Compare commits

..

165 Commits

Author SHA1 Message Date
Anna Khanova
2a88889f44 Merge pull request #7254 from neondatabase/rc/proxy/2024-03-27
Proxy release 2024-03-27
2024-03-27 11:44:09 +01:00
Conrad Ludgate
12512f3173 add authentication rate limiting (#6865)
## Problem

https://github.com/neondatabase/cloud/issues/9642

## Summary of changes

1. Make `EndpointRateLimiter` generic, renamed as `BucketRateLimiter`
2. Add support for claiming multiple tokens at once
3. Add `AuthRateLimiter` alias.
4. Check `(Endpoint, IP)` pair during authentication, weighted by how
many hashes proxy would be doing.

TODO: handle ipv6 subnets. will do this in a separate PR.
2024-03-26 19:31:19 +00:00
John Spray
b3b7ce457c pageserver: remove bare mgr::get_tenant, mgr::list_tenants (#7237)
## Problem

This is a refactor.

This PR was a precursor to a much smaller change
e5bd602dc1,
where as I was writing it I found that we were not far from getting rid
of the last non-deprecated code paths that use `mgr::` scoped functions
to get at the TenantManager state.

We're almost done cleaning this up as per
https://github.com/neondatabase/neon/issues/5796. The only significant
remaining mgr:: item is `get_active_tenant_with_timeout`, which is
page_service's path for fetching tenants.

## Summary of changes

- Remove the bool argument to get_attached_tenant_shard: this was almost
always false from API use cases, and in cases when it was true, it was
readily replacable with an explicit check of the returned tenant's
status.
- Rather than letting the timeline eviction task query any tenant it
likes via `mgr::`, pass an `Arc<Tenant>` into the task. This is still an
ugly circular reference, but should eventually go away: either when we
switch to exclusively using disk usage eviction, or when we change
metadata storage to avoid the need to imitate layer accesses.
- Convert all the mgr::get_tenant call sites to use
TenantManager::get_attached_tenant_shard
- Move list_tenants into TenantManager.
2024-03-26 18:29:08 +00:00
John Spray
6814bb4b59 tests: add a log allow list to stabilize benchmarks (#7251)
## Problem

https://github.com/neondatabase/neon/pull/7227 destabilized various
tests in the performance suite, with log errors during shutdown. It's
because we switched shutdown order to stop the storage controller before
the pageservers.

## Summary of changes

- Tolerate "connection failed" errors from pageservers trying to
validation their deletion queue.
2024-03-26 17:44:18 +00:00
John Spray
b3bb1d1cad storage controller: make direct tenant creation more robust (#7247)
## Problem

- Creations were not idempotent (unique key violation)
- Creations waited for reconciliation, which control plane blocks while
an operation is in flight

## Summary of changes

- Handle unique key constraint violation as an OK situation: if we're
creating the same tenant ID and shard count, it's reasonable to assume
this is a duplicate creation.
- Make the wait for reconcile during creation tolerate failures: this is
similar to location_conf, where the cloud control plane blocks our
notification calls until it is done with calling into our API (in future
this constraint is expected to relax as the cloud control plane learns
to run multiple operations concurrently for a tenant)
2024-03-26 16:57:35 +00:00
John Spray
47d2b3a483 pageserver: limit total ephemeral layer bytes (#7218)
## Problem

Follows: https://github.com/neondatabase/neon/pull/7182

- Sufficient concurrent writes could OOM a pageserver from the size of
indices on all the InMemoryLayer instances.
- Enforcement of checkpoint_period only happened if there were some
writes.

Closes: https://github.com/neondatabase/neon/issues/6916

## Summary of changes

- Add `ephemeral_bytes_per_memory_kb` config property. This controls the
ratio of ephemeral layer capacity to memory capacity. The weird unit is
to enable making the ratio less than 1:1 (set this property to 1024 to
use 1MB of ephemeral layers for every 1MB of RAM, set it smaller to get
a fraction).
- Implement background layer rolling checks in
Timeline::compaction_iteration -- this ensures we apply layer rolling
policy in the absence of writes.
- During background checks, if the total ephemeral layer size has
exceeded the limit, then roll layers whose size is greater than the mean
size of all ephemeral layers.
- Remove the tick() path from walreceiver: it isn't needed any more now
that we do equivalent checks from compaction_iteration.
- Add tests for the above.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-03-26 15:45:32 +00:00
John Spray
8dfe3a070c pageserver: return 429 on timeline creation in progress (#7225)
## Problem

Currently, we return 409 (Conflict) in two cases:
- Temporary: Timeline creation cannot proceed because another timeline
with the same ID is being created
- Permanent: Timeline creation cannot proceed because another timeline
exists with different parameters but the same ID.

Callers which time out a request and retry should be able to distinguish
these cases.

Closes: #7208 

## Summary of changes

- Expose `AlreadyCreating` errors as 429 instead of 409
2024-03-26 15:20:05 +00:00
Alexander Bayandin
3426619a79 test_runner/performance: skip test_bulk_insert (#7238)
## Problem
`test_bulk_insert` becomes too slow, and it fails constantly:
https://github.com/neondatabase/neon/issues/7124

## Summary of changes
- Skip `test_bulk_insert` until it's fixed
2024-03-26 15:10:15 +00:00
Vlad Lazar
de03742ca3 pageserver: drop layer map lock in Timeline::get (#7217)
## Problem
We currently hold the layer map read lock while doing IO on the read
path. This is not required for correctness.

## Summary of changes
Drop the layer map lock after figuring out which layer we wish to read
from.
Why is this correct:
* `Layer` models the lifecycle of an on disk layer. In the event the
layer is removed from local disk, it will be on demand downloaded
* `InMemoryLayer` holds the `EphemeralFile` which wraps the on disk
file. As long as the `InMemoryLayer` is in scope, it's safe to read from it.

Related https://github.com/neondatabase/neon/issues/6833
2024-03-26 14:35:36 +00:00
Christian Schwarz
ad072de420 Revert "pageserver: use a single tokio runtime (#6555)" (#7246) 2024-03-26 15:24:18 +01:00
Anna Khanova
6c18109734 proxy: reuse sess_id as request_id for the cplane requests (#7245)
## Problem

https://github.com/neondatabase/cloud/issues/11599

## Summary of changes

Reuse the same sess_id for requests within the one session.

TODO: get rid of `session_id` in query params.
2024-03-26 11:27:48 +00:00
John Spray
5dee58f492 tests: wait for uploads in test_secondary_downloads (#7220)
## Problem

- https://github.com/neondatabase/neon/issues/6966

This test occasionally failed with some layers unexpectedly not present
on the secondary pageserver. The issue in that failure is the attached
pageserver uploading heatmaps that refer to not-yet-uploaded layers.

## Summary of changes

After uploading heatmap, drain upload queue on attached pageserver, to
guarantee that all the layers referenced in the haetmap are uploaded.
2024-03-26 10:59:16 +00:00
John Spray
6313f1fa7a tests: tolerate transient unavailability in test_sharding_split_failures (#7223)
## Problem

While most forms of split rollback don't interrupt clients, there are a
couple of cases that do -- this interruption is brief, driven by the
time it takes the controller to kick off Reconcilers during the async
abort of the split, so it's operationally fine, but can trip up a test.

- #7148 

## Summary of changes

- Relax test check to require that the tenant is eventually available
after split failure, rather than immediately. In the vast majority of
cases this will pass on the first iteration.
2024-03-26 09:56:47 +00:00
Christian Schwarz
f72415e1fd refactor(remote_timeline_client): infallible stop() and shutdown() (#7234)
preliminary refactoring for
https://github.com/neondatabase/neon/pull/7233

part of #7062
2024-03-25 18:42:18 +01:00
George Ma
d837ce0686 chore: remove repetitive words (#7206)
Signed-off-by: availhang <mayangang@outlook.com>
2024-03-25 11:43:02 -04:00
John Spray
2713142308 tests: stabilize compat tests (#7227)
This test had two flaky failure modes:
- pageserver log error for timeline not found: this resulted from
changes for DR when timeline destroy/create was added, but endpoint was
left running during that operation.
- storage controller log error because the test was running for long
enough that a background reconcile happened at almost the exact moment
of test teardown, and our test fixtures tear down the pageservers before
the controller.

Closes: #7224
2024-03-25 14:35:24 +00:00
Arseny Sher
a6c1fdcaf6 Try to fix test_crafted_wal_end flakiness.
Postgres can always write some more WAL, so previous checks that WAL doesn't
change after something had been crafted were wrong; remove them. Add comments
here and there.

should fix https://github.com/neondatabase/neon/issues/4691
2024-03-25 14:53:06 +03:00
John Spray
adb0526262 pageserver: track total ephemeral layer bytes (#7182)
## Problem

Large quantities of ephemeral layer data can lead to excessive memory
consumption (https://github.com/neondatabase/neon/issues/6939). We
currently don't have a way to know how much ephemeral layer data is
present on a pageserver.

Before we can add new behaviors to proactively roll layers in response
to too much ephemeral data, we must calculate that total.

Related: https://github.com/neondatabase/neon/issues/6916

## Summary of changes

- Create GlobalResources and GlobalResourceUnits types, where timelines
carry a GlobalResourceUnits in their TimelineWriterState.
- Periodically update the size in GlobalResourceUnits:
  - During tick()
  - During layer roll
- During put() if the latest value has drifted more than 10MB since our
last update
- Expose the value of the global ephemeral layer bytes counter as a
prometheus metric.
- Extend the lifetime of TimelineWriterState:
  - Instead of dropping it in TimelineWriter::drop, let it remain.
- Drop TimelineWriterState in roll_layer: this drops our guard on the
global byte count to reflect the fact that we're freezing the layer.
- Ensure the validity of the later in the writer state by clearing the
state in the same place we freeze layers, and asserting on the
write-ability of the layer in `writer()`
- Add a 'context' parameter to `get_open_layer_action` so that it can
skip the prev_lsn==lsn check when called in tick() -- this is needed
because now tick is called with a populated state, where
prev_lsn==Some(lsn) is true for an idle timeline.
- Extend layer rolling test to use this metric
2024-03-25 11:52:50 +00:00
John Spray
0099dfa56b storage controller: tighten up secrets handling (#7105)
- Remove code for using AWS secrets manager, as we're deploying with
k8s->env vars instead
- Load each secret independently, so that one can mix CLI args with
environment variables, rather than requiring that all secrets are loaded
with the same mechanism.
- Add a 'strict mode', enabled by default, which will refuse to start if
secrets are not loaded. This avoids the risk of accidentially disabling
auth by omitting the public key, for example
2024-03-25 11:52:33 +00:00
Vlad Lazar
3a4ebfb95d test: fix test_pageserver_recovery flakyness (#7207)
## Problem
We recently introduced log file validation for the storage controller.
The heartbeater will WARN when it fails
for a node, hence the test fails.

Closes https://github.com/neondatabase/neon/issues/7159

## Summary of changes
* Warn only once for each set of heartbeat retries
* Allow list heartbeat warns
2024-03-25 09:38:12 +00:00
Christian Schwarz
3220f830b7 pageserver: use a single tokio runtime (#6555)
Before this PR, each core had 3 executor threads from 3 different
runtimes. With this PR, we just have one runtime, with one thread per
core. Switching to a single tokio runtime should reduce that effective
over-commit of CPU and in theory help with tail latencies -- iff all
tokio tasks are well-behaved and yield to the runtime regularly.

Are All Tasks Well-Behaved? Are We Ready?
-----------------------------------------

Sadly there doesn't seem to be good out-of-the box tokio tooling to
answer this question.

We *believe* all tasks are well behaved in today's code base, as of the
switch to `virtual_file_io_engine = "tokio-epoll-uring"` in production
(https://github.com/neondatabase/aws/pull/1121).

The only remaining executor-thread-blocking code is walredo and some
filesystem namespace operations.

Filesystem namespace operations work is being tracked in #6663 and not
considered likely to actually block at this time.

Regarding walredo, it currently does a blocking `poll` for read/write to
the pipe file descriptors we use for IPC with the walredo process.
There is an ongoing experiment to make walredo async (#6628), but it
needs more time because there are surprisingly tricky trade-offs that
are articulated in that PR's description (which itself is still WIP).
What's relevant for *this* PR is that
1. walredo is always CPU-bound
2. production tail latencies for walredo request-response
(`pageserver_wal_redo_seconds_bucket`) are
  - p90: with few exceptions, low hundreds of micro-seconds
  - p95: except on very packed pageservers, below 1ms
  - p99: all below 50ms, vast majority below 1ms
  - p99.9: almost all around 50ms, rarely at >= 70ms
- [Dashboard
Link](https://neonprod.grafana.net/d/edgggcrmki3uof/2024-03-walredo-latency?orgId=1&var-ds=ZNX49CDVz&var-pXX_by_instance=0.9&var-pXX_by_instance=0.99&var-pXX_by_instance=0.95&var-adhoc=instance%7C%21%3D%7Cpageserver-30.us-west-2.aws.neon.tech&var-per_instance_pXX_max_seconds=0.0005&from=1711049688777&to=1711136088777)

The ones below 1ms are below our current threshold for when we start
thinking about yielding to the executor.
The tens of milliseconds stalls aren't great, but, not least because of
the implicit overcommit of CPU by the three runtimes, we can't be sure
whether these tens of milliseconds are inherently necessary to do the
walredo work or whether we could be faster if there was less contention
for CPU.

On the first item (walredo being always CPU-bound work): it means that
walredo processes will always compete with the executor threads.
We could yield, using async walredo, but then we hit the trade-offs
explained in that PR.

tl;dr: the risk of stalling executor threads through blocking walredo
seems low, and switching to one runtime cleans up one potential source
for higher-than-necessary stall times (explained in the previous
paragraphs).


Code Changes
------------

- Remove the 3 different runtime definitions.
- Add a new definition called `THE_RUNTIME`.
- Use it in all places that previously used one of the 3 removed
runtimes.
- Remove the argument from `task_mgr`.
- Fix failpoint usage where `pausable_failpoint!` should have been used.
We encountered some actual failures because of this, e.g., hung
`get_metric()` calls during test teardown that would client-timeout
after 300s.

As indicated by the comment above `THE_RUNTIME`, we could take this
clean-up further.
But before we create so much churn, let's first validate that there's no
perf regression.


Performance
-----------

We will test this in staging using the various nightly benchmark runs.

However, the worst-case impact of this change is likely compaction
(=>image layer creation) competing with compute requests.
Image layer creation work can't be easily generated & repeated quickly
by pagebench.
So, we'll simply watch getpage & basebackup tail latencies in staging.

Additionally, I have done manual benchmarking using pagebench.
Report:
https://neondatabase.notion.site/2024-03-23-oneruntime-change-benchmarking-22a399c411e24399a73311115fb703ec?pvs=4
Tail latencies and throughput are marginally better (no regression =
good).
Except in a workload with 128 clients against one tenant.
There, the p99.9 and p99.99 getpage latency is about 2x worse (at
slightly lower throughput).
A dip in throughput every 20s (compaction_period_ is clearly visible,
and probably responsible for that worse tail latency.
This has potential to improve with async walredo, and is an edge case
workload anyway.


Future Work
-----------

1. Once this change has shown satisfying results in production, change
the codebase to use the ambient runtime instead of explicitly
referencing `THE_RUNTIME`.
2. Have a mode where we run with a single-threaded runtime, so we
uncover executor stalls more quickly.
3. Switch or write our own failpoints library that is async-native:
https://github.com/neondatabase/neon/issues/7216
2024-03-23 19:25:11 +01:00
Conrad Ludgate
72103d481d proxy: fix stack overflow in cancel publisher (#7212)
## Problem

stack overflow in blanket impl for `CancellationPublisher`

## Summary of changes

Removes `async_trait` and fixes the impl order to make it non-recursive.
2024-03-23 06:36:58 +00:00
Alex Chi Z
643683f41a fixup(#7204 / postgres): revert IsPrimaryAlive checks (#7209)
Fix #7204.

https://github.com/neondatabase/postgres/pull/400
https://github.com/neondatabase/postgres/pull/401
https://github.com/neondatabase/postgres/pull/402

These commits never go into prod. Detailed investigation will be posted
in another issue. Reverting the commits so that things can keep running
in prod. This pull request adds the test to start two replicas. It fails
on the current main https://github.com/neondatabase/neon/pull/7210 but
passes in this pull request.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-03-23 01:01:51 +00:00
Konstantin Knizhnik
35f4c04c9b Remove Get/SetZenithCurrentClusterSize from Postgres core (#7196)
## Problem

See https://neondb.slack.com/archives/C04DGM6SMTM/p1711003752072899

## Summary of changes

Move keeping of cluster size to neon extension

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-03-22 13:14:31 -04:00
John Spray
1787cf19e3 pageserver: write consumption metrics to S3 (#7200)
## Problem

The service that receives consumption metrics has lower availability
than S3. Writing metrics to S3 improves their availability.

Closes: https://github.com/neondatabase/cloud/issues/9824

## Summary of changes

- The same data as consumption metrics POST bodies is also compressed
and written to an S3 object with a timestamp-formatted path.
- Set `metric_collection_bucket` (same format as `remote_storage`
config) to configure the location to write to
2024-03-22 14:52:14 +00:00
Alexander Bayandin
2668a1dfab CI: deploy release version to a preprod region (#6811)
## Problem

We want to deploy releases to a preprod region first to perform required
checks

## Summary of changes
- Deploy `release-XXX` / `release-proxy-YYY` docker tags to a preprod region
2024-03-22 14:42:10 +00:00
Conrad Ludgate
77f3a30440 proxy: unit tests for auth_quirks (#7199)
## Problem

I noticed code coverage for auth_quirks was pretty bare

## Summary of changes

Adds 3 happy path unit tests for auth_quirks
* scram
* cleartext (websockets)
* cleartext (password hack)
2024-03-22 13:31:10 +00:00
John Spray
62b318c928 Fix ephemeral file warning on secondaries (#7201)
A test was added which exercises secondary locations more, and there was
a location in the secondary downloader that warned on ephemeral files.

This was intended to be fixed in this faulty commit:
8cea866adf
2024-03-22 10:10:28 +00:00
Anna Khanova
6770ddba2e proxy: connect redis with AWS IAM (#7189)
## Problem

Support of IAM Roles for Service Accounts for authentication.

## Summary of changes

* Obtain aws 15m-long credentials
* Retrieve redis password from credentials
* Update every 1h to keep connection for more than 12h
* For now allow to have different endpoints for pubsub/stream redis.

TODOs: 
* PubSub doesn't support credentials refresh, consider using stream
instead.
* We need an AWS role for proxy to be able to connect to both: S3 and
elasticache.

Credentials obtaining and connection refresh was tested on xenon
preview.

https://github.com/neondatabase/cloud/issues/10365
2024-03-22 09:38:04 +01:00
Arpad Müller
3ee34a3f26 Update Rust to 1.77.0 (#7198)
Release notes: https://blog.rust-lang.org/2024/03/21/Rust-1.77.0.html

Thanks to #6886 the diff is reasonable, only for one new lint
`clippy::suspicious_open_options`. I added `truncate()` calls to the
places where it is obviously the right choice to me, and added allows
everywhere else, leaving it for followups.

I had to specify cargo install --locked because the build would fail otherwise.
This was also recommended by upstream.
2024-03-22 06:52:31 +00:00
Christian Schwarz
fb60278e02 walredo benchmark: throughput-oriented rewrite (#7190)
See the updated `bench_walredo.rs` module comment.

tl;dr: we measure avg latency of single redo operations issues against a
single redo manager from N tokio tasks.

part of https://github.com/neondatabase/neon/issues/6628
2024-03-21 15:24:56 +01:00
Conrad Ludgate
d5304337cf proxy: simplify password validation (#7188)
## Problem

for HTTP/WS/password hack flows we imitate SCRAM to validate passwords.
This code was unnecessarily complicated.

## Summary of changes

Copy in the `pbkdf2` and 'derive keys' steps from the
`postgres_protocol` crate in our `rust-postgres` fork. Derive the
`client_key`, `server_key` and `stored_key` from the password directly.
Use constant time equality to compare the `stored_key` and `server_key`
with the ones we are sent from cplane.
2024-03-21 13:54:06 +00:00
John Spray
06cb582d91 pageserver: extend /re-attach response to include tenant mode (#6941)
This change improves the resilience of the system to unclean restarts.

Previously, re-attach responses only included attached tenants
- If the pageserver had local state for a secondary location, it would
remain, but with no guarantee that it was still _meant_ to be there.
After this change, the pageserver will only retain secondary locations
if the /re-attach response indicates that they should still be there.
- If the pageserver had local state for an attached location that was
omitted from a re-attach response, it would be entirely detached. This
is wasteful in a typical HA setup, where an offline node's tenants might
have been re-attached elsewhere before it restarts, but the offline
node's location should revert to a secondary location rather than being
wiped. Including secondary tenants in the re-attach response enables the
pageserver to avoid throwing away local state unnecessarily.

In this PR:
- The re-attach items are extended with a 'mode' field.
- Storage controller populates 'mode'
- Pageserver interprets it (default is attached if missing) to construct
either a SecondaryTenant or a Tenant.
- A new test exercises both cases.
2024-03-21 13:39:23 +00:00
John Spray
bb47d536fb pageserver: quieten log on shutdown-while-attaching (#7177)
## Problem

If a shutdown happens when a tenant is attaching, we were logging at
ERROR severity and with a backtrace. Yuck.

## Summary of changes

- Pass a flag into `make_broken` to enable quietening this non-scary
case.
2024-03-21 12:56:13 +00:00
John Spray
59cdee749e storage controller: fixes to secondary location handling (#7169)
Stacks on:
- https://github.com/neondatabase/neon/pull/7165

Fixes while working on background optimization of scheduling after a
split:
- When a tenant has secondary locations, we weren't detaching the parent
shards' secondary locations when doing a split
- When a reconciler detaches a location, it was feeding back a
locationconf with `Detached` mode in its `observed` object, whereas it
should omit that location. This could cause the background reconcile
task to keep kicking off no-op reconcilers forever (harmless but
annoying).
- During shard split, we were scheduling secondary locations for the
child shards, but no reconcile was run for these until the next time the
background reconcile task ran. Creating these ASAP is useful, because
they'll be used shortly after a shard split as the destination locations
for migrating the new shards to different nodes.
2024-03-21 12:06:57 +00:00
Vlad Lazar
c75b584430 storage_controller: add metrics (#7178)
## Problem
Storage controller had basically no metrics.

## Summary of changes
1. Migrate the existing metrics to use Conrad's
[`measured`](https://docs.rs/measured/0.0.14/measured/) crate.
2. Add metrics for incoming http requests
3. Add metrics for outgoing http requests to the pageserver
4. Add metrics for outgoing pass through requests to the pageserver
5. Add metrics for database queries

Note that the metrics response for the attachment service does not use
chunked encoding like the rest of the metrics endpoints. Conrad has
kindly extended the crate such that it can now be done. Let's leave it
for a follow-up since the payload shouldn't be that big at this point.

Fixes https://github.com/neondatabase/neon/issues/6875
2024-03-21 12:00:20 +00:00
Conrad Ludgate
5ec6862bcf proxy: async aware password validation (#7176)
## Problem

spawn_blocking in #7171 was a hack

## Summary of changes

https://github.com/neondatabase/rust-postgres/pull/29
2024-03-21 11:58:41 +01:00
Jure Bajic
94138c1a28 Enforce LSN ordering of batch entries (#7071)
## Summary of changes

Enforce LSN ordering of batch entries.

Closes https://github.com/neondatabase/neon/issues/6707
2024-03-21 09:17:24 +00:00
Joonas Koivunen
2206e14c26 fix(layer): remove the need to repair internal state (#7030)
## Problem

The current implementation of struct Layer supports canceled read
requests, but those will leave the internal state such that a following
`Layer::keep_resident` call will need to repair the state. In
pathological cases seen during generation numbers resetting in staging
or with too many in-progress on-demand downloads, this repair activity
will need to wait for the download to complete, which stalls disk
usage-based eviction. Similar stalls have been observed in staging near
disk-full situations, where downloads failed because the disk was full.

Fixes #6028 or the "layer is present on filesystem but not evictable"
problems by:
1. not canceling pending evictions by a canceled
`LayerInner::get_or_maybe_download`
2. completing post-download initialization of the `LayerInner::inner`
from the download task

Not canceling evictions above case (1) and always initializing (2) lead
to plain `LayerInner::inner` always having the up-to-date information,
which leads to the old `Layer::keep_resident` never having to wait for
downloads to complete. Finally, the `Layer::keep_resident` is replaced
with `Layer::is_likely_resident`. These fix #7145.

## Summary of changes

- add a new test showing that a canceled get_or_maybe_download should
not cancel the eviction
- switch to using a `watch` internally rather than a `broadcast` to
avoid hanging eviction while a download is ongoing
- doc changes for new semantics and cleanup
- fix `Layer::keep_resident` to use just `self.0.inner.get()` as truth
as `Layer::is_likely_resident`
- remove `LayerInner::wanted_evicted` boolean as no longer needed

Builds upon: #7185. Cc: #5331.
2024-03-21 03:19:08 +02:00
Joonas Koivunen
a95c41f463 fix(heavier_once_cell): take_and_deinit should take ownership (#7185)
Small fix to remove confusing `mut` bindings.

Builds upon #7175, split off from #7030. Cc: #5331.
2024-03-21 00:42:38 +02:00
Tristan Partin
041b653a1a Add state diagram for compute
Models a compute's lifetime.
2024-03-20 17:10:46 -05:00
Alex Chi Z
55c4ef408b safekeeper: correctly handle signals (#7167)
errno is not preserved in the signal handler. This pull request fixes
it. Maybe related: https://github.com/neondatabase/neon/issues/6969, but
does not fix the flaky test problem.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-03-20 15:22:25 -04:00
Alex Chi Z
5f0d9f2360 fix: add safekeeper team to pgxn codeowners (#7170)
`pgxn/` also contains WAL proposer code, so modifications to this
directory should be able to be approved by the safekeeper team.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-03-20 18:40:48 +00:00
Arpad Müller
34fa34d15c Dump layer map json in test_gc_feedback.py (#7179)
The layer map json is an interesting file for that test, so dump it to
make debugging easier.
2024-03-20 18:39:46 +00:00
Joonas Koivunen
e961e0d3df fix(Layer): always init after downloading in the spawned task (#7175)
Before this PR, cancellation for `LayerInner::get_or_maybe_download`
could occur so that we have downloaded the layer file in the filesystem,
but because of the cancellation chance, we have not set the internal
`LayerInner::inner` or initialized the state. With the detached init
support introduced in #7135 and in place in #7152, we can now initialize
the internal state after successfully downloading in the spawned task.

The next PR will fix the remaining problems that this PR leaves:
- `Layer::keep_resident` is still used because
- `Layer::get_or_maybe_download` always cancels an eviction, even when
canceled

Split off from #7030. Stacked on top of #7152. Cc: #5331.
2024-03-20 20:37:47 +02:00
John Spray
2726b1934e pageserver: extra debug for test_secondary_downloads failures (#7183)
- Enable debug logs for this test
- Add some debug logging detail in downloader.rs
- Add an info-level message in scheduler.rs that makes it obvious if a
command is waiting for an existing task rather than spawning a new one.
2024-03-20 18:07:45 +00:00
Joonas Koivunen
3d16cda846 refactor(layer): use detached init (#7152)
The second part of work towards fixing `Layer::keep_resident` so that it
does not need to repair the internal state. #7135 added a nicer API for
initialization. This PR uses it to remove a few indentation levels and
the loop construction. The next PR #7175 will use the refactorings done
in this PR, and always initialize the internal state after a download.

Cc: #5331
2024-03-20 18:03:09 +02:00
Joonas Koivunen
fb66a3dd85 fix: ResidentLayer::load_keys should not create INFO level span (#7174)
Since #6115 with more often used get_value_reconstruct_data and friends,
we should not have needless INFO level span creation near hot paths. In
our prod configuration, INFO spans are always created, but in practice,
very rarely anything at INFO level is logged underneath.
`ResidentLayer::load_keys` is only used during compaction so it is not
that hot, but this aligns the access paths and their span usage.

PR changes the span level to debug to align with others, and adds the
layer name to the error which was missing.

Split off from #7030.
2024-03-20 15:08:03 +01:00
Conrad Ludgate
6d996427b1 proxy: enable sha2 asm support (#7184)
## Problem

faster sha2 hashing.

## Summary of changes

enable asm feature for sha2. this feature will be default in sha2 0.11,
so we might as well lean into it now. It provides a noticeable speed
boost on macos aarch64. Haven't tested on x86 though
2024-03-20 12:26:31 +00:00
Vlad Lazar
4ba3f3518e test: fix on demand activation test flakyness (#7180)
Warm-up (and the "tenant startup complete" metric update) happens in
a background tokio task. The tenant map is eagerly updated (can happen
before the task finishes).

The test assumed that if the tenant map was updated, then the metric
should reflect that. That's not the case, so we tweak the test to wait
for the metric.

Fixes https://github.com/neondatabase/neon/issues/7158
2024-03-20 10:24:59 +00:00
John Spray
a5d5c2a6a0 storage controller: tech debt (#7165)
This is a mixed bag of changes split out for separate review while
working on other things, and batched together to reduce load on CI
runners. Each commits stands alone for review purposes:
- do_tenant_shard_split was a long function and had a synchronous
validation phase at the start that could readily be pulled out into a
separate function. This also avoids the special casing of
ApiError::BadRequest when deciding whether an abort is needed on errors
- Add a 'describe' API (GET on tenant ID) that will enable storcon-cli
to see what's going on with a tenant
- the 'locate' API wasn't really meant for use in the field. It's for
tests: demote it to the /debug/ prefix
- The `Single` placement policy was a redundant duplicate of Double(0),
and Double was a bad name. Rename it Attached.
(https://github.com/neondatabase/neon/issues/7107)
- Some neon_local commands were added for debug/demos, which are now
replaced by commands in storcon-cli (#7114 ). Even though that's not
merged yet, we don't need the neon_local ones any more.

Closes https://github.com/neondatabase/neon/issues/7107

## Backward compat of Single/Double -> `Attached(n)` change

A database migration is used to convert any existing values.
2024-03-19 16:08:20 +00:00
Tristan Partin
64c6dfd3e4 Move functions for creating/extracting tarballs into utils
Useful for other code paths which will handle zstd compression and
decompression.
2024-03-19 10:50:41 -05:00
Alex Chi Z
a8384a074e fixup(#7168): neon_local: use pageserver defaults for known but unspecified config overrides (#7166)
e2e tests cannot run on macOS unless the file engine env var is
supplied.

```
./scripts/pytest test_runner/regress/test_neon_superuser.py -s
```

will fail with tokio-epoll-uring not supported.

This is because we persist the file engine config by default. In this
pull request, we only persist when someone specifies it, so that it can
use the default platform-variant config in the page server.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-03-19 10:43:24 -04:00
Conrad Ludgate
5bad8126dc Merge pull request #7173 from neondatabase/rc/proxy/2024-03-19
Proxy release 2024-03-19
2024-03-19 12:11:42 +00:00
John Spray
b80704cd34 tests: log hygiene checks for storage controller (#6710)
## Problem

As with the pageserver, we should fail tests that emit unexpected log
errors/warnings.

## Summary of changes

- Refactor existing log checks to be reusable
- Run log checks for attachment_service
- Add allow lists as needed.
2024-03-19 10:30:33 +00:00
Conrad Ludgate
49be446d95 async password validation (#7171)
## Problem

password hashing can block main thread

## Summary of changes

spawn_blocking the password hash call
2024-03-18 23:57:32 +01:00
Arthur Petukhovsky
ad5efb49ee Support backpressure for sharding (#7100)
Add shard_number to PageserverFeedback and parse it on the compute side.
When compute receives a new ps_feedback, it calculates min LSNs among
feedbacks from all shards, and uses those LSNs for backpressure.

Add `test_sharding_backpressure` to verify that backpressure slows down
compute to wait for the slowest shard.
2024-03-18 21:54:44 +00:00
Christian Schwarz
2bc2fd9cfd fixup(#7160 / tokio_epoll_uring_ext): double-panic caused by info! in thread-local's drop() (#7164)
Manual testing of the changes in #7160 revealed that, if the
thread-local destructor ever runs (it apparently doesn't in our test
suite runs, otherwise #7160 would not have auto-merged), we can
encounter an `abort()` due to a double-panic in the tracing code.

This github comment here contains the stack trace:
https://github.com/neondatabase/neon/pull/7160#issuecomment-2003778176

This PR reverts #7160 and uses a atomic counter to identify the
thread-local in log messages, instead of the memory address of the
thread local, which may be re-used.
2024-03-18 16:12:01 +01:00
Joonas Koivunen
877fd14401 fix: spanless log message (#7155)
with `immediate_gc` the span only covered the `gc_iteration`, make it
cover the whole needless spawned task, which also does waiting for layer
drops and stray logging in tests.

also clarify some comments while we are here.

Fixes: #6910
2024-03-18 16:27:53 +02:00
Christian Schwarz
db749914d8 fixup(#7141 / tokio_epoll_uring_ext): high frequency log message (#7160)
The PR #7141 added log message

```
ThreadLocalState is being dropped and id might be re-used in the future
```

which was supposed to be emitted when the thread-local is destroyed.
Instead, it was emitted on _each_ call to `thread_local_system()`,
ie.., on each tokio-epoll-uring operation.

Testing
-------

Reproduced the issue locally and verified that this PR fixes the issue.
2024-03-18 12:29:20 +00:00
John Spray
1d3ae57f18 pageserver: refactoring in TenantManager to reduce duplication (#6732)
## Problem

Followup to https://github.com/neondatabase/neon/pull/6725

In that PR, code for purging local files from a tenant shard was
duplicated.

## Summary of changes

- Refactor detach code into TenantManager
- `spawn_background_purge` method can now be common between detach and
split operations
2024-03-18 10:37:20 +00:00
Joonas Koivunen
30a3d80d2f build: make procfs linux only dependency (#7156)
the dependency refuses to build on macos so builds on `main` are broken
right now, including the `release` PR.
2024-03-18 09:28:45 +00:00
Christian Schwarz
5cec5cb3cf fixup(#7120): the macOS code used an outdated constant name, broke the build (#7150) 2024-03-15 19:48:51 +00:00
Christian Schwarz
0694ee9531 tokio-epoll-uring: retry on launch failures due to locked memory (#7141)
refs https://github.com/neondatabase/neon/issues/7136

Problem
-------

Before this PR, we were using
`tokio_epoll_uring::thread_local_system()`,
which panics on tokio_epoll_uring::System::launch() failure

As we've learned in [the

past](https://github.com/neondatabase/neon/issues/6373#issuecomment-1905814391),
some older Linux kernels account io_uring instances as locked memory.

And while we've raised the limit in prod considerably, we did hit it
once on 2024-03-11 16:30 UTC.
That was after we enabled tokio-epoll-uring fleet-wide, but before
we had shipped release-5090 (c6ed86d3d0)
which did away with the last mass-creation of tokio-epoll-uring
instances as per

    commit 3da410c8fe
    Author: Christian Schwarz <christian@neon.tech>
    Date:   Tue Mar 5 10:03:54 2024 +0100

tokio-epoll-uring: use it on the layer-creating code paths (#6378)

Nonetheless, it highlighted that panicking in this situation is probably
not ideal, as it can leave the pageserver process in a semi-broken
state.

Further, due to low sampling rate of Prometheus metrics, we don't know
much about the circumstances of this failure instance.

Solution
--------

This PR implements a custom thread_local_system() that is
pageserver-aware
and will do the following on failure:
- dump relevant stats to `tracing!`, hopefully they will be useful to
  understand the circumstances better
- if it's the locked memory failure (or any other ENOMEM): abort() the
  process
- if it's ENOMEM, retry with exponential back-off, capped at 3s.
- add metric counters so we can create an alert

This makes sense in the production environment where we know that
_usually_, there's ample locked memory allowance available, and we know
the failure rate is rare.
2024-03-15 19:46:15 +00:00
John Spray
9752ad8489 pageserver, controller: improve secondary download APIs for large shards (#7131)
## Problem

The existing secondary download API relied on the caller to wait as long
as it took to complete -- for large shards that could be a long time, so
typical clients that might have a baked-in ~30s timeout would have a
problem.

## Summary of changes

- Take a `wait_ms` query parameter to instruct the pageserver how long
to wait: if the download isn't complete in this duration, then 201 is
returned instead of 200.
- For both 200 and 201 responses, include response body describing
download progress, in terms of layers and bytes. This is sufficient for
the caller to track how much data is being transferred and log/present
that status.
- In storage controller live migrations, use this API to apply a much
longer outer timeout, with smaller individual per-request timeouts, and
log the progress of the downloads.
- Add a test that injects layer download delays to exercise the new
behavior
2024-03-15 19:45:58 +00:00
Christian Schwarz
ad6f538aef tokio-epoll-uring: use it for on-demand downloads (#6992)
# Problem

On-demand downloads are still using `tokio::fs`, which we know is
inefficient.

# Changes

- Add `pagebench ondemand-download-churn` to quantify on-demand download
throughput
- Requires dumping layer map, which required making `history_buffer`
impl `Deserialize`
- Implement an equivalent of `tokio::io::copy_buf` for owned buffers =>
`owned_buffers_io` module and children.
- Make layer file download sensitive to `io_engine::get()`, using
VirtualFile + above copy loop
- For this, I had to move some code into the `retry_download`, e.g.,
`sync_all()` call.

Drive-by:
- fix missing escaping in `scripts/ps_ec2_setup_instance_store` 
- if we failed in retry_download to create a file, we'd try to remove
it, encounter `NotFound`, and `abort()` the process using
`on_fatal_io_error`. This PR adds treats `NotFound` as a success.

# Testing

Functional

- The copy loop is generic & unit tested.

Performance

- Used the `ondemand-download-churn` benchmark to manually test against
real S3.
- Results (public Notion page):
https://neondatabase.notion.site/Benchmarking-tokio-epoll-uring-on-demand-downloads-2024-04-15-newer-code-03c0fdc475c54492b44d9627b6e4e710?pvs=4
- Performance is equivalent at low concurrency. Jumpier situation at
high concurrency, but, still less CPU / throughput with
tokio-epoll-uring.
  - It’s a win.

# Future Work

Turn the manual performance testing described in the above results
document into a performance regression test:
https://github.com/neondatabase/neon/issues/7146
2024-03-15 18:57:05 +00:00
John Spray
1aa159acca pageserver: cancellation for remote ops in tenant deletion on shutdown (#6105)
## Problem

Tenant deletion had a couple of TODOs where we weren't using proper
cancellation tokens that would have aborted the deletions during process
shutdown.

## Summary of changes

- Refactor enough that deletion/shutdown code has access to the
TenantManager's cancellation toke
- Use that cancellation token in tenant deletion instead of dummy
tokens.
2024-03-15 18:03:49 +00:00
Christian Schwarz
60f30000ef tokio-epoll-uring: fallback to std-fs if not available & not explicitly requested (#7120)
fixes https://github.com/neondatabase/neon/issues/7116

Changes:

- refactor PageServerConfigBuilder: support not-set values
- implement runtime feature test
- use runtime feature test to determine `virtual_file_io_engine` if not
explicitly configured in the config
- log the effective engine at startup
- drive-by: improve assertion messages in `test_pageserver_init_node_id`

This needed a tiny bit of tokio-epoll-uring work, hence bumping it.
Changelog:

```
    git log --no-decorate --oneline --reverse 868d2c42b5d54ca82fead6e8f2f233b69a540d3e..342ddd197a060a8354e8f11f4d12994419fff939
    c7a74c6 Bump mio from 0.8.8 to 0.8.11
    4df3466 Bump mio from 0.8.8 to 0.8.11 (#47)
    342ddd1 lifecycle: expose `LaunchResult` enum (#49)
```
2024-03-15 17:46:04 +00:00
John Spray
bc1efa827f pageserver: exclude gc_horizon from synthetic size calculation (#6407)
## Problem

See:
- https://github.com/neondatabase/neon/issues/6374

## Summary of changes

Whereas previously we calculated synthetic size from the gc_horizon or
the pitr_interval (whichever is the lower LSN), now we ignore gc_horizon
and exclusively start from the `pitr_interval`. This is a more generous
calculation for billing, where we do not charge users for data retained
due to gc_horizon.
2024-03-15 16:07:36 +00:00
John Spray
67522ce83d docs: shard splitting RFC (#6358)
Extend the previous sharding RFC with functionality for dynamically splitting shards to increase the total shard count on existing tenants.
2024-03-15 16:00:04 +00:00
John Spray
7d32af5ad5 .github: apply timeout to pytest regress (#7142)
These test runs usually take 20-30 minutes. if something hangs, we see
actions proceeding for several hours: it's more convenient to have them
time out sooner so that we notice that something has hung faster.
2024-03-15 15:57:01 +00:00
Joonas Koivunen
59b6cce418 heavier_once_cell: add detached init support (#7135)
Aiming for the design where `heavier_once_cell::OnceCell` is initialized
by a future factory lead to awkwardness with how
`LayerInner::get_or_maybe_download` looks right now with the `loop`. The
loop helps with two situations:

- an eviction has been scheduled but has not yet happened, and a read
access should cancel the eviction
- a previous `LayerInner::get_or_maybe_download` that canceled a pending
eviction was canceled leaving the `heavier_once_cell::OnceCell`
uninitialized but needing repair by the next
`LayerInner::get_or_maybe_download`

By instead supporting detached initialization in
`heavier_once_cell::OnceCell` via an `OnceCell::get_or_detached_init`,
we can fix what the monolithic #7030 does:
- spawned off download task initializes the
`heavier_once_cell::OnceCell` regardless of the download starter being
canceled
- a canceled `LayerInner::get_or_maybe_download` no longer stops
eviction but can win it if not canceled

Split off from #7030.

Cc: #5331
2024-03-15 15:54:28 +00:00
Joonas Koivunen
bf187aa13f fix(layer): metric miscalculations (#7137)
Split off from #7030:
- each early exit is counted as canceled init, even though it most
likely was just `LayerInner::keep_resident` doing the no-download repair
check
- `downloaded_after` could had been accounted for multiple times, and
also when repairing to match on-disk state

Cc: #5331
2024-03-15 17:30:13 +02:00
John Spray
22c26d610b pageserver: remove un-needed "uninit mark" (#5717)
Switched the order; doing https://github.com/neondatabase/neon/pull/6139
first then can remove uninit marker after.

## Problem

Previously, existence of a timeline directory was treated as evidence of
the timeline's logical existence. That is no longer the case since we
treat remote storage as the source of truth on each startup: we can
therefore do without this mark file.

The mark file had also been used as a pseudo-lock to guard against
concurrent creations of the same TimelineId -- now that persistence is
no longer required, this is a bit unwieldy.

In #6139 the `Tenant::timelines_creating` was added to protect against
concurrent creations on the same TimelineId, making the uninit mark file
entirely redundant.

## Summary of changes

- Code that writes & reads mark file is removed
- Some nearby `pub` definitions are amended to `pub(crate)`
- `test_duplicate_creation` is added to demonstrate that mutual
exclusion of creations still works.
2024-03-15 17:23:05 +02:00
John Spray
516f793ab4 remote_storage: make last_modified and etag mandatory (#7126)
## Problem

These fields were only optional for the convenience of the `local_fs`
test helper -- real remote storage backends provide them. It complicated
any code that actually wanted to use them for anything.

## Summary of changes

- Make these fields non-optional
- For azure/S3 it is an error if the server doesn't provide them
- For local_fs, use random strings as etags and the file's mtime for
last_modified.
2024-03-15 13:37:49 +00:00
John Spray
6443dbef90 tests: extend log allow list for test_sharding_split_failures (#7134)
Failure types that panic the storage controller can cause unlucky
pageservers to emit log warnings that they can't reach the generation
validation API:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/8284495687/index.html

Tolerate this log message: it's an expected behavior.
2024-03-15 13:18:12 +00:00
John Spray
23416cc358 docs: sharding phase 1 RFC (#5432)
We need to shard our Tenants to support larger databases without those
large databases dominating our pageservers and/or requiring dedicated
pageservers.

This RFC aims to define an initial capability that will permit creating
large-capacity databases using a static configuration
defined at time of Tenant creation.

Online re-sharding is deferred as future work, as is offloading layers
for historical reads. However, both of these capabilities would be
implementable without further changes to the control plane or compute:
this RFC aims to define the cross-component work needed to bootstrap
sharding end-to-end.
2024-03-15 11:14:25 +00:00
Anna Khanova
46098ea0ea proxy: add more missing warm logging (#7133)
## Problem

There is one more missing thing about cached connections for
`cold_start_info`.

## Summary of changes

Fix and add comments.
2024-03-15 11:13:15 +00:00
Conrad Ludgate
49bc734e02 proxy: add websocket regression tests (#7121)
## Problem

We have no regression tests for websocket flow

## Summary of changes

Add a hacky implementation of the postgres protocol over websockets just
to verify the protocol behaviour does not regress over time.
2024-03-15 10:21:48 +01:00
Alex Chi Z
76c44dc140 spec: disable neon extension auto upgrade (#7128)
This pull request disables neon extension auto upgrade to help the next
compute image upgrade smooth.

## Summary of changes

We have two places to auto-upgrade neon extension: during compute spec
update, and when the compute node starts. The compute spec update logic
is always there, and the compute node start logic is added in
https://github.com/neondatabase/neon/pull/7029. In this pull request, we
disable both of them, so that we can still roll back to an older version
of compute before figuring out the best way of extension
upgrade-downgrade. https://github.com/neondatabase/neon/issues/6936

We will enable auto-upgrade in the next release following this release.

There are no other extension upgrades from release 4917 and therefore
after this pull request, it would be safe to revert to release 4917.

Impact:

* Project created after unpinning the compute image -> if we need to
roll back, **they will stuck**, because the default neon extension
version is 1.3. Need to manually pin the compute image version if such
things happen.
* Projects already stuck on staging due to not downgradeable -> I don't
know their current status, maybe they are already running the latest
compute image?
* Other projects -> can be rolled back to release 4917.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-03-14 19:45:38 +00:00
Joonas Koivunen
58ef78cf41 doc(README): note cargo-nextest usage (#7122)
We have been using #5681 for quite some time, and at least since #6931
the tests have assumed `cargo-nextest` to work around our use of global
statics. Unlike the `cargo test`, the `cargo nextest run` runs each test
as a separate process that can be timeouted.

Add a mention of using `cargo-nextest` in the top-level README.md.
Sub-crates can still declare they support `cargo test`, like
`compute_tools/README.md` does.
2024-03-14 18:49:42 +00:00
John Spray
678ed39de2 storage controller: validate DNS of registering nodes (#7101)
A node with a bad DNS configuration can register itself with the storage
controller, and the controller will try and schedule work onto the node,
but never succeed because it can't reach the node.

The DNS case is a special case of asymmetric network issues. The general
case isn't covered here -- but might make sense to tighten up after
#6844 merges -- then we can avoid assuming a node is immediately
available in re_attach.
2024-03-14 16:48:38 +00:00
Vlad Lazar
3d8830ac35 test_runner: re-enable large slru benchmark (#7125)
Previously disabled due to
https://github.com/neondatabase/neon/issues/7006.
2024-03-14 16:47:32 +00:00
Vlad Lazar
38767ace68 storage_controller: periodic pageserver heartbeats (#7092)
## Problem
If a pageserver was offline when the storage controller started, there
was no mechanism to update the
storage controller state when the pageserver becomes active.

## Summary of changes
* Add a heartbeater module. The heartbeater must be driven by an
external loop.
* Integrate the heartbeater into the service.
- Extend the types used by the service and scheduler to keep track of a
nodes' utilisation score.
- Add a background loop to drive the heartbeater and update the state
based on the deltas it generated
  - Do an initial round of heartbeats at start-up
2024-03-14 15:21:36 +00:00
Arseny Sher
9fe0193e51 Bump vendor/postgres v15 v14. 2024-03-14 18:06:53 +04:00
Christian Schwarz
8075f0965a fix(test suite) virtual_file_io_engine and get_vectored_impl patametrization doesn't work (#7113)
# Problem

While investigating #7124, I noticed that the benchmark was always using
the `DEFAULT_*` `virtual_file_io_engine` , i.e., `tokio-epoll-uring` as
of https://github.com/neondatabase/neon/pull/7077.

The fundamental problem is that the `control_plane` code has its own
view of `PageServerConfig`, which, I believe, will always be a subset of
the real pageserver's `pageserver/src/config.rs`.

For the `virtual_file_io_engine` and `get_vectored_impl` parametrization
of the test suite, we were constructing a dict on the Python side that
contained these parameters, then handed it to
`control_plane::PageServerConfig`'s derived `serde::Deserialize`.
The default in serde is to ignore unknown fields, so, the Deserialize
impl silently ignored the fields.
In consequence, the fields weren't propagated to the `pageserver --init`
call, and the tests ended up using the
`pageserver/src/config.rs::DEFAULT_` values for the respective options
all the time.

Tests that explicitly used overrides in `env.pageserver.start()` and
similar were not affected by this.

But, it means that all the test suite runs where with parametrization
didn't properly exercise the code path.

# Changes

- use `serde(deny_unknown_fields)` to expose the problem  
- With this change, the Python tests that override
`virtual_file_io_engine` and
`get_vectored_impl` fail on `pageserver --init`, exposing the problem.
- use destructuring to uncover the issue in the future
- fix the issue by adding the missing fields to the `control_plane`
crate's `PageServerConf`
- A better solution would be for control plane to re-use a struct
provided
    by the pageserver crate, so that everything is in one place in
    `pageserver/src/config.rs`, but, our config parsing code is (almost)
    beyond repair anyways.
- fix the `pageserver_virtual_file_io_engine` to be responsive to the
env var
  - => required to make parametrization work in benchmarks

# Testing

Before merging this PR, I re-ran the regression tests & CI with the full
matrix of `virtual_file_io_engine` and `tokio-epoll-uring`, see
9c7ea364e0
2024-03-14 11:18:55 +00:00
Anna Khanova
27bc242085 Merge pull request #7119 from neondatabase/rc/proxy/2024-03-14
Proxy release 2024-03-14
2024-03-14 14:57:05 +05:00
Anna Khanova
192b49cc6d Merge branch 'release-proxy' into rc/proxy/2024-03-14 2024-03-14 14:16:36 +05:00
John Spray
44f42627dd pageserver/controller: error handling for shard splitting (#7074)
## Problem

Shard splits worked, but weren't safe against failures (e.g. node crash
during split) yet.

Related: #6676 

## Summary of changes

- Introduce async rwlocks at the scope of Tenant and Node:
  - exclusive tenant lock is used to protect splits
- exclusive node lock is used to protect new reconciliation process that
happens when setting node active
- exclusive locks used in both cases when doing persistent updates (e.g.
node scheduling conf) where the update to DB & in-memory state needs to
be atomic.
- Add failpoints to shard splitting in control plane and pageserver
code.
- Implement error handling in control plane for shard splits: this
detaches child chards and ensures parent shards are re-attached.
- Crash-safety for storage controller restarts requires little effort:
we already reconcile with nodes over a storage controller restart, so as
long as we reset any incomplete splits in the DB on restart (added in
this PR), things are implicitly cleaned up.
- Implement reconciliation with offline nodes before they transition to
active:
- (in this context reconciliation means something like
startup_reconcile, not literally the Reconciler)
- This covers cases where split abort cannot reach a node to clean it
up: the cleanup will eventually happen when the node is marked active,
as part of reconciliation.
- This also covers the case where a node was unavailable when the
storage controller started, but becomes available later: previously this
allowed it to skip the startup reconcile.
- Storage controller now terminates on panics. We only use panics for
true "should never happen" assertions, and these cases can leave us in
an un-usable state if we keep running (e.g. panicking in a shard split).
In the unlikely event that we get into a crashloop as a result, we'll
rely on kubernetes to back us off.
- Add `test_sharding_split_failures` which exercises a variety of
failure cases during shard split.
2024-03-14 09:11:57 +00:00
Conrad Ludgate
3bd6551b36 proxy http cancellation safety (#7117)
## Problem

hyper auto-cancels the request futures on connection close.
`sql_over_http::handle` is not 'drop cancel safe', so we need to do some
other work to make sure connections are queries in the right way.

## Summary of changes

1. tokio::spawn the request handler to resolve the initial cancel-safety
issue
2. share a cancellation token, and cancel it when the request `Service`
is dropped.
3. Add a new log span to be able to track the HTTP connection lifecycle.
2024-03-14 08:20:56 +00:00
Christian Schwarz
69338e53e3 throttling: fixup interactions with Timeline::get_vectored (#7089)
## Problem

Before this PR, `Timeline::get_vectored` would be throttled twice if the
sequential option was enabled or if validation was enabled.

Also, `pageserver_get_vectored_seconds` included the time spent in the
throttle, which turns out to be undesirable for what we use that metric
for.

## Summary of changes

Double-throttle:

* Add `Timeline::get0` method which is unthrottled.
* Use that method from within the `Timeline::get_vectored` code path.

Metric:

* return throttled time from `throttle()` method
* deduct the value from the observed time
* globally rate-limited logging of duration subtraction errors, like in
all other places that do the throttled-time deduction from observations
2024-03-13 17:49:17 +00:00
Arpad Müller
5309711691 Make tenant_id in TenantLocationConfigRequest optional (#7055)
The `tenant_id` in `TenantLocationConfigRequest` in the
`location_config` endpoint was only used in the storage
controller/attachment service, and there it was only used for assertions
and the creation part.
2024-03-13 17:30:29 +01:00
Joonas Koivunen
8a53d576e6 fix(metrics): time individual layer flush operations (#7109)
Currently, the flushing operation could flush multiple frozen layers to
the disk and store the aggregate time in the histogram. The result is a
bimodal distribution with short and over 1000-second flushes. Change it
so that we record how long one layer flush takes.
2024-03-13 15:10:20 +00:00
Anna Khanova
b0aff04157 proxy: add new dimension to exclude cplane latency (#7011)
## Problem

Currently cplane communication is a part of the latency monitoring. It
doesn't allow to setup the proper alerting based on proxy latency.

## Summary of changes

Added dimension to exclude cplane latency.
2024-03-13 13:50:05 +01:00
Anna Khanova
0554bee022 proxy: Report warm cold start if connection is from the local cache (#7104)
## Problem

* quotes in serialized string
* no status if connection is from local cache

## Summary of changes

* remove quotes
* report warm if connection if from local cache
2024-03-13 11:45:19 +00:00
Conrad Ludgate
83855a907c proxy http error classification (#7098)
## Problem

Missing error classification for SQL-over-HTTP queries.
Not respecting `UserFacingError` for SQL-over-HTTP queries.

## Summary of changes

Adds error classification.
Adds user facing errors.
2024-03-13 07:35:49 +01:00
John Spray
1b41db8bdd pageserver: enable setting stripe size inline with split request. (#7093)
## Summary

- Currently we can set stripe size at tenant creation, but it doesn't
mean anything until we have multiple shards
- When onboarding an existing tenant, it will always get a default shard
stripe size, so we would like to be able to pick the actual stripe size
at the point we split.

## Why do this inline with a split?

The alternative to this change would be to have a separate endpoint on
the storage controller for setting the stripe size on a tenant, and only
permit writes to that endpoint when the tenant has only a single shard.
That would work, but be a little bit more work for a client, and not
appreciably simpler (instead of having a special argument to the split
functions, we'd have a special separate endpoint, and a requirement that
the controller must sync its config down to the pageserver before
calling the split API). Either approach would work, but this one feels a
bit more robust end-to-end: the split API is the _very last moment_ that
the stripe size is mutable, so if we aim to set it before splitting, it
makes sense to do it as part of the same operation.
2024-03-12 20:41:08 +00:00
Jure Bajic
bac06ea1ac pageserver: fix read path max lsn bug (#7007)
## Summary of changes
The problem it fixes is when `request_lsn` is `u64::MAX-1` the
`cont_lsn` becomes `u64::MAX` which is the same as `prev_lsn` which
stops the loop.

Closes https://github.com/neondatabase/neon/issues/6812
2024-03-12 16:32:47 +00:00
John Spray
7ae8364b0b storage controller: register nodes in re-attach request (#7040)
## Problem

Currently we manually register nodes with the storage controller, and
use a script during deploy to register with the cloud control plane.
Rather than extend that script further, nodes should just register on
startup.

## Summary of changes

- Extend the re-attach request to include an optional
NodeRegisterRequest
- If the `register` field is set, handle it like a normal node
registration before executing the normal re-attach work.
- Update tests/neon_local that used to rely on doing an explicit
register step that could be enabled/disabled.

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-03-12 14:47:12 +00:00
Conrad Ludgate
1f7d54f987 proxy refactor tls listener (#7056)
## Problem

Now that we have tls-listener vendored, we can refactor and remove a lot
of bloated code and make the whole flow a bit simpler

## Summary of changes

1. Remove dead code
2. Move the error handling to inside the `TlsListener` accept() function
3. Extract the peer_addr from the PROXY protocol header and log it with
errors
2024-03-12 13:05:40 +00:00
Arthur Petukhovsky
580e136b2e Forward all backpressure feedback to compute (#7079)
Previously we aggregated ps_feedback on each safekeeper and sent it to
walproposer with every AppendResponse. This PR changes it to send
ps_feedback to walproposer right after receiving it from pageserver,
without aggregating it in memory. Also contains some preparations for
implementing backpressure support for sharding.
2024-03-12 12:14:02 +00:00
Conrad Ludgate
09699d4bd8 proxy: cancel http queries on timeout (#7031)
## Problem

On HTTP query timeout, we should try and cancel the current in-flight
SQL query.

## Summary of changes

Trigger a cancellation command in postgres once the timeout is reach
2024-03-12 11:52:00 +00:00
John Spray
89cf714890 tests/neon_local: rename "attachment service" -> "storage controller" (#7087)
Not a user-facing change, but can break any existing `.neon` directories
created by neon_local, as the name of the database used by the storage
controller changes.

This PR changes all the locations apart from the path of
`control_plane/attachment_service` (waiting for an opportune moment to
do that one, because it's the most conflict-ish wrt ongoing PRs like
#6676 )
2024-03-12 11:36:27 +00:00
Heikki Linnakangas
621ea2ec44 tests: try to make restored-datadir comparison tests not flaky v2
This test occasionally fails with a difference in "pg_xact/0000" file
between the local and restored datadirs. My hypothesis is that
something changed in the database between the last explicit checkpoint
and the shutdown. I suspect autovacuum, it could certainly create
transactions.

To fix, be more precise about the point in time that we compare. Shut
down the endpoint first, then read the last LSN (i.e. the shutdown
checkpoint's LSN), from the local disk with pg_controldata. And use
exactly that LSN in the basebackup.

Closes #559
2024-03-11 23:29:32 +04:00
Heikki Linnakangas
74d09b78c7 Keep walproposer alive until shutdown checkpoint is safe on safekepeers
The walproposer pretends to be a walsender in many ways. It has a
WalSnd slot, it claims to be a walsender by calling
MarkPostmasterChildWalSender() etc. But one different to real
walsenders was that the postmaster still treated it as a bgworker
rather than a walsender. The difference is that at shutdown,
walsenders are not killed until the very end, after the checkpointer
process has written the shutdown checkpoint and exited.

As a result, the walproposer always got killed before the shutdown
checkpoint was written, so the shutdown checkpoint never made it to
safekeepers. That's fine in principle, we don't require a clean
shutdown after all. But it also feels a bit silly not to stream the
shutdown checkpoint. It could be useful for initializing hot standby
mode in a read replica, for example.

Change postmaster to treat background workers that have called
MarkPostmasterChildWalSender() as walsenders. That unfortunately
requires another small change in postgres core.

After doing that, walproposers stay alive longer. However, it also
means that the checkpointer will wait for the walproposer to switch to
WALSNDSTATE_STOPPING state, when the checkpointer sends the
PROCSIG_WALSND_INIT_STOPPING signal. We don't have the machinery in
walproposer to receive and handle that signal reliably. Instead, we
mark walproposer as being in WALSNDSTATE_STOPPING always.

In commit 568f91420a, I assumed that shutdown will wait for all the
remaining WAL to be streamed to safekeepers, but before this commit
that was not true, and the test became flaky. This should make it
stable again.

Some tests wrongly assumed that no WAL could have been written between
pg_current_wal_flush_lsn and quick pg stop after it. Fix them by introducing
flush_ep_to_pageserver which first stops the endpoint and then waits till all
committed WAL reaches the pageserver.

In passing extract safekeeper http client to its own module.
2024-03-11 23:29:32 +04:00
Arseny Sher
0cf0731d8b SIGQUIT instead of SIGKILL prewarmed postgres.
To avoid orphaned processes using wiped datadir with confusing logging.
2024-03-11 22:36:52 +04:00
Sasha Krassovsky
98723844ee Don't return from inside PG_TRY (#7095)
## Problem
Returning from PG_TRY is a bug, and we currently do that

## Summary of changes
Make it break and then return false. This should also help stabilize
test_bad_connection.py
2024-03-11 18:36:39 +00:00
Alex Chi Z
73a8c97ac8 fix: warnings when compiling neon extensions (#7053)
proceeding https://github.com/neondatabase/neon/pull/7010, close
https://github.com/neondatabase/neon/issues/6188

## Summary of changes

This pull request (should) fix all warnings except
`-Wdeclaration-after-statement` in the neon extension compilation.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-03-11 17:49:58 +00:00
Christian Schwarz
17a3c9036e follow-up(#7077): adjust flaky-test-detection cutoff date for tokio-epoll-uring (#7090)
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-03-11 16:36:49 +00:00
Joonas Koivunen
8c5b310090 fix: Layer delete on drop and eviction can outlive timeline shutdown (#7082)
This is a follow-up to #7051 where `LayerInner::drop` and
`LayerInner::evict_blocking` were not noticed to require a gate before
the file deletion. The lack of entering a gate opens up a similar
possibility of deleting a layer file which a newer Timeline instance has
already checked out to be resident in a similar case as #7051.
2024-03-11 16:54:06 +01:00
Christian Schwarz
8224580f3e fix(tenant/timeline metrics): race condition during shutdown + recreation (#7064)
Tenant::shutdown or Timeline::shutdown completes and becomes externally
observable before the corresponding Tenant/Timeline object is dropped.

For example, after observing a Tenant::shutdown to complete, we could
attach the same tenant_id again. The shut down Tenant object might still
be around at the time of the attach.

The race is then the following:
- old object's metrics are still around
- new object uses with_label_values
- old object calls remove_label_values

The outcome is that the new object will have the metric objects (they're
an Arc internall) but the metrics won't be part of the internal registry
and hence they'll be missing in `/metrics`.

Later, when the new object gets shut down and tries to
remove_label_value, it will observe an error because
the metric was already removed by the old object.

Changes
-------

This PR moves metric removal to `shutdown()`.

An alternative design would be to multi-version the metrics using a
distinguishing label, or, to use a better metrics crate that allows
removing metrics from the registry through the locally held metric
handle instead of interacting with the (globally shared) registry.

refs https://github.com/neondatabase/neon/pull/7051
2024-03-11 15:41:41 +01:00
Christian Schwarz
2b0f3549f7 default to tokio-epoll-uring in CI tests & on Linux (#7077)
All of production is using it now as of
https://github.com/neondatabase/aws/pull/1121

The change in `flaky_tests.py` resets the flakiness detection logic.

The alternative would have been to repeat the choice of io engine in
each test name, which would junk up the various test reports too much.

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-03-11 14:35:59 +00:00
John Spray
b4972d07d4 storage controller: refactor non-mutable members up into Service (#7086)
result_tx and compute_hook were in ServiceState (i.e. behind a sync
mutex), but didn't need to be.

Moving them up into Service removes a bunch of boilerplate clones.

While we're here, create a helper `Service::maybe_reconcile_shard` which
avoids writing out all the `&self.` arguments to
`TenantState::maybe_reconcile` everywhere we call it.
2024-03-11 14:29:32 +00:00
Joonas Koivunen
26ae7b0b3e fix(metrics): reset TENANT_STATE metric on startup (#7084)
Otherwise, it might happen that we never get to witness the same state
on subsequent restarts, thus the time series will show the value from a
few restarts ago.

The actual case here was that "Activating" was showing `3` while I was
doing tenant migration testing on staging. The number 3 was however from
a startup that happened some time ago which had been interrupted by
another deployment.
2024-03-11 13:25:53 +00:00
John Spray
f8483cc4a3 pageserver: update swagger for HA APIs (#7070)
- The type of heatmap_period in tenant config was wrrong
- Secondary download and heatmap upload endpoints weren't in swagger.
2024-03-11 09:32:17 +00:00
Conrad Ludgate
cc5d6c66b3 proxy: categorise new cplane error message (#7057)
## Problem

`422 Unprocessable Entity: compute time quota of non-primary branches is
exceeded` being marked as a control plane error.

## Summary of changes

Add the manual checks to make this a user error that should not be
retried.
2024-03-11 09:20:09 +01:00
Roman Zaynetdinov
d894d2b450 Export db size, deadlocks and changed row metrics (#7050)
## Problem

We want to report metrics for the oldest user database.
2024-03-11 08:10:04 +00:00
Joonas Koivunen
b09d686335 fix: on-demand downloads can outlive timeline shutdown (#7051)
## Problem

Before this PR, it was possible that on-demand downloads were started
after `Timeline::shutdown()`.

For example, we have observed a walreceiver-connection-handler-initiated
on-demand download that was started after `Timeline::shutdown()`s final
`task_mgr::shutdown_tasks()` call.

The underlying issue is that `task_mgr::shutdown_tasks()` isn't sticky,
i.e., new tasks can be spawned during or after
`task_mgr::shutdown_tasks()`.

Cc: https://github.com/neondatabase/neon/issues/4175 in lieu of a more
specific issue for task_mgr. We already decided we want to get rid of it
anyways.

Original investigation:
https://neondb.slack.com/archives/C033RQ5SPDH/p1709824952465949

## Changes

- enter gate while downloading
- use timeline cancellation token for cancelling download

thereby, fixes #7054

Entering the gate might also remove recent "kept the gate from closing"
in staging.
2024-03-09 13:09:08 +00:00
Christian Schwarz
74d24582cf throttling: exclude throttled time from basebackup (fixup of #6953) (#7072)
PR #6953 only excluded throttled time from the handle_pagerequests
(aka smgr metrics).

This PR implements the deduction for `basebackup ` queries.

The other page_service methods either don't use Timeline::get
or they aren't used in production.

Found by manually inspecting in [staging
logs](https://neonprod.grafana.net/explore?schemaVersion=1&panes=%7B%22wx8%22:%7B%22datasource%22:%22xHHYY0dVz%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bhostname%3D%5C%22pageserver-0.eu-west-1.aws.neon.build%5C%22%7D%20%7C~%20%60git-env%7CERR%7CWARN%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22xHHYY0dVz%22%7D,%22editorMode%22:%22code%22%7D%5D,%22range%22:%7B%22to%22:%221709919114642%22,%22from%22:%221709904430898%22%7D%7D%7D).
2024-03-09 13:37:02 +01:00
Sasha Krassovsky
4834d22d2d Revoke REPLICATION (#7052)
## Problem
Currently users can cause problems with replication
## Summary of changes
Don't let them replicate
2024-03-08 22:24:30 +00:00
Anastasia Lubennikova
86e8c43ddf Add downgrade scripts for neon extension. (#7065)
## Problem

When we start compute with newer version of extension (i.e. 1.2) and
then rollback the release, downgrading the compute version, next compute
start will try to update extension to the latest version available in
neon.control (i.e. 1.1).

Thus we need to provide downgrade scripts like neon--1.2--1.1.sql

These scripts must revert the changes made by the upgrade scripts in the
reverse order. This is necessary to ensure that the next upgrade will
work correctly.

In general, we need to write upgrade and downgrade scripts to be more
robust and add IF EXISTS / CREATE OR REPLACE clauses to all statements
(where applicable).

## Summary of changes
Adds downgrade scripts.
Adds test cases for extension downgrade/upgrade. 

fixes #7066

This is a follow-up for
https://app.incident.io/neondb/incidents/167?tab=follow-ups

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Alex Chi Z <iskyzh@gmail.com>
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
2024-03-08 20:42:35 +00:00
John Spray
7329413705 storage controller: enable setting PlacementPolicy in tenant creation (#7037)
## Problem

Tenants created via the storage controller have a `PlacementPolicy` that
defines their HA/secondary/detach intent. For backward compat we can
just set it to Single, for onboarding tenants using /location_conf it is
automatically set to Double(1) if there are at least two pageservers,
but for freshly created tenants we didn't have a way to specify it.

This unblocks writing tests that create HA tenants on the storage
controller and do failure injection testing.

## Summary of changes

- Add optional fields to TenantCreateRequest for specifying
PlacementPolicy. This request structure is used both on pageserver API
and storage controller API, but this method is only meaningful for the
storage controller (same as existing `shard_parameters` attribute).
- Use the value from the creation request in tenant creation, if
provided.
2024-03-08 15:34:53 +00:00
Conrad Ludgate
e1b60f3693 Merge pull request #7041 from neondatabase/rc/proxy/2024-03-07
Proxy release 2024-03-07
2024-03-08 08:19:16 +00:00
Conrad Ludgate
2c132e45cb proxy: do not store ephemeral endpoints in http pool (#6819)
## Problem

For the ephemeral endpoint feature, it's not really too helpful to keep
them around in the connection pool. This isn't really pressing but I
think it's still a bit better this way.

## Summary of changes

Add `is_ephemeral` function to `NeonOptions`. Allow
`serverless::ConnInfo::endpoint_cache_key()` to return an `Option`.
Handle that option appropriately
2024-03-08 07:56:23 +00:00
Vlad Lazar
0f05ef67e2 pageserver: revert open layer rolling revert (#6962)
## Problem
We reverted https://github.com/neondatabase/neon/pull/6661 a few days
ago. The change led to OOMs in
benchmarks followed by large WAL reingests.

The issue was that we removed [this
code](d04af08567/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs (L409-L417)).
That call may trigger a roll of the open layer due to
the keepalive messages received from the safekeeper. Removing it meant
that enforcing
of checkpoint timeout became even more lax and led to using up large
amounts of memory
for the in memory layer indices.

## Summary of changes
Piggyback on keep alive messages to enforce checkpoint timeout. This is
a hack, but it's exactly what
the current code is doing.

## Alternatives
Christhian, Joonas and myself sketched out a timer based approach
[here](https://github.com/neondatabase/neon/pull/6940). While discussing
it further, it became obvious that's also a bit of a hack and not the
desired end state. I chose not
to take that further since it's not what we ultimately want and it'll be
harder to rip out.

Right now it's unclear what the ideal system behaviour is:
* early flushing on memory pressure, or ...
* detaching tenants on memory pressure
2024-03-07 19:53:10 +00:00
Conrad Ludgate
02358b21a4 update rustls (#7048)
## Summary of changes

Update rustls from 0.21 to 0.22.

reqwest/tonic/aws-smithy still use rustls 0.21. no upgrade route
available yet.
2024-03-07 18:23:19 +00:00
Sasha Krassovsky
2fc89428c3 Hopefully stabilize test_bad_connection.py (#6976)
## Problem
It seems that even though we have a retry on basebackup, it still
sometimes fails to fetch it with the failpoint enabled, resulting in a
test error.

## Summary of changes
If we fail to get the basebackup, disable the failpoint and try again.
2024-03-07 10:12:06 -08:00
Arpad Müller
ce7a82db05 Update svg_fmt (#7049)
Gets upstream PR https://github.com/nical/rust_debug/pull/3 , removes
trailing "s from output.
2024-03-07 17:32:09 +00:00
John Spray
d5a6a2a16d storage controller: robustness improvements (#7027)
## Problem


Closes: https://github.com/neondatabase/neon/issues/6847
Closes: https://github.com/neondatabase/neon/issues/7006

## Summary of changes

- Pageserver API calls are wrapped in timeout/retry logic: this prevents
a reconciler getting hung on a pageserver API hang, and prevents
reconcilers having to totally retry if one API call returns a retryable
error (e.g. 503).
- Add a cancellation token to `Node`, so that when we mark a node
offline we will cancel any API calls in progress to that node, and avoid
issuing any more API calls to that offline node.
- If the dirty locations of a shard are all on offline nodes, then don't
spawn a reconciler
- In re-attach, if we have no observed state object for a tenant then
construct one with conf: None (which means "unknown"). Then in
Reconciler, implement a TODO for scanning such locations before running,
so that we will avoid spuriously incrementing a generation in the case
of a node that was offline while we started (this is the case that
tripped up #7006)
- Refactoring: make Node contents private (and thereby guarantee that
updates to availability mode reliably update the cancellation token.)
- Refactoring: don't pass the whole map of nodes into Reconciler (and
thereby remove a bunch of .expect() calls)

Some of this was discovered/tested with a new failure injection test
that will come in a separate PR, once it is stable enough for CI.
2024-03-07 17:10:03 +00:00
Vlad Lazar
871977f14c pageserver: fix early bail out in vectored get (#7038)
## Problem
When vectored get encountered a portion of the key range that could
not be mapped to any layer in the current timeline it would incorrectly
bail out of the current timeline. This is incorrect since we may have
had layers queued for a visit in the fringe.

## Summary of changes
* Add a repro unit test
* Remove the early bail out path
* Simplify range search return value
2024-03-07 16:02:20 +00:00
Joonas Koivunen
602a4da9a5 bench: run branch_creation_many at 500, seeded (#6959)
We have a benchmark for creating a lot of branches, but it does random
things, and the branch count is not what we is the largest maximum we
aim to support. If this PR would stabilize the benchmark total duration
it means that there are some structures which are very much slower than
others. Then we should add a seed-outputting variant to help find and
reproduce such cases.

Additionally, record for the benchmark:
- shutdown duration
- startup metrics once done (on restart)
- duration of first compaction completion via debug logging
2024-03-07 16:23:42 +02:00
John Spray
d3c583efbe Rename binary attachment_service -> storage_controller (#7042)
## Problem

The storage controller binary still has its historic
`attachment_service` name -- it will be painful to change this later
because we can't atomically update this repo and the helm charts used to
deploy.

Companion helm chart change:
https://github.com/neondatabase/helm-charts/pull/70

## Summary of changes

- Change the name of the binary to `storage_controller`
- Skipping renaming things in the source right now: this is just to get
rid of the legacy name in external interfaces.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-03-07 14:06:48 +00:00
Vlad Lazar
d03ec9d998 pageserver: don't validate vectored get on shut-down (#7039)
## Problem
We attempted validation for cancelled errors under the assumption that
if vectored get fails, sequential get will too.
That's not right 100% of times though because sequential get may have
the values cached and slip them through
even when shutting down.

## Summary of changes
Don't validate if either search impl failed due to tenant shutdown.
2024-03-07 12:37:52 +00:00
Conrad Ludgate
c2876ec55d proxy http tls investigations (#7045)
## Problem

Some HTTP-specific TLS errors

## Summary of changes

Add more logging, vendor `tls-listener` with minor modifications.
2024-03-07 12:36:47 +00:00
Alex Chi Z
0b330e1310 upgrade neon extension on startup (#7029)
## Problem

Fix https://github.com/neondatabase/neon/issues/7003. Fix
https://github.com/neondatabase/neon/issues/6982. Currently, neon
extension is only upgraded when new compute spec gets applied, for
example, when creating a new role or creating a new database. This also
resolves `neon.lfc_stat` not found warnings in prod.

## Summary of changes

This pull request adds the logic to spawn a background thread to upgrade
the neon extension version if the compute is a primary. If for whatever
reason the upgrade fails, it reports an error to the console and does
not impact compute node state.

This change can be further applied to 3rd-party extension upgrades. We
can silently upgrade the version of 3rd party extensions in the
background in the future.

Questions:

* Does alter extension takes some kind of lock that will block user
requests?
* Does `ALTER EXTENSION` writes to the database if nothing needs to be
upgraded? (may impact storage size).

Otherwise it's safe to land this pull request.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-03-06 12:20:44 -05:00
Alexander Bayandin
f40b13d801 Update client libs for test_runner/pg_clients to their latest versions (#7022)
## Problem
Closes https://github.com/neondatabase/neon/security/dependabot/56
Supersedes https://github.com/neondatabase/neon/pull/7013

Workflow run:
https://github.com/neondatabase/neon/actions/runs/8157302480

## Summary of changes
- Update client libs for `test_runner/pg_clients` to their latest
versions
2024-03-06 17:09:54 +00:00
John Spray
a9a4a76d13 storage controller: misc fixes (#7036)
## Problem

Collection of small changes, batched together to reduce CI overhead.

## Summary of changes

- Layer download messages include size -- this is useful when watching a
pageserver hydrate its on disk cache in the log.
- Controller migrate API could put an invalid NodeId into TenantState
- Scheduling errors during tenant create could result in creating some
shards and not others.
- Consistency check could give hard-to-understand failures in tests if a
reconcile was in process: explicitly fail the check if reconciles are in
progress instead.
2024-03-06 16:47:32 +00:00
Alex Chi Z
5dc2088cf3 fix(test): drop subscription when test completes (#6975)
This pull request mitigates
https://github.com/neondatabase/neon/issues/6969, but the longer-term
problem is that we cannot properly stop Postgres if there is a
subscription.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-03-06 15:52:24 +00:00
John Spray
4a31e18c81 storage controller: include stripe size in compute notifications (#6974)
## Problem

- The storage controller is the source of truth for a tenant's stripe
size, but doesn't currently have a way to propagate that to compute:
we're just using the default stripe size everywhere.

Closes: https://github.com/neondatabase/neon/issues/6903

## Summary of changes

- Include stripe size in `ComputeHookNotifyRequest`
- Include stripe size in `LocationConfigResponse`

The stripe size is optional: it will only be advertised for
multi-sharded tenants. This enables the controller to defer the choice
of stripe size until we split a tenant for the first time.
2024-03-06 13:56:30 +00:00
John Spray
a3ef50c9b6 storage controller: use 'lazy' mode for location_config (#6987)
## Problem

If large numbers of shards are attached to a pageserver concurrently,
for example after another node fails, it can cause excessive I/O queue
depths due to all the newly attached shards trying to calculate logical
sizes concurrently.

#6907 added the `lazy` flag to handle this.

## Summary of changes

- Use `lazy=true` from all /location_config calls in the storage
controller Reconciler.
2024-03-06 11:26:29 +00:00
Arpad Müller
2f88e7a921 Move compaction code to compaction.rs (#7026)
Moves some of the (legacy) compaction code to compaction.rs. No
functional changes, just moves of code.

Before, compaction.rs was only for the new tiered compaction mechanism,
now it's for both the old and new mechanisms.

Part of #6768
2024-03-06 01:40:23 +00:00
Christian Schwarz
eacdc179dc fixup(#6991): it broke the macOS build (#7024) 2024-03-05 17:03:51 +00:00
Vlad Lazar
2daa2f1d10 test: disable large slru basebackup bench in ci (#7025)
The test is flaky due to
https://github.com/neondatabase/neon/issues/7006.
2024-03-05 15:41:05 +00:00
Anna Khanova
15b3665dc4 proxy: fix bug with populating the data (#7023)
## Problem

Branch/project and coldStart were not populated to data events.

## Summary of changes

Populate it. Also added logging for the coldstart info.
2024-03-05 15:32:58 +00:00
Arpad Müller
e69a25542b Minor improvements to tiered compaction (#7020)
Minor non-functional improvements to tiered compaction, mostly
consisting of comment fixes.

Followup of  #6830, part of #6768
2024-03-05 16:26:51 +01:00
Alex Chi Z
b036c32262 fix -Wmissing-prototypes for neon extension (#7010)
## Problem

ref https://github.com/neondatabase/neon/issues/6188

## Summary of changes

This pull request fixes `-Wmissing-prototypes` for the neon extension.
Note that (1) the gcc version in CI and macOS is different, therefore
some of the warning does not get reported when developing the neon
extension locally. (2) the CI env variable `COPT = -Werror` does not get
passed into the docker build process, therefore warnings are not treated
as errors on CI.


e62baa9704/.github/workflows/build_and_test.yml (L22)

There will be follow-up pull requests on solving other warnings. By the
way, I did not figure out the default compile parameters in the CI env,
and therefore this pull request is tested by manually adding
`-Wmissing-prototypes` into the `COPT`.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-03-05 10:03:44 -05:00
Anna Khanova
bdbb2f4afc proxy: report redis broken message metric (#7021)
## Problem

Not really a problem. Improving visibility around redis communication.

## Summary of changes

Added metric on the number of broken messages.
2024-03-05 16:02:51 +01:00
Christian Schwarz
270d3be507 feat(per-tenant throttling): exclude throttled time from page_service metrics + regression test (#6953)
part of https://github.com/neondatabase/neon/issues/5899

Problem
-------

Before this PR, the time spent waiting on the throttle was charged
towards the higher-level page_service metrics, i.e.,
`pageserver_smgr_query_seconds`.
The metrics are the foundation of internal SLIs / SLOs.
A throttled tenant would cause the SLI to degrade / SLO alerts to fire.

Changes
-------


- don't charge time spent in throttle towards the page_service metrics
- record time spent in throttle in RequestContext and subtract it from
the elapsed time
- this works because the page_service path doesn't create child context,
so, all the throttle time is recorded in the parent
- it's quite brittle and will break if we ever decide to spawn child
tasks that need child RequestContexts, which would have separate
instances of the `micros_spent_throttled` counter.
- however, let's punt that to a more general refactoring of
RequestContext
- add a test case that ensures that
- throttling happens for getpage requests; this aspect of the test
passed before this PR
- throttling delays aren't charged towards the page_service metrics;
this aspect of the test only passes with this PR
- drive-by: make the throttle log message `info!`, it's an expected
condition

Performance
-----------

I took the same measurements as in #6706 , no meaningful change in CPU
overhead.

Future Work
-----------

This PR enables us to experiment with the throttle for select tenants
without affecting the SLI metrics / triggering SLO alerts.

Before declaring this feature done, we need more work to happen,
specifically:

- decide on whether we want to retain the flexibility of throttling any
`Timeline::get` call, filtered by TaskKind
- versus: separate throttles for each page_service endpoint, potentially
with separate config options
- the trouble here is that this decision implies changes to the
TenantConfig, so, if we start using the current config style now, then
decide to switch to a different config, it'll be a breaking change

Nice-to-haves but probably not worth the time right now:

- Equivalent tests to ensure the throttle applies to all other
page_service handlers.
2024-03-05 13:44:00 +00:00
Vlad Lazar
9dec65b75b pageserver: fix vectored read path delta layer index traversal (#7001)
## Problem
Last weeks enablement of vectored get generated a number of panics.
From them, I diagnosed two issues in the delta layer index traversal
logic
1. The `key >= range.start && lsn >= lsn_range.start`
was too aggressive. Lsns are not monotonically increasing in the delta
layer index (keys are though), so we cannot assert on them.
2. Lsns greater or equal to `lsn_range.end` were not skipped. This
caused the query to consider records newer than the request Lsn.

## Summary of changes
* Fix the issues mentioned above inline
* Refactor the layer traversal logic to make it unit testable
* Add unit test which reproduces the failure modes listed above.
2024-03-05 13:35:45 +00:00
Vlad Lazar
ae8468f97e pageserver: fix AUX key vectored get validation (#7018)
## Problem
The value reconstruct of AUX_FILES_KEY from records is not deterministic
since it uses a hash map under the hood. This caused vectored get validation
failures when enabled in staging.

## Summary of changes
Deserialise AUX_FILES_KEY blobs comparing. All other keys should
reconstruct deterministically, so we simply compare the blobs.
2024-03-05 13:30:43 +00:00
Christian Schwarz
f3e4f85e65 layer file download: final rename: fix durability (#6991)
Before this PR, the layer file download code would fsync the inode after
rename instead of the timeline directory. That is not in line with what
a comment further up says we're doing, and it's obviously not achieving
the goal of making the rename durable.

part of https://github.com/neondatabase/neon/issues/6663
2024-03-05 11:09:13 +00:00
Joonas Koivunen
752bf5a22f build: clippy disallow futures::pin_mut macro (#7016)
`std` has had `pin!` macro for some time, there is no need for us to use
the older alternatives. Cannot disallow `tokio::pin` because tokio
macros use that.
2024-03-05 10:14:37 +00:00
Christian Schwarz
3da410c8fe tokio-epoll-uring: use it on the layer-creating code paths (#6378)
part of #6663 
See that epic for more context & related commits.

Problem
-------

Before this PR, the layer-file-creating code paths were using
VirtualFile, but under the hood these were still blocking system calls.

Generally this meant we'd stall the executor thread, unless the caller
"knew" and used the following pattern instead:

```
spawn_blocking(|| {
    Handle::block_on(async {
        VirtualFile::....().await;
    })
}).await
```

Solution
--------

This PR adopts `tokio-epoll-uring` on the layer-file-creating code paths
in pageserver.

Note that on-demand downloads still use `tokio::fs`, these will be
converted in a future PR.

Design: Avoiding Regressions With `std-fs` 
------------------------------------------

If we make the VirtualFile write path truly async using
`tokio-epoll-uring`, should we then remove the `spawn_blocking` +
`Handle::block_on` usage upstack in the same commit?

No, because if we’re still using the `std-fs` io engine, we’d then block
the executor in those places where previously we were protecting us from
that through the `spawn_blocking` .

So, if we want to see benefits from `tokio-epoll-uring` on the write
path while also preserving the ability to switch between
`tokio-epoll-uring` and `std-fs` , where `std-fs` will behave identical
to what we have now, we need to ***conditionally* use `spawn_blocking +
Handle::block_on`** .

I.e., in the places where we use that know, we’ll need to make that
conditional based on the currently configured io engine.

It boils down to investigating all the places where we do
`spawn_blocking(... block_on(... VirtualFile::...))`.

Detailed [write-up of that investigation in
Notion](https://neondatabase.notion.site/Surveying-VirtualFile-write-path-usage-wrt-tokio-epoll-uring-integration-spawn_blocking-Handle-bl-5dc2270dbb764db7b2e60803f375e015?pvs=4
), made publicly accessible.

tl;dr: Preceding PRs addressed the relevant call sites:
- `metadata` file: turns out we could simply remove it (#6777, #6769,
#6775)
- `create_delta_layer()`: made sensitive to `virtual_file_io_engine` in
#6986

NB: once we are switched over to `tokio-epoll-uring` everywhere in
production, we can deprecate `std-fs`; to keep macOS support, we can use
`tokio::fs` instead. That will remove this whole headache.


Code Changes In This PR
-----------------------

- VirtualFile API changes
  - `VirtualFile::write_at`
- implement an `ioengine` operation and switch `VirtualFile::write_at`
to it
  - `VirtualFile::metadata()`
- curiously, we only use it from the layer writers' `finish()` methods
- introduce a wrapper `Metadata` enum because `std::fs::Metadata` cannot
be constructed by code outside rust std
- `VirtualFile::sync_all()` and for completeness sake, add
`VirtualFile::sync_data()`

Testing & Rollout
-----------------

Before merging this PR, we ran the CI with both io engines.

Additionally, the changes will soak in staging.

We could have a feature gate / add a new io engine
`tokio-epoll-uring-write-path` to do a gradual rollout. However, that's
not part of this PR.


Future Work
-----------

There's still some use of `std::fs` and/or `tokio::fs` for directory
namespace operations, e.g. `std::fs::rename`.

We're not addressing those in this PR, as we'll need to add the support
in tokio-epoll-uring first. Note that rename itself is usually fast if
the directory is in the kernel dentry cache, and only the fsync after
rename is slow. These fsyncs are using tokio-epoll-uring, so, the impact
should be small.
2024-03-05 09:03:54 +00:00
Alex Chi Z
b7db912be6 compute_ctl: only try zenith_admin if could not authenticate (#6955)
## Problem

Fix https://github.com/neondatabase/neon/issues/6498

## Summary of changes

Only re-authenticate with zenith_admin if authentication fails.
Otherwise, directly return the error message.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-03-04 14:28:45 -05:00
Alexander Bayandin
3dfae4be8d upgrade mio 0.8.10 => 0.8.11 (#7009)
## Problem

`cargo deny` fails
- https://rustsec.org/advisories/RUSTSEC-2024-0019
-
https://github.com/tokio-rs/mio/security/advisories/GHSA-r8w9-5wcg-vfj7

> The vulnerability is Windows-specific, and can only happen if you are
using named pipes. Other IO resources are not affected.

## Summary of changes
- Upgrade `mio` from 0.8.10 to 0.8.11 (`cargo update -p mio`)
2024-03-04 19:16:07 +00:00
Christian Schwarz
e62baa9704 upgrade tokio 1.34 => 1.36 (#7008)
tokio 1.36 has been out for a month.

Release notes don't indicate major changes.

Skimming through their issue tracker, I can't find open `C-bug` issues
that would affect us.

(My personal motivation for this is `JoinSet::try_join_next`.)
2024-03-04 18:36:29 +01:00
Alexander Bayandin
191d8ac7e0 vm-image: update pgbouncer from 1.22.0 to 1.22.1 (#7005)
pgbouncer 1.22.1 has been released
> This release fixes issues caused by some clients using COPY FROM STDIN
queries. Such queries could introduce memory leaks, performance
regressions and prepared statement misbehavior.

- NEWS: https://www.pgbouncer.org/2024/03/pgbouncer-1-22-1
- CHANGES:
https://github.com/pgbouncer/pgbouncer/compare/pgbouncer_1_22_0...pgbouncer_1_22_1


## Summary of changes
- vm-image: update pgbouncer from 1.22.0 to 1.22.1
2024-03-04 16:04:12 +00:00
Roman Zaynetdinov
0d2395fe96 Update postgres-exporter to v0.12.1 (#7004)
Fixes https://github.com/neondatabase/neon/issues/6996

Thanks to @bayandin
2024-03-04 16:02:10 +00:00
Christian Schwarz
f0be9400f2 fix(test_remote_storage_upload_queue_retries): became flakier since #6960 (#6999)
This PR increases the `wait_until` timeout.
These are where things became more flaky as of
https://github.com/neondatabase/neon/pull/6960.
Most likely because it doubles the work in the
`churn_while_failpoints_active_thread`.

Slack context:
https://neondb.slack.com/archives/C033RQ5SPDH/p1709554455962959?thread_ts=1709286362.850549&cid=C033RQ5SPDH
2024-03-04 15:47:13 +01:00
Alex Chi Z
e938bb8157 fix epic issue template (#6920)
The template does not parse on GitHub
2024-03-04 09:17:14 -05:00
Christian Schwarz
944cac950d layer file creation: fsync timeline directories using VirtualFile::sync_all() (#6986)
Except for the involvement of the VirtualFile fd cache, this is
equivalent to what happened before at runtime.

Future PR https://github.com/neondatabase/neon/pull/6378 will implement
`VirtualFile::sync_all()` using
tokio-epoll-uring if that's configured as the io engine.
This PR is preliminary work for that.

part of https://github.com/neondatabase/neon/issues/6663
2024-03-04 13:31:09 +00:00
Anna Khanova
e1c032fb3c Fix type (#6998)
## Problem

Typo

## Summary of changes

Fix
2024-03-04 13:26:16 +00:00
Christian Schwarz
c861d71eeb layer file creation: fatal_err on timeline dir fsync (#6985)
As pointed out in the comments added in this PR:
the in-memory state of the filesystem already has the layer file in its
final place.
If the fsync fails, but pageserver continues to execute, it's quite easy
for subsequent pageserver code to observe the file being there and
assume it's durable, when it really isn't.

It can happen that we get ENOSPC during the fsync.
However,
1. the timeline dir is small (remember, the big layer _file_ has already
been synced).
Small data means ENOSPC due to delayed allocation races etc are less
likely.
2. what else are we going to do in that case?

If we decide to bubble up the error, the file remains on disk.
We could try to unlink it and fsync after the unlink.
If that fails, we would _definitely_ need to error out.
Is it worth the trouble though?

Side note: all this logic about not carrying on after fsync failure
implies that we `sync` the filesystem successfully before we restart
the pageserver. We don't do that right now, but should (=>
https://github.com/neondatabase/neon/issues/6989)

part of https://github.com/neondatabase/neon/issues/6663
2024-03-04 12:18:22 +00:00
Alexander Bayandin
6e46204712 CI(deploy): use separate workflow for proxy deploys (#6995)
## Problem

The current implementation of `deploy-prod` workflow doesn't allow to
run parallel deploys on Storage and Proxy.

## Summary of changes
- Call `deploy-proxy-prod` workflow that deploys only Proxy components,
and that can be run in parallel with `deploy-prod` for Storage.
2024-03-04 12:08:44 +00:00
Andreas Scherbaum
5c6d78d469 Rename "zenith" to "neon" (#6957)
Usually RFC documents are not modified, but the vast mentions of
"zenith" in early RFC documents make it desirable to update the product
name to today's name, to avoid confusion.

## Problem

Early RFC documents use the old "zenith" product name a lot, which is
not something everyone is aware of after the product was renamed.

## Summary of changes

Replace occurrences of "zenith" with "neon".
Images are excluded.

---------

Co-authored-by: Andreas Scherbaum <andreas@neon.tech>
2024-03-04 13:02:18 +01:00
326 changed files with 19358 additions and 7190 deletions

View File

@@ -16,9 +16,9 @@ assignees: ''
## Implementation ideas
## Tasks
```[tasklist]
### Tasks
- [ ] Example Task
```

View File

@@ -461,6 +461,7 @@ jobs:
- name: Pytest regression tests
uses: ./.github/actions/run-python-test-set
timeout-minutes: 60
with:
build_type: ${{ matrix.build_type }}
test_selection: regress
@@ -474,7 +475,7 @@ jobs:
TEST_RESULT_CONNSTR: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}
CHECK_ONDISK_DATA_COMPATIBILITY: nonempty
BUILD_TAG: ${{ needs.tag.outputs.build-tag }}
PAGESERVER_VIRTUAL_FILE_IO_ENGINE: std-fs
PAGESERVER_VIRTUAL_FILE_IO_ENGINE: tokio-epoll-uring
PAGESERVER_GET_VECTORED_IMPL: vectored
# Temporary disable this step until we figure out why it's so flaky
@@ -554,7 +555,7 @@ jobs:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
TEST_RESULT_CONNSTR: "${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}"
PAGESERVER_VIRTUAL_FILE_IO_ENGINE: std-fs
PAGESERVER_VIRTUAL_FILE_IO_ENGINE: tokio-epoll-uring
# XXX: no coverage data handling here, since benchmarks are run on release builds,
# while coverage is currently collected for the debug ones
@@ -1120,10 +1121,16 @@ jobs:
run: |
if [[ "$GITHUB_REF_NAME" == "main" ]]; then
gh workflow --repo neondatabase/aws run deploy-dev.yml --ref main -f branch=main -f dockerTag=${{needs.tag.outputs.build-tag}} -f deployPreprodRegion=false
# TODO: move deployPreprodRegion to release (`"$GITHUB_REF_NAME" == "release"` block), once Staging support different compute tag prefixes for different regions
gh workflow --repo neondatabase/aws run deploy-dev.yml --ref main -f branch=main -f dockerTag=${{needs.tag.outputs.build-tag}} -f deployPreprodRegion=true
elif [[ "$GITHUB_REF_NAME" == "release" ]]; then
gh workflow --repo neondatabase/aws run deploy-dev.yml --ref main \
-f deployPgSniRouter=false \
-f deployProxy=false \
-f deployStorage=true \
-f deployStorageBroker=true \
-f branch=main \
-f dockerTag=${{needs.tag.outputs.build-tag}} \
-f deployPreprodRegion=true
gh workflow --repo neondatabase/aws run deploy-prod.yml --ref main \
-f deployPgSniRouter=false \
-f deployProxy=false \
@@ -1132,12 +1139,19 @@ jobs:
-f branch=main \
-f dockerTag=${{needs.tag.outputs.build-tag}}
elif [[ "$GITHUB_REF_NAME" == "release-proxy" ]]; then
gh workflow --repo neondatabase/aws run deploy-prod.yml --ref main \
gh workflow --repo neondatabase/aws run deploy-dev.yml --ref main \
-f deployPgSniRouter=true \
-f deployProxy=true \
-f deployStorage=false \
-f deployStorageBroker=false \
-f branch=main \
-f dockerTag=${{needs.tag.outputs.build-tag}} \
-f deployPreprodRegion=true
gh workflow --repo neondatabase/aws run deploy-proxy-prod.yml --ref main \
-f deployPgSniRouter=true \
-f deployProxy=true \
-f branch=main \
-f dockerTag=${{needs.tag.outputs.build-tag}}
else
echo "GITHUB_REF_NAME (value '$GITHUB_REF_NAME') is not set to either 'main' or 'release'"

View File

@@ -97,7 +97,7 @@ jobs:
**Please merge this Pull Request using 'Create a merge commit' button**
EOF
gh pr create --title "Proxy release ${RELEASE_DATE}}" \
gh pr create --title "Proxy release ${RELEASE_DATE}" \
--body-file "body.md" \
--head "${RELEASE_BRANCH}" \
--base "release-proxy"

View File

@@ -1,12 +1,13 @@
/compute_tools/ @neondatabase/control-plane @neondatabase/compute
/control_plane/attachment_service @neondatabase/storage
/libs/pageserver_api/ @neondatabase/storage
/libs/postgres_ffi/ @neondatabase/compute
/libs/postgres_ffi/ @neondatabase/compute @neondatabase/safekeepers
/libs/remote_storage/ @neondatabase/storage
/libs/safekeeper_api/ @neondatabase/safekeepers
/libs/vm_monitor/ @neondatabase/autoscaling
/pageserver/ @neondatabase/storage
/pgxn/ @neondatabase/compute
/pgxn/neon/ @neondatabase/compute @neondatabase/safekeepers
/proxy/ @neondatabase/proxy
/safekeeper/ @neondatabase/safekeepers
/vendor/ @neondatabase/compute

528
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -52,10 +52,12 @@ async-stream = "0.3"
async-trait = "0.1"
aws-config = { version = "1.1.4", default-features = false, features=["rustls"] }
aws-sdk-s3 = "1.14"
aws-sdk-secretsmanager = { version = "1.14.0" }
aws-sdk-iam = "1.15.0"
aws-smithy-async = { version = "1.1.4", default-features = false, features=["rt-tokio"] }
aws-smithy-types = "1.1.4"
aws-credential-types = "1.1.4"
aws-sigv4 = { version = "1.2.0", features = ["sign-http"] }
aws-types = "1.1.7"
axum = { version = "0.6.20", features = ["ws"] }
base64 = "0.13.0"
bincode = "1.3"
@@ -76,6 +78,7 @@ either = "1.8"
enum-map = "2.4.2"
enumset = "1.0.12"
fail = "0.5.0"
fallible-iterator = "0.2"
fs2 = "0.4.3"
futures = "0.3"
futures-core = "0.3"
@@ -88,6 +91,7 @@ hex = "0.4"
hex-literal = "0.4"
hmac = "0.12.1"
hostname = "0.3.1"
http = {version = "1.1.0", features = ["std"]}
http-types = { version = "2", default-features = false }
humantime = "2.1"
humantime-serde = "1.1.1"
@@ -101,6 +105,7 @@ lasso = "0.7"
leaky-bucket = "1.0.1"
libc = "0.2"
md5 = "0.7.0"
measured = { version = "0.0.13", features=["default", "lasso"] }
memoffset = "0.8"
native-tls = "0.2"
nix = { version = "0.27", features = ["fs", "process", "socket", "signal", "poll"] }
@@ -120,7 +125,7 @@ procfs = "0.14"
prometheus = {version = "0.13", default_features=false, features = ["process"]} # removes protobuf dependency
prost = "0.11"
rand = "0.8"
redis = { version = "0.24.0", features = ["tokio-rustls-comp", "keep-alive"] }
redis = { version = "0.25.2", features = ["tokio-rustls-comp", "keep-alive"] }
regex = "1.10.2"
reqwest = { version = "0.11", default-features = false, features = ["rustls-tls"] }
reqwest-tracing = { version = "0.4.7", features = ["opentelemetry_0_20"] }
@@ -129,8 +134,8 @@ reqwest-retry = "0.2.2"
routerify = "3"
rpds = "0.13"
rustc-hash = "1.1.0"
rustls = "0.21"
rustls-pemfile = "1"
rustls = "0.22"
rustls-pemfile = "2"
rustls-split = "0.3"
scopeguard = "1.1"
sysinfo = "0.29.2"
@@ -148,6 +153,7 @@ smol_str = { version = "0.2.0", features = ["serde"] }
socket2 = "0.5"
strum = "0.24"
strum_macros = "0.24"
"subtle" = "2.5.0"
svg_fmt = "0.4.1"
sync_wrapper = "0.1.2"
tar = "0.4"
@@ -156,12 +162,11 @@ test-context = "0.1"
thiserror = "1.0"
tikv-jemallocator = "0.5"
tikv-jemalloc-ctl = "0.5"
tls-listener = { version = "0.7", features = ["rustls", "hyper-h1"] }
tokio = { version = "1.17", features = ["macros"] }
tokio-epoll-uring = { git = "https://github.com/neondatabase/tokio-epoll-uring.git" , branch = "main" }
tokio-io-timeout = "1.2.0"
tokio-postgres-rustls = "0.10.0"
tokio-rustls = "0.24"
tokio-postgres-rustls = "0.11.0"
tokio-rustls = "0.25"
tokio-stream = "0.1"
tokio-tar = "0.3"
tokio-util = { version = "0.7.10", features = ["io", "rt"] }
@@ -220,7 +225,7 @@ workspace_hack = { version = "0.1", path = "./workspace_hack/" }
## Build dependencies
criterion = "0.5.1"
rcgen = "0.11"
rcgen = "0.12"
rstest = "0.18"
camino-tempfile = "1.0.2"
tonic-build = "0.9"

View File

@@ -53,7 +53,7 @@ RUN set -e \
--bin pagectl \
--bin safekeeper \
--bin storage_broker \
--bin attachment_service \
--bin storage_controller \
--bin proxy \
--bin neon_local \
--locked --release \
@@ -81,7 +81,7 @@ COPY --from=build --chown=neon:neon /home/nonroot/target/release/pageserver
COPY --from=build --chown=neon:neon /home/nonroot/target/release/pagectl /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/safekeeper /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/storage_broker /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/attachment_service /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/storage_controller /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/proxy /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/neon_local /usr/local/bin

View File

@@ -135,7 +135,7 @@ WORKDIR /home/nonroot
# Rust
# Please keep the version of llvm (installed above) in sync with rust llvm (`rustc --version --verbose | grep LLVM`)
ENV RUSTC_VERSION=1.76.0
ENV RUSTC_VERSION=1.77.0
ENV RUSTUP_HOME="/home/nonroot/.rustup"
ENV PATH="/home/nonroot/.cargo/bin:${PATH}"
RUN curl -sSO https://static.rust-lang.org/rustup/dist/$(uname -m)-unknown-linux-gnu/rustup-init && whoami && \
@@ -149,7 +149,7 @@ RUN curl -sSO https://static.rust-lang.org/rustup/dist/$(uname -m)-unknown-linux
cargo install --git https://github.com/paritytech/cachepot && \
cargo install rustfilt && \
cargo install cargo-hakari && \
cargo install cargo-deny && \
cargo install cargo-deny --locked && \
cargo install cargo-hack && \
cargo install cargo-nextest && \
rm -rf /home/nonroot/.cargo/registry && \

View File

@@ -51,7 +51,7 @@ CARGO_BUILD_FLAGS += $(filter -j1,$(MAKEFLAGS))
CARGO_CMD_PREFIX += $(if $(filter n,$(MAKEFLAGS)),,+)
# Force cargo not to print progress bar
CARGO_CMD_PREFIX += CARGO_TERM_PROGRESS_WHEN=never CI=1
# Set PQ_LIB_DIR to make sure `attachment_service` get linked with bundled libpq (through diesel)
# Set PQ_LIB_DIR to make sure `storage_controller` get linked with bundled libpq (through diesel)
CARGO_CMD_PREFIX += PQ_LIB_DIR=$(POSTGRES_INSTALL_DIR)/v16/lib
#

View File

@@ -238,6 +238,14 @@ If you encounter errors during setting up the initial tenant, it's best to stop
## Running tests
### Rust unit tests
We are using [`cargo-nextest`](https://nexte.st/) to run the tests in Github Workflows.
Some crates do not support running plain `cargo test` anymore, prefer `cargo nextest run` instead.
You can install `cargo-nextest` with `cargo install cargo-nextest`.
### Integration tests
Ensure your dependencies are installed as described [here](https://github.com/neondatabase/neon#dependency-installation-notes).
```sh

View File

@@ -2,4 +2,13 @@ disallowed-methods = [
"tokio::task::block_in_place",
# Allow this for now, to deny it later once we stop using Handle::block_on completely
# "tokio::runtime::Handle::block_on",
# use tokio_epoll_uring_ext instead
"tokio_epoll_uring::thread_local_system",
]
disallowed-macros = [
# use std::pin::pin
"futures::pin_mut",
# cannot disallow this, because clippy finds used from tokio macros
#"tokio::pin",
]

View File

@@ -32,6 +32,29 @@ compute_ctl -D /var/db/postgres/compute \
-b /usr/local/bin/postgres
```
## State Diagram
Computes can be in various states. Below is a diagram that details how a
compute moves between states.
```mermaid
%% https://mermaid.js.org/syntax/stateDiagram.html
stateDiagram-v2
[*] --> Empty : Compute spawned
Empty --> ConfigurationPending : Waiting for compute spec
ConfigurationPending --> Configuration : Received compute spec
Configuration --> Failed : Failed to configure the compute
Configuration --> Running : Compute has been configured
Empty --> Init : Compute spec is immediately available
Empty --> TerminationPending : Requested termination
Init --> Failed : Failed to start Postgres
Init --> Running : Started Postgres
Running --> TerminationPending : Requested termination
TerminationPending --> Terminated : Terminated compute
Failed --> [*] : Compute exited
Terminated --> [*] : Compute exited
```
## Tests
Cargo formatter:

View File

@@ -17,6 +17,8 @@ use chrono::{DateTime, Utc};
use futures::future::join_all;
use futures::stream::FuturesUnordered;
use futures::StreamExt;
use nix::unistd::Pid;
use postgres::error::SqlState;
use postgres::{Client, NoTls};
use tracing::{debug, error, info, instrument, warn};
use utils::id::{TenantId, TimelineId};
@@ -395,9 +397,9 @@ impl ComputeNode {
// Gets the basebackup in a retry loop
#[instrument(skip_all, fields(%lsn))]
pub fn get_basebackup(&self, compute_state: &ComputeState, lsn: Lsn) -> Result<()> {
let mut retry_period_ms = 500;
let mut retry_period_ms = 500.0;
let mut attempts = 0;
let max_attempts = 5;
let max_attempts = 10;
loop {
let result = self.try_get_basebackup(compute_state, lsn);
match result {
@@ -409,8 +411,8 @@ impl ComputeNode {
"Failed to get basebackup: {} (attempt {}/{})",
e, attempts, max_attempts
);
std::thread::sleep(std::time::Duration::from_millis(retry_period_ms));
retry_period_ms *= 2;
std::thread::sleep(std::time::Duration::from_millis(retry_period_ms as u64));
retry_period_ms *= 1.5;
}
Err(_) => {
return result;
@@ -721,8 +723,12 @@ impl ComputeNode {
// Stop it when it's ready
info!("waiting for postgres");
wait_for_postgres(&mut pg, Path::new(pgdata))?;
pg.kill()?;
info!("sent kill signal");
// SIGQUIT orders postgres to exit immediately. We don't want to SIGKILL
// it to avoid orphaned processes prowling around while datadir is
// wiped.
let pm_pid = Pid::from_raw(pg.id() as i32);
kill(pm_pid, Signal::SIGQUIT)?;
info!("sent SIGQUIT signal");
pg.wait()?;
info!("done prewarming");
@@ -763,6 +769,26 @@ impl ComputeNode {
Ok((pg, logs_handle))
}
/// Do post configuration of the already started Postgres. This function spawns a background thread to
/// configure the database after applying the compute spec. Currently, it upgrades the neon extension
/// version. In the future, it may upgrade all 3rd-party extensions.
#[instrument(skip_all)]
pub fn post_apply_config(&self) -> Result<()> {
let connstr = self.connstr.clone();
thread::spawn(move || {
let func = || {
let mut client = Client::connect(connstr.as_str(), NoTls)?;
handle_neon_extension_upgrade(&mut client)
.context("handle_neon_extension_upgrade")?;
Ok::<_, anyhow::Error>(())
};
if let Err(err) = func() {
error!("error while post_apply_config: {err:#}");
}
});
Ok(())
}
/// Do initial configuration of the already started Postgres.
#[instrument(skip_all)]
pub fn apply_config(&self, compute_state: &ComputeState) -> Result<()> {
@@ -774,27 +800,34 @@ impl ComputeNode {
// but we can create a new one and grant it all privileges.
let connstr = self.connstr.clone();
let mut client = match Client::connect(connstr.as_str(), NoTls) {
Err(e) => {
info!(
"cannot connect to postgres: {}, retrying with `zenith_admin` username",
e
);
let mut zenith_admin_connstr = connstr.clone();
Err(e) => match e.code() {
Some(&SqlState::INVALID_PASSWORD)
| Some(&SqlState::INVALID_AUTHORIZATION_SPECIFICATION) => {
// connect with zenith_admin if cloud_admin could not authenticate
info!(
"cannot connect to postgres: {}, retrying with `zenith_admin` username",
e
);
let mut zenith_admin_connstr = connstr.clone();
zenith_admin_connstr
.set_username("zenith_admin")
.map_err(|_| anyhow::anyhow!("invalid connstr"))?;
zenith_admin_connstr
.set_username("zenith_admin")
.map_err(|_| anyhow::anyhow!("invalid connstr"))?;
let mut client = Client::connect(zenith_admin_connstr.as_str(), NoTls)?;
// Disable forwarding so that users don't get a cloud_admin role
client.simple_query("SET neon.forward_ddl = false")?;
client.simple_query("CREATE USER cloud_admin WITH SUPERUSER")?;
client.simple_query("GRANT zenith_admin TO cloud_admin")?;
drop(client);
let mut client =
Client::connect(zenith_admin_connstr.as_str(), NoTls)
.context("broken cloud_admin credential: tried connecting with cloud_admin but could not authenticate, and zenith_admin does not work either")?;
// Disable forwarding so that users don't get a cloud_admin role
client.simple_query("SET neon.forward_ddl = false")?;
client.simple_query("CREATE USER cloud_admin WITH SUPERUSER")?;
client.simple_query("GRANT zenith_admin TO cloud_admin")?;
drop(client);
// reconnect with connstring with expected name
Client::connect(connstr.as_str(), NoTls)?
}
// reconnect with connstring with expected name
Client::connect(connstr.as_str(), NoTls)?
}
_ => return Err(e.into()),
},
Ok(client) => client,
};
@@ -990,18 +1023,21 @@ impl ComputeNode {
let pg_process = self.start_postgres(pspec.storage_auth_token.clone())?;
let config_time = Utc::now();
if pspec.spec.mode == ComputeMode::Primary && !pspec.spec.skip_pg_catalog_updates {
let pgdata_path = Path::new(&self.pgdata);
// temporarily reset max_cluster_size in config
// to avoid the possibility of hitting the limit, while we are applying config:
// creating new extensions, roles, etc...
config::compute_ctl_temp_override_create(pgdata_path, "neon.max_cluster_size=-1")?;
self.pg_reload_conf()?;
if pspec.spec.mode == ComputeMode::Primary {
if !pspec.spec.skip_pg_catalog_updates {
let pgdata_path = Path::new(&self.pgdata);
// temporarily reset max_cluster_size in config
// to avoid the possibility of hitting the limit, while we are applying config:
// creating new extensions, roles, etc...
config::compute_ctl_temp_override_create(pgdata_path, "neon.max_cluster_size=-1")?;
self.pg_reload_conf()?;
self.apply_config(&compute_state)?;
self.apply_config(&compute_state)?;
config::compute_ctl_temp_override_remove(pgdata_path)?;
self.pg_reload_conf()?;
config::compute_ctl_temp_override_remove(pgdata_path)?;
self.pg_reload_conf()?;
}
self.post_apply_config()?;
}
let startup_end_time = Utc::now();

View File

@@ -17,6 +17,7 @@ pub fn line_in_file(path: &Path, line: &str) -> Result<bool> {
.write(true)
.create(true)
.append(false)
.truncate(false)
.open(path)?;
let buf = io::BufReader::new(&file);
let mut count: usize = 0;

View File

@@ -302,9 +302,9 @@ pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
RoleAction::Create => {
// This branch only runs when roles are created through the console, so it is
// safe to add more permissions here. BYPASSRLS and REPLICATION are inherited
// from neon_superuser.
// from neon_superuser. (NOTE: REPLICATION has been removed from here for now).
let mut query: String = format!(
"CREATE ROLE {} INHERIT CREATEROLE CREATEDB BYPASSRLS REPLICATION IN ROLE neon_superuser",
"CREATE ROLE {} INHERIT CREATEROLE CREATEDB BYPASSRLS IN ROLE neon_superuser",
name.pg_quote()
);
info!("running role create query: '{}'", &query);
@@ -743,9 +743,21 @@ pub fn handle_extension_neon(client: &mut Client) -> Result<()> {
// which may happen in two cases:
// - extension was just installed
// - extension was already installed and is up to date
let query = "ALTER EXTENSION neon UPDATE";
info!("update neon extension schema with query: {}", query);
client.simple_query(query)?;
// DISABLED due to compute node unpinning epic
// let query = "ALTER EXTENSION neon UPDATE";
// info!("update neon extension version with query: {}", query);
// client.simple_query(query)?;
Ok(())
}
#[instrument(skip_all)]
pub fn handle_neon_extension_upgrade(_client: &mut Client) -> Result<()> {
info!("handle neon extension upgrade (not really)");
// DISABLED due to compute node unpinning epic
// let query = "ALTER EXTENSION neon UPDATE";
// info!("update neon extension version with query: {}", query);
// client.simple_query(query)?;
Ok(())
}
@@ -795,6 +807,18 @@ $$;"#,
"",
"",
// Add new migrations below.
r#"
DO $$
DECLARE
role_name TEXT;
BEGIN
FOR role_name IN SELECT rolname FROM pg_roles WHERE rolreplication IS TRUE
LOOP
RAISE NOTICE 'EXECUTING ALTER ROLE % NOREPLICATION', quote_ident(role_name);
EXECUTE 'ALTER ROLE ' || quote_ident(role_name) || ' NOREPLICATION';
END LOOP;
END
$$;"#,
];
let mut query = "CREATE SCHEMA IF NOT EXISTS neon_migration";

View File

@@ -12,6 +12,7 @@ clap.workspace = true
comfy-table.workspace = true
futures.workspace = true
git-version.workspace = true
humantime.workspace = true
nix.workspace = true
once_cell.workspace = true
postgres.workspace = true

View File

@@ -4,6 +4,10 @@ version = "0.1.0"
edition.workspace = true
license.workspace = true
[[bin]]
name = "storage_controller"
path = "src/main.rs"
[features]
default = []
# Enables test-only APIs and behaviors
@@ -12,24 +16,29 @@ testing = []
[dependencies]
anyhow.workspace = true
aws-config.workspace = true
aws-sdk-secretsmanager.workspace = true
bytes.workspace = true
camino.workspace = true
clap.workspace = true
fail.workspace = true
futures.workspace = true
git-version.workspace = true
hex.workspace = true
hyper.workspace = true
humantime.workspace = true
lasso.workspace = true
once_cell.workspace = true
pageserver_api.workspace = true
pageserver_client.workspace = true
postgres_connection.workspace = true
reqwest.workspace = true
routerify.workspace = true
serde.workspace = true
serde_json.workspace = true
thiserror.workspace = true
tokio.workspace = true
tokio-util.workspace = true
tracing.workspace = true
measured.workspace = true
diesel = { version = "2.1.4", features = ["serde_json", "postgres", "r2d2"] }
diesel_migrations = { version = "2.1.0" }

View File

@@ -0,0 +1,3 @@
UPDATE tenant_shards set placement_policy='{"Double": 1}' where placement_policy='{"Attached": 1}';
UPDATE tenant_shards set placement_policy='"Single"' where placement_policy='{"Attached": 0}';

View File

@@ -0,0 +1,3 @@
UPDATE tenant_shards set placement_policy='{"Attached": 1}' where placement_policy='{"Double": 1}';
UPDATE tenant_shards set placement_policy='{"Attached": 0}' where placement_policy='"Single"';

View File

@@ -3,7 +3,7 @@ use std::{collections::HashMap, time::Duration};
use control_plane::endpoint::{ComputeControlPlane, EndpointStatus};
use control_plane::local_env::LocalEnv;
use hyper::{Method, StatusCode};
use pageserver_api::shard::{ShardIndex, ShardNumber, TenantShardId};
use pageserver_api::shard::{ShardCount, ShardNumber, ShardStripeSize, TenantShardId};
use postgres_connection::parse_host_port;
use serde::{Deserialize, Serialize};
use tokio_util::sync::CancellationToken;
@@ -19,8 +19,66 @@ const SLOWDOWN_DELAY: Duration = Duration::from_secs(5);
pub(crate) const API_CONCURRENCY: usize = 32;
pub(super) struct ComputeHookTenant {
shards: Vec<(ShardIndex, NodeId)>,
struct ShardedComputeHookTenant {
stripe_size: ShardStripeSize,
shard_count: ShardCount,
shards: Vec<(ShardNumber, NodeId)>,
}
enum ComputeHookTenant {
Unsharded(NodeId),
Sharded(ShardedComputeHookTenant),
}
impl ComputeHookTenant {
/// Construct with at least one shard's information
fn new(tenant_shard_id: TenantShardId, stripe_size: ShardStripeSize, node_id: NodeId) -> Self {
if tenant_shard_id.shard_count.count() > 1 {
Self::Sharded(ShardedComputeHookTenant {
shards: vec![(tenant_shard_id.shard_number, node_id)],
stripe_size,
shard_count: tenant_shard_id.shard_count,
})
} else {
Self::Unsharded(node_id)
}
}
/// Set one shard's location. If stripe size or shard count have changed, Self is reset
/// and drops existing content.
fn update(
&mut self,
tenant_shard_id: TenantShardId,
stripe_size: ShardStripeSize,
node_id: NodeId,
) {
match self {
Self::Unsharded(existing_node_id) if tenant_shard_id.shard_count.count() == 1 => {
*existing_node_id = node_id
}
Self::Sharded(sharded_tenant)
if sharded_tenant.stripe_size == stripe_size
&& sharded_tenant.shard_count == tenant_shard_id.shard_count =>
{
if let Some(existing) = sharded_tenant
.shards
.iter()
.position(|s| s.0 == tenant_shard_id.shard_number)
{
sharded_tenant.shards.get_mut(existing).unwrap().1 = node_id;
} else {
sharded_tenant
.shards
.push((tenant_shard_id.shard_number, node_id));
sharded_tenant.shards.sort_by_key(|s| s.0)
}
}
_ => {
// Shard count changed: reset struct.
*self = Self::new(tenant_shard_id, stripe_size, node_id);
}
}
}
}
#[derive(Serialize, Deserialize, Debug)]
@@ -33,6 +91,7 @@ struct ComputeHookNotifyRequestShard {
#[derive(Serialize, Deserialize, Debug)]
struct ComputeHookNotifyRequest {
tenant_id: TenantId,
stripe_size: Option<ShardStripeSize>,
shards: Vec<ComputeHookNotifyRequestShard>,
}
@@ -63,42 +122,43 @@ pub(crate) enum NotifyError {
}
impl ComputeHookTenant {
async fn maybe_reconfigure(&mut self, tenant_id: TenantId) -> Option<ComputeHookNotifyRequest> {
// Find the highest shard count and drop any shards that aren't
// for that shard count.
let shard_count = self.shards.iter().map(|(k, _v)| k.shard_count).max();
let Some(shard_count) = shard_count else {
// No shards, nothing to do.
tracing::info!("ComputeHookTenant::maybe_reconfigure: no shards");
return None;
};
self.shards.retain(|(k, _v)| k.shard_count == shard_count);
self.shards
.sort_by_key(|(shard, _node_id)| shard.shard_number);
if self.shards.len() == shard_count.count() as usize || shard_count.is_unsharded() {
// We have pageservers for all the shards: emit a configuration update
return Some(ComputeHookNotifyRequest {
fn maybe_reconfigure(&self, tenant_id: TenantId) -> Option<ComputeHookNotifyRequest> {
match self {
Self::Unsharded(node_id) => Some(ComputeHookNotifyRequest {
tenant_id,
shards: self
.shards
.iter()
.map(|(shard, node_id)| ComputeHookNotifyRequestShard {
shard_number: shard.shard_number,
node_id: *node_id,
})
.collect(),
});
} else {
tracing::info!(
"ComputeHookTenant::maybe_reconfigure: not enough shards ({}/{})",
self.shards.len(),
shard_count.count()
);
}
shards: vec![ComputeHookNotifyRequestShard {
shard_number: ShardNumber(0),
node_id: *node_id,
}],
stripe_size: None,
}),
Self::Sharded(sharded_tenant)
if sharded_tenant.shards.len() == sharded_tenant.shard_count.count() as usize =>
{
Some(ComputeHookNotifyRequest {
tenant_id,
shards: sharded_tenant
.shards
.iter()
.map(|(shard_number, node_id)| ComputeHookNotifyRequestShard {
shard_number: *shard_number,
node_id: *node_id,
})
.collect(),
stripe_size: Some(sharded_tenant.stripe_size),
})
}
Self::Sharded(sharded_tenant) => {
// Sharded tenant doesn't yet have information for all its shards
None
tracing::info!(
"ComputeHookTenant::maybe_reconfigure: not enough shards ({}/{})",
sharded_tenant.shards.len(),
sharded_tenant.shard_count.count()
);
None
}
}
}
}
@@ -139,7 +199,11 @@ impl ComputeHook {
};
let cplane =
ComputeControlPlane::load(env.clone()).expect("Error loading compute control plane");
let ComputeHookNotifyRequest { tenant_id, shards } = reconfigure_request;
let ComputeHookNotifyRequest {
tenant_id,
shards,
stripe_size,
} = reconfigure_request;
let compute_pageservers = shards
.into_iter()
@@ -156,7 +220,9 @@ impl ComputeHook {
for (endpoint_name, endpoint) in &cplane.endpoints {
if endpoint.tenant_id == tenant_id && endpoint.status() == EndpointStatus::Running {
tracing::info!("Reconfiguring endpoint {}", endpoint_name,);
endpoint.reconfigure(compute_pageservers.clone()).await?;
endpoint
.reconfigure(compute_pageservers.clone(), stripe_size)
.await?;
}
}
@@ -271,30 +337,26 @@ impl ComputeHook {
&self,
tenant_shard_id: TenantShardId,
node_id: NodeId,
stripe_size: ShardStripeSize,
cancel: &CancellationToken,
) -> Result<(), NotifyError> {
let mut locked = self.state.lock().await;
let entry = locked
.entry(tenant_shard_id.tenant_id)
.or_insert_with(|| ComputeHookTenant { shards: Vec::new() });
let shard_index = ShardIndex {
shard_count: tenant_shard_id.shard_count,
shard_number: tenant_shard_id.shard_number,
use std::collections::hash_map::Entry;
let tenant = match locked.entry(tenant_shard_id.tenant_id) {
Entry::Vacant(e) => e.insert(ComputeHookTenant::new(
tenant_shard_id,
stripe_size,
node_id,
)),
Entry::Occupied(e) => {
let tenant = e.into_mut();
tenant.update(tenant_shard_id, stripe_size, node_id);
tenant
}
};
let mut set = false;
for (existing_shard, existing_node) in &mut entry.shards {
if *existing_shard == shard_index {
*existing_node = node_id;
set = true;
}
}
if !set {
entry.shards.push((shard_index, node_id));
}
let reconfigure_request = entry.maybe_reconfigure(tenant_shard_id.tenant_id).await;
let reconfigure_request = tenant.maybe_reconfigure(tenant_shard_id.tenant_id);
let Some(reconfigure_request) = reconfigure_request else {
// The tenant doesn't yet have pageservers for all its shards: we won't notify anything
// until it does.
@@ -316,3 +378,85 @@ impl ComputeHook {
}
}
}
#[cfg(test)]
pub(crate) mod tests {
use pageserver_api::shard::{ShardCount, ShardNumber};
use utils::id::TenantId;
use super::*;
#[test]
fn tenant_updates() -> anyhow::Result<()> {
let tenant_id = TenantId::generate();
let mut tenant_state = ComputeHookTenant::new(
TenantShardId {
tenant_id,
shard_count: ShardCount::new(0),
shard_number: ShardNumber(0),
},
ShardStripeSize(12345),
NodeId(1),
);
// An unsharded tenant is always ready to emit a notification
assert!(tenant_state.maybe_reconfigure(tenant_id).is_some());
assert_eq!(
tenant_state
.maybe_reconfigure(tenant_id)
.unwrap()
.shards
.len(),
1
);
assert!(tenant_state
.maybe_reconfigure(tenant_id)
.unwrap()
.stripe_size
.is_none());
// Writing the first shard of a multi-sharded situation (i.e. in a split)
// resets the tenant state and puts it in an non-notifying state (need to
// see all shards)
tenant_state.update(
TenantShardId {
tenant_id,
shard_count: ShardCount::new(2),
shard_number: ShardNumber(1),
},
ShardStripeSize(32768),
NodeId(1),
);
assert!(tenant_state.maybe_reconfigure(tenant_id).is_none());
// Writing the second shard makes it ready to notify
tenant_state.update(
TenantShardId {
tenant_id,
shard_count: ShardCount::new(2),
shard_number: ShardNumber(0),
},
ShardStripeSize(32768),
NodeId(1),
);
assert!(tenant_state.maybe_reconfigure(tenant_id).is_some());
assert_eq!(
tenant_state
.maybe_reconfigure(tenant_id)
.unwrap()
.shards
.len(),
2
);
assert_eq!(
tenant_state
.maybe_reconfigure(tenant_id)
.unwrap()
.stripe_size,
Some(ShardStripeSize(32768))
);
Ok(())
}
}

View File

@@ -0,0 +1,227 @@
use futures::{stream::FuturesUnordered, StreamExt};
use std::{
collections::HashMap,
sync::Arc,
time::{Duration, Instant},
};
use tokio_util::sync::CancellationToken;
use pageserver_api::{
controller_api::{NodeAvailability, UtilizationScore},
models::PageserverUtilization,
};
use thiserror::Error;
use utils::id::NodeId;
use crate::node::Node;
struct HeartbeaterTask {
receiver: tokio::sync::mpsc::UnboundedReceiver<HeartbeatRequest>,
cancel: CancellationToken,
state: HashMap<NodeId, PageserverState>,
max_unavailable_interval: Duration,
jwt_token: Option<String>,
}
#[derive(Debug, Clone)]
pub(crate) enum PageserverState {
Available {
last_seen_at: Instant,
utilization: PageserverUtilization,
},
Offline,
}
#[derive(Debug)]
pub(crate) struct AvailablityDeltas(pub Vec<(NodeId, PageserverState)>);
#[derive(Debug, Error)]
pub(crate) enum HeartbeaterError {
#[error("Cancelled")]
Cancel,
}
struct HeartbeatRequest {
pageservers: Arc<HashMap<NodeId, Node>>,
reply: tokio::sync::oneshot::Sender<Result<AvailablityDeltas, HeartbeaterError>>,
}
pub(crate) struct Heartbeater {
sender: tokio::sync::mpsc::UnboundedSender<HeartbeatRequest>,
}
impl Heartbeater {
pub(crate) fn new(
jwt_token: Option<String>,
max_unavailable_interval: Duration,
cancel: CancellationToken,
) -> Self {
let (sender, receiver) = tokio::sync::mpsc::unbounded_channel::<HeartbeatRequest>();
let mut heartbeater =
HeartbeaterTask::new(receiver, jwt_token, max_unavailable_interval, cancel);
tokio::task::spawn(async move { heartbeater.run().await });
Self { sender }
}
pub(crate) async fn heartbeat(
&self,
pageservers: Arc<HashMap<NodeId, Node>>,
) -> Result<AvailablityDeltas, HeartbeaterError> {
let (sender, receiver) = tokio::sync::oneshot::channel();
self.sender
.send(HeartbeatRequest {
pageservers,
reply: sender,
})
.unwrap();
receiver.await.unwrap()
}
}
impl HeartbeaterTask {
fn new(
receiver: tokio::sync::mpsc::UnboundedReceiver<HeartbeatRequest>,
jwt_token: Option<String>,
max_unavailable_interval: Duration,
cancel: CancellationToken,
) -> Self {
Self {
receiver,
cancel,
state: HashMap::new(),
max_unavailable_interval,
jwt_token,
}
}
async fn run(&mut self) {
loop {
tokio::select! {
request = self.receiver.recv() => {
match request {
Some(req) => {
let res = self.heartbeat(req.pageservers).await;
req.reply.send(res).unwrap();
},
None => { return; }
}
},
_ = self.cancel.cancelled() => return
}
}
}
async fn heartbeat(
&mut self,
pageservers: Arc<HashMap<NodeId, Node>>,
) -> Result<AvailablityDeltas, HeartbeaterError> {
let mut new_state = HashMap::new();
let mut heartbeat_futs = FuturesUnordered::new();
for (node_id, node) in &*pageservers {
heartbeat_futs.push({
let jwt_token = self.jwt_token.clone();
let cancel = self.cancel.clone();
// Clone the node and mark it as available such that the request
// goes through to the pageserver even when the node is marked offline.
// This doesn't impact the availability observed by [`crate::service::Service`].
let mut node = node.clone();
node.set_availability(NodeAvailability::Active(UtilizationScore::worst()));
async move {
let response = node
.with_client_retries(
|client| async move { client.get_utilization().await },
&jwt_token,
3,
3,
Duration::from_secs(1),
&cancel,
)
.await;
let response = match response {
Some(r) => r,
None => {
// This indicates cancellation of the request.
// We ignore the node in this case.
return None;
}
};
let status = if let Ok(utilization) = response {
PageserverState::Available {
last_seen_at: Instant::now(),
utilization,
}
} else {
PageserverState::Offline
};
Some((*node_id, status))
}
});
loop {
let maybe_status = tokio::select! {
next = heartbeat_futs.next() => {
match next {
Some(result) => result,
None => { break; }
}
},
_ = self.cancel.cancelled() => { return Err(HeartbeaterError::Cancel); }
};
if let Some((node_id, status)) = maybe_status {
new_state.insert(node_id, status);
}
}
}
let mut deltas = Vec::new();
let now = Instant::now();
for (node_id, ps_state) in new_state {
use std::collections::hash_map::Entry::*;
let entry = self.state.entry(node_id);
let mut needs_update = false;
match entry {
Occupied(ref occ) => match (occ.get(), &ps_state) {
(PageserverState::Offline, PageserverState::Offline) => {}
(PageserverState::Available { last_seen_at, .. }, PageserverState::Offline) => {
if now - *last_seen_at >= self.max_unavailable_interval {
deltas.push((node_id, ps_state.clone()));
needs_update = true;
}
}
_ => {
deltas.push((node_id, ps_state.clone()));
needs_update = true;
}
},
Vacant(_) => {
deltas.push((node_id, ps_state.clone()));
}
}
match entry {
Occupied(mut occ) if needs_update => {
(*occ.get_mut()) = ps_state;
}
Vacant(vac) => {
vac.insert(ps_state);
}
_ => {}
}
}
Ok(AvailablityDeltas(deltas))
}
}

View File

@@ -1,6 +1,11 @@
use crate::metrics::{
HttpRequestLatencyLabelGroup, HttpRequestStatusLabelGroup, PageserverRequestLabelGroup,
METRICS_REGISTRY,
};
use crate::reconciler::ReconcileError;
use crate::service::{Service, STARTUP_RECONCILE_TIMEOUT};
use crate::PlacementPolicy;
use futures::Future;
use hyper::header::CONTENT_TYPE;
use hyper::{Body, Request, Response};
use hyper::{StatusCode, Uri};
use pageserver_api::models::{
@@ -11,9 +16,11 @@ use pageserver_api::shard::TenantShardId;
use pageserver_client::mgmt_api;
use std::sync::Arc;
use std::time::{Duration, Instant};
use tokio_util::sync::CancellationToken;
use utils::auth::{Scope, SwappableJwtAuth};
use utils::failpoint_support::failpoints_handler;
use utils::http::endpoint::{auth_middleware, check_permission_with, request_span};
use utils::http::request::{must_get_query_param, parse_request_param};
use utils::http::request::{must_get_query_param, parse_query_param, parse_request_param};
use utils::id::{TenantId, TimelineId};
use utils::{
@@ -27,11 +34,13 @@ use utils::{
};
use pageserver_api::controller_api::{
NodeConfigureRequest, NodeRegisterRequest, TenantShardMigrateRequest,
NodeAvailability, NodeConfigureRequest, NodeRegisterRequest, TenantShardMigrateRequest,
};
use pageserver_api::upcall_api::{ReAttachRequest, ValidateRequest};
use control_plane::attachment_service::{AttachHookRequest, InspectRequest};
use control_plane::storage_controller::{AttachHookRequest, InspectRequest};
use routerify::Middleware;
/// State available to HTTP request handlers
#[derive(Clone)]
@@ -119,13 +128,9 @@ async fn handle_tenant_create(
let create_req = json_request::<TenantCreateRequest>(&mut req).await?;
// TODO: enable specifying this. Using Single as a default helps legacy tests to work (they
// have no expectation of HA).
let placement_policy = PlacementPolicy::Single;
json_response(
StatusCode::CREATED,
service.tenant_create(create_req, placement_policy).await?,
service.tenant_create(create_req).await?,
)
}
@@ -179,14 +184,14 @@ async fn handle_tenant_location_config(
service: Arc<Service>,
mut req: Request<Body>,
) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&req, "tenant_id")?;
let tenant_shard_id: TenantShardId = parse_request_param(&req, "tenant_shard_id")?;
check_permissions(&req, Scope::PageServerApi)?;
let config_req = json_request::<TenantLocationConfigRequest>(&mut req).await?;
json_response(
StatusCode::OK,
service
.tenant_location_config(tenant_id, config_req)
.tenant_location_config(tenant_shard_id, config_req)
.await?,
)
}
@@ -251,8 +256,10 @@ async fn handle_tenant_secondary_download(
req: Request<Body>,
) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&req, "tenant_id")?;
service.tenant_secondary_download(tenant_id).await?;
json_response(StatusCode::OK, ())
let wait = parse_query_param(&req, "wait_ms")?.map(Duration::from_millis);
let (status, progress) = service.tenant_secondary_download(tenant_id, wait).await?;
json_response(status, progress)
}
async fn handle_tenant_delete(
@@ -314,7 +321,7 @@ async fn handle_tenant_timeline_passthrough(
tracing::info!("Proxying request for tenant {} ({})", tenant_id, path);
// Find the node that holds shard zero
let (base_url, tenant_shard_id) = service.tenant_shard0_baseurl(tenant_id)?;
let (node, tenant_shard_id) = service.tenant_shard0_node(tenant_id)?;
// Callers will always pass an unsharded tenant ID. Before proxying, we must
// rewrite this to a shard-aware shard zero ID.
@@ -323,12 +330,39 @@ async fn handle_tenant_timeline_passthrough(
let tenant_shard_str = format!("{}", tenant_shard_id);
let path = path.replace(&tenant_str, &tenant_shard_str);
let client = mgmt_api::Client::new(base_url, service.get_config().jwt_token.as_deref());
let latency = &METRICS_REGISTRY
.metrics_group
.storage_controller_passthrough_request_latency;
// This is a bit awkward. We remove the param from the request
// and join the words by '_' to get a label for the request.
let just_path = path.replace(&tenant_shard_str, "");
let path_label = just_path
.split('/')
.filter(|token| !token.is_empty())
.collect::<Vec<_>>()
.join("_");
let labels = PageserverRequestLabelGroup {
pageserver_id: &node.get_id().to_string(),
path: &path_label,
method: crate::metrics::Method::Get,
};
let _timer = latency.start_timer(labels.clone());
let client = mgmt_api::Client::new(node.base_url(), service.get_config().jwt_token.as_deref());
let resp = client.get_raw(path).await.map_err(|_e|
// FIXME: give APiError a proper Unavailable variant. We return 503 here because
// if we can't successfully send a request to the pageserver, we aren't available.
ApiError::ShuttingDown)?;
if !resp.status().is_success() {
let error_counter = &METRICS_REGISTRY
.metrics_group
.storage_controller_passthrough_request_error;
error_counter.inc(labels);
}
// We have a reqest::Response, would like a http::Response
let mut builder = hyper::Response::builder()
.status(resp.status())
@@ -354,6 +388,16 @@ async fn handle_tenant_locate(
json_response(StatusCode::OK, service.tenant_locate(tenant_id)?)
}
async fn handle_tenant_describe(
service: Arc<Service>,
req: Request<Body>,
) -> Result<Response<Body>, ApiError> {
check_permissions(&req, Scope::Admin)?;
let tenant_id: TenantId = parse_request_param(&req, "tenant_id")?;
json_response(StatusCode::OK, service.tenant_describe(tenant_id)?)
}
async fn handle_node_register(mut req: Request<Body>) -> Result<Response<Body>, ApiError> {
check_permissions(&req, Scope::Admin)?;
@@ -392,7 +436,14 @@ async fn handle_node_configure(mut req: Request<Body>) -> Result<Response<Body>,
json_response(
StatusCode::OK,
state.service.node_configure(config_req).await?,
state
.service
.node_configure(
config_req.node_id,
config_req.availability.map(NodeAvailability::from),
config_req.scheduling,
)
.await?,
)
}
@@ -482,7 +533,11 @@ impl From<ReconcileError> for ApiError {
/// Common wrapper for request handlers that call into Service and will operate on tenants: they must only
/// be allowed to run if Service has finished its initial reconciliation.
async fn tenant_service_handler<R, H>(request: Request<Body>, handler: H) -> R::Output
async fn tenant_service_handler<R, H>(
request: Request<Body>,
handler: H,
request_name: RequestName,
) -> R::Output
where
R: std::future::Future<Output = Result<Response<Body>, ApiError>> + Send + 'static,
H: FnOnce(Arc<Service>, Request<Body>) -> R + Send + Sync + 'static,
@@ -502,9 +557,10 @@ where
));
}
request_span(
named_request_span(
request,
|request| async move { handler(service, request).await },
request_name,
)
.await
}
@@ -515,11 +571,98 @@ fn check_permissions(request: &Request<Body>, required_scope: Scope) -> Result<(
})
}
#[derive(Clone, Debug)]
struct RequestMeta {
method: hyper::http::Method,
at: Instant,
}
fn prologue_metrics_middleware<B: hyper::body::HttpBody + Send + Sync + 'static>(
) -> Middleware<B, ApiError> {
Middleware::pre(move |req| async move {
let meta = RequestMeta {
method: req.method().clone(),
at: Instant::now(),
};
req.set_context(meta);
Ok(req)
})
}
fn epilogue_metrics_middleware<B: hyper::body::HttpBody + Send + Sync + 'static>(
) -> Middleware<B, ApiError> {
Middleware::post_with_info(move |resp, req_info| async move {
let request_name = match req_info.context::<RequestName>() {
Some(name) => name,
None => {
return Ok(resp);
}
};
if let Some(meta) = req_info.context::<RequestMeta>() {
let status = &crate::metrics::METRICS_REGISTRY
.metrics_group
.storage_controller_http_request_status;
let latency = &crate::metrics::METRICS_REGISTRY
.metrics_group
.storage_controller_http_request_latency;
status.inc(HttpRequestStatusLabelGroup {
path: request_name.0,
method: meta.method.clone().into(),
status: crate::metrics::StatusCode(resp.status()),
});
latency.observe(
HttpRequestLatencyLabelGroup {
path: request_name.0,
method: meta.method.into(),
},
meta.at.elapsed().as_secs_f64(),
);
}
Ok(resp)
})
}
pub async fn measured_metrics_handler(_req: Request<Body>) -> Result<Response<Body>, ApiError> {
pub const TEXT_FORMAT: &str = "text/plain; version=0.0.4";
let payload = crate::metrics::METRICS_REGISTRY.encode();
let response = Response::builder()
.status(200)
.header(CONTENT_TYPE, TEXT_FORMAT)
.body(payload.into())
.unwrap();
Ok(response)
}
#[derive(Clone)]
struct RequestName(&'static str);
async fn named_request_span<R, H>(
request: Request<Body>,
handler: H,
name: RequestName,
) -> R::Output
where
R: Future<Output = Result<Response<Body>, ApiError>> + Send + 'static,
H: FnOnce(Request<Body>) -> R + Send + Sync + 'static,
{
request.set_context(name);
request_span(request, handler).await
}
pub fn make_router(
service: Arc<Service>,
auth: Option<Arc<SwappableJwtAuth>>,
) -> RouterBuilder<hyper::Body, ApiError> {
let mut router = endpoint::make_router();
let mut router = endpoint::make_router()
.middleware(prologue_metrics_middleware())
.middleware(epilogue_metrics_middleware());
if auth.is_some() {
router = router.middleware(auth_middleware(|request| {
let state = get_state(request);
@@ -528,93 +671,166 @@ pub fn make_router(
} else {
state.auth.as_deref()
}
}))
}));
}
router
.data(Arc::new(HttpState::new(service, auth)))
.get("/metrics", |r| {
named_request_span(r, measured_metrics_handler, RequestName("metrics"))
})
// Non-prefixed generic endpoints (status, metrics)
.get("/status", |r| request_span(r, handle_status))
.get("/ready", |r| request_span(r, handle_ready))
.get("/status", |r| {
named_request_span(r, handle_status, RequestName("status"))
})
.get("/ready", |r| {
named_request_span(r, handle_ready, RequestName("ready"))
})
// Upcalls for the pageserver: point the pageserver's `control_plane_api` config to this prefix
.post("/upcall/v1/re-attach", |r| {
request_span(r, handle_re_attach)
named_request_span(r, handle_re_attach, RequestName("upcall_v1_reattach"))
})
.post("/upcall/v1/validate", |r| {
named_request_span(r, handle_validate, RequestName("upcall_v1_validate"))
})
.post("/upcall/v1/validate", |r| request_span(r, handle_validate))
// Test/dev/debug endpoints
.post("/debug/v1/attach-hook", |r| {
request_span(r, handle_attach_hook)
named_request_span(r, handle_attach_hook, RequestName("debug_v1_attach_hook"))
})
.post("/debug/v1/inspect", |r| {
named_request_span(r, handle_inspect, RequestName("debug_v1_inspect"))
})
.post("/debug/v1/inspect", |r| request_span(r, handle_inspect))
.post("/debug/v1/tenant/:tenant_id/drop", |r| {
request_span(r, handle_tenant_drop)
named_request_span(r, handle_tenant_drop, RequestName("debug_v1_tenant_drop"))
})
.post("/debug/v1/node/:node_id/drop", |r| {
request_span(r, handle_node_drop)
named_request_span(r, handle_node_drop, RequestName("debug_v1_node_drop"))
})
.get("/debug/v1/tenant", |r| {
named_request_span(r, handle_tenants_dump, RequestName("debug_v1_tenant"))
})
.get("/debug/v1/tenant/:tenant_id/locate", |r| {
tenant_service_handler(
r,
handle_tenant_locate,
RequestName("debug_v1_tenant_locate"),
)
})
.get("/debug/v1/tenant", |r| request_span(r, handle_tenants_dump))
.get("/debug/v1/scheduler", |r| {
request_span(r, handle_scheduler_dump)
named_request_span(r, handle_scheduler_dump, RequestName("debug_v1_scheduler"))
})
.post("/debug/v1/consistency_check", |r| {
request_span(r, handle_consistency_check)
named_request_span(
r,
handle_consistency_check,
RequestName("debug_v1_consistency_check"),
)
})
.get("/control/v1/tenant/:tenant_id/locate", |r| {
tenant_service_handler(r, handle_tenant_locate)
.put("/debug/v1/failpoints", |r| {
request_span(r, |r| failpoints_handler(r, CancellationToken::new()))
})
// Node operations
.post("/control/v1/node", |r| {
request_span(r, handle_node_register)
named_request_span(r, handle_node_register, RequestName("control_v1_node"))
})
.get("/control/v1/node", |r| {
named_request_span(r, handle_node_list, RequestName("control_v1_node"))
})
.get("/control/v1/node", |r| request_span(r, handle_node_list))
.put("/control/v1/node/:node_id/config", |r| {
request_span(r, handle_node_configure)
named_request_span(
r,
handle_node_configure,
RequestName("control_v1_node_config"),
)
})
// Tenant Shard operations
.put("/control/v1/tenant/:tenant_shard_id/migrate", |r| {
tenant_service_handler(r, handle_tenant_shard_migrate)
tenant_service_handler(
r,
handle_tenant_shard_migrate,
RequestName("control_v1_tenant_migrate"),
)
})
.put("/control/v1/tenant/:tenant_id/shard_split", |r| {
tenant_service_handler(r, handle_tenant_shard_split)
tenant_service_handler(
r,
handle_tenant_shard_split,
RequestName("control_v1_tenant_shard_split"),
)
})
.get("/control/v1/tenant/:tenant_id", |r| {
tenant_service_handler(
r,
handle_tenant_describe,
RequestName("control_v1_tenant_describe"),
)
})
// Tenant operations
// The ^/v1/ endpoints act as a "Virtual Pageserver", enabling shard-naive clients to call into
// this service to manage tenants that actually consist of many tenant shards, as if they are a single entity.
.post("/v1/tenant", |r| {
tenant_service_handler(r, handle_tenant_create)
tenant_service_handler(r, handle_tenant_create, RequestName("v1_tenant"))
})
.delete("/v1/tenant/:tenant_id", |r| {
tenant_service_handler(r, handle_tenant_delete)
tenant_service_handler(r, handle_tenant_delete, RequestName("v1_tenant"))
})
.put("/v1/tenant/config", |r| {
tenant_service_handler(r, handle_tenant_config_set)
tenant_service_handler(r, handle_tenant_config_set, RequestName("v1_tenant_config"))
})
.get("/v1/tenant/:tenant_id/config", |r| {
tenant_service_handler(r, handle_tenant_config_get)
tenant_service_handler(r, handle_tenant_config_get, RequestName("v1_tenant_config"))
})
.put("/v1/tenant/:tenant_id/location_config", |r| {
tenant_service_handler(r, handle_tenant_location_config)
.put("/v1/tenant/:tenant_shard_id/location_config", |r| {
tenant_service_handler(
r,
handle_tenant_location_config,
RequestName("v1_tenant_location_config"),
)
})
.put("/v1/tenant/:tenant_id/time_travel_remote_storage", |r| {
tenant_service_handler(r, handle_tenant_time_travel_remote_storage)
tenant_service_handler(
r,
handle_tenant_time_travel_remote_storage,
RequestName("v1_tenant_time_travel_remote_storage"),
)
})
.post("/v1/tenant/:tenant_id/secondary/download", |r| {
tenant_service_handler(r, handle_tenant_secondary_download)
tenant_service_handler(
r,
handle_tenant_secondary_download,
RequestName("v1_tenant_secondary_download"),
)
})
// Timeline operations
.delete("/v1/tenant/:tenant_id/timeline/:timeline_id", |r| {
tenant_service_handler(r, handle_tenant_timeline_delete)
tenant_service_handler(
r,
handle_tenant_timeline_delete,
RequestName("v1_tenant_timeline"),
)
})
.post("/v1/tenant/:tenant_id/timeline", |r| {
tenant_service_handler(r, handle_tenant_timeline_create)
tenant_service_handler(
r,
handle_tenant_timeline_create,
RequestName("v1_tenant_timeline"),
)
})
// Tenant detail GET passthrough to shard zero
.get("/v1/tenant/:tenant_id", |r| {
tenant_service_handler(r, handle_tenant_timeline_passthrough)
tenant_service_handler(
r,
handle_tenant_timeline_passthrough,
RequestName("v1_tenant_passthrough"),
)
})
// Timeline GET passthrough to shard zero. Note that the `*` in the URL is a wildcard: any future
// timeline GET APIs will be implicitly included.
.get("/v1/tenant/:tenant_id/timeline*", |r| {
tenant_service_handler(r, handle_tenant_timeline_passthrough)
tenant_service_handler(
r,
handle_tenant_timeline_passthrough,
RequestName("v1_tenant_timeline_passthrough"),
)
})
}

View File

@@ -0,0 +1,54 @@
use std::{collections::HashMap, sync::Arc};
/// A map of locks covering some arbitrary identifiers. Useful if you have a collection of objects but don't
/// want to embed a lock in each one, or if your locking granularity is different to your object granularity.
/// For example, used in the storage controller where the objects are tenant shards, but sometimes locking
/// is needed at a tenant-wide granularity.
pub(crate) struct IdLockMap<T>
where
T: Eq + PartialEq + std::hash::Hash,
{
/// A synchronous lock for getting/setting the async locks that our callers will wait on.
entities: std::sync::Mutex<std::collections::HashMap<T, Arc<tokio::sync::RwLock<()>>>>,
}
impl<T> IdLockMap<T>
where
T: Eq + PartialEq + std::hash::Hash,
{
pub(crate) fn shared(
&self,
key: T,
) -> impl std::future::Future<Output = tokio::sync::OwnedRwLockReadGuard<()>> {
let mut locked = self.entities.lock().unwrap();
let entry = locked.entry(key).or_default();
entry.clone().read_owned()
}
pub(crate) fn exclusive(
&self,
key: T,
) -> impl std::future::Future<Output = tokio::sync::OwnedRwLockWriteGuard<()>> {
let mut locked = self.entities.lock().unwrap();
let entry = locked.entry(key).or_default();
entry.clone().write_owned()
}
/// Rather than building a lock guard that re-takes the [`Self::entities`] lock, we just do
/// periodic housekeeping to avoid the map growing indefinitely
pub(crate) fn housekeeping(&self) {
let mut locked = self.entities.lock().unwrap();
locked.retain(|_k, lock| lock.try_write().is_err())
}
}
impl<T> Default for IdLockMap<T>
where
T: Eq + PartialEq + std::hash::Hash,
{
fn default() -> Self {
Self {
entities: std::sync::Mutex::new(HashMap::new()),
}
}
}

View File

@@ -1,11 +1,14 @@
use serde::{Deserialize, Serialize};
use serde::Serialize;
use utils::seqwait::MonotonicCounter;
mod auth;
mod compute_hook;
mod heartbeater;
pub mod http;
mod id_lock_map;
pub mod metrics;
mod node;
mod pageserver_client;
pub mod persistence;
mod reconciler;
mod scheduler;
@@ -13,23 +16,6 @@ mod schema;
pub mod service;
mod tenant_state;
#[derive(Clone, Serialize, Deserialize, Debug, PartialEq, Eq)]
enum PlacementPolicy {
/// Cheapest way to attach a tenant: just one pageserver, no secondary
Single,
/// Production-ready way to attach a tenant: one attached pageserver and
/// some number of secondaries.
Double(usize),
/// Create one secondary mode locations. This is useful when onboarding
/// a tenant, or for an idle tenant that we might want to bring online quickly.
Secondary,
/// Do not attach to any pageservers. This is appropriate for tenants that
/// have been idle for a long time, where we do not mind some delay in making
/// them available in future.
Detached,
}
#[derive(Ord, PartialOrd, Eq, PartialEq, Copy, Clone, Serialize)]
struct Sequence(u64);
@@ -66,9 +52,3 @@ impl Sequence {
Sequence(self.0 + 1)
}
}
impl Default for PlacementPolicy {
fn default() -> Self {
PlacementPolicy::Double(1)
}
}

View File

@@ -1,15 +1,8 @@
/// The attachment service mimics the aspects of the control plane API
/// that are required for a pageserver to operate.
///
/// This enables running & testing pageservers without a full-blown
/// deployment of the Neon cloud platform.
///
use anyhow::{anyhow, Context};
use attachment_service::http::make_router;
use attachment_service::metrics::preinitialize_metrics;
use attachment_service::persistence::Persistence;
use attachment_service::service::{Config, Service};
use aws_config::{BehaviorVersion, Region};
use attachment_service::service::{Config, Service, MAX_UNAVAILABLE_INTERVAL_DEFAULT};
use camino::Utf8PathBuf;
use clap::Parser;
use diesel::Connection;
@@ -60,6 +53,30 @@ struct Cli {
/// URL to connect to postgres, like postgresql://localhost:1234/attachment_service
#[arg(long)]
database_url: Option<String>,
/// Flag to enable dev mode, which permits running without auth
#[arg(long, default_value = "false")]
dev: bool,
/// Grace period before marking unresponsive pageserver offline
#[arg(long)]
max_unavailable_interval: Option<humantime::Duration>,
}
enum StrictMode {
/// In strict mode, we will require that all secrets are loaded, i.e. security features
/// may not be implicitly turned off by omitting secrets in the environment.
Strict,
/// In dev mode, secrets are optional, and omitting a particular secret will implicitly
/// disable the auth related to it (e.g. no pageserver jwt key -> send unauthenticated
/// requests, no public key -> don't authenticate incoming requests).
Dev,
}
impl Default for StrictMode {
fn default() -> Self {
Self::Strict
}
}
/// Secrets may either be provided on the command line (for testing), or loaded from AWS SecretManager: this
@@ -72,13 +89,6 @@ struct Secrets {
}
impl Secrets {
const DATABASE_URL_SECRET: &'static str = "rds-neon-storage-controller-url";
const PAGESERVER_JWT_TOKEN_SECRET: &'static str =
"neon-storage-controller-pageserver-jwt-token";
const CONTROL_PLANE_JWT_TOKEN_SECRET: &'static str =
"neon-storage-controller-control-plane-jwt-token";
const PUBLIC_KEY_SECRET: &'static str = "neon-storage-controller-public-key";
const DATABASE_URL_ENV: &'static str = "DATABASE_URL";
const PAGESERVER_JWT_TOKEN_ENV: &'static str = "PAGESERVER_JWT_TOKEN";
const CONTROL_PLANE_JWT_TOKEN_ENV: &'static str = "CONTROL_PLANE_JWT_TOKEN";
@@ -89,111 +99,41 @@ impl Secrets {
/// - Environment variables if DATABASE_URL is set.
/// - AWS Secrets Manager secrets
async fn load(args: &Cli) -> anyhow::Result<Self> {
match &args.database_url {
Some(url) => Self::load_cli(url, args),
None => match std::env::var(Self::DATABASE_URL_ENV) {
Ok(database_url) => Self::load_env(database_url),
Err(_) => Self::load_aws_sm().await,
},
}
}
fn load_env(database_url: String) -> anyhow::Result<Self> {
let public_key = match std::env::var(Self::PUBLIC_KEY_ENV) {
Ok(public_key) => Some(JwtAuth::from_key(public_key).context("Loading public key")?),
Err(_) => None,
};
Ok(Self {
database_url,
public_key,
jwt_token: std::env::var(Self::PAGESERVER_JWT_TOKEN_ENV).ok(),
control_plane_jwt_token: std::env::var(Self::CONTROL_PLANE_JWT_TOKEN_ENV).ok(),
})
}
async fn load_aws_sm() -> anyhow::Result<Self> {
let Ok(region) = std::env::var("AWS_REGION") else {
anyhow::bail!("AWS_REGION is not set, cannot load secrets automatically: either set this, or use CLI args to supply secrets");
};
let config = aws_config::defaults(BehaviorVersion::v2023_11_09())
.region(Region::new(region.clone()))
.load()
.await;
let asm = aws_sdk_secretsmanager::Client::new(&config);
let Some(database_url) = asm
.get_secret_value()
.secret_id(Self::DATABASE_URL_SECRET)
.send()
.await?
.secret_string()
.map(str::to_string)
let Some(database_url) =
Self::load_secret(&args.database_url, Self::DATABASE_URL_ENV).await
else {
anyhow::bail!(
"Database URL secret not found at {region}/{}",
Self::DATABASE_URL_SECRET
"Database URL is not set (set `--database-url`, or `DATABASE_URL` environment)"
)
};
let jwt_token = asm
.get_secret_value()
.secret_id(Self::PAGESERVER_JWT_TOKEN_SECRET)
.send()
.await?
.secret_string()
.map(str::to_string);
if jwt_token.is_none() {
tracing::warn!("No pageserver JWT token set: this will only work if authentication is disabled on the pageserver");
}
let control_plane_jwt_token = asm
.get_secret_value()
.secret_id(Self::CONTROL_PLANE_JWT_TOKEN_SECRET)
.send()
.await?
.secret_string()
.map(str::to_string);
if jwt_token.is_none() {
tracing::warn!("No control plane JWT token set: this will only work if authentication is disabled on the pageserver");
}
let public_key = asm
.get_secret_value()
.secret_id(Self::PUBLIC_KEY_SECRET)
.send()
.await?
.secret_string()
.map(str::to_string);
let public_key = match public_key {
Some(key) => Some(JwtAuth::from_key(key)?),
None => {
tracing::warn!(
"No public key set: inccoming HTTP requests will not be authenticated"
);
None
}
let public_key = match Self::load_secret(&args.public_key, Self::PUBLIC_KEY_ENV).await {
Some(v) => Some(JwtAuth::from_key(v).context("Loading public key")?),
None => None,
};
Ok(Self {
let this = Self {
database_url,
public_key,
jwt_token,
control_plane_jwt_token,
})
jwt_token: Self::load_secret(&args.jwt_token, Self::PAGESERVER_JWT_TOKEN_ENV).await,
control_plane_jwt_token: Self::load_secret(
&args.control_plane_jwt_token,
Self::CONTROL_PLANE_JWT_TOKEN_ENV,
)
.await,
};
Ok(this)
}
fn load_cli(database_url: &str, args: &Cli) -> anyhow::Result<Self> {
let public_key = match &args.public_key {
None => None,
Some(key) => Some(JwtAuth::from_key(key.clone()).context("Loading public key")?),
};
Ok(Self {
database_url: database_url.to_owned(),
public_key,
jwt_token: args.jwt_token.clone(),
control_plane_jwt_token: args.control_plane_jwt_token.clone(),
})
async fn load_secret(cli: &Option<String>, env_name: &str) -> Option<String> {
if let Some(v) = cli {
Some(v.clone())
} else if let Ok(v) = std::env::var(env_name) {
Some(v)
} else {
None
}
}
}
@@ -212,6 +152,12 @@ async fn migration_run(database_url: &str) -> anyhow::Result<()> {
}
fn main() -> anyhow::Result<()> {
let default_panic = std::panic::take_hook();
std::panic::set_hook(Box::new(move |info| {
default_panic(info);
std::process::exit(1);
}));
tokio::runtime::Builder::new_current_thread()
// We use spawn_blocking for database operations, so require approximately
// as many blocking threads as we will open database connections.
@@ -243,12 +189,50 @@ async fn async_main() -> anyhow::Result<()> {
args.listen
);
let strict_mode = if args.dev {
StrictMode::Dev
} else {
StrictMode::Strict
};
let secrets = Secrets::load(&args).await?;
// Validate required secrets and arguments are provided in strict mode
match strict_mode {
StrictMode::Strict
if (secrets.public_key.is_none()
|| secrets.jwt_token.is_none()
|| secrets.control_plane_jwt_token.is_none()) =>
{
// Production systems should always have secrets configured: if public_key was not set
// then we would implicitly disable auth.
anyhow::bail!(
"Insecure config! One or more secrets is not set. This is only permitted in `--dev` mode"
);
}
StrictMode::Strict if args.compute_hook_url.is_none() => {
// Production systems should always have a compute hook set, to prevent falling
// back to trying to use neon_local.
anyhow::bail!(
"`--compute-hook-url` is not set: this is only permitted in `--dev` mode"
);
}
StrictMode::Strict => {
tracing::info!("Starting in strict mode: configuration is OK.")
}
StrictMode::Dev => {
tracing::warn!("Starting in dev mode: this may be an insecure configuration.")
}
}
let config = Config {
jwt_token: secrets.jwt_token,
control_plane_jwt_token: secrets.control_plane_jwt_token,
compute_hook_url: args.compute_hook_url,
max_unavailable_interval: args
.max_unavailable_interval
.map(humantime::Duration::into)
.unwrap_or(MAX_UNAVAILABLE_INTERVAL_DEFAULT),
};
// After loading secrets & config, but before starting anything else, apply database migrations

View File

@@ -1,32 +1,284 @@
use metrics::{register_int_counter, register_int_counter_vec, IntCounter, IntCounterVec};
//!
//! This module provides metric definitions for the storage controller.
//!
//! All metrics are grouped in [`StorageControllerMetricGroup`]. [`StorageControllerMetrics`] holds
//! the mentioned metrics and their encoder. It's globally available via the [`METRICS_REGISTRY`]
//! constant.
//!
//! The rest of the code defines label group types and deals with converting outer types to labels.
//!
use bytes::Bytes;
use measured::{
label::{LabelValue, StaticLabelSet},
FixedCardinalityLabel, MetricGroup,
};
use once_cell::sync::Lazy;
use std::sync::Mutex;
pub(crate) struct ReconcilerMetrics {
pub(crate) spawned: IntCounter,
pub(crate) complete: IntCounterVec,
}
use crate::persistence::{DatabaseError, DatabaseOperation};
impl ReconcilerMetrics {
// Labels used on [`Self::complete`]
pub(crate) const SUCCESS: &'static str = "ok";
pub(crate) const ERROR: &'static str = "success";
pub(crate) const CANCEL: &'static str = "cancel";
}
pub(crate) static RECONCILER: Lazy<ReconcilerMetrics> = Lazy::new(|| ReconcilerMetrics {
spawned: register_int_counter!(
"storage_controller_reconcile_spawn",
"Count of how many times we spawn a reconcile task",
)
.expect("failed to define a metric"),
complete: register_int_counter_vec!(
"storage_controller_reconcile_complete",
"Reconciler tasks completed, broken down by success/failure/cancelled",
&["status"],
)
.expect("failed to define a metric"),
});
pub(crate) static METRICS_REGISTRY: Lazy<StorageControllerMetrics> =
Lazy::new(StorageControllerMetrics::default);
pub fn preinitialize_metrics() {
Lazy::force(&RECONCILER);
Lazy::force(&METRICS_REGISTRY);
}
pub(crate) struct StorageControllerMetrics {
pub(crate) metrics_group: StorageControllerMetricGroup,
encoder: Mutex<measured::text::TextEncoder>,
}
#[derive(measured::MetricGroup)]
pub(crate) struct StorageControllerMetricGroup {
/// Count of how many times we spawn a reconcile task
pub(crate) storage_controller_reconcile_spawn: measured::Counter,
/// Reconciler tasks completed, broken down by success/failure/cancelled
pub(crate) storage_controller_reconcile_complete:
measured::CounterVec<ReconcileCompleteLabelGroupSet>,
/// HTTP request status counters for handled requests
pub(crate) storage_controller_http_request_status:
measured::CounterVec<HttpRequestStatusLabelGroupSet>,
/// HTTP request handler latency across all status codes
pub(crate) storage_controller_http_request_latency:
measured::HistogramVec<HttpRequestLatencyLabelGroupSet, 5>,
/// Count of HTTP requests to the pageserver that resulted in an error,
/// broken down by the pageserver node id, request name and method
pub(crate) storage_controller_pageserver_request_error:
measured::CounterVec<PageserverRequestLabelGroupSet>,
/// Latency of HTTP requests to the pageserver, broken down by pageserver
/// node id, request name and method. This include both successful and unsuccessful
/// requests.
pub(crate) storage_controller_pageserver_request_latency:
measured::HistogramVec<PageserverRequestLabelGroupSet, 5>,
/// Count of pass-through HTTP requests to the pageserver that resulted in an error,
/// broken down by the pageserver node id, request name and method
pub(crate) storage_controller_passthrough_request_error:
measured::CounterVec<PageserverRequestLabelGroupSet>,
/// Latency of pass-through HTTP requests to the pageserver, broken down by pageserver
/// node id, request name and method. This include both successful and unsuccessful
/// requests.
pub(crate) storage_controller_passthrough_request_latency:
measured::HistogramVec<PageserverRequestLabelGroupSet, 5>,
/// Count of errors in database queries, broken down by error type and operation.
pub(crate) storage_controller_database_query_error:
measured::CounterVec<DatabaseQueryErrorLabelGroupSet>,
/// Latency of database queries, broken down by operation.
pub(crate) storage_controller_database_query_latency:
measured::HistogramVec<DatabaseQueryLatencyLabelGroupSet, 5>,
}
impl StorageControllerMetrics {
pub(crate) fn encode(&self) -> Bytes {
let mut encoder = self.encoder.lock().unwrap();
self.metrics_group.collect_into(&mut *encoder);
encoder.finish()
}
}
impl Default for StorageControllerMetrics {
fn default() -> Self {
Self {
metrics_group: StorageControllerMetricGroup::new(),
encoder: Mutex::new(measured::text::TextEncoder::new()),
}
}
}
impl StorageControllerMetricGroup {
pub(crate) fn new() -> Self {
Self {
storage_controller_reconcile_spawn: measured::Counter::new(),
storage_controller_reconcile_complete: measured::CounterVec::new(
ReconcileCompleteLabelGroupSet {
status: StaticLabelSet::new(),
},
),
storage_controller_http_request_status: measured::CounterVec::new(
HttpRequestStatusLabelGroupSet {
path: lasso::ThreadedRodeo::new(),
method: StaticLabelSet::new(),
status: StaticLabelSet::new(),
},
),
storage_controller_http_request_latency: measured::HistogramVec::new(
measured::metric::histogram::Thresholds::exponential_buckets(0.1, 2.0),
),
storage_controller_pageserver_request_error: measured::CounterVec::new(
PageserverRequestLabelGroupSet {
pageserver_id: lasso::ThreadedRodeo::new(),
path: lasso::ThreadedRodeo::new(),
method: StaticLabelSet::new(),
},
),
storage_controller_pageserver_request_latency: measured::HistogramVec::new(
measured::metric::histogram::Thresholds::exponential_buckets(0.1, 2.0),
),
storage_controller_passthrough_request_error: measured::CounterVec::new(
PageserverRequestLabelGroupSet {
pageserver_id: lasso::ThreadedRodeo::new(),
path: lasso::ThreadedRodeo::new(),
method: StaticLabelSet::new(),
},
),
storage_controller_passthrough_request_latency: measured::HistogramVec::new(
measured::metric::histogram::Thresholds::exponential_buckets(0.1, 2.0),
),
storage_controller_database_query_error: measured::CounterVec::new(
DatabaseQueryErrorLabelGroupSet {
operation: StaticLabelSet::new(),
error_type: StaticLabelSet::new(),
},
),
storage_controller_database_query_latency: measured::HistogramVec::new(
measured::metric::histogram::Thresholds::exponential_buckets(0.1, 2.0),
),
}
}
}
#[derive(measured::LabelGroup)]
#[label(set = ReconcileCompleteLabelGroupSet)]
pub(crate) struct ReconcileCompleteLabelGroup {
pub(crate) status: ReconcileOutcome,
}
#[derive(measured::LabelGroup)]
#[label(set = HttpRequestStatusLabelGroupSet)]
pub(crate) struct HttpRequestStatusLabelGroup<'a> {
#[label(dynamic_with = lasso::ThreadedRodeo)]
pub(crate) path: &'a str,
pub(crate) method: Method,
pub(crate) status: StatusCode,
}
#[derive(measured::LabelGroup)]
#[label(set = HttpRequestLatencyLabelGroupSet)]
pub(crate) struct HttpRequestLatencyLabelGroup<'a> {
#[label(dynamic_with = lasso::ThreadedRodeo)]
pub(crate) path: &'a str,
pub(crate) method: Method,
}
impl Default for HttpRequestLatencyLabelGroupSet {
fn default() -> Self {
Self {
path: lasso::ThreadedRodeo::new(),
method: StaticLabelSet::new(),
}
}
}
#[derive(measured::LabelGroup, Clone)]
#[label(set = PageserverRequestLabelGroupSet)]
pub(crate) struct PageserverRequestLabelGroup<'a> {
#[label(dynamic_with = lasso::ThreadedRodeo)]
pub(crate) pageserver_id: &'a str,
#[label(dynamic_with = lasso::ThreadedRodeo)]
pub(crate) path: &'a str,
pub(crate) method: Method,
}
impl Default for PageserverRequestLabelGroupSet {
fn default() -> Self {
Self {
pageserver_id: lasso::ThreadedRodeo::new(),
path: lasso::ThreadedRodeo::new(),
method: StaticLabelSet::new(),
}
}
}
#[derive(measured::LabelGroup)]
#[label(set = DatabaseQueryErrorLabelGroupSet)]
pub(crate) struct DatabaseQueryErrorLabelGroup {
pub(crate) error_type: DatabaseErrorLabel,
pub(crate) operation: DatabaseOperation,
}
#[derive(measured::LabelGroup)]
#[label(set = DatabaseQueryLatencyLabelGroupSet)]
pub(crate) struct DatabaseQueryLatencyLabelGroup {
pub(crate) operation: DatabaseOperation,
}
#[derive(FixedCardinalityLabel)]
pub(crate) enum ReconcileOutcome {
#[label(rename = "ok")]
Success,
Error,
Cancel,
}
#[derive(FixedCardinalityLabel, Clone)]
pub(crate) enum Method {
Get,
Put,
Post,
Delete,
Other,
}
impl From<hyper::Method> for Method {
fn from(value: hyper::Method) -> Self {
if value == hyper::Method::GET {
Method::Get
} else if value == hyper::Method::PUT {
Method::Put
} else if value == hyper::Method::POST {
Method::Post
} else if value == hyper::Method::DELETE {
Method::Delete
} else {
Method::Other
}
}
}
pub(crate) struct StatusCode(pub(crate) hyper::http::StatusCode);
impl LabelValue for StatusCode {
fn visit<V: measured::label::LabelVisitor>(&self, v: V) -> V::Output {
v.write_int(self.0.as_u16() as u64)
}
}
impl FixedCardinalityLabel for StatusCode {
fn cardinality() -> usize {
(100..1000).len()
}
fn encode(&self) -> usize {
self.0.as_u16() as usize
}
fn decode(value: usize) -> Self {
Self(hyper::http::StatusCode::from_u16(u16::try_from(value).unwrap()).unwrap())
}
}
#[derive(FixedCardinalityLabel)]
pub(crate) enum DatabaseErrorLabel {
Query,
Connection,
ConnectionPool,
Logical,
}
impl DatabaseError {
pub(crate) fn error_label(&self) -> DatabaseErrorLabel {
match self {
Self::Query(_) => DatabaseErrorLabel::Query,
Self::Connection(_) => DatabaseErrorLabel::Connection,
Self::ConnectionPool(_) => DatabaseErrorLabel::ConnectionPool,
Self::Logical(_) => DatabaseErrorLabel::Logical,
}
}
}

View File

@@ -1,8 +1,20 @@
use pageserver_api::controller_api::{NodeAvailability, NodeSchedulingPolicy};
use serde::Serialize;
use utils::id::NodeId;
use std::{str::FromStr, time::Duration};
use crate::persistence::NodePersistence;
use hyper::StatusCode;
use pageserver_api::{
controller_api::{
NodeAvailability, NodeRegisterRequest, NodeSchedulingPolicy, TenantLocateResponseShard,
},
shard::TenantShardId,
};
use pageserver_client::mgmt_api;
use serde::Serialize;
use tokio_util::sync::CancellationToken;
use utils::{backoff, id::NodeId};
use crate::{
pageserver_client::PageserverClient, persistence::NodePersistence, scheduler::MaySchedule,
};
/// Represents the in-memory description of a Node.
///
@@ -12,16 +24,29 @@ use crate::persistence::NodePersistence;
/// implementation of serialization on this type is only for debug dumps.
#[derive(Clone, Serialize)]
pub(crate) struct Node {
pub(crate) id: NodeId,
id: NodeId,
pub(crate) availability: NodeAvailability,
pub(crate) scheduling: NodeSchedulingPolicy,
availability: NodeAvailability,
scheduling: NodeSchedulingPolicy,
pub(crate) listen_http_addr: String,
pub(crate) listen_http_port: u16,
listen_http_addr: String,
listen_http_port: u16,
pub(crate) listen_pg_addr: String,
pub(crate) listen_pg_port: u16,
listen_pg_addr: String,
listen_pg_port: u16,
// This cancellation token means "stop any RPCs in flight to this node, and don't start
// any more". It is not related to process shutdown.
#[serde(skip)]
cancel: CancellationToken,
}
/// When updating [`Node::availability`] we use this type to indicate to the caller
/// whether/how they changed it.
pub(crate) enum AvailabilityTransition {
ToActive,
ToOffline,
Unchanged,
}
impl Node {
@@ -29,18 +54,111 @@ impl Node {
format!("http://{}:{}", self.listen_http_addr, self.listen_http_port)
}
/// Is this node elegible to have work scheduled onto it?
pub(crate) fn may_schedule(&self) -> bool {
match self.availability {
NodeAvailability::Active => {}
NodeAvailability::Offline => return false,
pub(crate) fn get_id(&self) -> NodeId {
self.id
}
pub(crate) fn set_scheduling(&mut self, scheduling: NodeSchedulingPolicy) {
self.scheduling = scheduling
}
/// Does this registration request match `self`? This is used when deciding whether a registration
/// request should be allowed to update an existing record with the same node ID.
pub(crate) fn registration_match(&self, register_req: &NodeRegisterRequest) -> bool {
self.id == register_req.node_id
&& self.listen_http_addr == register_req.listen_http_addr
&& self.listen_http_port == register_req.listen_http_port
&& self.listen_pg_addr == register_req.listen_pg_addr
&& self.listen_pg_port == register_req.listen_pg_port
}
/// For a shard located on this node, populate a response object
/// with this node's address information.
pub(crate) fn shard_location(&self, shard_id: TenantShardId) -> TenantLocateResponseShard {
TenantLocateResponseShard {
shard_id,
node_id: self.id,
listen_http_addr: self.listen_http_addr.clone(),
listen_http_port: self.listen_http_port,
listen_pg_addr: self.listen_pg_addr.clone(),
listen_pg_port: self.listen_pg_port,
}
}
pub(crate) fn set_availability(&mut self, availability: NodeAvailability) {
match self.get_availability_transition(availability) {
AvailabilityTransition::ToActive => {
// Give the node a new cancellation token, effectively resetting it to un-cancelled. Any
// users of previously-cloned copies of the node will still see the old cancellation
// state. For example, Reconcilers in flight will have to complete and be spawned
// again to realize that the node has become available.
self.cancel = CancellationToken::new();
}
AvailabilityTransition::ToOffline => {
// Fire the node's cancellation token to cancel any in-flight API requests to it
self.cancel.cancel();
}
AvailabilityTransition::Unchanged => {}
}
self.availability = availability;
}
/// Without modifying the availability of the node, convert the intended availability
/// into a description of the transition.
pub(crate) fn get_availability_transition(
&self,
availability: NodeAvailability,
) -> AvailabilityTransition {
use AvailabilityTransition::*;
use NodeAvailability::*;
match (self.availability, availability) {
(Offline, Active(_)) => ToActive,
(Active(_), Offline) => ToOffline,
_ => Unchanged,
}
}
/// Whether we may send API requests to this node.
pub(crate) fn is_available(&self) -> bool {
// When we clone a node, [`Self::availability`] is a snapshot, but [`Self::cancel`] holds
// a reference to the original Node's cancellation status. Checking both of these results
// in a "pessimistic" check where we will consider a Node instance unavailable if it was unavailable
// when we cloned it, or if the original Node instance's cancellation token was fired.
matches!(self.availability, NodeAvailability::Active(_)) && !self.cancel.is_cancelled()
}
/// Is this node elegible to have work scheduled onto it?
pub(crate) fn may_schedule(&self) -> MaySchedule {
let score = match self.availability {
NodeAvailability::Active(score) => score,
NodeAvailability::Offline => return MaySchedule::No,
};
match self.scheduling {
NodeSchedulingPolicy::Active => true,
NodeSchedulingPolicy::Draining => false,
NodeSchedulingPolicy::Filling => true,
NodeSchedulingPolicy::Pause => false,
NodeSchedulingPolicy::Active => MaySchedule::Yes(score),
NodeSchedulingPolicy::Draining => MaySchedule::No,
NodeSchedulingPolicy::Filling => MaySchedule::Yes(score),
NodeSchedulingPolicy::Pause => MaySchedule::No,
}
}
pub(crate) fn new(
id: NodeId,
listen_http_addr: String,
listen_http_port: u16,
listen_pg_addr: String,
listen_pg_port: u16,
) -> Self {
Self {
id,
listen_http_addr,
listen_http_port,
listen_pg_addr,
listen_pg_port,
scheduling: NodeSchedulingPolicy::Filling,
availability: NodeAvailability::Offline,
cancel: CancellationToken::new(),
}
}
@@ -54,4 +172,100 @@ impl Node {
listen_pg_port: self.listen_pg_port as i32,
}
}
pub(crate) fn from_persistent(np: NodePersistence) -> Self {
Self {
id: NodeId(np.node_id as u64),
// At startup we consider a node offline until proven otherwise.
availability: NodeAvailability::Offline,
scheduling: NodeSchedulingPolicy::from_str(&np.scheduling_policy)
.expect("Bad scheduling policy in DB"),
listen_http_addr: np.listen_http_addr,
listen_http_port: np.listen_http_port as u16,
listen_pg_addr: np.listen_pg_addr,
listen_pg_port: np.listen_pg_port as u16,
cancel: CancellationToken::new(),
}
}
/// Wrapper for issuing requests to pageserver management API: takes care of generic
/// retry/backoff for retryable HTTP status codes.
///
/// This will return None to indicate cancellation. Cancellation may happen from
/// the cancellation token passed in, or from Self's cancellation token (i.e. node
/// going offline).
pub(crate) async fn with_client_retries<T, O, F>(
&self,
mut op: O,
jwt: &Option<String>,
warn_threshold: u32,
max_retries: u32,
timeout: Duration,
cancel: &CancellationToken,
) -> Option<mgmt_api::Result<T>>
where
O: FnMut(PageserverClient) -> F,
F: std::future::Future<Output = mgmt_api::Result<T>>,
{
fn is_fatal(e: &mgmt_api::Error) -> bool {
use mgmt_api::Error::*;
match e {
ReceiveBody(_) | ReceiveErrorBody(_) => false,
ApiError(StatusCode::SERVICE_UNAVAILABLE, _)
| ApiError(StatusCode::GATEWAY_TIMEOUT, _)
| ApiError(StatusCode::REQUEST_TIMEOUT, _) => false,
ApiError(_, _) => true,
Cancelled => true,
}
}
backoff::retry(
|| {
let http_client = reqwest::ClientBuilder::new()
.timeout(timeout)
.build()
.expect("Failed to construct HTTP client");
let client = PageserverClient::from_client(
self.get_id(),
http_client,
self.base_url(),
jwt.as_deref(),
);
let node_cancel_fut = self.cancel.cancelled();
let op_fut = op(client);
async {
tokio::select! {
r = op_fut=> {r},
_ = node_cancel_fut => {
Err(mgmt_api::Error::Cancelled)
}}
}
},
is_fatal,
warn_threshold,
max_retries,
&format!(
"Call to node {} ({}:{}) management API",
self.id, self.listen_http_addr, self.listen_http_port
),
cancel,
)
.await
}
}
impl std::fmt::Display for Node {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{} ({})", self.id, self.listen_http_addr)
}
}
impl std::fmt::Debug for Node {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{} ({})", self.id, self.listen_http_addr)
}
}

View File

@@ -0,0 +1,203 @@
use pageserver_api::{
models::{
LocationConfig, LocationConfigListResponse, PageserverUtilization, SecondaryProgress,
TenantShardSplitRequest, TenantShardSplitResponse, TimelineCreateRequest, TimelineInfo,
},
shard::TenantShardId,
};
use pageserver_client::mgmt_api::{Client, Result};
use reqwest::StatusCode;
use utils::id::{NodeId, TimelineId};
/// Thin wrapper around [`pageserver_client::mgmt_api::Client`]. It allows the storage
/// controller to collect metrics in a non-intrusive manner.
#[derive(Debug, Clone)]
pub(crate) struct PageserverClient {
inner: Client,
node_id_label: String,
}
macro_rules! measured_request {
($name:literal, $method:expr, $node_id: expr, $invoke:expr) => {{
let labels = crate::metrics::PageserverRequestLabelGroup {
pageserver_id: $node_id,
path: $name,
method: $method,
};
let latency = &crate::metrics::METRICS_REGISTRY
.metrics_group
.storage_controller_pageserver_request_latency;
let _timer_guard = latency.start_timer(labels.clone());
let res = $invoke;
if res.is_err() {
let error_counters = &crate::metrics::METRICS_REGISTRY
.metrics_group
.storage_controller_pageserver_request_error;
error_counters.inc(labels)
}
res
}};
}
impl PageserverClient {
pub(crate) fn new(node_id: NodeId, mgmt_api_endpoint: String, jwt: Option<&str>) -> Self {
Self {
inner: Client::from_client(reqwest::Client::new(), mgmt_api_endpoint, jwt),
node_id_label: node_id.0.to_string(),
}
}
pub(crate) fn from_client(
node_id: NodeId,
raw_client: reqwest::Client,
mgmt_api_endpoint: String,
jwt: Option<&str>,
) -> Self {
Self {
inner: Client::from_client(raw_client, mgmt_api_endpoint, jwt),
node_id_label: node_id.0.to_string(),
}
}
pub(crate) async fn tenant_delete(&self, tenant_shard_id: TenantShardId) -> Result<StatusCode> {
measured_request!(
"tenant",
crate::metrics::Method::Delete,
&self.node_id_label,
self.inner.tenant_delete(tenant_shard_id).await
)
}
pub(crate) async fn tenant_time_travel_remote_storage(
&self,
tenant_shard_id: TenantShardId,
timestamp: &str,
done_if_after: &str,
) -> Result<()> {
measured_request!(
"tenant_time_travel_remote_storage",
crate::metrics::Method::Put,
&self.node_id_label,
self.inner
.tenant_time_travel_remote_storage(tenant_shard_id, timestamp, done_if_after)
.await
)
}
pub(crate) async fn tenant_secondary_download(
&self,
tenant_id: TenantShardId,
wait: Option<std::time::Duration>,
) -> Result<(StatusCode, SecondaryProgress)> {
measured_request!(
"tenant_secondary_download",
crate::metrics::Method::Post,
&self.node_id_label,
self.inner.tenant_secondary_download(tenant_id, wait).await
)
}
pub(crate) async fn location_config(
&self,
tenant_shard_id: TenantShardId,
config: LocationConfig,
flush_ms: Option<std::time::Duration>,
lazy: bool,
) -> Result<()> {
measured_request!(
"location_config",
crate::metrics::Method::Put,
&self.node_id_label,
self.inner
.location_config(tenant_shard_id, config, flush_ms, lazy)
.await
)
}
pub(crate) async fn list_location_config(&self) -> Result<LocationConfigListResponse> {
measured_request!(
"location_configs",
crate::metrics::Method::Get,
&self.node_id_label,
self.inner.list_location_config().await
)
}
pub(crate) async fn get_location_config(
&self,
tenant_shard_id: TenantShardId,
) -> Result<Option<LocationConfig>> {
measured_request!(
"location_config",
crate::metrics::Method::Get,
&self.node_id_label,
self.inner.get_location_config(tenant_shard_id).await
)
}
pub(crate) async fn timeline_create(
&self,
tenant_shard_id: TenantShardId,
req: &TimelineCreateRequest,
) -> Result<TimelineInfo> {
measured_request!(
"timeline",
crate::metrics::Method::Post,
&self.node_id_label,
self.inner.timeline_create(tenant_shard_id, req).await
)
}
pub(crate) async fn timeline_delete(
&self,
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
) -> Result<StatusCode> {
measured_request!(
"timeline",
crate::metrics::Method::Delete,
&self.node_id_label,
self.inner
.timeline_delete(tenant_shard_id, timeline_id)
.await
)
}
pub(crate) async fn tenant_shard_split(
&self,
tenant_shard_id: TenantShardId,
req: TenantShardSplitRequest,
) -> Result<TenantShardSplitResponse> {
measured_request!(
"tenant_shard_split",
crate::metrics::Method::Put,
&self.node_id_label,
self.inner.tenant_shard_split(tenant_shard_id, req).await
)
}
pub(crate) async fn timeline_list(
&self,
tenant_shard_id: &TenantShardId,
) -> Result<Vec<TimelineInfo>> {
measured_request!(
"timelines",
crate::metrics::Method::Get,
&self.node_id_label,
self.inner.timeline_list(tenant_shard_id).await
)
}
pub(crate) async fn get_utilization(&self) -> Result<PageserverUtilization> {
measured_request!(
"utilization",
crate::metrics::Method::Get,
&self.node_id_label,
self.inner.get_utilization().await
)
}
}

View File

@@ -7,23 +7,26 @@ use self::split_state::SplitState;
use camino::Utf8Path;
use camino::Utf8PathBuf;
use diesel::pg::PgConnection;
use diesel::{
Connection, ExpressionMethods, Insertable, QueryDsl, QueryResult, Queryable, RunQueryDsl,
Selectable, SelectableHelper,
};
use pageserver_api::controller_api::NodeSchedulingPolicy;
use diesel::prelude::*;
use diesel::Connection;
use pageserver_api::controller_api::{NodeSchedulingPolicy, PlacementPolicy};
use pageserver_api::models::TenantConfig;
use pageserver_api::shard::ShardConfigError;
use pageserver_api::shard::ShardIdentity;
use pageserver_api::shard::ShardStripeSize;
use pageserver_api::shard::{ShardCount, ShardNumber, TenantShardId};
use serde::{Deserialize, Serialize};
use utils::generation::Generation;
use utils::id::{NodeId, TenantId};
use crate::metrics::{
DatabaseQueryErrorLabelGroup, DatabaseQueryLatencyLabelGroup, METRICS_REGISTRY,
};
use crate::node::Node;
use crate::PlacementPolicy;
/// ## What do we store?
///
/// The attachment service does not store most of its state durably.
/// The storage controller service does not store most of its state durably.
///
/// The essential things to store durably are:
/// - generation numbers, as these must always advance monotonically to ensure data safety.
@@ -37,7 +40,7 @@ use crate::PlacementPolicy;
///
/// ## Performance/efficiency
///
/// The attachment service does not go via the database for most things: there are
/// The storage controller service does not go via the database for most things: there are
/// a couple of places where we must, and where efficiency matters:
/// - Incrementing generation numbers: the Reconciler has to wait for this to complete
/// before it can attach a tenant, so this acts as a bound on how fast things like
@@ -75,6 +78,33 @@ pub(crate) enum DatabaseError {
Logical(String),
}
#[derive(measured::FixedCardinalityLabel, Clone)]
pub(crate) enum DatabaseOperation {
InsertNode,
UpdateNode,
DeleteNode,
ListNodes,
BeginShardSplit,
CompleteShardSplit,
AbortShardSplit,
Detach,
ReAttach,
IncrementGeneration,
ListTenantShards,
InsertTenantShards,
UpdateTenantShard,
DeleteTenant,
UpdateTenantConfig,
}
#[must_use]
pub(crate) enum AbortShardSplitStatus {
/// We aborted the split in the database by reverting to the parent shards
Aborted,
/// The split had already been persisted.
Complete,
}
pub(crate) type DatabaseResult<T> = Result<T, DatabaseError>;
impl Persistence {
@@ -107,6 +137,34 @@ impl Persistence {
}
}
/// Wraps `with_conn` in order to collect latency and error metrics
async fn with_measured_conn<F, R>(&self, op: DatabaseOperation, func: F) -> DatabaseResult<R>
where
F: Fn(&mut PgConnection) -> DatabaseResult<R> + Send + 'static,
R: Send + 'static,
{
let latency = &METRICS_REGISTRY
.metrics_group
.storage_controller_database_query_latency;
let _timer = latency.start_timer(DatabaseQueryLatencyLabelGroup {
operation: op.clone(),
});
let res = self.with_conn(func).await;
if let Err(err) = &res {
let error_counter = &METRICS_REGISTRY
.metrics_group
.storage_controller_database_query_error;
error_counter.inc(DatabaseQueryErrorLabelGroup {
error_type: err.error_label(),
operation: op,
})
}
res
}
/// Call the provided function in a tokio blocking thread, with a Diesel database connection.
async fn with_conn<F, R>(&self, func: F) -> DatabaseResult<R>
where
@@ -122,21 +180,27 @@ impl Persistence {
/// When a node is first registered, persist it before using it for anything
pub(crate) async fn insert_node(&self, node: &Node) -> DatabaseResult<()> {
let np = node.to_persistent();
self.with_conn(move |conn| -> DatabaseResult<()> {
diesel::insert_into(crate::schema::nodes::table)
.values(&np)
.execute(conn)?;
Ok(())
})
self.with_measured_conn(
DatabaseOperation::InsertNode,
move |conn| -> DatabaseResult<()> {
diesel::insert_into(crate::schema::nodes::table)
.values(&np)
.execute(conn)?;
Ok(())
},
)
.await
}
/// At startup, populate the list of nodes which our shards may be placed on
pub(crate) async fn list_nodes(&self) -> DatabaseResult<Vec<NodePersistence>> {
let nodes: Vec<NodePersistence> = self
.with_conn(move |conn| -> DatabaseResult<_> {
Ok(crate::schema::nodes::table.load::<NodePersistence>(conn)?)
})
.with_measured_conn(
DatabaseOperation::ListNodes,
move |conn| -> DatabaseResult<_> {
Ok(crate::schema::nodes::table.load::<NodePersistence>(conn)?)
},
)
.await?;
tracing::info!("list_nodes: loaded {} nodes", nodes.len());
@@ -151,7 +215,7 @@ impl Persistence {
) -> DatabaseResult<()> {
use crate::schema::nodes::dsl::*;
let updated = self
.with_conn(move |conn| {
.with_measured_conn(DatabaseOperation::UpdateNode, move |conn| {
let updated = diesel::update(nodes)
.filter(node_id.eq(input_node_id.0 as i64))
.set((scheduling_policy.eq(String::from(input_scheduling)),))
@@ -173,9 +237,12 @@ impl Persistence {
/// be enriched at runtime with state discovered on pageservers.
pub(crate) async fn list_tenant_shards(&self) -> DatabaseResult<Vec<TenantShardPersistence>> {
let loaded = self
.with_conn(move |conn| -> DatabaseResult<_> {
Ok(crate::schema::tenant_shards::table.load::<TenantShardPersistence>(conn)?)
})
.with_measured_conn(
DatabaseOperation::ListTenantShards,
move |conn| -> DatabaseResult<_> {
Ok(crate::schema::tenant_shards::table.load::<TenantShardPersistence>(conn)?)
},
)
.await?;
if loaded.is_empty() {
@@ -203,15 +270,10 @@ impl Persistence {
let mut decoded = serde_json::from_slice::<JsonPersistence>(&bytes)
.map_err(|e| DatabaseError::Logical(format!("Deserialization error: {e}")))?;
for (tenant_id, tenant) in &mut decoded.tenants {
// Backward compat: an old attachments.json from before PR #6251, replace
// empty strings with proper defaults.
if tenant.tenant_id.is_empty() {
tenant.tenant_id = tenant_id.to_string();
tenant.config = serde_json::to_string(&TenantConfig::default())
.map_err(|e| DatabaseError::Logical(format!("Serialization error: {e}")))?;
tenant.placement_policy = serde_json::to_string(&PlacementPolicy::default())
.map_err(|e| DatabaseError::Logical(format!("Serialization error: {e}")))?;
for shard in decoded.tenants.values_mut() {
if shard.placement_policy == "\"Single\"" {
// Backward compat for test data after PR https://github.com/neondatabase/neon/pull/7165
shard.placement_policy = "{\"Attached\":0}".to_string();
}
}
@@ -257,17 +319,20 @@ impl Persistence {
shards: Vec<TenantShardPersistence>,
) -> DatabaseResult<()> {
use crate::schema::tenant_shards::dsl::*;
self.with_conn(move |conn| -> DatabaseResult<()> {
conn.transaction(|conn| -> QueryResult<()> {
for tenant in &shards {
diesel::insert_into(tenant_shards)
.values(tenant)
.execute(conn)?;
}
self.with_measured_conn(
DatabaseOperation::InsertTenantShards,
move |conn| -> DatabaseResult<()> {
conn.transaction(|conn| -> QueryResult<()> {
for tenant in &shards {
diesel::insert_into(tenant_shards)
.values(tenant)
.execute(conn)?;
}
Ok(())
})?;
Ok(())
})?;
Ok(())
})
},
)
.await
}
@@ -275,25 +340,31 @@ impl Persistence {
/// the tenant from memory on this server.
pub(crate) async fn delete_tenant(&self, del_tenant_id: TenantId) -> DatabaseResult<()> {
use crate::schema::tenant_shards::dsl::*;
self.with_conn(move |conn| -> DatabaseResult<()> {
diesel::delete(tenant_shards)
.filter(tenant_id.eq(del_tenant_id.to_string()))
.execute(conn)?;
self.with_measured_conn(
DatabaseOperation::DeleteTenant,
move |conn| -> DatabaseResult<()> {
diesel::delete(tenant_shards)
.filter(tenant_id.eq(del_tenant_id.to_string()))
.execute(conn)?;
Ok(())
})
Ok(())
},
)
.await
}
pub(crate) async fn delete_node(&self, del_node_id: NodeId) -> DatabaseResult<()> {
use crate::schema::nodes::dsl::*;
self.with_conn(move |conn| -> DatabaseResult<()> {
diesel::delete(nodes)
.filter(node_id.eq(del_node_id.0 as i64))
.execute(conn)?;
self.with_measured_conn(
DatabaseOperation::DeleteNode,
move |conn| -> DatabaseResult<()> {
diesel::delete(nodes)
.filter(node_id.eq(del_node_id.0 as i64))
.execute(conn)?;
Ok(())
})
Ok(())
},
)
.await
}
@@ -307,7 +378,7 @@ impl Persistence {
) -> DatabaseResult<HashMap<TenantShardId, Generation>> {
use crate::schema::tenant_shards::dsl::*;
let updated = self
.with_conn(move |conn| {
.with_measured_conn(DatabaseOperation::ReAttach, move |conn| {
let rows_updated = diesel::update(tenant_shards)
.filter(generation_pageserver.eq(node_id.0 as i64))
.set(generation.eq(generation + 1))
@@ -357,7 +428,7 @@ impl Persistence {
) -> anyhow::Result<Generation> {
use crate::schema::tenant_shards::dsl::*;
let updated = self
.with_conn(move |conn| {
.with_measured_conn(DatabaseOperation::IncrementGeneration, move |conn| {
let updated = diesel::update(tenant_shards)
.filter(tenant_id.eq(tenant_shard_id.tenant_id.to_string()))
.filter(shard_number.eq(tenant_shard_id.shard_number.0 as i32))
@@ -401,7 +472,7 @@ impl Persistence {
) -> DatabaseResult<()> {
use crate::schema::tenant_shards::dsl::*;
self.with_conn(move |conn| {
self.with_measured_conn(DatabaseOperation::UpdateTenantShard, move |conn| {
let query = diesel::update(tenant_shards)
.filter(tenant_id.eq(tenant_shard_id.tenant_id.to_string()))
.filter(shard_number.eq(tenant_shard_id.shard_number.0 as i32))
@@ -442,7 +513,7 @@ impl Persistence {
) -> DatabaseResult<()> {
use crate::schema::tenant_shards::dsl::*;
self.with_conn(move |conn| {
self.with_measured_conn(DatabaseOperation::UpdateTenantConfig, move |conn| {
diesel::update(tenant_shards)
.filter(tenant_id.eq(input_tenant_id.to_string()))
.set((config.eq(serde_json::to_string(&input_config).unwrap()),))
@@ -457,7 +528,7 @@ impl Persistence {
pub(crate) async fn detach(&self, tenant_shard_id: TenantShardId) -> anyhow::Result<()> {
use crate::schema::tenant_shards::dsl::*;
self.with_conn(move |conn| {
self.with_measured_conn(DatabaseOperation::Detach, move |conn| {
let updated = diesel::update(tenant_shards)
.filter(tenant_id.eq(tenant_shard_id.tenant_id.to_string()))
.filter(shard_number.eq(tenant_shard_id.shard_number.0 as i32))
@@ -487,7 +558,7 @@ impl Persistence {
parent_to_children: Vec<(TenantShardId, Vec<TenantShardPersistence>)>,
) -> DatabaseResult<()> {
use crate::schema::tenant_shards::dsl::*;
self.with_conn(move |conn| -> DatabaseResult<()> {
self.with_measured_conn(DatabaseOperation::BeginShardSplit, move |conn| -> DatabaseResult<()> {
conn.transaction(|conn| -> DatabaseResult<()> {
// Mark parent shards as splitting
@@ -551,26 +622,78 @@ impl Persistence {
old_shard_count: ShardCount,
) -> DatabaseResult<()> {
use crate::schema::tenant_shards::dsl::*;
self.with_conn(move |conn| -> DatabaseResult<()> {
conn.transaction(|conn| -> QueryResult<()> {
// Drop parent shards
diesel::delete(tenant_shards)
.filter(tenant_id.eq(split_tenant_id.to_string()))
.filter(shard_count.eq(old_shard_count.literal() as i32))
.execute(conn)?;
self.with_measured_conn(
DatabaseOperation::CompleteShardSplit,
move |conn| -> DatabaseResult<()> {
conn.transaction(|conn| -> QueryResult<()> {
// Drop parent shards
diesel::delete(tenant_shards)
.filter(tenant_id.eq(split_tenant_id.to_string()))
.filter(shard_count.eq(old_shard_count.literal() as i32))
.execute(conn)?;
// Clear sharding flag
let updated = diesel::update(tenant_shards)
.filter(tenant_id.eq(split_tenant_id.to_string()))
.set((splitting.eq(0),))
.execute(conn)?;
debug_assert!(updated > 0);
// Clear sharding flag
let updated = diesel::update(tenant_shards)
.filter(tenant_id.eq(split_tenant_id.to_string()))
.set((splitting.eq(0),))
.execute(conn)?;
debug_assert!(updated > 0);
Ok(())
})?;
Ok(())
})?;
},
)
.await
}
Ok(())
})
/// Used when the remote part of a shard split failed: we will revert the database state to have only
/// the parent shards, with SplitState::Idle.
pub(crate) async fn abort_shard_split(
&self,
split_tenant_id: TenantId,
new_shard_count: ShardCount,
) -> DatabaseResult<AbortShardSplitStatus> {
use crate::schema::tenant_shards::dsl::*;
self.with_measured_conn(
DatabaseOperation::AbortShardSplit,
move |conn| -> DatabaseResult<AbortShardSplitStatus> {
let aborted =
conn.transaction(|conn| -> DatabaseResult<AbortShardSplitStatus> {
// Clear the splitting state on parent shards
let updated = diesel::update(tenant_shards)
.filter(tenant_id.eq(split_tenant_id.to_string()))
.filter(shard_count.ne(new_shard_count.literal() as i32))
.set((splitting.eq(0),))
.execute(conn)?;
// Parent shards are already gone: we cannot abort.
if updated == 0 {
return Ok(AbortShardSplitStatus::Complete);
}
// Sanity check: if parent shards were present, their cardinality should
// be less than the number of child shards.
if updated >= new_shard_count.count() as usize {
return Err(DatabaseError::Logical(format!(
"Unexpected parent shard count {updated} while aborting split to \
count {new_shard_count:?} on tenant {split_tenant_id}"
)));
}
// Erase child shards
diesel::delete(tenant_shards)
.filter(tenant_id.eq(split_tenant_id.to_string()))
.filter(shard_count.eq(new_shard_count.literal() as i32))
.execute(conn)?;
Ok(AbortShardSplitStatus::Aborted)
})?;
Ok(aborted)
},
)
.await
}
}
@@ -607,6 +730,28 @@ pub(crate) struct TenantShardPersistence {
pub(crate) config: String,
}
impl TenantShardPersistence {
pub(crate) fn get_shard_identity(&self) -> Result<ShardIdentity, ShardConfigError> {
if self.shard_count == 0 {
Ok(ShardIdentity::unsharded())
} else {
Ok(ShardIdentity::new(
ShardNumber(self.shard_number as u8),
ShardCount::new(self.shard_count as u8),
ShardStripeSize(self.shard_stripe_size as u32),
)?)
}
}
pub(crate) fn get_tenant_shard_id(&self) -> Result<TenantShardId, hex::FromHexError> {
Ok(TenantShardId {
tenant_id: TenantId::from_str(self.tenant_id.as_str())?,
shard_number: ShardNumber(self.shard_number as u8),
shard_count: ShardCount::new(self.shard_count as u8),
})
}
}
/// Parts of [`crate::node::Node`] that are stored durably
#[derive(Serialize, Deserialize, Queryable, Selectable, Insertable, Eq, PartialEq)]
#[diesel(table_name = crate::schema::nodes)]

View File

@@ -1,6 +1,7 @@
use crate::pageserver_client::PageserverClient;
use crate::persistence::Persistence;
use crate::service;
use pageserver_api::controller_api::NodeAvailability;
use hyper::StatusCode;
use pageserver_api::models::{
LocationConfig, LocationConfigMode, LocationConfigSecondary, TenantConfig,
};
@@ -8,7 +9,7 @@ use pageserver_api::shard::{ShardIdentity, TenantShardId};
use pageserver_client::mgmt_api;
use std::collections::HashMap;
use std::sync::Arc;
use std::time::Duration;
use std::time::{Duration, Instant};
use tokio_util::sync::CancellationToken;
use utils::generation::Generation;
use utils::id::{NodeId, TimelineId};
@@ -19,6 +20,8 @@ use crate::compute_hook::{ComputeHook, NotifyError};
use crate::node::Node;
use crate::tenant_state::{IntentState, ObservedState, ObservedStateLocation};
const DEFAULT_HEATMAP_PERIOD: &str = "60s";
/// Object with the lifetime of the background reconcile task that is created
/// for tenants which have a difference between their intent and observed states.
pub(super) struct Reconciler {
@@ -28,15 +31,16 @@ pub(super) struct Reconciler {
pub(crate) shard: ShardIdentity,
pub(crate) generation: Option<Generation>,
pub(crate) intent: TargetState,
/// Nodes not referenced by [`Self::intent`], from which we should try
/// to detach this tenant shard.
pub(crate) detach: Vec<Node>,
pub(crate) config: TenantConfig,
pub(crate) observed: ObservedState,
pub(crate) service_config: service::Config,
/// A snapshot of the pageservers as they were when we were asked
/// to reconcile.
pub(crate) pageservers: Arc<HashMap<NodeId, Node>>,
/// A hook to notify the running postgres instances when we change the location
/// of a tenant. Use this via [`Self::compute_notify`] to update our failure flag
/// and guarantee eventual retries.
@@ -67,29 +71,37 @@ pub(super) struct Reconciler {
/// and the TargetState is just the instruction for a particular Reconciler run.
#[derive(Debug)]
pub(crate) struct TargetState {
pub(crate) attached: Option<NodeId>,
pub(crate) secondary: Vec<NodeId>,
pub(crate) attached: Option<Node>,
pub(crate) secondary: Vec<Node>,
}
impl TargetState {
pub(crate) fn from_intent(intent: &IntentState) -> Self {
pub(crate) fn from_intent(nodes: &HashMap<NodeId, Node>, intent: &IntentState) -> Self {
Self {
attached: *intent.get_attached(),
secondary: intent.get_secondary().clone(),
attached: intent.get_attached().map(|n| {
nodes
.get(&n)
.expect("Intent attached referenced non-existent node")
.clone()
}),
secondary: intent
.get_secondary()
.iter()
.map(|n| {
nodes
.get(n)
.expect("Intent secondary referenced non-existent node")
.clone()
})
.collect(),
}
}
fn all_pageservers(&self) -> Vec<NodeId> {
let mut result = self.secondary.clone();
if let Some(node_id) = &self.attached {
result.push(*node_id);
}
result
}
}
#[derive(thiserror::Error, Debug)]
pub(crate) enum ReconcileError {
#[error(transparent)]
Remote(#[from] mgmt_api::Error),
#[error(transparent)]
Notify(#[from] NotifyError),
#[error("Cancelled")]
@@ -101,44 +113,99 @@ pub(crate) enum ReconcileError {
impl Reconciler {
async fn location_config(
&mut self,
node_id: NodeId,
node: &Node,
config: LocationConfig,
flush_ms: Option<Duration>,
) -> anyhow::Result<()> {
let node = self
.pageservers
.get(&node_id)
.expect("Pageserver may not be removed while referenced");
lazy: bool,
) -> Result<(), ReconcileError> {
if !node.is_available() && config.mode == LocationConfigMode::Detached {
// Attempts to detach from offline nodes may be imitated without doing I/O: a node which is offline
// will get fully reconciled wrt the shard's intent state when it is reactivated, irrespective of
// what we put into `observed`, in [`crate::service::Service::node_activate_reconcile`]
tracing::info!("Node {node} is unavailable during detach: proceeding anyway, it will be detached on next activation");
self.observed.locations.remove(&node.get_id());
return Ok(());
}
self.observed
.locations
.insert(node.id, ObservedStateLocation { conf: None });
.insert(node.get_id(), ObservedStateLocation { conf: None });
tracing::info!("location_config({}) calling: {:?}", node_id, config);
let client =
mgmt_api::Client::new(node.base_url(), self.service_config.jwt_token.as_deref());
client
.location_config(self.tenant_shard_id, config.clone(), flush_ms)
.await?;
tracing::info!("location_config({}) complete: {:?}", node_id, config);
// TODO: amend locations that use long-polling: they will hit this timeout.
let timeout = Duration::from_secs(25);
self.observed
.locations
.insert(node.id, ObservedStateLocation { conf: Some(config) });
tracing::info!("location_config({node}) calling: {:?}", config);
let tenant_shard_id = self.tenant_shard_id;
let config_ref = &config;
match node
.with_client_retries(
|client| async move {
let config = config_ref.clone();
client
.location_config(tenant_shard_id, config.clone(), flush_ms, lazy)
.await
},
&self.service_config.jwt_token,
1,
3,
timeout,
&self.cancel,
)
.await
{
Some(Ok(_)) => {}
Some(Err(e)) => return Err(e.into()),
None => return Err(ReconcileError::Cancel),
};
tracing::info!("location_config({node}) complete: {:?}", config);
match config.mode {
LocationConfigMode::Detached => {
self.observed.locations.remove(&node.get_id());
}
_ => {
self.observed
.locations
.insert(node.get_id(), ObservedStateLocation { conf: Some(config) });
}
}
Ok(())
}
fn get_node(&self, node_id: &NodeId) -> Option<&Node> {
if let Some(node) = self.intent.attached.as_ref() {
if node.get_id() == *node_id {
return Some(node);
}
}
if let Some(node) = self
.intent
.secondary
.iter()
.find(|n| n.get_id() == *node_id)
{
return Some(node);
}
if let Some(node) = self.detach.iter().find(|n| n.get_id() == *node_id) {
return Some(node);
}
None
}
async fn maybe_live_migrate(&mut self) -> Result<(), ReconcileError> {
let destination = if let Some(node_id) = self.intent.attached {
match self.observed.locations.get(&node_id) {
let destination = if let Some(node) = &self.intent.attached {
match self.observed.locations.get(&node.get_id()) {
Some(conf) => {
// We will do a live migration only if the intended destination is not
// currently in an attached state.
match &conf.conf {
Some(conf) if conf.mode == LocationConfigMode::Secondary => {
// Fall through to do a live migration
node_id
node
}
None | Some(_) => {
// Attached or uncertain: don't do a live migration, proceed
@@ -151,7 +218,7 @@ impl Reconciler {
None => {
// Our destination is not attached: maybe live migrate if some other
// node is currently attached. Fall through.
node_id
node
}
}
} else {
@@ -164,15 +231,13 @@ impl Reconciler {
for (node_id, state) in &self.observed.locations {
if let Some(observed_conf) = &state.conf {
if observed_conf.mode == LocationConfigMode::AttachedSingle {
let node = self
.pageservers
.get(node_id)
.expect("Nodes may not be removed while referenced");
// We will only attempt live migration if the origin is not offline: this
// avoids trying to do it while reconciling after responding to an HA failover.
if !matches!(node.availability, NodeAvailability::Offline) {
origin = Some(*node_id);
break;
if let Some(node) = self.get_node(node_id) {
if node.is_available() {
origin = Some(node.clone());
break;
}
}
}
}
@@ -185,7 +250,7 @@ impl Reconciler {
// We have an origin and a destination: proceed to do the live migration
tracing::info!("Live migrating {}->{}", origin, destination);
self.live_migrate(origin, destination).await?;
self.live_migrate(origin, destination.clone()).await?;
Ok(())
}
@@ -193,15 +258,13 @@ impl Reconciler {
async fn get_lsns(
&self,
tenant_shard_id: TenantShardId,
node_id: &NodeId,
node: &Node,
) -> anyhow::Result<HashMap<TimelineId, Lsn>> {
let node = self
.pageservers
.get(node_id)
.expect("Pageserver may not be removed while referenced");
let client =
mgmt_api::Client::new(node.base_url(), self.service_config.jwt_token.as_deref());
let client = PageserverClient::new(
node.get_id(),
node.base_url(),
self.service_config.jwt_token.as_deref(),
);
let timelines = client.timeline_list(&tenant_shard_id).await?;
Ok(timelines
@@ -210,19 +273,86 @@ impl Reconciler {
.collect())
}
async fn secondary_download(&self, tenant_shard_id: TenantShardId, node_id: &NodeId) {
let node = self
.pageservers
.get(node_id)
.expect("Pageserver may not be removed while referenced");
async fn secondary_download(
&self,
tenant_shard_id: TenantShardId,
node: &Node,
) -> Result<(), ReconcileError> {
// This is not the timeout for a request, but the total amount of time we're willing to wait
// for a secondary location to get up to date before
const TOTAL_DOWNLOAD_TIMEOUT: Duration = Duration::from_secs(300);
let client =
mgmt_api::Client::new(node.base_url(), self.service_config.jwt_token.as_deref());
// This the long-polling interval for the secondary download requests we send to destination pageserver
// during a migration.
const REQUEST_DOWNLOAD_TIMEOUT: Duration = Duration::from_secs(20);
match client.tenant_secondary_download(tenant_shard_id).await {
Ok(()) => {}
Err(_) => {
tracing::info!(" (skipping, destination wasn't in secondary mode)")
let started_at = Instant::now();
loop {
let (status, progress) = match node
.with_client_retries(
|client| async move {
client
.tenant_secondary_download(
tenant_shard_id,
Some(REQUEST_DOWNLOAD_TIMEOUT),
)
.await
},
&self.service_config.jwt_token,
1,
3,
REQUEST_DOWNLOAD_TIMEOUT * 2,
&self.cancel,
)
.await
{
None => Err(ReconcileError::Cancel),
Some(Ok(v)) => Ok(v),
Some(Err(e)) => {
// Give up, but proceed: it's unfortunate if we couldn't freshen the destination before
// attaching, but we should not let an issue with a secondary location stop us proceeding
// with a live migration.
tracing::warn!("Failed to prepare by downloading layers on node {node}: {e})");
return Ok(());
}
}?;
if status == StatusCode::OK {
tracing::info!(
"Downloads to {} complete: {}/{} layers, {}/{} bytes",
node,
progress.layers_downloaded,
progress.layers_total,
progress.bytes_downloaded,
progress.bytes_total
);
return Ok(());
} else if status == StatusCode::ACCEPTED {
let total_runtime = started_at.elapsed();
if total_runtime > TOTAL_DOWNLOAD_TIMEOUT {
tracing::warn!("Timed out after {}ms downloading layers to {node}. Progress so far: {}/{} layers, {}/{} bytes",
total_runtime.as_millis(),
progress.layers_downloaded,
progress.layers_total,
progress.bytes_downloaded,
progress.bytes_total
);
// Give up, but proceed: an incompletely warmed destination doesn't prevent migration working,
// it just makes the I/O performance for users less good.
return Ok(());
}
// Log and proceed around the loop to retry. We don't sleep between requests, because our HTTP call
// to the pageserver is a long-poll.
tracing::info!(
"Downloads to {} not yet complete: {}/{} layers, {}/{} bytes",
node,
progress.layers_downloaded,
progress.layers_total,
progress.bytes_downloaded,
progress.bytes_total
);
}
}
}
@@ -230,17 +360,14 @@ impl Reconciler {
async fn await_lsn(
&self,
tenant_shard_id: TenantShardId,
pageserver_id: &NodeId,
node: &Node,
baseline: HashMap<TimelineId, Lsn>,
) -> anyhow::Result<()> {
loop {
let latest = match self.get_lsns(tenant_shard_id, pageserver_id).await {
let latest = match self.get_lsns(tenant_shard_id, node).await {
Ok(l) => l,
Err(e) => {
println!(
"🕑 Can't get LSNs on pageserver {} yet, waiting ({e})",
pageserver_id
);
tracing::info!("🕑 Can't get LSNs on node {node} yet, waiting ({e})",);
std::thread::sleep(Duration::from_millis(500));
continue;
}
@@ -250,7 +377,7 @@ impl Reconciler {
for (timeline_id, baseline_lsn) in &baseline {
match latest.get(timeline_id) {
Some(latest_lsn) => {
println!("🕑 LSN origin {baseline_lsn} vs destination {latest_lsn}");
tracing::info!("🕑 LSN origin {baseline_lsn} vs destination {latest_lsn}");
if latest_lsn < baseline_lsn {
any_behind = true;
}
@@ -265,7 +392,7 @@ impl Reconciler {
}
if !any_behind {
println!("✅ LSN caught up. Proceeding...");
tracing::info!("✅ LSN caught up. Proceeding...");
break;
} else {
std::thread::sleep(Duration::from_millis(500));
@@ -277,11 +404,11 @@ impl Reconciler {
pub async fn live_migrate(
&mut self,
origin_ps_id: NodeId,
dest_ps_id: NodeId,
) -> anyhow::Result<()> {
origin_ps: Node,
dest_ps: Node,
) -> Result<(), ReconcileError> {
// `maybe_live_migrate` is responsibble for sanity of inputs
assert!(origin_ps_id != dest_ps_id);
assert!(origin_ps.get_id() != dest_ps.get_id());
fn build_location_config(
shard: &ShardIdentity,
@@ -301,10 +428,7 @@ impl Reconciler {
}
}
tracing::info!(
"🔁 Switching origin pageserver {} to stale mode",
origin_ps_id
);
tracing::info!("🔁 Switching origin node {origin_ps} to stale mode",);
// FIXME: it is incorrect to use self.generation here, we should use the generation
// from the ObservedState of the origin pageserver (it might be older than self.generation)
@@ -315,21 +439,18 @@ impl Reconciler {
self.generation,
None,
);
self.location_config(origin_ps_id, stale_conf, Some(Duration::from_secs(10)))
self.location_config(&origin_ps, stale_conf, Some(Duration::from_secs(10)), false)
.await?;
let baseline_lsns = Some(self.get_lsns(self.tenant_shard_id, &origin_ps_id).await?);
let baseline_lsns = Some(self.get_lsns(self.tenant_shard_id, &origin_ps).await?);
// If we are migrating to a destination that has a secondary location, warm it up first
if let Some(destination_conf) = self.observed.locations.get(&dest_ps_id) {
if let Some(destination_conf) = self.observed.locations.get(&dest_ps.get_id()) {
if let Some(destination_conf) = &destination_conf.conf {
if destination_conf.mode == LocationConfigMode::Secondary {
tracing::info!(
"🔁 Downloading latest layers to destination pageserver {}",
dest_ps_id,
);
self.secondary_download(self.tenant_shard_id, &dest_ps_id)
.await;
tracing::info!("🔁 Downloading latest layers to destination node {dest_ps}",);
self.secondary_download(self.tenant_shard_id, &dest_ps)
.await?;
}
}
}
@@ -337,7 +458,7 @@ impl Reconciler {
// Increment generation before attaching to new pageserver
self.generation = Some(
self.persistence
.increment_generation(self.tenant_shard_id, dest_ps_id)
.increment_generation(self.tenant_shard_id, dest_ps.get_id())
.await?,
);
@@ -349,22 +470,23 @@ impl Reconciler {
None,
);
tracing::info!("🔁 Attaching to pageserver {}", dest_ps_id);
self.location_config(dest_ps_id, dest_conf, None).await?;
tracing::info!("🔁 Attaching to pageserver {dest_ps}");
self.location_config(&dest_ps, dest_conf, None, false)
.await?;
if let Some(baseline) = baseline_lsns {
tracing::info!("🕑 Waiting for LSN to catch up...");
self.await_lsn(self.tenant_shard_id, &dest_ps_id, baseline)
self.await_lsn(self.tenant_shard_id, &dest_ps, baseline)
.await?;
}
tracing::info!("🔁 Notifying compute to use pageserver {}", dest_ps_id);
tracing::info!("🔁 Notifying compute to use pageserver {dest_ps}");
// During a live migration it is unhelpful to proceed if we couldn't notify compute: if we detach
// the origin without notifying compute, we will render the tenant unavailable.
while let Err(e) = self.compute_notify().await {
match e {
NotifyError::Fatal(_) => return Err(anyhow::anyhow!(e)),
NotifyError::Fatal(_) => return Err(ReconcileError::Notify(e)),
_ => {
tracing::warn!(
"Live migration blocked by compute notification error, retrying: {e}"
@@ -373,7 +495,7 @@ impl Reconciler {
}
}
// Downgrade the origin to secondary. If the tenant's policy is PlacementPolicy::Single, then
// Downgrade the origin to secondary. If the tenant's policy is PlacementPolicy::Attached(0), then
// this location will be deleted in the general case reconciliation that runs after this.
let origin_secondary_conf = build_location_config(
&self.shard,
@@ -382,22 +504,19 @@ impl Reconciler {
None,
Some(LocationConfigSecondary { warm: true }),
);
self.location_config(origin_ps_id, origin_secondary_conf.clone(), None)
self.location_config(&origin_ps, origin_secondary_conf.clone(), None, false)
.await?;
// TODO: we should also be setting the ObservedState on earlier API calls, in case we fail
// partway through. In fact, all location conf API calls should be in a wrapper that sets
// the observed state to None, then runs, then sets it to what we wrote.
self.observed.locations.insert(
origin_ps_id,
origin_ps.get_id(),
ObservedStateLocation {
conf: Some(origin_secondary_conf),
},
);
println!(
"🔁 Switching to AttachedSingle mode on pageserver {}",
dest_ps_id
);
tracing::info!("🔁 Switching to AttachedSingle mode on node {dest_ps}",);
let dest_final_conf = build_location_config(
&self.shard,
&self.config,
@@ -405,16 +524,73 @@ impl Reconciler {
self.generation,
None,
);
self.location_config(dest_ps_id, dest_final_conf.clone(), None)
self.location_config(&dest_ps, dest_final_conf.clone(), None, false)
.await?;
self.observed.locations.insert(
dest_ps_id,
dest_ps.get_id(),
ObservedStateLocation {
conf: Some(dest_final_conf),
},
);
println!("✅ Migration complete");
tracing::info!("✅ Migration complete");
Ok(())
}
async fn maybe_refresh_observed(&mut self) -> Result<(), ReconcileError> {
// If the attached node has uncertain state, read it from the pageserver before proceeding: this
// is important to avoid spurious generation increments.
//
// We don't need to do this for secondary/detach locations because it's harmless to just PUT their
// location conf, whereas for attached locations it can interrupt clients if we spuriously destroy/recreate
// the `Timeline` object in the pageserver.
let Some(attached_node) = self.intent.attached.as_ref() else {
// Nothing to do
return Ok(());
};
if matches!(
self.observed.locations.get(&attached_node.get_id()),
Some(ObservedStateLocation { conf: None })
) {
let tenant_shard_id = self.tenant_shard_id;
let observed_conf = match attached_node
.with_client_retries(
|client| async move { client.get_location_config(tenant_shard_id).await },
&self.service_config.jwt_token,
1,
1,
Duration::from_secs(5),
&self.cancel,
)
.await
{
Some(Ok(observed)) => Some(observed),
Some(Err(mgmt_api::Error::ApiError(status, _msg)))
if status == StatusCode::NOT_FOUND =>
{
None
}
Some(Err(e)) => return Err(e.into()),
None => return Err(ReconcileError::Cancel),
};
tracing::info!("Scanned location configuration on {attached_node}: {observed_conf:?}");
match observed_conf {
Some(conf) => {
// Pageserver returned a state: update it in observed. This may still be an indeterminate (None) state,
// if internally the pageserver's TenantSlot was being mutated (e.g. some long running API call is still running)
self.observed
.locations
.insert(attached_node.get_id(), ObservedStateLocation { conf });
}
None => {
// Pageserver returned 404: we have confirmation that there is no state for this shard on that pageserver.
self.observed.locations.remove(&attached_node.get_id());
}
}
}
Ok(())
}
@@ -426,14 +602,14 @@ impl Reconciler {
/// general case reconciliation where we walk through the intent by pageserver
/// and call out to the pageserver to apply the desired state.
pub(crate) async fn reconcile(&mut self) -> Result<(), ReconcileError> {
// TODO: if any of self.observed is None, call to remote pageservers
// to learn correct state.
// Prepare: if we have uncertain `observed` state for our would-be attachement location, then refresh it
self.maybe_refresh_observed().await?;
// Special case: live migration
self.maybe_live_migrate().await?;
// If the attached pageserver is not attached, do so now.
if let Some(node_id) = self.intent.attached {
if let Some(node) = self.intent.attached.as_ref() {
// If we are in an attached policy, then generation must have been set (null generations
// are only present when a tenant is initially loaded with a secondary policy)
debug_assert!(self.generation.is_some());
@@ -443,11 +619,16 @@ impl Reconciler {
)));
};
let mut wanted_conf = attached_location_conf(generation, &self.shard, &self.config);
match self.observed.locations.get(&node_id) {
let mut wanted_conf = attached_location_conf(
generation,
&self.shard,
&self.config,
!self.intent.secondary.is_empty(),
);
match self.observed.locations.get(&node.get_id()) {
Some(conf) if conf.conf.as_ref() == Some(&wanted_conf) => {
// Nothing to do
tracing::info!(%node_id, "Observed configuration already correct.")
tracing::info!(node_id=%node.get_id(), "Observed configuration already correct.")
}
observed => {
// In all cases other than a matching observed configuration, we will
@@ -485,13 +666,21 @@ impl Reconciler {
if increment_generation {
let generation = self
.persistence
.increment_generation(self.tenant_shard_id, node_id)
.increment_generation(self.tenant_shard_id, node.get_id())
.await?;
self.generation = Some(generation);
wanted_conf.generation = generation.into();
}
tracing::info!(%node_id, "Observed configuration requires update.");
self.location_config(node_id, wanted_conf, None).await?;
tracing::info!(node_id=%node.get_id(), "Observed configuration requires update.");
// Because `node` comes from a ref to &self, clone it before calling into a &mut self
// function: this could be avoided by refactoring the state mutated by location_config into
// a separate type to Self.
let node = node.clone();
// Use lazy=true, because we may run many of Self concurrently, and do not want to
// overload the pageserver with logical size calculations.
self.location_config(&node, wanted_conf, None, true).await?;
self.compute_notify().await?;
}
}
@@ -500,33 +689,27 @@ impl Reconciler {
// Configure secondary locations: if these were previously attached this
// implicitly downgrades them from attached to secondary.
let mut changes = Vec::new();
for node_id in &self.intent.secondary {
for node in &self.intent.secondary {
let wanted_conf = secondary_location_conf(&self.shard, &self.config);
match self.observed.locations.get(node_id) {
match self.observed.locations.get(&node.get_id()) {
Some(conf) if conf.conf.as_ref() == Some(&wanted_conf) => {
// Nothing to do
tracing::info!(%node_id, "Observed configuration already correct.")
tracing::info!(node_id=%node.get_id(), "Observed configuration already correct.")
}
_ => {
// In all cases other than a matching observed configuration, we will
// reconcile this location.
tracing::info!(%node_id, "Observed configuration requires update.");
changes.push((*node_id, wanted_conf))
tracing::info!(node_id=%node.get_id(), "Observed configuration requires update.");
changes.push((node.clone(), wanted_conf))
}
}
}
// Detach any extraneous pageservers that are no longer referenced
// by our intent.
let all_pageservers = self.intent.all_pageservers();
for node_id in self.observed.locations.keys() {
if all_pageservers.contains(node_id) {
// We are only detaching pageservers that aren't used at all.
continue;
}
for node in &self.detach {
changes.push((
*node_id,
node.clone(),
LocationConfig {
mode: LocationConfigMode::Detached,
generation: None,
@@ -539,11 +722,11 @@ impl Reconciler {
));
}
for (node_id, conf) in changes {
for (node, conf) in changes {
if self.cancel.is_cancelled() {
return Err(ReconcileError::Cancel);
}
self.location_config(node_id, conf, None).await?;
self.location_config(&node, conf, None, false).await?;
}
Ok(())
@@ -552,16 +735,21 @@ impl Reconciler {
pub(crate) async fn compute_notify(&mut self) -> Result<(), NotifyError> {
// Whenever a particular Reconciler emits a notification, it is always notifying for the intended
// destination.
if let Some(node_id) = self.intent.attached {
if let Some(node) = &self.intent.attached {
let result = self
.compute_hook
.notify(self.tenant_shard_id, node_id, &self.cancel)
.notify(
self.tenant_shard_id,
node.get_id(),
self.shard.stripe_size,
&self.cancel,
)
.await;
if let Err(e) = &result {
// It is up to the caller whether they want to drop out on this error, but they don't have to:
// in general we should avoid letting unavailability of the cloud control plane stop us from
// making progress.
tracing::warn!("Failed to notify compute of attached pageserver {node_id}: {e}");
tracing::warn!("Failed to notify compute of attached pageserver {node}: {e}");
// Set this flag so that in our ReconcileResult we will set the flag on the shard that it
// needs to retry at some point.
self.compute_notify_failure = true;
@@ -573,10 +761,26 @@ impl Reconciler {
}
}
/// We tweak the externally-set TenantConfig while configuring
/// locations, using our awareness of whether secondary locations
/// are in use to automatically enable/disable heatmap uploads.
fn ha_aware_config(config: &TenantConfig, has_secondaries: bool) -> TenantConfig {
let mut config = config.clone();
if has_secondaries {
if config.heatmap_period.is_none() {
config.heatmap_period = Some(DEFAULT_HEATMAP_PERIOD.to_string());
}
} else {
config.heatmap_period = None;
}
config
}
pub(crate) fn attached_location_conf(
generation: Generation,
shard: &ShardIdentity,
config: &TenantConfig,
has_secondaries: bool,
) -> LocationConfig {
LocationConfig {
mode: LocationConfigMode::AttachedSingle,
@@ -585,7 +789,7 @@ pub(crate) fn attached_location_conf(
shard_number: shard.number.0,
shard_count: shard.count.literal(),
shard_stripe_size: shard.stripe_size.0,
tenant_conf: config.clone(),
tenant_conf: ha_aware_config(config, has_secondaries),
}
}
@@ -600,6 +804,6 @@ pub(crate) fn secondary_location_conf(
shard_number: shard.number.0,
shard_count: shard.count.literal(),
shard_stripe_size: shard.stripe_size.0,
tenant_conf: config.clone(),
tenant_conf: ha_aware_config(config, true),
}
}

View File

@@ -1,4 +1,5 @@
use crate::{node::Node, tenant_state::TenantState};
use pageserver_api::controller_api::UtilizationScore;
use serde::Serialize;
use std::collections::HashMap;
use utils::{http::error::ApiError, id::NodeId};
@@ -19,15 +20,34 @@ impl From<ScheduleError> for ApiError {
}
#[derive(Serialize, Eq, PartialEq)]
pub enum MaySchedule {
Yes(UtilizationScore),
No,
}
#[derive(Serialize)]
struct SchedulerNode {
/// How many shards are currently scheduled on this node, via their [`crate::tenant_state::IntentState`].
shard_count: usize,
/// Whether this node is currently elegible to have new shards scheduled (this is derived
/// from a node's availability state and scheduling policy).
may_schedule: bool,
may_schedule: MaySchedule,
}
impl PartialEq for SchedulerNode {
fn eq(&self, other: &Self) -> bool {
let may_schedule_matches = matches!(
(&self.may_schedule, &other.may_schedule),
(MaySchedule::Yes(_), MaySchedule::Yes(_)) | (MaySchedule::No, MaySchedule::No)
);
may_schedule_matches && self.shard_count == other.shard_count
}
}
impl Eq for SchedulerNode {}
/// This type is responsible for selecting which node is used when a tenant shard needs to choose a pageserver
/// on which to run.
///
@@ -43,7 +63,7 @@ impl Scheduler {
let mut scheduler_nodes = HashMap::new();
for node in nodes {
scheduler_nodes.insert(
node.id,
node.get_id(),
SchedulerNode {
shard_count: 0,
may_schedule: node.may_schedule(),
@@ -68,7 +88,7 @@ impl Scheduler {
let mut expect_nodes: HashMap<NodeId, SchedulerNode> = HashMap::new();
for node in nodes {
expect_nodes.insert(
node.id,
node.get_id(),
SchedulerNode {
shard_count: 0,
may_schedule: node.may_schedule(),
@@ -156,7 +176,7 @@ impl Scheduler {
pub(crate) fn node_upsert(&mut self, node: &Node) {
use std::collections::hash_map::Entry::*;
match self.nodes.entry(node.id) {
match self.nodes.entry(node.get_id()) {
Occupied(mut entry) => {
entry.get_mut().may_schedule = node.may_schedule();
}
@@ -186,13 +206,15 @@ impl Scheduler {
return None;
}
// TODO: When the utilization score returned by the pageserver becomes meaningful,
// schedule based on that instead of the shard count.
let node = nodes
.iter()
.map(|node_id| {
let may_schedule = self
.nodes
.get(node_id)
.map(|n| n.may_schedule)
.map(|n| n.may_schedule != MaySchedule::No)
.unwrap_or(false);
(*node_id, may_schedule)
})
@@ -211,7 +233,7 @@ impl Scheduler {
.nodes
.iter()
.filter_map(|(k, v)| {
if hard_exclude.contains(k) || !v.may_schedule {
if hard_exclude.contains(k) || v.may_schedule == MaySchedule::No {
None
} else {
Some((*k, v.shard_count))
@@ -230,7 +252,7 @@ impl Scheduler {
for (node_id, node) in &self.nodes {
tracing::info!(
"Node {node_id}: may_schedule={} shards={}",
node.may_schedule,
node.may_schedule != MaySchedule::No,
node.shard_count
);
}
@@ -255,7 +277,7 @@ impl Scheduler {
pub(crate) mod test_utils {
use crate::node::Node;
use pageserver_api::controller_api::{NodeAvailability, NodeSchedulingPolicy};
use pageserver_api::controller_api::{NodeAvailability, UtilizationScore};
use std::collections::HashMap;
use utils::id::NodeId;
/// Test helper: synthesize the requested number of nodes, all in active state.
@@ -264,18 +286,18 @@ pub(crate) mod test_utils {
pub(crate) fn make_test_nodes(n: u64) -> HashMap<NodeId, Node> {
(1..n + 1)
.map(|i| {
(
NodeId(i),
Node {
id: NodeId(i),
availability: NodeAvailability::Active,
scheduling: NodeSchedulingPolicy::Active,
listen_http_addr: format!("httphost-{i}"),
listen_http_port: 80 + i as u16,
listen_pg_addr: format!("pghost-{i}"),
listen_pg_port: 5432 + i as u16,
},
)
(NodeId(i), {
let mut node = Node::new(
NodeId(i),
format!("httphost-{i}"),
80 + i as u16,
format!("pghost-{i}"),
5432 + i as u16,
);
node.set_availability(NodeAvailability::Active(UtilizationScore::worst()));
assert!(node.is_available());
node
})
})
.collect()
}

File diff suppressed because it is too large Load Diff

View File

@@ -1,7 +1,14 @@
use std::{collections::HashMap, sync::Arc, time::Duration};
use std::{
collections::{HashMap, HashSet},
sync::Arc,
time::Duration,
};
use crate::{metrics, persistence::TenantShardPersistence};
use pageserver_api::controller_api::NodeAvailability;
use crate::{
metrics::{self, ReconcileCompleteLabelGroup, ReconcileOutcome},
persistence::TenantShardPersistence,
};
use pageserver_api::controller_api::PlacementPolicy;
use pageserver_api::{
models::{LocationConfig, LocationConfigMode, TenantConfig},
shard::{ShardIdentity, TenantShardId},
@@ -25,7 +32,7 @@ use crate::{
attached_location_conf, secondary_location_conf, ReconcileError, Reconciler, TargetState,
},
scheduler::{ScheduleError, Scheduler},
service, PlacementPolicy, Sequence,
service, Sequence,
};
/// Serialization helper
@@ -370,7 +377,7 @@ impl TenantState {
/// [`ObservedState`], even if it violates my [`PlacementPolicy`]. Call [`Self::schedule`] next,
/// to get an intent state that complies with placement policy. The overall goal is to do scheduling
/// in a way that makes use of any configured locations that already exist in the outside world.
pub(crate) fn intent_from_observed(&mut self) {
pub(crate) fn intent_from_observed(&mut self, scheduler: &mut Scheduler) {
// Choose an attached location by filtering observed locations, and then sorting to get the highest
// generation
let mut attached_locs = self
@@ -395,7 +402,7 @@ impl TenantState {
attached_locs.sort_by_key(|i| i.1);
if let Some((node_id, _gen)) = attached_locs.into_iter().last() {
self.intent.attached = Some(*node_id);
self.intent.set_attached(scheduler, Some(*node_id));
}
// All remaining observed locations generate secondary intents. This includes None
@@ -406,7 +413,7 @@ impl TenantState {
// will take care of promoting one of these secondaries to be attached.
self.observed.locations.keys().for_each(|node_id| {
if Some(*node_id) != self.intent.attached {
self.intent.secondary.push(*node_id);
self.intent.push_secondary(scheduler, *node_id);
}
});
}
@@ -453,22 +460,7 @@ impl TenantState {
// Add/remove nodes to fulfil policy
use PlacementPolicy::*;
match self.policy {
Single => {
// Should have exactly one attached, and zero secondaries
if !self.intent.secondary.is_empty() {
self.intent.clear_secondary(scheduler);
modified = true;
}
let (modified_attached, _attached_node_id) = self.schedule_attached(scheduler)?;
modified |= modified_attached;
if !self.intent.secondary.is_empty() {
self.intent.clear_secondary(scheduler);
modified = true;
}
}
Double(secondary_count) => {
Attached(secondary_count) => {
let retain_secondaries = if self.intent.attached.is_none()
&& scheduler.node_preferred(&self.intent.secondary).is_some()
{
@@ -564,18 +556,25 @@ impl TenantState {
}
}
fn dirty(&self) -> bool {
fn dirty(&self, nodes: &Arc<HashMap<NodeId, Node>>) -> bool {
let mut dirty_nodes = HashSet::new();
if let Some(node_id) = self.intent.attached {
// Maybe panic: it is a severe bug if we try to attach while generation is null.
let generation = self
.generation
.expect("Attempted to enter attached state without a generation");
let wanted_conf = attached_location_conf(generation, &self.shard, &self.config);
let wanted_conf = attached_location_conf(
generation,
&self.shard,
&self.config,
!self.intent.secondary.is_empty(),
);
match self.observed.locations.get(&node_id) {
Some(conf) if conf.conf.as_ref() == Some(&wanted_conf) => {}
Some(_) | None => {
return true;
dirty_nodes.insert(node_id);
}
}
}
@@ -585,7 +584,7 @@ impl TenantState {
match self.observed.locations.get(node_id) {
Some(conf) if conf.conf.as_ref() == Some(&wanted_conf) => {}
Some(_) | None => {
return true;
dirty_nodes.insert(*node_id);
}
}
}
@@ -593,24 +592,25 @@ impl TenantState {
for node_id in self.observed.locations.keys() {
if self.intent.attached != Some(*node_id) && !self.intent.secondary.contains(node_id) {
// We have observed state that isn't part of our intent: need to clean it up.
return true;
dirty_nodes.insert(*node_id);
}
}
// Even if there is no pageserver work to be done, if we have a pending notification to computes,
// wake up a reconciler to send it.
if self.pending_compute_notification {
return true;
}
dirty_nodes.retain(|node_id| {
nodes
.get(node_id)
.map(|n| n.is_available())
.unwrap_or(false)
});
false
!dirty_nodes.is_empty()
}
#[allow(clippy::too_many_arguments)]
#[instrument(skip_all, fields(tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug()))]
pub(crate) fn maybe_reconcile(
&mut self,
result_tx: tokio::sync::mpsc::UnboundedSender<ReconcileResult>,
result_tx: &tokio::sync::mpsc::UnboundedSender<ReconcileResult>,
pageservers: &Arc<HashMap<NodeId, Node>>,
compute_hook: &Arc<ComputeHook>,
service_config: &service::Config,
@@ -625,15 +625,20 @@ impl TenantState {
let node = pageservers
.get(node_id)
.expect("Nodes may not be removed while referenced");
if observed_loc.conf.is_none()
&& !matches!(node.availability, NodeAvailability::Offline)
{
if observed_loc.conf.is_none() && node.is_available() {
dirty_observed = true;
break;
}
}
if !self.dirty() && !dirty_observed {
let active_nodes_dirty = self.dirty(pageservers);
// Even if there is no pageserver work to be done, if we have a pending notification to computes,
// wake up a reconciler to send it.
let do_reconcile =
active_nodes_dirty || dirty_observed || self.pending_compute_notification;
if !do_reconcile {
tracing::info!("Not dirty, no reconciliation needed.");
return None;
}
@@ -663,6 +668,21 @@ impl TenantState {
}
}
// Build list of nodes from which the reconciler should detach
let mut detach = Vec::new();
for node_id in self.observed.locations.keys() {
if self.intent.get_attached() != &Some(*node_id)
&& !self.intent.secondary.contains(node_id)
{
detach.push(
pageservers
.get(node_id)
.expect("Intent references non-existent pageserver")
.clone(),
)
}
}
// Reconcile in flight for a stale sequence? Our sequence's task will wait for it before
// doing our sequence's work.
let old_handle = self.reconciler.take();
@@ -677,14 +697,15 @@ impl TenantState {
self.sequence = self.sequence.next();
let reconciler_cancel = cancel.child_token();
let reconciler_intent = TargetState::from_intent(pageservers, &self.intent);
let mut reconciler = Reconciler {
tenant_shard_id: self.tenant_shard_id,
shard: self.shard,
generation: self.generation,
intent: TargetState::from_intent(&self.intent),
intent: reconciler_intent,
detach,
config: self.config.clone(),
observed: self.observed.clone(),
pageservers: pageservers.clone(),
compute_hook: compute_hook.clone(),
service_config: service_config.clone(),
_gate_guard: gate_guard,
@@ -700,7 +721,11 @@ impl TenantState {
let reconciler_span = tracing::info_span!(parent: None, "reconciler", seq=%reconcile_seq,
tenant_id=%reconciler.tenant_shard_id.tenant_id,
shard_id=%reconciler.tenant_shard_id.shard_slug());
metrics::RECONCILER.spawned.inc();
metrics::METRICS_REGISTRY
.metrics_group
.storage_controller_reconcile_spawn
.inc();
let result_tx = result_tx.clone();
let join_handle = tokio::task::spawn(
async move {
// Wait for any previous reconcile task to complete before we start
@@ -717,10 +742,12 @@ impl TenantState {
// TODO: wrap all remote API operations in cancellation check
// as well.
if reconciler.cancel.is_cancelled() {
metrics::RECONCILER
.complete
.with_label_values(&[metrics::ReconcilerMetrics::CANCEL])
.inc();
metrics::METRICS_REGISTRY
.metrics_group
.storage_controller_reconcile_complete
.inc(ReconcileCompleteLabelGroup {
status: ReconcileOutcome::Cancel,
});
return;
}
@@ -735,18 +762,18 @@ impl TenantState {
}
// Update result counter
match &result {
Ok(_) => metrics::RECONCILER
.complete
.with_label_values(&[metrics::ReconcilerMetrics::SUCCESS]),
Err(ReconcileError::Cancel) => metrics::RECONCILER
.complete
.with_label_values(&[metrics::ReconcilerMetrics::CANCEL]),
Err(_) => metrics::RECONCILER
.complete
.with_label_values(&[metrics::ReconcilerMetrics::ERROR]),
}
.inc();
let outcome_label = match &result {
Ok(_) => ReconcileOutcome::Success,
Err(ReconcileError::Cancel) => ReconcileOutcome::Cancel,
Err(_) => ReconcileOutcome::Error,
};
metrics::METRICS_REGISTRY
.metrics_group
.storage_controller_reconcile_complete
.inc(ReconcileCompleteLabelGroup {
status: outcome_label,
});
result_tx
.send(ReconcileResult {
@@ -819,7 +846,10 @@ impl TenantState {
#[cfg(test)]
pub(crate) mod tests {
use pageserver_api::shard::{ShardCount, ShardNumber};
use pageserver_api::{
controller_api::NodeAvailability,
shard::{ShardCount, ShardNumber},
};
use utils::id::TenantId;
use crate::scheduler::test_utils::make_test_nodes;
@@ -858,7 +888,7 @@ pub(crate) mod tests {
let mut scheduler = Scheduler::new(nodes.values());
let mut tenant_state = make_test_tenant_shard(PlacementPolicy::Double(1));
let mut tenant_state = make_test_tenant_shard(PlacementPolicy::Attached(1));
tenant_state
.schedule(&mut scheduler)
.expect("we have enough nodes, scheduling should work");
@@ -878,7 +908,10 @@ pub(crate) mod tests {
assert_eq!(tenant_state.intent.secondary.len(), 2);
// Update the scheduler state to indicate the node is offline
nodes.get_mut(&attached_node_id).unwrap().availability = NodeAvailability::Offline;
nodes
.get_mut(&attached_node_id)
.unwrap()
.set_availability(NodeAvailability::Offline);
scheduler.node_upsert(nodes.get(&attached_node_id).unwrap());
// Scheduling the node should promote the still-available secondary node to attached
@@ -897,4 +930,54 @@ pub(crate) mod tests {
Ok(())
}
#[test]
fn intent_from_observed() -> anyhow::Result<()> {
let nodes = make_test_nodes(3);
let mut scheduler = Scheduler::new(nodes.values());
let mut tenant_state = make_test_tenant_shard(PlacementPolicy::Attached(1));
tenant_state.observed.locations.insert(
NodeId(3),
ObservedStateLocation {
conf: Some(LocationConfig {
mode: LocationConfigMode::AttachedMulti,
generation: Some(2),
secondary_conf: None,
shard_number: tenant_state.shard.number.0,
shard_count: tenant_state.shard.count.literal(),
shard_stripe_size: tenant_state.shard.stripe_size.0,
tenant_conf: TenantConfig::default(),
}),
},
);
tenant_state.observed.locations.insert(
NodeId(2),
ObservedStateLocation {
conf: Some(LocationConfig {
mode: LocationConfigMode::AttachedStale,
generation: Some(1),
secondary_conf: None,
shard_number: tenant_state.shard.number.0,
shard_count: tenant_state.shard.count.literal(),
shard_stripe_size: tenant_state.shard.stripe_size.0,
tenant_conf: TenantConfig::default(),
}),
},
);
tenant_state.intent_from_observed(&mut scheduler);
// The highest generationed attached location gets used as attached
assert_eq!(tenant_state.intent.attached, Some(NodeId(3)));
// Other locations get used as secondary
assert_eq!(tenant_state.intent.secondary, vec![NodeId(2)]);
scheduler.consistency_check(nodes.values(), [&tenant_state].into_iter())?;
tenant_state.intent.clear(&mut scheduler);
Ok(())
}
}

View File

@@ -294,7 +294,7 @@ where
// is in state 'taken' but the thread that would unlock it is
// not there.
// 2. A rust object that represented some external resource in the
// parent now got implicitly copied by the the fork, even though
// parent now got implicitly copied by the fork, even though
// the object's type is not `Copy`. The parent program may use
// non-copyability as way to enforce unique ownership of an
// external resource in the typesystem. The fork breaks that

View File

@@ -8,14 +8,14 @@
use anyhow::{anyhow, bail, Context, Result};
use clap::{value_parser, Arg, ArgAction, ArgMatches, Command, ValueEnum};
use compute_api::spec::ComputeMode;
use control_plane::attachment_service::AttachmentService;
use control_plane::endpoint::ComputeControlPlane;
use control_plane::local_env::{InitForceMode, LocalEnv};
use control_plane::pageserver::{PageServerNode, PAGESERVER_REMOTE_STORAGE_DIR};
use control_plane::safekeeper::SafekeeperNode;
use control_plane::storage_controller::StorageController;
use control_plane::{broker, local_env};
use pageserver_api::controller_api::{
NodeAvailability, NodeConfigureRequest, NodeSchedulingPolicy,
NodeAvailability, NodeConfigureRequest, NodeSchedulingPolicy, PlacementPolicy,
};
use pageserver_api::models::{
ShardParameters, TenantCreateRequest, TimelineCreateRequest, TimelineInfo,
@@ -138,7 +138,7 @@ fn main() -> Result<()> {
"start" => rt.block_on(handle_start_all(sub_args, &env)),
"stop" => rt.block_on(handle_stop_all(sub_args, &env)),
"pageserver" => rt.block_on(handle_pageserver(sub_args, &env)),
"attachment_service" => rt.block_on(handle_attachment_service(sub_args, &env)),
"storage_controller" => rt.block_on(handle_storage_controller(sub_args, &env)),
"safekeeper" => rt.block_on(handle_safekeeper(sub_args, &env)),
"endpoint" => rt.block_on(handle_endpoint(sub_args, &env)),
"mappings" => handle_mappings(sub_args, &mut env),
@@ -435,19 +435,24 @@ async fn handle_tenant(
let shard_stripe_size: Option<u32> =
create_match.get_one::<u32>("shard-stripe-size").cloned();
let placement_policy = match create_match.get_one::<String>("placement-policy") {
Some(s) if !s.is_empty() => serde_json::from_str::<PlacementPolicy>(s)?,
_ => PlacementPolicy::Attached(0),
};
let tenant_conf = PageServerNode::parse_config(tenant_conf)?;
// If tenant ID was not specified, generate one
let tenant_id = parse_tenant_id(create_match)?.unwrap_or_else(TenantId::generate);
// We must register the tenant with the attachment service, so
// We must register the tenant with the storage controller, so
// that when the pageserver restarts, it will be re-attached.
let attachment_service = AttachmentService::from_env(env);
attachment_service
let storage_controller = StorageController::from_env(env);
storage_controller
.tenant_create(TenantCreateRequest {
// Note that ::unsharded here isn't actually because the tenant is unsharded, its because the
// attachment service expecfs a shard-naive tenant_id in this attribute, and the TenantCreateRequest
// type is used both in attachment service (for creating tenants) and in pageserver (for creating shards)
// storage controller expecfs a shard-naive tenant_id in this attribute, and the TenantCreateRequest
// type is used both in storage controller (for creating tenants) and in pageserver (for creating shards)
new_tenant_id: TenantShardId::unsharded(tenant_id),
generation: None,
shard_parameters: ShardParameters {
@@ -456,6 +461,7 @@ async fn handle_tenant(
.map(ShardStripeSize)
.unwrap_or(ShardParameters::DEFAULT_STRIPE_SIZE),
},
placement_policy: Some(placement_policy),
config: tenant_conf,
})
.await?;
@@ -470,9 +476,9 @@ async fn handle_tenant(
.context("Failed to parse postgres version from the argument string")?;
// FIXME: passing None for ancestor_start_lsn is not kosher in a sharded world: we can't have
// different shards picking different start lsns. Maybe we have to teach attachment service
// different shards picking different start lsns. Maybe we have to teach storage controller
// to let shard 0 branch first and then propagate the chosen LSN to other shards.
attachment_service
storage_controller
.tenant_timeline_create(
tenant_id,
TimelineCreateRequest {
@@ -517,84 +523,6 @@ async fn handle_tenant(
.with_context(|| format!("Tenant config failed for tenant with id {tenant_id}"))?;
println!("tenant {tenant_id} successfully configured on the pageserver");
}
Some(("migrate", matches)) => {
let tenant_shard_id = get_tenant_shard_id(matches, env)?;
let new_pageserver = get_pageserver(env, matches)?;
let new_pageserver_id = new_pageserver.conf.id;
let attachment_service = AttachmentService::from_env(env);
attachment_service
.tenant_migrate(tenant_shard_id, new_pageserver_id)
.await?;
println!("tenant {tenant_shard_id} migrated to {}", new_pageserver_id);
}
Some(("status", matches)) => {
let tenant_id = get_tenant_id(matches, env)?;
let mut shard_table = comfy_table::Table::new();
shard_table.set_header(["Shard", "Pageserver", "Physical Size"]);
let mut tenant_synthetic_size = None;
let attachment_service = AttachmentService::from_env(env);
for shard in attachment_service.tenant_locate(tenant_id).await?.shards {
let pageserver =
PageServerNode::from_env(env, env.get_pageserver_conf(shard.node_id)?);
let size = pageserver
.http_client
.tenant_details(shard.shard_id)
.await?
.tenant_info
.current_physical_size
.unwrap();
shard_table.add_row([
format!("{}", shard.shard_id.shard_slug()),
format!("{}", shard.node_id.0),
format!("{} MiB", size / (1024 * 1024)),
]);
if shard.shard_id.is_zero() {
tenant_synthetic_size =
Some(pageserver.tenant_synthetic_size(shard.shard_id).await?);
}
}
let Some(synthetic_size) = tenant_synthetic_size else {
bail!("Shard 0 not found")
};
let mut tenant_table = comfy_table::Table::new();
tenant_table.add_row(["Tenant ID".to_string(), tenant_id.to_string()]);
tenant_table.add_row([
"Synthetic size".to_string(),
format!("{} MiB", synthetic_size.size.unwrap_or(0) / (1024 * 1024)),
]);
println!("{tenant_table}");
println!("{shard_table}");
}
Some(("shard-split", matches)) => {
let tenant_id = get_tenant_id(matches, env)?;
let shard_count: u8 = matches.get_one::<u8>("shard-count").cloned().unwrap_or(0);
let attachment_service = AttachmentService::from_env(env);
let result = attachment_service
.tenant_split(tenant_id, shard_count)
.await?;
println!(
"Split tenant {} into shards {}",
tenant_id,
result
.new_shards
.iter()
.map(|s| format!("{:?}", s))
.collect::<Vec<_>>()
.join(",")
);
}
Some((sub_name, _)) => bail!("Unexpected tenant subcommand '{}'", sub_name),
None => bail!("no tenant subcommand provided"),
@@ -607,7 +535,7 @@ async fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::Local
match timeline_match.subcommand() {
Some(("list", list_match)) => {
// TODO(sharding): this command shouldn't have to specify a shard ID: we should ask the attachment service
// TODO(sharding): this command shouldn't have to specify a shard ID: we should ask the storage controller
// where shard 0 is attached, and query there.
let tenant_shard_id = get_tenant_shard_id(list_match, env)?;
let timelines = pageserver.timeline_list(&tenant_shard_id).await?;
@@ -627,7 +555,7 @@ async fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::Local
let new_timeline_id_opt = parse_timeline_id(create_match)?;
let new_timeline_id = new_timeline_id_opt.unwrap_or(TimelineId::generate());
let attachment_service = AttachmentService::from_env(env);
let storage_controller = StorageController::from_env(env);
let create_req = TimelineCreateRequest {
new_timeline_id,
ancestor_timeline_id: None,
@@ -635,7 +563,7 @@ async fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::Local
ancestor_start_lsn: None,
pg_version: Some(pg_version),
};
let timeline_info = attachment_service
let timeline_info = storage_controller
.tenant_timeline_create(tenant_id, create_req)
.await?;
@@ -724,7 +652,7 @@ async fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::Local
.transpose()
.context("Failed to parse ancestor start Lsn from the request")?;
let new_timeline_id = TimelineId::generate();
let attachment_service = AttachmentService::from_env(env);
let storage_controller = StorageController::from_env(env);
let create_req = TimelineCreateRequest {
new_timeline_id,
ancestor_timeline_id: Some(ancestor_timeline_id),
@@ -732,7 +660,7 @@ async fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::Local
ancestor_start_lsn: start_lsn,
pg_version: None,
};
let timeline_info = attachment_service
let timeline_info = storage_controller
.tenant_timeline_create(tenant_id, create_req)
.await?;
@@ -761,7 +689,7 @@ async fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Re
match sub_name {
"list" => {
// TODO(sharding): this command shouldn't have to specify a shard ID: we should ask the attachment service
// TODO(sharding): this command shouldn't have to specify a shard ID: we should ask the storage controller
// where shard 0 is attached, and query there.
let tenant_shard_id = get_tenant_shard_id(sub_args, env)?;
let timeline_infos = get_timeline_infos(env, &tenant_shard_id)
@@ -946,21 +874,21 @@ async fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Re
(
vec![(parsed.0, parsed.1.unwrap_or(5432))],
// If caller is telling us what pageserver to use, this is not a tenant which is
// full managed by attachment service, therefore not sharded.
// full managed by storage controller, therefore not sharded.
ShardParameters::DEFAULT_STRIPE_SIZE,
)
} else {
// Look up the currently attached location of the tenant, and its striping metadata,
// to pass these on to postgres.
let attachment_service = AttachmentService::from_env(env);
let locate_result = attachment_service.tenant_locate(endpoint.tenant_id).await?;
let storage_controller = StorageController::from_env(env);
let locate_result = storage_controller.tenant_locate(endpoint.tenant_id).await?;
let pageservers = locate_result
.shards
.into_iter()
.map(|shard| {
(
Host::parse(&shard.listen_pg_addr)
.expect("Attachment service reported bad hostname"),
.expect("Storage controller reported bad hostname"),
shard.listen_pg_port,
)
})
@@ -1009,8 +937,8 @@ async fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Re
pageserver.pg_connection_config.port(),
)]
} else {
let attachment_service = AttachmentService::from_env(env);
attachment_service
let storage_controller = StorageController::from_env(env);
storage_controller
.tenant_locate(endpoint.tenant_id)
.await?
.shards
@@ -1018,13 +946,13 @@ async fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Re
.map(|shard| {
(
Host::parse(&shard.listen_pg_addr)
.expect("Attachment service reported malformed host"),
.expect("Storage controller reported malformed host"),
shard.listen_pg_port,
)
})
.collect::<Vec<_>>()
};
endpoint.reconfigure(pageservers).await?;
endpoint.reconfigure(pageservers, None).await?;
}
"stop" => {
let endpoint_id = sub_args
@@ -1094,9 +1022,8 @@ fn get_pageserver(env: &local_env::LocalEnv, args: &ArgMatches) -> Result<PageSe
async fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
match sub_match.subcommand() {
Some(("start", subcommand_args)) => {
let register = subcommand_args.get_one::<bool>("register").unwrap_or(&true);
if let Err(e) = get_pageserver(env, subcommand_args)?
.start(&pageserver_config_overrides(subcommand_args), *register)
.start(&pageserver_config_overrides(subcommand_args))
.await
{
eprintln!("pageserver start failed: {e}");
@@ -1125,7 +1052,7 @@ async fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) ->
}
if let Err(e) = pageserver
.start(&pageserver_config_overrides(subcommand_args), false)
.start(&pageserver_config_overrides(subcommand_args))
.await
{
eprintln!("pageserver start failed: {e}");
@@ -1138,8 +1065,8 @@ async fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) ->
let scheduling = subcommand_args.get_one("scheduling");
let availability = subcommand_args.get_one("availability");
let attachment_service = AttachmentService::from_env(env);
attachment_service
let storage_controller = StorageController::from_env(env);
storage_controller
.node_configure(NodeConfigureRequest {
node_id: pageserver.conf.id,
scheduling: scheduling.cloned(),
@@ -1164,11 +1091,11 @@ async fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) ->
Ok(())
}
async fn handle_attachment_service(
async fn handle_storage_controller(
sub_match: &ArgMatches,
env: &local_env::LocalEnv,
) -> Result<()> {
let svc = AttachmentService::from_env(env);
let svc = StorageController::from_env(env);
match sub_match.subcommand() {
Some(("start", _start_match)) => {
if let Err(e) = svc.start().await {
@@ -1188,8 +1115,8 @@ async fn handle_attachment_service(
exit(1);
}
}
Some((sub_name, _)) => bail!("Unexpected attachment_service subcommand '{}'", sub_name),
None => bail!("no attachment_service subcommand provided"),
Some((sub_name, _)) => bail!("Unexpected storage_controller subcommand '{}'", sub_name),
None => bail!("no storage_controller subcommand provided"),
}
Ok(())
}
@@ -1274,11 +1201,11 @@ async fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) ->
broker::start_broker_process(env).await?;
// Only start the attachment service if the pageserver is configured to need it
// Only start the storage controller if the pageserver is configured to need it
if env.control_plane_api.is_some() {
let attachment_service = AttachmentService::from_env(env);
if let Err(e) = attachment_service.start().await {
eprintln!("attachment_service start failed: {:#}", e);
let storage_controller = StorageController::from_env(env);
if let Err(e) = storage_controller.start().await {
eprintln!("storage_controller start failed: {:#}", e);
try_stop_all(env, true).await;
exit(1);
}
@@ -1287,7 +1214,7 @@ async fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) ->
for ps_conf in &env.pageservers {
let pageserver = PageServerNode::from_env(env, ps_conf);
if let Err(e) = pageserver
.start(&pageserver_config_overrides(sub_match), true)
.start(&pageserver_config_overrides(sub_match))
.await
{
eprintln!("pageserver {} start failed: {:#}", ps_conf.id, e);
@@ -1350,9 +1277,9 @@ async fn try_stop_all(env: &local_env::LocalEnv, immediate: bool) {
}
if env.control_plane_api.is_some() {
let attachment_service = AttachmentService::from_env(env);
if let Err(e) = attachment_service.stop(immediate).await {
eprintln!("attachment service stop failed: {e:#}");
let storage_controller = StorageController::from_env(env);
if let Err(e) = storage_controller.stop(immediate).await {
eprintln!("storage controller stop failed: {e:#}");
}
}
}
@@ -1562,24 +1489,13 @@ fn cli() -> Command {
.help("Use this tenant in future CLI commands where tenant_id is needed, but not specified"))
.arg(Arg::new("shard-count").value_parser(value_parser!(u8)).long("shard-count").action(ArgAction::Set).help("Number of shards in the new tenant (default 1)"))
.arg(Arg::new("shard-stripe-size").value_parser(value_parser!(u32)).long("shard-stripe-size").action(ArgAction::Set).help("Sharding stripe size in pages"))
.arg(Arg::new("placement-policy").value_parser(value_parser!(String)).long("placement-policy").action(ArgAction::Set).help("Placement policy shards in this tenant"))
)
.subcommand(Command::new("set-default").arg(tenant_id_arg.clone().required(true))
.about("Set a particular tenant as default in future CLI commands where tenant_id is needed, but not specified"))
.subcommand(Command::new("config")
.arg(tenant_id_arg.clone())
.arg(Arg::new("config").short('c').num_args(1).action(ArgAction::Append).required(false)))
.subcommand(Command::new("migrate")
.about("Migrate a tenant from one pageserver to another")
.arg(tenant_id_arg.clone())
.arg(pageserver_id_arg.clone()))
.subcommand(Command::new("status")
.about("Human readable summary of the tenant's shards and attachment locations")
.arg(tenant_id_arg.clone()))
.subcommand(Command::new("shard-split")
.about("Increase the number of shards in the tenant")
.arg(tenant_id_arg.clone())
.arg(Arg::new("shard-count").value_parser(value_parser!(u8)).long("shard-count").action(ArgAction::Set).help("Number of shards in the new tenant (default 1)"))
)
)
.subcommand(
Command::new("pageserver")
@@ -1589,11 +1505,7 @@ fn cli() -> Command {
.subcommand(Command::new("status"))
.subcommand(Command::new("start")
.about("Start local pageserver")
.arg(pageserver_config_args.clone()).arg(Arg::new("register")
.long("register")
.default_value("true").required(false)
.value_parser(value_parser!(bool))
.value_name("register"))
.arg(pageserver_config_args.clone())
)
.subcommand(Command::new("stop")
.about("Stop local pageserver")
@@ -1611,9 +1523,9 @@ fn cli() -> Command {
)
)
.subcommand(
Command::new("attachment_service")
Command::new("storage_controller")
.arg_required_else_help(true)
.about("Manage attachment_service")
.about("Manage storage_controller")
.subcommand(Command::new("start").about("Start local pageserver").arg(pageserver_config_args.clone()))
.subcommand(Command::new("stop").about("Stop local pageserver")
.arg(stop_mode_arg.clone()))

View File

@@ -12,7 +12,7 @@
//!
//! The endpoint is managed by the `compute_ctl` binary. When an endpoint is
//! started, we launch `compute_ctl` It synchronizes the safekeepers, downloads
//! the basebackup from the pageserver to initialize the the data directory, and
//! the basebackup from the pageserver to initialize the data directory, and
//! finally launches the PostgreSQL process. It watches the PostgreSQL process
//! until it exits.
//!
@@ -52,13 +52,14 @@ use compute_api::spec::RemoteExtSpec;
use compute_api::spec::Role;
use nix::sys::signal::kill;
use nix::sys::signal::Signal;
use pageserver_api::shard::ShardStripeSize;
use serde::{Deserialize, Serialize};
use url::Host;
use utils::id::{NodeId, TenantId, TimelineId};
use crate::attachment_service::AttachmentService;
use crate::local_env::LocalEnv;
use crate::postgresql_conf::PostgresConf;
use crate::storage_controller::StorageController;
use compute_api::responses::{ComputeState, ComputeStatus};
use compute_api::spec::{Cluster, ComputeFeature, ComputeMode, ComputeSpec};
@@ -655,7 +656,7 @@ impl Endpoint {
// Wait for it to start
let mut attempt = 0;
const ATTEMPT_INTERVAL: Duration = Duration::from_millis(100);
const MAX_ATTEMPTS: u32 = 10 * 30; // Wait up to 30 s
const MAX_ATTEMPTS: u32 = 10 * 90; // Wait up to 1.5 min
loop {
attempt += 1;
match self.get_status().await {
@@ -735,7 +736,11 @@ impl Endpoint {
}
}
pub async fn reconfigure(&self, mut pageservers: Vec<(Host, u16)>) -> Result<()> {
pub async fn reconfigure(
&self,
mut pageservers: Vec<(Host, u16)>,
stripe_size: Option<ShardStripeSize>,
) -> Result<()> {
let mut spec: ComputeSpec = {
let spec_path = self.endpoint_path().join("spec.json");
let file = std::fs::File::open(spec_path)?;
@@ -745,17 +750,17 @@ impl Endpoint {
let postgresql_conf = self.read_postgresql_conf()?;
spec.cluster.postgresql_conf = Some(postgresql_conf);
// If we weren't given explicit pageservers, query the attachment service
// If we weren't given explicit pageservers, query the storage controller
if pageservers.is_empty() {
let attachment_service = AttachmentService::from_env(&self.env);
let locate_result = attachment_service.tenant_locate(self.tenant_id).await?;
let storage_controller = StorageController::from_env(&self.env);
let locate_result = storage_controller.tenant_locate(self.tenant_id).await?;
pageservers = locate_result
.shards
.into_iter()
.map(|shard| {
(
Host::parse(&shard.listen_pg_addr)
.expect("Attachment service reported bad hostname"),
.expect("Storage controller reported bad hostname"),
shard.listen_pg_port,
)
})
@@ -765,8 +770,14 @@ impl Endpoint {
let pageserver_connstr = Self::build_pageserver_connstr(&pageservers);
assert!(!pageserver_connstr.is_empty());
spec.pageserver_connstring = Some(pageserver_connstr);
if stripe_size.is_some() {
spec.shard_stripe_size = stripe_size.map(|s| s.0 as usize);
}
let client = reqwest::Client::new();
let client = reqwest::Client::builder()
.timeout(Duration::from_secs(30))
.build()
.unwrap();
let response = client
.post(format!(
"http://{}:{}/configure",

View File

@@ -6,7 +6,6 @@
//! local installations.
#![deny(clippy::undocumented_unsafe_blocks)]
pub mod attachment_service;
mod background_process;
pub mod broker;
pub mod endpoint;
@@ -14,3 +13,4 @@ pub mod local_env;
pub mod pageserver;
pub mod postgresql_conf;
pub mod safekeeper;
pub mod storage_controller;

View File

@@ -72,13 +72,13 @@ pub struct LocalEnv {
#[serde(default)]
pub safekeepers: Vec<SafekeeperConf>,
// Control plane upcall API for pageserver: if None, we will not run attachment_service. If set, this will
// Control plane upcall API for pageserver: if None, we will not run storage_controller If set, this will
// be propagated into each pageserver's configuration.
#[serde(default)]
pub control_plane_api: Option<Url>,
// Control plane upcall API for attachment service. If set, this will be propagated into the
// attachment service's configuration.
// Control plane upcall API for storage controller. If set, this will be propagated into the
// storage controller's configuration.
#[serde(default)]
pub control_plane_compute_hook_api: Option<Url>,
@@ -114,7 +114,7 @@ impl NeonBroker {
}
#[derive(Serialize, Deserialize, PartialEq, Eq, Clone, Debug)]
#[serde(default)]
#[serde(default, deny_unknown_fields)]
pub struct PageServerConf {
// node id
pub id: NodeId,
@@ -126,6 +126,9 @@ pub struct PageServerConf {
// auth type used for the PG and HTTP ports
pub pg_auth_type: AuthType,
pub http_auth_type: AuthType,
pub(crate) virtual_file_io_engine: Option<String>,
pub(crate) get_vectored_impl: Option<String>,
}
impl Default for PageServerConf {
@@ -136,6 +139,8 @@ impl Default for PageServerConf {
listen_http_addr: String::new(),
pg_auth_type: AuthType::Trust,
http_auth_type: AuthType::Trust,
virtual_file_io_engine: None,
get_vectored_impl: None,
}
}
}
@@ -227,12 +232,12 @@ impl LocalEnv {
self.neon_distrib_dir.join("pageserver")
}
pub fn attachment_service_bin(&self) -> PathBuf {
// Irrespective of configuration, attachment service binary is always
pub fn storage_controller_bin(&self) -> PathBuf {
// Irrespective of configuration, storage controller binary is always
// run from the same location as neon_local. This means that for compatibility
// tests that run old pageserver/safekeeper, they still run latest attachment service.
// tests that run old pageserver/safekeeper, they still run latest storage controller.
let neon_local_bin_dir = env::current_exe().unwrap().parent().unwrap().to_owned();
neon_local_bin_dir.join("attachment_service")
neon_local_bin_dir.join("storage_controller")
}
pub fn safekeeper_bin(&self) -> PathBuf {

View File

@@ -17,7 +17,6 @@ use std::time::Duration;
use anyhow::{bail, Context};
use camino::Utf8PathBuf;
use futures::SinkExt;
use pageserver_api::controller_api::NodeRegisterRequest;
use pageserver_api::models::{
self, LocationConfig, ShardParameters, TenantHistorySize, TenantInfo, TimelineInfo,
};
@@ -31,7 +30,6 @@ use utils::{
lsn::Lsn,
};
use crate::attachment_service::AttachmentService;
use crate::local_env::PageServerConf;
use crate::{background_process, local_env::LocalEnv};
@@ -80,18 +78,39 @@ impl PageServerNode {
///
/// These all end up on the command line of the `pageserver` binary.
fn neon_local_overrides(&self, cli_overrides: &[&str]) -> Vec<String> {
let id = format!("id={}", self.conf.id);
// FIXME: the paths should be shell-escaped to handle paths with spaces, quotas etc.
let pg_distrib_dir_param = format!(
"pg_distrib_dir='{}'",
self.env.pg_distrib_dir_raw().display()
);
let http_auth_type_param = format!("http_auth_type='{}'", self.conf.http_auth_type);
let listen_http_addr_param = format!("listen_http_addr='{}'", self.conf.listen_http_addr);
let PageServerConf {
id,
listen_pg_addr,
listen_http_addr,
pg_auth_type,
http_auth_type,
virtual_file_io_engine,
get_vectored_impl,
} = &self.conf;
let pg_auth_type_param = format!("pg_auth_type='{}'", self.conf.pg_auth_type);
let listen_pg_addr_param = format!("listen_pg_addr='{}'", self.conf.listen_pg_addr);
let id = format!("id={}", id);
let http_auth_type_param = format!("http_auth_type='{}'", http_auth_type);
let listen_http_addr_param = format!("listen_http_addr='{}'", listen_http_addr);
let pg_auth_type_param = format!("pg_auth_type='{}'", pg_auth_type);
let listen_pg_addr_param = format!("listen_pg_addr='{}'", listen_pg_addr);
let virtual_file_io_engine = if let Some(virtual_file_io_engine) = virtual_file_io_engine {
format!("virtual_file_io_engine='{virtual_file_io_engine}'")
} else {
String::new()
};
let get_vectored_impl = if let Some(get_vectored_impl) = get_vectored_impl {
format!("get_vectored_impl='{get_vectored_impl}'")
} else {
String::new()
};
let broker_endpoint_param = format!("broker_endpoint='{}'", self.env.broker.client_url());
@@ -103,6 +122,8 @@ impl PageServerNode {
listen_http_addr_param,
listen_pg_addr_param,
broker_endpoint_param,
virtual_file_io_engine,
get_vectored_impl,
];
if let Some(control_plane_api) = &self.env.control_plane_api {
@@ -111,9 +132,9 @@ impl PageServerNode {
control_plane_api.as_str()
));
// Attachment service uses the same auth as pageserver: if JWT is enabled
// Storage controller uses the same auth as pageserver: if JWT is enabled
// for us, we will also need it to talk to them.
if matches!(self.conf.http_auth_type, AuthType::NeonJWT) {
if matches!(http_auth_type, AuthType::NeonJWT) {
let jwt_token = self
.env
.generate_auth_token(&Claims::new(None, Scope::GenerationsApi))
@@ -131,8 +152,7 @@ impl PageServerNode {
));
}
if self.conf.http_auth_type != AuthType::Trust || self.conf.pg_auth_type != AuthType::Trust
{
if *http_auth_type != AuthType::Trust || *pg_auth_type != AuthType::Trust {
// Keys are generated in the toplevel repo dir, pageservers' workdirs
// are one level below that, so refer to keys with ../
overrides.push("auth_validation_public_key_path='../auth_public_key.pem'".to_owned());
@@ -163,8 +183,8 @@ impl PageServerNode {
.expect("non-Unicode path")
}
pub async fn start(&self, config_overrides: &[&str], register: bool) -> anyhow::Result<()> {
self.start_node(config_overrides, false, register).await
pub async fn start(&self, config_overrides: &[&str]) -> anyhow::Result<()> {
self.start_node(config_overrides, false).await
}
fn pageserver_init(&self, config_overrides: &[&str]) -> anyhow::Result<()> {
@@ -202,6 +222,28 @@ impl PageServerNode {
String::from_utf8_lossy(&init_output.stderr),
);
// Write metadata file, used by pageserver on startup to register itself with
// the storage controller
let metadata_path = datadir.join("metadata.json");
let (_http_host, http_port) =
parse_host_port(&self.conf.listen_http_addr).expect("Unable to parse listen_http_addr");
let http_port = http_port.unwrap_or(9898);
// Intentionally hand-craft JSON: this acts as an implicit format compat test
// in case the pageserver-side structure is edited, and reflects the real life
// situation: the metadata is written by some other script.
std::fs::write(
metadata_path,
serde_json::to_vec(&serde_json::json!({
"host": "localhost",
"port": self.pg_connection_config.port(),
"http_host": "localhost",
"http_port": http_port,
}))
.unwrap(),
)
.expect("Failed to write metadata file");
Ok(())
}
@@ -209,27 +251,7 @@ impl PageServerNode {
&self,
config_overrides: &[&str],
update_config: bool,
register: bool,
) -> anyhow::Result<()> {
// Register the node with the storage controller before starting pageserver: pageserver must be registered to
// successfully call /re-attach and finish starting up.
if register {
let attachment_service = AttachmentService::from_env(&self.env);
let (pg_host, pg_port) =
parse_host_port(&self.conf.listen_pg_addr).expect("Unable to parse listen_pg_addr");
let (http_host, http_port) = parse_host_port(&self.conf.listen_http_addr)
.expect("Unable to parse listen_http_addr");
attachment_service
.node_register(NodeRegisterRequest {
node_id: self.conf.id,
listen_pg_addr: pg_host.to_string(),
listen_pg_port: pg_port.unwrap_or(5432),
listen_http_addr: http_host.to_string(),
listen_http_port: http_port.unwrap_or(80),
})
.await?;
}
// TODO: using a thread here because start_process() is not async but we need to call check_status()
let datadir = self.repo_path();
print!(
@@ -429,6 +451,8 @@ impl PageServerNode {
generation,
config,
shard_parameters: ShardParameters::default(),
// Placement policy is not meaningful for creations not done via storage controller
placement_policy: None,
};
if !settings.is_empty() {
bail!("Unrecognized tenant settings: {settings:?}")
@@ -537,10 +561,11 @@ impl PageServerNode {
tenant_shard_id: TenantShardId,
config: LocationConfig,
flush_ms: Option<Duration>,
lazy: bool,
) -> anyhow::Result<()> {
Ok(self
.http_client
.location_config(tenant_shard_id, config, flush_ms)
.location_config(tenant_shard_id, config, flush_ms, lazy)
.await?)
}
@@ -551,13 +576,6 @@ impl PageServerNode {
Ok(self.http_client.list_timelines(*tenant_shard_id).await?)
}
pub async fn tenant_secondary_download(&self, tenant_id: &TenantShardId) -> anyhow::Result<()> {
Ok(self
.http_client
.tenant_secondary_download(*tenant_id)
.await?)
}
pub async fn timeline_create(
&self,
tenant_shard_id: TenantShardId,
@@ -605,7 +623,7 @@ impl PageServerNode {
eprintln!("connection error: {}", e);
}
});
tokio::pin!(client);
let client = std::pin::pin!(client);
// Init base reader
let (start_lsn, base_tarfile_path) = base;

View File

@@ -10,7 +10,7 @@ use pageserver_api::{
TenantCreateRequest, TenantShardSplitRequest, TenantShardSplitResponse,
TimelineCreateRequest, TimelineInfo,
},
shard::TenantShardId,
shard::{ShardStripeSize, TenantShardId},
};
use pageserver_client::mgmt_api::ResponseErrorMessageExt;
use postgres_backend::AuthType;
@@ -24,7 +24,7 @@ use utils::{
id::{NodeId, TenantId},
};
pub struct AttachmentService {
pub struct StorageController {
env: LocalEnv,
listen: String,
path: Utf8PathBuf,
@@ -34,9 +34,12 @@ pub struct AttachmentService {
client: reqwest::Client,
}
const COMMAND: &str = "attachment_service";
const COMMAND: &str = "storage_controller";
const ATTACHMENT_SERVICE_POSTGRES_VERSION: u32 = 16;
const STORAGE_CONTROLLER_POSTGRES_VERSION: u32 = 16;
// Use a shorter pageserver unavailability interval than the default to speed up tests.
const NEON_LOCAL_MAX_UNAVAILABLE_INTERVAL: std::time::Duration = std::time::Duration::from_secs(10);
#[derive(Serialize, Deserialize)]
pub struct AttachHookRequest {
@@ -59,7 +62,7 @@ pub struct InspectResponse {
pub attachment: Option<(u32, NodeId)>,
}
impl AttachmentService {
impl StorageController {
pub fn from_env(env: &LocalEnv) -> Self {
let path = Utf8PathBuf::from_path_buf(env.base_data_dir.clone())
.unwrap()
@@ -136,27 +139,27 @@ impl AttachmentService {
}
fn pid_file(&self) -> Utf8PathBuf {
Utf8PathBuf::from_path_buf(self.env.base_data_dir.join("attachment_service.pid"))
Utf8PathBuf::from_path_buf(self.env.base_data_dir.join("storage_controller.pid"))
.expect("non-Unicode path")
}
/// PIDFile for the postgres instance used to store attachment service state
/// PIDFile for the postgres instance used to store storage controller state
fn postgres_pid_file(&self) -> Utf8PathBuf {
Utf8PathBuf::from_path_buf(
self.env
.base_data_dir
.join("attachment_service_postgres.pid"),
.join("storage_controller_postgres.pid"),
)
.expect("non-Unicode path")
}
/// Find the directory containing postgres binaries, such as `initdb` and `pg_ctl`
///
/// This usually uses ATTACHMENT_SERVICE_POSTGRES_VERSION of postgres, but will fall back
/// This usually uses STORAGE_CONTROLLER_POSTGRES_VERSION of postgres, but will fall back
/// to other versions if that one isn't found. Some automated tests create circumstances
/// where only one version is available in pg_distrib_dir, such as `test_remote_extensions`.
pub async fn get_pg_bin_dir(&self) -> anyhow::Result<Utf8PathBuf> {
let prefer_versions = [ATTACHMENT_SERVICE_POSTGRES_VERSION, 15, 14];
let prefer_versions = [STORAGE_CONTROLLER_POSTGRES_VERSION, 15, 14];
for v in prefer_versions {
let path = Utf8PathBuf::from_path_buf(self.env.pg_bin_dir(v)?).unwrap();
@@ -189,7 +192,7 @@ impl AttachmentService {
///
/// Returns the database url
pub async fn setup_database(&self) -> anyhow::Result<String> {
const DB_NAME: &str = "attachment_service";
const DB_NAME: &str = "storage_controller";
let database_url = format!("postgresql://localhost:{}/{DB_NAME}", self.postgres_port);
let pg_bin_dir = self.get_pg_bin_dir().await?;
@@ -219,10 +222,10 @@ impl AttachmentService {
}
pub async fn start(&self) -> anyhow::Result<()> {
// Start a vanilla Postgres process used by the attachment service for persistence.
// Start a vanilla Postgres process used by the storage controller for persistence.
let pg_data_path = Utf8PathBuf::from_path_buf(self.env.base_data_dir.clone())
.unwrap()
.join("attachment_service_db");
.join("storage_controller_db");
let pg_bin_dir = self.get_pg_bin_dir().await?;
let pg_log_path = pg_data_path.join("postgres.log");
@@ -245,7 +248,7 @@ impl AttachmentService {
.await?;
};
println!("Starting attachment service database...");
println!("Starting storage controller database...");
let db_start_args = [
"-w",
"-D",
@@ -256,7 +259,7 @@ impl AttachmentService {
];
background_process::start_process(
"attachment_service_db",
"storage_controller_db",
&self.env.base_data_dir,
pg_bin_dir.join("pg_ctl").as_std_path(),
db_start_args,
@@ -269,13 +272,18 @@ impl AttachmentService {
// Run migrations on every startup, in case something changed.
let database_url = self.setup_database().await?;
let max_unavailable: humantime::Duration = NEON_LOCAL_MAX_UNAVAILABLE_INTERVAL.into();
let mut args = vec![
"-l",
&self.listen,
"-p",
self.path.as_ref(),
"--dev",
"--database-url",
&database_url,
"--max-unavailable-interval",
&max_unavailable.to_string(),
]
.into_iter()
.map(|s| s.to_string())
@@ -300,7 +308,7 @@ impl AttachmentService {
background_process::start_process(
COMMAND,
&self.env.base_data_dir,
&self.env.attachment_service_bin(),
&self.env.storage_controller_bin(),
args,
[(
"NEON_REPO_DIR".to_string(),
@@ -322,10 +330,10 @@ impl AttachmentService {
pub async fn stop(&self, immediate: bool) -> anyhow::Result<()> {
background_process::stop_process(immediate, COMMAND, &self.pid_file())?;
let pg_data_path = self.env.base_data_dir.join("attachment_service_db");
let pg_data_path = self.env.base_data_dir.join("storage_controller_db");
let pg_bin_dir = self.get_pg_bin_dir().await?;
println!("Stopping attachment service database...");
println!("Stopping storage controller database...");
let pg_stop_args = ["-D", &pg_data_path.to_string_lossy(), "stop"];
let stop_status = Command::new(pg_bin_dir.join("pg_ctl"))
.args(pg_stop_args)
@@ -344,10 +352,10 @@ impl AttachmentService {
// fine that stop failed. Otherwise it is an error that stop failed.
const PG_STATUS_NOT_RUNNING: i32 = 3;
if Some(PG_STATUS_NOT_RUNNING) == status_exitcode.code() {
println!("Attachment service data base is already stopped");
println!("Storage controller database is already stopped");
return Ok(());
} else {
anyhow::bail!("Failed to stop attachment service database: {stop_status}")
anyhow::bail!("Failed to stop storage controller database: {stop_status}")
}
}
@@ -368,7 +376,7 @@ impl AttachmentService {
}
}
/// Simple HTTP request wrapper for calling into attachment service
/// Simple HTTP request wrapper for calling into storage controller
async fn dispatch<RQ, RS>(
&self,
method: hyper::Method,
@@ -468,7 +476,7 @@ impl AttachmentService {
pub async fn tenant_locate(&self, tenant_id: TenantId) -> anyhow::Result<TenantLocateResponse> {
self.dispatch::<(), _>(
Method::GET,
format!("control/v1/tenant/{tenant_id}/locate"),
format!("debug/v1/tenant/{tenant_id}/locate"),
None,
)
.await
@@ -496,11 +504,15 @@ impl AttachmentService {
&self,
tenant_id: TenantId,
new_shard_count: u8,
new_stripe_size: Option<ShardStripeSize>,
) -> anyhow::Result<TenantShardSplitResponse> {
self.dispatch(
Method::PUT,
format!("control/v1/tenant/{tenant_id}/shard_split"),
Some(TenantShardSplitRequest { new_shard_count }),
Some(TenantShardSplitRequest {
new_shard_count,
new_stripe_size,
}),
)
.await
}

View File

@@ -70,9 +70,9 @@ Should only be used e.g. for status check/tenant creation/list.
Should only be used e.g. for status check.
Currently also used for connection from any pageserver to any safekeeper.
"generations_api": Provides access to the upcall APIs served by the attachment service or the control plane.
"generations_api": Provides access to the upcall APIs served by the storage controller or the control plane.
"admin": Provides access to the control plane and admin APIs of the attachment service.
"admin": Provides access to the control plane and admin APIs of the storage controller.
### CLI
CLI generates a key pair during call to `neon_local init` with the following commands:

View File

@@ -1,4 +1,4 @@
# Zenith storage node — alternative
# Neon storage node — alternative
## **Design considerations**

View File

@@ -1,6 +1,6 @@
# Command line interface (end-user)
Zenith CLI as it is described here mostly resides on the same conceptual level as pg_ctl/initdb/pg_recvxlog/etc and replaces some of them in an opinionated way. I would also suggest bundling our patched postgres inside zenith distribution at least at the start.
Neon CLI as it is described here mostly resides on the same conceptual level as pg_ctl/initdb/pg_recvxlog/etc and replaces some of them in an opinionated way. I would also suggest bundling our patched postgres inside neon distribution at least at the start.
This proposal is focused on managing local installations. For cluster operations, different tooling would be needed. The point of integration between the two is storage URL: no matter how complex cluster setup is it may provide an endpoint where the user may push snapshots.
@@ -8,40 +8,40 @@ The most important concept here is a snapshot, which can be created/pushed/pulle
# Possible usage scenarios
## Install zenith, run a postgres
## Install neon, run a postgres
```
> brew install pg-zenith
> zenith pg create # creates pgdata with default pattern pgdata$i
> zenith pg list
> brew install pg-neon
> neon pg create # creates pgdata with default pattern pgdata$i
> neon pg list
ID PGDATA USED STORAGE ENDPOINT
primary1 pgdata1 0G zenith-local localhost:5432
primary1 pgdata1 0G neon-local localhost:5432
```
## Import standalone postgres to zenith
## Import standalone postgres to neon
```
> zenith snapshot import --from=basebackup://replication@localhost:5432/ oldpg
> neon snapshot import --from=basebackup://replication@localhost:5432/ oldpg
[====================------------] 60% | 20MB/s
> zenith snapshot list
> neon snapshot list
ID SIZE PARENT
oldpg 5G -
> zenith pg create --snapshot oldpg
> neon pg create --snapshot oldpg
Started postgres on localhost:5432
> zenith pg list
> neon pg list
ID PGDATA USED STORAGE ENDPOINT
primary1 pgdata1 5G zenith-local localhost:5432
primary1 pgdata1 5G neon-local localhost:5432
> zenith snapshot destroy oldpg
> neon snapshot destroy oldpg
Ok
```
Also, we may start snapshot import implicitly by looking at snapshot schema
```
> zenith pg create --snapshot basebackup://replication@localhost:5432/
> neon pg create --snapshot basebackup://replication@localhost:5432/
Downloading snapshot... Done.
Started postgres on localhost:5432
Destroying snapshot... Done.
@@ -52,39 +52,39 @@ Destroying snapshot... Done.
Since we may export the whole snapshot as one big file (tar of basebackup, maybe with some manifest) it may be shared over conventional means: http, ssh, [git+lfs](https://docs.github.com/en/github/managing-large-files/about-git-large-file-storage).
```
> zenith pg create --snapshot http://learn-postgres.com/movies_db.zenith movies
> neon pg create --snapshot http://learn-postgres.com/movies_db.neon movies
```
## Create snapshot and push it to the cloud
```
> zenith snapshot create pgdata1@snap1
> zenith snapshot push --to ssh://stas@zenith.tech pgdata1@snap1
> neon snapshot create pgdata1@snap1
> neon snapshot push --to ssh://stas@neon.tech pgdata1@snap1
```
## Rollback database to the snapshot
One way to rollback the database is just to init a new database from the snapshot and destroy the old one. But creating a new database from a snapshot would require a copy of that snapshot which is time consuming operation. Another option that would be cool to support is the ability to create the copy-on-write database from the snapshot without copying data, and store updated pages in a separate location, however that way would have performance implications. So to properly rollback the database to the older state we have `zenith pg checkout`.
One way to rollback the database is just to init a new database from the snapshot and destroy the old one. But creating a new database from a snapshot would require a copy of that snapshot which is time consuming operation. Another option that would be cool to support is the ability to create the copy-on-write database from the snapshot without copying data, and store updated pages in a separate location, however that way would have performance implications. So to properly rollback the database to the older state we have `neon pg checkout`.
```
> zenith pg list
> neon pg list
ID PGDATA USED STORAGE ENDPOINT
primary1 pgdata1 5G zenith-local localhost:5432
primary1 pgdata1 5G neon-local localhost:5432
> zenith snapshot create pgdata1@snap1
> neon snapshot create pgdata1@snap1
> zenith snapshot list
> neon snapshot list
ID SIZE PARENT
oldpg 5G -
pgdata1@snap1 6G -
pgdata1@CURRENT 6G -
> zenith pg checkout pgdata1@snap1
> neon pg checkout pgdata1@snap1
Stopping postgres on pgdata1.
Rolling back pgdata1@CURRENT to pgdata1@snap1.
Starting postgres on pgdata1.
> zenith snapshot list
> neon snapshot list
ID SIZE PARENT
oldpg 5G -
pgdata1@snap1 6G -
@@ -99,7 +99,7 @@ Some notes: pgdata1@CURRENT -- implicit snapshot representing the current state
PITR area acts like a continuous snapshot where you can reset the database to any point in time within this area (by area I mean some TTL period or some size limit, both possibly infinite).
```
> zenith pitr create --storage s3tank --ttl 30d --name pitr_last_month
> neon pitr create --storage s3tank --ttl 30d --name pitr_last_month
```
Resetting the database to some state in past would require creating a snapshot on some lsn / time in this pirt area.
@@ -108,29 +108,29 @@ Resetting the database to some state in past would require creating a snapshot o
## storage
Storage is either zenith pagestore or s3. Users may create a database in a pagestore and create/move *snapshots* and *pitr regions* in both pagestore and s3. Storage is a concept similar to `git remote`. After installation, I imagine one local storage is available by default.
Storage is either neon pagestore or s3. Users may create a database in a pagestore and create/move *snapshots* and *pitr regions* in both pagestore and s3. Storage is a concept similar to `git remote`. After installation, I imagine one local storage is available by default.
**zenith storage attach** -t [native|s3] -c key=value -n name
**neon storage attach** -t [native|s3] -c key=value -n name
Attaches/initializes storage. For --type=s3, user credentials and path should be provided. For --type=native we may support --path=/local/path and --url=zenith.tech/stas/mystore. Other possible term for native is 'zstore'.
Attaches/initializes storage. For --type=s3, user credentials and path should be provided. For --type=native we may support --path=/local/path and --url=neon.tech/stas/mystore. Other possible term for native is 'zstore'.
**zenith storage list**
**neon storage list**
Show currently attached storages. For example:
```
> zenith storage list
> neon storage list
NAME USED TYPE OPTIONS PATH
local 5.1G zenith-local /opt/zenith/store/local
local.compr 20.4G zenith-local compression=on /opt/zenith/store/local.compr
zcloud 60G zenith-remote zenith.tech/stas/mystore
local 5.1G neon-local /opt/neon/store/local
local.compr 20.4G neon-local compression=on /opt/neon/store/local.compr
zcloud 60G neon-remote neon.tech/stas/mystore
s3tank 80G S3
```
**zenith storage detach**
**neon storage detach**
**zenith storage show**
**neon storage show**
@@ -140,29 +140,29 @@ Manages postgres data directories and can start postgres instances with proper c
Pg is a term for a single postgres running on some data. I'm trying to avoid separation of datadir management and postgres instance management -- both that concepts bundled here together.
**zenith pg create** [--no-start --snapshot --cow] -s storage-name -n pgdata
**neon pg create** [--no-start --snapshot --cow] -s storage-name -n pgdata
Creates (initializes) new data directory in given storage and starts postgres. I imagine that storage for this operation may be only local and data movement to remote location happens through snapshots/pitr.
--no-start: just init datadir without creating
--snapshot snap: init from the snapshot. Snap is a name or URL (zenith.tech/stas/mystore/snap1)
--snapshot snap: init from the snapshot. Snap is a name or URL (neon.tech/stas/mystore/snap1)
--cow: initialize Copy-on-Write data directory on top of some snapshot (makes sense if it is a snapshot of currently running a database)
**zenith pg destroy**
**neon pg destroy**
**zenith pg start** [--replica] pgdata
**neon pg start** [--replica] pgdata
Start postgres with proper extensions preloaded/installed.
**zenith pg checkout**
**neon pg checkout**
Rollback data directory to some previous snapshot.
**zenith pg stop** pg_id
**neon pg stop** pg_id
**zenith pg list**
**neon pg list**
```
ROLE PGDATA USED STORAGE ENDPOINT
@@ -173,7 +173,7 @@ primary my_pg2 3.2G local.compr localhost:5435
- my_pg3 9.2G local.compr -
```
**zenith pg show**
**neon pg show**
```
my_pg:
@@ -194,7 +194,7 @@ my_pg:
```
**zenith pg start-rest/graphql** pgdata
**neon pg start-rest/graphql** pgdata
Starts REST/GraphQL proxy on top of postgres master. Not sure we should do that, just an idea.
@@ -203,35 +203,35 @@ Starts REST/GraphQL proxy on top of postgres master. Not sure we should do that,
Snapshot creation is cheap -- no actual data is copied, we just start retaining old pages. Snapshot size means the amount of retained data, not all data. Snapshot name looks like pgdata_name@tag_name. tag_name is set by the user during snapshot creation. There are some reserved tag names: CURRENT represents the current state of the data directory; HEAD{i} represents the data directory state that resided in the database before i-th checkout.
**zenith snapshot create** pgdata_name@snap_name
**neon snapshot create** pgdata_name@snap_name
Creates a new snapshot in the same storage where pgdata_name exists.
**zenith snapshot push** --to url pgdata_name@snap_name
**neon snapshot push** --to url pgdata_name@snap_name
Produces binary stream of a given snapshot. Under the hood starts temp read-only postgres over this snapshot and sends basebackup stream. Receiving side should start `zenith snapshot recv` before push happens. If url has some special schema like zenith:// receiving side may require auth start `zenith snapshot recv` on the go.
Produces binary stream of a given snapshot. Under the hood starts temp read-only postgres over this snapshot and sends basebackup stream. Receiving side should start `neon snapshot recv` before push happens. If url has some special schema like neon:// receiving side may require auth start `neon snapshot recv` on the go.
**zenith snapshot recv**
**neon snapshot recv**
Starts a port listening for a basebackup stream, prints connection info to stdout (so that user may use that in push command), and expects data on that socket.
**zenith snapshot pull** --from url or path
**neon snapshot pull** --from url or path
Connects to a remote zenith/s3/file and pulls snapshot. The remote site should be zenith service or files in our format.
Connects to a remote neon/s3/file and pulls snapshot. The remote site should be neon service or files in our format.
**zenith snapshot import** --from basebackup://<...> or path
**neon snapshot import** --from basebackup://<...> or path
Creates a new snapshot out of running postgres via basebackup protocol or basebackup files.
**zenith snapshot export**
**neon snapshot export**
Starts read-only postgres over this snapshot and exports data in some format (pg_dump, or COPY TO on some/all tables). One of the options may be zenith own format which is handy for us (but I think just tar of basebackup would be okay).
Starts read-only postgres over this snapshot and exports data in some format (pg_dump, or COPY TO on some/all tables). One of the options may be neon own format which is handy for us (but I think just tar of basebackup would be okay).
**zenith snapshot diff** snap1 snap2
**neon snapshot diff** snap1 snap2
Shows size of data changed between two snapshots. We also may provide options to diff schema/data in tables. To do that start temp read-only postgreses.
**zenith snapshot destroy**
**neon snapshot destroy**
## pitr
@@ -239,7 +239,7 @@ Pitr represents wal stream and ttl policy for that stream
XXX: any suggestions on a better name?
**zenith pitr create** name
**neon pitr create** name
--ttl = inf | period
@@ -247,21 +247,21 @@ XXX: any suggestions on a better name?
--storage = storage_name
**zenith pitr extract-snapshot** pitr_name --lsn xxx
**neon pitr extract-snapshot** pitr_name --lsn xxx
Creates a snapshot out of some lsn in PITR area. The obtained snapshot may be managed with snapshot routines (move/send/export)
**zenith pitr gc** pitr_name
**neon pitr gc** pitr_name
Force garbage collection on some PITR area.
**zenith pitr list**
**neon pitr list**
**zenith pitr destroy**
**neon pitr destroy**
## console
**zenith console**
**neon console**
Opens browser targeted at web console with the more or less same functionality as described here.

View File

@@ -6,7 +6,7 @@ When do we consider the WAL record as durable, so that we can
acknowledge the commit to the client and be reasonably certain that we
will not lose the transaction?
Zenith uses a group of WAL safekeeper nodes to hold the generated WAL.
Neon uses a group of WAL safekeeper nodes to hold the generated WAL.
A WAL record is considered durable, when it has been written to a
majority of WAL safekeeper nodes. In this document, I use 5
safekeepers, because I have five fingers. A WAL record is durable,

View File

@@ -1,23 +1,23 @@
# Zenith local
# Neon local
Here I list some objectives to keep in mind when discussing zenith-local design and a proposal that brings all components together. Your comments on both parts are very welcome.
Here I list some objectives to keep in mind when discussing neon-local design and a proposal that brings all components together. Your comments on both parts are very welcome.
#### Why do we need it?
- For distribution - this easy to use binary will help us to build adoption among developers.
- For internal use - to test all components together.
In my understanding, we consider it to be just a mock-up version of zenith-cloud.
In my understanding, we consider it to be just a mock-up version of neon-cloud.
> Question: How much should we care about durability and security issues for a local setup?
#### Why is it better than a simple local postgres?
- Easy one-line setup. As simple as `cargo install zenith && zenith start`
- Easy one-line setup. As simple as `cargo install neon && neon start`
- Quick and cheap creation of compute nodes over the same storage.
> Question: How can we describe a use-case for this feature?
- Zenith-local can work with S3 directly.
- Neon-local can work with S3 directly.
- Push and pull images (snapshots) to remote S3 to exchange data with other users.
@@ -31,50 +31,50 @@ Ideally, just one binary that incorporates all elements we need.
#### Components:
- **zenith-CLI** - interface for end-users. Turns commands to REST requests and handles responses to show them in a user-friendly way.
CLI proposal is here https://github.com/libzenith/rfcs/blob/003-laptop-cli.md/003-laptop-cli.md
WIP code is here: https://github.com/libzenith/postgres/tree/main/pageserver/src/bin/cli
- **neon-CLI** - interface for end-users. Turns commands to REST requests and handles responses to show them in a user-friendly way.
CLI proposal is here https://github.com/neondatabase/rfcs/blob/003-laptop-cli.md/003-laptop-cli.md
WIP code is here: https://github.com/neondatabase/postgres/tree/main/pageserver/src/bin/cli
- **zenith-console** - WEB UI with same functionality as CLI.
- **neon-console** - WEB UI with same functionality as CLI.
>Note: not for the first release.
- **zenith-local** - entrypoint. Service that starts all other components and handles REST API requests. See REST API proposal below.
> Idea: spawn all other components as child processes, so that we could shutdown everything by stopping zenith-local.
- **neon-local** - entrypoint. Service that starts all other components and handles REST API requests. See REST API proposal below.
> Idea: spawn all other components as child processes, so that we could shutdown everything by stopping neon-local.
- **zenith-pageserver** - consists of a storage and WAL-replaying service (modified PG in current implementation).
- **neon-pageserver** - consists of a storage and WAL-replaying service (modified PG in current implementation).
> Question: Probably, for local setup we should be able to bypass page-storage and interact directly with S3 to avoid double caching in shared buffers and page-server?
WIP code is here: https://github.com/libzenith/postgres/tree/main/pageserver/src
WIP code is here: https://github.com/neondatabase/postgres/tree/main/pageserver/src
- **zenith-S3** - stores base images of the database and WAL in S3 object storage. Import and export images from/to zenith.
- **neon-S3** - stores base images of the database and WAL in S3 object storage. Import and export images from/to neon.
> Question: How should it operate in a local setup? Will we manage it ourselves or ask user to provide credentials for existing S3 object storage (i.e. minio)?
> Question: Do we use it together with local page store or they are interchangeable?
WIP code is ???
- **zenith-safekeeper** - receives WAL from postgres, stores it durably, answers to Postgres that "sync" is succeed.
- **neon-safekeeper** - receives WAL from postgres, stores it durably, answers to Postgres that "sync" is succeed.
> Question: How should it operate in a local setup? In my understanding it should push WAL directly to S3 (if we use it) or store all data locally (if we use local page storage). The latter option seems meaningless (extra overhead and no gain), but it is still good to test the system.
WIP code is here: https://github.com/libzenith/postgres/tree/main/src/bin/safekeeper
WIP code is here: https://github.com/neondatabase/postgres/tree/main/src/bin/safekeeper
- **zenith-computenode** - bottomless PostgreSQL, ideally upstream, but for a start - our modified version. User can quickly create and destroy them and work with it as a regular postgres database.
- **neon-computenode** - bottomless PostgreSQL, ideally upstream, but for a start - our modified version. User can quickly create and destroy them and work with it as a regular postgres database.
WIP code is in main branch and here: https://github.com/libzenith/postgres/commits/compute_node
WIP code is in main branch and here: https://github.com/neondatabase/postgres/commits/compute_node
#### REST API:
Service endpoint: `http://localhost:3000`
Resources:
- /storages - Where data lives: zenith-pageserver or zenith-s3
- /pgs - Postgres - zenith-computenode
- /storages - Where data lives: neon-pageserver or neon-s3
- /pgs - Postgres - neon-computenode
- /snapshots - snapshots **TODO**
>Question: Do we want to extend this API to manage zenith components? I.e. start page-server, manage safekeepers and so on? Or they will be hardcoded to just start once and for all?
>Question: Do we want to extend this API to manage neon components? I.e. start page-server, manage safekeepers and so on? Or they will be hardcoded to just start once and for all?
Methods and their mapping to CLI:
- /storages - zenith-pageserver or zenith-s3
- /storages - neon-pageserver or neon-s3
CLI | REST API
------------- | -------------
@@ -84,7 +84,7 @@ storage list | GET /storages
storage show -n name | GET /storages/:storage_name
- /pgs - zenith-computenode
- /pgs - neon-computenode
CLI | REST API
------------- | -------------

View File

@@ -1,45 +1,45 @@
Zenith CLI allows you to operate database clusters (catalog clusters) and their commit history locally and in the cloud. Since ANSI calls them catalog clusters and cluster is a loaded term in the modern infrastructure we will call it "catalog".
Neon CLI allows you to operate database clusters (catalog clusters) and their commit history locally and in the cloud. Since ANSI calls them catalog clusters and cluster is a loaded term in the modern infrastructure we will call it "catalog".
# CLI v2 (after chatting with Carl)
Zenith introduces the notion of a repository.
Neon introduces the notion of a repository.
```bash
zenith init
zenith clone zenith://zenith.tech/piedpiper/northwind -- clones a repo to the northwind directory
neon init
neon clone neon://neon.tech/piedpiper/northwind -- clones a repo to the northwind directory
```
Once you have a cluster catalog you can explore it
```bash
zenith log -- returns a list of commits
zenith status -- returns if there are changes in the catalog that can be committed
zenith commit -- commits the changes and generates a new commit hash
zenith branch experimental <hash> -- creates a branch called testdb based on a given commit hash
neon log -- returns a list of commits
neon status -- returns if there are changes in the catalog that can be committed
neon commit -- commits the changes and generates a new commit hash
neon branch experimental <hash> -- creates a branch called testdb based on a given commit hash
```
To make changes in the catalog you need to run compute nodes
```bash
-- here is how you a compute node
zenith start /home/pipedpiper/northwind:main -- starts a compute instance
zenith start zenith://zenith.tech/northwind:main -- starts a compute instance in the cloud
neon start /home/pipedpiper/northwind:main -- starts a compute instance
neon start neon://neon.tech/northwind:main -- starts a compute instance in the cloud
-- you can start a compute node against any hash or branch
zenith start /home/pipedpiper/northwind:experimental --port 8008 -- start another compute instance (on different port)
neon start /home/pipedpiper/northwind:experimental --port 8008 -- start another compute instance (on different port)
-- you can start a compute node against any hash or branch
zenith start /home/pipedpiper/northwind:<hash> --port 8009 -- start another compute instance (on different port)
neon start /home/pipedpiper/northwind:<hash> --port 8009 -- start another compute instance (on different port)
-- After running some DML you can run
-- zenith status and see how there are two WAL streams one on top of
-- neon status and see how there are two WAL streams one on top of
-- the main branch
zenith status
neon status
-- and another on top of the experimental branch
zenith status -b experimental
neon status -b experimental
-- you can commit each branch separately
zenith commit main
neon commit main
-- or
zenith commit -c /home/pipedpiper/northwind:experimental
neon commit -c /home/pipedpiper/northwind:experimental
```
Starting compute instances against cloud environments
@@ -47,18 +47,18 @@ Starting compute instances against cloud environments
```bash
-- you can start a compute instance against the cloud environment
-- in this case all of the changes will be streamed into the cloud
zenith start https://zenith:tech/pipedpiper/northwind:main
zenith start https://zenith:tech/pipedpiper/northwind:main
zenith status -c https://zenith:tech/pipedpiper/northwind:main
zenith commit -c https://zenith:tech/pipedpiper/northwind:main
zenith branch -c https://zenith:tech/pipedpiper/northwind:<hash> experimental
neon start https://neon:tecj/pipedpiper/northwind:main
neon start https://neon:tecj/pipedpiper/northwind:main
neon status -c https://neon:tecj/pipedpiper/northwind:main
neon commit -c https://neon:tecj/pipedpiper/northwind:main
neon branch -c https://neon:tecj/pipedpiper/northwind:<hash> experimental
```
Pushing data into the cloud
```bash
-- pull all the commits from the cloud
zenith pull
neon pull
-- push all the commits to the cloud
zenith push
neon push
```

View File

@@ -1,14 +1,14 @@
# Repository format
A Zenith repository is similar to a traditional PostgreSQL backup
A Neon repository is similar to a traditional PostgreSQL backup
archive, like a WAL-G bucket or pgbarman backup catalogue. It holds
multiple versions of a PostgreSQL database cluster.
The distinguishing feature is that you can launch a Zenith Postgres
The distinguishing feature is that you can launch a Neon Postgres
server directly against a branch in the repository, without having to
"restore" it first. Also, Zenith manages the storage automatically,
"restore" it first. Also, Neon manages the storage automatically,
there is no separation between full and incremental backups nor WAL
archive. Zenith relies heavily on the WAL, and uses concepts similar
archive. Neon relies heavily on the WAL, and uses concepts similar
to incremental backups and WAL archiving internally, but it is hidden
from the user.
@@ -19,15 +19,15 @@ efficient. Just something to get us started.
The repository directory looks like this:
.zenith/timelines/4543be3daeab2ed4e58a285cbb8dd1fce6970f8c/wal/
.zenith/timelines/4543be3daeab2ed4e58a285cbb8dd1fce6970f8c/snapshots/<lsn>/
.zenith/timelines/4543be3daeab2ed4e58a285cbb8dd1fce6970f8c/history
.neon/timelines/4543be3daeab2ed4e58a285cbb8dd1fce6970f8c/wal/
.neon/timelines/4543be3daeab2ed4e58a285cbb8dd1fce6970f8c/snapshots/<lsn>/
.neon/timelines/4543be3daeab2ed4e58a285cbb8dd1fce6970f8c/history
.zenith/refs/branches/mybranch
.zenith/refs/tags/foo
.zenith/refs/tags/bar
.neon/refs/branches/mybranch
.neon/refs/tags/foo
.neon/refs/tags/bar
.zenith/datadirs/<timeline uuid>
.neon/datadirs/<timeline uuid>
### Timelines
@@ -39,7 +39,7 @@ All WAL is generated on a timeline. You can launch a read-only node
against a tag or arbitrary LSN on a timeline, but in order to write,
you need to create a timeline.
Each timeline is stored in a directory under .zenith/timelines. It
Each timeline is stored in a directory under .neon/timelines. It
consists of a WAL archive, containing all the WAL in the standard
PostgreSQL format, under the wal/ subdirectory.
@@ -66,18 +66,18 @@ contains the UUID of the timeline (and LSN, for tags).
### Datadirs
.zenith/datadirs contains PostgreSQL data directories. You can launch
.neon/datadirs contains PostgreSQL data directories. You can launch
a Postgres instance on one of them with:
```
postgres -D .zenith/datadirs/4543be3daeab2ed4e58a285cbb8dd1fce6970f8c
postgres -D .neon/datadirs/4543be3daeab2ed4e58a285cbb8dd1fce6970f8c
```
All the actual data is kept in the timeline directories, under
.zenith/timelines. The data directories are only needed for active
.neon/timelines. The data directories are only needed for active
PostgreQSL instances. After an instance is stopped, the data directory
can be safely removed. "zenith start" will recreate it quickly from
the data in .zenith/timelines, if it's missing.
can be safely removed. "neon start" will recreate it quickly from
the data in .neon/timelines, if it's missing.
## Version 2
@@ -103,14 +103,14 @@ more advanced. The exact format is TODO. But it should support:
### Garbage collection
When you run "zenith gc", old timelines that are no longer needed are
When you run "neon gc", old timelines that are no longer needed are
removed. That involves collecting the list of "unreachable" objects,
starting from the named branches and tags.
Also, if enough WAL has been generated on a timeline since last
snapshot, a new snapshot or delta is created.
### zenith push/pull
### neon push/pull
Compare the tags and branches on both servers, and copy missing ones.
For each branch, compare the timeline it points to in both servers. If
@@ -123,7 +123,7 @@ every time you start up an instance? Then you would detect that the
timelines have diverged. That would match with the "epoch" concept
that we have in the WAL safekeeper
### zenith checkout/commit
### neon checkout/commit
In this format, there is no concept of a "working tree", and hence no
concept of checking out or committing. All modifications are done on
@@ -134,7 +134,7 @@ You can easily fork off a temporary timeline to emulate a "working tree".
You can later remove it and have it garbage collected, or to "commit",
re-point the branch to the new timeline.
If we want to have a worktree and "zenith checkout/commit" concept, we can
If we want to have a worktree and "neon checkout/commit" concept, we can
emulate that with a temporary timeline. Create the temporary timeline at
"zenith checkout", and have "zenith commit" modify the branch to point to
"neon checkout", and have "neon commit" modify the branch to point to
the new timeline.

View File

@@ -4,27 +4,27 @@ How it works now
1. Create repository, start page server on it
```
$ zenith init
$ neon init
...
created main branch
new zenith repository was created in .zenith
new neon repository was created in .neon
$ zenith pageserver start
Starting pageserver at '127.0.0.1:64000' in .zenith
$ neon pageserver start
Starting pageserver at '127.0.0.1:64000' in .neon
Page server started
```
2. Create a branch, and start a Postgres instance on it
```
$ zenith branch heikki main
$ neon branch heikki main
branching at end of WAL: 0/15ECF68
$ zenith pg create heikki
$ neon pg create heikki
Initializing Postgres on timeline 76cf9279915be7797095241638e64644...
Extracting base backup to create postgres instance: path=.zenith/pgdatadirs/pg1 port=55432
Extracting base backup to create postgres instance: path=.neon/pgdatadirs/pg1 port=55432
$ zenith pg start pg1
$ neon pg start pg1
Starting postgres node at 'host=127.0.0.1 port=55432 user=heikki'
waiting for server to start.... done
server started
@@ -52,20 +52,20 @@ serverless on your laptop, so that the workflow becomes just:
1. Create repository, start page server on it (same as before)
```
$ zenith init
$ neon init
...
created main branch
new zenith repository was created in .zenith
new neon repository was created in .neon
$ zenith pageserver start
Starting pageserver at '127.0.0.1:64000' in .zenith
$ neon pageserver start
Starting pageserver at '127.0.0.1:64000' in .neon
Page server started
```
2. Create branch
```
$ zenith branch heikki main
$ neon branch heikki main
branching at end of WAL: 0/15ECF68
```

View File

@@ -7,22 +7,22 @@ Here is a proposal about implementing push/pull mechanics between pageservers. W
The origin represents connection info for some remote pageserver. Let's use here same commands as git uses except using explicit list subcommand (git uses `origin -v` for that).
```
zenith origin add <name> <connection_uri>
zenith origin list
zenith origin remove <name>
neon origin add <name> <connection_uri>
neon origin list
neon origin remove <name>
```
Connection URI a string of form `postgresql://user:pass@hostname:port` (https://www.postgresql.org/docs/13/libpq-connect.html#id-1.7.3.8.3.6). We can start with libpq password auth and later add support for client certs or require ssh as transport or invent some other kind of transport.
Behind the scenes, this commands may update toml file inside .zenith directory.
Behind the scenes, this commands may update toml file inside .neon directory.
## Push
### Pushing branch
```
zenith push mybranch cloudserver # push to eponymous branch in cloudserver
zenith push mybranch cloudserver:otherbranch # push to a different branch in cloudserver
neon push mybranch cloudserver # push to eponymous branch in cloudserver
neon push mybranch cloudserver:otherbranch # push to a different branch in cloudserver
```
Exact mechanics would be slightly different in the following situations:

View File

@@ -2,7 +2,7 @@ While working on export/import commands, I understood that they fit really well
We may think about backups as snapshots in a different format (i.e plain pgdata format, basebackup tar format, WAL-G format (if they want to support it) and so on). They use same storage API, the only difference is the code that packs/unpacks files.
Even if zenith aims to maintains durability using it's own snapshots, backups will be useful for uploading data from postgres to zenith.
Even if neon aims to maintains durability using it's own snapshots, backups will be useful for uploading data from postgres to neon.
So here is an attempt to design consistent CLI for different usage scenarios:
@@ -16,8 +16,8 @@ Save`storage_dest` and other parameters in config.
Push snapshots to `storage_dest` in background.
```
zenith init --storage_dest=S3_PREFIX
zenith start
neon init --storage_dest=S3_PREFIX
neon start
```
#### 2. Restart pageserver (manually or crash-recovery).
@@ -25,7 +25,7 @@ Take `storage_dest` from pageserver config, start pageserver from latest snapsho
Push snapshots to `storage_dest` in background.
```
zenith start
neon start
```
#### 3. Import.
@@ -35,22 +35,22 @@ Do not save `snapshot_path` and `snapshot_format` in config, as it is a one-time
Save`storage_dest` parameters in config.
Push snapshots to `storage_dest` in background.
```
//I.e. we want to start zenith on top of existing $PGDATA and use s3 as a persistent storage.
zenith init --snapshot_path=FILE_PREFIX --snapshot_format=pgdata --storage_dest=S3_PREFIX
zenith start
//I.e. we want to start neon on top of existing $PGDATA and use s3 as a persistent storage.
neon init --snapshot_path=FILE_PREFIX --snapshot_format=pgdata --storage_dest=S3_PREFIX
neon start
```
How to pass credentials needed for `snapshot_path`?
#### 4. Export.
Manually push snapshot to `snapshot_path` which differs from `storage_dest`
Optionally set `snapshot_format`, which can be plain pgdata format or zenith format.
Optionally set `snapshot_format`, which can be plain pgdata format or neon format.
```
zenith export --snapshot_path=FILE_PREFIX --snapshot_format=pgdata
neon export --snapshot_path=FILE_PREFIX --snapshot_format=pgdata
```
#### Notes and questions
- safekeeper s3_offload should use same (similar) syntax for storage. How to set it in UI?
- Why do we need `zenith init` as a separate command? Can't we init everything at first start?
- Why do we need `neon init` as a separate command? Can't we init everything at first start?
- We can think of better names for all options.
- Export to plain postgres format will be useless, if we are not 100% compatible on page level.
I can recall at least one such difference - PD_WAL_LOGGED flag in pages.

View File

@@ -9,7 +9,7 @@ receival and this might lag behind `term`; safekeeper switches to epoch `n` when
it has received all committed log records from all `< n` terms. This roughly
corresponds to proposed in
https://github.com/zenithdb/rfcs/pull/3/files
https://github.com/neondatabase/rfcs/pull/3/files
This makes our biggest our difference from Raft. In Raft, every log record is

View File

@@ -1,6 +1,6 @@
# Safekeeper gossip
Extracted from this [PR](https://github.com/zenithdb/rfcs/pull/13)
Extracted from this [PR](https://github.com/neondatabase/rfcs/pull/13)
## Motivation

View File

@@ -2,7 +2,7 @@
Created on 19.01.22
Initially created [here](https://github.com/zenithdb/rfcs/pull/16) by @kelvich.
Initially created [here](https://github.com/neondatabase/rfcs/pull/16) by @kelvich.
That it is an alternative to (014-safekeeper-gossip)[]
@@ -292,4 +292,4 @@ But with an etcd we are in a bit different situation:
1. We don't need persistency and strong consistency guarantees for the data we store in the etcd
2. etcd uses Grpc as a protocol, and messages are pretty simple
So it looks like implementing in-mem store with etcd interface is straightforward thing _if we will want that in future_. At the same time, we can avoid implementing it right now, and we will be able to run local zenith installation with etcd running somewhere in the background (as opposed to building and running console, which in turn requires Postgres).
So it looks like implementing in-mem store with etcd interface is straightforward thing _if we will want that in future_. At the same time, we can avoid implementing it right now, and we will be able to run local neon installation with etcd running somewhere in the background (as opposed to building and running console, which in turn requires Postgres).

View File

@@ -0,0 +1,408 @@
# Sharding Phase 1: Static Key-space Sharding
## Summary
To enable databases with sizes approaching the capacity of a pageserver's disk,
it is necessary to break up the storage for the database, or _shard_ it.
Sharding in general is a complex area. This RFC aims to define an initial
capability that will permit creating large-capacity databases using a static configuration
defined at time of Tenant creation.
## Motivation
Currently, all data for a Tenant, including all its timelines, is stored on a single
pageserver. The local storage required may be several times larger than the actual
database size, due to LSM write inflation.
If a database is larger than what one pageserver can hold, then it becomes impossible
for the pageserver to hold it in local storage, as it must do to provide service to
clients.
### Prior art
In Neon:
- Layer File Spreading: https://www.notion.so/neondatabase/One-Pager-Layer-File-Spreading-Konstantin-21fd9b11b618475da5f39c61dd8ab7a4
- Layer File SPreading: https://www.notion.so/neondatabase/One-Pager-Layer-File-Spreading-Christian-eb6b64182a214e11b3fceceee688d843
- Key Space partitioning: https://www.notion.so/neondatabase/One-Pager-Key-Space-Partitioning-Stas-8e3a28a600a04a25a68523f42a170677
Prior art in other distributed systems is too broad to capture here: pretty much
any scale out storage system does something like this.
## Requirements
- Enable creating a large (for example, 16TiB) database without requiring dedicated
pageserver nodes.
- Share read/write bandwidth costs for large databases across pageservers, as well
as storage capacity, in order to avoid large capacity databases acting as I/O hotspots
that disrupt service to other tenants.
- Our data distribution scheme should handle sparse/nonuniform keys well, since postgres
does not write out a single contiguous ranges of page numbers.
_Note: the definition of 'large database' is arbitrary, but the lower bound is to ensure that a database
that a user might create on a current-gen enterprise SSD should also work well on
Neon. The upper bound is whatever postgres can handle: i.e. we must make sure that the
pageserver backend is not the limiting factor in the database size_.
## Non Goals
- Independently distributing timelines within the same tenant. If a tenant has many
timelines, then sharding may be a less efficient mechanism for distributing load than
sharing out timelines between pageservers.
- Distributing work in the LSN dimension: this RFC focuses on the Key dimension only,
based on the idea that separate mechanisms will make sense for each dimension.
## Impacted Components
pageserver, control plane, postgres/smgr
## Terminology
**Key**: a postgres page number, qualified by relation. In the sense that the pageserver is a versioned key-value store,
the page number is the key in that store. `Key` is a literal data type in existing code.
**LSN dimension**: this just means the range of LSNs (history), when talking about the range
of keys and LSNs as a two dimensional space.
## Implementation
### Key sharding vs. LSN sharding
When we think of sharding across the two dimensional key/lsn space, this is an
opportunity to think about how the two dimensions differ:
- Sharding the key space distributes the _write_ workload of ingesting data
and compacting. This work must be carefully managed so that exactly one
node owns a given key.
- Sharding the LSN space distributes the _historical read_ workload. This work
can be done by anyone without any special coordination, as long as they can
see the remote index and layers.
The key sharding is the harder part, and also the more urgent one, to support larger
capacity databases. Because distributing historical LSN read work is a relatively
simpler problem that most users don't have, we defer it to future work. It is anticipated
that some quite simple P2P offload model will enable distributing work for historical
reads: a node which is low on space can call out to peer to ask it to download and
serve reads from a historical layer.
### Key mapping scheme
Having decided to focus on key sharding, we must next decide how we will map
keys to shards. It is proposed to use a "wide striping" approach, to obtain a good compromise
between data locality and avoiding entire large relations mapping to the same shard.
We will define two spaces:
- Key space: unsigned integer
- Shard space: integer from 0 to N-1, where we have N shards.
### Key -> Shard mapping
Keys are currently defined in the pageserver's getpage@lsn interface as follows:
```
pub struct Key {
pub field1: u8,
pub field2: u32,
pub field3: u32,
pub field4: u32,
pub field5: u8,
pub field6: u32,
}
fn rel_block_to_key(rel: RelTag, blknum: BlockNumber) -> Key {
Key {
field1: 0x00,
field2: rel.spcnode,
field3: rel.dbnode,
field4: rel.relnode,
field5: rel.forknum,
field6: blknum,
}
}
```
_Note: keys for relation metadata are ignored here, as this data will be mirrored to all
shards. For distribution purposes, we only care about user data keys_
The properties we want from our Key->Shard mapping are:
- Locality in `blknum`, such that adjacent `blknum` will usually map to
the same stripe and consequently land on the same shard, even though the overall
collection of blocks in a relation will be spread over many stripes and therefore
many shards.
- Avoid the same blknum on different relations landing on the same stripe, so that
with many small relations we do not end up aliasing data to the same stripe/shard.
- Avoid vulnerability to aliasing in the values of relation identity fields, such that
if there are patterns in the value of `relnode`, these do not manifest as patterns
in data placement.
To accomplish this, the blknum is used to select a stripe, and stripes are
assigned to shards in a pseudorandom order via a hash. The motivation for
pseudo-random distribution (rather than sequential mapping of stripe to shard)
is to avoid I/O hotspots when sequentially reading multiple relations: we don't want
all relations' stripes to touch pageservers in the same order.
To map a `Key` to a shard:
- Hash the `Key` field 4 (relNode).
- Divide field 6 (`blknum`) field by the stripe size in pages, and combine the
hash of this with the hash from the previous step.
- The total hash modulo the shard count gives the shard holding this key.
Why don't we use the other fields in the Key?
- We ignore `forknum` for key mapping, because it distinguishes different classes of data
in the same relation, and we would like to keep the data in a relation together.
- We would like to use spcNode and dbNode, but cannot. Postgres database creation operations can refer to an existing database as a template, such that the created
database's blocks differ only by spcNode and dbNode from the original. To enable running
this type of creation without cross-pageserver communication, we must ensure that these
blocks map to the same shard -- we do this by excluding spcNode and dbNode from the hash.
### Data placement examples
For example, consider the extreme large databases cases of postgres data layout in a system with 8 shards
and a stripe size of 32k pages:
- A single large relation: `blknum` division will break the data up into 4096
stripes, which will be scattered across the shards.
- 4096 relations of of 32k pages each: each relation will map to exactly one stripe,
and that stripe will be placed according to the hash of the key fields 4. The
data placement will be statistically uniform across shards.
Data placement will be more uneven on smaller databases:
- A tenant with 2 shards and 2 relations of one stripe size each: there is a 50% chance
that both relations land on the same shard and no data lands on the other shard.
- A tenant with 8 shards and one relation of size 12 stripes: 4 shards will have double
the data of the other four shards.
These uneven cases for small amounts of data do not matter, as long as the stripe size
is an order of magnitude smaller than the amount of data we are comfortable holding
in a single shard: if our system handles shard sizes up to 10-100GB, then it is not an issue if
a tenant has some shards with 256MB size and some shards with 512MB size, even though
the standard deviation of shard size within the tenant is very high. Our key mapping
scheme provides a statistical guarantee that as the tenant's overall data size increases,
uniformity of placement will improve.
### Important Types
#### `ShardIdentity`
Provides the information needed to know whether a particular key belongs
to a particular shard:
- Layout version
- Stripe size
- Shard count
- Shard index
This structure's size is constant. Note that if we had used a differnet key
mapping scheme such as consistent hashing with explicit hash ranges assigned
to each shard, then the ShardIdentity's size would grow with the shard count: the simpler
key mapping scheme used here enables a small fixed size ShardIdentity.
### Pageserver changes
#### Structural
Everywhere the Pageserver currently deals with Tenants, it will move to dealing with
`TenantShard`s, which are just a `Tenant` plus a `ShardIdentity` telling it which part
of the keyspace it owns. An un-sharded tenant is just a `TenantShard` whose `ShardIdentity`
covers the whole keyspace.
When the pageserver writes layers and index_part.json to remote storage, it must
include the shard index & count in the name, to avoid collisions (the count is
necessary for future-proofing: the count will vary in time). These keys
will also include a generation number: the [generation numbers](025-generation-numbers.md) system will work
exactly the same for TenantShards as it does for Tenants today: each shard will have
its own generation number.
#### Storage Format: Keys
For tenants with >1 shard, layer files implicitly become sparse: within the key
range described in the layer name, the layer file for a shard will only hold the
content relevant to stripes assigned to the shard.
For this reason, the LayerFileName within a tenant is no longer unique: different shards
may use the same LayerFileName to refer to different data. We may solve this simply
by including the shard number in the keys used for layers.
The shard number will be included as a prefix (as part of tenant ID), like this:
`pageserver/v1/tenants/<tenant_id>-<shard_number><shard_count>/timelines/<timeline id>/<layer file name>-<generation>`
`pageserver/v1/tenants/<tenant_id>-<shard_number><shard_count>/timelines/<timeline id>/index_part.json-<generation>`
Reasons for this particular format:
- Use of a prefix is convenient for implementation (no need to carry the shard ID everywhere
we construct a layer file name), and enables efficient listing of index_parts within
a particular shard-timeline prefix.
- Including the shard _count_ as well as shard number means that in future when we implement
shard splitting, it will be possible for a parent shard and one of its children to write
the same layer file without a name collision. For example, a parent shard 0_1 might split
into two (0_2, 1_2), and in the process of splitting shard 0_2 could write a layer or index_part
that is distinct from what shard 0_1 would have written at the same place.
In practice, we expect shard counts to be relatively small, so a `u8` will be sufficient,
and therefore the shard part of the path can be a fixed-length hex string like `{:02X}{:02X}`,
for example a single-shard tenant's prefix will be `0001`.
For backward compatibility, we may define a special `ShardIdentity` that has shard_count==0,
and use this as a cue to construct paths with no prefix at all.
#### Storage Format: Indices
In the phase 1 described in this RFC, shards only reference layers they write themselves. However,
when we implement shard splitting in future, it will be useful to enable shards to reference layers
written by other shards (specifically the parent shard during a split), so that shards don't
have to exhaustively copy all data into their own shard-prefixed keys.
To enable this, the `IndexPart` structure will be extended to store the (shard number, shard count)
tuple on each layer, such that it can construct paths for layers written by other shards. This
naturally raises the question of who "owns" such layers written by ancestral shards: this problem
will be addressed in phase 2.
For backward compatibility, any index entry without shard information will be assumed to be
in the legacy shardidentity.
#### WAL Ingest
In Phase 1, all shards will subscribe to the safekeeper to download WAL content. They will filter
it down to the pages relevant to their shard:
- For ordinary user data writes, only retain a write if it matches the ShardIdentity
- For metadata describing relations etc, all shards retain these writes.
The pageservers must somehow give the safekeeper correct feedback on remote_consistent_lsn:
one solution here is for the 0th shard to periodically peek at the IndexParts for all the other shards,
and have only the 0th shard populate remote_consistent_lsn. However, this is relatively
expensive: if the safekeeper can be made shard-aware then it could be taught to use
the max() of all shards' remote_consistent_lsns to decide when to trim the WAL.
#### Compaction/GC
No changes needed.
The pageserver doesn't have to do anything special during compaction
or GC. It is implicitly operating on the subset of keys that map to its ShardIdentity.
This will result in sparse layer files, containing keys only in the stripes that this
shard owns. Where optimizations currently exist in compaction for spotting "gaps" in
the key range, these should be updated to ignore gaps that are due to sharding, to
avoid spuriously splitting up layers ito stripe-sized pieces.
### Compute Endpoints
Compute endpoints will need to:
- Accept a vector of connection strings as part of their configuration from the control plane
- Route pageserver requests according to mapping the hash of key to the correct
entry in the vector of connection strings.
Doing this in compute rather than routing requests via a single pageserver is
necessary to enable sharding tenants without adding latency from extra hops.
### Control Plane
Tenants, or _Projects_ in the control plane, will each own a set of TenantShards (this will
be 1 for small tenants). Logic for placement of tenant shards is just the same as the current logic for placing
tenants.
Tenant lifecycle operations like deletion will require fanning-out to all the shards
in the tenant. The same goes for timeline creation and deletion: a timeline should
not be considered created until it has been created in all shards.
#### Selectively enabling sharding for large tenants
Initially, we will explicitly enable sharding for large tenants only.
In future, this hint mechanism will become optional when we implement automatic
re-sharding of tenants.
## Future Phases
This section exists to indicate what will likely come next after this phase.
Phases 2a and 2b are amenable to execution in parallel.
### Phase 2a: WAL fan-out
**Problem**: when all shards consume the whole WAL, the network bandwidth used
for transmitting the WAL from safekeeper to pageservers is multiplied by a factor
of the shard count.
Network bandwidth is not our most pressing bottleneck, but it is likely to become
a problem if we set a modest shard count (~8) on a significant number of tenants,
especially as those larger tenants which we shard are also likely to have higher
write bandwidth than average.
### Phase 2b: Shard Splitting
**Problem**: the number of shards in a tenant is defined at creation time and cannot
be changed. This causes excessive sharding for most small tenants, and an upper
bound on scale for very large tenants.
To address this, a _splitting_ feature will later be added. One shard can split its
data into a number of children by doing a special compaction operation to generate
image layers broken up child-shard-wise, and then writing out an `index_part.json` for
each child. This will then require external coordination (by the control plane) to
safely attach these new child shards and then move them around to distribute work.
The opposite _merging_ operation can also be imagined, but is unlikely to be implemented:
once a Tenant has been sharded, the marginal efficiency benefit of merging is unlikely to justify
the risk/complexity of implementing such a rarely-encountered scenario.
### Phase N (future): distributed historical reads
**Problem**: while sharding based on key is good for handling changes in overall
database size, it is less suitable for spiky/unpredictable changes in the read
workload to historical layers. Sudden increases in historical reads could result
in sudden increases in local disk capacity required for a TenantShard.
Example: the extreme case of this would be to run a tenant for a year, then create branches
with ancestors at monthly intervals. This could lead to a sudden 12x inflation in
the on-disk capacity footprint of a TenantShard, since it would be serving reads
from all those disparate historical layers.
If we can respond fast enough, then key-sharding a tenant more finely can help with
this, but splitting may be a relatively expensive operation and the increased historical
read load may be transient.
A separate mechanism for handling heavy historical reads could be something like
a gossip mechanism for pageservers to communicate
about their workload, and then a getpageatlsn offload mechanism where one pageserver can
ask another to go read the necessary layers from remote storage to serve the read. This
requires relativly little coordination because it is read-only: any node can service any
read. All reads to a particular shard would still flow through one node, but the
disk capactity & I/O impact of servicing the read would be distributed.
## FAQ/Alternatives
### Why stripe the data, rather than using contiguous ranges of keyspace for each shard?
When a database is growing under a write workload, writes may predominantly hit the
end of the keyspace, creating a bandwidth hotspot on that shard. Similarly, if the user
is intensively re-writing a particular relation, if that relation lived in a particular
shard then it would not achieve our goal of distributing the write work across shards.
### Why not proxy read requests through one pageserver, so that endpoints don't have to change?
1. This would not achieve scale-out of network bandwidth: a busy tenant with a large
database would still cause a load hotspot on the pageserver routing its read requests.
2. The additional hop through the "proxy" pageserver would add latency and overall
resource cost (CPU, network bandwidth)
### Layer File Spreading: use one pageserver as the owner of a tenant, and have it spread out work on a per-layer basis to peers
In this model, there would be no explicit sharding of work, but the pageserver to which
a tenant is attached would not hold all layers on its disk: instead, it would call out
to peers to have them store some layers, and call out to those peers to request reads
in those layers.
This mechanism will work well for distributing work in the LSN dimension, but in the key
space dimension it has the major limitation of requiring one node to handle all
incoming writes, and compactions. Even if the write workload for a large database
fits in one pageserver, it will still be a hotspot and such tenants may still
de-facto require their own pageserver.

View File

@@ -0,0 +1,479 @@
# Shard splitting
## Summary
This RFC describes a new pageserver API for splitting an existing tenant shard into
multiple shards, and describes how to use this API to safely increase the total
shard count of a tenant.
## Motivation
In the [sharding RFC](031-sharding-static.md), a mechanism was introduced to scale
tenants beyond the capacity of a single pageserver by breaking up the key space
into stripes, and distributing these stripes across many pageservers. However,
the shard count was defined once at tenant creation time and not varied thereafter.
In practice, the expected size of a database is rarely known at creation time, and
it is inefficient to enable sharding for very small tenants: we need to be
able to create a tenant with a small number of shards (such as 1), and later expand
when it becomes clear that the tenant has grown in size to a point where sharding
is beneficial.
### Prior art
Many distributed systems have the problem of choosing how many shards to create for
tenants that do not specify an expected size up-front. There are a couple of general
approaches:
- Write to a key space in order, and start a new shard when the highest key advances
past some point. This doesn't work well for Neon, because we write to our key space
in many different contiguous ranges (per relation), rather than in one contiguous
range. To adapt to this kind of model, we would need a sharding scheme where each
relation had its own range of shards, which would be inefficient for the common
case of databases with many small relations.
- Monitor the system, and automatically re-shard at some size threshold. For
example in Ceph, the [pg_autoscaler](https://github.com/ceph/ceph/blob/49c27499af4ee9a90f69fcc6bf3597999d6efc7b/src/pybind/mgr/pg_autoscaler/module.py)
component monitors the size of each RADOS Pool, and adjusts the number of Placement
Groups (Ceph's shard equivalent).
## Requirements
- A configurable capacity limit per-shard is enforced.
- Changes in shard count do not interrupt service beyond requiring postgres
to reconnect (i.e. milliseconds).
- Human being does not have to choose shard count
## Non Goals
- Shard splitting is always a tenant-global operation: we will not enable splitting
one shard while leaving others intact.
- The inverse operation (shard merging) is not described in this RFC. This is a lower
priority than splitting, because databases grow more often than they shrink, and
a database with many shards will still work properly if the stored data shrinks, just
with slightly more overhead (e.g. redundant WAL replication)
- Shard splitting is only initiated based on capacity bounds, not load. Splitting
a tenant based on load will make sense for some medium-capacity, high-load workloads,
but is more complex to reason about and likely is not desirable until we have
shard merging to reduce the shard count again if the database becomes less busy.
## Impacted Components
pageserver, storage controller
(the _storage controller_ is the evolution of what was called `attachment_service` in our test environment)
## Terminology
**Parent** shards are the shards that exist before a split. **Child** shards are
the new shards created during a split.
**Shard** is synonymous with _tenant shard_.
**Shard Index** is the 2-tuple of shard number and shard count, written in
paths as {:02x}{:02x}, e.g. `0001`.
## Background
In the implementation section, a couple of existing aspects of sharding are important
to remember:
- Shard identifiers contain the shard number and count, so that "shard 0 of 1" (`0001`) is
a distinct shard from "shard 0 of 2" (`0002`). This is the case in key paths, local
storage paths, and remote index metadata.
- Remote layer file paths contain the shard index of the shard that created them, and
remote indices contain the same index to enable building the layer file path. A shard's
index may reference layers that were created by another shard.
- Local tenant shard directories include the shard index. All layers downloaded by
a tenant shard are stored in this shard-prefixed path, even if those layers were
initially created by another shard: tenant shards do not read and write one anothers'
paths.
- The `Tenant` pageserver type represents one tenant _shard_, not the whole tenant.
This is for historical reasons and will be cleaned up in future, but the existing
name is used here to help comprehension when reading code.
## Implementation
Note: this section focuses on the correctness of the core split process. This will
be fairly inefficient in a naive implementation, and several important optimizations
are described in a later section.
There are broadly two parts to the implementation:
1. The pageserver split API, which splits one shard on one pageserver
2. The overall tenant split proccess which is coordinated by the storage controller,
and calls into the pageserver split API as needed.
### Pageserver Split API
The pageserver will expose a new API endpoint at `/v1/tenant/:tenant_shard_id/shard_split`
that takes the new total shard count in the body.
The pageserver split API operates on one tenant shard, on one pageserver. External
coordination is required to use it safely, this is described in the later
'Split procedure' section.
#### Preparation
First identify the shard indices for the new child shards. These are deterministic,
calculated from the parent shard's index, and the number of children being created (this
is an input to the API, and validated to be a power of two). In a trivial example, splitting
0001 in two always results in 0002 and 0102.
Child shard indices are chosen such that the childrens' parts of the keyspace will
be subsets of the parent's parts of the keyspace.
#### Step 1: write new remote indices
In remote storage, splitting is very simple: we may just write new index_part.json
objects for each child shard, containing exactly the same layers as the parent shard.
The children will have more data than they need, but this avoids any exhausive
re-writing or copying of layer files.
The index key path includes a generation number: the parent shard's current
attached generation number will also be used for the child shards' indices. This
makes the operation safely retryable: if everything crashes and restarts, we may
call the split API again on the parent shard, and the result will be some new remote
indices for the child shards, under a higher generation number.
#### Step 2: start new `Tenant` objects
A new `Tenant` object may be instantiated for each child shard, while the parent
shard still exists. When calling the tenant_spawn function for this object,
the remote index from step 1 will be read, and the child shard will start
to ingest WAL to catch up from whatever was in the remote storage at step 1.
We now wait for child shards' WAL ingestion to catch up with the parent shard,
so that we can safely tear down the parent shard without risking an availability
gap to clients reading recent LSNs.
#### Step 3: tear down parent `Tenant` object
Once child shards are running and have caught up with WAL ingest, we no longer
need the parent shard. Note that clients may still be using it -- when we
shut it down, any page_service handlers will also shut down, causing clients
to disconnect. When the client reconnects, it will re-lookup the tenant,
and hit the child shard instead of the parent (shard lookup from page_service
should bias toward higher ShardCount shards).
Note that at this stage the page service client has not yet been notified of
any split. In the trivial single split example:
- Shard 0001 is gone: Tenant object torn down
- Shards 0002 and 0102 are running on the same pageserver where Shard 0001 used to live.
- Clients will continue to connect to that server thinking that shard 0001 is there,
and all requests will work, because any key that was in shard 0001 is definitely
available in either shard 0002 or shard 0102.
- Eventually, the storage controller (not the pageserver) will decide to migrate
some child shards away: at that point it will do a live migration, ensuring
that the client has an updated configuration before it detaches anything
from the original server.
#### Complete
When we send a 200 response to the split request, we are promising the caller:
- That the child shards are persistent in remote storage
- That the parent shard has been shut down
This enables the caller to proceed with the overall shard split operation, which
may involve other shards on other pageservers.
### Storage Controller Split procedure
Splitting a tenant requires calling the pageserver split API, and tracking
enough state to ensure recovery + completion in the event of any component (pageserver
or storage controller) crashing (or request timing out) during the split.
1. call the split API on all existing shards. Ensure that the resulting
child shards are pinned to their pageservers until _all_ the split calls are done.
This pinning may be implemented as a "split bit" on the tenant shards, that
blocks any migrations, and also acts as a sign that if we restart, we must go
through some recovery steps to resume the split.
2. Once all the split calls are done, we may unpin the child shards (clear
the split bit). The split is now complete: subsequent steps are just migrations,
not strictly part of the split.
3. Try to schedule new pageserver locations for the child shards, using
a soft anti-affinity constraint to place shards from the same tenant onto different
pageservers.
Updating computes about the new shard count is not necessary until we migrate
any of the child shards away from the parent's location.
### Recovering from failures
#### Rolling back an incomplete split
An incomplete shard split may be rolled back quite simply, by attaching the parent shards to pageservers,
and detaching child shards. This will lose any WAL ingested into the children after the parents
were detached earlier, but the parents will catch up.
No special pageserver API is needed for this. From the storage controllers point of view, the
procedure is:
1. For all parent shards in the tenant, ensure they are attached
2. For all child shards, ensure they are not attached
3. Drop child shards from the storage controller's database, and clear the split bit on the parent shards.
Any remote storage content for child shards is left behind. This is similar to other cases where
we may leave garbage objects in S3 (e.g. when we upload a layer but crash before uploading an
index that references it). Future online scrub/cleanup functionality can remove these objects, or
they will be removed when the tenant is deleted, as tenant deletion lists all objects in the prefix,
which would include any child shards that were rolled back.
If any timelines had been created on child shards, they will be lost when rolling back. To mitigate
this, we will **block timeline creation during splitting**, so that we can safely roll back until
the split is complete, without risking losing timelines.
Rolling back an incomplete split will happen automatically if a split fails due to some fatal
reason, and will not be accessible via an API:
- A pageserver fails to complete its split API request after too many retries
- A pageserver returns a fatal unexpected error such as 400 or 500
- The storage controller database returns a non-retryable error
- Some internal invariant is violated in the storage controller split code
#### Rolling back a complete split
A complete shard split may be rolled back similarly to an incomplete split, with the following
modifications:
- The parent shards will no longer exist in the storage controller database, so these must
be re-synthesized somehow: the hard part of this is figuring the parent shards' generations. This
may be accomplished either by probing in S3, or by retaining some tombstone state for deleted
shards in the storage controller database.
- Any timelines that were created after the split complete will disappear when rolling back
to the tenant shards. For this reason, rolling back after a complete split should only
be done due to serious issues where loss of recently created timelines is acceptable, or
in cases where we have confirmed that no timelines were created in the intervening period.
- Parent shards' layers must not have been deleted: this property will come "for free" when
we first roll out sharding, by simply not implementing deletion of parent layers after
a split. When we do implement such deletion (see "Cleaning up parent-shard layers" in the
Optimizations section), it should apply a TTL to layers such that we have a
defined walltime window in which rollback will be possible.
The storage controller will expose an API for rolling back a complete split, for use
in the field if we encounter some critical bug with a post-split tenant.
#### Retrying API calls during Pageserver Restart
When a pageserver restarts during a split API call, it may witness on-disk content for both parent and
child shards from an ongoing split. This does not intrinsically break anything, and the
pageserver may include all these shards in its `/re-attach` request to the storage controller.
In order to support such restarts, it is important that the storage controller stores
persistent records of each child shard before it calls into a pageserver, as these child shards
may require generation increments via a `/re-attach` request.
The pageserver restart will also result in a failed API call from the storage controller's point
of view. Recall that if _any_ pageserver fails to split, the overall split operation may not
complete, and all shards must remain pinned to their current pageserver locations until the
split is done.
The pageserver API calls during splitting will retry on transient errors, so that
short availability gaps do not result in a failure of the overall operation. The
split in progress will be automatically rolled back if the threshold for API
retries is reached (e.g. if a pageserver stays offline for longer than a typical
restart).
#### Rollback on Storage Controller Restart
On startup, the storage controller will inspect the split bit for tenant shards that
it loads from the database. If any splits are in progress:
- Database content will be reverted to the parent shards
- Child shards will be dropped from memory
- The parent and child shards will be included in the general startup reconciliation that
the storage controller does: any child shards will be detached from pageservers because
they don't exist in the storage controller's expected set of shards, and parent shards
will be attached if they aren't already.
#### Storage controller API request failures/retries
The split request handler will implement idempotency: if the [`Tenant`] requested to split
doesn't exist, we will check for the would-be child shards, and if they already exist,
we consider the request complete.
If a request is retried while the original request is still underway, then the split
request handler will notice an InProgress marker in TenantManager, and return 503
to encourage the client to backoff/retry. This is the same as the general pageserver
API handling for calls that try to act on an InProgress shard.
#### Compute start/restart during a split
If a compute starts up during split, it will be configured with the old sharding
configuration. This will work for reads irrespective of the progress of the split
as long as no child hards have been migrated away from their original location, and
this is guaranteed in the split procedure (see earlier section).
#### Pageserver fails permanently during a split
If a pageserver permanently fails (i.e. the storage controller availability state for it
goes to Offline) while a split is in progress, the splitting operation will roll back, and
during the roll back it will skip any API calls to the offline pageserver. If the offline
pageserver becomes available again, any stale locations will be cleaned up via the normal reconciliation process (the `/re-attach` API).
### Handling secondary locations
For correctness, it is not necessary to split secondary locations. We can simply detach
the secondary locations for parent shards, and then attach new secondary locations
for child shards.
Clearly this is not optimal, as it will result in re-downloads of layer files that
were already present on disk. See "Splitting secondary locations"
### Conditions to trigger a split
The pageserver will expose a new API for reporting on shards that are candidates
for split: this will return a top-N report of the largest tenant shards by
physical size (remote size). This should exclude any tenants that are already
at the maximum configured shard count.
The API would look something like:
`/v1/top_n_tenant?shard_count_lt=8&sort_by=resident_size`
The storage controller will poll that API across all pageservers it manages at some appropriate interval (e.g. 60 seconds).
A split operation will be started when the tenant exceeds some threshold. This threshold
should be _less than_ how large we actually want shards to be, perhaps much less. That's to
minimize the amount of work involved in splitting -- if we want 100GiB shards, we shouldn't
wait for a tenant to exceed 100GiB before we split anything. Some data analysis of existing
tenant size distribution may be useful here: if we can make a statement like "usually, if
a tenant has exceeded 20GiB they're probably going to exceed 100GiB later", then we might
make our policy to split a tenant at 20GiB.
The finest split we can do is by factors of two, but we can do higher-cardinality splits
too, and this will help to reduce the overhead of repeatedly re-splitting a tenant
as it grows. An example of a very simple heuristic for early deployment of the splitting
feature would be: "Split tenants into 8 shards when their physical size exceeds 64GiB": that
would give us two kinds of tenant (1 shard and 8 shards), and the confidence that once we had
split a tenant, it will not need re-splitting soon after.
## Optimizations
### Flush parent shard to remote storage during split
Any data that is in WAL but not remote storage at time of split will need
to be replayed by child shards when they start for the first time. To minimize
this work, we may flush the parent shard to remote storage before writing the
remote indices for child shards.
It is important that this flush is subject to some time bounds: we may be splitting
in response to a surge of write ingest, so it may be time-critical to split. A
few seconds to flush latest data should be sufficient to optimize common cases without
running the risk of holding up a split for a harmful length of time when a parent
shard is being written heavily. If the flush doesn't complete in time, we may proceed
to shut down the parent shard and carry on with the split.
### Hard linking parent layers into child shard directories
Before we start the Tenant objects for child shards, we may pre-populate their
local storage directories with hard links to the layer files already present
in the parent shard's local directory. When the child shard starts and downloads
its remote index, it will find all those layer files already present on local disk.
This avoids wasting download capacity and makes splitting faster, but more importantly
it avoids taking up a factor of N more disk space when splitting 1 shard into N.
This mechanism will work well in typical flows where shards are migrated away
promptly after a split, but for the general case including what happens when
layers are evicted and re-downloaded after a split, see the 'Proactive compaction'
section below.
### Filtering during compaction
Compaction, especially image layer generation, should skip any keys that are
present in a shard's layer files, but do not match the shard's ShardIdentity's
is_key_local() check. This avoids carrying around data for longer than necessary
in post-split compactions.
This was already implemented in https://github.com/neondatabase/neon/pull/6246
### Proactive compaction
In remote storage, there is little reason to rewrite any data on a shard split:
all the children can reference parent layers via the very cheap write of the child
index_part.json.
In local storage, things are more nuanced. During the initial split there is no
capacity cost to duplicating parent layers, if we implement the hard linking
optimization described above. However, as soon as any layers are evicted from
local disk and re-downloaded, the downloaded layers will not be hard-links any more:
they'll have real capacity footprint. That isn't a problem if we migrate child shards
away from the parent node swiftly, but it risks a significant over-use of local disk
space if we do not.
For example, if we did an 8-way split of a shard, and then _didn't_ migrate 7 of
the shards elsewhere, then churned all the layers in all the shards via eviction,
then we would blow up the storage capacity used on the node by 8x. If we're splitting
a 100GB shard, that could take the pageserver to the point of exhausting disk space.
To avoid this scenario, we could implement a special compaction mode where we just
read historic layers, drop unwanted keys, and write back the layer file. This
is pretty expensive, but useful if we have split a large shard and are not going to
migrate the child shards away.
The heuristic conditions for triggering such a compaction are:
- A) eviction plus time: if a child shard
has existed for more than a time threshold, and has been requested to perform at least one eviction, then it becomes urgent for this child shard to execute a proactive compaction to reduce its storage footprint, at the cost of I/O load.
- B) resident size plus time: we may inspect the resident layers and calculate how
many of them include the overhead of storing pre-split keys. After some time
threshold (different to the one in case A) we still have such layers occupying
local disk space, then we should proactively compact them.
### Cleaning up parent-shard layers
It is functionally harmless to leave parent shard layers in remote storage indefinitely.
They would be cleaned up in the event of the tenant's deletion.
As an optimization to avoid leaking remote storage capacity (which costs money), we may
lazily clean up parent shard layers once no child shards reference them.
This may be done _very_ lazily: e.g. check every PITR interval. The cleanup procedure is:
- list all the key prefixes beginning with the tenant ID, and select those shard prefixes
which do not belong to the most-recently-split set of shards (_ancestral shards_, i.e. `shard*count < max(shard_count) over all shards)`, and those shard prefixes which do have the latest shard count (_current shards_)
- If there are no _ancestral shard_ prefixes found, we have nothing to clean up and
may drop out now.
- find the latest-generation index for each _current shard_, read all and accumulate the set of layers belonging to ancestral shards referenced by these indices.
- for all ancestral shards, list objects in the prefix and delete any layer which was not
referenced by a current shard.
If this cleanup is scheduled for 1-2 PITR periods after the split, there is a good chance that child shards will have written their own image layers covering the whole keyspace, such that all parent shard layers will be deletable.
The cleanup may be done by the scrubber (external process), or we may choose to have
the zeroth shard in the latest generation do the work -- there is no obstacle to one shard
reading the other shard's indices at runtime, and we do not require visibility of the
latest index writes.
Cleanup should be artificially delayed by some period (for example 24 hours) to ensure
that we retain the option to roll back a split in case of bugs.
### Splitting secondary locations
We may implement a pageserver API similar to the main splitting API, which does a simpler
operation for secondary locations: it would not write anything to S3, instead it would simply
create the child shard directory on local disk, hard link in directories from the parent,
and set up the in memory (TenantSlot) state for the children.
Similar to attached locations, a subset of secondary locations will probably need re-locating
after the split is complete, to avoid leaving multiple child shards on the same pageservers,
where they may use excessive space for the tenant.
## FAQ/Alternatives
### What should the thresholds be set to?
Shard size limit: the pre-sharding default capacity quota for databases was 200GiB, so this could be a starting point for the per-shard size limit.
Max shard count:
- The safekeeper overhead to sharding is currently O(N) network bandwidth because
the un-filtered WAL is sent to all shards. To avoid this growing out of control,
a limit of 8 shards should be temporarily imposed until WAL filtering is implemented
on the safekeeper.
- there is also little benefit to increasing the shard count beyond the number
of pageservers in a region.
### Is it worth just rewriting all the data during a split to simplify reasoning about space?

View File

@@ -40,7 +40,7 @@ macro_rules! register_hll {
}};
($N:literal, $NAME:expr, $HELP:expr $(,)?) => {{
$crate::register_hll!($N, $crate::opts!($NAME, $HELP), $LABELS_NAMES)
$crate::register_hll!($N, $crate::opts!($NAME, $HELP))
}};
}

View File

@@ -29,7 +29,6 @@ pub mod launch_timestamp;
mod wrappers;
pub use wrappers::{CountedReader, CountedWriter};
mod hll;
pub mod metric_vec_duration;
pub use hll::{HyperLogLog, HyperLogLogVec};
#[cfg(target_os = "linux")]
pub mod more_process_metrics;

View File

@@ -1,23 +0,0 @@
//! Helpers for observing duration on `HistogramVec` / `CounterVec` / `GaugeVec` / `MetricVec<T>`.
use std::{future::Future, time::Instant};
pub trait DurationResultObserver {
fn observe_result<T, E>(&self, res: &Result<T, E>, duration: std::time::Duration);
}
pub async fn observe_async_block_duration_by_result<
T,
E,
F: Future<Output = Result<T, E>>,
O: DurationResultObserver,
>(
observer: &O,
block: F,
) -> Result<T, E> {
let start = Instant::now();
let result = block.await;
let duration = start.elapsed();
observer.observe_result(&result, duration);
result
}

View File

@@ -6,7 +6,10 @@ use std::str::FromStr;
use serde::{Deserialize, Serialize};
use utils::id::NodeId;
use crate::{models::ShardParameters, shard::TenantShardId};
use crate::{
models::{ShardParameters, TenantConfig},
shard::{ShardStripeSize, TenantShardId},
};
#[derive(Serialize, Deserialize)]
pub struct TenantCreateResponseShard {
@@ -35,7 +38,7 @@ pub struct NodeRegisterRequest {
pub struct NodeConfigureRequest {
pub node_id: NodeId,
pub availability: Option<NodeAvailability>,
pub availability: Option<NodeAvailabilityWrapper>,
pub scheduling: Option<NodeSchedulingPolicy>,
}
@@ -57,6 +60,31 @@ pub struct TenantLocateResponse {
pub shard_params: ShardParameters,
}
#[derive(Serialize, Deserialize)]
pub struct TenantDescribeResponse {
pub shards: Vec<TenantDescribeResponseShard>,
pub stripe_size: ShardStripeSize,
pub policy: PlacementPolicy,
pub config: TenantConfig,
}
#[derive(Serialize, Deserialize)]
pub struct TenantDescribeResponseShard {
pub tenant_shard_id: TenantShardId,
pub node_attached: Option<NodeId>,
pub node_secondary: Vec<NodeId>,
pub last_error: String,
/// A task is currently running to reconcile this tenant's intent state with the state on pageservers
pub is_reconciling: bool,
/// This shard failed in sending a compute notification to the cloud control plane, and a retry is pending.
pub is_pending_compute_notification: bool,
/// A shard split is currently underway
pub is_splitting: bool,
}
/// Explicitly migrating a particular shard is a low level operation
/// TODO: higher level "Reschedule tenant" operation where the request
/// specifies some constraints, e.g. asking it to get off particular node(s)
@@ -66,30 +94,82 @@ pub struct TenantShardMigrateRequest {
pub node_id: NodeId,
}
#[derive(Serialize, Deserialize, Clone, Copy, Eq, PartialEq)]
/// Utilisation score indicating how good a candidate a pageserver
/// is for scheduling the next tenant. See [`crate::models::PageserverUtilization`].
/// Lower values are better.
#[derive(Serialize, Deserialize, Clone, Copy, Eq, PartialEq, PartialOrd, Ord)]
pub struct UtilizationScore(pub u64);
impl UtilizationScore {
pub fn worst() -> Self {
UtilizationScore(u64::MAX)
}
}
#[derive(Serialize, Clone, Copy)]
#[serde(into = "NodeAvailabilityWrapper")]
pub enum NodeAvailability {
// Normal, happy state
Active,
Active(UtilizationScore),
// Offline: Tenants shouldn't try to attach here, but they may assume that their
// secondary locations on this node still exist. Newly added nodes are in this
// state until we successfully contact them.
Offline,
}
impl PartialEq for NodeAvailability {
fn eq(&self, other: &Self) -> bool {
use NodeAvailability::*;
matches!((self, other), (Active(_), Active(_)) | (Offline, Offline))
}
}
impl Eq for NodeAvailability {}
// This wrapper provides serde functionality and it should only be used to
// communicate with external callers which don't know or care about the
// utilisation score of the pageserver it is targeting.
#[derive(Serialize, Deserialize, Clone)]
pub enum NodeAvailabilityWrapper {
Active,
Offline,
}
impl From<NodeAvailabilityWrapper> for NodeAvailability {
fn from(val: NodeAvailabilityWrapper) -> Self {
match val {
// Assume the worst utilisation score to begin with. It will later be updated by
// the heartbeats.
NodeAvailabilityWrapper::Active => NodeAvailability::Active(UtilizationScore::worst()),
NodeAvailabilityWrapper::Offline => NodeAvailability::Offline,
}
}
}
impl From<NodeAvailability> for NodeAvailabilityWrapper {
fn from(val: NodeAvailability) -> Self {
match val {
NodeAvailability::Active(_) => NodeAvailabilityWrapper::Active,
NodeAvailability::Offline => NodeAvailabilityWrapper::Offline,
}
}
}
impl FromStr for NodeAvailability {
type Err = anyhow::Error;
fn from_str(s: &str) -> Result<Self, Self::Err> {
match s {
"active" => Ok(Self::Active),
// This is used when parsing node configuration requests from neon-local.
// Assume the worst possible utilisation score
// and let it get updated via the heartbeats.
"active" => Ok(Self::Active(UtilizationScore::worst())),
"offline" => Ok(Self::Offline),
_ => Err(anyhow::anyhow!("Unknown availability state '{s}'")),
}
}
}
/// FIXME: this is a duplicate of the type in the attachment_service crate, because the
/// type needs to be defined with diesel traits in there.
#[derive(Serialize, Deserialize, Clone, Copy, Eq, PartialEq)]
pub enum NodeSchedulingPolicy {
Active,
@@ -125,5 +205,42 @@ impl From<NodeSchedulingPolicy> for String {
}
}
/// Controls how tenant shards are mapped to locations on pageservers, e.g. whether
/// to create secondary locations.
#[derive(Clone, Serialize, Deserialize, Debug, PartialEq, Eq)]
pub enum PlacementPolicy {
/// Normal live state: one attached pageserver and zero or more secondaries.
Attached(usize),
/// Create one secondary mode locations. This is useful when onboarding
/// a tenant, or for an idle tenant that we might want to bring online quickly.
Secondary,
/// Do not attach to any pageservers. This is appropriate for tenants that
/// have been idle for a long time, where we do not mind some delay in making
/// them available in future.
Detached,
}
#[derive(Serialize, Deserialize, Debug)]
pub struct TenantShardMigrateResponse {}
#[cfg(test)]
mod test {
use super::*;
use serde_json;
/// Check stability of PlacementPolicy's serialization
#[test]
fn placement_policy_encoding() -> anyhow::Result<()> {
let v = PlacementPolicy::Attached(1);
let encoded = serde_json::to_string(&v)?;
assert_eq!(encoded, "{\"Attached\":1}");
assert_eq!(serde_json::from_str::<PlacementPolicy>(&encoded)?, v);
let v = PlacementPolicy::Detached;
let encoded = serde_json::to_string(&v)?;
assert_eq!(encoded, "\"Detached\"");
assert_eq!(serde_json::from_str::<PlacementPolicy>(&encoded)?, v);
Ok(())
}
}

View File

@@ -4,6 +4,7 @@ pub mod utilization;
pub use utilization::PageserverUtilization;
use std::{
borrow::Cow,
collections::HashMap,
io::{BufRead, Read},
num::{NonZeroU64, NonZeroUsize},
@@ -21,6 +22,7 @@ use utils::{
lsn::Lsn,
};
use crate::controller_api::PlacementPolicy;
use crate::{
reltag::RelTag,
shard::{ShardCount, ShardStripeSize, TenantShardId},
@@ -197,6 +199,13 @@ pub struct TimelineCreateRequest {
#[derive(Serialize, Deserialize)]
pub struct TenantShardSplitRequest {
pub new_shard_count: u8,
// A tenant's stripe size is only meaningful the first time their shard count goes
// above 1: therefore during a split from 1->N shards, we may modify the stripe size.
//
// If this is set while the stripe count is being increased from an already >1 value,
// then the request will fail with 400.
pub new_stripe_size: Option<ShardStripeSize>,
}
#[derive(Serialize, Deserialize)]
@@ -242,6 +251,11 @@ pub struct TenantCreateRequest {
#[serde(skip_serializing_if = "ShardParameters::is_unsharded")]
pub shard_parameters: ShardParameters,
// This parameter is only meaningful in requests sent to the storage controller
#[serde(default)]
#[serde(skip_serializing_if = "Option::is_none")]
pub placement_policy: Option<PlacementPolicy>,
#[serde(flatten)]
pub config: TenantConfig, // as we have a flattened field, we should reject all unknown fields in it
}
@@ -413,7 +427,7 @@ pub struct StatusResponse {
#[derive(Serialize, Deserialize, Debug)]
#[serde(deny_unknown_fields)]
pub struct TenantLocationConfigRequest {
pub tenant_id: TenantShardId,
pub tenant_id: Option<TenantShardId>,
#[serde(flatten)]
pub config: LocationConfig, // as we have a flattened field, we should reject all unknown fields in it
}
@@ -435,6 +449,8 @@ pub struct TenantShardLocation {
#[serde(deny_unknown_fields)]
pub struct TenantLocationConfigResponse {
pub shards: Vec<TenantShardLocation>,
// If the shards' ShardCount count is >1, stripe_size will be set.
pub stripe_size: Option<ShardStripeSize>,
}
#[derive(Serialize, Deserialize, Debug)]
@@ -562,7 +578,7 @@ pub struct TimelineInfo {
pub walreceiver_status: String,
}
#[derive(Debug, Clone, Serialize)]
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct LayerMapInfo {
pub in_memory_layers: Vec<InMemoryLayerInfo>,
pub historic_layers: Vec<HistoricLayerInfo>,
@@ -580,7 +596,7 @@ pub enum LayerAccessKind {
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct LayerAccessStatFullDetails {
pub when_millis_since_epoch: u64,
pub task_kind: &'static str,
pub task_kind: Cow<'static, str>,
pub access_kind: LayerAccessKind,
}
@@ -639,23 +655,23 @@ impl LayerResidenceEvent {
}
}
#[derive(Debug, Clone, Serialize)]
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct LayerAccessStats {
pub access_count_by_access_kind: HashMap<LayerAccessKind, u64>,
pub task_kind_access_flag: Vec<&'static str>,
pub task_kind_access_flag: Vec<Cow<'static, str>>,
pub first: Option<LayerAccessStatFullDetails>,
pub accesses_history: HistoryBufferWithDropCounter<LayerAccessStatFullDetails, 16>,
pub residence_events_history: HistoryBufferWithDropCounter<LayerResidenceEvent, 16>,
}
#[derive(Debug, Clone, Serialize)]
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(tag = "kind")]
pub enum InMemoryLayerInfo {
Open { lsn_start: Lsn },
Frozen { lsn_start: Lsn, lsn_end: Lsn },
}
#[derive(Debug, Clone, Serialize)]
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(tag = "kind")]
pub enum HistoricLayerInfo {
Delta {
@@ -677,6 +693,32 @@ pub enum HistoricLayerInfo {
},
}
impl HistoricLayerInfo {
pub fn layer_file_name(&self) -> &str {
match self {
HistoricLayerInfo::Delta {
layer_file_name, ..
} => layer_file_name,
HistoricLayerInfo::Image {
layer_file_name, ..
} => layer_file_name,
}
}
pub fn is_remote(&self) -> bool {
match self {
HistoricLayerInfo::Delta { remote, .. } => *remote,
HistoricLayerInfo::Image { remote, .. } => *remote,
}
}
pub fn set_remote(&mut self, value: bool) {
let field = match self {
HistoricLayerInfo::Delta { remote, .. } => remote,
HistoricLayerInfo::Image { remote, .. } => remote,
};
*field = value;
}
}
#[derive(Debug, Serialize, Deserialize)]
pub struct DownloadRemoteLayersTaskSpawnRequest {
pub max_concurrent_downloads: NonZeroUsize,
@@ -709,6 +751,52 @@ pub struct WalRedoManagerStatus {
pub pid: Option<u32>,
}
/// The progress of a secondary tenant is mostly useful when doing a long running download: e.g. initiating
/// a download job, timing out while waiting for it to run, and then inspecting this status to understand
/// what's happening.
#[derive(Default, Debug, Serialize, Deserialize, Clone)]
pub struct SecondaryProgress {
/// The remote storage LastModified time of the heatmap object we last downloaded.
#[serde(
serialize_with = "opt_ser_rfc3339_millis",
deserialize_with = "opt_deser_rfc3339_millis"
)]
pub heatmap_mtime: Option<SystemTime>,
/// The number of layers currently on-disk
pub layers_downloaded: usize,
/// The number of layers in the most recently seen heatmap
pub layers_total: usize,
/// The number of layer bytes currently on-disk
pub bytes_downloaded: u64,
/// The number of layer bytes in the most recently seen heatmap
pub bytes_total: u64,
}
fn opt_ser_rfc3339_millis<S: serde::Serializer>(
ts: &Option<SystemTime>,
serializer: S,
) -> Result<S::Ok, S::Error> {
match ts {
Some(ts) => serializer.collect_str(&humantime::format_rfc3339_millis(*ts)),
None => serializer.serialize_none(),
}
}
fn opt_deser_rfc3339_millis<'de, D>(deserializer: D) -> Result<Option<SystemTime>, D::Error>
where
D: serde::de::Deserializer<'de>,
{
let s: Option<String> = serde::de::Deserialize::deserialize(deserializer)?;
match s {
None => Ok(None),
Some(s) => humantime::parse_rfc3339(&s)
.map_err(serde::de::Error::custom)
.map(Some),
}
}
pub mod virtual_file {
#[derive(
Copy,

View File

@@ -7,7 +7,7 @@ use std::time::SystemTime;
///
/// `format: int64` fields must use `ser_saturating_u63` because openapi generated clients might
/// not handle full u64 values properly.
#[derive(serde::Serialize, Debug)]
#[derive(serde::Serialize, serde::Deserialize, Debug, Clone)]
pub struct PageserverUtilization {
/// Used disk space
#[serde(serialize_with = "ser_saturating_u63")]
@@ -21,7 +21,10 @@ pub struct PageserverUtilization {
/// When was this snapshot captured, pageserver local time.
///
/// Use millis to give confidence that the value is regenerated often enough.
#[serde(serialize_with = "ser_rfc3339_millis")]
#[serde(
serialize_with = "ser_rfc3339_millis",
deserialize_with = "deser_rfc3339_millis"
)]
pub captured_at: SystemTime,
}
@@ -32,6 +35,14 @@ fn ser_rfc3339_millis<S: serde::Serializer>(
serializer.collect_str(&humantime::format_rfc3339_millis(*ts))
}
fn deser_rfc3339_millis<'de, D>(deserializer: D) -> Result<SystemTime, D::Error>
where
D: serde::de::Deserializer<'de>,
{
let s: String = serde::de::Deserialize::deserialize(deserializer)?;
humantime::parse_rfc3339(&s).map_err(serde::de::Error::custom)
}
/// openapi knows only `format: int64`, so avoid outputting a non-parseable value by generated clients.
///
/// Instead of newtype, use this because a newtype would get require handling deserializing values

View File

@@ -6,19 +6,36 @@
use serde::{Deserialize, Serialize};
use utils::id::NodeId;
use crate::shard::TenantShardId;
use crate::{
controller_api::NodeRegisterRequest, models::LocationConfigMode, shard::TenantShardId,
};
/// Upcall message sent by the pageserver to the configured `control_plane_api` on
/// startup.
#[derive(Serialize, Deserialize)]
pub struct ReAttachRequest {
pub node_id: NodeId,
/// Optional inline self-registration: this is useful with the storage controller,
/// if the node already has a node_id set.
#[serde(skip_serializing_if = "Option::is_none", default)]
pub register: Option<NodeRegisterRequest>,
}
#[derive(Serialize, Deserialize)]
fn default_mode() -> LocationConfigMode {
LocationConfigMode::AttachedSingle
}
#[derive(Serialize, Deserialize, Debug)]
pub struct ReAttachResponseTenant {
pub id: TenantShardId,
pub gen: u32,
}
/// Mandatory if LocationConfigMode is None or set to an Attached* mode
pub gen: Option<u32>,
/// Default value only for backward compat: this field should be set
#[serde(default = "default_mode")]
pub mode: LocationConfigMode,
}
#[derive(Serialize, Deserialize)]
pub struct ReAttachResponse {
pub tenants: Vec<ReAttachResponseTenant>,

View File

@@ -6,7 +6,6 @@
#![deny(clippy::undocumented_unsafe_blocks)]
use anyhow::Context;
use bytes::Bytes;
use futures::pin_mut;
use serde::{Deserialize, Serialize};
use std::io::ErrorKind;
use std::net::SocketAddr;
@@ -378,8 +377,7 @@ impl<IO: AsyncRead + AsyncWrite + Unpin> PostgresBackend<IO> {
&mut self,
cx: &mut std::task::Context<'_>,
) -> Poll<Result<(), std::io::Error>> {
let flush_fut = self.flush();
pin_mut!(flush_fut);
let flush_fut = std::pin::pin!(self.flush());
flush_fut.poll(cx)
}

View File

@@ -72,14 +72,19 @@ async fn simple_select() {
}
}
static KEY: Lazy<rustls::PrivateKey> = Lazy::new(|| {
static KEY: Lazy<rustls::pki_types::PrivateKeyDer<'static>> = Lazy::new(|| {
let mut cursor = Cursor::new(include_bytes!("key.pem"));
rustls::PrivateKey(rustls_pemfile::rsa_private_keys(&mut cursor).unwrap()[0].clone())
let key = rustls_pemfile::rsa_private_keys(&mut cursor)
.next()
.unwrap()
.unwrap();
rustls::pki_types::PrivateKeyDer::Pkcs1(key)
});
static CERT: Lazy<rustls::Certificate> = Lazy::new(|| {
static CERT: Lazy<rustls::pki_types::CertificateDer<'static>> = Lazy::new(|| {
let mut cursor = Cursor::new(include_bytes!("cert.pem"));
rustls::Certificate(rustls_pemfile::certs(&mut cursor).unwrap()[0].clone())
let cert = rustls_pemfile::certs(&mut cursor).next().unwrap().unwrap();
cert
});
// test that basic select with ssl works
@@ -88,9 +93,8 @@ async fn simple_select_ssl() {
let (client_sock, server_sock) = make_tcp_pair().await;
let server_cfg = rustls::ServerConfig::builder()
.with_safe_defaults()
.with_no_client_auth()
.with_single_cert(vec![CERT.clone()], KEY.clone())
.with_single_cert(vec![CERT.clone()], KEY.clone_key())
.unwrap();
let tls_config = Some(Arc::new(server_cfg));
let pgbackend =
@@ -102,10 +106,9 @@ async fn simple_select_ssl() {
});
let client_cfg = rustls::ClientConfig::builder()
.with_safe_defaults()
.with_root_certificates({
let mut store = rustls::RootCertStore::empty();
store.add(&CERT).unwrap();
store.add(CERT.clone()).unwrap();
store
})
.with_no_client_auth();

View File

@@ -1,5 +1,6 @@
use anyhow::*;
use clap::{value_parser, Arg, ArgMatches, Command};
use postgres::Client;
use std::{path::PathBuf, str::FromStr};
use wal_craft::*;
@@ -8,8 +9,8 @@ fn main() -> Result<()> {
.init();
let arg_matches = cli().get_matches();
let wal_craft = |arg_matches: &ArgMatches, client| {
let (intermediate_lsns, end_of_wal_lsn) = match arg_matches
let wal_craft = |arg_matches: &ArgMatches, client: &mut Client| {
let intermediate_lsns = match arg_matches
.get_one::<String>("type")
.map(|s| s.as_str())
.context("'type' is required")?
@@ -25,6 +26,7 @@ fn main() -> Result<()> {
LastWalRecordCrossingSegment::NAME => LastWalRecordCrossingSegment::craft(client)?,
a => panic!("Unknown --type argument: {a}"),
};
let end_of_wal_lsn = client.pg_current_wal_insert_lsn()?;
for lsn in intermediate_lsns {
println!("intermediate_lsn = {lsn}");
}

View File

@@ -5,7 +5,6 @@ use postgres::types::PgLsn;
use postgres::Client;
use postgres_ffi::{WAL_SEGMENT_SIZE, XLOG_BLCKSZ};
use postgres_ffi::{XLOG_SIZE_OF_XLOG_RECORD, XLOG_SIZE_OF_XLOG_SHORT_PHD};
use std::cmp::Ordering;
use std::path::{Path, PathBuf};
use std::process::Command;
use std::time::{Duration, Instant};
@@ -232,59 +231,52 @@ pub fn ensure_server_config(client: &mut impl postgres::GenericClient) -> anyhow
pub trait Crafter {
const NAME: &'static str;
/// Generates WAL using the client `client`. Returns a pair of:
/// * A vector of some valid "interesting" intermediate LSNs which one may start reading from.
/// May include or exclude Lsn(0) and the end-of-wal.
/// * The expected end-of-wal LSN.
fn craft(client: &mut impl postgres::GenericClient) -> anyhow::Result<(Vec<PgLsn>, PgLsn)>;
/// Generates WAL using the client `client`. Returns a vector of some valid
/// "interesting" intermediate LSNs which one may start reading from.
/// test_end_of_wal uses this to check various starting points.
///
/// Note that postgres is generally keen about writing some WAL. While we
/// try to disable it (autovacuum, big wal_writer_delay, etc) it is always
/// possible, e.g. xl_running_xacts are dumped each 15s. So checks about
/// stable WAL end would be flaky unless postgres is shut down. For this
/// reason returning potential end of WAL here is pointless. Most of the
/// time this doesn't happen though, so it is reasonable to create needed
/// WAL structure and immediately kill postgres like test_end_of_wal does.
fn craft(client: &mut impl postgres::GenericClient) -> anyhow::Result<Vec<PgLsn>>;
}
/// Wraps some WAL craft function, providing current LSN to it before the
/// insertion and flushing WAL afterwards. Also pushes initial LSN to the
/// result.
fn craft_internal<C: postgres::GenericClient>(
client: &mut C,
f: impl Fn(&mut C, PgLsn) -> anyhow::Result<(Vec<PgLsn>, Option<PgLsn>)>,
) -> anyhow::Result<(Vec<PgLsn>, PgLsn)> {
f: impl Fn(&mut C, PgLsn) -> anyhow::Result<Vec<PgLsn>>,
) -> anyhow::Result<Vec<PgLsn>> {
ensure_server_config(client)?;
let initial_lsn = client.pg_current_wal_insert_lsn()?;
info!("LSN initial = {}", initial_lsn);
let (mut intermediate_lsns, last_lsn) = f(client, initial_lsn)?;
let last_lsn = match last_lsn {
None => client.pg_current_wal_insert_lsn()?,
Some(last_lsn) => {
let insert_lsn = client.pg_current_wal_insert_lsn()?;
match last_lsn.cmp(&insert_lsn) {
Ordering::Less => bail!(
"Some records were inserted after the crafted WAL: {} vs {}",
last_lsn,
insert_lsn
),
Ordering::Equal => last_lsn,
Ordering::Greater => bail!("Reported LSN is greater than insert_lsn"),
}
}
};
let mut intermediate_lsns = f(client, initial_lsn)?;
if !intermediate_lsns.starts_with(&[initial_lsn]) {
intermediate_lsns.insert(0, initial_lsn);
}
// Some records may be not flushed, e.g. non-transactional logical messages.
//
// Note: this is broken if pg_current_wal_insert_lsn is at page boundary
// because pg_current_wal_insert_lsn skips page headers.
client.execute("select neon_xlogflush(pg_current_wal_insert_lsn())", &[])?;
match last_lsn.cmp(&client.pg_current_wal_flush_lsn()?) {
Ordering::Less => bail!("Some records were flushed after the crafted WAL"),
Ordering::Equal => {}
Ordering::Greater => bail!("Reported LSN is greater than flush_lsn"),
}
Ok((intermediate_lsns, last_lsn))
Ok(intermediate_lsns)
}
pub struct Simple;
impl Crafter for Simple {
const NAME: &'static str = "simple";
fn craft(client: &mut impl postgres::GenericClient) -> anyhow::Result<(Vec<PgLsn>, PgLsn)> {
fn craft(client: &mut impl postgres::GenericClient) -> anyhow::Result<Vec<PgLsn>> {
craft_internal(client, |client, _| {
client.execute("CREATE table t(x int)", &[])?;
Ok((Vec::new(), None))
Ok(Vec::new())
})
}
}
@@ -292,29 +284,36 @@ impl Crafter for Simple {
pub struct LastWalRecordXlogSwitch;
impl Crafter for LastWalRecordXlogSwitch {
const NAME: &'static str = "last_wal_record_xlog_switch";
fn craft(client: &mut impl postgres::GenericClient) -> anyhow::Result<(Vec<PgLsn>, PgLsn)> {
// Do not use generate_internal because here we end up with flush_lsn exactly on
fn craft(client: &mut impl postgres::GenericClient) -> anyhow::Result<Vec<PgLsn>> {
// Do not use craft_internal because here we end up with flush_lsn exactly on
// the segment boundary and insert_lsn after the initial page header, which is unusual.
ensure_server_config(client)?;
client.execute("CREATE table t(x int)", &[])?;
let before_xlog_switch = client.pg_current_wal_insert_lsn()?;
let after_xlog_switch: PgLsn = client.query_one("SELECT pg_switch_wal()", &[])?.get(0);
let next_segment = PgLsn::from(0x0200_0000);
// pg_switch_wal returns end of last record of the switched segment,
// i.e. end of SWITCH itself.
let xlog_switch_record_end: PgLsn = client.query_one("SELECT pg_switch_wal()", &[])?.get(0);
let before_xlog_switch_u64 = u64::from(before_xlog_switch);
let next_segment = PgLsn::from(
before_xlog_switch_u64 - (before_xlog_switch_u64 % WAL_SEGMENT_SIZE as u64)
+ WAL_SEGMENT_SIZE as u64,
);
ensure!(
after_xlog_switch <= next_segment,
"XLOG_SWITCH message ended after the expected segment boundary: {} > {}",
after_xlog_switch,
xlog_switch_record_end <= next_segment,
"XLOG_SWITCH record ended after the expected segment boundary: {} > {}",
xlog_switch_record_end,
next_segment
);
Ok((vec![before_xlog_switch, after_xlog_switch], next_segment))
Ok(vec![before_xlog_switch, xlog_switch_record_end])
}
}
pub struct LastWalRecordXlogSwitchEndsOnPageBoundary;
/// Craft xlog SWITCH record ending at page boundary.
impl Crafter for LastWalRecordXlogSwitchEndsOnPageBoundary {
const NAME: &'static str = "last_wal_record_xlog_switch_ends_on_page_boundary";
fn craft(client: &mut impl postgres::GenericClient) -> anyhow::Result<(Vec<PgLsn>, PgLsn)> {
fn craft(client: &mut impl postgres::GenericClient) -> anyhow::Result<Vec<PgLsn>> {
// Do not use generate_internal because here we end up with flush_lsn exactly on
// the segment boundary and insert_lsn after the initial page header, which is unusual.
ensure_server_config(client)?;
@@ -361,28 +360,29 @@ impl Crafter for LastWalRecordXlogSwitchEndsOnPageBoundary {
// Emit the XLOG_SWITCH
let before_xlog_switch = client.pg_current_wal_insert_lsn()?;
let after_xlog_switch: PgLsn = client.query_one("SELECT pg_switch_wal()", &[])?.get(0);
let xlog_switch_record_end: PgLsn = client.query_one("SELECT pg_switch_wal()", &[])?.get(0);
let next_segment = PgLsn::from(0x0200_0000);
ensure!(
after_xlog_switch < next_segment,
"XLOG_SWITCH message ended on or after the expected segment boundary: {} > {}",
after_xlog_switch,
xlog_switch_record_end < next_segment,
"XLOG_SWITCH record ended on or after the expected segment boundary: {} > {}",
xlog_switch_record_end,
next_segment
);
ensure!(
u64::from(after_xlog_switch) as usize % XLOG_BLCKSZ == XLOG_SIZE_OF_XLOG_SHORT_PHD,
u64::from(xlog_switch_record_end) as usize % XLOG_BLCKSZ == XLOG_SIZE_OF_XLOG_SHORT_PHD,
"XLOG_SWITCH message ended not on page boundary: {}, offset = {}",
after_xlog_switch,
u64::from(after_xlog_switch) as usize % XLOG_BLCKSZ
xlog_switch_record_end,
u64::from(xlog_switch_record_end) as usize % XLOG_BLCKSZ
);
Ok((vec![before_xlog_switch, after_xlog_switch], next_segment))
Ok(vec![before_xlog_switch, xlog_switch_record_end])
}
}
fn craft_single_logical_message(
/// Write ~16MB logical message; it should cross WAL segment.
fn craft_seg_size_logical_message(
client: &mut impl postgres::GenericClient,
transactional: bool,
) -> anyhow::Result<(Vec<PgLsn>, PgLsn)> {
) -> anyhow::Result<Vec<PgLsn>> {
craft_internal(client, |client, initial_lsn| {
ensure!(
initial_lsn < PgLsn::from(0x0200_0000 - 1024 * 1024),
@@ -405,34 +405,24 @@ fn craft_single_logical_message(
"Logical message crossed two segments"
);
if transactional {
// Transactional logical messages are part of a transaction, so the one above is
// followed by a small COMMIT record.
let after_message_lsn = client.pg_current_wal_insert_lsn()?;
ensure!(
message_lsn < after_message_lsn,
"No record found after the emitted message"
);
Ok((vec![message_lsn], Some(after_message_lsn)))
} else {
Ok((Vec::new(), Some(message_lsn)))
}
Ok(vec![message_lsn])
})
}
pub struct WalRecordCrossingSegmentFollowedBySmallOne;
impl Crafter for WalRecordCrossingSegmentFollowedBySmallOne {
const NAME: &'static str = "wal_record_crossing_segment_followed_by_small_one";
fn craft(client: &mut impl postgres::GenericClient) -> anyhow::Result<(Vec<PgLsn>, PgLsn)> {
craft_single_logical_message(client, true)
fn craft(client: &mut impl postgres::GenericClient) -> anyhow::Result<Vec<PgLsn>> {
// Transactional message crossing WAL segment will be followed by small
// commit record.
craft_seg_size_logical_message(client, true)
}
}
pub struct LastWalRecordCrossingSegment;
impl Crafter for LastWalRecordCrossingSegment {
const NAME: &'static str = "last_wal_record_crossing_segment";
fn craft(client: &mut impl postgres::GenericClient) -> anyhow::Result<(Vec<PgLsn>, PgLsn)> {
craft_single_logical_message(client, false)
fn craft(client: &mut impl postgres::GenericClient) -> anyhow::Result<Vec<PgLsn>> {
craft_seg_size_logical_message(client, false)
}
}

View File

@@ -11,13 +11,15 @@ use utils::const_assert;
use utils::lsn::Lsn;
fn init_logging() {
let _ = env_logger::Builder::from_env(env_logger::Env::default().default_filter_or(
format!("crate=info,postgres_ffi::{PG_MAJORVERSION}::xlog_utils=trace"),
))
let _ = env_logger::Builder::from_env(env_logger::Env::default().default_filter_or(format!(
"crate=info,postgres_ffi::{PG_MAJORVERSION}::xlog_utils=trace"
)))
.is_test(true)
.try_init();
}
/// Test that find_end_of_wal returns the same results as pg_dump on various
/// WALs created by Crafter.
fn test_end_of_wal<C: crate::Crafter>(test_name: &str) {
use crate::*;
@@ -38,13 +40,13 @@ fn test_end_of_wal<C: crate::Crafter>(test_name: &str) {
}
cfg.initdb().unwrap();
let srv = cfg.start_server().unwrap();
let (intermediate_lsns, expected_end_of_wal_partial) =
C::craft(&mut srv.connect_with_timeout().unwrap()).unwrap();
let intermediate_lsns = C::craft(&mut srv.connect_with_timeout().unwrap()).unwrap();
let intermediate_lsns: Vec<Lsn> = intermediate_lsns
.iter()
.map(|&lsn| u64::from(lsn).into())
.collect();
let expected_end_of_wal: Lsn = u64::from(expected_end_of_wal_partial).into();
// Kill postgres. Note that it might have inserted to WAL something after
// 'craft' did its job.
srv.kill();
// Check find_end_of_wal on the initial WAL
@@ -56,7 +58,7 @@ fn test_end_of_wal<C: crate::Crafter>(test_name: &str) {
.filter(|fname| IsXLogFileName(fname))
.max()
.unwrap();
check_pg_waldump_end_of_wal(&cfg, &last_segment, expected_end_of_wal);
let expected_end_of_wal = find_pg_waldump_end_of_wal(&cfg, &last_segment);
for start_lsn in intermediate_lsns
.iter()
.chain(std::iter::once(&expected_end_of_wal))
@@ -91,11 +93,7 @@ fn test_end_of_wal<C: crate::Crafter>(test_name: &str) {
}
}
fn check_pg_waldump_end_of_wal(
cfg: &crate::Conf,
last_segment: &str,
expected_end_of_wal: Lsn,
) {
fn find_pg_waldump_end_of_wal(cfg: &crate::Conf, last_segment: &str) -> Lsn {
// Get the actual end of WAL by pg_waldump
let waldump_output = cfg
.pg_waldump("000000010000000000000001", last_segment)
@@ -113,11 +111,8 @@ fn check_pg_waldump_end_of_wal(
}
};
let waldump_wal_end = Lsn::from_str(caps.get(1).unwrap().as_str()).unwrap();
info!(
"waldump erred on {}, expected wal end at {}",
waldump_wal_end, expected_end_of_wal
);
assert_eq!(waldump_wal_end, expected_end_of_wal);
info!("waldump erred on {}", waldump_wal_end);
waldump_wal_end
}
fn check_end_of_wal(
@@ -210,9 +205,9 @@ pub fn test_update_next_xid() {
#[test]
pub fn test_encode_logical_message() {
let expected = [
64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 21, 0, 0, 170, 34, 166, 227, 255,
38, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 112, 114,
101, 102, 105, 120, 0, 109, 101, 115, 115, 97, 103, 101,
64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 21, 0, 0, 170, 34, 166, 227, 255, 38,
0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 112, 114, 101, 102,
105, 120, 0, 109, 101, 115, 115, 97, 103, 101,
];
let actual = encode_logical_message("prefix", "message");
assert_eq!(expected, actual[..]);

View File

@@ -18,6 +18,7 @@ camino.workspace = true
humantime.workspace = true
hyper = { workspace = true, features = ["stream"] }
futures.workspace = true
rand.workspace = true
serde.workspace = true
serde_json.workspace = true
tokio = { workspace = true, features = ["sync", "fs", "io-util"] }

View File

@@ -157,9 +157,8 @@ impl AzureBlobStorage {
let mut bufs = Vec::new();
while let Some(part) = response.next().await {
let part = part?;
let etag_str: &str = part.blob.properties.etag.as_ref();
if etag.is_none() {
etag = Some(etag.unwrap_or_else(|| etag_str.to_owned()));
etag = Some(part.blob.properties.etag);
}
if last_modified.is_none() {
last_modified = Some(part.blob.properties.last_modified.into());
@@ -174,6 +173,16 @@ impl AzureBlobStorage {
.map_err(|e| DownloadError::Other(e.into()))?;
bufs.push(data);
}
if bufs.is_empty() {
return Err(DownloadError::Other(anyhow::anyhow!(
"Azure GET response contained no buffers"
)));
}
// unwrap safety: if these were None, bufs would be empty and we would have returned an error already
let etag = etag.unwrap();
let last_modified = last_modified.unwrap();
Ok(Download {
download_stream: Box::pin(futures::stream::iter(bufs.into_iter().map(Ok))),
etag,

View File

@@ -42,6 +42,9 @@ pub use self::{
};
use s3_bucket::RequestKind;
/// Azure SDK's ETag type is a simple String wrapper: we use this internally instead of repeating it here.
pub use azure_core::Etag;
pub use error::{DownloadError, TimeTravelError, TimeoutOrCancel};
/// Currently, sync happens with AWS S3, that has two limits on requests per second:
@@ -291,9 +294,9 @@ pub type DownloadStream =
pub struct Download {
pub download_stream: DownloadStream,
/// The last time the file was modified (`last-modified` HTTP header)
pub last_modified: Option<SystemTime>,
pub last_modified: SystemTime,
/// A way to identify this specific version of the resource (`etag` HTTP header)
pub etag: Option<String>,
pub etag: Etag,
/// Extra key-value data, associated with the current remote file.
pub metadata: Option<StorageMetadata>,
}

View File

@@ -10,7 +10,7 @@ use std::{
io::ErrorKind,
num::NonZeroU32,
pin::Pin,
time::{Duration, SystemTime},
time::{Duration, SystemTime, UNIX_EPOCH},
};
use anyhow::{bail, ensure, Context};
@@ -30,6 +30,7 @@ use crate::{
};
use super::{RemoteStorage, StorageMetadata};
use crate::Etag;
const LOCAL_FS_TEMP_FILE_SUFFIX: &str = "___temp";
@@ -197,6 +198,7 @@ impl LocalFs {
fs::OpenOptions::new()
.write(true)
.create(true)
.truncate(true)
.open(&temp_file_path)
.await
.with_context(|| {
@@ -406,35 +408,37 @@ impl RemoteStorage for LocalFs {
cancel: &CancellationToken,
) -> Result<Download, DownloadError> {
let target_path = from.with_base(&self.storage_root);
if file_exists(&target_path).map_err(DownloadError::BadInput)? {
let source = ReaderStream::new(
fs::OpenOptions::new()
.read(true)
.open(&target_path)
.await
.with_context(|| {
format!("Failed to open source file {target_path:?} to use in the download")
})
.map_err(DownloadError::Other)?,
);
let metadata = self
.read_storage_metadata(&target_path)
let file_metadata = file_metadata(&target_path).await?;
let source = ReaderStream::new(
fs::OpenOptions::new()
.read(true)
.open(&target_path)
.await
.map_err(DownloadError::Other)?;
.with_context(|| {
format!("Failed to open source file {target_path:?} to use in the download")
})
.map_err(DownloadError::Other)?,
);
let cancel_or_timeout = crate::support::cancel_or_timeout(self.timeout, cancel.clone());
let source = crate::support::DownloadStream::new(cancel_or_timeout, source);
let metadata = self
.read_storage_metadata(&target_path)
.await
.map_err(DownloadError::Other)?;
Ok(Download {
metadata,
last_modified: None,
etag: None,
download_stream: Box::pin(source),
})
} else {
Err(DownloadError::NotFound)
}
let cancel_or_timeout = crate::support::cancel_or_timeout(self.timeout, cancel.clone());
let source = crate::support::DownloadStream::new(cancel_or_timeout, source);
let etag = mock_etag(&file_metadata);
Ok(Download {
metadata,
last_modified: file_metadata
.modified()
.map_err(|e| DownloadError::Other(anyhow::anyhow!(e).context("Reading mtime")))?,
etag,
download_stream: Box::pin(source),
})
}
async fn download_byte_range(
@@ -452,50 +456,51 @@ impl RemoteStorage for LocalFs {
return Err(DownloadError::Other(anyhow::anyhow!("Invalid range, start ({start_inclusive}) and end_exclusive ({end_exclusive:?}) difference is zero bytes")));
}
}
let target_path = from.with_base(&self.storage_root);
if file_exists(&target_path).map_err(DownloadError::BadInput)? {
let mut source = tokio::fs::OpenOptions::new()
.read(true)
.open(&target_path)
.await
.with_context(|| {
format!("Failed to open source file {target_path:?} to use in the download")
})
.map_err(DownloadError::Other)?;
let len = source
.metadata()
.await
.context("query file length")
.map_err(DownloadError::Other)?
.len();
source
.seek(io::SeekFrom::Start(start_inclusive))
.await
.context("Failed to seek to the range start in a local storage file")
.map_err(DownloadError::Other)?;
let metadata = self
.read_storage_metadata(&target_path)
.await
.map_err(DownloadError::Other)?;
let source = source.take(end_exclusive.unwrap_or(len) - start_inclusive);
let source = ReaderStream::new(source);
let cancel_or_timeout = crate::support::cancel_or_timeout(self.timeout, cancel.clone());
let source = crate::support::DownloadStream::new(cancel_or_timeout, source);
Ok(Download {
metadata,
last_modified: None,
etag: None,
download_stream: Box::pin(source),
let file_metadata = file_metadata(&target_path).await?;
let mut source = tokio::fs::OpenOptions::new()
.read(true)
.open(&target_path)
.await
.with_context(|| {
format!("Failed to open source file {target_path:?} to use in the download")
})
} else {
Err(DownloadError::NotFound)
}
.map_err(DownloadError::Other)?;
let len = source
.metadata()
.await
.context("query file length")
.map_err(DownloadError::Other)?
.len();
source
.seek(io::SeekFrom::Start(start_inclusive))
.await
.context("Failed to seek to the range start in a local storage file")
.map_err(DownloadError::Other)?;
let metadata = self
.read_storage_metadata(&target_path)
.await
.map_err(DownloadError::Other)?;
let source = source.take(end_exclusive.unwrap_or(len) - start_inclusive);
let source = ReaderStream::new(source);
let cancel_or_timeout = crate::support::cancel_or_timeout(self.timeout, cancel.clone());
let source = crate::support::DownloadStream::new(cancel_or_timeout, source);
let etag = mock_etag(&file_metadata);
Ok(Download {
metadata,
last_modified: file_metadata
.modified()
.map_err(|e| DownloadError::Other(anyhow::anyhow!(e).context("Reading mtime")))?,
etag,
download_stream: Box::pin(source),
})
}
async fn delete(&self, path: &RemotePath, _cancel: &CancellationToken) -> anyhow::Result<()> {
@@ -610,13 +615,22 @@ async fn create_target_directory(target_file_path: &Utf8Path) -> anyhow::Result<
Ok(())
}
fn file_exists(file_path: &Utf8Path) -> anyhow::Result<bool> {
if file_path.exists() {
ensure!(file_path.is_file(), "file path '{file_path}' is not a file");
Ok(true)
} else {
Ok(false)
}
async fn file_metadata(file_path: &Utf8Path) -> Result<std::fs::Metadata, DownloadError> {
tokio::fs::metadata(&file_path).await.map_err(|e| {
if e.kind() == ErrorKind::NotFound {
DownloadError::NotFound
} else {
DownloadError::BadInput(e.into())
}
})
}
// Use mtime as stand-in for ETag. We could calculate a meaningful one by md5'ing the contents of files we
// read, but that's expensive and the local_fs test helper's whole reason for existence is to run small tests
// quickly, with less overhead than using a mock S3 server.
fn mock_etag(meta: &std::fs::Metadata) -> Etag {
let mtime = meta.modified().expect("Filesystem mtime missing");
format!("{}", mtime.duration_since(UNIX_EPOCH).unwrap().as_millis()).into()
}
#[cfg(test)]

View File

@@ -35,8 +35,8 @@ use aws_sdk_s3::{
};
use aws_smithy_async::rt::sleep::TokioSleep;
use aws_smithy_types::byte_stream::ByteStream;
use aws_smithy_types::{body::SdkBody, DateTime};
use aws_smithy_types::{byte_stream::ByteStream, date_time::ConversionError};
use bytes::Bytes;
use futures::stream::Stream;
use hyper::Body;
@@ -287,8 +287,17 @@ impl S3Bucket {
let remaining = self.timeout.saturating_sub(started_at.elapsed());
let metadata = object_output.metadata().cloned().map(StorageMetadata);
let etag = object_output.e_tag;
let last_modified = object_output.last_modified.and_then(|t| t.try_into().ok());
let etag = object_output
.e_tag
.ok_or(DownloadError::Other(anyhow::anyhow!("Missing ETag header")))?
.into();
let last_modified = object_output
.last_modified
.ok_or(DownloadError::Other(anyhow::anyhow!(
"Missing LastModified header"
)))?
.try_into()
.map_err(|e: ConversionError| DownloadError::Other(e.into()))?;
let body = object_output.body;
let body = ByteStreamAsStream::from(body);

View File

@@ -17,6 +17,7 @@ use remote_storage::{
};
use test_context::test_context;
use test_context::AsyncTestContext;
use tokio::io::AsyncBufReadExt;
use tokio_util::sync::CancellationToken;
use tracing::info;
@@ -117,7 +118,7 @@ async fn s3_time_travel_recovery_works(ctx: &mut MaybeEnabledStorage) -> anyhow:
// A little check to ensure that our clock is not too far off from the S3 clock
{
let dl = retry(|| ctx.client.download(&path2, &cancel)).await?;
let last_modified = dl.last_modified.unwrap();
let last_modified = dl.last_modified;
let half_wt = WAIT_TIME.mul_f32(0.5);
let t0_hwt = t0 + half_wt;
let t1_hwt = t1 - half_wt;
@@ -484,32 +485,33 @@ async fn download_is_cancelled(ctx: &mut MaybeEnabledStorage) {
))
.unwrap();
let len = upload_large_enough_file(&ctx.client, &path, &cancel).await;
let file_len = upload_large_enough_file(&ctx.client, &path, &cancel).await;
{
let mut stream = ctx
let stream = ctx
.client
.download(&path, &cancel)
.await
.expect("download succeeds")
.download_stream;
let first = stream
.next()
.await
.expect("should have the first blob")
.expect("should have succeeded");
let mut reader = std::pin::pin!(tokio_util::io::StreamReader::new(stream));
tracing::info!(len = first.len(), "downloaded first chunk");
let first = reader.fill_buf().await.expect("should have the first blob");
let len = first.len();
tracing::info!(len, "downloaded first chunk");
assert!(
first.len() < len,
first.len() < file_len,
"uploaded file is too small, we downloaded all on first chunk"
);
reader.consume(len);
cancel.cancel();
let next = stream.next().await.expect("stream should have more");
let next = reader.fill_buf().await;
let e = next.expect_err("expected an error, but got a chunk?");
@@ -520,6 +522,10 @@ async fn download_is_cancelled(ctx: &mut MaybeEnabledStorage) {
.is_some_and(|e| matches!(e, DownloadError::Cancelled)),
"{inner:?}"
);
let e = DownloadError::from(e);
assert!(matches!(e, DownloadError::Cancelled), "{e:?}");
}
let cancel = CancellationToken::new();

View File

@@ -247,7 +247,7 @@ fn scenario_4() {
//
// This is in total 5000 + 1000 + 5000 + 1000 = 12000
//
// (If we used the the method from the previous scenario, and
// (If we used the method from the previous scenario, and
// kept only snapshot at the branch point, we'd need to keep
// all the WAL between 10000-18000 on the main branch, so
// the total size would be 5000 + 1000 + 8000 = 14000. The

View File

@@ -13,6 +13,7 @@ testing = ["fail/failpoints"]
[dependencies]
arc-swap.workspace = true
sentry.workspace = true
async-compression.workspace = true
async-trait.workspace = true
anyhow.workspace = true
bincode.workspace = true
@@ -36,6 +37,7 @@ serde_json.workspace = true
signal-hook.workspace = true
thiserror.workspace = true
tokio.workspace = true
tokio-tar.workspace = true
tokio-util.workspace = true
tracing.workspace = true
tracing-error.workspace = true
@@ -46,6 +48,7 @@ strum.workspace = true
strum_macros.workspace = true
url.workspace = true
uuid.workspace = true
walkdir.workspace = true
pq_proto.workspace = true
postgres_connection.workspace = true

View File

@@ -47,9 +47,10 @@ impl<T, const L: usize> ops::Deref for HistoryBufferWithDropCounter<T, L> {
}
}
#[derive(serde::Serialize)]
#[derive(serde::Serialize, serde::Deserialize)]
struct SerdeRepr<T> {
buffer: Vec<T>,
buffer_size: usize,
drop_count: u64,
}
@@ -61,6 +62,7 @@ where
let HistoryBufferWithDropCounter { buffer, drop_count } = value;
SerdeRepr {
buffer: buffer.iter().cloned().collect(),
buffer_size: L,
drop_count: *drop_count,
}
}
@@ -78,19 +80,52 @@ where
}
}
impl<'de, T, const L: usize> serde::de::Deserialize<'de> for HistoryBufferWithDropCounter<T, L>
where
T: Clone + serde::Deserialize<'de>,
{
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
let SerdeRepr {
buffer: des_buffer,
drop_count,
buffer_size,
} = SerdeRepr::<T>::deserialize(deserializer)?;
if buffer_size != L {
use serde::de::Error;
return Err(D::Error::custom(format!(
"invalid buffer_size, expecting {L} got {buffer_size}"
)));
}
let mut buffer = HistoryBuffer::new();
buffer.extend(des_buffer);
Ok(HistoryBufferWithDropCounter { buffer, drop_count })
}
}
#[cfg(test)]
mod test {
use super::HistoryBufferWithDropCounter;
#[test]
fn test_basics() {
let mut b = HistoryBufferWithDropCounter::<_, 2>::default();
let mut b = HistoryBufferWithDropCounter::<usize, 2>::default();
b.write(1);
b.write(2);
b.write(3);
assert!(b.iter().any(|e| *e == 2));
assert!(b.iter().any(|e| *e == 3));
assert!(!b.iter().any(|e| *e == 1));
// round-trip serde
let round_tripped: HistoryBufferWithDropCounter<usize, 2> =
serde_json::from_str(&serde_json::to_string(&b).unwrap()).unwrap();
assert_eq!(
round_tripped.iter().cloned().collect::<Vec<_>>(),
b.iter().cloned().collect::<Vec<_>>()
);
}
#[test]

View File

@@ -245,7 +245,7 @@ impl std::io::Write for ChannelWriter {
}
}
async fn prometheus_metrics_handler(_req: Request<Body>) -> Result<Response<Body>, ApiError> {
pub async fn prometheus_metrics_handler(_req: Request<Body>) -> Result<Response<Body>, ApiError> {
SERVE_METRICS_COUNT.inc();
let started_at = std::time::Instant::now();
@@ -367,7 +367,6 @@ pub fn make_router() -> RouterBuilder<hyper::Body, ApiError> {
.middleware(Middleware::post_with_info(
add_request_id_header_to_response,
))
.get("/metrics", |r| request_span(r, prometheus_metrics_handler))
.err_handler(route_error_handler)
}

View File

@@ -87,6 +87,8 @@ pub mod failpoint_support;
pub mod yielding_loop;
pub mod zstd;
/// This is a shortcut to embed git sha into binaries and avoid copying the same build script to all packages
///
/// we have several cases:

View File

@@ -63,6 +63,7 @@ impl UnwrittenLockFile {
pub fn create_exclusive(lock_file_path: &Utf8Path) -> anyhow::Result<UnwrittenLockFile> {
let lock_file = fs::OpenOptions::new()
.create(true) // O_CREAT
.truncate(true)
.write(true)
.open(lock_file_path)
.context("open lock file")?;

View File

@@ -29,12 +29,10 @@ pub struct PageserverFeedback {
// Serialize with RFC3339 format.
#[serde(with = "serde_systemtime")]
pub replytime: SystemTime,
/// Used to track feedbacks from different shards. Always zero for unsharded tenants.
pub shard_number: u32,
}
// NOTE: Do not forget to increment this number when adding new fields to PageserverFeedback.
// Do not remove previously available fields because this might be backwards incompatible.
pub const PAGESERVER_FEEDBACK_FIELDS_NUMBER: u8 = 5;
impl PageserverFeedback {
pub fn empty() -> PageserverFeedback {
PageserverFeedback {
@@ -43,6 +41,7 @@ impl PageserverFeedback {
remote_consistent_lsn: Lsn::INVALID,
disk_consistent_lsn: Lsn::INVALID,
replytime: *PG_EPOCH,
shard_number: 0,
}
}
@@ -59,17 +58,26 @@ impl PageserverFeedback {
//
// TODO: change serialized fields names once all computes migrate to rename.
pub fn serialize(&self, buf: &mut BytesMut) {
buf.put_u8(PAGESERVER_FEEDBACK_FIELDS_NUMBER); // # of keys
let buf_ptr = buf.len();
buf.put_u8(0); // # of keys, will be filled later
let mut nkeys = 0;
nkeys += 1;
buf.put_slice(b"current_timeline_size\0");
buf.put_i32(8);
buf.put_u64(self.current_timeline_size);
nkeys += 1;
buf.put_slice(b"ps_writelsn\0");
buf.put_i32(8);
buf.put_u64(self.last_received_lsn.0);
nkeys += 1;
buf.put_slice(b"ps_flushlsn\0");
buf.put_i32(8);
buf.put_u64(self.disk_consistent_lsn.0);
nkeys += 1;
buf.put_slice(b"ps_applylsn\0");
buf.put_i32(8);
buf.put_u64(self.remote_consistent_lsn.0);
@@ -80,9 +88,19 @@ impl PageserverFeedback {
.expect("failed to serialize pg_replytime earlier than PG_EPOCH")
.as_micros() as i64;
nkeys += 1;
buf.put_slice(b"ps_replytime\0");
buf.put_i32(8);
buf.put_i64(timestamp);
if self.shard_number > 0 {
nkeys += 1;
buf.put_slice(b"shard_number\0");
buf.put_i32(4);
buf.put_u32(self.shard_number);
}
buf[buf_ptr] = nkeys;
}
// Deserialize PageserverFeedback message
@@ -123,6 +141,11 @@ impl PageserverFeedback {
rf.replytime = *PG_EPOCH - Duration::from_micros(-raw_time as u64);
}
}
b"shard_number" => {
let len = buf.get_i32();
assert_eq!(len, 4);
rf.shard_number = buf.get_u32();
}
_ => {
let len = buf.get_i32();
warn!(
@@ -194,10 +217,7 @@ mod tests {
rf.serialize(&mut data);
// Add an extra field to the buffer and adjust number of keys
if let Some(first) = data.first_mut() {
*first = PAGESERVER_FEEDBACK_FIELDS_NUMBER + 1;
}
data[0] += 1;
data.put_slice(b"new_field_one\0");
data.put_i32(8);
data.put_u64(42);

View File

@@ -110,6 +110,49 @@ impl<T> OnceCell<T> {
}
}
/// Returns a guard to an existing initialized value, or returns an unique initialization
/// permit which can be used to initialize this `OnceCell` using `OnceCell::set`.
pub async fn get_or_init_detached(&self) -> Result<Guard<'_, T>, InitPermit> {
// It looks like OnceCell::get_or_init could be implemented using this method instead of
// duplication. However, that makes the future be !Send due to possibly holding on to the
// MutexGuard over an await point.
loop {
let sem = {
let guard = self.inner.lock().unwrap();
if guard.value.is_some() {
return Ok(Guard(guard));
}
guard.init_semaphore.clone()
};
{
let permit = {
// increment the count for the duration of queued
let _guard = CountWaitingInitializers::start(self);
sem.acquire().await
};
let Ok(permit) = permit else {
let guard = self.inner.lock().unwrap();
if !Arc::ptr_eq(&sem, &guard.init_semaphore) {
// there was a take_and_deinit in between
continue;
}
assert!(
guard.value.is_some(),
"semaphore got closed, must be initialized"
);
return Ok(Guard(guard));
};
permit.forget();
}
let permit = InitPermit(sem);
return Err(permit);
}
}
/// Assuming a permit is held after previous call to [`Guard::take_and_deinit`], it can be used
/// to complete initializing the inner value.
///
@@ -202,7 +245,7 @@ impl<'a, T> Guard<'a, T> {
///
/// The permit will be on a semaphore part of the new internal value, and any following
/// [`OnceCell::get_or_init`] will wait on it to complete.
pub fn take_and_deinit(&mut self) -> (T, InitPermit) {
pub fn take_and_deinit(mut self) -> (T, InitPermit) {
let mut swapped = Inner::default();
let sem = swapped.init_semaphore.clone();
// acquire and forget right away, moving the control over to InitPermit
@@ -481,4 +524,39 @@ mod tests {
assert_eq!("t1", *cell.get().unwrap());
}
#[tokio::test(start_paused = true)]
async fn detached_init_smoke() {
let target = OnceCell::default();
let Err(permit) = target.get_or_init_detached().await else {
unreachable!("it is not initialized")
};
tokio::time::timeout(
std::time::Duration::from_secs(3600 * 24 * 7 * 365),
target.get_or_init(|permit2| async { Ok::<_, Infallible>((11, permit2)) }),
)
.await
.expect_err("should timeout since we are already holding the permit");
target.set(42, permit);
let (_answer, permit) = {
let guard = target
.get_or_init(|permit| async { Ok::<_, Infallible>((11, permit)) })
.await
.unwrap();
assert_eq!(*guard, 42);
guard.take_and_deinit()
};
assert!(target.get().is_none());
target.set(11, permit);
assert_eq!(*target.get().unwrap(), 11);
}
}

View File

@@ -1,27 +1,60 @@
use std::{alloc::Layout, cmp::Ordering, ops::RangeBounds};
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum VecMapOrdering {
Greater,
GreaterOrEqual,
}
/// Ordered map datastructure implemented in a Vec.
/// Append only - can only add keys that are larger than the
/// current max key.
/// Ordering can be adjusted using [`VecMapOrdering`]
/// during `VecMap` construction.
#[derive(Clone, Debug)]
pub struct VecMap<K, V>(Vec<(K, V)>);
pub struct VecMap<K, V> {
data: Vec<(K, V)>,
ordering: VecMapOrdering,
}
impl<K, V> Default for VecMap<K, V> {
fn default() -> Self {
VecMap(Default::default())
VecMap {
data: Default::default(),
ordering: VecMapOrdering::Greater,
}
}
}
#[derive(Debug)]
pub struct InvalidKey;
#[derive(thiserror::Error, Debug)]
pub enum VecMapError {
#[error("Key violates ordering constraint")]
InvalidKey,
#[error("Mismatched ordering constraints")]
ExtendOrderingError,
}
impl<K: Ord, V> VecMap<K, V> {
pub fn new(ordering: VecMapOrdering) -> Self {
Self {
data: Vec::new(),
ordering,
}
}
pub fn with_capacity(capacity: usize, ordering: VecMapOrdering) -> Self {
Self {
data: Vec::with_capacity(capacity),
ordering,
}
}
pub fn is_empty(&self) -> bool {
self.0.is_empty()
self.data.is_empty()
}
pub fn as_slice(&self) -> &[(K, V)] {
self.0.as_slice()
self.data.as_slice()
}
/// This function may panic if given a range where the lower bound is
@@ -29,7 +62,7 @@ impl<K: Ord, V> VecMap<K, V> {
pub fn slice_range<R: RangeBounds<K>>(&self, range: R) -> &[(K, V)] {
use std::ops::Bound::*;
let binary_search = |k: &K| self.0.binary_search_by_key(&k, extract_key);
let binary_search = |k: &K| self.data.binary_search_by_key(&k, extract_key);
let start_idx = match range.start_bound() {
Unbounded => 0,
@@ -41,7 +74,7 @@ impl<K: Ord, V> VecMap<K, V> {
};
let end_idx = match range.end_bound() {
Unbounded => self.0.len(),
Unbounded => self.data.len(),
Included(k) => match binary_search(k) {
Ok(idx) => idx + 1,
Err(idx) => idx,
@@ -49,34 +82,30 @@ impl<K: Ord, V> VecMap<K, V> {
Excluded(k) => binary_search(k).unwrap_or_else(std::convert::identity),
};
&self.0[start_idx..end_idx]
&self.data[start_idx..end_idx]
}
/// Add a key value pair to the map.
/// If `key` is less than or equal to the current maximum key
/// the pair will not be added and InvalidKey error will be returned.
pub fn append(&mut self, key: K, value: V) -> Result<usize, InvalidKey> {
if let Some((last_key, _last_value)) = self.0.last() {
if &key <= last_key {
return Err(InvalidKey);
}
}
/// If `key` is not respective of the `self` ordering the
/// pair will not be added and `InvalidKey` error will be returned.
pub fn append(&mut self, key: K, value: V) -> Result<usize, VecMapError> {
self.validate_key_order(&key)?;
let delta_size = self.instrument_vec_op(|vec| vec.push((key, value)));
Ok(delta_size)
}
/// Update the maximum key value pair or add a new key value pair to the map.
/// If `key` is less than the current maximum key no updates or additions
/// will occur and InvalidKey error will be returned.
/// If `key` is not respective of the `self` ordering no updates or additions
/// will occur and `InvalidKey` error will be returned.
pub fn append_or_update_last(
&mut self,
key: K,
mut value: V,
) -> Result<(Option<V>, usize), InvalidKey> {
if let Some((last_key, last_value)) = self.0.last_mut() {
) -> Result<(Option<V>, usize), VecMapError> {
if let Some((last_key, last_value)) = self.data.last_mut() {
match key.cmp(last_key) {
Ordering::Less => return Err(InvalidKey),
Ordering::Less => return Err(VecMapError::InvalidKey),
Ordering::Equal => {
std::mem::swap(last_value, &mut value);
const DELTA_SIZE: usize = 0;
@@ -100,40 +129,67 @@ impl<K: Ord, V> VecMap<K, V> {
V: Clone,
{
let split_idx = self
.0
.data
.binary_search_by_key(&cutoff, extract_key)
.unwrap_or_else(std::convert::identity);
(
VecMap(self.0[..split_idx].to_vec()),
VecMap(self.0[split_idx..].to_vec()),
VecMap {
data: self.data[..split_idx].to_vec(),
ordering: self.ordering,
},
VecMap {
data: self.data[split_idx..].to_vec(),
ordering: self.ordering,
},
)
}
/// Move items from `other` to the end of `self`, leaving `other` empty.
/// If any keys in `other` is less than or equal to any key in `self`,
/// `InvalidKey` error will be returned and no mutation will occur.
pub fn extend(&mut self, other: &mut Self) -> Result<usize, InvalidKey> {
let self_last_opt = self.0.last().map(extract_key);
let other_first_opt = other.0.last().map(extract_key);
/// If the `other` ordering is different from `self` ordering
/// `ExtendOrderingError` error will be returned.
/// If any keys in `other` is not respective of the ordering defined in
/// `self`, `InvalidKey` error will be returned and no mutation will occur.
pub fn extend(&mut self, other: &mut Self) -> Result<usize, VecMapError> {
if self.ordering != other.ordering {
return Err(VecMapError::ExtendOrderingError);
}
if let (Some(self_last), Some(other_first)) = (self_last_opt, other_first_opt) {
if self_last >= other_first {
return Err(InvalidKey);
let other_first_opt = other.data.last().map(extract_key);
if let Some(other_first) = other_first_opt {
self.validate_key_order(other_first)?;
}
let delta_size = self.instrument_vec_op(|vec| vec.append(&mut other.data));
Ok(delta_size)
}
/// Validate the current last key in `self` and key being
/// inserted against the order defined in `self`.
fn validate_key_order(&self, key: &K) -> Result<(), VecMapError> {
if let Some(last_key) = self.data.last().map(extract_key) {
match (&self.ordering, &key.cmp(last_key)) {
(VecMapOrdering::Greater, Ordering::Less | Ordering::Equal) => {
return Err(VecMapError::InvalidKey);
}
(VecMapOrdering::Greater, Ordering::Greater) => {}
(VecMapOrdering::GreaterOrEqual, Ordering::Less) => {
return Err(VecMapError::InvalidKey);
}
(VecMapOrdering::GreaterOrEqual, Ordering::Equal | Ordering::Greater) => {}
}
}
let delta_size = self.instrument_vec_op(|vec| vec.append(&mut other.0));
Ok(delta_size)
Ok(())
}
/// Instrument an operation on the underlying [`Vec`].
/// Will panic if the operation decreases capacity.
/// Returns the increase in memory usage caused by the op.
fn instrument_vec_op(&mut self, op: impl FnOnce(&mut Vec<(K, V)>)) -> usize {
let old_cap = self.0.capacity();
op(&mut self.0);
let new_cap = self.0.capacity();
let old_cap = self.data.capacity();
op(&mut self.data);
let new_cap = self.data.capacity();
match old_cap.cmp(&new_cap) {
Ordering::Less => {
@@ -145,6 +201,36 @@ impl<K: Ord, V> VecMap<K, V> {
Ordering::Greater => panic!("VecMap capacity shouldn't ever decrease"),
}
}
/// Similar to `from_iter` defined in `FromIter` trait except
/// that it accepts an [`VecMapOrdering`]
pub fn from_iter<I: IntoIterator<Item = (K, V)>>(iter: I, ordering: VecMapOrdering) -> Self {
let iter = iter.into_iter();
let initial_capacity = {
match iter.size_hint() {
(lower_bound, None) => lower_bound,
(_, Some(upper_bound)) => upper_bound,
}
};
let mut vec_map = VecMap::with_capacity(initial_capacity, ordering);
for (key, value) in iter {
vec_map
.append(key, value)
.expect("The passed collection needs to be sorted!");
}
vec_map
}
}
impl<K: Ord, V> IntoIterator for VecMap<K, V> {
type Item = (K, V);
type IntoIter = std::vec::IntoIter<(K, V)>;
fn into_iter(self) -> Self::IntoIter {
self.data.into_iter()
}
}
fn extract_key<K, V>(entry: &(K, V)) -> &K {
@@ -155,7 +241,7 @@ fn extract_key<K, V>(entry: &(K, V)) -> &K {
mod tests {
use std::{collections::BTreeMap, ops::Bound};
use super::VecMap;
use super::{VecMap, VecMapOrdering};
#[test]
fn unbounded_range() {
@@ -310,5 +396,59 @@ mod tests {
left.extend(&mut one_map).unwrap_err();
assert_eq!(left.as_slice(), &[(0, ()), (1, ())]);
assert_eq!(one_map.as_slice(), &[(1, ())]);
let mut map_greater_or_equal = VecMap::new(VecMapOrdering::GreaterOrEqual);
map_greater_or_equal.append(2, ()).unwrap();
map_greater_or_equal.append(2, ()).unwrap();
left.extend(&mut map_greater_or_equal).unwrap_err();
assert_eq!(left.as_slice(), &[(0, ()), (1, ())]);
assert_eq!(map_greater_or_equal.as_slice(), &[(2, ()), (2, ())]);
}
#[test]
fn extend_with_ordering() {
let mut left = VecMap::new(VecMapOrdering::GreaterOrEqual);
left.append(0, ()).unwrap();
assert_eq!(left.as_slice(), &[(0, ())]);
let mut greater_right = VecMap::new(VecMapOrdering::Greater);
greater_right.append(0, ()).unwrap();
left.extend(&mut greater_right).unwrap_err();
assert_eq!(left.as_slice(), &[(0, ())]);
let mut greater_or_equal_right = VecMap::new(VecMapOrdering::GreaterOrEqual);
greater_or_equal_right.append(2, ()).unwrap();
greater_or_equal_right.append(2, ()).unwrap();
left.extend(&mut greater_or_equal_right).unwrap();
assert_eq!(left.as_slice(), &[(0, ()), (2, ()), (2, ())]);
}
#[test]
fn vec_map_from_sorted() {
let vec = vec![(1, ()), (2, ()), (3, ()), (6, ())];
let vec_map = VecMap::from_iter(vec, VecMapOrdering::Greater);
assert_eq!(vec_map.as_slice(), &[(1, ()), (2, ()), (3, ()), (6, ())]);
let vec = vec![(1, ()), (2, ()), (3, ()), (3, ()), (6, ()), (6, ())];
let vec_map = VecMap::from_iter(vec, VecMapOrdering::GreaterOrEqual);
assert_eq!(
vec_map.as_slice(),
&[(1, ()), (2, ()), (3, ()), (3, ()), (6, ()), (6, ())]
);
}
#[test]
#[should_panic]
fn vec_map_from_unsorted_greater() {
let vec = vec![(1, ()), (2, ()), (2, ()), (3, ()), (6, ())];
let _ = VecMap::from_iter(vec, VecMapOrdering::Greater);
}
#[test]
#[should_panic]
fn vec_map_from_unsorted_greater_or_equal() {
let vec = vec![(1, ()), (2, ()), (3, ()), (6, ()), (5, ())];
let _ = VecMap::from_iter(vec, VecMapOrdering::GreaterOrEqual);
}
}

78
libs/utils/src/zstd.rs Normal file
View File

@@ -0,0 +1,78 @@
use std::io::SeekFrom;
use anyhow::{Context, Result};
use async_compression::{
tokio::{bufread::ZstdDecoder, write::ZstdEncoder},
zstd::CParameter,
Level,
};
use camino::Utf8Path;
use nix::NixPath;
use tokio::{
fs::{File, OpenOptions},
io::AsyncBufRead,
io::AsyncSeekExt,
io::AsyncWriteExt,
};
use tokio_tar::{Archive, Builder, HeaderMode};
use walkdir::WalkDir;
/// Creates a Zstandard tarball.
pub async fn create_zst_tarball(path: &Utf8Path, tarball: &Utf8Path) -> Result<(File, u64)> {
let file = OpenOptions::new()
.create(true)
.truncate(true)
.read(true)
.write(true)
.open(&tarball)
.await
.with_context(|| format!("tempfile creation {tarball}"))?;
let mut paths = Vec::new();
for entry in WalkDir::new(path) {
let entry = entry?;
let metadata = entry.metadata().expect("error getting dir entry metadata");
// Also allow directories so that we also get empty directories
if !(metadata.is_file() || metadata.is_dir()) {
continue;
}
let path = entry.into_path();
paths.push(path);
}
// Do a sort to get a more consistent listing
paths.sort_unstable();
let zstd = ZstdEncoder::with_quality_and_params(
file,
Level::Default,
&[CParameter::enable_long_distance_matching(true)],
);
let mut builder = Builder::new(zstd);
// Use reproducible header mode
builder.mode(HeaderMode::Deterministic);
for p in paths {
let rel_path = p.strip_prefix(path)?;
if rel_path.is_empty() {
// The top directory should not be compressed,
// the tar crate doesn't like that
continue;
}
builder.append_path_with_name(&p, rel_path).await?;
}
let mut zstd = builder.into_inner().await?;
zstd.shutdown().await?;
let mut compressed = zstd.into_inner();
let compressed_len = compressed.metadata().await?.len();
compressed.seek(SeekFrom::Start(0)).await?;
Ok((compressed, compressed_len))
}
/// Creates a Zstandard tarball.
pub async fn extract_zst_tarball(
path: &Utf8Path,
tarball: impl AsyncBufRead + Unpin,
) -> Result<()> {
let decoder = Box::pin(ZstdDecoder::new(tarball));
let mut archive = Archive::new(decoder);
archive.unpack(path).await?;
Ok(())
}

View File

@@ -69,7 +69,7 @@ pub struct Config {
/// should be removed once we have a better solution there.
sys_buffer_bytes: u64,
/// Minimum fraction of total system memory reserved *before* the the cgroup threshold; in
/// Minimum fraction of total system memory reserved *before* the cgroup threshold; in
/// other words, providing a ceiling for the highest value of the threshold by enforcing that
/// there's at least `cgroup_min_overhead_fraction` of the total memory remaining beyond the
/// threshold.

View File

@@ -324,11 +324,11 @@ extern "C" fn finish_sync_safekeepers(wp: *mut WalProposer, lsn: XLogRecPtr) {
}
}
extern "C" fn process_safekeeper_feedback(wp: *mut WalProposer, commit_lsn: XLogRecPtr) {
extern "C" fn process_safekeeper_feedback(wp: *mut WalProposer, sk: *mut Safekeeper) {
unsafe {
let callback_data = (*(*wp).config).callback_data;
let api = callback_data as *mut Box<dyn ApiImpl>;
(*api).process_safekeeper_feedback(&mut (*wp), commit_lsn)
(*api).process_safekeeper_feedback(&mut (*wp), &mut (*sk));
}
}

View File

@@ -142,7 +142,7 @@ pub trait ApiImpl {
todo!()
}
fn process_safekeeper_feedback(&self, _wp: &mut WalProposer, _commit_lsn: u64) {
fn process_safekeeper_feedback(&mut self, _wp: &mut WalProposer, _sk: &mut Safekeeper) {
todo!()
}

View File

@@ -59,6 +59,7 @@ signal-hook.workspace = true
smallvec = { workspace = true, features = ["write"] }
svg_fmt.workspace = true
sync_wrapper.workspace = true
sysinfo.workspace = true
tokio-tar.workspace = true
thiserror.workspace = true
tokio = { workspace = true, features = ["process", "sync", "fs", "rt", "io-util", "time"] }
@@ -89,6 +90,9 @@ enumset = { workspace = true, features = ["serde"]}
strum.workspace = true
strum_macros.workspace = true
[target.'cfg(target_os = "linux")'.dependencies]
procfs.workspace = true
[dev-dependencies]
criterion.workspace = true
hex-literal.workspace = true

View File

@@ -1,160 +1,156 @@
//! Simple benchmarking around walredo.
//! Quantify a single walredo manager's throughput under N concurrent callers.
//!
//! Right now they hope to just set a baseline. Later we can try to expand into latency and
//! throughput after figuring out the coordinated omission problems below.
//! The benchmark implementation ([`bench_impl`]) is parametrized by
//! - `redo_work` => [`Request::short_request`] or [`Request::medium_request`]
//! - `n_redos` => number of times the benchmark shell execute the `redo_work`
//! - `nclients` => number of clients (more on this shortly).
//!
//! There are two sets of inputs; `short` and `medium`. They were collected on postgres v14 by
//! logging what happens when a sequential scan is requested on a small table, then picking out two
//! suitable from logs.
//! The benchmark impl sets up a multi-threaded tokio runtime with default parameters.
//! It spawns `nclients` times [`client`] tokio tasks.
//! Each task executes the `redo_work` `n_redos/nclients` times.
//!
//! We exercise the following combinations:
//! - `redo_work = short / medium``
//! - `nclients = [1, 2, 4, 8, 16, 32, 64, 128]`
//!
//! Reference data (git blame to see commit) on an i3en.3xlarge
// ```text
//! short/short/1 time: [39.175 µs 39.348 µs 39.536 µs]
//! short/short/2 time: [51.227 µs 51.487 µs 51.755 µs]
//! short/short/4 time: [76.048 µs 76.362 µs 76.674 µs]
//! short/short/8 time: [128.94 µs 129.82 µs 130.74 µs]
//! short/short/16 time: [227.84 µs 229.00 µs 230.28 µs]
//! short/short/32 time: [455.97 µs 457.81 µs 459.90 µs]
//! short/short/64 time: [902.46 µs 904.84 µs 907.32 µs]
//! short/short/128 time: [1.7416 ms 1.7487 ms 1.7561 ms]
//! ``
use std::sync::Arc;
//! We let `criterion` determine the `n_redos` using `iter_custom`.
//! The idea is that for each `(redo_work, nclients)` combination,
//! criterion will run the `bench_impl` multiple times with different `n_redos`.
//! The `bench_impl` reports the aggregate wall clock time from the clients' perspective.
//! Criterion will divide that by `n_redos` to compute the "time per iteration".
//! In our case, "time per iteration" means "time per redo_work execution".
//!
//! NB: the way by which `iter_custom` determines the "number of iterations"
//! is called sampling. Apparently the idea here is to detect outliers.
//! We're not sure whether the current choice of sampling method makes sense.
//! See https://bheisler.github.io/criterion.rs/book/user_guide/command_line_output.html#collecting-samples
//!
//! # Reference Numbers
//!
//! 2024-03-20 on i3en.3xlarge
//!
//! ```text
//! short/1 time: [26.483 µs 26.614 µs 26.767 µs]
//! short/2 time: [32.223 µs 32.465 µs 32.767 µs]
//! short/4 time: [47.203 µs 47.583 µs 47.984 µs]
//! short/8 time: [89.135 µs 89.612 µs 90.139 µs]
//! short/16 time: [190.12 µs 191.52 µs 192.88 µs]
//! short/32 time: [380.96 µs 382.63 µs 384.20 µs]
//! short/64 time: [736.86 µs 741.07 µs 745.03 µs]
//! short/128 time: [1.4106 ms 1.4206 ms 1.4294 ms]
//! medium/1 time: [111.81 µs 112.25 µs 112.79 µs]
//! medium/2 time: [158.26 µs 159.13 µs 160.21 µs]
//! medium/4 time: [334.65 µs 337.14 µs 340.07 µs]
//! medium/8 time: [675.32 µs 679.91 µs 685.25 µs]
//! medium/16 time: [1.2929 ms 1.2996 ms 1.3067 ms]
//! medium/32 time: [2.4295 ms 2.4461 ms 2.4623 ms]
//! medium/64 time: [4.3973 ms 4.4458 ms 4.4875 ms]
//! medium/128 time: [7.5955 ms 7.7847 ms 7.9481 ms]
//! ```
use bytes::{Buf, Bytes};
use pageserver::{
config::PageServerConf, repository::Key, walrecord::NeonWalRecord, walredo::PostgresRedoManager,
use criterion::{BenchmarkId, Criterion};
use pageserver::{config::PageServerConf, walrecord::NeonWalRecord, walredo::PostgresRedoManager};
use pageserver_api::{key::Key, shard::TenantShardId};
use std::{
sync::Arc,
time::{Duration, Instant},
};
use pageserver_api::shard::TenantShardId;
use tokio::task::JoinSet;
use tokio::{sync::Barrier, task::JoinSet};
use utils::{id::TenantId, lsn::Lsn};
use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
fn bench(c: &mut Criterion) {
{
let nclients = [1, 2, 4, 8, 16, 32, 64, 128];
for nclients in nclients {
let mut group = c.benchmark_group("short");
group.bench_with_input(
BenchmarkId::from_parameter(nclients),
&nclients,
|b, nclients| {
let redo_work = Arc::new(Request::short_input());
b.iter_custom(|iters| bench_impl(Arc::clone(&redo_work), iters, *nclients));
},
);
}
}
fn redo_scenarios(c: &mut Criterion) {
// logging should be enabled when adding more inputs, since walredo will only report malformed
// input to the stderr.
// utils::logging::init(utils::logging::LogFormat::Plain).unwrap();
{
let nclients = [1, 2, 4, 8, 16, 32, 64, 128];
for nclients in nclients {
let mut group = c.benchmark_group("medium");
group.bench_with_input(
BenchmarkId::from_parameter(nclients),
&nclients,
|b, nclients| {
let redo_work = Arc::new(Request::medium_input());
b.iter_custom(|iters| bench_impl(Arc::clone(&redo_work), iters, *nclients));
},
);
}
}
}
criterion::criterion_group!(benches, bench);
criterion::criterion_main!(benches);
// Returns the sum of each client's wall-clock time spent executing their share of the n_redos.
fn bench_impl(redo_work: Arc<Request>, n_redos: u64, nclients: u64) -> Duration {
let repo_dir = camino_tempfile::tempdir_in(env!("CARGO_TARGET_TMPDIR")).unwrap();
let conf = PageServerConf::dummy_conf(repo_dir.path().to_path_buf());
let conf = Box::leak(Box::new(conf));
let tenant_shard_id = TenantShardId::unsharded(TenantId::generate());
let manager = PostgresRedoManager::new(conf, tenant_shard_id);
let manager = Arc::new(manager);
{
let rt = tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()
.unwrap();
tracing::info!("executing first");
rt.block_on(short().execute(&manager)).unwrap();
tracing::info!("first executed");
}
let thread_counts = [1, 2, 4, 8, 16, 32, 64, 128];
let mut group = c.benchmark_group("short");
group.sampling_mode(criterion::SamplingMode::Flat);
for thread_count in thread_counts {
group.bench_with_input(
BenchmarkId::new("short", thread_count),
&thread_count,
|b, thread_count| {
add_multithreaded_walredo_requesters(b, *thread_count, &manager, short);
},
);
}
drop(group);
let mut group = c.benchmark_group("medium");
group.sampling_mode(criterion::SamplingMode::Flat);
for thread_count in thread_counts {
group.bench_with_input(
BenchmarkId::new("medium", thread_count),
&thread_count,
|b, thread_count| {
add_multithreaded_walredo_requesters(b, *thread_count, &manager, medium);
},
);
}
drop(group);
}
/// Sets up a multi-threaded tokio runtime with default worker thread count,
/// then, spawn `requesters` tasks that repeatedly:
/// - get input from `input_factor()`
/// - call `manager.request_redo()` with their input
///
/// This stress-tests the scalability of a single walredo manager at high tokio-level concurrency.
///
/// Using tokio's default worker thread count means the results will differ on machines
/// with different core countrs. We don't care about that, the performance will always
/// be different on different hardware. To compare performance of different software versions,
/// use the same hardware.
fn add_multithreaded_walredo_requesters(
b: &mut criterion::Bencher,
nrequesters: usize,
manager: &Arc<PostgresRedoManager>,
input_factory: fn() -> Request,
) {
assert_ne!(nrequesters, 0);
let rt = tokio::runtime::Builder::new_multi_thread()
.enable_all()
.build()
.unwrap();
let barrier = Arc::new(tokio::sync::Barrier::new(nrequesters + 1));
let start = Arc::new(Barrier::new(nclients as usize));
let mut requesters = JoinSet::new();
for _ in 0..nrequesters {
let _entered = rt.enter();
let manager = manager.clone();
let barrier = barrier.clone();
requesters.spawn(async move {
loop {
let input = input_factory();
barrier.wait().await;
let page = input.execute(&manager).await.unwrap();
assert_eq!(page.remaining(), 8192);
barrier.wait().await;
}
let mut tasks = JoinSet::new();
let manager = PostgresRedoManager::new(conf, tenant_shard_id);
let manager = Arc::new(manager);
for _ in 0..nclients {
rt.block_on(async {
tasks.spawn(client(
Arc::clone(&manager),
Arc::clone(&start),
Arc::clone(&redo_work),
// divide the amount of work equally among the clients
n_redos / nclients,
))
});
}
let do_one_iteration = || {
rt.block_on(async {
barrier.wait().await;
// wait for work to complete
barrier.wait().await;
})
};
b.iter_batched(
|| {
// warmup
do_one_iteration();
},
|()| {
// work loop
do_one_iteration();
},
criterion::BatchSize::PerIteration,
);
rt.block_on(requesters.shutdown());
rt.block_on(async move {
let mut total_wallclock_time = std::time::Duration::from_millis(0);
while let Some(res) = tasks.join_next().await {
total_wallclock_time += res.unwrap();
}
total_wallclock_time
})
}
criterion_group!(benches, redo_scenarios);
criterion_main!(benches);
async fn client(
mgr: Arc<PostgresRedoManager>,
start: Arc<Barrier>,
redo_work: Arc<Request>,
n_redos: u64,
) -> Duration {
start.wait().await;
let start = Instant::now();
for _ in 0..n_redos {
let page = redo_work.execute(&mgr).await.unwrap();
assert_eq!(page.remaining(), 8192);
// The real pageserver will rarely if ever do 2 walredos in a row without
// yielding to the executor.
tokio::task::yield_now().await;
}
start.elapsed()
}
macro_rules! lsn {
($input:expr) => {{
@@ -166,12 +162,46 @@ macro_rules! lsn {
}};
}
/// Short payload, 1132 bytes.
// pg_records are copypasted from log, where they are put with Debug impl of Bytes, which uses \0
// for null bytes.
#[allow(clippy::octal_escapes)]
fn short() -> Request {
Request {
/// Simple wrapper around `WalRedoManager::request_redo`.
///
/// In benchmarks this is cloned around.
#[derive(Clone)]
struct Request {
key: Key,
lsn: Lsn,
base_img: Option<(Lsn, Bytes)>,
records: Vec<(Lsn, NeonWalRecord)>,
pg_version: u32,
}
impl Request {
async fn execute(&self, manager: &PostgresRedoManager) -> anyhow::Result<Bytes> {
let Request {
key,
lsn,
base_img,
records,
pg_version,
} = self;
// TODO: avoid these clones
manager
.request_redo(*key, *lsn, base_img.clone(), records.clone(), *pg_version)
.await
}
fn pg_record(will_init: bool, bytes: &'static [u8]) -> NeonWalRecord {
let rec = Bytes::from_static(bytes);
NeonWalRecord::Postgres { will_init, rec }
}
/// Short payload, 1132 bytes.
// pg_records are copypasted from log, where they are put with Debug impl of Bytes, which uses \0
// for null bytes.
#[allow(clippy::octal_escapes)]
pub fn short_input() -> Request {
let pg_record = Self::pg_record;
Request {
key: Key {
field1: 0,
field2: 1663,
@@ -194,13 +224,14 @@ fn short() -> Request {
],
pg_version: 14,
}
}
}
/// Medium sized payload, serializes as 26393 bytes.
// see [`short`]
#[allow(clippy::octal_escapes)]
fn medium() -> Request {
Request {
/// Medium sized payload, serializes as 26393 bytes.
// see [`short`]
#[allow(clippy::octal_escapes)]
pub fn medium_input() -> Request {
let pg_record = Self::pg_record;
Request {
key: Key {
field1: 0,
field2: 1663,
@@ -442,37 +473,5 @@ fn medium() -> Request {
],
pg_version: 14,
}
}
fn pg_record(will_init: bool, bytes: &'static [u8]) -> NeonWalRecord {
let rec = Bytes::from_static(bytes);
NeonWalRecord::Postgres { will_init, rec }
}
/// Simple wrapper around `WalRedoManager::request_redo`.
///
/// In benchmarks this is cloned around.
#[derive(Clone)]
struct Request {
key: Key,
lsn: Lsn,
base_img: Option<(Lsn, Bytes)>,
records: Vec<(Lsn, NeonWalRecord)>,
pg_version: u32,
}
impl Request {
async fn execute(self, manager: &PostgresRedoManager) -> anyhow::Result<Bytes> {
let Request {
key,
lsn,
base_img,
records,
pg_version,
} = self;
manager
.request_redo(key, lsn, base_img, records, pg_version)
.await
}
}

View File

@@ -7,7 +7,7 @@ use utils::{
pub mod util;
#[derive(Debug)]
#[derive(Debug, Clone)]
pub struct Client {
mgmt_api_endpoint: String,
authorization_header: Option<String>,
@@ -24,6 +24,9 @@ pub enum Error {
#[error("pageserver API: {1}")]
ApiError(StatusCode, String),
#[error("Cancelled")]
Cancelled,
}
pub type Result<T> = std::result::Result<T, Error>;
@@ -166,7 +169,7 @@ impl Client {
self.request(Method::GET, uri, ()).await
}
async fn request<B: serde::Serialize, U: reqwest::IntoUrl>(
async fn request_noerror<B: serde::Serialize, U: reqwest::IntoUrl>(
&self,
method: Method,
uri: U,
@@ -178,7 +181,16 @@ impl Client {
} else {
req
};
let res = req.json(&body).send().await.map_err(Error::ReceiveBody)?;
req.json(&body).send().await.map_err(Error::ReceiveBody)
}
async fn request<B: serde::Serialize, U: reqwest::IntoUrl>(
&self,
method: Method,
uri: U,
body: B,
) -> Result<reqwest::Response> {
let res = self.request_noerror(method, uri, body).await?;
let response = res.error_from_body().await?;
Ok(response)
}
@@ -237,13 +249,26 @@ impl Client {
Ok(())
}
pub async fn tenant_secondary_download(&self, tenant_id: TenantShardId) -> Result<()> {
let uri = format!(
pub async fn tenant_secondary_download(
&self,
tenant_id: TenantShardId,
wait: Option<std::time::Duration>,
) -> Result<(StatusCode, SecondaryProgress)> {
let mut path = reqwest::Url::parse(&format!(
"{}/v1/tenant/{}/secondary/download",
self.mgmt_api_endpoint, tenant_id
);
self.request(Method::POST, &uri, ()).await?;
Ok(())
))
.expect("Cannot build URL");
if let Some(wait) = wait {
path.query_pairs_mut()
.append_pair("wait_ms", &format!("{}", wait.as_millis()));
}
let response = self.request(Method::POST, path, ()).await?;
let status = response.status();
let progress: SecondaryProgress = response.json().await.map_err(Error::ReceiveBody)?;
Ok((status, progress))
}
pub async fn location_config(
@@ -251,21 +276,30 @@ impl Client {
tenant_shard_id: TenantShardId,
config: LocationConfig,
flush_ms: Option<std::time::Duration>,
lazy: bool,
) -> Result<()> {
let req_body = TenantLocationConfigRequest {
tenant_id: tenant_shard_id,
tenant_id: Some(tenant_shard_id),
config,
};
let path = format!(
let mut path = reqwest::Url::parse(&format!(
"{}/v1/tenant/{}/location_config",
self.mgmt_api_endpoint, tenant_shard_id
);
let path = if let Some(flush_ms) = flush_ms {
format!("{}?flush_ms={}", path, flush_ms.as_millis())
} else {
path
};
self.request(Method::PUT, &path, &req_body).await?;
))
// Should always work: mgmt_api_endpoint is configuration, not user input.
.expect("Cannot build URL");
if lazy {
path.query_pairs_mut().append_pair("lazy", "true");
}
if let Some(flush_ms) = flush_ms {
path.query_pairs_mut()
.append_pair("flush_ms", &format!("{}", flush_ms.as_millis()));
}
self.request(Method::PUT, path, &req_body).await?;
Ok(())
}
@@ -278,6 +312,21 @@ impl Client {
.map_err(Error::ReceiveBody)
}
pub async fn get_location_config(
&self,
tenant_shard_id: TenantShardId,
) -> Result<Option<LocationConfig>> {
let path = format!(
"{}/v1/location_config/{tenant_shard_id}",
self.mgmt_api_endpoint
);
self.request(Method::GET, &path, ())
.await?
.json()
.await
.map_err(Error::ReceiveBody)
}
pub async fn timeline_create(
&self,
tenant_shard_id: TenantShardId,
@@ -389,4 +438,77 @@ impl Client {
.await
.map_err(Error::ReceiveBody)
}
pub async fn get_utilization(&self) -> Result<PageserverUtilization> {
let uri = format!("{}/v1/utilization", self.mgmt_api_endpoint);
self.get(uri)
.await?
.json()
.await
.map_err(Error::ReceiveBody)
}
pub async fn layer_map_info(
&self,
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
) -> Result<LayerMapInfo> {
let uri = format!(
"{}/v1/tenant/{}/timeline/{}/layer",
self.mgmt_api_endpoint, tenant_shard_id, timeline_id,
);
self.get(&uri)
.await?
.json()
.await
.map_err(Error::ReceiveBody)
}
pub async fn layer_evict(
&self,
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
layer_file_name: &str,
) -> Result<bool> {
let uri = format!(
"{}/v1/tenant/{}/timeline/{}/layer/{}",
self.mgmt_api_endpoint, tenant_shard_id, timeline_id, layer_file_name
);
let resp = self.request_noerror(Method::DELETE, &uri, ()).await?;
match resp.status() {
StatusCode::OK => Ok(true),
StatusCode::NOT_MODIFIED => Ok(false),
// TODO: dedupe this pattern / introduce separate error variant?
status => Err(match resp.json::<HttpErrorBody>().await {
Ok(HttpErrorBody { msg }) => Error::ApiError(status, msg),
Err(_) => {
Error::ReceiveErrorBody(format!("Http error ({}) at {}.", status.as_u16(), uri))
}
}),
}
}
pub async fn layer_ondemand_download(
&self,
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
layer_file_name: &str,
) -> Result<bool> {
let uri = format!(
"{}/v1/tenant/{}/timeline/{}/layer/{}",
self.mgmt_api_endpoint, tenant_shard_id, timeline_id, layer_file_name
);
let resp = self.request_noerror(Method::GET, &uri, ()).await?;
match resp.status() {
StatusCode::OK => Ok(true),
StatusCode::NOT_MODIFIED => Ok(false),
// TODO: dedupe this pattern / introduce separate error variant?
status => Err(match resp.json::<HttpErrorBody>().await {
Ok(HttpErrorBody { msg }) => Error::ApiError(status, msg),
Err(_) => {
Error::ReceiveErrorBody(format!("Http error ({}) at {}.", status.as_u16(), uri))
}
}),
}
}
}

View File

@@ -63,7 +63,7 @@ pub async fn compact_tiered<E: CompactionJobExecutor>(
);
// Identify the range of LSNs that belong to this level. We assume that
// each file in this level span an LSN range up to 1.75x target file
// each file in this level spans an LSN range up to 1.75x target file
// size. That should give us enough slop that if we created a slightly
// oversized L0 layer, e.g. because flushing the in-memory layer was
// delayed for some reason, we don't consider the oversized layer to
@@ -248,7 +248,6 @@ enum CompactionStrategy {
CreateImage,
}
#[allow(dead_code)] // Todo
struct CompactionJob<E: CompactionJobExecutor> {
key_range: Range<E::Key>,
lsn_range: Range<Lsn>,
@@ -345,7 +344,7 @@ where
///
/// TODO: Currently, this is called exactly once for the level, and we
/// decide whether to create new image layers to cover the whole level, or
/// write a new set of delta. In the future, this should try to partition
/// write a new set of deltas. In the future, this should try to partition
/// the key space, and make the decision separately for each partition.
async fn divide_job(&mut self, job_id: JobId, ctx: &E::RequestContext) -> anyhow::Result<()> {
let job = &self.jobs[job_id.0];
@@ -709,18 +708,6 @@ where
}
}
// Sliding window through keyspace and values
//
// This is used to decide what layer to write next, from the beginning of the window.
//
// Candidates:
//
// 1. Create an image layer, snapping to previous images
// 2. Create a delta layer, snapping to previous images
// 3. Create an image layer, snapping to
//
//
// Take previous partitioning, based on the image layers below.
//
// Candidate is at the front:
@@ -739,6 +726,10 @@ struct WindowElement<K> {
last_key: K, // inclusive
accum_size: u64,
}
// Sliding window through keyspace and values
//
// This is used to decide what layer to write next, from the beginning of the window.
struct Window<K> {
elems: VecDeque<WindowElement<K>>,

View File

@@ -1,5 +1,5 @@
//! An LSM tree consists of multiple levels, each exponential larger than the
//! previous level. And each level consists of be multiple "tiers". With tiered
//! An LSM tree consists of multiple levels, each exponentially larger than the
//! previous level. And each level consists of multiple "tiers". With tiered
//! compaction, a level is compacted when it has accumulated more than N tiers,
//! forming one tier on the next level.
//!
@@ -170,13 +170,6 @@ where
})
}
// helper struct used in depth()
struct Event<K> {
key: K,
layer_idx: usize,
start: bool,
}
impl<L> Level<L> {
/// Count the number of deltas stacked on each other.
pub fn depth<K>(&self) -> u64
@@ -184,6 +177,11 @@ impl<L> Level<L> {
K: CompactionKey,
L: CompactionLayer<K>,
{
struct Event<K> {
key: K,
layer_idx: usize,
start: bool,
}
let mut events: Vec<Event<K>> = Vec::new();
for (idx, l) in self.layers.iter().enumerate() {
events.push(Event {
@@ -202,7 +200,7 @@ impl<L> Level<L> {
// Sweep the key space left to right. Stop at each distinct key, and
// count the number of deltas on top of the highest image at that key.
//
// This is a little enefficient, as we walk through the active_set on
// This is a little inefficient, as we walk through the active_set on
// every key. We could increment/decrement a counter on each step
// instead, but that'd require a bit more complex bookkeeping.
let mut active_set: BTreeSet<(Lsn, bool, usize)> = BTreeSet::new();
@@ -236,6 +234,7 @@ impl<L> Level<L> {
}
}
}
debug_assert_eq!(active_set, BTreeSet::new());
max_depth
}
}

View File

@@ -4,12 +4,12 @@
//! All the heavy lifting is done by the create_image and create_delta
//! functions that the implementor provides.
use async_trait::async_trait;
use futures::Future;
use pageserver_api::{key::Key, keyspace::key_range_size};
use std::ops::Range;
use utils::lsn::Lsn;
/// Public interface. This is the main thing that the implementor needs to provide
#[async_trait]
pub trait CompactionJobExecutor {
// Type system.
//
@@ -17,8 +17,7 @@ pub trait CompactionJobExecutor {
// compaction doesn't distinguish whether they are stored locally or
// remotely.
//
// The keyspace is defined by CompactionKey trait.
//
// The keyspace is defined by the CompactionKey trait.
type Key: CompactionKey;
type Layer: CompactionLayer<Self::Key> + Clone;
@@ -35,27 +34,27 @@ pub trait CompactionJobExecutor {
// ----
/// Return all layers that overlap the given bounding box.
async fn get_layers(
fn get_layers(
&mut self,
key_range: &Range<Self::Key>,
lsn_range: &Range<Lsn>,
ctx: &Self::RequestContext,
) -> anyhow::Result<Vec<Self::Layer>>;
) -> impl Future<Output = anyhow::Result<Vec<Self::Layer>>> + Send;
async fn get_keyspace(
fn get_keyspace(
&mut self,
key_range: &Range<Self::Key>,
lsn: Lsn,
ctx: &Self::RequestContext,
) -> anyhow::Result<CompactionKeySpace<Self::Key>>;
) -> impl Future<Output = anyhow::Result<CompactionKeySpace<Self::Key>>> + Send;
/// NB: This is a pretty expensive operation. In the real pageserver
/// implementation, it downloads the layer, and keeps it resident
/// until the DeltaLayer is dropped.
async fn downcast_delta_layer(
fn downcast_delta_layer(
&self,
layer: &Self::Layer,
) -> anyhow::Result<Option<Self::DeltaLayer>>;
) -> impl Future<Output = anyhow::Result<Option<Self::DeltaLayer>>> + Send;
// ----
// Functions to execute the plan
@@ -63,33 +62,33 @@ pub trait CompactionJobExecutor {
/// Create a new image layer, materializing all the values in the key range,
/// at given 'lsn'.
async fn create_image(
fn create_image(
&mut self,
lsn: Lsn,
key_range: &Range<Self::Key>,
ctx: &Self::RequestContext,
) -> anyhow::Result<()>;
) -> impl Future<Output = anyhow::Result<()>> + Send;
/// Create a new delta layer, containing all the values from 'input_layers'
/// in the given key and LSN range.
async fn create_delta(
fn create_delta(
&mut self,
lsn_range: &Range<Lsn>,
key_range: &Range<Self::Key>,
input_layers: &[Self::DeltaLayer],
ctx: &Self::RequestContext,
) -> anyhow::Result<()>;
) -> impl Future<Output = anyhow::Result<()>> + Send;
/// Delete a layer. The compaction implementation will call this only after
/// all the create_image() or create_delta() calls that deletion of this
/// layer depends on have finished. But if the implementor has extra lazy
/// background tasks, like uploading the index json file to remote storage,
/// background tasks, like uploading the index json file to remote storage.
/// it is the implementation's responsibility to track those.
async fn delete_layer(
fn delete_layer(
&mut self,
layer: &Self::Layer,
ctx: &Self::RequestContext,
) -> anyhow::Result<()>;
) -> impl Future<Output = anyhow::Result<()>> + Send;
}
pub trait CompactionKey: std::cmp::Ord + Clone + Copy + std::fmt::Display {

View File

@@ -429,7 +429,6 @@ impl From<&Arc<MockImageLayer>> for MockLayer {
}
}
#[async_trait]
impl interface::CompactionJobExecutor for MockTimeline {
type Key = Key;
type Layer = MockLayer;

View File

@@ -0,0 +1,272 @@
use pageserver_api::{models::HistoricLayerInfo, shard::TenantShardId};
use pageserver_client::mgmt_api;
use rand::seq::SliceRandom;
use tracing::{debug, info};
use utils::id::{TenantTimelineId, TimelineId};
use tokio::{
sync::{mpsc, OwnedSemaphorePermit},
task::JoinSet,
};
use std::{
num::NonZeroUsize,
sync::{
atomic::{AtomicU64, Ordering},
Arc,
},
time::{Duration, Instant},
};
/// Evict & on-demand download random layers.
#[derive(clap::Parser)]
pub(crate) struct Args {
#[clap(long, default_value = "http://localhost:9898")]
mgmt_api_endpoint: String,
#[clap(long)]
pageserver_jwt: Option<String>,
#[clap(long)]
runtime: Option<humantime::Duration>,
#[clap(long, default_value = "1")]
tasks_per_target: NonZeroUsize,
#[clap(long, default_value = "1")]
concurrency_per_target: NonZeroUsize,
/// Probability for sending `latest=true` in the request (uniform distribution).
#[clap(long)]
limit_to_first_n_targets: Option<usize>,
/// Before starting the benchmark, live-reconfigure the pageserver to use the given
/// [`pageserver_api::models::virtual_file::IoEngineKind`].
#[clap(long)]
set_io_engine: Option<pageserver_api::models::virtual_file::IoEngineKind>,
targets: Option<Vec<TenantTimelineId>>,
}
pub(crate) fn main(args: Args) -> anyhow::Result<()> {
let rt = tokio::runtime::Builder::new_multi_thread()
.enable_all()
.build()?;
let task = rt.spawn(main_impl(args));
rt.block_on(task).unwrap().unwrap();
Ok(())
}
#[derive(Debug, Default)]
struct LiveStats {
evictions: AtomicU64,
downloads: AtomicU64,
timeline_restarts: AtomicU64,
}
impl LiveStats {
fn eviction_done(&self) {
self.evictions.fetch_add(1, Ordering::Relaxed);
}
fn download_done(&self) {
self.downloads.fetch_add(1, Ordering::Relaxed);
}
fn timeline_restart_done(&self) {
self.timeline_restarts.fetch_add(1, Ordering::Relaxed);
}
}
async fn main_impl(args: Args) -> anyhow::Result<()> {
let args: &'static Args = Box::leak(Box::new(args));
let mgmt_api_client = Arc::new(pageserver_client::mgmt_api::Client::new(
args.mgmt_api_endpoint.clone(),
args.pageserver_jwt.as_deref(),
));
if let Some(engine_str) = &args.set_io_engine {
mgmt_api_client.put_io_engine(engine_str).await?;
}
// discover targets
let timelines: Vec<TenantTimelineId> = crate::util::cli::targets::discover(
&mgmt_api_client,
crate::util::cli::targets::Spec {
limit_to_first_n_targets: args.limit_to_first_n_targets,
targets: args.targets.clone(),
},
)
.await?;
let mut tasks = JoinSet::new();
let live_stats = Arc::new(LiveStats::default());
tasks.spawn({
let live_stats = Arc::clone(&live_stats);
async move {
let mut last_at = Instant::now();
loop {
tokio::time::sleep_until((last_at + Duration::from_secs(1)).into()).await;
let now = Instant::now();
let delta: Duration = now - last_at;
last_at = now;
let LiveStats {
evictions,
downloads,
timeline_restarts,
} = &*live_stats;
let evictions = evictions.swap(0, Ordering::Relaxed) as f64 / delta.as_secs_f64();
let downloads = downloads.swap(0, Ordering::Relaxed) as f64 / delta.as_secs_f64();
let timeline_restarts = timeline_restarts.swap(0, Ordering::Relaxed);
info!("evictions={evictions:.2}/s downloads={downloads:.2}/s timeline_restarts={timeline_restarts}");
}
}
});
for tl in timelines {
for _ in 0..args.tasks_per_target.get() {
tasks.spawn(timeline_actor(
args,
Arc::clone(&mgmt_api_client),
tl,
Arc::clone(&live_stats),
));
}
}
while let Some(res) = tasks.join_next().await {
res.unwrap();
}
Ok(())
}
async fn timeline_actor(
args: &'static Args,
mgmt_api_client: Arc<pageserver_client::mgmt_api::Client>,
timeline: TenantTimelineId,
live_stats: Arc<LiveStats>,
) {
// TODO: support sharding
let tenant_shard_id = TenantShardId::unsharded(timeline.tenant_id);
struct Timeline {
joinset: JoinSet<()>,
layers: Vec<mpsc::Sender<OwnedSemaphorePermit>>,
concurrency: Arc<tokio::sync::Semaphore>,
}
loop {
debug!("restarting timeline");
let layer_map_info = mgmt_api_client
.layer_map_info(tenant_shard_id, timeline.timeline_id)
.await
.unwrap();
let concurrency = Arc::new(tokio::sync::Semaphore::new(
args.concurrency_per_target.get(),
));
let mut joinset = JoinSet::new();
let layers = layer_map_info
.historic_layers
.into_iter()
.map(|historic_layer| {
let (tx, rx) = mpsc::channel(1);
joinset.spawn(layer_actor(
tenant_shard_id,
timeline.timeline_id,
historic_layer,
rx,
Arc::clone(&mgmt_api_client),
Arc::clone(&live_stats),
));
tx
})
.collect::<Vec<_>>();
let mut timeline = Timeline {
joinset,
layers,
concurrency,
};
live_stats.timeline_restart_done();
loop {
assert!(!timeline.joinset.is_empty());
if let Some(res) = timeline.joinset.try_join_next() {
debug!(?res, "a layer actor exited, should not happen");
timeline.joinset.shutdown().await;
break;
}
let mut permit = Some(
Arc::clone(&timeline.concurrency)
.acquire_owned()
.await
.unwrap(),
);
loop {
let layer_tx = {
let mut rng = rand::thread_rng();
timeline.layers.choose_mut(&mut rng).expect("no layers")
};
match layer_tx.try_send(permit.take().unwrap()) {
Ok(_) => break,
Err(e) => match e {
mpsc::error::TrySendError::Full(back) => {
// TODO: retrying introduces bias away from slow downloaders
permit.replace(back);
}
mpsc::error::TrySendError::Closed(_) => panic!(),
},
}
}
}
}
}
async fn layer_actor(
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
mut layer: HistoricLayerInfo,
mut rx: mpsc::Receiver<tokio::sync::OwnedSemaphorePermit>,
mgmt_api_client: Arc<mgmt_api::Client>,
live_stats: Arc<LiveStats>,
) {
#[derive(Clone, Copy)]
enum Action {
Evict,
OnDemandDownload,
}
while let Some(_permit) = rx.recv().await {
let action = if layer.is_remote() {
Action::OnDemandDownload
} else {
Action::Evict
};
let did_it = match action {
Action::Evict => {
let did_it = mgmt_api_client
.layer_evict(tenant_shard_id, timeline_id, layer.layer_file_name())
.await
.unwrap();
live_stats.eviction_done();
did_it
}
Action::OnDemandDownload => {
let did_it = mgmt_api_client
.layer_ondemand_download(tenant_shard_id, timeline_id, layer.layer_file_name())
.await
.unwrap();
live_stats.download_done();
did_it
}
};
if !did_it {
debug!("local copy of layer map appears out of sync, re-downloading");
return;
}
debug!("did it");
layer.set_remote(match action {
Action::Evict => true,
Action::OnDemandDownload => false,
});
}
}

View File

@@ -16,6 +16,7 @@ mod util {
mod cmd {
pub(super) mod basebackup;
pub(super) mod getpage_latest_lsn;
pub(super) mod ondemand_download_churn;
pub(super) mod trigger_initial_size_calculation;
}
@@ -25,6 +26,7 @@ enum Args {
Basebackup(cmd::basebackup::Args),
GetPageLatestLsn(cmd::getpage_latest_lsn::Args),
TriggerInitialSizeCalculation(cmd::trigger_initial_size_calculation::Args),
OndemandDownloadChurn(cmd::ondemand_download_churn::Args),
}
fn main() {
@@ -43,6 +45,7 @@ fn main() {
Args::TriggerInitialSizeCalculation(args) => {
cmd::trigger_initial_size_calculation::main(args)
}
Args::OndemandDownloadChurn(args) => cmd::ondemand_download_churn::main(args),
}
.unwrap()
}

View File

@@ -1,3 +1,5 @@
#![recursion_limit = "300"]
//! Main entry point for the Page Server executable.
use std::env::{var, VarError};
@@ -118,6 +120,9 @@ fn main() -> anyhow::Result<()> {
&[("node_id", &conf.id.to_string())],
);
// after setting up logging, log the effective IO engine choice
info!(?conf.virtual_file_io_engine, "starting with virtual_file IO engine");
let tenants_path = conf.tenants_path();
if !tenants_path.exists() {
utils::crashsafe::create_dir_all(conf.tenants_path())
@@ -312,6 +317,7 @@ fn start_pageserver(
let http_listener = tcp_listener::bind(http_addr)?;
let pg_addr = &conf.listen_pg_addr;
info!("Starting pageserver pg protocol handler on {pg_addr}");
let pageserver_listener = tcp_listener::bind(pg_addr)?;
@@ -544,7 +550,7 @@ fn start_pageserver(
let router_state = Arc::new(
http::routes::State::new(
conf,
tenant_manager,
tenant_manager.clone(),
http_auth.clone(),
remote_storage.clone(),
broker_client.clone(),
@@ -594,32 +600,37 @@ fn start_pageserver(
None,
"consumption metrics collection",
true,
async move {
// first wait until background jobs are cleared to launch.
//
// this is because we only process active tenants and timelines, and the
// Timeline::get_current_logical_size will spawn the logical size calculation,
// which will not be rate-limited.
let cancel = task_mgr::shutdown_token();
{
let tenant_manager = tenant_manager.clone();
async move {
// first wait until background jobs are cleared to launch.
//
// this is because we only process active tenants and timelines, and the
// Timeline::get_current_logical_size will spawn the logical size calculation,
// which will not be rate-limited.
let cancel = task_mgr::shutdown_token();
tokio::select! {
_ = cancel.cancelled() => { return Ok(()); },
_ = background_jobs_barrier.wait() => {}
};
tokio::select! {
_ = cancel.cancelled() => { return Ok(()); },
_ = background_jobs_barrier.wait() => {}
};
pageserver::consumption_metrics::collect_metrics(
metric_collection_endpoint,
conf.metric_collection_interval,
conf.cached_metric_collection_interval,
conf.synthetic_size_calculation_interval,
conf.id,
local_disk_storage,
cancel,
metrics_ctx,
)
.instrument(info_span!("metrics_collection"))
.await?;
Ok(())
pageserver::consumption_metrics::collect_metrics(
tenant_manager,
metric_collection_endpoint,
&conf.metric_collection_bucket,
conf.metric_collection_interval,
conf.cached_metric_collection_interval,
conf.synthetic_size_calculation_interval,
conf.id,
local_disk_storage,
cancel,
metrics_ctx,
)
.instrument(info_span!("metrics_collection"))
.await?;
Ok(())
}
},
);
}
@@ -688,6 +699,7 @@ fn start_pageserver(
let bg_remote_storage = remote_storage.clone();
let bg_deletion_queue = deletion_queue.clone();
BACKGROUND_RUNTIME.block_on(pageserver::shutdown_pageserver(
&tenant_manager,
bg_remote_storage.map(|_| bg_deletion_queue),
0,
));

View File

@@ -7,8 +7,9 @@
use anyhow::{anyhow, bail, ensure, Context, Result};
use pageserver_api::shard::TenantShardId;
use remote_storage::{RemotePath, RemoteStorageConfig};
use serde;
use serde::de::IntoDeserializer;
use std::env;
use std::{collections::HashMap, env};
use storage_broker::Uri;
use utils::crashsafe::path_with_suffix_extension;
use utils::id::ConnectionId;
@@ -29,18 +30,17 @@ use utils::{
logging::LogFormat,
};
use crate::disk_usage_eviction_task::DiskUsageEvictionTaskConfig;
use crate::tenant::config::TenantConf;
use crate::tenant::config::TenantConfOpt;
use crate::tenant::timeline::GetVectoredImpl;
use crate::tenant::vectored_blob_io::MaxVectoredReadBytes;
use crate::tenant::{
TENANTS_SEGMENT_NAME, TENANT_DELETED_MARKER_FILE_NAME, TIMELINES_SEGMENT_NAME,
};
use crate::virtual_file;
use crate::{disk_usage_eviction_task::DiskUsageEvictionTaskConfig, virtual_file::io_engine};
use crate::{tenant::config::TenantConf, virtual_file};
use crate::{
IGNORED_TENANT_FILE_NAME, TENANT_CONFIG_NAME, TENANT_HEATMAP_BASENAME,
TENANT_LOCATION_CONFIG_NAME, TIMELINE_DELETE_MARK_SUFFIX, TIMELINE_UNINIT_MARK_SUFFIX,
TENANT_LOCATION_CONFIG_NAME, TIMELINE_DELETE_MARK_SUFFIX,
};
use self::defaults::DEFAULT_CONCURRENT_TENANT_WARMUP;
@@ -83,6 +83,10 @@ pub mod defaults {
pub const DEFAULT_INGEST_BATCH_SIZE: u64 = 100;
#[cfg(target_os = "linux")]
pub const DEFAULT_VIRTUAL_FILE_IO_ENGINE: &str = "tokio-epoll-uring";
#[cfg(not(target_os = "linux"))]
pub const DEFAULT_VIRTUAL_FILE_IO_ENGINE: &str = "std-fs";
pub const DEFAULT_GET_VECTORED_IMPL: &str = "sequential";
@@ -91,6 +95,8 @@ pub mod defaults {
pub const DEFAULT_VALIDATE_VECTORED_GET: bool = true;
pub const DEFAULT_EPHEMERAL_BYTES_PER_MEMORY_KB: usize = 0;
///
/// Default built-in configuration file.
///
@@ -152,6 +158,8 @@ pub mod defaults {
#heatmap_upload_concurrency = {DEFAULT_HEATMAP_UPLOAD_CONCURRENCY}
#secondary_download_concurrency = {DEFAULT_SECONDARY_DOWNLOAD_CONCURRENCY}
#ephemeral_bytes_per_memory_kb = {DEFAULT_EPHEMERAL_BYTES_PER_MEMORY_KB}
[remote_storage]
"#
@@ -230,6 +238,7 @@ pub struct PageServerConf {
// How often to send unchanged cached metrics to the metrics endpoint.
pub cached_metric_collection_interval: Duration,
pub metric_collection_endpoint: Option<Url>,
pub metric_collection_bucket: Option<RemoteStorageConfig>,
pub synthetic_size_calculation_interval: Duration,
pub disk_usage_based_eviction: Option<DiskUsageEvictionTaskConfig>,
@@ -274,6 +283,13 @@ pub struct PageServerConf {
pub max_vectored_read_bytes: MaxVectoredReadBytes,
pub validate_vectored_get: bool,
/// How many bytes of ephemeral layer content will we allow per kilobyte of RAM. When this
/// is exceeded, we start proactively closing ephemeral layers to limit the total amount
/// of ephemeral data.
///
/// Setting this to zero disables limits on total ephemeral layer size.
pub ephemeral_bytes_per_memory_kb: usize,
}
/// We do not want to store this in a PageServerConf because the latter may be logged
@@ -286,21 +302,49 @@ pub static SAFEKEEPER_AUTH_TOKEN: OnceCell<Arc<String>> = OnceCell::new();
// use dedicated enum for builder to better indicate the intention
// and avoid possible confusion with nested options
#[derive(Clone, Default)]
pub enum BuilderValue<T> {
Set(T),
#[default]
NotSet,
}
impl<T> BuilderValue<T> {
pub fn ok_or<E>(self, err: E) -> Result<T, E> {
impl<T: Clone> BuilderValue<T> {
pub fn ok_or(&self, field_name: &'static str, default: BuilderValue<T>) -> anyhow::Result<T> {
match self {
Self::Set(v) => Ok(v),
Self::NotSet => Err(err),
Self::Set(v) => Ok(v.clone()),
Self::NotSet => match default {
BuilderValue::Set(v) => Ok(v.clone()),
BuilderValue::NotSet => {
anyhow::bail!("missing config value {field_name:?}")
}
},
}
}
}
// Certain metadata (e.g. externally-addressable name, AZ) is delivered
// as a separate structure. This information is not neeed by the pageserver
// itself, it is only used for registering the pageserver with the control
// plane and/or storage controller.
//
#[derive(serde::Deserialize)]
pub(crate) struct NodeMetadata {
#[serde(rename = "host")]
pub(crate) postgres_host: String,
#[serde(rename = "port")]
pub(crate) postgres_port: u16,
pub(crate) http_host: String,
pub(crate) http_port: u16,
// Deployment tools may write fields to the metadata file beyond what we
// use in this type: this type intentionally only names fields that require.
#[serde(flatten)]
pub(crate) other: HashMap<String, serde_json::Value>,
}
// needed to simplify config construction
#[derive(Default)]
struct PageServerConfigBuilder {
listen_pg_addr: BuilderValue<String>,
@@ -341,6 +385,7 @@ struct PageServerConfigBuilder {
cached_metric_collection_interval: BuilderValue<Duration>,
metric_collection_endpoint: BuilderValue<Option<Url>>,
synthetic_size_calculation_interval: BuilderValue<Duration>,
metric_collection_bucket: BuilderValue<Option<RemoteStorageConfig>>,
disk_usage_based_eviction: BuilderValue<Option<DiskUsageEvictionTaskConfig>>,
@@ -366,10 +411,13 @@ struct PageServerConfigBuilder {
max_vectored_read_bytes: BuilderValue<MaxVectoredReadBytes>,
validate_vectored_get: BuilderValue<bool>,
ephemeral_bytes_per_memory_kb: BuilderValue<usize>,
}
impl Default for PageServerConfigBuilder {
fn default() -> Self {
impl PageServerConfigBuilder {
#[inline(always)]
fn default_values() -> Self {
use self::BuilderValue::*;
use defaults::*;
Self {
@@ -422,6 +470,8 @@ impl Default for PageServerConfigBuilder {
.expect("cannot parse default synthetic size calculation interval")),
metric_collection_endpoint: Set(DEFAULT_METRIC_COLLECTION_ENDPOINT),
metric_collection_bucket: Set(None),
disk_usage_based_eviction: Set(None),
test_remote_failures: Set(0),
@@ -449,6 +499,7 @@ impl Default for PageServerConfigBuilder {
NonZeroUsize::new(DEFAULT_MAX_VECTORED_READ_BYTES).unwrap(),
)),
validate_vectored_get: Set(DEFAULT_VALIDATE_VECTORED_GET),
ephemeral_bytes_per_memory_kb: Set(DEFAULT_EPHEMERAL_BYTES_PER_MEMORY_KB),
}
}
}
@@ -553,6 +604,13 @@ impl PageServerConfigBuilder {
self.metric_collection_endpoint = BuilderValue::Set(metric_collection_endpoint)
}
pub fn metric_collection_bucket(
&mut self,
metric_collection_bucket: Option<RemoteStorageConfig>,
) {
self.metric_collection_bucket = BuilderValue::Set(metric_collection_bucket)
}
pub fn synthetic_size_calculation_interval(
&mut self,
synthetic_size_calculation_interval: Duration,
@@ -621,126 +679,103 @@ impl PageServerConfigBuilder {
self.validate_vectored_get = BuilderValue::Set(value);
}
pub fn get_ephemeral_bytes_per_memory_kb(&mut self, value: usize) {
self.ephemeral_bytes_per_memory_kb = BuilderValue::Set(value);
}
pub fn build(self) -> anyhow::Result<PageServerConf> {
let concurrent_tenant_warmup = self
.concurrent_tenant_warmup
.ok_or(anyhow!("missing concurrent_tenant_warmup"))?;
let concurrent_tenant_size_logical_size_queries = self
.concurrent_tenant_size_logical_size_queries
.ok_or(anyhow!(
"missing concurrent_tenant_size_logical_size_queries"
))?;
Ok(PageServerConf {
listen_pg_addr: self
.listen_pg_addr
.ok_or(anyhow!("missing listen_pg_addr"))?,
listen_http_addr: self
.listen_http_addr
.ok_or(anyhow!("missing listen_http_addr"))?,
availability_zone: self
.availability_zone
.ok_or(anyhow!("missing availability_zone"))?,
wait_lsn_timeout: self
.wait_lsn_timeout
.ok_or(anyhow!("missing wait_lsn_timeout"))?,
wal_redo_timeout: self
.wal_redo_timeout
.ok_or(anyhow!("missing wal_redo_timeout"))?,
superuser: self.superuser.ok_or(anyhow!("missing superuser"))?,
page_cache_size: self
.page_cache_size
.ok_or(anyhow!("missing page_cache_size"))?,
max_file_descriptors: self
.max_file_descriptors
.ok_or(anyhow!("missing max_file_descriptors"))?,
workdir: self.workdir.ok_or(anyhow!("missing workdir"))?,
pg_distrib_dir: self
.pg_distrib_dir
.ok_or(anyhow!("missing pg_distrib_dir"))?,
http_auth_type: self
.http_auth_type
.ok_or(anyhow!("missing http_auth_type"))?,
pg_auth_type: self.pg_auth_type.ok_or(anyhow!("missing pg_auth_type"))?,
auth_validation_public_key_path: self
.auth_validation_public_key_path
.ok_or(anyhow!("missing auth_validation_public_key_path"))?,
remote_storage_config: self
.remote_storage_config
.ok_or(anyhow!("missing remote_storage_config"))?,
id: self.id.ok_or(anyhow!("missing id"))?,
// TenantConf is handled separately
default_tenant_conf: TenantConf::default(),
broker_endpoint: self
.broker_endpoint
.ok_or(anyhow!("No broker endpoints provided"))?,
broker_keepalive_interval: self
.broker_keepalive_interval
.ok_or(anyhow!("No broker keepalive interval provided"))?,
log_format: self.log_format.ok_or(anyhow!("missing log_format"))?,
concurrent_tenant_warmup: ConfigurableSemaphore::new(concurrent_tenant_warmup),
concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::new(
concurrent_tenant_size_logical_size_queries,
),
eviction_task_immitated_concurrent_logical_size_queries: ConfigurableSemaphore::new(
concurrent_tenant_size_logical_size_queries,
),
metric_collection_interval: self
.metric_collection_interval
.ok_or(anyhow!("missing metric_collection_interval"))?,
cached_metric_collection_interval: self
.cached_metric_collection_interval
.ok_or(anyhow!("missing cached_metric_collection_interval"))?,
metric_collection_endpoint: self
.metric_collection_endpoint
.ok_or(anyhow!("missing metric_collection_endpoint"))?,
synthetic_size_calculation_interval: self
.synthetic_size_calculation_interval
.ok_or(anyhow!("missing synthetic_size_calculation_interval"))?,
disk_usage_based_eviction: self
.disk_usage_based_eviction
.ok_or(anyhow!("missing disk_usage_based_eviction"))?,
test_remote_failures: self
.test_remote_failures
.ok_or(anyhow!("missing test_remote_failuers"))?,
ondemand_download_behavior_treat_error_as_warn: self
.ondemand_download_behavior_treat_error_as_warn
.ok_or(anyhow!(
"missing ondemand_download_behavior_treat_error_as_warn"
))?,
background_task_maximum_delay: self
.background_task_maximum_delay
.ok_or(anyhow!("missing background_task_maximum_delay"))?,
control_plane_api: self
.control_plane_api
.ok_or(anyhow!("missing control_plane_api"))?,
control_plane_api_token: self
.control_plane_api_token
.ok_or(anyhow!("missing control_plane_api_token"))?,
control_plane_emergency_mode: self
.control_plane_emergency_mode
.ok_or(anyhow!("missing control_plane_emergency_mode"))?,
heatmap_upload_concurrency: self
.heatmap_upload_concurrency
.ok_or(anyhow!("missing heatmap_upload_concurrency"))?,
secondary_download_concurrency: self
.secondary_download_concurrency
.ok_or(anyhow!("missing secondary_download_concurrency"))?,
ingest_batch_size: self
.ingest_batch_size
.ok_or(anyhow!("missing ingest_batch_size"))?,
virtual_file_io_engine: self
.virtual_file_io_engine
.ok_or(anyhow!("missing virtual_file_io_engine"))?,
get_vectored_impl: self
.get_vectored_impl
.ok_or(anyhow!("missing get_vectored_impl"))?,
max_vectored_read_bytes: self
.max_vectored_read_bytes
.ok_or(anyhow!("missing max_vectored_read_bytes"))?,
validate_vectored_get: self
.validate_vectored_get
.ok_or(anyhow!("missing validate_vectored_get"))?,
})
let default = Self::default_values();
macro_rules! conf {
(USING DEFAULT { $($field:ident,)* } CUSTOM LOGIC { $($custom_field:ident : $custom_value:expr,)* } ) => {
PageServerConf {
$(
$field: self.$field.ok_or(stringify!($field), default.$field)?,
)*
$(
$custom_field: $custom_value,
)*
}
};
}
Ok(conf!(
USING DEFAULT
{
listen_pg_addr,
listen_http_addr,
availability_zone,
wait_lsn_timeout,
wal_redo_timeout,
superuser,
page_cache_size,
max_file_descriptors,
workdir,
pg_distrib_dir,
http_auth_type,
pg_auth_type,
auth_validation_public_key_path,
remote_storage_config,
id,
broker_endpoint,
broker_keepalive_interval,
log_format,
metric_collection_interval,
cached_metric_collection_interval,
metric_collection_endpoint,
metric_collection_bucket,
synthetic_size_calculation_interval,
disk_usage_based_eviction,
test_remote_failures,
ondemand_download_behavior_treat_error_as_warn,
background_task_maximum_delay,
control_plane_api,
control_plane_api_token,
control_plane_emergency_mode,
heatmap_upload_concurrency,
secondary_download_concurrency,
ingest_batch_size,
get_vectored_impl,
max_vectored_read_bytes,
validate_vectored_get,
ephemeral_bytes_per_memory_kb,
}
CUSTOM LOGIC
{
// TenantConf is handled separately
default_tenant_conf: TenantConf::default(),
concurrent_tenant_warmup: ConfigurableSemaphore::new({
self
.concurrent_tenant_warmup
.ok_or("concurrent_tenant_warmpup",
default.concurrent_tenant_warmup)?
}),
concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::new(
self
.concurrent_tenant_size_logical_size_queries
.ok_or("concurrent_tenant_size_logical_size_queries",
default.concurrent_tenant_size_logical_size_queries.clone())?
),
eviction_task_immitated_concurrent_logical_size_queries: ConfigurableSemaphore::new(
// re-use `concurrent_tenant_size_logical_size_queries`
self
.concurrent_tenant_size_logical_size_queries
.ok_or("eviction_task_immitated_concurrent_logical_size_queries",
default.concurrent_tenant_size_logical_size_queries.clone())?,
),
virtual_file_io_engine: match self.virtual_file_io_engine {
BuilderValue::Set(v) => v,
BuilderValue::NotSet => match crate::virtual_file::io_engine_feature_test().context("auto-detect virtual_file_io_engine")? {
io_engine::FeatureTestResult::PlatformPreferred(v) => v, // make no noise
io_engine::FeatureTestResult::Worse { engine, remark } => {
// TODO: bubble this up to the caller so we can tracing::warn! it.
eprintln!("auto-detected IO engine is not platform-preferred: engine={engine:?} remark={remark:?}");
engine
}
},
},
}
))
}
}
@@ -757,6 +792,10 @@ impl PageServerConf {
self.workdir.join("deletion")
}
pub fn metadata_path(&self) -> Utf8PathBuf {
self.workdir.join("metadata.json")
}
pub fn deletion_list_path(&self, sequence: u64) -> Utf8PathBuf {
// Encode a version in the filename, so that if we ever switch away from JSON we can
// increment this.
@@ -816,18 +855,7 @@ impl PageServerConf {
.join(timeline_id.to_string())
}
pub fn timeline_uninit_mark_file_path(
&self,
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
) -> Utf8PathBuf {
path_with_suffix_extension(
self.timeline_path(&tenant_shard_id, &timeline_id),
TIMELINE_UNINIT_MARK_SUFFIX,
)
}
pub fn timeline_delete_mark_file_path(
pub(crate) fn timeline_delete_mark_file_path(
&self,
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
@@ -838,7 +866,10 @@ impl PageServerConf {
)
}
pub fn tenant_deleted_mark_file_path(&self, tenant_shard_id: &TenantShardId) -> Utf8PathBuf {
pub(crate) fn tenant_deleted_mark_file_path(
&self,
tenant_shard_id: &TenantShardId,
) -> Utf8PathBuf {
self.tenant_path(tenant_shard_id)
.join(TENANT_DELETED_MARKER_FILE_NAME)
}
@@ -942,6 +973,9 @@ impl PageServerConf {
let endpoint = parse_toml_string(key, item)?.parse().context("failed to parse metric_collection_endpoint")?;
builder.metric_collection_endpoint(Some(endpoint));
},
"metric_collection_bucket" => {
builder.metric_collection_bucket(RemoteStorageConfig::from_toml(item)?)
}
"synthetic_size_calculation_interval" =>
builder.synthetic_size_calculation_interval(parse_toml_duration(key, item)?),
"test_remote_failures" => builder.test_remote_failures(parse_toml_u64(key, item)?),
@@ -995,6 +1029,9 @@ impl PageServerConf {
"validate_vectored_get" => {
builder.get_validate_vectored_get(parse_toml_bool("validate_vectored_get", item)?)
}
"ephemeral_bytes_per_memory_kb" => {
builder.get_ephemeral_bytes_per_memory_kb(parse_toml_u64("ephemeral_bytes_per_memory_kb", item)? as usize)
}
_ => bail!("unrecognized pageserver option '{key}'"),
}
}
@@ -1057,6 +1094,7 @@ impl PageServerConf {
metric_collection_interval: Duration::from_secs(60),
cached_metric_collection_interval: Duration::from_secs(60 * 60),
metric_collection_endpoint: defaults::DEFAULT_METRIC_COLLECTION_ENDPOINT,
metric_collection_bucket: None,
synthetic_size_calculation_interval: Duration::from_secs(60),
disk_usage_based_eviction: None,
test_remote_failures: 0,
@@ -1075,6 +1113,7 @@ impl PageServerConf {
.expect("Invalid default constant"),
),
validate_vectored_get: defaults::DEFAULT_VALIDATE_VECTORED_GET,
ephemeral_bytes_per_memory_kb: defaults::DEFAULT_EPHEMERAL_BYTES_PER_MEMORY_KB,
}
}
}
@@ -1289,6 +1328,7 @@ background_task_maximum_delay = '334 s'
defaults::DEFAULT_CACHED_METRIC_COLLECTION_INTERVAL
)?,
metric_collection_endpoint: defaults::DEFAULT_METRIC_COLLECTION_ENDPOINT,
metric_collection_bucket: None,
synthetic_size_calculation_interval: humantime::parse_duration(
defaults::DEFAULT_SYNTHETIC_SIZE_CALCULATION_INTERVAL
)?,
@@ -1311,6 +1351,7 @@ background_task_maximum_delay = '334 s'
.expect("Invalid default constant")
),
validate_vectored_get: defaults::DEFAULT_VALIDATE_VECTORED_GET,
ephemeral_bytes_per_memory_kb: defaults::DEFAULT_EPHEMERAL_BYTES_PER_MEMORY_KB
},
"Correct defaults should be used when no config values are provided"
);
@@ -1363,6 +1404,7 @@ background_task_maximum_delay = '334 s'
metric_collection_interval: Duration::from_secs(222),
cached_metric_collection_interval: Duration::from_secs(22200),
metric_collection_endpoint: Some(Url::parse("http://localhost:80/metrics")?),
metric_collection_bucket: None,
synthetic_size_calculation_interval: Duration::from_secs(333),
disk_usage_based_eviction: None,
test_remote_failures: 0,
@@ -1381,6 +1423,7 @@ background_task_maximum_delay = '334 s'
.expect("Invalid default constant")
),
validate_vectored_get: defaults::DEFAULT_VALIDATE_VECTORED_GET,
ephemeral_bytes_per_memory_kb: defaults::DEFAULT_EPHEMERAL_BYTES_PER_MEMORY_KB
},
"Should be able to parse all basic config values correctly"
);

View File

@@ -3,10 +3,13 @@
use crate::context::{DownloadBehavior, RequestContext};
use crate::task_mgr::{self, TaskKind, BACKGROUND_RUNTIME};
use crate::tenant::tasks::BackgroundLoopKind;
use crate::tenant::{mgr, LogicalSizeCalculationCause, PageReconstructError, Tenant};
use crate::tenant::{
mgr::TenantManager, LogicalSizeCalculationCause, PageReconstructError, Tenant,
};
use camino::Utf8PathBuf;
use consumption_metrics::EventType;
use pageserver_api::models::TenantState;
use remote_storage::{GenericRemoteStorage, RemoteStorageConfig};
use reqwest::Url;
use std::collections::HashMap;
use std::sync::Arc;
@@ -40,7 +43,9 @@ type Cache = HashMap<MetricsKey, (EventType, u64)>;
/// Main thread that serves metrics collection
#[allow(clippy::too_many_arguments)]
pub async fn collect_metrics(
tenant_manager: Arc<TenantManager>,
metric_collection_endpoint: &Url,
metric_collection_bucket: &Option<RemoteStorageConfig>,
metric_collection_interval: Duration,
_cached_metric_collection_interval: Duration,
synthetic_size_calculation_interval: Duration,
@@ -65,15 +70,19 @@ pub async fn collect_metrics(
None,
"synthetic size calculation",
false,
async move {
calculate_synthetic_size_worker(
synthetic_size_calculation_interval,
&cancel,
&worker_ctx,
)
.instrument(info_span!("synthetic_size_worker"))
.await?;
Ok(())
{
let tenant_manager = tenant_manager.clone();
async move {
calculate_synthetic_size_worker(
tenant_manager,
synthetic_size_calculation_interval,
&cancel,
&worker_ctx,
)
.instrument(info_span!("synthetic_size_worker"))
.await?;
Ok(())
}
},
);
@@ -94,13 +103,27 @@ pub async fn collect_metrics(
.build()
.expect("Failed to create http client with timeout");
let bucket_client = if let Some(bucket_config) = metric_collection_bucket {
match GenericRemoteStorage::from_config(bucket_config) {
Ok(client) => Some(client),
Err(e) => {
// Non-fatal error: if we were given an invalid config, we will proceed
// with sending metrics over the network, but not to S3.
tracing::warn!("Invalid configuration for metric_collection_bucket: {e}");
None
}
}
} else {
None
};
let node_id = node_id.to_string();
loop {
let started_at = Instant::now();
// these are point in time, with variable "now"
let metrics = metrics::collect_all_metrics(&cached_metrics, &ctx).await;
let metrics = metrics::collect_all_metrics(&tenant_manager, &cached_metrics, &ctx).await;
let metrics = Arc::new(metrics);
@@ -118,10 +141,18 @@ pub async fn collect_metrics(
tracing::error!("failed to persist metrics to {path:?}: {e:#}");
}
}
if let Some(bucket_client) = &bucket_client {
let res =
upload::upload_metrics_bucket(bucket_client, &cancel, &node_id, &metrics).await;
if let Err(e) = res {
tracing::error!("failed to upload to S3: {e:#}");
}
}
};
let upload = async {
let res = upload::upload_metrics(
let res = upload::upload_metrics_http(
&client,
metric_collection_endpoint,
&cancel,
@@ -132,7 +163,7 @@ pub async fn collect_metrics(
.await;
if let Err(e) = res {
// serialization error which should never happen
tracing::error!("failed to upload due to {e:#}");
tracing::error!("failed to upload via HTTP due to {e:#}");
}
};
@@ -247,6 +278,7 @@ async fn reschedule(
/// Caclculate synthetic size for each active tenant
async fn calculate_synthetic_size_worker(
tenant_manager: Arc<TenantManager>,
synthetic_size_calculation_interval: Duration,
cancel: &CancellationToken,
ctx: &RequestContext,
@@ -259,7 +291,7 @@ async fn calculate_synthetic_size_worker(
loop {
let started_at = Instant::now();
let tenants = match mgr::list_tenants().await {
let tenants = match tenant_manager.list_tenants() {
Ok(tenants) => tenants,
Err(e) => {
warn!("cannot get tenant list: {e:#}");
@@ -278,10 +310,14 @@ async fn calculate_synthetic_size_worker(
continue;
}
let Ok(tenant) = mgr::get_tenant(tenant_shard_id, true) else {
let Ok(tenant) = tenant_manager.get_attached_tenant_shard(tenant_shard_id) else {
continue;
};
if !tenant.is_active() {
continue;
}
// there is never any reason to exit calculate_synthetic_size_worker following any
// return value -- we don't need to care about shutdown because no tenant is found when
// pageserver is shut down.
@@ -319,9 +355,7 @@ async fn calculate_and_log(tenant: &Tenant, cancel: &CancellationToken, ctx: &Re
};
// this error can be returned if timeline is shutting down, but it does not
// mean the synthetic size worker should terminate. we do not need any checks
// in this function because `mgr::get_tenant` will error out after shutdown has
// progressed to shutting down tenants.
// mean the synthetic size worker should terminate.
let shutting_down = matches!(
e.downcast_ref::<PageReconstructError>(),
Some(PageReconstructError::Cancelled | PageReconstructError::AncestorStopping(_))

View File

@@ -1,3 +1,4 @@
use crate::tenant::mgr::TenantManager;
use crate::{context::RequestContext, tenant::timeline::logical_size::CurrentLogicalSize};
use chrono::{DateTime, Utc};
use consumption_metrics::EventType;
@@ -181,6 +182,7 @@ impl MetricsKey {
}
pub(super) async fn collect_all_metrics(
tenant_manager: &Arc<TenantManager>,
cached_metrics: &Cache,
ctx: &RequestContext,
) -> Vec<RawMetric> {
@@ -188,7 +190,7 @@ pub(super) async fn collect_all_metrics(
let started_at = std::time::Instant::now();
let tenants = match crate::tenant::mgr::list_tenants().await {
let tenants = match tenant_manager.list_tenants() {
Ok(tenants) => tenants,
Err(err) => {
tracing::error!("failed to list tenants: {:?}", err);
@@ -200,7 +202,8 @@ pub(super) async fn collect_all_metrics(
if state != TenantState::Active || !id.is_zero() {
None
} else {
crate::tenant::mgr::get_tenant(id, true)
tenant_manager
.get_attached_tenant_shard(id)
.ok()
.map(|tenant| (id.tenant_id, tenant))
}

View File

@@ -1,4 +1,9 @@
use std::time::SystemTime;
use chrono::{DateTime, Utc};
use consumption_metrics::{Event, EventChunk, IdempotencyKey, CHUNK_SIZE};
use remote_storage::{GenericRemoteStorage, RemotePath};
use tokio::io::AsyncWriteExt;
use tokio_util::sync::CancellationToken;
use tracing::Instrument;
@@ -13,8 +18,9 @@ struct Ids {
pub(super) timeline_id: Option<TimelineId>,
}
/// Serialize and write metrics to an HTTP endpoint
#[tracing::instrument(skip_all, fields(metrics_total = %metrics.len()))]
pub(super) async fn upload_metrics(
pub(super) async fn upload_metrics_http(
client: &reqwest::Client,
metric_collection_endpoint: &reqwest::Url,
cancel: &CancellationToken,
@@ -74,6 +80,60 @@ pub(super) async fn upload_metrics(
Ok(())
}
/// Serialize and write metrics to a remote storage object
#[tracing::instrument(skip_all, fields(metrics_total = %metrics.len()))]
pub(super) async fn upload_metrics_bucket(
client: &GenericRemoteStorage,
cancel: &CancellationToken,
node_id: &str,
metrics: &[RawMetric],
) -> anyhow::Result<()> {
if metrics.is_empty() {
// Skip uploads if we have no metrics, so that readers don't have to handle the edge case
// of an empty object.
return Ok(());
}
// Compose object path
let datetime: DateTime<Utc> = SystemTime::now().into();
let ts_prefix = datetime.format("year=%Y/month=%m/day=%d/%H:%M:%SZ");
let path = RemotePath::from_string(&format!("{ts_prefix}_{node_id}.ndjson.gz"))?;
// Set up a gzip writer into a buffer
let mut compressed_bytes: Vec<u8> = Vec::new();
let compressed_writer = std::io::Cursor::new(&mut compressed_bytes);
let mut gzip_writer = async_compression::tokio::write::GzipEncoder::new(compressed_writer);
// Serialize and write into compressed buffer
let started_at = std::time::Instant::now();
for res in serialize_in_chunks(CHUNK_SIZE, metrics, node_id) {
let (_chunk, body) = res?;
gzip_writer.write_all(&body).await?;
}
gzip_writer.flush().await?;
gzip_writer.shutdown().await?;
let compressed_length = compressed_bytes.len();
// Write to remote storage
client
.upload_storage_object(
futures::stream::once(futures::future::ready(Ok(compressed_bytes.into()))),
compressed_length,
&path,
cancel,
)
.await?;
let elapsed = started_at.elapsed();
tracing::info!(
compressed_length,
elapsed_ms = elapsed.as_millis(),
"write metrics bucket at {path}",
);
Ok(())
}
// The return type is quite ugly, but we gain testability in isolation
fn serialize_in_chunks<'a, F>(
chunk_size: usize,

Some files were not shown because too many files have changed in this diff Show More