Compare commits

..

70 Commits

Author SHA1 Message Date
Arseny Sher
cfd78950e2 basic sk bench of pgbench init with perf fixtures 2024-01-30 14:24:27 +03:00
Arpad Müller
734755eaca Enable nextest retries for the arm build (#6496)
Also make the NEXTEST_RETRIES declaration more local.

Requested in https://github.com/neondatabase/neon/pull/6493#issuecomment-1912110202
2024-01-27 05:16:11 +01:00
Christian Schwarz
e34166a28f CI: switch back to std-fs io engine for soak time before next release (#6492)
PR #5824 introduced the concept of io engines in pageserver and
implemented `tokio-epoll-uring` in addition to our current method,
`std-fs`.

We used `tokio-epoll-uring` in CI for a day to get more exposure to
the code.  Now it's time to switch CI back so that we test with `std-fs`
as well, because that's what we're (still) using in production.
2024-01-26 22:48:34 +01:00
Christian Schwarz
3a36a0a227 fix(test suite): some tests leak child processes (#6497) 2024-01-26 18:23:53 +00:00
John Spray
58f6cb649e control_plane: database persistence for attachment_service (#6468)
## Problem

Spun off from https://github.com/neondatabase/neon/pull/6394 -- this PR
is just the persistence parts and the changes that enable it to work
nicely


## Summary of changes

- Revert #6444 and #6450
- In neon_local, start a vanilla postgres instance for the attachment
service to use.
- Adopt `diesel` crate for database access in attachment service. This
uses raw SQL migrations as the source of truth for the schema, so it's a
soft dependency: we can switch libraries pretty easily.
- Rewrite persistence.rs to use postgres (via diesel) instead of JSON.
- Preserve JSON read+write at startup and shutdown: this enables using
the JSON format in compatibility tests, so that we don't have to commit
to our DB schema yet.
- In neon_local, run database creation + migrations before starting
attachment service
- Run the initial reconciliation in Service::spawn in the background, so
that the pageserver + attachment service don't get stuck waiting for
each other to start, when restarting both together in a test.
2024-01-26 17:20:44 +00:00
Arpad Müller
dcc7610ad6 Do backoff::retry in s3 timetravel test (#6493)
The top level retries weren't enough, probably because we do so many
network requests. Fine grained retries ensure that there is higher
potential for the entire test to succeed.

To demonstrate this, consider the following example: let's assume that
each request has 5% chance of failing and we do 10 requests. Then
chances of success without any retries is 0.95^10 = 0.6. With 3 top
level retries it is 1-0.4^3 = 0.936. With 3 fine grained retries it is
(1-0.05^3)^10 = 0.9988 (roundings implicit). So chances of failure are
6.4% for the top level retry vs 0.12% for the fine grained retry.

Follow-up of #6155
2024-01-26 16:43:56 +00:00
Alexander Bayandin
4c245b0f5a update_build_tools_image.yml: Push build-tools image to Docker Hub (#6481)
## Problem

- `docker.io/neondatabase/build-tools:pinned` image is frequently
outdated on Docker Hub because there's no automated way to update it.
- `update_build_tools_image.yml` workflow contains legacy roll-back
logic, which is not required anymore because it updates only a single
image.

## Summary of changes
- Make `update_build_tools_image.yml` workflow push images to both ECR
and Docker Hub
- Remove unneeded roll-back logic
2024-01-26 16:12:49 +00:00
John Spray
55b7cde665 tests: add basic coverage for sharding (#6380)
## Problem

The support for sharding in the pageserver was written before
https://github.com/neondatabase/neon/pull/6205 landed, so when it landed
we couldn't directly test sharding.

## Summary of changes

- Add `test_sharding_smoke` which tests the basics of creating a
sharding tenant, creating a timeline within it, checking that data
within it is distributed.
- Add modes to pg_regress tests for running with 4 shards as well as
with 1.
2024-01-26 14:40:47 +00:00
Vlad Lazar
5b34d5f561 pageserver: add vectored get latency histogram (#6461)
This patch introduces a new set of grafana metrics for a histogram:
pageserver_get_vectored_seconds_bucket{task_kind="Compaction|PageRequestHandler"}.

While it has a `task_kind` label, only compaction and SLRU fetches are
tracked. This reduces the increase in cardinality to 24.

The metric should allow us to isolate performance regressions while the
vectorized get is being implemented. Once the implementation is
complete, it'll also allow us to quantify the improvements.
2024-01-26 13:40:03 +00:00
Alexander Bayandin
26c55b0255 Compute: fix rdkit extension build (#6488)
## Problem

`rdkit` extension build started to fail because of the changed checksum
of the Comic Neue font:

```
Downloading https://fonts.google.com/download?family=Comic%20Neue...
CMake Error at Code/cmake/Modules/RDKitUtils.cmake:257 (MESSAGE):
  The md5 checksum for /rdkit-src/Code/GraphMol/MolDraw2D/Comic_Neue.zip is
  incorrect; expected: 850b0df852f1cda4970887b540f8f333, found:
  b7fd0df73ad4637504432d72a0accb8f
```

https://github.com/neondatabase/neon/actions/runs/7666530536/job/20895534826

Ref https://neondb.slack.com/archives/C059ZC138NR/p1706265392422469

## Summary of changes
- Disable comic fonts for `rdkit` extension
2024-01-26 12:39:20 +00:00
Vadim Kharitonov
12e9b2a909 Update plv8 (#6465) 2024-01-26 09:56:11 +00:00
Christian Schwarz
918b03b3b0 integrate tokio-epoll-uring as alternative VirtualFile IO engine (#5824) 2024-01-26 09:25:07 +01:00
Alexander Bayandin
d36623ad74 CI: cancel old e2e-tests on new commits (#6463)
## Problem

Triggered `e2e-tests` job is not cancelled along with other jobs in a PR
if the PR get new commits. We can improve the situation by setting
`concurrency_group` for the remote workflow
(https://github.com/neondatabase/cloud/pull/9622 adds
`concurrency_group` group input to the remote workflow).

Ref https://neondb.slack.com/archives/C059ZC138NR/p1706087124297569

Cloud's part added in https://github.com/neondatabase/cloud/pull/9622

## Summary of changes
- Set `concurrency_group` parameter when triggering `e2e-tests`
- At the beginning of a CI pipeline, trigger Cloud's
`cancel-previous-in-concurrency-group.yml` workflow which cancels
previously triggered e2e-tests
2024-01-25 19:25:29 +00:00
Christian Schwarz
689ad72e92 fix(neon_local): leaks child process if it fails to start & pass checks (#6474)
refs https://github.com/neondatabase/neon/issues/6473

Before this PR, if process_started() didn't return Ok(true) until we
ran out of retries, we'd return an error but leave the process running.

Try it by adding a 20s sleep to the pageserver `main()`, e.g., right
before we claim the pidfile.

Without this PR, output looks like so:

```
(.venv) cs@devvm-mbp:[~/src/neon-work-2]: ./target/debug/neon_local start
Starting neon broker at 127.0.0.1:50051.
storage_broker started, pid: 2710939
.
attachment_service started, pid: 2710949
Starting pageserver node 1 at '127.0.0.1:64000' in ".neon/pageserver_1".....
pageserver has not started yet, continuing to wait.....
pageserver 1 start failed: pageserver did not start in 10 seconds
No process is holding the pidfile. The process must have already exited. Leave in place to avoid race conditions: ".neon/pageserver_1/pageserver.pid"
No process is holding the pidfile. The process must have already exited. Leave in place to avoid race conditions: ".neon/safekeepers/sk1/safekeeper.pid"
Stopping storage_broker with pid 2710939 immediately.......
storage_broker has not stopped yet, continuing to wait.....
neon broker stop failed: storage_broker with pid 2710939 did not stop in 10 seconds
Stopping attachment_service with pid 2710949 immediately.......
attachment_service has not stopped yet, continuing to wait.....
attachment service stop failed: attachment_service with pid 2710949 did not stop in 10 seconds
```

and we leak the pageserver process

```
(.venv) cs@devvm-mbp:[~/src/neon-work-2]: ps aux | grep pageserver
cs       2710959  0.0  0.2 2377960 47616 pts/4   Sl   14:36   0:00 /home/cs/src/neon-work-2/target/debug/pageserver -D .neon/pageserver_1 -c id=1 -c pg_distrib_dir='/home/cs/src/neon-work-2/pg_install' -c http_auth_type='Trust' -c pg_auth_type='Trust' -c listen_http_addr='127.0.0.1:9898' -c listen_pg_addr='127.0.0.1:64000' -c broker_endpoint='http://127.0.0.1:50051/' -c control_plane_api='http://127.0.0.1:1234/' -c remote_storage={local_path='../local_fs_remote_storage/pageserver'}
```

After this PR, there is no leaked process.
2024-01-25 19:20:02 +01:00
Christian Schwarz
fd4cce9417 test_pageserver_max_throughput_getpage_at_latest_lsn: remove n_tenants=100 combination (#6477)
Need to fix the neon_local timeouts first
(https://github.com/neondatabase/neon/issues/6473)
and also not run them on every merge, but only nightly:
https://github.com/neondatabase/neon/issues/6476
2024-01-25 18:17:53 +00:00
Arpad Müller
d52b81340f S3 based recovery (#6155)
Adds a new `time_travel_recover` function to the `RemoteStorage` trait
that allows time travel like functionality for S3 buckets, regardless of
their content (it is not even pageserver related). It takes a different
approach from [this
post](https://aws.amazon.com/blogs/storage/point-in-time-restore-for-amazon-s3-buckets/)
that is more complicated.

It takes as input a prefix a target timestamp, and a limit timestamp:

* executes [`ListObjectVersions`](https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectVersions.html)
* obtains the latest version that comes before the target timestamp
* copies that latest version to the same prefix
* if there is versions newer than the limit timestamp, it doesn't do
anything for the file

The limit timestamp is meant to be some timestamp before the start of
the recovery operation and after any changes that one wants to revert.
For example, it might be the time point after a tenant was detached from
all involved pageservers. The limiting mechanism ensures that the
operation is idempotent and can be retried without causing additional
writes/copies.

The approach fulfills all the requirements laid out in 8233, and is a
recoverable operation. Nothing is deleted permanently, only new entries
added to the version log.

I also enable [nextest retries](https://nexte.st/book/retries.html) to
help with some general S3 flakiness (on top of low level retries).

Part of https://github.com/neondatabase/cloud/issues/8233
2024-01-25 18:23:18 +01:00
Joonas Koivunen
8dee9908f8 fix(compaction_task): wrong log levels (#6442)
Filter what we log on compaction task. Per discussion in last triage
call, fixing these by introducing and inspecting the root cause within
anyhow::Error instead of rolling out proper conversions.

Fixes: #6365
Fixes: #6367
2024-01-25 18:45:17 +02:00
Konstantin Knizhnik
19ed230708 Add support for PS sharding in compute (#6205)
refer #5508

replaces #5837

## Problem

This PR implements sharding support at compute side. Relations are
splinted in stripes and `get_page` requests are redirected to the
particular shard where stripe is located. All other requests (i.e. get
relation or database size) are always send to shard 0.

## Summary of changes

Support of sharding at compute side include three things:
1. Make it possible to specify and change in runtime connection to more
retain one page server
2. Send `get_page` request to the particular shard (determined by hash
of page key)
3. Support multiple servers in prefetch ring requests

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: John Spray <john@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-01-25 15:53:31 +02:00
Joonas Koivunen
463b6a26b5 test: show relative order eviction with "fast growing tenant" (#6377)
Refactor out test_disk_usage_eviction tenant creation and add a custom
case with 4 tenants, 3 made with pgbench scale=1 and 1 made with pgbench
scale=4.

Because the tenants are created in order of scales [1, 1, 1, 4] this is
simple enough to demonstrate the problem with using absolute access
times, because on a disk usage based eviction run we will
disproportionally target the *first* scale=1 tenant(s), and the later
larger tenant does not lose anything.

This test is not enough to show the difference between `relative_equal`
and `relative_spare` (the fudge factor); much larger scale will be
needed for "the large tenant", but that will make debug mode tests
slower.

Cc: #5304
2024-01-25 15:38:28 +02:00
John Spray
c9b1657e4c pageserver: fixes for creation operations overlapping with shutdown/startup (#6436)
## Problem

For #6423, creating a reproducer turned out to be very easy, as an
extension to test_ondemand_activation.

However, before I had diagnosed the issue, I was starting with a more
brute force approach of running creation API calls in the background
while restarting a pageserver, and that shows up a bunch of other
interesting issues.

In this PR:
- Add the reproducer for #6423 by extending `test_ondemand_activation`
(confirmed that this test fails if I revert the fix from
https://github.com/neondatabase/neon/pull/6430)
- In timeline creation, return 503 responses when we get an error and
the tenant's cancellation token is set: this covers the cases where we
get an anyhow::Error from something during timeline creation as a result
of shutdown.
- While waiting for tenants to become active during creation, don't
.map_err() the result to a 500: instead let the `From` impl map the
result to something appropriate (this includes mapping shutdown to 503)
- During tenant creation, we were calling `Tenant::load_local` because
no Preload object is provided. This is usually harmless because the
tenant dir is empty, but if there are some half-created timelines in
there, bad things can happen. Propagate the SpawnMode into
Tenant::attach, so that it can properly skip _any_ attempt to load
timelines if creating.
- When we call upsert_location, there's a SpawnMode that tells us
whether to load from remote storage or not. But if the operation is a
retry and we already have the tenant, it is not correct to skip loading
from remote storage: there might be a timeline there. This isn't
strictly a correctness issue as long as the caller behaves correctly
(does not assume that any timelines are persistent until the creation is
acked), but it's a more defensive position.
- If we shut down while the task in Tenant::attach is running, it can
end up spawning rogue tasks. Fix this by holding a GateGuard through
here, and in upsert_location shutting down a tenant after calling
tenant_spawn if we can't insert it into tenants_map. This fixes the
expected behavior that after shutdown_all_tenants returns, no tenant
tasks are running.
- Add `test_create_churn_during_restart`, which runs tenant & timeline
creations across pageserver restarts.
- Update a couple of tests that covered cancellation, to reflect the
cleaner errors we now return.
2024-01-25 12:35:52 +00:00
Arpad Müller
b92be77e19 Make RemoteStorage not use async_trait (#6464)
Makes the `RemoteStorage` trait not be based on `async_trait` any more.

To avoid recursion in async (not supported by Rust), we made
`GenericRemoteStorage` generic on the "Unreliable" variant. That allows
us to have the unreliable wrapper never contain/call itself.

related earlier work: #6305
2024-01-24 21:27:54 +01:00
Arthur Petukhovsky
8cb8c8d7b5 Allow remove_wal.rs to run on inactive timelines (#6462)
Temporary enable it on staging to help with
https://github.com/neondatabase/neon/issues/6403
Can be also deployed to prod if will work well on staging.
2024-01-24 16:48:56 +00:00
Conrad Ludgate
210700d0d9 proxy: add newtype wrappers for string based IDs (#6445)
## Problem

too many string based IDs. easy to mix up ID types.

## Summary of changes

Add a bunch of `SmolStr` wrappers that provide convenience methods but
are type safe
2024-01-24 16:38:10 +00:00
Joonas Koivunen
a0a3ba85e7 fix(page_service): walredo logging problem (#6460)
Fixes: #6459 by formatting full causes of an error to log, while keeping
the top level string for end-user.

Changes user visible error detail from:

```
-DETAIL:  page server returned error: Read error: Failed to reconstruct a page image:
+DETAIL:  page server returned error: Read error
```

However on pageserver logs:

```
-ERROR page_service_conn_main{...}: error reading relation or page version: Read error: Failed to reconstruct a page image:
+ERROR page_service_conn_main{...}: error reading relation or page version: Read error: reconstruct a page image: launch walredo process: spawn process: Permission denied (os error 13)
```
2024-01-24 15:47:17 +00:00
Arpad Müller
d820aa1d08 Disable initdb cancellation (#6451)
## Problem

The initdb cancellation added in #5921 is not sufficient to reliably
abort the entire initdb process. Initdb also spawns children. The tests
added by #6310 (#6385) and #6436 now do initdb cancellations on a more
regular basis.

In #6385, I attempted to issue `killpg` (after giving it a new process
group ID) to kill not just the initdb but all its spawned subprocesses,
but this didn't work. Initdb doesn't take *that* long in the end either,
so we just wait until it concludes.

## Summary of changes

* revert initdb cancellation support added in #5921
* still return `Err(Cancelled)` upon cancellation, but this is just to
not have to remove the cancellation infrastructure
* fixes to the `test_tenant_delete_races_timeline_creation` test to make
it reliably pass

Fixes #6385
2024-01-24 13:06:05 +01:00
Christian Schwarz
996abc9563 pagebench-based GetPage@LSN performance test (#6214) 2024-01-24 12:51:53 +01:00
John Spray
a72af29d12 control_plane/attachment_service: implement PlacementPolicy::Detached (#6458)
## Problem

The API for detaching things wasn't implement yet, but one could hit
this case indirectly from tests when using attach-hook, and find tenants
unexpectedly attached again because their policy remained Single.

## Summary of changes

Add PlacementPolicy::Detached, and:
- add the behavior for it in schedule()
- in tenant_migrate, refuse if the policy is detached
- automatically set this policy in attach-hook if the caller has
specified pageserver=null.
2024-01-24 12:49:30 +01:00
Sasha Krassovsky
4f51824820 Fix creating publications for all tables 2024-01-23 22:41:00 -08:00
Christian Schwarz
743f6dfb9b fix(attachment_service): corrupted attachments.json when parallel requests (#6450)
The pagebench integration PR (#6214) issues attachment requests in
parallel.
We observed corrupted attachments.json from time to time, especially in
the test cases with high tenant counts.

The atomic overwrite added in #6444 exposed the root cause cleanly:
the `.commit()` calls of two request handlers could interleave or
be reordered.
See also:
https://github.com/neondatabase/neon/pull/6444#issuecomment-1906392259

This PR makes changes to the `persistence` module to fix above race:
- mpsc queue for PendingWrites
- one writer task performs the writes in mpsc queue order
- request handlers that need to do writes do it using the
  new `mutating_transaction` function.

`mutating_transaction`, while holding the lock, does the modifications,
serializes the post-modification state, and pushes that as a
`PendingWrite` into the mpsc queue.
It then release the lock and `await`s the completion of the write.
The writer tasks executes the `PendingWrites` in queue order.
Once the write has been executed, it wakes the writing tokio task.
2024-01-23 19:14:32 +00:00
Arpad Müller
faf275d4a2 Remove initdb on timeline delete (#6387)
This PR:

* makes `initdb.tar.zst` be deleted by default on timeline deletion
(#6226), mirroring the safekeeper:
https://github.com/neondatabase/neon/pull/6381
* adds a new `preserve_initdb_archive` endpoint for a timeline, to be
used during the disaster recovery process, see reasoning
[here](https://github.com/neondatabase/neon/issues/6226#issuecomment-1894574778)
* makes the creation code look for `initdb-preserved.tar.zst` in
addition to `initdb.tar.zst`.
* makes the tests use the new endpoint

fixes #6226
2024-01-23 18:22:59 +00:00
Vlad Lazar
001f0d6db7 pageserver: fix import failure caused by merge race (#6448)
PR #6406 raced with #6372 and broke main.
2024-01-23 18:07:01 +01:00
Christian Schwarz
42c17a6fc6 attachment_service: use atomic overwrite to persist attachments.json (#6444)
The pagebench integration PR (#6214) is the first to SIGQUIT & then
restart attachment_service.

With many tenants (100), we have found frequent failures on restart in
the CI[^1].

[^1]:
[Allure](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6214/7615750160/index.html#suites/e26265675583c610f99af77084ae58f1/851ff709578c4452/)

```
2024-01-22T19:07:57.932021Z  INFO request{method=POST path=/attach-hook request_id=2697503c-7b3e-4529-b8c1-d12ef912d3eb}: Request handled, status: 200 OK
2024-01-22T19:07:58.898213Z  INFO Got SIGQUIT. Terminating
2024-01-22T19:08:02.176588Z  INFO version: git-env:d56f31639356ed8e8ce832097f132f27ee19ac8a, launch_timestamp: 2024-01-22 19:08:02.174634554 UTC, build_tag build_tag-env:7615750160, state at /tmp/test_output/test_pageserver_max_throughput_getpage_at_latest_lsn[10-13-30]/repo/attachments.json, listening on 127.0.0.1:15048
thread 'main' panicked at /__w/neon/neon/control_plane/attachment_service/src/persistence.rs:95:17:
Failed to load state from '/tmp/test_output/test_pageserver_max_throughput_getpage_at_latest_lsn[10-13-30]/repo/attachments.json': trailing characters at line 1 column 8957 (maybe your .neon/ dir was written by an older version?)
stack backtrace:
   0: rust_begin_unwind
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
   2: attachment_service::persistence::PersistentState::load_or_new::{{closure}}
             at ./control_plane/attachment_service/src/persistence.rs:95:17
   3: attachment_service::persistence::Persistence:🆕:{{closure}}
             at ./control_plane/attachment_service/src/persistence.rs:103:56
   4: attachment_service::main::{{closure}}
             at ./control_plane/attachment_service/src/main.rs:69:61
   5: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:63
   6: tokio::runtime::coop::with_budget
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:107:5
   7: tokio::runtime::coop::budget
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:73:5
   8: tokio::runtime::park::CachedParkThread::block_on
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:31
   9: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/blocking.rs:66:9
  10: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:87:13
  11: tokio::runtime::context::runtime::enter_runtime
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/runtime.rs:65:16
  12: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:86:9
  13: tokio::runtime::runtime::Runtime::block_on
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/runtime.rs:350:50
  14: attachment_service::main
             at ./control_plane/attachment_service/src/main.rs:99:5
  15: core::ops::function::FnOnce::call_once
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
```

The attachment_service handles SIGQUIT by just exiting the process.
In theory, the SIGQUIT could come in while we're writing out the
`attachments.json`.

Now, in above log output, there's a 1 second gap between the last
request completing
and the SIGQUIT coming in. So, there must be some other issue.

But, let's have this change anyways, maybe it helps uncover the real
cause for the test failure.
2024-01-23 17:21:06 +01:00
Vlad Lazar
37638fce79 pageserver: introduce vectored Timeline::get interface (#6372)
1. Introduce a naive  `Timeline::get_vectored` implementation

The return type is intended to be flexible enough for various types of
callers. We return the pages in a map keyed by `Key` such that the
caller doesn't have to map back to the key if it needs to know it. Some
callers can ignore errors
for specific pages, so we return a separate `Result<Bytes,
PageReconstructError>` for each page and an overarching
`GetVectoredError` for API misuse. The overhead of the mapping will be
small and bounded since we enforce a maximum key count for the
operation.

2. Use the `get_vectored` API for SLRU segment reconstruction and image
layer creation.
2024-01-23 14:23:53 +00:00
Christian Schwarz
50288c16b1 fix(pagebench): avoid CopyFail error in success case (#6443)
PR #6392 fixed CopyFail in the case where we get cancelled.
But, we also want to use `client.shutdown()` if we don't get cancelled.
2024-01-23 15:11:32 +01:00
Conrad Ludgate
e03f8abba9 eager parsing of ip addr (#6446)
## Problem

Parsing the IP address at check time is a little wasteful. 

## Summary of changes

Parse the IP when we get it from cplane. Adding a `None` variant to
still allow malformed patterns
2024-01-23 13:25:01 +00:00
Anna Khanova
1905f0bced proxy: store role not found in cache (#6439)
## Problem

There are a lot of responses with 404 role not found error, which are
not getting cached in proxy.

## Summary of changes

If there was returned an empty secret but with the project_id, store it
in cache.
2024-01-23 13:15:05 +01:00
Conrad Ludgate
72de1cb511 remove some duped deps (#6422)
## Problem

duplicated deps

## Summary of changes

little bit of fiddling with deps to reduce duplicates

needs consideration:
https://github.com/notify-rs/notify/blob/main/CHANGELOG.md#notify-600-2023-05-17
2024-01-23 11:17:15 +00:00
Konstantin Knizhnik
00d9bf5b61 Implement lockless update of pageserver_connstring GUC in shared memory (#6314)
## Problem

There is "neon.pageserver_connstring" GUC with PGC_SIGHUP option,
allowing to change it using
pg_reload_conf(). It is used by control plane to update pageserver
connection string if page server is crashed,
relocated or new shards are added.
It is copied to shared memory because config can not be loaded during
query execution and we need to
reestablish connection to page server.

## Summary of changes

Copying connection string to shared memory is done by postmaster. And
other backends
should check update counter to determine of connection URL is changed
and connection needs to be reestablished.
We can not use standard Postgres LW-locks, because postmaster has proc
entry and so can not wait
on this primitive. This is why lockless access algorithm is implemented
using two atomic counters to enforce
consistent reading of connection string value from shared memory.


## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-01-23 07:55:05 +02:00
Sasha Krassovsky
71f495c7f7 Gate it behind feature flags 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
0a7e050144 Fix test one last time 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
55bfa91bd7 Fix test again again 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
d90b2b99df Fix test again 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
27587e155d Fix test 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
55aede2762 Prevnet duplicate insertions 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
9f186b4d3e Fix query 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
585687d563 Fix syntax error 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
65a98e425d Switch to bigint 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
b2e7249979 Sleep 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
844303255a Cargo fmt 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
6d8df2579b Fix dumb thing 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
3c3b53f8ad Update test 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
30064eb197 Add scary comment 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
869acfe29b Make migrations transactional 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
11a91eaf7b Uncomment the thread 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
394ef013d0 Push the migrations test 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
a718287902 Make migrations happen on a separate thread 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
2eac1adcb9 Make clippy happy 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
3f90b2d337 Fix test_ddl_forwarding 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
a40ed86d87 Add test for migrations, add initial migration 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
1bf8bb88c5 Add support for migrations within compute_ctl 2024-01-22 14:53:29 -08:00
Vlad Lazar
f1901833a6 pageserver_api: migrate keyspace related functions from pgdatadir_mapping (#6406)
The idea is to achieve separation between keyspace layout definition
and operating on said keyspace. I've inlined all these function since
they're small and we don't use LTO in the storage release builds
at the moment.

Closes https://github.com/neondatabase/neon/issues/6347
2024-01-22 19:16:38 +00:00
Arthur Petukhovsky
b41ee81308 Log warning on slow WAL removal (#6432)
Also add `safekeeper_active_timelines` metric.
Should help investigating #6403
2024-01-22 18:38:05 +00:00
Christian Schwarz
205b6111e6 attachment_service: /attach-hook: correctly handle detach (#6433)
Before this patch, we would update the `tenant_state.intent` in memory
but not persist the detachment to disk.

I noticed this in https://github.com/neondatabase/neon/pull/6214 where
we stop, then restart, the attachment service.
2024-01-22 18:27:05 +00:00
John Spray
93572a3e99 pageserver: mark tenant broken when cancelling attach (#6430)
## Problem

When a tenant is in Attaching state, and waiting for the
`concurrent_tenant_warmup` semaphore, it also listens for the tenant
cancellation token. When that token fires, Tenant::attach drops out.
Meanwhile, Tenant::set_stopping waits forever for the tenant to exit
Attaching state.

Fixes: https://github.com/neondatabase/neon/issues/6423

## Summary of changes

- In the absence of a valid state for the tenant, it is set to Broken in
this path. A more elegant solution will require more refactoring, beyond
this minimal fix.
2024-01-22 15:50:32 +00:00
Christian Schwarz
15c0df4de7 fixup(#6037): actually fix the issue, #6388 failed to do so (#6429)
Before this patch, the select! still retured immediately if `futs` was
empty. Must have tested a stale build in my manual testing of #6388.
2024-01-22 14:27:29 +00:00
Anna Khanova
3290fb09bf Proxy: fix gc (#6426)
## Problem

Gc currently doesn't work properly.

## Summary of changes

Change statement on running gc.
2024-01-22 13:24:10 +00:00
hamishc
efdb2bf948 Added missing PG_VERSION arg into compute node dockerfile (#6382)
## Problem

If you build the compute-node dockerfile with the PG_VERSION argument
passed in (e.g. `docker build -f Dockerfile.compute-node --build-arg
PG_VERSION=v15 .`, it fails, as some of stages doesn't have the
PG_VERSION arg defined.

## Summary of changes

Added the PG_VERSION arg to the plv8-build, neon-pg-ext-build, and 
pg-embedding-pg-build stages of Dockerfile.compute-node
2024-01-22 11:05:27 +00:00
Conrad Ludgate
5559b16953 bump shlex (#6421)
## Problem

https://rustsec.org/advisories/RUSTSEC-2024-0006

## Summary of changes

`cargo update -p shlex`
2024-01-22 09:14:30 +00:00
Konstantin Knizhnik
1aea65eb9d Fix potential overflow in update_next_xid (#6412)
## Problem

See https://neondb.slack.com/archives/C06F5UJH601/p1705731304237889

Adding 1 to xid in `update_next_xid` can cause overflow in debug mode.
0xffffffff is valid transaction ID.

## Summary of changes

Use `wrapping_add` 

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-01-21 22:11:00 +02:00
Conrad Ludgate
34ddec67d9 proxy small tweaks (#6398)
## Problem

In https://github.com/neondatabase/neon/pull/6283 I did a couple changes
that weren't directly related to the goal of extracting the state
machine, so I'm putting them here

## Summary of changes

- move postgres vs console provider into another enum
- reduce error cases for link auth
- slightly refactor link flow
2024-01-21 09:58:42 +01:00
156 changed files with 6602 additions and 2371 deletions

View File

@@ -69,7 +69,15 @@ jobs:
run: echo "{\"credsStore\":\"ecr-login\"}" > /kaniko/.docker/config.json run: echo "{\"credsStore\":\"ecr-login\"}" > /kaniko/.docker/config.json
- name: Kaniko build - name: Kaniko build
run: /kaniko/executor --reproducible --snapshotMode=redo --skip-unused-stages --dockerfile ${{ inputs.dockerfile-path }} --cache=true --cache-repo 369495373322.dkr.ecr.eu-central-1.amazonaws.com/cache --destination 369495373322.dkr.ecr.eu-central-1.amazonaws.com/${{ inputs.image-name }}:${{ needs.tag.outputs.build-tools-tag }}-amd64 run: |
/kaniko/executor \
--reproducible \
--snapshotMode=redo \
--skip-unused-stages \
--dockerfile ${{ inputs.dockerfile-path }} \
--cache=true \
--cache-repo 369495373322.dkr.ecr.eu-central-1.amazonaws.com/cache \
--destination 369495373322.dkr.ecr.eu-central-1.amazonaws.com/${{ inputs.image-name }}:${{ needs.tag.outputs.build-tools-tag }}-amd64
kaniko-arm: kaniko-arm:
if: needs.check-if-build-tools-dockerfile-changed.outputs.docker_file_changed == 'true' if: needs.check-if-build-tools-dockerfile-changed.outputs.docker_file_changed == 'true'
@@ -85,7 +93,15 @@ jobs:
run: echo "{\"credsStore\":\"ecr-login\"}" > /kaniko/.docker/config.json run: echo "{\"credsStore\":\"ecr-login\"}" > /kaniko/.docker/config.json
- name: Kaniko build - name: Kaniko build
run: /kaniko/executor --reproducible --snapshotMode=redo --skip-unused-stages --dockerfile ${{ inputs.dockerfile-path }} --cache=true --cache-repo 369495373322.dkr.ecr.eu-central-1.amazonaws.com/cache --destination 369495373322.dkr.ecr.eu-central-1.amazonaws.com/${{ inputs.image-name }}:${{ needs.tag.outputs.build-tools-tag }}-arm64 run: |
/kaniko/executor \
--reproducible \
--snapshotMode=redo \
--skip-unused-stages \
--dockerfile ${{ inputs.dockerfile-path }} \
--cache=true \
--cache-repo 369495373322.dkr.ecr.eu-central-1.amazonaws.com/cache \
--destination 369495373322.dkr.ecr.eu-central-1.amazonaws.com/${{ inputs.image-name }}:${{ needs.tag.outputs.build-tools-tag }}-arm64
manifest: manifest:
if: needs.check-if-build-tools-dockerfile-changed.outputs.docker_file_changed == 'true' if: needs.check-if-build-tools-dockerfile-changed.outputs.docker_file_changed == 'true'
@@ -99,7 +115,10 @@ jobs:
steps: steps:
- name: Create manifest - name: Create manifest
run: docker manifest create 369495373322.dkr.ecr.eu-central-1.amazonaws.com/${{ inputs.image-name }}:${{ needs.tag.outputs.build-tools-tag }} --amend 369495373322.dkr.ecr.eu-central-1.amazonaws.com/${{ inputs.image-name }}:${{ needs.tag.outputs.build-tools-tag }}-amd64 --amend 369495373322.dkr.ecr.eu-central-1.amazonaws.com/${{ inputs.image-name }}:${{ needs.tag.outputs.build-tools-tag }}-arm64 run: |
docker manifest create 369495373322.dkr.ecr.eu-central-1.amazonaws.com/${{ inputs.image-name }}:${{ needs.tag.outputs.build-tools-tag }} \
--amend 369495373322.dkr.ecr.eu-central-1.amazonaws.com/${{ inputs.image-name }}:${{ needs.tag.outputs.build-tools-tag }}-amd64 \
--amend 369495373322.dkr.ecr.eu-central-1.amazonaws.com/${{ inputs.image-name }}:${{ needs.tag.outputs.build-tools-tag }}-arm64
- name: Push manifest - name: Push manifest
run: docker manifest push 369495373322.dkr.ecr.eu-central-1.amazonaws.com/${{ inputs.image-name }}:${{ needs.tag.outputs.build-tools-tag }} run: docker manifest push 369495373322.dkr.ecr.eu-central-1.amazonaws.com/${{ inputs.image-name }}:${{ needs.tag.outputs.build-tools-tag }}

View File

@@ -21,6 +21,8 @@ env:
COPT: '-Werror' COPT: '-Werror'
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_DEV }} AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_DEV }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_KEY_DEV }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_KEY_DEV }}
# A concurrency group that we use for e2e-tests runs, matches `concurrency.group` above with `github.repository` as a prefix
E2E_CONCURRENCY_GROUP: ${{ github.repository }}-${{ github.workflow }}-${{ github.ref_name }}-${{ github.ref_name == 'main' && github.sha || 'anysha' }}
jobs: jobs:
check-permissions: check-permissions:
@@ -44,6 +46,20 @@ jobs:
exit 1 exit 1
cancel-previous-e2e-tests:
needs: [ check-permissions ]
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
steps:
- name: Cancel previous e2e-tests runs for this PR
env:
GH_TOKEN: ${{ secrets.CI_ACCESS_TOKEN }}
run: |
gh workflow --repo neondatabase/cloud \
run cancel-previous-in-concurrency-group.yml \
--field concurrency_group="${{ env.E2E_CONCURRENCY_GROUP }}"
tag: tag:
needs: [ check-permissions ] needs: [ check-permissions ]
runs-on: [ self-hosted, gen3, small ] runs-on: [ self-hosted, gen3, small ]
@@ -186,7 +202,11 @@ jobs:
runs-on: [ self-hosted, gen3, large ] runs-on: [ self-hosted, gen3, large ]
container: container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools:${{ needs.build-buildtools-image.outputs.build-tools-tag }} image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools:${{ needs.build-buildtools-image.outputs.build-tools-tag }}
options: --init # Raise locked memory limit for tokio-epoll-uring.
# On 5.10 LTS kernels < 5.10.162 (and generally mainline kernels < 5.12),
# io_uring will account the memory of the CQ and SQ as locked.
# More details: https://github.com/neondatabase/neon/issues/6373#issuecomment-1905814391
options: --init --shm-size=512mb --ulimit memlock=67108864:67108864
strategy: strategy:
fail-fast: false fail-fast: false
matrix: matrix:
@@ -340,8 +360,12 @@ jobs:
${cov_prefix} mold -run cargo build $CARGO_FLAGS $CARGO_FEATURES --bins --tests ${cov_prefix} mold -run cargo build $CARGO_FLAGS $CARGO_FEATURES --bins --tests
- name: Run rust tests - name: Run rust tests
env:
NEXTEST_RETRIES: 3
run: | run: |
${cov_prefix} cargo nextest run $CARGO_FLAGS $CARGO_FEATURES for io_engine in std-fs tokio-epoll-uring ; do
NEON_PAGESERVER_UNIT_TEST_VIRTUAL_FILE_IOENGINE=$io_engine ${cov_prefix} cargo nextest run $CARGO_FLAGS $CARGO_FEATURES
done
# Run separate tests for real S3 # Run separate tests for real S3
export ENABLE_REAL_S3_REMOTE_STORAGE=nonempty export ENABLE_REAL_S3_REMOTE_STORAGE=nonempty
@@ -419,8 +443,8 @@ jobs:
runs-on: [ self-hosted, gen3, large ] runs-on: [ self-hosted, gen3, large ]
container: container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools:${{ needs.build-buildtools-image.outputs.build-tools-tag }} image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools:${{ needs.build-buildtools-image.outputs.build-tools-tag }}
# Default shared memory is 64mb # for changed limits, see comments on `options:` earlier in this file
options: --init --shm-size=512mb options: --init --shm-size=512mb --ulimit memlock=67108864:67108864
strategy: strategy:
fail-fast: false fail-fast: false
matrix: matrix:
@@ -448,6 +472,7 @@ jobs:
TEST_RESULT_CONNSTR: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }} TEST_RESULT_CONNSTR: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}
CHECK_ONDISK_DATA_COMPATIBILITY: nonempty CHECK_ONDISK_DATA_COMPATIBILITY: nonempty
BUILD_TAG: ${{ needs.tag.outputs.build-tag }} BUILD_TAG: ${{ needs.tag.outputs.build-tag }}
PAGESERVER_VIRTUAL_FILE_IO_ENGINE: std-fs
- name: Merge and upload coverage data - name: Merge and upload coverage data
if: matrix.build_type == 'debug' && matrix.pg_version == 'v14' if: matrix.build_type == 'debug' && matrix.pg_version == 'v14'
@@ -458,12 +483,13 @@ jobs:
runs-on: [ self-hosted, gen3, small ] runs-on: [ self-hosted, gen3, small ]
container: container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools:${{ needs.build-buildtools-image.outputs.build-tools-tag }} image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools:${{ needs.build-buildtools-image.outputs.build-tools-tag }}
# Default shared memory is 64mb # for changed limits, see comments on `options:` earlier in this file
options: --init --shm-size=512mb options: --init --shm-size=512mb --ulimit memlock=67108864:67108864
if: github.ref_name == 'main' || contains(github.event.pull_request.labels.*.name, 'run-benchmarks') if: github.ref_name == 'main' || contains(github.event.pull_request.labels.*.name, 'run-benchmarks')
strategy: strategy:
fail-fast: false fail-fast: false
matrix: matrix:
# the amount of groups (N) should be reflected in `extra_params: --splits N ...`
pytest_split_group: [ 1, 2, 3, 4 ] pytest_split_group: [ 1, 2, 3, 4 ]
build_type: [ release ] build_type: [ release ]
steps: steps:
@@ -477,11 +503,12 @@ jobs:
test_selection: performance test_selection: performance
run_in_parallel: false run_in_parallel: false
save_perf_report: ${{ github.ref_name == 'main' }} save_perf_report: ${{ github.ref_name == 'main' }}
extra_params: --splits ${{ strategy.job-total }} --group ${{ matrix.pytest_split_group }} extra_params: --splits 4 --group ${{ matrix.pytest_split_group }}
env: env:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}" VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}" PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
TEST_RESULT_CONNSTR: "${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}" TEST_RESULT_CONNSTR: "${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}"
PAGESERVER_VIRTUAL_FILE_IO_ENGINE: tokio-epoll-uring
# XXX: no coverage data handling here, since benchmarks are run on release builds, # XXX: no coverage data handling here, since benchmarks are run on release builds,
# while coverage is currently collected for the debug ones # while coverage is currently collected for the debug ones
@@ -695,7 +722,8 @@ jobs:
\"commit_hash\": \"$COMMIT_SHA\", \"commit_hash\": \"$COMMIT_SHA\",
\"remote_repo\": \"${{ github.repository }}\", \"remote_repo\": \"${{ github.repository }}\",
\"storage_image_tag\": \"${{ needs.tag.outputs.build-tag }}\", \"storage_image_tag\": \"${{ needs.tag.outputs.build-tag }}\",
\"compute_image_tag\": \"${{ needs.tag.outputs.build-tag }}\" \"compute_image_tag\": \"${{ needs.tag.outputs.build-tag }}\",
\"concurrency_group\": \"${{ env.E2E_CONCURRENCY_GROUP }}\"
} }
}" }"

View File

@@ -124,12 +124,12 @@ jobs:
# Hence keeping target/ (and general cache size) smaller # Hence keeping target/ (and general cache size) smaller
BUILD_TYPE: release BUILD_TYPE: release
CARGO_FEATURES: --features testing CARGO_FEATURES: --features testing
CARGO_FLAGS: --locked --release CARGO_FLAGS: --release
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_DEV }} AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_DEV }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_KEY_DEV }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_KEY_DEV }}
container: container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools:pinned
options: --init options: --init
steps: steps:
@@ -210,18 +210,20 @@ jobs:
- name: Run cargo build - name: Run cargo build
run: | run: |
mold -run cargo build $CARGO_FLAGS $CARGO_FEATURES --bins --tests mold -run cargo build --locked $CARGO_FLAGS $CARGO_FEATURES --bins --tests
- name: Run cargo test - name: Run cargo test
env:
NEXTEST_RETRIES: 3
run: | run: |
cargo test $CARGO_FLAGS $CARGO_FEATURES cargo nextest run $CARGO_FEATURES
# Run separate tests for real S3 # Run separate tests for real S3
export ENABLE_REAL_S3_REMOTE_STORAGE=nonempty export ENABLE_REAL_S3_REMOTE_STORAGE=nonempty
export REMOTE_STORAGE_S3_BUCKET=neon-github-ci-tests export REMOTE_STORAGE_S3_BUCKET=neon-github-ci-tests
export REMOTE_STORAGE_S3_REGION=eu-central-1 export REMOTE_STORAGE_S3_REGION=eu-central-1
# Avoid `$CARGO_FEATURES` since there's no `testing` feature in the e2e tests now # Avoid `$CARGO_FEATURES` since there's no `testing` feature in the e2e tests now
cargo test $CARGO_FLAGS --package remote_storage --test test_real_s3 cargo nextest run --package remote_storage --test test_real_s3
# Run separate tests for real Azure Blob Storage # Run separate tests for real Azure Blob Storage
# XXX: replace region with `eu-central-1`-like region # XXX: replace region with `eu-central-1`-like region
@@ -231,7 +233,7 @@ jobs:
export REMOTE_STORAGE_AZURE_CONTAINER="${{ vars.REMOTE_STORAGE_AZURE_CONTAINER }}" export REMOTE_STORAGE_AZURE_CONTAINER="${{ vars.REMOTE_STORAGE_AZURE_CONTAINER }}"
export REMOTE_STORAGE_AZURE_REGION="${{ vars.REMOTE_STORAGE_AZURE_REGION }}" export REMOTE_STORAGE_AZURE_REGION="${{ vars.REMOTE_STORAGE_AZURE_REGION }}"
# Avoid `$CARGO_FEATURES` since there's no `testing` feature in the e2e tests now # Avoid `$CARGO_FEATURES` since there's no `testing` feature in the e2e tests now
cargo test $CARGO_FLAGS --package remote_storage --test test_real_azure cargo nextest run --package remote_storage --test test_real_azure
check-codestyle-rust-arm: check-codestyle-rust-arm:
timeout-minutes: 90 timeout-minutes: 90

View File

@@ -20,111 +20,51 @@ defaults:
run: run:
shell: bash -euo pipefail {0} shell: bash -euo pipefail {0}
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_DEV }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_KEY_DEV }}
permissions: {} permissions: {}
jobs: jobs:
tag-image: tag-image:
runs-on: [ self-hosted, gen3, small ] runs-on: [ self-hosted, gen3, small ]
container: golang:1.19-bullseye
env: env:
IMAGE: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools ECR_IMAGE: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools
FROM_TAG: ${{ inputs.from-tag }} DOCKER_HUB_IMAGE: docker.io/neondatabase/build-tools
TO_TAG: ${{ inputs.to-tag }}
outputs:
next-digest-buildtools: ${{ steps.next-digest.outputs.next-digest-buildtools }}
prev-digest-buildtools: ${{ steps.prev-digest.outputs.prev-digest-buildtools }}
steps:
- name: Install Crane & ECR helper
run: |
go install github.com/google/go-containerregistry/cmd/crane@a54d64203cffcbf94146e04069aae4a97f228ee2 # v0.16.1
go install github.com/awslabs/amazon-ecr-credential-helper/ecr-login/cli/docker-credential-ecr-login@adf1bafd791ae7d4ff098108b1e91f36a4da5404 # v0.7.1
- name: Configure ECR login
run: |
mkdir /github/home/.docker/
echo "{\"credsStore\":\"ecr-login\"}" > /github/home/.docker/config.json
- name: Get source image digest
id: next-digest
run: |
NEXT_DIGEST=$(crane digest ${IMAGE}:${FROM_TAG} || true)
if [ -z "${NEXT_DIGEST}" ]; then
echo >&2 "Image ${IMAGE}:${FROM_TAG} does not exist"
exit 1
fi
echo "Current ${IMAGE}@${FROM_TAG} image is ${IMAGE}@${NEXT_DIGEST}"
echo "next-digest-buildtools=$NEXT_DIGEST" >> $GITHUB_OUTPUT
- name: Get destination image digest (if already exists)
id: prev-digest
run: |
PREV_DIGEST=$(crane digest ${IMAGE}:${TO_TAG} || true)
if [ -z "${PREV_DIGEST}" ]; then
echo >&2 "Image ${IMAGE}:${TO_TAG} does not exist (it's ok)"
else
echo >&2 "Current ${IMAGE}@${TO_TAG} image is ${IMAGE}@${PREV_DIGEST}"
echo "prev-digest-buildtools=$PREV_DIGEST" >> $GITHUB_OUTPUT
fi
- name: Tag image
run: |
crane tag "${IMAGE}:${FROM_TAG}" "${TO_TAG}"
rollback-tag-image:
needs: tag-image
if: ${{ !success() }}
runs-on: [ self-hosted, gen3, small ]
container: golang:1.19-bullseye
env:
IMAGE: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools
FROM_TAG: ${{ inputs.from-tag }} FROM_TAG: ${{ inputs.from-tag }}
TO_TAG: ${{ inputs.to-tag }} TO_TAG: ${{ inputs.to-tag }}
steps: steps:
- name: Install Crane & ECR helper # Use custom DOCKER_CONFIG directory to avoid conflicts with default settings
# The default value is ~/.docker
- name: Set custom docker config directory
run: | run: |
go install github.com/google/go-containerregistry/cmd/crane@a54d64203cffcbf94146e04069aae4a97f228ee2 # v0.16.1 mkdir -p .docker-custom
go install github.com/awslabs/amazon-ecr-credential-helper/ecr-login/cli/docker-credential-ecr-login@adf1bafd791ae7d4ff098108b1e91f36a4da5404 # v0.7.1 echo DOCKER_CONFIG=$(pwd)/.docker-custom >> $GITHUB_ENV
- name: Configure ECR login - uses: docker/login-action@v2
with:
username: ${{ secrets.NEON_DOCKERHUB_USERNAME }}
password: ${{ secrets.NEON_DOCKERHUB_PASSWORD }}
- uses: docker/login-action@v2
with:
registry: 369495373322.dkr.ecr.eu-central-1.amazonaws.com
username: ${{ secrets.AWS_ACCESS_KEY_DEV }}
password: ${{ secrets.AWS_SECRET_KEY_DEV }}
- uses: actions/setup-go@v5
with:
go-version: '1.21'
- name: Install crane
run: | run: |
mkdir /github/home/.docker/ go install github.com/google/go-containerregistry/cmd/crane@a0658aa1d0cc7a7f1bcc4a3af9155335b6943f40 # v0.18.0
echo "{\"credsStore\":\"ecr-login\"}" > /github/home/.docker/config.json
- name: Restore previous tag if needed - name: Copy images
run: | run: |
NEXT_DIGEST="${{ needs.tag-image.outputs.next-digest-buildtools }}" crane copy "${ECR_IMAGE}:${FROM_TAG}" "${ECR_IMAGE}:${TO_TAG}"
PREV_DIGEST="${{ needs.tag-image.outputs.prev-digest-buildtools }}" crane copy "${ECR_IMAGE}:${FROM_TAG}" "${DOCKER_HUB_IMAGE}:${TO_TAG}"
if [ -z "${NEXT_DIGEST}" ]; then - name: Remove custom docker config directory
echo >&2 "Image ${IMAGE}:${FROM_TAG} does not exist, nothing to rollback" if: always()
exit 0 run: |
fi rm -rf .docker-custom
if [ -z "${PREV_DIGEST}" ]; then
# I guess we should delete the tag here/untag the image, but crane does not support it
# - https://github.com/google/go-containerregistry/issues/999
echo >&2 "Image ${IMAGE}:${TO_TAG} did not exist, but it was created by the job, no need to rollback"
exit 0
fi
CURRENT_DIGEST=$(crane digest "${IMAGE}:${TO_TAG}")
if [ "${CURRENT_DIGEST}" == "${NEXT_DIGEST}" ]; then
crane tag "${IMAGE}@${PREV_DIGEST}" "${TO_TAG}"
echo >&2 "Successfully restored ${TO_TAG} tag from ${IMAGE}@${CURRENT_DIGEST} to ${IMAGE}@${PREV_DIGEST}"
else
echo >&2 "Image ${IMAGE}:${TO_TAG}@${CURRENT_DIGEST} is not required to be restored"
fi

319
Cargo.lock generated
View File

@@ -10,9 +10,9 @@ checksum = "8b5ace29ee3216de37c0546865ad08edef58b0f9e76838ed8959a84a990e58c5"
[[package]] [[package]]
name = "addr2line" name = "addr2line"
version = "0.19.0" version = "0.21.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a76fd60b23679b7d19bd066031410fb7e458ccc5e958eb5c325888ce4baedc97" checksum = "8a30b2e23b9e17a9f90641c7ab1549cd9b44f296d3ccbf309d2863cfe398a0cb"
dependencies = [ dependencies = [
"gimli", "gimli",
] ]
@@ -278,6 +278,7 @@ dependencies = [
"camino", "camino",
"clap", "clap",
"control_plane", "control_plane",
"diesel",
"futures", "futures",
"git-version", "git-version",
"hyper", "hyper",
@@ -840,15 +841,15 @@ dependencies = [
[[package]] [[package]]
name = "backtrace" name = "backtrace"
version = "0.3.67" version = "0.3.69"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "233d376d6d185f2a3093e58f283f60f880315b6c60075b01f36b3b85154564ca" checksum = "2089b7e3f35b9dd2d0ed921ead4f6d318c27680d4a5bd167b3ee120edb105837"
dependencies = [ dependencies = [
"addr2line", "addr2line",
"cc", "cc",
"cfg-if", "cfg-if",
"libc", "libc",
"miniz_oxide 0.6.2", "miniz_oxide",
"object", "object",
"rustc-demangle", "rustc-demangle",
] ]
@@ -1215,7 +1216,7 @@ dependencies = [
"flate2", "flate2",
"futures", "futures",
"hyper", "hyper",
"nix 0.26.2", "nix 0.27.1",
"notify", "notify",
"num_cpus", "num_cpus",
"opentelemetry", "opentelemetry",
@@ -1327,11 +1328,13 @@ dependencies = [
"clap", "clap",
"comfy-table", "comfy-table",
"compute_api", "compute_api",
"diesel",
"diesel_migrations",
"futures", "futures",
"git-version", "git-version",
"hex", "hex",
"hyper", "hyper",
"nix 0.26.2", "nix 0.27.1",
"once_cell", "once_cell",
"pageserver_api", "pageserver_api",
"pageserver_client", "pageserver_client",
@@ -1341,6 +1344,7 @@ dependencies = [
"regex", "regex",
"reqwest", "reqwest",
"safekeeper_api", "safekeeper_api",
"scopeguard",
"serde", "serde",
"serde_json", "serde_json",
"serde_with", "serde_with",
@@ -1636,6 +1640,52 @@ dependencies = [
"rusticata-macros", "rusticata-macros",
] ]
[[package]]
name = "diesel"
version = "2.1.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "62c6fcf842f17f8c78ecf7c81d75c5ce84436b41ee07e03f490fbb5f5a8731d8"
dependencies = [
"bitflags 2.4.1",
"byteorder",
"diesel_derives",
"itoa",
"pq-sys",
"serde_json",
]
[[package]]
name = "diesel_derives"
version = "2.1.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ef8337737574f55a468005a83499da720f20c65586241ffea339db9ecdfd2b44"
dependencies = [
"diesel_table_macro_syntax",
"proc-macro2",
"quote",
"syn 2.0.32",
]
[[package]]
name = "diesel_migrations"
version = "2.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6036b3f0120c5961381b570ee20a02432d7e2d27ea60de9578799cf9156914ac"
dependencies = [
"diesel",
"migrations_internals",
"migrations_macros",
]
[[package]]
name = "diesel_table_macro_syntax"
version = "0.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fc5557efc453706fed5e4fa85006fe9817c224c3f480a34c7e5959fd700921c5"
dependencies = [
"syn 2.0.32",
]
[[package]] [[package]]
name = "digest" name = "digest"
version = "0.10.7" version = "0.10.7"
@@ -1872,13 +1922,13 @@ dependencies = [
[[package]] [[package]]
name = "filetime" name = "filetime"
version = "0.2.21" version = "0.2.22"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5cbc844cecaee9d4443931972e1289c8ff485cb4cc2767cb03ca139ed6885153" checksum = "d4029edd3e734da6fe05b6cd7bd2960760a616bd2ddd0d59a0124746d6272af0"
dependencies = [ dependencies = [
"cfg-if", "cfg-if",
"libc", "libc",
"redox_syscall 0.2.16", "redox_syscall 0.3.5",
"windows-sys 0.48.0", "windows-sys 0.48.0",
] ]
@@ -1895,7 +1945,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3b9429470923de8e8cbd4d2dc513535400b4b3fef0319fb5c4e1f520a7bef743" checksum = "3b9429470923de8e8cbd4d2dc513535400b4b3fef0319fb5c4e1f520a7bef743"
dependencies = [ dependencies = [
"crc32fast", "crc32fast",
"miniz_oxide 0.7.1", "miniz_oxide",
] ]
[[package]] [[package]]
@@ -2093,9 +2143,9 @@ dependencies = [
[[package]] [[package]]
name = "gimli" name = "gimli"
version = "0.27.2" version = "0.28.1"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ad0a93d233ebf96623465aad4046a8d3aa4da22d4f4beba5388838c8a434bbb4" checksum = "4271d37baee1b8c7e4b708028c57d816cf9d2434acb33a549475f78c181f6253"
[[package]] [[package]]
name = "git-version" name = "git-version"
@@ -2562,6 +2612,16 @@ dependencies = [
"windows-sys 0.48.0", "windows-sys 0.48.0",
] ]
[[package]]
name = "io-uring"
version = "0.6.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "460648e47a07a43110fbfa2e0b14afb2be920093c31e5dccc50e49568e099762"
dependencies = [
"bitflags 1.3.2",
"libc",
]
[[package]] [[package]]
name = "ipnet" name = "ipnet"
version = "2.9.0" version = "2.9.0"
@@ -2748,18 +2808,18 @@ checksum = "f665ee40bc4a3c5590afb1e9677db74a508659dfd71e126420da8274909a0167"
[[package]] [[package]]
name = "memoffset" name = "memoffset"
version = "0.7.1" version = "0.8.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5de893c32cde5f383baa4c04c5d6dbdd735cfd4a794b0debdb2bb1b421da5ff4" checksum = "d61c719bcfbcf5d62b3a09efa6088de8c54bc0bfcd3ea7ae39fcc186108b8de1"
dependencies = [ dependencies = [
"autocfg", "autocfg",
] ]
[[package]] [[package]]
name = "memoffset" name = "memoffset"
version = "0.8.0" version = "0.9.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d61c719bcfbcf5d62b3a09efa6088de8c54bc0bfcd3ea7ae39fcc186108b8de1" checksum = "5a634b1c61a95585bd15607c6ab0c4e5b226e695ff2800ba0cdccddf208c406c"
dependencies = [ dependencies = [
"autocfg", "autocfg",
] ]
@@ -2775,6 +2835,27 @@ dependencies = [
"workspace_hack", "workspace_hack",
] ]
[[package]]
name = "migrations_internals"
version = "2.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0f23f71580015254b020e856feac3df5878c2c7a8812297edd6c0a485ac9dada"
dependencies = [
"serde",
"toml",
]
[[package]]
name = "migrations_macros"
version = "2.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "cce3325ac70e67bbab5bd837a31cae01f1a6db64e0e744a33cb03a543469ef08"
dependencies = [
"migrations_internals",
"proc-macro2",
"quote",
]
[[package]] [[package]]
name = "mime" name = "mime"
version = "0.3.17" version = "0.3.17"
@@ -2797,15 +2878,6 @@ version = "0.2.1"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "68354c5c6bd36d73ff3feceb05efa59b6acb7626617f4962be322a825e61f79a" checksum = "68354c5c6bd36d73ff3feceb05efa59b6acb7626617f4962be322a825e61f79a"
[[package]]
name = "miniz_oxide"
version = "0.6.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b275950c28b37e794e8c55d88aeb5e139d0ce23fdbbeda68f8d7174abdf9e8fa"
dependencies = [
"adler",
]
[[package]] [[package]]
name = "miniz_oxide" name = "miniz_oxide"
version = "0.7.1" version = "0.7.1"
@@ -2865,16 +2937,14 @@ dependencies = [
[[package]] [[package]]
name = "nix" name = "nix"
version = "0.26.2" version = "0.27.1"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "bfdda3d196821d6af13126e40375cdf7da646a96114af134d5f417a9a1dc8e1a" checksum = "2eb04e9c688eff1c89d72b407f168cf79bb9e867a9d3323ed6c01519eb9cc053"
dependencies = [ dependencies = [
"bitflags 1.3.2", "bitflags 2.4.1",
"cfg-if", "cfg-if",
"libc", "libc",
"memoffset 0.7.1", "memoffset 0.9.0",
"pin-utils",
"static_assertions",
] ]
[[package]] [[package]]
@@ -2889,20 +2959,21 @@ dependencies = [
[[package]] [[package]]
name = "notify" name = "notify"
version = "5.2.0" version = "6.1.1"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "729f63e1ca555a43fe3efa4f3efdf4801c479da85b432242a7b726f353c88486" checksum = "6205bd8bb1e454ad2e27422015fb5e4f2bcc7e08fa8f27058670d208324a4d2d"
dependencies = [ dependencies = [
"bitflags 1.3.2", "bitflags 2.4.1",
"crossbeam-channel", "crossbeam-channel",
"filetime", "filetime",
"fsevent-sys", "fsevent-sys",
"inotify 0.9.6", "inotify 0.9.6",
"kqueue", "kqueue",
"libc", "libc",
"log",
"mio", "mio",
"walkdir", "walkdir",
"windows-sys 0.45.0", "windows-sys 0.48.0",
] ]
[[package]] [[package]]
@@ -3028,9 +3099,9 @@ dependencies = [
[[package]] [[package]]
name = "object" name = "object"
version = "0.30.3" version = "0.32.2"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ea86265d3d3dcb6a27fc51bd29a4bf387fae9d2986b823079d4986af253eb439" checksum = "a6a622008b6e321afc04970976f62ee297fdbaa6f95318ca343e3eebb9648441"
dependencies = [ dependencies = [
"memchr", "memchr",
] ]
@@ -3102,9 +3173,9 @@ dependencies = [
[[package]] [[package]]
name = "opentelemetry" name = "opentelemetry"
version = "0.19.0" version = "0.20.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5f4b8347cc26099d3aeee044065ecc3ae11469796b4d65d065a23a584ed92a6f" checksum = "9591d937bc0e6d2feb6f71a559540ab300ea49955229c347a517a28d27784c54"
dependencies = [ dependencies = [
"opentelemetry_api", "opentelemetry_api",
"opentelemetry_sdk", "opentelemetry_sdk",
@@ -3112,9 +3183,9 @@ dependencies = [
[[package]] [[package]]
name = "opentelemetry-http" name = "opentelemetry-http"
version = "0.8.0" version = "0.9.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a819b71d6530c4297b49b3cae2939ab3a8cc1b9f382826a1bc29dd0ca3864906" checksum = "c7594ec0e11d8e33faf03530a4c49af7064ebba81c1480e01be67d90b356508b"
dependencies = [ dependencies = [
"async-trait", "async-trait",
"bytes", "bytes",
@@ -3125,54 +3196,56 @@ dependencies = [
[[package]] [[package]]
name = "opentelemetry-otlp" name = "opentelemetry-otlp"
version = "0.12.0" version = "0.13.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8af72d59a4484654ea8eb183fea5ae4eb6a41d7ac3e3bae5f4d2a282a3a7d3ca" checksum = "7e5e5a5c4135864099f3faafbe939eb4d7f9b80ebf68a8448da961b32a7c1275"
dependencies = [ dependencies = [
"async-trait", "async-trait",
"futures", "futures-core",
"futures-util",
"http", "http",
"opentelemetry",
"opentelemetry-http", "opentelemetry-http",
"opentelemetry-proto", "opentelemetry-proto",
"opentelemetry-semantic-conventions",
"opentelemetry_api",
"opentelemetry_sdk",
"prost", "prost",
"reqwest", "reqwest",
"thiserror", "thiserror",
"tokio",
"tonic",
] ]
[[package]] [[package]]
name = "opentelemetry-proto" name = "opentelemetry-proto"
version = "0.2.0" version = "0.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "045f8eea8c0fa19f7d48e7bc3128a39c2e5c533d5c61298c548dfefc1064474c" checksum = "b1e3f814aa9f8c905d0ee4bde026afd3b2577a97c10e1699912e3e44f0c4cbeb"
dependencies = [ dependencies = [
"futures", "opentelemetry_api",
"futures-util", "opentelemetry_sdk",
"opentelemetry",
"prost", "prost",
"tonic 0.8.3", "tonic",
] ]
[[package]] [[package]]
name = "opentelemetry-semantic-conventions" name = "opentelemetry-semantic-conventions"
version = "0.11.0" version = "0.12.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "24e33428e6bf08c6f7fcea4ddb8e358fab0fe48ab877a87c70c6ebe20f673ce5" checksum = "73c9f9340ad135068800e7f1b24e9e09ed9e7143f5bf8518ded3d3ec69789269"
dependencies = [ dependencies = [
"opentelemetry", "opentelemetry",
] ]
[[package]] [[package]]
name = "opentelemetry_api" name = "opentelemetry_api"
version = "0.19.0" version = "0.20.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ed41783a5bf567688eb38372f2b7a8530f5a607a4b49d38dd7573236c23ca7e2" checksum = "8a81f725323db1b1206ca3da8bb19874bbd3f57c3bcd59471bfb04525b265b9b"
dependencies = [ dependencies = [
"fnv",
"futures-channel", "futures-channel",
"futures-util", "futures-util",
"indexmap 1.9.3", "indexmap 1.9.3",
"js-sys",
"once_cell", "once_cell",
"pin-project-lite", "pin-project-lite",
"thiserror", "thiserror",
@@ -3181,21 +3254,22 @@ dependencies = [
[[package]] [[package]]
name = "opentelemetry_sdk" name = "opentelemetry_sdk"
version = "0.19.0" version = "0.20.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8b3a2a91fdbfdd4d212c0dcc2ab540de2c2bcbbd90be17de7a7daf8822d010c1" checksum = "fa8e705a0612d48139799fcbaba0d4a90f06277153e43dd2bdc16c6f0edd8026"
dependencies = [ dependencies = [
"async-trait", "async-trait",
"crossbeam-channel", "crossbeam-channel",
"dashmap",
"fnv",
"futures-channel", "futures-channel",
"futures-executor", "futures-executor",
"futures-util", "futures-util",
"once_cell", "once_cell",
"opentelemetry_api", "opentelemetry_api",
"ordered-float 3.9.2",
"percent-encoding", "percent-encoding",
"rand 0.8.5", "rand 0.8.5",
"regex",
"serde_json",
"thiserror", "thiserror",
"tokio", "tokio",
"tokio-stream", "tokio-stream",
@@ -3210,6 +3284,15 @@ dependencies = [
"num-traits", "num-traits",
] ]
[[package]]
name = "ordered-float"
version = "3.9.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f1e1c390732d15f1d48471625cd92d154e66db2c56645e29a9cd26f4699f72dc"
dependencies = [
"num-traits",
]
[[package]] [[package]]
name = "ordered-multimap" name = "ordered-multimap"
version = "0.7.1" version = "0.7.1"
@@ -3325,7 +3408,7 @@ dependencies = [
"itertools", "itertools",
"md5", "md5",
"metrics", "metrics",
"nix 0.26.2", "nix 0.27.1",
"num-traits", "num-traits",
"num_cpus", "num_cpus",
"once_cell", "once_cell",
@@ -3358,6 +3441,7 @@ dependencies = [
"tenant_size_model", "tenant_size_model",
"thiserror", "thiserror",
"tokio", "tokio",
"tokio-epoll-uring",
"tokio-io-timeout", "tokio-io-timeout",
"tokio-postgres", "tokio-postgres",
"tokio-stream", "tokio-stream",
@@ -3780,6 +3864,15 @@ version = "0.2.17"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5b40af805b3121feab8a3c29f04d8ad262fa8e0561883e7653e024ae4479e6de" checksum = "5b40af805b3121feab8a3c29f04d8ad262fa8e0561883e7653e024ae4479e6de"
[[package]]
name = "pq-sys"
version = "0.4.8"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "31c0052426df997c0cbd30789eb44ca097e3541717a7b8fa36b1c464ee7edebd"
dependencies = [
"vcpkg",
]
[[package]] [[package]]
name = "pq_proto" name = "pq_proto"
version = "0.1.0" version = "0.1.0"
@@ -4339,9 +4432,9 @@ dependencies = [
[[package]] [[package]]
name = "reqwest-tracing" name = "reqwest-tracing"
version = "0.4.5" version = "0.4.7"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1b97ad83c2fc18113346b7158d79732242002427c30f620fa817c1f32901e0a8" checksum = "5a0152176687dd5cfe7f507ac1cb1a491c679cfe483afd133a7db7aaea818bb3"
dependencies = [ dependencies = [
"anyhow", "anyhow",
"async-trait", "async-trait",
@@ -5031,9 +5124,9 @@ dependencies = [
[[package]] [[package]]
name = "shlex" name = "shlex"
version = "1.1.0" version = "1.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "43b2853a4d09f215c24cc5489c992ce46052d359b5109343cbafbf26bc62f8a3" checksum = "0fda2ff0d084019ba4d7c6f371c95d8fd75ce3524c3cb8fb653a3023f6323e64"
[[package]] [[package]]
name = "signal-hook" name = "signal-hook"
@@ -5110,9 +5203,9 @@ checksum = "62bb4feee49fdd9f707ef802e22365a35de4b7b299de4763d44bfea899442ff9"
[[package]] [[package]]
name = "smol_str" name = "smol_str"
version = "0.2.0" version = "0.2.1"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "74212e6bbe9a4352329b2f68ba3130c15a3f26fe88ff22dbdc6cdd58fa85e99c" checksum = "e6845563ada680337a52d43bb0b29f396f2d911616f6573012645b9e3d048a49"
dependencies = [ dependencies = [
"serde", "serde",
] ]
@@ -5195,7 +5288,7 @@ dependencies = [
"prost", "prost",
"tokio", "tokio",
"tokio-stream", "tokio-stream",
"tonic 0.9.2", "tonic",
"tonic-build", "tonic-build",
"tracing", "tracing",
"utils", "utils",
@@ -5379,18 +5472,18 @@ dependencies = [
[[package]] [[package]]
name = "thiserror" name = "thiserror"
version = "1.0.40" version = "1.0.47"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "978c9a314bd8dc99be594bc3c175faaa9794be04a5a5e153caba6915336cebac" checksum = "97a802ec30afc17eee47b2855fc72e0c4cd62be9b4efe6591edde0ec5bd68d8f"
dependencies = [ dependencies = [
"thiserror-impl", "thiserror-impl",
] ]
[[package]] [[package]]
name = "thiserror-impl" name = "thiserror-impl"
version = "1.0.40" version = "1.0.47"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f9456a42c5b0d803c8cd86e73dd7cc9edd429499f37a3550d286d5e86720569f" checksum = "6bb623b56e39ab7dcd4b1b98bb6c8f8d907ed255b18de254088016b27a8ee19b"
dependencies = [ dependencies = [
"proc-macro2", "proc-macro2",
"quote", "quote",
@@ -5415,7 +5508,7 @@ checksum = "7e54bc85fc7faa8bc175c4bab5b92ba8d9a3ce893d0e9f42cc455c8ab16a9e09"
dependencies = [ dependencies = [
"byteorder", "byteorder",
"integer-encoding", "integer-encoding",
"ordered-float", "ordered-float 2.10.1",
] ]
[[package]] [[package]]
@@ -5514,6 +5607,21 @@ dependencies = [
"windows-sys 0.48.0", "windows-sys 0.48.0",
] ]
[[package]]
name = "tokio-epoll-uring"
version = "0.1.0"
source = "git+https://github.com/neondatabase/tokio-epoll-uring.git?branch=main#0dd3a2f8bf3239d34a19719ef1a74146c093126f"
dependencies = [
"futures",
"once_cell",
"scopeguard",
"thiserror",
"tokio",
"tokio-util",
"tracing",
"uring-common",
]
[[package]] [[package]]
name = "tokio-io-timeout" name = "tokio-io-timeout"
version = "1.2.0" version = "1.2.0"
@@ -5681,38 +5789,6 @@ dependencies = [
"winnow", "winnow",
] ]
[[package]]
name = "tonic"
version = "0.8.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8f219fad3b929bef19b1f86fbc0358d35daed8f2cac972037ac0dc10bbb8d5fb"
dependencies = [
"async-stream",
"async-trait",
"axum",
"base64 0.13.1",
"bytes",
"futures-core",
"futures-util",
"h2",
"http",
"http-body",
"hyper",
"hyper-timeout",
"percent-encoding",
"pin-project",
"prost",
"prost-derive",
"tokio",
"tokio-stream",
"tokio-util",
"tower",
"tower-layer",
"tower-service",
"tracing",
"tracing-futures",
]
[[package]] [[package]]
name = "tonic" name = "tonic"
version = "0.9.2" version = "0.9.2"
@@ -5856,16 +5932,6 @@ dependencies = [
"tracing-subscriber", "tracing-subscriber",
] ]
[[package]]
name = "tracing-futures"
version = "0.2.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "97d095ae15e245a057c8e8451bab9b3ee1e1f68e9ba2b4fbc18d0ac5237835f2"
dependencies = [
"pin-project",
"tracing",
]
[[package]] [[package]]
name = "tracing-log" name = "tracing-log"
version = "0.1.3" version = "0.1.3"
@@ -5879,9 +5945,9 @@ dependencies = [
[[package]] [[package]]
name = "tracing-opentelemetry" name = "tracing-opentelemetry"
version = "0.19.0" version = "0.20.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "00a39dcf9bfc1742fa4d6215253b33a6e474be78275884c216fc2a06267b3600" checksum = "fc09e402904a5261e42cf27aea09ccb7d5318c6717a9eec3d8e2e65c56b18f19"
dependencies = [ dependencies = [
"once_cell", "once_cell",
"opentelemetry", "opentelemetry",
@@ -6065,6 +6131,15 @@ dependencies = [
"webpki-roots 0.23.1", "webpki-roots 0.23.1",
] ]
[[package]]
name = "uring-common"
version = "0.1.0"
source = "git+https://github.com/neondatabase/tokio-epoll-uring.git?branch=main#0dd3a2f8bf3239d34a19719ef1a74146c093126f"
dependencies = [
"io-uring",
"libc",
]
[[package]] [[package]]
name = "url" name = "url"
version = "2.3.1" version = "2.3.1"
@@ -6118,7 +6193,7 @@ dependencies = [
"hyper", "hyper",
"jsonwebtoken", "jsonwebtoken",
"metrics", "metrics",
"nix 0.26.2", "nix 0.27.1",
"once_cell", "once_cell",
"pin-project-lite", "pin-project-lite",
"postgres_connection", "postgres_connection",
@@ -6626,10 +6701,9 @@ dependencies = [
"clap", "clap",
"clap_builder", "clap_builder",
"crossbeam-utils", "crossbeam-utils",
"dashmap", "diesel",
"either", "either",
"fail", "fail",
"futures",
"futures-channel", "futures-channel",
"futures-core", "futures-core",
"futures-executor", "futures-executor",
@@ -6674,6 +6748,7 @@ dependencies = [
"tokio-util", "tokio-util",
"toml_datetime", "toml_datetime",
"toml_edit", "toml_edit",
"tonic",
"tower", "tower",
"tracing", "tracing",
"tracing-core", "tracing-core",

View File

@@ -99,14 +99,14 @@ libc = "0.2"
md5 = "0.7.0" md5 = "0.7.0"
memoffset = "0.8" memoffset = "0.8"
native-tls = "0.2" native-tls = "0.2"
nix = "0.26" nix = { version = "0.27", features = ["fs", "process", "socket", "signal", "poll"] }
notify = "5.0.0" notify = "6.0.0"
num_cpus = "1.15" num_cpus = "1.15"
num-traits = "0.2.15" num-traits = "0.2.15"
once_cell = "1.13" once_cell = "1.13"
opentelemetry = "0.19.0" opentelemetry = "0.20.0"
opentelemetry-otlp = { version = "0.12.0", default_features=false, features = ["http-proto", "trace", "http", "reqwest-client"] } opentelemetry-otlp = { version = "0.13.0", default_features=false, features = ["http-proto", "trace", "http", "reqwest-client"] }
opentelemetry-semantic-conventions = "0.11.0" opentelemetry-semantic-conventions = "0.12.0"
parking_lot = "0.12" parking_lot = "0.12"
parquet = { version = "49.0.0", default-features = false, features = ["zstd"] } parquet = { version = "49.0.0", default-features = false, features = ["zstd"] }
parquet_derive = "49.0.0" parquet_derive = "49.0.0"
@@ -118,7 +118,7 @@ rand = "0.8"
redis = { version = "0.24.0", features = ["tokio-rustls-comp", "keep-alive"] } redis = { version = "0.24.0", features = ["tokio-rustls-comp", "keep-alive"] }
regex = "1.10.2" regex = "1.10.2"
reqwest = { version = "0.11", default-features = false, features = ["rustls-tls"] } reqwest = { version = "0.11", default-features = false, features = ["rustls-tls"] }
reqwest-tracing = { version = "0.4.0", features = ["opentelemetry_0_19"] } reqwest-tracing = { version = "0.4.7", features = ["opentelemetry_0_20"] }
reqwest-middleware = "0.2.0" reqwest-middleware = "0.2.0"
reqwest-retry = "0.2.2" reqwest-retry = "0.2.2"
routerify = "3" routerify = "3"
@@ -151,6 +151,7 @@ test-context = "0.1"
thiserror = "1.0" thiserror = "1.0"
tls-listener = { version = "0.7", features = ["rustls", "hyper-h1"] } tls-listener = { version = "0.7", features = ["rustls", "hyper-h1"] }
tokio = { version = "1.17", features = ["macros"] } tokio = { version = "1.17", features = ["macros"] }
tokio-epoll-uring = { git = "https://github.com/neondatabase/tokio-epoll-uring.git" , branch = "main" }
tokio-io-timeout = "1.2.0" tokio-io-timeout = "1.2.0"
tokio-postgres-rustls = "0.10.0" tokio-postgres-rustls = "0.10.0"
tokio-rustls = "0.24" tokio-rustls = "0.24"
@@ -162,7 +163,7 @@ toml_edit = "0.19"
tonic = {version = "0.9", features = ["tls", "tls-roots"]} tonic = {version = "0.9", features = ["tls", "tls-roots"]}
tracing = "0.1" tracing = "0.1"
tracing-error = "0.2.0" tracing-error = "0.2.0"
tracing-opentelemetry = "0.19.0" tracing-opentelemetry = "0.20.0"
tracing-subscriber = { version = "0.3", default_features = false, features = ["smallvec", "fmt", "tracing-log", "std", "env-filter", "json"] } tracing-subscriber = { version = "0.3", default_features = false, features = ["smallvec", "fmt", "tracing-log", "std", "env-filter", "json"] }
url = "2.2" url = "2.2"
uuid = { version = "1.6.1", features = ["v4", "v7", "serde"] } uuid = { version = "1.6.1", features = ["v4", "v7", "serde"] }

View File

@@ -52,7 +52,7 @@ RUN cd postgres && \
# We need to grant EXECUTE on pg_stat_statements_reset() to neon_superuser. # We need to grant EXECUTE on pg_stat_statements_reset() to neon_superuser.
# In vanilla postgres this function is limited to Postgres role superuser. # In vanilla postgres this function is limited to Postgres role superuser.
# In neon we have neon_superuser role that is not a superuser but replaces superuser in some cases. # In neon we have neon_superuser role that is not a superuser but replaces superuser in some cases.
# We could add the additional grant statements to the postgres repository but it would be hard to maintain, # We could add the additional grant statements to the postgres repository but it would be hard to maintain,
# whenever we need to pick up a new postgres version and we want to limit the changes in our postgres fork, # whenever we need to pick up a new postgres version and we want to limit the changes in our postgres fork,
# so we do it here. # so we do it here.
old_list="pg_stat_statements--1.0--1.1.sql pg_stat_statements--1.1--1.2.sql pg_stat_statements--1.2--1.3.sql pg_stat_statements--1.3--1.4.sql pg_stat_statements--1.4--1.5.sql pg_stat_statements--1.4.sql pg_stat_statements--1.5--1.6.sql"; \ old_list="pg_stat_statements--1.0--1.1.sql pg_stat_statements--1.1--1.2.sql pg_stat_statements--1.2--1.3.sql pg_stat_statements--1.3--1.4.sql pg_stat_statements--1.4--1.5.sql pg_stat_statements--1.4.sql pg_stat_statements--1.5--1.6.sql"; \
@@ -63,14 +63,14 @@ RUN cd postgres && \
echo 'GRANT EXECUTE ON FUNCTION pg_stat_statements_reset() TO neon_superuser;' >> $file; \ echo 'GRANT EXECUTE ON FUNCTION pg_stat_statements_reset() TO neon_superuser;' >> $file; \
fi; \ fi; \
done; \ done; \
# the second loop is for pg_stat_statement extension versions >= 1.7, # the second loop is for pg_stat_statement extension versions >= 1.7,
# where pg_stat_statement_reset() got 3 additional arguments # where pg_stat_statement_reset() got 3 additional arguments
for file in /usr/local/pgsql/share/extension/pg_stat_statements--*.sql; do \ for file in /usr/local/pgsql/share/extension/pg_stat_statements--*.sql; do \
filename=$(basename "$file"); \ filename=$(basename "$file"); \
if ! echo "$old_list" | grep -q -F "$filename"; then \ if ! echo "$old_list" | grep -q -F "$filename"; then \
echo 'GRANT EXECUTE ON FUNCTION pg_stat_statements_reset(Oid, Oid, bigint) TO neon_superuser;' >> $file; \ echo 'GRANT EXECUTE ON FUNCTION pg_stat_statements_reset(Oid, Oid, bigint) TO neon_superuser;' >> $file; \
fi; \ fi; \
done done
######################################################################################### #########################################################################################
# #
@@ -143,29 +143,24 @@ RUN wget https://github.com/pgRouting/pgrouting/archive/v3.4.2.tar.gz -O pgrouti
######################################################################################### #########################################################################################
FROM build-deps AS plv8-build FROM build-deps AS plv8-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN apt update && \ RUN apt update && \
apt install -y ninja-build python3-dev libncurses5 binutils clang apt install -y ninja-build python3-dev libncurses5 binutils clang
RUN case "${PG_VERSION}" in \ RUN wget https://github.com/plv8/plv8/archive/refs/tags/v3.1.10.tar.gz -O plv8.tar.gz && \
"v14" | "v15") \ echo "7096c3290928561f0d4901b7a52794295dc47f6303102fae3f8e42dd575ad97d plv8.tar.gz" | sha256sum --check && \
export PLV8_VERSION=3.1.5 \
export PLV8_CHECKSUM=1e108d5df639e4c189e1c5bdfa2432a521c126ca89e7e5a969d46899ca7bf106 \
;; \
"v16") \
export PLV8_VERSION=3.1.8 \
export PLV8_CHECKSUM=92b10c7db39afdae97ff748c9ec54713826af222c459084ad002571b79eb3f49 \
;; \
*) \
echo "Export the valid PG_VERSION variable" && exit 1 \
;; \
esac && \
wget https://github.com/plv8/plv8/archive/refs/tags/v${PLV8_VERSION}.tar.gz -O plv8.tar.gz && \
echo "${PLV8_CHECKSUM} plv8.tar.gz" | sha256sum --check && \
mkdir plv8-src && cd plv8-src && tar xvzf ../plv8.tar.gz --strip-components=1 -C . && \ mkdir plv8-src && cd plv8-src && tar xvzf ../plv8.tar.gz --strip-components=1 -C . && \
# generate and copy upgrade scripts
mkdir -p upgrade && ./generate_upgrade.sh 3.1.10 && \
cp upgrade/* /usr/local/pgsql/share/extension/ && \
export PATH="/usr/local/pgsql/bin:$PATH" && \ export PATH="/usr/local/pgsql/bin:$PATH" && \
make DOCKER=1 -j $(getconf _NPROCESSORS_ONLN) install && \ make DOCKER=1 -j $(getconf _NPROCESSORS_ONLN) install && \
rm -rf /plv8-* && \ rm -rf /plv8-* && \
find /usr/local/pgsql/ -name "plv8-*.so" | xargs strip && \ find /usr/local/pgsql/ -name "plv8-*.so" | xargs strip && \
# don't break computes with installed old version of plv8
cd /usr/local/pgsql/lib/ && \
ln -s plv8-3.1.10.so plv8-3.1.5.so && \
ln -s plv8-3.1.10.so plv8-3.1.8.so && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/plv8.control && \ echo 'trusted = true' >> /usr/local/pgsql/share/extension/plv8.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/plcoffee.control && \ echo 'trusted = true' >> /usr/local/pgsql/share/extension/plcoffee.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/plls.control echo 'trusted = true' >> /usr/local/pgsql/share/extension/plls.control
@@ -551,6 +546,7 @@ RUN wget https://github.com/rdkit/rdkit/archive/refs/tags/Release_2023_03_3.tar.
-D PostgreSQL_TYPE_INCLUDE_DIR=`pg_config --includedir-server` \ -D PostgreSQL_TYPE_INCLUDE_DIR=`pg_config --includedir-server` \
-D PostgreSQL_LIBRARY_DIR=`pg_config --libdir` \ -D PostgreSQL_LIBRARY_DIR=`pg_config --libdir` \
-D RDK_INSTALL_INTREE=OFF \ -D RDK_INSTALL_INTREE=OFF \
-D RDK_INSTALL_COMIC_FONTS=OFF \
-D CMAKE_BUILD_TYPE=Release \ -D CMAKE_BUILD_TYPE=Release \
. && \ . && \
make -j $(getconf _NPROCESSORS_ONLN) && \ make -j $(getconf _NPROCESSORS_ONLN) && \
@@ -617,6 +613,7 @@ RUN wget https://github.com/theory/pg-semver/archive/refs/tags/v0.32.1.tar.gz -O
FROM build-deps AS pg-embedding-pg-build FROM build-deps AS pg-embedding-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ARG PG_VERSION
ENV PATH "/usr/local/pgsql/bin/:$PATH" ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN case "${PG_VERSION}" in \ RUN case "${PG_VERSION}" in \
"v14" | "v15") \ "v14" | "v15") \
@@ -779,6 +776,8 @@ RUN wget https://github.com/eulerto/wal2json/archive/refs/tags/wal2json_2_5.tar.
# #
######################################################################################### #########################################################################################
FROM build-deps AS neon-pg-ext-build FROM build-deps AS neon-pg-ext-build
ARG PG_VERSION
# Public extensions # Public extensions
COPY --from=postgis-build /usr/local/pgsql/ /usr/local/pgsql/ COPY --from=postgis-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=postgis-build /sfcgal/* / COPY --from=postgis-build /sfcgal/* /

View File

@@ -700,13 +700,14 @@ impl ComputeNode {
// In this case we need to connect with old `zenith_admin` name // In this case we need to connect with old `zenith_admin` name
// and create new user. We cannot simply rename connected user, // and create new user. We cannot simply rename connected user,
// but we can create a new one and grant it all privileges. // but we can create a new one and grant it all privileges.
let mut client = match Client::connect(self.connstr.as_str(), NoTls) { let connstr = self.connstr.clone();
let mut client = match Client::connect(connstr.as_str(), NoTls) {
Err(e) => { Err(e) => {
info!( info!(
"cannot connect to postgres: {}, retrying with `zenith_admin` username", "cannot connect to postgres: {}, retrying with `zenith_admin` username",
e e
); );
let mut zenith_admin_connstr = self.connstr.clone(); let mut zenith_admin_connstr = connstr.clone();
zenith_admin_connstr zenith_admin_connstr
.set_username("zenith_admin") .set_username("zenith_admin")
@@ -719,8 +720,8 @@ impl ComputeNode {
client.simple_query("GRANT zenith_admin TO cloud_admin")?; client.simple_query("GRANT zenith_admin TO cloud_admin")?;
drop(client); drop(client);
// reconnect with connsting with expected name // reconnect with connstring with expected name
Client::connect(self.connstr.as_str(), NoTls)? Client::connect(connstr.as_str(), NoTls)?
} }
Ok(client) => client, Ok(client) => client,
}; };
@@ -734,8 +735,8 @@ impl ComputeNode {
cleanup_instance(&mut client)?; cleanup_instance(&mut client)?;
handle_roles(spec, &mut client)?; handle_roles(spec, &mut client)?;
handle_databases(spec, &mut client)?; handle_databases(spec, &mut client)?;
handle_role_deletions(spec, self.connstr.as_str(), &mut client)?; handle_role_deletions(spec, connstr.as_str(), &mut client)?;
handle_grants(spec, &mut client, self.connstr.as_str())?; handle_grants(spec, &mut client, connstr.as_str())?;
handle_extensions(spec, &mut client)?; handle_extensions(spec, &mut client)?;
handle_extension_neon(&mut client)?; handle_extension_neon(&mut client)?;
create_availability_check_data(&mut client)?; create_availability_check_data(&mut client)?;
@@ -743,6 +744,12 @@ impl ComputeNode {
// 'Close' connection // 'Close' connection
drop(client); drop(client);
if self.has_feature(ComputeFeature::Migrations) {
thread::spawn(move || {
let mut client = Client::connect(connstr.as_str(), NoTls)?;
handle_migrations(&mut client)
});
}
Ok(()) Ok(())
} }
@@ -807,6 +814,10 @@ impl ComputeNode {
handle_grants(&spec, &mut client, self.connstr.as_str())?; handle_grants(&spec, &mut client, self.connstr.as_str())?;
handle_extensions(&spec, &mut client)?; handle_extensions(&spec, &mut client)?;
handle_extension_neon(&mut client)?; handle_extension_neon(&mut client)?;
// We can skip handle_migrations here because a new migration can only appear
// if we have a new version of the compute_ctl binary, which can only happen
// if compute got restarted, in which case we'll end up inside of apply_config
// instead of reconfigure.
} }
// 'Close' connection // 'Close' connection

View File

@@ -727,3 +727,79 @@ pub fn handle_extension_neon(client: &mut Client) -> Result<()> {
Ok(()) Ok(())
} }
#[instrument(skip_all)]
pub fn handle_migrations(client: &mut Client) -> Result<()> {
info!("handle migrations");
// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
// !BE SURE TO ONLY ADD MIGRATIONS TO THE END OF THIS ARRAY. IF YOU DO NOT, VERY VERY BAD THINGS MAY HAPPEN!
// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
let migrations = [
"ALTER ROLE neon_superuser BYPASSRLS",
r#"
DO $$
DECLARE
role_name text;
BEGIN
FOR role_name IN SELECT rolname FROM pg_roles WHERE pg_has_role(rolname, 'neon_superuser', 'member')
LOOP
RAISE NOTICE 'EXECUTING ALTER ROLE % INHERIT', quote_ident(role_name);
EXECUTE 'ALTER ROLE ' || quote_ident(role_name) || ' INHERIT';
END LOOP;
FOR role_name IN SELECT rolname FROM pg_roles
WHERE
NOT pg_has_role(rolname, 'neon_superuser', 'member') AND NOT starts_with(rolname, 'pg_')
LOOP
RAISE NOTICE 'EXECUTING ALTER ROLE % NOBYPASSRLS', quote_ident(role_name);
EXECUTE 'ALTER ROLE ' || quote_ident(role_name) || ' NOBYPASSRLS';
END LOOP;
END $$;
"#,
];
let mut query = "CREATE SCHEMA IF NOT EXISTS neon_migration";
client.simple_query(query)?;
query = "CREATE TABLE IF NOT EXISTS neon_migration.migration_id (key INT NOT NULL PRIMARY KEY, id bigint NOT NULL DEFAULT 0)";
client.simple_query(query)?;
query = "INSERT INTO neon_migration.migration_id VALUES (0, 0) ON CONFLICT DO NOTHING";
client.simple_query(query)?;
query = "ALTER SCHEMA neon_migration OWNER TO cloud_admin";
client.simple_query(query)?;
query = "REVOKE ALL ON SCHEMA neon_migration FROM PUBLIC";
client.simple_query(query)?;
query = "SELECT id FROM neon_migration.migration_id";
let row = client.query_one(query, &[])?;
let mut current_migration: usize = row.get::<&str, i64>("id") as usize;
let starting_migration_id = current_migration;
query = "BEGIN";
client.simple_query(query)?;
while current_migration < migrations.len() {
info!("Running migration:\n{}\n", migrations[current_migration]);
client.simple_query(migrations[current_migration])?;
current_migration += 1;
}
let setval = format!(
"UPDATE neon_migration.migration_id SET id={}",
migrations.len()
);
client.simple_query(&setval)?;
query = "COMMIT";
client.simple_query(query)?;
info!(
"Ran {} migrations",
(migrations.len() - starting_migration_id)
);
Ok(())
}

View File

@@ -10,6 +10,8 @@ async-trait.workspace = true
camino.workspace = true camino.workspace = true
clap.workspace = true clap.workspace = true
comfy-table.workspace = true comfy-table.workspace = true
diesel = { version = "2.1.4", features = ["postgres"]}
diesel_migrations = { version = "2.1.0", features = ["postgres"]}
futures.workspace = true futures.workspace = true
git-version.workspace = true git-version.workspace = true
nix.workspace = true nix.workspace = true
@@ -19,6 +21,7 @@ hex.workspace = true
hyper.workspace = true hyper.workspace = true
regex.workspace = true regex.workspace = true
reqwest = { workspace = true, features = ["blocking", "json"] } reqwest = { workspace = true, features = ["blocking", "json"] }
scopeguard.workspace = true
serde.workspace = true serde.workspace = true
serde_json.workspace = true serde_json.workspace = true
serde_with.workspace = true serde_with.workspace = true

View File

@@ -25,6 +25,8 @@ tracing.workspace = true
# a parsing function when loading pageservers from neon_local LocalEnv # a parsing function when loading pageservers from neon_local LocalEnv
postgres_backend.workspace = true postgres_backend.workspace = true
diesel = { version = "2.1.4", features = ["serde_json", "postgres"] }
utils = { path = "../../libs/utils/" } utils = { path = "../../libs/utils/" }
metrics = { path = "../../libs/metrics/" } metrics = { path = "../../libs/metrics/" }
control_plane = { path = ".." } control_plane = { path = ".." }

View File

@@ -0,0 +1,6 @@
-- This file was automatically created by Diesel to setup helper functions
-- and other internal bookkeeping. This file is safe to edit, any future
-- changes will be added to existing projects as new migrations.
DROP FUNCTION IF EXISTS diesel_manage_updated_at(_tbl regclass);
DROP FUNCTION IF EXISTS diesel_set_updated_at();

View File

@@ -0,0 +1,36 @@
-- This file was automatically created by Diesel to setup helper functions
-- and other internal bookkeeping. This file is safe to edit, any future
-- changes will be added to existing projects as new migrations.
-- Sets up a trigger for the given table to automatically set a column called
-- `updated_at` whenever the row is modified (unless `updated_at` was included
-- in the modified columns)
--
-- # Example
--
-- ```sql
-- CREATE TABLE users (id SERIAL PRIMARY KEY, updated_at TIMESTAMP NOT NULL DEFAULT NOW());
--
-- SELECT diesel_manage_updated_at('users');
-- ```
CREATE OR REPLACE FUNCTION diesel_manage_updated_at(_tbl regclass) RETURNS VOID AS $$
BEGIN
EXECUTE format('CREATE TRIGGER set_updated_at BEFORE UPDATE ON %s
FOR EACH ROW EXECUTE PROCEDURE diesel_set_updated_at()', _tbl);
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION diesel_set_updated_at() RETURNS trigger AS $$
BEGIN
IF (
NEW IS DISTINCT FROM OLD AND
NEW.updated_at IS NOT DISTINCT FROM OLD.updated_at
) THEN
NEW.updated_at := current_timestamp;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

View File

@@ -0,0 +1 @@
DROP TABLE tenant_shards;

View File

@@ -0,0 +1,12 @@
CREATE TABLE tenant_shards (
tenant_id VARCHAR NOT NULL,
shard_number INTEGER NOT NULL,
shard_count INTEGER NOT NULL,
PRIMARY KEY(tenant_id, shard_number, shard_count),
shard_stripe_size INTEGER NOT NULL,
generation INTEGER NOT NULL,
generation_pageserver BIGINT NOT NULL,
placement_policy VARCHAR NOT NULL,
-- config is JSON encoded, opaque to the database.
config TEXT NOT NULL
);

View File

@@ -0,0 +1 @@
DROP TABLE nodes;

View File

@@ -0,0 +1,10 @@
CREATE TABLE nodes (
node_id BIGINT PRIMARY KEY NOT NULL,
scheduling_policy VARCHAR NOT NULL,
listen_http_addr VARCHAR NOT NULL,
listen_http_port INTEGER NOT NULL,
listen_pg_addr VARCHAR NOT NULL,
listen_pg_port INTEGER NOT NULL
);

View File

@@ -1,5 +1,5 @@
use crate::reconciler::ReconcileError; use crate::reconciler::ReconcileError;
use crate::service::Service; use crate::service::{Service, STARTUP_RECONCILE_TIMEOUT};
use hyper::{Body, Request, Response}; use hyper::{Body, Request, Response};
use hyper::{StatusCode, Uri}; use hyper::{StatusCode, Uri};
use pageserver_api::models::{TenantCreateRequest, TimelineCreateRequest}; use pageserver_api::models::{TenantCreateRequest, TimelineCreateRequest};
@@ -104,34 +104,34 @@ async fn handle_inspect(mut req: Request<Body>) -> Result<Response<Body>, ApiErr
json_response(StatusCode::OK, state.service.inspect(inspect_req)) json_response(StatusCode::OK, state.service.inspect(inspect_req))
} }
async fn handle_tenant_create(mut req: Request<Body>) -> Result<Response<Body>, ApiError> { async fn handle_tenant_create(
service: Arc<Service>,
mut req: Request<Body>,
) -> Result<Response<Body>, ApiError> {
let create_req = json_request::<TenantCreateRequest>(&mut req).await?; let create_req = json_request::<TenantCreateRequest>(&mut req).await?;
let state = get_state(&req); json_response(StatusCode::OK, service.tenant_create(create_req).await?)
json_response(
StatusCode::OK,
state.service.tenant_create(create_req).await?,
)
} }
async fn handle_tenant_timeline_create(mut req: Request<Body>) -> Result<Response<Body>, ApiError> { async fn handle_tenant_timeline_create(
service: Arc<Service>,
mut req: Request<Body>,
) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&req, "tenant_id")?; let tenant_id: TenantId = parse_request_param(&req, "tenant_id")?;
let create_req = json_request::<TimelineCreateRequest>(&mut req).await?; let create_req = json_request::<TimelineCreateRequest>(&mut req).await?;
let state = get_state(&req);
json_response( json_response(
StatusCode::OK, StatusCode::OK,
state service
.service
.tenant_timeline_create(tenant_id, create_req) .tenant_timeline_create(tenant_id, create_req)
.await?, .await?,
) )
} }
async fn handle_tenant_locate(req: Request<Body>) -> Result<Response<Body>, ApiError> { async fn handle_tenant_locate(
service: Arc<Service>,
req: Request<Body>,
) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&req, "tenant_id")?; let tenant_id: TenantId = parse_request_param(&req, "tenant_id")?;
let state = get_state(&req); json_response(StatusCode::OK, service.tenant_locate(tenant_id)?)
json_response(StatusCode::OK, state.service.tenant_locate(tenant_id)?)
} }
async fn handle_node_register(mut req: Request<Body>) -> Result<Response<Body>, ApiError> { async fn handle_node_register(mut req: Request<Body>) -> Result<Response<Body>, ApiError> {
@@ -154,14 +154,15 @@ async fn handle_node_configure(mut req: Request<Body>) -> Result<Response<Body>,
json_response(StatusCode::OK, state.service.node_configure(config_req)?) json_response(StatusCode::OK, state.service.node_configure(config_req)?)
} }
async fn handle_tenant_shard_migrate(mut req: Request<Body>) -> Result<Response<Body>, ApiError> { async fn handle_tenant_shard_migrate(
service: Arc<Service>,
mut req: Request<Body>,
) -> Result<Response<Body>, ApiError> {
let tenant_shard_id: TenantShardId = parse_request_param(&req, "tenant_shard_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&req, "tenant_shard_id")?;
let migrate_req = json_request::<TenantShardMigrateRequest>(&mut req).await?; let migrate_req = json_request::<TenantShardMigrateRequest>(&mut req).await?;
let state = get_state(&req);
json_response( json_response(
StatusCode::OK, StatusCode::OK,
state service
.service
.tenant_shard_migrate(tenant_shard_id, migrate_req) .tenant_shard_migrate(tenant_shard_id, migrate_req)
.await?, .await?,
) )
@@ -178,6 +179,35 @@ impl From<ReconcileError> for ApiError {
} }
} }
/// Common wrapper for request handlers that call into Service and will operate on tenants: they must only
/// be allowed to run if Service has finished its initial reconciliation.
async fn tenant_service_handler<R, H>(request: Request<Body>, handler: H) -> R::Output
where
R: std::future::Future<Output = Result<Response<Body>, ApiError>> + Send + 'static,
H: FnOnce(Arc<Service>, Request<Body>) -> R + Send + Sync + 'static,
{
let state = get_state(&request);
let service = state.service.clone();
let startup_complete = service.startup_complete.clone();
if tokio::time::timeout(STARTUP_RECONCILE_TIMEOUT, startup_complete.wait())
.await
.is_err()
{
// This shouldn't happen: it is the responsibilty of [`Service::startup_reconcile`] to use appropriate
// timeouts around its remote calls, to bound its runtime.
return Err(ApiError::Timeout(
"Timed out waiting for service readiness".into(),
));
}
request_span(
request,
|request| async move { handler(service, request).await },
)
.await
}
pub fn make_router( pub fn make_router(
service: Arc<Service>, service: Arc<Service>,
auth: Option<Arc<SwappableJwtAuth>>, auth: Option<Arc<SwappableJwtAuth>>,
@@ -205,14 +235,20 @@ pub fn make_router(
.put("/node/:node_id/config", |r| { .put("/node/:node_id/config", |r| {
request_span(r, handle_node_configure) request_span(r, handle_node_configure)
}) })
.post("/tenant", |r| request_span(r, handle_tenant_create)) .post("/v1/tenant", |r| {
.post("/tenant/:tenant_id/timeline", |r| { tenant_service_handler(r, handle_tenant_create)
request_span(r, handle_tenant_timeline_create) })
.post("/v1/tenant/:tenant_id/timeline", |r| {
tenant_service_handler(r, handle_tenant_timeline_create)
}) })
.get("/tenant/:tenant_id/locate", |r| { .get("/tenant/:tenant_id/locate", |r| {
request_span(r, handle_tenant_locate) tenant_service_handler(r, handle_tenant_locate)
}) })
.put("/tenant/:tenant_shard_id/migrate", |r| { .put("/tenant/:tenant_shard_id/migrate", |r| {
request_span(r, handle_tenant_shard_migrate) tenant_service_handler(r, handle_tenant_shard_migrate)
}) })
// Path aliases for tests_forward_compatibility
// TODO: remove these in future PR
.post("/re-attach", |r| request_span(r, handle_re_attach))
.post("/validate", |r| request_span(r, handle_validate))
} }

View File

@@ -7,6 +7,7 @@ mod node;
pub mod persistence; pub mod persistence;
mod reconciler; mod reconciler;
mod scheduler; mod scheduler;
mod schema;
pub mod service; pub mod service;
mod tenant_state; mod tenant_state;
@@ -17,6 +18,8 @@ enum PlacementPolicy {
/// Production-ready way to attach a tenant: one attached pageserver and /// Production-ready way to attach a tenant: one attached pageserver and
/// some number of secondaries. /// some number of secondaries.
Double(usize), Double(usize),
/// Do not attach to any pageservers
Detached,
} }
#[derive(Ord, PartialOrd, Eq, PartialEq, Copy, Clone)] #[derive(Ord, PartialOrd, Eq, PartialEq, Copy, Clone)]

View File

@@ -12,9 +12,9 @@ use camino::Utf8PathBuf;
use clap::Parser; use clap::Parser;
use metrics::launch_timestamp::LaunchTimestamp; use metrics::launch_timestamp::LaunchTimestamp;
use std::sync::Arc; use std::sync::Arc;
use tokio::signal::unix::SignalKind;
use utils::auth::{JwtAuth, SwappableJwtAuth}; use utils::auth::{JwtAuth, SwappableJwtAuth};
use utils::logging::{self, LogFormat}; use utils::logging::{self, LogFormat};
use utils::signals::{ShutdownSignals, Signal};
use utils::{project_build_tag, project_git_version, tcp_listener}; use utils::{project_build_tag, project_git_version, tcp_listener};
@@ -40,6 +40,10 @@ struct Cli {
/// Path to the .json file to store state (will be created if it doesn't exist) /// Path to the .json file to store state (will be created if it doesn't exist)
#[arg(short, long)] #[arg(short, long)]
path: Utf8PathBuf, path: Utf8PathBuf,
/// URL to connect to postgres, like postgresql://localhost:1234/attachment_service
#[arg(long)]
database_url: String,
} }
#[tokio::main] #[tokio::main]
@@ -66,9 +70,14 @@ async fn main() -> anyhow::Result<()> {
jwt_token: args.jwt_token, jwt_token: args.jwt_token,
}; };
let persistence = Arc::new(Persistence::new(&args.path).await); let json_path = if args.path.as_os_str().is_empty() {
None
} else {
Some(args.path)
};
let persistence = Arc::new(Persistence::new(args.database_url, json_path.clone()));
let service = Service::spawn(config, persistence).await?; let service = Service::spawn(config, persistence.clone()).await?;
let http_listener = tcp_listener::bind(args.listen)?; let http_listener = tcp_listener::bind(args.listen)?;
@@ -81,20 +90,31 @@ async fn main() -> anyhow::Result<()> {
let router = make_router(service, auth) let router = make_router(service, auth)
.build() .build()
.map_err(|err| anyhow!(err))?; .map_err(|err| anyhow!(err))?;
let service = utils::http::RouterService::new(router).unwrap(); let router_service = utils::http::RouterService::new(router).unwrap();
let server = hyper::Server::from_tcp(http_listener)?.serve(service); let server = hyper::Server::from_tcp(http_listener)?.serve(router_service);
tracing::info!("Serving on {0}", args.listen); tracing::info!("Serving on {0}", args.listen);
tokio::task::spawn(server); tokio::task::spawn(server);
ShutdownSignals::handle(|signal| match signal { // Wait until we receive a signal
Signal::Interrupt | Signal::Terminate | Signal::Quit => { let mut sigint = tokio::signal::unix::signal(SignalKind::interrupt())?;
tracing::info!("Got {}. Terminating", signal.name()); let mut sigquit = tokio::signal::unix::signal(SignalKind::quit())?;
// We're just a test helper: no graceful shutdown. let mut sigterm = tokio::signal::unix::signal(SignalKind::terminate())?;
std::process::exit(0); tokio::select! {
} _ = sigint.recv() => {},
})?; _ = sigterm.recv() => {},
_ = sigquit.recv() => {},
}
tracing::info!("Terminating on signal");
Ok(()) if json_path.is_some() {
// Write out a JSON dump on shutdown: this is used in compat tests to avoid passing
// full postgres dumps around.
if let Err(e) = persistence.write_tenants_json().await {
tracing::error!("Failed to write JSON on shutdown: {e}")
}
}
std::process::exit(0);
} }

View File

@@ -1,6 +1,8 @@
use control_plane::attachment_service::{NodeAvailability, NodeSchedulingPolicy}; use control_plane::attachment_service::{NodeAvailability, NodeSchedulingPolicy};
use utils::id::NodeId; use utils::id::NodeId;
use crate::persistence::NodePersistence;
#[derive(Clone)] #[derive(Clone)]
pub(crate) struct Node { pub(crate) struct Node {
pub(crate) id: NodeId, pub(crate) id: NodeId,
@@ -34,4 +36,15 @@ impl Node {
NodeSchedulingPolicy::Pause => false, NodeSchedulingPolicy::Pause => false,
} }
} }
pub(crate) fn to_persistent(&self) -> NodePersistence {
NodePersistence {
node_id: self.id.0 as i64,
scheduling_policy: self.scheduling.into(),
listen_http_addr: self.listen_http_addr.clone(),
listen_http_port: self.listen_http_port as i32,
listen_pg_addr: self.listen_pg_addr.clone(),
listen_pg_port: self.listen_pg_port as i32,
}
}
} }

View File

@@ -1,139 +1,161 @@
use std::{collections::HashMap, str::FromStr}; use std::collections::HashMap;
use std::str::FromStr;
use camino::{Utf8Path, Utf8PathBuf}; use camino::Utf8Path;
use control_plane::{ use camino::Utf8PathBuf;
attachment_service::{NodeAvailability, NodeSchedulingPolicy}, use control_plane::attachment_service::{NodeAvailability, NodeSchedulingPolicy};
local_env::LocalEnv, use diesel::pg::PgConnection;
}; use diesel::prelude::*;
use pageserver_api::{ use diesel::Connection;
models::TenantConfig, use pageserver_api::models::TenantConfig;
shard::{ShardCount, ShardNumber, TenantShardId}, use pageserver_api::shard::{ShardCount, ShardNumber, TenantShardId};
};
use postgres_connection::parse_host_port; use postgres_connection::parse_host_port;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use utils::{ use utils::generation::Generation;
generation::Generation, use utils::id::{NodeId, TenantId};
id::{NodeId, TenantId},
};
use crate::{node::Node, PlacementPolicy}; use crate::node::Node;
use crate::PlacementPolicy;
/// Placeholder for storage. This will be replaced with a database client. /// ## What do we store?
///
/// The attachment service does not store most of its state durably.
///
/// The essential things to store durably are:
/// - generation numbers, as these must always advance monotonically to ensure data safety.
/// - Tenant's PlacementPolicy and TenantConfig, as the source of truth for these is something external.
/// - Node's scheduling policies, as the source of truth for these is something external.
///
/// Other things we store durably as an implementation detail:
/// - Node's host/port: this could be avoided it we made nodes emit a self-registering heartbeat,
/// but it is operationally simpler to make this service the authority for which nodes
/// it talks to.
///
/// ## Performance/efficiency
///
/// The attachment service does not go via the database for most things: there are
/// a couple of places where we must, and where efficiency matters:
/// - Incrementing generation numbers: the Reconciler has to wait for this to complete
/// before it can attach a tenant, so this acts as a bound on how fast things like
/// failover can happen.
/// - Pageserver re-attach: we will increment many shards' generations when this happens,
/// so it is important to avoid e.g. issuing O(N) queries.
///
/// Database calls relating to nodes have low performance requirements, as they are very rarely
/// updated, and reads of nodes are always from memory, not the database. We only require that
/// we can UPDATE a node's scheduling mode reasonably quickly to mark a bad node offline.
pub struct Persistence { pub struct Persistence {
state: std::sync::Mutex<PersistentState>, database_url: String,
// In test environments, we support loading+saving a JSON file. This is temporary, for the benefit of
// test_compatibility.py, so that we don't have to commit to making the database contents fully backward/forward
// compatible just yet.
json_path: Option<Utf8PathBuf>,
} }
// Top level state available to all HTTP handlers /// Legacy format, for use in JSON compat objects in test environment
#[derive(Serialize, Deserialize)] #[derive(Serialize, Deserialize)]
struct PersistentState { struct JsonPersistence {
tenants: HashMap<TenantShardId, TenantShardPersistence>, tenants: HashMap<TenantShardId, TenantShardPersistence>,
#[serde(skip)]
path: Utf8PathBuf,
} }
/// A convenience for serializing the state inside a sync lock, and then #[derive(thiserror::Error, Debug)]
/// writing it to disk outside of the lock. This will go away when switching pub(crate) enum DatabaseError {
/// to a database backend. #[error(transparent)]
struct PendingWrite { Query(#[from] diesel::result::Error),
bytes: Vec<u8>, #[error(transparent)]
path: Utf8PathBuf, Connection(#[from] diesel::result::ConnectionError),
#[error("Logical error: {0}")]
Logical(String),
} }
impl PendingWrite { pub(crate) type DatabaseResult<T> = Result<T, DatabaseError>;
async fn commit(&self) -> anyhow::Result<()> {
tokio::fs::write(&self.path, &self.bytes).await?;
Ok(())
}
}
impl PersistentState {
fn save(&self) -> PendingWrite {
PendingWrite {
bytes: serde_json::to_vec(self).expect("Serialization error"),
path: self.path.clone(),
}
}
async fn load(path: &Utf8Path) -> anyhow::Result<Self> {
let bytes = tokio::fs::read(path).await?;
let mut decoded = serde_json::from_slice::<Self>(&bytes)?;
decoded.path = path.to_owned();
for (tenant_id, tenant) in &mut decoded.tenants {
// Backward compat: an old attachments.json from before PR #6251, replace
// empty strings with proper defaults.
if tenant.tenant_id.is_empty() {
tenant.tenant_id = format!("{}", tenant_id);
tenant.config = serde_json::to_string(&TenantConfig::default())?;
tenant.placement_policy = serde_json::to_string(&PlacementPolicy::default())?;
}
}
Ok(decoded)
}
async fn load_or_new(path: &Utf8Path) -> Self {
match Self::load(path).await {
Ok(s) => {
tracing::info!("Loaded state file at {}", path);
s
}
Err(e)
if e.downcast_ref::<std::io::Error>()
.map(|e| e.kind() == std::io::ErrorKind::NotFound)
.unwrap_or(false) =>
{
tracing::info!("Will create state file at {}", path);
Self {
tenants: HashMap::new(),
path: path.to_owned(),
}
}
Err(e) => {
panic!("Failed to load state from '{}': {e:#} (maybe your .neon/ dir was written by an older version?)", path)
}
}
}
}
impl Persistence { impl Persistence {
pub async fn new(path: &Utf8Path) -> Self { pub fn new(database_url: String, json_path: Option<Utf8PathBuf>) -> Self {
let state = PersistentState::load_or_new(path).await;
Self { Self {
state: std::sync::Mutex::new(state), database_url,
json_path,
} }
} }
/// When registering a node, persist it so that on next start we will be able to /// Call the provided function in a tokio blocking thread, with a Diesel database connection.
/// iterate over known nodes to synchronize their tenant shard states with our observed state. async fn with_conn<F, R>(&self, func: F) -> DatabaseResult<R>
pub(crate) async fn insert_node(&self, _node: &Node) -> anyhow::Result<()> { where
// TODO: node persitence will come with database backend F: Fn(&mut PgConnection) -> DatabaseResult<R> + Send + 'static,
Ok(()) R: Send + 'static,
{
let database_url = self.database_url.clone();
tokio::task::spawn_blocking(move || -> DatabaseResult<R> {
// TODO: connection pooling, such as via diesel::r2d2
let mut conn = PgConnection::establish(&database_url)?;
func(&mut conn)
})
.await
.expect("Task panic")
} }
/// At startup, we populate the service's list of nodes, and use this list to call into /// When a node is first registered, persist it before using it for anything
/// each node to do an initial reconciliation of the state of the world with our in-memory pub(crate) async fn insert_node(&self, node: &Node) -> DatabaseResult<()> {
/// observed state. let np = node.to_persistent();
pub(crate) async fn list_nodes(&self) -> anyhow::Result<Vec<Node>> { self.with_conn(move |conn| -> DatabaseResult<()> {
let env = LocalEnv::load_config()?; diesel::insert_into(crate::schema::nodes::table)
// TODO: node persitence will come with database backend .values(&np)
.execute(conn)?;
Ok(())
})
.await
}
// XXX hack: enable test_backward_compatibility to work by populating our list of /// At startup, populate the list of nodes which our shards may be placed on
pub(crate) async fn list_nodes(&self) -> DatabaseResult<Vec<Node>> {
let nodes: Vec<Node> = self
.with_conn(move |conn| -> DatabaseResult<_> {
Ok(crate::schema::nodes::table
.load::<NodePersistence>(conn)?
.into_iter()
.map(|n| Node {
id: NodeId(n.node_id as u64),
// At startup we consider a node offline until proven otherwise.
availability: NodeAvailability::Offline,
scheduling: NodeSchedulingPolicy::from_str(&n.scheduling_policy)
.expect("Bad scheduling policy in DB"),
listen_http_addr: n.listen_http_addr,
listen_http_port: n.listen_http_port as u16,
listen_pg_addr: n.listen_pg_addr,
listen_pg_port: n.listen_pg_port as u16,
})
.collect::<Vec<Node>>())
})
.await?;
if nodes.is_empty() {
return self.list_nodes_local_env().await;
}
tracing::info!("list_nodes: loaded {} nodes", nodes.len());
Ok(nodes)
}
/// Shim for automated compatibility tests: load nodes from LocalEnv instead of database
pub(crate) async fn list_nodes_local_env(&self) -> DatabaseResult<Vec<Node>> {
// Enable test_backward_compatibility to work by populating our list of
// nodes from LocalEnv when it is not present in persistent storage. Otherwise at // nodes from LocalEnv when it is not present in persistent storage. Otherwise at
// first startup in the compat test, we may have shards but no nodes. // first startup in the compat test, we may have shards but no nodes.
let mut result = Vec::new(); use control_plane::local_env::LocalEnv;
let env = LocalEnv::load_config().map_err(|e| DatabaseError::Logical(format!("{e}")))?;
tracing::info!( tracing::info!(
"Loaded {} pageserver nodes from LocalEnv", "Loading {} pageserver nodes from LocalEnv",
env.pageservers.len() env.pageservers.len()
); );
let mut nodes = Vec::new();
for ps_conf in env.pageservers { for ps_conf in env.pageservers {
let (pg_host, pg_port) = let (pg_host, pg_port) =
parse_host_port(&ps_conf.listen_pg_addr).expect("Unable to parse listen_pg_addr"); parse_host_port(&ps_conf.listen_pg_addr).expect("Unable to parse listen_pg_addr");
let (http_host, http_port) = parse_host_port(&ps_conf.listen_http_addr) let (http_host, http_port) = parse_host_port(&ps_conf.listen_http_addr)
.expect("Unable to parse listen_http_addr"); .expect("Unable to parse listen_http_addr");
result.push(Node { let node = Node {
id: ps_conf.id, id: ps_conf.id,
listen_pg_addr: pg_host.to_string(), listen_pg_addr: pg_host.to_string(),
listen_pg_port: pg_port.unwrap_or(5432), listen_pg_port: pg_port.unwrap_or(5432),
@@ -141,16 +163,96 @@ impl Persistence {
listen_http_port: http_port.unwrap_or(80), listen_http_port: http_port.unwrap_or(80),
availability: NodeAvailability::Active, availability: NodeAvailability::Active,
scheduling: NodeSchedulingPolicy::Active, scheduling: NodeSchedulingPolicy::Active,
}); };
// Synchronize database with what we learn from LocalEnv
self.insert_node(&node).await?;
nodes.push(node);
} }
Ok(result) Ok(nodes)
} }
/// At startup, we populate our map of tenant shards from persistent storage. /// At startup, load the high level state for shards, such as their config + policy. This will
pub(crate) async fn list_tenant_shards(&self) -> anyhow::Result<Vec<TenantShardPersistence>> { /// be enriched at runtime with state discovered on pageservers.
let locked = self.state.lock().unwrap(); pub(crate) async fn list_tenant_shards(&self) -> DatabaseResult<Vec<TenantShardPersistence>> {
Ok(locked.tenants.values().cloned().collect()) let loaded = self
.with_conn(move |conn| -> DatabaseResult<_> {
Ok(crate::schema::tenant_shards::table.load::<TenantShardPersistence>(conn)?)
})
.await?;
if loaded.is_empty() {
if let Some(path) = &self.json_path {
if tokio::fs::try_exists(path)
.await
.map_err(|e| DatabaseError::Logical(format!("Error stat'ing JSON file: {e}")))?
{
tracing::info!("Importing from legacy JSON format at {path}");
return self.list_tenant_shards_json(path).await;
}
}
}
Ok(loaded)
}
/// Shim for automated compatibility tests: load tenants from a JSON file instead of database
pub(crate) async fn list_tenant_shards_json(
&self,
path: &Utf8Path,
) -> DatabaseResult<Vec<TenantShardPersistence>> {
let bytes = tokio::fs::read(path)
.await
.map_err(|e| DatabaseError::Logical(format!("Failed to load JSON: {e}")))?;
let mut decoded = serde_json::from_slice::<JsonPersistence>(&bytes)
.map_err(|e| DatabaseError::Logical(format!("Deserialization error: {e}")))?;
for (tenant_id, tenant) in &mut decoded.tenants {
// Backward compat: an old attachments.json from before PR #6251, replace
// empty strings with proper defaults.
if tenant.tenant_id.is_empty() {
tenant.tenant_id = tenant_id.to_string();
tenant.config = serde_json::to_string(&TenantConfig::default())
.map_err(|e| DatabaseError::Logical(format!("Serialization error: {e}")))?;
tenant.placement_policy = serde_json::to_string(&PlacementPolicy::default())
.map_err(|e| DatabaseError::Logical(format!("Serialization error: {e}")))?;
}
}
let tenants: Vec<TenantShardPersistence> = decoded.tenants.into_values().collect();
// Synchronize database with what is in the JSON file
self.insert_tenant_shards(tenants.clone()).await?;
Ok(tenants)
}
/// For use in testing environments, where we dump out JSON on shutdown.
pub async fn write_tenants_json(&self) -> anyhow::Result<()> {
let Some(path) = &self.json_path else {
anyhow::bail!("Cannot write JSON if path isn't set (test environment bug)");
};
tracing::info!("Writing state to {path}...");
let tenants = self.list_tenant_shards().await?;
let mut tenants_map = HashMap::new();
for tsp in tenants {
let tenant_shard_id = TenantShardId {
tenant_id: TenantId::from_str(tsp.tenant_id.as_str())?,
shard_number: ShardNumber(tsp.shard_number as u8),
shard_count: ShardCount(tsp.shard_count as u8),
};
tenants_map.insert(tenant_shard_id, tsp);
}
let json = serde_json::to_string(&JsonPersistence {
tenants: tenants_map,
})?;
tokio::fs::write(path, &json).await?;
tracing::info!("Wrote {} bytes to {path}...", json.len());
Ok(())
} }
/// Tenants must be persisted before we schedule them for the first time. This enables us /// Tenants must be persisted before we schedule them for the first time. This enables us
@@ -158,24 +260,77 @@ impl Persistence {
pub(crate) async fn insert_tenant_shards( pub(crate) async fn insert_tenant_shards(
&self, &self,
shards: Vec<TenantShardPersistence>, shards: Vec<TenantShardPersistence>,
) -> anyhow::Result<()> { ) -> DatabaseResult<()> {
let write = { use crate::schema::tenant_shards::dsl::*;
let mut locked = self.state.lock().unwrap(); self.with_conn(move |conn| -> DatabaseResult<()> {
for shard in shards { conn.transaction(|conn| -> QueryResult<()> {
let tenant_shard_id = TenantShardId { for tenant in &shards {
tenant_id: TenantId::from_str(shard.tenant_id.as_str())?, diesel::insert_into(tenant_shards)
shard_number: ShardNumber(shard.shard_number as u8), .values(tenant)
shard_count: ShardCount(shard.shard_count as u8), .execute(conn)?;
}; }
Ok(())
})?;
Ok(())
})
.await
}
locked.tenants.insert(tenant_shard_id, shard); /// Ordering: call this _after_ deleting the tenant on pageservers, but _before_ dropping state for
} /// the tenant from memory on this server.
locked.save() #[allow(unused)]
}; pub(crate) async fn delete_tenant(&self, del_tenant_id: TenantId) -> DatabaseResult<()> {
use crate::schema::tenant_shards::dsl::*;
self.with_conn(move |conn| -> DatabaseResult<()> {
diesel::delete(tenant_shards)
.filter(tenant_id.eq(del_tenant_id.to_string()))
.execute(conn)?;
write.commit().await?; Ok(())
})
.await
}
Ok(()) /// When a tenant invokes the /re-attach API, this function is responsible for doing an efficient
/// batched increment of the generations of all tenants whose generation_pageserver is equal to
/// the node that called /re-attach.
#[tracing::instrument(skip_all, fields(node_id))]
pub(crate) async fn re_attach(
&self,
node_id: NodeId,
) -> DatabaseResult<HashMap<TenantShardId, Generation>> {
use crate::schema::tenant_shards::dsl::*;
let updated = self
.with_conn(move |conn| {
let rows_updated = diesel::update(tenant_shards)
.filter(generation_pageserver.eq(node_id.0 as i64))
.set(generation.eq(generation + 1))
.execute(conn)?;
tracing::info!("Incremented {} tenants' generations", rows_updated);
// TODO: UPDATE+SELECT in one query
let updated = tenant_shards
.filter(generation_pageserver.eq(node_id.0 as i64))
.select(TenantShardPersistence::as_select())
.load(conn)?;
Ok(updated)
})
.await?;
let mut result = HashMap::new();
for tsp in updated {
let tenant_shard_id = TenantShardId {
tenant_id: TenantId::from_str(tsp.tenant_id.as_str())
.map_err(|e| DatabaseError::Logical(format!("Malformed tenant id: {e}")))?,
shard_number: ShardNumber(tsp.shard_number as u8),
shard_count: ShardCount(tsp.shard_count as u8),
};
result.insert(tenant_shard_id, Generation::new(tsp.generation as u32));
}
Ok(result)
} }
/// Reconciler calls this immediately before attaching to a new pageserver, to acquire a unique, monotonically /// Reconciler calls this immediately before attaching to a new pageserver, to acquire a unique, monotonically
@@ -184,49 +339,48 @@ impl Persistence {
pub(crate) async fn increment_generation( pub(crate) async fn increment_generation(
&self, &self,
tenant_shard_id: TenantShardId, tenant_shard_id: TenantShardId,
node_id: Option<NodeId>, node_id: NodeId,
) -> anyhow::Result<Generation> { ) -> anyhow::Result<Generation> {
let (write, gen) = { use crate::schema::tenant_shards::dsl::*;
let mut locked = self.state.lock().unwrap(); let updated = self
let Some(shard) = locked.tenants.get_mut(&tenant_shard_id) else { .with_conn(move |conn| {
anyhow::bail!("Tried to increment generation of unknown shard"); let updated = diesel::update(tenant_shards)
}; .filter(tenant_id.eq(tenant_shard_id.tenant_id.to_string()))
.filter(shard_number.eq(tenant_shard_id.shard_number.0 as i32))
.filter(shard_count.eq(tenant_shard_id.shard_count.0 as i32))
.set((
generation.eq(generation + 1),
generation_pageserver.eq(node_id.0 as i64),
))
// TODO: only returning() the generation column
.returning(TenantShardPersistence::as_returning())
.get_result(conn)?;
// If we're called with a None pageserver, we need only update the generation Ok(updated)
// record to disassociate it with this pageserver, not actually increment the number, as })
// the increment is guaranteed to happen the next time this tenant is attached. .await?;
if node_id.is_some() {
shard.generation += 1;
}
shard.generation_pageserver = node_id; Ok(Generation::new(updated.generation as u32))
let gen = Generation::new(shard.generation);
(locked.save(), gen)
};
write.commit().await?;
Ok(gen)
} }
pub(crate) async fn re_attach( pub(crate) async fn detach(&self, tenant_shard_id: TenantShardId) -> anyhow::Result<()> {
&self, use crate::schema::tenant_shards::dsl::*;
node_id: NodeId, self.with_conn(move |conn| {
) -> anyhow::Result<HashMap<TenantShardId, Generation>> { let updated = diesel::update(tenant_shards)
let (write, result) = { .filter(tenant_id.eq(tenant_shard_id.tenant_id.to_string()))
let mut result = HashMap::new(); .filter(shard_number.eq(tenant_shard_id.shard_number.0 as i32))
let mut locked = self.state.lock().unwrap(); .filter(shard_count.eq(tenant_shard_id.shard_count.0 as i32))
for (tenant_shard_id, shard) in locked.tenants.iter_mut() { .set((
if shard.generation_pageserver == Some(node_id) { generation_pageserver.eq(i64::MAX),
shard.generation += 1; placement_policy.eq(serde_json::to_string(&PlacementPolicy::Detached).unwrap()),
result.insert(*tenant_shard_id, Generation::new(shard.generation)); ))
} .execute(conn)?;
}
(locked.save(), result) Ok(updated)
}; })
.await?;
write.commit().await?; Ok(())
Ok(result)
} }
// TODO: when we start shard splitting, we must durably mark the tenant so that // TODO: when we start shard splitting, we must durably mark the tenant so that
@@ -246,7 +400,8 @@ impl Persistence {
} }
/// Parts of [`crate::tenant_state::TenantState`] that are stored durably /// Parts of [`crate::tenant_state::TenantState`] that are stored durably
#[derive(Serialize, Deserialize, Clone)] #[derive(Queryable, Selectable, Insertable, Serialize, Deserialize, Clone)]
#[diesel(table_name = crate::schema::tenant_shards)]
pub(crate) struct TenantShardPersistence { pub(crate) struct TenantShardPersistence {
#[serde(default)] #[serde(default)]
pub(crate) tenant_id: String, pub(crate) tenant_id: String,
@@ -257,16 +412,28 @@ pub(crate) struct TenantShardPersistence {
#[serde(default)] #[serde(default)]
pub(crate) shard_stripe_size: i32, pub(crate) shard_stripe_size: i32,
// Currently attached pageserver
#[serde(rename = "pageserver")]
pub(crate) generation_pageserver: Option<NodeId>,
// Latest generation number: next time we attach, increment this // Latest generation number: next time we attach, increment this
// and use the incremented number when attaching // and use the incremented number when attaching
pub(crate) generation: u32, pub(crate) generation: i32,
// Currently attached pageserver
#[serde(rename = "pageserver")]
pub(crate) generation_pageserver: i64,
#[serde(default)] #[serde(default)]
pub(crate) placement_policy: String, pub(crate) placement_policy: String,
#[serde(default)] #[serde(default)]
pub(crate) config: String, pub(crate) config: String,
} }
/// Parts of [`crate::node::Node`] that are stored durably
#[derive(Serialize, Deserialize, Queryable, Selectable, Insertable)]
#[diesel(table_name = crate::schema::nodes)]
pub(crate) struct NodePersistence {
pub(crate) node_id: i64,
pub(crate) scheduling_policy: String,
pub(crate) listen_http_addr: String,
pub(crate) listen_http_port: i32,
pub(crate) listen_pg_addr: String,
pub(crate) listen_pg_port: i32,
}

View File

@@ -296,7 +296,7 @@ impl Reconciler {
// Increment generation before attaching to new pageserver // Increment generation before attaching to new pageserver
self.generation = self self.generation = self
.persistence .persistence
.increment_generation(self.tenant_shard_id, Some(dest_ps_id)) .increment_generation(self.tenant_shard_id, dest_ps_id)
.await?; .await?;
let dest_conf = build_location_config( let dest_conf = build_location_config(
@@ -395,7 +395,7 @@ impl Reconciler {
// as locations with unknown (None) observed state. // as locations with unknown (None) observed state.
self.generation = self self.generation = self
.persistence .persistence
.increment_generation(self.tenant_shard_id, Some(node_id)) .increment_generation(self.tenant_shard_id, node_id)
.await?; .await?;
wanted_conf.generation = self.generation.into(); wanted_conf.generation = self.generation.into();
tracing::info!("Observed configuration requires update."); tracing::info!("Observed configuration requires update.");

View File

@@ -0,0 +1,27 @@
// @generated automatically by Diesel CLI.
diesel::table! {
nodes (node_id) {
node_id -> Int8,
scheduling_policy -> Varchar,
listen_http_addr -> Varchar,
listen_http_port -> Int4,
listen_pg_addr -> Varchar,
listen_pg_port -> Int4,
}
}
diesel::table! {
tenant_shards (tenant_id, shard_number, shard_count) {
tenant_id -> Varchar,
shard_number -> Int4,
shard_count -> Int4,
shard_stripe_size -> Int4,
generation -> Int4,
generation_pageserver -> Int8,
placement_policy -> Varchar,
config -> Text,
}
}
diesel::allow_tables_to_appear_in_same_query!(nodes, tenant_shards,);

View File

@@ -11,6 +11,7 @@ use control_plane::attachment_service::{
TenantCreateResponseShard, TenantLocateResponse, TenantLocateResponseShard, TenantCreateResponseShard, TenantLocateResponse, TenantLocateResponseShard,
TenantShardMigrateRequest, TenantShardMigrateResponse, TenantShardMigrateRequest, TenantShardMigrateResponse,
}; };
use diesel::result::DatabaseErrorKind;
use hyper::StatusCode; use hyper::StatusCode;
use pageserver_api::{ use pageserver_api::{
control_api::{ control_api::{
@@ -26,6 +27,7 @@ use pageserver_api::{
}; };
use pageserver_client::mgmt_api; use pageserver_client::mgmt_api;
use utils::{ use utils::{
completion::Barrier,
generation::Generation, generation::Generation,
http::error::ApiError, http::error::ApiError,
id::{NodeId, TenantId}, id::{NodeId, TenantId},
@@ -35,7 +37,7 @@ use utils::{
use crate::{ use crate::{
compute_hook::ComputeHook, compute_hook::ComputeHook,
node::Node, node::Node,
persistence::{Persistence, TenantShardPersistence}, persistence::{DatabaseError, Persistence, TenantShardPersistence},
scheduler::Scheduler, scheduler::Scheduler,
tenant_state::{ tenant_state::{
IntentState, ObservedState, ObservedStateLocation, ReconcileResult, ReconcileWaitError, IntentState, ObservedState, ObservedStateLocation, ReconcileResult, ReconcileWaitError,
@@ -46,6 +48,10 @@ use crate::{
const RECONCILE_TIMEOUT: Duration = Duration::from_secs(30); const RECONCILE_TIMEOUT: Duration = Duration::from_secs(30);
/// How long [`Service::startup_reconcile`] is allowed to take before it should give
/// up on unresponsive pageservers and proceed.
pub(crate) const STARTUP_RECONCILE_TIMEOUT: Duration = Duration::from_secs(30);
// Top level state available to all HTTP handlers // Top level state available to all HTTP handlers
struct ServiceState { struct ServiceState {
tenants: BTreeMap<TenantShardId, TenantState>, tenants: BTreeMap<TenantShardId, TenantState>,
@@ -79,10 +85,27 @@ pub struct Config {
pub jwt_token: Option<String>, pub jwt_token: Option<String>,
} }
impl From<DatabaseError> for ApiError {
fn from(err: DatabaseError) -> ApiError {
match err {
DatabaseError::Query(e) => ApiError::InternalServerError(e.into()),
// FIXME: ApiError doesn't have an Unavailable variant, but ShuttingDown maps to 503.
DatabaseError::Connection(_e) => ApiError::ShuttingDown,
DatabaseError::Logical(reason) => {
ApiError::InternalServerError(anyhow::anyhow!(reason))
}
}
}
}
pub struct Service { pub struct Service {
inner: Arc<std::sync::RwLock<ServiceState>>, inner: Arc<std::sync::RwLock<ServiceState>>,
config: Config, config: Config,
persistence: Arc<Persistence>, persistence: Arc<Persistence>,
/// This waits for initial reconciliation with pageservers to complete. Until this barrier
/// passes, it isn't safe to do any actions that mutate tenants.
pub(crate) startup_complete: Barrier,
} }
impl From<ReconcileWaitError> for ApiError { impl From<ReconcileWaitError> for ApiError {
@@ -96,77 +119,32 @@ impl From<ReconcileWaitError> for ApiError {
} }
impl Service { impl Service {
pub async fn spawn(config: Config, persistence: Arc<Persistence>) -> anyhow::Result<Arc<Self>> { pub fn get_config(&self) -> &Config {
let (result_tx, mut result_rx) = tokio::sync::mpsc::unbounded_channel(); &self.config
}
tracing::info!("Loading nodes from database...");
let mut nodes = persistence.list_nodes().await?;
tracing::info!("Loaded {} nodes from database.", nodes.len());
tracing::info!("Loading shards from database...");
let tenant_shard_persistence = persistence.list_tenant_shards().await?;
tracing::info!(
"Loaded {} shards from database.",
tenant_shard_persistence.len()
);
let mut tenants = BTreeMap::new();
for tsp in tenant_shard_persistence {
let tenant_shard_id = TenantShardId {
tenant_id: TenantId::from_str(tsp.tenant_id.as_str())?,
shard_number: ShardNumber(tsp.shard_number as u8),
shard_count: ShardCount(tsp.shard_count as u8),
};
let shard_identity = if tsp.shard_count == 0 {
ShardIdentity::unsharded()
} else {
ShardIdentity::new(
ShardNumber(tsp.shard_number as u8),
ShardCount(tsp.shard_count as u8),
ShardStripeSize(tsp.shard_stripe_size as u32),
)?
};
let new_tenant = TenantState {
tenant_shard_id,
shard: shard_identity,
sequence: Sequence::initial(),
// Note that we load generation, but don't care about generation_pageserver. We will either end up finding
// our existing attached location and it will match generation_pageserver, or we will attach somewhere new
// and update generation_pageserver in the process.
generation: Generation::new(tsp.generation),
policy: serde_json::from_str(&tsp.placement_policy).unwrap(),
intent: IntentState::new(),
observed: ObservedState::new(),
config: serde_json::from_str(&tsp.config).unwrap(),
reconciler: None,
waiter: Arc::new(SeqWait::new(Sequence::initial())),
error_waiter: Arc::new(SeqWait::new(Sequence::initial())),
last_error: Arc::default(),
};
tenants.insert(tenant_shard_id, new_tenant);
}
/// TODO: don't allow other API calls until this is done, don't start doing any background housekeeping
/// until this is done.
async fn startup_reconcile(&self) {
// For all tenant shards, a vector of observed states on nodes (where None means // For all tenant shards, a vector of observed states on nodes (where None means
// indeterminate, same as in [`ObservedStateLocation`]) // indeterminate, same as in [`ObservedStateLocation`])
let mut observed = HashMap::new(); let mut observed = HashMap::new();
let nodes = {
let locked = self.inner.read().unwrap();
locked.nodes.clone()
};
// TODO: issue these requests concurrently // TODO: issue these requests concurrently
for node in &mut nodes { for node in nodes.values() {
let client = mgmt_api::Client::new(node.base_url(), config.jwt_token.as_deref()); let client = mgmt_api::Client::new(node.base_url(), self.config.jwt_token.as_deref());
tracing::info!("Scanning shards on node {}...", node.id); tracing::info!("Scanning shards on node {}...", node.id);
match client.list_location_config().await { match client.list_location_config().await {
Err(e) => { Err(e) => {
tracing::warn!("Could not contact pageserver {} ({e})", node.id); tracing::warn!("Could not contact pageserver {} ({e})", node.id);
// TODO: be more tolerant, apply a generous 5-10 second timeout // TODO: be more tolerant, apply a generous 5-10 second timeout with retries, in case
// TODO: setting a node to Offline is a dramatic thing to do, and can // pageserver is being restarted at the same time as we are
// prevent neon_local from starting up (it starts this service before
// any pageservers are running). It may make sense to give nodes
// a Pending state to accomodate this situation, and allow (but deprioritize)
// scheduling on Pending nodes.
//node.availability = NodeAvailability::Offline;
} }
Ok(listing) => { Ok(listing) => {
tracing::info!( tracing::info!(
@@ -174,7 +152,6 @@ impl Service {
listing.tenant_shards.len(), listing.tenant_shards.len(),
node.id node.id
); );
node.availability = NodeAvailability::Active;
for (tenant_shard_id, conf_opt) in listing.tenant_shards { for (tenant_shard_id, conf_opt) in listing.tenant_shards {
observed.insert(tenant_shard_id, (node.id, conf_opt)); observed.insert(tenant_shard_id, (node.id, conf_opt));
@@ -186,41 +163,46 @@ impl Service {
let mut cleanup = Vec::new(); let mut cleanup = Vec::new();
// Populate intent and observed states for all tenants, based on reported state on pageservers // Populate intent and observed states for all tenants, based on reported state on pageservers
for (tenant_shard_id, (node_id, observed_loc)) in observed { let shard_count = {
let Some(tenant_state) = tenants.get_mut(&tenant_shard_id) else { let mut locked = self.inner.write().unwrap();
cleanup.push((tenant_shard_id, node_id)); for (tenant_shard_id, (node_id, observed_loc)) in observed {
continue; let Some(tenant_state) = locked.tenants.get_mut(&tenant_shard_id) else {
}; cleanup.push((tenant_shard_id, node_id));
continue;
};
tenant_state tenant_state
.observed .observed
.locations .locations
.insert(node_id, ObservedStateLocation { conf: observed_loc }); .insert(node_id, ObservedStateLocation { conf: observed_loc });
}
// State of nodes is now frozen, transform to a HashMap.
let mut nodes: HashMap<NodeId, Node> = nodes.into_iter().map(|n| (n.id, n)).collect();
// Populate each tenant's intent state
let mut scheduler = Scheduler::new(&tenants, &nodes);
for (tenant_shard_id, tenant_state) in tenants.iter_mut() {
tenant_state.intent_from_observed();
if let Err(e) = tenant_state.schedule(&mut scheduler) {
// Non-fatal error: we are unable to properly schedule the tenant, perhaps because
// not enough pageservers are available. The tenant may well still be available
// to clients.
tracing::error!("Failed to schedule tenant {tenant_shard_id} at startup: {e}");
} }
}
// Populate each tenant's intent state
let mut scheduler = Scheduler::new(&locked.tenants, &nodes);
for (tenant_shard_id, tenant_state) in locked.tenants.iter_mut() {
tenant_state.intent_from_observed();
if let Err(e) = tenant_state.schedule(&mut scheduler) {
// Non-fatal error: we are unable to properly schedule the tenant, perhaps because
// not enough pageservers are available. The tenant may well still be available
// to clients.
tracing::error!("Failed to schedule tenant {tenant_shard_id} at startup: {e}");
}
}
locked.tenants.len()
};
// TODO: if any tenant's intent now differs from its loaded generation_pageserver, we should clear that
// generation_pageserver in the database.
// Clean up any tenants that were found on pageservers but are not known to us. // Clean up any tenants that were found on pageservers but are not known to us.
for (tenant_shard_id, node_id) in cleanup { for (tenant_shard_id, node_id) in cleanup {
// A node reported a tenant_shard_id which is unknown to us: detach it. // A node reported a tenant_shard_id which is unknown to us: detach it.
let node = nodes let node = nodes
.get_mut(&node_id) .get(&node_id)
.expect("Always exists: only known nodes are scanned"); .expect("Always exists: only known nodes are scanned");
let client = mgmt_api::Client::new(node.base_url(), config.jwt_token.as_deref()); let client = mgmt_api::Client::new(node.base_url(), self.config.jwt_token.as_deref());
match client match client
.location_config( .location_config(
tenant_shard_id, tenant_shard_id,
@@ -252,13 +234,80 @@ impl Service {
} }
} }
let shard_count = tenants.len(); // Finally, now that the service is up and running, launch reconcile operations for any tenants
// which require it: under normal circumstances this should only include tenants that were in some
// transient state before we restarted.
let reconcile_tasks = self.reconcile_all();
tracing::info!("Startup complete, spawned {reconcile_tasks} reconciliation tasks ({shard_count} shards total)");
}
pub async fn spawn(config: Config, persistence: Arc<Persistence>) -> anyhow::Result<Arc<Self>> {
let (result_tx, mut result_rx) = tokio::sync::mpsc::unbounded_channel();
tracing::info!("Loading nodes from database...");
let nodes = persistence.list_nodes().await?;
let nodes: HashMap<NodeId, Node> = nodes.into_iter().map(|n| (n.id, n)).collect();
tracing::info!("Loaded {} nodes from database.", nodes.len());
tracing::info!("Loading shards from database...");
let tenant_shard_persistence = persistence.list_tenant_shards().await?;
tracing::info!(
"Loaded {} shards from database.",
tenant_shard_persistence.len()
);
let mut tenants = BTreeMap::new();
for tsp in tenant_shard_persistence {
let tenant_shard_id = TenantShardId {
tenant_id: TenantId::from_str(tsp.tenant_id.as_str())?,
shard_number: ShardNumber(tsp.shard_number as u8),
shard_count: ShardCount(tsp.shard_count as u8),
};
let shard_identity = if tsp.shard_count == 0 {
ShardIdentity::unsharded()
} else {
ShardIdentity::new(
ShardNumber(tsp.shard_number as u8),
ShardCount(tsp.shard_count as u8),
ShardStripeSize(tsp.shard_stripe_size as u32),
)?
};
// We will populate intent properly later in [`Self::startup_reconcile`], initially populate
// it with what we can infer: the node for which a generation was most recently issued.
let mut intent = IntentState::new();
if tsp.generation_pageserver != i64::MAX {
intent.attached = Some(NodeId(tsp.generation_pageserver as u64))
}
let new_tenant = TenantState {
tenant_shard_id,
shard: shard_identity,
sequence: Sequence::initial(),
generation: Generation::new(tsp.generation as u32),
policy: serde_json::from_str(&tsp.placement_policy).unwrap(),
intent,
observed: ObservedState::new(),
config: serde_json::from_str(&tsp.config).unwrap(),
reconciler: None,
waiter: Arc::new(SeqWait::new(Sequence::initial())),
error_waiter: Arc::new(SeqWait::new(Sequence::initial())),
last_error: Arc::default(),
};
tenants.insert(tenant_shard_id, new_tenant);
}
let (startup_completion, startup_complete) = utils::completion::channel();
let this = Arc::new(Self { let this = Arc::new(Self {
inner: Arc::new(std::sync::RwLock::new(ServiceState::new( inner: Arc::new(std::sync::RwLock::new(ServiceState::new(
result_tx, nodes, tenants, result_tx, nodes, tenants,
))), ))),
config, config,
persistence, persistence,
startup_complete,
}); });
let result_task_this = this.clone(); let result_task_this = this.clone();
@@ -316,11 +365,13 @@ impl Service {
} }
}); });
// Finally, now that the service is up and running, launch reconcile operations for any tenants let startup_reconcile_this = this.clone();
// which require it: under normal circumstances this should only include tenants that were in some tokio::task::spawn(async move {
// transient state before we restarted. // Block the [`Service::startup_complete`] barrier until we're done
let reconcile_tasks = this.reconcile_all(); let _completion = startup_completion;
tracing::info!("Startup complete, spawned {reconcile_tasks} reconciliation tasks ({shard_count} shards total)");
startup_reconcile_this.startup_reconcile().await
});
Ok(this) Ok(this)
} }
@@ -336,7 +387,6 @@ impl Service {
let locked = self.inner.write().unwrap(); let locked = self.inner.write().unwrap();
!locked.tenants.contains_key(&attach_req.tenant_shard_id) !locked.tenants.contains_key(&attach_req.tenant_shard_id)
}; };
if insert { if insert {
let tsp = TenantShardPersistence { let tsp = TenantShardPersistence {
tenant_id: attach_req.tenant_shard_id.tenant_id.to_string(), tenant_id: attach_req.tenant_shard_id.tenant_id.to_string(),
@@ -344,31 +394,49 @@ impl Service {
shard_count: attach_req.tenant_shard_id.shard_count.0 as i32, shard_count: attach_req.tenant_shard_id.shard_count.0 as i32,
shard_stripe_size: 0, shard_stripe_size: 0,
generation: 0, generation: 0,
generation_pageserver: None, generation_pageserver: i64::MAX,
placement_policy: serde_json::to_string(&PlacementPolicy::default()).unwrap(), placement_policy: serde_json::to_string(&PlacementPolicy::default()).unwrap(),
config: serde_json::to_string(&TenantConfig::default()).unwrap(), config: serde_json::to_string(&TenantConfig::default()).unwrap(),
}; };
self.persistence.insert_tenant_shards(vec![tsp]).await?; match self.persistence.insert_tenant_shards(vec![tsp]).await {
Err(e) => match e {
DatabaseError::Query(diesel::result::Error::DatabaseError(
DatabaseErrorKind::UniqueViolation,
_,
)) => {
tracing::info!(
"Raced with another request to insert tenant {}",
attach_req.tenant_shard_id
)
}
_ => return Err(e.into()),
},
Ok(()) => {
tracing::info!("Inserted shard {} in database", attach_req.tenant_shard_id);
let mut locked = self.inner.write().unwrap(); let mut locked = self.inner.write().unwrap();
locked.tenants.insert( locked.tenants.insert(
attach_req.tenant_shard_id, attach_req.tenant_shard_id,
TenantState::new( TenantState::new(
attach_req.tenant_shard_id, attach_req.tenant_shard_id,
ShardIdentity::unsharded(), ShardIdentity::unsharded(),
PlacementPolicy::Single, PlacementPolicy::Single,
), ),
); );
tracing::info!("Inserted shard {} in memory", attach_req.tenant_shard_id);
}
}
} }
let new_generation = if attach_req.node_id.is_some() { let new_generation = if let Some(req_node_id) = attach_req.node_id {
Some( Some(
self.persistence self.persistence
.increment_generation(attach_req.tenant_shard_id, attach_req.node_id) .increment_generation(attach_req.tenant_shard_id, req_node_id)
.await?, .await?,
) )
} else { } else {
self.persistence.detach(attach_req.tenant_shard_id).await?;
None None
}; };
@@ -380,6 +448,11 @@ impl Service {
if let Some(new_generation) = new_generation { if let Some(new_generation) = new_generation {
tenant_state.generation = new_generation; tenant_state.generation = new_generation;
} else {
// This is a detach notification. We must update placement policy to avoid re-attaching
// during background scheduling/reconciliation, or during attachment service restart.
assert!(attach_req.node_id.is_none());
tenant_state.policy = PlacementPolicy::Detached;
} }
if let Some(attaching_pageserver) = attach_req.node_id.as_ref() { if let Some(attaching_pageserver) = attach_req.node_id.as_ref() {
@@ -407,6 +480,7 @@ impl Service {
"attach_hook: tenant {} set generation {:?}, pageserver {}", "attach_hook: tenant {} set generation {:?}, pageserver {}",
attach_req.tenant_shard_id, attach_req.tenant_shard_id,
tenant_state.generation, tenant_state.generation,
// TODO: this is an odd number of 0xf's
attach_req.node_id.unwrap_or(utils::id::NodeId(0xfffffff)) attach_req.node_id.unwrap_or(utils::id::NodeId(0xfffffff))
); );
@@ -499,6 +573,14 @@ impl Service {
id: req_tenant.id, id: req_tenant.id,
valid, valid,
}); });
} else {
// After tenant deletion, we may approve any validation. This avoids
// spurious warnings on the pageserver if it has pending LSN updates
// at the point a deletion happens.
response.tenants.push(ValidateResponseTenant {
id: req_tenant.id,
valid: true,
});
} }
} }
response response
@@ -554,7 +636,7 @@ impl Service {
shard_count: tenant_shard_id.shard_count.0 as i32, shard_count: tenant_shard_id.shard_count.0 as i32,
shard_stripe_size: create_req.shard_parameters.stripe_size.0 as i32, shard_stripe_size: create_req.shard_parameters.stripe_size.0 as i32,
generation: 0, generation: 0,
generation_pageserver: None, generation_pageserver: i64::MAX,
placement_policy: serde_json::to_string(&placement_policy).unwrap(), placement_policy: serde_json::to_string(&placement_policy).unwrap(),
config: serde_json::to_string(&create_req.config).unwrap(), config: serde_json::to_string(&create_req.config).unwrap(),
}) })
@@ -868,7 +950,6 @@ impl Service {
} else { } else {
let old_attached = shard.intent.attached; let old_attached = shard.intent.attached;
shard.intent.attached = Some(migrate_req.node_id);
match shard.policy { match shard.policy {
PlacementPolicy::Single => { PlacementPolicy::Single => {
shard.intent.secondary.clear(); shard.intent.secondary.clear();
@@ -882,7 +963,13 @@ impl Service {
shard.intent.secondary.push(old_attached); shard.intent.secondary.push(old_attached);
} }
} }
PlacementPolicy::Detached => {
return Err(ApiError::BadRequest(anyhow::anyhow!(
"Cannot migrate a tenant that is PlacementPolicy::Detached: configure it to an attached policy first"
)))
}
} }
shard.intent.attached = Some(migrate_req.node_id);
tracing::info!("Migrating: new intent {:?}", shard.intent); tracing::info!("Migrating: new intent {:?}", shard.intent);
shard.sequence = shard.sequence.next(); shard.sequence = shard.sequence.next();
@@ -955,10 +1042,7 @@ impl Service {
availability: NodeAvailability::Active, availability: NodeAvailability::Active,
}; };
// TODO: idempotency if the node already exists in the database // TODO: idempotency if the node already exists in the database
self.persistence self.persistence.insert_node(&new_node).await?;
.insert_node(&new_node)
.await
.map_err(ApiError::InternalServerError)?;
let mut locked = self.inner.write().unwrap(); let mut locked = self.inner.write().unwrap();
let mut new_nodes = (*locked.nodes).clone(); let mut new_nodes = (*locked.nodes).clone();

View File

@@ -312,6 +312,18 @@ impl TenantState {
modified = true; modified = true;
} }
} }
Detached => {
// Should have no attached or secondary pageservers
if self.intent.attached.is_some() {
self.intent.attached = None;
modified = true;
}
if !self.intent.secondary.is_empty() {
self.intent.secondary.clear();
modified = true;
}
}
} }
if modified { if modified {

View File

@@ -1,5 +1,11 @@
use crate::{background_process, local_env::LocalEnv}; use crate::{background_process, local_env::LocalEnv};
use camino::Utf8PathBuf; use camino::{Utf8Path, Utf8PathBuf};
use diesel::{
backend::Backend,
query_builder::{AstPass, QueryFragment, QueryId},
Connection, PgConnection, QueryResult, RunQueryDsl,
};
use diesel_migrations::{HarnessWithOutput, MigrationHarness};
use hyper::Method; use hyper::Method;
use pageserver_api::{ use pageserver_api::{
models::{ShardParameters, TenantCreateRequest, TimelineCreateRequest, TimelineInfo}, models::{ShardParameters, TenantCreateRequest, TimelineCreateRequest, TimelineInfo},
@@ -7,9 +13,9 @@ use pageserver_api::{
}; };
use pageserver_client::mgmt_api::ResponseErrorMessageExt; use pageserver_client::mgmt_api::ResponseErrorMessageExt;
use postgres_backend::AuthType; use postgres_backend::AuthType;
use postgres_connection::parse_host_port;
use serde::{de::DeserializeOwned, Deserialize, Serialize}; use serde::{de::DeserializeOwned, Deserialize, Serialize};
use std::{path::PathBuf, process::Child, str::FromStr}; use std::{env, str::FromStr};
use tokio::process::Command;
use tracing::instrument; use tracing::instrument;
use utils::{ use utils::{
auth::{Claims, Scope}, auth::{Claims, Scope},
@@ -19,14 +25,17 @@ use utils::{
pub struct AttachmentService { pub struct AttachmentService {
env: LocalEnv, env: LocalEnv,
listen: String, listen: String,
path: PathBuf, path: Utf8PathBuf,
jwt_token: Option<String>, jwt_token: Option<String>,
public_key_path: Option<Utf8PathBuf>, public_key_path: Option<Utf8PathBuf>,
postgres_port: u16,
client: reqwest::Client, client: reqwest::Client,
} }
const COMMAND: &str = "attachment_service"; const COMMAND: &str = "attachment_service";
const ATTACHMENT_SERVICE_POSTGRES_VERSION: u32 = 16;
#[derive(Serialize, Deserialize)] #[derive(Serialize, Deserialize)]
pub struct AttachHookRequest { pub struct AttachHookRequest {
pub tenant_shard_id: TenantShardId, pub tenant_shard_id: TenantShardId,
@@ -169,7 +178,9 @@ pub struct TenantShardMigrateResponse {}
impl AttachmentService { impl AttachmentService {
pub fn from_env(env: &LocalEnv) -> Self { pub fn from_env(env: &LocalEnv) -> Self {
let path = env.base_data_dir.join("attachments.json"); let path = Utf8PathBuf::from_path_buf(env.base_data_dir.clone())
.unwrap()
.join("attachments.json");
// Makes no sense to construct this if pageservers aren't going to use it: assume // Makes no sense to construct this if pageservers aren't going to use it: assume
// pageservers have control plane API set // pageservers have control plane API set
@@ -181,6 +192,13 @@ impl AttachmentService {
listen_url.port().unwrap() listen_url.port().unwrap()
); );
// Convention: NeonEnv in python tests reserves the next port after the control_plane_api
// port, for use by our captive postgres.
let postgres_port = listen_url
.port()
.expect("Control plane API setting should always have a port")
+ 1;
// Assume all pageservers have symmetric auth configuration: this service // Assume all pageservers have symmetric auth configuration: this service
// expects to use one JWT token to talk to all of them. // expects to use one JWT token to talk to all of them.
let ps_conf = env let ps_conf = env
@@ -209,6 +227,7 @@ impl AttachmentService {
listen, listen,
jwt_token, jwt_token,
public_key_path, public_key_path,
postgres_port,
client: reqwest::ClientBuilder::new() client: reqwest::ClientBuilder::new()
.build() .build()
.expect("Failed to construct http client"), .expect("Failed to construct http client"),
@@ -220,13 +239,214 @@ impl AttachmentService {
.expect("non-Unicode path") .expect("non-Unicode path")
} }
pub async fn start(&self) -> anyhow::Result<Child> { /// PIDFile for the postgres instance used to store attachment service state
let path_str = self.path.to_string_lossy(); fn postgres_pid_file(&self) -> Utf8PathBuf {
Utf8PathBuf::from_path_buf(
self.env
.base_data_dir
.join("attachment_service_postgres.pid"),
)
.expect("non-Unicode path")
}
let mut args = vec!["-l", &self.listen, "-p", &path_str] /// In order to access database migrations, we need to find the Neon source tree
.into_iter() async fn find_source_root(&self) -> anyhow::Result<Utf8PathBuf> {
.map(|s| s.to_string()) // We assume that either prd or our binary is in the source tree. The former is usually
.collect::<Vec<_>>(); // true for automated test runners, the latter is usually true for developer workstations. Often
// both are true, which is fine.
let candidate_start_points = [
// Current working directory
Utf8PathBuf::from_path_buf(std::env::current_dir()?).unwrap(),
// Directory containing the binary we're running inside
Utf8PathBuf::from_path_buf(env::current_exe()?.parent().unwrap().to_owned()).unwrap(),
];
// For each candidate start point, search through ancestors looking for a neon.git source tree root
for start_point in &candidate_start_points {
// Start from the build dir: assumes we are running out of a built neon source tree
for path in start_point.ancestors() {
// A crude approximation: the root of the source tree is whatever contains a "control_plane"
// subdirectory.
let control_plane = path.join("control_plane");
if tokio::fs::try_exists(&control_plane).await? {
return Ok(path.to_owned());
}
}
}
// Fall-through
Err(anyhow::anyhow!(
"Could not find control_plane src dir, after searching ancestors of {candidate_start_points:?}"
))
}
/// Find the directory containing postgres binaries, such as `initdb` and `pg_ctl`
///
/// This usually uses ATTACHMENT_SERVICE_POSTGRES_VERSION of postgres, but will fall back
/// to other versions if that one isn't found. Some automated tests create circumstances
/// where only one version is available in pg_distrib_dir, such as `test_remote_extensions`.
pub async fn get_pg_bin_dir(&self) -> anyhow::Result<Utf8PathBuf> {
let prefer_versions = [ATTACHMENT_SERVICE_POSTGRES_VERSION, 15, 14];
for v in prefer_versions {
let path = Utf8PathBuf::from_path_buf(self.env.pg_bin_dir(v)?).unwrap();
if tokio::fs::try_exists(&path).await? {
return Ok(path);
}
}
// Fall through
anyhow::bail!(
"Postgres binaries not found in {}",
self.env.pg_distrib_dir.display()
);
}
/// Readiness check for our postgres process
async fn pg_isready(&self, pg_bin_dir: &Utf8Path) -> anyhow::Result<bool> {
let bin_path = pg_bin_dir.join("pg_isready");
let args = ["-h", "localhost", "-p", &format!("{}", self.postgres_port)];
let exitcode = Command::new(bin_path).args(args).spawn()?.wait().await?;
Ok(exitcode.success())
}
/// Create our database if it doesn't exist, and run migrations.
///
/// This function is equivalent to the `diesel setup` command in the diesel CLI. We implement
/// the same steps by hand to avoid imposing a dependency on installing diesel-cli for developers
/// who just want to run `cargo neon_local` without knowing about diesel.
///
/// Returns the database url
pub async fn setup_database(&self) -> anyhow::Result<String> {
let database_url = format!(
"postgresql://localhost:{}/attachment_service",
self.postgres_port
);
println!("Running attachment service database setup...");
fn change_database_of_url(database_url: &str, default_database: &str) -> (String, String) {
let base = ::url::Url::parse(database_url).unwrap();
let database = base.path_segments().unwrap().last().unwrap().to_owned();
let mut new_url = base.join(default_database).unwrap();
new_url.set_query(base.query());
(database, new_url.into())
}
#[derive(Debug, Clone)]
pub struct CreateDatabaseStatement {
db_name: String,
}
impl CreateDatabaseStatement {
pub fn new(db_name: &str) -> Self {
CreateDatabaseStatement {
db_name: db_name.to_owned(),
}
}
}
impl<DB: Backend> QueryFragment<DB> for CreateDatabaseStatement {
fn walk_ast<'b>(&'b self, mut out: AstPass<'_, 'b, DB>) -> QueryResult<()> {
out.push_sql("CREATE DATABASE ");
out.push_identifier(&self.db_name)?;
Ok(())
}
}
impl<Conn> RunQueryDsl<Conn> for CreateDatabaseStatement {}
impl QueryId for CreateDatabaseStatement {
type QueryId = ();
const HAS_STATIC_QUERY_ID: bool = false;
}
if PgConnection::establish(&database_url).is_err() {
let (database, postgres_url) = change_database_of_url(&database_url, "postgres");
println!("Creating database: {database}");
let mut conn = PgConnection::establish(&postgres_url)?;
CreateDatabaseStatement::new(&database).execute(&mut conn)?;
}
let mut conn = PgConnection::establish(&database_url)?;
let migrations_dir = self
.find_source_root()
.await?
.join("control_plane/attachment_service/migrations");
let migrations = diesel_migrations::FileBasedMigrations::from_path(migrations_dir)?;
println!("Running migrations in {}", migrations.path().display());
HarnessWithOutput::write_to_stdout(&mut conn)
.run_pending_migrations(migrations)
.map(|_| ())
.map_err(|e| anyhow::anyhow!(e))?;
println!("Migrations complete");
Ok(database_url)
}
pub async fn start(&self) -> anyhow::Result<()> {
// Start a vanilla Postgres process used by the attachment service for persistence.
let pg_data_path = Utf8PathBuf::from_path_buf(self.env.base_data_dir.clone())
.unwrap()
.join("attachment_service_db");
let pg_bin_dir = self.get_pg_bin_dir().await?;
let pg_log_path = pg_data_path.join("postgres.log");
if !tokio::fs::try_exists(&pg_data_path).await? {
// Initialize empty database
let initdb_path = pg_bin_dir.join("initdb");
let mut child = Command::new(&initdb_path)
.args(["-D", pg_data_path.as_ref()])
.spawn()
.expect("Failed to spawn initdb");
let status = child.wait().await?;
if !status.success() {
anyhow::bail!("initdb failed with status {status}");
}
tokio::fs::write(
&pg_data_path.join("postgresql.conf"),
format!("port = {}", self.postgres_port),
)
.await?;
};
println!("Starting attachment service database...");
let db_start_args = [
"-w",
"-D",
pg_data_path.as_ref(),
"-l",
pg_log_path.as_ref(),
"start",
];
background_process::start_process(
"attachment_service_db",
&self.env.base_data_dir,
pg_bin_dir.join("pg_ctl").as_std_path(),
db_start_args,
[],
background_process::InitialPidFile::Create(self.postgres_pid_file()),
|| self.pg_isready(&pg_bin_dir),
)
.await?;
// Run migrations on every startup, in case something changed.
let database_url = self.setup_database().await?;
let mut args = vec![
"-l",
&self.listen,
"-p",
self.path.as_ref(),
"--database-url",
&database_url,
]
.into_iter()
.map(|s| s.to_string())
.collect::<Vec<_>>();
if let Some(jwt_token) = &self.jwt_token { if let Some(jwt_token) = &self.jwt_token {
args.push(format!("--jwt-token={jwt_token}")); args.push(format!("--jwt-token={jwt_token}"));
} }
@@ -235,7 +455,7 @@ impl AttachmentService {
args.push(format!("--public-key={public_key_path}")); args.push(format!("--public-key={public_key_path}"));
} }
let result = background_process::start_process( background_process::start_process(
COMMAND, COMMAND,
&self.env.base_data_dir, &self.env.base_data_dir,
&self.env.attachment_service_bin(), &self.env.attachment_service_bin(),
@@ -252,29 +472,46 @@ impl AttachmentService {
} }
}, },
) )
.await; .await?;
for ps_conf in &self.env.pageservers { Ok(())
let (pg_host, pg_port) = }
parse_host_port(&ps_conf.listen_pg_addr).expect("Unable to parse listen_pg_addr");
let (http_host, http_port) = parse_host_port(&ps_conf.listen_http_addr) pub async fn stop(&self, immediate: bool) -> anyhow::Result<()> {
.expect("Unable to parse listen_http_addr"); background_process::stop_process(immediate, COMMAND, &self.pid_file())?;
self.node_register(NodeRegisterRequest {
node_id: ps_conf.id, let pg_data_path = self.env.base_data_dir.join("attachment_service_db");
listen_pg_addr: pg_host.to_string(), let pg_bin_dir = self.get_pg_bin_dir().await?;
listen_pg_port: pg_port.unwrap_or(5432),
listen_http_addr: http_host.to_string(), println!("Stopping attachment service database...");
listen_http_port: http_port.unwrap_or(80), let pg_stop_args = ["-D", &pg_data_path.to_string_lossy(), "stop"];
}) let stop_status = Command::new(pg_bin_dir.join("pg_ctl"))
.args(pg_stop_args)
.spawn()?
.wait()
.await?; .await?;
if !stop_status.success() {
let pg_status_args = ["-D", &pg_data_path.to_string_lossy(), "status"];
let status_exitcode = Command::new(pg_bin_dir.join("pg_ctl"))
.args(pg_status_args)
.spawn()?
.wait()
.await?;
// pg_ctl status returns this exit code if postgres is not running: in this case it is
// fine that stop failed. Otherwise it is an error that stop failed.
const PG_STATUS_NOT_RUNNING: i32 = 3;
if Some(PG_STATUS_NOT_RUNNING) == status_exitcode.code() {
println!("Attachment service data base is already stopped");
return Ok(());
} else {
anyhow::bail!("Failed to stop attachment service database: {stop_status}")
}
} }
result Ok(())
} }
pub fn stop(&self, immediate: bool) -> anyhow::Result<()> {
background_process::stop_process(immediate, COMMAND, &self.pid_file())
}
/// Simple HTTP request wrapper for calling into attachment service /// Simple HTTP request wrapper for calling into attachment service
async fn dispatch<RQ, RS>( async fn dispatch<RQ, RS>(
&self, &self,
@@ -356,7 +593,7 @@ impl AttachmentService {
&self, &self,
req: TenantCreateRequest, req: TenantCreateRequest,
) -> anyhow::Result<TenantCreateResponse> { ) -> anyhow::Result<TenantCreateResponse> {
self.dispatch(Method::POST, "tenant".to_string(), Some(req)) self.dispatch(Method::POST, "v1/tenant".to_string(), Some(req))
.await .await
} }
@@ -413,7 +650,7 @@ impl AttachmentService {
) -> anyhow::Result<TimelineInfo> { ) -> anyhow::Result<TimelineInfo> {
self.dispatch( self.dispatch(
Method::POST, Method::POST,
format!("tenant/{tenant_id}/timeline"), format!("v1/tenant/{tenant_id}/timeline"),
Some(req), Some(req),
) )
.await .await

View File

@@ -17,7 +17,7 @@ use std::io::Write;
use std::os::unix::prelude::AsRawFd; use std::os::unix::prelude::AsRawFd;
use std::os::unix::process::CommandExt; use std::os::unix::process::CommandExt;
use std::path::Path; use std::path::Path;
use std::process::{Child, Command}; use std::process::Command;
use std::time::Duration; use std::time::Duration;
use std::{fs, io, thread}; use std::{fs, io, thread};
@@ -60,7 +60,7 @@ pub async fn start_process<F, Fut, AI, A, EI>(
envs: EI, envs: EI,
initial_pid_file: InitialPidFile, initial_pid_file: InitialPidFile,
process_status_check: F, process_status_check: F,
) -> anyhow::Result<Child> ) -> anyhow::Result<()>
where where
F: Fn() -> Fut, F: Fn() -> Fut,
Fut: std::future::Future<Output = anyhow::Result<bool>>, Fut: std::future::Future<Output = anyhow::Result<bool>>,
@@ -98,7 +98,7 @@ where
InitialPidFile::Expect(path) => path, InitialPidFile::Expect(path) => path,
}; };
let mut spawned_process = filled_cmd.spawn().with_context(|| { let spawned_process = filled_cmd.spawn().with_context(|| {
format!("Could not spawn {process_name}, see console output and log files for details.") format!("Could not spawn {process_name}, see console output and log files for details.")
})?; })?;
let pid = spawned_process.id(); let pid = spawned_process.id();
@@ -106,12 +106,26 @@ where
i32::try_from(pid) i32::try_from(pid)
.with_context(|| format!("Subprocess {process_name} has invalid pid {pid}"))?, .with_context(|| format!("Subprocess {process_name} has invalid pid {pid}"))?,
); );
// set up a scopeguard to kill & wait for the child in case we panic or bail below
let spawned_process = scopeguard::guard(spawned_process, |mut spawned_process| {
println!("SIGKILL & wait the started process");
(|| {
// TODO: use another signal that can be caught by the child so it can clean up any children it spawned (e..g, walredo).
spawned_process.kill().context("SIGKILL child")?;
spawned_process.wait().context("wait() for child process")?;
anyhow::Ok(())
})()
.with_context(|| format!("scopeguard kill&wait child {process_name:?}"))
.unwrap();
});
for retries in 0..RETRIES { for retries in 0..RETRIES {
match process_started(pid, pid_file_to_check, &process_status_check).await { match process_started(pid, pid_file_to_check, &process_status_check).await {
Ok(true) => { Ok(true) => {
println!("\n{process_name} started, pid: {pid}"); println!("\n{process_name} started and passed status check, pid: {pid}");
return Ok(spawned_process); // leak the child process, it'll outlive this neon_local invocation
drop(scopeguard::ScopeGuard::into_inner(spawned_process));
return Ok(());
} }
Ok(false) => { Ok(false) => {
if retries == NOTICE_AFTER_RETRIES { if retries == NOTICE_AFTER_RETRIES {
@@ -126,16 +140,15 @@ where
thread::sleep(Duration::from_millis(RETRY_INTERVAL_MILLIS)); thread::sleep(Duration::from_millis(RETRY_INTERVAL_MILLIS));
} }
Err(e) => { Err(e) => {
println!("{process_name} failed to start: {e:#}"); println!("error starting process {process_name:?}: {e:#}");
if let Err(e) = spawned_process.kill() {
println!("Could not stop {process_name} subprocess: {e:#}")
};
return Err(e); return Err(e);
} }
} }
} }
println!(); println!();
anyhow::bail!("{process_name} did not start in {RETRY_UNTIL_SECS} seconds"); anyhow::bail!(
"{process_name} did not start+pass status checks within {RETRY_UNTIL_SECS} seconds"
);
} }
/// Stops the process, using the pid file given. Returns Ok also if the process is already not running. /// Stops the process, using the pid file given. Returns Ok also if the process is already not running.

View File

@@ -135,7 +135,7 @@ fn main() -> Result<()> {
"tenant" => rt.block_on(handle_tenant(sub_args, &mut env)), "tenant" => rt.block_on(handle_tenant(sub_args, &mut env)),
"timeline" => rt.block_on(handle_timeline(sub_args, &mut env)), "timeline" => rt.block_on(handle_timeline(sub_args, &mut env)),
"start" => rt.block_on(handle_start_all(sub_args, &env)), "start" => rt.block_on(handle_start_all(sub_args, &env)),
"stop" => handle_stop_all(sub_args, &env), "stop" => rt.block_on(handle_stop_all(sub_args, &env)),
"pageserver" => rt.block_on(handle_pageserver(sub_args, &env)), "pageserver" => rt.block_on(handle_pageserver(sub_args, &env)),
"attachment_service" => rt.block_on(handle_attachment_service(sub_args, &env)), "attachment_service" => rt.block_on(handle_attachment_service(sub_args, &env)),
"safekeeper" => rt.block_on(handle_safekeeper(sub_args, &env)), "safekeeper" => rt.block_on(handle_safekeeper(sub_args, &env)),
@@ -1056,8 +1056,9 @@ fn get_pageserver(env: &local_env::LocalEnv, args: &ArgMatches) -> Result<PageSe
async fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> { async fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
match sub_match.subcommand() { match sub_match.subcommand() {
Some(("start", subcommand_args)) => { Some(("start", subcommand_args)) => {
let register = subcommand_args.get_one::<bool>("register").unwrap_or(&true);
if let Err(e) = get_pageserver(env, subcommand_args)? if let Err(e) = get_pageserver(env, subcommand_args)?
.start(&pageserver_config_overrides(subcommand_args)) .start(&pageserver_config_overrides(subcommand_args), *register)
.await .await
{ {
eprintln!("pageserver start failed: {e}"); eprintln!("pageserver start failed: {e}");
@@ -1086,24 +1087,7 @@ async fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) ->
} }
if let Err(e) = pageserver if let Err(e) = pageserver
.start(&pageserver_config_overrides(subcommand_args)) .start(&pageserver_config_overrides(subcommand_args), false)
.await
{
eprintln!("pageserver start failed: {e}");
exit(1);
}
}
Some(("migrate", subcommand_args)) => {
let pageserver = get_pageserver(env, subcommand_args)?;
//TODO what shutdown strategy should we use here?
if let Err(e) = pageserver.stop(false) {
eprintln!("pageserver stop failed: {}", e);
exit(1);
}
if let Err(e) = pageserver
.start(&pageserver_config_overrides(subcommand_args))
.await .await
{ {
eprintln!("pageserver start failed: {e}"); eprintln!("pageserver start failed: {e}");
@@ -1161,7 +1145,7 @@ async fn handle_attachment_service(
.map(|s| s.as_str()) .map(|s| s.as_str())
== Some("immediate"); == Some("immediate");
if let Err(e) = svc.stop(immediate) { if let Err(e) = svc.stop(immediate).await {
eprintln!("stop failed: {}", e); eprintln!("stop failed: {}", e);
exit(1); exit(1);
} }
@@ -1257,7 +1241,7 @@ async fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) ->
let attachment_service = AttachmentService::from_env(env); let attachment_service = AttachmentService::from_env(env);
if let Err(e) = attachment_service.start().await { if let Err(e) = attachment_service.start().await {
eprintln!("attachment_service start failed: {:#}", e); eprintln!("attachment_service start failed: {:#}", e);
try_stop_all(env, true); try_stop_all(env, true).await;
exit(1); exit(1);
} }
} }
@@ -1265,11 +1249,11 @@ async fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) ->
for ps_conf in &env.pageservers { for ps_conf in &env.pageservers {
let pageserver = PageServerNode::from_env(env, ps_conf); let pageserver = PageServerNode::from_env(env, ps_conf);
if let Err(e) = pageserver if let Err(e) = pageserver
.start(&pageserver_config_overrides(sub_match)) .start(&pageserver_config_overrides(sub_match), true)
.await .await
{ {
eprintln!("pageserver {} start failed: {:#}", ps_conf.id, e); eprintln!("pageserver {} start failed: {:#}", ps_conf.id, e);
try_stop_all(env, true); try_stop_all(env, true).await;
exit(1); exit(1);
} }
} }
@@ -1278,23 +1262,23 @@ async fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) ->
let safekeeper = SafekeeperNode::from_env(env, node); let safekeeper = SafekeeperNode::from_env(env, node);
if let Err(e) = safekeeper.start(vec![]).await { if let Err(e) = safekeeper.start(vec![]).await {
eprintln!("safekeeper {} start failed: {:#}", safekeeper.id, e); eprintln!("safekeeper {} start failed: {:#}", safekeeper.id, e);
try_stop_all(env, false); try_stop_all(env, false).await;
exit(1); exit(1);
} }
} }
Ok(()) Ok(())
} }
fn handle_stop_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> { async fn handle_stop_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
let immediate = let immediate =
sub_match.get_one::<String>("stop-mode").map(|s| s.as_str()) == Some("immediate"); sub_match.get_one::<String>("stop-mode").map(|s| s.as_str()) == Some("immediate");
try_stop_all(env, immediate); try_stop_all(env, immediate).await;
Ok(()) Ok(())
} }
fn try_stop_all(env: &local_env::LocalEnv, immediate: bool) { async fn try_stop_all(env: &local_env::LocalEnv, immediate: bool) {
// Stop all endpoints // Stop all endpoints
match ComputeControlPlane::load(env.clone()) { match ComputeControlPlane::load(env.clone()) {
Ok(cplane) => { Ok(cplane) => {
@@ -1329,7 +1313,7 @@ fn try_stop_all(env: &local_env::LocalEnv, immediate: bool) {
if env.control_plane_api.is_some() { if env.control_plane_api.is_some() {
let attachment_service = AttachmentService::from_env(env); let attachment_service = AttachmentService::from_env(env);
if let Err(e) = attachment_service.stop(immediate) { if let Err(e) = attachment_service.stop(immediate).await {
eprintln!("attachment service stop failed: {e:#}"); eprintln!("attachment service stop failed: {e:#}");
} }
} }
@@ -1549,7 +1533,11 @@ fn cli() -> Command {
.subcommand(Command::new("status")) .subcommand(Command::new("status"))
.subcommand(Command::new("start") .subcommand(Command::new("start")
.about("Start local pageserver") .about("Start local pageserver")
.arg(pageserver_config_args.clone()) .arg(pageserver_config_args.clone()).arg(Arg::new("register")
.long("register")
.default_value("true").required(false)
.value_parser(value_parser!(bool))
.value_name("register"))
) )
.subcommand(Command::new("stop") .subcommand(Command::new("stop")
.about("Stop local pageserver") .about("Stop local pageserver")

View File

@@ -57,7 +57,7 @@ use crate::local_env::LocalEnv;
use crate::postgresql_conf::PostgresConf; use crate::postgresql_conf::PostgresConf;
use compute_api::responses::{ComputeState, ComputeStatus}; use compute_api::responses::{ComputeState, ComputeStatus};
use compute_api::spec::{Cluster, ComputeMode, ComputeSpec}; use compute_api::spec::{Cluster, ComputeFeature, ComputeMode, ComputeSpec};
// contents of a endpoint.json file // contents of a endpoint.json file
#[derive(Serialize, Deserialize, PartialEq, Eq, Clone, Debug)] #[derive(Serialize, Deserialize, PartialEq, Eq, Clone, Debug)]
@@ -70,6 +70,7 @@ pub struct EndpointConf {
http_port: u16, http_port: u16,
pg_version: u32, pg_version: u32,
skip_pg_catalog_updates: bool, skip_pg_catalog_updates: bool,
features: Vec<ComputeFeature>,
} }
// //
@@ -140,6 +141,7 @@ impl ComputeControlPlane {
// with this we basically test a case of waking up an idle compute, where // with this we basically test a case of waking up an idle compute, where
// we also skip catalog updates in the cloud. // we also skip catalog updates in the cloud.
skip_pg_catalog_updates: true, skip_pg_catalog_updates: true,
features: vec![],
}); });
ep.create_endpoint_dir()?; ep.create_endpoint_dir()?;
@@ -154,6 +156,7 @@ impl ComputeControlPlane {
pg_port, pg_port,
pg_version, pg_version,
skip_pg_catalog_updates: true, skip_pg_catalog_updates: true,
features: vec![],
})?, })?,
)?; )?;
std::fs::write( std::fs::write(
@@ -215,6 +218,9 @@ pub struct Endpoint {
// Optimizations // Optimizations
skip_pg_catalog_updates: bool, skip_pg_catalog_updates: bool,
// Feature flags
features: Vec<ComputeFeature>,
} }
impl Endpoint { impl Endpoint {
@@ -244,6 +250,7 @@ impl Endpoint {
tenant_id: conf.tenant_id, tenant_id: conf.tenant_id,
pg_version: conf.pg_version, pg_version: conf.pg_version,
skip_pg_catalog_updates: conf.skip_pg_catalog_updates, skip_pg_catalog_updates: conf.skip_pg_catalog_updates,
features: conf.features,
}) })
} }
@@ -431,7 +438,7 @@ impl Endpoint {
} }
fn wait_for_compute_ctl_to_exit(&self, send_sigterm: bool) -> Result<()> { fn wait_for_compute_ctl_to_exit(&self, send_sigterm: bool) -> Result<()> {
// TODO use background_process::stop_process instead // TODO use background_process::stop_process instead: https://github.com/neondatabase/neon/pull/6482
let pidfile_path = self.endpoint_path().join("compute_ctl.pid"); let pidfile_path = self.endpoint_path().join("compute_ctl.pid");
let pid: u32 = std::fs::read_to_string(pidfile_path)?.parse()?; let pid: u32 = std::fs::read_to_string(pidfile_path)?.parse()?;
let pid = nix::unistd::Pid::from_raw(pid as i32); let pid = nix::unistd::Pid::from_raw(pid as i32);
@@ -519,7 +526,7 @@ impl Endpoint {
skip_pg_catalog_updates: self.skip_pg_catalog_updates, skip_pg_catalog_updates: self.skip_pg_catalog_updates,
format_version: 1.0, format_version: 1.0,
operation_uuid: None, operation_uuid: None,
features: vec![], features: self.features.clone(),
cluster: Cluster { cluster: Cluster {
cluster_id: None, // project ID: not used cluster_id: None, // project ID: not used
name: None, // project name: not used name: None, // project name: not used
@@ -576,9 +583,21 @@ impl Endpoint {
} }
let child = cmd.spawn()?; let child = cmd.spawn()?;
// set up a scopeguard to kill & wait for the child in case we panic or bail below
let child = scopeguard::guard(child, |mut child| {
println!("SIGKILL & wait the started process");
(|| {
// TODO: use another signal that can be caught by the child so it can clean up any children it spawned
child.kill().context("SIGKILL child")?;
child.wait().context("wait() for child process")?;
anyhow::Ok(())
})()
.with_context(|| format!("scopeguard kill&wait child {child:?}"))
.unwrap();
});
// Write down the pid so we can wait for it when we want to stop // Write down the pid so we can wait for it when we want to stop
// TODO use background_process::start_process instead // TODO use background_process::start_process instead: https://github.com/neondatabase/neon/pull/6482
let pid = child.id(); let pid = child.id();
let pidfile_path = self.endpoint_path().join("compute_ctl.pid"); let pidfile_path = self.endpoint_path().join("compute_ctl.pid");
std::fs::write(pidfile_path, pid.to_string())?; std::fs::write(pidfile_path, pid.to_string())?;
@@ -627,6 +646,9 @@ impl Endpoint {
std::thread::sleep(ATTEMPT_INTERVAL); std::thread::sleep(ATTEMPT_INTERVAL);
} }
// disarm the scopeguard, let the child outlive this function (and neon_local invoction)
drop(scopeguard::ScopeGuard::into_inner(child));
Ok(()) Ok(())
} }

View File

@@ -223,7 +223,11 @@ impl LocalEnv {
} }
pub fn attachment_service_bin(&self) -> PathBuf { pub fn attachment_service_bin(&self) -> PathBuf {
self.neon_distrib_dir.join("attachment_service") // Irrespective of configuration, attachment service binary is always
// run from the same location as neon_local. This means that for compatibility
// tests that run old pageserver/safekeeper, they still run latest attachment service.
let neon_local_bin_dir = env::current_exe().unwrap().parent().unwrap().to_owned();
neon_local_bin_dir.join("attachment_service")
} }
pub fn safekeeper_bin(&self) -> PathBuf { pub fn safekeeper_bin(&self) -> PathBuf {

View File

@@ -11,7 +11,7 @@ use std::io;
use std::io::Write; use std::io::Write;
use std::num::NonZeroU64; use std::num::NonZeroU64;
use std::path::PathBuf; use std::path::PathBuf;
use std::process::{Child, Command}; use std::process::Command;
use std::time::Duration; use std::time::Duration;
use anyhow::{bail, Context}; use anyhow::{bail, Context};
@@ -30,6 +30,7 @@ use utils::{
lsn::Lsn, lsn::Lsn,
}; };
use crate::attachment_service::{AttachmentService, NodeRegisterRequest};
use crate::local_env::PageServerConf; use crate::local_env::PageServerConf;
use crate::{background_process, local_env::LocalEnv}; use crate::{background_process, local_env::LocalEnv};
@@ -161,8 +162,8 @@ impl PageServerNode {
.expect("non-Unicode path") .expect("non-Unicode path")
} }
pub async fn start(&self, config_overrides: &[&str]) -> anyhow::Result<Child> { pub async fn start(&self, config_overrides: &[&str], register: bool) -> anyhow::Result<()> {
self.start_node(config_overrides, false).await self.start_node(config_overrides, false, register).await
} }
fn pageserver_init(&self, config_overrides: &[&str]) -> anyhow::Result<()> { fn pageserver_init(&self, config_overrides: &[&str]) -> anyhow::Result<()> {
@@ -207,7 +208,8 @@ impl PageServerNode {
&self, &self,
config_overrides: &[&str], config_overrides: &[&str],
update_config: bool, update_config: bool,
) -> anyhow::Result<Child> { register: bool,
) -> anyhow::Result<()> {
// TODO: using a thread here because start_process() is not async but we need to call check_status() // TODO: using a thread here because start_process() is not async but we need to call check_status()
let datadir = self.repo_path(); let datadir = self.repo_path();
print!( print!(
@@ -244,7 +246,26 @@ impl PageServerNode {
} }
}, },
) )
.await .await?;
if register {
let attachment_service = AttachmentService::from_env(&self.env);
let (pg_host, pg_port) =
parse_host_port(&self.conf.listen_pg_addr).expect("Unable to parse listen_pg_addr");
let (http_host, http_port) = parse_host_port(&self.conf.listen_http_addr)
.expect("Unable to parse listen_http_addr");
attachment_service
.node_register(NodeRegisterRequest {
node_id: self.conf.id,
listen_pg_addr: pg_host.to_string(),
listen_pg_port: pg_port.unwrap_or(5432),
listen_http_addr: http_host.to_string(),
listen_http_port: http_port.unwrap_or(80),
})
.await?;
}
Ok(())
} }
fn pageserver_basic_args<'a>( fn pageserver_basic_args<'a>(

View File

@@ -7,7 +7,6 @@
//! ``` //! ```
use std::io::Write; use std::io::Write;
use std::path::PathBuf; use std::path::PathBuf;
use std::process::Child;
use std::{io, result}; use std::{io, result};
use anyhow::Context; use anyhow::Context;
@@ -104,7 +103,7 @@ impl SafekeeperNode {
.expect("non-Unicode path") .expect("non-Unicode path")
} }
pub async fn start(&self, extra_opts: Vec<String>) -> anyhow::Result<Child> { pub async fn start(&self, extra_opts: Vec<String>) -> anyhow::Result<()> {
print!( print!(
"Starting safekeeper at '{}' in '{}'", "Starting safekeeper at '{}' in '{}'",
self.pg_connection_config.raw_address(), self.pg_connection_config.raw_address(),

9
diesel.toml Normal file
View File

@@ -0,0 +1,9 @@
# For documentation on how to configure this file,
# see https://diesel.rs/guides/configuring-diesel-cli
[print_schema]
file = "control_plane/attachment_service/src/schema.rs"
custom_type_derives = ["diesel::query_builder::QueryId"]
[migrations_directory]
dir = "control_plane/attachment_service/migrations"

View File

@@ -90,6 +90,9 @@ pub enum ComputeFeature {
/// track short-lived connections as user activity. /// track short-lived connections as user activity.
ActivityMonitorExperimental, ActivityMonitorExperimental,
/// Enable running migrations
Migrations,
/// This is a special feature flag that is used to represent unknown feature flags. /// This is a special feature flag that is used to represent unknown feature flags.
/// Basically all unknown to enum flags are represented as this one. See unit test /// Basically all unknown to enum flags are represented as this one. See unit test
/// `parse_unknown_features()` for more details. /// `parse_unknown_features()` for more details.

View File

@@ -1,9 +1,11 @@
use anyhow::{bail, Result}; use anyhow::{bail, Result};
use byteorder::{ByteOrder, BE}; use byteorder::{ByteOrder, BE};
use postgres_ffi::relfile_utils::{FSM_FORKNUM, VISIBILITYMAP_FORKNUM};
use postgres_ffi::{Oid, TransactionId};
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use std::fmt; use std::{fmt, ops::Range};
use crate::reltag::{BlockNumber, RelTag}; use crate::reltag::{BlockNumber, RelTag, SlruKind};
/// Key used in the Repository kv-store. /// Key used in the Repository kv-store.
/// ///
@@ -143,12 +145,390 @@ impl Key {
} }
} }
// Layout of the Key address space
//
// The Key struct, used to address the underlying key-value store, consists of
// 18 bytes, split into six fields. See 'Key' in repository.rs. We need to map
// all the data and metadata keys into those 18 bytes.
//
// Principles for the mapping:
//
// - Things that are often accessed or modified together, should be close to
// each other in the key space. For example, if a relation is extended by one
// block, we create a new key-value pair for the block data, and update the
// relation size entry. Because of that, the RelSize key comes after all the
// RelBlocks of a relation: the RelSize and the last RelBlock are always next
// to each other.
//
// The key space is divided into four major sections, identified by the first
// byte, and the form a hierarchy:
//
// 00 Relation data and metadata
//
// DbDir () -> (dbnode, spcnode)
// Filenodemap
// RelDir -> relnode forknum
// RelBlocks
// RelSize
//
// 01 SLRUs
//
// SlruDir kind
// SlruSegBlocks segno
// SlruSegSize
//
// 02 pg_twophase
//
// 03 misc
// Controlfile
// checkpoint
// pg_version
//
// 04 aux files
//
// Below is a full list of the keyspace allocation:
//
// DbDir:
// 00 00000000 00000000 00000000 00 00000000
//
// Filenodemap:
// 00 SPCNODE DBNODE 00000000 00 00000000
//
// RelDir:
// 00 SPCNODE DBNODE 00000000 00 00000001 (Postgres never uses relfilenode 0)
//
// RelBlock:
// 00 SPCNODE DBNODE RELNODE FORK BLKNUM
//
// RelSize:
// 00 SPCNODE DBNODE RELNODE FORK FFFFFFFF
//
// SlruDir:
// 01 kind 00000000 00000000 00 00000000
//
// SlruSegBlock:
// 01 kind 00000001 SEGNO 00 BLKNUM
//
// SlruSegSize:
// 01 kind 00000001 SEGNO 00 FFFFFFFF
//
// TwoPhaseDir:
// 02 00000000 00000000 00000000 00 00000000
//
// TwoPhaseFile:
// 02 00000000 00000000 00000000 00 XID
//
// ControlFile:
// 03 00000000 00000000 00000000 00 00000000
//
// Checkpoint:
// 03 00000000 00000000 00000000 00 00000001
//
// AuxFiles:
// 03 00000000 00000000 00000000 00 00000002
//
//-- Section 01: relation data and metadata
pub const DBDIR_KEY: Key = Key {
field1: 0x00,
field2: 0,
field3: 0,
field4: 0,
field5: 0,
field6: 0,
};
#[inline(always)]
pub fn dbdir_key_range(spcnode: Oid, dbnode: Oid) -> Range<Key> {
Key {
field1: 0x00,
field2: spcnode,
field3: dbnode,
field4: 0,
field5: 0,
field6: 0,
}..Key {
field1: 0x00,
field2: spcnode,
field3: dbnode,
field4: 0xffffffff,
field5: 0xff,
field6: 0xffffffff,
}
}
#[inline(always)]
pub fn relmap_file_key(spcnode: Oid, dbnode: Oid) -> Key {
Key {
field1: 0x00,
field2: spcnode,
field3: dbnode,
field4: 0,
field5: 0,
field6: 0,
}
}
#[inline(always)]
pub fn rel_dir_to_key(spcnode: Oid, dbnode: Oid) -> Key {
Key {
field1: 0x00,
field2: spcnode,
field3: dbnode,
field4: 0,
field5: 0,
field6: 1,
}
}
#[inline(always)]
pub fn rel_block_to_key(rel: RelTag, blknum: BlockNumber) -> Key {
Key {
field1: 0x00,
field2: rel.spcnode,
field3: rel.dbnode,
field4: rel.relnode,
field5: rel.forknum,
field6: blknum,
}
}
#[inline(always)]
pub fn rel_size_to_key(rel: RelTag) -> Key {
Key {
field1: 0x00,
field2: rel.spcnode,
field3: rel.dbnode,
field4: rel.relnode,
field5: rel.forknum,
field6: 0xffffffff,
}
}
#[inline(always)]
pub fn rel_key_range(rel: RelTag) -> Range<Key> {
Key {
field1: 0x00,
field2: rel.spcnode,
field3: rel.dbnode,
field4: rel.relnode,
field5: rel.forknum,
field6: 0,
}..Key {
field1: 0x00,
field2: rel.spcnode,
field3: rel.dbnode,
field4: rel.relnode,
field5: rel.forknum + 1,
field6: 0,
}
}
//-- Section 02: SLRUs
#[inline(always)]
pub fn slru_dir_to_key(kind: SlruKind) -> Key {
Key {
field1: 0x01,
field2: match kind {
SlruKind::Clog => 0x00,
SlruKind::MultiXactMembers => 0x01,
SlruKind::MultiXactOffsets => 0x02,
},
field3: 0,
field4: 0,
field5: 0,
field6: 0,
}
}
#[inline(always)]
pub fn slru_block_to_key(kind: SlruKind, segno: u32, blknum: BlockNumber) -> Key {
Key {
field1: 0x01,
field2: match kind {
SlruKind::Clog => 0x00,
SlruKind::MultiXactMembers => 0x01,
SlruKind::MultiXactOffsets => 0x02,
},
field3: 1,
field4: segno,
field5: 0,
field6: blknum,
}
}
#[inline(always)]
pub fn slru_segment_size_to_key(kind: SlruKind, segno: u32) -> Key {
Key {
field1: 0x01,
field2: match kind {
SlruKind::Clog => 0x00,
SlruKind::MultiXactMembers => 0x01,
SlruKind::MultiXactOffsets => 0x02,
},
field3: 1,
field4: segno,
field5: 0,
field6: 0xffffffff,
}
}
#[inline(always)]
pub fn slru_segment_key_range(kind: SlruKind, segno: u32) -> Range<Key> {
let field2 = match kind {
SlruKind::Clog => 0x00,
SlruKind::MultiXactMembers => 0x01,
SlruKind::MultiXactOffsets => 0x02,
};
Key {
field1: 0x01,
field2,
field3: 1,
field4: segno,
field5: 0,
field6: 0,
}..Key {
field1: 0x01,
field2,
field3: 1,
field4: segno,
field5: 1,
field6: 0,
}
}
//-- Section 03: pg_twophase
pub const TWOPHASEDIR_KEY: Key = Key {
field1: 0x02,
field2: 0,
field3: 0,
field4: 0,
field5: 0,
field6: 0,
};
#[inline(always)]
pub fn twophase_file_key(xid: TransactionId) -> Key {
Key {
field1: 0x02,
field2: 0,
field3: 0,
field4: 0,
field5: 0,
field6: xid,
}
}
#[inline(always)]
pub fn twophase_key_range(xid: TransactionId) -> Range<Key> {
let (next_xid, overflowed) = xid.overflowing_add(1);
Key {
field1: 0x02,
field2: 0,
field3: 0,
field4: 0,
field5: 0,
field6: xid,
}..Key {
field1: 0x02,
field2: 0,
field3: 0,
field4: 0,
field5: u8::from(overflowed),
field6: next_xid,
}
}
//-- Section 03: Control file
pub const CONTROLFILE_KEY: Key = Key {
field1: 0x03,
field2: 0,
field3: 0,
field4: 0,
field5: 0,
field6: 0,
};
pub const CHECKPOINT_KEY: Key = Key {
field1: 0x03,
field2: 0,
field3: 0,
field4: 0,
field5: 0,
field6: 1,
};
pub const AUX_FILES_KEY: Key = Key {
field1: 0x03,
field2: 0,
field3: 0,
field4: 0,
field5: 0,
field6: 2,
};
// Reverse mappings for a few Keys.
// These are needed by WAL redo manager.
// AUX_FILES currently stores only data for logical replication (slots etc), and
// we don't preserve these on a branch because safekeepers can't follow timeline
// switch (and generally it likely should be optional), so ignore these.
#[inline(always)]
pub fn is_inherited_key(key: Key) -> bool {
key != AUX_FILES_KEY
}
#[inline(always)]
pub fn is_rel_fsm_block_key(key: Key) -> bool {
key.field1 == 0x00 && key.field4 != 0 && key.field5 == FSM_FORKNUM && key.field6 != 0xffffffff
}
#[inline(always)]
pub fn is_rel_vm_block_key(key: Key) -> bool {
key.field1 == 0x00
&& key.field4 != 0
&& key.field5 == VISIBILITYMAP_FORKNUM
&& key.field6 != 0xffffffff
}
#[inline(always)]
pub fn key_to_slru_block(key: Key) -> anyhow::Result<(SlruKind, u32, BlockNumber)> {
Ok(match key.field1 {
0x01 => {
let kind = match key.field2 {
0x00 => SlruKind::Clog,
0x01 => SlruKind::MultiXactMembers,
0x02 => SlruKind::MultiXactOffsets,
_ => anyhow::bail!("unrecognized slru kind 0x{:02x}", key.field2),
};
let segno = key.field4;
let blknum = key.field6;
(kind, segno, blknum)
}
_ => anyhow::bail!("unexpected value kind 0x{:02x}", key.field1),
})
}
#[inline(always)]
pub fn is_slru_block_key(key: Key) -> bool {
key.field1 == 0x01 // SLRU-related
&& key.field3 == 0x00000001 // but not SlruDir
&& key.field6 != 0xffffffff // and not SlruSegSize
}
#[inline(always)] #[inline(always)]
pub fn is_rel_block_key(key: &Key) -> bool { pub fn is_rel_block_key(key: &Key) -> bool {
key.field1 == 0x00 && key.field4 != 0 && key.field6 != 0xffffffff key.field1 == 0x00 && key.field4 != 0 && key.field6 != 0xffffffff
} }
/// Guaranteed to return `Ok()` if [[is_rel_block_key]] returns `true` for `key`. /// Guaranteed to return `Ok()` if [[is_rel_block_key]] returns `true` for `key`.
#[inline(always)]
pub fn key_to_rel_block(key: Key) -> anyhow::Result<(RelTag, BlockNumber)> { pub fn key_to_rel_block(key: Key) -> anyhow::Result<(RelTag, BlockNumber)> {
Ok(match key.field1 { Ok(match key.field1 {
0x00 => ( 0x00 => (

View File

@@ -104,6 +104,7 @@ pub struct KeySpaceAccum {
accum: Option<Range<Key>>, accum: Option<Range<Key>>,
ranges: Vec<Range<Key>>, ranges: Vec<Range<Key>>,
size: u64,
} }
impl KeySpaceAccum { impl KeySpaceAccum {
@@ -111,6 +112,7 @@ impl KeySpaceAccum {
Self { Self {
accum: None, accum: None,
ranges: Vec::new(), ranges: Vec::new(),
size: 0,
} }
} }
@@ -121,6 +123,8 @@ impl KeySpaceAccum {
#[inline(always)] #[inline(always)]
pub fn add_range(&mut self, range: Range<Key>) { pub fn add_range(&mut self, range: Range<Key>) {
self.size += key_range_size(&range) as u64;
match self.accum.as_mut() { match self.accum.as_mut() {
Some(accum) => { Some(accum) => {
if range.start == accum.end { if range.start == accum.end {
@@ -146,6 +150,23 @@ impl KeySpaceAccum {
ranges: self.ranges, ranges: self.ranges,
} }
} }
pub fn consume_keyspace(&mut self) -> KeySpace {
if let Some(accum) = self.accum.take() {
self.ranges.push(accum);
}
let mut prev_accum = KeySpaceAccum::new();
std::mem::swap(self, &mut prev_accum);
KeySpace {
ranges: prev_accum.ranges,
}
}
pub fn size(&self) -> u64 {
self.size
}
} }
/// ///
@@ -254,6 +275,30 @@ mod tests {
} }
} }
#[test]
fn keyspace_consume() {
let ranges = vec![kr(0..10), kr(20..35), kr(40..45)];
let mut accum = KeySpaceAccum::new();
for range in &ranges {
accum.add_range(range.clone());
}
let expected_size: u64 = ranges.iter().map(|r| key_range_size(r) as u64).sum();
assert_eq!(accum.size(), expected_size);
assert_ks_eq(&accum.consume_keyspace(), ranges.clone());
assert_eq!(accum.size(), 0);
assert_ks_eq(&accum.consume_keyspace(), vec![]);
assert_eq!(accum.size(), 0);
for range in &ranges {
accum.add_range(range.clone());
}
assert_ks_eq(&accum.to_keyspace(), ranges);
}
#[test] #[test]
fn keyspace_add_range() { fn keyspace_add_range() {
// two separate ranges // two separate ranges

View File

@@ -111,7 +111,19 @@ impl RelTag {
/// These files are divided into segments, which are divided into /// These files are divided into segments, which are divided into
/// pages of the same BLCKSZ as used for relation files. /// pages of the same BLCKSZ as used for relation files.
/// ///
#[derive(Debug, Clone, Copy, Hash, Serialize, Deserialize, PartialEq, Eq, PartialOrd, Ord)] #[derive(
Debug,
Clone,
Copy,
Hash,
Serialize,
Deserialize,
PartialEq,
Eq,
PartialOrd,
Ord,
strum_macros::EnumIter,
)]
pub enum SlruKind { pub enum SlruKind {
Clog, Clog,
MultiXactMembers, MultiXactMembers,

View File

@@ -329,8 +329,8 @@ impl CheckPoint {
/// ///
/// Returns 'true' if the XID was updated. /// Returns 'true' if the XID was updated.
pub fn update_next_xid(&mut self, xid: u32) -> bool { pub fn update_next_xid(&mut self, xid: u32) -> bool {
// nextXid should nw greater than any XID in WAL, so increment provided XID and check for wraparround. // nextXid should be greater than any XID in WAL, so increment provided XID and check for wraparround.
let mut new_xid = std::cmp::max(xid + 1, pg_constants::FIRST_NORMAL_TRANSACTION_ID); let mut new_xid = std::cmp::max(xid.wrapping_add(1), pg_constants::FIRST_NORMAL_TRANSACTION_ID);
// To reduce number of metadata checkpoints, we forward align XID on XID_CHECKPOINT_INTERVAL. // To reduce number of metadata checkpoints, we forward align XID on XID_CHECKPOINT_INTERVAL.
// XID_CHECKPOINT_INTERVAL should not be larger than BLCKSZ*CLOG_XACTS_PER_BYTE // XID_CHECKPOINT_INTERVAL should not be larger than BLCKSZ*CLOG_XACTS_PER_BYTE
new_xid = new_xid =

View File

@@ -8,6 +8,7 @@ use std::pin::Pin;
use std::str::FromStr; use std::str::FromStr;
use std::sync::Arc; use std::sync::Arc;
use std::time::Duration; use std::time::Duration;
use std::time::SystemTime;
use super::REMOTE_STORAGE_PREFIX_SEPARATOR; use super::REMOTE_STORAGE_PREFIX_SEPARATOR;
use anyhow::Result; use anyhow::Result;
@@ -23,6 +24,7 @@ use futures::stream::Stream;
use futures_util::StreamExt; use futures_util::StreamExt;
use http_types::{StatusCode, Url}; use http_types::{StatusCode, Url};
use tokio::time::Instant; use tokio::time::Instant;
use tokio_util::sync::CancellationToken;
use tracing::debug; use tracing::debug;
use crate::s3_bucket::RequestKind; use crate::s3_bucket::RequestKind;
@@ -183,7 +185,6 @@ fn to_download_error(error: azure_core::Error) -> DownloadError {
} }
} }
#[async_trait::async_trait]
impl RemoteStorage for AzureBlobStorage { impl RemoteStorage for AzureBlobStorage {
async fn list( async fn list(
&self, &self,
@@ -371,6 +372,20 @@ impl RemoteStorage for AzureBlobStorage {
copy_status = status; copy_status = status;
} }
} }
async fn time_travel_recover(
&self,
_prefix: Option<&RemotePath>,
_timestamp: SystemTime,
_done_if_after: SystemTime,
_cancel: CancellationToken,
) -> anyhow::Result<()> {
// TODO use Azure point in time recovery feature for this
// https://learn.microsoft.com/en-us/azure/storage/blobs/point-in-time-restore-overview
Err(anyhow::anyhow!(
"time travel recovery for azure blob storage is not implemented"
))
}
} }
pin_project_lite::pin_project! { pin_project_lite::pin_project! {

View File

@@ -25,6 +25,7 @@ use bytes::Bytes;
use futures::stream::Stream; use futures::stream::Stream;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use tokio::sync::Semaphore; use tokio::sync::Semaphore;
use tokio_util::sync::CancellationToken;
use toml_edit::Item; use toml_edit::Item;
use tracing::info; use tracing::info;
@@ -142,7 +143,7 @@ pub struct Listing {
/// Storage (potentially remote) API to manage its state. /// Storage (potentially remote) API to manage its state.
/// This storage tries to be unaware of any layered repository context, /// This storage tries to be unaware of any layered repository context,
/// providing basic CRUD operations for storage files. /// providing basic CRUD operations for storage files.
#[async_trait::async_trait] #[allow(async_fn_in_trait)]
pub trait RemoteStorage: Send + Sync + 'static { pub trait RemoteStorage: Send + Sync + 'static {
/// Lists all top level subdirectories for a given prefix /// Lists all top level subdirectories for a given prefix
/// Note: here we assume that if the prefix is passed it was obtained via remote_object_id /// Note: here we assume that if the prefix is passed it was obtained via remote_object_id
@@ -210,6 +211,15 @@ pub trait RemoteStorage: Send + Sync + 'static {
/// Copy a remote object inside a bucket from one path to another. /// Copy a remote object inside a bucket from one path to another.
async fn copy(&self, from: &RemotePath, to: &RemotePath) -> anyhow::Result<()>; async fn copy(&self, from: &RemotePath, to: &RemotePath) -> anyhow::Result<()>;
/// Resets the content of everything with the given prefix to the given state
async fn time_travel_recover(
&self,
prefix: Option<&RemotePath>,
timestamp: SystemTime,
done_if_after: SystemTime,
cancel: CancellationToken,
) -> anyhow::Result<()>;
} }
pub type DownloadStream = Pin<Box<dyn Stream<Item = std::io::Result<Bytes>> + Unpin + Send + Sync>>; pub type DownloadStream = Pin<Box<dyn Stream<Item = std::io::Result<Bytes>> + Unpin + Send + Sync>>;
@@ -262,14 +272,15 @@ impl std::error::Error for DownloadError {}
/// Every storage, currently supported. /// Every storage, currently supported.
/// Serves as a simple way to pass around the [`RemoteStorage`] without dealing with generics. /// Serves as a simple way to pass around the [`RemoteStorage`] without dealing with generics.
#[derive(Clone)] #[derive(Clone)]
pub enum GenericRemoteStorage { // Require Clone for `Other` due to https://github.com/rust-lang/rust/issues/26925
pub enum GenericRemoteStorage<Other: Clone = Arc<UnreliableWrapper>> {
LocalFs(LocalFs), LocalFs(LocalFs),
AwsS3(Arc<S3Bucket>), AwsS3(Arc<S3Bucket>),
AzureBlob(Arc<AzureBlobStorage>), AzureBlob(Arc<AzureBlobStorage>),
Unreliable(Arc<UnreliableWrapper>), Unreliable(Other),
} }
impl GenericRemoteStorage { impl<Other: RemoteStorage> GenericRemoteStorage<Arc<Other>> {
pub async fn list( pub async fn list(
&self, &self,
prefix: Option<&RemotePath>, prefix: Option<&RemotePath>,
@@ -386,6 +397,33 @@ impl GenericRemoteStorage {
Self::Unreliable(s) => s.copy(from, to).await, Self::Unreliable(s) => s.copy(from, to).await,
} }
} }
pub async fn time_travel_recover(
&self,
prefix: Option<&RemotePath>,
timestamp: SystemTime,
done_if_after: SystemTime,
cancel: CancellationToken,
) -> anyhow::Result<()> {
match self {
Self::LocalFs(s) => {
s.time_travel_recover(prefix, timestamp, done_if_after, cancel)
.await
}
Self::AwsS3(s) => {
s.time_travel_recover(prefix, timestamp, done_if_after, cancel)
.await
}
Self::AzureBlob(s) => {
s.time_travel_recover(prefix, timestamp, done_if_after, cancel)
.await
}
Self::Unreliable(s) => {
s.time_travel_recover(prefix, timestamp, done_if_after, cancel)
.await
}
}
}
} }
impl GenericRemoteStorage { impl GenericRemoteStorage {
@@ -673,6 +711,7 @@ impl ConcurrencyLimiter {
RequestKind::List => &self.read, RequestKind::List => &self.read,
RequestKind::Delete => &self.write, RequestKind::Delete => &self.write,
RequestKind::Copy => &self.write, RequestKind::Copy => &self.write,
RequestKind::TimeTravel => &self.write,
} }
} }

View File

@@ -4,7 +4,7 @@
//! This storage used in tests, but can also be used in cases when a certain persistent //! This storage used in tests, but can also be used in cases when a certain persistent
//! volume is mounted to the local FS. //! volume is mounted to the local FS.
use std::{borrow::Cow, future::Future, io::ErrorKind, pin::Pin}; use std::{borrow::Cow, future::Future, io::ErrorKind, pin::Pin, time::SystemTime};
use anyhow::{bail, ensure, Context}; use anyhow::{bail, ensure, Context};
use bytes::Bytes; use bytes::Bytes;
@@ -14,7 +14,7 @@ use tokio::{
fs, fs,
io::{self, AsyncReadExt, AsyncSeekExt, AsyncWriteExt}, io::{self, AsyncReadExt, AsyncSeekExt, AsyncWriteExt},
}; };
use tokio_util::io::ReaderStream; use tokio_util::{io::ReaderStream, sync::CancellationToken};
use tracing::*; use tracing::*;
use utils::{crashsafe::path_with_suffix_extension, fs_ext::is_directory_empty}; use utils::{crashsafe::path_with_suffix_extension, fs_ext::is_directory_empty};
@@ -157,7 +157,6 @@ impl LocalFs {
} }
} }
#[async_trait::async_trait]
impl RemoteStorage for LocalFs { impl RemoteStorage for LocalFs {
async fn list( async fn list(
&self, &self,
@@ -423,6 +422,17 @@ impl RemoteStorage for LocalFs {
})?; })?;
Ok(()) Ok(())
} }
#[allow(clippy::diverging_sub_expression)]
async fn time_travel_recover(
&self,
_prefix: Option<&RemotePath>,
_timestamp: SystemTime,
_done_if_after: SystemTime,
_cancel: CancellationToken,
) -> anyhow::Result<()> {
unimplemented!()
}
} }
fn storage_metadata_path(original_path: &Utf8Path) -> Utf8PathBuf { fn storage_metadata_path(original_path: &Utf8Path) -> Utf8PathBuf {

View File

@@ -6,12 +6,14 @@
use std::{ use std::{
borrow::Cow, borrow::Cow,
collections::HashMap,
pin::Pin, pin::Pin,
sync::Arc, sync::Arc,
task::{Context, Poll}, task::{Context, Poll},
time::SystemTime,
}; };
use anyhow::Context as _; use anyhow::{anyhow, Context as _};
use aws_config::{ use aws_config::{
environment::credentials::EnvironmentVariableCredentialsProvider, environment::credentials::EnvironmentVariableCredentialsProvider,
imds::credentials::ImdsCredentialsProvider, imds::credentials::ImdsCredentialsProvider,
@@ -27,17 +29,19 @@ use aws_sdk_s3::{
config::{AsyncSleep, Builder, IdentityCache, Region, SharedAsyncSleep}, config::{AsyncSleep, Builder, IdentityCache, Region, SharedAsyncSleep},
error::SdkError, error::SdkError,
operation::get_object::GetObjectError, operation::get_object::GetObjectError,
types::{Delete, ObjectIdentifier}, types::{Delete, DeleteMarkerEntry, ObjectIdentifier, ObjectVersion},
Client, Client,
}; };
use aws_smithy_async::rt::sleep::TokioSleep; use aws_smithy_async::rt::sleep::TokioSleep;
use aws_smithy_types::body::SdkBody;
use aws_smithy_types::byte_stream::ByteStream; use aws_smithy_types::byte_stream::ByteStream;
use aws_smithy_types::{body::SdkBody, DateTime};
use bytes::Bytes; use bytes::Bytes;
use futures::stream::Stream; use futures::stream::Stream;
use hyper::Body; use hyper::Body;
use scopeguard::ScopeGuard; use scopeguard::ScopeGuard;
use tokio_util::sync::CancellationToken;
use utils::backoff;
use super::StorageMetadata; use super::StorageMetadata;
use crate::{ use crate::{
@@ -270,6 +274,59 @@ impl S3Bucket {
} }
} }
} }
async fn delete_oids(
&self,
kind: RequestKind,
delete_objects: &[ObjectIdentifier],
) -> anyhow::Result<()> {
for chunk in delete_objects.chunks(MAX_KEYS_PER_DELETE) {
let started_at = start_measuring_requests(kind);
let resp = self
.client
.delete_objects()
.bucket(self.bucket_name.clone())
.delete(
Delete::builder()
.set_objects(Some(chunk.to_vec()))
.build()?,
)
.send()
.await;
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
.req_seconds
.observe_elapsed(kind, &resp, started_at);
let resp = resp?;
metrics::BUCKET_METRICS
.deleted_objects_total
.inc_by(chunk.len() as u64);
if let Some(errors) = resp.errors {
// Log a bounded number of the errors within the response:
// these requests can carry 1000 keys so logging each one
// would be too verbose, especially as errors may lead us
// to retry repeatedly.
const LOG_UP_TO_N_ERRORS: usize = 10;
for e in errors.iter().take(LOG_UP_TO_N_ERRORS) {
tracing::warn!(
"DeleteObjects key {} failed: {}: {}",
e.key.as_ref().map(Cow::from).unwrap_or("".into()),
e.code.as_ref().map(Cow::from).unwrap_or("".into()),
e.message.as_ref().map(Cow::from).unwrap_or("".into())
);
}
return Err(anyhow::format_err!(
"Failed to delete {} objects",
errors.len()
));
}
}
Ok(())
}
} }
pin_project_lite::pin_project! { pin_project_lite::pin_project! {
@@ -373,7 +430,6 @@ impl<S: Stream<Item = std::io::Result<Bytes>>> Stream for TimedDownload<S> {
} }
} }
#[async_trait::async_trait]
impl RemoteStorage for S3Bucket { impl RemoteStorage for S3Bucket {
async fn list( async fn list(
&self, &self,
@@ -569,64 +625,168 @@ impl RemoteStorage for S3Bucket {
delete_objects.push(obj_id); delete_objects.push(obj_id);
} }
for chunk in delete_objects.chunks(MAX_KEYS_PER_DELETE) { self.delete_oids(kind, &delete_objects).await
let started_at = start_measuring_requests(kind);
let resp = self
.client
.delete_objects()
.bucket(self.bucket_name.clone())
.delete(
Delete::builder()
.set_objects(Some(chunk.to_vec()))
.build()?,
)
.send()
.await;
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
.req_seconds
.observe_elapsed(kind, &resp, started_at);
match resp {
Ok(resp) => {
metrics::BUCKET_METRICS
.deleted_objects_total
.inc_by(chunk.len() as u64);
if let Some(errors) = resp.errors {
// Log a bounded number of the errors within the response:
// these requests can carry 1000 keys so logging each one
// would be too verbose, especially as errors may lead us
// to retry repeatedly.
const LOG_UP_TO_N_ERRORS: usize = 10;
for e in errors.iter().take(LOG_UP_TO_N_ERRORS) {
tracing::warn!(
"DeleteObjects key {} failed: {}: {}",
e.key.as_ref().map(Cow::from).unwrap_or("".into()),
e.code.as_ref().map(Cow::from).unwrap_or("".into()),
e.message.as_ref().map(Cow::from).unwrap_or("".into())
);
}
return Err(anyhow::format_err!(
"Failed to delete {} objects",
errors.len()
));
}
}
Err(e) => {
return Err(e.into());
}
}
}
Ok(())
} }
async fn delete(&self, path: &RemotePath) -> anyhow::Result<()> { async fn delete(&self, path: &RemotePath) -> anyhow::Result<()> {
let paths = std::array::from_ref(path); let paths = std::array::from_ref(path);
self.delete_objects(paths).await self.delete_objects(paths).await
} }
async fn time_travel_recover(
&self,
prefix: Option<&RemotePath>,
timestamp: SystemTime,
done_if_after: SystemTime,
cancel: CancellationToken,
) -> anyhow::Result<()> {
let kind = RequestKind::TimeTravel;
let _guard = self.permit(kind).await;
let timestamp = DateTime::from(timestamp);
let done_if_after = DateTime::from(done_if_after);
tracing::trace!("Target time: {timestamp:?}, done_if_after {done_if_after:?}");
// get the passed prefix or if it is not set use prefix_in_bucket value
let prefix = prefix
.map(|p| self.relative_path_to_s3_object(p))
.or_else(|| self.prefix_in_bucket.clone());
let warn_threshold = 3;
let max_retries = 10;
let is_permanent = |_e: &_| false;
let list = backoff::retry(
|| async {
Ok(self
.client
.list_object_versions()
.bucket(self.bucket_name.clone())
.set_prefix(prefix.clone())
.send()
.await?)
},
is_permanent,
warn_threshold,
max_retries,
"listing object versions for time_travel_recover",
backoff::Cancel::new(cancel.clone(), || anyhow!("Cancelled")),
)
.await?;
if list.is_truncated().unwrap_or_default() {
anyhow::bail!("Received truncated ListObjectVersions response for prefix={prefix:?}");
}
let mut versions_deletes = list
.versions()
.iter()
.map(VerOrDelete::Version)
.chain(list.delete_markers().iter().map(VerOrDelete::DeleteMarker))
.collect::<Vec<_>>();
versions_deletes.sort_by_key(|vd| (vd.key(), vd.last_modified()));
let mut vds_for_key = HashMap::<_, Vec<_>>::new();
for vd in versions_deletes {
let last_modified = vd.last_modified();
let version_id = vd.version_id();
let key = vd.key();
let (Some(last_modified), Some(version_id), Some(key)) =
(last_modified, version_id, key)
else {
anyhow::bail!(
"One (or more) of last_modified, key, and id is None. \
Is versioning enabled in the bucket? last_modified={:?} key={:?} version_id={:?}",
last_modified, key, version_id,
);
};
if version_id == "null" {
anyhow::bail!("Received ListVersions response for key={key} with version_id='null', \
indicating either disabled versioning, or legacy objects with null version id values");
}
tracing::trace!(
"Parsing version key={key} version_id={version_id} is_delete={}",
matches!(vd, VerOrDelete::DeleteMarker(_))
);
vds_for_key
.entry(key)
.or_default()
.push((vd, last_modified, version_id));
}
for (key, versions) in vds_for_key {
let (last_vd, last_last_modified, _version_id) = versions.last().unwrap();
if last_last_modified > &&done_if_after {
tracing::trace!("Key {key} has version later than done_if_after, skipping");
continue;
}
// the version we want to restore to.
let version_to_restore_to =
match versions.binary_search_by_key(&timestamp, |tpl| *tpl.1) {
Ok(v) => v,
Err(e) => e,
};
if version_to_restore_to == versions.len() {
tracing::trace!("Key {key} has no changes since timestamp, skipping");
continue;
}
let mut do_delete = false;
if version_to_restore_to == 0 {
// All versions more recent, so the key didn't exist at the specified time point.
tracing::trace!(
"All {} versions more recent for {key}, deleting",
versions.len()
);
do_delete = true;
} else {
match &versions[version_to_restore_to - 1] {
(VerOrDelete::Version(_), _last_modified, version_id) => {
tracing::trace!("Copying old version {version_id} for {key}...");
// Restore the state to the last version by copying
let source_id =
format!("{}/{key}?versionId={version_id}", self.bucket_name);
backoff::retry(
|| async {
Ok(self
.client
.copy_object()
.bucket(self.bucket_name.clone())
.key(key)
.copy_source(&source_id)
.send()
.await?)
},
is_permanent,
warn_threshold,
max_retries,
"listing object versions for time_travel_recover",
backoff::Cancel::new(cancel.clone(), || anyhow!("Cancelled")),
)
.await?;
}
(VerOrDelete::DeleteMarker(_), _last_modified, _version_id) => {
do_delete = true;
}
}
};
if do_delete {
if matches!(last_vd, VerOrDelete::DeleteMarker(_)) {
// Key has since been deleted (but there was some history), no need to do anything
tracing::trace!("Key {key} already deleted, skipping.");
} else {
tracing::trace!("Deleting {key}...");
let oid = ObjectIdentifier::builder().key(key.to_owned()).build()?;
self.delete_oids(kind, &[oid]).await?;
}
}
}
Ok(())
}
} }
/// On drop (cancellation) count towards [`metrics::BucketMetrics::cancelled_waits`]. /// On drop (cancellation) count towards [`metrics::BucketMetrics::cancelled_waits`].
@@ -651,6 +811,32 @@ fn start_measuring_requests(
}) })
} }
enum VerOrDelete<'a> {
Version(&'a ObjectVersion),
DeleteMarker(&'a DeleteMarkerEntry),
}
impl<'a> VerOrDelete<'a> {
fn last_modified(&self) -> Option<&'a DateTime> {
match self {
VerOrDelete::Version(v) => v.last_modified(),
VerOrDelete::DeleteMarker(v) => v.last_modified(),
}
}
fn version_id(&self) -> Option<&'a str> {
match self {
VerOrDelete::Version(v) => v.version_id(),
VerOrDelete::DeleteMarker(v) => v.version_id(),
}
}
fn key(&self) -> Option<&'a str> {
match self {
VerOrDelete::Version(v) => v.key(),
VerOrDelete::DeleteMarker(v) => v.key(),
}
}
}
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use camino::Utf8Path; use camino::Utf8Path;

View File

@@ -12,6 +12,7 @@ pub(crate) enum RequestKind {
Delete = 2, Delete = 2,
List = 3, List = 3,
Copy = 4, Copy = 4,
TimeTravel = 5,
} }
use RequestKind::*; use RequestKind::*;
@@ -24,6 +25,7 @@ impl RequestKind {
Delete => "delete_object", Delete => "delete_object",
List => "list_objects", List => "list_objects",
Copy => "copy_object", Copy => "copy_object",
TimeTravel => "time_travel_recover",
} }
} }
const fn as_index(&self) -> usize { const fn as_index(&self) -> usize {
@@ -31,7 +33,7 @@ impl RequestKind {
} }
} }
pub(super) struct RequestTyped<C>([C; 5]); pub(super) struct RequestTyped<C>([C; 6]);
impl<C> RequestTyped<C> { impl<C> RequestTyped<C> {
pub(super) fn get(&self, kind: RequestKind) -> &C { pub(super) fn get(&self, kind: RequestKind) -> &C {
@@ -40,8 +42,8 @@ impl<C> RequestTyped<C> {
fn build_with(mut f: impl FnMut(RequestKind) -> C) -> Self { fn build_with(mut f: impl FnMut(RequestKind) -> C) -> Self {
use RequestKind::*; use RequestKind::*;
let mut it = [Get, Put, Delete, List, Copy].into_iter(); let mut it = [Get, Put, Delete, List, Copy, TimeTravel].into_iter();
let arr = std::array::from_fn::<C, 5, _>(|index| { let arr = std::array::from_fn::<C, 6, _>(|index| {
let next = it.next().unwrap(); let next = it.next().unwrap();
assert_eq!(index, next.as_index()); assert_eq!(index, next.as_index());
f(next) f(next)

View File

@@ -3,16 +3,19 @@
//! testing purposes. //! testing purposes.
use bytes::Bytes; use bytes::Bytes;
use futures::stream::Stream; use futures::stream::Stream;
use std::collections::hash_map::Entry;
use std::collections::HashMap; use std::collections::HashMap;
use std::sync::Mutex; use std::sync::Mutex;
use std::time::SystemTime;
use std::{collections::hash_map::Entry, sync::Arc};
use tokio_util::sync::CancellationToken;
use crate::{ use crate::{
Download, DownloadError, Listing, ListingMode, RemotePath, RemoteStorage, StorageMetadata, Download, DownloadError, GenericRemoteStorage, Listing, ListingMode, RemotePath, RemoteStorage,
StorageMetadata,
}; };
pub struct UnreliableWrapper { pub struct UnreliableWrapper {
inner: crate::GenericRemoteStorage, inner: GenericRemoteStorage<Arc<VoidStorage>>,
// This many attempts of each operation will fail, then we let it succeed. // This many attempts of each operation will fail, then we let it succeed.
attempts_to_fail: u64, attempts_to_fail: u64,
@@ -29,11 +32,21 @@ enum RemoteOp {
Download(RemotePath), Download(RemotePath),
Delete(RemotePath), Delete(RemotePath),
DeleteObjects(Vec<RemotePath>), DeleteObjects(Vec<RemotePath>),
TimeTravelRecover(Option<RemotePath>),
} }
impl UnreliableWrapper { impl UnreliableWrapper {
pub fn new(inner: crate::GenericRemoteStorage, attempts_to_fail: u64) -> Self { pub fn new(inner: crate::GenericRemoteStorage, attempts_to_fail: u64) -> Self {
assert!(attempts_to_fail > 0); assert!(attempts_to_fail > 0);
let inner = match inner {
GenericRemoteStorage::AwsS3(s) => GenericRemoteStorage::AwsS3(s),
GenericRemoteStorage::AzureBlob(s) => GenericRemoteStorage::AzureBlob(s),
GenericRemoteStorage::LocalFs(s) => GenericRemoteStorage::LocalFs(s),
// We could also make this a no-op, as in, extract the inner of the passed generic remote storage
GenericRemoteStorage::Unreliable(_s) => {
panic!("Can't wrap unreliable wrapper unreliably")
}
};
UnreliableWrapper { UnreliableWrapper {
inner, inner,
attempts_to_fail, attempts_to_fail,
@@ -84,7 +97,9 @@ impl UnreliableWrapper {
} }
} }
#[async_trait::async_trait] // We never construct this, so the type is not important, just has to not be UnreliableWrapper and impl RemoteStorage.
type VoidStorage = crate::LocalFs;
impl RemoteStorage for UnreliableWrapper { impl RemoteStorage for UnreliableWrapper {
async fn list_prefixes( async fn list_prefixes(
&self, &self,
@@ -169,4 +184,17 @@ impl RemoteStorage for UnreliableWrapper {
self.attempt(RemoteOp::Upload(to.clone()))?; self.attempt(RemoteOp::Upload(to.clone()))?;
self.inner.copy_object(from, to).await self.inner.copy_object(from, to).await
} }
async fn time_travel_recover(
&self,
prefix: Option<&RemotePath>,
timestamp: SystemTime,
done_if_after: SystemTime,
cancel: CancellationToken,
) -> anyhow::Result<()> {
self.attempt(RemoteOp::TimeTravelRecover(prefix.map(|p| p.to_owned())))?;
self.inner
.time_travel_recover(prefix, timestamp, done_if_after, cancel)
.await
}
} }

View File

@@ -1,15 +1,21 @@
use std::collections::HashSet;
use std::env; use std::env;
use std::fmt::{Debug, Display};
use std::num::NonZeroUsize; use std::num::NonZeroUsize;
use std::ops::ControlFlow; use std::ops::ControlFlow;
use std::sync::Arc; use std::sync::Arc;
use std::time::UNIX_EPOCH; use std::time::{Duration, UNIX_EPOCH};
use std::{collections::HashSet, time::SystemTime};
use crate::common::{download_to_vec, upload_stream};
use anyhow::Context; use anyhow::Context;
use camino::Utf8Path;
use futures_util::Future;
use remote_storage::{ use remote_storage::{
GenericRemoteStorage, RemotePath, RemoteStorageConfig, RemoteStorageKind, S3Config, GenericRemoteStorage, RemotePath, RemoteStorageConfig, RemoteStorageKind, S3Config,
}; };
use test_context::test_context;
use test_context::AsyncTestContext; use test_context::AsyncTestContext;
use tokio_util::sync::CancellationToken;
use tracing::info; use tracing::info;
mod common; mod common;
@@ -18,11 +24,160 @@ mod common;
mod tests_s3; mod tests_s3;
use common::{cleanup, ensure_logging_ready, upload_remote_data, upload_simple_remote_data}; use common::{cleanup, ensure_logging_ready, upload_remote_data, upload_simple_remote_data};
use utils::backoff;
const ENABLE_REAL_S3_REMOTE_STORAGE_ENV_VAR_NAME: &str = "ENABLE_REAL_S3_REMOTE_STORAGE"; const ENABLE_REAL_S3_REMOTE_STORAGE_ENV_VAR_NAME: &str = "ENABLE_REAL_S3_REMOTE_STORAGE";
const BASE_PREFIX: &str = "test"; const BASE_PREFIX: &str = "test";
#[test_context(MaybeEnabledStorage)]
#[tokio::test]
async fn s3_time_travel_recovery_works(ctx: &mut MaybeEnabledStorage) -> anyhow::Result<()> {
let ctx = match ctx {
MaybeEnabledStorage::Enabled(ctx) => ctx,
MaybeEnabledStorage::Disabled => return Ok(()),
};
// Our test depends on discrepancies in the clock between S3 and the environment the tests
// run in. Therefore, wait a little bit before and after. The alternative would be
// to take the time from S3 response headers.
const WAIT_TIME: Duration = Duration::from_millis(3_000);
async fn retry<T, O, F, E>(op: O) -> Result<T, E>
where
E: Display + Debug + 'static,
O: FnMut() -> F,
F: Future<Output = Result<T, E>>,
{
let warn_threshold = 3;
let max_retries = 10;
backoff::retry(
op,
|_e| false,
warn_threshold,
max_retries,
"test retry",
backoff::Cancel::new(CancellationToken::new(), || unreachable!()),
)
.await
}
async fn time_point() -> SystemTime {
tokio::time::sleep(WAIT_TIME).await;
let ret = SystemTime::now();
tokio::time::sleep(WAIT_TIME).await;
ret
}
async fn list_files(client: &Arc<GenericRemoteStorage>) -> anyhow::Result<HashSet<RemotePath>> {
Ok(retry(|| client.list_files(None))
.await
.context("list root files failure")?
.into_iter()
.collect::<HashSet<_>>())
}
let path1 = RemotePath::new(Utf8Path::new(format!("{}/path1", ctx.base_prefix).as_str()))
.with_context(|| "RemotePath conversion")?;
let path2 = RemotePath::new(Utf8Path::new(format!("{}/path2", ctx.base_prefix).as_str()))
.with_context(|| "RemotePath conversion")?;
let path3 = RemotePath::new(Utf8Path::new(format!("{}/path3", ctx.base_prefix).as_str()))
.with_context(|| "RemotePath conversion")?;
retry(|| {
let (data, len) = upload_stream("remote blob data1".as_bytes().into());
ctx.client.upload(data, len, &path1, None)
})
.await?;
let t0_files = list_files(&ctx.client).await?;
let t0 = time_point().await;
println!("at t0: {t0_files:?}");
let old_data = "remote blob data2";
retry(|| {
let (data, len) = upload_stream(old_data.as_bytes().into());
ctx.client.upload(data, len, &path2, None)
})
.await?;
let t1_files = list_files(&ctx.client).await?;
let t1 = time_point().await;
println!("at t1: {t1_files:?}");
// A little check to ensure that our clock is not too far off from the S3 clock
{
let dl = retry(|| ctx.client.download(&path2)).await?;
let last_modified = dl.last_modified.unwrap();
let half_wt = WAIT_TIME.mul_f32(0.5);
let t0_hwt = t0 + half_wt;
let t1_hwt = t1 - half_wt;
if !(t0_hwt..=t1_hwt).contains(&last_modified) {
panic!("last_modified={last_modified:?} is not between t0_hwt={t0_hwt:?} and t1_hwt={t1_hwt:?}. \
This likely means a large lock discrepancy between S3 and the local clock.");
}
}
retry(|| {
let (data, len) = upload_stream("remote blob data3".as_bytes().into());
ctx.client.upload(data, len, &path3, None)
})
.await?;
let new_data = "new remote blob data2";
retry(|| {
let (data, len) = upload_stream(new_data.as_bytes().into());
ctx.client.upload(data, len, &path2, None)
})
.await?;
retry(|| ctx.client.delete(&path1)).await?;
let t2_files = list_files(&ctx.client).await?;
let t2 = time_point().await;
println!("at t2: {t2_files:?}");
// No changes after recovery to t2 (no-op)
let t_final = time_point().await;
ctx.client
.time_travel_recover(None, t2, t_final, CancellationToken::new())
.await?;
let t2_files_recovered = list_files(&ctx.client).await?;
println!("after recovery to t2: {t2_files_recovered:?}");
assert_eq!(t2_files, t2_files_recovered);
let path2_recovered_t2 = download_to_vec(ctx.client.download(&path2).await?).await?;
assert_eq!(path2_recovered_t2, new_data.as_bytes());
// after recovery to t1: path1 is back, path2 has the old content
let t_final = time_point().await;
ctx.client
.time_travel_recover(None, t1, t_final, CancellationToken::new())
.await?;
let t1_files_recovered = list_files(&ctx.client).await?;
println!("after recovery to t1: {t1_files_recovered:?}");
assert_eq!(t1_files, t1_files_recovered);
let path2_recovered_t1 = download_to_vec(ctx.client.download(&path2).await?).await?;
assert_eq!(path2_recovered_t1, old_data.as_bytes());
// after recovery to t0: everything is gone except for path1
let t_final = time_point().await;
ctx.client
.time_travel_recover(None, t0, t_final, CancellationToken::new())
.await?;
let t0_files_recovered = list_files(&ctx.client).await?;
println!("after recovery to t0: {t0_files_recovered:?}");
assert_eq!(t0_files, t0_files_recovered);
// cleanup
let paths = &[path1, path2, path3];
retry(|| ctx.client.delete_objects(paths)).await?;
Ok(())
}
struct EnabledS3 { struct EnabledS3 {
client: Arc<GenericRemoteStorage>, client: Arc<GenericRemoteStorage>,
base_prefix: &'static str, base_prefix: &'static str,

View File

@@ -131,7 +131,9 @@ pub fn api_error_handler(api_error: ApiError) -> Response<Body> {
ApiError::ResourceUnavailable(_) => info!("Error processing HTTP request: {api_error:#}"), ApiError::ResourceUnavailable(_) => info!("Error processing HTTP request: {api_error:#}"),
ApiError::NotFound(_) => info!("Error processing HTTP request: {api_error:#}"), ApiError::NotFound(_) => info!("Error processing HTTP request: {api_error:#}"),
ApiError::InternalServerError(_) => error!("Error processing HTTP request: {api_error:?}"), ApiError::InternalServerError(_) => error!("Error processing HTTP request: {api_error:?}"),
_ => error!("Error processing HTTP request: {api_error:#}"), ApiError::ShuttingDown => info!("Shut down while processing HTTP request"),
ApiError::Timeout(_) => info!("Timeout while processing HTTP request: {api_error:#}"),
_ => info!("Error processing HTTP request: {api_error:#}"),
} }
api_error.into_response() api_error.into_response()

View File

@@ -5,10 +5,10 @@ use std::os::unix::io::RawFd;
pub fn set_nonblock(fd: RawFd) -> Result<(), std::io::Error> { pub fn set_nonblock(fd: RawFd) -> Result<(), std::io::Error> {
let bits = fcntl(fd, F_GETFL)?; let bits = fcntl(fd, F_GETFL)?;
// Safety: If F_GETFL returns some unknown bits, they should be valid // If F_GETFL returns some unknown bits, they should be valid
// for passing back to F_SETFL, too. If we left them out, the F_SETFL // for passing back to F_SETFL, too. If we left them out, the F_SETFL
// would effectively clear them, which is not what we want. // would effectively clear them, which is not what we want.
let mut flags = unsafe { OFlag::from_bits_unchecked(bits) }; let mut flags = OFlag::from_bits_retain(bits);
flags |= OFlag::O_NONBLOCK; flags |= OFlag::O_NONBLOCK;
fcntl(fd, F_SETFL(flags))?; fcntl(fd, F_SETFL(flags))?;

View File

@@ -1,7 +1,6 @@
use std::{ use std::{
io, io,
net::{TcpListener, ToSocketAddrs}, net::{TcpListener, ToSocketAddrs},
os::unix::prelude::AsRawFd,
}; };
use nix::sys::socket::{setsockopt, sockopt::ReuseAddr}; use nix::sys::socket::{setsockopt, sockopt::ReuseAddr};
@@ -10,7 +9,7 @@ use nix::sys::socket::{setsockopt, sockopt::ReuseAddr};
pub fn bind<A: ToSocketAddrs>(addr: A) -> io::Result<TcpListener> { pub fn bind<A: ToSocketAddrs>(addr: A) -> io::Result<TcpListener> {
let listener = TcpListener::bind(addr)?; let listener = TcpListener::bind(addr)?;
setsockopt(listener.as_raw_fd(), ReuseAddr, &true)?; setsockopt(&listener, ReuseAddr, &true)?;
Ok(listener) Ok(listener)
} }

View File

@@ -61,6 +61,7 @@ sync_wrapper.workspace = true
tokio-tar.workspace = true tokio-tar.workspace = true
thiserror.workspace = true thiserror.workspace = true
tokio = { workspace = true, features = ["process", "sync", "fs", "rt", "io-util", "time"] } tokio = { workspace = true, features = ["process", "sync", "fs", "rt", "io-util", "time"] }
tokio-epoll-uring.workspace = true
tokio-io-timeout.workspace = true tokio-io-timeout.workspace = true
tokio-postgres.workspace = true tokio-postgres.workspace = true
tokio-stream.workspace = true tokio-stream.workspace = true

View File

@@ -18,7 +18,7 @@ use pageserver::tenant::block_io::FileBlockReader;
use pageserver::tenant::disk_btree::{DiskBtreeReader, VisitDirection}; use pageserver::tenant::disk_btree::{DiskBtreeReader, VisitDirection};
use pageserver::tenant::storage_layer::delta_layer::{Summary, DELTA_KEY_SIZE}; use pageserver::tenant::storage_layer::delta_layer::{Summary, DELTA_KEY_SIZE};
use pageserver::tenant::storage_layer::range_overlaps; use pageserver::tenant::storage_layer::range_overlaps;
use pageserver::virtual_file::VirtualFile; use pageserver::virtual_file::{self, VirtualFile};
use utils::{bin_ser::BeSer, lsn::Lsn}; use utils::{bin_ser::BeSer, lsn::Lsn};
@@ -142,7 +142,7 @@ pub(crate) async fn main(cmd: &AnalyzeLayerMapCmd) -> Result<()> {
let ctx = RequestContext::new(TaskKind::DebugTool, DownloadBehavior::Error); let ctx = RequestContext::new(TaskKind::DebugTool, DownloadBehavior::Error);
// Initialize virtual_file (file desriptor cache) and page cache which are needed to access layer persistent B-Tree. // Initialize virtual_file (file desriptor cache) and page cache which are needed to access layer persistent B-Tree.
pageserver::virtual_file::init(10); pageserver::virtual_file::init(10, virtual_file::IoEngineKind::StdFs);
pageserver::page_cache::init(100); pageserver::page_cache::init(100);
let mut total_delta_layers = 0usize; let mut total_delta_layers = 0usize;

View File

@@ -59,7 +59,7 @@ pub(crate) enum LayerCmd {
async fn read_delta_file(path: impl AsRef<Path>, ctx: &RequestContext) -> Result<()> { async fn read_delta_file(path: impl AsRef<Path>, ctx: &RequestContext) -> Result<()> {
let path = Utf8Path::from_path(path.as_ref()).expect("non-Unicode path"); let path = Utf8Path::from_path(path.as_ref()).expect("non-Unicode path");
virtual_file::init(10); virtual_file::init(10, virtual_file::IoEngineKind::StdFs);
page_cache::init(100); page_cache::init(100);
let file = FileBlockReader::new(VirtualFile::open(path).await?); let file = FileBlockReader::new(VirtualFile::open(path).await?);
let summary_blk = file.read_blk(0, ctx).await?; let summary_blk = file.read_blk(0, ctx).await?;
@@ -187,7 +187,7 @@ pub(crate) async fn main(cmd: &LayerCmd) -> Result<()> {
new_tenant_id, new_tenant_id,
new_timeline_id, new_timeline_id,
} => { } => {
pageserver::virtual_file::init(10); pageserver::virtual_file::init(10, virtual_file::IoEngineKind::StdFs);
pageserver::page_cache::init(100); pageserver::page_cache::init(100);
let ctx = RequestContext::new(TaskKind::DebugTool, DownloadBehavior::Error); let ctx = RequestContext::new(TaskKind::DebugTool, DownloadBehavior::Error);

View File

@@ -123,7 +123,7 @@ fn read_pg_control_file(control_file_path: &Utf8Path) -> anyhow::Result<()> {
async fn print_layerfile(path: &Utf8Path) -> anyhow::Result<()> { async fn print_layerfile(path: &Utf8Path) -> anyhow::Result<()> {
// Basic initialization of things that don't change after startup // Basic initialization of things that don't change after startup
virtual_file::init(10); virtual_file::init(10, virtual_file::IoEngineKind::StdFs);
page_cache::init(100); page_cache::init(100);
let ctx = RequestContext::new(TaskKind::DebugTool, DownloadBehavior::Error); let ctx = RequestContext::new(TaskKind::DebugTool, DownloadBehavior::Error);
dump_layerfile_from_path(path, true, &ctx).await dump_layerfile_from_path(path, true, &ctx).await

View File

@@ -423,8 +423,8 @@ async fn client(
tokio::select! { tokio::select! {
res = do_requests => { res }, res = do_requests => { res },
_ = cancel.cancelled() => { _ = cancel.cancelled() => {
client.shutdown().await; // fallthrough to shutdown
return;
} }
} }
client.shutdown().await;
} }

View File

@@ -11,8 +11,9 @@
//! from data stored in object storage. //! from data stored in object storage.
//! //!
use anyhow::{anyhow, bail, ensure, Context}; use anyhow::{anyhow, bail, ensure, Context};
use bytes::{BufMut, BytesMut}; use bytes::{BufMut, Bytes, BytesMut};
use fail::fail_point; use fail::fail_point;
use pageserver_api::key::{key_to_slru_block, Key};
use postgres_ffi::pg_constants; use postgres_ffi::pg_constants;
use std::fmt::Write as FmtWrite; use std::fmt::Write as FmtWrite;
use std::time::SystemTime; use std::time::SystemTime;
@@ -133,6 +134,87 @@ where
ctx: &'a RequestContext, ctx: &'a RequestContext,
} }
/// A sink that accepts SLRU blocks ordered by key and forwards
/// full segments to the archive.
struct SlruSegmentsBuilder<'a, 'b, W>
where
W: AsyncWrite + Send + Sync + Unpin,
{
ar: &'a mut Builder<&'b mut W>,
buf: Vec<u8>,
current_segment: Option<(SlruKind, u32)>,
}
impl<'a, 'b, W> SlruSegmentsBuilder<'a, 'b, W>
where
W: AsyncWrite + Send + Sync + Unpin,
{
fn new(ar: &'a mut Builder<&'b mut W>) -> Self {
Self {
ar,
buf: Vec::new(),
current_segment: None,
}
}
async fn add_block(&mut self, key: &Key, block: Bytes) -> anyhow::Result<()> {
let (kind, segno, _) = key_to_slru_block(*key)?;
match kind {
SlruKind::Clog => {
ensure!(block.len() == BLCKSZ as usize || block.len() == BLCKSZ as usize + 8);
}
SlruKind::MultiXactMembers | SlruKind::MultiXactOffsets => {
ensure!(block.len() == BLCKSZ as usize);
}
}
let segment = (kind, segno);
match self.current_segment {
None => {
self.current_segment = Some(segment);
self.buf
.extend_from_slice(block.slice(..BLCKSZ as usize).as_ref());
}
Some(current_seg) if current_seg == segment => {
self.buf
.extend_from_slice(block.slice(..BLCKSZ as usize).as_ref());
}
Some(_) => {
self.flush().await?;
self.current_segment = Some(segment);
self.buf
.extend_from_slice(block.slice(..BLCKSZ as usize).as_ref());
}
}
Ok(())
}
async fn flush(&mut self) -> anyhow::Result<()> {
let nblocks = self.buf.len() / BLCKSZ as usize;
let (kind, segno) = self.current_segment.take().unwrap();
let segname = format!("{}/{:>04X}", kind.to_str(), segno);
let header = new_tar_header(&segname, self.buf.len() as u64)?;
self.ar.append(&header, self.buf.as_slice()).await?;
trace!("Added to basebackup slru {} relsize {}", segname, nblocks);
self.buf.clear();
Ok(())
}
async fn finish(mut self) -> anyhow::Result<()> {
if self.current_segment.is_none() || self.buf.is_empty() {
return Ok(());
}
self.flush().await
}
}
impl<'a, W> Basebackup<'a, W> impl<'a, W> Basebackup<'a, W>
where where
W: AsyncWrite + Send + Sync + Unpin, W: AsyncWrite + Send + Sync + Unpin,
@@ -168,20 +250,27 @@ where
} }
// Gather non-relational files from object storage pages. // Gather non-relational files from object storage pages.
for kind in [ let slru_partitions = self
SlruKind::Clog, .timeline
SlruKind::MultiXactOffsets, .get_slru_keyspace(Version::Lsn(self.lsn), self.ctx)
SlruKind::MultiXactMembers, .await?
] { .partition(Timeline::MAX_GET_VECTORED_KEYS * BLCKSZ as u64);
for segno in self
let mut slru_builder = SlruSegmentsBuilder::new(&mut self.ar);
for part in slru_partitions.parts {
let blocks = self
.timeline .timeline
.list_slru_segments(kind, Version::Lsn(self.lsn), self.ctx) .get_vectored(&part.ranges, self.lsn, self.ctx)
.await? .await?;
{
self.add_slru_segment(kind, segno).await?; for (key, block) in blocks {
slru_builder.add_block(&key, block?).await?;
} }
} }
slru_builder.finish().await?;
let mut min_restart_lsn: Lsn = Lsn::MAX; let mut min_restart_lsn: Lsn = Lsn::MAX;
// Create tablespace directories // Create tablespace directories
for ((spcnode, dbnode), has_relmap_file) in for ((spcnode, dbnode), has_relmap_file) in
@@ -305,39 +394,6 @@ where
Ok(()) Ok(())
} }
//
// Generate SLRU segment files from repository.
//
async fn add_slru_segment(&mut self, slru: SlruKind, segno: u32) -> anyhow::Result<()> {
let nblocks = self
.timeline
.get_slru_segment_size(slru, segno, Version::Lsn(self.lsn), self.ctx)
.await?;
let mut slru_buf: Vec<u8> = Vec::with_capacity(nblocks as usize * BLCKSZ as usize);
for blknum in 0..nblocks {
let img = self
.timeline
.get_slru_page_at_lsn(slru, segno, blknum, self.lsn, self.ctx)
.await?;
if slru == SlruKind::Clog {
ensure!(img.len() == BLCKSZ as usize || img.len() == BLCKSZ as usize + 8);
} else {
ensure!(img.len() == BLCKSZ as usize);
}
slru_buf.extend_from_slice(&img[..BLCKSZ as usize]);
}
let segname = format!("{}/{:>04X}", slru.to_str(), segno);
let header = new_tar_header(&segname, slru_buf.len() as u64)?;
self.ar.append(&header, slru_buf.as_slice()).await?;
trace!("Added to basebackup slru {} relsize {}", segname, nblocks);
Ok(())
}
// //
// Include database/tablespace directories. // Include database/tablespace directories.
// //

View File

@@ -130,7 +130,7 @@ fn main() -> anyhow::Result<()> {
let scenario = failpoint_support::init(); let scenario = failpoint_support::init();
// Basic initialization of things that don't change after startup // Basic initialization of things that don't change after startup
virtual_file::init(conf.max_file_descriptors); virtual_file::init(conf.max_file_descriptors, conf.virtual_file_io_engine);
page_cache::init(conf.page_cache_size); page_cache::init(conf.page_cache_size);
start_pageserver(launch_ts, conf).context("Failed to start pageserver")?; start_pageserver(launch_ts, conf).context("Failed to start pageserver")?;

View File

@@ -36,6 +36,7 @@ use crate::tenant::config::TenantConfOpt;
use crate::tenant::{ use crate::tenant::{
TENANTS_SEGMENT_NAME, TENANT_DELETED_MARKER_FILE_NAME, TIMELINES_SEGMENT_NAME, TENANTS_SEGMENT_NAME, TENANT_DELETED_MARKER_FILE_NAME, TIMELINES_SEGMENT_NAME,
}; };
use crate::virtual_file;
use crate::{ use crate::{
IGNORED_TENANT_FILE_NAME, METADATA_FILE_NAME, TENANT_CONFIG_NAME, TENANT_HEATMAP_BASENAME, IGNORED_TENANT_FILE_NAME, METADATA_FILE_NAME, TENANT_CONFIG_NAME, TENANT_HEATMAP_BASENAME,
TENANT_LOCATION_CONFIG_NAME, TIMELINE_DELETE_MARK_SUFFIX, TIMELINE_UNINIT_MARK_SUFFIX, TENANT_LOCATION_CONFIG_NAME, TIMELINE_DELETE_MARK_SUFFIX, TIMELINE_UNINIT_MARK_SUFFIX,
@@ -43,6 +44,8 @@ use crate::{
use self::defaults::DEFAULT_CONCURRENT_TENANT_WARMUP; use self::defaults::DEFAULT_CONCURRENT_TENANT_WARMUP;
use self::defaults::DEFAULT_VIRTUAL_FILE_IO_ENGINE;
pub mod defaults { pub mod defaults {
use crate::tenant::config::defaults::*; use crate::tenant::config::defaults::*;
use const_format::formatcp; use const_format::formatcp;
@@ -79,6 +82,8 @@ pub mod defaults {
pub const DEFAULT_INGEST_BATCH_SIZE: u64 = 100; pub const DEFAULT_INGEST_BATCH_SIZE: u64 = 100;
pub const DEFAULT_VIRTUAL_FILE_IO_ENGINE: &str = "std-fs";
/// ///
/// Default built-in configuration file. /// Default built-in configuration file.
/// ///
@@ -114,6 +119,8 @@ pub mod defaults {
#ingest_batch_size = {DEFAULT_INGEST_BATCH_SIZE} #ingest_batch_size = {DEFAULT_INGEST_BATCH_SIZE}
#virtual_file_io_engine = '{DEFAULT_VIRTUAL_FILE_IO_ENGINE}'
[tenant_config] [tenant_config]
#checkpoint_distance = {DEFAULT_CHECKPOINT_DISTANCE} # in bytes #checkpoint_distance = {DEFAULT_CHECKPOINT_DISTANCE} # in bytes
#checkpoint_timeout = {DEFAULT_CHECKPOINT_TIMEOUT} #checkpoint_timeout = {DEFAULT_CHECKPOINT_TIMEOUT}
@@ -247,6 +254,8 @@ pub struct PageServerConf {
/// Maximum number of WAL records to be ingested and committed at the same time /// Maximum number of WAL records to be ingested and committed at the same time
pub ingest_batch_size: u64, pub ingest_batch_size: u64,
pub virtual_file_io_engine: virtual_file::IoEngineKind,
} }
/// We do not want to store this in a PageServerConf because the latter may be logged /// We do not want to store this in a PageServerConf because the latter may be logged
@@ -331,6 +340,8 @@ struct PageServerConfigBuilder {
secondary_download_concurrency: BuilderValue<usize>, secondary_download_concurrency: BuilderValue<usize>,
ingest_batch_size: BuilderValue<u64>, ingest_batch_size: BuilderValue<u64>,
virtual_file_io_engine: BuilderValue<virtual_file::IoEngineKind>,
} }
impl Default for PageServerConfigBuilder { impl Default for PageServerConfigBuilder {
@@ -406,6 +417,8 @@ impl Default for PageServerConfigBuilder {
secondary_download_concurrency: Set(DEFAULT_SECONDARY_DOWNLOAD_CONCURRENCY), secondary_download_concurrency: Set(DEFAULT_SECONDARY_DOWNLOAD_CONCURRENCY),
ingest_batch_size: Set(DEFAULT_INGEST_BATCH_SIZE), ingest_batch_size: Set(DEFAULT_INGEST_BATCH_SIZE),
virtual_file_io_engine: Set(DEFAULT_VIRTUAL_FILE_IO_ENGINE.parse().unwrap()),
} }
} }
} }
@@ -562,6 +575,10 @@ impl PageServerConfigBuilder {
self.ingest_batch_size = BuilderValue::Set(ingest_batch_size) self.ingest_batch_size = BuilderValue::Set(ingest_batch_size)
} }
pub fn virtual_file_io_engine(&mut self, value: virtual_file::IoEngineKind) {
self.virtual_file_io_engine = BuilderValue::Set(value);
}
pub fn build(self) -> anyhow::Result<PageServerConf> { pub fn build(self) -> anyhow::Result<PageServerConf> {
let concurrent_tenant_warmup = self let concurrent_tenant_warmup = self
.concurrent_tenant_warmup .concurrent_tenant_warmup
@@ -669,6 +686,9 @@ impl PageServerConfigBuilder {
ingest_batch_size: self ingest_batch_size: self
.ingest_batch_size .ingest_batch_size
.ok_or(anyhow!("missing ingest_batch_size"))?, .ok_or(anyhow!("missing ingest_batch_size"))?,
virtual_file_io_engine: self
.virtual_file_io_engine
.ok_or(anyhow!("missing virtual_file_io_engine"))?,
}) })
} }
} }
@@ -920,6 +940,9 @@ impl PageServerConf {
builder.secondary_download_concurrency(parse_toml_u64(key, item)? as usize) builder.secondary_download_concurrency(parse_toml_u64(key, item)? as usize)
}, },
"ingest_batch_size" => builder.ingest_batch_size(parse_toml_u64(key, item)?), "ingest_batch_size" => builder.ingest_batch_size(parse_toml_u64(key, item)?),
"virtual_file_io_engine" => {
builder.virtual_file_io_engine(parse_toml_from_str("virtual_file_io_engine", item)?)
}
_ => bail!("unrecognized pageserver option '{key}'"), _ => bail!("unrecognized pageserver option '{key}'"),
} }
} }
@@ -993,6 +1016,7 @@ impl PageServerConf {
heatmap_upload_concurrency: defaults::DEFAULT_HEATMAP_UPLOAD_CONCURRENCY, heatmap_upload_concurrency: defaults::DEFAULT_HEATMAP_UPLOAD_CONCURRENCY,
secondary_download_concurrency: defaults::DEFAULT_SECONDARY_DOWNLOAD_CONCURRENCY, secondary_download_concurrency: defaults::DEFAULT_SECONDARY_DOWNLOAD_CONCURRENCY,
ingest_batch_size: defaults::DEFAULT_INGEST_BATCH_SIZE, ingest_batch_size: defaults::DEFAULT_INGEST_BATCH_SIZE,
virtual_file_io_engine: DEFAULT_VIRTUAL_FILE_IO_ENGINE.parse().unwrap(),
} }
} }
} }
@@ -1225,6 +1249,7 @@ background_task_maximum_delay = '334 s'
heatmap_upload_concurrency: defaults::DEFAULT_HEATMAP_UPLOAD_CONCURRENCY, heatmap_upload_concurrency: defaults::DEFAULT_HEATMAP_UPLOAD_CONCURRENCY,
secondary_download_concurrency: defaults::DEFAULT_SECONDARY_DOWNLOAD_CONCURRENCY, secondary_download_concurrency: defaults::DEFAULT_SECONDARY_DOWNLOAD_CONCURRENCY,
ingest_batch_size: defaults::DEFAULT_INGEST_BATCH_SIZE, ingest_batch_size: defaults::DEFAULT_INGEST_BATCH_SIZE,
virtual_file_io_engine: DEFAULT_VIRTUAL_FILE_IO_ENGINE.parse().unwrap(),
}, },
"Correct defaults should be used when no config values are provided" "Correct defaults should be used when no config values are provided"
); );
@@ -1288,6 +1313,7 @@ background_task_maximum_delay = '334 s'
heatmap_upload_concurrency: defaults::DEFAULT_HEATMAP_UPLOAD_CONCURRENCY, heatmap_upload_concurrency: defaults::DEFAULT_HEATMAP_UPLOAD_CONCURRENCY,
secondary_download_concurrency: defaults::DEFAULT_SECONDARY_DOWNLOAD_CONCURRENCY, secondary_download_concurrency: defaults::DEFAULT_SECONDARY_DOWNLOAD_CONCURRENCY,
ingest_batch_size: 100, ingest_batch_size: 100,
virtual_file_io_engine: DEFAULT_VIRTUAL_FILE_IO_ENGINE.parse().unwrap(),
}, },
"Should be able to parse all basic config values correctly" "Should be able to parse all basic config values correctly"
); );

View File

@@ -877,6 +877,56 @@ paths:
schema: schema:
$ref: "#/components/schemas/ServiceUnavailableError" $ref: "#/components/schemas/ServiceUnavailableError"
/v1/tenant/{tenant_id}/{timeline_id}/preserve_initdb_archive:
parameters:
- name: tenant_id
in: path
required: true
schema:
type: string
- name: timeline_id
in: path
required: true
schema:
type: string
post:
description: |
Marks the initdb archive for preservation upon deletion of the timeline or tenant.
This is meant to be part of the disaster recovery process.
responses:
"202":
description: Tenant scheduled to load successfully
"404":
description: No tenant or timeline found for the specified ids
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
"401":
description: Unauthorized Error
content:
application/json:
schema:
$ref: "#/components/schemas/UnauthorizedError"
"403":
description: Forbidden Error
content:
application/json:
schema:
$ref: "#/components/schemas/ForbiddenError"
"500":
description: Generic operation error
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
"503":
description: Temporarily unavailable, please retry.
content:
application/json:
schema:
$ref: "#/components/schemas/ServiceUnavailableError"
/v1/tenant/{tenant_id}/synthetic_size: /v1/tenant/{tenant_id}/synthetic_size:
parameters: parameters:

View File

@@ -187,6 +187,7 @@ impl From<TenantSlotUpsertError> for ApiError {
match e { match e {
InternalError(e) => ApiError::InternalServerError(anyhow::anyhow!("{e}")), InternalError(e) => ApiError::InternalServerError(anyhow::anyhow!("{e}")),
MapState(e) => e.into(), MapState(e) => e.into(),
ShuttingDown(_) => ApiError::ShuttingDown,
} }
} }
} }
@@ -495,6 +496,10 @@ async fn timeline_create_handler(
.map_err(ApiError::InternalServerError)?; .map_err(ApiError::InternalServerError)?;
json_response(StatusCode::CREATED, timeline_info) json_response(StatusCode::CREATED, timeline_info)
} }
Err(_) if tenant.cancel.is_cancelled() => {
// In case we get some ugly error type during shutdown, cast it into a clean 503.
json_response(StatusCode::SERVICE_UNAVAILABLE, HttpErrorBody::from_msg("Tenant shutting down".to_string()))
}
Err(tenant::CreateTimelineError::Conflict | tenant::CreateTimelineError::AlreadyCreating) => { Err(tenant::CreateTimelineError::Conflict | tenant::CreateTimelineError::AlreadyCreating) => {
json_response(StatusCode::CONFLICT, ()) json_response(StatusCode::CONFLICT, ())
} }
@@ -561,6 +566,43 @@ async fn timeline_list_handler(
json_response(StatusCode::OK, response_data) json_response(StatusCode::OK, response_data)
} }
async fn timeline_preserve_initdb_handler(
request: Request<Body>,
_cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> {
let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
check_permission(&request, Some(tenant_shard_id.tenant_id))?;
// Part of the process for disaster recovery from safekeeper-stored WAL:
// If we don't recover into a new timeline but want to keep the timeline ID,
// then the initdb archive is deleted. This endpoint copies it to a different
// location where timeline recreation cand find it.
async {
let tenant = mgr::get_tenant(tenant_shard_id, true)?;
let timeline = tenant
.get_timeline(timeline_id, false)
.map_err(|e| ApiError::NotFound(e.into()))?;
timeline
.preserve_initdb_archive()
.await
.context("preserving initdb archive")
.map_err(ApiError::InternalServerError)?;
Ok::<_, ApiError>(())
}
.instrument(info_span!("timeline_preserve_initdb_archive",
tenant_id = %tenant_shard_id.tenant_id,
shard_id = %tenant_shard_id.shard_slug(),
%timeline_id))
.await?;
json_response(StatusCode::OK, ())
}
async fn timeline_detail_handler( async fn timeline_detail_handler(
request: Request<Body>, request: Request<Body>,
_cancel: CancellationToken, _cancel: CancellationToken,
@@ -1220,19 +1262,9 @@ async fn tenant_create_handler(
}; };
// We created the tenant. Existing API semantics are that the tenant // We created the tenant. Existing API semantics are that the tenant
// is Active when this function returns. // is Active when this function returns.
if let res @ Err(_) = new_tenant new_tenant
.wait_to_become_active(ACTIVE_TENANT_TIMEOUT) .wait_to_become_active(ACTIVE_TENANT_TIMEOUT)
.await .await?;
{
// This shouldn't happen because we just created the tenant directory
// in upsert_location, and there aren't any remote timelines
// to load, so, nothing can really fail during load.
// Don't do cleanup because we don't know how we got here.
// The tenant will likely be in `Broken` state and subsequent
// calls will fail.
res.context("created tenant failed to become active")
.map_err(ApiError::InternalServerError)?;
}
json_response( json_response(
StatusCode::CREATED, StatusCode::CREATED,
@@ -1943,6 +1975,10 @@ pub fn make_router(
.post("/v1/tenant/:tenant_id/ignore", |r| { .post("/v1/tenant/:tenant_id/ignore", |r| {
api_handler(r, tenant_ignore_handler) api_handler(r, tenant_ignore_handler)
}) })
.post(
"/v1/tenant/:tenant_shard_id/timeline/:timeline_id/preserve_initdb_archive",
|r| api_handler(r, timeline_preserve_initdb_handler),
)
.get("/v1/tenant/:tenant_shard_id/timeline/:timeline_id", |r| { .get("/v1/tenant/:tenant_shard_id/timeline/:timeline_id", |r| {
api_handler(r, timeline_detail_handler) api_handler(r, timeline_detail_handler)
}) })

View File

@@ -1,3 +1,4 @@
#![recursion_limit = "300"]
#![deny(clippy::undocumented_unsafe_blocks)] #![deny(clippy::undocumented_unsafe_blocks)]
mod auth; mod auth;

View File

@@ -150,6 +150,43 @@ pub(crate) static MATERIALIZED_PAGE_CACHE_HIT: Lazy<IntCounter> = Lazy::new(|| {
.expect("failed to define a metric") .expect("failed to define a metric")
}); });
pub(crate) struct GetVectoredLatency {
map: EnumMap<TaskKind, Option<Histogram>>,
}
impl GetVectoredLatency {
// Only these task types perform vectored gets. Filter all other tasks out to reduce total
// cardinality of the metric.
const TRACKED_TASK_KINDS: [TaskKind; 2] = [TaskKind::Compaction, TaskKind::PageRequestHandler];
pub(crate) fn for_task_kind(&self, task_kind: TaskKind) -> Option<&Histogram> {
self.map[task_kind].as_ref()
}
}
pub(crate) static GET_VECTORED_LATENCY: Lazy<GetVectoredLatency> = Lazy::new(|| {
let inner = register_histogram_vec!(
"pageserver_get_vectored_seconds",
"Time spent in get_vectored",
&["task_kind"],
CRITICAL_OP_BUCKETS.into(),
)
.expect("failed to define a metric");
GetVectoredLatency {
map: EnumMap::from_array(std::array::from_fn(|task_kind_idx| {
let task_kind = <TaskKind as enum_map::Enum>::from_usize(task_kind_idx);
if GetVectoredLatency::TRACKED_TASK_KINDS.contains(&task_kind) {
let task_kind = task_kind.into();
Some(inner.with_label_values(&[task_kind]))
} else {
None
}
})),
}
});
pub(crate) struct PageCacheMetricsForTaskKind { pub(crate) struct PageCacheMetricsForTaskKind {
pub read_accesses_materialized_page: IntCounter, pub read_accesses_materialized_page: IntCounter,
pub read_accesses_immutable: IntCounter, pub read_accesses_immutable: IntCounter,
@@ -932,6 +969,7 @@ pub(crate) static STORAGE_IO_SIZE: Lazy<IntGaugeVec> = Lazy::new(|| {
.expect("failed to define a metric") .expect("failed to define a metric")
}); });
#[cfg(not(test))]
pub(crate) mod virtual_file_descriptor_cache { pub(crate) mod virtual_file_descriptor_cache {
use super::*; use super::*;
@@ -951,6 +989,20 @@ pub(crate) mod virtual_file_descriptor_cache {
// ``` // ```
} }
#[cfg(not(test))]
pub(crate) mod virtual_file_io_engine {
use super::*;
pub(crate) static KIND: Lazy<UIntGaugeVec> = Lazy::new(|| {
register_uint_gauge_vec!(
"pageserver_virtual_file_io_engine_kind",
"The configured io engine for VirtualFile",
&["kind"],
)
.unwrap()
});
}
#[derive(Debug)] #[derive(Debug)]
struct GlobalAndPerTimelineHistogram { struct GlobalAndPerTimelineHistogram {
global: Histogram, global: Histogram,

View File

@@ -61,7 +61,7 @@ use crate::context::{DownloadBehavior, RequestContext};
use crate::import_datadir::import_wal_from_tar; use crate::import_datadir::import_wal_from_tar;
use crate::metrics; use crate::metrics;
use crate::metrics::LIVE_CONNECTIONS_COUNT; use crate::metrics::LIVE_CONNECTIONS_COUNT;
use crate::pgdatadir_mapping::{rel_block_to_key, Version}; use crate::pgdatadir_mapping::Version;
use crate::task_mgr; use crate::task_mgr;
use crate::task_mgr::TaskKind; use crate::task_mgr::TaskKind;
use crate::tenant::debug_assert_current_span_has_tenant_and_timeline_id; use crate::tenant::debug_assert_current_span_has_tenant_and_timeline_id;
@@ -75,6 +75,7 @@ use crate::tenant::PageReconstructError;
use crate::tenant::Timeline; use crate::tenant::Timeline;
use crate::trace::Tracer; use crate::trace::Tracer;
use pageserver_api::key::rel_block_to_key;
use postgres_ffi::pg_constants::DEFAULTTABLESPACE_OID; use postgres_ffi::pg_constants::DEFAULTTABLESPACE_OID;
use postgres_ffi::BLCKSZ; use postgres_ffi::BLCKSZ;
@@ -321,8 +322,8 @@ enum PageStreamError {
Shutdown, Shutdown,
/// Something went wrong reading a page: this likely indicates a pageserver bug /// Something went wrong reading a page: this likely indicates a pageserver bug
#[error("Read error: {0}")] #[error("Read error")]
Read(PageReconstructError), Read(#[source] PageReconstructError),
/// Ran out of time waiting for an LSN /// Ran out of time waiting for an LSN
#[error("LSN timeout: {0}")] #[error("LSN timeout: {0}")]
@@ -331,11 +332,11 @@ enum PageStreamError {
/// The entity required to serve the request (tenant or timeline) is not found, /// The entity required to serve the request (tenant or timeline) is not found,
/// or is not found in a suitable state to serve a request. /// or is not found in a suitable state to serve a request.
#[error("Not found: {0}")] #[error("Not found: {0}")]
NotFound(std::borrow::Cow<'static, str>), NotFound(Cow<'static, str>),
/// Request asked for something that doesn't make sense, like an invalid LSN /// Request asked for something that doesn't make sense, like an invalid LSN
#[error("Bad request: {0}")] #[error("Bad request: {0}")]
BadRequest(std::borrow::Cow<'static, str>), BadRequest(Cow<'static, str>),
} }
impl From<PageReconstructError> for PageStreamError { impl From<PageReconstructError> for PageStreamError {
@@ -386,12 +387,18 @@ impl PageServerHandler {
/// Future that completes when we need to shut down the connection. /// Future that completes when we need to shut down the connection.
/// ///
/// Reasons for need to shut down are: /// We currently need to shut down when any of the following happens:
/// - any of the timelines we hold GateGuards for in `shard_timelines` is cancelled /// 1. any of the timelines we hold GateGuards for in `shard_timelines` is cancelled
/// - task_mgr requests shutdown of the connection /// 2. task_mgr requests shutdown of the connection
/// ///
/// The need to check for `task_mgr` cancellation arises mainly from `handle_pagerequests` /// NB on (1): the connection's lifecycle is not actually tied to any of the
/// where, at first, `shard_timelines` is empty, see <https://github.com/neondatabase/neon/pull/6388> /// `shard_timelines`s' lifecycles. But it's _necessary_ in the current
/// implementation to be responsive to timeline cancellation because
/// the connection holds their `GateGuards` open (sored in `shard_timelines`).
/// We currently do the easy thing and terminate the connection if any of the
/// shard_timelines gets cancelled. But really, we cuold spend more effort
/// and simply remove the cancelled timeline from the `shard_timelines`, thereby
/// dropping the guard.
/// ///
/// NB: keep in sync with [`Self::is_connection_cancelled`] /// NB: keep in sync with [`Self::is_connection_cancelled`]
async fn await_connection_cancelled(&self) { async fn await_connection_cancelled(&self) {
@@ -404,16 +411,17 @@ impl PageServerHandler {
// immutable &self). So it's fine to evaluate shard_timelines after the sleep, we don't risk // immutable &self). So it's fine to evaluate shard_timelines after the sleep, we don't risk
// missing any inserts to the map. // missing any inserts to the map.
let mut futs = self let mut cancellation_sources = Vec::with_capacity(1 + self.shard_timelines.len());
.shard_timelines use futures::future::Either;
.values() cancellation_sources.push(Either::Left(task_mgr::shutdown_watcher()));
.map(|ht| ht.timeline.cancel.cancelled()) cancellation_sources.extend(
.collect::<FuturesUnordered<_>>(); self.shard_timelines
.values()
tokio::select! { .map(|ht| Either::Right(ht.timeline.cancel.cancelled())),
_ = task_mgr::shutdown_watcher() => { } );
_ = futs.next() => {} FuturesUnordered::from_iter(cancellation_sources)
} .next()
.await;
} }
/// Checking variant of [`Self::await_connection_cancelled`]. /// Checking variant of [`Self::await_connection_cancelled`].
@@ -659,7 +667,10 @@ impl PageServerHandler {
// print the all details to the log with {:#}, but for the client the // print the all details to the log with {:#}, but for the client the
// error message is enough. Do not log if shutting down, as the anyhow::Error // error message is enough. Do not log if shutting down, as the anyhow::Error
// here includes cancellation which is not an error. // here includes cancellation which is not an error.
span.in_scope(|| error!("error reading relation or page version: {:#}", e)); let full = utils::error::report_compact_sources(&e);
span.in_scope(|| {
error!("error reading relation or page version: {full:#}")
});
PagestreamBeMessage::Error(PagestreamErrorResponse { PagestreamBeMessage::Error(PagestreamErrorResponse {
message: e.to_string(), message: e.to_string(),
}) })

View File

@@ -13,7 +13,12 @@ use crate::repository::*;
use crate::walrecord::NeonWalRecord; use crate::walrecord::NeonWalRecord;
use anyhow::{ensure, Context}; use anyhow::{ensure, Context};
use bytes::{Buf, Bytes}; use bytes::{Buf, Bytes};
use pageserver_api::key::is_rel_block_key; use pageserver_api::key::{
dbdir_key_range, is_rel_block_key, is_slru_block_key, rel_block_to_key, rel_dir_to_key,
rel_key_range, rel_size_to_key, relmap_file_key, slru_block_to_key, slru_dir_to_key,
slru_segment_key_range, slru_segment_size_to_key, twophase_file_key, twophase_key_range,
AUX_FILES_KEY, CHECKPOINT_KEY, CONTROLFILE_KEY, DBDIR_KEY, TWOPHASEDIR_KEY,
};
use pageserver_api::reltag::{BlockNumber, RelTag, SlruKind}; use pageserver_api::reltag::{BlockNumber, RelTag, SlruKind};
use postgres_ffi::relfile_utils::{FSM_FORKNUM, VISIBILITYMAP_FORKNUM}; use postgres_ffi::relfile_utils::{FSM_FORKNUM, VISIBILITYMAP_FORKNUM};
use postgres_ffi::BLCKSZ; use postgres_ffi::BLCKSZ;
@@ -22,6 +27,7 @@ use serde::{Deserialize, Serialize};
use std::collections::{hash_map, HashMap, HashSet}; use std::collections::{hash_map, HashMap, HashSet};
use std::ops::ControlFlow; use std::ops::ControlFlow;
use std::ops::Range; use std::ops::Range;
use strum::IntoEnumIterator;
use tokio_util::sync::CancellationToken; use tokio_util::sync::CancellationToken;
use tracing::{debug, trace, warn}; use tracing::{debug, trace, warn};
use utils::bin_ser::DeserializeError; use utils::bin_ser::DeserializeError;
@@ -528,6 +534,33 @@ impl Timeline {
Ok(Default::default()) Ok(Default::default())
} }
pub(crate) async fn get_slru_keyspace(
&self,
version: Version<'_>,
ctx: &RequestContext,
) -> Result<KeySpace, PageReconstructError> {
let mut accum = KeySpaceAccum::new();
for kind in SlruKind::iter() {
let mut segments: Vec<u32> = self
.list_slru_segments(kind, version, ctx)
.await?
.into_iter()
.collect();
segments.sort_unstable();
for seg in segments {
let block_count = self.get_slru_segment_size(kind, seg, version, ctx).await?;
accum.add_range(
slru_block_to_key(kind, seg, 0)..slru_block_to_key(kind, seg, block_count),
);
}
}
Ok(accum.to_keyspace())
}
/// Get a list of SLRU segments /// Get a list of SLRU segments
pub(crate) async fn list_slru_segments( pub(crate) async fn list_slru_segments(
&self, &self,
@@ -1535,366 +1568,6 @@ struct SlruSegmentDirectory {
static ZERO_PAGE: Bytes = Bytes::from_static(&[0u8; BLCKSZ as usize]); static ZERO_PAGE: Bytes = Bytes::from_static(&[0u8; BLCKSZ as usize]);
// Layout of the Key address space
//
// The Key struct, used to address the underlying key-value store, consists of
// 18 bytes, split into six fields. See 'Key' in repository.rs. We need to map
// all the data and metadata keys into those 18 bytes.
//
// Principles for the mapping:
//
// - Things that are often accessed or modified together, should be close to
// each other in the key space. For example, if a relation is extended by one
// block, we create a new key-value pair for the block data, and update the
// relation size entry. Because of that, the RelSize key comes after all the
// RelBlocks of a relation: the RelSize and the last RelBlock are always next
// to each other.
//
// The key space is divided into four major sections, identified by the first
// byte, and the form a hierarchy:
//
// 00 Relation data and metadata
//
// DbDir () -> (dbnode, spcnode)
// Filenodemap
// RelDir -> relnode forknum
// RelBlocks
// RelSize
//
// 01 SLRUs
//
// SlruDir kind
// SlruSegBlocks segno
// SlruSegSize
//
// 02 pg_twophase
//
// 03 misc
// Controlfile
// checkpoint
// pg_version
//
// 04 aux files
//
// Below is a full list of the keyspace allocation:
//
// DbDir:
// 00 00000000 00000000 00000000 00 00000000
//
// Filenodemap:
// 00 SPCNODE DBNODE 00000000 00 00000000
//
// RelDir:
// 00 SPCNODE DBNODE 00000000 00 00000001 (Postgres never uses relfilenode 0)
//
// RelBlock:
// 00 SPCNODE DBNODE RELNODE FORK BLKNUM
//
// RelSize:
// 00 SPCNODE DBNODE RELNODE FORK FFFFFFFF
//
// SlruDir:
// 01 kind 00000000 00000000 00 00000000
//
// SlruSegBlock:
// 01 kind 00000001 SEGNO 00 BLKNUM
//
// SlruSegSize:
// 01 kind 00000001 SEGNO 00 FFFFFFFF
//
// TwoPhaseDir:
// 02 00000000 00000000 00000000 00 00000000
//
// TwoPhaseFile:
// 02 00000000 00000000 00000000 00 XID
//
// ControlFile:
// 03 00000000 00000000 00000000 00 00000000
//
// Checkpoint:
// 03 00000000 00000000 00000000 00 00000001
//
// AuxFiles:
// 03 00000000 00000000 00000000 00 00000002
//
//-- Section 01: relation data and metadata
const DBDIR_KEY: Key = Key {
field1: 0x00,
field2: 0,
field3: 0,
field4: 0,
field5: 0,
field6: 0,
};
fn dbdir_key_range(spcnode: Oid, dbnode: Oid) -> Range<Key> {
Key {
field1: 0x00,
field2: spcnode,
field3: dbnode,
field4: 0,
field5: 0,
field6: 0,
}..Key {
field1: 0x00,
field2: spcnode,
field3: dbnode,
field4: 0xffffffff,
field5: 0xff,
field6: 0xffffffff,
}
}
fn relmap_file_key(spcnode: Oid, dbnode: Oid) -> Key {
Key {
field1: 0x00,
field2: spcnode,
field3: dbnode,
field4: 0,
field5: 0,
field6: 0,
}
}
fn rel_dir_to_key(spcnode: Oid, dbnode: Oid) -> Key {
Key {
field1: 0x00,
field2: spcnode,
field3: dbnode,
field4: 0,
field5: 0,
field6: 1,
}
}
pub(crate) fn rel_block_to_key(rel: RelTag, blknum: BlockNumber) -> Key {
Key {
field1: 0x00,
field2: rel.spcnode,
field3: rel.dbnode,
field4: rel.relnode,
field5: rel.forknum,
field6: blknum,
}
}
fn rel_size_to_key(rel: RelTag) -> Key {
Key {
field1: 0x00,
field2: rel.spcnode,
field3: rel.dbnode,
field4: rel.relnode,
field5: rel.forknum,
field6: 0xffffffff,
}
}
fn rel_key_range(rel: RelTag) -> Range<Key> {
Key {
field1: 0x00,
field2: rel.spcnode,
field3: rel.dbnode,
field4: rel.relnode,
field5: rel.forknum,
field6: 0,
}..Key {
field1: 0x00,
field2: rel.spcnode,
field3: rel.dbnode,
field4: rel.relnode,
field5: rel.forknum + 1,
field6: 0,
}
}
//-- Section 02: SLRUs
fn slru_dir_to_key(kind: SlruKind) -> Key {
Key {
field1: 0x01,
field2: match kind {
SlruKind::Clog => 0x00,
SlruKind::MultiXactMembers => 0x01,
SlruKind::MultiXactOffsets => 0x02,
},
field3: 0,
field4: 0,
field5: 0,
field6: 0,
}
}
fn slru_block_to_key(kind: SlruKind, segno: u32, blknum: BlockNumber) -> Key {
Key {
field1: 0x01,
field2: match kind {
SlruKind::Clog => 0x00,
SlruKind::MultiXactMembers => 0x01,
SlruKind::MultiXactOffsets => 0x02,
},
field3: 1,
field4: segno,
field5: 0,
field6: blknum,
}
}
fn slru_segment_size_to_key(kind: SlruKind, segno: u32) -> Key {
Key {
field1: 0x01,
field2: match kind {
SlruKind::Clog => 0x00,
SlruKind::MultiXactMembers => 0x01,
SlruKind::MultiXactOffsets => 0x02,
},
field3: 1,
field4: segno,
field5: 0,
field6: 0xffffffff,
}
}
fn slru_segment_key_range(kind: SlruKind, segno: u32) -> Range<Key> {
let field2 = match kind {
SlruKind::Clog => 0x00,
SlruKind::MultiXactMembers => 0x01,
SlruKind::MultiXactOffsets => 0x02,
};
Key {
field1: 0x01,
field2,
field3: 1,
field4: segno,
field5: 0,
field6: 0,
}..Key {
field1: 0x01,
field2,
field3: 1,
field4: segno,
field5: 1,
field6: 0,
}
}
//-- Section 03: pg_twophase
const TWOPHASEDIR_KEY: Key = Key {
field1: 0x02,
field2: 0,
field3: 0,
field4: 0,
field5: 0,
field6: 0,
};
fn twophase_file_key(xid: TransactionId) -> Key {
Key {
field1: 0x02,
field2: 0,
field3: 0,
field4: 0,
field5: 0,
field6: xid,
}
}
fn twophase_key_range(xid: TransactionId) -> Range<Key> {
let (next_xid, overflowed) = xid.overflowing_add(1);
Key {
field1: 0x02,
field2: 0,
field3: 0,
field4: 0,
field5: 0,
field6: xid,
}..Key {
field1: 0x02,
field2: 0,
field3: 0,
field4: 0,
field5: u8::from(overflowed),
field6: next_xid,
}
}
//-- Section 03: Control file
const CONTROLFILE_KEY: Key = Key {
field1: 0x03,
field2: 0,
field3: 0,
field4: 0,
field5: 0,
field6: 0,
};
pub const CHECKPOINT_KEY: Key = Key {
field1: 0x03,
field2: 0,
field3: 0,
field4: 0,
field5: 0,
field6: 1,
};
const AUX_FILES_KEY: Key = Key {
field1: 0x03,
field2: 0,
field3: 0,
field4: 0,
field5: 0,
field6: 2,
};
// Reverse mappings for a few Keys.
// These are needed by WAL redo manager.
// AUX_FILES currently stores only data for logical replication (slots etc), and
// we don't preserve these on a branch because safekeepers can't follow timeline
// switch (and generally it likely should be optional), so ignore these.
pub fn is_inherited_key(key: Key) -> bool {
key != AUX_FILES_KEY
}
pub fn is_rel_fsm_block_key(key: Key) -> bool {
key.field1 == 0x00 && key.field4 != 0 && key.field5 == FSM_FORKNUM && key.field6 != 0xffffffff
}
pub fn is_rel_vm_block_key(key: Key) -> bool {
key.field1 == 0x00
&& key.field4 != 0
&& key.field5 == VISIBILITYMAP_FORKNUM
&& key.field6 != 0xffffffff
}
pub fn key_to_slru_block(key: Key) -> anyhow::Result<(SlruKind, u32, BlockNumber)> {
Ok(match key.field1 {
0x01 => {
let kind = match key.field2 {
0x00 => SlruKind::Clog,
0x01 => SlruKind::MultiXactMembers,
0x02 => SlruKind::MultiXactOffsets,
_ => anyhow::bail!("unrecognized slru kind 0x{:02x}", key.field2),
};
let segno = key.field4;
let blknum = key.field6;
(kind, segno, blknum)
}
_ => anyhow::bail!("unexpected value kind 0x{:02x}", key.field1),
})
}
fn is_slru_block_key(key: Key) -> bool {
key.field1 == 0x01 // SLRU-related
&& key.field3 == 0x00000001 // but not SlruDir
&& key.field6 != 0xffffffff // and not SlruSegSize
}
#[allow(clippy::bool_assert_comparison)] #[allow(clippy::bool_assert_comparison)]
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {

View File

@@ -91,7 +91,6 @@ use std::fs;
use std::fs::File; use std::fs::File;
use std::io; use std::io;
use std::ops::Bound::Included; use std::ops::Bound::Included;
use std::process::Stdio;
use std::sync::atomic::AtomicU64; use std::sync::atomic::AtomicU64;
use std::sync::atomic::Ordering; use std::sync::atomic::Ordering;
use std::sync::Arc; use std::sync::Arc;
@@ -628,9 +627,15 @@ impl Tenant {
deletion_queue_client, deletion_queue_client,
)); ));
// The attach task will carry a GateGuard, so that shutdown() reliably waits for it to drop out if
// we shut down while attaching.
let Ok(attach_gate_guard) = tenant.gate.enter() else {
// We just created the Tenant: nothing else can have shut it down yet
unreachable!();
};
// Do all the hard work in the background // Do all the hard work in the background
let tenant_clone = Arc::clone(&tenant); let tenant_clone = Arc::clone(&tenant);
let ctx = ctx.detached_child(TaskKind::Attach, DownloadBehavior::Warn); let ctx = ctx.detached_child(TaskKind::Attach, DownloadBehavior::Warn);
task_mgr::spawn( task_mgr::spawn(
&tokio::runtime::Handle::current(), &tokio::runtime::Handle::current(),
@@ -640,6 +645,8 @@ impl Tenant {
"attach tenant", "attach tenant",
false, false,
async move { async move {
let _gate_guard = attach_gate_guard;
// Is this tenant being spawned as part of process startup? // Is this tenant being spawned as part of process startup?
let starting_up = init_order.is_some(); let starting_up = init_order.is_some();
scopeguard::defer! { scopeguard::defer! {
@@ -716,6 +723,10 @@ impl Tenant {
// stayed in Activating for such a long time that shutdown found it in // stayed in Activating for such a long time that shutdown found it in
// that state. // that state.
tracing::info!(state=%tenant_clone.current_state(), "Tenant shut down before activation"); tracing::info!(state=%tenant_clone.current_state(), "Tenant shut down before activation");
// Make the tenant broken so that set_stopping will not hang waiting for it to leave
// the Attaching state. This is an over-reaction (nothing really broke, the tenant is
// just shutting down), but ensures progress.
make_broken(&tenant_clone, anyhow::anyhow!("Shut down while Attaching"));
return Ok(()); return Ok(());
}, },
) )
@@ -810,7 +821,7 @@ impl Tenant {
SpawnMode::Create => None, SpawnMode::Create => None,
SpawnMode::Normal => {Some(TENANT.attach.start_timer())} SpawnMode::Normal => {Some(TENANT.attach.start_timer())}
}; };
match tenant_clone.attach(preload, &ctx).await { match tenant_clone.attach(preload, mode, &ctx).await {
Ok(()) => { Ok(()) => {
info!("attach finished, activating"); info!("attach finished, activating");
if let Some(t)= attach_timer {t.observe_duration();} if let Some(t)= attach_timer {t.observe_duration();}
@@ -897,15 +908,20 @@ impl Tenant {
async fn attach( async fn attach(
self: &Arc<Tenant>, self: &Arc<Tenant>,
preload: Option<TenantPreload>, preload: Option<TenantPreload>,
mode: SpawnMode,
ctx: &RequestContext, ctx: &RequestContext,
) -> anyhow::Result<()> { ) -> anyhow::Result<()> {
span::debug_assert_current_span_has_tenant_id(); span::debug_assert_current_span_has_tenant_id();
failpoint_support::sleep_millis_async!("before-attaching-tenant"); failpoint_support::sleep_millis_async!("before-attaching-tenant");
let preload = match preload { let preload = match (preload, mode) {
Some(p) => p, (Some(p), _) => p,
None => { (None, SpawnMode::Create) => TenantPreload {
deleting: false,
timelines: HashMap::new(),
},
(None, SpawnMode::Normal) => {
// Deprecated dev mode: load from local disk state instead of remote storage // Deprecated dev mode: load from local disk state instead of remote storage
// https://github.com/neondatabase/neon/issues/5624 // https://github.com/neondatabase/neon/issues/5624
return self.load_local(ctx).await; return self.load_local(ctx).await;
@@ -1013,7 +1029,10 @@ impl Tenant {
// IndexPart is the source of truth. // IndexPart is the source of truth.
self.clean_up_timelines(&existent_timelines)?; self.clean_up_timelines(&existent_timelines)?;
failpoint_support::sleep_millis_async!("attach-before-activate", &self.cancel); fail::fail_point!("attach-before-activate", |_| {
anyhow::bail!("attach-before-activate");
});
failpoint_support::sleep_millis_async!("attach-before-activate-sleep", &self.cancel);
info!("Done"); info!("Done");
@@ -1677,9 +1696,13 @@ impl Tenant {
ctx: &RequestContext, ctx: &RequestContext,
) -> Result<Arc<Timeline>, CreateTimelineError> { ) -> Result<Arc<Timeline>, CreateTimelineError> {
if !self.is_active() { if !self.is_active() {
return Err(CreateTimelineError::Other(anyhow::anyhow!( if matches!(self.current_state(), TenantState::Stopping { .. }) {
"Cannot create timelines on inactive tenant" return Err(CreateTimelineError::ShuttingDown);
))); } else {
return Err(CreateTimelineError::Other(anyhow::anyhow!(
"Cannot create timelines on inactive tenant"
)));
}
} }
let _gate = self let _gate = self
@@ -3755,27 +3778,25 @@ async fn run_initdb(
.env_clear() .env_clear()
.env("LD_LIBRARY_PATH", &initdb_lib_dir) .env("LD_LIBRARY_PATH", &initdb_lib_dir)
.env("DYLD_LIBRARY_PATH", &initdb_lib_dir) .env("DYLD_LIBRARY_PATH", &initdb_lib_dir)
.stdout(Stdio::piped())
.stderr(Stdio::piped())
// If the `select!` below doesn't finish the `wait_with_output`,
// let the task get `wait()`ed for asynchronously by tokio.
// This means there is a slim chance we can go over the INIT_DB_SEMAPHORE.
// TODO: fix for this is non-trivial, see
// https://github.com/neondatabase/neon/pull/5921#pullrequestreview-1750858021
//
.kill_on_drop(true)
.spawn()?; .spawn()?;
tokio::select! { // Ideally we'd select here with the cancellation token, but the problem is that
initdb_output = initdb_command.wait_with_output() => { // we can't safely terminate initdb: it launches processes of its own, and killing
let initdb_output = initdb_output?; // initdb doesn't kill them. After we return from this function, we want the target
if !initdb_output.status.success() { // directory to be able to be cleaned up.
return Err(InitdbError::Failed(initdb_output.status, initdb_output.stderr)); // See https://github.com/neondatabase/neon/issues/6385
} let initdb_output = initdb_command.wait_with_output().await?;
} if !initdb_output.status.success() {
_ = cancel.cancelled() => { return Err(InitdbError::Failed(
return Err(InitdbError::Cancelled); initdb_output.status,
} initdb_output.stderr,
));
}
// This isn't true cancellation support, see above. Still return an error to
// excercise the cancellation code path.
if cancel.is_cancelled() {
return Err(InitdbError::Cancelled);
} }
Ok(()) Ok(())
@@ -4031,7 +4052,7 @@ pub(crate) mod harness {
.instrument(info_span!("try_load_preload", tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug())) .instrument(info_span!("try_load_preload", tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug()))
.await?; .await?;
tenant tenant
.attach(Some(preload), ctx) .attach(Some(preload), SpawnMode::Normal, ctx)
.instrument(info_span!("try_load", tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug())) .instrument(info_span!("try_load", tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug()))
.await?; .await?;
} }

View File

@@ -5,10 +5,10 @@
use super::ephemeral_file::EphemeralFile; use super::ephemeral_file::EphemeralFile;
use super::storage_layer::delta_layer::{Adapter, DeltaLayerInner}; use super::storage_layer::delta_layer::{Adapter, DeltaLayerInner};
use crate::context::RequestContext; use crate::context::RequestContext;
use crate::page_cache::{self, PageReadGuard, ReadBufResult, PAGE_SZ}; use crate::page_cache::{self, PageReadGuard, PageWriteGuard, ReadBufResult, PAGE_SZ};
use crate::virtual_file::VirtualFile; use crate::virtual_file::VirtualFile;
use bytes::Bytes; use bytes::Bytes;
use std::ops::{Deref, DerefMut}; use std::ops::Deref;
/// This is implemented by anything that can read 8 kB (PAGE_SZ) /// This is implemented by anything that can read 8 kB (PAGE_SZ)
/// blocks, using the page cache /// blocks, using the page cache
@@ -39,6 +39,8 @@ pub enum BlockLease<'a> {
EphemeralFileMutableTail(&'a [u8; PAGE_SZ]), EphemeralFileMutableTail(&'a [u8; PAGE_SZ]),
#[cfg(test)] #[cfg(test)]
Arc(std::sync::Arc<[u8; PAGE_SZ]>), Arc(std::sync::Arc<[u8; PAGE_SZ]>),
#[cfg(test)]
Vec(Vec<u8>),
} }
impl From<PageReadGuard<'static>> for BlockLease<'static> { impl From<PageReadGuard<'static>> for BlockLease<'static> {
@@ -63,6 +65,10 @@ impl<'a> Deref for BlockLease<'a> {
BlockLease::EphemeralFileMutableTail(v) => v, BlockLease::EphemeralFileMutableTail(v) => v,
#[cfg(test)] #[cfg(test)]
BlockLease::Arc(v) => v.deref(), BlockLease::Arc(v) => v.deref(),
#[cfg(test)]
BlockLease::Vec(v) => {
TryFrom::try_from(&v[..]).expect("caller must ensure that v has PAGE_SZ")
}
} }
} }
} }
@@ -169,10 +175,14 @@ impl FileBlockReader {
} }
/// Read a page from the underlying file into given buffer. /// Read a page from the underlying file into given buffer.
async fn fill_buffer(&self, buf: &mut [u8], blkno: u32) -> Result<(), std::io::Error> { async fn fill_buffer(
&self,
buf: PageWriteGuard<'static>,
blkno: u32,
) -> Result<PageWriteGuard<'static>, std::io::Error> {
assert!(buf.len() == PAGE_SZ); assert!(buf.len() == PAGE_SZ);
self.file self.file
.read_exact_at(buf, blkno as u64 * PAGE_SZ as u64) .read_exact_at_page(buf, blkno as u64 * PAGE_SZ as u64)
.await .await
} }
/// Read a block. /// Read a block.
@@ -196,9 +206,9 @@ impl FileBlockReader {
) )
})? { })? {
ReadBufResult::Found(guard) => Ok(guard.into()), ReadBufResult::Found(guard) => Ok(guard.into()),
ReadBufResult::NotFound(mut write_guard) => { ReadBufResult::NotFound(write_guard) => {
// Read the page from disk into the buffer // Read the page from disk into the buffer
self.fill_buffer(write_guard.deref_mut(), blknum).await?; let write_guard = self.fill_buffer(write_guard, blknum).await?;
Ok(write_guard.mark_valid().into()) Ok(write_guard.mark_valid().into())
} }
} }

View File

@@ -409,7 +409,10 @@ impl DeleteTenantFlow {
.await .await
.expect("cant be stopping or broken"); .expect("cant be stopping or broken");
tenant.attach(preload, ctx).await.context("attach")?; tenant
.attach(preload, super::SpawnMode::Normal, ctx)
.await
.context("attach")?;
Self::background( Self::background(
guard, guard,

View File

@@ -5,11 +5,11 @@ use crate::config::PageServerConf;
use crate::context::RequestContext; use crate::context::RequestContext;
use crate::page_cache::{self, PAGE_SZ}; use crate::page_cache::{self, PAGE_SZ};
use crate::tenant::block_io::{BlockCursor, BlockLease, BlockReader}; use crate::tenant::block_io::{BlockCursor, BlockLease, BlockReader};
use crate::virtual_file::VirtualFile; use crate::virtual_file::{self, VirtualFile};
use camino::Utf8PathBuf; use camino::Utf8PathBuf;
use pageserver_api::shard::TenantShardId; use pageserver_api::shard::TenantShardId;
use std::cmp::min; use std::cmp::min;
use std::fs::OpenOptions;
use std::io::{self, ErrorKind}; use std::io::{self, ErrorKind};
use std::ops::DerefMut; use std::ops::DerefMut;
use std::sync::atomic::AtomicU64; use std::sync::atomic::AtomicU64;
@@ -47,7 +47,10 @@ impl EphemeralFile {
let file = VirtualFile::open_with_options( let file = VirtualFile::open_with_options(
&filename, &filename,
OpenOptions::new().read(true).write(true).create(true), virtual_file::OpenOptions::new()
.read(true)
.write(true)
.create(true),
) )
.await?; .await?;
@@ -89,11 +92,10 @@ impl EphemeralFile {
page_cache::ReadBufResult::Found(guard) => { page_cache::ReadBufResult::Found(guard) => {
return Ok(BlockLease::PageReadGuard(guard)) return Ok(BlockLease::PageReadGuard(guard))
} }
page_cache::ReadBufResult::NotFound(mut write_guard) => { page_cache::ReadBufResult::NotFound(write_guard) => {
let buf: &mut [u8] = write_guard.deref_mut(); let write_guard = self
debug_assert_eq!(buf.len(), PAGE_SZ); .file
self.file .read_exact_at_page(write_guard, blknum as u64 * PAGE_SZ as u64)
.read_exact_at(&mut buf[..], blknum as u64 * PAGE_SZ as u64)
.await?; .await?;
let read_guard = write_guard.mark_valid(); let read_guard = write_guard.mark_valid();
return Ok(BlockLease::PageReadGuard(read_guard)); return Ok(BlockLease::PageReadGuard(read_guard));

View File

@@ -283,15 +283,15 @@ impl LayerMap {
/// ///
/// This is used for garbage collection, to determine if an old layer can /// This is used for garbage collection, to determine if an old layer can
/// be deleted. /// be deleted.
pub fn image_layer_exists(&self, key: &Range<Key>, lsn: &Range<Lsn>) -> Result<bool> { pub fn image_layer_exists(&self, key: &Range<Key>, lsn: &Range<Lsn>) -> bool {
if key.is_empty() { if key.is_empty() {
// Vacuously true. There's a newer image for all 0 of the kerys in the range. // Vacuously true. There's a newer image for all 0 of the kerys in the range.
return Ok(true); return true;
} }
let version = match self.historic.get().unwrap().get_version(lsn.end.0 - 1) { let version = match self.historic.get().unwrap().get_version(lsn.end.0 - 1) {
Some(v) => v, Some(v) => v,
None => return Ok(false), None => return false,
}; };
let start = key.start.to_i128(); let start = key.start.to_i128();
@@ -304,17 +304,17 @@ impl LayerMap {
// Check the start is covered // Check the start is covered
if !layer_covers(version.image_coverage.query(start)) { if !layer_covers(version.image_coverage.query(start)) {
return Ok(false); return false;
} }
// Check after all changes of coverage // Check after all changes of coverage
for (_, change_val) in version.image_coverage.range(start..end) { for (_, change_val) in version.image_coverage.range(start..end) {
if !layer_covers(change_val) { if !layer_covers(change_val) {
return Ok(false); return false;
} }
} }
Ok(true) true
} }
pub fn iter_historic_layers(&self) -> impl '_ + Iterator<Item = Arc<PersistentLayerDesc>> { pub fn iter_historic_layers(&self) -> impl '_ + Iterator<Item = Arc<PersistentLayerDesc>> {
@@ -325,18 +325,14 @@ impl LayerMap {
/// Divide the whole given range of keys into sub-ranges based on the latest /// Divide the whole given range of keys into sub-ranges based on the latest
/// image layer that covers each range at the specified lsn (inclusive). /// image layer that covers each range at the specified lsn (inclusive).
/// This is used when creating new image layers. /// This is used when creating new image layers.
///
// FIXME: clippy complains that the result type is very complex. She's probably
// right...
#[allow(clippy::type_complexity)]
pub fn image_coverage( pub fn image_coverage(
&self, &self,
key_range: &Range<Key>, key_range: &Range<Key>,
lsn: Lsn, lsn: Lsn,
) -> Result<Vec<(Range<Key>, Option<Arc<PersistentLayerDesc>>)>> { ) -> Vec<(Range<Key>, Option<Arc<PersistentLayerDesc>>)> {
let version = match self.historic.get().unwrap().get_version(lsn.0) { let version = match self.historic.get().unwrap().get_version(lsn.0) {
Some(v) => v, Some(v) => v,
None => return Ok(vec![]), None => return vec![],
}; };
let start = key_range.start.to_i128(); let start = key_range.start.to_i128();
@@ -359,7 +355,7 @@ impl LayerMap {
let kr = Key::from_i128(current_key)..Key::from_i128(end); let kr = Key::from_i128(current_key)..Key::from_i128(end);
coverage.push((kr, current_val.take())); coverage.push((kr, current_val.take()));
Ok(coverage) coverage
} }
pub fn is_l0(layer: &PersistentLayerDesc) -> bool { pub fn is_l0(layer: &PersistentLayerDesc) -> bool {
@@ -410,24 +406,19 @@ impl LayerMap {
/// This number is used to compute the largest number of deltas that /// This number is used to compute the largest number of deltas that
/// we'll need to visit for any page reconstruction in this region. /// we'll need to visit for any page reconstruction in this region.
/// We use this heuristic to decide whether to create an image layer. /// We use this heuristic to decide whether to create an image layer.
pub fn count_deltas( pub fn count_deltas(&self, key: &Range<Key>, lsn: &Range<Lsn>, limit: Option<usize>) -> usize {
&self,
key: &Range<Key>,
lsn: &Range<Lsn>,
limit: Option<usize>,
) -> Result<usize> {
// We get the delta coverage of the region, and for each part of the coverage // We get the delta coverage of the region, and for each part of the coverage
// we recurse right underneath the delta. The recursion depth is limited by // we recurse right underneath the delta. The recursion depth is limited by
// the largest result this function could return, which is in practice between // the largest result this function could return, which is in practice between
// 3 and 10 (since we usually try to create an image when the number gets larger). // 3 and 10 (since we usually try to create an image when the number gets larger).
if lsn.is_empty() || key.is_empty() || limit == Some(0) { if lsn.is_empty() || key.is_empty() || limit == Some(0) {
return Ok(0); return 0;
} }
let version = match self.historic.get().unwrap().get_version(lsn.end.0 - 1) { let version = match self.historic.get().unwrap().get_version(lsn.end.0 - 1) {
Some(v) => v, Some(v) => v,
None => return Ok(0), None => return 0,
}; };
let start = key.start.to_i128(); let start = key.start.to_i128();
@@ -448,8 +439,7 @@ impl LayerMap {
if !kr.is_empty() { if !kr.is_empty() {
let base_count = Self::is_reimage_worthy(&val, key) as usize; let base_count = Self::is_reimage_worthy(&val, key) as usize;
let new_limit = limit.map(|l| l - base_count); let new_limit = limit.map(|l| l - base_count);
let max_stacked_deltas_underneath = let max_stacked_deltas_underneath = self.count_deltas(&kr, &lr, new_limit);
self.count_deltas(&kr, &lr, new_limit)?;
max_stacked_deltas = std::cmp::max( max_stacked_deltas = std::cmp::max(
max_stacked_deltas, max_stacked_deltas,
base_count + max_stacked_deltas_underneath, base_count + max_stacked_deltas_underneath,
@@ -471,7 +461,7 @@ impl LayerMap {
if !kr.is_empty() { if !kr.is_empty() {
let base_count = Self::is_reimage_worthy(&val, key) as usize; let base_count = Self::is_reimage_worthy(&val, key) as usize;
let new_limit = limit.map(|l| l - base_count); let new_limit = limit.map(|l| l - base_count);
let max_stacked_deltas_underneath = self.count_deltas(&kr, &lr, new_limit)?; let max_stacked_deltas_underneath = self.count_deltas(&kr, &lr, new_limit);
max_stacked_deltas = std::cmp::max( max_stacked_deltas = std::cmp::max(
max_stacked_deltas, max_stacked_deltas,
base_count + max_stacked_deltas_underneath, base_count + max_stacked_deltas_underneath,
@@ -480,7 +470,7 @@ impl LayerMap {
} }
} }
Ok(max_stacked_deltas) max_stacked_deltas
} }
/// Count how many reimage-worthy layers we need to visit for given key-lsn pair. /// Count how many reimage-worthy layers we need to visit for given key-lsn pair.
@@ -592,10 +582,7 @@ impl LayerMap {
if limit == Some(difficulty) { if limit == Some(difficulty) {
break; break;
} }
for (img_range, last_img) in self for (img_range, last_img) in self.image_coverage(range, lsn) {
.image_coverage(range, lsn)
.expect("why would this err?")
{
if limit == Some(difficulty) { if limit == Some(difficulty) {
break; break;
} }
@@ -606,9 +593,7 @@ impl LayerMap {
}; };
if img_lsn < lsn { if img_lsn < lsn {
let num_deltas = self let num_deltas = self.count_deltas(&img_range, &(img_lsn..lsn), limit);
.count_deltas(&img_range, &(img_lsn..lsn), limit)
.expect("why would this err lol?");
difficulty = std::cmp::max(difficulty, num_deltas); difficulty = std::cmp::max(difficulty, num_deltas);
} }
} }

View File

@@ -7,6 +7,7 @@ use pageserver_api::models::ShardParameters;
use pageserver_api::shard::{ShardCount, ShardIdentity, ShardNumber, TenantShardId}; use pageserver_api::shard::{ShardCount, ShardIdentity, ShardNumber, TenantShardId};
use rand::{distributions::Alphanumeric, Rng}; use rand::{distributions::Alphanumeric, Rng};
use std::borrow::Cow; use std::borrow::Cow;
use std::cmp::Ordering;
use std::collections::{BTreeMap, HashMap}; use std::collections::{BTreeMap, HashMap};
use std::ops::Deref; use std::ops::Deref;
use std::sync::Arc; use std::sync::Arc;
@@ -32,7 +33,8 @@ use crate::deletion_queue::DeletionQueueClient;
use crate::metrics::{TENANT, TENANT_MANAGER as METRICS}; use crate::metrics::{TENANT, TENANT_MANAGER as METRICS};
use crate::task_mgr::{self, TaskKind}; use crate::task_mgr::{self, TaskKind};
use crate::tenant::config::{ use crate::tenant::config::{
AttachedLocationConfig, AttachmentMode, LocationConf, LocationMode, TenantConfOpt, AttachedLocationConfig, AttachmentMode, LocationConf, LocationMode, SecondaryLocationConfig,
TenantConfOpt,
}; };
use crate::tenant::delete::DeleteTenantFlow; use crate::tenant::delete::DeleteTenantFlow;
use crate::tenant::span::debug_assert_current_span_has_tenant_id; use crate::tenant::span::debug_assert_current_span_has_tenant_id;
@@ -466,6 +468,26 @@ pub async fn init_tenant_mgr(
// We have a generation map: treat it as the authority for whether // We have a generation map: treat it as the authority for whether
// this tenant is really attached. // this tenant is really attached.
if let Some(gen) = generations.get(&tenant_shard_id) { if let Some(gen) = generations.get(&tenant_shard_id) {
if let LocationMode::Attached(attached) = &location_conf.mode {
if attached.generation > *gen {
tracing::error!(tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug(),
"Control plane gave decreasing generation ({gen:?}) in re-attach response for tenant that was attached in generation {:?}, demoting to secondary",
attached.generation
);
// We cannot safely attach this tenant given a bogus generation number, but let's avoid throwing away
// local disk content: demote to secondary rather than detaching.
tenants.insert(
tenant_shard_id,
TenantSlot::Secondary(SecondaryTenant::new(
tenant_shard_id,
location_conf.shard,
location_conf.tenant_conf,
&SecondaryLocationConfig { warm: false },
)),
);
}
}
*gen *gen
} else { } else {
match &location_conf.mode { match &location_conf.mode {
@@ -721,7 +743,7 @@ async fn shutdown_all_tenants0(tenants: &std::sync::RwLock<TenantsMap>) {
tokio::select! { tokio::select! {
Some(joined) = join_set.join_next() => { Some(joined) = join_set.join_next() => {
match joined { match joined {
Ok(()) => {} Ok(()) => {},
Err(join_error) if join_error.is_cancelled() => { Err(join_error) if join_error.is_cancelled() => {
unreachable!("we are not cancelling any of the tasks"); unreachable!("we are not cancelling any of the tasks");
} }
@@ -882,7 +904,7 @@ impl TenantManager {
tenant_shard_id: TenantShardId, tenant_shard_id: TenantShardId,
new_location_config: LocationConf, new_location_config: LocationConf,
flush: Option<Duration>, flush: Option<Duration>,
spawn_mode: SpawnMode, mut spawn_mode: SpawnMode,
ctx: &RequestContext, ctx: &RequestContext,
) -> Result<Option<Arc<Tenant>>, UpsertLocationError> { ) -> Result<Option<Arc<Tenant>>, UpsertLocationError> {
debug_assert_current_span_has_tenant_id(); debug_assert_current_span_has_tenant_id();
@@ -902,19 +924,29 @@ impl TenantManager {
tenant_map_peek_slot(&locked, &tenant_shard_id, TenantSlotPeekMode::Write)?; tenant_map_peek_slot(&locked, &tenant_shard_id, TenantSlotPeekMode::Write)?;
match (&new_location_config.mode, peek_slot) { match (&new_location_config.mode, peek_slot) {
(LocationMode::Attached(attach_conf), Some(TenantSlot::Attached(tenant))) => { (LocationMode::Attached(attach_conf), Some(TenantSlot::Attached(tenant))) => {
if attach_conf.generation == tenant.generation { match attach_conf.generation.cmp(&tenant.generation) {
// A transition from Attached to Attached in the same generation, we may Ordering::Equal => {
// take our fast path and just provide the updated configuration // A transition from Attached to Attached in the same generation, we may
// to the tenant. // take our fast path and just provide the updated configuration
tenant.set_new_location_config( // to the tenant.
AttachedTenantConf::try_from(new_location_config.clone()) tenant.set_new_location_config(
.map_err(UpsertLocationError::BadRequest)?, AttachedTenantConf::try_from(new_location_config.clone())
); .map_err(UpsertLocationError::BadRequest)?,
);
Some(FastPathModified::Attached(tenant.clone())) Some(FastPathModified::Attached(tenant.clone()))
} else { }
// Different generations, fall through to general case Ordering::Less => {
None return Err(UpsertLocationError::BadRequest(anyhow::anyhow!(
"Generation {:?} is less than existing {:?}",
attach_conf.generation,
tenant.generation
)));
}
Ordering::Greater => {
// Generation advanced, fall through to general case of replacing `Tenant` object
None
}
} }
} }
( (
@@ -1019,6 +1051,12 @@ impl TenantManager {
} }
} }
slot_guard.drop_old_value().expect("We just shut it down"); slot_guard.drop_old_value().expect("We just shut it down");
// Edge case: if we were called with SpawnMode::Create, but a Tenant already existed, then
// the caller thinks they're creating but the tenant already existed. We must switch to
// Normal mode so that when starting this Tenant we properly probe remote storage for timelines,
// rather than assuming it to be empty.
spawn_mode = SpawnMode::Normal;
} }
Some(TenantSlot::Secondary(state)) => { Some(TenantSlot::Secondary(state)) => {
info!("Shutting down secondary tenant"); info!("Shutting down secondary tenant");
@@ -1102,14 +1140,46 @@ impl TenantManager {
None None
}; };
slot_guard.upsert(new_slot).map_err(|e| match e { match slot_guard.upsert(new_slot) {
TenantSlotUpsertError::InternalError(e) => { Err(TenantSlotUpsertError::InternalError(e)) => {
UpsertLocationError::Other(anyhow::anyhow!(e)) Err(UpsertLocationError::Other(anyhow::anyhow!(e)))
} }
TenantSlotUpsertError::MapState(e) => UpsertLocationError::Unavailable(e), Err(TenantSlotUpsertError::MapState(e)) => Err(UpsertLocationError::Unavailable(e)),
})?; Err(TenantSlotUpsertError::ShuttingDown((new_slot, _completion))) => {
// If we just called tenant_spawn() on a new tenant, and can't insert it into our map, then
// we must not leak it: this would violate the invariant that after shutdown_all_tenants, all tenants
// are shutdown.
//
// We must shut it down inline here.
match new_slot {
TenantSlot::InProgress(_) => {
// Unreachable because we never insert an InProgress
unreachable!()
}
TenantSlot::Attached(tenant) => {
let (_guard, progress) = utils::completion::channel();
info!("Shutting down just-spawned tenant, because tenant manager is shut down");
match tenant.shutdown(progress, false).await {
Ok(()) => {
info!("Finished shutting down just-spawned tenant");
}
Err(barrier) => {
info!("Shutdown already in progress, waiting for it to complete");
barrier.wait().await;
}
}
}
TenantSlot::Secondary(secondary_tenant) => {
secondary_tenant.shutdown().await;
}
}
Ok(attached_tenant) Err(UpsertLocationError::Unavailable(
TenantMapError::ShuttingDown,
))
}
Ok(()) => Ok(attached_tenant),
}
} }
/// Resetting a tenant is equivalent to detaching it, then attaching it again with the same /// Resetting a tenant is equivalent to detaching it, then attaching it again with the same
@@ -1728,14 +1798,31 @@ pub(crate) enum TenantSlotError {
/// Superset of TenantMapError: issues that can occur when using a SlotGuard /// Superset of TenantMapError: issues that can occur when using a SlotGuard
/// to insert a new value. /// to insert a new value.
#[derive(Debug, thiserror::Error)] #[derive(thiserror::Error)]
pub enum TenantSlotUpsertError { pub(crate) enum TenantSlotUpsertError {
/// An error where the slot is in an unexpected state, indicating a code bug /// An error where the slot is in an unexpected state, indicating a code bug
#[error("Internal error updating Tenant")] #[error("Internal error updating Tenant")]
InternalError(Cow<'static, str>), InternalError(Cow<'static, str>),
#[error(transparent)] #[error(transparent)]
MapState(#[from] TenantMapError), MapState(TenantMapError),
// If we encounter TenantManager shutdown during upsert, we must carry the Completion
// from the SlotGuard, so that the caller can hold it while they clean up: otherwise
// TenantManager shutdown might race ahead before we're done cleaning up any Tenant that
// was protected by the SlotGuard.
#[error("Shutting down")]
ShuttingDown((TenantSlot, utils::completion::Completion)),
}
impl std::fmt::Debug for TenantSlotUpsertError {
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
match self {
Self::InternalError(reason) => write!(f, "Internal Error {reason}"),
Self::MapState(map_error) => write!(f, "Tenant map state: {map_error:?}"),
Self::ShuttingDown(_completion) => write!(f, "Tenant map shutting down"),
}
}
} }
#[derive(Debug, thiserror::Error)] #[derive(Debug, thiserror::Error)]
@@ -1784,7 +1871,7 @@ pub struct SlotGuard {
/// [`TenantSlot::InProgress`] carries the corresponding Barrier: it will /// [`TenantSlot::InProgress`] carries the corresponding Barrier: it will
/// release any waiters as soon as this SlotGuard is dropped. /// release any waiters as soon as this SlotGuard is dropped.
_completion: utils::completion::Completion, completion: utils::completion::Completion,
} }
impl SlotGuard { impl SlotGuard {
@@ -1797,7 +1884,7 @@ impl SlotGuard {
tenant_shard_id, tenant_shard_id,
old_value, old_value,
upserted: false, upserted: false,
_completion: completion, completion,
} }
} }
@@ -1830,9 +1917,16 @@ impl SlotGuard {
} }
let m = match &mut *locked { let m = match &mut *locked {
TenantsMap::Initializing => return Err(TenantMapError::StillInitializing.into()), TenantsMap::Initializing => {
return Err(TenantSlotUpsertError::MapState(
TenantMapError::StillInitializing,
))
}
TenantsMap::ShuttingDown(_) => { TenantsMap::ShuttingDown(_) => {
return Err(TenantMapError::ShuttingDown.into()); return Err(TenantSlotUpsertError::ShuttingDown((
new_value,
self.completion.clone(),
)));
} }
TenantsMap::Open(m) => m, TenantsMap::Open(m) => m,
}; };
@@ -1880,7 +1974,9 @@ impl SlotGuard {
Err(TenantSlotUpsertError::InternalError(_)) => { Err(TenantSlotUpsertError::InternalError(_)) => {
// We already logged the error, nothing else we can do. // We already logged the error, nothing else we can do.
} }
Err(TenantSlotUpsertError::MapState(_)) => { Err(
TenantSlotUpsertError::MapState(_) | TenantSlotUpsertError::ShuttingDown(_),
) => {
// If the map is shutting down, we need not replace anything // If the map is shutting down, we need not replace anything
} }
Ok(()) => {} Ok(()) => {}
@@ -1978,18 +2074,22 @@ fn tenant_map_peek_slot<'a>(
tenant_shard_id: &TenantShardId, tenant_shard_id: &TenantShardId,
mode: TenantSlotPeekMode, mode: TenantSlotPeekMode,
) -> Result<Option<&'a TenantSlot>, TenantMapError> { ) -> Result<Option<&'a TenantSlot>, TenantMapError> {
let m = match tenants.deref() { match tenants.deref() {
TenantsMap::Initializing => return Err(TenantMapError::StillInitializing), TenantsMap::Initializing => Err(TenantMapError::StillInitializing),
TenantsMap::ShuttingDown(m) => match mode { TenantsMap::ShuttingDown(m) => match mode {
TenantSlotPeekMode::Read => m, TenantSlotPeekMode::Read => Ok(Some(
TenantSlotPeekMode::Write => { // When reading in ShuttingDown state, we must translate None results
return Err(TenantMapError::ShuttingDown); // into a ShuttingDown error, because absence of a tenant shard ID in the map
} // isn't a reliable indicator of the tenant being gone: it might have been
// InProgress when shutdown started, and cleaned up from that state such
// that it's now no longer in the map. Callers will have to wait until
// we next start up to get a proper answer. This avoids incorrect 404 API responses.
m.get(tenant_shard_id).ok_or(TenantMapError::ShuttingDown)?,
)),
TenantSlotPeekMode::Write => Err(TenantMapError::ShuttingDown),
}, },
TenantsMap::Open(m) => m, TenantsMap::Open(m) => Ok(m.get(tenant_shard_id)),
}; }
Ok(m.get(tenant_shard_id))
} }
enum TenantSlotAcquireMode { enum TenantSlotAcquireMode {

View File

@@ -257,6 +257,8 @@ pub(crate) const FAILED_UPLOAD_WARN_THRESHOLD: u32 = 3;
pub(crate) const INITDB_PATH: &str = "initdb.tar.zst"; pub(crate) const INITDB_PATH: &str = "initdb.tar.zst";
pub(crate) const INITDB_PRESERVED_PATH: &str = "initdb-preserved.tar.zst";
/// Default buffer size when interfacing with [`tokio::fs::File`]. /// Default buffer size when interfacing with [`tokio::fs::File`].
pub(crate) const BUFFER_SIZE: usize = 32 * 1024; pub(crate) const BUFFER_SIZE: usize = 32 * 1024;
@@ -1066,6 +1068,28 @@ impl RemoteTimelineClient {
Ok(()) Ok(())
} }
pub(crate) async fn preserve_initdb_archive(
self: &Arc<Self>,
tenant_id: &TenantId,
timeline_id: &TimelineId,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
backoff::retry(
|| async {
upload::preserve_initdb_archive(&self.storage_impl, tenant_id, timeline_id, cancel)
.await
},
|_e| false,
FAILED_DOWNLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
"preserve_initdb_tar_zst",
backoff::Cancel::new(cancel.clone(), || anyhow::anyhow!("Cancelled!")),
)
.await
.context("backing up initdb archive")?;
Ok(())
}
/// Prerequisites: UploadQueue should be in stopped state and deleted_at should be successfuly set. /// Prerequisites: UploadQueue should be in stopped state and deleted_at should be successfuly set.
/// The function deletes layer files one by one, then lists the prefix to see if we leaked something /// The function deletes layer files one by one, then lists the prefix to see if we leaked something
/// deletes leaked files if any and proceeds with deletion of index file at the end. /// deletes leaked files if any and proceeds with deletion of index file at the end.
@@ -1101,6 +1125,14 @@ impl RemoteTimelineClient {
let layer_deletion_count = layers.len(); let layer_deletion_count = layers.len();
self.deletion_queue_client.push_immediate(layers).await?; self.deletion_queue_client.push_immediate(layers).await?;
// Delete the initdb.tar.zst, which is not always present, but deletion attempts of
// inexistant objects are not considered errors.
let initdb_path =
remote_initdb_archive_path(&self.tenant_shard_id.tenant_id, &self.timeline_id);
self.deletion_queue_client
.push_immediate(vec![initdb_path])
.await?;
// Do not delete index part yet, it is needed for possible retry. If we remove it first // Do not delete index part yet, it is needed for possible retry. If we remove it first
// and retry will arrive to different pageserver there wont be any traces of it on remote storage // and retry will arrive to different pageserver there wont be any traces of it on remote storage
let timeline_storage_path = remote_timeline_path(&self.tenant_shard_id, &self.timeline_id); let timeline_storage_path = remote_timeline_path(&self.tenant_shard_id, &self.timeline_id);
@@ -1148,10 +1180,8 @@ impl RemoteTimelineClient {
if p == &latest_index { if p == &latest_index {
return false; return false;
} }
if let Some(name) = p.object_name() { if p.object_name() == Some(INITDB_PRESERVED_PATH) {
if name == INITDB_PATH { return false;
return false;
}
} }
true true
}) })
@@ -1724,6 +1754,16 @@ pub fn remote_initdb_archive_path(tenant_id: &TenantId, timeline_id: &TimelineId
.expect("Failed to construct path") .expect("Failed to construct path")
} }
pub fn remote_initdb_preserved_archive_path(
tenant_id: &TenantId,
timeline_id: &TimelineId,
) -> RemotePath {
RemotePath::from_string(&format!(
"tenants/{tenant_id}/{TIMELINES_SEGMENT_NAME}/{timeline_id}/{INITDB_PRESERVED_PATH}"
))
.expect("Failed to construct path")
}
pub fn remote_index_path( pub fn remote_index_path(
tenant_shard_id: &TenantShardId, tenant_shard_id: &TenantShardId,
timeline_id: &TimelineId, timeline_id: &TimelineId,

View File

@@ -32,7 +32,8 @@ use utils::id::TimelineId;
use super::index::{IndexPart, LayerFileMetadata}; use super::index::{IndexPart, LayerFileMetadata};
use super::{ use super::{
parse_remote_index_path, remote_index_path, remote_initdb_archive_path, parse_remote_index_path, remote_index_path, remote_initdb_archive_path,
FAILED_DOWNLOAD_WARN_THRESHOLD, FAILED_REMOTE_OP_RETRIES, INITDB_PATH, remote_initdb_preserved_archive_path, FAILED_DOWNLOAD_WARN_THRESHOLD, FAILED_REMOTE_OP_RETRIES,
INITDB_PATH,
}; };
/// ///
@@ -430,6 +431,9 @@ pub(crate) async fn download_initdb_tar_zst(
let remote_path = remote_initdb_archive_path(&tenant_shard_id.tenant_id, timeline_id); let remote_path = remote_initdb_archive_path(&tenant_shard_id.tenant_id, timeline_id);
let remote_preserved_path =
remote_initdb_preserved_archive_path(&tenant_shard_id.tenant_id, timeline_id);
let timeline_path = conf.timelines_path(tenant_shard_id); let timeline_path = conf.timelines_path(tenant_shard_id);
if !timeline_path.exists() { if !timeline_path.exists() {
@@ -456,8 +460,16 @@ pub(crate) async fn download_initdb_tar_zst(
.with_context(|| format!("tempfile creation {temp_path}")) .with_context(|| format!("tempfile creation {temp_path}"))
.map_err(DownloadError::Other)?; .map_err(DownloadError::Other)?;
let download = let download = match download_cancellable(&cancel_inner, storage.download(&remote_path))
download_cancellable(&cancel_inner, storage.download(&remote_path)).await?; .await
{
Ok(dl) => dl,
Err(DownloadError::NotFound) => {
download_cancellable(&cancel_inner, storage.download(&remote_preserved_path))
.await?
}
Err(other) => Err(other)?,
};
let mut download = tokio_util::io::StreamReader::new(download.download_stream); let mut download = tokio_util::io::StreamReader::new(download.download_stream);
let mut writer = tokio::io::BufWriter::with_capacity(8 * 1024, file); let mut writer = tokio::io::BufWriter::with_capacity(8 * 1024, file);

View File

@@ -13,8 +13,8 @@ use super::Generation;
use crate::{ use crate::{
config::PageServerConf, config::PageServerConf,
tenant::remote_timeline_client::{ tenant::remote_timeline_client::{
index::IndexPart, remote_index_path, remote_initdb_archive_path, remote_path, index::IndexPart, remote_index_path, remote_initdb_archive_path,
upload_cancellable, remote_initdb_preserved_archive_path, remote_path, upload_cancellable,
}, },
}; };
use remote_storage::GenericRemoteStorage; use remote_storage::GenericRemoteStorage;
@@ -144,3 +144,16 @@ pub(crate) async fn upload_initdb_dir(
.await .await
.with_context(|| format!("upload initdb dir for '{tenant_id} / {timeline_id}'")) .with_context(|| format!("upload initdb dir for '{tenant_id} / {timeline_id}'"))
} }
pub(crate) async fn preserve_initdb_archive(
storage: &GenericRemoteStorage,
tenant_id: &TenantId,
timeline_id: &TimelineId,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
let source_path = remote_initdb_archive_path(tenant_id, timeline_id);
let dest_path = remote_initdb_preserved_archive_path(tenant_id, timeline_id);
upload_cancellable(cancel, storage.copy_object(&source_path, &dest_path))
.await
.with_context(|| format!("backing up initdb archive for '{tenant_id} / {timeline_id}'"))
}

View File

@@ -36,7 +36,7 @@ use crate::tenant::block_io::{BlockBuf, BlockCursor, BlockLease, BlockReader, Fi
use crate::tenant::disk_btree::{DiskBtreeBuilder, DiskBtreeReader, VisitDirection}; use crate::tenant::disk_btree::{DiskBtreeBuilder, DiskBtreeReader, VisitDirection};
use crate::tenant::storage_layer::{Layer, ValueReconstructResult, ValueReconstructState}; use crate::tenant::storage_layer::{Layer, ValueReconstructResult, ValueReconstructState};
use crate::tenant::Timeline; use crate::tenant::Timeline;
use crate::virtual_file::VirtualFile; use crate::virtual_file::{self, VirtualFile};
use crate::{walrecord, TEMP_FILE_SUFFIX}; use crate::{walrecord, TEMP_FILE_SUFFIX};
use crate::{DELTA_FILE_MAGIC, STORAGE_FORMAT_VERSION}; use crate::{DELTA_FILE_MAGIC, STORAGE_FORMAT_VERSION};
use anyhow::{bail, ensure, Context, Result}; use anyhow::{bail, ensure, Context, Result};
@@ -649,7 +649,7 @@ impl DeltaLayer {
{ {
let file = VirtualFile::open_with_options( let file = VirtualFile::open_with_options(
path, path,
&*std::fs::OpenOptions::new().read(true).write(true), virtual_file::OpenOptions::new().read(true).write(true),
) )
.await .await
.with_context(|| format!("Failed to open file '{}'", path))?; .with_context(|| format!("Failed to open file '{}'", path))?;
@@ -884,7 +884,7 @@ impl DeltaLayerInner {
let keys = self.load_keys(ctx).await?; let keys = self.load_keys(ctx).await?;
async fn dump_blob(val: &ValueRef<'_>, ctx: &RequestContext) -> anyhow::Result<String> { async fn dump_blob(val: ValueRef<'_>, ctx: &RequestContext) -> anyhow::Result<String> {
let buf = val.reader.read_blob(val.blob_ref.pos(), ctx).await?; let buf = val.reader.read_blob(val.blob_ref.pos(), ctx).await?;
let val = Value::des(&buf)?; let val = Value::des(&buf)?;
let desc = match val { let desc = match val {
@@ -905,30 +905,14 @@ impl DeltaLayerInner {
} }
for entry in keys { for entry in keys {
let DeltaEntry { key, lsn, val, .. } = entry; let DeltaEntry { key, lsn, val, .. } = entry;
let desc = match dump_blob(&val, ctx).await { let desc = match dump_blob(val, ctx).await {
Ok(desc) => desc, Ok(desc) => desc,
Err(err) => { Err(err) => {
format!("ERROR: {err}") format!("ERROR: {err}")
} }
}; };
println!(" key {key} at {lsn}: {desc}"); println!(" key {key} at {lsn}: {desc}");
use crate::pgdatadir_mapping::CHECKPOINT_KEY;
use postgres_ffi::CheckPoint;
if key == CHECKPOINT_KEY
{
let buf = val.reader.read_blob(val.blob_ref.pos(), ctx).await?;
let val = Value::des(&buf)?;
match val {
Value::Image(img) => {
let checkpoint = CheckPoint::decode(&img)?;
println!(" CHECKPOINT: {:?}", checkpoint);
}
Value::WalRecord(_rec) => {
format!(" unexpected walrecord for checkpoint key");
}
}
}
} }
Ok(()) Ok(())

View File

@@ -34,7 +34,7 @@ use crate::tenant::storage_layer::{
LayerAccessStats, ValueReconstructResult, ValueReconstructState, LayerAccessStats, ValueReconstructResult, ValueReconstructState,
}; };
use crate::tenant::Timeline; use crate::tenant::Timeline;
use crate::virtual_file::VirtualFile; use crate::virtual_file::{self, VirtualFile};
use crate::{IMAGE_FILE_MAGIC, STORAGE_FORMAT_VERSION, TEMP_FILE_SUFFIX}; use crate::{IMAGE_FILE_MAGIC, STORAGE_FORMAT_VERSION, TEMP_FILE_SUFFIX};
use anyhow::{bail, ensure, Context, Result}; use anyhow::{bail, ensure, Context, Result};
use bytes::Bytes; use bytes::Bytes;
@@ -327,7 +327,7 @@ impl ImageLayer {
{ {
let file = VirtualFile::open_with_options( let file = VirtualFile::open_with_options(
path, path,
&*std::fs::OpenOptions::new().read(true).write(true), virtual_file::OpenOptions::new().read(true).write(true),
) )
.await .await
.with_context(|| format!("Failed to open file '{}'", path))?; .with_context(|| format!("Failed to open file '{}'", path))?;
@@ -492,11 +492,15 @@ impl ImageLayerWriterInner {
}, },
); );
info!("new image layer {path}"); info!("new image layer {path}");
let mut file = VirtualFile::open_with_options( let mut file = {
&path, VirtualFile::open_with_options(
std::fs::OpenOptions::new().write(true).create_new(true), &path,
) virtual_file::OpenOptions::new()
.await?; .write(true)
.create_new(true),
)
.await?
};
// make room for the header block // make room for the header block
file.seek(SeekFrom::Start(PAGE_SZ as u64)).await?; file.seek(SeekFrom::Start(PAGE_SZ as u64)).await?;
let blob_writer = BlobWriter::new(file, PAGE_SZ as u64); let blob_writer = BlobWriter::new(file, PAGE_SZ as u64);

View File

@@ -9,6 +9,7 @@ use crate::context::{DownloadBehavior, RequestContext};
use crate::metrics::TENANT_TASK_EVENTS; use crate::metrics::TENANT_TASK_EVENTS;
use crate::task_mgr; use crate::task_mgr;
use crate::task_mgr::{TaskKind, BACKGROUND_RUNTIME}; use crate::task_mgr::{TaskKind, BACKGROUND_RUNTIME};
use crate::tenant::timeline::CompactionError;
use crate::tenant::{Tenant, TenantState}; use crate::tenant::{Tenant, TenantState};
use tokio_util::sync::CancellationToken; use tokio_util::sync::CancellationToken;
use tracing::*; use tracing::*;
@@ -181,8 +182,11 @@ async fn compaction_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
); );
error_run_count += 1; error_run_count += 1;
let wait_duration = Duration::from_secs_f64(wait_duration); let wait_duration = Duration::from_secs_f64(wait_duration);
error!( log_compaction_error(
"Compaction failed {error_run_count} times, retrying in {wait_duration:?}: {e:?}", &e,
error_run_count,
&wait_duration,
cancel.is_cancelled(),
); );
wait_duration wait_duration
} else { } else {
@@ -210,6 +214,58 @@ async fn compaction_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
TENANT_TASK_EVENTS.with_label_values(&["stop"]).inc(); TENANT_TASK_EVENTS.with_label_values(&["stop"]).inc();
} }
fn log_compaction_error(
e: &CompactionError,
error_run_count: u32,
sleep_duration: &std::time::Duration,
task_cancelled: bool,
) {
use crate::tenant::upload_queue::NotInitialized;
use crate::tenant::PageReconstructError;
use CompactionError::*;
enum LooksLike {
Info,
Error,
}
let decision = match e {
ShuttingDown => None,
_ if task_cancelled => Some(LooksLike::Info),
Other(e) => {
let root_cause = e.root_cause();
let is_stopping = {
let upload_queue = root_cause
.downcast_ref::<NotInitialized>()
.is_some_and(|e| e.is_stopping());
let timeline = root_cause
.downcast_ref::<PageReconstructError>()
.is_some_and(|e| e.is_stopping());
upload_queue || timeline
};
if is_stopping {
Some(LooksLike::Info)
} else {
Some(LooksLike::Error)
}
}
};
match decision {
Some(LooksLike::Info) => info!(
"Compaction failed {error_run_count} times, retrying in {sleep_duration:?}: {e:#}",
),
Some(LooksLike::Error) => error!(
"Compaction failed {error_run_count} times, retrying in {sleep_duration:?}: {e:?}",
),
None => {}
}
}
/// ///
/// GC task's main loop /// GC task's main loop
/// ///

View File

@@ -14,6 +14,7 @@ use enumset::EnumSet;
use fail::fail_point; use fail::fail_point;
use itertools::Itertools; use itertools::Itertools;
use pageserver_api::{ use pageserver_api::{
keyspace::{key_range_size, KeySpaceAccum},
models::{ models::{
DownloadRemoteLayersTaskInfo, DownloadRemoteLayersTaskSpawnRequest, EvictionPolicy, DownloadRemoteLayersTaskInfo, DownloadRemoteLayersTaskSpawnRequest, EvictionPolicy,
LayerMapInfo, TimelineState, LayerMapInfo, TimelineState,
@@ -32,7 +33,7 @@ use tokio_util::sync::CancellationToken;
use tracing::*; use tracing::*;
use utils::sync::gate::Gate; use utils::sync::gate::Gate;
use std::collections::{BinaryHeap, HashMap, HashSet}; use std::collections::{BTreeMap, BinaryHeap, HashMap, HashSet};
use std::ops::{Deref, Range}; use std::ops::{Deref, Range};
use std::pin::pin; use std::pin::pin;
use std::sync::atomic::Ordering as AtomicOrdering; use std::sync::atomic::Ordering as AtomicOrdering;
@@ -73,8 +74,8 @@ use crate::metrics::{
TimelineMetrics, MATERIALIZED_PAGE_CACHE_HIT, MATERIALIZED_PAGE_CACHE_HIT_DIRECT, TimelineMetrics, MATERIALIZED_PAGE_CACHE_HIT, MATERIALIZED_PAGE_CACHE_HIT_DIRECT,
}; };
use crate::pgdatadir_mapping::CalculateLogicalSizeError; use crate::pgdatadir_mapping::CalculateLogicalSizeError;
use crate::pgdatadir_mapping::{is_inherited_key, is_rel_fsm_block_key, is_rel_vm_block_key};
use crate::tenant::config::TenantConfOpt; use crate::tenant::config::TenantConfOpt;
use pageserver_api::key::{is_inherited_key, is_rel_fsm_block_key, is_rel_vm_block_key};
use pageserver_api::reltag::RelTag; use pageserver_api::reltag::RelTag;
use pageserver_api::shard::ShardIndex; use pageserver_api::shard::ShardIndex;
@@ -391,8 +392,7 @@ pub(crate) enum PageReconstructError {
#[error("Ancestor LSN wait error: {0}")] #[error("Ancestor LSN wait error: {0}")]
AncestorLsnTimeout(#[from] WaitLsnError), AncestorLsnTimeout(#[from] WaitLsnError),
/// The operation was cancelled #[error("timeline shutting down")]
#[error("Cancelled")]
Cancelled, Cancelled,
/// The ancestor of this is being stopped /// The ancestor of this is being stopped
@@ -404,6 +404,34 @@ pub(crate) enum PageReconstructError {
WalRedo(anyhow::Error), WalRedo(anyhow::Error),
} }
impl PageReconstructError {
/// Returns true if this error indicates a tenant/timeline shutdown alike situation
pub(crate) fn is_stopping(&self) -> bool {
use PageReconstructError::*;
match self {
Other(_) => false,
AncestorLsnTimeout(_) => false,
Cancelled | AncestorStopping(_) => true,
WalRedo(_) => false,
}
}
}
#[derive(thiserror::Error, Debug)]
enum CreateImageLayersError {
#[error("timeline shutting down")]
Cancelled,
#[error(transparent)]
GetVectoredError(GetVectoredError),
#[error(transparent)]
PageReconstructError(PageReconstructError),
#[error(transparent)]
Other(#[from] anyhow::Error),
}
#[derive(thiserror::Error, Debug)] #[derive(thiserror::Error, Debug)]
enum FlushLayerError { enum FlushLayerError {
/// Timeline cancellation token was cancelled /// Timeline cancellation token was cancelled
@@ -411,12 +439,24 @@ enum FlushLayerError {
Cancelled, Cancelled,
#[error(transparent)] #[error(transparent)]
PageReconstructError(#[from] PageReconstructError), CreateImageLayersError(CreateImageLayersError),
#[error(transparent)] #[error(transparent)]
Other(#[from] anyhow::Error), Other(#[from] anyhow::Error),
} }
#[derive(thiserror::Error, Debug)]
pub(crate) enum GetVectoredError {
#[error("timeline shutting down")]
Cancelled,
#[error("Requested too many keys: {0} > {}", Timeline::MAX_GET_VECTORED_KEYS)]
Oversized(u64),
#[error("Requested at invalid LSN: {0}")]
InvalidLsn(Lsn),
}
#[derive(Clone, Copy)] #[derive(Clone, Copy)]
pub enum LogicalSizeCalculationCause { pub enum LogicalSizeCalculationCause {
Initial, Initial,
@@ -456,6 +496,45 @@ pub(crate) enum WaitLsnError {
Timeout(String), Timeout(String),
} }
// The impls below achieve cancellation mapping for errors.
// Perhaps there's a way of achieving this with less cruft.
impl From<CreateImageLayersError> for CompactionError {
fn from(e: CreateImageLayersError) -> Self {
match e {
CreateImageLayersError::Cancelled => CompactionError::ShuttingDown,
_ => CompactionError::Other(e.into()),
}
}
}
impl From<CreateImageLayersError> for FlushLayerError {
fn from(e: CreateImageLayersError) -> Self {
match e {
CreateImageLayersError::Cancelled => FlushLayerError::Cancelled,
any => FlushLayerError::CreateImageLayersError(any),
}
}
}
impl From<PageReconstructError> for CreateImageLayersError {
fn from(e: PageReconstructError) -> Self {
match e {
PageReconstructError::Cancelled => CreateImageLayersError::Cancelled,
_ => CreateImageLayersError::PageReconstructError(e),
}
}
}
impl From<GetVectoredError> for CreateImageLayersError {
fn from(e: GetVectoredError) -> Self {
match e {
GetVectoredError::Cancelled => CreateImageLayersError::Cancelled,
_ => CreateImageLayersError::GetVectoredError(e),
}
}
}
/// Public interface functions /// Public interface functions
impl Timeline { impl Timeline {
/// Get the LSN where this branch was created /// Get the LSN where this branch was created
@@ -575,6 +654,57 @@ impl Timeline {
res res
} }
pub(crate) const MAX_GET_VECTORED_KEYS: u64 = 32;
/// Look up multiple page versions at a given LSN
///
/// This naive implementation will be replaced with a more efficient one
/// which actually vectorizes the read path.
pub(crate) async fn get_vectored(
&self,
key_ranges: &[Range<Key>],
lsn: Lsn,
ctx: &RequestContext,
) -> Result<BTreeMap<Key, Result<Bytes, PageReconstructError>>, GetVectoredError> {
if !lsn.is_valid() {
return Err(GetVectoredError::InvalidLsn(lsn));
}
let key_count = key_ranges
.iter()
.map(|range| key_range_size(range) as u64)
.sum();
if key_count > Timeline::MAX_GET_VECTORED_KEYS {
return Err(GetVectoredError::Oversized(key_count));
}
let _timer = crate::metrics::GET_VECTORED_LATENCY
.for_task_kind(ctx.task_kind())
.map(|t| t.start_timer());
let mut values = BTreeMap::new();
for range in key_ranges {
let mut key = range.start;
while key != range.end {
assert!(!self.shard_identity.is_key_disposable(&key));
let block = self.get(key, lsn, ctx).await;
if matches!(
block,
Err(PageReconstructError::Cancelled | PageReconstructError::AncestorStopping(_))
) {
return Err(GetVectoredError::Cancelled);
}
values.insert(key, block);
key = key.next();
}
}
Ok(values)
}
/// Get last or prev record separately. Same as get_last_record_rlsn().last/prev. /// Get last or prev record separately. Same as get_last_record_rlsn().last/prev.
pub fn get_last_record_lsn(&self) -> Lsn { pub fn get_last_record_lsn(&self) -> Lsn {
self.last_record_lsn.load().last self.last_record_lsn.load().last
@@ -2582,7 +2712,7 @@ impl Timeline {
return; return;
} }
err @ Err( err @ Err(
FlushLayerError::Other(_) | FlushLayerError::PageReconstructError(_), FlushLayerError::Other(_) | FlushLayerError::CreateImageLayersError(_),
) => { ) => {
error!("could not flush frozen layer: {err:?}"); error!("could not flush frozen layer: {err:?}");
break err; break err;
@@ -2859,6 +2989,21 @@ impl Timeline {
Ok(()) Ok(())
} }
pub(crate) async fn preserve_initdb_archive(&self) -> anyhow::Result<()> {
if let Some(remote_client) = &self.remote_client {
remote_client
.preserve_initdb_archive(
&self.tenant_shard_id.tenant_id,
&self.timeline_id,
&self.cancel,
)
.await?;
} else {
bail!("No remote storage configured, but was asked to backup the initdb archive for {} / {}", self.tenant_shard_id.tenant_id, self.timeline_id);
}
Ok(())
}
// Write out the given frozen in-memory layer as a new L0 delta file. This L0 file will not be tracked // Write out the given frozen in-memory layer as a new L0 delta file. This L0 file will not be tracked
// in layer map immediately. The caller is responsible to put it into the layer map. // in layer map immediately. The caller is responsible to put it into the layer map.
async fn create_delta_layer( async fn create_delta_layer(
@@ -2950,11 +3095,7 @@ impl Timeline {
} }
// Is it time to create a new image layer for the given partition? // Is it time to create a new image layer for the given partition?
async fn time_for_new_image_layer( async fn time_for_new_image_layer(&self, partition: &KeySpace, lsn: Lsn) -> bool {
&self,
partition: &KeySpace,
lsn: Lsn,
) -> anyhow::Result<bool> {
let threshold = self.get_image_creation_threshold(); let threshold = self.get_image_creation_threshold();
let guard = self.layers.read().await; let guard = self.layers.read().await;
@@ -2974,20 +3115,20 @@ impl Timeline {
// but the range is already covered by image layers at more recent LSNs. Before we // but the range is already covered by image layers at more recent LSNs. Before we
// create a new image layer, check if the range is already covered at more recent LSNs. // create a new image layer, check if the range is already covered at more recent LSNs.
if !layers if !layers
.image_layer_exists(&img_range, &(Lsn::min(lsn, *cutoff_lsn)..lsn + 1))? .image_layer_exists(&img_range, &(Lsn::min(lsn, *cutoff_lsn)..lsn + 1))
{ {
debug!( debug!(
"Force generation of layer {}-{} wanted by GC, cutoff={}, lsn={})", "Force generation of layer {}-{} wanted by GC, cutoff={}, lsn={})",
img_range.start, img_range.end, cutoff_lsn, lsn img_range.start, img_range.end, cutoff_lsn, lsn
); );
return Ok(true); return true;
} }
} }
} }
} }
for part_range in &partition.ranges { for part_range in &partition.ranges {
let image_coverage = layers.image_coverage(part_range, lsn)?; let image_coverage = layers.image_coverage(part_range, lsn);
for (img_range, last_img) in image_coverage { for (img_range, last_img) in image_coverage {
let img_lsn = if let Some(last_img) = last_img { let img_lsn = if let Some(last_img) = last_img {
last_img.get_lsn_range().end last_img.get_lsn_range().end
@@ -3008,7 +3149,7 @@ impl Timeline {
// after we read last_record_lsn, which is passed here in the 'lsn' argument. // after we read last_record_lsn, which is passed here in the 'lsn' argument.
if img_lsn < lsn { if img_lsn < lsn {
let num_deltas = let num_deltas =
layers.count_deltas(&img_range, &(img_lsn..lsn), Some(threshold))?; layers.count_deltas(&img_range, &(img_lsn..lsn), Some(threshold));
max_deltas = max_deltas.max(num_deltas); max_deltas = max_deltas.max(num_deltas);
if num_deltas >= threshold { if num_deltas >= threshold {
@@ -3016,7 +3157,7 @@ impl Timeline {
"key range {}-{}, has {} deltas on this timeline in LSN range {}..{}", "key range {}-{}, has {} deltas on this timeline in LSN range {}..{}",
img_range.start, img_range.end, num_deltas, img_lsn, lsn img_range.start, img_range.end, num_deltas, img_lsn, lsn
); );
return Ok(true); return true;
} }
} }
} }
@@ -3026,7 +3167,7 @@ impl Timeline {
max_deltas, max_deltas,
"none of the partitioned ranges had >= {threshold} deltas" "none of the partitioned ranges had >= {threshold} deltas"
); );
Ok(false) false
} }
#[tracing::instrument(skip_all, fields(%lsn, %force))] #[tracing::instrument(skip_all, fields(%lsn, %force))]
@@ -3036,7 +3177,7 @@ impl Timeline {
lsn: Lsn, lsn: Lsn,
force: bool, force: bool,
ctx: &RequestContext, ctx: &RequestContext,
) -> Result<Vec<ResidentLayer>, PageReconstructError> { ) -> Result<Vec<ResidentLayer>, CreateImageLayersError> {
let timer = self.metrics.create_images_time_histo.start_timer(); let timer = self.metrics.create_images_time_histo.start_timer();
let mut image_layers = Vec::new(); let mut image_layers = Vec::new();
@@ -3054,7 +3195,7 @@ impl Timeline {
for partition in partitioning.parts.iter() { for partition in partitioning.parts.iter() {
let img_range = start..partition.ranges.last().unwrap().end; let img_range = start..partition.ranges.last().unwrap().end;
start = img_range.end; start = img_range.end;
if force || self.time_for_new_image_layer(partition, lsn).await? { if force || self.time_for_new_image_layer(partition, lsn).await {
let mut image_layer_writer = ImageLayerWriter::new( let mut image_layer_writer = ImageLayerWriter::new(
self.conf, self.conf,
self.timeline_id, self.timeline_id,
@@ -3065,10 +3206,12 @@ impl Timeline {
.await?; .await?;
fail_point!("image-layer-writer-fail-before-finish", |_| { fail_point!("image-layer-writer-fail-before-finish", |_| {
Err(PageReconstructError::Other(anyhow::anyhow!( Err(CreateImageLayersError::Other(anyhow::anyhow!(
"failpoint image-layer-writer-fail-before-finish" "failpoint image-layer-writer-fail-before-finish"
))) )))
}); });
let mut key_request_accum = KeySpaceAccum::new();
for range in &partition.ranges { for range in &partition.ranges {
let mut key = range.start; let mut key = range.start;
while key < range.end { while key < range.end {
@@ -3081,34 +3224,55 @@ impl Timeline {
key = key.next(); key = key.next();
continue; continue;
} }
let img = match self.get(key, lsn, ctx).await {
Ok(img) => img,
Err(err) => {
// If we fail to reconstruct a VM or FSM page, we can zero the
// page without losing any actual user data. That seems better
// than failing repeatedly and getting stuck.
//
// We had a bug at one point, where we truncated the FSM and VM
// in the pageserver, but the Postgres didn't know about that
// and continued to generate incremental WAL records for pages
// that didn't exist in the pageserver. Trying to replay those
// WAL records failed to find the previous image of the page.
// This special case allows us to recover from that situation.
// See https://github.com/neondatabase/neon/issues/2601.
//
// Unfortunately we cannot do this for the main fork, or for
// any metadata keys, keys, as that would lead to actual data
// loss.
if is_rel_fsm_block_key(key) || is_rel_vm_block_key(key) {
warn!("could not reconstruct FSM or VM key {key}, filling with zeros: {err:?}");
ZERO_PAGE.clone()
} else {
return Err(err);
}
}
};
image_layer_writer.put_image(key, &img).await?; key_request_accum.add_key(key);
if key_request_accum.size() >= Timeline::MAX_GET_VECTORED_KEYS
|| key.next() == range.end
{
let results = self
.get_vectored(
&key_request_accum.consume_keyspace().ranges,
lsn,
ctx,
)
.await?;
for (img_key, img) in results {
let img = match img {
Ok(img) => img,
Err(err) => {
// If we fail to reconstruct a VM or FSM page, we can zero the
// page without losing any actual user data. That seems better
// than failing repeatedly and getting stuck.
//
// We had a bug at one point, where we truncated the FSM and VM
// in the pageserver, but the Postgres didn't know about that
// and continued to generate incremental WAL records for pages
// that didn't exist in the pageserver. Trying to replay those
// WAL records failed to find the previous image of the page.
// This special case allows us to recover from that situation.
// See https://github.com/neondatabase/neon/issues/2601.
//
// Unfortunately we cannot do this for the main fork, or for
// any metadata keys, keys, as that would lead to actual data
// loss.
if is_rel_fsm_block_key(img_key)
|| is_rel_vm_block_key(img_key)
{
warn!("could not reconstruct FSM or VM key {img_key}, filling with zeros: {err:?}");
ZERO_PAGE.clone()
} else {
return Err(
CreateImageLayersError::PageReconstructError(err),
);
}
}
};
image_layer_writer.put_image(img_key, &img).await?;
}
}
key = key.next(); key = key.next();
} }
} }
@@ -3484,7 +3648,7 @@ impl Timeline {
// has not so much sense, because largest holes will corresponds field1/field2 changes. // has not so much sense, because largest holes will corresponds field1/field2 changes.
// But we are mostly interested to eliminate holes which cause generation of excessive image layers. // But we are mostly interested to eliminate holes which cause generation of excessive image layers.
// That is why it is better to measure size of hole as number of covering image layers. // That is why it is better to measure size of hole as number of covering image layers.
let coverage_size = layers.image_coverage(&key_range, last_record_lsn)?.len(); let coverage_size = layers.image_coverage(&key_range, last_record_lsn).len();
if coverage_size >= min_hole_coverage_size { if coverage_size >= min_hole_coverage_size {
heap.push(Hole { heap.push(Hole {
key_range, key_range,
@@ -4110,7 +4274,7 @@ impl Timeline {
// we cannot remove C, even though it's older than 2500, because // we cannot remove C, even though it's older than 2500, because
// the delta layer 2000-3000 depends on it. // the delta layer 2000-3000 depends on it.
if !layers if !layers
.image_layer_exists(&l.get_key_range(), &(l.get_lsn_range().end..new_gc_cutoff))? .image_layer_exists(&l.get_key_range(), &(l.get_lsn_range().end..new_gc_cutoff))
{ {
debug!("keeping {} because it is the latest layer", l.filename()); debug!("keeping {} because it is the latest layer", l.filename());
// Collect delta key ranges that need image layers to allow garbage // Collect delta key ranges that need image layers to allow garbage
@@ -4240,7 +4404,7 @@ impl Timeline {
.walredo_mgr .walredo_mgr
.request_redo(key, request_lsn, data.img, data.records, self.pg_version) .request_redo(key, request_lsn, data.img, data.records, self.pg_version)
.await .await
.context("Failed to reconstruct a page image:") .context("reconstruct a page image")
{ {
Ok(img) => img, Ok(img) => img,
Err(e) => return Err(PageReconstructError::WalRedo(e)), Err(e) => return Err(PageReconstructError::WalRedo(e)),

View File

@@ -126,6 +126,27 @@ pub(super) struct UploadQueueStopped {
pub(super) deleted_at: SetDeletedFlagProgress, pub(super) deleted_at: SetDeletedFlagProgress,
} }
#[derive(thiserror::Error, Debug)]
pub(crate) enum NotInitialized {
#[error("queue is in state Uninitialized")]
Uninitialized,
#[error("queue is in state Stopping")]
Stopped,
#[error("queue is shutting down")]
ShuttingDown,
}
impl NotInitialized {
pub(crate) fn is_stopping(&self) -> bool {
use NotInitialized::*;
match self {
Uninitialized => false,
Stopped => true,
ShuttingDown => true,
}
}
}
impl UploadQueue { impl UploadQueue {
pub(crate) fn initialize_empty_remote( pub(crate) fn initialize_empty_remote(
&mut self, &mut self,
@@ -214,17 +235,17 @@ impl UploadQueue {
} }
pub(crate) fn initialized_mut(&mut self) -> anyhow::Result<&mut UploadQueueInitialized> { pub(crate) fn initialized_mut(&mut self) -> anyhow::Result<&mut UploadQueueInitialized> {
use UploadQueue::*;
match self { match self {
UploadQueue::Uninitialized | UploadQueue::Stopped(_) => { Uninitialized => Err(NotInitialized::Uninitialized.into()),
anyhow::bail!("queue is in state {}", self.as_str()) Initialized(x) => {
} if x.shutting_down {
UploadQueue::Initialized(x) => { Err(NotInitialized::ShuttingDown.into())
if !x.shutting_down {
Ok(x)
} else { } else {
anyhow::bail!("queue is shutting down") Ok(x)
} }
} }
Stopped(_) => Err(NotInitialized::Stopped.into()),
} }
} }

View File

@@ -11,18 +11,28 @@
//! src/backend/storage/file/fd.c //! src/backend/storage/file/fd.c
//! //!
use crate::metrics::{StorageIoOperation, STORAGE_IO_SIZE, STORAGE_IO_TIME_METRIC}; use crate::metrics::{StorageIoOperation, STORAGE_IO_SIZE, STORAGE_IO_TIME_METRIC};
use crate::page_cache::PageWriteGuard;
use crate::tenant::TENANTS_SEGMENT_NAME; use crate::tenant::TENANTS_SEGMENT_NAME;
use camino::{Utf8Path, Utf8PathBuf}; use camino::{Utf8Path, Utf8PathBuf};
use once_cell::sync::OnceCell; use once_cell::sync::OnceCell;
use pageserver_api::shard::TenantShardId; use pageserver_api::shard::TenantShardId;
use std::fs::{self, File, OpenOptions}; use std::fs::{self, File};
use std::io::{Error, ErrorKind, Seek, SeekFrom}; use std::io::{Error, ErrorKind, Seek, SeekFrom};
use tokio_epoll_uring::IoBufMut;
use std::os::fd::{AsRawFd, FromRawFd, IntoRawFd, OwnedFd, RawFd};
use std::os::unix::fs::FileExt; use std::os::unix::fs::FileExt;
use std::sync::atomic::{AtomicBool, AtomicUsize, Ordering}; use std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};
use tokio::sync::{RwLock, RwLockReadGuard, RwLockWriteGuard}; use tokio::sync::{RwLock, RwLockReadGuard, RwLockWriteGuard};
use tokio::time::Instant; use tokio::time::Instant;
use utils::fs_ext; use utils::fs_ext;
mod io_engine;
mod open_options;
pub use io_engine::IoEngineKind;
pub(crate) use open_options::*;
/// ///
/// A virtual file descriptor. You can use this just like std::fs::File, but internally /// A virtual file descriptor. You can use this just like std::fs::File, but internally
/// the underlying file is closed if the system is low on file descriptors, /// the underlying file is closed if the system is low on file descriptors,
@@ -106,7 +116,38 @@ struct SlotInner {
tag: u64, tag: u64,
/// the underlying file /// the underlying file
file: Option<File>, file: Option<OwnedFd>,
}
/// Impl of [`tokio_epoll_uring::IoBuf`] and [`tokio_epoll_uring::IoBufMut`] for [`PageWriteGuard`].
struct PageWriteGuardBuf {
page: PageWriteGuard<'static>,
init_up_to: usize,
}
// Safety: the [`PageWriteGuard`] gives us exclusive ownership of the page cache slot,
// and the location remains stable even if [`Self`] or the [`PageWriteGuard`] is moved.
unsafe impl tokio_epoll_uring::IoBuf for PageWriteGuardBuf {
fn stable_ptr(&self) -> *const u8 {
self.page.as_ptr()
}
fn bytes_init(&self) -> usize {
self.init_up_to
}
fn bytes_total(&self) -> usize {
self.page.len()
}
}
// Safety: see above, plus: the ownership of [`PageWriteGuard`] means exclusive access,
// hence it's safe to hand out the `stable_mut_ptr()`.
unsafe impl tokio_epoll_uring::IoBufMut for PageWriteGuardBuf {
fn stable_mut_ptr(&mut self) -> *mut u8 {
self.page.as_mut_ptr()
}
unsafe fn set_init(&mut self, pos: usize) {
assert!(pos <= self.page.len());
self.init_up_to = pos;
}
} }
impl OpenFiles { impl OpenFiles {
@@ -274,6 +315,10 @@ macro_rules! with_file {
let $ident = $this.lock_file().await?; let $ident = $this.lock_file().await?;
observe_duration!($op, $($body)*) observe_duration!($op, $($body)*)
}}; }};
($this:expr, $op:expr, | mut $ident:ident | $($body:tt)*) => {{
let mut $ident = $this.lock_file().await?;
observe_duration!($op, $($body)*)
}};
} }
impl VirtualFile { impl VirtualFile {
@@ -326,7 +371,9 @@ impl VirtualFile {
// NB: there is also StorageIoOperation::OpenAfterReplace which is for the case // NB: there is also StorageIoOperation::OpenAfterReplace which is for the case
// where our caller doesn't get to use the returned VirtualFile before its // where our caller doesn't get to use the returned VirtualFile before its
// slot gets re-used by someone else. // slot gets re-used by someone else.
let file = observe_duration!(StorageIoOperation::Open, open_options.open(path))?; let file = observe_duration!(StorageIoOperation::Open, {
open_options.open(path.as_std_path()).await?
});
// Strip all options other than read and write. // Strip all options other than read and write.
// //
@@ -400,15 +447,13 @@ impl VirtualFile {
/// Call File::sync_all() on the underlying File. /// Call File::sync_all() on the underlying File.
pub async fn sync_all(&self) -> Result<(), Error> { pub async fn sync_all(&self) -> Result<(), Error> {
with_file!(self, StorageIoOperation::Fsync, |file| file with_file!(self, StorageIoOperation::Fsync, |file_guard| file_guard
.as_ref() .with_std_file(|std_file| std_file.sync_all()))
.sync_all())
} }
pub async fn metadata(&self) -> Result<fs::Metadata, Error> { pub async fn metadata(&self) -> Result<fs::Metadata, Error> {
with_file!(self, StorageIoOperation::Metadata, |file| file with_file!(self, StorageIoOperation::Metadata, |file_guard| file_guard
.as_ref() .with_std_file(|std_file| std_file.metadata()))
.metadata())
} }
/// Helper function internal to `VirtualFile` that looks up the underlying File, /// Helper function internal to `VirtualFile` that looks up the underlying File,
@@ -417,7 +462,7 @@ impl VirtualFile {
/// ///
/// We are doing it via a macro as Rust doesn't support async closures that /// We are doing it via a macro as Rust doesn't support async closures that
/// take on parameters with lifetimes. /// take on parameters with lifetimes.
async fn lock_file(&self) -> Result<FileGuard<'_>, Error> { async fn lock_file(&self) -> Result<FileGuard, Error> {
let open_files = get_open_files(); let open_files = get_open_files();
let mut handle_guard = { let mut handle_guard = {
@@ -463,10 +508,9 @@ impl VirtualFile {
// NB: we use StorageIoOperation::OpenAferReplace for this to distinguish this // NB: we use StorageIoOperation::OpenAferReplace for this to distinguish this
// case from StorageIoOperation::Open. This helps with identifying thrashing // case from StorageIoOperation::Open. This helps with identifying thrashing
// of the virtual file descriptor cache. // of the virtual file descriptor cache.
let file = observe_duration!( let file = observe_duration!(StorageIoOperation::OpenAfterReplace, {
StorageIoOperation::OpenAfterReplace, self.open_options.open(self.path.as_std_path()).await?
self.open_options.open(&self.path) });
)?;
// Store the File in the slot and update the handle in the VirtualFile // Store the File in the slot and update the handle in the VirtualFile
// to point to it. // to point to it.
@@ -491,9 +535,8 @@ impl VirtualFile {
self.pos = offset; self.pos = offset;
} }
SeekFrom::End(offset) => { SeekFrom::End(offset) => {
self.pos = with_file!(self, StorageIoOperation::Seek, |file| file self.pos = with_file!(self, StorageIoOperation::Seek, |mut file_guard| file_guard
.as_ref() .with_std_file_mut(|std_file| std_file.seek(SeekFrom::End(offset))))?
.seek(SeekFrom::End(offset)))?
} }
SeekFrom::Current(offset) => { SeekFrom::Current(offset) => {
let pos = self.pos as i128 + offset as i128; let pos = self.pos as i128 + offset as i128;
@@ -512,25 +555,28 @@ impl VirtualFile {
Ok(self.pos) Ok(self.pos)
} }
// Copied from https://doc.rust-lang.org/1.72.0/src/std/os/unix/fs.rs.html#117-135 pub async fn read_exact_at<B>(&self, buf: B, offset: u64) -> Result<B, Error>
pub async fn read_exact_at(&self, mut buf: &mut [u8], mut offset: u64) -> Result<(), Error> { where
while !buf.is_empty() { B: IoBufMut + Send,
match self.read_at(buf, offset).await { {
Ok(0) => { let (buf, res) =
return Err(Error::new( read_exact_at_impl(buf, offset, |buf, offset| self.read_at(buf, offset)).await;
std::io::ErrorKind::UnexpectedEof, res.map(|()| buf)
"failed to fill whole buffer", }
))
} /// Like [`Self::read_exact_at`] but for [`PageWriteGuard`].
Ok(n) => { pub async fn read_exact_at_page(
buf = &mut buf[n..]; &self,
offset += n as u64; page: PageWriteGuard<'static>,
} offset: u64,
Err(ref e) if e.kind() == std::io::ErrorKind::Interrupted => {} ) -> Result<PageWriteGuard<'static>, Error> {
Err(e) => return Err(e), let buf = PageWriteGuardBuf {
} page,
} init_up_to: 0,
Ok(()) };
let res = self.read_exact_at(buf, offset).await;
res.map(|PageWriteGuardBuf { page, .. }| page)
.map_err(|e| Error::new(ErrorKind::Other, e))
} }
// Copied from https://doc.rust-lang.org/1.72.0/src/std/os/unix/fs.rs.html#219-235 // Copied from https://doc.rust-lang.org/1.72.0/src/std/os/unix/fs.rs.html#219-235
@@ -580,22 +626,35 @@ impl VirtualFile {
Ok(n) Ok(n)
} }
pub async fn read_at(&self, buf: &mut [u8], offset: u64) -> Result<usize, Error> { pub(crate) async fn read_at<B>(&self, buf: B, offset: u64) -> (B, Result<usize, Error>)
let result = with_file!(self, StorageIoOperation::Read, |file| file where
.as_ref() B: tokio_epoll_uring::BoundedBufMut + Send,
.read_at(buf, offset)); {
if let Ok(size) = result { let file_guard = match self.lock_file().await {
STORAGE_IO_SIZE Ok(file_guard) => file_guard,
.with_label_values(&["read", &self.tenant_id, &self.shard_id, &self.timeline_id]) Err(e) => return (buf, Err(e)),
.add(size as i64); };
}
result observe_duration!(StorageIoOperation::Read, {
let ((_file_guard, buf), res) = io_engine::get().read_at(file_guard, offset, buf).await;
if let Ok(size) = res {
STORAGE_IO_SIZE
.with_label_values(&[
"read",
&self.tenant_id,
&self.shard_id,
&self.timeline_id,
])
.add(size as i64);
}
(buf, res)
})
} }
async fn write_at(&self, buf: &[u8], offset: u64) -> Result<usize, Error> { async fn write_at(&self, buf: &[u8], offset: u64) -> Result<usize, Error> {
let result = with_file!(self, StorageIoOperation::Write, |file| file let result = with_file!(self, StorageIoOperation::Write, |file_guard| {
.as_ref() file_guard.with_std_file(|std_file| std_file.write_at(buf, offset))
.write_at(buf, offset)); });
if let Ok(size) = result { if let Ok(size) = result {
STORAGE_IO_SIZE STORAGE_IO_SIZE
.with_label_values(&["write", &self.tenant_id, &self.shard_id, &self.timeline_id]) .with_label_values(&["write", &self.tenant_id, &self.shard_id, &self.timeline_id])
@@ -605,18 +664,241 @@ impl VirtualFile {
} }
} }
struct FileGuard<'a> { // Adapted from https://doc.rust-lang.org/1.72.0/src/std/os/unix/fs.rs.html#117-135
slot_guard: RwLockReadGuard<'a, SlotInner>, pub async fn read_exact_at_impl<B, F, Fut>(
buf: B,
mut offset: u64,
mut read_at: F,
) -> (B, std::io::Result<()>)
where
B: IoBufMut + Send,
F: FnMut(tokio_epoll_uring::Slice<B>, u64) -> Fut,
Fut: std::future::Future<Output = (tokio_epoll_uring::Slice<B>, std::io::Result<usize>)>,
{
use tokio_epoll_uring::BoundedBuf;
let mut buf: tokio_epoll_uring::Slice<B> = buf.slice_full(); // includes all the uninitialized memory
while buf.bytes_total() != 0 {
let res;
(buf, res) = read_at(buf, offset).await;
match res {
Ok(0) => break,
Ok(n) => {
buf = buf.slice(n..);
offset += n as u64;
}
Err(ref e) if e.kind() == std::io::ErrorKind::Interrupted => {}
Err(e) => return (buf.into_inner(), Err(e)),
}
}
// NB: don't use `buf.is_empty()` here; it is from the
// `impl Deref for Slice { Target = [u8] }`; the the &[u8]
// returned by it only covers the initialized portion of `buf`.
// Whereas we're interested in ensuring that we filled the entire
// buffer that the user passed in.
if buf.bytes_total() != 0 {
(
buf.into_inner(),
Err(std::io::Error::new(
std::io::ErrorKind::UnexpectedEof,
"failed to fill whole buffer",
)),
)
} else {
assert_eq!(buf.len(), buf.bytes_total());
(buf.into_inner(), Ok(()))
}
} }
impl<'a> AsRef<File> for FileGuard<'a> { #[cfg(test)]
fn as_ref(&self) -> &File { mod test_read_exact_at_impl {
use std::{collections::VecDeque, sync::Arc};
use tokio_epoll_uring::{BoundedBuf, BoundedBufMut};
use super::read_exact_at_impl;
struct Expectation {
offset: u64,
bytes_total: usize,
result: std::io::Result<Vec<u8>>,
}
struct MockReadAt {
expectations: VecDeque<Expectation>,
}
impl MockReadAt {
async fn read_at(
&mut self,
mut buf: tokio_epoll_uring::Slice<Vec<u8>>,
offset: u64,
) -> (tokio_epoll_uring::Slice<Vec<u8>>, std::io::Result<usize>) {
let exp = self
.expectations
.pop_front()
.expect("read_at called but we have no expectations left");
assert_eq!(exp.offset, offset);
assert_eq!(exp.bytes_total, buf.bytes_total());
match exp.result {
Ok(bytes) => {
assert!(bytes.len() <= buf.bytes_total());
buf.put_slice(&bytes);
(buf, Ok(bytes.len()))
}
Err(e) => (buf, Err(e)),
}
}
}
impl Drop for MockReadAt {
fn drop(&mut self) {
assert_eq!(self.expectations.len(), 0);
}
}
#[tokio::test]
async fn test_basic() {
let buf = Vec::with_capacity(5);
let mock_read_at = Arc::new(tokio::sync::Mutex::new(MockReadAt {
expectations: VecDeque::from(vec![Expectation {
offset: 0,
bytes_total: 5,
result: Ok(vec![b'a', b'b', b'c', b'd', b'e']),
}]),
}));
let (buf, res) = read_exact_at_impl(buf, 0, |buf, offset| {
let mock_read_at = Arc::clone(&mock_read_at);
async move { mock_read_at.lock().await.read_at(buf, offset).await }
})
.await;
assert!(res.is_ok());
assert_eq!(buf, vec![b'a', b'b', b'c', b'd', b'e']);
}
#[tokio::test]
async fn test_empty_buf_issues_no_syscall() {
let buf = Vec::new();
let mock_read_at = Arc::new(tokio::sync::Mutex::new(MockReadAt {
expectations: VecDeque::new(),
}));
let (_buf, res) = read_exact_at_impl(buf, 0, |buf, offset| {
let mock_read_at = Arc::clone(&mock_read_at);
async move { mock_read_at.lock().await.read_at(buf, offset).await }
})
.await;
assert!(res.is_ok());
}
#[tokio::test]
async fn test_two_read_at_calls_needed_until_buf_filled() {
let buf = Vec::with_capacity(4);
let mock_read_at = Arc::new(tokio::sync::Mutex::new(MockReadAt {
expectations: VecDeque::from(vec![
Expectation {
offset: 0,
bytes_total: 4,
result: Ok(vec![b'a', b'b']),
},
Expectation {
offset: 2,
bytes_total: 2,
result: Ok(vec![b'c', b'd']),
},
]),
}));
let (buf, res) = read_exact_at_impl(buf, 0, |buf, offset| {
let mock_read_at = Arc::clone(&mock_read_at);
async move { mock_read_at.lock().await.read_at(buf, offset).await }
})
.await;
assert!(res.is_ok());
assert_eq!(buf, vec![b'a', b'b', b'c', b'd']);
}
#[tokio::test]
async fn test_eof_before_buffer_full() {
let buf = Vec::with_capacity(3);
let mock_read_at = Arc::new(tokio::sync::Mutex::new(MockReadAt {
expectations: VecDeque::from(vec![
Expectation {
offset: 0,
bytes_total: 3,
result: Ok(vec![b'a']),
},
Expectation {
offset: 1,
bytes_total: 2,
result: Ok(vec![b'b']),
},
Expectation {
offset: 2,
bytes_total: 1,
result: Ok(vec![]),
},
]),
}));
let (_buf, res) = read_exact_at_impl(buf, 0, |buf, offset| {
let mock_read_at = Arc::clone(&mock_read_at);
async move { mock_read_at.lock().await.read_at(buf, offset).await }
})
.await;
let Err(err) = res else {
panic!("should return an error");
};
assert_eq!(err.kind(), std::io::ErrorKind::UnexpectedEof);
assert_eq!(format!("{err}"), "failed to fill whole buffer");
// buffer contents on error are unspecified
}
}
struct FileGuard {
slot_guard: RwLockReadGuard<'static, SlotInner>,
}
impl AsRef<OwnedFd> for FileGuard {
fn as_ref(&self) -> &OwnedFd {
// This unwrap is safe because we only create `FileGuard`s // This unwrap is safe because we only create `FileGuard`s
// if we know that the file is Some. // if we know that the file is Some.
self.slot_guard.file.as_ref().unwrap() self.slot_guard.file.as_ref().unwrap()
} }
} }
impl FileGuard {
/// Soft deprecation: we'll move VirtualFile to async APIs and remove this function eventually.
fn with_std_file<F, R>(&self, with: F) -> R
where
F: FnOnce(&File) -> R,
{
// SAFETY:
// - lifetime of the fd: `file` doesn't outlive the OwnedFd stored in `self`.
// - `&` usage below: `self` is `&`, hence Rust typesystem guarantees there are is no `&mut`
let file = unsafe { File::from_raw_fd(self.as_ref().as_raw_fd()) };
let res = with(&file);
let _ = file.into_raw_fd();
res
}
/// Soft deprecation: we'll move VirtualFile to async APIs and remove this function eventually.
fn with_std_file_mut<F, R>(&mut self, with: F) -> R
where
F: FnOnce(&mut File) -> R,
{
// SAFETY:
// - lifetime of the fd: `file` doesn't outlive the OwnedFd stored in `self`.
// - &mut usage below: `self` is `&mut`, hence this call is the only task/thread that has control over the underlying fd
let mut file = unsafe { File::from_raw_fd(self.as_ref().as_raw_fd()) };
let res = with(&mut file);
let _ = file.into_raw_fd();
res
}
}
impl tokio_epoll_uring::IoFd for FileGuard {
unsafe fn as_fd(&self) -> RawFd {
let owned_fd: &OwnedFd = self.as_ref();
owned_fd.as_raw_fd()
}
}
#[cfg(test)] #[cfg(test)]
impl VirtualFile { impl VirtualFile {
pub(crate) async fn read_blk( pub(crate) async fn read_blk(
@@ -624,16 +906,19 @@ impl VirtualFile {
blknum: u32, blknum: u32,
) -> Result<crate::tenant::block_io::BlockLease<'_>, std::io::Error> { ) -> Result<crate::tenant::block_io::BlockLease<'_>, std::io::Error> {
use crate::page_cache::PAGE_SZ; use crate::page_cache::PAGE_SZ;
let mut buf = [0; PAGE_SZ]; let buf = vec![0; PAGE_SZ];
self.read_exact_at(&mut buf, blknum as u64 * (PAGE_SZ as u64)) let buf = self
.read_exact_at(buf, blknum as u64 * (PAGE_SZ as u64))
.await?; .await?;
Ok(std::sync::Arc::new(buf).into()) Ok(crate::tenant::block_io::BlockLease::Vec(buf))
} }
async fn read_to_end(&mut self, buf: &mut Vec<u8>) -> Result<(), Error> { async fn read_to_end(&mut self, buf: &mut Vec<u8>) -> Result<(), Error> {
let mut tmp = vec![0; 128];
loop { loop {
let mut tmp = [0; 128]; let res;
match self.read_at(&mut tmp, self.pos).await { (tmp, res) = self.read_at(tmp, self.pos).await;
match res {
Ok(0) => return Ok(()), Ok(0) => return Ok(()),
Ok(n) => { Ok(n) => {
self.pos += n as u64; self.pos += n as u64;
@@ -709,10 +994,12 @@ impl OpenFiles {
/// Initialize the virtual file module. This must be called once at page /// Initialize the virtual file module. This must be called once at page
/// server startup. /// server startup.
/// ///
pub fn init(num_slots: usize) { #[cfg(not(test))]
pub fn init(num_slots: usize, engine: IoEngineKind) {
if OPEN_FILES.set(OpenFiles::new(num_slots)).is_err() { if OPEN_FILES.set(OpenFiles::new(num_slots)).is_err() {
panic!("virtual_file::init called twice"); panic!("virtual_file::init called twice");
} }
io_engine::init(engine);
crate::metrics::virtual_file_descriptor_cache::SIZE_MAX.set(num_slots as u64); crate::metrics::virtual_file_descriptor_cache::SIZE_MAX.set(num_slots as u64);
} }
@@ -757,10 +1044,10 @@ mod tests {
} }
impl MaybeVirtualFile { impl MaybeVirtualFile {
async fn read_exact_at(&self, buf: &mut [u8], offset: u64) -> Result<(), Error> { async fn read_exact_at(&self, mut buf: Vec<u8>, offset: u64) -> Result<Vec<u8>, Error> {
match self { match self {
MaybeVirtualFile::VirtualFile(file) => file.read_exact_at(buf, offset).await, MaybeVirtualFile::VirtualFile(file) => file.read_exact_at(buf, offset).await,
MaybeVirtualFile::File(file) => file.read_exact_at(buf, offset), MaybeVirtualFile::File(file) => file.read_exact_at(&mut buf, offset).map(|()| buf),
} }
} }
async fn write_all_at(&self, buf: &[u8], offset: u64) -> Result<(), Error> { async fn write_all_at(&self, buf: &[u8], offset: u64) -> Result<(), Error> {
@@ -802,14 +1089,14 @@ mod tests {
// Helper function to slurp a portion of a file into a string // Helper function to slurp a portion of a file into a string
async fn read_string_at(&mut self, pos: u64, len: usize) -> Result<String, Error> { async fn read_string_at(&mut self, pos: u64, len: usize) -> Result<String, Error> {
let mut buf = vec![0; len]; let buf = vec![0; len];
self.read_exact_at(&mut buf, pos).await?; let buf = self.read_exact_at(buf, pos).await?;
Ok(String::from_utf8(buf).unwrap()) Ok(String::from_utf8(buf).unwrap())
} }
} }
#[tokio::test] #[tokio::test]
async fn test_virtual_files() -> Result<(), Error> { async fn test_virtual_files() -> anyhow::Result<()> {
// The real work is done in the test_files() helper function. This // The real work is done in the test_files() helper function. This
// allows us to run the same set of tests against a native File, and // allows us to run the same set of tests against a native File, and
// VirtualFile. We trust the native Files and wouldn't need to test them, // VirtualFile. We trust the native Files and wouldn't need to test them,
@@ -825,14 +1112,17 @@ mod tests {
} }
#[tokio::test] #[tokio::test]
async fn test_physical_files() -> Result<(), Error> { async fn test_physical_files() -> anyhow::Result<()> {
test_files("physical_files", |path, open_options| async move { test_files("physical_files", |path, open_options| async move {
Ok(MaybeVirtualFile::File(open_options.open(path)?)) Ok(MaybeVirtualFile::File({
let owned_fd = open_options.open(path.as_std_path()).await?;
File::from(owned_fd)
}))
}) })
.await .await
} }
async fn test_files<OF, FT>(testname: &str, openfunc: OF) -> Result<(), Error> async fn test_files<OF, FT>(testname: &str, openfunc: OF) -> anyhow::Result<()>
where where
OF: Fn(Utf8PathBuf, OpenOptions) -> FT, OF: Fn(Utf8PathBuf, OpenOptions) -> FT,
FT: Future<Output = Result<MaybeVirtualFile, std::io::Error>>, FT: Future<Output = Result<MaybeVirtualFile, std::io::Error>>,
@@ -976,11 +1266,11 @@ mod tests {
for _threadno in 0..THREADS { for _threadno in 0..THREADS {
let files = files.clone(); let files = files.clone();
let hdl = rt.spawn(async move { let hdl = rt.spawn(async move {
let mut buf = [0u8; SIZE]; let mut buf = vec![0u8; SIZE];
let mut rng = rand::rngs::OsRng; let mut rng = rand::rngs::OsRng;
for _ in 1..1000 { for _ in 1..1000 {
let f = &files[rng.gen_range(0..files.len())]; let f = &files[rng.gen_range(0..files.len())];
f.read_exact_at(&mut buf, 0).await.unwrap(); buf = f.read_exact_at(buf, 0).await.unwrap();
assert!(buf == SAMPLE); assert!(buf == SAMPLE);
} }
}); });

View File

@@ -0,0 +1,114 @@
//! [`super::VirtualFile`] supports different IO engines.
//!
//! The [`IoEngineKind`] enum identifies them.
//!
//! The choice of IO engine is global.
//! Initialize using [`init`].
//!
//! Then use [`get`] and [`super::OpenOptions`].
#[derive(
Copy,
Clone,
PartialEq,
Eq,
Hash,
strum_macros::EnumString,
strum_macros::Display,
serde_with::DeserializeFromStr,
serde_with::SerializeDisplay,
Debug,
)]
#[strum(serialize_all = "kebab-case")]
pub enum IoEngineKind {
StdFs,
#[cfg(target_os = "linux")]
TokioEpollUring,
}
static IO_ENGINE: once_cell::sync::OnceCell<IoEngineKind> = once_cell::sync::OnceCell::new();
#[cfg(not(test))]
pub(super) fn init(engine: IoEngineKind) {
if IO_ENGINE.set(engine).is_err() {
panic!("called twice");
}
crate::metrics::virtual_file_io_engine::KIND
.with_label_values(&[&format!("{engine}")])
.set(1);
}
pub(super) fn get() -> &'static IoEngineKind {
#[cfg(test)]
{
let env_var_name = "NEON_PAGESERVER_UNIT_TEST_VIRTUAL_FILE_IOENGINE";
IO_ENGINE.get_or_init(|| match std::env::var(env_var_name) {
Ok(v) => match v.parse::<IoEngineKind>() {
Ok(engine_kind) => engine_kind,
Err(e) => {
panic!("invalid VirtualFile io engine for env var {env_var_name}: {e:#}: {v:?}")
}
},
Err(std::env::VarError::NotPresent) => {
crate::config::defaults::DEFAULT_VIRTUAL_FILE_IO_ENGINE
.parse()
.unwrap()
}
Err(std::env::VarError::NotUnicode(_)) => {
panic!("env var {env_var_name} is not unicode");
}
})
}
#[cfg(not(test))]
IO_ENGINE.get().unwrap()
}
use std::os::unix::prelude::FileExt;
use super::FileGuard;
impl IoEngineKind {
pub(super) async fn read_at<B>(
&self,
file_guard: FileGuard,
offset: u64,
mut buf: B,
) -> ((FileGuard, B), std::io::Result<usize>)
where
B: tokio_epoll_uring::BoundedBufMut + Send,
{
match self {
IoEngineKind::StdFs => {
// SAFETY: `dst` only lives at most as long as this match arm, during which buf remains valid memory.
let dst = unsafe {
std::slice::from_raw_parts_mut(buf.stable_mut_ptr(), buf.bytes_total())
};
let res = file_guard.with_std_file(|std_file| std_file.read_at(dst, offset));
if let Ok(nbytes) = &res {
assert!(*nbytes <= buf.bytes_total());
// SAFETY: see above assertion
unsafe {
buf.set_init(*nbytes);
}
}
#[allow(dropping_references)]
drop(dst);
((file_guard, buf), res)
}
#[cfg(target_os = "linux")]
IoEngineKind::TokioEpollUring => {
let system = tokio_epoll_uring::thread_local_system().await;
let (resources, res) = system.read(file_guard, offset, buf).await;
(
resources,
res.map_err(|e| match e {
tokio_epoll_uring::Error::Op(e) => e,
tokio_epoll_uring::Error::System(system) => {
std::io::Error::new(std::io::ErrorKind::Other, system)
}
}),
)
}
}
}
}

View File

@@ -0,0 +1,138 @@
//! Enum-dispatch to the `OpenOptions` type of the respective [`super::IoEngineKind`];
use super::IoEngineKind;
use std::{os::fd::OwnedFd, path::Path};
#[derive(Debug, Clone)]
pub enum OpenOptions {
StdFs(std::fs::OpenOptions),
#[cfg(target_os = "linux")]
TokioEpollUring(tokio_epoll_uring::ops::open_at::OpenOptions),
}
impl Default for OpenOptions {
fn default() -> Self {
match super::io_engine::get() {
IoEngineKind::StdFs => Self::StdFs(std::fs::OpenOptions::new()),
#[cfg(target_os = "linux")]
IoEngineKind::TokioEpollUring => {
Self::TokioEpollUring(tokio_epoll_uring::ops::open_at::OpenOptions::new())
}
}
}
}
impl OpenOptions {
pub fn new() -> OpenOptions {
Self::default()
}
pub fn read(&mut self, read: bool) -> &mut OpenOptions {
match self {
OpenOptions::StdFs(x) => {
let _ = x.read(read);
}
#[cfg(target_os = "linux")]
OpenOptions::TokioEpollUring(x) => {
let _ = x.read(read);
}
}
self
}
pub fn write(&mut self, write: bool) -> &mut OpenOptions {
match self {
OpenOptions::StdFs(x) => {
let _ = x.write(write);
}
#[cfg(target_os = "linux")]
OpenOptions::TokioEpollUring(x) => {
let _ = x.write(write);
}
}
self
}
pub fn create(&mut self, create: bool) -> &mut OpenOptions {
match self {
OpenOptions::StdFs(x) => {
let _ = x.create(create);
}
#[cfg(target_os = "linux")]
OpenOptions::TokioEpollUring(x) => {
let _ = x.create(create);
}
}
self
}
pub fn create_new(&mut self, create_new: bool) -> &mut OpenOptions {
match self {
OpenOptions::StdFs(x) => {
let _ = x.create_new(create_new);
}
#[cfg(target_os = "linux")]
OpenOptions::TokioEpollUring(x) => {
let _ = x.create_new(create_new);
}
}
self
}
pub fn truncate(&mut self, truncate: bool) -> &mut OpenOptions {
match self {
OpenOptions::StdFs(x) => {
let _ = x.truncate(truncate);
}
#[cfg(target_os = "linux")]
OpenOptions::TokioEpollUring(x) => {
let _ = x.truncate(truncate);
}
}
self
}
pub(in crate::virtual_file) async fn open(&self, path: &Path) -> std::io::Result<OwnedFd> {
match self {
OpenOptions::StdFs(x) => x.open(path).map(|file| file.into()),
#[cfg(target_os = "linux")]
OpenOptions::TokioEpollUring(x) => {
let system = tokio_epoll_uring::thread_local_system().await;
system.open(path, x).await.map_err(|e| match e {
tokio_epoll_uring::Error::Op(e) => e,
tokio_epoll_uring::Error::System(system) => {
std::io::Error::new(std::io::ErrorKind::Other, system)
}
})
}
}
}
}
impl std::os::unix::prelude::OpenOptionsExt for OpenOptions {
fn mode(&mut self, mode: u32) -> &mut OpenOptions {
match self {
OpenOptions::StdFs(x) => {
let _ = x.mode(mode);
}
#[cfg(target_os = "linux")]
OpenOptions::TokioEpollUring(x) => {
let _ = x.mode(mode);
}
}
self
}
fn custom_flags(&mut self, flags: i32) -> &mut OpenOptions {
match self {
OpenOptions::StdFs(x) => {
let _ = x.custom_flags(flags);
}
#[cfg(target_os = "linux")]
OpenOptions::TokioEpollUring(x) => {
let _ = x.custom_flags(flags);
}
}
self
}
}

View File

@@ -26,30 +26,27 @@ use postgres_ffi::v14::nonrelfile_utils::clogpage_precedes;
use postgres_ffi::v14::nonrelfile_utils::slru_may_delete_clogsegment; use postgres_ffi::v14::nonrelfile_utils::slru_may_delete_clogsegment;
use postgres_ffi::{fsm_logical_to_physical, page_is_new, page_set_lsn}; use postgres_ffi::{fsm_logical_to_physical, page_is_new, page_set_lsn};
use std::str::FromStr;
use anyhow::{bail, Context, Result}; use anyhow::{bail, Context, Result};
use bytes::{Buf, Bytes, BytesMut}; use bytes::{Buf, Bytes, BytesMut};
use hex::FromHex;
use tracing::*; use tracing::*;
use utils::failpoint_support; use utils::failpoint_support;
use crate::context::RequestContext; use crate::context::RequestContext;
use crate::metrics::WAL_INGEST; use crate::metrics::WAL_INGEST;
use crate::pgdatadir_mapping::*; use crate::pgdatadir_mapping::{DatadirModification, Version};
use crate::tenant::PageReconstructError; use crate::tenant::PageReconstructError;
use crate::tenant::Timeline; use crate::tenant::Timeline;
use crate::walrecord::*; use crate::walrecord::*;
use crate::ZERO_PAGE; use crate::ZERO_PAGE;
use pageserver_api::key::rel_block_to_key;
use pageserver_api::reltag::{BlockNumber, RelTag, SlruKind}; use pageserver_api::reltag::{BlockNumber, RelTag, SlruKind};
use postgres_ffi::pg_constants; use postgres_ffi::pg_constants;
use postgres_ffi::relfile_utils::{FSM_FORKNUM, INIT_FORKNUM, MAIN_FORKNUM, VISIBILITYMAP_FORKNUM}; use postgres_ffi::relfile_utils::{FSM_FORKNUM, INIT_FORKNUM, MAIN_FORKNUM, VISIBILITYMAP_FORKNUM};
use postgres_ffi::v14::nonrelfile_utils::mx_offset_to_member_segment; use postgres_ffi::v14::nonrelfile_utils::mx_offset_to_member_segment;
use postgres_ffi::v14::xlog_utils::*; use postgres_ffi::v14::xlog_utils::*;
use postgres_ffi::v14::{bindings::FullTransactionId, CheckPoint}; use postgres_ffi::v14::CheckPoint;
use postgres_ffi::TransactionId; use postgres_ffi::TransactionId;
use postgres_ffi::BLCKSZ; use postgres_ffi::BLCKSZ;
use utils::id::TenantId;
use utils::lsn::Lsn; use utils::lsn::Lsn;
pub struct WalIngest { pub struct WalIngest {
@@ -112,55 +109,6 @@ impl WalIngest {
self.checkpoint_modified = true; self.checkpoint_modified = true;
} }
// BEGIN ONE-OFF HACK
//
// We had a bug where we incorrectly passed 0 to update_next_xid(). That was
// harmless as long as nextXid was < 2^31, because 0 looked like a very old
// XID. But once nextXid reaches 2^31, 0 starts to look like a very new XID, and
// we incorrectly bumped up nextXid to the next epoch, to value '1:1024'
//
// We have one known timeline in production where that happened. This is a one-off
// fix to fix that damage. The last WAL record on that timeline as of this writing
// is this:
//
// rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 35A/E32D86D8, prev 35A/E32D86B0, desc: RUNNING_XACTS nextXid 2325447052 latestCompletedXid 2325447051 oldestRunningXid 2325447052
//
// So on that particular timeline, before that LSN, fix the incorrectly set
// nextXid to the nextXid value from that record, plus 1000 to give some safety
// margin.
// For testing this hack, this failpoint temporarily re-introduces the bug that
// was fixed
fn reintroduce_bug_failpoint_activated() -> bool {
fail::fail_point!("reintroduce-nextxid-update-bug", |_| { true });
false
}
if decoded.xl_xid == pg_constants::INVALID_TRANSACTION_ID
&& reintroduce_bug_failpoint_activated()
&& self.checkpoint.update_next_xid(decoded.xl_xid)
{
info!(
"failpoint: Incorrectly updated nextXid at LSN {} to {}",
lsn, self.checkpoint.nextXid.value
);
self.checkpoint_modified = true;
}
if self.checkpoint.nextXid.value == 4294968320 && // 1::1024, the incorrect value
modification.tline.tenant_shard_id.tenant_id == TenantId::from_hex("df254570a4f603805528b46b0d45a76c").unwrap() &&
lsn < Lsn::from_str("35A/E32D9000").unwrap() &&
!reintroduce_bug_failpoint_activated()
{
// This is the last nextXid value from the last RUNNING_XACTS record, at the
// end of the WAL as of this writing.
self.checkpoint.nextXid = FullTransactionId {
value: 2325447052 + 1000,
};
self.checkpoint_modified = true;
warn!("nextXid fixed by one-off hack at LSN {}", lsn);
}
// END ONE-OFF HACK
match decoded.xl_rmid { match decoded.xl_rmid {
pg_constants::RM_HEAP_ID | pg_constants::RM_HEAP2_ID => { pg_constants::RM_HEAP_ID | pg_constants::RM_HEAP2_ID => {
// Heap AM records need some special handling, because they modify VM pages // Heap AM records need some special handling, because they modify VM pages
@@ -1085,7 +1033,23 @@ impl WalIngest {
// Copy content // Copy content
debug!("copying rel {} to {}, {} blocks", src_rel, dst_rel, nblocks); debug!("copying rel {} to {}, {} blocks", src_rel, dst_rel, nblocks);
for blknum in 0..nblocks { for blknum in 0..nblocks {
debug!("copying block {} from {} to {}", blknum, src_rel, dst_rel); // Sharding:
// - src and dst are always on the same shard, because they differ only by dbNode, and
// dbNode is not included in the hash inputs for sharding.
// - This WAL command is replayed on all shards, but each shard only copies the blocks
// that belong to it.
let src_key = rel_block_to_key(src_rel, blknum);
if !self.shard.is_key_local(&src_key) {
debug!(
"Skipping non-local key {} during XLOG_DBASE_CREATE",
src_key
);
continue;
}
debug!(
"copying block {} from {} ({}) to {}",
blknum, src_rel, src_key, dst_rel
);
let content = modification let content = modification
.tline .tline

View File

@@ -47,11 +47,10 @@ use crate::metrics::{
WAL_REDO_PROCESS_LAUNCH_DURATION_HISTOGRAM, WAL_REDO_RECORDS_HISTOGRAM, WAL_REDO_PROCESS_LAUNCH_DURATION_HISTOGRAM, WAL_REDO_RECORDS_HISTOGRAM,
WAL_REDO_RECORD_COUNTER, WAL_REDO_TIME, WAL_REDO_RECORD_COUNTER, WAL_REDO_TIME,
}; };
use crate::pgdatadir_mapping::key_to_slru_block;
use crate::repository::Key; use crate::repository::Key;
use crate::walrecord::NeonWalRecord; use crate::walrecord::NeonWalRecord;
use pageserver_api::key::key_to_rel_block; use pageserver_api::key::{key_to_rel_block, key_to_slru_block};
use pageserver_api::reltag::{RelTag, SlruKind}; use pageserver_api::reltag::{RelTag, SlruKind};
use postgres_ffi::pg_constants; use postgres_ffi::pg_constants;
use postgres_ffi::relfile_utils::VISIBILITYMAP_FORKNUM; use postgres_ffi::relfile_utils::VISIBILITYMAP_FORKNUM;
@@ -837,9 +836,8 @@ impl WalRedoProcess {
let mut proc = { input }; // TODO: remove this legacy rename, but this keep the patch small. let mut proc = { input }; // TODO: remove this legacy rename, but this keep the patch small.
let mut nwrite = 0usize; let mut nwrite = 0usize;
let mut stdin_pollfds = [PollFd::new(proc.stdin.as_raw_fd(), PollFlags::POLLOUT)];
while nwrite < writebuf.len() { while nwrite < writebuf.len() {
let mut stdin_pollfds = [PollFd::new(&proc.stdin, PollFlags::POLLOUT)];
let n = loop { let n = loop {
match nix::poll::poll(&mut stdin_pollfds[..], wal_redo_timeout.as_millis() as i32) { match nix::poll::poll(&mut stdin_pollfds[..], wal_redo_timeout.as_millis() as i32) {
Err(nix::errno::Errno::EINTR) => continue, Err(nix::errno::Errno::EINTR) => continue,
@@ -878,7 +876,6 @@ impl WalRedoProcess {
// advancing processed responses number. // advancing processed responses number.
let mut output = self.stdout.lock().unwrap(); let mut output = self.stdout.lock().unwrap();
let mut stdout_pollfds = [PollFd::new(output.stdout.as_raw_fd(), PollFlags::POLLIN)];
let n_processed_responses = output.n_processed_responses; let n_processed_responses = output.n_processed_responses;
while n_processed_responses + output.pending_responses.len() <= request_no { while n_processed_responses + output.pending_responses.len() <= request_no {
// We expect the WAL redo process to respond with an 8k page image. We read it // We expect the WAL redo process to respond with an 8k page image. We read it
@@ -886,6 +883,7 @@ impl WalRedoProcess {
let mut resultbuf = vec![0; BLCKSZ.into()]; let mut resultbuf = vec![0; BLCKSZ.into()];
let mut nresult: usize = 0; // # of bytes read into 'resultbuf' so far let mut nresult: usize = 0; // # of bytes read into 'resultbuf' so far
while nresult < BLCKSZ.into() { while nresult < BLCKSZ.into() {
let mut stdout_pollfds = [PollFd::new(&output.stdout, PollFlags::POLLIN)];
// We do two things simultaneously: reading response from stdout // We do two things simultaneously: reading response from stdout
// and forward any logging information that the child writes to its stderr to the page server's log. // and forward any logging information that the child writes to its stderr to the page server's log.
let n = loop { let n = loop {

View File

@@ -637,7 +637,7 @@ HandleAlterRole(AlterRoleStmt *stmt)
ListCell *option; ListCell *option;
const char *role_name = stmt->role->rolename; const char *role_name = stmt->role->rolename;
if (RoleIsNeonSuperuser(role_name)) if (RoleIsNeonSuperuser(role_name) && !superuser())
elog(ERROR, "can't ALTER neon_superuser"); elog(ERROR, "can't ALTER neon_superuser");
foreach(option, stmt->options) foreach(option, stmt->options)

View File

@@ -15,6 +15,7 @@
#include "postgres.h" #include "postgres.h"
#include "access/xlog.h" #include "access/xlog.h"
#include "common/hashfn.h"
#include "fmgr.h" #include "fmgr.h"
#include "libpq-fe.h" #include "libpq-fe.h"
#include "libpq/libpq.h" #include "libpq/libpq.h"
@@ -38,17 +39,6 @@
#define MIN_RECONNECT_INTERVAL_USEC 1000 #define MIN_RECONNECT_INTERVAL_USEC 1000
#define MAX_RECONNECT_INTERVAL_USEC 1000000 #define MAX_RECONNECT_INTERVAL_USEC 1000000
bool connected = false;
PGconn *pageserver_conn = NULL;
/*
* WaitEventSet containing:
* - WL_SOCKET_READABLE on pageserver_conn,
* - WL_LATCH_SET on MyLatch, and
* - WL_EXIT_ON_PM_DEATH.
*/
WaitEventSet *pageserver_conn_wes = NULL;
/* GUCs */ /* GUCs */
char *neon_timeline; char *neon_timeline;
char *neon_tenant; char *neon_tenant;
@@ -59,16 +49,40 @@ char *neon_auth_token;
int readahead_buffer_size = 128; int readahead_buffer_size = 128;
int flush_every_n_requests = 8; int flush_every_n_requests = 8;
static int n_reconnect_attempts = 0; static int n_reconnect_attempts = 0;
static int max_reconnect_attempts = 60; static int max_reconnect_attempts = 60;
static int stripe_size;
#define MAX_PAGESERVER_CONNSTRING_SIZE 256
typedef struct typedef struct
{ {
LWLockId lock; char connstring[MAX_SHARDS][MAX_PAGESERVER_CONNSTRING_SIZE];
pg_atomic_uint64 update_counter; size_t num_shards;
char pageserver_connstring[MAX_PAGESERVER_CONNSTRING_SIZE]; } ShardMap;
/*
* PagestoreShmemState is kept in shared memory. It contains the connection
* strings for each shard.
*
* The "neon.pageserver_connstring" GUC is marked with the PGC_SIGHUP option,
* allowing it to be changed using pg_reload_conf(). The control plane can
* update the connection string if the pageserver crashes, is relocated, or
* new shards are added. A parsed copy of the current value of the GUC is kept
* in shared memory, updated by the postmaster, because regular backends don't
* reload the config during query execution, but we might need to re-establish
* the pageserver connection with the new connection string even in the middle
* of a query.
*
* The shared memory copy is protected by a lockless algorithm using two
* atomic counters. The counters allow a backend to quickly check if the value
* has changed since last access, and to detect and retry copying the value if
* the postmaster changes the value concurrently. (Postmaster doesn't have a
* PGPROC entry and therefore cannot use LWLocks.)
*/
typedef struct
{
pg_atomic_uint64 begin_update_counter;
pg_atomic_uint64 end_update_counter;
ShardMap shard_map;
} PagestoreShmemState; } PagestoreShmemState;
#if PG_VERSION_NUM >= 150000 #if PG_VERSION_NUM >= 150000
@@ -78,76 +92,242 @@ static void walproposer_shmem_request(void);
static shmem_startup_hook_type prev_shmem_startup_hook; static shmem_startup_hook_type prev_shmem_startup_hook;
static PagestoreShmemState *pagestore_shared; static PagestoreShmemState *pagestore_shared;
static uint64 pagestore_local_counter = 0; static uint64 pagestore_local_counter = 0;
static char local_pageserver_connstring[MAX_PAGESERVER_CONNSTRING_SIZE];
static bool pageserver_flush(void); /* This backend's per-shard connections */
static void pageserver_disconnect(void); typedef struct
{
PGconn *conn;
/*---
* WaitEventSet containing:
* - WL_SOCKET_READABLE on 'conn'
* - WL_LATCH_SET on MyLatch, and
* - WL_EXIT_ON_PM_DEATH.
*/
WaitEventSet *wes;
} PageServer;
static PageServer page_servers[MAX_SHARDS];
static bool pageserver_flush(shardno_t shard_no);
static void pageserver_disconnect(shardno_t shard_no);
static bool static bool
PagestoreShmemIsValid() PagestoreShmemIsValid(void)
{ {
return pagestore_shared && UsedShmemSegAddr; return pagestore_shared && UsedShmemSegAddr;
} }
/*
* Parse a comma-separated list of connection strings into a ShardMap.
*
* If 'result' is NULL, just checks that the input is valid. If the input is
* not valid, returns false. The contents of *result are undefined in
* that case, and must not be relied on.
*/
static bool
ParseShardMap(const char *connstr, ShardMap *result)
{
const char *p;
int nshards = 0;
if (result)
memset(result, 0, sizeof(ShardMap));
p = connstr;
nshards = 0;
for (;;)
{
const char *sep;
size_t connstr_len;
sep = strchr(p, ',');
connstr_len = sep != NULL ? sep - p : strlen(p);
if (connstr_len == 0 && sep == NULL)
break; /* ignore trailing comma */
if (nshards >= MAX_SHARDS)
{
neon_log(LOG, "Too many shards");
return false;
}
if (connstr_len >= MAX_PAGESERVER_CONNSTRING_SIZE)
{
neon_log(LOG, "Connection string too long");
return false;
}
if (result)
{
memcpy(result->connstring[nshards], p, connstr_len);
result->connstring[nshards][connstr_len] = '\0';
}
nshards++;
if (sep == NULL)
break;
p = sep + 1;
}
if (result)
result->num_shards = nshards;
return true;
}
static bool static bool
CheckPageserverConnstring(char **newval, void **extra, GucSource source) CheckPageserverConnstring(char **newval, void **extra, GucSource source)
{ {
return strlen(*newval) < MAX_PAGESERVER_CONNSTRING_SIZE; char *p = *newval;
return ParseShardMap(p, NULL);
} }
static void static void
AssignPageserverConnstring(const char *newval, void *extra) AssignPageserverConnstring(const char *newval, void *extra)
{ {
if (!PagestoreShmemIsValid()) ShardMap shard_map;
/*
* Only postmaster updates the copy in shared memory.
*/
if (!PagestoreShmemIsValid() || IsUnderPostmaster)
return; return;
LWLockAcquire(pagestore_shared->lock, LW_EXCLUSIVE);
strlcpy(pagestore_shared->pageserver_connstring, newval, MAX_PAGESERVER_CONNSTRING_SIZE); if (!ParseShardMap(newval, &shard_map))
pg_atomic_fetch_add_u64(&pagestore_shared->update_counter, 1); {
LWLockRelease(pagestore_shared->lock); /*
} * shouldn't happen, because we already checked the value in
* CheckPageserverConnstring
static bool */
CheckConnstringUpdated() elog(ERROR, "could not parse shard map");
{ }
if (!PagestoreShmemIsValid())
return false; if (memcmp(&pagestore_shared->shard_map, &shard_map, sizeof(ShardMap)) != 0)
return pagestore_local_counter < pg_atomic_read_u64(&pagestore_shared->update_counter); {
pg_atomic_add_fetch_u64(&pagestore_shared->begin_update_counter, 1);
pg_write_barrier();
memcpy(&pagestore_shared->shard_map, &shard_map, sizeof(ShardMap));
pg_write_barrier();
pg_atomic_add_fetch_u64(&pagestore_shared->end_update_counter, 1);
}
else
{
/* no change */
}
} }
/*
* Get the current number of shards, and/or the connection string for a
* particular shard from the shard map in shared memory.
*
* If num_shards_p is not NULL, it is set to the current number of shards.
*
* If connstr_p is not NULL, the connection string for 'shard_no' is copied to
* it. It must point to a buffer at least MAX_PAGESERVER_CONNSTRING_SIZE bytes
* long.
*
* As a side-effect, if the shard map in shared memory had changed since the
* last call, terminates all existing connections to all pageservers.
*/
static void static void
ReloadConnstring() load_shard_map(shardno_t shard_no, char *connstr_p, shardno_t *num_shards_p)
{ {
if (!PagestoreShmemIsValid()) uint64 begin_update_counter;
return; uint64 end_update_counter;
LWLockAcquire(pagestore_shared->lock, LW_SHARED); ShardMap *shard_map = &pagestore_shared->shard_map;
strlcpy(local_pageserver_connstring, pagestore_shared->pageserver_connstring, sizeof(local_pageserver_connstring)); shardno_t num_shards;
pagestore_local_counter = pg_atomic_read_u64(&pagestore_shared->update_counter);
LWLockRelease(pagestore_shared->lock); /*
* Postmaster can update the shared memory values concurrently, in which
* case we would copy a garbled mix of the old and new values. We will
* detect it because the counter's won't match, and retry. But it's
* important that we don't do anything within the retry-loop that would
* depend on the string having valid contents.
*/
do
{
begin_update_counter = pg_atomic_read_u64(&pagestore_shared->begin_update_counter);
end_update_counter = pg_atomic_read_u64(&pagestore_shared->end_update_counter);
num_shards = shard_map->num_shards;
if (connstr_p && shard_no < MAX_SHARDS)
strlcpy(connstr_p, shard_map->connstring[shard_no], MAX_PAGESERVER_CONNSTRING_SIZE);
pg_memory_barrier();
}
while (begin_update_counter != end_update_counter
|| begin_update_counter != pg_atomic_read_u64(&pagestore_shared->begin_update_counter)
|| end_update_counter != pg_atomic_read_u64(&pagestore_shared->end_update_counter));
if (connstr_p && shard_no >= num_shards)
neon_log(ERROR, "Shard %d is greater or equal than number of shards %d",
shard_no, num_shards);
/*
* If any of the connection strings changed, reset all connections.
*/
if (pagestore_local_counter != end_update_counter)
{
for (shardno_t i = 0; i < MAX_SHARDS; i++)
{
if (page_servers[i].conn)
pageserver_disconnect(i);
}
pagestore_local_counter = end_update_counter;
}
if (num_shards_p)
*num_shards_p = num_shards;
}
#define MB (1024*1024)
shardno_t
get_shard_number(BufferTag *tag)
{
shardno_t n_shards;
uint32 hash;
load_shard_map(0, NULL, &n_shards);
#if PG_MAJORVERSION_NUM < 16
hash = murmurhash32(tag->rnode.relNode);
hash = hash_combine(hash, murmurhash32(tag->blockNum / stripe_size));
#else
hash = murmurhash32(tag->relNumber);
hash = hash_combine(hash, murmurhash32(tag->blockNum / stripe_size));
#endif
return hash % n_shards;
} }
static bool static bool
pageserver_connect(int elevel) pageserver_connect(shardno_t shard_no, int elevel)
{ {
char *query; char *query;
int ret; int ret;
const char *keywords[3]; const char *keywords[3];
const char *values[3]; const char *values[3];
int n; int n;
PGconn *conn;
WaitEventSet *wes;
char connstr[MAX_PAGESERVER_CONNSTRING_SIZE];
static TimestampTz last_connect_time = 0; static TimestampTz last_connect_time = 0;
static uint64_t delay_us = MIN_RECONNECT_INTERVAL_USEC; static uint64_t delay_us = MIN_RECONNECT_INTERVAL_USEC;
TimestampTz now; TimestampTz now;
uint64_t us_since_last_connect; uint64_t us_since_last_connect;
Assert(!connected); Assert(page_servers[shard_no].conn == NULL);
if (CheckConnstringUpdated()) /*
{ * Get the connection string for this shard. If the shard map has been
ReloadConnstring(); * updated since we last looked, this will also disconnect any existing
} * pageserver connections as a side effect.
*/
load_shard_map(shard_no, connstr, NULL);
now = GetCurrentTimestamp(); now = GetCurrentTimestamp();
us_since_last_connect = now - last_connect_time; us_since_last_connect = now - last_connect_time;
if (us_since_last_connect < delay_us) if (us_since_last_connect < delay_us)
{ {
pg_usleep(delay_us - us_since_last_connect); pg_usleep(delay_us - us_since_last_connect);
@@ -180,76 +360,84 @@ pageserver_connect(int elevel)
n++; n++;
} }
keywords[n] = "dbname"; keywords[n] = "dbname";
values[n] = local_pageserver_connstring; values[n] = connstr;
n++; n++;
keywords[n] = NULL; keywords[n] = NULL;
values[n] = NULL; values[n] = NULL;
n++; n++;
pageserver_conn = PQconnectdbParams(keywords, values, 1); conn = PQconnectdbParams(keywords, values, 1);
if (PQstatus(pageserver_conn) == CONNECTION_BAD) if (PQstatus(conn) == CONNECTION_BAD)
{ {
char *msg = pchomp(PQerrorMessage(pageserver_conn)); char *msg = pchomp(PQerrorMessage(conn));
PQfinish(pageserver_conn); PQfinish(conn);
pageserver_conn = NULL;
ereport(elevel, ereport(elevel,
(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION), (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
errmsg(NEON_TAG "could not establish connection to pageserver"), errmsg(NEON_TAG "[shard %d] could not establish connection to pageserver", shard_no),
errdetail_internal("%s", msg))); errdetail_internal("%s", msg)));
pfree(msg);
return false; return false;
} }
query = psprintf("pagestream %s %s", neon_tenant, neon_timeline); query = psprintf("pagestream %s %s", neon_tenant, neon_timeline);
ret = PQsendQuery(pageserver_conn, query); ret = PQsendQuery(conn, query);
pfree(query);
if (ret != 1) if (ret != 1)
{ {
PQfinish(pageserver_conn); PQfinish(conn);
pageserver_conn = NULL; neon_shard_log(shard_no, elevel, "could not send pagestream command to pageserver");
neon_log(elevel, "could not send pagestream command to pageserver");
return false; return false;
} }
pageserver_conn_wes = CreateWaitEventSet(TopMemoryContext, 3); wes = CreateWaitEventSet(TopMemoryContext, 3);
AddWaitEventToSet(pageserver_conn_wes, WL_LATCH_SET, PGINVALID_SOCKET, AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET,
MyLatch, NULL); MyLatch, NULL);
AddWaitEventToSet(pageserver_conn_wes, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET, AddWaitEventToSet(wes, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
NULL, NULL); NULL, NULL);
AddWaitEventToSet(pageserver_conn_wes, WL_SOCKET_READABLE, PQsocket(pageserver_conn), NULL, NULL); AddWaitEventToSet(wes, WL_SOCKET_READABLE, PQsocket(conn), NULL, NULL);
while (PQisBusy(pageserver_conn)) PG_TRY();
{ {
WaitEvent event; while (PQisBusy(conn))
/* Sleep until there's something to do */
(void) WaitEventSetWait(pageserver_conn_wes, -1L, &event, 1, PG_WAIT_EXTENSION);
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
/* Data available in socket? */
if (event.events & WL_SOCKET_READABLE)
{ {
if (!PQconsumeInput(pageserver_conn)) WaitEvent event;
/* Sleep until there's something to do */
(void) WaitEventSetWait(wes, -1L, &event, 1, PG_WAIT_EXTENSION);
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
/* Data available in socket? */
if (event.events & WL_SOCKET_READABLE)
{ {
char *msg = pchomp(PQerrorMessage(pageserver_conn)); if (!PQconsumeInput(conn))
{
char *msg = pchomp(PQerrorMessage(conn));
PQfinish(pageserver_conn); PQfinish(conn);
pageserver_conn = NULL; FreeWaitEventSet(wes);
FreeWaitEventSet(pageserver_conn_wes);
pageserver_conn_wes = NULL;
neon_log(elevel, "could not complete handshake with pageserver: %s", neon_shard_log(shard_no, elevel, "could not complete handshake with pageserver: %s",
msg); msg);
return false; return false;
}
} }
} }
} }
PG_CATCH();
{
PQfinish(conn);
FreeWaitEventSet(wes);
PG_RE_THROW();
}
PG_END_TRY();
neon_log(LOG, "libpagestore: connected to '%s'", page_server_connstring); neon_shard_log(shard_no, LOG, "libpagestore: connected to '%s'", connstr);
page_servers[shard_no].conn = conn;
page_servers[shard_no].wes = wes;
connected = true;
return true; return true;
} }
@@ -257,9 +445,10 @@ pageserver_connect(int elevel)
* A wrapper around PQgetCopyData that checks for interrupts while sleeping. * A wrapper around PQgetCopyData that checks for interrupts while sleeping.
*/ */
static int static int
call_PQgetCopyData(char **buffer) call_PQgetCopyData(shardno_t shard_no, char **buffer)
{ {
int ret; int ret;
PGconn *pageserver_conn = page_servers[shard_no].conn;
retry: retry:
ret = PQgetCopyData(pageserver_conn, buffer, 1 /* async */ ); ret = PQgetCopyData(pageserver_conn, buffer, 1 /* async */ );
@@ -269,7 +458,7 @@ retry:
WaitEvent event; WaitEvent event;
/* Sleep until there's something to do */ /* Sleep until there's something to do */
(void) WaitEventSetWait(pageserver_conn_wes, -1L, &event, 1, PG_WAIT_EXTENSION); (void) WaitEventSetWait(page_servers[shard_no].wes, -1L, &event, 1, PG_WAIT_EXTENSION);
ResetLatch(MyLatch); ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS(); CHECK_FOR_INTERRUPTS();
@@ -281,7 +470,7 @@ retry:
{ {
char *msg = pchomp(PQerrorMessage(pageserver_conn)); char *msg = pchomp(PQerrorMessage(pageserver_conn));
neon_log(LOG, "could not get response from pageserver: %s", msg); neon_shard_log(shard_no, LOG, "could not get response from pageserver: %s", msg);
pfree(msg); pfree(msg);
return -1; return -1;
} }
@@ -295,7 +484,7 @@ retry:
static void static void
pageserver_disconnect(void) pageserver_disconnect(shardno_t shard_no)
{ {
/* /*
* If anything goes wrong while we were sending a request, it's not clear * If anything goes wrong while we were sending a request, it's not clear
@@ -304,38 +493,38 @@ pageserver_disconnect(void)
* time later after we have already sent a new unrelated request. Close * time later after we have already sent a new unrelated request. Close
* the connection to avoid getting confused. * the connection to avoid getting confused.
*/ */
if (connected) if (page_servers[shard_no].conn)
{ {
neon_log(LOG, "dropping connection to page server due to error"); neon_shard_log(shard_no, LOG, "dropping connection to page server due to error");
PQfinish(pageserver_conn); PQfinish(page_servers[shard_no].conn);
pageserver_conn = NULL; page_servers[shard_no].conn = NULL;
connected = false;
/*
* If the connection to any pageserver is lost, we throw away the
* whole prefetch queue, even for other pageservers. It should not
* cause big problems, because connection loss is supposed to be a
* rare event.
*/
prefetch_on_ps_disconnect(); prefetch_on_ps_disconnect();
} }
if (pageserver_conn_wes != NULL) if (page_servers[shard_no].wes != NULL)
{ {
FreeWaitEventSet(pageserver_conn_wes); FreeWaitEventSet(page_servers[shard_no].wes);
pageserver_conn_wes = NULL; page_servers[shard_no].wes = NULL;
} }
} }
static bool static bool
pageserver_send(NeonRequest *request) pageserver_send(shardno_t shard_no, NeonRequest *request)
{ {
StringInfoData req_buff; StringInfoData req_buff;
PGconn *pageserver_conn = page_servers[shard_no].conn;
if (CheckConnstringUpdated())
{
pageserver_disconnect();
ReloadConnstring();
}
/* If the connection was lost for some reason, reconnect */ /* If the connection was lost for some reason, reconnect */
if (connected && PQstatus(pageserver_conn) == CONNECTION_BAD) if (pageserver_conn && PQstatus(pageserver_conn) == CONNECTION_BAD)
{ {
neon_log(LOG, "pageserver_send disconnect bad connection"); neon_shard_log(shard_no, LOG, "pageserver_send disconnect bad connection");
pageserver_disconnect(); pageserver_disconnect(shard_no);
} }
req_buff = nm_pack_request(request); req_buff = nm_pack_request(request);
@@ -349,9 +538,9 @@ pageserver_send(NeonRequest *request)
* https://github.com/neondatabase/neon/issues/1138 So try to reestablish * https://github.com/neondatabase/neon/issues/1138 So try to reestablish
* connection in case of failure. * connection in case of failure.
*/ */
if (!connected) if (!page_servers[shard_no].conn)
{ {
while (!pageserver_connect(n_reconnect_attempts < max_reconnect_attempts ? LOG : ERROR)) while (!pageserver_connect(shard_no, n_reconnect_attempts < max_reconnect_attempts ? LOG : ERROR))
{ {
HandleMainLoopInterrupts(); HandleMainLoopInterrupts();
n_reconnect_attempts += 1; n_reconnect_attempts += 1;
@@ -359,6 +548,8 @@ pageserver_send(NeonRequest *request)
n_reconnect_attempts = 0; n_reconnect_attempts = 0;
} }
pageserver_conn = page_servers[shard_no].conn;
/* /*
* Send request. * Send request.
* *
@@ -371,8 +562,8 @@ pageserver_send(NeonRequest *request)
{ {
char *msg = pchomp(PQerrorMessage(pageserver_conn)); char *msg = pchomp(PQerrorMessage(pageserver_conn));
pageserver_disconnect(); pageserver_disconnect(shard_no);
neon_log(LOG, "pageserver_send disconnect because failed to send page request (try to reconnect): %s", msg); neon_shard_log(shard_no, LOG, "pageserver_send disconnect because failed to send page request (try to reconnect): %s", msg);
pfree(msg); pfree(msg);
pfree(req_buff.data); pfree(req_buff.data);
return false; return false;
@@ -384,19 +575,20 @@ pageserver_send(NeonRequest *request)
{ {
char *msg = nm_to_string((NeonMessage *) request); char *msg = nm_to_string((NeonMessage *) request);
neon_log(PageStoreTrace, "sent request: %s", msg); neon_shard_log(shard_no, PageStoreTrace, "sent request: %s", msg);
pfree(msg); pfree(msg);
} }
return true; return true;
} }
static NeonResponse * static NeonResponse *
pageserver_receive(void) pageserver_receive(shardno_t shard_no)
{ {
StringInfoData resp_buff; StringInfoData resp_buff;
NeonResponse *resp; NeonResponse *resp;
PGconn *pageserver_conn = page_servers[shard_no].conn;
if (!connected) if (!pageserver_conn)
return NULL; return NULL;
PG_TRY(); PG_TRY();
@@ -404,7 +596,7 @@ pageserver_receive(void)
/* read response */ /* read response */
int rc; int rc;
rc = call_PQgetCopyData(&resp_buff.data); rc = call_PQgetCopyData(shard_no, &resp_buff.data);
if (rc >= 0) if (rc >= 0)
{ {
resp_buff.len = rc; resp_buff.len = rc;
@@ -416,33 +608,33 @@ pageserver_receive(void)
{ {
char *msg = nm_to_string((NeonMessage *) resp); char *msg = nm_to_string((NeonMessage *) resp);
neon_log(PageStoreTrace, "got response: %s", msg); neon_shard_log(shard_no, PageStoreTrace, "got response: %s", msg);
pfree(msg); pfree(msg);
} }
} }
else if (rc == -1) else if (rc == -1)
{ {
neon_log(LOG, "pageserver_receive disconnect because call_PQgetCopyData returns -1: %s", pchomp(PQerrorMessage(pageserver_conn))); neon_shard_log(shard_no, LOG, "pageserver_receive disconnect because call_PQgetCopyData returns -1: %s", pchomp(PQerrorMessage(pageserver_conn)));
pageserver_disconnect(); pageserver_disconnect(shard_no);
resp = NULL; resp = NULL;
} }
else if (rc == -2) else if (rc == -2)
{ {
char *msg = pchomp(PQerrorMessage(pageserver_conn)); char *msg = pchomp(PQerrorMessage(pageserver_conn));
pageserver_disconnect(); pageserver_disconnect(shard_no);
neon_log(ERROR, "pageserver_receive disconnect because could not read COPY data: %s", msg); neon_shard_log(shard_no, ERROR, "pageserver_receive disconnect because could not read COPY data: %s", msg);
} }
else else
{ {
pageserver_disconnect(); pageserver_disconnect(shard_no);
neon_log(ERROR, "pageserver_receive disconnect because unexpected PQgetCopyData return value: %d", rc); neon_shard_log(shard_no, ERROR, "pageserver_receive disconnect because unexpected PQgetCopyData return value: %d", rc);
} }
} }
PG_CATCH(); PG_CATCH();
{ {
neon_log(LOG, "pageserver_receive disconnect due to caught exception"); neon_shard_log(shard_no, LOG, "pageserver_receive disconnect due to caught exception");
pageserver_disconnect(); pageserver_disconnect(shard_no);
PG_RE_THROW(); PG_RE_THROW();
} }
PG_END_TRY(); PG_END_TRY();
@@ -452,11 +644,13 @@ pageserver_receive(void)
static bool static bool
pageserver_flush(void) pageserver_flush(shardno_t shard_no)
{ {
if (!connected) PGconn *pageserver_conn = page_servers[shard_no].conn;
if (!pageserver_conn)
{ {
neon_log(WARNING, "Tried to flush while disconnected"); neon_shard_log(shard_no, WARNING, "Tried to flush while disconnected");
} }
else else
{ {
@@ -464,8 +658,8 @@ pageserver_flush(void)
{ {
char *msg = pchomp(PQerrorMessage(pageserver_conn)); char *msg = pchomp(PQerrorMessage(pageserver_conn));
pageserver_disconnect(); pageserver_disconnect(shard_no);
neon_log(LOG, "pageserver_flush disconnect because failed to flush page requests: %s", msg); neon_shard_log(shard_no, LOG, "pageserver_flush disconnect because failed to flush page requests: %s", msg);
pfree(msg); pfree(msg);
return false; return false;
} }
@@ -505,8 +699,9 @@ PagestoreShmemInit(void)
&found); &found);
if (!found) if (!found)
{ {
pagestore_shared->lock = &(GetNamedLWLockTranche("neon_libpagestore")->lock); pg_atomic_init_u64(&pagestore_shared->begin_update_counter, 0);
pg_atomic_init_u64(&pagestore_shared->update_counter, 0); pg_atomic_init_u64(&pagestore_shared->end_update_counter, 0);
memset(&pagestore_shared->shard_map, 0, sizeof(ShardMap));
AssignPageserverConnstring(page_server_connstring, NULL); AssignPageserverConnstring(page_server_connstring, NULL);
} }
LWLockRelease(AddinShmemInitLock); LWLockRelease(AddinShmemInitLock);
@@ -531,7 +726,6 @@ pagestore_shmem_request(void)
#endif #endif
RequestAddinShmemSpace(PagestoreShmemSize()); RequestAddinShmemSpace(PagestoreShmemSize());
RequestNamedLWLockTranche("neon_libpagestore", 1);
} }
static void static void
@@ -582,6 +776,15 @@ pg_init_libpagestore(void)
0, /* no flags required */ 0, /* no flags required */
check_neon_id, NULL, NULL); check_neon_id, NULL, NULL);
DefineCustomIntVariable("neon.stripe_size",
"sharding stripe size",
NULL,
&stripe_size,
32768, 1, INT_MAX,
PGC_SIGHUP,
GUC_UNIT_BLOCKS,
NULL, NULL, NULL);
DefineCustomIntVariable("neon.max_cluster_size", DefineCustomIntVariable("neon.max_cluster_size",
"cluster size limit", "cluster size limit",
NULL, NULL,

View File

@@ -20,9 +20,13 @@
#include "lib/stringinfo.h" #include "lib/stringinfo.h"
#include "libpq/pqformat.h" #include "libpq/pqformat.h"
#include "storage/block.h" #include "storage/block.h"
#include "storage/buf_internals.h"
#include "storage/smgr.h" #include "storage/smgr.h"
#include "utils/memutils.h" #include "utils/memutils.h"
#define MAX_SHARDS 128
#define MAX_PAGESERVER_CONNSTRING_SIZE 256
typedef enum typedef enum
{ {
/* pagestore_client -> pagestore */ /* pagestore_client -> pagestore */
@@ -51,6 +55,9 @@ typedef struct
#define neon_log(tag, fmt, ...) ereport(tag, \ #define neon_log(tag, fmt, ...) ereport(tag, \
(errmsg(NEON_TAG fmt, ##__VA_ARGS__), \ (errmsg(NEON_TAG fmt, ##__VA_ARGS__), \
errhidestmt(true), errhidecontext(true), errposition(0), internalerrposition(0))) errhidestmt(true), errhidecontext(true), errposition(0), internalerrposition(0)))
#define neon_shard_log(shard_no, tag, fmt, ...) ereport(tag, \
(errmsg(NEON_TAG "[shard %d] " fmt, shard_no, ##__VA_ARGS__), \
errhidestmt(true), errhidecontext(true), errposition(0), internalerrposition(0)))
/* /*
* supertype of all the Neon*Request structs below * supertype of all the Neon*Request structs below
@@ -141,11 +148,13 @@ extern char *nm_to_string(NeonMessage *msg);
* API * API
*/ */
typedef unsigned shardno_t;
typedef struct typedef struct
{ {
bool (*send) (NeonRequest *request); bool (*send) (shardno_t shard_no, NeonRequest * request);
NeonResponse *(*receive) (void); NeonResponse *(*receive) (shardno_t shard_no);
bool (*flush) (void); bool (*flush) (shardno_t shard_no);
} page_server_api; } page_server_api;
extern void prefetch_on_ps_disconnect(void); extern void prefetch_on_ps_disconnect(void);
@@ -159,6 +168,8 @@ extern char *neon_timeline;
extern char *neon_tenant; extern char *neon_tenant;
extern int32 max_cluster_size; extern int32 max_cluster_size;
extern shardno_t get_shard_number(BufferTag* tag);
extern const f_smgr *smgr_neon(BackendId backend, NRelFileInfo rinfo); extern const f_smgr *smgr_neon(BackendId backend, NRelFileInfo rinfo);
extern void smgr_init_neon(void); extern void smgr_init_neon(void);
extern void readahead_buffer_resize(int newsize, void *extra); extern void readahead_buffer_resize(int newsize, void *extra);

View File

@@ -172,6 +172,7 @@ typedef struct PrefetchRequest
XLogRecPtr actual_request_lsn; XLogRecPtr actual_request_lsn;
NeonResponse *response; /* may be null */ NeonResponse *response; /* may be null */
PrefetchStatus status; PrefetchStatus status;
shardno_t shard_no;
uint64 my_ring_index; uint64 my_ring_index;
} PrefetchRequest; } PrefetchRequest;
@@ -239,10 +240,17 @@ typedef struct PrefetchState
* also unused */ * also unused */
/* the buffers */ /* the buffers */
prfh_hash *prf_hash; prfh_hash *prf_hash;
int max_shard_no;
/* Mark shards involved in prefetch */
uint8 shard_bitmap[(MAX_SHARDS + 7)/8];
PrefetchRequest prf_buffer[]; /* prefetch buffers */ PrefetchRequest prf_buffer[]; /* prefetch buffers */
} PrefetchState; } PrefetchState;
#define BITMAP_ISSET(bm, bit) ((bm)[(bit) >> 3] & (1 << ((bit) & 7)))
#define BITMAP_SET(bm, bit) (bm)[(bit) >> 3] |= (1 << ((bit) & 7))
#define BITMAP_CLR(bm, bit) (bm)[(bit) >> 3] &= ~(1 << ((bit) & 7))
static PrefetchState *MyPState; static PrefetchState *MyPState;
#define GetPrfSlot(ring_index) ( \ #define GetPrfSlot(ring_index) ( \
@@ -327,6 +335,7 @@ compact_prefetch_buffers(void)
Assert(target_slot->status == PRFS_UNUSED); Assert(target_slot->status == PRFS_UNUSED);
target_slot->buftag = source_slot->buftag; target_slot->buftag = source_slot->buftag;
target_slot->shard_no = source_slot->shard_no;
target_slot->status = source_slot->status; target_slot->status = source_slot->status;
target_slot->response = source_slot->response; target_slot->response = source_slot->response;
target_slot->effective_request_lsn = source_slot->effective_request_lsn; target_slot->effective_request_lsn = source_slot->effective_request_lsn;
@@ -494,6 +503,23 @@ prefetch_cleanup_trailing_unused(void)
} }
} }
static bool
prefetch_flush_requests(void)
{
for (shardno_t shard_no = 0; shard_no < MyPState->max_shard_no; shard_no++)
{
if (BITMAP_ISSET(MyPState->shard_bitmap, shard_no))
{
if (!page_server->flush(shard_no))
return false;
BITMAP_CLR(MyPState->shard_bitmap, shard_no);
}
}
MyPState->max_shard_no = 0;
return true;
}
/* /*
* Wait for slot of ring_index to have received its response. * Wait for slot of ring_index to have received its response.
* The caller is responsible for making sure the request buffer is flushed. * The caller is responsible for making sure the request buffer is flushed.
@@ -509,7 +535,7 @@ prefetch_wait_for(uint64 ring_index)
if (MyPState->ring_flush <= ring_index && if (MyPState->ring_flush <= ring_index &&
MyPState->ring_unused > MyPState->ring_flush) MyPState->ring_unused > MyPState->ring_flush)
{ {
if (!page_server->flush()) if (!prefetch_flush_requests())
return false; return false;
MyPState->ring_flush = MyPState->ring_unused; MyPState->ring_flush = MyPState->ring_unused;
} }
@@ -547,7 +573,7 @@ prefetch_read(PrefetchRequest *slot)
Assert(slot->my_ring_index == MyPState->ring_receive); Assert(slot->my_ring_index == MyPState->ring_receive);
old = MemoryContextSwitchTo(MyPState->errctx); old = MemoryContextSwitchTo(MyPState->errctx);
response = (NeonResponse *) page_server->receive(); response = (NeonResponse *) page_server->receive(slot->shard_no);
MemoryContextSwitchTo(old); MemoryContextSwitchTo(old);
if (response) if (response)
{ {
@@ -704,12 +730,14 @@ prefetch_do_request(PrefetchRequest *slot, bool *force_latest, XLogRecPtr *force
Assert(slot->response == NULL); Assert(slot->response == NULL);
Assert(slot->my_ring_index == MyPState->ring_unused); Assert(slot->my_ring_index == MyPState->ring_unused);
while (!page_server->send((NeonRequest *) &request)); while (!page_server->send(slot->shard_no, (NeonRequest *) &request));
/* update prefetch state */ /* update prefetch state */
MyPState->n_requests_inflight += 1; MyPState->n_requests_inflight += 1;
MyPState->n_unused -= 1; MyPState->n_unused -= 1;
MyPState->ring_unused += 1; MyPState->ring_unused += 1;
BITMAP_SET(MyPState->shard_bitmap, slot->shard_no);
MyPState->max_shard_no = Max(slot->shard_no+1, MyPState->max_shard_no);
/* update slot state */ /* update slot state */
slot->status = PRFS_REQUESTED; slot->status = PRFS_REQUESTED;
@@ -880,6 +908,7 @@ Retry:
* function reads the buffer tag from the slot. * function reads the buffer tag from the slot.
*/ */
slot->buftag = tag; slot->buftag = tag;
slot->shard_no = get_shard_number(&tag);
slot->my_ring_index = ring_index; slot->my_ring_index = ring_index;
prefetch_do_request(slot, force_latest, force_lsn); prefetch_do_request(slot, force_latest, force_lsn);
@@ -890,7 +919,7 @@ Retry:
if (flush_every_n_requests > 0 && if (flush_every_n_requests > 0 &&
MyPState->ring_unused - MyPState->ring_flush >= flush_every_n_requests) MyPState->ring_unused - MyPState->ring_flush >= flush_every_n_requests)
{ {
if (!page_server->flush()) if (!prefetch_flush_requests())
{ {
/* /*
* Prefetch set is reset in case of error, so we should try to * Prefetch set is reset in case of error, so we should try to
@@ -908,13 +937,44 @@ static NeonResponse *
page_server_request(void const *req) page_server_request(void const *req)
{ {
NeonResponse *resp; NeonResponse *resp;
BufferTag tag = {0};
shardno_t shard_no;
switch (((NeonRequest *) req)->tag)
{
case T_NeonExistsRequest:
CopyNRelFileInfoToBufTag(tag, ((NeonExistsRequest *) req)->rinfo);
break;
case T_NeonNblocksRequest:
CopyNRelFileInfoToBufTag(tag, ((NeonNblocksRequest *) req)->rinfo);
break;
case T_NeonDbSizeRequest:
NInfoGetDbOid(BufTagGetNRelFileInfo(tag)) = ((NeonDbSizeRequest *) req)->dbNode;
break;
case T_NeonGetPageRequest:
CopyNRelFileInfoToBufTag(tag, ((NeonGetPageRequest *) req)->rinfo);
tag.blockNum = ((NeonGetPageRequest *) req)->blkno;
break;
default:
neon_log(ERROR, "Unexpected request tag: %d", ((NeonRequest *) req)->tag);
}
shard_no = get_shard_number(&tag);
/*
* Current sharding model assumes that all metadata is present only at shard 0.
* We still need to call get_shard_no() to check if shard map is up-to-date.
*/
if (((NeonRequest *) req)->tag != T_NeonGetPageRequest || ((NeonGetPageRequest *) req)->forknum != MAIN_FORKNUM)
{
shard_no = 0;
}
do do
{ {
while (!page_server->send((NeonRequest *) req) || !page_server->flush()); while (!page_server->send(shard_no, (NeonRequest *) req) || !page_server->flush(shard_no));
MyPState->ring_flush = MyPState->ring_unused;
consume_prefetch_responses(); consume_prefetch_responses();
resp = page_server->receive(); resp = page_server->receive(shard_no);
} while (resp == NULL); } while (resp == NULL);
return resp; return resp;
@@ -2098,8 +2158,8 @@ neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
case T_NeonErrorResponse: case T_NeonErrorResponse:
ereport(ERROR, ereport(ERROR,
(errcode(ERRCODE_IO_ERROR), (errcode(ERRCODE_IO_ERROR),
errmsg(NEON_TAG "could not read block %u in rel %u/%u/%u.%u from page server at lsn %X/%08X", errmsg(NEON_TAG "[shard %d] could not read block %u in rel %u/%u/%u.%u from page server at lsn %X/%08X",
blkno, slot->shard_no, blkno,
RelFileInfoFmt(rinfo), RelFileInfoFmt(rinfo),
forkNum, forkNum,
(uint32) (request_lsn >> 32), (uint32) request_lsn), (uint32) (request_lsn >> 32), (uint32) request_lsn),

View File

@@ -5,7 +5,7 @@ edition.workspace = true
license.workspace = true license.workspace = true
[features] [features]
default = ["testing"] default = []
testing = [] testing = []
[dependencies] [dependencies]

View File

@@ -4,7 +4,9 @@ pub mod backend;
pub use backend::BackendType; pub use backend::BackendType;
mod credentials; mod credentials;
pub use credentials::{check_peer_addr_is_in_list, endpoint_sni, ComputeUserInfoMaybeEndpoint}; pub use credentials::{
check_peer_addr_is_in_list, endpoint_sni, ComputeUserInfoMaybeEndpoint, IpPattern,
};
mod password_hack; mod password_hack;
pub use password_hack::parse_endpoint_param; pub use password_hack::parse_endpoint_param;

View File

@@ -3,19 +3,18 @@ mod hacks;
mod link; mod link;
pub use link::LinkAuthError; pub use link::LinkAuthError;
use smol_str::SmolStr;
use tokio_postgres::config::AuthKeys; use tokio_postgres::config::AuthKeys;
use crate::auth::credentials::check_peer_addr_is_in_list; use crate::auth::credentials::check_peer_addr_is_in_list;
use crate::auth::validate_password_and_exchange; use crate::auth::validate_password_and_exchange;
use crate::cache::Cached; use crate::cache::Cached;
use crate::console::errors::GetAuthInfoError; use crate::console::errors::GetAuthInfoError;
use crate::console::provider::ConsoleBackend;
use crate::console::AuthSecret; use crate::console::AuthSecret;
use crate::context::RequestMonitoring; use crate::context::RequestMonitoring;
use crate::proxy::connect_compute::handle_try_wake; use crate::proxy::connect_compute::handle_try_wake;
use crate::proxy::retry::retry_after; use crate::proxy::retry::retry_after;
use crate::proxy::NeonOptions; use crate::proxy::NeonOptions;
use crate::scram;
use crate::stream::Stream; use crate::stream::Stream;
use crate::{ use crate::{
auth::{self, ComputeUserInfoMaybeEndpoint}, auth::{self, ComputeUserInfoMaybeEndpoint},
@@ -27,6 +26,7 @@ use crate::{
}, },
stream, url, stream, url,
}; };
use crate::{scram, EndpointCacheKey, EndpointId, RoleName};
use futures::TryFutureExt; use futures::TryFutureExt;
use std::borrow::Cow; use std::borrow::Cow;
use std::ops::ControlFlow; use std::ops::ControlFlow;
@@ -34,6 +34,8 @@ use std::sync::Arc;
use tokio::io::{AsyncRead, AsyncWrite}; use tokio::io::{AsyncRead, AsyncWrite};
use tracing::{error, info, warn}; use tracing::{error, info, warn};
use super::IpPattern;
/// This type serves two purposes: /// This type serves two purposes:
/// ///
/// * When `T` is `()`, it's just a regular auth backend selector /// * When `T` is `()`, it's just a regular auth backend selector
@@ -43,11 +45,8 @@ use tracing::{error, info, warn};
/// this helps us provide the credentials only to those auth /// this helps us provide the credentials only to those auth
/// backends which require them for the authentication process. /// backends which require them for the authentication process.
pub enum BackendType<'a, T> { pub enum BackendType<'a, T> {
/// Current Cloud API (V2). /// Cloud API (V2).
Console(Cow<'a, console::provider::neon::Api>, T), Console(Cow<'a, ConsoleBackend>, T),
/// Local mock of Cloud API (V2).
#[cfg(feature = "testing")]
Postgres(Cow<'a, console::provider::mock::Api>, T),
/// Authentication via a web browser. /// Authentication via a web browser.
Link(Cow<'a, url::ApiUrl>), Link(Cow<'a, url::ApiUrl>),
#[cfg(test)] #[cfg(test)]
@@ -57,16 +56,22 @@ pub enum BackendType<'a, T> {
pub trait TestBackend: Send + Sync + 'static { pub trait TestBackend: Send + Sync + 'static {
fn wake_compute(&self) -> Result<CachedNodeInfo, console::errors::WakeComputeError>; fn wake_compute(&self) -> Result<CachedNodeInfo, console::errors::WakeComputeError>;
fn get_allowed_ips(&self) -> Result<Vec<SmolStr>, console::errors::GetAuthInfoError>; fn get_allowed_ips(&self) -> Result<Vec<IpPattern>, console::errors::GetAuthInfoError>;
} }
impl std::fmt::Display for BackendType<'_, ()> { impl std::fmt::Display for BackendType<'_, ()> {
fn fmt(&self, fmt: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { fn fmt(&self, fmt: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
use BackendType::*; use BackendType::*;
match self { match self {
Console(endpoint, _) => fmt.debug_tuple("Console").field(&endpoint.url()).finish(), Console(api, _) => match &**api {
#[cfg(feature = "testing")] ConsoleBackend::Console(endpoint) => {
Postgres(endpoint, _) => fmt.debug_tuple("Postgres").field(&endpoint.url()).finish(), fmt.debug_tuple("Console").field(&endpoint.url()).finish()
}
#[cfg(feature = "testing")]
ConsoleBackend::Postgres(endpoint) => {
fmt.debug_tuple("Postgres").field(&endpoint.url()).finish()
}
},
Link(url) => fmt.debug_tuple("Link").field(&url.as_str()).finish(), Link(url) => fmt.debug_tuple("Link").field(&url.as_str()).finish(),
#[cfg(test)] #[cfg(test)]
Test(_) => fmt.debug_tuple("Test").finish(), Test(_) => fmt.debug_tuple("Test").finish(),
@@ -81,8 +86,6 @@ impl<T> BackendType<'_, T> {
use BackendType::*; use BackendType::*;
match self { match self {
Console(c, x) => Console(Cow::Borrowed(c), x), Console(c, x) => Console(Cow::Borrowed(c), x),
#[cfg(feature = "testing")]
Postgres(c, x) => Postgres(Cow::Borrowed(c), x),
Link(c) => Link(Cow::Borrowed(c)), Link(c) => Link(Cow::Borrowed(c)),
#[cfg(test)] #[cfg(test)]
Test(x) => Test(*x), Test(x) => Test(*x),
@@ -98,8 +101,6 @@ impl<'a, T> BackendType<'a, T> {
use BackendType::*; use BackendType::*;
match self { match self {
Console(c, x) => Console(c, f(x)), Console(c, x) => Console(c, f(x)),
#[cfg(feature = "testing")]
Postgres(c, x) => Postgres(c, f(x)),
Link(c) => Link(c), Link(c) => Link(c),
#[cfg(test)] #[cfg(test)]
Test(x) => Test(x), Test(x) => Test(x),
@@ -114,8 +115,6 @@ impl<'a, T, E> BackendType<'a, Result<T, E>> {
use BackendType::*; use BackendType::*;
match self { match self {
Console(c, x) => x.map(|x| Console(c, x)), Console(c, x) => x.map(|x| Console(c, x)),
#[cfg(feature = "testing")]
Postgres(c, x) => x.map(|x| Postgres(c, x)),
Link(c) => Ok(Link(c)), Link(c) => Ok(Link(c)),
#[cfg(test)] #[cfg(test)]
Test(x) => Ok(Test(x)), Test(x) => Ok(Test(x)),
@@ -130,19 +129,19 @@ pub struct ComputeCredentials<T> {
#[derive(Debug, Clone)] #[derive(Debug, Clone)]
pub struct ComputeUserInfoNoEndpoint { pub struct ComputeUserInfoNoEndpoint {
pub user: SmolStr, pub user: RoleName,
pub options: NeonOptions, pub options: NeonOptions,
} }
#[derive(Debug, Clone)] #[derive(Debug, Clone)]
pub struct ComputeUserInfo { pub struct ComputeUserInfo {
pub endpoint: SmolStr, pub endpoint: EndpointId,
pub user: SmolStr, pub user: RoleName,
pub options: NeonOptions, pub options: NeonOptions,
} }
impl ComputeUserInfo { impl ComputeUserInfo {
pub fn endpoint_cache_key(&self) -> SmolStr { pub fn endpoint_cache_key(&self) -> EndpointCacheKey {
self.options.get_cache_key(&self.endpoint) self.options.get_cache_key(&self.endpoint)
} }
} }
@@ -158,7 +157,7 @@ impl TryFrom<ComputeUserInfoMaybeEndpoint> for ComputeUserInfo {
type Error = ComputeUserInfoNoEndpoint; type Error = ComputeUserInfoNoEndpoint;
fn try_from(user_info: ComputeUserInfoMaybeEndpoint) -> Result<Self, Self::Error> { fn try_from(user_info: ComputeUserInfoMaybeEndpoint) -> Result<Self, Self::Error> {
match user_info.project { match user_info.endpoint_id {
None => Err(ComputeUserInfoNoEndpoint { None => Err(ComputeUserInfoNoEndpoint {
user: user_info.user, user: user_info.user,
options: user_info.options, options: user_info.options,
@@ -204,21 +203,18 @@ async fn auth_quirks(
if !check_peer_addr_is_in_list(&ctx.peer_addr, &allowed_ips) { if !check_peer_addr_is_in_list(&ctx.peer_addr, &allowed_ips) {
return Err(auth::AuthError::ip_address_not_allowed()); return Err(auth::AuthError::ip_address_not_allowed());
} }
let maybe_secret = api.get_role_secret(ctx, &info).await?; let cached_secret = api.get_role_secret(ctx, &info).await?;
let cached_secret = maybe_secret.unwrap_or_else(|| { let secret = cached_secret.value.clone().unwrap_or_else(|| {
// If we don't have an authentication secret, we mock one to // If we don't have an authentication secret, we mock one to
// prevent malicious probing (possible due to missing protocol steps). // prevent malicious probing (possible due to missing protocol steps).
// This mocked secret will never lead to successful authentication. // This mocked secret will never lead to successful authentication.
info!("authentication info not found, mocking it"); info!("authentication info not found, mocking it");
Cached::new_uncached(AuthSecret::Scram(scram::ServerSecret::mock( AuthSecret::Scram(scram::ServerSecret::mock(&info.user, rand::random()))
&info.user,
rand::random(),
)))
}); });
match authenticate_with_secret( match authenticate_with_secret(
ctx, ctx,
cached_secret.value.clone(), secret,
info, info,
client, client,
unauthenticated_password, unauthenticated_password,
@@ -320,13 +316,11 @@ async fn auth_and_wake_compute(
impl<'a> BackendType<'a, ComputeUserInfoMaybeEndpoint> { impl<'a> BackendType<'a, ComputeUserInfoMaybeEndpoint> {
/// Get compute endpoint name from the credentials. /// Get compute endpoint name from the credentials.
pub fn get_endpoint(&self) -> Option<SmolStr> { pub fn get_endpoint(&self) -> Option<EndpointId> {
use BackendType::*; use BackendType::*;
match self { match self {
Console(_, user_info) => user_info.project.clone(), Console(_, user_info) => user_info.endpoint_id.clone(),
#[cfg(feature = "testing")]
Postgres(_, user_info) => user_info.project.clone(),
Link(_) => Some("link".into()), Link(_) => Some("link".into()),
#[cfg(test)] #[cfg(test)]
Test(_) => Some("test".into()), Test(_) => Some("test".into()),
@@ -339,8 +333,6 @@ impl<'a> BackendType<'a, ComputeUserInfoMaybeEndpoint> {
match self { match self {
Console(_, user_info) => &user_info.user, Console(_, user_info) => &user_info.user,
#[cfg(feature = "testing")]
Postgres(_, user_info) => &user_info.user,
Link(_) => "link", Link(_) => "link",
#[cfg(test)] #[cfg(test)]
Test(_) => "test", Test(_) => "test",
@@ -362,7 +354,7 @@ impl<'a> BackendType<'a, ComputeUserInfoMaybeEndpoint> {
Console(api, user_info) => { Console(api, user_info) => {
info!( info!(
user = &*user_info.user, user = &*user_info.user,
project = user_info.project(), project = user_info.endpoint(),
"performing authentication using the console" "performing authentication using the console"
); );
@@ -371,19 +363,6 @@ impl<'a> BackendType<'a, ComputeUserInfoMaybeEndpoint> {
.await?; .await?;
(cache_info, BackendType::Console(api, user_info)) (cache_info, BackendType::Console(api, user_info))
} }
#[cfg(feature = "testing")]
Postgres(api, user_info) => {
info!(
user = &*user_info.user,
project = user_info.project(),
"performing authentication using a local postgres instance"
);
let (cache_info, user_info) =
auth_and_wake_compute(ctx, &*api, user_info, client, allow_cleartext, config)
.await?;
(cache_info, BackendType::Postgres(api, user_info))
}
// NOTE: this auth backend doesn't use client credentials. // NOTE: this auth backend doesn't use client credentials.
Link(url) => { Link(url) => {
info!("performing link authentication"); info!("performing link authentication");
@@ -414,8 +393,6 @@ impl BackendType<'_, ComputeUserInfo> {
use BackendType::*; use BackendType::*;
match self { match self {
Console(api, user_info) => api.get_allowed_ips(ctx, user_info).await, Console(api, user_info) => api.get_allowed_ips(ctx, user_info).await,
#[cfg(feature = "testing")]
Postgres(api, user_info) => api.get_allowed_ips(ctx, user_info).await,
Link(_) => Ok(Cached::new_uncached(Arc::new(vec![]))), Link(_) => Ok(Cached::new_uncached(Arc::new(vec![]))),
#[cfg(test)] #[cfg(test)]
Test(x) => Ok(Cached::new_uncached(Arc::new(x.get_allowed_ips()?))), Test(x) => Ok(Cached::new_uncached(Arc::new(x.get_allowed_ips()?))),
@@ -432,8 +409,6 @@ impl BackendType<'_, ComputeUserInfo> {
match self { match self {
Console(api, user_info) => api.wake_compute(ctx, user_info).map_ok(Some).await, Console(api, user_info) => api.wake_compute(ctx, user_info).map_ok(Some).await,
#[cfg(feature = "testing")]
Postgres(api, user_info) => api.wake_compute(ctx, user_info).map_ok(Some).await,
Link(_) => Ok(None), Link(_) => Ok(None),
#[cfg(test)] #[cfg(test)]
Test(x) => x.wake_compute().map(Some), Test(x) => x.wake_compute().map(Some),

View File

@@ -57,24 +57,31 @@ pub(super) async fn authenticate(
link_uri: &reqwest::Url, link_uri: &reqwest::Url,
client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin>, client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin>,
) -> auth::Result<NodeInfo> { ) -> auth::Result<NodeInfo> {
let psql_session_id = new_psql_session_id(); // registering waiter can fail if we get unlucky with rng.
// just try again.
let (psql_session_id, waiter) = loop {
let psql_session_id = new_psql_session_id();
match console::mgmt::get_waiter(&psql_session_id) {
Ok(waiter) => break (psql_session_id, waiter),
Err(_e) => continue,
}
};
let span = info_span!("link", psql_session_id = &psql_session_id); let span = info_span!("link", psql_session_id = &psql_session_id);
let greeting = hello_message(link_uri, &psql_session_id); let greeting = hello_message(link_uri, &psql_session_id);
let db_info = console::mgmt::with_waiter(psql_session_id, |waiter| async { // Give user a URL to spawn a new database.
// Give user a URL to spawn a new database. info!(parent: &span, "sending the auth URL to the user");
info!(parent: &span, "sending the auth URL to the user"); client
client .write_message_noflush(&Be::AuthenticationOk)?
.write_message_noflush(&Be::AuthenticationOk)? .write_message_noflush(&Be::CLIENT_ENCODING)?
.write_message_noflush(&Be::CLIENT_ENCODING)? .write_message(&Be::NoticeResponse(&greeting))
.write_message(&Be::NoticeResponse(&greeting)) .await?;
.await?;
// Wait for web console response (see `mgmt`). // Wait for web console response (see `mgmt`).
info!(parent: &span, "waiting for console's reply..."); info!(parent: &span, "waiting for console's reply...");
waiter.await?.map_err(LinkAuthError::AuthFailed) let db_info = waiter.await.map_err(LinkAuthError::from)?;
})
.await?;
client.write_message_noflush(&Be::NoticeResponse("Connecting to database."))?; client.write_message_noflush(&Be::NoticeResponse("Connecting to database."))?;

View File

@@ -2,12 +2,12 @@
use crate::{ use crate::{
auth::password_hack::parse_endpoint_param, context::RequestMonitoring, error::UserFacingError, auth::password_hack::parse_endpoint_param, context::RequestMonitoring, error::UserFacingError,
metrics::NUM_CONNECTION_ACCEPTED_BY_SNI, proxy::NeonOptions, metrics::NUM_CONNECTION_ACCEPTED_BY_SNI, proxy::NeonOptions, EndpointId, RoleName,
}; };
use itertools::Itertools; use itertools::Itertools;
use pq_proto::StartupMessageParams; use pq_proto::StartupMessageParams;
use smol_str::SmolStr; use smol_str::SmolStr;
use std::{collections::HashSet, net::IpAddr}; use std::{collections::HashSet, net::IpAddr, str::FromStr};
use thiserror::Error; use thiserror::Error;
use tracing::{info, warn}; use tracing::{info, warn};
@@ -21,7 +21,10 @@ pub enum ComputeUserInfoParseError {
SNI ('{}') and project option ('{}').", SNI ('{}') and project option ('{}').",
.domain, .option, .domain, .option,
)] )]
InconsistentProjectNames { domain: SmolStr, option: SmolStr }, InconsistentProjectNames {
domain: EndpointId,
option: EndpointId,
},
#[error( #[error(
"Common name inferred from SNI ('{}') is not known", "Common name inferred from SNI ('{}') is not known",
@@ -30,7 +33,7 @@ pub enum ComputeUserInfoParseError {
UnknownCommonName { cn: String }, UnknownCommonName { cn: String },
#[error("Project name ('{0}') must contain only alphanumeric characters and hyphen.")] #[error("Project name ('{0}') must contain only alphanumeric characters and hyphen.")]
MalformedProjectName(SmolStr), MalformedProjectName(EndpointId),
} }
impl UserFacingError for ComputeUserInfoParseError {} impl UserFacingError for ComputeUserInfoParseError {}
@@ -39,17 +42,15 @@ impl UserFacingError for ComputeUserInfoParseError {}
/// Note that we don't store any kind of client key or password here. /// Note that we don't store any kind of client key or password here.
#[derive(Debug, Clone, PartialEq, Eq)] #[derive(Debug, Clone, PartialEq, Eq)]
pub struct ComputeUserInfoMaybeEndpoint { pub struct ComputeUserInfoMaybeEndpoint {
pub user: SmolStr, pub user: RoleName,
// TODO: this is a severe misnomer! We should think of a new name ASAP. pub endpoint_id: Option<EndpointId>,
pub project: Option<SmolStr>,
pub options: NeonOptions, pub options: NeonOptions,
} }
impl ComputeUserInfoMaybeEndpoint { impl ComputeUserInfoMaybeEndpoint {
#[inline] #[inline]
pub fn project(&self) -> Option<&str> { pub fn endpoint(&self) -> Option<&str> {
self.project.as_deref() self.endpoint_id.as_deref()
} }
} }
@@ -79,15 +80,15 @@ impl ComputeUserInfoMaybeEndpoint {
// Some parameters are stored in the startup message. // Some parameters are stored in the startup message.
let get_param = |key| params.get(key).ok_or(MissingKey(key)); let get_param = |key| params.get(key).ok_or(MissingKey(key));
let user: SmolStr = get_param("user")?.into(); let user: RoleName = get_param("user")?.into();
// record the values if we have them // record the values if we have them
ctx.set_application(params.get("application_name").map(SmolStr::from)); ctx.set_application(params.get("application_name").map(SmolStr::from));
ctx.set_user(user.clone()); ctx.set_user(user.clone());
ctx.set_endpoint_id(sni.map(SmolStr::from)); ctx.set_endpoint_id(sni.map(EndpointId::from));
// Project name might be passed via PG's command-line options. // Project name might be passed via PG's command-line options.
let project_option = params let endpoint_option = params
.options_raw() .options_raw()
.and_then(|options| { .and_then(|options| {
// We support both `project` (deprecated) and `endpoint` options for backward compatibility. // We support both `project` (deprecated) and `endpoint` options for backward compatibility.
@@ -100,9 +101,9 @@ impl ComputeUserInfoMaybeEndpoint {
}) })
.map(|name| name.into()); .map(|name| name.into());
let project_from_domain = if let Some(sni_str) = sni { let endpoint_from_domain = if let Some(sni_str) = sni {
if let Some(cn) = common_names { if let Some(cn) = common_names {
Some(SmolStr::from(endpoint_sni(sni_str, cn)?)) Some(EndpointId::from(endpoint_sni(sni_str, cn)?))
} else { } else {
None None
} }
@@ -110,7 +111,7 @@ impl ComputeUserInfoMaybeEndpoint {
None None
}; };
let project = match (project_option, project_from_domain) { let endpoint = match (endpoint_option, endpoint_from_domain) {
// Invariant: if we have both project name variants, they should match. // Invariant: if we have both project name variants, they should match.
(Some(option), Some(domain)) if option != domain => { (Some(option), Some(domain)) if option != domain => {
Some(Err(InconsistentProjectNames { domain, option })) Some(Err(InconsistentProjectNames { domain, option }))
@@ -123,13 +124,13 @@ impl ComputeUserInfoMaybeEndpoint {
} }
.transpose()?; .transpose()?;
info!(%user, project = project.as_deref(), "credentials"); info!(%user, project = endpoint.as_deref(), "credentials");
if sni.is_some() { if sni.is_some() {
info!("Connection with sni"); info!("Connection with sni");
NUM_CONNECTION_ACCEPTED_BY_SNI NUM_CONNECTION_ACCEPTED_BY_SNI
.with_label_values(&["sni"]) .with_label_values(&["sni"])
.inc(); .inc();
} else if project.is_some() { } else if endpoint.is_some() {
NUM_CONNECTION_ACCEPTED_BY_SNI NUM_CONNECTION_ACCEPTED_BY_SNI
.with_label_values(&["no_sni"]) .with_label_values(&["no_sni"])
.inc(); .inc();
@@ -145,36 +146,57 @@ impl ComputeUserInfoMaybeEndpoint {
Ok(Self { Ok(Self {
user, user,
project, endpoint_id: endpoint.map(EndpointId::from),
options, options,
}) })
} }
} }
pub fn check_peer_addr_is_in_list(peer_addr: &IpAddr, ip_list: &Vec<SmolStr>) -> bool { pub fn check_peer_addr_is_in_list(peer_addr: &IpAddr, ip_list: &[IpPattern]) -> bool {
if ip_list.is_empty() { ip_list.is_empty() || ip_list.iter().any(|pattern| check_ip(peer_addr, pattern))
return true;
}
for ip in ip_list {
// We expect that all ip addresses from control plane are correct.
// However, if some of them are broken, we still can check the others.
match parse_ip_pattern(ip) {
Ok(pattern) => {
if check_ip(peer_addr, &pattern) {
return true;
}
}
Err(err) => warn!("Cannot parse ip: {}; err: {}", ip, err),
}
}
false
} }
#[derive(Debug, Clone, Eq, PartialEq)] #[derive(Debug, Clone, Eq, PartialEq)]
enum IpPattern { pub enum IpPattern {
Subnet(ipnet::IpNet), Subnet(ipnet::IpNet),
Range(IpAddr, IpAddr), Range(IpAddr, IpAddr),
Single(IpAddr), Single(IpAddr),
None,
}
impl<'de> serde::de::Deserialize<'de> for IpPattern {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
struct StrVisitor;
impl<'de> serde::de::Visitor<'de> for StrVisitor {
type Value = IpPattern;
fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
write!(formatter, "comma separated list with ip address, ip address range, or ip address subnet mask")
}
fn visit_str<E>(self, v: &str) -> Result<Self::Value, E>
where
E: serde::de::Error,
{
Ok(parse_ip_pattern(v).unwrap_or_else(|e| {
warn!("Cannot parse ip pattern {v}: {e}");
IpPattern::None
}))
}
}
deserializer.deserialize_str(StrVisitor)
}
}
impl FromStr for IpPattern {
type Err = anyhow::Error;
fn from_str(s: &str) -> Result<Self, Self::Err> {
parse_ip_pattern(s)
}
} }
fn parse_ip_pattern(pattern: &str) -> anyhow::Result<IpPattern> { fn parse_ip_pattern(pattern: &str) -> anyhow::Result<IpPattern> {
@@ -196,6 +218,7 @@ fn check_ip(ip: &IpAddr, pattern: &IpPattern) -> bool {
IpPattern::Subnet(subnet) => subnet.contains(ip), IpPattern::Subnet(subnet) => subnet.contains(ip),
IpPattern::Range(start, end) => start <= ip && ip <= end, IpPattern::Range(start, end) => start <= ip && ip <= end,
IpPattern::Single(addr) => addr == ip, IpPattern::Single(addr) => addr == ip,
IpPattern::None => false,
} }
} }
@@ -206,6 +229,7 @@ fn project_name_valid(name: &str) -> bool {
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use super::*; use super::*;
use serde_json::json;
use ComputeUserInfoParseError::*; use ComputeUserInfoParseError::*;
#[test] #[test]
@@ -215,7 +239,7 @@ mod tests {
let mut ctx = RequestMonitoring::test(); let mut ctx = RequestMonitoring::test();
let user_info = ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, None, None)?; let user_info = ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, None, None)?;
assert_eq!(user_info.user, "john_doe"); assert_eq!(user_info.user, "john_doe");
assert_eq!(user_info.project, None); assert_eq!(user_info.endpoint_id, None);
Ok(()) Ok(())
} }
@@ -230,7 +254,7 @@ mod tests {
let mut ctx = RequestMonitoring::test(); let mut ctx = RequestMonitoring::test();
let user_info = ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, None, None)?; let user_info = ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, None, None)?;
assert_eq!(user_info.user, "john_doe"); assert_eq!(user_info.user, "john_doe");
assert_eq!(user_info.project, None); assert_eq!(user_info.endpoint_id, None);
Ok(()) Ok(())
} }
@@ -246,7 +270,7 @@ mod tests {
let user_info = let user_info =
ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, sni, common_names.as_ref())?; ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, sni, common_names.as_ref())?;
assert_eq!(user_info.user, "john_doe"); assert_eq!(user_info.user, "john_doe");
assert_eq!(user_info.project.as_deref(), Some("foo")); assert_eq!(user_info.endpoint_id.as_deref(), Some("foo"));
assert_eq!(user_info.options.get_cache_key("foo"), "foo"); assert_eq!(user_info.options.get_cache_key("foo"), "foo");
Ok(()) Ok(())
@@ -262,7 +286,7 @@ mod tests {
let mut ctx = RequestMonitoring::test(); let mut ctx = RequestMonitoring::test();
let user_info = ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, None, None)?; let user_info = ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, None, None)?;
assert_eq!(user_info.user, "john_doe"); assert_eq!(user_info.user, "john_doe");
assert_eq!(user_info.project.as_deref(), Some("bar")); assert_eq!(user_info.endpoint_id.as_deref(), Some("bar"));
Ok(()) Ok(())
} }
@@ -277,7 +301,7 @@ mod tests {
let mut ctx = RequestMonitoring::test(); let mut ctx = RequestMonitoring::test();
let user_info = ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, None, None)?; let user_info = ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, None, None)?;
assert_eq!(user_info.user, "john_doe"); assert_eq!(user_info.user, "john_doe");
assert_eq!(user_info.project.as_deref(), Some("bar")); assert_eq!(user_info.endpoint_id.as_deref(), Some("bar"));
Ok(()) Ok(())
} }
@@ -295,7 +319,7 @@ mod tests {
let mut ctx = RequestMonitoring::test(); let mut ctx = RequestMonitoring::test();
let user_info = ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, None, None)?; let user_info = ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, None, None)?;
assert_eq!(user_info.user, "john_doe"); assert_eq!(user_info.user, "john_doe");
assert!(user_info.project.is_none()); assert!(user_info.endpoint_id.is_none());
Ok(()) Ok(())
} }
@@ -310,7 +334,7 @@ mod tests {
let mut ctx = RequestMonitoring::test(); let mut ctx = RequestMonitoring::test();
let user_info = ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, None, None)?; let user_info = ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, None, None)?;
assert_eq!(user_info.user, "john_doe"); assert_eq!(user_info.user, "john_doe");
assert!(user_info.project.is_none()); assert!(user_info.endpoint_id.is_none());
Ok(()) Ok(())
} }
@@ -326,7 +350,7 @@ mod tests {
let user_info = let user_info =
ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, sni, common_names.as_ref())?; ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, sni, common_names.as_ref())?;
assert_eq!(user_info.user, "john_doe"); assert_eq!(user_info.user, "john_doe");
assert_eq!(user_info.project.as_deref(), Some("baz")); assert_eq!(user_info.endpoint_id.as_deref(), Some("baz"));
Ok(()) Ok(())
} }
@@ -340,14 +364,14 @@ mod tests {
let mut ctx = RequestMonitoring::test(); let mut ctx = RequestMonitoring::test();
let user_info = let user_info =
ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, sni, common_names.as_ref())?; ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, sni, common_names.as_ref())?;
assert_eq!(user_info.project.as_deref(), Some("p1")); assert_eq!(user_info.endpoint_id.as_deref(), Some("p1"));
let common_names = Some(["a.com".into(), "b.com".into()].into()); let common_names = Some(["a.com".into(), "b.com".into()].into());
let sni = Some("p1.b.com"); let sni = Some("p1.b.com");
let mut ctx = RequestMonitoring::test(); let mut ctx = RequestMonitoring::test();
let user_info = let user_info =
ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, sni, common_names.as_ref())?; ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, sni, common_names.as_ref())?;
assert_eq!(user_info.project.as_deref(), Some("p1")); assert_eq!(user_info.endpoint_id.as_deref(), Some("p1"));
Ok(()) Ok(())
} }
@@ -404,7 +428,7 @@ mod tests {
let mut ctx = RequestMonitoring::test(); let mut ctx = RequestMonitoring::test();
let user_info = let user_info =
ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, sni, common_names.as_ref())?; ComputeUserInfoMaybeEndpoint::parse(&mut ctx, &options, sni, common_names.as_ref())?;
assert_eq!(user_info.project.as_deref(), Some("project")); assert_eq!(user_info.endpoint_id.as_deref(), Some("project"));
assert_eq!( assert_eq!(
user_info.options.get_cache_key("project"), user_info.options.get_cache_key("project"),
"project endpoint_type:read_write lsn:0/2" "project endpoint_type:read_write lsn:0/2"
@@ -415,21 +439,17 @@ mod tests {
#[test] #[test]
fn test_check_peer_addr_is_in_list() { fn test_check_peer_addr_is_in_list() {
let peer_addr = IpAddr::from([127, 0, 0, 1]); fn check(v: serde_json::Value) -> bool {
assert!(check_peer_addr_is_in_list(&peer_addr, &vec![])); let peer_addr = IpAddr::from([127, 0, 0, 1]);
assert!(check_peer_addr_is_in_list( let ip_list: Vec<IpPattern> = serde_json::from_value(v).unwrap();
&peer_addr, check_peer_addr_is_in_list(&peer_addr, &ip_list)
&vec!["127.0.0.1".into()] }
));
assert!(!check_peer_addr_is_in_list( assert!(check(json!([])));
&peer_addr, assert!(check(json!(["127.0.0.1"])));
&vec!["8.8.8.8".into()] assert!(!check(json!(["8.8.8.8"])));
));
// If there is an incorrect address, it will be skipped. // If there is an incorrect address, it will be skipped.
assert!(check_peer_addr_is_in_list( assert!(check(json!(["88.8.8", "127.0.0.1"])));
&peer_addr,
&vec!["88.8.8".into(), "127.0.0.1".into()]
));
} }
#[test] #[test]
fn test_parse_ip_v4() -> anyhow::Result<()> { fn test_parse_ip_v4() -> anyhow::Result<()> {

View File

@@ -4,10 +4,11 @@
//! UPDATE (Mon Aug 8 13:20:34 UTC 2022): the payload format has been simplified. //! UPDATE (Mon Aug 8 13:20:34 UTC 2022): the payload format has been simplified.
use bstr::ByteSlice; use bstr::ByteSlice;
use smol_str::SmolStr;
use crate::EndpointId;
pub struct PasswordHackPayload { pub struct PasswordHackPayload {
pub endpoint: SmolStr, pub endpoint: EndpointId,
pub password: Vec<u8>, pub password: Vec<u8>,
} }

View File

@@ -249,12 +249,19 @@ async fn main() -> anyhow::Result<()> {
} }
if let auth::BackendType::Console(api, _) = &config.auth_backend { if let auth::BackendType::Console(api, _) = &config.auth_backend {
let cache = api.caches.project_info.clone(); match &**api {
if let Some(url) = args.redis_notifications { proxy::console::provider::ConsoleBackend::Console(api) => {
info!("Starting redis notifications listener ({url})"); let cache = api.caches.project_info.clone();
maintenance_tasks.spawn(notifications::task_main(url.to_owned(), cache.clone())); if let Some(url) = args.redis_notifications {
info!("Starting redis notifications listener ({url})");
maintenance_tasks
.spawn(notifications::task_main(url.to_owned(), cache.clone()));
}
maintenance_tasks.spawn(async move { cache.clone().gc_worker().await });
}
#[cfg(feature = "testing")]
proxy::console::provider::ConsoleBackend::Postgres(_) => {}
} }
maintenance_tasks.spawn(async move { cache.clone().gc_worker().await });
} }
let maintenance = loop { let maintenance = loop {
@@ -351,13 +358,15 @@ fn build_config(args: &ProxyCliArgs) -> anyhow::Result<&'static ProxyConfig> {
let endpoint = http::Endpoint::new(url, http::new_client(rate_limiter_config)); let endpoint = http::Endpoint::new(url, http::new_client(rate_limiter_config));
let api = console::provider::neon::Api::new(endpoint, caches, locks); let api = console::provider::neon::Api::new(endpoint, caches, locks);
let api = console::provider::ConsoleBackend::Console(api);
auth::BackendType::Console(Cow::Owned(api), ()) auth::BackendType::Console(Cow::Owned(api), ())
} }
#[cfg(feature = "testing")] #[cfg(feature = "testing")]
AuthBackend::Postgres => { AuthBackend::Postgres => {
let url = args.auth_endpoint.parse()?; let url = args.auth_endpoint.parse()?;
let api = console::provider::mock::Api::new(url); let api = console::provider::mock::Api::new(url);
auth::BackendType::Postgres(Cow::Owned(api), ()) let api = console::provider::ConsoleBackend::Postgres(api);
auth::BackendType::Console(Cow::Owned(api), ())
} }
AuthBackend::Link => { AuthBackend::Link => {
let url = args.uri.parse()?; let url = args.uri.parse()?;

View File

@@ -11,13 +11,16 @@ use smol_str::SmolStr;
use tokio::time::Instant; use tokio::time::Instant;
use tracing::{debug, info}; use tracing::{debug, info};
use crate::{config::ProjectInfoCacheOptions, console::AuthSecret}; use crate::{
auth::IpPattern, config::ProjectInfoCacheOptions, console::AuthSecret, EndpointId, ProjectId,
RoleName,
};
use super::{Cache, Cached}; use super::{Cache, Cached};
pub trait ProjectInfoCache { pub trait ProjectInfoCache {
fn invalidate_allowed_ips_for_project(&self, project_id: &SmolStr); fn invalidate_allowed_ips_for_project(&self, project_id: &ProjectId);
fn invalidate_role_secret_for_project(&self, project_id: &SmolStr, role_name: &SmolStr); fn invalidate_role_secret_for_project(&self, project_id: &ProjectId, role_name: &RoleName);
fn enable_ttl(&self); fn enable_ttl(&self);
fn disable_ttl(&self); fn disable_ttl(&self);
} }
@@ -44,8 +47,8 @@ impl<T> From<T> for Entry<T> {
#[derive(Default)] #[derive(Default)]
struct EndpointInfo { struct EndpointInfo {
secret: std::collections::HashMap<SmolStr, Entry<AuthSecret>>, secret: std::collections::HashMap<RoleName, Entry<Option<AuthSecret>>>,
allowed_ips: Option<Entry<Arc<Vec<SmolStr>>>>, allowed_ips: Option<Entry<Arc<Vec<IpPattern>>>>,
} }
impl EndpointInfo { impl EndpointInfo {
@@ -57,10 +60,10 @@ impl EndpointInfo {
} }
pub fn get_role_secret( pub fn get_role_secret(
&self, &self,
role_name: &SmolStr, role_name: &RoleName,
valid_since: Instant, valid_since: Instant,
ignore_cache_since: Option<Instant>, ignore_cache_since: Option<Instant>,
) -> Option<(AuthSecret, bool)> { ) -> Option<(Option<AuthSecret>, bool)> {
if let Some(secret) = self.secret.get(role_name) { if let Some(secret) = self.secret.get(role_name) {
if valid_since < secret.created_at { if valid_since < secret.created_at {
return Some(( return Some((
@@ -76,7 +79,7 @@ impl EndpointInfo {
&self, &self,
valid_since: Instant, valid_since: Instant,
ignore_cache_since: Option<Instant>, ignore_cache_since: Option<Instant>,
) -> Option<(Arc<Vec<SmolStr>>, bool)> { ) -> Option<(Arc<Vec<IpPattern>>, bool)> {
if let Some(allowed_ips) = &self.allowed_ips { if let Some(allowed_ips) = &self.allowed_ips {
if valid_since < allowed_ips.created_at { if valid_since < allowed_ips.created_at {
return Some(( return Some((
@@ -90,7 +93,7 @@ impl EndpointInfo {
pub fn invalidate_allowed_ips(&mut self) { pub fn invalidate_allowed_ips(&mut self) {
self.allowed_ips = None; self.allowed_ips = None;
} }
pub fn invalidate_role_secret(&mut self, role_name: &SmolStr) { pub fn invalidate_role_secret(&mut self, role_name: &RoleName) {
self.secret.remove(role_name); self.secret.remove(role_name);
} }
} }
@@ -103,9 +106,9 @@ impl EndpointInfo {
/// One may ask, why the data is stored per project, when on the user request there is only data about the endpoint available? /// One may ask, why the data is stored per project, when on the user request there is only data about the endpoint available?
/// On the cplane side updates are done per project (or per branch), so it's easier to invalidate the whole project cache. /// On the cplane side updates are done per project (or per branch), so it's easier to invalidate the whole project cache.
pub struct ProjectInfoCacheImpl { pub struct ProjectInfoCacheImpl {
cache: DashMap<SmolStr, EndpointInfo>, cache: DashMap<EndpointId, EndpointInfo>,
project2ep: DashMap<SmolStr, HashSet<SmolStr>>, project2ep: DashMap<ProjectId, HashSet<EndpointId>>,
config: ProjectInfoCacheOptions, config: ProjectInfoCacheOptions,
start_time: Instant, start_time: Instant,
@@ -113,7 +116,7 @@ pub struct ProjectInfoCacheImpl {
} }
impl ProjectInfoCache for ProjectInfoCacheImpl { impl ProjectInfoCache for ProjectInfoCacheImpl {
fn invalidate_allowed_ips_for_project(&self, project_id: &SmolStr) { fn invalidate_allowed_ips_for_project(&self, project_id: &ProjectId) {
info!("invalidating allowed ips for project `{}`", project_id); info!("invalidating allowed ips for project `{}`", project_id);
let endpoints = self let endpoints = self
.project2ep .project2ep
@@ -126,7 +129,7 @@ impl ProjectInfoCache for ProjectInfoCacheImpl {
} }
} }
} }
fn invalidate_role_secret_for_project(&self, project_id: &SmolStr, role_name: &SmolStr) { fn invalidate_role_secret_for_project(&self, project_id: &ProjectId, role_name: &RoleName) {
info!( info!(
"invalidating role secret for project_id `{}` and role_name `{}`", "invalidating role secret for project_id `{}` and role_name `{}`",
project_id, role_name project_id, role_name
@@ -167,9 +170,9 @@ impl ProjectInfoCacheImpl {
pub fn get_role_secret( pub fn get_role_secret(
&self, &self,
endpoint_id: &SmolStr, endpoint_id: &EndpointId,
role_name: &SmolStr, role_name: &RoleName,
) -> Option<Cached<&Self, AuthSecret>> { ) -> Option<Cached<&Self, Option<AuthSecret>>> {
let (valid_since, ignore_cache_since) = self.get_cache_times(); let (valid_since, ignore_cache_since) = self.get_cache_times();
let endpoint_info = self.cache.get(endpoint_id)?; let endpoint_info = self.cache.get(endpoint_id)?;
let (value, ignore_cache) = let (value, ignore_cache) =
@@ -188,8 +191,8 @@ impl ProjectInfoCacheImpl {
} }
pub fn get_allowed_ips( pub fn get_allowed_ips(
&self, &self,
endpoint_id: &SmolStr, endpoint_id: &EndpointId,
) -> Option<Cached<&Self, Arc<Vec<SmolStr>>>> { ) -> Option<Cached<&Self, Arc<Vec<IpPattern>>>> {
let (valid_since, ignore_cache_since) = self.get_cache_times(); let (valid_since, ignore_cache_since) = self.get_cache_times();
let endpoint_info = self.cache.get(endpoint_id)?; let endpoint_info = self.cache.get(endpoint_id)?;
let value = endpoint_info.get_allowed_ips(valid_since, ignore_cache_since); let value = endpoint_info.get_allowed_ips(valid_since, ignore_cache_since);
@@ -205,10 +208,10 @@ impl ProjectInfoCacheImpl {
} }
pub fn insert_role_secret( pub fn insert_role_secret(
&self, &self,
project_id: &SmolStr, project_id: &ProjectId,
endpoint_id: &SmolStr, endpoint_id: &EndpointId,
role_name: &SmolStr, role_name: &RoleName,
secret: AuthSecret, secret: Option<AuthSecret>,
) { ) {
if self.cache.len() >= self.config.size { if self.cache.len() >= self.config.size {
// If there are too many entries, wait until the next gc cycle. // If there are too many entries, wait until the next gc cycle.
@@ -222,9 +225,9 @@ impl ProjectInfoCacheImpl {
} }
pub fn insert_allowed_ips( pub fn insert_allowed_ips(
&self, &self,
project_id: &SmolStr, project_id: &ProjectId,
endpoint_id: &SmolStr, endpoint_id: &EndpointId,
allowed_ips: Arc<Vec<SmolStr>>, allowed_ips: Arc<Vec<IpPattern>>,
) { ) {
if self.cache.len() >= self.config.size { if self.cache.len() >= self.config.size {
// If there are too many entries, wait until the next gc cycle. // If there are too many entries, wait until the next gc cycle.
@@ -236,7 +239,7 @@ impl ProjectInfoCacheImpl {
.or_default() .or_default()
.allowed_ips = Some(allowed_ips.into()); .allowed_ips = Some(allowed_ips.into());
} }
fn inser_project2endpoint(&self, project_id: &SmolStr, endpoint_id: &SmolStr) { fn inser_project2endpoint(&self, project_id: &ProjectId, endpoint_id: &EndpointId) {
if let Some(mut endpoints) = self.project2ep.get_mut(project_id) { if let Some(mut endpoints) = self.project2ep.get_mut(project_id) {
endpoints.insert(endpoint_id.clone()); endpoints.insert(endpoint_id.clone());
} else { } else {
@@ -266,7 +269,7 @@ impl ProjectInfoCacheImpl {
tokio::time::interval(self.config.gc_interval / (self.cache.shards().len()) as u32); tokio::time::interval(self.config.gc_interval / (self.cache.shards().len()) as u32);
loop { loop {
interval.tick().await; interval.tick().await;
if self.cache.len() <= self.config.size { if self.cache.len() < self.config.size {
// If there are not too many entries, wait until the next gc cycle. // If there are not too many entries, wait until the next gc cycle.
continue; continue;
} }
@@ -297,18 +300,18 @@ impl ProjectInfoCacheImpl {
/// This is used to invalidate cache entries. /// This is used to invalidate cache entries.
pub struct CachedLookupInfo { pub struct CachedLookupInfo {
/// Search by this key. /// Search by this key.
endpoint_id: SmolStr, endpoint_id: EndpointId,
lookup_type: LookupType, lookup_type: LookupType,
} }
impl CachedLookupInfo { impl CachedLookupInfo {
pub(self) fn new_role_secret(endpoint_id: SmolStr, role_name: SmolStr) -> Self { pub(self) fn new_role_secret(endpoint_id: EndpointId, role_name: RoleName) -> Self {
Self { Self {
endpoint_id, endpoint_id,
lookup_type: LookupType::RoleSecret(role_name), lookup_type: LookupType::RoleSecret(role_name),
} }
} }
pub(self) fn new_allowed_ips(endpoint_id: SmolStr) -> Self { pub(self) fn new_allowed_ips(endpoint_id: EndpointId) -> Self {
Self { Self {
endpoint_id, endpoint_id,
lookup_type: LookupType::AllowedIps, lookup_type: LookupType::AllowedIps,
@@ -317,7 +320,7 @@ impl CachedLookupInfo {
} }
enum LookupType { enum LookupType {
RoleSecret(SmolStr), RoleSecret(RoleName),
AllowedIps, AllowedIps,
} }
@@ -348,7 +351,6 @@ impl Cache for ProjectInfoCacheImpl {
mod tests { mod tests {
use super::*; use super::*;
use crate::{console::AuthSecret, scram::ServerSecret}; use crate::{console::AuthSecret, scram::ServerSecret};
use smol_str::SmolStr;
use std::{sync::Arc, time::Duration}; use std::{sync::Arc, time::Duration};
#[tokio::test] #[tokio::test]
@@ -362,11 +364,17 @@ mod tests {
}); });
let project_id = "project".into(); let project_id = "project".into();
let endpoint_id = "endpoint".into(); let endpoint_id = "endpoint".into();
let user1: SmolStr = "user1".into(); let user1: RoleName = "user1".into();
let user2: SmolStr = "user2".into(); let user2: RoleName = "user2".into();
let secret1 = AuthSecret::Scram(ServerSecret::mock(user1.as_str(), [1; 32])); let secret1 = Some(AuthSecret::Scram(ServerSecret::mock(
let secret2 = AuthSecret::Scram(ServerSecret::mock(user2.as_str(), [2; 32])); user1.as_str(),
let allowed_ips = Arc::new(vec!["allowed_ip1".into(), "allowed_ip2".into()]); [1; 32],
)));
let secret2 = None;
let allowed_ips = Arc::new(vec![
"127.0.0.1".parse().unwrap(),
"127.0.0.2".parse().unwrap(),
]);
cache.insert_role_secret(&project_id, &endpoint_id, &user1, secret1.clone()); cache.insert_role_secret(&project_id, &endpoint_id, &user1, secret1.clone());
cache.insert_role_secret(&project_id, &endpoint_id, &user2, secret2.clone()); cache.insert_role_secret(&project_id, &endpoint_id, &user2, secret2.clone());
cache.insert_allowed_ips(&project_id, &endpoint_id, allowed_ips.clone()); cache.insert_allowed_ips(&project_id, &endpoint_id, allowed_ips.clone());
@@ -379,8 +387,11 @@ mod tests {
assert_eq!(cached.value, secret2); assert_eq!(cached.value, secret2);
// Shouldn't add more than 2 roles. // Shouldn't add more than 2 roles.
let user3: SmolStr = "user3".into(); let user3: RoleName = "user3".into();
let secret3 = AuthSecret::Scram(ServerSecret::mock(user3.as_str(), [3; 32])); let secret3 = Some(AuthSecret::Scram(ServerSecret::mock(
user3.as_str(),
[3; 32],
)));
cache.insert_role_secret(&project_id, &endpoint_id, &user3, secret3.clone()); cache.insert_role_secret(&project_id, &endpoint_id, &user3, secret3.clone());
assert!(cache.get_role_secret(&endpoint_id, &user3).is_none()); assert!(cache.get_role_secret(&endpoint_id, &user3).is_none());
@@ -411,11 +422,20 @@ mod tests {
let project_id = "project".into(); let project_id = "project".into();
let endpoint_id = "endpoint".into(); let endpoint_id = "endpoint".into();
let user1: SmolStr = "user1".into(); let user1: RoleName = "user1".into();
let user2: SmolStr = "user2".into(); let user2: RoleName = "user2".into();
let secret1 = AuthSecret::Scram(ServerSecret::mock(user1.as_str(), [1; 32])); let secret1 = Some(AuthSecret::Scram(ServerSecret::mock(
let secret2 = AuthSecret::Scram(ServerSecret::mock(user2.as_str(), [2; 32])); user1.as_str(),
let allowed_ips = Arc::new(vec!["allowed_ip1".into(), "allowed_ip2".into()]); [1; 32],
)));
let secret2 = Some(AuthSecret::Scram(ServerSecret::mock(
user2.as_str(),
[2; 32],
)));
let allowed_ips = Arc::new(vec![
"127.0.0.1".parse().unwrap(),
"127.0.0.2".parse().unwrap(),
]);
cache.insert_role_secret(&project_id, &endpoint_id, &user1, secret1.clone()); cache.insert_role_secret(&project_id, &endpoint_id, &user1, secret1.clone());
cache.insert_role_secret(&project_id, &endpoint_id, &user2, secret2.clone()); cache.insert_role_secret(&project_id, &endpoint_id, &user2, secret2.clone());
cache.insert_allowed_ips(&project_id, &endpoint_id, allowed_ips.clone()); cache.insert_allowed_ips(&project_id, &endpoint_id, allowed_ips.clone());
@@ -457,11 +477,20 @@ mod tests {
let project_id = "project".into(); let project_id = "project".into();
let endpoint_id = "endpoint".into(); let endpoint_id = "endpoint".into();
let user1: SmolStr = "user1".into(); let user1: RoleName = "user1".into();
let user2: SmolStr = "user2".into(); let user2: RoleName = "user2".into();
let secret1 = AuthSecret::Scram(ServerSecret::mock(user1.as_str(), [1; 32])); let secret1 = Some(AuthSecret::Scram(ServerSecret::mock(
let secret2 = AuthSecret::Scram(ServerSecret::mock(user2.as_str(), [2; 32])); user1.as_str(),
let allowed_ips = Arc::new(vec!["allowed_ip1".into(), "allowed_ip2".into()]); [1; 32],
)));
let secret2 = Some(AuthSecret::Scram(ServerSecret::mock(
user2.as_str(),
[2; 32],
)));
let allowed_ips = Arc::new(vec![
"127.0.0.1".parse().unwrap(),
"127.0.0.2".parse().unwrap(),
]);
cache.insert_role_secret(&project_id, &endpoint_id, &user1, secret1.clone()); cache.insert_role_secret(&project_id, &endpoint_id, &user1, secret1.clone());
cache.clone().disable_ttl(); cache.clone().disable_ttl();
tokio::time::advance(Duration::from_millis(100)).await; tokio::time::advance(Duration::from_millis(100)).await;

View File

@@ -1,7 +1,10 @@
use serde::Deserialize; use serde::Deserialize;
use smol_str::SmolStr;
use std::fmt; use std::fmt;
use crate::auth::IpPattern;
use crate::{BranchId, EndpointId, ProjectId};
/// Generic error response with human-readable description. /// Generic error response with human-readable description.
/// Note that we can't always present it to user as is. /// Note that we can't always present it to user as is.
#[derive(Debug, Deserialize)] #[derive(Debug, Deserialize)]
@@ -14,8 +17,8 @@ pub struct ConsoleError {
#[derive(Deserialize)] #[derive(Deserialize)]
pub struct GetRoleSecret { pub struct GetRoleSecret {
pub role_secret: Box<str>, pub role_secret: Box<str>,
pub allowed_ips: Option<Vec<Box<str>>>, pub allowed_ips: Option<Vec<IpPattern>>,
pub project_id: Option<Box<str>>, pub project_id: Option<ProjectId>,
} }
// Manually implement debug to omit sensitive info. // Manually implement debug to omit sensitive info.
@@ -92,9 +95,9 @@ impl fmt::Debug for DatabaseInfo {
/// Also known as `ProxyMetricsAuxInfo` in the console. /// Also known as `ProxyMetricsAuxInfo` in the console.
#[derive(Debug, Deserialize, Clone, Default)] #[derive(Debug, Deserialize, Clone, Default)]
pub struct MetricsAuxInfo { pub struct MetricsAuxInfo {
pub endpoint_id: SmolStr, pub endpoint_id: EndpointId,
pub project_id: SmolStr, pub project_id: ProjectId,
pub branch_id: SmolStr, pub branch_id: BranchId,
} }
impl MetricsAuxInfo { impl MetricsAuxInfo {

View File

@@ -13,16 +13,10 @@ use tracing::{error, info, info_span, Instrument};
static CPLANE_WAITERS: Lazy<Waiters<ComputeReady>> = Lazy::new(Default::default); static CPLANE_WAITERS: Lazy<Waiters<ComputeReady>> = Lazy::new(Default::default);
/// Give caller an opportunity to wait for the cloud's reply. /// Give caller an opportunity to wait for the cloud's reply.
pub async fn with_waiter<R, T, E>( pub fn get_waiter(
psql_session_id: impl Into<String>, psql_session_id: impl Into<String>,
action: impl FnOnce(Waiter<'static, ComputeReady>) -> R, ) -> Result<Waiter<'static, ComputeReady>, waiters::RegisterError> {
) -> Result<T, E> CPLANE_WAITERS.register(psql_session_id.into())
where
R: std::future::Future<Output = Result<T, E>>,
E: From<waiters::RegisterError>,
{
let waiter = CPLANE_WAITERS.register(psql_session_id.into())?;
action(waiter).await
} }
pub fn notify(psql_session_id: &str, msg: ComputeReady) -> Result<(), waiters::NotifyError> { pub fn notify(psql_session_id: &str, msg: ComputeReady) -> Result<(), waiters::NotifyError> {
@@ -77,7 +71,7 @@ async fn handle_connection(socket: TcpStream) -> Result<(), QueryError> {
} }
/// A message received by `mgmt` when a compute node is ready. /// A message received by `mgmt` when a compute node is ready.
pub type ComputeReady = Result<DatabaseInfo, String>; pub type ComputeReady = DatabaseInfo;
// TODO: replace with an http-based protocol. // TODO: replace with an http-based protocol.
struct MgmtHandler; struct MgmtHandler;
@@ -102,7 +96,7 @@ fn try_process_query(pgb: &mut PostgresBackendTCP, query: &str) -> Result<(), Qu
let _enter = span.enter(); let _enter = span.enter();
info!("got response: {:?}", resp.result); info!("got response: {:?}", resp.result);
match notify(resp.session_id, Ok(resp.result)) { match notify(resp.session_id, resp.result) {
Ok(()) => { Ok(()) => {
pgb.write_message_noflush(&SINGLE_COL_ROWDESC)? pgb.write_message_noflush(&SINGLE_COL_ROWDESC)?
.write_message_noflush(&BeMessage::DataRow(&[Some(b"ok")]))? .write_message_noflush(&BeMessage::DataRow(&[Some(b"ok")]))?

View File

@@ -4,16 +4,15 @@ pub mod neon;
use super::messages::MetricsAuxInfo; use super::messages::MetricsAuxInfo;
use crate::{ use crate::{
auth::backend::ComputeUserInfo, auth::{backend::ComputeUserInfo, IpPattern},
cache::{project_info::ProjectInfoCacheImpl, Cached, TimedLru}, cache::{project_info::ProjectInfoCacheImpl, Cached, TimedLru},
compute, compute,
config::{CacheOptions, ProjectInfoCacheOptions}, config::{CacheOptions, ProjectInfoCacheOptions},
context::RequestMonitoring, context::RequestMonitoring,
scram, scram, EndpointCacheKey, ProjectId,
}; };
use async_trait::async_trait; use async_trait::async_trait;
use dashmap::DashMap; use dashmap::DashMap;
use smol_str::SmolStr;
use std::{sync::Arc, time::Duration}; use std::{sync::Arc, time::Duration};
use tokio::sync::{OwnedSemaphorePermit, Semaphore}; use tokio::sync::{OwnedSemaphorePermit, Semaphore};
use tokio::time::Instant; use tokio::time::Instant;
@@ -212,9 +211,9 @@ pub enum AuthSecret {
pub struct AuthInfo { pub struct AuthInfo {
pub secret: Option<AuthSecret>, pub secret: Option<AuthSecret>,
/// List of IP addresses allowed for the autorization. /// List of IP addresses allowed for the autorization.
pub allowed_ips: Vec<SmolStr>, pub allowed_ips: Vec<IpPattern>,
/// Project ID. This is used for cache invalidation. /// Project ID. This is used for cache invalidation.
pub project_id: Option<SmolStr>, pub project_id: Option<ProjectId>,
} }
/// Info for establishing a connection to a compute node. /// Info for establishing a connection to a compute node.
@@ -233,10 +232,10 @@ pub struct NodeInfo {
pub allow_self_signed_compute: bool, pub allow_self_signed_compute: bool,
} }
pub type NodeInfoCache = TimedLru<SmolStr, NodeInfo>; pub type NodeInfoCache = TimedLru<EndpointCacheKey, NodeInfo>;
pub type CachedNodeInfo = Cached<&'static NodeInfoCache>; pub type CachedNodeInfo = Cached<&'static NodeInfoCache>;
pub type CachedRoleSecret = Cached<&'static ProjectInfoCacheImpl, AuthSecret>; pub type CachedRoleSecret = Cached<&'static ProjectInfoCacheImpl, Option<AuthSecret>>;
pub type CachedAllowedIps = Cached<&'static ProjectInfoCacheImpl, Arc<Vec<SmolStr>>>; pub type CachedAllowedIps = Cached<&'static ProjectInfoCacheImpl, Arc<Vec<IpPattern>>>;
/// This will allocate per each call, but the http requests alone /// This will allocate per each call, but the http requests alone
/// already require a few allocations, so it should be fine. /// already require a few allocations, so it should be fine.
@@ -248,23 +247,75 @@ pub trait Api {
async fn get_role_secret( async fn get_role_secret(
&self, &self,
ctx: &mut RequestMonitoring, ctx: &mut RequestMonitoring,
creds: &ComputeUserInfo, user_info: &ComputeUserInfo,
) -> Result<Option<CachedRoleSecret>, errors::GetAuthInfoError>; ) -> Result<CachedRoleSecret, errors::GetAuthInfoError>;
async fn get_allowed_ips( async fn get_allowed_ips(
&self, &self,
ctx: &mut RequestMonitoring, ctx: &mut RequestMonitoring,
creds: &ComputeUserInfo, user_info: &ComputeUserInfo,
) -> Result<CachedAllowedIps, errors::GetAuthInfoError>; ) -> Result<CachedAllowedIps, errors::GetAuthInfoError>;
/// Wake up the compute node and return the corresponding connection info. /// Wake up the compute node and return the corresponding connection info.
async fn wake_compute( async fn wake_compute(
&self, &self,
ctx: &mut RequestMonitoring, ctx: &mut RequestMonitoring,
creds: &ComputeUserInfo, user_info: &ComputeUserInfo,
) -> Result<CachedNodeInfo, errors::WakeComputeError>; ) -> Result<CachedNodeInfo, errors::WakeComputeError>;
} }
#[derive(Clone)]
pub enum ConsoleBackend {
/// Current Cloud API (V2).
Console(neon::Api),
/// Local mock of Cloud API (V2).
#[cfg(feature = "testing")]
Postgres(mock::Api),
}
#[async_trait]
impl Api for ConsoleBackend {
async fn get_role_secret(
&self,
ctx: &mut RequestMonitoring,
user_info: &ComputeUserInfo,
) -> Result<CachedRoleSecret, errors::GetAuthInfoError> {
use ConsoleBackend::*;
match self {
Console(api) => api.get_role_secret(ctx, user_info).await,
#[cfg(feature = "testing")]
Postgres(api) => api.get_role_secret(ctx, user_info).await,
}
}
async fn get_allowed_ips(
&self,
ctx: &mut RequestMonitoring,
user_info: &ComputeUserInfo,
) -> Result<CachedAllowedIps, errors::GetAuthInfoError> {
use ConsoleBackend::*;
match self {
Console(api) => api.get_allowed_ips(ctx, user_info).await,
#[cfg(feature = "testing")]
Postgres(api) => api.get_allowed_ips(ctx, user_info).await,
}
}
async fn wake_compute(
&self,
ctx: &mut RequestMonitoring,
user_info: &ComputeUserInfo,
) -> Result<CachedNodeInfo, errors::WakeComputeError> {
use ConsoleBackend::*;
match self {
Console(api) => api.wake_compute(ctx, user_info).await,
#[cfg(feature = "testing")]
Postgres(api) => api.wake_compute(ctx, user_info).await,
}
}
}
/// Various caches for [`console`](super). /// Various caches for [`console`](super).
pub struct ApiCaches { pub struct ApiCaches {
/// Cache for the `wake_compute` API method. /// Cache for the `wake_compute` API method.
@@ -293,7 +344,7 @@ impl ApiCaches {
/// Various caches for [`console`](super). /// Various caches for [`console`](super).
pub struct ApiLocks { pub struct ApiLocks {
name: &'static str, name: &'static str,
node_locks: DashMap<SmolStr, Arc<Semaphore>>, node_locks: DashMap<EndpointCacheKey, Arc<Semaphore>>,
permits: usize, permits: usize,
timeout: Duration, timeout: Duration,
registered: prometheus::IntCounter, registered: prometheus::IntCounter,
@@ -361,7 +412,7 @@ impl ApiLocks {
pub async fn get_wake_compute_permit( pub async fn get_wake_compute_permit(
&self, &self,
key: &SmolStr, key: &EndpointCacheKey,
) -> Result<WakeComputePermit, errors::WakeComputeError> { ) -> Result<WakeComputePermit, errors::WakeComputeError> {
if self.permits == 0 { if self.permits == 0 {
return Ok(WakeComputePermit { permit: None }); return Ok(WakeComputePermit { permit: None });

View File

@@ -4,14 +4,13 @@ use super::{
errors::{ApiError, GetAuthInfoError, WakeComputeError}, errors::{ApiError, GetAuthInfoError, WakeComputeError},
AuthInfo, AuthSecret, CachedNodeInfo, NodeInfo, AuthInfo, AuthSecret, CachedNodeInfo, NodeInfo,
}; };
use crate::cache::Cached;
use crate::console::provider::{CachedAllowedIps, CachedRoleSecret}; use crate::console::provider::{CachedAllowedIps, CachedRoleSecret};
use crate::context::RequestMonitoring; use crate::context::RequestMonitoring;
use crate::{auth::backend::ComputeUserInfo, compute, error::io_error, scram, url::ApiUrl}; use crate::{auth::backend::ComputeUserInfo, compute, error::io_error, scram, url::ApiUrl};
use crate::{auth::IpPattern, cache::Cached};
use async_trait::async_trait; use async_trait::async_trait;
use futures::TryFutureExt; use futures::TryFutureExt;
use smol_str::SmolStr; use std::{str::FromStr, sync::Arc};
use std::sync::Arc;
use thiserror::Error; use thiserror::Error;
use tokio_postgres::{config::SslMode, Client}; use tokio_postgres::{config::SslMode, Client};
use tracing::{error, info, info_span, warn, Instrument}; use tracing::{error, info, info_span, warn, Instrument};
@@ -88,7 +87,9 @@ impl Api {
{ {
Some(s) => { Some(s) => {
info!("got allowed_ips: {s}"); info!("got allowed_ips: {s}");
s.split(',').map(String::from).collect() s.split(',')
.map(|s| IpPattern::from_str(s).unwrap())
.collect()
} }
None => vec![], None => vec![],
}; };
@@ -100,7 +101,7 @@ impl Api {
.await?; .await?;
Ok(AuthInfo { Ok(AuthInfo {
secret, secret,
allowed_ips: allowed_ips.iter().map(SmolStr::from).collect(), allowed_ips,
project_id: None, project_id: None,
}) })
} }
@@ -150,12 +151,10 @@ impl super::Api for Api {
&self, &self,
_ctx: &mut RequestMonitoring, _ctx: &mut RequestMonitoring,
user_info: &ComputeUserInfo, user_info: &ComputeUserInfo,
) -> Result<Option<CachedRoleSecret>, GetAuthInfoError> { ) -> Result<CachedRoleSecret, GetAuthInfoError> {
Ok(self Ok(CachedRoleSecret::new_uncached(
.do_get_auth_info(user_info) self.do_get_auth_info(user_info).await?.secret,
.await? ))
.secret
.map(CachedRoleSecret::new_uncached))
} }
async fn get_allowed_ips( async fn get_allowed_ips(

View File

@@ -14,8 +14,6 @@ use crate::{
}; };
use async_trait::async_trait; use async_trait::async_trait;
use futures::TryFutureExt; use futures::TryFutureExt;
use itertools::Itertools;
use smol_str::SmolStr;
use std::sync::Arc; use std::sync::Arc;
use tokio::time::Instant; use tokio::time::Instant;
use tokio_postgres::config::SslMode; use tokio_postgres::config::SslMode;
@@ -86,20 +84,20 @@ impl Api {
}, },
}; };
let secret = scram::ServerSecret::parse(&body.role_secret) let secret = if body.role_secret.is_empty() {
.map(AuthSecret::Scram) None
.ok_or(GetAuthInfoError::BadSecret)?; } else {
let allowed_ips = body let secret = scram::ServerSecret::parse(&body.role_secret)
.allowed_ips .map(AuthSecret::Scram)
.into_iter() .ok_or(GetAuthInfoError::BadSecret)?;
.flatten() Some(secret)
.map(SmolStr::from) };
.collect_vec(); let allowed_ips = body.allowed_ips.unwrap_or_default();
ALLOWED_IPS_NUMBER.observe(allowed_ips.len() as f64); ALLOWED_IPS_NUMBER.observe(allowed_ips.len() as f64);
Ok(AuthInfo { Ok(AuthInfo {
secret: Some(secret), secret,
allowed_ips, allowed_ips,
project_id: body.project_id.map(SmolStr::from), project_id: body.project_id,
}) })
} }
.map_err(crate::error::log_error) .map_err(crate::error::log_error)
@@ -172,19 +170,20 @@ impl super::Api for Api {
&self, &self,
ctx: &mut RequestMonitoring, ctx: &mut RequestMonitoring,
user_info: &ComputeUserInfo, user_info: &ComputeUserInfo,
) -> Result<Option<CachedRoleSecret>, GetAuthInfoError> { ) -> Result<CachedRoleSecret, GetAuthInfoError> {
let ep = &user_info.endpoint; let ep = &user_info.endpoint;
let user = &user_info.user; let user = &user_info.user;
if let Some(role_secret) = self.caches.project_info.get_role_secret(ep, user) { if let Some(role_secret) = self.caches.project_info.get_role_secret(ep, user) {
return Ok(Some(role_secret)); return Ok(role_secret);
} }
let auth_info = self.do_get_auth_info(ctx, user_info).await?; let auth_info = self.do_get_auth_info(ctx, user_info).await?;
if let Some(project_id) = auth_info.project_id { if let Some(project_id) = auth_info.project_id {
if let Some(secret) = &auth_info.secret { self.caches.project_info.insert_role_secret(
self.caches &project_id,
.project_info ep,
.insert_role_secret(&project_id, ep, user, secret.clone()) user,
} auth_info.secret.clone(),
);
self.caches.project_info.insert_allowed_ips( self.caches.project_info.insert_allowed_ips(
&project_id, &project_id,
ep, ep,
@@ -192,7 +191,7 @@ impl super::Api for Api {
); );
} }
// When we just got a secret, we don't need to invalidate it. // When we just got a secret, we don't need to invalidate it.
Ok(auth_info.secret.map(Cached::new_uncached)) Ok(Cached::new_uncached(auth_info.secret))
} }
async fn get_allowed_ips( async fn get_allowed_ips(
@@ -214,11 +213,12 @@ impl super::Api for Api {
let allowed_ips = Arc::new(auth_info.allowed_ips); let allowed_ips = Arc::new(auth_info.allowed_ips);
let user = &user_info.user; let user = &user_info.user;
if let Some(project_id) = auth_info.project_id { if let Some(project_id) = auth_info.project_id {
if let Some(secret) = &auth_info.secret { self.caches.project_info.insert_role_secret(
self.caches &project_id,
.project_info ep,
.insert_role_secret(&project_id, ep, user, secret.clone()) user,
} auth_info.secret.clone(),
);
self.caches self.caches
.project_info .project_info
.insert_allowed_ips(&project_id, ep, allowed_ips.clone()); .insert_allowed_ips(&project_id, ep, allowed_ips.clone());
@@ -238,7 +238,7 @@ impl super::Api for Api {
// for some time (highly depends on the console's scale-to-zero policy); // for some time (highly depends on the console's scale-to-zero policy);
// The connection info remains the same during that period of time, // The connection info remains the same during that period of time,
// which means that we might cache it to reduce the load and latency. // which means that we might cache it to reduce the load and latency.
if let Some(cached) = self.caches.node_info.get(&*key) { if let Some(cached) = self.caches.node_info.get(&key) {
info!(key = &*key, "found cached compute node info"); info!(key = &*key, "found cached compute node info");
return Ok(cached); return Ok(cached);
} }

Some files were not shown because too many files have changed in this diff Show More