Compare commits

..

63 Commits

Author SHA1 Message Date
Nikita Kalyanov
96a4e8de66 Add /terminate API (#6745) (#6853)
this is to speed up suspends, see
https://github.com/neondatabase/cloud/issues/10284


Cherry-pick to release branch to build new compute images
2024-02-22 11:51:19 +02:00
Arseny Sher
01180666b0 Merge pull request #6803 from neondatabase/releases/2024-02-19
Release 2024-02-19
2024-02-19 16:38:35 +04:00
John Spray
5667372c61 pageserver: during shard split, wait for child to activate (#6789)
## Problem

test_sharding_split_unsharded was flaky with log errors from tenants not
being active. This was happening when the split function enters
wait_lsn() while the child shard might still be activating. It's flaky
rather than an outright failure because activation is usually very fast.

This is also a real bug fix, because in realistic scenarios we could
proceed to detach the parent shard before the children are ready,
leading to an availability gap for clients.

## Summary of changes

- Do a short wait_to_become_active on the child shards before proceeding
to wait for their LSNs to advance

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-02-18 15:55:19 +00:00
Alexander Bayandin
61f99d703d test_create_snapshot: do not try to copy pg_dynshmem dir (#6796)
## Problem
`test_create_snapshot` is flaky[0] on CI and fails constantly on macOS,
but with a slightly different error:
```
shutil.Error: [('/Users/bayandin/work/neon/test_output/test_create_snapshot[release-pg15-1-100]/repo/endpoints/ep-1/pgdata/pg_dynshmem', '/Users/bayandin/work/neon/test_output/compatibility_snapshot_pgv15/repo/endpoints/ep-1/pgdata/pg_dynshmem', "[Errno 2] No such file or directory: '/Users/bayandin/work/neon/test_output/test_create_snapshot[release-pg15-1-100]/repo/endpoints/ep-1/pgdata/pg_dynshmem'")]
```
Also (on macOS) `repo/endpoints/ep-1/pgdata/pg_dynshmem` is a symlink
to `/dev/shm/`.

- [0] https://github.com/neondatabase/neon/issues/6784

## Summary of changes
Ignore `pg_dynshmem` directory while copying a snapshot
2024-02-18 12:16:07 +00:00
John Spray
24014d8383 pageserver: fix sharding emitting empty image layers during compaction (#6776)
## Problem

Sharded tenants would sometimes try to write empty image layers during
compaction: this was more noticeable on larger databases.
- https://github.com/neondatabase/neon/issues/6755

**Note to reviewers: the last commit is a refactor that de-intents a
whole block, I recommend reviewing the earlier commits one by one to see
the real changes**

## Summary of changes

- Fix a case where when we drop a key during compaction, we might fail
to write out keys (this was broken when vectored get was added)
- If an image layer is empty, then do not try and write it out, but
leave `start` where it is so that if the subsequent key range meets
criteria for writing an image layer, we will extend its key range to
cover the empty area.
- Add a compaction test that configures small layers and compaction
thresholds, and asserts that we really successfully did image layer
generation. This fails before the fix.
2024-02-18 08:51:12 +00:00
Konstantin Knizhnik
e3ded64d1b Support pg-ivm extension (#6793)
## Problem

See https://github.com/neondatabase/cloud/issues/10268

## Summary of changes

Add pg_ivm extension

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-02-17 22:13:25 +02:00
dependabot[bot]
9b714c8572 build(deps): bump cryptography from 42.0.0 to 42.0.2 (#6792) 2024-02-17 19:15:21 +00:00
Alex Chi Z
29fb675432 Revert "fix superuser permission check for extensions (#6733)" (#6791)
This reverts commit 9ad940086c.

This pull request reverts #6733 to avoid incompatibility with pgvector
and I will push further fixes later. Note that after reverting this pull
request, the postgres submodule will point to some detached branches.
2024-02-16 20:50:09 +00:00
Christian Schwarz
ca07fa5f8b per-TenantShard read throttling (#6706) 2024-02-16 21:26:59 +01:00
John Spray
5d039c6e9b libs: add 'generations_api' auth scope (#6783)
## Problem

Even if you're not enforcing auth, the JwtAuth middleware barfs on
scopes it doesn't know about.

Add `generations_api` scope, which was invented in the cloud control
plane for the pageserver's /re-attach and /validate upcalls: this will
be enforced in storage controller's implementation of these in a later
PR.

Unfortunately the scope's naming doesn't match the other scope's naming
styles, so needs a manual serde decorator to give it an underscore.

## Summary of changes

- Add `Scope::GenerationsApi` variant
- Update pageserver + safekeeper auth code to print appropriate message
if they see it.
2024-02-16 15:53:09 +00:00
Calin Anca
36e1100949 bench_walredo: use tokio multi-threaded runtime (#6743)
fixes https://github.com/neondatabase/neon/issues/6648

Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-02-16 16:31:54 +01:00
Alexander Bayandin
59c5b374de test_pageserver_max_throughput_getpage_at_latest_lsn: disable on CI (#6785)
## Problem
`test_pageserver_max_throughput_getpage_at_latest_lsn` is flaky which
makes CI status red pretty frequently. `benchmarks` is not a blocking
job (doesn't block `deploy`), so having it red might hide failures in
other jobs

Ref: https://github.com/neondatabase/neon/issues/6724

## Summary of changes
- Disable `test_pageserver_max_throughput_getpage_at_latest_lsn` on CI
until it fixed
2024-02-16 15:30:04 +00:00
Arpad Müller
0f3b87d023 Add test for pageserver_directory_entries_count metric (#6767)
Adds a simple test to ensure the metric works.

The test creates a bunch of relations to activate the metric.

Follow-up of #6736
2024-02-16 14:53:36 +00:00
Konstantin Knizhnik
c19625a29c Support sharding for compute_ctl (#6787)
## Problem

See https://github.com/neondatabase/neon/issues/6786

## Summary of changes

Split connection string in compute.rs when requesting basebackup
2024-02-16 14:50:09 +00:00
John Spray
f2e5212fed storage controller: background reconcile, graceful shutdown, better logging (#6709)
## Problem

Now that the storage controller is working end to end, we start burning
down the robustness aspects.

## Summary of changes

- Add a background task that periodically calls `reconcile_all`. This
ensures that if earlier operations couldn't succeed (e.g. because a node
was unavailable), we will eventually retry. This is a naive initial
implementation can start an unlimited number of reconcile tasks:
limiting reconcile concurrency is a later item in #6342
- Add a number of tracing spans in key locations: each background task,
each reconciler task.
- Add a top level CancellationToken and Gate, and use these to implement
a graceful shutdown that waits for tasks to shut down. This is not
bulletproof yet, because within these tasks we have remote HTTP calls
that aren't wrapped in cancellation/timeouts, but it creates the
structure, and if we don't shutdown promptly then k8s will kill us.
- To protect shard splits from background reconciliation, expose the `SplitState`
in memory and use it to guard any APIs that require an attached tenant.
2024-02-16 13:00:53 +00:00
Christian Schwarz
568bc1fde3 fix(build): production flamegraphs are useless (#6764) 2024-02-16 10:12:34 +00:00
Christian Schwarz
45e929c069 stop reading local metadata file (#6777) 2024-02-16 09:35:11 +00:00
John Spray
6b980f38da libs: refactor ShardCount.0 to private (#6690)
## Problem

The ShardCount type has a magic '0' value that represents a legacy
single-sharded tenant, whose TenantShardId is formatted without a
`-0001` suffix (i.e. formatted as a traditional TenantId).

This was error-prone in code locations that wanted the actual number of
shards: they had to handle the 0 case specially.

## Summary of changes

- Make the internal value of ShardCount private, and expose `count()`
and `literal()` getters so that callers have to explicitly say whether
they want the literal value (e.g. for storing in a TenantShardId), or
the actual number of shards in the tenant.


---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-02-15 21:59:39 +00:00
MMeent
f0d8bd7855 Update Makefile (#6779)
This fixes issues where `neon-pg-ext-clean-vYY` is used as target and
resolves using the `neon-pg-ext-%` template with `$*` resolving as `clean-vYY`, for
older versions of GNU Make, rather than `neon-pg-ext-clean-%` using `$*` = `vYY`

## Problem

```
$ make clean
...
rm -f pg_config_paths.h

Compiling neon clean-v14

mkdir -p /Users/<user>/neon-build//pg_install//build/neon-clean-v14

/Applications/Xcode.app/Contents/Developer/usr/bin/make PG_CONFIG=/Users/<user>/neon-build//pg_install//clean-v14/bin/pg_config CFLAGS='-O0 -g3  ' \

        -C /Users/<user>/neon-build//pg_install//build/neon-clean-v14 \

        -f /Users/<user>/neon-build//pgxn/neon/Makefile install

make[1]: /Users/<user>/neon-build//pg_install//clean-v14/bin/pg_config: Command not found

make[1]: *** No rule to make target `install'.  Stop.

make: *** [neon-pg-ext-clean-v14] Error 2
```
2024-02-15 19:48:50 +00:00
Joonas Koivunen
046d9c69e6 fix: require wider jwt for changing the io engine (#6770)
io-engine should not be changeable with any JWT token, for example the
tenant_id scoped token which computes have.
2024-02-15 16:58:26 +00:00
Alexander Bayandin
c72cb44213 test_runner/performance: parametrize benchmarks (#6744)
## Problem
Currently, we don't store `PLATFORM` for Nightly Benchmarks. It
causes them to be merged as reruns in Allure report (because they have
the same test name).

## Summary of changes
- Parametrize benchmarks by 
  - Postgres Version (14/15/16)
  - Build Type (debug/release/remote)
  - PLATFORM (neon-staging/github-actions-selfhosted/...)

---------

Co-authored-by: Bodobolero <peterbendel@neon.tech>
2024-02-15 15:53:58 +00:00
Arpad Müller
cd3e4ac18d Rename TEST_IMG function to test_img (#6762)
Latter follows the canonical way to naming functions in Rust.
2024-02-15 15:14:51 +00:00
Alex Chi Z
9ad940086c fix superuser permission check for extensions (#6733)
close https://github.com/neondatabase/neon/issues/6236

This pull request bumps neon postgres dependencies. The corresponding
postgres commits fix the checks for superuser permission when creating
an extension. Also, for creating native functinos, it now allows
neon_superuser only in the extension creation process.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-02-15 14:59:13 +00:00
Joonas Koivunen
936f2ee2a5 fix: accidential wide span in tests (#6772)
introduced in a PR without other #[tracing::instrument] changes.
2024-02-15 13:48:44 +00:00
Heikki Linnakangas
1af047dd3e Fix typo in CI message (#6749) 2024-02-15 14:34:19 +02:00
Conrad Ludgate
6c94269c32 Merge pull request #6758 from neondatabase/release-proxy-2024-02-14
2024-02-14 Proxy Release
2024-02-15 09:45:08 +00:00
John Spray
5fa747e493 pageserver: shard splitting refinements (parent deletion, hard linking) (#6725)
## Problem

- We weren't deleting parent shard contents once the split was done
- Re-downloading layers into child shards is wasteful

## Summary of changes

- Hard-link layers into child chart local storage during split
- Delete parent shards content at the end

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2024-02-15 10:21:53 +02:00
Anna Khanova
edc691647d Proxy: remove fail fast logic to connect to compute (#6759)
## Problem

Flaky tests

## Summary of changes

Remove failfast logic
2024-02-15 07:42:12 +00:00
Joonas Koivunen
80854b98ff move timeouts and cancellation handling to remote_storage (#6697)
Cancellation and timeouts are handled at remote_storage callsites, if
they are. However they should always be handled, because we've had
transient problems with remote storage connections.

- Add cancellation token to the `trait RemoteStorage` methods
- For `download*`, `list*` methods there is
`DownloadError::{Cancelled,Timeout}`
- For the rest now using `anyhow::Error`, it will have root cause
`remote_storage::TimeoutOrCancel::{Cancel,Timeout}`
- Both types have `::is_permanent` equivalent which should be passed to
`backoff::retry`
- New generic RemoteStorageConfig option `timeout`, defaults to 120s
- Start counting timeouts only after acquiring concurrency limiter
permit
- Cancellable permit acquiring
- Download stream timeout or cancellation is communicated via an
`std::io::Error`
- Exit backoff::retry by marking cancellation errors permanent

Fixes: #6096
Closes: #4781

Co-authored-by: arpad-m <arpad-m@users.noreply.github.com>
2024-02-14 23:24:07 +00:00
Christian Schwarz
024372a3db Revert "refactor(VirtualFile::crashsafe_overwrite): avoid Handle::block_on in callers" (#6765)
Reverts neondatabase/neon#6731

On high tenant count Pageservers in staging, memory and CPU usage shoots
to 100% with this change. (NB: staging currently has tokio-epoll-uring
enabled)

Will analyze tomorrow.


https://neondb.slack.com/archives/C03H1K0PGKH/p1707933875639379?thread_ts=1707929541.125329&cid=C03H1K0PGKH
2024-02-14 19:17:12 +00:00
Shayan Hosseini
fff2468aa2 Add resource consume test funcs (#6747)
## Problem

Building on #5875 to add handy test functions for autoscaling.

Resolves #5609

## Summary of changes

This PR makes the following changes to #5875:
- Enable `neon_test_utils` extension in the compute node docker image,
so we could use it in the e2e tests (as discussed with @kelvich).
- Removed test functions related to disk as we don't use them for
autoscaling.
- Fix the warning with printf-ing unsigned long variables.

---------

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-02-14 18:45:05 +00:00
Anna Khanova
c7538a2c20 Proxy: remove fail fast logic to connect to compute (#6759)
## Problem

Flaky tests

## Summary of changes

Remove failfast logic
2024-02-14 18:43:52 +00:00
Arpad Müller
a2d0d44b42 Remove unused allow's (#6760)
These allow's became redundant some time ago so remove them, or address
them if addressing is very simple.
2024-02-14 18:16:05 +00:00
Christian Schwarz
7d3cdc05d4 fix(pageserver): pagebench doesn't work with released artifacts (#6757)
The canonical release artifact of neon.git is the Docker image with all
the binaries in them:

```
docker pull neondatabase/neon:release-4854
docker create --name extract neondatabase/neon:release-4854
docker cp extract:/usr/local/bin/pageserver ./pageserver.release-4854
chmod +x pageserver.release-4854
cp -a pageserver.release-4854 ./target/release/pageserver
```

Before this PR, these artifacts didn't expose the `keyspace` API,
thereby preventing `pagebench get-page-latest-lsn` from working.

Having working pagebench is useful, e.g., for experiments in staging.
So, expose the API, but don't document it, as it's not part of the
interface with control plane.
2024-02-14 17:01:15 +00:00
John Spray
840abe3954 pageserver: store aux files as deltas (#6742)
## Problem

Aux files were stored with an O(N^2) cost, since on each modification
the entire map is re-written as a page image.

This addresses one axis of the inefficiency in logical replication's use
of storage (https://github.com/neondatabase/neon/issues/6626). It will
still be writing a large amount of duplicative data if writing the same
slot's state every 15 seconds, but the impact will be O(N) instead of
O(N^2).

## Summary of changes

- Introduce `NeonWalRecord::AuxFile`
- In `DatadirModification`, if the AUX_FILES_KEY has already been set,
then write a delta instead of an image
2024-02-14 15:01:16 +00:00
Christian Schwarz
774a6e7475 refactor(virtual_file) make write_all_at take owned buffers (#6673)
context: https://github.com/neondatabase/neon/issues/6663

Building atop #6664, this PR switches `write_all_at` to take owned
buffers.

The main challenge here is the `EphemeralFile::mutable_tail`, for which
I'm picking the ugly solution of an `Option` that is `None` while the IO
is in flight.

After this, we will be able to switch `write_at` to take owned buffers
and call tokio-epoll-uring's `write` function with that owned buffer.
That'll be done in #6378.
2024-02-14 15:59:06 +01:00
Conrad Ludgate
855d7b4781 hold cancel session (#6750)
## Problem

In a recent refactor, we accidentally dropped the cancel session early

## Summary of changes

Hold the cancel session during proxy passthrough
2024-02-14 14:57:22 +00:00
Anna Khanova
c49c9707ce Proxy: send cancel notifications to all instances (#6719)
## Problem

If cancel request ends up on the wrong proxy instance, it doesn't take
an effect.

## Summary of changes

Send redis notifications to all proxy pods about the cancel request.

Related issue: https://github.com/neondatabase/neon/issues/5839,
https://github.com/neondatabase/cloud/issues/10262
2024-02-14 14:57:22 +00:00
Anna Khanova
2227540a0d Proxy refactor auth+connect (#6708)
## Problem

Not really a problem, just refactoring.

## Summary of changes

Separate authenticate from wake compute.

Do not call wake compute second time if we managed to connect to
postgres or if we got it not from cache.
2024-02-14 14:57:22 +00:00
Conrad Ludgate
f1347f2417 proxy: add more http logging (#6726)
## Problem

hard to see where time is taken during HTTP flow.

## Summary of changes

add a lot more for query state. add a conn_id field to the sql-over-http
span
2024-02-14 14:57:22 +00:00
Conrad Ludgate
30b295b017 proxy: some more parquet data (#6711)
## Summary of changes

add auth_method and database to the parquet logs
2024-02-14 14:57:22 +00:00
Anna Khanova
1cef395266 Proxy: copy bidirectional fork (#6720)
## Problem

`tokio::io::copy_bidirectional` doesn't close the connection once one of
the sides closes it. It's not really suitable for the postgres protocol.

## Summary of changes

Fork `copy_bidirectional` and initiate a shutdown for both connections.

---------

Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>
2024-02-14 14:57:22 +00:00
Christian Schwarz
df5d588f63 refactor(VirtualFile::crashsafe_overwrite): avoid Handle::block_on in callers (#6731)
Some callers of `VirtualFile::crashsafe_overwrite` call it on the
executor thread, thereby potentially stalling it.

Others are more diligent and wrap it in `spawn_blocking(...,
Handle::block_on, ... )` to avoid stalling the executor thread.

However, because `crashsafe_overwrite` uses
VirtualFile::open_with_options internally, we spawn a new thread-local
`tokio-epoll-uring::System` in the blocking pool thread that's used for
the `spawn_blocking` call.

This PR refactors the situation such that we do the `spawn_blocking`
inside `VirtualFile::crashsafe_overwrite`. This unifies the situation
for the better:

1. Callers who didn't wrap in `spawn_blocking(..., Handle::block_on,
...)` before no longer stall the executor.
2. Callers who did it before now can avoid the `block_on`, resolving the
problem with the short-lived `tokio-epoll-uring::System`s in the
blocking pool threads.

A future PR will build on top of this and divert to tokio-epoll-uring if
it's configures as the IO engine.

Changes
-------

- Convert implementation to std::fs and move it into `crashsafe.rs`
- Yes, I know, Safekeepers (cc @arssher ) added `durable_rename` and
`fsync_async_opt` recently. However, `crashsafe_overwrite` is different
in the sense that it's higher level, i.e., it's more like
`std::fs::write` and the Safekeeper team's code is more building block
style.
- The consequence is that we don't use the VirtualFile file descriptor
cache anymore.
- I don't think it's a big deal because we have plenty of slack wrt
production file descriptor limit rlimit (see [this
dashboard](https://neonprod.grafana.net/d/e4a40325-9acf-4aa0-8fd9-f6322b3f30bd/pageserver-open-file-descriptors?orgId=1))

- Use `tokio::task::spawn_blocking` in
`VirtualFile::crashsafe_overwrite` to call the new
`crashsafe::overwrite` API.
- Inspect all callers to remove any double-`spawn_blocking`
- spawn_blocking requires the captures data to be 'static + Send. So,
refactor the callers. We'll need this for future tokio-epoll-uring
support anyway, because tokio-epoll-uring requires owned buffers.

Related Issues
--------------

- overall epic to enable write path to tokio-epoll-uring: #6663
- this is also kind of relevant to the tokio-epoll-uring System creation
failures that we encountered in staging, investigation being tracked in
#6667
- why is it relevant? Because this PR removes two uses of
`spawn_blocking+Handle::block_on`
2024-02-14 14:22:41 +00:00
John Spray
f39b0fce9b Revert #6666 "tests: try to make restored-datadir comparison tests not flaky" (#6751)
The #6666  change appears to have made the test fail more often.

PR https://github.com/neondatabase/neon/pull/6712 should re-instate this
change, along with its change to make the overall flow more reliable.

This reverts commit 568f91420a.
2024-02-14 10:57:01 +00:00
Conrad Ludgate
a9ec4eb4fc hold cancel session (#6750)
## Problem

In a recent refactor, we accidentally dropped the cancel session early

## Summary of changes

Hold the cancel session during proxy passthrough
2024-02-14 10:26:32 +00:00
Heikki Linnakangas
a97b54e3b9 Cherry-pick Postgres bugfix to 'mmap' DSM implementation
Cherry-pick Upstream commit fbf9a7ac4d to neon stable branches. We'll
get it in the next PostgreSQL minor release anyway, but we need it
now, if we want to start using the 'mmap' implementation.

See https://github.com/neondatabase/autoscaling/issues/800 for the
plans on doing that.
2024-02-14 11:37:52 +02:00
Heikki Linnakangas
a5114a99b2 Create a symlink from pg_dynshmem to /dev/shm
See included comment and issue
https://github.com/neondatabase/autoscaling/issues/800 for details.

This has no effect, unless you set "dynamic_shared_memory_type = mmap"
in postgresql.conf.
2024-02-14 11:37:52 +02:00
Arpad Müller
ee7bbdda0e Create new metric for directory counts (#6736)
There is O(n^2) issues due to how we store these directories (#6626), so
it's good to keep an eye on them and ensure the numbers stay low.

The new per-timeline metric `pageserver_directory_entries_count`
isn't perfect, namely we don't calculate it every time we attach
the timeline, but only if there is an actual change.
Also, it is a collective metric over multiple scalars. Lastly,
we only emit the metric if it is above a certain threshold.

However, the metric still give a feel for the general size of the timeline.
We care less for small values as the metric is mainly there to
detect and track tenants with large directory counts.

We also expose the directory counts in `TimelineInfo` so that one can
get the detailed size distribution directly via the pageserver's API.

Related: #6642 , https://github.com/neondatabase/cloud/issues/10273
2024-02-14 02:12:00 +01:00
Konstantin Knizhnik
b6e070bf85 Do not perform fast exit for catalog pages in redo filter (#6730)
## Problem

See https://github.com/neondatabase/neon/issues/6674

Current implementation of `neon_redo_read_buffer_filter` performs fast
exist for catalog pages:
```
       /*
        * Out of an abundance of caution, we always run redo on shared catalogs,
        * regardless of whether the block is stored in shared buffers. See also
        * this function's top comment.
        */
       if (!OidIsValid(NInfoGetDbOid(rinfo)))
               return false;
*/

as a result last written lsn and relation size for FSM fork are not correctly updated for catalog relations.

## Summary of changes

Do not perform fast path return for catalog relations.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-02-13 20:41:17 +02:00
Christian Schwarz
7fa732c96c refactor(virtual_file): take owned buffer in VirtualFile::write_all (#6664)
Building atop #6660 , this PR converts VirtualFile::write_all to
owned buffers.

Part of https://github.com/neondatabase/neon/issues/6663
2024-02-13 18:46:25 +01:00
Anna Khanova
331935df91 Proxy: send cancel notifications to all instances (#6719)
## Problem

If cancel request ends up on the wrong proxy instance, it doesn't take
an effect.

## Summary of changes

Send redis notifications to all proxy pods about the cancel request.

Related issue: https://github.com/neondatabase/neon/issues/5839,
https://github.com/neondatabase/cloud/issues/10262
2024-02-13 17:58:58 +01:00
John Spray
a8eb4042ba tests: test_secondary_mode_eviction: avoid use of mocked statvfs (#6698)
## Problem

Test sometimes fails with `used_blocks > total_blocks`, because when
using mocked statvfs with the total blocks set to the size of data on
disk before starting, we are implicitly asserting that nothing at all
can be written to disk between startup and calling statvfs.

Related: https://github.com/neondatabase/neon/issues/6511

## Summary of changes

- Use HTTP API to invoke disk usage eviction instead of mocked statvfs
2024-02-13 09:00:50 +02:00
Arthur Petukhovsky
4be2223a4c Discrete event simulation for safekeepers (#5804)
This PR contains the first version of a
[FoundationDB-like](https://www.youtube.com/watch?v=4fFDFbi3toc)
simulation testing for safekeeper and walproposer.

### desim

This is a core "framework" for running determenistic simulation. It
operates on threads, allowing to test syncronous code (like walproposer).

`libs/desim/src/executor.rs` contains implementation of a determenistic
thread execution. This is achieved by blocking all threads, and each
time allowing only a single thread to make an execution step. All
executor's threads are blocked using `yield_me(after_ms)` function. This
function is called when a thread wants to sleep or wait for an external
notification (like blocking on a channel until it has a ready message).

`libs/desim/src/chan.rs` contains implementation of a channel (basic
sync primitive). It has unlimited capacity and any thread can push or
read messages to/from it.

`libs/desim/src/network.rs` has a very naive implementation of a network
(only reliable TCP-like connections are supported for now), that can
have arbitrary delays for each package and failure injections for
breaking connections with some probability.

`libs/desim/src/world.rs` ties everything together, to have a concept of
virtual nodes that can have network connections between them.

### walproposer_sim

Has everything to run walproposer and safekeepers in a simulation.

`safekeeper.rs` reimplements all necesary stuff from `receive_wal.rs`,
`send_wal.rs` and `timelines_global_map.rs`.

`walproposer_api.rs` implements all walproposer callback to use
simulation library.

`simulation.rs` defines a schedule – a set of events like `restart <sk>`
or `write_wal` that should happen at time `<ts>`. It also has code to
spawn walproposer/safekeeper threads and provide config to them.

### tests

`simple_test.rs` has tests that just start walproposer and 3 safekeepers
together in a simulation, and tests that they are not crashing right
away.

`misc_test.rs` has tests checking more advanced simulation cases, like
crashing or restarting threads, testing memory deallocation, etc.

`random_test.rs` is the main test, it checks thousands of random seeds
(schedules) for correctness. It roughly corresponds to running a real
python integration test in an environment with very unstable network and
cpu, but in a determenistic way (each seed results in the same execution
log) and much much faster.

Closes #547

---------

Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2024-02-12 20:29:57 +00:00
Anna Khanova
fac50a6264 Proxy refactor auth+connect (#6708)
## Problem

Not really a problem, just refactoring.

## Summary of changes

Separate authenticate from wake compute.

Do not call wake compute second time if we managed to connect to
postgres or if we got it not from cache.
2024-02-12 18:41:02 +00:00
Arpad Müller
a1f37cba1c Add test that runs the S3 scrubber (#6641)
In #6079 it was found that there is no test that executes the scrubber.
We now add such a test, which does the following things:

* create a tenant, write some data
* run the scrubber
* remove the tenant
* run the scrubber again

Each time, the scrubber runs the scan-metadata command. Before #6079 we
would have errored, now we don't.

Fixes #6080
2024-02-12 19:15:21 +01:00
Christian Schwarz
8b8ff88e4b GH actions: label to disable CI runs completely (#6677)
I don't want my very-early-draft PRs to trigger any CI runs.
So, add a label `run-no-ci`, and piggy-back on the `check-permissions` job.
2024-02-12 15:25:33 +00:00
Joonas Koivunen
7ea593db22 refactor(LayerManager): resident layers query (#6634)
Refactor out layer accesses so that we can have easy access to resident
layers, which are needed for number of cases instead of layers for
eviction. Simplifies the heatmap building by only using Layers, not
RemoteTimelineClient.

Cc: #5331
2024-02-12 17:13:35 +02:00
Conrad Ludgate
789a71c4ee proxy: add more http logging (#6726)
## Problem

hard to see where time is taken during HTTP flow.

## Summary of changes

add a lot more for query state. add a conn_id field to the sql-over-http
span
2024-02-12 15:03:45 +00:00
Christian Schwarz
242dd8398c refactor(blob_io): use owned buffers (#6660)
This PR refactors the `blob_io` code away from using slices towards
taking owned buffers and return them after use.
Using owned buffers will eventually allow us to use io_uring for writes.

part of https://github.com/neondatabase/neon/issues/6663

Depends on https://github.com/neondatabase/tokio-epoll-uring/pull/43

The high level scheme is as follows:
- call writing functions with the `BoundedBuf`
- return the underlying `BoundedBuf::Buf` for potential reuse in the
caller

NB: Invoking `BoundedBuf::slice(..)` will return a slice that _includes
the uninitialized portion of `BoundedBuf`_.
I.e., the portion between `bytes_init()` and `bytes_total()`.
It's a safe API that actually permits access to uninitialized memory.
Not great.

Another wrinkle is that it panics if the range has length 0.

However, I don't want to switch away from the `BoundedBuf` API, since
it's what tokio-uring uses.
We can always weed this out later by replacing `BoundedBuf` with our own
type.
Created an issue so we don't forget:
https://github.com/neondatabase/tokio-epoll-uring/issues/46
2024-02-12 15:58:55 +01:00
Conrad Ludgate
98ec5c5c46 proxy: some more parquet data (#6711)
## Summary of changes

add auth_method and database to the parquet logs
2024-02-12 13:14:06 +00:00
Anna Khanova
020e607637 Proxy: copy bidirectional fork (#6720)
## Problem

`tokio::io::copy_bidirectional` doesn't close the connection once one of
the sides closes it. It's not really suitable for the postgres protocol.

## Summary of changes

Fork `copy_bidirectional` and initiate a shutdown for both connections.

---------

Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>
2024-02-12 14:04:46 +01:00
Joonas Koivunen
c77411e903 cleanup around attach (#6621)
The smaller changes I found while looking around #6584.

- rustfmt was not able to format handle_timeline_create
- fix Generation::get_suffix always allocating
- Generation was missing a `#[track_caller]` for panicky method
- attach has a lot of issues, but even with this PR it cannot be
formatted by rustfmt
- moved the `preload` span to be on top of `attach` -- it is awaited
inline
- make disconnected panic! or unreachable! into expect, expect_err
2024-02-12 14:52:20 +02:00
Joonas Koivunen
aeda82a010 fix(heavier_once_cell): assertion failure can be hit (#6722)
@problame noticed that the `tokio::sync::AcquireError` branch assertion
can be hit like in the added test. We haven't seen this yet in
production, but I'd prefer not to see it there. There `take_and_deinit`
is being used, but this race must be quite timing sensitive.

Rework of earlier: #6652.
2024-02-12 09:57:29 +00:00
187 changed files with 10378 additions and 2854 deletions

View File

@@ -17,6 +17,7 @@ concurrency:
jobs:
actionlint:
if: ${{ !contains(github.event.pull_request.labels.*.name, 'run-no-ci') }}
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

View File

@@ -26,8 +26,8 @@ env:
jobs:
check-permissions:
if: ${{ !contains(github.event.pull_request.labels.*.name, 'run-no-ci') }}
runs-on: ubuntu-latest
steps:
- name: Disallow PRs from forks
if: |
@@ -253,7 +253,7 @@ jobs:
done
if [ "${FAILED}" = "true" ]; then
echo >&2 "Please update vendors/revisions.json if these changes are intentional"
echo >&2 "Please update vendor/revisions.json if these changes are intentional"
exit 1
fi

View File

@@ -117,6 +117,7 @@ jobs:
check-linux-arm-build:
timeout-minutes: 90
if: ${{ !contains(github.event.pull_request.labels.*.name, 'run-no-ci') }}
runs-on: [ self-hosted, dev, arm64 ]
env:
@@ -237,6 +238,7 @@ jobs:
check-codestyle-rust-arm:
timeout-minutes: 90
if: ${{ !contains(github.event.pull_request.labels.*.name, 'run-no-ci') }}
runs-on: [ self-hosted, dev, arm64 ]
container:

49
Cargo.lock generated
View File

@@ -1639,6 +1639,22 @@ dependencies = [
"rusticata-macros",
]
[[package]]
name = "desim"
version = "0.1.0"
dependencies = [
"anyhow",
"bytes",
"hex",
"parking_lot 0.12.1",
"rand 0.8.5",
"scopeguard",
"smallvec",
"tracing",
"utils",
"workspace_hack",
]
[[package]]
name = "diesel"
version = "2.1.4"
@@ -1797,6 +1813,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e875f1719c16de097dee81ed675e2d9bb63096823ed3f0ca827b7dea3028bbbb"
dependencies = [
"enumset_derive",
"serde",
]
[[package]]
@@ -2247,11 +2264,11 @@ dependencies = [
[[package]]
name = "hashlink"
version = "0.8.2"
version = "0.8.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0761a1b9491c4f2e3d66aa0f62d0fba0af9a0e2852e4d48ea506632a4b56e6aa"
checksum = "e8094feaf31ff591f651a2664fb9cfd92bba7a60ce3197265e9482ebe753c8f7"
dependencies = [
"hashbrown 0.13.2",
"hashbrown 0.14.0",
]
[[package]]
@@ -2741,6 +2758,17 @@ version = "1.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "830d08ce1d1d941e6b30645f1a0eb5643013d835ce3779a5fc208261dbe10f55"
[[package]]
name = "leaky-bucket"
version = "1.0.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8eb491abd89e9794d50f93c8db610a29509123e3fbbc9c8c67a528e9391cd853"
dependencies = [
"parking_lot 0.12.1",
"tokio",
"tracing",
]
[[package]]
name = "libc"
version = "0.2.150"
@@ -3432,6 +3460,7 @@ name = "pageserver"
version = "0.1.0"
dependencies = [
"anyhow",
"arc-swap",
"async-compression",
"async-stream",
"async-trait",
@@ -3459,6 +3488,7 @@ dependencies = [
"humantime-serde",
"hyper",
"itertools",
"leaky-bucket",
"md5",
"metrics",
"nix 0.27.1",
@@ -3936,6 +3966,7 @@ dependencies = [
"pin-project-lite",
"postgres-protocol",
"rand 0.8.5",
"serde",
"thiserror",
"tokio",
"tracing",
@@ -4419,6 +4450,7 @@ dependencies = [
"futures",
"futures-util",
"http-types",
"humantime",
"hyper",
"itertools",
"metrics",
@@ -4430,6 +4462,7 @@ dependencies = [
"serde_json",
"test-context",
"tokio",
"tokio-stream",
"tokio-util",
"toml_edit",
"tracing",
@@ -4827,6 +4860,7 @@ dependencies = [
"clap",
"const_format",
"crc32c",
"desim",
"fail",
"fs2",
"futures",
@@ -4842,6 +4876,7 @@ dependencies = [
"postgres_backend",
"postgres_ffi",
"pq_proto",
"rand 0.8.5",
"regex",
"remote_storage",
"reqwest",
@@ -4862,8 +4897,10 @@ dependencies = [
"tokio-util",
"toml_edit",
"tracing",
"tracing-subscriber",
"url",
"utils",
"walproposer",
"workspace_hack",
]
@@ -5740,7 +5777,7 @@ dependencies = [
[[package]]
name = "tokio-epoll-uring"
version = "0.1.0"
source = "git+https://github.com/neondatabase/tokio-epoll-uring.git?branch=main#d6a1c93442fb6b3a5bec490204961134e54925dc"
source = "git+https://github.com/neondatabase/tokio-epoll-uring.git?branch=main#868d2c42b5d54ca82fead6e8f2f233b69a540d3e"
dependencies = [
"futures",
"nix 0.26.4",
@@ -6265,8 +6302,9 @@ dependencies = [
[[package]]
name = "uring-common"
version = "0.1.0"
source = "git+https://github.com/neondatabase/tokio-epoll-uring.git?branch=main#d6a1c93442fb6b3a5bec490204961134e54925dc"
source = "git+https://github.com/neondatabase/tokio-epoll-uring.git?branch=main#868d2c42b5d54ca82fead6e8f2f233b69a540d3e"
dependencies = [
"bytes",
"io-uring",
"libc",
]
@@ -6323,6 +6361,7 @@ dependencies = [
"hex-literal",
"hyper",
"jsonwebtoken",
"leaky-bucket",
"metrics",
"nix 0.27.1",
"once_cell",

View File

@@ -18,6 +18,7 @@ members = [
"libs/pageserver_api",
"libs/postgres_ffi",
"libs/safekeeper_api",
"libs/desim",
"libs/utils",
"libs/consumption_metrics",
"libs/postgres_backend",
@@ -80,7 +81,7 @@ futures-core = "0.3"
futures-util = "0.3"
git-version = "0.3"
hashbrown = "0.13"
hashlink = "0.8.1"
hashlink = "0.8.4"
hdrhistogram = "7.5.2"
hex = "0.4"
hex-literal = "0.4"
@@ -96,6 +97,7 @@ ipnet = "2.9.0"
itertools = "0.10"
jsonwebtoken = "9"
lasso = "0.7"
leaky-bucket = "1.0.1"
libc = "0.2"
md5 = "0.7.0"
memoffset = "0.8"
@@ -203,6 +205,7 @@ postgres_ffi = { version = "0.1", path = "./libs/postgres_ffi/" }
pq_proto = { version = "0.1", path = "./libs/pq_proto/" }
remote_storage = { version = "0.1", path = "./libs/remote_storage/" }
safekeeper_api = { version = "0.1", path = "./libs/safekeeper_api" }
desim = { version = "0.1", path = "./libs/desim" }
storage_broker = { version = "0.1", path = "./storage_broker/" } # Note: main broker code is inside the binary crate, so linking with the library shouldn't be heavy.
tenant_size_model = { version = "0.1", path = "./libs/tenant_size_model/" }
tracing-utils = { version = "0.1", path = "./libs/tracing-utils/" }

View File

@@ -47,7 +47,7 @@ COPY --chown=nonroot . .
# Show build caching stats to check if it was used in the end.
# Has to be the part of the same RUN since cachepot daemon is killed in the end of this RUN, losing the compilation stats.
RUN set -e \
&& mold -run cargo build \
&& RUSTFLAGS="-Clinker=clang -Clink-arg=-fuse-ld=mold -Clink-arg=-Wl,--no-rosegment" cargo build \
--bin pg_sni_router \
--bin pageserver \
--bin pagectl \

View File

@@ -769,6 +769,24 @@ RUN wget https://github.com/eulerto/wal2json/archive/refs/tags/wal2json_2_5.tar.
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install
#########################################################################################
#
# Layer "pg_ivm"
# compile pg_ivm extension
#
#########################################################################################
FROM build-deps AS pg-ivm-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN wget https://github.com/sraoss/pg_ivm/archive/refs/tags/v1.7.tar.gz -O pg_ivm.tar.gz && \
echo "ebfde04f99203c7be4b0e873f91104090e2e83e5429c32ac242d00f334224d5e pg_ivm.tar.gz" | sha256sum --check && \
mkdir pg_ivm-src && cd pg_ivm-src && tar xvzf ../pg_ivm.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pg_ivm.control
#########################################################################################
#
# Layer "neon-pg-ext-build"
@@ -810,6 +828,7 @@ COPY --from=pg-semver-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-embedding-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=wal2json-pg-build /usr/local/pgsql /usr/local/pgsql
COPY --from=pg-anon-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-ivm-build /usr/local/pgsql/ /usr/local/pgsql/
COPY pgxn/ pgxn/
RUN make -j $(getconf _NPROCESSORS_ONLN) \
@@ -820,6 +839,10 @@ RUN make -j $(getconf _NPROCESSORS_ONLN) \
PG_CONFIG=/usr/local/pgsql/bin/pg_config \
-C pgxn/neon_utils \
-s install && \
make -j $(getconf _NPROCESSORS_ONLN) \
PG_CONFIG=/usr/local/pgsql/bin/pg_config \
-C pgxn/neon_test_utils \
-s install && \
make -j $(getconf _NPROCESSORS_ONLN) \
PG_CONFIG=/usr/local/pgsql/bin/pg_config \
-C pgxn/neon_rmgr \

View File

@@ -159,8 +159,8 @@ neon-pg-ext-%: postgres-%
-C $(POSTGRES_INSTALL_DIR)/build/neon-utils-$* \
-f $(ROOT_PROJECT_DIR)/pgxn/neon_utils/Makefile install
.PHONY: neon-pg-ext-clean-%
neon-pg-ext-clean-%:
.PHONY: neon-pg-clean-ext-%
neon-pg-clean-ext-%:
$(MAKE) PG_CONFIG=$(POSTGRES_INSTALL_DIR)/$*/bin/pg_config \
-C $(POSTGRES_INSTALL_DIR)/build/neon-$* \
-f $(ROOT_PROJECT_DIR)/pgxn/neon/Makefile clean
@@ -216,11 +216,11 @@ neon-pg-ext: \
neon-pg-ext-v15 \
neon-pg-ext-v16
.PHONY: neon-pg-ext-clean
neon-pg-ext-clean: \
neon-pg-ext-clean-v14 \
neon-pg-ext-clean-v15 \
neon-pg-ext-clean-v16
.PHONY: neon-pg-clean-ext
neon-pg-clean-ext: \
neon-pg-clean-ext-v14 \
neon-pg-clean-ext-v15 \
neon-pg-clean-ext-v16
# shorthand to build all Postgres versions
.PHONY: postgres
@@ -249,7 +249,7 @@ postgres-check: \
# This doesn't remove the effects of 'configure'.
.PHONY: clean
clean: postgres-clean neon-pg-ext-clean
clean: postgres-clean neon-pg-clean-ext
$(CARGO_CMD_PREFIX) cargo clean
# This removes everything

View File

@@ -249,6 +249,16 @@ testing locally, it is convenient to run just one set of permutations, like this
DEFAULT_PG_VERSION=15 BUILD_TYPE=release ./scripts/pytest
```
## Flamegraphs
You may find yourself in need of flamegraphs for software in this repository.
You can use [`flamegraph-rs`](https://github.com/flamegraph-rs/flamegraph) or the original [`flamegraph.pl`](https://github.com/brendangregg/FlameGraph). Your choice!
>[!IMPORTANT]
> If you're using `lld` or `mold`, you need the `--no-rosegment` linker argument.
> It's a [general thing with Rust / lld / mold](https://crbug.com/919499#c16), not specific to this repository.
> See [this PR for further instructions](https://github.com/neondatabase/neon/pull/6764).
## Documentation
[docs](/docs) Contains a top-level overview of all available markdown documentation.

View File

@@ -45,7 +45,6 @@ use std::{thread, time::Duration};
use anyhow::{Context, Result};
use chrono::Utc;
use clap::Arg;
use nix::sys::signal::{kill, Signal};
use signal_hook::consts::{SIGQUIT, SIGTERM};
use signal_hook::{consts::SIGINT, iterator::Signals};
use tracing::{error, info};
@@ -53,7 +52,9 @@ use url::Url;
use compute_api::responses::ComputeStatus;
use compute_tools::compute::{ComputeNode, ComputeState, ParsedSpec, PG_PID, SYNC_SAFEKEEPERS_PID};
use compute_tools::compute::{
forward_termination_signal, ComputeNode, ComputeState, ParsedSpec, PG_PID,
};
use compute_tools::configurator::launch_configurator;
use compute_tools::extension_server::get_pg_version;
use compute_tools::http::api::launch_http_server;
@@ -394,6 +395,15 @@ fn main() -> Result<()> {
info!("synced safekeepers at lsn {lsn}");
}
let mut state = compute.state.lock().unwrap();
if state.status == ComputeStatus::TerminationPending {
state.status = ComputeStatus::Terminated;
compute.state_changed.notify_all();
// we were asked to terminate gracefully, don't exit to avoid restart
delay_exit = true
}
drop(state);
if let Err(err) = compute.check_for_core_dumps() {
error!("error while checking for core dumps: {err:?}");
}
@@ -523,16 +533,7 @@ fn cli() -> clap::Command {
/// wait for termination which would be easy then.
fn handle_exit_signal(sig: i32) {
info!("received {sig} termination signal");
let ss_pid = SYNC_SAFEKEEPERS_PID.load(Ordering::SeqCst);
if ss_pid != 0 {
let ss_pid = nix::unistd::Pid::from_raw(ss_pid as i32);
kill(ss_pid, Signal::SIGTERM).ok();
}
let pg_pid = PG_PID.load(Ordering::SeqCst);
if pg_pid != 0 {
let pg_pid = nix::unistd::Pid::from_raw(pg_pid as i32);
kill(pg_pid, Signal::SIGTERM).ok();
}
forward_termination_signal();
exit(1);
}

View File

@@ -2,7 +2,7 @@ use std::collections::HashMap;
use std::env;
use std::fs;
use std::io::BufRead;
use std::os::unix::fs::PermissionsExt;
use std::os::unix::fs::{symlink, PermissionsExt};
use std::path::Path;
use std::process::{Command, Stdio};
use std::str::FromStr;
@@ -28,6 +28,8 @@ use compute_api::responses::{ComputeMetrics, ComputeStatus};
use compute_api::spec::{ComputeFeature, ComputeMode, ComputeSpec};
use utils::measured_stream::MeasuredReader;
use nix::sys::signal::{kill, Signal};
use remote_storage::{DownloadError, RemotePath};
use crate::checker::create_availability_check_data;
@@ -324,7 +326,8 @@ impl ComputeNode {
let spec = compute_state.pspec.as_ref().expect("spec must be set");
let start_time = Instant::now();
let mut config = postgres::Config::from_str(&spec.pageserver_connstr)?;
let shard0_connstr = spec.pageserver_connstr.split(',').next().unwrap();
let mut config = postgres::Config::from_str(shard0_connstr)?;
// Use the storage auth token from the config file, if given.
// Note: this overrides any password set in the connection string.
@@ -634,6 +637,48 @@ impl ComputeNode {
// Update pg_hba.conf received with basebackup.
update_pg_hba(pgdata_path)?;
// Place pg_dynshmem under /dev/shm. This allows us to use
// 'dynamic_shared_memory_type = mmap' so that the files are placed in
// /dev/shm, similar to how 'dynamic_shared_memory_type = posix' works.
//
// Why on earth don't we just stick to the 'posix' default, you might
// ask. It turns out that making large allocations with 'posix' doesn't
// work very well with autoscaling. The behavior we want is that:
//
// 1. You can make large DSM allocations, larger than the current RAM
// size of the VM, without errors
//
// 2. If the allocated memory is really used, the VM is scaled up
// automatically to accommodate that
//
// We try to make that possible by having swap in the VM. But with the
// default 'posix' DSM implementation, we fail step 1, even when there's
// plenty of swap available. PostgreSQL uses posix_fallocate() to create
// the shmem segment, which is really just a file in /dev/shm in Linux,
// but posix_fallocate() on tmpfs returns ENOMEM if the size is larger
// than available RAM.
//
// Using 'dynamic_shared_memory_type = mmap' works around that, because
// the Postgres 'mmap' DSM implementation doesn't use
// posix_fallocate(). Instead, it uses repeated calls to write(2) to
// fill the file with zeros. It's weird that that differs between
// 'posix' and 'mmap', but we take advantage of it. When the file is
// filled slowly with write(2), the kernel allows it to grow larger, as
// long as there's swap available.
//
// In short, using 'dynamic_shared_memory_type = mmap' allows us one DSM
// segment to be larger than currently available RAM. But because we
// don't want to store it on a real file, which the kernel would try to
// flush to disk, so symlink pg_dynshm to /dev/shm.
//
// We don't set 'dynamic_shared_memory_type = mmap' here, we let the
// control plane control that option. If 'mmap' is not used, this
// symlink doesn't affect anything.
//
// See https://github.com/neondatabase/autoscaling/issues/800
std::fs::remove_dir(pgdata_path.join("pg_dynshmem"))?;
symlink("/dev/shm/", pgdata_path.join("pg_dynshmem"))?;
match spec.mode {
ComputeMode::Primary => {}
ComputeMode::Replica | ComputeMode::Static(..) => {
@@ -1279,3 +1324,17 @@ LIMIT 100",
Ok(remote_ext_metrics)
}
}
pub fn forward_termination_signal() {
let ss_pid = SYNC_SAFEKEEPERS_PID.load(Ordering::SeqCst);
if ss_pid != 0 {
let ss_pid = nix::unistd::Pid::from_raw(ss_pid as i32);
kill(ss_pid, Signal::SIGTERM).ok();
}
let pg_pid = PG_PID.load(Ordering::SeqCst);
if pg_pid != 0 {
let pg_pid = nix::unistd::Pid::from_raw(pg_pid as i32);
// use 'immediate' shutdown (SIGQUIT): https://www.postgresql.org/docs/current/server-shutdown.html
kill(pg_pid, Signal::SIGQUIT).ok();
}
}

View File

@@ -51,6 +51,9 @@ pub fn write_postgres_conf(
if let Some(s) = &spec.pageserver_connstring {
writeln!(file, "neon.pageserver_connstring={}", escape_conf_value(s))?;
}
if let Some(stripe_size) = spec.shard_stripe_size {
writeln!(file, "neon.stripe_size={stripe_size}")?;
}
if !spec.safekeeper_connstrings.is_empty() {
writeln!(
file,

View File

@@ -5,6 +5,7 @@ use std::net::SocketAddr;
use std::sync::Arc;
use std::thread;
use crate::compute::forward_termination_signal;
use crate::compute::{ComputeNode, ComputeState, ParsedSpec};
use compute_api::requests::ConfigurationRequest;
use compute_api::responses::{ComputeStatus, ComputeStatusResponse, GenericAPIError};
@@ -123,6 +124,17 @@ async fn routes(req: Request<Body>, compute: &Arc<ComputeNode>) -> Response<Body
}
}
(&Method::POST, "/terminate") => {
info!("serving /terminate POST request");
match handle_terminate_request(compute).await {
Ok(()) => Response::new(Body::empty()),
Err((msg, code)) => {
error!("error handling /terminate request: {msg}");
render_json_error(&msg, code)
}
}
}
// download extension files from remote extension storage on demand
(&Method::POST, route) if route.starts_with("/extension_server/") => {
info!("serving {:?} POST request", route);
@@ -297,6 +309,49 @@ fn render_json_error(e: &str, status: StatusCode) -> Response<Body> {
.unwrap()
}
async fn handle_terminate_request(compute: &Arc<ComputeNode>) -> Result<(), (String, StatusCode)> {
{
let mut state = compute.state.lock().unwrap();
if state.status == ComputeStatus::Terminated {
return Ok(());
}
if state.status != ComputeStatus::Empty && state.status != ComputeStatus::Running {
let msg = format!(
"invalid compute status for termination request: {:?}",
state.status.clone()
);
return Err((msg, StatusCode::PRECONDITION_FAILED));
}
state.status = ComputeStatus::TerminationPending;
compute.state_changed.notify_all();
drop(state);
}
forward_termination_signal();
info!("sent signal and notified waiters");
// Spawn a blocking thread to wait for compute to become Terminated.
// This is needed to do not block the main pool of workers and
// be able to serve other requests while some particular request
// is waiting for compute to finish configuration.
let c = compute.clone();
task::spawn_blocking(move || {
let mut state = c.state.lock().unwrap();
while state.status != ComputeStatus::Terminated {
state = c.state_changed.wait(state).unwrap();
info!(
"waiting for compute to become Terminated, current status: {:?}",
state.status
);
}
Ok(())
})
.await
.unwrap()?;
info!("terminated Postgres");
Ok(())
}
// Main Hyper HTTP server function that runs it and blocks waiting on it forever.
#[tokio::main]
async fn serve(port: u16, state: Arc<ComputeNode>) {

View File

@@ -168,6 +168,29 @@ paths:
schema:
$ref: "#/components/schemas/GenericError"
/terminate:
post:
tags:
- Terminate
summary: Terminate Postgres and wait for it to exit
description: ""
operationId: terminate
responses:
200:
description: Result
412:
description: "wrong state"
content:
application/json:
schema:
$ref: "#/components/schemas/GenericError"
500:
description: "Unexpected error"
content:
application/json:
schema:
$ref: "#/components/schemas/GenericError"
components:
securitySchemes:
JWT:

View File

@@ -4,6 +4,11 @@ version = "0.1.0"
edition.workspace = true
license.workspace = true
[features]
default = []
# Enables test-only APIs and behaviors
testing = []
[dependencies]
anyhow.workspace = true
aws-config.workspace = true

View File

@@ -3,7 +3,7 @@ use std::{collections::HashMap, time::Duration};
use control_plane::endpoint::{ComputeControlPlane, EndpointStatus};
use control_plane::local_env::LocalEnv;
use hyper::{Method, StatusCode};
use pageserver_api::shard::{ShardCount, ShardIndex, ShardNumber, TenantShardId};
use pageserver_api::shard::{ShardIndex, ShardNumber, TenantShardId};
use postgres_connection::parse_host_port;
use serde::{Deserialize, Serialize};
use tokio_util::sync::CancellationToken;
@@ -77,7 +77,7 @@ impl ComputeHookTenant {
self.shards
.sort_by_key(|(shard, _node_id)| shard.shard_number);
if self.shards.len() == shard_count.0 as usize || shard_count == ShardCount(0) {
if self.shards.len() == shard_count.count() as usize || shard_count.is_unsharded() {
// We have pageservers for all the shards: emit a configuration update
return Some(ComputeHookNotifyRequest {
tenant_id,
@@ -94,7 +94,7 @@ impl ComputeHookTenant {
tracing::info!(
"ComputeHookTenant::maybe_reconfigure: not enough shards ({}/{})",
self.shards.len(),
shard_count.0
shard_count.count()
);
}
@@ -155,7 +155,7 @@ impl ComputeHook {
for (endpoint_name, endpoint) in &cplane.endpoints {
if endpoint.tenant_id == tenant_id && endpoint.status() == EndpointStatus::Running {
tracing::info!("🔁 Reconfiguring endpoint {}", endpoint_name,);
tracing::info!("Reconfiguring endpoint {}", endpoint_name,);
endpoint.reconfigure(compute_pageservers.clone()).await?;
}
}
@@ -177,7 +177,7 @@ impl ComputeHook {
req
};
tracing::debug!(
tracing::info!(
"Sending notify request to {} ({:?})",
url,
reconfigure_request
@@ -266,7 +266,7 @@ impl ComputeHook {
/// periods, but we don't retry forever. The **caller** is responsible for handling failures and
/// ensuring that they eventually call again to ensure that the compute is eventually notified of
/// the proper pageserver nodes for a tenant.
#[tracing::instrument(skip_all, fields(tenant_shard_id, node_id))]
#[tracing::instrument(skip_all, fields(tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug(), node_id))]
pub(super) async fn notify(
&self,
tenant_shard_id: TenantShardId,
@@ -298,7 +298,7 @@ impl ComputeHook {
let Some(reconfigure_request) = reconfigure_request else {
// The tenant doesn't yet have pageservers for all its shards: we won't notify anything
// until it does.
tracing::debug!("Tenant isn't yet ready to emit a notification",);
tracing::info!("Tenant isn't yet ready to emit a notification");
return Ok(());
};

View File

@@ -37,6 +37,12 @@ impl std::fmt::Display for Sequence {
}
}
impl std::fmt::Debug for Sequence {
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
write!(f, "{}", self.0)
}
}
impl MonotonicCounter<Sequence> for Sequence {
fn cnt_advance(&mut self, v: Sequence) {
assert!(*self <= v);

View File

@@ -15,6 +15,7 @@ use diesel::Connection;
use metrics::launch_timestamp::LaunchTimestamp;
use std::sync::Arc;
use tokio::signal::unix::SignalKind;
use tokio_util::sync::CancellationToken;
use utils::auth::{JwtAuth, SwappableJwtAuth};
use utils::logging::{self, LogFormat};
@@ -237,15 +238,23 @@ async fn async_main() -> anyhow::Result<()> {
let auth = secrets
.public_key
.map(|jwt_auth| Arc::new(SwappableJwtAuth::new(jwt_auth)));
let router = make_router(service, auth)
let router = make_router(service.clone(), auth)
.build()
.map_err(|err| anyhow!(err))?;
let router_service = utils::http::RouterService::new(router).unwrap();
let server = hyper::Server::from_tcp(http_listener)?.serve(router_service);
// Start HTTP server
let server_shutdown = CancellationToken::new();
let server = hyper::Server::from_tcp(http_listener)?
.serve(router_service)
.with_graceful_shutdown({
let server_shutdown = server_shutdown.clone();
async move {
server_shutdown.cancelled().await;
}
});
tracing::info!("Serving on {0}", args.listen);
tokio::task::spawn(server);
let server_task = tokio::task::spawn(server);
// Wait until we receive a signal
let mut sigint = tokio::signal::unix::signal(SignalKind::interrupt())?;
@@ -266,5 +275,16 @@ async fn async_main() -> anyhow::Result<()> {
}
}
// Stop HTTP server first, so that we don't have to service requests
// while shutting down Service
server_shutdown.cancel();
if let Err(e) = server_task.await {
tracing::error!("Error joining HTTP server task: {e}")
}
tracing::info!("Joined HTTP server task");
service.shutdown().await;
tracing::info!("Service shutdown complete");
std::process::exit(0);
}

View File

@@ -222,7 +222,7 @@ impl Persistence {
let tenant_shard_id = TenantShardId {
tenant_id: TenantId::from_str(tsp.tenant_id.as_str())?,
shard_number: ShardNumber(tsp.shard_number as u8),
shard_count: ShardCount(tsp.shard_count as u8),
shard_count: ShardCount::new(tsp.shard_count as u8),
};
tenants_map.insert(tenant_shard_id, tsp);
@@ -318,7 +318,7 @@ impl Persistence {
tenant_id: TenantId::from_str(tsp.tenant_id.as_str())
.map_err(|e| DatabaseError::Logical(format!("Malformed tenant id: {e}")))?,
shard_number: ShardNumber(tsp.shard_number as u8),
shard_count: ShardCount(tsp.shard_count as u8),
shard_count: ShardCount::new(tsp.shard_count as u8),
};
result.insert(tenant_shard_id, Generation::new(tsp.generation as u32));
}
@@ -340,7 +340,7 @@ impl Persistence {
let updated = diesel::update(tenant_shards)
.filter(tenant_id.eq(tenant_shard_id.tenant_id.to_string()))
.filter(shard_number.eq(tenant_shard_id.shard_number.0 as i32))
.filter(shard_count.eq(tenant_shard_id.shard_count.0 as i32))
.filter(shard_count.eq(tenant_shard_id.shard_count.literal() as i32))
.set((
generation.eq(generation + 1),
generation_pageserver.eq(node_id.0 as i64),
@@ -362,7 +362,7 @@ impl Persistence {
let updated = diesel::update(tenant_shards)
.filter(tenant_id.eq(tenant_shard_id.tenant_id.to_string()))
.filter(shard_number.eq(tenant_shard_id.shard_number.0 as i32))
.filter(shard_count.eq(tenant_shard_id.shard_count.0 as i32))
.filter(shard_count.eq(tenant_shard_id.shard_count.literal() as i32))
.set((
generation_pageserver.eq(i64::MAX),
placement_policy.eq(serde_json::to_string(&PlacementPolicy::Detached).unwrap()),
@@ -381,7 +381,6 @@ impl Persistence {
//
// We create the child shards here, so that they will be available for increment_generation calls
// if some pageserver holding a child shard needs to restart before the overall tenant split is complete.
#[allow(dead_code)]
pub(crate) async fn begin_shard_split(
&self,
old_shard_count: ShardCount,
@@ -393,21 +392,19 @@ impl Persistence {
conn.transaction(|conn| -> DatabaseResult<()> {
// Mark parent shards as splitting
let expect_parent_records = std::cmp::max(1, old_shard_count.0);
let updated = diesel::update(tenant_shards)
.filter(tenant_id.eq(split_tenant_id.to_string()))
.filter(shard_count.eq(old_shard_count.0 as i32))
.filter(shard_count.eq(old_shard_count.literal() as i32))
.set((splitting.eq(1),))
.execute(conn)?;
if u8::try_from(updated)
.map_err(|_| DatabaseError::Logical(
format!("Overflow existing shard count {} while splitting", updated))
)? != expect_parent_records {
)? != old_shard_count.count() {
// Perhaps a deletion or another split raced with this attempt to split, mutating
// the parent shards that we intend to split. In this case the split request should fail.
return Err(DatabaseError::Logical(
format!("Unexpected existing shard count {updated} when preparing tenant for split (expected {expect_parent_records})")
format!("Unexpected existing shard count {updated} when preparing tenant for split (expected {})", old_shard_count.count())
));
}
@@ -419,7 +416,7 @@ impl Persistence {
let mut parent = crate::schema::tenant_shards::table
.filter(tenant_id.eq(parent_shard_id.tenant_id.to_string()))
.filter(shard_number.eq(parent_shard_id.shard_number.0 as i32))
.filter(shard_count.eq(parent_shard_id.shard_count.0 as i32))
.filter(shard_count.eq(parent_shard_id.shard_count.literal() as i32))
.load::<TenantShardPersistence>(conn)?;
let parent = if parent.len() != 1 {
return Err(DatabaseError::Logical(format!(
@@ -449,7 +446,6 @@ impl Persistence {
// When we finish shard splitting, we must atomically clean up the old shards
// and insert the new shards, and clear the splitting marker.
#[allow(dead_code)]
pub(crate) async fn complete_shard_split(
&self,
split_tenant_id: TenantId,
@@ -461,7 +457,7 @@ impl Persistence {
// Drop parent shards
diesel::delete(tenant_shards)
.filter(tenant_id.eq(split_tenant_id.to_string()))
.filter(shard_count.eq(old_shard_count.0 as i32))
.filter(shard_count.eq(old_shard_count.literal() as i32))
.execute(conn)?;
// Clear sharding flag

View File

@@ -13,6 +13,7 @@ use tokio_util::sync::CancellationToken;
use utils::generation::Generation;
use utils::id::{NodeId, TimelineId};
use utils::lsn::Lsn;
use utils::sync::gate::GateGuard;
use crate::compute_hook::{ComputeHook, NotifyError};
use crate::node::Node;
@@ -53,6 +54,10 @@ pub(super) struct Reconciler {
/// the tenant is changed.
pub(crate) cancel: CancellationToken,
/// Reconcilers are registered with a Gate so that during a graceful shutdown we
/// can wait for all the reconcilers to respond to their cancellation tokens.
pub(crate) _gate_guard: GateGuard,
/// Access to persistent storage for updating generation numbers
pub(crate) persistence: Arc<Persistence>,
}
@@ -263,7 +268,7 @@ impl Reconciler {
secondary_conf,
tenant_conf: config.clone(),
shard_number: shard.number.0,
shard_count: shard.count.0,
shard_count: shard.count.literal(),
shard_stripe_size: shard.stripe_size.0,
}
}
@@ -458,7 +463,7 @@ impl Reconciler {
generation: None,
secondary_conf: None,
shard_number: self.shard.number.0,
shard_count: self.shard.count.0,
shard_count: self.shard.count.literal(),
shard_stripe_size: self.shard.stripe_size.0,
tenant_conf: self.config.clone(),
},
@@ -506,7 +511,7 @@ pub(crate) fn attached_location_conf(
generation: generation.into(),
secondary_conf: None,
shard_number: shard.number.0,
shard_count: shard.count.0,
shard_count: shard.count.literal(),
shard_stripe_size: shard.stripe_size.0,
tenant_conf: config.clone(),
}
@@ -521,7 +526,7 @@ pub(crate) fn secondary_location_conf(
generation: None,
secondary_conf: Some(LocationConfigSecondary { warm: true }),
shard_number: shard.number.0,
shard_count: shard.count.0,
shard_count: shard.count.literal(),
shard_stripe_size: shard.stripe_size.0,
tenant_conf: config.clone(),
}

View File

@@ -77,12 +77,11 @@ impl Scheduler {
return Err(ScheduleError::ImpossibleConstraint);
}
for (node_id, count) in &tenant_counts {
tracing::info!("tenant_counts[{node_id}]={count}");
}
let node_id = tenant_counts.first().unwrap().0;
tracing::info!("scheduler selected node {node_id}");
tracing::info!(
"scheduler selected node {node_id} (elegible nodes {:?}, exclude: {hard_exclude:?})",
tenant_counts.iter().map(|i| i.0 .0).collect::<Vec<_>>()
);
*self.tenant_counts.get_mut(&node_id).unwrap() += 1;
Ok(node_id)
}

View File

@@ -30,6 +30,7 @@ use pageserver_api::{
};
use pageserver_client::mgmt_api;
use tokio_util::sync::CancellationToken;
use tracing::instrument;
use utils::{
backoff,
completion::Barrier,
@@ -37,6 +38,7 @@ use utils::{
http::error::ApiError,
id::{NodeId, TenantId, TimelineId},
seqwait::SeqWait,
sync::gate::Gate,
};
use crate::{
@@ -124,6 +126,12 @@ pub struct Service {
config: Config,
persistence: Arc<Persistence>,
// Process shutdown will fire this token
cancel: CancellationToken,
// Background tasks will hold this gate
gate: Gate,
/// This waits for initial reconciliation with pageservers to complete. Until this barrier
/// passes, it isn't safe to do any actions that mutate tenants.
pub(crate) startup_complete: Barrier,
@@ -144,8 +152,9 @@ impl Service {
&self.config
}
/// TODO: don't allow other API calls until this is done, don't start doing any background housekeeping
/// until this is done.
/// Called once on startup, this function attempts to contact all pageservers to build an up-to-date
/// view of the world, and determine which pageservers are responsive.
#[instrument(skip_all)]
async fn startup_reconcile(&self) {
// For all tenant shards, a vector of observed states on nodes (where None means
// indeterminate, same as in [`ObservedStateLocation`])
@@ -153,9 +162,6 @@ impl Service {
let mut nodes_online = HashSet::new();
// TODO: give Service a cancellation token for clean shutdown
let cancel = CancellationToken::new();
// TODO: issue these requests concurrently
{
let nodes = {
@@ -190,7 +196,7 @@ impl Service {
1,
5,
"Location config listing",
&cancel,
&self.cancel,
)
.await;
let Some(list_response) = list_response else {
@@ -292,7 +298,7 @@ impl Service {
generation: None,
secondary_conf: None,
shard_number: tenant_shard_id.shard_number.0,
shard_count: tenant_shard_id.shard_count.0,
shard_count: tenant_shard_id.shard_count.literal(),
shard_stripe_size: 0,
tenant_conf: models::TenantConfig::default(),
},
@@ -331,7 +337,7 @@ impl Service {
let stream = futures::stream::iter(compute_notifications.into_iter())
.map(|(tenant_shard_id, node_id)| {
let compute_hook = compute_hook.clone();
let cancel = cancel.clone();
let cancel = self.cancel.clone();
async move {
if let Err(e) = compute_hook.notify(tenant_shard_id, node_id, &cancel).await {
tracing::error!(
@@ -368,8 +374,98 @@ impl Service {
tracing::info!("Startup complete, spawned {reconcile_tasks} reconciliation tasks ({shard_count} shards total)");
}
/// Long running background task that periodically wakes up and looks for shards that need
/// reconciliation. Reconciliation is fallible, so any reconciliation tasks that fail during
/// e.g. a tenant create/attach/migrate must eventually be retried: this task is responsible
/// for those retries.
#[instrument(skip_all)]
async fn background_reconcile(&self) {
self.startup_complete.clone().wait().await;
const BACKGROUND_RECONCILE_PERIOD: Duration = Duration::from_secs(20);
let mut interval = tokio::time::interval(BACKGROUND_RECONCILE_PERIOD);
while !self.cancel.is_cancelled() {
tokio::select! {
_ = interval.tick() => { self.reconcile_all(); }
_ = self.cancel.cancelled() => return
}
}
}
#[instrument(skip_all)]
async fn process_results(
&self,
mut result_rx: tokio::sync::mpsc::UnboundedReceiver<ReconcileResult>,
) {
loop {
// Wait for the next result, or for cancellation
let result = tokio::select! {
r = result_rx.recv() => {
match r {
Some(result) => {result},
None => {break;}
}
}
_ = self.cancel.cancelled() => {
break;
}
};
tracing::info!(
"Reconcile result for sequence {}, ok={}",
result.sequence,
result.result.is_ok()
);
let mut locked = self.inner.write().unwrap();
let Some(tenant) = locked.tenants.get_mut(&result.tenant_shard_id) else {
// A reconciliation result might race with removing a tenant: drop results for
// tenants that aren't in our map.
continue;
};
// Usually generation should only be updated via this path, so the max() isn't
// needed, but it is used to handle out-of-band updates via. e.g. test hook.
tenant.generation = std::cmp::max(tenant.generation, result.generation);
// If the reconciler signals that it failed to notify compute, set this state on
// the shard so that a future [`TenantState::maybe_reconcile`] will try again.
tenant.pending_compute_notification = result.pending_compute_notification;
match result.result {
Ok(()) => {
for (node_id, loc) in &result.observed.locations {
if let Some(conf) = &loc.conf {
tracing::info!("Updating observed location {}: {:?}", node_id, conf);
} else {
tracing::info!("Setting observed location {} to None", node_id,)
}
}
tenant.observed = result.observed;
tenant.waiter.advance(result.sequence);
}
Err(e) => {
tracing::warn!(
"Reconcile error on tenant {}: {}",
tenant.tenant_shard_id,
e
);
// Ordering: populate last_error before advancing error_seq,
// so that waiters will see the correct error after waiting.
*(tenant.last_error.lock().unwrap()) = format!("{e}");
tenant.error_waiter.advance(result.sequence);
for (node_id, o) in result.observed.locations {
tenant.observed.locations.insert(node_id, o);
}
}
}
}
}
pub async fn spawn(config: Config, persistence: Arc<Persistence>) -> anyhow::Result<Arc<Self>> {
let (result_tx, mut result_rx) = tokio::sync::mpsc::unbounded_channel();
let (result_tx, result_rx) = tokio::sync::mpsc::unbounded_channel();
tracing::info!("Loading nodes from database...");
let nodes = persistence.list_nodes().await?;
@@ -389,14 +485,14 @@ impl Service {
let tenant_shard_id = TenantShardId {
tenant_id: TenantId::from_str(tsp.tenant_id.as_str())?,
shard_number: ShardNumber(tsp.shard_number as u8),
shard_count: ShardCount(tsp.shard_count as u8),
shard_count: ShardCount::new(tsp.shard_count as u8),
};
let shard_identity = if tsp.shard_count == 0 {
ShardIdentity::unsharded()
} else {
ShardIdentity::new(
ShardNumber(tsp.shard_number as u8),
ShardCount(tsp.shard_count as u8),
ShardCount::new(tsp.shard_count as u8),
ShardStripeSize(tsp.shard_stripe_size as u32),
)?
};
@@ -418,6 +514,7 @@ impl Service {
observed: ObservedState::new(),
config: serde_json::from_str(&tsp.config).unwrap(),
reconciler: None,
splitting: tsp.splitting,
waiter: Arc::new(SeqWait::new(Sequence::initial())),
error_waiter: Arc::new(SeqWait::new(Sequence::initial())),
last_error: Arc::default(),
@@ -439,73 +536,35 @@ impl Service {
config,
persistence,
startup_complete: startup_complete.clone(),
cancel: CancellationToken::new(),
gate: Gate::default(),
});
let result_task_this = this.clone();
tokio::task::spawn(async move {
while let Some(result) = result_rx.recv().await {
tracing::info!(
"Reconcile result for sequence {}, ok={}",
result.sequence,
result.result.is_ok()
);
let mut locked = result_task_this.inner.write().unwrap();
let Some(tenant) = locked.tenants.get_mut(&result.tenant_shard_id) else {
// A reconciliation result might race with removing a tenant: drop results for
// tenants that aren't in our map.
continue;
};
// Usually generation should only be updated via this path, so the max() isn't
// needed, but it is used to handle out-of-band updates via. e.g. test hook.
tenant.generation = std::cmp::max(tenant.generation, result.generation);
// If the reconciler signals that it failed to notify compute, set this state on
// the shard so that a future [`TenantState::maybe_reconcile`] will try again.
tenant.pending_compute_notification = result.pending_compute_notification;
match result.result {
Ok(()) => {
for (node_id, loc) in &result.observed.locations {
if let Some(conf) = &loc.conf {
tracing::info!(
"Updating observed location {}: {:?}",
node_id,
conf
);
} else {
tracing::info!("Setting observed location {} to None", node_id,)
}
}
tenant.observed = result.observed;
tenant.waiter.advance(result.sequence);
}
Err(e) => {
tracing::warn!(
"Reconcile error on tenant {}: {}",
tenant.tenant_shard_id,
e
);
// Ordering: populate last_error before advancing error_seq,
// so that waiters will see the correct error after waiting.
*(tenant.last_error.lock().unwrap()) = format!("{e}");
tenant.error_waiter.advance(result.sequence);
for (node_id, o) in result.observed.locations {
tenant.observed.locations.insert(node_id, o);
}
}
}
// Block shutdown until we're done (we must respect self.cancel)
if let Ok(_gate) = result_task_this.gate.enter() {
result_task_this.process_results(result_rx).await
}
});
let startup_reconcile_this = this.clone();
tokio::task::spawn(async move {
// Block the [`Service::startup_complete`] barrier until we're done
let _completion = startup_completion;
tokio::task::spawn({
let this = this.clone();
// We will block the [`Service::startup_complete`] barrier until [`Self::startup_reconcile`]
// is done.
let startup_completion = startup_completion.clone();
async move {
// Block shutdown until we're done (we must respect self.cancel)
let Ok(_gate) = this.gate.enter() else {
return;
};
startup_reconcile_this.startup_reconcile().await
this.startup_reconcile().await;
drop(startup_completion);
this.background_reconcile().await;
}
});
Ok(this)
@@ -526,7 +585,7 @@ impl Service {
let tsp = TenantShardPersistence {
tenant_id: attach_req.tenant_shard_id.tenant_id.to_string(),
shard_number: attach_req.tenant_shard_id.shard_number.0 as i32,
shard_count: attach_req.tenant_shard_id.shard_count.0 as i32,
shard_count: attach_req.tenant_shard_id.shard_count.literal() as i32,
shard_stripe_size: 0,
generation: 0,
generation_pageserver: i64::MAX,
@@ -620,6 +679,28 @@ impl Service {
attach_req.node_id.unwrap_or(utils::id::NodeId(0xfffffff))
);
// Trick the reconciler into not doing anything for this tenant: this helps
// tests that manually configure a tenant on the pagesrever, and then call this
// attach hook: they don't want background reconciliation to modify what they
// did to the pageserver.
#[cfg(feature = "testing")]
{
if let Some(node_id) = attach_req.node_id {
tenant_state.observed.locations = HashMap::from([(
node_id,
ObservedStateLocation {
conf: Some(attached_location_conf(
tenant_state.generation,
&tenant_state.shard,
&tenant_state.config,
)),
},
)]);
} else {
tenant_state.observed.locations.clear();
}
}
Ok(AttachHookResponse {
gen: attach_req
.node_id
@@ -726,16 +807,9 @@ impl Service {
&self,
create_req: TenantCreateRequest,
) -> Result<TenantCreateResponse, ApiError> {
// Shard count 0 is valid: it means create a single shard (ShardCount(0) means "unsharded")
let literal_shard_count = if create_req.shard_parameters.is_unsharded() {
1
} else {
create_req.shard_parameters.count.0
};
// This service expects to handle sharding itself: it is an error to try and directly create
// a particular shard here.
let tenant_id = if create_req.new_tenant_id.shard_count > ShardCount(1) {
let tenant_id = if !create_req.new_tenant_id.is_unsharded() {
return Err(ApiError::BadRequest(anyhow::anyhow!(
"Attempted to create a specific shard, this API is for creating the whole tenant"
)));
@@ -749,7 +823,7 @@ impl Service {
create_req.shard_parameters.count,
);
let create_ids = (0..literal_shard_count)
let create_ids = (0..create_req.shard_parameters.count.count())
.map(|i| TenantShardId {
tenant_id,
shard_number: ShardNumber(i),
@@ -769,7 +843,7 @@ impl Service {
.map(|tenant_shard_id| TenantShardPersistence {
tenant_id: tenant_shard_id.tenant_id.to_string(),
shard_number: tenant_shard_id.shard_number.0 as i32,
shard_count: tenant_shard_id.shard_count.0 as i32,
shard_count: tenant_shard_id.shard_count.literal() as i32,
shard_stripe_size: create_req.shard_parameters.stripe_size.0 as i32,
generation: create_req.generation.map(|g| g as i32).unwrap_or(0),
generation_pageserver: i64::MAX,
@@ -875,6 +949,8 @@ impl Service {
&compute_hook,
&self.config,
&self.persistence,
&self.gate,
&self.cancel,
)
})
.collect::<Vec<_>>();
@@ -914,7 +990,7 @@ impl Service {
tenant_id: TenantId,
req: TenantLocationConfigRequest,
) -> Result<TenantLocationConfigResponse, ApiError> {
if req.tenant_id.shard_count.0 > 1 {
if !req.tenant_id.is_unsharded() {
return Err(ApiError::BadRequest(anyhow::anyhow!(
"This API is for importing single-sharded or unsharded tenants"
)));
@@ -977,6 +1053,8 @@ impl Service {
&compute_hook,
&self.config,
&self.persistence,
&self.gate,
&self.cancel,
);
if let Some(waiter) = maybe_waiter {
waiters.push(waiter);
@@ -1066,6 +1144,8 @@ impl Service {
}
pub(crate) async fn tenant_delete(&self, tenant_id: TenantId) -> Result<StatusCode, ApiError> {
self.ensure_attached_wait(tenant_id).await?;
// TODO: refactor into helper
let targets = {
let locked = self.inner.read().unwrap();
@@ -1087,8 +1167,6 @@ impl Service {
targets
};
// TODO: error out if the tenant is not attached anywhere.
// Phase 1: delete on the pageservers
let mut any_pending = false;
for (tenant_shard_id, node) in targets {
@@ -1424,9 +1502,6 @@ impl Service {
let mut policy = None;
let mut shard_ident = None;
// TODO: put a cancellation token on Service for clean shutdown
let cancel = CancellationToken::new();
// A parent shard which will be split
struct SplitTarget {
parent_id: TenantShardId,
@@ -1449,7 +1524,7 @@ impl Service {
for (tenant_shard_id, shard) in
locked.tenants.range(TenantShardId::tenant_range(tenant_id))
{
match shard.shard.count.0.cmp(&split_req.new_shard_count) {
match shard.shard.count.count().cmp(&split_req.new_shard_count) {
Ordering::Equal => {
// Already split this
children_found.push(*tenant_shard_id);
@@ -1459,7 +1534,7 @@ impl Service {
return Err(ApiError::BadRequest(anyhow::anyhow!(
"Requested count {} but already have shards at count {}",
split_req.new_shard_count,
shard.shard.count.0
shard.shard.count.count()
)));
}
Ordering::Less => {
@@ -1489,7 +1564,7 @@ impl Service {
shard_ident = Some(shard.shard);
}
if tenant_shard_id.shard_count == ShardCount(split_req.new_shard_count) {
if tenant_shard_id.shard_count.count() == split_req.new_shard_count {
tracing::info!(
"Tenant shard {} already has shard count {}",
tenant_shard_id,
@@ -1515,7 +1590,7 @@ impl Service {
targets.push(SplitTarget {
parent_id: *tenant_shard_id,
node: node.clone(),
child_ids: tenant_shard_id.split(ShardCount(split_req.new_shard_count)),
child_ids: tenant_shard_id.split(ShardCount::new(split_req.new_shard_count)),
});
}
@@ -1562,7 +1637,7 @@ impl Service {
this_child_tsps.push(TenantShardPersistence {
tenant_id: child.tenant_id.to_string(),
shard_number: child.shard_number.0 as i32,
shard_count: child.shard_count.0 as i32,
shard_count: child.shard_count.literal() as i32,
shard_stripe_size: shard_ident.stripe_size.0 as i32,
// Note: this generation is a placeholder, [`Persistence::begin_shard_split`] will
// populate the correct generation as part of its transaction, to protect us
@@ -1598,6 +1673,18 @@ impl Service {
}
}
// Now that I have persisted the splitting state, apply it in-memory. This is infallible, so
// callers may assume that if splitting is set in memory, then it was persisted, and if splitting
// is not set in memory, then it was not persisted.
{
let mut locked = self.inner.write().unwrap();
for target in &targets {
if let Some(parent_shard) = locked.tenants.get_mut(&target.parent_id) {
parent_shard.splitting = SplitState::Splitting;
}
}
}
// FIXME: we have now committed the shard split state to the database, so any subsequent
// failure needs to roll it back. We will later wrap this function in logic to roll back
// the split if it fails.
@@ -1657,7 +1744,7 @@ impl Service {
.complete_shard_split(tenant_id, old_shard_count)
.await?;
// Replace all the shards we just split with their children
// Replace all the shards we just split with their children: this phase is infallible.
let mut response = TenantShardSplitResponse {
new_shards: Vec::new(),
};
@@ -1705,6 +1792,10 @@ impl Service {
child_state.generation = generation;
child_state.config = config.clone();
// The child's TenantState::splitting is intentionally left at the default value of Idle,
// as at this point in the split process we have succeeded and this part is infallible:
// we will never need to do any special recovery from this state.
child_locations.push((child, pageserver));
locked.tenants.insert(child, child_state);
@@ -1716,7 +1807,7 @@ impl Service {
// Send compute notifications for all the new shards
let mut failed_notifications = Vec::new();
for (child_id, child_ps) in child_locations {
if let Err(e) = compute_hook.notify(child_id, child_ps, &cancel).await {
if let Err(e) = compute_hook.notify(child_id, child_ps, &self.cancel).await {
tracing::warn!("Failed to update compute of {}->{} during split, proceeding anyway to complete split ({e})",
child_id, child_ps);
failed_notifications.push(child_id);
@@ -1792,6 +1883,8 @@ impl Service {
&compute_hook,
&self.config,
&self.persistence,
&self.gate,
&self.cancel,
)
};
@@ -1993,6 +2086,8 @@ impl Service {
&compute_hook,
&self.config,
&self.persistence,
&self.gate,
&self.cancel,
);
}
}
@@ -2014,6 +2109,8 @@ impl Service {
&compute_hook,
&self.config,
&self.persistence,
&self.gate,
&self.cancel,
);
}
}
@@ -2053,6 +2150,8 @@ impl Service {
&compute_hook,
&self.config,
&self.persistence,
&self.gate,
&self.cancel,
) {
waiters.push(waiter);
}
@@ -2064,6 +2163,17 @@ impl Service {
let ensure_waiters = {
let locked = self.inner.write().unwrap();
// Check if the tenant is splitting: in this case, even if it is attached,
// we must act as if it is not: this blocks e.g. timeline creation/deletion
// operations during the split.
for (_shard_id, shard) in locked.tenants.range(TenantShardId::tenant_range(tenant_id)) {
if !matches!(shard.splitting, SplitState::Idle) {
return Err(ApiError::ResourceUnavailable(
"Tenant shards are currently splitting".into(),
));
}
}
self.ensure_attached_schedule(locked, tenant_id)
.map_err(ApiError::InternalServerError)?
};
@@ -2095,8 +2205,25 @@ impl Service {
&compute_hook,
&self.config,
&self.persistence,
&self.gate,
&self.cancel,
)
})
.count()
}
pub async fn shutdown(&self) {
// Note that this already stops processing any results from reconciles: so
// we do not expect that our [`TenantState`] objects will reach a neat
// final state.
self.cancel.cancel();
// The cancellation tokens in [`crate::reconciler::Reconciler`] are children
// of our cancellation token, so we do not need to explicitly cancel each of
// them.
// Background tasks and reconcilers hold gate guards: this waits for them all
// to complete.
self.gate.close().await;
}
}

View File

@@ -7,16 +7,18 @@ use pageserver_api::{
};
use tokio::task::JoinHandle;
use tokio_util::sync::CancellationToken;
use tracing::{instrument, Instrument};
use utils::{
generation::Generation,
id::NodeId,
seqwait::{SeqWait, SeqWaitError},
sync::gate::Gate,
};
use crate::{
compute_hook::ComputeHook,
node::Node,
persistence::Persistence,
persistence::{split_state::SplitState, Persistence},
reconciler::{attached_location_conf, secondary_location_conf, ReconcileError, Reconciler},
scheduler::{ScheduleError, Scheduler},
service, PlacementPolicy, Sequence,
@@ -58,6 +60,11 @@ pub(crate) struct TenantState {
/// cancellation token has been fired)
pub(crate) reconciler: Option<ReconcilerHandle>,
/// If a tenant is being split, then all shards with that TenantId will have a
/// SplitState set, this acts as a guard against other operations such as background
/// reconciliation, and timeline creation.
pub(crate) splitting: SplitState,
/// Optionally wait for reconciliation to complete up to a particular
/// sequence number.
pub(crate) waiter: std::sync::Arc<SeqWait<Sequence, Sequence>>,
@@ -238,6 +245,7 @@ impl TenantState {
observed: ObservedState::default(),
config: TenantConfig::default(),
reconciler: None,
splitting: SplitState::Idle,
sequence: Sequence(1),
waiter: Arc::new(SeqWait::new(Sequence(0))),
error_waiter: Arc::new(SeqWait::new(Sequence(0))),
@@ -415,6 +423,8 @@ impl TenantState {
false
}
#[allow(clippy::too_many_arguments)]
#[instrument(skip_all, fields(tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug()))]
pub(crate) fn maybe_reconcile(
&mut self,
result_tx: tokio::sync::mpsc::UnboundedSender<ReconcileResult>,
@@ -422,6 +432,8 @@ impl TenantState {
compute_hook: &Arc<ComputeHook>,
service_config: &service::Config,
persistence: &Arc<Persistence>,
gate: &Gate,
cancel: &CancellationToken,
) -> Option<ReconcilerWaiter> {
// If there are any ambiguous observed states, and the nodes they refer to are available,
// we should reconcile to clean them up.
@@ -443,6 +455,14 @@ impl TenantState {
return None;
}
// If we are currently splitting, then never start a reconciler task: the splitting logic
// requires that shards are not interfered with while it runs. Do this check here rather than
// up top, so that we only log this message if we would otherwise have done a reconciliation.
if !matches!(self.splitting, SplitState::Idle) {
tracing::info!("Refusing to reconcile, splitting in progress");
return None;
}
// Reconcile already in flight for the current sequence?
if let Some(handle) = &self.reconciler {
if handle.sequence == self.sequence {
@@ -460,7 +480,12 @@ impl TenantState {
// doing our sequence's work.
let old_handle = self.reconciler.take();
let cancel = CancellationToken::new();
let Ok(gate_guard) = gate.enter() else {
// Shutting down, don't start a reconciler
return None;
};
let reconciler_cancel = cancel.child_token();
let mut reconciler = Reconciler {
tenant_shard_id: self.tenant_shard_id,
shard: self.shard,
@@ -471,59 +496,66 @@ impl TenantState {
pageservers: pageservers.clone(),
compute_hook: compute_hook.clone(),
service_config: service_config.clone(),
cancel: cancel.clone(),
_gate_guard: gate_guard,
cancel: reconciler_cancel.clone(),
persistence: persistence.clone(),
compute_notify_failure: false,
};
let reconcile_seq = self.sequence;
tracing::info!("Spawning Reconciler for sequence {}", self.sequence);
tracing::info!(seq=%reconcile_seq, "Spawning Reconciler for sequence {}", self.sequence);
let must_notify = self.pending_compute_notification;
let join_handle = tokio::task::spawn(async move {
// Wait for any previous reconcile task to complete before we start
if let Some(old_handle) = old_handle {
old_handle.cancel.cancel();
if let Err(e) = old_handle.handle.await {
// We can't do much with this other than log it: the task is done, so
// we may proceed with our work.
tracing::error!("Unexpected join error waiting for reconcile task: {e}");
let reconciler_span = tracing::info_span!(parent: None, "reconciler", seq=%reconcile_seq,
tenant_id=%reconciler.tenant_shard_id.tenant_id,
shard_id=%reconciler.tenant_shard_id.shard_slug());
let join_handle = tokio::task::spawn(
async move {
// Wait for any previous reconcile task to complete before we start
if let Some(old_handle) = old_handle {
old_handle.cancel.cancel();
if let Err(e) = old_handle.handle.await {
// We can't do much with this other than log it: the task is done, so
// we may proceed with our work.
tracing::error!("Unexpected join error waiting for reconcile task: {e}");
}
}
// Early check for cancellation before doing any work
// TODO: wrap all remote API operations in cancellation check
// as well.
if reconciler.cancel.is_cancelled() {
return;
}
// Attempt to make observed state match intent state
let result = reconciler.reconcile().await;
// If we know we had a pending compute notification from some previous action, send a notification irrespective
// of whether the above reconcile() did any work
if result.is_ok() && must_notify {
// If this fails we will send the need to retry in [`ReconcileResult::pending_compute_notification`]
reconciler.compute_notify().await.ok();
}
result_tx
.send(ReconcileResult {
sequence: reconcile_seq,
result,
tenant_shard_id: reconciler.tenant_shard_id,
generation: reconciler.generation,
observed: reconciler.observed,
pending_compute_notification: reconciler.compute_notify_failure,
})
.ok();
}
// Early check for cancellation before doing any work
// TODO: wrap all remote API operations in cancellation check
// as well.
if reconciler.cancel.is_cancelled() {
return;
}
// Attempt to make observed state match intent state
let result = reconciler.reconcile().await;
// If we know we had a pending compute notification from some previous action, send a notification irrespective
// of whether the above reconcile() did any work
if result.is_ok() && must_notify {
// If this fails we will send the need to retry in [`ReconcileResult::pending_compute_notification`]
reconciler.compute_notify().await.ok();
}
result_tx
.send(ReconcileResult {
sequence: reconcile_seq,
result,
tenant_shard_id: reconciler.tenant_shard_id,
generation: reconciler.generation,
observed: reconciler.observed,
pending_compute_notification: reconciler.compute_notify_failure,
})
.ok();
});
.instrument(reconciler_span),
);
self.reconciler = Some(ReconcilerHandle {
sequence: self.sequence,
handle: join_handle,
cancel,
cancel: reconciler_cancel,
});
Some(ReconcilerWaiter {

View File

@@ -450,7 +450,7 @@ async fn handle_tenant(
new_tenant_id: TenantShardId::unsharded(tenant_id),
generation: None,
shard_parameters: ShardParameters {
count: ShardCount(shard_count),
count: ShardCount::new(shard_count),
stripe_size: shard_stripe_size
.map(ShardStripeSize)
.unwrap_or(ShardParameters::DEFAULT_STRIPE_SIZE),

View File

@@ -652,7 +652,9 @@ impl Endpoint {
}
ComputeStatus::Empty
| ComputeStatus::ConfigurationPending
| ComputeStatus::Configuration => {
| ComputeStatus::Configuration
| ComputeStatus::TerminationPending
| ComputeStatus::Terminated => {
bail!("unexpected compute status: {:?}", state.status)
}
}

View File

@@ -400,6 +400,11 @@ impl PageServerNode {
.map(|x| x.parse::<bool>())
.transpose()
.context("Failed to parse 'lazy_slru_download' as bool")?,
timeline_get_throttle: settings
.remove("timeline_get_throttle")
.map(serde_json::from_str)
.transpose()
.context("parse `timeline_get_throttle` from json")?,
};
if !settings.is_empty() {
bail!("Unrecognized tenant settings: {settings:?}")
@@ -505,6 +510,11 @@ impl PageServerNode {
.map(|x| x.parse::<bool>())
.transpose()
.context("Failed to parse 'lazy_slru_download' as bool")?,
timeline_get_throttle: settings
.remove("timeline_get_throttle")
.map(serde_json::from_str)
.transpose()
.context("parse `timeline_get_throttle` from json")?,
}
};

View File

@@ -52,6 +52,10 @@ pub enum ComputeStatus {
// compute will exit soon or is waiting for
// control-plane to terminate it.
Failed,
// Termination requested
TerminationPending,
// Terminated Postgres
Terminated,
}
fn rfc3339_serialize<S>(x: &Option<DateTime<Utc>>, s: S) -> Result<S::Ok, S::Error>

18
libs/desim/Cargo.toml Normal file
View File

@@ -0,0 +1,18 @@
[package]
name = "desim"
version = "0.1.0"
edition.workspace = true
license.workspace = true
[dependencies]
anyhow.workspace = true
rand.workspace = true
tracing.workspace = true
bytes.workspace = true
utils.workspace = true
parking_lot.workspace = true
hex.workspace = true
scopeguard.workspace = true
smallvec = { workspace = true, features = ["write"] }
workspace_hack.workspace = true

7
libs/desim/README.md Normal file
View File

@@ -0,0 +1,7 @@
# Discrete Event SIMulator
This is a library for running simulations of distributed systems. The main idea is borrowed from [FoundationDB](https://www.youtube.com/watch?v=4fFDFbi3toc).
Each node runs as a separate thread. This library was not optimized for speed yet, but it's already much faster than running usual intergration tests in real time, because it uses virtual simulation time and can fast-forward time to skip intervals where all nodes are doing nothing but sleeping or waiting for something.
The original purpose for this library is to test walproposer and safekeeper implementation working together, in a scenarios close to the real world environment. This simulator is determenistic and can inject failures in networking without waiting minutes of wall-time to trigger timeout, which makes it easier to find bugs in our consensus implementation compared to using integration tests.

108
libs/desim/src/chan.rs Normal file
View File

@@ -0,0 +1,108 @@
use std::{collections::VecDeque, sync::Arc};
use parking_lot::{Mutex, MutexGuard};
use crate::executor::{self, PollSome, Waker};
/// FIFO channel with blocking send and receive. Can be cloned and shared between threads.
/// Blocking functions should be used only from threads that are managed by the executor.
pub struct Chan<T> {
shared: Arc<State<T>>,
}
impl<T> Clone for Chan<T> {
fn clone(&self) -> Self {
Chan {
shared: self.shared.clone(),
}
}
}
impl<T> Default for Chan<T> {
fn default() -> Self {
Self::new()
}
}
impl<T> Chan<T> {
pub fn new() -> Chan<T> {
Chan {
shared: Arc::new(State {
queue: Mutex::new(VecDeque::new()),
waker: Waker::new(),
}),
}
}
/// Get a message from the front of the queue, block if the queue is empty.
/// If not called from the executor thread, it can block forever.
pub fn recv(&self) -> T {
self.shared.recv()
}
/// Panic if the queue is empty.
pub fn must_recv(&self) -> T {
self.shared
.try_recv()
.expect("message should've been ready")
}
/// Get a message from the front of the queue, return None if the queue is empty.
/// Never blocks.
pub fn try_recv(&self) -> Option<T> {
self.shared.try_recv()
}
/// Send a message to the back of the queue.
pub fn send(&self, t: T) {
self.shared.send(t);
}
}
struct State<T> {
queue: Mutex<VecDeque<T>>,
waker: Waker,
}
impl<T> State<T> {
fn send(&self, t: T) {
self.queue.lock().push_back(t);
self.waker.wake_all();
}
fn try_recv(&self) -> Option<T> {
let mut q = self.queue.lock();
q.pop_front()
}
fn recv(&self) -> T {
// interrupt the receiver to prevent consuming everything at once
executor::yield_me(0);
let mut queue = self.queue.lock();
if let Some(t) = queue.pop_front() {
return t;
}
loop {
self.waker.wake_me_later();
if let Some(t) = queue.pop_front() {
return t;
}
MutexGuard::unlocked(&mut queue, || {
executor::yield_me(-1);
});
}
}
}
impl<T> PollSome for Chan<T> {
/// Schedules a wakeup for the current thread.
fn wake_me(&self) {
self.shared.waker.wake_me_later();
}
/// Checks if chan has any pending messages.
fn has_some(&self) -> bool {
!self.shared.queue.lock().is_empty()
}
}

483
libs/desim/src/executor.rs Normal file
View File

@@ -0,0 +1,483 @@
use std::{
panic::AssertUnwindSafe,
sync::{
atomic::{AtomicBool, AtomicU32, AtomicU8, Ordering},
mpsc, Arc, OnceLock,
},
thread::JoinHandle,
};
use tracing::{debug, error, trace};
use crate::time::Timing;
/// Stores status of the running threads. Threads are registered in the runtime upon creation
/// and deregistered upon termination.
pub struct Runtime {
// stores handles to all threads that are currently running
threads: Vec<ThreadHandle>,
// stores current time and pending wakeups
clock: Arc<Timing>,
// thread counter
thread_counter: AtomicU32,
// Thread step counter -- how many times all threads has been actually
// stepped (note that all world/time/executor/thread have slightly different
// meaning of steps). For observability.
pub step_counter: u64,
}
impl Runtime {
/// Init new runtime, no running threads.
pub fn new(clock: Arc<Timing>) -> Self {
Self {
threads: Vec::new(),
clock,
thread_counter: AtomicU32::new(0),
step_counter: 0,
}
}
/// Spawn a new thread and register it in the runtime.
pub fn spawn<F>(&mut self, f: F) -> ExternalHandle
where
F: FnOnce() + Send + 'static,
{
let (tx, rx) = mpsc::channel();
let clock = self.clock.clone();
let tid = self.thread_counter.fetch_add(1, Ordering::SeqCst);
debug!("spawning thread-{}", tid);
let join = std::thread::spawn(move || {
let _guard = tracing::info_span!("", tid).entered();
let res = std::panic::catch_unwind(AssertUnwindSafe(|| {
with_thread_context(|ctx| {
assert!(ctx.clock.set(clock).is_ok());
ctx.id.store(tid, Ordering::SeqCst);
tx.send(ctx.clone()).expect("failed to send thread context");
// suspend thread to put it to `threads` in sleeping state
ctx.yield_me(0);
});
// start user-provided function
f();
}));
debug!("thread finished");
if let Err(e) = res {
with_thread_context(|ctx| {
if !ctx.allow_panic.load(std::sync::atomic::Ordering::SeqCst) {
error!("thread panicked, terminating the process: {:?}", e);
std::process::exit(1);
}
debug!("thread panicked: {:?}", e);
let mut result = ctx.result.lock();
if result.0 == -1 {
*result = (256, format!("thread panicked: {:?}", e));
}
});
}
with_thread_context(|ctx| {
ctx.finish_me();
});
});
let ctx = rx.recv().expect("failed to receive thread context");
let handle = ThreadHandle::new(ctx.clone(), join);
self.threads.push(handle);
ExternalHandle { ctx }
}
/// Returns true if there are any unfinished activity, such as running thread or pending events.
/// Otherwise returns false, which means all threads are blocked forever.
pub fn step(&mut self) -> bool {
trace!("runtime step");
// have we run any thread?
let mut ran = false;
self.threads.retain(|thread: &ThreadHandle| {
let res = thread.ctx.wakeup.compare_exchange(
PENDING_WAKEUP,
NO_WAKEUP,
Ordering::SeqCst,
Ordering::SeqCst,
);
if res.is_err() {
// thread has no pending wakeups, leaving as is
return true;
}
ran = true;
trace!("entering thread-{}", thread.ctx.tid());
let status = thread.step();
self.step_counter += 1;
trace!(
"out of thread-{} with status {:?}",
thread.ctx.tid(),
status
);
if status == Status::Sleep {
true
} else {
trace!("thread has finished");
// removing the thread from the list
false
}
});
if !ran {
trace!("no threads were run, stepping clock");
if let Some(ctx_to_wake) = self.clock.step() {
trace!("waking up thread-{}", ctx_to_wake.tid());
ctx_to_wake.inc_wake();
} else {
return false;
}
}
true
}
/// Kill all threads. This is done by setting a flag in each thread context and waking it up.
pub fn crash_all_threads(&mut self) {
for thread in self.threads.iter() {
thread.ctx.crash_stop();
}
// all threads should be finished after a few steps
while !self.threads.is_empty() {
self.step();
}
}
}
impl Drop for Runtime {
fn drop(&mut self) {
debug!("dropping the runtime");
self.crash_all_threads();
}
}
#[derive(Clone)]
pub struct ExternalHandle {
ctx: Arc<ThreadContext>,
}
impl ExternalHandle {
/// Returns true if thread has finished execution.
pub fn is_finished(&self) -> bool {
let status = self.ctx.mutex.lock();
*status == Status::Finished
}
/// Returns exitcode and message, which is available after thread has finished execution.
pub fn result(&self) -> (i32, String) {
let result = self.ctx.result.lock();
result.clone()
}
/// Returns thread id.
pub fn id(&self) -> u32 {
self.ctx.id.load(Ordering::SeqCst)
}
/// Sets a flag to crash thread on the next wakeup.
pub fn crash_stop(&self) {
self.ctx.crash_stop();
}
}
struct ThreadHandle {
ctx: Arc<ThreadContext>,
_join: JoinHandle<()>,
}
impl ThreadHandle {
/// Create a new [`ThreadHandle`] and wait until thread will enter [`Status::Sleep`] state.
fn new(ctx: Arc<ThreadContext>, join: JoinHandle<()>) -> Self {
let mut status = ctx.mutex.lock();
// wait until thread will go into the first yield
while *status != Status::Sleep {
ctx.condvar.wait(&mut status);
}
drop(status);
Self { ctx, _join: join }
}
/// Allows thread to execute one step of its execution.
/// Returns [`Status`] of the thread after the step.
fn step(&self) -> Status {
let mut status = self.ctx.mutex.lock();
assert!(matches!(*status, Status::Sleep));
*status = Status::Running;
self.ctx.condvar.notify_all();
while *status == Status::Running {
self.ctx.condvar.wait(&mut status);
}
*status
}
}
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
enum Status {
/// Thread is running.
Running,
/// Waiting for event to complete, will be resumed by the executor step, once wakeup flag is set.
Sleep,
/// Thread finished execution.
Finished,
}
const NO_WAKEUP: u8 = 0;
const PENDING_WAKEUP: u8 = 1;
pub struct ThreadContext {
id: AtomicU32,
// used to block thread until it is woken up
mutex: parking_lot::Mutex<Status>,
condvar: parking_lot::Condvar,
// used as a flag to indicate runtime that thread is ready to be woken up
wakeup: AtomicU8,
clock: OnceLock<Arc<Timing>>,
// execution result, set by exit() call
result: parking_lot::Mutex<(i32, String)>,
// determines if process should be killed on receiving panic
allow_panic: AtomicBool,
// acts as a signal that thread should crash itself on the next wakeup
crash_request: AtomicBool,
}
impl ThreadContext {
pub(crate) fn new() -> Self {
Self {
id: AtomicU32::new(0),
mutex: parking_lot::Mutex::new(Status::Running),
condvar: parking_lot::Condvar::new(),
wakeup: AtomicU8::new(NO_WAKEUP),
clock: OnceLock::new(),
result: parking_lot::Mutex::new((-1, String::new())),
allow_panic: AtomicBool::new(false),
crash_request: AtomicBool::new(false),
}
}
}
// Functions for executor to control thread execution.
impl ThreadContext {
/// Set atomic flag to indicate that thread is ready to be woken up.
fn inc_wake(&self) {
self.wakeup.store(PENDING_WAKEUP, Ordering::SeqCst);
}
/// Internal function used for event queues.
pub(crate) fn schedule_wakeup(self: &Arc<Self>, after_ms: u64) {
self.clock
.get()
.unwrap()
.schedule_wakeup(after_ms, self.clone());
}
fn tid(&self) -> u32 {
self.id.load(Ordering::SeqCst)
}
fn crash_stop(&self) {
let status = self.mutex.lock();
if *status == Status::Finished {
debug!(
"trying to crash thread-{}, which is already finished",
self.tid()
);
return;
}
assert!(matches!(*status, Status::Sleep));
drop(status);
self.allow_panic.store(true, Ordering::SeqCst);
self.crash_request.store(true, Ordering::SeqCst);
// set a wakeup
self.inc_wake();
// it will panic on the next wakeup
}
}
// Internal functions.
impl ThreadContext {
/// Blocks thread until it's woken up by the executor. If `after_ms` is 0, is will be
/// woken on the next step. If `after_ms` > 0, wakeup is scheduled after that time.
/// Otherwise wakeup is not scheduled inside `yield_me`, and should be arranged before
/// calling this function.
fn yield_me(self: &Arc<Self>, after_ms: i64) {
let mut status = self.mutex.lock();
assert!(matches!(*status, Status::Running));
match after_ms.cmp(&0) {
std::cmp::Ordering::Less => {
// block until something wakes us up
}
std::cmp::Ordering::Equal => {
// tell executor that we are ready to be woken up
self.inc_wake();
}
std::cmp::Ordering::Greater => {
// schedule wakeup
self.clock
.get()
.unwrap()
.schedule_wakeup(after_ms as u64, self.clone());
}
}
*status = Status::Sleep;
self.condvar.notify_all();
// wait until executor wakes us up
while *status != Status::Running {
self.condvar.wait(&mut status);
}
if self.crash_request.load(Ordering::SeqCst) {
panic!("crashed by request");
}
}
/// Called only once, exactly before thread finishes execution.
fn finish_me(&self) {
let mut status = self.mutex.lock();
assert!(matches!(*status, Status::Running));
*status = Status::Finished;
{
let mut result = self.result.lock();
if result.0 == -1 {
*result = (0, "finished normally".to_owned());
}
}
self.condvar.notify_all();
}
}
/// Invokes the given closure with a reference to the current thread [`ThreadContext`].
#[inline(always)]
fn with_thread_context<T>(f: impl FnOnce(&Arc<ThreadContext>) -> T) -> T {
thread_local!(static THREAD_DATA: Arc<ThreadContext> = Arc::new(ThreadContext::new()));
THREAD_DATA.with(f)
}
/// Waker is used to wake up threads that are blocked on condition.
/// It keeps track of contexts [`Arc<ThreadContext>`] and can increment the counter
/// of several contexts to send a notification.
pub struct Waker {
// contexts that are waiting for a notification
contexts: parking_lot::Mutex<smallvec::SmallVec<[Arc<ThreadContext>; 8]>>,
}
impl Default for Waker {
fn default() -> Self {
Self::new()
}
}
impl Waker {
pub fn new() -> Self {
Self {
contexts: parking_lot::Mutex::new(smallvec::SmallVec::new()),
}
}
/// Subscribe current thread to receive a wake notification later.
pub fn wake_me_later(&self) {
with_thread_context(|ctx| {
self.contexts.lock().push(ctx.clone());
});
}
/// Wake up all threads that are waiting for a notification and clear the list.
pub fn wake_all(&self) {
let mut v = self.contexts.lock();
for ctx in v.iter() {
ctx.inc_wake();
}
v.clear();
}
}
/// See [`ThreadContext::yield_me`].
pub fn yield_me(after_ms: i64) {
with_thread_context(|ctx| ctx.yield_me(after_ms))
}
/// Get current time.
pub fn now() -> u64 {
with_thread_context(|ctx| ctx.clock.get().unwrap().now())
}
pub fn exit(code: i32, msg: String) {
with_thread_context(|ctx| {
ctx.allow_panic.store(true, Ordering::SeqCst);
let mut result = ctx.result.lock();
*result = (code, msg);
panic!("exit");
});
}
pub(crate) fn get_thread_ctx() -> Arc<ThreadContext> {
with_thread_context(|ctx| ctx.clone())
}
/// Trait for polling channels until they have something.
pub trait PollSome {
/// Schedule wakeup for message arrival.
fn wake_me(&self);
/// Check if channel has a ready message.
fn has_some(&self) -> bool;
}
/// Blocks current thread until one of the channels has a ready message. Returns
/// index of the channel that has a message. If timeout is reached, returns None.
///
/// Negative timeout means block forever. Zero timeout means check channels and return
/// immediately. Positive timeout means block until timeout is reached.
pub fn epoll_chans(chans: &[Box<dyn PollSome>], timeout: i64) -> Option<usize> {
let deadline = if timeout < 0 {
0
} else {
now() + timeout as u64
};
loop {
for chan in chans {
chan.wake_me()
}
for (i, chan) in chans.iter().enumerate() {
if chan.has_some() {
return Some(i);
}
}
if timeout < 0 {
// block until wakeup
yield_me(-1);
} else {
let current_time = now();
if current_time >= deadline {
return None;
}
yield_me((deadline - current_time) as i64);
}
}
}

8
libs/desim/src/lib.rs Normal file
View File

@@ -0,0 +1,8 @@
pub mod chan;
pub mod executor;
pub mod network;
pub mod node_os;
pub mod options;
pub mod proto;
pub mod time;
pub mod world;

451
libs/desim/src/network.rs Normal file
View File

@@ -0,0 +1,451 @@
use std::{
cmp::Ordering,
collections::{BinaryHeap, VecDeque},
fmt::{self, Debug},
ops::DerefMut,
sync::{mpsc, Arc},
};
use parking_lot::{
lock_api::{MappedMutexGuard, MutexGuard},
Mutex, RawMutex,
};
use rand::rngs::StdRng;
use tracing::debug;
use crate::{
executor::{self, ThreadContext},
options::NetworkOptions,
proto::NetEvent,
proto::NodeEvent,
};
use super::{chan::Chan, proto::AnyMessage};
pub struct NetworkTask {
options: Arc<NetworkOptions>,
connections: Mutex<Vec<VirtualConnection>>,
/// min-heap of connections having something to deliver.
events: Mutex<BinaryHeap<Event>>,
task_context: Arc<ThreadContext>,
}
impl NetworkTask {
pub fn start_new(options: Arc<NetworkOptions>, tx: mpsc::Sender<Arc<NetworkTask>>) {
let ctx = executor::get_thread_ctx();
let task = Arc::new(Self {
options,
connections: Mutex::new(Vec::new()),
events: Mutex::new(BinaryHeap::new()),
task_context: ctx,
});
// send the task upstream
tx.send(task.clone()).unwrap();
// start the task
task.start();
}
pub fn start_new_connection(self: &Arc<Self>, rng: StdRng, dst_accept: Chan<NodeEvent>) -> TCP {
let now = executor::now();
let connection_id = self.connections.lock().len();
let vc = VirtualConnection {
connection_id,
dst_accept,
dst_sockets: [Chan::new(), Chan::new()],
state: Mutex::new(ConnectionState {
buffers: [NetworkBuffer::new(None), NetworkBuffer::new(Some(now))],
rng,
}),
};
vc.schedule_timeout(self);
vc.send_connect(self);
let recv_chan = vc.dst_sockets[0].clone();
self.connections.lock().push(vc);
TCP {
net: self.clone(),
conn_id: connection_id,
dir: 0,
recv_chan,
}
}
}
// private functions
impl NetworkTask {
/// Schedule to wakeup network task (self) `after_ms` later to deliver
/// messages of connection `id`.
fn schedule(&self, id: usize, after_ms: u64) {
self.events.lock().push(Event {
time: executor::now() + after_ms,
conn_id: id,
});
self.task_context.schedule_wakeup(after_ms);
}
/// Get locked connection `id`.
fn get(&self, id: usize) -> MappedMutexGuard<'_, RawMutex, VirtualConnection> {
MutexGuard::map(self.connections.lock(), |connections| {
connections.get_mut(id).unwrap()
})
}
fn collect_pending_events(&self, now: u64, vec: &mut Vec<Event>) {
vec.clear();
let mut events = self.events.lock();
while let Some(event) = events.peek() {
if event.time > now {
break;
}
let event = events.pop().unwrap();
vec.push(event);
}
}
fn start(self: &Arc<Self>) {
debug!("started network task");
let mut events = Vec::new();
loop {
let now = executor::now();
self.collect_pending_events(now, &mut events);
for event in events.drain(..) {
let conn = self.get(event.conn_id);
conn.process(self);
}
// block until wakeup
executor::yield_me(-1);
}
}
}
// 0 - from node(0) to node(1)
// 1 - from node(1) to node(0)
type MessageDirection = u8;
fn sender_str(dir: MessageDirection) -> &'static str {
match dir {
0 => "client",
1 => "server",
_ => unreachable!(),
}
}
fn receiver_str(dir: MessageDirection) -> &'static str {
match dir {
0 => "server",
1 => "client",
_ => unreachable!(),
}
}
/// Virtual connection between two nodes.
/// Node 0 is the creator of the connection (client),
/// and node 1 is the acceptor (server).
struct VirtualConnection {
connection_id: usize,
/// one-off chan, used to deliver Accept message to dst
dst_accept: Chan<NodeEvent>,
/// message sinks
dst_sockets: [Chan<NetEvent>; 2],
state: Mutex<ConnectionState>,
}
struct ConnectionState {
buffers: [NetworkBuffer; 2],
rng: StdRng,
}
impl VirtualConnection {
/// Notify the future about the possible timeout.
fn schedule_timeout(&self, net: &NetworkTask) {
if let Some(timeout) = net.options.keepalive_timeout {
net.schedule(self.connection_id, timeout);
}
}
/// Send the handshake (Accept) to the server.
fn send_connect(&self, net: &NetworkTask) {
let now = executor::now();
let mut state = self.state.lock();
let delay = net.options.connect_delay.delay(&mut state.rng);
let buffer = &mut state.buffers[0];
assert!(buffer.buf.is_empty());
assert!(!buffer.recv_closed);
assert!(!buffer.send_closed);
assert!(buffer.last_recv.is_none());
let delay = if let Some(ms) = delay {
ms
} else {
debug!("NET: TCP #{} dropped connect", self.connection_id);
buffer.send_closed = true;
return;
};
// Send a message into the future.
buffer
.buf
.push_back((now + delay, AnyMessage::InternalConnect));
net.schedule(self.connection_id, delay);
}
/// Transmit some of the messages from the buffer to the nodes.
fn process(&self, net: &Arc<NetworkTask>) {
let now = executor::now();
let mut state = self.state.lock();
for direction in 0..2 {
self.process_direction(
net,
state.deref_mut(),
now,
direction as MessageDirection,
&self.dst_sockets[direction ^ 1],
);
}
// Close the one side of the connection by timeout if the node
// has not received any messages for a long time.
if let Some(timeout) = net.options.keepalive_timeout {
let mut to_close = [false, false];
for direction in 0..2 {
let buffer = &mut state.buffers[direction];
if buffer.recv_closed {
continue;
}
if let Some(last_recv) = buffer.last_recv {
if now - last_recv >= timeout {
debug!(
"NET: connection {} timed out at {}",
self.connection_id,
receiver_str(direction as MessageDirection)
);
let node_idx = direction ^ 1;
to_close[node_idx] = true;
}
}
}
drop(state);
for (node_idx, should_close) in to_close.iter().enumerate() {
if *should_close {
self.close(node_idx);
}
}
}
}
/// Process messages in the buffer in the given direction.
fn process_direction(
&self,
net: &Arc<NetworkTask>,
state: &mut ConnectionState,
now: u64,
direction: MessageDirection,
to_socket: &Chan<NetEvent>,
) {
let buffer = &mut state.buffers[direction as usize];
if buffer.recv_closed {
assert!(buffer.buf.is_empty());
}
while !buffer.buf.is_empty() && buffer.buf.front().unwrap().0 <= now {
let msg = buffer.buf.pop_front().unwrap().1;
buffer.last_recv = Some(now);
self.schedule_timeout(net);
if let AnyMessage::InternalConnect = msg {
// TODO: assert to_socket is the server
let server_to_client = TCP {
net: net.clone(),
conn_id: self.connection_id,
dir: direction ^ 1,
recv_chan: to_socket.clone(),
};
// special case, we need to deliver new connection to a separate channel
self.dst_accept.send(NodeEvent::Accept(server_to_client));
} else {
to_socket.send(NetEvent::Message(msg));
}
}
}
/// Try to send a message to the buffer, optionally dropping it and
/// determining delivery timestamp.
fn send(&self, net: &NetworkTask, direction: MessageDirection, msg: AnyMessage) {
let now = executor::now();
let mut state = self.state.lock();
let (delay, close) = if let Some(ms) = net.options.send_delay.delay(&mut state.rng) {
(ms, false)
} else {
(0, true)
};
let buffer = &mut state.buffers[direction as usize];
if buffer.send_closed {
debug!(
"NET: TCP #{} dropped message {:?} (broken pipe)",
self.connection_id, msg
);
return;
}
if close {
debug!(
"NET: TCP #{} dropped message {:?} (pipe just broke)",
self.connection_id, msg
);
buffer.send_closed = true;
return;
}
if buffer.recv_closed {
debug!(
"NET: TCP #{} dropped message {:?} (recv closed)",
self.connection_id, msg
);
return;
}
// Send a message into the future.
buffer.buf.push_back((now + delay, msg));
net.schedule(self.connection_id, delay);
}
/// Close the connection. Only one side of the connection will be closed,
/// and no further messages will be delivered. The other side will not be notified.
fn close(&self, node_idx: usize) {
let mut state = self.state.lock();
let recv_buffer = &mut state.buffers[1 ^ node_idx];
if recv_buffer.recv_closed {
debug!(
"NET: TCP #{} closed twice at {}",
self.connection_id,
sender_str(node_idx as MessageDirection),
);
return;
}
debug!(
"NET: TCP #{} closed at {}",
self.connection_id,
sender_str(node_idx as MessageDirection),
);
recv_buffer.recv_closed = true;
for msg in recv_buffer.buf.drain(..) {
debug!(
"NET: TCP #{} dropped message {:?} (closed)",
self.connection_id, msg
);
}
let send_buffer = &mut state.buffers[node_idx];
send_buffer.send_closed = true;
drop(state);
// TODO: notify the other side?
self.dst_sockets[node_idx].send(NetEvent::Closed);
}
}
struct NetworkBuffer {
/// Messages paired with time of delivery
buf: VecDeque<(u64, AnyMessage)>,
/// True if the connection is closed on the receiving side,
/// i.e. no more messages from the buffer will be delivered.
recv_closed: bool,
/// True if the connection is closed on the sending side,
/// i.e. no more messages will be added to the buffer.
send_closed: bool,
/// Last time a message was delivered from the buffer.
/// If None, it means that the server is the receiver and
/// it has not yet aware of this connection (i.e. has not
/// received the Accept).
last_recv: Option<u64>,
}
impl NetworkBuffer {
fn new(last_recv: Option<u64>) -> Self {
Self {
buf: VecDeque::new(),
recv_closed: false,
send_closed: false,
last_recv,
}
}
}
/// Single end of a bidirectional network stream without reordering (TCP-like).
/// Reads are implemented using channels, writes go to the buffer inside VirtualConnection.
pub struct TCP {
net: Arc<NetworkTask>,
conn_id: usize,
dir: MessageDirection,
recv_chan: Chan<NetEvent>,
}
impl Debug for TCP {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(f, "TCP #{} ({})", self.conn_id, sender_str(self.dir),)
}
}
impl TCP {
/// Send a message to the other side. It's guaranteed that it will not arrive
/// before the arrival of all messages sent earlier.
pub fn send(&self, msg: AnyMessage) {
let conn = self.net.get(self.conn_id);
conn.send(&self.net, self.dir, msg);
}
/// Get a channel to receive incoming messages.
pub fn recv_chan(&self) -> Chan<NetEvent> {
self.recv_chan.clone()
}
pub fn connection_id(&self) -> usize {
self.conn_id
}
pub fn close(&self) {
let conn = self.net.get(self.conn_id);
conn.close(self.dir as usize);
}
}
struct Event {
time: u64,
conn_id: usize,
}
// BinaryHeap is a max-heap, and we want a min-heap. Reverse the ordering here
// to get that.
impl PartialOrd for Event {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
Some(self.cmp(other))
}
}
impl Ord for Event {
fn cmp(&self, other: &Self) -> Ordering {
(other.time, other.conn_id).cmp(&(self.time, self.conn_id))
}
}
impl PartialEq for Event {
fn eq(&self, other: &Self) -> bool {
(other.time, other.conn_id) == (self.time, self.conn_id)
}
}
impl Eq for Event {}

54
libs/desim/src/node_os.rs Normal file
View File

@@ -0,0 +1,54 @@
use std::sync::Arc;
use rand::Rng;
use crate::proto::NodeEvent;
use super::{
chan::Chan,
network::TCP,
world::{Node, NodeId, World},
};
/// Abstraction with all functions (aka syscalls) available to the node.
#[derive(Clone)]
pub struct NodeOs {
world: Arc<World>,
internal: Arc<Node>,
}
impl NodeOs {
pub fn new(world: Arc<World>, internal: Arc<Node>) -> NodeOs {
NodeOs { world, internal }
}
/// Get the node id.
pub fn id(&self) -> NodeId {
self.internal.id
}
/// Opens a bidirectional connection with the other node. Always successful.
pub fn open_tcp(&self, dst: NodeId) -> TCP {
self.world.open_tcp(dst)
}
/// Returns a channel to receive node events (socket Accept and internal messages).
pub fn node_events(&self) -> Chan<NodeEvent> {
self.internal.node_events()
}
/// Get current time.
pub fn now(&self) -> u64 {
self.world.now()
}
/// Generate a random number in range [0, max).
pub fn random(&self, max: u64) -> u64 {
self.internal.rng.lock().gen_range(0..max)
}
/// Append a new event to the world event log.
pub fn log_event(&self, data: String) {
self.internal.log_event(data)
}
}

50
libs/desim/src/options.rs Normal file
View File

@@ -0,0 +1,50 @@
use rand::{rngs::StdRng, Rng};
/// Describes random delays and failures. Delay will be uniformly distributed in [min, max].
/// Connection failure will occur with the probablity fail_prob.
#[derive(Clone, Debug)]
pub struct Delay {
pub min: u64,
pub max: u64,
pub fail_prob: f64, // [0; 1]
}
impl Delay {
/// Create a struct with no delay, no failures.
pub fn empty() -> Delay {
Delay {
min: 0,
max: 0,
fail_prob: 0.0,
}
}
/// Create a struct with a fixed delay.
pub fn fixed(ms: u64) -> Delay {
Delay {
min: ms,
max: ms,
fail_prob: 0.0,
}
}
/// Generate a random delay in range [min, max]. Return None if the
/// message should be dropped.
pub fn delay(&self, rng: &mut StdRng) -> Option<u64> {
if rng.gen_bool(self.fail_prob) {
return None;
}
Some(rng.gen_range(self.min..=self.max))
}
}
/// Describes network settings. All network packets will be subjected to the same delays and failures.
#[derive(Clone, Debug)]
pub struct NetworkOptions {
/// Connection will be automatically closed after this timeout if no data is received.
pub keepalive_timeout: Option<u64>,
/// New connections will be delayed by this amount of time.
pub connect_delay: Delay,
/// Each message will be delayed by this amount of time.
pub send_delay: Delay,
}

63
libs/desim/src/proto.rs Normal file
View File

@@ -0,0 +1,63 @@
use std::fmt::Debug;
use bytes::Bytes;
use utils::lsn::Lsn;
use crate::{network::TCP, world::NodeId};
/// Internal node events.
#[derive(Debug)]
pub enum NodeEvent {
Accept(TCP),
Internal(AnyMessage),
}
/// Events that are coming from a network socket.
#[derive(Clone, Debug)]
pub enum NetEvent {
Message(AnyMessage),
Closed,
}
/// Custom events generated throughout the simulation. Can be used by the test to verify the correctness.
#[derive(Debug)]
pub struct SimEvent {
pub time: u64,
pub node: NodeId,
pub data: String,
}
/// Umbrella type for all possible flavours of messages. These events can be sent over network
/// or to an internal node events channel.
#[derive(Clone)]
pub enum AnyMessage {
/// Not used, empty placeholder.
None,
/// Used internally for notifying node about new incoming connection.
InternalConnect,
Just32(u32),
ReplCell(ReplCell),
Bytes(Bytes),
LSN(u64),
}
impl Debug for AnyMessage {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
AnyMessage::None => write!(f, "None"),
AnyMessage::InternalConnect => write!(f, "InternalConnect"),
AnyMessage::Just32(v) => write!(f, "Just32({})", v),
AnyMessage::ReplCell(v) => write!(f, "ReplCell({:?})", v),
AnyMessage::Bytes(v) => write!(f, "Bytes({})", hex::encode(v)),
AnyMessage::LSN(v) => write!(f, "LSN({})", Lsn(*v)),
}
}
}
/// Used in reliable_copy_test.rs
#[derive(Clone, Debug)]
pub struct ReplCell {
pub value: u32,
pub client_id: u32,
pub seqno: u32,
}

129
libs/desim/src/time.rs Normal file
View File

@@ -0,0 +1,129 @@
use std::{
cmp::Ordering,
collections::BinaryHeap,
ops::DerefMut,
sync::{
atomic::{AtomicU32, AtomicU64},
Arc,
},
};
use parking_lot::Mutex;
use tracing::trace;
use crate::executor::ThreadContext;
/// Holds current time and all pending wakeup events.
pub struct Timing {
/// Current world's time.
current_time: AtomicU64,
/// Pending timers.
queue: Mutex<BinaryHeap<Pending>>,
/// Global nonce. Makes picking events from binary heap queue deterministic
/// by appending a number to events with the same timestamp.
nonce: AtomicU32,
/// Used to schedule fake events.
fake_context: Arc<ThreadContext>,
}
impl Default for Timing {
fn default() -> Self {
Self::new()
}
}
impl Timing {
/// Create a new empty clock with time set to 0.
pub fn new() -> Timing {
Timing {
current_time: AtomicU64::new(0),
queue: Mutex::new(BinaryHeap::new()),
nonce: AtomicU32::new(0),
fake_context: Arc::new(ThreadContext::new()),
}
}
/// Return the current world's time.
pub fn now(&self) -> u64 {
self.current_time.load(std::sync::atomic::Ordering::SeqCst)
}
/// Tick-tock the global clock. Return the event ready to be processed
/// or move the clock forward and then return the event.
pub(crate) fn step(&self) -> Option<Arc<ThreadContext>> {
let mut queue = self.queue.lock();
if queue.is_empty() {
// no future events
return None;
}
if !self.is_event_ready(queue.deref_mut()) {
let next_time = queue.peek().unwrap().time;
self.current_time
.store(next_time, std::sync::atomic::Ordering::SeqCst);
trace!("rewind time to {}", next_time);
assert!(self.is_event_ready(queue.deref_mut()));
}
Some(queue.pop().unwrap().wake_context)
}
/// Append an event to the queue, to wakeup the thread in `ms` milliseconds.
pub(crate) fn schedule_wakeup(&self, ms: u64, wake_context: Arc<ThreadContext>) {
self.nonce.fetch_add(1, std::sync::atomic::Ordering::SeqCst);
let nonce = self.nonce.load(std::sync::atomic::Ordering::SeqCst);
self.queue.lock().push(Pending {
time: self.now() + ms,
nonce,
wake_context,
})
}
/// Append a fake event to the queue, to prevent clocks from skipping this time.
pub fn schedule_fake(&self, ms: u64) {
self.queue.lock().push(Pending {
time: self.now() + ms,
nonce: 0,
wake_context: self.fake_context.clone(),
});
}
/// Return true if there is a ready event.
fn is_event_ready(&self, queue: &mut BinaryHeap<Pending>) -> bool {
queue.peek().map_or(false, |x| x.time <= self.now())
}
/// Clear all pending events.
pub(crate) fn clear(&self) {
self.queue.lock().clear();
}
}
struct Pending {
time: u64,
nonce: u32,
wake_context: Arc<ThreadContext>,
}
// BinaryHeap is a max-heap, and we want a min-heap. Reverse the ordering here
// to get that.
impl PartialOrd for Pending {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
Some(self.cmp(other))
}
}
impl Ord for Pending {
fn cmp(&self, other: &Self) -> Ordering {
(other.time, other.nonce).cmp(&(self.time, self.nonce))
}
}
impl PartialEq for Pending {
fn eq(&self, other: &Self) -> bool {
(other.time, other.nonce) == (self.time, self.nonce)
}
}
impl Eq for Pending {}

180
libs/desim/src/world.rs Normal file
View File

@@ -0,0 +1,180 @@
use parking_lot::Mutex;
use rand::{rngs::StdRng, SeedableRng};
use std::{
ops::DerefMut,
sync::{mpsc, Arc},
};
use crate::{
executor::{ExternalHandle, Runtime},
network::NetworkTask,
options::NetworkOptions,
proto::{NodeEvent, SimEvent},
time::Timing,
};
use super::{chan::Chan, network::TCP, node_os::NodeOs};
pub type NodeId = u32;
/// World contains simulation state.
pub struct World {
nodes: Mutex<Vec<Arc<Node>>>,
/// Random number generator.
rng: Mutex<StdRng>,
/// Internal event log.
events: Mutex<Vec<SimEvent>>,
/// Separate task that processes all network messages.
network_task: Arc<NetworkTask>,
/// Runtime for running threads and moving time.
runtime: Mutex<Runtime>,
/// To get current time.
timing: Arc<Timing>,
}
impl World {
pub fn new(seed: u64, options: Arc<NetworkOptions>) -> World {
let timing = Arc::new(Timing::new());
let mut runtime = Runtime::new(timing.clone());
let (tx, rx) = mpsc::channel();
runtime.spawn(move || {
// create and start network background thread, and send it back via the channel
NetworkTask::start_new(options, tx)
});
// wait for the network task to start
while runtime.step() {}
let network_task = rx.recv().unwrap();
World {
nodes: Mutex::new(Vec::new()),
rng: Mutex::new(StdRng::seed_from_u64(seed)),
events: Mutex::new(Vec::new()),
network_task,
runtime: Mutex::new(runtime),
timing,
}
}
pub fn step(&self) -> bool {
self.runtime.lock().step()
}
pub fn get_thread_step_count(&self) -> u64 {
self.runtime.lock().step_counter
}
/// Create a new random number generator.
pub fn new_rng(&self) -> StdRng {
let mut rng = self.rng.lock();
StdRng::from_rng(rng.deref_mut()).unwrap()
}
/// Create a new node.
pub fn new_node(self: &Arc<Self>) -> Arc<Node> {
let mut nodes = self.nodes.lock();
let id = nodes.len() as NodeId;
let node = Arc::new(Node::new(id, self.clone(), self.new_rng()));
nodes.push(node.clone());
node
}
/// Get an internal node state by id.
fn get_node(&self, id: NodeId) -> Option<Arc<Node>> {
let nodes = self.nodes.lock();
let num = id as usize;
if num < nodes.len() {
Some(nodes[num].clone())
} else {
None
}
}
pub fn stop_all(&self) {
self.runtime.lock().crash_all_threads();
}
/// Returns a writable end of a TCP connection, to send src->dst messages.
pub fn open_tcp(self: &Arc<World>, dst: NodeId) -> TCP {
// TODO: replace unwrap() with /dev/null socket.
let dst = self.get_node(dst).unwrap();
let dst_accept = dst.node_events.lock().clone();
let rng = self.new_rng();
self.network_task.start_new_connection(rng, dst_accept)
}
/// Get current time.
pub fn now(&self) -> u64 {
self.timing.now()
}
/// Get a copy of the internal clock.
pub fn clock(&self) -> Arc<Timing> {
self.timing.clone()
}
pub fn add_event(&self, node: NodeId, data: String) {
let time = self.now();
self.events.lock().push(SimEvent { time, node, data });
}
pub fn take_events(&self) -> Vec<SimEvent> {
let mut events = self.events.lock();
let mut res = Vec::new();
std::mem::swap(&mut res, &mut events);
res
}
pub fn deallocate(&self) {
self.stop_all();
self.timing.clear();
self.nodes.lock().clear();
}
}
/// Internal node state.
pub struct Node {
pub id: NodeId,
node_events: Mutex<Chan<NodeEvent>>,
world: Arc<World>,
pub(crate) rng: Mutex<StdRng>,
}
impl Node {
pub fn new(id: NodeId, world: Arc<World>, rng: StdRng) -> Node {
Node {
id,
node_events: Mutex::new(Chan::new()),
world,
rng: Mutex::new(rng),
}
}
/// Spawn a new thread with this node context.
pub fn launch(self: &Arc<Self>, f: impl FnOnce(NodeOs) + Send + 'static) -> ExternalHandle {
let node = self.clone();
let world = self.world.clone();
self.world.runtime.lock().spawn(move || {
f(NodeOs::new(world, node.clone()));
})
}
/// Returns a channel to receive Accepts and internal messages.
pub fn node_events(&self) -> Chan<NodeEvent> {
self.node_events.lock().clone()
}
/// This will drop all in-flight Accept messages.
pub fn replug_node_events(&self, chan: Chan<NodeEvent>) {
*self.node_events.lock() = chan;
}
/// Append event to the world's log.
pub fn log_event(&self, data: String) {
self.world.add_event(self.id, data)
}
}

View File

@@ -0,0 +1,244 @@
//! Simple test to verify that simulator is working.
#[cfg(test)]
mod reliable_copy_test {
use anyhow::Result;
use desim::executor::{self, PollSome};
use desim::options::{Delay, NetworkOptions};
use desim::proto::{NetEvent, NodeEvent, ReplCell};
use desim::world::{NodeId, World};
use desim::{node_os::NodeOs, proto::AnyMessage};
use parking_lot::Mutex;
use std::sync::Arc;
use tracing::info;
/// Disk storage trait and implementation.
pub trait Storage<T> {
fn flush_pos(&self) -> u32;
fn flush(&mut self) -> Result<()>;
fn write(&mut self, t: T);
}
#[derive(Clone)]
pub struct SharedStorage<T> {
pub state: Arc<Mutex<InMemoryStorage<T>>>,
}
impl<T> SharedStorage<T> {
pub fn new() -> Self {
Self {
state: Arc::new(Mutex::new(InMemoryStorage::new())),
}
}
}
impl<T> Storage<T> for SharedStorage<T> {
fn flush_pos(&self) -> u32 {
self.state.lock().flush_pos
}
fn flush(&mut self) -> Result<()> {
executor::yield_me(0);
self.state.lock().flush()
}
fn write(&mut self, t: T) {
executor::yield_me(0);
self.state.lock().write(t);
}
}
pub struct InMemoryStorage<T> {
pub data: Vec<T>,
pub flush_pos: u32,
}
impl<T> InMemoryStorage<T> {
pub fn new() -> Self {
Self {
data: Vec::new(),
flush_pos: 0,
}
}
pub fn flush(&mut self) -> Result<()> {
self.flush_pos = self.data.len() as u32;
Ok(())
}
pub fn write(&mut self, t: T) {
self.data.push(t);
}
}
/// Server implementation.
pub fn run_server(os: NodeOs, mut storage: Box<dyn Storage<u32>>) {
info!("started server");
let node_events = os.node_events();
let mut epoll_vec: Vec<Box<dyn PollSome>> = vec![Box::new(node_events.clone())];
let mut sockets = vec![];
loop {
let index = executor::epoll_chans(&epoll_vec, -1).unwrap();
if index == 0 {
let node_event = node_events.must_recv();
info!("got node event: {:?}", node_event);
if let NodeEvent::Accept(tcp) = node_event {
tcp.send(AnyMessage::Just32(storage.flush_pos()));
epoll_vec.push(Box::new(tcp.recv_chan()));
sockets.push(tcp);
}
continue;
}
let recv_chan = sockets[index - 1].recv_chan();
let socket = &sockets[index - 1];
let event = recv_chan.must_recv();
info!("got event: {:?}", event);
if let NetEvent::Message(AnyMessage::ReplCell(cell)) = event {
if cell.seqno != storage.flush_pos() {
info!("got out of order data: {:?}", cell);
continue;
}
storage.write(cell.value);
storage.flush().unwrap();
socket.send(AnyMessage::Just32(storage.flush_pos()));
}
}
}
/// Client copies all data from array to the remote node.
pub fn run_client(os: NodeOs, data: &[ReplCell], dst: NodeId) {
info!("started client");
let mut delivered = 0;
let mut sock = os.open_tcp(dst);
let mut recv_chan = sock.recv_chan();
while delivered < data.len() {
let num = &data[delivered];
info!("sending data: {:?}", num.clone());
sock.send(AnyMessage::ReplCell(num.clone()));
// loop {
let event = recv_chan.recv();
match event {
NetEvent::Message(AnyMessage::Just32(flush_pos)) => {
if flush_pos == 1 + delivered as u32 {
delivered += 1;
}
}
NetEvent::Closed => {
info!("connection closed, reestablishing");
sock = os.open_tcp(dst);
recv_chan = sock.recv_chan();
}
_ => {}
}
// }
}
let sock = os.open_tcp(dst);
for num in data {
info!("sending data: {:?}", num.clone());
sock.send(AnyMessage::ReplCell(num.clone()));
}
info!("sent all data and finished client");
}
/// Run test simulations.
#[test]
fn sim_example_reliable_copy() {
utils::logging::init(
utils::logging::LogFormat::Test,
utils::logging::TracingErrorLayerEnablement::Disabled,
utils::logging::Output::Stdout,
)
.expect("logging init failed");
let delay = Delay {
min: 1,
max: 60,
fail_prob: 0.4,
};
let network = NetworkOptions {
keepalive_timeout: Some(50),
connect_delay: delay.clone(),
send_delay: delay.clone(),
};
for seed in 0..20 {
let u32_data: [u32; 5] = [1, 2, 3, 4, 5];
let data = u32_to_cells(&u32_data, 1);
let world = Arc::new(World::new(seed, Arc::new(network.clone())));
start_simulation(Options {
world,
time_limit: 1_000_000,
client_fn: Box::new(move |os, server_id| run_client(os, &data, server_id)),
u32_data,
});
}
}
pub struct Options {
pub world: Arc<World>,
pub time_limit: u64,
pub u32_data: [u32; 5],
pub client_fn: Box<dyn FnOnce(NodeOs, u32) + Send + 'static>,
}
pub fn start_simulation(options: Options) {
let world = options.world;
let client_node = world.new_node();
let server_node = world.new_node();
let server_id = server_node.id;
// start the client thread
client_node.launch(move |os| {
let client_fn = options.client_fn;
client_fn(os, server_id);
});
// start the server thread
let shared_storage = SharedStorage::new();
let server_storage = shared_storage.clone();
server_node.launch(move |os| run_server(os, Box::new(server_storage)));
while world.step() && world.now() < options.time_limit {}
let disk_data = shared_storage.state.lock().data.clone();
assert!(verify_data(&disk_data, &options.u32_data[..]));
}
pub fn u32_to_cells(data: &[u32], client_id: u32) -> Vec<ReplCell> {
let mut res = Vec::new();
for (i, _) in data.iter().enumerate() {
res.push(ReplCell {
client_id,
seqno: i as u32,
value: data[i],
});
}
res
}
fn verify_data(disk_data: &[u32], data: &[u32]) -> bool {
if disk_data.len() != data.len() {
return false;
}
for i in 0..data.len() {
if disk_data[i] != data[i] {
return false;
}
}
true
}
}

View File

@@ -115,7 +115,6 @@ pub fn set_build_info_metric(revision: &str, build_tag: &str) {
// performed by the process.
// We know the size of the block, so we can determine the I/O bytes out of it.
// The value might be not 100% exact, but should be fine for Prometheus metrics in this case.
#[allow(clippy::unnecessary_cast)]
fn update_rusage_metrics() {
let rusage_stats = get_rusage_stats();

View File

@@ -214,14 +214,14 @@ impl ShardParameters {
pub const DEFAULT_STRIPE_SIZE: ShardStripeSize = ShardStripeSize(256 * 1024 / 8);
pub fn is_unsharded(&self) -> bool {
self.count == ShardCount(0)
self.count.is_unsharded()
}
}
impl Default for ShardParameters {
fn default() -> Self {
Self {
count: ShardCount(0),
count: ShardCount::new(0),
stripe_size: Self::DEFAULT_STRIPE_SIZE,
}
}
@@ -283,6 +283,7 @@ pub struct TenantConfig {
pub gc_feedback: Option<bool>,
pub heatmap_period: Option<String>,
pub lazy_slru_download: Option<bool>,
pub timeline_get_throttle: Option<ThrottleConfig>,
}
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
@@ -309,6 +310,35 @@ pub struct EvictionPolicyLayerAccessThreshold {
pub threshold: Duration,
}
#[derive(Debug, Serialize, Deserialize, Clone, PartialEq, Eq)]
pub struct ThrottleConfig {
pub task_kinds: Vec<String>, // TaskKind
pub initial: usize,
#[serde(with = "humantime_serde")]
pub refill_interval: Duration,
pub refill_amount: NonZeroUsize,
pub max: usize,
pub fair: bool,
}
impl ThrottleConfig {
pub fn disabled() -> Self {
Self {
task_kinds: vec![], // effectively disables the throttle
// other values don't matter with emtpy `task_kinds`.
initial: 0,
refill_interval: Duration::from_millis(1),
refill_amount: NonZeroUsize::new(1).unwrap(),
max: 1,
fair: true,
}
}
/// The requests per second allowed by the given config.
pub fn steady_rps(&self) -> f64 {
(self.refill_amount.get() as f64) / (self.refill_interval.as_secs_f64()) / 1e3
}
}
/// A flattened analog of a `pagesever::tenant::LocationMode`, which
/// lists out all possible states (and the virtual "Detached" state)
/// in a flat form rather than using rust-style enums.
@@ -494,6 +524,8 @@ pub struct TimelineInfo {
pub current_logical_size: u64,
pub current_logical_size_is_accurate: bool,
pub directory_entries_counts: Vec<u64>,
/// Sum of the size of all layer files.
/// If a layer is present in both local FS and S3, it counts only once.
pub current_physical_size: Option<u64>, // is None when timeline is Unloaded

View File

@@ -124,6 +124,7 @@ impl RelTag {
Ord,
strum_macros::EnumIter,
strum_macros::FromRepr,
enum_map::Enum,
)]
#[repr(u8)]
pub enum SlruKind {

View File

@@ -13,10 +13,41 @@ use utils::id::TenantId;
pub struct ShardNumber(pub u8);
#[derive(Ord, PartialOrd, Eq, PartialEq, Clone, Copy, Serialize, Deserialize, Debug, Hash)]
pub struct ShardCount(pub u8);
pub struct ShardCount(u8);
impl ShardCount {
pub const MAX: Self = Self(u8::MAX);
/// The internal value of a ShardCount may be zero, which means "1 shard, but use
/// legacy format for TenantShardId that excludes the shard suffix", also known
/// as `TenantShardId::unsharded`.
///
/// This method returns the actual number of shards, i.e. if our internal value is
/// zero, we return 1 (unsharded tenants have 1 shard).
pub fn count(&self) -> u8 {
if self.0 > 0 {
self.0
} else {
1
}
}
/// The literal internal value: this is **not** the number of shards in the
/// tenant, as we have a special zero value for legacy unsharded tenants. Use
/// [`Self::count`] if you want to know the cardinality of shards.
pub fn literal(&self) -> u8 {
self.0
}
pub fn is_unsharded(&self) -> bool {
self.0 == 0
}
/// `v` may be zero, or the number of shards in the tenant. `v` is what
/// [`Self::literal`] would return.
pub fn new(val: u8) -> Self {
Self(val)
}
}
impl ShardNumber {
@@ -86,7 +117,7 @@ impl TenantShardId {
}
pub fn is_unsharded(&self) -> bool {
self.shard_number == ShardNumber(0) && self.shard_count == ShardCount(0)
self.shard_number == ShardNumber(0) && self.shard_count.is_unsharded()
}
/// Convenience for dropping the tenant_id and just getting the ShardIndex: this
@@ -471,10 +502,12 @@ impl ShardIdentity {
pub fn is_key_disposable(&self, key: &Key) -> bool {
if key_is_shard0(key) {
// Q: Why can't we dispose of shard0 content if we're not shard 0?
// A: because the WAL ingestion logic currently ingests some shard 0
// content on all shards, even though it's only read on shard 0. If we
// dropped it, then subsequent WAL ingest to these keys would encounter
// an error.
// A1: because the WAL ingestion logic currently ingests some shard 0
// content on all shards, even though it's only read on shard 0. If we
// dropped it, then subsequent WAL ingest to these keys would encounter
// an error.
// A2: because key_is_shard0 also covers relation size keys, which are written
// on all shards even though they're only maintained accurately on shard 0.
false
} else {
!self.is_key_local(key)

View File

@@ -3,7 +3,7 @@
#![allow(non_snake_case)]
// bindgen creates some unsafe code with no doc comments.
#![allow(clippy::missing_safety_doc)]
// noted at 1.63 that in many cases there's a u32 -> u32 transmutes in bindgen code.
// noted at 1.63 that in many cases there's u32 -> u32 transmutes in bindgen code.
#![allow(clippy::useless_transmute)]
// modules included with the postgres_ffi macro depend on the types of the specific version's
// types, and trigger a too eager lint.

View File

@@ -431,11 +431,11 @@ pub fn generate_wal_segment(segno: u64, system_id: u64, lsn: Lsn) -> Result<Byte
#[repr(C)]
#[derive(Serialize)]
struct XlLogicalMessage {
db_id: Oid,
transactional: uint32, // bool, takes 4 bytes due to alignment in C structures
prefix_size: uint64,
message_size: uint64,
pub struct XlLogicalMessage {
pub db_id: Oid,
pub transactional: uint32, // bool, takes 4 bytes due to alignment in C structures
pub prefix_size: uint64,
pub message_size: uint64,
}
impl XlLogicalMessage {

View File

@@ -13,5 +13,6 @@ rand.workspace = true
tokio.workspace = true
tracing.workspace = true
thiserror.workspace = true
serde.workspace = true
workspace_hack.workspace = true

View File

@@ -7,6 +7,7 @@ pub mod framed;
use byteorder::{BigEndian, ReadBytesExt};
use bytes::{Buf, BufMut, Bytes, BytesMut};
use serde::{Deserialize, Serialize};
use std::{borrow::Cow, collections::HashMap, fmt, io, str};
// re-export for use in utils pageserver_feedback.rs
@@ -123,7 +124,7 @@ impl StartupMessageParams {
}
}
#[derive(Debug, Hash, PartialEq, Eq, Clone, Copy)]
#[derive(Debug, Hash, PartialEq, Eq, Clone, Copy, Serialize, Deserialize)]
pub struct CancelKeyData {
pub backend_pid: i32,
pub cancel_key: i32,

View File

@@ -15,11 +15,13 @@ aws-sdk-s3.workspace = true
aws-credential-types.workspace = true
bytes.workspace = true
camino.workspace = true
humantime.workspace = true
hyper = { workspace = true, features = ["stream"] }
futures.workspace = true
serde.workspace = true
serde_json.workspace = true
tokio = { workspace = true, features = ["sync", "fs", "io-util"] }
tokio-stream.workspace = true
tokio-util = { workspace = true, features = ["compat"] }
toml_edit.workspace = true
tracing.workspace = true

View File

@@ -22,16 +22,15 @@ use azure_storage_blobs::{blob::operations::GetBlobBuilder, prelude::ContainerCl
use bytes::Bytes;
use futures::stream::Stream;
use futures_util::StreamExt;
use futures_util::TryStreamExt;
use http_types::{StatusCode, Url};
use tokio::time::Instant;
use tokio_util::sync::CancellationToken;
use tracing::debug;
use crate::s3_bucket::RequestKind;
use crate::TimeTravelError;
use crate::{
AzureConfig, ConcurrencyLimiter, Download, DownloadError, Listing, ListingMode, RemotePath,
RemoteStorage, StorageMetadata,
error::Cancelled, s3_bucket::RequestKind, AzureConfig, ConcurrencyLimiter, Download,
DownloadError, Listing, ListingMode, RemotePath, RemoteStorage, StorageMetadata,
TimeTravelError, TimeoutOrCancel,
};
pub struct AzureBlobStorage {
@@ -39,10 +38,12 @@ pub struct AzureBlobStorage {
prefix_in_container: Option<String>,
max_keys_per_list_response: Option<NonZeroU32>,
concurrency_limiter: ConcurrencyLimiter,
// Per-request timeout. Accessible for tests.
pub timeout: Duration,
}
impl AzureBlobStorage {
pub fn new(azure_config: &AzureConfig) -> Result<Self> {
pub fn new(azure_config: &AzureConfig, timeout: Duration) -> Result<Self> {
debug!(
"Creating azure remote storage for azure container {}",
azure_config.container_name
@@ -79,6 +80,7 @@ impl AzureBlobStorage {
prefix_in_container: azure_config.prefix_in_container.to_owned(),
max_keys_per_list_response,
concurrency_limiter: ConcurrencyLimiter::new(azure_config.concurrency_limit.get()),
timeout,
})
}
@@ -121,8 +123,11 @@ impl AzureBlobStorage {
async fn download_for_builder(
&self,
builder: GetBlobBuilder,
cancel: &CancellationToken,
) -> Result<Download, DownloadError> {
let mut response = builder.into_stream();
let kind = RequestKind::Get;
let _permit = self.permit(kind, cancel).await?;
let mut etag = None;
let mut last_modified = None;
@@ -130,39 +135,70 @@ impl AzureBlobStorage {
// TODO give proper streaming response instead of buffering into RAM
// https://github.com/neondatabase/neon/issues/5563
let mut bufs = Vec::new();
while let Some(part) = response.next().await {
let part = part.map_err(to_download_error)?;
let etag_str: &str = part.blob.properties.etag.as_ref();
if etag.is_none() {
etag = Some(etag.unwrap_or_else(|| etag_str.to_owned()));
let download = async {
let response = builder
// convert to concrete Pageable
.into_stream()
// convert to TryStream
.into_stream()
.map_err(to_download_error);
// apply per request timeout
let response = tokio_stream::StreamExt::timeout(response, self.timeout);
// flatten
let response = response.map(|res| match res {
Ok(res) => res,
Err(_elapsed) => Err(DownloadError::Timeout),
});
let mut response = std::pin::pin!(response);
let mut bufs = Vec::new();
while let Some(part) = response.next().await {
let part = part?;
let etag_str: &str = part.blob.properties.etag.as_ref();
if etag.is_none() {
etag = Some(etag.unwrap_or_else(|| etag_str.to_owned()));
}
if last_modified.is_none() {
last_modified = Some(part.blob.properties.last_modified.into());
}
if let Some(blob_meta) = part.blob.metadata {
metadata.extend(blob_meta.iter().map(|(k, v)| (k.to_owned(), v.to_owned())));
}
let data = part
.data
.collect()
.await
.map_err(|e| DownloadError::Other(e.into()))?;
bufs.push(data);
}
if last_modified.is_none() {
last_modified = Some(part.blob.properties.last_modified.into());
}
if let Some(blob_meta) = part.blob.metadata {
metadata.extend(blob_meta.iter().map(|(k, v)| (k.to_owned(), v.to_owned())));
}
let data = part
.data
.collect()
.await
.map_err(|e| DownloadError::Other(e.into()))?;
bufs.push(data);
Ok(Download {
download_stream: Box::pin(futures::stream::iter(bufs.into_iter().map(Ok))),
etag,
last_modified,
metadata: Some(StorageMetadata(metadata)),
})
};
tokio::select! {
bufs = download => bufs,
_ = cancel.cancelled() => Err(DownloadError::Cancelled),
}
Ok(Download {
download_stream: Box::pin(futures::stream::iter(bufs.into_iter().map(Ok))),
etag,
last_modified,
metadata: Some(StorageMetadata(metadata)),
})
}
async fn permit(&self, kind: RequestKind) -> tokio::sync::SemaphorePermit<'_> {
self.concurrency_limiter
.acquire(kind)
.await
.expect("semaphore is never closed")
async fn permit(
&self,
kind: RequestKind,
cancel: &CancellationToken,
) -> Result<tokio::sync::SemaphorePermit<'_>, Cancelled> {
let acquire = self.concurrency_limiter.acquire(kind);
tokio::select! {
permit = acquire => Ok(permit.expect("never closed")),
_ = cancel.cancelled() => Err(Cancelled),
}
}
}
@@ -192,66 +228,87 @@ impl RemoteStorage for AzureBlobStorage {
prefix: Option<&RemotePath>,
mode: ListingMode,
max_keys: Option<NonZeroU32>,
cancel: &CancellationToken,
) -> anyhow::Result<Listing, DownloadError> {
// get the passed prefix or if it is not set use prefix_in_bucket value
let list_prefix = prefix
.map(|p| self.relative_path_to_name(p))
.or_else(|| self.prefix_in_container.clone())
.map(|mut p| {
// required to end with a separator
// otherwise request will return only the entry of a prefix
if matches!(mode, ListingMode::WithDelimiter)
&& !p.ends_with(REMOTE_STORAGE_PREFIX_SEPARATOR)
{
p.push(REMOTE_STORAGE_PREFIX_SEPARATOR);
}
p
let _permit = self.permit(RequestKind::List, cancel).await?;
let op = async {
// get the passed prefix or if it is not set use prefix_in_bucket value
let list_prefix = prefix
.map(|p| self.relative_path_to_name(p))
.or_else(|| self.prefix_in_container.clone())
.map(|mut p| {
// required to end with a separator
// otherwise request will return only the entry of a prefix
if matches!(mode, ListingMode::WithDelimiter)
&& !p.ends_with(REMOTE_STORAGE_PREFIX_SEPARATOR)
{
p.push(REMOTE_STORAGE_PREFIX_SEPARATOR);
}
p
});
let mut builder = self.client.list_blobs();
if let ListingMode::WithDelimiter = mode {
builder = builder.delimiter(REMOTE_STORAGE_PREFIX_SEPARATOR.to_string());
}
if let Some(prefix) = list_prefix {
builder = builder.prefix(Cow::from(prefix.to_owned()));
}
if let Some(limit) = self.max_keys_per_list_response {
builder = builder.max_results(MaxResults::new(limit));
}
let response = builder.into_stream();
let response = response.into_stream().map_err(to_download_error);
let response = tokio_stream::StreamExt::timeout(response, self.timeout);
let response = response.map(|res| match res {
Ok(res) => res,
Err(_elapsed) => Err(DownloadError::Timeout),
});
let mut builder = self.client.list_blobs();
let mut response = std::pin::pin!(response);
if let ListingMode::WithDelimiter = mode {
builder = builder.delimiter(REMOTE_STORAGE_PREFIX_SEPARATOR.to_string());
}
let mut res = Listing::default();
if let Some(prefix) = list_prefix {
builder = builder.prefix(Cow::from(prefix.to_owned()));
}
let mut max_keys = max_keys.map(|mk| mk.get());
while let Some(entry) = response.next().await {
let entry = entry?;
let prefix_iter = entry
.blobs
.prefixes()
.map(|prefix| self.name_to_relative_path(&prefix.name));
res.prefixes.extend(prefix_iter);
if let Some(limit) = self.max_keys_per_list_response {
builder = builder.max_results(MaxResults::new(limit));
}
let blob_iter = entry
.blobs
.blobs()
.map(|k| self.name_to_relative_path(&k.name));
let mut response = builder.into_stream();
let mut res = Listing::default();
// NonZeroU32 doesn't support subtraction apparently
let mut max_keys = max_keys.map(|mk| mk.get());
while let Some(l) = response.next().await {
let entry = l.map_err(to_download_error)?;
let prefix_iter = entry
.blobs
.prefixes()
.map(|prefix| self.name_to_relative_path(&prefix.name));
res.prefixes.extend(prefix_iter);
for key in blob_iter {
res.keys.push(key);
let blob_iter = entry
.blobs
.blobs()
.map(|k| self.name_to_relative_path(&k.name));
for key in blob_iter {
res.keys.push(key);
if let Some(mut mk) = max_keys {
assert!(mk > 0);
mk -= 1;
if mk == 0 {
return Ok(res); // limit reached
if let Some(mut mk) = max_keys {
assert!(mk > 0);
mk -= 1;
if mk == 0 {
return Ok(res); // limit reached
}
max_keys = Some(mk);
}
max_keys = Some(mk);
}
}
Ok(res)
};
tokio::select! {
res = op => res,
_ = cancel.cancelled() => Err(DownloadError::Cancelled),
}
Ok(res)
}
async fn upload(
@@ -260,35 +317,52 @@ impl RemoteStorage for AzureBlobStorage {
data_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
let _permit = self.permit(RequestKind::Put).await;
let blob_client = self.client.blob_client(self.relative_path_to_name(to));
let _permit = self.permit(RequestKind::Put, cancel).await?;
let from: Pin<Box<dyn Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static>> =
Box::pin(from);
let op = async {
let blob_client = self.client.blob_client(self.relative_path_to_name(to));
let from = NonSeekableStream::new(from, data_size_bytes);
let from: Pin<Box<dyn Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static>> =
Box::pin(from);
let body = azure_core::Body::SeekableStream(Box::new(from));
let from = NonSeekableStream::new(from, data_size_bytes);
let mut builder = blob_client.put_block_blob(body);
let body = azure_core::Body::SeekableStream(Box::new(from));
if let Some(metadata) = metadata {
builder = builder.metadata(to_azure_metadata(metadata));
let mut builder = blob_client.put_block_blob(body);
if let Some(metadata) = metadata {
builder = builder.metadata(to_azure_metadata(metadata));
}
let fut = builder.into_future();
let fut = tokio::time::timeout(self.timeout, fut);
match fut.await {
Ok(Ok(_response)) => Ok(()),
Ok(Err(azure)) => Err(azure.into()),
Err(_timeout) => Err(TimeoutOrCancel::Cancel.into()),
}
};
tokio::select! {
res = op => res,
_ = cancel.cancelled() => Err(TimeoutOrCancel::Cancel.into()),
}
let _response = builder.into_future().await?;
Ok(())
}
async fn download(&self, from: &RemotePath) -> Result<Download, DownloadError> {
let _permit = self.permit(RequestKind::Get).await;
async fn download(
&self,
from: &RemotePath,
cancel: &CancellationToken,
) -> Result<Download, DownloadError> {
let blob_client = self.client.blob_client(self.relative_path_to_name(from));
let builder = blob_client.get();
self.download_for_builder(builder).await
self.download_for_builder(builder, cancel).await
}
async fn download_byte_range(
@@ -296,8 +370,8 @@ impl RemoteStorage for AzureBlobStorage {
from: &RemotePath,
start_inclusive: u64,
end_exclusive: Option<u64>,
cancel: &CancellationToken,
) -> Result<Download, DownloadError> {
let _permit = self.permit(RequestKind::Get).await;
let blob_client = self.client.blob_client(self.relative_path_to_name(from));
let mut builder = blob_client.get();
@@ -309,82 +383,113 @@ impl RemoteStorage for AzureBlobStorage {
};
builder = builder.range(range);
self.download_for_builder(builder).await
self.download_for_builder(builder, cancel).await
}
async fn delete(&self, path: &RemotePath) -> anyhow::Result<()> {
let _permit = self.permit(RequestKind::Delete).await;
let blob_client = self.client.blob_client(self.relative_path_to_name(path));
async fn delete(&self, path: &RemotePath, cancel: &CancellationToken) -> anyhow::Result<()> {
self.delete_objects(std::array::from_ref(path), cancel)
.await
}
let builder = blob_client.delete();
async fn delete_objects<'a>(
&self,
paths: &'a [RemotePath],
cancel: &CancellationToken,
) -> anyhow::Result<()> {
let _permit = self.permit(RequestKind::Delete, cancel).await?;
match builder.into_future().await {
Ok(_response) => Ok(()),
Err(e) => {
if let Some(http_err) = e.as_http_error() {
if http_err.status() == StatusCode::NotFound {
return Ok(());
let op = async {
// TODO batch requests are also not supported by the SDK
// https://github.com/Azure/azure-sdk-for-rust/issues/1068
// https://github.com/Azure/azure-sdk-for-rust/issues/1249
for path in paths {
let blob_client = self.client.blob_client(self.relative_path_to_name(path));
let request = blob_client.delete().into_future();
let res = tokio::time::timeout(self.timeout, request).await;
match res {
Ok(Ok(_response)) => continue,
Ok(Err(e)) => {
if let Some(http_err) = e.as_http_error() {
if http_err.status() == StatusCode::NotFound {
continue;
}
}
return Err(e.into());
}
Err(_elapsed) => return Err(TimeoutOrCancel::Timeout.into()),
}
Err(anyhow::Error::new(e))
}
Ok(())
};
tokio::select! {
res = op => res,
_ = cancel.cancelled() => Err(TimeoutOrCancel::Cancel.into()),
}
}
async fn delete_objects<'a>(&self, paths: &'a [RemotePath]) -> anyhow::Result<()> {
// Permit is already obtained by inner delete function
async fn copy(
&self,
from: &RemotePath,
to: &RemotePath,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
let _permit = self.permit(RequestKind::Copy, cancel).await?;
// TODO batch requests are also not supported by the SDK
// https://github.com/Azure/azure-sdk-for-rust/issues/1068
// https://github.com/Azure/azure-sdk-for-rust/issues/1249
for path in paths {
self.delete(path).await?;
}
Ok(())
}
let timeout = tokio::time::sleep(self.timeout);
async fn copy(&self, from: &RemotePath, to: &RemotePath) -> anyhow::Result<()> {
let _permit = self.permit(RequestKind::Copy).await;
let blob_client = self.client.blob_client(self.relative_path_to_name(to));
let mut copy_status = None;
let source_url = format!(
"{}/{}",
self.client.url()?,
self.relative_path_to_name(from)
);
let builder = blob_client.copy(Url::from_str(&source_url)?);
let op = async {
let blob_client = self.client.blob_client(self.relative_path_to_name(to));
let result = builder.into_future().await?;
let source_url = format!(
"{}/{}",
self.client.url()?,
self.relative_path_to_name(from)
);
let mut copy_status = result.copy_status;
let start_time = Instant::now();
const MAX_WAIT_TIME: Duration = Duration::from_secs(60);
loop {
match copy_status {
CopyStatus::Aborted => {
anyhow::bail!("Received abort for copy from {from} to {to}.");
let builder = blob_client.copy(Url::from_str(&source_url)?);
let copy = builder.into_future();
let result = copy.await?;
copy_status = Some(result.copy_status);
loop {
match copy_status.as_ref().expect("we always set it to Some") {
CopyStatus::Aborted => {
anyhow::bail!("Received abort for copy from {from} to {to}.");
}
CopyStatus::Failed => {
anyhow::bail!("Received failure response for copy from {from} to {to}.");
}
CopyStatus::Success => return Ok(()),
CopyStatus::Pending => (),
}
CopyStatus::Failed => {
anyhow::bail!("Received failure response for copy from {from} to {to}.");
}
CopyStatus::Success => return Ok(()),
CopyStatus::Pending => (),
// The copy is taking longer. Waiting a second and then re-trying.
// TODO estimate time based on copy_progress and adjust time based on that
tokio::time::sleep(Duration::from_millis(1000)).await;
let properties = blob_client.get_properties().into_future().await?;
let Some(status) = properties.blob.properties.copy_status else {
tracing::warn!("copy_status for copy is None!, from={from}, to={to}");
return Ok(());
};
copy_status = Some(status);
}
// The copy is taking longer. Waiting a second and then re-trying.
// TODO estimate time based on copy_progress and adjust time based on that
tokio::time::sleep(Duration::from_millis(1000)).await;
let properties = blob_client.get_properties().into_future().await?;
let Some(status) = properties.blob.properties.copy_status else {
tracing::warn!("copy_status for copy is None!, from={from}, to={to}");
return Ok(());
};
if start_time.elapsed() > MAX_WAIT_TIME {
anyhow::bail!("Copy from from {from} to {to} took longer than limit MAX_WAIT_TIME={}s. copy_pogress={:?}.",
MAX_WAIT_TIME.as_secs_f32(),
properties.blob.properties.copy_progress,
);
}
copy_status = status;
};
tokio::select! {
res = op => res,
_ = cancel.cancelled() => Err(anyhow::Error::new(TimeoutOrCancel::Cancel)),
_ = timeout => {
let e = anyhow::Error::new(TimeoutOrCancel::Timeout);
let e = e.context(format!("Timeout, last status: {copy_status:?}"));
Err(e)
},
}
}

View File

@@ -0,0 +1,181 @@
/// Reasons for downloads or listings to fail.
#[derive(Debug)]
pub enum DownloadError {
/// Validation or other error happened due to user input.
BadInput(anyhow::Error),
/// The file was not found in the remote storage.
NotFound,
/// A cancellation token aborted the download, typically during
/// tenant detach or process shutdown.
Cancelled,
/// A timeout happened while executing the request. Possible reasons:
/// - stuck tcp connection
///
/// Concurrency control is not timed within timeout.
Timeout,
/// The file was found in the remote storage, but the download failed.
Other(anyhow::Error),
}
impl std::fmt::Display for DownloadError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
DownloadError::BadInput(e) => {
write!(f, "Failed to download a remote file due to user input: {e}")
}
DownloadError::NotFound => write!(f, "No file found for the remote object id given"),
DownloadError::Cancelled => write!(f, "Cancelled, shutting down"),
DownloadError::Timeout => write!(f, "timeout"),
DownloadError::Other(e) => write!(f, "Failed to download a remote file: {e:?}"),
}
}
}
impl std::error::Error for DownloadError {}
impl DownloadError {
/// Returns true if the error should not be retried with backoff
pub fn is_permanent(&self) -> bool {
use DownloadError::*;
match self {
BadInput(_) | NotFound | Cancelled => true,
Timeout | Other(_) => false,
}
}
}
#[derive(Debug)]
pub enum TimeTravelError {
/// Validation or other error happened due to user input.
BadInput(anyhow::Error),
/// The used remote storage does not have time travel recovery implemented
Unimplemented,
/// The number of versions/deletion markers is above our limit.
TooManyVersions,
/// A cancellation token aborted the process, typically during
/// request closure or process shutdown.
Cancelled,
/// Other errors
Other(anyhow::Error),
}
impl std::fmt::Display for TimeTravelError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
TimeTravelError::BadInput(e) => {
write!(
f,
"Failed to time travel recover a prefix due to user input: {e}"
)
}
TimeTravelError::Unimplemented => write!(
f,
"time travel recovery is not implemented for the current storage backend"
),
TimeTravelError::Cancelled => write!(f, "Cancelled, shutting down"),
TimeTravelError::TooManyVersions => {
write!(f, "Number of versions/delete markers above limit")
}
TimeTravelError::Other(e) => write!(f, "Failed to time travel recover a prefix: {e:?}"),
}
}
}
impl std::error::Error for TimeTravelError {}
/// Plain cancelled error.
///
/// By design this type does not not implement `std::error::Error` so it cannot be put as the root
/// cause of `std::io::Error` or `anyhow::Error`. It should never need to be exposed out of this
/// crate.
///
/// It exists to implement permit acquiring in `{Download,TimeTravel}Error` and `anyhow::Error` returning
/// operations and ensuring that those get converted to proper versions with just `?`.
#[derive(Debug)]
pub(crate) struct Cancelled;
impl From<Cancelled> for anyhow::Error {
fn from(_: Cancelled) -> Self {
anyhow::Error::new(TimeoutOrCancel::Cancel)
}
}
impl From<Cancelled> for TimeTravelError {
fn from(_: Cancelled) -> Self {
TimeTravelError::Cancelled
}
}
impl From<Cancelled> for TimeoutOrCancel {
fn from(_: Cancelled) -> Self {
TimeoutOrCancel::Cancel
}
}
impl From<Cancelled> for DownloadError {
fn from(_: Cancelled) -> Self {
DownloadError::Cancelled
}
}
/// This type is used at as the root cause for timeouts and cancellations with `anyhow::Error` returning
/// RemoteStorage methods.
///
/// For use with `utils::backoff::retry` and `anyhow::Error` returning operations there is
/// `TimeoutOrCancel::caused_by_cancel` method to query "proper form" errors.
#[derive(Debug)]
pub enum TimeoutOrCancel {
Timeout,
Cancel,
}
impl std::fmt::Display for TimeoutOrCancel {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
use TimeoutOrCancel::*;
match self {
Timeout => write!(f, "timeout"),
Cancel => write!(f, "cancel"),
}
}
}
impl std::error::Error for TimeoutOrCancel {}
impl TimeoutOrCancel {
pub fn caused(error: &anyhow::Error) -> Option<&Self> {
error.root_cause().downcast_ref()
}
/// Returns true if the error was caused by [`TimeoutOrCancel::Cancel`].
pub fn caused_by_cancel(error: &anyhow::Error) -> bool {
Self::caused(error).is_some_and(Self::is_cancel)
}
pub fn is_cancel(&self) -> bool {
matches!(self, TimeoutOrCancel::Cancel)
}
pub fn is_timeout(&self) -> bool {
matches!(self, TimeoutOrCancel::Timeout)
}
}
/// This conversion is used when [`crate::support::DownloadStream`] notices a cancellation or
/// timeout to wrap it in an `std::io::Error`.
impl From<TimeoutOrCancel> for std::io::Error {
fn from(value: TimeoutOrCancel) -> Self {
let e = DownloadError::from(value);
std::io::Error::other(e)
}
}
impl From<TimeoutOrCancel> for DownloadError {
fn from(value: TimeoutOrCancel) -> Self {
use TimeoutOrCancel::*;
match value {
Timeout => DownloadError::Timeout,
Cancel => DownloadError::Cancelled,
}
}
}

View File

@@ -10,6 +10,7 @@
#![deny(clippy::undocumented_unsafe_blocks)]
mod azure_blob;
mod error;
mod local_fs;
mod s3_bucket;
mod simulate_failures;
@@ -21,7 +22,7 @@ use std::{
num::{NonZeroU32, NonZeroUsize},
pin::Pin,
sync::Arc,
time::SystemTime,
time::{Duration, SystemTime},
};
use anyhow::{bail, Context};
@@ -41,6 +42,8 @@ pub use self::{
};
use s3_bucket::RequestKind;
pub use error::{DownloadError, TimeTravelError, TimeoutOrCancel};
/// Currently, sync happens with AWS S3, that has two limits on requests per second:
/// ~200 RPS for IAM services
/// <https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/UsingWithRDS.IAMDBAuth.html>
@@ -158,9 +161,10 @@ pub trait RemoteStorage: Send + Sync + 'static {
async fn list_prefixes(
&self,
prefix: Option<&RemotePath>,
cancel: &CancellationToken,
) -> Result<Vec<RemotePath>, DownloadError> {
let result = self
.list(prefix, ListingMode::WithDelimiter, None)
.list(prefix, ListingMode::WithDelimiter, None, cancel)
.await?
.prefixes;
Ok(result)
@@ -182,9 +186,10 @@ pub trait RemoteStorage: Send + Sync + 'static {
&self,
prefix: Option<&RemotePath>,
max_keys: Option<NonZeroU32>,
cancel: &CancellationToken,
) -> Result<Vec<RemotePath>, DownloadError> {
let result = self
.list(prefix, ListingMode::NoDelimiter, max_keys)
.list(prefix, ListingMode::NoDelimiter, max_keys, cancel)
.await?
.keys;
Ok(result)
@@ -195,9 +200,13 @@ pub trait RemoteStorage: Send + Sync + 'static {
prefix: Option<&RemotePath>,
_mode: ListingMode,
max_keys: Option<NonZeroU32>,
cancel: &CancellationToken,
) -> Result<Listing, DownloadError>;
/// Streams the local file contents into remote into the remote storage entry.
///
/// If the operation fails because of timeout or cancellation, the root cause of the error will be
/// set to `TimeoutOrCancel`.
async fn upload(
&self,
from: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
@@ -206,27 +215,61 @@ pub trait RemoteStorage: Send + Sync + 'static {
data_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
cancel: &CancellationToken,
) -> anyhow::Result<()>;
/// Streams the remote storage entry contents into the buffered writer given, returns the filled writer.
/// Streams the remote storage entry contents.
///
/// The returned download stream will obey initial timeout and cancellation signal by erroring
/// on whichever happens first. Only one of the reasons will fail the stream, which is usually
/// enough for `tokio::io::copy_buf` usage. If needed the error can be filtered out.
///
/// Returns the metadata, if any was stored with the file previously.
async fn download(&self, from: &RemotePath) -> Result<Download, DownloadError>;
async fn download(
&self,
from: &RemotePath,
cancel: &CancellationToken,
) -> Result<Download, DownloadError>;
/// Streams a given byte range of the remote storage entry contents into the buffered writer given, returns the filled writer.
/// Streams a given byte range of the remote storage entry contents.
///
/// The returned download stream will obey initial timeout and cancellation signal by erroring
/// on whichever happens first. Only one of the reasons will fail the stream, which is usually
/// enough for `tokio::io::copy_buf` usage. If needed the error can be filtered out.
///
/// Returns the metadata, if any was stored with the file previously.
async fn download_byte_range(
&self,
from: &RemotePath,
start_inclusive: u64,
end_exclusive: Option<u64>,
cancel: &CancellationToken,
) -> Result<Download, DownloadError>;
async fn delete(&self, path: &RemotePath) -> anyhow::Result<()>;
/// Delete a single path from remote storage.
///
/// If the operation fails because of timeout or cancellation, the root cause of the error will be
/// set to `TimeoutOrCancel`. In such situation it is unknown if the deletion went through.
async fn delete(&self, path: &RemotePath, cancel: &CancellationToken) -> anyhow::Result<()>;
async fn delete_objects<'a>(&self, paths: &'a [RemotePath]) -> anyhow::Result<()>;
/// Delete a multiple paths from remote storage.
///
/// If the operation fails because of timeout or cancellation, the root cause of the error will be
/// set to `TimeoutOrCancel`. In such situation it is unknown which deletions, if any, went
/// through.
async fn delete_objects<'a>(
&self,
paths: &'a [RemotePath],
cancel: &CancellationToken,
) -> anyhow::Result<()>;
/// Copy a remote object inside a bucket from one path to another.
async fn copy(&self, from: &RemotePath, to: &RemotePath) -> anyhow::Result<()>;
async fn copy(
&self,
from: &RemotePath,
to: &RemotePath,
cancel: &CancellationToken,
) -> anyhow::Result<()>;
/// Resets the content of everything with the given prefix to the given state
async fn time_travel_recover(
@@ -238,7 +281,13 @@ pub trait RemoteStorage: Send + Sync + 'static {
) -> Result<(), TimeTravelError>;
}
pub type DownloadStream = Pin<Box<dyn Stream<Item = std::io::Result<Bytes>> + Unpin + Send + Sync>>;
/// DownloadStream is sensitive to the timeout and cancellation used with the original
/// [`RemoteStorage::download`] request. The type yields `std::io::Result<Bytes>` to be compatible
/// with `tokio::io::copy_buf`.
// This has 'static because safekeepers do not use cancellation tokens (yet)
pub type DownloadStream =
Pin<Box<dyn Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static>>;
pub struct Download {
pub download_stream: DownloadStream,
/// The last time the file was modified (`last-modified` HTTP header)
@@ -257,86 +306,6 @@ impl Debug for Download {
}
}
#[derive(Debug)]
pub enum DownloadError {
/// Validation or other error happened due to user input.
BadInput(anyhow::Error),
/// The file was not found in the remote storage.
NotFound,
/// A cancellation token aborted the download, typically during
/// tenant detach or process shutdown.
Cancelled,
/// The file was found in the remote storage, but the download failed.
Other(anyhow::Error),
}
impl std::fmt::Display for DownloadError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
DownloadError::BadInput(e) => {
write!(f, "Failed to download a remote file due to user input: {e}")
}
DownloadError::Cancelled => write!(f, "Cancelled, shutting down"),
DownloadError::NotFound => write!(f, "No file found for the remote object id given"),
DownloadError::Other(e) => write!(f, "Failed to download a remote file: {e:?}"),
}
}
}
impl std::error::Error for DownloadError {}
impl DownloadError {
/// Returns true if the error should not be retried with backoff
pub fn is_permanent(&self) -> bool {
use DownloadError::*;
match self {
BadInput(_) => true,
NotFound => true,
Cancelled => true,
Other(_) => false,
}
}
}
#[derive(Debug)]
pub enum TimeTravelError {
/// Validation or other error happened due to user input.
BadInput(anyhow::Error),
/// The used remote storage does not have time travel recovery implemented
Unimplemented,
/// The number of versions/deletion markers is above our limit.
TooManyVersions,
/// A cancellation token aborted the process, typically during
/// request closure or process shutdown.
Cancelled,
/// Other errors
Other(anyhow::Error),
}
impl std::fmt::Display for TimeTravelError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
TimeTravelError::BadInput(e) => {
write!(
f,
"Failed to time travel recover a prefix due to user input: {e}"
)
}
TimeTravelError::Unimplemented => write!(
f,
"time travel recovery is not implemented for the current storage backend"
),
TimeTravelError::Cancelled => write!(f, "Cancelled, shutting down"),
TimeTravelError::TooManyVersions => {
write!(f, "Number of versions/delete markers above limit")
}
TimeTravelError::Other(e) => write!(f, "Failed to time travel recover a prefix: {e:?}"),
}
}
}
impl std::error::Error for TimeTravelError {}
/// Every storage, currently supported.
/// Serves as a simple way to pass around the [`RemoteStorage`] without dealing with generics.
#[derive(Clone)]
@@ -354,12 +323,13 @@ impl<Other: RemoteStorage> GenericRemoteStorage<Arc<Other>> {
prefix: Option<&RemotePath>,
mode: ListingMode,
max_keys: Option<NonZeroU32>,
cancel: &CancellationToken,
) -> anyhow::Result<Listing, DownloadError> {
match self {
Self::LocalFs(s) => s.list(prefix, mode, max_keys).await,
Self::AwsS3(s) => s.list(prefix, mode, max_keys).await,
Self::AzureBlob(s) => s.list(prefix, mode, max_keys).await,
Self::Unreliable(s) => s.list(prefix, mode, max_keys).await,
Self::LocalFs(s) => s.list(prefix, mode, max_keys, cancel).await,
Self::AwsS3(s) => s.list(prefix, mode, max_keys, cancel).await,
Self::AzureBlob(s) => s.list(prefix, mode, max_keys, cancel).await,
Self::Unreliable(s) => s.list(prefix, mode, max_keys, cancel).await,
}
}
@@ -372,12 +342,13 @@ impl<Other: RemoteStorage> GenericRemoteStorage<Arc<Other>> {
&self,
folder: Option<&RemotePath>,
max_keys: Option<NonZeroU32>,
cancel: &CancellationToken,
) -> Result<Vec<RemotePath>, DownloadError> {
match self {
Self::LocalFs(s) => s.list_files(folder, max_keys).await,
Self::AwsS3(s) => s.list_files(folder, max_keys).await,
Self::AzureBlob(s) => s.list_files(folder, max_keys).await,
Self::Unreliable(s) => s.list_files(folder, max_keys).await,
Self::LocalFs(s) => s.list_files(folder, max_keys, cancel).await,
Self::AwsS3(s) => s.list_files(folder, max_keys, cancel).await,
Self::AzureBlob(s) => s.list_files(folder, max_keys, cancel).await,
Self::Unreliable(s) => s.list_files(folder, max_keys, cancel).await,
}
}
@@ -387,36 +358,43 @@ impl<Other: RemoteStorage> GenericRemoteStorage<Arc<Other>> {
pub async fn list_prefixes(
&self,
prefix: Option<&RemotePath>,
cancel: &CancellationToken,
) -> Result<Vec<RemotePath>, DownloadError> {
match self {
Self::LocalFs(s) => s.list_prefixes(prefix).await,
Self::AwsS3(s) => s.list_prefixes(prefix).await,
Self::AzureBlob(s) => s.list_prefixes(prefix).await,
Self::Unreliable(s) => s.list_prefixes(prefix).await,
Self::LocalFs(s) => s.list_prefixes(prefix, cancel).await,
Self::AwsS3(s) => s.list_prefixes(prefix, cancel).await,
Self::AzureBlob(s) => s.list_prefixes(prefix, cancel).await,
Self::Unreliable(s) => s.list_prefixes(prefix, cancel).await,
}
}
/// See [`RemoteStorage::upload`]
pub async fn upload(
&self,
from: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
data_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
match self {
Self::LocalFs(s) => s.upload(from, data_size_bytes, to, metadata).await,
Self::AwsS3(s) => s.upload(from, data_size_bytes, to, metadata).await,
Self::AzureBlob(s) => s.upload(from, data_size_bytes, to, metadata).await,
Self::Unreliable(s) => s.upload(from, data_size_bytes, to, metadata).await,
Self::LocalFs(s) => s.upload(from, data_size_bytes, to, metadata, cancel).await,
Self::AwsS3(s) => s.upload(from, data_size_bytes, to, metadata, cancel).await,
Self::AzureBlob(s) => s.upload(from, data_size_bytes, to, metadata, cancel).await,
Self::Unreliable(s) => s.upload(from, data_size_bytes, to, metadata, cancel).await,
}
}
pub async fn download(&self, from: &RemotePath) -> Result<Download, DownloadError> {
pub async fn download(
&self,
from: &RemotePath,
cancel: &CancellationToken,
) -> Result<Download, DownloadError> {
match self {
Self::LocalFs(s) => s.download(from).await,
Self::AwsS3(s) => s.download(from).await,
Self::AzureBlob(s) => s.download(from).await,
Self::Unreliable(s) => s.download(from).await,
Self::LocalFs(s) => s.download(from, cancel).await,
Self::AwsS3(s) => s.download(from, cancel).await,
Self::AzureBlob(s) => s.download(from, cancel).await,
Self::Unreliable(s) => s.download(from, cancel).await,
}
}
@@ -425,54 +403,72 @@ impl<Other: RemoteStorage> GenericRemoteStorage<Arc<Other>> {
from: &RemotePath,
start_inclusive: u64,
end_exclusive: Option<u64>,
cancel: &CancellationToken,
) -> Result<Download, DownloadError> {
match self {
Self::LocalFs(s) => {
s.download_byte_range(from, start_inclusive, end_exclusive)
s.download_byte_range(from, start_inclusive, end_exclusive, cancel)
.await
}
Self::AwsS3(s) => {
s.download_byte_range(from, start_inclusive, end_exclusive)
s.download_byte_range(from, start_inclusive, end_exclusive, cancel)
.await
}
Self::AzureBlob(s) => {
s.download_byte_range(from, start_inclusive, end_exclusive)
s.download_byte_range(from, start_inclusive, end_exclusive, cancel)
.await
}
Self::Unreliable(s) => {
s.download_byte_range(from, start_inclusive, end_exclusive)
s.download_byte_range(from, start_inclusive, end_exclusive, cancel)
.await
}
}
}
pub async fn delete(&self, path: &RemotePath) -> anyhow::Result<()> {
/// See [`RemoteStorage::delete`]
pub async fn delete(
&self,
path: &RemotePath,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
match self {
Self::LocalFs(s) => s.delete(path).await,
Self::AwsS3(s) => s.delete(path).await,
Self::AzureBlob(s) => s.delete(path).await,
Self::Unreliable(s) => s.delete(path).await,
Self::LocalFs(s) => s.delete(path, cancel).await,
Self::AwsS3(s) => s.delete(path, cancel).await,
Self::AzureBlob(s) => s.delete(path, cancel).await,
Self::Unreliable(s) => s.delete(path, cancel).await,
}
}
pub async fn delete_objects<'a>(&self, paths: &'a [RemotePath]) -> anyhow::Result<()> {
/// See [`RemoteStorage::delete_objects`]
pub async fn delete_objects(
&self,
paths: &[RemotePath],
cancel: &CancellationToken,
) -> anyhow::Result<()> {
match self {
Self::LocalFs(s) => s.delete_objects(paths).await,
Self::AwsS3(s) => s.delete_objects(paths).await,
Self::AzureBlob(s) => s.delete_objects(paths).await,
Self::Unreliable(s) => s.delete_objects(paths).await,
Self::LocalFs(s) => s.delete_objects(paths, cancel).await,
Self::AwsS3(s) => s.delete_objects(paths, cancel).await,
Self::AzureBlob(s) => s.delete_objects(paths, cancel).await,
Self::Unreliable(s) => s.delete_objects(paths, cancel).await,
}
}
pub async fn copy_object(&self, from: &RemotePath, to: &RemotePath) -> anyhow::Result<()> {
/// See [`RemoteStorage::copy`]
pub async fn copy_object(
&self,
from: &RemotePath,
to: &RemotePath,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
match self {
Self::LocalFs(s) => s.copy(from, to).await,
Self::AwsS3(s) => s.copy(from, to).await,
Self::AzureBlob(s) => s.copy(from, to).await,
Self::Unreliable(s) => s.copy(from, to).await,
Self::LocalFs(s) => s.copy(from, to, cancel).await,
Self::AwsS3(s) => s.copy(from, to, cancel).await,
Self::AzureBlob(s) => s.copy(from, to, cancel).await,
Self::Unreliable(s) => s.copy(from, to, cancel).await,
}
}
/// See [`RemoteStorage::time_travel_recover`].
pub async fn time_travel_recover(
&self,
prefix: Option<&RemotePath>,
@@ -503,10 +499,11 @@ impl<Other: RemoteStorage> GenericRemoteStorage<Arc<Other>> {
impl GenericRemoteStorage {
pub fn from_config(storage_config: &RemoteStorageConfig) -> anyhow::Result<Self> {
let timeout = storage_config.timeout;
Ok(match &storage_config.storage {
RemoteStorageKind::LocalFs(root) => {
info!("Using fs root '{root}' as a remote storage");
Self::LocalFs(LocalFs::new(root.clone())?)
RemoteStorageKind::LocalFs(path) => {
info!("Using fs root '{path}' as a remote storage");
Self::LocalFs(LocalFs::new(path.clone(), timeout)?)
}
RemoteStorageKind::AwsS3(s3_config) => {
// The profile and access key id are only printed here for debugging purposes,
@@ -516,12 +513,12 @@ impl GenericRemoteStorage {
std::env::var("AWS_ACCESS_KEY_ID").unwrap_or_else(|_| "<none>".into());
info!("Using s3 bucket '{}' in region '{}' as a remote storage, prefix in bucket: '{:?}', bucket endpoint: '{:?}', profile: {profile}, access_key_id: {access_key_id}",
s3_config.bucket_name, s3_config.bucket_region, s3_config.prefix_in_bucket, s3_config.endpoint);
Self::AwsS3(Arc::new(S3Bucket::new(s3_config)?))
Self::AwsS3(Arc::new(S3Bucket::new(s3_config, timeout)?))
}
RemoteStorageKind::AzureContainer(azure_config) => {
info!("Using azure container '{}' in region '{}' as a remote storage, prefix in container: '{:?}'",
azure_config.container_name, azure_config.container_region, azure_config.prefix_in_container);
Self::AzureBlob(Arc::new(AzureBlobStorage::new(azure_config)?))
Self::AzureBlob(Arc::new(AzureBlobStorage::new(azure_config, timeout)?))
}
})
}
@@ -530,18 +527,15 @@ impl GenericRemoteStorage {
Self::Unreliable(Arc::new(UnreliableWrapper::new(s, fail_first)))
}
/// Takes storage object contents and its size and uploads to remote storage,
/// mapping `from_path` to the corresponding remote object id in the storage.
///
/// The storage object does not have to be present on the `from_path`,
/// this path is used for the remote object id conversion only.
/// See [`RemoteStorage::upload`], which this method calls with `None` as metadata.
pub async fn upload_storage_object(
&self,
from: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
from_size_bytes: usize,
to: &RemotePath,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
self.upload(from, from_size_bytes, to, None)
self.upload(from, from_size_bytes, to, None, cancel)
.await
.with_context(|| {
format!("Failed to upload data of length {from_size_bytes} to storage path {to:?}")
@@ -554,10 +548,11 @@ impl GenericRemoteStorage {
&self,
byte_range: Option<(u64, Option<u64>)>,
from: &RemotePath,
cancel: &CancellationToken,
) -> Result<Download, DownloadError> {
match byte_range {
Some((start, end)) => self.download_byte_range(from, start, end).await,
None => self.download(from).await,
Some((start, end)) => self.download_byte_range(from, start, end, cancel).await,
None => self.download(from, cancel).await,
}
}
}
@@ -572,6 +567,9 @@ pub struct StorageMetadata(HashMap<String, String>);
pub struct RemoteStorageConfig {
/// The storage connection configuration.
pub storage: RemoteStorageKind,
/// A common timeout enforced for all requests after concurrency limiter permit has been
/// acquired.
pub timeout: Duration,
}
/// A kind of a remote storage to connect to, with its connection configuration.
@@ -656,6 +654,8 @@ impl Debug for AzureConfig {
}
impl RemoteStorageConfig {
pub const DEFAULT_TIMEOUT: Duration = std::time::Duration::from_secs(120);
pub fn from_toml(toml: &toml_edit::Item) -> anyhow::Result<Option<RemoteStorageConfig>> {
let local_path = toml.get("local_path");
let bucket_name = toml.get("bucket_name");
@@ -685,6 +685,27 @@ impl RemoteStorageConfig {
.map(|endpoint| parse_toml_string("endpoint", endpoint))
.transpose()?;
let timeout = toml
.get("timeout")
.map(|timeout| {
timeout
.as_str()
.ok_or_else(|| anyhow::Error::msg("timeout was not a string"))
})
.transpose()
.and_then(|timeout| {
timeout
.map(humantime::parse_duration)
.transpose()
.map_err(anyhow::Error::new)
})
.context("parse timeout")?
.unwrap_or(Self::DEFAULT_TIMEOUT);
if timeout < Duration::from_secs(1) {
bail!("timeout was specified as {timeout:?} which is too low");
}
let storage = match (
local_path,
bucket_name,
@@ -746,7 +767,7 @@ impl RemoteStorageConfig {
}
};
Ok(Some(RemoteStorageConfig { storage }))
Ok(Some(RemoteStorageConfig { storage, timeout }))
}
}
@@ -842,4 +863,24 @@ mod tests {
let err = RemotePath::new(Utf8Path::new("/")).expect_err("Should fail on absolute paths");
assert_eq!(err.to_string(), "Path \"/\" is not relative");
}
#[test]
fn parse_localfs_config_with_timeout() {
let input = "local_path = '.'
timeout = '5s'";
let toml = input.parse::<toml_edit::Document>().unwrap();
let config = RemoteStorageConfig::from_toml(toml.as_item())
.unwrap()
.expect("it exists");
assert_eq!(
config,
RemoteStorageConfig {
storage: RemoteStorageKind::LocalFs(Utf8PathBuf::from(".")),
timeout: Duration::from_secs(5)
}
);
}
}

View File

@@ -5,7 +5,12 @@
//! volume is mounted to the local FS.
use std::{
borrow::Cow, future::Future, io::ErrorKind, num::NonZeroU32, pin::Pin, time::SystemTime,
borrow::Cow,
future::Future,
io::ErrorKind,
num::NonZeroU32,
pin::Pin,
time::{Duration, SystemTime},
};
use anyhow::{bail, ensure, Context};
@@ -20,7 +25,9 @@ use tokio_util::{io::ReaderStream, sync::CancellationToken};
use tracing::*;
use utils::{crashsafe::path_with_suffix_extension, fs_ext::is_directory_empty};
use crate::{Download, DownloadError, Listing, ListingMode, RemotePath, TimeTravelError};
use crate::{
Download, DownloadError, Listing, ListingMode, RemotePath, TimeTravelError, TimeoutOrCancel,
};
use super::{RemoteStorage, StorageMetadata};
@@ -29,12 +36,13 @@ const LOCAL_FS_TEMP_FILE_SUFFIX: &str = "___temp";
#[derive(Debug, Clone)]
pub struct LocalFs {
storage_root: Utf8PathBuf,
timeout: Duration,
}
impl LocalFs {
/// Attempts to create local FS storage, along with its root directory.
/// Storage root will be created (if does not exist) and transformed into an absolute path (if passed as relative).
pub fn new(mut storage_root: Utf8PathBuf) -> anyhow::Result<Self> {
pub fn new(mut storage_root: Utf8PathBuf, timeout: Duration) -> anyhow::Result<Self> {
if !storage_root.exists() {
std::fs::create_dir_all(&storage_root).with_context(|| {
format!("Failed to create all directories in the given root path {storage_root:?}")
@@ -46,7 +54,10 @@ impl LocalFs {
})?;
}
Ok(Self { storage_root })
Ok(Self {
storage_root,
timeout,
})
}
// mirrors S3Bucket::s3_object_to_relative_path
@@ -157,80 +168,14 @@ impl LocalFs {
Ok(files)
}
}
impl RemoteStorage for LocalFs {
async fn list(
&self,
prefix: Option<&RemotePath>,
mode: ListingMode,
max_keys: Option<NonZeroU32>,
) -> Result<Listing, DownloadError> {
let mut result = Listing::default();
if let ListingMode::NoDelimiter = mode {
let keys = self
.list_recursive(prefix)
.await
.map_err(DownloadError::Other)?;
result.keys = keys
.into_iter()
.filter(|k| {
let path = k.with_base(&self.storage_root);
!path.is_dir()
})
.collect();
if let Some(max_keys) = max_keys {
result.keys.truncate(max_keys.get() as usize);
}
return Ok(result);
}
let path = match prefix {
Some(prefix) => Cow::Owned(prefix.with_base(&self.storage_root)),
None => Cow::Borrowed(&self.storage_root),
};
let prefixes_to_filter = get_all_files(path.as_ref(), false)
.await
.map_err(DownloadError::Other)?;
// filter out empty directories to mirror s3 behavior.
for prefix in prefixes_to_filter {
if prefix.is_dir()
&& is_directory_empty(&prefix)
.await
.map_err(DownloadError::Other)?
{
continue;
}
let stripped = prefix
.strip_prefix(&self.storage_root)
.context("Failed to strip prefix")
.and_then(RemotePath::new)
.expect(
"We list files for storage root, hence should be able to remote the prefix",
);
if prefix.is_dir() {
result.prefixes.push(stripped);
} else {
result.keys.push(stripped);
}
}
Ok(result)
}
async fn upload(
async fn upload0(
&self,
data: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync,
data_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
let target_file_path = to.with_base(&self.storage_root);
create_target_directory(&target_file_path).await?;
@@ -265,9 +210,26 @@ impl RemoteStorage for LocalFs {
let mut buffer_to_read = data.take(from_size_bytes);
// alternatively we could just write the bytes to a file, but local_fs is a testing utility
let bytes_read = io::copy_buf(&mut buffer_to_read, &mut destination)
.await
.with_context(|| {
let copy = io::copy_buf(&mut buffer_to_read, &mut destination);
let bytes_read = tokio::select! {
biased;
_ = cancel.cancelled() => {
let file = destination.into_inner();
// wait for the inflight operation(s) to complete so that there could be a next
// attempt right away and our writes are not directed to their file.
file.into_std().await;
// TODO: leave the temp or not? leaving is probably less racy. enabled truncate at
// least.
fs::remove_file(temp_file_path).await.context("remove temp_file_path after cancellation or timeout")?;
return Err(TimeoutOrCancel::Cancel.into());
}
read = copy => read,
};
let bytes_read =
bytes_read.with_context(|| {
format!(
"Failed to upload file (write temp) to the local storage at '{temp_file_path}'",
)
@@ -299,6 +261,9 @@ impl RemoteStorage for LocalFs {
})?;
if let Some(storage_metadata) = metadata {
// FIXME: we must not be using metadata much, since this would forget the old metadata
// for new writes? or perhaps metadata is sticky; could consider removing if it's never
// used.
let storage_metadata_path = storage_metadata_path(&target_file_path);
fs::write(
&storage_metadata_path,
@@ -315,8 +280,131 @@ impl RemoteStorage for LocalFs {
Ok(())
}
}
async fn download(&self, from: &RemotePath) -> Result<Download, DownloadError> {
impl RemoteStorage for LocalFs {
async fn list(
&self,
prefix: Option<&RemotePath>,
mode: ListingMode,
max_keys: Option<NonZeroU32>,
cancel: &CancellationToken,
) -> Result<Listing, DownloadError> {
let op = async {
let mut result = Listing::default();
if let ListingMode::NoDelimiter = mode {
let keys = self
.list_recursive(prefix)
.await
.map_err(DownloadError::Other)?;
result.keys = keys
.into_iter()
.filter(|k| {
let path = k.with_base(&self.storage_root);
!path.is_dir()
})
.collect();
if let Some(max_keys) = max_keys {
result.keys.truncate(max_keys.get() as usize);
}
return Ok(result);
}
let path = match prefix {
Some(prefix) => Cow::Owned(prefix.with_base(&self.storage_root)),
None => Cow::Borrowed(&self.storage_root),
};
let prefixes_to_filter = get_all_files(path.as_ref(), false)
.await
.map_err(DownloadError::Other)?;
// filter out empty directories to mirror s3 behavior.
for prefix in prefixes_to_filter {
if prefix.is_dir()
&& is_directory_empty(&prefix)
.await
.map_err(DownloadError::Other)?
{
continue;
}
let stripped = prefix
.strip_prefix(&self.storage_root)
.context("Failed to strip prefix")
.and_then(RemotePath::new)
.expect(
"We list files for storage root, hence should be able to remote the prefix",
);
if prefix.is_dir() {
result.prefixes.push(stripped);
} else {
result.keys.push(stripped);
}
}
Ok(result)
};
let timeout = async {
tokio::time::sleep(self.timeout).await;
Err(DownloadError::Timeout)
};
let cancelled = async {
cancel.cancelled().await;
Err(DownloadError::Cancelled)
};
tokio::select! {
res = op => res,
res = timeout => res,
res = cancelled => res,
}
}
async fn upload(
&self,
data: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync,
data_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
let cancel = cancel.child_token();
let op = self.upload0(data, data_size_bytes, to, metadata, &cancel);
let mut op = std::pin::pin!(op);
// race the upload0 to the timeout; if it goes over, do a graceful shutdown
let (res, timeout) = tokio::select! {
res = &mut op => (res, false),
_ = tokio::time::sleep(self.timeout) => {
cancel.cancel();
(op.await, true)
}
};
match res {
Err(e) if timeout && TimeoutOrCancel::caused_by_cancel(&e) => {
// we caused this cancel (or they happened simultaneously) -- swap it out to
// Timeout
Err(TimeoutOrCancel::Timeout.into())
}
res => res,
}
}
async fn download(
&self,
from: &RemotePath,
cancel: &CancellationToken,
) -> Result<Download, DownloadError> {
let target_path = from.with_base(&self.storage_root);
if file_exists(&target_path).map_err(DownloadError::BadInput)? {
let source = ReaderStream::new(
@@ -334,6 +422,10 @@ impl RemoteStorage for LocalFs {
.read_storage_metadata(&target_path)
.await
.map_err(DownloadError::Other)?;
let cancel_or_timeout = crate::support::cancel_or_timeout(self.timeout, cancel.clone());
let source = crate::support::DownloadStream::new(cancel_or_timeout, source);
Ok(Download {
metadata,
last_modified: None,
@@ -350,6 +442,7 @@ impl RemoteStorage for LocalFs {
from: &RemotePath,
start_inclusive: u64,
end_exclusive: Option<u64>,
cancel: &CancellationToken,
) -> Result<Download, DownloadError> {
if let Some(end_exclusive) = end_exclusive {
if end_exclusive <= start_inclusive {
@@ -391,6 +484,9 @@ impl RemoteStorage for LocalFs {
let source = source.take(end_exclusive.unwrap_or(len) - start_inclusive);
let source = ReaderStream::new(source);
let cancel_or_timeout = crate::support::cancel_or_timeout(self.timeout, cancel.clone());
let source = crate::support::DownloadStream::new(cancel_or_timeout, source);
Ok(Download {
metadata,
last_modified: None,
@@ -402,7 +498,7 @@ impl RemoteStorage for LocalFs {
}
}
async fn delete(&self, path: &RemotePath) -> anyhow::Result<()> {
async fn delete(&self, path: &RemotePath, _cancel: &CancellationToken) -> anyhow::Result<()> {
let file_path = path.with_base(&self.storage_root);
match fs::remove_file(&file_path).await {
Ok(()) => Ok(()),
@@ -414,14 +510,23 @@ impl RemoteStorage for LocalFs {
}
}
async fn delete_objects<'a>(&self, paths: &'a [RemotePath]) -> anyhow::Result<()> {
async fn delete_objects<'a>(
&self,
paths: &'a [RemotePath],
cancel: &CancellationToken,
) -> anyhow::Result<()> {
for path in paths {
self.delete(path).await?
self.delete(path, cancel).await?
}
Ok(())
}
async fn copy(&self, from: &RemotePath, to: &RemotePath) -> anyhow::Result<()> {
async fn copy(
&self,
from: &RemotePath,
to: &RemotePath,
_cancel: &CancellationToken,
) -> anyhow::Result<()> {
let from_path = from.with_base(&self.storage_root);
let to_path = to.with_base(&self.storage_root);
create_target_directory(&to_path).await?;
@@ -435,7 +540,6 @@ impl RemoteStorage for LocalFs {
Ok(())
}
#[allow(clippy::diverging_sub_expression)]
async fn time_travel_recover(
&self,
_prefix: Option<&RemotePath>,
@@ -529,8 +633,9 @@ mod fs_tests {
remote_storage_path: &RemotePath,
expected_metadata: Option<&StorageMetadata>,
) -> anyhow::Result<String> {
let cancel = CancellationToken::new();
let download = storage
.download(remote_storage_path)
.download(remote_storage_path, &cancel)
.await
.map_err(|e| anyhow::anyhow!("Download failed: {e}"))?;
ensure!(
@@ -545,16 +650,16 @@ mod fs_tests {
#[tokio::test]
async fn upload_file() -> anyhow::Result<()> {
let storage = create_storage()?;
let (storage, cancel) = create_storage()?;
let target_path_1 = upload_dummy_file(&storage, "upload_1", None).await?;
let target_path_1 = upload_dummy_file(&storage, "upload_1", None, &cancel).await?;
assert_eq!(
storage.list_all().await?,
vec![target_path_1.clone()],
"Should list a single file after first upload"
);
let target_path_2 = upload_dummy_file(&storage, "upload_2", None).await?;
let target_path_2 = upload_dummy_file(&storage, "upload_2", None, &cancel).await?;
assert_eq!(
list_files_sorted(&storage).await?,
vec![target_path_1.clone(), target_path_2.clone()],
@@ -566,7 +671,7 @@ mod fs_tests {
#[tokio::test]
async fn upload_file_negatives() -> anyhow::Result<()> {
let storage = create_storage()?;
let (storage, cancel) = create_storage()?;
let id = RemotePath::new(Utf8Path::new("dummy"))?;
let content = Bytes::from_static(b"12345");
@@ -575,34 +680,34 @@ mod fs_tests {
// Check that you get an error if the size parameter doesn't match the actual
// size of the stream.
storage
.upload(content(), 0, &id, None)
.upload(content(), 0, &id, None, &cancel)
.await
.expect_err("upload with zero size succeeded");
storage
.upload(content(), 4, &id, None)
.upload(content(), 4, &id, None, &cancel)
.await
.expect_err("upload with too short size succeeded");
storage
.upload(content(), 6, &id, None)
.upload(content(), 6, &id, None, &cancel)
.await
.expect_err("upload with too large size succeeded");
// Correct size is 5, this should succeed.
storage.upload(content(), 5, &id, None).await?;
storage.upload(content(), 5, &id, None, &cancel).await?;
Ok(())
}
fn create_storage() -> anyhow::Result<LocalFs> {
fn create_storage() -> anyhow::Result<(LocalFs, CancellationToken)> {
let storage_root = tempdir()?.path().to_path_buf();
LocalFs::new(storage_root)
LocalFs::new(storage_root, Duration::from_secs(120)).map(|s| (s, CancellationToken::new()))
}
#[tokio::test]
async fn download_file() -> anyhow::Result<()> {
let storage = create_storage()?;
let (storage, cancel) = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&storage, upload_name, None).await?;
let upload_target = upload_dummy_file(&storage, upload_name, None, &cancel).await?;
let contents = read_and_check_metadata(&storage, &upload_target, None).await?;
assert_eq!(
@@ -612,7 +717,7 @@ mod fs_tests {
);
let non_existing_path = "somewhere/else";
match storage.download(&RemotePath::new(Utf8Path::new(non_existing_path))?).await {
match storage.download(&RemotePath::new(Utf8Path::new(non_existing_path))?, &cancel).await {
Err(DownloadError::NotFound) => {} // Should get NotFound for non existing keys
other => panic!("Should get a NotFound error when downloading non-existing storage files, but got: {other:?}"),
}
@@ -621,9 +726,9 @@ mod fs_tests {
#[tokio::test]
async fn download_file_range_positive() -> anyhow::Result<()> {
let storage = create_storage()?;
let (storage, cancel) = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&storage, upload_name, None).await?;
let upload_target = upload_dummy_file(&storage, upload_name, None, &cancel).await?;
let full_range_download_contents =
read_and_check_metadata(&storage, &upload_target, None).await?;
@@ -637,7 +742,12 @@ mod fs_tests {
let (first_part_local, second_part_local) = uploaded_bytes.split_at(3);
let first_part_download = storage
.download_byte_range(&upload_target, 0, Some(first_part_local.len() as u64))
.download_byte_range(
&upload_target,
0,
Some(first_part_local.len() as u64),
&cancel,
)
.await?;
assert!(
first_part_download.metadata.is_none(),
@@ -655,6 +765,7 @@ mod fs_tests {
&upload_target,
first_part_local.len() as u64,
Some((first_part_local.len() + second_part_local.len()) as u64),
&cancel,
)
.await?;
assert!(
@@ -669,7 +780,7 @@ mod fs_tests {
);
let suffix_bytes = storage
.download_byte_range(&upload_target, 13, None)
.download_byte_range(&upload_target, 13, None, &cancel)
.await?
.download_stream;
let suffix_bytes = aggregate(suffix_bytes).await?;
@@ -677,7 +788,7 @@ mod fs_tests {
assert_eq!(upload_name, suffix);
let all_bytes = storage
.download_byte_range(&upload_target, 0, None)
.download_byte_range(&upload_target, 0, None, &cancel)
.await?
.download_stream;
let all_bytes = aggregate(all_bytes).await?;
@@ -689,9 +800,9 @@ mod fs_tests {
#[tokio::test]
async fn download_file_range_negative() -> anyhow::Result<()> {
let storage = create_storage()?;
let (storage, cancel) = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&storage, upload_name, None).await?;
let upload_target = upload_dummy_file(&storage, upload_name, None, &cancel).await?;
let start = 1_000_000_000;
let end = start + 1;
@@ -700,6 +811,7 @@ mod fs_tests {
&upload_target,
start,
Some(end), // exclusive end
&cancel,
)
.await
{
@@ -716,7 +828,7 @@ mod fs_tests {
let end = 234;
assert!(start > end, "Should test an incorrect range");
match storage
.download_byte_range(&upload_target, start, Some(end))
.download_byte_range(&upload_target, start, Some(end), &cancel)
.await
{
Ok(_) => panic!("Should not allow downloading wrong ranges"),
@@ -733,15 +845,15 @@ mod fs_tests {
#[tokio::test]
async fn delete_file() -> anyhow::Result<()> {
let storage = create_storage()?;
let (storage, cancel) = create_storage()?;
let upload_name = "upload_1";
let upload_target = upload_dummy_file(&storage, upload_name, None).await?;
let upload_target = upload_dummy_file(&storage, upload_name, None, &cancel).await?;
storage.delete(&upload_target).await?;
storage.delete(&upload_target, &cancel).await?;
assert!(storage.list_all().await?.is_empty());
storage
.delete(&upload_target)
.delete(&upload_target, &cancel)
.await
.expect("Should allow deleting non-existing storage files");
@@ -750,14 +862,14 @@ mod fs_tests {
#[tokio::test]
async fn file_with_metadata() -> anyhow::Result<()> {
let storage = create_storage()?;
let (storage, cancel) = create_storage()?;
let upload_name = "upload_1";
let metadata = StorageMetadata(HashMap::from([
("one".to_string(), "1".to_string()),
("two".to_string(), "2".to_string()),
]));
let upload_target =
upload_dummy_file(&storage, upload_name, Some(metadata.clone())).await?;
upload_dummy_file(&storage, upload_name, Some(metadata.clone()), &cancel).await?;
let full_range_download_contents =
read_and_check_metadata(&storage, &upload_target, Some(&metadata)).await?;
@@ -771,7 +883,12 @@ mod fs_tests {
let (first_part_local, _) = uploaded_bytes.split_at(3);
let partial_download_with_metadata = storage
.download_byte_range(&upload_target, 0, Some(first_part_local.len() as u64))
.download_byte_range(
&upload_target,
0,
Some(first_part_local.len() as u64),
&cancel,
)
.await?;
let first_part_remote = aggregate(partial_download_with_metadata.download_stream).await?;
assert_eq!(
@@ -792,16 +909,20 @@ mod fs_tests {
#[tokio::test]
async fn list() -> anyhow::Result<()> {
// No delimiter: should recursively list everything
let storage = create_storage()?;
let child = upload_dummy_file(&storage, "grandparent/parent/child", None).await?;
let uncle = upload_dummy_file(&storage, "grandparent/uncle", None).await?;
let (storage, cancel) = create_storage()?;
let child = upload_dummy_file(&storage, "grandparent/parent/child", None, &cancel).await?;
let uncle = upload_dummy_file(&storage, "grandparent/uncle", None, &cancel).await?;
let listing = storage.list(None, ListingMode::NoDelimiter, None).await?;
let listing = storage
.list(None, ListingMode::NoDelimiter, None, &cancel)
.await?;
assert!(listing.prefixes.is_empty());
assert_eq!(listing.keys, [uncle.clone(), child.clone()].to_vec());
// Delimiter: should only go one deep
let listing = storage.list(None, ListingMode::WithDelimiter, None).await?;
let listing = storage
.list(None, ListingMode::WithDelimiter, None, &cancel)
.await?;
assert_eq!(
listing.prefixes,
@@ -815,6 +936,7 @@ mod fs_tests {
Some(&RemotePath::from_string("timelines/some_timeline/grandparent").unwrap()),
ListingMode::WithDelimiter,
None,
&cancel,
)
.await?;
assert_eq!(
@@ -827,10 +949,75 @@ mod fs_tests {
Ok(())
}
#[tokio::test]
async fn overwrite_shorter_file() -> anyhow::Result<()> {
let (storage, cancel) = create_storage()?;
let path = RemotePath::new("does/not/matter/file".into())?;
let body = Bytes::from_static(b"long file contents is long");
{
let len = body.len();
let body =
futures::stream::once(futures::future::ready(std::io::Result::Ok(body.clone())));
storage.upload(body, len, &path, None, &cancel).await?;
}
let read = aggregate(storage.download(&path, &cancel).await?.download_stream).await?;
assert_eq!(body, read);
let shorter = Bytes::from_static(b"shorter body");
{
let len = shorter.len();
let body =
futures::stream::once(futures::future::ready(std::io::Result::Ok(shorter.clone())));
storage.upload(body, len, &path, None, &cancel).await?;
}
let read = aggregate(storage.download(&path, &cancel).await?.download_stream).await?;
assert_eq!(shorter, read);
Ok(())
}
#[tokio::test]
async fn cancelled_upload_can_later_be_retried() -> anyhow::Result<()> {
let (storage, cancel) = create_storage()?;
let path = RemotePath::new("does/not/matter/file".into())?;
let body = Bytes::from_static(b"long file contents is long");
{
let len = body.len();
let body =
futures::stream::once(futures::future::ready(std::io::Result::Ok(body.clone())));
let cancel = cancel.child_token();
cancel.cancel();
let e = storage
.upload(body, len, &path, None, &cancel)
.await
.unwrap_err();
assert!(TimeoutOrCancel::caused_by_cancel(&e));
}
{
let len = body.len();
let body =
futures::stream::once(futures::future::ready(std::io::Result::Ok(body.clone())));
storage.upload(body, len, &path, None, &cancel).await?;
}
let read = aggregate(storage.download(&path, &cancel).await?.download_stream).await?;
assert_eq!(body, read);
Ok(())
}
async fn upload_dummy_file(
storage: &LocalFs,
name: &str,
metadata: Option<StorageMetadata>,
cancel: &CancellationToken,
) -> anyhow::Result<RemotePath> {
let from_path = storage
.storage_root
@@ -852,7 +1039,9 @@ mod fs_tests {
let file = tokio_util::io::ReaderStream::new(file);
storage.upload(file, size, &relative_path, metadata).await?;
storage
.upload(file, size, &relative_path, metadata, cancel)
.await?;
Ok(relative_path)
}

View File

@@ -11,7 +11,7 @@ use std::{
pin::Pin,
sync::Arc,
task::{Context, Poll},
time::SystemTime,
time::{Duration, SystemTime},
};
use anyhow::{anyhow, Context as _};
@@ -46,9 +46,9 @@ use utils::backoff;
use super::StorageMetadata;
use crate::{
support::PermitCarrying, ConcurrencyLimiter, Download, DownloadError, Listing, ListingMode,
RemotePath, RemoteStorage, S3Config, TimeTravelError, MAX_KEYS_PER_DELETE,
REMOTE_STORAGE_PREFIX_SEPARATOR,
error::Cancelled, support::PermitCarrying, ConcurrencyLimiter, Download, DownloadError,
Listing, ListingMode, RemotePath, RemoteStorage, S3Config, TimeTravelError, TimeoutOrCancel,
MAX_KEYS_PER_DELETE, REMOTE_STORAGE_PREFIX_SEPARATOR,
};
pub(super) mod metrics;
@@ -63,6 +63,8 @@ pub struct S3Bucket {
prefix_in_bucket: Option<String>,
max_keys_per_list_response: Option<i32>,
concurrency_limiter: ConcurrencyLimiter,
// Per-request timeout. Accessible for tests.
pub timeout: Duration,
}
struct GetObjectRequest {
@@ -72,7 +74,7 @@ struct GetObjectRequest {
}
impl S3Bucket {
/// Creates the S3 storage, errors if incorrect AWS S3 configuration provided.
pub fn new(aws_config: &S3Config) -> anyhow::Result<Self> {
pub fn new(aws_config: &S3Config, timeout: Duration) -> anyhow::Result<Self> {
tracing::debug!(
"Creating s3 remote storage for S3 bucket {}",
aws_config.bucket_name
@@ -152,6 +154,7 @@ impl S3Bucket {
max_keys_per_list_response: aws_config.max_keys_per_list_response,
prefix_in_bucket,
concurrency_limiter: ConcurrencyLimiter::new(aws_config.concurrency_limit.get()),
timeout,
})
}
@@ -185,40 +188,55 @@ impl S3Bucket {
}
}
async fn permit(&self, kind: RequestKind) -> tokio::sync::SemaphorePermit<'_> {
async fn permit(
&self,
kind: RequestKind,
cancel: &CancellationToken,
) -> Result<tokio::sync::SemaphorePermit<'_>, Cancelled> {
let started_at = start_counting_cancelled_wait(kind);
let permit = self
.concurrency_limiter
.acquire(kind)
.await
.expect("semaphore is never closed");
let acquire = self.concurrency_limiter.acquire(kind);
let permit = tokio::select! {
permit = acquire => permit.expect("semaphore is never closed"),
_ = cancel.cancelled() => return Err(Cancelled),
};
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
.wait_seconds
.observe_elapsed(kind, started_at);
permit
Ok(permit)
}
async fn owned_permit(&self, kind: RequestKind) -> tokio::sync::OwnedSemaphorePermit {
async fn owned_permit(
&self,
kind: RequestKind,
cancel: &CancellationToken,
) -> Result<tokio::sync::OwnedSemaphorePermit, Cancelled> {
let started_at = start_counting_cancelled_wait(kind);
let permit = self
.concurrency_limiter
.acquire_owned(kind)
.await
.expect("semaphore is never closed");
let acquire = self.concurrency_limiter.acquire_owned(kind);
let permit = tokio::select! {
permit = acquire => permit.expect("semaphore is never closed"),
_ = cancel.cancelled() => return Err(Cancelled),
};
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
.wait_seconds
.observe_elapsed(kind, started_at);
permit
Ok(permit)
}
async fn download_object(&self, request: GetObjectRequest) -> Result<Download, DownloadError> {
async fn download_object(
&self,
request: GetObjectRequest,
cancel: &CancellationToken,
) -> Result<Download, DownloadError> {
let kind = RequestKind::Get;
let permit = self.owned_permit(kind).await;
let permit = self.owned_permit(kind, cancel).await?;
let started_at = start_measuring_requests(kind);
@@ -228,8 +246,13 @@ impl S3Bucket {
.bucket(request.bucket)
.key(request.key)
.set_range(request.range)
.send()
.await;
.send();
let get_object = tokio::select! {
res = get_object => res,
_ = tokio::time::sleep(self.timeout) => return Err(DownloadError::Timeout),
_ = cancel.cancelled() => return Err(DownloadError::Cancelled),
};
let started_at = ScopeGuard::into_inner(started_at);
@@ -259,6 +282,10 @@ impl S3Bucket {
}
};
// even if we would have no timeout left, continue anyways. the caller can decide to ignore
// the errors considering timeouts and cancellation.
let remaining = self.timeout.saturating_sub(started_at.elapsed());
let metadata = object_output.metadata().cloned().map(StorageMetadata);
let etag = object_output.e_tag;
let last_modified = object_output.last_modified.and_then(|t| t.try_into().ok());
@@ -268,6 +295,9 @@ impl S3Bucket {
let body = PermitCarrying::new(permit, body);
let body = TimedDownload::new(started_at, body);
let cancel_or_timeout = crate::support::cancel_or_timeout(remaining, cancel.clone());
let body = crate::support::DownloadStream::new(cancel_or_timeout, body);
Ok(Download {
metadata,
etag,
@@ -278,33 +308,44 @@ impl S3Bucket {
async fn delete_oids(
&self,
kind: RequestKind,
_permit: &tokio::sync::SemaphorePermit<'_>,
delete_objects: &[ObjectIdentifier],
cancel: &CancellationToken,
) -> anyhow::Result<()> {
let kind = RequestKind::Delete;
let mut cancel = std::pin::pin!(cancel.cancelled());
for chunk in delete_objects.chunks(MAX_KEYS_PER_DELETE) {
let started_at = start_measuring_requests(kind);
let resp = self
let req = self
.client
.delete_objects()
.bucket(self.bucket_name.clone())
.delete(
Delete::builder()
.set_objects(Some(chunk.to_vec()))
.build()?,
.build()
.context("build request")?,
)
.send()
.await;
.send();
let resp = tokio::select! {
resp = req => resp,
_ = tokio::time::sleep(self.timeout) => return Err(TimeoutOrCancel::Timeout.into()),
_ = &mut cancel => return Err(TimeoutOrCancel::Cancel.into()),
};
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
.req_seconds
.observe_elapsed(kind, &resp, started_at);
let resp = resp?;
let resp = resp.context("request deletion")?;
metrics::BUCKET_METRICS
.deleted_objects_total
.inc_by(chunk.len() as u64);
if let Some(errors) = resp.errors {
// Log a bounded number of the errors within the response:
// these requests can carry 1000 keys so logging each one
@@ -320,9 +361,10 @@ impl S3Bucket {
);
}
return Err(anyhow::format_err!(
"Failed to delete {} objects",
errors.len()
return Err(anyhow::anyhow!(
"Failed to delete {}/{} objects",
errors.len(),
chunk.len(),
));
}
}
@@ -410,6 +452,7 @@ impl RemoteStorage for S3Bucket {
prefix: Option<&RemotePath>,
mode: ListingMode,
max_keys: Option<NonZeroU32>,
cancel: &CancellationToken,
) -> Result<Listing, DownloadError> {
let kind = RequestKind::List;
// s3 sdk wants i32
@@ -431,10 +474,11 @@ impl RemoteStorage for S3Bucket {
p
});
let _permit = self.permit(kind, cancel).await?;
let mut continuation_token = None;
loop {
let _guard = self.permit(kind).await;
let started_at = start_measuring_requests(kind);
// min of two Options, returning Some if one is value and another is
@@ -456,9 +500,15 @@ impl RemoteStorage for S3Bucket {
request = request.delimiter(REMOTE_STORAGE_PREFIX_SEPARATOR.to_string());
}
let response = request
.send()
.await
let request = request.send();
let response = tokio::select! {
res = request => res,
_ = tokio::time::sleep(self.timeout) => return Err(DownloadError::Timeout),
_ = cancel.cancelled() => return Err(DownloadError::Cancelled),
};
let response = response
.context("Failed to list S3 prefixes")
.map_err(DownloadError::Other);
@@ -511,16 +561,17 @@ impl RemoteStorage for S3Bucket {
from_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
let kind = RequestKind::Put;
let _guard = self.permit(kind).await;
let _permit = self.permit(kind, cancel).await?;
let started_at = start_measuring_requests(kind);
let body = Body::wrap_stream(from);
let bytes_stream = ByteStream::new(SdkBody::from_body_0_4(body));
let res = self
let upload = self
.client
.put_object()
.bucket(self.bucket_name.clone())
@@ -528,22 +579,40 @@ impl RemoteStorage for S3Bucket {
.set_metadata(metadata.map(|m| m.0))
.content_length(from_size_bytes.try_into()?)
.body(bytes_stream)
.send()
.await;
.send();
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
.req_seconds
.observe_elapsed(kind, &res, started_at);
let upload = tokio::time::timeout(self.timeout, upload);
res?;
let res = tokio::select! {
res = upload => res,
_ = cancel.cancelled() => return Err(TimeoutOrCancel::Cancel.into()),
};
Ok(())
if let Ok(inner) = &res {
// do not incl. timeouts as errors in metrics but cancellations
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
.req_seconds
.observe_elapsed(kind, inner, started_at);
}
match res {
Ok(Ok(_put)) => Ok(()),
Ok(Err(sdk)) => Err(sdk.into()),
Err(_timeout) => Err(TimeoutOrCancel::Timeout.into()),
}
}
async fn copy(&self, from: &RemotePath, to: &RemotePath) -> anyhow::Result<()> {
async fn copy(
&self,
from: &RemotePath,
to: &RemotePath,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
let kind = RequestKind::Copy;
let _guard = self.permit(kind).await;
let _permit = self.permit(kind, cancel).await?;
let timeout = tokio::time::sleep(self.timeout);
let started_at = start_measuring_requests(kind);
@@ -554,14 +623,19 @@ impl RemoteStorage for S3Bucket {
self.relative_path_to_s3_object(from)
);
let res = self
let op = self
.client
.copy_object()
.bucket(self.bucket_name.clone())
.key(self.relative_path_to_s3_object(to))
.copy_source(copy_source)
.send()
.await;
.send();
let res = tokio::select! {
res = op => res,
_ = timeout => return Err(TimeoutOrCancel::Timeout.into()),
_ = cancel.cancelled() => return Err(TimeoutOrCancel::Cancel.into()),
};
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
@@ -573,14 +647,21 @@ impl RemoteStorage for S3Bucket {
Ok(())
}
async fn download(&self, from: &RemotePath) -> Result<Download, DownloadError> {
async fn download(
&self,
from: &RemotePath,
cancel: &CancellationToken,
) -> Result<Download, DownloadError> {
// if prefix is not none then download file `prefix/from`
// if prefix is none then download file `from`
self.download_object(GetObjectRequest {
bucket: self.bucket_name.clone(),
key: self.relative_path_to_s3_object(from),
range: None,
})
self.download_object(
GetObjectRequest {
bucket: self.bucket_name.clone(),
key: self.relative_path_to_s3_object(from),
range: None,
},
cancel,
)
.await
}
@@ -589,6 +670,7 @@ impl RemoteStorage for S3Bucket {
from: &RemotePath,
start_inclusive: u64,
end_exclusive: Option<u64>,
cancel: &CancellationToken,
) -> Result<Download, DownloadError> {
// S3 accepts ranges as https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35
// and needs both ends to be exclusive
@@ -598,31 +680,39 @@ impl RemoteStorage for S3Bucket {
None => format!("bytes={start_inclusive}-"),
});
self.download_object(GetObjectRequest {
bucket: self.bucket_name.clone(),
key: self.relative_path_to_s3_object(from),
range,
})
self.download_object(
GetObjectRequest {
bucket: self.bucket_name.clone(),
key: self.relative_path_to_s3_object(from),
range,
},
cancel,
)
.await
}
async fn delete_objects<'a>(&self, paths: &'a [RemotePath]) -> anyhow::Result<()> {
let kind = RequestKind::Delete;
let _guard = self.permit(kind).await;
async fn delete_objects<'a>(
&self,
paths: &'a [RemotePath],
cancel: &CancellationToken,
) -> anyhow::Result<()> {
let kind = RequestKind::Delete;
let permit = self.permit(kind, cancel).await?;
let mut delete_objects = Vec::with_capacity(paths.len());
for path in paths {
let obj_id = ObjectIdentifier::builder()
.set_key(Some(self.relative_path_to_s3_object(path)))
.build()?;
.build()
.context("convert path to oid")?;
delete_objects.push(obj_id);
}
self.delete_oids(kind, &delete_objects).await
self.delete_oids(&permit, &delete_objects, cancel).await
}
async fn delete(&self, path: &RemotePath) -> anyhow::Result<()> {
async fn delete(&self, path: &RemotePath, cancel: &CancellationToken) -> anyhow::Result<()> {
let paths = std::array::from_ref(path);
self.delete_objects(paths).await
self.delete_objects(paths, cancel).await
}
async fn time_travel_recover(
@@ -633,7 +723,7 @@ impl RemoteStorage for S3Bucket {
cancel: &CancellationToken,
) -> Result<(), TimeTravelError> {
let kind = RequestKind::TimeTravel;
let _guard = self.permit(kind).await;
let permit = self.permit(kind, cancel).await?;
let timestamp = DateTime::from(timestamp);
let done_if_after = DateTime::from(done_if_after);
@@ -647,7 +737,7 @@ impl RemoteStorage for S3Bucket {
let warn_threshold = 3;
let max_retries = 10;
let is_permanent = |_e: &_| false;
let is_permanent = |e: &_| matches!(e, TimeTravelError::Cancelled);
let mut key_marker = None;
let mut version_id_marker = None;
@@ -656,15 +746,19 @@ impl RemoteStorage for S3Bucket {
loop {
let response = backoff::retry(
|| async {
self.client
let op = self
.client
.list_object_versions()
.bucket(self.bucket_name.clone())
.set_prefix(prefix.clone())
.set_key_marker(key_marker.clone())
.set_version_id_marker(version_id_marker.clone())
.send()
.await
.map_err(|e| TimeTravelError::Other(e.into()))
.send();
tokio::select! {
res = op => res.map_err(|e| TimeTravelError::Other(e.into())),
_ = cancel.cancelled() => Err(TimeTravelError::Cancelled),
}
},
is_permanent,
warn_threshold,
@@ -786,14 +880,18 @@ impl RemoteStorage for S3Bucket {
backoff::retry(
|| async {
self.client
let op = self
.client
.copy_object()
.bucket(self.bucket_name.clone())
.key(key)
.copy_source(&source_id)
.send()
.await
.map_err(|e| TimeTravelError::Other(e.into()))
.send();
tokio::select! {
res = op => res.map_err(|e| TimeTravelError::Other(e.into())),
_ = cancel.cancelled() => Err(TimeTravelError::Cancelled),
}
},
is_permanent,
warn_threshold,
@@ -824,10 +922,18 @@ impl RemoteStorage for S3Bucket {
let oid = ObjectIdentifier::builder()
.key(key.to_owned())
.build()
.map_err(|e| TimeTravelError::Other(anyhow::Error::new(e)))?;
self.delete_oids(kind, &[oid])
.map_err(|e| TimeTravelError::Other(e.into()))?;
self.delete_oids(&permit, &[oid], cancel)
.await
.map_err(TimeTravelError::Other)?;
.map_err(|e| {
// delete_oid0 will use TimeoutOrCancel
if TimeoutOrCancel::caused_by_cancel(&e) {
TimeTravelError::Cancelled
} else {
TimeTravelError::Other(e)
}
})?;
}
}
}
@@ -963,7 +1069,8 @@ mod tests {
concurrency_limit: NonZeroUsize::new(100).unwrap(),
max_keys_per_list_response: Some(5),
};
let storage = S3Bucket::new(&config).expect("remote storage init");
let storage =
S3Bucket::new(&config, std::time::Duration::ZERO).expect("remote storage init");
for (test_path_idx, test_path) in all_paths.iter().enumerate() {
let result = storage.relative_path_to_s3_object(test_path);
let expected = expected_outputs[prefix_idx][test_path_idx];

View File

@@ -90,11 +90,16 @@ impl UnreliableWrapper {
}
}
async fn delete_inner(&self, path: &RemotePath, attempt: bool) -> anyhow::Result<()> {
async fn delete_inner(
&self,
path: &RemotePath,
attempt: bool,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
if attempt {
self.attempt(RemoteOp::Delete(path.clone()))?;
}
self.inner.delete(path).await
self.inner.delete(path, cancel).await
}
}
@@ -105,20 +110,22 @@ impl RemoteStorage for UnreliableWrapper {
async fn list_prefixes(
&self,
prefix: Option<&RemotePath>,
cancel: &CancellationToken,
) -> Result<Vec<RemotePath>, DownloadError> {
self.attempt(RemoteOp::ListPrefixes(prefix.cloned()))
.map_err(DownloadError::Other)?;
self.inner.list_prefixes(prefix).await
self.inner.list_prefixes(prefix, cancel).await
}
async fn list_files(
&self,
folder: Option<&RemotePath>,
max_keys: Option<NonZeroU32>,
cancel: &CancellationToken,
) -> Result<Vec<RemotePath>, DownloadError> {
self.attempt(RemoteOp::ListPrefixes(folder.cloned()))
.map_err(DownloadError::Other)?;
self.inner.list_files(folder, max_keys).await
self.inner.list_files(folder, max_keys, cancel).await
}
async fn list(
@@ -126,10 +133,11 @@ impl RemoteStorage for UnreliableWrapper {
prefix: Option<&RemotePath>,
mode: ListingMode,
max_keys: Option<NonZeroU32>,
cancel: &CancellationToken,
) -> Result<Listing, DownloadError> {
self.attempt(RemoteOp::ListPrefixes(prefix.cloned()))
.map_err(DownloadError::Other)?;
self.inner.list(prefix, mode, max_keys).await
self.inner.list(prefix, mode, max_keys, cancel).await
}
async fn upload(
@@ -140,15 +148,22 @@ impl RemoteStorage for UnreliableWrapper {
data_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
self.attempt(RemoteOp::Upload(to.clone()))?;
self.inner.upload(data, data_size_bytes, to, metadata).await
self.inner
.upload(data, data_size_bytes, to, metadata, cancel)
.await
}
async fn download(&self, from: &RemotePath) -> Result<Download, DownloadError> {
async fn download(
&self,
from: &RemotePath,
cancel: &CancellationToken,
) -> Result<Download, DownloadError> {
self.attempt(RemoteOp::Download(from.clone()))
.map_err(DownloadError::Other)?;
self.inner.download(from).await
self.inner.download(from, cancel).await
}
async fn download_byte_range(
@@ -156,6 +171,7 @@ impl RemoteStorage for UnreliableWrapper {
from: &RemotePath,
start_inclusive: u64,
end_exclusive: Option<u64>,
cancel: &CancellationToken,
) -> Result<Download, DownloadError> {
// Note: We treat any download_byte_range as an "attempt" of the same
// operation. We don't pay attention to the ranges. That's good enough
@@ -163,20 +179,24 @@ impl RemoteStorage for UnreliableWrapper {
self.attempt(RemoteOp::Download(from.clone()))
.map_err(DownloadError::Other)?;
self.inner
.download_byte_range(from, start_inclusive, end_exclusive)
.download_byte_range(from, start_inclusive, end_exclusive, cancel)
.await
}
async fn delete(&self, path: &RemotePath) -> anyhow::Result<()> {
self.delete_inner(path, true).await
async fn delete(&self, path: &RemotePath, cancel: &CancellationToken) -> anyhow::Result<()> {
self.delete_inner(path, true, cancel).await
}
async fn delete_objects<'a>(&self, paths: &'a [RemotePath]) -> anyhow::Result<()> {
async fn delete_objects<'a>(
&self,
paths: &'a [RemotePath],
cancel: &CancellationToken,
) -> anyhow::Result<()> {
self.attempt(RemoteOp::DeleteObjects(paths.to_vec()))?;
let mut error_counter = 0;
for path in paths {
// Dont record attempt because it was already recorded above
if (self.delete_inner(path, false).await).is_err() {
if (self.delete_inner(path, false, cancel).await).is_err() {
error_counter += 1;
}
}
@@ -189,11 +209,16 @@ impl RemoteStorage for UnreliableWrapper {
Ok(())
}
async fn copy(&self, from: &RemotePath, to: &RemotePath) -> anyhow::Result<()> {
async fn copy(
&self,
from: &RemotePath,
to: &RemotePath,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
// copy is equivalent to download + upload
self.attempt(RemoteOp::Download(from.clone()))?;
self.attempt(RemoteOp::Upload(to.clone()))?;
self.inner.copy_object(from, to).await
self.inner.copy_object(from, to, cancel).await
}
async fn time_travel_recover(

View File

@@ -1,9 +1,15 @@
use std::{
future::Future,
pin::Pin,
task::{Context, Poll},
time::Duration,
};
use bytes::Bytes;
use futures_util::Stream;
use tokio_util::sync::CancellationToken;
use crate::TimeoutOrCancel;
pin_project_lite::pin_project! {
/// An `AsyncRead` adapter which carries a permit for the lifetime of the value.
@@ -31,3 +37,133 @@ impl<S: Stream> Stream for PermitCarrying<S> {
self.inner.size_hint()
}
}
pin_project_lite::pin_project! {
pub(crate) struct DownloadStream<F, S> {
hit: bool,
#[pin]
cancellation: F,
#[pin]
inner: S,
}
}
impl<F, S> DownloadStream<F, S> {
pub(crate) fn new(cancellation: F, inner: S) -> Self {
Self {
cancellation,
hit: false,
inner,
}
}
}
/// See documentation on [`crate::DownloadStream`] on rationale why `std::io::Error` is used.
impl<E, F, S> Stream for DownloadStream<F, S>
where
std::io::Error: From<E>,
F: Future<Output = E>,
S: Stream<Item = std::io::Result<Bytes>>,
{
type Item = <S as Stream>::Item;
fn poll_next(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
let this = self.project();
if !*this.hit {
if let Poll::Ready(e) = this.cancellation.poll(cx) {
*this.hit = true;
let e = Err(std::io::Error::from(e));
return Poll::Ready(Some(e));
}
}
this.inner.poll_next(cx)
}
fn size_hint(&self) -> (usize, Option<usize>) {
self.inner.size_hint()
}
}
/// Fires only on the first cancel or timeout, not on both.
pub(crate) async fn cancel_or_timeout(
timeout: Duration,
cancel: CancellationToken,
) -> TimeoutOrCancel {
tokio::select! {
_ = tokio::time::sleep(timeout) => TimeoutOrCancel::Timeout,
_ = cancel.cancelled() => TimeoutOrCancel::Cancel,
}
}
#[cfg(test)]
mod tests {
use super::*;
use crate::DownloadError;
use futures::stream::StreamExt;
#[tokio::test(start_paused = true)]
async fn cancelled_download_stream() {
let inner = futures::stream::pending();
let timeout = Duration::from_secs(120);
let cancel = CancellationToken::new();
let stream = DownloadStream::new(cancel_or_timeout(timeout, cancel.clone()), inner);
let mut stream = std::pin::pin!(stream);
let mut first = stream.next();
tokio::select! {
_ = &mut first => unreachable!("we haven't yet cancelled nor is timeout passed"),
_ = tokio::time::sleep(Duration::from_secs(1)) => {},
}
cancel.cancel();
let e = first.await.expect("there must be some").unwrap_err();
assert!(matches!(e.kind(), std::io::ErrorKind::Other), "{e:?}");
let inner = e.get_ref().expect("inner should be set");
assert!(
inner
.downcast_ref::<DownloadError>()
.is_some_and(|e| matches!(e, DownloadError::Cancelled)),
"{inner:?}"
);
tokio::select! {
_ = stream.next() => unreachable!("no timeout ever happens as we were already cancelled"),
_ = tokio::time::sleep(Duration::from_secs(121)) => {},
}
}
#[tokio::test(start_paused = true)]
async fn timeouted_download_stream() {
let inner = futures::stream::pending();
let timeout = Duration::from_secs(120);
let cancel = CancellationToken::new();
let stream = DownloadStream::new(cancel_or_timeout(timeout, cancel.clone()), inner);
let mut stream = std::pin::pin!(stream);
// because the stream uses 120s timeout we are paused, we advance to 120s right away.
let first = stream.next();
let e = first.await.expect("there must be some").unwrap_err();
assert!(matches!(e.kind(), std::io::ErrorKind::Other), "{e:?}");
let inner = e.get_ref().expect("inner should be set");
assert!(
inner
.downcast_ref::<DownloadError>()
.is_some_and(|e| matches!(e, DownloadError::Timeout)),
"{inner:?}"
);
cancel.cancel();
tokio::select! {
_ = stream.next() => unreachable!("no cancellation ever happens because we already timed out"),
_ = tokio::time::sleep(Duration::from_secs(121)) => {},
}
}
}

View File

@@ -10,6 +10,7 @@ use futures::stream::Stream;
use once_cell::sync::OnceCell;
use remote_storage::{Download, GenericRemoteStorage, RemotePath};
use tokio::task::JoinSet;
use tokio_util::sync::CancellationToken;
use tracing::{debug, error, info};
static LOGGING_DONE: OnceCell<()> = OnceCell::new();
@@ -58,8 +59,12 @@ pub(crate) async fn upload_simple_remote_data(
) -> ControlFlow<HashSet<RemotePath>, HashSet<RemotePath>> {
info!("Creating {upload_tasks_count} remote files");
let mut upload_tasks = JoinSet::new();
let cancel = CancellationToken::new();
for i in 1..upload_tasks_count + 1 {
let task_client = Arc::clone(client);
let cancel = cancel.clone();
upload_tasks.spawn(async move {
let blob_path = PathBuf::from(format!("folder{}/blob_{}.txt", i / 7, i));
let blob_path = RemotePath::new(
@@ -69,7 +74,9 @@ pub(crate) async fn upload_simple_remote_data(
debug!("Creating remote item {i} at path {blob_path:?}");
let (data, len) = upload_stream(format!("remote blob data {i}").into_bytes().into());
task_client.upload(data, len, &blob_path, None).await?;
task_client
.upload(data, len, &blob_path, None, &cancel)
.await?;
Ok::<_, anyhow::Error>(blob_path)
});
@@ -107,13 +114,15 @@ pub(crate) async fn cleanup(
"Removing {} objects from the remote storage during cleanup",
objects_to_delete.len()
);
let cancel = CancellationToken::new();
let mut delete_tasks = JoinSet::new();
for object_to_delete in objects_to_delete {
let task_client = Arc::clone(client);
let cancel = cancel.clone();
delete_tasks.spawn(async move {
debug!("Deleting remote item at path {object_to_delete:?}");
task_client
.delete(&object_to_delete)
.delete(&object_to_delete, &cancel)
.await
.with_context(|| format!("{object_to_delete:?} removal"))
});
@@ -141,8 +150,12 @@ pub(crate) async fn upload_remote_data(
) -> ControlFlow<Uploads, Uploads> {
info!("Creating {upload_tasks_count} remote files");
let mut upload_tasks = JoinSet::new();
let cancel = CancellationToken::new();
for i in 1..upload_tasks_count + 1 {
let task_client = Arc::clone(client);
let cancel = cancel.clone();
upload_tasks.spawn(async move {
let prefix = format!("{base_prefix_str}/sub_prefix_{i}/");
let blob_prefix = RemotePath::new(Utf8Path::new(&prefix))
@@ -152,7 +165,9 @@ pub(crate) async fn upload_remote_data(
let (data, data_len) =
upload_stream(format!("remote blob data {i}").into_bytes().into());
task_client.upload(data, data_len, &blob_path, None).await?;
task_client
.upload(data, data_len, &blob_path, None, &cancel)
.await?;
Ok::<_, anyhow::Error>((blob_prefix, blob_path))
});

View File

@@ -4,6 +4,7 @@ use remote_storage::RemotePath;
use std::sync::Arc;
use std::{collections::HashSet, num::NonZeroU32};
use test_context::test_context;
use tokio_util::sync::CancellationToken;
use tracing::debug;
use crate::common::{download_to_vec, upload_stream, wrap_stream};
@@ -45,13 +46,15 @@ async fn pagination_should_work(ctx: &mut MaybeEnabledStorageWithTestBlobs) -> a
}
};
let cancel = CancellationToken::new();
let test_client = Arc::clone(&ctx.enabled.client);
let expected_remote_prefixes = ctx.remote_prefixes.clone();
let base_prefix = RemotePath::new(Utf8Path::new(ctx.enabled.base_prefix))
.context("common_prefix construction")?;
let root_remote_prefixes = test_client
.list_prefixes(None)
.list_prefixes(None, &cancel)
.await
.context("client list root prefixes failure")?
.into_iter()
@@ -62,7 +65,7 @@ async fn pagination_should_work(ctx: &mut MaybeEnabledStorageWithTestBlobs) -> a
);
let nested_remote_prefixes = test_client
.list_prefixes(Some(&base_prefix))
.list_prefixes(Some(&base_prefix), &cancel)
.await
.context("client list nested prefixes failure")?
.into_iter()
@@ -99,11 +102,12 @@ async fn list_files_works(ctx: &mut MaybeEnabledStorageWithSimpleTestBlobs) -> a
anyhow::bail!("S3 init failed: {e:?}")
}
};
let cancel = CancellationToken::new();
let test_client = Arc::clone(&ctx.enabled.client);
let base_prefix =
RemotePath::new(Utf8Path::new("folder1")).context("common_prefix construction")?;
let root_files = test_client
.list_files(None, None)
.list_files(None, None, &cancel)
.await
.context("client list root files failure")?
.into_iter()
@@ -117,13 +121,13 @@ async fn list_files_works(ctx: &mut MaybeEnabledStorageWithSimpleTestBlobs) -> a
// Test that max_keys limit works. In total there are about 21 files (see
// upload_simple_remote_data call in test_real_s3.rs).
let limited_root_files = test_client
.list_files(None, Some(NonZeroU32::new(2).unwrap()))
.list_files(None, Some(NonZeroU32::new(2).unwrap()), &cancel)
.await
.context("client list root files failure")?;
assert_eq!(limited_root_files.len(), 2);
let nested_remote_files = test_client
.list_files(Some(&base_prefix), None)
.list_files(Some(&base_prefix), None, &cancel)
.await
.context("client list nested files failure")?
.into_iter()
@@ -150,12 +154,17 @@ async fn delete_non_exising_works(ctx: &mut MaybeEnabledStorage) -> anyhow::Resu
MaybeEnabledStorage::Disabled => return Ok(()),
};
let cancel = CancellationToken::new();
let path = RemotePath::new(Utf8Path::new(
format!("{}/for_sure_there_is_nothing_there_really", ctx.base_prefix).as_str(),
))
.with_context(|| "RemotePath conversion")?;
ctx.client.delete(&path).await.expect("should succeed");
ctx.client
.delete(&path, &cancel)
.await
.expect("should succeed");
Ok(())
}
@@ -168,6 +177,8 @@ async fn delete_objects_works(ctx: &mut MaybeEnabledStorage) -> anyhow::Result<(
MaybeEnabledStorage::Disabled => return Ok(()),
};
let cancel = CancellationToken::new();
let path1 = RemotePath::new(Utf8Path::new(format!("{}/path1", ctx.base_prefix).as_str()))
.with_context(|| "RemotePath conversion")?;
@@ -178,21 +189,21 @@ async fn delete_objects_works(ctx: &mut MaybeEnabledStorage) -> anyhow::Result<(
.with_context(|| "RemotePath conversion")?;
let (data, len) = upload_stream("remote blob data1".as_bytes().into());
ctx.client.upload(data, len, &path1, None).await?;
ctx.client.upload(data, len, &path1, None, &cancel).await?;
let (data, len) = upload_stream("remote blob data2".as_bytes().into());
ctx.client.upload(data, len, &path2, None).await?;
ctx.client.upload(data, len, &path2, None, &cancel).await?;
let (data, len) = upload_stream("remote blob data3".as_bytes().into());
ctx.client.upload(data, len, &path3, None).await?;
ctx.client.upload(data, len, &path3, None, &cancel).await?;
ctx.client.delete_objects(&[path1, path2]).await?;
ctx.client.delete_objects(&[path1, path2], &cancel).await?;
let prefixes = ctx.client.list_prefixes(None).await?;
let prefixes = ctx.client.list_prefixes(None, &cancel).await?;
assert_eq!(prefixes.len(), 1);
ctx.client.delete_objects(&[path3]).await?;
ctx.client.delete_objects(&[path3], &cancel).await?;
Ok(())
}
@@ -204,6 +215,8 @@ async fn upload_download_works(ctx: &mut MaybeEnabledStorage) -> anyhow::Result<
return Ok(());
};
let cancel = CancellationToken::new();
let path = RemotePath::new(Utf8Path::new(format!("{}/file", ctx.base_prefix).as_str()))
.with_context(|| "RemotePath conversion")?;
@@ -211,47 +224,56 @@ async fn upload_download_works(ctx: &mut MaybeEnabledStorage) -> anyhow::Result<
let (data, len) = wrap_stream(orig.clone());
ctx.client.upload(data, len, &path, None).await?;
ctx.client.upload(data, len, &path, None, &cancel).await?;
// Normal download request
let dl = ctx.client.download(&path).await?;
let dl = ctx.client.download(&path, &cancel).await?;
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig);
// Full range (end specified)
let dl = ctx
.client
.download_byte_range(&path, 0, Some(len as u64))
.download_byte_range(&path, 0, Some(len as u64), &cancel)
.await?;
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig);
// partial range (end specified)
let dl = ctx.client.download_byte_range(&path, 4, Some(10)).await?;
let dl = ctx
.client
.download_byte_range(&path, 4, Some(10), &cancel)
.await?;
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig[4..10]);
// partial range (end beyond real end)
let dl = ctx
.client
.download_byte_range(&path, 8, Some(len as u64 * 100))
.download_byte_range(&path, 8, Some(len as u64 * 100), &cancel)
.await?;
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig[8..]);
// Partial range (end unspecified)
let dl = ctx.client.download_byte_range(&path, 4, None).await?;
let dl = ctx
.client
.download_byte_range(&path, 4, None, &cancel)
.await?;
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig[4..]);
// Full range (end unspecified)
let dl = ctx.client.download_byte_range(&path, 0, None).await?;
let dl = ctx
.client
.download_byte_range(&path, 0, None, &cancel)
.await?;
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig);
debug!("Cleanup: deleting file at path {path:?}");
ctx.client
.delete(&path)
.delete(&path, &cancel)
.await
.with_context(|| format!("{path:?} removal"))?;
@@ -265,6 +287,8 @@ async fn copy_works(ctx: &mut MaybeEnabledStorage) -> anyhow::Result<()> {
return Ok(());
};
let cancel = CancellationToken::new();
let path = RemotePath::new(Utf8Path::new(
format!("{}/file_to_copy", ctx.base_prefix).as_str(),
))
@@ -278,18 +302,18 @@ async fn copy_works(ctx: &mut MaybeEnabledStorage) -> anyhow::Result<()> {
let (data, len) = wrap_stream(orig.clone());
ctx.client.upload(data, len, &path, None).await?;
ctx.client.upload(data, len, &path, None, &cancel).await?;
// Normal download request
ctx.client.copy_object(&path, &path_dest).await?;
ctx.client.copy_object(&path, &path_dest, &cancel).await?;
let dl = ctx.client.download(&path_dest).await?;
let dl = ctx.client.download(&path_dest, &cancel).await?;
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig);
debug!("Cleanup: deleting file at path {path:?}");
ctx.client
.delete_objects(&[path.clone(), path_dest.clone()])
.delete_objects(&[path.clone(), path_dest.clone()], &cancel)
.await
.with_context(|| format!("{path:?} removal"))?;

View File

@@ -1,9 +1,9 @@
use std::collections::HashSet;
use std::env;
use std::num::NonZeroUsize;
use std::ops::ControlFlow;
use std::sync::Arc;
use std::time::UNIX_EPOCH;
use std::{collections::HashSet, time::Duration};
use anyhow::Context;
use remote_storage::{
@@ -39,6 +39,17 @@ impl EnabledAzure {
base_prefix: BASE_PREFIX,
}
}
#[allow(unused)] // this will be needed when moving the timeout integration tests back
fn configure_request_timeout(&mut self, timeout: Duration) {
match Arc::get_mut(&mut self.client).expect("outer Arc::get_mut") {
GenericRemoteStorage::AzureBlob(azure) => {
let azure = Arc::get_mut(azure).expect("inner Arc::get_mut");
azure.timeout = timeout;
}
_ => unreachable!(),
}
}
}
enum MaybeEnabledStorage {
@@ -213,6 +224,7 @@ fn create_azure_client(
concurrency_limit: NonZeroUsize::new(100).unwrap(),
max_keys_per_list_response,
}),
timeout: Duration::from_secs(120),
};
Ok(Arc::new(
GenericRemoteStorage::from_config(&remote_storage_config).context("remote storage init")?,

View File

@@ -1,5 +1,6 @@
use std::env;
use std::fmt::{Debug, Display};
use std::future::Future;
use std::num::NonZeroUsize;
use std::ops::ControlFlow;
use std::sync::Arc;
@@ -9,9 +10,10 @@ use std::{collections::HashSet, time::SystemTime};
use crate::common::{download_to_vec, upload_stream};
use anyhow::Context;
use camino::Utf8Path;
use futures_util::Future;
use futures_util::StreamExt;
use remote_storage::{
GenericRemoteStorage, RemotePath, RemoteStorageConfig, RemoteStorageKind, S3Config,
DownloadError, GenericRemoteStorage, RemotePath, RemoteStorageConfig, RemoteStorageKind,
S3Config,
};
use test_context::test_context;
use test_context::AsyncTestContext;
@@ -27,7 +29,6 @@ use common::{cleanup, ensure_logging_ready, upload_remote_data, upload_simple_re
use utils::backoff;
const ENABLE_REAL_S3_REMOTE_STORAGE_ENV_VAR_NAME: &str = "ENABLE_REAL_S3_REMOTE_STORAGE";
const BASE_PREFIX: &str = "test";
#[test_context(MaybeEnabledStorage)]
@@ -69,8 +70,11 @@ async fn s3_time_travel_recovery_works(ctx: &mut MaybeEnabledStorage) -> anyhow:
ret
}
async fn list_files(client: &Arc<GenericRemoteStorage>) -> anyhow::Result<HashSet<RemotePath>> {
Ok(retry(|| client.list_files(None, None))
async fn list_files(
client: &Arc<GenericRemoteStorage>,
cancel: &CancellationToken,
) -> anyhow::Result<HashSet<RemotePath>> {
Ok(retry(|| client.list_files(None, None, cancel))
.await
.context("list root files failure")?
.into_iter()
@@ -90,11 +94,11 @@ async fn s3_time_travel_recovery_works(ctx: &mut MaybeEnabledStorage) -> anyhow:
retry(|| {
let (data, len) = upload_stream("remote blob data1".as_bytes().into());
ctx.client.upload(data, len, &path1, None)
ctx.client.upload(data, len, &path1, None, &cancel)
})
.await?;
let t0_files = list_files(&ctx.client).await?;
let t0_files = list_files(&ctx.client, &cancel).await?;
let t0 = time_point().await;
println!("at t0: {t0_files:?}");
@@ -102,17 +106,17 @@ async fn s3_time_travel_recovery_works(ctx: &mut MaybeEnabledStorage) -> anyhow:
retry(|| {
let (data, len) = upload_stream(old_data.as_bytes().into());
ctx.client.upload(data, len, &path2, None)
ctx.client.upload(data, len, &path2, None, &cancel)
})
.await?;
let t1_files = list_files(&ctx.client).await?;
let t1_files = list_files(&ctx.client, &cancel).await?;
let t1 = time_point().await;
println!("at t1: {t1_files:?}");
// A little check to ensure that our clock is not too far off from the S3 clock
{
let dl = retry(|| ctx.client.download(&path2)).await?;
let dl = retry(|| ctx.client.download(&path2, &cancel)).await?;
let last_modified = dl.last_modified.unwrap();
let half_wt = WAIT_TIME.mul_f32(0.5);
let t0_hwt = t0 + half_wt;
@@ -125,7 +129,7 @@ async fn s3_time_travel_recovery_works(ctx: &mut MaybeEnabledStorage) -> anyhow:
retry(|| {
let (data, len) = upload_stream("remote blob data3".as_bytes().into());
ctx.client.upload(data, len, &path3, None)
ctx.client.upload(data, len, &path3, None, &cancel)
})
.await?;
@@ -133,12 +137,12 @@ async fn s3_time_travel_recovery_works(ctx: &mut MaybeEnabledStorage) -> anyhow:
retry(|| {
let (data, len) = upload_stream(new_data.as_bytes().into());
ctx.client.upload(data, len, &path2, None)
ctx.client.upload(data, len, &path2, None, &cancel)
})
.await?;
retry(|| ctx.client.delete(&path1)).await?;
let t2_files = list_files(&ctx.client).await?;
retry(|| ctx.client.delete(&path1, &cancel)).await?;
let t2_files = list_files(&ctx.client, &cancel).await?;
let t2 = time_point().await;
println!("at t2: {t2_files:?}");
@@ -147,10 +151,10 @@ async fn s3_time_travel_recovery_works(ctx: &mut MaybeEnabledStorage) -> anyhow:
ctx.client
.time_travel_recover(None, t2, t_final, &cancel)
.await?;
let t2_files_recovered = list_files(&ctx.client).await?;
let t2_files_recovered = list_files(&ctx.client, &cancel).await?;
println!("after recovery to t2: {t2_files_recovered:?}");
assert_eq!(t2_files, t2_files_recovered);
let path2_recovered_t2 = download_to_vec(ctx.client.download(&path2).await?).await?;
let path2_recovered_t2 = download_to_vec(ctx.client.download(&path2, &cancel).await?).await?;
assert_eq!(path2_recovered_t2, new_data.as_bytes());
// after recovery to t1: path1 is back, path2 has the old content
@@ -158,10 +162,10 @@ async fn s3_time_travel_recovery_works(ctx: &mut MaybeEnabledStorage) -> anyhow:
ctx.client
.time_travel_recover(None, t1, t_final, &cancel)
.await?;
let t1_files_recovered = list_files(&ctx.client).await?;
let t1_files_recovered = list_files(&ctx.client, &cancel).await?;
println!("after recovery to t1: {t1_files_recovered:?}");
assert_eq!(t1_files, t1_files_recovered);
let path2_recovered_t1 = download_to_vec(ctx.client.download(&path2).await?).await?;
let path2_recovered_t1 = download_to_vec(ctx.client.download(&path2, &cancel).await?).await?;
assert_eq!(path2_recovered_t1, old_data.as_bytes());
// after recovery to t0: everything is gone except for path1
@@ -169,14 +173,14 @@ async fn s3_time_travel_recovery_works(ctx: &mut MaybeEnabledStorage) -> anyhow:
ctx.client
.time_travel_recover(None, t0, t_final, &cancel)
.await?;
let t0_files_recovered = list_files(&ctx.client).await?;
let t0_files_recovered = list_files(&ctx.client, &cancel).await?;
println!("after recovery to t0: {t0_files_recovered:?}");
assert_eq!(t0_files, t0_files_recovered);
// cleanup
let paths = &[path1, path2, path3];
retry(|| ctx.client.delete_objects(paths)).await?;
retry(|| ctx.client.delete_objects(paths, &cancel)).await?;
Ok(())
}
@@ -197,6 +201,16 @@ impl EnabledS3 {
base_prefix: BASE_PREFIX,
}
}
fn configure_request_timeout(&mut self, timeout: Duration) {
match Arc::get_mut(&mut self.client).expect("outer Arc::get_mut") {
GenericRemoteStorage::AwsS3(s3) => {
let s3 = Arc::get_mut(s3).expect("inner Arc::get_mut");
s3.timeout = timeout;
}
_ => unreachable!(),
}
}
}
enum MaybeEnabledStorage {
@@ -370,8 +384,169 @@ fn create_s3_client(
concurrency_limit: NonZeroUsize::new(100).unwrap(),
max_keys_per_list_response,
}),
timeout: RemoteStorageConfig::DEFAULT_TIMEOUT,
};
Ok(Arc::new(
GenericRemoteStorage::from_config(&remote_storage_config).context("remote storage init")?,
))
}
#[test_context(MaybeEnabledStorage)]
#[tokio::test]
async fn download_is_timeouted(ctx: &mut MaybeEnabledStorage) {
let MaybeEnabledStorage::Enabled(ctx) = ctx else {
return;
};
let cancel = CancellationToken::new();
let path = RemotePath::new(Utf8Path::new(
format!("{}/file_to_copy", ctx.base_prefix).as_str(),
))
.unwrap();
let len = upload_large_enough_file(&ctx.client, &path, &cancel).await;
let timeout = std::time::Duration::from_secs(5);
ctx.configure_request_timeout(timeout);
let started_at = std::time::Instant::now();
let mut stream = ctx
.client
.download(&path, &cancel)
.await
.expect("download succeeds")
.download_stream;
if started_at.elapsed().mul_f32(0.9) >= timeout {
tracing::warn!(
elapsed_ms = started_at.elapsed().as_millis(),
"timeout might be too low, consumed most of it during headers"
);
}
let first = stream
.next()
.await
.expect("should have the first blob")
.expect("should have succeeded");
tracing::info!(len = first.len(), "downloaded first chunk");
assert!(
first.len() < len,
"uploaded file is too small, we downloaded all on first chunk"
);
tokio::time::sleep(timeout).await;
{
let started_at = std::time::Instant::now();
let next = stream
.next()
.await
.expect("stream should not have ended yet");
tracing::info!(
next.is_err = next.is_err(),
elapsed_ms = started_at.elapsed().as_millis(),
"received item after timeout"
);
let e = next.expect_err("expected an error, but got a chunk?");
let inner = e.get_ref().expect("std::io::Error::inner should be set");
assert!(
inner
.downcast_ref::<DownloadError>()
.is_some_and(|e| matches!(e, DownloadError::Timeout)),
"{inner:?}"
);
}
ctx.configure_request_timeout(RemoteStorageConfig::DEFAULT_TIMEOUT);
ctx.client.delete_objects(&[path], &cancel).await.unwrap()
}
#[test_context(MaybeEnabledStorage)]
#[tokio::test]
async fn download_is_cancelled(ctx: &mut MaybeEnabledStorage) {
let MaybeEnabledStorage::Enabled(ctx) = ctx else {
return;
};
let cancel = CancellationToken::new();
let path = RemotePath::new(Utf8Path::new(
format!("{}/file_to_copy", ctx.base_prefix).as_str(),
))
.unwrap();
let len = upload_large_enough_file(&ctx.client, &path, &cancel).await;
{
let mut stream = ctx
.client
.download(&path, &cancel)
.await
.expect("download succeeds")
.download_stream;
let first = stream
.next()
.await
.expect("should have the first blob")
.expect("should have succeeded");
tracing::info!(len = first.len(), "downloaded first chunk");
assert!(
first.len() < len,
"uploaded file is too small, we downloaded all on first chunk"
);
cancel.cancel();
let next = stream.next().await.expect("stream should have more");
let e = next.expect_err("expected an error, but got a chunk?");
let inner = e.get_ref().expect("std::io::Error::inner should be set");
assert!(
inner
.downcast_ref::<DownloadError>()
.is_some_and(|e| matches!(e, DownloadError::Cancelled)),
"{inner:?}"
);
}
let cancel = CancellationToken::new();
ctx.client.delete_objects(&[path], &cancel).await.unwrap();
}
/// Upload a long enough file so that we cannot download it in single chunk
///
/// For s3 the first chunk seems to be less than 10kB, so this has a bit of a safety margin
async fn upload_large_enough_file(
client: &GenericRemoteStorage,
path: &RemotePath,
cancel: &CancellationToken,
) -> usize {
let header = bytes::Bytes::from_static("remote blob data content".as_bytes());
let body = bytes::Bytes::from(vec![0u8; 1024]);
let contents = std::iter::once(header).chain(std::iter::repeat(body).take(128));
let len = contents.clone().fold(0, |acc, next| acc + next.len());
let contents = futures::stream::iter(contents.map(std::io::Result::Ok));
client
.upload(contents, len, path, None, cancel)
.await
.expect("upload succeeds");
len
}

View File

@@ -25,6 +25,7 @@ hyper = { workspace = true, features = ["full"] }
fail.workspace = true
futures = { workspace = true}
jsonwebtoken.workspace = true
leaky-bucket.workspace = true
nix.workspace = true
once_cell.workspace = true
pin-project-lite.workspace = true

View File

@@ -1,5 +1,3 @@
#![allow(unused)]
use criterion::{criterion_group, criterion_main, Criterion};
use utils::id;

View File

@@ -29,6 +29,9 @@ pub enum Scope {
// Should only be used e.g. for status check.
// Currently also used for connection from any pageserver to any safekeeper.
SafekeeperData,
// The scope used by pageservers in upcalls to storage controller and cloud control plane
#[serde(rename = "generations_api")]
GenerationsApi,
}
/// JWT payload. See docs/authentication.md for the format

View File

@@ -54,12 +54,10 @@ impl Generation {
}
#[track_caller]
pub fn get_suffix(&self) -> String {
pub fn get_suffix(&self) -> impl std::fmt::Display {
match self {
Self::Valid(v) => {
format!("-{:08x}", v)
}
Self::None => "".into(),
Self::Valid(v) => GenerationFileSuffix(Some(*v)),
Self::None => GenerationFileSuffix(None),
Self::Broken => {
panic!("Tried to use a broken generation");
}
@@ -90,6 +88,7 @@ impl Generation {
}
}
#[track_caller]
pub fn next(&self) -> Generation {
match self {
Self::Valid(n) => Self::Valid(*n + 1),
@@ -107,6 +106,18 @@ impl Generation {
}
}
struct GenerationFileSuffix(Option<u32>);
impl std::fmt::Display for GenerationFileSuffix {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
if let Some(g) = self.0 {
write!(f, "-{g:08x}")
} else {
Ok(())
}
}
}
impl Serialize for Generation {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
@@ -164,4 +175,24 @@ mod test {
assert!(Generation::none() < Generation::new(0));
assert!(Generation::none() < Generation::new(1));
}
#[test]
fn suffix_is_stable() {
use std::fmt::Write as _;
// the suffix must remain stable through-out the pageserver remote storage evolution and
// not be changed accidentially without thinking about migration
let examples = [
(line!(), Generation::None, ""),
(line!(), Generation::Valid(0), "-00000000"),
(line!(), Generation::Valid(u32::MAX), "-ffffffff"),
];
let mut s = String::new();
for (line, gen, expected) in examples {
s.clear();
write!(s, "{}", &gen.get_suffix()).expect("string grows");
assert_eq!(s, expected, "example on {line}");
}
}
}

View File

@@ -69,37 +69,44 @@ impl<T> OnceCell<T> {
F: FnOnce(InitPermit) -> Fut,
Fut: std::future::Future<Output = Result<(T, InitPermit), E>>,
{
let sem = {
loop {
let sem = {
let guard = self.inner.lock().unwrap();
if guard.value.is_some() {
return Ok(Guard(guard));
}
guard.init_semaphore.clone()
};
{
let permit = {
// increment the count for the duration of queued
let _guard = CountWaitingInitializers::start(self);
sem.acquire().await
};
let Ok(permit) = permit else {
let guard = self.inner.lock().unwrap();
if !Arc::ptr_eq(&sem, &guard.init_semaphore) {
// there was a take_and_deinit in between
continue;
}
assert!(
guard.value.is_some(),
"semaphore got closed, must be initialized"
);
return Ok(Guard(guard));
};
permit.forget();
}
let permit = InitPermit(sem);
let (value, _permit) = factory(permit).await?;
let guard = self.inner.lock().unwrap();
if guard.value.is_some() {
return Ok(Guard(guard));
}
guard.init_semaphore.clone()
};
let permit = {
// increment the count for the duration of queued
let _guard = CountWaitingInitializers::start(self);
sem.acquire_owned().await
};
match permit {
Ok(permit) => {
let permit = InitPermit(permit);
let (value, _permit) = factory(permit).await?;
let guard = self.inner.lock().unwrap();
Ok(Self::set0(value, guard))
}
Err(_closed) => {
let guard = self.inner.lock().unwrap();
assert!(
guard.value.is_some(),
"semaphore got closed, must be initialized"
);
return Ok(Guard(guard));
}
return Ok(Self::set0(value, guard));
}
}
@@ -197,27 +204,41 @@ impl<'a, T> Guard<'a, T> {
/// [`OnceCell::get_or_init`] will wait on it to complete.
pub fn take_and_deinit(&mut self) -> (T, InitPermit) {
let mut swapped = Inner::default();
let permit = swapped
.init_semaphore
.clone()
.try_acquire_owned()
.expect("we just created this");
let sem = swapped.init_semaphore.clone();
// acquire and forget right away, moving the control over to InitPermit
sem.try_acquire().expect("we just created this").forget();
std::mem::swap(&mut *self.0, &mut swapped);
swapped
.value
.map(|v| (v, InitPermit(permit)))
.map(|v| (v, InitPermit(sem)))
.expect("guard is not created unless value has been initialized")
}
}
/// Type held by OnceCell (de)initializing task.
pub struct InitPermit(tokio::sync::OwnedSemaphorePermit);
///
/// On drop, this type will return the permit.
pub struct InitPermit(Arc<tokio::sync::Semaphore>);
impl Drop for InitPermit {
fn drop(&mut self) {
assert_eq!(
self.0.available_permits(),
0,
"InitPermit should only exist as the unique permit"
);
self.0.add_permits(1);
}
}
#[cfg(test)]
mod tests {
use futures::Future;
use super::*;
use std::{
convert::Infallible,
pin::{pin, Pin},
sync::atomic::{AtomicUsize, Ordering},
time::Duration,
};
@@ -380,4 +401,85 @@ mod tests {
.unwrap();
assert_eq!(*g, "now initialized");
}
#[tokio::test(start_paused = true)]
async fn reproduce_init_take_deinit_race() {
init_take_deinit_scenario(|cell, factory| {
Box::pin(async {
cell.get_or_init(factory).await.unwrap();
})
})
.await;
}
type BoxedInitFuture<T, E> = Pin<Box<dyn Future<Output = Result<(T, InitPermit), E>>>>;
type BoxedInitFunction<T, E> = Box<dyn Fn(InitPermit) -> BoxedInitFuture<T, E>>;
/// Reproduce an assertion failure.
///
/// This has interesting generics to be generic between `get_or_init` and `get_mut_or_init`.
/// We currently only have one, but the structure is kept.
async fn init_take_deinit_scenario<F>(init_way: F)
where
F: for<'a> Fn(
&'a OnceCell<&'static str>,
BoxedInitFunction<&'static str, Infallible>,
) -> Pin<Box<dyn Future<Output = ()> + 'a>>,
{
let cell = OnceCell::default();
// acquire the init_semaphore only permit to drive initializing tasks in order to waiting
// on the same semaphore.
let permit = cell
.inner
.lock()
.unwrap()
.init_semaphore
.clone()
.try_acquire_owned()
.unwrap();
let mut t1 = pin!(init_way(
&cell,
Box::new(|permit| Box::pin(async move { Ok(("t1", permit)) })),
));
let mut t2 = pin!(init_way(
&cell,
Box::new(|permit| Box::pin(async move { Ok(("t2", permit)) })),
));
// drive t2 first to the init_semaphore -- the timeout will be hit once t2 future can
// no longer make progress
tokio::select! {
_ = &mut t2 => unreachable!("it cannot get permit"),
_ = tokio::time::sleep(Duration::from_secs(3600 * 24 * 7 * 365)) => {}
}
// followed by t1 in the init_semaphore
tokio::select! {
_ = &mut t1 => unreachable!("it cannot get permit"),
_ = tokio::time::sleep(Duration::from_secs(3600 * 24 * 7 * 365)) => {}
}
// now let t2 proceed and initialize
drop(permit);
t2.await;
let (s, permit) = { cell.get().unwrap().take_and_deinit() };
assert_eq!("t2", s);
// now originally t1 would see the semaphore it has as closed. it cannot yet get a permit from
// the new one.
tokio::select! {
_ = &mut t1 => unreachable!("it cannot get permit"),
_ = tokio::time::sleep(Duration::from_secs(3600 * 24 * 7 * 365)) => {}
}
// only now we get to initialize it
drop(permit);
t1.await;
assert_eq!("t1", *cell.get().unwrap());
}
}

View File

@@ -34,6 +34,9 @@ fn main() -> anyhow::Result<()> {
println!("cargo:rustc-link-lib=static=walproposer");
println!("cargo:rustc-link-search={walproposer_lib_search_str}");
// Rebuild crate when libwalproposer.a changes
println!("cargo:rerun-if-changed={walproposer_lib_search_str}/libwalproposer.a");
let pg_config_bin = pg_install_abs.join("v16").join("bin").join("pg_config");
let inc_server_path: String = if pg_config_bin.exists() {
let output = Command::new(pg_config_bin)
@@ -79,6 +82,7 @@ fn main() -> anyhow::Result<()> {
.allowlist_function("WalProposerBroadcast")
.allowlist_function("WalProposerPoll")
.allowlist_function("WalProposerFree")
.allowlist_function("SafekeeperStateDesiredEvents")
.allowlist_var("DEBUG5")
.allowlist_var("DEBUG4")
.allowlist_var("DEBUG3")

View File

@@ -22,6 +22,7 @@ use crate::bindings::WalProposerExecStatusType;
use crate::bindings::WalproposerShmemState;
use crate::bindings::XLogRecPtr;
use crate::walproposer::ApiImpl;
use crate::walproposer::StreamingCallback;
use crate::walproposer::WaitResult;
extern "C" fn get_shmem_state(wp: *mut WalProposer) -> *mut WalproposerShmemState {
@@ -36,7 +37,8 @@ extern "C" fn start_streaming(wp: *mut WalProposer, startpos: XLogRecPtr) {
unsafe {
let callback_data = (*(*wp).config).callback_data;
let api = callback_data as *mut Box<dyn ApiImpl>;
(*api).start_streaming(startpos)
let callback = StreamingCallback::new(wp);
(*api).start_streaming(startpos, &callback);
}
}
@@ -134,19 +136,18 @@ extern "C" fn conn_async_read(
unsafe {
let callback_data = (*(*(*sk).wp).config).callback_data;
let api = callback_data as *mut Box<dyn ApiImpl>;
let (res, result) = (*api).conn_async_read(&mut (*sk));
// This function has guarantee that returned buf will be valid until
// the next call. So we can store a Vec in each Safekeeper and reuse
// it on the next call.
let mut inbuf = take_vec_u8(&mut (*sk).inbuf).unwrap_or_default();
inbuf.clear();
inbuf.extend_from_slice(res);
let result = (*api).conn_async_read(&mut (*sk), &mut inbuf);
// Put a Vec back to sk->inbuf and return data ptr.
*amount = inbuf.len() as i32;
*buf = store_vec_u8(&mut (*sk).inbuf, inbuf);
*amount = res.len() as i32;
result
}
@@ -182,6 +183,10 @@ extern "C" fn recovery_download(wp: *mut WalProposer, sk: *mut Safekeeper) -> bo
unsafe {
let callback_data = (*(*(*sk).wp).config).callback_data;
let api = callback_data as *mut Box<dyn ApiImpl>;
// currently `recovery_download` is always called right after election
(*api).after_election(&mut (*wp));
(*api).recovery_download(&mut (*wp), &mut (*sk))
}
}
@@ -277,7 +282,8 @@ extern "C" fn wait_event_set(
}
WaitResult::Timeout => {
*event_sk = std::ptr::null_mut();
*events = crate::bindings::WL_TIMEOUT;
// WaitEventSetWait returns 0 for timeout.
*events = 0;
0
}
WaitResult::Network(sk, event_mask) => {
@@ -340,7 +346,7 @@ extern "C" fn log_internal(
}
}
#[derive(Debug)]
#[derive(Debug, PartialEq)]
pub enum Level {
Debug5,
Debug4,

View File

@@ -1,13 +1,13 @@
use std::ffi::CString;
use postgres_ffi::WAL_SEGMENT_SIZE;
use utils::id::TenantTimelineId;
use utils::{id::TenantTimelineId, lsn::Lsn};
use crate::{
api_bindings::{create_api, take_vec_u8, Level},
bindings::{
NeonWALReadResult, Safekeeper, WalProposer, WalProposerConfig, WalProposerCreate,
WalProposerFree, WalProposerStart,
NeonWALReadResult, Safekeeper, WalProposer, WalProposerBroadcast, WalProposerConfig,
WalProposerCreate, WalProposerFree, WalProposerPoll, WalProposerStart,
},
};
@@ -16,11 +16,11 @@ use crate::{
///
/// Refer to `pgxn/neon/walproposer.h` for documentation.
pub trait ApiImpl {
fn get_shmem_state(&self) -> &mut crate::bindings::WalproposerShmemState {
fn get_shmem_state(&self) -> *mut crate::bindings::WalproposerShmemState {
todo!()
}
fn start_streaming(&self, _startpos: u64) {
fn start_streaming(&self, _startpos: u64, _callback: &StreamingCallback) {
todo!()
}
@@ -70,7 +70,11 @@ pub trait ApiImpl {
todo!()
}
fn conn_async_read(&self, _sk: &mut Safekeeper) -> (&[u8], crate::bindings::PGAsyncReadResult) {
fn conn_async_read(
&self,
_sk: &mut Safekeeper,
_vec: &mut Vec<u8>,
) -> crate::bindings::PGAsyncReadResult {
todo!()
}
@@ -151,12 +155,14 @@ pub trait ApiImpl {
}
}
#[derive(Debug)]
pub enum WaitResult {
Latch,
Timeout,
Network(*mut Safekeeper, u32),
}
#[derive(Clone)]
pub struct Config {
/// Tenant and timeline id
pub ttid: TenantTimelineId,
@@ -242,6 +248,24 @@ impl Drop for Wrapper {
}
}
pub struct StreamingCallback {
wp: *mut WalProposer,
}
impl StreamingCallback {
pub fn new(wp: *mut WalProposer) -> StreamingCallback {
StreamingCallback { wp }
}
pub fn broadcast(&self, startpos: Lsn, endpos: Lsn) {
unsafe { WalProposerBroadcast(self.wp, startpos.0, endpos.0) }
}
pub fn poll(&self) {
unsafe { WalProposerPoll(self.wp) }
}
}
#[cfg(test)]
mod tests {
use core::panic;
@@ -344,14 +368,13 @@ mod tests {
fn conn_async_read(
&self,
_: &mut crate::bindings::Safekeeper,
) -> (&[u8], crate::bindings::PGAsyncReadResult) {
vec: &mut Vec<u8>,
) -> crate::bindings::PGAsyncReadResult {
println!("conn_async_read");
let reply = self.next_safekeeper_reply();
println!("conn_async_read result: {:?}", reply);
(
reply,
crate::bindings::PGAsyncReadResult_PG_ASYNC_READ_SUCCESS,
)
vec.extend_from_slice(reply);
crate::bindings::PGAsyncReadResult_PG_ASYNC_READ_SUCCESS
}
fn conn_blocking_write(&self, _: &mut crate::bindings::Safekeeper, buf: &[u8]) -> bool {

View File

@@ -12,6 +12,7 @@ testing = ["fail/failpoints"]
[dependencies]
anyhow.workspace = true
arc-swap.workspace = true
async-compression.workspace = true
async-stream.workspace = true
async-trait.workspace = true
@@ -35,6 +36,7 @@ humantime.workspace = true
humantime-serde.workspace = true
hyper.workspace = true
itertools.workspace = true
leaky-bucket.workspace = true
md5.workspace = true
nix.workspace = true
# hack to get the number of worker threads tokio uses
@@ -82,7 +84,7 @@ workspace_hack.workspace = true
reqwest.workspace = true
rpds.workspace = true
enum-map.workspace = true
enumset.workspace = true
enumset = { workspace = true, features = ["serde"]}
strum.workspace = true
strum_macros.workspace = true

View File

@@ -6,14 +6,28 @@
//! There are two sets of inputs; `short` and `medium`. They were collected on postgres v14 by
//! logging what happens when a sequential scan is requested on a small table, then picking out two
//! suitable from logs.
//!
//!
//! Reference data (git blame to see commit) on an i3en.3xlarge
// ```text
//! short/short/1 time: [39.175 µs 39.348 µs 39.536 µs]
//! short/short/2 time: [51.227 µs 51.487 µs 51.755 µs]
//! short/short/4 time: [76.048 µs 76.362 µs 76.674 µs]
//! short/short/8 time: [128.94 µs 129.82 µs 130.74 µs]
//! short/short/16 time: [227.84 µs 229.00 µs 230.28 µs]
//! short/short/32 time: [455.97 µs 457.81 µs 459.90 µs]
//! short/short/64 time: [902.46 µs 904.84 µs 907.32 µs]
//! short/short/128 time: [1.7416 ms 1.7487 ms 1.7561 ms]
//! ``
use std::sync::{Arc, Barrier};
use std::sync::Arc;
use bytes::{Buf, Bytes};
use pageserver::{
config::PageServerConf, repository::Key, walrecord::NeonWalRecord, walredo::PostgresRedoManager,
};
use pageserver_api::shard::TenantShardId;
use tokio::task::JoinSet;
use utils::{id::TenantId, lsn::Lsn};
use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
@@ -39,11 +53,11 @@ fn redo_scenarios(c: &mut Criterion) {
.build()
.unwrap();
tracing::info!("executing first");
short().execute(rt.handle(), &manager).unwrap();
rt.block_on(short().execute(&manager)).unwrap();
tracing::info!("first executed");
}
let thread_counts = [1, 2, 4, 8, 16];
let thread_counts = [1, 2, 4, 8, 16, 32, 64, 128];
let mut group = c.benchmark_group("short");
group.sampling_mode(criterion::SamplingMode::Flat);
@@ -74,114 +88,69 @@ fn redo_scenarios(c: &mut Criterion) {
drop(group);
}
/// Sets up `threads` number of requesters to `request_redo`, with the given input.
/// Sets up a multi-threaded tokio runtime with default worker thread count,
/// then, spawn `requesters` tasks that repeatedly:
/// - get input from `input_factor()`
/// - call `manager.request_redo()` with their input
///
/// This stress-tests the scalability of a single walredo manager at high tokio-level concurrency.
///
/// Using tokio's default worker thread count means the results will differ on machines
/// with different core countrs. We don't care about that, the performance will always
/// be different on different hardware. To compare performance of different software versions,
/// use the same hardware.
fn add_multithreaded_walredo_requesters(
b: &mut criterion::Bencher,
threads: u32,
nrequesters: usize,
manager: &Arc<PostgresRedoManager>,
input_factory: fn() -> Request,
) {
assert_ne!(threads, 0);
assert_ne!(nrequesters, 0);
if threads == 1 {
let rt = tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()
.unwrap();
let handle = rt.handle();
b.iter_batched_ref(
|| Some(input_factory()),
|input| execute_all(input.take(), handle, manager),
criterion::BatchSize::PerIteration,
);
} else {
let (work_tx, work_rx) = std::sync::mpsc::sync_channel(threads as usize);
let rt = tokio::runtime::Builder::new_multi_thread()
.enable_all()
.build()
.unwrap();
let work_rx = std::sync::Arc::new(std::sync::Mutex::new(work_rx));
let barrier = Arc::new(tokio::sync::Barrier::new(nrequesters + 1));
let barrier = Arc::new(Barrier::new(threads as usize + 1));
let jhs = (0..threads)
.map(|_| {
std::thread::spawn({
let manager = manager.clone();
let barrier = barrier.clone();
let work_rx = work_rx.clone();
move || {
let rt = tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()
.unwrap();
let handle = rt.handle();
loop {
// queue up and wait if we want to go another round
if work_rx.lock().unwrap().recv().is_err() {
break;
}
let input = Some(input_factory());
barrier.wait();
execute_all(input, handle, &manager).unwrap();
barrier.wait();
}
}
})
})
.collect::<Vec<_>>();
let _jhs = JoinOnDrop(jhs);
b.iter_batched(
|| {
for _ in 0..threads {
work_tx.send(()).unwrap()
}
},
|()| {
// start the work
barrier.wait();
// wait for work to complete
barrier.wait();
},
criterion::BatchSize::PerIteration,
);
drop(work_tx);
let mut requesters = JoinSet::new();
for _ in 0..nrequesters {
let _entered = rt.enter();
let manager = manager.clone();
let barrier = barrier.clone();
requesters.spawn(async move {
loop {
let input = input_factory();
barrier.wait().await;
let page = input.execute(&manager).await.unwrap();
assert_eq!(page.remaining(), 8192);
barrier.wait().await;
}
});
}
}
struct JoinOnDrop(Vec<std::thread::JoinHandle<()>>);
let do_one_iteration = || {
rt.block_on(async {
barrier.wait().await;
// wait for work to complete
barrier.wait().await;
})
};
impl Drop for JoinOnDrop {
// it's not really needless because we want join all then check for panicks
#[allow(clippy::needless_collect)]
fn drop(&mut self) {
// first join all
let results = self.0.drain(..).map(|jh| jh.join()).collect::<Vec<_>>();
// then check the results; panicking here is not great, but it does get the message across
// to the user, and sets an exit value.
results.into_iter().try_for_each(|res| res).unwrap();
}
}
b.iter_batched(
|| {
// warmup
do_one_iteration();
},
|()| {
// work loop
do_one_iteration();
},
criterion::BatchSize::PerIteration,
);
fn execute_all<I>(
input: I,
handle: &tokio::runtime::Handle,
manager: &PostgresRedoManager,
) -> anyhow::Result<()>
where
I: IntoIterator<Item = Request>,
{
// just fire all requests as fast as possible
input.into_iter().try_for_each(|req| {
let page = req.execute(handle, manager)?;
assert_eq!(page.remaining(), 8192);
anyhow::Ok(())
})
rt.block_on(requesters.shutdown());
}
criterion_group!(benches, redo_scenarios);
@@ -493,11 +462,7 @@ struct Request {
}
impl Request {
fn execute(
self,
rt: &tokio::runtime::Handle,
manager: &PostgresRedoManager,
) -> anyhow::Result<Bytes> {
async fn execute(self, manager: &PostgresRedoManager) -> anyhow::Result<Bytes> {
let Request {
key,
lsn,
@@ -506,6 +471,8 @@ impl Request {
pg_version,
} = self;
rt.block_on(manager.request_redo(key, lsn, base_img, records, pg_version))
manager
.request_redo(key, lsn, base_img, records, pg_version)
.await
}
}

View File

@@ -1,6 +1,5 @@
use anyhow::Context;
use camino::Utf8PathBuf;
use futures::future::join_all;
use pageserver_api::key::{is_rel_block_key, key_to_rel_block, Key};
use pageserver_api::keyspace::KeySpaceAccum;
use pageserver_api::models::PagestreamGetPageRequest;
@@ -10,11 +9,10 @@ use utils::id::TenantTimelineId;
use utils::lsn::Lsn;
use rand::prelude::*;
use tokio::sync::Barrier;
use tokio::task::JoinSet;
use tracing::{info, instrument};
use tracing::info;
use std::collections::{HashMap, HashSet};
use std::collections::HashSet;
use std::future::Future;
use std::num::NonZeroUsize;
use std::pin::Pin;
@@ -38,8 +36,12 @@ pub(crate) struct Args {
num_clients: NonZeroUsize,
#[clap(long)]
runtime: Option<humantime::Duration>,
/// Each client sends requests at the given rate.
///
/// If a request takes too long and we should be issuing a new request already,
/// we skip that request and account it as `MISSED`.
#[clap(long)]
per_target_rate_limit: Option<usize>,
per_client_rate: Option<usize>,
/// Probability for sending `latest=true` in the request (uniform distribution).
#[clap(long, default_value = "1")]
req_latest_probability: f64,
@@ -61,12 +63,16 @@ pub(crate) struct Args {
#[derive(Debug, Default)]
struct LiveStats {
completed_requests: AtomicU64,
missed: AtomicU64,
}
impl LiveStats {
fn inc(&self) {
fn request_done(&self) {
self.completed_requests.fetch_add(1, Ordering::Relaxed);
}
fn missed(&self, n: u64) {
self.missed.fetch_add(n, Ordering::Relaxed);
}
}
#[derive(Clone, serde::Serialize, serde::Deserialize)]
@@ -220,13 +226,12 @@ async fn main_impl(
let live_stats = Arc::new(LiveStats::default());
let num_client_tasks = args.num_clients.get() * timelines.len();
let num_live_stats_dump = 1;
let num_work_sender_tasks = 1;
let num_work_sender_tasks = args.num_clients.get() * timelines.len();
let num_main_impl = 1;
let start_work_barrier = Arc::new(tokio::sync::Barrier::new(
num_client_tasks + num_live_stats_dump + num_work_sender_tasks + num_main_impl,
num_live_stats_dump + num_work_sender_tasks + num_main_impl,
));
tokio::spawn({
@@ -238,10 +243,12 @@ async fn main_impl(
let start = std::time::Instant::now();
tokio::time::sleep(std::time::Duration::from_secs(1)).await;
let completed_requests = stats.completed_requests.swap(0, Ordering::Relaxed);
let missed = stats.missed.swap(0, Ordering::Relaxed);
let elapsed = start.elapsed();
info!(
"RPS: {:.0}",
completed_requests as f64 / elapsed.as_secs_f64()
"RPS: {:.0} MISSED: {:.0}",
completed_requests as f64 / elapsed.as_secs_f64(),
missed as f64 / elapsed.as_secs_f64()
);
}
}
@@ -249,127 +256,105 @@ async fn main_impl(
let cancel = CancellationToken::new();
let mut work_senders: HashMap<WorkerId, _> = HashMap::new();
let mut tasks = Vec::new();
let rps_period = args
.per_client_rate
.map(|rps_limit| Duration::from_secs_f64(1.0 / (rps_limit as f64)));
let make_worker: &dyn Fn(WorkerId) -> Pin<Box<dyn Send + Future<Output = ()>>> = &|worker_id| {
let live_stats = live_stats.clone();
let start_work_barrier = start_work_barrier.clone();
let ranges: Vec<KeyRange> = all_ranges
.iter()
.filter(|r| r.timeline == worker_id.timeline)
.cloned()
.collect();
let weights =
rand::distributions::weighted::WeightedIndex::new(ranges.iter().map(|v| v.len()))
.unwrap();
let cancel = cancel.clone();
Box::pin(async move {
let client =
pageserver_client::page_service::Client::new(args.page_service_connstring.clone())
.await
.unwrap();
let mut client = client
.pagestream(worker_id.timeline.tenant_id, worker_id.timeline.timeline_id)
.await
.unwrap();
start_work_barrier.wait().await;
let client_start = Instant::now();
let mut ticks_processed = 0;
while !cancel.is_cancelled() {
// Detect if a request took longer than the RPS rate
if let Some(period) = &rps_period {
let periods_passed_until_now =
usize::try_from(client_start.elapsed().as_micros() / period.as_micros())
.unwrap();
if periods_passed_until_now > ticks_processed {
live_stats.missed((periods_passed_until_now - ticks_processed) as u64);
}
ticks_processed = periods_passed_until_now;
}
let start = Instant::now();
let req = {
let mut rng = rand::thread_rng();
let r = &ranges[weights.sample(&mut rng)];
let key: i128 = rng.gen_range(r.start..r.end);
let key = Key::from_i128(key);
assert!(is_rel_block_key(&key));
let (rel_tag, block_no) =
key_to_rel_block(key).expect("we filter non-rel-block keys out above");
PagestreamGetPageRequest {
latest: rng.gen_bool(args.req_latest_probability),
lsn: r.timeline_lsn,
rel: rel_tag,
blkno: block_no,
}
};
client.getpage(req).await.unwrap();
let end = Instant::now();
live_stats.request_done();
ticks_processed += 1;
STATS.with(|stats| {
stats
.borrow()
.lock()
.unwrap()
.observe(end.duration_since(start))
.unwrap();
});
if let Some(period) = &rps_period {
let next_at = client_start
+ Duration::from_micros(
(ticks_processed) as u64 * u64::try_from(period.as_micros()).unwrap(),
);
tokio::time::sleep_until(next_at.into()).await;
}
}
})
};
info!("spawning workers");
let mut workers = JoinSet::new();
for timeline in timelines.iter().cloned() {
for num_client in 0..args.num_clients.get() {
let (sender, receiver) = tokio::sync::mpsc::channel(10); // TODO: not sure what the implications of this are
let worker_id = WorkerId {
timeline,
num_client,
};
work_senders.insert(worker_id, sender);
tasks.push(tokio::spawn(client(
args,
worker_id,
Arc::clone(&start_work_barrier),
receiver,
Arc::clone(&live_stats),
cancel.clone(),
)));
workers.spawn(make_worker(worker_id));
}
}
let work_sender: Pin<Box<dyn Send + Future<Output = ()>>> = {
let start_work_barrier = start_work_barrier.clone();
let cancel = cancel.clone();
match args.per_target_rate_limit {
None => Box::pin(async move {
let weights = rand::distributions::weighted::WeightedIndex::new(
all_ranges.iter().map(|v| v.len()),
)
.unwrap();
start_work_barrier.wait().await;
while !cancel.is_cancelled() {
let (timeline, req) = {
let mut rng = rand::thread_rng();
let r = &all_ranges[weights.sample(&mut rng)];
let key: i128 = rng.gen_range(r.start..r.end);
let key = Key::from_i128(key);
let (rel_tag, block_no) =
key_to_rel_block(key).expect("we filter non-rel-block keys out above");
(
WorkerId {
timeline: r.timeline,
num_client: rng.gen_range(0..args.num_clients.get()),
},
PagestreamGetPageRequest {
latest: rng.gen_bool(args.req_latest_probability),
lsn: r.timeline_lsn,
rel: rel_tag,
blkno: block_no,
},
)
};
let sender = work_senders.get(&timeline).unwrap();
// TODO: what if this blocks?
if sender.send(req).await.is_err() {
assert!(cancel.is_cancelled(), "client has gone away unexpectedly");
}
}
}),
Some(rps_limit) => Box::pin(async move {
let period = Duration::from_secs_f64(1.0 / (rps_limit as f64));
let make_task: &dyn Fn(WorkerId) -> Pin<Box<dyn Send + Future<Output = ()>>> =
&|worker_id| {
let sender = work_senders.get(&worker_id).unwrap();
let ranges: Vec<KeyRange> = all_ranges
.iter()
.filter(|r| r.timeline == worker_id.timeline)
.cloned()
.collect();
let weights = rand::distributions::weighted::WeightedIndex::new(
ranges.iter().map(|v| v.len()),
)
.unwrap();
let cancel = cancel.clone();
Box::pin(async move {
let mut ticker = tokio::time::interval(period);
ticker.set_missed_tick_behavior(
/* TODO review this choice */
tokio::time::MissedTickBehavior::Burst,
);
while !cancel.is_cancelled() {
ticker.tick().await;
let req = {
let mut rng = rand::thread_rng();
let r = &ranges[weights.sample(&mut rng)];
let key: i128 = rng.gen_range(r.start..r.end);
let key = Key::from_i128(key);
assert!(is_rel_block_key(&key));
let (rel_tag, block_no) = key_to_rel_block(key)
.expect("we filter non-rel-block keys out above");
PagestreamGetPageRequest {
latest: rng.gen_bool(args.req_latest_probability),
lsn: r.timeline_lsn,
rel: rel_tag,
blkno: block_no,
}
};
if sender.send(req).await.is_err() {
assert!(
cancel.is_cancelled(),
"client has gone away unexpectedly"
);
}
}
})
};
let tasks: Vec<_> = work_senders.keys().map(|tl| make_task(*tl)).collect();
start_work_barrier.wait().await;
join_all(tasks).await;
}),
let workers = async move {
while let Some(res) = workers.join_next().await {
res.unwrap();
}
};
let work_sender_task = tokio::spawn(work_sender);
info!("waiting for everything to become ready");
start_work_barrier.wait().await;
info!("work started");
@@ -377,20 +362,13 @@ async fn main_impl(
tokio::time::sleep(runtime.into()).await;
info!("runtime over, signalling cancellation");
cancel.cancel();
work_sender_task.await.unwrap();
workers.await;
info!("work sender exited");
} else {
work_sender_task.await.unwrap();
workers.await;
unreachable!("work sender never terminates");
}
info!("joining clients");
for t in tasks {
t.await.unwrap();
}
info!("all clients stopped");
let output = Output {
total: {
let mut agg_stats = request_stats::Stats::new();
@@ -407,49 +385,3 @@ async fn main_impl(
anyhow::Ok(())
}
#[instrument(skip_all)]
async fn client(
args: &'static Args,
id: WorkerId,
start_work_barrier: Arc<Barrier>,
mut work: tokio::sync::mpsc::Receiver<PagestreamGetPageRequest>,
live_stats: Arc<LiveStats>,
cancel: CancellationToken,
) {
let WorkerId {
timeline,
num_client: _,
} = id;
let client = pageserver_client::page_service::Client::new(args.page_service_connstring.clone())
.await
.unwrap();
let mut client = client
.pagestream(timeline.tenant_id, timeline.timeline_id)
.await
.unwrap();
let do_requests = async {
start_work_barrier.wait().await;
while let Some(req) = work.recv().await {
let start = Instant::now();
client
.getpage(req)
.await
.with_context(|| format!("getpage for {timeline}"))
.unwrap();
let elapsed = start.elapsed();
live_stats.inc();
STATS.with(|stats| {
stats.borrow().lock().unwrap().observe(elapsed).unwrap();
});
}
};
tokio::select! {
res = do_requests => { res },
_ = cancel.cancelled() => {
// fallthrough to shutdown
}
}
client.shutdown().await;
}

View File

@@ -14,8 +14,12 @@ pub fn check_permission(claims: &Claims, tenant_id: Option<TenantId>) -> Result<
}
(Scope::PageServerApi, None) => Ok(()), // access to management api for PageServerApi scope
(Scope::PageServerApi, Some(_)) => Ok(()), // access to tenant api using PageServerApi scope
(Scope::SafekeeperData, _) => Err(AuthError(
"SafekeeperData scope makes no sense for Pageserver".into(),
(Scope::SafekeeperData | Scope::GenerationsApi, _) => Err(AuthError(
format!(
"JWT scope '{:?}' is ineligible for Pageserver auth",
claims.scope
)
.into(),
)),
}
}

View File

@@ -1359,6 +1359,7 @@ broker_endpoint = '{broker_endpoint}'
parsed_remote_storage_config,
RemoteStorageConfig {
storage: RemoteStorageKind::LocalFs(local_storage_path.clone()),
timeout: RemoteStorageConfig::DEFAULT_TIMEOUT,
},
"Remote storage config should correctly parse the local FS config and fill other storage defaults"
);
@@ -1426,6 +1427,7 @@ broker_endpoint = '{broker_endpoint}'
concurrency_limit: s3_concurrency_limit,
max_keys_per_list_response: None,
}),
timeout: RemoteStorageConfig::DEFAULT_TIMEOUT,
},
"Remote storage config should correctly parse the S3 config"
);

View File

@@ -234,7 +234,7 @@ impl DeletionHeader {
let header_bytes = serde_json::to_vec(self).context("serialize deletion header")?;
let header_path = conf.deletion_header_path();
let temp_path = path_with_suffix_extension(&header_path, TEMP_SUFFIX);
VirtualFile::crashsafe_overwrite(&header_path, &temp_path, &header_bytes)
VirtualFile::crashsafe_overwrite(&header_path, &temp_path, header_bytes)
.await
.maybe_fatal_err("save deletion header")?;
@@ -325,7 +325,7 @@ impl DeletionList {
let temp_path = path_with_suffix_extension(&path, TEMP_SUFFIX);
let bytes = serde_json::to_vec(self).expect("Failed to serialize deletion list");
VirtualFile::crashsafe_overwrite(&path, &temp_path, &bytes)
VirtualFile::crashsafe_overwrite(&path, &temp_path, bytes)
.await
.maybe_fatal_err("save deletion list")
.map_err(Into::into)
@@ -834,7 +834,6 @@ mod test {
}
impl ControlPlaneGenerationsApi for MockControlPlane {
#[allow(clippy::diverging_sub_expression)] // False positive via async_trait
async fn re_attach(&self) -> Result<HashMap<TenantShardId, Generation>, RetryForeverError> {
unimplemented!()
}
@@ -868,6 +867,7 @@ mod test {
let remote_fs_dir = harness.conf.workdir.join("remote_fs").canonicalize_utf8()?;
let storage_config = RemoteStorageConfig {
storage: RemoteStorageKind::LocalFs(remote_fs_dir.clone()),
timeout: RemoteStorageConfig::DEFAULT_TIMEOUT,
};
let storage = GenericRemoteStorage::from_config(&storage_config).unwrap();
@@ -1171,6 +1171,7 @@ pub(crate) mod mock {
pub struct ConsumerState {
rx: tokio::sync::mpsc::UnboundedReceiver<ListWriterQueueMessage>,
executor_rx: tokio::sync::mpsc::Receiver<DeleterMessage>,
cancel: CancellationToken,
}
impl ConsumerState {
@@ -1184,7 +1185,7 @@ pub(crate) mod mock {
match msg {
DeleterMessage::Delete(objects) => {
for path in objects {
match remote_storage.delete(&path).await {
match remote_storage.delete(&path, &self.cancel).await {
Ok(_) => {
debug!("Deleted {path}");
}
@@ -1217,7 +1218,7 @@ pub(crate) mod mock {
for path in objects {
info!("Executing deletion {path}");
match remote_storage.delete(&path).await {
match remote_storage.delete(&path, &self.cancel).await {
Ok(_) => {
debug!("Deleted {path}");
}
@@ -1267,7 +1268,11 @@ pub(crate) mod mock {
executor_tx,
executed,
remote_storage,
consumer: std::sync::Mutex::new(ConsumerState { rx, executor_rx }),
consumer: std::sync::Mutex::new(ConsumerState {
rx,
executor_rx,
cancel: CancellationToken::new(),
}),
lsn_table: Arc::new(std::sync::RwLock::new(VisibleLsnUpdates::new())),
}
}

View File

@@ -8,6 +8,7 @@
use remote_storage::GenericRemoteStorage;
use remote_storage::RemotePath;
use remote_storage::TimeoutOrCancel;
use remote_storage::MAX_KEYS_PER_DELETE;
use std::time::Duration;
use tokio_util::sync::CancellationToken;
@@ -71,9 +72,11 @@ impl Deleter {
Err(anyhow::anyhow!("failpoint: deletion-queue-before-execute"))
});
self.remote_storage.delete_objects(&self.accumulator).await
self.remote_storage
.delete_objects(&self.accumulator, &self.cancel)
.await
},
|_| false,
TimeoutOrCancel::caused_by_cancel,
3,
10,
"executing deletion batch",

View File

@@ -351,7 +351,6 @@ pub enum IterationOutcome<U> {
Finished(IterationOutcomeFinished<U>),
}
#[allow(dead_code)]
#[derive(Debug, Serialize)]
pub struct IterationOutcomeFinished<U> {
/// The actual usage observed before we started the iteration.
@@ -366,7 +365,6 @@ pub struct IterationOutcomeFinished<U> {
}
#[derive(Debug, Serialize)]
#[allow(dead_code)]
struct AssumedUsage<U> {
/// The expected value for `after`, after phase 2.
projected_after: U,
@@ -374,14 +372,12 @@ struct AssumedUsage<U> {
failed: LayerCount,
}
#[allow(dead_code)]
#[derive(Debug, Serialize)]
struct PlannedUsage<U> {
respecting_tenant_min_resident_size: U,
fallback_to_global_lru: Option<U>,
}
#[allow(dead_code)]
#[derive(Debug, Default, Serialize)]
struct LayerCount {
file_sizes: u64,
@@ -565,7 +561,6 @@ pub(crate) struct EvictionSecondaryLayer {
#[derive(Clone)]
pub(crate) enum EvictionLayer {
Attached(Layer),
#[allow(dead_code)]
Secondary(EvictionSecondaryLayer),
}
@@ -1105,7 +1100,6 @@ mod filesystem_level_usage {
use super::DiskUsageEvictionTaskConfig;
#[derive(Debug, Clone, Copy)]
#[allow(dead_code)]
pub struct Usage<'a> {
config: &'a DiskUsageEvictionTaskConfig,

View File

@@ -83,12 +83,12 @@ use utils::{
// This is not functionally necessary (clients will retry), but avoids generating a lot of
// failed API calls while tenants are activating.
#[cfg(not(feature = "testing"))]
const ACTIVE_TENANT_TIMEOUT: Duration = Duration::from_millis(5000);
pub(crate) const ACTIVE_TENANT_TIMEOUT: Duration = Duration::from_millis(5000);
// Tests run on slow/oversubscribed nodes, and may need to wait much longer for tenants to
// finish attaching, if calls to remote storage are slow.
#[cfg(feature = "testing")]
const ACTIVE_TENANT_TIMEOUT: Duration = Duration::from_millis(30000);
pub(crate) const ACTIVE_TENANT_TIMEOUT: Duration = Duration::from_millis(30000);
pub struct State {
conf: &'static PageServerConf,
@@ -422,6 +422,7 @@ async fn build_timeline_info_common(
tenant::timeline::logical_size::Accuracy::Approximate => false,
tenant::timeline::logical_size::Accuracy::Exact => true,
},
directory_entries_counts: timeline.get_directory_metrics().to_vec(),
current_physical_size,
current_logical_size_non_incremental: None,
timeline_dir_layer_file_size_sum: None,
@@ -488,7 +489,9 @@ async fn timeline_create_handler(
let state = get_state(&request);
async {
let tenant = state.tenant_manager.get_attached_tenant_shard(tenant_shard_id, false)?;
let tenant = state
.tenant_manager
.get_attached_tenant_shard(tenant_shard_id, false)?;
tenant.wait_to_become_active(ACTIVE_TENANT_TIMEOUT).await?;
@@ -498,48 +501,62 @@ async fn timeline_create_handler(
tracing::info!("bootstrapping");
}
match tenant.create_timeline(
new_timeline_id,
request_data.ancestor_timeline_id.map(TimelineId::from),
request_data.ancestor_start_lsn,
request_data.pg_version.unwrap_or(crate::DEFAULT_PG_VERSION),
request_data.existing_initdb_timeline_id,
state.broker_client.clone(),
&ctx,
)
.await {
match tenant
.create_timeline(
new_timeline_id,
request_data.ancestor_timeline_id,
request_data.ancestor_start_lsn,
request_data.pg_version.unwrap_or(crate::DEFAULT_PG_VERSION),
request_data.existing_initdb_timeline_id,
state.broker_client.clone(),
&ctx,
)
.await
{
Ok(new_timeline) => {
// Created. Construct a TimelineInfo for it.
let timeline_info = build_timeline_info_common(&new_timeline, &ctx, tenant::timeline::GetLogicalSizePriority::User)
.await
.map_err(ApiError::InternalServerError)?;
let timeline_info = build_timeline_info_common(
&new_timeline,
&ctx,
tenant::timeline::GetLogicalSizePriority::User,
)
.await
.map_err(ApiError::InternalServerError)?;
json_response(StatusCode::CREATED, timeline_info)
}
Err(_) if tenant.cancel.is_cancelled() => {
// In case we get some ugly error type during shutdown, cast it into a clean 503.
json_response(StatusCode::SERVICE_UNAVAILABLE, HttpErrorBody::from_msg("Tenant shutting down".to_string()))
}
Err(tenant::CreateTimelineError::Conflict | tenant::CreateTimelineError::AlreadyCreating) => {
json_response(StatusCode::CONFLICT, ())
}
Err(tenant::CreateTimelineError::AncestorLsn(err)) => {
json_response(StatusCode::NOT_ACCEPTABLE, HttpErrorBody::from_msg(
format!("{err:#}")
))
}
Err(e @ tenant::CreateTimelineError::AncestorNotActive) => {
json_response(StatusCode::SERVICE_UNAVAILABLE, HttpErrorBody::from_msg(e.to_string()))
}
Err(tenant::CreateTimelineError::ShuttingDown) => {
json_response(StatusCode::SERVICE_UNAVAILABLE,HttpErrorBody::from_msg("tenant shutting down".to_string()))
json_response(
StatusCode::SERVICE_UNAVAILABLE,
HttpErrorBody::from_msg("Tenant shutting down".to_string()),
)
}
Err(
tenant::CreateTimelineError::Conflict
| tenant::CreateTimelineError::AlreadyCreating,
) => json_response(StatusCode::CONFLICT, ()),
Err(tenant::CreateTimelineError::AncestorLsn(err)) => json_response(
StatusCode::NOT_ACCEPTABLE,
HttpErrorBody::from_msg(format!("{err:#}")),
),
Err(e @ tenant::CreateTimelineError::AncestorNotActive) => json_response(
StatusCode::SERVICE_UNAVAILABLE,
HttpErrorBody::from_msg(e.to_string()),
),
Err(tenant::CreateTimelineError::ShuttingDown) => json_response(
StatusCode::SERVICE_UNAVAILABLE,
HttpErrorBody::from_msg("tenant shutting down".to_string()),
),
Err(tenant::CreateTimelineError::Other(err)) => Err(ApiError::InternalServerError(err)),
}
}
.instrument(info_span!("timeline_create",
tenant_id = %tenant_shard_id.tenant_id,
shard_id = %tenant_shard_id.shard_slug(),
timeline_id = %new_timeline_id, lsn=?request_data.ancestor_start_lsn, pg_version=?request_data.pg_version))
timeline_id = %new_timeline_id,
lsn=?request_data.ancestor_start_lsn,
pg_version=?request_data.pg_version
))
.await
}
@@ -554,10 +571,16 @@ async fn timeline_list_handler(
parse_query_param(&request, "force-await-initial-logical-size")?;
check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let state = get_state(&request);
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let response_data = async {
let tenant = mgr::get_tenant(tenant_shard_id, true)?;
let tenant = state
.tenant_manager
.get_attached_tenant_shard(tenant_shard_id, false)?;
tenant.wait_to_become_active(ACTIVE_TENANT_TIMEOUT).await?;
let timelines = tenant.list_timelines();
let mut response_data = Vec::with_capacity(timelines.len());
@@ -1119,7 +1142,7 @@ async fn tenant_shard_split_handler(
let new_shards = state
.tenant_manager
.shard_split(tenant_shard_id, ShardCount(req.new_shard_count), &ctx)
.shard_split(tenant_shard_id, ShardCount::new(req.new_shard_count), &ctx)
.await
.map_err(ApiError::InternalServerError)?;
@@ -1934,6 +1957,7 @@ async fn put_io_engine_handler(
mut r: Request<Body>,
_cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> {
check_permission(&r, None)?;
let kind: crate::virtual_file::IoEngineKind = json_request(&mut r).await?;
crate::virtual_file::io_engine::set(kind);
json_response(StatusCode::OK, ())
@@ -2197,7 +2221,7 @@ pub fn make_router(
)
.get(
"/v1/tenant/:tenant_shard_id/timeline/:timeline_id/keyspace",
|r| testing_api_handler("read out the keyspace", r, timeline_collect_keyspace),
|r| api_handler(r, timeline_collect_keyspace),
)
.put("/v1/io_engine", |r| api_handler(r, put_io_engine_handler))
.any(handler_404))

View File

@@ -602,6 +602,15 @@ pub(crate) mod initial_logical_size {
});
}
static DIRECTORY_ENTRIES_COUNT: Lazy<UIntGaugeVec> = Lazy::new(|| {
register_uint_gauge_vec!(
"pageserver_directory_entries_count",
"Sum of the entries in pageserver-stored directory listings",
&["tenant_id", "shard_id", "timeline_id"]
)
.expect("failed to define a metric")
});
pub(crate) static TENANT_STATE_METRIC: Lazy<UIntGaugeVec> = Lazy::new(|| {
register_uint_gauge_vec!(
"pageserver_tenant_states_count",
@@ -1809,6 +1818,7 @@ pub(crate) struct TimelineMetrics {
resident_physical_size_gauge: UIntGauge,
/// copy of LayeredTimeline.current_logical_size
pub current_logical_size_gauge: UIntGauge,
pub directory_entries_count_gauge: Lazy<UIntGauge, Box<dyn Send + Fn() -> UIntGauge>>,
pub num_persistent_files_created: IntCounter,
pub persistent_bytes_written: IntCounter,
pub evictions: IntCounter,
@@ -1818,12 +1828,12 @@ pub(crate) struct TimelineMetrics {
impl TimelineMetrics {
pub fn new(
tenant_shard_id: &TenantShardId,
timeline_id: &TimelineId,
timeline_id_raw: &TimelineId,
evictions_with_low_residence_duration_builder: EvictionsWithLowResidenceDurationBuilder,
) -> Self {
let tenant_id = tenant_shard_id.tenant_id.to_string();
let shard_id = format!("{}", tenant_shard_id.shard_slug());
let timeline_id = timeline_id.to_string();
let timeline_id = timeline_id_raw.to_string();
let flush_time_histo = StorageTimeMetrics::new(
StorageTimeOperation::LayerFlush,
&tenant_id,
@@ -1876,6 +1886,22 @@ impl TimelineMetrics {
let current_logical_size_gauge = CURRENT_LOGICAL_SIZE
.get_metric_with_label_values(&[&tenant_id, &shard_id, &timeline_id])
.unwrap();
// TODO use impl Trait syntax here once we have ability to use it: https://github.com/rust-lang/rust/issues/63065
let directory_entries_count_gauge_closure = {
let tenant_shard_id = *tenant_shard_id;
let timeline_id_raw = *timeline_id_raw;
move || {
let tenant_id = tenant_shard_id.tenant_id.to_string();
let shard_id = format!("{}", tenant_shard_id.shard_slug());
let timeline_id = timeline_id_raw.to_string();
let gauge: UIntGauge = DIRECTORY_ENTRIES_COUNT
.get_metric_with_label_values(&[&tenant_id, &shard_id, &timeline_id])
.unwrap();
gauge
}
};
let directory_entries_count_gauge: Lazy<UIntGauge, Box<dyn Send + Fn() -> UIntGauge>> =
Lazy::new(Box::new(directory_entries_count_gauge_closure));
let num_persistent_files_created = NUM_PERSISTENT_FILES_CREATED
.get_metric_with_label_values(&[&tenant_id, &shard_id, &timeline_id])
.unwrap();
@@ -1902,6 +1928,7 @@ impl TimelineMetrics {
last_record_gauge,
resident_physical_size_gauge,
current_logical_size_gauge,
directory_entries_count_gauge,
num_persistent_files_created,
persistent_bytes_written,
evictions,
@@ -1944,6 +1971,9 @@ impl Drop for TimelineMetrics {
RESIDENT_PHYSICAL_SIZE.remove_label_values(&[tenant_id, &shard_id, timeline_id]);
}
let _ = CURRENT_LOGICAL_SIZE.remove_label_values(&[tenant_id, &shard_id, timeline_id]);
if let Some(metric) = Lazy::get(&DIRECTORY_ENTRIES_COUNT) {
let _ = metric.remove_label_values(&[tenant_id, &shard_id, timeline_id]);
}
let _ =
NUM_PERSISTENT_FILES_CREATED.remove_label_values(&[tenant_id, &shard_id, timeline_id]);
let _ = PERSISTENT_BYTES_WRITTEN.remove_label_values(&[tenant_id, &shard_id, timeline_id]);
@@ -2466,6 +2496,56 @@ pub mod tokio_epoll_uring {
}
}
pub(crate) mod tenant_throttling {
use metrics::{register_int_counter_vec, IntCounter};
use once_cell::sync::Lazy;
use crate::tenant::{self, throttle::Metric};
pub(crate) struct TimelineGet {
wait_time: IntCounter,
count: IntCounter,
}
pub(crate) static TIMELINE_GET: Lazy<TimelineGet> = Lazy::new(|| {
static WAIT_USECS: Lazy<metrics::IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_tenant_throttling_wait_usecs_sum_global",
"Sum of microseconds that tenants spent waiting for a tenant throttle of a given kind.",
&["kind"]
)
.unwrap()
});
static WAIT_COUNT: Lazy<metrics::IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_tenant_throttling_count_global",
"Count of tenant throttlings, by kind of throttle.",
&["kind"]
)
.unwrap()
});
let kind = "timeline_get";
TimelineGet {
wait_time: WAIT_USECS.with_label_values(&[kind]),
count: WAIT_COUNT.with_label_values(&[kind]),
}
});
impl Metric for &'static TimelineGet {
#[inline(always)]
fn observe_throttling(
&self,
tenant::throttle::Observation { wait_time }: &tenant::throttle::Observation,
) {
let val = u64::try_from(wait_time.as_micros()).unwrap();
self.wait_time.inc_by(val);
self.count.inc();
}
}
}
pub fn preinitialize_metrics() {
// Python tests need these and on some we do alerting.
//
@@ -2527,4 +2607,5 @@ pub fn preinitialize_metrics() {
// Custom
Lazy::force(&RECONSTRUCT_TIME);
Lazy::force(&tenant_throttling::TIMELINE_GET);
}

View File

@@ -26,7 +26,7 @@ use pageserver_api::models::{
PagestreamNblocksResponse,
};
use pageserver_api::shard::ShardIndex;
use pageserver_api::shard::{ShardCount, ShardNumber};
use pageserver_api::shard::ShardNumber;
use postgres_backend::{self, is_expected_io_error, AuthType, PostgresBackend, QueryError};
use pq_proto::framed::ConnectionError;
use pq_proto::FeStartupPacket;
@@ -998,7 +998,7 @@ impl PageServerHandler {
) -> Result<&Arc<Timeline>, Key> {
let key = if let Some((first_idx, first_timeline)) = self.shard_timelines.iter().next() {
// Fastest path: single sharded case
if first_idx.shard_count < ShardCount(2) {
if first_idx.shard_count.count() == 1 {
return Ok(&first_timeline.timeline);
}

View File

@@ -14,6 +14,7 @@ use crate::span::debug_assert_current_span_has_tenant_and_timeline_id_no_shard_i
use crate::walrecord::NeonWalRecord;
use anyhow::{ensure, Context};
use bytes::{Buf, Bytes, BytesMut};
use enum_map::Enum;
use pageserver_api::key::{
dbdir_key_range, is_rel_block_key, is_slru_block_key, rel_block_to_key, rel_dir_to_key,
rel_key_range, rel_size_to_key, relmap_file_key, slru_block_to_key, slru_dir_to_key,
@@ -155,6 +156,8 @@ impl Timeline {
pending_updates: HashMap::new(),
pending_deletions: Vec::new(),
pending_nblocks: 0,
pending_aux_files: None,
pending_directory_entries: Vec::new(),
lsn,
}
}
@@ -868,6 +871,15 @@ pub struct DatadirModification<'a> {
pending_updates: HashMap<Key, Vec<(Lsn, Value)>>,
pending_deletions: Vec<(Range<Key>, Lsn)>,
pending_nblocks: i64,
// If we already wrote any aux file changes in this modification, stash the latest dir. If set,
// [`Self::put_file`] may assume that it is safe to emit a delta rather than checking
// if AUX_FILES_KEY is already set.
pending_aux_files: Option<AuxFilesDirectory>,
/// For special "directory" keys that store key-value maps, track the size of the map
/// if it was updated in this modification.
pending_directory_entries: Vec<(DirectoryKind, usize)>,
}
impl<'a> DatadirModification<'a> {
@@ -899,6 +911,7 @@ impl<'a> DatadirModification<'a> {
let buf = DbDirectory::ser(&DbDirectory {
dbdirs: HashMap::new(),
})?;
self.pending_directory_entries.push((DirectoryKind::Db, 0));
self.put(DBDIR_KEY, Value::Image(buf.into()));
// Create AuxFilesDirectory
@@ -907,16 +920,24 @@ impl<'a> DatadirModification<'a> {
let buf = TwoPhaseDirectory::ser(&TwoPhaseDirectory {
xids: HashSet::new(),
})?;
self.pending_directory_entries
.push((DirectoryKind::TwoPhase, 0));
self.put(TWOPHASEDIR_KEY, Value::Image(buf.into()));
let buf: Bytes = SlruSegmentDirectory::ser(&SlruSegmentDirectory::default())?.into();
let empty_dir = Value::Image(buf);
self.put(slru_dir_to_key(SlruKind::Clog), empty_dir.clone());
self.pending_directory_entries
.push((DirectoryKind::SlruSegment(SlruKind::Clog), 0));
self.put(
slru_dir_to_key(SlruKind::MultiXactMembers),
empty_dir.clone(),
);
self.pending_directory_entries
.push((DirectoryKind::SlruSegment(SlruKind::Clog), 0));
self.put(slru_dir_to_key(SlruKind::MultiXactOffsets), empty_dir);
self.pending_directory_entries
.push((DirectoryKind::SlruSegment(SlruKind::MultiXactOffsets), 0));
Ok(())
}
@@ -1017,6 +1038,7 @@ impl<'a> DatadirModification<'a> {
let buf = RelDirectory::ser(&RelDirectory {
rels: HashSet::new(),
})?;
self.pending_directory_entries.push((DirectoryKind::Rel, 0));
self.put(
rel_dir_to_key(spcnode, dbnode),
Value::Image(Bytes::from(buf)),
@@ -1039,6 +1061,8 @@ impl<'a> DatadirModification<'a> {
if !dir.xids.insert(xid) {
anyhow::bail!("twophase file for xid {} already exists", xid);
}
self.pending_directory_entries
.push((DirectoryKind::TwoPhase, dir.xids.len()));
self.put(
TWOPHASEDIR_KEY,
Value::Image(Bytes::from(TwoPhaseDirectory::ser(&dir)?)),
@@ -1074,6 +1098,8 @@ impl<'a> DatadirModification<'a> {
let mut dir = DbDirectory::des(&buf)?;
if dir.dbdirs.remove(&(spcnode, dbnode)).is_some() {
let buf = DbDirectory::ser(&dir)?;
self.pending_directory_entries
.push((DirectoryKind::Db, dir.dbdirs.len()));
self.put(DBDIR_KEY, Value::Image(buf.into()));
} else {
warn!(
@@ -1111,6 +1137,8 @@ impl<'a> DatadirModification<'a> {
// Didn't exist. Update dbdir
dbdir.dbdirs.insert((rel.spcnode, rel.dbnode), false);
let buf = DbDirectory::ser(&dbdir).context("serialize db")?;
self.pending_directory_entries
.push((DirectoryKind::Db, dbdir.dbdirs.len()));
self.put(DBDIR_KEY, Value::Image(buf.into()));
// and create the RelDirectory
@@ -1125,6 +1153,10 @@ impl<'a> DatadirModification<'a> {
if !rel_dir.rels.insert((rel.relnode, rel.forknum)) {
return Err(RelationError::AlreadyExists);
}
self.pending_directory_entries
.push((DirectoryKind::Rel, rel_dir.rels.len()));
self.put(
rel_dir_key,
Value::Image(Bytes::from(
@@ -1216,6 +1248,9 @@ impl<'a> DatadirModification<'a> {
let buf = self.get(dir_key, ctx).await?;
let mut dir = RelDirectory::des(&buf)?;
self.pending_directory_entries
.push((DirectoryKind::Rel, dir.rels.len()));
if dir.rels.remove(&(rel.relnode, rel.forknum)) {
self.put(dir_key, Value::Image(Bytes::from(RelDirectory::ser(&dir)?)));
} else {
@@ -1251,6 +1286,8 @@ impl<'a> DatadirModification<'a> {
if !dir.segments.insert(segno) {
anyhow::bail!("slru segment {kind:?}/{segno} already exists");
}
self.pending_directory_entries
.push((DirectoryKind::SlruSegment(kind), dir.segments.len()));
self.put(
dir_key,
Value::Image(Bytes::from(SlruSegmentDirectory::ser(&dir)?)),
@@ -1295,6 +1332,8 @@ impl<'a> DatadirModification<'a> {
if !dir.segments.remove(&segno) {
warn!("slru segment {:?}/{} does not exist", kind, segno);
}
self.pending_directory_entries
.push((DirectoryKind::SlruSegment(kind), dir.segments.len()));
self.put(
dir_key,
Value::Image(Bytes::from(SlruSegmentDirectory::ser(&dir)?)),
@@ -1325,6 +1364,8 @@ impl<'a> DatadirModification<'a> {
if !dir.xids.remove(&xid) {
warn!("twophase file for xid {} does not exist", xid);
}
self.pending_directory_entries
.push((DirectoryKind::TwoPhase, dir.xids.len()));
self.put(
TWOPHASEDIR_KEY,
Value::Image(Bytes::from(TwoPhaseDirectory::ser(&dir)?)),
@@ -1340,6 +1381,8 @@ impl<'a> DatadirModification<'a> {
let buf = AuxFilesDirectory::ser(&AuxFilesDirectory {
files: HashMap::new(),
})?;
self.pending_directory_entries
.push((DirectoryKind::AuxFiles, 0));
self.put(AUX_FILES_KEY, Value::Image(Bytes::from(buf)));
Ok(())
}
@@ -1350,28 +1393,76 @@ impl<'a> DatadirModification<'a> {
content: &[u8],
ctx: &RequestContext,
) -> anyhow::Result<()> {
let mut dir = match self.get(AUX_FILES_KEY, ctx).await {
Ok(buf) => AuxFilesDirectory::des(&buf)?,
Err(e) => {
// This is expected: historical databases do not have the key.
debug!("Failed to get info about AUX files: {}", e);
AuxFilesDirectory {
files: HashMap::new(),
let file_path = path.to_string();
let content = if content.is_empty() {
None
} else {
Some(Bytes::copy_from_slice(content))
};
let dir = if let Some(mut dir) = self.pending_aux_files.take() {
// We already updated aux files in `self`: emit a delta and update our latest value
self.put(
AUX_FILES_KEY,
Value::WalRecord(NeonWalRecord::AuxFile {
file_path: file_path.clone(),
content: content.clone(),
}),
);
dir.upsert(file_path, content);
dir
} else {
// Check if the AUX_FILES_KEY is initialized
match self.get(AUX_FILES_KEY, ctx).await {
Ok(dir_bytes) => {
let mut dir = AuxFilesDirectory::des(&dir_bytes)?;
// Key is already set, we may append a delta
self.put(
AUX_FILES_KEY,
Value::WalRecord(NeonWalRecord::AuxFile {
file_path: file_path.clone(),
content: content.clone(),
}),
);
dir.upsert(file_path, content);
dir
}
Err(
e @ (PageReconstructError::AncestorStopping(_)
| PageReconstructError::Cancelled
| PageReconstructError::AncestorLsnTimeout(_)),
) => {
// Important that we do not interpret a shutdown error as "not found" and thereby
// reset the map.
return Err(e.into());
}
// FIXME: PageReconstructError doesn't have an explicit variant for key-not-found, so
// we are assuming that all _other_ possible errors represents a missing key. If some
// other error occurs, we may incorrectly reset the map of aux files.
Err(PageReconstructError::Other(_) | PageReconstructError::WalRedo(_)) => {
// Key is missing, we must insert an image as the basis for subsequent deltas.
let mut dir = AuxFilesDirectory {
files: HashMap::new(),
};
dir.upsert(file_path, content);
self.put(
AUX_FILES_KEY,
Value::Image(Bytes::from(
AuxFilesDirectory::ser(&dir).context("serialize")?,
)),
);
dir
}
}
};
let path = path.to_string();
if content.is_empty() {
dir.files.remove(&path);
} else {
dir.files.insert(path, Bytes::copy_from_slice(content));
}
self.put(
AUX_FILES_KEY,
Value::Image(Bytes::from(
AuxFilesDirectory::ser(&dir).context("serialize")?,
)),
);
self.pending_directory_entries
.push((DirectoryKind::AuxFiles, dir.files.len()));
self.pending_aux_files = Some(dir);
Ok(())
}
@@ -1427,6 +1518,10 @@ impl<'a> DatadirModification<'a> {
self.pending_nblocks = 0;
}
for (kind, count) in std::mem::take(&mut self.pending_directory_entries) {
writer.update_directory_entries_count(kind, count as u64);
}
Ok(())
}
@@ -1464,6 +1559,10 @@ impl<'a> DatadirModification<'a> {
writer.update_current_logical_size(pending_nblocks * i64::from(BLCKSZ));
}
for (kind, count) in std::mem::take(&mut self.pending_directory_entries) {
writer.update_directory_entries_count(kind, count as u64);
}
Ok(())
}
@@ -1573,8 +1672,18 @@ struct RelDirectory {
}
#[derive(Debug, Serialize, Deserialize, Default)]
struct AuxFilesDirectory {
files: HashMap<String, Bytes>,
pub(crate) struct AuxFilesDirectory {
pub(crate) files: HashMap<String, Bytes>,
}
impl AuxFilesDirectory {
pub(crate) fn upsert(&mut self, key: String, value: Option<Bytes>) {
if let Some(value) = value {
self.files.insert(key, value);
} else {
self.files.remove(&key);
}
}
}
#[derive(Debug, Serialize, Deserialize)]
@@ -1588,13 +1697,82 @@ struct SlruSegmentDirectory {
segments: HashSet<u32>,
}
#[derive(Copy, Clone, PartialEq, Eq, Debug, enum_map::Enum)]
#[repr(u8)]
pub(crate) enum DirectoryKind {
Db,
TwoPhase,
Rel,
AuxFiles,
SlruSegment(SlruKind),
}
impl DirectoryKind {
pub(crate) const KINDS_NUM: usize = <DirectoryKind as Enum>::LENGTH;
pub(crate) fn offset(&self) -> usize {
self.into_usize()
}
}
static ZERO_PAGE: Bytes = Bytes::from_static(&[0u8; BLCKSZ as usize]);
#[allow(clippy::bool_assert_comparison)]
#[cfg(test)]
mod tests {
//use super::repo_harness::*;
//use super::*;
use hex_literal::hex;
use utils::id::TimelineId;
use super::*;
use crate::{tenant::harness::TenantHarness, DEFAULT_PG_VERSION};
/// Test a round trip of aux file updates, from DatadirModification to reading back from the Timeline
#[tokio::test]
async fn aux_files_round_trip() -> anyhow::Result<()> {
let name = "aux_files_round_trip";
let harness = TenantHarness::create(name)?;
pub const TIMELINE_ID: TimelineId =
TimelineId::from_array(hex!("11223344556677881122334455667788"));
let (tenant, ctx) = harness.load().await;
let tline = tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION, &ctx)
.await?;
let tline = tline.raw_timeline().unwrap();
// First modification: insert two keys
let mut modification = tline.begin_modification(Lsn(0x1000));
modification.put_file("foo/bar1", b"content1", &ctx).await?;
modification.set_lsn(Lsn(0x1008))?;
modification.put_file("foo/bar2", b"content2", &ctx).await?;
modification.commit(&ctx).await?;
let expect_1008 = HashMap::from([
("foo/bar1".to_string(), Bytes::from_static(b"content1")),
("foo/bar2".to_string(), Bytes::from_static(b"content2")),
]);
let readback = tline.list_aux_files(Lsn(0x1008), &ctx).await?;
assert_eq!(readback, expect_1008);
// Second modification: update one key, remove the other
let mut modification = tline.begin_modification(Lsn(0x2000));
modification.put_file("foo/bar1", b"content3", &ctx).await?;
modification.set_lsn(Lsn(0x2008))?;
modification.put_file("foo/bar2", b"", &ctx).await?;
modification.commit(&ctx).await?;
let expect_2008 =
HashMap::from([("foo/bar1".to_string(), Bytes::from_static(b"content3"))]);
let readback = tline.list_aux_files(Lsn(0x2008), &ctx).await?;
assert_eq!(readback, expect_2008);
// Reading back in time works
let readback = tline.list_aux_files(Lsn(0x1008), &ctx).await?;
assert_eq!(readback, expect_1008);
Ok(())
}
/*
fn assert_current_logical_size<R: Repository>(timeline: &DatadirTimeline<R>, lsn: Lsn) {

View File

@@ -30,10 +30,6 @@
//! only a single tenant or timeline.
//!
// Clippy 1.60 incorrectly complains about the tokio::task_local!() macro.
// Silence it. See https://github.com/rust-lang/rust-clippy/issues/9224.
#![allow(clippy::declare_interior_mutable_const)]
use std::collections::HashMap;
use std::fmt;
use std::future::Future;
@@ -192,6 +188,7 @@ task_local! {
serde::Serialize,
serde::Deserialize,
strum_macros::IntoStaticStr,
strum_macros::EnumString,
)]
pub enum TaskKind {
// Pageserver startup, i.e., `main`
@@ -312,7 +309,6 @@ struct MutableTaskState {
}
struct PageServerTask {
#[allow(dead_code)] // unused currently
task_id: PageserverTaskId,
kind: TaskKind,

File diff suppressed because it is too large Load Diff

View File

@@ -11,6 +11,9 @@
//! len < 128: 0XXXXXXX
//! len >= 128: 1XXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
//!
use bytes::{BufMut, BytesMut};
use tokio_epoll_uring::{BoundedBuf, Slice};
use crate::context::RequestContext;
use crate::page_cache::PAGE_SZ;
use crate::tenant::block_io::BlockCursor;
@@ -100,6 +103,8 @@ pub struct BlobWriter<const BUFFERED: bool> {
offset: u64,
/// A buffer to save on write calls, only used if BUFFERED=true
buf: Vec<u8>,
/// We do tiny writes for the length headers; they need to be in an owned buffer;
io_buf: Option<BytesMut>,
}
impl<const BUFFERED: bool> BlobWriter<BUFFERED> {
@@ -108,6 +113,7 @@ impl<const BUFFERED: bool> BlobWriter<BUFFERED> {
inner,
offset: start_offset,
buf: Vec::with_capacity(Self::CAPACITY),
io_buf: Some(BytesMut::new()),
}
}
@@ -117,21 +123,31 @@ impl<const BUFFERED: bool> BlobWriter<BUFFERED> {
const CAPACITY: usize = if BUFFERED { PAGE_SZ } else { 0 };
#[inline(always)]
/// Writes the given buffer directly to the underlying `VirtualFile`.
/// You need to make sure that the internal buffer is empty, otherwise
/// data will be written in wrong order.
async fn write_all_unbuffered(&mut self, src_buf: &[u8]) -> Result<(), Error> {
self.inner.write_all(src_buf).await?;
self.offset += src_buf.len() as u64;
Ok(())
#[inline(always)]
async fn write_all_unbuffered<B: BoundedBuf>(
&mut self,
src_buf: B,
) -> (B::Buf, Result<(), Error>) {
let (src_buf, res) = self.inner.write_all(src_buf).await;
let nbytes = match res {
Ok(nbytes) => nbytes,
Err(e) => return (src_buf, Err(e)),
};
self.offset += nbytes as u64;
(src_buf, Ok(()))
}
#[inline(always)]
/// Flushes the internal buffer to the underlying `VirtualFile`.
pub async fn flush_buffer(&mut self) -> Result<(), Error> {
self.inner.write_all(&self.buf).await?;
self.buf.clear();
let buf = std::mem::take(&mut self.buf);
let (mut buf, res) = self.inner.write_all(buf).await;
res?;
buf.clear();
self.buf = buf;
Ok(())
}
@@ -146,62 +162,91 @@ impl<const BUFFERED: bool> BlobWriter<BUFFERED> {
}
/// Internal, possibly buffered, write function
async fn write_all(&mut self, mut src_buf: &[u8]) -> Result<(), Error> {
async fn write_all<B: BoundedBuf>(&mut self, src_buf: B) -> (B::Buf, Result<(), Error>) {
if !BUFFERED {
assert!(self.buf.is_empty());
self.write_all_unbuffered(src_buf).await?;
return Ok(());
return self.write_all_unbuffered(src_buf).await;
}
let remaining = Self::CAPACITY - self.buf.len();
let src_buf_len = src_buf.bytes_init();
if src_buf_len == 0 {
return (Slice::into_inner(src_buf.slice_full()), Ok(()));
}
let mut src_buf = src_buf.slice(0..src_buf_len);
// First try to copy as much as we can into the buffer
if remaining > 0 {
let copied = self.write_into_buffer(src_buf);
src_buf = &src_buf[copied..];
let copied = self.write_into_buffer(&src_buf);
src_buf = src_buf.slice(copied..);
}
// Then, if the buffer is full, flush it out
if self.buf.len() == Self::CAPACITY {
self.flush_buffer().await?;
if let Err(e) = self.flush_buffer().await {
return (Slice::into_inner(src_buf), Err(e));
}
}
// Finally, write the tail of src_buf:
// If it wholly fits into the buffer without
// completely filling it, then put it there.
// If not, write it out directly.
if !src_buf.is_empty() {
let src_buf = if !src_buf.is_empty() {
assert_eq!(self.buf.len(), 0);
if src_buf.len() < Self::CAPACITY {
let copied = self.write_into_buffer(src_buf);
let copied = self.write_into_buffer(&src_buf);
// We just verified above that src_buf fits into our internal buffer.
assert_eq!(copied, src_buf.len());
Slice::into_inner(src_buf)
} else {
self.write_all_unbuffered(src_buf).await?;
let (src_buf, res) = self.write_all_unbuffered(src_buf).await;
if let Err(e) = res {
return (src_buf, Err(e));
}
src_buf
}
}
Ok(())
} else {
Slice::into_inner(src_buf)
};
(src_buf, Ok(()))
}
/// Write a blob of data. Returns the offset that it was written to,
/// which can be used to retrieve the data later.
pub async fn write_blob(&mut self, srcbuf: &[u8]) -> Result<u64, Error> {
pub async fn write_blob<B: BoundedBuf>(&mut self, srcbuf: B) -> (B::Buf, Result<u64, Error>) {
let offset = self.offset;
if srcbuf.len() < 128 {
// Short blob. Write a 1-byte length header
let len_buf = srcbuf.len() as u8;
self.write_all(&[len_buf]).await?;
} else {
// Write a 4-byte length header
if srcbuf.len() > 0x7fff_ffff {
return Err(Error::new(
ErrorKind::Other,
format!("blob too large ({} bytes)", srcbuf.len()),
));
let len = srcbuf.bytes_init();
let mut io_buf = self.io_buf.take().expect("we always put it back below");
io_buf.clear();
let (io_buf, hdr_res) = async {
if len < 128 {
// Short blob. Write a 1-byte length header
io_buf.put_u8(len as u8);
self.write_all(io_buf).await
} else {
// Write a 4-byte length header
if len > 0x7fff_ffff {
return (
io_buf,
Err(Error::new(
ErrorKind::Other,
format!("blob too large ({} bytes)", len),
)),
);
}
let mut len_buf = (len as u32).to_be_bytes();
len_buf[0] |= 0x80;
io_buf.extend_from_slice(&len_buf[..]);
self.write_all(io_buf).await
}
let mut len_buf = ((srcbuf.len()) as u32).to_be_bytes();
len_buf[0] |= 0x80;
self.write_all(&len_buf).await?;
}
self.write_all(srcbuf).await?;
Ok(offset)
.await;
self.io_buf = Some(io_buf);
match hdr_res {
Ok(_) => (),
Err(e) => return (Slice::into_inner(srcbuf.slice(..)), Err(e)),
}
let (srcbuf, res) = self.write_all(srcbuf).await;
(srcbuf, res.map(|_| offset))
}
}
@@ -248,12 +293,14 @@ mod tests {
let file = VirtualFile::create(pathbuf.as_path()).await?;
let mut wtr = BlobWriter::<BUFFERED>::new(file, 0);
for blob in blobs.iter() {
let offs = wtr.write_blob(blob).await?;
let (_, res) = wtr.write_blob(blob.clone()).await;
let offs = res?;
offsets.push(offs);
}
// Write out one page worth of zeros so that we can
// read again with read_blk
let offs = wtr.write_blob(&vec![0; PAGE_SZ]).await?;
let (_, res) = wtr.write_blob(vec![0; PAGE_SZ]).await;
let offs = res?;
println!("Writing final blob at offs={offs}");
wtr.flush_buffer().await?;
}

View File

@@ -9,8 +9,8 @@
//! may lead to a data loss.
//!
use anyhow::bail;
use pageserver_api::models;
use pageserver_api::models::EvictionPolicy;
use pageserver_api::models::{self, ThrottleConfig};
use pageserver_api::shard::{ShardCount, ShardIdentity, ShardNumber, ShardStripeSize};
use serde::de::IntoDeserializer;
use serde::{Deserialize, Serialize};
@@ -251,7 +251,7 @@ impl LocationConf {
} else {
ShardIdentity::new(
ShardNumber(conf.shard_number),
ShardCount(conf.shard_count),
ShardCount::new(conf.shard_count),
ShardStripeSize(conf.shard_stripe_size),
)?
};
@@ -285,7 +285,7 @@ impl Default for LocationConf {
///
/// For storing and transmitting individual tenant's configuration, see
/// TenantConfOpt.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
pub struct TenantConf {
// Flush out an inmemory layer, if it's holding WAL older than this
// This puts a backstop on how much WAL needs to be re-digested if the
@@ -348,11 +348,13 @@ pub struct TenantConf {
/// If true then SLRU segments are dowloaded on demand, if false SLRU segments are included in basebackup
pub lazy_slru_download: bool,
pub timeline_get_throttle: pageserver_api::models::ThrottleConfig,
}
/// Same as TenantConf, but this struct preserves the information about
/// which parameters are set and which are not.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, Default)]
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize, Default)]
pub struct TenantConfOpt {
#[serde(skip_serializing_if = "Option::is_none")]
#[serde(default)]
@@ -437,6 +439,9 @@ pub struct TenantConfOpt {
#[serde(skip_serializing_if = "Option::is_none")]
#[serde(default)]
pub lazy_slru_download: Option<bool>,
#[serde(skip_serializing_if = "Option::is_none")]
pub timeline_get_throttle: Option<pageserver_api::models::ThrottleConfig>,
}
impl TenantConfOpt {
@@ -485,6 +490,10 @@ impl TenantConfOpt {
lazy_slru_download: self
.lazy_slru_download
.unwrap_or(global_conf.lazy_slru_download),
timeline_get_throttle: self
.timeline_get_throttle
.clone()
.unwrap_or(global_conf.timeline_get_throttle),
}
}
}
@@ -524,6 +533,7 @@ impl Default for TenantConf {
gc_feedback: false,
heatmap_period: Duration::ZERO,
lazy_slru_download: false,
timeline_get_throttle: crate::tenant::throttle::Config::disabled(),
}
}
}
@@ -596,6 +606,7 @@ impl From<TenantConfOpt> for models::TenantConfig {
gc_feedback: value.gc_feedback,
heatmap_period: value.heatmap_period.map(humantime),
lazy_slru_download: value.lazy_slru_download,
timeline_get_throttle: value.timeline_get_throttle.map(ThrottleConfig::from),
}
}
}

View File

@@ -3,10 +3,10 @@ use std::sync::Arc;
use anyhow::Context;
use camino::{Utf8Path, Utf8PathBuf};
use pageserver_api::{models::TenantState, shard::TenantShardId};
use remote_storage::{GenericRemoteStorage, RemotePath};
use remote_storage::{GenericRemoteStorage, RemotePath, TimeoutOrCancel};
use tokio::sync::OwnedMutexGuard;
use tokio_util::sync::CancellationToken;
use tracing::{error, instrument, Instrument, Span};
use tracing::{error, instrument, Instrument};
use utils::{backoff, completion, crashsafe, fs_ext, id::TimelineId};
@@ -84,17 +84,17 @@ async fn create_remote_delete_mark(
let data = bytes::Bytes::from_static(data);
let stream = futures::stream::once(futures::future::ready(Ok(data)));
remote_storage
.upload(stream, 0, &remote_mark_path, None)
.upload(stream, 0, &remote_mark_path, None, cancel)
.await
},
|_e| false,
TimeoutOrCancel::caused_by_cancel,
FAILED_UPLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
"mark_upload",
cancel,
)
.await
.ok_or_else(|| anyhow::anyhow!("Cancelled"))
.ok_or_else(|| anyhow::Error::new(TimeoutOrCancel::Cancel))
.and_then(|x| x)
.context("mark_upload")?;
@@ -184,15 +184,15 @@ async fn remove_tenant_remote_delete_mark(
if let Some(remote_storage) = remote_storage {
let path = remote_tenant_delete_mark_path(conf, tenant_shard_id)?;
backoff::retry(
|| async { remote_storage.delete(&path).await },
|_e| false,
|| async { remote_storage.delete(&path, cancel).await },
TimeoutOrCancel::caused_by_cancel,
FAILED_UPLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
"remove_tenant_remote_delete_mark",
cancel,
)
.await
.ok_or_else(|| anyhow::anyhow!("Cancelled"))
.ok_or_else(|| anyhow::Error::new(TimeoutOrCancel::Cancel))
.and_then(|x| x)
.context("remove_tenant_remote_delete_mark")?;
}
@@ -496,11 +496,7 @@ impl DeleteTenantFlow {
};
Ok(())
}
.instrument({
let span = tracing::info_span!(parent: None, "delete_tenant", tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug());
span.follows_from(Span::current());
span
}),
.instrument(tracing::info_span!(parent: None, "delete_tenant", tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug())),
);
}

View File

@@ -36,7 +36,6 @@ use crate::{
pub const VALUE_SZ: usize = 5;
pub const MAX_VALUE: u64 = 0x007f_ffff_ffff;
#[allow(dead_code)]
pub const PAGE_SZ: usize = 8192;
#[derive(Clone, Copy, Debug)]

View File

@@ -6,6 +6,7 @@ use crate::context::RequestContext;
use crate::page_cache::{self, PAGE_SZ};
use crate::tenant::block_io::{BlockCursor, BlockLease, BlockReader};
use crate::virtual_file::{self, VirtualFile};
use bytes::BytesMut;
use camino::Utf8PathBuf;
use pageserver_api::shard::TenantShardId;
use std::cmp::min;
@@ -26,7 +27,10 @@ pub struct EphemeralFile {
/// An ephemeral file is append-only.
/// We keep the last page, which can still be modified, in [`Self::mutable_tail`].
/// The other pages, which can no longer be modified, are accessed through the page cache.
mutable_tail: [u8; PAGE_SZ],
///
/// None <=> IO is ongoing.
/// Size is fixed to PAGE_SZ at creation time and must not be changed.
mutable_tail: Option<BytesMut>,
}
impl EphemeralFile {
@@ -60,7 +64,7 @@ impl EphemeralFile {
_timeline_id: timeline_id,
file,
len: 0,
mutable_tail: [0u8; PAGE_SZ],
mutable_tail: Some(BytesMut::zeroed(PAGE_SZ)),
})
}
@@ -103,7 +107,13 @@ impl EphemeralFile {
};
} else {
debug_assert_eq!(blknum as u64, self.len / PAGE_SZ as u64);
Ok(BlockLease::EphemeralFileMutableTail(&self.mutable_tail))
Ok(BlockLease::EphemeralFileMutableTail(
self.mutable_tail
.as_deref()
.expect("we're not doing IO, it must be Some()")
.try_into()
.expect("we ensure that it's always PAGE_SZ"),
))
}
}
@@ -135,21 +145,27 @@ impl EphemeralFile {
) -> Result<(), io::Error> {
let mut src_remaining = src;
while !src_remaining.is_empty() {
let dst_remaining = &mut self.ephemeral_file.mutable_tail[self.off..];
let dst_remaining = &mut self
.ephemeral_file
.mutable_tail
.as_deref_mut()
.expect("IO is not yet ongoing")[self.off..];
let n = min(dst_remaining.len(), src_remaining.len());
dst_remaining[..n].copy_from_slice(&src_remaining[..n]);
self.off += n;
src_remaining = &src_remaining[n..];
if self.off == PAGE_SZ {
match self
let mutable_tail = std::mem::take(&mut self.ephemeral_file.mutable_tail)
.expect("IO is not yet ongoing");
let (mutable_tail, res) = self
.ephemeral_file
.file
.write_all_at(
&self.ephemeral_file.mutable_tail,
self.blknum as u64 * PAGE_SZ as u64,
)
.await
{
.write_all_at(mutable_tail, self.blknum as u64 * PAGE_SZ as u64)
.await;
// TODO: If we panic before we can put the mutable_tail back, subsequent calls will fail.
// I.e., the IO isn't retryable if we panic.
self.ephemeral_file.mutable_tail = Some(mutable_tail);
match res {
Ok(_) => {
// Pre-warm the page cache with what we just wrote.
// This isn't necessary for coherency/correctness, but it's how we've always done it.
@@ -169,7 +185,12 @@ impl EphemeralFile {
Ok(page_cache::ReadBufResult::NotFound(mut write_guard)) => {
let buf: &mut [u8] = write_guard.deref_mut();
debug_assert_eq!(buf.len(), PAGE_SZ);
buf.copy_from_slice(&self.ephemeral_file.mutable_tail);
buf.copy_from_slice(
self.ephemeral_file
.mutable_tail
.as_deref()
.expect("IO is not ongoing"),
);
let _ = write_guard.mark_valid();
// pre-warm successful
}
@@ -181,7 +202,11 @@ impl EphemeralFile {
// Zero the buffer for re-use.
// Zeroing is critical for correcntess because the write_blob code below
// and similarly read_blk expect zeroed pages.
self.ephemeral_file.mutable_tail.fill(0);
self.ephemeral_file
.mutable_tail
.as_deref_mut()
.expect("IO is not ongoing")
.fill(0);
// This block is done, move to next one.
self.blknum += 1;
self.off = 0;

View File

@@ -279,7 +279,7 @@ pub async fn save_metadata(
let path = conf.metadata_path(tenant_shard_id, timeline_id);
let temp_path = path_with_suffix_extension(&path, TEMP_FILE_SUFFIX);
let metadata_bytes = data.to_bytes().context("serialize metadata")?;
VirtualFile::crashsafe_overwrite(&path, &temp_path, &metadata_bytes)
VirtualFile::crashsafe_overwrite(&path, &temp_path, metadata_bytes)
.await
.context("write metadata")?;
Ok(())
@@ -294,17 +294,6 @@ pub enum LoadMetadataError {
Decode(#[from] anyhow::Error),
}
pub fn load_metadata(
conf: &'static PageServerConf,
tenant_shard_id: &TenantShardId,
timeline_id: &TimelineId,
) -> Result<TimelineMetadata, LoadMetadataError> {
let metadata_path = conf.metadata_path(tenant_shard_id, timeline_id);
let metadata_bytes = std::fs::read(metadata_path)?;
Ok(TimelineMetadata::from_bytes(&metadata_bytes)?)
}
#[cfg(test)]
mod tests {
use super::*;

View File

@@ -2,6 +2,7 @@
//! page server.
use camino::{Utf8DirEntry, Utf8Path, Utf8PathBuf};
use futures::stream::StreamExt;
use itertools::Itertools;
use pageserver_api::key::Key;
use pageserver_api::models::ShardParameters;
@@ -31,6 +32,7 @@ use crate::control_plane_client::{
ControlPlaneClient, ControlPlaneGenerationsApi, RetryForeverError,
};
use crate::deletion_queue::DeletionQueueClient;
use crate::http::routes::ACTIVE_TENANT_TIMEOUT;
use crate::metrics::{TENANT, TENANT_MANAGER as METRICS};
use crate::task_mgr::{self, TaskKind};
use crate::tenant::config::{
@@ -483,7 +485,7 @@ pub async fn init_tenant_mgr(
TenantSlot::Secondary(SecondaryTenant::new(
tenant_shard_id,
location_conf.shard,
location_conf.tenant_conf,
location_conf.tenant_conf.clone(),
&SecondaryLocationConfig { warm: false },
)),
);
@@ -793,7 +795,7 @@ pub(crate) async fn set_new_tenant_config(
info!("configuring tenant {tenant_id}");
let tenant = get_tenant(tenant_shard_id, true)?;
if tenant.tenant_shard_id().shard_count > ShardCount(0) {
if !tenant.tenant_shard_id().shard_count.is_unsharded() {
// Note that we use ShardParameters::default below.
return Err(SetNewTenantConfigError::Other(anyhow::anyhow!(
"This API may only be used on single-sharded tenants, use the /location_config API for sharded tenants"
@@ -804,7 +806,7 @@ pub(crate) async fn set_new_tenant_config(
// API to use is the location_config/ endpoint, which lets the caller provide
// the full LocationConf.
let location_conf = LocationConf::attached_single(
new_tenant_conf,
new_tenant_conf.clone(),
tenant.generation,
&ShardParameters::default(),
);
@@ -1375,7 +1377,7 @@ impl TenantManager {
result
}
#[instrument(skip_all, fields(tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug(), new_shard_count=%new_shard_count.0))]
#[instrument(skip_all, fields(tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug(), new_shard_count=%new_shard_count.literal()))]
pub(crate) async fn shard_split(
&self,
tenant_shard_id: TenantShardId,
@@ -1385,11 +1387,10 @@ impl TenantManager {
let tenant = get_tenant(tenant_shard_id, true)?;
// Plan: identify what the new child shards will be
let effective_old_shard_count = std::cmp::max(tenant_shard_id.shard_count.0, 1);
if new_shard_count <= ShardCount(effective_old_shard_count) {
if new_shard_count.count() <= tenant_shard_id.shard_count.count() {
anyhow::bail!("Requested shard count is not an increase");
}
let expansion_factor = new_shard_count.0 / effective_old_shard_count;
let expansion_factor = new_shard_count.count() / tenant_shard_id.shard_count.count();
if !expansion_factor.is_power_of_two() {
anyhow::bail!("Requested split is not a power of two");
}
@@ -1439,8 +1440,10 @@ impl TenantManager {
}
};
// TODO: hardlink layers from the parent into the child shard directories so that they don't immediately re-download
// TODO: erase the dentries from the parent
// Optimization: hardlink layers from the parent into the children, so that they don't have to
// re-download & duplicate the data referenced in their initial IndexPart
self.shard_split_hardlink(parent, child_shards.clone())
.await?;
// Take a snapshot of where the parent's WAL ingest had got to: we will wait for
// child shards to reach this point.
@@ -1464,7 +1467,7 @@ impl TenantManager {
attach_mode: AttachmentMode::Single,
}),
shard: child_shard_identity,
tenant_conf: parent_tenant_conf,
tenant_conf: parent_tenant_conf.clone(),
};
self.upsert_location(
@@ -1479,13 +1482,24 @@ impl TenantManager {
// Phase 4: wait for child chards WAL ingest to catch up to target LSN
for child_shard_id in &child_shards {
let child_shard_id = *child_shard_id;
let child_shard = {
let locked = TENANTS.read().unwrap();
let peek_slot =
tenant_map_peek_slot(&locked, child_shard_id, TenantSlotPeekMode::Read)?;
tenant_map_peek_slot(&locked, &child_shard_id, TenantSlotPeekMode::Read)?;
peek_slot.and_then(|s| s.get_attached()).cloned()
};
if let Some(t) = child_shard {
// Wait for the child shard to become active: this should be very quick because it only
// has to download the index_part that we just uploaded when creating it.
if let Err(e) = t.wait_to_become_active(ACTIVE_TENANT_TIMEOUT).await {
// This is not fatal: we have durably created the child shard. It just makes the
// split operation less seamless for clients, as we will may detach the parent
// shard before the child shards are fully ready to serve requests.
tracing::warn!("Failed to wait for shard {child_shard_id} to activate: {e}");
continue;
}
let timelines = t.timelines.lock().unwrap().clone();
for timeline in timelines.values() {
let Some(target_lsn) = target_lsns.get(&timeline.timeline_id) else {
@@ -1517,7 +1531,7 @@ impl TenantManager {
}
}
// Phase 5: Shut down the parent shard.
// Phase 5: Shut down the parent shard, and erase it from disk
let (_guard, progress) = completion::channel();
match parent.shutdown(progress, false).await {
Ok(()) => {}
@@ -1525,6 +1539,24 @@ impl TenantManager {
other.wait().await;
}
}
let local_tenant_directory = self.conf.tenant_path(&tenant_shard_id);
let tmp_path = safe_rename_tenant_dir(&local_tenant_directory)
.await
.with_context(|| format!("local tenant directory {local_tenant_directory:?} rename"))?;
task_mgr::spawn(
task_mgr::BACKGROUND_RUNTIME.handle(),
TaskKind::MgmtRequest,
None,
None,
"tenant_files_delete",
false,
async move {
fs::remove_dir_all(tmp_path.as_path())
.await
.with_context(|| format!("tenant directory {:?} deletion", tmp_path))
},
);
parent_slot_guard.drop_old_value()?;
// Phase 6: Release the InProgress on the parent shard
@@ -1532,6 +1564,130 @@ impl TenantManager {
Ok(child_shards)
}
/// Part of [`Self::shard_split`]: hard link parent shard layers into child shards, as an optimization
/// to avoid the children downloading them again.
///
/// For each resident layer in the parent shard, we will hard link it into all of the child shards.
async fn shard_split_hardlink(
&self,
parent_shard: &Tenant,
child_shards: Vec<TenantShardId>,
) -> anyhow::Result<()> {
debug_assert_current_span_has_tenant_id();
let parent_path = self.conf.tenant_path(parent_shard.get_tenant_shard_id());
let (parent_timelines, parent_layers) = {
let mut parent_layers = Vec::new();
let timelines = parent_shard.timelines.lock().unwrap().clone();
let parent_timelines = timelines.keys().cloned().collect::<Vec<_>>();
for timeline in timelines.values() {
let timeline_layers = timeline
.layers
.read()
.await
.resident_layers()
.collect::<Vec<_>>()
.await;
for layer in timeline_layers {
let relative_path = layer
.local_path()
.strip_prefix(&parent_path)
.context("Removing prefix from parent layer path")?;
parent_layers.push(relative_path.to_owned());
}
}
debug_assert!(
!parent_layers.is_empty(),
"shutdown cannot empty the layermap"
);
(parent_timelines, parent_layers)
};
let mut child_prefixes = Vec::new();
let mut create_dirs = Vec::new();
for child in child_shards {
let child_prefix = self.conf.tenant_path(&child);
create_dirs.push(child_prefix.clone());
create_dirs.extend(
parent_timelines
.iter()
.map(|t| self.conf.timeline_path(&child, t)),
);
child_prefixes.push(child_prefix);
}
// Since we will do a large number of small filesystem metadata operations, batch them into
// spawn_blocking calls rather than doing each one as a tokio::fs round-trip.
let jh = tokio::task::spawn_blocking(move || -> anyhow::Result<usize> {
for dir in &create_dirs {
if let Err(e) = std::fs::create_dir_all(dir) {
// Ignore AlreadyExists errors, drop out on all other errors
match e.kind() {
std::io::ErrorKind::AlreadyExists => {}
_ => {
return Err(anyhow::anyhow!(e).context(format!("Creating {dir}")));
}
}
}
}
for child_prefix in child_prefixes {
for relative_layer in &parent_layers {
let parent_path = parent_path.join(relative_layer);
let child_path = child_prefix.join(relative_layer);
if let Err(e) = std::fs::hard_link(&parent_path, &child_path) {
match e.kind() {
std::io::ErrorKind::AlreadyExists => {}
std::io::ErrorKind::NotFound => {
tracing::info!(
"Layer {} not found during hard-linking, evicted during split?",
relative_layer
);
}
_ => {
return Err(anyhow::anyhow!(e).context(format!(
"Hard linking {relative_layer} into {child_prefix}"
)))
}
}
}
}
}
// Durability is not required for correctness, but if we crashed during split and
// then came restarted with empty timeline dirs, it would be very inefficient to
// re-populate from remote storage.
for dir in create_dirs {
if let Err(e) = crashsafe::fsync(&dir) {
// Something removed a newly created timeline dir out from underneath us? Extremely
// unexpected, but not worth panic'ing over as this whole function is just an
// optimization.
tracing::warn!("Failed to fsync directory {dir}: {e}")
}
}
Ok(parent_layers.len())
});
match jh.await {
Ok(Ok(layer_count)) => {
tracing::info!(count = layer_count, "Hard linked layers into child shards");
}
Ok(Err(e)) => {
// This is an optimization, so we tolerate failure.
tracing::warn!("Error hard-linking layers, proceeding anyway: {e}")
}
Err(e) => {
// This is something totally unexpected like a panic, so bail out.
anyhow::bail!("Error joining hard linking task: {e}");
}
}
Ok(())
}
}
#[derive(Debug, thiserror::Error)]

View File

@@ -196,14 +196,12 @@ pub(crate) use upload::upload_initdb_dir;
use utils::backoff::{
self, exponential_backoff, DEFAULT_BASE_BACKOFF_SECONDS, DEFAULT_MAX_BACKOFF_SECONDS,
};
use utils::timeout::{timeout_cancellable, TimeoutCancellableError};
use std::collections::{HashMap, VecDeque};
use std::sync::atomic::{AtomicU32, Ordering};
use std::sync::{Arc, Mutex};
use std::time::Duration;
use remote_storage::{DownloadError, GenericRemoteStorage, RemotePath};
use remote_storage::{DownloadError, GenericRemoteStorage, RemotePath, TimeoutOrCancel};
use std::ops::DerefMut;
use tracing::{debug, error, info, instrument, warn};
use tracing::{info_span, Instrument};
@@ -263,11 +261,6 @@ pub(crate) const INITDB_PRESERVED_PATH: &str = "initdb-preserved.tar.zst";
/// Default buffer size when interfacing with [`tokio::fs::File`].
pub(crate) const BUFFER_SIZE: usize = 32 * 1024;
/// This timeout is intended to deal with hangs in lower layers, e.g. stuck TCP flows. It is not
/// intended to be snappy enough for prompt shutdown, as we have a CancellationToken for that.
pub(crate) const UPLOAD_TIMEOUT: Duration = Duration::from_secs(120);
pub(crate) const DOWNLOAD_TIMEOUT: Duration = Duration::from_secs(120);
pub enum MaybeDeletedIndexPart {
IndexPart(IndexPart),
Deleted(IndexPart),
@@ -331,40 +324,6 @@ pub struct RemoteTimelineClient {
cancel: CancellationToken,
}
/// Wrapper for timeout_cancellable that flattens result and converts TimeoutCancellableError to anyhow.
///
/// This is a convenience for the various upload functions. In future
/// the anyhow::Error result should be replaced with a more structured type that
/// enables callers to avoid handling shutdown as an error.
async fn upload_cancellable<F>(cancel: &CancellationToken, future: F) -> anyhow::Result<()>
where
F: std::future::Future<Output = anyhow::Result<()>>,
{
match timeout_cancellable(UPLOAD_TIMEOUT, cancel, future).await {
Ok(Ok(())) => Ok(()),
Ok(Err(e)) => Err(e),
Err(TimeoutCancellableError::Timeout) => Err(anyhow::anyhow!("Timeout")),
Err(TimeoutCancellableError::Cancelled) => Err(anyhow::anyhow!("Shutting down")),
}
}
/// Wrapper for timeout_cancellable that flattens result and converts TimeoutCancellableError to DownloaDError.
async fn download_cancellable<F, R>(
cancel: &CancellationToken,
future: F,
) -> Result<R, DownloadError>
where
F: std::future::Future<Output = Result<R, DownloadError>>,
{
match timeout_cancellable(DOWNLOAD_TIMEOUT, cancel, future).await {
Ok(Ok(r)) => Ok(r),
Ok(Err(e)) => Err(e),
Err(TimeoutCancellableError::Timeout) => {
Err(DownloadError::Other(anyhow::anyhow!("Timed out")))
}
Err(TimeoutCancellableError::Cancelled) => Err(DownloadError::Cancelled),
}
}
impl RemoteTimelineClient {
///
/// Create a remote storage client for given timeline
@@ -1050,7 +1009,7 @@ impl RemoteTimelineClient {
&self.cancel,
)
.await
.ok_or_else(|| anyhow::anyhow!("Cancelled"))
.ok_or_else(|| anyhow::Error::new(TimeoutOrCancel::Cancel))
.and_then(|x| x)?;
// all good, disarm the guard and mark as success
@@ -1082,14 +1041,14 @@ impl RemoteTimelineClient {
upload::preserve_initdb_archive(&self.storage_impl, tenant_id, timeline_id, cancel)
.await
},
|_e| false,
TimeoutOrCancel::caused_by_cancel,
FAILED_DOWNLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
"preserve_initdb_tar_zst",
&cancel.clone(),
)
.await
.ok_or_else(|| anyhow::anyhow!("Cancellled"))
.ok_or_else(|| anyhow::Error::new(TimeoutOrCancel::Cancel))
.and_then(|x| x)
.context("backing up initdb archive")?;
Ok(())
@@ -1151,7 +1110,7 @@ impl RemoteTimelineClient {
let remaining = download_retry(
|| async {
self.storage_impl
.list_files(Some(&timeline_storage_path), None)
.list_files(Some(&timeline_storage_path), None, &cancel)
.await
},
"list remaining files",
@@ -1445,6 +1404,10 @@ impl RemoteTimelineClient {
Ok(()) => {
break;
}
Err(e) if TimeoutOrCancel::caused_by_cancel(&e) => {
// loop around to do the proper stopping
continue;
}
Err(e) => {
let retries = task.retries.fetch_add(1, Ordering::SeqCst);
@@ -1700,23 +1663,6 @@ impl RemoteTimelineClient {
}
}
}
pub(crate) fn get_layers_metadata(
&self,
layers: Vec<LayerFileName>,
) -> anyhow::Result<Vec<Option<LayerFileMetadata>>> {
let q = self.upload_queue.lock().unwrap();
let q = match &*q {
UploadQueue::Stopped(_) | UploadQueue::Uninitialized => {
anyhow::bail!("queue is in state {}", q.as_str())
}
UploadQueue::Initialized(inner) => inner,
};
let decorated = layers.into_iter().map(|l| q.latest_files.get(&l).cloned());
Ok(decorated.collect())
}
}
pub fn remote_timelines_path(tenant_shard_id: &TenantShardId) -> RemotePath {

View File

@@ -11,16 +11,14 @@ use camino::{Utf8Path, Utf8PathBuf};
use pageserver_api::shard::TenantShardId;
use tokio::fs::{self, File, OpenOptions};
use tokio::io::{AsyncSeekExt, AsyncWriteExt};
use tokio_util::io::StreamReader;
use tokio_util::sync::CancellationToken;
use tracing::warn;
use utils::timeout::timeout_cancellable;
use utils::{backoff, crashsafe};
use crate::config::PageServerConf;
use crate::span::debug_assert_current_span_has_tenant_and_timeline_id;
use crate::tenant::remote_timeline_client::{
download_cancellable, remote_layer_path, remote_timelines_path, DOWNLOAD_TIMEOUT,
};
use crate::tenant::remote_timeline_client::{remote_layer_path, remote_timelines_path};
use crate::tenant::storage_layer::LayerFileName;
use crate::tenant::Generation;
use crate::virtual_file::on_fatal_io_error;
@@ -83,15 +81,13 @@ pub async fn download_layer_file<'a>(
.with_context(|| format!("create a destination file for layer '{temp_file_path}'"))
.map_err(DownloadError::Other)?;
// Cancellation safety: it is safe to cancel this future, because it isn't writing to a local
// file: the write to local file doesn't start until after the request header is returned
// and we start draining the body stream below
let download = download_cancellable(cancel, storage.download(&remote_path))
let download = storage
.download(&remote_path, cancel)
.await
.with_context(|| {
format!(
"open a download stream for layer with remote storage path '{remote_path:?}'"
)
"open a download stream for layer with remote storage path '{remote_path:?}'"
)
})
.map_err(DownloadError::Other)?;
@@ -100,43 +96,26 @@ pub async fn download_layer_file<'a>(
let mut reader = tokio_util::io::StreamReader::new(download.download_stream);
// Cancellation safety: it is safe to cancel this future because it is writing into a temporary file,
// and we will unlink the temporary file if there is an error. This unlink is important because we
// are in a retry loop, and we wouldn't want to leave behind a rogue write I/O to a file that
// we will imminiently try and write to again.
let bytes_amount: u64 = match timeout_cancellable(
DOWNLOAD_TIMEOUT,
cancel,
tokio::io::copy_buf(&mut reader, &mut destination_file),
)
.await
.with_context(|| {
format!(
let bytes_amount = tokio::io::copy_buf(&mut reader, &mut destination_file)
.await
.with_context(|| format!(
"download layer at remote path '{remote_path:?}' into file {temp_file_path:?}"
)
})
.map_err(DownloadError::Other)?
{
Ok(b) => Ok(b),
))
.map_err(DownloadError::Other);
match bytes_amount {
Ok(bytes_amount) => {
let destination_file = destination_file.into_inner();
Ok((destination_file, bytes_amount))
}
Err(e) => {
// Remove incomplete files: on restart Timeline would do this anyway, but we must
// do it here for the retry case.
if let Err(e) = tokio::fs::remove_file(&temp_file_path).await {
on_fatal_io_error(&e, &format!("Removing temporary file {temp_file_path}"));
}
Err(e)
}
}
.with_context(|| {
format!(
"download layer at remote path '{remote_path:?}' into file {temp_file_path:?}"
)
})
.map_err(DownloadError::Other)?;
let destination_file = destination_file.into_inner();
Ok((destination_file, bytes_amount))
},
&format!("download {remote_path:?}"),
cancel,
@@ -218,9 +197,11 @@ pub async fn list_remote_timelines(
let listing = download_retry_forever(
|| {
download_cancellable(
storage.list(
Some(&remote_path),
ListingMode::WithDelimiter,
None,
&cancel,
storage.list(Some(&remote_path), ListingMode::WithDelimiter, None),
)
},
&format!("list timelines for {tenant_shard_id}"),
@@ -259,26 +240,23 @@ async fn do_download_index_part(
index_generation: Generation,
cancel: &CancellationToken,
) -> Result<IndexPart, DownloadError> {
use futures::stream::StreamExt;
let remote_path = remote_index_path(tenant_shard_id, timeline_id, index_generation);
let index_part_bytes = download_retry_forever(
|| async {
// Cancellation: if is safe to cancel this future because we're just downloading into
// a memory buffer, not touching local disk.
let index_part_download =
download_cancellable(cancel, storage.download(&remote_path)).await?;
let download = storage.download(&remote_path, cancel).await?;
let mut index_part_bytes = Vec::new();
let mut stream = std::pin::pin!(index_part_download.download_stream);
while let Some(chunk) = stream.next().await {
let chunk = chunk
.with_context(|| format!("download index part at {remote_path:?}"))
.map_err(DownloadError::Other)?;
index_part_bytes.extend_from_slice(&chunk[..]);
}
Ok(index_part_bytes)
let mut bytes = Vec::new();
let stream = download.download_stream;
let mut stream = StreamReader::new(stream);
tokio::io::copy_buf(&mut stream, &mut bytes)
.await
.with_context(|| format!("download index part at {remote_path:?}"))
.map_err(DownloadError::Other)?;
Ok(bytes)
},
&format!("download {remote_path:?}"),
cancel,
@@ -373,7 +351,7 @@ pub(super) async fn download_index_part(
let index_prefix = remote_index_path(tenant_shard_id, timeline_id, Generation::none());
let indices = download_retry(
|| async { storage.list_files(Some(&index_prefix), None).await },
|| async { storage.list_files(Some(&index_prefix), None, cancel).await },
"list index_part files",
cancel,
)
@@ -446,11 +424,10 @@ pub(crate) async fn download_initdb_tar_zst(
.with_context(|| format!("tempfile creation {temp_path}"))
.map_err(DownloadError::Other)?;
let download = match download_cancellable(cancel, storage.download(&remote_path)).await
{
let download = match storage.download(&remote_path, cancel).await {
Ok(dl) => dl,
Err(DownloadError::NotFound) => {
download_cancellable(cancel, storage.download(&remote_preserved_path)).await?
storage.download(&remote_preserved_path, cancel).await?
}
Err(other) => Err(other)?,
};
@@ -460,6 +437,7 @@ pub(crate) async fn download_initdb_tar_zst(
// TODO: this consumption of the response body should be subject to timeout + cancellation, but
// not without thinking carefully about how to recover safely from cancelling a write to
// local storage (e.g. by writing into a temp file as we do in download_layer)
// FIXME: flip the weird error wrapping
tokio::io::copy_buf(&mut download, &mut writer)
.await
.with_context(|| format!("download initdb.tar.zst at {remote_path:?}"))

View File

@@ -16,7 +16,7 @@ use crate::{
config::PageServerConf,
tenant::remote_timeline_client::{
index::IndexPart, remote_index_path, remote_initdb_archive_path,
remote_initdb_preserved_archive_path, remote_path, upload_cancellable,
remote_initdb_preserved_archive_path, remote_path,
},
};
use remote_storage::{GenericRemoteStorage, TimeTravelError};
@@ -49,16 +49,15 @@ pub(crate) async fn upload_index_part<'a>(
let index_part_bytes = bytes::Bytes::from(index_part_bytes);
let remote_path = remote_index_path(tenant_shard_id, timeline_id, generation);
upload_cancellable(
cancel,
storage.upload_storage_object(
storage
.upload_storage_object(
futures::stream::once(futures::future::ready(Ok(index_part_bytes))),
index_part_size,
&remote_path,
),
)
.await
.with_context(|| format!("upload index part for '{tenant_shard_id} / {timeline_id}'"))
cancel,
)
.await
.with_context(|| format!("upload index part for '{tenant_shard_id} / {timeline_id}'"))
}
/// Attempts to upload given layer files.
@@ -115,11 +114,10 @@ pub(super) async fn upload_timeline_layer<'a>(
let reader = tokio_util::io::ReaderStream::with_capacity(source_file, super::BUFFER_SIZE);
upload_cancellable(cancel, storage.upload(reader, fs_size, &storage_path, None))
storage
.upload(reader, fs_size, &storage_path, None, cancel)
.await
.with_context(|| format!("upload layer from local path '{source_path}'"))?;
Ok(())
.with_context(|| format!("upload layer from local path '{source_path}'"))
}
/// Uploads the given `initdb` data to the remote storage.
@@ -139,12 +137,10 @@ pub(crate) async fn upload_initdb_dir(
let file = tokio_util::io::ReaderStream::with_capacity(initdb_tar_zst, super::BUFFER_SIZE);
let remote_path = remote_initdb_archive_path(tenant_id, timeline_id);
upload_cancellable(
cancel,
storage.upload_storage_object(file, size as usize, &remote_path),
)
.await
.with_context(|| format!("upload initdb dir for '{tenant_id} / {timeline_id}'"))
storage
.upload_storage_object(file, size as usize, &remote_path, cancel)
.await
.with_context(|| format!("upload initdb dir for '{tenant_id} / {timeline_id}'"))
}
pub(crate) async fn preserve_initdb_archive(
@@ -155,7 +151,8 @@ pub(crate) async fn preserve_initdb_archive(
) -> anyhow::Result<()> {
let source_path = remote_initdb_archive_path(tenant_id, timeline_id);
let dest_path = remote_initdb_preserved_archive_path(tenant_id, timeline_id);
upload_cancellable(cancel, storage.copy_object(&source_path, &dest_path))
storage
.copy_object(&source_path, &dest_path, cancel)
.await
.with_context(|| format!("backing up initdb archive for '{tenant_id} / {timeline_id}'"))
}

View File

@@ -133,7 +133,7 @@ impl SecondaryTenant {
}
pub(crate) fn set_tenant_conf(&self, config: &TenantConfOpt) {
*(self.tenant_conf.lock().unwrap()) = *config;
*(self.tenant_conf.lock().unwrap()) = config.clone();
}
/// For API access: generate a LocationConfig equivalent to the one that would be used to
@@ -144,13 +144,13 @@ impl SecondaryTenant {
let conf = models::LocationConfigSecondary { warm: conf.warm };
let tenant_conf = *self.tenant_conf.lock().unwrap();
let tenant_conf = self.tenant_conf.lock().unwrap().clone();
models::LocationConfig {
mode: models::LocationConfigMode::Secondary,
generation: None,
secondary_conf: Some(conf),
shard_number: self.tenant_shard_id.shard_number.0,
shard_count: self.tenant_shard_id.shard_count.0,
shard_count: self.tenant_shard_id.shard_count.literal(),
shard_stripe_size: self.shard_identity.stripe_size.0,
tenant_conf: tenant_conf.into(),
}

View File

@@ -486,7 +486,7 @@ impl<'a> TenantDownloader<'a> {
let heatmap_path_bg = heatmap_path.clone();
tokio::task::spawn_blocking(move || {
tokio::runtime::Handle::current().block_on(async move {
VirtualFile::crashsafe_overwrite(&heatmap_path_bg, &temp_path, &heatmap_bytes).await
VirtualFile::crashsafe_overwrite(&heatmap_path_bg, &temp_path, heatmap_bytes).await
})
})
.await
@@ -523,12 +523,13 @@ impl<'a> TenantDownloader<'a> {
tracing::debug!("Downloading heatmap for secondary tenant",);
let heatmap_path = remote_heatmap_path(tenant_shard_id);
let cancel = &self.secondary_state.cancel;
let heatmap_bytes = backoff::retry(
|| async {
let download = self
.remote_storage
.download(&heatmap_path)
.download(&heatmap_path, cancel)
.await
.map_err(UpdateError::from)?;
let mut heatmap_bytes = Vec::new();
@@ -540,7 +541,7 @@ impl<'a> TenantDownloader<'a> {
FAILED_DOWNLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
"download heatmap",
&self.secondary_state.cancel,
cancel,
)
.await
.ok_or_else(|| UpdateError::Cancelled)

View File

@@ -21,18 +21,17 @@ use futures::Future;
use md5;
use pageserver_api::shard::TenantShardId;
use rand::Rng;
use remote_storage::GenericRemoteStorage;
use remote_storage::{GenericRemoteStorage, TimeoutOrCancel};
use super::{
heatmap::HeatMapTenant,
scheduler::{self, JobGenerator, RunningJob, SchedulingResult, TenantBackgroundJobs},
CommandRequest,
CommandRequest, UploadCommand,
};
use tokio_util::sync::CancellationToken;
use tracing::{info_span, instrument, Instrument};
use utils::{backoff, completion::Barrier, yielding_loop::yielding_loop};
use super::{heatmap::HeatMapTenant, UploadCommand};
pub(super) async fn heatmap_uploader_task(
tenant_manager: Arc<TenantManager>,
remote_storage: GenericRemoteStorage,
@@ -417,10 +416,10 @@ async fn upload_tenant_heatmap(
|| async {
let bytes = futures::stream::once(futures::future::ready(Ok(bytes.clone())));
remote_storage
.upload_storage_object(bytes, size, &path)
.upload_storage_object(bytes, size, &path, cancel)
.await
},
|_| false,
TimeoutOrCancel::caused_by_cancel,
3,
u32::MAX,
"Uploading heatmap",

View File

@@ -257,6 +257,12 @@ impl LayerAccessStats {
ret
}
/// Get the latest access timestamp, falling back to latest residence event, further falling
/// back to `SystemTime::now` for a usable timestamp for eviction.
pub(crate) fn latest_activity_or_now(&self) -> SystemTime {
self.latest_activity().unwrap_or_else(SystemTime::now)
}
/// Get the latest access timestamp, falling back to latest residence event.
///
/// This function can only return `None` if there has not yet been a call to the
@@ -271,7 +277,7 @@ impl LayerAccessStats {
/// that that type can only be produced by inserting into the layer map.
///
/// [`record_residence_event`]: Self::record_residence_event
pub(crate) fn latest_activity(&self) -> Option<SystemTime> {
fn latest_activity(&self) -> Option<SystemTime> {
let locked = self.0.lock().unwrap();
let inner = &locked.for_eviction_policy;
match inner.last_accesses.recent() {

View File

@@ -416,27 +416,31 @@ impl DeltaLayerWriterInner {
/// The values must be appended in key, lsn order.
///
async fn put_value(&mut self, key: Key, lsn: Lsn, val: Value) -> anyhow::Result<()> {
self.put_value_bytes(key, lsn, &Value::ser(&val)?, val.will_init())
.await
let (_, res) = self
.put_value_bytes(key, lsn, Value::ser(&val)?, val.will_init())
.await;
res
}
async fn put_value_bytes(
&mut self,
key: Key,
lsn: Lsn,
val: &[u8],
val: Vec<u8>,
will_init: bool,
) -> anyhow::Result<()> {
) -> (Vec<u8>, anyhow::Result<()>) {
assert!(self.lsn_range.start <= lsn);
let off = self.blob_writer.write_blob(val).await?;
let (val, res) = self.blob_writer.write_blob(val).await;
let off = match res {
Ok(off) => off,
Err(e) => return (val, Err(anyhow::anyhow!(e))),
};
let blob_ref = BlobRef::new(off, will_init);
let delta_key = DeltaKey::from_key_lsn(&key, lsn);
self.tree.append(&delta_key.0, blob_ref.0)?;
Ok(())
let res = self.tree.append(&delta_key.0, blob_ref.0);
(val, res.map_err(|e| anyhow::anyhow!(e)))
}
fn size(&self) -> u64 {
@@ -457,7 +461,8 @@ impl DeltaLayerWriterInner {
file.seek(SeekFrom::Start(index_start_blk as u64 * PAGE_SZ as u64))
.await?;
for buf in block_buf.blocks {
file.write_all(buf.as_ref()).await?;
let (_buf, res) = file.write_all(buf).await;
res?;
}
assert!(self.lsn_range.start < self.lsn_range.end);
// Fill in the summary on blk 0
@@ -472,17 +477,12 @@ impl DeltaLayerWriterInner {
index_root_blk,
};
let mut buf = smallvec::SmallVec::<[u8; PAGE_SZ]>::new();
let mut buf = Vec::with_capacity(PAGE_SZ);
// TODO: could use smallvec here but it's a pain with Slice<T>
Summary::ser_into(&summary, &mut buf)?;
if buf.spilled() {
// This is bad as we only have one free block for the summary
warn!(
"Used more than one page size for summary buffer: {}",
buf.len()
);
}
file.seek(SeekFrom::Start(0)).await?;
file.write_all(&buf).await?;
let (_buf, res) = file.write_all(buf).await;
res?;
let metadata = file
.metadata()
@@ -587,9 +587,9 @@ impl DeltaLayerWriter {
&mut self,
key: Key,
lsn: Lsn,
val: &[u8],
val: Vec<u8>,
will_init: bool,
) -> anyhow::Result<()> {
) -> (Vec<u8>, anyhow::Result<()>) {
self.inner
.as_mut()
.unwrap()
@@ -675,18 +675,12 @@ impl DeltaLayer {
let new_summary = rewrite(actual_summary);
let mut buf = smallvec::SmallVec::<[u8; PAGE_SZ]>::new();
let mut buf = Vec::with_capacity(PAGE_SZ);
// TODO: could use smallvec here, but it's a pain with Slice<T>
Summary::ser_into(&new_summary, &mut buf).context("serialize")?;
if buf.spilled() {
// The code in DeltaLayerWriterInner just warn!()s for this.
// It should probably error out as well.
return Err(RewriteSummaryError::Other(anyhow::anyhow!(
"Used more than one page size for summary buffer: {}",
buf.len()
)));
}
file.seek(SeekFrom::Start(0)).await?;
file.write_all(&buf).await?;
let (_buf, res) = file.write_all(buf).await;
res?;
Ok(())
}
}

View File

@@ -341,18 +341,12 @@ impl ImageLayer {
let new_summary = rewrite(actual_summary);
let mut buf = smallvec::SmallVec::<[u8; PAGE_SZ]>::new();
let mut buf = Vec::with_capacity(PAGE_SZ);
// TODO: could use smallvec here but it's a pain with Slice<T>
Summary::ser_into(&new_summary, &mut buf).context("serialize")?;
if buf.spilled() {
// The code in ImageLayerWriterInner just warn!()s for this.
// It should probably error out as well.
return Err(RewriteSummaryError::Other(anyhow::anyhow!(
"Used more than one page size for summary buffer: {}",
buf.len()
)));
}
file.seek(SeekFrom::Start(0)).await?;
file.write_all(&buf).await?;
let (_buf, res) = file.write_all(buf).await;
res?;
Ok(())
}
}
@@ -528,9 +522,11 @@ impl ImageLayerWriterInner {
///
/// The page versions must be appended in blknum order.
///
async fn put_image(&mut self, key: Key, img: &[u8]) -> anyhow::Result<()> {
async fn put_image(&mut self, key: Key, img: Bytes) -> anyhow::Result<()> {
ensure!(self.key_range.contains(&key));
let off = self.blob_writer.write_blob(img).await?;
let (_img, res) = self.blob_writer.write_blob(img).await;
// TODO: re-use the buffer for `img` further upstack
let off = res?;
let mut keybuf: [u8; KEY_SIZE] = [0u8; KEY_SIZE];
key.write_to_byte_slice(&mut keybuf);
@@ -553,7 +549,8 @@ impl ImageLayerWriterInner {
.await?;
let (index_root_blk, block_buf) = self.tree.finish()?;
for buf in block_buf.blocks {
file.write_all(buf.as_ref()).await?;
let (_buf, res) = file.write_all(buf).await;
res?;
}
// Fill in the summary on blk 0
@@ -568,17 +565,12 @@ impl ImageLayerWriterInner {
index_root_blk,
};
let mut buf = smallvec::SmallVec::<[u8; PAGE_SZ]>::new();
let mut buf = Vec::with_capacity(PAGE_SZ);
// TODO: could use smallvec here but it's a pain with Slice<T>
Summary::ser_into(&summary, &mut buf)?;
if buf.spilled() {
// This is bad as we only have one free block for the summary
warn!(
"Used more than one page size for summary buffer: {}",
buf.len()
);
}
file.seek(SeekFrom::Start(0)).await?;
file.write_all(&buf).await?;
let (_buf, res) = file.write_all(buf).await;
res?;
let metadata = file
.metadata()
@@ -659,7 +651,7 @@ impl ImageLayerWriter {
///
/// The page versions must be appended in blknum order.
///
pub async fn put_image(&mut self, key: Key, img: &[u8]) -> anyhow::Result<()> {
pub async fn put_image(&mut self, key: Key, img: Bytes) -> anyhow::Result<()> {
self.inner.as_mut().unwrap().put_image(key, img).await
}

View File

@@ -383,9 +383,11 @@ impl InMemoryLayer {
for (lsn, pos) in vec_map.as_slice() {
cursor.read_blob_into_buf(*pos, &mut buf, &ctx).await?;
let will_init = Value::des(&buf)?.will_init();
delta_layer_writer
.put_value_bytes(key, *lsn, &buf, will_init)
.await?;
let res;
(buf, res) = delta_layer_writer
.put_value_bytes(key, *lsn, buf, will_init)
.await;
res?;
}
}

View File

@@ -1413,10 +1413,6 @@ impl ResidentLayer {
&self.owner.0.path
}
pub(crate) fn access_stats(&self) -> &LayerAccessStats {
self.owner.access_stats()
}
pub(crate) fn metadata(&self) -> LayerFileMetadata {
self.owner.metadata()
}

View File

@@ -9,6 +9,7 @@ use crate::context::{DownloadBehavior, RequestContext};
use crate::metrics::TENANT_TASK_EVENTS;
use crate::task_mgr;
use crate::task_mgr::{TaskKind, BACKGROUND_RUNTIME};
use crate::tenant::throttle::Stats;
use crate::tenant::timeline::CompactionError;
use crate::tenant::{Tenant, TenantState};
use tokio_util::sync::CancellationToken;
@@ -139,6 +140,8 @@ async fn compaction_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
// How many errors we have seen consequtively
let mut error_run_count = 0;
let mut last_throttle_flag_reset_at = Instant::now();
TENANT_TASK_EVENTS.with_label_values(&["start"]).inc();
async {
let ctx = RequestContext::todo_child(TaskKind::Compaction, DownloadBehavior::Download);
@@ -203,6 +206,27 @@ async fn compaction_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
walredo_mgr.maybe_quiesce(period * 10);
}
// TODO: move this (and walredo quiesce) to a separate task that isn't affected by the back-off,
// so we get some upper bound guarantee on when walredo quiesce / this throttling reporting here happens.
info_span!(parent: None, "timeline_get_throttle", tenant_id=%tenant.tenant_shard_id, shard_id=%tenant.tenant_shard_id.shard_slug()).in_scope(|| {
let now = Instant::now();
let prev = std::mem::replace(&mut last_throttle_flag_reset_at, now);
let Stats { count_accounted, count_throttled, sum_throttled_usecs } = tenant.timeline_get_throttle.reset_stats();
if count_throttled == 0 {
return;
}
let allowed_rps = tenant.timeline_get_throttle.steady_rps();
let delta = now - prev;
warn!(
n_seconds=%format_args!("{:.3}",
delta.as_secs_f64()),
count_accounted,
count_throttled,
sum_throttled_usecs,
allowed_rps=%format_args!("{allowed_rps:.0}"),
"shard was throttled in the last n_seconds")
});
// Sleep
if tokio::time::timeout(sleep_duration, cancel.cancelled())
.await

Some files were not shown because too many files have changed in this diff Show More