Compare commits

..

282 Commits

Author SHA1 Message Date
Conrad Ludgate
a5c1716edc hakari 2023-12-19 15:56:19 +00:00
Conrad Ludgate
0f36927a17 deps 2023-12-19 15:55:56 +00:00
Conrad Ludgate
b78a8c4d53 fix 2023-12-19 15:49:35 +00:00
Conrad Ludgate
dc109c42bc update rustls 2023-12-19 15:49:35 +00:00
Arpad Müller
8b91bbc38e Update jsonwebtoken to 9 and sct to 0.7.1 (#6189)
This increases the list of crates that base on `ring` 0.17.
2023-12-19 15:45:17 +00:00
Christian Schwarz
e6bf6952b8 higher resolution histograms for getpage@lsn (#6177)
part of https://github.com/neondatabase/cloud/issues/7811
2023-12-19 14:46:17 +01:00
Arpad Müller
a2fab34371 Update zstd to 0.13 (#6187)
This updates the `zstd` crate to 0.13, and `zstd-sys` with it (it
contains C so we should always run the newest version of that).
2023-12-19 13:16:53 +00:00
Vadim Kharitonov
c52384752e Compile pg_semver extension (#6184)
Closes #6183
2023-12-19 15:10:07 +02:00
Bodobolero
73d247c464 Analyze clickbench performance with explain plans and pg_stat_statements (#6161)
## Problem

To understand differences in performance between neon, aurora and rds we
want to collect explain analyze plans and pg_stat_statements for
selected benchmarking runs

## Summary of changes

Add workflow input options to collect explain and pg_stat_statements for
benchmarking workflow

Co-authored-by: BodoBolero <bodobolero@gmail.com>
2023-12-19 11:44:25 +00:00
Arseny Sher
b701394d7a Fix WAL waiting in walproposer for v16.
Just preparing cv right before waiting is not enough as we might have already
missed the flushptr change & wakeup, so re-checked before sleep.

https://neondb.slack.com/archives/C03QLRH7PPD/p1702830965396619?thread_ts=1702756761.836649&cid=C03QLRH7PPD
2023-12-19 15:34:14 +04:00
John Spray
d89af4cf8e pageserver: downgrade 'connection reset' WAL errors (#6181)
This squashes a particularly noisy warn-level log that occurs when
safekeepers are restarted.

Unfortunately the error type from `tonic` doesn't provide a neat way of
matching this, so we use a string comparison
2023-12-19 10:38:00 +00:00
Christian Schwarz
6ffbbb2e02 include timeline ids in tenant details response (#6166)
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771

This allows getting the list of tenants and timelines without triggering
initial logical size calculation by requesting the timeline details API
response, which would skew our results.
2023-12-19 10:32:51 +00:00
Arpad Müller
fbb979d5e3 remote_storage: move shared utilities for S3 and Azure into common module (#6176)
The PR does two things:

* move the util functions present in the remote_storage Azure and S3
test files into a shared one, deduplicating them.
* add a `s3_upload_download_works` test as a copy of the Azure test

The goal is mainly to fight duplication and make the code a little bit
more generic (like removing mentions of s3 and azure from function
names).

This is a first step towards #6146.
2023-12-19 11:29:50 +01:00
Arpad Müller
a89d6dc76e Always send a json response for timeline_get_lsn_by_timestamp (#6178)
As part of the transition laid out in
[this](https://github.com/neondatabase/cloud/pull/7553#discussion_r1370473911)
comment, don't read the `version` query parameter in
`timeline_get_lsn_by_timestamp`, but always return the structured json
response.

Follow-up of https://github.com/neondatabase/neon/pull/5608
2023-12-19 11:29:16 +01:00
Christian Schwarz
c272c68e5c RFC: Per-Tenant GetPage@LSN Throttling (#5648)
Implementation epic: https://github.com/neondatabase/neon/issues/5899
2023-12-19 11:20:56 +01:00
Anna Khanova
6e6e40dd7f Invalidate credentials on auth failure (#6171)
## Problem

If the user reset password, cache could receive this information only
after `ttl` minutes.

## Summary of changes

Invalidate password on auth failure.
2023-12-18 23:24:22 +01:00
Heikki Linnakangas
6939fc3db6 Remove declarations of non-existent global variables and functions
FileCacheMonitorMain was removed in commit b497d0094e.
2023-12-18 21:05:31 +02:00
Heikki Linnakangas
c4c48cfd63 Clean up #includes
- No need to include c.h, port.h or pg_config.h, they are included in
  postgres.h
- No need to include postgres.h in header files. Instead, the
  assumption in PostgreSQL is that all .c files include postgres.h.
- Reorder includes to alphabetical order, and system headers before
  pgsql headers
- Remove bunch of other unnecessary includes that got copy-pasted from
  one source file to another
2023-12-18 21:05:29 +02:00
Heikki Linnakangas
82215d20b0 Mark some variables 'static'
Move initialization of neon_redo_read_buffer_filter. This allows
marking it 'static', too.
2023-12-18 21:05:24 +02:00
Sasha Krassovsky
62737f3776 Grant BYPASSRLS and REPLICATION explicitly to neon_superuser roles 2023-12-18 10:54:14 -08:00
Christian Schwarz
1f9a7d1cd0 add a Rust client for Pageserver page_service (#6128)
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771

Stacked atop https://github.com/neondatabase/neon/pull/6145
2023-12-18 18:17:19 +00:00
John Spray
4ea4812ab2 tests: update python dependencies (#6164)
## Problem

Existing dependencies didn't work on Fedora 39 (python 3.12)

## Summary of changes

- Update pyyaml 6.0 -> 6.0.1
- Update yarl 1.8.2->1.9.4
- Update the `dnf install` line in README to include dependencies of
python packages (unrelated to upgrades, just noticed absences while
doing fresh pysync run)
2023-12-18 15:47:09 +00:00
Anna Khanova
00d90ce76a Added cache for get role secret (#6165)
## Problem

Currently if we are getting many consecutive connections to the same
user/ep we will send a lot of traffic to the console.

## Summary of changes

Cache with ttl=4min proxy_get_role_secret response.

Note: this is the temporary hack, notifier listener is WIP.
2023-12-18 16:04:47 +01:00
John Khvatov
33cb9a68f7 pageserver: Reduce tracing overhead in timeline::get (#6115)
## Problem

Compaction process (specifically the image layer reconstructions part)
is lagging behind wal ingest (at speed ~10-15MB/s) for medium-sized
tenants (30-50GB). CPU profile shows that significant amount of time
(see flamegraph) is being spent in `tracing::span::Span::new`.

mainline (commit: 0ba4cae491):

![reconstruct-mainline-0ba4cae491c2](https://github.com/neondatabase/neon/assets/289788/ebfd262e-5c97-4858-80c7-664a1dbcc59d)

## Summary of changes

By lowering the tracing level in get_value_reconstruct_data and
get_or_maybe_download from info to debug, we can reduce the overhead of
span creation in prod environments. On my system, this sped up the image
reconstruction process by 60% (from 14500 to 23160 page reconstruction
per sec)

pr:

![reconstruct-opt-2](https://github.com/neondatabase/neon/assets/289788/563a159b-8f2f-4300-b0a1-6cd66e7df769)


`create_image_layers()` (it's 1 CPU bound here) mainline vs pr:

![image](https://github.com/neondatabase/neon/assets/289788/a981e3cb-6df9-4882-8a94-95e99c35aa83)
2023-12-18 13:33:23 +00:00
Conrad Ludgate
17bde7eda5 proxy refactor large files (#6153)
## Problem

The `src/proxy.rs` file is far too large

## Summary of changes

Creates 3 new files:
```
src/metrics.rs
src/proxy/retry.rs
src/proxy/connect_compute.rs
```
2023-12-18 10:59:49 +00:00
John Spray
dbdb1d21f2 pageserver: on-demand activation cleanups (#6157)
## Problem

#6112 added some logs and metrics: clean these up a bit:
- Avoid counting startup completions for tenants launched after startup
- exclude no-op cases from timing histograms 
- remove a rogue log messages
2023-12-18 10:29:19 +00:00
Arseny Sher
e1935f42a1 Don't generate core dump when walproposer intentionally panics.
Walproposer sometimes intentionally PANICs when its term is defeated as the
basebackup is likely spoiled by that time. We don't want core dumped in this
case.
2023-12-18 11:03:34 +04:00
Alexander Bayandin
9bdc25f0af Revert "CI: build build-tools image" (#6156)
It turns out the issue with skipped jobs is not so trivial (because
Github checks jobs transitively), a possible workaround with `if:
always() && contains(fromJSON('["success", "skipped"]'),
needs.build-buildtools-image.result)` will tangle the workflow really
bad. We'll need to come up with a better solution.

To unblock the main I'm going to revert
https://github.com/neondatabase/neon/pull/6082.
2023-12-16 12:32:00 +00:00
Christian Schwarz
47873470db pageserver: add method to dump keyspace in mgmt api client (#6145)
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771
2023-12-16 10:52:48 +00:00
Abhijeet Patil
8619e6295a CI: build build-tools image (#6082)
## Currently our build docker file is located in the build repo it makes
sense to have it as a part of our neon repo

## Summary of changes
We had the docker file that we use to build our binary and other tools
resided in the build repo
It made sense to bring the docker file to its repo where it has been
used
So that the contributors can also view it and amend if required
It will reduce the maintenance. Docker file changes and code changes can
be accommodated in same PR
Also, building the image and pushing it to ECR is abstracted in a
reusable workflow. Ideal is to use that for any other jobs too

## Checklist before requesting a review

- [x] Moved the docker file used to build the binary from the build repo
to the neon repo
- [x] adding gh workflow to build and push the image
- [x] adding gh workflow to tag the pushed image
- [x] update readMe file

---------

Co-authored-by: Abhijeet Patil <abhijeet@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-12-16 10:33:52 +00:00
Conrad Ludgate
83811491da update zerocopy (#6148)
## Problem

https://github.com/neondatabase/neon/security/dependabot/48

```
$ cargo tree -i zerocopy
zerocopy v0.7.3
└── ahash v0.8.5
    └── hashbrown v0.13.2
```

ahash doesn't use the affected APIs we we are not vulnerable but best to
update to silence the alert anyway

## Summary of changes

```
$ cargo update -p zerocopy --precise 0.7.31
    Updating crates.io index
    Updating syn v2.0.28 -> v2.0.32
    Updating zerocopy v0.7.3 -> v0.7.31
    Updating zerocopy-derive v0.7.3 -> v0.7.31
```
2023-12-16 09:06:00 +00:00
John Spray
d066dad84b pageserver: prioritize activation of tenants with client requests (#6112)
## Problem

During startup, a client request might have to wait a long time while
the system is busy initializing all the attached tenants, even though
most of the attached tenants probably don't have any client requests to
service, and could wait a bit.

## Summary of changes

- Add a semaphore to limit how many Tenant::spawn()s may concurrently do
I/O to attach their tenant (i.e. read indices from remote storage, scan
local layer files, etc).
- Add Tenant::activate_now, a hook for kicking a tenant in its spawn()
method to skip waiting for the warmup semaphore
- For tenants that attached via warmup semaphore units, wait for logical
size calculation to complete before dropping the warmup units
- Set Tenant::activate_now in `get_active_tenant_with_timeout` (the page
service's path for getting a reference to a tenant).
- Wait for tenant activation in HTTP handlers for timeline creation and
deletion: like page service requests, these require an active tenant and
should prioritize activation if called.
2023-12-15 20:37:47 +00:00
John Spray
56f7d55ba7 pageserver: basic cancel/timeout for remote storage operations (#6097)
## Problem

Various places in remote storage were not subject to a timeout (thereby
stuck TCP connections could hold things up), and did not respect a
cancellation token (so things like timeline deletion or tenant detach
would have to wait arbitrarily long).



## Summary of changes

- Add download_cancellable and upload_cancellable helpers, and use them
in all the places we wait for remote storage operations (with the
exception of initdb downloads, where it would not have been safe).
- Add a cancellation token arg to `download_retry`.
- Use cancellation token args in various places that were missing one
per #5066

Closes: #5066 

Why is this only "basic" handling?
- Doesn't express difference between shutdown and errors in return
types, to avoid refactoring all the places that use an anyhow::Error
(these should all eventually return a more structured error type)
- Implements timeouts on top of remote storage, rather than within it:
this means that operations hitting their timeout will lose their
semaphore permit and thereby go to the back of the queue for their
retry.
- Doing a nicer job is tracked in
https://github.com/neondatabase/neon/issues/6096
2023-12-15 17:43:02 +00:00
Christian Schwarz
1a9854bfb7 add a Rust client for Pageserver management API (#6127)
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771

This PR moves the control plane's spread-all-over-the-place client for
the pageserver management API into a separate module within the
pageserver crate.

I need that client to be async in my benchmarking work, so, this PR
switches to the async version of `reqwest`.
That is also the right direction generally IMO.

The switch to async in turn mandated converting most of the
`control_plane/` code to async.

Note that some of the client methods should be taking `TenantShardId`
instead of `TenantId`, but, none of the callers seem to be
sharding-aware.
Leaving that for another time:
https://github.com/neondatabase/neon/issues/6154
2023-12-15 18:33:45 +01:00
John Spray
de1a9c6e3b s3_scrubber: basic support for sharding (#6119)
This doesn't make the scrubber smart enough to understand that many
shards are part of the same tenants, but it makes it understand paths
well enough to scrub the individual shards without thinking they're
malformed.

This is a prerequisite to being able to run tests with sharding enabled.

Related: #5929
2023-12-15 15:48:55 +00:00
Arseny Sher
e62569a878 A few comments on rust walproposer build. 2023-12-15 19:31:51 +04:00
John Spray
bd1cb1b217 tests: update allow list for negative_env (#6144)
Tests attaching the tenant immediately after the fixture detaches it
could result in LSN updates failing validation

e.g.
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6142/7211196140/index.html#suites/7745dadbd815ab87f5798aa881796f47/32b12ccc0b01b122
2023-12-15 15:08:28 +00:00
Conrad Ludgate
98629841e0 improve proxy code cov (#6141)
## Summary of changes

saw some low-hanging codecov improvements. even if code coverage is
somewhat of a pointless game, might as well add tests where we can and
delete code if it's unused
2023-12-15 12:11:50 +00:00
Arpad Müller
215cdd18c4 Make initdb upload retries cancellable and seek to beginning (#6147)
* initdb uploads had no cancellation token, which means that when we
were stuck in upload retries, we wouldn't be able to delete the
timeline. in general, the combination of retrying forever and not having
cancellation tokens is quite dangerous.
* initdb uploads wouldn't rewind the file. this wasn't discovered in the
purposefully unreliable test-s3 in pytest because those fail on the
first byte always, not somewhere during the connection. we'd be getting
errors from the AWS sdk that the file was at an unexpected end.

slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1702632247784079
2023-12-15 12:11:25 +00:00
Joonas Koivunen
0fd80484a9 fix: Timeline deletion during busy startup (#6133)
Compaction was holding back timeline deletion because the compaction
lock had been acquired, but the semaphore was waited on. Timeline
deletion was waiting on the same lock for 1500s.

This replaces the
`pageserver::tenant::tasks::concurrent_background_tasks_rate_limit`
(which looks correct) with a simpler `..._permit` which is just an
infallible acquire, which is easier to spot "aah this needs to be raced
with cancellation tokens".

Ref: https://neondb.slack.com/archives/C03F5SM1N02/p1702496912904719
Ref: https://neondb.slack.com/archives/C03F5SM1N02/p1702578093497779
2023-12-15 11:59:24 +00:00
Joonas Koivunen
07508fb110 fix: better Json parsing errors (#6135)
Before any json parsing from the http api only returned errors were per
field errors. Now they are done using `serde_path_to_error`, which at
least helped greatly with the `disk_usage_eviction_run` used for
testing. I don't think this can conflict with anything added in #5310.
2023-12-15 12:18:22 +02:00
Arseny Sher
5bb9ba37cc Fix python list_segments of sk.
Fixes rare test_peer_recovery flakiness as we started to compare tmp control
file.

https://neondb.slack.com/archives/C04KGFVUWUQ/p1702310929657179
2023-12-15 13:43:11 +04:00
John Spray
f1cd1a2122 pageserver: improved handling of concurrent timeline creations on the same ID (#6139)
## Problem

Historically, the pageserver used an "uninit mark" file on disk for two
purposes:
- Track which timeline dirs are incomplete for handling on restart
- Avoid trying to create the same timeline twice at the same time.

The original purpose of handling restarts is now defunct, as we use
remote storage as the source of truth and clean up any trash timeline
dirs on startup. Using the file to mutually exclude creation operations
is error prone compared with just doing it in memory, and the existing
checks happened some way into the creation operation, and could expose
errors as 500s (anyhow::Errors) rather than something clean.

## Summary of changes

- Creations are now mutually excluded in memory (using
`Tenant::timelines_creating`), rather than relying on a file on disk for
coordination.
- Acquiring unique access to the timeline ID now happens earlier in the
request.
- Creating the same timeline which already exists is now a 201: this
simplifies retry handling for clients.
- 409 is still returned if a timeline with the same ID is still being
created: if this happens it is probably because the client timed out an
earlier request and has retried.
- Colliding timeline creation requests should no longer return 500
errors

This paves the way to entirely removing uninit markers in a subsequent
change.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-12-15 08:51:23 +00:00
Joonas Koivunen
f010479107 feat(layer): pageserver_layer_redownloaded_after histogram (#6132)
this is aimed at replacing the current mtime only based trashing
alerting later.

Cc: #5331
2023-12-14 21:32:54 +02:00
Conrad Ludgate
cc633585dc gauge guards (#6138)
## Problem

The websockets gauge for active db connections seems to be growing more
than the gauge for client connections over websockets, which does not
make sense.

## Summary of changes

refactor how our counter-pair gauges are represented. not sure if this
will improve the problem, but it should be harder to mess-up the
counters. The API is much nicer though now and doesn't require
scopeguard::defer hacks
2023-12-14 17:21:39 +00:00
Christian Schwarz
aa5581d14f utils::logging: TracingEventCountLayer: don't use with_label_values() on hot path (#6129)
fixes #6126
2023-12-14 16:31:41 +01:00
John Spray
c4e0ef507f pageserver: heatmap uploads (#6050)
Dependency (commits inline):
https://github.com/neondatabase/neon/pull/5842

## Problem

Secondary mode tenants need a manifest of what to download. Ultimately
this will be some kind of heat-scored set of layers, but as a robust
first step we will simply use the set of resident layers: secondary
tenant locations will aim to match the on-disk content of the attached
location.

## Summary of changes

- Add heatmap types representing the remote structure
- Add hooks to Tenant/Timeline for generating these heatmaps
- Create a new `HeatmapUploader` type that is external to `Tenant`, and
responsible for walking the list of attached tenants and scheduling
heatmap uploads.

Notes to reviewers:
- Putting the logic for uploads (and later, secondary mode downloads)
outside of `Tenant` is an opinionated choice, motivated by:
- Enable future smarter scheduling of operations, e.g. uploading the
stalest tenant first, rather than having all tenants compete for a fair
semaphore on a first-come-first-served basis. Similarly for downloads,
we may wish to schedule the tenants with the hottest un-downloaded
layers first.
- Enable accessing upload-related state without synchronization (it
belongs to HeatmapUploader, rather than being some Mutex<>'d part of
Tenant)
- Avoid further expanding the scope of Tenant/Timeline types, which are
already among the largest in the codebase
- You might reasonably wonder how much of the uploader code could be a
generic job manager thing. Probably some of it: but let's defer pulling
that out until we have at least two users (perhaps secondary downloads
will be the second one) to highlight which bits are really generic.

Compromises:
- Later, instead of using digests of heatmaps to decide whether anything
changed, I would prefer to avoid walking the layers in tenants that
don't have changes: tracking that will be a bit invasive, as it needs
input from both remote_timeline_client and Layer.
2023-12-14 13:09:24 +00:00
Conrad Ludgate
6987b5c44e proxy: add more rates to endpoint limiter (#6130)
## Problem

Single rate bucket is limited in usefulness

## Summary of changes

Introduce a secondary bucket allowing an average of 200 requests per
second over 1 minute, and a tertiary bucket allowing an average of 100
requests per second over 10 minutes.

Configured by using a format like

```sh
proxy --endpoint-rps-limit 300@1s --endpoint-rps-limit 100@10s --endpoint-rps-limit 50@1m
```

If the bucket limits are inconsistent, an error is returned on startup

```
$ proxy --endpoint-rps-limit 300@1s --endpoint-rps-limit 10@10s
Error: invalid endpoint RPS limits. 10@10s allows fewer requests per bucket than 300@1s (100 vs 300)
```
2023-12-13 21:43:49 +00:00
Alexander Bayandin
0cd49cac84 test_compatibility: make it use initdb.tar.zst 2023-12-13 15:04:25 -06:00
Alexander Bayandin
904dff58b5 test_wal_restore_http: cleanup test 2023-12-13 15:04:25 -06:00
Arthur Petukhovsky
f401a21cf6 Fix test_simple_sync_safekeepers
There is a postgres 16 version encoded in a binary message.
2023-12-13 15:04:25 -06:00
Tristan Partin
158adf602e Update Postgres 16 series to 16.1 2023-12-13 15:04:25 -06:00
Tristan Partin
c94db6adbb Update Postgres 15 series to 15.5 2023-12-13 15:04:25 -06:00
Tristan Partin
85720616b1 Update Postgres 14 series to 14.10 2023-12-13 15:04:25 -06:00
George MacKerron
d6fcc18eb2 Add Neon-Batch- headers to OPTIONS response for SQL-over-HTTP requests (#6116)
This is needed to allow use of batch queries from browsers.

## Problem

SQL-over-HTTP batch queries fail from web browsers because the relevant
headers, `Neon-Batch-isolation-Level` and `Neon-Batch-Read-Only`, are
not included in the server's OPTIONS response. I think we simply forgot
to add them when implementing the batch query feature.

## Summary of changes

Added `Neon-Batch-Isolation-Level` and `Neon-Batch-Read-Only` to the
OPTIONS response.
2023-12-13 17:18:20 +00:00
Vadim Kharitonov
c2528ae671 Increase pgbouncer pool size to 64 for VMs (#6124)
The pool size was changed for pods
(https://github.com/neondatabase/cloud/pull/8057). The idea to increase
it for VMs too
2023-12-13 16:23:24 +00:00
Joonas Koivunen
a919b863d1 refactor: remove eviction batching (#6060)
We no longer have `layer_removal_cs` since #5108, we no longer need
batching.
2023-12-13 18:05:33 +02:00
Joonas Koivunen
2d22661061 refactor: calculate_synthetic_size_worker, remove PRE::NeedsDownload (#6111)
Changes I wanted to make on #6106 but decided to leave out to keep that
commit clean for including in the #6090. Finally remove
`PageReconstructionError::NeedsDownload`.
2023-12-13 14:23:19 +00:00
John Spray
e3778381a8 tests: make test_bulk_insert recreate tenant in same generation (#6113)
## Problem

Test deletes tenant and recreates with the same ID. The recreation bumps
generation number. This could lead to stale generation warnings in the
logs.

## Summary of changes

Handle this more gracefully by re-creating in the same generation that
the tenant was previously attached in.

We could also update the tenant delete path to have the attachment
service to drop tenant state on delete, but I like having it there: it
makes debug easier, and the only time it's a problem is when a test is
re-using a tenant ID after deletion.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2023-12-13 14:14:38 +00:00
Conrad Ludgate
c8316b7a3f simplify endpoint limiter (#6122)
## Problem

1. Using chrono for durations only is wasteful
2. The arc/mutex was not being utilised
3. Locking every shard in the dashmap every GC could cause latency
spikes
4. More buckets

## Summary of changes

1. Use `Instant` instead of `NaiveTime`.
2. Remove the `Arc<Mutex<_>>` wrapper, utilising that dashmap entry
returns mut access
3. Clear only a random shard, update gc interval accordingly
4. Multiple buckets can be checked before allowing access

When I benchmarked the check function, it took on average 811ns when
multithreaded over the course of 10 million checks.
2023-12-13 13:53:23 +00:00
Stas Kelvich
8460654f61 Add per-endpoint rate limiter to proxy 2023-12-13 07:03:21 +02:00
Arpad Müller
7c2c87a5ab Update azure SDK to 0.18 and use open range support (#6103)
* Update `azure-*` crates to 0.18
* Use new open ranges support added by upstream in
https://github.com/Azure/azure-sdk-for-rust/pull/1482

Part of #5567. Prior update PR: #6081
2023-12-12 18:20:12 +01:00
Arpad Müller
5820faaa87 Use extend instead of groups of append calls in tests (#6109)
Repeated calls to `.append` don't line up as nicely as they might get
formatted in different ways. Also, it is more characters and the lines
might be longer.

Saw this while working on #5912.
2023-12-12 18:00:37 +01:00
John Spray
dfb0a6fdaf scrubber: handle initdb files, fix an issue with prefixes (#6079)
- The code for calculating the prefix in the bucket was expecting a
trailing slash (as it is in the tests), but that's an awkward
expectation to impose for use in the field: make the code more flexible
by only trimming a trailing character if it is indeed a slash.
- initdb archives were detected by the scrubber as malformed layer
files. Teach it to recognize and ignore them.
2023-12-12 16:53:08 +00:00
Alexander Bayandin
6acbee2368 test_runner: add from_repo_dir method (#6087)
## Problem

We need a reliable way to restore a project state (in this context, I
mean data on pageservers, safekeepers, and remote storage) from a
snapshot. The existing method (that we use in `test_compatibility`)
heavily relies on config files, which makes it harder to add/change
fields in the config.
The proposed solution uses config file only to get `default_tenant_id`
and `branch_name_mappings`.

## Summary of changes
- Add `NeonEnvBuilder#from_repo_dir` method, which allows using the
`neon_env_builder` fixture with data from a snapshot.
- Use `NeonEnvBuilder#from_repo_dir` in compatibility tests

Requires for https://github.com/neondatabase/neon/issues/6033
2023-12-12 16:24:13 +00:00
Konstantin Knizhnik
aec1acdbac Do not inherite replication slots in branch (#5898)
## Problem
 
See 
https://github.com/neondatabase/company_projects/issues/111
https://neondb.slack.com/archives/C03H1K0PGKH/p1700166126954079


## Summary of changes

Do not search for AUX_FILES_KEY in parent timelines

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2023-12-12 14:24:21 +02:00
Konstantin Knizhnik
8bb4a13192 Do not materialize null images in PS (#5979)
## Problem

PG16 is writing null images during relation extension.
And page server implements optimisation which replace WAL record with
FPI with page image.
So instead of WAL record ~30 bytes we store 8kb null page image.
Ans this image is almost useless, because most likely it will be shortly
rewritten with actual page content.

## Summary of changes

Do not materialize wal records with null page FPI.
 
## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-12-12 14:23:45 +02:00
Anna Khanova
9e071e4458 Propagate information about the protocol to console (#6102)
## Problem

In snowflake logs currently there is no information about the protocol,
that the client uses.

## Summary of changes

Propagate the information about the protocol together with the app_name.
In format: `{app_name}/{sql_over_http/tcp/ws}`.

This will give to @stepashka more observability on what our clients are
using.
2023-12-12 11:42:51 +00:00
John Spray
fead836f26 swagger: remove 'format: hex' from tenant IDs (#6099)
## Problem

TenantId is changing to TenantShardId in many APIs. The swagger had
`format: hex` attributes on some of these IDs. That isn't formally
defined anywhere, but a reasonable person might think it means "hex
digits only", which will no longer be the case once we start using
shard-aware IDs (they're like `<tenant_id>-0001`).



## Summary of changes

- Remove these `format` attributes from all `tenant_id` fields in the
swagger definition
2023-12-12 10:39:34 +00:00
John Spray
20e9cf7d31 pageserver: tweaks to slow/hung task logging (#6098)
## Problem

- `shutdown_tasks` would log when a particular task was taking a long
time to shut down, but not when it eventually completed. That left one
uncertain as to whether the slow task was the source of a hang, or just
a precursor.

## Summary of changes

- Add a log line after a slow task shutdown
- Add an equivalent in Gate's `warn_if_stuck`, in case we ever need it.
This isn't related to the original issue but was noticed when checking
through these logging paths.
2023-12-12 07:19:59 +00:00
Joonas Koivunen
3b04f3a749 fix: accidential return Ok (#6106)
Error indicating request cancellation OR timeline shutdown was deemed as
a reason to exit the background worker that calculated synthetic size.
Fix it to only be considered for avoiding logging such of such errors.
2023-12-11 21:27:53 +00:00
Arpad Müller
c49fd69bd6 Add initdb_lsn to TimelineInfo (#6104)
This way, we can query it.

Background: I want to do statistics for how reproducible `initdb_lsn`
really is, see https://github.com/neondatabase/cloud/issues/8284 and
https://neondb.slack.com/archives/C036U0GRMRB/p1701895218280269
2023-12-11 21:08:14 +00:00
Tristan Partin
5ab9592a2d Add submodule paths as safe directories as a precaution
The check-codestyle-rust-arm job requires this for some reason, so let's
just add them everywhere we do this workaround.
2023-12-11 13:08:37 -06:00
Tristan Partin
036558c956 Fix git ownership issue in check-codestyle-rust-arm
We have this workaround for other jobs. Looks like this one was
forgotten about.
2023-12-11 13:08:37 -06:00
John Spray
6a922b1a75 tests: start adding tests for secondary mode, live migration (#5842)
These tests have been loitering on a branch of mine for a while: they
already provide value even without all the secondary mode bits landed
yet, and the Workload helper is handy for other tests too.

- `Workload` is a re-usable test workload that replaces some of the
arbitrary "write a few rows" SQL that I've found my self repeating, and
adds a systematic way to append data and check that reads properly
reflect the changes. This append+validate stuff is important when doing
migrations, as we want to detect situations where we might be reading
from a pageserver that has not properly seen latest changes.
- test_multi_attach is a validation of how the pageserver handles
attaching the same tenant to multiple pageservers, from a safety point
of view. This is intentionally separate from the larger testing of
migration, to provide an isolated environment for multi-attachment.
- test_location_conf_churn is a pseudo-random walk through the various
states that TenantSlot can be put into, with validation that attached
tenants remain externally readable when they should, and as a side
effect validating that the compute endpoint's online configuration
changes work as expected.
- test_live_migration is the reference implementation of how to drive a
pair of pageservers through a zero-downtime migration of a tenant.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-12-11 16:55:43 +00:00
John Spray
f1fc1fd639 pageserver: further refactoring from TenantId to TenantShardId (#6059)
## Problem

In https://github.com/neondatabase/neon/pull/5957, the most essential
types were updated to use TenantShardId rather than TenantId. That
unblocked other work, but didn't fully enable running multiple shards
from the same tenant on the same pageserver.

## Summary of changes

- Use TenantShardId in page cache key for materialized pages
- Update mgr.rs get_tenant() and list_tenants() functions to use a shard
id, and update all callers.
- Eliminate the exactly_one_or_none helper in mgr.rs and all code that
used it
- Convert timeline HTTP routes to use tenant_shard_id

Note on page cache:
```
struct MaterializedPageHashKey {
    /// Why is this TenantShardId rather than TenantId?
    ///
    /// Usually, the materialized value of a page@lsn is identical on any shard in the same tenant.  However, this
    /// this not the case for certain internally-generated pages (e.g. relation sizes).  In future, we may make this
    /// key smaller by omitting the shard, if we ensure that reads to such pages always skip the cache, or are
    /// special-cased in some other way.
    tenant_shard_id: TenantShardId,
    timeline_id: TimelineId,
    key: Key,
}
```
2023-12-11 15:52:33 +00:00
Alexander Bayandin
66a7a226f8 test_runner: use toml instead of formatted strings (#6088)
## Problem

A bunch of refactorings extracted from
https://github.com/neondatabase/neon/pull/6087 (not required for it); 
the most significant one is using toml instead of formatted strings.

## Summary of changes 
- Use toml instead of formatted strings for config
- Skip pageserver log check if `pageserver.log` doesn't exist
- `chmod -x test_runner/regress/test_config.py`
2023-12-11 15:13:27 +00:00
Joonas Koivunen
f0d15cee6f build: update azure-* to 0.17 (#6081)
this is a drive-by upgrade while we refresh the access tokens at the
same time.
2023-12-11 12:21:02 +01:00
Sasha Krassovsky
0ba4cae491 Fix RLS/REPLICATION granting (#6083)
## Problem

## Summary of changes

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2023-12-08 12:55:44 -08:00
Andrew Rudenko
df1f8e13c4 proxy: pass neon options in deep object format (#6068)
---------

Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>
2023-12-08 19:58:36 +01:00
John Spray
e640bc7dba tests: allow-lists for occasional failures (#6074)
test_creating_tenant_conf_after...
- Test detaches a tenant and then re-attaches immediatel: this causes a
race between pending remote LSN update and the generation bump in the
attachment.

test_gc_cutoff:
- Test rapidly restarts a pageserver before one generation has had the
chance to process deletions from the previous generation
2023-12-08 17:32:16 +00:00
Christian Schwarz
cf024de202 virtual_file metrics: expose max size of the fd cache (#6078)
And also leave a comment on how to determine current size.

Kind of follow-up to #6066

refs https://github.com/neondatabase/cloud/issues/8351
refs https://github.com/neondatabase/neon/issues/5479
2023-12-08 17:23:50 +00:00
Conrad Ludgate
e1a564ace2 proxy simplify cancellation (#5916)
## Problem

The cancellation code was confusing and error prone (as seen before in
our memory leaks).

## Summary of changes

* Use the new `TaskTracker` primitve instead of JoinSet to gracefully
wait for tasks to shutdown.
* Updated libs/utils/completion to use `TaskTracker`
* Remove `tokio::select` in favour of `futures::future::select` in a
specialised `run_until_cancelled()` helper function
2023-12-08 16:21:17 +00:00
Christian Schwarz
f5b9af6ac7 page cache: improve eviction-related metrics (#6077)
These changes help with identifying thrashing.

The existing `pageserver_page_cache_find_victim_iters_total` is already
useful, but, it doesn't tell us how many individual find_victim() calls
are happening, only how many clock-LRU steps happened in the entire
system,
without info about whether we needed to actually evict other data vs
just scan for a long time, e.g., because the cache is large.

The changes in this PR allows us to
1. count each possible outcome separately, esp evictions
2. compute mean iterations/outcome

I don't think anyone except me was paying close attention to
`pageserver_page_cache_find_victim_iters_total` before, so,
I think the slight behavior change of also counting iterations
for the 'iters exceeded' case is fine.

refs https://github.com/neondatabase/cloud/issues/8351
refs https://github.com/neondatabase/neon/issues/5479
2023-12-08 15:27:21 +00:00
John Spray
5e98855d80 tests: update tests that used local_fs&mock_s3 to use one or the other (#6015)
## Problem

This was wasting resources: if we run a test with mock s3 we don't then
need to run it again with local fs. When we're running in CI, we don't
need to run with the mock/local storage as well as real S3. There is
some value in having CI notice/spot issues that might otherwise only
happen when running locally, but that doesn't justify the cost of
running the tests so many more times on every PR.

## Summary of changes

- For tests that used available_remote_storages or
available_s3_storages, update them to either specify no remote storage
(therefore inherit the default, which is currently local fs), or to
specify s3_storage() for the tests that actually want an S3 API.
2023-12-08 14:52:37 +00:00
Conrad Ludgate
699049b8f3 proxy: make auth more type safe (#5689)
## Problem

a5292f7e67/proxy/src/auth/backend.rs (L146-L148)

a5292f7e67/proxy/src/console/provider/neon.rs (L90)

a5292f7e67/proxy/src/console/provider/neon.rs (L154)

## Summary of changes

1. Test backend is only enabled on `cfg(test)`.
2. Postgres mock backend + MD5 auth keys are only enabled on
`cfg(feature = testing)`
3. Password hack and cleartext flow will have their passwords validated
before proceeding.
4. Distinguish between ClientCredentials with endpoint and without,
removing many panics in the process
2023-12-08 11:48:37 +00:00
John Spray
2c544343e0 pageserver: filtered WAL ingest for sharding (#6024)
## Problem

Currently, if one creates many shards they will all ingest all the data:
not much use! We want them to ingest a proportional share of the data
each.

Closes: #6025

## Summary of changes

- WalIngest object gets a copy of the ShardIdentity for the Tenant it
was created by.
- While iterating the `blocks` part of a decoded record, blocks that do
not match the current shard are ignored, apart from on shard zero where
they are used to update relation sizes in `observe_decoded_block` (but
not stored).
- Before committing a `DataDirModificiation` from a WAL record, we check
if it's empty, and drop the record if so. This check is necessary
(rather than just looking at the `blocks` part) because certain record
types may modify blocks in non-obvious ways (e.g.
`ingest_heapam_record`).
- Add WAL ingest metrics to record the total received, total committed,
and total filtered out
- Behaviour for unsharded tenants is unchanged: they will continue to
ingest all blocks, and will take the fast path through `is_key_local`
that doesn't bother calculating any hashes.

After this change, shards store a subset of the tenant's total data, and
accurate relation sizes are only maintained on shard zero.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-12-08 10:12:37 +00:00
Arseny Sher
193e60e2b8 Fix/edit pgindent confusing places in neon. 2023-12-08 14:03:13 +04:00
Arseny Sher
1bbd6cae24 pgindent pgxn/neon 2023-12-08 14:03:13 +04:00
Arseny Sher
65f48c7002 Make targets to run pgindent on core and neon extension. 2023-12-08 14:03:13 +04:00
Alexander Bayandin
d9d8e9afc7 test_tenant_reattach: fix reattach mode names (#6070)
## Problem

Ref
https://neondb.slack.com/archives/C033QLM5P7D/p1701987609146109?thread_ts=1701976393.757279&cid=C033QLM5P7D

## Summary of changes
- Make reattach mode names unique for `test_tenant_reattach`
2023-12-08 08:39:45 +00:00
Arpad Müller
7914eaf1e6 Buffer initdb.tar.zst to a temporary file before upload (#5944)
In https://github.com/neondatabase/neon/pull/5912#pullrequestreview-1749982732 , Christian liked the idea of using files instead of buffering the
archive to RAM for the *download* path. This is for the upload path,
which is a very similar situation.
2023-12-08 03:33:44 +01:00
Joonas Koivunen
37fdbc3aaa fix: use larger buffers for remote storage (#6069)
Currently using 8kB buffers, raise that to 32kB to hopefully 1/4 of
`spawn_blocking` usage. Also a drive-by fixing of last `tokio::io::copy`
to `tokio::io::copy_buf`.
2023-12-07 19:36:44 +00:00
Tristan Partin
7aa1e58301 Add support for Python 3.12 2023-12-07 12:30:42 -06:00
Christian Schwarz
f2892d3798 virtual_file metrics: distinguish first and subsequent open() syscalls (#6066)
This helps with identifying thrashing.

I don't love the name, but, there is already "close-by-replace".

While reading the code, I also found a case where we waste
work in a cache pressure situation:
https://github.com/neondatabase/neon/issues/6065

refs https://github.com/neondatabase/cloud/issues/8351
2023-12-07 16:17:33 +00:00
Joonas Koivunen
b492cedf51 fix(remote_storage): buffering, by using streams for upload and download (#5446)
There is double buffering in remote_storage and in pageserver for 8KiB
in using `tokio::io::copy` to read `BufReader<ReaderStream<_>>`.

Switches downloads and uploads to use `Stream<Item =
std::io::Result<Bytes>>`. Caller and only caller now handles setting up
buffering. For reading, `Stream<Item = ...>` is also a `AsyncBufRead`,
so when writing to a file, we now have `tokio::io::copy_buf` reading
full buffers and writing them to `tokio::io::BufWriter` which handles
the buffering before dispatching over to `tokio::fs::File`.

Additionally implements streaming uploads for azure. With azure
downloads are a bit nicer than before, but not much; instead of one huge
vec they just hold on to N allocations we got over the wire.

This PR will also make it trivial to switch reading and writing to
io-uring based methods.

Cc: #5563.
2023-12-07 15:52:22 +00:00
John Spray
880663f6bc tests: use tenant_create() helper in test_bulk_insert (#6064)
## Problem

Since #5449 we enable generations in tests by default. Running
benchmarks was missed while merging that PR, and there was one that
needed updating.

## Summary of changes

Make test_bulk_insert use the proper generation-aware helper for tenant
creation.
2023-12-07 14:52:16 +00:00
John Spray
e89e41f8ba tests: update for tenant generations (#5449)
## Problem

Some existing tests are written in a way that's incompatible with tenant
generations.

## Summary of changes

Update all the tests that need updating: this is things like calling
through the NeonPageserver.tenant_attach helper to get a generation
number, instead of calling directly into the pageserver API. There are
various more subtle cases.
2023-12-07 12:27:16 +00:00
Conrad Ludgate
f9401fdd31 proxy: fix channel binding error messages (#6054)
## Problem

For channel binding failed messages we were still saying "channel
binding not supported" in the errors.

## Summary of changes

Fix error messages
2023-12-07 11:47:16 +00:00
Joonas Koivunen
b7ffe24426 build: update tokio to 1.34.0, tokio-utils 0.7.10 (#6061)
We should still remember to bump minimum crates for libraries beginning
to use task tracker.
2023-12-07 11:31:38 +00:00
Joonas Koivunen
52718bb8ff fix(layer): metric splitting, span rename (#5902)
Per [feedback], split the Layer metrics, also finally account for lost
and [re-submitted feedback] on `layer_gc` by renaming it to
`layer_delete`, `Layer::garbage_collect_on_drop` renamed to
`Layer::delete_on_drop`. References to "gc" dropped from metric names
and elsewhere.

Also fixes how the cancellations were tracked: there was one rare
counter. Now there is a top level metric for cancelled inits, and the
rare "download failed but failed to communicate" counter is kept.

Fixes: #6027

[feedback]: https://github.com/neondatabase/neon/pull/5809#pullrequestreview-1720043251
[re-submitted feedback]: https://github.com/neondatabase/neon/pull/5108#discussion_r1401867311
2023-12-07 11:39:40 +02:00
Joonas Koivunen
10c77cb410 temp: increase the wait tenant activation timeout (#6058)
5s is causing way too much noise; this is of course a temporary fix, we
should prioritize tenants for which there are pagestream openings the
highest, second highest the basebackups.

Deployment thread for context:
https://neondb.slack.com/archives/C03H1K0PGKH/p1701935048144479?thread_ts=1701765158.926659&cid=C03H1K0PGKH
2023-12-07 09:01:08 +00:00
Heikki Linnakangas
31be301ef3 Make simple_rcu::RcuWaitList::wait() async (#6046)
The gc_timeline() function is async, but it calls the synchronous wait()
function. In the worst case, that could lead to a deadlock by using up
all tokio executor threads.

In the passing, fix a few typos in comments.

Fixes issue #6045.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-12-07 10:20:40 +02:00
Joonas Koivunen
a3c7d400b4 fix: avoid allocations with logging a slug (#6047)
to_string forces allocating a less than pointer sized string (costing on
stack 4 usize), using a Display formattable slug saves that. the
difference seems small, but at the same time, we log these a lot.
2023-12-07 07:25:22 +00:00
Vadim Kharitonov
7501ca6efb Revert timescaledb for pg14 and pg15 (#6056)
```
could not start the compute node: compute is in state "failed": db error: ERROR: could not access file "$libdir/timescaledb-2.10.1": No such file or directory Caused by: ERROR: could not access file "$libdir/timescaledb-2.10.1": No such file or directory
```
2023-12-06 15:12:36 +00:00
Christian Schwarz
987c9aaea0 virtual_file: fix the metric for close() calls done by VirtualFile::drop (#6051)
Before this PR we would inc() the counter for `Close` even though the
slot's FD had already been closed.

Especially visible when subtracting `open` from `close+close-by-replace`
on a system that does a lot of attach and detach.

refs https://github.com/neondatabase/cloud/issues/8440
refs https://github.com/neondatabase/cloud/issues/8351
2023-12-06 12:05:28 +00:00
Konstantin Knizhnik
7fab731f65 Track size of FSM fork while applying records at replica (#5901)
## Problem

See https://neondb.slack.com/archives/C04DGM6SMTM/p1700560921471619

## Summary of changes

Update relation size cache for FSM fork in WAL records filter

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-12-05 18:49:24 +02:00
John Spray
483caa22c6 pageserver: logging tweaks (#6039)
- The `Attaching tenant` log message omitted some useful information
like the generation and mode
- info-level messages about writing configuration files were
unnecessarily verbose
- During process shutdown, we don't emit logs about the various phases:
this is very cheap to log since we do it once per process lifetime, and
is helpful when figuring out where something got stuck during a hang.
2023-12-05 16:11:15 +00:00
John Spray
da5e03b0d8 pageserver: add a /reset API for tenants (#6014)
## Problem

Traditionally we would detach/attach directly with curl if we wanted to
"reboot" a single tenant. That's kind of inconvenient these days,
because one needs to know a generation number to issue an attach
request.

Closes: https://github.com/neondatabase/neon/issues/6011

## Summary of changes

- Introduce a new `/reset` API, which remembers the LocationConf from
the current attachment so that callers do not have to work out the
correct configuration/generation to use.
- As an additional support tool, allow an optional `drop_cache` query
parameter, for situations where we are concerned that some on-disk state
might be bad and want to clear that as well as the in-memory state.

One might wonder why I didn't call this "reattach" -- it's because
there's already a PS->CP API of that name and it could get confusing.
2023-12-05 15:38:27 +00:00
John Spray
be885370f6 pageserver: remove redundant unsafe_create_dir_all (#6040)
This non-fsyncing analog to our safe directory creation function was
just duplicating what tokio's fs::create_dir_all does.
2023-12-05 15:03:07 +00:00
Alexey Kondratov
bc1020f965 compute_ctl: Notify waiters when Postgres failed to start (#6034)
In case of configuring the empty compute, API handler is waiting on
condvar for compute state change. Yet, previously if Postgres failed to
start we were just setting compute status to `Failed` without notifying.
It causes a timeout on control plane side, although we can return a
proper error from compute earlier.

With this commit API handler should be properly notified.
2023-12-05 13:38:45 +01:00
John Spray
61fe9d360d pageserver: add Key->Shard mapping logic & use it in page service (#5980)
## Problem

When a pageserver receives a page service request identified by
TenantId, it must decide which `Tenant` object to route it to.

As in earlier PRs, this stuff is all a no-op for tenants with a single
shard: calls to `is_key_local` always return true without doing any
hashing on a single-shard ShardIdentity.

Closes: https://github.com/neondatabase/neon/issues/6026

## Summary of changes

- Carry immutable `ShardIdentity` objects in Tenant and Timeline. These
provide the information that Tenants/Timelines need to figure out which
shard is responsible for which Key.
- Augment `get_active_tenant_with_timeout` to take a `ShardSelector`
specifying how the shard should be resolved for this tenant. This mode
depends on the kind of request (e.g. basebackups always go to shard
zero).
- In `handle_get_page_at_lsn_request`, handle the case where the
Timeline we looked up at connection time is not the correct shard for
the page being requested. This can happen whenever one node holds
multiple shards for the same tenant. This is currently written as a
"slow path" with the optimistic expectation that usually we'll run with
one shard per pageserver, and the Timeline resolved at connection time
will be the one serving page requests. There is scope for optimization
here later, to avoid doing the full shard lookup for each page.
- Omit consumption metrics from nonzero shards: only the 0th shard is
responsible for tracing accurate relation sizes.

Note to reviewers:
- Testing of these changes is happening separately on the
`jcsp/sharding-pt1` branch, where we have hacked neon_local etc needed
to run a test_pg_regress.
- The main caveat to this implementation is that page service
connections still look up one Timeline when the connection is opened,
before they know which pages are going to be read. If there is one shard
per pageserver then this will always also be the Timeline that serves
page requests. However, if multiple shards are on one pageserver then
get page requests will incur the cost of looking up the correct Timeline
on each getpage request. We may look to improve this in future with a
"sticky" timeline per connection handler so that subsequent requests for
the same Timeline don't have to look up again, and/or by having postgres
pass a shard hint when connecting. This is tracked in the "Loose ends"
section of https://github.com/neondatabase/neon/issues/5507
2023-12-05 12:01:55 +00:00
Conrad Ludgate
f60e49fe8e proxy: fix panic in startup packet (#6032)
## Problem

Panic when less than 8 bytes is presented in a startup packet.

## Summary of changes

We need there to be a 4 byte message code, so the expected min length is
8.
2023-12-05 11:24:16 +01:00
Anna Khanova
c48918d329 Rename metric (#6030)
## Problem

It looks like because of reallocation of the buckets in previous PR, the
metric is broken in graphana.

## Summary of changes

Renamed the metric.
2023-12-05 10:03:07 +00:00
Sasha Krassovsky
bad686bb71 Remove trusted from wal2json (#6035)
## Problem

## Summary of changes
2023-12-04 21:10:23 +00:00
Alexey Kondratov
85d08581ed [compute_ctl] Introduce feature flags in the compute spec (#6016)
## Problem

In the past we've rolled out all new `compute_ctl` functionality right
to all users, which could be risky. I want to have a more fine-grained
control over what we enable, in which env and to which users.

## Summary of changes

Add an option to pass a list of feature flags to `compute_ctl`. If not
passed, it defaults to an empty list. Any unknown flags are ignored.

This allows us to release new experimental features safer, as we can
then flip the flag for one specific user, only Neon employees, free /
pro / etc. users and so on. Or control it per environment.

In the current implementation feature flags are passed via compute spec,
so they do not allow controlling behavior of `empty` computes. For them,
we can either stick with the previous approach, i.e. add separate cli
args or introduce a more generic `--features` cli argument.
2023-12-04 19:54:18 +01:00
Christian Schwarz
c7f1143e57 concurrency-limit low-priority initial logical size calculation [v2] (#6000)
Problem
-------

Before this PR, there was no concurrency limit on initial logical size
computations.

While logical size computations are lazy in theory, in practice
(production), they happen in a short timeframe after restart.

This means that on a PS with 20k tenants, we'd have up to 20k concurrent
initial logical size calculation requests.

This is self-inflicted needless overload.

This hasn't been a problem so far because the `.await` points on the
logical size calculation path never return `Pending`, hence we have a
natural concurrency limit of the number of executor threads.
But, as soon as we return `Pending` somewhere in the logical size
calculation path, other concurrent tasks get scheduled by tokio.
If these other tasks are also logical size calculations, they eventually
pound on the same bottleneck.

For example, in #5479, we want to switch the VirtualFile descriptor
cache to a `tokio::sync::RwLock`, which makes us return `Pending`, and
without measures like this patch, after PS restart, VirtualFile
descriptor cache thrashes heavily for 2 hours until all the logical size
calculations have been computed and the degree of concurrency /
concurrent VirtualFile operations is down to regular levels.
See the *Experiment* section below for details.

<!-- Experiments (see below) show that plain #5479 causes heavy
thrashing of the VirtualFile descriptor cache.
The high degree of concurrency is too much for 
In the case of #5479 the VirtualFile descriptor cache size starts
thrashing heavily.


-->

Background
----------

Before this PR, initial logical size calculation was spawned lazily on
first call to `Timeline::get_current_logical_size()`.

In practice (prod), the lazy calculation is triggered by
`WalReceiverConnectionHandler` if the timeline is active according to
storage broker, or by the first iteration of consumption metrics worker
after restart (`MetricsCollection`).

The spawns by walreceiver are high-priority because logical size is
needed by Safekeepers (via walreceiver `PageserverFeedback`) to enforce
the project logical size limit.
The spawns by metrics collection are not on the user-critical path and
hence low-priority. [^consumption_metrics_slo]

[^consumption_metrics_slo]: We can't delay metrics collection
indefintely because there are TBD internal SLOs tied to metrics
collection happening in a timeline manner
(https://github.com/neondatabase/cloud/issues/7408). But let's ignore
that in this issue.

The ratio of walreceiver-initiated spawns vs
consumption-metrics-initiated spawns can be reconstructed from logs
(`spawning logical size computation from context of task kind {:?}"`).
PR #5995 and #6018 adds metrics for this.

First investigation of the ratio lead to the discovery that walreceiver
spawns 75% of init logical size computations.
That's because of two bugs:
- In Safekeepers: https://github.com/neondatabase/neon/issues/5993
- In interaction between Pageservers and Safekeepers:
https://github.com/neondatabase/neon/issues/5962

The safekeeper bug is likely primarily responsible but we don't have the
data yet. The metrics will hopefully provide some insights.

When assessing production-readiness of this PR, please assume that
neither of these bugs are fixed yet.


Changes In This PR
------------------

With this PR, initial logical size calculation is reworked as follows:

First, all initial logical size calculation task_mgr tasks are started
early, as part of timeline activation, and run a retry loop with long
back-off until success. This removes the lazy computation; it was
needless complexity because in practice, we compute all logical sizes
anyways, because consumption metrics collects it.

Second, within the initial logical size calculation task, each attempt
queues behind the background loop concurrency limiter semaphore. This
fixes the performance issue that we pointed out in the "Problem" section
earlier.

Third, there is a twist to queuing behind the background loop
concurrency limiter semaphore. Logical size is needed by Safekeepers
(via walreceiver `PageserverFeedback`) to enforce the project logical
size limit. However, we currently do open walreceiver connections even
before we have an exact logical size. That's bad, and I'll build on top
of this PR to fix that
(https://github.com/neondatabase/neon/issues/5963). But, for the
purposes of this PR, we don't want to introduce a regression, i.e., we
don't want to provide an exact value later than before this PR. The
solution is to introduce a priority-boosting mechanism
(`GetLogicalSizePriority`), allowing callers of
`Timeline::get_current_logical_size` to specify how urgently they need
an exact value. The effect of specifying high urgency is that the
initial logical size calculation task for the timeline will skip the
concurrency limiting semaphore. This should yield effectively the same
behavior as we had before this PR with lazy spawning.

Last, the priority-boosting mechanism obsoletes the `init_order`'s grace
period for initial logical size calculations. It's a separate commit to
reduce the churn during review. We can drop that commit if people think
it's too much churn, and commit it later once we know this PR here
worked as intended.

Experiment With #5479 
---------------------

I validated this PR combined with #5479 to assess whether we're making
forward progress towards asyncification.

The setup is an `i3en.3xlarge` instance with 20k tenants, each with one
timeline that has 9 layers.
All tenants are inactive, i.e., not known to SKs nor storage broker.
This means all initial logical size calculations are spawned by
consumption metrics `MetricsCollection` task kind.
The consumption metrics worker starts requesting logical sizes at low
priority immediately after restart. This is achieved by deleting the
consumption metrics cache file on disk before starting
PS.[^consumption_metrics_cache_file]

[^consumption_metrics_cache_file] Consumption metrics worker persists
its interval across restarts to achieve persistent reporting intervals
across PS restarts; delete the state file on disk to get predictable
(and I believe worst-case in terms of concurrency during PS restart)
behavior.

Before this patch, all of these timelines would all do their initial
logical size calculation in parallel, leading to extreme thrashing in
page cache and virtual file cache.

With this patch, the virtual file cache thrashing is reduced
significantly (from 80k `open`-system-calls/second to ~500
`open`-system-calls/second during loading).


### Critique

The obvious critique with above experiment is that there's no skipping
of the semaphore, i.e., the priority-boosting aspect of this PR is not
exercised.

If even just 1% of our 20k tenants in the setup were active in
SK/storage_broker, then 200 logical size calculations would skip the
limiting semaphore immediately after restart and run concurrently.

Further critique: given the two bugs wrt timeline inactive vs active
state that were mentioned in the Background section, we could have 75%
of our 20k tenants being (falsely) active on restart.

So... (next section)

This Doesn't Make Us Ready For Async VirtualFile
------------------------------------------------

This PR is a step towards asynchronous `VirtualFile`, aka, #5479 or even
#4744.

But it doesn't yet enable us to ship #5479.

The reason is that this PR doesn't limit the amount of high-priority
logical size computations.
If there are many high-priority logical size calculations requested,
we'll fall over like we did if #5479 is applied without this PR.
And currently, at very least due to the bugs mentioned in the Background
section, we run thousands of high-priority logical size calculations on
PS startup in prod.

So, at a minimum, we need to fix these bugs.

Then we can ship #5479 and #4744, and things will likely be fine under
normal operation.

But in high-traffic situations, overload problems will still be more
likely to happen, e.g., VirtualFile cache descriptor thrashing.
The solution candidates for that are orthogonal to this PR though:
* global concurrency limiting
* per-tenant rate limiting => #5899
* load shedding
* scaling bottleneck resources (fd cache size (neondatabase/cloud#8351),
page cache size(neondatabase/cloud#8351), spread load across more PSes,
etc)

Conclusion
----------

Even with the remarks from in the previous section, we should merge this
PR because:
1. it's an improvement over the status quo (esp. if the aforementioned
bugs wrt timeline active / inactive are fixed)
2. it prepares the way for
https://github.com/neondatabase/neon/pull/6010
3. it gets us close to shipping #5479 and #4744
2023-12-04 17:22:26 +00:00
Christian Schwarz
7403d55013 walredo: stderr cleanup & make explicitly cancel safe (#6031)
# Problem

I need walredo to be cancellation-safe for
https://github.com/neondatabase/neon/pull/6000#discussion_r1412049728

# Solution

We are only `async fn` because of
`wait_for(stderr_logger_task_done).await`, added in #5560 .

The `stderr_logger_cancel` and `stderr_logger_task_done` were there out
of precaution that the stderr logger task might for some reason not stop
when the walredo process terminates.
That hasn't been a problem in practice.
So, simplify things:
- remove `stderr_logger_cancel` and the
`wait_for(...stderr_logger_task_done...)`
- use `tokio::process::ChildStderr` in the stderr logger task
- add metrics to track number of running stderr logger tasks so in case
I'm wrong here, we can use these metrics to identify the issue (not
planning to put them into a dashboard or anything)
2023-12-04 16:06:41 +00:00
Anna Khanova
12f02523a4 Enable dynamic rate limiter (#6029)
## Problem

Limit the number of open connections between the control plane and
proxy.

## Summary of changes

Enable dynamic rate limiter in prod.

Unfortunately the latency metrics are a bit broken, but from logs I see
that on staging for the past 7 days only 2 times latency for acquiring
was greater than 1ms (for most of the cases it's insignificant).
2023-12-04 15:00:24 +00:00
Arseny Sher
207c527270 Safekeepers: persist state before timeline deactivation.
Without it, sometimes on restart we lose latest remote_consistent_lsn which
leads to excessive ps -> sk reconnections.

https://github.com/neondatabase/neon/issues/5993
2023-12-04 18:22:36 +04:00
John Khvatov
eae49ff598 Perform L0 compaction before creating new image layers (#5950)
If there are too many L0 layers before compaction, the compaction
process becomes slow because of slow `Timeline::get`. As a result of the
slowdown, the pageserver will generate even more L0 layers for the next
iteration, further exacerbating the slow performance.

Change to perform L0 -> L1 compaction before creating new images. The
simple change speeds up compaction time and `Timeline::get` to 5x.
`Timeline::get` is faster on top of L1 layers.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-12-04 12:35:09 +00:00
Alexander Bayandin
e6b2f89fec test_pg_clients: fix test that reads from stdout (#6021)
## Problem

`test_pg_clients` reads the actual result from a *.stdout file,
https://github.com/neondatabase/neon/pull/5977 has added a header to
such files, so `test_pg_clients` started to fail.

## Summary of changes
- Use `capture_stdout` and compare the expected result with the output
instead of *.stdout file content
2023-12-04 11:18:41 +00:00
John Spray
1d81e70d60 pageserver: tweak logs for index_part loading (#6005)
## Problem

On pageservers upgraded to enable generations, these INFO level logs
were rather frequent. If a tenant timeline hasn't written new layers
since the upgrade, it will emit the "No index_part.json*" log every time
it starts.

## Summary of changes

- Downgrade two log lines from info to debug
- Add a tiny unit test that I wrote for sanity-checking that there
wasn't something wrong with our Generation-comparing logic when loading
index parts.
2023-12-04 09:57:47 +00:00
Anastasia Lubennikova
e3512340c1 Override neon.max_cluster_size for the time of compute_ctl (#5998)
Temporarily reset neon.max_cluster_size to avoid
the possibility of hitting the limit, while we are applying config:
creating new extensions, roles, etc...
2023-12-03 15:21:44 +00:00
Christian Schwarz
e43cde7aba initial logical size: remove CALLS metric from hot path (#6018)
Only introduced a few hours ago (#5995), I took a look at the numbers
from staging and realized that `get_current_logical_size()` is on the
walingest hot path: we call it for every `ReplicationMessage::XLogData`
that we receive.

Since the metric is global, it would be quite a busy cache line.

This PR replaces it with a new metric purpose-built for what's most
interesting right now.
2023-12-01 22:45:04 +01:00
Alexey Kondratov
c1295bfb3a [compute_ctl] User correct HTTP code in the /configure errors (#6017)
It was using `PRECONDITION_FAILED` for errors during `ComputeSpec` to
`ParsedSpec` conversion, but this disobeys the OpenAPI spec [1] and
correct code should be `BAD_REQUEST` for any spec processing errors.

While on it, I also noticed that `compute_ctl` OpenAPI spec has an
invalid format and fixed it.

[1] fd81945a60/compute_tools/src/http/openapi_spec.yaml (L119-L120)
2023-12-01 18:19:55 +01:00
Joonas Koivunen
711425cc47 fix: use create_new instead of create for mutex file (#6012)
Using create_new makes the uninit marker work as a mutual exclusion
primitive. Temporary hopefully.
2023-12-01 18:30:51 +02:00
bojanserafimov
fd81945a60 Use TEST_OUTPUT envvar in pageserver (#5984) 2023-12-01 09:16:24 -05:00
bojanserafimov
e49c21a3cd Speed up rel extend (#5983) 2023-12-01 09:11:41 -05:00
Anastasia Lubennikova
92e7cd40e8 add sql_exporter to vm-image (#5949)
expose LFC metrics
2023-12-01 13:40:49 +00:00
Alexander Bayandin
7eabfc40ee test_runner: use separate directory for each rerun (#6004)
## Problem

While investigating https://github.com/neondatabase/neon/issues/5854, we
hypothesised that logs/repo-dir from the initial failure might leak into
reruns. Use different directories for each run to avoid such a
possibility.

## Summary of changes
- make each test rerun use different directories
- update `pytest-rerunfailure` plugin from 11.1.2 to 13.0
2023-12-01 13:26:19 +00:00
Christian Schwarz
ce1652990d logical size: better represent level of accuracy in the type system (#5999)
I would love to not expose the in-accurate value int he mgmt API at all,
and in fact control plane doesn't use it [^1].
But our tests do, and I have no desire to change them at this time.

[^1]: https://github.com/neondatabase/cloud/pull/8317
2023-12-01 14:16:29 +01:00
Christian Schwarz
8cd28e1718 logical size calculation: make .current_size() infallible (#5999)
... by panicking on overflow;

It was made fallible initially due to in-confidence in logical size
calculation. However, the error has never happened since I am at Neon.

Let's stop worrying about this by converting the overflow check into a panic.
2023-12-01 14:16:29 +01:00
Christian Schwarz
1c88824ed0 initial logical size calculation: add a bunch of metrics (#5995)
These will help us answer questions such as:
- when & at what do calculations get started after PS restart?
- how often is the api to get current incrementally-computed logical
  size called, and does it return Exact vs Approximate?

I'd also be interested in a histogram of how much wall clock
time size calculations take, but, I don't know good bucket sizes,
and, logging it would introduce yet another per-timeline log
message during startup; don't think that's worth it just yet.

Context

- https://neondb.slack.com/archives/C033RQ5SPDH/p1701197668789769
- https://github.com/neondatabase/neon/issues/5962
- https://github.com/neondatabase/neon/issues/5963
- https://github.com/neondatabase/neon/pull/5955
- https://github.com/neondatabase/cloud/issues/7408
2023-12-01 12:52:59 +01:00
Arpad Müller
1ce1c82d78 Clean up local state if index_part.json request gives 404 (#6009)
If `index_part.json` is (verifiably) not present on remote storage, we
should regard the timeline as inexistent. This lets `clean_up_timelines`
purge the partial local disk state, which is important in the case of
incomplete creations leaving behind state that hinders retries. For
incomplete deletions, we also want the timeline's local disk content be
gone completely.

The PR removes the allowed warnings added by #5390 and #5912, as we now
are only supposed to issue info level messages. It also adds a
reproducer for #6007, by parametrizing the
`test_timeline_init_break_before_checkpoint_recreate` test added by
#5390. If one reverts the .rs changes, the "cannot create its uninit
mark file" log line occurs once one comments out the failing checks for
the local disk state being actually empty.

Closes #6007

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-12-01 10:58:06 +00:00
Vadim Kharitonov
f784e59b12 Update timescaledb to 2.13.0 (#5975)
TimescaleDB has released 2.13.0. This version is compatible with
Postgres16
2023-11-30 17:12:52 -06:00
Arpad Müller
b71b8ecfc2 Add existing_initdb_timeline_id param to timeline creation (#5912)
This PR adds an `existing_initdb_timeline_id` option to timeline
creation APIs, taking an optional timeline ID.

Follow-up of  #5390.

If the `existing_initdb_timeline_id` option is specified via the HTTP
API, the pageserver downloads the existing initdb archive from the given
timeline ID and extracts it, instead of running initdb itself.

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-11-30 22:32:04 +01:00
Arpad Müller
3842773546 Correct RFC number for Pageserver WAL DR RFC (#5997)
When I opened #5248, 27 was an unused RFC number. Since then, two RFCs
have been merged, so now 27 is taken. 29 is free though, so move it
there.
2023-11-30 21:01:25 +00:00
Conrad Ludgate
f39fca0049 proxy: chore: replace strings with SmolStr (#5786)
## Problem

no problem

## Summary of changes

replaces boxstr with arcstr as it's cheaper to clone. mild perf
improvement.

probably should look into other smallstring optimsations tbh, they will
likely be even better. The longest endpoint name I was able to construct
is something like `ep-weathered-wildflower-12345678` which is 32 bytes.
Most string optimisations top out at 23 bytes
2023-11-30 20:52:30 +00:00
Joonas Koivunen
b451e75dc6 test: include cmdline in captured output (#5977)
aiming for faster to understand a bunch of `.stdout` and `.stderr`
files, see example echo_1.stdout differences:

```
+# echo foobar abbacd
+
 foobar abbacd
```

it can be disabled and is disabled in this PR for some tests; use
`pg_bin.run_capture(..., with_command_header=False)` for that.

as a bonus this cleans up the echoed newlines from s3_scrubber
output which are also saved to file but echoed to test log.

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-11-30 17:31:03 +00:00
Anna Khanova
3657a3c76e Proxy fix metrics record (#5996)
## Problem

Some latency metrics are recorded in inconsistent way.

## Summary of changes

Make sure that everything is recorded in seconds.
2023-11-30 16:33:54 +00:00
Joonas Koivunen
eba3bfc57e test: python needs thread safety as well (#5992)
we have test cases which launch processes from threads, and they capture
output assuming this counter is thread-safe. at least according to my
understanding this operation in python requires a lock to be
thread-safe.
2023-11-30 15:48:40 +00:00
John Spray
57ae9cd07f pageserver: add flush_ms and document /location_config API (#5860)
- During migration of tenants, it is useful for callers to
`/location_conf` to flush a tenant's layers while transitioning to
AttachedStale: this optimization reduces the redundant WAL replay work
that the tenant's new attached pageserver will have to do. Test coverage
for this will come as part of the larger tests for live migration in
#5745 #5842
- Flushing is controlled with `flush_ms` query parameter: it is the
caller's job to decide how long they want to wait for a flush to
complete. If flush is not complete within the time limit, the pageserver
proceeds to succeed anyway: flushing is only an optimization.
- Add swagger definitions for all this: the location_config API is the
primary interface for driving tenant migration as described in
docs/rfcs/028-pageserver-migration.md, and will eventually replace the
various /attach /detach /load /ignore APIs.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-30 14:22:07 +00:00
Christian Schwarz
3bb1030f5d walingest: refactor if-cascade on decoded.xl_rmid into match statement (#5974)
refs https://github.com/neondatabase/neon/issues/5962

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-30 14:07:41 +00:00
John Spray
5d3c3636fc tests: add a log allow list entry in test_timeline_deletion_with_files_stuck_in_upload_queue (#5981)
Test failure seen here:

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5860/7032903218/index.html#suites/837740b64a53e769572c4ed7b7a7eeeb/c0f1c79a70a3b9ab

```
E   AssertionError: assert not [(302, '2023-11-29T13:23:51.046801Z ERROR request{method=PUT path=/v1/tenant/f6b845de60cb0e92f4426e0d6af1d2ea/timeline/69a8c98004abe71a281cff8642a45274/checkpoint request_id=eca33d8a-7af2-46e7-92ab-c28629feb42c}: Error processing HTTP request: InternalServerError(queue is in state Stopped\n')]
```

This appears to be a legitimate log: the test is issuing checkpoint
requests in the background, and deleting (therefore shutting down) a
timeline.
2023-11-30 13:44:14 +00:00
Conrad Ludgate
0c87d1866b proxy: fix wake_compute error prop (#5989)
## Problem

fixes #5654 - WakeComputeErrors occuring during a connect_to_compute got
propagated as IO errors, which get forwarded to the user as "Couldn't
connect to compute node" with no helpful message.

## Summary of changes

Handle WakeComputeError during ConnectionError properly
2023-11-30 13:43:21 +00:00
Arpad Müller
8ec6033ed8 Pageserver disaster recovery RFC (#5248)
Enable the pageserver to recover from data corruption events by
implementing a feature to re-apply historic WAL records in parallel to the already
occurring WAL replay.

The feature is outside of the user-visible backup and history story, and
only
serves as a second-level backup for the case that there is a bug in the
pageservers that corrupted the served pages.

The RFC proposes the addition of two new features:
* recover a broken branch from WAL (downtime is allowed)
* a test recovery system to recover random branches to make sure
recovery works
2023-11-30 14:30:17 +01:00
Anna Khanova
e12e2681e9 IP allowlist on the proxy side (#5906)
## Problem

Per-project IP allowlist:
https://github.com/neondatabase/cloud/issues/8116

## Summary of changes

Implemented IP filtering on the proxy side. 

To retrieve ip allowlist for all scenarios, added `get_auth_info` call
to the control plane for:
* sql-over-http
* password_hack
* cleartext_hack

Added cache with ttl for sql-over-http path

This might slow down a bit, consider using redis in the future.

---------

Co-authored-by: Conrad Ludgate <conrad@neon.tech>
2023-11-30 13:14:33 +00:00
Joonas Koivunen
1e57ddaabc fix: flush loop should also keep the gate open (#5987)
I was expecting this to already be in place, because this should not
conflict how we shutdown (0. cancel, 1. shutdown_tasks, 2. close gate).
2023-11-30 14:26:11 +02:00
John Khvatov
3e094e90d7 update aws sdk to 1.0.x (#5976)
This change will be useful for experimenting with S3 performance.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-30 14:17:58 +02:00
Christian Schwarz
292281c9df pagectl: add subcommand to rewrite layer file summary (#5933)
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771
2023-11-30 11:34:30 +00:00
Rahul Modpur
50d959fddc refactor: use serde for TenantConf deserialization Fixes: #5300 (#5310)
Remove handcrafted TenantConf deserialization code. Use
`serde_path_to_error` to include the field which failed parsing. Leaves
the duplicated TenantConf in pageserver and models, does not touch
PageserverConf handcrafted deserialization.

Error change:
- before change: "configure option `checkpoint_distance` cannot be
negative"
- after change: "`checkpoint_distance`: invalid value: integer `-1`,
expected u64"

Fixes: #5300
Cc: #3682

---------

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
Co-authored-by: Shany Pozin <shany@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-30 12:47:13 +02:00
Conrad Ludgate
fc77c42c57 proxy: add flag to enable http pool for all users (#5959)
## Problem

#5123

## Summary of changes

Add `--sql-over-http-pool-opt-in true` default cli arg. Allows us to set
`--sql-over-http-pool-opt-in false` region-by-region
2023-11-30 10:19:30 +00:00
Conrad Ludgate
f05d1b598a proxy: add more db error info (#5951)
## Problem

https://github.com/neondatabase/serverless/issues/51

## Summary of changes

include more error fields in the json response
2023-11-30 10:18:59 +00:00
Christian Schwarz
ca597206b8 walredo: latency histogram for spawn duration (#5925)
fixes https://github.com/neondatabase/neon/issues/5891
2023-11-29 18:44:37 +00:00
Rahul Modpur
46f20faa0d neon_local: fix endpoint api to prevent two primary endpoints (#5520)
`neon_local endpoint` subcommand currently allows creating two primary
endpoints for the same branch which leads to shutdown of both endpoints

`neon_local endpoint start` new behavior:
1. Fail if endpoint doesn't exist
2. Fail if two primary conflict detected

Fixes #4959
Closes #5426

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-29 19:38:03 +02:00
John Spray
9e55ad4796 pageserver: refactor TenantId to TenantShardId in Tenant & Timeline (#5957)
(includes two preparatory commits from
https://github.com/neondatabase/neon/pull/5960)

## Problem

To accommodate multiple shards in the same tenant on the same
pageserver, we must include the full TenantShardId in local paths. That
means that all code touching local storage needs to see the
TenantShardId.

## Summary of changes

- Replace `tenant_id: TenantId` with `tenant_shard_id: TenantShardId` on
Tenant, Timeline and RemoteTimelineClient.
- Use TenantShardId in helpers for building local paths.
- Update all the relevant call sites.

This doesn't update absolutely everything: things like PageCache,
TaskMgr, WalRedo are still shard-naive. The purpose of this PR is to
update the core types so that others code can be added/updated
incrementally without churning the most central shared types.
2023-11-29 14:52:35 +00:00
John Spray
70b5646fba pageserver: remove redundant serialization helpers on DeletionList (#5960)
Precursor for https://github.com/neondatabase/neon/pull/5957

## Problem

When DeletionList was written, TenantId/TimelineId didn't have
human-friendly modes in their serde. #5335 added those, such that the
helpers used in serialization of HashMaps are no longer necessary.

## Summary of changes

- Add a unit test to ensure that this change isn't changing anything
about the serialized form
- Remove the serialization helpers for maps of Id
2023-11-29 10:39:12 +00:00
Konstantin Knizhnik
64890594a5 Optimize storing of null page in WAL (#5910)
## Problem

PG16 (https://github.com/neondatabase/postgres/pull/327) adds new
function to SMGR: zeroextend
It's implementation in Neon actually wal-log zero pages of extended
relation.
This zero page is wal-logged using XLOG_FPI.
As far as page is zero, the hole optimization (excluding from the image
everything between pg_upper and pd_lower) doesn't work.

## Summary of changes

In case of zero page (`PageIsNull()` returns true) assume
`hole_size=BLCKSZ`

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-11-29 12:08:20 +02:00
Arseny Sher
78e73b20e1 Notify safekeeper readiness with systemd.
To avoid downtime during deploy, as in busy regions initial load can currently
take ~30s.
2023-11-29 14:07:06 +04:00
John Spray
c48cc020bd pageserver: fix race between deletion completion and incoming requests (#5941)
## Problem

This is a narrow race that can leave a stuck Stopping tenant behind,
while emitting a log error "Missing InProgress marker during tenant
upsert, this is a bug"

- Deletion request 1 puts tenant into Stopping state, and fires off
background part of DeleteTenantFlow
- Deletion request 2 acquires a SlotGuard for the same tenant ID, leaves
a TenantSlot::InProgress in place while it checks if the tenant's state
is accept able.
- DeleteTenantFlow finishes, calls TenantsMap::remove, which removes the
InProgress marker.
- Deletion request 2 calls SlotGuard::revert, which upserts the old
value (the Tenant in Stopping state), and emits the telltale log
message.

Closes: #5936 

## Summary of changes

- Add a regression test which uses pausable failpoints to reproduce this
scenario.
- TenantsMap::remove is only called by DeleteTenantFlow. Its behavior is
tweaked to express the different possible states, especially
`InProgress` which carriers a barrier.
- In DeleteTenantFlow, if we see such a barrier result from remove(),
wait for the barrier and then try removing again.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-29 09:32:26 +00:00
dependabot[bot]
a15969714c build(deps): bump openssl from 0.10.57 to 0.10.60 in /test_runner/pg_clients/rust/tokio-postgres (#5966)
Bumps [openssl](https://github.com/sfackler/rust-openssl) from 0.10.57 to 0.10.60.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-29 02:17:15 +01:00
dependabot[bot]
8c195d8214 build(deps): bump cryptography from 41.0.4 to 41.0.6 (#5970)
Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.4 to 41.0.6.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-29 02:16:35 +01:00
dependabot[bot]
0d16874960 build(deps): bump openssl from 0.10.55 to 0.10.60 (#5965)
Bumps [openssl](https://github.com/sfackler/rust-openssl) from 0.10.55 to 0.10.60.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-29 01:24:02 +01:00
Alexander Bayandin
fd440e7d79 neonvm: add pgbouncer patch to support DEALLOCATD/DISCARD ALL (#5958)
pgbouncer 1.21.0 doesn't play nicely with DEALLOCATD/DISCARD ALL if
prepared statement support is enabled (max_prepared_statements > 0).
There's a patch[0] that improves this (it will be included in the next 
release of pgbouncer).

This PR applies this patch on top of 1.21.0 release tarball. 
For some reason, the tarball doesn't include `test/test_prepared.py` 
(which is modified by the patch as well), so the patch can't be applied 
clearly. I use `filterdiff` (from `patchutils` package) to apply 
the required changes.

[0] a7b3c0a5f4
2023-11-28 23:43:24 +00:00
bojanserafimov
65160650da Add walingest test (#5892) 2023-11-28 12:50:53 -05:00
dependabot[bot]
12dd6b61df build(deps): bump aiohttp from 3.8.6 to 3.9.0 (#5946) 2023-11-28 17:47:15 +00:00
bojanserafimov
5345c1c21b perf readme fix (#5956) 2023-11-28 17:31:42 +00:00
Joonas Koivunen
105edc265c fix: remove layer_removal_cs (#5108)
Quest: https://github.com/neondatabase/neon/issues/4745. Follow-up to
#4938.

- add in locks for compaction and gc, so we don't have multiple
executions at the same time in tests
- remove layer_removal_cs
- remove waiting for uploads in eviction/gc/compaction
    - #4938 will keep the file resident until upload completes

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-11-28 19:15:21 +02:00
Shany Pozin
8625466144 Move run_initdb to be async and guarded by max of 8 running tasks. Fixes #5895. Use tenant.cancel for cancellation (#5921)
## Problem
https://github.com/neondatabase/neon/issues/5895
2023-11-28 14:49:31 +00:00
John Spray
1ab0cfc8cb pageserver: add sharding metadata to LocationConf (#5932)
## Problem

The TenantShardId in API URLs is sufficient to uniquely identify a
tenant shard, but not for it to function: it also needs to know its full
sharding configuration (stripe size, layout version) in order to map
keys to shards.

## Summary of changes

- Introduce ShardIdentity: this is the superset of ShardIndex (#5924 )
that is required for translating keys to shard numbers.
- Include ShardIdentity as an optional attribute of LocationConf
- Extend the public `LocationConfig` API structure with a flat
representation of shard attributes.

The net result is that at the point we construct a `Tenant`, we have a
`ShardIdentity` (inside LocationConf). This enables the next steps to
actually use the ShardIdentity to split WAL and validate that page
service requires are reaching the correct shard.
2023-11-28 13:14:51 +00:00
John Spray
ca469be1cf pageserver: add shard indices to layer metadata (#5928)
## Problem

For sharded tenants, the layer keys must include the shard number and
shard count, to disambiguate keys written by different shards in the
same tenant (shard number), and disambiguate layers written before and
after splits (shard count).

Closes: https://github.com/neondatabase/neon/issues/5924

## Summary of changes

There are no functional changes in this PR: everything behaves the same
for the default ShardIndex::unsharded() value. Actual construct of
sharded tenants will come next.

- Add a ShardIndex type: this is just a wrapper for a ShardCount and
ShardNumber. This is a subset of ShardIdentity: whereas ShardIdentity
contains enough information to filter page keys, ShardIndex contains
just enough information to construct a remote key. ShardIndex has a
compact encoding, the same as the shard part of TenantShardId.
- Store the ShardIndex as part of IndexLayerMetadata, if it is set to a
different value than ShardIndex::unsharded.
- Update RemoteTimelineClient and DeletionQueue to construct paths using
the layer metadata. Deletion code paths that previously just passed a
`Generation` now pass a full `LayerFileMetadata` to capture the shard as
well.

Notes to reviewers:
- In deletion code paths, I could have used a (Generation, ShardIndex)
instead of the full LayerFileMetadata. I opted for the full object
partly for brevity, and partly because in future when we add checksums
the deletion code really will care about the full metadata in order to
validate that it is deleting what was intended.
- While ShardIdentity and TenantShardId could both use a ShardIndex, I
find that they read more cleanly as "flat" structs that spell out the
shard count and number field separately. Serialization code would need
writing out by hand anyway, because TenantShardId's serialized form is
not a serde struct-style serialization.
- ShardIndex doesn't _have_ to exist (we could use ShardIdentity
everywhere), but it is a worthwhile optimization, as we will have many
copies of this as part of layer metadata. In future the size difference
betweedn ShardIndex and ShardIdentity may become larger if we implement
more sophisticated key distribution mechanisms (i.e. new values of
ShardIdentity::layout).

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-11-28 11:47:25 +00:00
Christian Schwarz
286f34dfce test suite: add method for generation-aware detachment of a tenant (#5939)
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771
2023-11-28 09:51:37 +00:00
Sasha Krassovsky
f290b27378 Fix check for if shmem is valid to take into account detached shmem (#5937)
## Problem
We can segfault if we update connstr inside of a process that has
detached from shmem (e.g. inside stats collector)
## Summary of changes
Add a check to make sure we're not detached
2023-11-28 03:14:42 +00:00
Sasha Krassovsky
4cd18fcebd Compile wal2json (#5893)
Add wal2json extension
2023-11-27 18:17:26 -08:00
Anastasia Lubennikova
4c29e0594e Update neon extension relocatable for existing installations (#5943) 2023-11-27 23:29:24 +00:00
Anastasia Lubennikova
3c56a4dd18 Make neon extension relocatable to allow SET SCHEMA (#5942) 2023-11-27 21:45:41 +00:00
Conrad Ludgate
316309c85b channel binding (#5683)
## Problem

channel binding protects scram from sophisticated MITM attacks where the
attacker is able to produce 'valid' TLS certificates.

## Summary of changes

get the tls-server-end-point channel binding, and verify it is correct
for the SCRAM-SHA-256-PLUS authentication flow
2023-11-27 21:45:15 +00:00
Arpad Müller
e09bb9974c bootstrap_timeline: rename initdb_path to pgdata_path (#5931)
This is a rename without functional changes, in preparation for #5912.

Split off from #5912 as per review request.
2023-11-27 20:14:39 +00:00
Anastasia Lubennikova
5289f341ce Use test specific directory in test_remote_extensions (#5938) 2023-11-27 18:57:58 +00:00
Joonas Koivunen
683ec2417c deflake: test_live_reconfig_get_evictions_low_residence_... (#5926)
- disable extra tenant
- disable compaction which could try to repartition while we assert

Split from #5108.
2023-11-27 15:20:54 +02:00
Christian Schwarz
a76a503b8b remove confusing no-op .take() of init_tenant_load_remote (#5923)
The `Tenant::spawn()` method already `.take()`s it.

I think this was an oversight in
https://github.com/neondatabase/neon/pull/5580 .
2023-11-27 12:50:19 +00:00
Anastasia Lubennikova
92bc2bb132 Refactor remote extensions feature to request extensions from proxy (#5836)
instead of direct S3 request.

Pros:
- simplify code a lot (no need to provide AWS credentials and paths);
- reduce latency of downloading extension data as proxy resides near
computes; -reduce AWS costs as proxy has cache and 1000 computes asking
the same extension will not generate 1000 downloads from S3.
- we can use only one S3 bucket to store extensions (and rid of regional
buckets which were introduced to reduce latency);

Changes:
- deprecate remote-ext-config compute_ctl parameter, use
http://pg-ext-s3-gateway if any old format remote-ext-cofig is provided;
- refactor tests to use mock http server;
2023-11-27 12:10:23 +00:00
John Spray
b80b9e1c4c pageserver: remove defunct local timeline delete markers (#5699)
## Problem

Historically, we treated the presence of a timeline on local disk as
evidence that it logically exists. Since #5580 that is no longer the
case, so we can always rely on remote storage. If we restart and the
timeline is gone in remote storage, we will also purge it from local
disk: no need for a marker.

Reference on why this PR is for timeline markers and not tenant markers:
https://github.com/neondatabase/neon/issues/5080#issuecomment-1783187807

## Summary of changes

Remove code paths that read + write deletion marker for timelines.

Leave code path that deletes these markers, just in case we deploy while
there are some in existence. This can be cleaned up later.
(https://github.com/neondatabase/neon/issues/5718)
2023-11-27 09:31:20 +00:00
Anastasia Lubennikova
87b8ac3ec3 Only create neon extension in postgres database; (#5918)
Create neon extension in neon schema.
2023-11-26 08:37:01 +00:00
Joonas Koivunen
6b1c4cc983 fix: long timeline create cancelled by tenant delete (#5917)
Fix the fallible vs. infallible check order with
`UninitTimeline::finish_creation` so that the incomplete timeline can be
removed. Currently the order of drop guard unwrapping causes uninit
files to be left on pageserver, blocking the tenant deletion.

Cc: #5914
Cc: #investigation-2023-11-23-stuck-tenant-deletion
2023-11-24 16:17:56 +00:00
Joonas Koivunen
831fad46d5 tests: fix allowed_error for compaction detecting a shutdown (#5919)
This has been causing flaky tests, [example evidence].

Follow-up to #5883 where I forgot to fix this.

[example evidence]:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5917/6981540065/index.html#suites/9d2450a537238135fd4007859e09aca7/6fd3556a879fa3d1
2023-11-24 16:14:32 +00:00
Joonas Koivunen
53851ea8ec fix: log cancelled request handler errors (#5915)
noticed during [investigation] with @problame a major point of lost
error logging which would had sped up the investigation.

Cc: #5815

[investigation]:
https://neondb.slack.com/archives/C066ZFAJU85/p1700751858049319
2023-11-24 15:54:06 +02:00
Joonas Koivunen
044375732a test: support validating allowed_errors against a logfile (#5905)
this will make it easier to test if an added allowed_error does in fact
match for example against a log file from an allure report.

```
$ python3 test_runner/fixtures/pageserver/allowed_errors.py --help
usage: allowed_errors.py [-h] [-i INPUT]

check input against pageserver global allowed_errors

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Pageserver logs file. Reads from stdin if no file is provided.
```

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-11-24 12:43:25 +00:00
Konstantin Knizhnik
ea63b43009 Check if LFC was intialized in local_cache_pages function (#5911)
## Problem

There is not check that LFC is initialised (`lfc_max_size != 0`) in
`local_cache_pages` function

## Summary of changes

Add proper check.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-11-24 08:23:00 +02:00
Conrad Ludgate
a56fd45f56 proxy: fix memory leak again (#5909)
## Problem

The connections.join_next helped but it wasn't enough... The way I
implemented the improvement before was still faulty but it mostly worked
so it looked like it was working correctly.

From [`tokio::select`
docs](https://docs.rs/tokio/latest/tokio/macro.select.html):
> 4. Once an <async expression> returns a value, attempt to apply the
value to the provided <pattern>, if the pattern matches, evaluate
<handler> and return. If the pattern does not match, disable the current
branch and for the remainder of the current call to select!. Continue
from step 3.

The `connections.join_next()` future would complete and `Some(Err(e))`
branch would be evaluated but not match (as the future would complete
without panicking, we would hope). Since the branch doesn't match, it's
disabled. The select continues but never attempts to call `join_next`
again. Getting unlucky, more TCP connections are created than we attempt
to join_next.

## Summary of changes

Replace the `Some(Err(e))` pattern with `Some(e)`. Because of the
auto-disabling feature, we don't need the `if !connections.is_empty()`
step as the `None` pattern will disable it for us.
2023-11-23 19:11:24 +00:00
Anastasia Lubennikova
582a42762b update extension version in test_neon_extension 2023-11-23 18:53:03 +00:00
Anastasia Lubennikova
f5dfa6f140 Create extension neon in existing databases too 2023-11-23 18:53:03 +00:00
Anastasia Lubennikova
f8d9bd8d14 Add extension neon to all databases.
- Run CREATE EXTENSION neon for template1, so that it was created in all databases.
- Run ALTER EXTENSION neon in all databases, to always have the newest version of the extension in computes.
- Add test_neon_extension test
2023-11-23 18:53:03 +00:00
Anastasia Lubennikova
04e6c09f14 Add pgxn/neon/README.md 2023-11-23 18:53:03 +00:00
Arpad Müller
54327bbeec Upload initdb results to S3 (#5390)
## Problem

See #2592

## Summary of changes

Compresses the results of initdb into a .tar.zst file and uploads them
to S3, to enable usage in recovery from lsn.

Generations should not be involved I think because we do this only once
at the very beginning of a timeline.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-23 18:11:52 +00:00
Shany Pozin
35f243e787 Move weekly release PR trigger to Monday morning (#5908) 2023-11-23 19:09:34 +02:00
Shany Pozin
b7a988ba46 Support cancellation for find_lsn_for_timestamp API (#5904)
## Problem
#5900
## Summary of changes
Added cancellation token as param in all relevant code paths and actually used it in the find_lsn_for_timestamp main loop
2023-11-23 17:08:32 +02:00
Christian Schwarz
a0e61145c8 fix: cleanup of layers from the future can race with their re-creation (#5890)
fixes https://github.com/neondatabase/neon/issues/5878
obsoletes https://github.com/neondatabase/neon/issues/5879

Before this PR, it could happen that `load_layer_map` schedules removal
of the future
image layer. Then a later compaction run could re-create the same image
layer, scheduling a PUT.
Due to lack of an upload queue barrier, the PUT and DELETE could be
re-ordered.
The result was IndexPart referencing a non-existent object.

## Summary of changes

* Add support to `pagectl` / Python tests to decode `IndexPart`
  * Rust
    * new `pagectl` Subcommand
* `IndexPart::{from,to}_s3_bytes()` methods to internalize knowledge
about encoding of `IndexPart`
  * Python
    * new `NeonCli` subclass
* Add regression test
  * Rust
* Ability to force repartitioning; required to ensure image layer
creation at last_record_lsn
  * Python
    * The regression test.
* Fix the issue
  * Insert an `UploadOp::Barrier` after scheduling the deletions.
2023-11-23 13:33:41 +00:00
Konstantin Knizhnik
6afbadc90e LFC fixes + statistics (#5727)
## Problem

## Summary of changes

See #5500

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-11-23 08:59:19 +02:00
Anastasia Lubennikova
2a12e9c46b Add documentation for our sample pre-commit hook (#5868) 2023-11-22 12:04:36 +00:00
Christian Schwarz
9e3c07611c logging: support output to stderr (#5896)
(part of the getpage benchmarking epic #5771)

The plan is to make the benchmarking tool log on stderr and emit results
as JSON on stdout. That way, the test suite can simply take captures
stdout and json.loads() it, while interactive users of the benchmarking
tool have a reasonable experience as well.

Existing logging users continue to print to stdout, so, this change
should be a no-op functionally and performance-wise.
2023-11-22 11:08:35 +00:00
Christian Schwarz
d353fa1998 refer to our rust-postgres.git fork by branch name (#5894)
This way, `cargo update -p tokio-postgres` just works. The `Cargo.toml`
communicates more clearly that we're referring to the `main` branch. And
the git revision is still pinned in `Cargo.lock`.
2023-11-22 10:58:27 +00:00
Joonas Koivunen
0d10992e46 Cleanup compact_level0_phase1 fsyncing (#5852)
While reviewing code noticed a scary `layer_paths.pop().unwrap()` then
realized this should be further asyncified, something I forgot to do
when I switched the `compact_level0_phase1` back to async in #4938.

This keeps the double-fsync for new deltas as #4749 is still unsolved.
2023-11-21 15:30:40 +02:00
Arpad Müller
3e131bb3d7 Update Rust to 1.74.0 (#5873)
[Release notes](https://github.com/rust-lang/rust/releases/tag/1.74.0).
2023-11-21 11:41:41 +01:00
Sasha Krassovsky
81b2cefe10 Disallow CREATE DATABASE WITH OWNER neon_superuser (#5887)
## Problem
Currently, control plane doesn't know about neon_superuser, so if a user
creates a database with owner neon_superuser it causes an exception when
it tries to forward it. It is also currently possible to ALTER ROLE
neon_superuser.

## Summary of changes
Disallow creating database with owner neon_superuser. This is probably
fine, since I don't think you can create a database with owner normal
superuser. Also forbids altering neon_superuser
2023-11-20 22:39:47 +00:00
Christian Schwarz
d2ca410919 build: back to opt-level=0 in debug builds, for faster compile times (#5751)
This change brings down incremental compilation for me
from > 1min to 10s (and this is a pretty old Ryzen 1700X).

More details: "incremental compilation" here means to change one
character
in the `failed to read value from offset` string in `image_layer.rs`.
The command for incremental compilation is `cargo build_testing`.
The system on which I got these numbers uses `mold` via
`~/.cargo/config.toml`.

As a bonus, `rust-gdb` is now at least a little fun again.

Some tests are timing out in debug builds due to these changes.
This PR makes them skip for debug builds.
We run both with debug and release build, so, the loss of coverage is
marginal.

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-11-20 15:41:37 +01:00
Joonas Koivunen
d98ac04136 chore(background_tasks): missed allowed_error change, logging change (#5883)
- I am always confused by the log for the error wait time, now it will
be `2s` or `2.0s` not `2.0`
- fix missed string change introduced in #5881 [evidence]

[evidence]:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/6921062837/index.html#suites/f9eba3cfdb71aa6e2b54f6466222829b/87897fe1ddee3825
2023-11-20 07:33:17 +00:00
Joonas Koivunen
ac08072d2e fix(layer): VirtualFile opening and read errors can be caused by contention (#5880)
A very low number of layer loads have been marked wrongly as permanent,
as I did not remember that `VirtualFile::open` or reading could fail
transiently for contention. Return separate errors for transient and
persistent errors from `{Delta,Image}LayerInner::load`.

Includes drive-by comment changes.

The implementation looks quite ugly because having the same type be both
the inner (operation error) and outer (critical error), but with the
alternatives I tried I did not find a better way.
2023-11-19 14:57:39 +00:00
John Spray
d22dce2e31 pageserver: shut down idle walredo processes (#5877)
The longer a pageserver runs, the more walredo processes it accumulates
from tenants that are touched intermittently (e.g. by availability
checks). This can lead to getting OOM killed.

Changes:
- Add an Instant recording the last use of the walredo process for a
tenant
- After compaction iteration in the background task, check for idleness
and stop the walredo process if idle for more than 10x compaction
period.

Cc: #3620

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Shany Pozin <shany@neon.tech>
2023-11-19 14:21:16 +00:00
Joonas Koivunen
3b3f040be3 fix(background_tasks): first backoff, compaction error stacktraces (#5881)
First compaction/gc error backoff starts from 0 which is less than 2s
what it was before #5672. This is now fixed to be the intended 2**n.

Additionally noticed the `compaction_iteration` creating an
`anyhow::Error` via `into()` always captures a stacktrace even if we had
a stacktraceful anyhow error within the CompactionError because there is
no stable api for querying that.
2023-11-19 14:16:31 +00:00
Em Sharnoff
cad0dca4b8 compute_ctl: Remove deprecated flag --file-cache-on-disk (#5622)
See neondatabase/cloud#7516 for more.
2023-11-18 12:43:54 +01:00
Em Sharnoff
5d13a2e426 Improve error message when neon.max_cluster_size reached (#4173)
Changes the error message encountered when the `neon.max_cluster_size`
limit is reached. Reasoning is that this is user-visible, and so should
*probably* use language that's closer to what users are familiar with.
2023-11-16 21:51:26 +00:00
khanova
0c243faf96 Proxy log pid hack (#5869)
## Problem

Improve observability for the compute node.

## Summary of changes

Log pid from the compute node. Doesn't work with pgbouncer.
2023-11-16 20:46:23 +00:00
Em Sharnoff
d0a842a509 Update vm-builder to v0.19.0 and move its customization here (#5783)
ref neondatabase/autoscaling#600 for more
2023-11-16 18:17:42 +01:00
khanova
6b82f22ada Collect number of connections by sni type (#5867)
## Problem

We don't know the number of users with the different kind of
authentication: ["sni", "endpoint in options" (A and B from
[here](https://neon.tech/docs/connect/connection-errors)),
"password_hack"]

## Summary of changes

Collect metrics by sni kind.
2023-11-16 12:19:13 +00:00
John Spray
ab631e6792 pageserver: make TenantsMap shard-aware (#5819)
## Problem

When using TenantId as the key, we are unable to handle multiple tenant
shards attached to the same pageserver for the same tenant ID. This is
an expected scenario if we have e.g. 8 shards and 5 pageservers.

## Summary of changes

- TenantsMap is now a BTreeMap instead of a HashMap: this enables
looking up by range. In future, we will need this for page_service, as
incoming requests will just specify the Key, and we'll have to figure
out which shard to route it to.
- A new key type TenantShardId is introduced, to act as the key in
TenantsMap, and as the id type in external APIs. Its human readable
serialization is backward compatible with TenantId, and also
forward-compatible as long as sharding is not actually used (when we
construct a TenantShardId with ShardCount(0), it serializes to an
old-fashioned TenantId).
- Essential tenant APIs are updated to accept TenantShardIds:
tenant/timeline create, tenant delete, and /location_conf. These are the
APIs that will enable driving sharded tenants. Other apis like /attach
/detach /load /ignore will not work with sharding: those will soon be
deprecated and replaced with /location_conf as part of the live
migration work.

Closes: #5787
2023-11-15 23:20:21 +02:00
Alexander Bayandin
f84ac2b98d Fix baseline commit and branch for code coverage (#5769)
## Problem

`HEAD` commit for a PR is a phantom merge commit which skews the baseline
commit for coverage reports.

See
https://github.com/neondatabase/neon/pull/5751#issuecomment-1790717867

## Summary of changes
- Use commit hash instead of `HEAD` for finding baseline commits for
code coverage
- Use the base branch for PRs or the current branch for pushes
2023-11-15 12:40:21 +01:00
dependabot[bot]
5cd5b93066 build(deps): bump aiohttp from 3.8.5 to 3.8.6 (#5864) 2023-11-15 11:08:49 +00:00
khanova
2f0d245c2a Proxy control plane rate limiter (#5785)
## Problem

Proxy might overload the control plane.

## Summary of changes

Implement rate limiter for proxy<->control plane connection.
Resolves https://github.com/neondatabase/neon/issues/5707

Used implementation ideas from https://github.com/conradludgate/squeeze/
2023-11-15 09:15:59 +00:00
Joonas Koivunen
462f04d377 Smaller test addition and change (#5858)
- trivial serialization roundtrip test for
`pageserver::repository::Value`
- add missing `start_paused = true` to 15s test making it <0s test
- completely unrelated future clippy lint avoidance (helps beta channel
users)
2023-11-14 18:04:34 +01:00
Arpad Müller
31a54d663c Migrate links from wiki to notion (#5862)
See the slack discussion:
https://neondb.slack.com/archives/C033A2WE6BZ/p1696429688621489?thread_ts=1695647103.117499
2023-11-14 15:36:47 +00:00
John Spray
7709c91fe5 neon_local: use remote storage by default, add cargo neon tenant migrate (#5760)
## Problem

Currently the only way to exercise tenant migration is via python test
code. We need a convenient way for developers to do it directly in a
neon local environment.

## Summary of changes

- Add a `--num-pageservers` argument to `cargo neon init` so that it's
easy to run with multiple pageservers
- Modify default pageserver overrides in neon_local to set up `LocalFs`
remote storage, as any migration/attach/detach stuff doesn't work in the
legacy local storage mode. This also unblocks removing the pageserver's
support for the legacy local mode.
- Add a new `cargo neon tenant migrate` command that orchestrates tenant
migration, including endpoints.
2023-11-14 09:51:51 +00:00
Arpad Müller
f7249b9018 Fix comment in find_lsn_for_timestamp (#5855)
We still subtract 1 from low to compute `commit_lsn`. the comment
moved/added by #5844 should point this out.
2023-11-11 00:32:00 +00:00
Joonas Koivunen
74d150ba45 build: upgrade ahash (#5851)
`cargo deny` was complaining the version 0.8.3 was yanked (for possible
DoS attack [wiki]), but the latest version (0.8.5) also includes aarch64
fixes which may or may not be relevant. Our usage of ahash limits to
proxy, but I don't think we are at any risk.

[wiki]: https://github.com/tkaitchuck/aHash/wiki/Yanked-versions
2023-11-10 19:10:54 +00:00
Joonas Koivunen
b7f45204a2 build: deny async-std and friends (#5849)
rationale: some crates pull these in as default; hopefully these hints
will require less cleanup-after and Cargo.lock file watching.

follow-up to #5848.
2023-11-10 18:02:22 +01:00
Joonas Koivunen
a05f104cce build: remove async-std dependency (#5848)
Introduced by accident (missing `default-features = false`) in
e09d5ada6a. We directly need only `http_types::StatusCode`.
2023-11-10 16:05:21 +02:00
John Spray
d672e44eee pageserver: error type for collect_keyspace (#5846)
## Problem

This is a log hygiene fix, for an occasional test failure.

warn-level logging in imitate_timeline_cached_layer_accesses can't
distinguish actual errors from shutdown cases.

## Summary of changes

Replaced anyhow::Error with an explicit CollectKeySpaceError type, that
includes conversion from PageReconstructError::Cancelled.
2023-11-10 13:58:18 +00:00
Rahul Modpur
a6f892e200 metric: add started and killed walredo processes counter (#5809)
In OOM situations, knowing exactly how many walredo processes there were
at a time would help afterwards to understand why was pageserver OOM
killed. Add `pageserver_wal_redo_process_total` metric to keep track of
total wal redo process started, shutdown and killed since pageserver
start.

Closes #5722

---------

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <me@cschwarz.com>
2023-11-10 15:05:22 +02:00
Alexander Bayandin
71b380f90a Set BUILD_TAG for build-neon job (#5847)
## Problem

I've added `BUILD_TAG` to docker images.
(https://github.com/neondatabase/neon/pull/5812), but forgot to add it
to services that we build for tests

## Summary of changes
- Set `BUILD_TAG` in `build-neon` job
2023-11-10 12:49:52 +00:00
Alexander Bayandin
6e145a44fa workflows/neon_extra_builds: run check-codestyle-rust & build-neon on arm64 (#5832)
## Problem

Some developers use workstations with arm CPUs, and sometimes x86-64
code is not fully compatible with it (for example,
https://github.com/neondatabase/neon/pull/5827).
Although we don't have arm CPUs in the prod (yet?), it is worth having
some basic checks for this architecture to have a better developer
experience.

Closes https://github.com/neondatabase/neon/issues/5829

## Summary of changes
- Run `check-codestyle-rust`-like & `build-neon`-like jobs on Arm runner
- Add `run-extra-build-*` label to run all available extra builds
2023-11-10 12:45:41 +00:00
Arpad Müller
8e5e3971ba find_lsn_for_timestamp fixes (#5844)
Includes the changes of #3689 that address point 1 of #3689, plus some
further improvements. In particular, this PR does:

* set `min_lsn` to a safe value to create branches from (and verify it
in tests)
* return `min_lsn` instead of `max_lsn` for `NoData` and `Past` (verify
it in test for `Past`, `NoData` is harder and not as important)
* return `commit_lsn` instead of `max_lsn` for Future (and verify it in
the tests)
* add some comments

Split out of #5686 to get something more minimal out to users.
2023-11-10 13:38:44 +01:00
Joonas Koivunen
8dd29f1e27 fix(pageserver): spawn all kinds of tenant shutdowns (#5841)
Minor bugfix, something noticed while manual code-review. Use the same
joinset for inprogress tenants so we can get the benefit of the
buffering logging just as we get for attached tenants, and no single
inprogress task can hold up shutdown of other tenants.
2023-11-09 21:36:57 +00:00
Joonas Koivunen
f5344fb85a temp: log all layer loading errors while we lose them (#5816)
Temporary workaround while some errors are not being logged.

Cc: #5815.
2023-11-09 21:31:53 +00:00
Arpad Müller
f95f001b8b Lsn for get_timestamp_of_lsn should be string, not integer (#5840)
The `get_timestamp_of_lsn` pageserver endpoint has been added in #5497,
but the yml it added was wrong: the lsn is expected in hex format, not
in integer (decimal) format.
2023-11-09 16:12:18 +00:00
John Spray
e0821e1eab pageserver: refined Timeline shutdown (#5833)
## Problem

We have observed the shutdown of a timeline taking a long time when a
deletion arrives at a busy time for the system. This suggests that we
are not respecting cancellation tokens promptly enough.

## Summary of changes

- Refactor timeline shutdown so that rather than having a shutdown()
function that takes a flag for optionally flushing, there are two
distinct functions, one for graceful flushing shutdown, and another that
does the "normal" shutdown where we're just setting a cancellation token
and then tearing down as fast as we can. This makes things a bit easier
to reason about, and enables us to remove the hand-written variant of
shutdown that was maintained in `delete.rs`
- Layer flush task checks cancellation token more carefully
- Logical size calculation's handling of cancellation tokens is
simplified: rather than passing one in, it respects the Timeline's
cancellation token.

This PR doesn't touch RemoteTimelineClient, which will be a key thing to
fix as well, so that a slow remote storage op doesn't hold up shutdown.
2023-11-09 16:02:59 +00:00
bojanserafimov
4469b1a62c Fix blob_io test (#5818) 2023-11-09 10:47:03 -05:00
Joonas Koivunen
842223b47f fix(metric): remove pageserver_wal_redo_wait_seconds (#5791)
the meaning of the values recorded in this histogram changed with #5560
and we never had it visualized as a histogram, just the
`increase(_sum)`. The histogram is not too interesting to look at, so
remove it per discussion in [slack
thread](https://neondb.slack.com/archives/C063LJFF26S/p1699008316109999?thread_ts=1698852436.637559&cid=C063LJFF26S).
2023-11-09 16:40:52 +02:00
Anna Stepanyan
893616051d Update epic-template.md (#5709)
replace the checkbox list with a a proper task list in the epic template

NB: this PR does not change the code, it only touches the github issue
templates
2023-11-09 15:24:43 +01:00
Conrad Ludgate
7cdde285a5 proxy: limit concurrent wake_compute requests per endpoint (#5799)
## Problem

A user can perform many database connections at the same instant of time
- these will all cache miss and materialise as requests to the control
plane. #5705

## Summary of changes

I am using a `DashMap` (a sharded `RwLock<HashMap>`) of endpoints ->
semaphores to apply a limiter. If the limiter is enabled (permits > 0),
the semaphore will be retrieved per endpoint and a permit will be
awaited before continuing to call the wake_compute endpoint.

### Important details

This dashmap would grow uncontrollably without maintenance. It's not a
cache so I don't think an LRU-based reclamation makes sense. Instead,
I've made use of the sharding functionality of DashMap to lock a single
shard and clear out unused semaphores periodically.

I ran a test in release, using 128 tokio tasks among 12 threads each
pushing 1000 entries into the map per second, clearing a shard every 2
seconds (64 second epoch with 32 shards). The endpoint names were
sampled from a gamma distribution to make sure some overlap would occur,
and each permit was held for 1ms. The histogram for time to clear each
shard settled between 256-512us without any variance in my testing.

Holding a lock for under a millisecond for 1 of the shards does not
concern me as blocking
2023-11-09 14:14:30 +00:00
John Spray
9c30883c4b remote_storage: use S3 SDK's adaptive retry policy (#5813)
## Problem

Currently, we aren't doing any explicit slowdown in response to 429
responses. Recently, as we hit remote storage a bit harder (pageserver
does more ListObjectsv2 requests than it used to since #5580 ), we're
seeing storms of 429 responses that may be the result of not just doing
too may requests, but continuing to do those extra requests without
backing off any more than our usual backoff::exponential.

## Summary of changes

Switch from AWS's "Standard" retry policy to "Adaptive" -- docs describe
this as experimental but it has been around for a long time. The main
difference between Standard and Adaptive is that Adaptive rate-limits
the client in response to feedback from the server, which is meant to
avoid scenarios where the client would otherwise repeatedly hit
throttling responses.
2023-11-09 13:50:13 +00:00
Arthur Petukhovsky
0495798591 Fix walproposer build on aarch64 (#5827)
There was a compilation error due to `std::ffi::c_char` being different type on different platforms. Clippy also complained due to a similar reason.
2023-11-09 13:05:17 +00:00
Sasha Krassovsky
87389bc933 Add test simulating bad connection between pageserver and compute (#5728)
## Problem
We have a funny 3-day timeout for connections between the compute and
pageserver. We want to get rid of it, so to do that we need to make sure
the compute is resilient to connection failures.

Closes: https://github.com/neondatabase/neon/issues/5518

## Summary of changes
This test makes the pageserver randomly drop the connection if the
failpoint is enabled, and ensures we can keep querying the pageserver.

This PR also reduces the default timeout to 10 minutes from 3 days.
2023-11-08 19:48:57 +00:00
Arpad Müller
ea118a238a JWT logging improvements (#5823)
* lower level on auth success from info to debug (fixes #5820)
* don't log stacktraces on auth errors (as requested on slack). we do this by introducing an `AuthError` type instead of using `anyhow` and `bail`.
* return errors that have been censored for improved security.
2023-11-08 16:56:53 +00:00
Christian Schwarz
e9b227a11e cleanup unused RemoteStorage fields (#5830)
Found this while working on #5771
2023-11-08 16:54:33 +00:00
John Spray
40441f8ada pageserver: use Gate for stronger safety check in SlotGuard (#5793)
## Problem

#5711 and #5367 raced -- the `SlotGuard` type needs `Gate` to properly
enforce its invariant that we may not drop an `Arc<Tenant>` from a slot.

## Summary of changes

Replace the TODO with the intended check of Gate.
2023-11-08 13:00:11 +00:00
John Spray
a8a39cd464 test: de-flake test_deletion_queue_recovery (#5822)
## Problem

This test could fail if timing is unlucky, and the deletions in the test
land in two deletion lists instead of one.

## Summary of changes

We await _some_ validations instead of _all_ validations, because our
execution failpoint
will prevent validation proceeding for any but the first DeletionList.
Usually the workload
just generates one, but if it generates two due to timing, then we must
not expect that the
second one will be validated.
2023-11-08 12:41:48 +00:00
John Spray
b989ad1922 extend test_change_pageserver for failure case, rework changing pageserver (#5693)
Reproducer for https://github.com/neondatabase/neon/issues/5692

The test change in this PR intentionally fails, to demonstrate the
issue.

---------

Co-authored-by: Sasha Krassovsky <krassovskysasha@gmail.com>
2023-11-08 11:26:56 +00:00
Em Sharnoff
acef742a6e vm-monitor: Remove dependency on workspace_hack (#5752)
neondatabase/autoscaling builds libs/vm-monitor during CI because it's a
necessary component of autoscaling.

workspace_hack includes a lot of crates that are not necessary for
vm-monitor, which artificially inflates the build time on the
autoscaling side, so hopefully removing the dependency should speed
things up.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-07 09:41:20 -08:00
duguorong009
11d9d801b5 pageserver: improve the shutdown log error (#5792)
## Problem
- Close #5784 

## Summary of changes
- Update the `GetActiveTenantError` -> `QueryError` conversion process
in `pageserver/src/page_service.rs`
- Update the pytest logging exceptions in
`./test_runner/regress/test_tenant_detach.py`
2023-11-07 16:57:26 +00:00
Andrew Rudenko
fc47af156f Passing neon options to the console (#5781)
The idea is to pass neon_* prefixed options to control plane. It can be
used by cplane to dynamically create timelines and computes. Such
options also should be excluded from passing to compute. Another issue
is how connection caching is working now, because compute's instance now
depends not only on hostname but probably on such options too I included
them to cache key.
2023-11-07 16:49:26 +01:00
Arpad Müller
e310533ed3 Support JWT key reload in pageserver (#5594)
## Problem

For quickly rotating JWT secrets, we want to be able to reload the JWT
public key file in the pageserver, and also support multiple JWT keys.

See #4897.

## Summary of changes

* Allow directories for the `auth_validation_public_key_path` config
param instead of just files. for the safekeepers, all of their config options
also support multiple JWT keys.
* For the pageservers, make the JWT public keys easily globally swappable
by using the `arc-swap` crate.
* Add an endpoint to the pageserver, triggered by a POST to
`/v1/reload_auth_validation_keys`, that reloads the JWT public keys from
the pre-configured path (for security reasons, you cannot upload any
keys yourself).

Fixes #4897

---------

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-07 15:43:29 +01:00
John Spray
1d68f52b57 pageserver: move deletion failpoint inside backoff (#5814)
## Problem

When enabled, this failpoint would busy-spin in a loop that emits log
messages.

## Summary of changes

Move the failpoint inside a backoff::exponential block: it will still
spam the log, but at much lower rate.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-07 14:25:51 +00:00
Alexander Bayandin
4cd47b7d4b Dockerfile: Set BUILD_TAG for storage services (#5812)
## Problem

https://github.com/neondatabase/neon/pull/5576 added `build-tag`
reporting to `libmetrics_build_info`, but it's not reported because we
didn't set the corresponding env variable in the build process.

## Summary of changes
- Add `BUILD_TAG` env var while building services
2023-11-07 13:45:59 +00:00
Fernando Luz
0141c95788 build: Add warning when missing postgres submodule during the build (#5614)
I forked the project and in my local repo, I wasn't able to compile the
project and in my search, I found the solution in neon forum. After a PR
discussion, I made a change in the makefile to alert the missing `git
submodules update` step.

---------

Signed-off-by: Fernando Luz <prof.fernando.luz@gmail.com>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-07 12:13:05 +00:00
Shany Pozin
0ac4cf67a6 Use self.tenants instead of TENANTS (#5811) 2023-11-07 11:38:02 +00:00
Joonas Koivunen
4be6bc7251 refactor: remove unnecessary unsafe (#5802)
unsafe impls for `Send` and `Sync` should not be added by default. in
the case of `SlotGuard` removing them does not cause any issues, as the
compiler automatically derives those.

This PR adds requirement to document the unsafety (see
[clippy::undocumented_unsafe_blocks]) and opportunistically adds
`#![deny(unsafe_code)]` to most places where we don't have unsafe code
right now.

TRPL on Send and Sync:
https://doc.rust-lang.org/book/ch16-04-extensible-concurrency-sync-and-send.html

[clippy::undocumented_unsafe_blocks]:
https://rust-lang.github.io/rust-clippy/master/#/undocumented_unsafe_blocks
2023-11-07 10:26:25 +00:00
John Spray
a394f49e0d pageserver: avoid converting an error to anyhow::Error (#5803)
This was preventing it getting cleanly converted to a
CalculateLogicalSizeError::Cancelled, resulting in "Logical size
calculation failed" errors in logs.
2023-11-07 09:35:45 +00:00
John Spray
c00651ff9b pageserver: start refactoring into TenantManager (#5797)
## Problem

See: https://github.com/neondatabase/neon/issues/5796

## Summary of changes

Completing the refactor is quite verbose and can be done in stages: each
interface that is currently called directly from a top-level mgr.rs
function can be moved into TenantManager once the relevant subsystems
have access to it.

Landing the initial change to create of TenantManager is useful because
it enables new code to use it without having to be altered later, and
sets us up to incrementally fix the existing code to use an explicit
Arc<TenantManager> instead of relying on the static TENANTS.
2023-11-07 09:06:53 +00:00
Richy Wang
bea8efac24 Fix comments in 'receive_wal.rs'. (#5807)
## Problem
Some comments in 'receive_wal.rs' is not suitable. It may copy from
'send_wal.rs' and leave it unchanged.
## Summary of changes
This commit fixes two comments in the code:
Changed "/// Unregister walsender." to "/// Unregister walreceiver."
Changed "///Scope guard to access slot in WalSenders registry" to
"///Scope guard to access slot in WalReceivers registry."
2023-11-07 09:13:01 +01:00
Conrad Ludgate
ad5b02e175 proxy: remove unsafe (#5805)
## Problem

`unsafe {}`

## Summary of changes

`CStr` has a method to parse the bytes up to a null byte, so we don't
have to do it ourselves.
2023-11-06 17:44:44 +00:00
Arpad Müller
b09a851705 Make azure blob storage not do extra metadata requests (#5777)
Load the metadata from the returned `GetBlobResponse` and avoid
downloading it via a separate request.
As it turns out, the SDK does return the metadata:
https://github.com/Azure/azure-sdk-for-rust/issues/1439 .

This PR will reduce the number of requests to Azure caused by downloads.

Fixes #5571
2023-11-06 15:16:55 +00:00
John Spray
85cd97af61 pageserver: add InProgress tenant map state, use a sync lock for the map (#5367)
## Problem

Follows on from #5299 
- We didn't have a generic way to protect a tenant undergoing changes:
`Tenant` had states, but for our arbitrary transitions between
secondary/attached, we need a general way to say "reserve this tenant
ID, and don't allow any other ops on it, but don't try and report it as
being in any particular state".
- The TenantsMap structure was behind an async RwLock, but it was never
correct to hold it across await points: that would block any other
changes for all tenants.


## Summary of changes

- Add the `TenantSlot::InProgress` value.  This means:
  - Incoming administrative operations on the tenant should retry later
- Anything trying to read the live state of the tenant (e.g. a page
service reader) should retry later or block.
- Store TenantsMap in `std::sync::RwLock`
- Provide an extended `get_active_tenant_with_timeout` for page_service
to use, which will wait on InProgress slots as well as non-active
tenants.

Closes: https://github.com/neondatabase/neon/issues/5378

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-11-06 14:03:22 +00:00
Arpad Müller
e6470ee92e Add API description for safekeeper copy endpoint (#5770)
Adds a yaml API description for a new endpoint that allows creation of a
new timeline as the copy of an existing one.
 
Part of #5282
2023-11-06 15:00:07 +01:00
bojanserafimov
dc72567288 Layer flush minor speedup (#5765)
Convert keys to `i128` before sorting
2023-11-06 08:58:20 -05:00
John Spray
6defa2b5d5 pageserver: add Gate as a partner to CancellationToken for safe shutdown of Tenant & Timeline (#5711)
## Problem

When shutting down a Tenant, it isn't just important to cause any
background tasks to stop. It's also important to wait until they have
stopped before declaring shutdown complete, in cases where we may re-use
the tenant's local storage for something else, such as running in
secondary mode, or creating a new tenant with the same ID.

## Summary of changes

A `Gate` class is added, inspired by
[seastar::gate](https://docs.seastar.io/master/classseastar_1_1gate.html).
For types that have an important lifetime that corresponds to some
physical resource, use of a Gate as well as a CancellationToken provides
a robust pattern for async requests & shutdown:
- Requests must always acquire the gate as long as they are using the
object
- Shutdown must set the cancellation token, and then `close()` the gate
to wait for requests in progress before returning.

This is not for memory safety: it's for expressing the difference
between "Arc<Tenant> exists", and "This tenant's files on disk are
eligible to be read/written".

- Both Tenant and Timeline get a Gate & CancellationToken.
- The Timeline gate is held during eviction of layers, and during
page_service requests.
- Existing cancellation support in page_service is refined to use the
timeline-scope cancellation token instead of a process-scope
cancellation token. This replaces the use of `task_mgr::associate_with`:
tasks no longer change their tenant/timelineidentity after being
spawned.

The Tenant's Gate is not yet used, but will be important for
Tenant-scoped operations in secondary mode, where we must ensure that
our secondary-mode downloads for a tenant are gated wrt the activity of
an attached Tenant.

This is part of a broader move away from using the global-state driven
`task_mgr` shutdown tokens:
- less global state where we rely on implicit knowledge of what task a
given function is running in, and more explicit references to the
cancellation token that a particular function/type will respect, making
shutdown easier to reason about.
- eventually avoid the big global TASKS mutex.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-06 12:39:20 +00:00
duguorong009
b3d3a2587d feat: improve the serde impl for several types(Lsn, TenantId, TimelineId ...) (#5335)
Improve the serde impl for several types (`Lsn`, `TenantId`,
`TimelineId`) by making them sensitive to
`Serializer::is_human_readadable` (true for json, false for bincode).

Fixes #3511 by:
- Implement the custom serde for `Lsn`
- Implement the custom serde for `Id`
- Add the helper module `serde_as_u64` in `libs/utils/src/lsn.rs`
- Remove the unnecessary attr `#[serde_as(as = "DisplayFromStr")]` in
all possible structs

Additionally some safekeeper types gained serde tests.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-06 11:40:03 +02:00
Heikki Linnakangas
b85fc39bdb Update control plane API path for getting compute spec. (#5357)
We changed the path in the control plane. The old path is still accepted
for compatibility with existing computes, but we'd like to phase it out.
2023-11-06 09:26:09 +02:00
duguorong009
09b5954526 refactor: use streaming in safekeeper /v1/debug_dump http response (#5731)
- Update the handler for `/v1/debug_dump` http response in safekeeper
- Update the `debug_dump::build()` to use the streaming in JSON build
process
2023-11-05 10:16:54 +00:00
John Spray
306c4f9967 s3_scrubber: prepare for scrubbing buckets with generation-aware content (#5700)
## Problem

The scrubber didn't know how to find the latest index_part when
generations were in use.

## Summary of changes

- Teach the scrubber to do the same dance that pageserver does when
finding the latest index_part.json
- Teach the scrubber how to understand layer files with generation
suffixes.
- General improvement to testability: scan_metadata has a machine
readable output that the testing `S3Scrubber` wrapper can read.
- Existing test coverage of scrubber was false-passing because it just
didn't see any data due to prefixing of data in the bucket. Fix that.

This is incremental improvement: the more confidence we can have in the
scrubber, the more we can use it in integration tests to validate the
state of remote storage.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-11-03 17:36:02 +00:00
Konstantin Knizhnik
5ceccdc7de Logical replication startup fixes (#5750)
## Problem

See https://neondb.slack.com/archives/C04DGM6SMTM/p1698226491736459

## Summary of changes

Update WAL affected buffers when restoring WAL from safekeeper

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2023-11-03 18:40:27 +02:00
Conrad Ludgate
cdcaa329bf proxy: no more statements (#5747)
## Problem

my prepared statements change in tokio-postgres landed in the latest
release. it didn't work as we intended

## Summary of changes

https://github.com/neondatabase/rust-postgres/pull/24
2023-11-03 08:30:58 +00:00
Joonas Koivunen
27bdbf5e36 chore(layer): restore logging, doc changes (#5766)
Some of the log messages were lost with the #4938. This PR adds some of
them back, most notably:

- starting to on-demand download
- successful completion of on-demand download
- ability to see when there were many waiters for the layer download
- "unexpectedly on-demand downloading ..." is now `info!`

Additionally some rare events are logged as error, which should never
happen.
2023-11-02 19:05:33 +00:00
khanova
4c7fa12a2a Proxy introduce allowed ips (#5729)
## Problem

Proxy doesn't accept wake_compute responses with the allowed IPs.

## Summary of changes

Extend wake_compute api to be able to return allowed_ips.
2023-11-02 16:26:15 +00:00
Em Sharnoff
367971a0e9 vm-monitor: Remove support for file cache in tmpfs (#5617)
ref neondatabase/cloud#7516.

We switched everything over to file cache on disk, now time to remove
support for having it in tmpfs.
2023-11-02 16:06:16 +00:00
bojanserafimov
51570114ea Remove outdated and flaky perf test (#5762) 2023-11-02 10:43:59 -04:00
Joonas Koivunen
098d3111a5 fix(layer): get_and_upgrade and metrics (#5767)
when introducing `get_and_upgrade` I forgot that an `evict_and_wait`
would had already incremented the counter for started evictions, but an
upgrade would just "silently" cancel the eviction as no drop would ever
run. these metrics are likely sources for alerts with the next release,
so it's important to keep them correct.
2023-11-02 13:06:14 +00:00
Joonas Koivunen
3737fe3a4b fix(layer): error out early if layer path is non-file (#5756)
In an earlier PR
https://github.com/neondatabase/neon/pull/5743#discussion_r1378625244 I
added a FIXME and there's a simple solution suggested by @jcsp, so
implement it. Wondering why I did not implement this originally, there
is no concept of a permanent failure, so this failure will happen quite
often. I don't think the frequency is a problem however.

Sadly for std::fs::FileType there is only decimal and hex formatting, no
octal.
2023-11-02 11:03:38 +00:00
John Spray
5650138532 pageserver: helpers for explicitly dying on fatal I/O errors (#5651)
Following from discussion on
https://github.com/neondatabase/neon/pull/5436 where hacking an implicit
die-on-fatal-io behavior into an Error type was a source of disagreement
-- in this PR, dying on fatal I/O errors is explicit, with `fatal_err`
and `maybe_fatal_err` helpers in the `MaybeFatalIo` trait, which is
implemented for std::io::Result.

To enable this approach with `crashsafe_overwrite`, the return type of
that function is changed to std::io::Result -- the previous error enum
for this function was not used for any logic, and the utility of saying
exactly which step in the function failed is outweighed by the hygiene
of having an I/O funciton return an io::Result.

The initial use case for these helpers is the deletion queue.
2023-11-02 09:14:26 +00:00
Joonas Koivunen
2dca4c03fc feat(layer): cancellable get_or_maybe_download (#5744)
With the layer implementation as was done in #4938, it is possible via
cancellation to cause two concurrent downloads on the same path, due to
how `RemoteTimelineClient::download_remote_layer` does tempfiles. Thread
the init semaphore through the spawned task of downloading to make this
impossible to happen.
2023-11-02 08:06:32 +00:00
bojanserafimov
0b790b6d00 Record wal size in import benchmark (#5755) 2023-11-01 17:02:58 -04:00
Joonas Koivunen
e82d1ad6b8 fix(layer): reinit on access before eviction happens (#5743)
Right before merging, I added a loop to `fn
LayerInner::get_or_maybe_download`, which was always supposed to be
there. However I had forgotten to restart initialization instead of
waiting for the eviction to happen to support original design goal of
"eviction should always lose to redownload (or init)". This was wrong.
After this fix, if `spawn_blocking` queue is blocked on something,
nothing bad will happen.

Part of #5737.
2023-11-01 17:38:32 +02:00
343 changed files with 28124 additions and 13398 deletions

View File

@@ -1,17 +1,3 @@
# The binaries are really slow, if you compile them in 'dev' mode with the defaults.
# Enable some optimizations even in 'dev' mode, to make tests faster. The basic
# optimizations enabled by "opt-level=1" don't affect debuggability too much.
#
# See https://www.reddit.com/r/rust/comments/gvrgca/this_is_a_neat_trick_for_getting_good_runtime/
#
[profile.dev.package."*"]
# Set the default for dependencies in Development mode.
opt-level = 3
[profile.dev]
# Turn on a small amount of optimization in Development mode.
opt-level = 1
[build]
# This is only present for local builds, as it will be overridden
# by the RUSTDOCFLAGS env var in CI.

View File

@@ -22,5 +22,11 @@ platforms = [
# "x86_64-pc-windows-msvc",
]
[final-excludes]
# vm_monitor benefits from the same Cargo.lock as the rest of our artifacts, but
# it is built primarly in separate repo neondatabase/autoscaling and thus is excluded
# from depending on workspace-hack because most of the dependencies are not used.
workspace-members = ["vm_monitor"]
# Write out exact versions rather than a semver range. (Defaults to false.)
# exact-versions = true

View File

@@ -17,8 +17,9 @@ assignees: ''
## Implementation ideas
## Tasks
- [ ]
```[tasklist]
### Tasks
```
## Other related tasks and Epics

View File

@@ -3,7 +3,7 @@
**NB: this PR must be merged only by 'Create a merge commit'!**
### Checklist when preparing for release
- [ ] Read or refresh [the release flow guide](https://github.com/neondatabase/cloud/wiki/Release:-general-flow)
- [ ] Read or refresh [the release flow guide](https://www.notion.so/neondatabase/Release-general-flow-61f2e39fd45d4d14a70c7749604bd70b)
- [ ] Ask in the [cloud Slack channel](https://neondb.slack.com/archives/C033A2WE6BZ) that you are going to rollout the release. Any blockers?
- [ ] Does this release contain any db migrations? Destructive ones? What is the rollback plan?

View File

@@ -1,5 +1,7 @@
self-hosted-runner:
labels:
- arm64
- dev
- gen3
- large
- small

View File

@@ -11,7 +11,7 @@ on:
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12 or JAN-DEC)
# │ │ │ │ ┌───────────── day of the week (0 - 6 or SUN-SAT)
- cron: '0 3 * * *' # run once a day, timezone is utc
- cron: '0 3 * * *' # run once a day, timezone is utc
workflow_dispatch: # adds ability to run this manually
inputs:
@@ -23,6 +23,21 @@ on:
type: boolean
description: 'Publish perf report. If not set, the report will be published only for the main branch'
required: false
collect_olap_explain:
type: boolean
description: 'Collect EXPLAIN ANALYZE for OLAP queries. If not set, EXPLAIN ANALYZE will not be collected'
required: false
default: false
collect_pg_stat_statements:
type: boolean
description: 'Collect pg_stat_statements for OLAP queries. If not set, pg_stat_statements will not be collected'
required: false
default: false
run_AWS_RDS_AND_AURORA:
type: boolean
description: 'AWS-RDS and AWS-AURORA normally only run on Saturday. Set this to true to run them on every workflow_dispatch'
required: false
default: false
defaults:
run:
@@ -113,6 +128,8 @@ jobs:
# - neon-captest-reuse: Reusing existing project
# - rds-aurora: Aurora Postgres Serverless v2 with autoscaling from 0.5 to 2 ACUs
# - rds-postgres: RDS Postgres db.m5.large instance (2 vCPU, 8 GiB) with gp3 EBS storage
env:
RUN_AWS_RDS_AND_AURORA: ${{ github.event.inputs.run_AWS_RDS_AND_AURORA || 'false' }}
runs-on: ubuntu-latest
outputs:
pgbench-compare-matrix: ${{ steps.pgbench-compare-matrix.outputs.matrix }}
@@ -152,7 +169,7 @@ jobs:
]
}'
if [ "$(date +%A)" = "Saturday" ]; then
if [ "$(date +%A)" = "Saturday" ] || [ ${RUN_AWS_RDS_AND_AURORA} = "true" ]; then
matrix=$(echo "$matrix" | jq '.include += [{ "platform": "rds-postgres" },
{ "platform": "rds-aurora" }]')
fi
@@ -171,9 +188,9 @@ jobs:
]
}'
if [ "$(date +%A)" = "Saturday" ]; then
if [ "$(date +%A)" = "Saturday" ] || [ ${RUN_AWS_RDS_AND_AURORA} = "true" ]; then
matrix=$(echo "$matrix" | jq '.include += [{ "platform": "rds-postgres", "scale": "10" },
{ "platform": "rds-aurora", "scale": "10" }]')
{ "platform": "rds-aurora", "scale": "10" }]')
fi
echo "matrix=$(echo "$matrix" | jq --compact-output '.')" >> $GITHUB_OUTPUT
@@ -337,6 +354,8 @@ jobs:
POSTGRES_DISTRIB_DIR: /tmp/neon/pg_install
DEFAULT_PG_VERSION: 14
TEST_OUTPUT: /tmp/test_output
TEST_OLAP_COLLECT_EXPLAIN: ${{ github.event.inputs.collect_olap_explain }}
TEST_OLAP_COLLECT_PG_STAT_STATEMENTS: ${{ github.event.inputs.collect_pg_stat_statements }}
BUILD_TYPE: remote
SAVE_PERF_REPORT: ${{ github.event.inputs.save_perf_report || ( github.ref_name == 'main' ) }}
PLATFORM: ${{ matrix.platform }}
@@ -399,6 +418,8 @@ jobs:
env:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
TEST_OLAP_COLLECT_EXPLAIN: ${{ github.event.inputs.collect_olap_explain || 'false' }}
TEST_OLAP_COLLECT_PG_STAT_STATEMENTS: ${{ github.event.inputs.collect_pg_stat_statements || 'false' }}
BENCHMARK_CONNSTR: ${{ steps.set-up-connstr.outputs.connstr }}
TEST_OLAP_SCALE: 10

View File

@@ -172,10 +172,10 @@ jobs:
# https://github.com/EmbarkStudios/cargo-deny
- name: Check rust licenses/bans/advisories/sources
if: ${{ !cancelled() }}
run: cargo deny check
run: cargo deny check --hide-inclusion-graph
build-neon:
needs: [ check-permissions ]
needs: [ check-permissions, tag ]
runs-on: [ self-hosted, gen3, large ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
@@ -187,6 +187,7 @@ jobs:
env:
BUILD_TYPE: ${{ matrix.build_type }}
GIT_VERSION: ${{ github.event.pull_request.head.sha || github.sha }}
BUILD_TAG: ${{ needs.tag.outputs.build-tag }}
steps:
- name: Fix git ownership
@@ -198,6 +199,10 @@ jobs:
#
git config --global --add safe.directory ${{ github.workspace }}
git config --global --add safe.directory ${GITHUB_WORKSPACE}
for r in 14 15 16; do
git config --global --add safe.directory "${{ github.workspace }}/vendor/postgres-v$r"
git config --global --add safe.directory "${GITHUB_WORKSPACE}/vendor/postgres-v$r"
done
- name: Checkout
uses: actions/checkout@v3
@@ -403,7 +408,7 @@ jobs:
uses: ./.github/actions/save-coverage-data
regress-tests:
needs: [ check-permissions, build-neon ]
needs: [ check-permissions, build-neon, tag ]
runs-on: [ self-hosted, gen3, large ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
@@ -435,6 +440,7 @@ jobs:
env:
TEST_RESULT_CONNSTR: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}
CHECK_ONDISK_DATA_COMPATIBILITY: nonempty
BUILD_TAG: ${{ needs.tag.outputs.build-tag }}
- name: Merge and upload coverage data
if: matrix.build_type == 'debug' && matrix.pg_version == 'v14'
@@ -585,10 +591,13 @@ jobs:
id: upload-coverage-report-new
env:
BUCKET: neon-github-public-dev
# A differential coverage report is available only for PRs.
# (i.e. for pushes into main/release branches we have a regular coverage report)
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
BASE_SHA: ${{ github.event.pull_request.base.sha || github.sha }}
run: |
BASELINE="$(git merge-base HEAD origin/main)"
CURRENT="${COMMIT_SHA}"
BASELINE="$(git merge-base $BASE_SHA $CURRENT)"
cp /tmp/coverage/report/lcov.info ./${CURRENT}.info
@@ -723,6 +732,7 @@ jobs:
--cache-repo 369495373322.dkr.ecr.eu-central-1.amazonaws.com/cache
--context .
--build-arg GIT_VERSION=${{ github.event.pull_request.head.sha || github.sha }}
--build-arg BUILD_TAG=${{ needs.tag.outputs.build-tag }}
--build-arg REPOSITORY=369495373322.dkr.ecr.eu-central-1.amazonaws.com
--destination 369495373322.dkr.ecr.eu-central-1.amazonaws.com/neon:${{needs.tag.outputs.build-tag}}
--destination neondatabase/neon:${{needs.tag.outputs.build-tag}}
@@ -847,7 +857,7 @@ jobs:
run:
shell: sh -eu {0}
env:
VM_BUILDER_VERSION: v0.18.5
VM_BUILDER_VERSION: v0.19.0
steps:
- name: Checkout
@@ -869,8 +879,7 @@ jobs:
- name: Build vm image
run: |
./vm-builder \
-enable-file-cache \
-cgroup-uid=postgres \
-spec=vm-image-spec.yaml \
-src=369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} \
-dst=369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}
@@ -1092,6 +1101,10 @@ jobs:
#
git config --global --add safe.directory ${{ github.workspace }}
git config --global --add safe.directory ${GITHUB_WORKSPACE}
for r in 14 15 16; do
git config --global --add safe.directory "${{ github.workspace }}/vendor/postgres-v$r"
git config --global --add safe.directory "${GITHUB_WORKSPACE}/vendor/postgres-v$r"
done
- name: Checkout
uses: actions/checkout@v3

View File

@@ -21,7 +21,10 @@ env:
jobs:
check-macos-build:
if: github.ref_name == 'main' || contains(github.event.pull_request.labels.*.name, 'run-extra-build-macos')
if: |
contains(github.event.pull_request.labels.*.name, 'run-extra-build-macos') ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-*') ||
github.ref_name == 'main'
timeout-minutes: 90
runs-on: macos-latest
@@ -112,8 +115,200 @@ jobs:
- name: Check that no warnings are produced
run: ./run_clippy.sh
check-linux-arm-build:
timeout-minutes: 90
runs-on: [ self-hosted, dev, arm64 ]
env:
# Use release build only, to have less debug info around
# Hence keeping target/ (and general cache size) smaller
BUILD_TYPE: release
CARGO_FEATURES: --features testing
CARGO_FLAGS: --locked --release
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_DEV }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_KEY_DEV }}
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
options: --init
steps:
- name: Fix git ownership
run: |
# Workaround for `fatal: detected dubious ownership in repository at ...`
#
# Use both ${{ github.workspace }} and ${GITHUB_WORKSPACE} because they're different on host and in containers
# Ref https://github.com/actions/checkout/issues/785
#
git config --global --add safe.directory ${{ github.workspace }}
git config --global --add safe.directory ${GITHUB_WORKSPACE}
for r in 14 15 16; do
git config --global --add safe.directory "${{ github.workspace }}/vendor/postgres-v$r"
git config --global --add safe.directory "${GITHUB_WORKSPACE}/vendor/postgres-v$r"
done
- name: Checkout
uses: actions/checkout@v4
with:
submodules: true
fetch-depth: 1
- name: Set pg 14 revision for caching
id: pg_v14_rev
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v14) >> $GITHUB_OUTPUT
- name: Set pg 15 revision for caching
id: pg_v15_rev
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v15) >> $GITHUB_OUTPUT
- name: Set pg 16 revision for caching
id: pg_v16_rev
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v16) >> $GITHUB_OUTPUT
- name: Set env variables
run: |
echo "CARGO_HOME=${GITHUB_WORKSPACE}/.cargo" >> $GITHUB_ENV
- name: Cache postgres v14 build
id: cache_pg_14
uses: actions/cache@v3
with:
path: pg_install/v14
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v14_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Cache postgres v15 build
id: cache_pg_15
uses: actions/cache@v3
with:
path: pg_install/v15
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v15_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Cache postgres v16 build
id: cache_pg_16
uses: actions/cache@v3
with:
path: pg_install/v16
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v16_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Build postgres v14
if: steps.cache_pg_14.outputs.cache-hit != 'true'
run: mold -run make postgres-v14 -j$(nproc)
- name: Build postgres v15
if: steps.cache_pg_15.outputs.cache-hit != 'true'
run: mold -run make postgres-v15 -j$(nproc)
- name: Build postgres v16
if: steps.cache_pg_16.outputs.cache-hit != 'true'
run: mold -run make postgres-v16 -j$(nproc)
- name: Build neon extensions
run: mold -run make neon-pg-ext -j$(nproc)
- name: Build walproposer-lib
run: mold -run make walproposer-lib -j$(nproc)
- name: Run cargo build
run: |
mold -run cargo build $CARGO_FLAGS $CARGO_FEATURES --bins --tests
- name: Run cargo test
run: |
cargo test $CARGO_FLAGS $CARGO_FEATURES
# Run separate tests for real S3
export ENABLE_REAL_S3_REMOTE_STORAGE=nonempty
export REMOTE_STORAGE_S3_BUCKET=neon-github-public-dev
export REMOTE_STORAGE_S3_REGION=eu-central-1
# Avoid `$CARGO_FEATURES` since there's no `testing` feature in the e2e tests now
cargo test $CARGO_FLAGS --package remote_storage --test test_real_s3
# Run separate tests for real Azure Blob Storage
# XXX: replace region with `eu-central-1`-like region
export ENABLE_REAL_AZURE_REMOTE_STORAGE=y
export AZURE_STORAGE_ACCOUNT="${{ secrets.AZURE_STORAGE_ACCOUNT_DEV }}"
export AZURE_STORAGE_ACCESS_KEY="${{ secrets.AZURE_STORAGE_ACCESS_KEY_DEV }}"
export REMOTE_STORAGE_AZURE_CONTAINER="${{ vars.REMOTE_STORAGE_AZURE_CONTAINER }}"
export REMOTE_STORAGE_AZURE_REGION="${{ vars.REMOTE_STORAGE_AZURE_REGION }}"
# Avoid `$CARGO_FEATURES` since there's no `testing` feature in the e2e tests now
cargo test $CARGO_FLAGS --package remote_storage --test test_real_azure
check-codestyle-rust-arm:
timeout-minutes: 90
runs-on: [ self-hosted, dev, arm64 ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
options: --init
steps:
- name: Fix git ownership
run: |
# Workaround for `fatal: detected dubious ownership in repository at ...`
#
# Use both ${{ github.workspace }} and ${GITHUB_WORKSPACE} because they're different on host and in containers
# Ref https://github.com/actions/checkout/issues/785
#
git config --global --add safe.directory ${{ github.workspace }}
git config --global --add safe.directory ${GITHUB_WORKSPACE}
for r in 14 15 16; do
git config --global --add safe.directory "${{ github.workspace }}/vendor/postgres-v$r"
git config --global --add safe.directory "${GITHUB_WORKSPACE}/vendor/postgres-v$r"
done
- name: Checkout
uses: actions/checkout@v4
with:
submodules: true
fetch-depth: 1
# Some of our rust modules use FFI and need those to be checked
- name: Get postgres headers
run: make postgres-headers -j$(nproc)
# cargo hack runs the given cargo subcommand (clippy in this case) for all feature combinations.
# This will catch compiler & clippy warnings in all feature combinations.
# TODO: use cargo hack for build and test as well, but, that's quite expensive.
# NB: keep clippy args in sync with ./run_clippy.sh
- run: |
CLIPPY_COMMON_ARGS="$( source .neon_clippy_args; echo "$CLIPPY_COMMON_ARGS")"
if [ "$CLIPPY_COMMON_ARGS" = "" ]; then
echo "No clippy args found in .neon_clippy_args"
exit 1
fi
echo "CLIPPY_COMMON_ARGS=${CLIPPY_COMMON_ARGS}" >> $GITHUB_ENV
- name: Run cargo clippy (debug)
run: cargo hack --feature-powerset clippy $CLIPPY_COMMON_ARGS
- name: Run cargo clippy (release)
run: cargo hack --feature-powerset clippy --release $CLIPPY_COMMON_ARGS
- name: Check documentation generation
run: cargo doc --workspace --no-deps --document-private-items
env:
RUSTDOCFLAGS: "-Dwarnings -Arustdoc::private_intra_doc_links"
# Use `${{ !cancelled() }}` to run quck tests after the longer clippy run
- name: Check formatting
if: ${{ !cancelled() }}
run: cargo fmt --all -- --check
# https://github.com/facebookincubator/cargo-guppy/tree/bec4e0eb29dcd1faac70b1b5360267fc02bf830e/tools/cargo-hakari#2-keep-the-workspace-hack-up-to-date-in-ci
- name: Check rust dependencies
if: ${{ !cancelled() }}
run: |
cargo hakari generate --diff # workspace-hack Cargo.toml is up-to-date
cargo hakari manage-deps --dry-run # all workspace crates depend on workspace-hack
# https://github.com/EmbarkStudios/cargo-deny
- name: Check rust licenses/bans/advisories/sources
if: ${{ !cancelled() }}
run: cargo deny check
gather-rust-build-stats:
if: github.ref_name == 'main' || contains(github.event.pull_request.labels.*.name, 'run-extra-build-stats')
if: |
contains(github.event.pull_request.labels.*.name, 'run-extra-build-stats') ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-*') ||
github.ref_name == 'main'
runs-on: [ self-hosted, gen3, large ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned

View File

@@ -2,7 +2,7 @@ name: Create Release Branch
on:
schedule:
- cron: '0 7 * * 5'
- cron: '0 6 * * 1'
workflow_dispatch:
jobs:

3
.gitignore vendored
View File

@@ -18,3 +18,6 @@ test_output/
*.o
*.so
*.Po
# pgindent typedef lists
*.list

View File

@@ -9,6 +9,24 @@ refactoring, additional comments, and so forth. Let's try to raise the
bar, and clean things up as we go. Try to leave code in a better shape
than it was before.
## Pre-commit hook
We have a sample pre-commit hook in `pre-commit.py`.
To set it up, run:
```bash
ln -s ../../pre-commit.py .git/hooks/pre-commit
```
This will run following checks on staged files before each commit:
- `rustfmt`
- checks for python files, see [obligatory checks](/docs/sourcetree.md#obligatory-checks).
There is also a separate script `./run_clippy.sh` that runs `cargo clippy` on the whole project
and `./scripts/reformat` that runs all formatting tools to ensure the project is up to date.
If you want to skip the hook, run `git commit` with `--no-verify` option.
## Submitting changes
1. Get at least one +1 on your PR before you push.

1561
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -5,6 +5,7 @@ members = [
"control_plane",
"pageserver",
"pageserver/ctl",
"pageserver/client",
"proxy",
"safekeeper",
"storage_broker",
@@ -36,19 +37,20 @@ license = "Apache-2.0"
## All dependency versions, used in the project
[workspace.dependencies]
anyhow = { version = "1.0", features = ["backtrace"] }
async-compression = { version = "0.4.0", features = ["tokio", "gzip"] }
azure_core = "0.16"
azure_identity = "0.16"
azure_storage = "0.16"
azure_storage_blobs = "0.16"
arc-swap = "1.6"
async-compression = { version = "0.4.0", features = ["tokio", "gzip", "zstd"] }
azure_core = "0.18"
azure_identity = "0.18"
azure_storage = "0.18"
azure_storage_blobs = "0.18"
flate2 = "1.0.26"
async-stream = "0.3"
async-trait = "0.1"
aws-config = { version = "0.56", default-features = false, features=["rustls"] }
aws-sdk-s3 = "0.29"
aws-smithy-http = "0.56"
aws-credential-types = "0.56"
aws-types = "0.56"
aws-config = { version = "1.0", default-features = false, features=["rustls"] }
aws-sdk-s3 = "1.0"
aws-smithy-async = { version = "1.0", default-features = false, features=["rt-tokio"] }
aws-smithy-types = "1.0"
aws-credential-types = "1.0"
axum = { version = "0.6.20", features = ["ws"] }
base64 = "0.13.0"
bincode = "1.3"
@@ -65,7 +67,7 @@ comfy-table = "6.1"
const_format = "0.2"
crc32c = "0.6"
crossbeam-utils = "0.8.5"
dashmap = "5.5.0"
dashmap = { version = "5.5.0", features = ["raw-api"] }
either = "1.8"
enum-map = "2.4.2"
enumset = "1.0.12"
@@ -81,14 +83,15 @@ hex = "0.4"
hex-literal = "0.4"
hmac = "0.12.1"
hostname = "0.3.1"
http-types = "2"
http-types = { version = "2", default-features = false }
humantime = "2.1"
humantime-serde = "1.1.1"
hyper = "0.14"
hyper-tungstenite = "0.11"
inotify = "0.10.2"
ipnet = "2.9.0"
itertools = "0.10"
jsonwebtoken = "8"
jsonwebtoken = "9"
libc = "0.2"
md5 = "0.7.0"
memoffset = "0.8"
@@ -107,42 +110,48 @@ pin-project-lite = "0.2"
prometheus = {version = "0.13", default_features=false, features = ["process"]} # removes protobuf dependency
prost = "0.11"
rand = "0.8"
regex = "1.4"
regex = "1.10.2"
reqwest = { version = "0.11", default-features = false, features = ["rustls-tls"] }
reqwest-tracing = { version = "0.4.0", features = ["opentelemetry_0_19"] }
reqwest-middleware = "0.2.0"
reqwest-retry = "0.2.2"
ring = "0.17"
routerify = "3"
rpds = "0.13"
rustc-hash = "1.1.0"
rustls = "0.21"
rustls-pemfile = "1"
rustls = "0.22.1"
rustls-pemfile = "2.0.0"
rustls-split = "0.3"
scopeguard = "1.1"
sysinfo = "0.29.2"
sd-notify = "0.4.1"
sentry = { version = "0.31", default-features = false, features = ["backtrace", "contexts", "panic", "rustls", "reqwest" ] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
serde_path_to_error = "0.1"
serde_with = "2.0"
serde_assert = "0.5.0"
sha2 = "0.10.2"
signal-hook = "0.3"
smallvec = "1.11"
smol_str = { version = "0.2.0", features = ["serde"] }
socket2 = "0.5"
strum = "0.24"
strum_macros = "0.24"
svg_fmt = "0.4.1"
sync_wrapper = "0.1.2"
tar = "0.4"
task-local-extensions = "0.1.4"
test-context = "0.1"
thiserror = "1.0"
tls-listener = { version = "0.7", features = ["rustls", "hyper-h1"] }
tls-listener = { version = "0.9.0", features = ["rustls"] }
tokio = { version = "1.17", features = ["macros"] }
tokio-io-timeout = "1.2.0"
tokio-postgres-rustls = "0.10.0"
tokio-rustls = "0.24"
tokio-rustls = "0.25.0"
tokio-stream = "0.1"
tokio-tar = "0.3"
tokio-util = { version = "0.7", features = ["io"] }
tokio-util = { version = "0.7.10", features = ["io", "rt"] }
toml = "0.7"
toml_edit = "0.19"
tonic = {version = "0.9", features = ["tls", "tls-roots"]}
@@ -161,11 +170,11 @@ env_logger = "0.10"
log = "0.4"
## Libraries from neondatabase/ git forks, ideally with changes to be upstreamed
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="7434d9388965a17a6d113e5dfc0e65666a03b4c2" }
postgres-native-tls = { git = "https://github.com/neondatabase/rust-postgres.git", rev="7434d9388965a17a6d113e5dfc0e65666a03b4c2" }
postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", rev="7434d9388965a17a6d113e5dfc0e65666a03b4c2" }
postgres-types = { git = "https://github.com/neondatabase/rust-postgres.git", rev="7434d9388965a17a6d113e5dfc0e65666a03b4c2" }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="7434d9388965a17a6d113e5dfc0e65666a03b4c2" }
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", branch="neon" }
postgres-native-tls = { git = "https://github.com/neondatabase/rust-postgres.git", branch="neon" }
postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", branch="neon" }
postgres-types = { git = "https://github.com/neondatabase/rust-postgres.git", branch="neon" }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", branch="neon" }
## Other git libraries
heapless = { default-features=false, features=[], git = "https://github.com/japaric/heapless.git", rev = "644653bf3b831c6bb4963be2de24804acf5e5001" } # upstream release pending
@@ -175,6 +184,7 @@ compute_api = { version = "0.1", path = "./libs/compute_api/" }
consumption_metrics = { version = "0.1", path = "./libs/consumption_metrics/" }
metrics = { version = "0.1", path = "./libs/metrics/" }
pageserver_api = { version = "0.1", path = "./libs/pageserver_api/" }
pageserver_client = { path = "./pageserver/client" }
postgres_backend = { version = "0.1", path = "./libs/postgres_backend/" }
postgres_connection = { version = "0.1", path = "./libs/postgres_connection/" }
postgres_ffi = { version = "0.1", path = "./libs/postgres_ffi/" }
@@ -193,7 +203,7 @@ workspace_hack = { version = "0.1", path = "./workspace_hack/" }
## Build dependencies
criterion = "0.5.1"
rcgen = "0.11"
rcgen = "0.12"
rstest = "0.18"
camino-tempfile = "1.0.2"
tonic-build = "0.9"
@@ -202,7 +212,9 @@ tonic-build = "0.9"
# This is only needed for proxy's tests.
# TODO: we should probably fork `tokio-postgres-rustls` instead.
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="7434d9388965a17a6d113e5dfc0e65666a03b4c2" }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", branch="neon" }
tls-listener = { git = "https://github.com/conradludgate/tls-listener", branch="main" }
################# Binary contents sections

View File

@@ -27,6 +27,7 @@ RUN set -e \
FROM $REPOSITORY/$IMAGE:$TAG AS build
WORKDIR /home/nonroot
ARG GIT_VERSION=local
ARG BUILD_TAG
# Enable https://github.com/paritytech/cachepot to cache Rust crates' compilation results in Docker builds.
# Set up cachepot to use an AWS S3 bucket for cache results, to reuse it between `docker build` invocations.
@@ -78,9 +79,9 @@ COPY --from=build --chown=neon:neon /home/nonroot/target/release/pg_sni_router
COPY --from=build --chown=neon:neon /home/nonroot/target/release/pageserver /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/pagectl /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/safekeeper /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/storage_broker /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/storage_broker /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/proxy /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/neon_local /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/neon_local /usr/local/bin
COPY --from=pg-build /home/nonroot/pg_install/v14 /usr/local/v14/
COPY --from=pg-build /home/nonroot/pg_install/v15 /usr/local/v15/

View File

@@ -393,7 +393,9 @@ RUN case "${PG_VERSION}" in \
export TIMESCALEDB_CHECKSUM=6fca72a6ed0f6d32d2b3523951ede73dc5f9b0077b38450a029a5f411fdb8c73 \
;; \
*) \
echo "TimescaleDB not supported on this PostgreSQL version. See https://github.com/timescale/timescaledb/issues/5752" && exit 0;; \
export TIMESCALEDB_VERSION=2.13.0 \
export TIMESCALEDB_CHECKSUM=584a351c7775f0e067eaa0e7277ea88cab9077cc4c455cbbf09a5d9723dce95d \
;; \
esac && \
apt-get update && \
apt-get install -y cmake && \
@@ -567,6 +569,23 @@ RUN wget https://github.com/ChenHuajun/pg_roaringbitmap/archive/refs/tags/v0.5.4
make -j $(getconf _NPROCESSORS_ONLN) install && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/roaringbitmap.control
#########################################################################################
#
# Layer "pg-semver-pg-build"
# compile pg_semver extension
#
#########################################################################################
FROM build-deps AS pg-semver-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN wget https://github.com/theory/pg-semver/archive/refs/tags/v0.32.1.tar.gz -O pg_semver.tar.gz && \
echo "fbdaf7512026d62eec03fad8687c15ed509b6ba395bff140acd63d2e4fbe25d7 pg_semver.tar.gz" | sha256sum --check && \
mkdir pg_semver-src && cd pg_semver-src && tar xvzf ../pg_semver.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/semver.control
#########################################################################################
#
# Layer "pg-embedding-pg-build"
@@ -714,6 +733,23 @@ RUN wget https://github.com/pksunkara/pgx_ulid/archive/refs/tags/v0.1.3.tar.gz -
cargo pgrx install --release && \
echo "trusted = true" >> /usr/local/pgsql/share/extension/ulid.control
#########################################################################################
#
# Layer "wal2json-build"
# Compile "wal2json" extension
#
#########################################################################################
FROM build-deps AS wal2json-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN wget https://github.com/eulerto/wal2json/archive/refs/tags/wal2json_2_5.tar.gz && \
echo "b516653575541cf221b99cf3f8be9b6821f6dbcfc125675c85f35090f824f00e wal2json_2_5.tar.gz" | sha256sum --check && \
mkdir wal2json-src && cd wal2json-src && tar xvzf ../wal2json_2_5.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install
#########################################################################################
#
# Layer "neon-pg-ext-build"
@@ -749,7 +785,9 @@ COPY --from=pg-pgx-ulid-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=rdkit-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-uuidv7-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-roaringbitmap-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-semver-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-embedding-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=wal2json-pg-build /usr/local/pgsql /usr/local/pgsql
COPY pgxn/ pgxn/
RUN make -j $(getconf _NPROCESSORS_ONLN) \

View File

@@ -72,6 +72,10 @@ neon: postgres-headers walproposer-lib
#
$(POSTGRES_INSTALL_DIR)/build/%/config.status:
+@echo "Configuring Postgres $* build"
@test -s $(ROOT_PROJECT_DIR)/vendor/postgres-$*/configure || { \
echo "\nPostgres submodule not found in $(ROOT_PROJECT_DIR)/vendor/postgres-$*/, execute "; \
echo "'git submodule update --init --recursive --depth 2 --progress .' in project root.\n"; \
exit 1; }
mkdir -p $(POSTGRES_INSTALL_DIR)/build/$*
(cd $(POSTGRES_INSTALL_DIR)/build/$* && \
env PATH="$(EXTRA_PATH_OVERRIDES):$$PATH" $(ROOT_PROJECT_DIR)/vendor/postgres-$*/configure \
@@ -256,6 +260,44 @@ distclean:
fmt:
./pre-commit.py --fix-inplace
postgres-%-pg-bsd-indent: postgres-%
+@echo "Compiling pg_bsd_indent"
$(MAKE) -C $(POSTGRES_INSTALL_DIR)/build/$*/src/tools/pg_bsd_indent/
# Create typedef list for the core. Note that generally it should be combined with
# buildfarm one to cover platform specific stuff.
# https://wiki.postgresql.org/wiki/Running_pgindent_on_non-core_code_or_development_code
postgres-%-typedefs.list: postgres-%
$(ROOT_PROJECT_DIR)/vendor/postgres-$*/src/tools/find_typedef $(POSTGRES_INSTALL_DIR)/$*/bin > $@
# Indent postgres. See src/tools/pgindent/README for details.
.PHONY: postgres-%-pgindent
postgres-%-pgindent: postgres-%-pg-bsd-indent postgres-%-typedefs.list
+@echo merge with buildfarm typedef to cover all platforms
+@echo note: I first tried to download from pgbuildfarm.org, but for unclear reason e.g. \
REL_16_STABLE list misses PGSemaphoreData
# wget -q -O - "http://www.pgbuildfarm.org/cgi-bin/typedefs.pl?branch=REL_16_STABLE" |\
# cat - postgres-$*-typedefs.list | sort | uniq > postgres-$*-typedefs-full.list
cat $(ROOT_PROJECT_DIR)/vendor/postgres-$*/src/tools/pgindent/typedefs.list |\
cat - postgres-$*-typedefs.list | sort | uniq > postgres-$*-typedefs-full.list
+@echo note: you might want to run it on selected files/dirs instead.
INDENT=$(POSTGRES_INSTALL_DIR)/build/$*/src/tools/pg_bsd_indent/pg_bsd_indent \
$(ROOT_PROJECT_DIR)/vendor/postgres-$*/src/tools/pgindent/pgindent --typedefs postgres-$*-typedefs-full.list \
$(ROOT_PROJECT_DIR)/vendor/postgres-$*/src/ \
--excludes $(ROOT_PROJECT_DIR)/vendor/postgres-$*/src/tools/pgindent/exclude_file_patterns
rm -f pg*.BAK
# Indent pxgn/neon.
.PHONY: pgindent
neon-pgindent: postgres-v16-pg-bsd-indent neon-pg-ext-v16
$(MAKE) PG_CONFIG=$(POSTGRES_INSTALL_DIR)/v16/bin/pg_config CFLAGS='$(PG_CFLAGS) $(COPT)' \
FIND_TYPEDEF=$(ROOT_PROJECT_DIR)/vendor/postgres-v16/src/tools/find_typedef \
INDENT=$(POSTGRES_INSTALL_DIR)/build/v16/src/tools/pg_bsd_indent/pg_bsd_indent \
PGINDENT_SCRIPT=$(ROOT_PROJECT_DIR)/vendor/postgres-v16/src/tools/pgindent/pgindent \
-C $(POSTGRES_INSTALL_DIR)/build/neon-v16 \
-f $(ROOT_PROJECT_DIR)/pgxn/neon/Makefile pgindent
.PHONY: setup-pre-commit-hook
setup-pre-commit-hook:
ln -s -f $(ROOT_PROJECT_DIR)/pre-commit.py .git/hooks/pre-commit

View File

@@ -29,13 +29,14 @@ See developer documentation in [SUMMARY.md](/docs/SUMMARY.md) for more informati
```bash
apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libseccomp-dev \
libssl-dev clang pkg-config libpq-dev cmake postgresql-client protobuf-compiler \
libcurl4-openssl-dev openssl python-poetry lsof libicu-dev
libcurl4-openssl-dev openssl python3-poetry lsof libicu-dev
```
* On Fedora, these packages are needed:
```bash
dnf install flex bison readline-devel zlib-devel openssl-devel \
libseccomp-devel perl clang cmake postgresql postgresql-contrib protobuf-compiler \
protobuf-devel libcurl-devel openssl poetry lsof libicu-devel
protobuf-devel libcurl-devel openssl poetry lsof libicu-devel libpq-devel python3-devel \
libffi-devel
```
* On Arch based systems, these packages are needed:
```bash
@@ -149,6 +150,9 @@ tenant 9ef87a5bf0d92544f6fafeeb3239695c successfully created on the pageserver
Created an initial timeline 'de200bd42b49cc1814412c7e592dd6e9' at Lsn 0/16B5A50 for tenant: 9ef87a5bf0d92544f6fafeeb3239695c
Setting tenant 9ef87a5bf0d92544f6fafeeb3239695c as a default one
# create postgres compute node
> cargo neon endpoint create main
# start postgres compute node
> cargo neon endpoint start main
Starting new endpoint main (PostgreSQL v14) on timeline de200bd42b49cc1814412c7e592dd6e9 ...
@@ -185,8 +189,11 @@ Created timeline 'b3b863fa45fa9e57e615f9f2d944e601' at Lsn 0/16F9A00 for tenant:
(L) main [de200bd42b49cc1814412c7e592dd6e9]
(L) ┗━ @0/16F9A00: migration_check [b3b863fa45fa9e57e615f9f2d944e601]
# create postgres on that branch
> cargo neon endpoint create migration_check --branch-name migration_check
# start postgres on that branch
> cargo neon endpoint start migration_check --branch-name migration_check
> cargo neon endpoint start migration_check
Starting new endpoint migration_check (PostgreSQL v14) on timeline b3b863fa45fa9e57e615f9f2d944e601 ...
Starting postgres at 'postgresql://cloud_admin@127.0.0.1:55434/postgres'

View File

@@ -37,4 +37,5 @@ workspace_hack.workspace = true
toml_edit.workspace = true
remote_storage = { version = "0.1", path = "../libs/remote_storage/" }
vm_monitor = { version = "0.1", path = "../libs/vm_monitor/" }
zstd = "0.12.4"
zstd = "0.13"
bytes = "1.0"

View File

@@ -31,7 +31,7 @@
//! -C 'postgresql://cloud_admin@localhost/postgres' \
//! -S /var/db/postgres/specs/current.json \
//! -b /usr/local/bin/postgres \
//! -r {"bucket": "neon-dev-extensions-eu-central-1", "region": "eu-central-1"}
//! -r http://pg-ext-s3-gateway
//! ```
//!
use std::collections::HashMap;
@@ -51,7 +51,7 @@ use compute_api::responses::ComputeStatus;
use compute_tools::compute::{ComputeNode, ComputeState, ParsedSpec};
use compute_tools::configurator::launch_configurator;
use compute_tools::extension_server::{get_pg_version, init_remote_storage};
use compute_tools::extension_server::get_pg_version;
use compute_tools::http::api::launch_http_server;
use compute_tools::logger::*;
use compute_tools::monitor::launch_monitor;
@@ -60,7 +60,7 @@ use compute_tools::spec::*;
// this is an arbitrary build tag. Fine as a default / for testing purposes
// in-case of not-set environment var
const BUILD_TAG_DEFAULT: &str = "5670669815";
const BUILD_TAG_DEFAULT: &str = "latest";
fn main() -> Result<()> {
init_tracing_and_logging(DEFAULT_LOG_LEVEL)?;
@@ -74,10 +74,18 @@ fn main() -> Result<()> {
let pgbin_default = String::from("postgres");
let pgbin = matches.get_one::<String>("pgbin").unwrap_or(&pgbin_default);
let remote_ext_config = matches.get_one::<String>("remote-ext-config");
let ext_remote_storage = remote_ext_config.map(|x| {
init_remote_storage(x).expect("cannot initialize remote extension storage from config")
});
let ext_remote_storage = matches
.get_one::<String>("remote-ext-config")
// Compatibility hack: if the control plane specified any remote-ext-config
// use the default value for extension storage proxy gateway.
// Remove this once the control plane is updated to pass the gateway URL
.map(|conf| {
if conf.starts_with("http") {
conf.trim_end_matches('/')
} else {
"http://pg-ext-s3-gateway"
}
});
let http_port = *matches
.get_one::<u16>("http-port")
@@ -198,7 +206,7 @@ fn main() -> Result<()> {
live_config_allowed,
state: Mutex::new(new_state),
state_changed: Condvar::new(),
ext_remote_storage,
ext_remote_storage: ext_remote_storage.map(|s| s.to_string()),
ext_download_progress: RwLock::new(HashMap::new()),
build_tag,
};
@@ -266,7 +274,13 @@ fn main() -> Result<()> {
let mut state = compute.state.lock().unwrap();
state.error = Some(format!("{:?}", err));
state.status = ComputeStatus::Failed;
drop(state);
// Notify others that Postgres failed to start. In case of configuring the
// empty compute, it's likely that API handler is still waiting for compute
// state change. With this we will notify it that compute is in Failed state,
// so control plane will know about it earlier and record proper error instead
// of timeout.
compute.state_changed.notify_all();
drop(state); // unlock
delay_exit = true;
None
}
@@ -283,7 +297,6 @@ fn main() -> Result<()> {
.expect("--vm-monitor-addr should always be set because it has a default arg");
let file_cache_connstr = matches.get_one::<String>("filecache-connstr");
let cgroup = matches.get_one::<String>("cgroup");
let file_cache_on_disk = matches.get_flag("file-cache-on-disk");
// Only make a runtime if we need to.
// Note: it seems like you can make a runtime in an inner scope and
@@ -310,7 +323,6 @@ fn main() -> Result<()> {
cgroup: cgroup.cloned(),
pgconnstr: file_cache_connstr.cloned(),
addr: vm_monitor_addr.clone(),
file_cache_on_disk,
})),
token.clone(),
))
@@ -481,11 +493,6 @@ fn cli() -> clap::Command {
)
.value_name("FILECACHE_CONNSTR"),
)
.arg(
Arg::new("file-cache-on-disk")
.long("file-cache-on-disk")
.action(clap::ArgAction::SetTrue),
)
}
#[test]

View File

@@ -22,10 +22,10 @@ use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn;
use compute_api::responses::{ComputeMetrics, ComputeStatus};
use compute_api::spec::{ComputeMode, ComputeSpec};
use compute_api::spec::{ComputeFeature, ComputeMode, ComputeSpec};
use utils::measured_stream::MeasuredReader;
use remote_storage::{DownloadError, GenericRemoteStorage, RemotePath};
use remote_storage::{DownloadError, RemotePath};
use crate::checker::create_availability_check_data;
use crate::pg_helpers::*;
@@ -59,8 +59,8 @@ pub struct ComputeNode {
pub state: Mutex<ComputeState>,
/// `Condvar` to allow notifying waiters about state changes.
pub state_changed: Condvar,
/// the S3 bucket that we search for extensions in
pub ext_remote_storage: Option<GenericRemoteStorage>,
/// the address of extension storage proxy gateway
pub ext_remote_storage: Option<String>,
// key: ext_archive_name, value: started download time, download_completed?
pub ext_download_progress: RwLock<HashMap<String, (DateTime<Utc>, bool)>>,
pub build_tag: String,
@@ -252,7 +252,7 @@ fn create_neon_superuser(spec: &ComputeSpec, client: &mut Client) -> Result<()>
IF NOT EXISTS (
SELECT FROM pg_catalog.pg_roles WHERE rolname = 'neon_superuser')
THEN
CREATE ROLE neon_superuser CREATEDB CREATEROLE NOLOGIN REPLICATION IN ROLE pg_read_all_data, pg_write_all_data;
CREATE ROLE neon_superuser CREATEDB CREATEROLE NOLOGIN REPLICATION BYPASSRLS IN ROLE pg_read_all_data, pg_write_all_data;
IF array_length(roles, 1) IS NOT NULL THEN
EXECUTE format('GRANT neon_superuser TO %s',
array_to_string(ARRAY(SELECT quote_ident(x) FROM unnest(roles) as x), ', '));
@@ -277,6 +277,17 @@ fn create_neon_superuser(spec: &ComputeSpec, client: &mut Client) -> Result<()>
}
impl ComputeNode {
/// Check that compute node has corresponding feature enabled.
pub fn has_feature(&self, feature: ComputeFeature) -> bool {
let state = self.state.lock().unwrap();
if let Some(s) = state.pspec.as_ref() {
s.spec.features.contains(&feature)
} else {
false
}
}
pub fn set_status(&self, status: ComputeStatus) {
let mut state = self.state.lock().unwrap();
state.status = status;
@@ -698,6 +709,7 @@ impl ComputeNode {
handle_role_deletions(spec, self.connstr.as_str(), &mut client)?;
handle_grants(spec, &mut client, self.connstr.as_str())?;
handle_extensions(spec, &mut client)?;
handle_extension_neon(&mut client)?;
create_availability_check_data(&mut client)?;
// 'Close' connection
@@ -710,8 +722,12 @@ impl ComputeNode {
// `pg_ctl` for start / stop, so this just seems much easier to do as we already
// have opened connection to Postgres and superuser access.
#[instrument(skip_all)]
fn pg_reload_conf(&self, client: &mut Client) -> Result<()> {
client.simple_query("SELECT pg_reload_conf()")?;
fn pg_reload_conf(&self) -> Result<()> {
let pgctl_bin = Path::new(&self.pgbin).parent().unwrap().join("pg_ctl");
Command::new(pgctl_bin)
.args(["reload", "-D", &self.pgdata])
.output()
.expect("cannot run pg_ctl process");
Ok(())
}
@@ -723,10 +739,15 @@ impl ComputeNode {
// Write new config
let pgdata_path = Path::new(&self.pgdata);
config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), &spec, None)?;
let postgresql_conf_path = pgdata_path.join("postgresql.conf");
config::write_postgres_conf(&postgresql_conf_path, &spec, None)?;
// temporarily reset max_cluster_size in config
// to avoid the possibility of hitting the limit, while we are reconfiguring:
// creating new extensions, roles, etc...
config::compute_ctl_temp_override_create(pgdata_path, "neon.max_cluster_size=-1")?;
self.pg_reload_conf()?;
let mut client = Client::connect(self.connstr.as_str(), NoTls)?;
self.pg_reload_conf(&mut client)?;
// Proceed with post-startup configuration. Note, that order of operations is important.
// Disable DDL forwarding because control plane already knows about these roles/databases.
@@ -738,11 +759,16 @@ impl ComputeNode {
handle_role_deletions(&spec, self.connstr.as_str(), &mut client)?;
handle_grants(&spec, &mut client, self.connstr.as_str())?;
handle_extensions(&spec, &mut client)?;
handle_extension_neon(&mut client)?;
}
// 'Close' connection
drop(client);
// reset max_cluster_size in config back to original value and reload config
config::compute_ctl_temp_override_remove(pgdata_path)?;
self.pg_reload_conf()?;
let unknown_op = "unknown".to_string();
let op_id = spec.operation_uuid.as_ref().unwrap_or(&unknown_op);
info!(
@@ -803,7 +829,17 @@ impl ComputeNode {
let config_time = Utc::now();
if pspec.spec.mode == ComputeMode::Primary && !pspec.spec.skip_pg_catalog_updates {
let pgdata_path = Path::new(&self.pgdata);
// temporarily reset max_cluster_size in config
// to avoid the possibility of hitting the limit, while we are applying config:
// creating new extensions, roles, etc...
config::compute_ctl_temp_override_create(pgdata_path, "neon.max_cluster_size=-1")?;
self.pg_reload_conf()?;
self.apply_config(&compute_state)?;
config::compute_ctl_temp_override_remove(pgdata_path)?;
self.pg_reload_conf()?;
}
let startup_end_time = Utc::now();
@@ -951,12 +987,12 @@ LIMIT 100",
real_ext_name: String,
ext_path: RemotePath,
) -> Result<u64, DownloadError> {
let remote_storage = self
.ext_remote_storage
.as_ref()
.ok_or(DownloadError::BadInput(anyhow::anyhow!(
"Remote extensions storage is not configured",
)))?;
let ext_remote_storage =
self.ext_remote_storage
.as_ref()
.ok_or(DownloadError::BadInput(anyhow::anyhow!(
"Remote extensions storage is not configured",
)))?;
let ext_archive_name = ext_path.object_name().expect("bad path");
@@ -1012,7 +1048,7 @@ LIMIT 100",
let download_size = extension_server::download_extension(
&real_ext_name,
&ext_path,
remote_storage,
ext_remote_storage,
&self.pgbin,
)
.await

View File

@@ -93,5 +93,25 @@ pub fn write_postgres_conf(
writeln!(file, "neon.extension_server_port={}", port)?;
}
// This is essential to keep this line at the end of the file,
// because it is intended to override any settings above.
writeln!(file, "include_if_exists = 'compute_ctl_temp_override.conf'")?;
Ok(())
}
/// create file compute_ctl_temp_override.conf in pgdata_dir
/// add provided options to this file
pub fn compute_ctl_temp_override_create(pgdata_path: &Path, options: &str) -> Result<()> {
let path = pgdata_path.join("compute_ctl_temp_override.conf");
let mut file = File::create(path)?;
write!(file, "{}", options)?;
Ok(())
}
/// remove file compute_ctl_temp_override.conf in pgdata_dir
pub fn compute_ctl_temp_override_remove(pgdata_path: &Path) -> Result<()> {
let path = pgdata_path.join("compute_ctl_temp_override.conf");
std::fs::remove_file(path)?;
Ok(())
}

View File

@@ -71,18 +71,16 @@ More specifically, here is an example ext_index.json
}
}
*/
use anyhow::Context;
use anyhow::{self, Result};
use anyhow::{bail, Context};
use bytes::Bytes;
use compute_api::spec::RemoteExtSpec;
use regex::Regex;
use remote_storage::*;
use serde_json;
use std::io::Read;
use std::num::{NonZeroU32, NonZeroUsize};
use reqwest::StatusCode;
use std::path::Path;
use std::str;
use tar::Archive;
use tokio::io::AsyncReadExt;
use tracing::info;
use tracing::log::warn;
use zstd::stream::read::Decoder;
@@ -133,67 +131,36 @@ fn parse_pg_version(human_version: &str) -> &str {
panic!("Unsuported postgres version {human_version}");
}
#[cfg(test)]
mod tests {
use super::parse_pg_version;
#[test]
fn test_parse_pg_version() {
assert_eq!(parse_pg_version("PostgreSQL 15.4"), "v15");
assert_eq!(parse_pg_version("PostgreSQL 15.14"), "v15");
assert_eq!(
parse_pg_version("PostgreSQL 15.4 (Ubuntu 15.4-0ubuntu0.23.04.1)"),
"v15"
);
assert_eq!(parse_pg_version("PostgreSQL 14.15"), "v14");
assert_eq!(parse_pg_version("PostgreSQL 14.0"), "v14");
assert_eq!(
parse_pg_version("PostgreSQL 14.9 (Debian 14.9-1.pgdg120+1"),
"v14"
);
assert_eq!(parse_pg_version("PostgreSQL 16devel"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16beta1"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16rc2"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16extra"), "v16");
}
#[test]
#[should_panic]
fn test_parse_pg_unsupported_version() {
parse_pg_version("PostgreSQL 13.14");
}
#[test]
#[should_panic]
fn test_parse_pg_incorrect_version_format() {
parse_pg_version("PostgreSQL 14");
}
}
// download the archive for a given extension,
// unzip it, and place files in the appropriate locations (share/lib)
pub async fn download_extension(
ext_name: &str,
ext_path: &RemotePath,
remote_storage: &GenericRemoteStorage,
ext_remote_storage: &str,
pgbin: &str,
) -> Result<u64> {
info!("Download extension {:?} from {:?}", ext_name, ext_path);
let mut download = remote_storage.download(ext_path).await?;
let mut download_buffer = Vec::new();
download
.download_stream
.read_to_end(&mut download_buffer)
.await?;
// TODO add retry logic
let download_buffer =
match download_extension_tar(ext_remote_storage, &ext_path.to_string()).await {
Ok(buffer) => buffer,
Err(error_message) => {
return Err(anyhow::anyhow!(
"error downloading extension {:?}: {:?}",
ext_name,
error_message
));
}
};
let download_size = download_buffer.len() as u64;
info!("Download size {:?}", download_size);
// it's unclear whether it is more performant to decompress into memory or not
// TODO: decompressing into memory can be avoided
let mut decoder = Decoder::new(download_buffer.as_slice())?;
let mut decompress_buffer = Vec::new();
decoder.read_to_end(&mut decompress_buffer)?;
let mut archive = Archive::new(decompress_buffer.as_slice());
let decoder = Decoder::new(download_buffer.as_ref())?;
let mut archive = Archive::new(decoder);
let unzip_dest = pgbin
.strip_suffix("/bin/postgres")
.expect("bad pgbin")
@@ -261,29 +228,69 @@ pub fn create_control_files(remote_extensions: &RemoteExtSpec, pgbin: &str) {
}
}
// This function initializes the necessary structs to use remote storage
pub fn init_remote_storage(remote_ext_config: &str) -> anyhow::Result<GenericRemoteStorage> {
#[derive(Debug, serde::Deserialize)]
struct RemoteExtJson {
bucket: String,
region: String,
endpoint: Option<String>,
prefix: Option<String>,
}
let remote_ext_json = serde_json::from_str::<RemoteExtJson>(remote_ext_config)?;
// Do request to extension storage proxy, i.e.
// curl http://pg-ext-s3-gateway/latest/v15/extensions/anon.tar.zst
// using HHTP GET
// and return the response body as bytes
//
async fn download_extension_tar(ext_remote_storage: &str, ext_path: &str) -> Result<Bytes> {
let uri = format!("{}/{}", ext_remote_storage, ext_path);
let config = S3Config {
bucket_name: remote_ext_json.bucket,
bucket_region: remote_ext_json.region,
prefix_in_bucket: remote_ext_json.prefix,
endpoint: remote_ext_json.endpoint,
concurrency_limit: NonZeroUsize::new(100).expect("100 != 0"),
max_keys_per_list_response: None,
};
let config = RemoteStorageConfig {
max_concurrent_syncs: NonZeroUsize::new(100).expect("100 != 0"),
max_sync_errors: NonZeroU32::new(100).expect("100 != 0"),
storage: RemoteStorageKind::AwsS3(config),
};
GenericRemoteStorage::from_config(&config)
info!("Download extension {:?} from uri {:?}", ext_path, uri);
let resp = reqwest::get(uri).await?;
match resp.status() {
StatusCode::OK => match resp.bytes().await {
Ok(resp) => {
info!("Download extension {:?} completed successfully", ext_path);
Ok(resp)
}
Err(e) => bail!("could not deserialize remote extension response: {}", e),
},
StatusCode::SERVICE_UNAVAILABLE => bail!("remote extension is temporarily unavailable"),
_ => bail!(
"unexpected remote extension response status code: {}",
resp.status()
),
}
}
#[cfg(test)]
mod tests {
use super::parse_pg_version;
#[test]
fn test_parse_pg_version() {
assert_eq!(parse_pg_version("PostgreSQL 15.4"), "v15");
assert_eq!(parse_pg_version("PostgreSQL 15.14"), "v15");
assert_eq!(
parse_pg_version("PostgreSQL 15.4 (Ubuntu 15.4-0ubuntu0.23.04.1)"),
"v15"
);
assert_eq!(parse_pg_version("PostgreSQL 14.15"), "v14");
assert_eq!(parse_pg_version("PostgreSQL 14.0"), "v14");
assert_eq!(
parse_pg_version("PostgreSQL 14.9 (Debian 14.9-1.pgdg120+1"),
"v14"
);
assert_eq!(parse_pg_version("PostgreSQL 16devel"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16beta1"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16rc2"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16extra"), "v16");
}
#[test]
#[should_panic]
fn test_parse_pg_unsupported_version() {
parse_pg_version("PostgreSQL 13.14");
}
#[test]
#[should_panic]
fn test_parse_pg_incorrect_version_format() {
parse_pg_version("PostgreSQL 14");
}
}

View File

@@ -123,7 +123,7 @@ async fn routes(req: Request<Body>, compute: &Arc<ComputeNode>) -> Response<Body
}
}
// download extension files from S3 on demand
// download extension files from remote extension storage on demand
(&Method::POST, route) if route.starts_with("/extension_server/") => {
info!("serving {:?} POST request", route);
info!("req.uri {:?}", req.uri());
@@ -227,7 +227,7 @@ async fn handle_configure_request(
let parsed_spec = match ParsedSpec::try_from(spec) {
Ok(ps) => ps,
Err(msg) => return Err((msg, StatusCode::PRECONDITION_FAILED)),
Err(msg) => return Err((msg, StatusCode::BAD_REQUEST)),
};
// XXX: wrap state update under lock in code blocks. Otherwise,

View File

@@ -156,17 +156,17 @@ paths:
description: Error text or 'OK' if download succeeded.
example: "OK"
400:
description: Request is invalid.
content:
application/json:
schema:
$ref: "#/components/schemas/GenericError"
description: Request is invalid.
content:
application/json:
schema:
$ref: "#/components/schemas/GenericError"
500:
description: Extension download request failed.
content:
application/json:
schema:
$ref: "#/components/schemas/GenericError"
description: Extension download request failed.
content:
application/json:
schema:
$ref: "#/components/schemas/GenericError"
components:
securitySchemes:

View File

@@ -1,7 +1,7 @@
//!
//! Various tools and helpers to handle cluster / compute node (Postgres)
//! configuration.
//!
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
pub mod checker;
pub mod config;
pub mod configurator;

View File

@@ -193,16 +193,11 @@ impl Escaping for PgIdent {
/// Build a list of existing Postgres roles
pub fn get_existing_roles(xact: &mut Transaction<'_>) -> Result<Vec<Role>> {
let postgres_roles = xact
.query(
"SELECT rolname, rolpassword, rolreplication, rolbypassrls FROM pg_catalog.pg_authid",
&[],
)?
.query("SELECT rolname, rolpassword FROM pg_catalog.pg_authid", &[])?
.iter()
.map(|row| Role {
name: row.get("rolname"),
encrypted_password: row.get("rolpassword"),
replication: Some(row.get("rolreplication")),
bypassrls: Some(row.get("rolbypassrls")),
options: None,
})
.collect();

View File

@@ -68,7 +68,7 @@ pub fn get_spec_from_control_plane(
base_uri: &str,
compute_id: &str,
) -> Result<Option<ComputeSpec>> {
let cp_uri = format!("{base_uri}/management/api/v2/computes/{compute_id}/spec");
let cp_uri = format!("{base_uri}/compute/api/v2/computes/{compute_id}/spec");
let jwt: String = match std::env::var("NEON_CONTROL_PLANE_TOKEN") {
Ok(v) => v,
Err(_) => "".to_string(),
@@ -118,19 +118,6 @@ pub fn get_spec_from_control_plane(
spec
}
/// It takes cluster specification and does the following:
/// - Serialize cluster config and put it into `postgresql.conf` completely rewriting the file.
/// - Update `pg_hba.conf` to allow external connections.
pub fn handle_configuration(spec: &ComputeSpec, pgdata_path: &Path) -> Result<()> {
// File `postgresql.conf` is no longer included into `basebackup`, so just
// always write all config into it creating new file.
config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), spec, None)?;
update_pg_hba(pgdata_path)?;
Ok(())
}
/// Check `pg_hba.conf` and update if needed to allow external connections.
pub fn update_pg_hba(pgdata_path: &Path) -> Result<()> {
// XXX: consider making it a part of spec.json
@@ -265,8 +252,6 @@ pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
let action = if let Some(r) = pg_role {
if (r.encrypted_password.is_none() && role.encrypted_password.is_some())
|| (r.encrypted_password.is_some() && role.encrypted_password.is_none())
|| !r.bypassrls.unwrap_or(false)
|| !r.replication.unwrap_or(false)
{
RoleAction::Update
} else if let Some(pg_pwd) = &r.encrypted_password {
@@ -298,14 +283,22 @@ pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
match action {
RoleAction::None => {}
RoleAction::Update => {
let mut query: String =
format!("ALTER ROLE {} BYPASSRLS REPLICATION", name.pg_quote());
// This can be run on /every/ role! Not just ones created through the console.
// This means that if you add some funny ALTER here that adds a permission,
// this will get run even on user-created roles! This will result in different
// behavior before and after a spec gets reapplied. The below ALTER as it stands
// now only grants LOGIN and changes the password. Please do not allow this branch
// to do anything silly.
let mut query: String = format!("ALTER ROLE {} ", name.pg_quote());
query.push_str(&role.to_pg_options());
xact.execute(query.as_str(), &[])?;
}
RoleAction::Create => {
// This branch only runs when roles are created through the console, so it is
// safe to add more permissions here. BYPASSRLS and REPLICATION are inherited
// from neon_superuser.
let mut query: String = format!(
"CREATE ROLE {} CREATEROLE CREATEDB BYPASSRLS REPLICATION IN ROLE neon_superuser",
"CREATE ROLE {} INHERIT CREATEROLE CREATEDB BYPASSRLS REPLICATION IN ROLE neon_superuser",
name.pg_quote()
);
info!("role create query: '{}'", &query);
@@ -674,3 +667,33 @@ pub fn handle_extensions(spec: &ComputeSpec, client: &mut Client) -> Result<()>
Ok(())
}
/// Run CREATE and ALTER EXTENSION neon UPDATE for postgres database
#[instrument(skip_all)]
pub fn handle_extension_neon(client: &mut Client) -> Result<()> {
info!("handle extension neon");
let mut query = "CREATE SCHEMA IF NOT EXISTS neon";
client.simple_query(query)?;
query = "CREATE EXTENSION IF NOT EXISTS neon WITH SCHEMA neon";
info!("create neon extension with query: {}", query);
client.simple_query(query)?;
query = "UPDATE pg_extension SET extrelocatable = true WHERE extname = 'neon'";
client.simple_query(query)?;
query = "ALTER EXTENSION neon SET SCHEMA neon";
info!("alter neon extension schema with query: {}", query);
client.simple_query(query)?;
// this will be a no-op if extension is already up to date,
// which may happen in two cases:
// - extension was just installed
// - extension was already installed and is up to date
let query = "ALTER EXTENSION neon UPDATE";
info!("update neon extension schema with query: {}", query);
client.simple_query(query)?;
Ok(())
}

View File

@@ -6,9 +6,11 @@ license.workspace = true
[dependencies]
anyhow.workspace = true
async-trait.workspace = true
camino.workspace = true
clap.workspace = true
comfy-table.workspace = true
futures.workspace = true
git-version.workspace = true
nix.workspace = true
once_cell.workspace = true
@@ -24,10 +26,11 @@ tar.workspace = true
thiserror.workspace = true
toml.workspace = true
tokio.workspace = true
tokio-postgres.workspace = true
tokio-util.workspace = true
url.workspace = true
# Note: Do not directly depend on pageserver or safekeeper; use pageserver_api or safekeeper_api
# instead, so that recompile times are better.
pageserver_api.workspace = true
pageserver_client.workspace = true
postgres_backend.workspace = true
safekeeper_api.workspace = true
postgres_connection.workspace = true

View File

@@ -2,7 +2,6 @@ use crate::{background_process, local_env::LocalEnv};
use anyhow::anyhow;
use camino::Utf8PathBuf;
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use std::{path::PathBuf, process::Child};
use utils::id::{NodeId, TenantId};
@@ -10,14 +9,13 @@ pub struct AttachmentService {
env: LocalEnv,
listen: String,
path: PathBuf,
client: reqwest::Client,
}
const COMMAND: &str = "attachment_service";
#[serde_as]
#[derive(Serialize, Deserialize)]
pub struct AttachHookRequest {
#[serde_as(as = "DisplayFromStr")]
pub tenant_id: TenantId,
pub node_id: Option<NodeId>,
}
@@ -27,6 +25,16 @@ pub struct AttachHookResponse {
pub gen: Option<u32>,
}
#[derive(Serialize, Deserialize)]
pub struct InspectRequest {
pub tenant_id: TenantId,
}
#[derive(Serialize, Deserialize)]
pub struct InspectResponse {
pub attachment: Option<(u32, NodeId)>,
}
impl AttachmentService {
pub fn from_env(env: &LocalEnv) -> Self {
let path = env.base_data_dir.join("attachments.json");
@@ -45,6 +53,9 @@ impl AttachmentService {
env: env.clone(),
path,
listen,
client: reqwest::ClientBuilder::new()
.build()
.expect("Failed to construct http client"),
}
}
@@ -53,7 +64,7 @@ impl AttachmentService {
.expect("non-Unicode path")
}
pub fn start(&self) -> anyhow::Result<Child> {
pub async fn start(&self) -> anyhow::Result<Child> {
let path_str = self.path.to_string_lossy();
background_process::start_process(
@@ -62,10 +73,11 @@ impl AttachmentService {
&self.env.attachment_service_bin(),
["-l", &self.listen, "-p", &path_str],
[],
background_process::InitialPidFile::Create(&self.pid_file()),
background_process::InitialPidFile::Create(self.pid_file()),
// TODO: a real status check
|| Ok(true),
|| async move { anyhow::Ok(true) },
)
.await
}
pub fn stop(&self, immediate: bool) -> anyhow::Result<()> {
@@ -73,7 +85,7 @@ impl AttachmentService {
}
/// Call into the attach_hook API, for use before handing out attachments to pageservers
pub fn attach_hook(
pub async fn attach_hook(
&self,
tenant_id: TenantId,
pageserver_id: NodeId,
@@ -87,21 +99,40 @@ impl AttachmentService {
.unwrap()
.join("attach-hook")
.unwrap();
let client = reqwest::blocking::ClientBuilder::new()
.build()
.expect("Failed to construct http client");
let request = AttachHookRequest {
tenant_id,
node_id: Some(pageserver_id),
};
let response = client.post(url).json(&request).send()?;
let response = self.client.post(url).json(&request).send().await?;
if response.status() != StatusCode::OK {
return Err(anyhow!("Unexpected status {}", response.status()));
}
let response = response.json::<AttachHookResponse>()?;
let response = response.json::<AttachHookResponse>().await?;
Ok(response.gen)
}
pub async fn inspect(&self, tenant_id: TenantId) -> anyhow::Result<Option<(u32, NodeId)>> {
use hyper::StatusCode;
let url = self
.env
.control_plane_api
.clone()
.unwrap()
.join("inspect")
.unwrap();
let request = InspectRequest { tenant_id };
let response = self.client.post(url).json(&request).send().await?;
if response.status() != StatusCode::OK {
return Err(anyhow!("Unexpected status {}", response.status()));
}
let response = response.json::<InspectResponse>().await?;
Ok(response.attachment)
}
}

View File

@@ -44,15 +44,15 @@ const NOTICE_AFTER_RETRIES: u64 = 50;
/// Argument to `start_process`, to indicate whether it should create pidfile or if the process creates
/// it itself.
pub enum InitialPidFile<'t> {
pub enum InitialPidFile {
/// Create a pidfile, to allow future CLI invocations to manipulate the process.
Create(&'t Utf8Path),
Create(Utf8PathBuf),
/// The process will create the pidfile itself, need to wait for that event.
Expect(&'t Utf8Path),
Expect(Utf8PathBuf),
}
/// Start a background child process using the parameters given.
pub fn start_process<F, AI, A, EI>(
pub async fn start_process<F, Fut, AI, A, EI>(
process_name: &str,
datadir: &Path,
command: &Path,
@@ -62,7 +62,8 @@ pub fn start_process<F, AI, A, EI>(
process_status_check: F,
) -> anyhow::Result<Child>
where
F: Fn() -> anyhow::Result<bool>,
F: Fn() -> Fut,
Fut: std::future::Future<Output = anyhow::Result<bool>>,
AI: IntoIterator<Item = A>,
A: AsRef<OsStr>,
// Not generic AsRef<OsStr>, otherwise empty `envs` prevents type inference
@@ -89,7 +90,7 @@ where
let filled_cmd = fill_remote_storage_secrets_vars(fill_rust_env_vars(background_command));
filled_cmd.envs(envs);
let pid_file_to_check = match initial_pid_file {
let pid_file_to_check = match &initial_pid_file {
InitialPidFile::Create(path) => {
pre_exec_create_pidfile(filled_cmd, path);
path
@@ -107,7 +108,7 @@ where
);
for retries in 0..RETRIES {
match process_started(pid, Some(pid_file_to_check), &process_status_check) {
match process_started(pid, pid_file_to_check, &process_status_check).await {
Ok(true) => {
println!("\n{process_name} started, pid: {pid}");
return Ok(spawned_process);
@@ -262,7 +263,7 @@ where
P: Into<Utf8PathBuf>,
{
let path: Utf8PathBuf = path.into();
// SAFETY
// SAFETY:
// pre_exec is marked unsafe because it runs between fork and exec.
// Why is that dangerous in various ways?
// Long answer: https://github.com/rust-lang/rust/issues/39575
@@ -316,22 +317,20 @@ where
cmd
}
fn process_started<F>(
async fn process_started<F, Fut>(
pid: Pid,
pid_file_to_check: Option<&Utf8Path>,
pid_file_to_check: &Utf8Path,
status_check: &F,
) -> anyhow::Result<bool>
where
F: Fn() -> anyhow::Result<bool>,
F: Fn() -> Fut,
Fut: std::future::Future<Output = anyhow::Result<bool>>,
{
match status_check() {
Ok(true) => match pid_file_to_check {
Some(pid_file_path) => match pid_file::read(pid_file_path)? {
PidFileRead::NotExist => Ok(false),
PidFileRead::LockedByOtherProcess(pid_in_file) => Ok(pid_in_file == pid),
PidFileRead::NotHeldByAnyProcess(_) => Ok(false),
},
None => Ok(true),
match status_check().await {
Ok(true) => match pid_file::read(pid_file_to_check)? {
PidFileRead::NotExist => Ok(false),
PidFileRead::LockedByOtherProcess(pid_in_file) => Ok(pid_in_file == pid),
PidFileRead::NotHeldByAnyProcess(_) => Ok(false),
},
Ok(false) => Ok(false),
Err(e) => anyhow::bail!("process failed to start: {e}"),

View File

@@ -9,6 +9,7 @@ use clap::Parser;
use hex::FromHex;
use hyper::StatusCode;
use hyper::{Body, Request, Response};
use pageserver_api::shard::TenantShardId;
use serde::{Deserialize, Serialize};
use std::path::{Path, PathBuf};
use std::{collections::HashMap, sync::Arc};
@@ -32,7 +33,9 @@ use pageserver_api::control_api::{
ValidateResponseTenant,
};
use control_plane::attachment_service::{AttachHookRequest, AttachHookResponse};
use control_plane::attachment_service::{
AttachHookRequest, AttachHookResponse, InspectRequest, InspectResponse,
};
#[derive(Parser)]
#[command(author, version, about, long_about = None)]
@@ -171,7 +174,8 @@ async fn handle_re_attach(mut req: Request<Body>) -> Result<Response<Body>, ApiE
if state.pageserver == Some(reattach_req.node_id) {
state.generation += 1;
response.tenants.push(ReAttachResponseTenant {
id: *t,
// TODO(sharding): make this shard-aware
id: TenantShardId::unsharded(*t),
gen: state.generation,
});
}
@@ -194,8 +198,15 @@ async fn handle_validate(mut req: Request<Body>) -> Result<Response<Body>, ApiEr
};
for req_tenant in validate_req.tenants {
if let Some(tenant_state) = locked.tenants.get(&req_tenant.id) {
// TODO(sharding): make this shard-aware
if let Some(tenant_state) = locked.tenants.get(&req_tenant.id.tenant_id) {
let valid = tenant_state.generation == req_tenant.gen;
tracing::info!(
"handle_validate: {}(gen {}): valid={valid} (latest {})",
req_tenant.id,
req_tenant.gen,
tenant_state.generation
);
response.tenants.push(ValidateResponseTenant {
id: req_tenant.id,
valid,
@@ -245,6 +256,13 @@ async fn handle_attach_hook(mut req: Request<Body>) -> Result<Response<Body>, Ap
tenant_state.pageserver = attach_req.node_id;
let generation = tenant_state.generation;
tracing::info!(
"handle_attach_hook: tenant {} set generation {}, pageserver {}",
attach_req.tenant_id,
tenant_state.generation,
attach_req.node_id.unwrap_or(utils::id::NodeId(0xfffffff))
);
locked.save().await.map_err(ApiError::InternalServerError)?;
json_response(
@@ -255,12 +273,28 @@ async fn handle_attach_hook(mut req: Request<Body>) -> Result<Response<Body>, Ap
)
}
async fn handle_inspect(mut req: Request<Body>) -> Result<Response<Body>, ApiError> {
let inspect_req = json_request::<InspectRequest>(&mut req).await?;
let state = get_state(&req).inner.clone();
let locked = state.write().await;
let tenant_state = locked.tenants.get(&inspect_req.tenant_id);
json_response(
StatusCode::OK,
InspectResponse {
attachment: tenant_state.and_then(|s| s.pageserver.map(|ps| (s.generation, ps))),
},
)
}
fn make_router(persistent_state: PersistentState) -> RouterBuilder<hyper::Body, ApiError> {
endpoint::make_router()
.data(Arc::new(State::new(persistent_state)))
.post("/re-attach", |r| request_span(r, handle_re_attach))
.post("/validate", |r| request_span(r, handle_validate))
.post("/attach-hook", |r| request_span(r, handle_attach_hook))
.post("/inspect", |r| request_span(r, handle_inspect))
}
#[tokio::main]
@@ -268,6 +302,7 @@ async fn main() -> anyhow::Result<()> {
logging::init(
LogFormat::Plain,
logging::TracingErrorLayerEnablement::Disabled,
logging::Output::Stdout,
)?;
let args = Cli::parse();

View File

@@ -11,13 +11,14 @@ use compute_api::spec::ComputeMode;
use control_plane::attachment_service::AttachmentService;
use control_plane::endpoint::ComputeControlPlane;
use control_plane::local_env::LocalEnv;
use control_plane::pageserver::PageServerNode;
use control_plane::pageserver::{PageServerNode, PAGESERVER_REMOTE_STORAGE_DIR};
use control_plane::safekeeper::SafekeeperNode;
use control_plane::tenant_migration::migrate_tenant;
use control_plane::{broker, local_env};
use pageserver_api::models::TimelineInfo;
use pageserver_api::{
DEFAULT_HTTP_LISTEN_ADDR as DEFAULT_PAGESERVER_HTTP_ADDR,
DEFAULT_PG_LISTEN_ADDR as DEFAULT_PAGESERVER_PG_ADDR,
DEFAULT_HTTP_LISTEN_PORT as DEFAULT_PAGESERVER_HTTP_PORT,
DEFAULT_PG_LISTEN_PORT as DEFAULT_PAGESERVER_PG_PORT,
};
use postgres_backend::AuthType;
use safekeeper_api::{
@@ -46,8 +47,8 @@ const DEFAULT_PG_VERSION: &str = "15";
const DEFAULT_PAGESERVER_CONTROL_PLANE_API: &str = "http://127.0.0.1:1234/";
fn default_conf() -> String {
format!(
fn default_conf(num_pageservers: u16) -> String {
let mut template = format!(
r#"
# Default built-in configuration, defined in main.rs
control_plane_api = '{DEFAULT_PAGESERVER_CONTROL_PLANE_API}'
@@ -55,21 +56,33 @@ control_plane_api = '{DEFAULT_PAGESERVER_CONTROL_PLANE_API}'
[broker]
listen_addr = '{DEFAULT_BROKER_ADDR}'
[[pageservers]]
id = {DEFAULT_PAGESERVER_ID}
listen_pg_addr = '{DEFAULT_PAGESERVER_PG_ADDR}'
listen_http_addr = '{DEFAULT_PAGESERVER_HTTP_ADDR}'
pg_auth_type = '{trust_auth}'
http_auth_type = '{trust_auth}'
[[safekeepers]]
id = {DEFAULT_SAFEKEEPER_ID}
pg_port = {DEFAULT_SAFEKEEPER_PG_PORT}
http_port = {DEFAULT_SAFEKEEPER_HTTP_PORT}
"#,
trust_auth = AuthType::Trust,
)
);
for i in 0..num_pageservers {
let pageserver_id = NodeId(DEFAULT_PAGESERVER_ID.0 + i as u64);
let pg_port = DEFAULT_PAGESERVER_PG_PORT + i;
let http_port = DEFAULT_PAGESERVER_HTTP_PORT + i;
template += &format!(
r#"
[[pageservers]]
id = {pageserver_id}
listen_pg_addr = '127.0.0.1:{pg_port}'
listen_http_addr = '127.0.0.1:{http_port}'
pg_auth_type = '{trust_auth}'
http_auth_type = '{trust_auth}'
"#,
trust_auth = AuthType::Trust,
)
}
template
}
///
@@ -107,15 +120,20 @@ fn main() -> Result<()> {
let mut env = LocalEnv::load_config().context("Error loading config")?;
let original_env = env.clone();
let rt = tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()
.unwrap();
let subcommand_result = match sub_name {
"tenant" => handle_tenant(sub_args, &mut env),
"timeline" => handle_timeline(sub_args, &mut env),
"start" => handle_start_all(sub_args, &env),
"tenant" => rt.block_on(handle_tenant(sub_args, &mut env)),
"timeline" => rt.block_on(handle_timeline(sub_args, &mut env)),
"start" => rt.block_on(handle_start_all(sub_args, &env)),
"stop" => handle_stop_all(sub_args, &env),
"pageserver" => handle_pageserver(sub_args, &env),
"attachment_service" => handle_attachment_service(sub_args, &env),
"safekeeper" => handle_safekeeper(sub_args, &env),
"endpoint" => handle_endpoint(sub_args, &env),
"pageserver" => rt.block_on(handle_pageserver(sub_args, &env)),
"attachment_service" => rt.block_on(handle_attachment_service(sub_args, &env)),
"safekeeper" => rt.block_on(handle_safekeeper(sub_args, &env)),
"endpoint" => rt.block_on(handle_endpoint(sub_args, &env)),
"mappings" => handle_mappings(sub_args, &mut env),
"pg" => bail!("'pg' subcommand has been renamed to 'endpoint'"),
_ => bail!("unexpected subcommand {sub_name}"),
@@ -155,7 +173,7 @@ fn print_timelines_tree(
info: t.clone(),
children: BTreeSet::new(),
name: timeline_name_mappings
.remove(&TenantTimelineId::new(t.tenant_id, t.timeline_id)),
.remove(&TenantTimelineId::new(t.tenant_id.tenant_id, t.timeline_id)),
},
)
})
@@ -256,12 +274,13 @@ fn print_timeline(
/// Returns a map of timeline IDs to timeline_id@lsn strings.
/// Connects to the pageserver to query this information.
fn get_timeline_infos(
async fn get_timeline_infos(
env: &local_env::LocalEnv,
tenant_id: &TenantId,
) -> Result<HashMap<TimelineId, TimelineInfo>> {
Ok(get_default_pageserver(env)
.timeline_list(tenant_id)?
.timeline_list(tenant_id)
.await?
.into_iter()
.map(|timeline_info| (timeline_info.timeline_id, timeline_info))
.collect())
@@ -295,6 +314,9 @@ fn parse_timeline_id(sub_match: &ArgMatches) -> anyhow::Result<Option<TimelineId
}
fn handle_init(init_match: &ArgMatches) -> anyhow::Result<LocalEnv> {
let num_pageservers = init_match
.get_one::<u16>("num-pageservers")
.expect("num-pageservers arg has a default");
// Create config file
let toml_file: String = if let Some(config_path) = init_match.get_one::<PathBuf>("config") {
// load and parse the file
@@ -306,7 +328,7 @@ fn handle_init(init_match: &ArgMatches) -> anyhow::Result<LocalEnv> {
})?
} else {
// Built-in default config
default_conf()
default_conf(*num_pageservers)
};
let pg_version = init_match
@@ -320,6 +342,9 @@ fn handle_init(init_match: &ArgMatches) -> anyhow::Result<LocalEnv> {
env.init(pg_version, force)
.context("Failed to initialize neon repository")?;
// Create remote storage location for default LocalFs remote storage
std::fs::create_dir_all(env.base_data_dir.join(PAGESERVER_REMOTE_STORAGE_DIR))?;
// Initialize pageserver, create initial tenant and timeline.
for ps_conf in &env.pageservers {
PageServerNode::from_env(&env, ps_conf)
@@ -354,11 +379,14 @@ fn pageserver_config_overrides(init_match: &ArgMatches) -> Vec<&str> {
.collect()
}
fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> anyhow::Result<()> {
async fn handle_tenant(
tenant_match: &ArgMatches,
env: &mut local_env::LocalEnv,
) -> anyhow::Result<()> {
let pageserver = get_default_pageserver(env);
match tenant_match.subcommand() {
Some(("list", _)) => {
for t in pageserver.tenant_list()? {
for t in pageserver.tenant_list().await? {
println!("{} {:?}", t.id, t.state);
}
}
@@ -375,12 +403,16 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an
// We must register the tenant with the attachment service, so
// that when the pageserver restarts, it will be re-attached.
let attachment_service = AttachmentService::from_env(env);
attachment_service.attach_hook(tenant_id, pageserver.conf.id)?
attachment_service
.attach_hook(tenant_id, pageserver.conf.id)
.await?
} else {
None
};
pageserver.tenant_create(tenant_id, generation, tenant_conf)?;
pageserver
.tenant_create(tenant_id, generation, tenant_conf)
.await?;
println!("tenant {tenant_id} successfully created on the pageserver");
// Create an initial timeline for the new tenant
@@ -390,13 +422,16 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an
.copied()
.context("Failed to parse postgres version from the argument string")?;
let timeline_info = pageserver.timeline_create(
tenant_id,
new_timeline_id,
None,
None,
Some(pg_version),
)?;
let timeline_info = pageserver
.timeline_create(
tenant_id,
new_timeline_id,
None,
None,
Some(pg_version),
None,
)
.await?;
let new_timeline_id = timeline_info.timeline_id;
let last_record_lsn = timeline_info.last_record_lsn;
@@ -430,22 +465,32 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an
pageserver
.tenant_config(tenant_id, tenant_conf)
.await
.with_context(|| format!("Tenant config failed for tenant with id {tenant_id}"))?;
println!("tenant {tenant_id} successfully configured on the pageserver");
}
Some(("migrate", matches)) => {
let tenant_id = get_tenant_id(matches, env)?;
let new_pageserver = get_pageserver(env, matches)?;
let new_pageserver_id = new_pageserver.conf.id;
migrate_tenant(env, tenant_id, new_pageserver).await?;
println!("tenant {tenant_id} migrated to {}", new_pageserver_id);
}
Some((sub_name, _)) => bail!("Unexpected tenant subcommand '{}'", sub_name),
None => bail!("no tenant subcommand provided"),
}
Ok(())
}
fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::LocalEnv) -> Result<()> {
async fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::LocalEnv) -> Result<()> {
let pageserver = get_default_pageserver(env);
match timeline_match.subcommand() {
Some(("list", list_match)) => {
let tenant_id = get_tenant_id(list_match, env)?;
let timelines = pageserver.timeline_list(&tenant_id)?;
let timelines = pageserver.timeline_list(&tenant_id).await?;
print_timelines_tree(timelines, env.timeline_name_mappings())?;
}
Some(("create", create_match)) => {
@@ -459,8 +504,18 @@ fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::LocalEnv) -
.copied()
.context("Failed to parse postgres version from the argument string")?;
let timeline_info =
pageserver.timeline_create(tenant_id, None, None, None, Some(pg_version))?;
let new_timeline_id_opt = parse_timeline_id(create_match)?;
let timeline_info = pageserver
.timeline_create(
tenant_id,
new_timeline_id_opt,
None,
None,
Some(pg_version),
None,
)
.await?;
let new_timeline_id = timeline_info.timeline_id;
let last_record_lsn = timeline_info.last_record_lsn;
@@ -505,7 +560,9 @@ fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::LocalEnv) -
let mut cplane = ComputeControlPlane::load(env.clone())?;
println!("Importing timeline into pageserver ...");
pageserver.timeline_import(tenant_id, timeline_id, base, pg_wal, pg_version)?;
pageserver
.timeline_import(tenant_id, timeline_id, base, pg_wal, pg_version)
.await?;
env.register_branch_mapping(name.to_string(), tenant_id, timeline_id)?;
println!("Creating endpoint for imported timeline ...");
@@ -541,13 +598,16 @@ fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::LocalEnv) -
.map(|lsn_str| Lsn::from_str(lsn_str))
.transpose()
.context("Failed to parse ancestor start Lsn from the request")?;
let timeline_info = pageserver.timeline_create(
tenant_id,
None,
start_lsn,
Some(ancestor_timeline_id),
None,
)?;
let timeline_info = pageserver
.timeline_create(
tenant_id,
None,
start_lsn,
Some(ancestor_timeline_id),
None,
None,
)
.await?;
let new_timeline_id = timeline_info.timeline_id;
let last_record_lsn = timeline_info.last_record_lsn;
@@ -566,22 +626,22 @@ fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::LocalEnv) -
Ok(())
}
fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
async fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
let (sub_name, sub_args) = match ep_match.subcommand() {
Some(ep_subcommand_data) => ep_subcommand_data,
None => bail!("no endpoint subcommand provided"),
};
let mut cplane = ComputeControlPlane::load(env.clone())?;
// All subcommands take an optional --tenant-id option
let tenant_id = get_tenant_id(sub_args, env)?;
match sub_name {
"list" => {
let timeline_infos = get_timeline_infos(env, &tenant_id).unwrap_or_else(|e| {
eprintln!("Failed to load timeline info: {}", e);
HashMap::new()
});
let tenant_id = get_tenant_id(sub_args, env)?;
let timeline_infos = get_timeline_infos(env, &tenant_id)
.await
.unwrap_or_else(|e| {
eprintln!("Failed to load timeline info: {}", e);
HashMap::new()
});
let timeline_name_mappings = env.timeline_name_mappings();
@@ -637,6 +697,7 @@ fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<(
println!("{table}");
}
"create" => {
let tenant_id = get_tenant_id(sub_args, env)?;
let branch_name = sub_args
.get_one::<String>("branch-name")
.map(|s| s.as_str())
@@ -681,6 +742,18 @@ fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<(
(Some(_), true) => anyhow::bail!("cannot specify both lsn and hot-standby"),
};
match (mode, hot_standby) {
(ComputeMode::Static(_), true) => {
bail!("Cannot start a node in hot standby mode when it is already configured as a static replica")
}
(ComputeMode::Primary, true) => {
bail!("Cannot start a node as a hot standby replica, it is already configured as primary node")
}
_ => {}
}
cplane.check_conflicting_endpoints(mode, tenant_id, timeline_id)?;
cplane.new_endpoint(
&endpoint_id,
tenant_id,
@@ -693,8 +766,6 @@ fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<(
)?;
}
"start" => {
let pg_port: Option<u16> = sub_args.get_one::<u16>("pg-port").copied();
let http_port: Option<u16> = sub_args.get_one::<u16>("http-port").copied();
let endpoint_id = sub_args
.get_one::<String>("endpoint_id")
.ok_or_else(|| anyhow!("No endpoint ID was provided to start"))?;
@@ -723,80 +794,30 @@ fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<(
env.safekeepers.iter().map(|sk| sk.id).collect()
};
let endpoint = cplane.endpoints.get(endpoint_id.as_str());
let endpoint = cplane
.endpoints
.get(endpoint_id.as_str())
.ok_or_else(|| anyhow::anyhow!("endpoint {endpoint_id} not found"))?;
cplane.check_conflicting_endpoints(
endpoint.mode,
endpoint.tenant_id,
endpoint.timeline_id,
)?;
let ps_conf = env.get_pageserver_conf(pageserver_id)?;
let auth_token = if matches!(ps_conf.pg_auth_type, AuthType::NeonJWT) {
let claims = Claims::new(Some(tenant_id), Scope::Tenant);
let claims = Claims::new(Some(endpoint.tenant_id), Scope::Tenant);
Some(env.generate_auth_token(&claims)?)
} else {
None
};
let hot_standby = sub_args
.get_one::<bool>("hot-standby")
.copied()
.unwrap_or(false);
if let Some(endpoint) = endpoint {
match (&endpoint.mode, hot_standby) {
(ComputeMode::Static(_), true) => {
bail!("Cannot start a node in hot standby mode when it is already configured as a static replica")
}
(ComputeMode::Primary, true) => {
bail!("Cannot start a node as a hot standby replica, it is already configured as primary node")
}
_ => {}
}
println!("Starting existing endpoint {endpoint_id}...");
endpoint.start(&auth_token, safekeepers, remote_ext_config)?;
} else {
let branch_name = sub_args
.get_one::<String>("branch-name")
.map(|s| s.as_str())
.unwrap_or(DEFAULT_BRANCH_NAME);
let timeline_id = env
.get_branch_timeline_id(branch_name, tenant_id)
.ok_or_else(|| {
anyhow!("Found no timeline id for branch name '{branch_name}'")
})?;
let lsn = sub_args
.get_one::<String>("lsn")
.map(|lsn_str| Lsn::from_str(lsn_str))
.transpose()
.context("Failed to parse Lsn from the request")?;
let pg_version = sub_args
.get_one::<u32>("pg-version")
.copied()
.context("Failed to `pg-version` from the argument string")?;
let mode = match (lsn, hot_standby) {
(Some(lsn), false) => ComputeMode::Static(lsn),
(None, true) => ComputeMode::Replica,
(None, false) => ComputeMode::Primary,
(Some(_), true) => anyhow::bail!("cannot specify both lsn and hot-standby"),
};
// when used with custom port this results in non obvious behaviour
// port is remembered from first start command, i e
// start --port X
// stop
// start <-- will also use port X even without explicit port argument
println!("Starting new endpoint {endpoint_id} (PostgreSQL v{pg_version}) on timeline {timeline_id} ...");
let ep = cplane.new_endpoint(
endpoint_id,
tenant_id,
timeline_id,
pg_port,
http_port,
pg_version,
mode,
pageserver_id,
)?;
ep.start(&auth_token, safekeepers, remote_ext_config)?;
}
println!("Starting existing endpoint {endpoint_id}...");
endpoint
.start(&auth_token, safekeepers, remote_ext_config)
.await?;
}
"reconfigure" => {
let endpoint_id = sub_args
@@ -814,7 +835,7 @@ fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<(
} else {
None
};
endpoint.reconfigure(pageserver_id)?;
endpoint.reconfigure(pageserver_id).await?;
}
"stop" => {
let endpoint_id = sub_args
@@ -867,24 +888,25 @@ fn handle_mappings(sub_match: &ArgMatches, env: &mut local_env::LocalEnv) -> Res
}
}
fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
fn get_pageserver(env: &local_env::LocalEnv, args: &ArgMatches) -> Result<PageServerNode> {
let node_id = if let Some(id_str) = args.get_one::<String>("pageserver-id") {
NodeId(id_str.parse().context("while parsing pageserver id")?)
} else {
DEFAULT_PAGESERVER_ID
};
fn get_pageserver(env: &local_env::LocalEnv, args: &ArgMatches) -> Result<PageServerNode> {
let node_id = if let Some(id_str) = args.get_one::<String>("pageserver-id") {
NodeId(id_str.parse().context("while parsing pageserver id")?)
} else {
DEFAULT_PAGESERVER_ID
};
Ok(PageServerNode::from_env(
env,
env.get_pageserver_conf(node_id)?,
))
}
Ok(PageServerNode::from_env(
env,
env.get_pageserver_conf(node_id)?,
))
}
async fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
match sub_match.subcommand() {
Some(("start", subcommand_args)) => {
if let Err(e) = get_pageserver(env, subcommand_args)?
.start(&pageserver_config_overrides(subcommand_args))
.await
{
eprintln!("pageserver start failed: {e}");
exit(1);
@@ -911,14 +933,34 @@ fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
exit(1);
}
if let Err(e) = pageserver.start(&pageserver_config_overrides(subcommand_args)) {
if let Err(e) = pageserver
.start(&pageserver_config_overrides(subcommand_args))
.await
{
eprintln!("pageserver start failed: {e}");
exit(1);
}
}
Some(("migrate", subcommand_args)) => {
let pageserver = get_pageserver(env, subcommand_args)?;
//TODO what shutdown strategy should we use here?
if let Err(e) = pageserver.stop(false) {
eprintln!("pageserver stop failed: {}", e);
exit(1);
}
if let Err(e) = pageserver
.start(&pageserver_config_overrides(subcommand_args))
.await
{
eprintln!("pageserver start failed: {e}");
exit(1);
}
}
Some(("status", subcommand_args)) => {
match get_pageserver(env, subcommand_args)?.check_status() {
match get_pageserver(env, subcommand_args)?.check_status().await {
Ok(_) => println!("Page server is up and running"),
Err(err) => {
eprintln!("Page server is not available: {}", err);
@@ -933,11 +975,14 @@ fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
Ok(())
}
fn handle_attachment_service(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
async fn handle_attachment_service(
sub_match: &ArgMatches,
env: &local_env::LocalEnv,
) -> Result<()> {
let svc = AttachmentService::from_env(env);
match sub_match.subcommand() {
Some(("start", _start_match)) => {
if let Err(e) = svc.start() {
if let Err(e) = svc.start().await {
eprintln!("start failed: {e}");
exit(1);
}
@@ -978,7 +1023,7 @@ fn safekeeper_extra_opts(init_match: &ArgMatches) -> Vec<String> {
.collect()
}
fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
async fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
let (sub_name, sub_args) = match sub_match.subcommand() {
Some(safekeeper_command_data) => safekeeper_command_data,
None => bail!("no safekeeper subcommand provided"),
@@ -996,7 +1041,7 @@ fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
"start" => {
let extra_opts = safekeeper_extra_opts(sub_args);
if let Err(e) = safekeeper.start(extra_opts) {
if let Err(e) = safekeeper.start(extra_opts).await {
eprintln!("safekeeper start failed: {}", e);
exit(1);
}
@@ -1022,7 +1067,7 @@ fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
}
let extra_opts = safekeeper_extra_opts(sub_args);
if let Err(e) = safekeeper.start(extra_opts) {
if let Err(e) = safekeeper.start(extra_opts).await {
eprintln!("safekeeper start failed: {}", e);
exit(1);
}
@@ -1035,15 +1080,15 @@ fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
Ok(())
}
fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> anyhow::Result<()> {
async fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> anyhow::Result<()> {
// Endpoints are not started automatically
broker::start_broker_process(env)?;
broker::start_broker_process(env).await?;
// Only start the attachment service if the pageserver is configured to need it
if env.control_plane_api.is_some() {
let attachment_service = AttachmentService::from_env(env);
if let Err(e) = attachment_service.start() {
if let Err(e) = attachment_service.start().await {
eprintln!("attachment_service start failed: {:#}", e);
try_stop_all(env, true);
exit(1);
@@ -1052,7 +1097,10 @@ fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> anyhow
for ps_conf in &env.pageservers {
let pageserver = PageServerNode::from_env(env, ps_conf);
if let Err(e) = pageserver.start(&pageserver_config_overrides(sub_match)) {
if let Err(e) = pageserver
.start(&pageserver_config_overrides(sub_match))
.await
{
eprintln!("pageserver {} start failed: {:#}", ps_conf.id, e);
try_stop_all(env, true);
exit(1);
@@ -1061,7 +1109,7 @@ fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> anyhow
for node in env.safekeepers.iter() {
let safekeeper = SafekeeperNode::from_env(env, node);
if let Err(e) = safekeeper.start(vec![]) {
if let Err(e) = safekeeper.start(vec![]).await {
eprintln!("safekeeper {} start failed: {:#}", safekeeper.id, e);
try_stop_all(env, false);
exit(1);
@@ -1203,7 +1251,7 @@ fn cli() -> Command {
let remote_ext_config_args = Arg::new("remote-ext-config")
.long("remote-ext-config")
.num_args(1)
.help("Configure the S3 bucket that we search for extensions in.")
.help("Configure the remote extensions storage proxy gateway to request for extensions.")
.required(false);
let lsn_arg = Arg::new("lsn")
@@ -1224,6 +1272,13 @@ fn cli() -> Command {
.help("Force initialization even if the repository is not empty")
.required(false);
let num_pageservers_arg = Arg::new("num-pageservers")
.value_parser(value_parser!(u16))
.long("num-pageservers")
.help("How many pageservers to create (default 1)")
.required(false)
.default_value("1");
Command::new("Neon CLI")
.arg_required_else_help(true)
.version(GIT_VERSION)
@@ -1231,6 +1286,7 @@ fn cli() -> Command {
Command::new("init")
.about("Initialize a new Neon repository, preparing configs for services to start with")
.arg(pageserver_config_args.clone())
.arg(num_pageservers_arg.clone())
.arg(
Arg::new("config")
.long("config")
@@ -1258,6 +1314,7 @@ fn cli() -> Command {
.subcommand(Command::new("create")
.about("Create a new blank timeline")
.arg(tenant_id_arg.clone())
.arg(timeline_id_arg.clone())
.arg(branch_name_arg.clone())
.arg(pg_version_arg.clone())
)
@@ -1301,6 +1358,10 @@ fn cli() -> Command {
.subcommand(Command::new("config")
.arg(tenant_id_arg.clone())
.arg(Arg::new("config").short('c').num_args(1).action(ArgAction::Append).required(false)))
.subcommand(Command::new("migrate")
.about("Migrate a tenant from one pageserver to another")
.arg(tenant_id_arg.clone())
.arg(pageserver_id_arg.clone()))
)
.subcommand(
Command::new("pageserver")
@@ -1375,15 +1436,7 @@ fn cli() -> Command {
.subcommand(Command::new("start")
.about("Start postgres.\n If the endpoint doesn't exist yet, it is created.")
.arg(endpoint_id_arg.clone())
.arg(tenant_id_arg.clone())
.arg(branch_name_arg.clone())
.arg(timeline_id_arg.clone())
.arg(lsn_arg)
.arg(pg_port_arg)
.arg(http_port_arg)
.arg(endpoint_pageserver_id_arg.clone())
.arg(pg_version_arg)
.arg(hot_standby_arg)
.arg(safekeepers_arg)
.arg(remote_ext_config_args)
)
@@ -1396,7 +1449,6 @@ fn cli() -> Command {
.subcommand(
Command::new("stop")
.arg(endpoint_id_arg)
.arg(tenant_id_arg.clone())
.arg(
Arg::new("destroy")
.help("Also delete data directory (now optional, should be default in future)")

View File

@@ -11,7 +11,7 @@ use camino::Utf8PathBuf;
use crate::{background_process, local_env};
pub fn start_broker_process(env: &local_env::LocalEnv) -> anyhow::Result<()> {
pub async fn start_broker_process(env: &local_env::LocalEnv) -> anyhow::Result<()> {
let broker = &env.broker;
let listen_addr = &broker.listen_addr;
@@ -19,15 +19,15 @@ pub fn start_broker_process(env: &local_env::LocalEnv) -> anyhow::Result<()> {
let args = [format!("--listen-addr={listen_addr}")];
let client = reqwest::blocking::Client::new();
let client = reqwest::Client::new();
background_process::start_process(
"storage_broker",
&env.base_data_dir,
&env.storage_broker_bin(),
args,
[],
background_process::InitialPidFile::Create(&storage_broker_pid_file_path(env)),
|| {
background_process::InitialPidFile::Create(storage_broker_pid_file_path(env)),
|| async {
let url = broker.client_url();
let status_url = url.join("status").with_context(|| {
format!("Failed to append /status path to broker endpoint {url}")
@@ -36,12 +36,13 @@ pub fn start_broker_process(env: &local_env::LocalEnv) -> anyhow::Result<()> {
.get(status_url)
.build()
.with_context(|| format!("Failed to construct request to broker endpoint {url}"))?;
match client.execute(request) {
match client.execute(request).await {
Ok(resp) => Ok(resp.status().is_success()),
Err(_) => Ok(false),
}
},
)
.await
.context("Failed to spawn storage_broker subprocess")?;
Ok(())
}

View File

@@ -45,8 +45,8 @@ use std::sync::Arc;
use std::time::Duration;
use anyhow::{anyhow, bail, Context, Result};
use compute_api::spec::RemoteExtSpec;
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use utils::id::{NodeId, TenantId, TimelineId};
use crate::local_env::LocalEnv;
@@ -57,13 +57,10 @@ use compute_api::responses::{ComputeState, ComputeStatus};
use compute_api::spec::{Cluster, ComputeMode, ComputeSpec};
// contents of a endpoint.json file
#[serde_as]
#[derive(Serialize, Deserialize, PartialEq, Eq, Clone, Debug)]
pub struct EndpointConf {
endpoint_id: String,
#[serde_as(as = "DisplayFromStr")]
tenant_id: TenantId,
#[serde_as(as = "DisplayFromStr")]
timeline_id: TimelineId,
mode: ComputeMode,
pg_port: u16,
@@ -128,6 +125,7 @@ impl ComputeControlPlane {
let http_port = http_port.unwrap_or_else(|| self.get_port() + 1);
let pageserver =
PageServerNode::from_env(&self.env, self.env.get_pageserver_conf(pageserver_id)?);
let ep = Arc::new(Endpoint {
endpoint_id: endpoint_id.to_owned(),
pg_address: SocketAddr::new("127.0.0.1".parse().unwrap(), pg_port),
@@ -172,6 +170,30 @@ impl ComputeControlPlane {
Ok(ep)
}
pub fn check_conflicting_endpoints(
&self,
mode: ComputeMode,
tenant_id: TenantId,
timeline_id: TimelineId,
) -> Result<()> {
if matches!(mode, ComputeMode::Primary) {
// this check is not complete, as you could have a concurrent attempt at
// creating another primary, both reading the state before checking it here,
// but it's better than nothing.
let mut duplicates = self.endpoints.iter().filter(|(_k, v)| {
v.tenant_id == tenant_id
&& v.timeline_id == timeline_id
&& v.mode == mode
&& v.status() != "stopped"
});
if let Some((key, _)) = duplicates.next() {
bail!("attempting to create a duplicate primary endpoint on tenant {tenant_id}, timeline {timeline_id}: endpoint {key:?} exists already. please don't do this, it is not supported.");
}
}
Ok(())
}
}
///////////////////////////////////////////////////////////////////////////////
@@ -442,7 +464,7 @@ impl Endpoint {
}
}
pub fn start(
pub async fn start(
&self,
auth_token: &Option<String>,
safekeepers: Vec<NodeId>,
@@ -480,11 +502,24 @@ impl Endpoint {
}
}
// check for file remote_extensions_spec.json
// if it is present, read it and pass to compute_ctl
let remote_extensions_spec_path = self.endpoint_path().join("remote_extensions_spec.json");
let remote_extensions_spec = std::fs::File::open(remote_extensions_spec_path);
let remote_extensions: Option<RemoteExtSpec>;
if let Ok(spec_file) = remote_extensions_spec {
remote_extensions = serde_json::from_reader(spec_file).ok();
} else {
remote_extensions = None;
};
// Create spec file
let spec = ComputeSpec {
skip_pg_catalog_updates: self.skip_pg_catalog_updates,
format_version: 1.0,
operation_uuid: None,
features: vec![],
cluster: Cluster {
cluster_id: None, // project ID: not used
name: None, // project name: not used
@@ -501,7 +536,7 @@ impl Endpoint {
pageserver_connstring: Some(pageserver_connstring),
safekeeper_connstrings,
storage_auth_token: auth_token.clone(),
remote_extensions: None,
remote_extensions,
};
let spec_path = self.endpoint_path().join("spec.json");
std::fs::write(spec_path, serde_json::to_string_pretty(&spec)?)?;
@@ -552,7 +587,7 @@ impl Endpoint {
const MAX_ATTEMPTS: u32 = 10 * 30; // Wait up to 30 s
loop {
attempt += 1;
match self.get_status() {
match self.get_status().await {
Ok(state) => {
match state.status {
ComputeStatus::Init => {
@@ -594,8 +629,8 @@ impl Endpoint {
}
// Call the /status HTTP API
pub fn get_status(&self) -> Result<ComputeState> {
let client = reqwest::blocking::Client::new();
pub async fn get_status(&self) -> Result<ComputeState> {
let client = reqwest::Client::new();
let response = client
.request(
@@ -606,16 +641,17 @@ impl Endpoint {
self.http_address.port()
),
)
.send()?;
.send()
.await?;
// Interpret the response
let status = response.status();
if !(status.is_client_error() || status.is_server_error()) {
Ok(response.json()?)
Ok(response.json().await?)
} else {
// reqwest does not export its error construction utility functions, so let's craft the message ourselves
let url = response.url().to_owned();
let msg = match response.text() {
let msg = match response.text().await {
Ok(err_body) => format!("Error: {}", err_body),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
};
@@ -623,7 +659,7 @@ impl Endpoint {
}
}
pub fn reconfigure(&self, pageserver_id: Option<NodeId>) -> Result<()> {
pub async fn reconfigure(&self, pageserver_id: Option<NodeId>) -> Result<()> {
let mut spec: ComputeSpec = {
let spec_path = self.endpoint_path().join("spec.json");
let file = std::fs::File::open(spec_path)?;
@@ -652,7 +688,7 @@ impl Endpoint {
spec.pageserver_connstring = Some(format!("postgresql://no_user@{host}:{port}"));
}
let client = reqwest::blocking::Client::new();
let client = reqwest::Client::new();
let response = client
.post(format!(
"http://{}:{}/configure",
@@ -663,14 +699,15 @@ impl Endpoint {
"{{\"spec\":{}}}",
serde_json::to_string_pretty(&spec)?
))
.send()?;
.send()
.await?;
let status = response.status();
if !(status.is_client_error() || status.is_server_error()) {
Ok(())
} else {
let url = response.url().to_owned();
let msg = match response.text() {
let msg = match response.text().await {
Ok(err_body) => format!("Error: {}", err_body),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
};

View File

@@ -1,11 +1,10 @@
//
// Local control plane.
//
// Can start, configure and stop postgres instances running as a local processes.
//
// Intended to be used in integration tests and in CLI tools for
// local installations.
//
//! Local control plane.
//!
//! Can start, configure and stop postgres instances running as a local processes.
//!
//! Intended to be used in integration tests and in CLI tools for
//! local installations.
#![deny(clippy::undocumented_unsafe_blocks)]
pub mod attachment_service;
mod background_process;
@@ -15,3 +14,4 @@ pub mod local_env;
pub mod pageserver;
pub mod postgresql_conf;
pub mod safekeeper;
pub mod tenant_migration;

View File

@@ -8,7 +8,6 @@ use anyhow::{bail, ensure, Context};
use postgres_backend::AuthType;
use reqwest::Url;
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use std::collections::HashMap;
use std::env;
use std::fs;
@@ -33,7 +32,6 @@ pub const DEFAULT_PG_VERSION: u32 = 15;
// to 'neon_local init --config=<path>' option. See control_plane/simple.conf for
// an example.
//
#[serde_as]
#[derive(Serialize, Deserialize, PartialEq, Eq, Clone, Debug)]
pub struct LocalEnv {
// Base directory for all the nodes (the pageserver, safekeepers and
@@ -59,7 +57,6 @@ pub struct LocalEnv {
// Default tenant ID to use with the 'neon_local' command line utility, when
// --tenant_id is not explicitly specified.
#[serde(default)]
#[serde_as(as = "Option<DisplayFromStr>")]
pub default_tenant_id: Option<TenantId>,
// used to issue tokens during e.g pg start
@@ -84,7 +81,6 @@ pub struct LocalEnv {
// A `HashMap<String, HashMap<TenantId, TimelineId>>` would be more appropriate here,
// but deserialization into a generic toml object as `toml::Value::try_from` fails with an error.
// https://toml.io/en/v1.0.0 does not contain a concept of "a table inside another table".
#[serde_as(as = "HashMap<_, Vec<(DisplayFromStr, DisplayFromStr)>>")]
branch_name_mappings: HashMap<String, Vec<(TenantId, TimelineId)>>,
}

View File

@@ -6,24 +6,24 @@
//!
use std::borrow::Cow;
use std::collections::HashMap;
use std::fs::File;
use std::io::{BufReader, Write};
use std::io;
use std::io::Write;
use std::num::NonZeroU64;
use std::path::PathBuf;
use std::process::{Child, Command};
use std::{io, result};
use std::time::Duration;
use anyhow::{bail, Context};
use camino::Utf8PathBuf;
use pageserver_api::models::{self, TenantInfo, TimelineInfo};
use futures::SinkExt;
use pageserver_api::models::{self, LocationConfig, TenantInfo, TimelineInfo};
use pageserver_api::shard::TenantShardId;
use pageserver_client::mgmt_api;
use postgres_backend::AuthType;
use postgres_connection::{parse_host_port, PgConnectionConfig};
use reqwest::blocking::{Client, RequestBuilder, Response};
use reqwest::{IntoUrl, Method};
use thiserror::Error;
use utils::auth::{Claims, Scope};
use utils::{
http::error::HttpErrorBody,
id::{TenantId, TimelineId},
lsn::Lsn,
};
@@ -31,44 +31,8 @@ use utils::{
use crate::local_env::PageServerConf;
use crate::{background_process, local_env::LocalEnv};
#[derive(Error, Debug)]
pub enum PageserverHttpError {
#[error("Reqwest error: {0}")]
Transport(#[from] reqwest::Error),
#[error("Error: {0}")]
Response(String),
}
impl From<anyhow::Error> for PageserverHttpError {
fn from(e: anyhow::Error) -> Self {
Self::Response(e.to_string())
}
}
type Result<T> = result::Result<T, PageserverHttpError>;
pub trait ResponseErrorMessageExt: Sized {
fn error_from_body(self) -> Result<Self>;
}
impl ResponseErrorMessageExt for Response {
fn error_from_body(self) -> Result<Self> {
let status = self.status();
if !(status.is_client_error() || status.is_server_error()) {
return Ok(self);
}
// reqwest does not export its error construction utility functions, so let's craft the message ourselves
let url = self.url().to_owned();
Err(PageserverHttpError::Response(
match self.json::<HttpErrorBody>() {
Ok(err_body) => format!("Error: {}", err_body.msg),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
},
))
}
}
/// Directory within .neon which will be used by default for LocalFs remote storage.
pub const PAGESERVER_REMOTE_STORAGE_DIR: &str = "local_fs_remote_storage/pageserver";
//
// Control routines for pageserver.
@@ -80,8 +44,7 @@ pub struct PageServerNode {
pub pg_connection_config: PgConnectionConfig,
pub conf: PageServerConf,
pub env: LocalEnv,
pub http_client: Client,
pub http_base_url: String,
pub http_client: mgmt_api::Client,
}
impl PageServerNode {
@@ -93,13 +56,26 @@ impl PageServerNode {
pg_connection_config: PgConnectionConfig::new_host_port(host, port),
conf: conf.clone(),
env: env.clone(),
http_client: Client::new(),
http_base_url: format!("http://{}/v1", conf.listen_http_addr),
http_client: mgmt_api::Client::new(
format!("http://{}", conf.listen_http_addr),
{
match conf.http_auth_type {
AuthType::Trust => None,
AuthType::NeonJWT => Some(
env.generate_auth_token(&Claims::new(None, Scope::PageServerApi))
.unwrap(),
),
}
}
.as_deref(),
),
}
}
// pageserver conf overrides defined by neon_local configuration.
fn neon_local_overrides(&self) -> Vec<String> {
/// Merge overrides provided by the user on the command line with our default overides derived from neon_local configuration.
///
/// These all end up on the command line of the `pageserver` binary.
fn neon_local_overrides(&self, cli_overrides: &[&str]) -> Vec<String> {
let id = format!("id={}", self.conf.id);
// FIXME: the paths should be shell-escaped to handle paths with spaces, quotas etc.
let pg_distrib_dir_param = format!(
@@ -132,12 +108,25 @@ impl PageServerNode {
));
}
if !cli_overrides
.iter()
.any(|c| c.starts_with("remote_storage"))
{
overrides.push(format!(
"remote_storage={{local_path='../{PAGESERVER_REMOTE_STORAGE_DIR}'}}"
));
}
if self.conf.http_auth_type != AuthType::Trust || self.conf.pg_auth_type != AuthType::Trust
{
// Keys are generated in the toplevel repo dir, pageservers' workdirs
// are one level below that, so refer to keys with ../
overrides.push("auth_validation_public_key_path='../auth_public_key.pem'".to_owned());
}
// Apply the user-provided overrides
overrides.extend(cli_overrides.iter().map(|&c| c.to_owned()));
overrides
}
@@ -160,8 +149,8 @@ impl PageServerNode {
.expect("non-Unicode path")
}
pub fn start(&self, config_overrides: &[&str]) -> anyhow::Result<Child> {
self.start_node(config_overrides, false)
pub async fn start(&self, config_overrides: &[&str]) -> anyhow::Result<Child> {
self.start_node(config_overrides, false).await
}
fn pageserver_init(&self, config_overrides: &[&str]) -> anyhow::Result<()> {
@@ -202,10 +191,12 @@ impl PageServerNode {
Ok(())
}
fn start_node(&self, config_overrides: &[&str], update_config: bool) -> anyhow::Result<Child> {
let mut overrides = self.neon_local_overrides();
overrides.extend(config_overrides.iter().map(|&c| c.to_owned()));
async fn start_node(
&self,
config_overrides: &[&str],
update_config: bool,
) -> anyhow::Result<Child> {
// TODO: using a thread here because start_process() is not async but we need to call check_status()
let datadir = self.repo_path();
print!(
"Starting pageserver node {} at '{}' in {:?}",
@@ -213,7 +204,7 @@ impl PageServerNode {
self.pg_connection_config.raw_address(),
datadir
);
io::stdout().flush()?;
io::stdout().flush().context("flush stdout")?;
let datadir_path_str = datadir.to_str().with_context(|| {
format!(
@@ -225,20 +216,23 @@ impl PageServerNode {
if update_config {
args.push(Cow::Borrowed("--update-config"));
}
background_process::start_process(
"pageserver",
&datadir,
&self.env.pageserver_bin(),
args.iter().map(Cow::as_ref),
self.pageserver_env_variables()?,
background_process::InitialPidFile::Expect(&self.pid_file()),
|| match self.check_status() {
Ok(()) => Ok(true),
Err(PageserverHttpError::Transport(_)) => Ok(false),
Err(e) => Err(anyhow::anyhow!("Failed to check node status: {e}")),
background_process::InitialPidFile::Expect(self.pid_file()),
|| async {
let st = self.check_status().await;
match st {
Ok(()) => Ok(true),
Err(mgmt_api::Error::ReceiveBody(_)) => Ok(false),
Err(e) => Err(anyhow::anyhow!("Failed to check node status: {e}")),
}
},
)
.await
}
fn pageserver_basic_args<'a>(
@@ -248,8 +242,7 @@ impl PageServerNode {
) -> Vec<Cow<'a, str>> {
let mut args = vec![Cow::Borrowed("-D"), Cow::Borrowed(datadir_path_str)];
let mut overrides = self.neon_local_overrides();
overrides.extend(config_overrides.iter().map(|&c| c.to_owned()));
let overrides = self.neon_local_overrides(config_overrides);
for config_override in overrides {
args.push(Cow::Borrowed("-c"));
args.push(Cow::Owned(config_override));
@@ -285,7 +278,12 @@ impl PageServerNode {
background_process::stop_process(immediate, "pageserver", &self.pid_file())
}
pub fn page_server_psql_client(&self) -> anyhow::Result<postgres::Client> {
pub async fn page_server_psql_client(
&self,
) -> anyhow::Result<(
tokio_postgres::Client,
tokio_postgres::Connection<tokio_postgres::Socket, tokio_postgres::tls::NoTlsStream>,
)> {
let mut config = self.pg_connection_config.clone();
if self.conf.pg_auth_type == AuthType::NeonJWT {
let token = self
@@ -293,36 +291,18 @@ impl PageServerNode {
.generate_auth_token(&Claims::new(None, Scope::PageServerApi))?;
config = config.set_password(Some(token));
}
Ok(config.connect_no_tls()?)
Ok(config.connect_no_tls().await?)
}
fn http_request<U: IntoUrl>(&self, method: Method, url: U) -> anyhow::Result<RequestBuilder> {
let mut builder = self.http_client.request(method, url);
if self.conf.http_auth_type == AuthType::NeonJWT {
let token = self
.env
.generate_auth_token(&Claims::new(None, Scope::PageServerApi))?;
builder = builder.bearer_auth(token)
}
Ok(builder)
pub async fn check_status(&self) -> mgmt_api::Result<()> {
self.http_client.status().await
}
pub fn check_status(&self) -> Result<()> {
self.http_request(Method::GET, format!("{}/status", self.http_base_url))?
.send()?
.error_from_body()?;
Ok(())
pub async fn tenant_list(&self) -> mgmt_api::Result<Vec<TenantInfo>> {
self.http_client.list_tenants().await
}
pub fn tenant_list(&self) -> Result<Vec<TenantInfo>> {
Ok(self
.http_request(Method::GET, format!("{}/tenant", self.http_base_url))?
.send()?
.error_from_body()?
.json()?)
}
pub fn tenant_create(
pub async fn tenant_create(
&self,
new_tenant_id: TenantId,
generation: Option<u32>,
@@ -389,33 +369,21 @@ impl PageServerNode {
.map(|x| x.parse::<bool>())
.transpose()
.context("Failed to parse 'gc_feedback' as bool")?,
heatmap_period: settings.remove("heatmap_period").map(|x| x.to_string()),
};
let request = models::TenantCreateRequest {
new_tenant_id,
new_tenant_id: TenantShardId::unsharded(new_tenant_id),
generation,
config,
};
if !settings.is_empty() {
bail!("Unrecognized tenant settings: {settings:?}")
}
self.http_request(Method::POST, format!("{}/tenant", self.http_base_url))?
.json(&request)
.send()?
.error_from_body()?
.json::<Option<String>>()
.with_context(|| {
format!("Failed to parse tenant creation response for tenant id: {new_tenant_id:?}")
})?
.context("No tenant id was found in the tenant creation response")
.and_then(|tenant_id_string| {
tenant_id_string.parse().with_context(|| {
format!("Failed to parse response string as tenant id: '{tenant_id_string}'")
})
})
Ok(self.http_client.tenant_create(&request).await?)
}
pub fn tenant_config(
pub async fn tenant_config(
&self,
tenant_id: TenantId,
mut settings: HashMap<&str, &str>,
@@ -486,6 +454,7 @@ impl PageServerNode {
.map(|x| x.parse::<bool>())
.transpose()
.context("Failed to parse 'gc_feedback' as bool")?,
heatmap_period: settings.remove("heatmap_period").map(|x| x.to_string()),
}
};
@@ -493,59 +462,48 @@ impl PageServerNode {
bail!("Unrecognized tenant settings: {settings:?}")
}
self.http_request(Method::PUT, format!("{}/tenant/config", self.http_base_url))?
.json(&models::TenantConfigRequest { tenant_id, config })
.send()?
.error_from_body()?;
self.http_client
.tenant_config(&models::TenantConfigRequest { tenant_id, config })
.await?;
Ok(())
}
pub fn timeline_list(&self, tenant_id: &TenantId) -> anyhow::Result<Vec<TimelineInfo>> {
let timeline_infos: Vec<TimelineInfo> = self
.http_request(
Method::GET,
format!("{}/tenant/{}/timeline", self.http_base_url, tenant_id),
)?
.send()?
.error_from_body()?
.json()?;
Ok(timeline_infos)
pub async fn location_config(
&self,
tenant_id: TenantId,
config: LocationConfig,
flush_ms: Option<Duration>,
) -> anyhow::Result<()> {
Ok(self
.http_client
.location_config(tenant_id, config, flush_ms)
.await?)
}
pub fn timeline_create(
pub async fn timeline_list(&self, tenant_id: &TenantId) -> anyhow::Result<Vec<TimelineInfo>> {
Ok(self.http_client.list_timelines(*tenant_id).await?)
}
pub async fn timeline_create(
&self,
tenant_id: TenantId,
new_timeline_id: Option<TimelineId>,
ancestor_start_lsn: Option<Lsn>,
ancestor_timeline_id: Option<TimelineId>,
pg_version: Option<u32>,
existing_initdb_timeline_id: Option<TimelineId>,
) -> anyhow::Result<TimelineInfo> {
// If timeline ID was not specified, generate one
let new_timeline_id = new_timeline_id.unwrap_or(TimelineId::generate());
self.http_request(
Method::POST,
format!("{}/tenant/{}/timeline", self.http_base_url, tenant_id),
)?
.json(&models::TimelineCreateRequest {
let req = models::TimelineCreateRequest {
new_timeline_id,
ancestor_start_lsn,
ancestor_timeline_id,
pg_version,
})
.send()?
.error_from_body()?
.json::<Option<TimelineInfo>>()
.with_context(|| {
format!("Failed to parse timeline creation response for tenant id: {tenant_id}")
})?
.with_context(|| {
format!(
"No timeline id was found in the timeline creation response for tenant {tenant_id}"
)
})
existing_initdb_timeline_id,
};
Ok(self.http_client.timeline_create(tenant_id, &req).await?)
}
/// Import a basebackup prepared using either:
@@ -557,7 +515,7 @@ impl PageServerNode {
/// * `timeline_id` - id to assign to imported timeline
/// * `base` - (start lsn of basebackup, path to `base.tar` file)
/// * `pg_wal` - if there's any wal to import: (end lsn, path to `pg_wal.tar`)
pub fn timeline_import(
pub async fn timeline_import(
&self,
tenant_id: TenantId,
timeline_id: TimelineId,
@@ -565,36 +523,60 @@ impl PageServerNode {
pg_wal: Option<(Lsn, PathBuf)>,
pg_version: u32,
) -> anyhow::Result<()> {
let mut client = self.page_server_psql_client()?;
let (client, conn) = self.page_server_psql_client().await?;
// The connection object performs the actual communication with the database,
// so spawn it off to run on its own.
tokio::spawn(async move {
if let Err(e) = conn.await {
eprintln!("connection error: {}", e);
}
});
tokio::pin!(client);
// Init base reader
let (start_lsn, base_tarfile_path) = base;
let base_tarfile = File::open(base_tarfile_path)?;
let mut base_reader = BufReader::new(base_tarfile);
let base_tarfile = tokio::fs::File::open(base_tarfile_path).await?;
let base_tarfile = tokio_util::io::ReaderStream::new(base_tarfile);
// Init wal reader if necessary
let (end_lsn, wal_reader) = if let Some((end_lsn, wal_tarfile_path)) = pg_wal {
let wal_tarfile = File::open(wal_tarfile_path)?;
let wal_reader = BufReader::new(wal_tarfile);
let wal_tarfile = tokio::fs::File::open(wal_tarfile_path).await?;
let wal_reader = tokio_util::io::ReaderStream::new(wal_tarfile);
(end_lsn, Some(wal_reader))
} else {
(start_lsn, None)
};
// Import base
let import_cmd = format!(
"import basebackup {tenant_id} {timeline_id} {start_lsn} {end_lsn} {pg_version}"
);
let mut writer = client.copy_in(&import_cmd)?;
io::copy(&mut base_reader, &mut writer)?;
writer.finish()?;
let copy_in = |reader, cmd| {
let client = &client;
async move {
let writer = client.copy_in(&cmd).await?;
let writer = std::pin::pin!(writer);
let mut writer = writer.sink_map_err(|e| {
std::io::Error::new(std::io::ErrorKind::Other, format!("{e}"))
});
let mut reader = std::pin::pin!(reader);
writer.send_all(&mut reader).await?;
writer.into_inner().finish().await?;
anyhow::Ok(())
}
};
// Import base
copy_in(
base_tarfile,
format!(
"import basebackup {tenant_id} {timeline_id} {start_lsn} {end_lsn} {pg_version}"
),
)
.await?;
// Import wal if necessary
if let Some(mut wal_reader) = wal_reader {
let import_cmd = format!("import wal {tenant_id} {timeline_id} {start_lsn} {end_lsn}");
let mut writer = client.copy_in(&import_cmd)?;
io::copy(&mut wal_reader, &mut writer)?;
writer.finish()?;
if let Some(wal_reader) = wal_reader {
copy_in(
wal_reader,
format!("import wal {tenant_id} {timeline_id} {start_lsn} {end_lsn}"),
)
.await?;
}
Ok(())

View File

@@ -13,7 +13,6 @@ use std::{io, result};
use anyhow::Context;
use camino::Utf8PathBuf;
use postgres_connection::PgConnectionConfig;
use reqwest::blocking::{Client, RequestBuilder, Response};
use reqwest::{IntoUrl, Method};
use thiserror::Error;
use utils::{http::error::HttpErrorBody, id::NodeId};
@@ -34,12 +33,14 @@ pub enum SafekeeperHttpError {
type Result<T> = result::Result<T, SafekeeperHttpError>;
#[async_trait::async_trait]
pub trait ResponseErrorMessageExt: Sized {
fn error_from_body(self) -> Result<Self>;
async fn error_from_body(self) -> Result<Self>;
}
impl ResponseErrorMessageExt for Response {
fn error_from_body(self) -> Result<Self> {
#[async_trait::async_trait]
impl ResponseErrorMessageExt for reqwest::Response {
async fn error_from_body(self) -> Result<Self> {
let status = self.status();
if !(status.is_client_error() || status.is_server_error()) {
return Ok(self);
@@ -48,7 +49,7 @@ impl ResponseErrorMessageExt for Response {
// reqwest does not export its error construction utility functions, so let's craft the message ourselves
let url = self.url().to_owned();
Err(SafekeeperHttpError::Response(
match self.json::<HttpErrorBody>() {
match self.json::<HttpErrorBody>().await {
Ok(err_body) => format!("Error: {}", err_body.msg),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
},
@@ -69,7 +70,7 @@ pub struct SafekeeperNode {
pub pg_connection_config: PgConnectionConfig,
pub env: LocalEnv,
pub http_client: Client,
pub http_client: reqwest::Client,
pub http_base_url: String,
}
@@ -80,7 +81,7 @@ impl SafekeeperNode {
conf: conf.clone(),
pg_connection_config: Self::safekeeper_connection_config(conf.pg_port),
env: env.clone(),
http_client: Client::new(),
http_client: reqwest::Client::new(),
http_base_url: format!("http://127.0.0.1:{}/v1", conf.http_port),
}
}
@@ -103,7 +104,7 @@ impl SafekeeperNode {
.expect("non-Unicode path")
}
pub fn start(&self, extra_opts: Vec<String>) -> anyhow::Result<Child> {
pub async fn start(&self, extra_opts: Vec<String>) -> anyhow::Result<Child> {
print!(
"Starting safekeeper at '{}' in '{}'",
self.pg_connection_config.raw_address(),
@@ -191,13 +192,16 @@ impl SafekeeperNode {
&self.env.safekeeper_bin(),
&args,
[],
background_process::InitialPidFile::Expect(&self.pid_file()),
|| match self.check_status() {
Ok(()) => Ok(true),
Err(SafekeeperHttpError::Transport(_)) => Ok(false),
Err(e) => Err(anyhow::anyhow!("Failed to check node status: {e}")),
background_process::InitialPidFile::Expect(self.pid_file()),
|| async {
match self.check_status().await {
Ok(()) => Ok(true),
Err(SafekeeperHttpError::Transport(_)) => Ok(false),
Err(e) => Err(anyhow::anyhow!("Failed to check node status: {e}")),
}
},
)
.await
}
///
@@ -216,7 +220,7 @@ impl SafekeeperNode {
)
}
fn http_request<U: IntoUrl>(&self, method: Method, url: U) -> RequestBuilder {
fn http_request<U: IntoUrl>(&self, method: Method, url: U) -> reqwest::RequestBuilder {
// TODO: authentication
//if self.env.auth_type == AuthType::NeonJWT {
// builder = builder.bearer_auth(&self.env.safekeeper_auth_token)
@@ -224,10 +228,12 @@ impl SafekeeperNode {
self.http_client.request(method, url)
}
pub fn check_status(&self) -> Result<()> {
pub async fn check_status(&self) -> Result<()> {
self.http_request(Method::GET, format!("{}/{}", self.http_base_url, "status"))
.send()?
.error_from_body()?;
.send()
.await?
.error_from_body()
.await?;
Ok(())
}
}

View File

@@ -0,0 +1,205 @@
//!
//! Functionality for migrating tenants across pageservers: unlike most of neon_local, this code
//! isn't scoped to a particular physical service, as it needs to update compute endpoints to
//! point to the new pageserver.
//!
use crate::local_env::LocalEnv;
use crate::{
attachment_service::AttachmentService, endpoint::ComputeControlPlane,
pageserver::PageServerNode,
};
use pageserver_api::models::{
LocationConfig, LocationConfigMode, LocationConfigSecondary, TenantConfig,
};
use std::collections::HashMap;
use std::time::Duration;
use utils::{
id::{TenantId, TimelineId},
lsn::Lsn,
};
/// Given an attached pageserver, retrieve the LSN for all timelines
async fn get_lsns(
tenant_id: TenantId,
pageserver: &PageServerNode,
) -> anyhow::Result<HashMap<TimelineId, Lsn>> {
let timelines = pageserver.timeline_list(&tenant_id).await?;
Ok(timelines
.into_iter()
.map(|t| (t.timeline_id, t.last_record_lsn))
.collect())
}
/// Wait for the timeline LSNs on `pageserver` to catch up with or overtake
/// `baseline`.
async fn await_lsn(
tenant_id: TenantId,
pageserver: &PageServerNode,
baseline: HashMap<TimelineId, Lsn>,
) -> anyhow::Result<()> {
loop {
let latest = match get_lsns(tenant_id, pageserver).await {
Ok(l) => l,
Err(e) => {
println!(
"🕑 Can't get LSNs on pageserver {} yet, waiting ({e})",
pageserver.conf.id
);
std::thread::sleep(Duration::from_millis(500));
continue;
}
};
let mut any_behind: bool = false;
for (timeline_id, baseline_lsn) in &baseline {
match latest.get(timeline_id) {
Some(latest_lsn) => {
println!("🕑 LSN origin {baseline_lsn} vs destination {latest_lsn}");
if latest_lsn < baseline_lsn {
any_behind = true;
}
}
None => {
// Expected timeline isn't yet visible on migration destination.
// (IRL we would have to account for timeline deletion, but this
// is just test helper)
any_behind = true;
}
}
}
if !any_behind {
println!("✅ LSN caught up. Proceeding...");
break;
} else {
std::thread::sleep(Duration::from_millis(500));
}
}
Ok(())
}
/// This function spans multiple services, to demonstrate live migration of a tenant
/// between pageservers:
/// - Coordinate attach/secondary/detach on pageservers
/// - call into attachment_service for generations
/// - reconfigure compute endpoints to point to new attached pageserver
pub async fn migrate_tenant(
env: &LocalEnv,
tenant_id: TenantId,
dest_ps: PageServerNode,
) -> anyhow::Result<()> {
// Get a new generation
let attachment_service = AttachmentService::from_env(env);
fn build_location_config(
mode: LocationConfigMode,
generation: Option<u32>,
secondary_conf: Option<LocationConfigSecondary>,
) -> LocationConfig {
LocationConfig {
mode,
generation,
secondary_conf,
tenant_conf: TenantConfig::default(),
shard_number: 0,
shard_count: 0,
shard_stripe_size: 0,
}
}
let previous = attachment_service.inspect(tenant_id).await?;
let mut baseline_lsns = None;
if let Some((generation, origin_ps_id)) = &previous {
let origin_ps = PageServerNode::from_env(env, env.get_pageserver_conf(*origin_ps_id)?);
if origin_ps_id == &dest_ps.conf.id {
println!("🔁 Already attached to {origin_ps_id}, freshening...");
let gen = attachment_service
.attach_hook(tenant_id, dest_ps.conf.id)
.await?;
let dest_conf = build_location_config(LocationConfigMode::AttachedSingle, gen, None);
dest_ps.location_config(tenant_id, dest_conf, None).await?;
println!("✅ Migration complete");
return Ok(());
}
println!("🔁 Switching origin pageserver {origin_ps_id} to stale mode");
let stale_conf =
build_location_config(LocationConfigMode::AttachedStale, Some(*generation), None);
origin_ps
.location_config(tenant_id, stale_conf, Some(Duration::from_secs(10)))
.await?;
baseline_lsns = Some(get_lsns(tenant_id, &origin_ps).await?);
}
let gen = attachment_service
.attach_hook(tenant_id, dest_ps.conf.id)
.await?;
let dest_conf = build_location_config(LocationConfigMode::AttachedMulti, gen, None);
println!("🔁 Attaching to pageserver {}", dest_ps.conf.id);
dest_ps.location_config(tenant_id, dest_conf, None).await?;
if let Some(baseline) = baseline_lsns {
println!("🕑 Waiting for LSN to catch up...");
await_lsn(tenant_id, &dest_ps, baseline).await?;
}
let cplane = ComputeControlPlane::load(env.clone())?;
for (endpoint_name, endpoint) in &cplane.endpoints {
if endpoint.tenant_id == tenant_id {
println!(
"🔁 Reconfiguring endpoint {} to use pageserver {}",
endpoint_name, dest_ps.conf.id
);
endpoint.reconfigure(Some(dest_ps.conf.id)).await?;
}
}
for other_ps_conf in &env.pageservers {
if other_ps_conf.id == dest_ps.conf.id {
continue;
}
let other_ps = PageServerNode::from_env(env, other_ps_conf);
let other_ps_tenants = other_ps.tenant_list().await?;
// Check if this tenant is attached
let found = other_ps_tenants
.into_iter()
.map(|t| t.id)
.any(|i| i.tenant_id == tenant_id);
if !found {
continue;
}
// Downgrade to a secondary location
let secondary_conf = build_location_config(
LocationConfigMode::Secondary,
None,
Some(LocationConfigSecondary { warm: true }),
);
println!(
"💤 Switching to secondary mode on pageserver {}",
other_ps.conf.id
);
other_ps
.location_config(tenant_id, secondary_conf, None)
.await?;
}
println!(
"🔁 Switching to AttachedSingle mode on pageserver {}",
dest_ps.conf.id
);
let dest_conf = build_location_config(LocationConfigMode::AttachedSingle, gen, None);
dest_ps.location_config(tenant_id, dest_conf, None).await?;
println!("✅ Migration complete");
Ok(())
}

View File

@@ -74,10 +74,30 @@ highlight = "all"
workspace-default-features = "allow"
external-default-features = "allow"
allow = []
deny = []
skip = []
skip-tree = []
[[bans.deny]]
# we use tokio, the same rationale applies for async-{io,waker,global-executor,executor,channel,lock}, smol
# if you find yourself here while adding a dependency, try "default-features = false", ask around on #rust
name = "async-std"
[[bans.deny]]
name = "async-io"
[[bans.deny]]
name = "async-waker"
[[bans.deny]]
name = "async-global-executor"
[[bans.deny]]
name = "async-executor"
[[bans.deny]]
name = "smol"
# This section is considered when running `cargo deny check sources`.
# More documentation about the 'sources' section can be found here:
# https://embarkstudios.github.io/cargo-deny/checks/sources/cfg.html

View File

@@ -177,7 +177,7 @@ I e during migration create_branch can be called on old pageserver and newly cre
The difference of simplistic approach from one described above is that it calls ignore on source tenant first and then calls attach on target pageserver. Approach above does it in opposite order thus opening a possibility for race conditions we strive to avoid.
The approach largely follows this guide: <https://github.com/neondatabase/cloud/wiki/Cloud:-Ad-hoc-tenant-relocation>
The approach largely follows this guide: <https://www.notion.so/neondatabase/Cloud-Ad-hoc-tenant-relocation-f687474f7bfc42269e6214e3acba25c7>
The happy path sequence:

View File

@@ -0,0 +1,197 @@
# Per-Tenant GetPage@LSN Throttling
Author: Christian Schwarz
Date: Oct 24, 2023
## Summary
This RFC proposes per-tenant throttling of GetPage@LSN requests inside Pageserver
and the interactions with its client, i.e., the neon_smgr component in Compute.
The result of implementing & executing this RFC will be a fleet-wide upper limit for
**"the highest GetPage/second that Pageserver can support for a single tenant/shard"**.
## Background
### GetPage@LSN Request Flow
Pageserver exposes its `page_service.rs` as a libpq listener.
The Computes' `neon_smgr` module connects to that libpq listener.
Once a connection is established, the protocol allows Compute to request page images at a given LSN.
We call these requests GetPage@LSN requests, or GetPage requests for short.
Other request types can be sent, but these are low traffic compared to GetPage requests
and are not the concern of this RFC.
Pageserver associates one libpq connection with one tokio task.
Per connection/task, the pq protocol is handled by the common `postgres_backend` crate.
Its `run_message_loop` function invokes the `page_service` specific `impl<IO> postgres_backend::Handler<IO> for PageServerHandler`.
Requests are processed in the order in which they arrive via the TCP-based pq protocol.
So, there is no concurrent request processing within one connection/task.
There is a degree of natural pipelining:
Compute can "fill the pipe" by sending more than one GetPage request into the libpq TCP stream.
And Pageserver can fill the pipe with responses in the other direction.
Both directions are subject to the limit of tx/rx buffers, nodelay, TCP flow control, etc.
### GetPage@LSN Access Pattern
The Compute has its own hierarchy of caches, specifically `shared_buffers` and the `local file cache` (LFC).
Compute only issues GetPage requests to Pageserver if it encounters a miss in these caches.
If the working set stops fitting into Compute's caches, requests to Pageserver increase sharply -- the Compute starts *thrashing*.
## Motivation
In INC-69, a tenant issued 155k GetPage/second for a period of 10 minutes and 60k GetPage/second for a period of 3h,
then dropping to ca 18k GetPage/second for a period of 9h.
We noticed this because of an internal GetPage latency SLO burn rate alert, i.e.,
the request latency profile during this period significantly exceeded what was acceptable according to the internal SLO.
Sadly, we do not have the observability data to determine the impact of this tenant on other tenants on the same tenants.
However, here are some illustrative data points for the 155k period:
The tenant was responsible for >= 99% of the GetPage traffic and, frankly, the overall activity on this Pageserver instance.
We were serving pages at 10 Gb/s (`155k x 8 kbyte (PAGE_SZ) per second is 1.12GiB/s = 9.4Gb/s.`)
The CPU utilization of the instance was 75% user+system.
Pageserver page cache served 1.75M accesses/second at a hit rate of ca 90%.
The hit rate for materialized pages was ca. 40%.
Curiously, IOPS to the Instance Store NVMe were very low, rarely exceeding 100.
The fact that the IOPS were so low / the materialized page cache hit rate was so high suggests that **this tenant's compute's caches were thrashing**.
The compute was of type `k8s-pod`; hence, auto-scaling could/would not have helped remediate the thrashing by provisioning more RAM.
The consequence was that the **thrashing translated into excessive GetPage requests against Pageserver**.
My claim is that it was **unhealthy to serve this workload at the pace we did**:
* it is likely that other tenants were/would have experienced high latencies (again, we sadly don't have per-tenant latency data to confirm this)
* more importantly, it was **unsustainable** to serve traffic at this pace for multiple reasons:
* **predictability of performance**: when the working set grows, the pageserver materialized page cache hit rate drops.
At some point, we're bound by the EC2 Instance Store NVMe drive's IOPS limit.
The result is an **uneven** performance profile from the Compute perspective.
* **economics**: Neon currently does not charge for IOPS, only capacity.
**We cannot afford to undercut the market in IOPS/$ this drastically; it leads to adverse selection and perverse incentives.**
For example, the 155k IOPS, which we served for 10min, would cost ca. 6.5k$/month when provisioned as an io2 EBS volume.
Even the 18k IOPS, which we served for 9h, would cost ca. 1.1k$/month when provisioned as an io2 EBS volume.
We charge 0$.
It could be economically advantageous to keep using a low-DRAM compute because Pageserver IOPS are fast enough and free.
Note: It is helpful to think of Pageserver as a disk, because it's precisely where `neon_smgr` sits:
vanilla Postgres gets its pages from disk, Neon Postgres gets them from Pageserver.
So, regarding the above performance & economic arguments, it is fair to say that we currently provide an "as-fast-as-possible-IOPS" disk that we charge for only by capacity.
## Solution: Throttling GetPage Requests
**The consequence of the above analysis must be that Pageserver throttles GetPage@LSN requests**.
That is, unless we want to start charging for provisioned GetPage@LSN/second.
Throttling sets the correct incentive for a thrashing Compute to scale up its DRAM to the working set size.
Neon Autoscaling will make this easy, [eventually](https://github.com/neondatabase/neon/pull/3913).
## The Design Space
What that remains is the question about *policy* and *mechanism*:
**Policy** concerns itself with the question of what limit applies to a given connection|timeline|tenant.
Candidates are:
* hard limit, same limit value per connection|timeline|tenant
* Per-tenant will provide an upper bound for the impact of a tenant on a given Pageserver instance.
This is a major operational pain point / risk right now.
* hard limit, configurable per connection|timeline|tenant
* This outsources policy to console/control plane, with obvious advantages for flexible structuring of what service we offer to customers.
* Note that this is not a mechanism to guarantee a minium provisioned rate, i.e., this is not a mechanism to guarantee a certain QoS for a tenant.
* fair share among active connections|timelines|tenants per instance
* example: each connection|timeline|tenant gets a fair fraction of the machine's GetPage/second capacity
* NB: needs definition of "active", and knowledge of available GetPage/second capacity in advance
* ...
Regarding **mechanism**, it's clear that **backpressure** is the way to go.
However, we must choose between
* **implicit** backpressure through pq/TCP and
* **explicit** rejection of requests + retries with exponential backoff
Further, there is the question of how throttling GetPage@LSN will affect the **internal GetPage latency SLO**:
where do we measure the SLI for Pageserver's internal getpage latency SLO? Before or after the throttling?
And when we eventually move the measurement point into the Computes (to avoid coordinated omission),
how do we avoid counting throttling-induced latency toward the internal getpage latency SLI/SLO?
## Scope Of This RFC
**This RFC proposes introducing a hard GetPage@LSN/second limit per tenant, with the same value applying to each tenant on a Pageserver**.
This proposal is easy to implement and significantly de-risks operating large Pageservers,
based on the assumption that extremely-high-GetPage-rate-episodes like the one from the "Motivation" section are uncorrelated between tenants.
For example, suppose we pick a limit that allows up to 10 tenants to go at limit rate.
Suppose our Pageserver can serve 100k GetPage/second total at a 100% page cache miss rate.
If each tenant gets a hard limit of 10k GetPage/second, we can serve up to 10 tenants at limit speed without latency degradation.
The mechanism for backpressure will be TCP-based implicit backpressure.
The compute team isn't concerned about prefetch queue depth.
Pageserver will implement it by delaying the reading of requests from the libpq connection(s).
The rate limit will be implemented using a per-tenant token bucket.
The bucket will be be shared among all connections to the tenant.
The bucket implementation supports starvation-preventing `await`ing.
The current candidate for the implementation is [`leaky_bucket`](https://docs.rs/leaky-bucket/).
The getpage@lsn benchmark that's being added in https://github.com/neondatabase/neon/issues/5771
can be used to evaluate the overhead of sharing the bucket among connections of a tenant.
A possible technique to mitigate the impact of sharing the bucket would be to maintain a buffer of a few tokens per connection handler.
Regarding metrics / the internal GetPage latency SLO:
we will measure the GetPage latency SLO _after_ the throttler and introduce a new metric to measure the amount of throttling, quantified by:
- histogram that records the tenants' observations of queue depth before they start waiting (one such histogram per pageserver)
- histogram that records the tenants' observations of time spent waiting (one such histogram per pageserver)
Further observability measures:
- an INFO log message at frequency 1/min if the tenant/timeline/connection was throttled in that last minute.
The message will identify the tenant/timeline/connection to allow correlation with compute logs/stats.
Rollout will happen as follows:
- deploy 1: implementation + config: disabled by default, ability to enable it per tenant through tenant_conf
- experimentation in staging and later production to study impact & interaction with auto-scaling
- determination of a sensible global default value
- the value will be chosen as high as possible ...
- ... but low enough to work towards this RFC's goal that one tenant should not be able to dominate a pageserver instance.
- deploy 2: implementation fixes if any + config: enabled by default with the aforementioned global default
- reset of the experimental per-tenant overrides
- gain experience & lower the limit over time
- we stop lowering the limit as soon as this RFC's goal is achieved, i.e.,
once we decide that in practice the chosen value sufficiently de-risks operating large pageservers
The per-tenant override will remain for emergencies and testing.
But since Console doesn't preserve it during tenant migrations, it isn't durably configurable for the tenant.
Toward the upper layers of the Neon stack, the resulting limit will be
**"the highest GetPage/second that Pageserver can support for a single tenant"**.
### Rationale
We decided against error + retry because of worries about starvation.
## Future Work
Enable per-tenant emergency override of the limit via Console.
Should be part of a more general framework to specify tenant config overrides.
**NB:** this is **not** the right mechanism to _sell_ different max GetPage/second levels to users,
or _auto-scale_ the GetPage/second levels. Such functionality will require a separate RFC that
concerns itself with GetPage/second capacity planning.
Compute-side metrics for GetPage latency.
Back-channel to inform Compute/Autoscaling/ControlPlane that the project is being throttled.
Compute-side neon_smgr improvements to avoid sending the same GetPage request multiple times if multiple backends experience a cache miss.
Dealing with read-only endpoints: users use read-only endpoints to scale reads for a single tenant.
Possibly there are also assumptions around read-only endpoints not affecting the primary read-write endpoint's performance.
With per-tenant rate limiting, we will not meet that expectation.
However, we can currently only scale per tenant.
Soon, we will have sharding (#5505), which will apply the throttling on a per-shard basis.
But, that's orthogonal to scaling reads: if many endpoints hit one shard, they share the same throttling limit.
To solve this properly, I think we'll need replicas for tenants / shard.
To performance-isolate a tenant's endpoints from each other, we'd then route them to different replicas.

View File

@@ -0,0 +1,205 @@
# Name
Created on: 2023-09-08
Author: Arpad Müller
## Summary
Enable the pageserver to recover from data corruption events by implementing
a feature to re-apply historic WAL records in parallel to the already occurring
WAL replay.
The feature is outside of the user-visible backup and history story, and only
serves as a second-level backup for the case that there is a bug in the
pageservers that corrupted the served pages.
The RFC proposes the addition of two new features:
* recover a broken branch from WAL (downtime is allowed)
* a test recovery system to recover random branches to make sure recovery works
## Motivation
The historic WAL is currently stored in S3 even after it has been replayed by
the pageserver and thus been integrated into the pageserver's storage system.
This is done to defend from data corruption failures inside the pageservers.
However, application of this WAL in the disaster recovery setting is currently
very manual and we want to automate this to make it easier.
### Use cases
There are various use cases for this feature, like:
* The main motivation is replaying in the instance of pageservers corrupting
data.
* We might want to, beyond the user-visible history features, through our
support channels and upon customer request, in select instances, recover
historic versions beyond the range of history that we officially support.
* Running the recovery process in the background for random tenant timelines
to figure out if there was a corruption of data (we would compare with what
the pageserver stores for the "official" timeline).
* Using the WAL to arrive at historic pages we can then back up to S3 so that
WAL itself can be discarded, or at least not used for future replays.
Again, this sounds a lot like what the pageserver is already doing, but the
point is to provide a fallback to the service provided by the pageserver.
## Design
### Design constraints
The main design constraint is that the feature needs to be *simple* enough that
the number of bugs are as low, and reliability as high as possible: the main
goal of this endeavour is to achieve higher correctness than the pageserver.
For the background process, we cannot afford a downtime of the timeline that is
being cloned, as we don't want to restrict ourselves to offline tenants only.
In the scenario where we want to recover from disasters or roll back to a
historic lsn through support staff, downtimes are more affordable, and
inevitable if the original had been subject to the corruption. Ideally, the
two code paths would share code, so the solution would be designed for not
requiring downtimes.
### API endpoint changes
This RFC proposes two API endpoint changes in the safekeeper and the
pageserver.
Remember, the pageserver timeline API creation endpoint is to this URL:
```
/v1/tenant/{tenant_id}/timeline/
```
Where `{tenant_id}` is the ID of the tenant the timeline is created for,
and specified as part of the URL. The timeline ID is passed via the POST
request body as the only required parameter `new_timeline_id`.
This proposal adds one optional parameter called
`existing_initdb_timeline_id` to the request's json body. If the parameter
is not specified, behaviour should be as existing, so the pageserver runs
initdb.
If the parameter is specified, it is expected to point to a timeline ID.
In fact that ID might match `new_timeline_id`, what's important is that
S3 storage contains a matching initdb under the URL matching the given
tenant and timeline.
Having both `ancestor_timeline_id` and `existing_initdb_timeline_id`
specified is illegal and will yield in an HTTP error. This feature is
only meant for the "main" branch that doesn't have any ancestors
of its own, as only here initdb is relevant.
For the safekeeper, we propose the addition of the following copy endpoint:
```
/v1/tenant/{tenant_id}/timeline/{source_timeline_id}/copy
```
it is meant for POST requests with json, and the two URL parameters
`tenant_id` and `source_timeline_id`. The json request body contains
the two required parameters `target_timeline_id` and `until_lsn`.
After invoking, the copy endpoint starts a copy process of the WAL from
the source ID to the target ID. The lsn is updated according to the
progress of the API call.
### Higher level features
We want the API changes to support the following higher level features:
* recovery-after-corruption DR of the main timeline of a tenant. This
feature allows for downtime.
* test DR of the main timeline into a special copy timeline. this feature
is meant to run against selected production tenants in the background,
without the user noticing, so it does not allow for downtime.
The recovery-after-corruption DR only needs the pageserver changes.
It works as follows:
* delete the timeline from the pageservers via timeline deletion API
* re-create it via timeline creation API (same ID as before) and set
`existing_initdb_timeline_id` to the same timeline ID
The test DR requires also the copy primitive and works as follows:
* copy the WAL of the timeline to a new place
* create a new timeline for the tenant
## Non Goals
At the danger of being repetitive, the main goal of this feature is to be a
backup method, so reliability is very important. This implies that other
aspects like performance or space reduction are less important.
### Corrupt WAL
The process suggested by this RFC assumes that the WAL is free of corruption.
In some instances, corruption can make it into WAL, like for example when
higher level components like postgres or the application first read corrupt
data, and then execute a write with data derived from that earlier read. That
written data might then contain the corruption.
Common use cases can hit this quite easily. For example, an application reads
some counter, increments it, and then writes the new counter value to the
database.
On a lower level, the compute might put FPIs (Full Page Images) into the WAL,
which have corrupt data for rows unrelated to the write operation at hand.
Separating corrupt writes from non-corrupt ones is a hard problem in general,
and if the application was involved in making the corrupt write, a recovery
would also involve the application. Therefore, corruption that has made it into
the WAL is outside of the scope of this feature. However, the WAL replay can be
issued to right before the point in time where the corruption occured. Then the
data loss is isolated to post-corruption writes only.
## Impacted components (e.g. pageserver, safekeeper, console, etc)
Most changes would happen to the pageservers.
For the higher level features, maybe other components like the console would
be involved.
We need to make sure that the shadow timelines are not subject to the usual
limits and billing we apply to existing timelines.
## Proposed implementation
The first problem to keep in mind is the reproducability of `initdb`.
So an initial step would be to upload `initdb` snapshots to S3.
After that, we'd have the endpoint spawn a background process which
performs the replay of the WAL to that new timeline. This process should
follow the existing workflows as closely as possible, just using the
WAL records of a different timeline.
The timeline created will be in a special state that solely looks for WAL
entries of the timeline it is trying to copy. Once the target LSN is reached,
it turns into a normal timeline that also accepts writes to its own
timeline ID.
### Scalability
For now we want to run this entire process on a single node, and as
it is by nature linear, it's hard to parallelize. However, for the
verification workloads, we can easily start the WAL replay in parallel
for different points in time. This is valuable especially for tenants
with large WAL records.
Compare this with the tricks to make addition circuits execute with
lower latency by making them perform the addition for both possible
values of the carry bit, and then, in a second step, taking the
result for the carry bit that was actually obtained.
The other scalability dimension to consider is the WAL length, which
is a growing question as tenants accumulate changes. There are
possible approaches to this, including creating snapshots of the
page files and uploading them to S3, but if we do this for every single
branch, we lose the cheap branching property.
### Implementation by component
The proposed changes for the various components of the neon architecture
are written up in this notion page:
https://www.notion.so/neondatabase/Pageserver-disaster-recovery-one-pager-4ecfb5df16ce4f6bbfc3817ed1a6cbb2
### Unresolved questions
none known (outside of the mentioned ones).

View File

@@ -1,3 +1,5 @@
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
pub mod requests;
pub mod responses;
pub mod spec;

View File

@@ -6,7 +6,6 @@
use std::collections::HashMap;
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn;
@@ -19,7 +18,6 @@ pub type PgIdent = String;
/// Cluster spec or configuration represented as an optional number of
/// delta operations + final cluster state description.
#[serde_as]
#[derive(Clone, Debug, Default, Deserialize, Serialize)]
pub struct ComputeSpec {
pub format_version: f32,
@@ -28,6 +26,13 @@ pub struct ComputeSpec {
// but we don't use it for anything. Serde will ignore missing fields when
// deserializing it.
pub operation_uuid: Option<String>,
/// Compute features to enable. These feature flags are provided, when we
/// know all the details about client's compute, so they cannot be used
/// to change `Empty` compute behavior.
#[serde(default)]
pub features: Vec<ComputeFeature>,
/// Expected cluster state at the end of transition process.
pub cluster: Cluster,
pub delta_operations: Option<Vec<DeltaOp>>,
@@ -50,12 +55,12 @@ pub struct ComputeSpec {
// these, and instead set the "neon.tenant_id", "neon.timeline_id",
// etc. GUCs in cluster.settings. TODO: Once the control plane has been
// updated to fill these fields, we can make these non optional.
#[serde_as(as = "Option<DisplayFromStr>")]
pub tenant_id: Option<TenantId>,
#[serde_as(as = "Option<DisplayFromStr>")]
pub timeline_id: Option<TimelineId>,
#[serde_as(as = "Option<DisplayFromStr>")]
pub pageserver_connstring: Option<String>,
#[serde(default)]
pub safekeeper_connstrings: Vec<String>,
@@ -70,6 +75,19 @@ pub struct ComputeSpec {
pub remote_extensions: Option<RemoteExtSpec>,
}
/// Feature flag to signal `compute_ctl` to enable certain experimental functionality.
#[derive(Serialize, Clone, Copy, Debug, Deserialize, PartialEq, Eq)]
#[serde(rename_all = "snake_case")]
pub enum ComputeFeature {
// XXX: Add more feature flags here.
// This is a special feature flag that is used to represent unknown feature flags.
// Basically all unknown to enum flags are represented as this one. See unit test
// `parse_unknown_features()` for more details.
#[serde(other)]
UnknownFeature,
}
#[derive(Clone, Debug, Default, Deserialize, Serialize)]
pub struct RemoteExtSpec {
pub public_extensions: Option<Vec<String>>,
@@ -140,14 +158,13 @@ impl RemoteExtSpec {
}
}
#[serde_as]
#[derive(Clone, Copy, Debug, Default, Eq, PartialEq, Deserialize, Serialize)]
pub enum ComputeMode {
/// A read-write node
#[default]
Primary,
/// A read-only node, pinned at a particular LSN
Static(#[serde_as(as = "DisplayFromStr")] Lsn),
Static(Lsn),
/// A read-only node that follows the tip of the branch in hot standby mode
///
/// Future versions may want to distinguish between replicas with hot standby
@@ -190,8 +207,6 @@ pub struct DeltaOp {
pub struct Role {
pub name: PgIdent,
pub encrypted_password: Option<String>,
pub replication: Option<bool>,
pub bypassrls: Option<bool>,
pub options: GenericOptions,
}
@@ -232,7 +247,10 @@ mod tests {
#[test]
fn parse_spec_file() {
let file = File::open("tests/cluster_spec.json").unwrap();
let _spec: ComputeSpec = serde_json::from_reader(file).unwrap();
let spec: ComputeSpec = serde_json::from_reader(file).unwrap();
// Features list defaults to empty vector.
assert!(spec.features.is_empty());
}
#[test]
@@ -244,4 +262,22 @@ mod tests {
ob.insert("unknown_field_123123123".into(), "hello".into());
let _spec: ComputeSpec = serde_json::from_value(json).unwrap();
}
#[test]
fn parse_unknown_features() {
// Test that unknown feature flags do not cause any errors.
let file = File::open("tests/cluster_spec.json").unwrap();
let mut json: serde_json::Value = serde_json::from_reader(file).unwrap();
let ob = json.as_object_mut().unwrap();
// Add unknown feature flags.
let features = vec!["foo_bar_feature", "baz_feature"];
ob.insert("features".into(), features.into());
let spec: ComputeSpec = serde_json::from_value(json).unwrap();
assert!(spec.features.len() == 2);
assert!(spec.features.contains(&ComputeFeature::UnknownFeature));
assert_eq!(spec.features, vec![ComputeFeature::UnknownFeature; 2]);
}
}

View File

@@ -1,6 +1,6 @@
//!
//! Shared code for consumption metics collection
//!
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
use chrono::{DateTime, Utc};
use rand::Rng;
use serde::{Deserialize, Serialize};

View File

@@ -2,8 +2,12 @@
//! make sure that we use the same dep version everywhere.
//! Otherwise, we might not see all metrics registered via
//! a default registry.
#![deny(clippy::undocumented_unsafe_blocks)]
use once_cell::sync::Lazy;
use prometheus::core::{AtomicU64, Collector, GenericGauge, GenericGaugeVec};
use prometheus::core::{
Atomic, AtomicU64, Collector, GenericCounter, GenericCounterVec, GenericGauge, GenericGaugeVec,
};
pub use prometheus::opts;
pub use prometheus::register;
pub use prometheus::Error;
@@ -131,3 +135,137 @@ fn get_rusage_stats() -> libc::rusage {
rusage.assume_init()
}
}
/// Create an [`IntCounterPairVec`] and registers to default registry.
#[macro_export(local_inner_macros)]
macro_rules! register_int_counter_pair_vec {
($NAME1:expr, $HELP1:expr, $NAME2:expr, $HELP2:expr, $LABELS_NAMES:expr $(,)?) => {{
match (
$crate::register_int_counter_vec!($NAME1, $HELP1, $LABELS_NAMES),
$crate::register_int_counter_vec!($NAME2, $HELP2, $LABELS_NAMES),
) {
(Ok(inc), Ok(dec)) => Ok($crate::IntCounterPairVec::new(inc, dec)),
(Err(e), _) | (_, Err(e)) => Err(e),
}
}};
}
/// Create an [`IntCounterPair`] and registers to default registry.
#[macro_export(local_inner_macros)]
macro_rules! register_int_counter_pair {
($NAME1:expr, $HELP1:expr, $NAME2:expr, $HELP2:expr $(,)?) => {{
match (
$crate::register_int_counter!($NAME1, $HELP1),
$crate::register_int_counter!($NAME2, $HELP2),
) {
(Ok(inc), Ok(dec)) => Ok($crate::IntCounterPair::new(inc, dec)),
(Err(e), _) | (_, Err(e)) => Err(e),
}
}};
}
/// A Pair of [`GenericCounterVec`]s. Like an [`GenericGaugeVec`] but will always observe changes
pub struct GenericCounterPairVec<P: Atomic> {
inc: GenericCounterVec<P>,
dec: GenericCounterVec<P>,
}
/// A Pair of [`GenericCounter`]s. Like an [`GenericGauge`] but will always observe changes
pub struct GenericCounterPair<P: Atomic> {
inc: GenericCounter<P>,
dec: GenericCounter<P>,
}
impl<P: Atomic> GenericCounterPairVec<P> {
pub fn new(inc: GenericCounterVec<P>, dec: GenericCounterVec<P>) -> Self {
Self { inc, dec }
}
/// `get_metric_with_label_values` returns the [`GenericCounterPair<P>`] for the given slice
/// of label values (same order as the VariableLabels in Desc). If that combination of
/// label values is accessed for the first time, a new [`GenericCounterPair<P>`] is created.
///
/// An error is returned if the number of label values is not the same as the
/// number of VariableLabels in Desc.
pub fn get_metric_with_label_values(&self, vals: &[&str]) -> Result<GenericCounterPair<P>> {
Ok(GenericCounterPair {
inc: self.inc.get_metric_with_label_values(vals)?,
dec: self.dec.get_metric_with_label_values(vals)?,
})
}
/// `with_label_values` works as `get_metric_with_label_values`, but panics if an error
/// occurs.
pub fn with_label_values(&self, vals: &[&str]) -> GenericCounterPair<P> {
self.get_metric_with_label_values(vals).unwrap()
}
}
impl<P: Atomic> GenericCounterPair<P> {
pub fn new(inc: GenericCounter<P>, dec: GenericCounter<P>) -> Self {
Self { inc, dec }
}
/// Increment the gauge by 1, returning a guard that decrements by 1 on drop.
pub fn guard(&self) -> GenericCounterPairGuard<P> {
self.inc.inc();
GenericCounterPairGuard(self.dec.clone())
}
/// Increment the gauge by n, returning a guard that decrements by n on drop.
pub fn guard_by(&self, n: P::T) -> GenericCounterPairGuardBy<P> {
self.inc.inc_by(n);
GenericCounterPairGuardBy(self.dec.clone(), n)
}
/// Increase the gauge by 1.
#[inline]
pub fn inc(&self) {
self.inc.inc();
}
/// Decrease the gauge by 1.
#[inline]
pub fn dec(&self) {
self.dec.inc();
}
/// Add the given value to the gauge. (The value can be
/// negative, resulting in a decrement of the gauge.)
#[inline]
pub fn inc_by(&self, v: P::T) {
self.inc.inc_by(v);
}
/// Subtract the given value from the gauge. (The value can be
/// negative, resulting in an increment of the gauge.)
#[inline]
pub fn dec_by(&self, v: P::T) {
self.dec.inc_by(v);
}
}
/// Guard returned by [`GenericCounterPair::guard`]
pub struct GenericCounterPairGuard<P: Atomic>(GenericCounter<P>);
impl<P: Atomic> Drop for GenericCounterPairGuard<P> {
fn drop(&mut self) {
self.0.inc();
}
}
/// Guard returned by [`GenericCounterPair::guard_by`]
pub struct GenericCounterPairGuardBy<P: Atomic>(GenericCounter<P>, P::T);
impl<P: Atomic> Drop for GenericCounterPairGuardBy<P> {
fn drop(&mut self) {
self.0.inc_by(self.1);
}
}
/// A Pair of [`IntCounterVec`]s. Like an [`IntGaugeVec`] but will always observe changes
pub type IntCounterPairVec = GenericCounterPairVec<AtomicU64>;
/// A Pair of [`IntCounter`]s. Like an [`IntGauge`] but will always observe changes
pub type IntCounterPair = GenericCounterPair<AtomicU64>;
/// A guard for [`IntCounterPair`] that will decrement the gauge on drop
pub type IntCounterPairGuard = GenericCounterPairGuard<AtomicU64>;

View File

@@ -17,5 +17,11 @@ postgres_ffi.workspace = true
enum-map.workspace = true
strum.workspace = true
strum_macros.workspace = true
hex.workspace = true
thiserror.workspace = true
workspace_hack.workspace = true
[dev-dependencies]
bincode.workspace = true
rand.workspace = true

View File

@@ -4,19 +4,18 @@
//! See docs/rfcs/025-generation-numbers.md
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use utils::id::{NodeId, TenantId};
use utils::id::NodeId;
use crate::shard::TenantShardId;
#[derive(Serialize, Deserialize)]
pub struct ReAttachRequest {
pub node_id: NodeId,
}
#[serde_as]
#[derive(Serialize, Deserialize)]
pub struct ReAttachResponseTenant {
#[serde_as(as = "DisplayFromStr")]
pub id: TenantId,
pub id: TenantShardId,
pub gen: u32,
}
@@ -25,11 +24,9 @@ pub struct ReAttachResponse {
pub tenants: Vec<ReAttachResponseTenant>,
}
#[serde_as]
#[derive(Serialize, Deserialize)]
pub struct ValidateRequestTenant {
#[serde_as(as = "DisplayFromStr")]
pub id: TenantId,
pub id: TenantShardId,
pub gen: u32,
}
@@ -43,10 +40,8 @@ pub struct ValidateResponse {
pub tenants: Vec<ValidateResponseTenant>,
}
#[serde_as]
#[derive(Serialize, Deserialize)]
pub struct ValidateResponseTenant {
#[serde_as(as = "DisplayFromStr")]
pub id: TenantId,
pub id: TenantShardId,
pub valid: bool,
}

View File

@@ -0,0 +1,180 @@
use anyhow::{bail, Result};
use byteorder::{ByteOrder, BE};
use serde::{Deserialize, Serialize};
use std::fmt;
/// Key used in the Repository kv-store.
///
/// The Repository treats this as an opaque struct, but see the code in pgdatadir_mapping.rs
/// for what we actually store in these fields.
#[derive(Debug, Clone, Copy, Hash, PartialEq, Eq, Ord, PartialOrd, Serialize, Deserialize)]
pub struct Key {
pub field1: u8,
pub field2: u32,
pub field3: u32,
pub field4: u32,
pub field5: u8,
pub field6: u32,
}
pub const KEY_SIZE: usize = 18;
impl Key {
/// 'field2' is used to store tablespaceid for relations and small enum numbers for other relish.
/// As long as Neon does not support tablespace (because of lack of access to local file system),
/// we can assume that only some predefined namespace OIDs are used which can fit in u16
pub fn to_i128(&self) -> i128 {
assert!(self.field2 < 0xFFFF || self.field2 == 0xFFFFFFFF || self.field2 == 0x22222222);
(((self.field1 & 0xf) as i128) << 120)
| (((self.field2 & 0xFFFF) as i128) << 104)
| ((self.field3 as i128) << 72)
| ((self.field4 as i128) << 40)
| ((self.field5 as i128) << 32)
| self.field6 as i128
}
pub const fn from_i128(x: i128) -> Self {
Key {
field1: ((x >> 120) & 0xf) as u8,
field2: ((x >> 104) & 0xFFFF) as u32,
field3: (x >> 72) as u32,
field4: (x >> 40) as u32,
field5: (x >> 32) as u8,
field6: x as u32,
}
}
pub fn next(&self) -> Key {
self.add(1)
}
pub fn add(&self, x: u32) -> Key {
let mut key = *self;
let r = key.field6.overflowing_add(x);
key.field6 = r.0;
if r.1 {
let r = key.field5.overflowing_add(1);
key.field5 = r.0;
if r.1 {
let r = key.field4.overflowing_add(1);
key.field4 = r.0;
if r.1 {
let r = key.field3.overflowing_add(1);
key.field3 = r.0;
if r.1 {
let r = key.field2.overflowing_add(1);
key.field2 = r.0;
if r.1 {
let r = key.field1.overflowing_add(1);
key.field1 = r.0;
assert!(!r.1);
}
}
}
}
}
key
}
pub fn from_slice(b: &[u8]) -> Self {
Key {
field1: b[0],
field2: u32::from_be_bytes(b[1..5].try_into().unwrap()),
field3: u32::from_be_bytes(b[5..9].try_into().unwrap()),
field4: u32::from_be_bytes(b[9..13].try_into().unwrap()),
field5: b[13],
field6: u32::from_be_bytes(b[14..18].try_into().unwrap()),
}
}
pub fn write_to_byte_slice(&self, buf: &mut [u8]) {
buf[0] = self.field1;
BE::write_u32(&mut buf[1..5], self.field2);
BE::write_u32(&mut buf[5..9], self.field3);
BE::write_u32(&mut buf[9..13], self.field4);
buf[13] = self.field5;
BE::write_u32(&mut buf[14..18], self.field6);
}
}
impl fmt::Display for Key {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(
f,
"{:02X}{:08X}{:08X}{:08X}{:02X}{:08X}",
self.field1, self.field2, self.field3, self.field4, self.field5, self.field6
)
}
}
impl Key {
pub const MIN: Key = Key {
field1: u8::MIN,
field2: u32::MIN,
field3: u32::MIN,
field4: u32::MIN,
field5: u8::MIN,
field6: u32::MIN,
};
pub const MAX: Key = Key {
field1: u8::MAX,
field2: u32::MAX,
field3: u32::MAX,
field4: u32::MAX,
field5: u8::MAX,
field6: u32::MAX,
};
pub fn from_hex(s: &str) -> Result<Self> {
if s.len() != 36 {
bail!("parse error");
}
Ok(Key {
field1: u8::from_str_radix(&s[0..2], 16)?,
field2: u32::from_str_radix(&s[2..10], 16)?,
field3: u32::from_str_radix(&s[10..18], 16)?,
field4: u32::from_str_radix(&s[18..26], 16)?,
field5: u8::from_str_radix(&s[26..28], 16)?,
field6: u32::from_str_radix(&s[28..36], 16)?,
})
}
}
pub fn is_rel_block_key(key: &Key) -> bool {
key.field1 == 0x00 && key.field4 != 0
}
impl std::str::FromStr for Key {
type Err = anyhow::Error;
fn from_str(s: &str) -> std::result::Result<Self, Self::Err> {
Self::from_hex(s)
}
}
#[cfg(test)]
mod tests {
use std::str::FromStr;
use crate::key::Key;
use rand::Rng;
use rand::SeedableRng;
#[test]
fn display_fromstr_bijection() {
let mut rng = rand::rngs::StdRng::seed_from_u64(42);
let key = Key {
field1: rng.gen(),
field2: rng.gen(),
field3: rng.gen(),
field4: rng.gen(),
field5: rng.gen(),
field6: rng.gen(),
};
assert_eq!(key, Key::from_str(&format!("{key}")).unwrap());
}
}

View File

@@ -1,11 +1,12 @@
use crate::repository::{key_range_size, singleton_range, Key};
use postgres_ffi::BLCKSZ;
use std::ops::Range;
use crate::key::Key;
///
/// Represents a set of Keys, in a compact form.
///
#[derive(Clone, Debug, Default)]
#[derive(Clone, Debug, Default, PartialEq, Eq)]
pub struct KeySpace {
/// Contiguous ranges of keys that belong to the key space. In key order,
/// and with no overlap.
@@ -186,6 +187,33 @@ impl KeySpaceRandomAccum {
}
}
pub fn key_range_size(key_range: &Range<Key>) -> u32 {
let start = key_range.start;
let end = key_range.end;
if end.field1 != start.field1
|| end.field2 != start.field2
|| end.field3 != start.field3
|| end.field4 != start.field4
{
return u32::MAX;
}
let start = (start.field5 as u64) << 32 | start.field6 as u64;
let end = (end.field5 as u64) << 32 | end.field6 as u64;
let diff = end - start;
if diff > u32::MAX as u64 {
u32::MAX
} else {
diff as u32
}
}
pub fn singleton_range(key: Key) -> Range<Key> {
key..key.next()
}
#[cfg(test)]
mod tests {
use super::*;

View File

@@ -1,9 +1,14 @@
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
use const_format::formatcp;
/// Public API types
pub mod control_api;
pub mod key;
pub mod keyspace;
pub mod models;
pub mod reltag;
pub mod shard;
pub const DEFAULT_PG_LISTEN_PORT: u16 = 64000;
pub const DEFAULT_PG_LISTEN_ADDR: &str = formatcp!("127.0.0.1:{DEFAULT_PG_LISTEN_PORT}");

View File

@@ -1,24 +1,26 @@
pub mod partitioning;
use std::{
collections::HashMap,
io::Read,
num::{NonZeroU64, NonZeroUsize},
time::SystemTime,
};
use byteorder::{BigEndian, ReadBytesExt};
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use serde_with::serde_as;
use strum_macros;
use utils::{
completion,
generation::Generation,
history_buffer::HistoryBufferWithDropCounter,
id::{NodeId, TenantId, TimelineId},
lsn::Lsn,
};
use crate::reltag::RelTag;
use crate::{reltag::RelTag, shard::TenantShardId};
use anyhow::bail;
use bytes::{BufMut, Bytes, BytesMut};
use bytes::{Buf, BufMut, Bytes, BytesMut};
/// The state of a tenant in this pageserver.
///
@@ -174,26 +176,22 @@ pub enum TimelineState {
Broken { reason: String, backtrace: String },
}
#[serde_as]
#[derive(Serialize, Deserialize)]
pub struct TimelineCreateRequest {
#[serde_as(as = "DisplayFromStr")]
pub new_timeline_id: TimelineId,
#[serde(default)]
#[serde_as(as = "Option<DisplayFromStr>")]
pub ancestor_timeline_id: Option<TimelineId>,
#[serde(default)]
#[serde_as(as = "Option<DisplayFromStr>")]
pub existing_initdb_timeline_id: Option<TimelineId>,
#[serde(default)]
pub ancestor_start_lsn: Option<Lsn>,
pub pg_version: Option<u32>,
}
#[serde_as]
#[derive(Serialize, Deserialize, Debug)]
#[serde(deny_unknown_fields)]
pub struct TenantCreateRequest {
#[serde_as(as = "DisplayFromStr")]
pub new_tenant_id: TenantId,
pub new_tenant_id: TenantShardId,
#[serde(default)]
#[serde(skip_serializing_if = "Option::is_none")]
pub generation: Option<u32>,
@@ -201,7 +199,6 @@ pub struct TenantCreateRequest {
pub config: TenantConfig, // as we have a flattened field, we should reject all unknown fields in it
}
#[serde_as]
#[derive(Deserialize, Debug)]
#[serde(deny_unknown_fields)]
pub struct TenantLoadRequest {
@@ -243,6 +240,7 @@ pub struct TenantConfig {
pub min_resident_size_override: Option<u64>,
pub evictions_low_residence_duration_metric_threshold: Option<String>,
pub gc_feedback: Option<bool>,
pub heatmap_period: Option<String>,
}
/// A flattened analog of a `pagesever::tenant::LocationMode`, which
@@ -269,40 +267,44 @@ pub struct LocationConfig {
pub mode: LocationConfigMode,
/// If attaching, in what generation?
#[serde(default)]
pub generation: Option<Generation>,
pub generation: Option<u32>,
#[serde(default)]
pub secondary_conf: Option<LocationConfigSecondary>,
// Shard parameters: if shard_count is nonzero, then other shard_* fields
// must be set accurately.
#[serde(default)]
pub shard_number: u8,
#[serde(default)]
pub shard_count: u8,
#[serde(default)]
pub shard_stripe_size: u32,
// If requesting mode `Secondary`, configuration for that.
// Custom storage configuration for the tenant, if any
pub tenant_conf: TenantConfig,
}
#[serde_as]
#[derive(Serialize, Deserialize)]
#[serde(transparent)]
pub struct TenantCreateResponse(#[serde_as(as = "DisplayFromStr")] pub TenantId);
pub struct TenantCreateResponse(pub TenantId);
#[derive(Serialize)]
pub struct StatusResponse {
pub id: NodeId,
}
#[serde_as]
#[derive(Serialize, Deserialize, Debug)]
#[serde(deny_unknown_fields)]
pub struct TenantLocationConfigRequest {
#[serde_as(as = "DisplayFromStr")]
pub tenant_id: TenantId,
#[serde(flatten)]
pub config: LocationConfig, // as we have a flattened field, we should reject all unknown fields in it
}
#[serde_as]
#[derive(Serialize, Deserialize, Debug)]
#[serde(deny_unknown_fields)]
pub struct TenantConfigRequest {
#[serde_as(as = "DisplayFromStr")]
pub tenant_id: TenantId,
#[serde(flatten)]
pub config: TenantConfig, // as we have a flattened field, we should reject all unknown fields in it
@@ -318,31 +320,14 @@ impl std::ops::Deref for TenantConfigRequest {
impl TenantConfigRequest {
pub fn new(tenant_id: TenantId) -> TenantConfigRequest {
let config = TenantConfig {
checkpoint_distance: None,
checkpoint_timeout: None,
compaction_target_size: None,
compaction_period: None,
compaction_threshold: None,
gc_horizon: None,
gc_period: None,
image_creation_threshold: None,
pitr_interval: None,
walreceiver_connect_timeout: None,
lagging_wal_timeout: None,
max_lsn_wal_lag: None,
trace_read_requests: None,
eviction_policy: None,
min_resident_size_override: None,
evictions_low_residence_duration_metric_threshold: None,
gc_feedback: None,
};
let config = TenantConfig::default();
TenantConfigRequest { tenant_id, config }
}
}
#[derive(Debug, Deserialize)]
pub struct TenantAttachRequest {
#[serde(default)]
pub config: TenantAttachConfig,
#[serde(default)]
pub generation: Option<u32>,
@@ -350,7 +335,7 @@ pub struct TenantAttachRequest {
/// Newtype to enforce deny_unknown_fields on TenantConfig for
/// its usage inside `TenantAttachRequest`.
#[derive(Debug, Serialize, Deserialize)]
#[derive(Debug, Serialize, Deserialize, Default)]
#[serde(deny_unknown_fields)]
pub struct TenantAttachConfig {
#[serde(flatten)]
@@ -374,11 +359,9 @@ pub enum TenantAttachmentStatus {
Failed { reason: String },
}
#[serde_as]
#[derive(Serialize, Deserialize, Clone)]
pub struct TenantInfo {
#[serde_as(as = "DisplayFromStr")]
pub id: TenantId,
pub id: TenantShardId,
// NB: intentionally not part of OpenAPI, we don't want to commit to a specific set of TenantState's
pub state: TenantState,
/// Sum of the size of all layer files.
@@ -387,37 +370,39 @@ pub struct TenantInfo {
pub attachment_status: TenantAttachmentStatus,
}
#[derive(Serialize, Deserialize, Clone)]
pub struct TenantDetails {
#[serde(flatten)]
pub tenant_info: TenantInfo,
pub timelines: Vec<TimelineId>,
}
/// This represents the output of the "timeline_detail" and "timeline_list" API calls.
#[serde_as]
#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct TimelineInfo {
#[serde_as(as = "DisplayFromStr")]
pub tenant_id: TenantId,
#[serde_as(as = "DisplayFromStr")]
pub tenant_id: TenantShardId,
pub timeline_id: TimelineId,
#[serde_as(as = "Option<DisplayFromStr>")]
pub ancestor_timeline_id: Option<TimelineId>,
#[serde_as(as = "Option<DisplayFromStr>")]
pub ancestor_lsn: Option<Lsn>,
#[serde_as(as = "DisplayFromStr")]
pub last_record_lsn: Lsn,
#[serde_as(as = "Option<DisplayFromStr>")]
pub prev_record_lsn: Option<Lsn>,
#[serde_as(as = "DisplayFromStr")]
pub latest_gc_cutoff_lsn: Lsn,
#[serde_as(as = "DisplayFromStr")]
pub disk_consistent_lsn: Lsn,
/// The LSN that we have succesfully uploaded to remote storage
#[serde_as(as = "DisplayFromStr")]
pub remote_consistent_lsn: Lsn,
/// The LSN that we are advertizing to safekeepers
#[serde_as(as = "DisplayFromStr")]
pub remote_consistent_lsn_visible: Lsn,
pub current_logical_size: Option<u64>, // is None when timeline is Unloaded
/// The LSN from the start of the root timeline (never changes)
pub initdb_lsn: Lsn,
pub current_logical_size: u64,
pub current_logical_size_is_accurate: bool,
/// Sum of the size of all layer files.
/// If a layer is present in both local FS and S3, it counts only once.
pub current_physical_size: Option<u64>, // is None when timeline is Unloaded
@@ -426,7 +411,6 @@ pub struct TimelineInfo {
pub timeline_dir_layer_file_size_sum: Option<u64>,
pub wal_source_connstr: Option<String>,
#[serde_as(as = "Option<DisplayFromStr>")]
pub last_received_msg_lsn: Option<Lsn>,
/// the timestamp (in microseconds) of the last received message
pub last_received_msg_ts: Option<u128>,
@@ -523,23 +507,13 @@ pub struct LayerAccessStats {
pub residence_events_history: HistoryBufferWithDropCounter<LayerResidenceEvent, 16>,
}
#[serde_as]
#[derive(Debug, Clone, Serialize)]
#[serde(tag = "kind")]
pub enum InMemoryLayerInfo {
Open {
#[serde_as(as = "DisplayFromStr")]
lsn_start: Lsn,
},
Frozen {
#[serde_as(as = "DisplayFromStr")]
lsn_start: Lsn,
#[serde_as(as = "DisplayFromStr")]
lsn_end: Lsn,
},
Open { lsn_start: Lsn },
Frozen { lsn_start: Lsn, lsn_end: Lsn },
}
#[serde_as]
#[derive(Debug, Clone, Serialize)]
#[serde(tag = "kind")]
pub enum HistoricLayerInfo {
@@ -547,9 +521,7 @@ pub enum HistoricLayerInfo {
layer_file_name: String,
layer_file_size: u64,
#[serde_as(as = "DisplayFromStr")]
lsn_start: Lsn,
#[serde_as(as = "DisplayFromStr")]
lsn_end: Lsn,
remote: bool,
access_stats: LayerAccessStats,
@@ -558,7 +530,6 @@ pub enum HistoricLayerInfo {
layer_file_name: String,
layer_file_size: u64,
#[serde_as(as = "DisplayFromStr")]
lsn_start: Lsn,
remote: bool,
access_stats: LayerAccessStats,
@@ -614,6 +585,7 @@ pub enum PagestreamFeMessage {
}
// Wrapped in libpq CopyData
#[derive(strum_macros::EnumProperty)]
pub enum PagestreamBeMessage {
Exists(PagestreamExistsResponse),
Nblocks(PagestreamNblocksResponse),
@@ -622,6 +594,29 @@ pub enum PagestreamBeMessage {
DbSize(PagestreamDbSizeResponse),
}
// Keep in sync with `pagestore_client.h`
#[repr(u8)]
enum PagestreamBeMessageTag {
Exists = 100,
Nblocks = 101,
GetPage = 102,
Error = 103,
DbSize = 104,
}
impl TryFrom<u8> for PagestreamBeMessageTag {
type Error = u8;
fn try_from(value: u8) -> Result<Self, u8> {
match value {
100 => Ok(PagestreamBeMessageTag::Exists),
101 => Ok(PagestreamBeMessageTag::Nblocks),
102 => Ok(PagestreamBeMessageTag::GetPage),
103 => Ok(PagestreamBeMessageTag::Error),
104 => Ok(PagestreamBeMessageTag::DbSize),
_ => Err(value),
}
}
}
#[derive(Debug, PartialEq, Eq)]
pub struct PagestreamExistsRequest {
pub latest: bool,
@@ -777,35 +772,91 @@ impl PagestreamBeMessage {
pub fn serialize(&self) -> Bytes {
let mut bytes = BytesMut::new();
use PagestreamBeMessageTag as Tag;
match self {
Self::Exists(resp) => {
bytes.put_u8(100); /* tag from pagestore_client.h */
bytes.put_u8(Tag::Exists as u8);
bytes.put_u8(resp.exists as u8);
}
Self::Nblocks(resp) => {
bytes.put_u8(101); /* tag from pagestore_client.h */
bytes.put_u8(Tag::Nblocks as u8);
bytes.put_u32(resp.n_blocks);
}
Self::GetPage(resp) => {
bytes.put_u8(102); /* tag from pagestore_client.h */
bytes.put_u8(Tag::GetPage as u8);
bytes.put(&resp.page[..]);
}
Self::Error(resp) => {
bytes.put_u8(103); /* tag from pagestore_client.h */
bytes.put_u8(Tag::Error as u8);
bytes.put(resp.message.as_bytes());
bytes.put_u8(0); // null terminator
}
Self::DbSize(resp) => {
bytes.put_u8(104); /* tag from pagestore_client.h */
bytes.put_u8(Tag::DbSize as u8);
bytes.put_i64(resp.db_size);
}
}
bytes.into()
}
pub fn deserialize(buf: Bytes) -> anyhow::Result<Self> {
let mut buf = buf.reader();
let msg_tag = buf.read_u8()?;
use PagestreamBeMessageTag as Tag;
let ok =
match Tag::try_from(msg_tag).map_err(|tag: u8| anyhow::anyhow!("invalid tag {tag}"))? {
Tag::Exists => {
let exists = buf.read_u8()?;
Self::Exists(PagestreamExistsResponse {
exists: exists != 0,
})
}
Tag::Nblocks => {
let n_blocks = buf.read_u32::<BigEndian>()?;
Self::Nblocks(PagestreamNblocksResponse { n_blocks })
}
Tag::GetPage => {
let mut page = vec![0; 8192]; // TODO: use MaybeUninit
buf.read_exact(&mut page)?;
PagestreamBeMessage::GetPage(PagestreamGetPageResponse { page: page.into() })
}
Tag::Error => {
let buf = buf.get_ref();
let cstr = std::ffi::CStr::from_bytes_until_nul(buf)?;
let rust_str = cstr.to_str()?;
PagestreamBeMessage::Error(PagestreamErrorResponse {
message: rust_str.to_owned(),
})
}
Tag::DbSize => {
let db_size = buf.read_i64::<BigEndian>()?;
Self::DbSize(PagestreamDbSizeResponse { db_size })
}
};
let remaining = buf.into_inner();
if !remaining.is_empty() {
anyhow::bail!(
"remaining bytes in msg with tag={msg_tag}: {}",
remaining.len()
);
}
Ok(ok)
}
pub fn kind(&self) -> &'static str {
match self {
Self::Exists(_) => "Exists",
Self::Nblocks(_) => "Nblocks",
Self::GetPage(_) => "GetPage",
Self::Error(_) => "Error",
Self::DbSize(_) => "DbSize",
}
}
}
#[cfg(test)]
@@ -867,7 +918,7 @@ mod tests {
fn test_tenantinfo_serde() {
// Test serialization/deserialization of TenantInfo
let original_active = TenantInfo {
id: TenantId::generate(),
id: TenantShardId::unsharded(TenantId::generate()),
state: TenantState::Active,
current_physical_size: Some(42),
attachment_status: TenantAttachmentStatus::Attached,
@@ -884,7 +935,7 @@ mod tests {
});
let original_broken = TenantInfo {
id: TenantId::generate(),
id: TenantShardId::unsharded(TenantId::generate()),
state: TenantState::Broken {
reason: "reason".into(),
backtrace: "backtrace info".into(),

View File

@@ -0,0 +1,151 @@
use utils::lsn::Lsn;
#[derive(Debug, PartialEq, Eq)]
pub struct Partitioning {
pub keys: crate::keyspace::KeySpace,
pub at_lsn: Lsn,
}
impl serde::Serialize for Partitioning {
fn serialize<S>(&self, serializer: S) -> std::result::Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
pub struct KeySpace<'a>(&'a crate::keyspace::KeySpace);
impl<'a> serde::Serialize for KeySpace<'a> {
fn serialize<S>(&self, serializer: S) -> std::result::Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
use serde::ser::SerializeSeq;
let mut seq = serializer.serialize_seq(Some(self.0.ranges.len()))?;
for kr in &self.0.ranges {
seq.serialize_element(&KeyRange(kr))?;
}
seq.end()
}
}
use serde::ser::SerializeMap;
let mut map = serializer.serialize_map(Some(2))?;
map.serialize_key("keys")?;
map.serialize_value(&KeySpace(&self.keys))?;
map.serialize_key("at_lsn")?;
map.serialize_value(&WithDisplay(&self.at_lsn))?;
map.end()
}
}
pub struct WithDisplay<'a, T>(&'a T);
impl<'a, T: std::fmt::Display> serde::Serialize for WithDisplay<'a, T> {
fn serialize<S>(&self, serializer: S) -> std::result::Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
serializer.collect_str(&self.0)
}
}
pub struct KeyRange<'a>(&'a std::ops::Range<crate::key::Key>);
impl<'a> serde::Serialize for KeyRange<'a> {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
use serde::ser::SerializeTuple;
let mut t = serializer.serialize_tuple(2)?;
t.serialize_element(&WithDisplay(&self.0.start))?;
t.serialize_element(&WithDisplay(&self.0.end))?;
t.end()
}
}
impl<'a> serde::Deserialize<'a> for Partitioning {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'a>,
{
pub struct KeySpace(crate::keyspace::KeySpace);
impl<'de> serde::Deserialize<'de> for KeySpace {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
#[serde_with::serde_as]
#[derive(serde::Deserialize)]
#[serde(transparent)]
struct Key(#[serde_as(as = "serde_with::DisplayFromStr")] crate::key::Key);
#[serde_with::serde_as]
#[derive(serde::Deserialize)]
struct Range(Key, Key);
let ranges: Vec<Range> = serde::Deserialize::deserialize(deserializer)?;
Ok(Self(crate::keyspace::KeySpace {
ranges: ranges
.into_iter()
.map(|Range(start, end)| (start.0..end.0))
.collect(),
}))
}
}
#[serde_with::serde_as]
#[derive(serde::Deserialize)]
struct De {
keys: KeySpace,
#[serde_as(as = "serde_with::DisplayFromStr")]
at_lsn: Lsn,
}
let de: De = serde::Deserialize::deserialize(deserializer)?;
Ok(Self {
at_lsn: de.at_lsn,
keys: de.keys.0,
})
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_serialization_roundtrip() {
let reference = r#"
{
"keys": [
[
"000000000000000000000000000000000000",
"000000000000000000000000000000000001"
],
[
"000000067F00000001000000000000000000",
"000000067F00000001000000000000000002"
],
[
"030000000000000000000000000000000000",
"030000000000000000000000000000000003"
]
],
"at_lsn": "0/2240160"
}
"#;
let de: Partitioning = serde_json::from_str(reference).unwrap();
let ser = serde_json::to_string(&de).unwrap();
let ser_de: serde_json::Value = serde_json::from_str(&ser).unwrap();
assert_eq!(
ser_de,
serde_json::from_str::<'_, serde_json::Value>(reference).unwrap()
);
}
}

View File

@@ -0,0 +1,762 @@
use std::{ops::RangeInclusive, str::FromStr};
use crate::key::{is_rel_block_key, Key};
use hex::FromHex;
use serde::{Deserialize, Serialize};
use thiserror;
use utils::id::TenantId;
#[derive(Ord, PartialOrd, Eq, PartialEq, Clone, Copy, Serialize, Deserialize, Debug, Hash)]
pub struct ShardNumber(pub u8);
#[derive(Ord, PartialOrd, Eq, PartialEq, Clone, Copy, Serialize, Deserialize, Debug, Hash)]
pub struct ShardCount(pub u8);
impl ShardCount {
pub const MAX: Self = Self(u8::MAX);
}
impl ShardNumber {
pub const MAX: Self = Self(u8::MAX);
}
/// TenantShardId identify the units of work for the Pageserver.
///
/// These are written as `<tenant_id>-<shard number><shard-count>`, for example:
///
/// # The second shard in a two-shard tenant
/// 072f1291a5310026820b2fe4b2968934-0102
///
/// Historically, tenants could not have multiple shards, and were identified
/// by TenantId. To support this, TenantShardId has a special legacy
/// mode where `shard_count` is equal to zero: this represents a single-sharded
/// tenant which should be written as a TenantId with no suffix.
///
/// The human-readable encoding of TenantShardId, such as used in API URLs,
/// is both forward and backward compatible: a legacy TenantId can be
/// decoded as a TenantShardId, and when re-encoded it will be parseable
/// as a TenantId.
///
/// Note that the binary encoding is _not_ backward compatible, because
/// at the time sharding is introduced, there are no existing binary structures
/// containing TenantId that we need to handle.
#[derive(Eq, PartialEq, PartialOrd, Ord, Clone, Copy, Hash)]
pub struct TenantShardId {
pub tenant_id: TenantId,
pub shard_number: ShardNumber,
pub shard_count: ShardCount,
}
impl TenantShardId {
pub fn unsharded(tenant_id: TenantId) -> Self {
Self {
tenant_id,
shard_number: ShardNumber(0),
shard_count: ShardCount(0),
}
}
/// The range of all TenantShardId that belong to a particular TenantId. This is useful when
/// you have a BTreeMap of TenantShardId, and are querying by TenantId.
pub fn tenant_range(tenant_id: TenantId) -> RangeInclusive<Self> {
RangeInclusive::new(
Self {
tenant_id,
shard_number: ShardNumber(0),
shard_count: ShardCount(0),
},
Self {
tenant_id,
shard_number: ShardNumber::MAX,
shard_count: ShardCount::MAX,
},
)
}
pub fn shard_slug(&self) -> impl std::fmt::Display + '_ {
ShardSlug(self)
}
/// Convenience for code that has special behavior on the 0th shard.
pub fn is_zero(&self) -> bool {
self.shard_number == ShardNumber(0)
}
}
/// Formatting helper
struct ShardSlug<'a>(&'a TenantShardId);
impl<'a> std::fmt::Display for ShardSlug<'a> {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(
f,
"{:02x}{:02x}",
self.0.shard_number.0, self.0.shard_count.0
)
}
}
impl std::fmt::Display for TenantShardId {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
if self.shard_count != ShardCount(0) {
write!(f, "{}-{}", self.tenant_id, self.shard_slug())
} else {
// Legacy case (shard_count == 0) -- format as just the tenant id. Note that this
// is distinct from the normal single shard case (shard count == 1).
self.tenant_id.fmt(f)
}
}
}
impl std::fmt::Debug for TenantShardId {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
// Debug is the same as Display: the compact hex representation
write!(f, "{}", self)
}
}
impl std::str::FromStr for TenantShardId {
type Err = hex::FromHexError;
fn from_str(s: &str) -> Result<Self, Self::Err> {
// Expect format: 16 byte TenantId, '-', 1 byte shard number, 1 byte shard count
if s.len() == 32 {
// Legacy case: no shard specified
Ok(Self {
tenant_id: TenantId::from_str(s)?,
shard_number: ShardNumber(0),
shard_count: ShardCount(0),
})
} else if s.len() == 37 {
let bytes = s.as_bytes();
let tenant_id = TenantId::from_hex(&bytes[0..32])?;
let mut shard_parts: [u8; 2] = [0u8; 2];
hex::decode_to_slice(&bytes[33..37], &mut shard_parts)?;
Ok(Self {
tenant_id,
shard_number: ShardNumber(shard_parts[0]),
shard_count: ShardCount(shard_parts[1]),
})
} else {
Err(hex::FromHexError::InvalidStringLength)
}
}
}
impl From<[u8; 18]> for TenantShardId {
fn from(b: [u8; 18]) -> Self {
let tenant_id_bytes: [u8; 16] = b[0..16].try_into().unwrap();
Self {
tenant_id: TenantId::from(tenant_id_bytes),
shard_number: ShardNumber(b[16]),
shard_count: ShardCount(b[17]),
}
}
}
/// For use within the context of a particular tenant, when we need to know which
/// shard we're dealing with, but do not need to know the full ShardIdentity (because
/// we won't be doing any page->shard mapping), and do not need to know the fully qualified
/// TenantShardId.
#[derive(Eq, PartialEq, PartialOrd, Ord, Clone, Copy)]
pub struct ShardIndex {
pub shard_number: ShardNumber,
pub shard_count: ShardCount,
}
impl ShardIndex {
pub fn new(number: ShardNumber, count: ShardCount) -> Self {
Self {
shard_number: number,
shard_count: count,
}
}
pub fn unsharded() -> Self {
Self {
shard_number: ShardNumber(0),
shard_count: ShardCount(0),
}
}
pub fn is_unsharded(&self) -> bool {
self.shard_number == ShardNumber(0) && self.shard_count == ShardCount(0)
}
/// For use in constructing remote storage paths: concatenate this with a TenantId
/// to get a fully qualified TenantShardId.
///
/// Backward compat: this function returns an empty string if Self::is_unsharded, such
/// that the legacy pre-sharding remote key format is preserved.
pub fn get_suffix(&self) -> String {
if self.is_unsharded() {
"".to_string()
} else {
format!("-{:02x}{:02x}", self.shard_number.0, self.shard_count.0)
}
}
}
impl std::fmt::Display for ShardIndex {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{:02x}{:02x}", self.shard_number.0, self.shard_count.0)
}
}
impl std::fmt::Debug for ShardIndex {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
// Debug is the same as Display: the compact hex representation
write!(f, "{}", self)
}
}
impl std::str::FromStr for ShardIndex {
type Err = hex::FromHexError;
fn from_str(s: &str) -> Result<Self, Self::Err> {
// Expect format: 1 byte shard number, 1 byte shard count
if s.len() == 4 {
let bytes = s.as_bytes();
let mut shard_parts: [u8; 2] = [0u8; 2];
hex::decode_to_slice(bytes, &mut shard_parts)?;
Ok(Self {
shard_number: ShardNumber(shard_parts[0]),
shard_count: ShardCount(shard_parts[1]),
})
} else {
Err(hex::FromHexError::InvalidStringLength)
}
}
}
impl From<[u8; 2]> for ShardIndex {
fn from(b: [u8; 2]) -> Self {
Self {
shard_number: ShardNumber(b[0]),
shard_count: ShardCount(b[1]),
}
}
}
impl Serialize for TenantShardId {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
if serializer.is_human_readable() {
serializer.collect_str(self)
} else {
let mut packed: [u8; 18] = [0; 18];
packed[0..16].clone_from_slice(&self.tenant_id.as_arr());
packed[16] = self.shard_number.0;
packed[17] = self.shard_count.0;
packed.serialize(serializer)
}
}
}
impl<'de> Deserialize<'de> for TenantShardId {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
struct IdVisitor {
is_human_readable_deserializer: bool,
}
impl<'de> serde::de::Visitor<'de> for IdVisitor {
type Value = TenantShardId;
fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
if self.is_human_readable_deserializer {
formatter.write_str("value in form of hex string")
} else {
formatter.write_str("value in form of integer array([u8; 18])")
}
}
fn visit_seq<A>(self, seq: A) -> Result<Self::Value, A::Error>
where
A: serde::de::SeqAccess<'de>,
{
let s = serde::de::value::SeqAccessDeserializer::new(seq);
let id: [u8; 18] = Deserialize::deserialize(s)?;
Ok(TenantShardId::from(id))
}
fn visit_str<E>(self, v: &str) -> Result<Self::Value, E>
where
E: serde::de::Error,
{
TenantShardId::from_str(v).map_err(E::custom)
}
}
if deserializer.is_human_readable() {
deserializer.deserialize_str(IdVisitor {
is_human_readable_deserializer: true,
})
} else {
deserializer.deserialize_tuple(
18,
IdVisitor {
is_human_readable_deserializer: false,
},
)
}
}
}
/// Stripe size in number of pages
#[derive(Clone, Copy, Serialize, Deserialize, Eq, PartialEq, Debug)]
pub struct ShardStripeSize(pub u32);
/// Layout version: for future upgrades where we might change how the key->shard mapping works
#[derive(Clone, Copy, Serialize, Deserialize, Eq, PartialEq, Debug)]
pub struct ShardLayout(u8);
const LAYOUT_V1: ShardLayout = ShardLayout(1);
/// ShardIdentity uses a magic layout value to indicate if it is unusable
const LAYOUT_BROKEN: ShardLayout = ShardLayout(255);
/// Default stripe size in pages: 256MiB divided by 8kiB page size.
const DEFAULT_STRIPE_SIZE: ShardStripeSize = ShardStripeSize(256 * 1024 / 8);
/// The ShardIdentity contains the information needed for one member of map
/// to resolve a key to a shard, and then check whether that shard is ==self.
#[derive(Clone, Copy, Serialize, Deserialize, Eq, PartialEq, Debug)]
pub struct ShardIdentity {
pub number: ShardNumber,
pub count: ShardCount,
stripe_size: ShardStripeSize,
layout: ShardLayout,
}
#[derive(thiserror::Error, Debug, PartialEq, Eq)]
pub enum ShardConfigError {
#[error("Invalid shard count")]
InvalidCount,
#[error("Invalid shard number")]
InvalidNumber,
#[error("Invalid stripe size")]
InvalidStripeSize,
}
impl ShardIdentity {
/// An identity with number=0 count=0 is a "none" identity, which represents legacy
/// tenants. Modern single-shard tenants should not use this: they should
/// have number=0 count=1.
pub fn unsharded() -> Self {
Self {
number: ShardNumber(0),
count: ShardCount(0),
layout: LAYOUT_V1,
stripe_size: DEFAULT_STRIPE_SIZE,
}
}
/// A broken instance of this type is only used for `TenantState::Broken` tenants,
/// which are constructed in code paths that don't have access to proper configuration.
///
/// A ShardIdentity in this state may not be used for anything, and should not be persisted.
/// Enforcement is via assertions, to avoid making our interface fallible for this
/// edge case: it is the Tenant's responsibility to avoid trying to do any I/O when in a broken
/// state, and by extension to avoid trying to do any page->shard resolution.
pub fn broken(number: ShardNumber, count: ShardCount) -> Self {
Self {
number,
count,
layout: LAYOUT_BROKEN,
stripe_size: DEFAULT_STRIPE_SIZE,
}
}
pub fn is_unsharded(&self) -> bool {
self.number == ShardNumber(0) && self.count == ShardCount(0)
}
/// Count must be nonzero, and number must be < count. To construct
/// the legacy case (count==0), use Self::unsharded instead.
pub fn new(
number: ShardNumber,
count: ShardCount,
stripe_size: ShardStripeSize,
) -> Result<Self, ShardConfigError> {
if count.0 == 0 {
Err(ShardConfigError::InvalidCount)
} else if number.0 > count.0 - 1 {
Err(ShardConfigError::InvalidNumber)
} else if stripe_size.0 == 0 {
Err(ShardConfigError::InvalidStripeSize)
} else {
Ok(Self {
number,
count,
layout: LAYOUT_V1,
stripe_size,
})
}
}
fn is_broken(&self) -> bool {
self.layout == LAYOUT_BROKEN
}
pub fn get_shard_number(&self, key: &Key) -> ShardNumber {
assert!(!self.is_broken());
key_to_shard_number(self.count, self.stripe_size, key)
}
/// Return true if the key should be ingested by this shard
pub fn is_key_local(&self, key: &Key) -> bool {
assert!(!self.is_broken());
if self.count < ShardCount(2) || (key_is_shard0(key) && self.number == ShardNumber(0)) {
true
} else {
key_to_shard_number(self.count, self.stripe_size, key) == self.number
}
}
pub fn shard_slug(&self) -> String {
if self.count > ShardCount(0) {
format!("-{:02x}{:02x}", self.number.0, self.count.0)
} else {
String::new()
}
}
/// Convenience for checking if this identity is the 0th shard in a tenant,
/// for special cases on shard 0 such as ingesting relation sizes.
pub fn is_zero(&self) -> bool {
self.number == ShardNumber(0)
}
}
impl Serialize for ShardIndex {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
if serializer.is_human_readable() {
serializer.collect_str(self)
} else {
// Binary encoding is not used in index_part.json, but is included in anticipation of
// switching various structures (e.g. inter-process communication, remote metadata) to more
// compact binary encodings in future.
let mut packed: [u8; 2] = [0; 2];
packed[0] = self.shard_number.0;
packed[1] = self.shard_count.0;
packed.serialize(serializer)
}
}
}
impl<'de> Deserialize<'de> for ShardIndex {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
struct IdVisitor {
is_human_readable_deserializer: bool,
}
impl<'de> serde::de::Visitor<'de> for IdVisitor {
type Value = ShardIndex;
fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
if self.is_human_readable_deserializer {
formatter.write_str("value in form of hex string")
} else {
formatter.write_str("value in form of integer array([u8; 2])")
}
}
fn visit_seq<A>(self, seq: A) -> Result<Self::Value, A::Error>
where
A: serde::de::SeqAccess<'de>,
{
let s = serde::de::value::SeqAccessDeserializer::new(seq);
let id: [u8; 2] = Deserialize::deserialize(s)?;
Ok(ShardIndex::from(id))
}
fn visit_str<E>(self, v: &str) -> Result<Self::Value, E>
where
E: serde::de::Error,
{
ShardIndex::from_str(v).map_err(E::custom)
}
}
if deserializer.is_human_readable() {
deserializer.deserialize_str(IdVisitor {
is_human_readable_deserializer: true,
})
} else {
deserializer.deserialize_tuple(
2,
IdVisitor {
is_human_readable_deserializer: false,
},
)
}
}
}
/// Whether this key is always held on shard 0 (e.g. shard 0 holds all SLRU keys
/// in order to be able to serve basebackup requests without peer communication).
fn key_is_shard0(key: &Key) -> bool {
// To decide what to shard out to shards >0, we apply a simple rule that only
// relation pages are distributed to shards other than shard zero. Everything else gets
// stored on shard 0. This guarantees that shard 0 can independently serve basebackup
// requests, and any request other than those for particular blocks in relations.
//
// In this condition:
// - is_rel_block_key includes only relations, i.e. excludes SLRU data and
// all metadata.
// - field6 is set to -1 for relation size pages.
!(is_rel_block_key(key) && key.field6 != 0xffffffff)
}
/// Provide the same result as the function in postgres `hashfn.h` with the same name
fn murmurhash32(mut h: u32) -> u32 {
h ^= h >> 16;
h = h.wrapping_mul(0x85ebca6b);
h ^= h >> 13;
h = h.wrapping_mul(0xc2b2ae35);
h ^= h >> 16;
h
}
/// Provide the same result as the function in postgres `hashfn.h` with the same name
fn hash_combine(mut a: u32, mut b: u32) -> u32 {
b = b.wrapping_add(0x9e3779b9);
b = b.wrapping_add(a << 6);
b = b.wrapping_add(a >> 2);
a ^= b;
a
}
/// Where a Key is to be distributed across shards, select the shard. This function
/// does not account for keys that should be broadcast across shards.
///
/// The hashing in this function must exactly match what we do in postgres smgr
/// code. The resulting distribution of pages is intended to preserve locality within
/// `stripe_size` ranges of contiguous block numbers in the same relation, while otherwise
/// distributing data pseudo-randomly.
///
/// The mapping of key to shard is not stable across changes to ShardCount: this is intentional
/// and will be handled at higher levels when shards are split.
fn key_to_shard_number(count: ShardCount, stripe_size: ShardStripeSize, key: &Key) -> ShardNumber {
// Fast path for un-sharded tenants or broadcast keys
if count < ShardCount(2) || key_is_shard0(key) {
return ShardNumber(0);
}
// relNode
let mut hash = murmurhash32(key.field4);
// blockNum/stripe size
hash = hash_combine(hash, murmurhash32(key.field6 / stripe_size.0));
ShardNumber((hash % count.0 as u32) as u8)
}
#[cfg(test)]
mod tests {
use std::str::FromStr;
use bincode;
use utils::{id::TenantId, Hex};
use super::*;
const EXAMPLE_TENANT_ID: &str = "1f359dd625e519a1a4e8d7509690f6fc";
#[test]
fn tenant_shard_id_string() -> Result<(), hex::FromHexError> {
let example = TenantShardId {
tenant_id: TenantId::from_str(EXAMPLE_TENANT_ID).unwrap(),
shard_count: ShardCount(10),
shard_number: ShardNumber(7),
};
let encoded = format!("{example}");
let expected = format!("{EXAMPLE_TENANT_ID}-070a");
assert_eq!(&encoded, &expected);
let decoded = TenantShardId::from_str(&encoded)?;
assert_eq!(example, decoded);
Ok(())
}
#[test]
fn tenant_shard_id_binary() -> Result<(), hex::FromHexError> {
let example = TenantShardId {
tenant_id: TenantId::from_str(EXAMPLE_TENANT_ID).unwrap(),
shard_count: ShardCount(10),
shard_number: ShardNumber(7),
};
let encoded = bincode::serialize(&example).unwrap();
let expected: [u8; 18] = [
0x1f, 0x35, 0x9d, 0xd6, 0x25, 0xe5, 0x19, 0xa1, 0xa4, 0xe8, 0xd7, 0x50, 0x96, 0x90,
0xf6, 0xfc, 0x07, 0x0a,
];
assert_eq!(Hex(&encoded), Hex(&expected));
let decoded = bincode::deserialize(&encoded).unwrap();
assert_eq!(example, decoded);
Ok(())
}
#[test]
fn tenant_shard_id_backward_compat() -> Result<(), hex::FromHexError> {
// Test that TenantShardId can decode a TenantId in human
// readable form
let example = TenantId::from_str(EXAMPLE_TENANT_ID).unwrap();
let encoded = format!("{example}");
assert_eq!(&encoded, EXAMPLE_TENANT_ID);
let decoded = TenantShardId::from_str(&encoded)?;
assert_eq!(example, decoded.tenant_id);
assert_eq!(decoded.shard_count, ShardCount(0));
assert_eq!(decoded.shard_number, ShardNumber(0));
Ok(())
}
#[test]
fn tenant_shard_id_forward_compat() -> Result<(), hex::FromHexError> {
// Test that a legacy TenantShardId encodes into a form that
// can be decoded as TenantId
let example_tenant_id = TenantId::from_str(EXAMPLE_TENANT_ID).unwrap();
let example = TenantShardId::unsharded(example_tenant_id);
let encoded = format!("{example}");
assert_eq!(&encoded, EXAMPLE_TENANT_ID);
let decoded = TenantId::from_str(&encoded)?;
assert_eq!(example_tenant_id, decoded);
Ok(())
}
#[test]
fn tenant_shard_id_legacy_binary() -> Result<(), hex::FromHexError> {
// Unlike in human readable encoding, binary encoding does not
// do any special handling of legacy unsharded TenantIds: this test
// is equivalent to the main test for binary encoding, just verifying
// that the same behavior applies when we have used `unsharded()` to
// construct a TenantShardId.
let example = TenantShardId::unsharded(TenantId::from_str(EXAMPLE_TENANT_ID).unwrap());
let encoded = bincode::serialize(&example).unwrap();
let expected: [u8; 18] = [
0x1f, 0x35, 0x9d, 0xd6, 0x25, 0xe5, 0x19, 0xa1, 0xa4, 0xe8, 0xd7, 0x50, 0x96, 0x90,
0xf6, 0xfc, 0x00, 0x00,
];
assert_eq!(Hex(&encoded), Hex(&expected));
let decoded = bincode::deserialize::<TenantShardId>(&encoded).unwrap();
assert_eq!(example, decoded);
Ok(())
}
#[test]
fn shard_identity_validation() -> Result<(), ShardConfigError> {
// Happy cases
ShardIdentity::new(ShardNumber(0), ShardCount(1), DEFAULT_STRIPE_SIZE)?;
ShardIdentity::new(ShardNumber(0), ShardCount(1), ShardStripeSize(1))?;
ShardIdentity::new(ShardNumber(254), ShardCount(255), ShardStripeSize(1))?;
assert_eq!(
ShardIdentity::new(ShardNumber(0), ShardCount(0), DEFAULT_STRIPE_SIZE),
Err(ShardConfigError::InvalidCount)
);
assert_eq!(
ShardIdentity::new(ShardNumber(10), ShardCount(10), DEFAULT_STRIPE_SIZE),
Err(ShardConfigError::InvalidNumber)
);
assert_eq!(
ShardIdentity::new(ShardNumber(11), ShardCount(10), DEFAULT_STRIPE_SIZE),
Err(ShardConfigError::InvalidNumber)
);
assert_eq!(
ShardIdentity::new(ShardNumber(255), ShardCount(255), DEFAULT_STRIPE_SIZE),
Err(ShardConfigError::InvalidNumber)
);
assert_eq!(
ShardIdentity::new(ShardNumber(0), ShardCount(1), ShardStripeSize(0)),
Err(ShardConfigError::InvalidStripeSize)
);
Ok(())
}
#[test]
fn shard_index_human_encoding() -> Result<(), hex::FromHexError> {
let example = ShardIndex {
shard_number: ShardNumber(13),
shard_count: ShardCount(17),
};
let expected: String = "0d11".to_string();
let encoded = format!("{example}");
assert_eq!(&encoded, &expected);
let decoded = ShardIndex::from_str(&encoded)?;
assert_eq!(example, decoded);
Ok(())
}
#[test]
fn shard_index_binary_encoding() -> Result<(), hex::FromHexError> {
let example = ShardIndex {
shard_number: ShardNumber(13),
shard_count: ShardCount(17),
};
let expected: [u8; 2] = [0x0d, 0x11];
let encoded = bincode::serialize(&example).unwrap();
assert_eq!(Hex(&encoded), Hex(&expected));
let decoded = bincode::deserialize(&encoded).unwrap();
assert_eq!(example, decoded);
Ok(())
}
// These are only smoke tests to spot check that our implementation doesn't
// deviate from a few examples values: not aiming to validate the overall
// hashing algorithm.
#[test]
fn murmur_hash() {
assert_eq!(murmurhash32(0), 0);
assert_eq!(hash_combine(0xb1ff3b40, 0), 0xfb7923c9);
}
#[test]
fn shard_mapping() {
let key = Key {
field1: 0x00,
field2: 0x67f,
field3: 0x5,
field4: 0x400c,
field5: 0x00,
field6: 0x7d06,
};
let shard = key_to_shard_number(ShardCount(10), DEFAULT_STRIPE_SIZE, &key);
assert_eq!(shard, ShardNumber(8));
}
}

View File

@@ -9,10 +9,12 @@ async-trait.workspace = true
anyhow.workspace = true
bytes.workspace = true
futures.workspace = true
ring.workspace = true
rustls.workspace = true
serde.workspace = true
thiserror.workspace = true
tokio.workspace = true
tokio-postgres.workspace = true
tokio-rustls.workspace = true
tracing.workspace = true
@@ -22,5 +24,4 @@ workspace_hack.workspace = true
[dev-dependencies]
once_cell.workspace = true
rustls-pemfile.workspace = true
tokio-postgres.workspace = true
tokio-postgres-rustls.workspace = true
# tokio-postgres-rustls.workspace = true

View File

@@ -2,9 +2,11 @@
//! To use, create PostgresBackend and run() it, passing the Handler
//! implementation determining how to process the queries. Currently its API
//! is rather narrow, but we can extend it once required.
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
use anyhow::Context;
use bytes::Bytes;
use futures::pin_mut;
use futures::{pin_mut, TryFutureExt, FutureExt};
use serde::{Deserialize, Serialize};
use std::io::ErrorKind;
use std::net::SocketAddr;
@@ -15,7 +17,7 @@ use std::{fmt, io};
use std::{future::Future, str::FromStr};
use tokio::io::{AsyncRead, AsyncWrite};
use tokio_rustls::TlsAcceptor;
use tracing::{debug, error, info, trace};
use tracing::{debug, error, info, trace, warn};
use pq_proto::framed::{ConnectionError, Framed, FramedReader, FramedWriter};
use pq_proto::{
@@ -33,6 +35,11 @@ pub enum QueryError {
/// We were instructed to shutdown while processing the query
#[error("Shutting down")]
Shutdown,
/// Authentication failure
#[error("Unauthorized: {0}")]
Unauthorized(std::borrow::Cow<'static, str>),
#[error("Simulated Connection Error")]
SimulatedConnectionError,
/// Some other error
#[error(transparent)]
Other(#[from] anyhow::Error),
@@ -47,8 +54,9 @@ impl From<io::Error> for QueryError {
impl QueryError {
pub fn pg_error_code(&self) -> &'static [u8; 5] {
match self {
Self::Disconnected(_) => b"08006", // connection failure
Self::Disconnected(_) | Self::SimulatedConnectionError => b"08006", // connection failure
Self::Shutdown => SQLSTATE_ADMIN_SHUTDOWN,
Self::Unauthorized(_) => SQLSTATE_INTERNAL_ERROR,
Self::Other(_) => SQLSTATE_INTERNAL_ERROR, // internal error
}
}
@@ -608,7 +616,7 @@ impl<IO: AsyncRead + AsyncWrite + Unpin> PostgresBackend<IO> {
if let Err(e) = handler.check_auth_jwt(self, jwt_response) {
self.write_message_noflush(&BeMessage::ErrorResponse(
&e.to_string(),
&short_error(&e),
Some(e.pg_error_code()),
))?;
return Err(e);
@@ -728,12 +736,20 @@ impl<IO: AsyncRead + AsyncWrite + Unpin> PostgresBackend<IO> {
trace!("got query {query_string:?}");
if let Err(e) = handler.process_query(self, query_string).await {
log_query_error(query_string, &e);
let short_error = short_error(&e);
self.write_message_noflush(&BeMessage::ErrorResponse(
&short_error,
Some(e.pg_error_code()),
))?;
match e {
QueryError::Shutdown => return Ok(ProcessMsgResult::Break),
QueryError::SimulatedConnectionError => {
return Err(QueryError::SimulatedConnectionError)
}
e => {
log_query_error(query_string, &e);
let short_error = short_error(&e);
self.write_message_noflush(&BeMessage::ErrorResponse(
&short_error,
Some(e.pg_error_code()),
))?;
}
}
}
self.write_message_noflush(&BeMessage::ReadyForQuery)?;
}
@@ -959,6 +975,8 @@ pub fn short_error(e: &QueryError) -> String {
match e {
QueryError::Disconnected(connection_error) => connection_error.to_string(),
QueryError::Shutdown => "shutdown".to_string(),
QueryError::Unauthorized(_e) => "JWT authentication error".to_string(),
QueryError::SimulatedConnectionError => "simulated connection error".to_string(),
QueryError::Other(e) => format!("{e:#}"),
}
}
@@ -975,9 +993,15 @@ fn log_query_error(query: &str, e: &QueryError) {
QueryError::Disconnected(other_connection_error) => {
error!("query handler for '{query}' failed with connection error: {other_connection_error:?}")
}
QueryError::SimulatedConnectionError => {
error!("query handler for query '{query}' failed due to a simulated connection error")
}
QueryError::Shutdown => {
info!("query handler for '{query}' cancelled during tenant shutdown")
}
QueryError::Unauthorized(e) => {
warn!("query handler for '{query}' failed with authentication error: {e}");
}
QueryError::Other(e) => {
error!("query handler for '{query}' failed: {e:?}");
}
@@ -1006,3 +1030,115 @@ pub enum CopyStreamHandlerEnd {
#[error(transparent)]
Other(#[from] anyhow::Error),
}
#[derive(Clone)]
pub struct MakeRustlsConnect {
config: Arc<rustls::ClientConfig>,
}
impl MakeRustlsConnect {
pub fn new(config: rustls::ClientConfig) -> Self {
Self {
config: Arc::new(config),
}
}
}
impl<S> tokio_postgres::tls::MakeTlsConnect<S> for MakeRustlsConnect
where
S: AsyncRead + AsyncWrite + Unpin + Send + 'static,
{
type Stream = RustlsStream<S>;
type TlsConnect = RustlsConnect;
type Error = io::Error;
fn make_tls_connect(&mut self, hostname: &str) -> io::Result<RustlsConnect> {
rustls::pki_types::ServerName::try_from(hostname)
.map(|dns_name| {
RustlsConnect(Some(RustlsConnectData {
hostname: dns_name.to_owned(),
connector: Arc::clone(&self.config).into(),
}))
})
.or(Ok(RustlsConnect(None)))
}
}
pub struct RustlsConnect(Option<RustlsConnectData>);
struct RustlsConnectData {
hostname: rustls::pki_types::ServerName<'static>,
connector: tokio_rustls::TlsConnector,
}
impl<S> tokio_postgres::tls::TlsConnect<S> for RustlsConnect
where
S: AsyncRead + AsyncWrite + Unpin + Send + 'static,
{
type Stream = RustlsStream<S>;
type Error = io::Error;
type Future = Pin<Box<dyn Future<Output = io::Result<RustlsStream<S>>> + Send>>;
fn connect(self, stream: S) -> Self::Future {
match self.0 {
None => Box::pin(core::future::ready(Err(io::ErrorKind::InvalidInput.into()))),
Some(c) => c
.connector
.connect(c.hostname, stream)
.map_ok(|s| RustlsStream(Box::pin(s)))
.boxed(),
}
}
}
pub struct RustlsStream<S>(Pin<Box<tokio_rustls::client:: TlsStream<S>>>);
impl<S> tokio_postgres::tls::TlsStream for RustlsStream<S>
where
S: AsyncRead + AsyncWrite + Unpin,
{
fn channel_binding(&self) -> tokio_postgres::tls::ChannelBinding {
let (_, session) = self.0.get_ref();
match session.peer_certificates() {
Some(certs) if !certs.is_empty() => {
let sha256 = ring::digest::digest(&ring::digest::SHA256, certs[0].as_ref());
tokio_postgres::tls::ChannelBinding::tls_server_end_point(sha256.as_ref().into())
}
_ => tokio_postgres::tls::ChannelBinding::none(),
}
}
}
impl<S> AsyncRead for RustlsStream<S>
where
S: AsyncRead + AsyncWrite + Unpin,
{
fn poll_read(
mut self: Pin<&mut Self>,
cx: &mut std::task:: Context,
buf: &mut tokio::io::ReadBuf<'_>,
) -> Poll<tokio::io::Result<()>> {
self.0.as_mut().poll_read(cx, buf)
}
}
impl<S> AsyncWrite for RustlsStream<S>
where
S: AsyncRead + AsyncWrite + Unpin,
{
fn poll_write(
mut self: Pin<&mut Self>,
cx: &mut std::task:: Context,
buf: &[u8],
) -> Poll<tokio::io::Result<usize>> {
self.0.as_mut().poll_write(cx, buf)
}
fn poll_flush(mut self: Pin<&mut Self>, cx: &mut std::task:: Context) -> Poll<tokio::io::Result<()>> {
self.0.as_mut().poll_flush(cx)
}
fn poll_shutdown(mut self: Pin<&mut Self>, cx: &mut std::task:: Context) -> Poll<tokio::io::Result<()>> {
self.0.as_mut().poll_shutdown(cx)
}
}

View File

@@ -1,5 +1,6 @@
/// Test postgres_backend_async with tokio_postgres
use once_cell::sync::Lazy;
use postgres_backend::MakeRustlsConnect;
use postgres_backend::{AuthType, Handler, PostgresBackend, QueryError};
use pq_proto::{BeMessage, RowDescriptor};
use std::io::Cursor;
@@ -9,7 +10,6 @@ use tokio::net::{TcpListener, TcpStream};
use tokio_postgres::config::SslMode;
use tokio_postgres::tls::MakeTlsConnect;
use tokio_postgres::{Config, NoTls, SimpleQueryMessage};
use tokio_postgres_rustls::MakeRustlsConnect;
// generate client, server test streams
async fn make_tcp_pair() -> (TcpStream, TcpStream) {
@@ -72,14 +72,21 @@ async fn simple_select() {
}
}
static KEY: Lazy<rustls::PrivateKey> = Lazy::new(|| {
static KEY: Lazy<rustls::pki_types::PrivatePkcs1KeyDer<'static>> = Lazy::new(|| {
let mut cursor = Cursor::new(include_bytes!("key.pem"));
rustls::PrivateKey(rustls_pemfile::rsa_private_keys(&mut cursor).unwrap()[0].clone())
let key = rustls_pemfile::rsa_private_keys(&mut cursor)
.next()
.unwrap()
.unwrap();
key.secret_pkcs1_der().to_owned().into()
});
static CERT: Lazy<rustls::Certificate> = Lazy::new(|| {
static CERT: Lazy<rustls::pki_types::CertificateDer<'static>> = Lazy::new(|| {
let mut cursor = Cursor::new(include_bytes!("cert.pem"));
rustls::Certificate(rustls_pemfile::certs(&mut cursor).unwrap()[0].clone())
let cert = rustls_pemfile::certs(&mut cursor).next().unwrap().unwrap();
cert.into_owned()
});
// test that basic select with ssl works
@@ -87,10 +94,10 @@ static CERT: Lazy<rustls::Certificate> = Lazy::new(|| {
async fn simple_select_ssl() {
let (client_sock, server_sock) = make_tcp_pair().await;
let key = rustls::pki_types::PrivateKeyDer::Pkcs1(KEY.secret_pkcs1_der().to_owned().into());
let server_cfg = rustls::ServerConfig::builder()
.with_safe_defaults()
.with_no_client_auth()
.with_single_cert(vec![CERT.clone()], KEY.clone())
.with_single_cert(vec![CERT.clone()], key)
.unwrap();
let tls_config = Some(Arc::new(server_cfg));
let pgbackend =
@@ -102,14 +109,13 @@ async fn simple_select_ssl() {
});
let client_cfg = rustls::ClientConfig::builder()
.with_safe_defaults()
.with_root_certificates({
let mut store = rustls::RootCertStore::empty();
store.add(&CERT).unwrap();
store.add(CERT.clone()).unwrap();
store
})
.with_no_client_auth();
let mut make_tls_connect = tokio_postgres_rustls::MakeRustlsConnect::new(client_cfg);
let mut make_tls_connect = MakeRustlsConnect::new(client_cfg);
let tls_connect = <MakeRustlsConnect as MakeTlsConnect<TcpStream>>::make_tls_connect(
&mut make_tls_connect,
"localhost",

View File

@@ -1,3 +1,5 @@
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
use anyhow::{bail, Context};
use itertools::Itertools;
use std::borrow::Cow;
@@ -161,8 +163,18 @@ impl PgConnectionConfig {
}
/// Connect using postgres protocol with TLS disabled.
pub fn connect_no_tls(&self) -> Result<postgres::Client, postgres::Error> {
postgres::Config::from(self.to_tokio_postgres_config()).connect(postgres::NoTls)
pub async fn connect_no_tls(
&self,
) -> Result<
(
tokio_postgres::Client,
tokio_postgres::Connection<tokio_postgres::Socket, tokio_postgres::tls::NoTlsStream>,
),
postgres::Error,
> {
self.to_tokio_postgres_config()
.connect(postgres::NoTls)
.await
}
}

View File

@@ -8,6 +8,7 @@
// modules included with the postgres_ffi macro depend on the types of the specific version's
// types, and trigger a too eager lint.
#![allow(clippy::duplicate_mod)]
#![deny(clippy::undocumented_unsafe_blocks)]
use bytes::Bytes;
use utils::bin_ser::SerializeError;
@@ -20,6 +21,7 @@ macro_rules! postgres_ffi {
pub mod bindings {
// bindgen generates bindings for a lot of stuff we don't need
#![allow(dead_code)]
#![allow(clippy::undocumented_unsafe_blocks)]
use serde::{Deserialize, Serialize};
include!(concat!(

View File

@@ -1,6 +1,7 @@
//! Postgres protocol messages serialization-deserialization. See
//! <https://www.postgresql.org/docs/devel/protocol-message-formats.html>
//! on message formats.
#![deny(clippy::undocumented_unsafe_blocks)]
pub mod framed;
@@ -288,10 +289,10 @@ impl FeStartupPacket {
// We shouldn't advance `buf` as probably full message is not there yet,
// so can't directly use Bytes::get_u32 etc.
let len = (&buf[0..4]).read_u32::<BigEndian>().unwrap() as usize;
// The proposed replacement is `!(4..=MAX_STARTUP_PACKET_LENGTH).contains(&len)`
// The proposed replacement is `!(8..=MAX_STARTUP_PACKET_LENGTH).contains(&len)`
// which is less readable
#[allow(clippy::manual_range_contains)]
if len < 4 || len > MAX_STARTUP_PACKET_LENGTH {
if len < 8 || len > MAX_STARTUP_PACKET_LENGTH {
return Err(ProtocolError::Protocol(format!(
"invalid startup packet message length {}",
len
@@ -974,4 +975,10 @@ mod tests {
let params = make_params("foo\\ bar \\ \\\\ baz\\ lol");
assert_eq!(split_options(&params), ["foo bar", " \\", "baz ", "lol"]);
}
#[test]
fn parse_fe_startup_packet_regression() {
let data = [0, 0, 0, 7, 0, 0, 0, 0];
FeStartupPacket::parse(&mut BytesMut::from_iter(data)).unwrap_err();
}
}

View File

@@ -8,18 +8,19 @@ license.workspace = true
anyhow.workspace = true
async-trait.workspace = true
once_cell.workspace = true
aws-smithy-http.workspace = true
aws-types.workspace = true
aws-smithy-async.workspace = true
aws-smithy-types.workspace = true
aws-config.workspace = true
aws-sdk-s3.workspace = true
aws-credential-types.workspace = true
bytes.workspace = true
camino.workspace = true
hyper = { workspace = true, features = ["stream"] }
futures.workspace = true
serde.workspace = true
serde_json.workspace = true
tokio = { workspace = true, features = ["sync", "fs", "io-util"] }
tokio-util.workspace = true
tokio-util = { workspace = true, features = ["compat"] }
toml_edit.workspace = true
tracing.workspace = true
scopeguard.workspace = true

View File

@@ -1,24 +1,24 @@
//! Azure Blob Storage wrapper
use std::borrow::Cow;
use std::collections::HashMap;
use std::env;
use std::num::NonZeroU32;
use std::pin::Pin;
use std::sync::Arc;
use std::{borrow::Cow, collections::HashMap, io::Cursor};
use super::REMOTE_STORAGE_PREFIX_SEPARATOR;
use anyhow::Result;
use azure_core::request_options::{MaxResults, Metadata, Range};
use azure_core::Header;
use azure_core::RetryOptions;
use azure_identity::DefaultAzureCredential;
use azure_storage::StorageCredentials;
use azure_storage_blobs::prelude::ClientBuilder;
use azure_storage_blobs::{
blob::operations::GetBlobBuilder,
prelude::{BlobClient, ContainerClient},
};
use azure_storage_blobs::{blob::operations::GetBlobBuilder, prelude::ContainerClient};
use bytes::Bytes;
use futures::stream::Stream;
use futures_util::StreamExt;
use http_types::StatusCode;
use tokio::io::AsyncRead;
use tracing::debug;
use crate::s3_bucket::RequestKind;
@@ -52,7 +52,8 @@ impl AzureBlobStorage {
StorageCredentials::token_credential(Arc::new(token_credential))
};
let builder = ClientBuilder::new(account, credentials);
// we have an outer retry
let builder = ClientBuilder::new(account, credentials).retry(RetryOptions::none());
let client = builder.container_client(azure_config.container_name.to_owned());
@@ -112,47 +113,32 @@ impl AzureBlobStorage {
async fn download_for_builder(
&self,
metadata: StorageMetadata,
builder: GetBlobBuilder,
) -> Result<Download, DownloadError> {
let mut response = builder.into_stream();
let mut metadata = HashMap::new();
// TODO give proper streaming response instead of buffering into RAM
// https://github.com/neondatabase/neon/issues/5563
let mut buf = Vec::new();
let mut bufs = Vec::new();
while let Some(part) = response.next().await {
let part = part.map_err(to_download_error)?;
if let Some(blob_meta) = part.blob.metadata {
metadata.extend(blob_meta.iter().map(|(k, v)| (k.to_owned(), v.to_owned())));
}
let data = part
.data
.collect()
.await
.map_err(|e| DownloadError::Other(e.into()))?;
buf.extend_from_slice(&data.slice(..));
bufs.push(data);
}
Ok(Download {
download_stream: Box::pin(Cursor::new(buf)),
metadata: Some(metadata),
download_stream: Box::pin(futures::stream::iter(bufs.into_iter().map(Ok))),
metadata: Some(StorageMetadata(metadata)),
})
}
// TODO get rid of this function once we have metadata included in the response
// https://github.com/Azure/azure-sdk-for-rust/issues/1439
async fn get_metadata(
&self,
blob_client: &BlobClient,
) -> Result<StorageMetadata, DownloadError> {
let builder = blob_client.get_metadata();
let response = builder.into_future().await.map_err(to_download_error)?;
let mut map = HashMap::new();
for md in response.metadata.iter() {
map.insert(
md.name().as_str().to_string(),
md.value().as_str().to_string(),
);
}
Ok(StorageMetadata(map))
}
async fn permit(&self, kind: RequestKind) -> tokio::sync::SemaphorePermit<'_> {
self.concurrency_limiter
@@ -236,9 +222,10 @@ impl RemoteStorage for AzureBlobStorage {
}
Ok(res)
}
async fn upload(
&self,
mut from: impl AsyncRead + Unpin + Send + Sync + 'static,
from: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
data_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
@@ -246,13 +233,12 @@ impl RemoteStorage for AzureBlobStorage {
let _permit = self.permit(RequestKind::Put).await;
let blob_client = self.client.blob_client(self.relative_path_to_name(to));
// TODO FIX THIS UGLY HACK and don't buffer the entire object
// into RAM here, but use the streaming interface. For that,
// we'd have to change the interface though...
// https://github.com/neondatabase/neon/issues/5563
let mut buf = Vec::with_capacity(data_size_bytes);
tokio::io::copy(&mut from, &mut buf).await?;
let body = azure_core::Body::Bytes(buf.into());
let from: Pin<Box<dyn Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static>> =
Box::pin(from);
let from = NonSeekableStream::new(from, data_size_bytes);
let body = azure_core::Body::SeekableStream(Box::new(from));
let mut builder = blob_client.put_block_blob(body);
@@ -269,11 +255,9 @@ impl RemoteStorage for AzureBlobStorage {
let _permit = self.permit(RequestKind::Get).await;
let blob_client = self.client.blob_client(self.relative_path_to_name(from));
let metadata = self.get_metadata(&blob_client).await?;
let builder = blob_client.get();
self.download_for_builder(metadata, builder).await
self.download_for_builder(builder).await
}
async fn download_byte_range(
@@ -285,23 +269,16 @@ impl RemoteStorage for AzureBlobStorage {
let _permit = self.permit(RequestKind::Get).await;
let blob_client = self.client.blob_client(self.relative_path_to_name(from));
let metadata = self.get_metadata(&blob_client).await?;
let mut builder = blob_client.get();
if let Some(end_exclusive) = end_exclusive {
builder = builder.range(Range::new(start_inclusive, end_exclusive));
let range: Range = if let Some(end_exclusive) = end_exclusive {
(start_inclusive..end_exclusive).into()
} else {
// Open ranges are not supported by the SDK so we work around
// by setting the upper limit extremely high (but high enough
// to still be representable by signed 64 bit integers).
// TODO remove workaround once the SDK adds open range support
// https://github.com/Azure/azure-sdk-for-rust/issues/1438
let end_exclusive = u64::MAX / 4;
builder = builder.range(Range::new(start_inclusive, end_exclusive));
}
(start_inclusive..).into()
};
builder = builder.range(range);
self.download_for_builder(metadata, builder).await
self.download_for_builder(builder).await
}
async fn delete(&self, path: &RemotePath) -> anyhow::Result<()> {
@@ -335,3 +312,153 @@ impl RemoteStorage for AzureBlobStorage {
Ok(())
}
}
pin_project_lite::pin_project! {
/// Hack to work around not being able to stream once with azure sdk.
///
/// Azure sdk clones streams around with the assumption that they are like
/// `Arc<tokio::fs::File>` (except not supporting tokio), however our streams are not like
/// that. For example for an `index_part.json` we just have a single chunk of [`Bytes`]
/// representing the whole serialized vec. It could be trivially cloneable and "semi-trivially"
/// seekable, but we can also just re-try the request easier.
#[project = NonSeekableStreamProj]
enum NonSeekableStream<S> {
/// A stream wrappers initial form.
///
/// Mutex exists to allow moving when cloning. If the sdk changes to do less than 1
/// clone before first request, then this must be changed.
Initial {
inner: std::sync::Mutex<Option<tokio_util::compat::Compat<tokio_util::io::StreamReader<S, Bytes>>>>,
len: usize,
},
/// The actually readable variant, produced by cloning the Initial variant.
///
/// The sdk currently always clones once, even without retry policy.
Actual {
#[pin]
inner: tokio_util::compat::Compat<tokio_util::io::StreamReader<S, Bytes>>,
len: usize,
read_any: bool,
},
/// Most likely unneeded, but left to make life easier, in case more clones are added.
Cloned {
len_was: usize,
}
}
}
impl<S> NonSeekableStream<S>
where
S: Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
{
fn new(inner: S, len: usize) -> NonSeekableStream<S> {
use tokio_util::compat::TokioAsyncReadCompatExt;
let inner = tokio_util::io::StreamReader::new(inner).compat();
let inner = Some(inner);
let inner = std::sync::Mutex::new(inner);
NonSeekableStream::Initial { inner, len }
}
}
impl<S> std::fmt::Debug for NonSeekableStream<S> {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Self::Initial { len, .. } => f.debug_struct("Initial").field("len", len).finish(),
Self::Actual { len, .. } => f.debug_struct("Actual").field("len", len).finish(),
Self::Cloned { len_was, .. } => f.debug_struct("Cloned").field("len", len_was).finish(),
}
}
}
impl<S> futures::io::AsyncRead for NonSeekableStream<S>
where
S: Stream<Item = std::io::Result<Bytes>>,
{
fn poll_read(
self: std::pin::Pin<&mut Self>,
cx: &mut std::task::Context<'_>,
buf: &mut [u8],
) -> std::task::Poll<std::io::Result<usize>> {
match self.project() {
NonSeekableStreamProj::Actual {
inner, read_any, ..
} => {
*read_any = true;
inner.poll_read(cx, buf)
}
// NonSeekableStream::Initial does not support reading because it is just much easier
// to have the mutex in place where one does not poll the contents, or that's how it
// seemed originally. If there is a version upgrade which changes the cloning, then
// that support needs to be hacked in.
//
// including {self:?} into the message would be useful, but unsure how to unproject.
_ => std::task::Poll::Ready(Err(std::io::Error::new(
std::io::ErrorKind::Other,
"cloned or initial values cannot be read",
))),
}
}
}
impl<S> Clone for NonSeekableStream<S> {
/// Weird clone implementation exists to support the sdk doing cloning before issuing the first
/// request, see type documentation.
fn clone(&self) -> Self {
use NonSeekableStream::*;
match self {
Initial { inner, len } => {
if let Some(inner) = inner.lock().unwrap().take() {
Actual {
inner,
len: *len,
read_any: false,
}
} else {
Self::Cloned { len_was: *len }
}
}
Actual { len, .. } => Cloned { len_was: *len },
Cloned { len_was } => Cloned { len_was: *len_was },
}
}
}
#[async_trait::async_trait]
impl<S> azure_core::SeekableStream for NonSeekableStream<S>
where
S: Stream<Item = std::io::Result<Bytes>> + Unpin + Send + Sync + 'static,
{
async fn reset(&mut self) -> azure_core::error::Result<()> {
use NonSeekableStream::*;
let msg = match self {
Initial { inner, .. } => {
if inner.get_mut().unwrap().is_some() {
return Ok(());
} else {
"reset after first clone is not supported"
}
}
Actual { read_any, .. } if !*read_any => return Ok(()),
Actual { .. } => "reset after reading is not supported",
Cloned { .. } => "reset after second clone is not supported",
};
Err(azure_core::error::Error::new(
azure_core::error::ErrorKind::Io,
std::io::Error::new(std::io::ErrorKind::Other, msg),
))
}
// Note: it is not documented if this should be the total or remaining length, total passes the
// tests.
fn len(&self) -> usize {
use NonSeekableStream::*;
match self {
Initial { len, .. } => *len,
Actual { len, .. } => *len,
Cloned { len_was, .. } => *len_was,
}
}
}

View File

@@ -6,25 +6,23 @@
//! * [`s3_bucket`] uses AWS S3 bucket as an external storage
//! * [`azure_blob`] allows to use Azure Blob storage as an external storage
//!
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
mod azure_blob;
mod local_fs;
mod s3_bucket;
mod simulate_failures;
use std::{
collections::HashMap,
fmt::Debug,
num::{NonZeroU32, NonZeroUsize},
pin::Pin,
sync::Arc,
};
use std::{collections::HashMap, fmt::Debug, num::NonZeroUsize, pin::Pin, sync::Arc};
use anyhow::{bail, Context};
use camino::{Utf8Path, Utf8PathBuf};
use bytes::Bytes;
use futures::stream::Stream;
use serde::{Deserialize, Serialize};
use tokio::{io, sync::Semaphore};
use tokio::sync::Semaphore;
use toml_edit::Item;
use tracing::info;
@@ -34,12 +32,6 @@ pub use self::{
};
use s3_bucket::RequestKind;
/// How many different timelines can be processed simultaneously when synchronizing layers with the remote storage.
/// During regular work, pageserver produces one layer file per timeline checkpoint, with bursts of concurrency
/// during start (where local and remote timelines are compared and initial sync tasks are scheduled) and timeline attach.
/// Both cases may trigger timeline download, that might download a lot of layers. This concurrency is limited by the clients internally, if needed.
pub const DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNCS: usize = 50;
pub const DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS: u32 = 10;
/// Currently, sync happens with AWS S3, that has two limits on requests per second:
/// ~200 RPS for IAM services
/// <https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/UsingWithRDS.IAMDBAuth.html>
@@ -189,7 +181,7 @@ pub trait RemoteStorage: Send + Sync + 'static {
/// Streams the local file contents into remote into the remote storage entry.
async fn upload(
&self,
from: impl io::AsyncRead + Unpin + Send + Sync + 'static,
from: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
// S3 PUT request requires the content length to be specified,
// otherwise it starts to fail with the concurrent connection count increasing.
data_size_bytes: usize,
@@ -216,7 +208,7 @@ pub trait RemoteStorage: Send + Sync + 'static {
}
pub struct Download {
pub download_stream: Pin<Box<dyn io::AsyncRead + Unpin + Send + Sync>>,
pub download_stream: Pin<Box<dyn Stream<Item = std::io::Result<Bytes>> + Unpin + Send + Sync>>,
/// Extra key-value data, associated with the current remote file.
pub metadata: Option<StorageMetadata>,
}
@@ -310,7 +302,7 @@ impl GenericRemoteStorage {
pub async fn upload(
&self,
from: impl io::AsyncRead + Unpin + Send + Sync + 'static,
from: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
data_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
@@ -408,7 +400,7 @@ impl GenericRemoteStorage {
/// this path is used for the remote object id conversion only.
pub async fn upload_storage_object(
&self,
from: impl tokio::io::AsyncRead + Unpin + Send + Sync + 'static,
from: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
from_size_bytes: usize,
to: &RemotePath,
) -> anyhow::Result<()> {
@@ -441,10 +433,6 @@ pub struct StorageMetadata(HashMap<String, String>);
/// External backup storage configuration, enough for creating a client for that storage.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct RemoteStorageConfig {
/// Max allowed number of concurrent sync operations between the API user and the remote storage.
pub max_concurrent_syncs: NonZeroUsize,
/// Max allowed errors before the sync task is considered failed and evicted.
pub max_sync_errors: NonZeroU32,
/// The storage connection configuration.
pub storage: RemoteStorageKind,
}
@@ -540,18 +528,6 @@ impl RemoteStorageConfig {
let use_azure = container_name.is_some() && container_region.is_some();
let max_concurrent_syncs = NonZeroUsize::new(
parse_optional_integer("max_concurrent_syncs", toml)?
.unwrap_or(DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNCS),
)
.context("Failed to parse 'max_concurrent_syncs' as a positive integer")?;
let max_sync_errors = NonZeroU32::new(
parse_optional_integer("max_sync_errors", toml)?
.unwrap_or(DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS),
)
.context("Failed to parse 'max_sync_errors' as a positive integer")?;
let default_concurrency_limit = if use_azure {
DEFAULT_REMOTE_STORAGE_AZURE_CONCURRENCY_LIMIT
} else {
@@ -633,11 +609,7 @@ impl RemoteStorageConfig {
}
};
Ok(Some(RemoteStorageConfig {
max_concurrent_syncs,
max_sync_errors,
storage,
}))
Ok(Some(RemoteStorageConfig { storage }))
}
}

View File

@@ -7,11 +7,14 @@
use std::{borrow::Cow, future::Future, io::ErrorKind, pin::Pin};
use anyhow::{bail, ensure, Context};
use bytes::Bytes;
use camino::{Utf8Path, Utf8PathBuf};
use futures::stream::Stream;
use tokio::{
fs,
io::{self, AsyncReadExt, AsyncSeekExt, AsyncWriteExt},
};
use tokio_util::io::ReaderStream;
use tracing::*;
use utils::{crashsafe::path_with_suffix_extension, fs_ext::is_directory_empty};
@@ -99,27 +102,35 @@ impl LocalFs {
};
// If we were given a directory, we may use it as our starting point.
// Otherwise, we must go up to the parent directory. This is because
// Otherwise, we must go up to the first ancestor dir that exists. This is because
// S3 object list prefixes can be arbitrary strings, but when reading
// the local filesystem we need a directory to start calling read_dir on.
let mut initial_dir = full_path.clone();
match fs::metadata(full_path.clone()).await {
Ok(meta) => {
if !meta.is_dir() {
loop {
// Did we make it to the root?
if initial_dir.parent().is_none() {
anyhow::bail!("list_files: failed to find valid ancestor dir for {full_path}");
}
match fs::metadata(initial_dir.clone()).await {
Ok(meta) if meta.is_dir() => {
// We found a directory, break
break;
}
Ok(_meta) => {
// It's not a directory: strip back to the parent
initial_dir.pop();
}
}
Err(e) if e.kind() == ErrorKind::NotFound => {
// It's not a file that exists: strip the prefix back to the parent directory
initial_dir.pop();
}
Err(e) => {
// Unexpected I/O error
anyhow::bail!(e)
Err(e) if e.kind() == ErrorKind::NotFound => {
// It's not a file that exists: strip the prefix back to the parent directory
initial_dir.pop();
}
Err(e) => {
// Unexpected I/O error
anyhow::bail!(e)
}
}
}
// Note that Utf8PathBuf starts_with only considers full path segments, but
// object prefixes are arbitrary strings, so we need the strings for doing
// starts_with later.
@@ -211,7 +222,7 @@ impl RemoteStorage for LocalFs {
async fn upload(
&self,
data: impl io::AsyncRead + Unpin + Send + Sync + 'static,
data: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync,
data_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
@@ -244,9 +255,12 @@ impl RemoteStorage for LocalFs {
);
let from_size_bytes = data_size_bytes as u64;
let data = tokio_util::io::StreamReader::new(data);
let data = std::pin::pin!(data);
let mut buffer_to_read = data.take(from_size_bytes);
let bytes_read = io::copy(&mut buffer_to_read, &mut destination)
// alternatively we could just write the bytes to a file, but local_fs is a testing utility
let bytes_read = io::copy_buf(&mut buffer_to_read, &mut destination)
.await
.with_context(|| {
format!(
@@ -300,7 +314,7 @@ impl RemoteStorage for LocalFs {
async fn download(&self, from: &RemotePath) -> Result<Download, DownloadError> {
let target_path = from.with_base(&self.storage_root);
if file_exists(&target_path).map_err(DownloadError::BadInput)? {
let source = io::BufReader::new(
let source = ReaderStream::new(
fs::OpenOptions::new()
.read(true)
.open(&target_path)
@@ -340,16 +354,14 @@ impl RemoteStorage for LocalFs {
}
let target_path = from.with_base(&self.storage_root);
if file_exists(&target_path).map_err(DownloadError::BadInput)? {
let mut source = io::BufReader::new(
fs::OpenOptions::new()
.read(true)
.open(&target_path)
.await
.with_context(|| {
format!("Failed to open source file {target_path:?} to use in the download")
})
.map_err(DownloadError::Other)?,
);
let mut source = tokio::fs::OpenOptions::new()
.read(true)
.open(&target_path)
.await
.with_context(|| {
format!("Failed to open source file {target_path:?} to use in the download")
})
.map_err(DownloadError::Other)?;
source
.seek(io::SeekFrom::Start(start_inclusive))
.await
@@ -363,11 +375,13 @@ impl RemoteStorage for LocalFs {
Ok(match end_exclusive {
Some(end_exclusive) => Download {
metadata,
download_stream: Box::pin(source.take(end_exclusive - start_inclusive)),
download_stream: Box::pin(ReaderStream::new(
source.take(end_exclusive - start_inclusive),
)),
},
None => Download {
metadata,
download_stream: Box::pin(source),
download_stream: Box::pin(ReaderStream::new(source)),
},
})
} else {
@@ -467,7 +481,9 @@ fn file_exists(file_path: &Utf8Path) -> anyhow::Result<bool> {
mod fs_tests {
use super::*;
use bytes::Bytes;
use camino_tempfile::tempdir;
use futures_util::Stream;
use std::{collections::HashMap, io::Write};
async fn read_and_assert_remote_file_contents(
@@ -477,7 +493,7 @@ mod fs_tests {
remote_storage_path: &RemotePath,
expected_metadata: Option<&StorageMetadata>,
) -> anyhow::Result<String> {
let mut download = storage
let download = storage
.download(remote_storage_path)
.await
.map_err(|e| anyhow::anyhow!("Download failed: {e}"))?;
@@ -486,13 +502,9 @@ mod fs_tests {
"Unexpected metadata returned for the downloaded file"
);
let mut contents = String::new();
download
.download_stream
.read_to_string(&mut contents)
.await
.context("Failed to read remote file contents into string")?;
Ok(contents)
let contents = aggregate(download.download_stream).await?;
String::from_utf8(contents).map_err(anyhow::Error::new)
}
#[tokio::test]
@@ -521,25 +533,26 @@ mod fs_tests {
let storage = create_storage()?;
let id = RemotePath::new(Utf8Path::new("dummy"))?;
let content = std::io::Cursor::new(b"12345");
let content = Bytes::from_static(b"12345");
let content = move || futures::stream::once(futures::future::ready(Ok(content.clone())));
// Check that you get an error if the size parameter doesn't match the actual
// size of the stream.
storage
.upload(Box::new(content.clone()), 0, &id, None)
.upload(content(), 0, &id, None)
.await
.expect_err("upload with zero size succeeded");
storage
.upload(Box::new(content.clone()), 4, &id, None)
.upload(content(), 4, &id, None)
.await
.expect_err("upload with too short size succeeded");
storage
.upload(Box::new(content.clone()), 6, &id, None)
.upload(content(), 6, &id, None)
.await
.expect_err("upload with too large size succeeded");
// Correct size is 5, this should succeed.
storage.upload(Box::new(content), 5, &id, None).await?;
storage.upload(content(), 5, &id, None).await?;
Ok(())
}
@@ -587,7 +600,7 @@ mod fs_tests {
let uploaded_bytes = dummy_contents(upload_name).into_bytes();
let (first_part_local, second_part_local) = uploaded_bytes.split_at(3);
let mut first_part_download = storage
let first_part_download = storage
.download_byte_range(&upload_target, 0, Some(first_part_local.len() as u64))
.await?;
assert!(
@@ -595,21 +608,13 @@ mod fs_tests {
"No metadata should be returned for no metadata upload"
);
let mut first_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
io::copy(
&mut first_part_download.download_stream,
&mut first_part_remote,
)
.await?;
first_part_remote.flush().await?;
let first_part_remote = first_part_remote.into_inner().into_inner();
let first_part_remote = aggregate(first_part_download.download_stream).await?;
assert_eq!(
first_part_local,
first_part_remote.as_slice(),
first_part_local, first_part_remote,
"First part bytes should be returned when requested"
);
let mut second_part_download = storage
let second_part_download = storage
.download_byte_range(
&upload_target,
first_part_local.len() as u64,
@@ -621,17 +626,9 @@ mod fs_tests {
"No metadata should be returned for no metadata upload"
);
let mut second_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
io::copy(
&mut second_part_download.download_stream,
&mut second_part_remote,
)
.await?;
second_part_remote.flush().await?;
let second_part_remote = second_part_remote.into_inner().into_inner();
let second_part_remote = aggregate(second_part_download.download_stream).await?;
assert_eq!(
second_part_local,
second_part_remote.as_slice(),
second_part_local, second_part_remote,
"Second part bytes should be returned when requested"
);
@@ -721,17 +718,10 @@ mod fs_tests {
let uploaded_bytes = dummy_contents(upload_name).into_bytes();
let (first_part_local, _) = uploaded_bytes.split_at(3);
let mut partial_download_with_metadata = storage
let partial_download_with_metadata = storage
.download_byte_range(&upload_target, 0, Some(first_part_local.len() as u64))
.await?;
let mut first_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new()));
io::copy(
&mut partial_download_with_metadata.download_stream,
&mut first_part_remote,
)
.await?;
first_part_remote.flush().await?;
let first_part_remote = first_part_remote.into_inner().into_inner();
let first_part_remote = aggregate(partial_download_with_metadata.download_stream).await?;
assert_eq!(
first_part_local,
first_part_remote.as_slice(),
@@ -807,16 +797,16 @@ mod fs_tests {
)
})?;
storage
.upload(Box::new(file), size, &relative_path, metadata)
.await?;
let file = tokio_util::io::ReaderStream::new(file);
storage.upload(file, size, &relative_path, metadata).await?;
Ok(relative_path)
}
async fn create_file_for_upload(
path: &Utf8Path,
contents: &str,
) -> anyhow::Result<(io::BufReader<fs::File>, usize)> {
) -> anyhow::Result<(fs::File, usize)> {
std::fs::create_dir_all(path.parent().unwrap())?;
let mut file_for_writing = std::fs::OpenOptions::new()
.write(true)
@@ -826,7 +816,7 @@ mod fs_tests {
drop(file_for_writing);
let file_size = path.metadata()?.len() as usize;
Ok((
io::BufReader::new(fs::OpenOptions::new().read(true).open(&path).await?),
fs::OpenOptions::new().read(true).open(&path).await?,
file_size,
))
}
@@ -840,4 +830,16 @@ mod fs_tests {
files.sort_by(|a, b| a.0.cmp(&b.0));
Ok(files)
}
async fn aggregate(
stream: impl Stream<Item = std::io::Result<Bytes>>,
) -> anyhow::Result<Vec<u8>> {
use futures::stream::StreamExt;
let mut out = Vec::new();
let mut stream = std::pin::pin!(stream);
while let Some(res) = stream.next().await {
out.extend_from_slice(&res?[..]);
}
Ok(out)
}
}

View File

@@ -4,29 +4,39 @@
//! allowing multiple api users to independently work with the same S3 bucket, if
//! their bucket prefixes are both specified and different.
use std::borrow::Cow;
use std::{
borrow::Cow,
pin::Pin,
sync::Arc,
task::{Context, Poll},
};
use anyhow::Context;
use anyhow::Context as _;
use aws_config::{
environment::credentials::EnvironmentVariableCredentialsProvider,
imds::credentials::ImdsCredentialsProvider, meta::credentials::CredentialsProviderChain,
provider_config::ProviderConfig, web_identity_token::WebIdentityTokenCredentialsProvider,
imds::credentials::ImdsCredentialsProvider,
meta::credentials::CredentialsProviderChain,
provider_config::ProviderConfig,
retry::{RetryConfigBuilder, RetryMode},
web_identity_token::WebIdentityTokenCredentialsProvider,
BehaviorVersion,
};
use aws_credential_types::cache::CredentialsCache;
use aws_credential_types::provider::SharedCredentialsProvider;
use aws_sdk_s3::{
config::{Config, Region},
config::{AsyncSleep, Builder, IdentityCache, Region, SharedAsyncSleep},
error::SdkError,
operation::get_object::GetObjectError,
primitives::ByteStream,
types::{Delete, ObjectIdentifier},
Client,
};
use aws_smithy_http::body::SdkBody;
use aws_smithy_async::rt::sleep::TokioSleep;
use aws_smithy_types::body::SdkBody;
use aws_smithy_types::byte_stream::ByteStream;
use bytes::Bytes;
use futures::stream::Stream;
use hyper::Body;
use scopeguard::ScopeGuard;
use tokio::io::{self, AsyncRead};
use tokio_util::io::ReaderStream;
use tracing::debug;
use super::StorageMetadata;
use crate::{
@@ -57,7 +67,7 @@ struct GetObjectRequest {
impl S3Bucket {
/// Creates the S3 storage, errors if incorrect AWS S3 configuration provided.
pub fn new(aws_config: &S3Config) -> anyhow::Result<Self> {
debug!(
tracing::debug!(
"Creating s3 remote storage for S3 bucket {}",
aws_config.bucket_name
);
@@ -74,7 +84,6 @@ impl S3Bucket {
// needed to access remote extensions bucket
.or_else("token", {
let provider_conf = ProviderConfig::without_region().with_region(region.clone());
WebIdentityTokenCredentialsProvider::builder()
.configure(&provider_conf)
.build()
@@ -83,16 +92,31 @@ impl S3Bucket {
.or_else("imds", ImdsCredentialsProvider::builder().build())
};
let mut config_builder = Config::builder()
// AWS SDK requires us to specify how the RetryConfig should sleep when it wants to back off
let sleep_impl: Arc<dyn AsyncSleep> = Arc::new(TokioSleep::new());
// We do our own retries (see [`backoff::retry`]). However, for the AWS SDK to enable rate limiting in response to throttling
// responses (e.g. 429 on too many ListObjectsv2 requests), we must provide a retry config. We set it to use at most one
// attempt, and enable 'Adaptive' mode, which causes rate limiting to be enabled.
let mut retry_config = RetryConfigBuilder::new();
retry_config
.set_max_attempts(Some(1))
.set_mode(Some(RetryMode::Adaptive));
let mut config_builder = Builder::default()
.behavior_version(BehaviorVersion::v2023_11_09())
.region(region)
.credentials_cache(CredentialsCache::lazy())
.credentials_provider(credentials_provider);
.identity_cache(IdentityCache::lazy().build())
.credentials_provider(SharedCredentialsProvider::new(credentials_provider))
.retry_config(retry_config.build())
.sleep_impl(SharedAsyncSleep::from(sleep_impl));
if let Some(custom_endpoint) = aws_config.endpoint.clone() {
config_builder = config_builder
.endpoint_url(custom_endpoint)
.force_path_style(true);
}
let client = Client::from_conf(config_builder.build());
let prefix_in_bucket = aws_config.prefix_in_bucket.as_deref().map(|prefix| {
@@ -205,12 +229,15 @@ impl S3Bucket {
match get_object {
Ok(object_output) => {
let metadata = object_output.metadata().cloned().map(StorageMetadata);
let body = object_output.body;
let body = ByteStreamAsStream::from(body);
let body = PermitCarrying::new(permit, body);
let body = TimedDownload::new(started_at, body);
Ok(Download {
metadata,
download_stream: Box::pin(io::BufReader::new(TimedDownload::new(
started_at,
RatelimitedAsyncRead::new(permit, object_output.body.into_async_read()),
))),
download_stream: Box::pin(body),
})
}
Err(SdkError::ServiceError(e)) if matches!(e.err(), GetObjectError::NoSuchKey(_)) => {
@@ -223,29 +250,55 @@ impl S3Bucket {
}
}
pin_project_lite::pin_project! {
struct ByteStreamAsStream {
#[pin]
inner: aws_smithy_types::byte_stream::ByteStream
}
}
impl From<aws_smithy_types::byte_stream::ByteStream> for ByteStreamAsStream {
fn from(inner: aws_smithy_types::byte_stream::ByteStream) -> Self {
ByteStreamAsStream { inner }
}
}
impl Stream for ByteStreamAsStream {
type Item = std::io::Result<Bytes>;
fn poll_next(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
// this does the std::io::ErrorKind::Other conversion
self.project().inner.poll_next(cx).map_err(|x| x.into())
}
// cannot implement size_hint because inner.size_hint is remaining size in bytes, which makes
// sense and Stream::size_hint does not really
}
pin_project_lite::pin_project! {
/// An `AsyncRead` adapter which carries a permit for the lifetime of the value.
struct RatelimitedAsyncRead<S> {
struct PermitCarrying<S> {
permit: tokio::sync::OwnedSemaphorePermit,
#[pin]
inner: S,
}
}
impl<S: AsyncRead> RatelimitedAsyncRead<S> {
impl<S> PermitCarrying<S> {
fn new(permit: tokio::sync::OwnedSemaphorePermit, inner: S) -> Self {
RatelimitedAsyncRead { permit, inner }
Self { permit, inner }
}
}
impl<S: AsyncRead> AsyncRead for RatelimitedAsyncRead<S> {
fn poll_read(
self: std::pin::Pin<&mut Self>,
cx: &mut std::task::Context<'_>,
buf: &mut io::ReadBuf<'_>,
) -> std::task::Poll<std::io::Result<()>> {
let this = self.project();
this.inner.poll_read(cx, buf)
impl<S: Stream<Item = std::io::Result<Bytes>>> Stream for PermitCarrying<S> {
type Item = <S as Stream>::Item;
fn poll_next(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
self.project().inner.poll_next(cx)
}
fn size_hint(&self) -> (usize, Option<usize>) {
self.inner.size_hint()
}
}
@@ -265,7 +318,7 @@ pin_project_lite::pin_project! {
}
}
impl<S: AsyncRead> TimedDownload<S> {
impl<S> TimedDownload<S> {
fn new(started_at: std::time::Instant, inner: S) -> Self {
TimedDownload {
started_at,
@@ -275,25 +328,26 @@ impl<S: AsyncRead> TimedDownload<S> {
}
}
impl<S: AsyncRead> AsyncRead for TimedDownload<S> {
fn poll_read(
self: std::pin::Pin<&mut Self>,
cx: &mut std::task::Context<'_>,
buf: &mut io::ReadBuf<'_>,
) -> std::task::Poll<std::io::Result<()>> {
impl<S: Stream<Item = std::io::Result<Bytes>>> Stream for TimedDownload<S> {
type Item = <S as Stream>::Item;
fn poll_next(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
use std::task::ready;
let this = self.project();
let before = buf.filled().len();
let read = std::task::ready!(this.inner.poll_read(cx, buf));
let read_eof = buf.filled().len() == before;
match read {
Ok(()) if read_eof => *this.outcome = AttemptOutcome::Ok,
Ok(()) => { /* still in progress */ }
Err(_) => *this.outcome = AttemptOutcome::Err,
let res = ready!(this.inner.poll_next(cx));
match &res {
Some(Ok(_)) => {}
Some(Err(_)) => *this.outcome = metrics::AttemptOutcome::Err,
None => *this.outcome = metrics::AttemptOutcome::Ok,
}
std::task::Poll::Ready(read)
Poll::Ready(res)
}
fn size_hint(&self) -> (usize, Option<usize>) {
self.inner.size_hint()
}
}
@@ -354,11 +408,11 @@ impl RemoteStorage for S3Bucket {
let response = response?;
let keys = response.contents().unwrap_or_default();
let keys = response.contents();
let empty = Vec::new();
let prefixes = response.common_prefixes.as_ref().unwrap_or(&empty);
tracing::info!("list: {} prefixes, {} keys", prefixes.len(), keys.len());
tracing::debug!("list: {} prefixes, {} keys", prefixes.len(), keys.len());
for object in keys {
let object_path = object.key().expect("response does not contain a key");
@@ -383,7 +437,7 @@ impl RemoteStorage for S3Bucket {
async fn upload(
&self,
from: impl io::AsyncRead + Unpin + Send + Sync + 'static,
from: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
from_size_bytes: usize,
to: &RemotePath,
metadata: Option<StorageMetadata>,
@@ -393,8 +447,8 @@ impl RemoteStorage for S3Bucket {
let started_at = start_measuring_requests(kind);
let body = Body::wrap_stream(ReaderStream::new(from));
let bytes_stream = ByteStream::new(SdkBody::from(body));
let body = Body::wrap_stream(from);
let bytes_stream = ByteStream::new(SdkBody::from_body_0_4(body));
let res = self
.client
@@ -457,7 +511,7 @@ impl RemoteStorage for S3Bucket {
for path in paths {
let obj_id = ObjectIdentifier::builder()
.set_key(Some(self.relative_path_to_s3_object(path)))
.build();
.build()?;
delete_objects.push(obj_id);
}
@@ -468,7 +522,11 @@ impl RemoteStorage for S3Bucket {
.client
.delete_objects()
.bucket(self.bucket_name.clone())
.delete(Delete::builder().set_objects(Some(chunk.to_vec())).build())
.delete(
Delete::builder()
.set_objects(Some(chunk.to_vec()))
.build()?,
)
.send()
.await;

View File

@@ -1,6 +1,8 @@
//! This module provides a wrapper around a real RemoteStorage implementation that
//! causes the first N attempts at each upload or download operatio to fail. For
//! testing purposes.
use bytes::Bytes;
use futures::stream::Stream;
use std::collections::hash_map::Entry;
use std::collections::HashMap;
use std::sync::Mutex;
@@ -108,7 +110,7 @@ impl RemoteStorage for UnreliableWrapper {
async fn upload(
&self,
data: impl tokio::io::AsyncRead + Unpin + Send + Sync + 'static,
data: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
// S3 PUT request requires the content length to be specified,
// otherwise it starts to fail with the concurrent connection count increasing.
data_size_bytes: usize,

View File

@@ -0,0 +1,200 @@
use std::collections::HashSet;
use std::ops::ControlFlow;
use std::path::PathBuf;
use std::sync::Arc;
use anyhow::Context;
use bytes::Bytes;
use camino::Utf8Path;
use futures::stream::Stream;
use once_cell::sync::OnceCell;
use remote_storage::{Download, GenericRemoteStorage, RemotePath};
use tokio::task::JoinSet;
use tracing::{debug, error, info};
static LOGGING_DONE: OnceCell<()> = OnceCell::new();
pub(crate) fn upload_stream(
content: std::borrow::Cow<'static, [u8]>,
) -> (
impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
usize,
) {
use std::borrow::Cow;
let content = match content {
Cow::Borrowed(x) => Bytes::from_static(x),
Cow::Owned(vec) => Bytes::from(vec),
};
wrap_stream(content)
}
pub(crate) fn wrap_stream(
content: bytes::Bytes,
) -> (
impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
usize,
) {
let len = content.len();
let content = futures::future::ready(Ok(content));
(futures::stream::once(content), len)
}
pub(crate) async fn download_to_vec(dl: Download) -> anyhow::Result<Vec<u8>> {
let mut buf = Vec::new();
tokio::io::copy_buf(
&mut tokio_util::io::StreamReader::new(dl.download_stream),
&mut buf,
)
.await?;
Ok(buf)
}
// Uploads files `folder{j}/blob{i}.txt`. See test description for more details.
pub(crate) async fn upload_simple_remote_data(
client: &Arc<GenericRemoteStorage>,
upload_tasks_count: usize,
) -> ControlFlow<HashSet<RemotePath>, HashSet<RemotePath>> {
info!("Creating {upload_tasks_count} remote files");
let mut upload_tasks = JoinSet::new();
for i in 1..upload_tasks_count + 1 {
let task_client = Arc::clone(client);
upload_tasks.spawn(async move {
let blob_path = PathBuf::from(format!("folder{}/blob_{}.txt", i / 7, i));
let blob_path = RemotePath::new(
Utf8Path::from_path(blob_path.as_path()).expect("must be valid blob path"),
)
.with_context(|| format!("{blob_path:?} to RemotePath conversion"))?;
debug!("Creating remote item {i} at path {blob_path:?}");
let (data, len) = upload_stream(format!("remote blob data {i}").into_bytes().into());
task_client.upload(data, len, &blob_path, None).await?;
Ok::<_, anyhow::Error>(blob_path)
});
}
let mut upload_tasks_failed = false;
let mut uploaded_blobs = HashSet::with_capacity(upload_tasks_count);
while let Some(task_run_result) = upload_tasks.join_next().await {
match task_run_result
.context("task join failed")
.and_then(|task_result| task_result.context("upload task failed"))
{
Ok(upload_path) => {
uploaded_blobs.insert(upload_path);
}
Err(e) => {
error!("Upload task failed: {e:?}");
upload_tasks_failed = true;
}
}
}
if upload_tasks_failed {
ControlFlow::Break(uploaded_blobs)
} else {
ControlFlow::Continue(uploaded_blobs)
}
}
pub(crate) async fn cleanup(
client: &Arc<GenericRemoteStorage>,
objects_to_delete: HashSet<RemotePath>,
) {
info!(
"Removing {} objects from the remote storage during cleanup",
objects_to_delete.len()
);
let mut delete_tasks = JoinSet::new();
for object_to_delete in objects_to_delete {
let task_client = Arc::clone(client);
delete_tasks.spawn(async move {
debug!("Deleting remote item at path {object_to_delete:?}");
task_client
.delete(&object_to_delete)
.await
.with_context(|| format!("{object_to_delete:?} removal"))
});
}
while let Some(task_run_result) = delete_tasks.join_next().await {
match task_run_result {
Ok(task_result) => match task_result {
Ok(()) => {}
Err(e) => error!("Delete task failed: {e:?}"),
},
Err(join_err) => error!("Delete task did not finish correctly: {join_err}"),
}
}
}
pub(crate) struct Uploads {
pub(crate) prefixes: HashSet<RemotePath>,
pub(crate) blobs: HashSet<RemotePath>,
}
pub(crate) async fn upload_remote_data(
client: &Arc<GenericRemoteStorage>,
base_prefix_str: &'static str,
upload_tasks_count: usize,
) -> ControlFlow<Uploads, Uploads> {
info!("Creating {upload_tasks_count} remote files");
let mut upload_tasks = JoinSet::new();
for i in 1..upload_tasks_count + 1 {
let task_client = Arc::clone(client);
upload_tasks.spawn(async move {
let prefix = format!("{base_prefix_str}/sub_prefix_{i}/");
let blob_prefix = RemotePath::new(Utf8Path::new(&prefix))
.with_context(|| format!("{prefix:?} to RemotePath conversion"))?;
let blob_path = blob_prefix.join(Utf8Path::new(&format!("blob_{i}")));
debug!("Creating remote item {i} at path {blob_path:?}");
let (data, data_len) =
upload_stream(format!("remote blob data {i}").into_bytes().into());
task_client.upload(data, data_len, &blob_path, None).await?;
Ok::<_, anyhow::Error>((blob_prefix, blob_path))
});
}
let mut upload_tasks_failed = false;
let mut uploaded_prefixes = HashSet::with_capacity(upload_tasks_count);
let mut uploaded_blobs = HashSet::with_capacity(upload_tasks_count);
while let Some(task_run_result) = upload_tasks.join_next().await {
match task_run_result
.context("task join failed")
.and_then(|task_result| task_result.context("upload task failed"))
{
Ok((upload_prefix, upload_path)) => {
uploaded_prefixes.insert(upload_prefix);
uploaded_blobs.insert(upload_path);
}
Err(e) => {
error!("Upload task failed: {e:?}");
upload_tasks_failed = true;
}
}
}
let uploads = Uploads {
prefixes: uploaded_prefixes,
blobs: uploaded_blobs,
};
if upload_tasks_failed {
ControlFlow::Break(uploads)
} else {
ControlFlow::Continue(uploads)
}
}
pub(crate) fn ensure_logging_ready() {
LOGGING_DONE.get_or_init(|| {
utils::logging::init(
utils::logging::LogFormat::Test,
utils::logging::TracingErrorLayerEnablement::Disabled,
utils::logging::Output::Stdout,
)
.expect("logging init failed");
});
}

View File

@@ -1,22 +1,24 @@
use std::collections::HashSet;
use std::env;
use std::num::{NonZeroU32, NonZeroUsize};
use std::num::NonZeroUsize;
use std::ops::ControlFlow;
use std::path::PathBuf;
use std::sync::Arc;
use std::time::UNIX_EPOCH;
use anyhow::Context;
use camino::Utf8Path;
use once_cell::sync::OnceCell;
use remote_storage::{
AzureConfig, Download, GenericRemoteStorage, RemotePath, RemoteStorageConfig, RemoteStorageKind,
AzureConfig, GenericRemoteStorage, RemotePath, RemoteStorageConfig, RemoteStorageKind,
};
use test_context::{test_context, AsyncTestContext};
use tokio::task::JoinSet;
use tracing::{debug, error, info};
use tracing::{debug, info};
static LOGGING_DONE: OnceCell<()> = OnceCell::new();
mod common;
use common::{
cleanup, download_to_vec, ensure_logging_ready, upload_remote_data, upload_simple_remote_data,
upload_stream, wrap_stream,
};
const ENABLE_REAL_AZURE_REMOTE_STORAGE_ENV_VAR_NAME: &str = "ENABLE_REAL_AZURE_REMOTE_STORAGE";
@@ -28,7 +30,7 @@ const BASE_PREFIX: &str = "test";
/// If real Azure tests are disabled, the test passes, skipping any real test run: currently, there's no way to mark the test ignored in runtime with the
/// deafult test framework, see https://github.com/rust-lang/rust/issues/68007 for details.
///
/// First, the test creates a set of Azure blobs with keys `/${random_prefix_part}/${base_prefix_str}/sub_prefix_${i}/blob_${i}` in [`upload_azure_data`]
/// First, the test creates a set of Azure blobs with keys `/${random_prefix_part}/${base_prefix_str}/sub_prefix_${i}/blob_${i}` in [`upload_remote_data`]
/// where
/// * `random_prefix_part` is set for the entire Azure client during the Azure client creation in [`create_azure_client`], to avoid multiple test runs interference
/// * `base_prefix_str` is a common prefix to use in the client requests: we would want to ensure that the client is able to list nested prefixes inside the bucket
@@ -95,7 +97,7 @@ async fn azure_pagination_should_work(
/// Uses real Azure and requires [`ENABLE_REAL_AZURE_REMOTE_STORAGE_ENV_VAR_NAME`] and related Azure cred env vars specified. Test will skip real code and pass if env vars not set.
/// See `Azure_pagination_should_work` for more information.
///
/// First, create a set of Azure objects with keys `random_prefix/folder{j}/blob_{i}.txt` in [`upload_azure_data`]
/// First, create a set of Azure objects with keys `random_prefix/folder{j}/blob_{i}.txt` in [`upload_remote_data`]
/// Then performs the following queries:
/// 1. `list_files(None)`. This should return all files `random_prefix/folder{j}/blob_{i}.txt`
/// 2. `list_files("folder1")`. This should return all files `random_prefix/folder1/blob_{i}.txt`
@@ -180,23 +182,14 @@ async fn azure_delete_objects_works(ctx: &mut MaybeEnabledAzure) -> anyhow::Resu
let path3 = RemotePath::new(Utf8Path::new(format!("{}/path3", ctx.base_prefix).as_str()))
.with_context(|| "RemotePath conversion")?;
let data1 = "remote blob data1".as_bytes();
let data1_len = data1.len();
let data2 = "remote blob data2".as_bytes();
let data2_len = data2.len();
let data3 = "remote blob data3".as_bytes();
let data3_len = data3.len();
ctx.client
.upload(std::io::Cursor::new(data1), data1_len, &path1, None)
.await?;
let (data, len) = upload_stream("remote blob data1".as_bytes().into());
ctx.client.upload(data, len, &path1, None).await?;
ctx.client
.upload(std::io::Cursor::new(data2), data2_len, &path2, None)
.await?;
let (data, len) = upload_stream("remote blob data2".as_bytes().into());
ctx.client.upload(data, len, &path2, None).await?;
ctx.client
.upload(std::io::Cursor::new(data3), data3_len, &path3, None)
.await?;
let (data, len) = upload_stream("remote blob data3".as_bytes().into());
ctx.client.upload(data, len, &path3, None).await?;
ctx.client.delete_objects(&[path1, path2]).await?;
@@ -219,53 +212,47 @@ async fn azure_upload_download_works(ctx: &mut MaybeEnabledAzure) -> anyhow::Res
let path = RemotePath::new(Utf8Path::new(format!("{}/file", ctx.base_prefix).as_str()))
.with_context(|| "RemotePath conversion")?;
let data = "remote blob data here".as_bytes();
let data_len = data.len() as u64;
let orig = bytes::Bytes::from_static("remote blob data here".as_bytes());
ctx.client
.upload(std::io::Cursor::new(data), data.len(), &path, None)
.await?;
let (data, len) = wrap_stream(orig.clone());
ctx.client.upload(data, len, &path, None).await?;
async fn download_and_compare(mut dl: Download) -> anyhow::Result<Vec<u8>> {
let mut buf = Vec::new();
tokio::io::copy(&mut dl.download_stream, &mut buf).await?;
Ok(buf)
}
// Normal download request
let dl = ctx.client.download(&path).await?;
let buf = download_and_compare(dl).await?;
assert_eq!(buf, data);
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig);
// Full range (end specified)
let dl = ctx
.client
.download_byte_range(&path, 0, Some(data_len))
.download_byte_range(&path, 0, Some(len as u64))
.await?;
let buf = download_and_compare(dl).await?;
assert_eq!(buf, data);
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig);
// partial range (end specified)
let dl = ctx.client.download_byte_range(&path, 4, Some(10)).await?;
let buf = download_and_compare(dl).await?;
assert_eq!(buf, data[4..10]);
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig[4..10]);
// partial range (end beyond real end)
let dl = ctx
.client
.download_byte_range(&path, 8, Some(data_len * 100))
.download_byte_range(&path, 8, Some(len as u64 * 100))
.await?;
let buf = download_and_compare(dl).await?;
assert_eq!(buf, data[8..]);
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig[8..]);
// Partial range (end unspecified)
let dl = ctx.client.download_byte_range(&path, 4, None).await?;
let buf = download_and_compare(dl).await?;
assert_eq!(buf, data[4..]);
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig[4..]);
// Full range (end unspecified)
let dl = ctx.client.download_byte_range(&path, 0, None).await?;
let buf = download_and_compare(dl).await?;
assert_eq!(buf, data);
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig);
debug!("Cleanup: deleting file at path {path:?}");
ctx.client
@@ -276,16 +263,6 @@ async fn azure_upload_download_works(ctx: &mut MaybeEnabledAzure) -> anyhow::Res
Ok(())
}
fn ensure_logging_ready() {
LOGGING_DONE.get_or_init(|| {
utils::logging::init(
utils::logging::LogFormat::Test,
utils::logging::TracingErrorLayerEnablement::Disabled,
)
.expect("logging init failed");
});
}
struct EnabledAzure {
client: Arc<GenericRemoteStorage>,
base_prefix: &'static str,
@@ -355,7 +332,7 @@ impl AsyncTestContext for MaybeEnabledAzureWithTestBlobs {
let enabled = EnabledAzure::setup(Some(max_keys_in_list_response)).await;
match upload_azure_data(&enabled.client, enabled.base_prefix, upload_tasks_count).await {
match upload_remote_data(&enabled.client, enabled.base_prefix, upload_tasks_count).await {
ControlFlow::Continue(uploads) => {
info!("Remote objects created successfully");
@@ -417,7 +394,7 @@ impl AsyncTestContext for MaybeEnabledAzureWithSimpleTestBlobs {
let enabled = EnabledAzure::setup(Some(max_keys_in_list_response)).await;
match upload_simple_azure_data(&enabled.client, upload_tasks_count).await {
match upload_simple_remote_data(&enabled.client, upload_tasks_count).await {
ControlFlow::Continue(uploads) => {
info!("Remote objects created successfully");
@@ -469,8 +446,6 @@ fn create_azure_client(
let random = rand::thread_rng().gen::<u32>();
let remote_storage_config = RemoteStorageConfig {
max_concurrent_syncs: NonZeroUsize::new(100).unwrap(),
max_sync_errors: NonZeroU32::new(5).unwrap(),
storage: RemoteStorageKind::AzureContainer(AzureConfig {
container_name: remote_storage_azure_container,
container_region: remote_storage_azure_region,
@@ -483,143 +458,3 @@ fn create_azure_client(
GenericRemoteStorage::from_config(&remote_storage_config).context("remote storage init")?,
))
}
struct Uploads {
prefixes: HashSet<RemotePath>,
blobs: HashSet<RemotePath>,
}
async fn upload_azure_data(
client: &Arc<GenericRemoteStorage>,
base_prefix_str: &'static str,
upload_tasks_count: usize,
) -> ControlFlow<Uploads, Uploads> {
info!("Creating {upload_tasks_count} Azure files");
let mut upload_tasks = JoinSet::new();
for i in 1..upload_tasks_count + 1 {
let task_client = Arc::clone(client);
upload_tasks.spawn(async move {
let prefix = format!("{base_prefix_str}/sub_prefix_{i}/");
let blob_prefix = RemotePath::new(Utf8Path::new(&prefix))
.with_context(|| format!("{prefix:?} to RemotePath conversion"))?;
let blob_path = blob_prefix.join(Utf8Path::new(&format!("blob_{i}")));
debug!("Creating remote item {i} at path {blob_path:?}");
let data = format!("remote blob data {i}").into_bytes();
let data_len = data.len();
task_client
.upload(std::io::Cursor::new(data), data_len, &blob_path, None)
.await?;
Ok::<_, anyhow::Error>((blob_prefix, blob_path))
});
}
let mut upload_tasks_failed = false;
let mut uploaded_prefixes = HashSet::with_capacity(upload_tasks_count);
let mut uploaded_blobs = HashSet::with_capacity(upload_tasks_count);
while let Some(task_run_result) = upload_tasks.join_next().await {
match task_run_result
.context("task join failed")
.and_then(|task_result| task_result.context("upload task failed"))
{
Ok((upload_prefix, upload_path)) => {
uploaded_prefixes.insert(upload_prefix);
uploaded_blobs.insert(upload_path);
}
Err(e) => {
error!("Upload task failed: {e:?}");
upload_tasks_failed = true;
}
}
}
let uploads = Uploads {
prefixes: uploaded_prefixes,
blobs: uploaded_blobs,
};
if upload_tasks_failed {
ControlFlow::Break(uploads)
} else {
ControlFlow::Continue(uploads)
}
}
async fn cleanup(client: &Arc<GenericRemoteStorage>, objects_to_delete: HashSet<RemotePath>) {
info!(
"Removing {} objects from the remote storage during cleanup",
objects_to_delete.len()
);
let mut delete_tasks = JoinSet::new();
for object_to_delete in objects_to_delete {
let task_client = Arc::clone(client);
delete_tasks.spawn(async move {
debug!("Deleting remote item at path {object_to_delete:?}");
task_client
.delete(&object_to_delete)
.await
.with_context(|| format!("{object_to_delete:?} removal"))
});
}
while let Some(task_run_result) = delete_tasks.join_next().await {
match task_run_result {
Ok(task_result) => match task_result {
Ok(()) => {}
Err(e) => error!("Delete task failed: {e:?}"),
},
Err(join_err) => error!("Delete task did not finish correctly: {join_err}"),
}
}
}
// Uploads files `folder{j}/blob{i}.txt`. See test description for more details.
async fn upload_simple_azure_data(
client: &Arc<GenericRemoteStorage>,
upload_tasks_count: usize,
) -> ControlFlow<HashSet<RemotePath>, HashSet<RemotePath>> {
info!("Creating {upload_tasks_count} Azure files");
let mut upload_tasks = JoinSet::new();
for i in 1..upload_tasks_count + 1 {
let task_client = Arc::clone(client);
upload_tasks.spawn(async move {
let blob_path = PathBuf::from(format!("folder{}/blob_{}.txt", i / 7, i));
let blob_path = RemotePath::new(
Utf8Path::from_path(blob_path.as_path()).expect("must be valid blob path"),
)
.with_context(|| format!("{blob_path:?} to RemotePath conversion"))?;
debug!("Creating remote item {i} at path {blob_path:?}");
let data = format!("remote blob data {i}").into_bytes();
let data_len = data.len();
task_client
.upload(std::io::Cursor::new(data), data_len, &blob_path, None)
.await?;
Ok::<_, anyhow::Error>(blob_path)
});
}
let mut upload_tasks_failed = false;
let mut uploaded_blobs = HashSet::with_capacity(upload_tasks_count);
while let Some(task_run_result) = upload_tasks.join_next().await {
match task_run_result
.context("task join failed")
.and_then(|task_result| task_result.context("upload task failed"))
{
Ok(upload_path) => {
uploaded_blobs.insert(upload_path);
}
Err(e) => {
error!("Upload task failed: {e:?}");
upload_tasks_failed = true;
}
}
}
if upload_tasks_failed {
ControlFlow::Break(uploaded_blobs)
} else {
ControlFlow::Continue(uploaded_blobs)
}
}

View File

@@ -1,22 +1,24 @@
use std::collections::HashSet;
use std::env;
use std::num::{NonZeroU32, NonZeroUsize};
use std::num::NonZeroUsize;
use std::ops::ControlFlow;
use std::path::PathBuf;
use std::sync::Arc;
use std::time::UNIX_EPOCH;
use anyhow::Context;
use camino::Utf8Path;
use once_cell::sync::OnceCell;
use remote_storage::{
GenericRemoteStorage, RemotePath, RemoteStorageConfig, RemoteStorageKind, S3Config,
};
use test_context::{test_context, AsyncTestContext};
use tokio::task::JoinSet;
use tracing::{debug, error, info};
use tracing::{debug, info};
static LOGGING_DONE: OnceCell<()> = OnceCell::new();
mod common;
use common::{
cleanup, download_to_vec, ensure_logging_ready, upload_remote_data, upload_simple_remote_data,
upload_stream, wrap_stream,
};
const ENABLE_REAL_S3_REMOTE_STORAGE_ENV_VAR_NAME: &str = "ENABLE_REAL_S3_REMOTE_STORAGE";
@@ -28,7 +30,7 @@ const BASE_PREFIX: &str = "test";
/// If real S3 tests are disabled, the test passes, skipping any real test run: currently, there's no way to mark the test ignored in runtime with the
/// deafult test framework, see https://github.com/rust-lang/rust/issues/68007 for details.
///
/// First, the test creates a set of S3 objects with keys `/${random_prefix_part}/${base_prefix_str}/sub_prefix_${i}/blob_${i}` in [`upload_s3_data`]
/// First, the test creates a set of S3 objects with keys `/${random_prefix_part}/${base_prefix_str}/sub_prefix_${i}/blob_${i}` in [`upload_remote_data`]
/// where
/// * `random_prefix_part` is set for the entire S3 client during the S3 client creation in [`create_s3_client`], to avoid multiple test runs interference
/// * `base_prefix_str` is a common prefix to use in the client requests: we would want to ensure that the client is able to list nested prefixes inside the bucket
@@ -93,7 +95,7 @@ async fn s3_pagination_should_work(ctx: &mut MaybeEnabledS3WithTestBlobs) -> any
/// Uses real S3 and requires [`ENABLE_REAL_S3_REMOTE_STORAGE_ENV_VAR_NAME`] and related S3 cred env vars specified. Test will skip real code and pass if env vars not set.
/// See `s3_pagination_should_work` for more information.
///
/// First, create a set of S3 objects with keys `random_prefix/folder{j}/blob_{i}.txt` in [`upload_s3_data`]
/// First, create a set of S3 objects with keys `random_prefix/folder{j}/blob_{i}.txt` in [`upload_remote_data`]
/// Then performs the following queries:
/// 1. `list_files(None)`. This should return all files `random_prefix/folder{j}/blob_{i}.txt`
/// 2. `list_files("folder1")`. This should return all files `random_prefix/folder1/blob_{i}.txt`
@@ -176,23 +178,14 @@ async fn s3_delete_objects_works(ctx: &mut MaybeEnabledS3) -> anyhow::Result<()>
let path3 = RemotePath::new(Utf8Path::new(format!("{}/path3", ctx.base_prefix).as_str()))
.with_context(|| "RemotePath conversion")?;
let data1 = "remote blob data1".as_bytes();
let data1_len = data1.len();
let data2 = "remote blob data2".as_bytes();
let data2_len = data2.len();
let data3 = "remote blob data3".as_bytes();
let data3_len = data3.len();
ctx.client
.upload(std::io::Cursor::new(data1), data1_len, &path1, None)
.await?;
let (data, len) = upload_stream("remote blob data1".as_bytes().into());
ctx.client.upload(data, len, &path1, None).await?;
ctx.client
.upload(std::io::Cursor::new(data2), data2_len, &path2, None)
.await?;
let (data, len) = upload_stream("remote blob data2".as_bytes().into());
ctx.client.upload(data, len, &path2, None).await?;
ctx.client
.upload(std::io::Cursor::new(data3), data3_len, &path3, None)
.await?;
let (data, len) = upload_stream("remote blob data3".as_bytes().into());
ctx.client.upload(data, len, &path3, None).await?;
ctx.client.delete_objects(&[path1, path2]).await?;
@@ -205,14 +198,65 @@ async fn s3_delete_objects_works(ctx: &mut MaybeEnabledS3) -> anyhow::Result<()>
Ok(())
}
fn ensure_logging_ready() {
LOGGING_DONE.get_or_init(|| {
utils::logging::init(
utils::logging::LogFormat::Test,
utils::logging::TracingErrorLayerEnablement::Disabled,
)
.expect("logging init failed");
});
#[test_context(MaybeEnabledS3)]
#[tokio::test]
async fn s3_upload_download_works(ctx: &mut MaybeEnabledS3) -> anyhow::Result<()> {
let MaybeEnabledS3::Enabled(ctx) = ctx else {
return Ok(());
};
let path = RemotePath::new(Utf8Path::new(format!("{}/file", ctx.base_prefix).as_str()))
.with_context(|| "RemotePath conversion")?;
let orig = bytes::Bytes::from_static("remote blob data here".as_bytes());
let (data, len) = wrap_stream(orig.clone());
ctx.client.upload(data, len, &path, None).await?;
// Normal download request
let dl = ctx.client.download(&path).await?;
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig);
// Full range (end specified)
let dl = ctx
.client
.download_byte_range(&path, 0, Some(len as u64))
.await?;
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig);
// partial range (end specified)
let dl = ctx.client.download_byte_range(&path, 4, Some(10)).await?;
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig[4..10]);
// partial range (end beyond real end)
let dl = ctx
.client
.download_byte_range(&path, 8, Some(len as u64 * 100))
.await?;
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig[8..]);
// Partial range (end unspecified)
let dl = ctx.client.download_byte_range(&path, 4, None).await?;
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig[4..]);
// Full range (end unspecified)
let dl = ctx.client.download_byte_range(&path, 0, None).await?;
let buf = download_to_vec(dl).await?;
assert_eq!(&buf, &orig);
debug!("Cleanup: deleting file at path {path:?}");
ctx.client
.delete(&path)
.await
.with_context(|| format!("{path:?} removal"))?;
Ok(())
}
struct EnabledS3 {
@@ -284,7 +328,7 @@ impl AsyncTestContext for MaybeEnabledS3WithTestBlobs {
let enabled = EnabledS3::setup(Some(max_keys_in_list_response)).await;
match upload_s3_data(&enabled.client, enabled.base_prefix, upload_tasks_count).await {
match upload_remote_data(&enabled.client, enabled.base_prefix, upload_tasks_count).await {
ControlFlow::Continue(uploads) => {
info!("Remote objects created successfully");
@@ -346,7 +390,7 @@ impl AsyncTestContext for MaybeEnabledS3WithSimpleTestBlobs {
let enabled = EnabledS3::setup(Some(max_keys_in_list_response)).await;
match upload_simple_s3_data(&enabled.client, upload_tasks_count).await {
match upload_simple_remote_data(&enabled.client, upload_tasks_count).await {
ControlFlow::Continue(uploads) => {
info!("Remote objects created successfully");
@@ -396,8 +440,6 @@ fn create_s3_client(
let random = rand::thread_rng().gen::<u32>();
let remote_storage_config = RemoteStorageConfig {
max_concurrent_syncs: NonZeroUsize::new(100).unwrap(),
max_sync_errors: NonZeroU32::new(5).unwrap(),
storage: RemoteStorageKind::AwsS3(S3Config {
bucket_name: remote_storage_s3_bucket,
bucket_region: remote_storage_s3_region,
@@ -411,143 +453,3 @@ fn create_s3_client(
GenericRemoteStorage::from_config(&remote_storage_config).context("remote storage init")?,
))
}
struct Uploads {
prefixes: HashSet<RemotePath>,
blobs: HashSet<RemotePath>,
}
async fn upload_s3_data(
client: &Arc<GenericRemoteStorage>,
base_prefix_str: &'static str,
upload_tasks_count: usize,
) -> ControlFlow<Uploads, Uploads> {
info!("Creating {upload_tasks_count} S3 files");
let mut upload_tasks = JoinSet::new();
for i in 1..upload_tasks_count + 1 {
let task_client = Arc::clone(client);
upload_tasks.spawn(async move {
let prefix = format!("{base_prefix_str}/sub_prefix_{i}/");
let blob_prefix = RemotePath::new(Utf8Path::new(&prefix))
.with_context(|| format!("{prefix:?} to RemotePath conversion"))?;
let blob_path = blob_prefix.join(Utf8Path::new(&format!("blob_{i}")));
debug!("Creating remote item {i} at path {blob_path:?}");
let data = format!("remote blob data {i}").into_bytes();
let data_len = data.len();
task_client
.upload(std::io::Cursor::new(data), data_len, &blob_path, None)
.await?;
Ok::<_, anyhow::Error>((blob_prefix, blob_path))
});
}
let mut upload_tasks_failed = false;
let mut uploaded_prefixes = HashSet::with_capacity(upload_tasks_count);
let mut uploaded_blobs = HashSet::with_capacity(upload_tasks_count);
while let Some(task_run_result) = upload_tasks.join_next().await {
match task_run_result
.context("task join failed")
.and_then(|task_result| task_result.context("upload task failed"))
{
Ok((upload_prefix, upload_path)) => {
uploaded_prefixes.insert(upload_prefix);
uploaded_blobs.insert(upload_path);
}
Err(e) => {
error!("Upload task failed: {e:?}");
upload_tasks_failed = true;
}
}
}
let uploads = Uploads {
prefixes: uploaded_prefixes,
blobs: uploaded_blobs,
};
if upload_tasks_failed {
ControlFlow::Break(uploads)
} else {
ControlFlow::Continue(uploads)
}
}
async fn cleanup(client: &Arc<GenericRemoteStorage>, objects_to_delete: HashSet<RemotePath>) {
info!(
"Removing {} objects from the remote storage during cleanup",
objects_to_delete.len()
);
let mut delete_tasks = JoinSet::new();
for object_to_delete in objects_to_delete {
let task_client = Arc::clone(client);
delete_tasks.spawn(async move {
debug!("Deleting remote item at path {object_to_delete:?}");
task_client
.delete(&object_to_delete)
.await
.with_context(|| format!("{object_to_delete:?} removal"))
});
}
while let Some(task_run_result) = delete_tasks.join_next().await {
match task_run_result {
Ok(task_result) => match task_result {
Ok(()) => {}
Err(e) => error!("Delete task failed: {e:?}"),
},
Err(join_err) => error!("Delete task did not finish correctly: {join_err}"),
}
}
}
// Uploads files `folder{j}/blob{i}.txt`. See test description for more details.
async fn upload_simple_s3_data(
client: &Arc<GenericRemoteStorage>,
upload_tasks_count: usize,
) -> ControlFlow<HashSet<RemotePath>, HashSet<RemotePath>> {
info!("Creating {upload_tasks_count} S3 files");
let mut upload_tasks = JoinSet::new();
for i in 1..upload_tasks_count + 1 {
let task_client = Arc::clone(client);
upload_tasks.spawn(async move {
let blob_path = PathBuf::from(format!("folder{}/blob_{}.txt", i / 7, i));
let blob_path = RemotePath::new(
Utf8Path::from_path(blob_path.as_path()).expect("must be valid blob path"),
)
.with_context(|| format!("{blob_path:?} to RemotePath conversion"))?;
debug!("Creating remote item {i} at path {blob_path:?}");
let data = format!("remote blob data {i}").into_bytes();
let data_len = data.len();
task_client
.upload(std::io::Cursor::new(data), data_len, &blob_path, None)
.await?;
Ok::<_, anyhow::Error>(blob_path)
});
}
let mut upload_tasks_failed = false;
let mut uploaded_blobs = HashSet::with_capacity(upload_tasks_count);
while let Some(task_run_result) = upload_tasks.join_next().await {
match task_run_result
.context("task join failed")
.and_then(|task_result| task_result.context("upload task failed"))
{
Ok(upload_path) => {
uploaded_blobs.insert(upload_path);
}
Err(e) => {
error!("Upload task failed: {e:?}");
upload_tasks_failed = true;
}
}
}
if upload_tasks_failed {
ControlFlow::Break(uploaded_blobs)
} else {
ControlFlow::Continue(uploaded_blobs)
}
}

View File

@@ -1,3 +1,5 @@
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
use const_format::formatcp;
/// Public API types

View File

@@ -1,23 +1,18 @@
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use utils::{
id::{NodeId, TenantId, TimelineId},
lsn::Lsn,
};
#[serde_as]
#[derive(Serialize, Deserialize)]
pub struct TimelineCreateRequest {
#[serde_as(as = "DisplayFromStr")]
pub tenant_id: TenantId,
#[serde_as(as = "DisplayFromStr")]
pub timeline_id: TimelineId,
pub peer_ids: Option<Vec<NodeId>>,
pub pg_version: u32,
pub system_id: Option<u64>,
pub wal_seg_size: Option<u32>,
#[serde_as(as = "DisplayFromStr")]
pub commit_lsn: Lsn,
// If not passed, it is assigned to the beginning of commit_lsn segment.
pub local_start_lsn: Option<Lsn>,
@@ -28,7 +23,6 @@ fn lsn_invalid() -> Lsn {
}
/// Data about safekeeper's timeline, mirrors broker.proto.
#[serde_as]
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct SkTimelineInfo {
/// Term.
@@ -36,25 +30,19 @@ pub struct SkTimelineInfo {
/// Term of the last entry.
pub last_log_term: Option<u64>,
/// LSN of the last record.
#[serde_as(as = "DisplayFromStr")]
#[serde(default = "lsn_invalid")]
pub flush_lsn: Lsn,
/// Up to which LSN safekeeper regards its WAL as committed.
#[serde_as(as = "DisplayFromStr")]
#[serde(default = "lsn_invalid")]
pub commit_lsn: Lsn,
/// LSN up to which safekeeper has backed WAL.
#[serde_as(as = "DisplayFromStr")]
#[serde(default = "lsn_invalid")]
pub backup_lsn: Lsn,
/// LSN of last checkpoint uploaded by pageserver.
#[serde_as(as = "DisplayFromStr")]
#[serde(default = "lsn_invalid")]
pub remote_consistent_lsn: Lsn,
#[serde_as(as = "DisplayFromStr")]
#[serde(default = "lsn_invalid")]
pub peer_horizon_lsn: Lsn,
#[serde_as(as = "DisplayFromStr")]
#[serde(default = "lsn_invalid")]
pub local_start_lsn: Lsn,
/// A connection string to use for WAL receiving.

View File

@@ -1,4 +1,6 @@
//! Synthetic size calculation
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
mod calculation;
pub mod svg;

View File

@@ -32,6 +32,8 @@
//! .init();
//! }
//! ```
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
use opentelemetry::sdk::Resource;
use opentelemetry::KeyValue;

View File

@@ -5,6 +5,7 @@ edition.workspace = true
license.workspace = true
[dependencies]
arc-swap.workspace = true
sentry.workspace = true
async-trait.workspace = true
anyhow.workspace = true
@@ -49,12 +50,15 @@ const_format.workspace = true
# why is it only here? no other crate should use it, streams are rarely needed.
tokio-stream = { version = "0.1.14" }
serde_path_to_error.workspace = true
[dev-dependencies]
byteorder.workspace = true
bytes.workspace = true
criterion.workspace = true
hex-literal.workspace = true
camino-tempfile.workspace = true
serde_assert.workspace = true
[[bench]]
name = "benchmarks"

View File

@@ -0,0 +1,21 @@
#!/bin/bash
# like restore_from_wal.sh, but takes existing initdb.tar.zst
set -euxo pipefail
PG_BIN=$1
WAL_PATH=$2
DATA_DIR=$3
PORT=$4
echo "port=$PORT" >> "$DATA_DIR"/postgresql.conf
echo "shared_preload_libraries='\$libdir/neon_rmgr.so'" >> "$DATA_DIR"/postgresql.conf
REDO_POS=0x$("$PG_BIN"/pg_controldata -D "$DATA_DIR" | grep -F "REDO location"| cut -c 42-)
declare -i WAL_SIZE=$REDO_POS+114
"$PG_BIN"/pg_ctl -D "$DATA_DIR" -l "$DATA_DIR/logfile.log" start
"$PG_BIN"/pg_ctl -D "$DATA_DIR" -l "$DATA_DIR/logfile.log" stop -m immediate
cp "$DATA_DIR"/pg_wal/000000010000000000000001 .
cp "$WAL_PATH"/* "$DATA_DIR"/pg_wal/
for partial in "$DATA_DIR"/pg_wal/*.partial ; do mv "$partial" "${partial%.partial}" ; done
dd if=000000010000000000000001 of="$DATA_DIR"/pg_wal/000000010000000000000001 bs=$WAL_SIZE count=1 conv=notrunc
rm -f 000000010000000000000001

View File

@@ -1,7 +1,8 @@
// For details about authentication see docs/authentication.md
use arc_swap::ArcSwap;
use serde;
use std::fs;
use std::{borrow::Cow, fmt::Display, fs, sync::Arc};
use anyhow::Result;
use camino::Utf8Path;
@@ -9,9 +10,8 @@ use jsonwebtoken::{
decode, encode, Algorithm, DecodingKey, EncodingKey, Header, TokenData, Validation,
};
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use crate::id::TenantId;
use crate::{http::error::ApiError, id::TenantId};
/// Algorithm to use. We require EdDSA.
const STORAGE_TOKEN_ALGORITHM: Algorithm = Algorithm::EdDSA;
@@ -32,11 +32,9 @@ pub enum Scope {
}
/// JWT payload. See docs/authentication.md for the format
#[serde_as]
#[derive(Debug, Serialize, Deserialize, Clone, PartialEq)]
pub struct Claims {
#[serde(default)]
#[serde_as(as = "Option<DisplayFromStr>")]
pub tenant_id: Option<TenantId>,
pub scope: Scope,
}
@@ -47,31 +45,106 @@ impl Claims {
}
}
pub struct SwappableJwtAuth(ArcSwap<JwtAuth>);
impl SwappableJwtAuth {
pub fn new(jwt_auth: JwtAuth) -> Self {
SwappableJwtAuth(ArcSwap::new(Arc::new(jwt_auth)))
}
pub fn swap(&self, jwt_auth: JwtAuth) {
self.0.swap(Arc::new(jwt_auth));
}
pub fn decode(&self, token: &str) -> std::result::Result<TokenData<Claims>, AuthError> {
self.0.load().decode(token)
}
}
impl std::fmt::Debug for SwappableJwtAuth {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "Swappable({:?})", self.0.load())
}
}
#[derive(Clone, PartialEq, Eq, Hash, Debug)]
pub struct AuthError(pub Cow<'static, str>);
impl Display for AuthError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{}", self.0)
}
}
impl From<AuthError> for ApiError {
fn from(_value: AuthError) -> Self {
// Don't pass on the value of the AuthError as a precautionary measure.
// Being intentionally vague in public error communication hurts debugability
// but it is more secure.
ApiError::Forbidden("JWT authentication error".to_string())
}
}
pub struct JwtAuth {
decoding_key: DecodingKey,
decoding_keys: Vec<DecodingKey>,
validation: Validation,
}
impl JwtAuth {
pub fn new(decoding_key: DecodingKey) -> Self {
pub fn new(decoding_keys: Vec<DecodingKey>) -> Self {
let mut validation = Validation::default();
validation.algorithms = vec![STORAGE_TOKEN_ALGORITHM];
// The default 'required_spec_claims' is 'exp'. But we don't want to require
// expiration.
validation.required_spec_claims = [].into();
Self {
decoding_key,
decoding_keys,
validation,
}
}
pub fn from_key_path(key_path: &Utf8Path) -> Result<Self> {
let public_key = fs::read(key_path)?;
Ok(Self::new(DecodingKey::from_ed_pem(&public_key)?))
let metadata = key_path.metadata()?;
let decoding_keys = if metadata.is_dir() {
let mut keys = Vec::new();
for entry in fs::read_dir(key_path)? {
let path = entry?.path();
if !path.is_file() {
// Ignore directories (don't recurse)
continue;
}
let public_key = fs::read(path)?;
keys.push(DecodingKey::from_ed_pem(&public_key)?);
}
keys
} else if metadata.is_file() {
let public_key = fs::read(key_path)?;
vec![DecodingKey::from_ed_pem(&public_key)?]
} else {
anyhow::bail!("path is neither a directory or a file")
};
if decoding_keys.is_empty() {
anyhow::bail!("Configured for JWT auth with zero decoding keys. All JWT gated requests would be rejected.");
}
Ok(Self::new(decoding_keys))
}
pub fn decode(&self, token: &str) -> Result<TokenData<Claims>> {
Ok(decode(token, &self.decoding_key, &self.validation)?)
/// Attempt to decode the token with the internal decoding keys.
///
/// The function tries the stored decoding keys in succession,
/// and returns the first yielding a successful result.
/// If there is no working decoding key, it returns the last error.
pub fn decode(&self, token: &str) -> std::result::Result<TokenData<Claims>, AuthError> {
let mut res = None;
for decoding_key in &self.decoding_keys {
res = Some(decode(token, decoding_key, &self.validation));
if let Some(Ok(res)) = res {
return Ok(res);
}
}
if let Some(res) = res {
res.map_err(|e| AuthError(Cow::Owned(e.to_string())))
} else {
Err(AuthError(Cow::Borrowed("no JWT decoding keys configured")))
}
}
}
@@ -111,9 +184,9 @@ MC4CAQAwBQYDK2VwBCIEID/Drmc1AA6U/znNRWpF3zEGegOATQxfkdWxitcOMsIH
"#;
#[test]
fn test_decode() -> Result<(), anyhow::Error> {
fn test_decode() {
let expected_claims = Claims {
tenant_id: Some(TenantId::from_str("3d1f7595b468230304e0b73cecbcb081")?),
tenant_id: Some(TenantId::from_str("3d1f7595b468230304e0b73cecbcb081").unwrap()),
scope: Scope::Tenant,
};
@@ -132,28 +205,24 @@ MC4CAQAwBQYDK2VwBCIEID/Drmc1AA6U/znNRWpF3zEGegOATQxfkdWxitcOMsIH
let encoded_eddsa = "eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJzY29wZSI6InRlbmFudCIsInRlbmFudF9pZCI6IjNkMWY3NTk1YjQ2ODIzMDMwNGUwYjczY2VjYmNiMDgxIiwiaXNzIjoibmVvbi5jb250cm9scGxhbmUiLCJleHAiOjE3MDkyMDA4NzksImlhdCI6MTY3ODQ0MjQ3OX0.U3eA8j-uU-JnhzeO3EDHRuXLwkAUFCPxtGHEgw6p7Ccc3YRbFs2tmCdbD9PZEXP-XsxSeBQi1FY0YPcT3NXADw";
// Check it can be validated with the public key
let auth = JwtAuth::new(DecodingKey::from_ed_pem(TEST_PUB_KEY_ED25519)?);
let claims_from_token = auth.decode(encoded_eddsa)?.claims;
let auth = JwtAuth::new(vec![DecodingKey::from_ed_pem(TEST_PUB_KEY_ED25519).unwrap()]);
let claims_from_token = auth.decode(encoded_eddsa).unwrap().claims;
assert_eq!(claims_from_token, expected_claims);
Ok(())
}
#[test]
fn test_encode() -> Result<(), anyhow::Error> {
fn test_encode() {
let claims = Claims {
tenant_id: Some(TenantId::from_str("3d1f7595b468230304e0b73cecbcb081")?),
tenant_id: Some(TenantId::from_str("3d1f7595b468230304e0b73cecbcb081").unwrap()),
scope: Scope::Tenant,
};
let encoded = encode_from_key_file(&claims, TEST_PRIV_KEY_ED25519)?;
let encoded = encode_from_key_file(&claims, TEST_PRIV_KEY_ED25519).unwrap();
// decode it back
let auth = JwtAuth::new(DecodingKey::from_ed_pem(TEST_PUB_KEY_ED25519)?);
let decoded = auth.decode(&encoded)?;
let auth = JwtAuth::new(vec![DecodingKey::from_ed_pem(TEST_PUB_KEY_ED25519).unwrap()]);
let decoded = auth.decode(&encoded).unwrap();
assert_eq!(decoded.claims, claims);
Ok(())
}
}

View File

@@ -1,16 +1,14 @@
use std::sync::Arc;
use tokio::sync::{mpsc, Mutex};
use tokio_util::task::{task_tracker::TaskTrackerToken, TaskTracker};
/// While a reference is kept around, the associated [`Barrier::wait`] will wait.
///
/// Can be cloned, moved and kept around in futures as "guard objects".
#[derive(Clone)]
pub struct Completion(mpsc::Sender<()>);
pub struct Completion(TaskTrackerToken);
/// Barrier will wait until all clones of [`Completion`] have been dropped.
#[derive(Clone)]
pub struct Barrier(Arc<Mutex<mpsc::Receiver<()>>>);
pub struct Barrier(TaskTracker);
impl Default for Barrier {
fn default() -> Self {
@@ -21,7 +19,7 @@ impl Default for Barrier {
impl Barrier {
pub async fn wait(self) {
self.0.lock().await.recv().await;
self.0.wait().await;
}
pub async fn maybe_wait(barrier: Option<Barrier>) {
@@ -33,8 +31,7 @@ impl Barrier {
impl PartialEq for Barrier {
fn eq(&self, other: &Self) -> bool {
// we don't use dyn so this is good
Arc::ptr_eq(&self.0, &other.0)
TaskTracker::ptr_eq(&self.0, &other.0)
}
}
@@ -42,8 +39,10 @@ impl Eq for Barrier {}
/// Create new Guard and Barrier pair.
pub fn channel() -> (Completion, Barrier) {
let (tx, rx) = mpsc::channel::<()>(1);
let rx = Mutex::new(rx);
let rx = Arc::new(rx);
(Completion(tx), Barrier(rx))
let tracker = TaskTracker::new();
// otherwise wait never exits
tracker.close();
let token = tracker.token();
(Completion(token), Barrier(tracker))
}

View File

@@ -7,7 +7,7 @@ use serde::{Deserialize, Serialize};
///
/// See docs/rfcs/025-generation-numbers.md for detail on how generation
/// numbers are used.
#[derive(Copy, Clone, Eq, PartialEq, PartialOrd, Ord)]
#[derive(Copy, Clone, Eq, PartialEq, PartialOrd, Ord, Hash)]
pub enum Generation {
// Generations with this magic value will not add a suffix to S3 keys, and will not
// be included in persisted index_part.json. This value is only to be used
@@ -152,3 +152,16 @@ impl Debug for Generation {
}
}
}
#[cfg(test)]
mod test {
use super::*;
#[test]
fn generation_gt() {
// Important that a None generation compares less than a valid one, during upgrades from
// pre-generation systems.
assert!(Generation::none() < Generation::new(0));
assert!(Generation::none() < Generation::new(1));
}
}

41
libs/utils/src/hex.rs Normal file
View File

@@ -0,0 +1,41 @@
/// Useful type for asserting that expected bytes match reporting the bytes more readable
/// array-syntax compatible hex bytes.
///
/// # Usage
///
/// ```
/// use utils::Hex;
///
/// let actual = serialize_something();
/// let expected = [0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64];
///
/// // the type implements PartialEq and on mismatch, both sides are printed in 16 wide multiline
/// // output suffixed with an array style length for easier comparisons.
/// assert_eq!(Hex(&actual), Hex(&expected));
///
/// // with `let expected = [0x68];` the error would had been:
/// // assertion `left == right` failed
/// // left: [0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64; 11]
/// // right: [0x68; 1]
/// # fn serialize_something() -> Vec<u8> { "hello world".as_bytes().to_vec() }
/// ```
#[derive(PartialEq)]
pub struct Hex<'a>(pub &'a [u8]);
impl std::fmt::Debug for Hex<'_> {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "[")?;
for (i, c) in self.0.chunks(16).enumerate() {
if i > 0 && !c.is_empty() {
writeln!(f, ", ")?;
}
for (j, b) in c.iter().enumerate() {
if j > 0 {
write!(f, ", ")?;
}
write!(f, "0x{b:02x}")?;
}
}
write!(f, "; {}]", self.0.len())
}
}

View File

@@ -1,4 +1,4 @@
use crate::auth::{Claims, JwtAuth};
use crate::auth::{AuthError, Claims, SwappableJwtAuth};
use crate::http::error::{api_error_handler, route_error_handler, ApiError};
use anyhow::Context;
use hyper::header::{HeaderName, AUTHORIZATION};
@@ -14,6 +14,11 @@ use tracing::{self, debug, info, info_span, warn, Instrument};
use std::future::Future;
use std::str::FromStr;
use bytes::{Bytes, BytesMut};
use std::io::Write as _;
use tokio::sync::mpsc;
use tokio_stream::wrappers::ReceiverStream;
static SERVE_METRICS_COUNT: Lazy<IntCounter> = Lazy::new(|| {
register_int_counter!(
"libmetrics_metric_handler_requests_total",
@@ -146,94 +151,89 @@ impl Drop for RequestCancelled {
}
}
/// An [`std::io::Write`] implementation on top of a channel sending [`bytes::Bytes`] chunks.
pub struct ChannelWriter {
buffer: BytesMut,
pub tx: mpsc::Sender<std::io::Result<Bytes>>,
written: usize,
}
impl ChannelWriter {
pub fn new(buf_len: usize, tx: mpsc::Sender<std::io::Result<Bytes>>) -> Self {
assert_ne!(buf_len, 0);
ChannelWriter {
// split about half off the buffer from the start, because we flush depending on
// capacity. first flush will come sooner than without this, but now resizes will
// have better chance of picking up the "other" half. not guaranteed of course.
buffer: BytesMut::with_capacity(buf_len).split_off(buf_len / 2),
tx,
written: 0,
}
}
pub fn flush0(&mut self) -> std::io::Result<usize> {
let n = self.buffer.len();
if n == 0 {
return Ok(0);
}
tracing::trace!(n, "flushing");
let ready = self.buffer.split().freeze();
// not ideal to call from blocking code to block_on, but we are sure that this
// operation does not spawn_blocking other tasks
let res: Result<(), ()> = tokio::runtime::Handle::current().block_on(async {
self.tx.send(Ok(ready)).await.map_err(|_| ())?;
// throttle sending to allow reuse of our buffer in `write`.
self.tx.reserve().await.map_err(|_| ())?;
// now the response task has picked up the buffer and hopefully started
// sending it to the client.
Ok(())
});
if res.is_err() {
return Err(std::io::ErrorKind::BrokenPipe.into());
}
self.written += n;
Ok(n)
}
pub fn flushed_bytes(&self) -> usize {
self.written
}
}
impl std::io::Write for ChannelWriter {
fn write(&mut self, mut buf: &[u8]) -> std::io::Result<usize> {
let remaining = self.buffer.capacity() - self.buffer.len();
let out_of_space = remaining < buf.len();
let original_len = buf.len();
if out_of_space {
let can_still_fit = buf.len() - remaining;
self.buffer.extend_from_slice(&buf[..can_still_fit]);
buf = &buf[can_still_fit..];
self.flush0()?;
}
// assume that this will often under normal operation just move the pointer back to the
// beginning of allocation, because previous split off parts are already sent and
// dropped.
self.buffer.extend_from_slice(buf);
Ok(original_len)
}
fn flush(&mut self) -> std::io::Result<()> {
self.flush0().map(|_| ())
}
}
async fn prometheus_metrics_handler(_req: Request<Body>) -> Result<Response<Body>, ApiError> {
use bytes::{Bytes, BytesMut};
use std::io::Write as _;
use tokio::sync::mpsc;
use tokio_stream::wrappers::ReceiverStream;
SERVE_METRICS_COUNT.inc();
/// An [`std::io::Write`] implementation on top of a channel sending [`bytes::Bytes`] chunks.
struct ChannelWriter {
buffer: BytesMut,
tx: mpsc::Sender<std::io::Result<Bytes>>,
written: usize,
}
impl ChannelWriter {
fn new(buf_len: usize, tx: mpsc::Sender<std::io::Result<Bytes>>) -> Self {
assert_ne!(buf_len, 0);
ChannelWriter {
// split about half off the buffer from the start, because we flush depending on
// capacity. first flush will come sooner than without this, but now resizes will
// have better chance of picking up the "other" half. not guaranteed of course.
buffer: BytesMut::with_capacity(buf_len).split_off(buf_len / 2),
tx,
written: 0,
}
}
fn flush0(&mut self) -> std::io::Result<usize> {
let n = self.buffer.len();
if n == 0 {
return Ok(0);
}
tracing::trace!(n, "flushing");
let ready = self.buffer.split().freeze();
// not ideal to call from blocking code to block_on, but we are sure that this
// operation does not spawn_blocking other tasks
let res: Result<(), ()> = tokio::runtime::Handle::current().block_on(async {
self.tx.send(Ok(ready)).await.map_err(|_| ())?;
// throttle sending to allow reuse of our buffer in `write`.
self.tx.reserve().await.map_err(|_| ())?;
// now the response task has picked up the buffer and hopefully started
// sending it to the client.
Ok(())
});
if res.is_err() {
return Err(std::io::ErrorKind::BrokenPipe.into());
}
self.written += n;
Ok(n)
}
fn flushed_bytes(&self) -> usize {
self.written
}
}
impl std::io::Write for ChannelWriter {
fn write(&mut self, mut buf: &[u8]) -> std::io::Result<usize> {
let remaining = self.buffer.capacity() - self.buffer.len();
let out_of_space = remaining < buf.len();
let original_len = buf.len();
if out_of_space {
let can_still_fit = buf.len() - remaining;
self.buffer.extend_from_slice(&buf[..can_still_fit]);
buf = &buf[can_still_fit..];
self.flush0()?;
}
// assume that this will often under normal operation just move the pointer back to the
// beginning of allocation, because previous split off parts are already sent and
// dropped.
self.buffer.extend_from_slice(buf);
Ok(original_len)
}
fn flush(&mut self) -> std::io::Result<()> {
self.flush0().map(|_| ())
}
}
let started_at = std::time::Instant::now();
let (tx, rx) = mpsc::channel(1);
@@ -389,7 +389,7 @@ fn parse_token(header_value: &str) -> Result<&str, ApiError> {
}
pub fn auth_middleware<B: hyper::body::HttpBody + Send + Sync + 'static>(
provide_auth: fn(&Request<Body>) -> Option<&JwtAuth>,
provide_auth: fn(&Request<Body>) -> Option<&SwappableJwtAuth>,
) -> Middleware<B, ApiError> {
Middleware::pre(move |req| async move {
if let Some(auth) = provide_auth(&req) {
@@ -400,9 +400,11 @@ pub fn auth_middleware<B: hyper::body::HttpBody + Send + Sync + 'static>(
})?;
let token = parse_token(header_value)?;
let data = auth
.decode(token)
.map_err(|_| ApiError::Unauthorized("malformed jwt token".to_string()))?;
let data = auth.decode(token).map_err(|err| {
warn!("Authentication error: {err}");
// Rely on From<AuthError> for ApiError impl
err
})?;
req.set_context(data.claims);
}
None => {
@@ -450,12 +452,11 @@ where
pub fn check_permission_with(
req: &Request<Body>,
check_permission: impl Fn(&Claims) -> Result<(), anyhow::Error>,
check_permission: impl Fn(&Claims) -> Result<(), AuthError>,
) -> Result<(), ApiError> {
match req.context::<Claims>() {
Some(claims) => {
Ok(check_permission(&claims).map_err(|err| ApiError::Forbidden(err.to_string()))?)
}
Some(claims) => Ok(check_permission(&claims)
.map_err(|_err| ApiError::Forbidden("JWT authentication error".to_string()))?),
None => Ok(()), // claims is None because auth is disabled
}
}

View File

@@ -3,7 +3,7 @@ use serde::{Deserialize, Serialize};
use std::borrow::Cow;
use std::error::Error as StdError;
use thiserror::Error;
use tracing::{error, info};
use tracing::{error, info, warn};
#[derive(Debug, Error)]
pub enum ApiError {
@@ -118,6 +118,9 @@ pub fn api_error_handler(api_error: ApiError) -> Response<Body> {
// Print a stack trace for Internal Server errors
match api_error {
ApiError::Forbidden(_) | ApiError::Unauthorized(_) => {
warn!("Error processing HTTP request: {api_error:#}")
}
ApiError::ResourceUnavailable(_) => info!("Error processing HTTP request: {api_error:#}"),
ApiError::NotFound(_) => info!("Error processing HTTP request: {api_error:#}"),
ApiError::InternalServerError(_) => error!("Error processing HTTP request: {api_error:?}"),

View File

@@ -25,8 +25,12 @@ pub async fn json_request_or_empty_body<T: for<'de> Deserialize<'de>>(
if body.remaining() == 0 {
return Ok(None);
}
serde_json::from_reader(body.reader())
.context("Failed to parse json request")
let mut deser = serde_json::de::Deserializer::from_reader(body.reader());
serde_path_to_error::deserialize(&mut deser)
// intentionally stringify because the debug version is not helpful in python logs
.map_err(|e| anyhow::anyhow!("Failed to parse json request: {e}"))
.map(Some)
.map_err(ApiError::BadRequest)
}

View File

@@ -3,6 +3,7 @@ use std::{fmt, str::FromStr};
use anyhow::Context;
use hex::FromHex;
use rand::Rng;
use serde::de::Visitor;
use serde::{Deserialize, Serialize};
use thiserror::Error;
@@ -17,12 +18,74 @@ pub enum IdError {
///
/// NOTE: It (de)serializes as an array of hex bytes, so the string representation would look
/// like `[173,80,132,115,129,226,72,254,170,201,135,108,199,26,228,24]`.
///
/// Use `#[serde_as(as = "DisplayFromStr")]` to (de)serialize it as hex string instead: `ad50847381e248feaac9876cc71ae418`.
/// Check the `serde_with::serde_as` documentation for options for more complex types.
#[derive(Clone, Copy, PartialEq, Eq, Hash, Serialize, Deserialize, PartialOrd, Ord)]
#[derive(Clone, Copy, PartialEq, Eq, Hash, PartialOrd, Ord)]
struct Id([u8; 16]);
impl Serialize for Id {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
if serializer.is_human_readable() {
serializer.collect_str(self)
} else {
self.0.serialize(serializer)
}
}
}
impl<'de> Deserialize<'de> for Id {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
struct IdVisitor {
is_human_readable_deserializer: bool,
}
impl<'de> Visitor<'de> for IdVisitor {
type Value = Id;
fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
if self.is_human_readable_deserializer {
formatter.write_str("value in form of hex string")
} else {
formatter.write_str("value in form of integer array([u8; 16])")
}
}
fn visit_seq<A>(self, seq: A) -> Result<Self::Value, A::Error>
where
A: serde::de::SeqAccess<'de>,
{
let s = serde::de::value::SeqAccessDeserializer::new(seq);
let id: [u8; 16] = Deserialize::deserialize(s)?;
Ok(Id::from(id))
}
fn visit_str<E>(self, v: &str) -> Result<Self::Value, E>
where
E: serde::de::Error,
{
Id::from_str(v).map_err(E::custom)
}
}
if deserializer.is_human_readable() {
deserializer.deserialize_str(IdVisitor {
is_human_readable_deserializer: true,
})
} else {
deserializer.deserialize_tuple(
16,
IdVisitor {
is_human_readable_deserializer: false,
},
)
}
}
}
impl Id {
pub fn get_from_buf(buf: &mut impl bytes::Buf) -> Id {
let mut arr = [0u8; 16];
@@ -57,6 +120,8 @@ impl Id {
chunk[0] = HEX[((b >> 4) & 0xf) as usize];
chunk[1] = HEX[(b & 0xf) as usize];
}
// SAFETY: vec constructed out of `HEX`, it can only be ascii
unsafe { String::from_utf8_unchecked(buf) }
}
}
@@ -308,3 +373,112 @@ impl fmt::Display for NodeId {
write!(f, "{}", self.0)
}
}
#[cfg(test)]
mod tests {
use serde_assert::{Deserializer, Serializer, Token, Tokens};
use crate::bin_ser::BeSer;
use super::*;
#[test]
fn test_id_serde_non_human_readable() {
let original_id = Id([
173, 80, 132, 115, 129, 226, 72, 254, 170, 201, 135, 108, 199, 26, 228, 24,
]);
let expected_tokens = Tokens(vec![
Token::Tuple { len: 16 },
Token::U8(173),
Token::U8(80),
Token::U8(132),
Token::U8(115),
Token::U8(129),
Token::U8(226),
Token::U8(72),
Token::U8(254),
Token::U8(170),
Token::U8(201),
Token::U8(135),
Token::U8(108),
Token::U8(199),
Token::U8(26),
Token::U8(228),
Token::U8(24),
Token::TupleEnd,
]);
let serializer = Serializer::builder().is_human_readable(false).build();
let serialized_tokens = original_id.serialize(&serializer).unwrap();
assert_eq!(serialized_tokens, expected_tokens);
let mut deserializer = Deserializer::builder()
.is_human_readable(false)
.tokens(serialized_tokens)
.build();
let deserialized_id = Id::deserialize(&mut deserializer).unwrap();
assert_eq!(deserialized_id, original_id);
}
#[test]
fn test_id_serde_human_readable() {
let original_id = Id([
173, 80, 132, 115, 129, 226, 72, 254, 170, 201, 135, 108, 199, 26, 228, 24,
]);
let expected_tokens = Tokens(vec![Token::Str(String::from(
"ad50847381e248feaac9876cc71ae418",
))]);
let serializer = Serializer::builder().is_human_readable(true).build();
let serialized_tokens = original_id.serialize(&serializer).unwrap();
assert_eq!(serialized_tokens, expected_tokens);
let mut deserializer = Deserializer::builder()
.is_human_readable(true)
.tokens(Tokens(vec![Token::Str(String::from(
"ad50847381e248feaac9876cc71ae418",
))]))
.build();
assert_eq!(Id::deserialize(&mut deserializer).unwrap(), original_id);
}
macro_rules! roundtrip_type {
($type:ty, $expected_bytes:expr) => {{
let expected_bytes: [u8; 16] = $expected_bytes;
let original_id = <$type>::from(expected_bytes);
let ser_bytes = original_id.ser().unwrap();
assert_eq!(ser_bytes, expected_bytes);
let des_id = <$type>::des(&ser_bytes).unwrap();
assert_eq!(des_id, original_id);
}};
}
#[test]
fn test_id_bincode_serde() {
let expected_bytes = [
173, 80, 132, 115, 129, 226, 72, 254, 170, 201, 135, 108, 199, 26, 228, 24,
];
roundtrip_type!(Id, expected_bytes);
}
#[test]
fn test_tenant_id_bincode_serde() {
let expected_bytes = [
173, 80, 132, 115, 129, 226, 72, 254, 170, 201, 135, 108, 199, 26, 228, 24,
];
roundtrip_type!(TenantId, expected_bytes);
}
#[test]
fn test_timeline_id_bincode_serde() {
let expected_bytes = [
173, 80, 132, 115, 129, 226, 72, 254, 170, 201, 135, 108, 199, 26, 228, 24,
];
roundtrip_type!(TimelineId, expected_bytes);
}
}

View File

@@ -1,5 +1,6 @@
//! `utils` is intended to be a place to put code that is shared
//! between other crates in this repository.
#![deny(clippy::undocumented_unsafe_blocks)]
pub mod backoff;
@@ -24,6 +25,10 @@ pub mod auth;
// utility functions and helper traits for unified unique id generation/serialization etc.
pub mod id;
mod hex;
pub use hex::Hex;
// http endpoint utils
pub mod http;
@@ -73,6 +78,9 @@ pub mod completion;
/// Reporting utilities
pub mod error;
/// async timeout helper
pub mod timeout;
pub mod sync;
/// This is a shortcut to embed git sha into binaries and avoid copying the same build script to all packages

View File

@@ -1,6 +1,7 @@
use std::str::FromStr;
use anyhow::Context;
use metrics::{IntCounter, IntCounterVec};
use once_cell::sync::Lazy;
use strum_macros::{EnumString, EnumVariantNames};
@@ -24,16 +25,48 @@ impl LogFormat {
}
}
static TRACING_EVENT_COUNT: Lazy<metrics::IntCounterVec> = Lazy::new(|| {
metrics::register_int_counter_vec!(
struct TracingEventCountMetric {
error: IntCounter,
warn: IntCounter,
info: IntCounter,
debug: IntCounter,
trace: IntCounter,
}
static TRACING_EVENT_COUNT_METRIC: Lazy<TracingEventCountMetric> = Lazy::new(|| {
let vec = metrics::register_int_counter_vec!(
"libmetrics_tracing_event_count",
"Number of tracing events, by level",
&["level"]
)
.expect("failed to define metric")
.expect("failed to define metric");
TracingEventCountMetric::new(vec)
});
struct TracingEventCountLayer(&'static metrics::IntCounterVec);
impl TracingEventCountMetric {
fn new(vec: IntCounterVec) -> Self {
Self {
error: vec.with_label_values(&["error"]),
warn: vec.with_label_values(&["warn"]),
info: vec.with_label_values(&["info"]),
debug: vec.with_label_values(&["debug"]),
trace: vec.with_label_values(&["trace"]),
}
}
fn inc_for_level(&self, level: tracing::Level) {
let counter = match level {
tracing::Level::ERROR => &self.error,
tracing::Level::WARN => &self.warn,
tracing::Level::INFO => &self.info,
tracing::Level::DEBUG => &self.debug,
tracing::Level::TRACE => &self.trace,
};
counter.inc();
}
}
struct TracingEventCountLayer(&'static TracingEventCountMetric);
impl<S> tracing_subscriber::layer::Layer<S> for TracingEventCountLayer
where
@@ -44,15 +77,7 @@ where
event: &tracing::Event<'_>,
_ctx: tracing_subscriber::layer::Context<'_, S>,
) {
let level = event.metadata().level();
let level = match *level {
tracing::Level::ERROR => "error",
tracing::Level::WARN => "warn",
tracing::Level::INFO => "info",
tracing::Level::DEBUG => "debug",
tracing::Level::TRACE => "trace",
};
self.0.with_label_values(&[level]).inc();
self.0.inc_for_level(*event.metadata().level());
}
}
@@ -66,9 +91,17 @@ pub enum TracingErrorLayerEnablement {
EnableWithRustLogFilter,
}
/// Where the logging should output to.
#[derive(Clone, Copy)]
pub enum Output {
Stdout,
Stderr,
}
pub fn init(
log_format: LogFormat,
tracing_error_layer_enablement: TracingErrorLayerEnablement,
output: Output,
) -> anyhow::Result<()> {
// We fall back to printing all spans at info-level or above if
// the RUST_LOG environment variable is not set.
@@ -85,7 +118,12 @@ pub fn init(
let log_layer = tracing_subscriber::fmt::layer()
.with_target(false)
.with_ansi(false)
.with_writer(std::io::stdout);
.with_writer(move || -> Box<dyn std::io::Write> {
match output {
Output::Stdout => Box::new(std::io::stdout()),
Output::Stderr => Box::new(std::io::stderr()),
}
});
let log_layer = match log_format {
LogFormat::Json => log_layer.json().boxed(),
LogFormat::Plain => log_layer.boxed(),
@@ -93,7 +131,9 @@ pub fn init(
};
log_layer.with_filter(rust_log_env_filter())
});
let r = r.with(TracingEventCountLayer(&TRACING_EVENT_COUNT).with_filter(rust_log_env_filter()));
let r = r.with(
TracingEventCountLayer(&TRACING_EVENT_COUNT_METRIC).with_filter(rust_log_env_filter()),
);
match tracing_error_layer_enablement {
TracingErrorLayerEnablement::EnableWithRustLogFilter => r
.with(tracing_error::ErrorLayer::default().with_filter(rust_log_env_filter()))
@@ -244,14 +284,14 @@ impl std::fmt::Debug for SecretString {
mod tests {
use metrics::{core::Opts, IntCounterVec};
use super::TracingEventCountLayer;
use crate::logging::{TracingEventCountLayer, TracingEventCountMetric};
#[test]
fn tracing_event_count_metric() {
let counter_vec =
IntCounterVec::new(Opts::new("testmetric", "testhelp"), &["level"]).unwrap();
let counter_vec = Box::leak(Box::new(counter_vec)); // make it 'static
let layer = TracingEventCountLayer(counter_vec);
let metric = Box::leak(Box::new(TracingEventCountMetric::new(counter_vec.clone())));
let layer = TracingEventCountLayer(metric);
use tracing_subscriber::prelude::*;
tracing::subscriber::with_default(tracing_subscriber::registry().with(layer), || {

View File

@@ -1,7 +1,7 @@
#![warn(missing_docs)]
use camino::Utf8Path;
use serde::{Deserialize, Serialize};
use serde::{de::Visitor, Deserialize, Serialize};
use std::fmt;
use std::ops::{Add, AddAssign};
use std::str::FromStr;
@@ -13,10 +13,114 @@ use crate::seqwait::MonotonicCounter;
pub const XLOG_BLCKSZ: u32 = 8192;
/// A Postgres LSN (Log Sequence Number), also known as an XLogRecPtr
#[derive(Clone, Copy, Eq, Ord, PartialEq, PartialOrd, Hash, Serialize, Deserialize)]
#[serde(transparent)]
#[derive(Clone, Copy, Eq, Ord, PartialEq, PartialOrd, Hash)]
pub struct Lsn(pub u64);
impl Serialize for Lsn {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
if serializer.is_human_readable() {
serializer.collect_str(self)
} else {
self.0.serialize(serializer)
}
}
}
impl<'de> Deserialize<'de> for Lsn {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
struct LsnVisitor {
is_human_readable_deserializer: bool,
}
impl<'de> Visitor<'de> for LsnVisitor {
type Value = Lsn;
fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
if self.is_human_readable_deserializer {
formatter.write_str(
"value in form of hex string({upper_u32_hex}/{lower_u32_hex}) representing u64 integer",
)
} else {
formatter.write_str("value in form of integer(u64)")
}
}
fn visit_u64<E>(self, v: u64) -> Result<Self::Value, E>
where
E: serde::de::Error,
{
Ok(Lsn(v))
}
fn visit_str<E>(self, v: &str) -> Result<Self::Value, E>
where
E: serde::de::Error,
{
Lsn::from_str(v).map_err(|e| E::custom(e))
}
}
if deserializer.is_human_readable() {
deserializer.deserialize_str(LsnVisitor {
is_human_readable_deserializer: true,
})
} else {
deserializer.deserialize_u64(LsnVisitor {
is_human_readable_deserializer: false,
})
}
}
}
/// Allows (de)serialization of an `Lsn` always as `u64`.
///
/// ### Example
///
/// ```rust
/// # use serde::{Serialize, Deserialize};
/// use utils::lsn::Lsn;
///
/// #[derive(PartialEq, Serialize, Deserialize, Debug)]
/// struct Foo {
/// #[serde(with = "utils::lsn::serde_as_u64")]
/// always_u64: Lsn,
/// }
///
/// let orig = Foo { always_u64: Lsn(1234) };
///
/// let res = serde_json::to_string(&orig).unwrap();
/// assert_eq!(res, r#"{"always_u64":1234}"#);
///
/// let foo = serde_json::from_str::<Foo>(&res).unwrap();
/// assert_eq!(foo, orig);
/// ```
///
pub mod serde_as_u64 {
use super::Lsn;
/// Serializes the Lsn as u64 disregarding the human readability of the format.
///
/// Meant to be used via `#[serde(with = "...")]` or `#[serde(serialize_with = "...")]`.
pub fn serialize<S: serde::Serializer>(lsn: &Lsn, serializer: S) -> Result<S::Ok, S::Error> {
use serde::Serialize;
lsn.0.serialize(serializer)
}
/// Deserializes the Lsn as u64 disregarding the human readability of the format.
///
/// Meant to be used via `#[serde(with = "...")]` or `#[serde(deserialize_with = "...")]`.
pub fn deserialize<'de, D: serde::Deserializer<'de>>(deserializer: D) -> Result<Lsn, D::Error> {
use serde::Deserialize;
u64::deserialize(deserializer).map(Lsn)
}
}
/// We tried to parse an LSN from a string, but failed
#[derive(Debug, PartialEq, Eq, thiserror::Error)]
#[error("LsnParseError")]
@@ -264,8 +368,13 @@ impl MonotonicCounter<Lsn> for RecordLsn {
#[cfg(test)]
mod tests {
use crate::bin_ser::BeSer;
use super::*;
use serde::ser::Serialize;
use serde_assert::{Deserializer, Serializer, Token, Tokens};
#[test]
fn test_lsn_strings() {
assert_eq!("12345678/AAAA5555".parse(), Ok(Lsn(0x12345678AAAA5555)));
@@ -341,4 +450,95 @@ mod tests {
assert_eq!(lsn.fetch_max(Lsn(6000)), Lsn(5678));
assert_eq!(lsn.fetch_max(Lsn(5000)), Lsn(6000));
}
#[test]
fn test_lsn_serde() {
let original_lsn = Lsn(0x0123456789abcdef);
let expected_readable_tokens = Tokens(vec![Token::U64(0x0123456789abcdef)]);
let expected_non_readable_tokens =
Tokens(vec![Token::Str(String::from("1234567/89ABCDEF"))]);
// Testing human_readable ser/de
let serializer = Serializer::builder().is_human_readable(false).build();
let readable_ser_tokens = original_lsn.serialize(&serializer).unwrap();
assert_eq!(readable_ser_tokens, expected_readable_tokens);
let mut deserializer = Deserializer::builder()
.is_human_readable(false)
.tokens(readable_ser_tokens)
.build();
let des_lsn = Lsn::deserialize(&mut deserializer).unwrap();
assert_eq!(des_lsn, original_lsn);
// Testing NON human_readable ser/de
let serializer = Serializer::builder().is_human_readable(true).build();
let non_readable_ser_tokens = original_lsn.serialize(&serializer).unwrap();
assert_eq!(non_readable_ser_tokens, expected_non_readable_tokens);
let mut deserializer = Deserializer::builder()
.is_human_readable(true)
.tokens(non_readable_ser_tokens)
.build();
let des_lsn = Lsn::deserialize(&mut deserializer).unwrap();
assert_eq!(des_lsn, original_lsn);
// Testing mismatching ser/de
let serializer = Serializer::builder().is_human_readable(false).build();
let non_readable_ser_tokens = original_lsn.serialize(&serializer).unwrap();
let mut deserializer = Deserializer::builder()
.is_human_readable(true)
.tokens(non_readable_ser_tokens)
.build();
Lsn::deserialize(&mut deserializer).unwrap_err();
let serializer = Serializer::builder().is_human_readable(true).build();
let readable_ser_tokens = original_lsn.serialize(&serializer).unwrap();
let mut deserializer = Deserializer::builder()
.is_human_readable(false)
.tokens(readable_ser_tokens)
.build();
Lsn::deserialize(&mut deserializer).unwrap_err();
}
#[test]
fn test_lsn_ensure_roundtrip() {
let original_lsn = Lsn(0xaaaabbbb);
let serializer = Serializer::builder().is_human_readable(false).build();
let ser_tokens = original_lsn.serialize(&serializer).unwrap();
let mut deserializer = Deserializer::builder()
.is_human_readable(false)
.tokens(ser_tokens)
.build();
let des_lsn = Lsn::deserialize(&mut deserializer).unwrap();
assert_eq!(des_lsn, original_lsn);
}
#[test]
fn test_lsn_bincode_serde() {
let lsn = Lsn(0x0123456789abcdef);
let expected_bytes = [0x01, 0x23, 0x45, 0x67, 0x89, 0xab, 0xcd, 0xef];
let ser_bytes = lsn.ser().unwrap();
assert_eq!(ser_bytes, expected_bytes);
let des_lsn = Lsn::des(&ser_bytes).unwrap();
assert_eq!(des_lsn, lsn);
}
#[test]
fn test_lsn_bincode_ensure_roundtrip() {
let original_lsn = Lsn(0x01_02_03_04_05_06_07_08);
let expected_bytes = vec![0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08];
let ser_bytes = original_lsn.ser().unwrap();
assert_eq!(ser_bytes, expected_bytes);
let des_lsn = Lsn::des(&ser_bytes).unwrap();
assert_eq!(des_lsn, original_lsn);
}
}

View File

@@ -3,7 +3,6 @@ use std::time::{Duration, SystemTime};
use bytes::{Buf, BufMut, Bytes, BytesMut};
use pq_proto::{read_cstr, PG_EPOCH};
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use tracing::{trace, warn};
use crate::lsn::Lsn;
@@ -15,21 +14,17 @@ use crate::lsn::Lsn;
///
/// serde Serialize is used only for human readable dump to json (e.g. in
/// safekeepers debug_dump).
#[serde_as]
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub struct PageserverFeedback {
/// Last known size of the timeline. Used to enforce timeline size limit.
pub current_timeline_size: u64,
/// LSN last received and ingested by the pageserver. Controls backpressure.
#[serde_as(as = "DisplayFromStr")]
pub last_received_lsn: Lsn,
/// LSN up to which data is persisted by the pageserver to its local disc.
/// Controls backpressure.
#[serde_as(as = "DisplayFromStr")]
pub disk_consistent_lsn: Lsn,
/// LSN up to which data is persisted by the pageserver on s3; safekeepers
/// consider WAL before it can be removed.
#[serde_as(as = "DisplayFromStr")]
pub remote_consistent_lsn: Lsn,
// Serialize with RFC3339 format.
#[serde(with = "serde_systemtime")]

View File

@@ -125,6 +125,9 @@ where
// Wake everyone with an error.
let mut internal = self.internal.lock().unwrap();
// Block any future waiters from starting
internal.shutdown = true;
// This will steal the entire waiters map.
// When we drop it all waiters will be woken.
mem::take(&mut internal.waiters)

View File

@@ -1,6 +1,7 @@
/// Immediately terminate the calling process without calling
/// atexit callbacks, C runtime destructors etc. We mainly use
/// this to protect coverage data from concurrent writes.
pub fn exit_now(code: u8) {
pub fn exit_now(code: u8) -> ! {
// SAFETY: exiting is safe, the ffi is not safe
unsafe { nix::libc::_exit(code as _) };
}

View File

@@ -1,10 +1,10 @@
//!
//! RCU stands for Read-Copy-Update. It's a synchronization mechanism somewhat
//! similar to a lock, but it allows readers to "hold on" to an old value of RCU
//! without blocking writers, and allows writing a new values without blocking
//! readers. When you update the new value, the new value is immediately visible
//! without blocking writers, and allows writing a new value without blocking
//! readers. When you update the value, the new value is immediately visible
//! to new readers, but the update waits until all existing readers have
//! finishe, so that no one sees the old value anymore.
//! finished, so that on return, no one sees the old value anymore.
//!
//! This implementation isn't wait-free; it uses an RwLock that is held for a
//! short duration when the value is read or updated.
@@ -26,6 +26,7 @@
//! Increment the value by one, and wait for old readers to finish:
//!
//! ```
//! # async fn dox() {
//! # let rcu = utils::simple_rcu::Rcu::new(1);
//! let write_guard = rcu.lock_for_write();
//!
@@ -36,15 +37,17 @@
//!
//! // Concurrent reads and writes are now possible again. Wait for all the readers
//! // that still observe the old value to finish.
//! waitlist.wait();
//! waitlist.wait().await;
//! # }
//! ```
//!
#![warn(missing_docs)]
use std::ops::Deref;
use std::sync::mpsc::{sync_channel, Receiver, SyncSender};
use std::sync::{Arc, Weak};
use std::sync::{Mutex, RwLock, RwLockWriteGuard};
use std::sync::{RwLock, RwLockWriteGuard};
use tokio::sync::watch;
///
/// Rcu allows multiple readers to read and hold onto a value without blocking
@@ -68,22 +71,21 @@ struct RcuCell<V> {
value: V,
/// A dummy channel. We never send anything to this channel. The point is
/// that when the RcuCell is dropped, any cloned Senders will be notified
/// that when the RcuCell is dropped, any subscribed Receivers will be notified
/// that the channel is closed. Updaters can use this to wait out until the
/// RcuCell has been dropped, i.e. until the old value is no longer in use.
///
/// We never do anything with the receiver, we just need to hold onto it so
/// that the Senders will be notified when it's dropped. But because it's
/// not Sync, we need a Mutex on it.
watch: (SyncSender<()>, Mutex<Receiver<()>>),
/// We never send anything to this, we just need to hold onto it so that the
/// Receivers will be notified when it's dropped.
watch: watch::Sender<()>,
}
impl<V> RcuCell<V> {
fn new(value: V) -> Self {
let (watch_sender, watch_receiver) = sync_channel(0);
let (watch_sender, _) = watch::channel(());
RcuCell {
value,
watch: (watch_sender, Mutex::new(watch_receiver)),
watch: watch_sender,
}
}
}
@@ -141,10 +143,10 @@ impl<V> Deref for RcuReadGuard<V> {
///
/// Write guard returned by `write`
///
/// NB: Holding this guard blocks all concurrent `read` and `write` calls, so
/// it should only be held for a short duration!
/// NB: Holding this guard blocks all concurrent `read` and `write` calls, so it should only be
/// held for a short duration!
///
/// Calling `store` consumes the guard, making new reads and new writes possible
/// Calling [`Self::store_and_unlock`] consumes the guard, making new reads and new writes possible
/// again.
///
pub struct RcuWriteGuard<'a, V> {
@@ -179,7 +181,7 @@ impl<'a, V> RcuWriteGuard<'a, V> {
// the watches for any that do.
self.inner.old_cells.retain(|weak| {
if let Some(cell) = weak.upgrade() {
watches.push(cell.watch.0.clone());
watches.push(cell.watch.subscribe());
true
} else {
false
@@ -193,20 +195,20 @@ impl<'a, V> RcuWriteGuard<'a, V> {
///
/// List of readers who can still see old values.
///
pub struct RcuWaitList(Vec<SyncSender<()>>);
pub struct RcuWaitList(Vec<watch::Receiver<()>>);
impl RcuWaitList {
///
/// Wait for old readers to finish.
///
pub fn wait(mut self) {
pub async fn wait(mut self) {
// after all the old_cells are no longer in use, we're done
for w in self.0.iter_mut() {
// This will block until the Receiver is closed. That happens when
// the RcuCell is dropped.
#[allow(clippy::single_match)]
match w.send(()) {
Ok(_) => panic!("send() unexpectedly succeeded on dummy channel"),
match w.changed().await {
Ok(_) => panic!("changed() unexpectedly succeeded on dummy channel"),
Err(_) => {
// closed, which means that the cell has been dropped, and
// its value is no longer in use
@@ -220,11 +222,10 @@ impl RcuWaitList {
mod tests {
use super::*;
use std::sync::{Arc, Mutex};
use std::thread::{sleep, spawn};
use std::time::Duration;
#[test]
fn two_writers() {
#[tokio::test]
async fn two_writers() {
let rcu = Rcu::new(1);
let read1 = rcu.read();
@@ -248,33 +249,35 @@ mod tests {
assert_eq!(*read1, 1);
let log = Arc::new(Mutex::new(Vec::new()));
// Wait for the old readers to finish in separate threads.
// Wait for the old readers to finish in separate tasks.
let log_clone = Arc::clone(&log);
let thread2 = spawn(move || {
wait2.wait();
let task2 = tokio::spawn(async move {
wait2.wait().await;
log_clone.lock().unwrap().push("wait2 done");
});
let log_clone = Arc::clone(&log);
let thread3 = spawn(move || {
wait3.wait();
let task3 = tokio::spawn(async move {
wait3.wait().await;
log_clone.lock().unwrap().push("wait3 done");
});
// without this sleep the test can pass on accident if the writer is slow
sleep(Duration::from_millis(500));
tokio::time::sleep(Duration::from_millis(100)).await;
// Release first reader. This allows first write to finish, but calling
// wait() on the second one would still block.
// wait() on the 'task3' would still block.
log.lock().unwrap().push("dropping read1");
drop(read1);
thread2.join().unwrap();
task2.await.unwrap();
sleep(Duration::from_millis(500));
assert!(!task3.is_finished());
tokio::time::sleep(Duration::from_millis(100)).await;
// Release second reader, and finish second writer.
log.lock().unwrap().push("dropping read2");
drop(read2);
thread3.join().unwrap();
task3.await.unwrap();
assert_eq!(
log.lock().unwrap().as_slice(),

View File

@@ -1 +1,3 @@
pub mod heavier_once_cell;
pub mod gate;

172
libs/utils/src/sync/gate.rs Normal file
View File

@@ -0,0 +1,172 @@
use std::{sync::Arc, time::Duration};
/// Gates are a concurrency helper, primarily used for implementing safe shutdown.
///
/// Users of a resource call `enter()` to acquire a GateGuard, and the owner of
/// the resource calls `close()` when they want to ensure that all holders of guards
/// have released them, and that no future guards will be issued.
pub struct Gate {
/// Each caller of enter() takes one unit from the semaphore. In close(), we
/// take all the units to ensure all GateGuards are destroyed.
sem: Arc<tokio::sync::Semaphore>,
/// For observability only: a name that will be used to log warnings if a particular
/// gate is holding up shutdown
name: String,
}
/// RAII guard for a [`Gate`]: as long as this exists, calls to [`Gate::close`] will
/// not complete.
#[derive(Debug)]
pub struct GateGuard(tokio::sync::OwnedSemaphorePermit);
/// Observability helper: every `warn_period`, emit a log warning that we're still waiting on this gate
async fn warn_if_stuck<Fut: std::future::Future>(
fut: Fut,
name: &str,
warn_period: std::time::Duration,
) -> <Fut as std::future::Future>::Output {
let started = std::time::Instant::now();
let mut fut = std::pin::pin!(fut);
let mut warned = false;
let ret = loop {
match tokio::time::timeout(warn_period, &mut fut).await {
Ok(ret) => break ret,
Err(_) => {
tracing::warn!(
gate = name,
elapsed_ms = started.elapsed().as_millis(),
"still waiting, taking longer than expected..."
);
warned = true;
}
}
};
// If we emitted a warning for slowness, also emit a message when we complete, so that
// someone debugging a shutdown can know for sure whether we have moved past this operation.
if warned {
tracing::info!(
gate = name,
elapsed_ms = started.elapsed().as_millis(),
"completed, after taking longer than expected"
)
}
ret
}
#[derive(Debug)]
pub enum GateError {
GateClosed,
}
impl Gate {
const MAX_UNITS: u32 = u32::MAX;
pub fn new(name: String) -> Self {
Self {
sem: Arc::new(tokio::sync::Semaphore::new(Self::MAX_UNITS as usize)),
name,
}
}
/// Acquire a guard that will prevent close() calls from completing. If close()
/// was already called, this will return an error which should be interpreted
/// as "shutting down".
///
/// This function would typically be used from e.g. request handlers. While holding
/// the guard returned from this function, it is important to respect a CancellationToken
/// to avoid blocking close() indefinitely: typically types that contain a Gate will
/// also contain a CancellationToken.
pub fn enter(&self) -> Result<GateGuard, GateError> {
self.sem
.clone()
.try_acquire_owned()
.map(GateGuard)
.map_err(|_| GateError::GateClosed)
}
/// Types with a shutdown() method and a gate should call this method at the
/// end of shutdown, to ensure that all GateGuard holders are done.
///
/// This will wait for all guards to be destroyed. For this to complete promptly, it is
/// important that the holders of such guards are respecting a CancellationToken which has
/// been cancelled before entering this function.
pub async fn close(&self) {
warn_if_stuck(self.do_close(), &self.name, Duration::from_millis(1000)).await
}
/// Check if [`Self::close()`] has finished waiting for all [`Self::enter()`] users to finish. This
/// is usually analoguous for "Did shutdown finish?" for types that include a Gate, whereas checking
/// the CancellationToken on such types is analogous to "Did shutdown start?"
pub fn close_complete(&self) -> bool {
self.sem.is_closed()
}
async fn do_close(&self) {
tracing::debug!(gate = self.name, "Closing Gate...");
match self.sem.acquire_many(Self::MAX_UNITS).await {
Ok(_units) => {
// While holding all units, close the semaphore. All subsequent calls to enter() will fail.
self.sem.close();
}
Err(_) => {
// Semaphore closed: we are the only function that can do this, so it indicates a double-call.
// This is legal. Timeline::shutdown for example is not protected from being called more than
// once.
tracing::debug!(gate = self.name, "Double close")
}
}
tracing::debug!(gate = self.name, "Closed Gate.")
}
}
#[cfg(test)]
mod tests {
use futures::FutureExt;
use super::*;
#[tokio::test]
async fn test_idle_gate() {
// Having taken no gates, we should not be blocked in close
let gate = Gate::new("test".to_string());
gate.close().await;
// If a guard is dropped before entering, close should not be blocked
let gate = Gate::new("test".to_string());
let guard = gate.enter().unwrap();
drop(guard);
gate.close().await;
// Entering a closed guard fails
gate.enter().expect_err("enter should fail after close");
}
#[tokio::test]
async fn test_busy_gate() {
let gate = Gate::new("test".to_string());
let guard = gate.enter().unwrap();
let mut close_fut = std::pin::pin!(gate.close());
// Close should be blocked
assert!(close_fut.as_mut().now_or_never().is_none());
// Attempting to enter() should fail, even though close isn't done yet.
gate.enter()
.expect_err("enter should fail after entering close");
drop(guard);
// Guard is gone, close should finish
assert!(close_fut.as_mut().now_or_never().is_some());
// Attempting to enter() is still forbidden
gate.enter().expect_err("enter should fail finishing close");
}
}

View File

@@ -1,4 +1,7 @@
use std::sync::{Arc, Mutex, MutexGuard};
use std::sync::{
atomic::{AtomicUsize, Ordering},
Arc, Mutex, MutexGuard,
};
use tokio::sync::Semaphore;
/// Custom design like [`tokio::sync::OnceCell`] but using [`OwnedSemaphorePermit`] instead of
@@ -10,6 +13,7 @@ use tokio::sync::Semaphore;
/// [`OwnedSemaphorePermit`]: tokio::sync::OwnedSemaphorePermit
pub struct OnceCell<T> {
inner: Mutex<Inner<T>>,
initializers: AtomicUsize,
}
impl<T> Default for OnceCell<T> {
@@ -17,6 +21,7 @@ impl<T> Default for OnceCell<T> {
fn default() -> Self {
Self {
inner: Default::default(),
initializers: AtomicUsize::new(0),
}
}
}
@@ -49,6 +54,7 @@ impl<T> OnceCell<T> {
init_semaphore: Arc::new(sem),
value: Some(value),
}),
initializers: AtomicUsize::new(0),
}
}
@@ -60,8 +66,8 @@ impl<T> OnceCell<T> {
/// Initialization is panic-safe and cancellation-safe.
pub async fn get_or_init<F, Fut, E>(&self, factory: F) -> Result<Guard<'_, T>, E>
where
F: FnOnce() -> Fut,
Fut: std::future::Future<Output = Result<T, E>>,
F: FnOnce(InitPermit) -> Fut,
Fut: std::future::Future<Output = Result<(T, InitPermit), E>>,
{
let sem = {
let guard = self.inner.lock().unwrap();
@@ -71,29 +77,61 @@ impl<T> OnceCell<T> {
guard.init_semaphore.clone()
};
let permit = sem.acquire_owned().await;
if permit.is_err() {
let guard = self.inner.lock().unwrap();
assert!(
guard.value.is_some(),
"semaphore got closed, must be initialized"
);
return Ok(Guard(guard));
} else {
// now we try
let value = factory().await?;
let permit = {
// increment the count for the duration of queued
let _guard = CountWaitingInitializers::start(self);
sem.acquire_owned().await
};
let mut guard = self.inner.lock().unwrap();
assert!(
guard.value.is_none(),
"we won permit, must not be initialized"
);
guard.value = Some(value);
guard.init_semaphore.close();
Ok(Guard(guard))
match permit {
Ok(permit) => {
let permit = InitPermit(permit);
let (value, _permit) = factory(permit).await?;
let guard = self.inner.lock().unwrap();
Ok(Self::set0(value, guard))
}
Err(_closed) => {
let guard = self.inner.lock().unwrap();
assert!(
guard.value.is_some(),
"semaphore got closed, must be initialized"
);
return Ok(Guard(guard));
}
}
}
/// Assuming a permit is held after previous call to [`Guard::take_and_deinit`], it can be used
/// to complete initializing the inner value.
///
/// # Panics
///
/// If the inner has already been initialized.
pub fn set(&self, value: T, _permit: InitPermit) -> Guard<'_, T> {
let guard = self.inner.lock().unwrap();
// cannot assert that this permit is for self.inner.semaphore, but we can assert it cannot
// give more permits right now.
if guard.init_semaphore.try_acquire().is_ok() {
drop(guard);
panic!("permit is of wrong origin");
}
Self::set0(value, guard)
}
fn set0(value: T, mut guard: std::sync::MutexGuard<'_, Inner<T>>) -> Guard<'_, T> {
if guard.value.is_some() {
drop(guard);
unreachable!("we won permit, must not be initialized");
}
guard.value = Some(value);
guard.init_semaphore.close();
Guard(guard)
}
/// Returns a guard to an existing initialized value, if any.
pub fn get(&self) -> Option<Guard<'_, T>> {
let guard = self.inner.lock().unwrap();
@@ -103,6 +141,28 @@ impl<T> OnceCell<T> {
None
}
}
/// Return the number of [`Self::get_or_init`] calls waiting for initialization to complete.
pub fn initializer_count(&self) -> usize {
self.initializers.load(Ordering::Relaxed)
}
}
/// DropGuard counter for queued tasks waiting to initialize, mainly accessible for the
/// initializing task for example at the end of initialization.
struct CountWaitingInitializers<'a, T>(&'a OnceCell<T>);
impl<'a, T> CountWaitingInitializers<'a, T> {
fn start(target: &'a OnceCell<T>) -> Self {
target.initializers.fetch_add(1, Ordering::Relaxed);
CountWaitingInitializers(target)
}
}
impl<'a, T> Drop for CountWaitingInitializers<'a, T> {
fn drop(&mut self) {
self.0.initializers.fetch_sub(1, Ordering::Relaxed);
}
}
/// Uninteresting guard object to allow short-lived access to inspect or clone the held,
@@ -135,7 +195,7 @@ impl<'a, T> Guard<'a, T> {
///
/// The permit will be on a semaphore part of the new internal value, and any following
/// [`OnceCell::get_or_init`] will wait on it to complete.
pub fn take_and_deinit(&mut self) -> (T, tokio::sync::OwnedSemaphorePermit) {
pub fn take_and_deinit(&mut self) -> (T, InitPermit) {
let mut swapped = Inner::default();
let permit = swapped
.init_semaphore
@@ -145,11 +205,14 @@ impl<'a, T> Guard<'a, T> {
std::mem::swap(&mut *self.0, &mut swapped);
swapped
.value
.map(|v| (v, permit))
.map(|v| (v, InitPermit(permit)))
.expect("guard is not created unless value has been initialized")
}
}
/// Type held by OnceCell (de)initializing task.
pub struct InitPermit(tokio::sync::OwnedSemaphorePermit);
#[cfg(test)]
mod tests {
use super::*;
@@ -185,11 +248,11 @@ mod tests {
barrier.wait().await;
let won = {
let g = cell
.get_or_init(|| {
.get_or_init(|permit| {
counters.factory_got_to_run.fetch_add(1, Ordering::Relaxed);
async {
counters.future_polled.fetch_add(1, Ordering::Relaxed);
Ok::<_, Infallible>(i)
Ok::<_, Infallible>((i, permit))
}
})
.await
@@ -243,7 +306,7 @@ mod tests {
deinitialization_started.wait().await;
let started_at = tokio::time::Instant::now();
cell.get_or_init(|| async { Ok::<_, Infallible>(reinit) })
cell.get_or_init(|permit| async { Ok::<_, Infallible>((reinit, permit)) })
.await
.unwrap();
@@ -258,18 +321,32 @@ mod tests {
assert_eq!(*cell.get().unwrap(), reinit);
}
#[test]
fn reinit_with_deinit_permit() {
let cell = Arc::new(OnceCell::new(42));
let (mol, permit) = cell.get().unwrap().take_and_deinit();
cell.set(5, permit);
assert_eq!(*cell.get().unwrap(), 5);
let (five, permit) = cell.get().unwrap().take_and_deinit();
assert_eq!(5, five);
cell.set(mol, permit);
assert_eq!(*cell.get().unwrap(), 42);
}
#[tokio::test]
async fn initialization_attemptable_until_ok() {
let cell = OnceCell::default();
for _ in 0..10 {
cell.get_or_init(|| async { Err("whatever error") })
cell.get_or_init(|_permit| async { Err("whatever error") })
.await
.unwrap_err();
}
let g = cell
.get_or_init(|| async { Ok::<_, Infallible>("finally success") })
.get_or_init(|permit| async { Ok::<_, Infallible>(("finally success", permit)) })
.await
.unwrap();
assert_eq!(*g, "finally success");
@@ -281,11 +358,11 @@ mod tests {
let barrier = tokio::sync::Barrier::new(2);
let initializer = cell.get_or_init(|| async {
let initializer = cell.get_or_init(|permit| async {
barrier.wait().await;
futures::future::pending::<()>().await;
Ok::<_, Infallible>("never reached")
Ok::<_, Infallible>(("never reached", permit))
});
tokio::select! {
@@ -298,7 +375,7 @@ mod tests {
assert!(cell.get().is_none());
let g = cell
.get_or_init(|| async { Ok::<_, Infallible>("now initialized") })
.get_or_init(|permit| async { Ok::<_, Infallible>(("now initialized", permit)) })
.await
.unwrap();
assert_eq!(*g, "now initialized");

40
libs/utils/src/timeout.rs Normal file
View File

@@ -0,0 +1,40 @@
use std::time::Duration;
use tokio_util::sync::CancellationToken;
#[derive(thiserror::Error, Debug)]
pub enum TimeoutCancellableError {
#[error("Timed out")]
Timeout,
#[error("Cancelled")]
Cancelled,
}
/// Wrap [`tokio::time::timeout`] with a CancellationToken.
///
/// This wrapper is appropriate for any long running operation in a task
/// that ought to respect a CancellationToken (which means most tasks).
///
/// The only time you should use a bare tokio::timeout is when the future `F`
/// itself respects a CancellationToken: otherwise, always use this wrapper
/// with your CancellationToken to ensure that your task does not hold up
/// graceful shutdown.
pub async fn timeout_cancellable<F>(
duration: Duration,
cancel: &CancellationToken,
future: F,
) -> Result<F::Output, TimeoutCancellableError>
where
F: std::future::Future,
{
tokio::select!(
r = tokio::time::timeout(duration, future) => {
r.map_err(|_| TimeoutCancellableError::Timeout)
},
_ = cancel.cancelled() => {
Err(TimeoutCancellableError::Cancelled)
}
)
}

View File

@@ -19,13 +19,12 @@ inotify.workspace = true
serde.workspace = true
serde_json.workspace = true
sysinfo.workspace = true
tokio.workspace = true
tokio = { workspace = true, features = ["rt-multi-thread"] }
tokio-postgres.workspace = true
tokio-stream.workspace = true
tokio-util.workspace = true
tracing.workspace = true
tracing-subscriber.workspace = true
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
[target.'cfg(target_os = "linux")'.dependencies]
cgroups-rs = "0.3.3"

View File

@@ -21,11 +21,6 @@ pub struct FileCacheState {
#[derive(Debug)]
pub struct FileCacheConfig {
/// Whether the file cache is *actually* stored in memory (e.g. by writing to
/// a tmpfs or shmem file). If true, the size of the file cache will be counted against the
/// memory available for the cgroup.
pub(crate) in_memory: bool,
/// The size of the file cache, in terms of the size of the resource it consumes
/// (currently: only memory)
///
@@ -59,22 +54,9 @@ pub struct FileCacheConfig {
spread_factor: f64,
}
impl FileCacheConfig {
pub fn default_in_memory() -> Self {
impl Default for FileCacheConfig {
fn default() -> Self {
Self {
in_memory: true,
// 75 %
resource_multiplier: 0.75,
// 640 MiB; (512 + 128)
min_remaining_after_cache: NonZeroU64::new(640 * MiB).unwrap(),
// ensure any increase in file cache size is split 90-10 with 10% to other memory
spread_factor: 0.1,
}
}
pub fn default_on_disk() -> Self {
Self {
in_memory: false,
resource_multiplier: 0.75,
// 256 MiB - lower than when in memory because overcommitting is safe; if we don't have
// memory, the kernel will just evict from its page cache, rather than e.g. killing
@@ -83,7 +65,9 @@ impl FileCacheConfig {
spread_factor: 0.1,
}
}
}
impl FileCacheConfig {
/// Make sure fields of the config are consistent.
pub fn validate(&self) -> anyhow::Result<()> {
// Single field validity

View File

@@ -1,3 +1,5 @@
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
#![cfg(target_os = "linux")]
use anyhow::Context;
@@ -39,16 +41,6 @@ pub struct Args {
#[arg(short, long)]
pub pgconnstr: Option<String>,
/// Flag to signal that the Postgres file cache is on disk (i.e. not in memory aside from the
/// kernel's page cache), and therefore should not count against available memory.
//
// NB: Ideally this flag would directly refer to whether the file cache is in memory (rather
// than a roundabout way, via whether it's on disk), but in order to be backwards compatible
// during the switch away from an in-memory file cache, we had to default to the previous
// behavior.
#[arg(long)]
pub file_cache_on_disk: bool,
/// The address we should listen on for connection requests. For the
/// agent, this is 0.0.0.0:10301. For the informant, this is 127.0.0.1:10369.
#[arg(short, long)]

View File

@@ -156,10 +156,7 @@ impl Runner {
// memory limits.
if let Some(connstr) = &args.pgconnstr {
info!("initializing file cache");
let config = match args.file_cache_on_disk {
true => FileCacheConfig::default_on_disk(),
false => FileCacheConfig::default_in_memory(),
};
let config = FileCacheConfig::default();
let mut file_cache = FileCacheState::new(connstr, config, token.clone())
.await
@@ -187,10 +184,7 @@ impl Runner {
info!("file cache size actually got set to {actual_size}")
}
if args.file_cache_on_disk {
file_cache_disk_size = actual_size;
}
file_cache_disk_size = actual_size;
state.filecache = Some(file_cache);
}
@@ -239,17 +233,11 @@ impl Runner {
let requested_mem = target.mem;
let usable_system_memory = requested_mem.saturating_sub(self.config.sys_buffer_bytes);
let (expected_file_cache_size, expected_file_cache_disk_size) = self
let expected_file_cache_size = self
.filecache
.as_ref()
.map(|file_cache| {
let size = file_cache.config.calculate_cache_size(usable_system_memory);
match file_cache.config.in_memory {
true => (size, 0),
false => (size, size),
}
})
.unwrap_or((0, 0));
.map(|file_cache| file_cache.config.calculate_cache_size(usable_system_memory))
.unwrap_or(0);
if let Some(cgroup) = &self.cgroup {
let (last_time, last_history) = *cgroup.watcher.borrow();
@@ -273,7 +261,7 @@ impl Runner {
let new_threshold = self
.config
.cgroup_threshold(usable_system_memory, expected_file_cache_disk_size);
.cgroup_threshold(usable_system_memory, expected_file_cache_size);
let current = last_history.avg_non_reclaimable;
@@ -300,13 +288,10 @@ impl Runner {
.set_file_cache_size(expected_file_cache_size)
.await
.context("failed to set file cache size")?;
if !file_cache.config.in_memory {
file_cache_disk_size = actual_usage;
}
file_cache_disk_size = actual_usage;
let message = format!(
"set file cache size to {} MiB (in memory = {})",
"set file cache size to {} MiB",
bytes_to_mebibytes(actual_usage),
file_cache.config.in_memory,
);
info!("downscale: {message}");
status.push(message);
@@ -357,9 +342,7 @@ impl Runner {
.set_file_cache_size(expected_usage)
.await
.context("failed to set file cache size")?;
if !file_cache.config.in_memory {
file_cache_disk_size = actual_usage;
}
file_cache_disk_size = actual_usage;
if actual_usage != expected_usage {
warn!(

View File

@@ -1 +1,2 @@
#include "postgres.h"
#include "walproposer.h"

Some files were not shown because too many files have changed in this diff Show More