Compare commits

...

88 Commits

Author SHA1 Message Date
Conrad Ludgate
6d2bbffdab only for console 2023-12-15 12:28:50 +00:00
Conrad Ludgate
7151bcc175 proxy console force http2 2023-12-15 12:26:51 +00:00
Conrad Ludgate
98629841e0 improve proxy code cov (#6141)
## Summary of changes

saw some low-hanging codecov improvements. even if code coverage is
somewhat of a pointless game, might as well add tests where we can and
delete code if it's unused
2023-12-15 12:11:50 +00:00
Arpad Müller
215cdd18c4 Make initdb upload retries cancellable and seek to beginning (#6147)
* initdb uploads had no cancellation token, which means that when we
were stuck in upload retries, we wouldn't be able to delete the
timeline. in general, the combination of retrying forever and not having
cancellation tokens is quite dangerous.
* initdb uploads wouldn't rewind the file. this wasn't discovered in the
purposefully unreliable test-s3 in pytest because those fail on the
first byte always, not somewhere during the connection. we'd be getting
errors from the AWS sdk that the file was at an unexpected end.

slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1702632247784079
2023-12-15 12:11:25 +00:00
Joonas Koivunen
0fd80484a9 fix: Timeline deletion during busy startup (#6133)
Compaction was holding back timeline deletion because the compaction
lock had been acquired, but the semaphore was waited on. Timeline
deletion was waiting on the same lock for 1500s.

This replaces the
`pageserver::tenant::tasks::concurrent_background_tasks_rate_limit`
(which looks correct) with a simpler `..._permit` which is just an
infallible acquire, which is easier to spot "aah this needs to be raced
with cancellation tokens".

Ref: https://neondb.slack.com/archives/C03F5SM1N02/p1702496912904719
Ref: https://neondb.slack.com/archives/C03F5SM1N02/p1702578093497779
2023-12-15 11:59:24 +00:00
Joonas Koivunen
07508fb110 fix: better Json parsing errors (#6135)
Before any json parsing from the http api only returned errors were per
field errors. Now they are done using `serde_path_to_error`, which at
least helped greatly with the `disk_usage_eviction_run` used for
testing. I don't think this can conflict with anything added in #5310.
2023-12-15 12:18:22 +02:00
Arseny Sher
5bb9ba37cc Fix python list_segments of sk.
Fixes rare test_peer_recovery flakiness as we started to compare tmp control
file.

https://neondb.slack.com/archives/C04KGFVUWUQ/p1702310929657179
2023-12-15 13:43:11 +04:00
John Spray
f1cd1a2122 pageserver: improved handling of concurrent timeline creations on the same ID (#6139)
## Problem

Historically, the pageserver used an "uninit mark" file on disk for two
purposes:
- Track which timeline dirs are incomplete for handling on restart
- Avoid trying to create the same timeline twice at the same time.

The original purpose of handling restarts is now defunct, as we use
remote storage as the source of truth and clean up any trash timeline
dirs on startup. Using the file to mutually exclude creation operations
is error prone compared with just doing it in memory, and the existing
checks happened some way into the creation operation, and could expose
errors as 500s (anyhow::Errors) rather than something clean.

## Summary of changes

- Creations are now mutually excluded in memory (using
`Tenant::timelines_creating`), rather than relying on a file on disk for
coordination.
- Acquiring unique access to the timeline ID now happens earlier in the
request.
- Creating the same timeline which already exists is now a 201: this
simplifies retry handling for clients.
- 409 is still returned if a timeline with the same ID is still being
created: if this happens it is probably because the client timed out an
earlier request and has retried.
- Colliding timeline creation requests should no longer return 500
errors

This paves the way to entirely removing uninit markers in a subsequent
change.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-12-15 08:51:23 +00:00
Joonas Koivunen
f010479107 feat(layer): pageserver_layer_redownloaded_after histogram (#6132)
this is aimed at replacing the current mtime only based trashing
alerting later.

Cc: #5331
2023-12-14 21:32:54 +02:00
Conrad Ludgate
cc633585dc gauge guards (#6138)
## Problem

The websockets gauge for active db connections seems to be growing more
than the gauge for client connections over websockets, which does not
make sense.

## Summary of changes

refactor how our counter-pair gauges are represented. not sure if this
will improve the problem, but it should be harder to mess-up the
counters. The API is much nicer though now and doesn't require
scopeguard::defer hacks
2023-12-14 17:21:39 +00:00
Christian Schwarz
aa5581d14f utils::logging: TracingEventCountLayer: don't use with_label_values() on hot path (#6129)
fixes #6126
2023-12-14 16:31:41 +01:00
John Spray
c4e0ef507f pageserver: heatmap uploads (#6050)
Dependency (commits inline):
https://github.com/neondatabase/neon/pull/5842

## Problem

Secondary mode tenants need a manifest of what to download. Ultimately
this will be some kind of heat-scored set of layers, but as a robust
first step we will simply use the set of resident layers: secondary
tenant locations will aim to match the on-disk content of the attached
location.

## Summary of changes

- Add heatmap types representing the remote structure
- Add hooks to Tenant/Timeline for generating these heatmaps
- Create a new `HeatmapUploader` type that is external to `Tenant`, and
responsible for walking the list of attached tenants and scheduling
heatmap uploads.

Notes to reviewers:
- Putting the logic for uploads (and later, secondary mode downloads)
outside of `Tenant` is an opinionated choice, motivated by:
- Enable future smarter scheduling of operations, e.g. uploading the
stalest tenant first, rather than having all tenants compete for a fair
semaphore on a first-come-first-served basis. Similarly for downloads,
we may wish to schedule the tenants with the hottest un-downloaded
layers first.
- Enable accessing upload-related state without synchronization (it
belongs to HeatmapUploader, rather than being some Mutex<>'d part of
Tenant)
- Avoid further expanding the scope of Tenant/Timeline types, which are
already among the largest in the codebase
- You might reasonably wonder how much of the uploader code could be a
generic job manager thing. Probably some of it: but let's defer pulling
that out until we have at least two users (perhaps secondary downloads
will be the second one) to highlight which bits are really generic.

Compromises:
- Later, instead of using digests of heatmaps to decide whether anything
changed, I would prefer to avoid walking the layers in tenants that
don't have changes: tracking that will be a bit invasive, as it needs
input from both remote_timeline_client and Layer.
2023-12-14 13:09:24 +00:00
Conrad Ludgate
6987b5c44e proxy: add more rates to endpoint limiter (#6130)
## Problem

Single rate bucket is limited in usefulness

## Summary of changes

Introduce a secondary bucket allowing an average of 200 requests per
second over 1 minute, and a tertiary bucket allowing an average of 100
requests per second over 10 minutes.

Configured by using a format like

```sh
proxy --endpoint-rps-limit 300@1s --endpoint-rps-limit 100@10s --endpoint-rps-limit 50@1m
```

If the bucket limits are inconsistent, an error is returned on startup

```
$ proxy --endpoint-rps-limit 300@1s --endpoint-rps-limit 10@10s
Error: invalid endpoint RPS limits. 10@10s allows fewer requests per bucket than 300@1s (100 vs 300)
```
2023-12-13 21:43:49 +00:00
Alexander Bayandin
0cd49cac84 test_compatibility: make it use initdb.tar.zst 2023-12-13 15:04:25 -06:00
Alexander Bayandin
904dff58b5 test_wal_restore_http: cleanup test 2023-12-13 15:04:25 -06:00
Arthur Petukhovsky
f401a21cf6 Fix test_simple_sync_safekeepers
There is a postgres 16 version encoded in a binary message.
2023-12-13 15:04:25 -06:00
Tristan Partin
158adf602e Update Postgres 16 series to 16.1 2023-12-13 15:04:25 -06:00
Tristan Partin
c94db6adbb Update Postgres 15 series to 15.5 2023-12-13 15:04:25 -06:00
Tristan Partin
85720616b1 Update Postgres 14 series to 14.10 2023-12-13 15:04:25 -06:00
George MacKerron
d6fcc18eb2 Add Neon-Batch- headers to OPTIONS response for SQL-over-HTTP requests (#6116)
This is needed to allow use of batch queries from browsers.

## Problem

SQL-over-HTTP batch queries fail from web browsers because the relevant
headers, `Neon-Batch-isolation-Level` and `Neon-Batch-Read-Only`, are
not included in the server's OPTIONS response. I think we simply forgot
to add them when implementing the batch query feature.

## Summary of changes

Added `Neon-Batch-Isolation-Level` and `Neon-Batch-Read-Only` to the
OPTIONS response.
2023-12-13 17:18:20 +00:00
Vadim Kharitonov
c2528ae671 Increase pgbouncer pool size to 64 for VMs (#6124)
The pool size was changed for pods
(https://github.com/neondatabase/cloud/pull/8057). The idea to increase
it for VMs too
2023-12-13 16:23:24 +00:00
Joonas Koivunen
a919b863d1 refactor: remove eviction batching (#6060)
We no longer have `layer_removal_cs` since #5108, we no longer need
batching.
2023-12-13 18:05:33 +02:00
Joonas Koivunen
2d22661061 refactor: calculate_synthetic_size_worker, remove PRE::NeedsDownload (#6111)
Changes I wanted to make on #6106 but decided to leave out to keep that
commit clean for including in the #6090. Finally remove
`PageReconstructionError::NeedsDownload`.
2023-12-13 14:23:19 +00:00
John Spray
e3778381a8 tests: make test_bulk_insert recreate tenant in same generation (#6113)
## Problem

Test deletes tenant and recreates with the same ID. The recreation bumps
generation number. This could lead to stale generation warnings in the
logs.

## Summary of changes

Handle this more gracefully by re-creating in the same generation that
the tenant was previously attached in.

We could also update the tenant delete path to have the attachment
service to drop tenant state on delete, but I like having it there: it
makes debug easier, and the only time it's a problem is when a test is
re-using a tenant ID after deletion.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2023-12-13 14:14:38 +00:00
Conrad Ludgate
c8316b7a3f simplify endpoint limiter (#6122)
## Problem

1. Using chrono for durations only is wasteful
2. The arc/mutex was not being utilised
3. Locking every shard in the dashmap every GC could cause latency
spikes
4. More buckets

## Summary of changes

1. Use `Instant` instead of `NaiveTime`.
2. Remove the `Arc<Mutex<_>>` wrapper, utilising that dashmap entry
returns mut access
3. Clear only a random shard, update gc interval accordingly
4. Multiple buckets can be checked before allowing access

When I benchmarked the check function, it took on average 811ns when
multithreaded over the course of 10 million checks.
2023-12-13 13:53:23 +00:00
Stas Kelvich
8460654f61 Add per-endpoint rate limiter to proxy 2023-12-13 07:03:21 +02:00
Arpad Müller
7c2c87a5ab Update azure SDK to 0.18 and use open range support (#6103)
* Update `azure-*` crates to 0.18
* Use new open ranges support added by upstream in
https://github.com/Azure/azure-sdk-for-rust/pull/1482

Part of #5567. Prior update PR: #6081
2023-12-12 18:20:12 +01:00
Arpad Müller
5820faaa87 Use extend instead of groups of append calls in tests (#6109)
Repeated calls to `.append` don't line up as nicely as they might get
formatted in different ways. Also, it is more characters and the lines
might be longer.

Saw this while working on #5912.
2023-12-12 18:00:37 +01:00
John Spray
dfb0a6fdaf scrubber: handle initdb files, fix an issue with prefixes (#6079)
- The code for calculating the prefix in the bucket was expecting a
trailing slash (as it is in the tests), but that's an awkward
expectation to impose for use in the field: make the code more flexible
by only trimming a trailing character if it is indeed a slash.
- initdb archives were detected by the scrubber as malformed layer
files. Teach it to recognize and ignore them.
2023-12-12 16:53:08 +00:00
Alexander Bayandin
6acbee2368 test_runner: add from_repo_dir method (#6087)
## Problem

We need a reliable way to restore a project state (in this context, I
mean data on pageservers, safekeepers, and remote storage) from a
snapshot. The existing method (that we use in `test_compatibility`)
heavily relies on config files, which makes it harder to add/change
fields in the config.
The proposed solution uses config file only to get `default_tenant_id`
and `branch_name_mappings`.

## Summary of changes
- Add `NeonEnvBuilder#from_repo_dir` method, which allows using the
`neon_env_builder` fixture with data from a snapshot.
- Use `NeonEnvBuilder#from_repo_dir` in compatibility tests

Requires for https://github.com/neondatabase/neon/issues/6033
2023-12-12 16:24:13 +00:00
Konstantin Knizhnik
aec1acdbac Do not inherite replication slots in branch (#5898)
## Problem
 
See 
https://github.com/neondatabase/company_projects/issues/111
https://neondb.slack.com/archives/C03H1K0PGKH/p1700166126954079


## Summary of changes

Do not search for AUX_FILES_KEY in parent timelines

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2023-12-12 14:24:21 +02:00
Konstantin Knizhnik
8bb4a13192 Do not materialize null images in PS (#5979)
## Problem

PG16 is writing null images during relation extension.
And page server implements optimisation which replace WAL record with
FPI with page image.
So instead of WAL record ~30 bytes we store 8kb null page image.
Ans this image is almost useless, because most likely it will be shortly
rewritten with actual page content.

## Summary of changes

Do not materialize wal records with null page FPI.
 
## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-12-12 14:23:45 +02:00
Anna Khanova
9e071e4458 Propagate information about the protocol to console (#6102)
## Problem

In snowflake logs currently there is no information about the protocol,
that the client uses.

## Summary of changes

Propagate the information about the protocol together with the app_name.
In format: `{app_name}/{sql_over_http/tcp/ws}`.

This will give to @stepashka more observability on what our clients are
using.
2023-12-12 11:42:51 +00:00
John Spray
fead836f26 swagger: remove 'format: hex' from tenant IDs (#6099)
## Problem

TenantId is changing to TenantShardId in many APIs. The swagger had
`format: hex` attributes on some of these IDs. That isn't formally
defined anywhere, but a reasonable person might think it means "hex
digits only", which will no longer be the case once we start using
shard-aware IDs (they're like `<tenant_id>-0001`).



## Summary of changes

- Remove these `format` attributes from all `tenant_id` fields in the
swagger definition
2023-12-12 10:39:34 +00:00
John Spray
20e9cf7d31 pageserver: tweaks to slow/hung task logging (#6098)
## Problem

- `shutdown_tasks` would log when a particular task was taking a long
time to shut down, but not when it eventually completed. That left one
uncertain as to whether the slow task was the source of a hang, or just
a precursor.

## Summary of changes

- Add a log line after a slow task shutdown
- Add an equivalent in Gate's `warn_if_stuck`, in case we ever need it.
This isn't related to the original issue but was noticed when checking
through these logging paths.
2023-12-12 07:19:59 +00:00
Joonas Koivunen
3b04f3a749 fix: accidential return Ok (#6106)
Error indicating request cancellation OR timeline shutdown was deemed as
a reason to exit the background worker that calculated synthetic size.
Fix it to only be considered for avoiding logging such of such errors.
2023-12-11 21:27:53 +00:00
Arpad Müller
c49fd69bd6 Add initdb_lsn to TimelineInfo (#6104)
This way, we can query it.

Background: I want to do statistics for how reproducible `initdb_lsn`
really is, see https://github.com/neondatabase/cloud/issues/8284 and
https://neondb.slack.com/archives/C036U0GRMRB/p1701895218280269
2023-12-11 21:08:14 +00:00
Tristan Partin
5ab9592a2d Add submodule paths as safe directories as a precaution
The check-codestyle-rust-arm job requires this for some reason, so let's
just add them everywhere we do this workaround.
2023-12-11 13:08:37 -06:00
Tristan Partin
036558c956 Fix git ownership issue in check-codestyle-rust-arm
We have this workaround for other jobs. Looks like this one was
forgotten about.
2023-12-11 13:08:37 -06:00
John Spray
6a922b1a75 tests: start adding tests for secondary mode, live migration (#5842)
These tests have been loitering on a branch of mine for a while: they
already provide value even without all the secondary mode bits landed
yet, and the Workload helper is handy for other tests too.

- `Workload` is a re-usable test workload that replaces some of the
arbitrary "write a few rows" SQL that I've found my self repeating, and
adds a systematic way to append data and check that reads properly
reflect the changes. This append+validate stuff is important when doing
migrations, as we want to detect situations where we might be reading
from a pageserver that has not properly seen latest changes.
- test_multi_attach is a validation of how the pageserver handles
attaching the same tenant to multiple pageservers, from a safety point
of view. This is intentionally separate from the larger testing of
migration, to provide an isolated environment for multi-attachment.
- test_location_conf_churn is a pseudo-random walk through the various
states that TenantSlot can be put into, with validation that attached
tenants remain externally readable when they should, and as a side
effect validating that the compute endpoint's online configuration
changes work as expected.
- test_live_migration is the reference implementation of how to drive a
pair of pageservers through a zero-downtime migration of a tenant.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-12-11 16:55:43 +00:00
John Spray
f1fc1fd639 pageserver: further refactoring from TenantId to TenantShardId (#6059)
## Problem

In https://github.com/neondatabase/neon/pull/5957, the most essential
types were updated to use TenantShardId rather than TenantId. That
unblocked other work, but didn't fully enable running multiple shards
from the same tenant on the same pageserver.

## Summary of changes

- Use TenantShardId in page cache key for materialized pages
- Update mgr.rs get_tenant() and list_tenants() functions to use a shard
id, and update all callers.
- Eliminate the exactly_one_or_none helper in mgr.rs and all code that
used it
- Convert timeline HTTP routes to use tenant_shard_id

Note on page cache:
```
struct MaterializedPageHashKey {
    /// Why is this TenantShardId rather than TenantId?
    ///
    /// Usually, the materialized value of a page@lsn is identical on any shard in the same tenant.  However, this
    /// this not the case for certain internally-generated pages (e.g. relation sizes).  In future, we may make this
    /// key smaller by omitting the shard, if we ensure that reads to such pages always skip the cache, or are
    /// special-cased in some other way.
    tenant_shard_id: TenantShardId,
    timeline_id: TimelineId,
    key: Key,
}
```
2023-12-11 15:52:33 +00:00
Alexander Bayandin
66a7a226f8 test_runner: use toml instead of formatted strings (#6088)
## Problem

A bunch of refactorings extracted from
https://github.com/neondatabase/neon/pull/6087 (not required for it); 
the most significant one is using toml instead of formatted strings.

## Summary of changes 
- Use toml instead of formatted strings for config
- Skip pageserver log check if `pageserver.log` doesn't exist
- `chmod -x test_runner/regress/test_config.py`
2023-12-11 15:13:27 +00:00
Joonas Koivunen
f0d15cee6f build: update azure-* to 0.17 (#6081)
this is a drive-by upgrade while we refresh the access tokens at the
same time.
2023-12-11 12:21:02 +01:00
Sasha Krassovsky
0ba4cae491 Fix RLS/REPLICATION granting (#6083)
## Problem

## Summary of changes

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2023-12-08 12:55:44 -08:00
Andrew Rudenko
df1f8e13c4 proxy: pass neon options in deep object format (#6068)
---------

Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>
2023-12-08 19:58:36 +01:00
John Spray
e640bc7dba tests: allow-lists for occasional failures (#6074)
test_creating_tenant_conf_after...
- Test detaches a tenant and then re-attaches immediatel: this causes a
race between pending remote LSN update and the generation bump in the
attachment.

test_gc_cutoff:
- Test rapidly restarts a pageserver before one generation has had the
chance to process deletions from the previous generation
2023-12-08 17:32:16 +00:00
Christian Schwarz
cf024de202 virtual_file metrics: expose max size of the fd cache (#6078)
And also leave a comment on how to determine current size.

Kind of follow-up to #6066

refs https://github.com/neondatabase/cloud/issues/8351
refs https://github.com/neondatabase/neon/issues/5479
2023-12-08 17:23:50 +00:00
Conrad Ludgate
e1a564ace2 proxy simplify cancellation (#5916)
## Problem

The cancellation code was confusing and error prone (as seen before in
our memory leaks).

## Summary of changes

* Use the new `TaskTracker` primitve instead of JoinSet to gracefully
wait for tasks to shutdown.
* Updated libs/utils/completion to use `TaskTracker`
* Remove `tokio::select` in favour of `futures::future::select` in a
specialised `run_until_cancelled()` helper function
2023-12-08 16:21:17 +00:00
Christian Schwarz
f5b9af6ac7 page cache: improve eviction-related metrics (#6077)
These changes help with identifying thrashing.

The existing `pageserver_page_cache_find_victim_iters_total` is already
useful, but, it doesn't tell us how many individual find_victim() calls
are happening, only how many clock-LRU steps happened in the entire
system,
without info about whether we needed to actually evict other data vs
just scan for a long time, e.g., because the cache is large.

The changes in this PR allows us to
1. count each possible outcome separately, esp evictions
2. compute mean iterations/outcome

I don't think anyone except me was paying close attention to
`pageserver_page_cache_find_victim_iters_total` before, so,
I think the slight behavior change of also counting iterations
for the 'iters exceeded' case is fine.

refs https://github.com/neondatabase/cloud/issues/8351
refs https://github.com/neondatabase/neon/issues/5479
2023-12-08 15:27:21 +00:00
John Spray
5e98855d80 tests: update tests that used local_fs&mock_s3 to use one or the other (#6015)
## Problem

This was wasting resources: if we run a test with mock s3 we don't then
need to run it again with local fs. When we're running in CI, we don't
need to run with the mock/local storage as well as real S3. There is
some value in having CI notice/spot issues that might otherwise only
happen when running locally, but that doesn't justify the cost of
running the tests so many more times on every PR.

## Summary of changes

- For tests that used available_remote_storages or
available_s3_storages, update them to either specify no remote storage
(therefore inherit the default, which is currently local fs), or to
specify s3_storage() for the tests that actually want an S3 API.
2023-12-08 14:52:37 +00:00
Conrad Ludgate
699049b8f3 proxy: make auth more type safe (#5689)
## Problem

a5292f7e67/proxy/src/auth/backend.rs (L146-L148)

a5292f7e67/proxy/src/console/provider/neon.rs (L90)

a5292f7e67/proxy/src/console/provider/neon.rs (L154)

## Summary of changes

1. Test backend is only enabled on `cfg(test)`.
2. Postgres mock backend + MD5 auth keys are only enabled on
`cfg(feature = testing)`
3. Password hack and cleartext flow will have their passwords validated
before proceeding.
4. Distinguish between ClientCredentials with endpoint and without,
removing many panics in the process
2023-12-08 11:48:37 +00:00
John Spray
2c544343e0 pageserver: filtered WAL ingest for sharding (#6024)
## Problem

Currently, if one creates many shards they will all ingest all the data:
not much use! We want them to ingest a proportional share of the data
each.

Closes: #6025

## Summary of changes

- WalIngest object gets a copy of the ShardIdentity for the Tenant it
was created by.
- While iterating the `blocks` part of a decoded record, blocks that do
not match the current shard are ignored, apart from on shard zero where
they are used to update relation sizes in `observe_decoded_block` (but
not stored).
- Before committing a `DataDirModificiation` from a WAL record, we check
if it's empty, and drop the record if so. This check is necessary
(rather than just looking at the `blocks` part) because certain record
types may modify blocks in non-obvious ways (e.g.
`ingest_heapam_record`).
- Add WAL ingest metrics to record the total received, total committed,
and total filtered out
- Behaviour for unsharded tenants is unchanged: they will continue to
ingest all blocks, and will take the fast path through `is_key_local`
that doesn't bother calculating any hashes.

After this change, shards store a subset of the tenant's total data, and
accurate relation sizes are only maintained on shard zero.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-12-08 10:12:37 +00:00
Arseny Sher
193e60e2b8 Fix/edit pgindent confusing places in neon. 2023-12-08 14:03:13 +04:00
Arseny Sher
1bbd6cae24 pgindent pgxn/neon 2023-12-08 14:03:13 +04:00
Arseny Sher
65f48c7002 Make targets to run pgindent on core and neon extension. 2023-12-08 14:03:13 +04:00
Alexander Bayandin
d9d8e9afc7 test_tenant_reattach: fix reattach mode names (#6070)
## Problem

Ref
https://neondb.slack.com/archives/C033QLM5P7D/p1701987609146109?thread_ts=1701976393.757279&cid=C033QLM5P7D

## Summary of changes
- Make reattach mode names unique for `test_tenant_reattach`
2023-12-08 08:39:45 +00:00
Arpad Müller
7914eaf1e6 Buffer initdb.tar.zst to a temporary file before upload (#5944)
In https://github.com/neondatabase/neon/pull/5912#pullrequestreview-1749982732 , Christian liked the idea of using files instead of buffering the
archive to RAM for the *download* path. This is for the upload path,
which is a very similar situation.
2023-12-08 03:33:44 +01:00
Joonas Koivunen
37fdbc3aaa fix: use larger buffers for remote storage (#6069)
Currently using 8kB buffers, raise that to 32kB to hopefully 1/4 of
`spawn_blocking` usage. Also a drive-by fixing of last `tokio::io::copy`
to `tokio::io::copy_buf`.
2023-12-07 19:36:44 +00:00
Tristan Partin
7aa1e58301 Add support for Python 3.12 2023-12-07 12:30:42 -06:00
Christian Schwarz
f2892d3798 virtual_file metrics: distinguish first and subsequent open() syscalls (#6066)
This helps with identifying thrashing.

I don't love the name, but, there is already "close-by-replace".

While reading the code, I also found a case where we waste
work in a cache pressure situation:
https://github.com/neondatabase/neon/issues/6065

refs https://github.com/neondatabase/cloud/issues/8351
2023-12-07 16:17:33 +00:00
Joonas Koivunen
b492cedf51 fix(remote_storage): buffering, by using streams for upload and download (#5446)
There is double buffering in remote_storage and in pageserver for 8KiB
in using `tokio::io::copy` to read `BufReader<ReaderStream<_>>`.

Switches downloads and uploads to use `Stream<Item =
std::io::Result<Bytes>>`. Caller and only caller now handles setting up
buffering. For reading, `Stream<Item = ...>` is also a `AsyncBufRead`,
so when writing to a file, we now have `tokio::io::copy_buf` reading
full buffers and writing them to `tokio::io::BufWriter` which handles
the buffering before dispatching over to `tokio::fs::File`.

Additionally implements streaming uploads for azure. With azure
downloads are a bit nicer than before, but not much; instead of one huge
vec they just hold on to N allocations we got over the wire.

This PR will also make it trivial to switch reading and writing to
io-uring based methods.

Cc: #5563.
2023-12-07 15:52:22 +00:00
John Spray
880663f6bc tests: use tenant_create() helper in test_bulk_insert (#6064)
## Problem

Since #5449 we enable generations in tests by default. Running
benchmarks was missed while merging that PR, and there was one that
needed updating.

## Summary of changes

Make test_bulk_insert use the proper generation-aware helper for tenant
creation.
2023-12-07 14:52:16 +00:00
John Spray
e89e41f8ba tests: update for tenant generations (#5449)
## Problem

Some existing tests are written in a way that's incompatible with tenant
generations.

## Summary of changes

Update all the tests that need updating: this is things like calling
through the NeonPageserver.tenant_attach helper to get a generation
number, instead of calling directly into the pageserver API. There are
various more subtle cases.
2023-12-07 12:27:16 +00:00
Conrad Ludgate
f9401fdd31 proxy: fix channel binding error messages (#6054)
## Problem

For channel binding failed messages we were still saying "channel
binding not supported" in the errors.

## Summary of changes

Fix error messages
2023-12-07 11:47:16 +00:00
Joonas Koivunen
b7ffe24426 build: update tokio to 1.34.0, tokio-utils 0.7.10 (#6061)
We should still remember to bump minimum crates for libraries beginning
to use task tracker.
2023-12-07 11:31:38 +00:00
Joonas Koivunen
52718bb8ff fix(layer): metric splitting, span rename (#5902)
Per [feedback], split the Layer metrics, also finally account for lost
and [re-submitted feedback] on `layer_gc` by renaming it to
`layer_delete`, `Layer::garbage_collect_on_drop` renamed to
`Layer::delete_on_drop`. References to "gc" dropped from metric names
and elsewhere.

Also fixes how the cancellations were tracked: there was one rare
counter. Now there is a top level metric for cancelled inits, and the
rare "download failed but failed to communicate" counter is kept.

Fixes: #6027

[feedback]: https://github.com/neondatabase/neon/pull/5809#pullrequestreview-1720043251
[re-submitted feedback]: https://github.com/neondatabase/neon/pull/5108#discussion_r1401867311
2023-12-07 11:39:40 +02:00
Joonas Koivunen
10c77cb410 temp: increase the wait tenant activation timeout (#6058)
5s is causing way too much noise; this is of course a temporary fix, we
should prioritize tenants for which there are pagestream openings the
highest, second highest the basebackups.

Deployment thread for context:
https://neondb.slack.com/archives/C03H1K0PGKH/p1701935048144479?thread_ts=1701765158.926659&cid=C03H1K0PGKH
2023-12-07 09:01:08 +00:00
Heikki Linnakangas
31be301ef3 Make simple_rcu::RcuWaitList::wait() async (#6046)
The gc_timeline() function is async, but it calls the synchronous wait()
function. In the worst case, that could lead to a deadlock by using up
all tokio executor threads.

In the passing, fix a few typos in comments.

Fixes issue #6045.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-12-07 10:20:40 +02:00
Joonas Koivunen
a3c7d400b4 fix: avoid allocations with logging a slug (#6047)
to_string forces allocating a less than pointer sized string (costing on
stack 4 usize), using a Display formattable slug saves that. the
difference seems small, but at the same time, we log these a lot.
2023-12-07 07:25:22 +00:00
Vadim Kharitonov
7501ca6efb Revert timescaledb for pg14 and pg15 (#6056)
```
could not start the compute node: compute is in state "failed": db error: ERROR: could not access file "$libdir/timescaledb-2.10.1": No such file or directory Caused by: ERROR: could not access file "$libdir/timescaledb-2.10.1": No such file or directory
```
2023-12-06 15:12:36 +00:00
Christian Schwarz
987c9aaea0 virtual_file: fix the metric for close() calls done by VirtualFile::drop (#6051)
Before this PR we would inc() the counter for `Close` even though the
slot's FD had already been closed.

Especially visible when subtracting `open` from `close+close-by-replace`
on a system that does a lot of attach and detach.

refs https://github.com/neondatabase/cloud/issues/8440
refs https://github.com/neondatabase/cloud/issues/8351
2023-12-06 12:05:28 +00:00
Konstantin Knizhnik
7fab731f65 Track size of FSM fork while applying records at replica (#5901)
## Problem

See https://neondb.slack.com/archives/C04DGM6SMTM/p1700560921471619

## Summary of changes

Update relation size cache for FSM fork in WAL records filter

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-12-05 18:49:24 +02:00
John Spray
483caa22c6 pageserver: logging tweaks (#6039)
- The `Attaching tenant` log message omitted some useful information
like the generation and mode
- info-level messages about writing configuration files were
unnecessarily verbose
- During process shutdown, we don't emit logs about the various phases:
this is very cheap to log since we do it once per process lifetime, and
is helpful when figuring out where something got stuck during a hang.
2023-12-05 16:11:15 +00:00
John Spray
da5e03b0d8 pageserver: add a /reset API for tenants (#6014)
## Problem

Traditionally we would detach/attach directly with curl if we wanted to
"reboot" a single tenant. That's kind of inconvenient these days,
because one needs to know a generation number to issue an attach
request.

Closes: https://github.com/neondatabase/neon/issues/6011

## Summary of changes

- Introduce a new `/reset` API, which remembers the LocationConf from
the current attachment so that callers do not have to work out the
correct configuration/generation to use.
- As an additional support tool, allow an optional `drop_cache` query
parameter, for situations where we are concerned that some on-disk state
might be bad and want to clear that as well as the in-memory state.

One might wonder why I didn't call this "reattach" -- it's because
there's already a PS->CP API of that name and it could get confusing.
2023-12-05 15:38:27 +00:00
John Spray
be885370f6 pageserver: remove redundant unsafe_create_dir_all (#6040)
This non-fsyncing analog to our safe directory creation function was
just duplicating what tokio's fs::create_dir_all does.
2023-12-05 15:03:07 +00:00
Alexey Kondratov
bc1020f965 compute_ctl: Notify waiters when Postgres failed to start (#6034)
In case of configuring the empty compute, API handler is waiting on
condvar for compute state change. Yet, previously if Postgres failed to
start we were just setting compute status to `Failed` without notifying.
It causes a timeout on control plane side, although we can return a
proper error from compute earlier.

With this commit API handler should be properly notified.
2023-12-05 13:38:45 +01:00
John Spray
61fe9d360d pageserver: add Key->Shard mapping logic & use it in page service (#5980)
## Problem

When a pageserver receives a page service request identified by
TenantId, it must decide which `Tenant` object to route it to.

As in earlier PRs, this stuff is all a no-op for tenants with a single
shard: calls to `is_key_local` always return true without doing any
hashing on a single-shard ShardIdentity.

Closes: https://github.com/neondatabase/neon/issues/6026

## Summary of changes

- Carry immutable `ShardIdentity` objects in Tenant and Timeline. These
provide the information that Tenants/Timelines need to figure out which
shard is responsible for which Key.
- Augment `get_active_tenant_with_timeout` to take a `ShardSelector`
specifying how the shard should be resolved for this tenant. This mode
depends on the kind of request (e.g. basebackups always go to shard
zero).
- In `handle_get_page_at_lsn_request`, handle the case where the
Timeline we looked up at connection time is not the correct shard for
the page being requested. This can happen whenever one node holds
multiple shards for the same tenant. This is currently written as a
"slow path" with the optimistic expectation that usually we'll run with
one shard per pageserver, and the Timeline resolved at connection time
will be the one serving page requests. There is scope for optimization
here later, to avoid doing the full shard lookup for each page.
- Omit consumption metrics from nonzero shards: only the 0th shard is
responsible for tracing accurate relation sizes.

Note to reviewers:
- Testing of these changes is happening separately on the
`jcsp/sharding-pt1` branch, where we have hacked neon_local etc needed
to run a test_pg_regress.
- The main caveat to this implementation is that page service
connections still look up one Timeline when the connection is opened,
before they know which pages are going to be read. If there is one shard
per pageserver then this will always also be the Timeline that serves
page requests. However, if multiple shards are on one pageserver then
get page requests will incur the cost of looking up the correct Timeline
on each getpage request. We may look to improve this in future with a
"sticky" timeline per connection handler so that subsequent requests for
the same Timeline don't have to look up again, and/or by having postgres
pass a shard hint when connecting. This is tracked in the "Loose ends"
section of https://github.com/neondatabase/neon/issues/5507
2023-12-05 12:01:55 +00:00
Conrad Ludgate
f60e49fe8e proxy: fix panic in startup packet (#6032)
## Problem

Panic when less than 8 bytes is presented in a startup packet.

## Summary of changes

We need there to be a 4 byte message code, so the expected min length is
8.
2023-12-05 11:24:16 +01:00
Anna Khanova
c48918d329 Rename metric (#6030)
## Problem

It looks like because of reallocation of the buckets in previous PR, the
metric is broken in graphana.

## Summary of changes

Renamed the metric.
2023-12-05 10:03:07 +00:00
Sasha Krassovsky
bad686bb71 Remove trusted from wal2json (#6035)
## Problem

## Summary of changes
2023-12-04 21:10:23 +00:00
Alexey Kondratov
85d08581ed [compute_ctl] Introduce feature flags in the compute spec (#6016)
## Problem

In the past we've rolled out all new `compute_ctl` functionality right
to all users, which could be risky. I want to have a more fine-grained
control over what we enable, in which env and to which users.

## Summary of changes

Add an option to pass a list of feature flags to `compute_ctl`. If not
passed, it defaults to an empty list. Any unknown flags are ignored.

This allows us to release new experimental features safer, as we can
then flip the flag for one specific user, only Neon employees, free /
pro / etc. users and so on. Or control it per environment.

In the current implementation feature flags are passed via compute spec,
so they do not allow controlling behavior of `empty` computes. For them,
we can either stick with the previous approach, i.e. add separate cli
args or introduce a more generic `--features` cli argument.
2023-12-04 19:54:18 +01:00
Christian Schwarz
c7f1143e57 concurrency-limit low-priority initial logical size calculation [v2] (#6000)
Problem
-------

Before this PR, there was no concurrency limit on initial logical size
computations.

While logical size computations are lazy in theory, in practice
(production), they happen in a short timeframe after restart.

This means that on a PS with 20k tenants, we'd have up to 20k concurrent
initial logical size calculation requests.

This is self-inflicted needless overload.

This hasn't been a problem so far because the `.await` points on the
logical size calculation path never return `Pending`, hence we have a
natural concurrency limit of the number of executor threads.
But, as soon as we return `Pending` somewhere in the logical size
calculation path, other concurrent tasks get scheduled by tokio.
If these other tasks are also logical size calculations, they eventually
pound on the same bottleneck.

For example, in #5479, we want to switch the VirtualFile descriptor
cache to a `tokio::sync::RwLock`, which makes us return `Pending`, and
without measures like this patch, after PS restart, VirtualFile
descriptor cache thrashes heavily for 2 hours until all the logical size
calculations have been computed and the degree of concurrency /
concurrent VirtualFile operations is down to regular levels.
See the *Experiment* section below for details.

<!-- Experiments (see below) show that plain #5479 causes heavy
thrashing of the VirtualFile descriptor cache.
The high degree of concurrency is too much for 
In the case of #5479 the VirtualFile descriptor cache size starts
thrashing heavily.


-->

Background
----------

Before this PR, initial logical size calculation was spawned lazily on
first call to `Timeline::get_current_logical_size()`.

In practice (prod), the lazy calculation is triggered by
`WalReceiverConnectionHandler` if the timeline is active according to
storage broker, or by the first iteration of consumption metrics worker
after restart (`MetricsCollection`).

The spawns by walreceiver are high-priority because logical size is
needed by Safekeepers (via walreceiver `PageserverFeedback`) to enforce
the project logical size limit.
The spawns by metrics collection are not on the user-critical path and
hence low-priority. [^consumption_metrics_slo]

[^consumption_metrics_slo]: We can't delay metrics collection
indefintely because there are TBD internal SLOs tied to metrics
collection happening in a timeline manner
(https://github.com/neondatabase/cloud/issues/7408). But let's ignore
that in this issue.

The ratio of walreceiver-initiated spawns vs
consumption-metrics-initiated spawns can be reconstructed from logs
(`spawning logical size computation from context of task kind {:?}"`).
PR #5995 and #6018 adds metrics for this.

First investigation of the ratio lead to the discovery that walreceiver
spawns 75% of init logical size computations.
That's because of two bugs:
- In Safekeepers: https://github.com/neondatabase/neon/issues/5993
- In interaction between Pageservers and Safekeepers:
https://github.com/neondatabase/neon/issues/5962

The safekeeper bug is likely primarily responsible but we don't have the
data yet. The metrics will hopefully provide some insights.

When assessing production-readiness of this PR, please assume that
neither of these bugs are fixed yet.


Changes In This PR
------------------

With this PR, initial logical size calculation is reworked as follows:

First, all initial logical size calculation task_mgr tasks are started
early, as part of timeline activation, and run a retry loop with long
back-off until success. This removes the lazy computation; it was
needless complexity because in practice, we compute all logical sizes
anyways, because consumption metrics collects it.

Second, within the initial logical size calculation task, each attempt
queues behind the background loop concurrency limiter semaphore. This
fixes the performance issue that we pointed out in the "Problem" section
earlier.

Third, there is a twist to queuing behind the background loop
concurrency limiter semaphore. Logical size is needed by Safekeepers
(via walreceiver `PageserverFeedback`) to enforce the project logical
size limit. However, we currently do open walreceiver connections even
before we have an exact logical size. That's bad, and I'll build on top
of this PR to fix that
(https://github.com/neondatabase/neon/issues/5963). But, for the
purposes of this PR, we don't want to introduce a regression, i.e., we
don't want to provide an exact value later than before this PR. The
solution is to introduce a priority-boosting mechanism
(`GetLogicalSizePriority`), allowing callers of
`Timeline::get_current_logical_size` to specify how urgently they need
an exact value. The effect of specifying high urgency is that the
initial logical size calculation task for the timeline will skip the
concurrency limiting semaphore. This should yield effectively the same
behavior as we had before this PR with lazy spawning.

Last, the priority-boosting mechanism obsoletes the `init_order`'s grace
period for initial logical size calculations. It's a separate commit to
reduce the churn during review. We can drop that commit if people think
it's too much churn, and commit it later once we know this PR here
worked as intended.

Experiment With #5479 
---------------------

I validated this PR combined with #5479 to assess whether we're making
forward progress towards asyncification.

The setup is an `i3en.3xlarge` instance with 20k tenants, each with one
timeline that has 9 layers.
All tenants are inactive, i.e., not known to SKs nor storage broker.
This means all initial logical size calculations are spawned by
consumption metrics `MetricsCollection` task kind.
The consumption metrics worker starts requesting logical sizes at low
priority immediately after restart. This is achieved by deleting the
consumption metrics cache file on disk before starting
PS.[^consumption_metrics_cache_file]

[^consumption_metrics_cache_file] Consumption metrics worker persists
its interval across restarts to achieve persistent reporting intervals
across PS restarts; delete the state file on disk to get predictable
(and I believe worst-case in terms of concurrency during PS restart)
behavior.

Before this patch, all of these timelines would all do their initial
logical size calculation in parallel, leading to extreme thrashing in
page cache and virtual file cache.

With this patch, the virtual file cache thrashing is reduced
significantly (from 80k `open`-system-calls/second to ~500
`open`-system-calls/second during loading).


### Critique

The obvious critique with above experiment is that there's no skipping
of the semaphore, i.e., the priority-boosting aspect of this PR is not
exercised.

If even just 1% of our 20k tenants in the setup were active in
SK/storage_broker, then 200 logical size calculations would skip the
limiting semaphore immediately after restart and run concurrently.

Further critique: given the two bugs wrt timeline inactive vs active
state that were mentioned in the Background section, we could have 75%
of our 20k tenants being (falsely) active on restart.

So... (next section)

This Doesn't Make Us Ready For Async VirtualFile
------------------------------------------------

This PR is a step towards asynchronous `VirtualFile`, aka, #5479 or even
#4744.

But it doesn't yet enable us to ship #5479.

The reason is that this PR doesn't limit the amount of high-priority
logical size computations.
If there are many high-priority logical size calculations requested,
we'll fall over like we did if #5479 is applied without this PR.
And currently, at very least due to the bugs mentioned in the Background
section, we run thousands of high-priority logical size calculations on
PS startup in prod.

So, at a minimum, we need to fix these bugs.

Then we can ship #5479 and #4744, and things will likely be fine under
normal operation.

But in high-traffic situations, overload problems will still be more
likely to happen, e.g., VirtualFile cache descriptor thrashing.
The solution candidates for that are orthogonal to this PR though:
* global concurrency limiting
* per-tenant rate limiting => #5899
* load shedding
* scaling bottleneck resources (fd cache size (neondatabase/cloud#8351),
page cache size(neondatabase/cloud#8351), spread load across more PSes,
etc)

Conclusion
----------

Even with the remarks from in the previous section, we should merge this
PR because:
1. it's an improvement over the status quo (esp. if the aforementioned
bugs wrt timeline active / inactive are fixed)
2. it prepares the way for
https://github.com/neondatabase/neon/pull/6010
3. it gets us close to shipping #5479 and #4744
2023-12-04 17:22:26 +00:00
Christian Schwarz
7403d55013 walredo: stderr cleanup & make explicitly cancel safe (#6031)
# Problem

I need walredo to be cancellation-safe for
https://github.com/neondatabase/neon/pull/6000#discussion_r1412049728

# Solution

We are only `async fn` because of
`wait_for(stderr_logger_task_done).await`, added in #5560 .

The `stderr_logger_cancel` and `stderr_logger_task_done` were there out
of precaution that the stderr logger task might for some reason not stop
when the walredo process terminates.
That hasn't been a problem in practice.
So, simplify things:
- remove `stderr_logger_cancel` and the
`wait_for(...stderr_logger_task_done...)`
- use `tokio::process::ChildStderr` in the stderr logger task
- add metrics to track number of running stderr logger tasks so in case
I'm wrong here, we can use these metrics to identify the issue (not
planning to put them into a dashboard or anything)
2023-12-04 16:06:41 +00:00
Anna Khanova
12f02523a4 Enable dynamic rate limiter (#6029)
## Problem

Limit the number of open connections between the control plane and
proxy.

## Summary of changes

Enable dynamic rate limiter in prod.

Unfortunately the latency metrics are a bit broken, but from logs I see
that on staging for the past 7 days only 2 times latency for acquiring
was greater than 1ms (for most of the cases it's insignificant).
2023-12-04 15:00:24 +00:00
Arseny Sher
207c527270 Safekeepers: persist state before timeline deactivation.
Without it, sometimes on restart we lose latest remote_consistent_lsn which
leads to excessive ps -> sk reconnections.

https://github.com/neondatabase/neon/issues/5993
2023-12-04 18:22:36 +04:00
John Khvatov
eae49ff598 Perform L0 compaction before creating new image layers (#5950)
If there are too many L0 layers before compaction, the compaction
process becomes slow because of slow `Timeline::get`. As a result of the
slowdown, the pageserver will generate even more L0 layers for the next
iteration, further exacerbating the slow performance.

Change to perform L0 -> L1 compaction before creating new images. The
simple change speeds up compaction time and `Timeline::get` to 5x.
`Timeline::get` is faster on top of L1 layers.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-12-04 12:35:09 +00:00
Alexander Bayandin
e6b2f89fec test_pg_clients: fix test that reads from stdout (#6021)
## Problem

`test_pg_clients` reads the actual result from a *.stdout file,
https://github.com/neondatabase/neon/pull/5977 has added a header to
such files, so `test_pg_clients` started to fail.

## Summary of changes
- Use `capture_stdout` and compare the expected result with the output
instead of *.stdout file content
2023-12-04 11:18:41 +00:00
John Spray
1d81e70d60 pageserver: tweak logs for index_part loading (#6005)
## Problem

On pageservers upgraded to enable generations, these INFO level logs
were rather frequent. If a tenant timeline hasn't written new layers
since the upgrade, it will emit the "No index_part.json*" log every time
it starts.

## Summary of changes

- Downgrade two log lines from info to debug
- Add a tiny unit test that I wrote for sanity-checking that there
wasn't something wrong with our Generation-comparing logic when loading
index parts.
2023-12-04 09:57:47 +00:00
179 changed files with 7843 additions and 4300 deletions

View File

@@ -199,6 +199,10 @@ jobs:
# #
git config --global --add safe.directory ${{ github.workspace }} git config --global --add safe.directory ${{ github.workspace }}
git config --global --add safe.directory ${GITHUB_WORKSPACE} git config --global --add safe.directory ${GITHUB_WORKSPACE}
for r in 14 15 16; do
git config --global --add safe.directory "${{ github.workspace }}/vendor/postgres-v$r"
git config --global --add safe.directory "${GITHUB_WORKSPACE}/vendor/postgres-v$r"
done
- name: Checkout - name: Checkout
uses: actions/checkout@v3 uses: actions/checkout@v3
@@ -1097,6 +1101,10 @@ jobs:
# #
git config --global --add safe.directory ${{ github.workspace }} git config --global --add safe.directory ${{ github.workspace }}
git config --global --add safe.directory ${GITHUB_WORKSPACE} git config --global --add safe.directory ${GITHUB_WORKSPACE}
for r in 14 15 16; do
git config --global --add safe.directory "${{ github.workspace }}/vendor/postgres-v$r"
git config --global --add safe.directory "${GITHUB_WORKSPACE}/vendor/postgres-v$r"
done
- name: Checkout - name: Checkout
uses: actions/checkout@v3 uses: actions/checkout@v3

View File

@@ -142,6 +142,10 @@ jobs:
# #
git config --global --add safe.directory ${{ github.workspace }} git config --global --add safe.directory ${{ github.workspace }}
git config --global --add safe.directory ${GITHUB_WORKSPACE} git config --global --add safe.directory ${GITHUB_WORKSPACE}
for r in 14 15 16; do
git config --global --add safe.directory "${{ github.workspace }}/vendor/postgres-v$r"
git config --global --add safe.directory "${GITHUB_WORKSPACE}/vendor/postgres-v$r"
done
- name: Checkout - name: Checkout
uses: actions/checkout@v4 uses: actions/checkout@v4
@@ -238,6 +242,20 @@ jobs:
options: --init options: --init
steps: steps:
- name: Fix git ownership
run: |
# Workaround for `fatal: detected dubious ownership in repository at ...`
#
# Use both ${{ github.workspace }} and ${GITHUB_WORKSPACE} because they're different on host and in containers
# Ref https://github.com/actions/checkout/issues/785
#
git config --global --add safe.directory ${{ github.workspace }}
git config --global --add safe.directory ${GITHUB_WORKSPACE}
for r in 14 15 16; do
git config --global --add safe.directory "${{ github.workspace }}/vendor/postgres-v$r"
git config --global --add safe.directory "${GITHUB_WORKSPACE}/vendor/postgres-v$r"
done
- name: Checkout - name: Checkout
uses: actions/checkout@v4 uses: actions/checkout@v4
with: with:

3
.gitignore vendored
View File

@@ -18,3 +18,6 @@ test_output/
*.o *.o
*.so *.so
*.Po *.Po
# pgindent typedef lists
*.list

157
Cargo.lock generated
View File

@@ -44,6 +44,12 @@ dependencies = [
"memchr", "memchr",
] ]
[[package]]
name = "allocator-api2"
version = "0.2.16"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0942ffc6dcaadf03badf6e6a2d0228460359d5e34b57ccdc720b7382dfbd5ec5"
[[package]] [[package]]
name = "android_system_properties" name = "android_system_properties"
version = "0.1.5" version = "0.1.5"
@@ -178,7 +184,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "81953c529336010edd6d8e358f886d9581267795c61b19475b71314bffa46d35" checksum = "81953c529336010edd6d8e358f886d9581267795c61b19475b71314bffa46d35"
dependencies = [ dependencies = [
"concurrent-queue", "concurrent-queue",
"event-listener", "event-listener 2.5.3",
"futures-core", "futures-core",
] ]
@@ -199,11 +205,13 @@ dependencies = [
[[package]] [[package]]
name = "async-lock" name = "async-lock"
version = "2.8.0" version = "3.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "287272293e9d8c41773cec55e365490fe034813a2f172f502d6ddcf75b2f582b" checksum = "7125e42787d53db9dd54261812ef17e937c95a51e4d291373b670342fa44310c"
dependencies = [ dependencies = [
"event-listener", "event-listener 4.0.0",
"event-listener-strategy",
"pin-project-lite",
] ]
[[package]] [[package]]
@@ -686,9 +694,9 @@ dependencies = [
[[package]] [[package]]
name = "azure_core" name = "azure_core"
version = "0.16.0" version = "0.18.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8e29286b9edfdd6f2c7e9d970bb5b015df8621258acab9ecfcea09b2d7692467" checksum = "a6218987c374650fdad0b476bfc675729762c28dfb35f58608a38a2b1ea337dd"
dependencies = [ dependencies = [
"async-trait", "async-trait",
"base64 0.21.1", "base64 0.21.1",
@@ -696,8 +704,10 @@ dependencies = [
"dyn-clone", "dyn-clone",
"futures", "futures",
"getrandom 0.2.11", "getrandom 0.2.11",
"hmac",
"http-types", "http-types",
"log", "log",
"once_cell",
"paste", "paste",
"pin-project", "pin-project",
"quick-xml", "quick-xml",
@@ -706,6 +716,7 @@ dependencies = [
"rustc_version", "rustc_version",
"serde", "serde",
"serde_json", "serde_json",
"sha2",
"time", "time",
"url", "url",
"uuid", "uuid",
@@ -713,9 +724,9 @@ dependencies = [
[[package]] [[package]]
name = "azure_identity" name = "azure_identity"
version = "0.16.2" version = "0.18.1"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5b67b337346da8739e91ea1e9400a6ebc9bc54e0b2af1d23c9bcd565950588f9" checksum = "9e1eacc4f7fb2a73d57c39139d0fc3aed78435606055779ddaef4b43cdf919a8"
dependencies = [ dependencies = [
"async-lock", "async-lock",
"async-trait", "async-trait",
@@ -725,7 +736,6 @@ dependencies = [
"oauth2", "oauth2",
"pin-project", "pin-project",
"serde", "serde",
"serde_json",
"time", "time",
"tz-rs", "tz-rs",
"url", "url",
@@ -734,21 +744,18 @@ dependencies = [
[[package]] [[package]]
name = "azure_storage" name = "azure_storage"
version = "0.16.0" version = "0.18.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "bed0ccefde57930b2886fd4aed1f70ac469c197b8c2e94828290d71bcbdb5d97" checksum = "ade8f2653e408de88b9eafec9f48c3c26b94026375e88adbd34523a7dd9795a1"
dependencies = [ dependencies = [
"RustyXML", "RustyXML",
"async-lock",
"async-trait", "async-trait",
"azure_core", "azure_core",
"bytes", "bytes",
"futures",
"hmac",
"log", "log",
"serde", "serde",
"serde_derive", "serde_derive",
"serde_json",
"sha2",
"time", "time",
"url", "url",
"uuid", "uuid",
@@ -756,13 +763,14 @@ dependencies = [
[[package]] [[package]]
name = "azure_storage_blobs" name = "azure_storage_blobs"
version = "0.16.0" version = "0.18.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f91a52da2d192cfe43759f61e8bb31a5969f1722d5b85ac89627f356ad674ab4" checksum = "025701c7cc5b523100f0f3b2b01723564ec5a86c03236521c06826337047e872"
dependencies = [ dependencies = [
"RustyXML", "RustyXML",
"azure_core", "azure_core",
"azure_storage", "azure_storage",
"azure_svc_blobstorage",
"bytes", "bytes",
"futures", "futures",
"log", "log",
@@ -774,6 +782,22 @@ dependencies = [
"uuid", "uuid",
] ]
[[package]]
name = "azure_svc_blobstorage"
version = "0.18.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "76051e5bb67cea1055abe5e530a0878feac7e0ab4cbbcb4a6adc953a58993389"
dependencies = [
"azure_core",
"bytes",
"futures",
"log",
"once_cell",
"serde",
"serde_json",
"time",
]
[[package]] [[package]]
name = "backtrace" name = "backtrace"
version = "0.3.67" version = "0.3.67"
@@ -890,7 +914,7 @@ checksum = "a246e68bb43f6cd9db24bea052a53e40405417c5fb372e3d1a8a7f770a564ef5"
dependencies = [ dependencies = [
"memchr", "memchr",
"once_cell", "once_cell",
"regex-automata", "regex-automata 0.1.10",
"serde", "serde",
] ]
@@ -1680,6 +1704,27 @@ version = "2.5.3"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0206175f82b8d6bf6652ff7d71a1e27fd2e4efde587fd368662814d6ec1d9ce0" checksum = "0206175f82b8d6bf6652ff7d71a1e27fd2e4efde587fd368662814d6ec1d9ce0"
[[package]]
name = "event-listener"
version = "4.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "770d968249b5d99410d61f5bf89057f3199a077a04d087092f58e7d10692baae"
dependencies = [
"concurrent-queue",
"parking",
"pin-project-lite",
]
[[package]]
name = "event-listener-strategy"
version = "0.4.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "958e4d70b6d5e81971bebec42271ec641e7ff4e170a6fa605f2b8a8b65cb97d3"
dependencies = [
"event-listener 4.0.0",
"pin-project-lite",
]
[[package]] [[package]]
name = "fail" name = "fail"
version = "0.5.1" version = "0.5.1"
@@ -2042,6 +2087,10 @@ name = "hashbrown"
version = "0.14.0" version = "0.14.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2c6201b9ff9fd90a5a3bac2e56a830d0caa509576f0e503818ee82c181b3437a" checksum = "2c6201b9ff9fd90a5a3bac2e56a830d0caa509576f0e503818ee82c181b3437a"
dependencies = [
"ahash",
"allocator-api2",
]
[[package]] [[package]]
name = "hashlink" name = "hashlink"
@@ -2533,7 +2582,7 @@ version = "0.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8263075bb86c5a1b1427b5ae862e8889656f126e9f77c484496e8b47cf5c5558" checksum = "8263075bb86c5a1b1427b5ae862e8889656f126e9f77c484496e8b47cf5c5558"
dependencies = [ dependencies = [
"regex-automata", "regex-automata 0.1.10",
] ]
[[package]] [[package]]
@@ -2559,9 +2608,9 @@ checksum = "490cc448043f947bae3cbee9c203358d62dbee0db12107a74be5c30ccfd09771"
[[package]] [[package]]
name = "memchr" name = "memchr"
version = "2.5.0" version = "2.6.4"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2dffe52ecf27772e601905b7522cb4ef790d2cc203488bbd0e2fe85fcb74566d" checksum = "f665ee40bc4a3c5590afb1e9677db74a508659dfd71e126420da8274909a0167"
[[package]] [[package]]
name = "memoffset" name = "memoffset"
@@ -2634,14 +2683,14 @@ dependencies = [
[[package]] [[package]]
name = "mio" name = "mio"
version = "0.8.6" version = "0.8.10"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5b9d9a46eff5b4ff64b45a9e316a6d1e0bc719ef429cbec4dc630684212bfdf9" checksum = "8f3d0b296e374a4e6f3c7b0a1f5a51d748a0d34c85e7dc48fc3fa9a87657fe09"
dependencies = [ dependencies = [
"libc", "libc",
"log", "log",
"wasi 0.11.0+wasi-snapshot-preview1", "wasi 0.11.0+wasi-snapshot-preview1",
"windows-sys 0.45.0", "windows-sys 0.48.0",
] ]
[[package]] [[package]]
@@ -3054,6 +3103,7 @@ dependencies = [
"humantime-serde", "humantime-serde",
"hyper", "hyper",
"itertools", "itertools",
"md5",
"metrics", "metrics",
"nix 0.26.2", "nix 0.26.2",
"num-traits", "num-traits",
@@ -3644,7 +3694,7 @@ dependencies = [
"serde_json", "serde_json",
"sha2", "sha2",
"smol_str", "smol_str",
"socket2 0.5.3", "socket2 0.5.5",
"sync_wrapper", "sync_wrapper",
"task-local-extensions", "task-local-extensions",
"thiserror", "thiserror",
@@ -3668,9 +3718,9 @@ dependencies = [
[[package]] [[package]]
name = "quick-xml" name = "quick-xml"
version = "0.30.0" version = "0.31.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "eff6510e86862b57b210fd8cbe8ed3f0d7d600b9c2863cd4549a2e033c66e956" checksum = "1004a344b30a54e2ee58d66a71b32d2db2feb0a31f9a2d302bf0536f15de2a33"
dependencies = [ dependencies = [
"memchr", "memchr",
"serde", "serde",
@@ -3810,13 +3860,14 @@ dependencies = [
[[package]] [[package]]
name = "regex" name = "regex"
version = "1.8.2" version = "1.10.2"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d1a59b5d8e97dee33696bf13c5ba8ab85341c002922fba050069326b9c498974" checksum = "380b951a9c5e80ddfd6136919eef32310721aa4aacd4889a8d39124b026ab343"
dependencies = [ dependencies = [
"aho-corasick", "aho-corasick",
"memchr", "memchr",
"regex-syntax 0.7.2", "regex-automata 0.4.3",
"regex-syntax 0.8.2",
] ]
[[package]] [[package]]
@@ -3828,6 +3879,17 @@ dependencies = [
"regex-syntax 0.6.29", "regex-syntax 0.6.29",
] ]
[[package]]
name = "regex-automata"
version = "0.4.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5f804c7828047e88b2d32e2d7fe5a105da8ee3264f01902f796c8e067dc2483f"
dependencies = [
"aho-corasick",
"memchr",
"regex-syntax 0.8.2",
]
[[package]] [[package]]
name = "regex-syntax" name = "regex-syntax"
version = "0.6.29" version = "0.6.29"
@@ -3836,9 +3898,9 @@ checksum = "f162c6dd7b008981e4d40210aca20b4bd0f9b60ca9271061b07f78537722f2e1"
[[package]] [[package]]
name = "regex-syntax" name = "regex-syntax"
version = "0.7.2" version = "0.8.2"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "436b050e76ed2903236f032a59761c1eb99e1b0aead2c257922771dab1fc8c78" checksum = "c08c74e62047bb2de4ff487b251e4a92e24f48745648451635cec7d591162d9f"
[[package]] [[package]]
name = "relative-path" name = "relative-path"
@@ -3864,6 +3926,7 @@ dependencies = [
"bytes", "bytes",
"camino", "camino",
"camino-tempfile", "camino-tempfile",
"futures",
"futures-util", "futures-util",
"http-types", "http-types",
"hyper", "hyper",
@@ -4291,6 +4354,7 @@ dependencies = [
"tokio-io-timeout", "tokio-io-timeout",
"tokio-postgres", "tokio-postgres",
"tokio-stream", "tokio-stream",
"tokio-util",
"toml_edit", "toml_edit",
"tracing", "tracing",
"url", "url",
@@ -4731,9 +4795,9 @@ dependencies = [
[[package]] [[package]]
name = "socket2" name = "socket2"
version = "0.5.3" version = "0.5.5"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2538b18701741680e0322a2302176d3253a35388e2e62f172f64f4f16605f877" checksum = "7b5fac59a5cb5dd637972e5fca70daf0523c9067fcdc4842f053dae04a18f8e9"
dependencies = [ dependencies = [
"libc", "libc",
"windows-sys 0.48.0", "windows-sys 0.48.0",
@@ -5080,18 +5144,18 @@ dependencies = [
[[package]] [[package]]
name = "tokio" name = "tokio"
version = "1.28.1" version = "1.34.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0aa32867d44e6f2ce3385e89dceb990188b8bb0fb25b0cf576647a6f98ac5105" checksum = "d0c014766411e834f7af5b8f4cf46257aab4036ca95e9d2c144a10f59ad6f5b9"
dependencies = [ dependencies = [
"autocfg", "backtrace",
"bytes", "bytes",
"libc", "libc",
"mio", "mio",
"num_cpus", "num_cpus",
"pin-project-lite", "pin-project-lite",
"signal-hook-registry", "signal-hook-registry",
"socket2 0.4.9", "socket2 0.5.5",
"tokio-macros", "tokio-macros",
"windows-sys 0.48.0", "windows-sys 0.48.0",
] ]
@@ -5108,9 +5172,9 @@ dependencies = [
[[package]] [[package]]
name = "tokio-macros" name = "tokio-macros"
version = "2.1.0" version = "2.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "630bdcf245f78637c13ec01ffae6187cca34625e8c63150d424b59e55af2675e" checksum = "5b8a1e28f2deaa14e508979454cb3a223b10b938b45af148bc0986de36f1923b"
dependencies = [ dependencies = [
"proc-macro2", "proc-macro2",
"quote", "quote",
@@ -5145,7 +5209,7 @@ dependencies = [
"pin-project-lite", "pin-project-lite",
"postgres-protocol", "postgres-protocol",
"postgres-types", "postgres-types",
"socket2 0.5.3", "socket2 0.5.5",
"tokio", "tokio",
"tokio-util", "tokio-util",
] ]
@@ -5214,13 +5278,16 @@ dependencies = [
[[package]] [[package]]
name = "tokio-util" name = "tokio-util"
version = "0.7.8" version = "0.7.10"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "806fe8c2c87eccc8b3267cbae29ed3ab2d0bd37fca70ab622e46aaa9375ddb7d" checksum = "5419f34732d9eb6ee4c3578b7989078579b7f039cbbb9ca2c4da015749371e15"
dependencies = [ dependencies = [
"bytes", "bytes",
"futures-core", "futures-core",
"futures-io",
"futures-sink", "futures-sink",
"futures-util",
"hashbrown 0.14.0",
"pin-project-lite", "pin-project-lite",
"tokio", "tokio",
"tracing", "tracing",
@@ -5698,6 +5765,7 @@ dependencies = [
"serde", "serde",
"serde_assert", "serde_assert",
"serde_json", "serde_json",
"serde_path_to_error",
"serde_with", "serde_with",
"signal-hook", "signal-hook",
"strum", "strum",
@@ -6216,7 +6284,8 @@ dependencies = [
"prost", "prost",
"rand 0.8.5", "rand 0.8.5",
"regex", "regex",
"regex-syntax 0.7.2", "regex-automata 0.4.3",
"regex-syntax 0.8.2",
"reqwest", "reqwest",
"ring 0.16.20", "ring 0.16.20",
"rustls", "rustls",

View File

@@ -38,10 +38,10 @@ license = "Apache-2.0"
anyhow = { version = "1.0", features = ["backtrace"] } anyhow = { version = "1.0", features = ["backtrace"] }
arc-swap = "1.6" arc-swap = "1.6"
async-compression = { version = "0.4.0", features = ["tokio", "gzip", "zstd"] } async-compression = { version = "0.4.0", features = ["tokio", "gzip", "zstd"] }
azure_core = "0.16" azure_core = "0.18"
azure_identity = "0.16" azure_identity = "0.18"
azure_storage = "0.16" azure_storage = "0.18"
azure_storage_blobs = "0.16" azure_storage_blobs = "0.18"
flate2 = "1.0.26" flate2 = "1.0.26"
async-stream = "0.3" async-stream = "0.3"
async-trait = "0.1" async-trait = "0.1"
@@ -109,7 +109,7 @@ pin-project-lite = "0.2"
prometheus = {version = "0.13", default_features=false, features = ["process"]} # removes protobuf dependency prometheus = {version = "0.13", default_features=false, features = ["process"]} # removes protobuf dependency
prost = "0.11" prost = "0.11"
rand = "0.8" rand = "0.8"
regex = "1.4" regex = "1.10.2"
reqwest = { version = "0.11", default-features = false, features = ["rustls-tls"] } reqwest = { version = "0.11", default-features = false, features = ["rustls-tls"] }
reqwest-tracing = { version = "0.4.0", features = ["opentelemetry_0_19"] } reqwest-tracing = { version = "0.4.0", features = ["opentelemetry_0_19"] }
reqwest-middleware = "0.2.0" reqwest-middleware = "0.2.0"
@@ -149,7 +149,7 @@ tokio-postgres-rustls = "0.10.0"
tokio-rustls = "0.24" tokio-rustls = "0.24"
tokio-stream = "0.1" tokio-stream = "0.1"
tokio-tar = "0.3" tokio-tar = "0.3"
tokio-util = { version = "0.7", features = ["io"] } tokio-util = { version = "0.7.10", features = ["io", "rt"] }
toml = "0.7" toml = "0.7"
toml_edit = "0.19" toml_edit = "0.19"
tonic = {version = "0.9", features = ["tls", "tls-roots"]} tonic = {version = "0.9", features = ["tls", "tls-roots"]}

View File

@@ -387,10 +387,20 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ARG PG_VERSION ARG PG_VERSION
ENV PATH "/usr/local/pgsql/bin:$PATH" ENV PATH "/usr/local/pgsql/bin:$PATH"
RUN apt-get update && \ RUN case "${PG_VERSION}" in \
"v14" | "v15") \
export TIMESCALEDB_VERSION=2.10.1 \
export TIMESCALEDB_CHECKSUM=6fca72a6ed0f6d32d2b3523951ede73dc5f9b0077b38450a029a5f411fdb8c73 \
;; \
*) \
export TIMESCALEDB_VERSION=2.13.0 \
export TIMESCALEDB_CHECKSUM=584a351c7775f0e067eaa0e7277ea88cab9077cc4c455cbbf09a5d9723dce95d \
;; \
esac && \
apt-get update && \
apt-get install -y cmake && \ apt-get install -y cmake && \
wget https://github.com/timescale/timescaledb/archive/refs/tags/2.13.0.tar.gz -O timescaledb.tar.gz && \ wget https://github.com/timescale/timescaledb/archive/refs/tags/${TIMESCALEDB_VERSION}.tar.gz -O timescaledb.tar.gz && \
echo "584a351c7775f0e067eaa0e7277ea88cab9077cc4c455cbbf09a5d9723dce95d timescaledb.tar.gz" | sha256sum --check && \ echo "${TIMESCALEDB_CHECKSUM} timescaledb.tar.gz" | sha256sum --check && \
mkdir timescaledb-src && cd timescaledb-src && tar xvzf ../timescaledb.tar.gz --strip-components=1 -C . && \ mkdir timescaledb-src && cd timescaledb-src && tar xvzf ../timescaledb.tar.gz --strip-components=1 -C . && \
./bootstrap -DSEND_TELEMETRY_DEFAULT:BOOL=OFF -DUSE_TELEMETRY:BOOL=OFF -DAPACHE_ONLY:BOOL=ON -DCMAKE_BUILD_TYPE=Release && \ ./bootstrap -DSEND_TELEMETRY_DEFAULT:BOOL=OFF -DUSE_TELEMETRY:BOOL=OFF -DAPACHE_ONLY:BOOL=ON -DCMAKE_BUILD_TYPE=Release && \
cd build && \ cd build && \
@@ -721,8 +731,7 @@ RUN wget https://github.com/eulerto/wal2json/archive/refs/tags/wal2json_2_5.tar.
echo "b516653575541cf221b99cf3f8be9b6821f6dbcfc125675c85f35090f824f00e wal2json_2_5.tar.gz" | sha256sum --check && \ echo "b516653575541cf221b99cf3f8be9b6821f6dbcfc125675c85f35090f824f00e wal2json_2_5.tar.gz" | sha256sum --check && \
mkdir wal2json-src && cd wal2json-src && tar xvzf ../wal2json_2_5.tar.gz --strip-components=1 -C . && \ mkdir wal2json-src && cd wal2json-src && tar xvzf ../wal2json_2_5.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \ make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install && \ make -j $(getconf _NPROCESSORS_ONLN) install
echo 'trusted = true' >> /usr/local/pgsql/share/extension/wal2json.control
######################################################################################### #########################################################################################
# #

View File

@@ -260,6 +260,44 @@ distclean:
fmt: fmt:
./pre-commit.py --fix-inplace ./pre-commit.py --fix-inplace
postgres-%-pg-bsd-indent: postgres-%
+@echo "Compiling pg_bsd_indent"
$(MAKE) -C $(POSTGRES_INSTALL_DIR)/build/$*/src/tools/pg_bsd_indent/
# Create typedef list for the core. Note that generally it should be combined with
# buildfarm one to cover platform specific stuff.
# https://wiki.postgresql.org/wiki/Running_pgindent_on_non-core_code_or_development_code
postgres-%-typedefs.list: postgres-%
$(ROOT_PROJECT_DIR)/vendor/postgres-$*/src/tools/find_typedef $(POSTGRES_INSTALL_DIR)/$*/bin > $@
# Indent postgres. See src/tools/pgindent/README for details.
.PHONY: postgres-%-pgindent
postgres-%-pgindent: postgres-%-pg-bsd-indent postgres-%-typedefs.list
+@echo merge with buildfarm typedef to cover all platforms
+@echo note: I first tried to download from pgbuildfarm.org, but for unclear reason e.g. \
REL_16_STABLE list misses PGSemaphoreData
# wget -q -O - "http://www.pgbuildfarm.org/cgi-bin/typedefs.pl?branch=REL_16_STABLE" |\
# cat - postgres-$*-typedefs.list | sort | uniq > postgres-$*-typedefs-full.list
cat $(ROOT_PROJECT_DIR)/vendor/postgres-$*/src/tools/pgindent/typedefs.list |\
cat - postgres-$*-typedefs.list | sort | uniq > postgres-$*-typedefs-full.list
+@echo note: you might want to run it on selected files/dirs instead.
INDENT=$(POSTGRES_INSTALL_DIR)/build/$*/src/tools/pg_bsd_indent/pg_bsd_indent \
$(ROOT_PROJECT_DIR)/vendor/postgres-$*/src/tools/pgindent/pgindent --typedefs postgres-$*-typedefs-full.list \
$(ROOT_PROJECT_DIR)/vendor/postgres-$*/src/ \
--excludes $(ROOT_PROJECT_DIR)/vendor/postgres-$*/src/tools/pgindent/exclude_file_patterns
rm -f pg*.BAK
# Indent pxgn/neon.
.PHONY: pgindent
neon-pgindent: postgres-v16-pg-bsd-indent neon-pg-ext-v16
$(MAKE) PG_CONFIG=$(POSTGRES_INSTALL_DIR)/v16/bin/pg_config CFLAGS='$(PG_CFLAGS) $(COPT)' \
FIND_TYPEDEF=$(ROOT_PROJECT_DIR)/vendor/postgres-v16/src/tools/find_typedef \
INDENT=$(POSTGRES_INSTALL_DIR)/build/v16/src/tools/pg_bsd_indent/pg_bsd_indent \
PGINDENT_SCRIPT=$(ROOT_PROJECT_DIR)/vendor/postgres-v16/src/tools/pgindent/pgindent \
-C $(POSTGRES_INSTALL_DIR)/build/neon-v16 \
-f $(ROOT_PROJECT_DIR)/pgxn/neon/Makefile pgindent
.PHONY: setup-pre-commit-hook .PHONY: setup-pre-commit-hook
setup-pre-commit-hook: setup-pre-commit-hook:
ln -s -f $(ROOT_PROJECT_DIR)/pre-commit.py .git/hooks/pre-commit ln -s -f $(ROOT_PROJECT_DIR)/pre-commit.py .git/hooks/pre-commit

View File

@@ -274,7 +274,13 @@ fn main() -> Result<()> {
let mut state = compute.state.lock().unwrap(); let mut state = compute.state.lock().unwrap();
state.error = Some(format!("{:?}", err)); state.error = Some(format!("{:?}", err));
state.status = ComputeStatus::Failed; state.status = ComputeStatus::Failed;
drop(state); // Notify others that Postgres failed to start. In case of configuring the
// empty compute, it's likely that API handler is still waiting for compute
// state change. With this we will notify it that compute is in Failed state,
// so control plane will know about it earlier and record proper error instead
// of timeout.
compute.state_changed.notify_all();
drop(state); // unlock
delay_exit = true; delay_exit = true;
None None
} }

View File

@@ -22,7 +22,7 @@ use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn; use utils::lsn::Lsn;
use compute_api::responses::{ComputeMetrics, ComputeStatus}; use compute_api::responses::{ComputeMetrics, ComputeStatus};
use compute_api::spec::{ComputeMode, ComputeSpec}; use compute_api::spec::{ComputeFeature, ComputeMode, ComputeSpec};
use utils::measured_stream::MeasuredReader; use utils::measured_stream::MeasuredReader;
use remote_storage::{DownloadError, RemotePath}; use remote_storage::{DownloadError, RemotePath};
@@ -252,7 +252,7 @@ fn create_neon_superuser(spec: &ComputeSpec, client: &mut Client) -> Result<()>
IF NOT EXISTS ( IF NOT EXISTS (
SELECT FROM pg_catalog.pg_roles WHERE rolname = 'neon_superuser') SELECT FROM pg_catalog.pg_roles WHERE rolname = 'neon_superuser')
THEN THEN
CREATE ROLE neon_superuser CREATEDB CREATEROLE NOLOGIN REPLICATION IN ROLE pg_read_all_data, pg_write_all_data; CREATE ROLE neon_superuser CREATEDB CREATEROLE NOLOGIN REPLICATION BYPASSRLS IN ROLE pg_read_all_data, pg_write_all_data;
IF array_length(roles, 1) IS NOT NULL THEN IF array_length(roles, 1) IS NOT NULL THEN
EXECUTE format('GRANT neon_superuser TO %s', EXECUTE format('GRANT neon_superuser TO %s',
array_to_string(ARRAY(SELECT quote_ident(x) FROM unnest(roles) as x), ', ')); array_to_string(ARRAY(SELECT quote_ident(x) FROM unnest(roles) as x), ', '));
@@ -277,6 +277,17 @@ fn create_neon_superuser(spec: &ComputeSpec, client: &mut Client) -> Result<()>
} }
impl ComputeNode { impl ComputeNode {
/// Check that compute node has corresponding feature enabled.
pub fn has_feature(&self, feature: ComputeFeature) -> bool {
let state = self.state.lock().unwrap();
if let Some(s) = state.pspec.as_ref() {
s.spec.features.contains(&feature)
} else {
false
}
}
pub fn set_status(&self, status: ComputeStatus) { pub fn set_status(&self, status: ComputeStatus) {
let mut state = self.state.lock().unwrap(); let mut state = self.state.lock().unwrap();
state.status = status; state.status = status;

View File

@@ -193,16 +193,11 @@ impl Escaping for PgIdent {
/// Build a list of existing Postgres roles /// Build a list of existing Postgres roles
pub fn get_existing_roles(xact: &mut Transaction<'_>) -> Result<Vec<Role>> { pub fn get_existing_roles(xact: &mut Transaction<'_>) -> Result<Vec<Role>> {
let postgres_roles = xact let postgres_roles = xact
.query( .query("SELECT rolname, rolpassword FROM pg_catalog.pg_authid", &[])?
"SELECT rolname, rolpassword, rolreplication, rolbypassrls FROM pg_catalog.pg_authid",
&[],
)?
.iter() .iter()
.map(|row| Role { .map(|row| Role {
name: row.get("rolname"), name: row.get("rolname"),
encrypted_password: row.get("rolpassword"), encrypted_password: row.get("rolpassword"),
replication: Some(row.get("rolreplication")),
bypassrls: Some(row.get("rolbypassrls")),
options: None, options: None,
}) })
.collect(); .collect();

View File

@@ -252,8 +252,6 @@ pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
let action = if let Some(r) = pg_role { let action = if let Some(r) = pg_role {
if (r.encrypted_password.is_none() && role.encrypted_password.is_some()) if (r.encrypted_password.is_none() && role.encrypted_password.is_some())
|| (r.encrypted_password.is_some() && role.encrypted_password.is_none()) || (r.encrypted_password.is_some() && role.encrypted_password.is_none())
|| !r.bypassrls.unwrap_or(false)
|| !r.replication.unwrap_or(false)
{ {
RoleAction::Update RoleAction::Update
} else if let Some(pg_pwd) = &r.encrypted_password { } else if let Some(pg_pwd) = &r.encrypted_password {
@@ -285,14 +283,22 @@ pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
match action { match action {
RoleAction::None => {} RoleAction::None => {}
RoleAction::Update => { RoleAction::Update => {
let mut query: String = // This can be run on /every/ role! Not just ones created through the console.
format!("ALTER ROLE {} BYPASSRLS REPLICATION", name.pg_quote()); // This means that if you add some funny ALTER here that adds a permission,
// this will get run even on user-created roles! This will result in different
// behavior before and after a spec gets reapplied. The below ALTER as it stands
// now only grants LOGIN and changes the password. Please do not allow this branch
// to do anything silly.
let mut query: String = format!("ALTER ROLE {} ", name.pg_quote());
query.push_str(&role.to_pg_options()); query.push_str(&role.to_pg_options());
xact.execute(query.as_str(), &[])?; xact.execute(query.as_str(), &[])?;
} }
RoleAction::Create => { RoleAction::Create => {
// This branch only runs when roles are created through the console, so it is
// safe to add more permissions here. BYPASSRLS and REPLICATION are inherited
// from neon_superuser.
let mut query: String = format!( let mut query: String = format!(
"CREATE ROLE {} CREATEROLE CREATEDB BYPASSRLS REPLICATION IN ROLE neon_superuser", "CREATE ROLE {} INHERIT CREATEROLE CREATEDB IN ROLE neon_superuser",
name.pg_quote() name.pg_quote()
); );
info!("role create query: '{}'", &query); info!("role create query: '{}'", &query);

View File

@@ -201,6 +201,12 @@ async fn handle_validate(mut req: Request<Body>) -> Result<Response<Body>, ApiEr
// TODO(sharding): make this shard-aware // TODO(sharding): make this shard-aware
if let Some(tenant_state) = locked.tenants.get(&req_tenant.id.tenant_id) { if let Some(tenant_state) = locked.tenants.get(&req_tenant.id.tenant_id) {
let valid = tenant_state.generation == req_tenant.gen; let valid = tenant_state.generation == req_tenant.gen;
tracing::info!(
"handle_validate: {}(gen {}): valid={valid} (latest {})",
req_tenant.id,
req_tenant.gen,
tenant_state.generation
);
response.tenants.push(ValidateResponseTenant { response.tenants.push(ValidateResponseTenant {
id: req_tenant.id, id: req_tenant.id,
valid, valid,
@@ -250,6 +256,13 @@ async fn handle_attach_hook(mut req: Request<Body>) -> Result<Response<Body>, Ap
tenant_state.pageserver = attach_req.node_id; tenant_state.pageserver = attach_req.node_id;
let generation = tenant_state.generation; let generation = tenant_state.generation;
tracing::info!(
"handle_attach_hook: tenant {} set generation {}, pageserver {}",
attach_req.tenant_id,
tenant_state.generation,
attach_req.node_id.unwrap_or(utils::id::NodeId(0xfffffff))
);
locked.save().await.map_err(ApiError::InternalServerError)?; locked.save().await.map_err(ApiError::InternalServerError)?;
json_response( json_response(

View File

@@ -168,7 +168,7 @@ fn print_timelines_tree(
info: t.clone(), info: t.clone(),
children: BTreeSet::new(), children: BTreeSet::new(),
name: timeline_name_mappings name: timeline_name_mappings
.remove(&TenantTimelineId::new(t.tenant_id, t.timeline_id)), .remove(&TenantTimelineId::new(t.tenant_id.tenant_id, t.timeline_id)),
}, },
) )
}) })

View File

@@ -519,6 +519,7 @@ impl Endpoint {
skip_pg_catalog_updates: self.skip_pg_catalog_updates, skip_pg_catalog_updates: self.skip_pg_catalog_updates,
format_version: 1.0, format_version: 1.0,
operation_uuid: None, operation_uuid: None,
features: vec![],
cluster: Cluster { cluster: Cluster {
cluster_id: None, // project ID: not used cluster_id: None, // project ID: not used
name: None, // project name: not used name: None, // project name: not used

View File

@@ -407,6 +407,7 @@ impl PageServerNode {
.map(|x| x.parse::<bool>()) .map(|x| x.parse::<bool>())
.transpose() .transpose()
.context("Failed to parse 'gc_feedback' as bool")?, .context("Failed to parse 'gc_feedback' as bool")?,
heatmap_period: settings.remove("heatmap_period").map(|x| x.to_string()),
}; };
let request = models::TenantCreateRequest { let request = models::TenantCreateRequest {
@@ -504,6 +505,7 @@ impl PageServerNode {
.map(|x| x.parse::<bool>()) .map(|x| x.parse::<bool>())
.transpose() .transpose()
.context("Failed to parse 'gc_feedback' as bool")?, .context("Failed to parse 'gc_feedback' as bool")?,
heatmap_period: settings.remove("heatmap_period").map(|x| x.to_string()),
} }
}; };

View File

@@ -165,7 +165,7 @@ pub fn migrate_tenant(
let found = other_ps_tenants let found = other_ps_tenants
.into_iter() .into_iter()
.map(|t| t.id) .map(|t| t.id)
.any(|i| i == tenant_id); .any(|i| i.tenant_id == tenant_id);
if !found { if !found {
continue; continue;
} }

View File

@@ -26,6 +26,13 @@ pub struct ComputeSpec {
// but we don't use it for anything. Serde will ignore missing fields when // but we don't use it for anything. Serde will ignore missing fields when
// deserializing it. // deserializing it.
pub operation_uuid: Option<String>, pub operation_uuid: Option<String>,
/// Compute features to enable. These feature flags are provided, when we
/// know all the details about client's compute, so they cannot be used
/// to change `Empty` compute behavior.
#[serde(default)]
pub features: Vec<ComputeFeature>,
/// Expected cluster state at the end of transition process. /// Expected cluster state at the end of transition process.
pub cluster: Cluster, pub cluster: Cluster,
pub delta_operations: Option<Vec<DeltaOp>>, pub delta_operations: Option<Vec<DeltaOp>>,
@@ -68,6 +75,19 @@ pub struct ComputeSpec {
pub remote_extensions: Option<RemoteExtSpec>, pub remote_extensions: Option<RemoteExtSpec>,
} }
/// Feature flag to signal `compute_ctl` to enable certain experimental functionality.
#[derive(Serialize, Clone, Copy, Debug, Deserialize, PartialEq, Eq)]
#[serde(rename_all = "snake_case")]
pub enum ComputeFeature {
// XXX: Add more feature flags here.
// This is a special feature flag that is used to represent unknown feature flags.
// Basically all unknown to enum flags are represented as this one. See unit test
// `parse_unknown_features()` for more details.
#[serde(other)]
UnknownFeature,
}
#[derive(Clone, Debug, Default, Deserialize, Serialize)] #[derive(Clone, Debug, Default, Deserialize, Serialize)]
pub struct RemoteExtSpec { pub struct RemoteExtSpec {
pub public_extensions: Option<Vec<String>>, pub public_extensions: Option<Vec<String>>,
@@ -187,8 +207,6 @@ pub struct DeltaOp {
pub struct Role { pub struct Role {
pub name: PgIdent, pub name: PgIdent,
pub encrypted_password: Option<String>, pub encrypted_password: Option<String>,
pub replication: Option<bool>,
pub bypassrls: Option<bool>,
pub options: GenericOptions, pub options: GenericOptions,
} }
@@ -229,7 +247,10 @@ mod tests {
#[test] #[test]
fn parse_spec_file() { fn parse_spec_file() {
let file = File::open("tests/cluster_spec.json").unwrap(); let file = File::open("tests/cluster_spec.json").unwrap();
let _spec: ComputeSpec = serde_json::from_reader(file).unwrap(); let spec: ComputeSpec = serde_json::from_reader(file).unwrap();
// Features list defaults to empty vector.
assert!(spec.features.is_empty());
} }
#[test] #[test]
@@ -241,4 +262,22 @@ mod tests {
ob.insert("unknown_field_123123123".into(), "hello".into()); ob.insert("unknown_field_123123123".into(), "hello".into());
let _spec: ComputeSpec = serde_json::from_value(json).unwrap(); let _spec: ComputeSpec = serde_json::from_value(json).unwrap();
} }
#[test]
fn parse_unknown_features() {
// Test that unknown feature flags do not cause any errors.
let file = File::open("tests/cluster_spec.json").unwrap();
let mut json: serde_json::Value = serde_json::from_reader(file).unwrap();
let ob = json.as_object_mut().unwrap();
// Add unknown feature flags.
let features = vec!["foo_bar_feature", "baz_feature"];
ob.insert("features".into(), features.into());
let spec: ComputeSpec = serde_json::from_value(json).unwrap();
assert!(spec.features.len() == 2);
assert!(spec.features.contains(&ComputeFeature::UnknownFeature));
assert_eq!(spec.features, vec![ComputeFeature::UnknownFeature; 2]);
}
} }

View File

@@ -3,8 +3,11 @@
//! Otherwise, we might not see all metrics registered via //! Otherwise, we might not see all metrics registered via
//! a default registry. //! a default registry.
#![deny(clippy::undocumented_unsafe_blocks)] #![deny(clippy::undocumented_unsafe_blocks)]
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
use prometheus::core::{AtomicU64, Collector, GenericGauge, GenericGaugeVec}; use prometheus::core::{
Atomic, AtomicU64, Collector, GenericCounter, GenericCounterVec, GenericGauge, GenericGaugeVec,
};
pub use prometheus::opts; pub use prometheus::opts;
pub use prometheus::register; pub use prometheus::register;
pub use prometheus::Error; pub use prometheus::Error;
@@ -132,3 +135,137 @@ fn get_rusage_stats() -> libc::rusage {
rusage.assume_init() rusage.assume_init()
} }
} }
/// Create an [`IntCounterPairVec`] and registers to default registry.
#[macro_export(local_inner_macros)]
macro_rules! register_int_counter_pair_vec {
($NAME1:expr, $HELP1:expr, $NAME2:expr, $HELP2:expr, $LABELS_NAMES:expr $(,)?) => {{
match (
$crate::register_int_counter_vec!($NAME1, $HELP1, $LABELS_NAMES),
$crate::register_int_counter_vec!($NAME2, $HELP2, $LABELS_NAMES),
) {
(Ok(inc), Ok(dec)) => Ok($crate::IntCounterPairVec::new(inc, dec)),
(Err(e), _) | (_, Err(e)) => Err(e),
}
}};
}
/// Create an [`IntCounterPair`] and registers to default registry.
#[macro_export(local_inner_macros)]
macro_rules! register_int_counter_pair {
($NAME1:expr, $HELP1:expr, $NAME2:expr, $HELP2:expr $(,)?) => {{
match (
$crate::register_int_counter!($NAME1, $HELP1),
$crate::register_int_counter!($NAME2, $HELP2),
) {
(Ok(inc), Ok(dec)) => Ok($crate::IntCounterPair::new(inc, dec)),
(Err(e), _) | (_, Err(e)) => Err(e),
}
}};
}
/// A Pair of [`GenericCounterVec`]s. Like an [`GenericGaugeVec`] but will always observe changes
pub struct GenericCounterPairVec<P: Atomic> {
inc: GenericCounterVec<P>,
dec: GenericCounterVec<P>,
}
/// A Pair of [`GenericCounter`]s. Like an [`GenericGauge`] but will always observe changes
pub struct GenericCounterPair<P: Atomic> {
inc: GenericCounter<P>,
dec: GenericCounter<P>,
}
impl<P: Atomic> GenericCounterPairVec<P> {
pub fn new(inc: GenericCounterVec<P>, dec: GenericCounterVec<P>) -> Self {
Self { inc, dec }
}
/// `get_metric_with_label_values` returns the [`GenericCounterPair<P>`] for the given slice
/// of label values (same order as the VariableLabels in Desc). If that combination of
/// label values is accessed for the first time, a new [`GenericCounterPair<P>`] is created.
///
/// An error is returned if the number of label values is not the same as the
/// number of VariableLabels in Desc.
pub fn get_metric_with_label_values(&self, vals: &[&str]) -> Result<GenericCounterPair<P>> {
Ok(GenericCounterPair {
inc: self.inc.get_metric_with_label_values(vals)?,
dec: self.dec.get_metric_with_label_values(vals)?,
})
}
/// `with_label_values` works as `get_metric_with_label_values`, but panics if an error
/// occurs.
pub fn with_label_values(&self, vals: &[&str]) -> GenericCounterPair<P> {
self.get_metric_with_label_values(vals).unwrap()
}
}
impl<P: Atomic> GenericCounterPair<P> {
pub fn new(inc: GenericCounter<P>, dec: GenericCounter<P>) -> Self {
Self { inc, dec }
}
/// Increment the gauge by 1, returning a guard that decrements by 1 on drop.
pub fn guard(&self) -> GenericCounterPairGuard<P> {
self.inc.inc();
GenericCounterPairGuard(self.dec.clone())
}
/// Increment the gauge by n, returning a guard that decrements by n on drop.
pub fn guard_by(&self, n: P::T) -> GenericCounterPairGuardBy<P> {
self.inc.inc_by(n);
GenericCounterPairGuardBy(self.dec.clone(), n)
}
/// Increase the gauge by 1.
#[inline]
pub fn inc(&self) {
self.inc.inc();
}
/// Decrease the gauge by 1.
#[inline]
pub fn dec(&self) {
self.dec.inc();
}
/// Add the given value to the gauge. (The value can be
/// negative, resulting in a decrement of the gauge.)
#[inline]
pub fn inc_by(&self, v: P::T) {
self.inc.inc_by(v);
}
/// Subtract the given value from the gauge. (The value can be
/// negative, resulting in an increment of the gauge.)
#[inline]
pub fn dec_by(&self, v: P::T) {
self.dec.inc_by(v);
}
}
/// Guard returned by [`GenericCounterPair::guard`]
pub struct GenericCounterPairGuard<P: Atomic>(GenericCounter<P>);
impl<P: Atomic> Drop for GenericCounterPairGuard<P> {
fn drop(&mut self) {
self.0.inc();
}
}
/// Guard returned by [`GenericCounterPair::guard_by`]
pub struct GenericCounterPairGuardBy<P: Atomic>(GenericCounter<P>, P::T);
impl<P: Atomic> Drop for GenericCounterPairGuardBy<P> {
fn drop(&mut self) {
self.0.inc_by(self.1);
}
}
/// A Pair of [`IntCounterVec`]s. Like an [`IntGaugeVec`] but will always observe changes
pub type IntCounterPairVec = GenericCounterPairVec<AtomicU64>;
/// A Pair of [`IntCounter`]s. Like an [`IntGauge`] but will always observe changes
pub type IntCounterPair = GenericCounterPair<AtomicU64>;
/// A guard for [`IntCounterPair`] that will decrement the gauge on drop
pub type IntCounterPairGuard = GenericCounterPairGuard<AtomicU64>;

View File

@@ -140,3 +140,7 @@ impl Key {
}) })
} }
} }
pub fn is_rel_block_key(key: &Key) -> bool {
key.field1 == 0x00 && key.field4 != 0
}

View File

@@ -237,6 +237,7 @@ pub struct TenantConfig {
pub min_resident_size_override: Option<u64>, pub min_resident_size_override: Option<u64>,
pub evictions_low_residence_duration_metric_threshold: Option<String>, pub evictions_low_residence_duration_metric_threshold: Option<String>,
pub gc_feedback: Option<bool>, pub gc_feedback: Option<bool>,
pub heatmap_period: Option<String>,
} }
/// A flattened analog of a `pagesever::tenant::LocationMode`, which /// A flattened analog of a `pagesever::tenant::LocationMode`, which
@@ -323,6 +324,7 @@ impl TenantConfigRequest {
#[derive(Debug, Deserialize)] #[derive(Debug, Deserialize)]
pub struct TenantAttachRequest { pub struct TenantAttachRequest {
#[serde(default)]
pub config: TenantAttachConfig, pub config: TenantAttachConfig,
#[serde(default)] #[serde(default)]
pub generation: Option<u32>, pub generation: Option<u32>,
@@ -330,7 +332,7 @@ pub struct TenantAttachRequest {
/// Newtype to enforce deny_unknown_fields on TenantConfig for /// Newtype to enforce deny_unknown_fields on TenantConfig for
/// its usage inside `TenantAttachRequest`. /// its usage inside `TenantAttachRequest`.
#[derive(Debug, Serialize, Deserialize)] #[derive(Debug, Serialize, Deserialize, Default)]
#[serde(deny_unknown_fields)] #[serde(deny_unknown_fields)]
pub struct TenantAttachConfig { pub struct TenantAttachConfig {
#[serde(flatten)] #[serde(flatten)]
@@ -356,7 +358,7 @@ pub enum TenantAttachmentStatus {
#[derive(Serialize, Deserialize, Clone)] #[derive(Serialize, Deserialize, Clone)]
pub struct TenantInfo { pub struct TenantInfo {
pub id: TenantId, pub id: TenantShardId,
// NB: intentionally not part of OpenAPI, we don't want to commit to a specific set of TenantState's // NB: intentionally not part of OpenAPI, we don't want to commit to a specific set of TenantState's
pub state: TenantState, pub state: TenantState,
/// Sum of the size of all layer files. /// Sum of the size of all layer files.
@@ -368,7 +370,7 @@ pub struct TenantInfo {
/// This represents the output of the "timeline_detail" and "timeline_list" API calls. /// This represents the output of the "timeline_detail" and "timeline_list" API calls.
#[derive(Debug, Serialize, Deserialize, Clone)] #[derive(Debug, Serialize, Deserialize, Clone)]
pub struct TimelineInfo { pub struct TimelineInfo {
pub tenant_id: TenantId, pub tenant_id: TenantShardId,
pub timeline_id: TimelineId, pub timeline_id: TimelineId,
pub ancestor_timeline_id: Option<TimelineId>, pub ancestor_timeline_id: Option<TimelineId>,
@@ -384,6 +386,9 @@ pub struct TimelineInfo {
/// The LSN that we are advertizing to safekeepers /// The LSN that we are advertizing to safekeepers
pub remote_consistent_lsn_visible: Lsn, pub remote_consistent_lsn_visible: Lsn,
/// The LSN from the start of the root timeline (never changes)
pub initdb_lsn: Lsn,
pub current_logical_size: u64, pub current_logical_size: u64,
pub current_logical_size_is_accurate: bool, pub current_logical_size_is_accurate: bool,
@@ -822,7 +827,7 @@ mod tests {
fn test_tenantinfo_serde() { fn test_tenantinfo_serde() {
// Test serialization/deserialization of TenantInfo // Test serialization/deserialization of TenantInfo
let original_active = TenantInfo { let original_active = TenantInfo {
id: TenantId::generate(), id: TenantShardId::unsharded(TenantId::generate()),
state: TenantState::Active, state: TenantState::Active,
current_physical_size: Some(42), current_physical_size: Some(42),
attachment_status: TenantAttachmentStatus::Attached, attachment_status: TenantAttachmentStatus::Attached,
@@ -839,7 +844,7 @@ mod tests {
}); });
let original_broken = TenantInfo { let original_broken = TenantInfo {
id: TenantId::generate(), id: TenantShardId::unsharded(TenantId::generate()),
state: TenantState::Broken { state: TenantState::Broken {
reason: "reason".into(), reason: "reason".into(),
backtrace: "backtrace info".into(), backtrace: "backtrace info".into(),

View File

@@ -1,5 +1,6 @@
use std::{ops::RangeInclusive, str::FromStr}; use std::{ops::RangeInclusive, str::FromStr};
use crate::key::{is_rel_block_key, Key};
use hex::FromHex; use hex::FromHex;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use thiserror; use thiserror;
@@ -72,19 +73,33 @@ impl TenantShardId {
) )
} }
pub fn shard_slug(&self) -> String { pub fn shard_slug(&self) -> impl std::fmt::Display + '_ {
format!("{:02x}{:02x}", self.shard_number.0, self.shard_count.0) ShardSlug(self)
}
/// Convenience for code that has special behavior on the 0th shard.
pub fn is_zero(&self) -> bool {
self.shard_number == ShardNumber(0)
}
}
/// Formatting helper
struct ShardSlug<'a>(&'a TenantShardId);
impl<'a> std::fmt::Display for ShardSlug<'a> {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(
f,
"{:02x}{:02x}",
self.0.shard_number.0, self.0.shard_count.0
)
} }
} }
impl std::fmt::Display for TenantShardId { impl std::fmt::Display for TenantShardId {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
if self.shard_count != ShardCount(0) { if self.shard_count != ShardCount(0) {
write!( write!(f, "{}-{}", self.tenant_id, self.shard_slug())
f,
"{}-{:02x}{:02x}",
self.tenant_id, self.shard_number.0, self.shard_count.0
)
} else { } else {
// Legacy case (shard_count == 0) -- format as just the tenant id. Note that this // Legacy case (shard_count == 0) -- format as just the tenant id. Note that this
// is distinct from the normal single shard case (shard count == 1). // is distinct from the normal single shard case (shard count == 1).
@@ -302,6 +317,8 @@ pub struct ShardStripeSize(pub u32);
pub struct ShardLayout(u8); pub struct ShardLayout(u8);
const LAYOUT_V1: ShardLayout = ShardLayout(1); const LAYOUT_V1: ShardLayout = ShardLayout(1);
/// ShardIdentity uses a magic layout value to indicate if it is unusable
const LAYOUT_BROKEN: ShardLayout = ShardLayout(255);
/// Default stripe size in pages: 256MiB divided by 8kiB page size. /// Default stripe size in pages: 256MiB divided by 8kiB page size.
const DEFAULT_STRIPE_SIZE: ShardStripeSize = ShardStripeSize(256 * 1024 / 8); const DEFAULT_STRIPE_SIZE: ShardStripeSize = ShardStripeSize(256 * 1024 / 8);
@@ -310,10 +327,10 @@ const DEFAULT_STRIPE_SIZE: ShardStripeSize = ShardStripeSize(256 * 1024 / 8);
/// to resolve a key to a shard, and then check whether that shard is ==self. /// to resolve a key to a shard, and then check whether that shard is ==self.
#[derive(Clone, Copy, Serialize, Deserialize, Eq, PartialEq, Debug)] #[derive(Clone, Copy, Serialize, Deserialize, Eq, PartialEq, Debug)]
pub struct ShardIdentity { pub struct ShardIdentity {
pub layout: ShardLayout,
pub number: ShardNumber, pub number: ShardNumber,
pub count: ShardCount, pub count: ShardCount,
pub stripe_size: ShardStripeSize, stripe_size: ShardStripeSize,
layout: ShardLayout,
} }
#[derive(thiserror::Error, Debug, PartialEq, Eq)] #[derive(thiserror::Error, Debug, PartialEq, Eq)]
@@ -339,6 +356,22 @@ impl ShardIdentity {
} }
} }
/// A broken instance of this type is only used for `TenantState::Broken` tenants,
/// which are constructed in code paths that don't have access to proper configuration.
///
/// A ShardIdentity in this state may not be used for anything, and should not be persisted.
/// Enforcement is via assertions, to avoid making our interface fallible for this
/// edge case: it is the Tenant's responsibility to avoid trying to do any I/O when in a broken
/// state, and by extension to avoid trying to do any page->shard resolution.
pub fn broken(number: ShardNumber, count: ShardCount) -> Self {
Self {
number,
count,
layout: LAYOUT_BROKEN,
stripe_size: DEFAULT_STRIPE_SIZE,
}
}
pub fn is_unsharded(&self) -> bool { pub fn is_unsharded(&self) -> bool {
self.number == ShardNumber(0) && self.count == ShardCount(0) self.number == ShardNumber(0) && self.count == ShardCount(0)
} }
@@ -365,6 +398,39 @@ impl ShardIdentity {
}) })
} }
} }
fn is_broken(&self) -> bool {
self.layout == LAYOUT_BROKEN
}
pub fn get_shard_number(&self, key: &Key) -> ShardNumber {
assert!(!self.is_broken());
key_to_shard_number(self.count, self.stripe_size, key)
}
/// Return true if the key should be ingested by this shard
pub fn is_key_local(&self, key: &Key) -> bool {
assert!(!self.is_broken());
if self.count < ShardCount(2) || (key_is_shard0(key) && self.number == ShardNumber(0)) {
true
} else {
key_to_shard_number(self.count, self.stripe_size, key) == self.number
}
}
pub fn shard_slug(&self) -> String {
if self.count > ShardCount(0) {
format!("-{:02x}{:02x}", self.number.0, self.count.0)
} else {
String::new()
}
}
/// Convenience for checking if this identity is the 0th shard in a tenant,
/// for special cases on shard 0 such as ingesting relation sizes.
pub fn is_zero(&self) -> bool {
self.number == ShardNumber(0)
}
} }
impl Serialize for ShardIndex { impl Serialize for ShardIndex {
@@ -438,6 +504,65 @@ impl<'de> Deserialize<'de> for ShardIndex {
} }
} }
/// Whether this key is always held on shard 0 (e.g. shard 0 holds all SLRU keys
/// in order to be able to serve basebackup requests without peer communication).
fn key_is_shard0(key: &Key) -> bool {
// To decide what to shard out to shards >0, we apply a simple rule that only
// relation pages are distributed to shards other than shard zero. Everything else gets
// stored on shard 0. This guarantees that shard 0 can independently serve basebackup
// requests, and any request other than those for particular blocks in relations.
//
// In this condition:
// - is_rel_block_key includes only relations, i.e. excludes SLRU data and
// all metadata.
// - field6 is set to -1 for relation size pages.
!(is_rel_block_key(key) && key.field6 != 0xffffffff)
}
/// Provide the same result as the function in postgres `hashfn.h` with the same name
fn murmurhash32(mut h: u32) -> u32 {
h ^= h >> 16;
h = h.wrapping_mul(0x85ebca6b);
h ^= h >> 13;
h = h.wrapping_mul(0xc2b2ae35);
h ^= h >> 16;
h
}
/// Provide the same result as the function in postgres `hashfn.h` with the same name
fn hash_combine(mut a: u32, mut b: u32) -> u32 {
b = b.wrapping_add(0x9e3779b9);
b = b.wrapping_add(a << 6);
b = b.wrapping_add(a >> 2);
a ^= b;
a
}
/// Where a Key is to be distributed across shards, select the shard. This function
/// does not account for keys that should be broadcast across shards.
///
/// The hashing in this function must exactly match what we do in postgres smgr
/// code. The resulting distribution of pages is intended to preserve locality within
/// `stripe_size` ranges of contiguous block numbers in the same relation, while otherwise
/// distributing data pseudo-randomly.
///
/// The mapping of key to shard is not stable across changes to ShardCount: this is intentional
/// and will be handled at higher levels when shards are split.
fn key_to_shard_number(count: ShardCount, stripe_size: ShardStripeSize, key: &Key) -> ShardNumber {
// Fast path for un-sharded tenants or broadcast keys
if count < ShardCount(2) || key_is_shard0(key) {
return ShardNumber(0);
}
// relNode
let mut hash = murmurhash32(key.field4);
// blockNum/stripe size
hash = hash_combine(hash, murmurhash32(key.field6 / stripe_size.0));
ShardNumber((hash % count.0 as u32) as u8)
}
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use std::str::FromStr; use std::str::FromStr;
@@ -609,4 +734,29 @@ mod tests {
Ok(()) Ok(())
} }
// These are only smoke tests to spot check that our implementation doesn't
// deviate from a few examples values: not aiming to validate the overall
// hashing algorithm.
#[test]
fn murmur_hash() {
assert_eq!(murmurhash32(0), 0);
assert_eq!(hash_combine(0xb1ff3b40, 0), 0xfb7923c9);
}
#[test]
fn shard_mapping() {
let key = Key {
field1: 0x00,
field2: 0x67f,
field3: 0x5,
field4: 0x400c,
field5: 0x00,
field6: 0x7d06,
};
let shard = key_to_shard_number(ShardCount(10), DEFAULT_STRIPE_SIZE, &key);
assert_eq!(shard, ShardNumber(8));
}
} }

View File

@@ -289,10 +289,10 @@ impl FeStartupPacket {
// We shouldn't advance `buf` as probably full message is not there yet, // We shouldn't advance `buf` as probably full message is not there yet,
// so can't directly use Bytes::get_u32 etc. // so can't directly use Bytes::get_u32 etc.
let len = (&buf[0..4]).read_u32::<BigEndian>().unwrap() as usize; let len = (&buf[0..4]).read_u32::<BigEndian>().unwrap() as usize;
// The proposed replacement is `!(4..=MAX_STARTUP_PACKET_LENGTH).contains(&len)` // The proposed replacement is `!(8..=MAX_STARTUP_PACKET_LENGTH).contains(&len)`
// which is less readable // which is less readable
#[allow(clippy::manual_range_contains)] #[allow(clippy::manual_range_contains)]
if len < 4 || len > MAX_STARTUP_PACKET_LENGTH { if len < 8 || len > MAX_STARTUP_PACKET_LENGTH {
return Err(ProtocolError::Protocol(format!( return Err(ProtocolError::Protocol(format!(
"invalid startup packet message length {}", "invalid startup packet message length {}",
len len
@@ -975,4 +975,10 @@ mod tests {
let params = make_params("foo\\ bar \\ \\\\ baz\\ lol"); let params = make_params("foo\\ bar \\ \\\\ baz\\ lol");
assert_eq!(split_options(&params), ["foo bar", " \\", "baz ", "lol"]); assert_eq!(split_options(&params), ["foo bar", " \\", "baz ", "lol"]);
} }
#[test]
fn parse_fe_startup_packet_regression() {
let data = [0, 0, 0, 7, 0, 0, 0, 0];
FeStartupPacket::parse(&mut BytesMut::from_iter(data)).unwrap_err();
}
} }

View File

@@ -16,10 +16,11 @@ aws-credential-types.workspace = true
bytes.workspace = true bytes.workspace = true
camino.workspace = true camino.workspace = true
hyper = { workspace = true, features = ["stream"] } hyper = { workspace = true, features = ["stream"] }
futures.workspace = true
serde.workspace = true serde.workspace = true
serde_json.workspace = true serde_json.workspace = true
tokio = { workspace = true, features = ["sync", "fs", "io-util"] } tokio = { workspace = true, features = ["sync", "fs", "io-util"] }
tokio-util.workspace = true tokio-util = { workspace = true, features = ["compat"] }
toml_edit.workspace = true toml_edit.workspace = true
tracing.workspace = true tracing.workspace = true
scopeguard.workspace = true scopeguard.workspace = true

View File

@@ -1,21 +1,24 @@
//! Azure Blob Storage wrapper //! Azure Blob Storage wrapper
use std::borrow::Cow;
use std::collections::HashMap; use std::collections::HashMap;
use std::env; use std::env;
use std::num::NonZeroU32; use std::num::NonZeroU32;
use std::pin::Pin;
use std::sync::Arc; use std::sync::Arc;
use std::{borrow::Cow, io::Cursor};
use super::REMOTE_STORAGE_PREFIX_SEPARATOR; use super::REMOTE_STORAGE_PREFIX_SEPARATOR;
use anyhow::Result; use anyhow::Result;
use azure_core::request_options::{MaxResults, Metadata, Range}; use azure_core::request_options::{MaxResults, Metadata, Range};
use azure_core::RetryOptions;
use azure_identity::DefaultAzureCredential; use azure_identity::DefaultAzureCredential;
use azure_storage::StorageCredentials; use azure_storage::StorageCredentials;
use azure_storage_blobs::prelude::ClientBuilder; use azure_storage_blobs::prelude::ClientBuilder;
use azure_storage_blobs::{blob::operations::GetBlobBuilder, prelude::ContainerClient}; use azure_storage_blobs::{blob::operations::GetBlobBuilder, prelude::ContainerClient};
use bytes::Bytes;
use futures::stream::Stream;
use futures_util::StreamExt; use futures_util::StreamExt;
use http_types::StatusCode; use http_types::StatusCode;
use tokio::io::AsyncRead;
use tracing::debug; use tracing::debug;
use crate::s3_bucket::RequestKind; use crate::s3_bucket::RequestKind;
@@ -49,7 +52,8 @@ impl AzureBlobStorage {
StorageCredentials::token_credential(Arc::new(token_credential)) StorageCredentials::token_credential(Arc::new(token_credential))
}; };
let builder = ClientBuilder::new(account, credentials); // we have an outer retry
let builder = ClientBuilder::new(account, credentials).retry(RetryOptions::none());
let client = builder.container_client(azure_config.container_name.to_owned()); let client = builder.container_client(azure_config.container_name.to_owned());
@@ -116,7 +120,8 @@ impl AzureBlobStorage {
let mut metadata = HashMap::new(); let mut metadata = HashMap::new();
// TODO give proper streaming response instead of buffering into RAM // TODO give proper streaming response instead of buffering into RAM
// https://github.com/neondatabase/neon/issues/5563 // https://github.com/neondatabase/neon/issues/5563
let mut buf = Vec::new();
let mut bufs = Vec::new();
while let Some(part) = response.next().await { while let Some(part) = response.next().await {
let part = part.map_err(to_download_error)?; let part = part.map_err(to_download_error)?;
if let Some(blob_meta) = part.blob.metadata { if let Some(blob_meta) = part.blob.metadata {
@@ -127,10 +132,10 @@ impl AzureBlobStorage {
.collect() .collect()
.await .await
.map_err(|e| DownloadError::Other(e.into()))?; .map_err(|e| DownloadError::Other(e.into()))?;
buf.extend_from_slice(&data.slice(..)); bufs.push(data);
} }
Ok(Download { Ok(Download {
download_stream: Box::pin(Cursor::new(buf)), download_stream: Box::pin(futures::stream::iter(bufs.into_iter().map(Ok))),
metadata: Some(StorageMetadata(metadata)), metadata: Some(StorageMetadata(metadata)),
}) })
} }
@@ -217,9 +222,10 @@ impl RemoteStorage for AzureBlobStorage {
} }
Ok(res) Ok(res)
} }
async fn upload( async fn upload(
&self, &self,
mut from: impl AsyncRead + Unpin + Send + Sync + 'static, from: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
data_size_bytes: usize, data_size_bytes: usize,
to: &RemotePath, to: &RemotePath,
metadata: Option<StorageMetadata>, metadata: Option<StorageMetadata>,
@@ -227,13 +233,12 @@ impl RemoteStorage for AzureBlobStorage {
let _permit = self.permit(RequestKind::Put).await; let _permit = self.permit(RequestKind::Put).await;
let blob_client = self.client.blob_client(self.relative_path_to_name(to)); let blob_client = self.client.blob_client(self.relative_path_to_name(to));
// TODO FIX THIS UGLY HACK and don't buffer the entire object let from: Pin<Box<dyn Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static>> =
// into RAM here, but use the streaming interface. For that, Box::pin(from);
// we'd have to change the interface though...
// https://github.com/neondatabase/neon/issues/5563 let from = NonSeekableStream::new(from, data_size_bytes);
let mut buf = Vec::with_capacity(data_size_bytes);
tokio::io::copy(&mut from, &mut buf).await?; let body = azure_core::Body::SeekableStream(Box::new(from));
let body = azure_core::Body::Bytes(buf.into());
let mut builder = blob_client.put_block_blob(body); let mut builder = blob_client.put_block_blob(body);
@@ -266,17 +271,12 @@ impl RemoteStorage for AzureBlobStorage {
let mut builder = blob_client.get(); let mut builder = blob_client.get();
if let Some(end_exclusive) = end_exclusive { let range: Range = if let Some(end_exclusive) = end_exclusive {
builder = builder.range(Range::new(start_inclusive, end_exclusive)); (start_inclusive..end_exclusive).into()
} else { } else {
// Open ranges are not supported by the SDK so we work around (start_inclusive..).into()
// by setting the upper limit extremely high (but high enough };
// to still be representable by signed 64 bit integers). builder = builder.range(range);
// TODO remove workaround once the SDK adds open range support
// https://github.com/Azure/azure-sdk-for-rust/issues/1438
let end_exclusive = u64::MAX / 4;
builder = builder.range(Range::new(start_inclusive, end_exclusive));
}
self.download_for_builder(builder).await self.download_for_builder(builder).await
} }
@@ -312,3 +312,153 @@ impl RemoteStorage for AzureBlobStorage {
Ok(()) Ok(())
} }
} }
pin_project_lite::pin_project! {
/// Hack to work around not being able to stream once with azure sdk.
///
/// Azure sdk clones streams around with the assumption that they are like
/// `Arc<tokio::fs::File>` (except not supporting tokio), however our streams are not like
/// that. For example for an `index_part.json` we just have a single chunk of [`Bytes`]
/// representing the whole serialized vec. It could be trivially cloneable and "semi-trivially"
/// seekable, but we can also just re-try the request easier.
#[project = NonSeekableStreamProj]
enum NonSeekableStream<S> {
/// A stream wrappers initial form.
///
/// Mutex exists to allow moving when cloning. If the sdk changes to do less than 1
/// clone before first request, then this must be changed.
Initial {
inner: std::sync::Mutex<Option<tokio_util::compat::Compat<tokio_util::io::StreamReader<S, Bytes>>>>,
len: usize,
},
/// The actually readable variant, produced by cloning the Initial variant.
///
/// The sdk currently always clones once, even without retry policy.
Actual {
#[pin]
inner: tokio_util::compat::Compat<tokio_util::io::StreamReader<S, Bytes>>,
len: usize,
read_any: bool,
},
/// Most likely unneeded, but left to make life easier, in case more clones are added.
Cloned {
len_was: usize,
}
}
}
impl<S> NonSeekableStream<S>
where
S: Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
{
fn new(inner: S, len: usize) -> NonSeekableStream<S> {
use tokio_util::compat::TokioAsyncReadCompatExt;
let inner = tokio_util::io::StreamReader::new(inner).compat();
let inner = Some(inner);
let inner = std::sync::Mutex::new(inner);
NonSeekableStream::Initial { inner, len }
}
}
impl<S> std::fmt::Debug for NonSeekableStream<S> {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Self::Initial { len, .. } => f.debug_struct("Initial").field("len", len).finish(),
Self::Actual { len, .. } => f.debug_struct("Actual").field("len", len).finish(),
Self::Cloned { len_was, .. } => f.debug_struct("Cloned").field("len", len_was).finish(),
}
}
}
impl<S> futures::io::AsyncRead for NonSeekableStream<S>
where
S: Stream<Item = std::io::Result<Bytes>>,
{
fn poll_read(
self: std::pin::Pin<&mut Self>,
cx: &mut std::task::Context<'_>,
buf: &mut [u8],
) -> std::task::Poll<std::io::Result<usize>> {
match self.project() {
NonSeekableStreamProj::Actual {
inner, read_any, ..
} => {
*read_any = true;
inner.poll_read(cx, buf)
}
// NonSeekableStream::Initial does not support reading because it is just much easier
// to have the mutex in place where one does not poll the contents, or that's how it
// seemed originally. If there is a version upgrade which changes the cloning, then
// that support needs to be hacked in.
//
// including {self:?} into the message would be useful, but unsure how to unproject.
_ => std::task::Poll::Ready(Err(std::io::Error::new(
std::io::ErrorKind::Other,
"cloned or initial values cannot be read",
))),
}
}
}
impl<S> Clone for NonSeekableStream<S> {
/// Weird clone implementation exists to support the sdk doing cloning before issuing the first
/// request, see type documentation.
fn clone(&self) -> Self {
use NonSeekableStream::*;
match self {
Initial { inner, len } => {
if let Some(inner) = inner.lock().unwrap().take() {
Actual {
inner,
len: *len,
read_any: false,
}
} else {
Self::Cloned { len_was: *len }
}
}
Actual { len, .. } => Cloned { len_was: *len },
Cloned { len_was } => Cloned { len_was: *len_was },
}
}
}
#[async_trait::async_trait]
impl<S> azure_core::SeekableStream for NonSeekableStream<S>
where
S: Stream<Item = std::io::Result<Bytes>> + Unpin + Send + Sync + 'static,
{
async fn reset(&mut self) -> azure_core::error::Result<()> {
use NonSeekableStream::*;
let msg = match self {
Initial { inner, .. } => {
if inner.get_mut().unwrap().is_some() {
return Ok(());
} else {
"reset after first clone is not supported"
}
}
Actual { read_any, .. } if !*read_any => return Ok(()),
Actual { .. } => "reset after reading is not supported",
Cloned { .. } => "reset after second clone is not supported",
};
Err(azure_core::error::Error::new(
azure_core::error::ErrorKind::Io,
std::io::Error::new(std::io::ErrorKind::Other, msg),
))
}
// Note: it is not documented if this should be the total or remaining length, total passes the
// tests.
fn len(&self) -> usize {
use NonSeekableStream::*;
match self {
Initial { len, .. } => *len,
Actual { len, .. } => *len,
Cloned { len_was, .. } => *len_was,
}
}
}

View File

@@ -19,8 +19,10 @@ use std::{collections::HashMap, fmt::Debug, num::NonZeroUsize, pin::Pin, sync::A
use anyhow::{bail, Context}; use anyhow::{bail, Context};
use camino::{Utf8Path, Utf8PathBuf}; use camino::{Utf8Path, Utf8PathBuf};
use bytes::Bytes;
use futures::stream::Stream;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use tokio::{io, sync::Semaphore}; use tokio::sync::Semaphore;
use toml_edit::Item; use toml_edit::Item;
use tracing::info; use tracing::info;
@@ -179,7 +181,7 @@ pub trait RemoteStorage: Send + Sync + 'static {
/// Streams the local file contents into remote into the remote storage entry. /// Streams the local file contents into remote into the remote storage entry.
async fn upload( async fn upload(
&self, &self,
from: impl io::AsyncRead + Unpin + Send + Sync + 'static, from: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
// S3 PUT request requires the content length to be specified, // S3 PUT request requires the content length to be specified,
// otherwise it starts to fail with the concurrent connection count increasing. // otherwise it starts to fail with the concurrent connection count increasing.
data_size_bytes: usize, data_size_bytes: usize,
@@ -206,7 +208,7 @@ pub trait RemoteStorage: Send + Sync + 'static {
} }
pub struct Download { pub struct Download {
pub download_stream: Pin<Box<dyn io::AsyncRead + Unpin + Send + Sync>>, pub download_stream: Pin<Box<dyn Stream<Item = std::io::Result<Bytes>> + Unpin + Send + Sync>>,
/// Extra key-value data, associated with the current remote file. /// Extra key-value data, associated with the current remote file.
pub metadata: Option<StorageMetadata>, pub metadata: Option<StorageMetadata>,
} }
@@ -300,7 +302,7 @@ impl GenericRemoteStorage {
pub async fn upload( pub async fn upload(
&self, &self,
from: impl io::AsyncRead + Unpin + Send + Sync + 'static, from: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
data_size_bytes: usize, data_size_bytes: usize,
to: &RemotePath, to: &RemotePath,
metadata: Option<StorageMetadata>, metadata: Option<StorageMetadata>,
@@ -398,7 +400,7 @@ impl GenericRemoteStorage {
/// this path is used for the remote object id conversion only. /// this path is used for the remote object id conversion only.
pub async fn upload_storage_object( pub async fn upload_storage_object(
&self, &self,
from: impl tokio::io::AsyncRead + Unpin + Send + Sync + 'static, from: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
from_size_bytes: usize, from_size_bytes: usize,
to: &RemotePath, to: &RemotePath,
) -> anyhow::Result<()> { ) -> anyhow::Result<()> {

View File

@@ -7,11 +7,14 @@
use std::{borrow::Cow, future::Future, io::ErrorKind, pin::Pin}; use std::{borrow::Cow, future::Future, io::ErrorKind, pin::Pin};
use anyhow::{bail, ensure, Context}; use anyhow::{bail, ensure, Context};
use bytes::Bytes;
use camino::{Utf8Path, Utf8PathBuf}; use camino::{Utf8Path, Utf8PathBuf};
use futures::stream::Stream;
use tokio::{ use tokio::{
fs, fs,
io::{self, AsyncReadExt, AsyncSeekExt, AsyncWriteExt}, io::{self, AsyncReadExt, AsyncSeekExt, AsyncWriteExt},
}; };
use tokio_util::io::ReaderStream;
use tracing::*; use tracing::*;
use utils::{crashsafe::path_with_suffix_extension, fs_ext::is_directory_empty}; use utils::{crashsafe::path_with_suffix_extension, fs_ext::is_directory_empty};
@@ -99,27 +102,35 @@ impl LocalFs {
}; };
// If we were given a directory, we may use it as our starting point. // If we were given a directory, we may use it as our starting point.
// Otherwise, we must go up to the parent directory. This is because // Otherwise, we must go up to the first ancestor dir that exists. This is because
// S3 object list prefixes can be arbitrary strings, but when reading // S3 object list prefixes can be arbitrary strings, but when reading
// the local filesystem we need a directory to start calling read_dir on. // the local filesystem we need a directory to start calling read_dir on.
let mut initial_dir = full_path.clone(); let mut initial_dir = full_path.clone();
match fs::metadata(full_path.clone()).await { loop {
Ok(meta) => { // Did we make it to the root?
if !meta.is_dir() { if initial_dir.parent().is_none() {
anyhow::bail!("list_files: failed to find valid ancestor dir for {full_path}");
}
match fs::metadata(initial_dir.clone()).await {
Ok(meta) if meta.is_dir() => {
// We found a directory, break
break;
}
Ok(_meta) => {
// It's not a directory: strip back to the parent // It's not a directory: strip back to the parent
initial_dir.pop(); initial_dir.pop();
} }
} Err(e) if e.kind() == ErrorKind::NotFound => {
Err(e) if e.kind() == ErrorKind::NotFound => { // It's not a file that exists: strip the prefix back to the parent directory
// It's not a file that exists: strip the prefix back to the parent directory initial_dir.pop();
initial_dir.pop(); }
} Err(e) => {
Err(e) => { // Unexpected I/O error
// Unexpected I/O error anyhow::bail!(e)
anyhow::bail!(e) }
} }
} }
// Note that Utf8PathBuf starts_with only considers full path segments, but // Note that Utf8PathBuf starts_with only considers full path segments, but
// object prefixes are arbitrary strings, so we need the strings for doing // object prefixes are arbitrary strings, so we need the strings for doing
// starts_with later. // starts_with later.
@@ -211,7 +222,7 @@ impl RemoteStorage for LocalFs {
async fn upload( async fn upload(
&self, &self,
data: impl io::AsyncRead + Unpin + Send + Sync + 'static, data: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync,
data_size_bytes: usize, data_size_bytes: usize,
to: &RemotePath, to: &RemotePath,
metadata: Option<StorageMetadata>, metadata: Option<StorageMetadata>,
@@ -244,9 +255,12 @@ impl RemoteStorage for LocalFs {
); );
let from_size_bytes = data_size_bytes as u64; let from_size_bytes = data_size_bytes as u64;
let data = tokio_util::io::StreamReader::new(data);
let data = std::pin::pin!(data);
let mut buffer_to_read = data.take(from_size_bytes); let mut buffer_to_read = data.take(from_size_bytes);
let bytes_read = io::copy(&mut buffer_to_read, &mut destination) // alternatively we could just write the bytes to a file, but local_fs is a testing utility
let bytes_read = io::copy_buf(&mut buffer_to_read, &mut destination)
.await .await
.with_context(|| { .with_context(|| {
format!( format!(
@@ -300,7 +314,7 @@ impl RemoteStorage for LocalFs {
async fn download(&self, from: &RemotePath) -> Result<Download, DownloadError> { async fn download(&self, from: &RemotePath) -> Result<Download, DownloadError> {
let target_path = from.with_base(&self.storage_root); let target_path = from.with_base(&self.storage_root);
if file_exists(&target_path).map_err(DownloadError::BadInput)? { if file_exists(&target_path).map_err(DownloadError::BadInput)? {
let source = io::BufReader::new( let source = ReaderStream::new(
fs::OpenOptions::new() fs::OpenOptions::new()
.read(true) .read(true)
.open(&target_path) .open(&target_path)
@@ -340,16 +354,14 @@ impl RemoteStorage for LocalFs {
} }
let target_path = from.with_base(&self.storage_root); let target_path = from.with_base(&self.storage_root);
if file_exists(&target_path).map_err(DownloadError::BadInput)? { if file_exists(&target_path).map_err(DownloadError::BadInput)? {
let mut source = io::BufReader::new( let mut source = tokio::fs::OpenOptions::new()
fs::OpenOptions::new() .read(true)
.read(true) .open(&target_path)
.open(&target_path) .await
.await .with_context(|| {
.with_context(|| { format!("Failed to open source file {target_path:?} to use in the download")
format!("Failed to open source file {target_path:?} to use in the download") })
}) .map_err(DownloadError::Other)?;
.map_err(DownloadError::Other)?,
);
source source
.seek(io::SeekFrom::Start(start_inclusive)) .seek(io::SeekFrom::Start(start_inclusive))
.await .await
@@ -363,11 +375,13 @@ impl RemoteStorage for LocalFs {
Ok(match end_exclusive { Ok(match end_exclusive {
Some(end_exclusive) => Download { Some(end_exclusive) => Download {
metadata, metadata,
download_stream: Box::pin(source.take(end_exclusive - start_inclusive)), download_stream: Box::pin(ReaderStream::new(
source.take(end_exclusive - start_inclusive),
)),
}, },
None => Download { None => Download {
metadata, metadata,
download_stream: Box::pin(source), download_stream: Box::pin(ReaderStream::new(source)),
}, },
}) })
} else { } else {
@@ -467,7 +481,9 @@ fn file_exists(file_path: &Utf8Path) -> anyhow::Result<bool> {
mod fs_tests { mod fs_tests {
use super::*; use super::*;
use bytes::Bytes;
use camino_tempfile::tempdir; use camino_tempfile::tempdir;
use futures_util::Stream;
use std::{collections::HashMap, io::Write}; use std::{collections::HashMap, io::Write};
async fn read_and_assert_remote_file_contents( async fn read_and_assert_remote_file_contents(
@@ -477,7 +493,7 @@ mod fs_tests {
remote_storage_path: &RemotePath, remote_storage_path: &RemotePath,
expected_metadata: Option<&StorageMetadata>, expected_metadata: Option<&StorageMetadata>,
) -> anyhow::Result<String> { ) -> anyhow::Result<String> {
let mut download = storage let download = storage
.download(remote_storage_path) .download(remote_storage_path)
.await .await
.map_err(|e| anyhow::anyhow!("Download failed: {e}"))?; .map_err(|e| anyhow::anyhow!("Download failed: {e}"))?;
@@ -486,13 +502,9 @@ mod fs_tests {
"Unexpected metadata returned for the downloaded file" "Unexpected metadata returned for the downloaded file"
); );
let mut contents = String::new(); let contents = aggregate(download.download_stream).await?;
download
.download_stream String::from_utf8(contents).map_err(anyhow::Error::new)
.read_to_string(&mut contents)
.await
.context("Failed to read remote file contents into string")?;
Ok(contents)
} }
#[tokio::test] #[tokio::test]
@@ -521,25 +533,26 @@ mod fs_tests {
let storage = create_storage()?; let storage = create_storage()?;
let id = RemotePath::new(Utf8Path::new("dummy"))?; let id = RemotePath::new(Utf8Path::new("dummy"))?;
let content = std::io::Cursor::new(b"12345"); let content = Bytes::from_static(b"12345");
let content = move || futures::stream::once(futures::future::ready(Ok(content.clone())));
// Check that you get an error if the size parameter doesn't match the actual // Check that you get an error if the size parameter doesn't match the actual
// size of the stream. // size of the stream.
storage storage
.upload(Box::new(content.clone()), 0, &id, None) .upload(content(), 0, &id, None)
.await .await
.expect_err("upload with zero size succeeded"); .expect_err("upload with zero size succeeded");
storage storage
.upload(Box::new(content.clone()), 4, &id, None) .upload(content(), 4, &id, None)
.await .await
.expect_err("upload with too short size succeeded"); .expect_err("upload with too short size succeeded");
storage storage
.upload(Box::new(content.clone()), 6, &id, None) .upload(content(), 6, &id, None)
.await .await
.expect_err("upload with too large size succeeded"); .expect_err("upload with too large size succeeded");
// Correct size is 5, this should succeed. // Correct size is 5, this should succeed.
storage.upload(Box::new(content), 5, &id, None).await?; storage.upload(content(), 5, &id, None).await?;
Ok(()) Ok(())
} }
@@ -587,7 +600,7 @@ mod fs_tests {
let uploaded_bytes = dummy_contents(upload_name).into_bytes(); let uploaded_bytes = dummy_contents(upload_name).into_bytes();
let (first_part_local, second_part_local) = uploaded_bytes.split_at(3); let (first_part_local, second_part_local) = uploaded_bytes.split_at(3);
let mut first_part_download = storage let first_part_download = storage
.download_byte_range(&upload_target, 0, Some(first_part_local.len() as u64)) .download_byte_range(&upload_target, 0, Some(first_part_local.len() as u64))
.await?; .await?;
assert!( assert!(
@@ -595,21 +608,13 @@ mod fs_tests {
"No metadata should be returned for no metadata upload" "No metadata should be returned for no metadata upload"
); );
let mut first_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new())); let first_part_remote = aggregate(first_part_download.download_stream).await?;
io::copy(
&mut first_part_download.download_stream,
&mut first_part_remote,
)
.await?;
first_part_remote.flush().await?;
let first_part_remote = first_part_remote.into_inner().into_inner();
assert_eq!( assert_eq!(
first_part_local, first_part_local, first_part_remote,
first_part_remote.as_slice(),
"First part bytes should be returned when requested" "First part bytes should be returned when requested"
); );
let mut second_part_download = storage let second_part_download = storage
.download_byte_range( .download_byte_range(
&upload_target, &upload_target,
first_part_local.len() as u64, first_part_local.len() as u64,
@@ -621,17 +626,9 @@ mod fs_tests {
"No metadata should be returned for no metadata upload" "No metadata should be returned for no metadata upload"
); );
let mut second_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new())); let second_part_remote = aggregate(second_part_download.download_stream).await?;
io::copy(
&mut second_part_download.download_stream,
&mut second_part_remote,
)
.await?;
second_part_remote.flush().await?;
let second_part_remote = second_part_remote.into_inner().into_inner();
assert_eq!( assert_eq!(
second_part_local, second_part_local, second_part_remote,
second_part_remote.as_slice(),
"Second part bytes should be returned when requested" "Second part bytes should be returned when requested"
); );
@@ -721,17 +718,10 @@ mod fs_tests {
let uploaded_bytes = dummy_contents(upload_name).into_bytes(); let uploaded_bytes = dummy_contents(upload_name).into_bytes();
let (first_part_local, _) = uploaded_bytes.split_at(3); let (first_part_local, _) = uploaded_bytes.split_at(3);
let mut partial_download_with_metadata = storage let partial_download_with_metadata = storage
.download_byte_range(&upload_target, 0, Some(first_part_local.len() as u64)) .download_byte_range(&upload_target, 0, Some(first_part_local.len() as u64))
.await?; .await?;
let mut first_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new())); let first_part_remote = aggregate(partial_download_with_metadata.download_stream).await?;
io::copy(
&mut partial_download_with_metadata.download_stream,
&mut first_part_remote,
)
.await?;
first_part_remote.flush().await?;
let first_part_remote = first_part_remote.into_inner().into_inner();
assert_eq!( assert_eq!(
first_part_local, first_part_local,
first_part_remote.as_slice(), first_part_remote.as_slice(),
@@ -807,16 +797,16 @@ mod fs_tests {
) )
})?; })?;
storage let file = tokio_util::io::ReaderStream::new(file);
.upload(Box::new(file), size, &relative_path, metadata)
.await?; storage.upload(file, size, &relative_path, metadata).await?;
Ok(relative_path) Ok(relative_path)
} }
async fn create_file_for_upload( async fn create_file_for_upload(
path: &Utf8Path, path: &Utf8Path,
contents: &str, contents: &str,
) -> anyhow::Result<(io::BufReader<fs::File>, usize)> { ) -> anyhow::Result<(fs::File, usize)> {
std::fs::create_dir_all(path.parent().unwrap())?; std::fs::create_dir_all(path.parent().unwrap())?;
let mut file_for_writing = std::fs::OpenOptions::new() let mut file_for_writing = std::fs::OpenOptions::new()
.write(true) .write(true)
@@ -826,7 +816,7 @@ mod fs_tests {
drop(file_for_writing); drop(file_for_writing);
let file_size = path.metadata()?.len() as usize; let file_size = path.metadata()?.len() as usize;
Ok(( Ok((
io::BufReader::new(fs::OpenOptions::new().read(true).open(&path).await?), fs::OpenOptions::new().read(true).open(&path).await?,
file_size, file_size,
)) ))
} }
@@ -840,4 +830,16 @@ mod fs_tests {
files.sort_by(|a, b| a.0.cmp(&b.0)); files.sort_by(|a, b| a.0.cmp(&b.0));
Ok(files) Ok(files)
} }
async fn aggregate(
stream: impl Stream<Item = std::io::Result<Bytes>>,
) -> anyhow::Result<Vec<u8>> {
use futures::stream::StreamExt;
let mut out = Vec::new();
let mut stream = std::pin::pin!(stream);
while let Some(res) = stream.next().await {
out.extend_from_slice(&res?[..]);
}
Ok(out)
}
} }

View File

@@ -4,9 +4,14 @@
//! allowing multiple api users to independently work with the same S3 bucket, if //! allowing multiple api users to independently work with the same S3 bucket, if
//! their bucket prefixes are both specified and different. //! their bucket prefixes are both specified and different.
use std::{borrow::Cow, sync::Arc}; use std::{
borrow::Cow,
pin::Pin,
sync::Arc,
task::{Context, Poll},
};
use anyhow::Context; use anyhow::Context as _;
use aws_config::{ use aws_config::{
environment::credentials::EnvironmentVariableCredentialsProvider, environment::credentials::EnvironmentVariableCredentialsProvider,
imds::credentials::ImdsCredentialsProvider, imds::credentials::ImdsCredentialsProvider,
@@ -28,11 +33,10 @@ use aws_smithy_async::rt::sleep::TokioSleep;
use aws_smithy_types::body::SdkBody; use aws_smithy_types::body::SdkBody;
use aws_smithy_types::byte_stream::ByteStream; use aws_smithy_types::byte_stream::ByteStream;
use bytes::Bytes;
use futures::stream::Stream;
use hyper::Body; use hyper::Body;
use scopeguard::ScopeGuard; use scopeguard::ScopeGuard;
use tokio::io::{self, AsyncRead};
use tokio_util::io::ReaderStream;
use tracing::debug;
use super::StorageMetadata; use super::StorageMetadata;
use crate::{ use crate::{
@@ -63,7 +67,7 @@ struct GetObjectRequest {
impl S3Bucket { impl S3Bucket {
/// Creates the S3 storage, errors if incorrect AWS S3 configuration provided. /// Creates the S3 storage, errors if incorrect AWS S3 configuration provided.
pub fn new(aws_config: &S3Config) -> anyhow::Result<Self> { pub fn new(aws_config: &S3Config) -> anyhow::Result<Self> {
debug!( tracing::debug!(
"Creating s3 remote storage for S3 bucket {}", "Creating s3 remote storage for S3 bucket {}",
aws_config.bucket_name aws_config.bucket_name
); );
@@ -225,12 +229,15 @@ impl S3Bucket {
match get_object { match get_object {
Ok(object_output) => { Ok(object_output) => {
let metadata = object_output.metadata().cloned().map(StorageMetadata); let metadata = object_output.metadata().cloned().map(StorageMetadata);
let body = object_output.body;
let body = ByteStreamAsStream::from(body);
let body = PermitCarrying::new(permit, body);
let body = TimedDownload::new(started_at, body);
Ok(Download { Ok(Download {
metadata, metadata,
download_stream: Box::pin(io::BufReader::new(TimedDownload::new( download_stream: Box::pin(body),
started_at,
RatelimitedAsyncRead::new(permit, object_output.body.into_async_read()),
))),
}) })
} }
Err(SdkError::ServiceError(e)) if matches!(e.err(), GetObjectError::NoSuchKey(_)) => { Err(SdkError::ServiceError(e)) if matches!(e.err(), GetObjectError::NoSuchKey(_)) => {
@@ -243,29 +250,55 @@ impl S3Bucket {
} }
} }
pin_project_lite::pin_project! {
struct ByteStreamAsStream {
#[pin]
inner: aws_smithy_types::byte_stream::ByteStream
}
}
impl From<aws_smithy_types::byte_stream::ByteStream> for ByteStreamAsStream {
fn from(inner: aws_smithy_types::byte_stream::ByteStream) -> Self {
ByteStreamAsStream { inner }
}
}
impl Stream for ByteStreamAsStream {
type Item = std::io::Result<Bytes>;
fn poll_next(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
// this does the std::io::ErrorKind::Other conversion
self.project().inner.poll_next(cx).map_err(|x| x.into())
}
// cannot implement size_hint because inner.size_hint is remaining size in bytes, which makes
// sense and Stream::size_hint does not really
}
pin_project_lite::pin_project! { pin_project_lite::pin_project! {
/// An `AsyncRead` adapter which carries a permit for the lifetime of the value. /// An `AsyncRead` adapter which carries a permit for the lifetime of the value.
struct RatelimitedAsyncRead<S> { struct PermitCarrying<S> {
permit: tokio::sync::OwnedSemaphorePermit, permit: tokio::sync::OwnedSemaphorePermit,
#[pin] #[pin]
inner: S, inner: S,
} }
} }
impl<S: AsyncRead> RatelimitedAsyncRead<S> { impl<S> PermitCarrying<S> {
fn new(permit: tokio::sync::OwnedSemaphorePermit, inner: S) -> Self { fn new(permit: tokio::sync::OwnedSemaphorePermit, inner: S) -> Self {
RatelimitedAsyncRead { permit, inner } Self { permit, inner }
} }
} }
impl<S: AsyncRead> AsyncRead for RatelimitedAsyncRead<S> { impl<S: Stream<Item = std::io::Result<Bytes>>> Stream for PermitCarrying<S> {
fn poll_read( type Item = <S as Stream>::Item;
self: std::pin::Pin<&mut Self>,
cx: &mut std::task::Context<'_>, fn poll_next(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
buf: &mut io::ReadBuf<'_>, self.project().inner.poll_next(cx)
) -> std::task::Poll<std::io::Result<()>> { }
let this = self.project();
this.inner.poll_read(cx, buf) fn size_hint(&self) -> (usize, Option<usize>) {
self.inner.size_hint()
} }
} }
@@ -285,7 +318,7 @@ pin_project_lite::pin_project! {
} }
} }
impl<S: AsyncRead> TimedDownload<S> { impl<S> TimedDownload<S> {
fn new(started_at: std::time::Instant, inner: S) -> Self { fn new(started_at: std::time::Instant, inner: S) -> Self {
TimedDownload { TimedDownload {
started_at, started_at,
@@ -295,25 +328,26 @@ impl<S: AsyncRead> TimedDownload<S> {
} }
} }
impl<S: AsyncRead> AsyncRead for TimedDownload<S> { impl<S: Stream<Item = std::io::Result<Bytes>>> Stream for TimedDownload<S> {
fn poll_read( type Item = <S as Stream>::Item;
self: std::pin::Pin<&mut Self>,
cx: &mut std::task::Context<'_>, fn poll_next(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
buf: &mut io::ReadBuf<'_>, use std::task::ready;
) -> std::task::Poll<std::io::Result<()>> {
let this = self.project(); let this = self.project();
let before = buf.filled().len();
let read = std::task::ready!(this.inner.poll_read(cx, buf));
let read_eof = buf.filled().len() == before; let res = ready!(this.inner.poll_next(cx));
match &res {
match read { Some(Ok(_)) => {}
Ok(()) if read_eof => *this.outcome = AttemptOutcome::Ok, Some(Err(_)) => *this.outcome = metrics::AttemptOutcome::Err,
Ok(()) => { /* still in progress */ } None => *this.outcome = metrics::AttemptOutcome::Ok,
Err(_) => *this.outcome = AttemptOutcome::Err,
} }
std::task::Poll::Ready(read) Poll::Ready(res)
}
fn size_hint(&self) -> (usize, Option<usize>) {
self.inner.size_hint()
} }
} }
@@ -378,7 +412,7 @@ impl RemoteStorage for S3Bucket {
let empty = Vec::new(); let empty = Vec::new();
let prefixes = response.common_prefixes.as_ref().unwrap_or(&empty); let prefixes = response.common_prefixes.as_ref().unwrap_or(&empty);
tracing::info!("list: {} prefixes, {} keys", prefixes.len(), keys.len()); tracing::debug!("list: {} prefixes, {} keys", prefixes.len(), keys.len());
for object in keys { for object in keys {
let object_path = object.key().expect("response does not contain a key"); let object_path = object.key().expect("response does not contain a key");
@@ -403,7 +437,7 @@ impl RemoteStorage for S3Bucket {
async fn upload( async fn upload(
&self, &self,
from: impl io::AsyncRead + Unpin + Send + Sync + 'static, from: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
from_size_bytes: usize, from_size_bytes: usize,
to: &RemotePath, to: &RemotePath,
metadata: Option<StorageMetadata>, metadata: Option<StorageMetadata>,
@@ -413,7 +447,7 @@ impl RemoteStorage for S3Bucket {
let started_at = start_measuring_requests(kind); let started_at = start_measuring_requests(kind);
let body = Body::wrap_stream(ReaderStream::new(from)); let body = Body::wrap_stream(from);
let bytes_stream = ByteStream::new(SdkBody::from_body_0_4(body)); let bytes_stream = ByteStream::new(SdkBody::from_body_0_4(body));
let res = self let res = self

View File

@@ -1,6 +1,8 @@
//! This module provides a wrapper around a real RemoteStorage implementation that //! This module provides a wrapper around a real RemoteStorage implementation that
//! causes the first N attempts at each upload or download operatio to fail. For //! causes the first N attempts at each upload or download operatio to fail. For
//! testing purposes. //! testing purposes.
use bytes::Bytes;
use futures::stream::Stream;
use std::collections::hash_map::Entry; use std::collections::hash_map::Entry;
use std::collections::HashMap; use std::collections::HashMap;
use std::sync::Mutex; use std::sync::Mutex;
@@ -108,7 +110,7 @@ impl RemoteStorage for UnreliableWrapper {
async fn upload( async fn upload(
&self, &self,
data: impl tokio::io::AsyncRead + Unpin + Send + Sync + 'static, data: impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
// S3 PUT request requires the content length to be specified, // S3 PUT request requires the content length to be specified,
// otherwise it starts to fail with the concurrent connection count increasing. // otherwise it starts to fail with the concurrent connection count increasing.
data_size_bytes: usize, data_size_bytes: usize,

View File

@@ -7,7 +7,9 @@ use std::sync::Arc;
use std::time::UNIX_EPOCH; use std::time::UNIX_EPOCH;
use anyhow::Context; use anyhow::Context;
use bytes::Bytes;
use camino::Utf8Path; use camino::Utf8Path;
use futures::stream::Stream;
use once_cell::sync::OnceCell; use once_cell::sync::OnceCell;
use remote_storage::{ use remote_storage::{
AzureConfig, Download, GenericRemoteStorage, RemotePath, RemoteStorageConfig, RemoteStorageKind, AzureConfig, Download, GenericRemoteStorage, RemotePath, RemoteStorageConfig, RemoteStorageKind,
@@ -180,23 +182,14 @@ async fn azure_delete_objects_works(ctx: &mut MaybeEnabledAzure) -> anyhow::Resu
let path3 = RemotePath::new(Utf8Path::new(format!("{}/path3", ctx.base_prefix).as_str())) let path3 = RemotePath::new(Utf8Path::new(format!("{}/path3", ctx.base_prefix).as_str()))
.with_context(|| "RemotePath conversion")?; .with_context(|| "RemotePath conversion")?;
let data1 = "remote blob data1".as_bytes(); let (data, len) = upload_stream("remote blob data1".as_bytes().into());
let data1_len = data1.len(); ctx.client.upload(data, len, &path1, None).await?;
let data2 = "remote blob data2".as_bytes();
let data2_len = data2.len();
let data3 = "remote blob data3".as_bytes();
let data3_len = data3.len();
ctx.client
.upload(std::io::Cursor::new(data1), data1_len, &path1, None)
.await?;
ctx.client let (data, len) = upload_stream("remote blob data2".as_bytes().into());
.upload(std::io::Cursor::new(data2), data2_len, &path2, None) ctx.client.upload(data, len, &path2, None).await?;
.await?;
ctx.client let (data, len) = upload_stream("remote blob data3".as_bytes().into());
.upload(std::io::Cursor::new(data3), data3_len, &path3, None) ctx.client.upload(data, len, &path3, None).await?;
.await?;
ctx.client.delete_objects(&[path1, path2]).await?; ctx.client.delete_objects(&[path1, path2]).await?;
@@ -219,53 +212,56 @@ async fn azure_upload_download_works(ctx: &mut MaybeEnabledAzure) -> anyhow::Res
let path = RemotePath::new(Utf8Path::new(format!("{}/file", ctx.base_prefix).as_str())) let path = RemotePath::new(Utf8Path::new(format!("{}/file", ctx.base_prefix).as_str()))
.with_context(|| "RemotePath conversion")?; .with_context(|| "RemotePath conversion")?;
let data = "remote blob data here".as_bytes(); let orig = bytes::Bytes::from_static("remote blob data here".as_bytes());
let data_len = data.len() as u64;
ctx.client let (data, len) = wrap_stream(orig.clone());
.upload(std::io::Cursor::new(data), data.len(), &path, None)
.await?;
async fn download_and_compare(mut dl: Download) -> anyhow::Result<Vec<u8>> { ctx.client.upload(data, len, &path, None).await?;
async fn download_and_compare(dl: Download) -> anyhow::Result<Vec<u8>> {
let mut buf = Vec::new(); let mut buf = Vec::new();
tokio::io::copy(&mut dl.download_stream, &mut buf).await?; tokio::io::copy_buf(
&mut tokio_util::io::StreamReader::new(dl.download_stream),
&mut buf,
)
.await?;
Ok(buf) Ok(buf)
} }
// Normal download request // Normal download request
let dl = ctx.client.download(&path).await?; let dl = ctx.client.download(&path).await?;
let buf = download_and_compare(dl).await?; let buf = download_and_compare(dl).await?;
assert_eq!(buf, data); assert_eq!(&buf, &orig);
// Full range (end specified) // Full range (end specified)
let dl = ctx let dl = ctx
.client .client
.download_byte_range(&path, 0, Some(data_len)) .download_byte_range(&path, 0, Some(len as u64))
.await?; .await?;
let buf = download_and_compare(dl).await?; let buf = download_and_compare(dl).await?;
assert_eq!(buf, data); assert_eq!(&buf, &orig);
// partial range (end specified) // partial range (end specified)
let dl = ctx.client.download_byte_range(&path, 4, Some(10)).await?; let dl = ctx.client.download_byte_range(&path, 4, Some(10)).await?;
let buf = download_and_compare(dl).await?; let buf = download_and_compare(dl).await?;
assert_eq!(buf, data[4..10]); assert_eq!(&buf, &orig[4..10]);
// partial range (end beyond real end) // partial range (end beyond real end)
let dl = ctx let dl = ctx
.client .client
.download_byte_range(&path, 8, Some(data_len * 100)) .download_byte_range(&path, 8, Some(len as u64 * 100))
.await?; .await?;
let buf = download_and_compare(dl).await?; let buf = download_and_compare(dl).await?;
assert_eq!(buf, data[8..]); assert_eq!(&buf, &orig[8..]);
// Partial range (end unspecified) // Partial range (end unspecified)
let dl = ctx.client.download_byte_range(&path, 4, None).await?; let dl = ctx.client.download_byte_range(&path, 4, None).await?;
let buf = download_and_compare(dl).await?; let buf = download_and_compare(dl).await?;
assert_eq!(buf, data[4..]); assert_eq!(&buf, &orig[4..]);
// Full range (end unspecified) // Full range (end unspecified)
let dl = ctx.client.download_byte_range(&path, 0, None).await?; let dl = ctx.client.download_byte_range(&path, 0, None).await?;
let buf = download_and_compare(dl).await?; let buf = download_and_compare(dl).await?;
assert_eq!(buf, data); assert_eq!(&buf, &orig);
debug!("Cleanup: deleting file at path {path:?}"); debug!("Cleanup: deleting file at path {path:?}");
ctx.client ctx.client
@@ -504,11 +500,8 @@ async fn upload_azure_data(
let blob_path = blob_prefix.join(Utf8Path::new(&format!("blob_{i}"))); let blob_path = blob_prefix.join(Utf8Path::new(&format!("blob_{i}")));
debug!("Creating remote item {i} at path {blob_path:?}"); debug!("Creating remote item {i} at path {blob_path:?}");
let data = format!("remote blob data {i}").into_bytes(); let (data, len) = upload_stream(format!("remote blob data {i}").into_bytes().into());
let data_len = data.len(); task_client.upload(data, len, &blob_path, None).await?;
task_client
.upload(std::io::Cursor::new(data), data_len, &blob_path, None)
.await?;
Ok::<_, anyhow::Error>((blob_prefix, blob_path)) Ok::<_, anyhow::Error>((blob_prefix, blob_path))
}); });
@@ -589,11 +582,8 @@ async fn upload_simple_azure_data(
.with_context(|| format!("{blob_path:?} to RemotePath conversion"))?; .with_context(|| format!("{blob_path:?} to RemotePath conversion"))?;
debug!("Creating remote item {i} at path {blob_path:?}"); debug!("Creating remote item {i} at path {blob_path:?}");
let data = format!("remote blob data {i}").into_bytes(); let (data, len) = upload_stream(format!("remote blob data {i}").into_bytes().into());
let data_len = data.len(); task_client.upload(data, len, &blob_path, None).await?;
task_client
.upload(std::io::Cursor::new(data), data_len, &blob_path, None)
.await?;
Ok::<_, anyhow::Error>(blob_path) Ok::<_, anyhow::Error>(blob_path)
}); });
@@ -622,3 +612,32 @@ async fn upload_simple_azure_data(
ControlFlow::Continue(uploaded_blobs) ControlFlow::Continue(uploaded_blobs)
} }
} }
// FIXME: copypasted from test_real_s3, can't remember how to share a module which is not compiled
// to binary
fn upload_stream(
content: std::borrow::Cow<'static, [u8]>,
) -> (
impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
usize,
) {
use std::borrow::Cow;
let content = match content {
Cow::Borrowed(x) => Bytes::from_static(x),
Cow::Owned(vec) => Bytes::from(vec),
};
wrap_stream(content)
}
fn wrap_stream(
content: bytes::Bytes,
) -> (
impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
usize,
) {
let len = content.len();
let content = futures::future::ready(Ok(content));
(futures::stream::once(content), len)
}

View File

@@ -7,7 +7,9 @@ use std::sync::Arc;
use std::time::UNIX_EPOCH; use std::time::UNIX_EPOCH;
use anyhow::Context; use anyhow::Context;
use bytes::Bytes;
use camino::Utf8Path; use camino::Utf8Path;
use futures::stream::Stream;
use once_cell::sync::OnceCell; use once_cell::sync::OnceCell;
use remote_storage::{ use remote_storage::{
GenericRemoteStorage, RemotePath, RemoteStorageConfig, RemoteStorageKind, S3Config, GenericRemoteStorage, RemotePath, RemoteStorageConfig, RemoteStorageKind, S3Config,
@@ -176,23 +178,14 @@ async fn s3_delete_objects_works(ctx: &mut MaybeEnabledS3) -> anyhow::Result<()>
let path3 = RemotePath::new(Utf8Path::new(format!("{}/path3", ctx.base_prefix).as_str())) let path3 = RemotePath::new(Utf8Path::new(format!("{}/path3", ctx.base_prefix).as_str()))
.with_context(|| "RemotePath conversion")?; .with_context(|| "RemotePath conversion")?;
let data1 = "remote blob data1".as_bytes(); let (data, len) = upload_stream("remote blob data1".as_bytes().into());
let data1_len = data1.len(); ctx.client.upload(data, len, &path1, None).await?;
let data2 = "remote blob data2".as_bytes();
let data2_len = data2.len();
let data3 = "remote blob data3".as_bytes();
let data3_len = data3.len();
ctx.client
.upload(std::io::Cursor::new(data1), data1_len, &path1, None)
.await?;
ctx.client let (data, len) = upload_stream("remote blob data2".as_bytes().into());
.upload(std::io::Cursor::new(data2), data2_len, &path2, None) ctx.client.upload(data, len, &path2, None).await?;
.await?;
ctx.client let (data, len) = upload_stream("remote blob data3".as_bytes().into());
.upload(std::io::Cursor::new(data3), data3_len, &path3, None) ctx.client.upload(data, len, &path3, None).await?;
.await?;
ctx.client.delete_objects(&[path1, path2]).await?; ctx.client.delete_objects(&[path1, path2]).await?;
@@ -432,11 +425,9 @@ async fn upload_s3_data(
let blob_path = blob_prefix.join(Utf8Path::new(&format!("blob_{i}"))); let blob_path = blob_prefix.join(Utf8Path::new(&format!("blob_{i}")));
debug!("Creating remote item {i} at path {blob_path:?}"); debug!("Creating remote item {i} at path {blob_path:?}");
let data = format!("remote blob data {i}").into_bytes(); let (data, data_len) =
let data_len = data.len(); upload_stream(format!("remote blob data {i}").into_bytes().into());
task_client task_client.upload(data, data_len, &blob_path, None).await?;
.upload(std::io::Cursor::new(data), data_len, &blob_path, None)
.await?;
Ok::<_, anyhow::Error>((blob_prefix, blob_path)) Ok::<_, anyhow::Error>((blob_prefix, blob_path))
}); });
@@ -517,11 +508,9 @@ async fn upload_simple_s3_data(
.with_context(|| format!("{blob_path:?} to RemotePath conversion"))?; .with_context(|| format!("{blob_path:?} to RemotePath conversion"))?;
debug!("Creating remote item {i} at path {blob_path:?}"); debug!("Creating remote item {i} at path {blob_path:?}");
let data = format!("remote blob data {i}").into_bytes(); let (data, data_len) =
let data_len = data.len(); upload_stream(format!("remote blob data {i}").into_bytes().into());
task_client task_client.upload(data, data_len, &blob_path, None).await?;
.upload(std::io::Cursor::new(data), data_len, &blob_path, None)
.await?;
Ok::<_, anyhow::Error>(blob_path) Ok::<_, anyhow::Error>(blob_path)
}); });
@@ -550,3 +539,30 @@ async fn upload_simple_s3_data(
ControlFlow::Continue(uploaded_blobs) ControlFlow::Continue(uploaded_blobs)
} }
} }
fn upload_stream(
content: std::borrow::Cow<'static, [u8]>,
) -> (
impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
usize,
) {
use std::borrow::Cow;
let content = match content {
Cow::Borrowed(x) => Bytes::from_static(x),
Cow::Owned(vec) => Bytes::from(vec),
};
wrap_stream(content)
}
fn wrap_stream(
content: bytes::Bytes,
) -> (
impl Stream<Item = std::io::Result<Bytes>> + Send + Sync + 'static,
usize,
) {
let len = content.len();
let content = futures::future::ready(Ok(content));
(futures::stream::once(content), len)
}

View File

@@ -50,6 +50,8 @@ const_format.workspace = true
# why is it only here? no other crate should use it, streams are rarely needed. # why is it only here? no other crate should use it, streams are rarely needed.
tokio-stream = { version = "0.1.14" } tokio-stream = { version = "0.1.14" }
serde_path_to_error.workspace = true
[dev-dependencies] [dev-dependencies]
byteorder.workspace = true byteorder.workspace = true
bytes.workspace = true bytes.workspace = true

View File

@@ -1,16 +1,14 @@
use std::sync::Arc; use tokio_util::task::{task_tracker::TaskTrackerToken, TaskTracker};
use tokio::sync::{mpsc, Mutex};
/// While a reference is kept around, the associated [`Barrier::wait`] will wait. /// While a reference is kept around, the associated [`Barrier::wait`] will wait.
/// ///
/// Can be cloned, moved and kept around in futures as "guard objects". /// Can be cloned, moved and kept around in futures as "guard objects".
#[derive(Clone)] #[derive(Clone)]
pub struct Completion(mpsc::Sender<()>); pub struct Completion(TaskTrackerToken);
/// Barrier will wait until all clones of [`Completion`] have been dropped. /// Barrier will wait until all clones of [`Completion`] have been dropped.
#[derive(Clone)] #[derive(Clone)]
pub struct Barrier(Arc<Mutex<mpsc::Receiver<()>>>); pub struct Barrier(TaskTracker);
impl Default for Barrier { impl Default for Barrier {
fn default() -> Self { fn default() -> Self {
@@ -21,7 +19,7 @@ impl Default for Barrier {
impl Barrier { impl Barrier {
pub async fn wait(self) { pub async fn wait(self) {
self.0.lock().await.recv().await; self.0.wait().await;
} }
pub async fn maybe_wait(barrier: Option<Barrier>) { pub async fn maybe_wait(barrier: Option<Barrier>) {
@@ -33,8 +31,7 @@ impl Barrier {
impl PartialEq for Barrier { impl PartialEq for Barrier {
fn eq(&self, other: &Self) -> bool { fn eq(&self, other: &Self) -> bool {
// we don't use dyn so this is good TaskTracker::ptr_eq(&self.0, &other.0)
Arc::ptr_eq(&self.0, &other.0)
} }
} }
@@ -42,8 +39,10 @@ impl Eq for Barrier {}
/// Create new Guard and Barrier pair. /// Create new Guard and Barrier pair.
pub fn channel() -> (Completion, Barrier) { pub fn channel() -> (Completion, Barrier) {
let (tx, rx) = mpsc::channel::<()>(1); let tracker = TaskTracker::new();
let rx = Mutex::new(rx); // otherwise wait never exits
let rx = Arc::new(rx); tracker.close();
(Completion(tx), Barrier(rx))
let token = tracker.token();
(Completion(token), Barrier(tracker))
} }

View File

@@ -152,3 +152,16 @@ impl Debug for Generation {
} }
} }
} }
#[cfg(test)]
mod test {
use super::*;
#[test]
fn generation_gt() {
// Important that a None generation compares less than a valid one, during upgrades from
// pre-generation systems.
assert!(Generation::none() < Generation::new(0));
assert!(Generation::none() < Generation::new(1));
}
}

View File

@@ -25,8 +25,12 @@ pub async fn json_request_or_empty_body<T: for<'de> Deserialize<'de>>(
if body.remaining() == 0 { if body.remaining() == 0 {
return Ok(None); return Ok(None);
} }
serde_json::from_reader(body.reader())
.context("Failed to parse json request") let mut deser = serde_json::de::Deserializer::from_reader(body.reader());
serde_path_to_error::deserialize(&mut deser)
// intentionally stringify because the debug version is not helpful in python logs
.map_err(|e| anyhow::anyhow!("Failed to parse json request: {e}"))
.map(Some) .map(Some)
.map_err(ApiError::BadRequest) .map_err(ApiError::BadRequest)
} }

View File

@@ -1,6 +1,7 @@
use std::str::FromStr; use std::str::FromStr;
use anyhow::Context; use anyhow::Context;
use metrics::{IntCounter, IntCounterVec};
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
use strum_macros::{EnumString, EnumVariantNames}; use strum_macros::{EnumString, EnumVariantNames};
@@ -24,16 +25,48 @@ impl LogFormat {
} }
} }
static TRACING_EVENT_COUNT: Lazy<metrics::IntCounterVec> = Lazy::new(|| { struct TracingEventCountMetric {
metrics::register_int_counter_vec!( error: IntCounter,
warn: IntCounter,
info: IntCounter,
debug: IntCounter,
trace: IntCounter,
}
static TRACING_EVENT_COUNT_METRIC: Lazy<TracingEventCountMetric> = Lazy::new(|| {
let vec = metrics::register_int_counter_vec!(
"libmetrics_tracing_event_count", "libmetrics_tracing_event_count",
"Number of tracing events, by level", "Number of tracing events, by level",
&["level"] &["level"]
) )
.expect("failed to define metric") .expect("failed to define metric");
TracingEventCountMetric::new(vec)
}); });
struct TracingEventCountLayer(&'static metrics::IntCounterVec); impl TracingEventCountMetric {
fn new(vec: IntCounterVec) -> Self {
Self {
error: vec.with_label_values(&["error"]),
warn: vec.with_label_values(&["warn"]),
info: vec.with_label_values(&["info"]),
debug: vec.with_label_values(&["debug"]),
trace: vec.with_label_values(&["trace"]),
}
}
fn inc_for_level(&self, level: tracing::Level) {
let counter = match level {
tracing::Level::ERROR => &self.error,
tracing::Level::WARN => &self.warn,
tracing::Level::INFO => &self.info,
tracing::Level::DEBUG => &self.debug,
tracing::Level::TRACE => &self.trace,
};
counter.inc();
}
}
struct TracingEventCountLayer(&'static TracingEventCountMetric);
impl<S> tracing_subscriber::layer::Layer<S> for TracingEventCountLayer impl<S> tracing_subscriber::layer::Layer<S> for TracingEventCountLayer
where where
@@ -44,15 +77,7 @@ where
event: &tracing::Event<'_>, event: &tracing::Event<'_>,
_ctx: tracing_subscriber::layer::Context<'_, S>, _ctx: tracing_subscriber::layer::Context<'_, S>,
) { ) {
let level = event.metadata().level(); self.0.inc_for_level(*event.metadata().level());
let level = match *level {
tracing::Level::ERROR => "error",
tracing::Level::WARN => "warn",
tracing::Level::INFO => "info",
tracing::Level::DEBUG => "debug",
tracing::Level::TRACE => "trace",
};
self.0.with_label_values(&[level]).inc();
} }
} }
@@ -106,7 +131,9 @@ pub fn init(
}; };
log_layer.with_filter(rust_log_env_filter()) log_layer.with_filter(rust_log_env_filter())
}); });
let r = r.with(TracingEventCountLayer(&TRACING_EVENT_COUNT).with_filter(rust_log_env_filter())); let r = r.with(
TracingEventCountLayer(&TRACING_EVENT_COUNT_METRIC).with_filter(rust_log_env_filter()),
);
match tracing_error_layer_enablement { match tracing_error_layer_enablement {
TracingErrorLayerEnablement::EnableWithRustLogFilter => r TracingErrorLayerEnablement::EnableWithRustLogFilter => r
.with(tracing_error::ErrorLayer::default().with_filter(rust_log_env_filter())) .with(tracing_error::ErrorLayer::default().with_filter(rust_log_env_filter()))
@@ -257,14 +284,14 @@ impl std::fmt::Debug for SecretString {
mod tests { mod tests {
use metrics::{core::Opts, IntCounterVec}; use metrics::{core::Opts, IntCounterVec};
use super::TracingEventCountLayer; use crate::logging::{TracingEventCountLayer, TracingEventCountMetric};
#[test] #[test]
fn tracing_event_count_metric() { fn tracing_event_count_metric() {
let counter_vec = let counter_vec =
IntCounterVec::new(Opts::new("testmetric", "testhelp"), &["level"]).unwrap(); IntCounterVec::new(Opts::new("testmetric", "testhelp"), &["level"]).unwrap();
let counter_vec = Box::leak(Box::new(counter_vec)); // make it 'static let metric = Box::leak(Box::new(TracingEventCountMetric::new(counter_vec.clone())));
let layer = TracingEventCountLayer(counter_vec); let layer = TracingEventCountLayer(metric);
use tracing_subscriber::prelude::*; use tracing_subscriber::prelude::*;
tracing::subscriber::with_default(tracing_subscriber::registry().with(layer), || { tracing::subscriber::with_default(tracing_subscriber::registry().with(layer), || {

View File

@@ -1,10 +1,10 @@
//! //!
//! RCU stands for Read-Copy-Update. It's a synchronization mechanism somewhat //! RCU stands for Read-Copy-Update. It's a synchronization mechanism somewhat
//! similar to a lock, but it allows readers to "hold on" to an old value of RCU //! similar to a lock, but it allows readers to "hold on" to an old value of RCU
//! without blocking writers, and allows writing a new values without blocking //! without blocking writers, and allows writing a new value without blocking
//! readers. When you update the new value, the new value is immediately visible //! readers. When you update the value, the new value is immediately visible
//! to new readers, but the update waits until all existing readers have //! to new readers, but the update waits until all existing readers have
//! finishe, so that no one sees the old value anymore. //! finished, so that on return, no one sees the old value anymore.
//! //!
//! This implementation isn't wait-free; it uses an RwLock that is held for a //! This implementation isn't wait-free; it uses an RwLock that is held for a
//! short duration when the value is read or updated. //! short duration when the value is read or updated.
@@ -26,6 +26,7 @@
//! Increment the value by one, and wait for old readers to finish: //! Increment the value by one, and wait for old readers to finish:
//! //!
//! ``` //! ```
//! # async fn dox() {
//! # let rcu = utils::simple_rcu::Rcu::new(1); //! # let rcu = utils::simple_rcu::Rcu::new(1);
//! let write_guard = rcu.lock_for_write(); //! let write_guard = rcu.lock_for_write();
//! //!
@@ -36,15 +37,17 @@
//! //!
//! // Concurrent reads and writes are now possible again. Wait for all the readers //! // Concurrent reads and writes are now possible again. Wait for all the readers
//! // that still observe the old value to finish. //! // that still observe the old value to finish.
//! waitlist.wait(); //! waitlist.wait().await;
//! # }
//! ``` //! ```
//! //!
#![warn(missing_docs)] #![warn(missing_docs)]
use std::ops::Deref; use std::ops::Deref;
use std::sync::mpsc::{sync_channel, Receiver, SyncSender};
use std::sync::{Arc, Weak}; use std::sync::{Arc, Weak};
use std::sync::{Mutex, RwLock, RwLockWriteGuard}; use std::sync::{RwLock, RwLockWriteGuard};
use tokio::sync::watch;
/// ///
/// Rcu allows multiple readers to read and hold onto a value without blocking /// Rcu allows multiple readers to read and hold onto a value without blocking
@@ -68,22 +71,21 @@ struct RcuCell<V> {
value: V, value: V,
/// A dummy channel. We never send anything to this channel. The point is /// A dummy channel. We never send anything to this channel. The point is
/// that when the RcuCell is dropped, any cloned Senders will be notified /// that when the RcuCell is dropped, any subscribed Receivers will be notified
/// that the channel is closed. Updaters can use this to wait out until the /// that the channel is closed. Updaters can use this to wait out until the
/// RcuCell has been dropped, i.e. until the old value is no longer in use. /// RcuCell has been dropped, i.e. until the old value is no longer in use.
/// ///
/// We never do anything with the receiver, we just need to hold onto it so /// We never send anything to this, we just need to hold onto it so that the
/// that the Senders will be notified when it's dropped. But because it's /// Receivers will be notified when it's dropped.
/// not Sync, we need a Mutex on it. watch: watch::Sender<()>,
watch: (SyncSender<()>, Mutex<Receiver<()>>),
} }
impl<V> RcuCell<V> { impl<V> RcuCell<V> {
fn new(value: V) -> Self { fn new(value: V) -> Self {
let (watch_sender, watch_receiver) = sync_channel(0); let (watch_sender, _) = watch::channel(());
RcuCell { RcuCell {
value, value,
watch: (watch_sender, Mutex::new(watch_receiver)), watch: watch_sender,
} }
} }
} }
@@ -141,10 +143,10 @@ impl<V> Deref for RcuReadGuard<V> {
/// ///
/// Write guard returned by `write` /// Write guard returned by `write`
/// ///
/// NB: Holding this guard blocks all concurrent `read` and `write` calls, so /// NB: Holding this guard blocks all concurrent `read` and `write` calls, so it should only be
/// it should only be held for a short duration! /// held for a short duration!
/// ///
/// Calling `store` consumes the guard, making new reads and new writes possible /// Calling [`Self::store_and_unlock`] consumes the guard, making new reads and new writes possible
/// again. /// again.
/// ///
pub struct RcuWriteGuard<'a, V> { pub struct RcuWriteGuard<'a, V> {
@@ -179,7 +181,7 @@ impl<'a, V> RcuWriteGuard<'a, V> {
// the watches for any that do. // the watches for any that do.
self.inner.old_cells.retain(|weak| { self.inner.old_cells.retain(|weak| {
if let Some(cell) = weak.upgrade() { if let Some(cell) = weak.upgrade() {
watches.push(cell.watch.0.clone()); watches.push(cell.watch.subscribe());
true true
} else { } else {
false false
@@ -193,20 +195,20 @@ impl<'a, V> RcuWriteGuard<'a, V> {
/// ///
/// List of readers who can still see old values. /// List of readers who can still see old values.
/// ///
pub struct RcuWaitList(Vec<SyncSender<()>>); pub struct RcuWaitList(Vec<watch::Receiver<()>>);
impl RcuWaitList { impl RcuWaitList {
/// ///
/// Wait for old readers to finish. /// Wait for old readers to finish.
/// ///
pub fn wait(mut self) { pub async fn wait(mut self) {
// after all the old_cells are no longer in use, we're done // after all the old_cells are no longer in use, we're done
for w in self.0.iter_mut() { for w in self.0.iter_mut() {
// This will block until the Receiver is closed. That happens when // This will block until the Receiver is closed. That happens when
// the RcuCell is dropped. // the RcuCell is dropped.
#[allow(clippy::single_match)] #[allow(clippy::single_match)]
match w.send(()) { match w.changed().await {
Ok(_) => panic!("send() unexpectedly succeeded on dummy channel"), Ok(_) => panic!("changed() unexpectedly succeeded on dummy channel"),
Err(_) => { Err(_) => {
// closed, which means that the cell has been dropped, and // closed, which means that the cell has been dropped, and
// its value is no longer in use // its value is no longer in use
@@ -220,11 +222,10 @@ impl RcuWaitList {
mod tests { mod tests {
use super::*; use super::*;
use std::sync::{Arc, Mutex}; use std::sync::{Arc, Mutex};
use std::thread::{sleep, spawn};
use std::time::Duration; use std::time::Duration;
#[test] #[tokio::test]
fn two_writers() { async fn two_writers() {
let rcu = Rcu::new(1); let rcu = Rcu::new(1);
let read1 = rcu.read(); let read1 = rcu.read();
@@ -248,33 +249,35 @@ mod tests {
assert_eq!(*read1, 1); assert_eq!(*read1, 1);
let log = Arc::new(Mutex::new(Vec::new())); let log = Arc::new(Mutex::new(Vec::new()));
// Wait for the old readers to finish in separate threads. // Wait for the old readers to finish in separate tasks.
let log_clone = Arc::clone(&log); let log_clone = Arc::clone(&log);
let thread2 = spawn(move || { let task2 = tokio::spawn(async move {
wait2.wait(); wait2.wait().await;
log_clone.lock().unwrap().push("wait2 done"); log_clone.lock().unwrap().push("wait2 done");
}); });
let log_clone = Arc::clone(&log); let log_clone = Arc::clone(&log);
let thread3 = spawn(move || { let task3 = tokio::spawn(async move {
wait3.wait(); wait3.wait().await;
log_clone.lock().unwrap().push("wait3 done"); log_clone.lock().unwrap().push("wait3 done");
}); });
// without this sleep the test can pass on accident if the writer is slow // without this sleep the test can pass on accident if the writer is slow
sleep(Duration::from_millis(500)); tokio::time::sleep(Duration::from_millis(100)).await;
// Release first reader. This allows first write to finish, but calling // Release first reader. This allows first write to finish, but calling
// wait() on the second one would still block. // wait() on the 'task3' would still block.
log.lock().unwrap().push("dropping read1"); log.lock().unwrap().push("dropping read1");
drop(read1); drop(read1);
thread2.join().unwrap(); task2.await.unwrap();
sleep(Duration::from_millis(500)); assert!(!task3.is_finished());
tokio::time::sleep(Duration::from_millis(100)).await;
// Release second reader, and finish second writer. // Release second reader, and finish second writer.
log.lock().unwrap().push("dropping read2"); log.lock().unwrap().push("dropping read2");
drop(read2); drop(read2);
thread3.join().unwrap(); task3.await.unwrap();
assert_eq!( assert_eq!(
log.lock().unwrap().as_slice(), log.lock().unwrap().as_slice(),

View File

@@ -30,18 +30,32 @@ async fn warn_if_stuck<Fut: std::future::Future>(
let mut fut = std::pin::pin!(fut); let mut fut = std::pin::pin!(fut);
loop { let mut warned = false;
let ret = loop {
match tokio::time::timeout(warn_period, &mut fut).await { match tokio::time::timeout(warn_period, &mut fut).await {
Ok(ret) => return ret, Ok(ret) => break ret,
Err(_) => { Err(_) => {
tracing::warn!( tracing::warn!(
gate = name, gate = name,
elapsed_ms = started.elapsed().as_millis(), elapsed_ms = started.elapsed().as_millis(),
"still waiting, taking longer than expected..." "still waiting, taking longer than expected..."
); );
warned = true;
} }
} }
};
// If we emitted a warning for slowness, also emit a message when we complete, so that
// someone debugging a shutdown can know for sure whether we have moved past this operation.
if warned {
tracing::info!(
gate = name,
elapsed_ms = started.elapsed().as_millis(),
"completed, after taking longer than expected"
)
} }
ret
} }
#[derive(Debug)] #[derive(Debug)]

View File

@@ -436,9 +436,9 @@ mod tests {
event_mask: 0, event_mask: 0,
}), }),
expected_messages: vec![ expected_messages: vec![
// Greeting(ProposerGreeting { protocol_version: 2, pg_version: 160000, proposer_id: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], system_id: 0, timeline_id: 9e4c8f36063c6c6e93bc20d65a820f3d, tenant_id: 9e4c8f36063c6c6e93bc20d65a820f3d, tli: 1, wal_seg_size: 16777216 }) // Greeting(ProposerGreeting { protocol_version: 2, pg_version: 160001, proposer_id: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], system_id: 0, timeline_id: 9e4c8f36063c6c6e93bc20d65a820f3d, tenant_id: 9e4c8f36063c6c6e93bc20d65a820f3d, tli: 1, wal_seg_size: 16777216 })
vec![ vec![
103, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 113, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 103, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 113, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 158, 76, 143, 54, 6, 60, 108, 110, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 158, 76, 143, 54, 6, 60, 108, 110,
147, 188, 32, 214, 90, 130, 15, 61, 158, 76, 143, 54, 6, 60, 108, 110, 147, 147, 188, 32, 214, 90, 130, 15, 61, 158, 76, 143, 54, 6, 60, 108, 110, 147,
188, 32, 214, 90, 130, 15, 61, 1, 0, 0, 0, 0, 0, 0, 1, 188, 32, 214, 90, 130, 15, 61, 1, 0, 0, 0, 0, 0, 0, 1,
@@ -478,7 +478,7 @@ mod tests {
// walproposer will panic when it finishes sync_safekeepers // walproposer will panic when it finishes sync_safekeepers
std::panic::catch_unwind(|| wp.start()).unwrap_err(); std::panic::catch_unwind(|| wp.start()).unwrap_err();
// validate the resulting LSN // validate the resulting LSN
assert_eq!(receiver.recv()?, 1337); assert_eq!(receiver.try_recv(), Ok(1337));
Ok(()) Ok(())
// drop() will free up resources here // drop() will free up resources here
} }

View File

@@ -36,6 +36,7 @@ humantime.workspace = true
humantime-serde.workspace = true humantime-serde.workspace = true
hyper.workspace = true hyper.workspace = true
itertools.workspace = true itertools.workspace = true
md5.workspace = true
nix.workspace = true nix.workspace = true
# hack to get the number of worker threads tokio uses # hack to get the number of worker threads tokio uses
num_cpus = { version = "1.15" } num_cpus = { version = "1.15" }

View File

@@ -14,7 +14,7 @@ use pageserver::control_plane_client::ControlPlaneClient;
use pageserver::disk_usage_eviction_task::{self, launch_disk_usage_global_eviction_task}; use pageserver::disk_usage_eviction_task::{self, launch_disk_usage_global_eviction_task};
use pageserver::metrics::{STARTUP_DURATION, STARTUP_IS_LOADING}; use pageserver::metrics::{STARTUP_DURATION, STARTUP_IS_LOADING};
use pageserver::task_mgr::WALRECEIVER_RUNTIME; use pageserver::task_mgr::WALRECEIVER_RUNTIME;
use pageserver::tenant::TenantSharedResources; use pageserver::tenant::{secondary, TenantSharedResources};
use remote_storage::GenericRemoteStorage; use remote_storage::GenericRemoteStorage;
use tokio::time::Instant; use tokio::time::Instant;
use tracing::*; use tracing::*;
@@ -402,15 +402,11 @@ fn start_pageserver(
let (init_remote_done_tx, init_remote_done_rx) = utils::completion::channel(); let (init_remote_done_tx, init_remote_done_rx) = utils::completion::channel();
let (init_done_tx, init_done_rx) = utils::completion::channel(); let (init_done_tx, init_done_rx) = utils::completion::channel();
let (init_logical_size_done_tx, init_logical_size_done_rx) = utils::completion::channel();
let (background_jobs_can_start, background_jobs_barrier) = utils::completion::channel(); let (background_jobs_can_start, background_jobs_barrier) = utils::completion::channel();
let order = pageserver::InitializationOrder { let order = pageserver::InitializationOrder {
initial_tenant_load_remote: Some(init_done_tx), initial_tenant_load_remote: Some(init_done_tx),
initial_tenant_load: Some(init_remote_done_tx), initial_tenant_load: Some(init_remote_done_tx),
initial_logical_size_can_start: init_done_rx.clone(),
initial_logical_size_attempt: Some(init_logical_size_done_tx),
background_jobs_can_start: background_jobs_barrier.clone(), background_jobs_can_start: background_jobs_barrier.clone(),
}; };
@@ -429,7 +425,6 @@ fn start_pageserver(
let tenant_manager = Arc::new(tenant_manager); let tenant_manager = Arc::new(tenant_manager);
BACKGROUND_RUNTIME.spawn({ BACKGROUND_RUNTIME.spawn({
let init_done_rx = init_done_rx;
let shutdown_pageserver = shutdown_pageserver.clone(); let shutdown_pageserver = shutdown_pageserver.clone();
let drive_init = async move { let drive_init = async move {
// NOTE: unlike many futures in pageserver, this one is cancellation-safe // NOTE: unlike many futures in pageserver, this one is cancellation-safe
@@ -464,7 +459,7 @@ fn start_pageserver(
}); });
let WaitForPhaseResult { let WaitForPhaseResult {
timeout_remaining: timeout, timeout_remaining: _timeout,
skipped: init_load_skipped, skipped: init_load_skipped,
} = wait_for_phase("initial_tenant_load", init_load_done, timeout).await; } = wait_for_phase("initial_tenant_load", init_load_done, timeout).await;
@@ -472,26 +467,6 @@ fn start_pageserver(
scopeguard::ScopeGuard::into_inner(guard); scopeguard::ScopeGuard::into_inner(guard);
let guard = scopeguard::guard_on_success((), |_| {
tracing::info!("Cancelled before initial logical sizes completed")
});
let logical_sizes_done = std::pin::pin!(async {
init_logical_size_done_rx.wait().await;
startup_checkpoint(
started_startup_at,
"initial_logical_sizes",
"Initial logical sizes completed",
);
});
let WaitForPhaseResult {
timeout_remaining: _,
skipped: logical_sizes_skipped,
} = wait_for_phase("initial_logical_sizes", logical_sizes_done, timeout).await;
scopeguard::ScopeGuard::into_inner(guard);
// allow background jobs to start: we either completed prior stages, or they reached timeout // allow background jobs to start: we either completed prior stages, or they reached timeout
// and were skipped. It is important that we do not let them block background jobs indefinitely, // and were skipped. It is important that we do not let them block background jobs indefinitely,
// because things like consumption metrics for billing are blocked by this barrier. // because things like consumption metrics for billing are blocked by this barrier.
@@ -514,9 +489,6 @@ fn start_pageserver(
if let Some(f) = init_load_skipped { if let Some(f) = init_load_skipped {
f.await; f.await;
} }
if let Some(f) = logical_sizes_skipped {
f.await;
}
scopeguard::ScopeGuard::into_inner(guard); scopeguard::ScopeGuard::into_inner(guard);
startup_checkpoint(started_startup_at, "complete", "Startup complete"); startup_checkpoint(started_startup_at, "complete", "Startup complete");
@@ -532,6 +504,17 @@ fn start_pageserver(
} }
}); });
let secondary_controller = if let Some(remote_storage) = &remote_storage {
secondary::spawn_tasks(
tenant_manager.clone(),
remote_storage.clone(),
background_jobs_barrier.clone(),
shutdown_pageserver.clone(),
)
} else {
secondary::null_controller()
};
// shared state between the disk-usage backed eviction background task and the http endpoint // shared state between the disk-usage backed eviction background task and the http endpoint
// that allows triggering disk-usage based eviction manually. note that the http endpoint // that allows triggering disk-usage based eviction manually. note that the http endpoint
// is still accessible even if background task is not configured as long as remote storage has // is still accessible even if background task is not configured as long as remote storage has
@@ -561,6 +544,7 @@ fn start_pageserver(
broker_client.clone(), broker_client.clone(),
disk_usage_eviction_state, disk_usage_eviction_state,
deletion_queue.new_client(), deletion_queue.new_client(),
secondary_controller,
) )
.context("Failed to initialize router state")?, .context("Failed to initialize router state")?,
); );
@@ -587,7 +571,6 @@ fn start_pageserver(
} }
if let Some(metric_collection_endpoint) = &conf.metric_collection_endpoint { if let Some(metric_collection_endpoint) = &conf.metric_collection_endpoint {
let background_jobs_barrier = background_jobs_barrier;
let metrics_ctx = RequestContext::todo_child( let metrics_ctx = RequestContext::todo_child(
TaskKind::MetricsCollection, TaskKind::MetricsCollection,
// This task itself shouldn't download anything. // This task itself shouldn't download anything.

View File

@@ -70,6 +70,8 @@ pub mod defaults {
pub const DEFAULT_SYNTHETIC_SIZE_CALCULATION_INTERVAL: &str = "10 min"; pub const DEFAULT_SYNTHETIC_SIZE_CALCULATION_INTERVAL: &str = "10 min";
pub const DEFAULT_BACKGROUND_TASK_MAXIMUM_DELAY: &str = "10s"; pub const DEFAULT_BACKGROUND_TASK_MAXIMUM_DELAY: &str = "10s";
pub const DEFAULT_HEATMAP_UPLOAD_CONCURRENCY: usize = 8;
/// ///
/// Default built-in configuration file. /// Default built-in configuration file.
/// ///
@@ -117,6 +119,8 @@ pub mod defaults {
#evictions_low_residence_duration_metric_threshold = '{DEFAULT_EVICTIONS_LOW_RESIDENCE_DURATION_METRIC_THRESHOLD}' #evictions_low_residence_duration_metric_threshold = '{DEFAULT_EVICTIONS_LOW_RESIDENCE_DURATION_METRIC_THRESHOLD}'
#gc_feedback = false #gc_feedback = false
#heatmap_upload_concurrency = {DEFAULT_HEATMAP_UPLOAD_CONCURRENCY}
[remote_storage] [remote_storage]
"# "#
@@ -215,6 +219,10 @@ pub struct PageServerConf {
/// If true, pageserver will make best-effort to operate without a control plane: only /// If true, pageserver will make best-effort to operate without a control plane: only
/// for use in major incidents. /// for use in major incidents.
pub control_plane_emergency_mode: bool, pub control_plane_emergency_mode: bool,
/// How many heatmap uploads may be done concurrency: lower values implicitly deprioritize
/// heatmap uploads vs. other remote storage operations.
pub heatmap_upload_concurrency: usize,
} }
/// We do not want to store this in a PageServerConf because the latter may be logged /// We do not want to store this in a PageServerConf because the latter may be logged
@@ -293,6 +301,8 @@ struct PageServerConfigBuilder {
control_plane_api: BuilderValue<Option<Url>>, control_plane_api: BuilderValue<Option<Url>>,
control_plane_api_token: BuilderValue<Option<SecretString>>, control_plane_api_token: BuilderValue<Option<SecretString>>,
control_plane_emergency_mode: BuilderValue<bool>, control_plane_emergency_mode: BuilderValue<bool>,
heatmap_upload_concurrency: BuilderValue<usize>,
} }
impl Default for PageServerConfigBuilder { impl Default for PageServerConfigBuilder {
@@ -361,6 +371,8 @@ impl Default for PageServerConfigBuilder {
control_plane_api: Set(None), control_plane_api: Set(None),
control_plane_api_token: Set(None), control_plane_api_token: Set(None),
control_plane_emergency_mode: Set(false), control_plane_emergency_mode: Set(false),
heatmap_upload_concurrency: Set(DEFAULT_HEATMAP_UPLOAD_CONCURRENCY),
} }
} }
} }
@@ -501,6 +513,10 @@ impl PageServerConfigBuilder {
self.control_plane_emergency_mode = BuilderValue::Set(enabled) self.control_plane_emergency_mode = BuilderValue::Set(enabled)
} }
pub fn heatmap_upload_concurrency(&mut self, value: usize) {
self.heatmap_upload_concurrency = BuilderValue::Set(value)
}
pub fn build(self) -> anyhow::Result<PageServerConf> { pub fn build(self) -> anyhow::Result<PageServerConf> {
let concurrent_tenant_size_logical_size_queries = self let concurrent_tenant_size_logical_size_queries = self
.concurrent_tenant_size_logical_size_queries .concurrent_tenant_size_logical_size_queries
@@ -595,6 +611,10 @@ impl PageServerConfigBuilder {
control_plane_emergency_mode: self control_plane_emergency_mode: self
.control_plane_emergency_mode .control_plane_emergency_mode
.ok_or(anyhow!("missing control_plane_emergency_mode"))?, .ok_or(anyhow!("missing control_plane_emergency_mode"))?,
heatmap_upload_concurrency: self
.heatmap_upload_concurrency
.ok_or(anyhow!("missing heatmap_upload_concurrency"))?,
}) })
} }
} }
@@ -828,7 +848,9 @@ impl PageServerConf {
}, },
"control_plane_emergency_mode" => { "control_plane_emergency_mode" => {
builder.control_plane_emergency_mode(parse_toml_bool(key, item)?) builder.control_plane_emergency_mode(parse_toml_bool(key, item)?)
},
"heatmap_upload_concurrency" => {
builder.heatmap_upload_concurrency(parse_toml_u64(key, item)? as usize)
}, },
_ => bail!("unrecognized pageserver option '{key}'"), _ => bail!("unrecognized pageserver option '{key}'"),
} }
@@ -896,6 +918,7 @@ impl PageServerConf {
control_plane_api: None, control_plane_api: None,
control_plane_api_token: None, control_plane_api_token: None,
control_plane_emergency_mode: false, control_plane_emergency_mode: false,
heatmap_upload_concurrency: defaults::DEFAULT_HEATMAP_UPLOAD_CONCURRENCY,
} }
} }
} }
@@ -1120,7 +1143,8 @@ background_task_maximum_delay = '334 s'
)?, )?,
control_plane_api: None, control_plane_api: None,
control_plane_api_token: None, control_plane_api_token: None,
control_plane_emergency_mode: false control_plane_emergency_mode: false,
heatmap_upload_concurrency: defaults::DEFAULT_HEATMAP_UPLOAD_CONCURRENCY
}, },
"Correct defaults should be used when no config values are provided" "Correct defaults should be used when no config values are provided"
); );
@@ -1177,7 +1201,8 @@ background_task_maximum_delay = '334 s'
background_task_maximum_delay: Duration::from_secs(334), background_task_maximum_delay: Duration::from_secs(334),
control_plane_api: None, control_plane_api: None,
control_plane_api_token: None, control_plane_api_token: None,
control_plane_emergency_mode: false control_plane_emergency_mode: false,
heatmap_upload_concurrency: defaults::DEFAULT_HEATMAP_UPLOAD_CONCURRENCY
}, },
"Should be able to parse all basic config values correctly" "Should be able to parse all basic config values correctly"
); );

View File

@@ -3,7 +3,7 @@
use crate::context::{DownloadBehavior, RequestContext}; use crate::context::{DownloadBehavior, RequestContext};
use crate::task_mgr::{self, TaskKind, BACKGROUND_RUNTIME}; use crate::task_mgr::{self, TaskKind, BACKGROUND_RUNTIME};
use crate::tenant::tasks::BackgroundLoopKind; use crate::tenant::tasks::BackgroundLoopKind;
use crate::tenant::{mgr, LogicalSizeCalculationCause, PageReconstructError}; use crate::tenant::{mgr, LogicalSizeCalculationCause, PageReconstructError, Tenant};
use camino::Utf8PathBuf; use camino::Utf8PathBuf;
use consumption_metrics::EventType; use consumption_metrics::EventType;
use pageserver_api::models::TenantState; use pageserver_api::models::TenantState;
@@ -256,8 +256,6 @@ async fn calculate_synthetic_size_worker(
info!("calculate_synthetic_size_worker stopped"); info!("calculate_synthetic_size_worker stopped");
}; };
let cause = LogicalSizeCalculationCause::ConsumptionMetricsSyntheticSize;
loop { loop {
let started_at = Instant::now(); let started_at = Instant::now();
@@ -269,26 +267,25 @@ async fn calculate_synthetic_size_worker(
} }
}; };
for (tenant_id, tenant_state) in tenants { for (tenant_shard_id, tenant_state) in tenants {
if tenant_state != TenantState::Active { if tenant_state != TenantState::Active {
continue; continue;
} }
if let Ok(tenant) = mgr::get_tenant(tenant_id, true) { if !tenant_shard_id.is_zero() {
// TODO should we use concurrent_background_tasks_rate_limit() here, like the other background tasks? // We only send consumption metrics from shard 0, so don't waste time calculating
// We can put in some prioritization for consumption metrics. // synthetic size on other shards.
// Same for the loop that fetches computed metrics. continue;
// By using the same limiter, we centralize metrics collection for "start" and "finished" counters,
// which turns out is really handy to understand the system.
if let Err(e) = tenant.calculate_synthetic_size(cause, cancel, ctx).await {
if let Some(PageReconstructError::Cancelled) =
e.downcast_ref::<PageReconstructError>()
{
return Ok(());
}
error!("failed to calculate synthetic size for tenant {tenant_id}: {e:#}");
}
} }
let Ok(tenant) = mgr::get_tenant(tenant_shard_id, true) else {
continue;
};
// there is never any reason to exit calculate_synthetic_size_worker following any
// return value -- we don't need to care about shutdown because no tenant is found when
// pageserver is shut down.
calculate_and_log(&tenant, cancel, ctx).await;
} }
crate::tenant::tasks::warn_when_period_overrun( crate::tenant::tasks::warn_when_period_overrun(
@@ -299,7 +296,7 @@ async fn calculate_synthetic_size_worker(
let res = tokio::time::timeout_at( let res = tokio::time::timeout_at(
started_at + synthetic_size_calculation_interval, started_at + synthetic_size_calculation_interval,
task_mgr::shutdown_token().cancelled(), cancel.cancelled(),
) )
.await; .await;
if res.is_ok() { if res.is_ok() {
@@ -307,3 +304,31 @@ async fn calculate_synthetic_size_worker(
} }
} }
} }
async fn calculate_and_log(tenant: &Tenant, cancel: &CancellationToken, ctx: &RequestContext) {
const CAUSE: LogicalSizeCalculationCause =
LogicalSizeCalculationCause::ConsumptionMetricsSyntheticSize;
// TODO should we use concurrent_background_tasks_rate_limit() here, like the other background tasks?
// We can put in some prioritization for consumption metrics.
// Same for the loop that fetches computed metrics.
// By using the same limiter, we centralize metrics collection for "start" and "finished" counters,
// which turns out is really handy to understand the system.
let Err(e) = tenant.calculate_synthetic_size(CAUSE, cancel, ctx).await else {
return;
};
// this error can be returned if timeline is shutting down, but it does not
// mean the synthetic size worker should terminate. we do not need any checks
// in this function because `mgr::get_tenant` will error out after shutdown has
// progressed to shutting down tenants.
let shutting_down = matches!(
e.downcast_ref::<PageReconstructError>(),
Some(PageReconstructError::Cancelled | PageReconstructError::AncestorStopping(_))
);
if !shutting_down {
let tenant_shard_id = tenant.tenant_shard_id();
error!("failed to calculate synthetic size for tenant {tenant_shard_id}: {e:#}");
}
}

View File

@@ -197,12 +197,12 @@ pub(super) async fn collect_all_metrics(
}; };
let tenants = futures::stream::iter(tenants).filter_map(|(id, state)| async move { let tenants = futures::stream::iter(tenants).filter_map(|(id, state)| async move {
if state != TenantState::Active { if state != TenantState::Active || !id.is_zero() {
None None
} else { } else {
crate::tenant::mgr::get_tenant(id, true) crate::tenant::mgr::get_tenant(id, true)
.ok() .ok()
.map(|tenant| (id, tenant)) .map(|tenant| (id.tenant_id, tenant))
} }
}); });
@@ -351,7 +351,12 @@ impl TimelineSnapshot {
let current_exact_logical_size = { let current_exact_logical_size = {
let span = tracing::info_span!("collect_metrics_iteration", tenant_id = %t.tenant_shard_id.tenant_id, timeline_id = %t.timeline_id); let span = tracing::info_span!("collect_metrics_iteration", tenant_id = %t.tenant_shard_id.tenant_id, timeline_id = %t.timeline_id);
let size = span.in_scope(|| t.get_current_logical_size(ctx)); let size = span.in_scope(|| {
t.get_current_logical_size(
crate::tenant::timeline::GetLogicalSizePriority::Background,
ctx,
)
});
match size { match size {
// Only send timeline logical size when it is fully calculated. // Only send timeline logical size when it is fully calculated.
CurrentLogicalSize::Exact(ref size) => Some(size.into()), CurrentLogicalSize::Exact(ref size) => Some(size.into()),

View File

@@ -312,7 +312,18 @@ impl ListWriter {
for (tenant_shard_id, tenant_list) in &mut deletion_list.tenants { for (tenant_shard_id, tenant_list) in &mut deletion_list.tenants {
if let Some(attached_gen) = attached_tenants.get(tenant_shard_id) { if let Some(attached_gen) = attached_tenants.get(tenant_shard_id) {
if attached_gen.previous() == tenant_list.generation { if attached_gen.previous() == tenant_list.generation {
info!(
seq=%s, tenant_id=%tenant_shard_id.tenant_id,
shard_id=%tenant_shard_id.shard_slug(),
old_gen=?tenant_list.generation, new_gen=?attached_gen,
"Updating gen on recovered list");
tenant_list.generation = *attached_gen; tenant_list.generation = *attached_gen;
} else {
info!(
seq=%s, tenant_id=%tenant_shard_id.tenant_id,
shard_id=%tenant_shard_id.shard_slug(),
old_gen=?tenant_list.generation, new_gen=?attached_gen,
"Encountered stale generation on recovered list");
} }
} }
} }

View File

@@ -42,7 +42,6 @@
// reading these fields. We use the Debug impl for semi-structured logging, though. // reading these fields. We use the Debug impl for semi-structured logging, though.
use std::{ use std::{
collections::HashMap,
sync::Arc, sync::Arc,
time::{Duration, SystemTime}, time::{Duration, SystemTime},
}; };
@@ -125,7 +124,7 @@ pub fn launch_disk_usage_global_eviction_task(
async fn disk_usage_eviction_task( async fn disk_usage_eviction_task(
state: &State, state: &State,
task_config: &DiskUsageEvictionTaskConfig, task_config: &DiskUsageEvictionTaskConfig,
_storage: &GenericRemoteStorage, storage: &GenericRemoteStorage,
tenants_dir: &Utf8Path, tenants_dir: &Utf8Path,
cancel: CancellationToken, cancel: CancellationToken,
) { ) {
@@ -149,8 +148,14 @@ async fn disk_usage_eviction_task(
let start = Instant::now(); let start = Instant::now();
async { async {
let res = let res = disk_usage_eviction_task_iteration(
disk_usage_eviction_task_iteration(state, task_config, tenants_dir, &cancel).await; state,
task_config,
storage,
tenants_dir,
&cancel,
)
.await;
match res { match res {
Ok(()) => {} Ok(()) => {}
@@ -181,12 +186,13 @@ pub trait Usage: Clone + Copy + std::fmt::Debug {
async fn disk_usage_eviction_task_iteration( async fn disk_usage_eviction_task_iteration(
state: &State, state: &State,
task_config: &DiskUsageEvictionTaskConfig, task_config: &DiskUsageEvictionTaskConfig,
storage: &GenericRemoteStorage,
tenants_dir: &Utf8Path, tenants_dir: &Utf8Path,
cancel: &CancellationToken, cancel: &CancellationToken,
) -> anyhow::Result<()> { ) -> anyhow::Result<()> {
let usage_pre = filesystem_level_usage::get(tenants_dir, task_config) let usage_pre = filesystem_level_usage::get(tenants_dir, task_config)
.context("get filesystem-level disk usage before evictions")?; .context("get filesystem-level disk usage before evictions")?;
let res = disk_usage_eviction_task_iteration_impl(state, usage_pre, cancel).await; let res = disk_usage_eviction_task_iteration_impl(state, storage, usage_pre, cancel).await;
match res { match res {
Ok(outcome) => { Ok(outcome) => {
debug!(?outcome, "disk_usage_eviction_iteration finished"); debug!(?outcome, "disk_usage_eviction_iteration finished");
@@ -268,8 +274,9 @@ struct LayerCount {
count: usize, count: usize,
} }
pub async fn disk_usage_eviction_task_iteration_impl<U: Usage>( pub(crate) async fn disk_usage_eviction_task_iteration_impl<U: Usage>(
state: &State, state: &State,
_storage: &GenericRemoteStorage,
usage_pre: U, usage_pre: U,
cancel: &CancellationToken, cancel: &CancellationToken,
) -> anyhow::Result<IterationOutcome<U>> { ) -> anyhow::Result<IterationOutcome<U>> {
@@ -321,16 +328,16 @@ pub async fn disk_usage_eviction_task_iteration_impl<U: Usage>(
// Walk through the list of candidates, until we have accumulated enough layers to get // Walk through the list of candidates, until we have accumulated enough layers to get
// us back under the pressure threshold. 'usage_planned' is updated so that it tracks // us back under the pressure threshold. 'usage_planned' is updated so that it tracks
// how much disk space would be used after evicting all the layers up to the current // how much disk space would be used after evicting all the layers up to the current
// point in the list. The layers are collected in 'batched', grouped per timeline. // point in the list.
// //
// If we get far enough in the list that we start to evict layers that are below // If we get far enough in the list that we start to evict layers that are below
// the tenant's min-resident-size threshold, print a warning, and memorize the disk // the tenant's min-resident-size threshold, print a warning, and memorize the disk
// usage at that point, in 'usage_planned_min_resident_size_respecting'. // usage at that point, in 'usage_planned_min_resident_size_respecting'.
let mut batched: HashMap<_, Vec<_>> = HashMap::new();
let mut warned = None; let mut warned = None;
let mut usage_planned = usage_pre; let mut usage_planned = usage_pre;
let mut max_batch_size = 0; let mut evicted_amount = 0;
for (i, (partition, candidate)) in candidates.into_iter().enumerate() {
for (i, (partition, candidate)) in candidates.iter().enumerate() {
if !usage_planned.has_pressure() { if !usage_planned.has_pressure() {
debug!( debug!(
no_candidates_evicted = i, no_candidates_evicted = i,
@@ -339,25 +346,13 @@ pub async fn disk_usage_eviction_task_iteration_impl<U: Usage>(
break; break;
} }
if partition == MinResidentSizePartition::Below && warned.is_none() { if partition == &MinResidentSizePartition::Below && warned.is_none() {
warn!(?usage_pre, ?usage_planned, candidate_no=i, "tenant_min_resident_size-respecting LRU would not relieve pressure, evicting more following global LRU policy"); warn!(?usage_pre, ?usage_planned, candidate_no=i, "tenant_min_resident_size-respecting LRU would not relieve pressure, evicting more following global LRU policy");
warned = Some(usage_planned); warned = Some(usage_planned);
} }
usage_planned.add_available_bytes(candidate.layer.layer_desc().file_size); usage_planned.add_available_bytes(candidate.layer.layer_desc().file_size);
evicted_amount += 1;
// FIXME: batching makes no sense anymore because of no layermap locking, should just spawn
// tasks to evict all seen layers until we have evicted enough
let batch = batched.entry(TimelineKey(candidate.timeline)).or_default();
// semaphore will later be used to limit eviction concurrency, and we can express at
// most u32 number of permits. unlikely we would have u32::MAX layers to be evicted,
// but fail gracefully by not making batches larger.
if batch.len() < u32::MAX as usize {
batch.push(candidate.layer);
max_batch_size = max_batch_size.max(batch.len());
}
} }
let usage_planned = match warned { let usage_planned = match warned {
@@ -372,100 +367,79 @@ pub async fn disk_usage_eviction_task_iteration_impl<U: Usage>(
}; };
debug!(?usage_planned, "usage planned"); debug!(?usage_planned, "usage planned");
// phase2: evict victims batched by timeline // phase2: evict layers
let mut js = tokio::task::JoinSet::new(); let mut js = tokio::task::JoinSet::new();
let limit = 1000;
// ratelimit to 1k files or any higher max batch size let mut evicted = candidates.into_iter().take(evicted_amount).fuse();
let limit = Arc::new(tokio::sync::Semaphore::new(1000.max(max_batch_size))); let mut consumed_all = false;
for (timeline, batch) in batched { // After the evictions, `usage_assumed` is the post-eviction usage,
let tenant_shard_id = timeline.tenant_shard_id; // according to internal accounting.
let timeline_id = timeline.timeline_id; let mut usage_assumed = usage_pre;
let batch_size = let mut evictions_failed = LayerCount::default();
u32::try_from(batch.len()).expect("batch size limited to u32::MAX during partitioning");
// I dislike naming of `available_permits` but it means current total amount of permits let evict_layers = async move {
// because permits can be added loop {
assert!(batch_size as usize <= limit.available_permits()); let next = if js.len() >= limit || consumed_all {
js.join_next().await
} else if !js.is_empty() {
// opportunistically consume ready result, one per each new evicted
futures::future::FutureExt::now_or_never(js.join_next()).and_then(|x| x)
} else {
None
};
debug!(%timeline_id, "evicting batch for timeline"); if let Some(next) = next {
match next {
let evict = { Ok(Ok(file_size)) => {
let limit = limit.clone(); usage_assumed.add_available_bytes(file_size);
let cancel = cancel.clone();
async move {
let mut evicted_bytes = 0;
let mut evictions_failed = LayerCount::default();
let Ok(_permit) = limit.acquire_many_owned(batch_size).await else {
// semaphore closing means cancelled
return (evicted_bytes, evictions_failed);
};
let results = timeline.evict_layers(&batch).await;
match results {
Ok(results) => {
assert_eq!(results.len(), batch.len());
for (result, layer) in results.into_iter().zip(batch.iter()) {
let file_size = layer.layer_desc().file_size;
match result {
Some(Ok(())) => {
evicted_bytes += file_size;
}
Some(Err(EvictionError::NotFound | EvictionError::Downloaded)) => {
evictions_failed.file_sizes += file_size;
evictions_failed.count += 1;
}
None => {
assert!(cancel.is_cancelled());
}
}
}
} }
Err(e) => { Ok(Err((file_size, EvictionError::NotFound | EvictionError::Downloaded))) => {
warn!("failed to evict batch: {:#}", e); evictions_failed.file_sizes += file_size;
evictions_failed.count += 1;
} }
Err(je) if je.is_cancelled() => unreachable!("not used"),
Err(je) if je.is_panic() => { /* already logged */ }
Err(je) => tracing::error!("unknown JoinError: {je:?}"),
} }
(evicted_bytes, evictions_failed)
} }
}
.instrument(tracing::info_span!("evict_batch", tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug(), %timeline_id, batch_size));
js.spawn(evict); if consumed_all && js.is_empty() {
break;
// spwaning multiple thousands of these is essentially blocking, so give already spawned a
// chance of making progress
tokio::task::yield_now().await;
}
let join_all = async move {
// After the evictions, `usage_assumed` is the post-eviction usage,
// according to internal accounting.
let mut usage_assumed = usage_pre;
let mut evictions_failed = LayerCount::default();
while let Some(res) = js.join_next().await {
match res {
Ok((evicted_bytes, failed)) => {
usage_assumed.add_available_bytes(evicted_bytes);
evictions_failed.file_sizes += failed.file_sizes;
evictions_failed.count += failed.count;
}
Err(je) if je.is_cancelled() => unreachable!("not used"),
Err(je) if je.is_panic() => { /* already logged */ }
Err(je) => tracing::error!("unknown JoinError: {je:?}"),
} }
// calling again when consumed_all is fine as evicted is fused.
let Some((_partition, candidate)) = evicted.next() else {
consumed_all = true;
continue;
};
js.spawn(async move {
let rtc = candidate.timeline.remote_client.as_ref().expect(
"holding the witness, all timelines must have a remote timeline client",
);
let file_size = candidate.layer.layer_desc().file_size;
candidate
.layer
.evict_and_wait(rtc)
.await
.map(|()| file_size)
.map_err(|e| (file_size, e))
});
tokio::task::yield_now().await;
} }
(usage_assumed, evictions_failed) (usage_assumed, evictions_failed)
}; };
let (usage_assumed, evictions_failed) = tokio::select! { let (usage_assumed, evictions_failed) = tokio::select! {
tuple = join_all => { tuple }, tuple = evict_layers => { tuple },
_ = cancel.cancelled() => { _ = cancel.cancelled() => {
// close the semaphore to stop any pending acquires // dropping joinset will abort all pending evict_and_waits and that is fine, our
limit.close(); // requests will still stand
return Ok(IterationOutcome::Cancelled); return Ok(IterationOutcome::Cancelled);
} }
}; };

View File

@@ -84,7 +84,6 @@ paths:
required: true required: true
schema: schema:
type: string type: string
format: hex
get: get:
description: Get tenant status description: Get tenant status
responses: responses:
@@ -181,7 +180,6 @@ paths:
required: true required: true
schema: schema:
type: string type: string
format: hex
get: get:
description: Get timelines for tenant description: Get timelines for tenant
responses: responses:
@@ -232,7 +230,6 @@ paths:
required: true required: true
schema: schema:
type: string type: string
format: hex
- name: timeline_id - name: timeline_id
in: path in: path
required: true required: true
@@ -338,7 +335,6 @@ paths:
required: true required: true
schema: schema:
type: string type: string
format: hex
- name: timeline_id - name: timeline_id
in: path in: path
required: true required: true
@@ -401,7 +397,6 @@ paths:
required: true required: true
schema: schema:
type: string type: string
format: hex
- name: timeline_id - name: timeline_id
in: path in: path
required: true required: true
@@ -469,7 +464,6 @@ paths:
required: true required: true
schema: schema:
type: string type: string
format: hex
- name: timeline_id - name: timeline_id
in: path in: path
required: true required: true
@@ -523,7 +517,6 @@ paths:
required: true required: true
schema: schema:
type: string type: string
format: hex
post: post:
description: | description: |
Schedules attach operation to happen in the background for the given tenant. Schedules attach operation to happen in the background for the given tenant.
@@ -631,7 +624,6 @@ paths:
required: true required: true
schema: schema:
type: string type: string
format: hex
- name: flush_ms - name: flush_ms
in: query in: query
required: false required: false
@@ -724,7 +716,6 @@ paths:
required: true required: true
schema: schema:
type: string type: string
format: hex
- name: detach_ignored - name: detach_ignored
in: query in: query
required: false required: false
@@ -784,7 +775,6 @@ paths:
required: true required: true
schema: schema:
type: string type: string
format: hex
post: post:
description: | description: |
Remove tenant data (including all corresponding timelines) from pageserver's memory. Remove tenant data (including all corresponding timelines) from pageserver's memory.
@@ -833,7 +823,6 @@ paths:
required: true required: true
schema: schema:
type: string type: string
format: hex
post: post:
description: | description: |
Schedules an operation that attempts to load a tenant from the local disk and Schedules an operation that attempts to load a tenant from the local disk and
@@ -890,7 +879,6 @@ paths:
required: true required: true
schema: schema:
type: string type: string
format: hex
get: get:
description: | description: |
Calculate tenant's synthetic size Calculate tenant's synthetic size
@@ -933,7 +921,6 @@ paths:
required: true required: true
schema: schema:
type: string type: string
format: hex
- name: inputs_only - name: inputs_only
in: query in: query
required: false required: false
@@ -1003,11 +990,10 @@ paths:
required: true required: true
schema: schema:
type: string type: string
format: hex
post: post:
description: | description: |
Create a timeline. Returns new timeline id on success.\ Create a timeline. Returns new timeline id on success.
If no new timeline id is specified in parameters, it would be generated. It's an error to recreate the same timeline. Recreating the same timeline will succeed if the parameters match the existing timeline.
If no pg_version is specified, assume DEFAULT_PG_VERSION hardcoded in the pageserver. If no pg_version is specified, assume DEFAULT_PG_VERSION hardcoded in the pageserver.
requestBody: requestBody:
content: content:
@@ -1137,7 +1123,6 @@ paths:
application/json: application/json:
schema: schema:
type: string type: string
format: hex
"400": "400":
description: Malformed tenant create request description: Malformed tenant create request
content: content:
@@ -1234,7 +1219,6 @@ paths:
required: true required: true
schema: schema:
type: string type: string
format: hex
get: get:
description: | description: |
Returns tenant's config description: specific config overrides a tenant has Returns tenant's config description: specific config overrides a tenant has
@@ -1340,7 +1324,6 @@ components:
properties: properties:
new_tenant_id: new_tenant_id:
type: string type: string
format: hex
generation: generation:
type: integer type: integer
description: Attachment generation number. description: Attachment generation number.
@@ -1369,7 +1352,6 @@ components:
properties: properties:
tenant_id: tenant_id:
type: string type: string
format: hex
TenantLocationConfigRequest: TenantLocationConfigRequest:
type: object type: object
required: required:
@@ -1377,7 +1359,6 @@ components:
properties: properties:
tenant_id: tenant_id:
type: string type: string
format: hex
mode: mode:
type: string type: string
enum: ["AttachedSingle", "AttachedMulti", "AttachedStale", "Secondary", "Detached"] enum: ["AttachedSingle", "AttachedMulti", "AttachedStale", "Secondary", "Detached"]
@@ -1424,6 +1405,8 @@ components:
type: integer type: integer
trace_read_requests: trace_read_requests:
type: boolean type: boolean
heatmap_period:
type: integer
TenantConfigResponse: TenantConfigResponse:
type: object type: object
properties: properties:
@@ -1446,7 +1429,6 @@ components:
format: hex format: hex
tenant_id: tenant_id:
type: string type: string
format: hex
last_record_lsn: last_record_lsn:
type: string type: string
format: hex format: hex

View File

@@ -42,6 +42,7 @@ use crate::tenant::mgr::{
GetTenantError, SetNewTenantConfigError, TenantManager, TenantMapError, TenantMapInsertError, GetTenantError, SetNewTenantConfigError, TenantManager, TenantMapError, TenantMapInsertError,
TenantSlotError, TenantSlotUpsertError, TenantStateError, TenantSlotError, TenantSlotUpsertError, TenantStateError,
}; };
use crate::tenant::secondary::SecondaryController;
use crate::tenant::size::ModelInputs; use crate::tenant::size::ModelInputs;
use crate::tenant::storage_layer::LayerAccessStatsReset; use crate::tenant::storage_layer::LayerAccessStatsReset;
use crate::tenant::timeline::CompactFlags; use crate::tenant::timeline::CompactFlags;
@@ -75,9 +76,11 @@ pub struct State {
broker_client: storage_broker::BrokerClientChannel, broker_client: storage_broker::BrokerClientChannel,
disk_usage_eviction_state: Arc<disk_usage_eviction_task::State>, disk_usage_eviction_state: Arc<disk_usage_eviction_task::State>,
deletion_queue_client: DeletionQueueClient, deletion_queue_client: DeletionQueueClient,
secondary_controller: SecondaryController,
} }
impl State { impl State {
#[allow(clippy::too_many_arguments)]
pub fn new( pub fn new(
conf: &'static PageServerConf, conf: &'static PageServerConf,
tenant_manager: Arc<TenantManager>, tenant_manager: Arc<TenantManager>,
@@ -86,6 +89,7 @@ impl State {
broker_client: storage_broker::BrokerClientChannel, broker_client: storage_broker::BrokerClientChannel,
disk_usage_eviction_state: Arc<disk_usage_eviction_task::State>, disk_usage_eviction_state: Arc<disk_usage_eviction_task::State>,
deletion_queue_client: DeletionQueueClient, deletion_queue_client: DeletionQueueClient,
secondary_controller: SecondaryController,
) -> anyhow::Result<Self> { ) -> anyhow::Result<Self> {
let allowlist_routes = ["/v1/status", "/v1/doc", "/swagger.yml", "/metrics"] let allowlist_routes = ["/v1/status", "/v1/doc", "/swagger.yml", "/metrics"]
.iter() .iter()
@@ -100,6 +104,7 @@ impl State {
broker_client, broker_client,
disk_usage_eviction_state, disk_usage_eviction_state,
deletion_queue_client, deletion_queue_client,
secondary_controller,
}) })
} }
@@ -136,11 +141,6 @@ impl From<PageReconstructError> for ApiError {
fn from(pre: PageReconstructError) -> ApiError { fn from(pre: PageReconstructError) -> ApiError {
match pre { match pre {
PageReconstructError::Other(pre) => ApiError::InternalServerError(pre), PageReconstructError::Other(pre) => ApiError::InternalServerError(pre),
PageReconstructError::NeedsDownload(_, _) => {
// This shouldn't happen, because we use a RequestContext that requests to
// download any missing layer files on-demand.
ApiError::InternalServerError(anyhow::anyhow!("need to download remote layer file"))
}
PageReconstructError::Cancelled => { PageReconstructError::Cancelled => {
ApiError::InternalServerError(anyhow::anyhow!("request was cancelled")) ApiError::InternalServerError(anyhow::anyhow!("request was cancelled"))
} }
@@ -319,6 +319,7 @@ async fn build_timeline_info_common(
ctx: &RequestContext, ctx: &RequestContext,
) -> anyhow::Result<TimelineInfo> { ) -> anyhow::Result<TimelineInfo> {
crate::tenant::debug_assert_current_span_has_tenant_and_timeline_id(); crate::tenant::debug_assert_current_span_has_tenant_and_timeline_id();
let initdb_lsn = timeline.initdb_lsn;
let last_record_lsn = timeline.get_last_record_lsn(); let last_record_lsn = timeline.get_last_record_lsn();
let (wal_source_connstr, last_received_msg_lsn, last_received_msg_ts) = { let (wal_source_connstr, last_received_msg_lsn, last_received_msg_ts) = {
let guard = timeline.last_received_wal.lock().unwrap(); let guard = timeline.last_received_wal.lock().unwrap();
@@ -338,7 +339,8 @@ async fn build_timeline_info_common(
Lsn(0) => None, Lsn(0) => None,
lsn @ Lsn(_) => Some(lsn), lsn @ Lsn(_) => Some(lsn),
}; };
let current_logical_size = timeline.get_current_logical_size(ctx); let current_logical_size =
timeline.get_current_logical_size(tenant::timeline::GetLogicalSizePriority::User, ctx);
let current_physical_size = Some(timeline.layer_size_sum().await); let current_physical_size = Some(timeline.layer_size_sum().await);
let state = timeline.current_state(); let state = timeline.current_state();
let remote_consistent_lsn_projected = timeline let remote_consistent_lsn_projected = timeline
@@ -351,14 +353,14 @@ async fn build_timeline_info_common(
let walreceiver_status = timeline.walreceiver_status(); let walreceiver_status = timeline.walreceiver_status();
let info = TimelineInfo { let info = TimelineInfo {
// TODO(sharding): add a shard_id field, or make tenant_id into a tenant_shard_id tenant_id: timeline.tenant_shard_id,
tenant_id: timeline.tenant_shard_id.tenant_id,
timeline_id: timeline.timeline_id, timeline_id: timeline.timeline_id,
ancestor_timeline_id, ancestor_timeline_id,
ancestor_lsn, ancestor_lsn,
disk_consistent_lsn: timeline.get_disk_consistent_lsn(), disk_consistent_lsn: timeline.get_disk_consistent_lsn(),
remote_consistent_lsn: remote_consistent_lsn_projected, remote_consistent_lsn: remote_consistent_lsn_projected,
remote_consistent_lsn_visible, remote_consistent_lsn_visible,
initdb_lsn,
last_record_lsn, last_record_lsn,
prev_record_lsn: Some(timeline.get_prev_record_lsn()), prev_record_lsn: Some(timeline.get_prev_record_lsn()),
latest_gc_cutoff_lsn: *timeline.get_latest_gc_cutoff_lsn(), latest_gc_cutoff_lsn: *timeline.get_latest_gc_cutoff_lsn(),
@@ -451,7 +453,7 @@ async fn timeline_create_handler(
.map_err(ApiError::InternalServerError)?; .map_err(ApiError::InternalServerError)?;
json_response(StatusCode::CREATED, timeline_info) json_response(StatusCode::CREATED, timeline_info)
} }
Err(tenant::CreateTimelineError::AlreadyExists) => { Err(tenant::CreateTimelineError::Conflict | tenant::CreateTimelineError::AlreadyCreating) => {
json_response(StatusCode::CONFLICT, ()) json_response(StatusCode::CONFLICT, ())
} }
Err(tenant::CreateTimelineError::AncestorLsn(err)) => { Err(tenant::CreateTimelineError::AncestorLsn(err)) => {
@@ -479,15 +481,15 @@ async fn timeline_list_handler(
request: Request<Body>, request: Request<Body>,
_cancel: CancellationToken, _cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
let include_non_incremental_logical_size: Option<bool> = let include_non_incremental_logical_size: Option<bool> =
parse_query_param(&request, "include-non-incremental-logical-size")?; parse_query_param(&request, "include-non-incremental-logical-size")?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download); let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let response_data = async { let response_data = async {
let tenant = mgr::get_tenant(tenant_id, true)?; let tenant = mgr::get_tenant(tenant_shard_id, true)?;
let timelines = tenant.list_timelines(); let timelines = tenant.list_timelines();
let mut response_data = Vec::with_capacity(timelines.len()); let mut response_data = Vec::with_capacity(timelines.len());
@@ -506,7 +508,9 @@ async fn timeline_list_handler(
} }
Ok::<Vec<TimelineInfo>, ApiError>(response_data) Ok::<Vec<TimelineInfo>, ApiError>(response_data)
} }
.instrument(info_span!("timeline_list", %tenant_id)) .instrument(info_span!("timeline_list",
tenant_id = %tenant_shard_id.tenant_id,
shard_id = %tenant_shard_id.shard_slug()))
.await?; .await?;
json_response(StatusCode::OK, response_data) json_response(StatusCode::OK, response_data)
@@ -516,17 +520,17 @@ async fn timeline_detail_handler(
request: Request<Body>, request: Request<Body>,
_cancel: CancellationToken, _cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?; let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
let include_non_incremental_logical_size: Option<bool> = let include_non_incremental_logical_size: Option<bool> =
parse_query_param(&request, "include-non-incremental-logical-size")?; parse_query_param(&request, "include-non-incremental-logical-size")?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
// Logical size calculation needs downloading. // Logical size calculation needs downloading.
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download); let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let timeline_info = async { let timeline_info = async {
let tenant = mgr::get_tenant(tenant_id, true)?; let tenant = mgr::get_tenant(tenant_shard_id, true)?;
let timeline = tenant let timeline = tenant
.get_timeline(timeline_id, false) .get_timeline(timeline_id, false)
@@ -543,7 +547,10 @@ async fn timeline_detail_handler(
Ok::<_, ApiError>(timeline_info) Ok::<_, ApiError>(timeline_info)
} }
.instrument(info_span!("timeline_detail", %tenant_id, %timeline_id)) .instrument(info_span!("timeline_detail",
tenant_id = %tenant_shard_id.tenant_id,
shard_id = %tenant_shard_id.shard_slug(),
%timeline_id))
.await?; .await?;
json_response(StatusCode::OK, timeline_info) json_response(StatusCode::OK, timeline_info)
@@ -553,8 +560,15 @@ async fn get_lsn_by_timestamp_handler(
request: Request<Body>, request: Request<Body>,
cancel: CancellationToken, cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
if !tenant_shard_id.is_zero() {
// Requires SLRU contents, which are only stored on shard zero
return Err(ApiError::BadRequest(anyhow!(
"Size calculations are only available on shard zero"
)));
}
let version: Option<u8> = parse_query_param(&request, "version")?; let version: Option<u8> = parse_query_param(&request, "version")?;
@@ -566,7 +580,7 @@ async fn get_lsn_by_timestamp_handler(
let timestamp_pg = postgres_ffi::to_pg_timestamp(timestamp); let timestamp_pg = postgres_ffi::to_pg_timestamp(timestamp);
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download); let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let timeline = active_timeline_of_active_tenant(tenant_id, timeline_id).await?; let timeline = active_timeline_of_active_tenant(tenant_shard_id, timeline_id).await?;
let result = timeline let result = timeline
.find_lsn_for_timestamp(timestamp_pg, &cancel, &ctx) .find_lsn_for_timestamp(timestamp_pg, &cancel, &ctx)
.await?; .await?;
@@ -601,8 +615,15 @@ async fn get_timestamp_of_lsn_handler(
request: Request<Body>, request: Request<Body>,
_cancel: CancellationToken, _cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
if !tenant_shard_id.is_zero() {
// Requires SLRU contents, which are only stored on shard zero
return Err(ApiError::BadRequest(anyhow!(
"Size calculations are only available on shard zero"
)));
}
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?; let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
@@ -612,7 +633,7 @@ async fn get_timestamp_of_lsn_handler(
.map_err(ApiError::BadRequest)?; .map_err(ApiError::BadRequest)?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download); let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let timeline = active_timeline_of_active_tenant(tenant_id, timeline_id).await?; let timeline = active_timeline_of_active_tenant(tenant_shard_id, timeline_id).await?;
let result = timeline.get_timestamp_for_lsn(lsn, &ctx).await?; let result = timeline.get_timestamp_for_lsn(lsn, &ctx).await?;
match result { match result {
@@ -708,6 +729,26 @@ async fn tenant_detach_handler(
json_response(StatusCode::OK, ()) json_response(StatusCode::OK, ())
} }
async fn tenant_reset_handler(
request: Request<Body>,
_cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> {
let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let drop_cache: Option<bool> = parse_query_param(&request, "drop_cache")?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Warn);
let state = get_state(&request);
state
.tenant_manager
.reset_tenant(tenant_shard_id, drop_cache.unwrap_or(false), ctx)
.await
.map_err(ApiError::InternalServerError)?;
json_response(StatusCode::OK, ())
}
async fn tenant_load_handler( async fn tenant_load_handler(
mut request: Request<Body>, mut request: Request<Body>,
_cancel: CancellationToken, _cancel: CancellationToken,
@@ -784,11 +825,11 @@ async fn tenant_status(
request: Request<Body>, request: Request<Body>,
_cancel: CancellationToken, _cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let tenant_info = async { let tenant_info = async {
let tenant = mgr::get_tenant(tenant_id, false)?; let tenant = mgr::get_tenant(tenant_shard_id, false)?;
// Calculate total physical size of all timelines // Calculate total physical size of all timelines
let mut current_physical_size = 0; let mut current_physical_size = 0;
@@ -798,13 +839,15 @@ async fn tenant_status(
let state = tenant.current_state(); let state = tenant.current_state();
Result::<_, ApiError>::Ok(TenantInfo { Result::<_, ApiError>::Ok(TenantInfo {
id: tenant_id, id: tenant_shard_id,
state: state.clone(), state: state.clone(),
current_physical_size: Some(current_physical_size), current_physical_size: Some(current_physical_size),
attachment_status: state.attachment_status(), attachment_status: state.attachment_status(),
}) })
} }
.instrument(info_span!("tenant_status_handler", %tenant_id)) .instrument(info_span!("tenant_status_handler",
tenant_id = %tenant_shard_id.tenant_id,
shard_id = %tenant_shard_id.shard_slug()))
.await?; .await?;
json_response(StatusCode::OK, tenant_info) json_response(StatusCode::OK, tenant_info)
@@ -823,7 +866,7 @@ async fn tenant_delete_handler(
mgr::delete_tenant(state.conf, state.remote_storage.clone(), tenant_shard_id) mgr::delete_tenant(state.conf, state.remote_storage.clone(), tenant_shard_id)
.instrument(info_span!("tenant_delete_handler", .instrument(info_span!("tenant_delete_handler",
tenant_id = %tenant_shard_id.tenant_id, tenant_id = %tenant_shard_id.tenant_id,
shard = tenant_shard_id.shard_slug() shard = %tenant_shard_id.shard_slug()
)) ))
.await?; .await?;
@@ -847,14 +890,20 @@ async fn tenant_size_handler(
request: Request<Body>, request: Request<Body>,
cancel: CancellationToken, cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let inputs_only: Option<bool> = parse_query_param(&request, "inputs_only")?; let inputs_only: Option<bool> = parse_query_param(&request, "inputs_only")?;
let retention_period: Option<u64> = parse_query_param(&request, "retention_period")?; let retention_period: Option<u64> = parse_query_param(&request, "retention_period")?;
let headers = request.headers(); let headers = request.headers();
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download); let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let tenant = mgr::get_tenant(tenant_id, true)?; let tenant = mgr::get_tenant(tenant_shard_id, true)?;
if !tenant_shard_id.is_zero() {
return Err(ApiError::BadRequest(anyhow!(
"Size calculations are only available on shard zero"
)));
}
// this can be long operation // this can be long operation
let inputs = tenant let inputs = tenant
@@ -906,7 +955,7 @@ async fn tenant_size_handler(
json_response( json_response(
StatusCode::OK, StatusCode::OK,
TenantHistorySize { TenantHistorySize {
id: tenant_id, id: tenant_shard_id.tenant_id,
size: sizes.as_ref().map(|x| x.total_size), size: sizes.as_ref().map(|x| x.total_size),
segment_sizes: sizes.map(|x| x.segments), segment_sizes: sizes.map(|x| x.segments),
inputs, inputs,
@@ -918,14 +967,14 @@ async fn layer_map_info_handler(
request: Request<Body>, request: Request<Body>,
_cancel: CancellationToken, _cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?; let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
let reset: LayerAccessStatsReset = let reset: LayerAccessStatsReset =
parse_query_param(&request, "reset")?.unwrap_or(LayerAccessStatsReset::NoReset); parse_query_param(&request, "reset")?.unwrap_or(LayerAccessStatsReset::NoReset);
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let timeline = active_timeline_of_active_tenant(tenant_id, timeline_id).await?; let timeline = active_timeline_of_active_tenant(tenant_shard_id, timeline_id).await?;
let layer_map_info = timeline.layer_map_info(reset).await; let layer_map_info = timeline.layer_map_info(reset).await;
json_response(StatusCode::OK, layer_map_info) json_response(StatusCode::OK, layer_map_info)
@@ -935,13 +984,12 @@ async fn layer_download_handler(
request: Request<Body>, request: Request<Body>,
_cancel: CancellationToken, _cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
check_permission(&request, Some(tenant_id))?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?; let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
let layer_file_name = get_request_param(&request, "layer_file_name")?; let layer_file_name = get_request_param(&request, "layer_file_name")?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let timeline = active_timeline_of_active_tenant(tenant_id, timeline_id).await?; let timeline = active_timeline_of_active_tenant(tenant_shard_id, timeline_id).await?;
let downloaded = timeline let downloaded = timeline
.download_layer(layer_file_name) .download_layer(layer_file_name)
.await .await
@@ -952,7 +1000,7 @@ async fn layer_download_handler(
Some(false) => json_response(StatusCode::NOT_MODIFIED, ()), Some(false) => json_response(StatusCode::NOT_MODIFIED, ()),
None => json_response( None => json_response(
StatusCode::BAD_REQUEST, StatusCode::BAD_REQUEST,
format!("Layer {tenant_id}/{timeline_id}/{layer_file_name} not found"), format!("Layer {tenant_shard_id}/{timeline_id}/{layer_file_name} not found"),
), ),
} }
} }
@@ -961,12 +1009,12 @@ async fn evict_timeline_layer_handler(
request: Request<Body>, request: Request<Body>,
_cancel: CancellationToken, _cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?; let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
let layer_file_name = get_request_param(&request, "layer_file_name")?; let layer_file_name = get_request_param(&request, "layer_file_name")?;
let timeline = active_timeline_of_active_tenant(tenant_id, timeline_id).await?; let timeline = active_timeline_of_active_tenant(tenant_shard_id, timeline_id).await?;
let evicted = timeline let evicted = timeline
.evict_layer(layer_file_name) .evict_layer(layer_file_name)
.await .await
@@ -977,7 +1025,7 @@ async fn evict_timeline_layer_handler(
Some(false) => json_response(StatusCode::NOT_MODIFIED, ()), Some(false) => json_response(StatusCode::NOT_MODIFIED, ()),
None => json_response( None => json_response(
StatusCode::BAD_REQUEST, StatusCode::BAD_REQUEST,
format!("Layer {tenant_id}/{timeline_id}/{layer_file_name} not found"), format!("Layer {tenant_shard_id}/{timeline_id}/{layer_file_name} not found"),
), ),
} }
} }
@@ -1109,10 +1157,10 @@ async fn get_tenant_config_handler(
request: Request<Body>, request: Request<Body>,
_cancel: CancellationToken, _cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let tenant = mgr::get_tenant(tenant_id, false)?; let tenant = mgr::get_tenant(tenant_shard_id, false)?;
let response = HashMap::from([ let response = HashMap::from([
( (
@@ -1172,7 +1220,7 @@ async fn put_tenant_location_config_handler(
mgr::detach_tenant(conf, tenant_shard_id, true, &state.deletion_queue_client) mgr::detach_tenant(conf, tenant_shard_id, true, &state.deletion_queue_client)
.instrument(info_span!("tenant_detach", .instrument(info_span!("tenant_detach",
tenant_id = %tenant_shard_id.tenant_id, tenant_id = %tenant_shard_id.tenant_id,
shard = tenant_shard_id.shard_slug() shard = %tenant_shard_id.shard_slug()
)) ))
.await .await
{ {
@@ -1206,9 +1254,9 @@ async fn handle_tenant_break(
r: Request<Body>, r: Request<Body>,
_cancel: CancellationToken, _cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&r, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&r, "tenant_shard_id")?;
let tenant = crate::tenant::mgr::get_tenant(tenant_id, true) let tenant = crate::tenant::mgr::get_tenant(tenant_shard_id, true)
.map_err(|_| ApiError::Conflict(String::from("no active tenant found")))?; .map_err(|_| ApiError::Conflict(String::from("no active tenant found")))?;
tenant.set_broken("broken from test".to_owned()).await; tenant.set_broken("broken from test".to_owned()).await;
@@ -1249,14 +1297,15 @@ async fn timeline_gc_handler(
mut request: Request<Body>, mut request: Request<Body>,
cancel: CancellationToken, cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?; let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let gc_req: TimelineGcRequest = json_request(&mut request).await?; let gc_req: TimelineGcRequest = json_request(&mut request).await?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download); let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let wait_task_done = mgr::immediate_gc(tenant_id, timeline_id, gc_req, cancel, &ctx).await?; let wait_task_done =
mgr::immediate_gc(tenant_shard_id, timeline_id, gc_req, cancel, &ctx).await?;
let gc_result = wait_task_done let gc_result = wait_task_done
.await .await
.context("wait for gc task") .context("wait for gc task")
@@ -1271,9 +1320,9 @@ async fn timeline_compact_handler(
request: Request<Body>, request: Request<Body>,
cancel: CancellationToken, cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?; let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let mut flags = EnumSet::empty(); let mut flags = EnumSet::empty();
if Some(true) == parse_query_param::<_, bool>(&request, "force_repartition")? { if Some(true) == parse_query_param::<_, bool>(&request, "force_repartition")? {
@@ -1281,14 +1330,14 @@ async fn timeline_compact_handler(
} }
async { async {
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download); let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let timeline = active_timeline_of_active_tenant(tenant_id, timeline_id).await?; let timeline = active_timeline_of_active_tenant(tenant_shard_id, timeline_id).await?;
timeline timeline
.compact(&cancel, flags, &ctx) .compact(&cancel, flags, &ctx)
.await .await
.map_err(|e| ApiError::InternalServerError(e.into()))?; .map_err(|e| ApiError::InternalServerError(e.into()))?;
json_response(StatusCode::OK, ()) json_response(StatusCode::OK, ())
} }
.instrument(info_span!("manual_compaction", %tenant_id, %timeline_id)) .instrument(info_span!("manual_compaction", tenant_id = %tenant_shard_id.tenant_id, shard_id = %tenant_shard_id.shard_slug(), %timeline_id))
.await .await
} }
@@ -1297,9 +1346,9 @@ async fn timeline_checkpoint_handler(
request: Request<Body>, request: Request<Body>,
cancel: CancellationToken, cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?; let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let mut flags = EnumSet::empty(); let mut flags = EnumSet::empty();
if Some(true) == parse_query_param::<_, bool>(&request, "force_repartition")? { if Some(true) == parse_query_param::<_, bool>(&request, "force_repartition")? {
@@ -1307,7 +1356,7 @@ async fn timeline_checkpoint_handler(
} }
async { async {
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download); let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let timeline = active_timeline_of_active_tenant(tenant_id, timeline_id).await?; let timeline = active_timeline_of_active_tenant(tenant_shard_id, timeline_id).await?;
timeline timeline
.freeze_and_flush() .freeze_and_flush()
.await .await
@@ -1319,7 +1368,7 @@ async fn timeline_checkpoint_handler(
json_response(StatusCode::OK, ()) json_response(StatusCode::OK, ())
} }
.instrument(info_span!("manual_checkpoint", %tenant_id, %timeline_id)) .instrument(info_span!("manual_checkpoint", tenant_id = %tenant_shard_id.tenant_id, shard_id = %tenant_shard_id.shard_slug(), %timeline_id))
.await .await
} }
@@ -1327,12 +1376,12 @@ async fn timeline_download_remote_layers_handler_post(
mut request: Request<Body>, mut request: Request<Body>,
_cancel: CancellationToken, _cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?; let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
let body: DownloadRemoteLayersTaskSpawnRequest = json_request(&mut request).await?; let body: DownloadRemoteLayersTaskSpawnRequest = json_request(&mut request).await?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let timeline = active_timeline_of_active_tenant(tenant_id, timeline_id).await?; let timeline = active_timeline_of_active_tenant(tenant_shard_id, timeline_id).await?;
match timeline.spawn_download_all_remote_layers(body).await { match timeline.spawn_download_all_remote_layers(body).await {
Ok(st) => json_response(StatusCode::ACCEPTED, st), Ok(st) => json_response(StatusCode::ACCEPTED, st),
Err(st) => json_response(StatusCode::CONFLICT, st), Err(st) => json_response(StatusCode::CONFLICT, st),
@@ -1343,11 +1392,11 @@ async fn timeline_download_remote_layers_handler_get(
request: Request<Body>, request: Request<Body>,
_cancel: CancellationToken, _cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?; let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
let timeline = active_timeline_of_active_tenant(tenant_id, timeline_id).await?; let timeline = active_timeline_of_active_tenant(tenant_shard_id, timeline_id).await?;
let info = timeline let info = timeline
.get_download_all_remote_layers_task_info() .get_download_all_remote_layers_task_info()
.context("task never started since last pageserver process start") .context("task never started since last pageserver process start")
@@ -1393,9 +1442,9 @@ async fn getpage_at_lsn_handler(
request: Request<Body>, request: Request<Body>,
_cancel: CancellationToken, _cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?; let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
struct Key(crate::repository::Key); struct Key(crate::repository::Key);
@@ -1414,7 +1463,7 @@ async fn getpage_at_lsn_handler(
async { async {
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download); let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let timeline = active_timeline_of_active_tenant(tenant_id, timeline_id).await?; let timeline = active_timeline_of_active_tenant(tenant_shard_id, timeline_id).await?;
let page = timeline.get(key.0, lsn, &ctx).await?; let page = timeline.get(key.0, lsn, &ctx).await?;
@@ -1426,7 +1475,7 @@ async fn getpage_at_lsn_handler(
.unwrap(), .unwrap(),
) )
} }
.instrument(info_span!("timeline_get", %tenant_id, %timeline_id)) .instrument(info_span!("timeline_get", tenant_id = %tenant_shard_id.tenant_id, shard_id = %tenant_shard_id.shard_slug(), %timeline_id))
.await .await
} }
@@ -1434,9 +1483,9 @@ async fn timeline_collect_keyspace(
request: Request<Body>, request: Request<Body>,
_cancel: CancellationToken, _cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?; let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?; let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_shard_id.tenant_id))?;
struct Partitioning { struct Partitioning {
keys: crate::keyspace::KeySpace, keys: crate::keyspace::KeySpace,
@@ -1505,7 +1554,7 @@ async fn timeline_collect_keyspace(
async { async {
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download); let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let timeline = active_timeline_of_active_tenant(tenant_id, timeline_id).await?; let timeline = active_timeline_of_active_tenant(tenant_shard_id, timeline_id).await?;
let at_lsn = at_lsn.unwrap_or_else(|| timeline.get_last_record_lsn()); let at_lsn = at_lsn.unwrap_or_else(|| timeline.get_last_record_lsn());
let keys = timeline let keys = timeline
.collect_keyspace(at_lsn, &ctx) .collect_keyspace(at_lsn, &ctx)
@@ -1514,15 +1563,15 @@ async fn timeline_collect_keyspace(
json_response(StatusCode::OK, Partitioning { keys, at_lsn }) json_response(StatusCode::OK, Partitioning { keys, at_lsn })
} }
.instrument(info_span!("timeline_collect_keyspace", %tenant_id, %timeline_id)) .instrument(info_span!("timeline_collect_keyspace", tenant_id = %tenant_shard_id.tenant_id, shard_id = %tenant_shard_id.shard_slug(), %timeline_id))
.await .await
} }
async fn active_timeline_of_active_tenant( async fn active_timeline_of_active_tenant(
tenant_id: TenantId, tenant_shard_id: TenantShardId,
timeline_id: TimelineId, timeline_id: TimelineId,
) -> Result<Arc<Timeline>, ApiError> { ) -> Result<Arc<Timeline>, ApiError> {
let tenant = mgr::get_tenant(tenant_id, true)?; let tenant = mgr::get_tenant(tenant_shard_id, true)?;
tenant tenant
.get_timeline(timeline_id, true) .get_timeline(timeline_id, true)
.map_err(|e| ApiError::NotFound(e.into())) .map_err(|e| ApiError::NotFound(e.into()))
@@ -1544,7 +1593,7 @@ async fn always_panic_handler(
async fn disk_usage_eviction_run( async fn disk_usage_eviction_run(
mut r: Request<Body>, mut r: Request<Body>,
_cancel: CancellationToken, cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> { ) -> Result<Response<Body>, ApiError> {
check_permission(&r, None)?; check_permission(&r, None)?;
@@ -1572,57 +1621,48 @@ async fn disk_usage_eviction_run(
} }
} }
let config = json_request::<Config>(&mut r) let config = json_request::<Config>(&mut r).await?;
.await
.map_err(|_| ApiError::BadRequest(anyhow::anyhow!("invalid JSON body")))?;
let usage = Usage { let usage = Usage {
config, config,
freed_bytes: 0, freed_bytes: 0,
}; };
let (tx, rx) = tokio::sync::oneshot::channel();
let state = get_state(&r); let state = get_state(&r);
if state.remote_storage.as_ref().is_none() { let Some(storage) = state.remote_storage.as_ref() else {
return Err(ApiError::InternalServerError(anyhow::anyhow!( return Err(ApiError::InternalServerError(anyhow::anyhow!(
"remote storage not configured, cannot run eviction iteration" "remote storage not configured, cannot run eviction iteration"
))); )));
} };
let state = state.disk_usage_eviction_state.clone(); let state = state.disk_usage_eviction_state.clone();
let cancel = CancellationToken::new(); let res = crate::disk_usage_eviction_task::disk_usage_eviction_task_iteration_impl(
let child_cancel = cancel.clone(); &state, storage, usage, &cancel,
let _g = cancel.drop_guard(); )
.await;
crate::task_mgr::spawn( info!(?res, "disk_usage_eviction_task_iteration_impl finished");
crate::task_mgr::BACKGROUND_RUNTIME.handle(),
TaskKind::DiskUsageEviction,
None,
None,
"ondemand disk usage eviction",
false,
async move {
let res = crate::disk_usage_eviction_task::disk_usage_eviction_task_iteration_impl(
&state,
usage,
&child_cancel,
)
.await;
info!(?res, "disk_usage_eviction_task_iteration_impl finished"); let res = res.map_err(ApiError::InternalServerError)?;
let _ = tx.send(res); json_response(StatusCode::OK, res)
Ok(()) }
}
.in_current_span(),
);
let response = rx.await.unwrap().map_err(ApiError::InternalServerError)?; async fn secondary_upload_handler(
request: Request<Body>,
_cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> {
let state = get_state(&request);
let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
state
.secondary_controller
.upload_tenant(tenant_shard_id)
.await
.map_err(ApiError::InternalServerError)?;
json_response(StatusCode::OK, response) json_response(StatusCode::OK, ())
} }
async fn handler_404(_: Request<Body>) -> Result<Response<Body>, ApiError> { async fn handler_404(_: Request<Body>) -> Result<Response<Body>, ApiError> {
@@ -1799,23 +1839,25 @@ pub fn make_router(
}) })
.get("/v1/tenant", |r| api_handler(r, tenant_list_handler)) .get("/v1/tenant", |r| api_handler(r, tenant_list_handler))
.post("/v1/tenant", |r| api_handler(r, tenant_create_handler)) .post("/v1/tenant", |r| api_handler(r, tenant_create_handler))
.get("/v1/tenant/:tenant_id", |r| api_handler(r, tenant_status)) .get("/v1/tenant/:tenant_shard_id", |r| {
api_handler(r, tenant_status)
})
.delete("/v1/tenant/:tenant_shard_id", |r| { .delete("/v1/tenant/:tenant_shard_id", |r| {
api_handler(r, tenant_delete_handler) api_handler(r, tenant_delete_handler)
}) })
.get("/v1/tenant/:tenant_id/synthetic_size", |r| { .get("/v1/tenant/:tenant_shard_id/synthetic_size", |r| {
api_handler(r, tenant_size_handler) api_handler(r, tenant_size_handler)
}) })
.put("/v1/tenant/config", |r| { .put("/v1/tenant/config", |r| {
api_handler(r, update_tenant_config_handler) api_handler(r, update_tenant_config_handler)
}) })
.get("/v1/tenant/:tenant_id/config", |r| { .get("/v1/tenant/:tenant_shard_id/config", |r| {
api_handler(r, get_tenant_config_handler) api_handler(r, get_tenant_config_handler)
}) })
.put("/v1/tenant/:tenant_shard_id/location_config", |r| { .put("/v1/tenant/:tenant_shard_id/location_config", |r| {
api_handler(r, put_tenant_location_config_handler) api_handler(r, put_tenant_location_config_handler)
}) })
.get("/v1/tenant/:tenant_id/timeline", |r| { .get("/v1/tenant/:tenant_shard_id/timeline", |r| {
api_handler(r, timeline_list_handler) api_handler(r, timeline_list_handler)
}) })
.post("/v1/tenant/:tenant_shard_id/timeline", |r| { .post("/v1/tenant/:tenant_shard_id/timeline", |r| {
@@ -1827,73 +1869,83 @@ pub fn make_router(
.post("/v1/tenant/:tenant_id/detach", |r| { .post("/v1/tenant/:tenant_id/detach", |r| {
api_handler(r, tenant_detach_handler) api_handler(r, tenant_detach_handler)
}) })
.post("/v1/tenant/:tenant_shard_id/reset", |r| {
api_handler(r, tenant_reset_handler)
})
.post("/v1/tenant/:tenant_id/load", |r| { .post("/v1/tenant/:tenant_id/load", |r| {
api_handler(r, tenant_load_handler) api_handler(r, tenant_load_handler)
}) })
.post("/v1/tenant/:tenant_id/ignore", |r| { .post("/v1/tenant/:tenant_id/ignore", |r| {
api_handler(r, tenant_ignore_handler) api_handler(r, tenant_ignore_handler)
}) })
.get("/v1/tenant/:tenant_id/timeline/:timeline_id", |r| { .get("/v1/tenant/:tenant_shard_id/timeline/:timeline_id", |r| {
api_handler(r, timeline_detail_handler) api_handler(r, timeline_detail_handler)
}) })
.get( .get(
"/v1/tenant/:tenant_id/timeline/:timeline_id/get_lsn_by_timestamp", "/v1/tenant/:tenant_shard_id/timeline/:timeline_id/get_lsn_by_timestamp",
|r| api_handler(r, get_lsn_by_timestamp_handler), |r| api_handler(r, get_lsn_by_timestamp_handler),
) )
.get( .get(
"/v1/tenant/:tenant_id/timeline/:timeline_id/get_timestamp_of_lsn", "/v1/tenant/:tenant_shard_id/timeline/:timeline_id/get_timestamp_of_lsn",
|r| api_handler(r, get_timestamp_of_lsn_handler), |r| api_handler(r, get_timestamp_of_lsn_handler),
) )
.put("/v1/tenant/:tenant_id/timeline/:timeline_id/do_gc", |r| {
api_handler(r, timeline_gc_handler)
})
.put("/v1/tenant/:tenant_id/timeline/:timeline_id/compact", |r| {
testing_api_handler("run timeline compaction", r, timeline_compact_handler)
})
.put( .put(
"/v1/tenant/:tenant_id/timeline/:timeline_id/checkpoint", "/v1/tenant/:tenant_shard_id/timeline/:timeline_id/do_gc",
|r| api_handler(r, timeline_gc_handler),
)
.put(
"/v1/tenant/:tenant_shard_id/timeline/:timeline_id/compact",
|r| testing_api_handler("run timeline compaction", r, timeline_compact_handler),
)
.put(
"/v1/tenant/:tenant_shard_id/timeline/:timeline_id/checkpoint",
|r| testing_api_handler("run timeline checkpoint", r, timeline_checkpoint_handler), |r| testing_api_handler("run timeline checkpoint", r, timeline_checkpoint_handler),
) )
.post( .post(
"/v1/tenant/:tenant_id/timeline/:timeline_id/download_remote_layers", "/v1/tenant/:tenant_shard_id/timeline/:timeline_id/download_remote_layers",
|r| api_handler(r, timeline_download_remote_layers_handler_post), |r| api_handler(r, timeline_download_remote_layers_handler_post),
) )
.get( .get(
"/v1/tenant/:tenant_id/timeline/:timeline_id/download_remote_layers", "/v1/tenant/:tenant_shard_id/timeline/:timeline_id/download_remote_layers",
|r| api_handler(r, timeline_download_remote_layers_handler_get), |r| api_handler(r, timeline_download_remote_layers_handler_get),
) )
.delete("/v1/tenant/:tenant_shard_id/timeline/:timeline_id", |r| { .delete("/v1/tenant/:tenant_shard_id/timeline/:timeline_id", |r| {
api_handler(r, timeline_delete_handler) api_handler(r, timeline_delete_handler)
}) })
.get("/v1/tenant/:tenant_id/timeline/:timeline_id/layer", |r| {
api_handler(r, layer_map_info_handler)
})
.get( .get(
"/v1/tenant/:tenant_id/timeline/:timeline_id/layer/:layer_file_name", "/v1/tenant/:tenant_shard_id/timeline/:timeline_id/layer",
|r| api_handler(r, layer_map_info_handler),
)
.get(
"/v1/tenant/:tenant_shard_id/timeline/:timeline_id/layer/:layer_file_name",
|r| api_handler(r, layer_download_handler), |r| api_handler(r, layer_download_handler),
) )
.delete( .delete(
"/v1/tenant/:tenant_id/timeline/:timeline_id/layer/:layer_file_name", "/v1/tenant/:tenant_shard_id/timeline/:timeline_id/layer/:layer_file_name",
|r| api_handler(r, evict_timeline_layer_handler), |r| api_handler(r, evict_timeline_layer_handler),
) )
.post("/v1/tenant/:tenant_shard_id/heatmap_upload", |r| {
api_handler(r, secondary_upload_handler)
})
.put("/v1/disk_usage_eviction/run", |r| { .put("/v1/disk_usage_eviction/run", |r| {
api_handler(r, disk_usage_eviction_run) api_handler(r, disk_usage_eviction_run)
}) })
.put("/v1/deletion_queue/flush", |r| { .put("/v1/deletion_queue/flush", |r| {
api_handler(r, deletion_queue_flush) api_handler(r, deletion_queue_flush)
}) })
.put("/v1/tenant/:tenant_id/break", |r| { .put("/v1/tenant/:tenant_shard_id/break", |r| {
testing_api_handler("set tenant state to broken", r, handle_tenant_break) testing_api_handler("set tenant state to broken", r, handle_tenant_break)
}) })
.get("/v1/panic", |r| api_handler(r, always_panic_handler)) .get("/v1/panic", |r| api_handler(r, always_panic_handler))
.post("/v1/tracing/event", |r| { .post("/v1/tracing/event", |r| {
testing_api_handler("emit a tracing event", r, post_tracing_event_handler) testing_api_handler("emit a tracing event", r, post_tracing_event_handler)
}) })
.get("/v1/tenant/:tenant_id/timeline/:timeline_id/getpage", |r| {
testing_api_handler("getpage@lsn", r, getpage_at_lsn_handler)
})
.get( .get(
"/v1/tenant/:tenant_id/timeline/:timeline_id/keyspace", "/v1/tenant/:tenant_shard_id/timeline/:timeline_id/getpage",
|r| testing_api_handler("getpage@lsn", r, getpage_at_lsn_handler),
)
.get(
"/v1/tenant/:tenant_shard_id/timeline/:timeline_id/keyspace",
|r| testing_api_handler("read out the keyspace", r, timeline_collect_keyspace), |r| testing_api_handler("read out the keyspace", r, timeline_collect_keyspace),
) )
.any(handler_404)) .any(handler_404))

View File

@@ -2,9 +2,8 @@
//! Import data and WAL from a PostgreSQL data directory and WAL segments into //! Import data and WAL from a PostgreSQL data directory and WAL segments into
//! a neon Timeline. //! a neon Timeline.
//! //!
use std::io::SeekFrom;
use std::path::{Path, PathBuf}; use std::path::{Path, PathBuf};
use std::pin::Pin;
use std::task::{self, Poll};
use anyhow::{bail, ensure, Context, Result}; use anyhow::{bail, ensure, Context, Result};
use async_compression::tokio::bufread::ZstdDecoder; use async_compression::tokio::bufread::ZstdDecoder;
@@ -13,7 +12,8 @@ use bytes::Bytes;
use camino::Utf8Path; use camino::Utf8Path;
use futures::StreamExt; use futures::StreamExt;
use nix::NixPath; use nix::NixPath;
use tokio::io::{AsyncBufRead, AsyncRead, AsyncReadExt, AsyncWrite, AsyncWriteExt}; use tokio::fs::{File, OpenOptions};
use tokio::io::{AsyncBufRead, AsyncRead, AsyncReadExt, AsyncSeekExt, AsyncWriteExt};
use tokio_tar::Archive; use tokio_tar::Archive;
use tokio_tar::Builder; use tokio_tar::Builder;
use tokio_tar::HeaderMode; use tokio_tar::HeaderMode;
@@ -629,70 +629,16 @@ async fn read_all_bytes(reader: &mut (impl AsyncRead + Unpin)) -> Result<Bytes>
Ok(Bytes::from(buf)) Ok(Bytes::from(buf))
} }
/// An in-memory buffer implementing `AsyncWrite`, inserting yields every now and then pub async fn create_tar_zst(pgdata_path: &Utf8Path, tmp_path: &Utf8Path) -> Result<(File, u64)> {
/// let file = OpenOptions::new()
/// The number of yields is bounded by above by the number of times poll_write is called, .create(true)
/// so calling it with 8 KB chunks and 8 MB chunks gives the same number of yields in total. .truncate(true)
/// This is an explicit choice as the `YieldingVec` is meant to give the async executor .read(true)
/// breathing room between units of CPU intensive preparation of buffers to be written. .write(true)
/// Once a write call is issued, the whole buffer has been prepared already, so there is no .open(&tmp_path)
/// gain in splitting up the memcopy further. .await
struct YieldingVec { .with_context(|| format!("tempfile creation {tmp_path}"))?;
yield_budget: usize,
// the buffer written into
buf: Vec<u8>,
}
impl YieldingVec {
fn new() -> Self {
Self {
yield_budget: 0,
buf: Vec::new(),
}
}
// Whether we should yield for a read operation of given size
fn should_yield(&mut self, add_buf_len: usize) -> bool {
// Set this limit to a small value so that we are a
// good async citizen and yield repeatedly (but not
// too often for many small writes to cause many yields)
const YIELD_DIST: usize = 1024;
let target_buf_len = self.buf.len() + add_buf_len;
let ret = self.yield_budget / YIELD_DIST < target_buf_len / YIELD_DIST;
if self.yield_budget < target_buf_len {
self.yield_budget += add_buf_len;
}
ret
}
}
impl AsyncWrite for YieldingVec {
fn poll_write(
mut self: Pin<&mut Self>,
cx: &mut task::Context<'_>,
buf: &[u8],
) -> Poll<std::io::Result<usize>> {
if self.should_yield(buf.len()) {
cx.waker().wake_by_ref();
return Poll::Pending;
}
self.get_mut().buf.extend_from_slice(buf);
Poll::Ready(Ok(buf.len()))
}
fn poll_flush(self: Pin<&mut Self>, _cx: &mut task::Context<'_>) -> Poll<std::io::Result<()>> {
Poll::Ready(Ok(()))
}
fn poll_shutdown(
self: Pin<&mut Self>,
_cx: &mut task::Context<'_>,
) -> Poll<std::io::Result<()>> {
Poll::Ready(Ok(()))
}
}
pub async fn create_tar_zst(pgdata_path: &Utf8Path) -> Result<Vec<u8>> {
let mut paths = Vec::new(); let mut paths = Vec::new();
for entry in WalkDir::new(pgdata_path) { for entry in WalkDir::new(pgdata_path) {
let entry = entry?; let entry = entry?;
@@ -707,7 +653,7 @@ pub async fn create_tar_zst(pgdata_path: &Utf8Path) -> Result<Vec<u8>> {
// Do a sort to get a more consistent listing // Do a sort to get a more consistent listing
paths.sort_unstable(); paths.sort_unstable();
let zstd = ZstdEncoder::with_quality_and_params( let zstd = ZstdEncoder::with_quality_and_params(
YieldingVec::new(), file,
Level::Default, Level::Default,
&[CParameter::enable_long_distance_matching(true)], &[CParameter::enable_long_distance_matching(true)],
); );
@@ -725,13 +671,14 @@ pub async fn create_tar_zst(pgdata_path: &Utf8Path) -> Result<Vec<u8>> {
} }
let mut zstd = builder.into_inner().await?; let mut zstd = builder.into_inner().await?;
zstd.shutdown().await?; zstd.shutdown().await?;
let compressed = zstd.into_inner(); let mut compressed = zstd.into_inner();
let compressed_len = compressed.buf.len(); let compressed_len = compressed.metadata().await?.len();
const INITDB_TAR_ZST_WARN_LIMIT: usize = 2_000_000; const INITDB_TAR_ZST_WARN_LIMIT: u64 = 2 * 1024 * 1024;
if compressed_len > INITDB_TAR_ZST_WARN_LIMIT { if compressed_len > INITDB_TAR_ZST_WARN_LIMIT {
warn!("compressed {INITDB_PATH} size of {compressed_len} is above limit {INITDB_TAR_ZST_WARN_LIMIT}."); warn!("compressed {INITDB_PATH} size of {compressed_len} is above limit {INITDB_TAR_ZST_WARN_LIMIT}.");
} }
Ok(compressed.buf) compressed.seek(SeekFrom::Start(0)).await?;
Ok((compressed, compressed_len))
} }
pub async fn extract_tar_zst( pub async fn extract_tar_zst(

View File

@@ -186,13 +186,6 @@ pub struct InitializationOrder {
/// Each initial tenant load task carries this until completion. /// Each initial tenant load task carries this until completion.
pub initial_tenant_load: Option<utils::completion::Completion>, pub initial_tenant_load: Option<utils::completion::Completion>,
/// Barrier for when we can start initial logical size calculations.
pub initial_logical_size_can_start: utils::completion::Barrier,
/// Each timeline owns a clone of this to be consumed on the initial logical size calculation
/// attempt. It is important to drop this once the attempt has completed.
pub initial_logical_size_attempt: Option<utils::completion::Completion>,
/// Barrier for when we can start any background jobs. /// Barrier for when we can start any background jobs.
/// ///
/// This can be broken up later on, but right now there is just one class of a background job. /// This can be broken up later on, but right now there is just one class of a background job.
@@ -212,7 +205,7 @@ async fn timed<Fut: std::future::Future>(
match tokio::time::timeout(warn_at, &mut fut).await { match tokio::time::timeout(warn_at, &mut fut).await {
Ok(ret) => { Ok(ret) => {
tracing::info!( tracing::info!(
task = name, stage = name,
elapsed_ms = started.elapsed().as_millis(), elapsed_ms = started.elapsed().as_millis(),
"completed" "completed"
); );
@@ -220,7 +213,7 @@ async fn timed<Fut: std::future::Future>(
} }
Err(_) => { Err(_) => {
tracing::info!( tracing::info!(
task = name, stage = name,
elapsed_ms = started.elapsed().as_millis(), elapsed_ms = started.elapsed().as_millis(),
"still waiting, taking longer than expected..." "still waiting, taking longer than expected..."
); );
@@ -229,7 +222,7 @@ async fn timed<Fut: std::future::Future>(
// this has a global allowed_errors // this has a global allowed_errors
tracing::warn!( tracing::warn!(
task = name, stage = name,
elapsed_ms = started.elapsed().as_millis(), elapsed_ms = started.elapsed().as_millis(),
"completed, took longer than expected" "completed, took longer than expected"
); );

View File

@@ -2,9 +2,10 @@ use enum_map::EnumMap;
use metrics::metric_vec_duration::DurationResultObserver; use metrics::metric_vec_duration::DurationResultObserver;
use metrics::{ use metrics::{
register_counter_vec, register_gauge_vec, register_histogram, register_histogram_vec, register_counter_vec, register_gauge_vec, register_histogram, register_histogram_vec,
register_int_counter, register_int_counter_vec, register_int_gauge, register_int_gauge_vec, register_int_counter, register_int_counter_pair_vec, register_int_counter_vec,
register_uint_gauge, register_uint_gauge_vec, Counter, CounterVec, GaugeVec, Histogram, register_int_gauge, register_int_gauge_vec, register_uint_gauge, register_uint_gauge_vec,
HistogramVec, IntCounter, IntCounterVec, IntGauge, IntGaugeVec, UIntGauge, UIntGaugeVec, Counter, CounterVec, GaugeVec, Histogram, HistogramVec, IntCounter, IntCounterPairVec,
IntCounterVec, IntGauge, IntGaugeVec, UIntGauge, UIntGaugeVec,
}; };
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
use pageserver_api::shard::TenantShardId; use pageserver_api::shard::TenantShardId;
@@ -285,6 +286,63 @@ pub static PAGE_CACHE_SIZE: Lazy<PageCacheSizeMetrics> = Lazy::new(|| PageCacheS
}, },
}); });
pub(crate) mod page_cache_eviction_metrics {
use std::num::NonZeroUsize;
use metrics::{register_int_counter_vec, IntCounter, IntCounterVec};
use once_cell::sync::Lazy;
#[derive(Clone, Copy)]
pub(crate) enum Outcome {
FoundSlotUnused { iters: NonZeroUsize },
FoundSlotEvicted { iters: NonZeroUsize },
ItersExceeded { iters: NonZeroUsize },
}
static ITERS_TOTAL_VEC: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_page_cache_find_victim_iters_total",
"Counter for the number of iterations in the find_victim loop",
&["outcome"],
)
.expect("failed to define a metric")
});
static CALLS_VEC: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"pageserver_page_cache_find_victim_calls",
"Incremented at the end of each find_victim() call.\
Filter by outcome to get e.g., eviction rate.",
&["outcome"]
)
.unwrap()
});
pub(crate) fn observe(outcome: Outcome) {
macro_rules! dry {
($label:literal, $iters:expr) => {{
static LABEL: &'static str = $label;
static ITERS_TOTAL: Lazy<IntCounter> =
Lazy::new(|| ITERS_TOTAL_VEC.with_label_values(&[LABEL]));
static CALLS: Lazy<IntCounter> =
Lazy::new(|| CALLS_VEC.with_label_values(&[LABEL]));
ITERS_TOTAL.inc_by(($iters.get()) as u64);
CALLS.inc();
}};
}
match outcome {
Outcome::FoundSlotUnused { iters } => dry!("found_empty", iters),
Outcome::FoundSlotEvicted { iters } => {
dry!("found_evicted", iters)
}
Outcome::ItersExceeded { iters } => {
dry!("err_iters_exceeded", iters);
super::page_cache_errors_inc(super::PageCacheErrorKind::EvictIterLimit);
}
}
}
}
pub(crate) static PAGE_CACHE_ACQUIRE_PINNED_SLOT_TIME: Lazy<Histogram> = Lazy::new(|| { pub(crate) static PAGE_CACHE_ACQUIRE_PINNED_SLOT_TIME: Lazy<Histogram> = Lazy::new(|| {
register_histogram!( register_histogram!(
"pageserver_page_cache_acquire_pinned_slot_seconds", "pageserver_page_cache_acquire_pinned_slot_seconds",
@@ -294,14 +352,6 @@ pub(crate) static PAGE_CACHE_ACQUIRE_PINNED_SLOT_TIME: Lazy<Histogram> = Lazy::n
.expect("failed to define a metric") .expect("failed to define a metric")
}); });
pub(crate) static PAGE_CACHE_FIND_VICTIMS_ITERS_TOTAL: Lazy<IntCounter> = Lazy::new(|| {
register_int_counter!(
"pageserver_page_cache_find_victim_iters_total",
"Counter for the number of iterations in the find_victim loop",
)
.expect("failed to define a metric")
});
static PAGE_CACHE_ERRORS: Lazy<IntCounterVec> = Lazy::new(|| { static PAGE_CACHE_ERRORS: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!( register_int_counter_vec!(
"page_cache_errors_total", "page_cache_errors_total",
@@ -407,16 +457,14 @@ pub(crate) mod initial_logical_size {
use metrics::{register_int_counter, register_int_counter_vec, IntCounter, IntCounterVec}; use metrics::{register_int_counter, register_int_counter_vec, IntCounter, IntCounterVec};
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
use crate::task_mgr::TaskKind;
pub(crate) struct StartCalculation(IntCounterVec); pub(crate) struct StartCalculation(IntCounterVec);
pub(crate) static START_CALCULATION: Lazy<StartCalculation> = Lazy::new(|| { pub(crate) static START_CALCULATION: Lazy<StartCalculation> = Lazy::new(|| {
StartCalculation( StartCalculation(
register_int_counter_vec!( register_int_counter_vec!(
"pageserver_initial_logical_size_start_calculation", "pageserver_initial_logical_size_start_calculation",
"Incremented each time we start an initial logical size calculation attempt. \ "Incremented each time we start an initial logical size calculation attempt. \
The `task_kind` label is for the task kind that caused this attempt.", The `circumstances` label provides some additional details.",
&["attempt", "task_kind"] &["attempt", "circumstances"]
) )
.unwrap(), .unwrap(),
) )
@@ -464,19 +512,24 @@ pub(crate) mod initial_logical_size {
inc_drop_calculation: Option<IntCounter>, inc_drop_calculation: Option<IntCounter>,
} }
#[derive(strum_macros::IntoStaticStr)]
pub(crate) enum StartCircumstances {
EmptyInitial,
SkippedConcurrencyLimiter,
AfterBackgroundTasksRateLimit,
}
impl StartCalculation { impl StartCalculation {
pub(crate) fn first(&self, causing_task_kind: Option<TaskKind>) -> OngoingCalculationGuard { pub(crate) fn first(&self, circumstances: StartCircumstances) -> OngoingCalculationGuard {
let task_kind_label: &'static str = let circumstances_label: &'static str = circumstances.into();
causing_task_kind.map(|k| k.into()).unwrap_or_default(); self.0.with_label_values(&["first", circumstances_label]);
self.0.with_label_values(&["first", task_kind_label]);
OngoingCalculationGuard { OngoingCalculationGuard {
inc_drop_calculation: Some(DROP_CALCULATION.first.clone()), inc_drop_calculation: Some(DROP_CALCULATION.first.clone()),
} }
} }
pub(crate) fn retry(&self, causing_task_kind: Option<TaskKind>) -> OngoingCalculationGuard { pub(crate) fn retry(&self, circumstances: StartCircumstances) -> OngoingCalculationGuard {
let task_kind_label: &'static str = let circumstances_label: &'static str = circumstances.into();
causing_task_kind.map(|k| k.into()).unwrap_or_default(); self.0.with_label_values(&["retry", circumstances_label]);
self.0.with_label_values(&["retry", task_kind_label]);
OngoingCalculationGuard { OngoingCalculationGuard {
inc_drop_calculation: Some(DROP_CALCULATION.retry.clone()), inc_drop_calculation: Some(DROP_CALCULATION.retry.clone()),
} }
@@ -598,7 +651,7 @@ static EVICTIONS_WITH_LOW_RESIDENCE_DURATION: Lazy<IntCounterVec> = Lazy::new(||
"pageserver_evictions_with_low_residence_duration", "pageserver_evictions_with_low_residence_duration",
"If a layer is evicted that was resident for less than `low_threshold`, it is counted to this counter. \ "If a layer is evicted that was resident for less than `low_threshold`, it is counted to this counter. \
Residence duration is determined using the `residence_duration_data_source`.", Residence duration is determined using the `residence_duration_data_source`.",
&["tenant_id", "timeline_id", "residence_duration_data_source", "low_threshold_secs"] &["tenant_id", "shard_id", "timeline_id", "residence_duration_data_source", "low_threshold_secs"]
) )
.expect("failed to define a metric") .expect("failed to define a metric")
}); });
@@ -662,10 +715,16 @@ impl EvictionsWithLowResidenceDurationBuilder {
} }
} }
fn build(&self, tenant_id: &str, timeline_id: &str) -> EvictionsWithLowResidenceDuration { fn build(
&self,
tenant_id: &str,
shard_id: &str,
timeline_id: &str,
) -> EvictionsWithLowResidenceDuration {
let counter = EVICTIONS_WITH_LOW_RESIDENCE_DURATION let counter = EVICTIONS_WITH_LOW_RESIDENCE_DURATION
.get_metric_with_label_values(&[ .get_metric_with_label_values(&[
tenant_id, tenant_id,
shard_id,
timeline_id, timeline_id,
self.data_source, self.data_source,
&EvictionsWithLowResidenceDuration::threshold_label_value(self.threshold), &EvictionsWithLowResidenceDuration::threshold_label_value(self.threshold),
@@ -696,21 +755,24 @@ impl EvictionsWithLowResidenceDuration {
pub fn change_threshold( pub fn change_threshold(
&mut self, &mut self,
tenant_id: &str, tenant_id: &str,
shard_id: &str,
timeline_id: &str, timeline_id: &str,
new_threshold: Duration, new_threshold: Duration,
) { ) {
if new_threshold == self.threshold { if new_threshold == self.threshold {
return; return;
} }
let mut with_new = let mut with_new = EvictionsWithLowResidenceDurationBuilder::new(
EvictionsWithLowResidenceDurationBuilder::new(self.data_source, new_threshold) self.data_source,
.build(tenant_id, timeline_id); new_threshold,
)
.build(tenant_id, shard_id, timeline_id);
std::mem::swap(self, &mut with_new); std::mem::swap(self, &mut with_new);
with_new.remove(tenant_id, timeline_id); with_new.remove(tenant_id, shard_id, timeline_id);
} }
// This could be a `Drop` impl, but, we need the `tenant_id` and `timeline_id`. // This could be a `Drop` impl, but, we need the `tenant_id` and `timeline_id`.
fn remove(&mut self, tenant_id: &str, timeline_id: &str) { fn remove(&mut self, tenant_id: &str, shard_id: &str, timeline_id: &str) {
let Some(_counter) = self.counter.take() else { let Some(_counter) = self.counter.take() else {
return; return;
}; };
@@ -719,6 +781,7 @@ impl EvictionsWithLowResidenceDuration {
let removed = EVICTIONS_WITH_LOW_RESIDENCE_DURATION.remove_label_values(&[ let removed = EVICTIONS_WITH_LOW_RESIDENCE_DURATION.remove_label_values(&[
tenant_id, tenant_id,
shard_id,
timeline_id, timeline_id,
self.data_source, self.data_source,
&threshold, &threshold,
@@ -771,6 +834,7 @@ const STORAGE_IO_TIME_BUCKETS: &[f64] = &[
)] )]
pub(crate) enum StorageIoOperation { pub(crate) enum StorageIoOperation {
Open, Open,
OpenAfterReplace,
Close, Close,
CloseByReplace, CloseByReplace,
Read, Read,
@@ -784,6 +848,7 @@ impl StorageIoOperation {
pub fn as_str(&self) -> &'static str { pub fn as_str(&self) -> &'static str {
match self { match self {
StorageIoOperation::Open => "open", StorageIoOperation::Open => "open",
StorageIoOperation::OpenAfterReplace => "open-after-replace",
StorageIoOperation::Close => "close", StorageIoOperation::Close => "close",
StorageIoOperation::CloseByReplace => "close-by-replace", StorageIoOperation::CloseByReplace => "close-by-replace",
StorageIoOperation::Read => "read", StorageIoOperation::Read => "read",
@@ -838,6 +903,25 @@ pub(crate) static STORAGE_IO_SIZE: Lazy<IntGaugeVec> = Lazy::new(|| {
.expect("failed to define a metric") .expect("failed to define a metric")
}); });
pub(crate) mod virtual_file_descriptor_cache {
use super::*;
pub(crate) static SIZE_MAX: Lazy<UIntGauge> = Lazy::new(|| {
register_uint_gauge!(
"pageserver_virtual_file_descriptor_cache_size_max",
"Maximum number of open file descriptors in the cache."
)
.unwrap()
});
// SIZE_CURRENT: derive it like so:
// ```
// sum (pageserver_io_operations_seconds_count{operation=~"^(open|open-after-replace)$")
// -ignoring(operation)
// sum(pageserver_io_operations_seconds_count{operation=~"^(close|close-by-replace)$"}
// ```
}
#[derive(Debug)] #[derive(Debug)]
struct GlobalAndPerTimelineHistogram { struct GlobalAndPerTimelineHistogram {
global: Histogram, global: Histogram,
@@ -1164,6 +1248,52 @@ pub(crate) static DELETION_QUEUE: Lazy<DeletionQueueMetrics> = Lazy::new(|| {
} }
}); });
pub(crate) struct WalIngestMetrics {
pub(crate) records_received: IntCounter,
pub(crate) records_committed: IntCounter,
pub(crate) records_filtered: IntCounter,
}
pub(crate) static WAL_INGEST: Lazy<WalIngestMetrics> = Lazy::new(|| WalIngestMetrics {
records_received: register_int_counter!(
"pageserver_wal_ingest_records_received",
"Number of WAL records received from safekeepers"
)
.expect("failed to define a metric"),
records_committed: register_int_counter!(
"pageserver_wal_ingest_records_committed",
"Number of WAL records which resulted in writes to pageserver storage"
)
.expect("failed to define a metric"),
records_filtered: register_int_counter!(
"pageserver_wal_ingest_records_filtered",
"Number of WAL records filtered out due to sharding"
)
.expect("failed to define a metric"),
});
pub(crate) struct SecondaryModeMetrics {
pub(crate) upload_heatmap: IntCounter,
pub(crate) upload_heatmap_errors: IntCounter,
pub(crate) upload_heatmap_duration: Histogram,
}
pub(crate) static SECONDARY_MODE: Lazy<SecondaryModeMetrics> = Lazy::new(|| SecondaryModeMetrics {
upload_heatmap: register_int_counter!(
"pageserver_secondary_upload_heatmap",
"Number of heatmaps written to remote storage by attached tenants"
)
.expect("failed to define a metric"),
upload_heatmap_errors: register_int_counter!(
"pageserver_secondary_upload_heatmap_errors",
"Failures writing heatmap to remote storage"
)
.expect("failed to define a metric"),
upload_heatmap_duration: register_histogram!(
"pageserver_secondary_upload_heatmap_duration",
"Time to build and upload a heatmap, including any waiting inside the S3 client"
)
.expect("failed to define a metric"),
});
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)] #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum RemoteOpKind { pub enum RemoteOpKind {
Upload, Upload,
@@ -1214,25 +1344,16 @@ pub(crate) static TENANT_TASK_EVENTS: Lazy<IntCounterVec> = Lazy::new(|| {
.expect("Failed to register tenant_task_events metric") .expect("Failed to register tenant_task_events metric")
}); });
pub(crate) static BACKGROUND_LOOP_SEMAPHORE_WAIT_START_COUNT: Lazy<IntCounterVec> = pub(crate) static BACKGROUND_LOOP_SEMAPHORE_WAIT_GAUGE: Lazy<IntCounterPairVec> = Lazy::new(|| {
Lazy::new(|| { register_int_counter_pair_vec!(
register_int_counter_vec!( "pageserver_background_loop_semaphore_wait_start_count",
"pageserver_background_loop_semaphore_wait_start_count", "Counter for background loop concurrency-limiting semaphore acquire calls started",
"Counter for background loop concurrency-limiting semaphore acquire calls started", "pageserver_background_loop_semaphore_wait_finish_count",
&["task"], "Counter for background loop concurrency-limiting semaphore acquire calls finished",
) &["task"],
.unwrap() )
}); .unwrap()
});
pub(crate) static BACKGROUND_LOOP_SEMAPHORE_WAIT_FINISH_COUNT: Lazy<IntCounterVec> =
Lazy::new(|| {
register_int_counter_vec!(
"pageserver_background_loop_semaphore_wait_finish_count",
"Counter for background loop concurrency-limiting semaphore acquire calls finished",
&["task"],
)
.unwrap()
});
pub(crate) static BACKGROUND_LOOP_PERIOD_OVERRUN_COUNT: Lazy<IntCounterVec> = Lazy::new(|| { pub(crate) static BACKGROUND_LOOP_PERIOD_OVERRUN_COUNT: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!( register_int_counter_vec!(
@@ -1385,6 +1506,8 @@ pub(crate) static WAL_REDO_PROCESS_LAUNCH_DURATION_HISTOGRAM: Lazy<Histogram> =
pub(crate) struct WalRedoProcessCounters { pub(crate) struct WalRedoProcessCounters {
pub(crate) started: IntCounter, pub(crate) started: IntCounter,
pub(crate) killed_by_cause: enum_map::EnumMap<WalRedoKillCause, IntCounter>, pub(crate) killed_by_cause: enum_map::EnumMap<WalRedoKillCause, IntCounter>,
pub(crate) active_stderr_logger_tasks_started: IntCounter,
pub(crate) active_stderr_logger_tasks_finished: IntCounter,
} }
#[derive(Debug, enum_map::Enum, strum_macros::IntoStaticStr)] #[derive(Debug, enum_map::Enum, strum_macros::IntoStaticStr)]
@@ -1408,6 +1531,19 @@ impl Default for WalRedoProcessCounters {
&["cause"], &["cause"],
) )
.unwrap(); .unwrap();
let active_stderr_logger_tasks_started = register_int_counter!(
"pageserver_walredo_stderr_logger_tasks_started_total",
"Number of active walredo stderr logger tasks that have started",
)
.unwrap();
let active_stderr_logger_tasks_finished = register_int_counter!(
"pageserver_walredo_stderr_logger_tasks_finished_total",
"Number of active walredo stderr logger tasks that have finished",
)
.unwrap();
Self { Self {
started, started,
killed_by_cause: EnumMap::from_array(std::array::from_fn(|i| { killed_by_cause: EnumMap::from_array(std::array::from_fn(|i| {
@@ -1415,6 +1551,8 @@ impl Default for WalRedoProcessCounters {
let cause_str: &'static str = cause.into(); let cause_str: &'static str = cause.into();
killed.with_label_values(&[cause_str]) killed.with_label_values(&[cause_str])
})), })),
active_stderr_logger_tasks_started,
active_stderr_logger_tasks_finished,
} }
} }
} }
@@ -1489,6 +1627,7 @@ impl StorageTimeMetrics {
#[derive(Debug)] #[derive(Debug)]
pub struct TimelineMetrics { pub struct TimelineMetrics {
tenant_id: String, tenant_id: String,
shard_id: String,
timeline_id: String, timeline_id: String,
pub flush_time_histo: StorageTimeMetrics, pub flush_time_histo: StorageTimeMetrics,
pub compact_time_histo: StorageTimeMetrics, pub compact_time_histo: StorageTimeMetrics,
@@ -1509,11 +1648,12 @@ pub struct TimelineMetrics {
impl TimelineMetrics { impl TimelineMetrics {
pub fn new( pub fn new(
tenant_id: &TenantId, tenant_shard_id: &TenantShardId,
timeline_id: &TimelineId, timeline_id: &TimelineId,
evictions_with_low_residence_duration_builder: EvictionsWithLowResidenceDurationBuilder, evictions_with_low_residence_duration_builder: EvictionsWithLowResidenceDurationBuilder,
) -> Self { ) -> Self {
let tenant_id = tenant_id.to_string(); let tenant_id = tenant_shard_id.tenant_id.to_string();
let shard_id = format!("{}", tenant_shard_id.shard_slug());
let timeline_id = timeline_id.to_string(); let timeline_id = timeline_id.to_string();
let flush_time_histo = let flush_time_histo =
StorageTimeMetrics::new(StorageTimeOperation::LayerFlush, &tenant_id, &timeline_id); StorageTimeMetrics::new(StorageTimeOperation::LayerFlush, &tenant_id, &timeline_id);
@@ -1550,11 +1690,12 @@ impl TimelineMetrics {
let evictions = EVICTIONS let evictions = EVICTIONS
.get_metric_with_label_values(&[&tenant_id, &timeline_id]) .get_metric_with_label_values(&[&tenant_id, &timeline_id])
.unwrap(); .unwrap();
let evictions_with_low_residence_duration = let evictions_with_low_residence_duration = evictions_with_low_residence_duration_builder
evictions_with_low_residence_duration_builder.build(&tenant_id, &timeline_id); .build(&tenant_id, &shard_id, &timeline_id);
TimelineMetrics { TimelineMetrics {
tenant_id, tenant_id,
shard_id,
timeline_id, timeline_id,
flush_time_histo, flush_time_histo,
compact_time_histo, compact_time_histo,
@@ -1600,6 +1741,7 @@ impl Drop for TimelineMetrics {
fn drop(&mut self) { fn drop(&mut self) {
let tenant_id = &self.tenant_id; let tenant_id = &self.tenant_id;
let timeline_id = &self.timeline_id; let timeline_id = &self.timeline_id;
let shard_id = &self.shard_id;
let _ = LAST_RECORD_LSN.remove_label_values(&[tenant_id, timeline_id]); let _ = LAST_RECORD_LSN.remove_label_values(&[tenant_id, timeline_id]);
{ {
RESIDENT_PHYSICAL_SIZE_GLOBAL.sub(self.resident_physical_size_get()); RESIDENT_PHYSICAL_SIZE_GLOBAL.sub(self.resident_physical_size_get());
@@ -1613,7 +1755,7 @@ impl Drop for TimelineMetrics {
self.evictions_with_low_residence_duration self.evictions_with_low_residence_duration
.write() .write()
.unwrap() .unwrap()
.remove(tenant_id, timeline_id); .remove(tenant_id, shard_id, timeline_id);
// The following metrics are born outside of the TimelineMetrics lifecycle but still // The following metrics are born outside of the TimelineMetrics lifecycle but still
// removed at the end of it. The idea is to have the metrics outlive the // removed at the end of it. The idea is to have the metrics outlive the
@@ -2074,6 +2216,8 @@ pub fn preinitialize_metrics() {
// Tenant manager stats // Tenant manager stats
Lazy::force(&TENANT_MANAGER); Lazy::force(&TENANT_MANAGER);
Lazy::force(&crate::tenant::storage_layer::layer::LAYER_IMPL_METRICS);
// countervecs // countervecs
[&BACKGROUND_LOOP_PERIOD_OVERRUN_COUNT] [&BACKGROUND_LOOP_PERIOD_OVERRUN_COUNT]
.into_iter() .into_iter()

View File

@@ -28,7 +28,7 @@
//! Page cache maps from a cache key to a buffer slot. //! Page cache maps from a cache key to a buffer slot.
//! The cache key uniquely identifies the piece of data that is being cached. //! The cache key uniquely identifies the piece of data that is being cached.
//! //!
//! The cache key for **materialized pages** is [`TenantId`], [`TimelineId`], [`Key`], and [`Lsn`]. //! The cache key for **materialized pages** is [`TenantShardId`], [`TimelineId`], [`Key`], and [`Lsn`].
//! Use [`PageCache::memorize_materialized_page`] and [`PageCache::lookup_materialized_page`] for fill & access. //! Use [`PageCache::memorize_materialized_page`] and [`PageCache::lookup_materialized_page`] for fill & access.
//! //!
//! The cache key for **immutable file** pages is [`FileId`] and a block number. //! The cache key for **immutable file** pages is [`FileId`] and a block number.
@@ -83,12 +83,14 @@ use std::{
use anyhow::Context; use anyhow::Context;
use once_cell::sync::OnceCell; use once_cell::sync::OnceCell;
use utils::{ use pageserver_api::shard::TenantShardId;
id::{TenantId, TimelineId}, use utils::{id::TimelineId, lsn::Lsn};
lsn::Lsn,
};
use crate::{context::RequestContext, metrics::PageCacheSizeMetrics, repository::Key}; use crate::{
context::RequestContext,
metrics::{page_cache_eviction_metrics, PageCacheSizeMetrics},
repository::Key,
};
static PAGE_CACHE: OnceCell<PageCache> = OnceCell::new(); static PAGE_CACHE: OnceCell<PageCache> = OnceCell::new();
const TEST_PAGE_CACHE_SIZE: usize = 50; const TEST_PAGE_CACHE_SIZE: usize = 50;
@@ -150,7 +152,13 @@ enum CacheKey {
#[derive(Debug, PartialEq, Eq, Hash, Clone)] #[derive(Debug, PartialEq, Eq, Hash, Clone)]
struct MaterializedPageHashKey { struct MaterializedPageHashKey {
tenant_id: TenantId, /// Why is this TenantShardId rather than TenantId?
///
/// Usually, the materialized value of a page@lsn is identical on any shard in the same tenant. However, this
/// this not the case for certain internally-generated pages (e.g. relation sizes). In future, we may make this
/// key smaller by omitting the shard, if we ensure that reads to such pages always skip the cache, or are
/// special-cased in some other way.
tenant_shard_id: TenantShardId,
timeline_id: TimelineId, timeline_id: TimelineId,
key: Key, key: Key,
} }
@@ -374,7 +382,7 @@ impl PageCache {
/// returned page. /// returned page.
pub async fn lookup_materialized_page( pub async fn lookup_materialized_page(
&self, &self,
tenant_id: TenantId, tenant_shard_id: TenantShardId,
timeline_id: TimelineId, timeline_id: TimelineId,
key: &Key, key: &Key,
lsn: Lsn, lsn: Lsn,
@@ -391,7 +399,7 @@ impl PageCache {
let mut cache_key = CacheKey::MaterializedPage { let mut cache_key = CacheKey::MaterializedPage {
hash_key: MaterializedPageHashKey { hash_key: MaterializedPageHashKey {
tenant_id, tenant_shard_id,
timeline_id, timeline_id,
key: *key, key: *key,
}, },
@@ -432,7 +440,7 @@ impl PageCache {
/// ///
pub async fn memorize_materialized_page( pub async fn memorize_materialized_page(
&self, &self,
tenant_id: TenantId, tenant_shard_id: TenantShardId,
timeline_id: TimelineId, timeline_id: TimelineId,
key: Key, key: Key,
lsn: Lsn, lsn: Lsn,
@@ -440,7 +448,7 @@ impl PageCache {
) -> anyhow::Result<()> { ) -> anyhow::Result<()> {
let cache_key = CacheKey::MaterializedPage { let cache_key = CacheKey::MaterializedPage {
hash_key: MaterializedPageHashKey { hash_key: MaterializedPageHashKey {
tenant_id, tenant_shard_id,
timeline_id, timeline_id,
key, key,
}, },
@@ -897,8 +905,10 @@ impl PageCache {
// Note that just yielding to tokio during iteration without such // Note that just yielding to tokio during iteration without such
// priority boosting is likely counter-productive. We'd just give more opportunities // priority boosting is likely counter-productive. We'd just give more opportunities
// for B to bump usage count, further starving A. // for B to bump usage count, further starving A.
crate::metrics::page_cache_errors_inc( page_cache_eviction_metrics::observe(
crate::metrics::PageCacheErrorKind::EvictIterLimit, page_cache_eviction_metrics::Outcome::ItersExceeded {
iters: iters.try_into().unwrap(),
},
); );
anyhow::bail!("exceeded evict iter limit"); anyhow::bail!("exceeded evict iter limit");
} }
@@ -909,8 +919,18 @@ impl PageCache {
// remove mapping for old buffer // remove mapping for old buffer
self.remove_mapping(old_key); self.remove_mapping(old_key);
inner.key = None; inner.key = None;
page_cache_eviction_metrics::observe(
page_cache_eviction_metrics::Outcome::FoundSlotEvicted {
iters: iters.try_into().unwrap(),
},
);
} else {
page_cache_eviction_metrics::observe(
page_cache_eviction_metrics::Outcome::FoundSlotUnused {
iters: iters.try_into().unwrap(),
},
);
} }
crate::metrics::PAGE_CACHE_FIND_VICTIMS_ITERS_TOTAL.inc_by(iters as u64);
return Ok((slot_idx, inner)); return Ok((slot_idx, inner));
} }
} }

View File

@@ -53,21 +53,23 @@ use crate::context::{DownloadBehavior, RequestContext};
use crate::import_datadir::import_wal_from_tar; use crate::import_datadir::import_wal_from_tar;
use crate::metrics; use crate::metrics;
use crate::metrics::LIVE_CONNECTIONS_COUNT; use crate::metrics::LIVE_CONNECTIONS_COUNT;
use crate::pgdatadir_mapping::rel_block_to_key;
use crate::task_mgr; use crate::task_mgr;
use crate::task_mgr::TaskKind; use crate::task_mgr::TaskKind;
use crate::tenant::debug_assert_current_span_has_tenant_and_timeline_id; use crate::tenant::debug_assert_current_span_has_tenant_and_timeline_id;
use crate::tenant::mgr; use crate::tenant::mgr;
use crate::tenant::mgr::get_active_tenant_with_timeout; use crate::tenant::mgr::get_active_tenant_with_timeout;
use crate::tenant::mgr::GetActiveTenantError; use crate::tenant::mgr::GetActiveTenantError;
use crate::tenant::mgr::ShardSelector;
use crate::tenant::Timeline; use crate::tenant::Timeline;
use crate::trace::Tracer; use crate::trace::Tracer;
use postgres_ffi::pg_constants::DEFAULTTABLESPACE_OID; use postgres_ffi::pg_constants::DEFAULTTABLESPACE_OID;
use postgres_ffi::BLCKSZ; use postgres_ffi::BLCKSZ;
// How long we may block waiting for a [`TenantSlot::InProgress`]` and/or a [`Tenant`] which // How long we may wait for a [`TenantSlot::InProgress`]` and/or a [`Tenant`] which
// is not yet in state [`TenantState::Active`]. // is not yet in state [`TenantState::Active`].
const ACTIVE_TENANT_TIMEOUT: Duration = Duration::from_millis(5000); const ACTIVE_TENANT_TIMEOUT: Duration = Duration::from_millis(30000);
/// Read the end of a tar archive. /// Read the end of a tar archive.
/// ///
@@ -399,16 +401,19 @@ impl PageServerHandler {
{ {
debug_assert_current_span_has_tenant_and_timeline_id(); debug_assert_current_span_has_tenant_and_timeline_id();
// TODO(sharding): enumerate local tenant shards for this tenant, and select the one // Note that since one connection may contain getpage requests that target different
// that should serve this request. // shards (e.g. during splitting when the compute is not yet aware of the split), the tenant
// that we look up here may not be the one that serves all the actual requests: we will double
// Make request tracer if needed // check the mapping of key->shard later before calling into Timeline for getpage requests.
let tenant = mgr::get_active_tenant_with_timeout( let tenant = mgr::get_active_tenant_with_timeout(
tenant_id, tenant_id,
ShardSelector::First,
ACTIVE_TENANT_TIMEOUT, ACTIVE_TENANT_TIMEOUT,
&task_mgr::shutdown_token(), &task_mgr::shutdown_token(),
) )
.await?; .await?;
// Make request tracer if needed
let mut tracer = if tenant.get_trace_read_requests() { let mut tracer = if tenant.get_trace_read_requests() {
let connection_id = ConnectionId::generate(); let connection_id = ConnectionId::generate();
let path = let path =
@@ -566,6 +571,7 @@ impl PageServerHandler {
info!("creating new timeline"); info!("creating new timeline");
let tenant = get_active_tenant_with_timeout( let tenant = get_active_tenant_with_timeout(
tenant_id, tenant_id,
ShardSelector::Zero,
ACTIVE_TENANT_TIMEOUT, ACTIVE_TENANT_TIMEOUT,
&task_mgr::shutdown_token(), &task_mgr::shutdown_token(),
) )
@@ -628,7 +634,7 @@ impl PageServerHandler {
debug_assert_current_span_has_tenant_and_timeline_id(); debug_assert_current_span_has_tenant_and_timeline_id();
let timeline = self let timeline = self
.get_active_tenant_timeline(tenant_id, timeline_id) .get_active_tenant_timeline(tenant_id, timeline_id, ShardSelector::Zero)
.await?; .await?;
let last_record_lsn = timeline.get_last_record_lsn(); let last_record_lsn = timeline.get_last_record_lsn();
if last_record_lsn != start_lsn { if last_record_lsn != start_lsn {
@@ -807,9 +813,49 @@ impl PageServerHandler {
} }
*/ */
let page = timeline let key = rel_block_to_key(req.rel, req.blkno);
.get_rel_page_at_lsn(req.rel, req.blkno, lsn, req.latest, ctx) let page = if timeline.get_shard_identity().is_key_local(&key) {
.await?; timeline
.get_rel_page_at_lsn(req.rel, req.blkno, lsn, req.latest, ctx)
.await?
} else {
// The Tenant shard we looked up at connection start does not hold this particular
// key: look for other shards in this tenant. This scenario occurs if a pageserver
// has multiple shards for the same tenant.
//
// TODO: optimize this (https://github.com/neondatabase/neon/pull/6037)
let timeline = match self
.get_active_tenant_timeline(
timeline.tenant_shard_id.tenant_id,
timeline.timeline_id,
ShardSelector::Page(key),
)
.await
{
Ok(t) => t,
Err(GetActiveTimelineError::Tenant(GetActiveTenantError::NotFound(_))) => {
// We already know this tenant exists in general, because we resolved it at
// start of connection. Getting a NotFound here indicates that the shard containing
// the requested page is not present on this node.
// TODO: this should be some kind of structured error that the client will understand,
// so that it can block until its config is updated: this error is expected in the case
// that the Tenant's shards' placements are being updated and the client hasn't been
// informed yet.
//
// https://github.com/neondatabase/neon/issues/6038
return Err(anyhow::anyhow!("Request routed to wrong shard"));
}
Err(e) => return Err(e.into()),
};
// Take a GateGuard for the duration of this request. If we were using our main Timeline object,
// the GateGuard was already held over the whole connection.
let _timeline_guard = timeline.gate.enter().map_err(|_| QueryError::Shutdown)?;
timeline
.get_rel_page_at_lsn(req.rel, req.blkno, lsn, req.latest, ctx)
.await?
};
Ok(PagestreamBeMessage::GetPage(PagestreamGetPageResponse { Ok(PagestreamBeMessage::GetPage(PagestreamGetPageResponse {
page, page,
@@ -838,7 +884,7 @@ impl PageServerHandler {
// check that the timeline exists // check that the timeline exists
let timeline = self let timeline = self
.get_active_tenant_timeline(tenant_id, timeline_id) .get_active_tenant_timeline(tenant_id, timeline_id, ShardSelector::Zero)
.await?; .await?;
let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn(); let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn();
if let Some(lsn) = lsn { if let Some(lsn) = lsn {
@@ -944,9 +990,11 @@ impl PageServerHandler {
&self, &self,
tenant_id: TenantId, tenant_id: TenantId,
timeline_id: TimelineId, timeline_id: TimelineId,
selector: ShardSelector,
) -> Result<Arc<Timeline>, GetActiveTimelineError> { ) -> Result<Arc<Timeline>, GetActiveTimelineError> {
let tenant = get_active_tenant_with_timeout( let tenant = get_active_tenant_with_timeout(
tenant_id, tenant_id,
selector,
ACTIVE_TENANT_TIMEOUT, ACTIVE_TENANT_TIMEOUT,
&task_mgr::shutdown_token(), &task_mgr::shutdown_token(),
) )
@@ -1120,7 +1168,7 @@ where
self.check_permission(Some(tenant_id))?; self.check_permission(Some(tenant_id))?;
let timeline = self let timeline = self
.get_active_tenant_timeline(tenant_id, timeline_id) .get_active_tenant_timeline(tenant_id, timeline_id, ShardSelector::Zero)
.await?; .await?;
let end_of_timeline = timeline.get_last_record_rlsn(); let end_of_timeline = timeline.get_last_record_rlsn();
@@ -1307,6 +1355,7 @@ where
let tenant = get_active_tenant_with_timeout( let tenant = get_active_tenant_with_timeout(
tenant_id, tenant_id,
ShardSelector::Zero,
ACTIVE_TENANT_TIMEOUT, ACTIVE_TENANT_TIMEOUT,
&task_mgr::shutdown_token(), &task_mgr::shutdown_token(),
) )

View File

@@ -13,6 +13,7 @@ use crate::repository::*;
use crate::walrecord::NeonWalRecord; use crate::walrecord::NeonWalRecord;
use anyhow::Context; use anyhow::Context;
use bytes::{Buf, Bytes}; use bytes::{Buf, Bytes};
use pageserver_api::key::is_rel_block_key;
use pageserver_api::reltag::{RelTag, SlruKind}; use pageserver_api::reltag::{RelTag, SlruKind};
use postgres_ffi::relfile_utils::{FSM_FORKNUM, VISIBILITYMAP_FORKNUM}; use postgres_ffi::relfile_utils::{FSM_FORKNUM, VISIBILITYMAP_FORKNUM};
use postgres_ffi::BLCKSZ; use postgres_ffi::BLCKSZ;
@@ -282,6 +283,10 @@ impl Timeline {
} }
/// Get a list of all existing relations in given tablespace and database. /// Get a list of all existing relations in given tablespace and database.
///
/// # Cancel-Safety
///
/// This method is cancellation-safe.
pub async fn list_rels( pub async fn list_rels(
&self, &self,
spcnode: Oid, spcnode: Oid,
@@ -630,6 +635,10 @@ impl Timeline {
/// ///
/// Only relation blocks are counted currently. That excludes metadata, /// Only relation blocks are counted currently. That excludes metadata,
/// SLRUs, twophase files etc. /// SLRUs, twophase files etc.
///
/// # Cancel-Safety
///
/// This method is cancellation-safe.
pub async fn get_current_logical_size_non_incremental( pub async fn get_current_logical_size_non_incremental(
&self, &self,
lsn: Lsn, lsn: Lsn,
@@ -813,10 +822,7 @@ impl<'a> DatadirModification<'a> {
self.put(DBDIR_KEY, Value::Image(buf.into())); self.put(DBDIR_KEY, Value::Image(buf.into()));
// Create AuxFilesDirectory // Create AuxFilesDirectory
let buf = AuxFilesDirectory::ser(&AuxFilesDirectory { self.init_aux_dir()?;
files: HashMap::new(),
})?;
self.put(AUX_FILES_KEY, Value::Image(Bytes::from(buf)));
let buf = TwoPhaseDirectory::ser(&TwoPhaseDirectory { let buf = TwoPhaseDirectory::ser(&TwoPhaseDirectory {
xids: HashSet::new(), xids: HashSet::new(),
@@ -924,10 +930,7 @@ impl<'a> DatadirModification<'a> {
self.put(DBDIR_KEY, Value::Image(buf.into())); self.put(DBDIR_KEY, Value::Image(buf.into()));
// Create AuxFilesDirectory as well // Create AuxFilesDirectory as well
let buf = AuxFilesDirectory::ser(&AuxFilesDirectory { self.init_aux_dir()?;
files: HashMap::new(),
})?;
self.put(AUX_FILES_KEY, Value::Image(Bytes::from(buf)));
} }
if r.is_none() { if r.is_none() {
// Create RelDirectory // Create RelDirectory
@@ -1252,6 +1255,14 @@ impl<'a> DatadirModification<'a> {
Ok(()) Ok(())
} }
pub fn init_aux_dir(&mut self) -> anyhow::Result<()> {
let buf = AuxFilesDirectory::ser(&AuxFilesDirectory {
files: HashMap::new(),
})?;
self.put(AUX_FILES_KEY, Value::Image(Bytes::from(buf)));
Ok(())
}
pub async fn put_file( pub async fn put_file(
&mut self, &mut self,
path: &str, path: &str,
@@ -1314,7 +1325,7 @@ impl<'a> DatadirModification<'a> {
// Flush relation and SLRU data blocks, keep metadata. // Flush relation and SLRU data blocks, keep metadata.
let mut retained_pending_updates = HashMap::new(); let mut retained_pending_updates = HashMap::new();
for (key, value) in self.pending_updates.drain() { for (key, value) in self.pending_updates.drain() {
if is_rel_block_key(key) || is_slru_block_key(key) { if is_rel_block_key(&key) || is_slru_block_key(key) {
// This bails out on first error without modifying pending_updates. // This bails out on first error without modifying pending_updates.
// That's Ok, cf this function's doc comment. // That's Ok, cf this function's doc comment.
writer.put(key, self.lsn, &value, ctx).await?; writer.put(key, self.lsn, &value, ctx).await?;
@@ -1359,6 +1370,10 @@ impl<'a> DatadirModification<'a> {
Ok(()) Ok(())
} }
pub(crate) fn is_empty(&self) -> bool {
self.pending_updates.is_empty() && self.pending_deletions.is_empty()
}
// Internal helper functions to batch the modifications // Internal helper functions to batch the modifications
async fn get(&self, key: Key, ctx: &RequestContext) -> Result<Bytes, PageReconstructError> { async fn get(&self, key: Key, ctx: &RequestContext) -> Result<Bytes, PageReconstructError> {
@@ -1570,7 +1585,7 @@ fn rel_dir_to_key(spcnode: Oid, dbnode: Oid) -> Key {
} }
} }
fn rel_block_to_key(rel: RelTag, blknum: BlockNumber) -> Key { pub(crate) fn rel_block_to_key(rel: RelTag, blknum: BlockNumber) -> Key {
Key { Key {
field1: 0x00, field1: 0x00,
field2: rel.spcnode, field2: rel.spcnode,
@@ -1754,6 +1769,13 @@ const AUX_FILES_KEY: Key = Key {
// Reverse mappings for a few Keys. // Reverse mappings for a few Keys.
// These are needed by WAL redo manager. // These are needed by WAL redo manager.
// AUX_FILES currently stores only data for logical replication (slots etc), and
// we don't preserve these on a branch because safekeepers can't follow timeline
// switch (and generally it likely should be optional), so ignore these.
pub fn is_inherited_key(key: Key) -> bool {
key != AUX_FILES_KEY
}
pub fn key_to_rel_block(key: Key) -> anyhow::Result<(RelTag, BlockNumber)> { pub fn key_to_rel_block(key: Key) -> anyhow::Result<(RelTag, BlockNumber)> {
Ok(match key.field1 { Ok(match key.field1 {
0x00 => ( 0x00 => (
@@ -1769,10 +1791,6 @@ pub fn key_to_rel_block(key: Key) -> anyhow::Result<(RelTag, BlockNumber)> {
}) })
} }
fn is_rel_block_key(key: Key) -> bool {
key.field1 == 0x00 && key.field4 != 0
}
pub fn is_rel_fsm_block_key(key: Key) -> bool { pub fn is_rel_fsm_block_key(key: Key) -> bool {
key.field1 == 0x00 && key.field4 != 0 && key.field5 == FSM_FORKNUM && key.field6 != 0xffffffff key.field1 == 0x00 && key.field4 != 0 && key.field5 == FSM_FORKNUM && key.field6 != 0xffffffff
} }

View File

@@ -42,6 +42,7 @@ use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::{Arc, Mutex}; use std::sync::{Arc, Mutex};
use futures::FutureExt; use futures::FutureExt;
use pageserver_api::shard::TenantShardId;
use tokio::runtime::Runtime; use tokio::runtime::Runtime;
use tokio::task::JoinHandle; use tokio::task::JoinHandle;
use tokio::task_local; use tokio::task_local;
@@ -51,7 +52,7 @@ use tracing::{debug, error, info, warn};
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
use utils::id::{TenantId, TimelineId}; use utils::id::TimelineId;
use crate::shutdown_pageserver; use crate::shutdown_pageserver;
@@ -257,6 +258,9 @@ pub enum TaskKind {
/// See [`crate::disk_usage_eviction_task`]. /// See [`crate::disk_usage_eviction_task`].
DiskUsageEviction, DiskUsageEviction,
/// See [`crate::tenant::secondary`].
SecondaryUploads,
// Initial logical size calculation // Initial logical size calculation
InitialLogicalSizeCalculation, InitialLogicalSizeCalculation,
@@ -317,7 +321,7 @@ struct PageServerTask {
/// Tasks may optionally be launched for a particular tenant/timeline, enabling /// Tasks may optionally be launched for a particular tenant/timeline, enabling
/// later cancelling tasks for that tenant/timeline in [`shutdown_tasks`] /// later cancelling tasks for that tenant/timeline in [`shutdown_tasks`]
tenant_id: Option<TenantId>, tenant_shard_id: Option<TenantShardId>,
timeline_id: Option<TimelineId>, timeline_id: Option<TimelineId>,
mutable: Mutex<MutableTaskState>, mutable: Mutex<MutableTaskState>,
@@ -329,7 +333,7 @@ struct PageServerTask {
pub fn spawn<F>( pub fn spawn<F>(
runtime: &tokio::runtime::Handle, runtime: &tokio::runtime::Handle,
kind: TaskKind, kind: TaskKind,
tenant_id: Option<TenantId>, tenant_shard_id: Option<TenantShardId>,
timeline_id: Option<TimelineId>, timeline_id: Option<TimelineId>,
name: &str, name: &str,
shutdown_process_on_error: bool, shutdown_process_on_error: bool,
@@ -345,7 +349,7 @@ where
kind, kind,
name: name.to_string(), name: name.to_string(),
cancel: cancel.clone(), cancel: cancel.clone(),
tenant_id, tenant_shard_id,
timeline_id, timeline_id,
mutable: Mutex::new(MutableTaskState { join_handle: None }), mutable: Mutex::new(MutableTaskState { join_handle: None }),
}); });
@@ -424,28 +428,28 @@ async fn task_finish(
Ok(Err(err)) => { Ok(Err(err)) => {
if shutdown_process_on_error { if shutdown_process_on_error {
error!( error!(
"Shutting down: task '{}' tenant_id: {:?}, timeline_id: {:?} exited with error: {:?}", "Shutting down: task '{}' tenant_shard_id: {:?}, timeline_id: {:?} exited with error: {:?}",
task_name, task.tenant_id, task.timeline_id, err task_name, task.tenant_shard_id, task.timeline_id, err
); );
shutdown_process = true; shutdown_process = true;
} else { } else {
error!( error!(
"Task '{}' tenant_id: {:?}, timeline_id: {:?} exited with error: {:?}", "Task '{}' tenant_shard_id: {:?}, timeline_id: {:?} exited with error: {:?}",
task_name, task.tenant_id, task.timeline_id, err task_name, task.tenant_shard_id, task.timeline_id, err
); );
} }
} }
Err(err) => { Err(err) => {
if shutdown_process_on_error { if shutdown_process_on_error {
error!( error!(
"Shutting down: task '{}' tenant_id: {:?}, timeline_id: {:?} panicked: {:?}", "Shutting down: task '{}' tenant_shard_id: {:?}, timeline_id: {:?} panicked: {:?}",
task_name, task.tenant_id, task.timeline_id, err task_name, task.tenant_shard_id, task.timeline_id, err
); );
shutdown_process = true; shutdown_process = true;
} else { } else {
error!( error!(
"Task '{}' tenant_id: {:?}, timeline_id: {:?} panicked: {:?}", "Task '{}' tenant_shard_id: {:?}, timeline_id: {:?} panicked: {:?}",
task_name, task.tenant_id, task.timeline_id, err task_name, task.tenant_shard_id, task.timeline_id, err
); );
} }
} }
@@ -467,11 +471,11 @@ async fn task_finish(
/// ///
/// Or to shut down all tasks for given timeline: /// Or to shut down all tasks for given timeline:
/// ///
/// shutdown_tasks(None, Some(tenant_id), Some(timeline_id)) /// shutdown_tasks(None, Some(tenant_shard_id), Some(timeline_id))
/// ///
pub async fn shutdown_tasks( pub async fn shutdown_tasks(
kind: Option<TaskKind>, kind: Option<TaskKind>,
tenant_id: Option<TenantId>, tenant_shard_id: Option<TenantShardId>,
timeline_id: Option<TimelineId>, timeline_id: Option<TimelineId>,
) { ) {
let mut victim_tasks = Vec::new(); let mut victim_tasks = Vec::new();
@@ -480,35 +484,35 @@ pub async fn shutdown_tasks(
let tasks = TASKS.lock().unwrap(); let tasks = TASKS.lock().unwrap();
for task in tasks.values() { for task in tasks.values() {
if (kind.is_none() || Some(task.kind) == kind) if (kind.is_none() || Some(task.kind) == kind)
&& (tenant_id.is_none() || task.tenant_id == tenant_id) && (tenant_shard_id.is_none() || task.tenant_shard_id == tenant_shard_id)
&& (timeline_id.is_none() || task.timeline_id == timeline_id) && (timeline_id.is_none() || task.timeline_id == timeline_id)
{ {
task.cancel.cancel(); task.cancel.cancel();
victim_tasks.push(( victim_tasks.push((
Arc::clone(task), Arc::clone(task),
task.kind, task.kind,
task.tenant_id, task.tenant_shard_id,
task.timeline_id, task.timeline_id,
)); ));
} }
} }
} }
let log_all = kind.is_none() && tenant_id.is_none() && timeline_id.is_none(); let log_all = kind.is_none() && tenant_shard_id.is_none() && timeline_id.is_none();
for (task, task_kind, tenant_id, timeline_id) in victim_tasks { for (task, task_kind, tenant_shard_id, timeline_id) in victim_tasks {
let join_handle = { let join_handle = {
let mut task_mut = task.mutable.lock().unwrap(); let mut task_mut = task.mutable.lock().unwrap();
task_mut.join_handle.take() task_mut.join_handle.take()
}; };
if let Some(mut join_handle) = join_handle { if let Some(mut join_handle) = join_handle {
if log_all { if log_all {
if tenant_id.is_none() { if tenant_shard_id.is_none() {
// there are quite few of these // there are quite few of these
info!(name = task.name, kind = ?task_kind, "stopping global task"); info!(name = task.name, kind = ?task_kind, "stopping global task");
} else { } else {
// warn to catch these in tests; there shouldn't be any // warn to catch these in tests; there shouldn't be any
warn!(name = task.name, tenant_id = ?tenant_id, timeline_id = ?timeline_id, kind = ?task_kind, "stopping left-over"); warn!(name = task.name, tenant_shard_id = ?tenant_shard_id, timeline_id = ?timeline_id, kind = ?task_kind, "stopping left-over");
} }
} }
if tokio::time::timeout(std::time::Duration::from_secs(1), &mut join_handle) if tokio::time::timeout(std::time::Duration::from_secs(1), &mut join_handle)
@@ -517,12 +521,13 @@ pub async fn shutdown_tasks(
{ {
// allow some time to elapse before logging to cut down the number of log // allow some time to elapse before logging to cut down the number of log
// lines. // lines.
info!("waiting for {} to shut down", task.name); info!("waiting for task {} to shut down", task.name);
// we never handled this return value, but: // we never handled this return value, but:
// - we don't deschedule which would lead to is_cancelled // - we don't deschedule which would lead to is_cancelled
// - panics are already logged (is_panicked) // - panics are already logged (is_panicked)
// - task errors are already logged in the wrapper // - task errors are already logged in the wrapper
let _ = join_handle.await; let _ = join_handle.await;
info!("task {} completed", task.name);
} }
} else { } else {
// Possibly one of: // Possibly one of:
@@ -556,9 +561,14 @@ pub async fn shutdown_watcher() {
/// cancelled. It can however be moved to other tasks, such as `tokio::task::spawn_blocking` or /// cancelled. It can however be moved to other tasks, such as `tokio::task::spawn_blocking` or
/// `tokio::task::JoinSet::spawn`. /// `tokio::task::JoinSet::spawn`.
pub fn shutdown_token() -> CancellationToken { pub fn shutdown_token() -> CancellationToken {
SHUTDOWN_TOKEN let res = SHUTDOWN_TOKEN.try_with(|t| t.clone());
.try_with(|t| t.clone())
.expect("shutdown_token() called in an unexpected task or thread") if cfg!(test) {
// in tests this method is called from non-taskmgr spawned tasks, and that is all ok.
res.unwrap_or_default()
} else {
res.expect("shutdown_token() called in an unexpected task or thread")
}
} }
/// Has the current task been requested to shut down? /// Has the current task been requested to shut down?

View File

@@ -12,13 +12,13 @@
//! //!
use anyhow::{bail, Context}; use anyhow::{bail, Context};
use bytes::Bytes;
use camino::{Utf8Path, Utf8PathBuf}; use camino::{Utf8Path, Utf8PathBuf};
use enumset::EnumSet; use enumset::EnumSet;
use futures::stream::FuturesUnordered; use futures::stream::FuturesUnordered;
use futures::FutureExt; use futures::FutureExt;
use futures::StreamExt; use futures::StreamExt;
use pageserver_api::models::TimelineState; use pageserver_api::models::TimelineState;
use pageserver_api::shard::ShardIdentity;
use pageserver_api::shard::TenantShardId; use pageserver_api::shard::TenantShardId;
use remote_storage::DownloadError; use remote_storage::DownloadError;
use remote_storage::GenericRemoteStorage; use remote_storage::GenericRemoteStorage;
@@ -48,6 +48,7 @@ use self::mgr::GetActiveTenantError;
use self::mgr::GetTenantError; use self::mgr::GetTenantError;
use self::mgr::TenantsMap; use self::mgr::TenantsMap;
use self::remote_timeline_client::RemoteTimelineClient; use self::remote_timeline_client::RemoteTimelineClient;
use self::timeline::uninit::TimelineExclusionError;
use self::timeline::uninit::TimelineUninitMark; use self::timeline::uninit::TimelineUninitMark;
use self::timeline::uninit::UninitializedTimeline; use self::timeline::uninit::UninitializedTimeline;
use self::timeline::EvictionTaskTenantState; use self::timeline::EvictionTaskTenantState;
@@ -68,6 +69,7 @@ use crate::tenant::config::TenantConfOpt;
use crate::tenant::metadata::load_metadata; use crate::tenant::metadata::load_metadata;
pub use crate::tenant::remote_timeline_client::index::IndexPart; pub use crate::tenant::remote_timeline_client::index::IndexPart;
use crate::tenant::remote_timeline_client::MaybeDeletedIndexPart; use crate::tenant::remote_timeline_client::MaybeDeletedIndexPart;
use crate::tenant::remote_timeline_client::INITDB_PATH;
use crate::tenant::storage_layer::DeltaLayer; use crate::tenant::storage_layer::DeltaLayer;
use crate::tenant::storage_layer::ImageLayer; use crate::tenant::storage_layer::ImageLayer;
use crate::InitializationOrder; use crate::InitializationOrder;
@@ -86,7 +88,6 @@ use std::process::Stdio;
use std::sync::atomic::AtomicU64; use std::sync::atomic::AtomicU64;
use std::sync::atomic::Ordering; use std::sync::atomic::Ordering;
use std::sync::Arc; use std::sync::Arc;
use std::sync::MutexGuard;
use std::sync::{Mutex, RwLock}; use std::sync::{Mutex, RwLock};
use std::time::{Duration, Instant}; use std::time::{Duration, Instant};
@@ -143,6 +144,7 @@ pub mod storage_layer;
pub mod config; pub mod config;
pub mod delete; pub mod delete;
pub mod mgr; pub mod mgr;
pub mod secondary;
pub mod tasks; pub mod tasks;
pub mod upload_queue; pub mod upload_queue;
@@ -236,6 +238,9 @@ pub struct Tenant {
tenant_shard_id: TenantShardId, tenant_shard_id: TenantShardId,
// The detailed sharding information, beyond the number/count in tenant_shard_id
shard_identity: ShardIdentity,
/// The remote storage generation, used to protect S3 objects from split-brain. /// The remote storage generation, used to protect S3 objects from split-brain.
/// Does not change over the lifetime of the [`Tenant`] object. /// Does not change over the lifetime of the [`Tenant`] object.
/// ///
@@ -244,6 +249,12 @@ pub struct Tenant {
generation: Generation, generation: Generation,
timelines: Mutex<HashMap<TimelineId, Arc<Timeline>>>, timelines: Mutex<HashMap<TimelineId, Arc<Timeline>>>,
/// During timeline creation, we first insert the TimelineId to the
/// creating map, then `timelines`, then remove it from the creating map.
/// **Lock order**: if acquring both, acquire`timelines` before `timelines_creating`
timelines_creating: std::sync::Mutex<HashSet<TimelineId>>,
// This mutex prevents creation of new timelines during GC. // This mutex prevents creation of new timelines during GC.
// Adding yet another mutex (in addition to `timelines`) is needed because holding // Adding yet another mutex (in addition to `timelines`) is needed because holding
// `timelines` mutex during all GC iteration // `timelines` mutex during all GC iteration
@@ -312,6 +323,9 @@ impl WalRedoManager {
} }
} }
/// # Cancel-Safety
///
/// This method is cancellation-safe.
pub async fn request_redo( pub async fn request_redo(
&self, &self,
key: crate::repository::Key, key: crate::repository::Key,
@@ -399,8 +413,10 @@ impl Debug for SetStoppingError {
#[derive(thiserror::Error, Debug)] #[derive(thiserror::Error, Debug)]
pub enum CreateTimelineError { pub enum CreateTimelineError {
#[error("a timeline with the given ID already exists")] #[error("creation of timeline with the given ID is in progress")]
AlreadyExists, AlreadyCreating,
#[error("timeline already exists with different parameters")]
Conflict,
#[error(transparent)] #[error(transparent)]
AncestorLsn(anyhow::Error), AncestorLsn(anyhow::Error),
#[error("ancestor timeline is not active")] #[error("ancestor timeline is not active")]
@@ -469,7 +485,6 @@ impl Tenant {
index_part: Option<IndexPart>, index_part: Option<IndexPart>,
metadata: TimelineMetadata, metadata: TimelineMetadata,
ancestor: Option<Arc<Timeline>>, ancestor: Option<Arc<Timeline>>,
init_order: Option<&InitializationOrder>,
_ctx: &RequestContext, _ctx: &RequestContext,
) -> anyhow::Result<()> { ) -> anyhow::Result<()> {
let tenant_id = self.tenant_shard_id; let tenant_id = self.tenant_shard_id;
@@ -479,7 +494,6 @@ impl Tenant {
&metadata, &metadata,
ancestor.clone(), ancestor.clone(),
resources, resources,
init_order,
CreateTimelineCause::Load, CreateTimelineCause::Load,
)?; )?;
let disk_consistent_lsn = timeline.get_disk_consistent_lsn(); let disk_consistent_lsn = timeline.get_disk_consistent_lsn();
@@ -567,6 +581,7 @@ impl Tenant {
tenant_shard_id: TenantShardId, tenant_shard_id: TenantShardId,
resources: TenantSharedResources, resources: TenantSharedResources,
attached_conf: AttachedTenantConf, attached_conf: AttachedTenantConf,
shard_identity: ShardIdentity,
init_order: Option<InitializationOrder>, init_order: Option<InitializationOrder>,
tenants: &'static std::sync::RwLock<TenantsMap>, tenants: &'static std::sync::RwLock<TenantsMap>,
mode: SpawnMode, mode: SpawnMode,
@@ -588,6 +603,7 @@ impl Tenant {
TenantState::Attaching, TenantState::Attaching,
conf, conf,
attached_conf, attached_conf,
shard_identity,
wal_redo_manager, wal_redo_manager,
tenant_shard_id, tenant_shard_id,
remote_storage.clone(), remote_storage.clone(),
@@ -601,7 +617,7 @@ impl Tenant {
task_mgr::spawn( task_mgr::spawn(
&tokio::runtime::Handle::current(), &tokio::runtime::Handle::current(),
TaskKind::Attach, TaskKind::Attach,
Some(tenant_shard_id.tenant_id), Some(tenant_shard_id),
None, None,
"attach tenant", "attach tenant",
false, false,
@@ -680,10 +696,6 @@ impl Tenant {
// as we are no longer loading, signal completion by dropping // as we are no longer loading, signal completion by dropping
// the completion while we resume deletion // the completion while we resume deletion
drop(_completion); drop(_completion);
// do not hold to initial_logical_size_attempt as it will prevent loading from proceeding without timeout
let _ = init_order
.as_mut()
.and_then(|x| x.initial_logical_size_attempt.take());
let background_jobs_can_start = let background_jobs_can_start =
init_order.as_ref().map(|x| &x.background_jobs_can_start); init_order.as_ref().map(|x| &x.background_jobs_can_start);
if let Some(background) = background_jobs_can_start { if let Some(background) = background_jobs_can_start {
@@ -697,7 +709,6 @@ impl Tenant {
&tenant_clone, &tenant_clone,
preload, preload,
tenants, tenants,
init_order,
&ctx, &ctx,
) )
.await .await
@@ -710,7 +721,7 @@ impl Tenant {
} }
} }
match tenant_clone.attach(init_order, preload, &ctx).await { match tenant_clone.attach(preload, &ctx).await {
Ok(()) => { Ok(()) => {
info!("attach finished, activating"); info!("attach finished, activating");
tenant_clone.activate(broker_client, None, &ctx); tenant_clone.activate(broker_client, None, &ctx);
@@ -773,7 +784,6 @@ impl Tenant {
/// ///
async fn attach( async fn attach(
self: &Arc<Tenant>, self: &Arc<Tenant>,
init_order: Option<InitializationOrder>,
preload: Option<TenantPreload>, preload: Option<TenantPreload>,
ctx: &RequestContext, ctx: &RequestContext,
) -> anyhow::Result<()> { ) -> anyhow::Result<()> {
@@ -786,7 +796,7 @@ impl Tenant {
None => { None => {
// Deprecated dev mode: load from local disk state instead of remote storage // Deprecated dev mode: load from local disk state instead of remote storage
// https://github.com/neondatabase/neon/issues/5624 // https://github.com/neondatabase/neon/issues/5624
return self.load_local(init_order, ctx).await; return self.load_local(ctx).await;
} }
}; };
@@ -881,7 +891,6 @@ impl Tenant {
&index_part.metadata, &index_part.metadata,
Some(remote_timeline_client), Some(remote_timeline_client),
self.deletion_queue_client.clone(), self.deletion_queue_client.clone(),
None,
) )
.await .await
.context("resume_deletion") .context("resume_deletion")
@@ -1006,10 +1015,6 @@ impl Tenant {
None None
}; };
// we can load remote timelines during init, but they are assumed to be so rare that
// initialization order is not passed to here.
let init_order = None;
// timeline loading after attach expects to find metadata file for each metadata // timeline loading after attach expects to find metadata file for each metadata
save_metadata( save_metadata(
self.conf, self.conf,
@@ -1027,7 +1032,6 @@ impl Tenant {
Some(index_part), Some(index_part),
remote_metadata, remote_metadata,
ancestor, ancestor,
init_order,
ctx, ctx,
) )
.await .await
@@ -1051,6 +1055,9 @@ impl Tenant {
}, },
conf, conf,
AttachedTenantConf::try_from(LocationConf::default()).unwrap(), AttachedTenantConf::try_from(LocationConf::default()).unwrap(),
// Shard identity isn't meaningful for a broken tenant: it's just a placeholder
// to occupy the slot for this TenantShardId.
ShardIdentity::broken(tenant_shard_id.shard_number, tenant_shard_id.shard_count),
wal_redo_manager, wal_redo_manager,
tenant_shard_id, tenant_shard_id,
None, None,
@@ -1269,11 +1276,7 @@ impl Tenant {
/// files on disk. Used at pageserver startup. /// files on disk. Used at pageserver startup.
/// ///
/// No background tasks are started as part of this routine. /// No background tasks are started as part of this routine.
async fn load_local( async fn load_local(self: &Arc<Tenant>, ctx: &RequestContext) -> anyhow::Result<()> {
self: &Arc<Tenant>,
init_order: Option<InitializationOrder>,
ctx: &RequestContext,
) -> anyhow::Result<()> {
span::debug_assert_current_span_has_tenant_id(); span::debug_assert_current_span_has_tenant_id();
debug!("loading tenant task"); debug!("loading tenant task");
@@ -1299,7 +1302,7 @@ impl Tenant {
// Process loadable timelines first // Process loadable timelines first
for (timeline_id, local_metadata) in scan.sorted_timelines_to_load { for (timeline_id, local_metadata) in scan.sorted_timelines_to_load {
if let Err(e) = self if let Err(e) = self
.load_local_timeline(timeline_id, local_metadata, init_order.as_ref(), ctx, false) .load_local_timeline(timeline_id, local_metadata, ctx, false)
.await .await
{ {
match e { match e {
@@ -1333,13 +1336,7 @@ impl Tenant {
} }
Some(local_metadata) => { Some(local_metadata) => {
if let Err(e) = self if let Err(e) = self
.load_local_timeline( .load_local_timeline(timeline_id, local_metadata, ctx, true)
timeline_id,
local_metadata,
init_order.as_ref(),
ctx,
true,
)
.await .await
{ {
match e { match e {
@@ -1367,12 +1364,11 @@ impl Tenant {
/// Subroutine of `load_tenant`, to load an individual timeline /// Subroutine of `load_tenant`, to load an individual timeline
/// ///
/// NB: The parent is assumed to be already loaded! /// NB: The parent is assumed to be already loaded!
#[instrument(skip(self, local_metadata, init_order, ctx))] #[instrument(skip(self, local_metadata, ctx))]
async fn load_local_timeline( async fn load_local_timeline(
self: &Arc<Self>, self: &Arc<Self>,
timeline_id: TimelineId, timeline_id: TimelineId,
local_metadata: TimelineMetadata, local_metadata: TimelineMetadata,
init_order: Option<&InitializationOrder>,
ctx: &RequestContext, ctx: &RequestContext,
found_delete_mark: bool, found_delete_mark: bool,
) -> Result<(), LoadLocalTimelineError> { ) -> Result<(), LoadLocalTimelineError> {
@@ -1389,7 +1385,6 @@ impl Tenant {
&local_metadata, &local_metadata,
None, None,
self.deletion_queue_client.clone(), self.deletion_queue_client.clone(),
init_order,
) )
.await .await
.context("resume deletion") .context("resume deletion")
@@ -1406,17 +1401,9 @@ impl Tenant {
None None
}; };
self.timeline_init_and_sync( self.timeline_init_and_sync(timeline_id, resources, None, local_metadata, ancestor, ctx)
timeline_id, .await
resources, .map_err(LoadLocalTimelineError::Load)
None,
local_metadata,
ancestor,
init_order,
ctx,
)
.await
.map_err(LoadLocalTimelineError::Load)
} }
pub(crate) fn tenant_id(&self) -> TenantId { pub(crate) fn tenant_id(&self) -> TenantId {
@@ -1479,7 +1466,7 @@ impl Tenant {
/// For tests, use `DatadirModification::init_empty_test_timeline` + `commit` to setup the /// For tests, use `DatadirModification::init_empty_test_timeline` + `commit` to setup the
/// minimum amount of keys required to get a writable timeline. /// minimum amount of keys required to get a writable timeline.
/// (Without it, `put` might fail due to `repartition` failing.) /// (Without it, `put` might fail due to `repartition` failing.)
pub async fn create_empty_timeline( pub(crate) async fn create_empty_timeline(
&self, &self,
new_timeline_id: TimelineId, new_timeline_id: TimelineId,
initdb_lsn: Lsn, initdb_lsn: Lsn,
@@ -1491,10 +1478,7 @@ impl Tenant {
"Cannot create empty timelines on inactive tenant" "Cannot create empty timelines on inactive tenant"
); );
let timeline_uninit_mark = { let timeline_uninit_mark = self.create_timeline_uninit_mark(new_timeline_id)?;
let timelines = self.timelines.lock().unwrap();
self.create_timeline_uninit_mark(new_timeline_id, &timelines)?
};
let new_metadata = TimelineMetadata::new( let new_metadata = TimelineMetadata::new(
// Initialize disk_consistent LSN to 0, The caller must import some data to // Initialize disk_consistent LSN to 0, The caller must import some data to
// make it valid, before calling finish_creation() // make it valid, before calling finish_creation()
@@ -1571,7 +1555,7 @@ impl Tenant {
/// If the caller specified the timeline ID to use (`new_timeline_id`), and timeline with /// If the caller specified the timeline ID to use (`new_timeline_id`), and timeline with
/// the same timeline ID already exists, returns CreateTimelineError::AlreadyExists. /// the same timeline ID already exists, returns CreateTimelineError::AlreadyExists.
#[allow(clippy::too_many_arguments)] #[allow(clippy::too_many_arguments)]
pub async fn create_timeline( pub(crate) async fn create_timeline(
&self, &self,
new_timeline_id: TimelineId, new_timeline_id: TimelineId,
ancestor_timeline_id: Option<TimelineId>, ancestor_timeline_id: Option<TimelineId>,
@@ -1592,26 +1576,51 @@ impl Tenant {
.enter() .enter()
.map_err(|_| CreateTimelineError::ShuttingDown)?; .map_err(|_| CreateTimelineError::ShuttingDown)?;
if let Ok(existing) = self.get_timeline(new_timeline_id, false) { // Get exclusive access to the timeline ID: this ensures that it does not already exist,
debug!("timeline {new_timeline_id} already exists"); // and that no other creation attempts will be allowed in while we are working. The
// uninit_mark is a guard.
if let Some(remote_client) = existing.remote_client.as_ref() { let uninit_mark = match self.create_timeline_uninit_mark(new_timeline_id) {
// Wait for uploads to complete, so that when we return Ok, the timeline Ok(m) => m,
// is known to be durable on remote storage. Just like we do at the end of Err(TimelineExclusionError::AlreadyCreating) => {
// this function, after we have created the timeline ourselves. // Creation is in progress, we cannot create it again, and we cannot
// // check if this request matches the existing one, so caller must try
// We only really care that the initial version of `index_part.json` has // again later.
// been uploaded. That's enough to remember that the timeline return Err(CreateTimelineError::AlreadyCreating);
// exists. However, there is no function to wait specifically for that so
// we just wait for all in-progress uploads to finish.
remote_client
.wait_completion()
.await
.context("wait for timeline uploads to complete")?;
} }
Err(TimelineExclusionError::Other(e)) => {
return Err(CreateTimelineError::Other(e));
}
Err(TimelineExclusionError::AlreadyExists(existing)) => {
debug!("timeline {new_timeline_id} already exists");
return Err(CreateTimelineError::AlreadyExists); // Idempotency: creating the same timeline twice is not an error, unless
} // the second creation has different parameters.
if existing.get_ancestor_timeline_id() != ancestor_timeline_id
|| existing.pg_version != pg_version
|| (ancestor_start_lsn.is_some()
&& ancestor_start_lsn != Some(existing.get_ancestor_lsn()))
{
return Err(CreateTimelineError::Conflict);
}
if let Some(remote_client) = existing.remote_client.as_ref() {
// Wait for uploads to complete, so that when we return Ok, the timeline
// is known to be durable on remote storage. Just like we do at the end of
// this function, after we have created the timeline ourselves.
//
// We only really care that the initial version of `index_part.json` has
// been uploaded. That's enough to remember that the timeline
// exists. However, there is no function to wait specifically for that so
// we just wait for all in-progress uploads to finish.
remote_client
.wait_completion()
.await
.context("wait for timeline uploads to complete")?;
}
return Ok(existing);
}
};
let loaded_timeline = match ancestor_timeline_id { let loaded_timeline = match ancestor_timeline_id {
Some(ancestor_timeline_id) => { Some(ancestor_timeline_id) => {
@@ -1648,18 +1657,32 @@ impl Tenant {
ancestor_timeline.wait_lsn(*lsn, ctx).await?; ancestor_timeline.wait_lsn(*lsn, ctx).await?;
} }
self.branch_timeline(&ancestor_timeline, new_timeline_id, ancestor_start_lsn, ctx) self.branch_timeline(
.await? &ancestor_timeline,
new_timeline_id,
ancestor_start_lsn,
uninit_mark,
ctx,
)
.await?
} }
None => { None => {
self.bootstrap_timeline(new_timeline_id, pg_version, load_existing_initdb, ctx) self.bootstrap_timeline(
.await? new_timeline_id,
pg_version,
load_existing_initdb,
uninit_mark,
ctx,
)
.await?
} }
}; };
// At this point we have dropped our guard on [`Self::timelines_creating`], and
// the timeline is visible in [`Self::timelines`], but it is _not_ durable yet. We must
// not send a success to the caller until it is. The same applies to handling retries,
// see the handling of [`TimelineExclusionError::AlreadyExists`] above.
if let Some(remote_client) = loaded_timeline.remote_client.as_ref() { if let Some(remote_client) = loaded_timeline.remote_client.as_ref() {
// Wait for the upload of the 'index_part.json` file to finish, so that when we return
// Ok, the timeline is durable in remote storage.
let kind = ancestor_timeline_id let kind = ancestor_timeline_id
.map(|_| "branched") .map(|_| "branched")
.unwrap_or("bootstrapped"); .unwrap_or("bootstrapped");
@@ -1939,7 +1962,7 @@ impl Tenant {
// //
// this will additionally shutdown and await all timeline tasks. // this will additionally shutdown and await all timeline tasks.
tracing::debug!("Waiting for tasks..."); tracing::debug!("Waiting for tasks...");
task_mgr::shutdown_tasks(None, Some(self.tenant_shard_id.tenant_id), None).await; task_mgr::shutdown_tasks(None, Some(self.tenant_shard_id), None).await;
// Wait for any in-flight operations to complete // Wait for any in-flight operations to complete
self.gate.close().await; self.gate.close().await;
@@ -2136,6 +2159,14 @@ impl Tenant {
.attach_mode .attach_mode
.clone() .clone()
} }
pub(crate) fn get_tenant_shard_id(&self) -> &TenantShardId {
&self.tenant_shard_id
}
pub(crate) fn get_generation(&self) -> Generation {
self.generation
}
} }
/// Given a Vec of timelines and their ancestors (timeline_id, ancestor_id), /// Given a Vec of timelines and their ancestors (timeline_id, ancestor_id),
@@ -2274,6 +2305,18 @@ impl Tenant {
.or(self.conf.default_tenant_conf.min_resident_size_override) .or(self.conf.default_tenant_conf.min_resident_size_override)
} }
pub fn get_heatmap_period(&self) -> Option<Duration> {
let tenant_conf = self.tenant_conf.read().unwrap().tenant_conf;
let heatmap_period = tenant_conf
.heatmap_period
.unwrap_or(self.conf.default_tenant_conf.heatmap_period);
if heatmap_period.is_zero() {
None
} else {
Some(heatmap_period)
}
}
pub fn set_new_tenant_config(&self, new_tenant_conf: TenantConfOpt) { pub fn set_new_tenant_config(&self, new_tenant_conf: TenantConfOpt) {
self.tenant_conf.write().unwrap().tenant_conf = new_tenant_conf; self.tenant_conf.write().unwrap().tenant_conf = new_tenant_conf;
// Don't hold self.timelines.lock() during the notifies. // Don't hold self.timelines.lock() during the notifies.
@@ -2311,7 +2354,6 @@ impl Tenant {
new_metadata: &TimelineMetadata, new_metadata: &TimelineMetadata,
ancestor: Option<Arc<Timeline>>, ancestor: Option<Arc<Timeline>>,
resources: TimelineResources, resources: TimelineResources,
init_order: Option<&InitializationOrder>,
cause: CreateTimelineCause, cause: CreateTimelineCause,
) -> anyhow::Result<Arc<Timeline>> { ) -> anyhow::Result<Arc<Timeline>> {
let state = match cause { let state = match cause {
@@ -2326,9 +2368,6 @@ impl Tenant {
CreateTimelineCause::Delete => TimelineState::Stopping, CreateTimelineCause::Delete => TimelineState::Stopping,
}; };
let initial_logical_size_can_start = init_order.map(|x| &x.initial_logical_size_can_start);
let initial_logical_size_attempt = init_order.map(|x| &x.initial_logical_size_attempt);
let pg_version = new_metadata.pg_version(); let pg_version = new_metadata.pg_version();
let timeline = Timeline::new( let timeline = Timeline::new(
@@ -2339,11 +2378,10 @@ impl Tenant {
new_timeline_id, new_timeline_id,
self.tenant_shard_id, self.tenant_shard_id,
self.generation, self.generation,
self.shard_identity,
Arc::clone(&self.walredo_mgr), Arc::clone(&self.walredo_mgr),
resources, resources,
pg_version, pg_version,
initial_logical_size_can_start.cloned(),
initial_logical_size_attempt.cloned().flatten(),
state, state,
self.cancel.child_token(), self.cancel.child_token(),
); );
@@ -2358,6 +2396,7 @@ impl Tenant {
state: TenantState, state: TenantState,
conf: &'static PageServerConf, conf: &'static PageServerConf,
attached_conf: AttachedTenantConf, attached_conf: AttachedTenantConf,
shard_identity: ShardIdentity,
walredo_mgr: Arc<WalRedoManager>, walredo_mgr: Arc<WalRedoManager>,
tenant_shard_id: TenantShardId, tenant_shard_id: TenantShardId,
remote_storage: Option<GenericRemoteStorage>, remote_storage: Option<GenericRemoteStorage>,
@@ -2419,6 +2458,7 @@ impl Tenant {
Tenant { Tenant {
tenant_shard_id, tenant_shard_id,
shard_identity,
generation: attached_conf.location.generation, generation: attached_conf.location.generation,
conf, conf,
// using now here is good enough approximation to catch tenants with really long // using now here is good enough approximation to catch tenants with really long
@@ -2426,6 +2466,7 @@ impl Tenant {
loading_started_at: Instant::now(), loading_started_at: Instant::now(),
tenant_conf: Arc::new(RwLock::new(attached_conf)), tenant_conf: Arc::new(RwLock::new(attached_conf)),
timelines: Mutex::new(HashMap::new()), timelines: Mutex::new(HashMap::new()),
timelines_creating: Mutex::new(HashSet::new()),
gc_cs: tokio::sync::Mutex::new(()), gc_cs: tokio::sync::Mutex::new(()),
walredo_mgr, walredo_mgr,
remote_storage, remote_storage,
@@ -2540,7 +2581,7 @@ impl Tenant {
} }
} }
info!("persisting tenantconf to {config_path}"); debug!("persisting tenantconf to {config_path}");
let mut conf_content = r#"# This file contains a specific per-tenant's config. let mut conf_content = r#"# This file contains a specific per-tenant's config.
# It is read in case of pageserver restart. # It is read in case of pageserver restart.
@@ -2575,7 +2616,7 @@ impl Tenant {
target_config_path: &Utf8Path, target_config_path: &Utf8Path,
tenant_conf: &TenantConfOpt, tenant_conf: &TenantConfOpt,
) -> anyhow::Result<()> { ) -> anyhow::Result<()> {
info!("persisting tenantconf to {target_config_path}"); debug!("persisting tenantconf to {target_config_path}");
let mut conf_content = r#"# This file contains a specific per-tenant's config. let mut conf_content = r#"# This file contains a specific per-tenant's config.
# It is read in case of pageserver restart. # It is read in case of pageserver restart.
@@ -2817,8 +2858,9 @@ impl Tenant {
start_lsn: Option<Lsn>, start_lsn: Option<Lsn>,
ctx: &RequestContext, ctx: &RequestContext,
) -> Result<Arc<Timeline>, CreateTimelineError> { ) -> Result<Arc<Timeline>, CreateTimelineError> {
let uninit_mark = self.create_timeline_uninit_mark(dst_id).unwrap();
let tl = self let tl = self
.branch_timeline_impl(src_timeline, dst_id, start_lsn, ctx) .branch_timeline_impl(src_timeline, dst_id, start_lsn, uninit_mark, ctx)
.await?; .await?;
tl.set_state(TimelineState::Active); tl.set_state(TimelineState::Active);
Ok(tl) Ok(tl)
@@ -2832,9 +2874,10 @@ impl Tenant {
src_timeline: &Arc<Timeline>, src_timeline: &Arc<Timeline>,
dst_id: TimelineId, dst_id: TimelineId,
start_lsn: Option<Lsn>, start_lsn: Option<Lsn>,
timeline_uninit_mark: TimelineUninitMark<'_>,
ctx: &RequestContext, ctx: &RequestContext,
) -> Result<Arc<Timeline>, CreateTimelineError> { ) -> Result<Arc<Timeline>, CreateTimelineError> {
self.branch_timeline_impl(src_timeline, dst_id, start_lsn, ctx) self.branch_timeline_impl(src_timeline, dst_id, start_lsn, timeline_uninit_mark, ctx)
.await .await
} }
@@ -2843,13 +2886,14 @@ impl Tenant {
src_timeline: &Arc<Timeline>, src_timeline: &Arc<Timeline>,
dst_id: TimelineId, dst_id: TimelineId,
start_lsn: Option<Lsn>, start_lsn: Option<Lsn>,
timeline_uninit_mark: TimelineUninitMark<'_>,
_ctx: &RequestContext, _ctx: &RequestContext,
) -> Result<Arc<Timeline>, CreateTimelineError> { ) -> Result<Arc<Timeline>, CreateTimelineError> {
let src_id = src_timeline.timeline_id; let src_id = src_timeline.timeline_id;
// First acquire the GC lock so that another task cannot advance the GC // We will validate our ancestor LSN in this function. Acquire the GC lock so that
// cutoff in 'gc_info', and make 'start_lsn' invalid, while we are // this check cannot race with GC, and the ancestor LSN is guaranteed to remain
// creating the branch. // valid while we are creating the branch.
let _gc_cs = self.gc_cs.lock().await; let _gc_cs = self.gc_cs.lock().await;
// If no start LSN is specified, we branch the new timeline from the source timeline's last record LSN // If no start LSN is specified, we branch the new timeline from the source timeline's last record LSN
@@ -2859,13 +2903,6 @@ impl Tenant {
lsn lsn
}); });
// Create a placeholder for the new branch. This will error
// out if the new timeline ID is already in use.
let timeline_uninit_mark = {
let timelines = self.timelines.lock().unwrap();
self.create_timeline_uninit_mark(dst_id, &timelines)?
};
// Ensure that `start_lsn` is valid, i.e. the LSN is within the PITR // Ensure that `start_lsn` is valid, i.e. the LSN is within the PITR
// horizon on the source timeline // horizon on the source timeline
// //
@@ -2957,27 +2994,44 @@ impl Tenant {
Ok(new_timeline) Ok(new_timeline)
} }
/// - run initdb to init temporary instance and get bootstrap data /// For unit tests, make this visible so that other modules can directly create timelines
/// - after initialization completes, tar up the temp dir and upload it to S3. #[cfg(test)]
/// pub(crate) async fn bootstrap_timeline_test(
/// The caller is responsible for activating the returned timeline.
pub(crate) async fn bootstrap_timeline(
&self, &self,
timeline_id: TimelineId, timeline_id: TimelineId,
pg_version: u32, pg_version: u32,
load_existing_initdb: Option<TimelineId>, load_existing_initdb: Option<TimelineId>,
ctx: &RequestContext, ctx: &RequestContext,
) -> anyhow::Result<Arc<Timeline>> { ) -> anyhow::Result<Arc<Timeline>> {
let timeline_uninit_mark = { let uninit_mark = self.create_timeline_uninit_mark(timeline_id).unwrap();
let timelines = self.timelines.lock().unwrap(); self.bootstrap_timeline(
self.create_timeline_uninit_mark(timeline_id, &timelines)? timeline_id,
}; pg_version,
load_existing_initdb,
uninit_mark,
ctx,
)
.await
}
/// - run initdb to init temporary instance and get bootstrap data
/// - after initialization completes, tar up the temp dir and upload it to S3.
///
/// The caller is responsible for activating the returned timeline.
async fn bootstrap_timeline(
&self,
timeline_id: TimelineId,
pg_version: u32,
load_existing_initdb: Option<TimelineId>,
timeline_uninit_mark: TimelineUninitMark<'_>,
ctx: &RequestContext,
) -> anyhow::Result<Arc<Timeline>> {
// create a `tenant/{tenant_id}/timelines/basebackup-{timeline_id}.{TEMP_FILE_SUFFIX}/` // create a `tenant/{tenant_id}/timelines/basebackup-{timeline_id}.{TEMP_FILE_SUFFIX}/`
// temporary directory for basebackup files for the given timeline. // temporary directory for basebackup files for the given timeline.
let timelines_path = self.conf.timelines_path(&self.tenant_shard_id);
let pgdata_path = path_with_suffix_extension( let pgdata_path = path_with_suffix_extension(
self.conf timelines_path.join(format!("basebackup-{timeline_id}")),
.timelines_path(&self.tenant_shard_id)
.join(format!("basebackup-{timeline_id}")),
TEMP_FILE_SUFFIX, TEMP_FILE_SUFFIX,
); );
@@ -3008,31 +3062,43 @@ impl Tenant {
) )
.await .await
.context("download initdb tar")?; .context("download initdb tar")?;
let buf_read = Box::pin(BufReader::new(initdb_tar_zst)); let buf_read =
BufReader::with_capacity(remote_timeline_client::BUFFER_SIZE, initdb_tar_zst);
import_datadir::extract_tar_zst(&pgdata_path, buf_read) import_datadir::extract_tar_zst(&pgdata_path, buf_read)
.await .await
.context("extract initdb tar")?; .context("extract initdb tar")?;
if initdb_tar_zst_path.exists() { tokio::fs::remove_file(&initdb_tar_zst_path)
tokio::fs::remove_file(&initdb_tar_zst_path) .await
.await .or_else(|e| {
.context("tempfile removal")?; if e.kind() == std::io::ErrorKind::NotFound {
} // If something else already removed the file, ignore the error
Ok(())
} else {
Err(e)
}
})
.with_context(|| format!("tempfile removal {initdb_tar_zst_path}"))?;
} else { } else {
// Init temporarily repo to get bootstrap data, this creates a directory in the `initdb_path` path // Init temporarily repo to get bootstrap data, this creates a directory in the `initdb_path` path
run_initdb(self.conf, &pgdata_path, pg_version, &self.cancel).await?; run_initdb(self.conf, &pgdata_path, pg_version, &self.cancel).await?;
// Upload the created data dir to S3 // Upload the created data dir to S3
if let Some(storage) = &self.remote_storage { if let Some(storage) = &self.remote_storage {
let pgdata_zstd = import_datadir::create_tar_zst(&pgdata_path).await?; let temp_path = timelines_path.join(format!(
let pgdata_zstd = Bytes::from(pgdata_zstd); "{INITDB_PATH}.upload-{timeline_id}.{TEMP_FILE_SUFFIX}"
));
let (pgdata_zstd, tar_zst_size) =
import_datadir::create_tar_zst(&pgdata_path, &temp_path).await?;
backoff::retry( backoff::retry(
|| async { || async {
self::remote_timeline_client::upload_initdb_dir( self::remote_timeline_client::upload_initdb_dir(
storage, storage,
&self.tenant_shard_id.tenant_id, &self.tenant_shard_id.tenant_id,
&timeline_id, &timeline_id,
pgdata_zstd.clone(), pgdata_zstd.try_clone().await?,
tar_zst_size,
) )
.await .await
}, },
@@ -3040,10 +3106,23 @@ impl Tenant {
3, 3,
u32::MAX, u32::MAX,
"persist_initdb_tar_zst", "persist_initdb_tar_zst",
// TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066) backoff::Cancel::new(self.cancel.clone(), || {
backoff::Cancel::new(CancellationToken::new(), || unreachable!()), anyhow::anyhow!("initdb upload cancelled")
}),
) )
.await?; .await?;
tokio::fs::remove_file(&temp_path)
.await
.or_else(|e| {
if e.kind() == std::io::ErrorKind::NotFound {
// If something else already removed the file, ignore the error
Ok(())
} else {
Err(e)
}
})
.with_context(|| format!("tempfile removal {temp_path}"))?;
} }
} }
let pgdata_lsn = import_datadir::get_lsn_from_controlfile(&pgdata_path)?.align(); let pgdata_lsn = import_datadir::get_lsn_from_controlfile(&pgdata_path)?.align();
@@ -3144,11 +3223,11 @@ impl Tenant {
/// at 'disk_consistent_lsn'. After any initial data has been imported, call /// at 'disk_consistent_lsn'. After any initial data has been imported, call
/// `finish_creation` to insert the Timeline into the timelines map and to remove the /// `finish_creation` to insert the Timeline into the timelines map and to remove the
/// uninit mark file. /// uninit mark file.
async fn prepare_new_timeline( async fn prepare_new_timeline<'a>(
&self, &'a self,
new_timeline_id: TimelineId, new_timeline_id: TimelineId,
new_metadata: &TimelineMetadata, new_metadata: &TimelineMetadata,
uninit_mark: TimelineUninitMark, uninit_mark: TimelineUninitMark<'a>,
start_lsn: Lsn, start_lsn: Lsn,
ancestor: Option<Arc<Timeline>>, ancestor: Option<Arc<Timeline>>,
) -> anyhow::Result<UninitializedTimeline> { ) -> anyhow::Result<UninitializedTimeline> {
@@ -3165,7 +3244,6 @@ impl Tenant {
new_metadata, new_metadata,
ancestor, ancestor,
resources, resources,
None,
CreateTimelineCause::Load, CreateTimelineCause::Load,
) )
.context("Failed to create timeline data structure")?; .context("Failed to create timeline data structure")?;
@@ -3222,23 +3300,38 @@ impl Tenant {
fn create_timeline_uninit_mark( fn create_timeline_uninit_mark(
&self, &self,
timeline_id: TimelineId, timeline_id: TimelineId,
timelines: &MutexGuard<HashMap<TimelineId, Arc<Timeline>>>, ) -> Result<TimelineUninitMark, TimelineExclusionError> {
) -> anyhow::Result<TimelineUninitMark> {
let tenant_shard_id = self.tenant_shard_id; let tenant_shard_id = self.tenant_shard_id;
anyhow::ensure!(
timelines.get(&timeline_id).is_none(),
"Timeline {tenant_shard_id}/{timeline_id} already exists in pageserver's memory"
);
let timeline_path = self.conf.timeline_path(&tenant_shard_id, &timeline_id);
anyhow::ensure!(
!timeline_path.exists(),
"Timeline {timeline_path} already exists, cannot create its uninit mark file",
);
let uninit_mark_path = self let uninit_mark_path = self
.conf .conf
.timeline_uninit_mark_file_path(tenant_shard_id, timeline_id); .timeline_uninit_mark_file_path(tenant_shard_id, timeline_id);
let timeline_path = self.conf.timeline_path(&tenant_shard_id, &timeline_id);
let uninit_mark = TimelineUninitMark::new(
self,
timeline_id,
uninit_mark_path.clone(),
timeline_path.clone(),
)?;
// At this stage, we have got exclusive access to in-memory state for this timeline ID
// for creation.
// A timeline directory should never exist on disk already:
// - a previous failed creation would have cleaned up after itself
// - a pageserver restart would clean up timeline directories that don't have valid remote state
//
// Therefore it is an unexpected internal error to encounter a timeline directory already existing here,
// this error may indicate a bug in cleanup on failed creations.
if timeline_path.exists() {
return Err(TimelineExclusionError::Other(anyhow::anyhow!(
"Timeline directory already exists! This is a bug."
)));
}
// Create the on-disk uninit mark _after_ the in-memory acquisition of the tenant ID: guarantees
// that during process runtime, colliding creations will be caught in-memory without getting
// as far as failing to write a file.
fs::OpenOptions::new() fs::OpenOptions::new()
.write(true) .write(true)
.create_new(true) .create_new(true)
@@ -3252,8 +3345,6 @@ impl Tenant {
format!("Failed to crate uninit mark for timeline {tenant_shard_id}/{timeline_id}") format!("Failed to crate uninit mark for timeline {tenant_shard_id}/{timeline_id}")
})?; })?;
let uninit_mark = TimelineUninitMark::new(uninit_mark_path, timeline_path);
Ok(uninit_mark) Ok(uninit_mark)
} }
@@ -3696,6 +3787,7 @@ pub(crate) mod harness {
tenant_conf.evictions_low_residence_duration_metric_threshold, tenant_conf.evictions_low_residence_duration_metric_threshold,
), ),
gc_feedback: Some(tenant_conf.gc_feedback), gc_feedback: Some(tenant_conf.gc_feedback),
heatmap_period: Some(tenant_conf.heatmap_period),
} }
} }
} }
@@ -3831,6 +3923,8 @@ pub(crate) mod harness {
self.generation, self.generation,
)) ))
.unwrap(), .unwrap(),
// This is a legacy/test code path: sharding isn't supported here.
ShardIdentity::unsharded(),
walredo_mgr, walredo_mgr,
self.tenant_shard_id, self.tenant_shard_id,
Some(self.remote_storage.clone()), Some(self.remote_storage.clone()),
@@ -3840,7 +3934,7 @@ pub(crate) mod harness {
match mode { match mode {
LoadMode::Local => { LoadMode::Local => {
tenant tenant
.load_local(None, ctx) .load_local(ctx)
.instrument(info_span!("try_load", tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug())) .instrument(info_span!("try_load", tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug()))
.await?; .await?;
} }
@@ -3850,7 +3944,7 @@ pub(crate) mod harness {
.instrument(info_span!("try_load_preload", tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug())) .instrument(info_span!("try_load_preload", tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug()))
.await?; .await?;
tenant tenant
.attach(None, Some(preload), ctx) .attach(Some(preload), ctx)
.instrument(info_span!("try_load", tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug())) .instrument(info_span!("try_load", tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug()))
.await?; .await?;
} }
@@ -3893,6 +3987,9 @@ pub(crate) mod harness {
pub(crate) struct TestRedoManager; pub(crate) struct TestRedoManager;
impl TestRedoManager { impl TestRedoManager {
/// # Cancel-Safety
///
/// This method is cancellation-safe.
pub async fn request_redo( pub async fn request_redo(
&self, &self,
key: Key, key: Key,
@@ -3997,13 +4094,7 @@ mod tests {
.await .await
{ {
Ok(_) => panic!("duplicate timeline creation should fail"), Ok(_) => panic!("duplicate timeline creation should fail"),
Err(e) => assert_eq!( Err(e) => assert_eq!(e.to_string(), "Already exists".to_string()),
e.to_string(),
format!(
"Timeline {}/{} already exists in pageserver's memory",
tenant.tenant_shard_id, TIMELINE_ID
)
),
} }
Ok(()) Ok(())

View File

@@ -334,6 +334,11 @@ pub struct TenantConf {
#[serde(with = "humantime_serde")] #[serde(with = "humantime_serde")]
pub evictions_low_residence_duration_metric_threshold: Duration, pub evictions_low_residence_duration_metric_threshold: Duration,
pub gc_feedback: bool, pub gc_feedback: bool,
/// If non-zero, the period between uploads of a heatmap from attached tenants. This
/// may be disabled if a Tenant will not have secondary locations: only secondary
/// locations will use the heatmap uploaded by attached locations.
pub heatmap_period: Duration,
} }
/// Same as TenantConf, but this struct preserves the information about /// Same as TenantConf, but this struct preserves the information about
@@ -414,6 +419,11 @@ pub struct TenantConfOpt {
#[serde(skip_serializing_if = "Option::is_none")] #[serde(skip_serializing_if = "Option::is_none")]
#[serde(default)] #[serde(default)]
pub gc_feedback: Option<bool>, pub gc_feedback: Option<bool>,
#[serde(skip_serializing_if = "Option::is_none")]
#[serde(with = "humantime_serde")]
#[serde(default)]
pub heatmap_period: Option<Duration>,
} }
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)] #[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
@@ -482,6 +492,7 @@ impl TenantConfOpt {
.evictions_low_residence_duration_metric_threshold .evictions_low_residence_duration_metric_threshold
.unwrap_or(global_conf.evictions_low_residence_duration_metric_threshold), .unwrap_or(global_conf.evictions_low_residence_duration_metric_threshold),
gc_feedback: self.gc_feedback.unwrap_or(global_conf.gc_feedback), gc_feedback: self.gc_feedback.unwrap_or(global_conf.gc_feedback),
heatmap_period: self.heatmap_period.unwrap_or(global_conf.heatmap_period),
} }
} }
} }
@@ -519,6 +530,7 @@ impl Default for TenantConf {
) )
.expect("cannot parse default evictions_low_residence_duration_metric_threshold"), .expect("cannot parse default evictions_low_residence_duration_metric_threshold"),
gc_feedback: false, gc_feedback: false,
heatmap_period: Duration::ZERO,
} }
} }
} }

View File

@@ -15,7 +15,6 @@ use crate::{
context::RequestContext, context::RequestContext,
task_mgr::{self, TaskKind}, task_mgr::{self, TaskKind},
tenant::mgr::{TenantSlot, TenantsMapRemoveResult}, tenant::mgr::{TenantSlot, TenantsMapRemoveResult},
InitializationOrder,
}; };
use super::{ use super::{
@@ -78,8 +77,10 @@ async fn create_remote_delete_mark(
let data: &[u8] = &[]; let data: &[u8] = &[];
backoff::retry( backoff::retry(
|| async { || async {
let data = bytes::Bytes::from_static(data);
let stream = futures::stream::once(futures::future::ready(Ok(data)));
remote_storage remote_storage
.upload(data, 0, &remote_mark_path, None) .upload(stream, 0, &remote_mark_path, None)
.await .await
}, },
|_e| false, |_e| false,
@@ -390,7 +391,6 @@ impl DeleteTenantFlow {
tenant: &Arc<Tenant>, tenant: &Arc<Tenant>,
preload: Option<TenantPreload>, preload: Option<TenantPreload>,
tenants: &'static std::sync::RwLock<TenantsMap>, tenants: &'static std::sync::RwLock<TenantsMap>,
init_order: Option<InitializationOrder>,
ctx: &RequestContext, ctx: &RequestContext,
) -> Result<(), DeleteTenantError> { ) -> Result<(), DeleteTenantError> {
let (_, progress) = completion::channel(); let (_, progress) = completion::channel();
@@ -400,10 +400,7 @@ impl DeleteTenantFlow {
.await .await
.expect("cant be stopping or broken"); .expect("cant be stopping or broken");
tenant tenant.attach(preload, ctx).await.context("attach")?;
.attach(init_order, preload, ctx)
.await
.context("attach")?;
Self::background( Self::background(
guard, guard,
@@ -466,7 +463,7 @@ impl DeleteTenantFlow {
task_mgr::spawn( task_mgr::spawn(
task_mgr::BACKGROUND_RUNTIME.handle(), task_mgr::BACKGROUND_RUNTIME.handle(),
TaskKind::TimelineDeletionWorker, TaskKind::TimelineDeletionWorker,
Some(tenant_shard_id.tenant_id), Some(tenant_shard_id),
None, None,
"tenant_delete", "tenant_delete",
false, false,
@@ -553,7 +550,7 @@ impl DeleteTenantFlow {
// we encounter an InProgress marker, yield the barrier it contains and wait on it. // we encounter an InProgress marker, yield the barrier it contains and wait on it.
let barrier = { let barrier = {
let mut locked = tenants.write().unwrap(); let mut locked = tenants.write().unwrap();
let removed = locked.remove(&tenant.tenant_shard_id.tenant_id); let removed = locked.remove(tenant.tenant_shard_id);
// FIXME: we should not be modifying this from outside of mgr.rs. // FIXME: we should not be modifying this from outside of mgr.rs.
// This will go away when we simplify deletion (https://github.com/neondatabase/neon/issues/5080) // This will go away when we simplify deletion (https://github.com/neondatabase/neon/issues/5080)

View File

@@ -2,7 +2,8 @@
//! page server. //! page server.
use camino::{Utf8DirEntry, Utf8Path, Utf8PathBuf}; use camino::{Utf8DirEntry, Utf8Path, Utf8PathBuf};
use pageserver_api::shard::TenantShardId; use pageserver_api::key::Key;
use pageserver_api::shard::{ShardIdentity, ShardNumber, TenantShardId};
use rand::{distributions::Alphanumeric, Rng}; use rand::{distributions::Alphanumeric, Rng};
use std::borrow::Cow; use std::borrow::Cow;
use std::collections::{BTreeMap, HashMap}; use std::collections::{BTreeMap, HashMap};
@@ -97,49 +98,76 @@ pub(crate) enum TenantsMap {
ShuttingDown(BTreeMap<TenantShardId, TenantSlot>), ShuttingDown(BTreeMap<TenantShardId, TenantSlot>),
} }
/// Helper for mapping shard-unaware functions to a sharding-aware map
/// TODO(sharding): all users of this must be made shard-aware.
fn exactly_one_or_none<'a>(
map: &'a BTreeMap<TenantShardId, TenantSlot>,
tenant_id: &TenantId,
) -> Option<(&'a TenantShardId, &'a TenantSlot)> {
let mut slots = map.range(TenantShardId::tenant_range(*tenant_id));
// Retrieve the first two slots in the range: if both are populated, we must panic because the caller
// needs a shard-naive view of the world in which only one slot can exist for a TenantId at a time.
let slot_a = slots.next();
let slot_b = slots.next();
match (slot_a, slot_b) {
(None, None) => None,
(Some(slot), None) => {
// Exactly one matching slot
Some(slot)
}
(Some(_slot_a), Some(_slot_b)) => {
// Multiple shards for this tenant: cannot handle this yet.
// TODO(sharding): callers of get() should be shard-aware.
todo!("Attaching multiple shards in teh same tenant to the same pageserver")
}
(None, Some(_)) => unreachable!(),
}
}
pub(crate) enum TenantsMapRemoveResult { pub(crate) enum TenantsMapRemoveResult {
Occupied(TenantSlot), Occupied(TenantSlot),
Vacant, Vacant,
InProgress(utils::completion::Barrier), InProgress(utils::completion::Barrier),
} }
/// When resolving a TenantId to a shard, we may be looking for the 0th
/// shard, or we might be looking for whichever shard holds a particular page.
pub(crate) enum ShardSelector {
/// Only return the 0th shard, if it is present. If a non-0th shard is present,
/// ignore it.
Zero,
/// Pick the first shard we find for the TenantId
First,
/// Pick the shard that holds this key
Page(Key),
}
impl TenantsMap { impl TenantsMap {
/// Convenience function for typical usage, where we want to get a `Tenant` object, for /// Convenience function for typical usage, where we want to get a `Tenant` object, for
/// working with attached tenants. If the TenantId is in the map but in Secondary state, /// working with attached tenants. If the TenantId is in the map but in Secondary state,
/// None is returned. /// None is returned.
pub(crate) fn get(&self, tenant_id: &TenantId) -> Option<&Arc<Tenant>> { pub(crate) fn get(&self, tenant_shard_id: &TenantShardId) -> Option<&Arc<Tenant>> {
match self { match self {
TenantsMap::Initializing => None, TenantsMap::Initializing => None,
TenantsMap::Open(m) | TenantsMap::ShuttingDown(m) => { TenantsMap::Open(m) | TenantsMap::ShuttingDown(m) => {
// TODO(sharding): callers of get() should be shard-aware. m.get(tenant_shard_id).and_then(|slot| slot.get_attached())
exactly_one_or_none(m, tenant_id).and_then(|(_, slot)| slot.get_attached()) }
}
}
/// A page service client sends a TenantId, and to look up the correct Tenant we must
/// resolve this to a fully qualified TenantShardId.
fn resolve_shard(
&self,
tenant_id: &TenantId,
selector: ShardSelector,
) -> Option<TenantShardId> {
let mut want_shard = None;
match self {
TenantsMap::Initializing => None,
TenantsMap::Open(m) | TenantsMap::ShuttingDown(m) => {
for slot in m.range(TenantShardId::tenant_range(*tenant_id)) {
match selector {
ShardSelector::First => return Some(*slot.0),
ShardSelector::Zero if slot.0.shard_number == ShardNumber(0) => {
return Some(*slot.0)
}
ShardSelector::Page(key) => {
if let Some(tenant) = slot.1.get_attached() {
// First slot we see for this tenant, calculate the expected shard number
// for the key: we will use this for checking if this and subsequent
// slots contain the key, rather than recalculating the hash each time.
if want_shard.is_none() {
want_shard = Some(tenant.shard_identity.get_shard_number(&key));
}
if Some(tenant.shard_identity.number) == want_shard {
return Some(*slot.0);
}
} else {
continue;
}
}
_ => continue,
}
}
// Fall through: we didn't find an acceptable shard
None
} }
} }
} }
@@ -148,25 +176,19 @@ impl TenantsMap {
/// ///
/// The normal way to remove a tenant is using a SlotGuard, which will gracefully remove the guarded /// The normal way to remove a tenant is using a SlotGuard, which will gracefully remove the guarded
/// slot if the enclosed tenant is shutdown. /// slot if the enclosed tenant is shutdown.
pub(crate) fn remove(&mut self, tenant_id: &TenantId) -> TenantsMapRemoveResult { pub(crate) fn remove(&mut self, tenant_shard_id: TenantShardId) -> TenantsMapRemoveResult {
use std::collections::btree_map::Entry; use std::collections::btree_map::Entry;
match self { match self {
TenantsMap::Initializing => TenantsMapRemoveResult::Vacant, TenantsMap::Initializing => TenantsMapRemoveResult::Vacant,
TenantsMap::Open(m) | TenantsMap::ShuttingDown(m) => { TenantsMap::Open(m) | TenantsMap::ShuttingDown(m) => match m.entry(tenant_shard_id) {
let key = exactly_one_or_none(m, tenant_id).map(|(k, _)| *k); Entry::Occupied(entry) => match entry.get() {
match key { TenantSlot::InProgress(barrier) => {
Some(key) => match m.entry(key) { TenantsMapRemoveResult::InProgress(barrier.clone())
Entry::Occupied(entry) => match entry.get() { }
TenantSlot::InProgress(barrier) => { _ => TenantsMapRemoveResult::Occupied(entry.remove()),
TenantsMapRemoveResult::InProgress(barrier.clone()) },
} Entry::Vacant(_entry) => TenantsMapRemoveResult::Vacant,
_ => TenantsMapRemoveResult::Occupied(entry.remove()), },
},
Entry::Vacant(_entry) => TenantsMapRemoveResult::Vacant,
},
None => TenantsMapRemoveResult::Vacant,
}
}
} }
} }
@@ -214,49 +236,6 @@ async fn safe_rename_tenant_dir(path: impl AsRef<Utf8Path>) -> std::io::Result<U
static TENANTS: Lazy<std::sync::RwLock<TenantsMap>> = static TENANTS: Lazy<std::sync::RwLock<TenantsMap>> =
Lazy::new(|| std::sync::RwLock::new(TenantsMap::Initializing)); Lazy::new(|| std::sync::RwLock::new(TenantsMap::Initializing));
/// Create a directory, including parents. This does no fsyncs and makes
/// no guarantees about the persistence of the resulting metadata: for
/// use when creating dirs for use as cache.
async fn unsafe_create_dir_all(path: &Utf8PathBuf) -> std::io::Result<()> {
let mut dirs_to_create = Vec::new();
let mut path: &Utf8Path = path.as_ref();
// Figure out which directories we need to create.
loop {
let meta = tokio::fs::metadata(path).await;
match meta {
Ok(metadata) if metadata.is_dir() => break,
Ok(_) => {
return Err(std::io::Error::new(
std::io::ErrorKind::AlreadyExists,
format!("non-directory found in path: {path}"),
));
}
Err(ref e) if e.kind() == std::io::ErrorKind::NotFound => {}
Err(e) => return Err(e),
}
dirs_to_create.push(path);
match path.parent() {
Some(parent) => path = parent,
None => {
return Err(std::io::Error::new(
std::io::ErrorKind::InvalidInput,
format!("can't find parent of path '{path}'"),
));
}
}
}
// Create directories from parent to child.
for &path in dirs_to_create.iter().rev() {
tokio::fs::create_dir(path).await?;
}
Ok(())
}
/// The TenantManager is responsible for storing and mutating the collection of all tenants /// The TenantManager is responsible for storing and mutating the collection of all tenants
/// that this pageserver process has state for. Every Tenant and SecondaryTenant instance /// that this pageserver process has state for. Every Tenant and SecondaryTenant instance
/// lives inside the TenantManager. /// lives inside the TenantManager.
@@ -515,12 +494,14 @@ pub async fn init_tenant_mgr(
location_conf.attach_in_generation(generation); location_conf.attach_in_generation(generation);
Tenant::persist_tenant_config(conf, &tenant_shard_id, &location_conf).await?; Tenant::persist_tenant_config(conf, &tenant_shard_id, &location_conf).await?;
let shard_identity = location_conf.shard;
match tenant_spawn( match tenant_spawn(
conf, conf,
tenant_shard_id, tenant_shard_id,
&tenant_dir_path, &tenant_dir_path,
resources.clone(), resources.clone(),
AttachedTenantConf::try_from(location_conf)?, AttachedTenantConf::try_from(location_conf)?,
shard_identity,
Some(init_order.clone()), Some(init_order.clone()),
&TENANTS, &TENANTS,
SpawnMode::Normal, SpawnMode::Normal,
@@ -561,6 +542,7 @@ pub(crate) fn tenant_spawn(
tenant_path: &Utf8Path, tenant_path: &Utf8Path,
resources: TenantSharedResources, resources: TenantSharedResources,
location_conf: AttachedTenantConf, location_conf: AttachedTenantConf,
shard_identity: ShardIdentity,
init_order: Option<InitializationOrder>, init_order: Option<InitializationOrder>,
tenants: &'static std::sync::RwLock<TenantsMap>, tenants: &'static std::sync::RwLock<TenantsMap>,
mode: SpawnMode, mode: SpawnMode,
@@ -587,12 +569,19 @@ pub(crate) fn tenant_spawn(
"Cannot load tenant, ignore mark found at {tenant_ignore_mark:?}" "Cannot load tenant, ignore mark found at {tenant_ignore_mark:?}"
); );
info!("Attaching tenant {tenant_shard_id}"); info!(
tenant_id = %tenant_shard_id.tenant_id,
shard_id = %tenant_shard_id.shard_slug(),
generation = ?location_conf.location.generation,
attach_mode = ?location_conf.location.attach_mode,
"Attaching tenant"
);
let tenant = match Tenant::spawn( let tenant = match Tenant::spawn(
conf, conf,
tenant_shard_id, tenant_shard_id,
resources, resources,
location_conf, location_conf,
shard_identity,
init_order, init_order,
tenants, tenants,
mode, mode,
@@ -762,12 +751,14 @@ pub(crate) async fn create_tenant(
tenant_map_acquire_slot(&tenant_shard_id, TenantSlotAcquireMode::MustNotExist)?; tenant_map_acquire_slot(&tenant_shard_id, TenantSlotAcquireMode::MustNotExist)?;
let tenant_path = super::create_tenant_files(conf, &location_conf, &tenant_shard_id).await?; let tenant_path = super::create_tenant_files(conf, &location_conf, &tenant_shard_id).await?;
let shard_identity = location_conf.shard;
let created_tenant = tenant_spawn( let created_tenant = tenant_spawn(
conf, conf,
tenant_shard_id, tenant_shard_id,
&tenant_path, &tenant_path,
resources, resources,
AttachedTenantConf::try_from(location_conf)?, AttachedTenantConf::try_from(location_conf)?,
shard_identity,
None, None,
&TENANTS, &TENANTS,
SpawnMode::Create, SpawnMode::Create,
@@ -797,14 +788,16 @@ pub(crate) async fn set_new_tenant_config(
new_tenant_conf: TenantConfOpt, new_tenant_conf: TenantConfOpt,
tenant_id: TenantId, tenant_id: TenantId,
) -> Result<(), SetNewTenantConfigError> { ) -> Result<(), SetNewTenantConfigError> {
// Legacy API: does not support sharding
let tenant_shard_id = TenantShardId::unsharded(tenant_id);
info!("configuring tenant {tenant_id}"); info!("configuring tenant {tenant_id}");
let tenant = get_tenant(tenant_id, true)?; let tenant = get_tenant(tenant_shard_id, true)?;
// This is a legacy API that only operates on attached tenants: the preferred // This is a legacy API that only operates on attached tenants: the preferred
// API to use is the location_config/ endpoint, which lets the caller provide // API to use is the location_config/ endpoint, which lets the caller provide
// the full LocationConf. // the full LocationConf.
let location_conf = LocationConf::attached_single(new_tenant_conf, tenant.generation); let location_conf = LocationConf::attached_single(new_tenant_conf, tenant.generation);
let tenant_shard_id = TenantShardId::unsharded(tenant_id);
Tenant::persist_tenant_config(conf, &tenant_shard_id, &location_conf) Tenant::persist_tenant_config(conf, &tenant_shard_id, &location_conf)
.await .await
@@ -814,6 +807,12 @@ pub(crate) async fn set_new_tenant_config(
} }
impl TenantManager { impl TenantManager {
/// Convenience function so that anyone with a TenantManager can get at the global configuration, without
/// having to pass it around everywhere as a separate object.
pub(crate) fn get_conf(&self) -> &'static PageServerConf {
self.conf
}
/// Gets the attached tenant from the in-memory data, erroring if it's absent, in secondary mode, or is not fitting to the query. /// Gets the attached tenant from the in-memory data, erroring if it's absent, in secondary mode, or is not fitting to the query.
/// `active_only = true` allows to query only tenants that are ready for operations, erroring on other kinds of tenants. /// `active_only = true` allows to query only tenants that are ready for operations, erroring on other kinds of tenants.
pub(crate) fn get_attached_tenant_shard( pub(crate) fn get_attached_tenant_shard(
@@ -860,6 +859,7 @@ impl TenantManager {
Ok(()) Ok(())
} }
#[instrument(skip_all, fields(tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug()))]
pub(crate) async fn upsert_location( pub(crate) async fn upsert_location(
&self, &self,
tenant_shard_id: TenantShardId, tenant_shard_id: TenantShardId,
@@ -972,7 +972,7 @@ impl TenantManager {
LocationMode::Secondary(_) => { LocationMode::Secondary(_) => {
// Directory doesn't need to be fsync'd because if we crash it can // Directory doesn't need to be fsync'd because if we crash it can
// safely be recreated next time this tenant location is configured. // safely be recreated next time this tenant location is configured.
unsafe_create_dir_all(&tenant_path) tokio::fs::create_dir_all(&tenant_path)
.await .await
.with_context(|| format!("Creating {tenant_path}"))?; .with_context(|| format!("Creating {tenant_path}"))?;
@@ -988,7 +988,7 @@ impl TenantManager {
// Directory doesn't need to be fsync'd because we do not depend on // Directory doesn't need to be fsync'd because we do not depend on
// it to exist after crashes: it may be recreated when tenant is // it to exist after crashes: it may be recreated when tenant is
// re-attached, see https://github.com/neondatabase/neon/issues/5550 // re-attached, see https://github.com/neondatabase/neon/issues/5550
unsafe_create_dir_all(&timelines_path) tokio::fs::create_dir_all(&tenant_path)
.await .await
.with_context(|| format!("Creating {timelines_path}"))?; .with_context(|| format!("Creating {timelines_path}"))?;
@@ -996,12 +996,14 @@ impl TenantManager {
.await .await
.map_err(SetNewTenantConfigError::Persist)?; .map_err(SetNewTenantConfigError::Persist)?;
let shard_identity = new_location_config.shard;
let tenant = tenant_spawn( let tenant = tenant_spawn(
self.conf, self.conf,
tenant_shard_id, tenant_shard_id,
&tenant_path, &tenant_path,
self.resources.clone(), self.resources.clone(),
AttachedTenantConf::try_from(new_location_config)?, AttachedTenantConf::try_from(new_location_config)?,
shard_identity,
None, None,
self.tenants, self.tenants,
SpawnMode::Normal, SpawnMode::Normal,
@@ -1016,6 +1018,95 @@ impl TenantManager {
Ok(()) Ok(())
} }
/// Resetting a tenant is equivalent to detaching it, then attaching it again with the same
/// LocationConf that was last used to attach it. Optionally, the local file cache may be
/// dropped before re-attaching.
///
/// This is not part of a tenant's normal lifecycle: it is used for debug/support, in situations
/// where an issue is identified that would go away with a restart of the tenant.
///
/// This does not have any special "force" shutdown of a tenant: it relies on the tenant's tasks
/// to respect the cancellation tokens used in normal shutdown().
#[instrument(skip_all, fields(tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug(), %drop_cache))]
pub(crate) async fn reset_tenant(
&self,
tenant_shard_id: TenantShardId,
drop_cache: bool,
ctx: RequestContext,
) -> anyhow::Result<()> {
let mut slot_guard = tenant_map_acquire_slot(&tenant_shard_id, TenantSlotAcquireMode::Any)?;
let Some(old_slot) = slot_guard.get_old_value() else {
anyhow::bail!("Tenant not found when trying to reset");
};
let Some(tenant) = old_slot.get_attached() else {
slot_guard.revert();
anyhow::bail!("Tenant is not in attached state");
};
let (_guard, progress) = utils::completion::channel();
match tenant.shutdown(progress, false).await {
Ok(()) => {
slot_guard.drop_old_value()?;
}
Err(_barrier) => {
slot_guard.revert();
anyhow::bail!("Cannot reset Tenant, already shutting down");
}
}
let tenant_path = self.conf.tenant_path(&tenant_shard_id);
let timelines_path = self.conf.timelines_path(&tenant_shard_id);
let config = Tenant::load_tenant_config(self.conf, &tenant_shard_id)?;
if drop_cache {
tracing::info!("Dropping local file cache");
match tokio::fs::read_dir(&timelines_path).await {
Err(e) => {
tracing::warn!("Failed to list timelines while dropping cache: {}", e);
}
Ok(mut entries) => {
while let Some(entry) = entries.next_entry().await? {
tokio::fs::remove_dir_all(entry.path()).await?;
}
}
}
}
let shard_identity = config.shard;
let tenant = tenant_spawn(
self.conf,
tenant_shard_id,
&tenant_path,
self.resources.clone(),
AttachedTenantConf::try_from(config)?,
shard_identity,
None,
self.tenants,
SpawnMode::Normal,
&ctx,
)?;
slot_guard.upsert(TenantSlot::Attached(tenant))?;
Ok(())
}
pub(crate) fn get_attached_active_tenant_shards(&self) -> Vec<Arc<Tenant>> {
let locked = self.tenants.read().unwrap();
match &*locked {
TenantsMap::Initializing => Vec::new(),
TenantsMap::Open(map) | TenantsMap::ShuttingDown(map) => map
.values()
.filter_map(|slot| {
slot.get_attached()
.and_then(|t| if t.is_active() { Some(t.clone()) } else { None })
})
.collect(),
}
}
} }
#[derive(Debug, thiserror::Error)] #[derive(Debug, thiserror::Error)]
@@ -1040,14 +1131,11 @@ pub(crate) enum GetTenantError {
/// ///
/// This method is cancel-safe. /// This method is cancel-safe.
pub(crate) fn get_tenant( pub(crate) fn get_tenant(
tenant_id: TenantId, tenant_shard_id: TenantShardId,
active_only: bool, active_only: bool,
) -> Result<Arc<Tenant>, GetTenantError> { ) -> Result<Arc<Tenant>, GetTenantError> {
let locked = TENANTS.read().unwrap(); let locked = TENANTS.read().unwrap();
// TODO(sharding): make all callers of get_tenant shard-aware
let tenant_shard_id = TenantShardId::unsharded(tenant_id);
let peek_slot = tenant_map_peek_slot(&locked, &tenant_shard_id, TenantSlotPeekMode::Read)?; let peek_slot = tenant_map_peek_slot(&locked, &tenant_shard_id, TenantSlotPeekMode::Read)?;
match peek_slot { match peek_slot {
@@ -1059,14 +1147,18 @@ pub(crate) fn get_tenant(
TenantState::Active => Ok(Arc::clone(tenant)), TenantState::Active => Ok(Arc::clone(tenant)),
_ => { _ => {
if active_only { if active_only {
Err(GetTenantError::NotActive(tenant_id)) Err(GetTenantError::NotActive(tenant_shard_id.tenant_id))
} else { } else {
Ok(Arc::clone(tenant)) Ok(Arc::clone(tenant))
} }
} }
}, },
Some(TenantSlot::InProgress(_)) => Err(GetTenantError::NotActive(tenant_id)), Some(TenantSlot::InProgress(_)) => {
None | Some(TenantSlot::Secondary) => Err(GetTenantError::NotFound(tenant_id)), Err(GetTenantError::NotActive(tenant_shard_id.tenant_id))
}
None | Some(TenantSlot::Secondary) => {
Err(GetTenantError::NotFound(tenant_shard_id.tenant_id))
}
} }
} }
@@ -1100,6 +1192,7 @@ pub(crate) enum GetActiveTenantError {
/// then wait for up to `timeout` (minus however long we waited for the slot). /// then wait for up to `timeout` (minus however long we waited for the slot).
pub(crate) async fn get_active_tenant_with_timeout( pub(crate) async fn get_active_tenant_with_timeout(
tenant_id: TenantId, tenant_id: TenantId,
shard_selector: ShardSelector,
timeout: Duration, timeout: Duration,
cancel: &CancellationToken, cancel: &CancellationToken,
) -> Result<Arc<Tenant>, GetActiveTenantError> { ) -> Result<Arc<Tenant>, GetActiveTenantError> {
@@ -1108,15 +1201,17 @@ pub(crate) async fn get_active_tenant_with_timeout(
Tenant(Arc<Tenant>), Tenant(Arc<Tenant>),
} }
// TODO(sharding): make page service interface sharding-aware (page service should apply ShardIdentity to the key
// to decide which shard services the request)
let tenant_shard_id = TenantShardId::unsharded(tenant_id);
let wait_start = Instant::now(); let wait_start = Instant::now();
let deadline = wait_start + timeout; let deadline = wait_start + timeout;
let wait_for = { let (wait_for, tenant_shard_id) = {
let locked = TENANTS.read().unwrap(); let locked = TENANTS.read().unwrap();
// Resolve TenantId to TenantShardId
let tenant_shard_id = locked.resolve_shard(&tenant_id, shard_selector).ok_or(
GetActiveTenantError::NotFound(GetTenantError::NotFound(tenant_id)),
)?;
let peek_slot = tenant_map_peek_slot(&locked, &tenant_shard_id, TenantSlotPeekMode::Read) let peek_slot = tenant_map_peek_slot(&locked, &tenant_shard_id, TenantSlotPeekMode::Read)
.map_err(GetTenantError::MapState)?; .map_err(GetTenantError::MapState)?;
match peek_slot { match peek_slot {
@@ -1126,7 +1221,7 @@ pub(crate) async fn get_active_tenant_with_timeout(
// Fast path: we don't need to do any async waiting. // Fast path: we don't need to do any async waiting.
return Ok(tenant.clone()); return Ok(tenant.clone());
} }
_ => WaitFor::Tenant(tenant.clone()), _ => (WaitFor::Tenant(tenant.clone()), tenant_shard_id),
} }
} }
Some(TenantSlot::Secondary) => { Some(TenantSlot::Secondary) => {
@@ -1134,7 +1229,9 @@ pub(crate) async fn get_active_tenant_with_timeout(
tenant_id, tenant_id,
))) )))
} }
Some(TenantSlot::InProgress(barrier)) => WaitFor::Barrier(barrier.clone()), Some(TenantSlot::InProgress(barrier)) => {
(WaitFor::Barrier(barrier.clone()), tenant_shard_id)
}
None => { None => {
return Err(GetActiveTenantError::NotFound(GetTenantError::NotFound( return Err(GetActiveTenantError::NotFound(GetTenantError::NotFound(
tenant_id, tenant_id,
@@ -1219,8 +1316,7 @@ pub(crate) async fn delete_tenant(
// See https://github.com/neondatabase/neon/issues/5080 // See https://github.com/neondatabase/neon/issues/5080
// TODO(sharding): make delete API sharding-aware // TODO(sharding): make delete API sharding-aware
let mut slot_guard = let slot_guard = tenant_map_acquire_slot(&tenant_shard_id, TenantSlotAcquireMode::MustExist)?;
tenant_map_acquire_slot(&tenant_shard_id, TenantSlotAcquireMode::MustExist)?;
// unwrap is safe because we used MustExist mode when acquiring // unwrap is safe because we used MustExist mode when acquiring
let tenant = match slot_guard.get_old_value().as_ref().unwrap() { let tenant = match slot_guard.get_old_value().as_ref().unwrap() {
@@ -1377,12 +1473,14 @@ pub(crate) async fn load_tenant(
Tenant::persist_tenant_config(conf, &tenant_shard_id, &location_conf).await?; Tenant::persist_tenant_config(conf, &tenant_shard_id, &location_conf).await?;
let shard_identity = location_conf.shard;
let new_tenant = tenant_spawn( let new_tenant = tenant_spawn(
conf, conf,
tenant_shard_id, tenant_shard_id,
&tenant_path, &tenant_path,
resources, resources,
AttachedTenantConf::try_from(location_conf)?, AttachedTenantConf::try_from(location_conf)?,
shard_identity,
None, None,
&TENANTS, &TENANTS,
SpawnMode::Normal, SpawnMode::Normal,
@@ -1433,7 +1531,8 @@ pub(crate) enum TenantMapListError {
/// ///
/// Get list of tenants, for the mgmt API /// Get list of tenants, for the mgmt API
/// ///
pub(crate) async fn list_tenants() -> Result<Vec<(TenantId, TenantState)>, TenantMapListError> { pub(crate) async fn list_tenants() -> Result<Vec<(TenantShardId, TenantState)>, TenantMapListError>
{
let tenants = TENANTS.read().unwrap(); let tenants = TENANTS.read().unwrap();
let m = match &*tenants { let m = match &*tenants {
TenantsMap::Initializing => return Err(TenantMapListError::Initializing), TenantsMap::Initializing => return Err(TenantMapListError::Initializing),
@@ -1441,12 +1540,10 @@ pub(crate) async fn list_tenants() -> Result<Vec<(TenantId, TenantState)>, Tenan
}; };
Ok(m.iter() Ok(m.iter()
.filter_map(|(id, tenant)| match tenant { .filter_map(|(id, tenant)| match tenant {
TenantSlot::Attached(tenant) => Some((id, tenant.current_state())), TenantSlot::Attached(tenant) => Some((*id, tenant.current_state())),
TenantSlot::Secondary => None, TenantSlot::Secondary => None,
TenantSlot::InProgress(_) => None, TenantSlot::InProgress(_) => None,
}) })
// TODO(sharding): make callers of this function shard-aware
.map(|(k, v)| (k.tenant_id, v))
.collect()) .collect())
} }
@@ -1472,12 +1569,14 @@ pub(crate) async fn attach_tenant(
// TODO: tenant directory remains on disk if we bail out from here on. // TODO: tenant directory remains on disk if we bail out from here on.
// See https://github.com/neondatabase/neon/issues/4233 // See https://github.com/neondatabase/neon/issues/4233
let shard_identity = location_conf.shard;
let attached_tenant = tenant_spawn( let attached_tenant = tenant_spawn(
conf, conf,
tenant_shard_id, tenant_shard_id,
&tenant_dir, &tenant_dir,
resources, resources,
AttachedTenantConf::try_from(location_conf)?, AttachedTenantConf::try_from(location_conf)?,
shard_identity,
None, None,
&TENANTS, &TENANTS,
SpawnMode::Normal, SpawnMode::Normal,
@@ -1543,9 +1642,10 @@ pub enum TenantSlotUpsertError {
MapState(#[from] TenantMapError), MapState(#[from] TenantMapError),
} }
#[derive(Debug)] #[derive(Debug, thiserror::Error)]
enum TenantSlotDropError { enum TenantSlotDropError {
/// It is only legal to drop a TenantSlot if its contents are fully shut down /// It is only legal to drop a TenantSlot if its contents are fully shut down
#[error("Tenant was not shut down")]
NotShutdown, NotShutdown,
} }
@@ -1605,9 +1705,9 @@ impl SlotGuard {
} }
} }
/// Take any value that was present in the slot before we acquired ownership /// Get any value that was present in the slot before we acquired ownership
/// of it: in state transitions, this will be the old state. /// of it: in state transitions, this will be the old state.
fn get_old_value(&mut self) -> &Option<TenantSlot> { fn get_old_value(&self) -> &Option<TenantSlot> {
&self.old_value &self.old_value
} }
@@ -1825,7 +1925,7 @@ fn tenant_map_acquire_slot_impl(
METRICS.tenant_slot_writes.inc(); METRICS.tenant_slot_writes.inc();
let mut locked = tenants.write().unwrap(); let mut locked = tenants.write().unwrap();
let span = tracing::info_span!("acquire_slot", tenant_id=%tenant_shard_id.tenant_id, shard=tenant_shard_id.shard_slug()); let span = tracing::info_span!("acquire_slot", tenant_id=%tenant_shard_id.tenant_id, shard = %tenant_shard_id.shard_slug());
let _guard = span.enter(); let _guard = span.enter();
let m = match &mut *locked { let m = match &mut *locked {
@@ -1977,21 +2077,19 @@ use {
}; };
pub(crate) async fn immediate_gc( pub(crate) async fn immediate_gc(
tenant_id: TenantId, tenant_shard_id: TenantShardId,
timeline_id: TimelineId, timeline_id: TimelineId,
gc_req: TimelineGcRequest, gc_req: TimelineGcRequest,
cancel: CancellationToken, cancel: CancellationToken,
ctx: &RequestContext, ctx: &RequestContext,
) -> Result<tokio::sync::oneshot::Receiver<Result<GcResult, anyhow::Error>>, ApiError> { ) -> Result<tokio::sync::oneshot::Receiver<Result<GcResult, anyhow::Error>>, ApiError> {
let guard = TENANTS.read().unwrap(); let guard = TENANTS.read().unwrap();
let tenant = guard
.get(&tenant_id)
.map(Arc::clone)
.with_context(|| format!("tenant {tenant_id}"))
.map_err(|e| ApiError::NotFound(e.into()))?;
// TODO(sharding): make callers of this function shard-aware let tenant = guard
let tenant_shard_id = TenantShardId::unsharded(tenant_id); .get(&tenant_shard_id)
.map(Arc::clone)
.with_context(|| format!("tenant {tenant_shard_id}"))
.map_err(|e| ApiError::NotFound(e.into()))?;
let gc_horizon = gc_req.gc_horizon.unwrap_or_else(|| tenant.get_gc_horizon()); let gc_horizon = gc_req.gc_horizon.unwrap_or_else(|| tenant.get_gc_horizon());
// Use tenant's pitr setting // Use tenant's pitr setting
@@ -2004,9 +2102,9 @@ pub(crate) async fn immediate_gc(
task_mgr::spawn( task_mgr::spawn(
&tokio::runtime::Handle::current(), &tokio::runtime::Handle::current(),
TaskKind::GarbageCollector, TaskKind::GarbageCollector,
Some(tenant_id), Some(tenant_shard_id),
Some(timeline_id), Some(timeline_id),
&format!("timeline_gc_handler garbage collection run for tenant {tenant_id} timeline {timeline_id}"), &format!("timeline_gc_handler garbage collection run for tenant {tenant_shard_id} timeline {timeline_id}"),
false, false,
async move { async move {
fail::fail_point!("immediate_gc_task_pre"); fail::fail_point!("immediate_gc_task_pre");

View File

@@ -180,7 +180,7 @@
//! [`Tenant::timeline_init_and_sync`]: super::Tenant::timeline_init_and_sync //! [`Tenant::timeline_init_and_sync`]: super::Tenant::timeline_init_and_sync
//! [`Timeline::load_layer_map`]: super::Timeline::load_layer_map //! [`Timeline::load_layer_map`]: super::Timeline::load_layer_map
mod download; pub(crate) mod download;
pub mod index; pub mod index;
mod upload; mod upload;
@@ -254,6 +254,9 @@ pub(crate) const FAILED_UPLOAD_WARN_THRESHOLD: u32 = 3;
pub(crate) const INITDB_PATH: &str = "initdb.tar.zst"; pub(crate) const INITDB_PATH: &str = "initdb.tar.zst";
/// Default buffer size when interfacing with [`tokio::fs::File`].
pub(crate) const BUFFER_SIZE: usize = 32 * 1024;
pub enum MaybeDeletedIndexPart { pub enum MaybeDeletedIndexPart {
IndexPart(IndexPart), IndexPart(IndexPart),
Deleted(IndexPart), Deleted(IndexPart),
@@ -1220,7 +1223,7 @@ impl RemoteTimelineClient {
task_mgr::spawn( task_mgr::spawn(
&self.runtime, &self.runtime,
TaskKind::RemoteUploadTask, TaskKind::RemoteUploadTask,
Some(self.tenant_shard_id.tenant_id), Some(self.tenant_shard_id),
Some(self.timeline_id), Some(self.timeline_id),
"remote upload", "remote upload",
false, false,
@@ -1601,6 +1604,23 @@ impl RemoteTimelineClient {
} }
} }
} }
pub(crate) fn get_layers_metadata(
&self,
layers: Vec<LayerFileName>,
) -> anyhow::Result<Vec<Option<LayerFileMetadata>>> {
let q = self.upload_queue.lock().unwrap();
let q = match &*q {
UploadQueue::Stopped(_) | UploadQueue::Uninitialized => {
anyhow::bail!("queue is in state {}", q.as_str())
}
UploadQueue::Initialized(inner) => inner,
};
let decorated = layers.into_iter().map(|l| q.latest_files.get(&l).cloned());
Ok(decorated.collect())
}
} }
pub fn remote_timelines_path(tenant_shard_id: &TenantShardId) -> RemotePath { pub fn remote_timelines_path(tenant_shard_id: &TenantShardId) -> RemotePath {
@@ -1656,6 +1676,13 @@ pub fn remote_index_path(
.expect("Failed to construct path") .expect("Failed to construct path")
} }
pub const HEATMAP_BASENAME: &str = "heatmap-v1.json";
pub(crate) fn remote_heatmap_path(tenant_shard_id: &TenantShardId) -> RemotePath {
RemotePath::from_string(&format!("tenants/{tenant_shard_id}/{HEATMAP_BASENAME}"))
.expect("Failed to construct path")
}
/// Given the key of an index, parse out the generation part of the name /// Given the key of an index, parse out the generation part of the name
pub fn parse_remote_index_path(path: RemotePath) -> Option<Generation> { pub fn parse_remote_index_path(path: RemotePath) -> Option<Generation> {
let file_name = match path.get_path().file_name() { let file_name = match path.get_path().file_name() {

View File

@@ -75,12 +75,11 @@ pub async fn download_layer_file<'a>(
let (mut destination_file, bytes_amount) = download_retry( let (mut destination_file, bytes_amount) = download_retry(
|| async { || async {
// TODO: this doesn't use the cached fd for some reason? let destination_file = tokio::fs::File::create(&temp_file_path)
let mut destination_file = fs::File::create(&temp_file_path)
.await .await
.with_context(|| format!("create a destination file for layer '{temp_file_path}'")) .with_context(|| format!("create a destination file for layer '{temp_file_path}'"))
.map_err(DownloadError::Other)?; .map_err(DownloadError::Other)?;
let mut download = storage let download = storage
.download(&remote_path) .download(&remote_path)
.await .await
.with_context(|| { .with_context(|| {
@@ -90,9 +89,14 @@ pub async fn download_layer_file<'a>(
}) })
.map_err(DownloadError::Other)?; .map_err(DownloadError::Other)?;
let mut destination_file =
tokio::io::BufWriter::with_capacity(super::BUFFER_SIZE, destination_file);
let mut reader = tokio_util::io::StreamReader::new(download.download_stream);
let bytes_amount = tokio::time::timeout( let bytes_amount = tokio::time::timeout(
MAX_DOWNLOAD_DURATION, MAX_DOWNLOAD_DURATION,
tokio::io::copy(&mut download.download_stream, &mut destination_file), tokio::io::copy_buf(&mut reader, &mut destination_file),
) )
.await .await
.map_err(|e| DownloadError::Other(anyhow::anyhow!("Timed out {:?}", e)))? .map_err(|e| DownloadError::Other(anyhow::anyhow!("Timed out {:?}", e)))?
@@ -103,6 +107,8 @@ pub async fn download_layer_file<'a>(
}) })
.map_err(DownloadError::Other)?; .map_err(DownloadError::Other)?;
let destination_file = destination_file.into_inner();
Ok((destination_file, bytes_amount)) Ok((destination_file, bytes_amount))
}, },
&format!("download {remote_path:?}"), &format!("download {remote_path:?}"),
@@ -220,20 +226,22 @@ async fn do_download_index_part(
index_generation: Generation, index_generation: Generation,
cancel: CancellationToken, cancel: CancellationToken,
) -> Result<IndexPart, DownloadError> { ) -> Result<IndexPart, DownloadError> {
use futures::stream::StreamExt;
let remote_path = remote_index_path(tenant_shard_id, timeline_id, index_generation); let remote_path = remote_index_path(tenant_shard_id, timeline_id, index_generation);
let index_part_bytes = download_retry_forever( let index_part_bytes = download_retry_forever(
|| async { || async {
let mut index_part_download = storage.download(&remote_path).await?; let index_part_download = storage.download(&remote_path).await?;
let mut index_part_bytes = Vec::new(); let mut index_part_bytes = Vec::new();
tokio::io::copy( let mut stream = std::pin::pin!(index_part_download.download_stream);
&mut index_part_download.download_stream, while let Some(chunk) = stream.next().await {
&mut index_part_bytes, let chunk = chunk
) .with_context(|| format!("download index part at {remote_path:?}"))
.await .map_err(DownloadError::Other)?;
.with_context(|| format!("download index part at {remote_path:?}")) index_part_bytes.extend_from_slice(&chunk[..]);
.map_err(DownloadError::Other)?; }
Ok(index_part_bytes) Ok(index_part_bytes)
}, },
&format!("download {remote_path:?}"), &format!("download {remote_path:?}"),
@@ -363,7 +371,7 @@ pub(super) async fn download_index_part(
None => { None => {
// Migration from legacy pre-generation state: we have a generation but no prior // Migration from legacy pre-generation state: we have a generation but no prior
// attached pageservers did. Try to load from a no-generation path. // attached pageservers did. Try to load from a no-generation path.
tracing::info!("No index_part.json* found"); tracing::debug!("No index_part.json* found");
do_download_index_part( do_download_index_part(
storage, storage,
tenant_shard_id, tenant_shard_id,
@@ -394,11 +402,13 @@ pub(crate) async fn download_initdb_tar_zst(
.with_context(|| format!("timeline dir creation {timeline_path}")) .with_context(|| format!("timeline dir creation {timeline_path}"))
.map_err(DownloadError::Other)?; .map_err(DownloadError::Other)?;
} }
let temp_path = timeline_path.join(format!("{INITDB_PATH}-{timeline_id}.{TEMP_FILE_SUFFIX}")); let temp_path = timeline_path.join(format!(
"{INITDB_PATH}.download-{timeline_id}.{TEMP_FILE_SUFFIX}"
));
let file = download_retry( let file = download_retry(
|| async { || async {
let mut file = OpenOptions::new() let file = OpenOptions::new()
.create(true) .create(true)
.truncate(true) .truncate(true)
.read(true) .read(true)
@@ -408,13 +418,17 @@ pub(crate) async fn download_initdb_tar_zst(
.with_context(|| format!("tempfile creation {temp_path}")) .with_context(|| format!("tempfile creation {temp_path}"))
.map_err(DownloadError::Other)?; .map_err(DownloadError::Other)?;
let mut download = storage.download(&remote_path).await?; let download = storage.download(&remote_path).await?;
let mut download = tokio_util::io::StreamReader::new(download.download_stream);
let mut writer = tokio::io::BufWriter::with_capacity(8 * 1024, file);
tokio::io::copy(&mut download.download_stream, &mut file) tokio::io::copy_buf(&mut download, &mut writer)
.await .await
.with_context(|| format!("download initdb.tar.zst at {remote_path:?}")) .with_context(|| format!("download initdb.tar.zst at {remote_path:?}"))
.map_err(DownloadError::Other)?; .map_err(DownloadError::Other)?;
let mut file = writer.into_inner();
file.seek(std::io::SeekFrom::Start(0)) file.seek(std::io::SeekFrom::Start(0))
.await .await
.with_context(|| format!("rewinding initdb.tar.zst at: {remote_path:?}")) .with_context(|| format!("rewinding initdb.tar.zst at: {remote_path:?}"))
@@ -426,10 +440,10 @@ pub(crate) async fn download_initdb_tar_zst(
) )
.await .await
.map_err(|e| { .map_err(|e| {
if temp_path.exists() { // Do a best-effort attempt at deleting the temporary file upon encountering an error.
// Do a best-effort attempt at deleting the temporary file upon encountering an error. // We don't have async here nor do we want to pile on any extra errors.
// We don't have async here nor do we want to pile on any extra errors. if let Err(e) = std::fs::remove_file(&temp_path) {
if let Err(e) = std::fs::remove_file(&temp_path) { if e.kind() != std::io::ErrorKind::NotFound {
warn!("error deleting temporary file {temp_path}: {e}"); warn!("error deleting temporary file {temp_path}: {e}");
} }
} }

View File

@@ -1,12 +1,12 @@
//! Helper functions to upload files to remote storage with a RemoteStorage //! Helper functions to upload files to remote storage with a RemoteStorage
use anyhow::{bail, Context}; use anyhow::{bail, Context};
use bytes::Bytes;
use camino::Utf8Path; use camino::Utf8Path;
use fail::fail_point; use fail::fail_point;
use pageserver_api::shard::TenantShardId; use pageserver_api::shard::TenantShardId;
use std::io::ErrorKind; use std::io::{ErrorKind, SeekFrom};
use tokio::fs; use tokio::fs::{self, File};
use tokio::io::AsyncSeekExt;
use super::Generation; use super::Generation;
use crate::{ use crate::{
@@ -41,11 +41,15 @@ pub(super) async fn upload_index_part<'a>(
.to_s3_bytes() .to_s3_bytes()
.context("serialize index part file into bytes")?; .context("serialize index part file into bytes")?;
let index_part_size = index_part_bytes.len(); let index_part_size = index_part_bytes.len();
let index_part_bytes = tokio::io::BufReader::new(std::io::Cursor::new(index_part_bytes)); let index_part_bytes = bytes::Bytes::from(index_part_bytes);
let remote_path = remote_index_path(tenant_shard_id, timeline_id, generation); let remote_path = remote_index_path(tenant_shard_id, timeline_id, generation);
storage storage
.upload_storage_object(Box::new(index_part_bytes), index_part_size, &remote_path) .upload_storage_object(
futures::stream::once(futures::future::ready(Ok(index_part_bytes))),
index_part_size,
&remote_path,
)
.await .await
.with_context(|| format!("upload index part for '{tenant_shard_id} / {timeline_id}'")) .with_context(|| format!("upload index part for '{tenant_shard_id} / {timeline_id}'"))
} }
@@ -101,8 +105,10 @@ pub(super) async fn upload_timeline_layer<'a>(
let fs_size = usize::try_from(fs_size) let fs_size = usize::try_from(fs_size)
.with_context(|| format!("convert {source_path:?} size {fs_size} usize"))?; .with_context(|| format!("convert {source_path:?} size {fs_size} usize"))?;
let reader = tokio_util::io::ReaderStream::with_capacity(source_file, super::BUFFER_SIZE);
storage storage
.upload(source_file, fs_size, &storage_path, None) .upload(reader, fs_size, &storage_path, None)
.await .await
.with_context(|| format!("upload layer from local path '{source_path}'"))?; .with_context(|| format!("upload layer from local path '{source_path}'"))?;
@@ -114,16 +120,19 @@ pub(crate) async fn upload_initdb_dir(
storage: &GenericRemoteStorage, storage: &GenericRemoteStorage,
tenant_id: &TenantId, tenant_id: &TenantId,
timeline_id: &TimelineId, timeline_id: &TimelineId,
initdb_dir: Bytes, mut initdb_tar_zst: File,
size: u64,
) -> anyhow::Result<()> { ) -> anyhow::Result<()> {
tracing::trace!("uploading initdb dir"); tracing::trace!("uploading initdb dir");
let size = initdb_dir.len(); // We might have read somewhat into the file already in the prior retry attempt
let bytes = tokio::io::BufReader::new(std::io::Cursor::new(initdb_dir)); initdb_tar_zst.seek(SeekFrom::Start(0)).await?;
let file = tokio_util::io::ReaderStream::with_capacity(initdb_tar_zst, super::BUFFER_SIZE);
let remote_path = remote_initdb_archive_path(tenant_id, timeline_id); let remote_path = remote_initdb_archive_path(tenant_id, timeline_id);
storage storage
.upload_storage_object(bytes, size, &remote_path) .upload_storage_object(file, size as usize, &remote_path)
.await .await
.with_context(|| format!("upload initdb dir for '{tenant_id} / {timeline_id}'")) .with_context(|| format!("upload initdb dir for '{tenant_id} / {timeline_id}'"))
} }

View File

@@ -0,0 +1,104 @@
pub mod heatmap;
mod heatmap_uploader;
use std::sync::Arc;
use crate::task_mgr::{self, TaskKind, BACKGROUND_RUNTIME};
use self::heatmap_uploader::heatmap_uploader_task;
use super::mgr::TenantManager;
use pageserver_api::shard::TenantShardId;
use remote_storage::GenericRemoteStorage;
use tokio_util::sync::CancellationToken;
use utils::completion::Barrier;
enum UploadCommand {
Upload(TenantShardId),
}
struct CommandRequest<T> {
payload: T,
response_tx: tokio::sync::oneshot::Sender<CommandResponse>,
}
struct CommandResponse {
result: anyhow::Result<()>,
}
/// The SecondaryController is a pseudo-rpc client for administrative control of secondary mode downloads,
/// and heatmap uploads. This is not a hot data path: it's primarily a hook for tests,
/// where we want to immediately upload/download for a particular tenant. In normal operation
/// uploads & downloads are autonomous and not driven by this interface.
pub struct SecondaryController {
upload_req_tx: tokio::sync::mpsc::Sender<CommandRequest<UploadCommand>>,
}
impl SecondaryController {
async fn dispatch<T>(
&self,
queue: &tokio::sync::mpsc::Sender<CommandRequest<T>>,
payload: T,
) -> anyhow::Result<()> {
let (response_tx, response_rx) = tokio::sync::oneshot::channel();
queue
.send(CommandRequest {
payload,
response_tx,
})
.await
.map_err(|_| anyhow::anyhow!("Receiver shut down"))?;
let response = response_rx
.await
.map_err(|_| anyhow::anyhow!("Request dropped"))?;
response.result
}
pub async fn upload_tenant(&self, tenant_shard_id: TenantShardId) -> anyhow::Result<()> {
self.dispatch(&self.upload_req_tx, UploadCommand::Upload(tenant_shard_id))
.await
}
}
pub fn spawn_tasks(
tenant_manager: Arc<TenantManager>,
remote_storage: GenericRemoteStorage,
background_jobs_can_start: Barrier,
cancel: CancellationToken,
) -> SecondaryController {
let (upload_req_tx, upload_req_rx) =
tokio::sync::mpsc::channel::<CommandRequest<UploadCommand>>(16);
task_mgr::spawn(
BACKGROUND_RUNTIME.handle(),
TaskKind::SecondaryUploads,
None,
None,
"heatmap uploads",
false,
async move {
heatmap_uploader_task(
tenant_manager,
remote_storage,
upload_req_rx,
background_jobs_can_start,
cancel,
)
.await
},
);
SecondaryController { upload_req_tx }
}
/// For running with remote storage disabled: a SecondaryController that is connected to nothing.
pub fn null_controller() -> SecondaryController {
let (upload_req_tx, _upload_req_rx) =
tokio::sync::mpsc::channel::<CommandRequest<UploadCommand>>(16);
SecondaryController { upload_req_tx }
}

View File

@@ -0,0 +1,64 @@
use std::time::SystemTime;
use crate::tenant::{
remote_timeline_client::index::IndexLayerMetadata, storage_layer::LayerFileName,
};
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr, TimestampSeconds};
use utils::{generation::Generation, id::TimelineId};
#[derive(Serialize, Deserialize)]
pub(super) struct HeatMapTenant {
/// Generation of the attached location that uploaded the heatmap: this is not required
/// for correctness, but acts as a hint to secondary locations in order to detect thrashing
/// in the unlikely event that two attached locations are both uploading conflicting heatmaps.
pub(super) generation: Generation,
pub(super) timelines: Vec<HeatMapTimeline>,
}
#[serde_as]
#[derive(Serialize, Deserialize)]
pub(crate) struct HeatMapTimeline {
#[serde_as(as = "DisplayFromStr")]
pub(super) timeline_id: TimelineId,
pub(super) layers: Vec<HeatMapLayer>,
}
#[serde_as]
#[derive(Serialize, Deserialize)]
pub(crate) struct HeatMapLayer {
pub(super) name: LayerFileName,
pub(super) metadata: IndexLayerMetadata,
#[serde_as(as = "TimestampSeconds<i64>")]
pub(super) access_time: SystemTime,
// TODO: an actual 'heat' score that would let secondary locations prioritize downloading
// the hottest layers, rather than trying to simply mirror whatever layers are on-disk on the primary.
}
impl HeatMapLayer {
pub(crate) fn new(
name: LayerFileName,
metadata: IndexLayerMetadata,
access_time: SystemTime,
) -> Self {
Self {
name,
metadata,
access_time,
}
}
}
impl HeatMapTimeline {
pub(crate) fn new(timeline_id: TimelineId, layers: Vec<HeatMapLayer>) -> Self {
Self {
timeline_id,
layers,
}
}
}

View File

@@ -0,0 +1,582 @@
use std::{
collections::HashMap,
sync::{Arc, Weak},
time::{Duration, Instant},
};
use crate::{
metrics::SECONDARY_MODE,
tenant::{
config::AttachmentMode, mgr::TenantManager, remote_timeline_client::remote_heatmap_path,
secondary::CommandResponse, span::debug_assert_current_span_has_tenant_id, Tenant,
},
};
use md5;
use pageserver_api::shard::TenantShardId;
use remote_storage::GenericRemoteStorage;
use tokio::task::JoinSet;
use tokio_util::sync::CancellationToken;
use tracing::instrument;
use utils::{backoff, completion::Barrier};
use super::{heatmap::HeatMapTenant, CommandRequest, UploadCommand};
/// Period between heatmap uploader walking Tenants to look for work to do.
/// If any tenants have a heatmap upload period lower than this, it will be adjusted
/// downward to match.
const DEFAULT_SCHEDULING_INTERVAL: Duration = Duration::from_millis(60000);
const MIN_SCHEDULING_INTERVAL: Duration = Duration::from_millis(1000);
struct WriteInProgress {
barrier: Barrier,
}
struct UploadPending {
tenant: Arc<Tenant>,
last_digest: Option<md5::Digest>,
}
struct WriteComplete {
tenant_shard_id: TenantShardId,
completed_at: Instant,
digest: Option<md5::Digest>,
next_upload: Option<Instant>,
}
/// The heatmap uploader keeps a little bit of per-tenant state, mainly to remember
/// when we last did a write. We only populate this after doing at least one
/// write for a tenant -- this avoids holding state for tenants that have
/// uploads disabled.
struct UploaderTenantState {
// This Weak only exists to enable culling idle instances of this type
// when the Tenant has been deallocated.
tenant: Weak<Tenant>,
/// Digest of the serialized heatmap that we last successfully uploaded
///
/// md5 is generally a bad hash. We use it because it's convenient for interop with AWS S3's ETag,
/// which is also an md5sum.
last_digest: Option<md5::Digest>,
/// When the last upload attempt completed (may have been successful or failed)
last_upload: Option<Instant>,
/// When should we next do an upload? None means never.
next_upload: Option<Instant>,
}
/// This type is owned by a single task ([`heatmap_uploader_task`]) which runs an event
/// handling loop and mutates it as needed: there are no locks here, because that event loop
/// can hold &mut references to this type throughout.
struct HeatmapUploader {
tenant_manager: Arc<TenantManager>,
remote_storage: GenericRemoteStorage,
cancel: CancellationToken,
tenants: HashMap<TenantShardId, UploaderTenantState>,
/// Tenants with work to do, for which tasks should be spawned as soon as concurrency
/// limits permit it.
tenants_pending: std::collections::VecDeque<UploadPending>,
/// Tenants for which a task in `tasks` has been spawned.
tenants_uploading: HashMap<TenantShardId, WriteInProgress>,
tasks: JoinSet<()>,
/// Channel for our child tasks to send results to: we use a channel for results rather than
/// just getting task results via JoinSet because we need the channel's recv() "sleep until something
/// is available" semantic, rather than JoinSet::join_next()'s "sleep until next thing is available _or_ I'm empty"
/// behavior.
task_result_tx: tokio::sync::mpsc::UnboundedSender<WriteComplete>,
task_result_rx: tokio::sync::mpsc::UnboundedReceiver<WriteComplete>,
concurrent_uploads: usize,
scheduling_interval: Duration,
}
/// The uploader task runs a loop that periodically wakes up and schedules tasks for
/// tenants that require an upload, or handles any commands that have been sent into
/// `command_queue`. No I/O is done in this loop: that all happens in the tasks we
/// spawn.
///
/// Scheduling iterations are somewhat infrequent. However, each one will enqueue
/// all tenants that require an upload, and in between scheduling iterations we will
/// continue to spawn new tasks for pending tenants, as our concurrency limit permits.
///
/// While we take a CancellationToken here, it is subordinate to the CancellationTokens
/// of tenants: i.e. we expect all Tenants to have been shut down before we are shut down, otherwise
/// we might block waiting on a Tenant.
pub(super) async fn heatmap_uploader_task(
tenant_manager: Arc<TenantManager>,
remote_storage: GenericRemoteStorage,
mut command_queue: tokio::sync::mpsc::Receiver<CommandRequest<UploadCommand>>,
background_jobs_can_start: Barrier,
cancel: CancellationToken,
) -> anyhow::Result<()> {
let concurrent_uploads = tenant_manager.get_conf().heatmap_upload_concurrency;
let (result_tx, result_rx) = tokio::sync::mpsc::unbounded_channel();
let mut uploader = HeatmapUploader {
tenant_manager,
remote_storage,
cancel: cancel.clone(),
tasks: JoinSet::new(),
tenants: HashMap::new(),
tenants_pending: std::collections::VecDeque::new(),
tenants_uploading: HashMap::new(),
task_result_tx: result_tx,
task_result_rx: result_rx,
concurrent_uploads,
scheduling_interval: DEFAULT_SCHEDULING_INTERVAL,
};
tracing::info!("Waiting for background_jobs_can start...");
background_jobs_can_start.wait().await;
tracing::info!("background_jobs_can is ready, proceeding.");
while !cancel.is_cancelled() {
// Look for new work: this is relatively expensive because we have to go acquire the lock on
// the tenant manager to retrieve tenants, and then iterate over them to figure out which ones
// require an upload.
uploader.schedule_iteration().await?;
// Between scheduling iterations, we will:
// - Drain any complete tasks and spawn pending tasks
// - Handle incoming administrative commands
// - Check our cancellation token
let next_scheduling_iteration = Instant::now()
.checked_add(uploader.scheduling_interval)
.unwrap_or_else(|| {
tracing::warn!(
"Scheduling interval invalid ({}s), running immediately!",
uploader.scheduling_interval.as_secs_f64()
);
Instant::now()
});
loop {
tokio::select! {
_ = cancel.cancelled() => {
// We do not simply drop the JoinSet, in order to have an orderly shutdown without cancellation.
tracing::info!("Heatmap uploader joining tasks");
while let Some(_r) = uploader.tasks.join_next().await {};
tracing::info!("Heatmap uploader terminating");
break;
},
_ = tokio::time::sleep(next_scheduling_iteration.duration_since(Instant::now())) => {
tracing::debug!("heatmap_uploader_task: woke for scheduling interval");
break;},
cmd = command_queue.recv() => {
tracing::debug!("heatmap_uploader_task: woke for command queue");
let cmd = match cmd {
Some(c) =>c,
None => {
// SecondaryController was destroyed, and this has raced with
// our CancellationToken
tracing::info!("Heatmap uploader terminating");
cancel.cancel();
break;
}
};
let CommandRequest{
response_tx,
payload
} = cmd;
uploader.handle_command(payload, response_tx);
},
_ = uploader.process_next_completion() => {
if !cancel.is_cancelled() {
uploader.spawn_pending();
}
}
}
}
}
Ok(())
}
impl HeatmapUploader {
/// Periodic execution phase: inspect all attached tenants and schedule any work they require.
async fn schedule_iteration(&mut self) -> anyhow::Result<()> {
// Cull any entries in self.tenants whose Arc<Tenant> is gone
self.tenants
.retain(|_k, v| v.tenant.upgrade().is_some() && v.next_upload.is_some());
// The priority order of previously scheduled work may be invalidated by current state: drop
// all pending work (it will be re-scheduled if still needed)
self.tenants_pending.clear();
// Used a fixed 'now' through the following loop, for efficiency and fairness.
let now = Instant::now();
// While iterating over the potentially-long list of tenants, we will periodically yield
// to avoid blocking executor.
const YIELD_ITERATIONS: usize = 1000;
// Iterate over tenants looking for work to do.
let tenants = self.tenant_manager.get_attached_active_tenant_shards();
for (i, tenant) in tenants.into_iter().enumerate() {
// Process is shutting down, drop out
if self.cancel.is_cancelled() {
return Ok(());
}
// Skip tenants that already have a write in flight
if self
.tenants_uploading
.contains_key(tenant.get_tenant_shard_id())
{
continue;
}
self.maybe_schedule_upload(&now, tenant);
if i + 1 % YIELD_ITERATIONS == 0 {
tokio::task::yield_now().await;
}
}
// Spawn tasks for as many of our pending tenants as we can.
self.spawn_pending();
Ok(())
}
///
/// Cancellation: this method is cancel-safe.
async fn process_next_completion(&mut self) {
match self.task_result_rx.recv().await {
Some(r) => {
self.on_completion(r);
}
None => {
unreachable!("Result sender is stored on Self");
}
}
}
/// The 'maybe' refers to the tenant's state: whether it is configured
/// for heatmap uploads at all, and whether sufficient time has passed
/// since the last upload.
fn maybe_schedule_upload(&mut self, now: &Instant, tenant: Arc<Tenant>) {
match tenant.get_heatmap_period() {
None => {
// Heatmaps are disabled for this tenant
return;
}
Some(period) => {
// If any tenant has asked for uploads more frequent than our scheduling interval,
// reduce it to match so that we can keep up. This is mainly useful in testing, where
// we may set rather short intervals.
if period < self.scheduling_interval {
self.scheduling_interval = std::cmp::max(period, MIN_SCHEDULING_INTERVAL);
}
}
}
// Stale attachments do not upload anything: if we are in this state, there is probably some
// other attachment in mode Single or Multi running on another pageserver, and we don't
// want to thrash and overwrite their heatmap uploads.
if tenant.get_attach_mode() == AttachmentMode::Stale {
return;
}
// Create an entry in self.tenants if one doesn't already exist: this will later be updated
// with the completion time in on_completion.
let state = self
.tenants
.entry(*tenant.get_tenant_shard_id())
.or_insert_with(|| UploaderTenantState {
tenant: Arc::downgrade(&tenant),
last_upload: None,
next_upload: Some(Instant::now()),
last_digest: None,
});
// Decline to do the upload if insufficient time has passed
if state.next_upload.map(|nu| &nu > now).unwrap_or(false) {
return;
}
let last_digest = state.last_digest;
self.tenants_pending.push_back(UploadPending {
tenant,
last_digest,
})
}
fn spawn_pending(&mut self) {
while !self.tenants_pending.is_empty()
&& self.tenants_uploading.len() < self.concurrent_uploads
{
// unwrap: loop condition includes !is_empty()
let pending = self.tenants_pending.pop_front().unwrap();
self.spawn_upload(pending.tenant, pending.last_digest);
}
}
fn spawn_upload(&mut self, tenant: Arc<Tenant>, last_digest: Option<md5::Digest>) {
let remote_storage = self.remote_storage.clone();
let tenant_shard_id = *tenant.get_tenant_shard_id();
let (completion, barrier) = utils::completion::channel();
let result_tx = self.task_result_tx.clone();
self.tasks.spawn(async move {
// Guard for the barrier in [`WriteInProgress`]
let _completion = completion;
let started_at = Instant::now();
let digest = match upload_tenant_heatmap(remote_storage, &tenant, last_digest).await {
Ok(UploadHeatmapOutcome::Uploaded(digest)) => {
let duration = Instant::now().duration_since(started_at);
SECONDARY_MODE
.upload_heatmap_duration
.observe(duration.as_secs_f64());
SECONDARY_MODE.upload_heatmap.inc();
Some(digest)
}
Ok(UploadHeatmapOutcome::NoChange | UploadHeatmapOutcome::Skipped) => last_digest,
Err(UploadHeatmapError::Upload(e)) => {
tracing::warn!(
"Failed to upload heatmap for tenant {}: {e:#}",
tenant.get_tenant_shard_id(),
);
let duration = Instant::now().duration_since(started_at);
SECONDARY_MODE
.upload_heatmap_duration
.observe(duration.as_secs_f64());
SECONDARY_MODE.upload_heatmap_errors.inc();
last_digest
}
Err(UploadHeatmapError::Cancelled) => {
tracing::info!("Cancelled heatmap upload, shutting down");
last_digest
}
};
let now = Instant::now();
let next_upload = tenant
.get_heatmap_period()
.and_then(|period| now.checked_add(period));
result_tx
.send(WriteComplete {
tenant_shard_id: *tenant.get_tenant_shard_id(),
completed_at: now,
digest,
next_upload,
})
.ok();
});
self.tenants_uploading
.insert(tenant_shard_id, WriteInProgress { barrier });
}
#[instrument(skip_all, fields(tenant_id=%completion.tenant_shard_id.tenant_id, shard_id=%completion.tenant_shard_id.shard_slug()))]
fn on_completion(&mut self, completion: WriteComplete) {
tracing::debug!("Heatmap upload completed");
let WriteComplete {
tenant_shard_id,
completed_at,
digest,
next_upload,
} = completion;
self.tenants_uploading.remove(&tenant_shard_id);
use std::collections::hash_map::Entry;
match self.tenants.entry(tenant_shard_id) {
Entry::Vacant(_) => {
// Tenant state was dropped, nothing to update.
}
Entry::Occupied(mut entry) => {
entry.get_mut().last_upload = Some(completed_at);
entry.get_mut().last_digest = digest;
entry.get_mut().next_upload = next_upload
}
}
}
fn handle_command(
&mut self,
command: UploadCommand,
response_tx: tokio::sync::oneshot::Sender<CommandResponse>,
) {
match command {
UploadCommand::Upload(tenant_shard_id) => {
// If an upload was ongoing for this tenant, let it finish first.
let barrier = if let Some(writing_state) =
self.tenants_uploading.get(&tenant_shard_id)
{
tracing::info!(
tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug(),
"Waiting for heatmap write to complete");
writing_state.barrier.clone()
} else {
// Spawn the upload then immediately wait for it. This will block processing of other commands and
// starting of other background work.
tracing::info!(
tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug(),
"Starting heatmap write on command");
let tenant = match self
.tenant_manager
.get_attached_tenant_shard(tenant_shard_id, true)
{
Ok(t) => t,
Err(e) => {
// Drop result of send: we don't care if caller dropped their receiver
drop(response_tx.send(CommandResponse {
result: Err(e.into()),
}));
return;
}
};
self.spawn_upload(tenant, None);
let writing_state = self
.tenants_uploading
.get(&tenant_shard_id)
.expect("We just inserted this");
tracing::info!(
tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug(),
"Waiting for heatmap upload to complete");
writing_state.barrier.clone()
};
// This task does no I/O: it only listens for a barrier's completion and then
// sends to the command response channel. It is therefore safe to spawn this without
// any gates/task_mgr hooks.
tokio::task::spawn(async move {
barrier.wait().await;
tracing::info!(
tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug(),
"Heatmap upload complete");
// Drop result of send: we don't care if caller dropped their receiver
drop(response_tx.send(CommandResponse { result: Ok(()) }))
});
}
}
}
}
enum UploadHeatmapOutcome {
/// We successfully wrote to remote storage, with this digest.
Uploaded(md5::Digest),
/// We did not upload because the heatmap digest was unchanged since the last upload
NoChange,
/// We skipped the upload for some reason, such as tenant/timeline not ready
Skipped,
}
#[derive(thiserror::Error, Debug)]
enum UploadHeatmapError {
#[error("Cancelled")]
Cancelled,
#[error(transparent)]
Upload(#[from] anyhow::Error),
}
/// The inner upload operation. This will skip if `last_digest` is Some and matches the digest
/// of the object we would have uploaded.
#[instrument(skip_all, fields(tenant_id = %tenant.get_tenant_shard_id().tenant_id, shard_id = %tenant.get_tenant_shard_id().shard_slug()))]
async fn upload_tenant_heatmap(
remote_storage: GenericRemoteStorage,
tenant: &Arc<Tenant>,
last_digest: Option<md5::Digest>,
) -> Result<UploadHeatmapOutcome, UploadHeatmapError> {
debug_assert_current_span_has_tenant_id();
let generation = tenant.get_generation();
if generation.is_none() {
// We do not expect this: generations were implemented before heatmap uploads. However,
// handle it so that we don't have to make the generation in the heatmap an Option<>
// (Generation::none is not serializable)
tracing::warn!("Skipping heatmap upload for tenant with generation==None");
return Ok(UploadHeatmapOutcome::Skipped);
}
let mut heatmap = HeatMapTenant {
timelines: Vec::new(),
generation,
};
let timelines = tenant.timelines.lock().unwrap().clone();
let tenant_cancel = tenant.cancel.clone();
// Ensure that Tenant::shutdown waits for any upload in flight: this is needed because otherwise
// when we delete a tenant, we might race with an upload in flight and end up leaving a heatmap behind
// in remote storage.
let _guard = match tenant.gate.enter() {
Ok(g) => g,
Err(_) => {
tracing::info!("Skipping heatmap upload for tenant which is shutting down");
return Err(UploadHeatmapError::Cancelled);
}
};
for (timeline_id, timeline) in timelines {
let heatmap_timeline = timeline.generate_heatmap().await;
match heatmap_timeline {
None => {
tracing::debug!(
"Skipping heatmap upload because timeline {timeline_id} is not ready"
);
return Ok(UploadHeatmapOutcome::Skipped);
}
Some(heatmap_timeline) => {
heatmap.timelines.push(heatmap_timeline);
}
}
}
// Serialize the heatmap
let bytes = serde_json::to_vec(&heatmap).map_err(|e| anyhow::anyhow!(e))?;
let size = bytes.len();
// Drop out early if nothing changed since our last upload
let digest = md5::compute(&bytes);
if Some(digest) == last_digest {
return Ok(UploadHeatmapOutcome::NoChange);
}
let path = remote_heatmap_path(tenant.get_tenant_shard_id());
// Write the heatmap.
tracing::debug!("Uploading {size} byte heatmap to {path}");
if let Err(e) = backoff::retry(
|| async {
let bytes = futures::stream::once(futures::future::ready(Ok(bytes::Bytes::from(
bytes.clone(),
))));
remote_storage
.upload_storage_object(bytes, size, &path)
.await
},
|_| false,
3,
u32::MAX,
"Uploading heatmap",
backoff::Cancel::new(tenant_cancel.clone(), || anyhow::anyhow!("Shutting down")),
)
.await
{
if tenant_cancel.is_cancelled() {
return Err(UploadHeatmapError::Cancelled);
} else {
return Err(e.into());
}
}
tracing::info!("Successfully uploaded {size} byte heatmap to {path}");
Ok(UploadHeatmapOutcome::Uploaded(digest))
}

View File

@@ -4,7 +4,7 @@ pub mod delta_layer;
mod filename; mod filename;
pub mod image_layer; pub mod image_layer;
mod inmemory_layer; mod inmemory_layer;
mod layer; pub(crate) mod layer;
mod layer_desc; mod layer_desc;
use crate::context::{AccessStatsBehavior, RequestContext}; use crate::context::{AccessStatsBehavior, RequestContext};

View File

@@ -222,14 +222,18 @@ impl Layer {
/// ///
/// [gc]: [`RemoteTimelineClient::schedule_gc_update`] /// [gc]: [`RemoteTimelineClient::schedule_gc_update`]
/// [compaction]: [`RemoteTimelineClient::schedule_compaction_update`] /// [compaction]: [`RemoteTimelineClient::schedule_compaction_update`]
pub(crate) fn garbage_collect_on_drop(&self) { pub(crate) fn delete_on_drop(&self) {
self.0.garbage_collect_on_drop(); self.0.delete_on_drop();
} }
/// Return data needed to reconstruct given page at LSN. /// Return data needed to reconstruct given page at LSN.
/// ///
/// It is up to the caller to collect more data from the previous layer and /// It is up to the caller to collect more data from the previous layer and
/// perform WAL redo, if necessary. /// perform WAL redo, if necessary.
///
/// # Cancellation-Safety
///
/// This method is cancellation-safe.
pub(crate) async fn get_value_reconstruct_data( pub(crate) async fn get_value_reconstruct_data(
&self, &self,
key: Key, key: Key,
@@ -327,10 +331,10 @@ impl Layer {
Ok(()) Ok(())
} }
/// Waits until this layer has been dropped (and if needed, local garbage collection and remote /// Waits until this layer has been dropped (and if needed, local file deletion and remote
/// deletion scheduling has completed). /// deletion scheduling has completed).
/// ///
/// Does not start garbage collection, use [`Self::garbage_collect_on_drop`] for that /// Does not start local deletion, use [`Self::delete_on_drop`] for that
/// separatedly. /// separatedly.
#[cfg(feature = "testing")] #[cfg(feature = "testing")]
pub(crate) fn wait_drop(&self) -> impl std::future::Future<Output = ()> + 'static { pub(crate) fn wait_drop(&self) -> impl std::future::Future<Output = ()> + 'static {
@@ -419,8 +423,8 @@ struct LayerInner {
/// Initialization and deinitialization are done while holding a permit. /// Initialization and deinitialization are done while holding a permit.
inner: heavier_once_cell::OnceCell<ResidentOrWantedEvicted>, inner: heavier_once_cell::OnceCell<ResidentOrWantedEvicted>,
/// Do we want to garbage collect this when `LayerInner` is dropped /// Do we want to delete locally and remotely this when `LayerInner` is dropped
wanted_garbage_collected: AtomicBool, wanted_deleted: AtomicBool,
/// Do we want to evict this layer as soon as possible? After being set to `true`, all accesses /// Do we want to evict this layer as soon as possible? After being set to `true`, all accesses
/// will try to downgrade [`ResidentOrWantedEvicted`], which will eventually trigger /// will try to downgrade [`ResidentOrWantedEvicted`], which will eventually trigger
@@ -434,10 +438,6 @@ struct LayerInner {
version: AtomicUsize, version: AtomicUsize,
/// Allow subscribing to when the layer actually gets evicted. /// Allow subscribing to when the layer actually gets evicted.
///
/// If in future we need to implement "wait until layer instances are gone and done", carrying
/// this over to the gc spawn_blocking from LayerInner::drop will do the trick, and adding a
/// method for "wait_gc" which will wait to this being closed.
status: tokio::sync::broadcast::Sender<Status>, status: tokio::sync::broadcast::Sender<Status>,
/// Counter for exponential backoff with the download /// Counter for exponential backoff with the download
@@ -457,6 +457,8 @@ struct LayerInner {
/// For loaded layers, this may be some other value if the tenant has undergone /// For loaded layers, this may be some other value if the tenant has undergone
/// a shard split since the layer was originally written. /// a shard split since the layer was originally written.
shard: ShardIndex, shard: ShardIndex,
last_evicted_at: std::sync::Mutex<Option<std::time::Instant>>,
} }
impl std::fmt::Display for LayerInner { impl std::fmt::Display for LayerInner {
@@ -479,14 +481,14 @@ enum Status {
impl Drop for LayerInner { impl Drop for LayerInner {
fn drop(&mut self) { fn drop(&mut self) {
if !*self.wanted_garbage_collected.get_mut() { if !*self.wanted_deleted.get_mut() {
// should we try to evict if the last wish was for eviction? // should we try to evict if the last wish was for eviction?
// feels like there's some hazard of overcrowding near shutdown near by, but we don't // feels like there's some hazard of overcrowding near shutdown near by, but we don't
// run drops during shutdown (yet) // run drops during shutdown (yet)
return; return;
} }
let span = tracing::info_span!(parent: None, "layer_gc", tenant_id = %self.layer_desc().tenant_shard_id.tenant_id, shard_id=%self.layer_desc().tenant_shard_id.shard_slug(), timeline_id = %self.layer_desc().timeline_id); let span = tracing::info_span!(parent: None, "layer_delete", tenant_id = %self.layer_desc().tenant_shard_id.tenant_id, shard_id=%self.layer_desc().tenant_shard_id.shard_slug(), timeline_id = %self.layer_desc().timeline_id);
let path = std::mem::take(&mut self.path); let path = std::mem::take(&mut self.path);
let file_name = self.layer_desc().filename(); let file_name = self.layer_desc().filename();
@@ -513,8 +515,8 @@ impl Drop for LayerInner {
false false
} }
Err(e) => { Err(e) => {
tracing::error!("failed to remove garbage collected layer: {e}"); tracing::error!("failed to remove wanted deleted layer: {e}");
LAYER_IMPL_METRICS.inc_gc_removes_failed(); LAYER_IMPL_METRICS.inc_delete_removes_failed();
false false
} }
}; };
@@ -536,15 +538,15 @@ impl Drop for LayerInner {
} else { } else {
tracing::warn!("scheduling deletion on drop failed: {e:#}"); tracing::warn!("scheduling deletion on drop failed: {e:#}");
} }
LAYER_IMPL_METRICS.inc_gcs_failed(GcFailed::DeleteSchedulingFailed); LAYER_IMPL_METRICS.inc_deletes_failed(DeleteFailed::DeleteSchedulingFailed);
} else { } else {
LAYER_IMPL_METRICS.inc_completed_gcs(); LAYER_IMPL_METRICS.inc_completed_deletes();
} }
} }
} else { } else {
// no need to nag that timeline is gone: under normal situation on // no need to nag that timeline is gone: under normal situation on
// task_mgr::remove_tenant_from_memory the timeline is gone before we get dropped. // task_mgr::remove_tenant_from_memory the timeline is gone before we get dropped.
LAYER_IMPL_METRICS.inc_gcs_failed(GcFailed::TimelineGone); LAYER_IMPL_METRICS.inc_deletes_failed(DeleteFailed::TimelineGone);
} }
}); });
} }
@@ -579,7 +581,7 @@ impl LayerInner {
timeline: Arc::downgrade(timeline), timeline: Arc::downgrade(timeline),
have_remote_client: timeline.remote_client.is_some(), have_remote_client: timeline.remote_client.is_some(),
access_stats, access_stats,
wanted_garbage_collected: AtomicBool::new(false), wanted_deleted: AtomicBool::new(false),
wanted_evicted: AtomicBool::new(false), wanted_evicted: AtomicBool::new(false),
inner, inner,
version: AtomicUsize::new(version), version: AtomicUsize::new(version),
@@ -587,19 +589,17 @@ impl LayerInner {
consecutive_failures: AtomicUsize::new(0), consecutive_failures: AtomicUsize::new(0),
generation, generation,
shard, shard,
last_evicted_at: std::sync::Mutex::default(),
} }
} }
fn garbage_collect_on_drop(&self) { fn delete_on_drop(&self) {
let res = self.wanted_garbage_collected.compare_exchange( let res =
false, self.wanted_deleted
true, .compare_exchange(false, true, Ordering::Release, Ordering::Relaxed);
Ordering::Release,
Ordering::Relaxed,
);
if res.is_ok() { if res.is_ok() {
LAYER_IMPL_METRICS.inc_started_gcs(); LAYER_IMPL_METRICS.inc_started_deletes();
} }
} }
@@ -667,6 +667,10 @@ impl LayerInner {
// disable any scheduled but not yet running eviction deletions for this // disable any scheduled but not yet running eviction deletions for this
let next_version = 1 + self.version.fetch_add(1, Ordering::Relaxed); let next_version = 1 + self.version.fetch_add(1, Ordering::Relaxed);
// count cancellations, which currently remain largely unexpected
let init_cancelled =
scopeguard::guard((), |_| LAYER_IMPL_METRICS.inc_init_cancelled());
// no need to make the evict_and_wait wait for the actual download to complete // no need to make the evict_and_wait wait for the actual download to complete
drop(self.status.send(Status::Downloaded)); drop(self.status.send(Status::Downloaded));
@@ -675,6 +679,8 @@ impl LayerInner {
.upgrade() .upgrade()
.ok_or_else(|| DownloadError::TimelineShutdown)?; .ok_or_else(|| DownloadError::TimelineShutdown)?;
// FIXME: grab a gate
let can_ever_evict = timeline.remote_client.as_ref().is_some(); let can_ever_evict = timeline.remote_client.as_ref().is_some();
// check if we really need to be downloaded; could have been already downloaded by a // check if we really need to be downloaded; could have been already downloaded by a
@@ -719,6 +725,14 @@ impl LayerInner {
permit permit
}; };
let since_last_eviction =
self.last_evicted_at.lock().unwrap().map(|ts| ts.elapsed());
if let Some(since_last_eviction) = since_last_eviction {
// FIXME: this will not always be recorded correctly until #6028 (the no
// download needed branch above)
LAYER_IMPL_METRICS.record_redownloaded_after(since_last_eviction);
}
let res = Arc::new(DownloadedLayer { let res = Arc::new(DownloadedLayer {
owner: Arc::downgrade(self), owner: Arc::downgrade(self),
kind: tokio::sync::OnceCell::default(), kind: tokio::sync::OnceCell::default(),
@@ -735,6 +749,8 @@ impl LayerInner {
tracing::info!(waiters, "completing the on-demand download for other tasks"); tracing::info!(waiters, "completing the on-demand download for other tasks");
} }
scopeguard::ScopeGuard::into_inner(init_cancelled);
Ok((ResidentOrWantedEvicted::Resident(res), permit)) Ok((ResidentOrWantedEvicted::Resident(res), permit))
}; };
@@ -832,7 +848,7 @@ impl LayerInner {
crate::task_mgr::spawn( crate::task_mgr::spawn(
&tokio::runtime::Handle::current(), &tokio::runtime::Handle::current(),
crate::task_mgr::TaskKind::RemoteDownloadTask, crate::task_mgr::TaskKind::RemoteDownloadTask,
Some(self.desc.tenant_shard_id.tenant_id), Some(self.desc.tenant_shard_id),
Some(self.desc.timeline_id), Some(self.desc.timeline_id),
&task_name, &task_name,
false, false,
@@ -863,14 +879,13 @@ impl LayerInner {
match res { match res {
(Ok(()), _) => { (Ok(()), _) => {
// our caller is cancellation safe so this is fine; if someone // our caller is cancellation safe so this is fine; if someone
// else requests the layer, they'll find it already downloaded // else requests the layer, they'll find it already downloaded.
// or redownload.
// //
// however, could be that we should consider marking the layer // See counter [`LayerImplMetrics::inc_init_needed_no_download`]
// for eviction? alas, cannot: because only DownloadedLayer //
// will handle that. // FIXME(#6028): however, could be that we should consider marking the
tracing::info!("layer file download completed after requester had cancelled"); // layer for eviction? alas, cannot: because only DownloadedLayer will
LAYER_IMPL_METRICS.inc_download_completed_without_requester(); // handle that.
}, },
(Err(e), _) => { (Err(e), _) => {
// our caller is cancellation safe, but we might be racing with // our caller is cancellation safe, but we might be racing with
@@ -990,12 +1005,15 @@ impl LayerInner {
/// `DownloadedLayer` is being dropped, so it calls this method. /// `DownloadedLayer` is being dropped, so it calls this method.
fn on_downloaded_layer_drop(self: Arc<LayerInner>, version: usize) { fn on_downloaded_layer_drop(self: Arc<LayerInner>, version: usize) {
let gc = self.wanted_garbage_collected.load(Ordering::Acquire); let delete = self.wanted_deleted.load(Ordering::Acquire);
let evict = self.wanted_evicted.load(Ordering::Acquire); let evict = self.wanted_evicted.load(Ordering::Acquire);
let can_evict = self.have_remote_client; let can_evict = self.have_remote_client;
if gc { if delete {
// do nothing now, only in LayerInner::drop // do nothing now, only in LayerInner::drop -- this was originally implemented because
// we could had already scheduled the deletion at the time.
//
// FIXME: this is not true anymore, we can safely evict wanted deleted files.
} else if can_evict && evict { } else if can_evict && evict {
let span = tracing::info_span!(parent: None, "layer_evict", tenant_id = %self.desc.tenant_shard_id.tenant_id, shard_id = %self.desc.tenant_shard_id.shard_slug(), timeline_id = %self.desc.timeline_id, layer=%self, %version); let span = tracing::info_span!(parent: None, "layer_evict", tenant_id = %self.desc.tenant_shard_id.tenant_id, shard_id = %self.desc.tenant_shard_id.shard_slug(), timeline_id = %self.desc.timeline_id, layer=%self, %version);
@@ -1010,7 +1028,7 @@ impl LayerInner {
crate::task_mgr::BACKGROUND_RUNTIME.spawn_blocking(move || { crate::task_mgr::BACKGROUND_RUNTIME.spawn_blocking(move || {
let _g = span.entered(); let _g = span.entered();
// if LayerInner is already dropped here, do nothing because the garbage collection // if LayerInner is already dropped here, do nothing because the delete on drop
// has already ran while we were in queue // has already ran while we were in queue
let Some(this) = this.upgrade() else { let Some(this) = this.upgrade() else {
LAYER_IMPL_METRICS.inc_eviction_cancelled(EvictionCancelled::LayerGone); LAYER_IMPL_METRICS.inc_eviction_cancelled(EvictionCancelled::LayerGone);
@@ -1110,6 +1128,8 @@ impl LayerInner {
// we are still holding the permit, so no new spawn_download_and_wait can happen // we are still holding the permit, so no new spawn_download_and_wait can happen
drop(self.status.send(Status::Evicted)); drop(self.status.send(Status::Evicted));
*self.last_evicted_at.lock().unwrap() = Some(std::time::Instant::now());
res res
} }
@@ -1401,35 +1421,38 @@ impl From<ResidentLayer> for Layer {
} }
} }
use metrics::{IntCounter, IntCounterVec}; use metrics::IntCounter;
struct LayerImplMetrics { pub(crate) struct LayerImplMetrics {
started_evictions: IntCounter, started_evictions: IntCounter,
completed_evictions: IntCounter, completed_evictions: IntCounter,
cancelled_evictions: IntCounterVec, cancelled_evictions: enum_map::EnumMap<EvictionCancelled, IntCounter>,
started_gcs: IntCounter, started_deletes: IntCounter,
completed_gcs: IntCounter, completed_deletes: IntCounter,
failed_gcs: IntCounterVec, failed_deletes: enum_map::EnumMap<DeleteFailed, IntCounter>,
rare_counters: IntCounterVec, rare_counters: enum_map::EnumMap<RareEvent, IntCounter>,
inits_cancelled: metrics::core::GenericCounter<metrics::core::AtomicU64>,
redownload_after: metrics::Histogram,
} }
impl Default for LayerImplMetrics { impl Default for LayerImplMetrics {
fn default() -> Self { fn default() -> Self {
let evictions = metrics::register_int_counter_vec!( use enum_map::Enum;
"pageserver_layer_evictions_count",
"Evictions started and completed in the Layer implementation", // reminder: these will be pageserver_layer_* with "_total" suffix
&["state"]
let started_evictions = metrics::register_int_counter!(
"pageserver_layer_started_evictions",
"Evictions started in the Layer implementation"
)
.unwrap();
let completed_evictions = metrics::register_int_counter!(
"pageserver_layer_completed_evictions",
"Evictions completed in the Layer implementation"
) )
.unwrap(); .unwrap();
let started_evictions = evictions
.get_metric_with_label_values(&["started"])
.unwrap();
let completed_evictions = evictions
.get_metric_with_label_values(&["completed"])
.unwrap();
let cancelled_evictions = metrics::register_int_counter_vec!( let cancelled_evictions = metrics::register_int_counter_vec!(
"pageserver_layer_cancelled_evictions_count", "pageserver_layer_cancelled_evictions_count",
@@ -1438,24 +1461,36 @@ impl Default for LayerImplMetrics {
) )
.unwrap(); .unwrap();
// reminder: this will be pageserver_layer_gcs_count_total with "_total" suffix let cancelled_evictions = enum_map::EnumMap::from_array(std::array::from_fn(|i| {
let gcs = metrics::register_int_counter_vec!( let reason = EvictionCancelled::from_usize(i);
"pageserver_layer_gcs_count", let s = reason.as_str();
"Garbage collections started and completed in the Layer implementation", cancelled_evictions.with_label_values(&[s])
&["state"] }));
let started_deletes = metrics::register_int_counter!(
"pageserver_layer_started_deletes",
"Deletions on drop pending in the Layer implementation"
)
.unwrap();
let completed_deletes = metrics::register_int_counter!(
"pageserver_layer_completed_deletes",
"Deletions on drop completed in the Layer implementation"
) )
.unwrap(); .unwrap();
let started_gcs = gcs.get_metric_with_label_values(&["pending"]).unwrap(); let failed_deletes = metrics::register_int_counter_vec!(
let completed_gcs = gcs.get_metric_with_label_values(&["completed"]).unwrap(); "pageserver_layer_failed_deletes_count",
"Different reasons for deletions on drop to have failed",
let failed_gcs = metrics::register_int_counter_vec!(
"pageserver_layer_failed_gcs_count",
"Different reasons for garbage collections to have failed",
&["reason"] &["reason"]
) )
.unwrap(); .unwrap();
let failed_deletes = enum_map::EnumMap::from_array(std::array::from_fn(|i| {
let reason = DeleteFailed::from_usize(i);
let s = reason.as_str();
failed_deletes.with_label_values(&[s])
}));
let rare_counters = metrics::register_int_counter_vec!( let rare_counters = metrics::register_int_counter_vec!(
"pageserver_layer_assumed_rare_count", "pageserver_layer_assumed_rare_count",
"Times unexpected or assumed rare event happened", "Times unexpected or assumed rare event happened",
@@ -1463,16 +1498,50 @@ impl Default for LayerImplMetrics {
) )
.unwrap(); .unwrap();
let rare_counters = enum_map::EnumMap::from_array(std::array::from_fn(|i| {
let event = RareEvent::from_usize(i);
let s = event.as_str();
rare_counters.with_label_values(&[s])
}));
let inits_cancelled = metrics::register_int_counter!(
"pageserver_layer_inits_cancelled_count",
"Times Layer initialization was cancelled",
)
.unwrap();
let redownload_after = {
let minute = 60.0;
let hour = 60.0 * minute;
metrics::register_histogram!(
"pageserver_layer_redownloaded_after",
"Time between evicting and re-downloading.",
vec![
10.0,
30.0,
minute,
5.0 * minute,
15.0 * minute,
30.0 * minute,
hour,
12.0 * hour,
]
)
.unwrap()
};
Self { Self {
started_evictions, started_evictions,
completed_evictions, completed_evictions,
cancelled_evictions, cancelled_evictions,
started_gcs, started_deletes,
completed_gcs, completed_deletes,
failed_gcs, failed_deletes,
rare_counters, rare_counters,
inits_cancelled,
redownload_after,
} }
} }
} }
@@ -1485,57 +1554,33 @@ impl LayerImplMetrics {
self.completed_evictions.inc(); self.completed_evictions.inc();
} }
fn inc_eviction_cancelled(&self, reason: EvictionCancelled) { fn inc_eviction_cancelled(&self, reason: EvictionCancelled) {
self.cancelled_evictions self.cancelled_evictions[reason].inc()
.get_metric_with_label_values(&[reason.as_str()])
.unwrap()
.inc()
} }
fn inc_started_gcs(&self) { fn inc_started_deletes(&self) {
self.started_gcs.inc(); self.started_deletes.inc();
} }
fn inc_completed_gcs(&self) { fn inc_completed_deletes(&self) {
self.completed_gcs.inc(); self.completed_deletes.inc();
} }
fn inc_gcs_failed(&self, reason: GcFailed) { fn inc_deletes_failed(&self, reason: DeleteFailed) {
self.failed_gcs self.failed_deletes[reason].inc();
.get_metric_with_label_values(&[reason.as_str()])
.unwrap()
.inc();
} }
/// Counted separatedly from failed gcs because we will complete the gc attempt regardless of /// Counted separatedly from failed layer deletes because we will complete the layer deletion
/// failure to delete local file. /// attempt regardless of failure to delete local file.
fn inc_gc_removes_failed(&self) { fn inc_delete_removes_failed(&self) {
self.rare_counters self.rare_counters[RareEvent::RemoveOnDropFailed].inc();
.get_metric_with_label_values(&["gc_remove_failed"])
.unwrap()
.inc();
} }
/// Expected rare because requires a race with `evict_blocking` and /// Expected rare because requires a race with `evict_blocking` and `get_or_maybe_download`.
/// `get_or_maybe_download`.
fn inc_retried_get_or_maybe_download(&self) { fn inc_retried_get_or_maybe_download(&self) {
self.rare_counters self.rare_counters[RareEvent::RetriedGetOrMaybeDownload].inc();
.get_metric_with_label_values(&["retried_gomd"])
.unwrap()
.inc();
} }
/// Expected rare because cancellations are unexpected /// Expected rare because cancellations are unexpected, and failures are unexpected
fn inc_download_completed_without_requester(&self) {
self.rare_counters
.get_metric_with_label_values(&["download_completed_without"])
.unwrap()
.inc();
}
/// Expected rare because cancellations are unexpected
fn inc_download_failed_without_requester(&self) { fn inc_download_failed_without_requester(&self) {
self.rare_counters self.rare_counters[RareEvent::DownloadFailedWithoutRequester].inc();
.get_metric_with_label_values(&["download_failed_without"])
.unwrap()
.inc();
} }
/// The Weak in ResidentOrWantedEvicted::WantedEvicted was successfully upgraded. /// The Weak in ResidentOrWantedEvicted::WantedEvicted was successfully upgraded.
@@ -1543,37 +1588,34 @@ impl LayerImplMetrics {
/// If this counter is always zero, we should replace ResidentOrWantedEvicted type with an /// If this counter is always zero, we should replace ResidentOrWantedEvicted type with an
/// Option. /// Option.
fn inc_raced_wanted_evicted_accesses(&self) { fn inc_raced_wanted_evicted_accesses(&self) {
self.rare_counters self.rare_counters[RareEvent::UpgradedWantedEvicted].inc();
.get_metric_with_label_values(&["raced_wanted_evicted"])
.unwrap()
.inc();
} }
/// These are only expected for [`Self::inc_download_completed_without_requester`] amount when /// These are only expected for [`Self::inc_init_cancelled`] amount when
/// running with remote storage. /// running with remote storage.
fn inc_init_needed_no_download(&self) { fn inc_init_needed_no_download(&self) {
self.rare_counters self.rare_counters[RareEvent::InitWithoutDownload].inc();
.get_metric_with_label_values(&["init_needed_no_download"])
.unwrap()
.inc();
} }
/// Expected rare because all layer files should be readable and good /// Expected rare because all layer files should be readable and good
fn inc_permanent_loading_failures(&self) { fn inc_permanent_loading_failures(&self) {
self.rare_counters self.rare_counters[RareEvent::PermanentLoadingFailure].inc();
.get_metric_with_label_values(&["permanent_loading_failure"])
.unwrap()
.inc();
} }
fn inc_broadcast_lagged(&self) { fn inc_broadcast_lagged(&self) {
self.rare_counters self.rare_counters[RareEvent::EvictAndWaitLagged].inc();
.get_metric_with_label_values(&["broadcast_lagged"]) }
.unwrap()
.inc(); fn inc_init_cancelled(&self) {
self.inits_cancelled.inc()
}
fn record_redownloaded_after(&self, duration: std::time::Duration) {
self.redownload_after.observe(duration.as_secs_f64())
} }
} }
#[derive(enum_map::Enum)]
enum EvictionCancelled { enum EvictionCancelled {
LayerGone, LayerGone,
TimelineGone, TimelineGone,
@@ -1602,19 +1644,47 @@ impl EvictionCancelled {
} }
} }
enum GcFailed { #[derive(enum_map::Enum)]
enum DeleteFailed {
TimelineGone, TimelineGone,
DeleteSchedulingFailed, DeleteSchedulingFailed,
} }
impl GcFailed { impl DeleteFailed {
fn as_str(&self) -> &'static str { fn as_str(&self) -> &'static str {
match self { match self {
GcFailed::TimelineGone => "timeline_gone", DeleteFailed::TimelineGone => "timeline_gone",
GcFailed::DeleteSchedulingFailed => "delete_scheduling_failed", DeleteFailed::DeleteSchedulingFailed => "delete_scheduling_failed",
} }
} }
} }
static LAYER_IMPL_METRICS: once_cell::sync::Lazy<LayerImplMetrics> = #[derive(enum_map::Enum)]
enum RareEvent {
RemoveOnDropFailed,
RetriedGetOrMaybeDownload,
DownloadFailedWithoutRequester,
UpgradedWantedEvicted,
InitWithoutDownload,
PermanentLoadingFailure,
EvictAndWaitLagged,
}
impl RareEvent {
fn as_str(&self) -> &'static str {
use RareEvent::*;
match self {
RemoveOnDropFailed => "remove_on_drop_failed",
RetriedGetOrMaybeDownload => "retried_gomd",
DownloadFailedWithoutRequester => "download_failed_without",
UpgradedWantedEvicted => "raced_wanted_evicted",
InitWithoutDownload => "init_needed_no_download",
PermanentLoadingFailure => "permanent_loading_failure",
EvictAndWaitLagged => "broadcast_lagged",
}
}
}
pub(crate) static LAYER_IMPL_METRICS: once_cell::sync::Lazy<LayerImplMetrics> =
once_cell::sync::Lazy::new(LayerImplMetrics::default); once_cell::sync::Lazy::new(LayerImplMetrics::default);

View File

@@ -44,6 +44,7 @@ pub(crate) enum BackgroundLoopKind {
Eviction, Eviction,
ConsumptionMetricsCollectMetrics, ConsumptionMetricsCollectMetrics,
ConsumptionMetricsSyntheticSizeWorker, ConsumptionMetricsSyntheticSizeWorker,
InitialLogicalSizeCalculation,
} }
impl BackgroundLoopKind { impl BackgroundLoopKind {
@@ -53,31 +54,18 @@ impl BackgroundLoopKind {
} }
} }
pub(crate) enum RateLimitError { /// Cancellation safe.
Cancelled, pub(crate) async fn concurrent_background_tasks_rate_limit_permit(
}
pub(crate) async fn concurrent_background_tasks_rate_limit(
loop_kind: BackgroundLoopKind, loop_kind: BackgroundLoopKind,
_ctx: &RequestContext, _ctx: &RequestContext,
cancel: &CancellationToken, ) -> impl Drop {
) -> Result<impl Drop, RateLimitError> { let _guard = crate::metrics::BACKGROUND_LOOP_SEMAPHORE_WAIT_GAUGE
crate::metrics::BACKGROUND_LOOP_SEMAPHORE_WAIT_START_COUNT
.with_label_values(&[loop_kind.as_static_str()]) .with_label_values(&[loop_kind.as_static_str()])
.inc(); .guard();
scopeguard::defer!(
crate::metrics::BACKGROUND_LOOP_SEMAPHORE_WAIT_FINISH_COUNT.with_label_values(&[loop_kind.as_static_str()]).inc(); match CONCURRENT_BACKGROUND_TASKS.acquire().await {
); Ok(permit) => permit,
tokio::select! { Err(_closed) => unreachable!("we never close the semaphore"),
permit = CONCURRENT_BACKGROUND_TASKS.acquire() => {
match permit {
Ok(permit) => Ok(permit),
Err(_closed) => unreachable!("we never close the semaphore"),
}
},
_ = cancel.cancelled() => {
Err(RateLimitError::Cancelled)
}
} }
} }
@@ -86,13 +74,13 @@ pub fn start_background_loops(
tenant: &Arc<Tenant>, tenant: &Arc<Tenant>,
background_jobs_can_start: Option<&completion::Barrier>, background_jobs_can_start: Option<&completion::Barrier>,
) { ) {
let tenant_id = tenant.tenant_shard_id.tenant_id; let tenant_shard_id = tenant.tenant_shard_id;
task_mgr::spawn( task_mgr::spawn(
BACKGROUND_RUNTIME.handle(), BACKGROUND_RUNTIME.handle(),
TaskKind::Compaction, TaskKind::Compaction,
Some(tenant_id), Some(tenant_shard_id),
None, None,
&format!("compactor for tenant {tenant_id}"), &format!("compactor for tenant {tenant_shard_id}"),
false, false,
{ {
let tenant = Arc::clone(tenant); let tenant = Arc::clone(tenant);
@@ -104,7 +92,7 @@ pub fn start_background_loops(
_ = completion::Barrier::maybe_wait(background_jobs_can_start) => {} _ = completion::Barrier::maybe_wait(background_jobs_can_start) => {}
}; };
compaction_loop(tenant, cancel) compaction_loop(tenant, cancel)
.instrument(info_span!("compaction_loop", tenant_id = %tenant_id)) .instrument(info_span!("compaction_loop", tenant_id = %tenant_shard_id.tenant_id, shard_id = %tenant_shard_id.shard_slug()))
.await; .await;
Ok(()) Ok(())
} }
@@ -113,9 +101,9 @@ pub fn start_background_loops(
task_mgr::spawn( task_mgr::spawn(
BACKGROUND_RUNTIME.handle(), BACKGROUND_RUNTIME.handle(),
TaskKind::GarbageCollector, TaskKind::GarbageCollector,
Some(tenant_id), Some(tenant_shard_id),
None, None,
&format!("garbage collector for tenant {tenant_id}"), &format!("garbage collector for tenant {tenant_shard_id}"),
false, false,
{ {
let tenant = Arc::clone(tenant); let tenant = Arc::clone(tenant);
@@ -127,7 +115,7 @@ pub fn start_background_loops(
_ = completion::Barrier::maybe_wait(background_jobs_can_start) => {} _ = completion::Barrier::maybe_wait(background_jobs_can_start) => {}
}; };
gc_loop(tenant, cancel) gc_loop(tenant, cancel)
.instrument(info_span!("gc_loop", tenant_id = %tenant_id)) .instrument(info_span!("gc_loop", tenant_id = %tenant_shard_id.tenant_id, shard_id = %tenant_shard_id.shard_slug()))
.await; .await;
Ok(()) Ok(())
} }

File diff suppressed because it is too large Load Diff

View File

@@ -21,7 +21,6 @@ use crate::{
}, },
CreateTimelineCause, DeleteTimelineError, Tenant, CreateTimelineCause, DeleteTimelineError, Tenant,
}, },
InitializationOrder,
}; };
use super::{Timeline, TimelineResources}; use super::{Timeline, TimelineResources};
@@ -44,7 +43,7 @@ async fn stop_tasks(timeline: &Timeline) -> Result<(), DeleteTimelineError> {
// Shut down the layer flush task before the remote client, as one depends on the other // Shut down the layer flush task before the remote client, as one depends on the other
task_mgr::shutdown_tasks( task_mgr::shutdown_tasks(
Some(TaskKind::LayerFlushTask), Some(TaskKind::LayerFlushTask),
Some(timeline.tenant_shard_id.tenant_id), Some(timeline.tenant_shard_id),
Some(timeline.timeline_id), Some(timeline.timeline_id),
) )
.await; .await;
@@ -72,7 +71,7 @@ async fn stop_tasks(timeline: &Timeline) -> Result<(), DeleteTimelineError> {
info!("waiting for timeline tasks to shutdown"); info!("waiting for timeline tasks to shutdown");
task_mgr::shutdown_tasks( task_mgr::shutdown_tasks(
None, None,
Some(timeline.tenant_shard_id.tenant_id), Some(timeline.tenant_shard_id),
Some(timeline.timeline_id), Some(timeline.timeline_id),
) )
.await; .await;
@@ -407,7 +406,6 @@ impl DeleteTimelineFlow {
local_metadata: &TimelineMetadata, local_metadata: &TimelineMetadata,
remote_client: Option<RemoteTimelineClient>, remote_client: Option<RemoteTimelineClient>,
deletion_queue_client: DeletionQueueClient, deletion_queue_client: DeletionQueueClient,
init_order: Option<&InitializationOrder>,
) -> anyhow::Result<()> { ) -> anyhow::Result<()> {
// Note: here we even skip populating layer map. Timeline is essentially uninitialized. // Note: here we even skip populating layer map. Timeline is essentially uninitialized.
// RemoteTimelineClient is the only functioning part. // RemoteTimelineClient is the only functioning part.
@@ -420,7 +418,6 @@ impl DeleteTimelineFlow {
remote_client, remote_client,
deletion_queue_client, deletion_queue_client,
}, },
init_order,
// Important. We dont pass ancestor above because it can be missing. // Important. We dont pass ancestor above because it can be missing.
// Thus we need to skip the validation here. // Thus we need to skip the validation here.
CreateTimelineCause::Delete, CreateTimelineCause::Delete,
@@ -531,7 +528,7 @@ impl DeleteTimelineFlow {
task_mgr::spawn( task_mgr::spawn(
task_mgr::BACKGROUND_RUNTIME.handle(), task_mgr::BACKGROUND_RUNTIME.handle(),
TaskKind::TimelineDeletionWorker, TaskKind::TimelineDeletionWorker,
Some(tenant_shard_id.tenant_id), Some(tenant_shard_id),
Some(timeline_id), Some(timeline_id),
"timeline_delete", "timeline_delete",
false, false,

View File

@@ -30,7 +30,7 @@ use crate::{
task_mgr::{self, TaskKind, BACKGROUND_RUNTIME}, task_mgr::{self, TaskKind, BACKGROUND_RUNTIME},
tenant::{ tenant::{
config::{EvictionPolicy, EvictionPolicyLayerAccessThreshold}, config::{EvictionPolicy, EvictionPolicyLayerAccessThreshold},
tasks::{BackgroundLoopKind, RateLimitError}, tasks::BackgroundLoopKind,
timeline::EvictionError, timeline::EvictionError,
LogicalSizeCalculationCause, Tenant, LogicalSizeCalculationCause, Tenant,
}, },
@@ -60,7 +60,7 @@ impl Timeline {
task_mgr::spawn( task_mgr::spawn(
BACKGROUND_RUNTIME.handle(), BACKGROUND_RUNTIME.handle(),
TaskKind::Eviction, TaskKind::Eviction,
Some(self.tenant_shard_id.tenant_id), Some(self.tenant_shard_id),
Some(self.timeline_id), Some(self.timeline_id),
&format!( &format!(
"layer eviction for {}/{}", "layer eviction for {}/{}",
@@ -158,15 +158,15 @@ impl Timeline {
) -> ControlFlow<()> { ) -> ControlFlow<()> {
let now = SystemTime::now(); let now = SystemTime::now();
let _permit = match crate::tenant::tasks::concurrent_background_tasks_rate_limit( let acquire_permit = crate::tenant::tasks::concurrent_background_tasks_rate_limit_permit(
BackgroundLoopKind::Eviction, BackgroundLoopKind::Eviction,
ctx, ctx,
cancel, );
)
.await let _permit = tokio::select! {
{ permit = acquire_permit => permit,
Ok(permit) => permit, _ = cancel.cancelled() => return ControlFlow::Break(()),
Err(RateLimitError::Cancelled) => return ControlFlow::Break(()), _ = self.cancel.cancelled() => return ControlFlow::Break(()),
}; };
// If we evict layers but keep cached values derived from those layers, then // If we evict layers but keep cached values derived from those layers, then
@@ -212,11 +212,21 @@ impl Timeline {
// Gather layers for eviction. // Gather layers for eviction.
// NB: all the checks can be invalidated as soon as we release the layer map lock. // NB: all the checks can be invalidated as soon as we release the layer map lock.
// We don't want to hold the layer map lock during eviction. // We don't want to hold the layer map lock during eviction.
// So, we just need to deal with this. // So, we just need to deal with this.
let candidates: Vec<_> = {
let remote_client = match self.remote_client.as_ref() {
Some(c) => c,
None => {
error!("no remote storage configured, cannot evict layers");
return ControlFlow::Continue(());
}
};
let mut js = tokio::task::JoinSet::new();
{
let guard = self.layers.read().await; let guard = self.layers.read().await;
let layers = guard.layer_map(); let layers = guard.layer_map();
let mut candidates = Vec::new();
for hist_layer in layers.iter_historic_layers() { for hist_layer in layers.iter_historic_layers() {
let hist_layer = guard.get_from_desc(&hist_layer); let hist_layer = guard.get_from_desc(&hist_layer);
@@ -262,54 +272,49 @@ impl Timeline {
continue; continue;
} }
}; };
let layer = guard.drop_eviction_guard();
if no_activity_for > p.threshold { if no_activity_for > p.threshold {
candidates.push(guard.drop_eviction_guard()) let remote_client = remote_client.clone();
// this could cause a lot of allocations in some cases
js.spawn(async move { layer.evict_and_wait(&remote_client).await });
stats.candidates += 1;
} }
} }
candidates
};
stats.candidates = candidates.len();
let remote_client = match self.remote_client.as_ref() {
None => {
error!(
num_candidates = candidates.len(),
"no remote storage configured, cannot evict layers"
);
return ControlFlow::Continue(());
}
Some(c) => c,
}; };
let results = match self.evict_layer_batch(remote_client, &candidates).await { let join_all = async move {
Err(pre_err) => { while let Some(next) = js.join_next().await {
stats.errors += candidates.len(); match next {
error!("could not do any evictions: {pre_err:#}"); Ok(Ok(())) => stats.evicted += 1,
return ControlFlow::Continue(()); Ok(Err(EvictionError::NotFound | EvictionError::Downloaded)) => {
stats.not_evictable += 1;
}
Err(je) if je.is_cancelled() => unreachable!("not used"),
Err(je) if je.is_panic() => {
/* already logged */
stats.errors += 1;
}
Err(je) => tracing::error!("unknown JoinError: {je:?}"),
}
} }
Ok(results) => results, stats
}; };
assert_eq!(results.len(), candidates.len());
for result in results { tokio::select! {
match result { stats = join_all => {
None => { if stats.candidates == stats.not_evictable {
stats.skipped_for_shutdown += 1; debug!(stats=?stats, "eviction iteration complete");
} } else if stats.errors > 0 || stats.not_evictable > 0 {
Some(Ok(())) => { warn!(stats=?stats, "eviction iteration complete");
stats.evicted += 1; } else {
} info!(stats=?stats, "eviction iteration complete");
Some(Err(EvictionError::NotFound | EvictionError::Downloaded)) => {
stats.not_evictable += 1;
} }
} }
_ = cancel.cancelled() => {
// just drop the joinset to "abort"
}
} }
if stats.candidates == stats.not_evictable {
debug!(stats=?stats, "eviction iteration complete");
} else if stats.errors > 0 || stats.not_evictable > 0 {
warn!(stats=?stats, "eviction iteration complete");
} else {
info!(stats=?stats, "eviction iteration complete");
}
ControlFlow::Continue(()) ControlFlow::Continue(())
} }
@@ -343,7 +348,7 @@ impl Timeline {
// Make one of the tenant's timelines draw the short straw and run the calculation. // Make one of the tenant's timelines draw the short straw and run the calculation.
// The others wait until the calculation is done so that they take into account the // The others wait until the calculation is done so that they take into account the
// imitated accesses that the winner made. // imitated accesses that the winner made.
let tenant = match crate::tenant::mgr::get_tenant(self.tenant_shard_id.tenant_id, true) { let tenant = match crate::tenant::mgr::get_tenant(self.tenant_shard_id, true) {
Ok(t) => t, Ok(t) => t,
Err(_) => { Err(_) => {
return ControlFlow::Break(()); return ControlFlow::Break(());

View File

@@ -243,7 +243,7 @@ impl LayerManager {
// map index without actually rebuilding the index. // map index without actually rebuilding the index.
updates.remove_historic(desc); updates.remove_historic(desc);
mapping.remove(layer); mapping.remove(layer);
layer.garbage_collect_on_drop(); layer.delete_on_drop();
} }
pub(crate) fn contains(&self, layer: &Layer) -> bool { pub(crate) fn contains(&self, layer: &Layer) -> bool {

View File

@@ -1,11 +1,10 @@
use anyhow::Context; use anyhow::Context;
use once_cell::sync::OnceCell;
use tokio::sync::Semaphore; use once_cell::sync::OnceCell;
use tokio_util::sync::CancellationToken;
use utils::lsn::Lsn; use utils::lsn::Lsn;
use std::sync::atomic::{AtomicBool, AtomicI64, Ordering as AtomicOrdering}; use std::sync::atomic::{AtomicBool, AtomicI64, Ordering as AtomicOrdering};
use std::sync::Arc;
/// Internal structure to hold all data needed for logical size calculation. /// Internal structure to hold all data needed for logical size calculation.
/// ///
@@ -28,8 +27,12 @@ pub(super) struct LogicalSize {
crate::metrics::initial_logical_size::FinishedCalculationGuard, crate::metrics::initial_logical_size::FinishedCalculationGuard,
)>, )>,
/// Semaphore to track ongoing calculation of `initial_logical_size`. /// Cancellation for the best-effort logical size calculation.
pub initial_size_computation: Arc<tokio::sync::Semaphore>, ///
/// The token is kept in a once-cell so that we can error out if a higher priority
/// request comes in *before* we have started the normal logical size calculation.
pub(crate) cancel_wait_for_background_loop_concurrency_limit_semaphore:
OnceCell<CancellationToken>,
/// Latest Lsn that has its size uncalculated, could be absent for freshly created timelines. /// Latest Lsn that has its size uncalculated, could be absent for freshly created timelines.
pub initial_part_end: Option<Lsn>, pub initial_part_end: Option<Lsn>,
@@ -72,7 +75,7 @@ pub(crate) enum CurrentLogicalSize {
Exact(Exact), Exact(Exact),
} }
#[derive(Debug, Copy, Clone)] #[derive(Debug, Copy, Clone, PartialEq, Eq)]
pub(crate) enum Accuracy { pub(crate) enum Accuracy {
Approximate, Approximate,
Exact, Exact,
@@ -115,11 +118,10 @@ impl LogicalSize {
Self { Self {
initial_logical_size: OnceCell::with_value((0, { initial_logical_size: OnceCell::with_value((0, {
crate::metrics::initial_logical_size::START_CALCULATION crate::metrics::initial_logical_size::START_CALCULATION
.first(None) .first(crate::metrics::initial_logical_size::StartCircumstances::EmptyInitial)
.calculation_result_saved() .calculation_result_saved()
})), })),
// initial_logical_size already computed, so, don't admit any calculations cancel_wait_for_background_loop_concurrency_limit_semaphore: OnceCell::new(),
initial_size_computation: Arc::new(Semaphore::new(0)),
initial_part_end: None, initial_part_end: None,
size_added_after_initial: AtomicI64::new(0), size_added_after_initial: AtomicI64::new(0),
did_return_approximate_to_walreceiver: AtomicBool::new(false), did_return_approximate_to_walreceiver: AtomicBool::new(false),
@@ -129,7 +131,7 @@ impl LogicalSize {
pub(super) fn deferred_initial(compute_to: Lsn) -> Self { pub(super) fn deferred_initial(compute_to: Lsn) -> Self {
Self { Self {
initial_logical_size: OnceCell::new(), initial_logical_size: OnceCell::new(),
initial_size_computation: Arc::new(Semaphore::new(1)), cancel_wait_for_background_loop_concurrency_limit_semaphore: OnceCell::new(),
initial_part_end: Some(compute_to), initial_part_end: Some(compute_to),
size_added_after_initial: AtomicI64::new(0), size_added_after_initial: AtomicI64::new(0),
did_return_approximate_to_walreceiver: AtomicBool::new(false), did_return_approximate_to_walreceiver: AtomicBool::new(false),

View File

@@ -19,14 +19,14 @@ use super::Timeline;
pub struct UninitializedTimeline<'t> { pub struct UninitializedTimeline<'t> {
pub(crate) owning_tenant: &'t Tenant, pub(crate) owning_tenant: &'t Tenant,
timeline_id: TimelineId, timeline_id: TimelineId,
raw_timeline: Option<(Arc<Timeline>, TimelineUninitMark)>, raw_timeline: Option<(Arc<Timeline>, TimelineUninitMark<'t>)>,
} }
impl<'t> UninitializedTimeline<'t> { impl<'t> UninitializedTimeline<'t> {
pub(crate) fn new( pub(crate) fn new(
owning_tenant: &'t Tenant, owning_tenant: &'t Tenant,
timeline_id: TimelineId, timeline_id: TimelineId,
raw_timeline: Option<(Arc<Timeline>, TimelineUninitMark)>, raw_timeline: Option<(Arc<Timeline>, TimelineUninitMark<'t>)>,
) -> Self { ) -> Self {
Self { Self {
owning_tenant, owning_tenant,
@@ -169,18 +169,55 @@ pub(crate) fn cleanup_timeline_directory(uninit_mark: TimelineUninitMark) {
/// ///
/// XXX: it's important to create it near the timeline dir, not inside it to ensure timeline dir gets removed first. /// XXX: it's important to create it near the timeline dir, not inside it to ensure timeline dir gets removed first.
#[must_use] #[must_use]
pub(crate) struct TimelineUninitMark { pub(crate) struct TimelineUninitMark<'t> {
owning_tenant: &'t Tenant,
timeline_id: TimelineId,
uninit_mark_deleted: bool, uninit_mark_deleted: bool,
uninit_mark_path: Utf8PathBuf, uninit_mark_path: Utf8PathBuf,
pub(crate) timeline_path: Utf8PathBuf, pub(crate) timeline_path: Utf8PathBuf,
} }
impl TimelineUninitMark { /// Errors when acquiring exclusive access to a timeline ID for creation
pub(crate) fn new(uninit_mark_path: Utf8PathBuf, timeline_path: Utf8PathBuf) -> Self { #[derive(thiserror::Error, Debug)]
Self { pub(crate) enum TimelineExclusionError {
uninit_mark_deleted: false, #[error("Already exists")]
uninit_mark_path, AlreadyExists(Arc<Timeline>),
timeline_path, #[error("Already creating")]
AlreadyCreating,
// e.g. I/O errors, or some failure deep in postgres initdb
#[error(transparent)]
Other(#[from] anyhow::Error),
}
impl<'t> TimelineUninitMark<'t> {
pub(crate) fn new(
owning_tenant: &'t Tenant,
timeline_id: TimelineId,
uninit_mark_path: Utf8PathBuf,
timeline_path: Utf8PathBuf,
) -> Result<Self, TimelineExclusionError> {
// Lock order: this is the only place we take both locks. During drop() we only
// lock creating_timelines
let timelines = owning_tenant.timelines.lock().unwrap();
let mut creating_timelines: std::sync::MutexGuard<
'_,
std::collections::HashSet<TimelineId>,
> = owning_tenant.timelines_creating.lock().unwrap();
if let Some(existing) = timelines.get(&timeline_id) {
Err(TimelineExclusionError::AlreadyExists(existing.clone()))
} else if creating_timelines.contains(&timeline_id) {
Err(TimelineExclusionError::AlreadyCreating)
} else {
creating_timelines.insert(timeline_id);
Ok(Self {
owning_tenant,
timeline_id,
uninit_mark_deleted: false,
uninit_mark_path,
timeline_path,
})
} }
} }
@@ -207,7 +244,7 @@ impl TimelineUninitMark {
} }
} }
impl Drop for TimelineUninitMark { impl Drop for TimelineUninitMark<'_> {
fn drop(&mut self) { fn drop(&mut self) {
if !self.uninit_mark_deleted { if !self.uninit_mark_deleted {
if self.timeline_path.exists() { if self.timeline_path.exists() {
@@ -226,5 +263,11 @@ impl Drop for TimelineUninitMark {
} }
} }
} }
self.owning_tenant
.timelines_creating
.lock()
.unwrap()
.remove(&self.timeline_id);
} }
} }

View File

@@ -30,6 +30,7 @@ use crate::tenant::timeline::walreceiver::connection_manager::{
connection_manager_loop_step, ConnectionManagerState, connection_manager_loop_step, ConnectionManagerState,
}; };
use pageserver_api::shard::TenantShardId;
use std::future::Future; use std::future::Future;
use std::num::NonZeroU64; use std::num::NonZeroU64;
use std::ops::ControlFlow; use std::ops::ControlFlow;
@@ -41,7 +42,7 @@ use tokio::sync::watch;
use tokio_util::sync::CancellationToken; use tokio_util::sync::CancellationToken;
use tracing::*; use tracing::*;
use utils::id::TenantTimelineId; use utils::id::TimelineId;
use self::connection_manager::ConnectionManagerStatus; use self::connection_manager::ConnectionManagerStatus;
@@ -60,7 +61,8 @@ pub struct WalReceiverConf {
} }
pub struct WalReceiver { pub struct WalReceiver {
timeline: TenantTimelineId, tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
manager_status: Arc<std::sync::RwLock<Option<ConnectionManagerStatus>>>, manager_status: Arc<std::sync::RwLock<Option<ConnectionManagerStatus>>>,
} }
@@ -71,7 +73,7 @@ impl WalReceiver {
mut broker_client: BrokerClientChannel, mut broker_client: BrokerClientChannel,
ctx: &RequestContext, ctx: &RequestContext,
) -> Self { ) -> Self {
let tenant_id = timeline.tenant_shard_id.tenant_id; let tenant_shard_id = timeline.tenant_shard_id;
let timeline_id = timeline.timeline_id; let timeline_id = timeline.timeline_id;
let walreceiver_ctx = let walreceiver_ctx =
ctx.detached_child(TaskKind::WalReceiverManager, DownloadBehavior::Error); ctx.detached_child(TaskKind::WalReceiverManager, DownloadBehavior::Error);
@@ -81,9 +83,9 @@ impl WalReceiver {
task_mgr::spawn( task_mgr::spawn(
WALRECEIVER_RUNTIME.handle(), WALRECEIVER_RUNTIME.handle(),
TaskKind::WalReceiverManager, TaskKind::WalReceiverManager,
Some(tenant_id), Some(timeline.tenant_shard_id),
Some(timeline_id), Some(timeline_id),
&format!("walreceiver for timeline {tenant_id}/{timeline_id}"), &format!("walreceiver for timeline {tenant_shard_id}/{timeline_id}"),
false, false,
async move { async move {
debug_assert_current_span_has_tenant_and_timeline_id(); debug_assert_current_span_has_tenant_and_timeline_id();
@@ -117,11 +119,12 @@ impl WalReceiver {
*loop_status.write().unwrap() = None; *loop_status.write().unwrap() = None;
Ok(()) Ok(())
} }
.instrument(info_span!(parent: None, "wal_connection_manager", tenant_id = %tenant_id, timeline_id = %timeline_id)) .instrument(info_span!(parent: None, "wal_connection_manager", tenant_id = %tenant_shard_id.tenant_id, shard_id = %tenant_shard_id.shard_slug(), timeline_id = %timeline_id))
); );
Self { Self {
timeline: TenantTimelineId::new(tenant_id, timeline_id), tenant_shard_id,
timeline_id,
manager_status, manager_status,
} }
} }
@@ -129,8 +132,8 @@ impl WalReceiver {
pub async fn stop(self) { pub async fn stop(self) {
task_mgr::shutdown_tasks( task_mgr::shutdown_tasks(
Some(TaskKind::WalReceiverManager), Some(TaskKind::WalReceiverManager),
Some(self.timeline.tenant_id), Some(self.tenant_shard_id),
Some(self.timeline.timeline_id), Some(self.timeline_id),
) )
.await; .await;
} }

View File

@@ -163,7 +163,7 @@ pub(super) async fn handle_walreceiver_connection(
task_mgr::spawn( task_mgr::spawn(
WALRECEIVER_RUNTIME.handle(), WALRECEIVER_RUNTIME.handle(),
TaskKind::WalReceiverConnectionPoller, TaskKind::WalReceiverConnectionPoller,
Some(timeline.tenant_shard_id.tenant_id), Some(timeline.tenant_shard_id),
Some(timeline.timeline_id), Some(timeline.timeline_id),
"walreceiver connection", "walreceiver connection",
false, false,
@@ -397,7 +397,10 @@ pub(super) async fn handle_walreceiver_connection(
// Send the replication feedback message. // Send the replication feedback message.
// Regular standby_status_update fields are put into this message. // Regular standby_status_update fields are put into this message.
let current_timeline_size = timeline let current_timeline_size = timeline
.get_current_logical_size(&ctx) .get_current_logical_size(
crate::tenant::timeline::GetLogicalSizePriority::User,
&ctx,
)
// FIXME: https://github.com/neondatabase/neon/issues/5963 // FIXME: https://github.com/neondatabase/neon/issues/5963
.size_dont_care_about_accuracy(); .size_dont_care_about_accuracy();
let status_update = PageserverFeedback { let status_update = PageserverFeedback {

View File

@@ -288,6 +288,9 @@ impl VirtualFile {
} }
let (handle, mut slot_guard) = get_open_files().find_victim_slot(); let (handle, mut slot_guard) = get_open_files().find_victim_slot();
// NB: there is also StorageIoOperation::OpenAfterReplace which is for the case
// where our caller doesn't get to use the returned VirtualFile before its
// slot gets re-used by someone else.
let file = STORAGE_IO_TIME_METRIC let file = STORAGE_IO_TIME_METRIC
.get(StorageIoOperation::Open) .get(StorageIoOperation::Open)
.observe_closure_duration(|| open_options.open(path))?; .observe_closure_duration(|| open_options.open(path))?;
@@ -311,6 +314,9 @@ impl VirtualFile {
timeline_id, timeline_id,
}; };
// TODO: Under pressure, it's likely the slot will get re-used and
// the underlying file closed before they get around to using it.
// => https://github.com/neondatabase/neon/issues/6065
slot_guard.file.replace(file); slot_guard.file.replace(file);
Ok(vfile) Ok(vfile)
@@ -421,9 +427,12 @@ impl VirtualFile {
// now locked in write-mode. Find a free slot to put it in. // now locked in write-mode. Find a free slot to put it in.
let (handle, mut slot_guard) = open_files.find_victim_slot(); let (handle, mut slot_guard) = open_files.find_victim_slot();
// Open the physical file // Re-open the physical file.
// NB: we use StorageIoOperation::OpenAferReplace for this to distinguish this
// case from StorageIoOperation::Open. This helps with identifying thrashing
// of the virtual file descriptor cache.
let file = STORAGE_IO_TIME_METRIC let file = STORAGE_IO_TIME_METRIC
.get(StorageIoOperation::Open) .get(StorageIoOperation::OpenAfterReplace)
.observe_closure_duration(|| self.open_options.open(&self.path))?; .observe_closure_duration(|| self.open_options.open(&self.path))?;
// Perform the requested operation on it // Perform the requested operation on it
@@ -610,9 +619,11 @@ impl Drop for VirtualFile {
slot.recently_used.store(false, Ordering::Relaxed); slot.recently_used.store(false, Ordering::Relaxed);
// there is also operation "close-by-replace" for closes done on eviction for // there is also operation "close-by-replace" for closes done on eviction for
// comparison. // comparison.
STORAGE_IO_TIME_METRIC if let Some(fd) = slot_guard.file.take() {
.get(StorageIoOperation::Close) STORAGE_IO_TIME_METRIC
.observe_closure_duration(|| drop(slot_guard.file.take())); .get(StorageIoOperation::Close)
.observe_closure_duration(|| drop(fd));
}
} }
} }
} }
@@ -643,6 +654,7 @@ pub fn init(num_slots: usize) {
if OPEN_FILES.set(OpenFiles::new(num_slots)).is_err() { if OPEN_FILES.set(OpenFiles::new(num_slots)).is_err() {
panic!("virtual_file::init called twice"); panic!("virtual_file::init called twice");
} }
crate::metrics::virtual_file_descriptor_cache::SIZE_MAX.set(num_slots as u64);
} }
const TEST_MAX_FILE_DESCRIPTORS: usize = 10; const TEST_MAX_FILE_DESCRIPTORS: usize = 10;

View File

@@ -21,6 +21,7 @@
//! redo Postgres process, but some records it can handle directly with //! redo Postgres process, but some records it can handle directly with
//! bespoken Rust code. //! bespoken Rust code.
use pageserver_api::shard::ShardIdentity;
use postgres_ffi::v14::nonrelfile_utils::clogpage_precedes; use postgres_ffi::v14::nonrelfile_utils::clogpage_precedes;
use postgres_ffi::v14::nonrelfile_utils::slru_may_delete_clogsegment; use postgres_ffi::v14::nonrelfile_utils::slru_may_delete_clogsegment;
use postgres_ffi::{fsm_logical_to_physical, page_is_new, page_set_lsn}; use postgres_ffi::{fsm_logical_to_physical, page_is_new, page_set_lsn};
@@ -30,6 +31,7 @@ use bytes::{Buf, Bytes, BytesMut};
use tracing::*; use tracing::*;
use crate::context::RequestContext; use crate::context::RequestContext;
use crate::metrics::WAL_INGEST;
use crate::pgdatadir_mapping::*; use crate::pgdatadir_mapping::*;
use crate::tenant::PageReconstructError; use crate::tenant::PageReconstructError;
use crate::tenant::Timeline; use crate::tenant::Timeline;
@@ -46,6 +48,7 @@ use postgres_ffi::BLCKSZ;
use utils::lsn::Lsn; use utils::lsn::Lsn;
pub struct WalIngest<'a> { pub struct WalIngest<'a> {
shard: ShardIdentity,
timeline: &'a Timeline, timeline: &'a Timeline,
checkpoint: CheckPoint, checkpoint: CheckPoint,
@@ -65,6 +68,7 @@ impl<'a> WalIngest<'a> {
trace!("CheckPoint.nextXid = {}", checkpoint.nextXid.value); trace!("CheckPoint.nextXid = {}", checkpoint.nextXid.value);
Ok(WalIngest { Ok(WalIngest {
shard: *timeline.get_shard_identity(),
timeline, timeline,
checkpoint, checkpoint,
checkpoint_modified: false, checkpoint_modified: false,
@@ -87,6 +91,8 @@ impl<'a> WalIngest<'a> {
decoded: &mut DecodedWALRecord, decoded: &mut DecodedWALRecord,
ctx: &RequestContext, ctx: &RequestContext,
) -> anyhow::Result<()> { ) -> anyhow::Result<()> {
WAL_INGEST.records_received.inc();
modification.lsn = lsn; modification.lsn = lsn;
decode_wal_record(recdata, decoded, self.timeline.pg_version)?; decode_wal_record(recdata, decoded, self.timeline.pg_version)?;
@@ -355,6 +361,33 @@ impl<'a> WalIngest<'a> {
// Iterate through all the blocks that the record modifies, and // Iterate through all the blocks that the record modifies, and
// "put" a separate copy of the record for each block. // "put" a separate copy of the record for each block.
for blk in decoded.blocks.iter() { for blk in decoded.blocks.iter() {
let rel = RelTag {
spcnode: blk.rnode_spcnode,
dbnode: blk.rnode_dbnode,
relnode: blk.rnode_relnode,
forknum: blk.forknum,
};
let key = rel_block_to_key(rel, blk.blkno);
let key_is_local = self.shard.is_key_local(&key);
tracing::debug!(
lsn=%lsn,
key=%key,
"ingest: shard decision {} (checkpoint={})",
if !key_is_local { "drop" } else { "keep" },
self.checkpoint_modified
);
if !key_is_local {
if self.shard.is_zero() {
// Shard 0 tracks relation sizes. Although we will not store this block, we will observe
// its blkno in case it implicitly extends a relation.
self.observe_decoded_block(modification, blk, ctx).await?;
}
continue;
}
self.ingest_decoded_block(modification, lsn, decoded, blk, ctx) self.ingest_decoded_block(modification, lsn, decoded, blk, ctx)
.await?; .await?;
} }
@@ -367,13 +400,38 @@ impl<'a> WalIngest<'a> {
self.checkpoint_modified = false; self.checkpoint_modified = false;
} }
if modification.is_empty() {
tracing::debug!("ingest: filtered out record @ LSN {lsn}");
WAL_INGEST.records_filtered.inc();
modification.tline.finish_write(lsn);
} else {
WAL_INGEST.records_committed.inc();
modification.commit(ctx).await?;
}
// Now that this record has been fully handled, including updating the // Now that this record has been fully handled, including updating the
// checkpoint data, let the repository know that it is up-to-date to this LSN // checkpoint data, let the repository know that it is up-to-date to this LSN.
modification.commit(ctx).await?;
Ok(()) Ok(())
} }
/// Do not store this block, but observe it for the purposes of updating our relation size state.
async fn observe_decoded_block(
&mut self,
modification: &mut DatadirModification<'_>,
blk: &DecodedBkpBlock,
ctx: &RequestContext,
) -> Result<(), PageReconstructError> {
let rel = RelTag {
spcnode: blk.rnode_spcnode,
dbnode: blk.rnode_dbnode,
relnode: blk.rnode_relnode,
forknum: blk.forknum,
};
self.handle_rel_extend(modification, rel, blk.blkno, ctx)
.await
}
async fn ingest_decoded_block( async fn ingest_decoded_block(
&mut self, &mut self,
modification: &mut DatadirModification<'_>, modification: &mut DatadirModification<'_>,
@@ -400,8 +458,10 @@ impl<'a> WalIngest<'a> {
&& decoded.xl_rmid == pg_constants::RM_XLOG_ID && decoded.xl_rmid == pg_constants::RM_XLOG_ID
&& (decoded.xl_info == pg_constants::XLOG_FPI && (decoded.xl_info == pg_constants::XLOG_FPI
|| decoded.xl_info == pg_constants::XLOG_FPI_FOR_HINT) || decoded.xl_info == pg_constants::XLOG_FPI_FOR_HINT)
// compression of WAL is not yet supported: fall back to storing the original WAL record // compression of WAL is not yet supported: fall back to storing the original WAL record
&& !postgres_ffi::bkpimage_is_compressed(blk.bimg_info, self.timeline.pg_version)? && !postgres_ffi::bkpimage_is_compressed(blk.bimg_info, self.timeline.pg_version)?
// do not materialize null pages because them most likely be soon replaced with real data
&& blk.bimg_len != 0
{ {
// Extract page image from FPI record // Extract page image from FPI record
let img_len = blk.bimg_len as usize; let img_len = blk.bimg_len as usize;
@@ -1465,8 +1525,15 @@ impl<'a> WalIngest<'a> {
//info!("extending {} {} to {}", rel, old_nblocks, new_nblocks); //info!("extending {} {} to {}", rel, old_nblocks, new_nblocks);
modification.put_rel_extend(rel, new_nblocks, ctx).await?; modification.put_rel_extend(rel, new_nblocks, ctx).await?;
let mut key = rel_block_to_key(rel, blknum);
// fill the gap with zeros // fill the gap with zeros
for gap_blknum in old_nblocks..blknum { for gap_blknum in old_nblocks..blknum {
key.field6 = gap_blknum;
if self.shard.get_shard_number(&key) != self.shard.number {
continue;
}
modification.put_rel_page_image(rel, gap_blknum, ZERO_PAGE.clone())?; modification.put_rel_page_image(rel, gap_blknum, ZERO_PAGE.clone())?;
} }
} }
@@ -2124,7 +2191,7 @@ mod tests {
.load() .load()
.await; .await;
let tline = tenant let tline = tenant
.bootstrap_timeline(TIMELINE_ID, pg_version, None, &ctx) .bootstrap_timeline_test(TIMELINE_ID, pg_version, None, &ctx)
.await .await
.unwrap(); .unwrap();

View File

@@ -34,7 +34,6 @@ use std::process::{Child, ChildStdin, ChildStdout, Command};
use std::sync::{Arc, Mutex, MutexGuard, RwLock}; use std::sync::{Arc, Mutex, MutexGuard, RwLock};
use std::time::Duration; use std::time::Duration;
use std::time::Instant; use std::time::Instant;
use tokio_util::sync::CancellationToken;
use tracing::*; use tracing::*;
use utils::{bin_ser::BeSer, id::TenantId, lsn::Lsn, nonblock::set_nonblock}; use utils::{bin_ser::BeSer, id::TenantId, lsn::Lsn, nonblock::set_nonblock};
@@ -124,7 +123,9 @@ impl PostgresRedoManager {
/// The WAL redo is handled by a separate thread, so this just sends a request /// The WAL redo is handled by a separate thread, so this just sends a request
/// to the thread and waits for response. /// to the thread and waits for response.
/// ///
/// CANCEL SAFETY: NOT CANCEL SAFE. /// # Cancel-Safety
///
/// This method is cancellation-safe.
pub async fn request_redo( pub async fn request_redo(
&self, &self,
key: Key, key: Key,
@@ -157,7 +158,6 @@ impl PostgresRedoManager {
self.conf.wal_redo_timeout, self.conf.wal_redo_timeout,
pg_version, pg_version,
) )
.await
}; };
img = Some(result?); img = Some(result?);
@@ -178,7 +178,6 @@ impl PostgresRedoManager {
self.conf.wal_redo_timeout, self.conf.wal_redo_timeout,
pg_version, pg_version,
) )
.await
} }
} }
} }
@@ -216,7 +215,7 @@ impl PostgresRedoManager {
/// Process one request for WAL redo using wal-redo postgres /// Process one request for WAL redo using wal-redo postgres
/// ///
#[allow(clippy::too_many_arguments)] #[allow(clippy::too_many_arguments)]
async fn apply_batch_postgres( fn apply_batch_postgres(
&self, &self,
key: Key, key: Key,
lsn: Lsn, lsn: Lsn,
@@ -332,12 +331,7 @@ impl PostgresRedoManager {
// than we can SIGKILL & `wait` for them to exit. By doing it the way we do here, // than we can SIGKILL & `wait` for them to exit. By doing it the way we do here,
// we limit this risk of run-away to at most $num_runtimes * $num_executor_threads. // we limit this risk of run-away to at most $num_runtimes * $num_executor_threads.
// This probably needs revisiting at some later point. // This probably needs revisiting at some later point.
let mut wait_done = proc.stderr_logger_task_done.clone();
drop(proc); drop(proc);
wait_done
.wait_for(|v| *v)
.await
.expect("we use scopeguard to ensure we always send `true` to the channel before dropping the sender");
} else if n_attempts != 0 { } else if n_attempts != 0 {
info!(n_attempts, "retried walredo succeeded"); info!(n_attempts, "retried walredo succeeded");
} }
@@ -649,8 +643,6 @@ struct WalRedoProcess {
child: Option<NoLeakChild>, child: Option<NoLeakChild>,
stdout: Mutex<ProcessOutput>, stdout: Mutex<ProcessOutput>,
stdin: Mutex<ProcessInput>, stdin: Mutex<ProcessInput>,
stderr_logger_cancel: CancellationToken,
stderr_logger_task_done: tokio::sync::watch::Receiver<bool>,
/// Counter to separate same sized walredo inputs failing at the same millisecond. /// Counter to separate same sized walredo inputs failing at the same millisecond.
#[cfg(feature = "testing")] #[cfg(feature = "testing")]
dump_sequence: AtomicUsize, dump_sequence: AtomicUsize,
@@ -699,6 +691,8 @@ impl WalRedoProcess {
let stdin = child.stdin.take().unwrap(); let stdin = child.stdin.take().unwrap();
let stdout = child.stdout.take().unwrap(); let stdout = child.stdout.take().unwrap();
let stderr = child.stderr.take().unwrap(); let stderr = child.stderr.take().unwrap();
let stderr = tokio::process::ChildStderr::from_std(stderr)
.context("convert to tokio::ChildStderr")?;
macro_rules! set_nonblock_or_log_err { macro_rules! set_nonblock_or_log_err {
($file:ident) => {{ ($file:ident) => {{
let res = set_nonblock($file.as_raw_fd()); let res = set_nonblock($file.as_raw_fd());
@@ -710,69 +704,45 @@ impl WalRedoProcess {
} }
set_nonblock_or_log_err!(stdin)?; set_nonblock_or_log_err!(stdin)?;
set_nonblock_or_log_err!(stdout)?; set_nonblock_or_log_err!(stdout)?;
set_nonblock_or_log_err!(stderr)?;
let mut stderr = tokio::io::unix::AsyncFd::new(stderr).context("AsyncFd::with_interest")?;
// all fallible operations post-spawn are complete, so get rid of the guard // all fallible operations post-spawn are complete, so get rid of the guard
let child = scopeguard::ScopeGuard::into_inner(child); let child = scopeguard::ScopeGuard::into_inner(child);
let stderr_logger_cancel = CancellationToken::new(); tokio::spawn(
let (stderr_logger_task_done_tx, stderr_logger_task_done_rx) =
tokio::sync::watch::channel(false);
tokio::spawn({
let stderr_logger_cancel = stderr_logger_cancel.clone();
async move { async move {
scopeguard::defer! { scopeguard::defer! {
debug!("wal-redo-postgres stderr_logger_task finished"); debug!("wal-redo-postgres stderr_logger_task finished");
let _ = stderr_logger_task_done_tx.send(true); crate::metrics::WAL_REDO_PROCESS_COUNTERS.active_stderr_logger_tasks_finished.inc();
} }
debug!("wal-redo-postgres stderr_logger_task started"); debug!("wal-redo-postgres stderr_logger_task started");
loop { crate::metrics::WAL_REDO_PROCESS_COUNTERS.active_stderr_logger_tasks_started.inc();
// NB: we purposefully don't do a select! for the cancellation here.
// The cancellation would likely cause us to miss stderr messages. use tokio::io::AsyncBufReadExt;
// We can rely on this to return from .await because when we SIGKILL let mut stderr_lines = tokio::io::BufReader::new(stderr);
// the child, the writing end of the stderr pipe gets closed. let mut buf = Vec::new();
match stderr.readable_mut().await { let res = loop {
Ok(mut guard) => { buf.clear();
let mut errbuf = [0; 16384]; // TODO we don't trust the process to cap its stderr length.
let res = guard.try_io(|fd| { // Currently it can do unbounded Vec allocation.
use std::io::Read; match stderr_lines.read_until(b'\n', &mut buf).await {
fd.get_mut().read(&mut errbuf) Ok(0) => break Ok(()), // eof
}); Ok(num_bytes) => {
match res { let output = String::from_utf8_lossy(&buf[..num_bytes]);
Ok(Ok(0)) => { error!(%output, "received output");
// it closed the stderr pipe
break;
}
Ok(Ok(n)) => {
// The message might not be split correctly into lines here. But this is
// good enough, the important thing is to get the message to the log.
let output = String::from_utf8_lossy(&errbuf[0..n]).to_string();
error!(output, "received output");
},
Ok(Err(e)) => {
error!(error = ?e, "read() error, waiting for cancellation");
stderr_logger_cancel.cancelled().await;
error!(error = ?e, "read() error, cancellation complete");
break;
}
Err(e) => {
let _e: tokio::io::unix::TryIoError = e;
// the read() returned WouldBlock, that's expected
}
}
} }
Err(e) => { Err(e) => {
error!(error = ?e, "read() error, waiting for cancellation"); break Err(e);
stderr_logger_cancel.cancelled().await;
error!(error = ?e, "read() error, cancellation complete");
break;
} }
} }
};
match res {
Ok(()) => (),
Err(e) => {
error!(error=?e, "failed to read from walredo stderr");
}
} }
}.instrument(tracing::info_span!(parent: None, "wal-redo-postgres-stderr", pid = child.id(), tenant_id = %tenant_id, %pg_version)) }.instrument(tracing::info_span!(parent: None, "wal-redo-postgres-stderr", pid = child.id(), tenant_id = %tenant_id, %pg_version))
}); );
Ok(Self { Ok(Self {
conf, conf,
@@ -787,8 +757,6 @@ impl WalRedoProcess {
pending_responses: VecDeque::new(), pending_responses: VecDeque::new(),
n_processed_responses: 0, n_processed_responses: 0,
}), }),
stderr_logger_cancel,
stderr_logger_task_done: stderr_logger_task_done_rx,
#[cfg(feature = "testing")] #[cfg(feature = "testing")]
dump_sequence: AtomicUsize::default(), dump_sequence: AtomicUsize::default(),
}) })
@@ -1029,7 +997,6 @@ impl Drop for WalRedoProcess {
.take() .take()
.expect("we only do this once") .expect("we only do this once")
.kill_and_wait(WalRedoKillCause::WalRedoProcessDrop); .kill_and_wait(WalRedoKillCause::WalRedoProcessDrop);
self.stderr_logger_cancel.cancel();
// no way to wait for stderr_logger_task from Drop because that is async only // no way to wait for stderr_logger_task from Drop because that is async only
} }
} }

View File

@@ -41,6 +41,17 @@ libwalproposer.a: $(WALPROP_OBJS)
rm -f $@ rm -f $@
$(AR) $(AROPT) $@ $^ $(AR) $(AROPT) $@ $^
# needs vars:
# FIND_TYPEDEF pointing to find_typedef
# INDENT pointing to pg_bsd_indent
# PGINDENT_SCRIPT pointing to pgindent (be careful with PGINDENT var name:
# pgindent will pick it up as pg_bsd_indent path).
.PHONY: pgindent
pgindent:
+@ echo top_srcdir=$(top_srcdir) top_builddir=$(top_builddir) srcdir=$(srcdir)
$(FIND_TYPEDEF) . > neon.typedefs
INDENT=$(INDENT) $(PGINDENT_SCRIPT) --typedefs neon.typedefs $(srcdir)/*.c $(srcdir)/*.h
PG_CONFIG = pg_config PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs) PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS) include $(PGXS)

View File

@@ -41,7 +41,7 @@ static char *ConsoleURL = NULL;
static bool ForwardDDL = true; static bool ForwardDDL = true;
/* Curl structures for sending the HTTP requests */ /* Curl structures for sending the HTTP requests */
static CURL * CurlHandle; static CURL *CurlHandle;
static struct curl_slist *ContentHeader = NULL; static struct curl_slist *ContentHeader = NULL;
/* /*
@@ -54,7 +54,7 @@ typedef enum
{ {
Op_Set, /* An upsert: Either a creation or an alter */ Op_Set, /* An upsert: Either a creation or an alter */
Op_Delete, Op_Delete,
} OpType; } OpType;
typedef struct typedef struct
{ {
@@ -62,7 +62,7 @@ typedef struct
Oid owner; Oid owner;
char old_name[NAMEDATALEN]; char old_name[NAMEDATALEN];
OpType type; OpType type;
} DbEntry; } DbEntry;
typedef struct typedef struct
{ {
@@ -70,7 +70,7 @@ typedef struct
char old_name[NAMEDATALEN]; char old_name[NAMEDATALEN];
const char *password; const char *password;
OpType type; OpType type;
} RoleEntry; } RoleEntry;
/* /*
* We keep one of these for each subtransaction in a stack. When a subtransaction * We keep one of these for each subtransaction in a stack. When a subtransaction
@@ -82,10 +82,10 @@ typedef struct DdlHashTable
struct DdlHashTable *prev_table; struct DdlHashTable *prev_table;
HTAB *db_table; HTAB *db_table;
HTAB *role_table; HTAB *role_table;
} DdlHashTable; } DdlHashTable;
static DdlHashTable RootTable; static DdlHashTable RootTable;
static DdlHashTable * CurrentDdlTable = &RootTable; static DdlHashTable *CurrentDdlTable = &RootTable;
static void static void
PushKeyValue(JsonbParseState **state, char *key, char *value) PushKeyValue(JsonbParseState **state, char *key, char *value)
@@ -199,7 +199,7 @@ typedef struct
{ {
char str[ERROR_SIZE]; char str[ERROR_SIZE];
size_t size; size_t size;
} ErrorString; } ErrorString;
static size_t static size_t
ErrorWriteCallback(char *ptr, size_t size, size_t nmemb, void *userdata) ErrorWriteCallback(char *ptr, size_t size, size_t nmemb, void *userdata)
@@ -478,7 +478,7 @@ NeonXactCallback(XactEvent event, void *arg)
static bool static bool
RoleIsNeonSuperuser(const char *role_name) RoleIsNeonSuperuser(const char *role_name)
{ {
return strcmp(role_name, "neon_superuser") == 0; return strcmp(role_name, "neon_superuser") == 0;
} }
static void static void
@@ -509,6 +509,7 @@ HandleCreateDb(CreatedbStmt *stmt)
if (downer && downer->arg) if (downer && downer->arg)
{ {
const char *owner_name = defGetString(downer); const char *owner_name = defGetString(downer);
if (RoleIsNeonSuperuser(owner_name)) if (RoleIsNeonSuperuser(owner_name))
elog(ERROR, "can't create a database with owner neon_superuser"); elog(ERROR, "can't create a database with owner neon_superuser");
entry->owner = get_role_oid(owner_name, false); entry->owner = get_role_oid(owner_name, false);
@@ -536,6 +537,7 @@ HandleAlterOwner(AlterOwnerStmt *stmt)
if (!found) if (!found)
memset(entry->old_name, 0, sizeof(entry->old_name)); memset(entry->old_name, 0, sizeof(entry->old_name));
const char *new_owner = get_rolespec_name(stmt->newowner); const char *new_owner = get_rolespec_name(stmt->newowner);
if (RoleIsNeonSuperuser(new_owner)) if (RoleIsNeonSuperuser(new_owner))
elog(ERROR, "can't alter owner to neon_superuser"); elog(ERROR, "can't alter owner to neon_superuser");
entry->owner = get_role_oid(new_owner, false); entry->owner = get_role_oid(new_owner, false);
@@ -633,6 +635,7 @@ HandleAlterRole(AlterRoleStmt *stmt)
DefElem *dpass = NULL; DefElem *dpass = NULL;
ListCell *option; ListCell *option;
const char *role_name = stmt->role->rolename; const char *role_name = stmt->role->rolename;
if (RoleIsNeonSuperuser(role_name)) if (RoleIsNeonSuperuser(role_name))
elog(ERROR, "can't ALTER neon_superuser"); elog(ERROR, "can't ALTER neon_superuser");

View File

@@ -25,79 +25,81 @@
#include <curl/curl.h> #include <curl/curl.h>
static int extension_server_port = 0; static int extension_server_port = 0;
static download_extension_file_hook_type prev_download_extension_file_hook = NULL; static download_extension_file_hook_type prev_download_extension_file_hook = NULL;
// to download all SQL (and data) files for an extension: /*
// curl -X POST http://localhost:8080/extension_server/postgis * to download all SQL (and data) files for an extension:
// it covers two possible extension files layouts: * curl -X POST http://localhost:8080/extension_server/postgis
// 1. extension_name--version--platform.sql * it covers two possible extension files layouts:
// 2. extension_name/extension_name--version.sql * 1. extension_name--version--platform.sql
// extension_name/extra_files.csv * 2. extension_name/extension_name--version.sql
// * extension_name/extra_files.csv
// to download specific library file: * to download specific library file:
// curl -X POST http://localhost:8080/extension_server/postgis-3.so?is_library=true * curl -X POST http://localhost:8080/extension_server/postgis-3.so?is_library=true
*/
static bool static bool
neon_download_extension_file_http(const char *filename, bool is_library) neon_download_extension_file_http(const char *filename, bool is_library)
{ {
CURL *curl; CURL *curl;
CURLcode res; CURLcode res;
char *compute_ctl_url; char *compute_ctl_url;
char *postdata; char *postdata;
bool ret = false; bool ret = false;
if ((curl = curl_easy_init()) == NULL) if ((curl = curl_easy_init()) == NULL)
{ {
elog(ERROR, "Failed to initialize curl handle"); elog(ERROR, "Failed to initialize curl handle");
} }
compute_ctl_url = psprintf("http://localhost:%d/extension_server/%s%s", compute_ctl_url = psprintf("http://localhost:%d/extension_server/%s%s",
extension_server_port, filename, is_library ? "?is_library=true" : ""); extension_server_port, filename, is_library ? "?is_library=true" : "");
elog(LOG, "Sending request to compute_ctl: %s", compute_ctl_url); elog(LOG, "Sending request to compute_ctl: %s", compute_ctl_url);
curl_easy_setopt(curl, CURLOPT_CUSTOMREQUEST, "POST"); curl_easy_setopt(curl, CURLOPT_CUSTOMREQUEST, "POST");
curl_easy_setopt(curl, CURLOPT_URL, compute_ctl_url); curl_easy_setopt(curl, CURLOPT_URL, compute_ctl_url);
curl_easy_setopt(curl, CURLOPT_TIMEOUT, 3L /* seconds */); curl_easy_setopt(curl, CURLOPT_TIMEOUT, 3L /* seconds */ );
if (curl) if (curl)
{ {
/* Perform the request, res will get the return code */ /* Perform the request, res will get the return code */
res = curl_easy_perform(curl); res = curl_easy_perform(curl);
/* Check for errors */ /* Check for errors */
if (res == CURLE_OK) if (res == CURLE_OK)
{ {
ret = true; ret = true;
} }
else else
{ {
// Don't error here because postgres will try to find the file /* Don't error here because postgres will try to find the file */
// and will fail with some proper error message if it's not found. /* and will fail with some proper error message if it's not found. */
elog(WARNING, "neon_download_extension_file_http failed: %s\n", curl_easy_strerror(res)); elog(WARNING, "neon_download_extension_file_http failed: %s\n", curl_easy_strerror(res));
} }
/* always cleanup */ /* always cleanup */
curl_easy_cleanup(curl); curl_easy_cleanup(curl);
} }
return ret; return ret;
} }
void pg_init_extension_server() void
pg_init_extension_server()
{ {
// Port to connect to compute_ctl on localhost /* Port to connect to compute_ctl on localhost */
// to request extension files. /* to request extension files. */
DefineCustomIntVariable("neon.extension_server_port", DefineCustomIntVariable("neon.extension_server_port",
"connection string to the compute_ctl", "connection string to the compute_ctl",
NULL, NULL,
&extension_server_port, &extension_server_port,
0, 0, INT_MAX, 0, 0, INT_MAX,
PGC_POSTMASTER, PGC_POSTMASTER,
0, /* no flags required */ 0, /* no flags required */
NULL, NULL, NULL); NULL, NULL, NULL);
// set download_extension_file_hook /* set download_extension_file_hook */
prev_download_extension_file_hook = download_extension_file_hook; prev_download_extension_file_hook = download_extension_file_hook;
download_extension_file_hook = neon_download_extension_file_http; download_extension_file_hook = neon_download_extension_file_http;
} }

View File

@@ -67,32 +67,34 @@
typedef struct FileCacheEntry typedef struct FileCacheEntry
{ {
BufferTag key; BufferTag key;
uint32 hash; uint32 hash;
uint32 offset; uint32 offset;
uint32 access_count; uint32 access_count;
uint32 bitmap[BLOCKS_PER_CHUNK/32]; uint32 bitmap[BLOCKS_PER_CHUNK / 32];
dlist_node lru_node; /* LRU list node */ dlist_node lru_node; /* LRU list node */
} FileCacheEntry; } FileCacheEntry;
typedef struct FileCacheControl typedef struct FileCacheControl
{ {
uint64 generation; /* generation is needed to handle correct hash reenabling */ uint64 generation; /* generation is needed to handle correct hash
uint32 size; /* size of cache file in chunks */ * reenabling */
uint32 used; /* number of used chunks */ uint32 size; /* size of cache file in chunks */
uint32 limit; /* shared copy of lfc_size_limit */ uint32 used; /* number of used chunks */
uint64 hits; uint32 limit; /* shared copy of lfc_size_limit */
uint64 misses; uint64 hits;
uint64 writes; uint64 misses;
dlist_head lru; /* double linked list for LRU replacement algorithm */ uint64 writes;
dlist_head lru; /* double linked list for LRU replacement
* algorithm */
} FileCacheControl; } FileCacheControl;
static HTAB* lfc_hash; static HTAB *lfc_hash;
static int lfc_desc = 0; static int lfc_desc = 0;
static LWLockId lfc_lock; static LWLockId lfc_lock;
static int lfc_max_size; static int lfc_max_size;
static int lfc_size_limit; static int lfc_size_limit;
static char* lfc_path; static char *lfc_path;
static FileCacheControl* lfc_ctl; static FileCacheControl *lfc_ctl;
static shmem_startup_hook_type prev_shmem_startup_hook; static shmem_startup_hook_type prev_shmem_startup_hook;
#if PG_VERSION_NUM>=150000 #if PG_VERSION_NUM>=150000
static shmem_request_hook_type prev_shmem_request_hook; static shmem_request_hook_type prev_shmem_request_hook;
@@ -100,7 +102,7 @@ static shmem_request_hook_type prev_shmem_request_hook;
#define LFC_ENABLED() (lfc_ctl->limit != 0) #define LFC_ENABLED() (lfc_ctl->limit != 0)
void PGDLLEXPORT FileCacheMonitorMain(Datum main_arg); void PGDLLEXPORT FileCacheMonitorMain(Datum main_arg);
/* /*
* Local file cache is optional and Neon can work without it. * Local file cache is optional and Neon can work without it.
@@ -109,9 +111,10 @@ void PGDLLEXPORT FileCacheMonitorMain(Datum main_arg);
* All cache content should be invalidated to avoid reading of stale or corrupted data * All cache content should be invalidated to avoid reading of stale or corrupted data
*/ */
static void static void
lfc_disable(char const* op) lfc_disable(char const *op)
{ {
int fd; int fd;
elog(WARNING, "Failed to %s local file cache at %s: %m, disabling local file cache", op, lfc_path); elog(WARNING, "Failed to %s local file cache at %s: %m, disabling local file cache", op, lfc_path);
/* Invalidate hash */ /* Invalidate hash */
@@ -120,7 +123,7 @@ lfc_disable(char const* op)
if (LFC_ENABLED()) if (LFC_ENABLED())
{ {
HASH_SEQ_STATUS status; HASH_SEQ_STATUS status;
FileCacheEntry* entry; FileCacheEntry *entry;
hash_seq_init(&status, lfc_hash); hash_seq_init(&status, lfc_hash);
while ((entry = hash_seq_search(&status)) != NULL) while ((entry = hash_seq_search(&status)) != NULL)
@@ -135,16 +138,24 @@ lfc_disable(char const* op)
if (lfc_desc > 0) if (lfc_desc > 0)
{ {
/* If the reason of error is ENOSPC, then truncation of file may help to reclaim some space */ /*
int rc = ftruncate(lfc_desc, 0); * If the reason of error is ENOSPC, then truncation of file may
* help to reclaim some space
*/
int rc = ftruncate(lfc_desc, 0);
if (rc < 0) if (rc < 0)
elog(WARNING, "Failed to truncate local file cache %s: %m", lfc_path); elog(WARNING, "Failed to truncate local file cache %s: %m", lfc_path);
} }
} }
/* We need to use unlink to to avoid races in LFC write, because it is not protectedby */
/*
* We need to use unlink to to avoid races in LFC write, because it is not
* protectedby
*/
unlink(lfc_path); unlink(lfc_path);
fd = BasicOpenFile(lfc_path, O_RDWR|O_CREAT|O_TRUNC); fd = BasicOpenFile(lfc_path, O_RDWR | O_CREAT | O_TRUNC);
if (fd < 0) if (fd < 0)
elog(WARNING, "Failed to recreate local file cache %s: %m", lfc_path); elog(WARNING, "Failed to recreate local file cache %s: %m", lfc_path);
else else
@@ -170,13 +181,15 @@ lfc_maybe_disabled(void)
static bool static bool
lfc_ensure_opened(void) lfc_ensure_opened(void)
{ {
bool enabled = !lfc_maybe_disabled(); bool enabled = !lfc_maybe_disabled();
/* Open cache file if not done yet */ /* Open cache file if not done yet */
if (lfc_desc <= 0 && enabled) if (lfc_desc <= 0 && enabled)
{ {
lfc_desc = BasicOpenFile(lfc_path, O_RDWR); lfc_desc = BasicOpenFile(lfc_path, O_RDWR);
if (lfc_desc < 0) { if (lfc_desc < 0)
{
lfc_disable("open"); lfc_disable("open");
return false; return false;
} }
@@ -187,7 +200,7 @@ lfc_ensure_opened(void)
static void static void
lfc_shmem_startup(void) lfc_shmem_startup(void)
{ {
bool found; bool found;
static HASHCTL info; static HASHCTL info;
if (prev_shmem_startup_hook) if (prev_shmem_startup_hook)
@@ -197,17 +210,22 @@ lfc_shmem_startup(void)
LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE); LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
lfc_ctl = (FileCacheControl*)ShmemInitStruct("lfc", sizeof(FileCacheControl), &found); lfc_ctl = (FileCacheControl *) ShmemInitStruct("lfc", sizeof(FileCacheControl), &found);
if (!found) if (!found)
{ {
int fd; int fd;
uint32 lfc_size = SIZE_MB_TO_CHUNKS(lfc_max_size); uint32 lfc_size = SIZE_MB_TO_CHUNKS(lfc_max_size);
lfc_lock = (LWLockId)GetNamedLWLockTranche("lfc_lock");
lfc_lock = (LWLockId) GetNamedLWLockTranche("lfc_lock");
info.keysize = sizeof(BufferTag); info.keysize = sizeof(BufferTag);
info.entrysize = sizeof(FileCacheEntry); info.entrysize = sizeof(FileCacheEntry);
/*
* lfc_size+1 because we add new element to hash table before eviction
* of victim
*/
lfc_hash = ShmemInitHash("lfc_hash", lfc_hash = ShmemInitHash("lfc_hash",
/* lfc_size+1 because we add new element to hash table before eviction of victim */ lfc_size + 1, lfc_size + 1,
lfc_size+1, lfc_size+1,
&info, &info,
HASH_ELEM | HASH_BLOBS); HASH_ELEM | HASH_BLOBS);
lfc_ctl->generation = 0; lfc_ctl->generation = 0;
@@ -219,7 +237,7 @@ lfc_shmem_startup(void)
dlist_init(&lfc_ctl->lru); dlist_init(&lfc_ctl->lru);
/* Recreate file cache on restart */ /* Recreate file cache on restart */
fd = BasicOpenFile(lfc_path, O_RDWR|O_CREAT|O_TRUNC); fd = BasicOpenFile(lfc_path, O_RDWR | O_CREAT | O_TRUNC);
if (fd < 0) if (fd < 0)
{ {
elog(WARNING, "Failed to create local file cache %s: %m", lfc_path); elog(WARNING, "Failed to create local file cache %s: %m", lfc_path);
@@ -242,7 +260,7 @@ lfc_shmem_request(void)
prev_shmem_request_hook(); prev_shmem_request_hook();
#endif #endif
RequestAddinShmemSpace(sizeof(FileCacheControl) + hash_estimate_size(SIZE_MB_TO_CHUNKS(lfc_max_size)+1, sizeof(FileCacheEntry))); RequestAddinShmemSpace(sizeof(FileCacheControl) + hash_estimate_size(SIZE_MB_TO_CHUNKS(lfc_max_size) + 1, sizeof(FileCacheEntry)));
RequestNamedLWLockTranche("lfc_lock", 1); RequestNamedLWLockTranche("lfc_lock", 1);
} }
@@ -250,9 +268,11 @@ static bool
is_normal_backend(void) is_normal_backend(void)
{ {
/* /*
* Stats collector detach shared memory, so we should not try to access shared memory here. * Stats collector detach shared memory, so we should not try to access
* Parallel workers first assign default value (0), so not perform truncation in parallel workers. * shared memory here. Parallel workers first assign default value (0), so
* The Postmaster can handle SIGHUP and it has access to shared memory (UsedShmemSegAddr != NULL), but has no PGPROC. * not perform truncation in parallel workers. The Postmaster can handle
* SIGHUP and it has access to shared memory (UsedShmemSegAddr != NULL),
* but has no PGPROC.
*/ */
return lfc_ctl && MyProc && UsedShmemSegAddr && !IsParallelWorker(); return lfc_ctl && MyProc && UsedShmemSegAddr && !IsParallelWorker();
} }
@@ -271,7 +291,7 @@ lfc_check_limit_hook(int *newval, void **extra, GucSource source)
static void static void
lfc_change_limit_hook(int newval, void *extra) lfc_change_limit_hook(int newval, void *extra)
{ {
uint32 new_size = SIZE_MB_TO_CHUNKS(newval); uint32 new_size = SIZE_MB_TO_CHUNKS(newval);
if (!is_normal_backend()) if (!is_normal_backend())
return; return;
@@ -283,11 +303,15 @@ lfc_change_limit_hook(int newval, void *extra)
while (new_size < lfc_ctl->used && !dlist_is_empty(&lfc_ctl->lru)) while (new_size < lfc_ctl->used && !dlist_is_empty(&lfc_ctl->lru))
{ {
/* Shrink cache by throwing away least recently accessed chunks and returning their space to file system */ /*
FileCacheEntry* victim = dlist_container(FileCacheEntry, lru_node, dlist_pop_head_node(&lfc_ctl->lru)); * Shrink cache by throwing away least recently accessed chunks and
* returning their space to file system
*/
FileCacheEntry *victim = dlist_container(FileCacheEntry, lru_node, dlist_pop_head_node(&lfc_ctl->lru));
Assert(victim->access_count == 0); Assert(victim->access_count == 0);
#ifdef FALLOC_FL_PUNCH_HOLE #ifdef FALLOC_FL_PUNCH_HOLE
if (fallocate(lfc_desc, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, (off_t)victim->offset*BLOCKS_PER_CHUNK*BLCKSZ, BLOCKS_PER_CHUNK*BLCKSZ) < 0) if (fallocate(lfc_desc, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, (off_t) victim->offset * BLOCKS_PER_CHUNK * BLCKSZ, BLOCKS_PER_CHUNK * BLCKSZ) < 0)
elog(LOG, "Failed to punch hole in file: %m"); elog(LOG, "Failed to punch hole in file: %m");
#endif #endif
hash_search_with_hash_value(lfc_hash, &victim->key, victim->hash, HASH_REMOVE, NULL); hash_search_with_hash_value(lfc_hash, &victim->key, victim->hash, HASH_REMOVE, NULL);
@@ -314,7 +338,7 @@ lfc_init(void)
"Maximal size of Neon local file cache", "Maximal size of Neon local file cache",
NULL, NULL,
&lfc_max_size, &lfc_max_size,
0, /* disabled by default */ 0, /* disabled by default */
0, 0,
INT_MAX, INT_MAX,
PGC_POSTMASTER, PGC_POSTMASTER,
@@ -327,7 +351,7 @@ lfc_init(void)
"Current limit for size of Neon local file cache", "Current limit for size of Neon local file cache",
NULL, NULL,
&lfc_size_limit, &lfc_size_limit,
0, /* disabled by default */ 0, /* disabled by default */
0, 0,
INT_MAX, INT_MAX,
PGC_SIGHUP, PGC_SIGHUP,
@@ -367,18 +391,18 @@ lfc_init(void)
bool bool
lfc_cache_contains(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno) lfc_cache_contains(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno)
{ {
BufferTag tag; BufferTag tag;
FileCacheEntry* entry; FileCacheEntry *entry;
int chunk_offs = blkno & (BLOCKS_PER_CHUNK-1); int chunk_offs = blkno & (BLOCKS_PER_CHUNK - 1);
bool found = false; bool found = false;
uint32 hash; uint32 hash;
if (lfc_maybe_disabled()) /* fast exit if file cache is disabled */ if (lfc_maybe_disabled()) /* fast exit if file cache is disabled */
return false; return false;
CopyNRelFileInfoToBufTag(tag, rinfo); CopyNRelFileInfoToBufTag(tag, rinfo);
tag.forkNum = forkNum; tag.forkNum = forkNum;
tag.blockNum = blkno & ~(BLOCKS_PER_CHUNK-1); tag.blockNum = blkno & ~(BLOCKS_PER_CHUNK - 1);
hash = get_hash_value(lfc_hash, &tag); hash = get_hash_value(lfc_hash, &tag);
LWLockAcquire(lfc_lock, LW_SHARED); LWLockAcquire(lfc_lock, LW_SHARED);
@@ -397,13 +421,13 @@ lfc_cache_contains(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno)
void void
lfc_evict(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno) lfc_evict(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno)
{ {
BufferTag tag; BufferTag tag;
FileCacheEntry* entry; FileCacheEntry *entry;
bool found; bool found;
int chunk_offs = blkno & (BLOCKS_PER_CHUNK-1); int chunk_offs = blkno & (BLOCKS_PER_CHUNK - 1);
uint32 hash; uint32 hash;
if (lfc_maybe_disabled()) /* fast exit if file cache is disabled */ if (lfc_maybe_disabled()) /* fast exit if file cache is disabled */
return; return;
CopyNRelFileInfoToBufTag(tag, rinfo); CopyNRelFileInfoToBufTag(tag, rinfo);
@@ -438,9 +462,10 @@ lfc_evict(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno)
*/ */
if (entry->bitmap[chunk_offs >> 5] == 0) if (entry->bitmap[chunk_offs >> 5] == 0)
{ {
bool has_remaining_pages; bool has_remaining_pages;
for (int i = 0; i < (BLOCKS_PER_CHUNK / 32); i++) { for (int i = 0; i < (BLOCKS_PER_CHUNK / 32); i++)
{
if (entry->bitmap[i] != 0) if (entry->bitmap[i] != 0)
{ {
has_remaining_pages = true; has_remaining_pages = true;
@@ -449,8 +474,8 @@ lfc_evict(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno)
} }
/* /*
* Put the entry at the position that is first to be reclaimed when * Put the entry at the position that is first to be reclaimed when we
* we have no cached pages remaining in the chunk * have no cached pages remaining in the chunk
*/ */
if (!has_remaining_pages) if (!has_remaining_pages)
{ {
@@ -476,16 +501,16 @@ bool
lfc_read(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno, lfc_read(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
char *buffer) char *buffer)
{ {
BufferTag tag; BufferTag tag;
FileCacheEntry* entry; FileCacheEntry *entry;
ssize_t rc; ssize_t rc;
int chunk_offs = blkno & (BLOCKS_PER_CHUNK-1); int chunk_offs = blkno & (BLOCKS_PER_CHUNK - 1);
bool result = true; bool result = true;
uint32 hash; uint32 hash;
uint64 generation; uint64 generation;
uint32 entry_offset; uint32 entry_offset;
if (lfc_maybe_disabled()) /* fast exit if file cache is disabled */ if (lfc_maybe_disabled()) /* fast exit if file cache is disabled */
return false; return false;
if (!lfc_ensure_opened()) if (!lfc_ensure_opened())
@@ -493,7 +518,7 @@ lfc_read(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
CopyNRelFileInfoToBufTag(tag, rinfo); CopyNRelFileInfoToBufTag(tag, rinfo);
tag.forkNum = forkNum; tag.forkNum = forkNum;
tag.blockNum = blkno & ~(BLOCKS_PER_CHUNK-1); tag.blockNum = blkno & ~(BLOCKS_PER_CHUNK - 1);
hash = get_hash_value(lfc_hash, &tag); hash = get_hash_value(lfc_hash, &tag);
LWLockAcquire(lfc_lock, LW_EXCLUSIVE); LWLockAcquire(lfc_lock, LW_EXCLUSIVE);
@@ -520,7 +545,7 @@ lfc_read(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
LWLockRelease(lfc_lock); LWLockRelease(lfc_lock);
rc = pread(lfc_desc, buffer, BLCKSZ, ((off_t)entry_offset*BLOCKS_PER_CHUNK + chunk_offs)*BLCKSZ); rc = pread(lfc_desc, buffer, BLCKSZ, ((off_t) entry_offset * BLOCKS_PER_CHUNK + chunk_offs) * BLCKSZ);
if (rc != BLCKSZ) if (rc != BLCKSZ)
{ {
lfc_disable("read"); lfc_disable("read");
@@ -551,30 +576,29 @@ lfc_read(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
* If cache is full then evict some other page. * If cache is full then evict some other page.
*/ */
void void
lfc_write(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
#if PG_MAJORVERSION_NUM < 16 #if PG_MAJORVERSION_NUM < 16
char *buffer) lfc_write(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno, char *buffer)
#else #else
const void *buffer) lfc_write(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno, const void *buffer)
#endif #endif
{ {
BufferTag tag; BufferTag tag;
FileCacheEntry* entry; FileCacheEntry *entry;
ssize_t rc; ssize_t rc;
bool found; bool found;
int chunk_offs = blkno & (BLOCKS_PER_CHUNK-1); int chunk_offs = blkno & (BLOCKS_PER_CHUNK - 1);
uint32 hash; uint32 hash;
uint64 generation; uint64 generation;
uint32 entry_offset; uint32 entry_offset;
if (lfc_maybe_disabled()) /* fast exit if file cache is disabled */ if (lfc_maybe_disabled()) /* fast exit if file cache is disabled */
return; return;
if (!lfc_ensure_opened()) if (!lfc_ensure_opened())
return; return;
tag.forkNum = forkNum; tag.forkNum = forkNum;
tag.blockNum = blkno & ~(BLOCKS_PER_CHUNK-1); tag.blockNum = blkno & ~(BLOCKS_PER_CHUNK - 1);
CopyNRelFileInfoToBufTag(tag, rinfo); CopyNRelFileInfoToBufTag(tag, rinfo);
hash = get_hash_value(lfc_hash, &tag); hash = get_hash_value(lfc_hash, &tag);
@@ -590,24 +614,36 @@ lfc_write(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
if (found) if (found)
{ {
/* Unlink entry from LRU list to pin it for the duration of IO operation */ /*
* Unlink entry from LRU list to pin it for the duration of IO
* operation
*/
if (entry->access_count++ == 0) if (entry->access_count++ == 0)
dlist_delete(&entry->lru_node); dlist_delete(&entry->lru_node);
} }
else else
{ {
/* /*
* We have two choices if all cache pages are pinned (i.e. used in IO operations): * We have two choices if all cache pages are pinned (i.e. used in IO
* 1. Wait until some of this operation is completed and pages is unpinned * operations):
* 2. Allocate one more chunk, so that specified cache size is more recommendation than hard limit. *
* As far as probability of such event (that all pages are pinned) is considered to be very very small: * 1) Wait until some of this operation is completed and pages is
* there are should be very large number of concurrent IO operations and them are limited by max_connections, * unpinned.
* we prefer not to complicate code and use second approach. *
* 2) Allocate one more chunk, so that specified cache size is more
* recommendation than hard limit.
*
* As far as probability of such event (that all pages are pinned) is
* considered to be very very small: there are should be very large
* number of concurrent IO operations and them are limited by
* max_connections, we prefer not to complicate code and use second
* approach.
*/ */
if (lfc_ctl->used >= lfc_ctl->limit && !dlist_is_empty(&lfc_ctl->lru)) if (lfc_ctl->used >= lfc_ctl->limit && !dlist_is_empty(&lfc_ctl->lru))
{ {
/* Cache overflow: evict least recently used chunk */ /* Cache overflow: evict least recently used chunk */
FileCacheEntry* victim = dlist_container(FileCacheEntry, lru_node, dlist_pop_head_node(&lfc_ctl->lru)); FileCacheEntry *victim = dlist_container(FileCacheEntry, lru_node, dlist_pop_head_node(&lfc_ctl->lru));
Assert(victim->access_count == 0); Assert(victim->access_count == 0);
entry->offset = victim->offset; /* grab victim's chunk */ entry->offset = victim->offset; /* grab victim's chunk */
hash_search_with_hash_value(lfc_hash, &victim->key, victim->hash, HASH_REMOVE, NULL); hash_search_with_hash_value(lfc_hash, &victim->key, victim->hash, HASH_REMOVE, NULL);
@@ -616,7 +652,8 @@ lfc_write(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
else else
{ {
lfc_ctl->used += 1; lfc_ctl->used += 1;
entry->offset = lfc_ctl->size++; /* allocate new chunk at end of file */ entry->offset = lfc_ctl->size++; /* allocate new chunk at end
* of file */
} }
entry->access_count = 1; entry->access_count = 1;
entry->hash = hash; entry->hash = hash;
@@ -628,7 +665,7 @@ lfc_write(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
lfc_ctl->writes += 1; lfc_ctl->writes += 1;
LWLockRelease(lfc_lock); LWLockRelease(lfc_lock);
rc = pwrite(lfc_desc, buffer, BLCKSZ, ((off_t)entry_offset*BLOCKS_PER_CHUNK + chunk_offs)*BLCKSZ); rc = pwrite(lfc_desc, buffer, BLCKSZ, ((off_t) entry_offset * BLOCKS_PER_CHUNK + chunk_offs) * BLCKSZ);
if (rc != BLCKSZ) if (rc != BLCKSZ)
{ {
lfc_disable("write"); lfc_disable("write");
@@ -665,13 +702,13 @@ Datum
neon_get_lfc_stats(PG_FUNCTION_ARGS) neon_get_lfc_stats(PG_FUNCTION_ARGS)
{ {
FuncCallContext *funcctx; FuncCallContext *funcctx;
NeonGetStatsCtx* fctx; NeonGetStatsCtx *fctx;
MemoryContext oldcontext; MemoryContext oldcontext;
TupleDesc tupledesc; TupleDesc tupledesc;
Datum result; Datum result;
HeapTuple tuple; HeapTuple tuple;
char const* key; char const *key;
uint64 value; uint64 value;
Datum values[NUM_NEON_GET_STATS_COLS]; Datum values[NUM_NEON_GET_STATS_COLS];
bool nulls[NUM_NEON_GET_STATS_COLS]; bool nulls[NUM_NEON_GET_STATS_COLS];
@@ -683,7 +720,7 @@ neon_get_lfc_stats(PG_FUNCTION_ARGS)
oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx); oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
/* Create a user function context for cross-call persistence */ /* Create a user function context for cross-call persistence */
fctx = (NeonGetStatsCtx*) palloc(sizeof(NeonGetStatsCtx)); fctx = (NeonGetStatsCtx *) palloc(sizeof(NeonGetStatsCtx));
/* Construct a tuple descriptor for the result rows. */ /* Construct a tuple descriptor for the result rows. */
tupledesc = CreateTemplateTupleDesc(NUM_NEON_GET_STATS_COLS); tupledesc = CreateTemplateTupleDesc(NUM_NEON_GET_STATS_COLS);
@@ -704,7 +741,7 @@ neon_get_lfc_stats(PG_FUNCTION_ARGS)
funcctx = SRF_PERCALL_SETUP(); funcctx = SRF_PERCALL_SETUP();
/* Get the saved state */ /* Get the saved state */
fctx = (NeonGetStatsCtx*) funcctx->user_fctx; fctx = (NeonGetStatsCtx *) funcctx->user_fctx;
switch (funcctx->call_cntr) switch (funcctx->call_cntr)
{ {
@@ -792,9 +829,9 @@ local_cache_pages(PG_FUNCTION_ARGS)
if (SRF_IS_FIRSTCALL()) if (SRF_IS_FIRSTCALL())
{ {
HASH_SEQ_STATUS status; HASH_SEQ_STATUS status;
FileCacheEntry* entry; FileCacheEntry *entry;
uint32 n_pages = 0; uint32 n_pages = 0;
funcctx = SRF_FIRSTCALL_INIT(); funcctx = SRF_FIRSTCALL_INIT();
@@ -851,7 +888,7 @@ local_cache_pages(PG_FUNCTION_ARGS)
hash_seq_init(&status, lfc_hash); hash_seq_init(&status, lfc_hash);
while ((entry = hash_seq_search(&status)) != NULL) while ((entry = hash_seq_search(&status)) != NULL)
{ {
for (int i = 0; i < BLOCKS_PER_CHUNK/32; i++) for (int i = 0; i < BLOCKS_PER_CHUNK / 32; i++)
n_pages += pg_popcount32(entry->bitmap[i]); n_pages += pg_popcount32(entry->bitmap[i]);
} }
} }
@@ -870,10 +907,11 @@ local_cache_pages(PG_FUNCTION_ARGS)
if (n_pages != 0) if (n_pages != 0)
{ {
/* /*
* Scan through all the cache entries, saving the relevant fields in the * Scan through all the cache entries, saving the relevant fields
* fctx->record structure. * in the fctx->record structure.
*/ */
uint32 n = 0; uint32 n = 0;
hash_seq_init(&status, lfc_hash); hash_seq_init(&status, lfc_hash);
while ((entry = hash_seq_search(&status)) != NULL) while ((entry = hash_seq_search(&status)) != NULL)
{ {
@@ -881,7 +919,7 @@ local_cache_pages(PG_FUNCTION_ARGS)
{ {
if (entry->bitmap[i >> 5] & (1 << (i & 31))) if (entry->bitmap[i >> 5] & (1 << (i & 31)))
{ {
fctx->record[n].pageoffs = entry->offset*BLOCKS_PER_CHUNK + i; fctx->record[n].pageoffs = entry->offset * BLOCKS_PER_CHUNK + i;
fctx->record[n].relfilenode = NInfoGetRelNumber(BufTagGetNRelFileInfo(entry->key)); fctx->record[n].relfilenode = NInfoGetRelNumber(BufTagGetNRelFileInfo(entry->key));
fctx->record[n].reltablespace = NInfoGetSpcOid(BufTagGetNRelFileInfo(entry->key)); fctx->record[n].reltablespace = NInfoGetSpcOid(BufTagGetNRelFileInfo(entry->key));
fctx->record[n].reldatabase = NInfoGetDbOid(BufTagGetNRelFileInfo(entry->key)); fctx->record[n].reldatabase = NInfoGetDbOid(BufTagGetNRelFileInfo(entry->key));

View File

@@ -69,9 +69,9 @@ int max_reconnect_attempts = 60;
typedef struct typedef struct
{ {
LWLockId lock; LWLockId lock;
pg_atomic_uint64 update_counter; pg_atomic_uint64 update_counter;
char pageserver_connstring[MAX_PAGESERVER_CONNSTRING_SIZE]; char pageserver_connstring[MAX_PAGESERVER_CONNSTRING_SIZE];
} PagestoreShmemState; } PagestoreShmemState;
#if PG_VERSION_NUM >= 150000 #if PG_VERSION_NUM >= 150000
@@ -83,7 +83,7 @@ static PagestoreShmemState *pagestore_shared;
static uint64 pagestore_local_counter = 0; static uint64 pagestore_local_counter = 0;
static char local_pageserver_connstring[MAX_PAGESERVER_CONNSTRING_SIZE]; static char local_pageserver_connstring[MAX_PAGESERVER_CONNSTRING_SIZE];
bool (*old_redo_read_buffer_filter) (XLogReaderState *record, uint8 block_id) = NULL; bool (*old_redo_read_buffer_filter) (XLogReaderState *record, uint8 block_id) = NULL;
static bool pageserver_flush(void); static bool pageserver_flush(void);
static void pageserver_disconnect(void); static void pageserver_disconnect(void);
@@ -91,43 +91,43 @@ static void pageserver_disconnect(void);
static bool static bool
PagestoreShmemIsValid() PagestoreShmemIsValid()
{ {
return pagestore_shared && UsedShmemSegAddr; return pagestore_shared && UsedShmemSegAddr;
} }
static bool static bool
CheckPageserverConnstring(char **newval, void **extra, GucSource source) CheckPageserverConnstring(char **newval, void **extra, GucSource source)
{ {
return strlen(*newval) < MAX_PAGESERVER_CONNSTRING_SIZE; return strlen(*newval) < MAX_PAGESERVER_CONNSTRING_SIZE;
} }
static void static void
AssignPageserverConnstring(const char *newval, void *extra) AssignPageserverConnstring(const char *newval, void *extra)
{ {
if(!PagestoreShmemIsValid()) if (!PagestoreShmemIsValid())
return; return;
LWLockAcquire(pagestore_shared->lock, LW_EXCLUSIVE); LWLockAcquire(pagestore_shared->lock, LW_EXCLUSIVE);
strlcpy(pagestore_shared->pageserver_connstring, newval, MAX_PAGESERVER_CONNSTRING_SIZE); strlcpy(pagestore_shared->pageserver_connstring, newval, MAX_PAGESERVER_CONNSTRING_SIZE);
pg_atomic_fetch_add_u64(&pagestore_shared->update_counter, 1); pg_atomic_fetch_add_u64(&pagestore_shared->update_counter, 1);
LWLockRelease(pagestore_shared->lock); LWLockRelease(pagestore_shared->lock);
} }
static bool static bool
CheckConnstringUpdated() CheckConnstringUpdated()
{ {
if(!PagestoreShmemIsValid()) if (!PagestoreShmemIsValid())
return false; return false;
return pagestore_local_counter < pg_atomic_read_u64(&pagestore_shared->update_counter); return pagestore_local_counter < pg_atomic_read_u64(&pagestore_shared->update_counter);
} }
static void static void
ReloadConnstring() ReloadConnstring()
{ {
if(!PagestoreShmemIsValid()) if (!PagestoreShmemIsValid())
return; return;
LWLockAcquire(pagestore_shared->lock, LW_SHARED); LWLockAcquire(pagestore_shared->lock, LW_SHARED);
strlcpy(local_pageserver_connstring, pagestore_shared->pageserver_connstring, sizeof(local_pageserver_connstring)); strlcpy(local_pageserver_connstring, pagestore_shared->pageserver_connstring, sizeof(local_pageserver_connstring));
pagestore_local_counter = pg_atomic_read_u64(&pagestore_shared->update_counter); pagestore_local_counter = pg_atomic_read_u64(&pagestore_shared->update_counter);
LWLockRelease(pagestore_shared->lock); LWLockRelease(pagestore_shared->lock);
} }
static bool static bool
@@ -141,21 +141,20 @@ pageserver_connect(int elevel)
Assert(!connected); Assert(!connected);
if(CheckConnstringUpdated()) if (CheckConnstringUpdated())
{ {
ReloadConnstring(); ReloadConnstring();
} }
/* /*
* Connect using the connection string we got from the * Connect using the connection string we got from the
* neon.pageserver_connstring GUC. If the NEON_AUTH_TOKEN environment * neon.pageserver_connstring GUC. If the NEON_AUTH_TOKEN environment
* variable was set, use that as the password. * variable was set, use that as the password.
* *
* The connection options are parsed in the order they're given, so * The connection options are parsed in the order they're given, so when
* when we set the password before the connection string, the * we set the password before the connection string, the connection string
* connection string can override the password from the env variable. * can override the password from the env variable. Seems useful, although
* Seems useful, although we don't currently use that capability * we don't currently use that capability anywhere.
* anywhere.
*/ */
n = 0; n = 0;
if (neon_auth_token) if (neon_auth_token)
@@ -198,9 +197,9 @@ pageserver_connect(int elevel)
pageserver_conn_wes = CreateWaitEventSet(TopMemoryContext, 3); pageserver_conn_wes = CreateWaitEventSet(TopMemoryContext, 3);
AddWaitEventToSet(pageserver_conn_wes, WL_LATCH_SET, PGINVALID_SOCKET, AddWaitEventToSet(pageserver_conn_wes, WL_LATCH_SET, PGINVALID_SOCKET,
MyLatch, NULL); MyLatch, NULL);
AddWaitEventToSet(pageserver_conn_wes, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET, AddWaitEventToSet(pageserver_conn_wes, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
NULL, NULL); NULL, NULL);
AddWaitEventToSet(pageserver_conn_wes, WL_SOCKET_READABLE, PQsocket(pageserver_conn), NULL, NULL); AddWaitEventToSet(pageserver_conn_wes, WL_SOCKET_READABLE, PQsocket(pageserver_conn), NULL, NULL);
while (PQisBusy(pageserver_conn)) while (PQisBusy(pageserver_conn))
@@ -265,6 +264,7 @@ retry:
if (!PQconsumeInput(pageserver_conn)) if (!PQconsumeInput(pageserver_conn))
{ {
char *msg = pchomp(PQerrorMessage(pageserver_conn)); char *msg = pchomp(PQerrorMessage(pageserver_conn));
neon_log(LOG, "could not get response from pageserver: %s", msg); neon_log(LOG, "could not get response from pageserver: %s", msg);
pfree(msg); pfree(msg);
return -1; return -1;
@@ -305,15 +305,15 @@ pageserver_disconnect(void)
} }
static bool static bool
pageserver_send(NeonRequest * request) pageserver_send(NeonRequest *request)
{ {
StringInfoData req_buff; StringInfoData req_buff;
if(CheckConnstringUpdated()) if (CheckConnstringUpdated())
{ {
pageserver_disconnect(); pageserver_disconnect();
ReloadConnstring(); ReloadConnstring();
} }
/* If the connection was lost for some reason, reconnect */ /* If the connection was lost for some reason, reconnect */
if (connected && PQstatus(pageserver_conn) == CONNECTION_BAD) if (connected && PQstatus(pageserver_conn) == CONNECTION_BAD)
@@ -326,10 +326,12 @@ pageserver_send(NeonRequest * request)
/* /*
* If pageserver is stopped, the connections from compute node are broken. * If pageserver is stopped, the connections from compute node are broken.
* The compute node doesn't notice that immediately, but it will cause the next request to fail, usually on the next query. * The compute node doesn't notice that immediately, but it will cause the
* That causes user-visible errors if pageserver is restarted, or the tenant is moved from one pageserver to another. * next request to fail, usually on the next query. That causes
* See https://github.com/neondatabase/neon/issues/1138 * user-visible errors if pageserver is restarted, or the tenant is moved
* So try to reestablish connection in case of failure. * from one pageserver to another. See
* https://github.com/neondatabase/neon/issues/1138 So try to reestablish
* connection in case of failure.
*/ */
if (!connected) if (!connected)
{ {
@@ -353,6 +355,7 @@ pageserver_send(NeonRequest * request)
if (PQputCopyData(pageserver_conn, req_buff.data, req_buff.len) <= 0) if (PQputCopyData(pageserver_conn, req_buff.data, req_buff.len) <= 0)
{ {
char *msg = pchomp(PQerrorMessage(pageserver_conn)); char *msg = pchomp(PQerrorMessage(pageserver_conn));
pageserver_disconnect(); pageserver_disconnect();
neon_log(LOG, "pageserver_send disconnect because failed to send page request (try to reconnect): %s", msg); neon_log(LOG, "pageserver_send disconnect because failed to send page request (try to reconnect): %s", msg);
pfree(msg); pfree(msg);
@@ -410,7 +413,8 @@ pageserver_receive(void)
} }
else if (rc == -2) else if (rc == -2)
{ {
char* msg = pchomp(PQerrorMessage(pageserver_conn)); char *msg = pchomp(PQerrorMessage(pageserver_conn));
pageserver_disconnect(); pageserver_disconnect();
neon_log(ERROR, "pageserver_receive disconnect because could not read COPY data: %s", msg); neon_log(ERROR, "pageserver_receive disconnect because could not read COPY data: %s", msg);
} }
@@ -444,6 +448,7 @@ pageserver_flush(void)
if (PQflush(pageserver_conn)) if (PQflush(pageserver_conn))
{ {
char *msg = pchomp(PQerrorMessage(pageserver_conn)); char *msg = pchomp(PQerrorMessage(pageserver_conn));
pageserver_disconnect(); pageserver_disconnect();
neon_log(LOG, "pageserver_flush disconnect because failed to flush page requests: %s", msg); neon_log(LOG, "pageserver_flush disconnect because failed to flush page requests: %s", msg);
pfree(msg); pfree(msg);
@@ -471,46 +476,47 @@ check_neon_id(char **newval, void **extra, GucSource source)
static Size static Size
PagestoreShmemSize(void) PagestoreShmemSize(void)
{ {
return sizeof(PagestoreShmemState); return sizeof(PagestoreShmemState);
} }
static bool static bool
PagestoreShmemInit(void) PagestoreShmemInit(void)
{ {
bool found; bool found;
LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
pagestore_shared = ShmemInitStruct("libpagestore shared state", LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
PagestoreShmemSize(), pagestore_shared = ShmemInitStruct("libpagestore shared state",
&found); PagestoreShmemSize(),
if(!found) &found);
{ if (!found)
pagestore_shared->lock = &(GetNamedLWLockTranche("neon_libpagestore")->lock); {
pg_atomic_init_u64(&pagestore_shared->update_counter, 0); pagestore_shared->lock = &(GetNamedLWLockTranche("neon_libpagestore")->lock);
AssignPageserverConnstring(page_server_connstring, NULL); pg_atomic_init_u64(&pagestore_shared->update_counter, 0);
} AssignPageserverConnstring(page_server_connstring, NULL);
LWLockRelease(AddinShmemInitLock); }
return found; LWLockRelease(AddinShmemInitLock);
return found;
} }
static void static void
pagestore_shmem_startup_hook(void) pagestore_shmem_startup_hook(void)
{ {
if(prev_shmem_startup_hook) if (prev_shmem_startup_hook)
prev_shmem_startup_hook(); prev_shmem_startup_hook();
PagestoreShmemInit(); PagestoreShmemInit();
} }
static void static void
pagestore_shmem_request(void) pagestore_shmem_request(void)
{ {
#if PG_VERSION_NUM >= 150000 #if PG_VERSION_NUM >= 150000
if(prev_shmem_request_hook) if (prev_shmem_request_hook)
prev_shmem_request_hook(); prev_shmem_request_hook();
#endif #endif
RequestAddinShmemSpace(PagestoreShmemSize()); RequestAddinShmemSpace(PagestoreShmemSize());
RequestNamedLWLockTranche("neon_libpagestore", 1); RequestNamedLWLockTranche("neon_libpagestore", 1);
} }
static void static void
@@ -520,7 +526,7 @@ pagestore_prepare_shmem(void)
prev_shmem_request_hook = shmem_request_hook; prev_shmem_request_hook = shmem_request_hook;
shmem_request_hook = pagestore_shmem_request; shmem_request_hook = pagestore_shmem_request;
#else #else
pagestore_shmem_request(); pagestore_shmem_request();
#endif #endif
prev_shmem_startup_hook = shmem_startup_hook; prev_shmem_startup_hook = shmem_startup_hook;
shmem_startup_hook = pagestore_shmem_startup_hook; shmem_startup_hook = pagestore_shmem_startup_hook;
@@ -532,7 +538,7 @@ pagestore_prepare_shmem(void)
void void
pg_init_libpagestore(void) pg_init_libpagestore(void)
{ {
pagestore_prepare_shmem(); pagestore_prepare_shmem();
DefineCustomStringVariable("neon.pageserver_connstring", DefineCustomStringVariable("neon.pageserver_connstring",
"connection string to the page server", "connection string to the page server",
@@ -607,7 +613,10 @@ pg_init_libpagestore(void)
neon_log(PageStoreTrace, "libpagestore already loaded"); neon_log(PageStoreTrace, "libpagestore already loaded");
page_server = &api; page_server = &api;
/* Retrieve the auth token to use when connecting to pageserver and safekeepers */ /*
* Retrieve the auth token to use when connecting to pageserver and
* safekeepers
*/
neon_auth_token = getenv("NEON_AUTH_TOKEN"); neon_auth_token = getenv("NEON_AUTH_TOKEN");
if (neon_auth_token) if (neon_auth_token)
neon_log(LOG, "using storage auth token from NEON_AUTH_TOKEN environment variable"); neon_log(LOG, "using storage auth token from NEON_AUTH_TOKEN environment variable");

View File

@@ -48,9 +48,11 @@ _PG_init(void)
pg_init_extension_server(); pg_init_extension_server();
// Important: This must happen after other parts of the extension /*
// are loaded, otherwise any settings to GUCs that were set before * Important: This must happen after other parts of the extension are
// the extension was loaded will be removed. * loaded, otherwise any settings to GUCs that were set before the
* extension was loaded will be removed.
*/
EmitWarningsOnPlaceholders("neon"); EmitWarningsOnPlaceholders("neon");
} }

View File

@@ -32,7 +32,7 @@ extern void pg_init_extension_server(void);
* block_id; false otherwise. * block_id; false otherwise.
*/ */
extern bool neon_redo_read_buffer_filter(XLogReaderState *record, uint8 block_id); extern bool neon_redo_read_buffer_filter(XLogReaderState *record, uint8 block_id);
extern bool (*old_redo_read_buffer_filter) (XLogReaderState *record, uint8 block_id); extern bool (*old_redo_read_buffer_filter) (XLogReaderState *record, uint8 block_id);
extern uint64 BackpressureThrottlingTime(void); extern uint64 BackpressureThrottlingTime(void);
extern void replication_feedback_get_lsns(XLogRecPtr *writeLsn, XLogRecPtr *flushLsn, XLogRecPtr *applyLsn); extern void replication_feedback_get_lsns(XLogRecPtr *writeLsn, XLogRecPtr *flushLsn, XLogRecPtr *applyLsn);

View File

@@ -59,7 +59,7 @@
#define DropRelationAllLocalBuffers DropRelFileNodeAllLocalBuffers #define DropRelationAllLocalBuffers DropRelFileNodeAllLocalBuffers
#else /* major version >= 16 */ #else /* major version >= 16 */
#define USE_RELFILELOCATOR #define USE_RELFILELOCATOR
@@ -109,4 +109,4 @@
#define DropRelationAllLocalBuffers DropRelationAllLocalBuffers #define DropRelationAllLocalBuffers DropRelationAllLocalBuffers
#endif #endif
#endif //NEON_PGVERSIONCOMPAT_H #endif /* NEON_PGVERSIONCOMPAT_H */

View File

@@ -40,13 +40,13 @@ typedef enum
T_NeonGetPageResponse, T_NeonGetPageResponse,
T_NeonErrorResponse, T_NeonErrorResponse,
T_NeonDbSizeResponse, T_NeonDbSizeResponse,
} NeonMessageTag; } NeonMessageTag;
/* base struct for c-style inheritance */ /* base struct for c-style inheritance */
typedef struct typedef struct
{ {
NeonMessageTag tag; NeonMessageTag tag;
} NeonMessage; } NeonMessage;
#define messageTag(m) (((const NeonMessage *)(m))->tag) #define messageTag(m) (((const NeonMessage *)(m))->tag)
@@ -67,27 +67,27 @@ typedef struct
NeonMessageTag tag; NeonMessageTag tag;
bool latest; /* if true, request latest page version */ bool latest; /* if true, request latest page version */
XLogRecPtr lsn; /* request page version @ this LSN */ XLogRecPtr lsn; /* request page version @ this LSN */
} NeonRequest; } NeonRequest;
typedef struct typedef struct
{ {
NeonRequest req; NeonRequest req;
NRelFileInfo rinfo; NRelFileInfo rinfo;
ForkNumber forknum; ForkNumber forknum;
} NeonExistsRequest; } NeonExistsRequest;
typedef struct typedef struct
{ {
NeonRequest req; NeonRequest req;
NRelFileInfo rinfo; NRelFileInfo rinfo;
ForkNumber forknum; ForkNumber forknum;
} NeonNblocksRequest; } NeonNblocksRequest;
typedef struct typedef struct
{ {
NeonRequest req; NeonRequest req;
Oid dbNode; Oid dbNode;
} NeonDbSizeRequest; } NeonDbSizeRequest;
typedef struct typedef struct
{ {
@@ -95,31 +95,31 @@ typedef struct
NRelFileInfo rinfo; NRelFileInfo rinfo;
ForkNumber forknum; ForkNumber forknum;
BlockNumber blkno; BlockNumber blkno;
} NeonGetPageRequest; } NeonGetPageRequest;
/* supertype of all the Neon*Response structs below */ /* supertype of all the Neon*Response structs below */
typedef struct typedef struct
{ {
NeonMessageTag tag; NeonMessageTag tag;
} NeonResponse; } NeonResponse;
typedef struct typedef struct
{ {
NeonMessageTag tag; NeonMessageTag tag;
bool exists; bool exists;
} NeonExistsResponse; } NeonExistsResponse;
typedef struct typedef struct
{ {
NeonMessageTag tag; NeonMessageTag tag;
uint32 n_blocks; uint32 n_blocks;
} NeonNblocksResponse; } NeonNblocksResponse;
typedef struct typedef struct
{ {
NeonMessageTag tag; NeonMessageTag tag;
char page[FLEXIBLE_ARRAY_MEMBER]; char page[FLEXIBLE_ARRAY_MEMBER];
} NeonGetPageResponse; } NeonGetPageResponse;
#define PS_GETPAGERESPONSE_SIZE (MAXALIGN(offsetof(NeonGetPageResponse, page) + BLCKSZ)) #define PS_GETPAGERESPONSE_SIZE (MAXALIGN(offsetof(NeonGetPageResponse, page) + BLCKSZ))
@@ -127,18 +127,18 @@ typedef struct
{ {
NeonMessageTag tag; NeonMessageTag tag;
int64 db_size; int64 db_size;
} NeonDbSizeResponse; } NeonDbSizeResponse;
typedef struct typedef struct
{ {
NeonMessageTag tag; NeonMessageTag tag;
char message[FLEXIBLE_ARRAY_MEMBER]; /* null-terminated error char message[FLEXIBLE_ARRAY_MEMBER]; /* null-terminated error
* message */ * message */
} NeonErrorResponse; } NeonErrorResponse;
extern StringInfoData nm_pack_request(NeonRequest * msg); extern StringInfoData nm_pack_request(NeonRequest *msg);
extern NeonResponse * nm_unpack_response(StringInfo s); extern NeonResponse *nm_unpack_response(StringInfo s);
extern char *nm_to_string(NeonMessage * msg); extern char *nm_to_string(NeonMessage *msg);
/* /*
* API * API
@@ -146,20 +146,20 @@ extern char *nm_to_string(NeonMessage * msg);
typedef struct typedef struct
{ {
bool (*send) (NeonRequest * request); bool (*send) (NeonRequest *request);
NeonResponse *(*receive) (void); NeonResponse *(*receive) (void);
bool (*flush) (void); bool (*flush) (void);
} page_server_api; } page_server_api;
extern void prefetch_on_ps_disconnect(void); extern void prefetch_on_ps_disconnect(void);
extern page_server_api * page_server; extern page_server_api *page_server;
extern char *page_server_connstring; extern char *page_server_connstring;
extern int flush_every_n_requests; extern int flush_every_n_requests;
extern int readahead_buffer_size; extern int readahead_buffer_size;
extern bool seqscan_prefetch_enabled; extern bool seqscan_prefetch_enabled;
extern int seqscan_prefetch_distance; extern int seqscan_prefetch_distance;
extern char *neon_timeline; extern char *neon_timeline;
extern char *neon_tenant; extern char *neon_tenant;
extern bool wal_redo; extern bool wal_redo;
@@ -194,14 +194,14 @@ extern bool neon_prefetch(SMgrRelation reln, ForkNumber forknum,
extern void neon_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, extern void neon_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
char *buffer); char *buffer);
extern PGDLLEXPORT void neon_read_at_lsn(NRelFileInfo rnode, ForkNumber forkNum, BlockNumber blkno, extern PGDLLEXPORT void neon_read_at_lsn(NRelFileInfo rnode, ForkNumber forkNum, BlockNumber blkno,
XLogRecPtr request_lsn, bool request_latest, char *buffer); XLogRecPtr request_lsn, bool request_latest, char *buffer);
extern void neon_write(SMgrRelation reln, ForkNumber forknum, extern void neon_write(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync); BlockNumber blocknum, char *buffer, bool skipFsync);
#else #else
extern void neon_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, extern void neon_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
void *buffer); void *buffer);
extern PGDLLEXPORT void neon_read_at_lsn(NRelFileInfo rnode, ForkNumber forkNum, BlockNumber blkno, extern PGDLLEXPORT void neon_read_at_lsn(NRelFileInfo rnode, ForkNumber forkNum, BlockNumber blkno,
XLogRecPtr request_lsn, bool request_latest, void *buffer); XLogRecPtr request_lsn, bool request_latest, void *buffer);
extern void neon_write(SMgrRelation reln, ForkNumber forknum, extern void neon_write(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, const void *buffer, bool skipFsync); BlockNumber blocknum, const void *buffer, bool skipFsync);
#endif #endif

View File

@@ -59,6 +59,7 @@
#include "replication/walsender.h" #include "replication/walsender.h"
#include "storage/bufmgr.h" #include "storage/bufmgr.h"
#include "storage/buf_internals.h" #include "storage/buf_internals.h"
#include "storage/fsm_internals.h"
#include "storage/smgr.h" #include "storage/smgr.h"
#include "storage/md.h" #include "storage/md.h"
#include "pgstat.h" #include "pgstat.h"
@@ -100,21 +101,21 @@ typedef enum
UNLOGGED_BUILD_PHASE_1, UNLOGGED_BUILD_PHASE_1,
UNLOGGED_BUILD_PHASE_2, UNLOGGED_BUILD_PHASE_2,
UNLOGGED_BUILD_NOT_PERMANENT UNLOGGED_BUILD_NOT_PERMANENT
} UnloggedBuildPhase; } UnloggedBuildPhase;
static SMgrRelation unlogged_build_rel = NULL; static SMgrRelation unlogged_build_rel = NULL;
static UnloggedBuildPhase unlogged_build_phase = UNLOGGED_BUILD_NOT_IN_PROGRESS; static UnloggedBuildPhase unlogged_build_phase = UNLOGGED_BUILD_NOT_IN_PROGRESS;
/* /*
* Prefetch implementation: * Prefetch implementation:
* *
* Prefetch is performed locally by each backend. * Prefetch is performed locally by each backend.
* *
* There can be up to readahead_buffer_size active IO requests registered at * There can be up to readahead_buffer_size active IO requests registered at
* any time. Requests using smgr_prefetch are sent to the pageserver, but we * any time. Requests using smgr_prefetch are sent to the pageserver, but we
* don't wait on the response. Requests using smgr_read are either read from * don't wait on the response. Requests using smgr_read are either read from
* the buffer, or (if that's not possible) we wait on the response to arrive - * the buffer, or (if that's not possible) we wait on the response to arrive -
* this also will allow us to receive other prefetched pages. * this also will allow us to receive other prefetched pages.
* Each request is immediately written to the output buffer of the pageserver * Each request is immediately written to the output buffer of the pageserver
* connection, but may not be flushed if smgr_prefetch is used: pageserver * connection, but may not be flushed if smgr_prefetch is used: pageserver
* flushes sent requests on manual flush, or every neon.flush_output_after * flushes sent requests on manual flush, or every neon.flush_output_after
@@ -138,7 +139,7 @@ static UnloggedBuildPhase unlogged_build_phase = UNLOGGED_BUILD_NOT_IN_PROGRESS;
/* /*
* State machine: * State machine:
* *
* not in hash : in hash * not in hash : in hash
* : * :
* UNUSED ------> REQUESTED --> RECEIVED * UNUSED ------> REQUESTED --> RECEIVED
@@ -149,30 +150,34 @@ static UnloggedBuildPhase unlogged_build_phase = UNLOGGED_BUILD_NOT_IN_PROGRESS;
* +----------------+------------+ * +----------------+------------+
* : * :
*/ */
typedef enum PrefetchStatus { typedef enum PrefetchStatus
PRFS_UNUSED = 0, /* unused slot */ {
PRFS_REQUESTED, /* request was written to the sendbuffer to PS, but not PRFS_UNUSED = 0, /* unused slot */
* necessarily flushed. PRFS_REQUESTED, /* request was written to the sendbuffer to
* all fields except response valid */ * PS, but not necessarily flushed. all fields
PRFS_RECEIVED, /* all fields valid */ * except response valid */
PRFS_TAG_REMAINS, /* only buftag and my_ring_index are still valid */ PRFS_RECEIVED, /* all fields valid */
PRFS_TAG_REMAINS, /* only buftag and my_ring_index are still
* valid */
} PrefetchStatus; } PrefetchStatus;
typedef struct PrefetchRequest { typedef struct PrefetchRequest
BufferTag buftag; /* must be first entry in the struct */ {
BufferTag buftag; /* must be first entry in the struct */
XLogRecPtr effective_request_lsn; XLogRecPtr effective_request_lsn;
XLogRecPtr actual_request_lsn; XLogRecPtr actual_request_lsn;
NeonResponse *response; /* may be null */ NeonResponse *response; /* may be null */
PrefetchStatus status; PrefetchStatus status;
uint64 my_ring_index; uint64 my_ring_index;
} PrefetchRequest; } PrefetchRequest;
/* prefetch buffer lookup hash table */ /* prefetch buffer lookup hash table */
typedef struct PrfHashEntry { typedef struct PrfHashEntry
{
PrefetchRequest *slot; PrefetchRequest *slot;
uint32 status; uint32 status;
uint32 hash; uint32 hash;
} PrfHashEntry; } PrfHashEntry;
#define SH_PREFIX prfh #define SH_PREFIX prfh
@@ -196,36 +201,42 @@ typedef struct PrfHashEntry {
/* /*
* PrefetchState maintains the state of (prefetch) getPage@LSN requests. * PrefetchState maintains the state of (prefetch) getPage@LSN requests.
* It maintains a (ring) buffer of in-flight requests and responses. * It maintains a (ring) buffer of in-flight requests and responses.
* *
* We maintain several indexes into the ring buffer: * We maintain several indexes into the ring buffer:
* ring_unused >= ring_flush >= ring_receive >= ring_last >= 0 * ring_unused >= ring_flush >= ring_receive >= ring_last >= 0
* *
* ring_unused points to the first unused slot of the buffer * ring_unused points to the first unused slot of the buffer
* ring_receive is the next request that is to be received * ring_receive is the next request that is to be received
* ring_last is the oldest received entry in the buffer * ring_last is the oldest received entry in the buffer
* *
* Apart from being an entry in the ring buffer of prefetch requests, each * Apart from being an entry in the ring buffer of prefetch requests, each
* PrefetchRequest that is not UNUSED is indexed in prf_hash by buftag. * PrefetchRequest that is not UNUSED is indexed in prf_hash by buftag.
*/ */
typedef struct PrefetchState { typedef struct PrefetchState
MemoryContext bufctx; /* context for prf_buffer[].response allocations */ {
MemoryContext errctx; /* context for prf_buffer[].response allocations */ MemoryContext bufctx; /* context for prf_buffer[].response
MemoryContext hashctx; /* context for prf_buffer */ * allocations */
MemoryContext errctx; /* context for prf_buffer[].response
* allocations */
MemoryContext hashctx; /* context for prf_buffer */
/* buffer indexes */ /* buffer indexes */
uint64 ring_unused; /* first unused slot */ uint64 ring_unused; /* first unused slot */
uint64 ring_flush; /* next request to flush */ uint64 ring_flush; /* next request to flush */
uint64 ring_receive; /* next slot that is to receive a response */ uint64 ring_receive; /* next slot that is to receive a response */
uint64 ring_last; /* min slot with a response value */ uint64 ring_last; /* min slot with a response value */
/* metrics / statistics */ /* metrics / statistics */
int n_responses_buffered; /* count of PS responses not yet in buffers */ int n_responses_buffered; /* count of PS responses not yet in
int n_requests_inflight; /* count of PS requests considered in flight */ * buffers */
int n_unused; /* count of buffers < unused, > last, that are also unused */ int n_requests_inflight; /* count of PS requests considered in
* flight */
int n_unused; /* count of buffers < unused, > last, that are
* also unused */
/* the buffers */ /* the buffers */
prfh_hash *prf_hash; prfh_hash *prf_hash;
PrefetchRequest prf_buffer[]; /* prefetch buffers */ PrefetchRequest prf_buffer[]; /* prefetch buffers */
} PrefetchState; } PrefetchState;
PrefetchState *MyPState; PrefetchState *MyPState;
@@ -263,10 +274,10 @@ static XLogRecPtr neon_get_request_lsn(bool *latest, NRelFileInfo rinfo,
static bool static bool
compact_prefetch_buffers(void) compact_prefetch_buffers(void)
{ {
uint64 empty_ring_index = MyPState->ring_last; uint64 empty_ring_index = MyPState->ring_last;
uint64 search_ring_index = MyPState->ring_receive; uint64 search_ring_index = MyPState->ring_receive;
int n_moved = 0; int n_moved = 0;
if (MyPState->ring_receive == MyPState->ring_last) if (MyPState->ring_receive == MyPState->ring_last)
return false; return false;
@@ -281,15 +292,14 @@ compact_prefetch_buffers(void)
} }
/* /*
* Here we have established: * Here we have established: slots < search_ring_index have an unknown
* slots < search_ring_index have an unknown state (not scanned) * state (not scanned) slots >= search_ring_index and <= empty_ring_index
* slots >= search_ring_index and <= empty_ring_index are unused * are unused slots > empty_ring_index are in use, or outside our buffer's
* slots > empty_ring_index are in use, or outside our buffer's range. * range. ... unless search_ring_index <= ring_last
* ... unless search_ring_index <= ring_last *
*
* Therefore, there is a gap of at least one unused items between * Therefore, there is a gap of at least one unused items between
* search_ring_index and empty_ring_index (both inclusive), which grows as we hit * search_ring_index and empty_ring_index (both inclusive), which grows as
* more unused items while moving backwards through the array. * we hit more unused items while moving backwards through the array.
*/ */
while (search_ring_index > MyPState->ring_last) while (search_ring_index > MyPState->ring_last)
@@ -329,7 +339,10 @@ compact_prefetch_buffers(void)
/* empty the moved slot */ /* empty the moved slot */
source_slot->status = PRFS_UNUSED; source_slot->status = PRFS_UNUSED;
source_slot->buftag = (BufferTag) {0}; source_slot->buftag = (BufferTag)
{
0
};
source_slot->response = NULL; source_slot->response = NULL;
source_slot->my_ring_index = 0; source_slot->my_ring_index = 0;
source_slot->effective_request_lsn = 0; source_slot->effective_request_lsn = 0;
@@ -339,8 +352,8 @@ compact_prefetch_buffers(void)
} }
/* /*
* Only when we've moved slots we can expect trailing unused slots, * Only when we've moved slots we can expect trailing unused slots, so
* so only then we clean up trailing unused slots. * only then we clean up trailing unused slots.
*/ */
if (n_moved > 0) if (n_moved > 0)
{ {
@@ -357,10 +370,9 @@ readahead_buffer_resize(int newsize, void *extra)
uint64 end, uint64 end,
nfree = newsize; nfree = newsize;
PrefetchState *newPState; PrefetchState *newPState;
Size newprfs_size = offsetof(PrefetchState, prf_buffer) + ( Size newprfs_size = offsetof(PrefetchState, prf_buffer) +
sizeof(PrefetchRequest) * newsize (sizeof(PrefetchRequest) * newsize);
);
/* don't try to re-initialize if we haven't initialized yet */ /* don't try to re-initialize if we haven't initialized yet */
if (MyPState == NULL) if (MyPState == NULL)
return; return;
@@ -387,12 +399,12 @@ readahead_buffer_resize(int newsize, void *extra)
newPState->ring_receive = newsize; newPState->ring_receive = newsize;
newPState->ring_flush = newsize; newPState->ring_flush = newsize;
/* /*
* Copy over the prefetches. * Copy over the prefetches.
* *
* We populate the prefetch array from the end; to retain the most recent * We populate the prefetch array from the end; to retain the most recent
* prefetches, but this has the benefit of only needing to do one iteration * prefetches, but this has the benefit of only needing to do one
* on the dataset, and trivial compaction. * iteration on the dataset, and trivial compaction.
*/ */
for (end = MyPState->ring_unused - 1; for (end = MyPState->ring_unused - 1;
end >= MyPState->ring_last && end != UINT64_MAX && nfree != 0; end >= MyPState->ring_last && end != UINT64_MAX && nfree != 0;
@@ -400,7 +412,7 @@ readahead_buffer_resize(int newsize, void *extra)
{ {
PrefetchRequest *slot = GetPrfSlot(end); PrefetchRequest *slot = GetPrfSlot(end);
PrefetchRequest *newslot; PrefetchRequest *newslot;
bool found; bool found;
if (slot->status == PRFS_UNUSED) if (slot->status == PRFS_UNUSED)
continue; continue;
@@ -463,10 +475,11 @@ consume_prefetch_responses(void)
static void static void
prefetch_cleanup_trailing_unused(void) prefetch_cleanup_trailing_unused(void)
{ {
uint64 ring_index; uint64 ring_index;
PrefetchRequest *slot; PrefetchRequest *slot;
while (MyPState->ring_last < MyPState->ring_receive) { while (MyPState->ring_last < MyPState->ring_receive)
{
ring_index = MyPState->ring_last; ring_index = MyPState->ring_last;
slot = GetPrfSlot(ring_index); slot = GetPrfSlot(ring_index);
@@ -480,7 +493,7 @@ prefetch_cleanup_trailing_unused(void)
/* /*
* Wait for slot of ring_index to have received its response. * Wait for slot of ring_index to have received its response.
* The caller is responsible for making sure the request buffer is flushed. * The caller is responsible for making sure the request buffer is flushed.
* *
* NOTE: this function may indirectly update MyPState->pfs_hash; which * NOTE: this function may indirectly update MyPState->pfs_hash; which
* invalidates any active pointers into the hash table. * invalidates any active pointers into the hash table.
*/ */
@@ -512,7 +525,7 @@ prefetch_wait_for(uint64 ring_index)
/* /*
* Read the response of a prefetch request into its slot. * Read the response of a prefetch request into its slot.
* *
* The caller is responsible for making sure that the request for this buffer * The caller is responsible for making sure that the request for this buffer
* was flushed to the PageServer. * was flushed to the PageServer.
* *
@@ -552,7 +565,7 @@ prefetch_read(PrefetchRequest *slot)
/* /*
* Disconnect hook - drop prefetches when the connection drops * Disconnect hook - drop prefetches when the connection drops
* *
* If we don't remove the failed prefetches, we'd be serving incorrect * If we don't remove the failed prefetches, we'd be serving incorrect
* data to the smgr. * data to the smgr.
*/ */
@@ -563,7 +576,7 @@ prefetch_on_ps_disconnect(void)
while (MyPState->ring_receive < MyPState->ring_unused) while (MyPState->ring_receive < MyPState->ring_unused)
{ {
PrefetchRequest *slot; PrefetchRequest *slot;
uint64 ring_index = MyPState->ring_receive; uint64 ring_index = MyPState->ring_receive;
slot = GetPrfSlot(ring_index); slot = GetPrfSlot(ring_index);
@@ -593,7 +606,7 @@ prefetch_set_unused(uint64 ring_index)
PrefetchRequest *slot = GetPrfSlot(ring_index); PrefetchRequest *slot = GetPrfSlot(ring_index);
if (ring_index < MyPState->ring_last) if (ring_index < MyPState->ring_last)
return; /* Should already be unused */ return; /* Should already be unused */
Assert(MyPState->ring_unused > ring_index); Assert(MyPState->ring_unused > ring_index);
@@ -624,7 +637,11 @@ prefetch_set_unused(uint64 ring_index)
/* run cleanup if we're holding back ring_last */ /* run cleanup if we're holding back ring_last */
if (MyPState->ring_last == ring_index) if (MyPState->ring_last == ring_index)
prefetch_cleanup_trailing_unused(); prefetch_cleanup_trailing_unused();
/* ... and try to store the buffered responses more compactly if > 12.5% of the buffer is gaps */
/*
* ... and try to store the buffered responses more compactly if > 12.5%
* of the buffer is gaps
*/
else if (ReceiveBufferNeedsCompaction()) else if (ReceiveBufferNeedsCompaction())
compact_prefetch_buffers(); compact_prefetch_buffers();
} }
@@ -632,7 +649,7 @@ prefetch_set_unused(uint64 ring_index)
static void static void
prefetch_do_request(PrefetchRequest *slot, bool *force_latest, XLogRecPtr *force_lsn) prefetch_do_request(PrefetchRequest *slot, bool *force_latest, XLogRecPtr *force_lsn)
{ {
bool found; bool found;
NeonGetPageRequest request = { NeonGetPageRequest request = {
.req.tag = T_NeonGetPageRequest, .req.tag = T_NeonGetPageRequest,
.req.latest = false, .req.latest = false,
@@ -650,21 +667,22 @@ prefetch_do_request(PrefetchRequest *slot, bool *force_latest, XLogRecPtr *force
} }
else else
{ {
XLogRecPtr lsn = neon_get_request_lsn( XLogRecPtr lsn = neon_get_request_lsn(
&request.req.latest, &request.req.latest,
BufTagGetNRelFileInfo(slot->buftag), BufTagGetNRelFileInfo(slot->buftag),
slot->buftag.forkNum, slot->buftag.forkNum,
slot->buftag.blockNum slot->buftag.blockNum
); );
/* /*
* Note: effective_request_lsn is potentially higher than the requested * Note: effective_request_lsn is potentially higher than the
* LSN, but still correct: * requested LSN, but still correct:
* *
* We know there are no changes between the actual requested LSN and * We know there are no changes between the actual requested LSN and
* the value of effective_request_lsn: If there were, the page would * the value of effective_request_lsn: If there were, the page would
* have been in cache and evicted between those LSN values, which * have been in cache and evicted between those LSN values, which then
* then would have had to result in a larger request LSN for this page. * would have had to result in a larger request LSN for this page.
* *
* It is possible that a concurrent backend loads the page, modifies * It is possible that a concurrent backend loads the page, modifies
* it and then evicts it again, but the LSN of that eviction cannot be * it and then evicts it again, but the LSN of that eviction cannot be
* smaller than the current WAL insert/redo pointer, which is already * smaller than the current WAL insert/redo pointer, which is already
@@ -701,7 +719,7 @@ prefetch_do_request(PrefetchRequest *slot, bool *force_latest, XLogRecPtr *force
* prefetch_register_buffer() - register and prefetch buffer * prefetch_register_buffer() - register and prefetch buffer
* *
* Register that we may want the contents of BufferTag in the near future. * Register that we may want the contents of BufferTag in the near future.
* *
* If force_latest and force_lsn are not NULL, those values are sent to the * If force_latest and force_lsn are not NULL, those values are sent to the
* pageserver. If they are NULL, we utilize the lastWrittenLsn -infrastructure * pageserver. If they are NULL, we utilize the lastWrittenLsn -infrastructure
* to fill in these values manually. * to fill in these values manually.
@@ -713,14 +731,14 @@ prefetch_do_request(PrefetchRequest *slot, bool *force_latest, XLogRecPtr *force
static uint64 static uint64
prefetch_register_buffer(BufferTag tag, bool *force_latest, XLogRecPtr *force_lsn) prefetch_register_buffer(BufferTag tag, bool *force_latest, XLogRecPtr *force_lsn)
{ {
uint64 ring_index; uint64 ring_index;
PrefetchRequest req; PrefetchRequest req;
PrefetchRequest *slot; PrefetchRequest *slot;
PrfHashEntry *entry; PrfHashEntry *entry;
/* use an intermediate PrefetchRequest struct to ensure correct alignment */ /* use an intermediate PrefetchRequest struct to ensure correct alignment */
req.buftag = tag; req.buftag = tag;
Retry: Retry:
entry = prfh_lookup(MyPState->prf_hash, (PrefetchRequest *) &req); entry = prfh_lookup(MyPState->prf_hash, (PrefetchRequest *) &req);
if (entry != NULL) if (entry != NULL)
@@ -740,7 +758,10 @@ prefetch_register_buffer(BufferTag tag, bool *force_latest, XLogRecPtr *force_ls
*/ */
if (force_latest && force_lsn) if (force_latest && force_lsn)
{ {
/* if we want the latest version, any effective_request_lsn < request lsn is OK */ /*
* if we want the latest version, any effective_request_lsn <
* request lsn is OK
*/
if (*force_latest) if (*force_latest)
{ {
if (*force_lsn > slot->effective_request_lsn) if (*force_lsn > slot->effective_request_lsn)
@@ -751,7 +772,11 @@ prefetch_register_buffer(BufferTag tag, bool *force_latest, XLogRecPtr *force_ls
} }
} }
/* if we don't want the latest version, only accept requests with the exact same LSN */
/*
* if we don't want the latest version, only accept requests with
* the exact same LSN
*/
else else
{ {
if (*force_lsn != slot->effective_request_lsn) if (*force_lsn != slot->effective_request_lsn)
@@ -798,7 +823,8 @@ prefetch_register_buffer(BufferTag tag, bool *force_latest, XLogRecPtr *force_ls
*/ */
if (MyPState->ring_last + readahead_buffer_size - 1 == MyPState->ring_unused) if (MyPState->ring_last + readahead_buffer_size - 1 == MyPState->ring_unused)
{ {
uint64 cleanup_index = MyPState->ring_last; uint64 cleanup_index = MyPState->ring_last;
slot = GetPrfSlot(cleanup_index); slot = GetPrfSlot(cleanup_index);
Assert(slot->status != PRFS_UNUSED); Assert(slot->status != PRFS_UNUSED);
@@ -813,7 +839,10 @@ prefetch_register_buffer(BufferTag tag, bool *force_latest, XLogRecPtr *force_ls
} }
else else
{ {
/* We have the slot for ring_last, so that must still be in progress */ /*
* We have the slot for ring_last, so that must still be in
* progress
*/
switch (slot->status) switch (slot->status)
{ {
case PRFS_REQUESTED: case PRFS_REQUESTED:
@@ -832,8 +861,8 @@ prefetch_register_buffer(BufferTag tag, bool *force_latest, XLogRecPtr *force_ls
} }
/* /*
* The next buffer pointed to by `ring_unused` is now definitely empty, * The next buffer pointed to by `ring_unused` is now definitely empty, so
* so we can insert the new request to it. * we can insert the new request to it.
*/ */
ring_index = MyPState->ring_unused; ring_index = MyPState->ring_unused;
slot = &MyPState->prf_buffer[((ring_index) % readahead_buffer_size)]; slot = &MyPState->prf_buffer[((ring_index) % readahead_buffer_size)];
@@ -859,7 +888,10 @@ prefetch_register_buffer(BufferTag tag, bool *force_latest, XLogRecPtr *force_ls
{ {
if (!page_server->flush()) if (!page_server->flush())
{ {
/* Prefetch set is reset in case of error, so we should try to register our request once again */ /*
* Prefetch set is reset in case of error, so we should try to
* register our request once again
*/
goto Retry; goto Retry;
} }
MyPState->ring_flush = MyPState->ring_unused; MyPState->ring_flush = MyPState->ring_unused;
@@ -871,8 +903,10 @@ prefetch_register_buffer(BufferTag tag, bool *force_latest, XLogRecPtr *force_ls
static NeonResponse * static NeonResponse *
page_server_request(void const *req) page_server_request(void const *req)
{ {
NeonResponse* resp; NeonResponse *resp;
do {
do
{
while (!page_server->send((NeonRequest *) req) || !page_server->flush()); while (!page_server->send((NeonRequest *) req) || !page_server->flush());
MyPState->ring_flush = MyPState->ring_unused; MyPState->ring_flush = MyPState->ring_unused;
consume_prefetch_responses(); consume_prefetch_responses();
@@ -884,7 +918,7 @@ page_server_request(void const *req)
StringInfoData StringInfoData
nm_pack_request(NeonRequest * msg) nm_pack_request(NeonRequest *msg)
{ {
StringInfoData s; StringInfoData s;
@@ -1000,7 +1034,7 @@ nm_unpack_response(StringInfo s)
/* XXX: should be varlena */ /* XXX: should be varlena */
memcpy(msg_resp->page, pq_getmsgbytes(s, BLCKSZ), BLCKSZ); memcpy(msg_resp->page, pq_getmsgbytes(s, BLCKSZ), BLCKSZ);
pq_getmsgend(s); pq_getmsgend(s);
Assert(msg_resp->tag == T_NeonGetPageResponse); Assert(msg_resp->tag == T_NeonGetPageResponse);
resp = (NeonResponse *) msg_resp; resp = (NeonResponse *) msg_resp;
@@ -1056,7 +1090,7 @@ nm_unpack_response(StringInfo s)
/* dump to json for debugging / error reporting purposes */ /* dump to json for debugging / error reporting purposes */
char * char *
nm_to_string(NeonMessage * msg) nm_to_string(NeonMessage *msg)
{ {
StringInfoData s; StringInfoData s;
@@ -1185,7 +1219,7 @@ nm_to_string(NeonMessage * msg)
* directly because it skips the logging if the LSN is new enough. * directly because it skips the logging if the LSN is new enough.
*/ */
static XLogRecPtr static XLogRecPtr
log_newpage_copy(NRelFileInfo *rinfo, ForkNumber forkNum, BlockNumber blkno, log_newpage_copy(NRelFileInfo * rinfo, ForkNumber forkNum, BlockNumber blkno,
Page page, bool page_std) Page page, bool page_std)
{ {
PGAlignedBlock copied_buffer; PGAlignedBlock copied_buffer;
@@ -1208,11 +1242,10 @@ PageIsEmptyHeapPage(char *buffer)
} }
static void static void
neon_wallog_page(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
#if PG_MAJORVERSION_NUM < 16 #if PG_MAJORVERSION_NUM < 16
char *buffer, bool force) neon_wallog_page(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer, bool force)
#else #else
const char *buffer, bool force) neon_wallog_page(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, const char *buffer, bool force)
#endif #endif
{ {
XLogRecPtr lsn = PageGetLSN((Page) buffer); XLogRecPtr lsn = PageGetLSN((Page) buffer);
@@ -1312,24 +1345,23 @@ neon_wallog_page(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
void void
neon_init(void) neon_init(void)
{ {
Size prfs_size; Size prfs_size;
if (MyPState != NULL) if (MyPState != NULL)
return; return;
prfs_size = offsetof(PrefetchState, prf_buffer) + ( prfs_size = offsetof(PrefetchState, prf_buffer) +
sizeof(PrefetchRequest) * readahead_buffer_size sizeof(PrefetchRequest) * readahead_buffer_size;
);
MyPState = MemoryContextAllocZero(TopMemoryContext, prfs_size); MyPState = MemoryContextAllocZero(TopMemoryContext, prfs_size);
MyPState->n_unused = readahead_buffer_size; MyPState->n_unused = readahead_buffer_size;
MyPState->bufctx = SlabContextCreate(TopMemoryContext, MyPState->bufctx = SlabContextCreate(TopMemoryContext,
"NeonSMGR/prefetch", "NeonSMGR/prefetch",
SLAB_DEFAULT_BLOCK_SIZE * 17, SLAB_DEFAULT_BLOCK_SIZE * 17,
PS_GETPAGERESPONSE_SIZE); PS_GETPAGERESPONSE_SIZE);
MyPState->errctx = AllocSetContextCreate(TopMemoryContext, MyPState->errctx = AllocSetContextCreate(TopMemoryContext,
"NeonSMGR/errors", "NeonSMGR/errors",
ALLOCSET_DEFAULT_SIZES); ALLOCSET_DEFAULT_SIZES);
MyPState->hashctx = AllocSetContextCreate(TopMemoryContext, MyPState->hashctx = AllocSetContextCreate(TopMemoryContext,
@@ -1569,14 +1601,14 @@ neon_create(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
/* /*
* Newly created relation is empty, remember that in the relsize cache. * Newly created relation is empty, remember that in the relsize cache.
* *
* Note that in REDO, this is called to make sure the relation fork exists, * Note that in REDO, this is called to make sure the relation fork
* but it does not truncate the relation. So, we can only update the * exists, but it does not truncate the relation. So, we can only update
* relsize if it didn't exist before. * the relsize if it didn't exist before.
* *
* Also, in redo, we must make sure to update the cached size of the * Also, in redo, we must make sure to update the cached size of the
* relation, as that is the primary source of truth for REDO's * relation, as that is the primary source of truth for REDO's file length
* file length considerations, and as file extension isn't (perfectly) * considerations, and as file extension isn't (perfectly) logged, we need
* logged, we need to take care of that before we hit file size checks. * to take care of that before we hit file size checks.
* *
* FIXME: This is currently not just an optimization, but required for * FIXME: This is currently not just an optimization, but required for
* correctness. Postgres can call smgrnblocks() on the newly-created * correctness. Postgres can call smgrnblocks() on the newly-created
@@ -1652,7 +1684,7 @@ neon_extend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno,
#endif #endif
{ {
XLogRecPtr lsn; XLogRecPtr lsn;
BlockNumber n_blocks = 0; BlockNumber n_blocks = 0;
switch (reln->smgr_relpersistence) switch (reln->smgr_relpersistence)
{ {
@@ -1693,9 +1725,10 @@ neon_extend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno,
} }
/* /*
* Usually Postgres doesn't extend relation on more than one page * Usually Postgres doesn't extend relation on more than one page (leaving
* (leaving holes). But this rule is violated in PG-15 where CreateAndCopyRelationData * holes). But this rule is violated in PG-15 where
* call smgrextend for destination relation n using size of source relation * CreateAndCopyRelationData call smgrextend for destination relation n
* using size of source relation
*/ */
n_blocks = neon_nblocks(reln, forkNum); n_blocks = neon_nblocks(reln, forkNum);
while (n_blocks < blkno) while (n_blocks < blkno)
@@ -1716,11 +1749,13 @@ neon_extend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno,
if (IS_LOCAL_REL(reln)) if (IS_LOCAL_REL(reln))
mdextend(reln, forkNum, blkno, buffer, skipFsync); mdextend(reln, forkNum, blkno, buffer, skipFsync);
#endif #endif
/* /*
* smgr_extend is often called with an all-zeroes page, so lsn==InvalidXLogRecPtr. * smgr_extend is often called with an all-zeroes page, so
* An smgr_write() call will come for the buffer later, after it has been initialized * lsn==InvalidXLogRecPtr. An smgr_write() call will come for the buffer
* with the real page contents, and it is eventually evicted from the buffer cache. * later, after it has been initialized with the real page contents, and
* But we need a valid LSN to the relation metadata update now. * it is eventually evicted from the buffer cache. But we need a valid LSN
* to the relation metadata update now.
*/ */
if (lsn == InvalidXLogRecPtr) if (lsn == InvalidXLogRecPtr)
{ {
@@ -1779,9 +1814,9 @@ neon_zeroextend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blocknum,
if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber) if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)
ereport(ERROR, ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("cannot extend file \"%s\" beyond %u blocks", errmsg("cannot extend file \"%s\" beyond %u blocks",
relpath(reln->smgr_rlocator, forkNum), relpath(reln->smgr_rlocator, forkNum),
InvalidBlockNumber))); InvalidBlockNumber)));
/* Don't log any pages if we're not allowed to do so. */ /* Don't log any pages if we're not allowed to do so. */
if (!XLogInsertAllowed()) if (!XLogInsertAllowed())
@@ -1863,12 +1898,12 @@ neon_close(SMgrRelation reln, ForkNumber forknum)
bool bool
neon_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum) neon_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
{ {
BufferTag tag;
uint64 ring_index PG_USED_FOR_ASSERTS_ONLY; uint64 ring_index PG_USED_FOR_ASSERTS_ONLY;
BufferTag tag;
switch (reln->smgr_relpersistence) switch (reln->smgr_relpersistence)
{ {
case 0: /* probably shouldn't happen, but ignore it */ case 0: /* probably shouldn't happen, but ignore it */
case RELPERSISTENCE_PERMANENT: case RELPERSISTENCE_PERMANENT:
break; break;
@@ -1883,10 +1918,9 @@ neon_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
if (lfc_cache_contains(InfoFromSMgrRel(reln), forknum, blocknum)) if (lfc_cache_contains(InfoFromSMgrRel(reln), forknum, blocknum))
return false; return false;
tag = (BufferTag) { tag.forkNum = forknum;
.forkNum = forknum, tag.blockNum = blocknum;
.blockNum = blocknum
};
CopyNRelFileInfoToBufTag(tag, InfoFromSMgrRel(reln)); CopyNRelFileInfoToBufTag(tag, InfoFromSMgrRel(reln));
ring_index = prefetch_register_buffer(tag, NULL, NULL); ring_index = prefetch_register_buffer(tag, NULL, NULL);
@@ -1939,23 +1973,21 @@ neon_writeback(SMgrRelation reln, ForkNumber forknum,
* While function is defined in the neon extension it's used within neon_test_utils directly. * While function is defined in the neon extension it's used within neon_test_utils directly.
* To avoid breaking tests in the runtime please keep function signature in sync. * To avoid breaking tests in the runtime please keep function signature in sync.
*/ */
void
#if PG_MAJORVERSION_NUM < 16 #if PG_MAJORVERSION_NUM < 16
void PGDLLEXPORT
neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno, neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
XLogRecPtr request_lsn, bool request_latest, char *buffer) XLogRecPtr request_lsn, bool request_latest, char *buffer)
#else #else
void PGDLLEXPORT
neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno, neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
XLogRecPtr request_lsn, bool request_latest, void *buffer) XLogRecPtr request_lsn, bool request_latest, void *buffer)
#endif #endif
{ {
NeonResponse *resp; NeonResponse *resp;
BufferTag buftag;
uint64 ring_index; uint64 ring_index;
PrfHashEntry *entry; PrfHashEntry *entry;
PrefetchRequest *slot; PrefetchRequest *slot;
BufferTag buftag =
buftag = (BufferTag) { {
.forkNum = forkNum, .forkNum = forkNum,
.blockNum = blkno, .blockNum = blkno,
}; };
@@ -1964,12 +1996,11 @@ neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
/* /*
* The redo process does not lock pages that it needs to replay but are * The redo process does not lock pages that it needs to replay but are
* not in the shared buffers, so a concurrent process may request the * not in the shared buffers, so a concurrent process may request the page
* page after redo has decided it won't redo that page and updated the * after redo has decided it won't redo that page and updated the LwLSN
* LwLSN for that page. * for that page. If we're in hot standby we need to take care that we
* If we're in hot standby we need to take care that we don't return * don't return until after REDO has finished replaying up to that LwLSN,
* until after REDO has finished replaying up to that LwLSN, as the page * as the page should have been locked up to that point.
* should have been locked up to that point.
* *
* See also the description on neon_redo_read_buffer_filter below. * See also the description on neon_redo_read_buffer_filter below.
* *
@@ -1977,7 +2008,7 @@ neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
* concurrent failed read IOs. Those IOs should never have a request_lsn * concurrent failed read IOs. Those IOs should never have a request_lsn
* that is as large as the WAL record we're currently replaying, if it * that is as large as the WAL record we're currently replaying, if it
* weren't for the behaviour of the LwLsn cache that uses the highest * weren't for the behaviour of the LwLsn cache that uses the highest
* value of the LwLsn cache when the entry is not found. * value of the LwLsn cache when the entry is not found.
*/ */
if (RecoveryInProgress() && !(MyBackendType == B_STARTUP)) if (RecoveryInProgress() && !(MyBackendType == B_STARTUP))
XLogWaitForReplayOf(request_lsn); XLogWaitForReplayOf(request_lsn);
@@ -1995,12 +2026,14 @@ neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
ring_index = slot->my_ring_index; ring_index = slot->my_ring_index;
pgBufferUsage.prefetch.hits += 1; pgBufferUsage.prefetch.hits += 1;
} }
else /* the current prefetch LSN is not large enough, so drop the prefetch */ else /* the current prefetch LSN is not large
* enough, so drop the prefetch */
{ {
/* /*
* We can't drop cache for not-yet-received requested items. It is * We can't drop cache for not-yet-received requested items. It is
* unlikely this happens, but it can happen if prefetch distance is * unlikely this happens, but it can happen if prefetch distance
* large enough and a backend didn't consume all prefetch requests. * is large enough and a backend didn't consume all prefetch
* requests.
*/ */
if (slot->status == PRFS_REQUESTED) if (slot->status == PRFS_REQUESTED)
{ {
@@ -2027,11 +2060,11 @@ neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
else else
{ {
/* /*
* Empty our reference to the prefetch buffer's hash entry. * Empty our reference to the prefetch buffer's hash entry. When
* When we wait for prefetches, the entry reference is invalidated by * we wait for prefetches, the entry reference is invalidated by
* potential updates to the hash, and when we reconnect to the * potential updates to the hash, and when we reconnect to the
* pageserver the prefetch we're waiting for may be dropped, * pageserver the prefetch we're waiting for may be dropped, in
* in which case we need to retry and take the branch above. * which case we need to retry and take the branch above.
*/ */
entry = NULL; entry = NULL;
} }
@@ -2079,11 +2112,10 @@ neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
* neon_read() -- Read the specified block from a relation. * neon_read() -- Read the specified block from a relation.
*/ */
void void
neon_read(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno,
#if PG_MAJORVERSION_NUM < 16 #if PG_MAJORVERSION_NUM < 16
char *buffer) neon_read(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno, char *buffer)
#else #else
void *buffer) neon_read(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno, void *buffer)
#endif #endif
{ {
bool latest; bool latest;
@@ -2218,11 +2250,10 @@ hexdump_page(char *page)
* use mdextend(). * use mdextend().
*/ */
void void
neon_write(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
#if PG_MAJORVERSION_NUM < 16 #if PG_MAJORVERSION_NUM < 16
char *buffer, bool skipFsync) neon_write(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer, bool skipFsync)
#else #else
const void *buffer, bool skipFsync) neon_write(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, const void *buffer, bool skipFsync)
#endif #endif
{ {
XLogRecPtr lsn; XLogRecPtr lsn;
@@ -2722,9 +2753,90 @@ smgr_init_neon(void)
} }
static void
neon_extend_rel_size(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blkno, XLogRecPtr end_recptr)
{
BlockNumber relsize;
/* Extend the relation if we know its size */
if (get_cached_relsize(rinfo, forknum, &relsize))
{
if (relsize < blkno + 1)
{
update_cached_relsize(rinfo, forknum, blkno + 1);
SetLastWrittenLSNForRelation(end_recptr, rinfo, forknum);
}
}
else
{
/*
* Size was not cached. We populate the cache now, with the size of
* the relation measured after this WAL record is applied.
*
* This length is later reused when we open the smgr to read the
* block, which is fine and expected.
*/
NeonResponse *response;
NeonNblocksResponse *nbresponse;
NeonNblocksRequest request = {
.req = (NeonRequest) {
.lsn = end_recptr,
.latest = false,
.tag = T_NeonNblocksRequest,
},
.rinfo = rinfo,
.forknum = forknum,
};
response = page_server_request(&request);
Assert(response->tag == T_NeonNblocksResponse);
nbresponse = (NeonNblocksResponse *) response;
relsize = Max(nbresponse->n_blocks, blkno + 1);
set_cached_relsize(rinfo, forknum, relsize);
SetLastWrittenLSNForRelation(end_recptr, rinfo, forknum);
elog(SmgrTrace, "Set length to %d", relsize);
}
}
#define FSM_TREE_DEPTH ((SlotsPerFSMPage >= 1626) ? 3 : 4)
/*
* TODO: May be it is better to make correspondent fgunctio from freespace.c public?
*/
static BlockNumber
get_fsm_physical_block(BlockNumber heapblk)
{
BlockNumber pages;
int leafno;
int l;
/*
* Calculate the logical page number of the first leaf page below the
* given page.
*/
leafno = heapblk / SlotsPerFSMPage;
/* Count upper level nodes required to address the leaf page */
pages = 0;
for (l = 0; l < FSM_TREE_DEPTH; l++)
{
pages += leafno + 1;
leafno /= SlotsPerFSMPage;
}
/* Turn the page count into 0-based block number */
return pages - 1;
}
/* /*
* Return whether we can skip the redo for this block. * Return whether we can skip the redo for this block.
* *
* The conditions for skipping the IO are: * The conditions for skipping the IO are:
* *
* - The block is not in the shared buffers, and * - The block is not in the shared buffers, and
@@ -2763,13 +2875,12 @@ neon_redo_read_buffer_filter(XLogReaderState *record, uint8 block_id)
XLogRecPtr end_recptr = record->EndRecPtr; XLogRecPtr end_recptr = record->EndRecPtr;
NRelFileInfo rinfo; NRelFileInfo rinfo;
ForkNumber forknum; ForkNumber forknum;
BlockNumber blkno; BlockNumber blkno;
BufferTag tag; BufferTag tag;
uint32 hash; uint32 hash;
LWLock *partitionLock; LWLock *partitionLock;
Buffer buffer; Buffer buffer;
bool no_redo_needed; bool no_redo_needed;
BlockNumber relsize;
if (old_redo_read_buffer_filter && old_redo_read_buffer_filter(record, block_id)) if (old_redo_read_buffer_filter && old_redo_read_buffer_filter(record, block_id))
return true; return true;
@@ -2783,8 +2894,8 @@ neon_redo_read_buffer_filter(XLogReaderState *record, uint8 block_id)
/* /*
* Out of an abundance of caution, we always run redo on shared catalogs, * Out of an abundance of caution, we always run redo on shared catalogs,
* regardless of whether the block is stored in shared buffers. * regardless of whether the block is stored in shared buffers. See also
* See also this function's top comment. * this function's top comment.
*/ */
if (!OidIsValid(NInfoGetDbOid(rinfo))) if (!OidIsValid(NInfoGetDbOid(rinfo)))
return false; return false;
@@ -2810,8 +2921,9 @@ neon_redo_read_buffer_filter(XLogReaderState *record, uint8 block_id)
/* In both cases st lwlsn past this WAL record */ /* In both cases st lwlsn past this WAL record */
SetLastWrittenLSNForBlock(end_recptr, rinfo, forknum, blkno); SetLastWrittenLSNForBlock(end_recptr, rinfo, forknum, blkno);
/* we don't have the buffer in memory, update lwLsn past this record, /*
* also evict page fro file cache * we don't have the buffer in memory, update lwLsn past this record, also
* evict page fro file cache
*/ */
if (no_redo_needed) if (no_redo_needed)
lfc_evict(rinfo, forknum, blkno); lfc_evict(rinfo, forknum, blkno);
@@ -2819,49 +2931,10 @@ neon_redo_read_buffer_filter(XLogReaderState *record, uint8 block_id)
LWLockRelease(partitionLock); LWLockRelease(partitionLock);
/* Extend the relation if we know its size */ neon_extend_rel_size(rinfo, forknum, blkno, end_recptr);
if (get_cached_relsize(rinfo, forknum, &relsize)) if (forknum == MAIN_FORKNUM)
{ {
if (relsize < blkno + 1) neon_extend_rel_size(rinfo, FSM_FORKNUM, get_fsm_physical_block(blkno), end_recptr);
{
update_cached_relsize(rinfo, forknum, blkno + 1);
SetLastWrittenLSNForRelation(end_recptr, rinfo, forknum);
}
} }
else
{
/*
* Size was not cached. We populate the cache now, with the size of the
* relation measured after this WAL record is applied.
*
* This length is later reused when we open the smgr to read the block,
* which is fine and expected.
*/
NeonResponse *response;
NeonNblocksResponse *nbresponse;
NeonNblocksRequest request = {
.req = (NeonRequest) {
.lsn = end_recptr,
.latest = false,
.tag = T_NeonNblocksRequest,
},
.rinfo = rinfo,
.forknum = forknum,
};
response = page_server_request(&request);
Assert(response->tag == T_NeonNblocksResponse);
nbresponse = (NeonNblocksResponse *) response;
Assert(nbresponse->n_blocks > blkno);
set_cached_relsize(rinfo, forknum, nbresponse->n_blocks);
SetLastWrittenLSNForRelation(end_recptr, rinfo, forknum);
elog(SmgrTrace, "Set length to %d", nbresponse->n_blocks);
}
return no_redo_needed; return no_redo_needed;
} }

View File

@@ -178,7 +178,7 @@ WalProposerFree(WalProposer *wp)
if (wp->propTermHistory.entries != NULL) if (wp->propTermHistory.entries != NULL)
pfree(wp->propTermHistory.entries); pfree(wp->propTermHistory.entries);
wp->propTermHistory.entries = NULL; wp->propTermHistory.entries = NULL;
pfree(wp); pfree(wp);
} }
@@ -275,7 +275,7 @@ WalProposerPoll(WalProposer *wp)
wp->config->safekeeper_connection_timeout)) wp->config->safekeeper_connection_timeout))
{ {
walprop_log(WARNING, "terminating connection to safekeeper '%s:%s' in '%s' state: no messages received during the last %dms or connection attempt took longer than that", walprop_log(WARNING, "terminating connection to safekeeper '%s:%s' in '%s' state: no messages received during the last %dms or connection attempt took longer than that",
sk->host, sk->port, FormatSafekeeperState(sk->state), wp->config->safekeeper_connection_timeout); sk->host, sk->port, FormatSafekeeperState(sk->state), wp->config->safekeeper_connection_timeout);
ShutdownConnection(sk); ShutdownConnection(sk);
} }
} }
@@ -395,7 +395,7 @@ ResetConnection(Safekeeper *sk)
* https://www.postgresql.org/docs/devel/libpq-connect.html#LIBPQ-PQCONNECTSTARTPARAMS * https://www.postgresql.org/docs/devel/libpq-connect.html#LIBPQ-PQCONNECTSTARTPARAMS
*/ */
walprop_log(WARNING, "Immediate failure to connect with node '%s:%s':\n\terror: %s", walprop_log(WARNING, "Immediate failure to connect with node '%s:%s':\n\terror: %s",
sk->host, sk->port, wp->api.conn_error_message(sk)); sk->host, sk->port, wp->api.conn_error_message(sk));
/* /*
* Even though the connection failed, we still need to clean up the * Even though the connection failed, we still need to clean up the
@@ -489,7 +489,7 @@ AdvancePollState(Safekeeper *sk, uint32 events)
*/ */
case SS_OFFLINE: case SS_OFFLINE:
walprop_log(FATAL, "Unexpected safekeeper %s:%s state advancement: is offline", walprop_log(FATAL, "Unexpected safekeeper %s:%s state advancement: is offline",
sk->host, sk->port); sk->host, sk->port);
break; /* actually unreachable, but prevents break; /* actually unreachable, but prevents
* -Wimplicit-fallthrough */ * -Wimplicit-fallthrough */
@@ -525,7 +525,7 @@ AdvancePollState(Safekeeper *sk, uint32 events)
*/ */
case SS_VOTING: case SS_VOTING:
walprop_log(WARNING, "EOF from node %s:%s in %s state", sk->host, walprop_log(WARNING, "EOF from node %s:%s in %s state", sk->host,
sk->port, FormatSafekeeperState(sk->state)); sk->port, FormatSafekeeperState(sk->state));
ResetConnection(sk); ResetConnection(sk);
return; return;
@@ -554,7 +554,7 @@ AdvancePollState(Safekeeper *sk, uint32 events)
*/ */
case SS_IDLE: case SS_IDLE:
walprop_log(WARNING, "EOF from node %s:%s in %s state", sk->host, walprop_log(WARNING, "EOF from node %s:%s in %s state", sk->host,
sk->port, FormatSafekeeperState(sk->state)); sk->port, FormatSafekeeperState(sk->state));
ResetConnection(sk); ResetConnection(sk);
return; return;
@@ -580,7 +580,7 @@ HandleConnectionEvent(Safekeeper *sk)
{ {
case WP_CONN_POLLING_OK: case WP_CONN_POLLING_OK:
walprop_log(LOG, "connected with node %s:%s", sk->host, walprop_log(LOG, "connected with node %s:%s", sk->host,
sk->port); sk->port);
sk->latestMsgReceivedAt = wp->api.get_current_timestamp(wp); sk->latestMsgReceivedAt = wp->api.get_current_timestamp(wp);
/* /*
@@ -604,7 +604,7 @@ HandleConnectionEvent(Safekeeper *sk)
case WP_CONN_POLLING_FAILED: case WP_CONN_POLLING_FAILED:
walprop_log(WARNING, "failed to connect to node '%s:%s': %s", walprop_log(WARNING, "failed to connect to node '%s:%s': %s",
sk->host, sk->port, wp->api.conn_error_message(sk)); sk->host, sk->port, wp->api.conn_error_message(sk));
/* /*
* If connecting failed, we don't want to restart the connection * If connecting failed, we don't want to restart the connection
@@ -641,7 +641,7 @@ SendStartWALPush(Safekeeper *sk)
if (!wp->api.conn_send_query(sk, "START_WAL_PUSH")) if (!wp->api.conn_send_query(sk, "START_WAL_PUSH"))
{ {
walprop_log(WARNING, "Failed to send 'START_WAL_PUSH' query to safekeeper %s:%s: %s", walprop_log(WARNING, "Failed to send 'START_WAL_PUSH' query to safekeeper %s:%s: %s",
sk->host, sk->port, wp->api.conn_error_message(sk)); sk->host, sk->port, wp->api.conn_error_message(sk));
ShutdownConnection(sk); ShutdownConnection(sk);
return; return;
} }
@@ -678,7 +678,7 @@ RecvStartWALPushResult(Safekeeper *sk)
case WP_EXEC_FAILED: case WP_EXEC_FAILED:
walprop_log(WARNING, "Failed to send query to safekeeper %s:%s: %s", walprop_log(WARNING, "Failed to send query to safekeeper %s:%s: %s",
sk->host, sk->port, wp->api.conn_error_message(sk)); sk->host, sk->port, wp->api.conn_error_message(sk));
ShutdownConnection(sk); ShutdownConnection(sk);
return; return;
@@ -689,7 +689,7 @@ RecvStartWALPushResult(Safekeeper *sk)
*/ */
case WP_EXEC_UNEXPECTED_SUCCESS: case WP_EXEC_UNEXPECTED_SUCCESS:
walprop_log(WARNING, "Received bad response from safekeeper %s:%s query execution", walprop_log(WARNING, "Received bad response from safekeeper %s:%s query execution",
sk->host, sk->port); sk->host, sk->port);
ShutdownConnection(sk); ShutdownConnection(sk);
return; return;
} }
@@ -758,8 +758,8 @@ RecvAcceptorGreeting(Safekeeper *sk)
{ {
/* Another compute with higher term is running. */ /* Another compute with higher term is running. */
walprop_log(FATAL, "WAL acceptor %s:%s with term " INT64_FORMAT " rejects our connection request with term " INT64_FORMAT "", walprop_log(FATAL, "WAL acceptor %s:%s with term " INT64_FORMAT " rejects our connection request with term " INT64_FORMAT "",
sk->host, sk->port, sk->host, sk->port,
sk->greetResponse.term, wp->propTerm); sk->greetResponse.term, wp->propTerm);
} }
/* /*
@@ -817,11 +817,11 @@ RecvVoteResponse(Safekeeper *sk)
return; return;
walprop_log(LOG, walprop_log(LOG,
"got VoteResponse from acceptor %s:%s, voteGiven=" UINT64_FORMAT ", epoch=" UINT64_FORMAT ", flushLsn=%X/%X, truncateLsn=%X/%X, timelineStartLsn=%X/%X", "got VoteResponse from acceptor %s:%s, voteGiven=" UINT64_FORMAT ", epoch=" UINT64_FORMAT ", flushLsn=%X/%X, truncateLsn=%X/%X, timelineStartLsn=%X/%X",
sk->host, sk->port, sk->voteResponse.voteGiven, GetHighestTerm(&sk->voteResponse.termHistory), sk->host, sk->port, sk->voteResponse.voteGiven, GetHighestTerm(&sk->voteResponse.termHistory),
LSN_FORMAT_ARGS(sk->voteResponse.flushLsn), LSN_FORMAT_ARGS(sk->voteResponse.flushLsn),
LSN_FORMAT_ARGS(sk->voteResponse.truncateLsn), LSN_FORMAT_ARGS(sk->voteResponse.truncateLsn),
LSN_FORMAT_ARGS(sk->voteResponse.timelineStartLsn)); LSN_FORMAT_ARGS(sk->voteResponse.timelineStartLsn));
/* /*
* In case of acceptor rejecting our vote, bail out, but only if either it * In case of acceptor rejecting our vote, bail out, but only if either it
@@ -832,8 +832,8 @@ RecvVoteResponse(Safekeeper *sk)
(sk->voteResponse.term > wp->propTerm || wp->n_votes < wp->quorum)) (sk->voteResponse.term > wp->propTerm || wp->n_votes < wp->quorum))
{ {
walprop_log(FATAL, "WAL acceptor %s:%s with term " INT64_FORMAT " rejects our connection request with term " INT64_FORMAT "", walprop_log(FATAL, "WAL acceptor %s:%s with term " INT64_FORMAT " rejects our connection request with term " INT64_FORMAT "",
sk->host, sk->port, sk->host, sk->port,
sk->voteResponse.term, wp->propTerm); sk->voteResponse.term, wp->propTerm);
} }
Assert(sk->voteResponse.term == wp->propTerm); Assert(sk->voteResponse.term == wp->propTerm);
@@ -877,10 +877,10 @@ HandleElectedProposer(WalProposer *wp)
if (wp->truncateLsn < wp->propEpochStartLsn) if (wp->truncateLsn < wp->propEpochStartLsn)
{ {
walprop_log(LOG, walprop_log(LOG,
"start recovery because truncateLsn=%X/%X is not " "start recovery because truncateLsn=%X/%X is not "
"equal to epochStartLsn=%X/%X", "equal to epochStartLsn=%X/%X",
LSN_FORMAT_ARGS(wp->truncateLsn), LSN_FORMAT_ARGS(wp->truncateLsn),
LSN_FORMAT_ARGS(wp->propEpochStartLsn)); LSN_FORMAT_ARGS(wp->propEpochStartLsn));
/* Perform recovery */ /* Perform recovery */
if (!wp->api.recovery_download(&wp->safekeeper[wp->donor], wp->greetRequest.timeline, wp->truncateLsn, wp->propEpochStartLsn)) if (!wp->api.recovery_download(&wp->safekeeper[wp->donor], wp->greetRequest.timeline, wp->truncateLsn, wp->propEpochStartLsn))
walprop_log(FATAL, "Failed to recover state"); walprop_log(FATAL, "Failed to recover state");
@@ -990,9 +990,9 @@ DetermineEpochStartLsn(WalProposer *wp)
wp->timelineStartLsn != wp->safekeeper[i].voteResponse.timelineStartLsn) wp->timelineStartLsn != wp->safekeeper[i].voteResponse.timelineStartLsn)
{ {
walprop_log(WARNING, walprop_log(WARNING,
"inconsistent timelineStartLsn: current %X/%X, received %X/%X", "inconsistent timelineStartLsn: current %X/%X, received %X/%X",
LSN_FORMAT_ARGS(wp->timelineStartLsn), LSN_FORMAT_ARGS(wp->timelineStartLsn),
LSN_FORMAT_ARGS(wp->safekeeper[i].voteResponse.timelineStartLsn)); LSN_FORMAT_ARGS(wp->safekeeper[i].voteResponse.timelineStartLsn));
} }
wp->timelineStartLsn = wp->safekeeper[i].voteResponse.timelineStartLsn; wp->timelineStartLsn = wp->safekeeper[i].voteResponse.timelineStartLsn;
} }
@@ -1038,11 +1038,11 @@ DetermineEpochStartLsn(WalProposer *wp)
wp->propTermHistory.entries[wp->propTermHistory.n_entries - 1].lsn = wp->propEpochStartLsn; wp->propTermHistory.entries[wp->propTermHistory.n_entries - 1].lsn = wp->propEpochStartLsn;
walprop_log(LOG, "got votes from majority (%d) of nodes, term " UINT64_FORMAT ", epochStartLsn %X/%X, donor %s:%s, truncate_lsn %X/%X", walprop_log(LOG, "got votes from majority (%d) of nodes, term " UINT64_FORMAT ", epochStartLsn %X/%X, donor %s:%s, truncate_lsn %X/%X",
wp->quorum, wp->quorum,
wp->propTerm, wp->propTerm,
LSN_FORMAT_ARGS(wp->propEpochStartLsn), LSN_FORMAT_ARGS(wp->propEpochStartLsn),
wp->safekeeper[wp->donor].host, wp->safekeeper[wp->donor].port, wp->safekeeper[wp->donor].host, wp->safekeeper[wp->donor].port,
LSN_FORMAT_ARGS(wp->truncateLsn)); LSN_FORMAT_ARGS(wp->truncateLsn));
/* /*
* Ensure the basebackup we are running (at RedoStartLsn) matches LSN * Ensure the basebackup we are running (at RedoStartLsn) matches LSN
@@ -1070,18 +1070,18 @@ DetermineEpochStartLsn(WalProposer *wp)
walprop_shared->mineLastElectedTerm))) walprop_shared->mineLastElectedTerm)))
{ {
walprop_log(PANIC, walprop_log(PANIC,
"collected propEpochStartLsn %X/%X, but basebackup LSN %X/%X", "collected propEpochStartLsn %X/%X, but basebackup LSN %X/%X",
LSN_FORMAT_ARGS(wp->propEpochStartLsn), LSN_FORMAT_ARGS(wp->propEpochStartLsn),
LSN_FORMAT_ARGS(wp->api.get_redo_start_lsn(wp))); LSN_FORMAT_ARGS(wp->api.get_redo_start_lsn(wp)));
} }
} }
walprop_shared->mineLastElectedTerm = wp->propTerm; walprop_shared->mineLastElectedTerm = wp->propTerm;
} }
/* /*
* WalProposer has just elected itself and initialized history, so * WalProposer has just elected itself and initialized history, so we can
* we can call election callback. Usually it updates truncateLsn to * call election callback. Usually it updates truncateLsn to fetch WAL for
* fetch WAL for logical replication. * logical replication.
*/ */
wp->api.after_election(wp); wp->api.after_election(wp);
} }
@@ -1155,8 +1155,8 @@ SendProposerElected(Safekeeper *sk)
sk->startStreamingAt = wp->truncateLsn; sk->startStreamingAt = wp->truncateLsn;
walprop_log(WARNING, "empty safekeeper joined cluster as %s:%s, historyStart=%X/%X, sk->startStreamingAt=%X/%X", walprop_log(WARNING, "empty safekeeper joined cluster as %s:%s, historyStart=%X/%X, sk->startStreamingAt=%X/%X",
sk->host, sk->port, LSN_FORMAT_ARGS(wp->propTermHistory.entries[0].lsn), sk->host, sk->port, LSN_FORMAT_ARGS(wp->propTermHistory.entries[0].lsn),
LSN_FORMAT_ARGS(sk->startStreamingAt)); LSN_FORMAT_ARGS(sk->startStreamingAt));
} }
} }
else else
@@ -1190,8 +1190,8 @@ SendProposerElected(Safekeeper *sk)
lastCommonTerm = i >= 0 ? wp->propTermHistory.entries[i].term : 0; lastCommonTerm = i >= 0 ? wp->propTermHistory.entries[i].term : 0;
walprop_log(LOG, walprop_log(LOG,
"sending elected msg to node " UINT64_FORMAT " term=" UINT64_FORMAT ", startStreamingAt=%X/%X (lastCommonTerm=" UINT64_FORMAT "), termHistory.n_entries=%u to %s:%s, timelineStartLsn=%X/%X", "sending elected msg to node " UINT64_FORMAT " term=" UINT64_FORMAT ", startStreamingAt=%X/%X (lastCommonTerm=" UINT64_FORMAT "), termHistory.n_entries=%u to %s:%s, timelineStartLsn=%X/%X",
sk->greetResponse.nodeId, msg.term, LSN_FORMAT_ARGS(msg.startStreamingAt), lastCommonTerm, msg.termHistory->n_entries, sk->host, sk->port, LSN_FORMAT_ARGS(msg.timelineStartLsn)); sk->greetResponse.nodeId, msg.term, LSN_FORMAT_ARGS(msg.startStreamingAt), lastCommonTerm, msg.termHistory->n_entries, sk->host, sk->port, LSN_FORMAT_ARGS(msg.timelineStartLsn));
resetStringInfo(&sk->outbuf); resetStringInfo(&sk->outbuf);
pq_sendint64_le(&sk->outbuf, msg.tag); pq_sendint64_le(&sk->outbuf, msg.tag);
@@ -1355,11 +1355,11 @@ SendAppendRequests(Safekeeper *sk)
PrepareAppendRequest(sk->wp, &sk->appendRequest, sk->streamingAt, endLsn); PrepareAppendRequest(sk->wp, &sk->appendRequest, sk->streamingAt, endLsn);
walprop_log(DEBUG2, "sending message len %ld beginLsn=%X/%X endLsn=%X/%X commitLsn=%X/%X truncateLsn=%X/%X to %s:%s", walprop_log(DEBUG2, "sending message len %ld beginLsn=%X/%X endLsn=%X/%X commitLsn=%X/%X truncateLsn=%X/%X to %s:%s",
req->endLsn - req->beginLsn, req->endLsn - req->beginLsn,
LSN_FORMAT_ARGS(req->beginLsn), LSN_FORMAT_ARGS(req->beginLsn),
LSN_FORMAT_ARGS(req->endLsn), LSN_FORMAT_ARGS(req->endLsn),
LSN_FORMAT_ARGS(req->commitLsn), LSN_FORMAT_ARGS(req->commitLsn),
LSN_FORMAT_ARGS(wp->truncateLsn), sk->host, sk->port); LSN_FORMAT_ARGS(wp->truncateLsn), sk->host, sk->port);
resetStringInfo(&sk->outbuf); resetStringInfo(&sk->outbuf);
@@ -1398,8 +1398,8 @@ SendAppendRequests(Safekeeper *sk)
case PG_ASYNC_WRITE_FAIL: case PG_ASYNC_WRITE_FAIL:
walprop_log(WARNING, "Failed to send to node %s:%s in %s state: %s", walprop_log(WARNING, "Failed to send to node %s:%s in %s state: %s",
sk->host, sk->port, FormatSafekeeperState(sk->state), sk->host, sk->port, FormatSafekeeperState(sk->state),
wp->api.conn_error_message(sk)); wp->api.conn_error_message(sk));
ShutdownConnection(sk); ShutdownConnection(sk);
return false; return false;
default: default:
@@ -1438,17 +1438,17 @@ RecvAppendResponses(Safekeeper *sk)
break; break;
walprop_log(DEBUG2, "received message term=" INT64_FORMAT " flushLsn=%X/%X commitLsn=%X/%X from %s:%s", walprop_log(DEBUG2, "received message term=" INT64_FORMAT " flushLsn=%X/%X commitLsn=%X/%X from %s:%s",
sk->appendResponse.term, sk->appendResponse.term,
LSN_FORMAT_ARGS(sk->appendResponse.flushLsn), LSN_FORMAT_ARGS(sk->appendResponse.flushLsn),
LSN_FORMAT_ARGS(sk->appendResponse.commitLsn), LSN_FORMAT_ARGS(sk->appendResponse.commitLsn),
sk->host, sk->port); sk->host, sk->port);
if (sk->appendResponse.term > wp->propTerm) if (sk->appendResponse.term > wp->propTerm)
{ {
/* Another compute with higher term is running. */ /* Another compute with higher term is running. */
walprop_log(PANIC, "WAL acceptor %s:%s with term " INT64_FORMAT " rejected our request, our term " INT64_FORMAT "", walprop_log(PANIC, "WAL acceptor %s:%s with term " INT64_FORMAT " rejected our request, our term " INT64_FORMAT "",
sk->host, sk->port, sk->host, sk->port,
sk->appendResponse.term, wp->propTerm); sk->appendResponse.term, wp->propTerm);
} }
readAnything = true; readAnything = true;
@@ -1493,7 +1493,7 @@ ParsePageserverFeedbackMessage(WalProposer *wp, StringInfo reply_message, Pagese
/* read value length */ /* read value length */
rf->currentClusterSize = pq_getmsgint64(reply_message); rf->currentClusterSize = pq_getmsgint64(reply_message);
walprop_log(DEBUG2, "ParsePageserverFeedbackMessage: current_timeline_size %lu", walprop_log(DEBUG2, "ParsePageserverFeedbackMessage: current_timeline_size %lu",
rf->currentClusterSize); rf->currentClusterSize);
} }
else if ((strcmp(key, "ps_writelsn") == 0) || (strcmp(key, "last_received_lsn") == 0)) else if ((strcmp(key, "ps_writelsn") == 0) || (strcmp(key, "last_received_lsn") == 0))
{ {
@@ -1501,7 +1501,7 @@ ParsePageserverFeedbackMessage(WalProposer *wp, StringInfo reply_message, Pagese
/* read value length */ /* read value length */
rf->last_received_lsn = pq_getmsgint64(reply_message); rf->last_received_lsn = pq_getmsgint64(reply_message);
walprop_log(DEBUG2, "ParsePageserverFeedbackMessage: last_received_lsn %X/%X", walprop_log(DEBUG2, "ParsePageserverFeedbackMessage: last_received_lsn %X/%X",
LSN_FORMAT_ARGS(rf->last_received_lsn)); LSN_FORMAT_ARGS(rf->last_received_lsn));
} }
else if ((strcmp(key, "ps_flushlsn") == 0) || (strcmp(key, "disk_consistent_lsn") == 0)) else if ((strcmp(key, "ps_flushlsn") == 0) || (strcmp(key, "disk_consistent_lsn") == 0))
{ {
@@ -1509,7 +1509,7 @@ ParsePageserverFeedbackMessage(WalProposer *wp, StringInfo reply_message, Pagese
/* read value length */ /* read value length */
rf->disk_consistent_lsn = pq_getmsgint64(reply_message); rf->disk_consistent_lsn = pq_getmsgint64(reply_message);
walprop_log(DEBUG2, "ParsePageserverFeedbackMessage: disk_consistent_lsn %X/%X", walprop_log(DEBUG2, "ParsePageserverFeedbackMessage: disk_consistent_lsn %X/%X",
LSN_FORMAT_ARGS(rf->disk_consistent_lsn)); LSN_FORMAT_ARGS(rf->disk_consistent_lsn));
} }
else if ((strcmp(key, "ps_applylsn") == 0) || (strcmp(key, "remote_consistent_lsn") == 0)) else if ((strcmp(key, "ps_applylsn") == 0) || (strcmp(key, "remote_consistent_lsn") == 0))
{ {
@@ -1517,7 +1517,7 @@ ParsePageserverFeedbackMessage(WalProposer *wp, StringInfo reply_message, Pagese
/* read value length */ /* read value length */
rf->remote_consistent_lsn = pq_getmsgint64(reply_message); rf->remote_consistent_lsn = pq_getmsgint64(reply_message);
walprop_log(DEBUG2, "ParsePageserverFeedbackMessage: remote_consistent_lsn %X/%X", walprop_log(DEBUG2, "ParsePageserverFeedbackMessage: remote_consistent_lsn %X/%X",
LSN_FORMAT_ARGS(rf->remote_consistent_lsn)); LSN_FORMAT_ARGS(rf->remote_consistent_lsn));
} }
else if ((strcmp(key, "ps_replytime") == 0) || (strcmp(key, "replytime") == 0)) else if ((strcmp(key, "ps_replytime") == 0) || (strcmp(key, "replytime") == 0))
{ {
@@ -1530,7 +1530,7 @@ ParsePageserverFeedbackMessage(WalProposer *wp, StringInfo reply_message, Pagese
/* Copy because timestamptz_to_str returns a static buffer */ /* Copy because timestamptz_to_str returns a static buffer */
replyTimeStr = pstrdup(timestamptz_to_str(rf->replytime)); replyTimeStr = pstrdup(timestamptz_to_str(rf->replytime));
walprop_log(DEBUG2, "ParsePageserverFeedbackMessage: replytime %lu reply_time: %s", walprop_log(DEBUG2, "ParsePageserverFeedbackMessage: replytime %lu reply_time: %s",
rf->replytime, replyTimeStr); rf->replytime, replyTimeStr);
pfree(replyTimeStr); pfree(replyTimeStr);
} }
@@ -1700,8 +1700,8 @@ AsyncRead(Safekeeper *sk, char **buf, int *buf_size)
case PG_ASYNC_READ_FAIL: case PG_ASYNC_READ_FAIL:
walprop_log(WARNING, "Failed to read from node %s:%s in %s state: %s", sk->host, walprop_log(WARNING, "Failed to read from node %s:%s in %s state: %s", sk->host,
sk->port, FormatSafekeeperState(sk->state), sk->port, FormatSafekeeperState(sk->state),
wp->api.conn_error_message(sk)); wp->api.conn_error_message(sk));
ShutdownConnection(sk); ShutdownConnection(sk);
return false; return false;
} }
@@ -1740,7 +1740,7 @@ AsyncReadMessage(Safekeeper *sk, AcceptorProposerMessage *anymsg)
if (tag != anymsg->tag) if (tag != anymsg->tag)
{ {
walprop_log(WARNING, "unexpected message tag %c from node %s:%s in state %s", (char) tag, sk->host, walprop_log(WARNING, "unexpected message tag %c from node %s:%s in state %s", (char) tag, sk->host,
sk->port, FormatSafekeeperState(sk->state)); sk->port, FormatSafekeeperState(sk->state));
ResetConnection(sk); ResetConnection(sk);
return false; return false;
} }
@@ -1816,8 +1816,8 @@ BlockingWrite(Safekeeper *sk, void *msg, size_t msg_size, SafekeeperState succes
if (!wp->api.conn_blocking_write(sk, msg, msg_size)) if (!wp->api.conn_blocking_write(sk, msg, msg_size))
{ {
walprop_log(WARNING, "Failed to send to node %s:%s in %s state: %s", walprop_log(WARNING, "Failed to send to node %s:%s in %s state: %s",
sk->host, sk->port, FormatSafekeeperState(sk->state), sk->host, sk->port, FormatSafekeeperState(sk->state),
wp->api.conn_error_message(sk)); wp->api.conn_error_message(sk));
ShutdownConnection(sk); ShutdownConnection(sk);
return false; return false;
} }
@@ -1863,8 +1863,8 @@ AsyncWrite(Safekeeper *sk, void *msg, size_t msg_size, SafekeeperState flush_sta
return false; return false;
case PG_ASYNC_WRITE_FAIL: case PG_ASYNC_WRITE_FAIL:
walprop_log(WARNING, "Failed to send to node %s:%s in %s state: %s", walprop_log(WARNING, "Failed to send to node %s:%s in %s state: %s",
sk->host, sk->port, FormatSafekeeperState(sk->state), sk->host, sk->port, FormatSafekeeperState(sk->state),
wp->api.conn_error_message(sk)); wp->api.conn_error_message(sk));
ShutdownConnection(sk); ShutdownConnection(sk);
return false; return false;
default: default:
@@ -1902,8 +1902,8 @@ AsyncFlush(Safekeeper *sk)
return false; return false;
case -1: case -1:
walprop_log(WARNING, "Failed to flush write to node %s:%s in %s state: %s", walprop_log(WARNING, "Failed to flush write to node %s:%s in %s state: %s",
sk->host, sk->port, FormatSafekeeperState(sk->state), sk->host, sk->port, FormatSafekeeperState(sk->state),
wp->api.conn_error_message(sk)); wp->api.conn_error_message(sk));
ResetConnection(sk); ResetConnection(sk);
return false; return false;
default: default:
@@ -2008,7 +2008,7 @@ AssertEventsOkForState(uint32 events, Safekeeper *sk)
* and then an assertion that's guaranteed to fail. * and then an assertion that's guaranteed to fail.
*/ */
walprop_log(WARNING, "events %s mismatched for safekeeper %s:%s in state [%s]", walprop_log(WARNING, "events %s mismatched for safekeeper %s:%s in state [%s]",
FormatEvents(wp, events), sk->host, sk->port, FormatSafekeeperState(sk->state)); FormatEvents(wp, events), sk->host, sk->port, FormatSafekeeperState(sk->state));
Assert(events_ok_for_state); Assert(events_ok_for_state);
} }
} }
@@ -2111,7 +2111,7 @@ FormatEvents(WalProposer *wp, uint32 events)
if (events & (~all_flags)) if (events & (~all_flags))
{ {
walprop_log(WARNING, "Event formatting found unexpected component %d", walprop_log(WARNING, "Event formatting found unexpected component %d",
events & (~all_flags)); events & (~all_flags));
return_str[6] = '*'; return_str[6] = '*';
return_str[7] = '\0'; return_str[7] = '\0';
} }

View File

@@ -356,7 +356,8 @@ typedef struct Safekeeper
/* postgres-specific fields */ /* postgres-specific fields */
#ifndef WALPROPOSER_LIB #ifndef WALPROPOSER_LIB
/* /*
* postgres protocol connection to the WAL acceptor * postgres protocol connection to the WAL acceptor
* *
@@ -374,17 +375,18 @@ typedef struct Safekeeper
* Position in wait event set. Equal to -1 if no event * Position in wait event set. Equal to -1 if no event
*/ */
int eventPos; int eventPos;
#endif #endif
/* WalProposer library specifics */ /* WalProposer library specifics */
#ifdef WALPROPOSER_LIB #ifdef WALPROPOSER_LIB
/* /*
* Buffer for incoming messages. Usually Rust vector is stored here. * Buffer for incoming messages. Usually Rust vector is stored here.
* Caller is responsible for freeing the buffer. * Caller is responsible for freeing the buffer.
*/ */
StringInfoData inbuf; StringInfoData inbuf;
#endif #endif
} Safekeeper; } Safekeeper;
/* Re-exported PostgresPollingStatusType */ /* Re-exported PostgresPollingStatusType */
@@ -472,7 +474,7 @@ typedef struct walproposer_api
WalProposerConnStatusType (*conn_status) (Safekeeper *sk); WalProposerConnStatusType (*conn_status) (Safekeeper *sk);
/* Start the connection, aka PQconnectStart. */ /* Start the connection, aka PQconnectStart. */
void (*conn_connect_start) (Safekeeper *sk); void (*conn_connect_start) (Safekeeper *sk);
/* Poll an asynchronous connection, aka PQconnectPoll. */ /* Poll an asynchronous connection, aka PQconnectPoll. */
WalProposerConnectPollStatusType (*conn_connect_poll) (Safekeeper *sk); WalProposerConnectPollStatusType (*conn_connect_poll) (Safekeeper *sk);
@@ -490,7 +492,7 @@ typedef struct walproposer_api
void (*conn_finish) (Safekeeper *sk); void (*conn_finish) (Safekeeper *sk);
/* /*
* Try to read CopyData message from the safekeeper, aka PQgetCopyData. * Try to read CopyData message from the safekeeper, aka PQgetCopyData.
* *
* On success, the data is placed in *buf. It is valid until the next call * On success, the data is placed in *buf. It is valid until the next call
* to this function. * to this function.
@@ -510,7 +512,7 @@ typedef struct walproposer_api
void (*wal_read) (Safekeeper *sk, char *buf, XLogRecPtr startptr, Size count); void (*wal_read) (Safekeeper *sk, char *buf, XLogRecPtr startptr, Size count);
/* Allocate WAL reader. */ /* Allocate WAL reader. */
void (*wal_reader_allocate) (Safekeeper *sk); void (*wal_reader_allocate) (Safekeeper *sk);
/* Deallocate event set. */ /* Deallocate event set. */
void (*free_event_set) (WalProposer *wp); void (*free_event_set) (WalProposer *wp);
@@ -572,7 +574,7 @@ typedef struct walproposer_api
/* /*
* Called right after the proposer was elected, but before it started * Called right after the proposer was elected, but before it started
* recovery and sent ProposerElected message to the safekeepers. * recovery and sent ProposerElected message to the safekeepers.
* *
* Used by logical replication to update truncateLsn. * Used by logical replication to update truncateLsn.
*/ */
void (*after_election) (WalProposer *wp); void (*after_election) (WalProposer *wp);
@@ -626,10 +628,10 @@ typedef struct WalProposerConfig
uint64 systemId; uint64 systemId;
/* Will be passed to safekeepers in greet request. */ /* Will be passed to safekeepers in greet request. */
TimeLineID pgTimeline; TimeLineID pgTimeline;
#ifdef WALPROPOSER_LIB #ifdef WALPROPOSER_LIB
void *callback_data; void *callback_data;
#endif #endif
} WalProposerConfig; } WalProposerConfig;
@@ -710,10 +712,11 @@ extern void WalProposerPoll(WalProposer *wp);
extern void WalProposerFree(WalProposer *wp); extern void WalProposerFree(WalProposer *wp);
#define WPEVENT 1337 /* special log level for walproposer internal events */ #define WPEVENT 1337 /* special log level for walproposer internal
* events */
#ifdef WALPROPOSER_LIB #ifdef WALPROPOSER_LIB
void WalProposerLibLog(WalProposer *wp, int elevel, char *fmt, ...); extern void WalProposerLibLog(WalProposer *wp, int elevel, char *fmt,...);
#define walprop_log(elevel, ...) WalProposerLibLog(wp, elevel, __VA_ARGS__) #define walprop_log(elevel, ...) WalProposerLibLog(wp, elevel, __VA_ARGS__)
#else #else
#define walprop_log(elevel, ...) elog(elevel, __VA_ARGS__) #define walprop_log(elevel, ...) elog(elevel, __VA_ARGS__)

View File

@@ -9,8 +9,9 @@
#include "utils/datetime.h" #include "utils/datetime.h"
#include "miscadmin.h" #include "miscadmin.h"
void ExceptionalCondition(const char *conditionName, void
const char *fileName, int lineNumber) ExceptionalCondition(const char *conditionName,
const char *fileName, int lineNumber)
{ {
fprintf(stderr, "ExceptionalCondition: %s:%d: %s\n", fprintf(stderr, "ExceptionalCondition: %s:%d: %s\n",
fileName, lineNumber, conditionName); fileName, lineNumber, conditionName);
@@ -169,17 +170,18 @@ timestamptz_to_str(TimestampTz t)
bool bool
TimestampDifferenceExceeds(TimestampTz start_time, TimestampDifferenceExceeds(TimestampTz start_time,
TimestampTz stop_time, TimestampTz stop_time,
int msec) int msec)
{ {
TimestampTz diff = stop_time - start_time; TimestampTz diff = stop_time - start_time;
return (diff >= msec * INT64CONST(1000)); return (diff >= msec * INT64CONST(1000));
} }
void void
WalProposerLibLog(WalProposer *wp, int elevel, char *fmt, ...) WalProposerLibLog(WalProposer *wp, int elevel, char *fmt,...)
{ {
char buf[1024]; char buf[1024];
va_list args; va_list args;
fmt = _(fmt); fmt = _(fmt);

View File

@@ -637,8 +637,8 @@ walprop_connect_start(Safekeeper *sk)
*/ */
sk->conn = palloc(sizeof(WalProposerConn)); sk->conn = palloc(sizeof(WalProposerConn));
sk->conn->pg_conn = pg_conn; sk->conn->pg_conn = pg_conn;
sk->conn->is_nonblocking = false; /* connections always start in blocking sk->conn->is_nonblocking = false; /* connections always start in
* mode */ * blocking mode */
sk->conn->recvbuf = NULL; sk->conn->recvbuf = NULL;
} }
@@ -1291,10 +1291,11 @@ XLogWalPropWrite(WalProposer *wp, char *buf, Size nbytes, XLogRecPtr recptr)
/* /*
* Apart from walproposer, basebackup LSN page is also written out by * Apart from walproposer, basebackup LSN page is also written out by
* postgres itself which writes WAL only in pages, and in basebackup it is * postgres itself which writes WAL only in pages, and in basebackup it is
* inherently dummy (only safekeepers have historic WAL). Update WAL buffers * inherently dummy (only safekeepers have historic WAL). Update WAL
* here to avoid dummy page overwriting correct one we download here. Ugly, * buffers here to avoid dummy page overwriting correct one we download
* but alternatives are about the same ugly. We won't need that if we switch * here. Ugly, but alternatives are about the same ugly. We won't need
* to on-demand WAL download from safekeepers, without writing to disk. * that if we switch to on-demand WAL download from safekeepers, without
* writing to disk.
* *
* https://github.com/neondatabase/neon/issues/5749 * https://github.com/neondatabase/neon/issues/5749
*/ */
@@ -1681,17 +1682,17 @@ walprop_pg_log_internal(WalProposer *wp, int level, const char *line)
static void static void
walprop_pg_after_election(WalProposer *wp) walprop_pg_after_election(WalProposer *wp)
{ {
FILE* f; FILE *f;
XLogRecPtr lrRestartLsn; XLogRecPtr lrRestartLsn;
/* We don't need to do anything in syncSafekeepers mode.*/ /* We don't need to do anything in syncSafekeepers mode. */
if (wp->config->syncSafekeepers) if (wp->config->syncSafekeepers)
return; return;
/* /*
* If there are active logical replication subscription we need * If there are active logical replication subscription we need to provide
* to provide enough WAL for their WAL senders based on th position * enough WAL for their WAL senders based on th position of their
* of their replication slots. * replication slots.
*/ */
f = fopen("restart.lsn", "rb"); f = fopen("restart.lsn", "rb");
if (f != NULL && !wp->config->syncSafekeepers) if (f != NULL && !wp->config->syncSafekeepers)
@@ -1700,8 +1701,12 @@ walprop_pg_after_election(WalProposer *wp)
fclose(f); fclose(f);
if (lrRestartLsn != InvalidXLogRecPtr) if (lrRestartLsn != InvalidXLogRecPtr)
{ {
elog(LOG, "Logical replication restart LSN %X/%X", LSN_FORMAT_ARGS(lrRestartLsn)); elog(LOG, "Logical replication restart LSN %X/%X", LSN_FORMAT_ARGS(lrRestartLsn));
/* start from the beginning of the segment to fetch page headers verifed by XLogReader */
/*
* start from the beginning of the segment to fetch page headers
* verifed by XLogReader
*/
lrRestartLsn = lrRestartLsn - XLogSegmentOffset(lrRestartLsn, wal_segment_size); lrRestartLsn = lrRestartLsn - XLogSegmentOffset(lrRestartLsn, wal_segment_size);
wp->truncateLsn = Min(wp->truncateLsn, lrRestartLsn); wp->truncateLsn = Min(wp->truncateLsn, lrRestartLsn);
} }

121
poetry.lock generated
View File

@@ -1,4 +1,4 @@
# This file is automatically @generated by Poetry 1.7.1 and should not be changed by hand. # This file is automatically @generated by Poetry 1.5.1 and should not be changed by hand.
[[package]] [[package]]
name = "aiohttp" name = "aiohttp"
@@ -98,18 +98,18 @@ speedups = ["Brotli", "aiodns", "brotlicffi"]
[[package]] [[package]]
name = "aiopg" name = "aiopg"
version = "1.3.4" version = "1.4.0"
description = "Postgres integration with asyncio." description = "Postgres integration with asyncio."
optional = false optional = false
python-versions = ">=3.6" python-versions = ">=3.7"
files = [ files = [
{file = "aiopg-1.3.4-py3-none-any.whl", hash = "sha256:b5b74a124831aad71608c3c203479db90bac4a7eb3f8982bc48c3d3e6f1e57bf"}, {file = "aiopg-1.4.0-py3-none-any.whl", hash = "sha256:aea46e8aff30b039cfa818e6db4752c97656e893fc75e5a5dc57355a9e9dedbd"},
{file = "aiopg-1.3.4.tar.gz", hash = "sha256:23f9e4cd9f28e9d91a6de3b4fb517e8bed25511cd954acccba9fe3a702d9b7d0"}, {file = "aiopg-1.4.0.tar.gz", hash = "sha256:116253bef86b4d954116716d181e9a0294037f266718b2e1c9766af995639d71"},
] ]
[package.dependencies] [package.dependencies]
async-timeout = ">=3.0,<5.0" async-timeout = ">=3.0,<5.0"
psycopg2-binary = ">=2.8.4" psycopg2-binary = ">=2.9.5"
[package.extras] [package.extras]
sa = ["sqlalchemy[postgresql-psycopg2binary] (>=1.3,<1.5)"] sa = ["sqlalchemy[postgresql-psycopg2binary] (>=1.3,<1.5)"]
@@ -160,64 +160,71 @@ pluggy = ">=0.4.0"
[[package]] [[package]]
name = "async-timeout" name = "async-timeout"
version = "4.0.2" version = "4.0.3"
description = "Timeout context manager for asyncio programs" description = "Timeout context manager for asyncio programs"
optional = false optional = false
python-versions = ">=3.6" python-versions = ">=3.7"
files = [ files = [
{file = "async-timeout-4.0.2.tar.gz", hash = "sha256:2163e1640ddb52b7a8c80d0a67a08587e5d245cc9c553a74a847056bc2976b15"}, {file = "async-timeout-4.0.3.tar.gz", hash = "sha256:4640d96be84d82d02ed59ea2b7105a0f7b33abe8703703cd0ab0bf87c427522f"},
{file = "async_timeout-4.0.2-py3-none-any.whl", hash = "sha256:8ca1e4fcf50d07413d66d1a5e416e42cfdf5851c981d679a09851a6853383b3c"}, {file = "async_timeout-4.0.3-py3-none-any.whl", hash = "sha256:7405140ff1230c310e51dc27b3145b9092d659ce68ff733fb0cefe3ee42be028"},
] ]
[[package]] [[package]]
name = "asyncpg" name = "asyncpg"
version = "0.27.0" version = "0.29.0"
description = "An asyncio PostgreSQL driver" description = "An asyncio PostgreSQL driver"
optional = false optional = false
python-versions = ">=3.7.0" python-versions = ">=3.8.0"
files = [ files = [
{file = "asyncpg-0.27.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:fca608d199ffed4903dce1bcd97ad0fe8260f405c1c225bdf0002709132171c2"}, {file = "asyncpg-0.29.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:72fd0ef9f00aeed37179c62282a3d14262dbbafb74ec0ba16e1b1864d8a12169"},
{file = "asyncpg-0.27.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:20b596d8d074f6f695c13ffb8646d0b6bb1ab570ba7b0cfd349b921ff03cfc1e"}, {file = "asyncpg-0.29.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:52e8f8f9ff6e21f9b39ca9f8e3e33a5fcdceaf5667a8c5c32bee158e313be385"},
{file = "asyncpg-0.27.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:7a6206210c869ebd3f4eb9e89bea132aefb56ff3d1b7dd7e26b102b17e27bbb1"}, {file = "asyncpg-0.29.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a9e6823a7012be8b68301342ba33b4740e5a166f6bbda0aee32bc01638491a22"},
{file = "asyncpg-0.27.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:a7a94c03386bb95456b12c66026b3a87d1b965f0f1e5733c36e7229f8f137747"}, {file = "asyncpg-0.29.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:746e80d83ad5d5464cfbf94315eb6744222ab00aa4e522b704322fb182b83610"},
{file = "asyncpg-0.27.0-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:bfc3980b4ba6f97138b04f0d32e8af21d6c9fa1f8e6e140c07d15690a0a99279"}, {file = "asyncpg-0.29.0-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:ff8e8109cd6a46ff852a5e6bab8b0a047d7ea42fcb7ca5ae6eaae97d8eacf397"},
{file = "asyncpg-0.27.0-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:9654085f2b22f66952124de13a8071b54453ff972c25c59b5ce1173a4283ffd9"}, {file = "asyncpg-0.29.0-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:97eb024685b1d7e72b1972863de527c11ff87960837919dac6e34754768098eb"},
{file = "asyncpg-0.27.0-cp310-cp310-win32.whl", hash = "sha256:879c29a75969eb2722f94443752f4720d560d1e748474de54ae8dd230bc4956b"}, {file = "asyncpg-0.29.0-cp310-cp310-win32.whl", hash = "sha256:5bbb7f2cafd8d1fa3e65431833de2642f4b2124be61a449fa064e1a08d27e449"},
{file = "asyncpg-0.27.0-cp310-cp310-win_amd64.whl", hash = "sha256:ab0f21c4818d46a60ca789ebc92327d6d874d3b7ccff3963f7af0a21dc6cff52"}, {file = "asyncpg-0.29.0-cp310-cp310-win_amd64.whl", hash = "sha256:76c3ac6530904838a4b650b2880f8e7af938ee049e769ec2fba7cd66469d7772"},
{file = "asyncpg-0.27.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:18f77e8e71e826ba2d0c3ba6764930776719ae2b225ca07e014590545928b576"}, {file = "asyncpg-0.29.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:d4900ee08e85af01adb207519bb4e14b1cae8fd21e0ccf80fac6aa60b6da37b4"},
{file = "asyncpg-0.27.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:c2232d4625c558f2aa001942cac1d7952aa9f0dbfc212f63bc754277769e1ef2"}, {file = "asyncpg-0.29.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:a65c1dcd820d5aea7c7d82a3fdcb70e096f8f70d1a8bf93eb458e49bfad036ac"},
{file = "asyncpg-0.27.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:9a3a4ff43702d39e3c97a8786314123d314e0f0e4dabc8367db5b665c93914de"}, {file = "asyncpg-0.29.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5b52e46f165585fd6af4863f268566668407c76b2c72d366bb8b522fa66f1870"},
{file = "asyncpg-0.27.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:ccddb9419ab4e1c48742457d0c0362dbdaeb9b28e6875115abfe319b29ee225d"}, {file = "asyncpg-0.29.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:dc600ee8ef3dd38b8d67421359779f8ccec30b463e7aec7ed481c8346decf99f"},
{file = "asyncpg-0.27.0-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:768e0e7c2898d40b16d4ef7a0b44e8150db3dd8995b4652aa1fe2902e92c7df8"}, {file = "asyncpg-0.29.0-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:039a261af4f38f949095e1e780bae84a25ffe3e370175193174eb08d3cecab23"},
{file = "asyncpg-0.27.0-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:609054a1f47292a905582a1cfcca51a6f3f30ab9d822448693e66fdddde27920"}, {file = "asyncpg-0.29.0-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:6feaf2d8f9138d190e5ec4390c1715c3e87b37715cd69b2c3dfca616134efd2b"},
{file = "asyncpg-0.27.0-cp311-cp311-win32.whl", hash = "sha256:8113e17cfe236dc2277ec844ba9b3d5312f61bd2fdae6d3ed1c1cdd75f6cf2d8"}, {file = "asyncpg-0.29.0-cp311-cp311-win32.whl", hash = "sha256:1e186427c88225ef730555f5fdda6c1812daa884064bfe6bc462fd3a71c4b675"},
{file = "asyncpg-0.27.0-cp311-cp311-win_amd64.whl", hash = "sha256:bb71211414dd1eeb8d31ec529fe77cff04bf53efc783a5f6f0a32d84923f45cf"}, {file = "asyncpg-0.29.0-cp311-cp311-win_amd64.whl", hash = "sha256:cfe73ffae35f518cfd6e4e5f5abb2618ceb5ef02a2365ce64f132601000587d3"},
{file = "asyncpg-0.27.0-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:4750f5cf49ed48a6e49c6e5aed390eee367694636c2dcfaf4a273ca832c5c43c"}, {file = "asyncpg-0.29.0-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:6011b0dc29886ab424dc042bf9eeb507670a3b40aece3439944006aafe023178"},
{file = "asyncpg-0.27.0-cp37-cp37m-musllinux_1_1_aarch64.whl", hash = "sha256:eca01eb112a39d31cc4abb93a5aef2a81514c23f70956729f42fb83b11b3483f"}, {file = "asyncpg-0.29.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:b544ffc66b039d5ec5a7454667f855f7fec08e0dfaf5a5490dfafbb7abbd2cfb"},
{file = "asyncpg-0.27.0-cp37-cp37m-musllinux_1_1_x86_64.whl", hash = "sha256:5710cb0937f696ce303f5eed6d272e3f057339bb4139378ccecafa9ee923a71c"}, {file = "asyncpg-0.29.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:d84156d5fb530b06c493f9e7635aa18f518fa1d1395ef240d211cb563c4e2364"},
{file = "asyncpg-0.27.0-cp37-cp37m-win_amd64.whl", hash = "sha256:71cca80a056ebe19ec74b7117b09e650990c3ca535ac1c35234a96f65604192f"}, {file = "asyncpg-0.29.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:54858bc25b49d1114178d65a88e48ad50cb2b6f3e475caa0f0c092d5f527c106"},
{file = "asyncpg-0.27.0-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:4bb366ae34af5b5cabc3ac6a5347dfb6013af38c68af8452f27968d49085ecc0"}, {file = "asyncpg-0.29.0-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:bde17a1861cf10d5afce80a36fca736a86769ab3579532c03e45f83ba8a09c59"},
{file = "asyncpg-0.27.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:16ba8ec2e85d586b4a12bcd03e8d29e3d99e832764d6a1d0b8c27dbbe4a2569d"}, {file = "asyncpg-0.29.0-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:37a2ec1b9ff88d8773d3eb6d3784dc7e3fee7756a5317b67f923172a4748a175"},
{file = "asyncpg-0.27.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:d20dea7b83651d93b1eb2f353511fe7fd554752844523f17ad30115d8b9c8cd6"}, {file = "asyncpg-0.29.0-cp312-cp312-win32.whl", hash = "sha256:bb1292d9fad43112a85e98ecdc2e051602bce97c199920586be83254d9dafc02"},
{file = "asyncpg-0.27.0-cp38-cp38-musllinux_1_1_aarch64.whl", hash = "sha256:e56ac8a8237ad4adec97c0cd4728596885f908053ab725e22900b5902e7f8e69"}, {file = "asyncpg-0.29.0-cp312-cp312-win_amd64.whl", hash = "sha256:2245be8ec5047a605e0b454c894e54bf2ec787ac04b1cb7e0d3c67aa1e32f0fe"},
{file = "asyncpg-0.27.0-cp38-cp38-musllinux_1_1_x86_64.whl", hash = "sha256:bf21ebf023ec67335258e0f3d3ad7b91bb9507985ba2b2206346de488267cad0"}, {file = "asyncpg-0.29.0-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:0009a300cae37b8c525e5b449233d59cd9868fd35431abc470a3e364d2b85cb9"},
{file = "asyncpg-0.27.0-cp38-cp38-win32.whl", hash = "sha256:69aa1b443a182b13a17ff926ed6627af2d98f62f2fe5890583270cc4073f63bf"}, {file = "asyncpg-0.29.0-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:5cad1324dbb33f3ca0cd2074d5114354ed3be2b94d48ddfd88af75ebda7c43cc"},
{file = "asyncpg-0.27.0-cp38-cp38-win_amd64.whl", hash = "sha256:62932f29cf2433988fcd799770ec64b374a3691e7902ecf85da14d5e0854d1ea"}, {file = "asyncpg-0.29.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:012d01df61e009015944ac7543d6ee30c2dc1eb2f6b10b62a3f598beb6531548"},
{file = "asyncpg-0.27.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:fddcacf695581a8d856654bc4c8cfb73d5c9df26d5f55201722d3e6a699e9629"}, {file = "asyncpg-0.29.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:000c996c53c04770798053e1730d34e30cb645ad95a63265aec82da9093d88e7"},
{file = "asyncpg-0.27.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:7d8585707ecc6661d07367d444bbaa846b4e095d84451340da8df55a3757e152"}, {file = "asyncpg-0.29.0-cp38-cp38-musllinux_1_1_aarch64.whl", hash = "sha256:e0bfe9c4d3429706cf70d3249089de14d6a01192d617e9093a8e941fea8ee775"},
{file = "asyncpg-0.27.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:975a320baf7020339a67315284a4d3bf7460e664e484672bd3e71dbd881bc692"}, {file = "asyncpg-0.29.0-cp38-cp38-musllinux_1_1_x86_64.whl", hash = "sha256:642a36eb41b6313ffa328e8a5c5c2b5bea6ee138546c9c3cf1bffaad8ee36dd9"},
{file = "asyncpg-0.27.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:2232ebae9796d4600a7819fc383da78ab51b32a092795f4555575fc934c1c89d"}, {file = "asyncpg-0.29.0-cp38-cp38-win32.whl", hash = "sha256:a921372bbd0aa3a5822dd0409da61b4cd50df89ae85150149f8c119f23e8c408"},
{file = "asyncpg-0.27.0-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:88b62164738239f62f4af92567b846a8ef7cf8abf53eddd83650603de4d52163"}, {file = "asyncpg-0.29.0-cp38-cp38-win_amd64.whl", hash = "sha256:103aad2b92d1506700cbf51cd8bb5441e7e72e87a7b3a2ca4e32c840f051a6a3"},
{file = "asyncpg-0.27.0-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:eb4b2fdf88af4fb1cc569781a8f933d2a73ee82cd720e0cb4edabbaecf2a905b"}, {file = "asyncpg-0.29.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:5340dd515d7e52f4c11ada32171d87c05570479dc01dc66d03ee3e150fb695da"},
{file = "asyncpg-0.27.0-cp39-cp39-win32.whl", hash = "sha256:8934577e1ed13f7d2d9cea3cc016cc6f95c19faedea2c2b56a6f94f257cea672"}, {file = "asyncpg-0.29.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:e17b52c6cf83e170d3d865571ba574577ab8e533e7361a2b8ce6157d02c665d3"},
{file = "asyncpg-0.27.0-cp39-cp39-win_amd64.whl", hash = "sha256:1b6499de06fe035cf2fa932ec5617ed3f37d4ebbf663b655922e105a484a6af9"}, {file = "asyncpg-0.29.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f100d23f273555f4b19b74a96840aa27b85e99ba4b1f18d4ebff0734e78dc090"},
{file = "asyncpg-0.27.0.tar.gz", hash = "sha256:720986d9a4705dd8a40fdf172036f5ae787225036a7eb46e704c45aa8f62c054"}, {file = "asyncpg-0.29.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:48e7c58b516057126b363cec8ca02b804644fd012ef8e6c7e23386b7d5e6ce83"},
{file = "asyncpg-0.29.0-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:f9ea3f24eb4c49a615573724d88a48bd1b7821c890c2effe04f05382ed9e8810"},
{file = "asyncpg-0.29.0-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:8d36c7f14a22ec9e928f15f92a48207546ffe68bc412f3be718eedccdf10dc5c"},
{file = "asyncpg-0.29.0-cp39-cp39-win32.whl", hash = "sha256:797ab8123ebaed304a1fad4d7576d5376c3a006a4100380fb9d517f0b59c1ab2"},
{file = "asyncpg-0.29.0-cp39-cp39-win_amd64.whl", hash = "sha256:cce08a178858b426ae1aa8409b5cc171def45d4293626e7aa6510696d46decd8"},
{file = "asyncpg-0.29.0.tar.gz", hash = "sha256:d1c49e1f44fffafd9a55e1a9b101590859d881d639ea2922516f5d9c512d354e"},
] ]
[package.dependencies]
async-timeout = {version = ">=4.0.3", markers = "python_version < \"3.12.0\""}
[package.extras] [package.extras]
dev = ["Cython (>=0.29.24,<0.30.0)", "Sphinx (>=4.1.2,<4.2.0)", "flake8 (>=5.0.4,<5.1.0)", "pytest (>=6.0)", "sphinx-rtd-theme (>=0.5.2,<0.6.0)", "sphinxcontrib-asyncio (>=0.3.0,<0.4.0)", "uvloop (>=0.15.3)"] docs = ["Sphinx (>=5.3.0,<5.4.0)", "sphinx-rtd-theme (>=1.2.2)", "sphinxcontrib-asyncio (>=0.3.0,<0.4.0)"]
docs = ["Sphinx (>=4.1.2,<4.2.0)", "sphinx-rtd-theme (>=0.5.2,<0.6.0)", "sphinxcontrib-asyncio (>=0.3.0,<0.4.0)"] test = ["flake8 (>=6.1,<7.0)", "uvloop (>=0.15.3)"]
test = ["flake8 (>=5.0.4,<5.1.0)", "uvloop (>=0.15.3)"]
[[package]] [[package]]
name = "attrs" name = "attrs"
@@ -2476,6 +2483,16 @@ files = [
{file = "wrapt-1.14.1-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:8ad85f7f4e20964db4daadcab70b47ab05c7c1cf2a7c1e51087bfaa83831854c"}, {file = "wrapt-1.14.1-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:8ad85f7f4e20964db4daadcab70b47ab05c7c1cf2a7c1e51087bfaa83831854c"},
{file = "wrapt-1.14.1-cp310-cp310-win32.whl", hash = "sha256:a9a52172be0b5aae932bef82a79ec0a0ce87288c7d132946d645eba03f0ad8a8"}, {file = "wrapt-1.14.1-cp310-cp310-win32.whl", hash = "sha256:a9a52172be0b5aae932bef82a79ec0a0ce87288c7d132946d645eba03f0ad8a8"},
{file = "wrapt-1.14.1-cp310-cp310-win_amd64.whl", hash = "sha256:6d323e1554b3d22cfc03cd3243b5bb815a51f5249fdcbb86fda4bf62bab9e164"}, {file = "wrapt-1.14.1-cp310-cp310-win_amd64.whl", hash = "sha256:6d323e1554b3d22cfc03cd3243b5bb815a51f5249fdcbb86fda4bf62bab9e164"},
{file = "wrapt-1.14.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:ecee4132c6cd2ce5308e21672015ddfed1ff975ad0ac8d27168ea82e71413f55"},
{file = "wrapt-1.14.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:2020f391008ef874c6d9e208b24f28e31bcb85ccff4f335f15a3251d222b92d9"},
{file = "wrapt-1.14.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2feecf86e1f7a86517cab34ae6c2f081fd2d0dac860cb0c0ded96d799d20b335"},
{file = "wrapt-1.14.1-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:240b1686f38ae665d1b15475966fe0472f78e71b1b4903c143a842659c8e4cb9"},
{file = "wrapt-1.14.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a9008dad07d71f68487c91e96579c8567c98ca4c3881b9b113bc7b33e9fd78b8"},
{file = "wrapt-1.14.1-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:6447e9f3ba72f8e2b985a1da758767698efa72723d5b59accefd716e9e8272bf"},
{file = "wrapt-1.14.1-cp311-cp311-musllinux_1_1_i686.whl", hash = "sha256:acae32e13a4153809db37405f5eba5bac5fbe2e2ba61ab227926a22901051c0a"},
{file = "wrapt-1.14.1-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:49ef582b7a1152ae2766557f0550a9fcbf7bbd76f43fbdc94dd3bf07cc7168be"},
{file = "wrapt-1.14.1-cp311-cp311-win32.whl", hash = "sha256:358fe87cc899c6bb0ddc185bf3dbfa4ba646f05b1b0b9b5a27c2cb92c2cea204"},
{file = "wrapt-1.14.1-cp311-cp311-win_amd64.whl", hash = "sha256:26046cd03936ae745a502abf44dac702a5e6880b2b01c29aea8ddf3353b68224"},
{file = "wrapt-1.14.1-cp35-cp35m-manylinux1_i686.whl", hash = "sha256:43ca3bbbe97af00f49efb06e352eae40434ca9d915906f77def219b88e85d907"}, {file = "wrapt-1.14.1-cp35-cp35m-manylinux1_i686.whl", hash = "sha256:43ca3bbbe97af00f49efb06e352eae40434ca9d915906f77def219b88e85d907"},
{file = "wrapt-1.14.1-cp35-cp35m-manylinux1_x86_64.whl", hash = "sha256:6b1a564e6cb69922c7fe3a678b9f9a3c54e72b469875aa8018f18b4d1dd1adf3"}, {file = "wrapt-1.14.1-cp35-cp35m-manylinux1_x86_64.whl", hash = "sha256:6b1a564e6cb69922c7fe3a678b9f9a3c54e72b469875aa8018f18b4d1dd1adf3"},
{file = "wrapt-1.14.1-cp35-cp35m-manylinux2010_i686.whl", hash = "sha256:00b6d4ea20a906c0ca56d84f93065b398ab74b927a7a3dbd470f6fc503f95dc3"}, {file = "wrapt-1.14.1-cp35-cp35m-manylinux2010_i686.whl", hash = "sha256:00b6d4ea20a906c0ca56d84f93065b398ab74b927a7a3dbd470f6fc503f95dc3"},
@@ -2697,4 +2714,4 @@ cffi = ["cffi (>=1.11)"]
[metadata] [metadata]
lock-version = "2.0" lock-version = "2.0"
python-versions = "^3.9" python-versions = "^3.9"
content-hash = "9f33b4404dbb9803ede5785469241dde1d09132427b87db8928bdbc37ccd6b7a" content-hash = "c4e38082d246636903e15c02fbf8364c6afc1fd35d36a81c49f596ba68fc739b"

View File

@@ -4,6 +4,10 @@ version = "0.1.0"
edition.workspace = true edition.workspace = true
license.workspace = true license.workspace = true
[features]
default = []
testing = []
[dependencies] [dependencies]
anyhow.workspace = true anyhow.workspace = true
async-trait.workspace = true async-trait.workspace = true
@@ -57,6 +61,7 @@ thiserror.workspace = true
tls-listener.workspace = true tls-listener.workspace = true
tokio-postgres.workspace = true tokio-postgres.workspace = true
tokio-rustls.workspace = true tokio-rustls.workspace = true
tokio-util.workspace = true
tokio = { workspace = true, features = ["signal"] } tokio = { workspace = true, features = ["signal"] }
tracing-opentelemetry.workspace = true tracing-opentelemetry.workspace = true
tracing-subscriber.workspace = true tracing-subscriber.workspace = true
@@ -69,13 +74,12 @@ webpki-roots.workspace = true
x509-parser.workspace = true x509-parser.workspace = true
native-tls.workspace = true native-tls.workspace = true
postgres-native-tls.workspace = true postgres-native-tls.workspace = true
postgres-protocol.workspace = true
smol_str.workspace = true smol_str.workspace = true
workspace_hack.workspace = true workspace_hack.workspace = true
tokio-util.workspace = true
[dev-dependencies] [dev-dependencies]
rcgen.workspace = true rcgen.workspace = true
rstest.workspace = true rstest.workspace = true
tokio-postgres-rustls.workspace = true tokio-postgres-rustls.workspace = true
postgres-protocol.workspace = true

View File

@@ -62,6 +62,9 @@ pub enum AuthErrorImpl {
Please add it to the allowed list in the Neon console." Please add it to the allowed list in the Neon console."
)] )]
IpAddressNotAllowed, IpAddressNotAllowed,
#[error("Too many connections to this endpoint. Please try again later.")]
TooManyConnections,
} }
#[derive(Debug, Error)] #[derive(Debug, Error)]
@@ -80,6 +83,10 @@ impl AuthError {
pub fn ip_address_not_allowed() -> Self { pub fn ip_address_not_allowed() -> Self {
AuthErrorImpl::IpAddressNotAllowed.into() AuthErrorImpl::IpAddressNotAllowed.into()
} }
pub fn too_many_connections() -> Self {
AuthErrorImpl::TooManyConnections.into()
}
} }
impl<E: Into<AuthErrorImpl>> From<E> for AuthError { impl<E: Into<AuthErrorImpl>> From<E> for AuthError {
@@ -102,6 +109,7 @@ impl UserFacingError for AuthError {
MissingEndpointName => self.to_string(), MissingEndpointName => self.to_string(),
Io(_) => "Internal error".to_string(), Io(_) => "Internal error".to_string(),
IpAddressNotAllowed => self.to_string(), IpAddressNotAllowed => self.to_string(),
TooManyConnections => self.to_string(),
} }
} }
} }

View File

@@ -3,9 +3,11 @@ mod hacks;
mod link; mod link;
pub use link::LinkAuthError; pub use link::LinkAuthError;
use smol_str::SmolStr;
use tokio_postgres::config::AuthKeys; use tokio_postgres::config::AuthKeys;
use crate::auth::credentials::check_peer_addr_is_in_list; use crate::auth::credentials::check_peer_addr_is_in_list;
use crate::auth::validate_password_and_exchange;
use crate::console::errors::GetAuthInfoError; use crate::console::errors::GetAuthInfoError;
use crate::console::provider::AuthInfo; use crate::console::provider::AuthInfo;
use crate::console::AuthSecret; use crate::console::AuthSecret;
@@ -24,31 +26,12 @@ use crate::{
}; };
use futures::TryFutureExt; use futures::TryFutureExt;
use std::borrow::Cow; use std::borrow::Cow;
use std::net::IpAddr;
use std::ops::ControlFlow; use std::ops::ControlFlow;
use std::sync::Arc; use std::sync::Arc;
use tokio::io::{AsyncRead, AsyncWrite}; use tokio::io::{AsyncRead, AsyncWrite};
use tracing::{error, info, warn}; use tracing::{error, info, warn};
/// A product of successful authentication.
pub struct AuthSuccess<T> {
/// Did we send [`pq_proto::BeMessage::AuthenticationOk`] to client?
pub reported_auth_ok: bool,
/// Something to be considered a positive result.
pub value: T,
}
impl<T> AuthSuccess<T> {
/// Very similar to [`std::option::Option::map`].
/// Maps [`AuthSuccess<T>`] to [`AuthSuccess<R>`] by applying
/// a function to a contained value.
pub fn map<R>(self, f: impl FnOnce(T) -> R) -> AuthSuccess<R> {
AuthSuccess {
reported_auth_ok: self.reported_auth_ok,
value: f(self.value),
}
}
}
/// This type serves two purposes: /// This type serves two purposes:
/// ///
/// * When `T` is `()`, it's just a regular auth backend selector /// * When `T` is `()`, it's just a regular auth backend selector
@@ -61,9 +44,11 @@ pub enum BackendType<'a, T> {
/// Current Cloud API (V2). /// Current Cloud API (V2).
Console(Cow<'a, console::provider::neon::Api>, T), Console(Cow<'a, console::provider::neon::Api>, T),
/// Local mock of Cloud API (V2). /// Local mock of Cloud API (V2).
#[cfg(feature = "testing")]
Postgres(Cow<'a, console::provider::mock::Api>, T), Postgres(Cow<'a, console::provider::mock::Api>, T),
/// Authentication via a web browser. /// Authentication via a web browser.
Link(Cow<'a, url::ApiUrl>), Link(Cow<'a, url::ApiUrl>),
#[cfg(test)]
/// Test backend. /// Test backend.
Test(&'a dyn TestBackend), Test(&'a dyn TestBackend),
} }
@@ -78,8 +63,10 @@ impl std::fmt::Display for BackendType<'_, ()> {
use BackendType::*; use BackendType::*;
match self { match self {
Console(endpoint, _) => fmt.debug_tuple("Console").field(&endpoint.url()).finish(), Console(endpoint, _) => fmt.debug_tuple("Console").field(&endpoint.url()).finish(),
#[cfg(feature = "testing")]
Postgres(endpoint, _) => fmt.debug_tuple("Postgres").field(&endpoint.url()).finish(), Postgres(endpoint, _) => fmt.debug_tuple("Postgres").field(&endpoint.url()).finish(),
Link(url) => fmt.debug_tuple("Link").field(&url.as_str()).finish(), Link(url) => fmt.debug_tuple("Link").field(&url.as_str()).finish(),
#[cfg(test)]
Test(_) => fmt.debug_tuple("Test").finish(), Test(_) => fmt.debug_tuple("Test").finish(),
} }
} }
@@ -92,8 +79,10 @@ impl<T> BackendType<'_, T> {
use BackendType::*; use BackendType::*;
match self { match self {
Console(c, x) => Console(Cow::Borrowed(c), x), Console(c, x) => Console(Cow::Borrowed(c), x),
#[cfg(feature = "testing")]
Postgres(c, x) => Postgres(Cow::Borrowed(c), x), Postgres(c, x) => Postgres(Cow::Borrowed(c), x),
Link(c) => Link(Cow::Borrowed(c)), Link(c) => Link(Cow::Borrowed(c)),
#[cfg(test)]
Test(x) => Test(*x), Test(x) => Test(*x),
} }
} }
@@ -107,8 +96,10 @@ impl<'a, T> BackendType<'a, T> {
use BackendType::*; use BackendType::*;
match self { match self {
Console(c, x) => Console(c, f(x)), Console(c, x) => Console(c, f(x)),
#[cfg(feature = "testing")]
Postgres(c, x) => Postgres(c, f(x)), Postgres(c, x) => Postgres(c, f(x)),
Link(c) => Link(c), Link(c) => Link(c),
#[cfg(test)]
Test(x) => Test(x), Test(x) => Test(x),
} }
} }
@@ -121,51 +112,87 @@ impl<'a, T, E> BackendType<'a, Result<T, E>> {
use BackendType::*; use BackendType::*;
match self { match self {
Console(c, x) => x.map(|x| Console(c, x)), Console(c, x) => x.map(|x| Console(c, x)),
#[cfg(feature = "testing")]
Postgres(c, x) => x.map(|x| Postgres(c, x)), Postgres(c, x) => x.map(|x| Postgres(c, x)),
Link(c) => Ok(Link(c)), Link(c) => Ok(Link(c)),
#[cfg(test)]
Test(x) => Ok(Test(x)), Test(x) => Ok(Test(x)),
} }
} }
} }
pub enum ComputeCredentials { pub struct ComputeCredentials<T> {
pub info: ComputeUserInfo,
pub keys: T,
}
pub struct ComputeUserInfoNoEndpoint {
pub user: SmolStr,
pub peer_addr: IpAddr,
pub cache_key: SmolStr,
}
pub struct ComputeUserInfo {
pub endpoint: SmolStr,
pub inner: ComputeUserInfoNoEndpoint,
}
pub enum ComputeCredentialKeys {
#[cfg(feature = "testing")]
Password(Vec<u8>), Password(Vec<u8>),
AuthKeys(AuthKeys), AuthKeys(AuthKeys),
} }
impl TryFrom<ClientCredentials> for ComputeUserInfo {
// user name
type Error = ComputeUserInfoNoEndpoint;
fn try_from(creds: ClientCredentials) -> Result<Self, Self::Error> {
let inner = ComputeUserInfoNoEndpoint {
user: creds.user,
peer_addr: creds.peer_addr,
cache_key: creds.cache_key,
};
match creds.project {
None => Err(inner),
Some(endpoint) => Ok(ComputeUserInfo { endpoint, inner }),
}
}
}
/// True to its name, this function encapsulates our current auth trade-offs. /// True to its name, this function encapsulates our current auth trade-offs.
/// Here, we choose the appropriate auth flow based on circumstances. /// Here, we choose the appropriate auth flow based on circumstances.
async fn auth_quirks_creds( ///
/// All authentication flows will emit an AuthenticationOk message if successful.
async fn auth_quirks(
api: &impl console::Api, api: &impl console::Api,
extra: &ConsoleReqExtra<'_>, extra: &ConsoleReqExtra,
creds: &mut ClientCredentials<'_>, creds: ClientCredentials,
client: &mut stream::PqStream<Stream<impl AsyncRead + AsyncWrite + Unpin>>, client: &mut stream::PqStream<Stream<impl AsyncRead + AsyncWrite + Unpin>>,
allow_cleartext: bool, allow_cleartext: bool,
config: &'static AuthenticationConfig, config: &'static AuthenticationConfig,
latency_timer: &mut LatencyTimer, latency_timer: &mut LatencyTimer,
) -> auth::Result<AuthSuccess<ComputeCredentials>> { ) -> auth::Result<ComputeCredentials<ComputeCredentialKeys>> {
// If there's no project so far, that entails that client doesn't // If there's no project so far, that entails that client doesn't
// support SNI or other means of passing the endpoint (project) name. // support SNI or other means of passing the endpoint (project) name.
// We now expect to see a very specific payload in the place of password. // We now expect to see a very specific payload in the place of password.
let maybe_success = if creds.project.is_none() { let (info, unauthenticated_password) = match creds.try_into() {
// Password will be checked by the compute node later. Err(info) => {
Some(hacks::password_hack(creds, client, latency_timer).await?) let res = hacks::password_hack_no_authentication(info, client, latency_timer).await?;
} else { (res.info, Some(res.keys))
None }
Ok(info) => (info, None),
}; };
// Password hack should set the project name.
// TODO: make `creds.project` more type-safe.
assert!(creds.project.is_some());
info!("fetching user's authentication info"); info!("fetching user's authentication info");
// TODO(anna): this will slow down both "hacks" below; we probably need a cache. // TODO(anna): this will slow down both "hacks" below; we probably need a cache.
let AuthInfo { let AuthInfo {
secret, secret,
allowed_ips, allowed_ips,
} = api.get_auth_info(extra, creds).await?; } = api.get_auth_info(extra, &info).await?;
// check allowed list // check allowed list
if !check_peer_addr_is_in_list(&creds.peer_addr.ip(), &allowed_ips) { if !check_peer_addr_is_in_list(&info.inner.peer_addr, &allowed_ips) {
return Err(auth::AuthError::ip_address_not_allowed()); return Err(auth::AuthError::ip_address_not_allowed());
} }
let secret = secret.unwrap_or_else(|| { let secret = secret.unwrap_or_else(|| {
@@ -173,36 +200,49 @@ async fn auth_quirks_creds(
// prevent malicious probing (possible due to missing protocol steps). // prevent malicious probing (possible due to missing protocol steps).
// This mocked secret will never lead to successful authentication. // This mocked secret will never lead to successful authentication.
info!("authentication info not found, mocking it"); info!("authentication info not found, mocking it");
AuthSecret::Scram(scram::ServerSecret::mock(creds.user, rand::random())) AuthSecret::Scram(scram::ServerSecret::mock(&info.inner.user, rand::random()))
}); });
if let Some(success) = maybe_success { if let Some(password) = unauthenticated_password {
return Ok(success); let auth_outcome = validate_password_and_exchange(&password, secret)?;
let keys = match auth_outcome {
crate::sasl::Outcome::Success(key) => key,
crate::sasl::Outcome::Failure(reason) => {
info!("auth backend failed with an error: {reason}");
return Err(auth::AuthError::auth_failed(&*info.inner.user));
}
};
// we have authenticated the password
client.write_message_noflush(&pq_proto::BeMessage::AuthenticationOk)?;
return Ok(ComputeCredentials { info, keys });
} }
// -- the remaining flows are self-authenticating --
// Perform cleartext auth if we're allowed to do that. // Perform cleartext auth if we're allowed to do that.
// Currently, we use it for websocket connections (latency). // Currently, we use it for websocket connections (latency).
if allow_cleartext { if allow_cleartext {
// Password will be checked by the compute node later. return hacks::authenticate_cleartext(info, client, latency_timer, secret).await;
return hacks::cleartext_hack(client, latency_timer).await;
} }
// Finally, proceed with the main auth flow (SCRAM-based). // Finally, proceed with the main auth flow (SCRAM-based).
classic::authenticate(creds, client, config, latency_timer, secret).await classic::authenticate(info, client, config, latency_timer, secret).await
} }
/// True to its name, this function encapsulates our current auth trade-offs. /// Authenticate the user and then wake a compute (or retrieve an existing compute session from cache)
/// Here, we choose the appropriate auth flow based on circumstances. /// only if authentication was successfuly.
async fn auth_quirks( async fn auth_and_wake_compute(
api: &impl console::Api, api: &impl console::Api,
extra: &ConsoleReqExtra<'_>, extra: &ConsoleReqExtra,
creds: &mut ClientCredentials<'_>, creds: ClientCredentials,
client: &mut stream::PqStream<Stream<impl AsyncRead + AsyncWrite + Unpin>>, client: &mut stream::PqStream<Stream<impl AsyncRead + AsyncWrite + Unpin>>,
allow_cleartext: bool, allow_cleartext: bool,
config: &'static AuthenticationConfig, config: &'static AuthenticationConfig,
latency_timer: &mut LatencyTimer, latency_timer: &mut LatencyTimer,
) -> auth::Result<AuthSuccess<CachedNodeInfo>> { ) -> auth::Result<(CachedNodeInfo, ComputeUserInfo)> {
let auth_stuff = auth_quirks_creds( let compute_credentials = auth_quirks(
api, api,
extra, extra,
creds, creds,
@@ -215,7 +255,7 @@ async fn auth_quirks(
let mut num_retries = 0; let mut num_retries = 0;
let mut node = loop { let mut node = loop {
let wake_res = api.wake_compute(extra, creds).await; let wake_res = api.wake_compute(extra, &compute_credentials.info).await;
match handle_try_wake(wake_res, num_retries) { match handle_try_wake(wake_res, num_retries) {
Err(e) => { Err(e) => {
error!(error = ?e, num_retries, retriable = false, "couldn't wake compute node"); error!(error = ?e, num_retries, retriable = false, "couldn't wake compute node");
@@ -232,27 +272,27 @@ async fn auth_quirks(
tokio::time::sleep(wait_duration).await; tokio::time::sleep(wait_duration).await;
}; };
match auth_stuff.value { match compute_credentials.keys {
ComputeCredentials::Password(password) => node.config.password(password), #[cfg(feature = "testing")]
ComputeCredentials::AuthKeys(auth_keys) => node.config.auth_keys(auth_keys), ComputeCredentialKeys::Password(password) => node.config.password(password),
ComputeCredentialKeys::AuthKeys(auth_keys) => node.config.auth_keys(auth_keys),
}; };
Ok(AuthSuccess { Ok((node, compute_credentials.info))
reported_auth_ok: auth_stuff.reported_auth_ok,
value: node,
})
} }
impl BackendType<'_, ClientCredentials<'_>> { impl<'a> BackendType<'a, ClientCredentials> {
/// Get compute endpoint name from the credentials. /// Get compute endpoint name from the credentials.
pub fn get_endpoint(&self) -> Option<String> { pub fn get_endpoint(&self) -> Option<SmolStr> {
use BackendType::*; use BackendType::*;
match self { match self {
Console(_, creds) => creds.project.clone(), Console(_, creds) => creds.project.clone(),
#[cfg(feature = "testing")]
Postgres(_, creds) => creds.project.clone(), Postgres(_, creds) => creds.project.clone(),
Link(_) => Some("link".to_owned()), Link(_) => Some("link".into()),
Test(_) => Some("test".to_owned()), #[cfg(test)]
Test(_) => Some("test".into()),
} }
} }
@@ -261,9 +301,11 @@ impl BackendType<'_, ClientCredentials<'_>> {
use BackendType::*; use BackendType::*;
match self { match self {
Console(_, creds) => creds.user, Console(_, creds) => &creds.user,
Postgres(_, creds) => creds.user, #[cfg(feature = "testing")]
Postgres(_, creds) => &creds.user,
Link(_) => "link", Link(_) => "link",
#[cfg(test)]
Test(_) => "test", Test(_) => "test",
} }
} }
@@ -271,26 +313,25 @@ impl BackendType<'_, ClientCredentials<'_>> {
/// Authenticate the client via the requested backend, possibly using credentials. /// Authenticate the client via the requested backend, possibly using credentials.
#[tracing::instrument(fields(allow_cleartext = allow_cleartext), skip_all)] #[tracing::instrument(fields(allow_cleartext = allow_cleartext), skip_all)]
pub async fn authenticate( pub async fn authenticate(
&mut self, self,
extra: &ConsoleReqExtra<'_>, extra: &ConsoleReqExtra,
client: &mut stream::PqStream<Stream<impl AsyncRead + AsyncWrite + Unpin>>, client: &mut stream::PqStream<Stream<impl AsyncRead + AsyncWrite + Unpin>>,
allow_cleartext: bool, allow_cleartext: bool,
config: &'static AuthenticationConfig, config: &'static AuthenticationConfig,
latency_timer: &mut LatencyTimer, latency_timer: &mut LatencyTimer,
) -> auth::Result<AuthSuccess<CachedNodeInfo>> { ) -> auth::Result<(CachedNodeInfo, BackendType<'a, ComputeUserInfo>)> {
use BackendType::*; use BackendType::*;
let res = match self { let res = match self {
Console(api, creds) => { Console(api, creds) => {
info!( info!(
user = creds.user, user = &*creds.user,
project = creds.project(), project = creds.project(),
"performing authentication using the console" "performing authentication using the console"
); );
let api = api.as_ref(); let (cache_info, user_info) = auth_and_wake_compute(
auth_quirks( &*api,
api,
extra, extra,
creds, creds,
client, client,
@@ -298,18 +339,19 @@ impl BackendType<'_, ClientCredentials<'_>> {
config, config,
latency_timer, latency_timer,
) )
.await? .await?;
(cache_info, BackendType::Console(api, user_info))
} }
#[cfg(feature = "testing")]
Postgres(api, creds) => { Postgres(api, creds) => {
info!( info!(
user = creds.user, user = &*creds.user,
project = creds.project(), project = creds.project(),
"performing authentication using a local postgres instance" "performing authentication using a local postgres instance"
); );
let api = api.as_ref(); let (cache_info, user_info) = auth_and_wake_compute(
auth_quirks( &*api,
api,
extra, extra,
creds, creds,
client, client,
@@ -317,16 +359,21 @@ impl BackendType<'_, ClientCredentials<'_>> {
config, config,
latency_timer, latency_timer,
) )
.await? .await?;
(cache_info, BackendType::Postgres(api, user_info))
} }
// NOTE: this auth backend doesn't use client credentials. // NOTE: this auth backend doesn't use client credentials.
Link(url) => { Link(url) => {
info!("performing link authentication"); info!("performing link authentication");
link::authenticate(url, client) let node_info = link::authenticate(&url, client).await?;
.await?
.map(CachedNodeInfo::new_uncached) (
CachedNodeInfo::new_uncached(node_info),
BackendType::Link(url),
)
} }
#[cfg(test)]
Test(_) => { Test(_) => {
unreachable!("this function should never be called in the test backend") unreachable!("this function should never be called in the test backend")
} }
@@ -335,16 +382,20 @@ impl BackendType<'_, ClientCredentials<'_>> {
info!("user successfully authenticated"); info!("user successfully authenticated");
Ok(res) Ok(res)
} }
}
impl BackendType<'_, ComputeUserInfo> {
pub async fn get_allowed_ips( pub async fn get_allowed_ips(
&self, &self,
extra: &ConsoleReqExtra<'_>, extra: &ConsoleReqExtra,
) -> Result<Arc<Vec<String>>, GetAuthInfoError> { ) -> Result<Arc<Vec<String>>, GetAuthInfoError> {
use BackendType::*; use BackendType::*;
match self { match self {
Console(api, creds) => api.get_allowed_ips(extra, creds).await, Console(api, creds) => api.get_allowed_ips(extra, creds).await,
#[cfg(feature = "testing")]
Postgres(api, creds) => api.get_allowed_ips(extra, creds).await, Postgres(api, creds) => api.get_allowed_ips(extra, creds).await,
Link(_) => Ok(Arc::new(vec![])), Link(_) => Ok(Arc::new(vec![])),
#[cfg(test)]
Test(x) => x.get_allowed_ips(), Test(x) => x.get_allowed_ips(),
} }
} }
@@ -353,14 +404,16 @@ impl BackendType<'_, ClientCredentials<'_>> {
/// The link auth flow doesn't support this, so we return [`None`] in that case. /// The link auth flow doesn't support this, so we return [`None`] in that case.
pub async fn wake_compute( pub async fn wake_compute(
&self, &self,
extra: &ConsoleReqExtra<'_>, extra: &ConsoleReqExtra,
) -> Result<Option<CachedNodeInfo>, console::errors::WakeComputeError> { ) -> Result<Option<CachedNodeInfo>, console::errors::WakeComputeError> {
use BackendType::*; use BackendType::*;
match self { match self {
Console(api, creds) => api.wake_compute(extra, creds).map_ok(Some).await, Console(api, creds) => api.wake_compute(extra, creds).map_ok(Some).await,
#[cfg(feature = "testing")]
Postgres(api, creds) => api.wake_compute(extra, creds).map_ok(Some).await, Postgres(api, creds) => api.wake_compute(extra, creds).map_ok(Some).await,
Link(_) => Ok(None), Link(_) => Ok(None),
#[cfg(test)]
Test(x) => x.wake_compute().map(Some), Test(x) => x.wake_compute().map(Some),
} }
} }

View File

@@ -1,6 +1,6 @@
use super::{AuthSuccess, ComputeCredentials}; use super::{ComputeCredentials, ComputeUserInfo};
use crate::{ use crate::{
auth::{self, AuthFlow, ClientCredentials}, auth::{self, backend::ComputeCredentialKeys, AuthFlow},
compute, compute,
config::AuthenticationConfig, config::AuthenticationConfig,
console::AuthSecret, console::AuthSecret,
@@ -12,14 +12,15 @@ use tokio::io::{AsyncRead, AsyncWrite};
use tracing::{info, warn}; use tracing::{info, warn};
pub(super) async fn authenticate( pub(super) async fn authenticate(
creds: &ClientCredentials<'_>, creds: ComputeUserInfo,
client: &mut PqStream<Stream<impl AsyncRead + AsyncWrite + Unpin>>, client: &mut PqStream<Stream<impl AsyncRead + AsyncWrite + Unpin>>,
config: &'static AuthenticationConfig, config: &'static AuthenticationConfig,
latency_timer: &mut LatencyTimer, latency_timer: &mut LatencyTimer,
secret: AuthSecret, secret: AuthSecret,
) -> auth::Result<AuthSuccess<ComputeCredentials>> { ) -> auth::Result<ComputeCredentials<ComputeCredentialKeys>> {
let flow = AuthFlow::new(client); let flow = AuthFlow::new(client);
let scram_keys = match secret { let scram_keys = match secret {
#[cfg(feature = "testing")]
AuthSecret::Md5(_) => { AuthSecret::Md5(_) => {
info!("auth endpoint chooses MD5"); info!("auth endpoint chooses MD5");
return Err(auth::AuthError::bad_auth_method("MD5")); return Err(auth::AuthError::bad_auth_method("MD5"));
@@ -53,7 +54,7 @@ pub(super) async fn authenticate(
sasl::Outcome::Success(key) => key, sasl::Outcome::Success(key) => key,
sasl::Outcome::Failure(reason) => { sasl::Outcome::Failure(reason) => {
info!("auth backend failed with an error: {reason}"); info!("auth backend failed with an error: {reason}");
return Err(auth::AuthError::auth_failed(creds.user)); return Err(auth::AuthError::auth_failed(&*creds.inner.user));
} }
}; };
@@ -64,9 +65,9 @@ pub(super) async fn authenticate(
} }
}; };
Ok(AuthSuccess { Ok(ComputeCredentials {
reported_auth_ok: false, info: creds,
value: ComputeCredentials::AuthKeys(tokio_postgres::config::AuthKeys::ScramSha256( keys: ComputeCredentialKeys::AuthKeys(tokio_postgres::config::AuthKeys::ScramSha256(
scram_keys, scram_keys,
)), )),
}) })

View File

@@ -1,7 +1,11 @@
use super::{AuthSuccess, ComputeCredentials}; use super::{
ComputeCredentialKeys, ComputeCredentials, ComputeUserInfo, ComputeUserInfoNoEndpoint,
};
use crate::{ use crate::{
auth::{self, AuthFlow, ClientCredentials}, auth::{self, AuthFlow},
console::AuthSecret,
proxy::LatencyTimer, proxy::LatencyTimer,
sasl,
stream::{self, Stream}, stream::{self, Stream},
}; };
use tokio::io::{AsyncRead, AsyncWrite}; use tokio::io::{AsyncRead, AsyncWrite};
@@ -11,35 +15,42 @@ use tracing::{info, warn};
/// one round trip and *expensive* computations (>= 4096 HMAC iterations). /// one round trip and *expensive* computations (>= 4096 HMAC iterations).
/// These properties are benefical for serverless JS workers, so we /// These properties are benefical for serverless JS workers, so we
/// use this mechanism for websocket connections. /// use this mechanism for websocket connections.
pub async fn cleartext_hack( pub async fn authenticate_cleartext(
info: ComputeUserInfo,
client: &mut stream::PqStream<Stream<impl AsyncRead + AsyncWrite + Unpin>>, client: &mut stream::PqStream<Stream<impl AsyncRead + AsyncWrite + Unpin>>,
latency_timer: &mut LatencyTimer, latency_timer: &mut LatencyTimer,
) -> auth::Result<AuthSuccess<ComputeCredentials>> { secret: AuthSecret,
) -> auth::Result<ComputeCredentials<ComputeCredentialKeys>> {
warn!("cleartext auth flow override is enabled, proceeding"); warn!("cleartext auth flow override is enabled, proceeding");
// pause the timer while we communicate with the client // pause the timer while we communicate with the client
let _paused = latency_timer.pause(); let _paused = latency_timer.pause();
let password = AuthFlow::new(client) let auth_outcome = AuthFlow::new(client)
.begin(auth::CleartextPassword) .begin(auth::CleartextPassword(secret))
.await? .await?
.authenticate() .authenticate()
.await?; .await?;
// Report tentative success; compute node will check the password anyway. let keys = match auth_outcome {
Ok(AuthSuccess { sasl::Outcome::Success(key) => key,
reported_auth_ok: false, sasl::Outcome::Failure(reason) => {
value: ComputeCredentials::Password(password), info!("auth backend failed with an error: {reason}");
}) return Err(auth::AuthError::auth_failed(&*info.inner.user));
}
};
Ok(ComputeCredentials { info, keys })
} }
/// Workaround for clients which don't provide an endpoint (project) name. /// Workaround for clients which don't provide an endpoint (project) name.
/// Very similar to [`cleartext_hack`], but there's a specific password format. /// Similar to [`authenticate_cleartext`], but there's a specific password format,
pub async fn password_hack( /// and passwords are not yet validated (we don't know how to validate them!)
creds: &mut ClientCredentials<'_>, pub async fn password_hack_no_authentication(
info: ComputeUserInfoNoEndpoint,
client: &mut stream::PqStream<Stream<impl AsyncRead + AsyncWrite + Unpin>>, client: &mut stream::PqStream<Stream<impl AsyncRead + AsyncWrite + Unpin>>,
latency_timer: &mut LatencyTimer, latency_timer: &mut LatencyTimer,
) -> auth::Result<AuthSuccess<ComputeCredentials>> { ) -> auth::Result<ComputeCredentials<Vec<u8>>> {
warn!("project not specified, resorting to the password hack auth flow"); warn!("project not specified, resorting to the password hack auth flow");
// pause the timer while we communicate with the client // pause the timer while we communicate with the client
@@ -48,15 +59,17 @@ pub async fn password_hack(
let payload = AuthFlow::new(client) let payload = AuthFlow::new(client)
.begin(auth::PasswordHack) .begin(auth::PasswordHack)
.await? .await?
.authenticate() .get_password()
.await?; .await?;
info!(project = &payload.endpoint, "received missing parameter"); info!(project = &*payload.endpoint, "received missing parameter");
creds.project = Some(payload.endpoint);
// Report tentative success; compute node will check the password anyway. // Report tentative success; compute node will check the password anyway.
Ok(AuthSuccess { Ok(ComputeCredentials {
reported_auth_ok: false, info: ComputeUserInfo {
value: ComputeCredentials::Password(payload.password), inner: info,
endpoint: payload.endpoint,
},
keys: payload.password,
}) })
} }

View File

@@ -1,4 +1,3 @@
use super::AuthSuccess;
use crate::{ use crate::{
auth, compute, auth, compute,
console::{self, provider::NodeInfo}, console::{self, provider::NodeInfo},
@@ -57,7 +56,7 @@ pub fn new_psql_session_id() -> String {
pub(super) async fn authenticate( pub(super) async fn authenticate(
link_uri: &reqwest::Url, link_uri: &reqwest::Url,
client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin>, client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin>,
) -> auth::Result<AuthSuccess<NodeInfo>> { ) -> auth::Result<NodeInfo> {
let psql_session_id = new_psql_session_id(); let psql_session_id = new_psql_session_id();
let span = info_span!("link", psql_session_id = &psql_session_id); let span = info_span!("link", psql_session_id = &psql_session_id);
let greeting = hello_message(link_uri, &psql_session_id); let greeting = hello_message(link_uri, &psql_session_id);
@@ -102,12 +101,9 @@ pub(super) async fn authenticate(
config.password(password.as_ref()); config.password(password.as_ref());
} }
Ok(AuthSuccess { Ok(NodeInfo {
reported_auth_ok: true, config,
value: NodeInfo { aux: db_info.aux,
config, allow_self_signed_compute: false, // caller may override
aux: db_info.aux,
allow_self_signed_compute: false, // caller may override
},
}) })
} }

View File

@@ -3,14 +3,12 @@
use crate::{ use crate::{
auth::password_hack::parse_endpoint_param, auth::password_hack::parse_endpoint_param,
error::UserFacingError, error::UserFacingError,
proxy::{neon_options, NUM_CONNECTION_ACCEPTED_BY_SNI}, proxy::{neon_options_str, NUM_CONNECTION_ACCEPTED_BY_SNI},
}; };
use itertools::Itertools; use itertools::Itertools;
use pq_proto::StartupMessageParams; use pq_proto::StartupMessageParams;
use std::{ use smol_str::SmolStr;
collections::HashSet, use std::{collections::HashSet, net::IpAddr};
net::{IpAddr, SocketAddr},
};
use thiserror::Error; use thiserror::Error;
use tracing::{info, warn}; use tracing::{info, warn};
@@ -24,7 +22,7 @@ pub enum ClientCredsParseError {
SNI ('{}') and project option ('{}').", SNI ('{}') and project option ('{}').",
.domain, .option, .domain, .option,
)] )]
InconsistentProjectNames { domain: String, option: String }, InconsistentProjectNames { domain: SmolStr, option: SmolStr },
#[error( #[error(
"Common name inferred from SNI ('{}') is not known", "Common name inferred from SNI ('{}') is not known",
@@ -33,7 +31,7 @@ pub enum ClientCredsParseError {
UnknownCommonName { cn: String }, UnknownCommonName { cn: String },
#[error("Project name ('{0}') must contain only alphanumeric characters and hyphen.")] #[error("Project name ('{0}') must contain only alphanumeric characters and hyphen.")]
MalformedProjectName(String), MalformedProjectName(SmolStr),
} }
impl UserFacingError for ClientCredsParseError {} impl UserFacingError for ClientCredsParseError {}
@@ -41,34 +39,34 @@ impl UserFacingError for ClientCredsParseError {}
/// Various client credentials which we use for authentication. /// Various client credentials which we use for authentication.
/// Note that we don't store any kind of client key or password here. /// Note that we don't store any kind of client key or password here.
#[derive(Debug, Clone, PartialEq, Eq)] #[derive(Debug, Clone, PartialEq, Eq)]
pub struct ClientCredentials<'a> { pub struct ClientCredentials {
pub user: &'a str, pub user: SmolStr,
// TODO: this is a severe misnomer! We should think of a new name ASAP. // TODO: this is a severe misnomer! We should think of a new name ASAP.
pub project: Option<String>, pub project: Option<SmolStr>,
pub cache_key: String, pub cache_key: SmolStr,
pub peer_addr: SocketAddr, pub peer_addr: IpAddr,
} }
impl ClientCredentials<'_> { impl ClientCredentials {
#[inline] #[inline]
pub fn project(&self) -> Option<&str> { pub fn project(&self) -> Option<&str> {
self.project.as_deref() self.project.as_deref()
} }
} }
impl<'a> ClientCredentials<'a> { impl ClientCredentials {
pub fn parse( pub fn parse(
params: &'a StartupMessageParams, params: &StartupMessageParams,
sni: Option<&str>, sni: Option<&str>,
common_names: Option<HashSet<String>>, common_names: Option<HashSet<String>>,
peer_addr: SocketAddr, peer_addr: IpAddr,
) -> Result<Self, ClientCredsParseError> { ) -> Result<Self, ClientCredsParseError> {
use ClientCredsParseError::*; use ClientCredsParseError::*;
// Some parameters are stored in the startup message. // Some parameters are stored in the startup message.
let get_param = |key| params.get(key).ok_or(MissingKey(key)); let get_param = |key| params.get(key).ok_or(MissingKey(key));
let user = get_param("user")?; let user = get_param("user")?.into();
// Project name might be passed via PG's command-line options. // Project name might be passed via PG's command-line options.
let project_option = params let project_option = params
@@ -82,7 +80,7 @@ impl<'a> ClientCredentials<'a> {
.at_most_one() .at_most_one()
.ok()? .ok()?
}) })
.map(|name| name.to_string()); .map(|name| name.into());
let project_from_domain = if let Some(sni_str) = sni { let project_from_domain = if let Some(sni_str) = sni {
if let Some(cn) = common_names { if let Some(cn) = common_names {
@@ -121,7 +119,7 @@ impl<'a> ClientCredentials<'a> {
} }
.transpose()?; .transpose()?;
info!(user, project = project.as_deref(), "credentials"); info!(%user, project = project.as_deref(), "credentials");
if sni.is_some() { if sni.is_some() {
info!("Connection with sni"); info!("Connection with sni");
NUM_CONNECTION_ACCEPTED_BY_SNI NUM_CONNECTION_ACCEPTED_BY_SNI
@@ -142,8 +140,9 @@ impl<'a> ClientCredentials<'a> {
let cache_key = format!( let cache_key = format!(
"{}{}", "{}{}",
project.as_deref().unwrap_or(""), project.as_deref().unwrap_or(""),
neon_options(params).unwrap_or("".to_string()) neon_options_str(params)
); )
.into();
Ok(Self { Ok(Self {
user, user,
@@ -206,10 +205,10 @@ fn project_name_valid(name: &str) -> bool {
name.chars().all(|c| c.is_alphanumeric() || c == '-') name.chars().all(|c| c.is_alphanumeric() || c == '-')
} }
fn subdomain_from_sni(sni: &str, common_name: &str) -> Option<String> { fn subdomain_from_sni(sni: &str, common_name: &str) -> Option<SmolStr> {
sni.strip_suffix(common_name)? sni.strip_suffix(common_name)?
.strip_suffix('.') .strip_suffix('.')
.map(str::to_owned) .map(SmolStr::from)
} }
#[cfg(test)] #[cfg(test)]
@@ -221,7 +220,7 @@ mod tests {
fn parse_bare_minimum() -> anyhow::Result<()> { fn parse_bare_minimum() -> anyhow::Result<()> {
// According to postgresql, only `user` should be required. // According to postgresql, only `user` should be required.
let options = StartupMessageParams::new([("user", "john_doe")]); let options = StartupMessageParams::new([("user", "john_doe")]);
let peer_addr = SocketAddr::from(([127, 0, 0, 1], 1234)); let peer_addr = IpAddr::from([127, 0, 0, 1]);
let creds = ClientCredentials::parse(&options, None, None, peer_addr)?; let creds = ClientCredentials::parse(&options, None, None, peer_addr)?;
assert_eq!(creds.user, "john_doe"); assert_eq!(creds.user, "john_doe");
assert_eq!(creds.project, None); assert_eq!(creds.project, None);
@@ -236,7 +235,7 @@ mod tests {
("database", "world"), // should be ignored ("database", "world"), // should be ignored
("foo", "bar"), // should be ignored ("foo", "bar"), // should be ignored
]); ]);
let peer_addr = SocketAddr::from(([127, 0, 0, 1], 1234)); let peer_addr = IpAddr::from([127, 0, 0, 1]);
let creds = ClientCredentials::parse(&options, None, None, peer_addr)?; let creds = ClientCredentials::parse(&options, None, None, peer_addr)?;
assert_eq!(creds.user, "john_doe"); assert_eq!(creds.user, "john_doe");
assert_eq!(creds.project, None); assert_eq!(creds.project, None);
@@ -251,7 +250,7 @@ mod tests {
let sni = Some("foo.localhost"); let sni = Some("foo.localhost");
let common_names = Some(["localhost".into()].into()); let common_names = Some(["localhost".into()].into());
let peer_addr = SocketAddr::from(([127, 0, 0, 1], 1234)); let peer_addr = IpAddr::from([127, 0, 0, 1]);
let creds = ClientCredentials::parse(&options, sni, common_names, peer_addr)?; let creds = ClientCredentials::parse(&options, sni, common_names, peer_addr)?;
assert_eq!(creds.user, "john_doe"); assert_eq!(creds.user, "john_doe");
assert_eq!(creds.project.as_deref(), Some("foo")); assert_eq!(creds.project.as_deref(), Some("foo"));
@@ -267,7 +266,7 @@ mod tests {
("options", "-ckey=1 project=bar -c geqo=off"), ("options", "-ckey=1 project=bar -c geqo=off"),
]); ]);
let peer_addr = SocketAddr::from(([127, 0, 0, 1], 1234)); let peer_addr = IpAddr::from([127, 0, 0, 1]);
let creds = ClientCredentials::parse(&options, None, None, peer_addr)?; let creds = ClientCredentials::parse(&options, None, None, peer_addr)?;
assert_eq!(creds.user, "john_doe"); assert_eq!(creds.user, "john_doe");
assert_eq!(creds.project.as_deref(), Some("bar")); assert_eq!(creds.project.as_deref(), Some("bar"));
@@ -282,7 +281,7 @@ mod tests {
("options", "-ckey=1 endpoint=bar -c geqo=off"), ("options", "-ckey=1 endpoint=bar -c geqo=off"),
]); ]);
let peer_addr = SocketAddr::from(([127, 0, 0, 1], 1234)); let peer_addr = IpAddr::from([127, 0, 0, 1]);
let creds = ClientCredentials::parse(&options, None, None, peer_addr)?; let creds = ClientCredentials::parse(&options, None, None, peer_addr)?;
assert_eq!(creds.user, "john_doe"); assert_eq!(creds.user, "john_doe");
assert_eq!(creds.project.as_deref(), Some("bar")); assert_eq!(creds.project.as_deref(), Some("bar"));
@@ -300,7 +299,7 @@ mod tests {
), ),
]); ]);
let peer_addr = SocketAddr::from(([127, 0, 0, 1], 1234)); let peer_addr = IpAddr::from([127, 0, 0, 1]);
let creds = ClientCredentials::parse(&options, None, None, peer_addr)?; let creds = ClientCredentials::parse(&options, None, None, peer_addr)?;
assert_eq!(creds.user, "john_doe"); assert_eq!(creds.user, "john_doe");
assert!(creds.project.is_none()); assert!(creds.project.is_none());
@@ -315,7 +314,7 @@ mod tests {
("options", "-ckey=1 endpoint=bar project=foo -c geqo=off"), ("options", "-ckey=1 endpoint=bar project=foo -c geqo=off"),
]); ]);
let peer_addr = SocketAddr::from(([127, 0, 0, 1], 1234)); let peer_addr = IpAddr::from([127, 0, 0, 1]);
let creds = ClientCredentials::parse(&options, None, None, peer_addr)?; let creds = ClientCredentials::parse(&options, None, None, peer_addr)?;
assert_eq!(creds.user, "john_doe"); assert_eq!(creds.user, "john_doe");
assert!(creds.project.is_none()); assert!(creds.project.is_none());
@@ -330,7 +329,7 @@ mod tests {
let sni = Some("baz.localhost"); let sni = Some("baz.localhost");
let common_names = Some(["localhost".into()].into()); let common_names = Some(["localhost".into()].into());
let peer_addr = SocketAddr::from(([127, 0, 0, 1], 1234)); let peer_addr = IpAddr::from([127, 0, 0, 1]);
let creds = ClientCredentials::parse(&options, sni, common_names, peer_addr)?; let creds = ClientCredentials::parse(&options, sni, common_names, peer_addr)?;
assert_eq!(creds.user, "john_doe"); assert_eq!(creds.user, "john_doe");
assert_eq!(creds.project.as_deref(), Some("baz")); assert_eq!(creds.project.as_deref(), Some("baz"));
@@ -344,13 +343,13 @@ mod tests {
let common_names = Some(["a.com".into(), "b.com".into()].into()); let common_names = Some(["a.com".into(), "b.com".into()].into());
let sni = Some("p1.a.com"); let sni = Some("p1.a.com");
let peer_addr = SocketAddr::from(([127, 0, 0, 1], 1234)); let peer_addr = IpAddr::from([127, 0, 0, 1]);
let creds = ClientCredentials::parse(&options, sni, common_names, peer_addr)?; let creds = ClientCredentials::parse(&options, sni, common_names, peer_addr)?;
assert_eq!(creds.project.as_deref(), Some("p1")); assert_eq!(creds.project.as_deref(), Some("p1"));
let common_names = Some(["a.com".into(), "b.com".into()].into()); let common_names = Some(["a.com".into(), "b.com".into()].into());
let sni = Some("p1.b.com"); let sni = Some("p1.b.com");
let peer_addr = SocketAddr::from(([127, 0, 0, 1], 1234)); let peer_addr = IpAddr::from([127, 0, 0, 1]);
let creds = ClientCredentials::parse(&options, sni, common_names, peer_addr)?; let creds = ClientCredentials::parse(&options, sni, common_names, peer_addr)?;
assert_eq!(creds.project.as_deref(), Some("p1")); assert_eq!(creds.project.as_deref(), Some("p1"));
@@ -365,7 +364,7 @@ mod tests {
let sni = Some("second.localhost"); let sni = Some("second.localhost");
let common_names = Some(["localhost".into()].into()); let common_names = Some(["localhost".into()].into());
let peer_addr = SocketAddr::from(([127, 0, 0, 1], 1234)); let peer_addr = IpAddr::from([127, 0, 0, 1]);
let err = ClientCredentials::parse(&options, sni, common_names, peer_addr) let err = ClientCredentials::parse(&options, sni, common_names, peer_addr)
.expect_err("should fail"); .expect_err("should fail");
match err { match err {
@@ -384,7 +383,7 @@ mod tests {
let sni = Some("project.localhost"); let sni = Some("project.localhost");
let common_names = Some(["example.com".into()].into()); let common_names = Some(["example.com".into()].into());
let peer_addr = SocketAddr::from(([127, 0, 0, 1], 1234)); let peer_addr = IpAddr::from([127, 0, 0, 1]);
let err = ClientCredentials::parse(&options, sni, common_names, peer_addr) let err = ClientCredentials::parse(&options, sni, common_names, peer_addr)
.expect_err("should fail"); .expect_err("should fail");
match err { match err {
@@ -404,13 +403,10 @@ mod tests {
let sni = Some("project.localhost"); let sni = Some("project.localhost");
let common_names = Some(["localhost".into()].into()); let common_names = Some(["localhost".into()].into());
let peer_addr = SocketAddr::from(([127, 0, 0, 1], 1234)); let peer_addr = IpAddr::from([127, 0, 0, 1]);
let creds = ClientCredentials::parse(&options, sni, common_names, peer_addr)?; let creds = ClientCredentials::parse(&options, sni, common_names, peer_addr)?;
assert_eq!(creds.project.as_deref(), Some("project")); assert_eq!(creds.project.as_deref(), Some("project"));
assert_eq!( assert_eq!(creds.cache_key, "projectendpoint_type:read_write lsn:0/2");
creds.cache_key,
"projectneon_endpoint_type:read_write neon_lsn:0/2"
);
Ok(()) Ok(())
} }

Some files were not shown because too many files have changed in this diff Show More