Compare commits

..

54 Commits

Author SHA1 Message Date
Konstantin Knizhnik
92e72cc3f3 Restore checkl for FSM/VM fork in neon_wallog_page 2024-06-04 14:08:45 +03:00
Konstantin Knizhnik
f9416ebf2b Do not write pages to the local disk during unlogged build 2024-06-04 09:20:51 +03:00
Konstantin Knizhnik
0c9dee9d06 Rebase with main 2024-06-03 21:36:37 +03:00
Konstantin Knizhnik
5a5775806f Restore check for poreserving pgdata_dir content 2024-06-03 21:16:04 +03:00
Konstantin Knizhnik
947f8c59dd Fix unlogged build 2024-06-03 21:16:02 +03:00
Konstantin Knizhnik
520101170f Pin information about unlogged relations in relsize cache until end of the build 2024-06-03 21:15:14 +03:00
Konstantin Knizhnik
1bd86c5c6a Rewrite unlogged relation build 2024-06-03 21:15:12 +03:00
Konstantin Knizhnik
e4fc6c3162 Comment check for pgdatadir match 2024-06-03 21:12:23 +03:00
Konstantin Knizhnik
fcd7d7008f Support unlogged build in Neon erxtension 2024-06-03 21:12:21 +03:00
Arpad Müller
acf0a11fea Move keyspace utils to inherent impls (#7929)
The keyspace utils like `is_rel_size_key` or `is_rel_fsm_block_key` and
many others are free functions and have to be either imported separately
or specified with the full path starting in `pageserver_api:🔑:`.
This is less convenient than if these functions were just inherent
impls.

Follow-up of #7890
Fixes #6438
2024-06-03 16:18:07 +02:00
Alex Chi Z
c1f55c1525 feat(pageserver): collect aux file tombstones (#7900)
close https://github.com/neondatabase/neon/issues/7800

This is a small change to enable the tombstone -> exclude from image
layer path. Most of the pull request is unit tests.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-03 09:56:36 -04:00
Joonas Koivunen
34f450c05a test: allow no vectored gets happening (#7939)
when running the regress tests locally without any environment variables
we use on CI, `test_pageserver_compaction_smoke` fails with division by
zero. fix it temporarily by allowing no vectored read happening. to be
cleaned when vectored get validation gets removed and the default value
can be changed.

Cc: https://github.com/neondatabase/neon/issues/7381
2024-06-03 09:37:11 -04:00
Arpad Müller
db477c0b8c Add metrics for Azure blob storage (#7933)
In issue #5590 it was proposed to implement metrics for Azure blob
storage. This PR implements them except for the part that performs the
rename, which is left for a followup.

Closes #5590
2024-06-02 14:10:56 +00:00
Arthur Petukhovsky
a345cf3fc6 Fix span for WAL removal task (#7930)
During refactoring in https://github.com/neondatabase/neon/pull/7887 I
forgot to add "WAL removal" span with ttid. This commit fixes it.
2024-06-01 12:23:59 +01:00
Arthur Petukhovsky
e98bc4fd2b Run gc on too many partial backup segments (#7700)
The general partial backup idea is that each safekeeper keeps only one
partial segment in remote storage at a time. Sometimes this is not true,
for example if we uploaded object to S3 but got an error when tried to
remove the previous upload. In this case we still keep a list of all
potentially uploaded objects in safekeeper state.

This commit prints a warning to logs if there is too many objects in
safekeeper state. This is not expected and we should try to fix this
state, we can do this by running gc.

I haven't seen this being an issue anywhere, but printing a warning is
something that I wanted to do and forgot in initial PR.
2024-06-01 00:18:56 +01:00
John Spray
7e60563910 pageserver: add GcError type (#7917)
## Problem

- Because GC exposes all errors as an anyhow::Error, we have
intermittent issues with spurious log errors during shutdown, e.g. in
this failure of a performance test
https://neon-github-public-dev.s3.amazonaws.com/reports/main/9300804302/index.html#suites/07874de07c4a1c9effe0d92da7755ebf/214a2154f6f0217a/

```
Gc failed 1 times, retrying in 2s: shutting down
```

GC really doesn't do a lot of complicated IO: it doesn't benefit from
the backtrace capabilities of anyhow::Error, and can be expressed more
robustly as an enum.

## Summary of changes

- Add GcError type and use it instead of anyhow::Error in GC functions
- In `gc_iteration_internal`, return GcError::Cancelled on shutdown
rather than Ok(()) (we only used Ok before because we didn't have a
clear cancellation error variant to use).
- In `gc_iteration_internal`, skip past timelines that are shutting
down, to avoid having to go through another GC iteration if we happen to
see a deleting timeline during a GC run.
- In `refresh_gc_info_internal`, avoid an error case where a timeline
might not be found after being looked up, by carrying an Arc<Timeline>
instead of a TimelineId between the first loop and second loop in the
function.
- In HTTP request handler, handle Cancelled variants as 503 instead of
turning all GC errors into 500s.
2024-05-31 22:20:06 +01:00
Joonas Koivunen
ef83f31e77 pagectl: key command for dumping what we know about the key (#7890)
What we know about the key via added `pagectl key $key` command:
- debug formatting
- shard placement when `--shard-count` is specified
- different boolean queries in `key.rs`
- aux files v2

Example:

```
$ cargo run -qp pagectl -- key 000000063F00004005000060270000100E2C
parsed from hex: 000000063F00004005000060270000100E2C:

Key { field1: 0, field2: 1599, field3: 16389, field4: 24615, field5: 0, field6: 1052204 }
rel_block:         true
rel_vm_block:      false
rel_fsm_block:     false
slru_block:        false
inherited:         true
rel_size:          false
slru_segment_size: false
recognized kind:   None
```
2024-05-31 18:19:41 +00:00
John Spray
9fda85b486 pageserver: remove AncestorStopping error variants (#7916)
## Problem

In all cases, AncestorStopping is equivalent to Cancelled.

This became more obvious in
https://github.com/neondatabase/neon/pull/7912#discussion_r1620582309
when updating these error types.

## Summary of changes

- Remove AncestorStopping, always use Cancelled instead
2024-05-31 17:02:10 +01:00
Alex Chi Z
87afbf6b24 test(pageserver): add test interface to create artificial layers (#7899)
This pull request adds necessary interfaces to deterministically create
scenarios we want to test. Simplify some test cases to use this
interface to make it stable + reproducible.

Compaction test will be able to use this interface. Also the upcoming
delete tombstone tests will use this interface to make test
reproducible.

## Summary of changes

* `force_create_image_layer`
* `force_create_delta_layer`
* `force_advance_lsn`
* `create_test_timeline_with_states`
* `branch_timeline_test_with_states`

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-05-31 12:00:40 -04:00
Arthur Petukhovsky
16b2e74037 Add FullAccessTimeline guard in safekeepers (#7887)
This is a preparation for
https://github.com/neondatabase/neon/issues/6337.

The idea is to add FullAccessTimeline, which will act as a guard for
tasks requiring access to WAL files. Eviction will be blocked on these
tasks and WAL won't be deleted from disk until there is at least one
active FullAccessTimeline.

To get FullAccessTimeline, tasks call `tli.full_access_guard().await?`.
After eviction is implemented, this function will be responsible for
downloading missing WAL file and waiting until the download finishes.

This commit also contains other small refactorings:
- Separate `get_tenant_dir` and `get_timeline_dir` functions for
building a local path. This is useful for looking at usages and finding
tasks requiring access to local filesystem.
- `timeline_manager` is now responsible for spawning all background
tasks
- WAL removal task is now spawned instantly after horizon is updated
2024-05-31 13:19:45 +00:00
John Spray
5a394fde56 pageserver: avoid spurious "bad state" logs/errors during shutdown (#7912)
## Problem

- Initial size calculations tend to fail with `Bad state (not active)`

Closes: https://github.com/neondatabase/neon/issues/7911

## Summary of changes

- In `wait_lsn`, return WaitLsnError::Cancelled rather than BadState
when the state is Stopping
- Replace PageReconstructError's `Other` variant with a specific
`BadState` variant
- Avoid returning anyhow::Error from get_ready_ancestor_timeline -- this
was only used for the case where there was no ancestor. All callers of
this function had implicitly checked that the ancestor timeline exists
before calling it, so they can pass in the ancestor instead of handling
an error.
2024-05-31 13:31:42 +01:00
Arseny Sher
7ec70b5eff safekeeper: rename epoch to last_log_term.
epoch is a historical and potentially confusing name. It semantically means
lastLogTerm from the raft paper, so let's use it.

This commit changes only internal namings, not public interface (http).
2024-05-31 12:59:13 +03:00
Arseny Sher
1fcc2b37eb Add test checking term change during pull_timeline. 2024-05-31 12:58:59 +03:00
Arseny Sher
af40bf3c2e Fix term/epoch confusion in python tests.
Call epoch last_log_term and add separate term field.
2024-05-31 12:58:59 +03:00
Arseny Sher
e6db8069b0 neon_walreader: check after local read that the segment still exists.
Otherwise read might receive zeros/garbage if the file is recycled (renamed) for
as a future segment.
2024-05-31 12:57:56 +03:00
John Spray
98dadf8543 pageserver: quieten some shutdown logs around logical size and flush (#7907)
## Problem

Looking at several noisy shutdown logs:
- In https://github.com/neondatabase/neon/issues/7861 we're hitting a
log error with `InternalServerError(timeline shutting down\n'` on the
checkpoint API handler.
- In the field, we see initial_logical_size_calculation errors on
shutdown, via DownloadError
- In the field, we see errors logged from layer download code
(independent of the error propagated) during shutdown

Closes: https://github.com/neondatabase/neon/issues/7861

## Summary of changes

The theme of these changes is to avoid propagating anyhow::Errors for
cases that aren't really unexpected error cases that we might want a
stacktrace for, and avoid "Other" error variants unless we really do
have unexpected error cases to propagate.

- On the flush_frozen_layers path, use the `FlushLayerError` type
throughout, rather than munging it into an anyhow::Error. Give
FlushLayerError an explicit from_anyhow helper that checks for timeline
cancellation, and uses it to give a Cancelled error instead of an Other
error when the timeline is shutting down.
- In logical size calculation, remove BackgroundCalculationError (this
type was just a Cancelled variant and an Other variant), and instead use
CalculateLogicalSizeError throughout. This can express a
PageReconstructError, and has a From impl that translates cancel-like
page reconstruct errors to Cancelled.
- Replace CalculateLogicalSizeError's Other(anyhow::Error) variant case
with a Decode(DeserializeError) variant, as this was the only kind of
error we actually used in the Other case.
- During layer download, drop out early if the timeline is shutting
down, so that we don't do an `error!()` log of the shutdown error in
this case.
2024-05-31 09:18:58 +01:00
Arpad Müller
c18b1c0646 Update tokio-epoll-uring for linux-raw-sys (#7918)
Updates the `tokio-epoll-uring` dependency.

There is [only one change](342ddd197a...08ccfa94ff),
the adoption of linux-raw-sys for `statx` instead of using libc.

Part of #7889.
2024-05-30 17:45:48 +02:00
Alex Chi Z
f20a9e760f chore(pageserver): warn on delete non-existing file (#7847)
Consider the following sequence of migration:

```
1. user starts compute
2. force migrate to v2
3. user continues to write data
```

At the time of (3), the compute node is not aware that the page server
does not contain replication states any more, and might continue to
ingest neon-file records into the safekeeper. This will leave the
pageserver store a partial replication state and cause some errors. For
example, the compute could issue a deletion of some aux files in v1, but
this file does not exist in v2. Therefore, we should ignore all these
errors until everyone is migrated to v2.

Also note that if we see this warning in prod, it is likely because we
did not fully suspend users' compute when flipping the v1/v2 flag.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-05-30 14:45:34 +00:00
Alex Chi Z
33395dcf4e perf(pageserver): postpone vectored get fringe keyspace construction (#7904)
Perf shows a significant amount of time is spent on `Keyspace::merge`.
This pull request postpones merging keyspace until retrieving the layer,
which contributes to a 30x improvement in aux keyspace basebackup time.

```
--- old
10000 files found in 0.580569459s
--- new
10000 files found in 0.02995075s
```

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-05-30 10:31:57 -04:00
Alex Chi Z
1eca8b8a6b fix(pageserver): ensure to_i128 works for metadata keys (#7895)
field2 of metadata keys can be 0xFFFF because of the mapping. Allow
0xFFFF for `to_i128`. An alternative is to encode 0xFFFF as 0xFFFFFFFF
(which is allowed in the original `to_i128`). But checking the places
where field2 is referenced, the rest part of the system does not seem to
depend on this assertion.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-05-30 10:03:17 -04:00
YukiSeino
167394a073 refacter : VirtualFile::open uses AsRef (#7908)
## Problem
#7371 

## Summary of changes
* The VirtualFile::open, open_with_options, and create methods use
AsRef, similar to the standard library's std::fs APIs.
2024-05-30 15:58:20 +02:00
Conrad Ludgate
9a081c230f proxy: lazily parse startup pg params (#7905)
## Problem

proxy params being a `HashMap<String,String>` when it contains just
```
application_name: psql
database: neondb
user: neondb_owner
```
is quite wasteful allocation wise.

## Summary of changes

Keep the params in the wire protocol form, eg:
```
application_name\0psql\0database\0neondb\0user\0neondb_owner\0
```

Using a linear search for the map is fast enough at small sizes, which
is the normal case.
2024-05-30 11:02:38 +00:00
Conrad Ludgate
fddd11dd1a proxy: upload postgres connection options as json in the parquet upload (#7903)
## Problem

https://github.com/neondatabase/cloud/issues/9943

## Summary of changes

Captures the postgres options, converts them to json, uploads them in
parquet.
2024-05-30 11:10:27 +01:00
Conrad Ludgate
238fa47bee proxy fix wake compute rate limit (#7902)
## Problem

We were rate limiting wake_compute in the wrong place

## Summary of changes

Move wake_compute rate limit to after the permit is acquired. Also makes
a slight refactor on normalize, as it caught my eye
2024-05-30 11:09:27 +01:00
a-masterov
b0a954bde2 CI: switch ubuntu-latest with ubuntu-22.04 (#7256) (#7901)
## Problem
We use ubuntu-latest as a default OS for running jobs. It can cause
problems due to instability, so we should use the LTS version of Ubuntu.

## Summary of changes
The image ubuntu-latest was changed with ubuntu-22.04 in workflows.

## Checklist before requesting a review

- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2024-05-30 08:25:10 +02:00
Konstantin Knizhnik
7ac11d3942 Do not produce error if gin page is not restored in redo (#7876)
## Problem

See https://github.com/neondatabase/cloud/issues/10845

## Summary of changes

Do not report error if GIN page is not restored

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-05-29 22:18:09 +03:00
Conrad Ludgate
c8cebecabf proxy: reintroduce dynamic limiter for compute lock (#7737)
## Problem

Computes that are healthy can manage many connection attempts at a time.
Unhealthy computes cannot. We initially handled this with a fixed
concurrency limit, but it seems this inhibits pgbench.

## Summary of changes

Support AIMD for connect_to_compute lock to allow varying the
concurrency limit based on compute health
2024-05-29 11:17:05 +01:00
Arpad Müller
14df69d0e3 Drop postgres-native-tls in favour of tokio-postgres-rustls (#7883)
Get rid of postgres-native-tls and openssl in favour of rustls in our
dependency tree.

Do further steps to completely remove native-tls and openssl.

Among other advantages, this allows us to do static musl builds more
easily: #7889
2024-05-28 15:40:52 +00:00
John Spray
352b08d0be pageserver: fix a warning on secondary mode downloads after evictions (#7877)
## Problem

In 4ce6e2d2fc we added a warning when progress stats don't look right at
the end of a secondary download pass.

This `Correcting drift in progress stats` warning fired in staging on a
pageserver that had been doing some disk usage eviction.

The impact is low because in the same place we log the warning, we also
fix up the progress values.

## Summary of changes

- When we skip downloading a layer because it was recently evicted,
update the progress stats to ensure they still reach a clean complete
state at the end of a download pass.
- Also add a log for evicting secondary location layers, for symmetry
with attached locations, so that we can clearly see when eviction has
happened for a particular tenant's layers when investigating issues.

This is a point fix -- the code would also benefit from being refactored
so that there is some "download result" type with a Skip variant, to
ensure that we are updating the progress stats uniformly for those
cases.
2024-05-28 16:06:47 +01:00
Peter Bendel
f9f69a2ee7 clarify how to load the dbpedia vector embeddings into a postgres database (#7894)
## Problem


Improve the readme for the data load step in the pgvector performance
test.
2024-05-28 17:21:09 +03:00
Peter Bendel
fabeff822f Performance test for pgvector HNSW index build and queries (#7873)
## Problem

We want to regularly verify the performance of pgvector HNSW parallel
index builds and parallel similarity search using HNSW indexes.
The first release that considerably improved the index-build parallelism
was pgvector 0.7.0 and we want to make sure that we do not regress by
our neon compute VM settings (swap, memory over commit, pg conf etc.)

## Summary of changes

Prepare a Neon project with 1 million openAI vector embeddings (vector
size 1536).
Run HNSW indexing operations in the regression test for the various
distance metrics.
Run similarity queries using pgbench with 100 concurrent clients.

I have also added the relevant metrics to the grafana dashboards pgbench
and olape

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-05-28 11:05:33 +00:00
Arseny Sher
4a0ce9512b Add safekeeper test truncating WAL.
We do it as a part of more complicated tests like test_compute_restarts, but
let's have a simple test as well.
2024-05-28 11:08:29 +03:00
Konstantin Knizhnik
d61e924103 Fix connect to PS on MacOS/X (#7885)
## Problem

After [0e4f182680] which introduce async
connect
Neon is not able to connect to page server.

## Summary of changes

Perform sync commit at MacOS/X

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-05-27 15:57:57 +03:00
Arseny Sher
b2d34a82b9 Make python Safekeeper datadir Path instead of str. 2024-05-25 06:06:32 +03:00
Arseny Sher
3797566c36 safekeeper: test pull_timeline with WAL gc.
Do pull_timeline while WAL is being removed. To this end
- extract pausable_failpoint to utils, sprinkle pull_timeline with it
- add 'checkpoint' sk http endpoint to force WAL removal.

After fixing checking for pull file status code test fails so far which is
expected.
2024-05-25 06:06:32 +03:00
Conrad Ludgate
43f9a16e46 proxy: fix websocket buffering (#7878)
## Problem

Seems the websocket buffering was broken for large query responses only

## Summary of changes

Move buffering until after the underlying stream is ready.
Tested locally confirms this fixes the bug.

Also fixes the pg-sni-router missing metrics bug
2024-05-24 17:56:12 +01:00
Alexander Bayandin
71a7fd983e CI(release): tune Storage & Compute release PR title (#7870)
## Problem

A title for automatic proxy release PRs is `Proxy release`, and for
storage & compute, it's just `Release`

## Summary of changes
- Amend PR title for Storage & Compute releases to "Storage & Compute
release"
2024-05-24 14:11:51 +01:00
Joonas Koivunen
a3f5b83677 chore: lower gate guard drop logging threshold to 100ms (#7862)
We have some 1001ms cases, which do not yield gate guard context.
2024-05-24 14:07:58 +01:00
John Spray
1455f5a261 pageserver: revert concurrent secondary downloads, make DownloadStream always yield Err after cancel (#7866)
## Problem

Ongoing hunt for secondary location shutdown hang issues.

## Summary of changes

- Revert the functional changes from #7675 
- Tweak a log in secondary downloads to make it more apparent when we
drop out on cancellation
- Modify DownloadStream's behavior to always return an Err after it has
been cancelled. This _should_ not impact anything, but it makes the
behavior simpler to reason about (e.g. even if the poll function somehow
got called again, it could never end up in an un-cancellable state)

Related #https://github.com/neondatabase/cloud/issues/13576
2024-05-24 11:45:34 +03:00
John Spray
3860bc9c6c pageserver: post-shard-split layer rewrites (2/2) (#7531)
## Problem

- After a shard split of a large existing tenant, child tenants can end
up with oversized historic layers indefinitely, if those layers are
prevented from being GC'd by branchpoints.

This PR follows https://github.com/neondatabase/neon/pull/7531, and adds
rewriting of layers that contain a mixture of needed & un-needed
contents, in addition to dropping un-needed layers.

Closes: https://github.com/neondatabase/neon/issues/7504

## Summary of changes

- Add methods to ImageLayer for reading back existing layers
- Extend `compact_shard_ancestors` to rewrite layer files that contain a
mixture of keys that we want and keys we do not, if unwanted keys are
the majority of those in the file.
- Amend initialization code to handle multiple layers with the same
LayerName properly
- Get rid of of renaming bad layer files to `.old` since that's now
expected on restarts during rewrites.
2024-05-24 08:33:19 +00:00
Roman Zaynetdinov
c1f4028fc0 Export db size metrics for 10 user databases (#7857)
## Problem

One database is too limiting. We have agreed to raise this limit to 10.

## Checklist before requesting a review

- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2024-05-24 09:05:20 +01:00
MMeent
0e4f182680 Rework PageStream connection state handling: (#7611)
* Make PS connection startup use async APIs
   This allows for improved query cancellation when we start connections
 * Make PS connections have per-shard connection retry state.
   Previously they shared global backoff state, which is bad for quickly
   getting all connections started and/or back online.
 * Make sure we clean up most connection state on failed connections.
   Previously, we could technically leak some resources that we'd otherwise
   clean up. Now, the resources are correctly cleaned up.
 * pagestore_smgr.c now PANICs on unexpected response message types.
   Unexpected responses are likely a symptom of having a desynchronized
   view of the connection state. As a desynchronized connection state can
   cause corruption, we PANIC, as we don't know what data may have been
   written to buffers: the only solution is to fail fast & hope we didn't
   write wrong data.
 * Catch errors in sync pagestream request handling.
   Previously, if a query was cancelled after a message was sent to
   the pageserver, but before the data was received, the backend
   could forget that it sent the synchronous request, and let others
   deal with the repercussions. This could then lead to incorrect
   responses, or errors such as "unexpected response from page
   server with tag 0x68"
2024-05-23 23:26:42 +02:00
Sasha Krassovsky
ea2e830707 Remove apostrophe (#7868)
## Problem

## Summary of changes

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2024-05-23 20:35:59 +00:00
Joonas Koivunen
7cf726e36e refactor(rtc): remove the duplicate IndexLayerMetadata (#7860)
Once upon a time, we used to have duplicated types for runtime IndexPart
and whatever we stored. Because of the serde fixes in #5335 we have no
need for duplicated IndexPart type anymore, but the `IndexLayerMetadata`
stayed.

- remove the type
- remove LayerFileMetadata::file_size() in favor of direct field access

Split off from #7833. Cc: #3072.
2024-05-23 23:24:31 +03:00
145 changed files with 5856 additions and 2546 deletions

View File

@@ -24,7 +24,7 @@ jobs:
actionlint:
needs: [ check-permissions ]
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v4
- uses: reviewdog/action-actionlint@v1
@@ -36,3 +36,15 @@ jobs:
fail_on_error: true
filter_mode: nofilter
level: error
- run: |
PAT='^\s*runs-on:.*-latest'
if grep -ERq $PAT .github/workflows
then
grep -ERl $PAT .github/workflows |\
while read -r f
do
l=$(grep -nE $PAT .github/workflows/release.yml | awk -F: '{print $1}' | head -1)
echo "::error file=$f,line=$l::Please, do not use ubuntu-latest images to run on, use LTS instead."
done
exit 1
fi

View File

@@ -44,7 +44,7 @@ jobs:
contains(fromJSON('["opened", "synchronize", "reopened", "closed"]'), github.event.action) &&
contains(github.event.pull_request.labels.*.name, 'approved-for-ci-run')
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- run: gh pr --repo "${GITHUB_REPOSITORY}" edit "${PR_NUMBER}" --remove-label "approved-for-ci-run"
@@ -60,7 +60,7 @@ jobs:
github.event.action == 'labeled' &&
contains(github.event.pull_request.labels.*.name, 'approved-for-ci-run')
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- run: gh pr --repo "${GITHUB_REPOSITORY}" edit "${PR_NUMBER}" --remove-label "approved-for-ci-run"
@@ -109,7 +109,7 @@ jobs:
github.event.action == 'closed' &&
github.event.pull_request.head.repo.full_name != github.repository
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- name: Close PR and delete `ci-run/pr-${{ env.PR_NUMBER }}` branch

View File

@@ -38,6 +38,11 @@ on:
description: 'AWS-RDS and AWS-AURORA normally only run on Saturday. Set this to true to run them on every workflow_dispatch'
required: false
default: false
run_only_pgvector_tests:
type: boolean
description: 'Run pgvector tests but no other tests. If not set, all tests including pgvector tests will be run'
required: false
default: false
defaults:
run:
@@ -50,6 +55,7 @@ concurrency:
jobs:
bench:
if: ${{ github.event.inputs.run_only_pgvector_tests == 'false' || github.event.inputs.run_only_pgvector_tests == null }}
env:
TEST_PG_BENCH_DURATIONS_MATRIX: "300"
TEST_PG_BENCH_SCALES_MATRIX: "10,100"
@@ -120,6 +126,7 @@ jobs:
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
generate-matrices:
if: ${{ github.event.inputs.run_only_pgvector_tests == 'false' || github.event.inputs.run_only_pgvector_tests == null }}
# Create matrices for the benchmarking jobs, so we run benchmarks on rds only once a week (on Saturday)
#
# Available platforms:
@@ -130,7 +137,7 @@ jobs:
# - rds-postgres: RDS Postgres db.m5.large instance (2 vCPU, 8 GiB) with gp3 EBS storage
env:
RUN_AWS_RDS_AND_AURORA: ${{ github.event.inputs.run_AWS_RDS_AND_AURORA || 'false' }}
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
outputs:
pgbench-compare-matrix: ${{ steps.pgbench-compare-matrix.outputs.matrix }}
olap-compare-matrix: ${{ steps.olap-compare-matrix.outputs.matrix }}
@@ -197,6 +204,7 @@ jobs:
echo "matrix=$(echo "$matrix" | jq --compact-output '.')" >> $GITHUB_OUTPUT
pgbench-compare:
if: ${{ github.event.inputs.run_only_pgvector_tests == 'false' || github.event.inputs.run_only_pgvector_tests == null }}
needs: [ generate-matrices ]
strategy:
@@ -343,6 +351,92 @@ jobs:
env:
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
pgbench-pgvector:
env:
TEST_PG_BENCH_DURATIONS_MATRIX: "15m"
TEST_PG_BENCH_SCALES_MATRIX: "1"
POSTGRES_DISTRIB_DIR: /tmp/neon/pg_install
DEFAULT_PG_VERSION: 16
TEST_OUTPUT: /tmp/test_output
BUILD_TYPE: remote
SAVE_PERF_REPORT: ${{ github.event.inputs.save_perf_report || ( github.ref_name == 'main' ) }}
PLATFORM: "neon-captest-pgvector"
runs-on: [ self-hosted, us-east-2, x64 ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools:pinned
options: --init
steps:
- uses: actions/checkout@v4
- name: Download Neon artifact
uses: ./.github/actions/download
with:
name: neon-${{ runner.os }}-release-artifact
path: /tmp/neon/
prefix: latest
- name: Add Postgres binaries to PATH
run: |
${POSTGRES_DISTRIB_DIR}/v${DEFAULT_PG_VERSION}/bin/pgbench --version
echo "${POSTGRES_DISTRIB_DIR}/v${DEFAULT_PG_VERSION}/bin" >> $GITHUB_PATH
- name: Set up Connection String
id: set-up-connstr
run: |
CONNSTR=${{ secrets.BENCHMARK_PGVECTOR_CONNSTR }}
echo "connstr=${CONNSTR}" >> $GITHUB_OUTPUT
QUERIES=("SELECT version()")
QUERIES+=("SHOW neon.tenant_id")
QUERIES+=("SHOW neon.timeline_id")
for q in "${QUERIES[@]}"; do
psql ${CONNSTR} -c "${q}"
done
- name: Benchmark pgvector hnsw indexing
uses: ./.github/actions/run-python-test-set
with:
build_type: ${{ env.BUILD_TYPE }}
test_selection: performance/test_perf_olap.py
run_in_parallel: false
save_perf_report: ${{ env.SAVE_PERF_REPORT }}
extra_params: -m remote_cluster --timeout 21600 -k test_pgvector_indexing
env:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
BENCHMARK_CONNSTR: ${{ steps.set-up-connstr.outputs.connstr }}
- name: Benchmark pgvector hnsw queries
uses: ./.github/actions/run-python-test-set
with:
build_type: ${{ env.BUILD_TYPE }}
test_selection: performance
run_in_parallel: false
save_perf_report: ${{ env.SAVE_PERF_REPORT }}
extra_params: -m remote_cluster --timeout 21600 -k test_pgbench_remote_pgvector
env:
BENCHMARK_CONNSTR: ${{ steps.set-up-connstr.outputs.connstr }}
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
- name: Create Allure report
if: ${{ !cancelled() }}
uses: ./.github/actions/allure-report-generate
- name: Post to a Slack channel
if: ${{ github.event.schedule && failure() }}
uses: slackapi/slack-github-action@v1
with:
channel-id: "C033QLM5P7D" # dev-staging-stream
slack-message: "Periodic perf testing neon-captest-pgvector: ${{ job.status }}\n${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
env:
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
clickbench-compare:
# ClichBench DB for rds-aurora and rds-Postgres deployed to the same clusters
# we use for performance testing in pgbench-compare.
@@ -351,7 +445,7 @@ jobs:
#
# *_CLICKBENCH_CONNSTR: Genuine ClickBench DB with ~100M rows
# *_CLICKBENCH_10M_CONNSTR: DB with the first 10M rows of ClickBench DB
if: ${{ !cancelled() }}
if: ${{ !cancelled() && (github.event.inputs.run_only_pgvector_tests == 'false' || github.event.inputs.run_only_pgvector_tests == null) }}
needs: [ generate-matrices, pgbench-compare ]
strategy:
@@ -455,7 +549,7 @@ jobs:
# We might change it after https://github.com/neondatabase/neon/issues/2900.
#
# *_TPCH_S10_CONNSTR: DB generated with scale factor 10 (~10 GB)
if: ${{ !cancelled() }}
if: ${{ !cancelled() && (github.event.inputs.run_only_pgvector_tests == 'false' || github.event.inputs.run_only_pgvector_tests == null) }}
needs: [ generate-matrices, clickbench-compare ]
strategy:
@@ -557,7 +651,7 @@ jobs:
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
user-examples-compare:
if: ${{ !cancelled() }}
if: ${{ !cancelled() && (github.event.inputs.run_only_pgvector_tests == 'false' || github.event.inputs.run_only_pgvector_tests == null) }}
needs: [ generate-matrices, tpch-compare ]
strategy:

View File

@@ -88,7 +88,7 @@ jobs:
merge-images:
needs: [ build-image ]
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
env:
IMAGE_TAG: ${{ inputs.image-tag }}

View File

@@ -35,7 +35,7 @@ jobs:
cancel-previous-e2e-tests:
needs: [ check-permissions ]
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- name: Cancel previous e2e-tests runs for this PR
@@ -549,7 +549,7 @@ jobs:
report-benchmarks-failures:
needs: [ benchmarks, create-test-report ]
if: github.ref_name == 'main' && failure() && needs.benchmarks.result == 'failure'
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- uses: slackapi/slack-github-action@v1
@@ -774,7 +774,7 @@ jobs:
neon-image:
needs: [ neon-image-arch, tag ]
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- uses: docker/login-action@v3
@@ -884,7 +884,7 @@ jobs:
compute-node-image:
needs: [ compute-node-image-arch, tag ]
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
strategy:
matrix:
@@ -1032,7 +1032,7 @@ jobs:
promote-images:
needs: [ check-permissions, tag, test-images, vm-compute-node-image ]
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
env:
VERSIONS: v14 v15 v16
@@ -1077,7 +1077,7 @@ jobs:
trigger-custom-extensions-build-and-wait:
needs: [ check-permissions, tag ]
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- name: Set PR's status to pending and request a remote CI test
run: |

View File

@@ -19,7 +19,7 @@ permissions: {}
jobs:
check-image:
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
outputs:
tag: ${{ steps.get-build-tools-tag.outputs.image-tag }}
found: ${{ steps.check-image.outputs.found }}

View File

@@ -16,7 +16,7 @@ permissions: {}
jobs:
check-permissions:
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- name: Disallow CI runs on PRs from forks
if: |

View File

@@ -9,7 +9,7 @@ on:
jobs:
cleanup:
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- name: Cleanup
run: |

View File

@@ -20,7 +20,7 @@ concurrency:
jobs:
test-postgres-client-libs:
# TODO: switch to gen2 runner, requires docker
runs-on: [ ubuntu-latest ]
runs-on: ubuntu-22.04
env:
DEFAULT_PG_VERSION: 14

View File

@@ -26,7 +26,7 @@ permissions: {}
jobs:
tag-image:
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
env:
FROM_TAG: ${{ inputs.from-tag }}

View File

@@ -19,7 +19,7 @@ on:
jobs:
notify:
runs-on: [ ubuntu-latest ]
runs-on: ubuntu-22.04
steps:
- uses: neondatabase/dev-actions/release-pr-notify@main

View File

@@ -26,7 +26,7 @@ defaults:
jobs:
create-storage-release-branch:
if: ${{ github.event.schedule == '0 6 * * MON' || format('{0}', inputs.create-storage-release-branch) == 'true' }}
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
permissions:
contents: write # for `git push`
@@ -53,7 +53,7 @@ jobs:
GH_TOKEN: ${{ secrets.CI_ACCESS_TOKEN }}
run: |
cat << EOF > body.md
## Release ${RELEASE_DATE}
## Storage & Compute release ${RELEASE_DATE}
**Please merge this Pull Request using 'Create a merge commit' button**
EOF
@@ -65,7 +65,7 @@ jobs:
create-proxy-release-branch:
if: ${{ github.event.schedule == '0 6 * * THU' || format('{0}', inputs.create-proxy-release-branch) == 'true' }}
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
permissions:
contents: write # for `git push`

View File

@@ -19,7 +19,7 @@ env:
jobs:
cancel-previous-e2e-tests:
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- name: Cancel previous e2e-tests runs for this PR
@@ -31,7 +31,7 @@ jobs:
--field concurrency_group="${{ env.E2E_CONCURRENCY_GROUP }}"
tag:
runs-on: [ ubuntu-latest ]
runs-on: ubuntu-22.04
outputs:
build-tag: ${{ steps.build-tag.outputs.tag }}
@@ -62,7 +62,7 @@ jobs:
trigger-e2e-tests:
needs: [ tag ]
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
env:
TAG: ${{ needs.tag.outputs.build-tag }}
steps:

168
Cargo.lock generated
View File

@@ -776,7 +776,6 @@ dependencies = [
"pin-project",
"serde",
"time",
"tz-rs",
"url",
"uuid",
]
@@ -1291,12 +1290,6 @@ dependencies = [
"tiny-keccak",
]
[[package]]
name = "const_fn"
version = "0.4.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fbdcdcb6d86f71c5e97409ad45898af11cbc995b4ee8112d59095a28d376c935"
[[package]]
name = "const_format"
version = "0.2.30"
@@ -1976,21 +1969,6 @@ version = "1.0.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3f9eec918d3f24069decb9af1554cad7c880e2da24a9afd88aca000531ab82c1"
[[package]]
name = "foreign-types"
version = "0.3.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f6f339eb8adc052cd2ca78910fda869aefa38d22d5cb648e6485e4d3fc06f3b1"
dependencies = [
"foreign-types-shared",
]
[[package]]
name = "foreign-types-shared"
version = "0.1.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "00b0228411908ca8685dba7fc2cdd70ec9990a6e753e89b6ac91a84c40fbaf4b"
[[package]]
name = "form_urlencoded"
version = "1.1.0"
@@ -2620,19 +2598,6 @@ dependencies = [
"tokio-io-timeout",
]
[[package]]
name = "hyper-tls"
version = "0.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d6183ddfa99b85da61a140bea0efc93fdf56ceaa041b37d553518030827f9905"
dependencies = [
"bytes",
"hyper 0.14.26",
"native-tls",
"tokio",
"tokio-native-tls",
]
[[package]]
name = "hyper-util"
version = "0.1.3"
@@ -2950,6 +2915,12 @@ version = "0.4.13"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "01cda141df6706de531b6c46c3a33ecca755538219bd484262fa09410c13539c"
[[package]]
name = "linux-raw-sys"
version = "0.6.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f0b5399f6804fbab912acbd8878ed3532d506b7c951b8f9f164ef90fef39e3f4"
[[package]]
name = "lock_api"
version = "0.4.10"
@@ -3168,24 +3139,6 @@ version = "0.8.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e5ce46fe64a9d73be07dcbe690a38ce1b293be448fd8ce1e6c1b8062c9f72c6a"
[[package]]
name = "native-tls"
version = "0.2.11"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "07226173c32f2926027b63cce4bcd8076c3552846cbe7925f3aaffeac0a3b92e"
dependencies = [
"lazy_static",
"libc",
"log",
"openssl",
"openssl-probe",
"openssl-sys",
"schannel",
"security-framework",
"security-framework-sys",
"tempfile",
]
[[package]]
name = "nix"
version = "0.25.1"
@@ -3356,15 +3309,6 @@ dependencies = [
"libc",
]
[[package]]
name = "num_threads"
version = "0.1.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2819ce041d2ee131036f4fc9d6ae7ae125a3a40e97ba64d04fe799ad9dabbb44"
dependencies = [
"libc",
]
[[package]]
name = "oauth2"
version = "4.4.2"
@@ -3414,50 +3358,12 @@ version = "11.1.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0ab1bc2a289d34bd04a330323ac98a1b4bc82c9d9fcb1e66b63caa84da26b575"
[[package]]
name = "openssl"
version = "0.10.60"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "79a4c6c3a2b158f7f8f2a2fc5a969fa3a068df6fc9dbb4a43845436e3af7c800"
dependencies = [
"bitflags 2.4.1",
"cfg-if",
"foreign-types",
"libc",
"once_cell",
"openssl-macros",
"openssl-sys",
]
[[package]]
name = "openssl-macros"
version = "0.1.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a948666b637a0f465e8564c73e89d4dde00d72d4d473cc972f390fc3dcee7d9c"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.52",
]
[[package]]
name = "openssl-probe"
version = "0.1.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ff011a302c396a5197692431fc1948019154afc178baf7d8e37367442a4601cf"
[[package]]
name = "openssl-sys"
version = "0.9.96"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3812c071ba60da8b5677cc12bcb1d42989a65553772897a7e0355545a819838f"
dependencies = [
"cc",
"libc",
"pkg-config",
"vcpkg",
]
[[package]]
name = "opentelemetry"
version = "0.20.0"
@@ -3664,6 +3570,7 @@ dependencies = [
"serde",
"serde_json",
"svg_fmt",
"thiserror",
"tokio",
"tokio-util",
"toml_edit",
@@ -4105,17 +4012,6 @@ dependencies = [
"tokio-postgres",
]
[[package]]
name = "postgres-native-tls"
version = "0.5.0"
source = "git+https://github.com/neondatabase/rust-postgres.git?branch=neon#20031d7a9ee1addeae6e0968e3899ae6bf01cee2"
dependencies = [
"native-tls",
"tokio",
"tokio-native-tls",
"tokio-postgres",
]
[[package]]
name = "postgres-protocol"
version = "0.6.4"
@@ -4224,6 +4120,7 @@ version = "0.1.0"
dependencies = [
"byteorder",
"bytes",
"itertools",
"pin-project-lite",
"postgres-protocol",
"rand 0.8.5",
@@ -4413,6 +4310,7 @@ dependencies = [
"http 1.1.0",
"http-body-util",
"humantime",
"humantime-serde",
"hyper 0.14.26",
"hyper 1.2.0",
"hyper-util",
@@ -4423,7 +4321,6 @@ dependencies = [
"md5",
"measured",
"metrics",
"native-tls",
"once_cell",
"opentelemetry",
"parking_lot 0.12.1",
@@ -4431,7 +4328,6 @@ dependencies = [
"parquet_derive",
"pbkdf2",
"pin-project-lite",
"postgres-native-tls",
"postgres-protocol",
"postgres_backend",
"pq_proto",
@@ -4450,6 +4346,7 @@ dependencies = [
"rstest",
"rustc-hash",
"rustls 0.22.4",
"rustls-native-certs 0.7.0",
"rustls-pemfile 2.1.1",
"scopeguard",
"serde",
@@ -4479,7 +4376,6 @@ dependencies = [
"utils",
"uuid",
"walkdir",
"webpki-roots 0.25.2",
"workspace_hack",
"x509-parser",
]
@@ -4786,20 +4682,21 @@ dependencies = [
"http 0.2.9",
"http-body 0.4.5",
"hyper 0.14.26",
"hyper-tls",
"hyper-rustls 0.24.0",
"ipnet",
"js-sys",
"log",
"mime",
"native-tls",
"once_cell",
"percent-encoding",
"pin-project-lite",
"rustls 0.21.11",
"rustls-pemfile 1.0.2",
"serde",
"serde_json",
"serde_urlencoded",
"tokio",
"tokio-native-tls",
"tokio-rustls 0.24.0",
"tokio-util",
"tower-service",
"url",
@@ -4807,6 +4704,7 @@ dependencies = [
"wasm-bindgen-futures",
"wasm-streams 0.3.0",
"web-sys",
"webpki-roots 0.25.2",
"winreg 0.50.0",
]
@@ -5232,20 +5130,22 @@ dependencies = [
"hex",
"histogram",
"itertools",
"native-tls",
"once_cell",
"pageserver",
"pageserver_api",
"postgres-native-tls",
"postgres_ffi",
"rand 0.8.5",
"remote_storage",
"reqwest 0.12.4",
"rustls 0.22.4",
"rustls-native-certs 0.7.0",
"serde",
"serde_json",
"serde_with",
"thiserror",
"tokio",
"tokio-postgres",
"tokio-postgres-rustls",
"tokio-rustls 0.25.0",
"tokio-stream",
"tokio-util",
@@ -6189,8 +6089,6 @@ checksum = "8f3403384eaacbca9923fa06940178ac13e4edb725486d70e8e15881d0c836cc"
dependencies = [
"itoa",
"js-sys",
"libc",
"num_threads",
"serde",
"time-core",
"time-macros",
@@ -6266,7 +6164,7 @@ dependencies = [
[[package]]
name = "tokio-epoll-uring"
version = "0.1.0"
source = "git+https://github.com/neondatabase/tokio-epoll-uring.git?branch=main#342ddd197a060a8354e8f11f4d12994419fff939"
source = "git+https://github.com/neondatabase/tokio-epoll-uring.git?branch=main#08ccfa94ff5507727bf4d8d006666b5b192e04c6"
dependencies = [
"futures",
"nix 0.26.4",
@@ -6300,16 +6198,6 @@ dependencies = [
"syn 2.0.52",
]
[[package]]
name = "tokio-native-tls"
version = "0.3.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "bbae76ab933c85776efabc971569dd6119c580d8f5d448769dec1764bf796ef2"
dependencies = [
"native-tls",
"tokio",
]
[[package]]
name = "tokio-postgres"
version = "0.7.7"
@@ -6716,15 +6604,6 @@ version = "1.16.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "497961ef93d974e23eb6f433eb5fe1b7930b659f06d12dec6fc44a8f554c0bba"
[[package]]
name = "tz-rs"
version = "0.6.14"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "33851b15c848fad2cf4b105c6bb66eb9512b6f6c44a4b13f57c53c73c707e2b4"
dependencies = [
"const_fn",
]
[[package]]
name = "uname"
version = "0.1.1"
@@ -6797,11 +6676,12 @@ dependencies = [
[[package]]
name = "uring-common"
version = "0.1.0"
source = "git+https://github.com/neondatabase/tokio-epoll-uring.git?branch=main#342ddd197a060a8354e8f11f4d12994419fff939"
source = "git+https://github.com/neondatabase/tokio-epoll-uring.git?branch=main#08ccfa94ff5507727bf4d8d006666b5b192e04c6"
dependencies = [
"bytes",
"io-uring",
"libc",
"linux-raw-sys 0.6.4",
]
[[package]]
@@ -7629,9 +7509,9 @@ dependencies = [
[[package]]
name = "zeroize"
version = "1.6.0"
version = "1.7.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2a0956f1ba7c7909bfb66c2e9e4124ab6f6482560f6628b5aaeba39207c9aad9"
checksum = "525b4ec142c6b68a2d10f01f7bbf6755599ca3f81ea53b8431b7dd348f5fdb2d"
dependencies = [
"zeroize_derive",
]

View File

@@ -46,10 +46,10 @@ anyhow = { version = "1.0", features = ["backtrace"] }
arc-swap = "1.6"
async-compression = { version = "0.4.0", features = ["tokio", "gzip", "zstd"] }
atomic-take = "1.1.0"
azure_core = "0.19"
azure_identity = "0.19"
azure_storage = "0.19"
azure_storage_blobs = "0.19"
azure_core = { version = "0.19", default-features = false, features = ["enable_reqwest_rustls", "hmac_rust"] }
azure_identity = { version = "0.19", default-features = false, features = ["enable_reqwest_rustls"] }
azure_storage = { version = "0.19", default-features = false, features = ["enable_reqwest_rustls"] }
azure_storage_blobs = { version = "0.19", default-features = false, features = ["enable_reqwest_rustls"] }
flate2 = "1.0.26"
async-stream = "0.3"
async-trait = "0.1"
@@ -114,7 +114,6 @@ md5 = "0.7.0"
measured = { version = "0.0.21", features=["lasso"] }
measured-process = { version = "0.0.21" }
memoffset = "0.8"
native-tls = "0.2"
nix = { version = "0.27", features = ["fs", "process", "socket", "signal", "poll"] }
notify = "6.0.0"
num_cpus = "1.15"
@@ -191,7 +190,7 @@ url = "2.2"
urlencoding = "2.1"
uuid = { version = "1.6.1", features = ["v4", "v7", "serde"] }
walkdir = "2.3.2"
webpki-roots = "0.25"
rustls-native-certs = "0.7"
x509-parser = "0.15"
## TODO replace this with tracing
@@ -200,7 +199,6 @@ log = "0.4"
## Libraries from neondatabase/ git forks, ideally with changes to be upstreamed
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", branch="neon" }
postgres-native-tls = { git = "https://github.com/neondatabase/rust-postgres.git", branch="neon" }
postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", branch="neon" }
postgres-types = { git = "https://github.com/neondatabase/rust-postgres.git", branch="neon" }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", branch="neon" }
@@ -241,8 +239,7 @@ tonic-build = "0.9"
[patch.crates-io]
# This is only needed for proxy's tests.
# TODO: we should probably fork `tokio-postgres-rustls` instead.
# Needed to get `tokio-postgres-rustls` to depend on our fork.
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", branch="neon" }
# bug fixes for UUID

View File

@@ -99,6 +99,13 @@ name = "async-executor"
[[bans.deny]]
name = "smol"
[[bans.deny]]
# We want to use rustls instead of the platform's native tls implementation.
name = "native-tls"
[[bans.deny]]
name = "openssl"
# This section is considered when running `cargo deny check sources`.
# More documentation about the 'sources' section can be found here:
# https://embarkstudios.github.io/cargo-deny/checks/sources/cfg.html

View File

@@ -1,6 +1,5 @@
use anyhow::{bail, Result};
use byteorder::{ByteOrder, BE};
use bytes::BufMut;
use postgres_ffi::relfile_utils::{FSM_FORKNUM, VISIBILITYMAP_FORKNUM};
use postgres_ffi::{Oid, TransactionId};
use serde::{Deserialize, Serialize};
@@ -53,14 +52,8 @@ impl Key {
/// Encode a metadata key to a storage key.
pub fn from_metadata_key_fixed_size(key: &[u8; METADATA_KEY_SIZE]) -> Self {
assert!(is_metadata_key_slice(key), "key not in metadata key range");
Key {
field1: key[0],
field2: u16::from_be_bytes(key[1..3].try_into().unwrap()) as u32,
field3: u32::from_be_bytes(key[3..7].try_into().unwrap()),
field4: u32::from_be_bytes(key[7..11].try_into().unwrap()),
field5: key[11],
field6: u32::from_be_bytes(key[12..16].try_into().unwrap()),
}
// Metadata key space ends at 0x7F so it's fine to directly convert it to i128.
Self::from_i128(i128::from_be_bytes(*key))
}
/// Encode a metadata key to a storage key.
@@ -68,17 +61,6 @@ impl Key {
Self::from_metadata_key_fixed_size(key.try_into().expect("expect 16 byte metadata key"))
}
/// Extract a metadata key to a writer. The result should always be 16 bytes.
pub fn extract_metadata_key_to_writer(&self, mut writer: impl BufMut) {
writer.put_u8(self.field1);
assert!(self.field2 <= 0xFFFF);
writer.put_u16(self.field2 as u16);
writer.put_u32(self.field3);
writer.put_u32(self.field4);
writer.put_u8(self.field5);
writer.put_u32(self.field6);
}
/// Get the range of metadata keys.
pub const fn metadata_key_range() -> Range<Self> {
Key {
@@ -121,7 +103,7 @@ impl Key {
/// As long as Neon does not support tablespace (because of lack of access to local file system),
/// we can assume that only some predefined namespace OIDs are used which can fit in u16
pub fn to_i128(&self) -> i128 {
assert!(self.field2 < 0xFFFF || self.field2 == 0xFFFFFFFF || self.field2 == 0x22222222);
assert!(self.field2 <= 0xFFFF || self.field2 == 0xFFFFFFFF || self.field2 == 0x22222222);
(((self.field1 & 0x7F) as i128) << 120)
| (((self.field2 & 0xFFFF) as i128) << 104)
| ((self.field3 as i128) << 72)
@@ -175,7 +157,7 @@ impl Key {
}
/// Convert a 18B slice to a key. This function should not be used for metadata keys because field2 is handled differently.
/// Use [`Key::from_metadata_key`] instead.
/// Use [`Key::from_i128`] instead if you want to handle 16B keys (i.e., metadata keys).
pub fn from_slice(b: &[u8]) -> Self {
Key {
field1: b[0],
@@ -188,7 +170,7 @@ impl Key {
}
/// Convert a key to a 18B slice. This function should not be used for metadata keys because field2 is handled differently.
/// Use [`Key::extract_metadata_key_to_writer`] instead.
/// Use [`Key::to_i128`] instead if you want to get a 16B key (i.e., metadata keys).
pub fn write_to_byte_slice(&self, buf: &mut [u8]) {
buf[0] = self.field1;
BE::write_u32(&mut buf[1..5], self.field2);
@@ -399,7 +381,14 @@ pub fn rel_size_to_key(rel: RelTag) -> Key {
field3: rel.dbnode,
field4: rel.relnode,
field5: rel.forknum,
field6: 0xffffffff,
field6: 0xffff_ffff,
}
}
impl Key {
#[inline(always)]
pub fn is_rel_size_key(&self) -> bool {
self.field1 == 0 && self.field6 == u32::MAX
}
}
@@ -440,6 +429,25 @@ pub fn slru_dir_to_key(kind: SlruKind) -> Key {
}
}
#[inline(always)]
pub fn slru_dir_kind(key: &Key) -> Option<Result<SlruKind, u32>> {
if key.field1 == 0x01
&& key.field3 == 0
&& key.field4 == 0
&& key.field5 == 0
&& key.field6 == 0
{
match key.field2 {
0 => Some(Ok(SlruKind::Clog)),
1 => Some(Ok(SlruKind::MultiXactMembers)),
2 => Some(Ok(SlruKind::MultiXactOffsets)),
x => Some(Err(x)),
}
} else {
None
}
}
#[inline(always)]
pub fn slru_block_to_key(kind: SlruKind, segno: u32, blknum: BlockNumber) -> Key {
Key {
@@ -468,7 +476,17 @@ pub fn slru_segment_size_to_key(kind: SlruKind, segno: u32) -> Key {
field3: 1,
field4: segno,
field5: 0,
field6: 0xffffffff,
field6: 0xffff_ffff,
}
}
impl Key {
pub fn is_slru_segment_size_key(&self) -> bool {
self.field1 == 0x01
&& self.field2 < 0x03
&& self.field3 == 0x01
&& self.field5 == 0
&& self.field6 == u32::MAX
}
}
@@ -577,73 +595,78 @@ pub const NON_INHERITED_RANGE: Range<Key> = AUX_FILES_KEY..AUX_FILES_KEY.next();
/// Sparse keyspace range for vectored get. Missing key error will be ignored for this range.
pub const NON_INHERITED_SPARSE_RANGE: Range<Key> = Key::metadata_key_range();
// AUX_FILES currently stores only data for logical replication (slots etc), and
// we don't preserve these on a branch because safekeepers can't follow timeline
// switch (and generally it likely should be optional), so ignore these.
#[inline(always)]
pub fn is_inherited_key(key: Key) -> bool {
!NON_INHERITED_RANGE.contains(&key) && !NON_INHERITED_SPARSE_RANGE.contains(&key)
}
impl Key {
// AUX_FILES currently stores only data for logical replication (slots etc), and
// we don't preserve these on a branch because safekeepers can't follow timeline
// switch (and generally it likely should be optional), so ignore these.
#[inline(always)]
pub fn is_inherited_key(self) -> bool {
!NON_INHERITED_RANGE.contains(&self) && !NON_INHERITED_SPARSE_RANGE.contains(&self)
}
#[inline(always)]
pub fn is_rel_fsm_block_key(key: Key) -> bool {
key.field1 == 0x00 && key.field4 != 0 && key.field5 == FSM_FORKNUM && key.field6 != 0xffffffff
}
#[inline(always)]
pub fn is_rel_fsm_block_key(self) -> bool {
self.field1 == 0x00
&& self.field4 != 0
&& self.field5 == FSM_FORKNUM
&& self.field6 != 0xffffffff
}
#[inline(always)]
pub fn is_rel_vm_block_key(key: Key) -> bool {
key.field1 == 0x00
&& key.field4 != 0
&& key.field5 == VISIBILITYMAP_FORKNUM
&& key.field6 != 0xffffffff
}
#[inline(always)]
pub fn is_rel_vm_block_key(self) -> bool {
self.field1 == 0x00
&& self.field4 != 0
&& self.field5 == VISIBILITYMAP_FORKNUM
&& self.field6 != 0xffffffff
}
#[inline(always)]
pub fn key_to_slru_block(key: Key) -> anyhow::Result<(SlruKind, u32, BlockNumber)> {
Ok(match key.field1 {
0x01 => {
let kind = match key.field2 {
0x00 => SlruKind::Clog,
0x01 => SlruKind::MultiXactMembers,
0x02 => SlruKind::MultiXactOffsets,
_ => anyhow::bail!("unrecognized slru kind 0x{:02x}", key.field2),
};
let segno = key.field4;
let blknum = key.field6;
#[inline(always)]
pub fn to_slru_block(self) -> anyhow::Result<(SlruKind, u32, BlockNumber)> {
Ok(match self.field1 {
0x01 => {
let kind = match self.field2 {
0x00 => SlruKind::Clog,
0x01 => SlruKind::MultiXactMembers,
0x02 => SlruKind::MultiXactOffsets,
_ => anyhow::bail!("unrecognized slru kind 0x{:02x}", self.field2),
};
let segno = self.field4;
let blknum = self.field6;
(kind, segno, blknum)
}
_ => anyhow::bail!("unexpected value kind 0x{:02x}", key.field1),
})
}
(kind, segno, blknum)
}
_ => anyhow::bail!("unexpected value kind 0x{:02x}", self.field1),
})
}
#[inline(always)]
pub fn is_slru_block_key(key: Key) -> bool {
key.field1 == 0x01 // SLRU-related
&& key.field3 == 0x00000001 // but not SlruDir
&& key.field6 != 0xffffffff // and not SlruSegSize
}
#[inline(always)]
pub fn is_slru_block_key(self) -> bool {
self.field1 == 0x01 // SLRU-related
&& self.field3 == 0x00000001 // but not SlruDir
&& self.field6 != 0xffffffff // and not SlruSegSize
}
#[inline(always)]
pub fn is_rel_block_key(key: &Key) -> bool {
key.field1 == 0x00 && key.field4 != 0 && key.field6 != 0xffffffff
}
#[inline(always)]
pub fn is_rel_block_key(&self) -> bool {
self.field1 == 0x00 && self.field4 != 0 && self.field6 != 0xffffffff
}
/// Guaranteed to return `Ok()` if [[is_rel_block_key]] returns `true` for `key`.
#[inline(always)]
pub fn key_to_rel_block(key: Key) -> anyhow::Result<(RelTag, BlockNumber)> {
Ok(match key.field1 {
0x00 => (
RelTag {
spcnode: key.field2,
dbnode: key.field3,
relnode: key.field4,
forknum: key.field5,
},
key.field6,
),
_ => anyhow::bail!("unexpected value kind 0x{:02x}", key.field1),
})
/// Guaranteed to return `Ok()` if [`Self::is_rel_block_key`] returns `true` for `key`.
#[inline(always)]
pub fn to_rel_block(self) -> anyhow::Result<(RelTag, BlockNumber)> {
Ok(match self.field1 {
0x00 => (
RelTag {
spcnode: self.field2,
dbnode: self.field3,
relnode: self.field4,
forknum: self.field5,
},
self.field6,
),
_ => anyhow::bail!("unexpected value kind 0x{:02x}", self.field1),
})
}
}
impl std::str::FromStr for Key {
@@ -687,10 +710,15 @@ mod tests {
let mut metadata_key = vec![AUX_KEY_PREFIX];
metadata_key.extend_from_slice(&[0xFF; 15]);
let encoded_key = Key::from_metadata_key(&metadata_key);
let mut output_key = Vec::new();
encoded_key.extract_metadata_key_to_writer(&mut output_key);
let output_key = encoded_key.to_i128().to_be_bytes();
assert_eq!(metadata_key, output_key);
assert!(encoded_key.is_metadata_key());
assert!(is_metadata_key_slice(&metadata_key));
}
#[test]
fn test_possible_largest_key() {
Key::from_i128(0x7FFF_FFFF_FFFF_FFFF_FFFF_FFFF_FFFF_FFFF);
// TODO: put this key into the system and see if anything breaks.
}
}

View File

@@ -3,7 +3,7 @@ use std::cmp::Ordering;
use std::fmt;
use postgres_ffi::pg_constants::GLOBALTABLESPACE_OID;
use postgres_ffi::relfile_utils::forknumber_to_name;
use postgres_ffi::relfile_utils::{forkname_to_number, forknumber_to_name, MAIN_FORKNUM};
use postgres_ffi::Oid;
///
@@ -68,6 +68,57 @@ impl fmt::Display for RelTag {
}
}
#[derive(Debug, thiserror::Error)]
pub enum ParseRelTagError {
#[error("invalid forknum")]
InvalidForknum(#[source] std::num::ParseIntError),
#[error("missing triplet member {}", .0)]
MissingTripletMember(usize),
#[error("invalid triplet member {}", .0)]
InvalidTripletMember(usize, #[source] std::num::ParseIntError),
}
impl std::str::FromStr for RelTag {
type Err = ParseRelTagError;
fn from_str(s: &str) -> Result<Self, Self::Err> {
use ParseRelTagError::*;
// FIXME: in postgres logs this separator is dot
// Example:
// could not read block 2 in rel 1663/208101/2620.1 from page server at lsn 0/2431E6F0
// with a regex we could get this more painlessly
let (triplet, forknum) = match s.split_once('_').or_else(|| s.split_once('.')) {
Some((t, f)) => {
let forknum = forkname_to_number(Some(f));
let forknum = if let Ok(f) = forknum {
f
} else {
f.parse::<u8>().map_err(InvalidForknum)?
};
(t, Some(forknum))
}
None => (s, None),
};
let mut split = triplet
.splitn(3, '/')
.enumerate()
.map(|(i, s)| s.parse::<u32>().map_err(|e| InvalidTripletMember(i, e)));
let spcnode = split.next().ok_or(MissingTripletMember(0))??;
let dbnode = split.next().ok_or(MissingTripletMember(1))??;
let relnode = split.next().ok_or(MissingTripletMember(2))??;
Ok(RelTag {
spcnode,
forknum: forknum.unwrap_or(MAIN_FORKNUM),
dbnode,
relnode,
})
}
}
impl RelTag {
pub fn to_segfile_name(&self, segno: u32) -> String {
let mut name = if self.spcnode == GLOBALTABLESPACE_OID {

View File

@@ -1,9 +1,6 @@
use std::{ops::RangeInclusive, str::FromStr};
use crate::{
key::{is_rel_block_key, Key},
models::ShardParameters,
};
use crate::{key::Key, models::ShardParameters};
use hex::FromHex;
use postgres_ffi::relfile_utils::INIT_FORKNUM;
use serde::{Deserialize, Serialize};
@@ -428,6 +425,12 @@ impl<'de> Deserialize<'de> for TenantShardId {
#[derive(Clone, Copy, Serialize, Deserialize, Eq, PartialEq, Debug)]
pub struct ShardStripeSize(pub u32);
impl Default for ShardStripeSize {
fn default() -> Self {
DEFAULT_STRIPE_SIZE
}
}
/// Layout version: for future upgrades where we might change how the key->shard mapping works
#[derive(Clone, Copy, Serialize, Deserialize, Eq, PartialEq, Debug)]
pub struct ShardLayout(u8);
@@ -666,7 +669,7 @@ fn key_is_shard0(key: &Key) -> bool {
// because they must be included in basebackups.
let is_initfork = key.field5 == INIT_FORKNUM;
!is_rel_block_key(key) || is_initfork
!key.is_rel_block_key() || is_initfork
}
/// Provide the same result as the function in postgres `hashfn.h` with the same name
@@ -713,6 +716,25 @@ fn key_to_shard_number(count: ShardCount, stripe_size: ShardStripeSize, key: &Ke
ShardNumber((hash % count.0 as u32) as u8)
}
/// For debugging, while not exposing the internals.
#[derive(Debug)]
#[allow(unused)] // used by debug formatting by pagectl
struct KeyShardingInfo {
shard0: bool,
shard_number: ShardNumber,
}
pub fn describe(
key: &Key,
shard_count: ShardCount,
stripe_size: ShardStripeSize,
) -> impl std::fmt::Debug {
KeyShardingInfo {
shard0: key_is_shard0(key),
shard_number: key_to_shard_number(shard_count, stripe_size, key),
}
}
#[cfg(test)]
mod tests {
use utils::Hex;

View File

@@ -7,6 +7,7 @@ license.workspace = true
[dependencies]
bytes.workspace = true
byteorder.workspace = true
itertools.workspace = true
pin-project-lite.workspace = true
postgres-protocol.workspace = true
rand.workspace = true

View File

@@ -7,8 +7,9 @@ pub mod framed;
use byteorder::{BigEndian, ReadBytesExt};
use bytes::{Buf, BufMut, Bytes, BytesMut};
use itertools::Itertools;
use serde::{Deserialize, Serialize};
use std::{borrow::Cow, collections::HashMap, fmt, io, str};
use std::{borrow::Cow, fmt, io, str};
// re-export for use in utils pageserver_feedback.rs
pub use postgres_protocol::PG_EPOCH;
@@ -50,15 +51,37 @@ pub enum FeStartupPacket {
},
}
#[derive(Debug)]
#[derive(Debug, Clone, Default)]
pub struct StartupMessageParamsBuilder {
params: BytesMut,
}
impl StartupMessageParamsBuilder {
/// Set parameter's value by its name.
/// name and value must not contain a \0 byte
pub fn insert(&mut self, name: &str, value: &str) {
self.params.put(name.as_bytes());
self.params.put(&b"\0"[..]);
self.params.put(value.as_bytes());
self.params.put(&b"\0"[..]);
}
pub fn freeze(self) -> StartupMessageParams {
StartupMessageParams {
params: self.params.freeze(),
}
}
}
#[derive(Debug, Clone, Default)]
pub struct StartupMessageParams {
params: HashMap<String, String>,
params: Bytes,
}
impl StartupMessageParams {
/// Get parameter's value by its name.
pub fn get(&self, name: &str) -> Option<&str> {
self.params.get(name).map(|s| s.as_str())
self.iter().find_map(|(k, v)| (k == name).then_some(v))
}
/// Split command-line options according to PostgreSQL's logic,
@@ -112,15 +135,19 @@ impl StartupMessageParams {
/// Iterate through key-value pairs in an arbitrary order.
pub fn iter(&self) -> impl Iterator<Item = (&str, &str)> {
self.params.iter().map(|(k, v)| (k.as_str(), v.as_str()))
let params =
std::str::from_utf8(&self.params).expect("should be validated as utf8 already");
params.split_terminator('\0').tuples()
}
// This function is mostly useful in tests.
#[doc(hidden)]
pub fn new<'a, const N: usize>(pairs: [(&'a str, &'a str); N]) -> Self {
Self {
params: pairs.map(|(k, v)| (k.to_owned(), v.to_owned())).into(),
let mut b = StartupMessageParamsBuilder::default();
for (k, v) in pairs {
b.insert(k, v)
}
b.freeze()
}
}
@@ -345,35 +372,21 @@ impl FeStartupPacket {
(major_version, minor_version) => {
// StartupMessage
// Parse pairs of null-terminated strings (key, value).
// See `postgres: ProcessStartupPacket, build_startup_packet`.
let mut tokens = str::from_utf8(&msg)
.map_err(|_e| {
ProtocolError::BadMessage("StartupMessage params: invalid utf-8".to_owned())
})?
.strip_suffix('\0') // drop packet's own null
.ok_or_else(|| {
ProtocolError::Protocol(
"StartupMessage params: missing null terminator".to_string(),
)
})?
.split_terminator('\0');
let mut params = HashMap::new();
while let Some(name) = tokens.next() {
let value = tokens.next().ok_or_else(|| {
ProtocolError::Protocol(
"StartupMessage params: key without value".to_string(),
)
})?;
params.insert(name.to_owned(), value.to_owned());
}
let s = str::from_utf8(&msg).map_err(|_e| {
ProtocolError::BadMessage("StartupMessage params: invalid utf-8".to_owned())
})?;
let s = s.strip_suffix('\0').ok_or_else(|| {
ProtocolError::Protocol(
"StartupMessage params: missing null terminator".to_string(),
)
})?;
FeStartupPacket::StartupMessage {
major_version,
minor_version,
params: StartupMessageParams { params },
params: StartupMessageParams {
params: msg.slice_ref(s.as_bytes()),
},
}
}
};

View File

@@ -26,14 +26,14 @@ use futures::stream::Stream;
use futures_util::StreamExt;
use futures_util::TryStreamExt;
use http_types::{StatusCode, Url};
use scopeguard::ScopeGuard;
use tokio_util::sync::CancellationToken;
use tracing::debug;
use crate::RemoteStorageActivity;
use crate::metrics::{start_measuring_requests, AttemptOutcome, RequestKind};
use crate::{
error::Cancelled, s3_bucket::RequestKind, AzureConfig, ConcurrencyLimiter, Download,
DownloadError, Listing, ListingMode, RemotePath, RemoteStorage, StorageMetadata,
TimeTravelError, TimeoutOrCancel,
error::Cancelled, AzureConfig, ConcurrencyLimiter, Download, DownloadError, Listing,
ListingMode, RemotePath, RemoteStorage, StorageMetadata, TimeTravelError, TimeoutOrCancel,
};
pub struct AzureBlobStorage {
@@ -138,6 +138,8 @@ impl AzureBlobStorage {
let mut last_modified = None;
let mut metadata = HashMap::new();
let started_at = start_measuring_requests(kind);
let download = async {
let response = builder
// convert to concrete Pageable
@@ -201,13 +203,22 @@ impl AzureBlobStorage {
})
};
tokio::select! {
let download = tokio::select! {
bufs = download => bufs,
cancel_or_timeout = cancel_or_timeout => match cancel_or_timeout {
TimeoutOrCancel::Timeout => Err(DownloadError::Timeout),
TimeoutOrCancel::Cancel => Err(DownloadError::Cancelled),
TimeoutOrCancel::Timeout => return Err(DownloadError::Timeout),
TimeoutOrCancel::Cancel => return Err(DownloadError::Cancelled),
},
}
};
let started_at = ScopeGuard::into_inner(started_at);
let outcome = match &download {
Ok(_) => AttemptOutcome::Ok,
Err(_) => AttemptOutcome::Err,
};
crate::metrics::BUCKET_METRICS
.req_seconds
.observe_elapsed(kind, outcome, started_at);
download
}
async fn permit(
@@ -341,7 +352,10 @@ impl RemoteStorage for AzureBlobStorage {
metadata: Option<StorageMetadata>,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
let _permit = self.permit(RequestKind::Put, cancel).await?;
let kind = RequestKind::Put;
let _permit = self.permit(kind, cancel).await?;
let started_at = start_measuring_requests(kind);
let op = async {
let blob_client = self.client.blob_client(self.relative_path_to_name(to));
@@ -365,14 +379,25 @@ impl RemoteStorage for AzureBlobStorage {
match fut.await {
Ok(Ok(_response)) => Ok(()),
Ok(Err(azure)) => Err(azure.into()),
Err(_timeout) => Err(TimeoutOrCancel::Cancel.into()),
Err(_timeout) => Err(TimeoutOrCancel::Timeout.into()),
}
};
tokio::select! {
let res = tokio::select! {
res = op => res,
_ = cancel.cancelled() => Err(TimeoutOrCancel::Cancel.into()),
}
_ = cancel.cancelled() => return Err(TimeoutOrCancel::Cancel.into()),
};
let outcome = match res {
Ok(_) => AttemptOutcome::Ok,
Err(_) => AttemptOutcome::Err,
};
let started_at = ScopeGuard::into_inner(started_at);
crate::metrics::BUCKET_METRICS
.req_seconds
.observe_elapsed(kind, outcome, started_at);
res
}
async fn download(
@@ -418,12 +443,13 @@ impl RemoteStorage for AzureBlobStorage {
paths: &'a [RemotePath],
cancel: &CancellationToken,
) -> anyhow::Result<()> {
let _permit = self.permit(RequestKind::Delete, cancel).await?;
let kind = RequestKind::Delete;
let _permit = self.permit(kind, cancel).await?;
let started_at = start_measuring_requests(kind);
let op = async {
// TODO batch requests are also not supported by the SDK
// TODO batch requests are not supported by the SDK
// https://github.com/Azure/azure-sdk-for-rust/issues/1068
// https://github.com/Azure/azure-sdk-for-rust/issues/1249
for path in paths {
let blob_client = self.client.blob_client(self.relative_path_to_name(path));
@@ -448,10 +474,16 @@ impl RemoteStorage for AzureBlobStorage {
Ok(())
};
tokio::select! {
let res = tokio::select! {
res = op => res,
_ = cancel.cancelled() => Err(TimeoutOrCancel::Cancel.into()),
}
_ = cancel.cancelled() => return Err(TimeoutOrCancel::Cancel.into()),
};
let started_at = ScopeGuard::into_inner(started_at);
crate::metrics::BUCKET_METRICS
.req_seconds
.observe_elapsed(kind, &res, started_at);
res
}
async fn copy(
@@ -460,7 +492,9 @@ impl RemoteStorage for AzureBlobStorage {
to: &RemotePath,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
let _permit = self.permit(RequestKind::Copy, cancel).await?;
let kind = RequestKind::Copy;
let _permit = self.permit(kind, cancel).await?;
let started_at = start_measuring_requests(kind);
let timeout = tokio::time::sleep(self.timeout);
@@ -504,15 +538,21 @@ impl RemoteStorage for AzureBlobStorage {
}
};
tokio::select! {
let res = tokio::select! {
res = op => res,
_ = cancel.cancelled() => Err(anyhow::Error::new(TimeoutOrCancel::Cancel)),
_ = cancel.cancelled() => return Err(anyhow::Error::new(TimeoutOrCancel::Cancel)),
_ = timeout => {
let e = anyhow::Error::new(TimeoutOrCancel::Timeout);
let e = e.context(format!("Timeout, last status: {copy_status:?}"));
Err(e)
},
}
};
let started_at = ScopeGuard::into_inner(started_at);
crate::metrics::BUCKET_METRICS
.req_seconds
.observe_elapsed(kind, &res, started_at);
res
}
async fn time_travel_recover(
@@ -526,10 +566,6 @@ impl RemoteStorage for AzureBlobStorage {
// https://learn.microsoft.com/en-us/azure/storage/blobs/point-in-time-restore-overview
Err(TimeTravelError::Unimplemented)
}
fn activity(&self) -> RemoteStorageActivity {
self.concurrency_limiter.activity()
}
}
pin_project_lite::pin_project! {

View File

@@ -12,6 +12,7 @@
mod azure_blob;
mod error;
mod local_fs;
mod metrics;
mod s3_bucket;
mod simulate_failures;
mod support;
@@ -121,8 +122,8 @@ impl RemotePath {
self.0.file_name()
}
pub fn join(&self, segment: &Utf8Path) -> Self {
Self(self.0.join(segment))
pub fn join(&self, path: impl AsRef<Utf8Path>) -> Self {
Self(self.0.join(path))
}
pub fn get_path(&self) -> &Utf8PathBuf {
@@ -263,17 +264,6 @@ pub trait RemoteStorage: Send + Sync + 'static {
done_if_after: SystemTime,
cancel: &CancellationToken,
) -> Result<(), TimeTravelError>;
/// Query how busy we currently are: may be used by callers which wish to politely
/// back off if there are already a lot of operations underway.
fn activity(&self) -> RemoteStorageActivity;
}
pub struct RemoteStorageActivity {
pub read_available: usize,
pub read_total: usize,
pub write_available: usize,
pub write_total: usize,
}
/// DownloadStream is sensitive to the timeout and cancellation used with the original
@@ -455,15 +445,6 @@ impl<Other: RemoteStorage> GenericRemoteStorage<Arc<Other>> {
}
}
}
pub fn activity(&self) -> RemoteStorageActivity {
match self {
Self::LocalFs(s) => s.activity(),
Self::AwsS3(s) => s.activity(),
Self::AzureBlob(s) => s.activity(),
Self::Unreliable(s) => s.activity(),
}
}
}
impl GenericRemoteStorage {
@@ -794,9 +775,6 @@ struct ConcurrencyLimiter {
// The helps to ensure we don't exceed the thresholds.
write: Arc<Semaphore>,
read: Arc<Semaphore>,
write_total: usize,
read_total: usize,
}
impl ConcurrencyLimiter {
@@ -825,21 +803,10 @@ impl ConcurrencyLimiter {
Arc::clone(self.for_kind(kind)).acquire_owned().await
}
fn activity(&self) -> RemoteStorageActivity {
RemoteStorageActivity {
read_available: self.read.available_permits(),
read_total: self.read_total,
write_available: self.write.available_permits(),
write_total: self.write_total,
}
}
fn new(limit: usize) -> ConcurrencyLimiter {
Self {
read: Arc::new(Semaphore::new(limit)),
write: Arc::new(Semaphore::new(limit)),
read_total: limit,
write_total: limit,
}
}
}

View File

@@ -23,8 +23,8 @@ use tokio_util::{io::ReaderStream, sync::CancellationToken};
use utils::crashsafe::path_with_suffix_extension;
use crate::{
Download, DownloadError, Listing, ListingMode, RemotePath, RemoteStorageActivity,
TimeTravelError, TimeoutOrCancel, REMOTE_STORAGE_PREFIX_SEPARATOR,
Download, DownloadError, Listing, ListingMode, RemotePath, TimeTravelError, TimeoutOrCancel,
REMOTE_STORAGE_PREFIX_SEPARATOR,
};
use super::{RemoteStorage, StorageMetadata};
@@ -605,16 +605,6 @@ impl RemoteStorage for LocalFs {
) -> Result<(), TimeTravelError> {
Err(TimeTravelError::Unimplemented)
}
fn activity(&self) -> RemoteStorageActivity {
// LocalFS has no concurrency limiting: give callers the impression that plenty of units are available
RemoteStorageActivity {
read_available: 16,
read_total: 16,
write_available: 16,
write_total: 16,
}
}
}
fn storage_metadata_path(original_path: &Utf8Path) -> Utf8PathBuf {

View File

@@ -15,6 +15,7 @@ pub(crate) enum RequestKind {
TimeTravel = 5,
}
use scopeguard::ScopeGuard;
use RequestKind::*;
impl RequestKind {
@@ -33,10 +34,10 @@ impl RequestKind {
}
}
pub(super) struct RequestTyped<C>([C; 6]);
pub(crate) struct RequestTyped<C>([C; 6]);
impl<C> RequestTyped<C> {
pub(super) fn get(&self, kind: RequestKind) -> &C {
pub(crate) fn get(&self, kind: RequestKind) -> &C {
&self.0[kind.as_index()]
}
@@ -58,19 +59,19 @@ impl<C> RequestTyped<C> {
}
impl RequestTyped<Histogram> {
pub(super) fn observe_elapsed(&self, kind: RequestKind, started_at: std::time::Instant) {
pub(crate) fn observe_elapsed(&self, kind: RequestKind, started_at: std::time::Instant) {
self.get(kind).observe(started_at.elapsed().as_secs_f64())
}
}
pub(super) struct PassFailCancelledRequestTyped<C> {
pub(crate) struct PassFailCancelledRequestTyped<C> {
success: RequestTyped<C>,
fail: RequestTyped<C>,
cancelled: RequestTyped<C>,
}
#[derive(Debug, Clone, Copy)]
pub(super) enum AttemptOutcome {
pub(crate) enum AttemptOutcome {
Ok,
Err,
Cancelled,
@@ -86,7 +87,7 @@ impl<T, E> From<&Result<T, E>> for AttemptOutcome {
}
impl AttemptOutcome {
pub(super) fn as_str(&self) -> &'static str {
pub(crate) fn as_str(&self) -> &'static str {
match self {
AttemptOutcome::Ok => "ok",
AttemptOutcome::Err => "err",
@@ -96,7 +97,7 @@ impl AttemptOutcome {
}
impl<C> PassFailCancelledRequestTyped<C> {
pub(super) fn get(&self, kind: RequestKind, outcome: AttemptOutcome) -> &C {
pub(crate) fn get(&self, kind: RequestKind, outcome: AttemptOutcome) -> &C {
let target = match outcome {
AttemptOutcome::Ok => &self.success,
AttemptOutcome::Err => &self.fail,
@@ -119,7 +120,7 @@ impl<C> PassFailCancelledRequestTyped<C> {
}
impl PassFailCancelledRequestTyped<Histogram> {
pub(super) fn observe_elapsed(
pub(crate) fn observe_elapsed(
&self,
kind: RequestKind,
outcome: impl Into<AttemptOutcome>,
@@ -130,19 +131,44 @@ impl PassFailCancelledRequestTyped<Histogram> {
}
}
pub(super) struct BucketMetrics {
/// On drop (cancellation) count towards [`BucketMetrics::cancelled_waits`].
pub(crate) fn start_counting_cancelled_wait(
kind: RequestKind,
) -> ScopeGuard<std::time::Instant, impl FnOnce(std::time::Instant), scopeguard::OnSuccess> {
scopeguard::guard_on_success(std::time::Instant::now(), move |_| {
crate::metrics::BUCKET_METRICS
.cancelled_waits
.get(kind)
.inc()
})
}
/// On drop (cancellation) add time to [`BucketMetrics::req_seconds`].
pub(crate) fn start_measuring_requests(
kind: RequestKind,
) -> ScopeGuard<std::time::Instant, impl FnOnce(std::time::Instant), scopeguard::OnSuccess> {
scopeguard::guard_on_success(std::time::Instant::now(), move |started_at| {
crate::metrics::BUCKET_METRICS.req_seconds.observe_elapsed(
kind,
AttemptOutcome::Cancelled,
started_at,
)
})
}
pub(crate) struct BucketMetrics {
/// Full request duration until successful completion, error or cancellation.
pub(super) req_seconds: PassFailCancelledRequestTyped<Histogram>,
pub(crate) req_seconds: PassFailCancelledRequestTyped<Histogram>,
/// Total amount of seconds waited on queue.
pub(super) wait_seconds: RequestTyped<Histogram>,
pub(crate) wait_seconds: RequestTyped<Histogram>,
/// Track how many semaphore awaits were cancelled per request type.
///
/// This is in case cancellations are happening more than expected.
pub(super) cancelled_waits: RequestTyped<IntCounter>,
pub(crate) cancelled_waits: RequestTyped<IntCounter>,
/// Total amount of deleted objects in batches or single requests.
pub(super) deleted_objects_total: IntCounter,
pub(crate) deleted_objects_total: IntCounter,
}
impl Default for BucketMetrics {

View File

@@ -46,15 +46,16 @@ use utils::backoff;
use super::StorageMetadata;
use crate::{
error::Cancelled, support::PermitCarrying, ConcurrencyLimiter, Download, DownloadError,
Listing, ListingMode, RemotePath, RemoteStorage, RemoteStorageActivity, S3Config,
TimeTravelError, TimeoutOrCancel, MAX_KEYS_PER_DELETE, REMOTE_STORAGE_PREFIX_SEPARATOR,
error::Cancelled,
metrics::{start_counting_cancelled_wait, start_measuring_requests},
support::PermitCarrying,
ConcurrencyLimiter, Download, DownloadError, Listing, ListingMode, RemotePath, RemoteStorage,
S3Config, TimeTravelError, TimeoutOrCancel, MAX_KEYS_PER_DELETE,
REMOTE_STORAGE_PREFIX_SEPARATOR,
};
pub(super) mod metrics;
use self::metrics::AttemptOutcome;
pub(super) use self::metrics::RequestKind;
use crate::metrics::AttemptOutcome;
pub(super) use crate::metrics::RequestKind;
/// AWS S3 storage.
pub struct S3Bucket {
@@ -227,7 +228,7 @@ impl S3Bucket {
};
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
crate::metrics::BUCKET_METRICS
.wait_seconds
.observe_elapsed(kind, started_at);
@@ -248,7 +249,7 @@ impl S3Bucket {
};
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
crate::metrics::BUCKET_METRICS
.wait_seconds
.observe_elapsed(kind, started_at);
Ok(permit)
@@ -287,7 +288,7 @@ impl S3Bucket {
// Count this in the AttemptOutcome::Ok bucket, because 404 is not
// an error: we expect to sometimes fetch an object and find it missing,
// e.g. when probing for timeline indices.
metrics::BUCKET_METRICS.req_seconds.observe_elapsed(
crate::metrics::BUCKET_METRICS.req_seconds.observe_elapsed(
kind,
AttemptOutcome::Ok,
started_at,
@@ -295,7 +296,7 @@ impl S3Bucket {
return Err(DownloadError::NotFound);
}
Err(e) => {
metrics::BUCKET_METRICS.req_seconds.observe_elapsed(
crate::metrics::BUCKET_METRICS.req_seconds.observe_elapsed(
kind,
AttemptOutcome::Err,
started_at,
@@ -371,12 +372,12 @@ impl S3Bucket {
};
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
crate::metrics::BUCKET_METRICS
.req_seconds
.observe_elapsed(kind, &resp, started_at);
let resp = resp.context("request deletion")?;
metrics::BUCKET_METRICS
crate::metrics::BUCKET_METRICS
.deleted_objects_total
.inc_by(chunk.len() as u64);
@@ -435,14 +436,14 @@ pin_project_lite::pin_project! {
/// Times and tracks the outcome of the request.
struct TimedDownload<S> {
started_at: std::time::Instant,
outcome: metrics::AttemptOutcome,
outcome: AttemptOutcome,
#[pin]
inner: S
}
impl<S> PinnedDrop for TimedDownload<S> {
fn drop(mut this: Pin<&mut Self>) {
metrics::BUCKET_METRICS.req_seconds.observe_elapsed(RequestKind::Get, this.outcome, this.started_at);
crate::metrics::BUCKET_METRICS.req_seconds.observe_elapsed(RequestKind::Get, this.outcome, this.started_at);
}
}
}
@@ -451,7 +452,7 @@ impl<S> TimedDownload<S> {
fn new(started_at: std::time::Instant, inner: S) -> Self {
TimedDownload {
started_at,
outcome: metrics::AttemptOutcome::Cancelled,
outcome: AttemptOutcome::Cancelled,
inner,
}
}
@@ -468,8 +469,8 @@ impl<S: Stream<Item = std::io::Result<Bytes>>> Stream for TimedDownload<S> {
let res = ready!(this.inner.poll_next(cx));
match &res {
Some(Ok(_)) => {}
Some(Err(_)) => *this.outcome = metrics::AttemptOutcome::Err,
None => *this.outcome = metrics::AttemptOutcome::Ok,
Some(Err(_)) => *this.outcome = AttemptOutcome::Err,
None => *this.outcome = AttemptOutcome::Ok,
}
Poll::Ready(res)
@@ -543,7 +544,7 @@ impl RemoteStorage for S3Bucket {
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
crate::metrics::BUCKET_METRICS
.req_seconds
.observe_elapsed(kind, &response, started_at);
@@ -625,7 +626,7 @@ impl RemoteStorage for S3Bucket {
if let Ok(inner) = &res {
// do not incl. timeouts as errors in metrics but cancellations
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
crate::metrics::BUCKET_METRICS
.req_seconds
.observe_elapsed(kind, inner, started_at);
}
@@ -673,7 +674,7 @@ impl RemoteStorage for S3Bucket {
};
let started_at = ScopeGuard::into_inner(started_at);
metrics::BUCKET_METRICS
crate::metrics::BUCKET_METRICS
.req_seconds
.observe_elapsed(kind, &res, started_at);
@@ -975,32 +976,6 @@ impl RemoteStorage for S3Bucket {
}
Ok(())
}
fn activity(&self) -> RemoteStorageActivity {
self.concurrency_limiter.activity()
}
}
/// On drop (cancellation) count towards [`metrics::BucketMetrics::cancelled_waits`].
fn start_counting_cancelled_wait(
kind: RequestKind,
) -> ScopeGuard<std::time::Instant, impl FnOnce(std::time::Instant), scopeguard::OnSuccess> {
scopeguard::guard_on_success(std::time::Instant::now(), move |_| {
metrics::BUCKET_METRICS.cancelled_waits.get(kind).inc()
})
}
/// On drop (cancellation) add time to [`metrics::BucketMetrics::req_seconds`].
fn start_measuring_requests(
kind: RequestKind,
) -> ScopeGuard<std::time::Instant, impl FnOnce(std::time::Instant), scopeguard::OnSuccess> {
scopeguard::guard_on_success(std::time::Instant::now(), move |started_at| {
metrics::BUCKET_METRICS.req_seconds.observe_elapsed(
kind,
AttemptOutcome::Cancelled,
started_at,
)
})
}
// Save RAM and only store the needed data instead of the entire ObjectVersion/DeleteMarkerEntry

View File

@@ -12,7 +12,7 @@ use tokio_util::sync::CancellationToken;
use crate::{
Download, DownloadError, GenericRemoteStorage, Listing, ListingMode, RemotePath, RemoteStorage,
RemoteStorageActivity, StorageMetadata, TimeTravelError,
StorageMetadata, TimeTravelError,
};
pub struct UnreliableWrapper {
@@ -213,8 +213,4 @@ impl RemoteStorage for UnreliableWrapper {
.time_travel_recover(prefix, timestamp, done_if_after, cancel)
.await
}
fn activity(&self) -> RemoteStorageActivity {
self.inner.activity()
}
}

View File

@@ -9,6 +9,33 @@ use serde::{Deserialize, Serialize};
use tokio_util::sync::CancellationToken;
use tracing::*;
/// Declare a failpoint that can use the `pause` failpoint action.
/// We don't want to block the executor thread, hence, spawn_blocking + await.
#[macro_export]
macro_rules! pausable_failpoint {
($name:literal) => {
if cfg!(feature = "testing") {
tokio::task::spawn_blocking({
let current = tracing::Span::current();
move || {
let _entered = current.entered();
tracing::info!("at failpoint {}", $name);
fail::fail_point!($name);
}
})
.await
.expect("spawn_blocking");
}
};
($name:literal, $cond:expr) => {
if cfg!(feature = "testing") {
if $cond {
pausable_failpoint!($name)
}
}
};
}
/// use with fail::cfg("$name", "return(2000)")
///
/// The effect is similar to a "sleep(2000)" action, i.e. we sleep for the

View File

@@ -19,13 +19,13 @@
/// // right: [0x68; 1]
/// # fn serialize_something() -> Vec<u8> { "hello world".as_bytes().to_vec() }
/// ```
#[derive(PartialEq)]
pub struct Hex<'a>(pub &'a [u8]);
pub struct Hex<S>(pub S);
impl std::fmt::Debug for Hex<'_> {
impl<S: AsRef<[u8]>> std::fmt::Debug for Hex<S> {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "[")?;
for (i, c) in self.0.chunks(16).enumerate() {
let chunks = self.0.as_ref().chunks(16);
for (i, c) in chunks.enumerate() {
if i > 0 && !c.is_empty() {
writeln!(f, ", ")?;
}
@@ -36,6 +36,15 @@ impl std::fmt::Debug for Hex<'_> {
write!(f, "0x{b:02x}")?;
}
}
write!(f, "; {}]", self.0.len())
write!(f, "; {}]", self.0.as_ref().len())
}
}
impl<R: AsRef<[u8]>, L: AsRef<[u8]>> PartialEq<Hex<R>> for Hex<L> {
fn eq(&self, other: &Hex<R>) -> bool {
let left = self.0.as_ref();
let right = other.0.as_ref();
left == right
}
}

View File

@@ -135,7 +135,8 @@ impl Gate {
let started_at = std::time::Instant::now();
let mut do_close = std::pin::pin!(self.do_close());
let nag_after = Duration::from_secs(1);
// with 1s we rarely saw anything, let's try if we get more gate closing reasons with 100ms
let nag_after = Duration::from_millis(100);
let Err(_timeout) = tokio::time::timeout(nag_after, &mut do_close).await else {
return;

View File

@@ -380,8 +380,8 @@ impl interface::CompactionLayer<Key> for MockLayer {
}
fn file_size(&self) -> u64 {
match self {
MockLayer::Delta(this) => this.file_size(),
MockLayer::Image(this) => this.file_size(),
MockLayer::Delta(this) => this.file_size,
MockLayer::Image(this) => this.file_size,
}
}
fn short_id(&self) -> String {

View File

@@ -17,6 +17,7 @@ pageserver = { path = ".." }
pageserver_api.workspace = true
remote_storage = { path = "../../libs/remote_storage" }
postgres_ffi.workspace = true
thiserror.workspace = true
tokio.workspace = true
tokio-util.workspace = true
toml_edit.workspace = true

View File

@@ -2,7 +2,7 @@ use std::collections::HashMap;
use anyhow::Context;
use camino::Utf8PathBuf;
use pageserver::tenant::remote_timeline_client::index::IndexLayerMetadata;
use pageserver::tenant::remote_timeline_client::index::LayerFileMetadata;
use pageserver::tenant::storage_layer::LayerName;
use pageserver::tenant::{metadata::TimelineMetadata, IndexPart};
use utils::lsn::Lsn;
@@ -19,7 +19,7 @@ pub(crate) async fn main(cmd: &IndexPartCmd) -> anyhow::Result<()> {
let des: IndexPart = IndexPart::from_s3_bytes(&bytes).context("deserialize")?;
#[derive(serde::Serialize)]
struct Output<'a> {
layer_metadata: &'a HashMap<LayerName, IndexLayerMetadata>,
layer_metadata: &'a HashMap<LayerName, LayerFileMetadata>,
disk_consistent_lsn: Lsn,
timeline_metadata: &'a TimelineMetadata,
}

475
pageserver/ctl/src/key.rs Normal file
View File

@@ -0,0 +1,475 @@
use anyhow::Context;
use clap::Parser;
use pageserver_api::{
key::Key,
reltag::{BlockNumber, RelTag, SlruKind},
shard::{ShardCount, ShardStripeSize},
};
use std::str::FromStr;
#[derive(Parser)]
pub(super) struct DescribeKeyCommand {
/// Key material in one of the forms: hex, span attributes captured from log, reltag blocknum
input: Vec<String>,
/// The number of shards to calculate what Keys placement would be.
#[arg(long)]
shard_count: Option<CustomShardCount>,
/// The sharding stripe size.
///
/// The default is hardcoded. It makes no sense to provide this without providing
/// `--shard-count`.
#[arg(long, requires = "shard_count")]
stripe_size: Option<u32>,
}
/// Sharded shard count without unsharded count, which the actual ShardCount supports.
#[derive(Clone, Copy)]
pub(super) struct CustomShardCount(std::num::NonZeroU8);
#[derive(Debug, thiserror::Error)]
pub(super) enum InvalidShardCount {
#[error(transparent)]
ParsingFailed(#[from] std::num::ParseIntError),
#[error("too few shards")]
TooFewShards,
}
impl FromStr for CustomShardCount {
type Err = InvalidShardCount;
fn from_str(s: &str) -> Result<Self, Self::Err> {
let inner: std::num::NonZeroU8 = s.parse()?;
if inner.get() < 2 {
Err(InvalidShardCount::TooFewShards)
} else {
Ok(CustomShardCount(inner))
}
}
}
impl From<CustomShardCount> for ShardCount {
fn from(value: CustomShardCount) -> Self {
ShardCount::new(value.0.get())
}
}
impl DescribeKeyCommand {
pub(super) fn execute(self) {
let DescribeKeyCommand {
input,
shard_count,
stripe_size,
} = self;
let material = KeyMaterial::try_from(input.as_slice()).unwrap();
let kind = material.kind();
let key = Key::from(material);
println!("parsed from {kind}: {key}:");
println!();
println!("{key:?}");
macro_rules! kind_query {
([$($name:ident),*$(,)?]) => {{[$(kind_query!($name)),*]}};
($name:ident) => {{
let s: &'static str = stringify!($name);
let s = s.strip_prefix("is_").unwrap_or(s);
let s = s.strip_suffix("_key").unwrap_or(s);
#[allow(clippy::needless_borrow)]
(s, key.$name())
}};
}
// the current characterization is a mess of these boolean queries and separate
// "recognization". I think it accurately represents how strictly we model the Key
// right now, but could of course be made less confusing.
let queries = kind_query!([
is_rel_block_key,
is_rel_vm_block_key,
is_rel_fsm_block_key,
is_slru_block_key,
is_inherited_key,
is_rel_size_key,
is_slru_segment_size_key,
]);
let recognized_kind = "recognized kind";
let metadata_key = "metadata key";
let shard_placement = "shard placement";
let longest = queries
.iter()
.map(|t| t.0)
.chain([recognized_kind, metadata_key, shard_placement])
.map(|s| s.len())
.max()
.unwrap();
let colon = 1;
let padding = 1;
for (name, is) in queries {
let width = longest - name.len() + colon + padding;
println!("{}{:width$}{}", name, ":", is);
}
let width = longest - recognized_kind.len() + colon + padding;
println!(
"{}{:width$}{:?}",
recognized_kind,
":",
RecognizedKeyKind::new(key),
);
if let Some(shard_count) = shard_count {
// seeing the sharding placement might be confusing, so leave it out unless shard
// count was given.
let stripe_size = stripe_size.map(ShardStripeSize).unwrap_or_default();
println!(
"# placement with shard_count: {} and stripe_size: {}:",
shard_count.0, stripe_size.0
);
let width = longest - shard_placement.len() + colon + padding;
println!(
"{}{:width$}{:?}",
shard_placement,
":",
pageserver_api::shard::describe(&key, shard_count.into(), stripe_size)
);
}
}
}
/// Hand-wavy "inputs we accept" for a key.
#[derive(Debug)]
pub(super) enum KeyMaterial {
Hex(Key),
String(SpanAttributesFromLogs),
Split(RelTag, BlockNumber),
}
impl KeyMaterial {
fn kind(&self) -> &'static str {
match self {
KeyMaterial::Hex(_) => "hex",
KeyMaterial::String(_) | KeyMaterial::Split(_, _) => "split",
}
}
}
impl From<KeyMaterial> for Key {
fn from(value: KeyMaterial) -> Self {
match value {
KeyMaterial::Hex(key) => key,
KeyMaterial::String(SpanAttributesFromLogs(rt, blocknum))
| KeyMaterial::Split(rt, blocknum) => {
pageserver_api::key::rel_block_to_key(rt, blocknum)
}
}
}
}
impl<S: AsRef<str>> TryFrom<&[S]> for KeyMaterial {
type Error = anyhow::Error;
fn try_from(value: &[S]) -> Result<Self, Self::Error> {
match value {
[] => anyhow::bail!(
"need 1..N positional arguments describing the key, try hex or a log line"
),
[one] => {
let one = one.as_ref();
let key = Key::from_hex(one).map(KeyMaterial::Hex);
let attrs = SpanAttributesFromLogs::from_str(one).map(KeyMaterial::String);
match (key, attrs) {
(Ok(key), _) => Ok(key),
(_, Ok(s)) => Ok(s),
(Err(e1), Err(e2)) => anyhow::bail!(
"failed to parse {one:?} as hex or span attributes:\n- {e1:#}\n- {e2:#}"
),
}
}
more => {
// assume going left to right one of these is a reltag and then we find a blocknum
// this works, because we don't have plain numbers at least right after reltag in
// logs. for some definition of "works".
let Some((reltag_at, reltag)) = more
.iter()
.map(AsRef::as_ref)
.enumerate()
.find_map(|(i, s)| {
s.split_once("rel=")
.map(|(_garbage, actual)| actual)
.unwrap_or(s)
.parse::<RelTag>()
.ok()
.map(|rt| (i, rt))
})
else {
anyhow::bail!("found no RelTag in arguments");
};
let Some(blocknum) = more
.iter()
.map(AsRef::as_ref)
.skip(reltag_at)
.find_map(|s| {
s.split_once("blkno=")
.map(|(_garbage, actual)| actual)
.unwrap_or(s)
.parse::<BlockNumber>()
.ok()
})
else {
anyhow::bail!("found no blocknum in arguments");
};
Ok(KeyMaterial::Split(reltag, blocknum))
}
}
}
}
#[derive(Debug)]
pub(super) struct SpanAttributesFromLogs(RelTag, BlockNumber);
impl std::str::FromStr for SpanAttributesFromLogs {
type Err = anyhow::Error;
fn from_str(s: &str) -> Result<Self, Self::Err> {
// accept the span separator but do not require or fail if either is missing
// "whatever{rel=1663/16389/24615 blkno=1052204 req_lsn=FFFFFFFF/FFFFFFFF}"
let (_, reltag) = s
.split_once("rel=")
.ok_or_else(|| anyhow::anyhow!("cannot find 'rel='"))?;
let reltag = reltag.split_whitespace().next().unwrap();
let (_, blocknum) = s
.split_once("blkno=")
.ok_or_else(|| anyhow::anyhow!("cannot find 'blkno='"))?;
let blocknum = blocknum.split_whitespace().next().unwrap();
let reltag = reltag
.parse()
.with_context(|| format!("parse reltag from {reltag:?}"))?;
let blocknum = blocknum
.parse()
.with_context(|| format!("parse blocknum from {blocknum:?}"))?;
Ok(Self(reltag, blocknum))
}
}
#[derive(Debug)]
#[allow(dead_code)] // debug print is used
enum RecognizedKeyKind {
DbDir,
ControlFile,
Checkpoint,
AuxFilesV1,
SlruDir(Result<SlruKind, u32>),
RelMap(RelTagish<2>),
RelDir(RelTagish<2>),
AuxFileV2(Result<AuxFileV2, utils::Hex<[u8; 16]>>),
}
#[derive(Debug, PartialEq)]
#[allow(unused)]
enum AuxFileV2 {
Recognized(&'static str, utils::Hex<[u8; 13]>),
OtherWithPrefix(&'static str, utils::Hex<[u8; 13]>),
Other(utils::Hex<[u8; 13]>),
}
impl RecognizedKeyKind {
fn new(key: Key) -> Option<Self> {
use RecognizedKeyKind::{
AuxFilesV1, Checkpoint, ControlFile, DbDir, RelDir, RelMap, SlruDir,
};
let slru_dir_kind = pageserver_api::key::slru_dir_kind(&key);
Some(match key {
pageserver_api::key::DBDIR_KEY => DbDir,
pageserver_api::key::CONTROLFILE_KEY => ControlFile,
pageserver_api::key::CHECKPOINT_KEY => Checkpoint,
pageserver_api::key::AUX_FILES_KEY => AuxFilesV1,
_ if slru_dir_kind.is_some() => SlruDir(slru_dir_kind.unwrap()),
_ if key.field1 == 0 && key.field4 == 0 && key.field5 == 0 && key.field6 == 0 => {
RelMap([key.field2, key.field3].into())
}
_ if key.field1 == 0 && key.field4 == 0 && key.field5 == 0 && key.field6 == 1 => {
RelDir([key.field2, key.field3].into())
}
_ if key.is_metadata_key() => RecognizedKeyKind::AuxFileV2(
AuxFileV2::new(key).ok_or_else(|| utils::Hex(key.to_i128().to_be_bytes())),
),
_ => return None,
})
}
}
impl AuxFileV2 {
fn new(key: Key) -> Option<AuxFileV2> {
const EMPTY_HASH: [u8; 13] = {
let mut out = [0u8; 13];
let hash = pageserver::aux_file::fnv_hash(b"").to_be_bytes();
let mut i = 3;
while i < 16 {
out[i - 3] = hash[i];
i += 1;
}
out
};
let bytes = key.to_i128().to_be_bytes();
let hash = utils::Hex(<[u8; 13]>::try_from(&bytes[3..]).unwrap());
assert_eq!(EMPTY_HASH.len(), hash.0.len());
// TODO: we could probably find the preimages for the hashes
Some(match (bytes[1], bytes[2]) {
(1, 1) => AuxFileV2::Recognized("pg_logical/mappings/", hash),
(1, 2) => AuxFileV2::Recognized("pg_logical/snapshots/", hash),
(1, 3) if hash.0 == EMPTY_HASH => {
AuxFileV2::Recognized("pg_logical/replorigin_checkpoint", hash)
}
(2, 1) => AuxFileV2::Recognized("pg_replslot/", hash),
(1, 0xff) => AuxFileV2::OtherWithPrefix("pg_logical/", hash),
(0xff, 0xff) => AuxFileV2::Other(hash),
_ => return None,
})
}
}
/// Prefix of RelTag, currently only known use cases are the two item versions.
///
/// Renders like a reltag with `/`, nothing else.
struct RelTagish<const N: usize>([u32; N]);
impl<const N: usize> From<[u32; N]> for RelTagish<N> {
fn from(val: [u32; N]) -> Self {
RelTagish(val)
}
}
impl<const N: usize> std::fmt::Debug for RelTagish<N> {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
use std::fmt::Write as _;
let mut first = true;
self.0.iter().try_for_each(|x| {
if !first {
f.write_char('/')?;
}
first = false;
write!(f, "{}", x)
})
}
}
#[cfg(test)]
mod tests {
use pageserver::aux_file::encode_aux_file_key;
use super::*;
#[test]
fn hex_is_key_material() {
let m = KeyMaterial::try_from(&["000000067F0000400200DF927900FFFFFFFF"][..]).unwrap();
assert!(matches!(m, KeyMaterial::Hex(_)), "{m:?}");
}
#[test]
fn single_positional_spanalike_is_key_material() {
// why is this needed? if you are checking many, then copypaste starts to appeal
let strings = [
(line!(), "2024-05-15T15:33:49.873906Z ERROR page_service_conn_main{peer_addr=A:B}:process_query{tenant_id=C timeline_id=D}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/208101/2620_fsm blkno=2 req_lsn=0/238D98C8}: error reading relation or page version: Read error: could not find data for key 000000067F00032CE5000000000000000001 (shard ShardNumber(0)) at LSN 0/1D0A16C1, request LSN 0/238D98C8, ancestor 0/0"),
(line!(), "rel=1663/208101/2620_fsm blkno=2"),
(line!(), "rel=1663/208101/2620.1 blkno=2"),
];
let mut first: Option<Key> = None;
for (line, example) in strings {
let m = KeyMaterial::try_from(&[example][..])
.unwrap_or_else(|e| panic!("failed to parse example from line {line}: {e:?}"));
let key = Key::from(m);
if let Some(first) = first {
assert_eq!(first, key);
} else {
first = Some(key);
}
}
// not supporting this is rather accidential, but I think the input parsing is lenient
// enough already
KeyMaterial::try_from(&["1663/208101/2620_fsm 2"][..]).unwrap_err();
}
#[test]
fn multiple_spanlike_args() {
let strings = [
(line!(), &["process_query{tenant_id=C", "timeline_id=D}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/208101/2620_fsm", "blkno=2", "req_lsn=0/238D98C8}"][..]),
(line!(), &["rel=1663/208101/2620_fsm", "blkno=2"][..]),
(line!(), &["1663/208101/2620_fsm", "2"][..]),
];
let mut first: Option<Key> = None;
for (line, example) in strings {
let m = KeyMaterial::try_from(example)
.unwrap_or_else(|e| panic!("failed to parse example from line {line}: {e:?}"));
let key = Key::from(m);
if let Some(first) = first {
assert_eq!(first, key);
} else {
first = Some(key);
}
}
}
#[test]
fn recognized_auxfiles() {
use AuxFileV2::*;
let empty = [
0x2e, 0x07, 0xbb, 0x01, 0x42, 0x62, 0xb8, 0x21, 0x75, 0x62, 0x95, 0xc5, 0x8d,
];
let foobar = [
0x62, 0x79, 0x3c, 0x64, 0xbf, 0x6f, 0x0d, 0x35, 0x97, 0xba, 0x44, 0x6f, 0x18,
];
#[rustfmt::skip]
let examples = [
(line!(), "pg_logical/mappings/foobar", Recognized("pg_logical/mappings/", utils::Hex(foobar))),
(line!(), "pg_logical/snapshots/foobar", Recognized("pg_logical/snapshots/", utils::Hex(foobar))),
(line!(), "pg_logical/replorigin_checkpoint", Recognized("pg_logical/replorigin_checkpoint", utils::Hex(empty))),
(line!(), "pg_logical/foobar", OtherWithPrefix("pg_logical/", utils::Hex(foobar))),
(line!(), "pg_replslot/foobar", Recognized("pg_replslot/", utils::Hex(foobar))),
(line!(), "foobar", Other(utils::Hex(foobar))),
];
for (line, path, expected) in examples {
let key = encode_aux_file_key(path);
let recognized =
AuxFileV2::new(key).unwrap_or_else(|| panic!("line {line} example failed"));
assert_eq!(recognized, expected);
}
assert_eq!(
AuxFileV2::new(Key::from_hex("600000102000000000000000000000000000").unwrap()),
None,
"example key has one too few 0 after 6 before 1"
);
}
}

View File

@@ -6,6 +6,7 @@
mod draw_timeline_dir;
mod index_part;
mod key;
mod layer_map_analyzer;
mod layers;
@@ -61,6 +62,8 @@ enum Commands {
AnalyzeLayerMap(AnalyzeLayerMapCmd),
#[command(subcommand)]
Layer(LayerCmd),
/// Debug print a hex key found from logs
Key(key::DescribeKeyCommand),
}
/// Read and update pageserver metadata file
@@ -183,6 +186,7 @@ async fn main() -> anyhow::Result<()> {
.time_travel_recover(Some(&prefix), timestamp, done_if_after, &cancel)
.await?;
}
Commands::Key(dkc) => dkc.execute(),
};
Ok(())
}

View File

@@ -5,6 +5,7 @@ use utils::lsn::Lsn;
use std::collections::HashMap;
use std::sync::Arc;
use std::time::Instant;
/// Ingest aux files into the pageserver.
#[derive(clap::Parser)]
@@ -88,11 +89,17 @@ async fn main_impl(args: Args) -> anyhow::Result<()> {
println!("ingested {file_cnt} files");
}
let files = mgmt_api_client
.list_aux_files(tenant_shard_id, timeline_id, Lsn(Lsn::MAX.0 - 1))
.await?;
println!("{} files found", files.len());
for _ in 0..100 {
let start = Instant::now();
let files = mgmt_api_client
.list_aux_files(tenant_shard_id, timeline_id, Lsn(Lsn::MAX.0 - 1))
.await?;
println!(
"{} files found in {}s",
files.len(),
start.elapsed().as_secs_f64()
);
}
anyhow::Ok(())
}

View File

@@ -1,6 +1,6 @@
use anyhow::Context;
use camino::Utf8PathBuf;
use pageserver_api::key::{is_rel_block_key, key_to_rel_block, Key};
use pageserver_api::key::Key;
use pageserver_api::keyspace::KeySpaceAccum;
use pageserver_api::models::PagestreamGetPageRequest;
@@ -187,7 +187,7 @@ async fn main_impl(
for r in partitioning.keys.ranges.iter() {
let mut i = r.start;
while i != r.end {
if is_rel_block_key(&i) {
if i.is_rel_block_key() {
filtered.add_key(i);
}
i = i.next();
@@ -308,9 +308,10 @@ async fn main_impl(
let r = &ranges[weights.sample(&mut rng)];
let key: i128 = rng.gen_range(r.start..r.end);
let key = Key::from_i128(key);
assert!(is_rel_block_key(&key));
let (rel_tag, block_no) =
key_to_rel_block(key).expect("we filter non-rel-block keys out above");
assert!(key.is_rel_block_key());
let (rel_tag, block_no) = key
.to_rel_block()
.expect("we filter non-rel-block keys out above");
PagestreamGetPageRequest {
request_lsn: if rng.gen_bool(args.req_latest_probability) {
Lsn::MAX

View File

@@ -13,7 +13,7 @@
use anyhow::{anyhow, Context};
use bytes::{BufMut, Bytes, BytesMut};
use fail::fail_point;
use pageserver_api::key::{key_to_slru_block, Key};
use pageserver_api::key::Key;
use postgres_ffi::pg_constants;
use std::fmt::Write as FmtWrite;
use std::time::SystemTime;
@@ -170,7 +170,7 @@ where
}
async fn add_block(&mut self, key: &Key, block: Bytes) -> Result<(), BasebackupError> {
let (kind, segno, _) = key_to_slru_block(*key)?;
let (kind, segno, _) = key.to_slru_block()?;
match kind {
SlruKind::Clog => {

View File

@@ -358,7 +358,7 @@ async fn calculate_and_log(tenant: &Tenant, cancel: &CancellationToken, ctx: &Re
// mean the synthetic size worker should terminate.
let shutting_down = matches!(
e.downcast_ref::<PageReconstructError>(),
Some(PageReconstructError::Cancelled | PageReconstructError::AncestorStopping(_))
Some(PageReconstructError::Cancelled)
);
if !shutting_down {

View File

@@ -311,7 +311,7 @@ impl DeletionList {
result.extend(
timeline_layers
.into_iter()
.map(|l| timeline_remote_path.join(&Utf8PathBuf::from(l))),
.map(|l| timeline_remote_path.join(Utf8PathBuf::from(l))),
);
}
}

View File

@@ -534,7 +534,7 @@ pub(crate) async fn disk_usage_eviction_task_iteration_impl<U: Usage>(
});
}
EvictionLayer::Secondary(layer) => {
let file_size = layer.metadata.file_size();
let file_size = layer.metadata.file_size;
js.spawn(async move {
layer
@@ -641,7 +641,7 @@ impl EvictionLayer {
pub(crate) fn get_file_size(&self) -> u64 {
match self {
Self::Attached(l) => l.layer_desc().file_size,
Self::Secondary(sl) => sl.metadata.file_size(),
Self::Secondary(sl) => sl.metadata.file_size,
}
}
}

View File

@@ -74,6 +74,7 @@ use crate::tenant::size::ModelInputs;
use crate::tenant::storage_layer::LayerAccessStatsReset;
use crate::tenant::storage_layer::LayerName;
use crate::tenant::timeline::CompactFlags;
use crate::tenant::timeline::CompactionError;
use crate::tenant::timeline::Timeline;
use crate::tenant::GetTimelineError;
use crate::tenant::SpawnMode;
@@ -183,9 +184,6 @@ impl From<PageReconstructError> for ApiError {
PageReconstructError::Cancelled => {
ApiError::InternalServerError(anyhow::anyhow!("request was cancelled"))
}
PageReconstructError::AncestorStopping(_) => {
ApiError::ResourceUnavailable(format!("{pre}").into())
}
PageReconstructError::AncestorLsnTimeout(e) => ApiError::Timeout(format!("{e}").into()),
PageReconstructError::WalRedo(pre) => ApiError::InternalServerError(pre),
}
@@ -1813,11 +1811,22 @@ async fn timeline_checkpoint_handler(
timeline
.freeze_and_flush()
.await
.map_err(ApiError::InternalServerError)?;
.map_err(|e| {
match e {
tenant::timeline::FlushLayerError::Cancelled => ApiError::ShuttingDown,
other => ApiError::InternalServerError(other.into()),
}
})?;
timeline
.compact(&cancel, flags, &ctx)
.await
.map_err(|e| ApiError::InternalServerError(e.into()))?;
.map_err(|e|
match e {
CompactionError::ShuttingDown => ApiError::ShuttingDown,
CompactionError::Other(e) => ApiError::InternalServerError(e)
}
)?;
if wait_until_uploaded {
timeline.remote_client.wait_completion().await.map_err(ApiError::InternalServerError)?;

View File

@@ -66,6 +66,7 @@ use crate::tenant::mgr::GetTenantError;
use crate::tenant::mgr::ShardResolveResult;
use crate::tenant::mgr::ShardSelector;
use crate::tenant::mgr::TenantManager;
use crate::tenant::timeline::FlushLayerError;
use crate::tenant::timeline::WaitLsnError;
use crate::tenant::GetTimelineError;
use crate::tenant::PageReconstructError;
@@ -260,6 +261,8 @@ async fn page_service_conn_main(
socket.set_timeout(Some(std::time::Duration::from_millis(socket_timeout_ms)));
let socket = std::pin::pin!(socket);
fail::fail_point!("ps::connection-start::pre-login");
// XXX: pgbackend.run() should take the connection_ctx,
// and create a child per-query context when it invokes process_query.
// But it's in a shared crate, so, we store connection_ctx inside PageServerHandler
@@ -370,7 +373,7 @@ impl From<WaitLsnError> for PageStreamError {
match value {
e @ WaitLsnError::Timeout(_) => Self::LsnTimeout(e),
WaitLsnError::Shutdown => Self::Shutdown,
WaitLsnError::BadState => Self::Reconnect("Timeline is not active".into()),
e @ WaitLsnError::BadState { .. } => Self::Reconnect(format!("{e}").into()),
}
}
}
@@ -380,7 +383,7 @@ impl From<WaitLsnError> for QueryError {
match value {
e @ WaitLsnError::Timeout(_) => Self::Other(anyhow::Error::new(e)),
WaitLsnError::Shutdown => Self::Shutdown,
WaitLsnError::BadState => Self::Reconnect,
WaitLsnError::BadState { .. } => Self::Reconnect,
}
}
}
@@ -603,6 +606,7 @@ impl PageServerHandler {
};
trace!("query: {copy_data_bytes:?}");
fail::fail_point!("ps::handle-pagerequest-message");
// Trace request if needed
if let Some(t) = tracer.as_mut() {
@@ -617,6 +621,7 @@ impl PageServerHandler {
let (response, span) = match neon_fe_msg {
PagestreamFeMessage::Exists(req) => {
fail::fail_point!("ps::handle-pagerequest-message::exists");
let span = tracing::info_span!("handle_get_rel_exists_request", rel = %req.rel, req_lsn = %req.request_lsn);
(
self.handle_get_rel_exists_request(tenant_id, timeline_id, &req, &ctx)
@@ -626,6 +631,7 @@ impl PageServerHandler {
)
}
PagestreamFeMessage::Nblocks(req) => {
fail::fail_point!("ps::handle-pagerequest-message::nblocks");
let span = tracing::info_span!("handle_get_nblocks_request", rel = %req.rel, req_lsn = %req.request_lsn);
(
self.handle_get_nblocks_request(tenant_id, timeline_id, &req, &ctx)
@@ -635,6 +641,7 @@ impl PageServerHandler {
)
}
PagestreamFeMessage::GetPage(req) => {
fail::fail_point!("ps::handle-pagerequest-message::getpage");
// shard_id is filled in by the handler
let span = tracing::info_span!("handle_get_page_at_lsn_request", rel = %req.rel, blkno = %req.blkno, req_lsn = %req.request_lsn);
(
@@ -645,6 +652,7 @@ impl PageServerHandler {
)
}
PagestreamFeMessage::DbSize(req) => {
fail::fail_point!("ps::handle-pagerequest-message::dbsize");
let span = tracing::info_span!("handle_db_size_request", dbnode = %req.dbnode, req_lsn = %req.request_lsn);
(
self.handle_db_size_request(tenant_id, timeline_id, &req, &ctx)
@@ -654,6 +662,7 @@ impl PageServerHandler {
)
}
PagestreamFeMessage::GetSlruSegment(req) => {
fail::fail_point!("ps::handle-pagerequest-message::slrusegment");
let span = tracing::info_span!("handle_get_slru_segment_request", kind = %req.kind, segno = %req.segno, req_lsn = %req.request_lsn);
(
self.handle_get_slru_segment_request(tenant_id, timeline_id, &req, &ctx)
@@ -822,7 +831,10 @@ impl PageServerHandler {
// We only want to persist the data, and it doesn't matter if it's in the
// shape of deltas or images.
info!("flushing layers");
timeline.freeze_and_flush().await?;
timeline.freeze_and_flush().await.map_err(|e| match e {
FlushLayerError::Cancelled => QueryError::Shutdown,
other => QueryError::Other(other.into()),
})?;
info!("done");
Ok(())
@@ -1505,6 +1517,7 @@ where
_pgb: &mut PostgresBackend<IO>,
_sm: &FeStartupPacket,
) -> Result<(), QueryError> {
fail::fail_point!("ps::connection-start::startup-packet");
Ok(())
}
@@ -1519,6 +1532,8 @@ where
Err(QueryError::SimulatedConnectionError)
});
fail::fail_point!("ps::connection-start::process-query");
let ctx = self.connection_ctx.attached_child();
debug!("process query {query_string:?}");
let parts = query_string.split_whitespace().collect::<Vec<_>>();

View File

@@ -17,10 +17,10 @@ use bytes::{Buf, Bytes, BytesMut};
use enum_map::Enum;
use itertools::Itertools;
use pageserver_api::key::{
dbdir_key_range, is_rel_block_key, is_slru_block_key, rel_block_to_key, rel_dir_to_key,
rel_key_range, rel_size_to_key, relmap_file_key, slru_block_to_key, slru_dir_to_key,
slru_segment_key_range, slru_segment_size_to_key, twophase_file_key, twophase_key_range,
AUX_FILES_KEY, CHECKPOINT_KEY, CONTROLFILE_KEY, DBDIR_KEY, TWOPHASEDIR_KEY,
dbdir_key_range, rel_block_to_key, rel_dir_to_key, rel_key_range, rel_size_to_key,
relmap_file_key, slru_block_to_key, slru_dir_to_key, slru_segment_key_range,
slru_segment_size_to_key, twophase_file_key, twophase_key_range, AUX_FILES_KEY, CHECKPOINT_KEY,
CONTROLFILE_KEY, DBDIR_KEY, TWOPHASEDIR_KEY,
};
use pageserver_api::keyspace::SparseKeySpace;
use pageserver_api::models::AuxFilePolicy;
@@ -78,11 +78,19 @@ pub enum LsnForTimestamp {
}
#[derive(Debug, thiserror::Error)]
pub enum CalculateLogicalSizeError {
pub(crate) enum CalculateLogicalSizeError {
#[error("cancelled")]
Cancelled,
/// Something went wrong while reading the metadata we use to calculate logical size
/// Note that cancellation variants of `PageReconstructError` are transformed to [`Self::Cancelled`]
/// in the `From` implementation for this variant.
#[error(transparent)]
Other(#[from] anyhow::Error),
PageRead(PageReconstructError),
/// Something went wrong deserializing metadata that we read to calculate logical size
#[error("decode error: {0}")]
Decode(#[from] DeserializeError),
}
#[derive(Debug, thiserror::Error)]
@@ -107,10 +115,8 @@ impl From<PageReconstructError> for CollectKeySpaceError {
impl From<PageReconstructError> for CalculateLogicalSizeError {
fn from(pre: PageReconstructError) -> Self {
match pre {
PageReconstructError::AncestorStopping(_) | PageReconstructError::Cancelled => {
Self::Cancelled
}
_ => Self::Other(pre.into()),
PageReconstructError::Cancelled => Self::Cancelled,
_ => Self::PageRead(pre),
}
}
}
@@ -763,7 +769,7 @@ impl Timeline {
/// # Cancel-Safety
///
/// This method is cancellation-safe.
pub async fn get_current_logical_size_non_incremental(
pub(crate) async fn get_current_logical_size_non_incremental(
&self,
lsn: Lsn,
ctx: &RequestContext,
@@ -772,7 +778,7 @@ impl Timeline {
// Fetch list of database dirs and iterate them
let buf = self.get(DBDIR_KEY, lsn, ctx).await?;
let dbdir = DbDirectory::des(&buf).context("deserialize db directory")?;
let dbdir = DbDirectory::des(&buf)?;
let mut total_size: u64 = 0;
for (spcnode, dbnode) in dbdir.dbdirs.keys() {
@@ -1552,7 +1558,7 @@ impl<'a> DatadirModification<'a> {
self.tline.aux_file_size_estimator.on_add(content.len());
new_files.push((path, content));
}
(None, true) => anyhow::bail!("removing non-existing aux file: {}", path),
(None, true) => warn!("removing non-existing aux file: {}", path),
}
let new_val = aux_file::encode_file_value(&new_files)?;
self.put(key, Value::Image(new_val.into()));
@@ -1606,8 +1612,7 @@ impl<'a> DatadirModification<'a> {
aux_files.dir = Some(dir);
}
Err(
e @ (PageReconstructError::AncestorStopping(_)
| PageReconstructError::Cancelled
e @ (PageReconstructError::Cancelled
| PageReconstructError::AncestorLsnTimeout(_)),
) => {
// Important that we do not interpret a shutdown error as "not found" and thereby
@@ -1679,7 +1684,7 @@ impl<'a> DatadirModification<'a> {
let mut retained_pending_updates = HashMap::<_, Vec<_>>::new();
for (key, values) in self.pending_updates.drain() {
for (lsn, value) in values {
if is_rel_block_key(&key) || is_slru_block_key(key) {
if key.is_rel_block_key() || key.is_slru_block_key() {
// This bails out on first error without modifying pending_updates.
// That's Ok, cf this function's doc comment.
writer.put(key, lsn, &value, ctx).await?;

View File

@@ -42,6 +42,7 @@ use utils::completion;
use utils::crashsafe::path_with_suffix_extension;
use utils::failpoint_support;
use utils::fs_ext;
use utils::pausable_failpoint;
use utils::sync::gate::Gate;
use utils::sync::gate::GateGuard;
use utils::timeout::timeout_cancellable;
@@ -122,32 +123,6 @@ use utils::{
lsn::{Lsn, RecordLsn},
};
/// Declare a failpoint that can use the `pause` failpoint action.
/// We don't want to block the executor thread, hence, spawn_blocking + await.
macro_rules! pausable_failpoint {
($name:literal) => {
if cfg!(feature = "testing") {
tokio::task::spawn_blocking({
let current = tracing::Span::current();
move || {
let _entered = current.entered();
tracing::info!("at failpoint {}", $name);
fail::fail_point!($name);
}
})
.await
.expect("spawn_blocking");
}
};
($name:literal, $cond:expr) => {
if cfg!(feature = "testing") {
if $cond {
pausable_failpoint!($name)
}
}
};
}
pub mod blob_io;
pub mod block_io;
pub mod vectored_blob_io;
@@ -512,6 +487,33 @@ enum CreateTimelineCause {
Delete,
}
#[derive(thiserror::Error, Debug)]
pub(crate) enum GcError {
// The tenant is shutting down
#[error("tenant shutting down")]
TenantCancelled,
// The tenant is shutting down
#[error("timeline shutting down")]
TimelineCancelled,
// The tenant is in a state inelegible to run GC
#[error("not active")]
NotActive,
// A requested GC cutoff LSN was invalid, for example it tried to move backwards
#[error("not active")]
BadLsn { why: String },
// A remote storage error while scheduling updates after compaction
#[error(transparent)]
Remote(anyhow::Error),
// If GC was invoked for a particular timeline, this error means it didn't exist
#[error("timeline not found")]
TimelineNotFound,
}
impl Tenant {
/// Yet another helper for timeline initialization.
///
@@ -1418,6 +1420,36 @@ impl Tenant {
Ok(tl)
}
/// Helper for unit tests to create a timeline with some pre-loaded states.
#[cfg(test)]
#[allow(clippy::too_many_arguments)]
pub async fn create_test_timeline_with_layers(
&self,
new_timeline_id: TimelineId,
initdb_lsn: Lsn,
pg_version: u32,
ctx: &RequestContext,
delta_layer_desc: Vec<Vec<(pageserver_api::key::Key, Lsn, crate::repository::Value)>>,
image_layer_desc: Vec<(Lsn, Vec<(pageserver_api::key::Key, bytes::Bytes)>)>,
end_lsn: Lsn,
) -> anyhow::Result<Arc<Timeline>> {
let tline = self
.create_test_timeline(new_timeline_id, initdb_lsn, pg_version, ctx)
.await?;
tline.force_advance_lsn(end_lsn);
for deltas in delta_layer_desc {
tline
.force_create_delta_layer(deltas, Some(initdb_lsn), ctx)
.await?;
}
for (lsn, images) in image_layer_desc {
tline
.force_create_image_layer(lsn, images, Some(initdb_lsn), ctx)
.await?;
}
Ok(tline)
}
/// Create a new timeline.
///
/// Returns the new timeline ID and reference to its Timeline object.
@@ -1532,7 +1564,7 @@ impl Tenant {
.wait_lsn(*lsn, timeline::WaitLsnWaiter::Tenant, ctx)
.await
.map_err(|e| match e {
e @ (WaitLsnError::Timeout(_) | WaitLsnError::BadState) => {
e @ (WaitLsnError::Timeout(_) | WaitLsnError::BadState { .. }) => {
CreateTimelineError::AncestorLsn(anyhow::anyhow!(e))
}
WaitLsnError::Shutdown => CreateTimelineError::ShuttingDown,
@@ -1600,24 +1632,23 @@ impl Tenant {
/// GC cutoff point is determined conservatively by either `horizon` and `pitr`, whichever
/// requires more history to be retained.
//
pub async fn gc_iteration(
pub(crate) async fn gc_iteration(
&self,
target_timeline_id: Option<TimelineId>,
horizon: u64,
pitr: Duration,
cancel: &CancellationToken,
ctx: &RequestContext,
) -> anyhow::Result<GcResult> {
) -> Result<GcResult, GcError> {
// Don't start doing work during shutdown
if let TenantState::Stopping { .. } = self.current_state() {
return Ok(GcResult::default());
}
// there is a global allowed_error for this
anyhow::ensure!(
self.is_active(),
"Cannot run GC iteration on inactive tenant"
);
if !self.is_active() {
return Err(GcError::NotActive);
}
{
let conf = self.tenant_conf.load();
@@ -2785,28 +2816,13 @@ impl Tenant {
pitr: Duration,
cancel: &CancellationToken,
ctx: &RequestContext,
) -> anyhow::Result<GcResult> {
) -> Result<GcResult, GcError> {
let mut totals: GcResult = Default::default();
let now = Instant::now();
let gc_timelines = match self
let gc_timelines = self
.refresh_gc_info_internal(target_timeline_id, horizon, pitr, cancel, ctx)
.await
{
Ok(result) => result,
Err(e) => {
if let Some(PageReconstructError::Cancelled) =
e.downcast_ref::<PageReconstructError>()
{
// Handle cancellation
totals.elapsed = now.elapsed();
return Ok(totals);
} else {
// Propagate other errors
return Err(e);
}
}
};
.await?;
failpoint_support::sleep_millis_async!("gc_iteration_internal_after_getting_gc_timelines");
@@ -2831,7 +2847,19 @@ impl Tenant {
// made.
break;
}
let result = timeline.gc().await?;
let result = match timeline.gc().await {
Err(GcError::TimelineCancelled) => {
if target_timeline_id.is_some() {
// If we were targetting this specific timeline, surface cancellation to caller
return Err(GcError::TimelineCancelled);
} else {
// A timeline may be shutting down independently of the tenant's lifecycle: we should
// skip past this and proceed to try GC on other timelines.
continue;
}
}
r => r?,
};
totals += result;
}
@@ -2844,11 +2872,11 @@ impl Tenant {
/// [`Tenant::get_gc_horizon`].
///
/// This is usually executed as part of periodic gc, but can now be triggered more often.
pub async fn refresh_gc_info(
pub(crate) async fn refresh_gc_info(
&self,
cancel: &CancellationToken,
ctx: &RequestContext,
) -> anyhow::Result<Vec<Arc<Timeline>>> {
) -> Result<Vec<Arc<Timeline>>, GcError> {
// since this method can now be called at different rates than the configured gc loop, it
// might be that these configuration values get applied faster than what it was previously,
// since these were only read from the gc task.
@@ -2869,7 +2897,7 @@ impl Tenant {
pitr: Duration,
cancel: &CancellationToken,
ctx: &RequestContext,
) -> anyhow::Result<Vec<Arc<Timeline>>> {
) -> Result<Vec<Arc<Timeline>>, GcError> {
// before taking the gc_cs lock, do the heavier weight finding of gc_cutoff points for
// currently visible timelines.
let timelines = self
@@ -2906,8 +2934,8 @@ impl Tenant {
}
}
if !self.is_active() {
anyhow::bail!("shutting down");
if !self.is_active() || self.cancel.is_cancelled() {
return Err(GcError::TenantCancelled);
}
// grab mutex to prevent new timelines from being created here; avoid doing long operations
@@ -2916,19 +2944,19 @@ impl Tenant {
// Scan all timelines. For each timeline, remember the timeline ID and
// the branch point where it was created.
let (all_branchpoints, timeline_ids): (BTreeSet<(TimelineId, Lsn)>, _) = {
let (all_branchpoints, timelines): (BTreeSet<(TimelineId, Lsn)>, _) = {
let timelines = self.timelines.lock().unwrap();
let mut all_branchpoints = BTreeSet::new();
let timeline_ids = {
let timelines = {
if let Some(target_timeline_id) = target_timeline_id.as_ref() {
if timelines.get(target_timeline_id).is_none() {
bail!("gc target timeline does not exist")
return Err(GcError::TimelineNotFound);
}
};
timelines
.iter()
.map(|(timeline_id, timeline_entry)| {
.map(|(_timeline_id, timeline_entry)| {
if let Some(ancestor_timeline_id) =
&timeline_entry.get_ancestor_timeline_id()
{
@@ -2950,33 +2978,28 @@ impl Tenant {
}
}
*timeline_id
timeline_entry.clone()
})
.collect::<Vec<_>>()
};
(all_branchpoints, timeline_ids)
(all_branchpoints, timelines)
};
// Ok, we now know all the branch points.
// Update the GC information for each timeline.
let mut gc_timelines = Vec::with_capacity(timeline_ids.len());
for timeline_id in timeline_ids {
// Timeline is known to be local and loaded.
let timeline = self
.get_timeline(timeline_id, false)
.with_context(|| format!("Timeline {timeline_id} was not found"))?;
let mut gc_timelines = Vec::with_capacity(timelines.len());
for timeline in timelines {
// If target_timeline is specified, ignore all other timelines
if let Some(target_timeline_id) = target_timeline_id {
if timeline_id != target_timeline_id {
if timeline.timeline_id != target_timeline_id {
continue;
}
}
let branchpoints: Vec<Lsn> = all_branchpoints
.range((
Included((timeline_id, Lsn(0))),
Included((timeline_id, Lsn(u64::MAX))),
Included((timeline.timeline_id, Lsn(0))),
Included((timeline.timeline_id, Lsn(u64::MAX))),
))
.map(|&x| x.1)
.collect();
@@ -2984,7 +3007,7 @@ impl Tenant {
{
let mut target = timeline.gc_info.write().unwrap();
match gc_cutoffs.remove(&timeline_id) {
match gc_cutoffs.remove(&timeline.timeline_id) {
Some(cutoffs) => {
*target = GcInfo {
retain_lsns: branchpoints,
@@ -3017,17 +3040,53 @@ impl Tenant {
&self,
src_timeline: &Arc<Timeline>,
dst_id: TimelineId,
start_lsn: Option<Lsn>,
ancestor_lsn: Option<Lsn>,
ctx: &RequestContext,
) -> Result<Arc<Timeline>, CreateTimelineError> {
let create_guard = self.create_timeline_create_guard(dst_id).unwrap();
let tl = self
.branch_timeline_impl(src_timeline, dst_id, start_lsn, create_guard, ctx)
.branch_timeline_impl(src_timeline, dst_id, ancestor_lsn, create_guard, ctx)
.await?;
tl.set_state(TimelineState::Active);
Ok(tl)
}
/// Helper for unit tests to branch a timeline with some pre-loaded states.
#[cfg(test)]
#[allow(clippy::too_many_arguments)]
pub async fn branch_timeline_test_with_layers(
&self,
src_timeline: &Arc<Timeline>,
dst_id: TimelineId,
ancestor_lsn: Option<Lsn>,
ctx: &RequestContext,
delta_layer_desc: Vec<Vec<(pageserver_api::key::Key, Lsn, crate::repository::Value)>>,
image_layer_desc: Vec<(Lsn, Vec<(pageserver_api::key::Key, bytes::Bytes)>)>,
end_lsn: Lsn,
) -> anyhow::Result<Arc<Timeline>> {
let tline = self
.branch_timeline_test(src_timeline, dst_id, ancestor_lsn, ctx)
.await?;
let ancestor_lsn = if let Some(ancestor_lsn) = ancestor_lsn {
ancestor_lsn
} else {
tline.get_last_record_lsn()
};
assert!(end_lsn >= ancestor_lsn);
tline.force_advance_lsn(end_lsn);
for deltas in delta_layer_desc {
tline
.force_create_delta_layer(deltas, Some(ancestor_lsn), ctx)
.await?;
}
for (lsn, images) in image_layer_desc {
tline
.force_create_image_layer(lsn, images, Some(ancestor_lsn), ctx)
.await?;
}
Ok(tline)
}
/// Branch an existing timeline.
///
/// The caller is responsible for activating the returned timeline.
@@ -4179,7 +4238,7 @@ mod tests {
.await?;
writer.finish_write(lsn);
}
tline.freeze_and_flush().await
tline.freeze_and_flush().await.map_err(|e| e.into())
}
#[tokio::test]
@@ -4333,9 +4392,10 @@ mod tests {
// This needs to traverse to the parent, and fails.
let err = newtline.get(*TEST_KEY, Lsn(0x50), &ctx).await.unwrap_err();
assert!(err
.to_string()
.contains("will not become active. Current state: Broken"));
assert!(err.to_string().starts_with(&format!(
"Bad state on timeline {}: Broken",
tline.timeline_id
)));
Ok(())
}
@@ -6230,75 +6290,36 @@ mod tests {
async fn test_vectored_missing_data_key_reads() -> anyhow::Result<()> {
let harness = TenantHarness::create("test_vectored_missing_data_key_reads")?;
let (tenant, ctx) = harness.load().await;
let tline = tenant
.create_test_timeline(TIMELINE_ID, Lsn(0x10), DEFAULT_PG_VERSION, &ctx)
.await?;
let cancel = CancellationToken::new();
let base_key = Key::from_hex("000000000033333333444444445500000000").unwrap();
let base_key_child = Key::from_hex("000000000033333333444444445500000001").unwrap();
let base_key_nonexist = Key::from_hex("000000000033333333444444445500000002").unwrap();
let mut lsn = Lsn(0x20);
{
let mut writer = tline.writer().await;
writer
.put(base_key, lsn, &Value::Image(test_img("data key 1")), &ctx)
.await?;
writer.finish_write(lsn);
drop(writer);
tline.freeze_and_flush().await?; // this will create a image layer
}
let tline = tenant
.create_test_timeline_with_layers(
TIMELINE_ID,
Lsn(0x10),
DEFAULT_PG_VERSION,
&ctx,
Vec::new(), // delta layers
vec![(Lsn(0x20), vec![(base_key, test_img("data key 1"))])], // image layers
Lsn(0x20), // it's fine to not advance LSN to 0x30 while using 0x30 to get below because `get_vectored_impl` does not wait for LSN
)
.await?;
let child = tenant
.branch_timeline_test(&tline, NEW_TIMELINE_ID, Some(lsn), &ctx)
.branch_timeline_test_with_layers(
&tline,
NEW_TIMELINE_ID,
Some(Lsn(0x20)),
&ctx,
Vec::new(), // delta layers
vec![(Lsn(0x30), vec![(base_key_child, test_img("data key 2"))])], // image layers
Lsn(0x30),
)
.await
.unwrap();
lsn.0 += 0x10;
{
let mut writer = child.writer().await;
writer
.put(
base_key_child,
lsn,
&Value::Image(test_img("data key 2")),
&ctx,
)
.await?;
writer.finish_write(lsn);
drop(writer);
child.freeze_and_flush().await?; // this will create a delta
{
// update the partitioning to include the test key space, otherwise they
// will be dropped by image layer creation
let mut guard = child.partitioning.lock().await;
let ((partitioning, _), partition_lsn) = &mut *guard;
partitioning
.parts
.push(KeySpace::single(base_key..base_key_nonexist)); // exclude the nonexist key
*partition_lsn = lsn;
}
child
.compact(
&cancel,
{
let mut set = EnumSet::empty();
set.insert(CompactFlags::ForceImageLayerCreation);
set
},
&ctx,
)
.await?; // force create an image layer for the keys, TODO: check if the image layer is created
}
async fn get_vectored_impl_wrapper(
tline: &Arc<Timeline>,
key: Key,
@@ -6320,6 +6341,8 @@ mod tests {
}))
}
let lsn = Lsn(0x30);
// test vectored get on parent timeline
assert_eq!(
get_vectored_impl_wrapper(&tline, base_key, lsn, &ctx).await?,
@@ -6357,94 +6380,42 @@ mod tests {
#[tokio::test]
async fn test_vectored_missing_metadata_key_reads() -> anyhow::Result<()> {
let harness = TenantHarness::create("test_vectored_missing_metadata_key_reads")?;
let harness = TenantHarness::create("test_vectored_missing_data_key_reads")?;
let (tenant, ctx) = harness.load().await;
let base_key = Key::from_hex("620000000033333333444444445500000000").unwrap();
let base_key_child = Key::from_hex("620000000033333333444444445500000001").unwrap();
let base_key_nonexist = Key::from_hex("620000000033333333444444445500000002").unwrap();
assert_eq!(base_key.field1, AUX_KEY_PREFIX); // in case someone accidentally changed the prefix...
let tline = tenant
.create_test_timeline(TIMELINE_ID, Lsn(0x10), DEFAULT_PG_VERSION, &ctx)
.create_test_timeline_with_layers(
TIMELINE_ID,
Lsn(0x10),
DEFAULT_PG_VERSION,
&ctx,
Vec::new(), // delta layers
vec![(Lsn(0x20), vec![(base_key, test_img("metadata key 1"))])], // image layers
Lsn(0x20), // it's fine to not advance LSN to 0x30 while using 0x30 to get below because `get_vectored_impl` does not wait for LSN
)
.await?;
let cancel = CancellationToken::new();
let mut base_key = Key::from_hex("000000000033333333444444445500000000").unwrap();
let mut base_key_child = Key::from_hex("000000000033333333444444445500000001").unwrap();
let mut base_key_nonexist = Key::from_hex("000000000033333333444444445500000002").unwrap();
base_key.field1 = AUX_KEY_PREFIX;
base_key_child.field1 = AUX_KEY_PREFIX;
base_key_nonexist.field1 = AUX_KEY_PREFIX;
let mut lsn = Lsn(0x20);
{
let mut writer = tline.writer().await;
writer
.put(
base_key,
lsn,
&Value::Image(test_img("metadata key 1")),
&ctx,
)
.await?;
writer.finish_write(lsn);
drop(writer);
tline.freeze_and_flush().await?; // this will create an image layer
tline
.compact(
&cancel,
{
let mut set = EnumSet::empty();
set.insert(CompactFlags::ForceImageLayerCreation);
set.insert(CompactFlags::ForceRepartition);
set
},
&ctx,
)
.await?; // force create an image layer for metadata keys
tenant
.gc_iteration(Some(tline.timeline_id), 0, Duration::ZERO, &cancel, &ctx)
.await?;
}
let child = tenant
.branch_timeline_test(&tline, NEW_TIMELINE_ID, Some(lsn), &ctx)
.branch_timeline_test_with_layers(
&tline,
NEW_TIMELINE_ID,
Some(Lsn(0x20)),
&ctx,
Vec::new(), // delta layers
vec![(
Lsn(0x30),
vec![(base_key_child, test_img("metadata key 2"))],
)], // image layers
Lsn(0x30),
)
.await
.unwrap();
lsn.0 += 0x10;
{
let mut writer = child.writer().await;
writer
.put(
base_key_child,
lsn,
&Value::Image(test_img("metadata key 2")),
&ctx,
)
.await?;
writer.finish_write(lsn);
drop(writer);
child.freeze_and_flush().await?;
child
.compact(
&cancel,
{
let mut set = EnumSet::empty();
set.insert(CompactFlags::ForceImageLayerCreation);
set.insert(CompactFlags::ForceRepartition);
set
},
&ctx,
)
.await?; // force create an image layer for metadata keys
tenant
.gc_iteration(Some(child.timeline_id), 0, Duration::ZERO, &cancel, &ctx)
.await?;
}
async fn get_vectored_impl_wrapper(
tline: &Arc<Timeline>,
key: Key,
@@ -6466,6 +6437,8 @@ mod tests {
}))
}
let lsn = Lsn(0x30);
// test vectored get on parent timeline
assert_eq!(
get_vectored_impl_wrapper(&tline, base_key, lsn, &ctx).await?,
@@ -6496,4 +6469,208 @@ mod tests {
Ok(())
}
async fn get_vectored_impl_wrapper(
tline: &Arc<Timeline>,
key: Key,
lsn: Lsn,
ctx: &RequestContext,
) -> Result<Option<Bytes>, GetVectoredError> {
let mut reconstruct_state = ValuesReconstructState::new();
let mut res = tline
.get_vectored_impl(
KeySpace::single(key..key.next()),
lsn,
&mut reconstruct_state,
ctx,
)
.await?;
Ok(res.pop_last().map(|(k, v)| {
assert_eq!(k, key);
v.unwrap()
}))
}
#[tokio::test]
async fn test_metadata_tombstone_reads() -> anyhow::Result<()> {
let harness = TenantHarness::create("test_metadata_tombstone_reads")?;
let (tenant, ctx) = harness.load().await;
let key0 = Key::from_hex("620000000033333333444444445500000000").unwrap();
let key1 = Key::from_hex("620000000033333333444444445500000001").unwrap();
let key2 = Key::from_hex("620000000033333333444444445500000002").unwrap();
let key3 = Key::from_hex("620000000033333333444444445500000003").unwrap();
// We emulate the situation that the compaction algorithm creates an image layer that removes the tombstones
// Lsn 0x30 key0, key3, no key1+key2
// Lsn 0x20 key1+key2 tomestones
// Lsn 0x10 key1 in image, key2 in delta
let tline = tenant
.create_test_timeline_with_layers(
TIMELINE_ID,
Lsn(0x10),
DEFAULT_PG_VERSION,
&ctx,
// delta layers
vec![
vec![(key2, Lsn(0x10), Value::Image(test_img("metadata key 2")))],
vec![(key1, Lsn(0x20), Value::Image(Bytes::new()))],
vec![(key2, Lsn(0x20), Value::Image(Bytes::new()))],
],
// image layers
vec![
(Lsn(0x10), vec![(key1, test_img("metadata key 1"))]),
(
Lsn(0x30),
vec![
(key0, test_img("metadata key 0")),
(key3, test_img("metadata key 3")),
],
),
],
Lsn(0x30),
)
.await?;
let lsn = Lsn(0x30);
let old_lsn = Lsn(0x20);
assert_eq!(
get_vectored_impl_wrapper(&tline, key0, lsn, &ctx).await?,
Some(test_img("metadata key 0"))
);
assert_eq!(
get_vectored_impl_wrapper(&tline, key1, lsn, &ctx).await?,
None,
);
assert_eq!(
get_vectored_impl_wrapper(&tline, key2, lsn, &ctx).await?,
None,
);
assert_eq!(
get_vectored_impl_wrapper(&tline, key1, old_lsn, &ctx).await?,
Some(Bytes::new()),
);
assert_eq!(
get_vectored_impl_wrapper(&tline, key2, old_lsn, &ctx).await?,
Some(Bytes::new()),
);
assert_eq!(
get_vectored_impl_wrapper(&tline, key3, lsn, &ctx).await?,
Some(test_img("metadata key 3"))
);
Ok(())
}
#[tokio::test]
async fn test_metadata_tombstone_image_creation() -> anyhow::Result<()> {
let harness = TenantHarness::create("test_metadata_tombstone_image_creation")?;
let (tenant, ctx) = harness.load().await;
let key0 = Key::from_hex("620000000033333333444444445500000000").unwrap();
let key1 = Key::from_hex("620000000033333333444444445500000001").unwrap();
let key2 = Key::from_hex("620000000033333333444444445500000002").unwrap();
let key3 = Key::from_hex("620000000033333333444444445500000003").unwrap();
let tline = tenant
.create_test_timeline_with_layers(
TIMELINE_ID,
Lsn(0x10),
DEFAULT_PG_VERSION,
&ctx,
// delta layers
vec![
vec![(key2, Lsn(0x10), Value::Image(test_img("metadata key 2")))],
vec![(key1, Lsn(0x20), Value::Image(Bytes::new()))],
vec![(key2, Lsn(0x20), Value::Image(Bytes::new()))],
vec![
(key0, Lsn(0x30), Value::Image(test_img("metadata key 0"))),
(key3, Lsn(0x30), Value::Image(test_img("metadata key 3"))),
],
],
// image layers
vec![(Lsn(0x10), vec![(key1, test_img("metadata key 1"))])],
Lsn(0x30),
)
.await?;
let cancel = CancellationToken::new();
tline
.compact(
&cancel,
{
let mut flags = EnumSet::new();
flags.insert(CompactFlags::ForceImageLayerCreation);
flags.insert(CompactFlags::ForceRepartition);
flags
},
&ctx,
)
.await?;
// Image layers are created at last_record_lsn
let images = tline
.inspect_image_layers(Lsn(0x30), &ctx)
.await?
.into_iter()
.filter(|(k, _)| k.is_metadata_key())
.collect::<Vec<_>>();
assert_eq!(images.len(), 2); // the image layer should only contain two existing keys, tombstones should be removed.
Ok(())
}
#[tokio::test]
async fn test_metadata_tombstone_empty_image_creation() -> anyhow::Result<()> {
let harness = TenantHarness::create("test_metadata_tombstone_image_creation")?;
let (tenant, ctx) = harness.load().await;
let key1 = Key::from_hex("620000000033333333444444445500000001").unwrap();
let key2 = Key::from_hex("620000000033333333444444445500000002").unwrap();
let tline = tenant
.create_test_timeline_with_layers(
TIMELINE_ID,
Lsn(0x10),
DEFAULT_PG_VERSION,
&ctx,
// delta layers
vec![
vec![(key2, Lsn(0x10), Value::Image(test_img("metadata key 2")))],
vec![(key1, Lsn(0x20), Value::Image(Bytes::new()))],
vec![(key2, Lsn(0x20), Value::Image(Bytes::new()))],
],
// image layers
vec![(Lsn(0x10), vec![(key1, test_img("metadata key 1"))])],
Lsn(0x30),
)
.await?;
let cancel = CancellationToken::new();
tline
.compact(
&cancel,
{
let mut flags = EnumSet::new();
flags.insert(CompactFlags::ForceImageLayerCreation);
flags.insert(CompactFlags::ForceRepartition);
flags
},
&ctx,
)
.await?;
// Image layers are created at last_record_lsn
let images = tline
.inspect_image_layers(Lsn(0x30), &ctx)
.await?
.into_iter()
.filter(|(k, _)| k.is_metadata_key())
.collect::<Vec<_>>();
assert_eq!(images.len(), 0); // the image layer should not contain tombstones, or it is not created
Ok(())
}
}

View File

@@ -8,7 +8,7 @@ use tokio::sync::OwnedMutexGuard;
use tokio_util::sync::CancellationToken;
use tracing::{error, instrument, Instrument};
use utils::{backoff, completion, crashsafe, fs_ext, id::TimelineId};
use utils::{backoff, completion, crashsafe, fs_ext, id::TimelineId, pausable_failpoint};
use crate::{
config::PageServerConf,

View File

@@ -45,7 +45,7 @@ use crate::tenant::delete::DeleteTenantFlow;
use crate::tenant::span::debug_assert_current_span_has_tenant_id;
use crate::tenant::storage_layer::inmemory_layer;
use crate::tenant::timeline::ShutdownMode;
use crate::tenant::{AttachedTenantConf, SpawnMode, Tenant, TenantState};
use crate::tenant::{AttachedTenantConf, GcError, SpawnMode, Tenant, TenantState};
use crate::{InitializationOrder, IGNORED_TENANT_FILE_NAME, TEMP_FILE_SUFFIX};
use utils::crashsafe::path_with_suffix_extension;
@@ -2833,7 +2833,13 @@ pub(crate) async fn immediate_gc(
}
}
result.map_err(ApiError::InternalServerError)
result.map_err(|e| match e {
GcError::TenantCancelled | GcError::TimelineCancelled => ApiError::ShuttingDown,
GcError::TimelineNotFound => {
ApiError::NotFound(anyhow::anyhow!("Timeline not found").into())
}
other => ApiError::InternalServerError(anyhow::anyhow!(other)),
})
}
#[cfg(test)]

View File

@@ -197,6 +197,7 @@ pub(crate) use upload::upload_initdb_dir;
use utils::backoff::{
self, exponential_backoff, DEFAULT_BASE_BACKOFF_SECONDS, DEFAULT_MAX_BACKOFF_SECONDS,
};
use utils::pausable_failpoint;
use std::collections::{HashMap, VecDeque};
use std::sync::atomic::{AtomicU32, Ordering};
@@ -1192,7 +1193,7 @@ impl RemoteTimelineClient {
&self.storage_impl,
uploaded.local_path(),
&remote_path,
uploaded.metadata().file_size(),
uploaded.metadata().file_size,
cancel,
)
.await
@@ -1573,7 +1574,7 @@ impl RemoteTimelineClient {
&self.storage_impl,
local_path,
&remote_path,
layer_metadata.file_size(),
layer_metadata.file_size,
&self.cancel,
)
.measure_remote_op(
@@ -1768,7 +1769,7 @@ impl RemoteTimelineClient {
UploadOp::UploadLayer(_, m) => (
RemoteOpFileKind::Layer,
RemoteOpKind::Upload,
RemoteTimelineClientMetricsCallTrackSize::Bytes(m.file_size()),
RemoteTimelineClientMetricsCallTrackSize::Bytes(m.file_size),
),
UploadOp::UploadMetadata(_, _) => (
RemoteOpFileKind::Index,

View File

@@ -84,7 +84,7 @@ pub async fn download_layer_file<'a>(
)
.await?;
let expected = layer_metadata.file_size();
let expected = layer_metadata.file_size;
if expected != bytes_amount {
return Err(DownloadError::Other(anyhow!(
"According to layer file metadata should have downloaded {expected} bytes but downloaded {bytes_amount} bytes into file {temp_file_path:?}",

View File

@@ -17,46 +17,6 @@ use pageserver_api::shard::ShardIndex;
use utils::lsn::Lsn;
/// Metadata gathered for each of the layer files.
///
/// Fields have to be `Option`s because remote [`IndexPart`]'s can be from different version, which
/// might have less or more metadata depending if upgrading or rolling back an upgrade.
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)]
//#[cfg_attr(test, derive(Default))]
pub struct LayerFileMetadata {
file_size: u64,
pub(crate) generation: Generation,
pub(crate) shard: ShardIndex,
}
impl From<&'_ IndexLayerMetadata> for LayerFileMetadata {
fn from(other: &IndexLayerMetadata) -> Self {
LayerFileMetadata {
file_size: other.file_size,
generation: other.generation,
shard: other.shard,
}
}
}
impl LayerFileMetadata {
pub fn new(file_size: u64, generation: Generation, shard: ShardIndex) -> Self {
LayerFileMetadata {
file_size,
generation,
shard,
}
}
pub fn file_size(&self) -> u64 {
self.file_size
}
}
// TODO seems like another part of the remote storage file format
// compatibility issue, see https://github.com/neondatabase/neon/issues/3072
/// In-memory representation of an `index_part.json` file
///
/// Contains the data about all files in the timeline, present remotely and its metadata.
@@ -77,7 +37,7 @@ pub struct IndexPart {
///
/// Older versions of `IndexPart` will not have this property or have only a part of metadata
/// that latest version stores.
pub layer_metadata: HashMap<LayerName, IndexLayerMetadata>,
pub layer_metadata: HashMap<LayerName, LayerFileMetadata>,
// 'disk_consistent_lsn' is a copy of the 'disk_consistent_lsn' in the metadata.
// It's duplicated for convenience when reading the serialized structure, but is
@@ -127,10 +87,7 @@ impl IndexPart {
lineage: Lineage,
last_aux_file_policy: Option<AuxFilePolicy>,
) -> Self {
let layer_metadata = layers_and_metadata
.iter()
.map(|(k, v)| (k.to_owned(), IndexLayerMetadata::from(v)))
.collect();
let layer_metadata = layers_and_metadata.clone();
Self {
version: Self::LATEST_VERSION,
@@ -194,9 +151,12 @@ impl From<&UploadQueueInitialized> for IndexPart {
}
}
/// Serialized form of [`LayerFileMetadata`].
#[derive(Debug, PartialEq, Eq, Clone, Serialize, Deserialize)]
pub struct IndexLayerMetadata {
/// Metadata gathered for each of the layer files.
///
/// Fields have to be `Option`s because remote [`IndexPart`]'s can be from different version, which
/// might have less or more metadata depending if upgrading or rolling back an upgrade.
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Serialize, Deserialize)]
pub struct LayerFileMetadata {
pub file_size: u64,
#[serde(default = "Generation::none")]
@@ -208,12 +168,12 @@ pub struct IndexLayerMetadata {
pub shard: ShardIndex,
}
impl From<&LayerFileMetadata> for IndexLayerMetadata {
fn from(other: &LayerFileMetadata) -> Self {
IndexLayerMetadata {
file_size: other.file_size,
generation: other.generation,
shard: other.shard,
impl LayerFileMetadata {
pub fn new(file_size: u64, generation: Generation, shard: ShardIndex) -> Self {
LayerFileMetadata {
file_size,
generation,
shard,
}
}
}
@@ -307,12 +267,12 @@ mod tests {
// note this is not verified, could be anything, but exists for humans debugging.. could be the git version instead?
version: 1,
layer_metadata: HashMap::from([
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), IndexLayerMetadata {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), LayerFileMetadata {
file_size: 25600000,
generation: Generation::none(),
shard: ShardIndex::unsharded()
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), IndexLayerMetadata {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), LayerFileMetadata {
// serde_json should always parse this but this might be a double with jq for
// example.
file_size: 9007199254741001,
@@ -349,12 +309,12 @@ mod tests {
// note this is not verified, could be anything, but exists for humans debugging.. could be the git version instead?
version: 1,
layer_metadata: HashMap::from([
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), IndexLayerMetadata {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), LayerFileMetadata {
file_size: 25600000,
generation: Generation::none(),
shard: ShardIndex::unsharded()
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), IndexLayerMetadata {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), LayerFileMetadata {
// serde_json should always parse this but this might be a double with jq for
// example.
file_size: 9007199254741001,
@@ -392,12 +352,12 @@ mod tests {
// note this is not verified, could be anything, but exists for humans debugging.. could be the git version instead?
version: 2,
layer_metadata: HashMap::from([
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), IndexLayerMetadata {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), LayerFileMetadata {
file_size: 25600000,
generation: Generation::none(),
shard: ShardIndex::unsharded()
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), IndexLayerMetadata {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), LayerFileMetadata {
// serde_json should always parse this but this might be a double with jq for
// example.
file_size: 9007199254741001,
@@ -480,12 +440,12 @@ mod tests {
let expected = IndexPart {
version: 4,
layer_metadata: HashMap::from([
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), IndexLayerMetadata {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), LayerFileMetadata {
file_size: 25600000,
generation: Generation::none(),
shard: ShardIndex::unsharded()
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), IndexLayerMetadata {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), LayerFileMetadata {
// serde_json should always parse this but this might be a double with jq for
// example.
file_size: 9007199254741001,
@@ -522,12 +482,12 @@ mod tests {
let expected = IndexPart {
version: 5,
layer_metadata: HashMap::from([
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000014EF420-00000000014EF499".parse().unwrap(), IndexLayerMetadata {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000014EF420-00000000014EF499".parse().unwrap(), LayerFileMetadata {
file_size: 23289856,
generation: Generation::new(1),
shard: ShardIndex::unsharded(),
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000014EF499-00000000015A7619".parse().unwrap(), IndexLayerMetadata {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000014EF499-00000000015A7619".parse().unwrap(), LayerFileMetadata {
file_size: 1015808,
generation: Generation::new(1),
shard: ShardIndex::unsharded(),
@@ -569,12 +529,12 @@ mod tests {
let expected = IndexPart {
version: 6,
layer_metadata: HashMap::from([
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), IndexLayerMetadata {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), LayerFileMetadata {
file_size: 25600000,
generation: Generation::none(),
shard: ShardIndex::unsharded()
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), IndexLayerMetadata {
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), LayerFileMetadata {
// serde_json should always parse this but this might be a double with jq for
// example.
file_size: 9007199254741001,

View File

@@ -9,7 +9,7 @@ use std::time::SystemTime;
use tokio::fs::{self, File};
use tokio::io::AsyncSeekExt;
use tokio_util::sync::CancellationToken;
use utils::backoff;
use utils::{backoff, pausable_failpoint};
use super::Generation;
use crate::tenant::remote_timeline_client::{

View File

@@ -187,6 +187,7 @@ impl SecondaryTenant {
};
let now = SystemTime::now();
tracing::info!("Evicting secondary layer");
let this = self.clone();

View File

@@ -45,10 +45,10 @@ use crate::tenant::{
use camino::Utf8PathBuf;
use chrono::format::{DelayedFormat, StrftimeItems};
use futures::{Future, StreamExt};
use futures::Future;
use pageserver_api::models::SecondaryProgress;
use pageserver_api::shard::TenantShardId;
use remote_storage::{DownloadError, Etag, GenericRemoteStorage, RemoteStorageActivity};
use remote_storage::{DownloadError, Etag, GenericRemoteStorage};
use tokio_util::sync::CancellationToken;
use tracing::{info_span, instrument, warn, Instrument};
@@ -67,12 +67,6 @@ use super::{
/// download, if the uploader populated it.
const DEFAULT_DOWNLOAD_INTERVAL: Duration = Duration::from_millis(60000);
/// Range of concurrency we may use when downloading layers within a timeline. This is independent
/// for each tenant we're downloading: the concurrency of _tenants_ is defined separately in
/// `PageServerConf::secondary_download_concurrency`
const MAX_LAYER_CONCURRENCY: usize = 16;
const MIN_LAYER_CONCURRENCY: usize = 1;
pub(super) async fn downloader_task(
tenant_manager: Arc<TenantManager>,
remote_storage: GenericRemoteStorage,
@@ -81,15 +75,14 @@ pub(super) async fn downloader_task(
cancel: CancellationToken,
root_ctx: RequestContext,
) {
// How many tenants' secondary download operations we will run concurrently
let tenant_concurrency = tenant_manager.get_conf().secondary_download_concurrency;
let concurrency = tenant_manager.get_conf().secondary_download_concurrency;
let generator = SecondaryDownloader {
tenant_manager,
remote_storage,
root_ctx,
};
let mut scheduler = Scheduler::new(generator, tenant_concurrency);
let mut scheduler = Scheduler::new(generator, concurrency);
scheduler
.run(command_queue, background_jobs_can_start, cancel)
@@ -414,7 +407,7 @@ impl JobGenerator<PendingDownload, RunningDownload, CompleteDownload, DownloadCo
tracing::warn!("Insufficient space while downloading. Will retry later.");
}
Err(UpdateError::Cancelled) => {
tracing::debug!("Shut down while downloading");
tracing::info!("Shut down while downloading");
},
Err(UpdateError::Deserialize(e)) => {
tracing::error!("Corrupt content while downloading tenant: {e}");
@@ -716,7 +709,7 @@ impl<'a> TenantDownloader<'a> {
let mut layer_byte_count: u64 = timeline_state
.on_disk_layers
.values()
.map(|l| l.metadata.file_size())
.map(|l| l.metadata.file_size)
.sum();
// Remove on-disk layers that are no longer present in heatmap
@@ -727,7 +720,7 @@ impl<'a> TenantDownloader<'a> {
.get(layer_file_name)
.unwrap()
.metadata
.file_size();
.file_size;
let local_path = local_layer_path(
self.conf,
@@ -848,8 +841,6 @@ impl<'a> TenantDownloader<'a> {
tracing::debug!(timeline_id=%timeline.timeline_id, "Downloading layers, {} in heatmap", timeline.layers.len());
let mut download_futs = Vec::new();
// Download heatmap layers that are not present on local disk, or update their
// access time if they are already present.
for layer in timeline.layers {
@@ -886,9 +877,7 @@ impl<'a> TenantDownloader<'a> {
}
}
if on_disk.metadata != LayerFileMetadata::from(&layer.metadata)
|| on_disk.access_time != layer.access_time
{
if on_disk.metadata != layer.metadata || on_disk.access_time != layer.access_time {
// We already have this layer on disk. Update its access time.
tracing::debug!(
"Access time updated for layer {}: {} -> {}",
@@ -920,35 +909,19 @@ impl<'a> TenantDownloader<'a> {
strftime(&layer.access_time),
strftime(evicted_at)
);
self.skip_layer(layer);
continue;
}
}
download_futs.push(self.download_layer(
tenant_shard_id,
&timeline.timeline_id,
layer,
ctx,
));
}
// Break up layer downloads into chunks, so that for each chunk we can re-check how much
// concurrency to use based on activity level of remote storage.
while !download_futs.is_empty() {
let chunk =
download_futs.split_off(download_futs.len().saturating_sub(MAX_LAYER_CONCURRENCY));
let concurrency = Self::layer_concurrency(self.remote_storage.activity());
let mut result_stream = futures::stream::iter(chunk).buffered(concurrency);
let mut result_stream = std::pin::pin!(result_stream);
while let Some(result) = result_stream.next().await {
match result {
Err(e) => return Err(e),
Ok(None) => {
// No error, but we didn't download the layer. Don't mark it touched
}
Ok(Some(layer)) => touched.push(layer),
match self
.download_layer(tenant_shard_id, &timeline.timeline_id, layer, ctx)
.await?
{
Some(layer) => touched.push(layer),
None => {
// Not an error but we didn't download it: remote layer is missing. Don't add it to the list of
// things to consider touched.
}
}
}
@@ -979,7 +952,7 @@ impl<'a> TenantDownloader<'a> {
tenant_shard_id,
&timeline.timeline_id,
t.name,
LayerFileMetadata::from(&t.metadata),
t.metadata.clone(),
t.access_time,
local_path,
));
@@ -991,6 +964,15 @@ impl<'a> TenantDownloader<'a> {
Ok(())
}
/// Call this during timeline download if a layer will _not_ be downloaded, to update progress statistics
fn skip_layer(&self, layer: HeatMapLayer) {
let mut progress = self.secondary_state.progress.lock().unwrap();
progress.layers_total = progress.layers_total.saturating_sub(1);
progress.bytes_total = progress
.bytes_total
.saturating_sub(layer.metadata.file_size);
}
async fn download_layer(
&self,
tenant_shard_id: &TenantShardId,
@@ -1024,7 +1006,7 @@ impl<'a> TenantDownloader<'a> {
*tenant_shard_id,
*timeline_id,
&layer.name,
&LayerFileMetadata::from(&layer.metadata),
&layer.metadata,
&local_path,
&self.secondary_state.cancel,
ctx,
@@ -1040,13 +1022,7 @@ impl<'a> TenantDownloader<'a> {
"Skipped downloading missing layer {}, raced with compaction/gc?",
layer.name
);
// If the layer is 404, adjust the progress statistics to reflect that we will not download it.
let mut progress = self.secondary_state.progress.lock().unwrap();
progress.layers_total = progress.layers_total.saturating_sub(1);
progress.bytes_total = progress
.bytes_total
.saturating_sub(layer.metadata.file_size);
self.skip_layer(layer);
return Ok(None);
}
@@ -1083,19 +1059,6 @@ impl<'a> TenantDownloader<'a> {
Ok(Some(layer))
}
/// Calculate the currently allowed parallelism of layer download tasks, based on activity level of the remote storage
fn layer_concurrency(activity: RemoteStorageActivity) -> usize {
// When less than 75% of units are available, use minimum concurrency. Else, do a linear mapping
// of our concurrency range to the units available within the remaining 25%.
let clamp_at = (activity.read_total * 3) / 4;
if activity.read_available > clamp_at {
(MAX_LAYER_CONCURRENCY * (activity.read_available - clamp_at))
/ (activity.read_total - clamp_at)
} else {
MIN_LAYER_CONCURRENCY
}
}
}
/// Scan local storage and build up Layer objects based on the metadata in a HeatMapTimeline
@@ -1185,7 +1148,7 @@ async fn init_timeline_state(
tenant_shard_id,
&heatmap.timeline_id,
name,
LayerFileMetadata::from(&remote_meta.metadata),
remote_meta.metadata.clone(),
remote_meta.access_time,
file_path,
),
@@ -1219,58 +1182,3 @@ async fn init_timeline_state(
detail
}
#[cfg(test)]
mod test {
use super::*;
#[test]
fn layer_concurrency() {
// Totally idle
assert_eq!(
TenantDownloader::layer_concurrency(RemoteStorageActivity {
read_available: 16,
read_total: 16,
write_available: 16,
write_total: 16
}),
MAX_LAYER_CONCURRENCY
);
// Totally busy
assert_eq!(
TenantDownloader::layer_concurrency(RemoteStorageActivity {
read_available: 0,
read_total: 16,
write_available: 16,
write_total: 16
}),
MIN_LAYER_CONCURRENCY
);
// Edge of the range at which we interpolate
assert_eq!(
TenantDownloader::layer_concurrency(RemoteStorageActivity {
read_available: 12,
read_total: 16,
write_available: 16,
write_total: 16
}),
MIN_LAYER_CONCURRENCY
);
// Midpoint of the range in which we interpolate
assert_eq!(
TenantDownloader::layer_concurrency(RemoteStorageActivity {
read_available: 14,
read_total: 16,
write_available: 16,
write_total: 16
}),
MAX_LAYER_CONCURRENCY / 2
);
}
}

View File

@@ -1,6 +1,6 @@
use std::time::SystemTime;
use crate::tenant::{remote_timeline_client::index::IndexLayerMetadata, storage_layer::LayerName};
use crate::tenant::{remote_timeline_client::index::LayerFileMetadata, storage_layer::LayerName};
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr, TimestampSeconds};
@@ -38,7 +38,7 @@ pub(crate) struct HeatMapTimeline {
#[derive(Serialize, Deserialize)]
pub(crate) struct HeatMapLayer {
pub(super) name: LayerName,
pub(super) metadata: IndexLayerMetadata,
pub(super) metadata: LayerFileMetadata,
#[serde_as(as = "TimestampSeconds<i64>")]
pub(super) access_time: SystemTime,
@@ -49,7 +49,7 @@ pub(crate) struct HeatMapLayer {
impl HeatMapLayer {
pub(crate) fn new(
name: LayerName,
metadata: IndexLayerMetadata,
metadata: LayerFileMetadata,
access_time: SystemTime,
) -> Self {
Self {

View File

@@ -318,7 +318,7 @@ pub(crate) struct LayerFringe {
#[derive(Debug)]
struct LayerKeyspace {
layer: ReadableLayer,
target_keyspace: KeySpace,
target_keyspace: Vec<KeySpace>,
}
impl LayerFringe {
@@ -336,6 +336,7 @@ impl LayerFringe {
};
let removed = self.layers.remove_entry(&read_desc.layer_id);
match removed {
Some((
_,
@@ -343,7 +344,15 @@ impl LayerFringe {
layer,
target_keyspace,
},
)) => Some((layer, target_keyspace, read_desc.lsn_range)),
)) => {
let mut keyspace = KeySpaceRandomAccum::new();
for ks in target_keyspace {
for part in ks.ranges {
keyspace.add_range(part);
}
}
Some((layer, keyspace.consume_keyspace(), read_desc.lsn_range))
}
None => unreachable!("fringe internals are always consistent"),
}
}
@@ -358,7 +367,7 @@ impl LayerFringe {
let entry = self.layers.entry(layer_id.clone());
match entry {
Entry::Occupied(mut entry) => {
entry.get_mut().target_keyspace.merge(&keyspace);
entry.get_mut().target_keyspace.push(keyspace);
}
Entry::Vacant(entry) => {
self.planned_reads_by_lsn.push(ReadDesc {
@@ -367,7 +376,7 @@ impl LayerFringe {
});
entry.insert(LayerKeyspace {
layer,
target_keyspace: keyspace,
target_keyspace: vec![keyspace],
});
}
}

View File

@@ -47,7 +47,7 @@ use hex;
use itertools::Itertools;
use pageserver_api::keyspace::KeySpace;
use pageserver_api::models::LayerAccessKind;
use pageserver_api::shard::TenantShardId;
use pageserver_api::shard::{ShardIdentity, TenantShardId};
use rand::{distributions::Alphanumeric, Rng};
use serde::{Deserialize, Serialize};
use std::fs::File;
@@ -473,7 +473,7 @@ impl ImageLayerInner {
ctx: &RequestContext,
) -> Result<(), GetVectoredError> {
let reads = self
.plan_reads(keyspace, ctx)
.plan_reads(keyspace, None, ctx)
.await
.map_err(GetVectoredError::Other)?;
@@ -485,9 +485,15 @@ impl ImageLayerInner {
Ok(())
}
/// Traverse the layer's index to build read operations on the overlap of the input keyspace
/// and the keys in this layer.
///
/// If shard_identity is provided, it will be used to filter keys down to those stored on
/// this shard.
async fn plan_reads(
&self,
keyspace: KeySpace,
shard_identity: Option<&ShardIdentity>,
ctx: &RequestContext,
) -> anyhow::Result<Vec<VectoredRead>> {
let mut planner = VectoredReadPlanner::new(
@@ -507,7 +513,6 @@ impl ImageLayerInner {
for range in keyspace.ranges.iter() {
let mut range_end_handled = false;
let mut search_key: [u8; KEY_SIZE] = [0u8; KEY_SIZE];
range.start.write_to_byte_slice(&mut search_key);
@@ -520,12 +525,22 @@ impl ImageLayerInner {
let key = Key::from_slice(&raw_key[..KEY_SIZE]);
assert!(key >= range.start);
let flag = if let Some(shard_identity) = shard_identity {
if shard_identity.is_key_disposable(&key) {
BlobFlag::Ignore
} else {
BlobFlag::None
}
} else {
BlobFlag::None
};
if key >= range.end {
planner.handle_range_end(offset);
range_end_handled = true;
break;
} else {
planner.handle(key, self.lsn, offset, BlobFlag::None);
planner.handle(key, self.lsn, offset, flag);
}
}
@@ -538,6 +553,50 @@ impl ImageLayerInner {
Ok(planner.finish())
}
/// Given a key range, select the parts of that range that should be retained by the ShardIdentity,
/// then execute vectored GET operations, passing the results of all read keys into the writer.
pub(super) async fn filter(
&self,
shard_identity: &ShardIdentity,
writer: &mut ImageLayerWriter,
ctx: &RequestContext,
) -> anyhow::Result<usize> {
// Fragment the range into the regions owned by this ShardIdentity
let plan = self
.plan_reads(
KeySpace {
// If asked for the total key space, plan_reads will give us all the keys in the layer
ranges: vec![Key::MIN..Key::MAX],
},
Some(shard_identity),
ctx,
)
.await?;
let vectored_blob_reader = VectoredBlobReader::new(&self.file);
let mut key_count = 0;
for read in plan.into_iter() {
let buf_size = read.size();
let buf = BytesMut::with_capacity(buf_size);
let blobs_buf = vectored_blob_reader.read_blobs(&read, buf, ctx).await?;
let frozen_buf = blobs_buf.buf.freeze();
for meta in blobs_buf.blobs.iter() {
let img_buf = frozen_buf.slice(meta.start..meta.end);
key_count += 1;
writer
.put_image(meta.meta.key, img_buf, ctx)
.await
.context(format!("Storing key {}", meta.meta.key))?;
}
}
Ok(key_count)
}
async fn do_reads_and_update_state(
&self,
reads: Vec<VectoredRead>,
@@ -855,3 +914,136 @@ impl Drop for ImageLayerWriter {
}
}
}
#[cfg(test)]
mod test {
use bytes::Bytes;
use pageserver_api::{
key::Key,
shard::{ShardCount, ShardIdentity, ShardNumber, ShardStripeSize},
};
use utils::{id::TimelineId, lsn::Lsn};
use crate::{tenant::harness::TenantHarness, DEFAULT_PG_VERSION};
use super::ImageLayerWriter;
#[tokio::test]
async fn image_layer_rewrite() {
let harness = TenantHarness::create("test_image_layer_rewrite").unwrap();
let (tenant, ctx) = harness.load().await;
// The LSN at which we will create an image layer to filter
let lsn = Lsn(0xdeadbeef0000);
let timeline_id = TimelineId::generate();
let timeline = tenant
.create_test_timeline(timeline_id, lsn, DEFAULT_PG_VERSION, &ctx)
.await
.unwrap();
// This key range contains several 0x8000 page stripes, only one of which belongs to shard zero
let input_start = Key::from_hex("000000067f00000001000000ae0000000000").unwrap();
let input_end = Key::from_hex("000000067f00000001000000ae0000020000").unwrap();
let range = input_start..input_end;
// Build an image layer to filter
let resident = {
let mut writer = ImageLayerWriter::new(
harness.conf,
timeline_id,
harness.tenant_shard_id,
&range,
lsn,
&ctx,
)
.await
.unwrap();
let foo_img = Bytes::from_static(&[1, 2, 3, 4]);
let mut key = range.start;
while key < range.end {
writer.put_image(key, foo_img.clone(), &ctx).await.unwrap();
key = key.next();
}
writer.finish(&timeline, &ctx).await.unwrap()
};
let original_size = resident.metadata().file_size;
// Filter for various shards: this exercises cases like values at start of key range, end of key
// range, middle of key range.
for shard_number in 0..4 {
let mut filtered_writer = ImageLayerWriter::new(
harness.conf,
timeline_id,
harness.tenant_shard_id,
&range,
lsn,
&ctx,
)
.await
.unwrap();
// TenantHarness gave us an unsharded tenant, but we'll use a sharded ShardIdentity
// to exercise filter()
let shard_identity = ShardIdentity::new(
ShardNumber(shard_number),
ShardCount::new(4),
ShardStripeSize(0x8000),
)
.unwrap();
let wrote_keys = resident
.filter(&shard_identity, &mut filtered_writer, &ctx)
.await
.unwrap();
let replacement = if wrote_keys > 0 {
Some(filtered_writer.finish(&timeline, &ctx).await.unwrap())
} else {
None
};
// This exact size and those below will need updating as/when the layer encoding changes, but
// should be deterministic for a given version of the format, as we used no randomness generating the input.
assert_eq!(original_size, 1597440);
match shard_number {
0 => {
// We should have written out just one stripe for our shard identity
assert_eq!(wrote_keys, 0x8000);
let replacement = replacement.unwrap();
// We should have dropped some of the data
assert!(replacement.metadata().file_size < original_size);
assert!(replacement.metadata().file_size > 0);
// Assert that we dropped ~3/4 of the data.
assert_eq!(replacement.metadata().file_size, 417792);
}
1 => {
// Shard 1 has no keys in our input range
assert_eq!(wrote_keys, 0x0);
assert!(replacement.is_none());
}
2 => {
// Shard 2 has one stripes in the input range
assert_eq!(wrote_keys, 0x8000);
let replacement = replacement.unwrap();
assert!(replacement.metadata().file_size < original_size);
assert!(replacement.metadata().file_size > 0);
assert_eq!(replacement.metadata().file_size, 417792);
}
3 => {
// Shard 3 has two stripes in the input range
assert_eq!(wrote_keys, 0x10000);
let replacement = replacement.unwrap();
assert!(replacement.metadata().file_size < original_size);
assert!(replacement.metadata().file_size > 0);
assert_eq!(replacement.metadata().file_size, 811008);
}
_ => unreachable!(),
}
}
}
}

View File

@@ -4,7 +4,7 @@ use pageserver_api::keyspace::KeySpace;
use pageserver_api::models::{
HistoricLayerInfo, LayerAccessKind, LayerResidenceEventReason, LayerResidenceStatus,
};
use pageserver_api::shard::{ShardIndex, TenantShardId};
use pageserver_api::shard::{ShardIdentity, ShardIndex, TenantShardId};
use std::ops::Range;
use std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};
use std::sync::{Arc, Weak};
@@ -23,10 +23,10 @@ use crate::tenant::timeline::GetVectoredError;
use crate::tenant::{remote_timeline_client::LayerFileMetadata, Timeline};
use super::delta_layer::{self, DeltaEntry};
use super::image_layer;
use super::image_layer::{self};
use super::{
AsLayerDesc, LayerAccessStats, LayerAccessStatsReset, LayerName, PersistentLayerDesc,
ValueReconstructResult, ValueReconstructState, ValuesReconstructState,
AsLayerDesc, ImageLayerWriter, LayerAccessStats, LayerAccessStatsReset, LayerName,
PersistentLayerDesc, ValueReconstructResult, ValueReconstructState, ValuesReconstructState,
};
use utils::generation::Generation;
@@ -161,7 +161,7 @@ impl Layer {
timeline.tenant_shard_id,
timeline.timeline_id,
file_name,
metadata.file_size(),
metadata.file_size,
);
let access_stats = LayerAccessStats::for_loading_layer(LayerResidenceStatus::Evicted);
@@ -194,7 +194,7 @@ impl Layer {
timeline.tenant_shard_id,
timeline.timeline_id,
file_name,
metadata.file_size(),
metadata.file_size,
);
let access_stats = LayerAccessStats::for_loading_layer(LayerResidenceStatus::Resident);
@@ -227,7 +227,7 @@ impl Layer {
timeline
.metrics
.resident_physical_size_add(metadata.file_size());
.resident_physical_size_add(metadata.file_size);
ResidentLayer { downloaded, owner }
}
@@ -366,7 +366,10 @@ impl Layer {
.0
.get_or_maybe_download(true, Some(ctx))
.await
.map_err(|err| GetVectoredError::Other(anyhow::anyhow!(err)))?;
.map_err(|err| match err {
DownloadError::DownloadCancelled => GetVectoredError::Cancelled,
other => GetVectoredError::Other(anyhow::anyhow!(other)),
})?;
self.0
.access_stats
@@ -1158,6 +1161,11 @@ impl LayerInner {
let consecutive_failures =
1 + self.consecutive_failures.fetch_add(1, Ordering::Relaxed);
if timeline.cancel.is_cancelled() {
// If we're shutting down, drop out before logging the error
return Err(e);
}
tracing::error!(consecutive_failures, "layer file download failed: {e:#}");
let backoff = utils::backoff::exponential_backoff_duration_seconds(
@@ -1802,16 +1810,15 @@ impl ResidentLayer {
use LayerKind::*;
let owner = &self.owner.0;
match self.downloaded.get(owner, ctx).await? {
Delta(ref d) => {
// this is valid because the DownloadedLayer::kind is a OnceCell, not a
// Mutex<OnceCell>, so we cannot go and deinitialize the value with OnceCell::take
// while it's being held.
owner
.access_stats
.record_access(LayerAccessKind::KeyIter, ctx);
// this is valid because the DownloadedLayer::kind is a OnceCell, not a
// Mutex<OnceCell>, so we cannot go and deinitialize the value with OnceCell::take
// while it's being held.
delta_layer::DeltaLayerInner::load_keys(d, ctx)
.await
.with_context(|| format!("Layer index is corrupted for {self}"))
@@ -1820,6 +1827,23 @@ impl ResidentLayer {
}
}
/// Read all they keys in this layer which match the ShardIdentity, and write them all to
/// the provided writer. Return the number of keys written.
#[tracing::instrument(level = tracing::Level::DEBUG, skip_all, fields(layer=%self))]
pub(crate) async fn filter<'a>(
&'a self,
shard_identity: &ShardIdentity,
writer: &mut ImageLayerWriter,
ctx: &RequestContext,
) -> anyhow::Result<usize> {
use LayerKind::*;
match self.downloaded.get(&self.owner.0, ctx).await? {
Delta(_) => anyhow::bail!(format!("cannot filter() on a delta layer {self}")),
Image(i) => i.filter(shard_identity, writer, ctx).await,
}
}
/// Returns the amount of keys and values written to the writer.
pub(crate) async fn copy_delta_prefix(
&self,

View File

@@ -17,7 +17,7 @@ use crate::tenant::{Tenant, TenantState};
use rand::Rng;
use tokio_util::sync::CancellationToken;
use tracing::*;
use utils::{backoff, completion};
use utils::{backoff, completion, pausable_failpoint};
static CONCURRENT_BACKGROUND_TASKS: once_cell::sync::Lazy<tokio::sync::Semaphore> =
once_cell::sync::Lazy::new(|| {
@@ -380,21 +380,28 @@ async fn gc_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
let res = tenant
.gc_iteration(None, gc_horizon, tenant.get_pitr_interval(), &cancel, &ctx)
.await;
if let Err(e) = res {
let wait_duration = backoff::exponential_backoff_duration_seconds(
error_run_count + 1,
1.0,
MAX_BACKOFF_SECS,
);
error_run_count += 1;
let wait_duration = Duration::from_secs_f64(wait_duration);
error!(
match res {
Ok(_) => {
error_run_count = 0;
period
}
Err(crate::tenant::GcError::TenantCancelled) => {
return;
}
Err(e) => {
let wait_duration = backoff::exponential_backoff_duration_seconds(
error_run_count + 1,
1.0,
MAX_BACKOFF_SECS,
);
error_run_count += 1;
let wait_duration = Duration::from_secs_f64(wait_duration);
error!(
"Gc failed {error_run_count} times, retrying in {wait_duration:?}: {e:?}",
);
wait_duration
} else {
error_run_count = 0;
period
wait_duration
}
}
};

File diff suppressed because it is too large Load Diff

View File

@@ -176,13 +176,24 @@ impl Timeline {
async fn compact_shard_ancestors(
self: &Arc<Self>,
rewrite_max: usize,
_ctx: &RequestContext,
ctx: &RequestContext,
) -> anyhow::Result<()> {
let mut drop_layers = Vec::new();
let layers_to_rewrite: Vec<Layer> = Vec::new();
let mut layers_to_rewrite: Vec<Layer> = Vec::new();
// We will use the PITR cutoff as a condition for rewriting layers.
let pitr_cutoff = self.gc_info.read().unwrap().cutoffs.pitr;
// We will use the Lsn cutoff of the last GC as a threshold for rewriting layers: if a
// layer is behind this Lsn, it indicates that the layer is being retained beyond the
// pitr_interval, for example because a branchpoint references it.
//
// Holding this read guard also blocks [`Self::gc_timeline`] from entering while we
// are rewriting layers.
let latest_gc_cutoff = self.get_latest_gc_cutoff_lsn();
tracing::info!(
"latest_gc_cutoff: {}, pitr cutoff {}",
*latest_gc_cutoff,
self.gc_info.read().unwrap().cutoffs.pitr
);
let layers = self.layers.read().await;
for layer_desc in layers.layer_map().iter_historic_layers() {
@@ -241,9 +252,9 @@ impl Timeline {
// Don't bother re-writing a layer if it is within the PITR window: it will age-out eventually
// without incurring the I/O cost of a rewrite.
if layer_desc.get_lsn_range().end >= pitr_cutoff {
debug!(%layer, "Skipping rewrite of layer still in PITR window ({} >= {})",
layer_desc.get_lsn_range().end, pitr_cutoff);
if layer_desc.get_lsn_range().end >= *latest_gc_cutoff {
debug!(%layer, "Skipping rewrite of layer still in GC window ({} >= {})",
layer_desc.get_lsn_range().end, *latest_gc_cutoff);
continue;
}
@@ -253,13 +264,10 @@ impl Timeline {
continue;
}
// Only rewrite layers if they would have different remote paths: either they belong to this
// shard but an old generation, or they belonged to another shard. This also implicitly
// guarantees that the layer is persistent in remote storage (as only remote persistent
// layers are carried across shard splits, any local-only layer would be in the current generation)
if layer.metadata().generation == self.generation
&& layer.metadata().shard.shard_count == self.shard_identity.count
{
// Only rewrite layers if their generations differ. This guarantees:
// - that local rewrite is safe, as local layer paths will differ between existing layer and rewritten one
// - that the layer is persistent in remote storage, as we only see old-generation'd layer via loading from remote storage
if layer.metadata().generation == self.generation {
debug!(%layer, "Skipping rewrite, is not from old generation");
continue;
}
@@ -272,18 +280,69 @@ impl Timeline {
}
// Fall through: all our conditions for doing a rewrite passed.
// TODO: implement rewriting
tracing::debug!(%layer, "Would rewrite layer");
layers_to_rewrite.push(layer);
}
// Drop the layers read lock: we will acquire it for write in [`Self::rewrite_layers`]
// Drop read lock on layer map before we start doing time-consuming I/O
drop(layers);
// TODO: collect layers to rewrite
let replace_layers = Vec::new();
let mut replace_image_layers = Vec::new();
for layer in layers_to_rewrite {
tracing::info!(layer=%layer, "Rewriting layer after shard split...");
let mut image_layer_writer = ImageLayerWriter::new(
self.conf,
self.timeline_id,
self.tenant_shard_id,
&layer.layer_desc().key_range,
layer.layer_desc().image_layer_lsn(),
ctx,
)
.await?;
// Safety of layer rewrites:
// - We are writing to a different local file path than we are reading from, so the old Layer
// cannot interfere with the new one.
// - In the page cache, contents for a particular VirtualFile are stored with a file_id that
// is different for two layers with the same name (in `ImageLayerInner::new` we always
// acquire a fresh id from [`crate::page_cache::next_file_id`]. So readers do not risk
// reading the index from one layer file, and then data blocks from the rewritten layer file.
// - Any readers that have a reference to the old layer will keep it alive until they are done
// with it. If they are trying to promote from remote storage, that will fail, but this is the same
// as for compaction generally: compaction is allowed to delete layers that readers might be trying to use.
// - We do not run concurrently with other kinds of compaction, so the only layer map writes we race with are:
// - GC, which at worst witnesses us "undelete" a layer that they just deleted.
// - ingestion, which only inserts layers, therefore cannot collide with us.
let resident = layer.download_and_keep_resident().await?;
let keys_written = resident
.filter(&self.shard_identity, &mut image_layer_writer, ctx)
.await?;
if keys_written > 0 {
let new_layer = image_layer_writer.finish(self, ctx).await?;
tracing::info!(layer=%new_layer, "Rewrote layer, {} -> {} bytes",
layer.metadata().file_size,
new_layer.metadata().file_size);
replace_image_layers.push((layer, new_layer));
} else {
// Drop the old layer. Usually for this case we would already have noticed that
// the layer has no data for us with the ShardedRange check above, but
drop_layers.push(layer);
}
}
// At this point, we have replaced local layer files with their rewritten form, but not yet uploaded
// metadata to reflect that. If we restart here, the replaced layer files will look invalid (size mismatch
// to remote index) and be removed. This is inefficient but safe.
fail::fail_point!("compact-shard-ancestors-localonly");
// Update the LayerMap so that readers will use the new layers, and enqueue it for writing to remote storage
self.rewrite_layers(replace_layers, drop_layers).await?;
self.rewrite_layers(replace_image_layers, drop_layers)
.await?;
fail::fail_point!("compact-shard-ancestors-enqueued");
// We wait for all uploads to complete before finishing this compaction stage. This is not
// necessary for correctness, but it simplifies testing, and avoids proceeding with another
@@ -291,6 +350,8 @@ impl Timeline {
// load.
self.remote_client.wait_completion().await?;
fail::fail_point!("compact-shard-ancestors-persistent");
Ok(())
}

View File

@@ -7,7 +7,7 @@ use anyhow::Context;
use pageserver_api::{models::TimelineState, shard::TenantShardId};
use tokio::sync::OwnedMutexGuard;
use tracing::{error, info, instrument, Instrument};
use utils::{crashsafe, fs_ext, id::TimelineId};
use utils::{crashsafe, fs_ext, id::TimelineId, pausable_failpoint};
use crate::{
config::PageServerConf,

View File

@@ -1,6 +1,6 @@
use std::sync::Arc;
use super::{layer_manager::LayerManager, Timeline};
use super::{layer_manager::LayerManager, FlushLayerError, Timeline};
use crate::{
context::{DownloadBehavior, RequestContext},
task_mgr::TaskKind,
@@ -23,7 +23,7 @@ pub(crate) enum Error {
#[error("shutting down, please retry later")]
ShuttingDown,
#[error("flushing failed")]
FlushAncestor(#[source] anyhow::Error),
FlushAncestor(#[source] FlushLayerError),
#[error("layer download failed")]
RewrittenDeltaDownloadFailed(#[source] anyhow::Error),
#[error("copying LSN prefix locally failed")]

View File

@@ -7,19 +7,20 @@ use crate::{
index::{IndexPart, LayerFileMetadata},
},
storage_layer::LayerName,
Generation,
},
};
use anyhow::Context;
use camino::{Utf8Path, Utf8PathBuf};
use pageserver_api::shard::ShardIndex;
use std::{collections::HashMap, str::FromStr};
use std::{
collections::{hash_map, HashMap},
str::FromStr,
};
use utils::lsn::Lsn;
/// Identified files in the timeline directory.
pub(super) enum Discovered {
/// The only one we care about
Layer(LayerName, Utf8PathBuf, u64),
Layer(LayerName, LocalLayerFileMetadata),
/// Old ephmeral files from previous launches, should be removed
Ephemeral(String),
/// Old temporary timeline files, unsure what these really are, should be removed
@@ -27,7 +28,7 @@ pub(super) enum Discovered {
/// Temporary on-demand download files, should be removed
TemporaryDownload(String),
/// Backup file from previously future layers
IgnoredBackup,
IgnoredBackup(Utf8PathBuf),
/// Unrecognized, warn about these
Unknown(String),
}
@@ -43,12 +44,15 @@ pub(super) fn scan_timeline_dir(path: &Utf8Path) -> anyhow::Result<Vec<Discovere
let discovered = match LayerName::from_str(&file_name) {
Ok(file_name) => {
let file_size = direntry.metadata()?.len();
Discovered::Layer(file_name, direntry.path().to_owned(), file_size)
Discovered::Layer(
file_name,
LocalLayerFileMetadata::new(direntry.path().to_owned(), file_size),
)
}
Err(_) => {
if file_name.ends_with(".old") {
// ignore these
Discovered::IgnoredBackup
Discovered::IgnoredBackup(direntry.path().to_owned())
} else if remote_timeline_client::is_temp_download_file(direntry.path()) {
Discovered::TemporaryDownload(file_name)
} else if is_ephemeral_file(&file_name) {
@@ -71,37 +75,32 @@ pub(super) fn scan_timeline_dir(path: &Utf8Path) -> anyhow::Result<Vec<Discovere
/// this structure extends it with metadata describing the layer's presence in local storage.
#[derive(Clone, Debug)]
pub(super) struct LocalLayerFileMetadata {
pub(super) metadata: LayerFileMetadata,
pub(super) file_size: u64,
pub(super) local_path: Utf8PathBuf,
}
impl LocalLayerFileMetadata {
pub fn new(
local_path: Utf8PathBuf,
file_size: u64,
generation: Generation,
shard: ShardIndex,
) -> Self {
pub fn new(local_path: Utf8PathBuf, file_size: u64) -> Self {
Self {
local_path,
metadata: LayerFileMetadata::new(file_size, generation, shard),
file_size,
}
}
}
/// Decision on what to do with a layer file after considering its local and remote metadata.
/// For a layer that is present in remote metadata, this type describes how to handle
/// it during startup: it is either Resident (and we have some metadata about a local file),
/// or it is Evicted (and we only have remote metadata).
#[derive(Clone, Debug)]
pub(super) enum Decision {
/// The layer is not present locally.
Evicted(LayerFileMetadata),
/// The layer is present locally, but local metadata does not match remote; we must
/// delete it and treat it as evicted.
UseRemote {
/// The layer is present locally, and metadata matches: we may hook up this layer to the
/// existing file in local storage.
Resident {
local: LocalLayerFileMetadata,
remote: LayerFileMetadata,
},
/// The layer is present locally, and metadata matches.
UseLocal(LocalLayerFileMetadata),
}
/// A layer needs to be left out of the layer map.
@@ -117,77 +116,81 @@ pub(super) enum DismissedLayer {
/// In order to make crash safe updates to layer map, we must dismiss layers which are only
/// found locally or not yet included in the remote `index_part.json`.
LocalOnly(LocalLayerFileMetadata),
/// The layer exists in remote storage but the local layer's metadata (e.g. file size)
/// does not match it
BadMetadata(LocalLayerFileMetadata),
}
/// Merges local discoveries and remote [`IndexPart`] to a collection of decisions.
pub(super) fn reconcile(
discovered: Vec<(LayerName, Utf8PathBuf, u64)>,
local_layers: Vec<(LayerName, LocalLayerFileMetadata)>,
index_part: Option<&IndexPart>,
disk_consistent_lsn: Lsn,
generation: Generation,
shard: ShardIndex,
) -> Vec<(LayerName, Result<Decision, DismissedLayer>)> {
use Decision::*;
let Some(index_part) = index_part else {
// If we have no remote metadata, no local layer files are considered valid to load
return local_layers
.into_iter()
.map(|(layer_name, local_metadata)| {
(layer_name, Err(DismissedLayer::LocalOnly(local_metadata)))
})
.collect();
};
// name => (local_metadata, remote_metadata)
type Collected =
HashMap<LayerName, (Option<LocalLayerFileMetadata>, Option<LayerFileMetadata>)>;
let mut result = Vec::new();
let mut discovered = discovered
.into_iter()
.map(|(layer_name, local_path, file_size)| {
(
layer_name,
// The generation and shard here will be corrected to match IndexPart in the merge below, unless
// it is not in IndexPart, in which case using our current generation makes sense
// because it will be uploaded in this generation.
(
Some(LocalLayerFileMetadata::new(
local_path, file_size, generation, shard,
)),
None,
),
)
})
.collect::<Collected>();
let mut remote_layers = HashMap::new();
// merge any index_part information, when available
// Construct Decisions for layers that are found locally, if they're in remote metadata. Otherwise
// construct DismissedLayers to get rid of them.
for (layer_name, local_metadata) in local_layers {
let Some(remote_metadata) = index_part.layer_metadata.get(&layer_name) else {
result.push((layer_name, Err(DismissedLayer::LocalOnly(local_metadata))));
continue;
};
if remote_metadata.file_size != local_metadata.file_size {
result.push((layer_name, Err(DismissedLayer::BadMetadata(local_metadata))));
continue;
}
remote_layers.insert(
layer_name,
Decision::Resident {
local: local_metadata,
remote: remote_metadata.clone(),
},
);
}
// Construct Decision for layers that were not found locally
index_part
.as_ref()
.map(|ip| ip.layer_metadata.iter())
.into_iter()
.flatten()
.map(|(name, metadata)| (name, LayerFileMetadata::from(metadata)))
.layer_metadata
.iter()
.for_each(|(name, metadata)| {
if let Some(existing) = discovered.get_mut(name) {
existing.1 = Some(metadata);
} else {
discovered.insert(name.to_owned(), (None, Some(metadata)));
if let hash_map::Entry::Vacant(entry) = remote_layers.entry(name.clone()) {
entry.insert(Decision::Evicted(metadata.clone()));
}
});
discovered
.into_iter()
.map(|(name, (local, remote))| {
let decision = if name.is_in_future(disk_consistent_lsn) {
Err(DismissedLayer::Future { local })
} else {
match (local, remote) {
(Some(local), Some(remote)) if local.metadata != remote => {
Ok(UseRemote { local, remote })
}
(Some(x), Some(_)) => Ok(UseLocal(x)),
(None, Some(x)) => Ok(Evicted(x)),
(Some(x), None) => Err(DismissedLayer::LocalOnly(x)),
(None, None) => {
unreachable!("there must not be any non-local non-remote files")
}
}
};
// For layers that were found in authoritative remote metadata, apply a final check that they are within
// the disk_consistent_lsn.
result.extend(remote_layers.into_iter().map(|(name, decision)| {
if name.is_in_future(disk_consistent_lsn) {
match decision {
Decision::Evicted(_remote) => (name, Err(DismissedLayer::Future { local: None })),
Decision::Resident {
local,
remote: _remote,
} => (name, Err(DismissedLayer::Future { local: Some(local) })),
}
} else {
(name, Ok(decision))
}
}));
(name, decision)
})
.collect::<Vec<_>>()
result
}
pub(super) fn cleanup(path: &Utf8Path, kind: &str) -> anyhow::Result<()> {
@@ -196,25 +199,15 @@ pub(super) fn cleanup(path: &Utf8Path, kind: &str) -> anyhow::Result<()> {
std::fs::remove_file(path).with_context(|| format!("failed to remove {kind} at {path}"))
}
pub(super) fn cleanup_local_file_for_remote(
local: &LocalLayerFileMetadata,
remote: &LayerFileMetadata,
) -> anyhow::Result<()> {
let local_size = local.metadata.file_size();
let remote_size = remote.file_size();
pub(super) fn cleanup_local_file_for_remote(local: &LocalLayerFileMetadata) -> anyhow::Result<()> {
let local_size = local.file_size;
let path = &local.local_path;
let file_name = path.file_name().expect("must be file path");
tracing::warn!("removing local file {file_name:?} because it has unexpected length {local_size}; length in remote index is {remote_size}");
if let Err(err) = crate::tenant::timeline::rename_to_backup(path) {
assert!(
path.exists(),
"we would leave the local_layer without a file if this does not hold: {path}",
);
Err(err)
} else {
Ok(())
}
tracing::warn!(
"removing local file {file_name:?} because it has unexpected length {local_size};"
);
std::fs::remove_file(path).with_context(|| format!("failed to remove layer at {path}"))
}
pub(super) fn cleanup_future_layer(
@@ -236,8 +229,8 @@ pub(super) fn cleanup_local_only_file(
) -> anyhow::Result<()> {
let kind = name.kind();
tracing::info!(
"found local-only {kind} layer {name}, metadata {:?}",
local.metadata
"found local-only {kind} layer {name} size {}",
local.file_size
);
std::fs::remove_file(&local.local_path)?;
Ok(())

View File

@@ -166,7 +166,7 @@ impl LayerManager {
/// Flush a frozen layer and add the written delta layer to the layer map.
pub(crate) fn finish_flush_l0_layer(
&mut self,
delta_layers: &[ResidentLayer],
delta_layer: Option<&ResidentLayer>,
frozen_layer_for_check: &Arc<InMemoryLayer>,
metrics: &TimelineMetrics,
) {
@@ -181,12 +181,10 @@ impl LayerManager {
// layer to disk at the same time, that would not work.
assert_eq!(Arc::as_ptr(&inmem), Arc::as_ptr(frozen_layer_for_check));
if !delta_layers.is_empty() {
if let Some(l) = delta_layer {
let mut updates = self.layer_map.batch_update();
for l in delta_layers {
Self::insert_historic_layer(l.as_ref().clone(), &mut updates, &mut self.layer_fmgr);
metrics.record_new_file_metrics(l.layer_desc().file_size);
}
Self::insert_historic_layer(l.as_ref().clone(), &mut updates, &mut self.layer_fmgr);
metrics.record_new_file_metrics(l.layer_desc().file_size);
updates.flush();
}
}
@@ -214,13 +212,34 @@ impl LayerManager {
&mut self,
rewrite_layers: &[(Layer, ResidentLayer)],
drop_layers: &[Layer],
_metrics: &TimelineMetrics,
metrics: &TimelineMetrics,
) {
let mut updates = self.layer_map.batch_update();
for (old_layer, new_layer) in rewrite_layers {
debug_assert_eq!(
old_layer.layer_desc().key_range,
new_layer.layer_desc().key_range
);
debug_assert_eq!(
old_layer.layer_desc().lsn_range,
new_layer.layer_desc().lsn_range
);
// TODO: implement rewrites (currently this code path only used for drops)
assert!(rewrite_layers.is_empty());
// Safety: we may never rewrite the same file in-place. Callers are responsible
// for ensuring that they only rewrite layers after something changes the path,
// such as an increment in the generation number.
assert_ne!(old_layer.local_path(), new_layer.local_path());
Self::delete_historic_layer(old_layer, &mut updates, &mut self.layer_fmgr);
Self::insert_historic_layer(
new_layer.as_ref().clone(),
&mut updates,
&mut self.layer_fmgr,
);
metrics.record_new_file_metrics(new_layer.layer_desc().file_size);
}
for l in drop_layers {
Self::delete_historic_layer(l, &mut updates, &mut self.layer_fmgr);
}
@@ -236,6 +255,13 @@ impl LayerManager {
updates.flush()
}
#[cfg(test)]
pub(crate) fn force_insert_layer(&mut self, layer: ResidentLayer) {
let mut updates = self.layer_map.batch_update();
Self::insert_historic_layer(layer.as_ref().clone(), &mut updates, &mut self.layer_fmgr);
updates.flush()
}
/// Helper function to insert a layer into the layer map and file manager.
fn insert_historic_layer(
layer: Layer,

View File

@@ -213,10 +213,7 @@ impl UploadQueue {
let mut files = HashMap::with_capacity(index_part.layer_metadata.len());
for (layer_name, layer_metadata) in &index_part.layer_metadata {
files.insert(
layer_name.to_owned(),
LayerFileMetadata::from(layer_metadata),
);
files.insert(layer_name.to_owned(), layer_metadata.clone());
}
info!(
@@ -322,9 +319,7 @@ impl std::fmt::Display for UploadOp {
write!(
f,
"UploadLayer({}, size={:?}, gen={:?})",
layer,
metadata.file_size(),
metadata.generation
layer, metadata.file_size, metadata.generation
)
}
UploadOp::UploadMetadata(_, lsn) => {

View File

@@ -344,21 +344,21 @@ macro_rules! with_file {
impl VirtualFile {
/// Open a file in read-only mode. Like File::open.
pub async fn open(
path: &Utf8Path,
pub async fn open<P: AsRef<Utf8Path>>(
path: P,
ctx: &RequestContext,
) -> Result<VirtualFile, std::io::Error> {
Self::open_with_options(path, OpenOptions::new().read(true), ctx).await
Self::open_with_options(path.as_ref(), OpenOptions::new().read(true), ctx).await
}
/// Create a new file for writing. If the file exists, it will be truncated.
/// Like File::create.
pub async fn create(
path: &Utf8Path,
pub async fn create<P: AsRef<Utf8Path>>(
path: P,
ctx: &RequestContext,
) -> Result<VirtualFile, std::io::Error> {
Self::open_with_options(
path,
path.as_ref(),
OpenOptions::new().write(true).create(true).truncate(true),
ctx,
)
@@ -370,12 +370,13 @@ impl VirtualFile {
/// Note: If any custom flags were set in 'open_options' through OpenOptionsExt,
/// they will be applied also when the file is subsequently re-opened, not only
/// on the first time. Make sure that's sane!
pub async fn open_with_options(
path: &Utf8Path,
pub async fn open_with_options<P: AsRef<Utf8Path>>(
path: P,
open_options: &OpenOptions,
_ctx: &RequestContext, /* TODO: carry a pointer to the metrics in the RequestContext instead of the parsing https://github.com/neondatabase/neon/issues/6107 */
) -> Result<VirtualFile, std::io::Error> {
let path_str = path.to_string();
let path_ref = path.as_ref();
let path_str = path_ref.to_string();
let parts = path_str.split('/').collect::<Vec<&str>>();
let (tenant_id, shard_id, timeline_id) =
if parts.len() > 5 && parts[parts.len() - 5] == TENANTS_SEGMENT_NAME {
@@ -401,7 +402,7 @@ impl VirtualFile {
// where our caller doesn't get to use the returned VirtualFile before its
// slot gets re-used by someone else.
let file = observe_duration!(StorageIoOperation::Open, {
open_options.open(path.as_std_path()).await?
open_options.open(path_ref.as_std_path()).await?
});
// Strip all options other than read and write.
@@ -417,7 +418,7 @@ impl VirtualFile {
let vfile = VirtualFile {
handle: RwLock::new(handle),
pos: 0,
path: path.to_path_buf(),
path: path_ref.to_path_buf(),
open_options: reopen_options,
tenant_id,
shard_id,

View File

@@ -34,7 +34,6 @@ use crate::repository::Key;
use crate::walrecord::NeonWalRecord;
use anyhow::Context;
use bytes::{Bytes, BytesMut};
use pageserver_api::key::key_to_rel_block;
use pageserver_api::models::{WalRedoManagerProcessStatus, WalRedoManagerStatus};
use pageserver_api::shard::TenantShardId;
use std::sync::Arc;
@@ -208,7 +207,7 @@ impl PostgresRedoManager {
) -> anyhow::Result<Bytes> {
*(self.last_redo_at.lock().unwrap()) = Some(Instant::now());
let (rel, blknum) = key_to_rel_block(key).context("invalid record")?;
let (rel, blknum) = key.to_rel_block().context("invalid record")?;
const MAX_RETRY_ATTEMPTS: u32 = 1;
let mut n_attempts = 0u32;
loop {

View File

@@ -3,7 +3,7 @@ use crate::walrecord::NeonWalRecord;
use anyhow::Context;
use byteorder::{ByteOrder, LittleEndian};
use bytes::{BufMut, BytesMut};
use pageserver_api::key::{key_to_rel_block, key_to_slru_block, Key};
use pageserver_api::key::Key;
use pageserver_api::reltag::SlruKind;
use postgres_ffi::pg_constants;
use postgres_ffi::relfile_utils::VISIBILITYMAP_FORKNUM;
@@ -48,7 +48,7 @@ pub(crate) fn apply_in_neon(
flags,
} => {
// sanity check that this is modifying the correct relation
let (rel, blknum) = key_to_rel_block(key).context("invalid record")?;
let (rel, blknum) = key.to_rel_block().context("invalid record")?;
assert!(
rel.forknum == VISIBILITYMAP_FORKNUM,
"ClearVisibilityMapFlags record on unexpected rel {}",
@@ -85,7 +85,7 @@ pub(crate) fn apply_in_neon(
// Non-relational WAL records are handled here, with custom code that has the
// same effects as the corresponding Postgres WAL redo function.
NeonWalRecord::ClogSetCommitted { xids, timestamp } => {
let (slru_kind, segno, blknum) = key_to_slru_block(key).context("invalid record")?;
let (slru_kind, segno, blknum) = key.to_slru_block().context("invalid record")?;
assert_eq!(
slru_kind,
SlruKind::Clog,
@@ -130,7 +130,7 @@ pub(crate) fn apply_in_neon(
}
}
NeonWalRecord::ClogSetAborted { xids } => {
let (slru_kind, segno, blknum) = key_to_slru_block(key).context("invalid record")?;
let (slru_kind, segno, blknum) = key.to_slru_block().context("invalid record")?;
assert_eq!(
slru_kind,
SlruKind::Clog,
@@ -160,7 +160,7 @@ pub(crate) fn apply_in_neon(
}
}
NeonWalRecord::MultixactOffsetCreate { mid, moff } => {
let (slru_kind, segno, blknum) = key_to_slru_block(key).context("invalid record")?;
let (slru_kind, segno, blknum) = key.to_slru_block().context("invalid record")?;
assert_eq!(
slru_kind,
SlruKind::MultiXactOffsets,
@@ -192,7 +192,7 @@ pub(crate) fn apply_in_neon(
LittleEndian::write_u32(&mut page[offset..offset + 4], *moff);
}
NeonWalRecord::MultixactMembersCreate { moff, members } => {
let (slru_kind, segno, blknum) = key_to_slru_block(key).context("invalid record")?;
let (slru_kind, segno, blknum) = key.to_slru_block().context("invalid record")?;
assert_eq!(
slru_kind,
SlruKind::MultiXactMembers,

View File

@@ -51,7 +51,6 @@ int flush_every_n_requests = 8;
int neon_protocol_version = 2;
static int n_reconnect_attempts = 0;
static int max_reconnect_attempts = 60;
static int stripe_size;
@@ -95,18 +94,37 @@ static shmem_startup_hook_type prev_shmem_startup_hook;
static PagestoreShmemState *pagestore_shared;
static uint64 pagestore_local_counter = 0;
typedef enum PSConnectionState {
PS_Disconnected, /* no connection yet */
PS_Connecting_Startup, /* connection starting up */
PS_Connecting_PageStream, /* negotiating pagestream */
PS_Connected, /* connected, pagestream established */
} PSConnectionState;
/* This backend's per-shard connections */
typedef struct
{
PGconn *conn;
TimestampTz last_connect_time; /* read-only debug value */
TimestampTz last_reconnect_time;
uint32 delay_us;
int n_reconnect_attempts;
/*---
* WaitEventSet containing:
* - WL_SOCKET_READABLE on 'conn'
* - WL_LATCH_SET on MyLatch, and
* - WL_EXIT_ON_PM_DEATH.
* Pageserver connection state, i.e.
* disconnected: conn == NULL, wes == NULL;
* conn_startup: connection initiated, waiting for connection establishing
* conn_ps: PageStream query sent, waiting for confirmation
* connected: PageStream established
*/
WaitEventSet *wes;
PSConnectionState state;
PGconn *conn;
/*---
* WaitEventSet containing:
* - WL_SOCKET_READABLE on 'conn'
* - WL_LATCH_SET on MyLatch, and
* - WL_EXIT_ON_PM_DEATH.
*/
WaitEventSet *wes_read;
} PageServer;
static PageServer page_servers[MAX_SHARDS];
@@ -303,119 +321,277 @@ get_shard_number(BufferTag *tag)
return hash % n_shards;
}
static inline void
CLEANUP_AND_DISCONNECT(PageServer *shard)
{
if (shard->wes_read)
{
FreeWaitEventSet(shard->wes_read);
shard->wes_read = NULL;
}
if (shard->conn)
{
PQfinish(shard->conn);
shard->conn = NULL;
}
shard->state = PS_Disconnected;
}
/*
* Connect to a pageserver, or continue to try to connect if we're yet to
* complete the connection (e.g. due to receiving an earlier cancellation
* during connection start).
* Returns true if successfully connected; false if the connection failed.
*
* Throws errors in unrecoverable situations, or when this backend's query
* is canceled.
*/
static bool
pageserver_connect(shardno_t shard_no, int elevel)
{
char *query;
int ret;
const char *keywords[3];
const char *values[3];
int n;
PGconn *conn;
WaitEventSet *wes;
PageServer *shard = &page_servers[shard_no];
char connstr[MAX_PAGESERVER_CONNSTRING_SIZE];
static TimestampTz last_connect_time = 0;
static uint64_t delay_us = MIN_RECONNECT_INTERVAL_USEC;
TimestampTz now;
uint64_t us_since_last_connect;
bool broke_from_loop = false;
Assert(page_servers[shard_no].conn == NULL);
/*
* Get the connection string for this shard. If the shard map has been
* updated since we last looked, this will also disconnect any existing
* pageserver connections as a side effect.
* Note that connstr is used both during connection start, and when we
* log the successful connection.
*/
load_shard_map(shard_no, connstr, NULL);
now = GetCurrentTimestamp();
us_since_last_connect = now - last_connect_time;
if (us_since_last_connect < MAX_RECONNECT_INTERVAL_USEC)
switch (shard->state)
{
pg_usleep(delay_us);
delay_us *= 2;
}
else
case PS_Disconnected:
{
delay_us = MIN_RECONNECT_INTERVAL_USEC;
}
const char *keywords[3];
const char *values[3];
int n_pgsql_params;
TimestampTz now;
int64 us_since_last_attempt;
/*
* Connect using the connection string we got from the
* neon.pageserver_connstring GUC. If the NEON_AUTH_TOKEN environment
* variable was set, use that as the password.
*
* The connection options are parsed in the order they're given, so when
* we set the password before the connection string, the connection string
* can override the password from the env variable. Seems useful, although
* we don't currently use that capability anywhere.
*/
n = 0;
if (neon_auth_token)
{
keywords[n] = "password";
values[n] = neon_auth_token;
n++;
/* Make sure we start with a clean slate */
CLEANUP_AND_DISCONNECT(shard);
neon_shard_log(shard_no, DEBUG5, "Connection state: Disconnected");
now = GetCurrentTimestamp();
us_since_last_attempt = (int64) (now - shard->last_reconnect_time);
shard->last_reconnect_time = now;
/*
* If we did other tasks between reconnect attempts, then we won't
* need to wait as long as a full delay.
*/
if (us_since_last_attempt < shard->delay_us)
{
pg_usleep(shard->delay_us - us_since_last_attempt);
}
/* update the delay metric */
shard->delay_us = Min(shard->delay_us * 2, MAX_RECONNECT_INTERVAL_USEC);
/*
* Connect using the connection string we got from the
* neon.pageserver_connstring GUC. If the NEON_AUTH_TOKEN environment
* variable was set, use that as the password.
*
* The connection options are parsed in the order they're given, so when
* we set the password before the connection string, the connection string
* can override the password from the env variable. Seems useful, although
* we don't currently use that capability anywhere.
*/
keywords[0] = "dbname";
values[0] = connstr;
n_pgsql_params = 1;
if (neon_auth_token)
{
keywords[1] = "password";
values[1] = neon_auth_token;
n_pgsql_params++;
}
keywords[n_pgsql_params] = NULL;
values[n_pgsql_params] = NULL;
shard->conn = PQconnectStartParams(keywords, values, 1);
if (!shard->conn)
{
neon_shard_log(shard_no, elevel, "Failed to connect to pageserver: out of memory");
return false;
}
shard->state = PS_Connecting_Startup;
/* fallthrough */
}
keywords[n] = "dbname";
values[n] = connstr;
n++;
keywords[n] = NULL;
values[n] = NULL;
n++;
conn = PQconnectdbParams(keywords, values, 1);
last_connect_time = GetCurrentTimestamp();
if (PQstatus(conn) == CONNECTION_BAD)
case PS_Connecting_Startup:
{
char *msg = pchomp(PQerrorMessage(conn));
char *pagestream_query;
int ps_send_query_ret;
bool connected = false;
int poll_result = PGRES_POLLING_WRITING;
neon_shard_log(shard_no, DEBUG5, "Connection state: Connecting_Startup");
PQfinish(conn);
do
{
WaitEvent event;
ereport(elevel,
(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
errmsg(NEON_TAG "[shard %d] could not establish connection to pageserver", shard_no),
errdetail_internal("%s", msg)));
pfree(msg);
return false;
}
switch (neon_protocol_version)
{
switch (poll_result)
{
default: /* unknown/unused states are handled as a failed connection */
case PGRES_POLLING_FAILED:
{
char *pqerr = PQerrorMessage(shard->conn);
char *msg = NULL;
neon_shard_log(shard_no, DEBUG5, "POLLING_FAILED");
if (pqerr)
msg = pchomp(pqerr);
CLEANUP_AND_DISCONNECT(shard);
if (msg)
{
neon_shard_log(shard_no, elevel,
"could not connect to pageserver: %s",
msg);
pfree(msg);
}
else
neon_shard_log(shard_no, elevel,
"could not connect to pageserver");
return false;
}
case PGRES_POLLING_READING:
/* Sleep until there's something to do */
while (true)
{
int rc = WaitLatchOrSocket(MyLatch,
WL_EXIT_ON_PM_DEATH | WL_LATCH_SET | WL_SOCKET_READABLE,
PQsocket(shard->conn),
0,
PG_WAIT_EXTENSION);
elog(DEBUG5, "PGRES_POLLING_READING=>%d", rc);
if (rc & WL_LATCH_SET)
{
ResetLatch(MyLatch);
/* query cancellation, backend shutdown */
CHECK_FOR_INTERRUPTS();
}
if (rc & WL_SOCKET_READABLE)
break;
}
/* PQconnectPoll() handles the socket polling state updates */
break;
case PGRES_POLLING_WRITING:
/* Sleep until there's something to do */
while (true)
{
int rc = WaitLatchOrSocket(MyLatch,
WL_EXIT_ON_PM_DEATH | WL_LATCH_SET | WL_SOCKET_WRITEABLE,
PQsocket(shard->conn),
0,
PG_WAIT_EXTENSION);
elog(DEBUG5, "PGRES_POLLING_WRITING=>%d", rc);
if (rc & WL_LATCH_SET)
{
ResetLatch(MyLatch);
/* query cancellation, backend shutdown */
CHECK_FOR_INTERRUPTS();
}
if (rc & WL_SOCKET_WRITEABLE)
break;
}
/* PQconnectPoll() handles the socket polling state updates */
break;
case PGRES_POLLING_OK:
neon_shard_log(shard_no, DEBUG5, "POLLING_OK");
connected = true;
break;
}
poll_result = PQconnectPoll(shard->conn);
elog(DEBUG5, "PQconnectPoll=>%d", poll_result);
}
while (!connected);
/* No more polling needed; connection succeeded */
shard->last_connect_time = GetCurrentTimestamp();
shard->wes_read = CreateWaitEventSet(TopMemoryContext, 3);
AddWaitEventToSet(shard->wes_read, WL_LATCH_SET, PGINVALID_SOCKET,
MyLatch, NULL);
AddWaitEventToSet(shard->wes_read, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
NULL, NULL);
AddWaitEventToSet(shard->wes_read, WL_SOCKET_READABLE, PQsocket(shard->conn), NULL, NULL);
switch (neon_protocol_version)
{
case 2:
query = psprintf("pagestream_v2 %s %s", neon_tenant, neon_timeline);
pagestream_query = psprintf("pagestream_v2 %s %s", neon_tenant, neon_timeline);
break;
case 1:
query = psprintf("pagestream %s %s", neon_tenant, neon_timeline);
pagestream_query = psprintf("pagestream %s %s", neon_tenant, neon_timeline);
break;
default:
elog(ERROR, "unexpected neon_protocol_version %d", neon_protocol_version);
}
ret = PQsendQuery(conn, query);
pfree(query);
if (ret != 1)
{
PQfinish(conn);
neon_shard_log(shard_no, elevel, "could not send pagestream command to pageserver");
return false;
}
}
wes = CreateWaitEventSet(TopMemoryContext, 3);
AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET,
MyLatch, NULL);
AddWaitEventToSet(wes, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
NULL, NULL);
AddWaitEventToSet(wes, WL_SOCKET_READABLE, PQsocket(conn), NULL, NULL);
if (PQstatus(shard->conn) == CONNECTION_BAD)
{
char *msg = pchomp(PQerrorMessage(shard->conn));
PG_TRY();
CLEANUP_AND_DISCONNECT(shard);
ereport(elevel,
(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
errmsg(NEON_TAG "[shard %d] could not establish connection to pageserver", shard_no),
errdetail_internal("%s", msg)));
pfree(msg);
return false;
}
ps_send_query_ret = PQsendQuery(shard->conn, pagestream_query);
pfree(pagestream_query);
if (ps_send_query_ret != 1)
{
CLEANUP_AND_DISCONNECT(shard);
neon_shard_log(shard_no, elevel, "could not send pagestream command to pageserver");
return false;
}
shard->state = PS_Connecting_PageStream;
/* fallthrough */
}
case PS_Connecting_PageStream:
{
while (PQisBusy(conn))
neon_shard_log(shard_no, DEBUG5, "Connection state: Connecting_PageStream");
if (PQstatus(shard->conn) == CONNECTION_BAD)
{
char *msg = pchomp(PQerrorMessage(shard->conn));
CLEANUP_AND_DISCONNECT(shard);
ereport(elevel,
(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
errmsg(NEON_TAG "[shard %d] could not establish connection to pageserver", shard_no),
errdetail_internal("%s", msg)));
pfree(msg);
return false;
}
while (PQisBusy(shard->conn))
{
WaitEvent event;
/* Sleep until there's something to do */
(void) WaitEventSetWait(wes, -1L, &event, 1, PG_WAIT_EXTENSION);
(void) WaitEventSetWait(shard->wes_read, -1L, &event, 1, PG_WAIT_EXTENSION);
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
@@ -423,40 +599,37 @@ pageserver_connect(shardno_t shard_no, int elevel)
/* Data available in socket? */
if (event.events & WL_SOCKET_READABLE)
{
if (!PQconsumeInput(conn))
if (!PQconsumeInput(shard->conn))
{
char *msg = pchomp(PQerrorMessage(conn));
PQfinish(conn);
FreeWaitEventSet(wes);
char *msg = pchomp(PQerrorMessage(shard->conn));
CLEANUP_AND_DISCONNECT(shard);
neon_shard_log(shard_no, elevel, "could not complete handshake with pageserver: %s",
msg);
/* Returning from inside PG_TRY is bad, so we break/return later */
broke_from_loop = true;
break;
pfree(msg);
return false;
}
}
}
}
PG_CATCH();
{
PQfinish(conn);
FreeWaitEventSet(wes);
PG_RE_THROW();
}
PG_END_TRY();
if (broke_from_loop)
{
return false;
shard->state = PS_Connected;
/* fallthrough */
}
case PS_Connected:
/*
* We successfully connected. Future connections to this PageServer
* will do fast retries again, with exponential backoff.
*/
shard->delay_us = MIN_RECONNECT_INTERVAL_USEC;
neon_shard_log(shard_no, LOG, "libpagestore: connected to '%s' with protocol version %d", connstr, neon_protocol_version);
page_servers[shard_no].conn = conn;
page_servers[shard_no].wes = wes;
return true;
neon_shard_log(shard_no, DEBUG5, "Connection state: Connected");
neon_shard_log(shard_no, LOG, "libpagestore: connected to '%s' with protocol version %d", connstr, neon_protocol_version);
return true;
default:
neon_shard_log(shard_no, ERROR, "libpagestore: invalid connection state %d", shard->state);
}
/* This shouldn't be hit */
Assert(false);
}
/*
@@ -476,7 +649,7 @@ retry:
WaitEvent event;
/* Sleep until there's something to do */
(void) WaitEventSetWait(page_servers[shard_no].wes, -1L, &event, 1, PG_WAIT_EXTENSION);
(void) WaitEventSetWait(page_servers[shard_no].wes_read, -1L, &event, 1, PG_WAIT_EXTENSION);
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
@@ -502,7 +675,8 @@ retry:
/*
* Reset prefetch and drop connection to the shard.
* It also drops connection to all other shards involved in prefetch.
* It also drops connection to all other shards involved in prefetch, through
* prefetch_on_ps_disconnect().
*/
static void
pageserver_disconnect(shardno_t shard_no)
@@ -512,9 +686,6 @@ pageserver_disconnect(shardno_t shard_no)
* whole prefetch queue, even for other pageservers. It should not
* cause big problems, because connection loss is supposed to be a
* rare event.
*
* Prefetch state should be reset even if page_servers[shard_no].conn == NULL,
* because prefetch request may be registered before connection is established.
*/
prefetch_on_ps_disconnect();
@@ -527,37 +698,36 @@ pageserver_disconnect(shardno_t shard_no)
static void
pageserver_disconnect_shard(shardno_t shard_no)
{
PageServer *shard = &page_servers[shard_no];
/*
* If anything goes wrong while we were sending a request, it's not clear
* what state the connection is in. For example, if we sent the request
* but didn't receive a response yet, we might receive the response some
* time later after we have already sent a new unrelated request. Close
* the connection to avoid getting confused.
* Similarly, even when we're in PS_DISCONNECTED, we may have junk to
* clean up: It is possible that we encountered an error allocating any
* of the wait event sets or the psql connection, or failed when we tried
* to attach wait events to the WaitEventSets.
*/
if (page_servers[shard_no].conn)
{
neon_shard_log(shard_no, LOG, "dropping connection to page server due to error");
PQfinish(page_servers[shard_no].conn);
page_servers[shard_no].conn = NULL;
}
if (page_servers[shard_no].wes != NULL)
{
FreeWaitEventSet(page_servers[shard_no].wes);
page_servers[shard_no].wes = NULL;
}
CLEANUP_AND_DISCONNECT(shard);
shard->state = PS_Disconnected;
}
static bool
pageserver_send(shardno_t shard_no, NeonRequest *request)
{
StringInfoData req_buff;
PGconn *pageserver_conn = page_servers[shard_no].conn;
PageServer *shard = &page_servers[shard_no];
PGconn *pageserver_conn;
/* If the connection was lost for some reason, reconnect */
if (pageserver_conn && PQstatus(pageserver_conn) == CONNECTION_BAD)
if (shard->state == PS_Connected && PQstatus(shard->conn) == CONNECTION_BAD)
{
neon_shard_log(shard_no, LOG, "pageserver_send disconnect bad connection");
pageserver_disconnect(shard_no);
pageserver_conn = NULL;
}
req_buff = nm_pack_request(request);
@@ -571,17 +741,19 @@ pageserver_send(shardno_t shard_no, NeonRequest *request)
* https://github.com/neondatabase/neon/issues/1138 So try to reestablish
* connection in case of failure.
*/
if (!page_servers[shard_no].conn)
if (shard->state != PS_Connected)
{
while (!pageserver_connect(shard_no, n_reconnect_attempts < max_reconnect_attempts ? LOG : ERROR))
while (!pageserver_connect(shard_no, shard->n_reconnect_attempts < max_reconnect_attempts ? LOG : ERROR))
{
HandleMainLoopInterrupts();
n_reconnect_attempts += 1;
shard->n_reconnect_attempts += 1;
}
n_reconnect_attempts = 0;
shard->n_reconnect_attempts = 0;
} else {
Assert(shard->conn != NULL);
}
pageserver_conn = page_servers[shard_no].conn;
pageserver_conn = shard->conn;
/*
* Send request.
@@ -590,13 +762,17 @@ pageserver_send(shardno_t shard_no, NeonRequest *request)
* should use async mode and check for interrupts while waiting. In
* practice, our requests are small enough to always fit in the output and
* TCP buffer.
*
* Note that this also will fail when the connection is in the
* PGRES_POLLING_WRITING state. It's kinda dirty to disconnect at this
* point, but on the grand scheme of things it's only a small issue.
*/
if (PQputCopyData(pageserver_conn, req_buff.data, req_buff.len) <= 0)
{
char *msg = pchomp(PQerrorMessage(pageserver_conn));
pageserver_disconnect(shard_no);
neon_shard_log(shard_no, LOG, "pageserver_send disconnect because failed to send page request (try to reconnect): %s", msg);
neon_shard_log(shard_no, LOG, "pageserver_send disconnected: failed to send page request (try to reconnect): %s", msg);
pfree(msg);
pfree(req_buff.data);
return false;
@@ -611,6 +787,7 @@ pageserver_send(shardno_t shard_no, NeonRequest *request)
neon_shard_log(shard_no, PageStoreTrace, "sent request: %s", msg);
pfree(msg);
}
return true;
}
@@ -619,58 +796,68 @@ pageserver_receive(shardno_t shard_no)
{
StringInfoData resp_buff;
NeonResponse *resp;
PGconn *pageserver_conn = page_servers[shard_no].conn;
PageServer *shard = &page_servers[shard_no];
PGconn *pageserver_conn = shard->conn;
/* read response */
int rc;
if (!pageserver_conn)
return NULL;
PG_TRY();
if (shard->state != PS_Connected)
{
/* read response */
int rc;
neon_shard_log(shard_no, LOG,
"pageserver_receive: returning NULL for non-connected pageserver connection: 0x%02x",
shard->state);
return NULL;
}
rc = call_PQgetCopyData(shard_no, &resp_buff.data);
if (rc >= 0)
Assert(pageserver_conn);
rc = call_PQgetCopyData(shard_no, &resp_buff.data);
if (rc >= 0)
{
/* call_PQgetCopyData handles rc == 0 */
Assert(rc > 0);
PG_TRY();
{
resp_buff.len = rc;
resp_buff.cursor = 0;
resp = nm_unpack_response(&resp_buff);
PQfreemem(resp_buff.data);
if (message_level_is_interesting(PageStoreTrace))
{
char *msg = nm_to_string((NeonMessage *) resp);
neon_shard_log(shard_no, PageStoreTrace, "got response: %s", msg);
pfree(msg);
}
}
else if (rc == -1)
PG_CATCH();
{
neon_shard_log(shard_no, LOG, "pageserver_receive disconnect because call_PQgetCopyData returns -1: %s", pchomp(PQerrorMessage(pageserver_conn)));
neon_shard_log(shard_no, LOG, "pageserver_receive: disconnect due malformatted response");
pageserver_disconnect(shard_no);
resp = NULL;
PG_RE_THROW();
}
else if (rc == -2)
{
char *msg = pchomp(PQerrorMessage(pageserver_conn));
PG_END_TRY();
pageserver_disconnect(shard_no);
neon_shard_log(shard_no, ERROR, "pageserver_receive disconnect because could not read COPY data: %s", msg);
}
else
if (message_level_is_interesting(PageStoreTrace))
{
pageserver_disconnect(shard_no);
neon_shard_log(shard_no, ERROR, "pageserver_receive disconnect because unexpected PQgetCopyData return value: %d", rc);
char *msg = nm_to_string((NeonMessage *) resp);
neon_shard_log(shard_no, PageStoreTrace, "got response: %s", msg);
pfree(msg);
}
}
PG_CATCH();
else if (rc == -1)
{
neon_shard_log(shard_no, LOG, "pageserver_receive disconnect due to caught exception");
neon_shard_log(shard_no, LOG, "pageserver_receive disconnect: psql end of copy data: %s", pchomp(PQerrorMessage(pageserver_conn)));
pageserver_disconnect(shard_no);
PG_RE_THROW();
resp = NULL;
}
else if (rc == -2)
{
char *msg = pchomp(PQerrorMessage(pageserver_conn));
pageserver_disconnect(shard_no);
neon_shard_log(shard_no, ERROR, "pageserver_receive disconnect: could not read COPY data: %s", msg);
}
else
{
pageserver_disconnect(shard_no);
neon_shard_log(shard_no, ERROR, "pageserver_receive disconnect: unexpected PQgetCopyData return value: %d", rc);
}
PG_END_TRY();
return (NeonResponse *) resp;
}
@@ -681,7 +868,7 @@ pageserver_flush(shardno_t shard_no)
{
PGconn *pageserver_conn = page_servers[shard_no].conn;
if (!pageserver_conn)
if (page_servers[shard_no].state != PS_Connected)
{
neon_shard_log(shard_no, WARNING, "Tried to flush while disconnected");
}
@@ -697,6 +884,7 @@ pageserver_flush(shardno_t shard_no)
return false;
}
}
return true;
}
@@ -891,5 +1079,7 @@ pg_init_libpagestore(void)
dbsize_hook = neon_dbsize;
}
memset(page_servers, 0, sizeof(page_servers));
lfc_init();
}

View File

@@ -184,8 +184,8 @@ NeonWALRead(NeonWALReader *state, char *buf, XLogRecPtr startptr, Size count, Ti
}
else if (state->wre_errno == ENOENT)
{
nwr_log(LOG, "local read failed as segment at %X/%X doesn't exist, attempting remote",
LSN_FORMAT_ARGS(startptr));
nwr_log(LOG, "local read at %X/%X len %zu failed as segment file doesn't exist, attempting remote",
LSN_FORMAT_ARGS(startptr), count);
return NeonWALReadRemote(state, buf, startptr, count, tli);
}
else
@@ -614,6 +614,7 @@ NeonWALReadLocal(NeonWALReader *state, char *buf, XLogRecPtr startptr, Size coun
uint32 startoff;
int segbytes;
int readbytes;
XLogSegNo lastRemovedSegNo;
startoff = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
@@ -689,6 +690,23 @@ NeonWALReadLocal(NeonWALReader *state, char *buf, XLogRecPtr startptr, Size coun
return false;
}
/*
* Recheck that the segment hasn't been removed while we were reading
* it.
*/
lastRemovedSegNo = XLogGetLastRemovedSegno();
if (state->seg.ws_segno <= lastRemovedSegNo)
{
char fname[MAXFNAMELEN];
state->wre_errno = ENOENT;
XLogFileName(fname, tli, state->seg.ws_segno, state->segcxt.ws_segsize);
snprintf(state->err_msg, sizeof(state->err_msg), "WAL segment %s has been removed during the read, lastRemovedSegNo " UINT64_FORMAT,
fname, lastRemovedSegNo);
return false;
}
/* Update state for read */
recptr += readbytes;
nbytes -= readbytes;

View File

@@ -295,10 +295,16 @@ extern void neon_immedsync(SMgrRelation reln, ForkNumber forknum);
/* utils for neon relsize cache */
extern void relsize_hash_init(void);
extern bool get_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber *size);
extern void set_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber size);
extern void set_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber new_size);
extern void update_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber size);
extern void forget_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum);
extern void start_unlogged_build(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blocknum);
extern bool is_unlogged_build(NRelFileInfo rinfo, ForkNumber forknum);
extern void stop_unlogged_build(NRelFileInfo rinfo, ForkNumber forknum);
/* functions for local file cache */
#if PG_MAJORVERSION_NUM < 16
extern void lfc_write(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,

View File

@@ -10,10 +10,6 @@
* Temporary and unlogged tables are stored locally, by md.c. The functions
* here just pass the calls through to corresponding md.c functions.
*
* Index build operations that use the buffer cache are also handled locally,
* just like unlogged tables. Such operations must be marked by calling
* smgr_start_unlogged_build() and friends.
*
* In order to know what relations are permanent and which ones are not, we
* have added a 'smgr_relpersistence' field to SmgrRelationData, and it is set
* by smgropen() callers, when they have the relcache entry at hand. However,
@@ -64,6 +60,7 @@
#include "storage/fsm_internals.h"
#include "storage/md.h"
#include "storage/smgr.h"
#include "utils/rel.h"
#include "pagestore_client.h"
@@ -94,19 +91,13 @@ static char *hexdump_page(char *page);
const int SmgrTrace = DEBUG5;
#define NEON_PANIC_CONNECTION_STATE(shard_no, elvl, message, ...) \
neon_shard_log(shard_no, elvl, "Broken connection state: " message, \
##__VA_ARGS__)
page_server_api *page_server;
/* unlogged relation build states */
typedef enum
{
UNLOGGED_BUILD_NOT_IN_PROGRESS = 0,
UNLOGGED_BUILD_PHASE_1,
UNLOGGED_BUILD_PHASE_2,
UNLOGGED_BUILD_NOT_PERMANENT
} UnloggedBuildPhase;
static SMgrRelation unlogged_build_rel = NULL;
static UnloggedBuildPhase unlogged_build_phase = UNLOGGED_BUILD_NOT_IN_PROGRESS;
const PGAlignedBlock zero_buffer;
static bool neon_redo_read_buffer_filter(XLogReaderState *record, uint8 block_id);
static bool (*old_redo_read_buffer_filter) (XLogReaderState *record, uint8 block_id) = NULL;
@@ -526,6 +517,8 @@ prefetch_flush_requests(void)
*
* NOTE: this function may indirectly update MyPState->pfs_hash; which
* invalidates any active pointers into the hash table.
* NOTE: callers should make sure they can handle query cancellations in this
* function's call path.
*/
static bool
prefetch_wait_for(uint64 ring_index)
@@ -561,6 +554,8 @@ prefetch_wait_for(uint64 ring_index)
*
* NOTE: this function may indirectly update MyPState->pfs_hash; which
* invalidates any active pointers into the hash table.
*
* NOTE: this does IO, and can get canceled out-of-line.
*/
static bool
prefetch_read(PrefetchRequest *slot)
@@ -572,6 +567,14 @@ prefetch_read(PrefetchRequest *slot)
Assert(slot->response == NULL);
Assert(slot->my_ring_index == MyPState->ring_receive);
if (slot->status != PRFS_REQUESTED ||
slot->response != NULL ||
slot->my_ring_index != MyPState->ring_receive)
neon_shard_log(slot->shard_no, ERROR,
"Incorrect prefetch read: status=%d response=%p my=%lu receive=%lu",
slot->status, slot->response,
(long)slot->my_ring_index, (long)MyPState->ring_receive);
old = MemoryContextSwitchTo(MyPState->errctx);
response = (NeonResponse *) page_server->receive(slot->shard_no);
MemoryContextSwitchTo(old);
@@ -589,6 +592,11 @@ prefetch_read(PrefetchRequest *slot)
}
else
{
neon_shard_log(slot->shard_no, WARNING,
"No response from reading prefetch entry %lu: %u/%u/%u.%u block %u. This can be caused by a concurrent disconnect",
(long)slot->my_ring_index,
RelFileInfoFmt(BufTagGetNRelFileInfo(slot->buftag)),
slot->buftag.forkNum, slot->buftag.blockNum);
return false;
}
}
@@ -603,6 +611,7 @@ void
prefetch_on_ps_disconnect(void)
{
MyPState->ring_flush = MyPState->ring_unused;
while (MyPState->ring_receive < MyPState->ring_unused)
{
PrefetchRequest *slot;
@@ -625,6 +634,7 @@ prefetch_on_ps_disconnect(void)
slot->status = PRFS_TAG_REMAINS;
MyPState->n_requests_inflight -= 1;
MyPState->ring_receive += 1;
prefetch_set_unused(ring_index);
}
}
@@ -691,6 +701,8 @@ static void
prefetch_do_request(PrefetchRequest *slot, neon_request_lsns *force_request_lsns)
{
bool found;
uint64 mySlotNo = slot->my_ring_index;
NeonGetPageRequest request = {
.req.tag = T_NeonGetPageRequest,
/* lsn and not_modified_since are filled in below */
@@ -699,6 +711,8 @@ prefetch_do_request(PrefetchRequest *slot, neon_request_lsns *force_request_lsns
.blkno = slot->buftag.blockNum,
};
Assert(mySlotNo == MyPState->ring_unused);
if (force_request_lsns)
slot->request_lsns = *force_request_lsns;
else
@@ -711,7 +725,11 @@ prefetch_do_request(PrefetchRequest *slot, neon_request_lsns *force_request_lsns
Assert(slot->response == NULL);
Assert(slot->my_ring_index == MyPState->ring_unused);
while (!page_server->send(slot->shard_no, (NeonRequest *) &request));
while (!page_server->send(slot->shard_no, (NeonRequest *) &request))
{
Assert(mySlotNo == MyPState->ring_unused);
/* loop */
}
/* update prefetch state */
MyPState->n_requests_inflight += 1;
@@ -722,7 +740,6 @@ prefetch_do_request(PrefetchRequest *slot, neon_request_lsns *force_request_lsns
/* update slot state */
slot->status = PRFS_REQUESTED;
prfh_insert(MyPState->prf_hash, slot, &found);
Assert(!found);
}
@@ -894,6 +911,10 @@ Retry:
return ring_index;
}
/*
* Note: this function can get canceled and use a long jump to the next catch
* context. Take care.
*/
static NeonResponse *
page_server_request(void const *req)
{
@@ -925,19 +946,38 @@ page_server_request(void const *req)
* Current sharding model assumes that all metadata is present only at shard 0.
* We still need to call get_shard_no() to check if shard map is up-to-date.
*/
if (((NeonRequest *) req)->tag != T_NeonGetPageRequest || ((NeonGetPageRequest *) req)->forknum != MAIN_FORKNUM)
if (((NeonRequest *) req)->tag != T_NeonGetPageRequest ||
((NeonGetPageRequest *) req)->forknum != MAIN_FORKNUM)
{
shard_no = 0;
}
do
{
while (!page_server->send(shard_no, (NeonRequest *) req) || !page_server->flush(shard_no));
consume_prefetch_responses();
resp = page_server->receive(shard_no);
} while (resp == NULL);
return resp;
PG_TRY();
{
while (!page_server->send(shard_no, (NeonRequest *) req)
|| !page_server->flush(shard_no))
{
/* do nothing */
}
consume_prefetch_responses();
resp = page_server->receive(shard_no);
}
PG_CATCH();
{
/*
* Cancellation in this code needs to be handled better at some
* point, but this currently seems fine for now.
*/
page_server->disconnect(shard_no);
PG_RE_THROW();
}
PG_END_TRY();
} while (resp == NULL);
return resp;
}
@@ -1349,10 +1389,6 @@ PageIsEmptyHeapPage(char *buffer)
return memcmp(buffer, empty_page.data, BLCKSZ) == 0;
}
/*
* A page is being evicted from the shared buffer cache. Update the
* last-written LSN of the page, and WAL-log it if needed.
*/
static void
#if PG_MAJORVERSION_NUM < 16
neon_wallog_page(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer, bool force)
@@ -1360,6 +1396,7 @@ neon_wallog_page(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, ch
neon_wallog_page(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, const char *buffer, bool force)
#endif
{
BlockNumber relsize;
XLogRecPtr lsn = PageGetLSN((Page) buffer);
bool log_page;
@@ -1376,13 +1413,28 @@ neon_wallog_page(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, co
Assert(XLogInsertAllowed());
log_page = true;
}
else if (XLogInsertAllowed() &&
!ShutdownRequestPending &&
(forknum == FSM_FORKNUM || forknum == VISIBILITYMAP_FORKNUM))
else if (XLogInsertAllowed() && !ShutdownRequestPending)
{
log_page = true;
if (forknum == MAIN_FORKNUM)
{
if (!PageIsNew((Page) buffer))
{
if (lsn < FirstNormalUnloggedLSN)
{
start_unlogged_build(InfoFromSMgrRel(reln), forknum, blocknum);
log_page = true;
}
else if (is_unlogged_build(InfoFromSMgrRel(reln), forknum))
{
log_page = true;
}
}
}
else
{
log_page = true;
}
}
if (log_page)
{
XLogRecPtr recptr;
@@ -1455,14 +1507,6 @@ neon_wallog_page(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, co
lsn = GetXLogReplayRecPtr(NULL); /* in standby mode, soldier on */
}
}
else
{
ereport(SmgrTrace,
(errmsg(NEON_TAG "Evicting page %u of relation %u/%u/%u.%u with lsn=%X/%X",
blocknum,
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum, LSN_FORMAT_ARGS(lsn))));
}
/*
* Remember the LSN on this page. When we read the page again, we must
@@ -1471,6 +1515,19 @@ neon_wallog_page(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, co
SetLastWrittenLSNForBlock(lsn, InfoFromSMgrRel(reln), forknum, blocknum);
}
/*
* Check if unlogged build is in progress for specified relation
* and stop it if so. It is used as callback for log_newpage_range( function
* which is called at the end of unlogged build.
*/
static void
neon_log_newpage_range_callback(Relation rel, ForkNumber forknum)
{
SMgrRelation smgr = RelationGetSmgr(rel);
stop_unlogged_build(InfoFromSMgrRel(smgr), forknum);
}
/*
* neon_init() -- Initialize private state
*/
@@ -1506,6 +1563,8 @@ neon_init(void)
old_redo_read_buffer_filter = redo_read_buffer_filter;
redo_read_buffer_filter = neon_redo_read_buffer_filter;
log_newpage_range_callback = neon_log_newpage_range_callback;
#ifdef DEBUG_COMPARE_LOCAL
mdinit();
#endif
@@ -1905,7 +1964,9 @@ neon_exists(SMgrRelation reln, ForkNumber forkNum)
break;
default:
neon_log(ERROR, "unexpected response from page server with tag 0x%02x in neon_exists", resp->tag);
NEON_PANIC_CONNECTION_STATE(-1, PANIC,
"Expected Exists (0x%02x) or Error (0x%02x) response to ExistsRequest, but got 0x%02x",
T_NeonExistsResponse, T_NeonErrorResponse, resp->tag);
}
pfree(resp);
return exists;
@@ -2077,6 +2138,7 @@ neon_extend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno,
neon_wallog_page(reln, forkNum, n_blocks++, buffer, true);
neon_wallog_page(reln, forkNum, blkno, buffer, false);
set_cached_relsize(InfoFromSMgrRel(reln), forkNum, blkno + 1);
lsn = PageGetLSN((Page) buffer);
@@ -2112,8 +2174,7 @@ void
neon_zeroextend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blocknum,
int nblocks, bool skipFsync)
{
const PGAlignedBlock buffer = {0};
int remblocks = nblocks;
BlockNumber remblocks = nblocks;
XLogRecPtr lsn = 0;
switch (reln->smgr_relpersistence)
@@ -2163,8 +2224,24 @@ neon_zeroextend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blocknum,
if (!XLogInsertAllowed())
return;
/* ensure we have enough xlog buffers to log max-sized records */
XLogEnsureRecordSpace(Min(remblocks, (XLR_MAX_BLOCK_ID - 1)), 0);
set_cached_relsize(InfoFromSMgrRel(reln), forkNum, blocknum + nblocks);
if (forkNum != MAIN_FORKNUM) /* no need to wal-log zero pages except VM/FSM forks */
{
/* ensure we have enough xlog buffers to log max-sized records */
XLogEnsureRecordSpace(Min(remblocks, (XLR_MAX_BLOCK_ID - 1)), 0);
}
else
{
/*
* smgr_extend is often called with an all-zeroes page, so
* lsn==InvalidXLogRecPtr. An smgr_write() call will come for the buffer
* later, after it has been initialized with the real page contents, and
* it is eventually evicted from the buffer cache. But we need a valid LSN
* to the relation metadata update now.
*/
lsn = GetXLogInsertRecPtr();
}
/*
* Iterate over all the pages. They are collected into batches of
@@ -2175,17 +2252,19 @@ neon_zeroextend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blocknum,
{
int count = Min(remblocks, XLR_MAX_BLOCK_ID);
XLogBeginInsert();
if (forkNum != MAIN_FORKNUM) /* no need to wal-log zero pages except VM/FSM forks */
{
XLogBeginInsert();
for (int i = 0; i < count; i++)
XLogRegisterBlock(i, &InfoFromSMgrRel(reln), forkNum, blocknum + i,
(char *) buffer.data, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
lsn = XLogInsert(RM_XLOG_ID, XLOG_FPI);
for (int i = 0; i < count; i++)
XLogRegisterBlock(i, &InfoFromSMgrRel(reln), forkNum, blocknum + i,
(char *) zero_buffer.data, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
lsn = XLogInsert(RM_XLOG_ID, XLOG_FPI);
}
for (int i = 0; i < count; i++)
{
lfc_write(InfoFromSMgrRel(reln), forkNum, blocknum + i, buffer.data);
lfc_write(InfoFromSMgrRel(reln), forkNum, blocknum + i, zero_buffer.data);
SetLastWrittenLSNForBlock(lsn, InfoFromSMgrRel(reln), forkNum,
blocknum + i);
}
@@ -2197,7 +2276,6 @@ neon_zeroextend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blocknum,
Assert(lsn != 0);
SetLastWrittenLSNForRelation(lsn, InfoFromSMgrRel(reln), forkNum);
set_cached_relsize(InfoFromSMgrRel(reln), forkNum, blocknum);
}
#endif
@@ -2357,7 +2435,7 @@ neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
/*
* Try to find prefetched page in the list of received pages.
*/
Retry:
Retry:
entry = prfh_lookup(MyPState->prf_hash, (PrefetchRequest *) &buftag);
if (entry != NULL)
@@ -2443,7 +2521,9 @@ neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
((NeonErrorResponse *) resp)->message)));
break;
default:
neon_log(ERROR, "unexpected response from page server with tag 0x%02x in neon_read_at_lsn", resp->tag);
NEON_PANIC_CONNECTION_STATE(slot->shard_no, PANIC,
"Expected GetPage (0x%02x) or Error (0x%02x) response to GetPageRequest, but got 0x%02x",
T_NeonGetPageResponse, T_NeonErrorResponse, resp->tag);
}
/* buffer was used, clean up for later reuse */
@@ -2462,6 +2542,7 @@ neon_read(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno, void *buffer
#endif
{
neon_request_lsns request_lsns;
BlockNumber relsize;
switch (reln->smgr_relpersistence)
{
@@ -2714,7 +2795,9 @@ neon_nblocks(SMgrRelation reln, ForkNumber forknum)
break;
default:
neon_log(ERROR, "unexpected response from page server with tag 0x%02x in neon_nblocks", resp->tag);
NEON_PANIC_CONNECTION_STATE(-1, PANIC,
"Expected Nblocks (0x%02x) or Error (0x%02x) response to NblocksRequest, but got 0x%02x",
T_NeonNblocksResponse, T_NeonErrorResponse, resp->tag);
}
update_cached_relsize(InfoFromSMgrRel(reln), forknum, n_blocks);
@@ -2767,7 +2850,9 @@ neon_dbsize(Oid dbNode)
break;
default:
neon_log(ERROR, "unexpected response from page server with tag 0x%02x in neon_dbsize", resp->tag);
NEON_PANIC_CONNECTION_STATE(-1, PANIC,
"Expected DbSize (0x%02x) or Error (0x%02x) response to DbSizeRequest, but got 0x%02x",
T_NeonDbSizeResponse, T_NeonErrorResponse, resp->tag);
}
neon_log(SmgrTrace, "neon_dbsize: db %u (request LSN %X/%08X): %ld bytes",
@@ -2878,150 +2963,6 @@ neon_immedsync(SMgrRelation reln, ForkNumber forknum)
#endif
}
/*
* neon_start_unlogged_build() -- Starting build operation on a rel.
*
* Some indexes are built in two phases, by first populating the table with
* regular inserts, using the shared buffer cache but skipping WAL-logging,
* and WAL-logging the whole relation after it's done. Neon relies on the
* WAL to reconstruct pages, so we cannot use the page server in the
* first phase when the changes are not logged.
*/
static void
neon_start_unlogged_build(SMgrRelation reln)
{
/*
* Currently, there can be only one unlogged relation build operation in
* progress at a time. That's enough for the current usage.
*/
if (unlogged_build_phase != UNLOGGED_BUILD_NOT_IN_PROGRESS)
neon_log(ERROR, "unlogged relation build is already in progress");
Assert(unlogged_build_rel == NULL);
ereport(SmgrTrace,
(errmsg(NEON_TAG "starting unlogged build of relation %u/%u/%u",
RelFileInfoFmt(InfoFromSMgrRel(reln)))));
switch (reln->smgr_relpersistence)
{
case 0:
neon_log(ERROR, "cannot call smgr_start_unlogged_build() on rel with unknown persistence");
break;
case RELPERSISTENCE_PERMANENT:
break;
case RELPERSISTENCE_TEMP:
case RELPERSISTENCE_UNLOGGED:
unlogged_build_rel = reln;
unlogged_build_phase = UNLOGGED_BUILD_NOT_PERMANENT;
return;
default:
neon_log(ERROR, "unknown relpersistence '%c'", reln->smgr_relpersistence);
}
if (smgrnblocks(reln, MAIN_FORKNUM) != 0)
neon_log(ERROR, "cannot perform unlogged index build, index is not empty ");
unlogged_build_rel = reln;
unlogged_build_phase = UNLOGGED_BUILD_PHASE_1;
/* Make the relation look like it's unlogged */
reln->smgr_relpersistence = RELPERSISTENCE_UNLOGGED;
/*
* Create the local file. In a parallel build, the leader is expected to
* call this first and do it.
*
* FIXME: should we pass isRedo true to create the tablespace dir if it
* doesn't exist? Is it needed?
*/
if (!IsParallelWorker())
mdcreate(reln, MAIN_FORKNUM, false);
}
/*
* neon_finish_unlogged_build_phase_1()
*
* Call this after you have finished populating a relation in unlogged mode,
* before you start WAL-logging it.
*/
static void
neon_finish_unlogged_build_phase_1(SMgrRelation reln)
{
Assert(unlogged_build_rel == reln);
ereport(SmgrTrace,
(errmsg(NEON_TAG "finishing phase 1 of unlogged build of relation %u/%u/%u",
RelFileInfoFmt(InfoFromSMgrRel(reln)))));
if (unlogged_build_phase == UNLOGGED_BUILD_NOT_PERMANENT)
return;
Assert(unlogged_build_phase == UNLOGGED_BUILD_PHASE_1);
Assert(reln->smgr_relpersistence == RELPERSISTENCE_UNLOGGED);
/*
* In a parallel build, (only) the leader process performs the 2nd
* phase.
*/
if (IsParallelWorker())
{
unlogged_build_rel = NULL;
unlogged_build_phase = UNLOGGED_BUILD_NOT_IN_PROGRESS;
}
else
unlogged_build_phase = UNLOGGED_BUILD_PHASE_2;
}
/*
* neon_end_unlogged_build() -- Finish an unlogged rel build.
*
* Call this after you have finished WAL-logging an relation that was
* first populated without WAL-logging.
*
* This removes the local copy of the rel, since it's now been fully
* WAL-logged and is present in the page server.
*/
static void
neon_end_unlogged_build(SMgrRelation reln)
{
NRelFileInfoBackend rinfob = InfoBFromSMgrRel(reln);
Assert(unlogged_build_rel == reln);
ereport(SmgrTrace,
(errmsg(NEON_TAG "ending unlogged build of relation %u/%u/%u",
RelFileInfoFmt(InfoFromNInfoB(rinfob)))));
if (unlogged_build_phase != UNLOGGED_BUILD_NOT_PERMANENT)
{
Assert(unlogged_build_phase == UNLOGGED_BUILD_PHASE_2);
Assert(reln->smgr_relpersistence == RELPERSISTENCE_UNLOGGED);
/* Make the relation look permanent again */
reln->smgr_relpersistence = RELPERSISTENCE_PERMANENT;
/* Remove local copy */
rinfob = InfoBFromSMgrRel(reln);
for (int forknum = 0; forknum <= MAX_FORKNUM; forknum++)
{
neon_log(SmgrTrace, "forgetting cached relsize for %u/%u/%u.%u",
RelFileInfoFmt(InfoFromNInfoB(rinfob)),
forknum);
forget_cached_relsize(InfoFromNInfoB(rinfob), forknum);
mdclose(reln, forknum);
/* use isRedo == true, so that we drop it immediately */
mdunlink(rinfob, forknum, true);
}
}
unlogged_build_rel = NULL;
unlogged_build_phase = UNLOGGED_BUILD_NOT_IN_PROGRESS;
}
#define STRPREFIX(str, prefix) (strncmp(str, prefix, strlen(prefix)) == 0)
static int
@@ -3106,47 +3047,15 @@ neon_read_slru_segment(SMgrRelation reln, const char* path, int segno, void* buf
break;
default:
neon_log(ERROR, "unexpected response from page server with tag 0x%02x in neon_read_slru_segment", resp->tag);
NEON_PANIC_CONNECTION_STATE(-1, PANIC,
"Expected GetSlruSegment (0x%02x) or Error (0x%02x) response to GetSlruSegmentRequest, but got 0x%02x",
T_NeonGetSlruSegmentResponse, T_NeonErrorResponse, resp->tag);
}
pfree(resp);
return n_blocks;
}
static void
AtEOXact_neon(XactEvent event, void *arg)
{
switch (event)
{
case XACT_EVENT_ABORT:
case XACT_EVENT_PARALLEL_ABORT:
/*
* Forget about any build we might have had in progress. The local
* file will be unlinked by smgrDoPendingDeletes()
*/
unlogged_build_rel = NULL;
unlogged_build_phase = UNLOGGED_BUILD_NOT_IN_PROGRESS;
break;
case XACT_EVENT_COMMIT:
case XACT_EVENT_PARALLEL_COMMIT:
case XACT_EVENT_PREPARE:
case XACT_EVENT_PRE_COMMIT:
case XACT_EVENT_PARALLEL_PRE_COMMIT:
case XACT_EVENT_PRE_PREPARE:
if (unlogged_build_phase != UNLOGGED_BUILD_NOT_IN_PROGRESS)
{
unlogged_build_rel = NULL;
unlogged_build_phase = UNLOGGED_BUILD_NOT_IN_PROGRESS;
ereport(ERROR,
(errcode(ERRCODE_INTERNAL_ERROR),
(errmsg(NEON_TAG "unlogged index build was not properly finished"))));
}
break;
}
}
static const struct f_smgr neon_smgr =
{
.smgr_init = neon_init,
@@ -3168,10 +3077,6 @@ static const struct f_smgr neon_smgr =
.smgr_truncate = neon_truncate,
.smgr_immedsync = neon_immedsync,
.smgr_start_unlogged_build = neon_start_unlogged_build,
.smgr_finish_unlogged_build_phase_1 = neon_finish_unlogged_build_phase_1,
.smgr_end_unlogged_build = neon_end_unlogged_build,
.smgr_read_slru_segment = neon_read_slru_segment,
};
@@ -3189,8 +3094,6 @@ smgr_neon(BackendId backend, NRelFileInfo rinfo)
void
smgr_init_neon(void)
{
RegisterXactCallback(AtEOXact_neon, NULL);
smgr_init_standard();
neon_init();
}

View File

@@ -39,7 +39,8 @@ typedef struct
typedef struct
{
RelTag tag;
BlockNumber size;
BlockNumber size : 31;
BlockNumber unlogged : 1;
dlist_node lru_node; /* LRU list node */
} RelSizeEntry;
@@ -117,9 +118,12 @@ get_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber *size)
*size = entry->size;
relsize_ctl->hits += 1;
found = true;
/* Move entry to the LRU list tail */
dlist_delete(&entry->lru_node);
dlist_push_tail(&relsize_ctl->lru, &entry->lru_node);
if (!entry->unlogged) /* entries of relation involved in unlogged build are pinned */
{
/* Move entry to the LRU list tail */
dlist_delete(&entry->lru_node);
dlist_push_tail(&relsize_ctl->lru, &entry->lru_node);
}
}
else
{
@@ -130,6 +134,9 @@ get_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber *size)
return found;
}
/*
* Cache relation size.
*/
void
set_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber size)
{
@@ -148,31 +155,53 @@ set_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber size)
*/
while ((entry = hash_search(relsize_hash, &tag, HASH_ENTER_NULL, &found)) == NULL)
{
RelSizeEntry *victim = dlist_container(RelSizeEntry, lru_node, dlist_pop_head_node(&relsize_ctl->lru));
hash_search(relsize_hash, &victim->tag, HASH_REMOVE, NULL);
Assert(relsize_ctl->size > 0);
relsize_ctl->size -= 1;
if (dlist_is_empty(&relsize_ctl->lru))
{
elog(FATAL, "No more free relsize cache entries");
}
else
{
RelSizeEntry *victim = dlist_container(RelSizeEntry, lru_node, dlist_pop_head_node(&relsize_ctl->lru));
hash_search(relsize_hash, &victim->tag, HASH_REMOVE, NULL);
Assert(relsize_ctl->size > 0);
relsize_ctl->size -= 1;
}
}
entry->size = size;
if (!found)
{
if (++relsize_ctl->size == relsize_hash_size)
entry->unlogged = false;
if (relsize_ctl->size+1 == relsize_hash_size)
{
/*
* Remove least recently used elment from the hash.
* Hash size after is becomes `relsize_hash_size-1`.
* But it is not considered to be a problem, because size of this hash is expecrted large enough and +-1 doesn't matter.
*/
RelSizeEntry *victim = dlist_container(RelSizeEntry, lru_node, dlist_pop_head_node(&relsize_ctl->lru));
hash_search(relsize_hash, &victim->tag, HASH_REMOVE, NULL);
relsize_ctl->size -= 1;
if (dlist_is_empty(&relsize_ctl->lru))
{
elog(FATAL, "No more free relsize cache entries");
}
else
{
RelSizeEntry *victim = dlist_container(RelSizeEntry, lru_node, dlist_pop_head_node(&relsize_ctl->lru));
hash_search(relsize_hash, &victim->tag, HASH_REMOVE, NULL);
}
}
else
{
relsize_ctl->size += 1;
}
}
else
else if (entry->unlogged) /* entries of relation involved in unlogged build are pinned */
{
dlist_delete(&entry->lru_node);
}
dlist_push_tail(&relsize_ctl->lru, &entry->lru_node);
if (!entry->unlogged) /* entries of relation involved in unlogged build are pinned */
{
dlist_push_tail(&relsize_ctl->lru, &entry->lru_node);
}
relsize_ctl->writes += 1;
LWLockRelease(relsize_lock);
}
@@ -191,23 +220,42 @@ update_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber size)
tag.forknum = forknum;
LWLockAcquire(relsize_lock, LW_EXCLUSIVE);
entry = hash_search(relsize_hash, &tag, HASH_ENTER, &found);
if (!found || entry->size < size)
if (!found) {
entry->unlogged = false;
entry->size = size;
if (!found)
{
if (++relsize_ctl->size == relsize_hash_size)
if (relsize_ctl->size+1 == relsize_hash_size)
{
RelSizeEntry *victim = dlist_container(RelSizeEntry, lru_node, dlist_pop_head_node(&relsize_ctl->lru));
hash_search(relsize_hash, &victim->tag, HASH_REMOVE, NULL);
relsize_ctl->size -= 1;
if (dlist_is_empty(&relsize_ctl->lru))
{
elog(FATAL, "No more free relsize cache entries");
}
else
{
RelSizeEntry *victim = dlist_container(RelSizeEntry, lru_node, dlist_pop_head_node(&relsize_ctl->lru));
hash_search(relsize_hash, &victim->tag, HASH_REMOVE, NULL);
}
}
else
{
relsize_ctl->size += 1;
}
}
else
{
dlist_delete(&entry->lru_node);
if (entry->size < size)
entry->size = size;
if (!entry->unlogged) /* entries of relation involved in unlogged build are pinned */
{
dlist_delete(&entry->lru_node);
}
}
relsize_ctl->writes += 1;
dlist_push_tail(&relsize_ctl->lru, &entry->lru_node);
if (!entry->unlogged) /* entries of relation involved in unlogged build are pinned */
{
dlist_push_tail(&relsize_ctl->lru, &entry->lru_node);
}
LWLockRelease(relsize_lock);
}
}
@@ -225,13 +273,154 @@ forget_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum)
entry = hash_search(relsize_hash, &tag, HASH_REMOVE, NULL);
if (entry)
{
dlist_delete(&entry->lru_node);
if (!entry->unlogged)
{
/* Entried of relations involved in unlogged build are pinned */
dlist_delete(&entry->lru_node);
}
relsize_ctl->size -= 1;
}
LWLockRelease(relsize_lock);
}
}
/*
* This function starts unlogged build if it was not yet started.
* The criteria for starting iunlogged build is writing page without normal LSN.
* It can happen in any backend when page is evicted from shared buffers.
* Or can not happen at all if index fits in shared buffers.
*/
void
start_unlogged_build(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blocknum)
{
if (relsize_hash_size > 0)
{
RelTag tag;
RelSizeEntry *entry;
bool found;
bool start = false;
tag.rinfo = rinfo;
tag.forknum = forknum;
LWLockAcquire(relsize_lock, LW_EXCLUSIVE);
entry = hash_search(relsize_hash, &tag, HASH_ENTER, &found);
if (!found) {
entry->size = blocknum + 1;
start = true;
if (relsize_ctl->size+1 == relsize_hash_size)
{
if (dlist_is_empty(&relsize_ctl->lru))
{
elog(FATAL, "No more free relsize cache entries");
}
else
{
RelSizeEntry *victim = dlist_container(RelSizeEntry, lru_node, dlist_pop_head_node(&relsize_ctl->lru));
hash_search(relsize_hash, &victim->tag, HASH_REMOVE, NULL);
}
}
else
{
relsize_ctl->size += 1;
}
}
else
{
start = !entry->unlogged;
if (entry->size <= blocknum)
{
entry->size = blocknum + 1;
}
if (start)
{
/* relation involved in unlogged build are pinned until the end of the build */
dlist_delete(&entry->lru_node);
}
}
entry->unlogged = true;
relsize_ctl->writes += 1;
/*
* We are not putting entry in LRU least to prevent it fro eviction until the end of unlogged build
*/
if (start)
elog(LOG, "Start unlogged build for %u/%u/%u.%u",
RelFileInfoFmt(rinfo), forknum);
LWLockRelease(relsize_lock);
}
}
/*
* Check if unlogged build is in progress.
*/
bool
is_unlogged_build(NRelFileInfo rinfo, ForkNumber forknum)
{
bool unlogged = false;
if (relsize_hash_size > 0)
{
RelTag tag;
RelSizeEntry *entry;
tag.rinfo = rinfo;
tag.forknum = forknum;
LWLockAcquire(relsize_lock, LW_SHARED);
entry = hash_search(relsize_hash, &tag, HASH_FIND, NULL);
if (entry != NULL)
{
unlogged = entry->unlogged;
relsize_ctl->hits += 1;
}
else
{
relsize_ctl->misses += 1;
}
LWLockRelease(relsize_lock);
}
return unlogged;
}
/*
* Clear unlogged build if it was set.
*/
void
stop_unlogged_build(NRelFileInfo rinfo, ForkNumber forknum)
{
if (relsize_hash_size > 0)
{
RelTag tag;
RelSizeEntry *entry;
tag.rinfo = rinfo;
tag.forknum = forknum;
LWLockAcquire(relsize_lock, LW_EXCLUSIVE);
entry = hash_search(relsize_hash, &tag, HASH_FIND, NULL);
if (entry != NULL)
{
bool unlogged = entry->unlogged;
entry->unlogged = false;
relsize_ctl->hits += 1;
if (unlogged)
{
elog(LOG, "Stop unlogged build for %u/%u/%u.%u",
RelFileInfoFmt(rinfo), forknum);
/* Return entry to the LRU list */
dlist_push_tail(&relsize_ctl->lru, &entry->lru_node);
}
}
else
{
relsize_ctl->misses += 1;
}
LWLockRelease(relsize_lock);
}
}
void
relsize_hash_init(void)
{

View File

@@ -38,6 +38,7 @@ hmac.workspace = true
hostname.workspace = true
http.workspace = true
humantime.workspace = true
humantime-serde.workspace = true
hyper.workspace = true
hyper1 = { package = "hyper", version = "1.2", features = ["server"] }
hyper-util = { version = "0.1", features = ["server", "http1", "http2", "tokio"] }
@@ -82,6 +83,7 @@ thiserror.workspace = true
tikv-jemallocator.workspace = true
tikv-jemalloc-ctl = { workspace = true, features = ["use_std"] }
tokio-postgres.workspace = true
tokio-postgres-rustls.workspace = true
tokio-rustls.workspace = true
tokio-util.workspace = true
tokio = { workspace = true, features = ["signal"] }
@@ -94,10 +96,8 @@ url.workspace = true
urlencoding.workspace = true
utils.workspace = true
uuid.workspace = true
webpki-roots.workspace = true
rustls-native-certs.workspace = true
x509-parser.workspace = true
native-tls.workspace = true
postgres-native-tls.workspace = true
postgres-protocol.workspace = true
redis.workspace = true

View File

@@ -35,7 +35,7 @@ use crate::{
},
stream, url,
};
use crate::{scram, EndpointCacheKey, EndpointId, Normalize, RoleName};
use crate::{scram, EndpointCacheKey, EndpointId, RoleName};
/// Alternative to [`std::borrow::Cow`] but doesn't need `T: ToOwned` as we don't need that functionality
pub enum MaybeOwned<'a, T> {

View File

@@ -100,6 +100,7 @@ pub(super) async fn authenticate(
.dbname(&db_info.dbname)
.user(&db_info.user);
ctx.set_dbname(db_info.dbname.into());
ctx.set_user(db_info.user.into());
ctx.set_project(db_info.aux.clone());
info!("woken up a compute node");

View File

@@ -11,7 +11,6 @@ use crate::{
};
use itertools::Itertools;
use pq_proto::StartupMessageParams;
use smol_str::SmolStr;
use std::{collections::HashSet, net::IpAddr, str::FromStr};
use thiserror::Error;
use tracing::{info, warn};
@@ -96,13 +95,6 @@ impl ComputeUserInfoMaybeEndpoint {
let get_param = |key| params.get(key).ok_or(MissingKey(key));
let user: RoleName = get_param("user")?.into();
// record the values if we have them
ctx.set_application(params.get("application_name").map(SmolStr::from));
ctx.set_user(user.clone());
if let Some(dbname) = params.get("database") {
ctx.set_dbname(dbname.into());
}
// Project name might be passed via PG's command-line options.
let endpoint_option = params
.options_raw()

View File

@@ -9,6 +9,7 @@ use futures::future::Either;
use itertools::Itertools;
use proxy::config::TlsServerEndPoint;
use proxy::context::RequestMonitoring;
use proxy::metrics::{Metrics, ThreadPoolMetrics};
use proxy::proxy::{copy_bidirectional_client_compute, run_until_cancelled};
use rustls::pki_types::PrivateKeyDer;
use tokio::net::TcpListener;
@@ -65,6 +66,8 @@ async fn main() -> anyhow::Result<()> {
let _panic_hook_guard = utils::logging::replace_panic_hook_with_tracing_panic_hook();
let _sentry_guard = init_sentry(Some(GIT_VERSION.into()), &[]);
Metrics::install(Arc::new(ThreadPoolMetrics::new(0)));
let args = cli().get_matches();
let destination: String = args.get_one::<String>("dest").unwrap().parse()?;

View File

@@ -557,14 +557,14 @@ fn build_config(args: &ProxyCliArgs) -> anyhow::Result<&'static ProxyConfig> {
let config::ConcurrencyLockOptions {
shards,
permits,
limiter,
epoch,
timeout,
} = args.wake_compute_lock.parse()?;
info!(permits, shards, ?epoch, "Using NodeLocks (wake_compute)");
info!(?limiter, shards, ?epoch, "Using NodeLocks (wake_compute)");
let locks = Box::leak(Box::new(console::locks::ApiLocks::new(
"wake_compute_lock",
permits,
limiter,
shards,
timeout,
epoch,
@@ -603,14 +603,19 @@ fn build_config(args: &ProxyCliArgs) -> anyhow::Result<&'static ProxyConfig> {
let config::ConcurrencyLockOptions {
shards,
permits,
limiter,
epoch,
timeout,
} = args.connect_compute_lock.parse()?;
info!(permits, shards, ?epoch, "Using NodeLocks (connect_compute)");
info!(
?limiter,
shards,
?epoch,
"Using NodeLocks (connect_compute)"
);
let connect_compute_locks = console::locks::ApiLocks::new(
"connect_compute_lock",
permits,
limiter,
shards,
timeout,
epoch,

View File

@@ -10,11 +10,14 @@ use crate::{
};
use futures::{FutureExt, TryFutureExt};
use itertools::Itertools;
use once_cell::sync::OnceCell;
use pq_proto::StartupMessageParams;
use std::{io, net::SocketAddr, time::Duration};
use rustls::{client::danger::ServerCertVerifier, pki_types::InvalidDnsNameError};
use std::{io, net::SocketAddr, sync::Arc, time::Duration};
use thiserror::Error;
use tokio::net::TcpStream;
use tokio_postgres::tls::MakeTlsConnect;
use tokio_postgres_rustls::MakeRustlsConnect;
use tracing::{error, info, warn};
const COULD_NOT_CONNECT: &str = "Couldn't connect to compute node";
@@ -30,7 +33,7 @@ pub enum ConnectionError {
CouldNotConnect(#[from] io::Error),
#[error("{COULD_NOT_CONNECT}: {0}")]
TlsError(#[from] native_tls::Error),
TlsError(#[from] InvalidDnsNameError),
#[error("{COULD_NOT_CONNECT}: {0}")]
WakeComputeError(#[from] WakeComputeError),
@@ -257,7 +260,7 @@ pub struct PostgresConnection {
/// Socket connected to a compute node.
pub stream: tokio_postgres::maybe_tls_stream::MaybeTlsStream<
tokio::net::TcpStream,
postgres_native_tls::TlsStream<tokio::net::TcpStream>,
tokio_postgres_rustls::RustlsStream<tokio::net::TcpStream>,
>,
/// PostgreSQL connection parameters.
pub params: std::collections::HashMap<String, String>,
@@ -282,12 +285,23 @@ impl ConnCfg {
let (socket_addr, stream, host) = self.connect_raw(timeout).await?;
drop(pause);
let tls_connector = native_tls::TlsConnector::builder()
.danger_accept_invalid_certs(allow_self_signed_compute)
.build()
.unwrap();
let mut mk_tls = postgres_native_tls::MakeTlsConnector::new(tls_connector);
let tls = MakeTlsConnect::<tokio::net::TcpStream>::make_tls_connect(&mut mk_tls, host)?;
let client_config = if allow_self_signed_compute {
// Allow all certificates for creating the connection
let verifier = Arc::new(AcceptEverythingVerifier) as Arc<dyn ServerCertVerifier>;
rustls::ClientConfig::builder()
.dangerous()
.with_custom_certificate_verifier(verifier)
} else {
let root_store = TLS_ROOTS.get_or_try_init(load_certs)?.clone();
rustls::ClientConfig::builder().with_root_certificates(root_store)
};
let client_config = client_config.with_no_client_auth();
let mut mk_tls = tokio_postgres_rustls::MakeRustlsConnect::new(client_config);
let tls = <MakeRustlsConnect as MakeTlsConnect<tokio::net::TcpStream>>::make_tls_connect(
&mut mk_tls,
host,
)?;
// connect_raw() will not use TLS if sslmode is "disable"
let pause = ctx.latency_timer.pause(crate::metrics::Waiting::Compute);
@@ -340,6 +354,58 @@ fn filtered_options(params: &StartupMessageParams) -> Option<String> {
Some(options)
}
fn load_certs() -> Result<Arc<rustls::RootCertStore>, io::Error> {
let der_certs = rustls_native_certs::load_native_certs()?;
let mut store = rustls::RootCertStore::empty();
store.add_parsable_certificates(der_certs);
Ok(Arc::new(store))
}
static TLS_ROOTS: OnceCell<Arc<rustls::RootCertStore>> = OnceCell::new();
#[derive(Debug)]
struct AcceptEverythingVerifier;
impl ServerCertVerifier for AcceptEverythingVerifier {
fn supported_verify_schemes(&self) -> Vec<rustls::SignatureScheme> {
use rustls::SignatureScheme::*;
// The schemes for which `SignatureScheme::supported_in_tls13` returns true.
vec![
ECDSA_NISTP521_SHA512,
ECDSA_NISTP384_SHA384,
ECDSA_NISTP256_SHA256,
RSA_PSS_SHA512,
RSA_PSS_SHA384,
RSA_PSS_SHA256,
ED25519,
]
}
fn verify_server_cert(
&self,
_end_entity: &rustls::pki_types::CertificateDer<'_>,
_intermediates: &[rustls::pki_types::CertificateDer<'_>],
_server_name: &rustls::pki_types::ServerName<'_>,
_ocsp_response: &[u8],
_now: rustls::pki_types::UnixTime,
) -> Result<rustls::client::danger::ServerCertVerified, rustls::Error> {
Ok(rustls::client::danger::ServerCertVerified::assertion())
}
fn verify_tls12_signature(
&self,
_message: &[u8],
_cert: &rustls::pki_types::CertificateDer<'_>,
_dss: &rustls::DigitallySignedStruct,
) -> Result<rustls::client::danger::HandshakeSignatureValid, rustls::Error> {
Ok(rustls::client::danger::HandshakeSignatureValid::assertion())
}
fn verify_tls13_signature(
&self,
_message: &[u8],
_cert: &rustls::pki_types::CertificateDer<'_>,
_dss: &rustls::DigitallySignedStruct,
) -> Result<rustls::client::danger::HandshakeSignatureValid, rustls::Error> {
Ok(rustls::client::danger::HandshakeSignatureValid::assertion())
}
}
#[cfg(test)]
mod tests {
use super::*;

View File

@@ -1,7 +1,7 @@
use crate::{
auth::{self, backend::AuthRateLimiter},
console::locks::ApiLocks,
rate_limiter::RateBucketInfo,
rate_limiter::{RateBucketInfo, RateLimitAlgorithm, RateLimiterConfig},
scram::threadpool::ThreadPool,
serverless::{cancel_set::CancelSet, GlobalConnPoolOptions},
Host,
@@ -580,14 +580,18 @@ impl RetryConfig {
}
/// Helper for cmdline cache options parsing.
#[derive(serde::Deserialize)]
pub struct ConcurrencyLockOptions {
/// The number of shards the lock map should have
pub shards: usize,
/// The number of allowed concurrent requests for each endpoitn
pub permits: usize,
#[serde(flatten)]
pub limiter: RateLimiterConfig,
/// Garbage collection epoch
#[serde(deserialize_with = "humantime_serde::deserialize")]
pub epoch: Duration,
/// Lock timeout
#[serde(deserialize_with = "humantime_serde::deserialize")]
pub timeout: Duration,
}
@@ -596,13 +600,18 @@ impl ConcurrencyLockOptions {
pub const DEFAULT_OPTIONS_WAKE_COMPUTE_LOCK: &'static str = "permits=0";
/// Default options for [`crate::console::provider::ApiLocks`].
pub const DEFAULT_OPTIONS_CONNECT_COMPUTE_LOCK: &'static str =
"shards=64,permits=10,epoch=10m,timeout=10ms";
"shards=64,permits=100,epoch=10m,timeout=10ms";
// pub const DEFAULT_OPTIONS_WAKE_COMPUTE_LOCK: &'static str = "shards=32,permits=4,epoch=10m,timeout=1s";
/// Parse lock options passed via cmdline.
/// Example: [`Self::DEFAULT_OPTIONS_WAKE_COMPUTE_LOCK`].
fn parse(options: &str) -> anyhow::Result<Self> {
let options = options.trim();
if options.starts_with('{') && options.ends_with('}') {
return Ok(serde_json::from_str(options)?);
}
let mut shards = None;
let mut permits = None;
let mut epoch = None;
@@ -629,9 +638,13 @@ impl ConcurrencyLockOptions {
shards = Some(2);
}
let permits = permits.context("missing `permits`")?;
let out = Self {
shards: shards.context("missing `shards`")?,
permits: permits.context("missing `permits`")?,
limiter: RateLimiterConfig {
algorithm: RateLimitAlgorithm::Fixed,
initial_limit: permits,
},
epoch: epoch.context("missing `epoch`")?,
timeout: timeout.context("missing `timeout`")?,
};
@@ -657,6 +670,8 @@ impl FromStr for ConcurrencyLockOptions {
#[cfg(test)]
mod tests {
use crate::rate_limiter::Aimd;
use super::*;
#[test]
@@ -684,36 +699,68 @@ mod tests {
fn test_parse_lock_options() -> anyhow::Result<()> {
let ConcurrencyLockOptions {
epoch,
permits,
limiter,
shards,
timeout,
} = "shards=32,permits=4,epoch=10m,timeout=1s".parse()?;
assert_eq!(epoch, Duration::from_secs(10 * 60));
assert_eq!(timeout, Duration::from_secs(1));
assert_eq!(shards, 32);
assert_eq!(permits, 4);
assert_eq!(limiter.initial_limit, 4);
assert_eq!(limiter.algorithm, RateLimitAlgorithm::Fixed);
let ConcurrencyLockOptions {
epoch,
permits,
limiter,
shards,
timeout,
} = "epoch=60s,shards=16,timeout=100ms,permits=8".parse()?;
assert_eq!(epoch, Duration::from_secs(60));
assert_eq!(timeout, Duration::from_millis(100));
assert_eq!(shards, 16);
assert_eq!(permits, 8);
assert_eq!(limiter.initial_limit, 8);
assert_eq!(limiter.algorithm, RateLimitAlgorithm::Fixed);
let ConcurrencyLockOptions {
epoch,
permits,
limiter,
shards,
timeout,
} = "permits=0".parse()?;
assert_eq!(epoch, Duration::ZERO);
assert_eq!(timeout, Duration::ZERO);
assert_eq!(shards, 2);
assert_eq!(permits, 0);
assert_eq!(limiter.initial_limit, 0);
assert_eq!(limiter.algorithm, RateLimitAlgorithm::Fixed);
Ok(())
}
#[test]
fn test_parse_json_lock_options() -> anyhow::Result<()> {
let ConcurrencyLockOptions {
epoch,
limiter,
shards,
timeout,
} = r#"{"shards":32,"initial_limit":44,"aimd":{"min":5,"max":500,"inc":10,"dec":0.9,"utilisation":0.8},"epoch":"10m","timeout":"1s"}"#
.parse()?;
assert_eq!(epoch, Duration::from_secs(10 * 60));
assert_eq!(timeout, Duration::from_secs(1));
assert_eq!(shards, 32);
assert_eq!(limiter.initial_limit, 44);
assert_eq!(
limiter.algorithm,
RateLimitAlgorithm::Aimd {
conf: Aimd {
min: 5,
max: 500,
dec: 0.9,
inc: 10,
utilisation: 0.8
}
},
);
Ok(())
}

View File

@@ -15,11 +15,11 @@ use crate::{
error::ReportableError,
intern::ProjectIdInt,
metrics::ApiLockMetrics,
rate_limiter::{DynamicLimiter, Outcome, RateLimiterConfig, Token},
scram, EndpointCacheKey,
};
use dashmap::DashMap;
use std::{hash::Hash, sync::Arc, time::Duration};
use tokio::sync::{OwnedSemaphorePermit, Semaphore};
use tokio::time::Instant;
use tracing::info;
@@ -443,8 +443,8 @@ impl ApiCaches {
/// Various caches for [`console`](super).
pub struct ApiLocks<K> {
name: &'static str,
node_locks: DashMap<K, Arc<Semaphore>>,
permits: usize,
node_locks: DashMap<K, Arc<DynamicLimiter>>,
config: RateLimiterConfig,
timeout: Duration,
epoch: std::time::Duration,
metrics: &'static ApiLockMetrics,
@@ -452,8 +452,6 @@ pub struct ApiLocks<K> {
#[derive(Debug, thiserror::Error)]
pub enum ApiLockError {
#[error("lock was closed")]
AcquireError(#[from] tokio::sync::AcquireError),
#[error("permit could not be acquired")]
TimeoutError(#[from] tokio::time::error::Elapsed),
}
@@ -461,7 +459,6 @@ pub enum ApiLockError {
impl ReportableError for ApiLockError {
fn get_error_kind(&self) -> crate::error::ErrorKind {
match self {
ApiLockError::AcquireError(_) => crate::error::ErrorKind::Service,
ApiLockError::TimeoutError(_) => crate::error::ErrorKind::RateLimit,
}
}
@@ -470,7 +467,7 @@ impl ReportableError for ApiLockError {
impl<K: Hash + Eq + Clone> ApiLocks<K> {
pub fn new(
name: &'static str,
permits: usize,
config: RateLimiterConfig,
shards: usize,
timeout: Duration,
epoch: std::time::Duration,
@@ -479,7 +476,7 @@ impl<K: Hash + Eq + Clone> ApiLocks<K> {
Ok(Self {
name,
node_locks: DashMap::with_shard_amount(shards),
permits,
config,
timeout,
epoch,
metrics,
@@ -487,8 +484,10 @@ impl<K: Hash + Eq + Clone> ApiLocks<K> {
}
pub async fn get_permit(&self, key: &K) -> Result<WakeComputePermit, ApiLockError> {
if self.permits == 0 {
return Ok(WakeComputePermit { permit: None });
if self.config.initial_limit == 0 {
return Ok(WakeComputePermit {
permit: Token::disabled(),
});
}
let now = Instant::now();
let semaphore = {
@@ -500,24 +499,22 @@ impl<K: Hash + Eq + Clone> ApiLocks<K> {
.entry(key.clone())
.or_insert_with(|| {
self.metrics.semaphores_registered.inc();
Arc::new(Semaphore::new(self.permits))
DynamicLimiter::new(self.config)
})
.clone()
}
};
let permit = tokio::time::timeout_at(now + self.timeout, semaphore.acquire_owned()).await;
let permit = semaphore.acquire_deadline(now + self.timeout).await;
self.metrics
.semaphore_acquire_seconds
.observe(now.elapsed().as_secs_f64());
Ok(WakeComputePermit {
permit: Some(permit??),
})
Ok(WakeComputePermit { permit: permit? })
}
pub async fn garbage_collect_worker(&self) {
if self.permits == 0 {
if self.config.initial_limit == 0 {
return;
}
let mut interval =
@@ -547,12 +544,21 @@ impl<K: Hash + Eq + Clone> ApiLocks<K> {
}
pub struct WakeComputePermit {
// None if the lock is disabled
permit: Option<OwnedSemaphorePermit>,
permit: Token,
}
impl WakeComputePermit {
pub fn should_check_cache(&self) -> bool {
self.permit.is_some()
!self.permit.is_disabled()
}
pub fn release(self, outcome: Outcome) {
self.permit.release(outcome)
}
pub fn release_result<T, E>(self, res: Result<T, E>) -> Result<T, E> {
match res {
Ok(_) => self.release(Outcome::Success),
Err(_) => self.release(Outcome::Overload),
}
res
}
}

View File

@@ -13,7 +13,7 @@ use crate::{
http,
metrics::{CacheOutcome, Metrics},
rate_limiter::EndpointRateLimiter,
scram, EndpointCacheKey, Normalize,
scram, EndpointCacheKey,
};
use crate::{cache::Cached, context::RequestMonitoring};
use futures::TryFutureExt;
@@ -281,14 +281,6 @@ impl super::Api for Api {
return Ok(cached);
}
// check rate limit
if !self
.wake_compute_endpoint_rate_limiter
.check(user_info.endpoint.normalize().into(), 1)
{
return Err(WakeComputeError::TooManyConnections);
}
let permit = self.locks.get_permit(&key).await?;
// after getting back a permit - it's possible the cache was filled
@@ -301,7 +293,16 @@ impl super::Api for Api {
}
}
let mut node = self.do_wake_compute(ctx, user_info).await?;
// check rate limit
if !self
.wake_compute_endpoint_rate_limiter
.check(user_info.endpoint.normalize_intern(), 1)
{
info!(key = &*key, "found cached compute node info");
return Err(WakeComputeError::TooManyConnections);
}
let mut node = permit.release_result(self.do_wake_compute(ctx, user_info).await)?;
ctx.set_project(node.aux.clone());
let cold_start_info = node.aux.cold_start_info;
info!("woken up a compute node");

View File

@@ -2,6 +2,7 @@
use chrono::Utc;
use once_cell::sync::OnceCell;
use pq_proto::StartupMessageParams;
use smol_str::SmolStr;
use std::net::IpAddr;
use tokio::sync::mpsc;
@@ -46,6 +47,7 @@ pub struct RequestMonitoring {
pub(crate) auth_method: Option<AuthMethod>,
success: bool,
pub(crate) cold_start_info: ColdStartInfo,
pg_options: Option<StartupMessageParams>,
// extra
// This sender is here to keep the request monitoring channel open while requests are taking place.
@@ -102,6 +104,7 @@ impl RequestMonitoring {
success: false,
rejected: None,
cold_start_info: ColdStartInfo::Unknown,
pg_options: None,
sender: LOG_CHAN.get().and_then(|tx| tx.upgrade()),
disconnect_sender: LOG_CHAN_DISCONNECT.get().and_then(|tx| tx.upgrade()),
@@ -132,6 +135,18 @@ impl RequestMonitoring {
self.latency_timer.cold_start_info(info);
}
pub fn set_db_options(&mut self, options: StartupMessageParams) {
self.set_application(options.get("application_name").map(SmolStr::from));
if let Some(user) = options.get("user") {
self.set_user(user.into());
}
if let Some(dbname) = options.get("database") {
self.set_dbname(dbname.into());
}
self.pg_options = Some(options);
}
pub fn set_project(&mut self, x: MetricsAuxInfo) {
if self.endpoint_id.is_none() {
self.set_endpoint_id(x.endpoint_id.as_str().into())
@@ -155,8 +170,10 @@ impl RequestMonitoring {
}
}
pub fn set_application(&mut self, app: Option<SmolStr>) {
self.application = app.or_else(|| self.application.clone());
fn set_application(&mut self, app: Option<SmolStr>) {
if let Some(app) = app {
self.application = Some(app);
}
}
pub fn set_dbname(&mut self, dbname: DbName) {

View File

@@ -13,7 +13,9 @@ use parquet::{
},
record::RecordWriter,
};
use pq_proto::StartupMessageParams;
use remote_storage::{GenericRemoteStorage, RemotePath, TimeoutOrCancel};
use serde::ser::SerializeMap;
use tokio::{sync::mpsc, time};
use tokio_util::sync::CancellationToken;
use tracing::{debug, info, Span};
@@ -87,6 +89,7 @@ pub struct RequestData {
database: Option<String>,
project: Option<String>,
branch: Option<String>,
pg_options: Option<String>,
auth_method: Option<&'static str>,
error: Option<&'static str>,
/// Success is counted if we form a HTTP response with sql rows inside
@@ -101,6 +104,23 @@ pub struct RequestData {
disconnect_timestamp: Option<chrono::NaiveDateTime>,
}
struct Options<'a> {
options: &'a StartupMessageParams,
}
impl<'a> serde::Serialize for Options<'a> {
fn serialize<S>(&self, s: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
let mut state = s.serialize_map(None)?;
for (k, v) in self.options.iter() {
state.serialize_entry(k, v)?;
}
state.end()
}
}
impl From<&RequestMonitoring> for RequestData {
fn from(value: &RequestMonitoring) -> Self {
Self {
@@ -113,6 +133,10 @@ impl From<&RequestMonitoring> for RequestData {
database: value.dbname.as_deref().map(String::from),
project: value.project.as_deref().map(String::from),
branch: value.branch.as_deref().map(String::from),
pg_options: value
.pg_options
.as_ref()
.and_then(|options| serde_json::to_string(&Options { options }).ok()),
auth_method: value.auth_method.as_ref().map(|x| match x {
super::AuthMethod::Web => "web",
super::AuthMethod::ScramSha256 => "scram_sha_256",
@@ -494,6 +518,7 @@ mod tests {
database: Some(hex::encode(rng.gen::<[u8; 16]>())),
project: Some(hex::encode(rng.gen::<[u8; 16]>())),
branch: Some(hex::encode(rng.gen::<[u8; 16]>())),
pg_options: None,
auth_method: None,
protocol: ["tcp", "ws", "http"][rng.gen_range(0..3)],
region: "us-east-1",
@@ -570,15 +595,15 @@ mod tests {
assert_eq!(
file_stats,
[
(1315314, 3, 6000),
(1315307, 3, 6000),
(1315367, 3, 6000),
(1315324, 3, 6000),
(1315454, 3, 6000),
(1315296, 3, 6000),
(1315088, 3, 6000),
(1315324, 3, 6000),
(438713, 1, 2000)
(1315874, 3, 6000),
(1315867, 3, 6000),
(1315927, 3, 6000),
(1315884, 3, 6000),
(1316014, 3, 6000),
(1315856, 3, 6000),
(1315648, 3, 6000),
(1315884, 3, 6000),
(438913, 1, 2000)
]
);
@@ -608,11 +633,11 @@ mod tests {
assert_eq!(
file_stats,
[
(1222212, 5, 10000),
(1228362, 5, 10000),
(1230156, 5, 10000),
(1229518, 5, 10000),
(1220796, 5, 10000)
(1223214, 5, 10000),
(1229364, 5, 10000),
(1231158, 5, 10000),
(1230520, 5, 10000),
(1221798, 5, 10000)
]
);
@@ -644,11 +669,11 @@ mod tests {
assert_eq!(
file_stats,
[
(1207859, 5, 10000),
(1207590, 5, 10000),
(1207883, 5, 10000),
(1207871, 5, 10000),
(1208126, 5, 10000)
(1208861, 5, 10000),
(1208592, 5, 10000),
(1208885, 5, 10000),
(1208873, 5, 10000),
(1209128, 5, 10000)
]
);
@@ -673,15 +698,15 @@ mod tests {
assert_eq!(
file_stats,
[
(1315314, 3, 6000),
(1315307, 3, 6000),
(1315367, 3, 6000),
(1315324, 3, 6000),
(1315454, 3, 6000),
(1315296, 3, 6000),
(1315088, 3, 6000),
(1315324, 3, 6000),
(438713, 1, 2000)
(1315874, 3, 6000),
(1315867, 3, 6000),
(1315927, 3, 6000),
(1315884, 3, 6000),
(1316014, 3, 6000),
(1315856, 3, 6000),
(1315648, 3, 6000),
(1315884, 3, 6000),
(438913, 1, 2000)
]
);
@@ -718,7 +743,7 @@ mod tests {
// files are smaller than the size threshold, but they took too long to fill so were flushed early
assert_eq!(
file_stats,
[(659462, 2, 3001), (659176, 2, 3000), (658972, 2, 2999)]
[(659836, 2, 3001), (659550, 2, 3000), (659346, 2, 2999)]
);
tmpdir.close().unwrap();

View File

@@ -3,6 +3,7 @@
use std::convert::Infallible;
use anyhow::{bail, Context};
use intern::{EndpointIdInt, EndpointIdTag, InternId};
use tokio::task::JoinError;
use tokio_util::sync::CancellationToken;
use tracing::warn;
@@ -129,20 +130,22 @@ macro_rules! smol_str_wrapper {
const POOLER_SUFFIX: &str = "-pooler";
pub trait Normalize {
fn normalize(&self) -> Self;
}
impl<S: Clone + AsRef<str> + From<String>> Normalize for S {
impl EndpointId {
fn normalize(&self) -> Self {
if self.as_ref().ends_with(POOLER_SUFFIX) {
let mut s = self.as_ref().to_string();
s.truncate(s.len() - POOLER_SUFFIX.len());
s.into()
if let Some(stripped) = self.as_ref().strip_suffix(POOLER_SUFFIX) {
stripped.into()
} else {
self.clone()
}
}
fn normalize_intern(&self) -> EndpointIdInt {
if let Some(stripped) = self.as_ref().strip_suffix(POOLER_SUFFIX) {
EndpointIdTag::get_interner().get_or_intern(stripped)
} else {
self.into()
}
}
}
// 90% of role name strings are 20 characters or less.

View File

@@ -267,6 +267,8 @@ pub async fn handle_client<S: AsyncRead + AsyncWrite + Unpin>(
};
drop(pause);
ctx.set_db_options(params.clone());
let hostname = mode.hostname(stream.get_ref());
let common_names = tls.map(|tls| &tls.common_names);

View File

@@ -84,8 +84,8 @@ impl ConnectMechanism for TcpMechanism<'_> {
timeout: time::Duration,
) -> Result<PostgresConnection, Self::Error> {
let host = node_info.config.get_host()?;
let _permit = self.locks.get_permit(&host).await?;
node_info.connect(ctx, timeout).await
let permit = self.locks.get_permit(&host).await?;
permit.release_result(node_info.connect(ctx, timeout).await)
}
fn update_connect_config(&self, config: &mut compute::ConnCfg) {

View File

@@ -1,2 +1,6 @@
mod limit_algorithm;
mod limiter;
pub use limit_algorithm::{
aimd::Aimd, DynamicLimiter, Outcome, RateLimitAlgorithm, RateLimiterConfig, Token,
};
pub use limiter::{BucketRateLimiter, EndpointRateLimiter, GlobalRateLimiter, RateBucketInfo};

View File

@@ -0,0 +1,275 @@
//! Algorithms for controlling concurrency limits.
use parking_lot::Mutex;
use std::{pin::pin, sync::Arc, time::Duration};
use tokio::{
sync::Notify,
time::{error::Elapsed, timeout_at, Instant},
};
use self::aimd::Aimd;
pub mod aimd;
/// Whether a job succeeded or failed as a result of congestion/overload.
///
/// Errors not considered to be caused by overload should be ignored.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum Outcome {
/// The job succeeded, or failed in a way unrelated to overload.
Success,
/// The job failed because of overload, e.g. it timed out or an explicit backpressure signal
/// was observed.
Overload,
}
/// An algorithm for controlling a concurrency limit.
pub trait LimitAlgorithm: Send + Sync + 'static {
/// Update the concurrency limit in response to a new job completion.
fn update(&self, old_limit: usize, sample: Sample) -> usize;
}
/// The result of a job (or jobs), including the [`Outcome`] (loss) and latency (delay).
#[derive(Debug, Clone, PartialEq, Eq, Copy)]
pub struct Sample {
pub(crate) latency: Duration,
/// Jobs in flight when the sample was taken.
pub(crate) in_flight: usize,
pub(crate) outcome: Outcome,
}
#[derive(Clone, Copy, Debug, Default, serde::Deserialize, PartialEq)]
#[serde(rename_all = "snake_case")]
pub enum RateLimitAlgorithm {
#[default]
Fixed,
Aimd {
#[serde(flatten)]
conf: Aimd,
},
}
pub struct Fixed;
impl LimitAlgorithm for Fixed {
fn update(&self, old_limit: usize, _sample: Sample) -> usize {
old_limit
}
}
#[derive(Clone, Copy, Debug, serde::Deserialize, PartialEq)]
pub struct RateLimiterConfig {
#[serde(flatten)]
pub algorithm: RateLimitAlgorithm,
pub initial_limit: usize,
}
impl RateLimiterConfig {
pub fn create_rate_limit_algorithm(self) -> Box<dyn LimitAlgorithm> {
match self.algorithm {
RateLimitAlgorithm::Fixed => Box::new(Fixed),
RateLimitAlgorithm::Aimd { conf } => Box::new(conf),
}
}
}
pub struct LimiterInner {
alg: Box<dyn LimitAlgorithm>,
available: usize,
limit: usize,
in_flight: usize,
}
impl LimiterInner {
fn update(&mut self, latency: Duration, outcome: Option<Outcome>) {
if let Some(outcome) = outcome {
let sample = Sample {
latency,
in_flight: self.in_flight,
outcome,
};
self.limit = self.alg.update(self.limit, sample);
}
}
fn take(&mut self, ready: &Notify) -> Option<()> {
if self.available > 1 {
self.available -= 1;
self.in_flight += 1;
// tell the next in the queue that there is a permit ready
if self.available > 1 {
ready.notify_one();
}
Some(())
} else {
None
}
}
}
/// Limits the number of concurrent jobs.
///
/// Concurrency is limited through the use of [`Token`]s. Acquire a token to run a job, and release the
/// token once the job is finished.
///
/// The limit will be automatically adjusted based on observed latency (delay) and/or failures
/// caused by overload (loss).
pub struct DynamicLimiter {
config: RateLimiterConfig,
inner: Mutex<LimiterInner>,
// to notify when a token is available
ready: Notify,
}
/// A concurrency token, required to run a job.
///
/// Release the token back to the [`DynamicLimiter`] after the job is complete.
pub struct Token {
start: Instant,
limiter: Option<Arc<DynamicLimiter>>,
}
/// A snapshot of the state of the [`DynamicLimiter`].
///
/// Not guaranteed to be consistent under high concurrency.
#[derive(Debug, Clone, Copy)]
pub struct LimiterState {
limit: usize,
in_flight: usize,
}
impl DynamicLimiter {
/// Create a limiter with a given limit control algorithm.
pub fn new(config: RateLimiterConfig) -> Arc<Self> {
let ready = Notify::new();
ready.notify_one();
Arc::new(Self {
inner: Mutex::new(LimiterInner {
alg: config.create_rate_limit_algorithm(),
available: config.initial_limit,
limit: config.initial_limit,
in_flight: 0,
}),
ready,
config,
})
}
/// Try to acquire a concurrency [Token], waiting for `duration` if there are none available.
///
/// Returns `None` if there are none available after `duration`.
pub async fn acquire_timeout(self: &Arc<Self>, duration: Duration) -> Result<Token, Elapsed> {
self.acquire_deadline(Instant::now() + duration).await
}
/// Try to acquire a concurrency [Token], waiting until `deadline` if there are none available.
///
/// Returns `None` if there are none available after `deadline`.
pub async fn acquire_deadline(self: &Arc<Self>, deadline: Instant) -> Result<Token, Elapsed> {
if self.config.initial_limit == 0 {
// If the rate limiter is disabled, we can always acquire a token.
Ok(Token::disabled())
} else {
let mut notified = pin!(self.ready.notified());
let mut ready = notified.as_mut().enable();
loop {
let mut limit = None;
if ready {
let mut inner = self.inner.lock();
if inner.take(&self.ready).is_some() {
break Ok(Token::new(self.clone()));
}
limit = Some(inner.limit);
}
match timeout_at(deadline, notified.as_mut()).await {
Ok(()) => ready = true,
Err(e) => {
let limit = limit.unwrap_or_else(|| self.inner.lock().limit);
tracing::info!(limit, "could not acquire token in time");
break Err(e);
}
}
}
}
}
/// Return the concurrency [Token], along with the outcome of the job.
///
/// The [Outcome] of the job, and the time taken to perform it, may be used
/// to update the concurrency limit.
///
/// Set the outcome to `None` to ignore the job.
fn release_inner(&self, start: Instant, outcome: Option<Outcome>) {
tracing::info!("outcome is {:?}", outcome);
if self.config.initial_limit == 0 {
return;
}
let mut inner = self.inner.lock();
inner.update(start.elapsed(), outcome);
if inner.in_flight < inner.limit {
inner.available = inner.limit - inner.in_flight;
// At least 1 permit is now available
self.ready.notify_one();
}
inner.in_flight -= 1;
}
/// The current state of the limiter.
pub fn state(&self) -> LimiterState {
let inner = self.inner.lock();
LimiterState {
limit: inner.limit,
in_flight: inner.in_flight,
}
}
}
impl Token {
fn new(limiter: Arc<DynamicLimiter>) -> Self {
Self {
start: Instant::now(),
limiter: Some(limiter),
}
}
pub fn disabled() -> Self {
Self {
start: Instant::now(),
limiter: None,
}
}
pub fn is_disabled(&self) -> bool {
self.limiter.is_none()
}
pub fn release(mut self, outcome: Outcome) {
self.release_mut(Some(outcome))
}
pub fn release_mut(&mut self, outcome: Option<Outcome>) {
if let Some(limiter) = self.limiter.take() {
limiter.release_inner(self.start, outcome);
}
}
}
impl Drop for Token {
fn drop(&mut self) {
self.release_mut(None)
}
}
impl LimiterState {
/// The current concurrency limit.
pub fn limit(&self) -> usize {
self.limit
}
/// The number of jobs in flight.
pub fn in_flight(&self) -> usize {
self.in_flight
}
}

View File

@@ -0,0 +1,184 @@
use std::usize;
use super::{LimitAlgorithm, Outcome, Sample};
/// Loss-based congestion avoidance.
///
/// Additive-increase, multiplicative decrease.
///
/// Adds available currency when:
/// 1. no load-based errors are observed, and
/// 2. the utilisation of the current limit is high.
///
/// Reduces available concurrency by a factor when load-based errors are detected.
#[derive(Clone, Copy, Debug, serde::Deserialize, PartialEq)]
pub struct Aimd {
/// Minimum limit for AIMD algorithm.
pub min: usize,
/// Maximum limit for AIMD algorithm.
pub max: usize,
/// Decrease AIMD decrease by value in case of error.
pub dec: f32,
/// Increase AIMD increase by value in case of success.
pub inc: usize,
/// A threshold below which the limit won't be increased.
pub utilisation: f32,
}
impl LimitAlgorithm for Aimd {
fn update(&self, old_limit: usize, sample: Sample) -> usize {
use Outcome::*;
match sample.outcome {
Success => {
let utilisation = sample.in_flight as f32 / old_limit as f32;
if utilisation > self.utilisation {
let limit = old_limit + self.inc;
let increased_limit = limit.clamp(self.min, self.max);
if increased_limit > old_limit {
tracing::info!(increased_limit, "limit increased");
}
increased_limit
} else {
old_limit
}
}
Overload => {
let limit = old_limit as f32 * self.dec;
// Floor instead of round, so the limit reduces even with small numbers.
// E.g. round(2 * 0.9) = 2, but floor(2 * 0.9) = 1
let limit = limit.floor() as usize;
limit.clamp(self.min, self.max)
}
}
}
}
#[cfg(test)]
mod tests {
use std::time::Duration;
use crate::rate_limiter::limit_algorithm::{
DynamicLimiter, RateLimitAlgorithm, RateLimiterConfig,
};
use super::*;
#[tokio::test(start_paused = true)]
async fn should_decrease_limit_on_overload() {
let config = RateLimiterConfig {
initial_limit: 10,
algorithm: RateLimitAlgorithm::Aimd {
conf: Aimd {
min: 1,
max: 1500,
inc: 10,
dec: 0.5,
utilisation: 0.8,
},
},
};
let limiter = DynamicLimiter::new(config);
let token = limiter
.acquire_timeout(Duration::from_millis(1))
.await
.unwrap();
token.release(Outcome::Overload);
assert_eq!(limiter.state().limit(), 5, "overload: decrease");
}
#[tokio::test(start_paused = true)]
async fn should_increase_limit_on_success_when_using_gt_util_threshold() {
let config = RateLimiterConfig {
initial_limit: 4,
algorithm: RateLimitAlgorithm::Aimd {
conf: Aimd {
min: 1,
max: 1500,
inc: 1,
dec: 0.5,
utilisation: 0.5,
},
},
};
let limiter = DynamicLimiter::new(config);
let token = limiter
.acquire_timeout(Duration::from_millis(1))
.await
.unwrap();
let _token = limiter
.acquire_timeout(Duration::from_millis(1))
.await
.unwrap();
let _token = limiter
.acquire_timeout(Duration::from_millis(1))
.await
.unwrap();
token.release(Outcome::Success);
assert_eq!(limiter.state().limit(), 5, "success: increase");
}
#[tokio::test(start_paused = true)]
async fn should_not_change_limit_on_success_when_using_lt_util_threshold() {
let config = RateLimiterConfig {
initial_limit: 4,
algorithm: RateLimitAlgorithm::Aimd {
conf: Aimd {
min: 1,
max: 1500,
inc: 10,
dec: 0.5,
utilisation: 0.5,
},
},
};
let limiter = DynamicLimiter::new(config);
let token = limiter
.acquire_timeout(Duration::from_millis(1))
.await
.unwrap();
token.release(Outcome::Success);
assert_eq!(
limiter.state().limit(),
4,
"success: ignore when < half limit"
);
}
#[tokio::test(start_paused = true)]
async fn should_not_change_limit_when_no_outcome() {
let config = RateLimiterConfig {
initial_limit: 10,
algorithm: RateLimitAlgorithm::Aimd {
conf: Aimd {
min: 1,
max: 1500,
inc: 10,
dec: 0.5,
utilisation: 0.5,
},
},
};
let limiter = DynamicLimiter::new(config);
let token = limiter
.acquire_timeout(Duration::from_millis(1))
.await
.unwrap();
drop(token);
assert_eq!(limiter.state().limit(), 10, "ignore");
}
}

View File

@@ -232,9 +232,9 @@ impl ConnectMechanism for TokioMechanism {
.connect_timeout(timeout);
let pause = ctx.latency_timer.pause(crate::metrics::Waiting::Compute);
let (client, connection) = config.connect(tokio_postgres::NoTls).await?;
let res = config.connect(tokio_postgres::NoTls).await;
drop(pause);
drop(permit);
let (client, connection) = permit.release_result(res)?;
tracing::Span::current().record("pid", &tracing::field::display(client.get_process_id()));
Ok(poll_client(

View File

@@ -17,6 +17,7 @@ use hyper1::http::HeaderValue;
use hyper1::Response;
use hyper1::StatusCode;
use hyper1::{HeaderMap, Request};
use pq_proto::StartupMessageParamsBuilder;
use serde_json::json;
use serde_json::Value;
use tokio::time;
@@ -192,13 +193,13 @@ fn get_conn_info(
let mut options = Option::None;
let mut params = StartupMessageParamsBuilder::default();
params.insert("user", &username);
params.insert("database", &dbname);
for (key, value) in pairs {
match &*key {
"options" => {
options = Some(NeonOptions::parse_options_raw(&value));
}
"application_name" => ctx.set_application(Some(value.into())),
_ => {}
params.insert(&key, &value);
if key == "options" {
options = Some(NeonOptions::parse_options_raw(&value));
}
}

View File

@@ -51,9 +51,10 @@ impl<S: AsyncRead + AsyncWrite + Unpin> AsyncWrite for WebSocketRw<S> {
) -> Poll<io::Result<usize>> {
let this = self.project();
let mut stream = this.stream;
this.send.put(buf);
ready!(stream.as_mut().poll_ready(cx).map_err(io_error))?;
this.send.put(buf);
match stream.as_mut().start_send(Frame::binary(this.send.split())) {
Ok(()) => Poll::Ready(Ok(buf.len())),
Err(e) => Poll::Ready(Err(io_error(e))),

View File

@@ -54,6 +54,7 @@ build-backend = "poetry.core.masonry.api"
exclude = [
"^vendor/",
"^target/",
"test_runner/performance/pgvector/loaddata.py",
]
check_untyped_defs = true
# Help mypy find imports when running against list of individual files.

View File

@@ -22,8 +22,7 @@ serde_with.workspace = true
workspace_hack.workspace = true
utils.workspace = true
async-stream.workspace = true
native-tls.workspace = true
postgres-native-tls.workspace = true
tokio-postgres-rustls.workspace = true
postgres_ffi.workspace = true
tokio-stream.workspace = true
tokio-postgres.workspace = true
@@ -31,6 +30,9 @@ tokio-util = { workspace = true }
futures-util.workspace = true
itertools.workspace = true
camino.workspace = true
rustls.workspace = true
rustls-native-certs.workspace = true
once_cell.workspace = true
tokio = { workspace = true, features = ["macros", "rt-multi-thread"] }
chrono = { workspace = true, default-features = false, features = ["clock", "serde"] }

View File

@@ -2,7 +2,7 @@ use std::collections::{HashMap, HashSet};
use anyhow::Context;
use aws_sdk_s3::{types::ObjectIdentifier, Client};
use pageserver::tenant::remote_timeline_client::index::IndexLayerMetadata;
use pageserver::tenant::remote_timeline_client::index::LayerFileMetadata;
use pageserver_api::shard::ShardIndex;
use tracing::{error, info, warn};
use utils::generation::Generation;
@@ -208,7 +208,7 @@ impl TenantObjectListing {
&mut self,
timeline_id: TimelineId,
layer_file: &LayerName,
metadata: &IndexLayerMetadata,
metadata: &LayerFileMetadata,
) -> bool {
let Some(shard_tl) = self.shard_timelines.get_mut(&(metadata.shard, timeline_id)) else {
return false;

View File

@@ -1,7 +1,8 @@
use std::{collections::HashSet, str::FromStr};
use std::{collections::HashSet, str::FromStr, sync::Arc};
use aws_sdk_s3::Client;
use futures::stream::{StreamExt, TryStreamExt};
use once_cell::sync::OnceCell;
use pageserver_api::shard::TenantShardId;
use postgres_ffi::{XLogFileName, PG_TLI};
use serde::Serialize;
@@ -70,9 +71,12 @@ pub async fn scan_safekeeper_metadata(
"checking bucket {}, region {}, dump_db_table {}",
bucket_config.bucket, bucket_config.region, dump_db_table
);
// Use the native TLS implementation (Neon requires TLS)
let tls_connector =
postgres_native_tls::MakeTlsConnector::new(native_tls::TlsConnector::new().unwrap());
// Use rustls (Neon requires TLS)
let root_store = TLS_ROOTS.get_or_try_init(load_certs)?.clone();
let client_config = rustls::ClientConfig::builder()
.with_root_certificates(root_store)
.with_no_client_auth();
let tls_connector = tokio_postgres_rustls::MakeRustlsConnect::new(client_config);
let (client, connection) = tokio_postgres::connect(&dump_db_connstr, tls_connector).await?;
// The connection object performs the actual communication with the database,
// so spawn it off to run on its own.
@@ -234,3 +238,11 @@ async fn check_timeline(
is_deleted: false,
})
}
fn load_certs() -> Result<Arc<rustls::RootCertStore>, std::io::Error> {
let der_certs = rustls_native_certs::load_native_certs()?;
let mut store = rustls::RootCertStore::empty();
store.add_parsable_certificates(der_certs);
Ok(Arc::new(store))
}
static TLS_ROOTS: OnceCell<Arc<rustls::RootCertStore>> = OnceCell::new();

View File

@@ -11,7 +11,7 @@ use async_stream::stream;
use aws_sdk_s3::Client;
use camino::Utf8PathBuf;
use futures::{StreamExt, TryStreamExt};
use pageserver::tenant::remote_timeline_client::index::IndexLayerMetadata;
use pageserver::tenant::remote_timeline_client::index::LayerFileMetadata;
use pageserver::tenant::storage_layer::LayerName;
use pageserver::tenant::IndexPart;
use pageserver_api::shard::TenantShardId;
@@ -49,8 +49,8 @@ impl SnapshotDownloader {
&self,
ttid: TenantShardTimelineId,
layer_name: LayerName,
layer_metadata: IndexLayerMetadata,
) -> anyhow::Result<(LayerName, IndexLayerMetadata)> {
layer_metadata: LayerFileMetadata,
) -> anyhow::Result<(LayerName, LayerFileMetadata)> {
// Note this is local as in a local copy of S3 data, not local as in the pageserver's local format. They use
// different layer names (remote-style has the generation suffix)
let local_path = self.output_path.join(format!(
@@ -110,7 +110,7 @@ impl SnapshotDownloader {
async fn download_layers(
&self,
ttid: TenantShardTimelineId,
layers: Vec<(LayerName, IndexLayerMetadata)>,
layers: Vec<(LayerName, LayerFileMetadata)>,
) -> anyhow::Result<()> {
let layer_count = layers.len();
tracing::info!("Downloading {} layers for timeline {ttid}...", layer_count);
@@ -161,10 +161,7 @@ impl SnapshotDownloader {
ttid: TenantShardTimelineId,
index_part: Box<IndexPart>,
index_part_generation: Generation,
ancestor_layers: &mut HashMap<
TenantShardTimelineId,
HashMap<LayerName, IndexLayerMetadata>,
>,
ancestor_layers: &mut HashMap<TenantShardTimelineId, HashMap<LayerName, LayerFileMetadata>>,
) -> anyhow::Result<()> {
let index_bytes = serde_json::to_string(&index_part).unwrap();
@@ -234,7 +231,7 @@ impl SnapshotDownloader {
// happen if this tenant has been split at some point)
let mut ancestor_layers: HashMap<
TenantShardTimelineId,
HashMap<LayerName, IndexLayerMetadata>,
HashMap<LayerName, LayerFileMetadata>,
> = Default::default();
for shard in shards.into_iter().filter(|s| s.shard_count == shard_count) {

View File

@@ -29,13 +29,12 @@ use safekeeper::defaults::{
DEFAULT_HEARTBEAT_TIMEOUT, DEFAULT_HTTP_LISTEN_ADDR, DEFAULT_MAX_OFFLOADER_LAG_BYTES,
DEFAULT_PARTIAL_BACKUP_TIMEOUT, DEFAULT_PG_LISTEN_ADDR,
};
use safekeeper::remove_wal;
use safekeeper::http;
use safekeeper::wal_service;
use safekeeper::GlobalTimelines;
use safekeeper::SafeKeeperConf;
use safekeeper::{broker, WAL_SERVICE_RUNTIME};
use safekeeper::{control_file, BROKER_RUNTIME};
use safekeeper::{http, WAL_REMOVER_RUNTIME};
use safekeeper::{wal_backup, HTTP_RUNTIME};
use storage_broker::DEFAULT_ENDPOINT;
use utils::auth::{JwtAuth, Scope, SwappableJwtAuth};
@@ -441,14 +440,6 @@ async fn start_safekeeper(conf: SafeKeeperConf) -> Result<()> {
.map(|res| ("broker main".to_owned(), res));
tasks_handles.push(Box::pin(broker_task_handle));
let conf_ = conf.clone();
let wal_remover_handle = current_thread_rt
.as_ref()
.unwrap_or_else(|| WAL_REMOVER_RUNTIME.handle())
.spawn(remove_wal::task_main(conf_))
.map(|res| ("WAL remover".to_owned(), res));
tasks_handles.push(Box::pin(wal_remover_handle));
set_build_info_metric(GIT_VERSION, BUILD_TAG);
// TODO: update tokio-stream, convert to real async Stream with

Some files were not shown because too many files have changed in this diff Show More