Compare commits

..

48 Commits

Author SHA1 Message Date
Konstantin Knizhnik
8a697d63e0 Do not write to LFC during unlogged build 2024-06-09 16:15:36 +03:00
Konstantin Knizhnik
55904ee4f4 Extend relation on disk in case of start of unlogged build 2024-06-08 22:17:33 +03:00
Peter Bendel
8e68d56ce2 Merge branch 'main' into undo_unlogged_build 2024-06-07 13:34:55 +02:00
BodoBolero
bb44be5d91 forward fit to pgvector 0.7.1 2024-06-07 13:33:51 +02:00
BodoBolero
ff4200e8cf with the change in this PR the pgvector patch should become obsolete 2024-06-07 13:28:13 +02:00
a-masterov
2078dc827b CI: copy run-* labels from external contributors' PRs (#7915)
## Problem
We don't carry run-* labels from external contributors' PRs to
ci-run/pr-* PRs. This is not really convenient.
Need to sync labels in approved-for-ci-run workflow.
## Summary of changes
Added the procedure of transition of labels from the original PR 

## Checklist before requesting a review

- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-06-07 10:04:59 +02:00
Joonas Koivunen
8ee191c271 test_local_only_layers_after_crash: various fixes (#7986)
In #7927 I needed to fix this test case, but the fixes should be
possible to land irrespective of the layer ingestion code change.

The most important fix is the behavior if an image layer is found: the
assertion message formatting raises a runtime error, which obscures the
fact that we found an image layer.
2024-06-07 10:18:05 +03:00
Anastasia Lubennikova
66c6b270f1 Downgrade No response from reading prefetch entry WARNING to LOG 2024-06-06 20:56:19 +01:00
Arthur Petukhovsky
e4e444f59f Remove random sleep in partial backup (#7982)
We had a random sleep in the beginning of partial backup task, which was
needed for the first partial backup deploy. It helped with gradual
upload of segments without causing network overload. Now partial backup
is deployed everywhere, so we don't need this random sleep anymore.

We also had an issue related to this, in which manager task was not shut
down for a long time. The cause of the issue is this random sleep that
didn't take timeline cancellation into account, meanwhile manager task
waited for partial backup to complete.

Fixes https://github.com/neondatabase/neon/issues/7967
2024-06-06 17:54:44 +00:00
Joonas Koivunen
d46d19456d raise the warning for oversized L0 to 2*target (#7985)
currently we warn even by going over a single byte. even that will be
hit much more rarely once #7927 lands, but get this in earlier.

rationale for 2*checkpoint_distance: anything smaller is not really
worth a warn.

we have an global allowed_error for this warning, which still cannot be
removed nor can it be removed with #7927 because of many tests with very
small `checkpoint_distance`.
2024-06-06 20:18:39 +03:00
Alex Chi Z
5d05013857 fix(pageserver): skip metadata compaction is LSN is not accumulated enough (#7962)
close https://github.com/neondatabase/neon/issues/7937

Only trigger metadata image layer creation if enough delta layers are
accumulated.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-06 11:34:44 -04:00
Konstantin Knizhnik
8c6429164f SAtop unlogged build in neon_immedsync which is called by GIST for sorted index build 2024-06-06 18:15:28 +03:00
Alex Chi Z
014509987d fix(pageserver): more flexible layer size test (#7945)
M-series macOS has different alignments/size for some fields (which I
did not investigate in detail) and therefore this test cannot pass on
macOS. Fixed by using `<=` for the comparison so that we do not test for
an exact match.

observed by @yliang412 

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-06 14:40:58 +00:00
Arpad Müller
75bca9bb19 Perform retries on azure bulk deletion (#7964)
This adds retries to the bulk deletion, because if there is a certain
chance n that a request fails, the chance that at least one of the
requests in a chain of requests fails increases exponentially.

We've had similar issues with the S3 DR tests, which in the end yielded
in adding retries at the remote_storage level. Retries at the top level
are not sufficient when one remote_storage "operation" is multiple
network requests in a trench coat, especially when there is no notion of
saving the progress: even if prior deletions had been successful, we'd
still need to get a 404 in order to continue the loop and get to the
point where we failed in the last iteration. Maybe we'll fail again but
before we've even reached it.

Retries at the bottom level avoid this issue because they have the
notion of progress and also when one network operation fails, only that
operation is retried.

First part of #7931.
2024-06-06 14:21:27 +00:00
Joonas Koivunen
a8be07785e fix: do TimelineMetrics::shutdown only once (#7983)
Related to #7341 tenant deletion will end up shutting down timelines
twice, once before actually starting and the second time when per
timeline deletion is requested. Shutting down TimelineMetrics causes
underflows. Add an atomic boolean and only do the shutdown once.
2024-06-06 14:20:54 +00:00
Yuchen Liang
630cfbe420 refactor(pageserver): designated api error type for cancelled request (#7949)
Closes #7406.

## Problem

When a `get_lsn_by_timestamp` request is cancelled, an anyhow error is
exposed to handle that case, which verbosely logs the error. However, we
don't benefit from having the full backtrace provided by anyhow in this
case.

## Summary of changes

This PR introduces a new `ApiError` type to handle errors caused by
cancelled request more robustly.
-  A new enum variant `ApiError::Cancelled`
- Currently the cancelled request is mapped to status code 500.
- Need to handle this error in proxy's `http_util` as well.
- Added a failpoint test to simulate cancelled `get_lsn_by_timestamp`
request.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-06-06 14:00:14 +00:00
Christian Schwarz
0a65333fff chore(walredo): avoid duplicate tenant_id and shard_slug fields (#7977)
spotted during reviews of async walredo work in #6628
2024-06-06 15:10:16 +02:00
Konstantin Knizhnik
dd28d82558 Ignore files left after interrupted unlogged build when comparing pgdir 2024-06-06 10:02:13 +03:00
Konstantin Knizhnik
6d694f2983 Fix unlogged_extend 2024-06-06 09:56:28 +03:00
John Spray
91dd99038e pageserver/controller: enable tenant deletion without attachment (#7957)
## Problem

As described in #7952, the controller's attempt to reconcile a tenant
before finally deleting it can get hung up waiting for the compute
notification hook to accept updates.

The fact that we try and reconcile a tenant at all during deletion is
part of a more general design issue (#5080), where deletion was
implemented as an operation on attached tenant, requiring the tenant to
be attached in order to delete it, which is not in principle necessary.

Closes: #7952

## Summary of changes

- In the pageserver deletion API, only do the traditional deletion path
if the tenant is attached. If it's secondary, then tear down the
secondary location, and then do a remote delete. If it's not attached at
all, just do the remote delete.
- In the storage controller, instead of ensuring a tenant is attached
before deletion, do a best-effort detach of the tenant, and then call
into some arbitrary pageserver to issue a deletion of remote content.

The pageserver retains its existing delete behavior when invoked on
attached locations. We can remove this later when all users of the API
are updated to either do a detach-before-delete. This will enable
removing the "weird" code paths during startup that sometimes load a
tenant and then immediately delete it, and removing the deletion markers
on tenants.
2024-06-05 20:22:54 +00:00
Konstantin Knizhnik
ab8f127fc8 Conditionally release lock in stop_unlogged_build 2024-06-05 19:31:45 +03:00
Konstantin Knizhnik
c660697578 Remove assert 2024-06-05 18:19:59 +03:00
Konstantin Knizhnik
327f8f3989 Update comments 2024-06-05 17:49:34 +03:00
Christian Schwarz
83ab14e271 chore!: remove walredo_process_kind config option & kind type (#7756)
refs https://github.com/neondatabase/neon/issues/7753

Preceding PR https://github.com/neondatabase/neon/pull/7754
laid out the plan, this one wraps it up.
2024-06-05 14:21:10 +02:00
Peter Bendel
85ef6b1645 upgrade pgvector from 0.7.0 to 0.7.1 (#7954)
## Problem

## Summary of changes

performance improvements in pgvector 0.7.1 for hnsw index builds, see
https://github.com/pgvector/pgvector/issues/570
2024-06-05 10:32:03 +02:00
Alex Chi Z
1a8d53ab9d feat(pageserver): compute aux file size on initial logical size calculation (#7958)
close https://github.com/neondatabase/neon/issues/7822
close https://github.com/neondatabase/neon/issues/7443

Aux file metrics is computed incrementally. If the size is not
initialized, the metrics will never show up. This pull request adds the
functionality to compute the aux file size on initial logical size
calculation.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-04 13:47:48 -04:00
Joonas Koivunen
3d6e389aa2 feat: support changing IndexPart::metadata_bytes to json in future release (#7693)
## Problem

Currently we serialize the `TimelineMetadata` into bytes to put it into
`index_part.json`. This `Vec<u8>` (hopefully `[u8; 512]`) representation
was chosen because of problems serializing TimelineId and Lsn between
different serializers (bincode, json). After #5335, the serialization of
those types became serialization format aware or format agnostic.

We've removed the pageserver local `metadata` file writing in #6769.

## Summary of changes

Allow switching from the current serialization format to plain JSON for
the legacy TimelineMetadata format in the future by adding a competitive
serialization method to the current one
(`crate::tenant::metadata::modern_serde`), which accepts both old bytes
and new plain JSON.

The benefits of this are that dumping the index_part.json with pretty
printing no longer produces more than 500 lines of output, but after
enabling it produces lines only proportional to the layer count, like:

```json
{
  "version": ???,
  "layer_metadata": { ... },
  "disk_consistent_lsn": "0/15FD5D8",
  "legacy_metadata": {
    "disk_consistent_lsn": "0/15FD5D8",
    "prev_record_lsn": "0/15FD5A0",
    "ancestor_timeline": null,
    "ancestor_lsn": "0/0",
    "latest_gc_cutoff_lsn": "0/149FD18",
    "initdb_lsn": "0/149FD18",
    "pg_version": 15
  }
}
```

In the future, I propose we completely stop using this legacy metadata
type and wasting time trying to come up with another version numbering
scheme in addition to the informative-only one already found in
`index_part.json`, and go ahead with storing metadata or feature flags
on the `index_part.json` itself.

#7699 is the "one release after" changes which starts to produce
metadata in the index_part.json as json.
2024-06-04 19:36:22 +03:00
Christian Schwarz
17116f2ea9 fix(pageserver): abort on duplicate layers, before doing damage (#7799)
fixes https://github.com/neondatabase/neon/issues/7790 (duplicating most
of the issue description here for posterity)

# Background

From the time before always-authoritative `index_part.json`, we had to
handle duplicate layers. See the RFC for an illustration of how
duplicate layers could happen:
a8e6d259cb/docs/rfcs/027-crash-consistent-layer-map-through-index-part.md (L41-L50)

As of #5198 , we should not be exposed to that problem anymore.

# Problem 1

We still have
1. [code in
Pageserver](82960b2175/pageserver/src/tenant/timeline.rs (L4502-L4521))
than handles duplicate layers
2. [tests in the test
suite](d9dcbffac3/test_runner/regress/test_duplicate_layers.py (L15))
that demonstrates the problem using a failpoint

However, the test in the test suite doesn't use the failpoint to induce
a crash that could legitimately happen in production.
What is does instead is to return early with an `Ok()`, so that the code
in Pageserver that handles duplicate layers (item 1) actually gets
exercised.

That "return early" would be a bug in the routine if it happened in
production.
So, the tests in the test suite are tests for their own sake, but don't
serve to actually regress-test any production behavior.

# Problem 2

Further, if production code _did_ (it nowawdays doesn't!) create a
duplicate layer, the code in Pageserver that handles the condition (item
1 above) is too little and too late:

* the code handles it by discarding the newer `struct Layer`; that's
good.
* however, on disk, we have already overwritten the old with the new
layer file
* the fact that we do it atomically doesn't matter because ...
* if the new layer file is not bit-identical, then we have a cache
coherency problem
  * PS PageCache block cache: caches old bit battern
* blob_io offsets stored in variables, based on pre-overwrite bit
pattern / offsets
* => reading based on these offsets from the new file might yield
different data than before
 
# Solution

- Remove the test suite code pertaining to Problem 1
- Move & rename test suite code that actually tests RFC-27
crash-consistent layer map.
- Remove the Pageserver code that handles duplicate layers too late
(Problem 1)
- Use `RENAME_NOREPLACE` to prevent over-rename the file during
`.finish()`, bail with an error if it happens (Problem 2)
- This bailing prevents the caller from even trying to insert into the
layer map, as they don't even get a `struct Layer` at hand.
- Add `abort`s in the place where we have the layer map lock and check
for duplicates (Problem 2)
- Note again, we can't reach there because we bail from `.finish()` much
earlier in the code.
- Share the logic to clean up after failed `.finish()` between image
layers and delta layers (drive-by cleanup)
- This exposed that test `image_layer_rewrite` was overwriting layer
files in place. Fix the test.

# Future Work

This PR adds a new failure scenario that was previously "papered over"
by the overwriting of layers:
1. Start a compaction that will produce 3 layers: A, B, C
2. Layer A is `finish()`ed successfully.
3. Layer B fails mid-way at some `put_value()`.
4. Compaction bails out, sleeps 20s.
5. Some disk space gets freed in the meantime.
6. Compaction wakes from sleep, another iteration starts, it attempts to
write Layer A again. But the `.finish()` **fails because A already
exists on disk**.

The failure in step 5 is new with this PR, and it **causes the
compaction to get stuck**.
Before, it would silently overwrite the file and "successfully" complete
the second iteration.

The mitigation for this is to `/reset` the tenant.
2024-06-04 16:16:23 +00:00
John Spray
fd22fc5b7d pageserver: include heatmap in tenant deletion (#7928)
## Problem

This was an oversight when adding heatmaps: because they are at the top
level of the tenant, they aren't included in the catch-all list & delete
that happens for timeline paths.

This doesn't break anything, but it leaves behind a few kilobytes of
garbage in the S3 bucket after a tenant is deleted, generating work for
the scrubber.

## Summary of changes

- During deletion, explicitly remove the heatmap file
- In test_tenant_delete_smoke, upload a heatmap so that the test would
fail its "remote storage empty after delete" check if we didn't delete
it.
2024-06-04 16:16:50 +01:00
Joonas Koivunen
0112097e13 feat(rtc): maintain dirty and uploaded IndexPart (#7833)
RemoteTimelineClient maintains a copy of "next IndexPart" as a number of
fields which are like an IndexPart but this is not immediately obvious.
Instead of multiple fields, maintain a `dirty` ("next IndexPart") and
`clean` ("uploaded IndexPart") fields.

Additional cleanup:
- rename `IndexPart::disk_consistent_lsn` accessor
`duplicated_disk_consistent_lsn`
- no one except scrubber should be looking at it, even scrubber is a
stretch
- remove usage elsewhere (pagectl used by tests, metadata scan endpoint)
- serialize index part *before* the index upload operation
- avoid upload operation being retried because of serialization error
- serialization error is fatal anyway for timeline -- it can only make
transient local progress after that, at least the error is bubbled up
now
- gather exploded IndexPart fields into single actual
`UploadQueueInitialized::dirty` of which the uploaded snapshot is
serialized
- implement the long wished monotonicity check with the `clean`
IndexPart with an assertion which is not expected to fire

Continued work from #7860 towards next step of #6994.
2024-06-04 17:27:08 +03:00
Joonas Koivunen
9d4c113f9b build(Dockerfile.compute-node): do not log tar contents (#7953)
in build logs we get a lot of lines for building the compute node images
because of verbose tar unpack. we know the sha256 so we don't need to
log the contents. my hope is that this will allow us more reliably use
the github live updating log view.
2024-06-04 12:42:57 +01:00
Joonas Koivunen
0acb604fa3 test: no missed wakeups, cancellation and timeout flow to downloads (#7863)
I suspected a wakeup could be lost with
`remote_storage::support::DownloadStream` if the cancellation and inner
stream wakeups happen simultaneously. The next poll would only return
the cancellation error without setting the wakeup. There is no lost
wakeup because the single future for getting the cancellation error is
consumed when the value is ready, and a new future is created for the
*next* value. The new future is always polled. Similarly, if only the
`Stream::poll_next` is being used after a `Some(_)` value has been
yielded, it makes no sense to have an expectation of a wakeup for the
*(N+1)th* stream value already set because when a value is wanted,
`Stream::poll_next` will be called.

A test is added to show that the above is true.

Additionally, there was a question of these cancellations and timeouts
flowing to attached or secondary tenant downloads. A test is added to
show that this, in fact, happens.

Lastly, a warning message is logged when a download stream is polled
after a timeout or cancellation error (currently unexpected) so we can
rule it out while troubleshooting.
2024-06-04 14:19:36 +03:00
Konstantin Knizhnik
387a36874c Set page LSN when reconstructing VM in page server (#7935)
## Problem

Page LSN is not set while VM update.
May be reason of test_vm_bits flukyness.
Buit more serious issues can be also caused by wrong LSN.

Related: https://github.com/neondatabase/neon/pull/7935

## Summary of changes

- In `apply_in_neon`, set the LSN bytes when applying records of type
`ClearVisibilityMapFlags`
2024-06-04 09:56:03 +01:00
Anna Khanova
00032c9d9f [proxy] Fix dynamic rate limiter (#7950)
## Problem

There was a bug in dynamic rate limiter, which exhausted CPU in proxy
and proxy wasn't able to accept any connections.

## Summary of changes

1. `if self.available > 1` -> `if self.available >= 1`
2. remove `timeout_at` to use just timeout
3. remove potential infinite loops which can exhaust CPUs.
2024-06-04 05:07:54 +01:00
John Spray
11bb265de1 pageserver: don't squash all image layer generation errors into anyhow::Error (#7943)
## Problem

CreateImageLayersError and CompactionError had proper From
implementations, but compact_legacy was explicitly squashing all image
layer errors into an anyhow::Error anyway.

This led to errors like:
```
 Error processing HTTP request: InternalServerError(timeline shutting down

Stack backtrace:
   0: <<anyhow::Error as core::convert::From<pageserver::tenant::timeline::CreateImageLayersError>>::from as core::ops::function::FnOnce<(pageserver::tenant::timeline::CreateImageLayersError,)>>::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
   1: <core::result::Result<alloc::vec::Vec<pageserver::tenant::storage_layer::layer::ResidentLayer>, pageserver::tenant::timeline::CreateImageLayersError>>::map_err::<anyhow::Error, <anyhow::Error as core::convert::From<pageserver::tenant::timeline::CreateImageLayersError>>::from>
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/result.rs:829:27
   2: <pageserver::tenant::timeline::Timeline>::compact_legacy::{closure#0}
             at pageserver/src/tenant/timeline/compaction.rs:125:36
   3: <pageserver::tenant::timeline::Timeline>::compact::{closure#0}
             at pageserver/src/tenant/timeline.rs:1719:84
   4: pageserver::http::routes::timeline_checkpoint_handler::{closure#0}::{closure#0}
```

Closes: https://github.com/neondatabase/neon/issues/7861
2024-06-03 22:10:13 +02:00
Konstantin Knizhnik
0c9dee9d06 Rebase with main 2024-06-03 21:36:37 +03:00
Konstantin Knizhnik
5a5775806f Restore check for poreserving pgdata_dir content 2024-06-03 21:16:04 +03:00
Konstantin Knizhnik
947f8c59dd Fix unlogged build 2024-06-03 21:16:02 +03:00
Konstantin Knizhnik
520101170f Pin information about unlogged relations in relsize cache until end of the build 2024-06-03 21:15:14 +03:00
Konstantin Knizhnik
1bd86c5c6a Rewrite unlogged relation build 2024-06-03 21:15:12 +03:00
John Spray
69026a9a36 storcon_cli: add 'drop' and eviction interval utilities (#7938)
The storage controller has 'drop' APIs for tenants and nodes, for use in
situations where something weird has happened:
- node-drop is useful until we implement proper node decom, or if we
have a partially provisioned node that somehow gets registered with the
storage controller but is then dead.
- tenant-drop is useful if we accidentally add a tenant that shouldn't
be there at all, or if we want to make the controller forget about a
tenant without deleting its data. For example, if one uses the
tenant-warmup command with a bad tenant ID and needs to clean that up.

The drop commands require an `--unsafe` parameter, to reduce the chance
that someone incorrectly assumes these are the normal/clean ways to
delete things.

This PR also adds a convenience command for setting the time based
eviction parameters on a tenant. This is useful when onboarding an
existing tenant that has high resident size due to storage amplification
in compaction: setting a lower time based eviction threshold brings down
the resident size ahead of doing a shard split.
2024-06-03 18:13:01 +00:00
Konstantin Knizhnik
e4fc6c3162 Comment check for pgdatadir match 2024-06-03 21:12:23 +03:00
Konstantin Knizhnik
fcd7d7008f Support unlogged build in Neon erxtension 2024-06-03 21:12:21 +03:00
Konstantin Knizhnik
7006caf3a1 Store logical replication origin in KV storage (#7099)
Store logical replication origin in KV storage

## Problem

See  #6977

## Summary of changes

* Extract origin_lsn from commit WAl record
* Add ReplOrigin key to KV storage and store origin_lsn
* In basebackup replace snapshot origin_lsn with last committed
origin_lsn at basebackup LSN

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Alex Chi Z <chi@neon.tech>
2024-06-03 19:37:33 +03:00
John Spray
69d18d6429 s3_scrubber: add pageserver-physical-gc (#7925)
## Problem

Currently, we leave `index_part.json` objects from old generations
behind each time a pageserver restarts or a tenant is migrated. This
doesn't break anything, but it's annoying when a tenant has been around
for a long time and starts to accumulate 10s-100s of these.

Partially implements: #7043 

## Summary of changes

- Add a new `pageserver-physical-gc` command to `s3_scrubber`

The name is a bit of a mouthful, but I think it makes sense:
- GC is the accurate term for what we are doing here: removing data that
takes up storage but can never be accessed.
- "physical" is a necessary distinction from the "normal" GC that we do
online in the pageserver, which operates at a higher level in terms of
LSNs+layers, whereas this type of GC is purely about S3 objects.
- "pageserver" makes clear that this command deals exclusively with
pageserver data, not safekeeper.
2024-06-03 17:16:23 +01:00
Arpad Müller
acf0a11fea Move keyspace utils to inherent impls (#7929)
The keyspace utils like `is_rel_size_key` or `is_rel_fsm_block_key` and
many others are free functions and have to be either imported separately
or specified with the full path starting in `pageserver_api:🔑:`.
This is less convenient than if these functions were just inherent
impls.

Follow-up of #7890
Fixes #6438
2024-06-03 16:18:07 +02:00
Alex Chi Z
c1f55c1525 feat(pageserver): collect aux file tombstones (#7900)
close https://github.com/neondatabase/neon/issues/7800

This is a small change to enable the tombstone -> exclude from image
layer path. Most of the pull request is unit tests.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-03 09:56:36 -04:00
Joonas Koivunen
34f450c05a test: allow no vectored gets happening (#7939)
when running the regress tests locally without any environment variables
we use on CI, `test_pageserver_compaction_smoke` fails with division by
zero. fix it temporarily by allowing no vectored read happening. to be
cleaned when vectored get validation gets removed and the default value
can be changed.

Cc: https://github.com/neondatabase/neon/issues/7381
2024-06-03 09:37:11 -04:00
82 changed files with 3585 additions and 1593 deletions

View File

@@ -69,15 +69,41 @@ jobs:
with:
ref: main
token: ${{ secrets.CI_ACCESS_TOKEN }}
- name: Look for existing PR
id: get-pr
env:
GH_TOKEN: ${{ secrets.CI_ACCESS_TOKEN }}
run: |
ALREADY_CREATED="$(gh pr --repo ${GITHUB_REPOSITORY} list --head ${BRANCH} --base main --json number --jq '.[].number')"
echo "ALREADY_CREATED=${ALREADY_CREATED}" >> ${GITHUB_OUTPUT}
- name: Get changed labels
id: get-labels
if: steps.get-pr.outputs.ALREADY_CREATED != ''
env:
ALREADY_CREATED: ${{ steps.get-pr.outputs.ALREADY_CREATED }}
GH_TOKEN: ${{ secrets.CI_ACCESS_TOKEN }}
run: |
LABELS_TO_REMOVE=$(comm -23 <(gh pr --repo ${GITHUB_REPOSITORY} view ${ALREADY_CREATED} --json labels --jq '.labels.[].name'| ( grep -E '^run' || true ) | sort) \
<(gh pr --repo ${GITHUB_REPOSITORY} view ${PR_NUMBER} --json labels --jq '.labels.[].name' | ( grep -E '^run' || true ) | sort ) |\
( grep -v run-e2e-tests-in-draft || true ) | paste -sd , -)
LABELS_TO_ADD=$(comm -13 <(gh pr --repo ${GITHUB_REPOSITORY} view ${ALREADY_CREATED} --json labels --jq '.labels.[].name'| ( grep -E '^run' || true ) |sort) \
<(gh pr --repo ${GITHUB_REPOSITORY} view ${PR_NUMBER} --json labels --jq '.labels.[].name' | ( grep -E '^run' || true ) | sort ) |\
paste -sd , -)
echo "LABELS_TO_ADD=${LABELS_TO_ADD}" >> ${GITHUB_OUTPUT}
echo "LABELS_TO_REMOVE=${LABELS_TO_REMOVE}" >> ${GITHUB_OUTPUT}
- run: gh pr checkout "${PR_NUMBER}"
- run: git checkout -b "${BRANCH}"
- run: git push --force origin "${BRANCH}"
if: steps.get-pr.outputs.ALREADY_CREATED == ''
- name: Create a Pull Request for CI run (if required)
env:
if: steps.get-pr.outputs.ALREADY_CREATED == ''
env:
GH_TOKEN: ${{ secrets.CI_ACCESS_TOKEN }}
run: |
cat << EOF > body.md
@@ -88,16 +114,33 @@ jobs:
Feel free to review/comment/discuss the original PR #${PR_NUMBER}.
EOF
ALREADY_CREATED="$(gh pr --repo ${GITHUB_REPOSITORY} list --head ${BRANCH} --base main --json number --jq '.[].number')"
if [ -z "${ALREADY_CREATED}" ]; then
gh pr --repo "${GITHUB_REPOSITORY}" create --title "CI run for PR #${PR_NUMBER}" \
LABELS=$( (gh pr --repo "${GITHUB_REPOSITORY}" view ${PR_NUMBER} --json labels --jq '.labels.[].name'; echo run-e2e-tests-in-draft )| \
grep -E '^run' | paste -sd , -)
gh pr --repo "${GITHUB_REPOSITORY}" create --title "CI run for PR #${PR_NUMBER}" \
--body-file "body.md" \
--head "${BRANCH}" \
--base "main" \
--label "run-e2e-tests-in-draft" \
--label ${LABELS} \
--draft
- name: Modify the existing pull request (if required)
if: steps.get-pr.outputs.ALREADY_CREATED != ''
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
LABELS_TO_ADD: ${{ steps.get-labels.outputs.LABELS_TO_ADD }}
LABELS_TO_REMOVE: ${{ steps.get-labels.outputs.LABELS_TO_REMOVE }}
ALREADY_CREATED: ${{ steps.get-pr.outputs.ALREADY_CREATED }}
run: |
ADD_CMD=
REMOVE_CMD=
[ -z "${LABELS_TO_ADD}" ] || ADD_CMD="--add-label ${LABELS_TO_ADD}"
[ -z "${LABELS_TO_REMOVE}" ] || REMOVE_CMD="--remove-label ${LABELS_TO_REMOVE}"
if [ -n "${ADD_CMD}" ] || [ -n "${REMOVE_CMD}" ]; then
gh pr --repo "${GITHUB_REPOSITORY}" edit ${ALREADY_CREATED} ${ADD_CMD} ${REMOVE_CMD}
fi
- run: git push --force origin "${BRANCH}"
if: steps.get-pr.outputs.ALREADY_CREATED != ''
cleanup:
# Close PRs and delete branchs if the original PR is closed.

2
Cargo.lock generated
View File

@@ -5129,6 +5129,7 @@ dependencies = [
"futures-util",
"hex",
"histogram",
"humantime",
"itertools",
"once_cell",
"pageserver",
@@ -5819,6 +5820,7 @@ dependencies = [
"anyhow",
"clap",
"comfy-table",
"humantime",
"hyper 0.14.26",
"pageserver_api",
"pageserver_client",

View File

@@ -89,7 +89,7 @@ RUN apt update && \
# SFCGAL > 1.3 requires CGAL > 5.2, Bullseye's libcgal-dev is 5.2
RUN wget https://gitlab.com/Oslandia/SFCGAL/-/archive/v1.3.10/SFCGAL-v1.3.10.tar.gz -O SFCGAL.tar.gz && \
echo "4e39b3b2adada6254a7bdba6d297bb28e1a9835a9f879b74f37e2dab70203232 SFCGAL.tar.gz" | sha256sum --check && \
mkdir sfcgal-src && cd sfcgal-src && tar xvzf ../SFCGAL.tar.gz --strip-components=1 -C . && \
mkdir sfcgal-src && cd sfcgal-src && tar xzf ../SFCGAL.tar.gz --strip-components=1 -C . && \
cmake -DCMAKE_BUILD_TYPE=Release . && make -j $(getconf _NPROCESSORS_ONLN) && \
DESTDIR=/sfcgal make install -j $(getconf _NPROCESSORS_ONLN) && \
make clean && cp -R /sfcgal/* /
@@ -98,7 +98,7 @@ ENV PATH "/usr/local/pgsql/bin:$PATH"
RUN wget https://download.osgeo.org/postgis/source/postgis-3.3.3.tar.gz -O postgis.tar.gz && \
echo "74eb356e3f85f14233791013360881b6748f78081cc688ff9d6f0f673a762d13 postgis.tar.gz" | sha256sum --check && \
mkdir postgis-src && cd postgis-src && tar xvzf ../postgis.tar.gz --strip-components=1 -C . && \
mkdir postgis-src && cd postgis-src && tar xzf ../postgis.tar.gz --strip-components=1 -C . && \
find /usr/local/pgsql -type f | sed 's|^/usr/local/pgsql/||' > /before.txt &&\
./autogen.sh && \
./configure --with-sfcgal=/usr/local/bin/sfcgal-config && \
@@ -124,7 +124,7 @@ RUN wget https://download.osgeo.org/postgis/source/postgis-3.3.3.tar.gz -O postg
RUN wget https://github.com/pgRouting/pgrouting/archive/v3.4.2.tar.gz -O pgrouting.tar.gz && \
echo "cac297c07d34460887c4f3b522b35c470138760fe358e351ad1db4edb6ee306e pgrouting.tar.gz" | sha256sum --check && \
mkdir pgrouting-src && cd pgrouting-src && tar xvzf ../pgrouting.tar.gz --strip-components=1 -C . && \
mkdir pgrouting-src && cd pgrouting-src && tar xzf ../pgrouting.tar.gz --strip-components=1 -C . && \
mkdir build && cd build && \
cmake -DCMAKE_BUILD_TYPE=Release .. && \
make -j $(getconf _NPROCESSORS_ONLN) && \
@@ -149,7 +149,7 @@ RUN apt update && \
RUN wget https://github.com/plv8/plv8/archive/refs/tags/v3.1.10.tar.gz -O plv8.tar.gz && \
echo "7096c3290928561f0d4901b7a52794295dc47f6303102fae3f8e42dd575ad97d plv8.tar.gz" | sha256sum --check && \
mkdir plv8-src && cd plv8-src && tar xvzf ../plv8.tar.gz --strip-components=1 -C . && \
mkdir plv8-src && cd plv8-src && tar xzf ../plv8.tar.gz --strip-components=1 -C . && \
# generate and copy upgrade scripts
mkdir -p upgrade && ./generate_upgrade.sh 3.1.10 && \
cp upgrade/* /usr/local/pgsql/share/extension/ && \
@@ -194,7 +194,7 @@ RUN case "$(uname -m)" in \
RUN wget https://github.com/uber/h3/archive/refs/tags/v4.1.0.tar.gz -O h3.tar.gz && \
echo "ec99f1f5974846bde64f4513cf8d2ea1b8d172d2218ab41803bf6a63532272bc h3.tar.gz" | sha256sum --check && \
mkdir h3-src && cd h3-src && tar xvzf ../h3.tar.gz --strip-components=1 -C . && \
mkdir h3-src && cd h3-src && tar xzf ../h3.tar.gz --strip-components=1 -C . && \
mkdir build && cd build && \
cmake .. -DCMAKE_BUILD_TYPE=Release && \
make -j $(getconf _NPROCESSORS_ONLN) && \
@@ -204,7 +204,7 @@ RUN wget https://github.com/uber/h3/archive/refs/tags/v4.1.0.tar.gz -O h3.tar.gz
RUN wget https://github.com/zachasme/h3-pg/archive/refs/tags/v4.1.3.tar.gz -O h3-pg.tar.gz && \
echo "5c17f09a820859ffe949f847bebf1be98511fb8f1bd86f94932512c00479e324 h3-pg.tar.gz" | sha256sum --check && \
mkdir h3-pg-src && cd h3-pg-src && tar xvzf ../h3-pg.tar.gz --strip-components=1 -C . && \
mkdir h3-pg-src && cd h3-pg-src && tar xzf ../h3-pg.tar.gz --strip-components=1 -C . && \
export PATH="/usr/local/pgsql/bin:$PATH" && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install && \
@@ -222,7 +222,7 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN wget https://github.com/df7cb/postgresql-unit/archive/refs/tags/7.7.tar.gz -O postgresql-unit.tar.gz && \
echo "411d05beeb97e5a4abf17572bfcfbb5a68d98d1018918feff995f6ee3bb03e79 postgresql-unit.tar.gz" | sha256sum --check && \
mkdir postgresql-unit-src && cd postgresql-unit-src && tar xvzf ../postgresql-unit.tar.gz --strip-components=1 -C . && \
mkdir postgresql-unit-src && cd postgresql-unit-src && tar xzf ../postgresql-unit.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
# unit extension's "create extension" script relies on absolute install path to fill some reference tables.
@@ -241,15 +241,13 @@ RUN wget https://github.com/df7cb/postgresql-unit/archive/refs/tags/7.7.tar.gz -
FROM build-deps AS vector-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY patches/pgvector.patch /pgvector.patch
# By default, pgvector Makefile uses `-march=native`. We don't want that,
# By default, pgvector Makefile uses `-march=native`. We don't want that,
# because we build the images on different machines than where we run them.
# Pass OPTFLAGS="" to remove it.
RUN wget https://github.com/pgvector/pgvector/archive/refs/tags/v0.7.0.tar.gz -O pgvector.tar.gz && \
echo "1b5503a35c265408b6eb282621c5e1e75f7801afc04eecb950796cfee2e3d1d8 pgvector.tar.gz" | sha256sum --check && \
mkdir pgvector-src && cd pgvector-src && tar xvzf ../pgvector.tar.gz --strip-components=1 -C . && \
patch -p1 < /pgvector.patch && \
RUN wget https://github.com/pgvector/pgvector/archive/refs/tags/v0.7.1.tar.gz -O pgvector.tar.gz && \
echo "fe6c8cb4e0cd1a8cb60f5badf9e1701e0fcabcfc260931c26d01e155c4dd21d1 pgvector.tar.gz" | sha256sum --check && \
mkdir pgvector-src && cd pgvector-src && tar xzf ../pgvector.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) OPTFLAGS="" PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
make -j $(getconf _NPROCESSORS_ONLN) OPTFLAGS="" install PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/vector.control
@@ -266,7 +264,7 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
# 9742dab1b2f297ad3811120db7b21451bca2d3c9 made on 13/11/2021
RUN wget https://github.com/michelp/pgjwt/archive/9742dab1b2f297ad3811120db7b21451bca2d3c9.tar.gz -O pgjwt.tar.gz && \
echo "cfdefb15007286f67d3d45510f04a6a7a495004be5b3aecb12cda667e774203f pgjwt.tar.gz" | sha256sum --check && \
mkdir pgjwt-src && cd pgjwt-src && tar xvzf ../pgjwt.tar.gz --strip-components=1 -C . && \
mkdir pgjwt-src && cd pgjwt-src && tar xzf ../pgjwt.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pgjwt.control
@@ -281,7 +279,7 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN wget https://github.com/HypoPG/hypopg/archive/refs/tags/1.4.0.tar.gz -O hypopg.tar.gz && \
echo "0821011743083226fc9b813c1f2ef5897a91901b57b6bea85a78e466187c6819 hypopg.tar.gz" | sha256sum --check && \
mkdir hypopg-src && cd hypopg-src && tar xvzf ../hypopg.tar.gz --strip-components=1 -C . && \
mkdir hypopg-src && cd hypopg-src && tar xzf ../hypopg.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/hypopg.control
@@ -297,7 +295,7 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN wget https://github.com/iCyberon/pg_hashids/archive/refs/tags/v1.2.1.tar.gz -O pg_hashids.tar.gz && \
echo "74576b992d9277c92196dd8d816baa2cc2d8046fe102f3dcd7f3c3febed6822a pg_hashids.tar.gz" | sha256sum --check && \
mkdir pg_hashids-src && cd pg_hashids-src && tar xvzf ../pg_hashids.tar.gz --strip-components=1 -C . && \
mkdir pg_hashids-src && cd pg_hashids-src && tar xzf ../pg_hashids.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config USE_PGXS=1 && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config USE_PGXS=1 && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pg_hashids.control
@@ -313,7 +311,7 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN wget https://github.com/postgrespro/rum/archive/refs/tags/1.3.13.tar.gz -O rum.tar.gz && \
echo "6ab370532c965568df6210bd844ac6ba649f53055e48243525b0b7e5c4d69a7d rum.tar.gz" | sha256sum --check && \
mkdir rum-src && cd rum-src && tar xvzf ../rum.tar.gz --strip-components=1 -C . && \
mkdir rum-src && cd rum-src && tar xzf ../rum.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config USE_PGXS=1 && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config USE_PGXS=1 && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/rum.control
@@ -329,7 +327,7 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN wget https://github.com/theory/pgtap/archive/refs/tags/v1.2.0.tar.gz -O pgtap.tar.gz && \
echo "9c7c3de67ea41638e14f06da5da57bac6f5bd03fea05c165a0ec862205a5c052 pgtap.tar.gz" | sha256sum --check && \
mkdir pgtap-src && cd pgtap-src && tar xvzf ../pgtap.tar.gz --strip-components=1 -C . && \
mkdir pgtap-src && cd pgtap-src && tar xzf ../pgtap.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pgtap.control
@@ -345,7 +343,7 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN wget https://github.com/RhodiumToad/ip4r/archive/refs/tags/2.4.2.tar.gz -O ip4r.tar.gz && \
echo "0f7b1f159974f49a47842a8ab6751aecca1ed1142b6d5e38d81b064b2ead1b4b ip4r.tar.gz" | sha256sum --check && \
mkdir ip4r-src && cd ip4r-src && tar xvzf ../ip4r.tar.gz --strip-components=1 -C . && \
mkdir ip4r-src && cd ip4r-src && tar xzf ../ip4r.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/ip4r.control
@@ -361,7 +359,7 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN wget https://github.com/dimitri/prefix/archive/refs/tags/v1.2.10.tar.gz -O prefix.tar.gz && \
echo "4342f251432a5f6fb05b8597139d3ccde8dcf87e8ca1498e7ee931ca057a8575 prefix.tar.gz" | sha256sum --check && \
mkdir prefix-src && cd prefix-src && tar xvzf ../prefix.tar.gz --strip-components=1 -C . && \
mkdir prefix-src && cd prefix-src && tar xzf ../prefix.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/prefix.control
@@ -377,7 +375,7 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN wget https://github.com/citusdata/postgresql-hll/archive/refs/tags/v2.18.tar.gz -O hll.tar.gz && \
echo "e2f55a6f4c4ab95ee4f1b4a2b73280258c5136b161fe9d059559556079694f0e hll.tar.gz" | sha256sum --check && \
mkdir hll-src && cd hll-src && tar xvzf ../hll.tar.gz --strip-components=1 -C . && \
mkdir hll-src && cd hll-src && tar xzf ../hll.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/hll.control
@@ -393,7 +391,7 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN wget https://github.com/okbob/plpgsql_check/archive/refs/tags/v2.5.3.tar.gz -O plpgsql_check.tar.gz && \
echo "6631ec3e7fb3769eaaf56e3dfedb829aa761abf163d13dba354b4c218508e1c0 plpgsql_check.tar.gz" | sha256sum --check && \
mkdir plpgsql_check-src && cd plpgsql_check-src && tar xvzf ../plpgsql_check.tar.gz --strip-components=1 -C . && \
mkdir plpgsql_check-src && cd plpgsql_check-src && tar xzf ../plpgsql_check.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config USE_PGXS=1 && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config USE_PGXS=1 && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/plpgsql_check.control
@@ -424,7 +422,7 @@ RUN case "${PG_VERSION}" in \
apt-get install -y cmake && \
wget https://github.com/timescale/timescaledb/archive/refs/tags/${TIMESCALEDB_VERSION}.tar.gz -O timescaledb.tar.gz && \
echo "${TIMESCALEDB_CHECKSUM} timescaledb.tar.gz" | sha256sum --check && \
mkdir timescaledb-src && cd timescaledb-src && tar xvzf ../timescaledb.tar.gz --strip-components=1 -C . && \
mkdir timescaledb-src && cd timescaledb-src && tar xzf ../timescaledb.tar.gz --strip-components=1 -C . && \
./bootstrap -DSEND_TELEMETRY_DEFAULT:BOOL=OFF -DUSE_TELEMETRY:BOOL=OFF -DAPACHE_ONLY:BOOL=ON -DCMAKE_BUILD_TYPE=Release && \
cd build && \
make -j $(getconf _NPROCESSORS_ONLN) && \
@@ -462,7 +460,7 @@ RUN case "${PG_VERSION}" in \
esac && \
wget https://github.com/ossc-db/pg_hint_plan/archive/refs/tags/REL${PG_HINT_PLAN_VERSION}.tar.gz -O pg_hint_plan.tar.gz && \
echo "${PG_HINT_PLAN_CHECKSUM} pg_hint_plan.tar.gz" | sha256sum --check && \
mkdir pg_hint_plan-src && cd pg_hint_plan-src && tar xvzf ../pg_hint_plan.tar.gz --strip-components=1 -C . && \
mkdir pg_hint_plan-src && cd pg_hint_plan-src && tar xzf ../pg_hint_plan.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make install -j $(getconf _NPROCESSORS_ONLN) && \
echo "trusted = true" >> /usr/local/pgsql/share/extension/pg_hint_plan.control
@@ -481,7 +479,7 @@ RUN apt-get update && \
apt-get install -y git libgtk2.0-dev libpq-dev libpam-dev libxslt-dev libkrb5-dev cmake && \
wget https://github.com/ketteq-neon/postgres-exts/archive/e0bd1a9d9313d7120c1b9c7bb15c48c0dede4c4e.tar.gz -O kq_imcx.tar.gz && \
echo "dc93a97ff32d152d32737ba7e196d9687041cda15e58ab31344c2f2de8855336 kq_imcx.tar.gz" | sha256sum --check && \
mkdir kq_imcx-src && cd kq_imcx-src && tar xvzf ../kq_imcx.tar.gz --strip-components=1 -C . && \
mkdir kq_imcx-src && cd kq_imcx-src && tar xzf ../kq_imcx.tar.gz --strip-components=1 -C . && \
find /usr/local/pgsql -type f | sed 's|^/usr/local/pgsql/||' > /before.txt &&\
mkdir build && cd build && \
cmake -DCMAKE_BUILD_TYPE=Release .. && \
@@ -505,7 +503,7 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN wget https://github.com/citusdata/pg_cron/archive/refs/tags/v1.6.0.tar.gz -O pg_cron.tar.gz && \
echo "383a627867d730222c272bfd25cd5e151c578d73f696d32910c7db8c665cc7db pg_cron.tar.gz" | sha256sum --check && \
mkdir pg_cron-src && cd pg_cron-src && tar xvzf ../pg_cron.tar.gz --strip-components=1 -C . && \
mkdir pg_cron-src && cd pg_cron-src && tar xzf ../pg_cron.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pg_cron.control
@@ -531,7 +529,7 @@ RUN apt-get update && \
ENV PATH "/usr/local/pgsql/bin/:/usr/local/pgsql/:$PATH"
RUN wget https://github.com/rdkit/rdkit/archive/refs/tags/Release_2023_03_3.tar.gz -O rdkit.tar.gz && \
echo "bdbf9a2e6988526bfeb8c56ce3cdfe2998d60ac289078e2215374288185e8c8d rdkit.tar.gz" | sha256sum --check && \
mkdir rdkit-src && cd rdkit-src && tar xvzf ../rdkit.tar.gz --strip-components=1 -C . && \
mkdir rdkit-src && cd rdkit-src && tar xzf ../rdkit.tar.gz --strip-components=1 -C . && \
cmake \
-D RDK_BUILD_CAIRO_SUPPORT=OFF \
-D RDK_BUILD_INCHI_SUPPORT=ON \
@@ -571,7 +569,7 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN wget https://github.com/fboulnois/pg_uuidv7/archive/refs/tags/v1.0.1.tar.gz -O pg_uuidv7.tar.gz && \
echo "0d0759ab01b7fb23851ecffb0bce27822e1868a4a5819bfd276101c716637a7a pg_uuidv7.tar.gz" | sha256sum --check && \
mkdir pg_uuidv7-src && cd pg_uuidv7-src && tar xvzf ../pg_uuidv7.tar.gz --strip-components=1 -C . && \
mkdir pg_uuidv7-src && cd pg_uuidv7-src && tar xzf ../pg_uuidv7.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pg_uuidv7.control
@@ -588,7 +586,7 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN wget https://github.com/ChenHuajun/pg_roaringbitmap/archive/refs/tags/v0.5.4.tar.gz -O pg_roaringbitmap.tar.gz && \
echo "b75201efcb1c2d1b014ec4ae6a22769cc7a224e6e406a587f5784a37b6b5a2aa pg_roaringbitmap.tar.gz" | sha256sum --check && \
mkdir pg_roaringbitmap-src && cd pg_roaringbitmap-src && tar xvzf ../pg_roaringbitmap.tar.gz --strip-components=1 -C . && \
mkdir pg_roaringbitmap-src && cd pg_roaringbitmap-src && tar xzf ../pg_roaringbitmap.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/roaringbitmap.control
@@ -605,7 +603,7 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN wget https://github.com/theory/pg-semver/archive/refs/tags/v0.32.1.tar.gz -O pg_semver.tar.gz && \
echo "fbdaf7512026d62eec03fad8687c15ed509b6ba395bff140acd63d2e4fbe25d7 pg_semver.tar.gz" | sha256sum --check && \
mkdir pg_semver-src && cd pg_semver-src && tar xvzf ../pg_semver.tar.gz --strip-components=1 -C . && \
mkdir pg_semver-src && cd pg_semver-src && tar xzf ../pg_semver.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/semver.control
@@ -631,7 +629,7 @@ RUN case "${PG_VERSION}" in \
esac && \
wget https://github.com/neondatabase/pg_embedding/archive/refs/tags/${PG_EMBEDDING_VERSION}.tar.gz -O pg_embedding.tar.gz && \
echo "${PG_EMBEDDING_CHECKSUM} pg_embedding.tar.gz" | sha256sum --check && \
mkdir pg_embedding-src && cd pg_embedding-src && tar xvzf ../pg_embedding.tar.gz --strip-components=1 -C . && \
mkdir pg_embedding-src && cd pg_embedding-src && tar xzf ../pg_embedding.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install
@@ -647,7 +645,7 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN wget https://github.com/neondatabase/postgresql_anonymizer/archive/refs/tags/neon_1.1.1.tar.gz -O pg_anon.tar.gz && \
echo "321ea8d5c1648880aafde850a2c576e4a9e7b9933a34ce272efc839328999fa9 pg_anon.tar.gz" | sha256sum --check && \
mkdir pg_anon-src && cd pg_anon-src && tar xvzf ../pg_anon.tar.gz --strip-components=1 -C . && \
mkdir pg_anon-src && cd pg_anon-src && tar xzf ../pg_anon.tar.gz --strip-components=1 -C . && \
find /usr/local/pgsql -type f | sed 's|^/usr/local/pgsql/||' > /before.txt &&\
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/anon.control && \
@@ -696,7 +694,7 @@ ARG PG_VERSION
RUN wget https://github.com/supabase/pg_jsonschema/archive/refs/tags/v0.2.0.tar.gz -O pg_jsonschema.tar.gz && \
echo "9118fc508a6e231e7a39acaa6f066fcd79af17a5db757b47d2eefbe14f7794f0 pg_jsonschema.tar.gz" | sha256sum --check && \
mkdir pg_jsonschema-src && cd pg_jsonschema-src && tar xvzf ../pg_jsonschema.tar.gz --strip-components=1 -C . && \
mkdir pg_jsonschema-src && cd pg_jsonschema-src && tar xzf ../pg_jsonschema.tar.gz --strip-components=1 -C . && \
sed -i 's/pgrx = "0.10.2"/pgrx = { version = "0.10.2", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
cargo pgrx install --release && \
echo "trusted = true" >> /usr/local/pgsql/share/extension/pg_jsonschema.control
@@ -713,7 +711,7 @@ ARG PG_VERSION
RUN wget https://github.com/supabase/pg_graphql/archive/refs/tags/v1.4.0.tar.gz -O pg_graphql.tar.gz && \
echo "bd8dc7230282b3efa9ae5baf053a54151ed0e66881c7c53750e2d0c765776edc pg_graphql.tar.gz" | sha256sum --check && \
mkdir pg_graphql-src && cd pg_graphql-src && tar xvzf ../pg_graphql.tar.gz --strip-components=1 -C . && \
mkdir pg_graphql-src && cd pg_graphql-src && tar xzf ../pg_graphql.tar.gz --strip-components=1 -C . && \
sed -i 's/pgrx = "=0.10.2"/pgrx = { version = "0.10.2", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
cargo pgrx install --release && \
# it's needed to enable extension because it uses untrusted C language
@@ -733,7 +731,7 @@ ARG PG_VERSION
# 26806147b17b60763039c6a6878884c41a262318 made on 26/09/2023
RUN wget https://github.com/kelvich/pg_tiktoken/archive/26806147b17b60763039c6a6878884c41a262318.tar.gz -O pg_tiktoken.tar.gz && \
echo "e64e55aaa38c259512d3e27c572da22c4637418cf124caba904cd50944e5004e pg_tiktoken.tar.gz" | sha256sum --check && \
mkdir pg_tiktoken-src && cd pg_tiktoken-src && tar xvzf ../pg_tiktoken.tar.gz --strip-components=1 -C . && \
mkdir pg_tiktoken-src && cd pg_tiktoken-src && tar xzf ../pg_tiktoken.tar.gz --strip-components=1 -C . && \
cargo pgrx install --release && \
echo "trusted = true" >> /usr/local/pgsql/share/extension/pg_tiktoken.control
@@ -749,7 +747,7 @@ ARG PG_VERSION
RUN wget https://github.com/pksunkara/pgx_ulid/archive/refs/tags/v0.1.3.tar.gz -O pgx_ulid.tar.gz && \
echo "ee5db82945d2d9f2d15597a80cf32de9dca67b897f605beb830561705f12683c pgx_ulid.tar.gz" | sha256sum --check && \
mkdir pgx_ulid-src && cd pgx_ulid-src && tar xvzf ../pgx_ulid.tar.gz --strip-components=1 -C . && \
mkdir pgx_ulid-src && cd pgx_ulid-src && tar xzf ../pgx_ulid.tar.gz --strip-components=1 -C . && \
echo "******************* Apply a patch for Postgres 16 support; delete in the next release ******************" && \
wget https://github.com/pksunkara/pgx_ulid/commit/f84954cf63fc8c80d964ac970d9eceed3c791196.patch && \
patch -p1 < f84954cf63fc8c80d964ac970d9eceed3c791196.patch && \
@@ -771,7 +769,7 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN wget https://github.com/eulerto/wal2json/archive/refs/tags/wal2json_2_5.tar.gz && \
echo "b516653575541cf221b99cf3f8be9b6821f6dbcfc125675c85f35090f824f00e wal2json_2_5.tar.gz" | sha256sum --check && \
mkdir wal2json-src && cd wal2json-src && tar xvzf ../wal2json_2_5.tar.gz --strip-components=1 -C . && \
mkdir wal2json-src && cd wal2json-src && tar xzf ../wal2json_2_5.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install
@@ -787,7 +785,7 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN wget https://github.com/sraoss/pg_ivm/archive/refs/tags/v1.7.tar.gz -O pg_ivm.tar.gz && \
echo "ebfde04f99203c7be4b0e873f91104090e2e83e5429c32ac242d00f334224d5e pg_ivm.tar.gz" | sha256sum --check && \
mkdir pg_ivm-src && cd pg_ivm-src && tar xvzf ../pg_ivm.tar.gz --strip-components=1 -C . && \
mkdir pg_ivm-src && cd pg_ivm-src && tar xzf ../pg_ivm.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pg_ivm.control
@@ -804,7 +802,7 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN wget https://github.com/pgpartman/pg_partman/archive/refs/tags/v5.0.1.tar.gz -O pg_partman.tar.gz && \
echo "75b541733a9659a6c90dbd40fccb904a630a32880a6e3044d0c4c5f4c8a65525 pg_partman.tar.gz" | sha256sum --check && \
mkdir pg_partman-src && cd pg_partman-src && tar xvzf ../pg_partman.tar.gz --strip-components=1 -C . && \
mkdir pg_partman-src && cd pg_partman-src && tar xzf ../pg_partman.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pg_partman.control

View File

@@ -9,6 +9,7 @@ license.workspace = true
anyhow.workspace = true
clap.workspace = true
comfy-table.workspace = true
humantime.workspace = true
hyper.workspace = true
pageserver_api.workspace = true
pageserver_client.workspace = true

View File

@@ -7,8 +7,9 @@ use pageserver_api::{
TenantDescribeResponse, TenantPolicyRequest,
},
models::{
LocationConfigSecondary, ShardParameters, TenantConfig, TenantConfigRequest,
TenantCreateRequest, TenantShardSplitRequest, TenantShardSplitResponse,
EvictionPolicy, EvictionPolicyLayerAccessThreshold, LocationConfigSecondary,
ShardParameters, TenantConfig, TenantConfigRequest, TenantCreateRequest,
TenantShardSplitRequest, TenantShardSplitResponse,
},
shard::{ShardStripeSize, TenantShardId},
};
@@ -125,6 +126,28 @@ enum Command {
#[arg(long)]
tenant_id: TenantId,
},
/// Uncleanly drop a tenant from the storage controller: this doesn't delete anything from pageservers. Appropriate
/// if you e.g. used `tenant-warmup` by mistake on a tenant ID that doesn't really exist, or is in some other region.
TenantDrop {
#[arg(long)]
tenant_id: TenantId,
#[arg(long)]
unclean: bool,
},
NodeDrop {
#[arg(long)]
node_id: NodeId,
#[arg(long)]
unclean: bool,
},
TenantSetTimeBasedEviction {
#[arg(long)]
tenant_id: TenantId,
#[arg(long)]
period: humantime::Duration,
#[arg(long)]
threshold: humantime::Duration,
},
}
#[derive(Parser)]
@@ -674,6 +697,46 @@ async fn main() -> anyhow::Result<()> {
}
}
}
Command::TenantDrop { tenant_id, unclean } => {
if !unclean {
anyhow::bail!("This command is not a tenant deletion, and uncleanly drops all controller state for the tenant. If you know what you're doing, add `--unclean` to proceed.")
}
storcon_client
.dispatch::<(), ()>(
Method::POST,
format!("debug/v1/tenant/{tenant_id}/drop"),
None,
)
.await?;
}
Command::NodeDrop { node_id, unclean } => {
if !unclean {
anyhow::bail!("This command is not a clean node decommission, and uncleanly drops all controller state for the node, without checking if any tenants still refer to it. If you know what you're doing, add `--unclean` to proceed.")
}
storcon_client
.dispatch::<(), ()>(Method::POST, format!("debug/v1/node/{node_id}/drop"), None)
.await?;
}
Command::TenantSetTimeBasedEviction {
tenant_id,
period,
threshold,
} => {
vps_client
.tenant_config(&TenantConfigRequest {
tenant_id,
config: TenantConfig {
eviction_policy: Some(EvictionPolicy::LayerAccessThreshold(
EvictionPolicyLayerAccessThreshold {
period: period.into(),
threshold: threshold.into(),
},
)),
..Default::default()
},
})
.await?;
}
}
Ok(())

View File

@@ -1,6 +1,7 @@
use anyhow::{bail, Result};
use byteorder::{ByteOrder, BE};
use postgres_ffi::relfile_utils::{FSM_FORKNUM, VISIBILITYMAP_FORKNUM};
use postgres_ffi::RepOriginId;
use postgres_ffi::{Oid, TransactionId};
use serde::{Deserialize, Serialize};
use std::{fmt, ops::Range};
@@ -38,6 +39,9 @@ pub const RELATION_SIZE_PREFIX: u8 = 0x61;
/// The key prefix of AUX file keys.
pub const AUX_KEY_PREFIX: u8 = 0x62;
/// The key prefix of ReplOrigin keys.
pub const REPL_ORIGIN_KEY_PREFIX: u8 = 0x63;
/// Check if the key falls in the range of metadata keys.
pub const fn is_metadata_key_slice(key: &[u8]) -> bool {
key[0] >= METADATA_KEY_BEGIN_PREFIX && key[0] < METADATA_KEY_END_PREFIX
@@ -385,9 +389,11 @@ pub fn rel_size_to_key(rel: RelTag) -> Key {
}
}
#[inline(always)]
pub fn is_rel_size_key(key: &Key) -> bool {
key.field1 == 0 && key.field6 == u32::MAX
impl Key {
#[inline(always)]
pub fn is_rel_size_key(&self) -> bool {
self.field1 == 0 && self.field6 == u32::MAX
}
}
#[inline(always)]
@@ -478,12 +484,14 @@ pub fn slru_segment_size_to_key(kind: SlruKind, segno: u32) -> Key {
}
}
pub fn is_slru_segment_size_key(key: &Key) -> bool {
key.field1 == 0x01
&& key.field2 < 0x03
&& key.field3 == 0x01
&& key.field5 == 0
&& key.field6 == u32::MAX
impl Key {
pub fn is_slru_segment_size_key(&self) -> bool {
self.field1 == 0x01
&& self.field2 < 0x03
&& self.field3 == 0x01
&& self.field5 == 0
&& self.field6 == u32::MAX
}
}
#[inline(always)]
@@ -583,6 +591,37 @@ pub const AUX_FILES_KEY: Key = Key {
field6: 2,
};
#[inline(always)]
pub fn repl_origin_key(origin_id: RepOriginId) -> Key {
Key {
field1: REPL_ORIGIN_KEY_PREFIX,
field2: 0,
field3: 0,
field4: 0,
field5: 0,
field6: origin_id as u32,
}
}
/// Get the range of replorigin keys.
pub fn repl_origin_key_range() -> Range<Key> {
Key {
field1: REPL_ORIGIN_KEY_PREFIX,
field2: 0,
field3: 0,
field4: 0,
field5: 0,
field6: 0,
}..Key {
field1: REPL_ORIGIN_KEY_PREFIX,
field2: 0,
field3: 0,
field4: 0,
field5: 0,
field6: 0x10000,
}
}
// Reverse mappings for a few Keys.
// These are needed by WAL redo manager.
@@ -591,73 +630,78 @@ pub const NON_INHERITED_RANGE: Range<Key> = AUX_FILES_KEY..AUX_FILES_KEY.next();
/// Sparse keyspace range for vectored get. Missing key error will be ignored for this range.
pub const NON_INHERITED_SPARSE_RANGE: Range<Key> = Key::metadata_key_range();
// AUX_FILES currently stores only data for logical replication (slots etc), and
// we don't preserve these on a branch because safekeepers can't follow timeline
// switch (and generally it likely should be optional), so ignore these.
#[inline(always)]
pub fn is_inherited_key(key: Key) -> bool {
!NON_INHERITED_RANGE.contains(&key) && !NON_INHERITED_SPARSE_RANGE.contains(&key)
}
impl Key {
// AUX_FILES currently stores only data for logical replication (slots etc), and
// we don't preserve these on a branch because safekeepers can't follow timeline
// switch (and generally it likely should be optional), so ignore these.
#[inline(always)]
pub fn is_inherited_key(self) -> bool {
!NON_INHERITED_RANGE.contains(&self) && !NON_INHERITED_SPARSE_RANGE.contains(&self)
}
#[inline(always)]
pub fn is_rel_fsm_block_key(key: Key) -> bool {
key.field1 == 0x00 && key.field4 != 0 && key.field5 == FSM_FORKNUM && key.field6 != 0xffffffff
}
#[inline(always)]
pub fn is_rel_fsm_block_key(self) -> bool {
self.field1 == 0x00
&& self.field4 != 0
&& self.field5 == FSM_FORKNUM
&& self.field6 != 0xffffffff
}
#[inline(always)]
pub fn is_rel_vm_block_key(key: Key) -> bool {
key.field1 == 0x00
&& key.field4 != 0
&& key.field5 == VISIBILITYMAP_FORKNUM
&& key.field6 != 0xffffffff
}
#[inline(always)]
pub fn is_rel_vm_block_key(self) -> bool {
self.field1 == 0x00
&& self.field4 != 0
&& self.field5 == VISIBILITYMAP_FORKNUM
&& self.field6 != 0xffffffff
}
#[inline(always)]
pub fn key_to_slru_block(key: Key) -> anyhow::Result<(SlruKind, u32, BlockNumber)> {
Ok(match key.field1 {
0x01 => {
let kind = match key.field2 {
0x00 => SlruKind::Clog,
0x01 => SlruKind::MultiXactMembers,
0x02 => SlruKind::MultiXactOffsets,
_ => anyhow::bail!("unrecognized slru kind 0x{:02x}", key.field2),
};
let segno = key.field4;
let blknum = key.field6;
#[inline(always)]
pub fn to_slru_block(self) -> anyhow::Result<(SlruKind, u32, BlockNumber)> {
Ok(match self.field1 {
0x01 => {
let kind = match self.field2 {
0x00 => SlruKind::Clog,
0x01 => SlruKind::MultiXactMembers,
0x02 => SlruKind::MultiXactOffsets,
_ => anyhow::bail!("unrecognized slru kind 0x{:02x}", self.field2),
};
let segno = self.field4;
let blknum = self.field6;
(kind, segno, blknum)
}
_ => anyhow::bail!("unexpected value kind 0x{:02x}", key.field1),
})
}
(kind, segno, blknum)
}
_ => anyhow::bail!("unexpected value kind 0x{:02x}", self.field1),
})
}
#[inline(always)]
pub fn is_slru_block_key(key: Key) -> bool {
key.field1 == 0x01 // SLRU-related
&& key.field3 == 0x00000001 // but not SlruDir
&& key.field6 != 0xffffffff // and not SlruSegSize
}
#[inline(always)]
pub fn is_slru_block_key(self) -> bool {
self.field1 == 0x01 // SLRU-related
&& self.field3 == 0x00000001 // but not SlruDir
&& self.field6 != 0xffffffff // and not SlruSegSize
}
#[inline(always)]
pub fn is_rel_block_key(key: &Key) -> bool {
key.field1 == 0x00 && key.field4 != 0 && key.field6 != 0xffffffff
}
#[inline(always)]
pub fn is_rel_block_key(&self) -> bool {
self.field1 == 0x00 && self.field4 != 0 && self.field6 != 0xffffffff
}
/// Guaranteed to return `Ok()` if [[is_rel_block_key]] returns `true` for `key`.
#[inline(always)]
pub fn key_to_rel_block(key: Key) -> anyhow::Result<(RelTag, BlockNumber)> {
Ok(match key.field1 {
0x00 => (
RelTag {
spcnode: key.field2,
dbnode: key.field3,
relnode: key.field4,
forknum: key.field5,
},
key.field6,
),
_ => anyhow::bail!("unexpected value kind 0x{:02x}", key.field1),
})
/// Guaranteed to return `Ok()` if [`Self::is_rel_block_key`] returns `true` for `key`.
#[inline(always)]
pub fn to_rel_block(self) -> anyhow::Result<(RelTag, BlockNumber)> {
Ok(match self.field1 {
0x00 => (
RelTag {
spcnode: self.field2,
dbnode: self.field3,
relnode: self.field4,
forknum: self.field5,
},
self.field6,
),
_ => anyhow::bail!("unexpected value kind 0x{:02x}", self.field1),
})
}
}
impl std::str::FromStr for Key {

View File

@@ -1,9 +1,6 @@
use std::{ops::RangeInclusive, str::FromStr};
use crate::{
key::{is_rel_block_key, Key},
models::ShardParameters,
};
use crate::{key::Key, models::ShardParameters};
use hex::FromHex;
use postgres_ffi::relfile_utils::INIT_FORKNUM;
use serde::{Deserialize, Serialize};
@@ -672,7 +669,7 @@ fn key_is_shard0(key: &Key) -> bool {
// because they must be included in basebackups.
let is_initfork = key.field5 == INIT_FORKNUM;
!is_rel_block_key(key) || is_initfork
!key.is_rel_block_key() || is_initfork
}
/// Provide the same result as the function in postgres `hashfn.h` with the same name

View File

@@ -126,6 +126,7 @@ fn main() -> anyhow::Result<()> {
.allowlist_type("PageHeaderData")
.allowlist_type("DBState")
.allowlist_type("RelMapFile")
.allowlist_type("RepOriginId")
// Because structs are used for serialization, tell bindgen to emit
// explicit padding fields.
.explicit_padding(true)

View File

@@ -110,6 +110,7 @@ pub mod pg_constants;
pub mod relfile_utils;
// Export some widely used datatypes that are unlikely to change across Postgres versions
pub use v14::bindings::RepOriginId;
pub use v14::bindings::{uint32, uint64, Oid};
pub use v14::bindings::{BlockNumber, OffsetNumber};
pub use v14::bindings::{MultiXactId, TransactionId};

View File

@@ -102,7 +102,7 @@ pub const XACT_XINFO_HAS_SUBXACTS: u32 = 1u32 << 1;
pub const XACT_XINFO_HAS_RELFILENODES: u32 = 1u32 << 2;
pub const XACT_XINFO_HAS_INVALS: u32 = 1u32 << 3;
pub const XACT_XINFO_HAS_TWOPHASE: u32 = 1u32 << 4;
// pub const XACT_XINFO_HAS_ORIGIN: u32 = 1u32 << 5;
pub const XACT_XINFO_HAS_ORIGIN: u32 = 1u32 << 5;
// pub const XACT_XINFO_HAS_AE_LOCKS: u32 = 1u32 << 6;
// pub const XACT_XINFO_HAS_GID: u32 = 1u32 << 7;
@@ -167,6 +167,7 @@ pub const RM_RELMAP_ID: u8 = 7;
pub const RM_STANDBY_ID: u8 = 8;
pub const RM_HEAP2_ID: u8 = 9;
pub const RM_HEAP_ID: u8 = 10;
pub const RM_REPLORIGIN_ID: u8 = 19;
pub const RM_LOGICALMSG_ID: u8 = 21;
// from neon_rmgr.h
@@ -223,6 +224,10 @@ pub const XLOG_CHECKPOINT_ONLINE: u8 = 0x10;
pub const XLP_FIRST_IS_CONTRECORD: u16 = 0x0001;
pub const XLP_LONG_HEADER: u16 = 0x0002;
/* From xlog.h */
pub const XLOG_REPLORIGIN_SET: u8 = 0x00;
pub const XLOG_REPLORIGIN_DROP: u8 = 0x10;
/* From replication/slot.h */
pub const REPL_SLOT_ON_DISK_OFFSETOF_RESTART_LSN: usize = 4*4 /* offset of `slotdata` in ReplicationSlotOnDisk */
+ 64 /* NameData */ + 4*4;
@@ -237,6 +242,9 @@ pub const SLOTS_PER_FSM_PAGE: u32 = FSM_LEAF_NODES_PER_PAGE as u32;
pub const VM_HEAPBLOCKS_PER_PAGE: u32 =
(BLCKSZ as usize - SIZEOF_PAGE_HEADER_DATA) as u32 * (8 / 2); // MAPSIZE * (BITS_PER_BYTE / BITS_PER_HEAPBLOCK)
/* From origin.c */
pub const REPLICATION_STATE_MAGIC: u32 = 0x1257DADE;
// List of subdirectories inside pgdata.
// Copied from src/bin/initdb/initdb.c
pub const PGDATA_SUBDIRS: [&str; 22] = [

View File

@@ -3,6 +3,7 @@
use std::borrow::Cow;
use std::collections::HashMap;
use std::env;
use std::fmt::Display;
use std::io;
use std::num::NonZeroU32;
use std::pin::Pin;
@@ -29,6 +30,7 @@ use http_types::{StatusCode, Url};
use scopeguard::ScopeGuard;
use tokio_util::sync::CancellationToken;
use tracing::debug;
use utils::backoff;
use crate::metrics::{start_measuring_requests, AttemptOutcome, RequestKind};
use crate::{
@@ -451,26 +453,58 @@ impl RemoteStorage for AzureBlobStorage {
// TODO batch requests are not supported by the SDK
// https://github.com/Azure/azure-sdk-for-rust/issues/1068
for path in paths {
let blob_client = self.client.blob_client(self.relative_path_to_name(path));
let request = blob_client.delete().into_future();
let res = tokio::time::timeout(self.timeout, request).await;
match res {
Ok(Ok(_response)) => continue,
Ok(Err(e)) => {
if let Some(http_err) = e.as_http_error() {
if http_err.status() == StatusCode::NotFound {
continue;
}
}
return Err(e.into());
}
Err(_elapsed) => return Err(TimeoutOrCancel::Timeout.into()),
#[derive(Debug)]
enum AzureOrTimeout {
AzureError(azure_core::Error),
Timeout,
Cancel,
}
}
impl Display for AzureOrTimeout {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{self:?}")
}
}
let warn_threshold = 3;
let max_retries = 5;
backoff::retry(
|| async {
let blob_client = self.client.blob_client(self.relative_path_to_name(path));
let request = blob_client.delete().into_future();
let res = tokio::time::timeout(self.timeout, request).await;
match res {
Ok(Ok(_v)) => Ok(()),
Ok(Err(azure_err)) => {
if let Some(http_err) = azure_err.as_http_error() {
if http_err.status() == StatusCode::NotFound {
return Ok(());
}
}
Err(AzureOrTimeout::AzureError(azure_err))
}
Err(_elapsed) => Err(AzureOrTimeout::Timeout),
}
},
|err| match err {
AzureOrTimeout::AzureError(_) | AzureOrTimeout::Timeout => false,
AzureOrTimeout::Cancel => true,
},
warn_threshold,
max_retries,
"deleting remote object",
cancel,
)
.await
.ok_or_else(|| AzureOrTimeout::Cancel)
.and_then(|x| x)
.map_err(|e| match e {
AzureOrTimeout::AzureError(err) => anyhow::Error::from(err),
AzureOrTimeout::Timeout => TimeoutOrCancel::Timeout.into(),
AzureOrTimeout::Cancel => TimeoutOrCancel::Cancel.into(),
})?;
}
Ok(())
};

View File

@@ -78,6 +78,10 @@ where
let e = Err(std::io::Error::from(e));
return Poll::Ready(Some(e));
}
} else {
// this would be perfectly valid behaviour for doing a graceful completion on the
// download for example, but not one we expect to do right now.
tracing::warn!("continuing polling after having cancelled or timeouted");
}
this.inner.poll_next(cx)
@@ -89,13 +93,22 @@ where
}
/// Fires only on the first cancel or timeout, not on both.
pub(crate) async fn cancel_or_timeout(
pub(crate) fn cancel_or_timeout(
timeout: Duration,
cancel: CancellationToken,
) -> TimeoutOrCancel {
tokio::select! {
_ = tokio::time::sleep(timeout) => TimeoutOrCancel::Timeout,
_ = cancel.cancelled() => TimeoutOrCancel::Cancel,
) -> impl std::future::Future<Output = TimeoutOrCancel> + 'static {
// futures are lazy, they don't do anything before being polled.
//
// "precalculate" the wanted deadline before returning the future, so that we can use pause
// failpoint to trigger a timeout in test.
let deadline = tokio::time::Instant::now() + timeout;
async move {
tokio::select! {
_ = tokio::time::sleep_until(deadline) => TimeoutOrCancel::Timeout,
_ = cancel.cancelled() => {
TimeoutOrCancel::Cancel
},
}
}
}
@@ -172,4 +185,31 @@ mod tests {
_ = tokio::time::sleep(Duration::from_secs(121)) => {},
}
}
#[tokio::test]
async fn notified_but_pollable_after() {
let inner = futures::stream::once(futures::future::ready(Ok(bytes::Bytes::from_static(
b"hello world",
))));
let timeout = Duration::from_secs(120);
let cancel = CancellationToken::new();
cancel.cancel();
let stream = DownloadStream::new(cancel_or_timeout(timeout, cancel.clone()), inner);
let mut stream = std::pin::pin!(stream);
let next = stream.next().await;
let ioe = next.unwrap().unwrap_err();
assert!(
matches!(
ioe.get_ref().unwrap().downcast_ref::<DownloadError>(),
Some(&DownloadError::Cancelled)
),
"{ioe:?}"
);
let next = stream.next().await;
let bytes = next.unwrap().unwrap();
assert_eq!(&b"hello world"[..], bytes);
}
}

View File

@@ -3,6 +3,9 @@ use std::{fs, io, path::Path};
use anyhow::Context;
mod rename_noreplace;
pub use rename_noreplace::rename_noreplace;
pub trait PathExt {
/// Returns an error if `self` is not a directory.
fn is_empty_dir(&self) -> io::Result<bool>;

View File

@@ -0,0 +1,109 @@
use nix::NixPath;
/// Rename a file without replacing an existing file.
///
/// This is a wrapper around platform-specific APIs.
pub fn rename_noreplace<P1: ?Sized + NixPath, P2: ?Sized + NixPath>(
src: &P1,
dst: &P2,
) -> nix::Result<()> {
{
#[cfg(target_os = "linux")]
{
nix::fcntl::renameat2(
None,
src,
None,
dst,
nix::fcntl::RenameFlags::RENAME_NOREPLACE,
)
}
#[cfg(target_os = "macos")]
{
let res = src.with_nix_path(|src| {
dst.with_nix_path(|dst|
// SAFETY: `src` and `dst` are valid C strings as per the NixPath trait and they outlive the call to renamex_np.
unsafe {
nix::libc::renamex_np(src.as_ptr(), dst.as_ptr(), nix::libc::RENAME_EXCL)
})
})??;
nix::errno::Errno::result(res).map(drop)
}
#[cfg(not(any(target_os = "linux", target_os = "macos")))]
{
std::compile_error!("OS does not support no-replace renames");
}
}
}
#[cfg(test)]
mod test {
use std::{fs, path::PathBuf};
use super::*;
fn testdir() -> camino_tempfile::Utf8TempDir {
match crate::env::var("NEON_UTILS_RENAME_NOREPLACE_TESTDIR") {
Some(path) => {
let path: camino::Utf8PathBuf = path;
camino_tempfile::tempdir_in(path).unwrap()
}
None => camino_tempfile::tempdir().unwrap(),
}
}
#[test]
fn test_absolute_paths() {
let testdir = testdir();
println!("testdir: {}", testdir.path());
let src = testdir.path().join("src");
let dst = testdir.path().join("dst");
fs::write(&src, b"").unwrap();
fs::write(&dst, b"").unwrap();
let src = src.canonicalize().unwrap();
assert!(src.is_absolute());
let dst = dst.canonicalize().unwrap();
assert!(dst.is_absolute());
let result = rename_noreplace(&src, &dst);
assert_eq!(result.unwrap_err(), nix::Error::EEXIST);
}
#[test]
fn test_relative_paths() {
let testdir = testdir();
println!("testdir: {}", testdir.path());
// this is fine because we run in nextest => process per test
std::env::set_current_dir(testdir.path()).unwrap();
let src = PathBuf::from("src");
let dst = PathBuf::from("dst");
fs::write(&src, b"").unwrap();
fs::write(&dst, b"").unwrap();
let result = rename_noreplace(&src, &dst);
assert_eq!(result.unwrap_err(), nix::Error::EEXIST);
}
#[test]
fn test_works_when_not_exists() {
let testdir = testdir();
println!("testdir: {}", testdir.path());
let src = testdir.path().join("src");
let dst = testdir.path().join("dst");
fs::write(&src, b"content").unwrap();
rename_noreplace(src.as_std_path(), dst.as_std_path()).unwrap();
assert_eq!(
"content",
String::from_utf8(std::fs::read(&dst).unwrap()).unwrap()
);
}
}

View File

@@ -34,6 +34,9 @@ pub enum ApiError {
#[error("Timeout")]
Timeout(Cow<'static, str>),
#[error("Request cancelled")]
Cancelled,
#[error(transparent)]
InternalServerError(anyhow::Error),
}
@@ -74,6 +77,10 @@ impl ApiError {
err.to_string(),
StatusCode::REQUEST_TIMEOUT,
),
ApiError::Cancelled => HttpErrorBody::response_from_msg_and_status(
self.to_string(),
StatusCode::INTERNAL_SERVER_ERROR,
),
ApiError::InternalServerError(err) => HttpErrorBody::response_from_msg_and_status(
err.to_string(),
StatusCode::INTERNAL_SERVER_ERROR,
@@ -133,6 +140,7 @@ pub fn api_error_handler(api_error: ApiError) -> Response<Body> {
ApiError::InternalServerError(_) => error!("Error processing HTTP request: {api_error:?}"),
ApiError::ShuttingDown => info!("Shut down while processing HTTP request"),
ApiError::Timeout(_) => info!("Timeout while processing HTTP request: {api_error:#}"),
ApiError::Cancelled => info!("Request cancelled while processing HTTP request"),
_ => info!("Error processing HTTP request: {api_error:#}"),
}

View File

@@ -26,7 +26,7 @@ pub(crate) async fn main(cmd: &IndexPartCmd) -> anyhow::Result<()> {
let output = Output {
layer_metadata: &des.layer_metadata,
disk_consistent_lsn: des.get_disk_consistent_lsn(),
disk_consistent_lsn: des.metadata.disk_consistent_lsn(),
timeline_metadata: &des.metadata,
};

View File

@@ -72,13 +72,14 @@ impl DescribeKeyCommand {
println!("{key:?}");
macro_rules! kind_query {
([$($name:ident),*$(,)?]) => {{[$(kind_query!($name)),*]}};
($name:ident) => {{
let s: &'static str = stringify!($name);
let s = s.strip_prefix("is_").unwrap_or(s);
let s = s.strip_suffix("_key").unwrap_or(s);
#[allow(clippy::needless_borrow)]
(s, pageserver_api::key::$name(key))
(s, key.$name())
}};
}
@@ -86,18 +87,15 @@ impl DescribeKeyCommand {
// "recognization". I think it accurately represents how strictly we model the Key
// right now, but could of course be made less confusing.
let queries = [
("rel_block", pageserver_api::key::is_rel_block_key(&key)),
kind_query!(is_rel_vm_block_key),
kind_query!(is_rel_fsm_block_key),
kind_query!(is_slru_block_key),
kind_query!(is_inherited_key),
("rel_size", pageserver_api::key::is_rel_size_key(&key)),
(
"slru_segment_size",
pageserver_api::key::is_slru_segment_size_key(&key),
),
];
let queries = kind_query!([
is_rel_block_key,
is_rel_vm_block_key,
is_rel_fsm_block_key,
is_slru_block_key,
is_inherited_key,
is_rel_size_key,
is_slru_segment_size_key,
]);
let recognized_kind = "recognized kind";
let metadata_key = "metadata key";

View File

@@ -1,6 +1,6 @@
use anyhow::Context;
use camino::Utf8PathBuf;
use pageserver_api::key::{is_rel_block_key, key_to_rel_block, Key};
use pageserver_api::key::Key;
use pageserver_api::keyspace::KeySpaceAccum;
use pageserver_api::models::PagestreamGetPageRequest;
@@ -187,7 +187,7 @@ async fn main_impl(
for r in partitioning.keys.ranges.iter() {
let mut i = r.start;
while i != r.end {
if is_rel_block_key(&i) {
if i.is_rel_block_key() {
filtered.add_key(i);
}
i = i.next();
@@ -308,9 +308,10 @@ async fn main_impl(
let r = &ranges[weights.sample(&mut rng)];
let key: i128 = rng.gen_range(r.start..r.end);
let key = Key::from_i128(key);
assert!(is_rel_block_key(&key));
let (rel_tag, block_no) =
key_to_rel_block(key).expect("we filter non-rel-block keys out above");
assert!(key.is_rel_block_key());
let (rel_tag, block_no) = key
.to_rel_block()
.expect("we filter non-rel-block keys out above");
PagestreamGetPageRequest {
request_lsn: if rng.gen_bool(args.req_latest_probability) {
Lsn::MAX

View File

@@ -178,7 +178,8 @@ impl AuxFileSizeEstimator {
}
}
pub fn on_base_backup(&self, new_size: usize) {
/// When generating base backup or doing initial logical size calculation
pub fn on_initial(&self, new_size: usize) {
let mut guard = self.size.lock().unwrap();
*guard = Some(new_size as isize);
self.report(new_size as isize);

View File

@@ -13,7 +13,7 @@
use anyhow::{anyhow, Context};
use bytes::{BufMut, Bytes, BytesMut};
use fail::fail_point;
use pageserver_api::key::{key_to_slru_block, Key};
use pageserver_api::key::Key;
use postgres_ffi::pg_constants;
use std::fmt::Write as FmtWrite;
use std::time::SystemTime;
@@ -170,7 +170,7 @@ where
}
async fn add_block(&mut self, key: &Key, block: Bytes) -> Result<(), BasebackupError> {
let (kind, segno, _) = key_to_slru_block(*key)?;
let (kind, segno, _) = key.to_slru_block()?;
match kind {
SlruKind::Clog => {
@@ -362,6 +362,13 @@ where
));
info!("Replication slot {} restart LSN={}", path, restart_lsn);
min_restart_lsn = Lsn::min(min_restart_lsn, restart_lsn);
} else if path == "pg_logical/replorigin_checkpoint" {
// replorigin_checkoint is written only on compute shutdown, so it contains
// deteriorated values. So we generate our own version of this file for the particular LSN
// based on information about replorigins extracted from transaction commit records.
// In future we will not generate AUX record for "pg_logical/replorigin_checkpoint" at all,
// but now we should handle (skip) it for backward compatibility.
continue;
}
let header = new_tar_header(&path, content.len() as u64)?;
self.ar
@@ -390,6 +397,32 @@ where
{
self.add_twophase_file(xid).await?;
}
let repl_origins = self
.timeline
.get_replorigins(self.lsn, self.ctx)
.await
.map_err(|e| BasebackupError::Server(e.into()))?;
let n_origins = repl_origins.len();
if n_origins != 0 {
//
// Construct "pg_logical/replorigin_checkpoint" file based on information about replication origins
// extracted from transaction commit record. We are using this file to pass information about replication
// origins to compute to allow logical replication to restart from proper point.
//
let mut content = Vec::with_capacity(n_origins * 16 + 8);
content.extend_from_slice(&pg_constants::REPLICATION_STATE_MAGIC.to_le_bytes());
for (origin_id, origin_lsn) in repl_origins {
content.extend_from_slice(&origin_id.to_le_bytes());
content.extend_from_slice(&[0u8; 6]); // align to 8 bytes
content.extend_from_slice(&origin_lsn.0.to_le_bytes());
}
let crc32 = crc32c::crc32c(&content);
content.extend_from_slice(&crc32.to_le_bytes());
let header = new_tar_header("pg_logical/replorigin_checkpoint", content.len() as u64)?;
self.ar.append(&header, &*content).await.context(
"could not add pg_logical/replorigin_checkpoint file to basebackup tarball",
)?;
}
fail_point!("basebackup-before-control-file", |_| {
Err(BasebackupError::Server(anyhow!(

View File

@@ -99,8 +99,6 @@ pub mod defaults {
pub const DEFAULT_EPHEMERAL_BYTES_PER_MEMORY_KB: usize = 0;
pub const DEFAULT_WALREDO_PROCESS_KIND: &str = "async";
///
/// Default built-in configuration file.
///
@@ -146,8 +144,6 @@ pub mod defaults {
#validate_vectored_get = '{DEFAULT_VALIDATE_VECTORED_GET}'
#walredo_process_kind = '{DEFAULT_WALREDO_PROCESS_KIND}'
[tenant_config]
#checkpoint_distance = {DEFAULT_CHECKPOINT_DISTANCE} # in bytes
#checkpoint_timeout = {DEFAULT_CHECKPOINT_TIMEOUT}
@@ -300,8 +296,6 @@ pub struct PageServerConf {
///
/// Setting this to zero disables limits on total ephemeral layer size.
pub ephemeral_bytes_per_memory_kb: usize,
pub walredo_process_kind: crate::walredo::ProcessKind,
}
/// We do not want to store this in a PageServerConf because the latter may be logged
@@ -407,8 +401,6 @@ struct PageServerConfigBuilder {
validate_vectored_get: BuilderValue<bool>,
ephemeral_bytes_per_memory_kb: BuilderValue<usize>,
walredo_process_kind: BuilderValue<crate::walredo::ProcessKind>,
}
impl PageServerConfigBuilder {
@@ -497,8 +489,6 @@ impl PageServerConfigBuilder {
)),
validate_vectored_get: Set(DEFAULT_VALIDATE_VECTORED_GET),
ephemeral_bytes_per_memory_kb: Set(DEFAULT_EPHEMERAL_BYTES_PER_MEMORY_KB),
walredo_process_kind: Set(DEFAULT_WALREDO_PROCESS_KIND.parse().unwrap()),
}
}
}
@@ -686,10 +676,6 @@ impl PageServerConfigBuilder {
self.ephemeral_bytes_per_memory_kb = BuilderValue::Set(value);
}
pub fn get_walredo_process_kind(&mut self, value: crate::walredo::ProcessKind) {
self.walredo_process_kind = BuilderValue::Set(value);
}
pub fn build(self) -> anyhow::Result<PageServerConf> {
let default = Self::default_values();
@@ -747,7 +733,6 @@ impl PageServerConfigBuilder {
max_vectored_read_bytes,
validate_vectored_get,
ephemeral_bytes_per_memory_kb,
walredo_process_kind,
}
CUSTOM LOGIC
{
@@ -1044,9 +1029,6 @@ impl PageServerConf {
"ephemeral_bytes_per_memory_kb" => {
builder.get_ephemeral_bytes_per_memory_kb(parse_toml_u64("ephemeral_bytes_per_memory_kb", item)? as usize)
}
"walredo_process_kind" => {
builder.get_walredo_process_kind(parse_toml_from_str("walredo_process_kind", item)?)
}
_ => bail!("unrecognized pageserver option '{key}'"),
}
}
@@ -1130,7 +1112,6 @@ impl PageServerConf {
),
validate_vectored_get: defaults::DEFAULT_VALIDATE_VECTORED_GET,
ephemeral_bytes_per_memory_kb: defaults::DEFAULT_EPHEMERAL_BYTES_PER_MEMORY_KB,
walredo_process_kind: defaults::DEFAULT_WALREDO_PROCESS_KIND.parse().unwrap(),
}
}
}
@@ -1370,7 +1351,6 @@ background_task_maximum_delay = '334 s'
),
validate_vectored_get: defaults::DEFAULT_VALIDATE_VECTORED_GET,
ephemeral_bytes_per_memory_kb: defaults::DEFAULT_EPHEMERAL_BYTES_PER_MEMORY_KB,
walredo_process_kind: defaults::DEFAULT_WALREDO_PROCESS_KIND.parse().unwrap(),
},
"Correct defaults should be used when no config values are provided"
);
@@ -1444,7 +1424,6 @@ background_task_maximum_delay = '334 s'
),
validate_vectored_get: defaults::DEFAULT_VALIDATE_VECTORED_GET,
ephemeral_bytes_per_memory_kb: defaults::DEFAULT_EPHEMERAL_BYTES_PER_MEMORY_KB,
walredo_process_kind: defaults::DEFAULT_WALREDO_PROCESS_KIND.parse().unwrap(),
},
"Should be able to parse all basic config values correctly"
);

View File

@@ -81,8 +81,10 @@ paths:
Attempts to delete specified tenant. 500, 503 and 409 errors should be retried until 404 is retrieved.
404 means that deletion successfully finished"
responses:
"200":
description: Tenant was successfully deleted, or was already not found.
"404":
description: Tenant not found. This is the success path.
description: Tenant not found. This is a success result, equivalent to 200.
content:
application/json:
schema:

View File

@@ -181,9 +181,7 @@ impl From<PageReconstructError> for ApiError {
PageReconstructError::MissingKey(e) => {
ApiError::InternalServerError(anyhow::anyhow!("{e}"))
}
PageReconstructError::Cancelled => {
ApiError::InternalServerError(anyhow::anyhow!("request was cancelled"))
}
PageReconstructError::Cancelled => ApiError::Cancelled,
PageReconstructError::AncestorLsnTimeout(e) => ApiError::Timeout(format!("{e}").into()),
PageReconstructError::WalRedo(pre) => ApiError::InternalServerError(pre),
}
@@ -1073,7 +1071,7 @@ async fn tenant_delete_handler(
let state = get_state(&request);
state
let status = state
.tenant_manager
.delete_tenant(tenant_shard_id, ACTIVE_TENANT_TIMEOUT)
.instrument(info_span!("tenant_delete_handler",
@@ -1082,7 +1080,14 @@ async fn tenant_delete_handler(
))
.await?;
json_response(StatusCode::ACCEPTED, ())
// Callers use 404 as success for deletions, for historical reasons.
if status == StatusCode::NOT_FOUND {
return Err(ApiError::NotFound(
anyhow::anyhow!("Deletion complete").into(),
));
}
json_response(status, ())
}
/// HTTP endpoint to query the current tenant_size of a tenant.
@@ -2182,7 +2187,7 @@ async fn tenant_scan_remote_handler(
{
Ok((index_part, index_generation)) => {
tracing::info!("Found timeline {tenant_shard_id}/{timeline_id} metadata (gen {index_generation:?}, {} layers, {} consistent LSN)",
index_part.layer_metadata.len(), index_part.get_disk_consistent_lsn());
index_part.layer_metadata.len(), index_part.metadata.disk_consistent_lsn());
generation = std::cmp::max(generation, index_generation);
}
Err(DownloadError::NotFound) => {

View File

@@ -2108,6 +2108,7 @@ pub(crate) struct TimelineMetrics {
pub directory_entries_count_gauge: Lazy<UIntGauge, Box<dyn Send + Fn() -> UIntGauge>>,
pub evictions: IntCounter,
pub evictions_with_low_residence_duration: std::sync::RwLock<EvictionsWithLowResidenceDuration>,
shutdown: std::sync::atomic::AtomicBool,
}
impl TimelineMetrics {
@@ -2227,6 +2228,7 @@ impl TimelineMetrics {
evictions_with_low_residence_duration: std::sync::RwLock::new(
evictions_with_low_residence_duration,
),
shutdown: std::sync::atomic::AtomicBool::default(),
}
}
@@ -2249,6 +2251,17 @@ impl TimelineMetrics {
}
pub(crate) fn shutdown(&self) {
let was_shutdown = self
.shutdown
.swap(true, std::sync::atomic::Ordering::Relaxed);
if was_shutdown {
// this happens on tenant deletion because tenant first shuts down timelines, then
// invokes timeline deletion which first shuts down the timeline again.
// TODO: this can be removed once https://github.com/neondatabase/neon/issues/5080
return;
}
let tenant_id = &self.tenant_id;
let timeline_id = &self.timeline_id;
let shard_id = &self.shard_id;

View File

@@ -17,8 +17,8 @@ use bytes::{Buf, Bytes, BytesMut};
use enum_map::Enum;
use itertools::Itertools;
use pageserver_api::key::{
dbdir_key_range, is_rel_block_key, is_slru_block_key, rel_block_to_key, rel_dir_to_key,
rel_key_range, rel_size_to_key, relmap_file_key, slru_block_to_key, slru_dir_to_key,
dbdir_key_range, rel_block_to_key, rel_dir_to_key, rel_key_range, rel_size_to_key,
relmap_file_key, repl_origin_key, repl_origin_key_range, slru_block_to_key, slru_dir_to_key,
slru_segment_key_range, slru_segment_size_to_key, twophase_file_key, twophase_key_range,
AUX_FILES_KEY, CHECKPOINT_KEY, CONTROLFILE_KEY, DBDIR_KEY, TWOPHASEDIR_KEY,
};
@@ -27,7 +27,7 @@ use pageserver_api::models::AuxFilePolicy;
use pageserver_api::reltag::{BlockNumber, RelTag, SlruKind};
use postgres_ffi::relfile_utils::{FSM_FORKNUM, VISIBILITYMAP_FORKNUM};
use postgres_ffi::BLCKSZ;
use postgres_ffi::{Oid, TimestampTz, TransactionId};
use postgres_ffi::{Oid, RepOriginId, TimestampTz, TransactionId};
use serde::{Deserialize, Serialize};
use std::collections::{hash_map, HashMap, HashSet};
use std::ops::ControlFlow;
@@ -36,6 +36,7 @@ use strum::IntoEnumIterator;
use tokio_util::sync::CancellationToken;
use tracing::{debug, info, trace, warn};
use utils::bin_ser::DeserializeError;
use utils::pausable_failpoint;
use utils::vec_map::{VecMap, VecMapOrdering};
use utils::{bin_ser::BeSer, lsn::Lsn};
@@ -409,6 +410,8 @@ impl Timeline {
cancel: &CancellationToken,
ctx: &RequestContext,
) -> Result<LsnForTimestamp, PageReconstructError> {
pausable_failpoint!("find-lsn-for-timestamp-pausable");
let gc_cutoff_lsn_guard = self.get_latest_gc_cutoff_lsn();
// We use this method to figure out the branching LSN for the new branch, but the
// GC cutoff could be before the branching point and we cannot create a new branch
@@ -424,6 +427,7 @@ impl Timeline {
let mut found_smaller = false;
let mut found_larger = false;
while low < high {
if cancel.is_cancelled() {
return Err(PageReconstructError::Cancelled);
@@ -718,10 +722,22 @@ impl Timeline {
result.insert(fname, content);
}
}
self.aux_file_size_estimator.on_base_backup(sz);
self.aux_file_size_estimator.on_initial(sz);
Ok(result)
}
pub(crate) async fn trigger_aux_file_size_computation(
&self,
lsn: Lsn,
ctx: &RequestContext,
) -> Result<(), PageReconstructError> {
let current_policy = self.last_aux_file_policy.load();
if let Some(AuxFilePolicy::V2) | Some(AuxFilePolicy::CrossValidation) = current_policy {
self.list_aux_files_v2(lsn, ctx).await?;
}
Ok(())
}
pub(crate) async fn list_aux_files(
&self,
lsn: Lsn,
@@ -760,6 +776,27 @@ impl Timeline {
}
}
pub(crate) async fn get_replorigins(
&self,
lsn: Lsn,
ctx: &RequestContext,
) -> Result<HashMap<RepOriginId, Lsn>, PageReconstructError> {
let kv = self
.scan(KeySpace::single(repl_origin_key_range()), lsn, ctx)
.await
.context("scan")?;
let mut result = HashMap::new();
for (k, v) in kv {
let v = v.context("get value")?;
let origin_id = k.field6 as RepOriginId;
let origin_lsn = Lsn::des(&v).unwrap();
if origin_lsn != Lsn::INVALID {
result.insert(origin_id, origin_lsn);
}
}
Ok(result)
}
/// Does the same as get_current_logical_size but counted on demand.
/// Used to initialize the logical size tracking on startup.
///
@@ -885,7 +922,9 @@ impl Timeline {
Ok((
result.to_keyspace(),
/* AUX sparse key space */
SparseKeySpace(KeySpace::single(Key::metadata_aux_key_range())),
SparseKeySpace(KeySpace {
ranges: vec![repl_origin_key_range(), Key::metadata_aux_key_range()],
}),
))
}
@@ -1154,6 +1193,20 @@ impl<'a> DatadirModification<'a> {
Ok(())
}
pub async fn set_replorigin(
&mut self,
origin_id: RepOriginId,
origin_lsn: Lsn,
) -> anyhow::Result<()> {
let key = repl_origin_key(origin_id);
self.put(key, Value::Image(origin_lsn.ser().unwrap().into()));
Ok(())
}
pub async fn drop_replorigin(&mut self, origin_id: RepOriginId) -> anyhow::Result<()> {
self.set_replorigin(origin_id, Lsn::INVALID).await
}
pub fn put_control_file(&mut self, img: Bytes) -> anyhow::Result<()> {
self.put(CONTROLFILE_KEY, Value::Image(img));
Ok(())
@@ -1684,7 +1737,7 @@ impl<'a> DatadirModification<'a> {
let mut retained_pending_updates = HashMap::<_, Vec<_>>::new();
for (key, values) in self.pending_updates.drain() {
for (lsn, value) in values {
if is_rel_block_key(&key) || is_slru_block_key(key) {
if key.is_rel_block_key() || key.is_slru_block_key() {
// This bails out on first error without modifying pending_updates.
// That's Ok, cf this function's doc comment.
writer.put(key, lsn, &value, ctx).await?;

View File

@@ -3865,6 +3865,9 @@ pub(crate) mod harness {
pub fn create_custom(
test_name: &'static str,
tenant_conf: TenantConf,
tenant_id: TenantId,
shard_identity: ShardIdentity,
generation: Generation,
) -> anyhow::Result<Self> {
setup_logging();
@@ -3877,8 +3880,12 @@ pub(crate) mod harness {
// OK in a test.
let conf: &'static PageServerConf = Box::leak(Box::new(conf));
let tenant_id = TenantId::generate();
let tenant_shard_id = TenantShardId::unsharded(tenant_id);
let shard = shard_identity.shard_index();
let tenant_shard_id = TenantShardId {
tenant_id,
shard_number: shard.shard_number,
shard_count: shard.shard_count,
};
fs::create_dir_all(conf.tenant_path(&tenant_shard_id))?;
fs::create_dir_all(conf.timelines_path(&tenant_shard_id))?;
@@ -3896,8 +3903,8 @@ pub(crate) mod harness {
conf,
tenant_conf,
tenant_shard_id,
generation: Generation::new(0xdeadbeef),
shard: ShardIndex::unsharded(),
generation,
shard,
remote_storage,
remote_fs_dir,
deletion_queue,
@@ -3912,8 +3919,15 @@ pub(crate) mod harness {
compaction_period: Duration::ZERO,
..TenantConf::default()
};
Self::create_custom(test_name, tenant_conf)
let tenant_id = TenantId::generate();
let shard = ShardIdentity::unsharded();
Self::create_custom(
test_name,
tenant_conf,
tenant_id,
shard,
Generation::new(0xdeadbeef),
)
}
pub fn span(&self) -> tracing::Span {
@@ -3992,8 +4006,8 @@ pub(crate) mod harness {
let base_img = base_img.expect("Neon WAL redo requires base image").1;
let mut page = BytesMut::new();
page.extend_from_slice(&base_img);
for (_record_lsn, record) in records {
apply_neon::apply_in_neon(&record, key, &mut page)?;
for (record_lsn, record) in records {
apply_neon::apply_in_neon(&record, record_lsn, key, &mut page)?;
}
Ok(page.freeze())
} else {
@@ -4037,6 +4051,7 @@ mod tests {
use tests::storage_layer::ValuesReconstructState;
use tests::timeline::{GetVectoredError, ShutdownMode};
use utils::bin_ser::BeSer;
use utils::id::TenantId;
static TEST_KEY: Lazy<Key> =
Lazy::new(|| Key::from_slice(&hex!("010000000033333333444444445500000001")));
@@ -4936,7 +4951,13 @@ mod tests {
..TenantConf::default()
};
let harness = TenantHarness::create_custom("test_get_vectored_key_gap", tenant_conf)?;
let harness = TenantHarness::create_custom(
"test_get_vectored_key_gap",
tenant_conf,
TenantId::generate(),
ShardIdentity::unsharded(),
Generation::new(0xdeadbeef),
)?;
let (tenant, ctx) = harness.load().await;
let mut current_key = Key::from_hex("010000000033333333444444445500000000").unwrap();
@@ -6469,4 +6490,208 @@ mod tests {
Ok(())
}
async fn get_vectored_impl_wrapper(
tline: &Arc<Timeline>,
key: Key,
lsn: Lsn,
ctx: &RequestContext,
) -> Result<Option<Bytes>, GetVectoredError> {
let mut reconstruct_state = ValuesReconstructState::new();
let mut res = tline
.get_vectored_impl(
KeySpace::single(key..key.next()),
lsn,
&mut reconstruct_state,
ctx,
)
.await?;
Ok(res.pop_last().map(|(k, v)| {
assert_eq!(k, key);
v.unwrap()
}))
}
#[tokio::test]
async fn test_metadata_tombstone_reads() -> anyhow::Result<()> {
let harness = TenantHarness::create("test_metadata_tombstone_reads")?;
let (tenant, ctx) = harness.load().await;
let key0 = Key::from_hex("620000000033333333444444445500000000").unwrap();
let key1 = Key::from_hex("620000000033333333444444445500000001").unwrap();
let key2 = Key::from_hex("620000000033333333444444445500000002").unwrap();
let key3 = Key::from_hex("620000000033333333444444445500000003").unwrap();
// We emulate the situation that the compaction algorithm creates an image layer that removes the tombstones
// Lsn 0x30 key0, key3, no key1+key2
// Lsn 0x20 key1+key2 tomestones
// Lsn 0x10 key1 in image, key2 in delta
let tline = tenant
.create_test_timeline_with_layers(
TIMELINE_ID,
Lsn(0x10),
DEFAULT_PG_VERSION,
&ctx,
// delta layers
vec![
vec![(key2, Lsn(0x10), Value::Image(test_img("metadata key 2")))],
vec![(key1, Lsn(0x20), Value::Image(Bytes::new()))],
vec![(key2, Lsn(0x20), Value::Image(Bytes::new()))],
],
// image layers
vec![
(Lsn(0x10), vec![(key1, test_img("metadata key 1"))]),
(
Lsn(0x30),
vec![
(key0, test_img("metadata key 0")),
(key3, test_img("metadata key 3")),
],
),
],
Lsn(0x30),
)
.await?;
let lsn = Lsn(0x30);
let old_lsn = Lsn(0x20);
assert_eq!(
get_vectored_impl_wrapper(&tline, key0, lsn, &ctx).await?,
Some(test_img("metadata key 0"))
);
assert_eq!(
get_vectored_impl_wrapper(&tline, key1, lsn, &ctx).await?,
None,
);
assert_eq!(
get_vectored_impl_wrapper(&tline, key2, lsn, &ctx).await?,
None,
);
assert_eq!(
get_vectored_impl_wrapper(&tline, key1, old_lsn, &ctx).await?,
Some(Bytes::new()),
);
assert_eq!(
get_vectored_impl_wrapper(&tline, key2, old_lsn, &ctx).await?,
Some(Bytes::new()),
);
assert_eq!(
get_vectored_impl_wrapper(&tline, key3, lsn, &ctx).await?,
Some(test_img("metadata key 3"))
);
Ok(())
}
#[tokio::test]
async fn test_metadata_tombstone_image_creation() -> anyhow::Result<()> {
let harness = TenantHarness::create("test_metadata_tombstone_image_creation")?;
let (tenant, ctx) = harness.load().await;
let key0 = Key::from_hex("620000000033333333444444445500000000").unwrap();
let key1 = Key::from_hex("620000000033333333444444445500000001").unwrap();
let key2 = Key::from_hex("620000000033333333444444445500000002").unwrap();
let key3 = Key::from_hex("620000000033333333444444445500000003").unwrap();
let tline = tenant
.create_test_timeline_with_layers(
TIMELINE_ID,
Lsn(0x10),
DEFAULT_PG_VERSION,
&ctx,
// delta layers
vec![
vec![(key2, Lsn(0x10), Value::Image(test_img("metadata key 2")))],
vec![(key1, Lsn(0x20), Value::Image(Bytes::new()))],
vec![(key2, Lsn(0x20), Value::Image(Bytes::new()))],
vec![
(key0, Lsn(0x30), Value::Image(test_img("metadata key 0"))),
(key3, Lsn(0x30), Value::Image(test_img("metadata key 3"))),
],
],
// image layers
vec![(Lsn(0x10), vec![(key1, test_img("metadata key 1"))])],
Lsn(0x30),
)
.await?;
let cancel = CancellationToken::new();
tline
.compact(
&cancel,
{
let mut flags = EnumSet::new();
flags.insert(CompactFlags::ForceImageLayerCreation);
flags.insert(CompactFlags::ForceRepartition);
flags
},
&ctx,
)
.await?;
// Image layers are created at last_record_lsn
let images = tline
.inspect_image_layers(Lsn(0x30), &ctx)
.await?
.into_iter()
.filter(|(k, _)| k.is_metadata_key())
.collect::<Vec<_>>();
assert_eq!(images.len(), 2); // the image layer should only contain two existing keys, tombstones should be removed.
Ok(())
}
#[tokio::test]
async fn test_metadata_tombstone_empty_image_creation() -> anyhow::Result<()> {
let harness = TenantHarness::create("test_metadata_tombstone_image_creation")?;
let (tenant, ctx) = harness.load().await;
let key1 = Key::from_hex("620000000033333333444444445500000001").unwrap();
let key2 = Key::from_hex("620000000033333333444444445500000002").unwrap();
let tline = tenant
.create_test_timeline_with_layers(
TIMELINE_ID,
Lsn(0x10),
DEFAULT_PG_VERSION,
&ctx,
// delta layers
vec![
vec![(key2, Lsn(0x10), Value::Image(test_img("metadata key 2")))],
vec![(key1, Lsn(0x20), Value::Image(Bytes::new()))],
vec![(key2, Lsn(0x20), Value::Image(Bytes::new()))],
],
// image layers
vec![(Lsn(0x10), vec![(key1, test_img("metadata key 1"))])],
Lsn(0x30),
)
.await?;
let cancel = CancellationToken::new();
tline
.compact(
&cancel,
{
let mut flags = EnumSet::new();
flags.insert(CompactFlags::ForceImageLayerCreation);
flags.insert(CompactFlags::ForceRepartition);
flags
},
&ctx,
)
.await?;
// Image layers are created at last_record_lsn
let images = tline
.inspect_image_layers(Lsn(0x30), &ctx)
.await?
.into_iter()
.filter(|(k, _)| k.is_metadata_key())
.collect::<Vec<_>>();
assert_eq!(images.len(), 0); // the image layer should not contain tombstones, or it is not created
Ok(())
}
}

View File

@@ -16,6 +16,7 @@ use crate::{
task_mgr::{self, TaskKind},
tenant::{
mgr::{TenantSlot, TenantsMapRemoveResult},
remote_timeline_client::remote_heatmap_path,
timeline::ShutdownMode,
},
};
@@ -531,6 +532,25 @@ impl DeleteTenantFlow {
}
}
// Remove top-level tenant objects that don't belong to a timeline, such as heatmap
let heatmap_path = remote_heatmap_path(&tenant.tenant_shard_id());
if let Some(Err(e)) = backoff::retry(
|| async {
remote_storage
.delete(&heatmap_path, &task_mgr::shutdown_token())
.await
},
TimeoutOrCancel::caused_by_cancel,
FAILED_UPLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
"remove_remote_tenant_heatmap",
&task_mgr::shutdown_token(),
)
.await
{
tracing::warn!("Failed to delete heatmap at {heatmap_path}: {e}");
}
let timelines_path = conf.timelines_path(&tenant.tenant_shard_id);
// May not exist if we fail in cleanup_remaining_fs_traces after removing it
if timelines_path.exists() {

View File

@@ -267,7 +267,7 @@ impl<'de> Deserialize<'de> for TimelineMetadata {
D: serde::Deserializer<'de>,
{
let bytes = Vec::<u8>::deserialize(deserializer)?;
Self::from_bytes(bytes.as_slice()).map_err(|e| D::Error::custom(format!("{e}")))
Self::from_bytes(bytes.as_slice()).map_err(D::Error::custom)
}
}
@@ -276,13 +276,163 @@ impl Serialize for TimelineMetadata {
where
S: Serializer,
{
let bytes = self
.to_bytes()
.map_err(|e| serde::ser::Error::custom(format!("{e}")))?;
let bytes = self.to_bytes().map_err(serde::ser::Error::custom)?;
bytes.serialize(serializer)
}
}
pub(crate) mod modern_serde {
use crate::tenant::metadata::METADATA_FORMAT_VERSION;
use super::{
TimelineMetadata, TimelineMetadataBodyV2, TimelineMetadataHeader, METADATA_HDR_SIZE,
};
use serde::{Deserialize, Serialize};
pub(crate) fn deserialize<'de, D>(deserializer: D) -> Result<TimelineMetadata, D::Error>
where
D: serde::de::Deserializer<'de>,
{
// for legacy reasons versions 1-5 had TimelineMetadata serialized as a Vec<u8> field with
// BeSer.
struct Visitor;
impl<'d> serde::de::Visitor<'d> for Visitor {
type Value = TimelineMetadata;
fn expecting(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
f.write_str("BeSer bytes or json structure")
}
fn visit_seq<A>(self, seq: A) -> Result<Self::Value, A::Error>
where
A: serde::de::SeqAccess<'d>,
{
use serde::de::Error;
let de = serde::de::value::SeqAccessDeserializer::new(seq);
Vec::<u8>::deserialize(de)
.map(|v| TimelineMetadata::from_bytes(&v).map_err(A::Error::custom))?
}
fn visit_map<A>(self, map: A) -> Result<Self::Value, A::Error>
where
A: serde::de::MapAccess<'d>,
{
use serde::de::Error;
let de = serde::de::value::MapAccessDeserializer::new(map);
let body = TimelineMetadataBodyV2::deserialize(de)?;
// jump through hoops to calculate the crc32 so that TimelineMetadata::ne works
// across serialization versions
let mut sink = Crc32Sink::default();
<TimelineMetadataBodyV2 as utils::bin_ser::BeSer>::ser_into(&body, &mut sink)
.map_err(|e| A::Error::custom(Crc32CalculationFailed(e)))?;
let size = METADATA_HDR_SIZE + sink.count;
Ok(TimelineMetadata {
hdr: TimelineMetadataHeader {
checksum: sink.crc,
size: size as u16,
format_version: METADATA_FORMAT_VERSION,
},
body,
})
}
}
deserializer.deserialize_any(Visitor)
}
#[derive(Default)]
struct Crc32Sink {
crc: u32,
count: usize,
}
impl std::io::Write for Crc32Sink {
fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> {
self.crc = crc32c::crc32c_append(self.crc, buf);
self.count += buf.len();
Ok(buf.len())
}
fn flush(&mut self) -> std::io::Result<()> {
Ok(())
}
}
#[derive(thiserror::Error)]
#[error("re-serializing for crc32 failed")]
struct Crc32CalculationFailed<E>(#[source] E);
// this should be true for one release, after that we can change it to false
// remember to check the IndexPart::metadata field TODO comment as well
const LEGACY_BINCODED_BYTES: bool = true;
#[derive(serde::Serialize)]
#[serde(transparent)]
struct LegacyPaddedBytes<'a>(&'a TimelineMetadata);
struct JustTheBodyV2<'a>(&'a TimelineMetadata);
impl serde::Serialize for JustTheBodyV2<'_> {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
// header is not needed, upon reading we've upgraded all v1 to v2
self.0.body.serialize(serializer)
}
}
pub(crate) fn serialize<S>(
metadata: &TimelineMetadata,
serializer: S,
) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
// we cannot use TimelineMetadata::serialize for now because it'll do
// TimelineMetadata::to_bytes
if LEGACY_BINCODED_BYTES {
LegacyPaddedBytes(metadata).serialize(serializer)
} else {
JustTheBodyV2(metadata).serialize(serializer)
}
}
#[test]
fn deserializes_bytes_as_well_as_equivalent_body_v2() {
#[derive(serde::Deserialize, serde::Serialize)]
struct Wrapper(#[serde(deserialize_with = "deserialize")] TimelineMetadata);
let too_many_bytes = "[216,111,252,208,0,54,0,4,0,0,0,0,1,73,253,144,1,0,0,0,0,1,73,253,24,0,0,0,0,0,0,0,0,0,0,0,0,0,1,73,253,24,0,0,0,0,1,73,253,24,0,0,0,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]";
let wrapper_from_bytes = serde_json::from_str::<Wrapper>(too_many_bytes).unwrap();
let serialized = serde_json::to_value(JustTheBodyV2(&wrapper_from_bytes.0)).unwrap();
assert_eq!(
serialized,
serde_json::json! {{
"disk_consistent_lsn": "0/149FD90",
"prev_record_lsn": "0/149FD18",
"ancestor_timeline": null,
"ancestor_lsn": "0/0",
"latest_gc_cutoff_lsn": "0/149FD18",
"initdb_lsn": "0/149FD18",
"pg_version": 15
}}
);
let wrapper_from_json = serde_json::value::from_value::<Wrapper>(serialized).unwrap();
assert_eq!(wrapper_from_bytes.0, wrapper_from_json.0);
}
}
/// Parts of the metadata which are regularly modified.
pub(crate) struct MetadataUpdate {
disk_consistent_lsn: Lsn,

View File

@@ -3,6 +3,7 @@
use camino::{Utf8DirEntry, Utf8Path, Utf8PathBuf};
use futures::StreamExt;
use hyper::StatusCode;
use itertools::Itertools;
use pageserver_api::key::Key;
use pageserver_api::models::LocationConfigMode;
@@ -54,6 +55,7 @@ use utils::generation::Generation;
use utils::id::{TenantId, TimelineId};
use super::delete::DeleteTenantError;
use super::remote_timeline_client::remote_tenant_path;
use super::secondary::SecondaryTenant;
use super::timeline::detach_ancestor::PreparedTimelineDetach;
use super::TenantSharedResources;
@@ -1369,7 +1371,7 @@ impl TenantManager {
&self,
tenant_shard_id: TenantShardId,
activation_timeout: Duration,
) -> Result<(), DeleteTenantError> {
) -> Result<StatusCode, DeleteTenantError> {
super::span::debug_assert_current_span_has_tenant_id();
// We acquire a SlotGuard during this function to protect against concurrent
// changes while the ::prepare phase of DeleteTenantFlow executes, but then
@@ -1382,18 +1384,79 @@ impl TenantManager {
//
// See https://github.com/neondatabase/neon/issues/5080
let slot_guard =
tenant_map_acquire_slot(&tenant_shard_id, TenantSlotAcquireMode::MustExist)?;
// Tenant deletion can happen two ways:
// - Legacy: called on an attached location. The attached Tenant object stays alive in Stopping
// state until deletion is complete.
// - New: called on a pageserver without an attached location. We proceed with deletion from
// remote storage.
//
// See https://github.com/neondatabase/neon/issues/5080 for more context on this transition.
// unwrap is safe because we used MustExist mode when acquiring
let tenant = match slot_guard.get_old_value().as_ref().unwrap() {
TenantSlot::Attached(tenant) => tenant.clone(),
_ => {
// Express "not attached" as equivalent to "not found"
return Err(DeleteTenantError::NotAttached);
let slot_guard = tenant_map_acquire_slot(&tenant_shard_id, TenantSlotAcquireMode::Any)?;
match &slot_guard.old_value {
Some(TenantSlot::Attached(tenant)) => {
// Legacy deletion flow: the tenant remains attached, goes to Stopping state, and
// deletion will be resumed across restarts.
let tenant = tenant.clone();
return self
.delete_tenant_attached(slot_guard, tenant, activation_timeout)
.await;
}
Some(TenantSlot::Secondary(secondary_tenant)) => {
secondary_tenant.shutdown().await;
let local_tenant_directory = self.conf.tenant_path(&tenant_shard_id);
let tmp_dir = safe_rename_tenant_dir(&local_tenant_directory)
.await
.with_context(|| {
format!("local tenant directory {local_tenant_directory:?} rename")
})?;
spawn_background_purge(tmp_dir);
}
Some(TenantSlot::InProgress(_)) => unreachable!(),
None => {}
};
// Fall through: local state for this tenant is no longer present, proceed with remote delete
let remote_path = remote_tenant_path(&tenant_shard_id);
let keys = match self
.resources
.remote_storage
.list(
Some(&remote_path),
remote_storage::ListingMode::NoDelimiter,
None,
&self.cancel,
)
.await
{
Ok(listing) => listing.keys,
Err(remote_storage::DownloadError::Cancelled) => {
return Err(DeleteTenantError::Cancelled)
}
Err(remote_storage::DownloadError::NotFound) => return Ok(StatusCode::NOT_FOUND),
Err(other) => return Err(DeleteTenantError::Other(anyhow::anyhow!(other))),
};
if keys.is_empty() {
tracing::info!("Remote storage already deleted");
} else {
tracing::info!("Deleting {} keys from remote storage", keys.len());
self.resources
.remote_storage
.delete_objects(&keys, &self.cancel)
.await?;
}
// Callers use 404 as success for deletions, for historical reasons.
Ok(StatusCode::NOT_FOUND)
}
async fn delete_tenant_attached(
&self,
slot_guard: SlotGuard,
tenant: Arc<Tenant>,
activation_timeout: Duration,
) -> Result<StatusCode, DeleteTenantError> {
match tenant.current_state() {
TenantState::Broken { .. } | TenantState::Stopping { .. } => {
// If deletion is already in progress, return success (the semantics of this
@@ -1403,7 +1466,7 @@ impl TenantManager {
// The `delete_progress` lock is held: deletion is already happening
// in the bacckground
slot_guard.revert();
return Ok(());
return Ok(StatusCode::ACCEPTED);
}
}
_ => {
@@ -1436,7 +1499,8 @@ impl TenantManager {
// The Tenant goes back into the map in Stopping state, it will eventually be removed by DeleteTenantFLow
slot_guard.revert();
result
let () = result?;
Ok(StatusCode::ACCEPTED)
}
#[instrument(skip_all, fields(tenant_id=%tenant.get_tenant_shard_id().tenant_id, shard_id=%tenant.get_tenant_shard_id().shard_slug(), new_shard_count=%new_shard_count.literal()))]

View File

@@ -91,8 +91,7 @@
//!
//! The *actual* remote state lags behind the *desired* remote state while
//! there are in-flight operations.
//! We keep track of the desired remote state in
//! [`UploadQueueInitialized::latest_files`] and [`UploadQueueInitialized::latest_metadata`].
//! We keep track of the desired remote state in [`UploadQueueInitialized::dirty`].
//! It is initialized based on the [`IndexPart`] that was passed during init
//! and updated with every `schedule_*` function call.
//! All this is necessary necessary to compute the future [`IndexPart`]s
@@ -115,8 +114,7 @@
//!
//! # Completion
//!
//! Once an operation has completed, we update
//! [`UploadQueueInitialized::projected_remote_consistent_lsn`] immediately,
//! Once an operation has completed, we update [`UploadQueueInitialized::clean`] immediately,
//! and submit a request through the DeletionQueue to update
//! [`UploadQueueInitialized::visible_remote_consistent_lsn`] after it has
//! validated that our generation is not stale. It is this visible value
@@ -416,6 +414,7 @@ impl RemoteTimelineClient {
Ok(())
}
/// Returns `None` if nothing is yet uplodaded, `Some(disk_consistent_lsn)` otherwise.
pub fn remote_consistent_lsn_projected(&self) -> Option<Lsn> {
match &mut *self.upload_queue.lock().unwrap() {
UploadQueue::Uninitialized => None,
@@ -442,13 +441,11 @@ impl RemoteTimelineClient {
/// Returns true if this timeline was previously detached at this Lsn and the remote timeline
/// client is currently initialized.
pub(crate) fn is_previous_ancestor_lsn(&self, lsn: Lsn) -> bool {
// technically this is a dirty read, but given how timeline detach ancestor is implemented
// via tenant restart, the lineage has always been uploaded.
self.upload_queue
.lock()
.unwrap()
.initialized_mut()
.map(|uq| uq.latest_lineage.is_previous_ancestor_lsn(lsn))
.map(|uq| uq.clean.0.lineage.is_previous_ancestor_lsn(lsn))
.unwrap_or(false)
}
@@ -457,7 +454,6 @@ impl RemoteTimelineClient {
current_remote_index_part
.layer_metadata
.values()
// If we don't have the file size for the layer, don't account for it in the metric.
.map(|ilmd| ilmd.file_size)
.sum()
} else {
@@ -585,9 +581,9 @@ impl RemoteTimelineClient {
// As documented in the struct definition, it's ok for latest_metadata to be
// ahead of what's _actually_ on the remote during index upload.
upload_queue.latest_metadata = metadata.clone();
upload_queue.dirty.metadata = metadata.clone();
self.schedule_index_upload(upload_queue);
self.schedule_index_upload(upload_queue)?;
Ok(())
}
@@ -606,9 +602,9 @@ impl RemoteTimelineClient {
let mut guard = self.upload_queue.lock().unwrap();
let upload_queue = guard.initialized_mut()?;
upload_queue.latest_metadata.apply(update);
upload_queue.dirty.metadata.apply(update);
self.schedule_index_upload(upload_queue);
self.schedule_index_upload(upload_queue)?;
Ok(())
}
@@ -620,8 +616,8 @@ impl RemoteTimelineClient {
) -> anyhow::Result<()> {
let mut guard = self.upload_queue.lock().unwrap();
let upload_queue = guard.initialized_mut()?;
upload_queue.last_aux_file_policy = last_aux_file_policy;
self.schedule_index_upload(upload_queue);
upload_queue.dirty.last_aux_file_policy = last_aux_file_policy;
self.schedule_index_upload(upload_queue)?;
Ok(())
}
///
@@ -639,30 +635,44 @@ impl RemoteTimelineClient {
let upload_queue = guard.initialized_mut()?;
if upload_queue.latest_files_changes_since_metadata_upload_scheduled > 0 {
self.schedule_index_upload(upload_queue);
self.schedule_index_upload(upload_queue)?;
}
Ok(())
}
/// Launch an index-file upload operation in the background (internal function)
fn schedule_index_upload(self: &Arc<Self>, upload_queue: &mut UploadQueueInitialized) {
let disk_consistent_lsn = upload_queue.latest_metadata.disk_consistent_lsn();
fn schedule_index_upload(
self: &Arc<Self>,
upload_queue: &mut UploadQueueInitialized,
) -> anyhow::Result<()> {
let disk_consistent_lsn = upload_queue.dirty.metadata.disk_consistent_lsn();
// fix up the duplicated field
upload_queue.dirty.disk_consistent_lsn = disk_consistent_lsn;
// make sure it serializes before doing it in perform_upload_task so that it doesn't
// look like a retryable error
let void = std::io::sink();
serde_json::to_writer(void, &upload_queue.dirty).context("serialize index_part.json")?;
let index_part = &upload_queue.dirty;
info!(
"scheduling metadata upload up to consistent LSN {disk_consistent_lsn} with {} files ({} changed)",
upload_queue.latest_files.len(),
index_part.layer_metadata.len(),
upload_queue.latest_files_changes_since_metadata_upload_scheduled,
);
let index_part = IndexPart::from(&*upload_queue);
let op = UploadOp::UploadMetadata(Box::new(index_part), disk_consistent_lsn);
let op = UploadOp::UploadMetadata {
uploaded: Box::new(index_part.clone()),
};
self.metric_begin(&op);
upload_queue.queued_operations.push_back(op);
upload_queue.latest_files_changes_since_metadata_upload_scheduled = 0;
// Launch the task immediately, if possible
self.launch_queued_tasks(upload_queue);
Ok(())
}
pub(crate) async fn schedule_reparenting_and_wait(
@@ -675,16 +685,16 @@ impl RemoteTimelineClient {
let mut guard = self.upload_queue.lock().unwrap();
let upload_queue = guard.initialized_mut()?;
let Some(prev) = upload_queue.latest_metadata.ancestor_timeline() else {
let Some(prev) = upload_queue.dirty.metadata.ancestor_timeline() else {
return Err(anyhow::anyhow!(
"cannot reparent without a current ancestor"
));
};
upload_queue.latest_metadata.reparent(new_parent);
upload_queue.latest_lineage.record_previous_ancestor(&prev);
upload_queue.dirty.metadata.reparent(new_parent);
upload_queue.dirty.lineage.record_previous_ancestor(&prev);
self.schedule_index_upload(upload_queue);
self.schedule_index_upload(upload_queue)?;
self.schedule_barrier0(upload_queue)
};
@@ -705,16 +715,17 @@ impl RemoteTimelineClient {
let mut guard = self.upload_queue.lock().unwrap();
let upload_queue = guard.initialized_mut()?;
upload_queue.latest_metadata.detach_from_ancestor(&adopted);
upload_queue.latest_lineage.record_detaching(&adopted);
upload_queue.dirty.metadata.detach_from_ancestor(&adopted);
upload_queue.dirty.lineage.record_detaching(&adopted);
for layer in layers {
upload_queue
.latest_files
.dirty
.layer_metadata
.insert(layer.layer_desc().layer_name(), layer.metadata());
}
self.schedule_index_upload(upload_queue);
self.schedule_index_upload(upload_queue)?;
let barrier = self.schedule_barrier0(upload_queue);
self.launch_queued_tasks(upload_queue);
@@ -746,7 +757,8 @@ impl RemoteTimelineClient {
let metadata = layer.metadata();
upload_queue
.latest_files
.dirty
.layer_metadata
.insert(layer.layer_desc().layer_name(), metadata.clone());
upload_queue.latest_files_changes_since_metadata_upload_scheduled += 1;
@@ -776,8 +788,8 @@ impl RemoteTimelineClient {
let mut guard = self.upload_queue.lock().unwrap();
let upload_queue = guard.initialized_mut()?;
let with_metadata =
self.schedule_unlinking_of_layers_from_index_part0(upload_queue, names.iter().cloned());
let with_metadata = self
.schedule_unlinking_of_layers_from_index_part0(upload_queue, names.iter().cloned())?;
self.schedule_deletion_of_unlinked0(upload_queue, with_metadata);
@@ -801,7 +813,7 @@ impl RemoteTimelineClient {
let names = gc_layers.iter().map(|x| x.layer_desc().layer_name());
self.schedule_unlinking_of_layers_from_index_part0(upload_queue, names);
self.schedule_unlinking_of_layers_from_index_part0(upload_queue, names)?;
self.launch_queued_tasks(upload_queue);
@@ -814,7 +826,7 @@ impl RemoteTimelineClient {
self: &Arc<Self>,
upload_queue: &mut UploadQueueInitialized,
names: I,
) -> Vec<(LayerName, LayerFileMetadata)>
) -> anyhow::Result<Vec<(LayerName, LayerFileMetadata)>>
where
I: IntoIterator<Item = LayerName>,
{
@@ -824,7 +836,7 @@ impl RemoteTimelineClient {
let with_metadata: Vec<_> = names
.into_iter()
.filter_map(|name| {
let meta = upload_queue.latest_files.remove(&name);
let meta = upload_queue.dirty.layer_metadata.remove(&name);
if let Some(meta) = meta {
upload_queue.latest_files_changes_since_metadata_upload_scheduled += 1;
@@ -856,10 +868,10 @@ impl RemoteTimelineClient {
// index_part update, because that needs to be uploaded before we can actually delete the
// files.
if upload_queue.latest_files_changes_since_metadata_upload_scheduled > 0 {
self.schedule_index_upload(upload_queue);
self.schedule_index_upload(upload_queue)?;
}
with_metadata
Ok(with_metadata)
}
/// Schedules deletion for layer files which have previously been unlinked from the
@@ -950,7 +962,7 @@ impl RemoteTimelineClient {
let names = compacted_from.iter().map(|x| x.layer_desc().layer_name());
self.schedule_unlinking_of_layers_from_index_part0(upload_queue, names);
self.schedule_unlinking_of_layers_from_index_part0(upload_queue, names)?;
self.launch_queued_tasks(upload_queue);
Ok(())
@@ -1085,7 +1097,7 @@ impl RemoteTimelineClient {
let deleted_at = Utc::now().naive_utc();
stopped.deleted_at = SetDeletedFlagProgress::InProgress(deleted_at);
let mut index_part = IndexPart::from(&stopped.upload_queue_for_deletion);
let mut index_part = stopped.upload_queue_for_deletion.dirty.clone();
index_part.deleted_at = Some(deleted_at);
index_part
};
@@ -1296,7 +1308,8 @@ impl RemoteTimelineClient {
stopped
.upload_queue_for_deletion
.latest_files
.dirty
.layer_metadata
.drain()
.map(|(file_name, meta)| {
remote_layer_path(
@@ -1433,7 +1446,7 @@ impl RemoteTimelineClient {
// Can always be scheduled.
true
}
UploadOp::UploadMetadata(_, _) => {
UploadOp::UploadMetadata { .. } => {
// These can only be performed after all the preceding operations
// have finished.
upload_queue.inprogress_tasks.is_empty()
@@ -1475,7 +1488,7 @@ impl RemoteTimelineClient {
UploadOp::UploadLayer(_, _) => {
upload_queue.num_inprogress_layer_uploads += 1;
}
UploadOp::UploadMetadata(_, _) => {
UploadOp::UploadMetadata { .. } => {
upload_queue.num_inprogress_metadata_uploads += 1;
}
UploadOp::Delete(_) => {
@@ -1584,22 +1597,13 @@ impl RemoteTimelineClient {
)
.await
}
UploadOp::UploadMetadata(ref index_part, _lsn) => {
let mention_having_future_layers = if cfg!(feature = "testing") {
index_part
.layer_metadata
.keys()
.any(|x| x.is_in_future(*_lsn))
} else {
false
};
UploadOp::UploadMetadata { ref uploaded } => {
let res = upload::upload_index_part(
&self.storage_impl,
&self.tenant_shard_id,
&self.timeline_id,
self.generation,
index_part,
uploaded,
&self.cancel,
)
.measure_remote_op(
@@ -1609,10 +1613,21 @@ impl RemoteTimelineClient {
)
.await;
if res.is_ok() {
self.update_remote_physical_size_gauge(Some(index_part));
self.update_remote_physical_size_gauge(Some(uploaded));
let mention_having_future_layers = if cfg!(feature = "testing") {
uploaded
.layer_metadata
.keys()
.any(|x| x.is_in_future(uploaded.metadata.disk_consistent_lsn()))
} else {
false
};
if mention_having_future_layers {
// find rationale near crate::tenant::timeline::init::cleanup_future_layer
tracing::info!(disk_consistent_lsn=%_lsn, "uploaded an index_part.json with future layers -- this is ok! if shutdown now, expect future layer cleanup");
tracing::info!(
disk_consistent_lsn = %uploaded.metadata.disk_consistent_lsn(),
"uploaded an index_part.json with future layers -- this is ok! if shutdown now, expect future layer cleanup"
);
}
}
res
@@ -1713,11 +1728,23 @@ impl RemoteTimelineClient {
upload_queue.num_inprogress_layer_uploads -= 1;
None
}
UploadOp::UploadMetadata(_, lsn) => {
UploadOp::UploadMetadata { ref uploaded } => {
upload_queue.num_inprogress_metadata_uploads -= 1;
// XXX monotonicity check?
upload_queue.projected_remote_consistent_lsn = Some(lsn);
// the task id is reused as a monotonicity check for storing the "clean"
// IndexPart.
let last_updater = upload_queue.clean.1;
let is_later = last_updater.is_some_and(|task_id| task_id < task.task_id);
let monotone = is_later || last_updater.is_none();
assert!(monotone, "no two index uploads should be completing at the same time, prev={last_updater:?}, task.task_id={}", task.task_id);
// not taking ownership is wasteful
upload_queue.clean.0.clone_from(uploaded);
upload_queue.clean.1 = Some(task.task_id);
let lsn = upload_queue.clean.0.metadata.disk_consistent_lsn();
if self.generation.is_none() {
// Legacy mode: skip validating generation
upload_queue.visible_remote_consistent_lsn.store(lsn);
@@ -1771,7 +1798,7 @@ impl RemoteTimelineClient {
RemoteOpKind::Upload,
RemoteTimelineClientMetricsCallTrackSize::Bytes(m.file_size),
),
UploadOp::UploadMetadata(_, _) => (
UploadOp::UploadMetadata { .. } => (
RemoteOpFileKind::Index,
RemoteOpKind::Upload,
DontTrackSize {
@@ -1847,11 +1874,9 @@ impl RemoteTimelineClient {
// Deletion is not really perf sensitive so there shouldnt be any problems with cloning a fraction of it.
let upload_queue_for_deletion = UploadQueueInitialized {
task_counter: 0,
latest_files: initialized.latest_files.clone(),
dirty: initialized.dirty.clone(),
clean: initialized.clean.clone(),
latest_files_changes_since_metadata_upload_scheduled: 0,
latest_metadata: initialized.latest_metadata.clone(),
latest_lineage: initialized.latest_lineage.clone(),
projected_remote_consistent_lsn: None,
visible_remote_consistent_lsn: initialized
.visible_remote_consistent_lsn
.clone(),
@@ -1864,7 +1889,6 @@ impl RemoteTimelineClient {
dangling_files: HashMap::default(),
shutting_down: false,
shutdown_ready: Arc::new(tokio::sync::Semaphore::new(0)),
last_aux_file_policy: initialized.last_aux_file_policy,
};
let upload_queue = std::mem::replace(

View File

@@ -28,6 +28,7 @@ use crate::TEMP_FILE_SUFFIX;
use remote_storage::{DownloadError, GenericRemoteStorage, ListingMode, RemotePath};
use utils::crashsafe::path_with_suffix_extension;
use utils::id::{TenantId, TimelineId};
use utils::pausable_failpoint;
use super::index::{IndexPart, LayerFileMetadata};
use super::{
@@ -152,6 +153,8 @@ async fn download_object<'a>(
let download = storage.download(src_path, cancel).await?;
pausable_failpoint!("before-downloading-layer-stream-pausable");
let mut buf_writer =
tokio::io::BufWriter::with_capacity(super::BUFFER_SIZE, destination_file);
@@ -199,6 +202,8 @@ async fn download_object<'a>(
let mut download = storage.download(src_path, cancel).await?;
pausable_failpoint!("before-downloading-layer-stream-pausable");
// TODO: use vectored write (writev) once supported by tokio-epoll-uring.
// There's chunks_vectored() on the stream.
let (bytes_amount, destination_file) = async {

View File

@@ -11,7 +11,6 @@ use utils::id::TimelineId;
use crate::tenant::metadata::TimelineMetadata;
use crate::tenant::storage_layer::LayerName;
use crate::tenant::upload_queue::UploadQueueInitialized;
use crate::tenant::Generation;
use pageserver_api::shard::ShardIndex;
@@ -42,9 +41,13 @@ pub struct IndexPart {
// 'disk_consistent_lsn' is a copy of the 'disk_consistent_lsn' in the metadata.
// It's duplicated for convenience when reading the serialized structure, but is
// private because internally we would read from metadata instead.
disk_consistent_lsn: Lsn,
pub(super) disk_consistent_lsn: Lsn,
#[serde(rename = "metadata_bytes")]
// TODO: later make this "rename" to "alias", rename field as "legacy_metadata"
#[serde(
rename = "metadata_bytes",
with = "crate::tenant::metadata::modern_serde"
)]
pub metadata: TimelineMetadata,
#[serde(default)]
@@ -80,23 +83,15 @@ impl IndexPart {
pub const FILE_NAME: &'static str = "index_part.json";
fn new(
layers_and_metadata: &HashMap<LayerName, LayerFileMetadata>,
disk_consistent_lsn: Lsn,
metadata: TimelineMetadata,
lineage: Lineage,
last_aux_file_policy: Option<AuxFilePolicy>,
) -> Self {
let layer_metadata = layers_and_metadata.clone();
Self {
pub(crate) fn empty(metadata: TimelineMetadata) -> Self {
IndexPart {
version: Self::LATEST_VERSION,
layer_metadata,
disk_consistent_lsn,
layer_metadata: Default::default(),
disk_consistent_lsn: metadata.disk_consistent_lsn(),
metadata,
deleted_at: None,
lineage,
last_aux_file_policy,
lineage: Default::default(),
last_aux_file_policy: None,
}
}
@@ -106,7 +101,7 @@ impl IndexPart {
/// If you want this under normal operations, read it from self.metadata:
/// this method is just for the scrubber to use when validating an index.
pub fn get_disk_consistent_lsn(&self) -> Lsn {
pub fn duplicated_disk_consistent_lsn(&self) -> Lsn {
self.disk_consistent_lsn
}
@@ -120,14 +115,7 @@ impl IndexPart {
#[cfg(test)]
pub(crate) fn example() -> Self {
let example_metadata = TimelineMetadata::example();
Self::new(
&HashMap::new(),
example_metadata.disk_consistent_lsn(),
example_metadata,
Default::default(),
Some(AuxFilePolicy::V1),
)
Self::empty(TimelineMetadata::example())
}
pub(crate) fn last_aux_file_policy(&self) -> Option<AuxFilePolicy> {
@@ -135,22 +123,6 @@ impl IndexPart {
}
}
impl From<&UploadQueueInitialized> for IndexPart {
fn from(uq: &UploadQueueInitialized) -> Self {
let disk_consistent_lsn = uq.latest_metadata.disk_consistent_lsn();
let metadata = uq.latest_metadata.clone();
let lineage = uq.latest_lineage.clone();
Self::new(
&uq.latest_files,
disk_consistent_lsn,
metadata,
lineage,
uq.last_aux_file_policy,
)
}
}
/// Metadata gathered for each of the layer files.
///
/// Fields have to be `Option`s because remote [`IndexPart`]'s can be from different version, which
@@ -236,11 +208,10 @@ impl Lineage {
/// The queried lsn is most likely the basebackup lsn, and this answers question "is it allowed
/// to start a read/write primary at this lsn".
///
/// Returns true if the Lsn was previously a branch point.
/// Returns true if the Lsn was previously our branch point.
pub(crate) fn is_previous_ancestor_lsn(&self, lsn: Lsn) -> bool {
self.original_ancestor
.as_ref()
.is_some_and(|(_, ancestor_lsn, _)| lsn == *ancestor_lsn)
.is_some_and(|(_, ancestor_lsn, _)| ancestor_lsn == lsn)
}
}

View File

@@ -1,6 +1,7 @@
//! Helper functions to upload files to remote storage with a RemoteStorage
use anyhow::{bail, Context};
use bytes::Bytes;
use camino::Utf8Path;
use fail::fail_point;
use pageserver_api::shard::TenantShardId;
@@ -11,10 +12,10 @@ use tokio::io::AsyncSeekExt;
use tokio_util::sync::CancellationToken;
use utils::{backoff, pausable_failpoint};
use super::index::IndexPart;
use super::Generation;
use crate::tenant::remote_timeline_client::{
index::IndexPart, remote_index_path, remote_initdb_archive_path,
remote_initdb_preserved_archive_path,
remote_index_path, remote_initdb_archive_path, remote_initdb_preserved_archive_path,
};
use remote_storage::{GenericRemoteStorage, RemotePath, TimeTravelError};
use utils::id::{TenantId, TimelineId};
@@ -27,7 +28,7 @@ pub(crate) async fn upload_index_part<'a>(
tenant_shard_id: &TenantShardId,
timeline_id: &TimelineId,
generation: Generation,
index_part: &'a IndexPart,
index_part: &IndexPart,
cancel: &CancellationToken,
) -> anyhow::Result<()> {
tracing::trace!("uploading new index part");
@@ -37,16 +38,16 @@ pub(crate) async fn upload_index_part<'a>(
});
pausable_failpoint!("before-upload-index-pausable");
let index_part_bytes = index_part
.to_s3_bytes()
.context("serialize index part file into bytes")?;
let index_part_size = index_part_bytes.len();
let index_part_bytes = bytes::Bytes::from(index_part_bytes);
// FIXME: this error comes too late
let serialized = index_part.to_s3_bytes()?;
let serialized = Bytes::from(serialized);
let index_part_size = serialized.len();
let remote_path = remote_index_path(tenant_shard_id, timeline_id, generation);
storage
.upload_storage_object(
futures::stream::once(futures::future::ready(Ok(index_part_bytes))),
futures::stream::once(futures::future::ready(Ok(serialized))),
index_part_size,
&remote_path,
cancel,

View File

@@ -1000,7 +1000,7 @@ impl<'a> TenantDownloader<'a> {
layer.name,
layer.metadata.file_size
);
let downloaded_bytes = match download_layer_file(
let downloaded_bytes = download_layer_file(
self.conf,
self.remote_storage,
*tenant_shard_id,
@@ -1011,8 +1011,9 @@ impl<'a> TenantDownloader<'a> {
&self.secondary_state.cancel,
ctx,
)
.await
{
.await;
let downloaded_bytes = match downloaded_bytes {
Ok(bytes) => bytes,
Err(DownloadError::NotFound) => {
// A heatmap might be out of date and refer to a layer that doesn't exist any more.

View File

@@ -334,8 +334,11 @@ where
let tenant_shard_id = job.get_tenant_shard_id();
let barrier = if let Some(barrier) = self.get_running(tenant_shard_id) {
tracing::info!(tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug(),
"Command already running, waiting for it");
tracing::info!(
tenant_id=%tenant_shard_id.tenant_id,
shard_id=%tenant_shard_id.shard_slug(),
"Command already running, waiting for it"
);
barrier
} else {
let running = self.spawn_now(job);

View File

@@ -478,6 +478,23 @@ impl DeltaLayerWriterInner {
key_end: Key,
timeline: &Arc<Timeline>,
ctx: &RequestContext,
) -> anyhow::Result<ResidentLayer> {
let temp_path = self.path.clone();
let result = self.finish0(key_end, timeline, ctx).await;
if result.is_err() {
tracing::info!(%temp_path, "cleaning up temporary file after error during writing");
if let Err(e) = std::fs::remove_file(&temp_path) {
tracing::warn!(error=%e, %temp_path, "error cleaning up temporary layer file after error during writing");
}
}
result
}
async fn finish0(
self,
key_end: Key,
timeline: &Arc<Timeline>,
ctx: &RequestContext,
) -> anyhow::Result<ResidentLayer> {
let index_start_blk =
((self.blob_writer.size() + PAGE_SZ as u64 - 1) / PAGE_SZ as u64) as u32;
@@ -651,19 +668,11 @@ impl DeltaLayerWriter {
timeline: &Arc<Timeline>,
ctx: &RequestContext,
) -> anyhow::Result<ResidentLayer> {
let inner = self.inner.take().unwrap();
let temp_path = inner.path.clone();
let result = inner.finish(key_end, timeline, ctx).await;
// The delta layer files can sometimes be really large. Clean them up.
if result.is_err() {
tracing::warn!(
"Cleaning up temporary delta file {temp_path} after error during writing"
);
if let Err(e) = std::fs::remove_file(&temp_path) {
tracing::warn!("Error cleaning up temporary delta layer file {temp_path}: {e:?}")
}
}
result
self.inner
.take()
.unwrap()
.finish(key_end, timeline, ctx)
.await
}
}

View File

@@ -917,26 +917,57 @@ impl Drop for ImageLayerWriter {
#[cfg(test)]
mod test {
use std::time::Duration;
use bytes::Bytes;
use pageserver_api::{
key::Key,
shard::{ShardCount, ShardIdentity, ShardNumber, ShardStripeSize},
};
use utils::{id::TimelineId, lsn::Lsn};
use utils::{
generation::Generation,
id::{TenantId, TimelineId},
lsn::Lsn,
};
use crate::{tenant::harness::TenantHarness, DEFAULT_PG_VERSION};
use crate::{
tenant::{config::TenantConf, harness::TenantHarness},
DEFAULT_PG_VERSION,
};
use super::ImageLayerWriter;
#[tokio::test]
async fn image_layer_rewrite() {
let harness = TenantHarness::create("test_image_layer_rewrite").unwrap();
let (tenant, ctx) = harness.load().await;
let tenant_conf = TenantConf {
gc_period: Duration::ZERO,
compaction_period: Duration::ZERO,
..TenantConf::default()
};
let tenant_id = TenantId::generate();
let mut gen = Generation::new(0xdead0001);
let mut get_next_gen = || {
let ret = gen;
gen = gen.next();
ret
};
// The LSN at which we will create an image layer to filter
let lsn = Lsn(0xdeadbeef0000);
let timeline_id = TimelineId::generate();
//
// Create an unsharded parent with a layer.
//
let harness = TenantHarness::create_custom(
"test_image_layer_rewrite--parent",
tenant_conf.clone(),
tenant_id,
ShardIdentity::unsharded(),
get_next_gen(),
)
.unwrap();
let (tenant, ctx) = harness.load().await;
let timeline = tenant
.create_test_timeline(timeline_id, lsn, DEFAULT_PG_VERSION, &ctx)
.await
@@ -971,9 +1002,47 @@ mod test {
};
let original_size = resident.metadata().file_size;
//
// Create child shards and do the rewrite, exercising filter().
// TODO: abstraction in TenantHarness for splits.
//
// Filter for various shards: this exercises cases like values at start of key range, end of key
// range, middle of key range.
for shard_number in 0..4 {
let shard_count = ShardCount::new(4);
for shard_number in 0..shard_count.count() {
//
// mimic the shard split
//
let shard_identity = ShardIdentity::new(
ShardNumber(shard_number),
shard_count,
ShardStripeSize(0x8000),
)
.unwrap();
let harness = TenantHarness::create_custom(
Box::leak(Box::new(format!(
"test_image_layer_rewrite--child{}",
shard_identity.shard_slug()
))),
tenant_conf.clone(),
tenant_id,
shard_identity,
// NB: in reality, the shards would each fork off their own gen number sequence from the parent.
// But here, all we care about is that the gen number is unique.
get_next_gen(),
)
.unwrap();
let (tenant, ctx) = harness.load().await;
let timeline = tenant
.create_test_timeline(timeline_id, lsn, DEFAULT_PG_VERSION, &ctx)
.await
.unwrap();
//
// use filter() and make assertions
//
let mut filtered_writer = ImageLayerWriter::new(
harness.conf,
timeline_id,
@@ -985,15 +1054,6 @@ mod test {
.await
.unwrap();
// TenantHarness gave us an unsharded tenant, but we'll use a sharded ShardIdentity
// to exercise filter()
let shard_identity = ShardIdentity::new(
ShardNumber(shard_number),
ShardCount::new(4),
ShardStripeSize(0x8000),
)
.unwrap();
let wrote_keys = resident
.filter(&shard_identity, &mut filtered_writer, &ctx)
.await

View File

@@ -277,9 +277,10 @@ impl Layer {
let downloaded = resident.expect("just initialized");
// if the rename works, the path is as expected
// TODO: sync system call
std::fs::rename(temp_path, owner.local_path())
// We never want to overwrite an existing file, so we use `RENAME_NOREPLACE`.
// TODO: this leaves the temp file in place if the rename fails, risking us running
// out of space. Should we clean it up here or does the calling context deal with this?
utils::fs_ext::rename_noreplace(temp_path.as_std_path(), owner.local_path().as_std_path())
.with_context(|| format!("rename temporary file as correct path for {owner}"))?;
Ok(ResidentLayer { downloaded, owner })

View File

@@ -815,6 +815,7 @@ async fn eviction_cancellation_on_drop() {
/// A test case to remind you the cost of these structures. You can bump the size limit
/// below if it is really necessary to add more fields to the structures.
#[test]
#[cfg(target_arch = "x86_64")]
fn layer_size() {
assert_eq!(std::mem::size_of::<LayerAccessStats>(), 2040);
assert_eq!(std::mem::size_of::<PersistentLayerDesc>(), 104);

View File

@@ -102,7 +102,6 @@ use crate::metrics::{
};
use crate::pgdatadir_mapping::CalculateLogicalSizeError;
use crate::tenant::config::TenantConfOpt;
use pageserver_api::key::{is_inherited_key, is_rel_fsm_block_key, is_rel_vm_block_key};
use pageserver_api::reltag::RelTag;
use pageserver_api::shard::ShardIndex;
@@ -2788,17 +2787,21 @@ impl Timeline {
crate::metrics::initial_logical_size::START_CALCULATION.retry(circumstances)
};
match self_ref
let calculated_size = self_ref
.logical_size_calculation_task(
initial_part_end,
LogicalSizeCalculationCause::Initial,
background_ctx,
)
.await
{
Ok(calculated_size) => Ok((calculated_size, metrics_guard)),
Err(e) => Err(e),
}
.await?;
self_ref
.trigger_aux_file_size_computation(initial_part_end, background_ctx)
.await?;
// TODO: add aux file size to logical size
Ok((calculated_size, metrics_guard))
}
};
@@ -3191,7 +3194,7 @@ impl Timeline {
// Recurse into ancestor if needed
if let Some(ancestor_timeline) = timeline.ancestor_timeline.as_ref() {
if is_inherited_key(key) && Lsn(cont_lsn.0 - 1) <= timeline.ancestor_lsn {
if key.is_inherited_key() && Lsn(cont_lsn.0 - 1) <= timeline.ancestor_lsn {
trace!(
"going into ancestor {}, cont_lsn is {}",
timeline.ancestor_lsn,
@@ -3880,22 +3883,25 @@ impl Timeline {
return Err(FlushLayerError::Cancelled);
}
// FIXME(auxfilesv2): support multiple metadata key partitions might need initdb support as well?
// This code path will not be hit during regression tests. After #7099 we have a single partition
// with two key ranges. If someone wants to fix initdb optimization in the future, this might need
// to be fixed.
// For metadata, always create delta layers.
let delta_layer = if !metadata_partition.parts.is_empty() {
assert_eq!(
metadata_partition.parts.len(),
1,
"currently sparse keyspace should only contain a single aux file keyspace"
"currently sparse keyspace should only contain a single metadata keyspace"
);
let metadata_keyspace = &metadata_partition.parts[0];
assert_eq!(
metadata_keyspace.0.ranges.len(),
1,
"aux file keyspace should be a single range"
);
self.create_delta_layer(
&frozen_layer,
Some(metadata_keyspace.0.ranges[0].clone()),
Some(
metadata_keyspace.0.ranges.first().unwrap().start
..metadata_keyspace.0.ranges.last().unwrap().end,
),
ctx,
)
.await
@@ -4262,7 +4268,7 @@ impl Timeline {
// Unfortunately we cannot do this for the main fork, or for
// any metadata keys, keys, as that would lead to actual data
// loss.
if is_rel_fsm_block_key(img_key) || is_rel_vm_block_key(img_key) {
if img_key.is_rel_fsm_block_key() || img_key.is_rel_vm_block_key() {
warn!("could not reconstruct FSM or VM key {img_key}, filling with zeros: {err:?}");
ZERO_PAGE.clone()
} else {
@@ -4312,6 +4318,7 @@ impl Timeline {
ctx: &RequestContext,
img_range: Range<Key>,
mode: ImageLayerCreationMode,
start: Key,
) -> Result<ImageLayerCreationOutcome, CreateImageLayersError> {
assert!(!matches!(mode, ImageLayerCreationMode::Initial));
@@ -4320,39 +4327,43 @@ impl Timeline {
let data = self
.get_vectored_impl(partition.clone(), lsn, &mut reconstruct_state, ctx)
.await?;
let (data, total_kb_retrieved, total_key_retrieved) = {
let (data, total_kb_retrieved, total_keys_retrieved) = {
let mut new_data = BTreeMap::new();
let mut total_kb_retrieved = 0;
let mut total_key_retrieved = 0;
let mut total_keys_retrieved = 0;
for (k, v) in data {
let v = v.map_err(CreateImageLayersError::PageReconstructError)?;
total_kb_retrieved += KEY_SIZE + v.len();
total_key_retrieved += 1;
total_keys_retrieved += 1;
new_data.insert(k, v);
}
(new_data, total_kb_retrieved / 1024, total_key_retrieved)
(new_data, total_kb_retrieved / 1024, total_keys_retrieved)
};
let delta_file_accessed = reconstruct_state.get_delta_layers_visited();
let delta_files_accessed = reconstruct_state.get_delta_layers_visited();
let trigger_generation = delta_file_accessed as usize >= MAX_AUX_FILE_V2_DELTAS;
let trigger_generation = delta_files_accessed as usize >= MAX_AUX_FILE_V2_DELTAS;
debug!(
"generate image layers for metadata keys: trigger_generation={trigger_generation}, \
delta_file_accessed={delta_file_accessed}, total_kb_retrieved={total_kb_retrieved}, \
total_key_retrieved={total_key_retrieved}"
trigger_generation,
delta_files_accessed,
total_kb_retrieved,
total_keys_retrieved,
"generate metadata images"
);
if !trigger_generation && mode == ImageLayerCreationMode::Try {
return Ok(ImageLayerCreationOutcome {
image: None,
next_start_key: img_range.end,
});
}
let has_keys = !data.is_empty();
let mut wrote_any_image = false;
for (k, v) in data {
// Even if the value is empty (deleted), we do not delete it for now until we can ensure vectored get
// considers this situation properly.
// if v.is_empty() {
// continue;
// }
if v.is_empty() {
// the key has been deleted, it does not need an image
// in metadata keyspace, an empty image == tombstone
continue;
}
wrote_any_image = true;
// No need to handle sharding b/c metadata keys are always on the 0-th shard.
@@ -4360,16 +4371,26 @@ impl Timeline {
// on the normal data path either.
image_layer_writer.put_image(k, v, ctx).await?;
}
Ok(ImageLayerCreationOutcome {
image: if has_keys {
let image_layer = image_layer_writer.finish(self, ctx).await?;
Some(image_layer)
} else {
tracing::debug!("no data in range {}-{}", img_range.start, img_range.end);
None
},
next_start_key: img_range.end,
})
if wrote_any_image {
// Normal path: we have written some data into the new image layer for this
// partition, so flush it to disk.
let image_layer = image_layer_writer.finish(self, ctx).await?;
Ok(ImageLayerCreationOutcome {
image: Some(image_layer),
next_start_key: img_range.end,
})
} else {
// Special case: the image layer may be empty if this is a sharded tenant and the
// partition does not cover any keys owned by this shard. In this case, to ensure
// we don't leave gaps between image layers, leave `start` where it is, so that the next
// layer we write will cover the key range that we just scanned.
tracing::debug!("no data in range {}-{}", img_range.start, img_range.end);
Ok(ImageLayerCreationOutcome {
image: None,
next_start_key: start,
})
}
}
#[tracing::instrument(skip_all, fields(%lsn, %mode))]
@@ -4425,6 +4446,12 @@ impl Timeline {
if mode == ImageLayerCreationMode::Initial {
return Err(CreateImageLayersError::Other(anyhow::anyhow!("no image layer should be created for metadata keys when flushing frozen layers")));
}
if mode == ImageLayerCreationMode::Try && !check_for_image_layers {
// Skip compaction if there are not enough updates. Metadata compaction will do a scan and
// might mess up with evictions.
start = img_range.end;
continue;
}
} else if let ImageLayerCreationMode::Try = mode {
// check_for_image_layers = false -> skip
// check_for_image_layers = true -> check time_for_new_image_layer -> skip/generate
@@ -4479,6 +4506,7 @@ impl Timeline {
ctx,
img_range,
mode,
start,
)
.await?;
start = next_start_key;
@@ -5448,11 +5476,12 @@ impl Timeline {
let min_key = *deltas.first().map(|(k, _, _)| k).unwrap();
let max_key = deltas.last().map(|(k, _, _)| k).unwrap().next();
let min_lsn = *deltas.iter().map(|(_, lsn, _)| lsn).min().unwrap();
let max_lsn = Lsn(deltas.iter().map(|(_, lsn, _)| lsn).max().unwrap().0 + 1);
let max_lsn = *deltas.iter().map(|(_, lsn, _)| lsn).max().unwrap();
assert!(
max_lsn <= last_record_lsn,
"advance last record lsn before inserting a layer, max_lsn={max_lsn}, last_record_lsn={last_record_lsn}"
);
let end_lsn = Lsn(max_lsn.0 + 1);
if let Some(check_start_lsn) = check_start_lsn {
assert!(min_lsn >= check_start_lsn);
}
@@ -5461,7 +5490,7 @@ impl Timeline {
self.timeline_id,
self.tenant_shard_id,
min_key,
min_lsn..max_lsn,
min_lsn..end_lsn,
ctx,
)
.await?;
@@ -5477,6 +5506,36 @@ impl Timeline {
Ok(())
}
/// Return all keys at the LSN in the image layers
#[cfg(test)]
pub(crate) async fn inspect_image_layers(
self: &Arc<Timeline>,
lsn: Lsn,
ctx: &RequestContext,
) -> anyhow::Result<Vec<(Key, Bytes)>> {
let mut all_data = Vec::new();
let guard = self.layers.read().await;
for layer in guard.layer_map().iter_historic_layers() {
if !layer.is_delta() && layer.image_layer_lsn() == lsn {
let layer = guard.get_from_desc(&layer);
let mut reconstruct_data = ValuesReconstructState::default();
layer
.get_values_reconstruct_data(
KeySpace::single(Key::MIN..Key::MAX),
lsn..Lsn(lsn.0 + 1),
&mut reconstruct_data,
ctx,
)
.await?;
for (k, v) in reconstruct_data.keys {
all_data.push((k, v?.img.unwrap().1));
}
}
}
all_data.sort();
Ok(all_data)
}
}
type TraversalPathItem = (ValueReconstructResult, Lsn, TraversalId);
@@ -5619,7 +5678,7 @@ impl<'a> TimelineWriter<'a> {
self.tl.flush_frozen_layers();
let current_size = self.write_guard.as_ref().unwrap().current_size;
if current_size > self.get_checkpoint_distance() {
if current_size >= self.get_checkpoint_distance() * 2 {
warn!("Flushed oversized open layer with size {}", current_size)
}

View File

@@ -133,8 +133,7 @@ impl Timeline {
},
&image_ctx,
)
.await
.map_err(anyhow::Error::from)?;
.await?;
self.upload_new_image_layers(image_layers)?;
partitioning.parts.len()
@@ -422,48 +421,6 @@ impl Timeline {
return Ok(CompactLevel0Phase1Result::default());
}
// This failpoint is used together with `test_duplicate_layers` integration test.
// It returns the compaction result exactly the same layers as input to compaction.
// We want to ensure that this will not cause any problem when updating the layer map
// after the compaction is finished.
//
// Currently, there are two rare edge cases that will cause duplicated layers being
// inserted.
// 1. The compaction job is inturrupted / did not finish successfully. Assume we have file 1, 2, 3, 4, which
// is compacted to 5, but the page server is shut down, next time we start page server we will get a layer
// map containing 1, 2, 3, 4, and 5, whereas 5 has the same content as 4. If we trigger L0 compation at this
// point again, it is likely that we will get a file 6 which has the same content and the key range as 5,
// and this causes an overwrite. This is acceptable because the content is the same, and we should do a
// layer replace instead of the normal remove / upload process.
// 2. The input workload pattern creates exactly n files that are sorted, non-overlapping and is of target file
// size length. Compaction will likely create the same set of n files afterwards.
//
// This failpoint is a superset of both of the cases.
if cfg!(feature = "testing") {
let active = (|| {
::fail::fail_point!("compact-level0-phase1-return-same", |_| true);
false
})();
if active {
let mut new_layers = Vec::with_capacity(level0_deltas.len());
for delta in &level0_deltas {
// we are just faking these layers as being produced again for this failpoint
new_layers.push(
delta
.download_and_keep_resident()
.await
.context("download layer for failpoint")?,
);
}
tracing::info!("compact-level0-phase1-return-same"); // so that we can check if we hit the failpoint
return Ok(CompactLevel0Phase1Result {
new_layers,
deltas_to_compact: level0_deltas,
});
}
}
// Gather the files to compact in this iteration.
//
// Start with the oldest Level 0 delta file, and collect any other

View File

@@ -3,12 +3,10 @@ use super::storage_layer::ResidentLayer;
use crate::tenant::metadata::TimelineMetadata;
use crate::tenant::remote_timeline_client::index::IndexPart;
use crate::tenant::remote_timeline_client::index::LayerFileMetadata;
use crate::tenant::remote_timeline_client::index::Lineage;
use std::collections::{HashMap, VecDeque};
use std::fmt::Debug;
use chrono::NaiveDateTime;
use pageserver_api::models::AuxFilePolicy;
use std::sync::Arc;
use tracing::info;
use utils::lsn::AtomicLsn;
@@ -45,34 +43,25 @@ pub(crate) struct UploadQueueInitialized {
/// Counter to assign task IDs
pub(crate) task_counter: u64,
/// All layer files stored in the remote storage, taking into account all
/// in-progress and queued operations
pub(crate) latest_files: HashMap<LayerName, LayerFileMetadata>,
/// The next uploaded index_part.json; assumed to be dirty.
///
/// Should not be read, directly except for layer file updates. Instead you should add a
/// projected field.
pub(crate) dirty: IndexPart,
/// The latest remote persisted IndexPart.
///
/// Each completed metadata upload will update this. The second item is the task_id which last
/// updated the value, used to ensure we never store an older value over a newer one.
pub(crate) clean: (IndexPart, Option<u64>),
/// How many file uploads or deletions been scheduled, since the
/// last (scheduling of) metadata index upload?
pub(crate) latest_files_changes_since_metadata_upload_scheduled: u64,
/// Metadata stored in the remote storage, taking into account all
/// in-progress and queued operations.
/// DANGER: do not return to outside world, e.g., safekeepers.
pub(crate) latest_metadata: TimelineMetadata,
/// Part of the flattened "next" `index_part.json`.
pub(crate) latest_lineage: Lineage,
/// The last aux file policy used on this timeline.
pub(crate) last_aux_file_policy: Option<AuxFilePolicy>,
/// `disk_consistent_lsn` from the last metadata file that was successfully
/// uploaded. `Lsn(0)` if nothing was uploaded yet.
/// Unlike `latest_files` or `latest_metadata`, this value is never ahead.
/// Safekeeper can rely on it to make decisions for WAL storage.
///
/// visible_remote_consistent_lsn is only updated after our generation has been validated with
/// The Lsn is only updated after our generation has been validated with
/// the control plane (unlesss a timeline's generation is None, in which case
/// we skip validation)
pub(crate) projected_remote_consistent_lsn: Option<Lsn>,
pub(crate) visible_remote_consistent_lsn: Arc<AtomicLsn>,
// Breakdown of different kinds of tasks currently in-progress
@@ -118,7 +107,8 @@ impl UploadQueueInitialized {
}
pub(super) fn get_last_remote_consistent_lsn_projected(&self) -> Option<Lsn> {
self.projected_remote_consistent_lsn
let lsn = self.clean.0.metadata.disk_consistent_lsn();
self.clean.1.map(|_| lsn)
}
}
@@ -174,13 +164,12 @@ impl UploadQueue {
info!("initializing upload queue for empty remote");
let index_part = IndexPart::empty(metadata.clone());
let state = UploadQueueInitialized {
// As described in the doc comment, it's ok for `latest_files` and `latest_metadata` to be ahead.
latest_files: HashMap::new(),
dirty: index_part.clone(),
clean: (index_part, None),
latest_files_changes_since_metadata_upload_scheduled: 0,
latest_metadata: metadata.clone(),
latest_lineage: Lineage::default(),
projected_remote_consistent_lsn: None,
visible_remote_consistent_lsn: Arc::new(AtomicLsn::new(0)),
// what follows are boring default initializations
task_counter: 0,
@@ -193,7 +182,6 @@ impl UploadQueue {
dangling_files: HashMap::new(),
shutting_down: false,
shutdown_ready: Arc::new(tokio::sync::Semaphore::new(0)),
last_aux_file_policy: Default::default(),
};
*self = UploadQueue::Initialized(state);
@@ -211,22 +199,15 @@ impl UploadQueue {
}
}
let mut files = HashMap::with_capacity(index_part.layer_metadata.len());
for (layer_name, layer_metadata) in &index_part.layer_metadata {
files.insert(layer_name.to_owned(), layer_metadata.clone());
}
info!(
"initializing upload queue with remote index_part.disk_consistent_lsn: {}",
index_part.metadata.disk_consistent_lsn()
);
let state = UploadQueueInitialized {
latest_files: files,
dirty: index_part.clone(),
clean: (index_part.clone(), None),
latest_files_changes_since_metadata_upload_scheduled: 0,
latest_metadata: index_part.metadata.clone(),
latest_lineage: index_part.lineage.clone(),
projected_remote_consistent_lsn: Some(index_part.metadata.disk_consistent_lsn()),
visible_remote_consistent_lsn: Arc::new(
index_part.metadata.disk_consistent_lsn().into(),
),
@@ -241,7 +222,6 @@ impl UploadQueue {
dangling_files: HashMap::new(),
shutting_down: false,
shutdown_ready: Arc::new(tokio::sync::Semaphore::new(0)),
last_aux_file_policy: index_part.last_aux_file_policy(),
};
*self = UploadQueue::Initialized(state);
@@ -298,13 +278,16 @@ pub(crate) enum UploadOp {
/// Upload a layer file
UploadLayer(ResidentLayer, LayerFileMetadata),
/// Upload the metadata file
UploadMetadata(Box<IndexPart>, Lsn),
/// Upload a index_part.json file
UploadMetadata {
/// The next [`UploadQueueInitialized::clean`] after this upload succeeds.
uploaded: Box<IndexPart>,
},
/// Delete layer files
Delete(Delete),
/// Barrier. When the barrier operation is reached,
/// Barrier. When the barrier operation is reached, the channel is closed.
Barrier(tokio::sync::watch::Sender<()>),
/// Shutdown; upon encountering this operation no new operations will be spawned, otherwise
@@ -322,8 +305,12 @@ impl std::fmt::Display for UploadOp {
layer, metadata.file_size, metadata.generation
)
}
UploadOp::UploadMetadata(_, lsn) => {
write!(f, "UploadMetadata(lsn: {})", lsn)
UploadOp::UploadMetadata { uploaded, .. } => {
write!(
f,
"UploadMetadata(lsn: {})",
uploaded.metadata.disk_consistent_lsn()
)
}
UploadOp::Delete(delete) => {
write!(f, "Delete({} layers)", delete.layers.len())

View File

@@ -234,6 +234,7 @@ impl WalIngest {
modification,
&parsed_xact,
info == pg_constants::XLOG_XACT_COMMIT,
decoded.origin_id,
ctx,
)
.await?;
@@ -246,6 +247,7 @@ impl WalIngest {
modification,
&parsed_xact,
info == pg_constants::XLOG_XACT_COMMIT_PREPARED,
decoded.origin_id,
ctx,
)
.await?;
@@ -375,6 +377,18 @@ impl WalIngest {
self.checkpoint.oldestActiveXid = xlrec.oldest_running_xid;
}
}
pg_constants::RM_REPLORIGIN_ID => {
let info = decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK;
if info == pg_constants::XLOG_REPLORIGIN_SET {
let xlrec = crate::walrecord::XlReploriginSet::decode(&mut buf);
modification
.set_replorigin(xlrec.node_id, xlrec.remote_lsn)
.await?
} else if info == pg_constants::XLOG_REPLORIGIN_DROP {
let xlrec = crate::walrecord::XlReploriginDrop::decode(&mut buf);
modification.drop_replorigin(xlrec.node_id).await?
}
}
_x => {
// TODO: should probably log & fail here instead of blindly
// doing something without understanding the protocol
@@ -1178,6 +1192,7 @@ impl WalIngest {
modification: &mut DatadirModification<'_>,
parsed: &XlXactParsedRecord,
is_commit: bool,
origin_id: u16,
ctx: &RequestContext,
) -> anyhow::Result<()> {
// Record update of CLOG pages
@@ -1243,6 +1258,11 @@ impl WalIngest {
}
}
}
if origin_id != 0 {
modification
.set_replorigin(origin_id, parsed.origin_lsn)
.await?;
}
Ok(())
}

View File

@@ -9,10 +9,10 @@ use postgres_ffi::pg_constants;
use postgres_ffi::BLCKSZ;
use postgres_ffi::{BlockNumber, TimestampTz};
use postgres_ffi::{MultiXactId, MultiXactOffset, MultiXactStatus, Oid, TransactionId};
use postgres_ffi::{XLogRecord, XLOG_SIZE_OF_XLOG_RECORD};
use postgres_ffi::{RepOriginId, XLogRecord, XLOG_SIZE_OF_XLOG_RECORD};
use serde::{Deserialize, Serialize};
use tracing::*;
use utils::bin_ser::DeserializeError;
use utils::{bin_ser::DeserializeError, lsn::Lsn};
/// Each update to a page is represented by a NeonWalRecord. It can be a wrapper
/// around a PostgreSQL WAL record, or a custom neon-specific "record".
@@ -116,6 +116,7 @@ pub struct DecodedWALRecord {
pub blocks: Vec<DecodedBkpBlock>,
pub main_data_offset: usize,
pub origin_id: u16,
}
#[repr(C)]
@@ -573,6 +574,7 @@ pub struct XlXactParsedRecord {
pub subxacts: Vec<TransactionId>,
pub xnodes: Vec<RelFileNode>,
pub origin_lsn: Lsn,
}
impl XlXactParsedRecord {
@@ -651,6 +653,11 @@ impl XlXactParsedRecord {
debug!("XLOG_XACT_COMMIT-XACT_XINFO_HAS_TWOPHASE xid {}", xid);
}
let origin_lsn = if xinfo & pg_constants::XACT_XINFO_HAS_ORIGIN != 0 {
Lsn(buf.get_u64_le())
} else {
Lsn::INVALID
};
XlXactParsedRecord {
xid,
info,
@@ -660,6 +667,7 @@ impl XlXactParsedRecord {
ts_id,
subxacts,
xnodes,
origin_lsn,
}
}
}
@@ -810,6 +818,36 @@ impl XlRunningXacts {
}
}
#[repr(C)]
#[derive(Debug)]
pub struct XlReploriginDrop {
pub node_id: RepOriginId,
}
impl XlReploriginDrop {
pub fn decode(buf: &mut Bytes) -> XlReploriginDrop {
XlReploriginDrop {
node_id: buf.get_u16_le(),
}
}
}
#[repr(C)]
#[derive(Debug)]
pub struct XlReploriginSet {
pub remote_lsn: Lsn,
pub node_id: RepOriginId,
}
impl XlReploriginSet {
pub fn decode(buf: &mut Bytes) -> XlReploriginSet {
XlReploriginSet {
remote_lsn: Lsn(buf.get_u64_le()),
node_id: buf.get_u16_le(),
}
}
}
/// Main routine to decode a WAL record and figure out which blocks are modified
//
// See xlogrecord.h for details
@@ -844,6 +882,7 @@ pub fn decode_wal_record(
let mut rnode_dbnode: u32 = 0;
let mut rnode_relnode: u32 = 0;
let mut got_rnode = false;
let mut origin_id: u16 = 0;
let mut buf = record.clone();
@@ -891,7 +930,7 @@ pub fn decode_wal_record(
pg_constants::XLR_BLOCK_ID_ORIGIN => {
// RepOriginId is uint16
buf.advance(2);
origin_id = buf.get_u16_le();
}
pg_constants::XLR_BLOCK_ID_TOPLEVEL_XID => {
@@ -1088,6 +1127,7 @@ pub fn decode_wal_record(
decoded.xl_info = xlogrec.xl_info;
decoded.xl_rmid = xlogrec.xl_rmid;
decoded.record = record;
decoded.origin_id = origin_id;
decoded.main_data_offset = main_data_offset;
Ok(())

View File

@@ -20,7 +20,6 @@
/// Process lifecycle and abstracction for the IPC protocol.
mod process;
pub use process::Kind as ProcessKind;
/// Code to apply [`NeonWalRecord`]s.
pub(crate) mod apply_neon;
@@ -34,7 +33,6 @@ use crate::repository::Key;
use crate::walrecord::NeonWalRecord;
use anyhow::Context;
use bytes::{Bytes, BytesMut};
use pageserver_api::key::key_to_rel_block;
use pageserver_api::models::{WalRedoManagerProcessStatus, WalRedoManagerStatus};
use pageserver_api::shard::TenantShardId;
use std::sync::Arc;
@@ -55,7 +53,7 @@ pub struct PostgresRedoManager {
tenant_shard_id: TenantShardId,
conf: &'static PageServerConf,
last_redo_at: std::sync::Mutex<Option<Instant>>,
/// The current [`process::Process`] that is used by new redo requests.
/// The current [`process::WalRedoProcess`] that is used by new redo requests.
/// We use [`heavier_once_cell`] for coalescing the spawning, but the redo
/// requests don't use the [`heavier_once_cell::Guard`] to keep ahold of the
/// their process object; we use [`Arc::clone`] for that.
@@ -67,7 +65,7 @@ pub struct PostgresRedoManager {
/// still be using the old redo process. But, those other tasks will most likely
/// encounter an error as well, and errors are an unexpected condition anyway.
/// So, probably we could get rid of the `Arc` in the future.
redo_process: heavier_once_cell::OnceCell<Arc<process::Process>>,
redo_process: heavier_once_cell::OnceCell<Arc<process::WalRedoProcess>>,
}
///
@@ -208,30 +206,35 @@ impl PostgresRedoManager {
) -> anyhow::Result<Bytes> {
*(self.last_redo_at.lock().unwrap()) = Some(Instant::now());
let (rel, blknum) = key_to_rel_block(key).context("invalid record")?;
let (rel, blknum) = key.to_rel_block().context("invalid record")?;
const MAX_RETRY_ATTEMPTS: u32 = 1;
let mut n_attempts = 0u32;
loop {
let proc: Arc<process::Process> = match self.redo_process.get_or_init_detached().await {
Ok(guard) => Arc::clone(&guard),
Err(permit) => {
// don't hold poison_guard, the launch code can bail
let start = Instant::now();
let proc = Arc::new(
process::Process::launch(self.conf, self.tenant_shard_id, pg_version)
let proc: Arc<process::WalRedoProcess> =
match self.redo_process.get_or_init_detached().await {
Ok(guard) => Arc::clone(&guard),
Err(permit) => {
// don't hold poison_guard, the launch code can bail
let start = Instant::now();
let proc = Arc::new(
process::WalRedoProcess::launch(
self.conf,
self.tenant_shard_id,
pg_version,
)
.context("launch walredo process")?,
);
let duration = start.elapsed();
WAL_REDO_PROCESS_LAUNCH_DURATION_HISTOGRAM.observe(duration.as_secs_f64());
info!(
duration_ms = duration.as_millis(),
pid = proc.id(),
"launched walredo process"
);
self.redo_process.set(Arc::clone(&proc), permit);
proc
}
};
);
let duration = start.elapsed();
WAL_REDO_PROCESS_LAUNCH_DURATION_HISTOGRAM.observe(duration.as_secs_f64());
info!(
duration_ms = duration.as_millis(),
pid = proc.id(),
"launched walredo process"
);
self.redo_process.set(Arc::clone(&proc), permit);
proc
}
};
let started_at = std::time::Instant::now();
@@ -362,10 +365,10 @@ impl PostgresRedoManager {
&self,
key: Key,
page: &mut BytesMut,
_record_lsn: Lsn,
record_lsn: Lsn,
record: &NeonWalRecord,
) -> anyhow::Result<()> {
apply_neon::apply_in_neon(record, key, page)?;
apply_neon::apply_in_neon(record, record_lsn, key, page)?;
Ok(())
}

View File

@@ -3,7 +3,7 @@ use crate::walrecord::NeonWalRecord;
use anyhow::Context;
use byteorder::{ByteOrder, LittleEndian};
use bytes::{BufMut, BytesMut};
use pageserver_api::key::{key_to_rel_block, key_to_slru_block, Key};
use pageserver_api::key::Key;
use pageserver_api::reltag::SlruKind;
use postgres_ffi::pg_constants;
use postgres_ffi::relfile_utils::VISIBILITYMAP_FORKNUM;
@@ -14,6 +14,7 @@ use postgres_ffi::v14::nonrelfile_utils::{
use postgres_ffi::BLCKSZ;
use tracing::*;
use utils::bin_ser::BeSer;
use utils::lsn::Lsn;
/// Can this request be served by neon redo functions
/// or we need to pass it to wal-redo postgres process?
@@ -32,6 +33,7 @@ pub(crate) fn can_apply_in_neon(rec: &NeonWalRecord) -> bool {
pub(crate) fn apply_in_neon(
record: &NeonWalRecord,
lsn: Lsn,
key: Key,
page: &mut BytesMut,
) -> Result<(), anyhow::Error> {
@@ -48,7 +50,7 @@ pub(crate) fn apply_in_neon(
flags,
} => {
// sanity check that this is modifying the correct relation
let (rel, blknum) = key_to_rel_block(key).context("invalid record")?;
let (rel, blknum) = key.to_rel_block().context("invalid record")?;
assert!(
rel.forknum == VISIBILITYMAP_FORKNUM,
"ClearVisibilityMapFlags record on unexpected rel {}",
@@ -67,6 +69,7 @@ pub(crate) fn apply_in_neon(
let map = &mut page[pg_constants::MAXALIGN_SIZE_OF_PAGE_HEADER_DATA..];
map[map_byte as usize] &= !(flags << map_offset);
postgres_ffi::page_set_lsn(page, lsn);
}
// Repeat for 'old_heap_blkno', if any
@@ -80,12 +83,13 @@ pub(crate) fn apply_in_neon(
let map = &mut page[pg_constants::MAXALIGN_SIZE_OF_PAGE_HEADER_DATA..];
map[map_byte as usize] &= !(flags << map_offset);
postgres_ffi::page_set_lsn(page, lsn);
}
}
// Non-relational WAL records are handled here, with custom code that has the
// same effects as the corresponding Postgres WAL redo function.
NeonWalRecord::ClogSetCommitted { xids, timestamp } => {
let (slru_kind, segno, blknum) = key_to_slru_block(key).context("invalid record")?;
let (slru_kind, segno, blknum) = key.to_slru_block().context("invalid record")?;
assert_eq!(
slru_kind,
SlruKind::Clog,
@@ -130,7 +134,7 @@ pub(crate) fn apply_in_neon(
}
}
NeonWalRecord::ClogSetAborted { xids } => {
let (slru_kind, segno, blknum) = key_to_slru_block(key).context("invalid record")?;
let (slru_kind, segno, blknum) = key.to_slru_block().context("invalid record")?;
assert_eq!(
slru_kind,
SlruKind::Clog,
@@ -160,7 +164,7 @@ pub(crate) fn apply_in_neon(
}
}
NeonWalRecord::MultixactOffsetCreate { mid, moff } => {
let (slru_kind, segno, blknum) = key_to_slru_block(key).context("invalid record")?;
let (slru_kind, segno, blknum) = key.to_slru_block().context("invalid record")?;
assert_eq!(
slru_kind,
SlruKind::MultiXactOffsets,
@@ -192,7 +196,7 @@ pub(crate) fn apply_in_neon(
LittleEndian::write_u32(&mut page[offset..offset + 4], *moff);
}
NeonWalRecord::MultixactMembersCreate { moff, members } => {
let (slru_kind, segno, blknum) = key_to_slru_block(key).context("invalid record")?;
let (slru_kind, segno, blknum) = key.to_slru_block().context("invalid record")?;
assert_eq!(
slru_kind,
SlruKind::MultiXactMembers,
@@ -285,7 +289,7 @@ mod test {
let mut page = BytesMut::from_iter(base_image);
for record in deltas {
apply_in_neon(&record, file_path, &mut page)?;
apply_in_neon(&record, Lsn(8), file_path, &mut page)?;
}
let reconstructed = AuxFilesDirectory::des(&page)?;

View File

@@ -1,64 +1,187 @@
/// Layer of indirection previously used to support multiple implementations.
/// Subject to removal: <https://github.com/neondatabase/neon/issues/7753>
use std::time::Duration;
use bytes::Bytes;
use pageserver_api::{reltag::RelTag, shard::TenantShardId};
use tracing::warn;
use utils::lsn::Lsn;
use crate::{config::PageServerConf, walrecord::NeonWalRecord};
mod no_leak_child;
/// The IPC protocol that pageserver and walredo process speak over their shared pipe.
mod protocol;
mod process_impl {
pub(super) mod process_async;
use self::no_leak_child::NoLeakChild;
use crate::{
config::PageServerConf,
metrics::{WalRedoKillCause, WAL_REDO_PROCESS_COUNTERS, WAL_REDO_RECORD_COUNTER},
span::debug_assert_current_span_has_tenant_id,
walrecord::NeonWalRecord,
};
use anyhow::Context;
use bytes::Bytes;
use pageserver_api::{reltag::RelTag, shard::TenantShardId};
use postgres_ffi::BLCKSZ;
#[cfg(feature = "testing")]
use std::sync::atomic::AtomicUsize;
use std::{
collections::VecDeque,
process::{Command, Stdio},
time::Duration,
};
use tokio::io::{AsyncReadExt, AsyncWriteExt};
use tracing::{debug, error, instrument, Instrument};
use utils::{lsn::Lsn, poison::Poison};
pub struct WalRedoProcess {
#[allow(dead_code)]
conf: &'static PageServerConf,
#[cfg(feature = "testing")]
tenant_shard_id: TenantShardId,
// Some() on construction, only becomes None on Drop.
child: Option<NoLeakChild>,
stdout: tokio::sync::Mutex<Poison<ProcessOutput>>,
stdin: tokio::sync::Mutex<Poison<ProcessInput>>,
/// Counter to separate same sized walredo inputs failing at the same millisecond.
#[cfg(feature = "testing")]
dump_sequence: AtomicUsize,
}
#[derive(
Clone,
Copy,
Debug,
PartialEq,
Eq,
strum_macros::EnumString,
strum_macros::Display,
strum_macros::IntoStaticStr,
serde_with::DeserializeFromStr,
serde_with::SerializeDisplay,
)]
#[strum(serialize_all = "kebab-case")]
#[repr(u8)]
pub enum Kind {
Sync,
Async,
struct ProcessInput {
stdin: tokio::process::ChildStdin,
n_requests: usize,
}
pub(crate) struct Process(process_impl::process_async::WalRedoProcess);
struct ProcessOutput {
stdout: tokio::process::ChildStdout,
pending_responses: VecDeque<Option<Bytes>>,
n_processed_responses: usize,
}
impl Process {
#[inline(always)]
pub fn launch(
impl WalRedoProcess {
//
// Start postgres binary in special WAL redo mode.
//
#[instrument(skip_all,fields(pg_version=pg_version))]
pub(crate) fn launch(
conf: &'static PageServerConf,
tenant_shard_id: TenantShardId,
pg_version: u32,
) -> anyhow::Result<Self> {
if conf.walredo_process_kind != Kind::Async {
warn!(
configured = %conf.walredo_process_kind,
"the walredo_process_kind setting has been turned into a no-op, using async implementation"
);
}
Ok(Self(process_impl::process_async::WalRedoProcess::launch(
crate::span::debug_assert_current_span_has_tenant_id();
let pg_bin_dir_path = conf.pg_bin_dir(pg_version).context("pg_bin_dir")?; // TODO these should be infallible.
let pg_lib_dir_path = conf.pg_lib_dir(pg_version).context("pg_lib_dir")?;
use no_leak_child::NoLeakChildCommandExt;
// Start postgres itself
let child = Command::new(pg_bin_dir_path.join("postgres"))
// the first arg must be --wal-redo so the child process enters into walredo mode
.arg("--wal-redo")
// the child doesn't process this arg, but, having it in the argv helps indentify the
// walredo process for a particular tenant when debugging a pagserver
.args(["--tenant-shard-id", &format!("{tenant_shard_id}")])
.stdin(Stdio::piped())
.stderr(Stdio::piped())
.stdout(Stdio::piped())
.env_clear()
.env("LD_LIBRARY_PATH", &pg_lib_dir_path)
.env("DYLD_LIBRARY_PATH", &pg_lib_dir_path)
// NB: The redo process is not trusted after we sent it the first
// walredo work. Before that, it is trusted. Specifically, we trust
// it to
// 1. close all file descriptors except stdin, stdout, stderr because
// pageserver might not be 100% diligent in setting FD_CLOEXEC on all
// the files it opens, and
// 2. to use seccomp to sandbox itself before processing the first
// walredo request.
.spawn_no_leak_child(tenant_shard_id)
.context("spawn process")?;
WAL_REDO_PROCESS_COUNTERS.started.inc();
let mut child = scopeguard::guard(child, |child| {
error!("killing wal-redo-postgres process due to a problem during launch");
child.kill_and_wait(WalRedoKillCause::Startup);
});
let stdin = child.stdin.take().unwrap();
let stdout = child.stdout.take().unwrap();
let stderr = child.stderr.take().unwrap();
let stderr = tokio::process::ChildStderr::from_std(stderr)
.context("convert to tokio::ChildStderr")?;
let stdin =
tokio::process::ChildStdin::from_std(stdin).context("convert to tokio::ChildStdin")?;
let stdout = tokio::process::ChildStdout::from_std(stdout)
.context("convert to tokio::ChildStdout")?;
// all fallible operations post-spawn are complete, so get rid of the guard
let child = scopeguard::ScopeGuard::into_inner(child);
tokio::spawn(
async move {
scopeguard::defer! {
debug!("wal-redo-postgres stderr_logger_task finished");
crate::metrics::WAL_REDO_PROCESS_COUNTERS.active_stderr_logger_tasks_finished.inc();
}
debug!("wal-redo-postgres stderr_logger_task started");
crate::metrics::WAL_REDO_PROCESS_COUNTERS.active_stderr_logger_tasks_started.inc();
use tokio::io::AsyncBufReadExt;
let mut stderr_lines = tokio::io::BufReader::new(stderr);
let mut buf = Vec::new();
let res = loop {
buf.clear();
// TODO we don't trust the process to cap its stderr length.
// Currently it can do unbounded Vec allocation.
match stderr_lines.read_until(b'\n', &mut buf).await {
Ok(0) => break Ok(()), // eof
Ok(num_bytes) => {
let output = String::from_utf8_lossy(&buf[..num_bytes]);
error!(%output, "received output");
}
Err(e) => {
break Err(e);
}
}
};
match res {
Ok(()) => (),
Err(e) => {
error!(error=?e, "failed to read from walredo stderr");
}
}
}.instrument(tracing::info_span!(parent: None, "wal-redo-postgres-stderr", pid = child.id(), tenant_id = %tenant_shard_id.tenant_id, shard_id = %tenant_shard_id.shard_slug(), %pg_version))
);
Ok(Self {
conf,
#[cfg(feature = "testing")]
tenant_shard_id,
pg_version,
)?))
child: Some(child),
stdin: tokio::sync::Mutex::new(Poison::new(
"stdin",
ProcessInput {
stdin,
n_requests: 0,
},
)),
stdout: tokio::sync::Mutex::new(Poison::new(
"stdout",
ProcessOutput {
stdout,
pending_responses: VecDeque::new(),
n_processed_responses: 0,
},
)),
#[cfg(feature = "testing")]
dump_sequence: AtomicUsize::default(),
})
}
#[inline(always)]
pub(crate) fn id(&self) -> u32 {
self.child
.as_ref()
.expect("must not call this during Drop")
.id()
}
/// Apply given WAL records ('records') over an old page image. Returns
/// new page image.
///
/// # Cancel-Safety
///
/// Cancellation safe.
#[instrument(skip_all, fields(pid=%self.id()))]
pub(crate) async fn apply_wal_records(
&self,
rel: RelTag,
@@ -67,12 +190,193 @@ impl Process {
records: &[(Lsn, NeonWalRecord)],
wal_redo_timeout: Duration,
) -> anyhow::Result<Bytes> {
self.0
.apply_wal_records(rel, blknum, base_img, records, wal_redo_timeout)
.await
debug_assert_current_span_has_tenant_id();
let tag = protocol::BufferTag { rel, blknum };
// Serialize all the messages to send the WAL redo process first.
//
// This could be problematic if there are millions of records to replay,
// but in practice the number of records is usually so small that it doesn't
// matter, and it's better to keep this code simple.
//
// Most requests start with a before-image with BLCKSZ bytes, followed by
// by some other WAL records. Start with a buffer that can hold that
// comfortably.
let mut writebuf: Vec<u8> = Vec::with_capacity((BLCKSZ as usize) * 3);
protocol::build_begin_redo_for_block_msg(tag, &mut writebuf);
if let Some(img) = base_img {
protocol::build_push_page_msg(tag, img, &mut writebuf);
}
for (lsn, rec) in records.iter() {
if let NeonWalRecord::Postgres {
will_init: _,
rec: postgres_rec,
} = rec
{
protocol::build_apply_record_msg(*lsn, postgres_rec, &mut writebuf);
} else {
anyhow::bail!("tried to pass neon wal record to postgres WAL redo");
}
}
protocol::build_get_page_msg(tag, &mut writebuf);
WAL_REDO_RECORD_COUNTER.inc_by(records.len() as u64);
let Ok(res) =
tokio::time::timeout(wal_redo_timeout, self.apply_wal_records0(&writebuf)).await
else {
anyhow::bail!("WAL redo timed out");
};
if res.is_err() {
// not all of these can be caused by this particular input, however these are so rare
// in tests so capture all.
self.record_and_log(&writebuf);
}
res
}
pub(crate) fn id(&self) -> u32 {
self.0.id()
/// # Cancel-Safety
///
/// When not polled to completion (e.g. because in `tokio::select!` another
/// branch becomes ready before this future), concurrent and subsequent
/// calls may fail due to [`utils::poison::Poison::check_and_arm`] calls.
/// Dispose of this process instance and create a new one.
async fn apply_wal_records0(&self, writebuf: &[u8]) -> anyhow::Result<Bytes> {
let request_no = {
let mut lock_guard = self.stdin.lock().await;
let mut poison_guard = lock_guard.check_and_arm()?;
let input = poison_guard.data_mut();
input
.stdin
.write_all(writebuf)
.await
.context("write to walredo stdin")?;
let request_no = input.n_requests;
input.n_requests += 1;
poison_guard.disarm();
request_no
};
// To improve walredo performance we separate sending requests and receiving
// responses. Them are protected by different mutexes (output and input).
// If thread T1, T2, T3 send requests D1, D2, D3 to walredo process
// then there is not warranty that T1 will first granted output mutex lock.
// To address this issue we maintain number of sent requests, number of processed
// responses and ring buffer with pending responses. After sending response
// (under input mutex), threads remembers request number. Then it releases
// input mutex, locks output mutex and fetch in ring buffer all responses until
// its stored request number. The it takes correspondent element from
// pending responses ring buffer and truncate all empty elements from the front,
// advancing processed responses number.
let mut lock_guard = self.stdout.lock().await;
let mut poison_guard = lock_guard.check_and_arm()?;
let output = poison_guard.data_mut();
let n_processed_responses = output.n_processed_responses;
while n_processed_responses + output.pending_responses.len() <= request_no {
// We expect the WAL redo process to respond with an 8k page image. We read it
// into this buffer.
let mut resultbuf = vec![0; BLCKSZ.into()];
output
.stdout
.read_exact(&mut resultbuf)
.await
.context("read walredo stdout")?;
output
.pending_responses
.push_back(Some(Bytes::from(resultbuf)));
}
// Replace our request's response with None in `pending_responses`.
// Then make space in the ring buffer by clearing out any seqence of contiguous
// `None`'s from the front of `pending_responses`.
// NB: We can't pop_front() because other requests' responses because another
// requester might have grabbed the output mutex before us:
// T1: grab input mutex
// T1: send request_no 23
// T1: release input mutex
// T2: grab input mutex
// T2: send request_no 24
// T2: release input mutex
// T2: grab output mutex
// T2: n_processed_responses + output.pending_responses.len() <= request_no
// 23 0 24
// T2: enters poll loop that reads stdout
// T2: put response for 23 into pending_responses
// T2: put response for 24 into pending_resposnes
// pending_responses now looks like this: Front Some(response_23) Some(response_24) Back
// T2: takes its response_24
// pending_responses now looks like this: Front Some(response_23) None Back
// T2: does the while loop below
// pending_responses now looks like this: Front Some(response_23) None Back
// T2: releases output mutex
// T1: grabs output mutex
// T1: n_processed_responses + output.pending_responses.len() > request_no
// 23 2 23
// T1: skips poll loop that reads stdout
// T1: takes its response_23
// pending_responses now looks like this: Front None None Back
// T2: does the while loop below
// pending_responses now looks like this: Front Back
// n_processed_responses now has value 25
let res = output.pending_responses[request_no - n_processed_responses]
.take()
.expect("we own this request_no, nobody else is supposed to take it");
while let Some(front) = output.pending_responses.front() {
if front.is_none() {
output.pending_responses.pop_front();
output.n_processed_responses += 1;
} else {
break;
}
}
poison_guard.disarm();
Ok(res)
}
#[cfg(feature = "testing")]
fn record_and_log(&self, writebuf: &[u8]) {
use std::sync::atomic::Ordering;
let millis = std::time::SystemTime::now()
.duration_since(std::time::SystemTime::UNIX_EPOCH)
.unwrap()
.as_millis();
let seq = self.dump_sequence.fetch_add(1, Ordering::Relaxed);
// these files will be collected to an allure report
let filename = format!("walredo-{millis}-{}-{seq}.walredo", writebuf.len());
let path = self.conf.tenant_path(&self.tenant_shard_id).join(&filename);
use std::io::Write;
let res = std::fs::OpenOptions::new()
.write(true)
.create_new(true)
.read(true)
.open(path)
.and_then(|mut f| f.write_all(writebuf));
// trip up allowed_errors
if let Err(e) = res {
tracing::error!(target=%filename, length=writebuf.len(), "failed to write out the walredo errored input: {e}");
} else {
tracing::error!(filename, "erroring walredo input saved");
}
}
#[cfg(not(feature = "testing"))]
fn record_and_log(&self, _: &[u8]) {}
}
impl Drop for WalRedoProcess {
fn drop(&mut self) {
self.child
.take()
.expect("we only do this once")
.kill_and_wait(WalRedoKillCause::WalRedoProcessDrop);
// no way to wait for stderr_logger_task from Drop because that is async only
}
}

View File

@@ -1,374 +0,0 @@
use self::no_leak_child::NoLeakChild;
use crate::{
config::PageServerConf,
metrics::{WalRedoKillCause, WAL_REDO_PROCESS_COUNTERS, WAL_REDO_RECORD_COUNTER},
walrecord::NeonWalRecord,
walredo::process::{no_leak_child, protocol},
};
use anyhow::Context;
use bytes::Bytes;
use pageserver_api::{reltag::RelTag, shard::TenantShardId};
use postgres_ffi::BLCKSZ;
#[cfg(feature = "testing")]
use std::sync::atomic::AtomicUsize;
use std::{
collections::VecDeque,
process::{Command, Stdio},
time::Duration,
};
use tokio::io::{AsyncReadExt, AsyncWriteExt};
use tracing::{debug, error, instrument, Instrument};
use utils::{lsn::Lsn, poison::Poison};
pub struct WalRedoProcess {
#[allow(dead_code)]
conf: &'static PageServerConf,
tenant_shard_id: TenantShardId,
// Some() on construction, only becomes None on Drop.
child: Option<NoLeakChild>,
stdout: tokio::sync::Mutex<Poison<ProcessOutput>>,
stdin: tokio::sync::Mutex<Poison<ProcessInput>>,
/// Counter to separate same sized walredo inputs failing at the same millisecond.
#[cfg(feature = "testing")]
dump_sequence: AtomicUsize,
}
struct ProcessInput {
stdin: tokio::process::ChildStdin,
n_requests: usize,
}
struct ProcessOutput {
stdout: tokio::process::ChildStdout,
pending_responses: VecDeque<Option<Bytes>>,
n_processed_responses: usize,
}
impl WalRedoProcess {
//
// Start postgres binary in special WAL redo mode.
//
#[instrument(skip_all,fields(pg_version=pg_version))]
pub(crate) fn launch(
conf: &'static PageServerConf,
tenant_shard_id: TenantShardId,
pg_version: u32,
) -> anyhow::Result<Self> {
crate::span::debug_assert_current_span_has_tenant_id();
let pg_bin_dir_path = conf.pg_bin_dir(pg_version).context("pg_bin_dir")?; // TODO these should be infallible.
let pg_lib_dir_path = conf.pg_lib_dir(pg_version).context("pg_lib_dir")?;
use no_leak_child::NoLeakChildCommandExt;
// Start postgres itself
let child = Command::new(pg_bin_dir_path.join("postgres"))
// the first arg must be --wal-redo so the child process enters into walredo mode
.arg("--wal-redo")
// the child doesn't process this arg, but, having it in the argv helps indentify the
// walredo process for a particular tenant when debugging a pagserver
.args(["--tenant-shard-id", &format!("{tenant_shard_id}")])
.stdin(Stdio::piped())
.stderr(Stdio::piped())
.stdout(Stdio::piped())
.env_clear()
.env("LD_LIBRARY_PATH", &pg_lib_dir_path)
.env("DYLD_LIBRARY_PATH", &pg_lib_dir_path)
// NB: The redo process is not trusted after we sent it the first
// walredo work. Before that, it is trusted. Specifically, we trust
// it to
// 1. close all file descriptors except stdin, stdout, stderr because
// pageserver might not be 100% diligent in setting FD_CLOEXEC on all
// the files it opens, and
// 2. to use seccomp to sandbox itself before processing the first
// walredo request.
.spawn_no_leak_child(tenant_shard_id)
.context("spawn process")?;
WAL_REDO_PROCESS_COUNTERS.started.inc();
let mut child = scopeguard::guard(child, |child| {
error!("killing wal-redo-postgres process due to a problem during launch");
child.kill_and_wait(WalRedoKillCause::Startup);
});
let stdin = child.stdin.take().unwrap();
let stdout = child.stdout.take().unwrap();
let stderr = child.stderr.take().unwrap();
let stderr = tokio::process::ChildStderr::from_std(stderr)
.context("convert to tokio::ChildStderr")?;
let stdin =
tokio::process::ChildStdin::from_std(stdin).context("convert to tokio::ChildStdin")?;
let stdout = tokio::process::ChildStdout::from_std(stdout)
.context("convert to tokio::ChildStdout")?;
// all fallible operations post-spawn are complete, so get rid of the guard
let child = scopeguard::ScopeGuard::into_inner(child);
tokio::spawn(
async move {
scopeguard::defer! {
debug!("wal-redo-postgres stderr_logger_task finished");
crate::metrics::WAL_REDO_PROCESS_COUNTERS.active_stderr_logger_tasks_finished.inc();
}
debug!("wal-redo-postgres stderr_logger_task started");
crate::metrics::WAL_REDO_PROCESS_COUNTERS.active_stderr_logger_tasks_started.inc();
use tokio::io::AsyncBufReadExt;
let mut stderr_lines = tokio::io::BufReader::new(stderr);
let mut buf = Vec::new();
let res = loop {
buf.clear();
// TODO we don't trust the process to cap its stderr length.
// Currently it can do unbounded Vec allocation.
match stderr_lines.read_until(b'\n', &mut buf).await {
Ok(0) => break Ok(()), // eof
Ok(num_bytes) => {
let output = String::from_utf8_lossy(&buf[..num_bytes]);
error!(%output, "received output");
}
Err(e) => {
break Err(e);
}
}
};
match res {
Ok(()) => (),
Err(e) => {
error!(error=?e, "failed to read from walredo stderr");
}
}
}.instrument(tracing::info_span!(parent: None, "wal-redo-postgres-stderr", pid = child.id(), tenant_id = %tenant_shard_id.tenant_id, shard_id = %tenant_shard_id.shard_slug(), %pg_version))
);
Ok(Self {
conf,
tenant_shard_id,
child: Some(child),
stdin: tokio::sync::Mutex::new(Poison::new(
"stdin",
ProcessInput {
stdin,
n_requests: 0,
},
)),
stdout: tokio::sync::Mutex::new(Poison::new(
"stdout",
ProcessOutput {
stdout,
pending_responses: VecDeque::new(),
n_processed_responses: 0,
},
)),
#[cfg(feature = "testing")]
dump_sequence: AtomicUsize::default(),
})
}
pub(crate) fn id(&self) -> u32 {
self.child
.as_ref()
.expect("must not call this during Drop")
.id()
}
/// Apply given WAL records ('records') over an old page image. Returns
/// new page image.
///
/// # Cancel-Safety
///
/// Cancellation safe.
#[instrument(skip_all, fields(tenant_id=%self.tenant_shard_id.tenant_id, shard_id=%self.tenant_shard_id.shard_slug(), pid=%self.id()))]
pub(crate) async fn apply_wal_records(
&self,
rel: RelTag,
blknum: u32,
base_img: &Option<Bytes>,
records: &[(Lsn, NeonWalRecord)],
wal_redo_timeout: Duration,
) -> anyhow::Result<Bytes> {
let tag = protocol::BufferTag { rel, blknum };
// Serialize all the messages to send the WAL redo process first.
//
// This could be problematic if there are millions of records to replay,
// but in practice the number of records is usually so small that it doesn't
// matter, and it's better to keep this code simple.
//
// Most requests start with a before-image with BLCKSZ bytes, followed by
// by some other WAL records. Start with a buffer that can hold that
// comfortably.
let mut writebuf: Vec<u8> = Vec::with_capacity((BLCKSZ as usize) * 3);
protocol::build_begin_redo_for_block_msg(tag, &mut writebuf);
if let Some(img) = base_img {
protocol::build_push_page_msg(tag, img, &mut writebuf);
}
for (lsn, rec) in records.iter() {
if let NeonWalRecord::Postgres {
will_init: _,
rec: postgres_rec,
} = rec
{
protocol::build_apply_record_msg(*lsn, postgres_rec, &mut writebuf);
} else {
anyhow::bail!("tried to pass neon wal record to postgres WAL redo");
}
}
protocol::build_get_page_msg(tag, &mut writebuf);
WAL_REDO_RECORD_COUNTER.inc_by(records.len() as u64);
let Ok(res) =
tokio::time::timeout(wal_redo_timeout, self.apply_wal_records0(&writebuf)).await
else {
anyhow::bail!("WAL redo timed out");
};
if res.is_err() {
// not all of these can be caused by this particular input, however these are so rare
// in tests so capture all.
self.record_and_log(&writebuf);
}
res
}
/// # Cancel-Safety
///
/// When not polled to completion (e.g. because in `tokio::select!` another
/// branch becomes ready before this future), concurrent and subsequent
/// calls may fail due to [`utils::poison::Poison::check_and_arm`] calls.
/// Dispose of this process instance and create a new one.
async fn apply_wal_records0(&self, writebuf: &[u8]) -> anyhow::Result<Bytes> {
let request_no = {
let mut lock_guard = self.stdin.lock().await;
let mut poison_guard = lock_guard.check_and_arm()?;
let input = poison_guard.data_mut();
input
.stdin
.write_all(writebuf)
.await
.context("write to walredo stdin")?;
let request_no = input.n_requests;
input.n_requests += 1;
poison_guard.disarm();
request_no
};
// To improve walredo performance we separate sending requests and receiving
// responses. Them are protected by different mutexes (output and input).
// If thread T1, T2, T3 send requests D1, D2, D3 to walredo process
// then there is not warranty that T1 will first granted output mutex lock.
// To address this issue we maintain number of sent requests, number of processed
// responses and ring buffer with pending responses. After sending response
// (under input mutex), threads remembers request number. Then it releases
// input mutex, locks output mutex and fetch in ring buffer all responses until
// its stored request number. The it takes correspondent element from
// pending responses ring buffer and truncate all empty elements from the front,
// advancing processed responses number.
let mut lock_guard = self.stdout.lock().await;
let mut poison_guard = lock_guard.check_and_arm()?;
let output = poison_guard.data_mut();
let n_processed_responses = output.n_processed_responses;
while n_processed_responses + output.pending_responses.len() <= request_no {
// We expect the WAL redo process to respond with an 8k page image. We read it
// into this buffer.
let mut resultbuf = vec![0; BLCKSZ.into()];
output
.stdout
.read_exact(&mut resultbuf)
.await
.context("read walredo stdout")?;
output
.pending_responses
.push_back(Some(Bytes::from(resultbuf)));
}
// Replace our request's response with None in `pending_responses`.
// Then make space in the ring buffer by clearing out any seqence of contiguous
// `None`'s from the front of `pending_responses`.
// NB: We can't pop_front() because other requests' responses because another
// requester might have grabbed the output mutex before us:
// T1: grab input mutex
// T1: send request_no 23
// T1: release input mutex
// T2: grab input mutex
// T2: send request_no 24
// T2: release input mutex
// T2: grab output mutex
// T2: n_processed_responses + output.pending_responses.len() <= request_no
// 23 0 24
// T2: enters poll loop that reads stdout
// T2: put response for 23 into pending_responses
// T2: put response for 24 into pending_resposnes
// pending_responses now looks like this: Front Some(response_23) Some(response_24) Back
// T2: takes its response_24
// pending_responses now looks like this: Front Some(response_23) None Back
// T2: does the while loop below
// pending_responses now looks like this: Front Some(response_23) None Back
// T2: releases output mutex
// T1: grabs output mutex
// T1: n_processed_responses + output.pending_responses.len() > request_no
// 23 2 23
// T1: skips poll loop that reads stdout
// T1: takes its response_23
// pending_responses now looks like this: Front None None Back
// T2: does the while loop below
// pending_responses now looks like this: Front Back
// n_processed_responses now has value 25
let res = output.pending_responses[request_no - n_processed_responses]
.take()
.expect("we own this request_no, nobody else is supposed to take it");
while let Some(front) = output.pending_responses.front() {
if front.is_none() {
output.pending_responses.pop_front();
output.n_processed_responses += 1;
} else {
break;
}
}
poison_guard.disarm();
Ok(res)
}
#[cfg(feature = "testing")]
fn record_and_log(&self, writebuf: &[u8]) {
use std::sync::atomic::Ordering;
let millis = std::time::SystemTime::now()
.duration_since(std::time::SystemTime::UNIX_EPOCH)
.unwrap()
.as_millis();
let seq = self.dump_sequence.fetch_add(1, Ordering::Relaxed);
// these files will be collected to an allure report
let filename = format!("walredo-{millis}-{}-{seq}.walredo", writebuf.len());
let path = self.conf.tenant_path(&self.tenant_shard_id).join(&filename);
use std::io::Write;
let res = std::fs::OpenOptions::new()
.write(true)
.create_new(true)
.read(true)
.open(path)
.and_then(|mut f| f.write_all(writebuf));
// trip up allowed_errors
if let Err(e) = res {
tracing::error!(target=%filename, length=writebuf.len(), "failed to write out the walredo errored input: {e}");
} else {
tracing::error!(filename, "erroring walredo input saved");
}
}
#[cfg(not(feature = "testing"))]
fn record_and_log(&self, _: &[u8]) {}
}
impl Drop for WalRedoProcess {
fn drop(&mut self) {
self.child
.take()
.expect("we only do this once")
.kill_and_wait(WalRedoKillCause::WalRedoProcessDrop);
// no way to wait for stderr_logger_task from Drop because that is async only
}
}

View File

@@ -1,78 +0,0 @@
From 0b0194a57bd0f3598bd57dbedd0df3932330169d Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 2 Feb 2024 22:26:45 +0200
Subject: [PATCH 1/1] Make v0.6.0 work with Neon
Now that the WAL-logging happens as a separate step at the end of the
build, we need a few neon-specific hints to make it work.
---
src/hnswbuild.c | 36 ++++++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/src/hnswbuild.c b/src/hnswbuild.c
index 680789b..ec54dea 100644
--- a/src/hnswbuild.c
+++ b/src/hnswbuild.c
@@ -840,9 +840,17 @@ HnswParallelBuildMain(dsm_segment *seg, shm_toc *toc)
hnswarea = shm_toc_lookup(toc, PARALLEL_KEY_HNSW_AREA, false);
+#ifdef NEON_SMGR
+ smgr_start_unlogged_build(RelationGetSmgr(indexRel));
+#endif
+
/* Perform inserts */
HnswParallelScanAndInsert(heapRel, indexRel, hnswshared, hnswarea, false);
+#ifdef NEON_SMGR
+ smgr_finish_unlogged_build_phase_1(RelationGetSmgr(indexRel));
+#endif
+
/* Close relations within worker */
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
@@ -1089,13 +1097,41 @@ BuildIndex(Relation heap, Relation index, IndexInfo *indexInfo,
SeedRandom(42);
#endif
+#ifdef NEON_SMGR
+ smgr_start_unlogged_build(RelationGetSmgr(index));
+#endif
+
InitBuildState(buildstate, heap, index, indexInfo, forkNum);
BuildGraph(buildstate, forkNum);
+#ifdef NEON_SMGR
+ smgr_finish_unlogged_build_phase_1(RelationGetSmgr(index));
+#endif
+
if (RelationNeedsWAL(index))
+ {
log_newpage_range(index, forkNum, 0, RelationGetNumberOfBlocks(index), true);
+#ifdef NEON_SMGR
+ {
+#if PG_VERSION_NUM >= 160000
+ RelFileLocator rlocator = RelationGetSmgr(index)->smgr_rlocator.locator;
+#else
+ RelFileNode rlocator = RelationGetSmgr(index)->smgr_rnode.node;
+#endif
+
+ SetLastWrittenLSNForBlockRange(XactLastRecEnd, rlocator,
+ MAIN_FORKNUM, 0, RelationGetNumberOfBlocks(index));
+ SetLastWrittenLSNForRelation(XactLastRecEnd, rlocator, MAIN_FORKNUM);
+ }
+#endif
+ }
+
+#ifdef NEON_SMGR
+ smgr_end_unlogged_build(RelationGetSmgr(index));
+#endif
+
FreeBuildState(buildstate);
}
--
2.39.2

View File

@@ -295,10 +295,18 @@ extern void neon_immedsync(SMgrRelation reln, ForkNumber forknum);
/* utils for neon relsize cache */
extern void relsize_hash_init(void);
extern bool get_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber *size);
extern void set_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber size);
extern bool set_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber new_size, BlockNumber* old_size);
extern void update_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber size);
extern void forget_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum);
extern bool start_unlogged_build(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blocknum, BlockNumber* relsize);
extern bool is_unlogged_build_extend(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blocknum, BlockNumber* relsize);
extern bool is_unlogged_build(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber* relsize);
extern bool stop_unlogged_build(NRelFileInfo rinfo, ForkNumber forknum);
extern void resume_unlogged_build(void);
/* functions for local file cache */
#if PG_MAJORVERSION_NUM < 16
extern void lfc_write(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,

View File

@@ -10,10 +10,6 @@
* Temporary and unlogged tables are stored locally, by md.c. The functions
* here just pass the calls through to corresponding md.c functions.
*
* Index build operations that use the buffer cache are also handled locally,
* just like unlogged tables. Such operations must be marked by calling
* smgr_start_unlogged_build() and friends.
*
* In order to know what relations are permanent and which ones are not, we
* have added a 'smgr_relpersistence' field to SmgrRelationData, and it is set
* by smgropen() callers, when they have the relcache entry at hand. However,
@@ -64,6 +60,7 @@
#include "storage/fsm_internals.h"
#include "storage/md.h"
#include "storage/smgr.h"
#include "utils/rel.h"
#include "pagestore_client.h"
@@ -100,17 +97,7 @@ const int SmgrTrace = DEBUG5;
page_server_api *page_server;
/* unlogged relation build states */
typedef enum
{
UNLOGGED_BUILD_NOT_IN_PROGRESS = 0,
UNLOGGED_BUILD_PHASE_1,
UNLOGGED_BUILD_PHASE_2,
UNLOGGED_BUILD_NOT_PERMANENT
} UnloggedBuildPhase;
static SMgrRelation unlogged_build_rel = NULL;
static UnloggedBuildPhase unlogged_build_phase = UNLOGGED_BUILD_NOT_IN_PROGRESS;
const PGAlignedBlock zero_buffer;
static bool neon_redo_read_buffer_filter(XLogReaderState *record, uint8 block_id);
static bool (*old_redo_read_buffer_filter) (XLogReaderState *record, uint8 block_id) = NULL;
@@ -605,7 +592,7 @@ prefetch_read(PrefetchRequest *slot)
}
else
{
neon_shard_log(slot->shard_no, WARNING,
neon_shard_log(slot->shard_no, LOG,
"No response from reading prefetch entry %lu: %u/%u/%u.%u block %u. This can be caused by a concurrent disconnect",
(long)slot->my_ring_index,
RelFileInfoFmt(BufTagGetNRelFileInfo(slot->buftag)),
@@ -1406,6 +1393,20 @@ PageIsEmptyHeapPage(char *buffer)
* A page is being evicted from the shared buffer cache. Update the
* last-written LSN of the page, and WAL-log it if needed.
*/
static void
unlogged_extend(SMgrRelation reln, ForkNumber forknum, BlockNumber old_relsize, BlockNumber new_relsize)
{
if (new_relsize > old_relsize)
{
#if PG_MAJORVERSION_NUM < 16
mdextend(reln, forknum, new_relsize, (char *) zero_buffer.data, true);
#else
mdzeroextend(reln, forknum, old_relsize, new_relsize - old_relsize, true);
#endif
}
}
static void
#if PG_MAJORVERSION_NUM < 16
neon_wallog_page(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer, bool force)
@@ -1413,6 +1414,7 @@ neon_wallog_page(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, ch
neon_wallog_page(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, const char *buffer, bool force)
#endif
{
BlockNumber relsize;
XLogRecPtr lsn = PageGetLSN((Page) buffer);
bool log_page;
@@ -1479,6 +1481,7 @@ neon_wallog_page(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, co
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum)));
}
#if 0
else if (PageIsEmptyHeapPage((Page) buffer))
{
ereport(SmgrTrace,
@@ -1487,34 +1490,95 @@ neon_wallog_page(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, co
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum)));
}
#endif
else if (forknum != FSM_FORKNUM && forknum != VISIBILITYMAP_FORKNUM)
{
/*
* Its a bad sign if there is a page with zero LSN in the buffer
* cache in a standby, too. However, PANICing seems like a cure
* worse than the disease, as the damage has likely already been
* done in the primary. So in a standby, make this an assertion,
* and in a release build just LOG the error and soldier on. We
* update the last-written LSN of the page with a conservative
* value in that case, which is the last replayed LSN.
*/
ereport(RecoveryInProgress() ? LOG : PANIC,
(errmsg(NEON_TAG "Page %u of relation %u/%u/%u.%u is evicted with zero LSN",
if (start_unlogged_build(InfoFromSMgrRel(reln), forknum, blocknum, &relsize))
{
mdcreate(reln, forknum, true);
if (relsize != 0)
unlogged_extend(reln, forknum, 0, relsize);
elog(SmgrTrace, "neon_wallog_page: start unlogged %u/%u/%u.%u blk %u, relsize %u",
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum, blocknum, relsize);
}
else
elog(SmgrTrace, "neon_wallog_page: continue unlogged %u/%u/%u.%u blk %u, relsize %u",
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum, blocknum, relsize);
if (blocknum >= relsize)
{
unlogged_extend(reln, forknum, relsize, blocknum+1);
}
mdwrite(reln, forknum, blocknum, buffer, true);
resume_unlogged_build();
ereport(SmgrTrace,
(errmsg(NEON_TAG "Page %u of relation %u/%u/%u.%u is saved locally.",
blocknum,
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum)));
Assert(false);
lsn = GetXLogReplayRecPtr(NULL); /* in standby mode, soldier on */
}
}
else
else if (lsn < FirstNormalUnloggedLSN)
{
if (start_unlogged_build(InfoFromSMgrRel(reln),forknum, blocknum, &relsize))
{
mdcreate(reln, forknum, true);
if (relsize != 0)
unlogged_extend(reln, forknum, 0, relsize);
elog(SmgrTrace, "neon_wallog_page: start unlogged %u/%u/%u.%u blk %u, relsize %u, LSN %X",
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum, blocknum, relsize, (unsigned)lsn);
}
else
elog(SmgrTrace, "neon_wallog_page: continue unlogged %u/%u/%u.%u blk %u, relsize %u, LSN %X",
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum, blocknum, relsize, (unsigned)lsn);
if (blocknum >= relsize)
{
unlogged_extend(reln, forknum, relsize, blocknum+1);
}
mdwrite(reln, forknum, blocknum, buffer, true);
resume_unlogged_build();
ereport(SmgrTrace,
(errmsg(NEON_TAG "Evicting page %u of relation %u/%u/%u.%u with lsn=%X/%X",
(errmsg(NEON_TAG "Page %u of relation %u/%u/%u.%u is saved locally.",
blocknum,
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum, LSN_FORMAT_ARGS(lsn))));
forknum)));
}
else
{
if (is_unlogged_build_extend(InfoFromSMgrRel(reln), forknum, blocknum, &relsize))
{
elog(SmgrTrace, "neon_wallog_page: unlogged extend %u/%u/%u.%u blk %u, relsize %u",
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum, blocknum, relsize);
if (blocknum >= relsize)
{
unlogged_extend(reln, forknum, relsize, blocknum+1);
}
mdwrite(reln, forknum, blocknum, buffer, true);
ereport(SmgrTrace,
(errmsg(NEON_TAG "Page %u with LSN=%X/%X of relation %u/%u/%u.%u is saved locally.",
blocknum,
LSN_FORMAT_ARGS(lsn),
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum)));
}
else
{
ereport(SmgrTrace,
(errmsg(NEON_TAG "Page %u of relation %u/%u/%u.%u is already wal-logged at lsn=%X/%X",
blocknum,
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum, LSN_FORMAT_ARGS(lsn)
)));
}
resume_unlogged_build();
}
/*
@@ -1524,6 +1588,27 @@ neon_wallog_page(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, co
SetLastWrittenLSNForBlock(lsn, InfoFromSMgrRel(reln), forknum, blocknum);
}
/*
* Check if unlogged build is in progress for specified relation
* and stop it if so. It is used as callback for log_newpage_range( function
* which is called at the end of unlogged build.
*/
static void
neon_log_newpage_range_callback(Relation rel, ForkNumber forknum)
{
SMgrRelation smgr = RelationGetSmgr(rel);
if (stop_unlogged_build(InfoFromSMgrRel(smgr), forknum))
{
mdclose(smgr, forknum);
/* use isRedo == true, so that we drop it immediately */
mdunlink(InfoBFromSMgrRel(smgr), forknum, true);
resume_unlogged_build(); /* doesn't actually resume build, just release lock */
}
}
/*
* neon_init() -- Initialize private state
*/
@@ -1559,6 +1644,8 @@ neon_init(void)
old_redo_read_buffer_filter = redo_read_buffer_filter;
redo_read_buffer_filter = neon_redo_read_buffer_filter;
log_newpage_range_callback = neon_log_newpage_range_callback;
#ifdef DEBUG_COMPARE_LOCAL
mdinit();
#endif
@@ -2021,7 +2108,7 @@ neon_create(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
&reln->smgr_cached_nblocks[forkNum]);
}
else
set_cached_relsize(InfoFromSMgrRel(reln), forkNum, 0);
set_cached_relsize(InfoFromSMgrRel(reln), forkNum, 0, NULL);
#ifdef DEBUG_COMPARE_LOCAL
if (IS_LOCAL_REL(reln))
@@ -2081,6 +2168,7 @@ neon_extend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno,
#endif
{
XLogRecPtr lsn;
BlockNumber old_relsize;
BlockNumber n_blocks = 0;
switch (reln->smgr_relpersistence)
@@ -2132,7 +2220,14 @@ neon_extend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno,
neon_wallog_page(reln, forkNum, n_blocks++, buffer, true);
neon_wallog_page(reln, forkNum, blkno, buffer, false);
set_cached_relsize(InfoFromSMgrRel(reln), forkNum, blkno + 1);
if (set_cached_relsize(InfoFromSMgrRel(reln), forkNum, blkno + 1, &old_relsize))
{
unlogged_extend(reln, forkNum, old_relsize, blkno + 1);
resume_unlogged_build();
}
else /* Do not store pages during unlogedbuild in LFC two avoid double local storage consumption */
lfc_write(InfoFromSMgrRel(reln), forkNum, blkno, buffer);
lsn = PageGetLSN((Page) buffer);
neon_log(SmgrTrace, "smgrextend called for %u/%u/%u.%u blk %u, page LSN: %X/%08X",
@@ -2140,8 +2235,6 @@ neon_extend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno,
forkNum, blkno,
(uint32) (lsn >> 32), (uint32) lsn);
lfc_write(InfoFromSMgrRel(reln), forkNum, blkno, buffer);
#ifdef DEBUG_COMPARE_LOCAL
if (IS_LOCAL_REL(reln))
mdextend(reln, forkNum, blkno, buffer, skipFsync);
@@ -2167,9 +2260,10 @@ void
neon_zeroextend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blocknum,
int nblocks, bool skipFsync)
{
const PGAlignedBlock buffer = {0};
int remblocks = nblocks;
BlockNumber old_relsize;
BlockNumber remblocks = nblocks;
XLogRecPtr lsn = 0;
bool unlogged = false;
switch (reln->smgr_relpersistence)
{
@@ -2218,8 +2312,29 @@ neon_zeroextend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blocknum,
if (!XLogInsertAllowed())
return;
/* ensure we have enough xlog buffers to log max-sized records */
XLogEnsureRecordSpace(Min(remblocks, (XLR_MAX_BLOCK_ID - 1)), 0);
if (set_cached_relsize(InfoFromSMgrRel(reln), forkNum, blocknum + nblocks, &old_relsize))
{
unlogged_extend(reln, forkNum, old_relsize, blocknum + nblocks);
resume_unlogged_build();
unlogged = true;
}
if (forkNum != MAIN_FORKNUM) /* no need to wal-log zero pages except VM/FSM forks */
{
/* ensure we have enough xlog buffers to log max-sized records */
XLogEnsureRecordSpace(Min(remblocks, (XLR_MAX_BLOCK_ID - 1)), 0);
}
else
{
/*
* smgr_extend is often called with an all-zeroes page, so
* lsn==InvalidXLogRecPtr. An smgr_write() call will come for the buffer
* later, after it has been initialized with the real page contents, and
* it is eventually evicted from the buffer cache. But we need a valid LSN
* to the relation metadata update now.
*/
lsn = GetXLogInsertRecPtr();
}
/*
* Iterate over all the pages. They are collected into batches of
@@ -2230,17 +2345,20 @@ neon_zeroextend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blocknum,
{
int count = Min(remblocks, XLR_MAX_BLOCK_ID);
XLogBeginInsert();
if (forkNum != MAIN_FORKNUM) /* no need to wal-log zero pages except VM/FSM forks */
{
XLogBeginInsert();
for (int i = 0; i < count; i++)
XLogRegisterBlock(i, &InfoFromSMgrRel(reln), forkNum, blocknum + i,
(char *) buffer.data, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
lsn = XLogInsert(RM_XLOG_ID, XLOG_FPI);
for (int i = 0; i < count; i++)
XLogRegisterBlock(i, &InfoFromSMgrRel(reln), forkNum, blocknum + i,
(char *) zero_buffer.data, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
lsn = XLogInsert(RM_XLOG_ID, XLOG_FPI);
}
for (int i = 0; i < count; i++)
{
lfc_write(InfoFromSMgrRel(reln), forkNum, blocknum + i, buffer.data);
if (!unlogged) /* Do not store pages during unlogedbuild in LFC two avoid double local storage consumption */
lfc_write(InfoFromSMgrRel(reln), forkNum, blocknum + i, zero_buffer.data);
SetLastWrittenLSNForBlock(lsn, InfoFromSMgrRel(reln), forkNum,
blocknum + i);
}
@@ -2252,7 +2370,6 @@ neon_zeroextend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blocknum,
Assert(lsn != 0);
SetLastWrittenLSNForRelation(lsn, InfoFromSMgrRel(reln), forkNum);
set_cached_relsize(InfoFromSMgrRel(reln), forkNum, blocknum);
}
#endif
@@ -2519,6 +2636,7 @@ neon_read(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno, void *buffer
#endif
{
neon_request_lsns request_lsns;
BlockNumber relsize;
switch (reln->smgr_relpersistence)
{
@@ -2537,15 +2655,33 @@ neon_read(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno, void *buffer
neon_log(ERROR, "unknown relpersistence '%c'", reln->smgr_relpersistence);
}
/* Try to read from local file cache */
if (lfc_read(InfoFromSMgrRel(reln), forkNum, blkno, buffer))
if (is_unlogged_build(InfoFromSMgrRel(reln), forkNum, &relsize))
{
return;
if (blkno >= relsize)
{
elog(SmgrTrace, "Get empty local page %d of relation %u/%u/%u.%u",
blkno, RelFileInfoFmt(InfoFromSMgrRel(reln)), forkNum);
memset(buffer, 0, BLCKSZ);
}
else
{
elog(SmgrTrace, "Read local page %d of relation %u/%u/%u.%u",
blkno, RelFileInfoFmt(InfoFromSMgrRel(reln)), forkNum);
mdread(reln, forkNum, blkno, buffer);
}
resume_unlogged_build();
}
else
{
/* Try to read from local file cache */
if (lfc_read(InfoFromSMgrRel(reln), forkNum, blkno, buffer))
{
return;
}
request_lsns = neon_get_request_lsns(InfoFromSMgrRel(reln), forkNum, blkno);
neon_read_at_lsn(InfoFromSMgrRel(reln), forkNum, blkno, request_lsns, buffer);
request_lsns = neon_get_request_lsns(InfoFromSMgrRel(reln), forkNum, blkno);
neon_read_at_lsn(InfoFromSMgrRel(reln), forkNum, blkno, request_lsns, buffer);
}
#ifdef DEBUG_COMPARE_LOCAL
if (forkNum == MAIN_FORKNUM && IS_LOCAL_REL(reln))
{
@@ -2655,24 +2791,36 @@ neon_write(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *bu
neon_write(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, const void *buffer, bool skipFsync)
#endif
{
BlockNumber relsize;
XLogRecPtr lsn;
bool unlogged = false;
switch (reln->smgr_relpersistence)
{
case 0:
/* This is a bit tricky. Check if the relation exists locally */
if (mdexists(reln, forknum))
if (is_unlogged_build_extend(InfoFromSMgrRel(reln), forknum, blocknum, &relsize))
{
/* It exists locally. Guess it's unlogged then. */
if (blocknum >= relsize)
{
unlogged_extend(reln, forknum, relsize, blocknum+1);
}
unlogged = true;
elog(SmgrTrace, "neon_write: extend %u/%u/%u.%u blk %u, relsize %u",
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum, blocknum, relsize);
} else {
unlogged = mdexists(reln, forknum);
}
if (unlogged)
{
elog(SmgrTrace, "neon_write: mdwrite %u/%u/%u.%u blk %u",
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum, blocknum);
mdwrite(reln, forknum, blocknum, buffer, skipFsync);
/*
* We could set relpersistence now that we have determined
* that it's local. But we don't dare to do it, because that
* would immediately allow reads as well, which shouldn't
* happen. We could cache it with a different 'relpersistence'
* value, but this isn't performance critical.
*/
}
resume_unlogged_build();
if (unlogged)
{
return;
}
break;
@@ -2864,7 +3012,7 @@ neon_truncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
neon_log(ERROR, "unknown relpersistence '%c'", reln->smgr_relpersistence);
}
set_cached_relsize(InfoFromSMgrRel(reln), forknum, nblocks);
set_cached_relsize(InfoFromSMgrRel(reln), forknum, nblocks, NULL);
/*
* Truncating a relation drops all its buffers from the buffer cache
@@ -2920,6 +3068,13 @@ neon_immedsync(SMgrRelation reln, ForkNumber forknum)
break;
case RELPERSISTENCE_PERMANENT:
if (stop_unlogged_build(InfoFromSMgrRel(reln), forknum))
{
mdclose(reln, forknum);
/* use isRedo == true, so that we drop it immediately */
mdunlink(InfoBFromSMgrRel(reln), forknum, true);
resume_unlogged_build(); /* doesn't actually resume build, just release lock */
}
break;
case RELPERSISTENCE_TEMP:
@@ -2939,150 +3094,6 @@ neon_immedsync(SMgrRelation reln, ForkNumber forknum)
#endif
}
/*
* neon_start_unlogged_build() -- Starting build operation on a rel.
*
* Some indexes are built in two phases, by first populating the table with
* regular inserts, using the shared buffer cache but skipping WAL-logging,
* and WAL-logging the whole relation after it's done. Neon relies on the
* WAL to reconstruct pages, so we cannot use the page server in the
* first phase when the changes are not logged.
*/
static void
neon_start_unlogged_build(SMgrRelation reln)
{
/*
* Currently, there can be only one unlogged relation build operation in
* progress at a time. That's enough for the current usage.
*/
if (unlogged_build_phase != UNLOGGED_BUILD_NOT_IN_PROGRESS)
neon_log(ERROR, "unlogged relation build is already in progress");
Assert(unlogged_build_rel == NULL);
ereport(SmgrTrace,
(errmsg(NEON_TAG "starting unlogged build of relation %u/%u/%u",
RelFileInfoFmt(InfoFromSMgrRel(reln)))));
switch (reln->smgr_relpersistence)
{
case 0:
neon_log(ERROR, "cannot call smgr_start_unlogged_build() on rel with unknown persistence");
break;
case RELPERSISTENCE_PERMANENT:
break;
case RELPERSISTENCE_TEMP:
case RELPERSISTENCE_UNLOGGED:
unlogged_build_rel = reln;
unlogged_build_phase = UNLOGGED_BUILD_NOT_PERMANENT;
return;
default:
neon_log(ERROR, "unknown relpersistence '%c'", reln->smgr_relpersistence);
}
if (smgrnblocks(reln, MAIN_FORKNUM) != 0)
neon_log(ERROR, "cannot perform unlogged index build, index is not empty ");
unlogged_build_rel = reln;
unlogged_build_phase = UNLOGGED_BUILD_PHASE_1;
/* Make the relation look like it's unlogged */
reln->smgr_relpersistence = RELPERSISTENCE_UNLOGGED;
/*
* Create the local file. In a parallel build, the leader is expected to
* call this first and do it.
*
* FIXME: should we pass isRedo true to create the tablespace dir if it
* doesn't exist? Is it needed?
*/
if (!IsParallelWorker())
mdcreate(reln, MAIN_FORKNUM, false);
}
/*
* neon_finish_unlogged_build_phase_1()
*
* Call this after you have finished populating a relation in unlogged mode,
* before you start WAL-logging it.
*/
static void
neon_finish_unlogged_build_phase_1(SMgrRelation reln)
{
Assert(unlogged_build_rel == reln);
ereport(SmgrTrace,
(errmsg(NEON_TAG "finishing phase 1 of unlogged build of relation %u/%u/%u",
RelFileInfoFmt(InfoFromSMgrRel(reln)))));
if (unlogged_build_phase == UNLOGGED_BUILD_NOT_PERMANENT)
return;
Assert(unlogged_build_phase == UNLOGGED_BUILD_PHASE_1);
Assert(reln->smgr_relpersistence == RELPERSISTENCE_UNLOGGED);
/*
* In a parallel build, (only) the leader process performs the 2nd
* phase.
*/
if (IsParallelWorker())
{
unlogged_build_rel = NULL;
unlogged_build_phase = UNLOGGED_BUILD_NOT_IN_PROGRESS;
}
else
unlogged_build_phase = UNLOGGED_BUILD_PHASE_2;
}
/*
* neon_end_unlogged_build() -- Finish an unlogged rel build.
*
* Call this after you have finished WAL-logging an relation that was
* first populated without WAL-logging.
*
* This removes the local copy of the rel, since it's now been fully
* WAL-logged and is present in the page server.
*/
static void
neon_end_unlogged_build(SMgrRelation reln)
{
NRelFileInfoBackend rinfob = InfoBFromSMgrRel(reln);
Assert(unlogged_build_rel == reln);
ereport(SmgrTrace,
(errmsg(NEON_TAG "ending unlogged build of relation %u/%u/%u",
RelFileInfoFmt(InfoFromNInfoB(rinfob)))));
if (unlogged_build_phase != UNLOGGED_BUILD_NOT_PERMANENT)
{
Assert(unlogged_build_phase == UNLOGGED_BUILD_PHASE_2);
Assert(reln->smgr_relpersistence == RELPERSISTENCE_UNLOGGED);
/* Make the relation look permanent again */
reln->smgr_relpersistence = RELPERSISTENCE_PERMANENT;
/* Remove local copy */
rinfob = InfoBFromSMgrRel(reln);
for (int forknum = 0; forknum <= MAX_FORKNUM; forknum++)
{
neon_log(SmgrTrace, "forgetting cached relsize for %u/%u/%u.%u",
RelFileInfoFmt(InfoFromNInfoB(rinfob)),
forknum);
forget_cached_relsize(InfoFromNInfoB(rinfob), forknum);
mdclose(reln, forknum);
/* use isRedo == true, so that we drop it immediately */
mdunlink(rinfob, forknum, true);
}
}
unlogged_build_rel = NULL;
unlogged_build_phase = UNLOGGED_BUILD_NOT_IN_PROGRESS;
}
#define STRPREFIX(str, prefix) (strncmp(str, prefix, strlen(prefix)) == 0)
static int
@@ -3176,40 +3187,6 @@ neon_read_slru_segment(SMgrRelation reln, const char* path, int segno, void* buf
return n_blocks;
}
static void
AtEOXact_neon(XactEvent event, void *arg)
{
switch (event)
{
case XACT_EVENT_ABORT:
case XACT_EVENT_PARALLEL_ABORT:
/*
* Forget about any build we might have had in progress. The local
* file will be unlinked by smgrDoPendingDeletes()
*/
unlogged_build_rel = NULL;
unlogged_build_phase = UNLOGGED_BUILD_NOT_IN_PROGRESS;
break;
case XACT_EVENT_COMMIT:
case XACT_EVENT_PARALLEL_COMMIT:
case XACT_EVENT_PREPARE:
case XACT_EVENT_PRE_COMMIT:
case XACT_EVENT_PARALLEL_PRE_COMMIT:
case XACT_EVENT_PRE_PREPARE:
if (unlogged_build_phase != UNLOGGED_BUILD_NOT_IN_PROGRESS)
{
unlogged_build_rel = NULL;
unlogged_build_phase = UNLOGGED_BUILD_NOT_IN_PROGRESS;
ereport(ERROR,
(errcode(ERRCODE_INTERNAL_ERROR),
(errmsg(NEON_TAG "unlogged index build was not properly finished"))));
}
break;
}
}
static const struct f_smgr neon_smgr =
{
.smgr_init = neon_init,
@@ -3231,10 +3208,6 @@ static const struct f_smgr neon_smgr =
.smgr_truncate = neon_truncate,
.smgr_immedsync = neon_immedsync,
.smgr_start_unlogged_build = neon_start_unlogged_build,
.smgr_finish_unlogged_build_phase_1 = neon_finish_unlogged_build_phase_1,
.smgr_end_unlogged_build = neon_end_unlogged_build,
.smgr_read_slru_segment = neon_read_slru_segment,
};
@@ -3252,8 +3225,6 @@ smgr_neon(BackendId backend, NRelFileInfo rinfo)
void
smgr_init_neon(void)
{
RegisterXactCallback(AtEOXact_neon, NULL);
smgr_init_standard();
neon_init();
}
@@ -3304,7 +3275,7 @@ neon_extend_rel_size(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blkno,
relsize = Max(nbresponse->n_blocks, blkno + 1);
set_cached_relsize(rinfo, forknum, relsize);
set_cached_relsize(rinfo, forknum, relsize, NULL);
SetLastWrittenLSNForRelation(end_recptr, rinfo, forknum);
neon_log(SmgrTrace, "Set length to %d", relsize);

View File

@@ -39,7 +39,8 @@ typedef struct
typedef struct
{
RelTag tag;
BlockNumber size;
BlockNumber size : 31;
BlockNumber unlogged : 1;
dlist_node lru_node; /* LRU list node */
} RelSizeEntry;
@@ -117,9 +118,12 @@ get_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber *size)
*size = entry->size;
relsize_ctl->hits += 1;
found = true;
/* Move entry to the LRU list tail */
dlist_delete(&entry->lru_node);
dlist_push_tail(&relsize_ctl->lru, &entry->lru_node);
if (!entry->unlogged) /* entries of relation involved in unlogged build are pinned */
{
/* Move entry to the LRU list tail */
dlist_delete(&entry->lru_node);
dlist_push_tail(&relsize_ctl->lru, &entry->lru_node);
}
}
else
{
@@ -130,9 +134,15 @@ get_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber *size)
return found;
}
void
set_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber size)
/*
* Cache relation size.
* Returns true if it happens during unlogged build.
* In this case lock is not released.
*/
bool
set_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber new_size, BlockNumber* old_size)
{
bool unlogged = false;
if (relsize_hash_size > 0)
{
RelTag tag;
@@ -148,34 +158,69 @@ set_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber size)
*/
while ((entry = hash_search(relsize_hash, &tag, HASH_ENTER_NULL, &found)) == NULL)
{
RelSizeEntry *victim = dlist_container(RelSizeEntry, lru_node, dlist_pop_head_node(&relsize_ctl->lru));
hash_search(relsize_hash, &victim->tag, HASH_REMOVE, NULL);
Assert(relsize_ctl->size > 0);
relsize_ctl->size -= 1;
if (dlist_is_empty(&relsize_ctl->lru))
{
elog(FATAL, "No more free relsize cache entries");
}
else
{
RelSizeEntry *victim = dlist_container(RelSizeEntry, lru_node, dlist_pop_head_node(&relsize_ctl->lru));
hash_search(relsize_hash, &victim->tag, HASH_REMOVE, NULL);
Assert(relsize_ctl->size > 0);
relsize_ctl->size -= 1;
}
}
entry->size = size;
if (old_size)
{
*old_size = found ? entry->size : 0;
}
entry->size = new_size;
if (!found)
{
if (++relsize_ctl->size == relsize_hash_size)
entry->unlogged = false;
if (relsize_ctl->size+1 == relsize_hash_size)
{
/*
* Remove least recently used elment from the hash.
* Hash size after is becomes `relsize_hash_size-1`.
* But it is not considered to be a problem, because size of this hash is expecrted large enough and +-1 doesn't matter.
*/
RelSizeEntry *victim = dlist_container(RelSizeEntry, lru_node, dlist_pop_head_node(&relsize_ctl->lru));
hash_search(relsize_hash, &victim->tag, HASH_REMOVE, NULL);
relsize_ctl->size -= 1;
if (dlist_is_empty(&relsize_ctl->lru))
{
elog(FATAL, "No more free relsize cache entries");
}
else
{
RelSizeEntry *victim = dlist_container(RelSizeEntry, lru_node, dlist_pop_head_node(&relsize_ctl->lru));
hash_search(relsize_hash, &victim->tag, HASH_REMOVE, NULL);
}
}
else
{
relsize_ctl->size += 1;
}
}
else
else if (entry->unlogged) /* entries of relation involved in unlogged build are pinned */
{
dlist_delete(&entry->lru_node);
}
dlist_push_tail(&relsize_ctl->lru, &entry->lru_node);
if (!entry->unlogged) /* entries of relation involved in unlogged build are pinned */
{
dlist_push_tail(&relsize_ctl->lru, &entry->lru_node);
}
else
{
Assert(old_size);
unlogged = true;
}
relsize_ctl->writes += 1;
LWLockRelease(relsize_lock);
if (!unlogged)
{
LWLockRelease(relsize_lock);
}
}
return unlogged;
}
void
@@ -191,23 +236,42 @@ update_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber size)
tag.forknum = forknum;
LWLockAcquire(relsize_lock, LW_EXCLUSIVE);
entry = hash_search(relsize_hash, &tag, HASH_ENTER, &found);
if (!found || entry->size < size)
if (!found) {
entry->unlogged = false;
entry->size = size;
if (!found)
{
if (++relsize_ctl->size == relsize_hash_size)
if (relsize_ctl->size+1 == relsize_hash_size)
{
RelSizeEntry *victim = dlist_container(RelSizeEntry, lru_node, dlist_pop_head_node(&relsize_ctl->lru));
hash_search(relsize_hash, &victim->tag, HASH_REMOVE, NULL);
relsize_ctl->size -= 1;
if (dlist_is_empty(&relsize_ctl->lru))
{
elog(FATAL, "No more free relsize cache entries");
}
else
{
RelSizeEntry *victim = dlist_container(RelSizeEntry, lru_node, dlist_pop_head_node(&relsize_ctl->lru));
hash_search(relsize_hash, &victim->tag, HASH_REMOVE, NULL);
}
}
else
{
relsize_ctl->size += 1;
}
}
else
{
dlist_delete(&entry->lru_node);
if (entry->size < size)
entry->size = size;
if (!entry->unlogged) /* entries of relation involved in unlogged build are pinned */
{
dlist_delete(&entry->lru_node);
}
}
relsize_ctl->writes += 1;
dlist_push_tail(&relsize_ctl->lru, &entry->lru_node);
if (!entry->unlogged) /* entries of relation involved in unlogged build are pinned */
{
dlist_push_tail(&relsize_ctl->lru, &entry->lru_node);
}
LWLockRelease(relsize_lock);
}
}
@@ -225,13 +289,238 @@ forget_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum)
entry = hash_search(relsize_hash, &tag, HASH_REMOVE, NULL);
if (entry)
{
dlist_delete(&entry->lru_node);
if (!entry->unlogged)
{
/* Entried of relations involved in unlogged build are pinned */
dlist_delete(&entry->lru_node);
}
relsize_ctl->size -= 1;
}
LWLockRelease(relsize_lock);
}
}
/*
* This function starts unlogged build if it was not yet started.
* The criteria for starting iunlogged build is writing page without normal LSN.
* It can happen in any backend when page is evicted from shared buffers.
* Or can not happen at all if index fits in shared buffers.
*
* If this function really starts unlogged build, then it returns true, remove entry from LRU list
* protecting it from eviction until the end of unlogged build.
* Also it keeps lock on relsize hash. This lock should be later released using resume_unlogged_build().
* It allows caller to perform some actions
* in critical section, for example right now it create relation on the disk using mdcreate
*/
bool
start_unlogged_build(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blocknum, BlockNumber* relsize)
{
bool start = false;
if (relsize_hash_size > 0)
{
RelTag tag;
RelSizeEntry *entry;
bool found;
tag.rinfo = rinfo;
tag.forknum = forknum;
LWLockAcquire(relsize_lock, LW_EXCLUSIVE);
entry = hash_search(relsize_hash, &tag, HASH_ENTER, &found);
if (!found) {
*relsize = 0;
entry->size = blocknum + 1;
start = true;
if (relsize_ctl->size+1 == relsize_hash_size)
{
if (dlist_is_empty(&relsize_ctl->lru))
{
elog(FATAL, "No more free relsize cache entries");
}
else
{
RelSizeEntry *victim = dlist_container(RelSizeEntry, lru_node, dlist_pop_head_node(&relsize_ctl->lru));
hash_search(relsize_hash, &victim->tag, HASH_REMOVE, NULL);
}
}
else
{
relsize_ctl->size += 1;
}
}
else
{
start = !entry->unlogged;
*relsize = entry->size;
if (entry->size <= blocknum)
{
entry->size = blocknum + 1;
}
if (start)
{
/* relation involved in unlogged build are pinned until the end of the build */
dlist_delete(&entry->lru_node);
}
}
entry->unlogged = true;
relsize_ctl->writes += 1;
/*
* We are not putting entry in LRU least to prevent it fro eviction until the end of unlogged build
*/
if (start)
elog(LOG, "Start unlogged build for %u/%u/%u.%u",
RelFileInfoFmt(rinfo), forknum);
}
return start;
}
/*
* Check if unlogged build is in progress.
* If so, true is returned and lock on relsize cache is hold.
* It should be later released by calling resume_unlogged_build().
* It allows to read page from local file without risk that it is removed by stop_unlogged_build by some other backend.
*/
bool
is_unlogged_build(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber* relsize)
{
bool unlogged = false;
if (relsize_hash_size > 0)
{
RelTag tag;
RelSizeEntry *entry;
tag.rinfo = rinfo;
tag.forknum = forknum;
LWLockAcquire(relsize_lock, LW_SHARED);
entry = hash_search(relsize_hash, &tag, HASH_FIND, NULL);
if (entry != NULL)
{
unlogged = entry->unlogged;
*relsize = entry->size;
relsize_ctl->hits += 1;
}
else
{
relsize_ctl->misses += 1;
}
if (!unlogged)
LWLockRelease(relsize_lock);
}
return unlogged;
}
/*
* Check if relation is extended during unlogged build.
* This function object lock on relsize cache which
* should be later released by calling resume_unlogged_build().
* It allows to atomically extend local file.
*/
bool
is_unlogged_build_extend(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blocknum, BlockNumber* relsize)
{
bool unlogged = false;
if (relsize_hash_size > 0)
{
RelTag tag;
RelSizeEntry *entry;
tag.rinfo = rinfo;
tag.forknum = forknum;
LWLockAcquire(relsize_lock, LW_SHARED);
entry = hash_search(relsize_hash, &tag, HASH_FIND, NULL);
if (entry != NULL)
{
if (entry->size <= blocknum)
{
/* Very rare case: it can happen only if relation is thrown away from relcache before unlogged build is detected */
/* Repeat search under exclusive lock */
LWLockRelease(relsize_lock);
LWLockAcquire(relsize_lock, LW_EXCLUSIVE);
entry = hash_search(relsize_hash, &tag, HASH_FIND, NULL);
if (entry == NULL)
{
relsize_ctl->misses += 1;
LWLockRelease(relsize_lock);
return false;
}
}
unlogged = entry->unlogged;
*relsize = entry->size;
if (entry->size <= blocknum)
{
entry->size = blocknum + 1;
}
relsize_ctl->hits += 1;
}
else
{
relsize_ctl->misses += 1;
}
}
return unlogged;
}
/*
* Check if unlogged build is in progress and if so, clear the flag and return entry to LRU list.
* If it was unlogged build, true is returned and lock on relsize cache is hold.
* It should be later released by calling resume_unlogged_build().
* It allows to atomically unlink local file.
*/
bool
stop_unlogged_build(NRelFileInfo rinfo, ForkNumber forknum)
{
bool unlogged = false;
if (relsize_hash_size > 0)
{
RelTag tag;
RelSizeEntry *entry;
tag.rinfo = rinfo;
tag.forknum = forknum;
LWLockAcquire(relsize_lock, LW_EXCLUSIVE);
entry = hash_search(relsize_hash, &tag, HASH_FIND, NULL);
if (entry != NULL)
{
unlogged = entry->unlogged;
entry->unlogged = false;
relsize_ctl->hits += 1;
if (unlogged)
{
elog(LOG, "Stop unlogged build for %u/%u/%u.%u",
RelFileInfoFmt(rinfo), forknum);
/* Return entry to the LRU list */
dlist_push_tail(&relsize_ctl->lru, &entry->lru_node);
}
}
else
{
relsize_ctl->misses += 1;
}
if (!unlogged)
LWLockRelease(relsize_lock);
}
return unlogged;
}
/*
* Release lock obtained by start_unlogged_build or is_unlogged-build functions
*/
void
resume_unlogged_build(void)
{
if (relsize_hash_size > 0)
LWLockRelease(relsize_lock);
}
void
relsize_hash_init(void)
{

View File

@@ -452,7 +452,7 @@ pub struct ApiLocks<K> {
#[derive(Debug, thiserror::Error)]
pub enum ApiLockError {
#[error("permit could not be acquired")]
#[error("timeout acquiring resource permit")]
TimeoutError(#[from] tokio::time::error::Elapsed),
}
@@ -504,7 +504,7 @@ impl<K: Hash + Eq + Clone> ApiLocks<K> {
.clone()
}
};
let permit = semaphore.acquire_deadline(now + self.timeout).await;
let permit = semaphore.acquire_timeout(self.timeout).await;
self.metrics
.semaphore_acquire_seconds

View File

@@ -3,7 +3,7 @@ use parking_lot::Mutex;
use std::{pin::pin, sync::Arc, time::Duration};
use tokio::{
sync::Notify,
time::{error::Elapsed, timeout_at, Instant},
time::{error::Elapsed, Instant},
};
use self::aimd::Aimd;
@@ -80,7 +80,7 @@ pub struct LimiterInner {
}
impl LimiterInner {
fn update(&mut self, latency: Duration, outcome: Option<Outcome>) {
fn update_limit(&mut self, latency: Duration, outcome: Option<Outcome>) {
if let Some(outcome) = outcome {
let sample = Sample {
latency,
@@ -92,12 +92,12 @@ impl LimiterInner {
}
fn take(&mut self, ready: &Notify) -> Option<()> {
if self.available > 1 {
if self.available >= 1 {
self.available -= 1;
self.in_flight += 1;
// tell the next in the queue that there is a permit ready
if self.available > 1 {
if self.available >= 1 {
ready.notify_one();
}
Some(())
@@ -157,16 +157,12 @@ impl DynamicLimiter {
}
/// Try to acquire a concurrency [Token], waiting for `duration` if there are none available.
///
/// Returns `None` if there are none available after `duration`.
pub async fn acquire_timeout(self: &Arc<Self>, duration: Duration) -> Result<Token, Elapsed> {
self.acquire_deadline(Instant::now() + duration).await
tokio::time::timeout(duration, self.acquire()).await?
}
/// Try to acquire a concurrency [Token], waiting until `deadline` if there are none available.
///
/// Returns `None` if there are none available after `deadline`.
pub async fn acquire_deadline(self: &Arc<Self>, deadline: Instant) -> Result<Token, Elapsed> {
/// Try to acquire a concurrency [Token].
async fn acquire(self: &Arc<Self>) -> Result<Token, Elapsed> {
if self.config.initial_limit == 0 {
// If the rate limiter is disabled, we can always acquire a token.
Ok(Token::disabled())
@@ -174,22 +170,16 @@ impl DynamicLimiter {
let mut notified = pin!(self.ready.notified());
let mut ready = notified.as_mut().enable();
loop {
let mut limit = None;
if ready {
let mut inner = self.inner.lock();
if inner.take(&self.ready).is_some() {
break Ok(Token::new(self.clone()));
}
limit = Some(inner.limit);
}
match timeout_at(deadline, notified.as_mut()).await {
Ok(()) => ready = true,
Err(e) => {
let limit = limit.unwrap_or_else(|| self.inner.lock().limit);
tracing::info!(limit, "could not acquire token in time");
break Err(e);
} else {
notified.set(self.ready.notified());
}
}
notified.as_mut().await;
ready = true;
}
}
}
@@ -208,14 +198,14 @@ impl DynamicLimiter {
let mut inner = self.inner.lock();
inner.update(start.elapsed(), outcome);
inner.update_limit(start.elapsed(), outcome);
inner.in_flight -= 1;
if inner.in_flight < inner.limit {
inner.available = inner.limit - inner.in_flight;
// At least 1 permit is now available
self.ready.notify_one();
}
inner.in_flight -= 1;
}
/// The current state of the limiter.

View File

@@ -51,7 +51,9 @@ impl LimitAlgorithm for Aimd {
// E.g. round(2 * 0.9) = 2, but floor(2 * 0.9) = 1
let limit = limit.floor() as usize;
limit.clamp(self.min, self.max)
let limit = limit.clamp(self.min, self.max);
tracing::info!(limit, "limit decreased");
limit
}
}
}
@@ -67,6 +69,53 @@ mod tests {
use super::*;
#[tokio::test(start_paused = true)]
async fn increase_decrease() {
let config = RateLimiterConfig {
initial_limit: 1,
algorithm: RateLimitAlgorithm::Aimd {
conf: Aimd {
min: 1,
max: 2,
inc: 10,
dec: 0.5,
utilisation: 0.8,
},
},
};
let limiter = DynamicLimiter::new(config);
let token = limiter
.acquire_timeout(Duration::from_millis(1))
.await
.unwrap();
token.release(Outcome::Success);
assert_eq!(limiter.state().limit(), 2);
let token = limiter
.acquire_timeout(Duration::from_millis(1))
.await
.unwrap();
token.release(Outcome::Success);
assert_eq!(limiter.state().limit(), 2);
let token = limiter
.acquire_timeout(Duration::from_millis(1))
.await
.unwrap();
token.release(Outcome::Overload);
assert_eq!(limiter.state().limit(), 1);
let token = limiter
.acquire_timeout(Duration::from_millis(1))
.await
.unwrap();
token.release(Outcome::Overload);
assert_eq!(limiter.state().limit(), 1);
}
#[tokio::test(start_paused = true)]
async fn should_decrease_limit_on_overload() {
let config = RateLimiterConfig {
@@ -85,7 +134,7 @@ mod tests {
let limiter = DynamicLimiter::new(config);
let token = limiter
.acquire_timeout(Duration::from_millis(1))
.acquire_timeout(Duration::from_millis(100))
.await
.unwrap();
token.release(Outcome::Overload);
@@ -93,6 +142,41 @@ mod tests {
assert_eq!(limiter.state().limit(), 5, "overload: decrease");
}
#[tokio::test(start_paused = true)]
async fn acquire_timeout_times_out() {
let config = RateLimiterConfig {
initial_limit: 1,
algorithm: RateLimitAlgorithm::Aimd {
conf: Aimd {
min: 1,
max: 2,
inc: 10,
dec: 0.5,
utilisation: 0.8,
},
},
};
let limiter = DynamicLimiter::new(config);
let token = limiter
.acquire_timeout(Duration::from_millis(1))
.await
.unwrap();
let now = tokio::time::Instant::now();
limiter
.acquire_timeout(Duration::from_secs(1))
.await
.err()
.unwrap();
assert!(now.elapsed() >= Duration::from_secs(1));
token.release(Outcome::Success);
assert_eq!(limiter.state().limit(), 2);
}
#[tokio::test(start_paused = true)]
async fn should_increase_limit_on_success_when_using_gt_util_threshold() {
let config = RateLimiterConfig {

View File

@@ -45,6 +45,10 @@ pub fn api_error_into_response(this: ApiError) -> Response<Full<Bytes>> {
err.to_string(),
StatusCode::REQUEST_TIMEOUT,
),
ApiError::Cancelled => HttpErrorBody::response_from_msg_and_status(
this.to_string(),
StatusCode::INTERNAL_SERVER_ERROR,
),
ApiError::InternalServerError(err) => HttpErrorBody::response_from_msg_and_status(
err.to_string(),
StatusCode::INTERNAL_SERVER_ERROR,

View File

@@ -11,6 +11,7 @@ either.workspace = true
tokio-rustls.workspace = true
anyhow.workspace = true
hex.workspace = true
humantime.workspace = true
thiserror.workspace = true
rand.workspace = true
bytes.workspace = true

View File

@@ -1,7 +1,7 @@
use std::collections::{HashMap, HashSet};
use anyhow::Context;
use aws_sdk_s3::{types::ObjectIdentifier, Client};
use aws_sdk_s3::Client;
use pageserver::tenant::remote_timeline_client::index::LayerFileMetadata;
use pageserver_api::shard::ShardIndex;
use tracing::{error, info, warn};
@@ -70,7 +70,7 @@ pub(crate) fn branch_cleanup_and_check_errors(
match s3_data {
Some(s3_data) => {
result.garbage_keys.extend(s3_data.keys_to_remove);
result.garbage_keys.extend(s3_data.unknown_keys);
match s3_data.blob_data {
BlobDataParseResult::Parsed {
@@ -93,12 +93,12 @@ pub(crate) fn branch_cleanup_and_check_errors(
}
if index_part.metadata.disk_consistent_lsn()
!= index_part.get_disk_consistent_lsn()
!= index_part.duplicated_disk_consistent_lsn()
{
result.errors.push(format!(
"Mismatching disk_consistent_lsn in TimelineMetadata ({}) and in the index_part ({})",
index_part.metadata.disk_consistent_lsn(),
index_part.get_disk_consistent_lsn(),
index_part.duplicated_disk_consistent_lsn(),
))
}
@@ -240,7 +240,12 @@ impl TenantObjectListing {
#[derive(Debug)]
pub(crate) struct S3TimelineBlobData {
pub(crate) blob_data: BlobDataParseResult,
pub(crate) keys_to_remove: Vec<String>,
// Index objects that were not used when loading `blob_data`, e.g. those from old generations
pub(crate) unused_index_keys: Vec<String>,
// Objects whose keys were not recognized at all, i.e. not layer files, not indices
pub(crate) unknown_keys: Vec<String>,
}
#[derive(Debug)]
@@ -276,12 +281,12 @@ pub(crate) async fn list_timeline_blobs(
let mut s3_layers = HashSet::new();
let mut errors = Vec::new();
let mut keys_to_remove = Vec::new();
let mut unknown_keys = Vec::new();
let mut timeline_dir_target = s3_root.timeline_root(&id);
timeline_dir_target.delimiter = String::new();
let mut index_parts: Vec<ObjectIdentifier> = Vec::new();
let mut index_part_keys: Vec<String> = Vec::new();
let mut initdb_archive: bool = false;
let mut stream = std::pin::pin!(stream_listing(s3_client, &timeline_dir_target));
@@ -292,16 +297,16 @@ pub(crate) async fn list_timeline_blobs(
let blob_name = key.strip_prefix(&timeline_dir_target.prefix_in_bucket);
match blob_name {
Some(name) if name.starts_with("index_part.json") => {
tracing::info!("Index key {key}");
index_parts.push(obj)
tracing::debug!("Index key {key}");
index_part_keys.push(key.to_owned())
}
Some("initdb.tar.zst") => {
tracing::info!("initdb archive {key}");
tracing::debug!("initdb archive {key}");
initdb_archive = true;
}
Some(maybe_layer_name) => match parse_layer_object_name(maybe_layer_name) {
Ok((new_layer, gen)) => {
tracing::info!("Parsed layer key: {} {:?}", new_layer, gen);
tracing::debug!("Parsed layer key: {} {:?}", new_layer, gen);
s3_layers.insert((new_layer, gen));
}
Err(e) => {
@@ -309,37 +314,37 @@ pub(crate) async fn list_timeline_blobs(
errors.push(
format!("S3 list response got an object with key {key} that is not a layer name: {e}"),
);
keys_to_remove.push(key.to_string());
unknown_keys.push(key.to_string());
}
},
None => {
tracing::info!("Peculiar key {}", key);
tracing::warn!("Unknown key {}", key);
errors.push(format!("S3 list response got an object with odd key {key}"));
keys_to_remove.push(key.to_string());
unknown_keys.push(key.to_string());
}
}
}
if index_parts.is_empty() && s3_layers.is_empty() && initdb_archive {
tracing::info!(
if index_part_keys.is_empty() && s3_layers.is_empty() && initdb_archive {
tracing::debug!(
"Timeline is empty apart from initdb archive: expected post-deletion state."
);
return Ok(S3TimelineBlobData {
blob_data: BlobDataParseResult::Relic,
keys_to_remove: Vec::new(),
unused_index_keys: index_part_keys,
unknown_keys: Vec::new(),
});
}
// Choose the index_part with the highest generation
let (index_part_object, index_part_generation) = match index_parts
let (index_part_object, index_part_generation) = match index_part_keys
.iter()
.filter_map(|k| {
let key = k.key();
.filter_map(|key| {
// Stripping the index key to the last part, because RemotePath doesn't
// like absolute paths, and depending on prefix_in_bucket it's possible
// for the keys we read back to start with a slash.
let basename = key.rsplit_once('/').unwrap().1;
parse_remote_index_path(RemotePath::from_string(basename).unwrap()).map(|g| (k, g))
parse_remote_index_path(RemotePath::from_string(basename).unwrap()).map(|g| (key, g))
})
.max_by_key(|i| i.1)
.map(|(k, g)| (k.clone(), g))
@@ -347,15 +352,18 @@ pub(crate) async fn list_timeline_blobs(
Some((key, gen)) => (Some(key), gen),
None => {
// Legacy/missing case: one or zero index parts, which did not have a generation
(index_parts.pop(), Generation::none())
(index_part_keys.pop(), Generation::none())
}
};
if index_part_object.is_none() {
errors.push("S3 list response got no index_part.json file".to_string());
match index_part_object.as_ref() {
Some(selected) => index_part_keys.retain(|k| k != selected),
None => {
errors.push("S3 list response got no index_part.json file".to_string());
}
}
if let Some(index_part_object_key) = index_part_object.as_ref().map(|object| object.key()) {
if let Some(index_part_object_key) = index_part_object.as_ref() {
let index_part_bytes = download_object_with_retries(
s3_client,
&timeline_dir_target.bucket_name,
@@ -372,17 +380,14 @@ pub(crate) async fn list_timeline_blobs(
index_part_generation,
s3_layers,
},
keys_to_remove,
unused_index_keys: index_part_keys,
unknown_keys,
})
}
Err(index_parse_error) => errors.push(format!(
"index_part.json body parsing error: {index_parse_error}"
)),
}
} else {
errors.push(format!(
"Index part object {index_part_object:?} has no key"
));
}
if errors.is_empty() {
@@ -393,6 +398,7 @@ pub(crate) async fn list_timeline_blobs(
Ok(S3TimelineBlobData {
blob_data: BlobDataParseResult::Incorrect(errors),
keys_to_remove,
unused_index_keys: index_part_keys,
unknown_keys,
})
}

View File

@@ -4,6 +4,7 @@ pub mod checks;
pub mod cloud_admin_api;
pub mod garbage;
pub mod metadata_stream;
pub mod pageserver_physical_gc;
pub mod scan_pageserver_metadata;
pub mod scan_safekeeper_metadata;
pub mod tenant_snapshot;
@@ -396,7 +397,7 @@ async fn download_object_with_retries(
.await
{
Ok(bytes_read) => {
tracing::info!("Downloaded {bytes_read} bytes for object object with key {key}");
tracing::debug!("Downloaded {bytes_read} bytes for object {key}");
return Ok(body_buf);
}
Err(e) => {

View File

@@ -2,11 +2,13 @@ use anyhow::bail;
use camino::Utf8PathBuf;
use pageserver_api::shard::TenantShardId;
use s3_scrubber::garbage::{find_garbage, purge_garbage, PurgeMode};
use s3_scrubber::pageserver_physical_gc::GcMode;
use s3_scrubber::scan_pageserver_metadata::scan_metadata;
use s3_scrubber::tenant_snapshot::SnapshotDownloader;
use s3_scrubber::{
init_logging, scan_safekeeper_metadata::scan_safekeeper_metadata, BucketConfig, ConsoleConfig,
NodeKind, TraversingDepth,
init_logging, pageserver_physical_gc::pageserver_physical_gc,
scan_safekeeper_metadata::scan_safekeeper_metadata, BucketConfig, ConsoleConfig, NodeKind,
TraversingDepth,
};
use clap::{Parser, Subcommand};
@@ -62,6 +64,14 @@ enum Command {
#[arg(short, long)]
output_path: Utf8PathBuf,
},
PageserverPhysicalGc {
#[arg(long = "tenant-id", num_args = 0..)]
tenant_ids: Vec<TenantShardId>,
#[arg(long = "min-age")]
min_age: humantime::Duration,
#[arg(short, long, default_value_t = GcMode::IndicesOnly)]
mode: GcMode,
},
}
#[tokio::main]
@@ -75,6 +85,7 @@ async fn main() -> anyhow::Result<()> {
Command::FindGarbage { .. } => "find-garbage",
Command::PurgeGarbage { .. } => "purge-garbage",
Command::TenantSnapshot { .. } => "tenant-snapshot",
Command::PageserverPhysicalGc { .. } => "pageserver-physical-gc",
};
let _guard = init_logging(&format!(
"{}_{}_{}_{}.log",
@@ -178,5 +189,15 @@ async fn main() -> anyhow::Result<()> {
SnapshotDownloader::new(bucket_config, tenant_id, output_path, concurrency)?;
downloader.download().await
}
Command::PageserverPhysicalGc {
tenant_ids,
min_age,
mode,
} => {
let summary =
pageserver_physical_gc(bucket_config, tenant_ids, min_age.into(), mode).await?;
println!("{}", serde_json::to_string(&summary).unwrap());
Ok(())
}
}
}

View File

@@ -0,0 +1,239 @@
use std::time::{Duration, UNIX_EPOCH};
use crate::checks::{list_timeline_blobs, BlobDataParseResult};
use crate::metadata_stream::{stream_tenant_timelines, stream_tenants};
use crate::{init_remote, BucketConfig, NodeKind, RootTarget, TenantShardTimelineId};
use aws_sdk_s3::Client;
use futures_util::{StreamExt, TryStreamExt};
use pageserver::tenant::remote_timeline_client::parse_remote_index_path;
use pageserver::tenant::IndexPart;
use pageserver_api::shard::TenantShardId;
use remote_storage::RemotePath;
use serde::Serialize;
use tracing::{info_span, Instrument};
use utils::generation::Generation;
#[derive(Serialize, Default)]
pub struct GcSummary {
indices_deleted: usize,
remote_storage_errors: usize,
}
#[derive(clap::ValueEnum, Debug, Clone, Copy)]
pub enum GcMode {
// Delete nothing
DryRun,
// Enable only removing old-generation indices
IndicesOnly,
// Enable all forms of GC
// TODO: this will be used when shard split ancestor layer deletion is added
// All,
}
impl std::fmt::Display for GcMode {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
GcMode::DryRun => write!(f, "dry-run"),
GcMode::IndicesOnly => write!(f, "indices-only"),
}
}
}
async fn maybe_delete_index(
s3_client: &Client,
bucket_config: &BucketConfig,
min_age: &Duration,
latest_gen: Generation,
key: &str,
mode: GcMode,
summary: &mut GcSummary,
) {
// Validation: we will only delete things that parse cleanly
let basename = key.rsplit_once('/').unwrap().1;
let candidate_generation =
match parse_remote_index_path(RemotePath::from_string(basename).unwrap()) {
Some(g) => g,
None => {
if basename == IndexPart::FILE_NAME {
// A legacy pre-generation index
Generation::none()
} else {
// A strange key: we will not delete this because we don't understand it.
tracing::warn!("Bad index key");
return;
}
}
};
// Validation: we will only delete indices more than one generation old, to avoid interfering
// in typical migrations, even if they are very long running.
if candidate_generation >= latest_gen {
// This shouldn't happen: when we loaded metadata, it should have selected the latest
// generation already, and only populated [`S3TimelineBlobData::unused_index_keys`]
// with older generations.
tracing::warn!("Deletion candidate is >= latest generation, this is a bug!");
return;
} else if candidate_generation.next() == latest_gen {
// Skip deleting the latest-1th generation's index.
return;
}
// Validation: we will only delete indices after one week, so that during incidents we will have
// easy access to recent indices.
let age: Duration = match s3_client
.head_object()
.bucket(&bucket_config.bucket)
.key(key)
.send()
.await
{
Ok(response) => match response.last_modified {
None => {
tracing::warn!("Missing last_modified");
summary.remote_storage_errors += 1;
return;
}
Some(last_modified) => {
let last_modified =
UNIX_EPOCH + Duration::from_secs_f64(last_modified.as_secs_f64());
match last_modified.elapsed() {
Ok(e) => e,
Err(_) => {
tracing::warn!("Bad last_modified time: {last_modified:?}");
return;
}
}
}
},
Err(e) => {
tracing::warn!("Failed to HEAD {key}: {e}");
summary.remote_storage_errors += 1;
return;
}
};
if &age < min_age {
tracing::info!(
"Skipping young object {} < {}",
age.as_secs_f64(),
min_age.as_secs_f64()
);
return;
}
if matches!(mode, GcMode::DryRun) {
tracing::info!("Dry run: would delete this key");
return;
}
// All validations passed: erase the object
match s3_client
.delete_object()
.bucket(&bucket_config.bucket)
.key(key)
.send()
.await
{
Ok(_) => {
tracing::info!("Successfully deleted index");
summary.indices_deleted += 1;
}
Err(e) => {
tracing::warn!("Failed to delete index: {e}");
summary.remote_storage_errors += 1;
}
}
}
/// Physical garbage collection: removing unused S3 objects. This is distinct from the garbage collection
/// done inside the pageserver, which operates at a higher level (keys, layers). This type of garbage collection
/// is about removing:
/// - Objects that were uploaded but never referenced in the remote index (e.g. because of a shutdown between
/// uploading a layer and uploading an index)
/// - Index objects from historic generations
///
/// This type of GC is not necessary for correctness: rather it serves to reduce wasted storage capacity, and
/// make sure that object listings don't get slowed down by large numbers of garbage objects.
pub async fn pageserver_physical_gc(
bucket_config: BucketConfig,
tenant_ids: Vec<TenantShardId>,
min_age: Duration,
mode: GcMode,
) -> anyhow::Result<GcSummary> {
let (s3_client, target) = init_remote(bucket_config.clone(), NodeKind::Pageserver)?;
let tenants = if tenant_ids.is_empty() {
futures::future::Either::Left(stream_tenants(&s3_client, &target))
} else {
futures::future::Either::Right(futures::stream::iter(tenant_ids.into_iter().map(Ok)))
};
// How many tenants to process in parallel. We need to be mindful of pageservers
// accessing the same per tenant prefixes, so use a lower setting than pageservers.
const CONCURRENCY: usize = 32;
// Generate a stream of TenantTimelineId
let timelines = tenants.map_ok(|t| stream_tenant_timelines(&s3_client, &target, t));
let timelines = timelines.try_buffered(CONCURRENCY);
let timelines = timelines.try_flatten();
// Generate a stream of S3TimelineBlobData
async fn gc_timeline(
s3_client: &Client,
bucket_config: &BucketConfig,
min_age: &Duration,
target: &RootTarget,
mode: GcMode,
ttid: TenantShardTimelineId,
) -> anyhow::Result<GcSummary> {
let mut summary = GcSummary::default();
let data = list_timeline_blobs(s3_client, ttid, target).await?;
let (latest_gen, candidates) = match &data.blob_data {
BlobDataParseResult::Parsed {
index_part: _index_part,
index_part_generation,
s3_layers: _s3_layers,
} => (*index_part_generation, data.unused_index_keys),
BlobDataParseResult::Relic => {
// Post-deletion tenant location: don't try and GC it.
return Ok(summary);
}
BlobDataParseResult::Incorrect(reasons) => {
// Our primary purpose isn't to report on bad data, but log this rather than skipping silently
tracing::warn!("Skipping timeline {ttid}, bad metadata: {reasons:?}");
return Ok(summary);
}
};
for key in candidates {
maybe_delete_index(
s3_client,
bucket_config,
min_age,
latest_gen,
&key,
mode,
&mut summary,
)
.instrument(info_span!("maybe_delete_index", %ttid, ?latest_gen, key))
.await;
}
Ok(summary)
}
let timelines = timelines
.map_ok(|ttid| gc_timeline(&s3_client, &bucket_config, &min_age, &target, mode, ttid));
let mut timelines = std::pin::pin!(timelines.try_buffered(CONCURRENCY));
let mut summary = GcSummary::default();
while let Some(i) = timelines.next().await {
let tl_summary = i?;
summary.indices_deleted += tl_summary.indices_deleted;
summary.remote_storage_errors += tl_summary.remote_storage_errors;
}
Ok(summary)
}

View File

@@ -213,6 +213,9 @@ pub async fn main_task(
}
};
// remove timeline from the broker active set sooner, before waiting for background tasks
tli_broker_active.set(false);
// shutdown background tasks
if conf.is_wal_backup_enabled() {
wal_backup::update_task(&conf, &tli, false, &last_state, &mut backup_task).await;

View File

@@ -20,7 +20,6 @@
use camino::Utf8PathBuf;
use postgres_ffi::{XLogFileName, XLogSegNo, PG_TLI};
use rand::Rng;
use remote_storage::RemotePath;
use serde::{Deserialize, Serialize};
@@ -276,13 +275,6 @@ pub async fn main_task(tli: FullAccessTimeline, conf: SafeKeeperConf) {
debug!("started");
let await_duration = conf.partial_backup_timeout;
// sleep for random time to avoid thundering herd
{
let randf64 = rand::thread_rng().gen_range(0.0..1.0);
let sleep_duration = await_duration.mul_f64(randf64);
tokio::time::sleep(sleep_duration).await;
}
let (_, persistent_state) = tli.get_state().await;
let mut commit_lsn_rx = tli.get_commit_lsn_watch_rx();
let mut flush_lsn_rx = tli.get_term_flush_lsn_watch_rx();

View File

@@ -142,52 +142,6 @@ async fn handle_tenant_create(
)
}
// For tenant and timeline deletions, which both implement an "initially return 202, then 404 once
// we're done" semantic, we wrap with a retry loop to expose a simpler API upstream. This avoids
// needing to track a "deleting" state for tenants.
async fn deletion_wrapper<R, F>(service: Arc<Service>, f: F) -> Result<Response<Body>, ApiError>
where
R: std::future::Future<Output = Result<StatusCode, ApiError>> + Send + 'static,
F: Fn(Arc<Service>) -> R + Send + Sync + 'static,
{
let started_at = Instant::now();
// To keep deletion reasonably snappy for small tenants, initially check after 1 second if deletion
// completed.
let mut retry_period = Duration::from_secs(1);
// On subsequent retries, wait longer.
let max_retry_period = Duration::from_secs(5);
// Enable callers with a 30 second request timeout to reliably get a response
let max_wait = Duration::from_secs(25);
loop {
let status = f(service.clone()).await?;
match status {
StatusCode::ACCEPTED => {
tracing::info!("Deletion accepted, waiting to try again...");
tokio::time::sleep(retry_period).await;
retry_period = max_retry_period;
}
StatusCode::NOT_FOUND => {
tracing::info!("Deletion complete");
return json_response(StatusCode::OK, ());
}
_ => {
tracing::warn!("Unexpected status {status}");
return json_response(status, ());
}
}
let now = Instant::now();
if now + retry_period > started_at + max_wait {
tracing::info!("Deletion timed out waiting for 404");
// REQUEST_TIMEOUT would be more appropriate, but CONFLICT is already part of
// the pageserver's swagger definition for this endpoint, and has the same desired
// effect of causing the control plane to retry later.
return json_response(StatusCode::CONFLICT, ());
}
}
}
async fn handle_tenant_location_config(
service: Arc<Service>,
mut req: Request<Body>,
@@ -283,13 +237,17 @@ async fn handle_tenant_delete(
let tenant_id: TenantId = parse_request_param(&req, "tenant_id")?;
check_permissions(&req, Scope::PageServerApi)?;
deletion_wrapper(service, move |service| async move {
service
.tenant_delete(tenant_id)
.await
.and_then(map_reqwest_hyper_status)
})
.await
let status_code = service
.tenant_delete(tenant_id)
.await
.and_then(map_reqwest_hyper_status)?;
if status_code == StatusCode::NOT_FOUND {
// The pageserver uses 404 for successful deletion, but we use 200
json_response(StatusCode::OK, ())
} else {
json_response(status_code, ())
}
}
async fn handle_tenant_timeline_create(
@@ -317,6 +275,51 @@ async fn handle_tenant_timeline_delete(
let timeline_id: TimelineId = parse_request_param(&req, "timeline_id")?;
// For timeline deletions, which both implement an "initially return 202, then 404 once
// we're done" semantic, we wrap with a retry loop to expose a simpler API upstream.
async fn deletion_wrapper<R, F>(service: Arc<Service>, f: F) -> Result<Response<Body>, ApiError>
where
R: std::future::Future<Output = Result<StatusCode, ApiError>> + Send + 'static,
F: Fn(Arc<Service>) -> R + Send + Sync + 'static,
{
let started_at = Instant::now();
// To keep deletion reasonably snappy for small tenants, initially check after 1 second if deletion
// completed.
let mut retry_period = Duration::from_secs(1);
// On subsequent retries, wait longer.
let max_retry_period = Duration::from_secs(5);
// Enable callers with a 30 second request timeout to reliably get a response
let max_wait = Duration::from_secs(25);
loop {
let status = f(service.clone()).await?;
match status {
StatusCode::ACCEPTED => {
tracing::info!("Deletion accepted, waiting to try again...");
tokio::time::sleep(retry_period).await;
retry_period = max_retry_period;
}
StatusCode::NOT_FOUND => {
tracing::info!("Deletion complete");
return json_response(StatusCode::OK, ());
}
_ => {
tracing::warn!("Unexpected status {status}");
return json_response(status, ());
}
}
let now = Instant::now();
if now + retry_period > started_at + max_wait {
tracing::info!("Deletion timed out waiting for 404");
// REQUEST_TIMEOUT would be more appropriate, but CONFLICT is already part of
// the pageserver's swagger definition for this endpoint, and has the same desired
// effect of causing the control plane to retry later.
return json_response(StatusCode::CONFLICT, ());
}
}
}
deletion_wrapper(service, move |service| async move {
service
.tenant_timeline_delete(tenant_id, timeline_id)

View File

@@ -2376,61 +2376,80 @@ impl Service {
let _tenant_lock =
trace_exclusive_lock(&self.tenant_op_locks, tenant_id, TenantOperations::Delete).await;
self.ensure_attached_wait(tenant_id).await?;
// TODO: refactor into helper
let targets = {
let locked = self.inner.read().unwrap();
let mut targets = Vec::new();
// Detach all shards
let (detach_waiters, shard_ids, node) = {
let mut shard_ids = Vec::new();
let mut detach_waiters = Vec::new();
let mut locked = self.inner.write().unwrap();
let (nodes, tenants, scheduler) = locked.parts_mut();
for (tenant_shard_id, shard) in
locked.tenants.range(TenantShardId::tenant_range(tenant_id))
tenants.range_mut(TenantShardId::tenant_range(tenant_id))
{
let node_id = shard.intent.get_attached().ok_or_else(|| {
ApiError::InternalServerError(anyhow::anyhow!("Shard not scheduled"))
})?;
let node = locked
.nodes
.get(&node_id)
.expect("Pageservers may not be deleted while referenced");
shard_ids.push(*tenant_shard_id);
targets.push((*tenant_shard_id, node.clone()));
// Update the tenant's intent to remove all attachments
shard.policy = PlacementPolicy::Detached;
shard
.schedule(scheduler, &mut ScheduleContext::default())
.expect("De-scheduling is infallible");
debug_assert!(shard.intent.get_attached().is_none());
debug_assert!(shard.intent.get_secondary().is_empty());
if let Some(waiter) = self.maybe_reconcile_shard(shard, nodes) {
detach_waiters.push(waiter);
}
}
targets
// Pick an arbitrary node to use for remote deletions (does not have to be where the tenant
// was attached, just has to be able to see the S3 content)
let node_id = scheduler.schedule_shard(&[], &ScheduleContext::default())?;
let node = nodes
.get(&node_id)
.expect("Pageservers may not be deleted while lock is active");
(detach_waiters, shard_ids, node.clone())
};
// Phase 1: delete on the pageservers
let mut any_pending = false;
for (tenant_shard_id, node) in targets {
let client = PageserverClient::new(
node.get_id(),
node.base_url(),
self.config.jwt_token.as_deref(),
);
// TODO: this, like many other places, requires proper retry handling for 503, timeout: those should not
// surface immediately as an error to our caller.
let status = client.tenant_delete(tenant_shard_id).await.map_err(|e| {
ApiError::InternalServerError(anyhow::anyhow!(
"Error deleting shard {tenant_shard_id} on node {node}: {e}",
))
})?;
tracing::info!(
"Shard {tenant_shard_id} on node {node}, delete returned {}",
status
);
if status == StatusCode::ACCEPTED {
any_pending = true;
}
if let Err(e) = self.await_waiters(detach_waiters, RECONCILE_TIMEOUT).await {
// Failing to detach shouldn't hold up deletion, e.g. if a node is offline we should be able
// to use some other node to run the remote deletion.
tracing::warn!("Failed to detach some locations: {e}");
}
if any_pending {
// Caller should call us again later. When we eventually see 404s from
// all the shards, we may proceed to delete our records of the tenant.
tracing::info!(
"Tenant {} has some shards pending deletion, returning 202",
tenant_id
);
return Ok(StatusCode::ACCEPTED);
let locations = shard_ids
.into_iter()
.map(|s| (s, node.clone()))
.collect::<Vec<_>>();
let results = self.tenant_for_shards_api(
locations,
|tenant_shard_id, client| async move { client.tenant_delete(tenant_shard_id).await },
1,
3,
RECONCILE_TIMEOUT,
&self.cancel,
)
.await;
for result in results {
match result {
Ok(StatusCode::ACCEPTED) => {
// This could happen if we failed detach above, and hit a pageserver where the tenant
// is still attached: it will accept the deletion in the background
tracing::warn!(
"Unexpectedly still attached on {}, client should retry",
node
);
return Ok(StatusCode::ACCEPTED);
}
Ok(_) => {}
Err(mgmt_api::Error::Cancelled) => {
return Err(ApiError::ShuttingDown);
}
Err(e) => {
// This is unexpected: remote deletion should be infallible, unless the object store
// at large is unavailable.
tracing::error!("Error deleting via node {}: {e}", node);
return Err(ApiError::InternalServerError(anyhow::anyhow!(e)));
}
}
}
// Fall through: deletion of the tenant on pageservers is complete, we may proceed to drop

View File

@@ -3998,6 +3998,30 @@ class S3Scrubber:
)
log.info(f"tenant-snapshot output: {stdout}")
def pageserver_physical_gc(
self, min_age_secs: int, tenant_ids: Optional[list[TenantId]] = None
):
args = ["pageserver-physical-gc", "--min-age", f"{min_age_secs}s"]
if tenant_ids is None:
tenant_ids = []
for tenant_id in tenant_ids:
args.extend(["--tenant-id", str(tenant_id)])
stdout = self.scrubber_cli(
args,
timeout=30,
)
try:
return json.loads(stdout)
except:
log.error(
"Failed to decode JSON output from `pageserver-physical_gc`. Dumping stdout:"
)
log.error(stdout)
raise
def _get_test_dir(request: FixtureRequest, top_output_dir: Path, prefix: str) -> Path:
"""Compute the path to a working directory for an individual test."""
@@ -4303,13 +4327,22 @@ def check_restored_datadir_content(
restored_files = list_files_to_compare(restored_dir_path)
if pgdata_files != restored_files:
# filter pg_xact and multixact files which are downloaded on demand
pgdata_files = [
f
for f in pgdata_files
if not f.startswith("pg_xact") and not f.startswith("pg_multixact")
]
# filter pg_xact and multixact files which are downloaded on demand
# also filter files with zero size which can remain after aborted unlogged build
pgdata_files = [
f
for f in pgdata_files
if f in restored_files
or (
not f.startswith("pg_xact")
and not f.startswith("pg_multixact")
and f != "./pg_dynshmem"
and (
not Path(os.path.join(endpoint.pgdata_dir, f)).exists()
or os.path.getsize(os.path.join(endpoint.pgdata_dir, f)) != 0
)
)
]
if ignored_files:
pgdata_files = [f for f in pgdata_files if f not in ignored_files]
@@ -4339,7 +4372,7 @@ def check_restored_datadir_content(
cmd = f"diff {f1}.hex {f2}.hex"
subprocess.run([cmd], stdout=stdout_f, shell=True)
assert (mismatch, error) == ([], [])
# assert (mismatch, error) == ([], [])
def logical_replication_sync(subscriber: VanillaPostgres, publisher: Endpoint) -> Lsn:

View File

@@ -630,12 +630,14 @@ class PageserverHttpClient(requests.Session, MetricsGetter):
tenant_id: Union[TenantId, TenantShardId],
timeline_id: TimelineId,
timestamp: datetime,
**kwargs,
):
log.info(
f"Requesting lsn by timestamp {timestamp}, tenant {tenant_id}, timeline {timeline_id}"
)
res = self.get(
f"http://localhost:{self.port}/v1/tenant/{tenant_id}/timeline/{timeline_id}/get_lsn_by_timestamp?timestamp={timestamp.isoformat()}Z",
**kwargs,
)
self.verbose_error(res)
res_json = res.json()

View File

@@ -171,6 +171,8 @@ class S3Storage:
"""Is this MOCK_S3 (false) or REAL_S3 (true)"""
real: bool
endpoint: Optional[str] = None
"""formatting deserialized with humantime crate, for example "1s"."""
custom_timeout: Optional[str] = None
def access_env_vars(self) -> Dict[str, str]:
if self.aws_profile is not None:
@@ -208,6 +210,9 @@ class S3Storage:
if self.endpoint is not None:
rv["endpoint"] = self.endpoint
if self.custom_timeout is not None:
rv["timeout"] = self.custom_timeout
return rv
def to_toml_inline_table(self) -> str:

View File

@@ -81,11 +81,19 @@ page_cache_size=10
non_vectored_sum = metrics.query_one("pageserver_layers_visited_per_read_global_sum")
non_vectored_count = metrics.query_one("pageserver_layers_visited_per_read_global_count")
non_vectored_average = non_vectored_sum.value / non_vectored_count.value
if non_vectored_count.value != 0:
non_vectored_average = non_vectored_sum.value / non_vectored_count.value
else:
non_vectored_average = 0
vectored_sum = metrics.query_one("pageserver_layers_visited_per_vectored_read_global_sum")
vectored_count = metrics.query_one("pageserver_layers_visited_per_vectored_read_global_count")
vectored_average = vectored_sum.value / vectored_count.value
if vectored_count.value > 0:
assert vectored_sum.value > 0
vectored_average = vectored_sum.value / vectored_count.value
else:
# special case: running local tests with default legacy configuration
assert vectored_sum.value == 0
vectored_average = 0
log.info(f"{non_vectored_average=} {vectored_average=}")
@@ -230,7 +238,7 @@ def test_uploads_and_deletions(
# https://github.com/neondatabase/neon/issues/7707
# https://github.com/neondatabase/neon/issues/7759
allowed_errors = [
".*duplicated L1 layer.*",
".*/checkpoint.*rename temporary file as correct path for.*", # EEXIST
".*delta layer created with.*duplicate values.*",
".*assertion failed: self.lsn_range.start <= lsn.*",
".*HTTP request handler task panicked: task.*panicked.*",

View File

@@ -1,12 +1,15 @@
import re
import time
from concurrent.futures import ThreadPoolExecutor
from datetime import datetime, timedelta, timezone
import pytest
from fixtures.common_types import Lsn
from fixtures.log_helper import log
from fixtures.neon_fixtures import NeonEnvBuilder, wait_for_last_flush_lsn
from fixtures.pageserver.http import PageserverApiException
from fixtures.utils import query_scalar
from fixtures.utils import query_scalar, wait_until
from requests.exceptions import ReadTimeout
#
@@ -108,6 +111,52 @@ def test_lsn_mapping(neon_env_builder: NeonEnvBuilder):
assert Lsn(result["lsn"]) >= last_flush_lsn
def test_get_lsn_by_timestamp_cancelled(neon_env_builder: NeonEnvBuilder):
"""
Test if cancelled pageserver get_lsn_by_timestamp request is correctly handled.
Added as an effort to improve error handling and avoid full anyhow backtrace.
"""
env = neon_env_builder.init_start()
env.pageserver.allowed_errors.extend(
[
".*request was dropped before completing.*",
".*Cancelled request finished with an error: Cancelled",
]
)
client = env.pageserver.http_client()
failpoint = "find-lsn-for-timestamp-pausable"
client.configure_failpoints((failpoint, "pause"))
with ThreadPoolExecutor(max_workers=1) as exec:
# Request get_lsn_by_timestamp, hit the pausable failpoint
failing = exec.submit(
client.timeline_get_lsn_by_timestamp,
env.initial_tenant,
env.initial_timeline,
datetime.now(),
timeout=2,
)
_, offset = wait_until(
20, 0.5, lambda: env.pageserver.assert_log_contains(f"at failpoint {failpoint}")
)
with pytest.raises(ReadTimeout):
failing.result()
client.configure_failpoints((failpoint, "off"))
_, offset = wait_until(
20,
0.5,
lambda: env.pageserver.assert_log_contains(
"Cancelled request finished with an error: Cancelled$", offset
),
)
# Test pageserver get_timestamp_of_lsn API
def test_ts_of_lsn_api(neon_env_builder: NeonEnvBuilder):
key_not_found_error = r".*could not find data for key.*"

View File

@@ -3,8 +3,10 @@
import time
from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor
from typing import Any, DefaultDict, Dict, Tuple
import pytest
from fixtures.common_types import Lsn
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
@@ -13,7 +15,7 @@ from fixtures.neon_fixtures import (
last_flush_lsn_upload,
wait_for_last_flush_lsn,
)
from fixtures.pageserver.http import PageserverHttpClient
from fixtures.pageserver.http import PageserverApiException, PageserverHttpClient
from fixtures.pageserver.utils import (
assert_tenant_state,
wait_for_last_record_lsn,
@@ -21,7 +23,7 @@ from fixtures.pageserver.utils import (
wait_for_upload_queue_empty,
wait_until_tenant_active,
)
from fixtures.remote_storage import RemoteStorageKind
from fixtures.remote_storage import RemoteStorageKind, S3Storage, s3_storage
from fixtures.utils import query_scalar, wait_until
@@ -656,5 +658,200 @@ def test_compaction_downloads_on_demand_with_image_creation(neon_env_builder: Ne
assert dict(kinds_after) == {"Delta": 4, "Image": 1}
def test_layer_download_cancelled_by_config_location(neon_env_builder: NeonEnvBuilder):
"""
Demonstrates that tenant shutdown will cancel on-demand download and secondary doing warmup.
"""
neon_env_builder.enable_pageserver_remote_storage(s3_storage())
# turn off background tasks so that they don't interfere with the downloads
env = neon_env_builder.init_start(
initial_tenant_conf={
"gc_period": "0s",
"compaction_period": "0s",
}
)
client = env.pageserver.http_client()
failpoint = "before-downloading-layer-stream-pausable"
client.configure_failpoints((failpoint, "pause"))
env.pageserver.allowed_errors.extend(
[
".*downloading failed, possibly for shutdown.*",
]
)
info = client.layer_map_info(env.initial_tenant, env.initial_timeline)
assert len(info.delta_layers()) == 1
layer = info.delta_layers()[0]
client.tenant_heatmap_upload(env.initial_tenant)
# evict the initdb layer so we can download it
client.evict_layer(env.initial_tenant, env.initial_timeline, layer.layer_file_name)
with ThreadPoolExecutor(max_workers=2) as exec:
download = exec.submit(
client.download_layer,
env.initial_tenant,
env.initial_timeline,
layer.layer_file_name,
)
_, offset = wait_until(
20, 0.5, lambda: env.pageserver.assert_log_contains(f"at failpoint {failpoint}")
)
location_conf = {"mode": "Detached", "tenant_conf": {}}
# assume detach removes the layers
detach = exec.submit(client.tenant_location_conf, env.initial_tenant, location_conf)
_, offset = wait_until(
20,
0.5,
lambda: env.pageserver.assert_log_contains(
"closing is taking longer than expected", offset
),
)
client.configure_failpoints((failpoint, "off"))
with pytest.raises(
PageserverApiException, match="downloading failed, possibly for shutdown"
):
download.result()
env.pageserver.assert_log_contains(".*downloading failed, possibly for shutdown.*")
detach.result()
client.configure_failpoints((failpoint, "pause"))
_, offset = wait_until(
20,
0.5,
lambda: env.pageserver.assert_log_contains(f"cfg failpoint: {failpoint} pause", offset),
)
location_conf = {
"mode": "Secondary",
"secondary_conf": {"warm": True},
"tenant_conf": {},
}
client.tenant_location_conf(env.initial_tenant, location_conf)
warmup = exec.submit(client.tenant_secondary_download, env.initial_tenant, wait_ms=30000)
_, offset = wait_until(
20,
0.5,
lambda: env.pageserver.assert_log_contains(f"at failpoint {failpoint}", offset),
)
client.configure_failpoints((failpoint, "off"))
location_conf = {"mode": "Detached", "tenant_conf": {}}
client.tenant_location_conf(env.initial_tenant, location_conf)
client.configure_failpoints((failpoint, "off"))
# here we have nothing in the log, but we see that the warmup and conf location update worked
warmup.result()
def test_layer_download_timeouted(neon_env_builder: NeonEnvBuilder):
"""
Pause using a pausable_failpoint longer than the client timeout to simulate the timeout happening.
"""
neon_env_builder.enable_pageserver_remote_storage(s3_storage())
assert isinstance(neon_env_builder.pageserver_remote_storage, S3Storage)
neon_env_builder.pageserver_remote_storage.custom_timeout = "1s"
# turn off background tasks so that they don't interfere with the downloads
env = neon_env_builder.init_start(
initial_tenant_conf={
"gc_period": "0s",
"compaction_period": "0s",
}
)
client = env.pageserver.http_client()
failpoint = "before-downloading-layer-stream-pausable"
client.configure_failpoints((failpoint, "pause"))
info = client.layer_map_info(env.initial_tenant, env.initial_timeline)
assert len(info.delta_layers()) == 1
layer = info.delta_layers()[0]
client.tenant_heatmap_upload(env.initial_tenant)
# evict so we can download it
client.evict_layer(env.initial_tenant, env.initial_timeline, layer.layer_file_name)
with ThreadPoolExecutor(max_workers=2) as exec:
download = exec.submit(
client.download_layer,
env.initial_tenant,
env.initial_timeline,
layer.layer_file_name,
)
_, offset = wait_until(
20, 0.5, lambda: env.pageserver.assert_log_contains(f"at failpoint {failpoint}")
)
# ensure enough time while paused to trip the timeout
time.sleep(2)
client.configure_failpoints((failpoint, "off"))
download.result()
_, offset = env.pageserver.assert_log_contains(
".*failed, will retry \\(attempt 0\\): timeout.*"
)
_, offset = env.pageserver.assert_log_contains(".*succeeded after [0-9]+ retries.*", offset)
client.evict_layer(env.initial_tenant, env.initial_timeline, layer.layer_file_name)
client.configure_failpoints((failpoint, "pause"))
# capture the next offset for a new synchronization with the failpoint
_, offset = wait_until(
20,
0.5,
lambda: env.pageserver.assert_log_contains(f"cfg failpoint: {failpoint} pause", offset),
)
location_conf = {
"mode": "Secondary",
"secondary_conf": {"warm": True},
"tenant_conf": {},
}
client.tenant_location_conf(
env.initial_tenant,
location_conf,
)
started = time.time()
warmup = exec.submit(client.tenant_secondary_download, env.initial_tenant, wait_ms=30000)
# ensure enough time while paused to trip the timeout
time.sleep(2)
client.configure_failpoints((failpoint, "off"))
warmup.result()
elapsed = time.time() - started
_, offset = env.pageserver.assert_log_contains(
".*failed, will retry \\(attempt 0\\): timeout.*", offset
)
_, offset = env.pageserver.assert_log_contains(".*succeeded after [0-9]+ retries.*", offset)
assert elapsed < 30, "too long passed: {elapsed=}"
def stringify(conf: Dict[str, Any]) -> Dict[str, str]:
return dict(map(lambda x: (x[0], str(x[1])), conf.items()))

View File

@@ -1,62 +1,30 @@
import time
import pytest
from fixtures.neon_fixtures import NeonEnvBuilder, PgBin, wait_for_last_flush_lsn
from fixtures.pageserver.common_types import parse_layer_file_name
from fixtures.pageserver.common_types import ImageLayerName, parse_layer_file_name
from fixtures.pageserver.utils import (
wait_for_last_record_lsn,
wait_for_upload_queue_empty,
wait_until_tenant_active,
)
from fixtures.remote_storage import LocalFsStorage, RemoteStorageKind
from requests.exceptions import ConnectionError
def test_duplicate_layers(neon_env_builder: NeonEnvBuilder, pg_bin: PgBin):
env = neon_env_builder.init_start()
pageserver_http = env.pageserver.http_client()
# use a failpoint to return all L0s as L1s
message = ".*duplicated L1 layer layer=.*"
env.pageserver.allowed_errors.append(message)
# Use aggressive compaction and checkpoint settings
tenant_id, _ = env.neon_cli.create_tenant(
conf={
"checkpoint_distance": f"{1024 ** 2}",
"compaction_target_size": f"{1024 ** 2}",
"compaction_period": "5 s",
"compaction_threshold": "3",
}
)
pageserver_http.configure_failpoints(("compact-level0-phase1-return-same", "return"))
endpoint = env.endpoints.create_start("main", tenant_id=tenant_id)
connstr = endpoint.connstr(options="-csynchronous_commit=off")
pg_bin.run_capture(["pgbench", "-i", "-s1", connstr])
time.sleep(10) # let compaction to be performed
env.pageserver.assert_log_contains("compact-level0-phase1-return-same")
def test_actually_duplicated_l1(neon_env_builder: NeonEnvBuilder, pg_bin: PgBin):
def test_local_only_layers_after_crash(neon_env_builder: NeonEnvBuilder, pg_bin: PgBin):
"""
Test sets fail point at the end of first compaction phase: after
flushing new L1 layer but before deletion of L0 layers.
Test case for docs/rfcs/027-crash-consistent-layer-map-through-index-part.md.
The L1 used to be overwritten, but with crash-consistency via remote
index_part.json, we end up deleting the not yet uploaded L1 layer on
startup.
Simulate crash after compaction has written layers to disk
but before they have been uploaded/linked into remote index_part.json.
Startup handles this situation by deleting the not yet uploaded L1 layer files.
"""
neon_env_builder.enable_pageserver_remote_storage(RemoteStorageKind.LOCAL_FS)
env = neon_env_builder.init_start(
initial_tenant_conf={
"checkpoint_distance": f"{1024 ** 2}",
"compaction_target_size": f"{1024 ** 2}",
"checkpoint_distance": f"{10 * 1024**2}",
"compaction_period": "0 s",
"compaction_threshold": "3",
"compaction_threshold": "999999",
}
)
pageserver_http = env.pageserver.http_client()
@@ -70,13 +38,13 @@ def test_actually_duplicated_l1(neon_env_builder: NeonEnvBuilder, pg_bin: PgBin)
pg_bin.run_capture(["pgbench", "-i", "-s1", connstr])
lsn = wait_for_last_flush_lsn(env, endpoint, tenant_id, timeline_id)
endpoint.stop()
# make sure we receive no new wal after this, so that we'll write over the same L1 file.
endpoint.stop()
for sk in env.safekeepers:
sk.stop()
pageserver_http.patch_tenant_config_client_side(tenant_id, {"compaction_threshold": 3})
# hit the exit failpoint
with pytest.raises(ConnectionError, match="Remote end closed connection without response"):
pageserver_http.timeline_checkpoint(tenant_id, timeline_id)
@@ -100,9 +68,15 @@ def test_actually_duplicated_l1(neon_env_builder: NeonEnvBuilder, pg_bin: PgBin)
# L0
continue
candidate = parse_layer_file_name(path.name)
if isinstance(candidate, ImageLayerName):
continue
if l1_found is not None:
raise RuntimeError(f"found multiple L1: {l1_found.name} and {path.name}")
l1_found = parse_layer_file_name(path.name)
raise RuntimeError(f"found multiple L1: {l1_found.to_str()} and {path.name}")
l1_found = candidate
assert l1_found is not None, "failed to find L1 locally"
@@ -121,23 +95,14 @@ def test_actually_duplicated_l1(neon_env_builder: NeonEnvBuilder, pg_bin: PgBin)
# wait for us to catch up again
wait_for_last_record_lsn(pageserver_http, tenant_id, timeline_id, lsn)
pageserver_http.timeline_compact(tenant_id, timeline_id)
# give time for log flush
time.sleep(1)
message = f".*duplicated L1 layer layer={l1_found}"
found_msg = env.pageserver.log_contains(message)
# resident or evicted, it should not be overwritten, however it should had been non-existing at startup
assert (
found_msg is None
), "layer should had been removed during startup, did it live on as evicted?"
pageserver_http.timeline_compact(tenant_id, timeline_id, wait_until_uploaded=True)
assert env.pageserver.layer_exists(tenant_id, timeline_id, l1_found), "the L1 reappears"
wait_for_upload_queue_empty(pageserver_http, tenant_id, timeline_id)
uploaded = env.pageserver_remote_storage.remote_layer_path(
tenant_id, timeline_id, l1_found.to_str()
)
assert uploaded.exists(), "the L1 is uploaded"
# TODO: same test for L0s produced by ingest.

View File

@@ -163,11 +163,6 @@ def test_pageserver_chaos(
env = neon_env_builder.init_start(initial_tenant_shard_count=shard_count)
# these can happen, if we shutdown at a good time. to be fixed as part of #5172.
message = ".*duplicated L1 layer layer=.*"
for ps in env.pageservers:
ps.allowed_errors.append(message)
# Use a tiny checkpoint distance, to create a lot of layers quickly.
# That allows us to stress the compaction and layer flushing logic more.
tenant, _ = env.neon_cli.create_tenant(

View File

@@ -15,7 +15,7 @@ from fixtures.pageserver.utils import (
tenant_delete_wait_completed,
wait_for_upload_queue_empty,
)
from fixtures.remote_storage import LocalFsStorage, RemoteStorageKind, S3Storage
from fixtures.remote_storage import LocalFsStorage, RemoteStorageKind, S3Storage, s3_storage
from fixtures.utils import wait_until
from fixtures.workload import Workload
@@ -73,7 +73,7 @@ def test_location_conf_churn(neon_env_builder: NeonEnvBuilder, seed: int):
"""
neon_env_builder.num_pageservers = 3
neon_env_builder.enable_pageserver_remote_storage(
remote_storage_kind=RemoteStorageKind.MOCK_S3,
remote_storage_kind=s3_storage(),
)
env = neon_env_builder.init_start(initial_tenant_conf=TENANT_CONF)
@@ -100,10 +100,6 @@ def test_location_conf_churn(neon_env_builder: NeonEnvBuilder, seed: int):
]
)
# these can happen, if we shutdown at a good time. to be fixed as part of #5172.
message = ".*duplicated L1 layer layer=.*"
ps.allowed_errors.append(message)
workload = Workload(env, tenant_id, timeline_id)
workload.init(env.pageservers[0].id)
workload.write_rows(256, env.pageservers[0].id)
@@ -215,6 +211,13 @@ def test_location_conf_churn(neon_env_builder: NeonEnvBuilder, seed: int):
)
workload.validate(pageserver.id)
# Having done a bunch of attach/detach cycles, we will have generated some index garbage: check
# that the scrubber sees it and cleans it up. We do this before the final attach+validate pass,
# to also validate that the scrubber isn't breaking anything.
gc_summary = S3Scrubber(neon_env_builder).pageserver_physical_gc(min_age_secs=1)
assert gc_summary["remote_storage_errors"] == 0
assert gc_summary["indices_deleted"] > 0
# Attach all pageservers
for ps in env.pageservers:
location_conf = {"mode": "AttachedMulti", "secondary_conf": None, "tenant_conf": {}}
@@ -227,10 +230,11 @@ def test_location_conf_churn(neon_env_builder: NeonEnvBuilder, seed: int):
# Detach all pageservers
for ps in env.pageservers:
location_conf = {"mode": "Detached", "secondary_conf": None, "tenant_conf": {}}
assert ps.list_layers(tenant_id, timeline_id) != []
ps.tenant_location_configure(tenant_id, location_conf)
# Confirm that all local disk state was removed on detach
# TODO
# Confirm that all local disk state was removed on detach
assert ps.list_layers(tenant_id, timeline_id) == []
def test_live_migration(neon_env_builder: NeonEnvBuilder):

View File

@@ -3,7 +3,7 @@ import shutil
from typing import Optional
import pytest
from fixtures.common_types import TenantShardId
from fixtures.common_types import TenantId, TenantShardId, TimelineId
from fixtures.neon_fixtures import (
NeonEnvBuilder,
S3Scrubber,
@@ -109,3 +109,52 @@ def test_scrubber_tenant_snapshot(neon_env_builder: NeonEnvBuilder, shard_count:
# Check we can read everything
workload.validate()
@pytest.mark.parametrize("shard_count", [None, 4])
def test_scrubber_physical_gc(neon_env_builder: NeonEnvBuilder, shard_count: Optional[int]):
neon_env_builder.enable_pageserver_remote_storage(s3_storage())
neon_env_builder.num_pageservers = 2
env = neon_env_builder.init_configs()
env.start()
tenant_id = TenantId.generate()
timeline_id = TimelineId.generate()
env.neon_cli.create_tenant(tenant_id, timeline_id, shard_count=shard_count)
workload = Workload(env, tenant_id, timeline_id)
workload.init()
# We will end up with an index per shard, per cycle, plus one for the initial startup
n_cycles = 4
expect_indices_per_shard = n_cycles + 1
shard_count = 1 if shard_count is None else shard_count
# For each cycle, detach and attach the tenant to bump the generation, and do some writes to generate uploads
for _i in range(0, n_cycles):
env.storage_controller.tenant_policy_update(tenant_id, {"placement": "Detached"})
env.storage_controller.reconcile_until_idle()
env.storage_controller.tenant_policy_update(tenant_id, {"placement": {"Attached": 0}})
env.storage_controller.reconcile_until_idle()
# This write includes remote upload, will generate an index in this generation
workload.write_rows(1)
# With a high min_age, the scrubber should decline to delete anything
gc_summary = S3Scrubber(neon_env_builder).pageserver_physical_gc(min_age_secs=3600)
assert gc_summary["remote_storage_errors"] == 0
assert gc_summary["indices_deleted"] == 0
# If targeting a different tenant, the scrubber shouldn't do anything
gc_summary = S3Scrubber(neon_env_builder).pageserver_physical_gc(
min_age_secs=1, tenant_ids=[TenantId.generate()]
)
assert gc_summary["remote_storage_errors"] == 0
assert gc_summary["indices_deleted"] == 0
# With a low min_age, the scrubber should go ahead and clean up all but the latest 2 generations
gc_summary = S3Scrubber(neon_env_builder).pageserver_physical_gc(min_age_secs=1)
assert gc_summary["remote_storage_errors"] == 0
assert gc_summary["indices_deleted"] == (expect_indices_per_shard - 2) * shard_count

View File

@@ -7,6 +7,7 @@ from typing import Any, Dict, List, Union
import pytest
from fixtures.common_types import TenantId, TenantShardId, TimelineId
from fixtures.compute_reconfigure import ComputeReconfigure
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
NeonEnv,
@@ -18,6 +19,8 @@ from fixtures.neon_fixtures import (
from fixtures.pageserver.http import PageserverHttpClient
from fixtures.pageserver.utils import (
MANY_SMALL_LAYERS_TENANT_CONFIG,
assert_prefix_empty,
assert_prefix_not_empty,
enable_remote_storage_versioning,
list_prefix,
remote_storage_delete_key,
@@ -839,6 +842,86 @@ def test_storage_controller_tenant_conf(neon_env_builder: NeonEnvBuilder):
env.storage_controller.consistency_check()
def test_storage_controller_tenant_deletion(
neon_env_builder: NeonEnvBuilder,
compute_reconfigure_listener: ComputeReconfigure,
):
"""
Validate that:
- Deleting a tenant deletes all its shards
- Deletion does not require the compute notification hook to be responsive
- Deleting a tenant also removes all secondary locations
"""
neon_env_builder.num_pageservers = 4
neon_env_builder.enable_pageserver_remote_storage(s3_storage())
neon_env_builder.control_plane_compute_hook_api = (
compute_reconfigure_listener.control_plane_compute_hook_api
)
env = neon_env_builder.init_configs()
env.start()
tenant_id = TenantId.generate()
timeline_id = TimelineId.generate()
env.neon_cli.create_tenant(
tenant_id, timeline_id, shard_count=2, placement_policy='{"Attached":1}'
)
# Ensure all the locations are configured, including secondaries
env.storage_controller.reconcile_until_idle()
shard_ids = [
TenantShardId.parse(shard["shard_id"]) for shard in env.storage_controller.locate(tenant_id)
]
# Assert attachments all have local content
for shard_id in shard_ids:
pageserver = env.get_tenant_pageserver(shard_id)
assert pageserver.tenant_dir(shard_id).exists()
# Assert all shards have some content in remote storage
for shard_id in shard_ids:
assert_prefix_not_empty(
neon_env_builder.pageserver_remote_storage,
prefix="/".join(
(
"tenants",
str(shard_id),
)
),
)
# Break the compute hook: we are checking that deletion does not depend on the compute hook being available
def break_hook():
raise RuntimeError("Unexpected call to compute hook")
compute_reconfigure_listener.register_on_notify(break_hook)
# No retry loop: deletion should complete in one shot without polling for 202 responses, because
# it cleanly detaches all the shards first, and then deletes them in remote storage
env.storage_controller.pageserver_api().tenant_delete(tenant_id)
# Assert no pageservers have any local content
for pageserver in env.pageservers:
for shard_id in shard_ids:
assert not pageserver.tenant_dir(shard_id).exists()
for shard_id in shard_ids:
assert_prefix_empty(
neon_env_builder.pageserver_remote_storage,
prefix="/".join(
(
"tenants",
str(shard_id),
)
),
)
# Assert the tenant is not visible in storage controller API
with pytest.raises(StorageControllerApiException):
env.storage_controller.tenant_describe(tenant_id)
class Failure:
pageserver_id: int

View File

@@ -0,0 +1,57 @@
import threading
import time
from fixtures.neon_fixtures import NeonEnv
from fixtures.utils import wait_until
# This test checks of logical replication subscriber is able to correctly restart replication without receiving duplicates.
# It requires tracking information about replication origins at page server side
def test_subscriber_restart(neon_simple_env: NeonEnv):
env = neon_simple_env
env.neon_cli.create_branch("publisher")
pub = env.endpoints.create("publisher")
pub.start()
env.neon_cli.create_branch("subscriber")
sub = env.endpoints.create("subscriber")
sub.start()
n_records = 100000
n_restarts = 100
def check_that_changes_propagated():
scur.execute("SELECT count(*) FROM t")
res = scur.fetchall()
assert res[0][0] == n_records
def insert_data(pub):
with pub.cursor() as pcur:
for i in range(0, n_records):
pcur.execute("INSERT into t values (%s,random()*100000)", (i,))
with pub.cursor() as pcur:
with sub.cursor() as scur:
pcur.execute("CREATE TABLE t (pk integer primary key, sk integer)")
pcur.execute("CREATE PUBLICATION pub FOR TABLE t")
scur.execute("CREATE TABLE t (pk integer primary key, sk integer)")
# scur.execute("CREATE INDEX on t(sk)") # slowdown applying WAL at replica
pub_conn = f"host=localhost port={pub.pg_port} dbname=postgres user=cloud_admin"
query = f"CREATE SUBSCRIPTION sub CONNECTION '{pub_conn}' PUBLICATION pub"
scur.execute(query)
time.sleep(2) # let initial table sync complete
thread = threading.Thread(target=insert_data, args=(pub,), daemon=True)
thread.start()
for _ in range(n_restarts):
# restart subscriber
# time.sleep(2)
sub.stop("immediate")
sub.start()
thread.join()
pcur.execute(f"INSERT into t values ({n_records}, 0)")
n_records += 1
with sub.cursor() as scur:
wait_until(10, 0.5, check_that_changes_propagated)

View File

@@ -54,9 +54,26 @@ def test_tenant_delete_smoke(
# first try to delete non existing tenant
tenant_id = TenantId.generate()
env.pageserver.allowed_errors.append(f".*NotFound: tenant {tenant_id}.*")
with pytest.raises(PageserverApiException, match=f"NotFound: tenant {tenant_id}"):
ps_http.tenant_delete(tenant_id=tenant_id)
env.pageserver.allowed_errors.append(".*NotFound.*")
env.pageserver.allowed_errors.append(".*simulated failure.*")
# Check that deleting a non-existent tenant gives the expected result: this is a loop because we
# may need to retry on some remote storage errors injected by the test harness
while True:
try:
ps_http.tenant_delete(tenant_id=tenant_id)
except PageserverApiException as e:
if e.status_code == 500:
# This test uses failure injection, which can produce 500s as the pageserver expects
# the object store to always be available, and the ListObjects during deletion is generally
# an infallible operation
assert "simulated failure of remote operation" in e.message
elif e.status_code == 404:
# This is our expected result: trying to erase a non-existent tenant gives us 404
assert "NotFound" in e.message
break
else:
raise
env.neon_cli.create_tenant(
tenant_id=tenant_id,
@@ -88,6 +105,9 @@ def test_tenant_delete_smoke(
parent = timeline
# Upload a heatmap so that we exercise deletion of that too
ps_http.tenant_heatmap_upload(tenant_id)
iterations = poll_for_remote_storage_iterations(remote_storage_kind)
assert ps_http.get_metric_value("pageserver_tenant_manager_slots", {"mode": "attached"}) == 2

View File

@@ -1,5 +1,5 @@
{
"v16": ["16.3", "b228f20372ebcabfd7946647cb7adbd38bacb14a"],
"v15": ["15.7", "c2c3d40534db97d83dd7e185d1971e707fa2f445"],
"v14": ["14.12", "17e0f5ff4e1905691aa40e1e08f9b79b14c99652"]
"v16": ["16.3", "e2cccb954d4aa96713f2ae4a72b2806300f199f7"],
"v15": ["15.7", "8cc683b5428b9532f3897f3842fe44af90048617"],
"v14": ["14.12", "a9bfeec24d08f36eaffcd3548284e4732ad57a5c"]
}