Compare commits

..

162 Commits

Author SHA1 Message Date
Ivan Efremov
0467d88f06 Merge pull request #9756 from neondatabase/rc/proxy/2024-11-14
Proxy release 2024-11-14
2024-11-14 09:46:52 +02:00
Tristan Partin
1280b708f1 Improve error handling for NeonAPI fixture
Move error handling to the common request function and add a debug log.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-11-13 20:35:48 -06:00
John Spray
b4e00b8b22 pageserver: refuse to load tenants with suspiciously old indices in old generations (#9719)
## Problem

Historically, if a control component passed a pageserver "generation: 1"
this could be a quick way to corrupt a tenant by loading a historic
index.

Follows https://github.com/neondatabase/neon/pull/9383
Closes #6951 

## Summary of changes

- Introduce a Fatal variant to DownloadError, to enable index downloads
to signal when they have encountered a scary enough situation that we
shouldn't proceed to load the tenant.
- Handle this variant by putting the tenant into a broken state (no
matter which timeline within the tenant reported it)
- Add a test for this case

In the event that this behavior fires when we don't want it to, we have
ways to intervene:
- "Touch" an affected index to update its mtime (download+upload S3
object)
- If this behavior is triggered, it indicates we're attaching in some
old generation, so we should be able to fix that by manually bumping
generation numbers in the storage controller database (this should never
happen, but it's an option if it does)
2024-11-13 18:07:39 +00:00
Heikki Linnakangas
10aaa3677d PostgreSQL minor version updates (17.1, 16.5, 15.9, 14.14) (#9727)
This includes a patch to temporarily disable one test in the pg_anon
test suite. It is an upstream issue, the test started failing with the
new PostgreSQL minor versions because of a change in the default
timezone used in tests. We don't want to block the release for this,
so just disable the test for now. See
199f0a392b (note_2148017485)

Corresponding postgres repository PRs:
https://github.com/neondatabase/postgres/pull/524
https://github.com/neondatabase/postgres/pull/525
https://github.com/neondatabase/postgres/pull/526
https://github.com/neondatabase/postgres/pull/527
2024-11-13 15:08:58 +02:00
Heikki Linnakangas
d5435b1a81 tests: Increase timeout in test_create_churn_during_restart (#9736)
This test was seen to be flaky, e.g. at:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9457/11804246485/index.html#suites/ec4311502db344eee91f1354e9dc839b/982bd121ea698414/.
If I _reduce_ the timeout from 10s to 8s on my laptop, it reliably hits
that timeout and fails. That suggests that the test is pretty close to
the edge even when it passes. Let's bump up the timeout to 30 s to make
it more robust.

See also https://github.com/neondatabase/neon/issues/9730, although the
error message is different there.
2024-11-13 12:20:32 +02:00
Anastasia Lubennikova
080d585b22 Add installed_extensions prometheus metric (#9608)
and add /metrics endpoint to compute_ctl to expose such metrics

metric format example for extension pg_rag
with versions 1.2.3 and 1.4.2
installed in 3 and 1 databases respectively:

neon_extensions_installed{extension="pg_rag", version="1.2.3"} = 3
neon_extensions_installed{extension="pg_rag", version="1.4.2"} = 1

------
infra part: https://github.com/neondatabase/flux-fleet/pull/251
---------

Co-authored-by: Tristan Partin <tristan@neon.tech>
2024-11-13 09:36:48 +00:00
John Spray
7595d3afe6 pageserver: add no_sync for use in regression tests (2/2) (#9678)
## Problem

Followup to https://github.com/neondatabase/neon/pull/9677 which enables
`no_sync` in tests. This can be merged once the next release has
happened.

## Summary of changes

- Always run pageserver with `no_sync = true` in tests.
2024-11-13 09:17:26 +00:00
Konstantin Knizhnik
1ff5333a1b Do not wallog AUX files at replica (#9457)
## Problem

Attempt to persist LR stuff at replica cause cannot make new WAL entries
during recovery` error.
See https://neondb.slack.com/archives/C07S7RBFVRA/p1729280401283389

## Summary of changes

Do not wallog AUX files at replica.
Related Postgres PRs:

https://github.com/neondatabase/postgres/pull/517
https://github.com/neondatabase/postgres/pull/516
https://github.com/neondatabase/postgres/pull/515
https://github.com/neondatabase/postgres/pull/514


## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-11-13 08:50:01 +02:00
Tristan Partin
d8f5d43549 Fix autocommit footguns in performance tests
psycopg2 has the following warning related to autocommit:

> By default, any query execution, including a simple SELECT will start
> a transaction: for long-running programs, if no further action is
> taken, the session will remain “idle in transaction”, an undesirable
> condition for several reasons (locks are held by the session, tables
> bloat…). For long lived scripts, either ensure to terminate a
> transaction as soon as possible or use an autocommit connection.

In the 2.9 release notes, psycopg2 also made the following change:

> `with connection` starts a transaction on autocommit transactions too

Some of these connections are indeed long-lived, so we were retaining
tons of WAL on the endpoints because we had a transaction pinned in the
past.

Link: https://www.psycopg.org/docs/news.html#what-s-new-in-psycopg-2-9
Link: https://github.com/psycopg/psycopg2/issues/941
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-11-12 15:48:19 -06:00
Erik Grinaker
2256a5727a safekeeper: use WAL_SEGMENT_SIZE for empty timeline state (#9734)
## Problem

`TimelinePersistentState::empty()`, used for tests and benchmarks, had a
hardcoded 16 MB WAL segment size. This caused confusion when attempting
to change the global segment size.

## Summary of changes

Inherit from `WAL_SEGMENT_SIZE` in `TimelinePersistentState::empty()`.
2024-11-12 20:35:44 +00:00
Tristan Partin
3f80af8b1d Add neon.logical_replication_max_logicalsnapdir_size
This GUC will drop replication slots if the size of the
pg_logical/snapshots directory (not including temp snapshot files)
becomes larger than the specified size. Keeping the size of this
directory smaller will help with basebackup size from the pageserver.

Part-of: https://github.com/neondatabase/neon/issues/8619
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-11-12 13:13:28 -06:00
Tristan Partin
a61d81bbc7 Calculate compute_backpressure_throttling_seconds correctly
The original value that we get is measured in microseconds. It comes
from a calculation using Postgres' GetCurrentTimestamp(), whihc is
implemented in terms of gettimeofday(2).

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-11-12 13:12:08 -06:00
Erik Grinaker
05381a48f0 utils: remove unnecessary fsync in durable_rename() (#9686)
## Problem

WAL segment fsyncs significantly affect WAL ingestion throughput.
`durable_rename()` is used when initializing every 16 MB segment, and
issues 3 fsyncs of which 1 was unnecessary.

## Summary of changes

Remove an fsync in `durable_rename` which is unnecessary with Linux and
ext4 (which we currently use). This improves WAL ingestion throughput by
up to 23% with large appends on my MacBook.
2024-11-12 18:57:31 +01:00
Alex Chi Z.
cef165818c test(pageserver): add gc-compaction tests with delta will_init (#9724)
I had an impression that gc-compaction didn't test the case where the
first record of the key history is will_init because of there are some
code path that will panic in this case. Luckily it got fixed in
https://github.com/neondatabase/neon/pull/9026 so we can now implement
such tests.

Part of https://github.com/neondatabase/neon/issues/9114

## Summary of changes

* Randomly changed some images into will_init neon wal record
* Split `test_simple_bottom_most_compaction_deltas` into two test cases,
one of them has the bottom layer as delta layer with will_init flags,
while the other is the original one with image layers.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-11-12 10:37:31 -05:00
Erik Grinaker
6b19867410 safekeeper: don't flush control file on WAL ingest path (#9698)
## Problem

The control file is flushed on the WAL ingest path when the commit LSN
advances by one segment, to bound the amount of recovery work in case of
a crash. This involves 3 additional fsyncs, which can have a significant
impact on WAL ingest throughput. This is to some extent mitigated by
`AppendResponse` not being emitted on segment bound flushes, since this
will prevent commit LSN advancement, which will be addressed separately.

## Summary of changes

Don't flush the control file on the WAL ingest path at all. Instead,
leave that responsibility to the timeline manager, but ask it to flush
eagerly if the control file lags the in-memory commit LSN by more than
one segment. This should not cause more than `REFRESH_INTERVAL` (300 ms)
additional latency before flushing the control file, which is
negligible.
2024-11-12 15:17:03 +00:00
Tristan Partin
cc8029c4c8 Update pg_cron to 1.6.4
This comes with PG 17 support.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-11-11 20:10:53 -06:00
Tristan Partin
5be6b07cf1 Improve typing related to regress/test_logical_replication.py (#9725)
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-11-11 17:36:45 -06:00
Arpad Müller
b018bc7da8 Add a retain_lsn test (#9599)
Add a test that ensures the `retain_lsn` functionality works. Right now,
there is not a single test that is broken if offloaded or non-offloaded
timelines don't get registered at their parents, preventing gc from
discarding the ancestor_lsns of the children. This PR fills that gap.

The test has four modes:

* `offloaded`: offload the child timeline, run compaction on the parent
timeline, unarchive the child timeline, then try reading from it.
hopefully the data is still there.
* `offloaded-corrupted`: offload the child timeline, corrupts the
manifest in a way that the pageserver believes the timeline was
flattened. This is the closest we can get to pretend the `retain_lsn`
mechanism doesn't exist for offloaded timelines, so we can avoid adding
endpoints to the pageserver that do this manually for tests. The test
then checks that indeed data is corrupted and the endpoint can't be
started. That way we know that the test is actually working, and
actually tests the `retain_lsn` mechanism, instead of say the lsn lease
mechanism, or one of the many other mechanisms that impede gc.
* `archived`: the child timeline gets archived but doesn't get
offloaded. this currently matches the `None` case but we might have
refactors in the future that make archived timelines sufficiently
different from non-archived ones.
* `None`: the child timeline doesn't even get archived. this tests that
normal timelines participate in `retain_lsn`. I've made them locally not
participate in `retain_lsn` (via commenting out the respective
`ancestor_children.push` statement in tenant.rs) and ran the testsuite,
and not a single test failed. So this test is first of its kind.

Part of #8088.
2024-11-11 22:29:21 +00:00
Tristan Partin
4b075db7ea Add a postgres_exporter config file
This exporter logs an ERROR if a file called `postgres_exporter.yml` is
not located in its current working directory. We can silence it by
adding an empty config file and pointing the exporter at it.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-11-11 14:49:37 -06:00
Fedor Dikarev
fde16f8614 use batch gh-workflow-stats-action with separate table (#9722)
We found that exporting GH Workflow Runs in batch is more efficient due
to
- better utilisation of Github API
- and gh runners usage is rounded to minutes, so even when ad-hoc export
is done in 5-10 seconds, we billed for one minute usage

So now we introduce batch exporting, with version v0.2.x of github
workflow stats exporter.
How it's expected to work now:
- every 15 minutes we query for the workflow runs, created in last 2
hours
- to avoid missing workflows that ran for more than 2 hours, every night
(00:25) we will query workflows created in past 24 hours and export them
as well
- should we have query for even longer periods?
- lets see how it works with current schedule
- for longer periods like for days or weeks, it may require to adjust
logic and concurrency of querying data, so lets for now use simpler
version
2024-11-11 20:33:29 +00:00
Alex Chi Z.
5a138d08a3 feat(pageserver): support partial gc-compaction for delta layers (#9611)
The final patch for partial compaction, part of
https://github.com/neondatabase/neon/issues/9114, close
https://github.com/neondatabase/neon/issues/8921 (note that we didn't
implement parallel compaction or compaction scheduler for partial
compaction -- currently this needs to be scheduled by using a Python
script to split the keyspace, and in the future, automatically split
based on the key partitioning when the pageserver wants to trigger a
gc-compaction)

## Summary of changes

* Update the layer selection algorithm to use the same selection as full
compaction (everything intersect/below gc horizon)
* Update the layer selection algorithm to also generate a list of delta
layers that need to be rewritten
* Add the logic to rewrite delta layers and add them back to the layer
map
* Update test case to do partial compaction on deltas

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-11-11 20:30:32 +00:00
Tristan Partin
2d9652c434 Clean up C.UTF-8 locale changes
Removes some unnecessary initdb arguments, and fixes Neon for MacOS
since it doesn't seem to ship a C.UTF-8 locale.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-11-11 13:53:12 -06:00
Alexander Bayandin
e9dcfa2eb2 test_runner: skip more tests using decorator instead of pytest.skip (#9704)
## Problem

Running `pytest.skip(...)` in a test body instead of marking the test
with `@pytest.mark.skipif(...)` makes all fixtures to be initialised,
which is not necessary if the test is going to be skipped anyway.

Also, some tests are unnecessarily skipped (e.g. `test_layer_bloating`
on Postgres 17, or `test_idle_reconnections` at all) or run (e.g.
`test_parse_project_git_version_output_positive` more than on once
configuration) according to comments.

## Summary of changes
- Move `skip_on_postgres` / `xfail_on_postgres` /
`run_only_on_default_postgres` decorators to `fixture.utils`
- Add new `skip_in_debug_build` and `skip_on_ci` decorators
- Replace `pytest.skip(...)` calls with decorators where possible
2024-11-11 18:07:01 +00:00
Peter Bendel
8db84d9964 new ingest benchmark (#9711)
## Problem

We have no specific benchmark testing project migration of postgresql
project with existing data into Neon.
Typical steps of such a project migration are
- schema creation in the neon project
- initial COPY of relations
- creation of indexes and constraints
- vacuum analyze

## Summary of changes

Add a periodic benchmark running 9 AM UTC every day.
In each run:
- copy a 200 GiB project that has realistic schema, data, tables,
indexes and constraints from another project into
  - a new Neon project (7 CU fixed)
- an existing tenant, (but new branch and new database) that already has
4 TiB of data
- use pgcopydb tool to automate all steps and parallelize COPY and index
creation
- parse pgcopydb output and report performance metrics in Neon
performance test database

## Logs

This benchmark has been tested first manually and then as part of
benchmarking.yml workflow, example run see

https://github.com/neondatabase/neon/actions/runs/11757679870
2024-11-11 17:51:15 +00:00
Alexander Bayandin
1aab34715a Remove checklist from the PR template (#9702)
## Problem
Once we enable the merge queue for the `main` branch, it won't be
possible to adjust the commit message right after pressing the "Squash
and merge" button and the PR title + description will be used as is.

To avoid extra noise in the commits in the `main` with the checklist
leftovers, I propose removing the checklist from the PR template and
keeping only the Problem / Summary of changes.

## Summary of changes
- Remove the checklist from the PR template
2024-11-11 17:01:02 +00:00
Erik Grinaker
f63de5f527 safekeeper: add initialize_segment variant of safekeeper_wal_storage_operation_seconds (#9691)
## Problem

We don't have a metric capturing the latency of segment initialization.
This can be significant due to fsyncs.

## Summary of changes

Add an `initialize_segment` variant of
`safekeeper_wal_storage_operation_seconds`.
2024-11-11 17:55:50 +01:00
Alex Chi Z.
54a1676680 rfc: update aux file rfc to reflect latest optimizations (#9681)
Reflects https://github.com/neondatabase/neon/pull/9631 in the RFC.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-11-11 09:19:03 -05:00
Alex Chi Z.
48c06d9f7b fix(pageserver): increase frozen layer warning threshold; ignore in tests (#9705)
Perf benchmarks produce a lot of layers.

## Summary of changes

Bumping the threshold and ignore the warning.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-11-11 09:13:46 -05:00
Alexander Bayandin
f510647c7e CI: retry actions/github-script for 5XX errors (#9703)
## Problem

GitHub API can return error 500, and it fails jobs that use
`actions/github-script` action.

## Summary of changes
- Add `retry: 500` to all `actions/github-script` usage
2024-11-11 12:42:32 +00:00
Vlad Lazar
ceaa80ffeb storcon: add peer token for peer to peer communication (#9695)
## Problem

We wish to stop using admin tokens in the infra repo, but step down
requests use the admin token.

## Summary of Changes

Introduce a new "ControllerPeer" scope and use it for step-down requests.
2024-11-11 09:58:41 +00:00
Alexander Bayandin
2fcac0e66b CI(pre-merge-checks): add required checks (#9700)
## Problem
The Merge queue doesn't work because it expects certain jobs, which we
don't have in the `pre-merge-checks` workflow.
But it turns out we can just create jobs/checks with the same names in
any workflow that we run.

## Summary of changes
- Add `conclusion` jobs
- Create `neon-cloud-e2e` status check
- Add a bunch of `if`s to handle cases with no relevant changes found
and prepare the workflow to run rust checks in the future
- List the workflow in `report-workflow-stats` to collect stats about it
2024-11-09 01:02:54 +00:00
Tristan Partin
ecde8d7632 Improve type safety according to pyright
Pyright found many issues that mypy doesn't seem to want to catch or
mypy isn't configured to catch.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-11-08 14:43:15 -06:00
Alex Chi Z.
af8238ae52 fix(pageserver): drain upload queue before offloading timeline (#9682)
It is possible at the point we shutdown the timeline, there are
still layer files we did not upload.

## Summary of changes

* If the queue is not empty, avoid offloading.
* Shutdown the timeline gracefully using the flush mode to
ensure all local files are uploaded before deleting the timeline
directory.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-11-08 14:28:55 -05:00
Erik Grinaker
ab47804d00 safekeeper: remove unused WriteGuardSharedState::skip_update (#9699) 2024-11-08 19:25:31 +00:00
Alex Chi Z.
ecca62a45d feat(pageserver): more log lines around frozen layers (#9697)
We saw pageserver OOMs
https://github.com/neondatabase/cloud/issues/19715 for tenants doing
large writes. Add log lines around in-memory layers to hopefully collect
some info during my on-call shift next week.

## Summary of changes

* Estimate in-memory size of an in-mem layer.
* Print frozen layer number if there are too many layers accumulated in
memory.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-11-08 18:44:00 +00:00
Tristan Partin
34a4eb6f2a Switch compute-related locales to C.UTF-8 by default
Right now, our environments create databases with the C locale, which is
really unfortunate for users who have data stored in other languages
that they want to analyze. For instance, show_trgm on Hebrew text
currently doesn't work in staging or production.

I don't envision this being the final solution. I think this is just a
way to set a known value so the pageserver doesn't use its parent
environment. The final solution to me is exposing initdb parameters to
users in the console. Then they could use a different locale or encoding
if they so chose.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-11-08 12:19:18 -06:00
Alexander Bayandin
b6bc954c5d CI: move check codestyle python to reusable workflow and run on a merge_group (#9683)
## Problem

To prevent breaking main after Python 3.11 PR get merged 
we need to enable merge queue and run `check-codestyle-python`
job on it

## Summary of changes
- Move `check-codestyle-python` to a reusable workflow
- Run this workflow on `merge_group` event
2024-11-08 17:32:56 +00:00
Vlad Lazar
30680d1f32 tests: use tigther storcon scopes (#9696)
## Problem

https://github.com/neondatabase/neon/pull/9596 did not update tests
because that would've broken the compat tests.

## Summary of Changes

Use infra scope where possible.
2024-11-08 17:00:31 +00:00
Alex Chi Z.
f561cbe1c7 fix(pageserver): drain upload queue before detaching ancestor (#9651)
In INC-317
https://neondb.slack.com/archives/C033RQ5SPDH/p1730815677932209, we saw
an interesting series of operations that would remove valid layer files
existing in the layer map.

* Timeline A starts compaction and generates an image layer Z but not
uploading it yet.
* Timeline B/C starts ancestor detaching (which should not affect
timeline A)
* The tenant gets restarted as part of the ancestor detaching process,
without increasing the generation number.
* Timeline A reloads, discovering the layer Z is a future layer, and
schedules a **deletion into the deletion queue**. This means that the
file will be deleted any time in the future.
* Timeline A starts compaction and generates layer Z again, adding it to
the layer map. Note that because we don't bump generation number during
ancestor detach, it has the same filename + generation number as the
original Z.
* Timeline A deletes layer Z from s3 + disk, and now we have a dangling
reference in the layer map, blocking all
compaction/logical_size_calculation process.

## Summary of changes

* We wait until all layers to be uploaded before shutting down the
tenants in `Flush` mode.
* Ancestor detach restarts now use this mode.
* Ancestor detach also waits for remote queue completion before starting
the detaching process.
* The patch ensures that we don't have any future image layer (or
something similar) after restart, but not fixing the underlying problem
around generation numbers.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-11-08 10:35:27 -05:00
Tristan Partin
3525d2e381 Update TimescaleDB to 2.17.1 for PG 17
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-11-08 09:15:38 -06:00
Konstantin Knizhnik
17c002b660 Do not copy logical replicaiton slots to replica (#9458)
## Problem

Replication slots are now persisted using AUX files mechanism and
included in basebackup when replica is launched.
This slots are not somehow used at replica but hold WAL, which may cause
local disk space exhaustion.

## Summary of changes

Add `--replica` parameter to basebackup request and do not include
replication slot state files in basebackup for replica.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-11-08 14:54:58 +02:00
John Spray
aa9112efce pageserver: add no_sync for use in regression tests (1/2) (#9677)
## Problem

In test environments, the `syncfs` that the pageserver does on startup
can take a long time, as other tests running concurrently might have
many gigabytes of dirty pages.

## Summary of changes

- Add a `no_sync` option to the pageserver's config.
- Skip syncfs on startup if this is set
- A subsequent PR (https://github.com/neondatabase/neon/pull/9678) will
enable this by default in tests. We need to wait until after the next
release to avoid breaking compat tests, which would fail if we set
no_sync & use an old pageserver binary.

Q: Why is this a different mechanism than safekeeper, which as a
--no-sync CLI?
A: Because the way we manage pageservers in neon_local depends on the
pageserver.toml containing the full configuration, whereas safekeepers
have a config file which is neon-local-specific and can drive a CLI
flag.

Q: Why is the option no_sync rather than sync?
A: For boolean configs with a dangerous value, it's preferable to make
"false" the safe option, so that any downstream future config tooling
that might have a "booleans are false by default" behavior (e.g. golang
structs) is safe by default.

Q: Why only skip the syncfs, and not all fsyncs?
A: Skipping all fsyncs would require more code changes, and the most
acute problem isn't fsyncs themselves (these just slow down a running
test), it's the syncfs (which makes a pageserver startup slow as a
result of _other_ tests)
2024-11-08 10:16:04 +00:00
JC Grünhage
027889b06c ci: use set-docker-config-dir from dev-actions (#9638)
set-docker-config-dir was replicated over multiple repositories.

The replica of this action was removed from this repository and it's
using the version from github.com/neondatabase/dev-actions instead
2024-11-08 10:44:59 +01:00
Heikki Linnakangas
79929bb1b6 Disable rust_2024_compatibility lint option (#9615)
Compiling with nightly rust compiler, I'm getting a lot of errors like
this:

    error: `if let` assigns a shorter lifetime since Edition 2024
       --> proxy/src/auth/backend/jwt.rs:226:16
        |
    226 |             if let Some(permit) = self.try_acquire_permit() {
        |                ^^^^^^^^^^^^^^^^^^^-------------------------
        |                                   |
| this value has a significant drop implementation which may observe a
major change in drop order and requires your discretion
        |
        = warning: this changes meaning in Rust 2024
= note: for more information, see issue #124085
<https://github.com/rust-lang/rust/issues/124085>
    help: the value is now dropped here in Edition 2024
       --> proxy/src/auth/backend/jwt.rs:241:13
        |
    241 |             } else {
        |             ^
    note: the lint level is defined here
       --> proxy/src/lib.rs:8:5
        |
    8   |     rust_2024_compatibility
        |     ^^^^^^^^^^^^^^^^^^^^^^^
= note: `#[deny(if_let_rescope)]` implied by
`#[deny(rust_2024_compatibility)]`

and this:

error: these values and local bindings have significant drop
implementation that will have a different drop order from that of
Edition 2021
       --> proxy/src/auth/backend/jwt.rs:376:18
        |
    369 |         let client = Client::builder()
| ------ these values have significant drop implementation and will
observe changes in drop order under Edition 2024
    ...
    376 |             map: DashMap::default(),
        |                  ^^^^^^^^^^^^^^^^^^
        |
        = warning: this changes meaning in Rust 2024
= note: for more information, see issue #123739
<https://github.com/rust-lang/rust/issues/123739>
= note: `#[deny(tail_expr_drop_order)]` implied by
`#[deny(rust_2024_compatibility)]`

They are caused by the `rust_2024_compatibility` lint option.

When we actually switch to the 2024 edition, it makes sense to go
through all these and check that the drop order changes don't break
anything, but in the meanwhile, there's no easy way to avoid these
errors. Disable it, to allow compiling with nightly again.

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-11-08 08:35:03 +00:00
Peter Bendel
9132d80aa3 add pgcopydb tool to build tools image (#9658)
## Problem

build-tools image does not provide superuser, so additional packages can
not be installed during GitHub benchmarking workflows but need to be
added to the image

## Summary of changes

install pgcopydb version 0.17-1 or higher into build-tools bookworm
image

```bash
docker run -it neondatabase/build-tools:<tag>-bookworm-arm64 /bin/bash
...
nonroot@c23c6f4901ce:~$ LD_LIBRARY_PATH=/pgcopydb/lib /pgcopydb/bin/pgcopydb --version;
13:58:19.768 8 INFO   Running pgcopydb version 0.17 from "/pgcopydb/bin/pgcopydb"
pgcopydb version 0.17
compiled with PostgreSQL 16.4 (Debian 16.4-1.pgdg120+2) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
compatible with Postgres 11, 12, 13, 14, 15, and 16
```

Example usage of that image in a workflow

https://github.com/neondatabase/neon/actions/runs/11725718371/job/32662681172#step:7:14
2024-11-07 19:00:25 +01:00
Conrad Ludgate
82e3f0ecba [proxy/authorize]: improve JWKS reliability (#9676)
While setting up some tests, I noticed that we didn't support keycloak.
They make use of encryption JWKs as well as signature ones. Our current
jwks crate does not support parsing encryption keys which caused the
entire jwk set to fail to parse. Switching to lazy parsing fixes this.

Also while setting up tests, I couldn't use localhost jwks server as we
require HTTPS and we were using webpki so it was impossible to add a
custom CA. Enabling native roots addresses this possibility.

I saw some of our current e2e tests against our custom JWKS in s3 were
taking a while to fetch. I've added a timeout + retries to address this.
2024-11-07 16:24:38 +00:00
Arpad Müller
75aa19aa2d Don't attach is_archived to debug output (#9679)
We are in branches where we know its value already.
2024-11-07 16:13:50 +00:00
Alex Chi Z.
a8d9939ea9 fix(pageserver): reduce aux compaction threshold (#9647)
ref https://github.com/neondatabase/neon/issues/9441

The metrics from LR publisher testing project: ~300KB aux key deltas per
256MB files. Therefore, I think we can do compaction more aggressively
as these deltas are small and compaction can reduce layer download
latency. We also have a read path perf fix
https://github.com/neondatabase/neon/pull/9631 but I'd still combine the
read path fix with the reduce of the compaction threshold.

## Summary of changes

* reduce metadata compaction threshold
* use num of L1 delta layers as an indicator for metadata compaction
* dump more logs

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-11-07 10:38:15 -05:00
Erik Grinaker
f18aa04b90 safekeeper: use set_len() to zero out segments (#9665)
## Problem

When we create a new segment, we zero it out in order to avoid changing
the length and fsyncing metadata on every write. However, we zeroed it
out by writing 8 KB zero-pages, and Tokio file writes have non-trivial
overhead.

## Summary of changes

Zero out the segment using
[`File::set_len()`](https://docs.rs/tokio/latest/i686-unknown-linux-gnu/tokio/fs/struct.File.html#method.set_len)
instead. This will typically (depending on the filesystem) just write a
sparse file and omit the 16 MB of data entirely. This improves WAL
append throughput for large messages by over 400% with fsync disabled,
and 100% with fsync enabled.
2024-11-07 15:09:57 +00:00
Erik Grinaker
01265b7bc6 safekeeper: add basic WAL ingestion benchmarks (#9531)
## Problem

We don't have any benchmarks for Safekeeper WAL ingestion.

## Summary of changes

Add some basic benchmarks for WAL ingestion, specifically for
`SafeKeeper::process_msg()` (single append) and `WalAcceptor` (pipelined
batch ingestion). Also add some baseline file write benchmarks.
2024-11-07 13:24:03 +00:00
Arseny Sher
f54f0e8e2d Fix direct reading from WAL buffers. (#9639)
Fix direct reading from WAL buffers.
Pointer wasn't advanced which resulted in sending corrupted WAL if part
of read used WAL buffers and part read from the file. Also move it to
neon_walreader so that e.g. replication could also make use of it.

ref https://github.com/neondatabase/cloud/issues/19567
2024-11-07 11:29:52 +00:00
Erik Grinaker
d6aa26a533 postgres_ffi: make WalGenerator generic over record generator (#9614)
## Problem

Benchmarks need more control over the WAL generated by `WalGenerator`.
In particular, they need to vary the size of logical messages.

## Summary of changes

* Make `WalGenerator` generic over `RecordGenerator`, which constructs
WAL records.
* Add `LogicalMessageGenerator` which emits logical messages, with a
configurable payload.
* Minor tweaks and code reorganization.

There are no changes to the core logic or emitted WAL.
2024-11-07 10:38:39 +00:00
Ivan Efremov
f5eec194e7 Merge pull request #9674 from neondatabase/rc/proxy/2024-11-07
Proxy release 2024-11-07
2024-11-07 12:07:12 +02:00
Cheng Chen
e1d0b73824 chore(compute): Bump pg_mooncake to the latest version 2024-11-06 22:41:18 -06:00
Arpad Müller
011c0a175f Support copying layers in detach_ancestor from before shard splits (#9669)
We need to use the shard associated with the layer file, not the shard
associated with our current tenant shard ID.

Due to shard splits, the shard IDs can refer to older files.

close https://github.com/neondatabase/neon/issues/9667
2024-11-07 01:53:58 +01:00
Alex Chi Z.
2a95a51a0d refactor(pageserver): better pageservice command parsing (#9597)
close https://github.com/neondatabase/neon/issues/9460

## Summary of changes

A full rewrite of pagestream cmdline parsing to make it more robust and
readable.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-11-06 20:41:01 +00:00
Yuchen Liang
11fc1a4c12 fix(test): use layer map dump in test_readonly_node_gc to validate layers protected by leases (#9551)
Fixes #9518.

## Problem

After removing the assertion `layers_removed == 0` in #9506, we could
miss breakage if we solely rely on the successful execution of the
`SELECT` query to check if lease is properly protecting layers. Details
listed in #9518.

Also, in integration tests, we sometimes run into the race condition
where getpage request comes before the lease get renewed (item 2 of
#8817), even if compute_ctl sends a lease renewal as soon as it sees a
`/configure` API calls that updates the `pageserver_connstr`. In this
case, we would observe a getpage request error stating that we `tried to
request a page version that was garbage collected` (as we seen in
[Allure
Report](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8613/11550393107/index.html#suites/3ccffb1d100105b98aed3dc19b717917/d1a1ba47bc180493)).

## Summary of changes

- Use layer map dump to verify if the lease protects what it claimed:
Record all historical layers that has `start_lsn <= lease_lsn` before
and after running timeline gc. This is the same check as
ad79f42460/pageserver/src/tenant/timeline.rs (L5025-L5027)
The set recorded after GC should contain every layer in the set recorded
before GC.
- Wait until log contains another successful lease request before
running the `SELECT` query after GC. We argued in #8817 that the bad
request can only exist within a short period after migration/restart,
and our test shows that as long as a lease renewal is done before the
first getpage request sent after reconfiguration, we will not have bad
request.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-11-06 20:18:21 +00:00
Tristan Partin
93123f2623 Rename compute_backpressure_throttling_ms to compute_backpressure_throttling_seconds
This is in line with the Prometheus guidance[0]. We also haven't started
using this metric, so renaming is essentially free.

Link: https://prometheus.io/docs/practices/naming/ [0]
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-11-06 13:28:23 -06:00
Alex Chi Z.
1d3559d4bc feat(pageserver): add fast path for sparse keyspace read (#9631)
In https://github.com/neondatabase/neon/issues/9441, the tenant has a
lot of aux keys spread in multiple aux files. The perf tool shows that a
significant amount of time is spent on remove_overlapping_keys. For
sparse keyspaces, we don't need to report missing key errors anyways,
and it's very likely that we will need to read all layers intersecting
with the key range. Therefore, this patch adds a new fast path for
sparse keyspace reads that we do not track `unmapped_keyspace` in a
fine-grained way. We only modify it when we find an image layer.

In debug mode, it was ~5min to read the aux files for a dump of the
tenant, and now it's only 8s, that's a 60x speedup.

## Summary of changes

* Do not add sparse keys into `keys_done` so that remove_overlapping
does nothing.
* Allow `ValueReconstructSituation::Complete` to be updated again in
`ValuesReconstructState::update_key` for sparse keyspaces.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-11-06 18:17:02 +00:00
Conrad Ludgate
73bdc9a2d0 [proxy]: minor changes to endpoint-cache handling (#9666)
I think I meant to make these changes over 6 months ago. alas, better
late than never.

1. should_reject doesn't eagerly intern the endpoint string
2. Rate limiter uses a std Mutex instead of a tokio Mutex.
3. Recently I introduced a `-local-proxy` endpoint suffix. I forgot to
add this to normalize.
4. Random but a small cleanup making the ControlPlaneEvent deser
directly to the interned strings.
2024-11-06 17:40:40 +00:00
John Spray
d182ff294c storcon: respect tenant scheduling policy in drain/fill (#9657)
## Problem

Pinning a tenant by setting Pause scheduling policy doesn't work because
drain/fill code moves the tenant around during deploys.

Closes: https://github.com/neondatabase/neon/issues/9612

## Summary of changes

- In drain, only move a tenant if it is in Active or Essential mode
- In fill, only move a tenant if it is in Active mode.

The asymmetry is a bit annoying, but it faithfully respects the purposes
of the modes: Essential is meant to endeavor to keep the tenant
available, which means it needs to be drained but doesn't need to be
migrated during fills.
2024-11-06 15:14:43 +00:00
Vlad Lazar
4dfa0c221b pageserver: ingest pre-serialized batches of values (#9579)
## Problem

https://github.com/neondatabase/neon/pull/9524 split the decoding and
interpretation step from ingestion.
The output of the first phase is a `wal_decoder::models::InterpretedWalRecord`. 
Before this patch set that struct contained a list of `Value` instances.

We wish to lift the decoding and interpretation step to the safekeeper,
but it would be nice if the safekeeper gave us a batch containing the raw data instead of actual values.

## Summary of changes

Main goal here is to make `InterpretedWalRecord` hold a raw buffer which
contains pre-serialized Values.
For this we do:
1. Add a `SerializedValueBatch` type. This is `inmemory_layer::SerializedBatch` with some 
extra functionality for extension, observing values for shard 0 and tests.
2. Replace `inmemory_layer::SerializedBatch` with `SerializedValueBatch`
3. Make `DatadirModification` maintain a `SerializedValueBatch`.


### `DatadirModification` changes

`DatadirModification` now maintains a `SerializedValueBatch` and extends
it as new WAL records come in (to avoid flushing to disk on every
record).
In turn, this cascaded into a number of modifications to
`DatadirModification`:
1. Replace `pending_data_pages` and `pending_zero_data_pages` with `pending_data_batch`.
2. Removal of `pending_zero_data_pages` and its cousin `on_wal_record_end`
3. Rename `pending_bytes` to `pending_metadata_bytes` since this is what it tracks now.
4. Adapting of various utility methods like `len`, `approx_pending_bytes` and `has_dirty_data_pages`.

Removal of `pending_zero_data_pages` and the optimisation associated
with it ((1) and (2)) deserves more detail.

Previously all zero data pages went through `pending_zero_data_pages`.
We wrote zero data pages when filling gaps caused by relation extension
(case A) and when handling special wal records (case B). If it happened
that the same WAL record contained a non zero write for an entry in
`pending_zero_data_pages` we skipped the zero write.

Case A: We handle this differently now. When ingesting the
`SerialiezdValueBatch` associated with one PG WAL record, we identify the gaps and fill the
them in one go. Essentially, we move from a per key process (gaps were filled after each
new key), and replace it with a per record process. Hence, the optimisation is not
required anymore.

Case B: When the handling of a special record needs to zero out a key,
it just adds that to the current batch. I inspected the code, and I
don't think the optimisation kicked in here.
2024-11-06 14:10:32 +00:00
Folke Behrens
bdd492b1d8 proxy: Replace "web(auth)" with "console redirect" everywhere (#9655) 2024-11-06 11:03:38 +00:00
Folke Behrens
5d8284c7fe proxy: Read cplane JWT with clap arg (#9654) 2024-11-06 10:27:55 +00:00
Folke Behrens
ebc43efebc proxy: Refactor cplane types (#9643)
The overall idea of the PR is to rename a few types to make their
purpose more clear, reduce abstraction where not needed, and move types
to to more better suited modules.
2024-11-05 23:03:53 +01:00
Folke Behrens
754d2950a3 proxy: Revert ControlPlaneEvent back to struct (#9649)
Due to neondatabase/cloud#19815 we need to be more tolerant when reading
events.
2024-11-05 21:32:33 +00:00
Conrad Ludgate
fcde40d600 [proxy] use the proxy protocol v2 command to silence some logs (#9620)
The PROXY Protocol V2 offers a "command" concept. It can be of two
different values. "Local" and "Proxy". The spec suggests that "Local" be
used for health-checks. We can thus use this to silence logging for such
health checks such as those from NLB.

This additionally refactors the flow to be a bit more type-safe, self
documenting and using zerocopy deser.
2024-11-05 17:23:00 +00:00
Erik Grinaker
babfeb70ba safekeeper: don't allocate send buffers on stack (#9644)
## Problem

While experimenting with `MAX_SEND_SIZE` for benchmarking, I saw stack
overflows when increasing it to 1 MB. Turns out a few buffers of this
size are stack-allocated rather than heap-allocated. Even at the default
128 KB size, that's a bit large to allocate on the stack.

## Summary of changes

Heap-allocate buffers of size `MAX_SEND_SIZE`.
2024-11-05 17:05:30 +00:00
Ivan Efremov
2f1a56c8f9 proxy: Unify local and remote conn pool client structures (#9604)
Unify client, EndpointConnPool and DbUserConnPool for remote and local
conn.
- Use new ClientDataEnum for additional client data.
- Add ClientInnerCommon client structure.
- Remove Client and EndpointConnPool code from local_conn_pool.rs
2024-11-05 17:33:41 +02:00
John Spray
e30f5fb922 scrubber: remove AWS region assumption, tolerate negative max_project_size (#9636)
## Problem

First issues noticed when trying to run scrubber find-garbage on Azure:
- Azure staging contains projects with -1 set for max_project_size:
apparently the control plane treats this as a signed field.
- Scrubber code assumed that listing projects should filter to
aws-$REGION. This is no longer needed (per comment in the code) because
we know hit region-local APIs.

This PR doesn't make it work all the way (`init_remote` still assumes
S3), but these are necessary precursors.

## Summary of changes

- Change max-project_size from unsigned to signed
- Remove region filtering in favor of simply using the right region's
API (which we already do)
2024-11-05 13:32:50 +00:00
Arpad Müller
70ae8c16da Construct models::TenantConfig only once (#9630)
Since 5f83c9290b482dc90006c400dfc68e85a17af785/#1504 we've had
duplication in construction of models::TenantConfig, where both
constructs contained the same code. This PR removes one of the two
locations to avoid the duplication.
2024-11-05 13:02:49 +00:00
Erik Grinaker
8840f3858c pageserver: return 503 during tenant shutdown (#9635)
## Problem

Tenant operations may return `409 Conflict` if the tenant is shutting
down. This status code is not retried by the control plane, causing
user-facing errors during pageserver restarts. Operations should instead
return `503 Service Unavailable`, which may be retried for idempotent
operations.

## Summary of changes

Convert
`GetActiveTenantError::WillNotBecomeActive(TenantState::Stopping)` to
`ApiError::ShuttingDown` rather than `ApiError::Conflict`. This error is
returned by `Tenant::wait_to_become_active` in most (all?)
tenant/timeline-related HTTP routes.
2024-11-05 13:16:55 +01:00
Tristan Partin
1e16221f82 Update psycopg2 to latest version for complete PG 17 support
Update the types to match. Changes the cursor import to match the
C bindings[0].

Link: https://github.com/python/typeshed/issues/12578 [0]
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-11-04 18:21:59 -06:00
Tristan Partin
34812a6aab Improve some typing related to performance testing for LR
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-11-04 15:52:01 -06:00
Arpad Müller
ee68bbf6f5 Add tenant config option to allow timeline_offloading (#9598)
Allow us to enable timeline offloading for single tenants without having
to enable it for the entire pageserver.

Part of #8088.
2024-11-04 21:01:18 +01:00
Folke Behrens
1085fe57d3 proxy: Rewrite ControlPlaneEvent as enum (#9627) 2024-11-04 20:19:26 +01:00
Folke Behrens
59879985b4 proxy: Wrap JWT errors in separate AuthError variant (#9625)
* Also rename `AuthFailed` variant to `PasswordFailed`.
* Before this all JWT errors end up in `AuthError::AuthFailed()`,
  expects a username and also causes cache invalidation.
2024-11-04 19:56:40 +01:00
Conrad Ludgate
81d1bb1941 quieten aws_config logs (#9626)
logs during aws authentication are soooo noisy in staging 🙃
2024-11-04 17:28:10 +00:00
Christian Schwarz
06113e94e6 fix(test_regress): always use storcon virtual pageserver API to set tenant config (#9622)
Problem
-------

Tests that directly call the Pageserver Management API to set tenant
config are flaky if the Pageserver is managed by Storcon because Storcon
is the source of truth and may (theoretically) reconcile a tenant at any
time.

Solution
--------

Switch all users of
`set_tenant_config`/`patch_tenant_config_client_side`
to use the `env.storage_controller.pageserver_api()`

Future Work
-----------

Prevent regressions from creeping in.

And generally clean up up tenant configuration.
Maybe we can avoid the Pageserver having a default tenant config at all
and put the default into Storcon instead?

* => https://github.com/neondatabase/neon/issues/9621

Refs
----

fixes https://github.com/neondatabase/neon/issues/9522
2024-11-04 17:42:08 +01:00
Erik Grinaker
0d5a512825 safekeeper: add walreceiver metrics (#9450)
## Problem

We don't have any observability for Safekeeper WAL receiver queues.

## Summary of changes

Adds a few WAL receiver metrics:

* `safekeeper_wal_receivers`: gauge of currently connected WAL
receivers.
* `safekeeper_wal_receiver_queue_depth`: histogram of queue depths per
receiver, sampled every 5 seconds.
* `safekeeper_wal_receiver_queue_depth_total`: gauge of total queued
messages across all receivers.
* `safekeeper_wal_receiver_queue_size_total`: gauge of total queued
message sizes across all receivers.

There are already metrics for ingested WAL volume: `written_wal_bytes`
counter per timeline, and `safekeeper_write_wal_bytes` per-request
histogram.
2024-11-04 15:22:46 +00:00
Conrad Ludgate
8ad1dbce72 [proxy]: parse proxy protocol TLVs with aws/azure support (#9610)
AWS/azure private link shares extra information in the "TLV" values of
the proxy protocol v2 header. This code doesn't action on it, but it
parses it as appropriate.
2024-11-04 14:04:56 +00:00
Conrad Ludgate
3dcdbcc34d remove aws-lc-rs dep and fix storage_broker tls (#9613)
It seems the ecosystem is not so keen on moving to aws-lc-rs as it's
build setup is more complicated than ring (requiring cmake).

Eventually I expect the ecosystem should pivot to
https://github.com/ctz/graviola/tree/main/rustls-graviola as it
stabilises (it has a very simply build step and license), but for now
let's try not have a headache of juggling two crypto libs.

I also noticed that tonic will just fail with tls without a default
provider, so I added some defensive code for that.
2024-11-04 13:29:13 +00:00
Matthias van de Meent
d5de63c6b8 Fix a time zone issue in a PG17 test case (#9618)
The commit was cherry-picked and thus shouldn't cause issues once we
merge the release tag for PostgreSQL 17.1
2024-11-04 12:10:32 +00:00
John Spray
4534f5cdc6 pageserver: make local timeline deletion infallible (#9594)
## Problem

In https://github.com/neondatabase/neon/pull/9589, timeline offload code
is modified to return an explicit error type rather than propagating
anyhow::Error. One of the 'Other' cases there is I/O errors from local
timeline deletion, which shouldn't need to exist, because our policy is
not to try and continue running if the local disk gives us errors.

## Summary of changes

- Make `delete_local_timeline_directory` and use `.fatal_err(` on I/O
errors

---------

Co-authored-by: Erik Grinaker <erik@neon.tech>
2024-11-04 09:11:52 +00:00
Erik Grinaker
0058eb09df test_runner/performance: add sharded ingest benchmark (#9591)
Adds a Python benchmark for sharded ingestion. This ingests 7 GB of WAL
(100M rows) into a Safekeeper and fans out to 10 shards running on 10
different pageservers. The ingest volume and duration is recorded.
2024-11-02 16:42:10 +00:00
Konstantin Knizhnik
8ac523d2ee Do not assign page LSN to new (uninitialized) page in ClearVisibilityMapFlags redo handler (#9287)
## Problem

https://neondb.slack.com/archives/C04DGM6SMTM/p1727872045252899

See https://github.com/neondatabase/neon/issues/9240

## Summary of changes

Add `!page_is_new` check before assigning page lsn.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-11-01 20:31:29 +02:00
John Spray
3c16bd6e0b storcon: skip non-active projects in chaos injection (#9606)
## Problem

We may sometimes use scheduling modes like `Pause` to pin a tenant in
its current location for operational reasons. It is undesirable for the
chaos task to make any changes to such projects.

## Summary of changes

- Add a check for scheduling mode
- Add a log line when we do choose to do a chaos action for a tenant:
this will help us understand which operations originate from the chaos
task.
2024-11-01 16:47:20 +00:00
Erik Grinaker
123816e99a safekeeper: log slow WalAcceptor sends (#9564)
## Problem

We don't have any observability into full WalAcceptor queues per
timeline.

## Summary of changes

Logs a message when a WalAcceptor send has blocked for 5 seconds, and
another message when the send completes. This implies that the log
frequency is at most once every 5 seconds per timeline, so we don't need
further throttling.
2024-11-01 13:47:03 +01:00
Peter Bendel
8b3bcf71ee revert higher token expiration (#9605)
## Problem

The IAM role associated with our github action runner supports a max
token expiration which is lower than the value we tried.

## Summary of changes

Since we believe to have understood the performance regression we (by
ensuring availability zone affinity of compute and pageserver) the job
should again run in lower than 5 hours and we revert this change instead
of increasing the max session token expiration in the IAM role which
would reduce our security.
2024-11-01 12:46:02 +01:00
Erik Grinaker
4c2c8d6708 test_runner: fix tenant_get_shards with one pageserver (#9603)
## Problem

`tenant_get_shards()` does not work with a sharded tenant on 1
pageserver, as it assumes an unsharded tenant in this case. This special
case appears to have been added to handle e.g. `test_emergency_mode`,
where the storage controller is stopped. This breaks e.g. the sharded
ingest benchmark in #9591 when run with a single shard.

## Summary of changes

Correctly look up shards even with a single pageserver, but add a
special case that assumes an unsharded tenant if the storage controller
is stopped and the caller provides an explicit pageserver, in order to
accomodate `test_emergency_mode`.
2024-11-01 11:25:04 +00:00
Conrad Ludgate
2d1366c8ee fix pre-commit hook with python stubs (#9602)
fix #9601
2024-11-01 11:22:38 +00:00
Vlad Lazar
e589c2e5ec storage_controller: allow deployment infra to use infra token (#9596)
## Problem

We wish for the deployment orchestrator to use infra scoped tokens,
but storcon endpoints it's using require admin scoped tokens.

## Summary of Changes

Switch over all endpoints that are used by the deployment orchestrator
to use an infra scoped token. This causes no breakage during mixed
version scenarios because admin scoped tokens allow access to all
endpoints. The deployment orchestrator can cut over to the infra token
after this commit touches down in prod.

Once this commit is released we should also update the tests code to use
infra scoped tokens where appropriate. Currently it would fail on the
[compat tests](9761b6a64e/test_runner/regress/test_storage_controller.py (L69-L71)).
2024-10-31 18:29:16 +00:00
Conrad Ludgate
9761b6a64e update pg_session_jwt to use pgrx 0.12 for pg17 (#9595)
Updates the extension to use pgrx 0.12. No changes to the extensions
have been made, the only difference is the pgrx version.
2024-10-31 15:50:41 +00:00
Conrad Ludgate
897cffb9d8 auth_broker: fix local_proxy conn count (#9593)
our current metrics for http pool opened connections is always negative
:D oops
2024-10-31 14:57:55 +00:00
John Spray
552088ac16 pageserver: fix spurious error logs in timeline lifecycle (#9589)
## Problem

The final part of https://github.com/neondatabase/neon/issues/9543 will
be a chaos test that creates/deletes/archives/offloads timelines while
restarting pageservers and migrating tenants. Developing that test
showed up a few places where we log errors during normal shutdown.

## Summary of changes

- UninitializedTimeline's drop should log at info severity: this is a
normal code path when some part of timeline creation encounters a
cancellation `?` path.
- When offloading and finding a `RemoteTimelineClient` in a
non-initialized state, this is not an error and should not be logged as
such.
- The `offload_timeline` function returned an anyhow error, so callers
couldn't gracefully pick out cancellation errors from real errors:
update this to have a structured error type and use it throughout.
2024-10-31 14:44:59 +00:00
Peter Bendel
51fda118f6 increase lifetime of AWS session token to 12 hours (#9590)
## Problem

clickbench regression causes clickbench to run >9 hours and the AWS
session token is expired before the run completes

## Summary of changes

extend lifetime of session token for this job to 12 hours
2024-10-31 13:34:50 +00:00
Anastasia Lubennikova
e96398a552 Add support of extensions for v17 (part 4) (#9568)
- pg_jsonschema 0.3.3
- pg_graphql 1.5.9
- rum 65e0a752
- pg_tiktoken a5bc447e

update support of extensions for v14-v16:
- pg_jsonschema 0.3.1 -> 0.3.3
- pg_graphql 1.5.7 -> 1.5.9
- rum 6ab37053 -> 65e0a752
- pg_tiktoken e64e55aa -> a5bc447e
2024-10-31 15:05:24 +02:00
Erik Grinaker
f9d8256d55 pageserver: don't return option from DeletionQueue::new (#9588)
`DeletionQueue::new()` always returns deletion workers, so the returned
`Option` is redundant.
2024-10-31 10:51:58 +00:00
Vlad Lazar
411c3aa0d6 pageserver: lift decoding and interpreting of wal into wal_decoder (#9524)
## Problem

Decoding and ingestion are still coupled in `pageserver::WalIngest`.

## Summary of changes

A new type is added to `wal_decoder::models`, InterpretedWalRecord. This
type contains everything that the pageserver requires in order to ingest
a WAL record. The highlights are the `metadata_record` which is an
optional special record type to be handled and `blocks` which stores
key, value pairs to be persisted to storage.

This type is produced by
`wal_decoder::models::InterpretedWalRecord::from_bytes` from a raw PG
wal record.

The rest of this commit separates decoding and interpretation of the PG
WAL record from its application in `WalIngest::ingest_record`.

Related: https://github.com/neondatabase/neon/issues/9335
Epic: https://github.com/neondatabase/neon/issues/9329
2024-10-31 10:47:43 +00:00
Arpad Müller
65b69392ea Disallow offloaded children during timeline deletion (#9582)
If we delete a timeline that has childen, those children will have their
data corrupted. Therefore, extend the already existing safety check to
offloaded timelines as well.

Part of #8088
2024-10-30 19:37:09 +01:00
Alex Chi Z.
8d70f88b37 refactor(pageserver): use JSON field encoding for consumption metrics cache (#9470)
In https://github.com/neondatabase/neon/issues/9032, I would like to
eventually add a `generation` field to the consumption metrics cache.
The current encoding is not backward compatible and it is hard to add
another field into the cache. Therefore, this patch refactors the format
to store "field -> value", and it's easier to maintain backward/forward
compatibility with this new format.

## Summary of changes

* Add `NewRawMetric` as the new format.
* Add upgrade path. When opening the disk cache, the codepath first
inspects the `version` field, and decide how to decode.
* Refactor metrics generation code and tests.
* Add tests on upgrade / compatibility with the old format.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-10-30 18:13:11 +00:00
Arpad Müller
bcfe013094 Don't keep around the timeline's remote_client (#9583)
Constructing a remote client is no big deal. Yes, it means an extra
download from S3 but it's not that expensive. This simplifies code paths
and scenarios to test. This unifies timelines that have been recently
offloaded with timelines that have been offloaded in an earlier
invocation of the process.

Part of #8088
2024-10-30 18:44:29 +01:00
Arpad Müller
d0a02f3649 Disallow archived timelines to be detached or reparented (#9578)
Disallow a request for timeline ancestor detach if either the to be
detached timeline, or any of the to be reparented timelines are
offloaded or archived.

In theory we could support timelines that are archived but not
offloaded, but archived timelines are at the risk of being offloaded, so
we treat them like offloaded timelines. As for offloaded timelines, any
code to "support" them would amount to unoffloading them, at which point
we can just demand to have the timelines be unarchived.

Part of #8088
2024-10-30 17:04:57 +01:00
Tristan Partin
8af9412eb2 Collect compute backpressure throttling time
This will tell us how much time the compute has spent throttled if
pageserver/safekeeper cannot keep up with WAL generation.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-10-30 09:58:29 -05:00
Erik Grinaker
96e35e11a6 postgres_ffi: add WAL generator for tests/benchmarks (#9503)
## Problem

We don't have a convenient way to generate WAL records for benchmarks
and tests.

## Summary of changes

Adds a WAL generator, exposed as an iterator. It currently only
generates logical messages (noops), but will be extended to write actual
table rows later.

Some existing code for WAL generation has been replaced with this
generator, to reduce duplication.
2024-10-30 14:46:39 +03:00
Alexey Kondratov
745061ddf8 chore(compute): Bump pg_mooncake to the latest version (#9576)
## Problem

There were some critical breaking changes made in the upstream since Oct
29th morning.

## Summary of changes

Point it to the topmost commit in the `neon` branch at the time of
writing this
https://github.com/Mooncake-Labs/pg_mooncake/commits/neon/
c495cd17d6
2024-10-30 11:07:02 +01:00
Tristan Partin
0c828c57e2 Remove non-gzipped basebackup code path
In July of 2023, Bojan and Chi authored
92aee7e07f. Our in production pageservers
are most definitely at a version where they all support gzipped
basebackups.
2024-10-29 23:03:45 -05:00
John Spray
8e2e9f0fed pageserver: generation-aware storage for TenantManifest (#9555)
## Problem

When tenant manifest objects are written without a generation suffix,
concurrently attached pageservers may stamp on each others writes of the
manifest and cause undefined behavior.

Closes: #9543 

## Summary of changes

- Use download_generation_object helper when reading manifests, to
search for the most recent generation
- Use Tenant::generation as the generation suffix when writing
manifests.
2024-10-29 23:24:04 +01:00
Tristan Partin
b77b9bdc9f Add tests for sql-exporter metrics
Should help us keep non-working metrics from hitting staging or
production.

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Fixes: https://github.com/neondatabase/neon/issues/8569
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-10-29 15:13:06 -05:00
Alex Chi Z.
81f9aba005 fix(pagectl): layer parsing and image layer dump (#9571)
This patch contains various improvements for the pagectl tool.

## Summary of changes

* Rewrite layer name parsing: LayerName now supports all variants we use
now.
* Drop pagectl's own layer parsing function, use LayerName in the
pageserver crate.
* Support image layer dumping in the layer dump command using
ImageLayer::dump, drop the original implementation.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-10-29 15:16:23 -04:00
Alex Chi Z.
88ff8a7803 feat(pageserver): support partial gc-compaction for lowest retain lsn (#9134)
part of https://github.com/neondatabase/neon/issues/8921,
https://github.com/neondatabase/neon/issues/9114

## Summary of changes

We start the partial compaction implementation with the image layer
partial generation. The partial compaction API now takes a key range. We
will only generate images for that key range for now, and remove layers
fully included in the key range after compaction.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-10-29 18:25:32 +00:00
Konstantin Knizhnik
0c075fab3a Add --replica parameter to basebackup (#9553)
## Problem

See https://github.com/neondatabase/neon/pull/9458
This PR separates PS related changes in #9458 from compute_ctl changes
to enforce that PS is deployed before compute.

## Summary of changes

This PR adds handlings of `--replica` parameters of backebackup to page
server.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-10-29 18:40:10 +02:00
Anastasia Lubennikova
80e1630042 Use pg_mooncake from our fork. (#9565)
Switch to main repo once
https://github.com/Mooncake-Labs/pg_mooncake/pull/3 is merged
2024-10-29 15:57:52 +00:00
Jakub Kołodziejczak
57499640c5 proxy: more granular http status codes for sql-over-http errors (#9549)
closes #9532
2024-10-29 15:44:45 +00:00
Anastasia Lubennikova
793ad50b7d fix allow_unstable_extensions GUC - make it USERSET (#9563)
fix message wording
2024-10-29 14:25:23 +00:00
John Spray
7a1331eee5 pageserver: make concurrent offloaded timeline operations safe wrt manifest uploads (#9557)
## Problem

Uploads of the tenant manifest could race between different tasks,
resulting in unexpected results in remote storage.

Closes: https://github.com/neondatabase/neon/issues/9556

## Summary of changes

- Create a central function for uploads that takes a tokio::sync::Mutex
- Store the latest upload in that Mutex, so that when there is lots of
concurrency (e.g. archive 20 timelines at once) we can coalesce their
manifest writes somewhat.
2024-10-29 13:54:48 +00:00
John Spray
4ef74215e1 pageserver: refactor generation-aware loading code into generic (#9545)
## Problem

Indices used to be the only kind of object where we had to search across
generations to find the most recent one. As of
https://github.com/neondatabase/neon/issues/9543, manifests will need
the same treatment.

## Summary of changes

- Refactor download_index_part to a generic download_generation_object
function, which will be usable for downloading manifest objects as well.
2024-10-29 13:00:03 +00:00
Conrad Ludgate
7e00be391d Merge pull request #9558 from neondatabase/rc/proxy/2024-10-29
Auth broker release 2024-10-29
2024-10-29 12:10:50 +00:00
Conrad Ludgate
d4cbc8cfeb [auth_broker]: regress test (#9541)
python based regression test setup for auth_broker. This uses a http
mock for cplane as well as the JWKs url.

complications:
1. We cannot just use local_proxy binary, as that requires the
pg_session_jwt extension which we don't have available in the current
test suite
2. We cannot use just any old http mock for local_proxy, as auth_broker
requires http2 to local_proxy

as such, I used the h2 library to implement an echo server - copied from
the examples in the h2 docs.
2024-10-29 11:39:09 +00:00
Conrad Ludgate
47c35f67c3 [proxy]: fix JWT handling for AWS cognito. (#9536)
In the base64 payload of an aws cognito jwt, I saw the following:

```
"iss":"https:\/\/cognito-idp.us-west-2.amazonaws.com\/us-west-2_redacted"
```

issuers are supposed to be URLs, and URLs are always valid un-escaped
JSON. However, `\/` is a valid escape character so what AWS is doing is
technically correct... sigh...

This PR refactors the test suite and adds a new regression test for
cognito.
2024-10-29 11:01:09 +00:00
Peter Bendel
45b558f480 temporarily increase timeout for clickbench benchmark until regression is resolved (#9554)
## Problem

click bench job in benchmarking workflow has a performance regression
causing it to run in timeout of max job run.

Suspected root cause:
Project has been migrated from single pageserver to storage controller
managed project on Oct 14th.
Since then the regression shows.

## Summary of changes

Increase timeout of pytest to 12 hours.
Increase job timeout to 12 hours
2024-10-29 10:53:28 +00:00
Arpad Müller
a73402e646 Offloaded timeline deletion (#9519)
As pointed out in
https://github.com/neondatabase/neon/pull/9489#discussion_r1814699683 ,
we currently didn't support deletion for offloaded timelines after the
timeline has been loaded from the manifest instead of having been
offloaded.

This was because the upload queue hasn't been initialized yet. This PR
thus initializes the timeline and shuts it down immediately.

Part of #8088
2024-10-29 10:41:53 +00:00
Vlad Lazar
07b974480c pageserver: move things around to prepare for decoding logic (#9504)
## Problem

We wish to have high level WAL decoding logic in `wal_decoder::decoder`
module.

## Summary of Changes

For this we need the `Value` and `NeonWalRecord` types accessible there, so:
1. Move `Value` and `NeonWalRecord` to `pageserver::value` and
`pageserver::record` respectively.
2. Get rid of `pageserver::repository` (follow up from (1))
3. Move PG specific WAL record types to `postgres_ffi::walrecord`. In
theory they could live in `wal_decoder`, but it would create a circular
dependency between `wal_decoder` and `postgres_ffi`. Long term it makes
sense for those types to be PG version specific, so that will work out nicely.
4. Move higher level WAL record types (to be ingested by pageserver)
into `wal_decoder::models`

Related: https://github.com/neondatabase/neon/issues/9335
Epic: https://github.com/neondatabase/neon/issues/9329
2024-10-29 10:00:34 +00:00
Arpad Müller
62f5d484d9 Assert the tenant to be active in unoffload_timeline (#9539)
Currently, all callers of `unoffload_timeline` ensure that the tenant
the unoffload operation is called on is active. We rely on it being
active as we activate the timeline below and don't want to race with the
activation code of the tenant (in the worst case, activating a timeline
twice).

Therefore, add this assertion.

Part of #8088
2024-10-29 00:36:05 +00:00
Tristan Partin
4df3987054 Get role name when not a C string
We will only have a C string if the specified role is a string.
Otherwise, we need to resolve references to public, current_role,
current_user, and session_user.

Fixes: https://github.com/neondatabase/cloud/issues/19323
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-10-28 18:21:45 -05:00
Konstantin Knizhnik
0624565617 Create the notion of unstable extensions
As a DBaaS provider, Neon needs to provide a stable platform for
customers to build applications upon. At the same time however, we also
need to enable customers to use the latest and greatest technology, so
they can prototype their work, and we can solicit feedback. If all
extensions are treated the same in terms of stability, it is hard to
meet that goal.

There are now two new GUCs created by the Neon extension:

neon.allow_unstable_extensions: This is a session GUC which allows
a session to install and load unstable extensions.

neon.unstable_extensions: This is a comma-separated list of extension
names. We can check if a CREATE EXTENSION statement is attempting to
install an unstable extension, and if so, deny the request if
neon.allow_unstable_extensions is not set to true.

Signed-off-by: Tristan Partin <tristan@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-10-28 17:47:15 -05:00
George MacKerron
7d5f6b6a52 Build pgrag extensions x3 (#8486)
Build the pgrag extensions (rag, rag_bge_small_en_v15, and
rag_jina_reranker_v1_tiny_en) as part of the compute node Dockerfile.

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-10-28 20:06:36 +00:00
Alex Chi Z.
f7c61e856f fix(pageserver): bump tokio-epoll-uring (#9546)
Includes https://github.com/neondatabase/tokio-epoll-uring/pull/58 that
fixes the clippy error.

## Summary of changes

Update the version of tokio-epoll-uring

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-10-28 20:03:02 +00:00
Alex Chi Z.
57c21aff9f refactor(pageserver): remove aux v1 configs (#9494)
## Problem

Part of https://github.com/neondatabase/neon/issues/8623

## Summary of changes

Removed all aux-v1 config processing code. Note that we persisted it
into the index part file, so we cannot really remove the field from
index part. I also kept the config item within the tenant config, but we
will not read it any more.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-10-28 19:51:14 +00:00
Erik Grinaker
248558dee8 safekeeper: refactor WalAcceptor to be event-driven (#9462)
## Problem

The `WalAcceptor` main loop currently uses two nested loops to consume
inbound messages. This makes it hard to slot in periodic events like
metrics collection. It also duplicates the event processing code, and assumes
all messages in steady state are AppendRequests (other messages types may
be dropped if following an AppendRequest).

## Summary of changes

Refactor the `WalAcceptor` loop to be event driven.
2024-10-28 17:18:37 +00:00
Sergey Melnikov
3bad52543f We don't have legacy proxies anymore (#9544)
We don't have legacy scram proxies anymore:
cc: https://github.com/neondatabase/cloud/issues/9745
2024-10-28 16:42:35 +00:00
Tristan Partin
3d64a7ddcd Add pg_mooncake to compute-node.Dockerfile
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-10-28 11:23:30 -05:00
Conrad Ludgate
25f1e5cfeb [proxy] demote warnings and remove dead-argument (#9512)
fixes https://github.com/neondatabase/cloud/issues/19000
2024-10-28 15:02:20 +00:00
Rahul Patil
8dd555d396 ci(proxy): Update GH action flag on proxy deployment (#9535)
## Problem

Based on a recent proxy deployment issue, we deployed another proxy
version (proxy-scram), which was not needed when deploying a specific
proxy type. we have
[PR](https://github.com/neondatabase/infra/pull/2142) to update on the
infra branch and need to update CI in this repo which triggers proxy
deployment.

## Summary of changes

- Update proxy deployment flag 

## Checklist before requesting a review

- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2024-10-28 13:17:09 +01:00
Arthur Petukhovsky
01b6843e12 Route pgbouncer logs to virtio-serial (#9488)
virtio-serial is much more performant than /dev/console emulation,
therefore, is much more suitable for the verbose logs inside vm. This
commit changes routing for pgbouncer logs, since we've recently noticed
it can emit large volumes of logs.

Manually tested on staging by pinning a compute image to my test
project.

Should help with https://github.com/neondatabase/cloud/issues/19072
2024-10-28 12:09:47 +00:00
John Spray
93987b5a4a tests: add test_storage_controller_onboard_detached (#9431)
## Problem

We haven't historically taken this API route where we would onboard a
tenant to the controller in detached state. It worked, but we didn't
have test coverage.

## Summary of changes

- Add a test that onboards a tenant to the storage controller in
Detached mode, and checks that deleting it without attaching it works as
expected.
2024-10-28 11:11:12 +00:00
John Spray
33baca07b6 storcon: add an API to cancel ongoing reconciler (#9520)
## Problem

If something goes wrong with a live migration, we currently only have
awkward ways to interrupt that:
- Restart the storage controller
- Ask it to do some other modification/migration on the shard, which we
don't really want.

## Summary of changes

- Add a new `/cancel` control API, and storcon_cli wrapper for it, which
fires the Reconciler's cancellation token. This is just for on-call use
and we do not expect it to be used by any other services.
2024-10-28 09:26:01 +00:00
John Spray
923974d4da safekeeper: don't un-evict timelines during snapshot API handler (#9428)
## Problem

When we use pull_timeline API on an evicted timeline, it gets downloaded
to serve the snapshot API request. That means that to evacuate all the
timelines from a node, the node needs enough disk space to download
partial segments from all timelines, which may not be physically the
case.

Closes: #8833 

## Summary of changes

- Add a "try" variant of acquiring a residence guard, that returns None
if the timeline is offloaded
- During snapshot API handler, take a different code path if the
timeline isn't resident, where we just read the checkpoint and don't try
to read any segments.
2024-10-28 08:47:12 +00:00
Arpad Müller
e7277885b3 Don't consider archived timelines for synthetic size calculation (#9497)
Archived timelines should not count towards synthetic size.

Closes #9384.

Part of #8088.
2024-10-26 13:27:57 +00:00
dependabot[bot]
80262e724f build(deps): bump werkzeug from 3.0.3 to 3.0.6 (#9527) 2024-10-26 08:24:15 +01:00
Yuchen Liang
85b954f449 pageserver: add tokio-epoll-uring slots waiters queue depth metrics (#9482)
In complement to
https://github.com/neondatabase/tokio-epoll-uring/pull/56.

## Problem

We want to make tokio-epoll-uring slots waiters queue depth observable
via Prometheus.

## Summary of changes

- Add `pageserver_tokio_epoll_uring_slots_submission_queue_depth`
metrics as a `Histogram`.
- Each thread-local tokio-epoll-uring system is given a `LocalHistogram`
to observe the metrics.
- Keep a list of `Arc<ThreadLocalMetrics>` used on-demand to flush data
to the shared histogram.
- Extend `Collector::collect` to report
`pageserver_tokio_epoll_uring_slots_submission_queue_depth`.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-10-25 21:30:57 +01:00
Arpad Müller
76328ada05 Fix unoffload_timeline races with creation (#9525)
This PR does two things:

1. Obtain a `TimelineCreateGuard` object in `unoffload_timeline`. This
prevents two unoffload tasks from racing with each other. While they
already obtain locks for `timelines` and `offloaded_timelines`, they
aren't sufficient, as we have already constructed an entire timeline at
that point. We shouldn't ever have two `Timeline` objects in the same
process at the same time.
2. don't allow timeline creations for timelines that have been
offloaded. Obviously they already exist, so we should not allow
creation. the previous logic only looked at the timelines list.

Part of #8088
2024-10-25 20:06:27 +00:00
Erik Grinaker
b54b632c6a safekeeper: don't pass conf into storage constructors (#9523)
## Problem

The storage components take an entire `SafekeeperConf` during
construction, but only actually use the `no_sync` field. This makes it
hard to understand the storage inputs (which fields do they actually
care about?), and is also inconvenient for tests and benchmarks that
need to set up a lot of unnecessary boilerplate.

## Summary of changes

* Don't take the entire config, but pass in the `no_sync` field
explicitly.
* Take the timeline dir instead of `ttid` as an input, since it's the
only thing it cares about.
* Fix a couple of tests to not leak tempdirs.
* Various minor tweaks.
2024-10-25 18:19:52 +01:00
Erik Grinaker
9909551f47 safekeeper: fix version in TimelinePersistentState::empty() (#9521)
## Problem

The Postgres version in `TimelinePersistentState::empty()` is incorrect:
the major version should be multiplied by 10000.

## Summary of changes

Multiply the version by 10000.
2024-10-25 16:22:35 +01:00
Arseny Sher
700b102b0f safekeeper: retry eviction. (#9485)
Without this manager may sleep forever after eviction failure without
retries.
2024-10-25 17:48:29 +03:00
Conrad Ludgate
dbadb0f9bb proxy: propagate session IDs (#9509)
fixes #9367 by sending session IDs to local_proxy, and also returns
session IDs to the client for easier debugging.
2024-10-25 14:34:19 +00:00
John Spray
8297f7a181 pageserver: fix N^2 I/O when processing relation drops in transaction abort (#9507)
## Problem

We have some known N^2 behaviors when it comes to large relation counts,
due to the monolithic encoding and full rewrites of of RelDirectory each
time a relation is added. Ordinarily our backpressure mechanisms give
"slow but steady" performance when creating/dropping/truncating
relations. However, in the case of a transaction abort, it is possible
for a single WAL record to drop an unbounded number of relations. The
results in an unavailable compute, as when it sends one of these
records, it can stall the pageserver's ingest for many minutes, even
though the compute only sent a small amount of WAL.

Closes https://github.com/neondatabase/neon/issues/9505

## Summary of changes

- Rewrite relation-dropping code to do one read/modify/write cycle of
RelDirectory, instead of doing it separately for each relation in a
loop.
- Add a test for the bug scenario encountered:
`test_tx_abort_with_many_relations`

The test has ~40s runtime on my workstation. About 1 second of that is
the part where we wait for ingest to catch up after a rollback, the rest
is the slowness of creating and truncating a large number of relations.


---------

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-10-25 15:09:02 +01:00
Christian Schwarz
2090e928d1 refactor(timeline creation): idempotency checking (#9501)
# Context

In the PGDATA import code
(https://github.com/neondatabase/neon/pull/9218) I add a third way to
create timelines, namely, by importing from a copy of a vanilla PGDATA
directory in object storage.

For idempotency, I'm using the PGDATA object storage location
specification, which is stored in the IndexPart for the entire lifespan
of the timeline. When loading the timeline from remote storage, that
value gets stored inside `struct Timeline` and timeline creation
compares the creation argument with that value to determine idempotency
of the request.

# Changes

This PR refactors the existing idempotency handling of Timeline
bootstrap and branching such that we simply compare the
`CreateTimelineIdempotency` struct, using the derive-generated
`PartialEq` implementation.

Also, by spelling idempotency out in the type names, I find it adds a
lot of clarity.

The pathway to idempotency via requester-provided idempotency key also
becomes very straight-forward, if we ever want to do this in the future.

# Refs
* platform context: https://github.com/neondatabase/neon/pull/9218
* product context: https://github.com/neondatabase/cloud/issues/17507
* stacks on top of https://github.com/neondatabase/neon/pull/9366
2024-10-25 14:44:20 +01:00
Tristan Partin
05eff3a67e Move logical replication slot monitor
neon.c is getting crowded and the logical replication slot monitor is
a good candidate for reorganization. It is very self-contained, and
being in a separate file will make it that much easier to find.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-10-25 08:41:44 -05:00
Arseny Sher
c6cf5e7c0f Make test_pageserver_lsn_wait_error_safekeeper_stop less aggressive. (#9517)
Previously it inserted ~150MiB of WAL while expecting page fetching to
work in 1s (wait_lsn_timeout=1s). It failed in CI in debug builds.
Instead, just directly wait for the wanted condition, i.e. needed
safekeepers are reported in pageserver timed out waiting for WAL error
message. Also set NEON_COMPUTE_TESTING_BASEBACKUP_RETRIES to 1 in this
test and neighbour one, it reduces execution time from 2.5m to ~10s.
2024-10-25 14:13:46 +01:00
Christian Schwarz
e0c7f1ce15 remote_storage(local_fs): return correct file sizes (#9511)
## Problem

`local_fs` doesn't return file sizes, which I need in PGDATA import
(#9218)

## Solution

Include file sizes in the result.

I would have liked to add a unit test, and started doing that in 

* https://github.com/neondatabase/neon/pull/9510

by extending the common object storage tests
(`libs/remote_storage/tests/common/tests.rs`) to check for sizes as
well.

But it turns out that localfs is not even covered by the common object
storage tests and upon closer inspection, it seems that this area needs
more attention.
=> punt the effort into https://github.com/neondatabase/neon/pull/9510
2024-10-25 12:20:53 +00:00
Christian Schwarz
6f5c262684 pageserver: add testing API to scan layers for disposable keys (#9393)
This PR adds a pageserver mgmt API to scan a layer file for disposable
keys.

It hooks it up to the sharding compaction test, demonstrating that we're
not filtering out all disposable keys.

This is extracted from PGDATA import
(https://github.com/neondatabase/neon/pull/9218)
where I do the filtering of layer files based on `is_key_disposable`.
2024-10-25 14:16:45 +02:00
Jakub Kołodziejczak
9768f09f6b proxy: don't follow redirects for user provided JWKS urls + set custom user agent (#9514)
partially fixes https://github.com/neondatabase/cloud/issues/19249

ref https://docs.rs/reqwest/latest/reqwest/redirect/index.html
> By default, a Client will automatically handle HTTP redirects, having
a maximum redirect chain of 10 hops. To customize this behavior, a
redirect::Policy can be used with a ClientBuilder.
2024-10-25 14:04:41 +02:00
Yuchen Liang
db900ae9d0 fix(test): remove too strict layers_removed==0 check in test_readonly_node_gc (#9506)
Fixes #9098 

## Problem

`test_readonly_node_gc` is flaky. As shown in [Allure
Report](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9469/11444519440/index.html#suites/3ccffb1d100105b98aed3dc19b717917/2c02073738fa2b39),
we would get a `AssertionError: No layers should be removed, old layers
are guarded by leases.` after the test restarts pageservers or after
reconfigure pageservers.

During the investigation, we found that the layers has LSN (`0/1563088`)
greater than the LSN (`0x1562000`) protected by the lease. For instance,


**Layers removed**
<pre>

000000067F00000005000034540100000000-000000067F00000005000040050100000000__000000000<b><i>1563088</i></b>-00000001
(shard 0002)

000000068000000000000017E20000000001-010000000100000001000000000000000001__000000000<b><i>1563088</i></b>-00000001
(shard 0002)
</pre>

**Lsn Lease Granted**
<pre>
handle_make_lsn_lease{lsn=<b><i>0/1562000</i></b> shard_id=0002
shard_id=0002}: lease created, valid until 2024-10-21
</pre>

This means that these layers are not guarded by the leases: they are in
"future", not visible to the static endpoint.

## Summary of changes

- Remove the assertion layers_removed == 0 after trigger timeline GC
while holding the lease. Instead rely on the successful execution of
the`SELECT` query to test lease validity.
- Improve test logging


Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-10-25 12:50:47 +01:00
Arpad Müller
4d9036bf1f Support offloaded timelines during shard split (#9489)
Before, we didn't copy over the `index-part.json` of offloaded timelines
to the new shard's location, resulting in the new shard not knowing the
timeline even exists.

In #9444, we copy over the manifest, but we also need to do this for
`index-part.json`.

As the operations to do are mostly the same between offloaded and
non-offloaded timelines, we can iterate over all of them in the same
loop, after the introduction of a `TimelineOrOffloadedArcRef` type to
generalize over the two cases. This is analogous to the deletion code
added in #8907.

The added test also ensures that the sharded archival config endpoint
works, something that has not yet been ensured by tests.

Part of #8088
2024-10-25 12:32:46 +02:00
Vlad Lazar
b3bedda6fd pageserver/walingest: log on gappy rel extend (#9502)
## Problem

https://github.com/neondatabase/neon/pull/9492 added a metric to track
the total count of block gaps filled on rel extend. More context is
needed to understand when this happens. The current theory is that it
may only happen on pg 14 and pg 15 since they do not WAL log relation extends.

## Summary of Changes

A rate limited log is added.
2024-10-25 11:15:53 +01:00
Christian Schwarz
b782b11b33 refactor(timeline creation): represent bootstrap vs branch using enum (#9366)
# Problem

Timeline creation can either be bootstrap or branch.
The distinction is made based on whether the `ancestor_*` fields are
present or not.

In the PGDATA import code
(https://github.com/neondatabase/neon/pull/9218), I add a third variant
to timeline creation.

# Solution

The above pushed me to refactor the code in Pageserver to distinguish
the different creation requests through enum variants.

There is no externally observable effect from this change.

On the implementation level, a notable change is that the acquisition of
the `TimelineCreationGuard` happens later than before. This is necessary
so that we have everything in place to construct the
`CreateTimelineIdempotency`. Notably, this moves the acquisition of the
creation guard _after_ the acquisition of the `gc_cs` lock in the case
of branching. This might appear as if we're at risk of holding `gc_cs`
longer than before this PR, but, even before this PR, we were holding
`gc_cs` until after the `wait_completion()` that makes the timeline
creation durable in S3 returns. I don't see any deadlock risk with
reversing the lock acquisition order.

As a drive-by change, I found that the `create_timeline()` function in
`neon_local` is unused, so I removed it.

# Refs

* platform context: https://github.com/neondatabase/neon/pull/9218
* product context: https://github.com/neondatabase/cloud/issues/17507
* next PR stacked atop this one:
https://github.com/neondatabase/neon/pull/9501
2024-10-25 10:04:27 +00:00
Vlad Lazar
5069123b6d pageserver: refactor ingest inplace to decouple decoding and handling (#9472)
## Problem

WAL ingest couples decoding of special records with their handling
(updates to the storage engine mostly).
This is a roadblock for our plan to move WAL filtering (and implicitly
decoding) to safekeepers since they cannot
do writes to the storage engine. 

## Summary of changes

This PR decouples the decoding of the special WAL records from their
application. The changes are done in place
and I've done my best to refrain from refactorings and attempted to
preserve the original code as much as possible.

Related: https://github.com/neondatabase/neon/issues/9335
Epic: https://github.com/neondatabase/neon/issues/9329
2024-10-24 17:12:47 +01:00
Alex Chi Z.
fb0406e9d2 refactor(pageserver): refactor split writers using batch layer writer (#9493)
part of https://github.com/neondatabase/neon/issues/9114,
https://github.com/neondatabase/neon/issues/8836,
https://github.com/neondatabase/neon/issues/8362

The split layer writer code can be used in a more general way: the
caller puts unfinished writers into the batch layer writer and let batch
layer writer to ensure the atomicity of the layer produces.

## Summary of changes

* Add batch layer writer, which atomically finishes the layers.
`BatchLayerWriter::finish` is simply a copy-paste from previous split
layer writers.
* Refactor split writers to use the batch layer writer.
* The current split writer tests cover all code path of batch layer
writer.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-10-24 10:49:54 -04:00
Alexander Bayandin
b8a311131e CI: remove git config --add safe.directory hack (#9391)
## Problem

We have `git config --global --add safe.directory ...` leftovers from the
past, but `actions/checkout` does it by default (since v3.0.2, we use v4)

## Summary of changes
- Remove `git config --global --add safe.directory ...` hack
2024-10-24 15:49:26 +01:00
John Spray
d589498c6f storcon: respect Reconciler::cancel during await_lsn (#9486)
## Problem

When a pageserver is misbehaving (e.g. we hit an ingest bug or something
is pathologically slow), the storage controller could get stuck in the
part of live migration that waits for LSNs to catch up. This is a
problem, because it can prevent us migrating the troublesome tenant to
another pageserver.

Closes: https://github.com/neondatabase/cloud/issues/19169

## Summary of changes

- Respect Reconciler::cancel during await_lsn.
2024-10-24 15:23:09 +01:00
Christian Schwarz
6f34f97573 refactor(pageserver(load_remote_timeline)) remove dead code handling absence of IndexPart (#9408)
The code is dead at runtime since we're nowadays always running with
remote storage and treat it as the source of truth during attach.

Clean it up as a preliminary to
https://github.com/neondatabase/neon/pull/9218.

Related: https://github.com/neondatabase/neon/pull/9366
2024-10-24 09:00:22 +01:00
300 changed files with 16406 additions and 7209 deletions

View File

@@ -20,3 +20,4 @@ config-variables:
- REMOTE_STORAGE_AZURE_REGION
- SLACK_UPCOMING_RELEASE_CHANNEL_ID
- DEV_AWS_OIDC_ROLE_ARN
- BENCHMARK_INGEST_TARGET_PROJECTID

View File

@@ -221,6 +221,8 @@ runs:
REPORT_URL: ${{ steps.generate-report.outputs.report-url }}
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
with:
# Retry script for 5XX server errors: https://github.com/actions/github-script#retries
retries: 5
script: |
const { REPORT_URL, COMMIT_SHA } = process.env

View File

@@ -1,36 +0,0 @@
name: "Set custom docker config directory"
description: "Create a directory for docker config and set DOCKER_CONFIG"
# Use custom DOCKER_CONFIG directory to avoid conflicts with default settings
runs:
using: "composite"
steps:
- name: Show warning on GitHub-hosted runners
if: runner.environment == 'github-hosted'
shell: bash -euo pipefail {0}
run: |
# Using the following environment variables to find a path to the workflow file
# ${GITHUB_WORKFLOW_REF} - octocat/hello-world/.github/workflows/my-workflow.yml@refs/heads/my_branch
# ${GITHUB_REPOSITORY} - octocat/hello-world
# ${GITHUB_REF} - refs/heads/my_branch
# From https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/variables
filename_with_ref=${GITHUB_WORKFLOW_REF#"$GITHUB_REPOSITORY/"}
filename=${filename_with_ref%"@$GITHUB_REF"}
# https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/workflow-commands-for-github-actions#setting-a-warning-message
title='Unnecessary usage of `.github/actions/set-docker-config-dir`'
message='No need to use `.github/actions/set-docker-config-dir` action on GitHub-hosted runners'
echo "::warning file=${filename},title=${title}::${message}"
- uses: pyTooling/Actions/with-post-step@74afc5a42a17a046c90c68cb5cfa627e5c6c5b6b # v1.0.7
env:
DOCKER_CONFIG: .docker-custom-${{ github.run_id }}-${{ github.run_attempt }}
with:
main: |
mkdir -p "${DOCKER_CONFIG}"
echo DOCKER_CONFIG=${DOCKER_CONFIG} | tee -a $GITHUB_ENV
post: |
if [ -d "${DOCKER_CONFIG}" ]; then
rm -r "${DOCKER_CONFIG}"
fi

View File

@@ -1,14 +1,3 @@
## Problem
## Summary of changes
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above checklist

View File

@@ -53,20 +53,6 @@ jobs:
BUILD_TAG: ${{ inputs.build-tag }}
steps:
- name: Fix git ownership
run: |
# Workaround for `fatal: detected dubious ownership in repository at ...`
#
# Use both ${{ github.workspace }} and ${GITHUB_WORKSPACE} because they're different on host and in containers
# Ref https://github.com/actions/checkout/issues/785
#
git config --global --add safe.directory ${{ github.workspace }}
git config --global --add safe.directory ${GITHUB_WORKSPACE}
for r in 14 15 16 17; do
git config --global --add safe.directory "${{ github.workspace }}/vendor/postgres-v$r"
git config --global --add safe.directory "${GITHUB_WORKSPACE}/vendor/postgres-v$r"
done
- uses: actions/checkout@v4
with:
submodules: true

View File

@@ -0,0 +1,37 @@
name: Check Codestyle Python
on:
workflow_call:
inputs:
build-tools-image:
description: 'build-tools image'
required: true
type: string
defaults:
run:
shell: bash -euxo pipefail {0}
jobs:
check-codestyle-python:
runs-on: [ self-hosted, small ]
container:
image: ${{ inputs.build-tools-image }}
credentials:
username: ${{ secrets.NEON_DOCKERHUB_USERNAME }}
password: ${{ secrets.NEON_DOCKERHUB_PASSWORD }}
options: --init
steps:
- uses: actions/checkout@v4
- uses: actions/cache@v4
with:
path: ~/.cache/pypoetry/virtualenvs
key: v2-${{ runner.os }}-${{ runner.arch }}-python-deps-bookworm-${{ hashFiles('poetry.lock') }}
- run: ./scripts/pysync
- run: poetry run ruff check .
- run: poetry run ruff format --check .
- run: poetry run mypy .

View File

@@ -671,6 +671,10 @@ jobs:
password: ${{ secrets.NEON_DOCKERHUB_PASSWORD }}
options: --init
# Increase timeout to 12h, default timeout is 6h
# we have regression in clickbench causing it to run 2-3x longer
timeout-minutes: 720
steps:
- uses: actions/checkout@v4
@@ -716,7 +720,7 @@ jobs:
test_selection: performance/test_perf_olap.py
run_in_parallel: false
save_perf_report: ${{ env.SAVE_PERF_REPORT }}
extra_params: -m remote_cluster --timeout 21600 -k test_clickbench
extra_params: -m remote_cluster --timeout 43200 -k test_clickbench
pg_version: ${{ env.DEFAULT_PG_VERSION }}
env:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"

View File

@@ -64,7 +64,7 @@ jobs:
- uses: actions/checkout@v4
- uses: ./.github/actions/set-docker-config-dir
- uses: neondatabase/dev-actions/set-docker-config-dir@6094485bf440001c94a94a3f9e221e81ff6b6193
- uses: docker/setup-buildx-action@v3
with:
cache-binary: false

View File

@@ -90,35 +90,10 @@ jobs:
check-codestyle-python:
needs: [ check-permissions, build-build-tools-image ]
runs-on: [ self-hosted, small ]
container:
image: ${{ needs.build-build-tools-image.outputs.image }}-bookworm
credentials:
username: ${{ secrets.NEON_DOCKERHUB_USERNAME }}
password: ${{ secrets.NEON_DOCKERHUB_PASSWORD }}
options: --init
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Cache poetry deps
uses: actions/cache@v4
with:
path: ~/.cache/pypoetry/virtualenvs
key: v2-${{ runner.os }}-${{ runner.arch }}-python-deps-bookworm-${{ hashFiles('poetry.lock') }}
- name: Install Python deps
run: ./scripts/pysync
- name: Run `ruff check` to ensure code format
run: poetry run ruff check .
- name: Run `ruff format` to ensure code format
run: poetry run ruff format --check .
- name: Run mypy to check types
run: poetry run mypy .
uses: ./.github/workflows/_check-codestyle-python.yml
with:
build-tools-image: ${{ needs.build-build-tools-image.outputs.image }}-bookworm
secrets: inherit
check-codestyle-jsonnet:
needs: [ check-permissions, build-build-tools-image ]
@@ -141,6 +116,7 @@ jobs:
# Check that the vendor/postgres-* submodules point to the
# corresponding REL_*_STABLE_neon branches.
check-submodules:
needs: [ check-permissions ]
runs-on: ubuntu-22.04
steps:
- name: Checkout
@@ -521,6 +497,8 @@ jobs:
REPORT_URL_NEW: ${{ steps.upload-coverage-report-new.outputs.report-url }}
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
with:
# Retry script for 5XX server errors: https://github.com/actions/github-script#retries
retries: 5
script: |
const { REPORT_URL_NEW, COMMIT_SHA } = process.env
@@ -552,7 +530,7 @@ jobs:
with:
submodules: true
- uses: ./.github/actions/set-docker-config-dir
- uses: neondatabase/dev-actions/set-docker-config-dir@6094485bf440001c94a94a3f9e221e81ff6b6193
- uses: docker/setup-buildx-action@v3
with:
cache-binary: false
@@ -643,7 +621,7 @@ jobs:
with:
submodules: true
- uses: ./.github/actions/set-docker-config-dir
- uses: neondatabase/dev-actions/set-docker-config-dir@6094485bf440001c94a94a3f9e221e81ff6b6193
- uses: docker/setup-buildx-action@v3
with:
cache-binary: false
@@ -824,7 +802,7 @@ jobs:
curl -fL https://github.com/neondatabase/autoscaling/releases/download/$VM_BUILDER_VERSION/vm-builder -o vm-builder
chmod +x vm-builder
- uses: ./.github/actions/set-docker-config-dir
- uses: neondatabase/dev-actions/set-docker-config-dir@6094485bf440001c94a94a3f9e221e81ff6b6193
- uses: docker/login-action@v3
with:
username: ${{ secrets.NEON_DOCKERHUB_USERNAME }}
@@ -839,6 +817,7 @@ jobs:
- name: Build vm image
run: |
./vm-builder \
-size=2G \
-spec=compute/vm-image-spec-${{ matrix.version.debian }}.yaml \
-src=neondatabase/compute-node-${{ matrix.version.pg }}:${{ needs.tag.outputs.build-tag }} \
-dst=neondatabase/vm-compute-node-${{ matrix.version.pg }}:${{ needs.tag.outputs.build-tag }}
@@ -859,7 +838,7 @@ jobs:
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/set-docker-config-dir
- uses: neondatabase/dev-actions/set-docker-config-dir@6094485bf440001c94a94a3f9e221e81ff6b6193
- uses: docker/login-action@v3
with:
username: ${{ secrets.NEON_DOCKERHUB_USERNAME }}
@@ -1078,20 +1057,6 @@ jobs:
runs-on: [ self-hosted, small ]
container: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/ansible:latest
steps:
- name: Fix git ownership
run: |
# Workaround for `fatal: detected dubious ownership in repository at ...`
#
# Use both ${{ github.workspace }} and ${GITHUB_WORKSPACE} because they're different on host and in containers
# Ref https://github.com/actions/checkout/issues/785
#
git config --global --add safe.directory ${{ github.workspace }}
git config --global --add safe.directory ${GITHUB_WORKSPACE}
for r in 14 15 16 17; do
git config --global --add safe.directory "${{ github.workspace }}/vendor/postgres-v$r"
git config --global --add safe.directory "${GITHUB_WORKSPACE}/vendor/postgres-v$r"
done
- uses: actions/checkout@v4
- name: Trigger deploy workflow
@@ -1130,7 +1095,10 @@ jobs:
gh workflow --repo neondatabase/infra run deploy-proxy-prod.yml --ref main \
-f deployPgSniRouter=true \
-f deployProxy=true \
-f deployProxyLink=true \
-f deployPrivatelinkProxy=true \
-f deployProxyScram=true \
-f deployProxyAuthBroker=true \
-f branch=main \
-f dockerTag=${{needs.tag.outputs.build-tag}}
else

372
.github/workflows/ingest_benchmark.yml vendored Normal file
View File

@@ -0,0 +1,372 @@
name: Benchmarking
on:
# uncomment to run on push for debugging your PR
# push:
# branches: [ your branch ]
schedule:
# * is a special character in YAML so you have to quote this string
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12 or JAN-DEC)
# │ │ │ │ ┌───────────── day of the week (0 - 6 or SUN-SAT)
- cron: '0 9 * * *' # run once a day, timezone is utc
workflow_dispatch: # adds ability to run this manually
defaults:
run:
shell: bash -euxo pipefail {0}
concurrency:
# Allow only one workflow globally because we need dedicated resources which only exist once
group: ingest-bench-workflow
cancel-in-progress: true
jobs:
ingest:
strategy:
matrix:
target_project: [new_empty_project, large_existing_project]
permissions:
contents: write
statuses: write
id-token: write # aws-actions/configure-aws-credentials
env:
PG_CONFIG: /tmp/neon/pg_install/v16/bin/pg_config
PSQL: /tmp/neon/pg_install/v16/bin/psql
PG_16_LIB_PATH: /tmp/neon/pg_install/v16/lib
PGCOPYDB: /pgcopydb/bin/pgcopydb
PGCOPYDB_LIB_PATH: /pgcopydb/lib
runs-on: [ self-hosted, us-east-2, x64 ]
container:
image: neondatabase/build-tools:pinned-bookworm
credentials:
username: ${{ secrets.NEON_DOCKERHUB_USERNAME }}
password: ${{ secrets.NEON_DOCKERHUB_PASSWORD }}
options: --init
timeout-minutes: 1440
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials # necessary to download artefacts
uses: aws-actions/configure-aws-credentials@v4
with:
aws-region: eu-central-1
role-to-assume: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
role-duration-seconds: 18000 # 5 hours is currently max associated with IAM role
- name: Download Neon artifact
uses: ./.github/actions/download
with:
name: neon-${{ runner.os }}-${{ runner.arch }}-release-artifact
path: /tmp/neon/
prefix: latest
- name: Create Neon Project
if: ${{ matrix.target_project == 'new_empty_project' }}
id: create-neon-project-ingest-target
uses: ./.github/actions/neon-project-create
with:
region_id: aws-us-east-2
postgres_version: 16
compute_units: '[7, 7]' # we want to test large compute here to avoid compute-side bottleneck
api_key: ${{ secrets.NEON_STAGING_API_KEY }}
- name: Initialize Neon project and retrieve current backpressure seconds
if: ${{ matrix.target_project == 'new_empty_project' }}
env:
NEW_PROJECT_CONNSTR: ${{ steps.create-neon-project-ingest-target.outputs.dsn }}
NEW_PROJECT_ID: ${{ steps.create-neon-project-ingest-target.outputs.project_id }}
run: |
echo "Initializing Neon project with project_id: ${NEW_PROJECT_ID}"
export LD_LIBRARY_PATH=${PG_16_LIB_PATH}
${PSQL} "${NEW_PROJECT_CONNSTR}" -c "CREATE EXTENSION IF NOT EXISTS neon; CREATE EXTENSION IF NOT EXISTS neon_utils;"
BACKPRESSURE_TIME_BEFORE_INGEST=$(${PSQL} "${NEW_PROJECT_CONNSTR}" -t -c "select backpressure_throttling_time()/1000000;")
echo "BACKPRESSURE_TIME_BEFORE_INGEST=${BACKPRESSURE_TIME_BEFORE_INGEST}" >> $GITHUB_ENV
echo "NEW_PROJECT_CONNSTR=${NEW_PROJECT_CONNSTR}" >> $GITHUB_ENV
- name: Create Neon Branch for large tenant
if: ${{ matrix.target_project == 'large_existing_project' }}
id: create-neon-branch-ingest-target
uses: ./.github/actions/neon-branch-create
with:
project_id: ${{ vars.BENCHMARK_INGEST_TARGET_PROJECTID }}
api_key: ${{ secrets.NEON_STAGING_API_KEY }}
- name: Initialize Neon project and retrieve current backpressure seconds
if: ${{ matrix.target_project == 'large_existing_project' }}
env:
NEW_PROJECT_CONNSTR: ${{ steps.create-neon-branch-ingest-target.outputs.dsn }}
NEW_BRANCH_ID: ${{ steps.create-neon-branch-ingest-target.outputs.branch_id }}
run: |
echo "Initializing Neon branch with branch_id: ${NEW_BRANCH_ID}"
export LD_LIBRARY_PATH=${PG_16_LIB_PATH}
# Extract the part before the database name
base_connstr="${NEW_PROJECT_CONNSTR%/*}"
# Extract the query parameters (if any) after the database name
query_params="${NEW_PROJECT_CONNSTR#*\?}"
# Reconstruct the new connection string
if [ "$query_params" != "$NEW_PROJECT_CONNSTR" ]; then
new_connstr="${base_connstr}/neondb?${query_params}"
else
new_connstr="${base_connstr}/neondb"
fi
${PSQL} "${new_connstr}" -c "drop database ludicrous;"
${PSQL} "${new_connstr}" -c "CREATE DATABASE ludicrous;"
if [ "$query_params" != "$NEW_PROJECT_CONNSTR" ]; then
NEW_PROJECT_CONNSTR="${base_connstr}/ludicrous?${query_params}"
else
NEW_PROJECT_CONNSTR="${base_connstr}/ludicrous"
fi
${PSQL} "${NEW_PROJECT_CONNSTR}" -c "CREATE EXTENSION IF NOT EXISTS neon; CREATE EXTENSION IF NOT EXISTS neon_utils;"
BACKPRESSURE_TIME_BEFORE_INGEST=$(${PSQL} "${NEW_PROJECT_CONNSTR}" -t -c "select backpressure_throttling_time()/1000000;")
echo "BACKPRESSURE_TIME_BEFORE_INGEST=${BACKPRESSURE_TIME_BEFORE_INGEST}" >> $GITHUB_ENV
echo "NEW_PROJECT_CONNSTR=${NEW_PROJECT_CONNSTR}" >> $GITHUB_ENV
- name: Create pgcopydb filter file
run: |
cat << EOF > /tmp/pgcopydb_filter.txt
[include-only-table]
public.events
public.emails
public.email_transmissions
public.payments
public.editions
public.edition_modules
public.sp_content
public.email_broadcasts
public.user_collections
public.devices
public.user_accounts
public.lessons
public.lesson_users
public.payment_methods
public.orders
public.course_emails
public.modules
public.users
public.module_users
public.courses
public.payment_gateway_keys
public.accounts
public.roles
public.payment_gateways
public.management
public.event_names
EOF
- name: Invoke pgcopydb
env:
BENCHMARK_INGEST_SOURCE_CONNSTR: ${{ secrets.BENCHMARK_INGEST_SOURCE_CONNSTR }}
run: |
export LD_LIBRARY_PATH=${PGCOPYDB_LIB_PATH}:${PG_16_LIB_PATH}
export PGCOPYDB_SOURCE_PGURI="${BENCHMARK_INGEST_SOURCE_CONNSTR}"
export PGCOPYDB_TARGET_PGURI="${NEW_PROJECT_CONNSTR}"
export PGOPTIONS="-c maintenance_work_mem=8388608 -c max_parallel_maintenance_workers=7"
${PG_CONFIG} --bindir
${PGCOPYDB} --version
${PGCOPYDB} clone --skip-vacuum --no-owner --no-acl --skip-db-properties --table-jobs 4 \
--index-jobs 4 --restore-jobs 4 --split-tables-larger-than 10GB --skip-extensions \
--use-copy-binary --filters /tmp/pgcopydb_filter.txt 2>&1 | tee /tmp/pgcopydb_${{ matrix.target_project }}.log
# create dummy pgcopydb log to test parsing
# - name: create dummy log for parser test
# run: |
# cat << EOF > /tmp/pgcopydb_${{ matrix.target_project }}.log
# 2024-11-04 18:00:53.433 500861 INFO main.c:136 Running pgcopydb version 0.17.10.g8361a93 from "/usr/lib/postgresql/17/bin/pgcopydb"
# 2024-11-04 18:00:53.434 500861 INFO cli_common.c:1225 [SOURCE] Copying database from "postgres://neondb_owner@ep-bitter-shape-w2c1ir0a.us-east-2.aws.neon.build/neondb?sslmode=require&keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
# 2024-11-04 18:00:53.434 500861 INFO cli_common.c:1226 [TARGET] Copying database into "postgres://neondb_owner@ep-icy-union-w25qd5pj.us-east-2.aws.neon.build/ludicrous?sslmode=require&keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60"
# 2024-11-04 18:00:53.442 500861 INFO copydb.c:105 Using work dir "/tmp/pgcopydb"
# 2024-11-04 18:00:53.541 500861 INFO snapshot.c:107 Exported snapshot "00000008-00000033-1" from the source database
# 2024-11-04 18:00:53.556 500865 INFO cli_clone_follow.c:543 STEP 1: fetch source database tables, indexes, and sequences
# 2024-11-04 18:00:54.570 500865 INFO copydb_schema.c:716 Splitting source candidate tables larger than 10 GB
# 2024-11-04 18:00:54.570 500865 INFO copydb_schema.c:829 Table public.events is 96 GB large which is larger than --split-tables-larger-than 10 GB, and does not have a unique column of type integer: splitting by CTID
# 2024-11-04 18:01:05.538 500865 INFO copydb_schema.c:905 Table public.events is 96 GB large, 10 COPY processes will be used, partitioning on ctid.
# 2024-11-04 18:01:05.564 500865 INFO copydb_schema.c:905 Table public.email_transmissions is 27 GB large, 4 COPY processes will be used, partitioning on id.
# 2024-11-04 18:01:05.584 500865 INFO copydb_schema.c:905 Table public.lessons is 25 GB large, 4 COPY processes will be used, partitioning on id.
# 2024-11-04 18:01:05.605 500865 INFO copydb_schema.c:905 Table public.lesson_users is 16 GB large, 3 COPY processes will be used, partitioning on id.
# 2024-11-04 18:01:05.605 500865 INFO copydb_schema.c:761 Fetched information for 26 tables (including 4 tables split in 21 partitions total), with an estimated total of 907 million tuples and 175 GB on-disk
# 2024-11-04 18:01:05.687 500865 INFO copydb_schema.c:968 Fetched information for 57 indexes (supporting 25 constraints)
# 2024-11-04 18:01:05.753 500865 INFO sequences.c:78 Fetching information for 24 sequences
# 2024-11-04 18:01:05.903 500865 INFO copydb_schema.c:1122 Fetched information for 4 extensions
# 2024-11-04 18:01:06.178 500865 INFO copydb_schema.c:1538 Found 0 indexes (supporting 0 constraints) in the target database
# 2024-11-04 18:01:06.184 500865 INFO cli_clone_follow.c:584 STEP 2: dump the source database schema (pre/post data)
# 2024-11-04 18:01:06.186 500865 INFO pgcmd.c:468 /usr/lib/postgresql/16/bin/pg_dump -Fc --snapshot 00000008-00000033-1 --section=pre-data --section=post-data --file /tmp/pgcopydb/schema/schema.dump 'postgres://neondb_owner@ep-bitter-shape-w2c1ir0a.us-east-2.aws.neon.build/neondb?sslmode=require&keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60'
# 2024-11-04 18:01:06.952 500865 INFO cli_clone_follow.c:592 STEP 3: restore the pre-data section to the target database
# 2024-11-04 18:01:07.004 500865 INFO pgcmd.c:1001 /usr/lib/postgresql/16/bin/pg_restore --dbname 'postgres://neondb_owner@ep-icy-union-w25qd5pj.us-east-2.aws.neon.build/ludicrous?sslmode=require&keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60' --section pre-data --jobs 4 --no-owner --no-acl --use-list /tmp/pgcopydb/schema/pre-filtered.list /tmp/pgcopydb/schema/schema.dump
# 2024-11-04 18:01:07.438 500874 INFO table-data.c:656 STEP 4: starting 4 table-data COPY processes
# 2024-11-04 18:01:07.451 500877 INFO vacuum.c:139 STEP 8: skipping VACUUM jobs per --skip-vacuum
# 2024-11-04 18:01:07.457 500875 INFO indexes.c:182 STEP 6: starting 4 CREATE INDEX processes
# 2024-11-04 18:01:07.457 500875 INFO indexes.c:183 STEP 7: constraints are built by the CREATE INDEX processes
# 2024-11-04 18:01:07.507 500865 INFO blobs.c:74 Skipping large objects: none found.
# 2024-11-04 18:01:07.509 500865 INFO sequences.c:194 STEP 9: reset sequences values
# 2024-11-04 18:01:07.510 500886 INFO sequences.c:290 Set sequences values on the target database
# 2024-11-04 20:49:00.587 500865 INFO cli_clone_follow.c:608 STEP 10: restore the post-data section to the target database
# 2024-11-04 20:49:00.600 500865 INFO pgcmd.c:1001 /usr/lib/postgresql/16/bin/pg_restore --dbname 'postgres://neondb_owner@ep-icy-union-w25qd5pj.us-east-2.aws.neon.build/ludicrous?sslmode=require&keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60' --section post-data --jobs 4 --no-owner --no-acl --use-list /tmp/pgcopydb/schema/post-filtered.list /tmp/pgcopydb/schema/schema.dump
# 2024-11-05 10:50:58.508 500865 INFO cli_clone_follow.c:639 All step are now done, 16h49m elapsed
# 2024-11-05 10:50:58.508 500865 INFO summary.c:3155 Printing summary for 26 tables and 57 indexes
# OID | Schema | Name | Parts | copy duration | transmitted bytes | indexes | create index duration
# ------+--------+----------------------+-------+---------------+-------------------+---------+----------------------
# 24654 | public | events | 10 | 1d11h | 878 GB | 1 | 1h41m
# 24623 | public | email_transmissions | 4 | 4h46m | 99 GB | 3 | 2h04m
# 24665 | public | lessons | 4 | 4h42m | 161 GB | 4 | 1m11s
# 24661 | public | lesson_users | 3 | 2h46m | 49 GB | 3 | 39m35s
# 24631 | public | emails | 1 | 34m07s | 10 GB | 2 | 17s
# 24739 | public | payments | 1 | 5m47s | 1848 MB | 4 | 4m40s
# 24681 | public | module_users | 1 | 4m57s | 1610 MB | 3 | 1m50s
# 24694 | public | orders | 1 | 2m50s | 835 MB | 3 | 1m05s
# 24597 | public | devices | 1 | 1m45s | 498 MB | 2 | 40s
# 24723 | public | payment_methods | 1 | 1m24s | 548 MB | 2 | 31s
# 24765 | public | user_collections | 1 | 2m17s | 1005 MB | 2 | 968ms
# 24774 | public | users | 1 | 52s | 291 MB | 4 | 27s
# 24760 | public | user_accounts | 1 | 16s | 172 MB | 3 | 16s
# 24606 | public | edition_modules | 1 | 8s983 | 46 MB | 3 | 4s749
# 24583 | public | course_emails | 1 | 8s526 | 26 MB | 2 | 996ms
# 24685 | public | modules | 1 | 1s592 | 21 MB | 3 | 1s696
# 24610 | public | editions | 1 | 2s199 | 7483 kB | 2 | 1s032
# 24755 | public | sp_content | 1 | 1s555 | 4177 kB | 0 | 0ms
# 24619 | public | email_broadcasts | 1 | 744ms | 2645 kB | 2 | 677ms
# 24590 | public | courses | 1 | 387ms | 1540 kB | 2 | 367ms
# 24704 | public | payment_gateway_keys | 1 | 1s972 | 164 kB | 2 | 27ms
# 24576 | public | accounts | 1 | 58ms | 24 kB | 1 | 14ms
# 24647 | public | event_names | 1 | 32ms | 397 B | 1 | 8ms
# 24716 | public | payment_gateways | 1 | 1s675 | 117 B | 1 | 11ms
# 24748 | public | roles | 1 | 71ms | 173 B | 1 | 8ms
# 24676 | public | management | 1 | 33ms | 40 B | 1 | 19ms
# Step Connection Duration Transfer Concurrency
# -------------------------------------------------- ---------- ---------- ---------- ------------
# Catalog Queries (table ordering, filtering, etc) source 12s 1
# Dump Schema source 765ms 1
# Prepare Schema target 466ms 1
# COPY, INDEX, CONSTRAINTS, VACUUM (wall clock) both 2h47m 12
# COPY (cumulative) both 7h46m 1225 GB 4
# CREATE INDEX (cumulative) target 4h36m 4
# CONSTRAINTS (cumulative) target 8s493 4
# VACUUM (cumulative) target 0ms 4
# Reset Sequences both 60ms 1
# Large Objects (cumulative) (null) 0ms 0
# Finalize Schema both 14h01m 4
# -------------------------------------------------- ---------- ---------- ---------- ------------
# Total Wall Clock Duration both 16h49m 20
# EOF
- name: show tables sizes and retrieve current backpressure seconds
run: |
export LD_LIBRARY_PATH=${PG_16_LIB_PATH}
${PSQL} "${NEW_PROJECT_CONNSTR}" -c "\dt+"
BACKPRESSURE_TIME_AFTER_INGEST=$(${PSQL} "${NEW_PROJECT_CONNSTR}" -t -c "select backpressure_throttling_time()/1000000;")
echo "BACKPRESSURE_TIME_AFTER_INGEST=${BACKPRESSURE_TIME_AFTER_INGEST}" >> $GITHUB_ENV
- name: Parse pgcopydb log and report performance metrics
env:
PERF_TEST_RESULT_CONNSTR: ${{ secrets.PERF_TEST_RESULT_CONNSTR }}
run: |
export LD_LIBRARY_PATH=${PG_16_LIB_PATH}
# Define the log file path
LOG_FILE="/tmp/pgcopydb_${{ matrix.target_project }}.log"
# Get the current git commit hash
git config --global --add safe.directory /__w/neon/neon
COMMIT_HASH=$(git rev-parse --short HEAD)
# Define the platform and test suite
PLATFORM="pg16-${{ matrix.target_project }}-us-east-2-staging"
SUIT="pgcopydb_ingest_bench"
# Function to convert time (e.g., "2h47m", "4h36m", "118ms", "8s493") to seconds
convert_to_seconds() {
local duration=$1
local total_seconds=0
# Check for hours (h)
if [[ "$duration" =~ ([0-9]+)h ]]; then
total_seconds=$((total_seconds + ${BASH_REMATCH[1]#0} * 3600))
fi
# Check for seconds (s)
if [[ "$duration" =~ ([0-9]+)s ]]; then
total_seconds=$((total_seconds + ${BASH_REMATCH[1]#0}))
fi
# Check for milliseconds (ms) (if applicable)
if [[ "$duration" =~ ([0-9]+)ms ]]; then
total_seconds=$((total_seconds + ${BASH_REMATCH[1]#0} / 1000))
duration=${duration/${BASH_REMATCH[0]}/} # need to remove it to avoid double counting with m
fi
# Check for minutes (m) - must be checked after ms because m is contained in ms
if [[ "$duration" =~ ([0-9]+)m ]]; then
total_seconds=$((total_seconds + ${BASH_REMATCH[1]#0} * 60))
fi
echo $total_seconds
}
# Calculate the backpressure difference in seconds
BACKPRESSURE_TIME_DIFF=$(awk "BEGIN {print $BACKPRESSURE_TIME_AFTER_INGEST - $BACKPRESSURE_TIME_BEFORE_INGEST}")
# Insert the backpressure time difference into the performance database
if [ -n "$BACKPRESSURE_TIME_DIFF" ]; then
PSQL_CMD="${PSQL} \"${PERF_TEST_RESULT_CONNSTR}\" -c \"
INSERT INTO public.perf_test_results (suit, revision, platform, metric_name, metric_value, metric_unit, metric_report_type, recorded_at_timestamp)
VALUES ('${SUIT}', '${COMMIT_HASH}', '${PLATFORM}', 'backpressure_time', ${BACKPRESSURE_TIME_DIFF}, 'seconds', 'lower_is_better', now());
\""
echo "Inserting backpressure time difference: ${BACKPRESSURE_TIME_DIFF} seconds"
eval $PSQL_CMD
fi
# Extract and process log lines
while IFS= read -r line; do
METRIC_NAME=""
# Match each desired line and extract the relevant information
if [[ "$line" =~ COPY,\ INDEX,\ CONSTRAINTS,\ VACUUM.* ]]; then
METRIC_NAME="COPY, INDEX, CONSTRAINTS, VACUUM (wall clock)"
elif [[ "$line" =~ COPY\ \(cumulative\).* ]]; then
METRIC_NAME="COPY (cumulative)"
elif [[ "$line" =~ CREATE\ INDEX\ \(cumulative\).* ]]; then
METRIC_NAME="CREATE INDEX (cumulative)"
elif [[ "$line" =~ CONSTRAINTS\ \(cumulative\).* ]]; then
METRIC_NAME="CONSTRAINTS (cumulative)"
elif [[ "$line" =~ Finalize\ Schema.* ]]; then
METRIC_NAME="Finalize Schema"
elif [[ "$line" =~ Total\ Wall\ Clock\ Duration.* ]]; then
METRIC_NAME="Total Wall Clock Duration"
fi
# If a metric was matched, insert it into the performance database
if [ -n "$METRIC_NAME" ]; then
DURATION=$(echo "$line" | grep -oP '\d+h\d+m|\d+s|\d+ms|\d{1,2}h\d{1,2}m|\d+\.\d+s' | head -n 1)
METRIC_VALUE=$(convert_to_seconds "$DURATION")
PSQL_CMD="${PSQL} \"${PERF_TEST_RESULT_CONNSTR}\" -c \"
INSERT INTO public.perf_test_results (suit, revision, platform, metric_name, metric_value, metric_unit, metric_report_type, recorded_at_timestamp)
VALUES ('${SUIT}', '${COMMIT_HASH}', '${PLATFORM}', '${METRIC_NAME}', ${METRIC_VALUE}, 'seconds', 'lower_is_better', now());
\""
echo "Inserting ${METRIC_NAME} with value ${METRIC_VALUE} seconds"
eval $PSQL_CMD
fi
done < "$LOG_FILE"
- name: Delete Neon Project
if: ${{ always() && matrix.target_project == 'new_empty_project' }}
uses: ./.github/actions/neon-project-delete
with:
project_id: ${{ steps.create-neon-project-ingest-target.outputs.project_id }}
api_key: ${{ secrets.NEON_STAGING_API_KEY }}
- name: Delete Neon Branch for large tenant
if: ${{ always() && matrix.target_project == 'large_existing_project' }}
uses: ./.github/actions/neon-branch-delete
with:
project_id: ${{ vars.BENCHMARK_INGEST_TARGET_PROJECTID }}
branch_id: ${{ steps.create-neon-branch-ingest-target.outputs.branch_id }}
api_key: ${{ secrets.NEON_STAGING_API_KEY }}

View File

@@ -201,6 +201,8 @@ jobs:
REPORT_URL: ${{ steps.upload-stats.outputs.report-url }}
SHA: ${{ github.event.pull_request.head.sha || github.sha }}
with:
# Retry script for 5XX server errors: https://github.com/actions/github-script#retries
retries: 5
script: |
const { REPORT_URL, SHA } = process.env

94
.github/workflows/pre-merge-checks.yml vendored Normal file
View File

@@ -0,0 +1,94 @@
name: Pre-merge checks
on:
merge_group:
branches:
- main
defaults:
run:
shell: bash -euxo pipefail {0}
# No permission for GITHUB_TOKEN by default; the **minimal required** set of permissions should be granted in each job.
permissions: {}
jobs:
get-changed-files:
runs-on: ubuntu-22.04
outputs:
python-changed: ${{ steps.python-src.outputs.any_changed }}
steps:
- uses: actions/checkout@v4
- uses: tj-actions/changed-files@4edd678ac3f81e2dc578756871e4d00c19191daf # v45.0.4
id: python-src
with:
files: |
.github/workflows/pre-merge-checks.yml
**/**.py
poetry.lock
pyproject.toml
- name: PRINT ALL CHANGED FILES FOR DEBUG PURPOSES
env:
PYTHON_CHANGED_FILES: ${{ steps.python-src.outputs.all_changed_files }}
run: |
echo "${PYTHON_CHANGED_FILES}"
check-build-tools-image:
if: needs.get-changed-files.outputs.python-changed == 'true'
needs: [ get-changed-files ]
uses: ./.github/workflows/check-build-tools-image.yml
build-build-tools-image:
needs: [ check-build-tools-image ]
uses: ./.github/workflows/build-build-tools-image.yml
with:
image-tag: ${{ needs.check-build-tools-image.outputs.image-tag }}
secrets: inherit
check-codestyle-python:
if: needs.get-changed-files.outputs.python-changed == 'true'
needs: [ get-changed-files, build-build-tools-image ]
uses: ./.github/workflows/_check-codestyle-python.yml
with:
build-tools-image: ${{ needs.build-build-tools-image.outputs.image }}-bookworm
secrets: inherit
# To get items from the merge queue merged into main we need to satisfy "Status checks that are required".
# Currently we require 2 jobs (checks with exact name):
# - conclusion
# - neon-cloud-e2e
conclusion:
if: always()
permissions:
statuses: write # for `github.repos.createCommitStatus(...)`
needs:
- get-changed-files
- check-codestyle-python
runs-on: ubuntu-22.04
steps:
- name: Create fake `neon-cloud-e2e` check
uses: actions/github-script@v7
with:
# Retry script for 5XX server errors: https://github.com/actions/github-script#retries
retries: 5
script: |
const { repo, owner } = context.repo;
const targetUrl = `${context.serverUrl}/${owner}/${repo}/actions/runs/${context.runId}`;
await github.rest.repos.createCommitStatus({
owner: owner,
repo: repo,
sha: context.sha,
context: `neon-cloud-e2e`,
state: `success`,
target_url: targetUrl,
description: `fake check for merge queue`,
});
- name: Fail the job if any of the dependencies do not succeed or skipped
run: exit 1
if: |
(contains(needs.check-codestyle-python.result, 'skipped') && needs.get-changed-files.outputs.python-changed == 'true')
|| contains(needs.*.result, 'failure')
|| contains(needs.*.result, 'cancelled')

View File

@@ -0,0 +1,29 @@
name: Report Workflow Stats Batch
on:
schedule:
- cron: '*/15 * * * *'
- cron: '25 0 * * *'
jobs:
gh-workflow-stats-batch:
name: GitHub Workflow Stats Batch
runs-on: ubuntu-22.04
permissions:
actions: read
steps:
- name: Export Workflow Run for the past 2 hours
uses: neondatabase/gh-workflow-stats-action@v0.2.1
with:
db_uri: ${{ secrets.GH_REPORT_STATS_DB_RW_CONNSTR }}
db_table: "gh_workflow_stats_batch_neon"
gh_token: ${{ secrets.GITHUB_TOKEN }}
duration: '2h'
- name: Export Workflow Run for the past 24 hours
if: github.event.schedule == '25 0 * * *'
uses: neondatabase/gh-workflow-stats-action@v0.2.1
with:
db_uri: ${{ secrets.GH_REPORT_STATS_DB_RW_CONNSTR }}
db_table: "gh_workflow_stats_batch_neon"
gh_token: ${{ secrets.GITHUB_TOKEN }}
duration: '24h'

View File

@@ -23,6 +23,7 @@ on:
- Test Postgres client libraries
- Trigger E2E Tests
- cleanup caches by a branch
- Pre-merge checks
types: [completed]
jobs:

2
.gitignore vendored
View File

@@ -6,6 +6,8 @@ __pycache__/
test_output/
.vscode
.idea
*.swp
tags
neon.iml
/.neon
/integration_tests/.neon

181
Cargo.lock generated
View File

@@ -310,33 +310,6 @@ dependencies = [
"zeroize",
]
[[package]]
name = "aws-lc-rs"
version = "1.9.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2f95446d919226d587817a7d21379e6eb099b97b45110a7f272a444ca5c54070"
dependencies = [
"aws-lc-sys",
"mirai-annotations",
"paste",
"zeroize",
]
[[package]]
name = "aws-lc-sys"
version = "0.21.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b3ddc4a5b231dd6958b140ff3151b6412b3f4321fab354f399eec8f14b06df62"
dependencies = [
"bindgen 0.69.5",
"cc",
"cmake",
"dunce",
"fs_extra",
"libc",
"paste",
]
[[package]]
name = "aws-runtime"
version = "1.4.3"
@@ -942,29 +915,6 @@ dependencies = [
"serde",
]
[[package]]
name = "bindgen"
version = "0.69.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "271383c67ccabffb7381723dea0672a673f292304fcb45c01cc648c7a8d58088"
dependencies = [
"bitflags 2.4.1",
"cexpr",
"clang-sys",
"itertools 0.10.5",
"lazy_static",
"lazycell",
"log",
"prettyplease",
"proc-macro2",
"quote",
"regex",
"rustc-hash",
"shlex",
"syn 2.0.52",
"which",
]
[[package]]
name = "bindgen"
version = "0.70.1"
@@ -974,7 +924,7 @@ dependencies = [
"bitflags 2.4.1",
"cexpr",
"clang-sys",
"itertools 0.10.5",
"itertools 0.12.1",
"log",
"prettyplease",
"proc-macro2",
@@ -1220,15 +1170,6 @@ version = "0.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2da6da31387c7e4ef160ffab6d5e7f00c42626fe39aea70a7b0f1773f7dd6c1b"
[[package]]
name = "cmake"
version = "0.1.51"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fb1e43aa7fd152b1f968787f7dbcdeb306d1867ff373c69955211876c053f91a"
dependencies = [
"cc",
]
[[package]]
name = "colorchoice"
version = "1.0.0"
@@ -1288,12 +1229,15 @@ dependencies = [
"flate2",
"futures",
"hyper 0.14.30",
"metrics",
"nix 0.27.1",
"notify",
"num_cpus",
"once_cell",
"opentelemetry",
"opentelemetry_sdk",
"postgres",
"prometheus",
"regex",
"remote_storage",
"reqwest 0.12.4",
@@ -1329,9 +1273,9 @@ dependencies = [
[[package]]
name = "const-oid"
version = "0.9.5"
version = "0.9.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "28c122c3980598d243d63d9a704629a2d748d101f278052ff068be5a4423ab6f"
checksum = "c2459377285ad874054d797f3ccebf984978aa39129f6eafde5cdc8315b612f8"
[[package]]
name = "const-random"
@@ -1815,12 +1759,6 @@ dependencies = [
"syn 2.0.52",
]
[[package]]
name = "dunce"
version = "1.0.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "92773504d58c093f6de2459af4af33faa518c13451eb8f2b5698ed3d36e7c813"
[[package]]
name = "dyn-clone"
version = "1.0.14"
@@ -2125,12 +2063,6 @@ dependencies = [
"tokio-util",
]
[[package]]
name = "fs_extra"
version = "1.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "42703706b716c37f96a77aea830392ad231f44c9e9a67872fa5548707e11b11c"
[[package]]
name = "fsevent-sys"
version = "4.1.0"
@@ -2484,15 +2416,6 @@ dependencies = [
"digest",
]
[[package]]
name = "home"
version = "0.5.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e3d1354bf6b7235cb4a0576c2619fd4ed18183f689b12b006a0ee7329eeff9a5"
dependencies = [
"windows-sys 0.52.0",
]
[[package]]
name = "hostname"
version = "0.4.0"
@@ -2988,12 +2911,6 @@ dependencies = [
"spin",
]
[[package]]
name = "lazycell"
version = "1.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "830d08ce1d1d941e6b30645f1a0eb5643013d835ce3779a5fc208261dbe10f55"
[[package]]
name = "libc"
version = "0.2.150"
@@ -3224,12 +3141,6 @@ dependencies = [
"windows-sys 0.48.0",
]
[[package]]
name = "mirai-annotations"
version = "1.12.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c9be0862c1b3f26a88803c4a49de6889c10e608b3ee9344e6ef5b45fb37ad3d1"
[[package]]
name = "multimap"
version = "0.8.3"
@@ -3749,6 +3660,7 @@ dependencies = [
"tracing",
"url",
"utils",
"wal_decoder",
"walkdir",
"workspace_hack",
]
@@ -4146,7 +4058,7 @@ dependencies = [
"bytes",
"once_cell",
"pq_proto",
"rustls 0.23.7",
"rustls 0.23.16",
"rustls-pemfile 2.1.1",
"serde",
"thiserror",
@@ -4175,7 +4087,7 @@ name = "postgres_ffi"
version = "0.1.0"
dependencies = [
"anyhow",
"bindgen 0.70.1",
"bindgen",
"bytes",
"crc32c",
"env_logger",
@@ -4186,6 +4098,7 @@ dependencies = [
"regex",
"serde",
"thiserror",
"tracing",
"utils",
]
@@ -4312,7 +4225,7 @@ checksum = "0c1318b19085f08681016926435853bbf7858f9c082d0999b80550ff5d9abe15"
dependencies = [
"bytes",
"heck 0.5.0",
"itertools 0.10.5",
"itertools 0.12.1",
"log",
"multimap",
"once_cell",
@@ -4332,7 +4245,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e9552f850d5f0964a4e4d0bf306459ac29323ddfbae05e35a7c0d35cb0803cc5"
dependencies = [
"anyhow",
"itertools 0.10.5",
"itertools 0.12.1",
"proc-macro2",
"quote",
"syn 2.0.52",
@@ -4420,7 +4333,7 @@ dependencies = [
"rsa",
"rstest",
"rustc-hash",
"rustls 0.23.7",
"rustls 0.23.16",
"rustls-native-certs 0.8.0",
"rustls-pemfile 2.1.1",
"scopeguard",
@@ -4431,6 +4344,8 @@ dependencies = [
"smallvec",
"smol_str",
"socket2",
"strum",
"strum_macros",
"subtle",
"thiserror",
"tikv-jemalloc-ctl",
@@ -4453,6 +4368,7 @@ dependencies = [
"walkdir",
"workspace_hack",
"x509-parser",
"zerocopy",
]
[[package]]
@@ -4830,6 +4746,7 @@ dependencies = [
"percent-encoding",
"pin-project-lite",
"rustls 0.22.4",
"rustls-native-certs 0.7.0",
"rustls-pemfile 2.1.1",
"rustls-pki-types",
"serde",
@@ -5104,23 +5021,22 @@ dependencies = [
"log",
"ring",
"rustls-pki-types",
"rustls-webpki 0.102.2",
"rustls-webpki 0.102.8",
"subtle",
"zeroize",
]
[[package]]
name = "rustls"
version = "0.23.7"
version = "0.23.16"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ebbbdb961df0ad3f2652da8f3fdc4b36122f568f968f45ad3316f26c025c677b"
checksum = "eee87ff5d9b36712a58574e12e9f0ea80f915a5b0ac518d322b24a465617925e"
dependencies = [
"aws-lc-rs",
"log",
"once_cell",
"ring",
"rustls-pki-types",
"rustls-webpki 0.102.2",
"rustls-webpki 0.102.8",
"subtle",
"zeroize",
]
@@ -5200,11 +5116,10 @@ dependencies = [
[[package]]
name = "rustls-webpki"
version = "0.102.2"
version = "0.102.8"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "faaa0a62740bedb9b2ef5afa303da42764c012f743917351dc9a237ea1663610"
checksum = "64ca1bc8749bd4cf37b5ce386cc146580777b4e8572c7b97baf22c83f444bee9"
dependencies = [
"aws-lc-rs",
"ring",
"rustls-pki-types",
"untrusted",
@@ -5235,6 +5150,7 @@ dependencies = [
"chrono",
"clap",
"crc32c",
"criterion",
"desim",
"fail",
"futures",
@@ -5242,6 +5158,7 @@ dependencies = [
"http 1.1.0",
"humantime",
"hyper 0.14.30",
"itertools 0.10.5",
"metrics",
"once_cell",
"parking_lot 0.12.1",
@@ -5821,6 +5738,7 @@ dependencies = [
"once_cell",
"parking_lot 0.12.1",
"prost",
"rustls 0.23.16",
"tokio",
"tonic",
"tonic-build",
@@ -5903,7 +5821,7 @@ dependencies = [
"postgres_ffi",
"remote_storage",
"reqwest 0.12.4",
"rustls 0.23.7",
"rustls 0.23.16",
"rustls-native-certs 0.8.0",
"serde",
"serde_json",
@@ -6272,7 +6190,7 @@ dependencies = [
[[package]]
name = "tokio-epoll-uring"
version = "0.1.0"
source = "git+https://github.com/neondatabase/tokio-epoll-uring.git?branch=main#08ccfa94ff5507727bf4d8d006666b5b192e04c6"
source = "git+https://github.com/neondatabase/tokio-epoll-uring.git?branch=main#33e00106a268644d02ba0461bbd64476073b0ee1"
dependencies = [
"futures",
"nix 0.26.4",
@@ -6336,7 +6254,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "04fb792ccd6bbcd4bba408eb8a292f70fc4a3589e5d793626f45190e6454b6ab"
dependencies = [
"ring",
"rustls 0.23.7",
"rustls 0.23.16",
"tokio",
"tokio-postgres",
"tokio-rustls 0.26.0",
@@ -6370,7 +6288,7 @@ version = "0.26.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0c7bc40d0e5a97695bb96e27995cd3a08538541b0a846f65bba7a359f36700d4"
dependencies = [
"rustls 0.23.7",
"rustls 0.23.16",
"rustls-pki-types",
"tokio",
]
@@ -6779,7 +6697,7 @@ dependencies = [
"base64 0.22.1",
"log",
"once_cell",
"rustls 0.23.7",
"rustls 0.23.16",
"rustls-pki-types",
"url",
"webpki-roots 0.26.1",
@@ -6788,7 +6706,7 @@ dependencies = [
[[package]]
name = "uring-common"
version = "0.1.0"
source = "git+https://github.com/neondatabase/tokio-epoll-uring.git?branch=main#08ccfa94ff5507727bf4d8d006666b5b192e04c6"
source = "git+https://github.com/neondatabase/tokio-epoll-uring.git?branch=main#33e00106a268644d02ba0461bbd64476073b0ee1"
dependencies = [
"bytes",
"io-uring",
@@ -6954,6 +6872,20 @@ dependencies = [
"utils",
]
[[package]]
name = "wal_decoder"
version = "0.1.0"
dependencies = [
"anyhow",
"bytes",
"pageserver_api",
"postgres_ffi",
"serde",
"tracing",
"utils",
"workspace_hack",
]
[[package]]
name = "walkdir"
version = "2.3.3"
@@ -6969,7 +6901,7 @@ name = "walproposer"
version = "0.1.0"
dependencies = [
"anyhow",
"bindgen 0.70.1",
"bindgen",
"postgres_ffi",
"utils",
]
@@ -7144,18 +7076,6 @@ dependencies = [
"rustls-pki-types",
]
[[package]]
name = "which"
version = "4.4.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "87ba24419a2078cd2b0f2ede2691b6c66d8e47836da3b6db8265ebad47afbfc7"
dependencies = [
"either",
"home",
"once_cell",
"rustix",
]
[[package]]
name = "whoami"
version = "1.5.1"
@@ -7415,7 +7335,7 @@ dependencies = [
"hyper-util",
"indexmap 1.9.3",
"indexmap 2.0.1",
"itertools 0.10.5",
"itertools 0.12.1",
"lazy_static",
"libc",
"log",
@@ -7436,8 +7356,7 @@ dependencies = [
"regex-automata 0.4.3",
"regex-syntax 0.8.2",
"reqwest 0.12.4",
"rustls 0.23.7",
"rustls-webpki 0.102.2",
"rustls 0.23.16",
"scopeguard",
"serde",
"serde_json",
@@ -7462,6 +7381,7 @@ dependencies = [
"tracing",
"tracing-core",
"url",
"zerocopy",
"zeroize",
"zstd",
"zstd-safe",
@@ -7534,6 +7454,7 @@ version = "0.7.31"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1c4061bedbb353041c12f413700357bec76df2c7e2ca8e4df8bac24c6bf68e3d"
dependencies = [
"byteorder",
"zerocopy-derive",
]

View File

@@ -33,6 +33,7 @@ members = [
"libs/postgres_ffi/wal_craft",
"libs/vm_monitor",
"libs/walproposer",
"libs/wal_decoder",
]
[workspace.package]
@@ -142,7 +143,7 @@ reqwest-retry = "0.5"
routerify = "3"
rpds = "0.13"
rustc-hash = "1.1.0"
rustls = "0.23"
rustls = { version = "0.23.16", default-features = false }
rustls-pemfile = "2"
scopeguard = "1.1"
sysinfo = "0.29.2"
@@ -173,7 +174,7 @@ tokio = { version = "1.17", features = ["macros"] }
tokio-epoll-uring = { git = "https://github.com/neondatabase/tokio-epoll-uring.git" , branch = "main" }
tokio-io-timeout = "1.2.0"
tokio-postgres-rustls = "0.12.0"
tokio-rustls = "0.26"
tokio-rustls = { version = "0.26.0", default-features = false, features = ["tls12", "ring"]}
tokio-stream = "0.1"
tokio-tar = "0.3"
tokio-util = { version = "0.7.10", features = ["io", "rt"] }
@@ -195,6 +196,7 @@ walkdir = "2.3.2"
rustls-native-certs = "0.8"
x509-parser = "0.16"
whoami = "1.5.1"
zerocopy = { version = "0.7", features = ["derive"] }
## TODO replace this with tracing
env_logger = "0.10"
@@ -238,6 +240,7 @@ tracing-utils = { version = "0.1", path = "./libs/tracing-utils/" }
utils = { version = "0.1", path = "./libs/utils/" }
vm_monitor = { version = "0.1", path = "./libs/vm_monitor/" }
walproposer = { version = "0.1", path = "./libs/walproposer/" }
wal_decoder = { version = "0.1", path = "./libs/wal_decoder" }
## Common library dependency
workspace_hack = { version = "0.1", path = "./workspace_hack/" }

View File

@@ -1,12 +1,66 @@
ARG DEBIAN_VERSION=bullseye
FROM debian:${DEBIAN_VERSION}-slim
FROM debian:bookworm-slim AS pgcopydb_builder
ARG DEBIAN_VERSION
RUN if [ "${DEBIAN_VERSION}" = "bookworm" ]; then \
set -e && \
apt update && \
apt install -y --no-install-recommends \
ca-certificates wget gpg && \
wget -qO - https://www.postgresql.org/media/keys/ACCC4CF8.asc | gpg --dearmor -o /usr/share/keyrings/postgresql-keyring.gpg && \
echo "deb [signed-by=/usr/share/keyrings/postgresql-keyring.gpg] http://apt.postgresql.org/pub/repos/apt bookworm-pgdg main" > /etc/apt/sources.list.d/pgdg.list && \
apt-get update && \
apt install -y --no-install-recommends \
build-essential \
autotools-dev \
libedit-dev \
libgc-dev \
libpam0g-dev \
libreadline-dev \
libselinux1-dev \
libxslt1-dev \
libssl-dev \
libkrb5-dev \
zlib1g-dev \
liblz4-dev \
libpq5 \
libpq-dev \
libzstd-dev \
postgresql-16 \
postgresql-server-dev-16 \
postgresql-common \
python3-sphinx && \
wget -O /tmp/pgcopydb.tar.gz https://github.com/dimitri/pgcopydb/archive/refs/tags/v0.17.tar.gz && \
mkdir /tmp/pgcopydb && \
tar -xzf /tmp/pgcopydb.tar.gz -C /tmp/pgcopydb --strip-components=1 && \
cd /tmp/pgcopydb && \
make -s clean && \
make -s -j12 install && \
libpq_path=$(find /lib /usr/lib -name "libpq.so.5" | head -n 1) && \
mkdir -p /pgcopydb/lib && \
cp "$libpq_path" /pgcopydb/lib/; \
else \
# copy command below will fail if we don't have dummy files, so we create them for other debian versions
mkdir -p /usr/lib/postgresql/16/bin && touch /usr/lib/postgresql/16/bin/pgcopydb && \
mkdir -p mkdir -p /pgcopydb/lib && touch /pgcopydb/lib/libpq.so.5; \
fi
FROM debian:${DEBIAN_VERSION}-slim AS build_tools
ARG DEBIAN_VERSION
# Add nonroot user
RUN useradd -ms /bin/bash nonroot -b /home
SHELL ["/bin/bash", "-c"]
RUN mkdir -p /pgcopydb/bin && \
mkdir -p /pgcopydb/lib && \
chmod -R 755 /pgcopydb && \
chown -R nonroot:nonroot /pgcopydb
COPY --from=pgcopydb_builder /usr/lib/postgresql/16/bin/pgcopydb /pgcopydb/bin/pgcopydb
COPY --from=pgcopydb_builder /pgcopydb/lib/libpq.so.5 /pgcopydb/lib/libpq.so.5
# System deps
#
# 'gdb' is included so that we get backtraces of core dumps produced in
@@ -38,7 +92,7 @@ RUN set -e \
libseccomp-dev \
libsqlite3-dev \
libssl-dev \
$([[ "${DEBIAN_VERSION}" = "bullseye" ]] && libstdc++-10-dev || libstdc++-11-dev) \
$([[ "${DEBIAN_VERSION}" = "bullseye" ]] && echo libstdc++-10-dev || echo libstdc++-11-dev) \
libtool \
libxml2-dev \
libxmlsec1-dev \
@@ -57,6 +111,18 @@ RUN set -e \
zstd \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# sql_exporter
# Keep the version the same as in compute/compute-node.Dockerfile and
# test_runner/regress/test_compute_metrics.py.
ENV SQL_EXPORTER_VERSION=0.13.1
RUN curl -fsSL \
"https://github.com/burningalchemist/sql_exporter/releases/download/${SQL_EXPORTER_VERSION}/sql_exporter-${SQL_EXPORTER_VERSION}.linux-$(case "$(uname -m)" in x86_64) echo amd64;; aarch64) echo arm64;; esac).tar.gz" \
--output sql_exporter.tar.gz \
&& mkdir /tmp/sql_exporter \
&& tar xzvf sql_exporter.tar.gz -C /tmp/sql_exporter --strip-components=1 \
&& mv /tmp/sql_exporter/sql_exporter /usr/local/bin/sql_exporter
# protobuf-compiler (protoc)
ENV PROTOC_VERSION=25.1
RUN curl -fsSL "https://github.com/protocolbuffers/protobuf/releases/download/v${PROTOC_VERSION}/protoc-${PROTOC_VERSION}-linux-$(uname -m | sed 's/aarch64/aarch_64/g').zip" -o "protoc.zip" \
@@ -223,7 +289,13 @@ RUN whoami \
&& cargo --version --verbose \
&& rustup --version --verbose \
&& rustc --version --verbose \
&& clang --version
&& clang --version
RUN if [ "${DEBIAN_VERSION}" = "bookworm" ]; then \
LD_LIBRARY_PATH=/pgcopydb/lib /pgcopydb/bin/pgcopydb --version; \
else \
echo "pgcopydb is not available for ${DEBIAN_VERSION}"; \
fi
# Set following flag to check in Makefile if its running in Docker
RUN touch /home/nonroot/.docker_build

View File

@@ -22,6 +22,7 @@ sql_exporter.yml: $(jsonnet_files)
--output-file etc/$@ \
--tla-str collector_name=neon_collector \
--tla-str collector_file=neon_collector.yml \
--tla-str 'connection_string=postgresql://cloud_admin@127.0.0.1:5432/postgres?sslmode=disable&application_name=sql_exporter' \
etc/sql_exporter.jsonnet
sql_exporter_autoscaling.yml: $(jsonnet_files)
@@ -29,7 +30,7 @@ sql_exporter_autoscaling.yml: $(jsonnet_files)
--output-file etc/$@ \
--tla-str collector_name=neon_collector_autoscaling \
--tla-str collector_file=neon_collector_autoscaling.yml \
--tla-str application_name=sql_exporter_autoscaling \
--tla-str 'connection_string=postgresql://cloud_admin@127.0.0.1:5432/postgres?sslmode=disable&application_name=sql_exporter_autoscaling' \
etc/sql_exporter.jsonnet
.PHONY: clean

View File

@@ -431,14 +431,11 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY compute/patches/rum.patch /rum.patch
# maybe version-specific
# support for v17 is unknown
# last release 1.3.13 - Sep 19, 2022
RUN case "${PG_VERSION}" in "v17") \
echo "v17 extensions are not supported yet. Quit" && exit 0;; \
esac && \
wget https://github.com/postgrespro/rum/archive/refs/tags/1.3.13.tar.gz -O rum.tar.gz && \
echo "6ab370532c965568df6210bd844ac6ba649f53055e48243525b0b7e5c4d69a7d rum.tar.gz" | sha256sum --check && \
# supports v17 since https://github.com/postgrespro/rum/commit/cb1edffc57736cd2a4455f8d0feab0d69928da25
# doesn't use releases since 1.3.13 - Sep 19, 2022
# use latest commit from the master branch
RUN wget https://github.com/postgrespro/rum/archive/cb1edffc57736cd2a4455f8d0feab0d69928da25.tar.gz -O rum.tar.gz && \
echo "65e0a752e99f4c3226400c9b899f997049e93503db8bf5c8072efa136d32fd83 rum.tar.gz" | sha256sum --check && \
mkdir rum-src && cd rum-src && tar xzf ../rum.tar.gz --strip-components=1 -C . && \
patch -p1 < /rum.patch && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config USE_PGXS=1 && \
@@ -562,8 +559,8 @@ RUN case "${PG_VERSION}" in \
export TIMESCALEDB_CHECKSUM=584a351c7775f0e067eaa0e7277ea88cab9077cc4c455cbbf09a5d9723dce95d \
;; \
"v17") \
export TIMESCALEDB_VERSION=2.17.0 \
export TIMESCALEDB_CHECKSUM=155bf64391d3558c42f31ca0e523cfc6252921974f75298c9039ccad1c89811a \
export TIMESCALEDB_VERSION=2.17.1 \
export TIMESCALEDB_CHECKSUM=6277cf43f5695e23dae1c5cfeba00474d730b66ed53665a84b787a6bb1a57e28 \
;; \
esac && \
wget https://github.com/timescale/timescaledb/archive/refs/tags/${TIMESCALEDB_VERSION}.tar.gz -O timescaledb.tar.gz && \
@@ -627,16 +624,12 @@ FROM build-deps AS pg-cron-pg-build
ARG PG_VERSION
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
# 1.6.4 available, supports v17
# This is an experimental extension that we do not support on prod yet.
# !Do not remove!
# We set it in shared_preload_libraries and computes will fail to start if library is not found.
ENV PATH="/usr/local/pgsql/bin/:$PATH"
RUN case "${PG_VERSION}" in "v17") \
echo "v17 extensions are not supported yet. Quit" && exit 0;; \
esac && \
wget https://github.com/citusdata/pg_cron/archive/refs/tags/v1.6.0.tar.gz -O pg_cron.tar.gz && \
echo "383a627867d730222c272bfd25cd5e151c578d73f696d32910c7db8c665cc7db pg_cron.tar.gz" | sha256sum --check && \
RUN wget https://github.com/citusdata/pg_cron/archive/refs/tags/v1.6.4.tar.gz -O pg_cron.tar.gz && \
echo "52d1850ee7beb85a4cb7185731ef4e5a90d1de216709d8988324b0d02e76af61 pg_cron.tar.gz" | sha256sum --check && \
mkdir pg_cron-src && cd pg_cron-src && tar xzf ../pg_cron.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install && \
@@ -666,7 +659,7 @@ RUN apt-get update && \
#
# Use new version only for v17
# because Release_2024_09_1 has some backward incompatible changes
# https://github.com/rdkit/rdkit/releases/tag/Release_2024_09_1
# https://github.com/rdkit/rdkit/releases/tag/Release_2024_09_1
ENV PATH="/usr/local/pgsql/bin/:/usr/local/pgsql/:$PATH"
RUN case "${PG_VERSION}" in \
"v17") \
@@ -860,18 +853,98 @@ ENV PATH="/home/nonroot/.cargo/bin:/usr/local/pgsql/bin/:$PATH"
USER nonroot
WORKDIR /home/nonroot
RUN case "${PG_VERSION}" in "v17") \
echo "v17 is not supported yet by pgrx. Quit" && exit 0;; \
esac && \
curl -sSO https://static.rust-lang.org/rustup/dist/$(uname -m)-unknown-linux-gnu/rustup-init && \
RUN curl -sSO https://static.rust-lang.org/rustup/dist/$(uname -m)-unknown-linux-gnu/rustup-init && \
chmod +x rustup-init && \
./rustup-init -y --no-modify-path --profile minimal --default-toolchain stable && \
rm rustup-init && \
case "${PG_VERSION}" in \
'v17') \
echo 'v17 is not supported yet by pgrx. Quit' && exit 0;; \
esac && \
cargo install --locked --version 0.11.3 cargo-pgrx && \
/bin/bash -c 'cargo pgrx init --pg${PG_VERSION:1}=/usr/local/pgsql/bin/pg_config'
USER root
#########################################################################################
#
# Layer "rust extensions pgrx12"
#
# pgrx started to support Postgres 17 since version 12,
# but some older extension aren't compatible with it.
# This layer should be used as a base for new pgrx extensions,
# and eventually get merged with `rust-extensions-build`
#
#########################################################################################
FROM build-deps AS rust-extensions-build-pgrx12
ARG PG_VERSION
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN apt-get update && \
apt-get install --no-install-recommends -y curl libclang-dev && \
useradd -ms /bin/bash nonroot -b /home
ENV HOME=/home/nonroot
ENV PATH="/home/nonroot/.cargo/bin:/usr/local/pgsql/bin/:$PATH"
USER nonroot
WORKDIR /home/nonroot
RUN curl -sSO https://static.rust-lang.org/rustup/dist/$(uname -m)-unknown-linux-gnu/rustup-init && \
chmod +x rustup-init && \
./rustup-init -y --no-modify-path --profile minimal --default-toolchain stable && \
rm rustup-init && \
cargo install --locked --version 0.12.6 cargo-pgrx && \
/bin/bash -c 'cargo pgrx init --pg${PG_VERSION:1}=/usr/local/pgsql/bin/pg_config'
USER root
#########################################################################################
#
# Layers "pg-onnx-build" and "pgrag-pg-build"
# Compile "pgrag" extensions
#
#########################################################################################
FROM rust-extensions-build-pgrx12 AS pg-onnx-build
# cmake 3.26 or higher is required, so installing it using pip (bullseye-backports has cmake 3.25).
# Install it using virtual environment, because Python 3.11 (the default version on Debian 12 (Bookworm)) complains otherwise
RUN apt-get update && apt-get install -y python3 python3-pip python3-venv && \
python3 -m venv venv && \
. venv/bin/activate && \
python3 -m pip install cmake==3.30.5 && \
wget https://github.com/microsoft/onnxruntime/archive/refs/tags/v1.18.1.tar.gz -O onnxruntime.tar.gz && \
mkdir onnxruntime-src && cd onnxruntime-src && tar xzf ../onnxruntime.tar.gz --strip-components=1 -C . && \
./build.sh --config Release --parallel --skip_submodule_sync --skip_tests --allow_running_as_root
FROM pg-onnx-build AS pgrag-pg-build
RUN apt-get install -y protobuf-compiler && \
wget https://github.com/neondatabase-labs/pgrag/archive/refs/tags/v0.0.0.tar.gz -O pgrag.tar.gz && \
echo "2cbe394c1e74fc8bcad9b52d5fbbfb783aef834ca3ce44626cfd770573700bb4 pgrag.tar.gz" | sha256sum --check && \
mkdir pgrag-src && cd pgrag-src && tar xzf ../pgrag.tar.gz --strip-components=1 -C . && \
\
cd exts/rag && \
sed -i 's/pgrx = "0.12.6"/pgrx = { version = "0.12.6", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
cargo pgrx install --release && \
echo "trusted = true" >> /usr/local/pgsql/share/extension/rag.control && \
\
cd ../rag_bge_small_en_v15 && \
sed -i 's/pgrx = "0.12.6"/pgrx = { version = "0.12.6", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
ORT_LIB_LOCATION=/home/nonroot/onnxruntime-src/build/Linux \
REMOTE_ONNX_URL=http://pg-ext-s3-gateway/pgrag-data/bge_small_en_v15.onnx \
cargo pgrx install --release --features remote_onnx && \
echo "trusted = true" >> /usr/local/pgsql/share/extension/rag_bge_small_en_v15.control && \
\
cd ../rag_jina_reranker_v1_tiny_en && \
sed -i 's/pgrx = "0.12.6"/pgrx = { version = "0.12.6", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
ORT_LIB_LOCATION=/home/nonroot/onnxruntime-src/build/Linux \
REMOTE_ONNX_URL=http://pg-ext-s3-gateway/pgrag-data/jina_reranker_v1_tiny_en.onnx \
cargo pgrx install --release --features remote_onnx && \
echo "trusted = true" >> /usr/local/pgsql/share/extension/rag_jina_reranker_v1_tiny_en.control
#########################################################################################
#
# Layer "pg-jsonschema-pg-build"
@@ -879,21 +952,31 @@ USER root
#
#########################################################################################
FROM rust-extensions-build AS pg-jsonschema-pg-build
FROM rust-extensions-build-pgrx12 AS pg-jsonschema-pg-build
ARG PG_VERSION
RUN case "${PG_VERSION}" in "v17") \
echo "pg_jsonschema does not yet have a release that supports pg17" && exit 0;; \
# version 0.3.3 supports v17
# last release v0.3.3 - Oct 16, 2024
#
# there were no breaking changes
# so we can use the same version for all postgres versions
RUN case "${PG_VERSION}" in \
"v14" | "v15" | "v16" | "v17") \
export PG_JSONSCHEMA_VERSION=0.3.3 \
export PG_JSONSCHEMA_CHECKSUM=40c2cffab4187e0233cb8c3bde013be92218c282f95f4469c5282f6b30d64eac \
;; \
*) \
echo "unexpected PostgreSQL version" && exit 1 \
;; \
esac && \
wget https://github.com/supabase/pg_jsonschema/archive/refs/tags/v0.3.1.tar.gz -O pg_jsonschema.tar.gz && \
echo "61df3db1ed83cf24f6aa39c826f8818bfa4f0bd33b587fd6b2b1747985642297 pg_jsonschema.tar.gz" | sha256sum --check && \
wget https://github.com/supabase/pg_jsonschema/archive/refs/tags/v${PG_JSONSCHEMA_VERSION}.tar.gz -O pg_jsonschema.tar.gz && \
echo "${PG_JSONSCHEMA_CHECKSUM} pg_jsonschema.tar.gz" | sha256sum --check && \
mkdir pg_jsonschema-src && cd pg_jsonschema-src && tar xzf ../pg_jsonschema.tar.gz --strip-components=1 -C . && \
# see commit 252b3685a27a0f4c31a0f91e983c6314838e89e8
# `unsafe-postgres` feature allows to build pgx extensions
# against postgres forks that decided to change their ABI name (like us).
# With that we can build extensions without forking them and using stock
# pgx. As this feature is new few manual version bumps were required.
sed -i 's/pgrx = "0.11.3"/pgrx = { version = "0.11.3", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
sed -i 's/pgrx = "0.12.6"/pgrx = { version = "0.12.6", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
cargo pgrx install --release && \
echo "trusted = true" >> /usr/local/pgsql/share/extension/pg_jsonschema.control
@@ -904,16 +987,27 @@ RUN case "${PG_VERSION}" in "v17") \
#
#########################################################################################
FROM rust-extensions-build AS pg-graphql-pg-build
FROM rust-extensions-build-pgrx12 AS pg-graphql-pg-build
ARG PG_VERSION
RUN case "${PG_VERSION}" in "v17") \
echo "pg_graphql does not yet have a release that supports pg17 as of now" && exit 0;; \
# version 1.5.9 supports v17
# last release v1.5.9 - Oct 16, 2024
#
# there were no breaking changes
# so we can use the same version for all postgres versions
RUN case "${PG_VERSION}" in \
"v14" | "v15" | "v16" | "v17") \
export PG_GRAPHQL_VERSION=1.5.9 \
export PG_GRAPHQL_CHECKSUM=cf768385a41278be1333472204fc0328118644ae443182cf52f7b9b23277e497 \
;; \
*) \
echo "unexpected PostgreSQL version" && exit 1 \
;; \
esac && \
wget https://github.com/supabase/pg_graphql/archive/refs/tags/v1.5.7.tar.gz -O pg_graphql.tar.gz && \
echo "2b3e567a5b31019cb97ae0e33263c1bcc28580be5a444ac4c8ece5c4be2aea41 pg_graphql.tar.gz" | sha256sum --check && \
wget https://github.com/supabase/pg_graphql/archive/refs/tags/v${PG_GRAPHQL_VERSION}.tar.gz -O pg_graphql.tar.gz && \
echo "${PG_GRAPHQL_CHECKSUM} pg_graphql.tar.gz" | sha256sum --check && \
mkdir pg_graphql-src && cd pg_graphql-src && tar xzf ../pg_graphql.tar.gz --strip-components=1 -C . && \
sed -i 's/pgrx = "=0.11.3"/pgrx = { version = "0.11.3", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
sed -i 's/pgrx = "=0.12.6"/pgrx = { version = "0.12.6", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
cargo pgrx install --release && \
# it's needed to enable extension because it uses untrusted C language
sed -i 's/superuser = false/superuser = true/g' /usr/local/pgsql/share/extension/pg_graphql.control && \
@@ -926,15 +1020,13 @@ RUN case "${PG_VERSION}" in "v17") \
#
#########################################################################################
FROM rust-extensions-build AS pg-tiktoken-pg-build
FROM rust-extensions-build-pgrx12 AS pg-tiktoken-pg-build
ARG PG_VERSION
# 26806147b17b60763039c6a6878884c41a262318 made on 26/09/2023
RUN case "${PG_VERSION}" in "v17") \
echo "pg_tiktoken does not have versions, nor support for pg17" && exit 0;; \
esac && \
wget https://github.com/kelvich/pg_tiktoken/archive/26806147b17b60763039c6a6878884c41a262318.tar.gz -O pg_tiktoken.tar.gz && \
echo "e64e55aaa38c259512d3e27c572da22c4637418cf124caba904cd50944e5004e pg_tiktoken.tar.gz" | sha256sum --check && \
# doesn't use releases
# 9118dd4549b7d8c0bbc98e04322499f7bf2fa6f7 - on Oct 29, 2024
RUN wget https://github.com/kelvich/pg_tiktoken/archive/9118dd4549b7d8c0bbc98e04322499f7bf2fa6f7.tar.gz -O pg_tiktoken.tar.gz && \
echo "a5bc447e7920ee149d3c064b8b9f0086c0e83939499753178f7d35788416f628 pg_tiktoken.tar.gz" | sha256sum --check && \
mkdir pg_tiktoken-src && cd pg_tiktoken-src && tar xzf ../pg_tiktoken.tar.gz --strip-components=1 -C . && \
# TODO update pgrx version in the pg_tiktoken repo and remove this line
sed -i 's/pgrx = { version = "=0.10.2",/pgrx = { version = "0.11.3",/g' Cargo.toml && \
@@ -952,6 +1044,8 @@ RUN case "${PG_VERSION}" in "v17") \
FROM rust-extensions-build AS pg-pgx-ulid-build
ARG PG_VERSION
# doesn't support v17 yet
# https://github.com/pksunkara/pgx_ulid/pull/52
RUN case "${PG_VERSION}" in "v17") \
echo "pgx_ulid does not support pg17 as of the latest version (0.1.5)" && exit 0;; \
esac && \
@@ -969,16 +1063,16 @@ RUN case "${PG_VERSION}" in "v17") \
#
#########################################################################################
FROM rust-extensions-build AS pg-session-jwt-build
FROM rust-extensions-build-pgrx12 AS pg-session-jwt-build
ARG PG_VERSION
RUN case "${PG_VERSION}" in "v17") \
echo "pg_session_jwt does not yet have a release that supports pg17" && exit 0;; \
esac && \
wget https://github.com/neondatabase/pg_session_jwt/archive/e1310b08ba51377a19e0559e4d1194883b9b2ba2.tar.gz -O pg_session_jwt.tar.gz && \
echo "837932a077888d5545fd54b0abcc79e5f8e37017c2769a930afc2f5c94df6f4e pg_session_jwt.tar.gz" | sha256sum --check && \
# NOTE: local_proxy depends on the version of pg_session_jwt
# Do not update without approve from proxy team
# Make sure the version is reflected in proxy/src/serverless/local_conn_pool.rs
RUN wget https://github.com/neondatabase/pg_session_jwt/archive/refs/tags/v0.1.2-v17.tar.gz -O pg_session_jwt.tar.gz && \
echo "c8ecbed9cb8c6441bce5134a176002b043018adf9d05a08e457dda233090a86e pg_session_jwt.tar.gz" | sha256sum --check && \
mkdir pg_session_jwt-src && cd pg_session_jwt-src && tar xzf ../pg_session_jwt.tar.gz --strip-components=1 -C . && \
sed -i 's/pgrx = "=0.11.3"/pgrx = { version = "=0.11.3", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
sed -i 's/pgrx = "0.12.6"/pgrx = { version = "=0.12.6", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
cargo pgrx install --release
#########################################################################################
@@ -1041,6 +1135,34 @@ RUN wget https://github.com/pgpartman/pg_partman/archive/refs/tags/v5.1.0.tar.gz
make -j $(getconf _NPROCESSORS_ONLN) install && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pg_partman.control
#########################################################################################
#
# Layer "pg_mooncake"
# compile pg_mooncake extension
#
#########################################################################################
FROM rust-extensions-build AS pg-mooncake-build
ARG PG_VERSION
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
# The topmost commit in the `neon` branch at the time of writing this
# https://github.com/Mooncake-Labs/pg_mooncake/commits/neon/
# https://github.com/Mooncake-Labs/pg_mooncake/commit/077c92c452bb6896a7b7776ee95f039984f076af
ENV PG_MOONCAKE_VERSION=077c92c452bb6896a7b7776ee95f039984f076af
ENV PATH="/usr/local/pgsql/bin/:$PATH"
RUN case "${PG_VERSION}" in \
'v14') \
echo "pg_mooncake is not supported on Postgres ${PG_VERSION}" && exit 0;; \
esac && \
git clone --depth 1 --branch neon https://github.com/Mooncake-Labs/pg_mooncake.git pg_mooncake-src && \
cd pg_mooncake-src && \
git checkout "${PG_MOONCAKE_VERSION}" && \
git submodule update --init --depth 1 --recursive && \
make BUILD_TYPE=release -j $(getconf _NPROCESSORS_ONLN) && \
make BUILD_TYPE=release -j $(getconf _NPROCESSORS_ONLN) install && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pg_mooncake.control
#########################################################################################
#
# Layer "neon-pg-ext-build"
@@ -1059,6 +1181,7 @@ COPY --from=h3-pg-build /h3/usr /
COPY --from=unit-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=vector-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pgjwt-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pgrag-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-jsonschema-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-graphql-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-tiktoken-pg-build /usr/local/pgsql/ /usr/local/pgsql/
@@ -1084,6 +1207,7 @@ COPY --from=wal2json-pg-build /usr/local/pgsql /usr/local/pgsql
COPY --from=pg-anon-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-ivm-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-partman-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-mooncake-build /usr/local/pgsql/ /usr/local/pgsql/
COPY pgxn/ pgxn/
RUN make -j $(getconf _NPROCESSORS_ONLN) \
@@ -1191,7 +1315,10 @@ RUN mold -run cargo build --locked --profile release-line-debug-size-lto --bin l
#########################################################################################
FROM quay.io/prometheuscommunity/postgres-exporter:v0.12.1 AS postgres-exporter
FROM burningalchemist/sql_exporter:0.13 AS sql-exporter
# Keep the version the same as in build-tools.Dockerfile and
# test_runner/regress/test_compute_metrics.py.
FROM burningalchemist/sql_exporter:0.13.1 AS sql-exporter
#########################################################################################
#
@@ -1247,6 +1374,7 @@ COPY --from=unit-pg-build /postgresql-unit.tar.gz /ext-src/
COPY --from=vector-pg-build /pgvector.tar.gz /ext-src/
COPY --from=vector-pg-build /pgvector.patch /ext-src/
COPY --from=pgjwt-pg-build /pgjwt.tar.gz /ext-src
#COPY --from=pgrag-pg-build /usr/local/pgsql/ /usr/local/pgsql/
#COPY --from=pg-jsonschema-pg-build /home/nonroot/pg_jsonschema.tar.gz /ext-src
#COPY --from=pg-graphql-pg-build /home/nonroot/pg_graphql.tar.gz /ext-src
#COPY --from=pg-tiktoken-pg-build /home/nonroot/pg_tiktoken.tar.gz /ext-src
@@ -1343,6 +1471,8 @@ RUN mkdir -p /etc/local_proxy && chown postgres:postgres /etc/local_proxy
COPY --from=postgres-exporter /bin/postgres_exporter /bin/postgres_exporter
COPY --from=sql-exporter /bin/sql_exporter /bin/sql_exporter
COPY --chown=postgres compute/etc/postgres_exporter.yml /etc/postgres_exporter.yml
COPY --from=sql_exporter_preprocessor --chmod=0644 /home/nonroot/compute/etc/sql_exporter.yml /etc/sql_exporter.yml
COPY --from=sql_exporter_preprocessor --chmod=0644 /home/nonroot/compute/etc/neon_collector.yml /etc/neon_collector.yml
COPY --from=sql_exporter_preprocessor --chmod=0644 /home/nonroot/compute/etc/sql_exporter_autoscaling.yml /etc/sql_exporter_autoscaling.yml

View File

@@ -3,6 +3,7 @@
metrics: [
import 'sql_exporter/checkpoints_req.libsonnet',
import 'sql_exporter/checkpoints_timed.libsonnet',
import 'sql_exporter/compute_backpressure_throttling_seconds.libsonnet',
import 'sql_exporter/compute_current_lsn.libsonnet',
import 'sql_exporter/compute_logical_snapshot_files.libsonnet',
import 'sql_exporter/compute_receive_lsn.libsonnet',

View File

View File

@@ -1,4 +1,4 @@
function(collector_name, collector_file, application_name='sql_exporter') {
function(collector_name, collector_file, connection_string) {
// Configuration for sql_exporter for autoscaling-agent
// Global defaults.
global: {
@@ -23,7 +23,7 @@ function(collector_name, collector_file, application_name='sql_exporter') {
target: {
// Data source name always has a URI schema that matches the driver name. In some cases (e.g. MySQL)
// the schema gets dropped or replaced to match the driver expected DSN format.
data_source_name: std.format('postgresql://cloud_admin@127.0.0.1:5432/postgres?sslmode=disable&application_name=%s', [application_name]),
data_source_name: connection_string,
// Collectors (referenced by name) to execute on the target.
// Glob patterns are supported (see <https://pkg.go.dev/path/filepath#Match> for syntax).

View File

@@ -0,0 +1,10 @@
{
metric_name: 'compute_backpressure_throttling_seconds',
type: 'gauge',
help: 'Time compute has spent throttled',
key_labels: null,
values: [
'throttled',
],
query: importstr 'sql_exporter/compute_backpressure_throttling_seconds.sql',
}

View File

@@ -0,0 +1 @@
SELECT (neon.backpressure_throttling_time()::float8 / 1000000) AS throttled;

View File

@@ -1,3 +1,45 @@
commit 00aa659afc9c7336ab81036edec3017168aabf40
Author: Heikki Linnakangas <heikki@neon.tech>
Date: Tue Nov 12 16:59:19 2024 +0200
Temporarily disable test that depends on timezone
diff --git a/tests/expected/generalization.out b/tests/expected/generalization.out
index 23ef5fa..9e60deb 100644
--- a/ext-src/pg_anon-src/tests/expected/generalization.out
+++ b/ext-src/pg_anon-src/tests/expected/generalization.out
@@ -284,12 +284,9 @@ SELECT anon.generalize_tstzrange('19041107','century');
["Tue Jan 01 00:00:00 1901 PST","Mon Jan 01 00:00:00 2001 PST")
(1 row)
-SELECT anon.generalize_tstzrange('19041107','millennium');
- generalize_tstzrange
------------------------------------------------------------------
- ["Thu Jan 01 00:00:00 1001 PST","Mon Jan 01 00:00:00 2001 PST")
-(1 row)
-
+-- temporarily disabled, see:
+-- https://gitlab.com/dalibo/postgresql_anonymizer/-/commit/199f0a392b37c59d92ae441fb8f037e094a11a52#note_2148017485
+--SELECT anon.generalize_tstzrange('19041107','millennium');
-- generalize_daterange
SELECT anon.generalize_daterange('19041107');
generalize_daterange
diff --git a/tests/sql/generalization.sql b/tests/sql/generalization.sql
index b868344..b4fc977 100644
--- a/ext-src/pg_anon-src/tests/sql/generalization.sql
+++ b/ext-src/pg_anon-src/tests/sql/generalization.sql
@@ -61,7 +61,9 @@ SELECT anon.generalize_tstzrange('19041107','month');
SELECT anon.generalize_tstzrange('19041107','year');
SELECT anon.generalize_tstzrange('19041107','decade');
SELECT anon.generalize_tstzrange('19041107','century');
-SELECT anon.generalize_tstzrange('19041107','millennium');
+-- temporarily disabled, see:
+-- https://gitlab.com/dalibo/postgresql_anonymizer/-/commit/199f0a392b37c59d92ae441fb8f037e094a11a52#note_2148017485
+--SELECT anon.generalize_tstzrange('19041107','millennium');
-- generalize_daterange
SELECT anon.generalize_daterange('19041107');
commit 7dd414ee75f2875cffb1d6ba474df1f135a6fc6f
Author: Alexey Masterov <alexeymasterov@neon.tech>
Date: Fri May 31 06:34:26 2024 +0000

View File

@@ -18,7 +18,7 @@ commands:
- name: pgbouncer
user: postgres
sysvInitAction: respawn
shell: '/usr/local/bin/pgbouncer /etc/pgbouncer.ini'
shell: '/usr/local/bin/pgbouncer /etc/pgbouncer.ini 2>&1 > /dev/virtio-ports/tech.neon.log.0'
- name: local_proxy
user: postgres
sysvInitAction: respawn
@@ -26,7 +26,7 @@ commands:
- name: postgres-exporter
user: nobody
sysvInitAction: respawn
shell: 'DATA_SOURCE_NAME="user=cloud_admin sslmode=disable dbname=postgres application_name=postgres-exporter" /bin/postgres_exporter'
shell: 'DATA_SOURCE_NAME="user=cloud_admin sslmode=disable dbname=postgres application_name=postgres-exporter" /bin/postgres_exporter --config.file=/etc/postgres_exporter.yml'
- name: sql-exporter
user: nobody
sysvInitAction: respawn

View File

@@ -18,7 +18,7 @@ commands:
- name: pgbouncer
user: postgres
sysvInitAction: respawn
shell: '/usr/local/bin/pgbouncer /etc/pgbouncer.ini'
shell: '/usr/local/bin/pgbouncer /etc/pgbouncer.ini 2>&1 > /dev/virtio-ports/tech.neon.log.0'
- name: local_proxy
user: postgres
sysvInitAction: respawn
@@ -26,7 +26,7 @@ commands:
- name: postgres-exporter
user: nobody
sysvInitAction: respawn
shell: 'DATA_SOURCE_NAME="user=cloud_admin sslmode=disable dbname=postgres application_name=postgres-exporter" /bin/postgres_exporter'
shell: 'DATA_SOURCE_NAME="user=cloud_admin sslmode=disable dbname=postgres application_name=postgres-exporter" /bin/postgres_exporter --config.file=/etc/postgres_exporter.yml'
- name: sql-exporter
user: nobody
sysvInitAction: respawn

View File

@@ -18,9 +18,11 @@ clap.workspace = true
flate2.workspace = true
futures.workspace = true
hyper0 = { workspace = true, features = ["full"] }
metrics.workspace = true
nix.workspace = true
notify.workspace = true
num_cpus.workspace = true
once_cell.workspace = true
opentelemetry.workspace = true
opentelemetry_sdk.workspace = true
postgres.workspace = true
@@ -39,6 +41,7 @@ tracing-subscriber.workspace = true
tracing-utils.workspace = true
thiserror.workspace = true
url.workspace = true
prometheus.workspace = true
compute_api.workspace = true
utils.workspace = true

View File

@@ -1,7 +1,6 @@
use std::collections::HashMap;
use std::env;
use std::fs;
use std::io::BufRead;
use std::os::unix::fs::{symlink, PermissionsExt};
use std::path::Path;
use std::process::{Command, Stdio};
@@ -365,48 +364,43 @@ impl ComputeNode {
let pageserver_connect_micros = start_time.elapsed().as_micros() as u64;
let basebackup_cmd = match lsn {
// HACK We don't use compression on first start (Lsn(0)) because there's no API for it
Lsn(0) => format!("basebackup {} {}", spec.tenant_id, spec.timeline_id),
_ => format!(
"basebackup {} {} {} --gzip",
spec.tenant_id, spec.timeline_id, lsn
),
Lsn(0) => {
if spec.spec.mode != ComputeMode::Primary {
format!(
"basebackup {} {} --gzip --replica",
spec.tenant_id, spec.timeline_id
)
} else {
format!("basebackup {} {} --gzip", spec.tenant_id, spec.timeline_id)
}
}
_ => {
if spec.spec.mode != ComputeMode::Primary {
format!(
"basebackup {} {} {} --gzip --replica",
spec.tenant_id, spec.timeline_id, lsn
)
} else {
format!(
"basebackup {} {} {} --gzip",
spec.tenant_id, spec.timeline_id, lsn
)
}
}
};
let copyreader = client.copy_out(basebackup_cmd.as_str())?;
let mut measured_reader = MeasuredReader::new(copyreader);
// Check the magic number to see if it's a gzip or not. Even though
// we might explicitly ask for gzip, an old pageserver with no implementation
// of gzip compression might send us uncompressed data. After some time
// passes we can assume all pageservers know how to compress and we can
// delete this check.
//
// If the data is not gzip, it will be tar. It will not be mistakenly
// recognized as gzip because tar starts with an ascii encoding of a filename,
// and 0x1f and 0x8b are unlikely first characters for any filename. Moreover,
// we send the "global" directory first from the pageserver, so it definitely
// won't be recognized as gzip.
let mut bufreader = std::io::BufReader::new(&mut measured_reader);
let gzip = {
let peek = bufreader.fill_buf().unwrap();
peek[0] == 0x1f && peek[1] == 0x8b
};
// Read the archive directly from the `CopyOutReader`
//
// Set `ignore_zeros` so that unpack() reads all the Copy data and
// doesn't stop at the end-of-archive marker. Otherwise, if the server
// sends an Error after finishing the tarball, we will not notice it.
if gzip {
let mut ar = tar::Archive::new(flate2::read::GzDecoder::new(&mut bufreader));
ar.set_ignore_zeros(true);
ar.unpack(&self.pgdata)?;
} else {
let mut ar = tar::Archive::new(&mut bufreader);
ar.set_ignore_zeros(true);
ar.unpack(&self.pgdata)?;
};
let mut ar = tar::Archive::new(flate2::read::GzDecoder::new(&mut bufreader));
ar.set_ignore_zeros(true);
ar.unpack(&self.pgdata)?;
// Report metrics
let mut state = self.state.lock().unwrap();

View File

@@ -73,6 +73,19 @@ pub fn write_postgres_conf(
)?;
}
// Locales
if cfg!(target_os = "macos") {
writeln!(file, "lc_messages='C'")?;
writeln!(file, "lc_monetary='C'")?;
writeln!(file, "lc_time='C'")?;
writeln!(file, "lc_numeric='C'")?;
} else {
writeln!(file, "lc_messages='C.UTF-8'")?;
writeln!(file, "lc_monetary='C.UTF-8'")?;
writeln!(file, "lc_time='C.UTF-8'")?;
writeln!(file, "lc_numeric='C.UTF-8'")?;
}
match spec.mode {
ComputeMode::Primary => {}
ComputeMode::Static(lsn) => {

View File

@@ -9,6 +9,7 @@ use crate::catalog::SchemaDumpError;
use crate::catalog::{get_database_schema, get_dbs_and_roles};
use crate::compute::forward_termination_signal;
use crate::compute::{ComputeNode, ComputeState, ParsedSpec};
use crate::installed_extensions;
use compute_api::requests::{ConfigurationRequest, ExtensionInstallRequest, SetRoleGrantsRequest};
use compute_api::responses::{
ComputeStatus, ComputeStatusResponse, ExtensionInstallResult, GenericAPIError,
@@ -19,6 +20,8 @@ use anyhow::Result;
use hyper::header::CONTENT_TYPE;
use hyper::service::{make_service_fn, service_fn};
use hyper::{Body, Method, Request, Response, Server, StatusCode};
use metrics::Encoder;
use metrics::TextEncoder;
use tokio::task;
use tracing::{debug, error, info, warn};
use tracing_utils::http::OtelName;
@@ -65,6 +68,28 @@ async fn routes(req: Request<Body>, compute: &Arc<ComputeNode>) -> Response<Body
Response::new(Body::from(serde_json::to_string(&metrics).unwrap()))
}
// Prometheus metrics
(&Method::GET, "/metrics") => {
debug!("serving /metrics GET request");
let mut buffer = vec![];
let metrics = installed_extensions::collect();
let encoder = TextEncoder::new();
encoder.encode(&metrics, &mut buffer).unwrap();
match Response::builder()
.status(StatusCode::OK)
.header(CONTENT_TYPE, encoder.format_type())
.body(Body::from(buffer))
{
Ok(response) => response,
Err(err) => {
let msg = format!("error handling /metrics request: {err}");
error!(msg);
render_json_error(&msg, StatusCode::INTERNAL_SERVER_ERROR)
}
}
}
// Collect Postgres current usage insights
(&Method::GET, "/insights") => {
info!("serving /insights GET request");

View File

@@ -37,6 +37,21 @@ paths:
schema:
$ref: "#/components/schemas/ComputeMetrics"
/metrics
get:
tags:
- Info
summary: Get compute node metrics in text format.
description: ""
operationId: getComputeMetrics
responses:
200:
description: ComputeMetrics
content:
text/plain:
schema:
type: string
description: Metrics in text format.
/insights:
get:
tags:

View File

@@ -1,4 +1,5 @@
use compute_api::responses::{InstalledExtension, InstalledExtensions};
use metrics::proto::MetricFamily;
use std::collections::HashMap;
use std::collections::HashSet;
use tracing::info;
@@ -8,6 +9,10 @@ use anyhow::Result;
use postgres::{Client, NoTls};
use tokio::task;
use metrics::core::Collector;
use metrics::{register_uint_gauge_vec, UIntGaugeVec};
use once_cell::sync::Lazy;
/// We don't reuse get_existing_dbs() just for code clarity
/// and to make database listing query here more explicit.
///
@@ -59,6 +64,12 @@ pub async fn get_installed_extensions(connstr: Url) -> Result<InstalledExtension
for (extname, v) in extensions.iter() {
let version = v.to_string();
// increment the number of databases where the version of extension is installed
INSTALLED_EXTENSIONS
.with_label_values(&[extname, &version])
.inc();
extensions_map
.entry(extname.to_string())
.and_modify(|e| {
@@ -74,9 +85,11 @@ pub async fn get_installed_extensions(connstr: Url) -> Result<InstalledExtension
}
}
Ok(InstalledExtensions {
let res = InstalledExtensions {
extensions: extensions_map.values().cloned().collect(),
})
};
Ok(res)
})
.await?
}
@@ -97,6 +110,18 @@ pub fn get_installed_extensions_sync(connstr: Url) -> Result<()> {
"[NEON_EXT_STAT] {}",
serde_json::to_string(&result).expect("failed to serialize extensions list")
);
Ok(())
}
static INSTALLED_EXTENSIONS: Lazy<UIntGaugeVec> = Lazy::new(|| {
register_uint_gauge_vec!(
"installed_extensions",
"Number of databases where the version of extension is installed",
&["extension_name", "version"]
)
.expect("failed to define a metric")
});
pub fn collect() -> Vec<MetricFamily> {
INSTALLED_EXTENSIONS.collect()
}

View File

@@ -944,6 +944,9 @@ fn handle_init(args: &InitCmdArgs) -> anyhow::Result<LocalEnv> {
pg_auth_type: AuthType::Trust,
http_auth_type: AuthType::Trust,
other: Default::default(),
// Typical developer machines use disks with slow fsync, and we don't care
// about data integrity: disable disk syncs.
no_sync: true,
}
})
.collect(),
@@ -1073,10 +1076,10 @@ async fn handle_tenant(subcmd: &TenantCmd, env: &mut local_env::LocalEnv) -> any
tenant_id,
TimelineCreateRequest {
new_timeline_id,
ancestor_timeline_id: None,
ancestor_start_lsn: None,
existing_initdb_timeline_id: None,
pg_version: Some(args.pg_version),
mode: pageserver_api::models::TimelineCreateRequestMode::Bootstrap {
existing_initdb_timeline_id: None,
pg_version: Some(args.pg_version),
},
},
)
.await?;
@@ -1133,10 +1136,10 @@ async fn handle_timeline(cmd: &TimelineCmd, env: &mut local_env::LocalEnv) -> Re
let storage_controller = StorageController::from_env(env);
let create_req = TimelineCreateRequest {
new_timeline_id,
ancestor_timeline_id: None,
existing_initdb_timeline_id: None,
ancestor_start_lsn: None,
pg_version: Some(args.pg_version),
mode: pageserver_api::models::TimelineCreateRequestMode::Bootstrap {
existing_initdb_timeline_id: None,
pg_version: Some(args.pg_version),
},
};
let timeline_info = storage_controller
.tenant_timeline_create(tenant_id, create_req)
@@ -1189,10 +1192,11 @@ async fn handle_timeline(cmd: &TimelineCmd, env: &mut local_env::LocalEnv) -> Re
let storage_controller = StorageController::from_env(env);
let create_req = TimelineCreateRequest {
new_timeline_id,
ancestor_timeline_id: Some(ancestor_timeline_id),
existing_initdb_timeline_id: None,
ancestor_start_lsn: start_lsn,
pg_version: None,
mode: pageserver_api::models::TimelineCreateRequestMode::Branch {
ancestor_timeline_id,
ancestor_start_lsn: start_lsn,
pg_version: None,
},
};
let timeline_info = storage_controller
.tenant_timeline_create(tenant_id, create_req)

View File

@@ -225,6 +225,7 @@ pub struct PageServerConf {
pub listen_http_addr: String,
pub pg_auth_type: AuthType,
pub http_auth_type: AuthType,
pub no_sync: bool,
}
impl Default for PageServerConf {
@@ -235,6 +236,7 @@ impl Default for PageServerConf {
listen_http_addr: String::new(),
pg_auth_type: AuthType::Trust,
http_auth_type: AuthType::Trust,
no_sync: false,
}
}
}
@@ -249,6 +251,8 @@ pub struct NeonLocalInitPageserverConf {
pub listen_http_addr: String,
pub pg_auth_type: AuthType,
pub http_auth_type: AuthType,
#[serde(default, skip_serializing_if = "std::ops::Not::not")]
pub no_sync: bool,
#[serde(flatten)]
pub other: HashMap<String, toml::Value>,
}
@@ -261,6 +265,7 @@ impl From<&NeonLocalInitPageserverConf> for PageServerConf {
listen_http_addr,
pg_auth_type,
http_auth_type,
no_sync,
other: _,
} = conf;
Self {
@@ -269,6 +274,7 @@ impl From<&NeonLocalInitPageserverConf> for PageServerConf {
listen_http_addr: listen_http_addr.clone(),
pg_auth_type: *pg_auth_type,
http_auth_type: *http_auth_type,
no_sync: *no_sync,
}
}
}
@@ -569,6 +575,8 @@ impl LocalEnv {
listen_http_addr: String,
pg_auth_type: AuthType,
http_auth_type: AuthType,
#[serde(default)]
no_sync: bool,
}
let config_toml_path = dentry.path().join("pageserver.toml");
let config_toml: PageserverConfigTomlSubset = toml_edit::de::from_str(
@@ -591,6 +599,7 @@ impl LocalEnv {
listen_http_addr,
pg_auth_type,
http_auth_type,
no_sync,
} = config_toml;
let IdentityTomlSubset {
id: identity_toml_id,
@@ -607,6 +616,7 @@ impl LocalEnv {
listen_http_addr,
pg_auth_type,
http_auth_type,
no_sync,
};
pageservers.push(conf);
}

View File

@@ -17,7 +17,7 @@ use std::time::Duration;
use anyhow::{bail, Context};
use camino::Utf8PathBuf;
use pageserver_api::models::{self, AuxFilePolicy, TenantInfo, TimelineInfo};
use pageserver_api::models::{self, TenantInfo, TimelineInfo};
use pageserver_api::shard::TenantShardId;
use pageserver_client::mgmt_api;
use postgres_backend::AuthType;
@@ -273,6 +273,7 @@ impl PageServerNode {
)
})?;
let args = vec!["-D", datadir_path_str];
background_process::start_process(
"pageserver",
&datadir,
@@ -334,17 +335,20 @@ impl PageServerNode {
checkpoint_distance: settings
.remove("checkpoint_distance")
.map(|x| x.parse::<u64>())
.transpose()?,
.transpose()
.context("Failed to parse 'checkpoint_distance' as an integer")?,
checkpoint_timeout: settings.remove("checkpoint_timeout").map(|x| x.to_string()),
compaction_target_size: settings
.remove("compaction_target_size")
.map(|x| x.parse::<u64>())
.transpose()?,
.transpose()
.context("Failed to parse 'compaction_target_size' as an integer")?,
compaction_period: settings.remove("compaction_period").map(|x| x.to_string()),
compaction_threshold: settings
.remove("compaction_threshold")
.map(|x| x.parse::<usize>())
.transpose()?,
.transpose()
.context("Failed to parse 'compaction_threshold' as an integer")?,
compaction_algorithm: settings
.remove("compaction_algorithm")
.map(serde_json::from_str)
@@ -353,16 +357,19 @@ impl PageServerNode {
gc_horizon: settings
.remove("gc_horizon")
.map(|x| x.parse::<u64>())
.transpose()?,
.transpose()
.context("Failed to parse 'gc_horizon' as an integer")?,
gc_period: settings.remove("gc_period").map(|x| x.to_string()),
image_creation_threshold: settings
.remove("image_creation_threshold")
.map(|x| x.parse::<usize>())
.transpose()?,
.transpose()
.context("Failed to parse 'image_creation_threshold' as non zero integer")?,
image_layer_creation_check_threshold: settings
.remove("image_layer_creation_check_threshold")
.map(|x| x.parse::<u8>())
.transpose()?,
.transpose()
.context("Failed to parse 'image_creation_check_threshold' as integer")?,
pitr_interval: settings.remove("pitr_interval").map(|x| x.to_string()),
walreceiver_connect_timeout: settings
.remove("walreceiver_connect_timeout")
@@ -399,15 +406,15 @@ impl PageServerNode {
.map(serde_json::from_str)
.transpose()
.context("parse `timeline_get_throttle` from json")?,
switch_aux_file_policy: settings
.remove("switch_aux_file_policy")
.map(|x| x.parse::<AuxFilePolicy>())
.transpose()
.context("Failed to parse 'switch_aux_file_policy'")?,
lsn_lease_length: settings.remove("lsn_lease_length").map(|x| x.to_string()),
lsn_lease_length_for_ts: settings
.remove("lsn_lease_length_for_ts")
.map(|x| x.to_string()),
timeline_offloading: settings
.remove("timeline_offloading")
.map(|x| x.parse::<bool>())
.transpose()
.context("Failed to parse 'timeline_offloading' as bool")?,
};
if !settings.is_empty() {
bail!("Unrecognized tenant settings: {settings:?}")
@@ -419,102 +426,9 @@ impl PageServerNode {
pub async fn tenant_config(
&self,
tenant_id: TenantId,
mut settings: HashMap<&str, &str>,
settings: HashMap<&str, &str>,
) -> anyhow::Result<()> {
let config = {
// Braces to make the diff easier to read
models::TenantConfig {
checkpoint_distance: settings
.remove("checkpoint_distance")
.map(|x| x.parse::<u64>())
.transpose()
.context("Failed to parse 'checkpoint_distance' as an integer")?,
checkpoint_timeout: settings.remove("checkpoint_timeout").map(|x| x.to_string()),
compaction_target_size: settings
.remove("compaction_target_size")
.map(|x| x.parse::<u64>())
.transpose()
.context("Failed to parse 'compaction_target_size' as an integer")?,
compaction_period: settings.remove("compaction_period").map(|x| x.to_string()),
compaction_threshold: settings
.remove("compaction_threshold")
.map(|x| x.parse::<usize>())
.transpose()
.context("Failed to parse 'compaction_threshold' as an integer")?,
compaction_algorithm: settings
.remove("compactin_algorithm")
.map(serde_json::from_str)
.transpose()
.context("Failed to parse 'compaction_algorithm' json")?,
gc_horizon: settings
.remove("gc_horizon")
.map(|x| x.parse::<u64>())
.transpose()
.context("Failed to parse 'gc_horizon' as an integer")?,
gc_period: settings.remove("gc_period").map(|x| x.to_string()),
image_creation_threshold: settings
.remove("image_creation_threshold")
.map(|x| x.parse::<usize>())
.transpose()
.context("Failed to parse 'image_creation_threshold' as non zero integer")?,
image_layer_creation_check_threshold: settings
.remove("image_layer_creation_check_threshold")
.map(|x| x.parse::<u8>())
.transpose()
.context("Failed to parse 'image_creation_check_threshold' as integer")?,
pitr_interval: settings.remove("pitr_interval").map(|x| x.to_string()),
walreceiver_connect_timeout: settings
.remove("walreceiver_connect_timeout")
.map(|x| x.to_string()),
lagging_wal_timeout: settings
.remove("lagging_wal_timeout")
.map(|x| x.to_string()),
max_lsn_wal_lag: settings
.remove("max_lsn_wal_lag")
.map(|x| x.parse::<NonZeroU64>())
.transpose()
.context("Failed to parse 'max_lsn_wal_lag' as non zero integer")?,
eviction_policy: settings
.remove("eviction_policy")
.map(serde_json::from_str)
.transpose()
.context("Failed to parse 'eviction_policy' json")?,
min_resident_size_override: settings
.remove("min_resident_size_override")
.map(|x| x.parse::<u64>())
.transpose()
.context("Failed to parse 'min_resident_size_override' as an integer")?,
evictions_low_residence_duration_metric_threshold: settings
.remove("evictions_low_residence_duration_metric_threshold")
.map(|x| x.to_string()),
heatmap_period: settings.remove("heatmap_period").map(|x| x.to_string()),
lazy_slru_download: settings
.remove("lazy_slru_download")
.map(|x| x.parse::<bool>())
.transpose()
.context("Failed to parse 'lazy_slru_download' as bool")?,
timeline_get_throttle: settings
.remove("timeline_get_throttle")
.map(serde_json::from_str)
.transpose()
.context("parse `timeline_get_throttle` from json")?,
switch_aux_file_policy: settings
.remove("switch_aux_file_policy")
.map(|x| x.parse::<AuxFilePolicy>())
.transpose()
.context("Failed to parse 'switch_aux_file_policy'")?,
lsn_lease_length: settings.remove("lsn_lease_length").map(|x| x.to_string()),
lsn_lease_length_for_ts: settings
.remove("lsn_lease_length_for_ts")
.map(|x| x.to_string()),
}
};
if !settings.is_empty() {
bail!("Unrecognized tenant settings: {settings:?}")
}
let config = Self::parse_config(settings)?;
self.http_client
.tenant_config(&models::TenantConfigRequest { tenant_id, config })
.await?;
@@ -529,28 +443,6 @@ impl PageServerNode {
Ok(self.http_client.list_timelines(*tenant_shard_id).await?)
}
pub async fn timeline_create(
&self,
tenant_shard_id: TenantShardId,
new_timeline_id: TimelineId,
ancestor_start_lsn: Option<Lsn>,
ancestor_timeline_id: Option<TimelineId>,
pg_version: Option<u32>,
existing_initdb_timeline_id: Option<TimelineId>,
) -> anyhow::Result<TimelineInfo> {
let req = models::TimelineCreateRequest {
new_timeline_id,
ancestor_start_lsn,
ancestor_timeline_id,
pg_version,
existing_initdb_timeline_id,
};
Ok(self
.http_client
.timeline_create(tenant_shard_id, &req)
.await?)
}
/// Import a basebackup prepared using either:
/// a) `pg_basebackup -F tar`, or
/// b) The `fullbackup` pageserver endpoint

View File

@@ -111,6 +111,11 @@ enum Command {
#[arg(long)]
node: NodeId,
},
/// Cancel any ongoing reconciliation for this shard
TenantShardCancelReconcile {
#[arg(long)]
tenant_shard_id: TenantShardId,
},
/// Modify the pageserver tenant configuration of a tenant: this is the configuration structure
/// that is passed through to pageservers, and does not affect storage controller behavior.
TenantConfig {
@@ -535,6 +540,15 @@ async fn main() -> anyhow::Result<()> {
)
.await?;
}
Command::TenantShardCancelReconcile { tenant_shard_id } => {
storcon_client
.dispatch::<(), ()>(
Method::PUT,
format!("control/v1/tenant/{tenant_shard_id}/cancel_reconcile"),
None,
)
.await?;
}
Command::TenantConfig { tenant_id, config } => {
let tenant_conf = serde_json::from_str(&config)?;

View File

@@ -91,7 +91,7 @@ generating the basebackup by scanning the `REPL_ORIGIN_KEY_PREFIX` keyspace.
There are two places we need to read the aux files from the pageserver:
* On the write path, when the compute node adds an aux file to the pageserver, we will retrieve the key from the storage, append the file to the hashed key, and write it back. The current `get` API already supports that.
* We use the vectored get API to retrieve all aux files during generating the basebackup. Because we need to scan a sparse keyspace, we slightly modified the vectored get path. The vectorized API will attempt to retrieve every single key within the requested key range, and therefore, we modified it in a way that keys within `NON_INHERITED_SPARSE_RANGE` will not trigger missing key error.
* We use the vectored get API to retrieve all aux files during generating the basebackup. Because we need to scan a sparse keyspace, we slightly modified the vectored get path. The vectorized API used to always attempt to retrieve every single key within the requested key range, and therefore, we modified it in a way that keys within `NON_INHERITED_SPARSE_RANGE` will not trigger missing key error. Furthermore, as aux file reads usually need all layer files intersecting with that key range within the branch and cover a big keyspace, it incurs large overhead for tracking keyspaces that have not been read. Therefore, for sparse keyspaces, we [do not track](https://github.com/neondatabase/neon/pull/9631) `ummapped_keyspace`.
## Compaction and Image Layer Generation

View File

@@ -19,6 +19,7 @@ use once_cell::sync::Lazy;
use prometheus::core::{
Atomic, AtomicU64, Collector, GenericCounter, GenericCounterVec, GenericGauge, GenericGaugeVec,
};
pub use prometheus::local::LocalHistogram;
pub use prometheus::opts;
pub use prometheus::register;
pub use prometheus::Error;
@@ -109,6 +110,23 @@ static MAXRSS_KB: Lazy<IntGauge> = Lazy::new(|| {
pub const DISK_FSYNC_SECONDS_BUCKETS: &[f64] =
&[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0, 30.0];
/// Constructs histogram buckets that are powers of two starting at 1 (i.e. 2^0), covering the end
/// points. For example, passing start=5,end=20 yields 4,8,16,32 as does start=4,end=32.
pub fn pow2_buckets(start: usize, end: usize) -> Vec<f64> {
assert_ne!(start, 0);
assert!(start <= end);
let start = match start.checked_next_power_of_two() {
Some(n) if n == start => n, // start already power of two
Some(n) => n >> 1, // power of two below start
None => panic!("start too large"),
};
let end = end.checked_next_power_of_two().expect("end too large");
std::iter::successors(Some(start), |n| n.checked_mul(2))
.take_while(|n| n <= &end)
.map(|n| n as f64)
.collect()
}
pub struct BuildInfo {
pub revision: &'static str,
pub build_tag: &'static str,
@@ -594,3 +612,67 @@ where
self.dec.collect_into(metadata, labels, name, &mut enc.0)
}
}
#[cfg(test)]
mod tests {
use super::*;
const POW2_BUCKETS_MAX: usize = 1 << (usize::BITS - 1);
#[test]
fn pow2_buckets_cases() {
assert_eq!(pow2_buckets(1, 1), vec![1.0]);
assert_eq!(pow2_buckets(1, 2), vec![1.0, 2.0]);
assert_eq!(pow2_buckets(1, 3), vec![1.0, 2.0, 4.0]);
assert_eq!(pow2_buckets(1, 4), vec![1.0, 2.0, 4.0]);
assert_eq!(pow2_buckets(1, 5), vec![1.0, 2.0, 4.0, 8.0]);
assert_eq!(pow2_buckets(1, 6), vec![1.0, 2.0, 4.0, 8.0]);
assert_eq!(pow2_buckets(1, 7), vec![1.0, 2.0, 4.0, 8.0]);
assert_eq!(pow2_buckets(1, 8), vec![1.0, 2.0, 4.0, 8.0]);
assert_eq!(
pow2_buckets(1, 200),
vec![1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0]
);
assert_eq!(pow2_buckets(1, 8), vec![1.0, 2.0, 4.0, 8.0]);
assert_eq!(pow2_buckets(2, 8), vec![2.0, 4.0, 8.0]);
assert_eq!(pow2_buckets(3, 8), vec![2.0, 4.0, 8.0]);
assert_eq!(pow2_buckets(4, 8), vec![4.0, 8.0]);
assert_eq!(pow2_buckets(5, 8), vec![4.0, 8.0]);
assert_eq!(pow2_buckets(6, 8), vec![4.0, 8.0]);
assert_eq!(pow2_buckets(7, 8), vec![4.0, 8.0]);
assert_eq!(pow2_buckets(8, 8), vec![8.0]);
assert_eq!(pow2_buckets(20, 200), vec![16.0, 32.0, 64.0, 128.0, 256.0]);
// Largest valid values.
assert_eq!(
pow2_buckets(1, POW2_BUCKETS_MAX).len(),
usize::BITS as usize
);
assert_eq!(pow2_buckets(POW2_BUCKETS_MAX, POW2_BUCKETS_MAX).len(), 1);
}
#[test]
#[should_panic]
fn pow2_buckets_zero_start() {
pow2_buckets(0, 1);
}
#[test]
#[should_panic]
fn pow2_buckets_end_lt_start() {
pow2_buckets(2, 1);
}
#[test]
#[should_panic]
fn pow2_buckets_end_overflow_min() {
pow2_buckets(1, POW2_BUCKETS_MAX + 1);
}
#[test]
#[should_panic]
fn pow2_buckets_end_overflow_max() {
pow2_buckets(1, usize::MAX);
}
}

View File

@@ -64,6 +64,7 @@ pub struct ConfigToml {
#[serde(with = "humantime_serde")]
pub wal_redo_timeout: Duration,
pub superuser: String,
pub locale: String,
pub page_cache_size: usize,
pub max_file_descriptors: usize,
pub pg_distrib_dir: Option<Utf8PathBuf>,
@@ -106,6 +107,8 @@ pub struct ConfigToml {
pub ephemeral_bytes_per_memory_kb: usize,
pub l0_flush: Option<crate::models::L0FlushConfig>,
pub virtual_file_io_mode: Option<crate::models::virtual_file::IoMode>,
#[serde(skip_serializing_if = "Option::is_none")]
pub no_sync: Option<bool>,
}
#[derive(Debug, Clone, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
@@ -250,12 +253,6 @@ pub struct TenantConfigToml {
// Expresed in multiples of checkpoint distance.
pub image_layer_creation_check_threshold: u8,
/// Switch to a new aux file policy. Switching this flag requires the user has not written any aux file into
/// the storage before, and this flag cannot be switched back. Otherwise there will be data corruptions.
/// There is a `last_aux_file_policy` flag which gets persisted in `index_part.json` once the first aux
/// file is written.
pub switch_aux_file_policy: crate::models::AuxFilePolicy,
/// The length for an explicit LSN lease request.
/// Layers needed to reconstruct pages at LSN will not be GC-ed during this interval.
#[serde(with = "humantime_serde")]
@@ -265,6 +262,10 @@ pub struct TenantConfigToml {
/// Layers needed to reconstruct pages at LSN will not be GC-ed during this interval.
#[serde(with = "humantime_serde")]
pub lsn_lease_length_for_ts: Duration,
/// Enable auto-offloading of timelines.
/// (either this flag or the pageserver-global one need to be set)
pub timeline_offloading: bool,
}
pub mod defaults {
@@ -276,6 +277,11 @@ pub mod defaults {
pub const DEFAULT_WAL_REDO_TIMEOUT: &str = "60 s";
pub const DEFAULT_SUPERUSER: &str = "cloud_admin";
pub const DEFAULT_LOCALE: &str = if cfg!(target_os = "macos") {
"C"
} else {
"C.UTF-8"
};
pub const DEFAULT_PAGE_CACHE_SIZE: usize = 8192;
pub const DEFAULT_MAX_FILE_DESCRIPTORS: usize = 100;
@@ -326,6 +332,7 @@ impl Default for ConfigToml {
wal_redo_timeout: (humantime::parse_duration(DEFAULT_WAL_REDO_TIMEOUT)
.expect("cannot parse default wal redo timeout")),
superuser: (DEFAULT_SUPERUSER.to_string()),
locale: DEFAULT_LOCALE.to_string(),
page_cache_size: (DEFAULT_PAGE_CACHE_SIZE),
max_file_descriptors: (DEFAULT_MAX_FILE_DESCRIPTORS),
pg_distrib_dir: None, // Utf8PathBuf::from("./pg_install"), // TODO: formely, this was std::env::current_dir()
@@ -391,6 +398,7 @@ impl Default for ConfigToml {
l0_flush: None,
virtual_file_io_mode: None,
tenant_config: TenantConfigToml::default(),
no_sync: None,
}
}
}
@@ -475,9 +483,9 @@ impl Default for TenantConfigToml {
lazy_slru_download: false,
timeline_get_throttle: crate::models::ThrottleConfig::disabled(),
image_layer_creation_check_threshold: DEFAULT_IMAGE_LAYER_CREATION_CHECK_THRESHOLD,
switch_aux_file_policy: crate::models::AuxFilePolicy::default_tenant_config(),
lsn_lease_length: LsnLease::DEFAULT_LENGTH,
lsn_lease_length_for_ts: LsnLease::DEFAULT_LENGTH_FOR_TS,
timeline_offloading: false,
}
}
}

View File

@@ -5,9 +5,11 @@ pub mod controller_api;
pub mod key;
pub mod keyspace;
pub mod models;
pub mod record;
pub mod reltag;
pub mod shard;
/// Public API types
pub mod upcall_api;
pub mod value;
pub mod config;

View File

@@ -10,7 +10,6 @@ use std::{
io::{BufRead, Read},
num::{NonZeroU32, NonZeroU64, NonZeroUsize},
str::FromStr,
sync::atomic::AtomicUsize,
time::{Duration, SystemTime},
};
@@ -211,13 +210,30 @@ pub enum TimelineState {
#[derive(Serialize, Deserialize, Clone)]
pub struct TimelineCreateRequest {
pub new_timeline_id: TimelineId,
#[serde(default)]
pub ancestor_timeline_id: Option<TimelineId>,
#[serde(default)]
pub existing_initdb_timeline_id: Option<TimelineId>,
#[serde(default)]
pub ancestor_start_lsn: Option<Lsn>,
pub pg_version: Option<u32>,
#[serde(flatten)]
pub mode: TimelineCreateRequestMode,
}
#[derive(Serialize, Deserialize, Clone)]
#[serde(untagged)]
pub enum TimelineCreateRequestMode {
Branch {
ancestor_timeline_id: TimelineId,
#[serde(default)]
ancestor_start_lsn: Option<Lsn>,
// TODO: cplane sets this, but, the branching code always
// inherits the ancestor's pg_version. Earlier code wasn't
// using a flattened enum, so, it was an accepted field, and
// we continue to accept it by having it here.
pg_version: Option<u32>,
},
// NB: Bootstrap is all-optional, and thus the serde(untagged) will cause serde to stop at Bootstrap.
// (serde picks the first matching enum variant, in declaration order).
Bootstrap {
#[serde(default)]
existing_initdb_timeline_id: Option<TimelineId>,
pg_version: Option<u32>,
},
}
#[derive(Serialize, Deserialize, Clone)]
@@ -292,9 +308,9 @@ pub struct TenantConfig {
pub lazy_slru_download: Option<bool>,
pub timeline_get_throttle: Option<ThrottleConfig>,
pub image_layer_creation_check_threshold: Option<u8>,
pub switch_aux_file_policy: Option<AuxFilePolicy>,
pub lsn_lease_length: Option<String>,
pub lsn_lease_length_for_ts: Option<String>,
pub timeline_offloading: Option<bool>,
}
/// The policy for the aux file storage.
@@ -333,68 +349,6 @@ pub enum AuxFilePolicy {
CrossValidation,
}
impl AuxFilePolicy {
pub fn is_valid_migration_path(from: Option<Self>, to: Self) -> bool {
matches!(
(from, to),
(None, _) | (Some(AuxFilePolicy::CrossValidation), AuxFilePolicy::V2)
)
}
/// If a tenant writes aux files without setting `switch_aux_policy`, this value will be used.
pub fn default_tenant_config() -> Self {
Self::V2
}
}
/// The aux file policy memory flag. Users can store `Option<AuxFilePolicy>` into this atomic flag. 0 == unspecified.
pub struct AtomicAuxFilePolicy(AtomicUsize);
impl AtomicAuxFilePolicy {
pub fn new(policy: Option<AuxFilePolicy>) -> Self {
Self(AtomicUsize::new(
policy.map(AuxFilePolicy::to_usize).unwrap_or_default(),
))
}
pub fn load(&self) -> Option<AuxFilePolicy> {
match self.0.load(std::sync::atomic::Ordering::Acquire) {
0 => None,
other => Some(AuxFilePolicy::from_usize(other)),
}
}
pub fn store(&self, policy: Option<AuxFilePolicy>) {
self.0.store(
policy.map(AuxFilePolicy::to_usize).unwrap_or_default(),
std::sync::atomic::Ordering::Release,
);
}
}
impl AuxFilePolicy {
pub fn to_usize(self) -> usize {
match self {
Self::V1 => 1,
Self::CrossValidation => 2,
Self::V2 => 3,
}
}
pub fn try_from_usize(this: usize) -> Option<Self> {
match this {
1 => Some(Self::V1),
2 => Some(Self::CrossValidation),
3 => Some(Self::V2),
_ => None,
}
}
pub fn from_usize(this: usize) -> Self {
Self::try_from_usize(this).unwrap()
}
}
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
#[serde(tag = "kind")]
pub enum EvictionPolicy {
@@ -1051,6 +1005,12 @@ pub mod virtual_file {
}
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ScanDisposableKeysResponse {
pub disposable_count: usize,
pub not_disposable_count: usize,
}
// Wrapped in libpq CopyData
#[derive(PartialEq, Eq, Debug)]
pub enum PagestreamFeMessage {
@@ -1610,71 +1570,6 @@ mod tests {
}
}
#[test]
fn test_aux_file_migration_path() {
assert!(AuxFilePolicy::is_valid_migration_path(
None,
AuxFilePolicy::V1
));
assert!(AuxFilePolicy::is_valid_migration_path(
None,
AuxFilePolicy::V2
));
assert!(AuxFilePolicy::is_valid_migration_path(
None,
AuxFilePolicy::CrossValidation
));
// Self-migration is not a valid migration path, and the caller should handle it by itself.
assert!(!AuxFilePolicy::is_valid_migration_path(
Some(AuxFilePolicy::V1),
AuxFilePolicy::V1
));
assert!(!AuxFilePolicy::is_valid_migration_path(
Some(AuxFilePolicy::V2),
AuxFilePolicy::V2
));
assert!(!AuxFilePolicy::is_valid_migration_path(
Some(AuxFilePolicy::CrossValidation),
AuxFilePolicy::CrossValidation
));
// Migrations not allowed
assert!(!AuxFilePolicy::is_valid_migration_path(
Some(AuxFilePolicy::CrossValidation),
AuxFilePolicy::V1
));
assert!(!AuxFilePolicy::is_valid_migration_path(
Some(AuxFilePolicy::V1),
AuxFilePolicy::V2
));
assert!(!AuxFilePolicy::is_valid_migration_path(
Some(AuxFilePolicy::V2),
AuxFilePolicy::V1
));
assert!(!AuxFilePolicy::is_valid_migration_path(
Some(AuxFilePolicy::V2),
AuxFilePolicy::CrossValidation
));
assert!(!AuxFilePolicy::is_valid_migration_path(
Some(AuxFilePolicy::V1),
AuxFilePolicy::CrossValidation
));
// Migrations allowed
assert!(AuxFilePolicy::is_valid_migration_path(
Some(AuxFilePolicy::CrossValidation),
AuxFilePolicy::V2
));
}
#[test]
fn test_aux_parse() {
assert_eq!(AuxFilePolicy::from_str("V2").unwrap(), AuxFilePolicy::V2);
assert_eq!(AuxFilePolicy::from_str("v2").unwrap(), AuxFilePolicy::V2);
assert_eq!(
AuxFilePolicy::from_str("cross-validation").unwrap(),
AuxFilePolicy::CrossValidation
);
}
#[test]
fn test_image_compression_algorithm_parsing() {
use ImageCompressionAlgorithm::*;

View File

@@ -0,0 +1,113 @@
//! This module defines the WAL record format used within the pageserver.
use bytes::Bytes;
use postgres_ffi::walrecord::{describe_postgres_wal_record, MultiXactMember};
use postgres_ffi::{MultiXactId, MultiXactOffset, TimestampTz, TransactionId};
use serde::{Deserialize, Serialize};
use utils::bin_ser::DeserializeError;
/// Each update to a page is represented by a NeonWalRecord. It can be a wrapper
/// around a PostgreSQL WAL record, or a custom neon-specific "record".
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
pub enum NeonWalRecord {
/// Native PostgreSQL WAL record
Postgres { will_init: bool, rec: Bytes },
/// Clear bits in heap visibility map. ('flags' is bitmap of bits to clear)
ClearVisibilityMapFlags {
new_heap_blkno: Option<u32>,
old_heap_blkno: Option<u32>,
flags: u8,
},
/// Mark transaction IDs as committed on a CLOG page
ClogSetCommitted {
xids: Vec<TransactionId>,
timestamp: TimestampTz,
},
/// Mark transaction IDs as aborted on a CLOG page
ClogSetAborted { xids: Vec<TransactionId> },
/// Extend multixact offsets SLRU
MultixactOffsetCreate {
mid: MultiXactId,
moff: MultiXactOffset,
},
/// Extend multixact members SLRU.
MultixactMembersCreate {
moff: MultiXactOffset,
members: Vec<MultiXactMember>,
},
/// Update the map of AUX files, either writing or dropping an entry
AuxFile {
file_path: String,
content: Option<Bytes>,
},
/// A testing record for unit testing purposes. It supports append data to an existing image, or clear it.
#[cfg(feature = "testing")]
Test {
/// Append a string to the image.
append: String,
/// Clear the image before appending.
clear: bool,
/// Treat this record as an init record. `clear` should be set to true if this field is set
/// to true. This record does not need the history WALs to reconstruct. See [`NeonWalRecord::will_init`] and
/// its references in `timeline.rs`.
will_init: bool,
},
}
impl NeonWalRecord {
/// Does replaying this WAL record initialize the page from scratch, or does
/// it need to be applied over the previous image of the page?
pub fn will_init(&self) -> bool {
// If you change this function, you'll also need to change ValueBytes::will_init
match self {
NeonWalRecord::Postgres { will_init, rec: _ } => *will_init,
#[cfg(feature = "testing")]
NeonWalRecord::Test { will_init, .. } => *will_init,
// None of the special neon record types currently initialize the page
_ => false,
}
}
#[cfg(feature = "testing")]
pub fn wal_append(s: impl AsRef<str>) -> Self {
Self::Test {
append: s.as_ref().to_string(),
clear: false,
will_init: false,
}
}
#[cfg(feature = "testing")]
pub fn wal_clear(s: impl AsRef<str>) -> Self {
Self::Test {
append: s.as_ref().to_string(),
clear: true,
will_init: false,
}
}
#[cfg(feature = "testing")]
pub fn wal_init(s: impl AsRef<str>) -> Self {
Self::Test {
append: s.as_ref().to_string(),
clear: true,
will_init: true,
}
}
}
/// Build a human-readable string to describe a WAL record
///
/// For debugging purposes
pub fn describe_wal_record(rec: &NeonWalRecord) -> Result<String, DeserializeError> {
match rec {
NeonWalRecord::Postgres { will_init, rec } => Ok(format!(
"will_init: {}, {}",
will_init,
describe_postgres_wal_record(rec)?
)),
_ => Ok(format!("{:?}", rec)),
}
}

View File

@@ -1,13 +1,16 @@
use crate::walrecord::NeonWalRecord;
use anyhow::Result;
//! This module defines the value type used by the storage engine.
//!
//! A [`Value`] represents either a completely new value for one Key ([`Value::Image`]),
//! or a "delta" of how to get from previous version of the value to the new one
//! ([`Value::WalRecord`]])
//!
//! Note that the [`Value`] type is used for the permananent storage format, so any
//! changes to it must be backwards compatible.
use crate::record::NeonWalRecord;
use bytes::Bytes;
use serde::{Deserialize, Serialize};
use std::ops::AddAssign;
use std::time::Duration;
pub use pageserver_api::key::{Key, KEY_SIZE};
/// A 'value' stored for a one Key.
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
pub enum Value {
/// An Image value contains a full copy of the value
@@ -20,10 +23,12 @@ pub enum Value {
}
impl Value {
#[inline(always)]
pub fn is_image(&self) -> bool {
matches!(self, Value::Image(_))
}
#[inline(always)]
pub fn will_init(&self) -> bool {
match self {
Value::Image(_) => true,
@@ -33,17 +38,18 @@ impl Value {
}
#[derive(Debug, PartialEq)]
pub(crate) enum InvalidInput {
pub enum InvalidInput {
TooShortValue,
TooShortPostgresRecord,
}
/// We could have a ValueRef where everything is `serde(borrow)`. Before implementing that, lets
/// use this type for querying if a slice looks some particular way.
pub(crate) struct ValueBytes;
pub struct ValueBytes;
impl ValueBytes {
pub(crate) fn will_init(raw: &[u8]) -> Result<bool, InvalidInput> {
#[inline(always)]
pub fn will_init(raw: &[u8]) -> Result<bool, InvalidInput> {
if raw.len() < 12 {
return Err(InvalidInput::TooShortValue);
}
@@ -79,6 +85,7 @@ impl ValueBytes {
mod test {
use super::*;
use bytes::Bytes;
use utils::bin_ser::BeSer;
macro_rules! roundtrip {
@@ -229,56 +236,3 @@ mod test {
assert!(!ValueBytes::will_init(&expected).unwrap());
}
}
///
/// Result of performing GC
///
#[derive(Default, Serialize, Debug)]
pub struct GcResult {
pub layers_total: u64,
pub layers_needed_by_cutoff: u64,
pub layers_needed_by_pitr: u64,
pub layers_needed_by_branches: u64,
pub layers_needed_by_leases: u64,
pub layers_not_updated: u64,
pub layers_removed: u64, // # of layer files removed because they have been made obsolete by newer ondisk files.
#[serde(serialize_with = "serialize_duration_as_millis")]
pub elapsed: Duration,
/// The layers which were garbage collected.
///
/// Used in `/v1/tenant/:tenant_id/timeline/:timeline_id/do_gc` to wait for the layers to be
/// dropped in tests.
#[cfg(feature = "testing")]
#[serde(skip)]
pub(crate) doomed_layers: Vec<crate::tenant::storage_layer::Layer>,
}
// helper function for `GcResult`, serializing a `Duration` as an integer number of milliseconds
fn serialize_duration_as_millis<S>(d: &Duration, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
d.as_millis().serialize(serializer)
}
impl AddAssign for GcResult {
fn add_assign(&mut self, other: Self) {
self.layers_total += other.layers_total;
self.layers_needed_by_pitr += other.layers_needed_by_pitr;
self.layers_needed_by_cutoff += other.layers_needed_by_cutoff;
self.layers_needed_by_branches += other.layers_needed_by_branches;
self.layers_needed_by_leases += other.layers_needed_by_leases;
self.layers_not_updated += other.layers_not_updated;
self.layers_removed += other.layers_removed;
self.elapsed += other.elapsed;
#[cfg(feature = "testing")]
{
let mut other = other;
self.doomed_layers.append(&mut other.doomed_layers);
}
}
}

View File

@@ -2,7 +2,7 @@
use once_cell::sync::Lazy;
use postgres_backend::{AuthType, Handler, PostgresBackend, QueryError};
use pq_proto::{BeMessage, RowDescriptor};
use rustls::crypto::aws_lc_rs;
use rustls::crypto::ring;
use std::io::Cursor;
use std::sync::Arc;
use tokio::io::{AsyncRead, AsyncWrite};
@@ -94,7 +94,7 @@ async fn simple_select_ssl() {
let (client_sock, server_sock) = make_tcp_pair().await;
let server_cfg =
rustls::ServerConfig::builder_with_provider(Arc::new(aws_lc_rs::default_provider()))
rustls::ServerConfig::builder_with_provider(Arc::new(ring::default_provider()))
.with_safe_default_protocol_versions()
.expect("aws_lc_rs should support the default protocol versions")
.with_no_client_auth()
@@ -110,7 +110,7 @@ async fn simple_select_ssl() {
});
let client_cfg =
rustls::ClientConfig::builder_with_provider(Arc::new(aws_lc_rs::default_provider()))
rustls::ClientConfig::builder_with_provider(Arc::new(ring::default_provider()))
.with_safe_default_protocol_versions()
.expect("aws_lc_rs should support the default protocol versions")
.with_root_certificates({

View File

@@ -15,6 +15,7 @@ memoffset.workspace = true
thiserror.workspace = true
serde.workspace = true
utils.workspace = true
tracing.workspace = true
[dev-dependencies]
env_logger.workspace = true

View File

@@ -36,6 +36,7 @@ macro_rules! postgres_ffi {
pub mod controlfile_utils;
pub mod nonrelfile_utils;
pub mod wal_craft_test_export;
pub mod wal_generator;
pub mod waldecoder_handler;
pub mod xlog_utils;
@@ -217,6 +218,7 @@ macro_rules! enum_pgversion {
pub mod pg_constants;
pub mod relfile_utils;
pub mod walrecord;
// Export some widely used datatypes that are unlikely to change across Postgres versions
pub use v14::bindings::RepOriginId;

View File

@@ -0,0 +1,259 @@
use std::ffi::{CStr, CString};
use bytes::{Bytes, BytesMut};
use crc32c::crc32c_append;
use utils::lsn::Lsn;
use super::bindings::{RmgrId, XLogLongPageHeaderData, XLogPageHeaderData, XLOG_PAGE_MAGIC};
use super::xlog_utils::{
XlLogicalMessage, XLOG_RECORD_CRC_OFFS, XLOG_SIZE_OF_XLOG_RECORD, XLP_BKP_REMOVABLE,
XLP_FIRST_IS_CONTRECORD,
};
use super::XLogRecord;
use crate::pg_constants::{
RM_LOGICALMSG_ID, XLOG_LOGICAL_MESSAGE, XLP_LONG_HEADER, XLR_BLOCK_ID_DATA_LONG,
XLR_BLOCK_ID_DATA_SHORT,
};
use crate::{WAL_SEGMENT_SIZE, XLOG_BLCKSZ};
/// A WAL record payload. Will be prefixed by an XLogRecord header when encoded.
pub struct Record {
pub rmid: RmgrId,
pub info: u8,
pub data: Bytes,
}
impl Record {
/// Encodes the WAL record including an XLogRecord header. prev_lsn is the start position of
/// the previous record in the WAL -- this is ignored by the Safekeeper, but not Postgres.
pub fn encode(&self, prev_lsn: Lsn) -> Bytes {
// Prefix data with block ID and length.
let data_header = Bytes::from(match self.data.len() {
0 => vec![],
1..=255 => vec![XLR_BLOCK_ID_DATA_SHORT, self.data.len() as u8],
256.. => {
let len_bytes = (self.data.len() as u32).to_le_bytes();
[&[XLR_BLOCK_ID_DATA_LONG], len_bytes.as_slice()].concat()
}
});
// Construct the WAL record header.
let mut header = XLogRecord {
xl_tot_len: (XLOG_SIZE_OF_XLOG_RECORD + data_header.len() + self.data.len()) as u32,
xl_xid: 0,
xl_prev: prev_lsn.into(),
xl_info: self.info,
xl_rmid: self.rmid,
__bindgen_padding_0: [0; 2],
xl_crc: 0, // see below
};
// Compute the CRC checksum for the data, and the header up to the CRC field.
let mut crc = 0;
crc = crc32c_append(crc, &data_header);
crc = crc32c_append(crc, &self.data);
crc = crc32c_append(crc, &header.encode().unwrap()[0..XLOG_RECORD_CRC_OFFS]);
header.xl_crc = crc;
// Encode the final header and record.
let header = header.encode().unwrap();
[header, data_header, self.data.clone()].concat().into()
}
}
/// Generates WAL record payloads.
///
/// TODO: currently only provides LogicalMessageGenerator for trivial noop messages. Add a generator
/// that creates a table and inserts rows.
pub trait RecordGenerator: Iterator<Item = Record> {}
impl<I: Iterator<Item = Record>> RecordGenerator for I {}
/// Generates binary WAL for use in tests and benchmarks. The provided record generator constructs
/// the WAL records. It is used as an iterator which yields encoded bytes for a single WAL record,
/// including internal page headers if it spans pages. Concatenating the bytes will yield a
/// complete, well-formed WAL, which can be chunked at segment boundaries if desired. Not optimized
/// for performance.
///
/// The WAL format is version-dependant (see e.g. `XLOG_PAGE_MAGIC`), so make sure to import this
/// for the appropriate Postgres version (e.g. `postgres_ffi::v17::wal_generator::WalGenerator`).
///
/// A WAL is split into 16 MB segments. Each segment is split into 8 KB pages, with headers.
/// Records are arbitrary length, 8-byte aligned, and may span pages. The layout is e.g.:
///
/// | Segment 1 | Segment 2 | Segment 3 |
/// | Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 |
/// | R1 | R2 |R3| R4 | R5 | R6 | R7 | R8 |
#[derive(Default)]
pub struct WalGenerator<R: RecordGenerator> {
/// Generates record payloads for the WAL.
pub record_generator: R,
/// Current LSN to append the next record at.
///
/// Callers can modify this (and prev_lsn) to restart generation at a different LSN, but should
/// ensure that the LSN is on a valid record boundary (i.e. we can't start appending in the
/// middle on an existing record or header, or beyond the end of the existing WAL).
pub lsn: Lsn,
/// The starting LSN of the previous record. Used in WAL record headers. The Safekeeper doesn't
/// care about this, unlike Postgres, but we include it for completeness.
pub prev_lsn: Lsn,
}
impl<R: RecordGenerator> WalGenerator<R> {
// Hardcode the sys and timeline ID. We can make them configurable if we care about them.
const SYS_ID: u64 = 0;
const TIMELINE_ID: u32 = 1;
/// Creates a new WAL generator with the given record generator.
pub fn new(record_generator: R) -> WalGenerator<R> {
Self {
record_generator,
lsn: Lsn(0),
prev_lsn: Lsn(0),
}
}
/// Appends a record with an arbitrary payload at the current LSN, then increments the LSN.
/// Returns the WAL bytes for the record, including page headers and padding, and the start LSN.
fn append_record(&mut self, record: Record) -> (Lsn, Bytes) {
let record = record.encode(self.prev_lsn);
let record = Self::insert_pages(record, self.lsn);
let record = Self::pad_record(record, self.lsn);
let lsn = self.lsn;
self.prev_lsn = self.lsn;
self.lsn += record.len() as u64;
(lsn, record)
}
/// Inserts page headers on 8KB page boundaries. Takes the current LSN position where the record
/// is to be appended.
fn insert_pages(record: Bytes, mut lsn: Lsn) -> Bytes {
// Fast path: record fits in current page, and the page already has a header.
if lsn.remaining_in_block() as usize >= record.len() && lsn.block_offset() > 0 {
return record;
}
let mut pages = BytesMut::new();
let mut remaining = record.clone(); // Bytes::clone() is cheap
while !remaining.is_empty() {
// At new page boundary, inject page header.
if lsn.block_offset() == 0 {
let mut page_header = XLogPageHeaderData {
xlp_magic: XLOG_PAGE_MAGIC as u16,
xlp_info: XLP_BKP_REMOVABLE,
xlp_tli: Self::TIMELINE_ID,
xlp_pageaddr: lsn.0,
xlp_rem_len: 0,
__bindgen_padding_0: [0; 4],
};
// If the record was split across page boundaries, mark as continuation.
if remaining.len() < record.len() {
page_header.xlp_rem_len = remaining.len() as u32;
page_header.xlp_info |= XLP_FIRST_IS_CONTRECORD;
}
// At start of segment, use a long page header.
let page_header = if lsn.segment_offset(WAL_SEGMENT_SIZE) == 0 {
page_header.xlp_info |= XLP_LONG_HEADER;
XLogLongPageHeaderData {
std: page_header,
xlp_sysid: Self::SYS_ID,
xlp_seg_size: WAL_SEGMENT_SIZE as u32,
xlp_xlog_blcksz: XLOG_BLCKSZ as u32,
}
.encode()
.unwrap()
} else {
page_header.encode().unwrap()
};
pages.extend_from_slice(&page_header);
lsn += page_header.len() as u64;
}
// Append the record up to the next page boundary, if any.
let page_free = lsn.remaining_in_block() as usize;
let chunk = remaining.split_to(std::cmp::min(page_free, remaining.len()));
pages.extend_from_slice(&chunk);
lsn += chunk.len() as u64;
}
pages.freeze()
}
/// Records must be 8-byte aligned. Take an encoded record (including any injected page
/// boundaries), starting at the given LSN, and add any necessary padding at the end.
fn pad_record(record: Bytes, mut lsn: Lsn) -> Bytes {
lsn += record.len() as u64;
let padding = lsn.calc_padding(8u64) as usize;
if padding == 0 {
return record;
}
[record, Bytes::from(vec![0; padding])].concat().into()
}
}
/// Generates WAL records as an iterator.
impl<R: RecordGenerator> Iterator for WalGenerator<R> {
type Item = (Lsn, Bytes);
fn next(&mut self) -> Option<Self::Item> {
let record = self.record_generator.next()?;
Some(self.append_record(record))
}
}
/// Generates logical message records (effectively noops) with a fixed message.
pub struct LogicalMessageGenerator {
prefix: CString,
message: Vec<u8>,
}
impl LogicalMessageGenerator {
const DB_ID: u32 = 0; // hardcoded for now
const RM_ID: RmgrId = RM_LOGICALMSG_ID;
const INFO: u8 = XLOG_LOGICAL_MESSAGE;
/// Creates a new LogicalMessageGenerator.
pub fn new(prefix: &CStr, message: &[u8]) -> Self {
Self {
prefix: prefix.to_owned(),
message: message.to_owned(),
}
}
/// Encodes a logical message.
fn encode(prefix: &CStr, message: &[u8]) -> Bytes {
let prefix = prefix.to_bytes_with_nul();
let header = XlLogicalMessage {
db_id: Self::DB_ID,
transactional: 0,
prefix_size: prefix.len() as u64,
message_size: message.len() as u64,
};
[&header.encode(), prefix, message].concat().into()
}
}
impl Iterator for LogicalMessageGenerator {
type Item = Record;
fn next(&mut self) -> Option<Self::Item> {
Some(Record {
rmid: Self::RM_ID,
info: Self::INFO,
data: Self::encode(&self.prefix, &self.message),
})
}
}
impl WalGenerator<LogicalMessageGenerator> {
/// Convenience method for appending a WAL record with an arbitrary logical message at the
/// current WAL LSN position. Returns the start LSN and resulting WAL bytes.
pub fn append_logical_message(&mut self, prefix: &CStr, message: &[u8]) -> (Lsn, Bytes) {
let record = Record {
rmid: LogicalMessageGenerator::RM_ID,
info: LogicalMessageGenerator::INFO,
data: LogicalMessageGenerator::encode(prefix, message),
};
self.append_record(record)
}
}

View File

@@ -7,13 +7,12 @@
// have been named the same as the corresponding PostgreSQL functions instead.
//
use crc32c::crc32c_append;
use super::super::waldecoder::WalStreamDecoder;
use super::bindings::{
CheckPoint, ControlFileData, DBState_DB_SHUTDOWNED, FullTransactionId, TimeLineID, TimestampTz,
XLogLongPageHeaderData, XLogPageHeaderData, XLogRecPtr, XLogRecord, XLogSegNo, XLOG_PAGE_MAGIC,
};
use super::wal_generator::LogicalMessageGenerator;
use super::PG_MAJORVERSION;
use crate::pg_constants;
use crate::PG_TLI;
@@ -26,7 +25,7 @@ use bytes::{Buf, Bytes};
use log::*;
use serde::Serialize;
use std::ffi::OsStr;
use std::ffi::{CString, OsStr};
use std::fs::File;
use std::io::prelude::*;
use std::io::ErrorKind;
@@ -39,6 +38,7 @@ use utils::bin_ser::SerializeError;
use utils::lsn::Lsn;
pub const XLOG_FNAME_LEN: usize = 24;
pub const XLP_BKP_REMOVABLE: u16 = 0x0004;
pub const XLP_FIRST_IS_CONTRECORD: u16 = 0x0001;
pub const XLP_REM_LEN_OFFS: usize = 2 + 2 + 4 + 8;
pub const XLOG_RECORD_CRC_OFFS: usize = 4 + 4 + 8 + 1 + 1 + 2;
@@ -489,64 +489,14 @@ impl XlLogicalMessage {
/// Create new WAL record for non-transactional logical message.
/// Used for creating artificial WAL for tests, as LogicalMessage
/// record is basically no-op.
///
/// NOTE: This leaves the xl_prev field zero. The safekeeper and
/// pageserver tolerate that, but PostgreSQL does not.
pub fn encode_logical_message(prefix: &str, message: &str) -> Vec<u8> {
let mut prefix_bytes: Vec<u8> = Vec::with_capacity(prefix.len() + 1);
prefix_bytes.write_all(prefix.as_bytes()).unwrap();
prefix_bytes.push(0);
let message_bytes = message.as_bytes();
let logical_message = XlLogicalMessage {
db_id: 0,
transactional: 0,
prefix_size: prefix_bytes.len() as u64,
message_size: message_bytes.len() as u64,
};
let mainrdata = logical_message.encode();
let mainrdata_len: usize = mainrdata.len() + prefix_bytes.len() + message_bytes.len();
// only short mainrdata is supported for now
assert!(mainrdata_len <= 255);
let mainrdata_len = mainrdata_len as u8;
let mut data: Vec<u8> = vec![pg_constants::XLR_BLOCK_ID_DATA_SHORT, mainrdata_len];
data.extend_from_slice(&mainrdata);
data.extend_from_slice(&prefix_bytes);
data.extend_from_slice(message_bytes);
let total_len = XLOG_SIZE_OF_XLOG_RECORD + data.len();
let mut header = XLogRecord {
xl_tot_len: total_len as u32,
xl_xid: 0,
xl_prev: 0,
xl_info: 0,
xl_rmid: 21,
__bindgen_padding_0: [0u8; 2usize],
xl_crc: 0, // crc will be calculated later
};
let header_bytes = header.encode().expect("failed to encode header");
let crc = crc32c_append(0, &data);
let crc = crc32c_append(crc, &header_bytes[0..XLOG_RECORD_CRC_OFFS]);
header.xl_crc = crc;
let mut wal: Vec<u8> = Vec::new();
wal.extend_from_slice(&header.encode().expect("failed to encode header"));
wal.extend_from_slice(&data);
// WAL start position must be aligned at 8 bytes,
// this will add padding for the next WAL record.
const PADDING: usize = 8;
let padding_rem = wal.len() % PADDING;
if padding_rem != 0 {
wal.resize(wal.len() + PADDING - padding_rem, 0);
}
wal
pub fn encode_logical_message(prefix: &str, message: &str) -> Bytes {
// This function can take untrusted input, so discard any NUL bytes in the prefix string.
let prefix = CString::new(prefix.replace('\0', "")).expect("no NULs");
let message = message.as_bytes();
LogicalMessageGenerator::new(&prefix, message)
.next()
.unwrap()
.encode(Lsn(0))
}
#[cfg(test)]

View File

@@ -15,6 +15,9 @@ pub enum DownloadError {
///
/// Concurrency control is not timed within timeout.
Timeout,
/// Some integrity/consistency check failed during download. This is used during
/// timeline loads to cancel the load of a tenant if some timeline detects fatal corruption.
Fatal(String),
/// The file was found in the remote storage, but the download failed.
Other(anyhow::Error),
}
@@ -29,6 +32,7 @@ impl std::fmt::Display for DownloadError {
DownloadError::Unmodified => write!(f, "File was not modified"),
DownloadError::Cancelled => write!(f, "Cancelled, shutting down"),
DownloadError::Timeout => write!(f, "timeout"),
DownloadError::Fatal(why) => write!(f, "Fatal read error: {why}"),
DownloadError::Other(e) => write!(f, "Failed to download a remote file: {e:?}"),
}
}
@@ -41,7 +45,7 @@ impl DownloadError {
pub fn is_permanent(&self) -> bool {
use DownloadError::*;
match self {
BadInput(_) | NotFound | Unmodified | Cancelled => true,
BadInput(_) | NotFound | Unmodified | Fatal(_) | Cancelled => true,
Timeout | Other(_) => false,
}
}

View File

@@ -357,22 +357,20 @@ impl RemoteStorage for LocalFs {
.list_recursive(prefix)
.await
.map_err(DownloadError::Other)?;
let objects = keys
.into_iter()
.filter_map(|k| {
let path = k.with_base(&self.storage_root);
if path.is_dir() {
None
} else {
Some(ListingObject {
key: k.clone(),
// LocalFs is just for testing, so just specify a dummy time
last_modified: SystemTime::now(),
size: 0,
})
}
})
.collect();
let mut objects = Vec::with_capacity(keys.len());
for key in keys {
let path = key.with_base(&self.storage_root);
let metadata = file_metadata(&path).await?;
if metadata.is_dir() {
continue;
}
objects.push(ListingObject {
key: key.clone(),
last_modified: metadata.modified()?,
size: metadata.len(),
});
}
let objects = objects;
if let ListingMode::NoDelimiter = mode {
result.keys = objects;
@@ -410,9 +408,8 @@ impl RemoteStorage for LocalFs {
} else {
result.keys.push(ListingObject {
key: RemotePath::from_string(&relative_key).unwrap(),
// LocalFs is just for testing
last_modified: SystemTime::now(),
size: 0,
last_modified: object.last_modified,
size: object.size,
});
}
}

View File

@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -euxo pipefail
@@ -6,9 +6,44 @@ PG_BIN=$1
WAL_PATH=$2
DATA_DIR=$3
PORT=$4
PG_VERSION=$5
SYSID=$(od -A n -j 24 -N 8 -t d8 "$WAL_PATH"/000000010000000000000002* | cut -c 3-)
# The way that initdb is invoked must match how the pageserver runs initdb.
function initdb_with_args {
local cmd=(
"$PG_BIN"/initdb
-E utf8
-U cloud_admin
-D "$DATA_DIR"
--locale 'C.UTF-8'
--lc-collate 'C.UTF-8'
--lc-ctype 'C.UTF-8'
--lc-messages 'C.UTF-8'
--lc-monetary 'C.UTF-8'
--lc-numeric 'C.UTF-8'
--lc-time 'C.UTF-8'
--sysid="$SYSID"
)
case "$PG_VERSION" in
14)
# Postgres 14 and below didn't support --locale-provider
;;
15 | 16)
cmd+=(--locale-provider 'libc')
;;
*)
# Postgres 17 added the builtin provider
cmd+=(--locale-provider 'builtin')
;;
esac
eval env -i LD_LIBRARY_PATH="$PG_BIN"/../lib "${cmd[*]}"
}
rm -fr "$DATA_DIR"
env -i LD_LIBRARY_PATH="$PG_BIN"/../lib "$PG_BIN"/initdb -E utf8 -U cloud_admin -D "$DATA_DIR" --sysid="$SYSID"
initdb_with_args
echo "port=$PORT" >> "$DATA_DIR"/postgresql.conf
echo "shared_preload_libraries='\$libdir/neon_rmgr.so'" >> "$DATA_DIR"/postgresql.conf
REDO_POS=0x$("$PG_BIN"/pg_controldata -D "$DATA_DIR" | grep -F "REDO location"| cut -c 42-)

View File

@@ -40,6 +40,11 @@ pub enum Scope {
/// Allows access to storage controller APIs used by the scrubber, to interrogate the state
/// of a tenant & post scrub results.
Scrubber,
/// This scope is used for communication with other storage controller instances.
/// At the time of writing, this is only used for the step down request.
#[serde(rename = "controller_peer")]
ControllerPeer,
}
/// JWT payload. See docs/authentication.md for the format

View File

@@ -123,15 +123,27 @@ pub async fn fsync_async_opt(
Ok(())
}
/// Like postgres' durable_rename, renames file issuing fsyncs do make it
/// durable. After return, file and rename are guaranteed to be persisted.
/// Like postgres' durable_rename, renames a file and issues fsyncs to make it durable. After
/// returning, both the file and rename are guaranteed to be persisted. Both paths must be on the
/// same file system.
///
/// Unlike postgres, it only does fsyncs to 1) file to be renamed to make
/// contents durable; 2) its directory entry to make rename durable 3) again to
/// already renamed file, which is not required by standards but postgres does
/// it, let's stick to that. Postgres additionally fsyncs newpath *before*
/// rename if it exists to ensure that at least one of the files survives, but
/// current callers don't need that.
/// Unlike postgres, it only fsyncs 1) the file to make contents durable, and 2) the directory to
/// make the rename durable. This sequence ensures the target file will never be incomplete.
///
/// Postgres also:
///
/// * Fsyncs the target file, if it exists, before the rename, to ensure either the new or existing
/// file survives a crash. Current callers don't need this as it should already be fsynced if
/// durability is needed.
///
/// * Fsyncs the file after the rename. This can be required with certain OSes or file systems (e.g.
/// NFS), but not on Linux with most common file systems like ext4 (which we currently use).
///
/// An audit of 8 other databases found that none fsynced the file after a rename:
/// <https://github.com/neondatabase/neon/pull/9686#discussion_r1837180535>
///
/// eBPF probes confirmed that this is sufficient with ext4, XFS, and ZFS, but possibly not Btrfs:
/// <https://github.com/neondatabase/neon/pull/9686#discussion_r1837926218>
///
/// virtual_file.rs has similar code, but it doesn't use vfs.
///
@@ -149,9 +161,6 @@ pub async fn durable_rename(
// Time to do the real deal.
tokio::fs::rename(old_path.as_ref(), new_path.as_ref()).await?;
// Postgres'ish fsync of renamed file.
fsync_async_opt(new_path.as_ref(), do_fsync).await?;
// Now fsync the parent
let parent = match new_path.as_ref().parent() {
Some(p) => p,

View File

@@ -12,7 +12,7 @@ use crate::seqwait::MonotonicCounter;
pub const XLOG_BLCKSZ: u32 = 8192;
/// A Postgres LSN (Log Sequence Number), also known as an XLogRecPtr
#[derive(Clone, Copy, Eq, Ord, PartialEq, PartialOrd, Hash)]
#[derive(Clone, Copy, Default, Eq, Ord, PartialEq, PartialOrd, Hash)]
pub struct Lsn(pub u64);
impl Serialize for Lsn {
@@ -138,6 +138,11 @@ impl Lsn {
self.0.checked_sub(other).map(Lsn)
}
/// Subtract a number, saturating at numeric bounds instead of overflowing.
pub fn saturating_sub<T: Into<u64>>(self, other: T) -> Lsn {
Lsn(self.0.saturating_sub(other.into()))
}
/// Subtract a number, returning the difference as i128 to avoid overflow.
pub fn widening_sub<T: Into<u64>>(self, other: T) -> i128 {
let other: u64 = other.into();

View File

@@ -0,0 +1,18 @@
[package]
name = "wal_decoder"
version = "0.1.0"
edition.workspace = true
license.workspace = true
[features]
testing = ["pageserver_api/testing"]
[dependencies]
anyhow.workspace = true
bytes.workspace = true
pageserver_api.workspace = true
postgres_ffi.workspace = true
serde.workspace = true
tracing.workspace = true
utils.workspace = true
workspace_hack = { version = "0.1", path = "../../workspace_hack" }

View File

@@ -0,0 +1,893 @@
//! This module contains logic for decoding and interpreting
//! raw bytes which represent a raw Postgres WAL record.
use crate::models::*;
use crate::serialized_batch::SerializedValueBatch;
use bytes::{Buf, Bytes};
use pageserver_api::reltag::{RelTag, SlruKind};
use pageserver_api::shard::ShardIdentity;
use postgres_ffi::pg_constants;
use postgres_ffi::relfile_utils::VISIBILITYMAP_FORKNUM;
use postgres_ffi::walrecord::*;
use utils::lsn::Lsn;
impl InterpretedWalRecord {
/// Decode and interpreted raw bytes which represent one Postgres WAL record.
/// Data blocks which do not match the provided shard identity are filtered out.
/// Shard 0 is a special case since it tracks all relation sizes. We only give it
/// the keys that are being written as that is enough for updating relation sizes.
pub fn from_bytes_filtered(
buf: Bytes,
shard: &ShardIdentity,
record_end_lsn: Lsn,
pg_version: u32,
) -> anyhow::Result<InterpretedWalRecord> {
let mut decoded = DecodedWALRecord::default();
decode_wal_record(buf, &mut decoded, pg_version)?;
let xid = decoded.xl_xid;
let flush_uncommitted = if decoded.is_dbase_create_copy(pg_version) {
FlushUncommittedRecords::Yes
} else {
FlushUncommittedRecords::No
};
let metadata_record = MetadataRecord::from_decoded(&decoded, record_end_lsn, pg_version)?;
let batch = SerializedValueBatch::from_decoded_filtered(
decoded,
shard,
record_end_lsn,
pg_version,
)?;
Ok(InterpretedWalRecord {
metadata_record,
batch,
end_lsn: record_end_lsn,
flush_uncommitted,
xid,
})
}
}
impl MetadataRecord {
fn from_decoded(
decoded: &DecodedWALRecord,
record_end_lsn: Lsn,
pg_version: u32,
) -> anyhow::Result<Option<MetadataRecord>> {
// Note: this doesn't actually copy the bytes since
// the [`Bytes`] type implements it via a level of indirection.
let mut buf = decoded.record.clone();
buf.advance(decoded.main_data_offset);
match decoded.xl_rmid {
pg_constants::RM_HEAP_ID | pg_constants::RM_HEAP2_ID => {
Self::decode_heapam_record(&mut buf, decoded, pg_version)
}
pg_constants::RM_NEON_ID => Self::decode_neonmgr_record(&mut buf, decoded, pg_version),
// Handle other special record types
pg_constants::RM_SMGR_ID => Self::decode_smgr_record(&mut buf, decoded),
pg_constants::RM_DBASE_ID => Self::decode_dbase_record(&mut buf, decoded, pg_version),
pg_constants::RM_TBLSPC_ID => {
tracing::trace!("XLOG_TBLSPC_CREATE/DROP is not handled yet");
Ok(None)
}
pg_constants::RM_CLOG_ID => Self::decode_clog_record(&mut buf, decoded, pg_version),
pg_constants::RM_XACT_ID => Self::decode_xact_record(&mut buf, decoded, record_end_lsn),
pg_constants::RM_MULTIXACT_ID => {
Self::decode_multixact_record(&mut buf, decoded, pg_version)
}
pg_constants::RM_RELMAP_ID => Self::decode_relmap_record(&mut buf, decoded),
// This is an odd duck. It needs to go to all shards.
// Since it uses the checkpoint image (that's initialized from CHECKPOINT_KEY
// in WalIngest::new), we have to send the whole DecodedWalRecord::record to
// the pageserver and decode it there.
//
// Alternatively, one can make the checkpoint part of the subscription protocol
// to the pageserver. This should work fine, but can be done at a later point.
pg_constants::RM_XLOG_ID => Self::decode_xlog_record(&mut buf, decoded, record_end_lsn),
pg_constants::RM_LOGICALMSG_ID => {
Self::decode_logical_message_record(&mut buf, decoded)
}
pg_constants::RM_STANDBY_ID => Self::decode_standby_record(&mut buf, decoded),
pg_constants::RM_REPLORIGIN_ID => Self::decode_replorigin_record(&mut buf, decoded),
_unexpected => {
// TODO: consider failing here instead of blindly doing something without
// understanding the protocol
Ok(None)
}
}
}
fn decode_heapam_record(
buf: &mut Bytes,
decoded: &DecodedWALRecord,
pg_version: u32,
) -> anyhow::Result<Option<MetadataRecord>> {
// Handle VM bit updates that are implicitly part of heap records.
// First, look at the record to determine which VM bits need
// to be cleared. If either of these variables is set, we
// need to clear the corresponding bits in the visibility map.
let mut new_heap_blkno: Option<u32> = None;
let mut old_heap_blkno: Option<u32> = None;
let mut flags = pg_constants::VISIBILITYMAP_VALID_BITS;
match pg_version {
14 => {
if decoded.xl_rmid == pg_constants::RM_HEAP_ID {
let info = decoded.xl_info & pg_constants::XLOG_HEAP_OPMASK;
if info == pg_constants::XLOG_HEAP_INSERT {
let xlrec = v14::XlHeapInsert::decode(buf);
assert_eq!(0, buf.remaining());
if (xlrec.flags & pg_constants::XLH_INSERT_ALL_VISIBLE_CLEARED) != 0 {
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
} else if info == pg_constants::XLOG_HEAP_DELETE {
let xlrec = v14::XlHeapDelete::decode(buf);
if (xlrec.flags & pg_constants::XLH_DELETE_ALL_VISIBLE_CLEARED) != 0 {
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
} else if info == pg_constants::XLOG_HEAP_UPDATE
|| info == pg_constants::XLOG_HEAP_HOT_UPDATE
{
let xlrec = v14::XlHeapUpdate::decode(buf);
// the size of tuple data is inferred from the size of the record.
// we can't validate the remaining number of bytes without parsing
// the tuple data.
if (xlrec.flags & pg_constants::XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED) != 0 {
old_heap_blkno = Some(decoded.blocks.last().unwrap().blkno);
}
if (xlrec.flags & pg_constants::XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED) != 0 {
// PostgreSQL only uses XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED on a
// non-HOT update where the new tuple goes to different page than
// the old one. Otherwise, only XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED is
// set.
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
} else if info == pg_constants::XLOG_HEAP_LOCK {
let xlrec = v14::XlHeapLock::decode(buf);
if (xlrec.flags & pg_constants::XLH_LOCK_ALL_FROZEN_CLEARED) != 0 {
old_heap_blkno = Some(decoded.blocks[0].blkno);
flags = pg_constants::VISIBILITYMAP_ALL_FROZEN;
}
}
} else if decoded.xl_rmid == pg_constants::RM_HEAP2_ID {
let info = decoded.xl_info & pg_constants::XLOG_HEAP_OPMASK;
if info == pg_constants::XLOG_HEAP2_MULTI_INSERT {
let xlrec = v14::XlHeapMultiInsert::decode(buf);
let offset_array_len =
if decoded.xl_info & pg_constants::XLOG_HEAP_INIT_PAGE > 0 {
// the offsets array is omitted if XLOG_HEAP_INIT_PAGE is set
0
} else {
size_of::<u16>() * xlrec.ntuples as usize
};
assert_eq!(offset_array_len, buf.remaining());
if (xlrec.flags & pg_constants::XLH_INSERT_ALL_VISIBLE_CLEARED) != 0 {
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
} else if info == pg_constants::XLOG_HEAP2_LOCK_UPDATED {
let xlrec = v14::XlHeapLockUpdated::decode(buf);
if (xlrec.flags & pg_constants::XLH_LOCK_ALL_FROZEN_CLEARED) != 0 {
old_heap_blkno = Some(decoded.blocks[0].blkno);
flags = pg_constants::VISIBILITYMAP_ALL_FROZEN;
}
}
} else {
anyhow::bail!("Unknown RMGR {} for Heap decoding", decoded.xl_rmid);
}
}
15 => {
if decoded.xl_rmid == pg_constants::RM_HEAP_ID {
let info = decoded.xl_info & pg_constants::XLOG_HEAP_OPMASK;
if info == pg_constants::XLOG_HEAP_INSERT {
let xlrec = v15::XlHeapInsert::decode(buf);
assert_eq!(0, buf.remaining());
if (xlrec.flags & pg_constants::XLH_INSERT_ALL_VISIBLE_CLEARED) != 0 {
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
} else if info == pg_constants::XLOG_HEAP_DELETE {
let xlrec = v15::XlHeapDelete::decode(buf);
if (xlrec.flags & pg_constants::XLH_DELETE_ALL_VISIBLE_CLEARED) != 0 {
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
} else if info == pg_constants::XLOG_HEAP_UPDATE
|| info == pg_constants::XLOG_HEAP_HOT_UPDATE
{
let xlrec = v15::XlHeapUpdate::decode(buf);
// the size of tuple data is inferred from the size of the record.
// we can't validate the remaining number of bytes without parsing
// the tuple data.
if (xlrec.flags & pg_constants::XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED) != 0 {
old_heap_blkno = Some(decoded.blocks.last().unwrap().blkno);
}
if (xlrec.flags & pg_constants::XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED) != 0 {
// PostgreSQL only uses XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED on a
// non-HOT update where the new tuple goes to different page than
// the old one. Otherwise, only XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED is
// set.
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
} else if info == pg_constants::XLOG_HEAP_LOCK {
let xlrec = v15::XlHeapLock::decode(buf);
if (xlrec.flags & pg_constants::XLH_LOCK_ALL_FROZEN_CLEARED) != 0 {
old_heap_blkno = Some(decoded.blocks[0].blkno);
flags = pg_constants::VISIBILITYMAP_ALL_FROZEN;
}
}
} else if decoded.xl_rmid == pg_constants::RM_HEAP2_ID {
let info = decoded.xl_info & pg_constants::XLOG_HEAP_OPMASK;
if info == pg_constants::XLOG_HEAP2_MULTI_INSERT {
let xlrec = v15::XlHeapMultiInsert::decode(buf);
let offset_array_len =
if decoded.xl_info & pg_constants::XLOG_HEAP_INIT_PAGE > 0 {
// the offsets array is omitted if XLOG_HEAP_INIT_PAGE is set
0
} else {
size_of::<u16>() * xlrec.ntuples as usize
};
assert_eq!(offset_array_len, buf.remaining());
if (xlrec.flags & pg_constants::XLH_INSERT_ALL_VISIBLE_CLEARED) != 0 {
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
} else if info == pg_constants::XLOG_HEAP2_LOCK_UPDATED {
let xlrec = v15::XlHeapLockUpdated::decode(buf);
if (xlrec.flags & pg_constants::XLH_LOCK_ALL_FROZEN_CLEARED) != 0 {
old_heap_blkno = Some(decoded.blocks[0].blkno);
flags = pg_constants::VISIBILITYMAP_ALL_FROZEN;
}
}
} else {
anyhow::bail!("Unknown RMGR {} for Heap decoding", decoded.xl_rmid);
}
}
16 => {
if decoded.xl_rmid == pg_constants::RM_HEAP_ID {
let info = decoded.xl_info & pg_constants::XLOG_HEAP_OPMASK;
if info == pg_constants::XLOG_HEAP_INSERT {
let xlrec = v16::XlHeapInsert::decode(buf);
assert_eq!(0, buf.remaining());
if (xlrec.flags & pg_constants::XLH_INSERT_ALL_VISIBLE_CLEARED) != 0 {
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
} else if info == pg_constants::XLOG_HEAP_DELETE {
let xlrec = v16::XlHeapDelete::decode(buf);
if (xlrec.flags & pg_constants::XLH_DELETE_ALL_VISIBLE_CLEARED) != 0 {
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
} else if info == pg_constants::XLOG_HEAP_UPDATE
|| info == pg_constants::XLOG_HEAP_HOT_UPDATE
{
let xlrec = v16::XlHeapUpdate::decode(buf);
// the size of tuple data is inferred from the size of the record.
// we can't validate the remaining number of bytes without parsing
// the tuple data.
if (xlrec.flags & pg_constants::XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED) != 0 {
old_heap_blkno = Some(decoded.blocks.last().unwrap().blkno);
}
if (xlrec.flags & pg_constants::XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED) != 0 {
// PostgreSQL only uses XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED on a
// non-HOT update where the new tuple goes to different page than
// the old one. Otherwise, only XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED is
// set.
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
} else if info == pg_constants::XLOG_HEAP_LOCK {
let xlrec = v16::XlHeapLock::decode(buf);
if (xlrec.flags & pg_constants::XLH_LOCK_ALL_FROZEN_CLEARED) != 0 {
old_heap_blkno = Some(decoded.blocks[0].blkno);
flags = pg_constants::VISIBILITYMAP_ALL_FROZEN;
}
}
} else if decoded.xl_rmid == pg_constants::RM_HEAP2_ID {
let info = decoded.xl_info & pg_constants::XLOG_HEAP_OPMASK;
if info == pg_constants::XLOG_HEAP2_MULTI_INSERT {
let xlrec = v16::XlHeapMultiInsert::decode(buf);
let offset_array_len =
if decoded.xl_info & pg_constants::XLOG_HEAP_INIT_PAGE > 0 {
// the offsets array is omitted if XLOG_HEAP_INIT_PAGE is set
0
} else {
size_of::<u16>() * xlrec.ntuples as usize
};
assert_eq!(offset_array_len, buf.remaining());
if (xlrec.flags & pg_constants::XLH_INSERT_ALL_VISIBLE_CLEARED) != 0 {
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
} else if info == pg_constants::XLOG_HEAP2_LOCK_UPDATED {
let xlrec = v16::XlHeapLockUpdated::decode(buf);
if (xlrec.flags & pg_constants::XLH_LOCK_ALL_FROZEN_CLEARED) != 0 {
old_heap_blkno = Some(decoded.blocks[0].blkno);
flags = pg_constants::VISIBILITYMAP_ALL_FROZEN;
}
}
} else {
anyhow::bail!("Unknown RMGR {} for Heap decoding", decoded.xl_rmid);
}
}
17 => {
if decoded.xl_rmid == pg_constants::RM_HEAP_ID {
let info = decoded.xl_info & pg_constants::XLOG_HEAP_OPMASK;
if info == pg_constants::XLOG_HEAP_INSERT {
let xlrec = v17::XlHeapInsert::decode(buf);
assert_eq!(0, buf.remaining());
if (xlrec.flags & pg_constants::XLH_INSERT_ALL_VISIBLE_CLEARED) != 0 {
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
} else if info == pg_constants::XLOG_HEAP_DELETE {
let xlrec = v17::XlHeapDelete::decode(buf);
if (xlrec.flags & pg_constants::XLH_DELETE_ALL_VISIBLE_CLEARED) != 0 {
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
} else if info == pg_constants::XLOG_HEAP_UPDATE
|| info == pg_constants::XLOG_HEAP_HOT_UPDATE
{
let xlrec = v17::XlHeapUpdate::decode(buf);
// the size of tuple data is inferred from the size of the record.
// we can't validate the remaining number of bytes without parsing
// the tuple data.
if (xlrec.flags & pg_constants::XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED) != 0 {
old_heap_blkno = Some(decoded.blocks.last().unwrap().blkno);
}
if (xlrec.flags & pg_constants::XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED) != 0 {
// PostgreSQL only uses XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED on a
// non-HOT update where the new tuple goes to different page than
// the old one. Otherwise, only XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED is
// set.
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
} else if info == pg_constants::XLOG_HEAP_LOCK {
let xlrec = v17::XlHeapLock::decode(buf);
if (xlrec.flags & pg_constants::XLH_LOCK_ALL_FROZEN_CLEARED) != 0 {
old_heap_blkno = Some(decoded.blocks[0].blkno);
flags = pg_constants::VISIBILITYMAP_ALL_FROZEN;
}
}
} else if decoded.xl_rmid == pg_constants::RM_HEAP2_ID {
let info = decoded.xl_info & pg_constants::XLOG_HEAP_OPMASK;
if info == pg_constants::XLOG_HEAP2_MULTI_INSERT {
let xlrec = v17::XlHeapMultiInsert::decode(buf);
let offset_array_len =
if decoded.xl_info & pg_constants::XLOG_HEAP_INIT_PAGE > 0 {
// the offsets array is omitted if XLOG_HEAP_INIT_PAGE is set
0
} else {
size_of::<u16>() * xlrec.ntuples as usize
};
assert_eq!(offset_array_len, buf.remaining());
if (xlrec.flags & pg_constants::XLH_INSERT_ALL_VISIBLE_CLEARED) != 0 {
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
} else if info == pg_constants::XLOG_HEAP2_LOCK_UPDATED {
let xlrec = v17::XlHeapLockUpdated::decode(buf);
if (xlrec.flags & pg_constants::XLH_LOCK_ALL_FROZEN_CLEARED) != 0 {
old_heap_blkno = Some(decoded.blocks[0].blkno);
flags = pg_constants::VISIBILITYMAP_ALL_FROZEN;
}
}
} else {
anyhow::bail!("Unknown RMGR {} for Heap decoding", decoded.xl_rmid);
}
}
_ => {}
}
if new_heap_blkno.is_some() || old_heap_blkno.is_some() {
let vm_rel = RelTag {
forknum: VISIBILITYMAP_FORKNUM,
spcnode: decoded.blocks[0].rnode_spcnode,
dbnode: decoded.blocks[0].rnode_dbnode,
relnode: decoded.blocks[0].rnode_relnode,
};
Ok(Some(MetadataRecord::Heapam(HeapamRecord::ClearVmBits(
ClearVmBits {
new_heap_blkno,
old_heap_blkno,
vm_rel,
flags,
},
))))
} else {
Ok(None)
}
}
fn decode_neonmgr_record(
buf: &mut Bytes,
decoded: &DecodedWALRecord,
pg_version: u32,
) -> anyhow::Result<Option<MetadataRecord>> {
// Handle VM bit updates that are implicitly part of heap records.
// First, look at the record to determine which VM bits need
// to be cleared. If either of these variables is set, we
// need to clear the corresponding bits in the visibility map.
let mut new_heap_blkno: Option<u32> = None;
let mut old_heap_blkno: Option<u32> = None;
let mut flags = pg_constants::VISIBILITYMAP_VALID_BITS;
assert_eq!(decoded.xl_rmid, pg_constants::RM_NEON_ID);
match pg_version {
16 | 17 => {
let info = decoded.xl_info & pg_constants::XLOG_HEAP_OPMASK;
match info {
pg_constants::XLOG_NEON_HEAP_INSERT => {
let xlrec = v17::rm_neon::XlNeonHeapInsert::decode(buf);
assert_eq!(0, buf.remaining());
if (xlrec.flags & pg_constants::XLH_INSERT_ALL_VISIBLE_CLEARED) != 0 {
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
}
pg_constants::XLOG_NEON_HEAP_DELETE => {
let xlrec = v17::rm_neon::XlNeonHeapDelete::decode(buf);
if (xlrec.flags & pg_constants::XLH_DELETE_ALL_VISIBLE_CLEARED) != 0 {
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
}
pg_constants::XLOG_NEON_HEAP_UPDATE
| pg_constants::XLOG_NEON_HEAP_HOT_UPDATE => {
let xlrec = v17::rm_neon::XlNeonHeapUpdate::decode(buf);
// the size of tuple data is inferred from the size of the record.
// we can't validate the remaining number of bytes without parsing
// the tuple data.
if (xlrec.flags & pg_constants::XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED) != 0 {
old_heap_blkno = Some(decoded.blocks.last().unwrap().blkno);
}
if (xlrec.flags & pg_constants::XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED) != 0 {
// PostgreSQL only uses XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED on a
// non-HOT update where the new tuple goes to different page than
// the old one. Otherwise, only XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED is
// set.
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
}
pg_constants::XLOG_NEON_HEAP_MULTI_INSERT => {
let xlrec = v17::rm_neon::XlNeonHeapMultiInsert::decode(buf);
let offset_array_len =
if decoded.xl_info & pg_constants::XLOG_HEAP_INIT_PAGE > 0 {
// the offsets array is omitted if XLOG_HEAP_INIT_PAGE is set
0
} else {
size_of::<u16>() * xlrec.ntuples as usize
};
assert_eq!(offset_array_len, buf.remaining());
if (xlrec.flags & pg_constants::XLH_INSERT_ALL_VISIBLE_CLEARED) != 0 {
new_heap_blkno = Some(decoded.blocks[0].blkno);
}
}
pg_constants::XLOG_NEON_HEAP_LOCK => {
let xlrec = v17::rm_neon::XlNeonHeapLock::decode(buf);
if (xlrec.flags & pg_constants::XLH_LOCK_ALL_FROZEN_CLEARED) != 0 {
old_heap_blkno = Some(decoded.blocks[0].blkno);
flags = pg_constants::VISIBILITYMAP_ALL_FROZEN;
}
}
info => anyhow::bail!("Unknown WAL record type for Neon RMGR: {}", info),
}
}
_ => anyhow::bail!(
"Neon RMGR has no known compatibility with PostgreSQL version {}",
pg_version
),
}
if new_heap_blkno.is_some() || old_heap_blkno.is_some() {
let vm_rel = RelTag {
forknum: VISIBILITYMAP_FORKNUM,
spcnode: decoded.blocks[0].rnode_spcnode,
dbnode: decoded.blocks[0].rnode_dbnode,
relnode: decoded.blocks[0].rnode_relnode,
};
Ok(Some(MetadataRecord::Neonrmgr(NeonrmgrRecord::ClearVmBits(
ClearVmBits {
new_heap_blkno,
old_heap_blkno,
vm_rel,
flags,
},
))))
} else {
Ok(None)
}
}
fn decode_smgr_record(
buf: &mut Bytes,
decoded: &DecodedWALRecord,
) -> anyhow::Result<Option<MetadataRecord>> {
let info = decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK;
if info == pg_constants::XLOG_SMGR_CREATE {
let create = XlSmgrCreate::decode(buf);
let rel = RelTag {
spcnode: create.rnode.spcnode,
dbnode: create.rnode.dbnode,
relnode: create.rnode.relnode,
forknum: create.forknum,
};
return Ok(Some(MetadataRecord::Smgr(SmgrRecord::Create(SmgrCreate {
rel,
}))));
} else if info == pg_constants::XLOG_SMGR_TRUNCATE {
let truncate = XlSmgrTruncate::decode(buf);
return Ok(Some(MetadataRecord::Smgr(SmgrRecord::Truncate(truncate))));
}
Ok(None)
}
fn decode_dbase_record(
buf: &mut Bytes,
decoded: &DecodedWALRecord,
pg_version: u32,
) -> anyhow::Result<Option<MetadataRecord>> {
// TODO: Refactor this to avoid the duplication between postgres versions.
let info = decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK;
tracing::debug!(%info, %pg_version, "handle RM_DBASE_ID");
if pg_version == 14 {
if info == postgres_ffi::v14::bindings::XLOG_DBASE_CREATE {
let createdb = XlCreateDatabase::decode(buf);
tracing::debug!("XLOG_DBASE_CREATE v14");
let record = MetadataRecord::Dbase(DbaseRecord::Create(DbaseCreate {
db_id: createdb.db_id,
tablespace_id: createdb.tablespace_id,
src_db_id: createdb.src_db_id,
src_tablespace_id: createdb.src_tablespace_id,
}));
return Ok(Some(record));
} else if info == postgres_ffi::v14::bindings::XLOG_DBASE_DROP {
let dropdb = XlDropDatabase::decode(buf);
let record = MetadataRecord::Dbase(DbaseRecord::Drop(DbaseDrop {
db_id: dropdb.db_id,
tablespace_ids: dropdb.tablespace_ids,
}));
return Ok(Some(record));
}
} else if pg_version == 15 {
if info == postgres_ffi::v15::bindings::XLOG_DBASE_CREATE_WAL_LOG {
tracing::debug!("XLOG_DBASE_CREATE_WAL_LOG: noop");
} else if info == postgres_ffi::v15::bindings::XLOG_DBASE_CREATE_FILE_COPY {
// The XLOG record was renamed between v14 and v15,
// but the record format is the same.
// So we can reuse XlCreateDatabase here.
tracing::debug!("XLOG_DBASE_CREATE_FILE_COPY");
let createdb = XlCreateDatabase::decode(buf);
let record = MetadataRecord::Dbase(DbaseRecord::Create(DbaseCreate {
db_id: createdb.db_id,
tablespace_id: createdb.tablespace_id,
src_db_id: createdb.src_db_id,
src_tablespace_id: createdb.src_tablespace_id,
}));
return Ok(Some(record));
} else if info == postgres_ffi::v15::bindings::XLOG_DBASE_DROP {
let dropdb = XlDropDatabase::decode(buf);
let record = MetadataRecord::Dbase(DbaseRecord::Drop(DbaseDrop {
db_id: dropdb.db_id,
tablespace_ids: dropdb.tablespace_ids,
}));
return Ok(Some(record));
}
} else if pg_version == 16 {
if info == postgres_ffi::v16::bindings::XLOG_DBASE_CREATE_WAL_LOG {
tracing::debug!("XLOG_DBASE_CREATE_WAL_LOG: noop");
} else if info == postgres_ffi::v16::bindings::XLOG_DBASE_CREATE_FILE_COPY {
// The XLOG record was renamed between v14 and v15,
// but the record format is the same.
// So we can reuse XlCreateDatabase here.
tracing::debug!("XLOG_DBASE_CREATE_FILE_COPY");
let createdb = XlCreateDatabase::decode(buf);
let record = MetadataRecord::Dbase(DbaseRecord::Create(DbaseCreate {
db_id: createdb.db_id,
tablespace_id: createdb.tablespace_id,
src_db_id: createdb.src_db_id,
src_tablespace_id: createdb.src_tablespace_id,
}));
return Ok(Some(record));
} else if info == postgres_ffi::v16::bindings::XLOG_DBASE_DROP {
let dropdb = XlDropDatabase::decode(buf);
let record = MetadataRecord::Dbase(DbaseRecord::Drop(DbaseDrop {
db_id: dropdb.db_id,
tablespace_ids: dropdb.tablespace_ids,
}));
return Ok(Some(record));
}
} else if pg_version == 17 {
if info == postgres_ffi::v17::bindings::XLOG_DBASE_CREATE_WAL_LOG {
tracing::debug!("XLOG_DBASE_CREATE_WAL_LOG: noop");
} else if info == postgres_ffi::v17::bindings::XLOG_DBASE_CREATE_FILE_COPY {
// The XLOG record was renamed between v14 and v15,
// but the record format is the same.
// So we can reuse XlCreateDatabase here.
tracing::debug!("XLOG_DBASE_CREATE_FILE_COPY");
let createdb = XlCreateDatabase::decode(buf);
let record = MetadataRecord::Dbase(DbaseRecord::Create(DbaseCreate {
db_id: createdb.db_id,
tablespace_id: createdb.tablespace_id,
src_db_id: createdb.src_db_id,
src_tablespace_id: createdb.src_tablespace_id,
}));
return Ok(Some(record));
} else if info == postgres_ffi::v17::bindings::XLOG_DBASE_DROP {
let dropdb = XlDropDatabase::decode(buf);
let record = MetadataRecord::Dbase(DbaseRecord::Drop(DbaseDrop {
db_id: dropdb.db_id,
tablespace_ids: dropdb.tablespace_ids,
}));
return Ok(Some(record));
}
}
Ok(None)
}
fn decode_clog_record(
buf: &mut Bytes,
decoded: &DecodedWALRecord,
pg_version: u32,
) -> anyhow::Result<Option<MetadataRecord>> {
let info = decoded.xl_info & !pg_constants::XLR_INFO_MASK;
if info == pg_constants::CLOG_ZEROPAGE {
let pageno = if pg_version < 17 {
buf.get_u32_le()
} else {
buf.get_u64_le() as u32
};
let segno = pageno / pg_constants::SLRU_PAGES_PER_SEGMENT;
let rpageno = pageno % pg_constants::SLRU_PAGES_PER_SEGMENT;
Ok(Some(MetadataRecord::Clog(ClogRecord::ZeroPage(
ClogZeroPage { segno, rpageno },
))))
} else {
assert!(info == pg_constants::CLOG_TRUNCATE);
let xlrec = XlClogTruncate::decode(buf, pg_version);
Ok(Some(MetadataRecord::Clog(ClogRecord::Truncate(
ClogTruncate {
pageno: xlrec.pageno,
oldest_xid: xlrec.oldest_xid,
oldest_xid_db: xlrec.oldest_xid_db,
},
))))
}
}
fn decode_xact_record(
buf: &mut Bytes,
decoded: &DecodedWALRecord,
lsn: Lsn,
) -> anyhow::Result<Option<MetadataRecord>> {
let info = decoded.xl_info & pg_constants::XLOG_XACT_OPMASK;
let origin_id = decoded.origin_id;
let xl_xid = decoded.xl_xid;
if info == pg_constants::XLOG_XACT_COMMIT {
let parsed = XlXactParsedRecord::decode(buf, decoded.xl_xid, decoded.xl_info);
return Ok(Some(MetadataRecord::Xact(XactRecord::Commit(XactCommon {
parsed,
origin_id,
xl_xid,
lsn,
}))));
} else if info == pg_constants::XLOG_XACT_ABORT {
let parsed = XlXactParsedRecord::decode(buf, decoded.xl_xid, decoded.xl_info);
return Ok(Some(MetadataRecord::Xact(XactRecord::Abort(XactCommon {
parsed,
origin_id,
xl_xid,
lsn,
}))));
} else if info == pg_constants::XLOG_XACT_COMMIT_PREPARED {
let parsed = XlXactParsedRecord::decode(buf, decoded.xl_xid, decoded.xl_info);
return Ok(Some(MetadataRecord::Xact(XactRecord::CommitPrepared(
XactCommon {
parsed,
origin_id,
xl_xid,
lsn,
},
))));
} else if info == pg_constants::XLOG_XACT_ABORT_PREPARED {
let parsed = XlXactParsedRecord::decode(buf, decoded.xl_xid, decoded.xl_info);
return Ok(Some(MetadataRecord::Xact(XactRecord::AbortPrepared(
XactCommon {
parsed,
origin_id,
xl_xid,
lsn,
},
))));
} else if info == pg_constants::XLOG_XACT_PREPARE {
return Ok(Some(MetadataRecord::Xact(XactRecord::Prepare(
XactPrepare {
xl_xid: decoded.xl_xid,
data: Bytes::copy_from_slice(&buf[..]),
},
))));
}
Ok(None)
}
fn decode_multixact_record(
buf: &mut Bytes,
decoded: &DecodedWALRecord,
pg_version: u32,
) -> anyhow::Result<Option<MetadataRecord>> {
let info = decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK;
if info == pg_constants::XLOG_MULTIXACT_ZERO_OFF_PAGE
|| info == pg_constants::XLOG_MULTIXACT_ZERO_MEM_PAGE
{
let pageno = if pg_version < 17 {
buf.get_u32_le()
} else {
buf.get_u64_le() as u32
};
let segno = pageno / pg_constants::SLRU_PAGES_PER_SEGMENT;
let rpageno = pageno % pg_constants::SLRU_PAGES_PER_SEGMENT;
let slru_kind = match info {
pg_constants::XLOG_MULTIXACT_ZERO_OFF_PAGE => SlruKind::MultiXactOffsets,
pg_constants::XLOG_MULTIXACT_ZERO_MEM_PAGE => SlruKind::MultiXactMembers,
_ => unreachable!(),
};
return Ok(Some(MetadataRecord::MultiXact(MultiXactRecord::ZeroPage(
MultiXactZeroPage {
slru_kind,
segno,
rpageno,
},
))));
} else if info == pg_constants::XLOG_MULTIXACT_CREATE_ID {
let xlrec = XlMultiXactCreate::decode(buf);
return Ok(Some(MetadataRecord::MultiXact(MultiXactRecord::Create(
xlrec,
))));
} else if info == pg_constants::XLOG_MULTIXACT_TRUNCATE_ID {
let xlrec = XlMultiXactTruncate::decode(buf);
return Ok(Some(MetadataRecord::MultiXact(MultiXactRecord::Truncate(
xlrec,
))));
}
Ok(None)
}
fn decode_relmap_record(
buf: &mut Bytes,
decoded: &DecodedWALRecord,
) -> anyhow::Result<Option<MetadataRecord>> {
let update = XlRelmapUpdate::decode(buf);
let mut buf = decoded.record.clone();
buf.advance(decoded.main_data_offset);
// skip xl_relmap_update
buf.advance(12);
Ok(Some(MetadataRecord::Relmap(RelmapRecord::Update(
RelmapUpdate {
update,
buf: Bytes::copy_from_slice(&buf[..]),
},
))))
}
fn decode_xlog_record(
buf: &mut Bytes,
decoded: &DecodedWALRecord,
lsn: Lsn,
) -> anyhow::Result<Option<MetadataRecord>> {
let info = decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK;
Ok(Some(MetadataRecord::Xlog(XlogRecord::Raw(RawXlogRecord {
info,
lsn,
buf: buf.clone(),
}))))
}
fn decode_logical_message_record(
buf: &mut Bytes,
decoded: &DecodedWALRecord,
) -> anyhow::Result<Option<MetadataRecord>> {
let info = decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK;
if info == pg_constants::XLOG_LOGICAL_MESSAGE {
let xlrec = XlLogicalMessage::decode(buf);
let prefix = std::str::from_utf8(&buf[0..xlrec.prefix_size - 1])?;
#[cfg(feature = "testing")]
if prefix == "neon-test" {
return Ok(Some(MetadataRecord::LogicalMessage(
LogicalMessageRecord::Failpoint,
)));
}
if let Some(path) = prefix.strip_prefix("neon-file:") {
let buf_size = xlrec.prefix_size + xlrec.message_size;
let buf = Bytes::copy_from_slice(&buf[xlrec.prefix_size..buf_size]);
return Ok(Some(MetadataRecord::LogicalMessage(
LogicalMessageRecord::Put(PutLogicalMessage {
path: path.to_string(),
buf,
}),
)));
}
}
Ok(None)
}
fn decode_standby_record(
buf: &mut Bytes,
decoded: &DecodedWALRecord,
) -> anyhow::Result<Option<MetadataRecord>> {
let info = decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK;
if info == pg_constants::XLOG_RUNNING_XACTS {
let xlrec = XlRunningXacts::decode(buf);
return Ok(Some(MetadataRecord::Standby(StandbyRecord::RunningXacts(
StandbyRunningXacts {
oldest_running_xid: xlrec.oldest_running_xid,
},
))));
}
Ok(None)
}
fn decode_replorigin_record(
buf: &mut Bytes,
decoded: &DecodedWALRecord,
) -> anyhow::Result<Option<MetadataRecord>> {
let info = decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK;
if info == pg_constants::XLOG_REPLORIGIN_SET {
let xlrec = XlReploriginSet::decode(buf);
return Ok(Some(MetadataRecord::Replorigin(ReploriginRecord::Set(
xlrec,
))));
} else if info == pg_constants::XLOG_REPLORIGIN_DROP {
let xlrec = XlReploriginDrop::decode(buf);
return Ok(Some(MetadataRecord::Replorigin(ReploriginRecord::Drop(
xlrec,
))));
}
Ok(None)
}
}

View File

@@ -0,0 +1,3 @@
pub mod decoder;
pub mod models;
pub mod serialized_batch;

View File

@@ -0,0 +1,211 @@
//! This module houses types which represent decoded PG WAL records
//! ready for the pageserver to interpret. They are derived from the original
//! WAL records, so that each struct corresponds closely to one WAL record of
//! a specific kind. They contain the same information as the original WAL records,
//! but the values are already serialized in a [`SerializedValueBatch`], which
//! is the format that the pageserver is expecting them in.
//!
//! The ingestion code uses these structs to help with parsing the WAL records,
//! and it splits them into a stream of modifications to the key-value pairs that
//! are ultimately stored in delta layers. See also the split-out counterparts in
//! [`postgres_ffi::walrecord`].
//!
//! The pipeline which processes WAL records is not super obvious, so let's follow
//! the flow of an example XACT_COMMIT Postgres record:
//!
//! (Postgres XACT_COMMIT record)
//! |
//! |--> pageserver::walingest::WalIngest::decode_xact_record
//! |
//! |--> ([`XactRecord::Commit`])
//! |
//! |--> pageserver::walingest::WalIngest::ingest_xact_record
//! |
//! |--> (NeonWalRecord::ClogSetCommitted)
//! |
//! |--> write to KV store within the pageserver
use bytes::Bytes;
use pageserver_api::reltag::{RelTag, SlruKind};
use postgres_ffi::walrecord::{
XlMultiXactCreate, XlMultiXactTruncate, XlRelmapUpdate, XlReploriginDrop, XlReploriginSet,
XlSmgrTruncate, XlXactParsedRecord,
};
use postgres_ffi::{Oid, TransactionId};
use utils::lsn::Lsn;
use crate::serialized_batch::SerializedValueBatch;
pub enum FlushUncommittedRecords {
Yes,
No,
}
/// An interpreted Postgres WAL record, ready to be handled by the pageserver
pub struct InterpretedWalRecord {
/// Optional metadata record - may cause writes to metadata keys
/// in the storage engine
pub metadata_record: Option<MetadataRecord>,
/// A pre-serialized batch along with the required metadata for ingestion
/// by the pageserver
pub batch: SerializedValueBatch,
/// Byte offset within WAL for the end of the original PG WAL record
pub end_lsn: Lsn,
/// Whether to flush all uncommitted modifications to the storage engine
/// before ingesting this record. This is currently only used for legacy PG
/// database creations which read pages from a template database. Such WAL
/// records require reading data blocks while ingesting, hence the need to flush.
pub flush_uncommitted: FlushUncommittedRecords,
/// Transaction id of the original PG WAL record
pub xid: TransactionId,
}
/// The interpreted part of the Postgres WAL record which requires metadata
/// writes to the underlying storage engine.
pub enum MetadataRecord {
Heapam(HeapamRecord),
Neonrmgr(NeonrmgrRecord),
Smgr(SmgrRecord),
Dbase(DbaseRecord),
Clog(ClogRecord),
Xact(XactRecord),
MultiXact(MultiXactRecord),
Relmap(RelmapRecord),
Xlog(XlogRecord),
LogicalMessage(LogicalMessageRecord),
Standby(StandbyRecord),
Replorigin(ReploriginRecord),
}
pub enum HeapamRecord {
ClearVmBits(ClearVmBits),
}
pub struct ClearVmBits {
pub new_heap_blkno: Option<u32>,
pub old_heap_blkno: Option<u32>,
pub vm_rel: RelTag,
pub flags: u8,
}
pub enum NeonrmgrRecord {
ClearVmBits(ClearVmBits),
}
pub enum SmgrRecord {
Create(SmgrCreate),
Truncate(XlSmgrTruncate),
}
pub struct SmgrCreate {
pub rel: RelTag,
}
pub enum DbaseRecord {
Create(DbaseCreate),
Drop(DbaseDrop),
}
pub struct DbaseCreate {
pub db_id: Oid,
pub tablespace_id: Oid,
pub src_db_id: Oid,
pub src_tablespace_id: Oid,
}
pub struct DbaseDrop {
pub db_id: Oid,
pub tablespace_ids: Vec<Oid>,
}
pub enum ClogRecord {
ZeroPage(ClogZeroPage),
Truncate(ClogTruncate),
}
pub struct ClogZeroPage {
pub segno: u32,
pub rpageno: u32,
}
pub struct ClogTruncate {
pub pageno: u32,
pub oldest_xid: TransactionId,
pub oldest_xid_db: Oid,
}
pub enum XactRecord {
Commit(XactCommon),
Abort(XactCommon),
CommitPrepared(XactCommon),
AbortPrepared(XactCommon),
Prepare(XactPrepare),
}
pub struct XactCommon {
pub parsed: XlXactParsedRecord,
pub origin_id: u16,
// Fields below are only used for logging
pub xl_xid: TransactionId,
pub lsn: Lsn,
}
pub struct XactPrepare {
pub xl_xid: TransactionId,
pub data: Bytes,
}
pub enum MultiXactRecord {
ZeroPage(MultiXactZeroPage),
Create(XlMultiXactCreate),
Truncate(XlMultiXactTruncate),
}
pub struct MultiXactZeroPage {
pub slru_kind: SlruKind,
pub segno: u32,
pub rpageno: u32,
}
pub enum RelmapRecord {
Update(RelmapUpdate),
}
pub struct RelmapUpdate {
pub update: XlRelmapUpdate,
pub buf: Bytes,
}
pub enum XlogRecord {
Raw(RawXlogRecord),
}
pub struct RawXlogRecord {
pub info: u8,
pub lsn: Lsn,
pub buf: Bytes,
}
pub enum LogicalMessageRecord {
Put(PutLogicalMessage),
#[cfg(feature = "testing")]
Failpoint,
}
pub struct PutLogicalMessage {
pub path: String,
pub buf: Bytes,
}
pub enum StandbyRecord {
RunningXacts(StandbyRunningXacts),
}
pub struct StandbyRunningXacts {
pub oldest_running_xid: TransactionId,
}
pub enum ReploriginRecord {
Set(XlReploriginSet),
Drop(XlReploriginDrop),
}

View File

@@ -0,0 +1,862 @@
//! This module implements batch type for serialized [`pageserver_api::value::Value`]
//! instances. Each batch contains a raw buffer (serialized values)
//! and a list of metadata for each (key, LSN) tuple present in the batch.
//!
//! Such batches are created from decoded PG wal records and ingested
//! by the pageserver by writing directly to the ephemeral file.
use std::collections::BTreeSet;
use bytes::{Bytes, BytesMut};
use pageserver_api::key::rel_block_to_key;
use pageserver_api::keyspace::KeySpace;
use pageserver_api::record::NeonWalRecord;
use pageserver_api::reltag::RelTag;
use pageserver_api::shard::ShardIdentity;
use pageserver_api::{key::CompactKey, value::Value};
use postgres_ffi::walrecord::{DecodedBkpBlock, DecodedWALRecord};
use postgres_ffi::{page_is_new, page_set_lsn, pg_constants, BLCKSZ};
use utils::bin_ser::BeSer;
use utils::lsn::Lsn;
use pageserver_api::key::Key;
static ZERO_PAGE: Bytes = Bytes::from_static(&[0u8; BLCKSZ as usize]);
/// Accompanying metadata for the batch
/// A value may be serialized and stored into the batch or just "observed".
/// Shard 0 currently "observes" all values in order to accurately track
/// relation sizes. In the case of "observed" values, we only need to know
/// the key and LSN, so two types of metadata are supported to save on network
/// bandwidth.
pub enum ValueMeta {
Serialized(SerializedValueMeta),
Observed(ObservedValueMeta),
}
impl ValueMeta {
pub fn key(&self) -> CompactKey {
match self {
Self::Serialized(ser) => ser.key,
Self::Observed(obs) => obs.key,
}
}
pub fn lsn(&self) -> Lsn {
match self {
Self::Serialized(ser) => ser.lsn,
Self::Observed(obs) => obs.lsn,
}
}
}
/// Wrapper around [`ValueMeta`] that implements ordering by
/// (key, LSN) tuples
struct OrderedValueMeta(ValueMeta);
impl Ord for OrderedValueMeta {
fn cmp(&self, other: &Self) -> std::cmp::Ordering {
(self.0.key(), self.0.lsn()).cmp(&(other.0.key(), other.0.lsn()))
}
}
impl PartialOrd for OrderedValueMeta {
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
Some(self.cmp(other))
}
}
impl PartialEq for OrderedValueMeta {
fn eq(&self, other: &Self) -> bool {
(self.0.key(), self.0.lsn()) == (other.0.key(), other.0.lsn())
}
}
impl Eq for OrderedValueMeta {}
/// Metadata for a [`Value`] serialized into the batch.
pub struct SerializedValueMeta {
pub key: CompactKey,
pub lsn: Lsn,
/// Starting offset of the value for the (key, LSN) tuple
/// in [`SerializedValueBatch::raw`]
pub batch_offset: u64,
pub len: usize,
pub will_init: bool,
}
/// Metadata for a [`Value`] observed by the batch
pub struct ObservedValueMeta {
pub key: CompactKey,
pub lsn: Lsn,
}
/// Batch of serialized [`Value`]s.
pub struct SerializedValueBatch {
/// [`Value`]s serialized in EphemeralFile's native format,
/// ready for disk write by the pageserver
pub raw: Vec<u8>,
/// Metadata to make sense of the bytes in [`Self::raw`]
/// and represent "observed" values.
///
/// Invariant: Metadata entries for any given key are ordered
/// by LSN. Note that entries for a key do not have to be contiguous.
pub metadata: Vec<ValueMeta>,
/// The highest LSN of any value in the batch
pub max_lsn: Lsn,
/// Number of values encoded by [`Self::raw`]
pub len: usize,
}
impl Default for SerializedValueBatch {
fn default() -> Self {
Self {
raw: Default::default(),
metadata: Default::default(),
max_lsn: Lsn(0),
len: 0,
}
}
}
impl SerializedValueBatch {
/// Build a batch of serialized values from a decoded PG WAL record
///
/// The batch will only contain values for keys targeting the specifiec
/// shard. Shard 0 is a special case, where any keys that don't belong to
/// it are "observed" by the batch (i.e. present in [`SerializedValueBatch::metadata`],
/// but absent from the raw buffer [`SerializedValueBatch::raw`]).
pub(crate) fn from_decoded_filtered(
decoded: DecodedWALRecord,
shard: &ShardIdentity,
record_end_lsn: Lsn,
pg_version: u32,
) -> anyhow::Result<SerializedValueBatch> {
// First determine how big the buffer needs to be and allocate it up-front.
// This duplicates some of the work below, but it's empirically much faster.
let estimated_buffer_size = Self::estimate_buffer_size(&decoded, shard, pg_version);
let mut buf = Vec::<u8>::with_capacity(estimated_buffer_size);
let mut metadata: Vec<ValueMeta> = Vec::with_capacity(decoded.blocks.len());
let mut max_lsn: Lsn = Lsn(0);
let mut len: usize = 0;
for blk in decoded.blocks.iter() {
let relative_off = buf.len() as u64;
let rel = RelTag {
spcnode: blk.rnode_spcnode,
dbnode: blk.rnode_dbnode,
relnode: blk.rnode_relnode,
forknum: blk.forknum,
};
let key = rel_block_to_key(rel, blk.blkno);
if !key.is_valid_key_on_write_path() {
anyhow::bail!("Unsupported key decoded at LSN {}: {}", record_end_lsn, key);
}
let key_is_local = shard.is_key_local(&key);
tracing::debug!(
lsn=%record_end_lsn,
key=%key,
"ingest: shard decision {}",
if !key_is_local { "drop" } else { "keep" },
);
if !key_is_local {
if shard.is_shard_zero() {
// Shard 0 tracks relation sizes. Although we will not store this block, we will observe
// its blkno in case it implicitly extends a relation.
metadata.push(ValueMeta::Observed(ObservedValueMeta {
key: key.to_compact(),
lsn: record_end_lsn,
}))
}
continue;
}
// Instead of storing full-page-image WAL record,
// it is better to store extracted image: we can skip wal-redo
// in this case. Also some FPI records may contain multiple (up to 32) pages,
// so them have to be copied multiple times.
//
let val = if Self::block_is_image(&decoded, blk, pg_version) {
// Extract page image from FPI record
let img_len = blk.bimg_len as usize;
let img_offs = blk.bimg_offset as usize;
let mut image = BytesMut::with_capacity(BLCKSZ as usize);
// TODO(vlad): skip the copy
image.extend_from_slice(&decoded.record[img_offs..img_offs + img_len]);
if blk.hole_length != 0 {
let tail = image.split_off(blk.hole_offset as usize);
image.resize(image.len() + blk.hole_length as usize, 0u8);
image.unsplit(tail);
}
//
// Match the logic of XLogReadBufferForRedoExtended:
// The page may be uninitialized. If so, we can't set the LSN because
// that would corrupt the page.
//
if !page_is_new(&image) {
page_set_lsn(&mut image, record_end_lsn)
}
assert_eq!(image.len(), BLCKSZ as usize);
Value::Image(image.freeze())
} else {
Value::WalRecord(NeonWalRecord::Postgres {
will_init: blk.will_init || blk.apply_image,
rec: decoded.record.clone(),
})
};
val.ser_into(&mut buf)
.expect("Writing into in-memory buffer is infallible");
let val_ser_size = buf.len() - relative_off as usize;
metadata.push(ValueMeta::Serialized(SerializedValueMeta {
key: key.to_compact(),
lsn: record_end_lsn,
batch_offset: relative_off,
len: val_ser_size,
will_init: val.will_init(),
}));
max_lsn = std::cmp::max(max_lsn, record_end_lsn);
len += 1;
}
if cfg!(any(debug_assertions, test)) {
let batch = Self {
raw: buf,
metadata,
max_lsn,
len,
};
batch.validate_lsn_order();
return Ok(batch);
}
Ok(Self {
raw: buf,
metadata,
max_lsn,
len,
})
}
/// Look into the decoded PG WAL record and determine
/// roughly how large the buffer for serialized values needs to be.
fn estimate_buffer_size(
decoded: &DecodedWALRecord,
shard: &ShardIdentity,
pg_version: u32,
) -> usize {
let mut estimate: usize = 0;
for blk in decoded.blocks.iter() {
let rel = RelTag {
spcnode: blk.rnode_spcnode,
dbnode: blk.rnode_dbnode,
relnode: blk.rnode_relnode,
forknum: blk.forknum,
};
let key = rel_block_to_key(rel, blk.blkno);
if !shard.is_key_local(&key) {
continue;
}
if Self::block_is_image(decoded, blk, pg_version) {
// 4 bytes for the Value::Image discriminator
// 8 bytes for encoding the size of the buffer
// BLCKSZ for the raw image
estimate += (4 + 8 + BLCKSZ) as usize;
} else {
// 4 bytes for the Value::WalRecord discriminator
// 4 bytes for the NeonWalRecord::Postgres discriminator
// 1 bytes for NeonWalRecord::Postgres::will_init
// 8 bytes for encoding the size of the buffer
// length of the raw record
estimate += 8 + 1 + 8 + decoded.record.len();
}
}
estimate
}
fn block_is_image(decoded: &DecodedWALRecord, blk: &DecodedBkpBlock, pg_version: u32) -> bool {
blk.apply_image
&& blk.has_image
&& decoded.xl_rmid == pg_constants::RM_XLOG_ID
&& (decoded.xl_info == pg_constants::XLOG_FPI
|| decoded.xl_info == pg_constants::XLOG_FPI_FOR_HINT)
// compression of WAL is not yet supported: fall back to storing the original WAL record
&& !postgres_ffi::bkpimage_is_compressed(blk.bimg_info, pg_version)
// do not materialize null pages because them most likely be soon replaced with real data
&& blk.bimg_len != 0
}
/// Encode a list of values and metadata into a serialized batch
///
/// This is used by the pageserver ingest code to conveniently generate
/// batches for metadata writes.
pub fn from_values(batch: Vec<(CompactKey, Lsn, usize, Value)>) -> Self {
// Pre-allocate a big flat buffer to write into. This should be large but not huge: it is soft-limited in practice by
// [`crate::pgdatadir_mapping::DatadirModification::MAX_PENDING_BYTES`]
let buffer_size = batch.iter().map(|i| i.2).sum::<usize>();
let mut buf = Vec::<u8>::with_capacity(buffer_size);
let mut metadata: Vec<ValueMeta> = Vec::with_capacity(batch.len());
let mut max_lsn: Lsn = Lsn(0);
let len = batch.len();
for (key, lsn, val_ser_size, val) in batch {
let relative_off = buf.len() as u64;
val.ser_into(&mut buf)
.expect("Writing into in-memory buffer is infallible");
metadata.push(ValueMeta::Serialized(SerializedValueMeta {
key,
lsn,
batch_offset: relative_off,
len: val_ser_size,
will_init: val.will_init(),
}));
max_lsn = std::cmp::max(max_lsn, lsn);
}
// Assert that we didn't do any extra allocations while building buffer.
debug_assert!(buf.len() <= buffer_size);
if cfg!(any(debug_assertions, test)) {
let batch = Self {
raw: buf,
metadata,
max_lsn,
len,
};
batch.validate_lsn_order();
return batch;
}
Self {
raw: buf,
metadata,
max_lsn,
len,
}
}
/// Add one value to the batch
///
/// This is used by the pageserver ingest code to include metadata block
/// updates for a single key.
pub fn put(&mut self, key: CompactKey, value: Value, lsn: Lsn) {
let relative_off = self.raw.len() as u64;
value.ser_into(&mut self.raw).unwrap();
let val_ser_size = self.raw.len() - relative_off as usize;
self.metadata
.push(ValueMeta::Serialized(SerializedValueMeta {
key,
lsn,
batch_offset: relative_off,
len: val_ser_size,
will_init: value.will_init(),
}));
self.max_lsn = std::cmp::max(self.max_lsn, lsn);
self.len += 1;
if cfg!(any(debug_assertions, test)) {
self.validate_lsn_order();
}
}
/// Extend with the contents of another batch
///
/// One batch is generated for each decoded PG WAL record.
/// They are then merged to accumulate reasonably sized writes.
pub fn extend(&mut self, mut other: SerializedValueBatch) {
let extend_batch_start_offset = self.raw.len() as u64;
self.raw.extend(other.raw);
// Shift the offsets in the batch we are extending with
other.metadata.iter_mut().for_each(|meta| match meta {
ValueMeta::Serialized(ser) => {
ser.batch_offset += extend_batch_start_offset;
if cfg!(debug_assertions) {
let value_end = ser.batch_offset + ser.len as u64;
assert!((value_end as usize) <= self.raw.len());
}
}
ValueMeta::Observed(_) => {}
});
self.metadata.extend(other.metadata);
self.max_lsn = std::cmp::max(self.max_lsn, other.max_lsn);
self.len += other.len;
if cfg!(any(debug_assertions, test)) {
self.validate_lsn_order();
}
}
/// Add zero images for the (key, LSN) tuples specified
///
/// PG versions below 16 do not zero out pages before extending
/// a relation and may leave gaps. Such gaps need to be identified
/// by the pageserver ingest logic and get patched up here.
///
/// Note that this function does not validate that the gaps have been
/// identified correctly (it does not know relation sizes), so it's up
/// to the call-site to do it properly.
pub fn zero_gaps(&mut self, gaps: Vec<(KeySpace, Lsn)>) {
// Implementation note:
//
// Values within [`SerializedValueBatch::raw`] do not have any ordering requirements,
// but the metadata entries should be ordered properly (see
// [`SerializedValueBatch::metadata`]).
//
// Exploiting this observation we do:
// 1. Drain all the metadata entries into an ordered set.
// The use of a BTreeSet keyed by (Key, Lsn) relies on the observation that Postgres never
// includes more than one update to the same block in the same WAL record.
// 2. For each (key, LSN) gap tuple, append a zero image to the raw buffer
// and add an index entry to the ordered metadata set.
// 3. Drain the ordered set back into a metadata vector
let mut ordered_metas = self
.metadata
.drain(..)
.map(OrderedValueMeta)
.collect::<BTreeSet<_>>();
for (keyspace, lsn) in gaps {
self.max_lsn = std::cmp::max(self.max_lsn, lsn);
for gap_range in keyspace.ranges {
let mut key = gap_range.start;
while key != gap_range.end {
let relative_off = self.raw.len() as u64;
// TODO(vlad): Can we be cheeky and write only one zero image, and
// make all index entries requiring a zero page point to it?
// Alternatively, we can change the index entry format to represent zero pages
// without writing them at all.
Value::Image(ZERO_PAGE.clone())
.ser_into(&mut self.raw)
.unwrap();
let val_ser_size = self.raw.len() - relative_off as usize;
ordered_metas.insert(OrderedValueMeta(ValueMeta::Serialized(
SerializedValueMeta {
key: key.to_compact(),
lsn,
batch_offset: relative_off,
len: val_ser_size,
will_init: true,
},
)));
self.len += 1;
key = key.next();
}
}
}
self.metadata = ordered_metas.into_iter().map(|ord| ord.0).collect();
if cfg!(any(debug_assertions, test)) {
self.validate_lsn_order();
}
}
/// Checks if the batch is empty
///
/// A batch is empty when it contains no serialized values.
/// Note that it may still contain observed values.
pub fn is_empty(&self) -> bool {
let empty = self.raw.is_empty();
if cfg!(debug_assertions) && empty {
assert!(self
.metadata
.iter()
.all(|meta| matches!(meta, ValueMeta::Observed(_))));
}
empty
}
/// Returns the number of values serialized in the batch
pub fn len(&self) -> usize {
self.len
}
/// Returns the size of the buffer wrapped by the batch
pub fn buffer_size(&self) -> usize {
self.raw.len()
}
pub fn updates_key(&self, key: &Key) -> bool {
self.metadata.iter().any(|meta| match meta {
ValueMeta::Serialized(ser) => key.to_compact() == ser.key,
ValueMeta::Observed(_) => false,
})
}
pub fn validate_lsn_order(&self) {
use std::collections::HashMap;
let mut last_seen_lsn_per_key: HashMap<CompactKey, Lsn> = HashMap::default();
for meta in self.metadata.iter() {
let lsn = meta.lsn();
let key = meta.key();
if let Some(prev_lsn) = last_seen_lsn_per_key.insert(key, lsn) {
assert!(
lsn >= prev_lsn,
"Ordering violated by {}: {} < {}",
Key::from_compact(key),
lsn,
prev_lsn
);
}
}
}
}
#[cfg(all(test, feature = "testing"))]
mod tests {
use super::*;
fn validate_batch(
batch: &SerializedValueBatch,
values: &[(CompactKey, Lsn, usize, Value)],
gaps: Option<&Vec<(KeySpace, Lsn)>>,
) {
// Invariant 1: The metadata for a given entry in the batch
// is correct and can be used to deserialize back to the original value.
for (key, lsn, size, value) in values.iter() {
let meta = batch
.metadata
.iter()
.find(|meta| (meta.key(), meta.lsn()) == (*key, *lsn))
.unwrap();
let meta = match meta {
ValueMeta::Serialized(ser) => ser,
ValueMeta::Observed(_) => unreachable!(),
};
assert_eq!(meta.len, *size);
assert_eq!(meta.will_init, value.will_init());
let start = meta.batch_offset as usize;
let end = meta.batch_offset as usize + meta.len;
let value_from_batch = Value::des(&batch.raw[start..end]).unwrap();
assert_eq!(&value_from_batch, value);
}
let mut expected_buffer_size: usize = values.iter().map(|(_, _, size, _)| size).sum();
let mut gap_pages_count: usize = 0;
// Invariant 2: Zero pages were added for identified gaps and their metadata
// is correct.
if let Some(gaps) = gaps {
for (gap_keyspace, lsn) in gaps {
for gap_range in &gap_keyspace.ranges {
let mut gap_key = gap_range.start;
while gap_key != gap_range.end {
let meta = batch
.metadata
.iter()
.find(|meta| (meta.key(), meta.lsn()) == (gap_key.to_compact(), *lsn))
.unwrap();
let meta = match meta {
ValueMeta::Serialized(ser) => ser,
ValueMeta::Observed(_) => unreachable!(),
};
let zero_value = Value::Image(ZERO_PAGE.clone());
let zero_value_size = zero_value.serialized_size().unwrap() as usize;
assert_eq!(meta.len, zero_value_size);
assert_eq!(meta.will_init, zero_value.will_init());
let start = meta.batch_offset as usize;
let end = meta.batch_offset as usize + meta.len;
let value_from_batch = Value::des(&batch.raw[start..end]).unwrap();
assert_eq!(value_from_batch, zero_value);
gap_pages_count += 1;
expected_buffer_size += zero_value_size;
gap_key = gap_key.next();
}
}
}
}
// Invariant 3: The length of the batch is equal to the number
// of values inserted, plus the number of gap pages. This extends
// to the raw buffer size.
assert_eq!(batch.len(), values.len() + gap_pages_count);
assert_eq!(expected_buffer_size, batch.buffer_size());
// Invariant 4: Metadata entries for any given key are sorted in LSN order.
batch.validate_lsn_order();
}
#[test]
fn test_creation_from_values() {
const LSN: Lsn = Lsn(0x10);
let key = Key::from_hex("110000000033333333444444445500000001").unwrap();
let values = vec![
(
key.to_compact(),
LSN,
Value::WalRecord(NeonWalRecord::wal_append("foo")),
),
(
key.next().to_compact(),
LSN,
Value::WalRecord(NeonWalRecord::wal_append("bar")),
),
(
key.to_compact(),
Lsn(LSN.0 + 0x10),
Value::WalRecord(NeonWalRecord::wal_append("baz")),
),
(
key.next().next().to_compact(),
LSN,
Value::WalRecord(NeonWalRecord::wal_append("taz")),
),
];
let values = values
.into_iter()
.map(|(key, lsn, value)| (key, lsn, value.serialized_size().unwrap() as usize, value))
.collect::<Vec<_>>();
let batch = SerializedValueBatch::from_values(values.clone());
validate_batch(&batch, &values, None);
assert!(!batch.is_empty());
}
#[test]
fn test_put() {
const LSN: Lsn = Lsn(0x10);
let key = Key::from_hex("110000000033333333444444445500000001").unwrap();
let values = vec![
(
key.to_compact(),
LSN,
Value::WalRecord(NeonWalRecord::wal_append("foo")),
),
(
key.next().to_compact(),
LSN,
Value::WalRecord(NeonWalRecord::wal_append("bar")),
),
];
let mut values = values
.into_iter()
.map(|(key, lsn, value)| (key, lsn, value.serialized_size().unwrap() as usize, value))
.collect::<Vec<_>>();
let mut batch = SerializedValueBatch::from_values(values.clone());
validate_batch(&batch, &values, None);
let value = (
key.to_compact(),
Lsn(LSN.0 + 0x10),
Value::WalRecord(NeonWalRecord::wal_append("baz")),
);
let serialized_size = value.2.serialized_size().unwrap() as usize;
let value = (value.0, value.1, serialized_size, value.2);
values.push(value.clone());
batch.put(value.0, value.3, value.1);
validate_batch(&batch, &values, None);
let value = (
key.next().next().to_compact(),
LSN,
Value::WalRecord(NeonWalRecord::wal_append("taz")),
);
let serialized_size = value.2.serialized_size().unwrap() as usize;
let value = (value.0, value.1, serialized_size, value.2);
values.push(value.clone());
batch.put(value.0, value.3, value.1);
validate_batch(&batch, &values, None);
}
#[test]
fn test_extension() {
const LSN: Lsn = Lsn(0x10);
let key = Key::from_hex("110000000033333333444444445500000001").unwrap();
let values = vec![
(
key.to_compact(),
LSN,
Value::WalRecord(NeonWalRecord::wal_append("foo")),
),
(
key.next().to_compact(),
LSN,
Value::WalRecord(NeonWalRecord::wal_append("bar")),
),
(
key.next().next().to_compact(),
LSN,
Value::WalRecord(NeonWalRecord::wal_append("taz")),
),
];
let mut values = values
.into_iter()
.map(|(key, lsn, value)| (key, lsn, value.serialized_size().unwrap() as usize, value))
.collect::<Vec<_>>();
let mut batch = SerializedValueBatch::from_values(values.clone());
let other_values = vec![
(
key.to_compact(),
Lsn(LSN.0 + 0x10),
Value::WalRecord(NeonWalRecord::wal_append("foo")),
),
(
key.next().to_compact(),
Lsn(LSN.0 + 0x10),
Value::WalRecord(NeonWalRecord::wal_append("bar")),
),
(
key.next().next().to_compact(),
Lsn(LSN.0 + 0x10),
Value::WalRecord(NeonWalRecord::wal_append("taz")),
),
];
let other_values = other_values
.into_iter()
.map(|(key, lsn, value)| (key, lsn, value.serialized_size().unwrap() as usize, value))
.collect::<Vec<_>>();
let other_batch = SerializedValueBatch::from_values(other_values.clone());
values.extend(other_values);
batch.extend(other_batch);
validate_batch(&batch, &values, None);
}
#[test]
fn test_gap_zeroing() {
const LSN: Lsn = Lsn(0x10);
let rel_foo_base_key = Key::from_hex("110000000033333333444444445500000001").unwrap();
let rel_bar_base_key = {
let mut key = rel_foo_base_key;
key.field4 += 1;
key
};
let values = vec![
(
rel_foo_base_key.to_compact(),
LSN,
Value::WalRecord(NeonWalRecord::wal_append("foo1")),
),
(
rel_foo_base_key.add(1).to_compact(),
LSN,
Value::WalRecord(NeonWalRecord::wal_append("foo2")),
),
(
rel_foo_base_key.add(5).to_compact(),
LSN,
Value::WalRecord(NeonWalRecord::wal_append("foo3")),
),
(
rel_foo_base_key.add(1).to_compact(),
Lsn(LSN.0 + 0x10),
Value::WalRecord(NeonWalRecord::wal_append("foo4")),
),
(
rel_foo_base_key.add(10).to_compact(),
Lsn(LSN.0 + 0x10),
Value::WalRecord(NeonWalRecord::wal_append("foo5")),
),
(
rel_foo_base_key.add(11).to_compact(),
Lsn(LSN.0 + 0x10),
Value::WalRecord(NeonWalRecord::wal_append("foo6")),
),
(
rel_foo_base_key.add(12).to_compact(),
Lsn(LSN.0 + 0x10),
Value::WalRecord(NeonWalRecord::wal_append("foo7")),
),
(
rel_bar_base_key.to_compact(),
LSN,
Value::WalRecord(NeonWalRecord::wal_append("bar1")),
),
(
rel_bar_base_key.add(4).to_compact(),
LSN,
Value::WalRecord(NeonWalRecord::wal_append("bar2")),
),
];
let values = values
.into_iter()
.map(|(key, lsn, value)| (key, lsn, value.serialized_size().unwrap() as usize, value))
.collect::<Vec<_>>();
let mut batch = SerializedValueBatch::from_values(values.clone());
let gaps = vec![
(
KeySpace {
ranges: vec![
rel_foo_base_key.add(2)..rel_foo_base_key.add(5),
rel_bar_base_key.add(1)..rel_bar_base_key.add(4),
],
},
LSN,
),
(
KeySpace {
ranges: vec![rel_foo_base_key.add(6)..rel_foo_base_key.add(10)],
},
Lsn(LSN.0 + 0x10),
),
];
batch.zero_gaps(gaps.clone());
validate_batch(&batch, &values, Some(&gaps));
}
}

View File

@@ -8,7 +8,7 @@ license.workspace = true
default = []
# Enables test-only APIs, incuding failpoints. In particular, enables the `fail_point!` macro,
# which adds some runtime cost to run tests on outage conditions
testing = ["fail/failpoints", "pageserver_api/testing" ]
testing = ["fail/failpoints", "pageserver_api/testing", "wal_decoder/testing"]
[dependencies]
anyhow.workspace = true
@@ -83,6 +83,7 @@ enum-map.workspace = true
enumset = { workspace = true, features = ["serde"]}
strum.workspace = true
strum_macros.workspace = true
wal_decoder.workspace = true
[target.'cfg(target_os = "linux")'.dependencies]
procfs.workspace = true

View File

@@ -8,17 +8,16 @@ use pageserver::{
context::{DownloadBehavior, RequestContext},
l0_flush::{L0FlushConfig, L0FlushGlobalState},
page_cache,
repository::Value,
task_mgr::TaskKind,
tenant::storage_layer::inmemory_layer::SerializedBatch,
tenant::storage_layer::InMemoryLayer,
virtual_file,
};
use pageserver_api::{key::Key, shard::TenantShardId};
use pageserver_api::{key::Key, shard::TenantShardId, value::Value};
use utils::{
bin_ser::BeSer,
id::{TenantId, TimelineId},
};
use wal_decoder::serialized_batch::SerializedValueBatch;
// A very cheap hash for generating non-sequential keys.
fn murmurhash32(mut h: u32) -> u32 {
@@ -103,13 +102,13 @@ async fn ingest(
batch.push((key.to_compact(), lsn, data_ser_size, data.clone()));
if batch.len() >= BATCH_SIZE {
let this_batch = std::mem::take(&mut batch);
let serialized = SerializedBatch::from_values(this_batch).unwrap();
let serialized = SerializedValueBatch::from_values(this_batch);
layer.put_batch(serialized, &ctx).await?;
}
}
if !batch.is_empty() {
let this_batch = std::mem::take(&mut batch);
let serialized = SerializedBatch::from_values(this_batch).unwrap();
let serialized = SerializedValueBatch::from_values(this_batch);
layer.put_batch(serialized, &ctx).await?;
}
layer.freeze(lsn + 1).await;

View File

@@ -1,9 +1,9 @@
use criterion::measurement::WallTime;
use pageserver::keyspace::{KeyPartitioning, KeySpace};
use pageserver::repository::Key;
use pageserver::tenant::layer_map::LayerMap;
use pageserver::tenant::storage_layer::LayerName;
use pageserver::tenant::storage_layer::PersistentLayerDesc;
use pageserver_api::key::Key;
use pageserver_api::shard::TenantShardId;
use rand::prelude::{SeedableRng, SliceRandom, StdRng};
use std::cmp::{max, min};

View File

@@ -60,7 +60,8 @@ use anyhow::Context;
use bytes::{Buf, Bytes};
use criterion::{BenchmarkId, Criterion};
use once_cell::sync::Lazy;
use pageserver::{config::PageServerConf, walrecord::NeonWalRecord, walredo::PostgresRedoManager};
use pageserver::{config::PageServerConf, walredo::PostgresRedoManager};
use pageserver_api::record::NeonWalRecord;
use pageserver_api::{key::Key, shard::TenantShardId};
use std::{
future::Future,

View File

@@ -35,6 +35,15 @@ pub fn overlaps_with<T: Ord>(a: &Range<T>, b: &Range<T>) -> bool {
!(a.end <= b.start || b.end <= a.start)
}
/// Whether a fully contains b, example as below
/// ```plain
/// | a |
/// | b |
/// ```
pub fn fully_contains<T: Ord>(a: &Range<T>, b: &Range<T>) -> bool {
a.start <= b.start && a.end >= b.end
}
pub fn union_to_keyspace<K: Ord>(a: &mut CompactionKeySpace<K>, b: CompactionKeySpace<K>) {
let x = std::mem::take(a);
let mut all_ranges_iter = [x.into_iter(), b.into_iter()]

View File

@@ -51,7 +51,7 @@
//!
use anyhow::{Context, Result};
use pageserver::repository::Key;
use pageserver_api::key::Key;
use std::cmp::Ordering;
use std::io::{self, BufRead};
use std::path::PathBuf;

View File

@@ -2,7 +2,7 @@
//!
//! Currently it only analyzes holes, which are regions within the layer range that the layer contains no updates for. In the future it might do more analysis (maybe key quantiles?) but it should never return sensitive data.
use anyhow::Result;
use anyhow::{anyhow, Result};
use camino::{Utf8Path, Utf8PathBuf};
use pageserver::context::{DownloadBehavior, RequestContext};
use pageserver::task_mgr::TaskKind;
@@ -11,15 +11,16 @@ use pageserver::virtual_file::api::IoMode;
use std::cmp::Ordering;
use std::collections::BinaryHeap;
use std::ops::Range;
use std::str::FromStr;
use std::{fs, str};
use pageserver::page_cache::{self, PAGE_SZ};
use pageserver::repository::{Key, KEY_SIZE};
use pageserver::tenant::block_io::FileBlockReader;
use pageserver::tenant::disk_btree::{DiskBtreeReader, VisitDirection};
use pageserver::tenant::storage_layer::delta_layer::{Summary, DELTA_KEY_SIZE};
use pageserver::tenant::storage_layer::range_overlaps;
use pageserver::tenant::storage_layer::{range_overlaps, LayerName};
use pageserver::virtual_file::{self, VirtualFile};
use pageserver_api::key::{Key, KEY_SIZE};
use utils::{bin_ser::BeSer, lsn::Lsn};
@@ -74,35 +75,15 @@ impl LayerFile {
}
}
pub(crate) fn parse_filename(name: &str) -> Option<LayerFile> {
let split: Vec<&str> = name.split("__").collect();
if split.len() != 2 {
return None;
}
let keys: Vec<&str> = split[0].split('-').collect();
let lsn_and_opt_generation: Vec<&str> = split[1].split('v').collect();
let lsns: Vec<&str> = lsn_and_opt_generation[0].split('-').collect();
let the_lsns: [&str; 2];
pub(crate) fn parse_filename(name: &str) -> anyhow::Result<LayerFile> {
let layer_name =
LayerName::from_str(name).map_err(|e| anyhow!("failed to parse layer name: {e}"))?;
/*
* Generations add a -vX-XXXXXX postfix, which causes issues when we try to
* parse 'vX' as an LSN.
*/
let is_delta = if lsns.len() == 1 || lsns[1].is_empty() {
the_lsns = [lsns[0], lsns[0]];
false
} else {
the_lsns = [lsns[0], lsns[1]];
true
};
let key_range = Key::from_hex(keys[0]).unwrap()..Key::from_hex(keys[1]).unwrap();
let lsn_range = Lsn::from_hex(the_lsns[0]).unwrap()..Lsn::from_hex(the_lsns[1]).unwrap();
let holes = Vec::new();
Some(LayerFile {
key_range,
lsn_range,
is_delta,
Ok(LayerFile {
key_range: layer_name.key_range().clone(),
lsn_range: layer_name.lsn_as_range(),
is_delta: layer_name.is_delta(),
holes,
})
}
@@ -179,7 +160,7 @@ pub(crate) async fn main(cmd: &AnalyzeLayerMapCmd) -> Result<()> {
for layer in fs::read_dir(timeline.path())? {
let layer = layer?;
if let Some(mut layer_file) =
if let Ok(mut layer_file) =
parse_filename(&layer.file_name().into_string().unwrap())
{
if layer_file.is_delta {

View File

@@ -5,24 +5,12 @@ use camino::{Utf8Path, Utf8PathBuf};
use clap::Subcommand;
use pageserver::context::{DownloadBehavior, RequestContext};
use pageserver::task_mgr::TaskKind;
use pageserver::tenant::block_io::BlockCursor;
use pageserver::tenant::disk_btree::DiskBtreeReader;
use pageserver::tenant::storage_layer::delta_layer::{BlobRef, Summary};
use pageserver::tenant::storage_layer::{delta_layer, image_layer};
use pageserver::tenant::storage_layer::{DeltaLayer, ImageLayer};
use pageserver::tenant::{TENANTS_SEGMENT_NAME, TIMELINES_SEGMENT_NAME};
use pageserver::virtual_file::api::IoMode;
use pageserver::{page_cache, virtual_file};
use pageserver::{
repository::{Key, KEY_SIZE},
tenant::{
block_io::FileBlockReader, disk_btree::VisitDirection,
storage_layer::delta_layer::DELTA_KEY_SIZE,
},
virtual_file::VirtualFile,
};
use std::fs;
use utils::bin_ser::BeSer;
use std::fs::{self, File};
use utils::id::{TenantId, TimelineId};
use crate::layer_map_analyzer::parse_filename;
@@ -59,44 +47,30 @@ pub(crate) enum LayerCmd {
}
async fn read_delta_file(path: impl AsRef<Path>, ctx: &RequestContext) -> Result<()> {
let path = Utf8Path::from_path(path.as_ref()).expect("non-Unicode path");
virtual_file::init(
10,
virtual_file::api::IoEngineKind::StdFs,
IoMode::preferred(),
);
page_cache::init(100);
let file = VirtualFile::open(path, ctx).await?;
let file_id = page_cache::next_file_id();
let block_reader = FileBlockReader::new(&file, file_id);
let summary_blk = block_reader.read_blk(0, ctx).await?;
let actual_summary = Summary::des_prefix(summary_blk.as_ref())?;
let tree_reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new(
actual_summary.index_start_blk,
actual_summary.index_root_blk,
&block_reader,
let path = Utf8Path::from_path(path.as_ref()).expect("non-Unicode path");
let file = File::open(path)?;
let delta_layer = DeltaLayer::new_for_path(path, file)?;
delta_layer.dump(true, ctx).await?;
Ok(())
}
async fn read_image_file(path: impl AsRef<Path>, ctx: &RequestContext) -> Result<()> {
virtual_file::init(
10,
virtual_file::api::IoEngineKind::StdFs,
IoMode::preferred(),
);
// TODO(chi): dedup w/ `delta_layer.rs` by exposing the API.
let mut all = vec![];
tree_reader
.visit(
&[0u8; DELTA_KEY_SIZE],
VisitDirection::Forwards,
|key, value_offset| {
let curr = Key::from_slice(&key[..KEY_SIZE]);
all.push((curr, BlobRef(value_offset)));
true
},
ctx,
)
.await?;
let cursor = BlockCursor::new_fileblockreader(&block_reader);
for (k, v) in all {
let value = cursor.read_blob(v.pos(), ctx).await?;
println!("key:{} value_len:{}", k, value.len());
assert!(k.is_i128_representable(), "invalid key: ");
}
// TODO(chi): special handling for last key?
page_cache::init(100);
let path = Utf8Path::from_path(path.as_ref()).expect("non-Unicode path");
let file = File::open(path)?;
let image_layer = ImageLayer::new_for_path(path, file)?;
image_layer.dump(true, ctx).await?;
Ok(())
}
@@ -133,8 +107,7 @@ pub(crate) async fn main(cmd: &LayerCmd) -> Result<()> {
let mut idx = 0;
for layer in fs::read_dir(timeline_path)? {
let layer = layer?;
if let Some(layer_file) = parse_filename(&layer.file_name().into_string().unwrap())
{
if let Ok(layer_file) = parse_filename(&layer.file_name().into_string().unwrap()) {
println!(
"[{:3}] key:{}-{}\n lsn:{}-{}\n delta:{}",
idx,
@@ -163,8 +136,7 @@ pub(crate) async fn main(cmd: &LayerCmd) -> Result<()> {
let mut idx = 0;
for layer in fs::read_dir(timeline_path)? {
let layer = layer?;
if let Some(layer_file) = parse_filename(&layer.file_name().into_string().unwrap())
{
if let Ok(layer_file) = parse_filename(&layer.file_name().into_string().unwrap()) {
if *id == idx {
// TODO(chi): dedup code
println!(
@@ -180,7 +152,7 @@ pub(crate) async fn main(cmd: &LayerCmd) -> Result<()> {
if layer_file.is_delta {
read_delta_file(layer.path(), &ctx).await?;
} else {
anyhow::bail!("not supported yet :(");
read_image_file(layer.path(), &ctx).await?;
}
break;

View File

@@ -1,4 +1,4 @@
use pageserver_api::models::{AuxFilePolicy, TenantConfig, TenantConfigRequest};
use pageserver_api::models::{TenantConfig, TenantConfigRequest};
use pageserver_api::shard::TenantShardId;
use utils::id::TenantTimelineId;
use utils::lsn::Lsn;
@@ -66,10 +66,7 @@ async fn main_impl(args: Args) -> anyhow::Result<()> {
mgmt_api_client
.tenant_config(&TenantConfigRequest {
tenant_id: timeline.tenant_id,
config: TenantConfig {
switch_aux_file_policy: Some(AuxFilePolicy::V2),
..Default::default()
},
config: TenantConfig::default(),
})
.await?;

View File

@@ -19,7 +19,8 @@ pub fn check_permission(claims: &Claims, tenant_id: Option<TenantId>) -> Result<
| Scope::SafekeeperData
| Scope::GenerationsApi
| Scope::Infra
| Scope::Scrubber,
| Scope::Scrubber
| Scope::ControllerPeer,
_,
) => Err(AuthError(
format!(

View File

@@ -59,6 +59,7 @@ pub async fn send_basebackup_tarball<'a, W>(
req_lsn: Option<Lsn>,
prev_lsn: Option<Lsn>,
full_backup: bool,
replica: bool,
ctx: &'a RequestContext,
) -> Result<(), BasebackupError>
where
@@ -110,8 +111,8 @@ where
};
info!(
"taking basebackup lsn={}, prev_lsn={} (full_backup={})",
backup_lsn, prev_lsn, full_backup
"taking basebackup lsn={}, prev_lsn={} (full_backup={}, replica={})",
backup_lsn, prev_lsn, full_backup, replica
);
let basebackup = Basebackup {
@@ -120,6 +121,7 @@ where
lsn: backup_lsn,
prev_record_lsn: prev_lsn,
full_backup,
replica,
ctx,
};
basebackup
@@ -140,6 +142,7 @@ where
lsn: Lsn,
prev_record_lsn: Lsn,
full_backup: bool,
replica: bool,
ctx: &'a RequestContext,
}
@@ -372,6 +375,10 @@ where
for (path, content) in aux_files {
if path.starts_with("pg_replslot") {
// Do not create LR slots at standby because they are not used but prevent WAL truncation
if self.replica {
continue;
}
let offs = pg_constants::REPL_SLOT_ON_DISK_OFFSETOF_RESTART_LSN;
let restart_lsn = Lsn(u64::from_le_bytes(
content[offs..offs + 8].try_into().unwrap(),

View File

@@ -154,13 +154,17 @@ fn main() -> anyhow::Result<()> {
},
};
let started = Instant::now();
syncfs(dirfd)?;
let elapsed = started.elapsed();
info!(
elapsed_ms = elapsed.as_millis(),
"made tenant directory contents durable"
);
if conf.no_sync {
info!("Skipping syncfs on startup");
} else {
let started = Instant::now();
syncfs(dirfd)?;
let elapsed = started.elapsed();
info!(
elapsed_ms = elapsed.as_millis(),
"made tenant directory contents durable"
);
}
}
// Initialize up failpoints support
@@ -398,9 +402,7 @@ fn start_pageserver(
ControllerUpcallClient::new(conf, &shutdown_pageserver),
conf,
);
if let Some(deletion_workers) = deletion_workers {
deletion_workers.spawn_with(BACKGROUND_RUNTIME.handle());
}
deletion_workers.spawn_with(BACKGROUND_RUNTIME.handle());
// Up to this point no significant I/O has been done: this should have been fast. Record
// duration prior to starting I/O intensive phase of startup.

View File

@@ -69,6 +69,7 @@ pub struct PageServerConf {
pub wal_redo_timeout: Duration,
pub superuser: String,
pub locale: String,
pub page_cache_size: usize,
pub max_file_descriptors: usize,
@@ -178,6 +179,9 @@ pub struct PageServerConf {
/// Direct IO settings
pub virtual_file_io_mode: virtual_file::IoMode,
/// Optionally disable disk syncs (unsafe!)
pub no_sync: bool,
}
/// Token for authentication to safekeepers
@@ -298,6 +302,7 @@ impl PageServerConf {
wait_lsn_timeout,
wal_redo_timeout,
superuser,
locale,
page_cache_size,
max_file_descriptors,
pg_distrib_dir,
@@ -332,6 +337,7 @@ impl PageServerConf {
concurrent_tenant_size_logical_size_queries,
virtual_file_io_engine,
tenant_config,
no_sync,
} = config_toml;
let mut conf = PageServerConf {
@@ -344,6 +350,7 @@ impl PageServerConf {
wait_lsn_timeout,
wal_redo_timeout,
superuser,
locale,
page_cache_size,
max_file_descriptors,
http_auth_type,
@@ -409,6 +416,7 @@ impl PageServerConf {
.map(crate::l0_flush::L0FlushConfig::from)
.unwrap_or_default(),
virtual_file_io_mode: virtual_file_io_mode.unwrap_or(virtual_file::IoMode::preferred()),
no_sync: no_sync.unwrap_or(false),
};
// ------------------------------------------------------------

View File

@@ -14,6 +14,7 @@ use itertools::Itertools as _;
use pageserver_api::models::TenantState;
use remote_storage::{GenericRemoteStorage, RemoteStorageConfig};
use reqwest::Url;
use serde::{Deserialize, Serialize};
use std::collections::HashMap;
use std::sync::Arc;
use std::time::{Duration, SystemTime};
@@ -35,12 +36,62 @@ const DEFAULT_HTTP_REPORTING_TIMEOUT: Duration = Duration::from_secs(60);
/// upload attempts.
type RawMetric = (MetricsKey, (EventType, u64));
/// The new serializable metrics format
#[derive(Serialize, Deserialize)]
struct NewMetricsRoot {
version: usize,
metrics: Vec<NewRawMetric>,
}
impl NewMetricsRoot {
pub fn is_v2_metrics(json_value: &serde_json::Value) -> bool {
if let Some(ver) = json_value.get("version") {
if let Some(2) = ver.as_u64() {
return true;
}
}
false
}
}
/// The new serializable metrics format
#[derive(Serialize)]
struct NewMetricsRefRoot<'a> {
version: usize,
metrics: &'a [NewRawMetric],
}
impl<'a> NewMetricsRefRoot<'a> {
fn new(metrics: &'a [NewRawMetric]) -> Self {
Self {
version: 2,
metrics,
}
}
}
/// The new serializable metrics format
#[derive(Serialize, Deserialize, Debug, Clone, PartialEq, Eq)]
struct NewRawMetric {
key: MetricsKey,
kind: EventType,
value: u64,
// TODO: add generation field and check against generations
}
impl NewRawMetric {
#[cfg(test)]
fn to_kv_pair(&self) -> (MetricsKey, NewRawMetric) {
(self.key, self.clone())
}
}
/// Caches the [`RawMetric`]s
///
/// In practice, during startup, last sent values are stored here to be used in calculating new
/// ones. After successful uploading, the cached values are updated to cache. This used to be used
/// for deduplication, but that is no longer needed.
type Cache = HashMap<MetricsKey, (EventType, u64)>;
type Cache = HashMap<MetricsKey, NewRawMetric>;
pub async fn run(
conf: &'static PageServerConf,
@@ -231,11 +282,14 @@ async fn restore_and_reschedule(
// collect_all_metrics
let earlier_metric_at = found_some
.iter()
.map(|(_, (et, _))| et.recorded_at())
.map(|item| item.kind.recorded_at())
.copied()
.next();
let cached = found_some.into_iter().collect::<Cache>();
let cached = found_some
.into_iter()
.map(|item| (item.key, item))
.collect::<Cache>();
(cached, earlier_metric_at)
}

View File

@@ -2,11 +2,33 @@ use anyhow::Context;
use camino::{Utf8Path, Utf8PathBuf};
use std::sync::Arc;
use super::RawMetric;
use crate::consumption_metrics::NewMetricsRefRoot;
use super::{NewMetricsRoot, NewRawMetric, RawMetric};
pub(super) fn read_metrics_from_serde_value(
json_value: serde_json::Value,
) -> anyhow::Result<Vec<NewRawMetric>> {
if NewMetricsRoot::is_v2_metrics(&json_value) {
let root = serde_json::from_value::<NewMetricsRoot>(json_value)?;
Ok(root.metrics)
} else {
let all_metrics = serde_json::from_value::<Vec<RawMetric>>(json_value)?;
let all_metrics = all_metrics
.into_iter()
.map(|(key, (event_type, value))| NewRawMetric {
key,
kind: event_type,
value,
})
.collect();
Ok(all_metrics)
}
}
pub(super) async fn read_metrics_from_disk(
path: Arc<Utf8PathBuf>,
) -> anyhow::Result<Vec<RawMetric>> {
) -> anyhow::Result<Vec<NewRawMetric>> {
// do not add context to each error, callsite will log with full path
let span = tracing::Span::current();
tokio::task::spawn_blocking(move || {
@@ -20,7 +42,8 @@ pub(super) async fn read_metrics_from_disk(
let mut file = std::fs::File::open(&*path)?;
let reader = std::io::BufReader::new(&mut file);
anyhow::Ok(serde_json::from_reader::<_, Vec<RawMetric>>(reader)?)
let json_value = serde_json::from_reader::<_, serde_json::Value>(reader)?;
read_metrics_from_serde_value(json_value)
})
.await
.context("read metrics join error")
@@ -63,7 +86,7 @@ fn scan_and_delete_with_same_prefix(path: &Utf8Path) -> std::io::Result<()> {
}
pub(super) async fn flush_metrics_to_disk(
current_metrics: &Arc<Vec<RawMetric>>,
current_metrics: &Arc<Vec<NewRawMetric>>,
path: &Arc<Utf8PathBuf>,
) -> anyhow::Result<()> {
use std::io::Write;
@@ -93,8 +116,11 @@ pub(super) async fn flush_metrics_to_disk(
// write out all of the raw metrics, to be read out later on restart as cached values
{
let mut writer = std::io::BufWriter::new(&mut tempfile);
serde_json::to_writer(&mut writer, &*current_metrics)
.context("serialize metrics")?;
serde_json::to_writer(
&mut writer,
&NewMetricsRefRoot::new(current_metrics.as_ref()),
)
.context("serialize metrics")?;
writer
.into_inner()
.map_err(|_| anyhow::anyhow!("flushing metrics failed"))?;

View File

@@ -9,7 +9,7 @@ use utils::{
lsn::Lsn,
};
use super::{Cache, RawMetric};
use super::{Cache, NewRawMetric};
/// Name of the metric, used by `MetricsKey` factory methods and `deserialize_cached_events`
/// instead of static str.
@@ -64,11 +64,21 @@ impl MetricsKey {
struct AbsoluteValueFactory(MetricsKey);
impl AbsoluteValueFactory {
const fn at(self, time: DateTime<Utc>, val: u64) -> RawMetric {
#[cfg(test)]
const fn at_old_format(self, time: DateTime<Utc>, val: u64) -> super::RawMetric {
let key = self.0;
(key, (EventType::Absolute { time }, val))
}
const fn at(self, time: DateTime<Utc>, val: u64) -> NewRawMetric {
let key = self.0;
NewRawMetric {
key,
kind: EventType::Absolute { time },
value: val,
}
}
fn key(&self) -> &MetricsKey {
&self.0
}
@@ -84,7 +94,28 @@ impl IncrementalValueFactory {
prev_end: DateTime<Utc>,
up_to: DateTime<Utc>,
val: u64,
) -> RawMetric {
) -> NewRawMetric {
let key = self.0;
// cannot assert prev_end < up_to because these are realtime clock based
let when = EventType::Incremental {
start_time: prev_end,
stop_time: up_to,
};
NewRawMetric {
key,
kind: when,
value: val,
}
}
#[allow(clippy::wrong_self_convention)]
#[cfg(test)]
const fn from_until_old_format(
self,
prev_end: DateTime<Utc>,
up_to: DateTime<Utc>,
val: u64,
) -> super::RawMetric {
let key = self.0;
// cannot assert prev_end < up_to because these are realtime clock based
let when = EventType::Incremental {
@@ -185,7 +216,7 @@ pub(super) async fn collect_all_metrics(
tenant_manager: &Arc<TenantManager>,
cached_metrics: &Cache,
ctx: &RequestContext,
) -> Vec<RawMetric> {
) -> Vec<NewRawMetric> {
use pageserver_api::models::TenantState;
let started_at = std::time::Instant::now();
@@ -220,11 +251,11 @@ pub(super) async fn collect_all_metrics(
res
}
async fn collect<S>(tenants: S, cache: &Cache, ctx: &RequestContext) -> Vec<RawMetric>
async fn collect<S>(tenants: S, cache: &Cache, ctx: &RequestContext) -> Vec<NewRawMetric>
where
S: futures::stream::Stream<Item = (TenantId, Arc<crate::tenant::Tenant>)>,
{
let mut current_metrics: Vec<RawMetric> = Vec::new();
let mut current_metrics: Vec<NewRawMetric> = Vec::new();
let mut tenants = std::pin::pin!(tenants);
@@ -291,7 +322,7 @@ impl TenantSnapshot {
tenant_id: TenantId,
now: DateTime<Utc>,
cached: &Cache,
metrics: &mut Vec<RawMetric>,
metrics: &mut Vec<NewRawMetric>,
) {
let remote_size = MetricsKey::remote_storage_size(tenant_id).at(now, self.remote_size);
@@ -302,9 +333,9 @@ impl TenantSnapshot {
let mut synthetic_size = self.synthetic_size;
if synthetic_size == 0 {
if let Some((_, value)) = cached.get(factory.key()) {
// use the latest value from previous session
synthetic_size = *value;
if let Some(item) = cached.get(factory.key()) {
// use the latest value from previous session, TODO: check generation number
synthetic_size = item.value;
}
}
@@ -381,37 +412,36 @@ impl TimelineSnapshot {
tenant_id: TenantId,
timeline_id: TimelineId,
now: DateTime<Utc>,
metrics: &mut Vec<RawMetric>,
metrics: &mut Vec<NewRawMetric>,
cache: &Cache,
) {
let timeline_written_size = u64::from(self.last_record_lsn);
let written_size_delta_key = MetricsKey::written_size_delta(tenant_id, timeline_id);
let last_stop_time = cache
.get(written_size_delta_key.key())
.map(|(until, _val)| {
until
.incremental_timerange()
.expect("never create EventType::Absolute for written_size_delta")
.end
});
let last_stop_time = cache.get(written_size_delta_key.key()).map(|item| {
item.kind
.incremental_timerange()
.expect("never create EventType::Absolute for written_size_delta")
.end
});
let (key, written_size_now) =
let written_size_now =
MetricsKey::written_size(tenant_id, timeline_id).at(now, timeline_written_size);
// by default, use the last sent written_size as the basis for
// calculating the delta. if we don't yet have one, use the load time value.
let prev = cache
.get(&key)
.map(|(prev_at, prev)| {
let prev: (DateTime<Utc>, u64) = cache
.get(&written_size_now.key)
.map(|item| {
// use the prev time from our last incremental update, or default to latest
// absolute update on the first round.
let prev_at = prev_at
let prev_at = item
.kind
.absolute_time()
.expect("never create EventType::Incremental for written_size");
let prev_at = last_stop_time.unwrap_or(prev_at);
(*prev_at, *prev)
(*prev_at, item.value)
})
.unwrap_or_else(|| {
// if we don't have a previous point of comparison, compare to the load time
@@ -422,24 +452,28 @@ impl TimelineSnapshot {
let up_to = now;
if let Some(delta) = written_size_now.1.checked_sub(prev.1) {
if let Some(delta) = written_size_now.value.checked_sub(prev.1) {
let key_value = written_size_delta_key.from_until(prev.0, up_to, delta);
// written_size_delta
metrics.push(key_value);
// written_size
metrics.push((key, written_size_now));
metrics.push(written_size_now);
} else {
// the cached value was ahead of us, report zero until we've caught up
metrics.push(written_size_delta_key.from_until(prev.0, up_to, 0));
// the cached value was ahead of us, report the same until we've caught up
metrics.push((key, (written_size_now.0, prev.1)));
metrics.push(NewRawMetric {
key: written_size_now.key,
kind: written_size_now.kind,
value: prev.1,
});
}
{
let factory = MetricsKey::timeline_logical_size(tenant_id, timeline_id);
let current_or_previous = self
.current_exact_logical_size
.or_else(|| cache.get(factory.key()).map(|(_, val)| *val));
.or_else(|| cache.get(factory.key()).map(|item| item.value));
if let Some(size) = current_or_previous {
metrics.push(factory.at(now, size));
@@ -452,4 +486,4 @@ impl TimelineSnapshot {
mod tests;
#[cfg(test)]
pub(crate) use tests::metric_examples;
pub(crate) use tests::{metric_examples, metric_examples_old};

View File

@@ -1,3 +1,5 @@
use crate::consumption_metrics::RawMetric;
use super::*;
use std::collections::HashMap;
@@ -50,9 +52,9 @@ fn startup_collected_timeline_metrics_second_round() {
let disk_consistent_lsn = Lsn(initdb_lsn.0 * 2);
let mut metrics = Vec::new();
let cache = HashMap::from([
MetricsKey::written_size(tenant_id, timeline_id).at(before, disk_consistent_lsn.0)
]);
let cache = HashMap::from([MetricsKey::written_size(tenant_id, timeline_id)
.at(before, disk_consistent_lsn.0)
.to_kv_pair()]);
let snap = TimelineSnapshot {
loaded_at: (disk_consistent_lsn, init),
@@ -89,9 +91,13 @@ fn startup_collected_timeline_metrics_nth_round_at_same_lsn() {
let mut metrics = Vec::new();
let cache = HashMap::from([
// at t=before was the last time the last_record_lsn changed
MetricsKey::written_size(tenant_id, timeline_id).at(before, disk_consistent_lsn.0),
MetricsKey::written_size(tenant_id, timeline_id)
.at(before, disk_consistent_lsn.0)
.to_kv_pair(),
// end time of this event is used for the next ones
MetricsKey::written_size_delta(tenant_id, timeline_id).from_until(before, just_before, 0),
MetricsKey::written_size_delta(tenant_id, timeline_id)
.from_until(before, just_before, 0)
.to_kv_pair(),
]);
let snap = TimelineSnapshot {
@@ -138,13 +144,17 @@ fn post_restart_written_sizes_with_rolled_back_last_record_lsn() {
};
let mut cache = HashMap::from([
MetricsKey::written_size(tenant_id, timeline_id).at(before_restart, 100),
MetricsKey::written_size_delta(tenant_id, timeline_id).from_until(
way_before,
before_restart,
// not taken into account, but the timestamps are important
999_999_999,
),
MetricsKey::written_size(tenant_id, timeline_id)
.at(before_restart, 100)
.to_kv_pair(),
MetricsKey::written_size_delta(tenant_id, timeline_id)
.from_until(
way_before,
before_restart,
// not taken into account, but the timestamps are important
999_999_999,
)
.to_kv_pair(),
]);
let mut metrics = Vec::new();
@@ -163,7 +173,7 @@ fn post_restart_written_sizes_with_rolled_back_last_record_lsn() {
);
// now if we cache these metrics, and re-run while "still in recovery"
cache.extend(metrics.drain(..));
cache.extend(metrics.drain(..).map(|x| x.to_kv_pair()));
// "still in recovery", because our snapshot did not change
snap.to_metrics(tenant_id, timeline_id, later, &mut metrics, &cache);
@@ -194,14 +204,14 @@ fn post_restart_current_exact_logical_size_uses_cached() {
current_exact_logical_size: None,
};
let cache = HashMap::from([
MetricsKey::timeline_logical_size(tenant_id, timeline_id).at(before_restart, 100)
]);
let cache = HashMap::from([MetricsKey::timeline_logical_size(tenant_id, timeline_id)
.at(before_restart, 100)
.to_kv_pair()]);
let mut metrics = Vec::new();
snap.to_metrics(tenant_id, timeline_id, now, &mut metrics, &cache);
metrics.retain(|(key, _)| key.metric == Name::LogicalSize);
metrics.retain(|item| item.key.metric == Name::LogicalSize);
assert_eq!(
metrics,
@@ -224,7 +234,9 @@ fn post_restart_synthetic_size_uses_cached_if_available() {
let before_restart = DateTime::<Utc>::from(now - std::time::Duration::from_secs(5 * 60));
let now = DateTime::<Utc>::from(now);
let cached = HashMap::from([MetricsKey::synthetic_size(tenant_id).at(before_restart, 1000)]);
let cached = HashMap::from([MetricsKey::synthetic_size(tenant_id)
.at(before_restart, 1000)
.to_kv_pair()]);
let mut metrics = Vec::new();
ts.to_metrics(tenant_id, now, &cached, &mut metrics);
@@ -278,12 +290,29 @@ fn time_backwards<const N: usize>() -> [std::time::SystemTime; N] {
times
}
pub(crate) const fn metric_examples(
pub(crate) const fn metric_examples_old(
tenant_id: TenantId,
timeline_id: TimelineId,
now: DateTime<Utc>,
before: DateTime<Utc>,
) -> [RawMetric; 6] {
[
MetricsKey::written_size(tenant_id, timeline_id).at_old_format(now, 0),
MetricsKey::written_size_delta(tenant_id, timeline_id)
.from_until_old_format(before, now, 0),
MetricsKey::timeline_logical_size(tenant_id, timeline_id).at_old_format(now, 0),
MetricsKey::remote_storage_size(tenant_id).at_old_format(now, 0),
MetricsKey::resident_size(tenant_id).at_old_format(now, 0),
MetricsKey::synthetic_size(tenant_id).at_old_format(now, 1),
]
}
pub(crate) const fn metric_examples(
tenant_id: TenantId,
timeline_id: TimelineId,
now: DateTime<Utc>,
before: DateTime<Utc>,
) -> [NewRawMetric; 6] {
[
MetricsKey::written_size(tenant_id, timeline_id).at(now, 0),
MetricsKey::written_size_delta(tenant_id, timeline_id).from_until(before, now, 0),

View File

@@ -7,7 +7,7 @@ use tokio::io::AsyncWriteExt;
use tokio_util::sync::CancellationToken;
use tracing::Instrument;
use super::{metrics::Name, Cache, MetricsKey, RawMetric};
use super::{metrics::Name, Cache, MetricsKey, NewRawMetric, RawMetric};
use utils::id::{TenantId, TimelineId};
/// How the metrics from pageserver are identified.
@@ -24,7 +24,7 @@ pub(super) async fn upload_metrics_http(
client: &reqwest::Client,
metric_collection_endpoint: &reqwest::Url,
cancel: &CancellationToken,
metrics: &[RawMetric],
metrics: &[NewRawMetric],
cached_metrics: &mut Cache,
idempotency_keys: &[IdempotencyKey<'_>],
) -> anyhow::Result<()> {
@@ -53,8 +53,8 @@ pub(super) async fn upload_metrics_http(
match res {
Ok(()) => {
for (curr_key, curr_val) in chunk {
cached_metrics.insert(*curr_key, *curr_val);
for item in chunk {
cached_metrics.insert(item.key, item.clone());
}
uploaded += chunk.len();
}
@@ -86,7 +86,7 @@ pub(super) async fn upload_metrics_bucket(
client: &GenericRemoteStorage,
cancel: &CancellationToken,
node_id: &str,
metrics: &[RawMetric],
metrics: &[NewRawMetric],
idempotency_keys: &[IdempotencyKey<'_>],
) -> anyhow::Result<()> {
if metrics.is_empty() {
@@ -140,16 +140,16 @@ pub(super) async fn upload_metrics_bucket(
/// across different metrics sinks), and must have the same length as input.
fn serialize_in_chunks<'a>(
chunk_size: usize,
input: &'a [RawMetric],
input: &'a [NewRawMetric],
idempotency_keys: &'a [IdempotencyKey<'a>],
) -> impl ExactSizeIterator<Item = Result<(&'a [RawMetric], bytes::Bytes), serde_json::Error>> + 'a
) -> impl ExactSizeIterator<Item = Result<(&'a [NewRawMetric], bytes::Bytes), serde_json::Error>> + 'a
{
use bytes::BufMut;
assert_eq!(input.len(), idempotency_keys.len());
struct Iter<'a> {
inner: std::slice::Chunks<'a, RawMetric>,
inner: std::slice::Chunks<'a, NewRawMetric>,
idempotency_keys: std::slice::Iter<'a, IdempotencyKey<'a>>,
chunk_size: usize,
@@ -160,7 +160,7 @@ fn serialize_in_chunks<'a>(
}
impl<'a> Iterator for Iter<'a> {
type Item = Result<(&'a [RawMetric], bytes::Bytes), serde_json::Error>;
type Item = Result<(&'a [NewRawMetric], bytes::Bytes), serde_json::Error>;
fn next(&mut self) -> Option<Self::Item> {
let chunk = self.inner.next()?;
@@ -269,6 +269,58 @@ impl RawMetricExt for RawMetric {
}
}
impl RawMetricExt for NewRawMetric {
fn as_event(&self, key: &IdempotencyKey<'_>) -> Event<Ids, Name> {
let MetricsKey {
metric,
tenant_id,
timeline_id,
} = self.key;
let kind = self.kind;
let value = self.value;
Event {
kind,
metric,
idempotency_key: key.to_string(),
value,
extra: Ids {
tenant_id,
timeline_id,
},
}
}
fn update_in_place(&self, event: &mut Event<Ids, Name>, key: &IdempotencyKey<'_>) {
use std::fmt::Write;
let MetricsKey {
metric,
tenant_id,
timeline_id,
} = self.key;
let kind = self.kind;
let value = self.value;
*event = Event {
kind,
metric,
idempotency_key: {
event.idempotency_key.clear();
write!(event.idempotency_key, "{key}").unwrap();
std::mem::take(&mut event.idempotency_key)
},
value,
extra: Ids {
tenant_id,
timeline_id,
},
};
}
}
pub(crate) trait KeyGen<'a> {
fn generate(&self) -> IdempotencyKey<'a>;
}
@@ -381,6 +433,10 @@ async fn upload(
#[cfg(test)]
mod tests {
use crate::consumption_metrics::{
disk_cache::read_metrics_from_serde_value, NewMetricsRefRoot,
};
use super::*;
use chrono::{DateTime, Utc};
use once_cell::sync::Lazy;
@@ -473,23 +529,49 @@ mod tests {
let idempotency_key = consumption_metrics::IdempotencyKey::for_tests(*SAMPLES_NOW, "1", 0);
let examples = examples.into_iter().zip(metric_samples());
for ((line, expected), (key, (kind, value))) in examples {
for ((line, expected), item) in examples {
let e = consumption_metrics::Event {
kind,
metric: key.metric,
kind: item.kind,
metric: item.key.metric,
idempotency_key: idempotency_key.to_string(),
value,
value: item.value,
extra: Ids {
tenant_id: key.tenant_id,
timeline_id: key.timeline_id,
tenant_id: item.key.tenant_id,
timeline_id: item.key.timeline_id,
},
};
let actual = serde_json::to_string(&e).unwrap();
assert_eq!(expected, actual, "example for {kind:?} from line {line}");
assert_eq!(
expected, actual,
"example for {:?} from line {line}",
item.kind
);
}
}
fn metric_samples() -> [RawMetric; 6] {
#[test]
fn disk_format_upgrade() {
let old_samples_json = serde_json::to_value(metric_samples_old()).unwrap();
let new_samples =
serde_json::to_value(NewMetricsRefRoot::new(metric_samples().as_ref())).unwrap();
let upgraded_samples = read_metrics_from_serde_value(old_samples_json).unwrap();
let new_samples = read_metrics_from_serde_value(new_samples).unwrap();
assert_eq!(upgraded_samples, new_samples);
}
fn metric_samples_old() -> [RawMetric; 6] {
let tenant_id = TenantId::from_array([0; 16]);
let timeline_id = TimelineId::from_array([0xff; 16]);
let before = DateTime::parse_from_rfc3339("2023-09-14T00:00:00.123456789Z")
.unwrap()
.into();
let [now, before] = [*SAMPLES_NOW, before];
super::super::metrics::metric_examples_old(tenant_id, timeline_id, now, before)
}
fn metric_samples() -> [NewRawMetric; 6] {
let tenant_id = TenantId::from_array([0; 16]);
let timeline_id = TimelineId::from_array([0xff; 16]);

View File

@@ -618,13 +618,11 @@ impl DeletionQueue {
/// Caller may use the returned object to construct clients with new_client.
/// Caller should tokio::spawn the background() members of the two worker objects returned:
/// we don't spawn those inside new() so that the caller can use their runtime/spans of choice.
///
/// If remote_storage is None, then the returned workers will also be None.
pub fn new<C>(
remote_storage: GenericRemoteStorage,
controller_upcall_client: Option<C>,
conf: &'static PageServerConf,
) -> (Self, Option<DeletionQueueWorkers<C>>)
) -> (Self, DeletionQueueWorkers<C>)
where
C: ControlPlaneGenerationsApi + Send + Sync,
{
@@ -656,7 +654,7 @@ impl DeletionQueue {
},
cancel: cancel.clone(),
},
Some(DeletionQueueWorkers {
DeletionQueueWorkers {
frontend: ListWriter::new(conf, rx, backend_tx, cancel.clone()),
backend: Validator::new(
conf,
@@ -667,7 +665,7 @@ impl DeletionQueue {
cancel.clone(),
),
executor: Deleter::new(remote_storage, executor_rx, cancel.clone()),
}),
},
)
}
@@ -696,7 +694,7 @@ impl DeletionQueue {
mod test {
use camino::Utf8Path;
use hex_literal::hex;
use pageserver_api::{shard::ShardIndex, upcall_api::ReAttachResponseTenant};
use pageserver_api::{key::Key, shard::ShardIndex, upcall_api::ReAttachResponseTenant};
use std::{io::ErrorKind, time::Duration};
use tracing::info;
@@ -705,7 +703,6 @@ mod test {
use crate::{
controller_upcall_client::RetryForeverError,
repository::Key,
tenant::{harness::TenantHarness, storage_layer::DeltaLayerName},
};
@@ -743,9 +740,7 @@ mod test {
);
tracing::debug!("Spawning worker for new queue queue");
let worker_join = workers
.unwrap()
.spawn_with(&tokio::runtime::Handle::current());
let worker_join = workers.spawn_with(&tokio::runtime::Handle::current());
let old_worker_join = std::mem::replace(&mut self.worker_join, worker_join);
let old_deletion_queue = std::mem::replace(&mut self.deletion_queue, deletion_queue);
@@ -856,7 +851,6 @@ mod test {
harness.conf,
);
let worker = worker.unwrap();
let worker_join = worker.spawn_with(&tokio::runtime::Handle::current());
Ok(TestSetup {

View File

@@ -597,6 +597,10 @@ paths:
Create a timeline. Returns new timeline id on success.
Recreating the same timeline will succeed if the parameters match the existing timeline.
If no pg_version is specified, assume DEFAULT_PG_VERSION hardcoded in the pageserver.
To ensure durability, the caller must retry the creation until success.
Just because the timeline is visible via other endpoints does not mean it is durable.
Future versions may stop showing timelines that are not yet durable.
requestBody:
content:
application/json:

View File

@@ -37,7 +37,9 @@ use pageserver_api::models::TenantShardLocation;
use pageserver_api::models::TenantShardSplitRequest;
use pageserver_api::models::TenantShardSplitResponse;
use pageserver_api::models::TenantSorting;
use pageserver_api::models::TenantState;
use pageserver_api::models::TimelineArchivalConfigRequest;
use pageserver_api::models::TimelineCreateRequestMode;
use pageserver_api::models::TimelinesInfoAndOffloaded;
use pageserver_api::models::TopTenantShardItem;
use pageserver_api::models::TopTenantShardsRequest;
@@ -79,12 +81,14 @@ use crate::tenant::size::ModelInputs;
use crate::tenant::storage_layer::LayerAccessStatsReset;
use crate::tenant::storage_layer::LayerName;
use crate::tenant::timeline::offload::offload_timeline;
use crate::tenant::timeline::offload::OffloadError;
use crate::tenant::timeline::CompactFlags;
use crate::tenant::timeline::CompactionError;
use crate::tenant::timeline::Timeline;
use crate::tenant::GetTimelineError;
use crate::tenant::OffloadedTimeline;
use crate::tenant::{LogicalSizeCalculationCause, PageReconstructError};
use crate::DEFAULT_PG_VERSION;
use crate::{disk_usage_eviction_task, tenant};
use pageserver_api::models::{
StatusResponse, TenantConfigRequest, TenantInfo, TimelineCreateRequest, TimelineGcRequest,
@@ -292,6 +296,9 @@ impl From<GetActiveTenantError> for ApiError {
GetActiveTenantError::Broken(reason) => {
ApiError::InternalServerError(anyhow!("tenant is broken: {}", reason))
}
GetActiveTenantError::WillNotBecomeActive(TenantState::Stopping { .. }) => {
ApiError::ShuttingDown
}
GetActiveTenantError::WillNotBecomeActive(_) => ApiError::Conflict(format!("{}", e)),
GetActiveTenantError::Cancelled => ApiError::ShuttingDown,
GetActiveTenantError::NotFound(gte) => gte.into(),
@@ -547,6 +554,26 @@ async fn timeline_create_handler(
check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let new_timeline_id = request_data.new_timeline_id;
// fill in the default pg_version if not provided & convert request into domain model
let params: tenant::CreateTimelineParams = match request_data.mode {
TimelineCreateRequestMode::Bootstrap {
existing_initdb_timeline_id,
pg_version,
} => tenant::CreateTimelineParams::Bootstrap(tenant::CreateTimelineParamsBootstrap {
new_timeline_id,
existing_initdb_timeline_id,
pg_version: pg_version.unwrap_or(DEFAULT_PG_VERSION),
}),
TimelineCreateRequestMode::Branch {
ancestor_timeline_id,
ancestor_start_lsn,
pg_version: _,
} => tenant::CreateTimelineParams::Branch(tenant::CreateTimelineParamsBranch {
new_timeline_id,
ancestor_timeline_id,
ancestor_start_lsn,
}),
};
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Error);
@@ -559,22 +586,12 @@ async fn timeline_create_handler(
tenant.wait_to_become_active(ACTIVE_TENANT_TIMEOUT).await?;
if let Some(ancestor_id) = request_data.ancestor_timeline_id.as_ref() {
tracing::info!(%ancestor_id, "starting to branch");
} else {
tracing::info!("bootstrapping");
}
// earlier versions of the code had pg_version and ancestor_lsn in the span
// => continue to provide that information, but, through a log message that doesn't require us to destructure
tracing::info!(?params, "creating timeline");
match tenant
.create_timeline(
new_timeline_id,
request_data.ancestor_timeline_id,
request_data.ancestor_start_lsn,
request_data.pg_version.unwrap_or(crate::DEFAULT_PG_VERSION),
request_data.existing_initdb_timeline_id,
state.broker_client.clone(),
&ctx,
)
.create_timeline(params, state.broker_client.clone(), &ctx)
.await
{
Ok(new_timeline) => {
@@ -625,8 +642,6 @@ async fn timeline_create_handler(
tenant_id = %tenant_shard_id.tenant_id,
shard_id = %tenant_shard_id.shard_slug(),
timeline_id = %new_timeline_id,
lsn=?request_data.ancestor_start_lsn,
pg_version=?request_data.pg_version
))
.await
}
@@ -1283,6 +1298,99 @@ async fn layer_map_info_handler(
json_response(StatusCode::OK, layer_map_info)
}
#[instrument(skip_all, fields(tenant_id, shard_id, timeline_id, layer_name))]
async fn timeline_layer_scan_disposable_keys(
request: Request<Body>,
cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> {
let tenant_shard_id: TenantShardId = parse_request_param(&request, "tenant_shard_id")?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
let layer_name: LayerName = parse_request_param(&request, "layer_name")?;
tracing::Span::current().record(
"tenant_id",
tracing::field::display(&tenant_shard_id.tenant_id),
);
tracing::Span::current().record(
"shard_id",
tracing::field::display(tenant_shard_id.shard_slug()),
);
tracing::Span::current().record("timeline_id", tracing::field::display(&timeline_id));
tracing::Span::current().record("layer_name", tracing::field::display(&layer_name));
let state = get_state(&request);
check_permission(&request, Some(tenant_shard_id.tenant_id))?;
// technically the timeline need not be active for this scan to complete
let timeline =
active_timeline_of_active_tenant(&state.tenant_manager, tenant_shard_id, timeline_id)
.await?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Download);
let guard = timeline.layers.read().await;
let Some(layer) = guard.try_get_from_key(&layer_name.clone().into()) else {
return Err(ApiError::NotFound(
anyhow::anyhow!("Layer {tenant_shard_id}/{timeline_id}/{layer_name} not found").into(),
));
};
let resident_layer = layer
.download_and_keep_resident()
.await
.map_err(|err| match err {
tenant::storage_layer::layer::DownloadError::TimelineShutdown
| tenant::storage_layer::layer::DownloadError::DownloadCancelled => {
ApiError::ShuttingDown
}
tenant::storage_layer::layer::DownloadError::ContextAndConfigReallyDeniesDownloads
| tenant::storage_layer::layer::DownloadError::DownloadRequired
| tenant::storage_layer::layer::DownloadError::NotFile(_)
| tenant::storage_layer::layer::DownloadError::DownloadFailed
| tenant::storage_layer::layer::DownloadError::PreStatFailed(_) => {
ApiError::InternalServerError(err.into())
}
#[cfg(test)]
tenant::storage_layer::layer::DownloadError::Failpoint(_) => {
ApiError::InternalServerError(err.into())
}
})?;
let keys = resident_layer
.load_keys(&ctx)
.await
.map_err(ApiError::InternalServerError)?;
let shard_identity = timeline.get_shard_identity();
let mut disposable_count = 0;
let mut not_disposable_count = 0;
let cancel = cancel.clone();
for (i, key) in keys.into_iter().enumerate() {
if shard_identity.is_key_disposable(&key) {
disposable_count += 1;
tracing::debug!(key = %key, key.dbg=?key, "disposable key");
} else {
not_disposable_count += 1;
}
#[allow(clippy::collapsible_if)]
if i % 10000 == 0 {
if cancel.is_cancelled() || timeline.cancel.is_cancelled() || timeline.is_stopping() {
return Err(ApiError::ShuttingDown);
}
}
}
json_response(
StatusCode::OK,
pageserver_api::models::ScanDisposableKeysResponse {
disposable_count,
not_disposable_count,
},
)
}
async fn layer_download_handler(
request: Request<Body>,
_cancel: CancellationToken,
@@ -1894,14 +2002,19 @@ async fn timeline_offload_handler(
"timeline has attached children".into(),
));
}
if !timeline.can_offload() {
if let (false, reason) = timeline.can_offload() {
return Err(ApiError::PreconditionFailed(
"Timeline::can_offload() returned false".into(),
format!("Timeline::can_offload() check failed: {}", reason) .into(),
));
}
offload_timeline(&tenant, &timeline)
.await
.map_err(ApiError::InternalServerError)?;
.map_err(|e| {
match e {
OffloadError::Cancelled => ApiError::ResourceUnavailable("Timeline shutting down".into()),
_ => ApiError::InternalServerError(anyhow!(e))
}
})?;
json_response(StatusCode::OK, ())
}
@@ -1957,6 +2070,7 @@ async fn timeline_checkpoint_handler(
.map_err(|e|
match e {
CompactionError::ShuttingDown => ApiError::ShuttingDown,
CompactionError::Offload(e) => ApiError::InternalServerError(anyhow::anyhow!(e)),
CompactionError::Other(e) => ApiError::InternalServerError(e)
}
)?;
@@ -2055,6 +2169,21 @@ async fn timeline_detach_ancestor_handler(
let ctx = RequestContext::new(TaskKind::DetachAncestor, DownloadBehavior::Download);
let ctx = &ctx;
// Flush the upload queues of all timelines before detaching ancestor. We do the same thing again
// during shutdown. This early upload ensures the pageserver does not need to upload too many
// things and creates downtime during timeline reloads.
for timeline in tenant.list_timelines() {
timeline
.remote_client
.wait_completion()
.await
.map_err(|e| {
ApiError::PreconditionFailed(format!("cannot drain upload queue: {e}").into())
})?;
}
tracing::info!("all timeline upload queues are drained");
let timeline = tenant.get_timeline(timeline_id, true)?;
let progress = timeline
@@ -2129,13 +2258,13 @@ async fn getpage_at_lsn_handler(
check_permission(&request, Some(tenant_shard_id.tenant_id))?;
let state = get_state(&request);
struct Key(crate::repository::Key);
struct Key(pageserver_api::key::Key);
impl std::str::FromStr for Key {
type Err = anyhow::Error;
fn from_str(s: &str) -> std::result::Result<Self, Self::Err> {
crate::repository::Key::from_hex(s).map(Key)
pageserver_api::key::Key::from_hex(s).map(Key)
}
}
@@ -3145,6 +3274,10 @@ pub fn make_router(
"/v1/tenant/:tenant_shard_id/timeline/:timeline_id/layer/:layer_file_name",
|r| api_handler(r, evict_timeline_layer_handler),
)
.post(
"/v1/tenant/:tenant_shard_id/timeline/:timeline_id/layer/:layer_name/scan_disposable_keys",
|r| testing_api_handler("timeline_layer_scan_disposable_keys", r, timeline_layer_scan_disposable_keys),
)
.post(
"/v1/tenant/:tenant_shard_id/timeline/:timeline_id/block_gc",
|r| api_handler(r, timeline_gc_blocking_handler),

View File

@@ -12,6 +12,7 @@ use pageserver_api::key::rel_block_to_key;
use tokio::io::{AsyncRead, AsyncReadExt};
use tokio_tar::Archive;
use tracing::*;
use wal_decoder::models::InterpretedWalRecord;
use walkdir::WalkDir;
use crate::context::RequestContext;
@@ -19,8 +20,6 @@ use crate::metrics::WAL_INGEST;
use crate::pgdatadir_mapping::*;
use crate::tenant::Timeline;
use crate::walingest::WalIngest;
use crate::walrecord::decode_wal_record;
use crate::walrecord::DecodedWALRecord;
use pageserver_api::reltag::{RelTag, SlruKind};
use postgres_ffi::pg_constants;
use postgres_ffi::relfile_utils::*;
@@ -313,11 +312,15 @@ async fn import_wal(
let mut modification = tline.begin_modification(last_lsn);
while last_lsn <= endpoint {
if let Some((lsn, recdata)) = waldecoder.poll_decode()? {
let mut decoded = DecodedWALRecord::default();
decode_wal_record(recdata, &mut decoded, tline.pg_version)?;
let interpreted = InterpretedWalRecord::from_bytes_filtered(
recdata,
tline.get_shard_identity(),
lsn,
tline.pg_version,
)?;
walingest
.ingest_record(decoded, lsn, &mut modification, ctx)
.ingest_record(interpreted, &mut modification, ctx)
.await?;
WAL_INGEST.records_committed.inc();
@@ -454,10 +457,15 @@ pub async fn import_wal_from_tar(
let mut modification = tline.begin_modification(last_lsn);
while last_lsn <= end_lsn {
if let Some((lsn, recdata)) = waldecoder.poll_decode()? {
let mut decoded = DecodedWALRecord::default();
decode_wal_record(recdata, &mut decoded, tline.pg_version)?;
let interpreted = InterpretedWalRecord::from_bytes_filtered(
recdata,
tline.get_shard_identity(),
lsn,
tline.pg_version,
)?;
walingest
.ingest_record(decoded, lsn, &mut modification, ctx)
.ingest_record(interpreted, &mut modification, ctx)
.await?;
modification.commit(ctx).await?;
last_lsn = lsn;

View File

@@ -24,7 +24,6 @@ pub mod metrics;
pub mod page_cache;
pub mod page_service;
pub mod pgdatadir_mapping;
pub mod repository;
pub mod span;
pub(crate) mod statvfs;
pub mod task_mgr;
@@ -32,7 +31,6 @@ pub mod tenant;
pub mod utilization;
pub mod virtual_file;
pub mod walingest;
pub mod walrecord;
pub mod walredo;
use camino::Utf8Path;

View File

@@ -3040,13 +3040,111 @@ impl<F: Future<Output = Result<O, E>>, O, E> Future for MeasuredRemoteOp<F> {
}
pub mod tokio_epoll_uring {
use metrics::{register_int_counter, UIntGauge};
use std::{
collections::HashMap,
sync::{Arc, Mutex},
};
use metrics::{register_histogram, register_int_counter, Histogram, LocalHistogram, UIntGauge};
use once_cell::sync::Lazy;
/// Shared storage for tokio-epoll-uring thread local metrics.
pub(crate) static THREAD_LOCAL_METRICS_STORAGE: Lazy<ThreadLocalMetricsStorage> =
Lazy::new(|| {
let slots_submission_queue_depth = register_histogram!(
"pageserver_tokio_epoll_uring_slots_submission_queue_depth",
"The slots waiters queue depth of each tokio_epoll_uring system",
vec![1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0, 512.0, 1024.0],
)
.expect("failed to define a metric");
ThreadLocalMetricsStorage {
observers: Mutex::new(HashMap::new()),
slots_submission_queue_depth,
}
});
pub struct ThreadLocalMetricsStorage {
/// List of thread local metrics observers.
observers: Mutex<HashMap<u64, Arc<ThreadLocalMetrics>>>,
/// A histogram shared between all thread local systems
/// for collecting slots submission queue depth.
slots_submission_queue_depth: Histogram,
}
/// Each thread-local [`tokio_epoll_uring::System`] gets one of these as its
/// [`tokio_epoll_uring::metrics::PerSystemMetrics`] generic.
///
/// The System makes observations into [`Self`] and periodically, the collector
/// comes along and flushes [`Self`] into the shared storage [`THREAD_LOCAL_METRICS_STORAGE`].
///
/// [`LocalHistogram`] is `!Send`, so, we need to put it behind a [`Mutex`].
/// But except for the periodic flush, the lock is uncontended so there's no waiting
/// for cache coherence protocol to get an exclusive cache line.
pub struct ThreadLocalMetrics {
/// Local observer of thread local tokio-epoll-uring system's slots waiters queue depth.
slots_submission_queue_depth: Mutex<LocalHistogram>,
}
impl ThreadLocalMetricsStorage {
/// Registers a new thread local system. Returns a thread local metrics observer.
pub fn register_system(&self, id: u64) -> Arc<ThreadLocalMetrics> {
let per_system_metrics = Arc::new(ThreadLocalMetrics::new(
self.slots_submission_queue_depth.local(),
));
let mut g = self.observers.lock().unwrap();
g.insert(id, Arc::clone(&per_system_metrics));
per_system_metrics
}
/// Removes metrics observer for a thread local system.
/// This should be called before dropping a thread local system.
pub fn remove_system(&self, id: u64) {
let mut g = self.observers.lock().unwrap();
g.remove(&id);
}
/// Flush all thread local metrics to the shared storage.
pub fn flush_thread_local_metrics(&self) {
let g = self.observers.lock().unwrap();
g.values().for_each(|local| {
local.flush();
});
}
}
impl ThreadLocalMetrics {
pub fn new(slots_submission_queue_depth: LocalHistogram) -> Self {
ThreadLocalMetrics {
slots_submission_queue_depth: Mutex::new(slots_submission_queue_depth),
}
}
/// Flushes the thread local metrics to shared aggregator.
pub fn flush(&self) {
let Self {
slots_submission_queue_depth,
} = self;
slots_submission_queue_depth.lock().unwrap().flush();
}
}
impl tokio_epoll_uring::metrics::PerSystemMetrics for ThreadLocalMetrics {
fn observe_slots_submission_queue_depth(&self, queue_depth: u64) {
let Self {
slots_submission_queue_depth,
} = self;
slots_submission_queue_depth
.lock()
.unwrap()
.observe(queue_depth as f64);
}
}
pub struct Collector {
descs: Vec<metrics::core::Desc>,
systems_created: UIntGauge,
systems_destroyed: UIntGauge,
thread_local_metrics_storage: &'static ThreadLocalMetricsStorage,
}
impl metrics::core::Collector for Collector {
@@ -3056,7 +3154,7 @@ pub mod tokio_epoll_uring {
fn collect(&self) -> Vec<metrics::proto::MetricFamily> {
let mut mfs = Vec::with_capacity(Self::NMETRICS);
let tokio_epoll_uring::metrics::Metrics {
let tokio_epoll_uring::metrics::GlobalMetrics {
systems_created,
systems_destroyed,
} = tokio_epoll_uring::metrics::global();
@@ -3064,12 +3162,21 @@ pub mod tokio_epoll_uring {
mfs.extend(self.systems_created.collect());
self.systems_destroyed.set(systems_destroyed);
mfs.extend(self.systems_destroyed.collect());
self.thread_local_metrics_storage
.flush_thread_local_metrics();
mfs.extend(
self.thread_local_metrics_storage
.slots_submission_queue_depth
.collect(),
);
mfs
}
}
impl Collector {
const NMETRICS: usize = 2;
const NMETRICS: usize = 3;
#[allow(clippy::new_without_default)]
pub fn new() -> Self {
@@ -3101,6 +3208,7 @@ pub mod tokio_epoll_uring {
descs,
systems_created,
systems_destroyed,
thread_local_metrics_storage: &THREAD_LOCAL_METRICS_STORAGE,
}
}
}
@@ -3460,6 +3568,7 @@ pub fn preinitialize_metrics() {
Lazy::force(&RECONSTRUCT_TIME);
Lazy::force(&BASEBACKUP_QUERY_TIME);
Lazy::force(&COMPUTE_COMMANDS_COUNTERS);
Lazy::force(&tokio_epoll_uring::THREAD_LOCAL_METRICS_STORAGE);
tenant_throttling::preinitialize_global_metrics();
}

View File

@@ -1,10 +1,11 @@
//! The Page Service listens for client connections and serves their GetPage@LSN
//! requests.
use anyhow::Context;
use anyhow::{bail, Context};
use async_compression::tokio::write::GzipEncoder;
use bytes::Buf;
use futures::FutureExt;
use itertools::Itertools;
use once_cell::sync::OnceCell;
use pageserver_api::models::TenantState;
use pageserver_api::models::{
@@ -1080,6 +1081,7 @@ impl PageServerHandler {
prev_lsn: Option<Lsn>,
full_backup: bool,
gzip: bool,
replica: bool,
ctx: &RequestContext,
) -> Result<(), QueryError>
where
@@ -1132,6 +1134,7 @@ impl PageServerHandler {
lsn,
prev_lsn,
full_backup,
replica,
ctx,
)
.await
@@ -1154,6 +1157,7 @@ impl PageServerHandler {
lsn,
prev_lsn,
full_backup,
replica,
ctx,
)
.await
@@ -1170,6 +1174,7 @@ impl PageServerHandler {
lsn,
prev_lsn,
full_backup,
replica,
ctx,
)
.await
@@ -1217,6 +1222,222 @@ impl PageServerHandler {
}
}
/// `basebackup tenant timeline [lsn] [--gzip] [--replica]`
#[derive(Debug, Clone, Eq, PartialEq)]
struct BaseBackupCmd {
tenant_id: TenantId,
timeline_id: TimelineId,
lsn: Option<Lsn>,
gzip: bool,
replica: bool,
}
/// `fullbackup tenant timeline [lsn] [prev_lsn]`
#[derive(Debug, Clone, Eq, PartialEq)]
struct FullBackupCmd {
tenant_id: TenantId,
timeline_id: TimelineId,
lsn: Option<Lsn>,
prev_lsn: Option<Lsn>,
}
/// `pagestream_v2 tenant timeline`
#[derive(Debug, Clone, Eq, PartialEq)]
struct PageStreamCmd {
tenant_id: TenantId,
timeline_id: TimelineId,
}
/// `lease lsn tenant timeline lsn`
#[derive(Debug, Clone, Eq, PartialEq)]
struct LeaseLsnCmd {
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
lsn: Lsn,
}
#[derive(Debug, Clone, Eq, PartialEq)]
enum PageServiceCmd {
Set,
PageStream(PageStreamCmd),
BaseBackup(BaseBackupCmd),
FullBackup(FullBackupCmd),
LeaseLsn(LeaseLsnCmd),
}
impl PageStreamCmd {
fn parse(query: &str) -> anyhow::Result<Self> {
let parameters = query.split_whitespace().collect_vec();
if parameters.len() != 2 {
bail!(
"invalid number of parameters for pagestream command: {}",
query
);
}
let tenant_id = TenantId::from_str(parameters[0])
.with_context(|| format!("Failed to parse tenant id from {}", parameters[0]))?;
let timeline_id = TimelineId::from_str(parameters[1])
.with_context(|| format!("Failed to parse timeline id from {}", parameters[1]))?;
Ok(Self {
tenant_id,
timeline_id,
})
}
}
impl FullBackupCmd {
fn parse(query: &str) -> anyhow::Result<Self> {
let parameters = query.split_whitespace().collect_vec();
if parameters.len() < 2 || parameters.len() > 4 {
bail!(
"invalid number of parameters for basebackup command: {}",
query
);
}
let tenant_id = TenantId::from_str(parameters[0])
.with_context(|| format!("Failed to parse tenant id from {}", parameters[0]))?;
let timeline_id = TimelineId::from_str(parameters[1])
.with_context(|| format!("Failed to parse timeline id from {}", parameters[1]))?;
// The caller is responsible for providing correct lsn and prev_lsn.
let lsn = if let Some(lsn_str) = parameters.get(2) {
Some(
Lsn::from_str(lsn_str)
.with_context(|| format!("Failed to parse Lsn from {lsn_str}"))?,
)
} else {
None
};
let prev_lsn = if let Some(prev_lsn_str) = parameters.get(3) {
Some(
Lsn::from_str(prev_lsn_str)
.with_context(|| format!("Failed to parse Lsn from {prev_lsn_str}"))?,
)
} else {
None
};
Ok(Self {
tenant_id,
timeline_id,
lsn,
prev_lsn,
})
}
}
impl BaseBackupCmd {
fn parse(query: &str) -> anyhow::Result<Self> {
let parameters = query.split_whitespace().collect_vec();
if parameters.len() < 2 {
bail!(
"invalid number of parameters for basebackup command: {}",
query
);
}
let tenant_id = TenantId::from_str(parameters[0])
.with_context(|| format!("Failed to parse tenant id from {}", parameters[0]))?;
let timeline_id = TimelineId::from_str(parameters[1])
.with_context(|| format!("Failed to parse timeline id from {}", parameters[1]))?;
let lsn;
let flags_parse_from;
if let Some(maybe_lsn) = parameters.get(2) {
if *maybe_lsn == "latest" {
lsn = None;
flags_parse_from = 3;
} else if maybe_lsn.starts_with("--") {
lsn = None;
flags_parse_from = 2;
} else {
lsn = Some(
Lsn::from_str(maybe_lsn)
.with_context(|| format!("Failed to parse lsn from {maybe_lsn}"))?,
);
flags_parse_from = 3;
}
} else {
lsn = None;
flags_parse_from = 2;
}
let mut gzip = false;
let mut replica = false;
for &param in &parameters[flags_parse_from..] {
match param {
"--gzip" => {
if gzip {
bail!("duplicate parameter for basebackup command: {param}")
}
gzip = true
}
"--replica" => {
if replica {
bail!("duplicate parameter for basebackup command: {param}")
}
replica = true
}
_ => bail!("invalid parameter for basebackup command: {param}"),
}
}
Ok(Self {
tenant_id,
timeline_id,
lsn,
gzip,
replica,
})
}
}
impl LeaseLsnCmd {
fn parse(query: &str) -> anyhow::Result<Self> {
let parameters = query.split_whitespace().collect_vec();
if parameters.len() != 3 {
bail!(
"invalid number of parameters for lease lsn command: {}",
query
);
}
let tenant_shard_id = TenantShardId::from_str(parameters[0])
.with_context(|| format!("Failed to parse tenant id from {}", parameters[0]))?;
let timeline_id = TimelineId::from_str(parameters[1])
.with_context(|| format!("Failed to parse timeline id from {}", parameters[1]))?;
let lsn = Lsn::from_str(parameters[2])
.with_context(|| format!("Failed to parse lsn from {}", parameters[2]))?;
Ok(Self {
tenant_shard_id,
timeline_id,
lsn,
})
}
}
impl PageServiceCmd {
fn parse(query: &str) -> anyhow::Result<Self> {
let query = query.trim();
let Some((cmd, other)) = query.split_once(' ') else {
bail!("cannot parse query: {query}")
};
match cmd.to_ascii_lowercase().as_str() {
"pagestream_v2" => Ok(Self::PageStream(PageStreamCmd::parse(other)?)),
"basebackup" => Ok(Self::BaseBackup(BaseBackupCmd::parse(other)?)),
"fullbackup" => Ok(Self::FullBackup(FullBackupCmd::parse(other)?)),
"lease" => {
let Some((cmd2, other)) = other.split_once(' ') else {
bail!("invalid lease command: {cmd}");
};
let cmd2 = cmd2.to_ascii_lowercase();
if cmd2 == "lsn" {
Ok(Self::LeaseLsn(LeaseLsnCmd::parse(other)?))
} else {
bail!("invalid lease command: {cmd}");
}
}
"set" => Ok(Self::Set),
_ => Err(anyhow::anyhow!("unsupported command {cmd} in {query}")),
}
}
}
impl<IO> postgres_backend::Handler<IO> for PageServerHandler
where
IO: AsyncRead + AsyncWrite + Send + Sync + Unpin,
@@ -1273,201 +1494,137 @@ where
fail::fail_point!("ps::connection-start::process-query");
let ctx = self.connection_ctx.attached_child();
debug!("process query {query_string:?}");
let parts = query_string.split_whitespace().collect::<Vec<_>>();
if let Some(params) = parts.strip_prefix(&["pagestream_v2"]) {
if params.len() != 2 {
return Err(QueryError::Other(anyhow::anyhow!(
"invalid param number for pagestream command"
)));
}
let tenant_id = TenantId::from_str(params[0])
.with_context(|| format!("Failed to parse tenant id from {}", params[0]))?;
let timeline_id = TimelineId::from_str(params[1])
.with_context(|| format!("Failed to parse timeline id from {}", params[1]))?;
tracing::Span::current()
.record("tenant_id", field::display(tenant_id))
.record("timeline_id", field::display(timeline_id));
self.check_permission(Some(tenant_id))?;
COMPUTE_COMMANDS_COUNTERS
.for_command(ComputeCommandKind::PageStreamV2)
.inc();
self.handle_pagerequests(
pgb,
debug!("process query {query_string}");
let query = PageServiceCmd::parse(query_string)?;
match query {
PageServiceCmd::PageStream(PageStreamCmd {
tenant_id,
timeline_id,
PagestreamProtocolVersion::V2,
ctx,
)
.await?;
} else if let Some(params) = parts.strip_prefix(&["basebackup"]) {
if params.len() < 2 {
return Err(QueryError::Other(anyhow::anyhow!(
"invalid param number for basebackup command"
)));
}) => {
tracing::Span::current()
.record("tenant_id", field::display(tenant_id))
.record("timeline_id", field::display(timeline_id));
self.check_permission(Some(tenant_id))?;
COMPUTE_COMMANDS_COUNTERS
.for_command(ComputeCommandKind::PageStreamV2)
.inc();
self.handle_pagerequests(
pgb,
tenant_id,
timeline_id,
PagestreamProtocolVersion::V2,
ctx,
)
.await?;
}
PageServiceCmd::BaseBackup(BaseBackupCmd {
tenant_id,
timeline_id,
lsn,
gzip,
replica,
}) => {
tracing::Span::current()
.record("tenant_id", field::display(tenant_id))
.record("timeline_id", field::display(timeline_id));
let tenant_id = TenantId::from_str(params[0])
.with_context(|| format!("Failed to parse tenant id from {}", params[0]))?;
let timeline_id = TimelineId::from_str(params[1])
.with_context(|| format!("Failed to parse timeline id from {}", params[1]))?;
self.check_permission(Some(tenant_id))?;
tracing::Span::current()
.record("tenant_id", field::display(tenant_id))
.record("timeline_id", field::display(timeline_id));
self.check_permission(Some(tenant_id))?;
COMPUTE_COMMANDS_COUNTERS
.for_command(ComputeCommandKind::Basebackup)
.inc();
let (lsn, gzip) = match (params.get(2), params.get(3)) {
(None, _) => (None, false),
(Some(&"--gzip"), _) => (None, true),
(Some(lsn_str), gzip_str_opt) => {
let lsn = Lsn::from_str(lsn_str)
.with_context(|| format!("Failed to parse Lsn from {lsn_str}"))?;
let gzip = match gzip_str_opt {
Some(&"--gzip") => true,
None => false,
Some(third_param) => {
return Err(QueryError::Other(anyhow::anyhow!(
"Parameter in position 3 unknown {third_param}",
)))
}
};
(Some(lsn), gzip)
COMPUTE_COMMANDS_COUNTERS
.for_command(ComputeCommandKind::Basebackup)
.inc();
let metric_recording = metrics::BASEBACKUP_QUERY_TIME.start_recording(&ctx);
let res = async {
self.handle_basebackup_request(
pgb,
tenant_id,
timeline_id,
lsn,
None,
false,
gzip,
replica,
&ctx,
)
.await?;
pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
Result::<(), QueryError>::Ok(())
}
};
.await;
metric_recording.observe(&res);
res?;
}
// same as basebackup, but result includes relational data as well
PageServiceCmd::FullBackup(FullBackupCmd {
tenant_id,
timeline_id,
lsn,
prev_lsn,
}) => {
tracing::Span::current()
.record("tenant_id", field::display(tenant_id))
.record("timeline_id", field::display(timeline_id));
let metric_recording = metrics::BASEBACKUP_QUERY_TIME.start_recording(&ctx);
let res = async {
self.check_permission(Some(tenant_id))?;
COMPUTE_COMMANDS_COUNTERS
.for_command(ComputeCommandKind::Fullbackup)
.inc();
// Check that the timeline exists
self.handle_basebackup_request(
pgb,
tenant_id,
timeline_id,
lsn,
None,
prev_lsn,
true,
false,
false,
gzip,
&ctx,
)
.await?;
pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
Result::<(), QueryError>::Ok(())
}
.await;
metric_recording.observe(&res);
res?;
}
// same as basebackup, but result includes relational data as well
else if let Some(params) = parts.strip_prefix(&["fullbackup"]) {
if params.len() < 2 {
return Err(QueryError::Other(anyhow::anyhow!(
"invalid param number for fullbackup command"
)));
PageServiceCmd::Set => {
// important because psycopg2 executes "SET datestyle TO 'ISO'"
// on connect
pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
}
let tenant_id = TenantId::from_str(params[0])
.with_context(|| format!("Failed to parse tenant id from {}", params[0]))?;
let timeline_id = TimelineId::from_str(params[1])
.with_context(|| format!("Failed to parse timeline id from {}", params[1]))?;
tracing::Span::current()
.record("tenant_id", field::display(tenant_id))
.record("timeline_id", field::display(timeline_id));
// The caller is responsible for providing correct lsn and prev_lsn.
let lsn = if let Some(lsn_str) = params.get(2) {
Some(
Lsn::from_str(lsn_str)
.with_context(|| format!("Failed to parse Lsn from {lsn_str}"))?,
)
} else {
None
};
let prev_lsn = if let Some(prev_lsn_str) = params.get(3) {
Some(
Lsn::from_str(prev_lsn_str)
.with_context(|| format!("Failed to parse Lsn from {prev_lsn_str}"))?,
)
} else {
None
};
self.check_permission(Some(tenant_id))?;
COMPUTE_COMMANDS_COUNTERS
.for_command(ComputeCommandKind::Fullbackup)
.inc();
// Check that the timeline exists
self.handle_basebackup_request(
pgb,
tenant_id,
PageServiceCmd::LeaseLsn(LeaseLsnCmd {
tenant_shard_id,
timeline_id,
lsn,
prev_lsn,
true,
false,
&ctx,
)
.await?;
pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else if query_string.to_ascii_lowercase().starts_with("set ") {
// important because psycopg2 executes "SET datestyle TO 'ISO'"
// on connect
pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else if query_string.starts_with("lease lsn ") {
let params = &parts[2..];
if params.len() != 3 {
return Err(QueryError::Other(anyhow::anyhow!(
"invalid param number {} for lease lsn command",
params.len()
)));
}) => {
tracing::Span::current()
.record("tenant_id", field::display(tenant_shard_id))
.record("timeline_id", field::display(timeline_id));
self.check_permission(Some(tenant_shard_id.tenant_id))?;
COMPUTE_COMMANDS_COUNTERS
.for_command(ComputeCommandKind::LeaseLsn)
.inc();
match self
.handle_make_lsn_lease(pgb, tenant_shard_id, timeline_id, lsn, &ctx)
.await
{
Ok(()) => {
pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?
}
Err(e) => {
error!("error obtaining lsn lease for {lsn}: {e:?}");
pgb.write_message_noflush(&BeMessage::ErrorResponse(
&e.to_string(),
Some(e.pg_error_code()),
))?
}
};
}
let tenant_shard_id = TenantShardId::from_str(params[0])
.with_context(|| format!("Failed to parse tenant id from {}", params[0]))?;
let timeline_id = TimelineId::from_str(params[1])
.with_context(|| format!("Failed to parse timeline id from {}", params[1]))?;
tracing::Span::current()
.record("tenant_id", field::display(tenant_shard_id))
.record("timeline_id", field::display(timeline_id));
self.check_permission(Some(tenant_shard_id.tenant_id))?;
COMPUTE_COMMANDS_COUNTERS
.for_command(ComputeCommandKind::LeaseLsn)
.inc();
// The caller is responsible for providing correct lsn.
let lsn = Lsn::from_str(params[2])
.with_context(|| format!("Failed to parse Lsn from {}", params[2]))?;
match self
.handle_make_lsn_lease(pgb, tenant_shard_id, timeline_id, lsn, &ctx)
.await
{
Ok(()) => pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?,
Err(e) => {
error!("error obtaining lsn lease for {lsn}: {e:?}");
pgb.write_message_noflush(&BeMessage::ErrorResponse(
&e.to_string(),
Some(e.pg_error_code()),
))?
}
};
} else {
return Err(QueryError::Other(anyhow::anyhow!(
"unknown command {query_string}"
)));
}
Ok(())
@@ -1516,3 +1673,181 @@ fn set_tracing_field_shard_id(timeline: &Timeline) {
);
debug_assert_current_span_has_tenant_and_timeline_id();
}
#[cfg(test)]
mod tests {
use utils::shard::ShardCount;
use super::*;
#[test]
fn pageservice_cmd_parse() {
let tenant_id = TenantId::generate();
let timeline_id = TimelineId::generate();
let cmd =
PageServiceCmd::parse(&format!("pagestream_v2 {tenant_id} {timeline_id}")).unwrap();
assert_eq!(
cmd,
PageServiceCmd::PageStream(PageStreamCmd {
tenant_id,
timeline_id
})
);
let cmd = PageServiceCmd::parse(&format!("basebackup {tenant_id} {timeline_id}")).unwrap();
assert_eq!(
cmd,
PageServiceCmd::BaseBackup(BaseBackupCmd {
tenant_id,
timeline_id,
lsn: None,
gzip: false,
replica: false
})
);
let cmd =
PageServiceCmd::parse(&format!("basebackup {tenant_id} {timeline_id} --gzip")).unwrap();
assert_eq!(
cmd,
PageServiceCmd::BaseBackup(BaseBackupCmd {
tenant_id,
timeline_id,
lsn: None,
gzip: true,
replica: false
})
);
let cmd =
PageServiceCmd::parse(&format!("basebackup {tenant_id} {timeline_id} latest")).unwrap();
assert_eq!(
cmd,
PageServiceCmd::BaseBackup(BaseBackupCmd {
tenant_id,
timeline_id,
lsn: None,
gzip: false,
replica: false
})
);
let cmd = PageServiceCmd::parse(&format!("basebackup {tenant_id} {timeline_id} 0/16ABCDE"))
.unwrap();
assert_eq!(
cmd,
PageServiceCmd::BaseBackup(BaseBackupCmd {
tenant_id,
timeline_id,
lsn: Some(Lsn::from_str("0/16ABCDE").unwrap()),
gzip: false,
replica: false
})
);
let cmd = PageServiceCmd::parse(&format!(
"basebackup {tenant_id} {timeline_id} --replica --gzip"
))
.unwrap();
assert_eq!(
cmd,
PageServiceCmd::BaseBackup(BaseBackupCmd {
tenant_id,
timeline_id,
lsn: None,
gzip: true,
replica: true
})
);
let cmd = PageServiceCmd::parse(&format!(
"basebackup {tenant_id} {timeline_id} 0/16ABCDE --replica --gzip"
))
.unwrap();
assert_eq!(
cmd,
PageServiceCmd::BaseBackup(BaseBackupCmd {
tenant_id,
timeline_id,
lsn: Some(Lsn::from_str("0/16ABCDE").unwrap()),
gzip: true,
replica: true
})
);
let cmd = PageServiceCmd::parse(&format!("fullbackup {tenant_id} {timeline_id}")).unwrap();
assert_eq!(
cmd,
PageServiceCmd::FullBackup(FullBackupCmd {
tenant_id,
timeline_id,
lsn: None,
prev_lsn: None
})
);
let cmd = PageServiceCmd::parse(&format!(
"fullbackup {tenant_id} {timeline_id} 0/16ABCDE 0/16ABCDF"
))
.unwrap();
assert_eq!(
cmd,
PageServiceCmd::FullBackup(FullBackupCmd {
tenant_id,
timeline_id,
lsn: Some(Lsn::from_str("0/16ABCDE").unwrap()),
prev_lsn: Some(Lsn::from_str("0/16ABCDF").unwrap()),
})
);
let tenant_shard_id = TenantShardId::unsharded(tenant_id);
let cmd = PageServiceCmd::parse(&format!(
"lease lsn {tenant_shard_id} {timeline_id} 0/16ABCDE"
))
.unwrap();
assert_eq!(
cmd,
PageServiceCmd::LeaseLsn(LeaseLsnCmd {
tenant_shard_id,
timeline_id,
lsn: Lsn::from_str("0/16ABCDE").unwrap(),
})
);
let tenant_shard_id = TenantShardId::split(&tenant_shard_id, ShardCount(8))[1];
let cmd = PageServiceCmd::parse(&format!(
"lease lsn {tenant_shard_id} {timeline_id} 0/16ABCDE"
))
.unwrap();
assert_eq!(
cmd,
PageServiceCmd::LeaseLsn(LeaseLsnCmd {
tenant_shard_id,
timeline_id,
lsn: Lsn::from_str("0/16ABCDE").unwrap(),
})
);
let cmd = PageServiceCmd::parse("set a = b").unwrap();
assert_eq!(cmd, PageServiceCmd::Set);
let cmd = PageServiceCmd::parse("SET foo").unwrap();
assert_eq!(cmd, PageServiceCmd::Set);
}
#[test]
fn pageservice_cmd_err_handling() {
let tenant_id = TenantId::generate();
let timeline_id = TimelineId::generate();
let cmd = PageServiceCmd::parse("unknown_command");
assert!(cmd.is_err());
let cmd = PageServiceCmd::parse("pagestream_v2");
assert!(cmd.is_err());
let cmd = PageServiceCmd::parse(&format!("pagestream_v2 {tenant_id}xxx"));
assert!(cmd.is_err());
let cmd = PageServiceCmd::parse(&format!("pagestream_v2 {tenant_id}xxx {timeline_id}xxx"));
assert!(cmd.is_err());
let cmd = PageServiceCmd::parse(&format!(
"basebackup {tenant_id} {timeline_id} --gzip --gzip"
));
assert!(cmd.is_err());
let cmd = PageServiceCmd::parse(&format!(
"basebackup {tenant_id} {timeline_id} --gzip --unknown"
));
assert!(cmd.is_err());
let cmd = PageServiceCmd::parse(&format!(
"basebackup {tenant_id} {timeline_id} --gzip 0/16ABCDE"
));
assert!(cmd.is_err());
let cmd = PageServiceCmd::parse(&format!("lease {tenant_id} {timeline_id} gzip 0/16ABCDE"));
assert!(cmd.is_err());
}
}

View File

@@ -7,14 +7,14 @@
//! Clarify that)
//!
use super::tenant::{PageReconstructError, Timeline};
use crate::aux_file;
use crate::context::RequestContext;
use crate::keyspace::{KeySpace, KeySpaceAccum};
use crate::span::debug_assert_current_span_has_tenant_and_timeline_id_no_shard_id;
use crate::walrecord::NeonWalRecord;
use crate::{aux_file, repository::*};
use anyhow::{ensure, Context};
use bytes::{Buf, Bytes, BytesMut};
use enum_map::Enum;
use pageserver_api::key::Key;
use pageserver_api::key::{
dbdir_key_range, rel_block_to_key, rel_dir_to_key, rel_key_range, rel_size_to_key,
relmap_file_key, repl_origin_key, repl_origin_key_range, slru_block_to_key, slru_dir_to_key,
@@ -22,7 +22,10 @@ use pageserver_api::key::{
CompactKey, AUX_FILES_KEY, CHECKPOINT_KEY, CONTROLFILE_KEY, DBDIR_KEY, TWOPHASEDIR_KEY,
};
use pageserver_api::keyspace::SparseKeySpace;
use pageserver_api::record::NeonWalRecord;
use pageserver_api::reltag::{BlockNumber, RelTag, SlruKind};
use pageserver_api::shard::ShardIdentity;
use pageserver_api::value::Value;
use postgres_ffi::relfile_utils::{FSM_FORKNUM, VISIBILITYMAP_FORKNUM};
use postgres_ffi::BLCKSZ;
use postgres_ffi::{Oid, RepOriginId, TimestampTz, TransactionId};
@@ -36,12 +39,13 @@ use tracing::{debug, trace, warn};
use utils::bin_ser::DeserializeError;
use utils::pausable_failpoint;
use utils::{bin_ser::BeSer, lsn::Lsn};
use wal_decoder::serialized_batch::SerializedValueBatch;
/// Max delta records appended to the AUX_FILES_KEY (for aux v1). The write path will write a full image once this threshold is reached.
pub const MAX_AUX_FILE_DELTAS: usize = 1024;
/// Max number of aux-file-related delta layers. The compaction will create a new image layer once this threshold is reached.
pub const MAX_AUX_FILE_V2_DELTAS: usize = 64;
pub const MAX_AUX_FILE_V2_DELTAS: usize = 16;
#[derive(Debug)]
pub enum LsnForTimestamp {
@@ -168,12 +172,11 @@ impl Timeline {
tline: self,
pending_lsns: Vec::new(),
pending_metadata_pages: HashMap::new(),
pending_data_pages: Vec::new(),
pending_zero_data_pages: Default::default(),
pending_data_batch: None,
pending_deletions: Vec::new(),
pending_nblocks: 0,
pending_directory_entries: Vec::new(),
pending_bytes: 0,
pending_metadata_bytes: 0,
lsn,
}
}
@@ -1023,21 +1026,14 @@ pub struct DatadirModification<'a> {
/// Data writes, ready to be flushed into an ephemeral layer. See [`Self::is_data_key`] for
/// which keys are stored here.
pending_data_pages: Vec<(CompactKey, Lsn, usize, Value)>,
// Sometimes during ingest, for example when extending a relation, we would like to write a zero page. However,
// if we encounter a write from postgres in the same wal record, we will drop this entry.
//
// Unlike other 'pending' fields, this does not last until the next call to commit(): it is flushed
// at the end of each wal record, and all these writes implicitly are at lsn Self::lsn
pending_zero_data_pages: HashSet<CompactKey>,
pending_data_batch: Option<SerializedValueBatch>,
/// For special "directory" keys that store key-value maps, track the size of the map
/// if it was updated in this modification.
pending_directory_entries: Vec<(DirectoryKind, usize)>,
/// An **approximation** of how large our EphemeralFile write will be when committed.
pending_bytes: usize,
/// An **approximation** of how many metadata bytes will be written to the EphemeralFile.
pending_metadata_bytes: usize,
}
impl<'a> DatadirModification<'a> {
@@ -1052,11 +1048,17 @@ impl<'a> DatadirModification<'a> {
}
pub(crate) fn approx_pending_bytes(&self) -> usize {
self.pending_bytes
self.pending_data_batch
.as_ref()
.map_or(0, |b| b.buffer_size())
+ self.pending_metadata_bytes
}
pub(crate) fn has_dirty_data_pages(&self) -> bool {
(!self.pending_data_pages.is_empty()) || (!self.pending_zero_data_pages.is_empty())
pub(crate) fn has_dirty_data(&self) -> bool {
!self
.pending_data_batch
.as_ref()
.map_or(true, |b| b.is_empty())
}
/// Set the current lsn
@@ -1068,9 +1070,6 @@ impl<'a> DatadirModification<'a> {
self.lsn
);
// If we are advancing LSN, then state from previous wal record should have been flushed.
assert!(self.pending_zero_data_pages.is_empty());
if lsn > self.lsn {
self.pending_lsns.push(self.lsn);
self.lsn = lsn;
@@ -1145,6 +1144,107 @@ impl<'a> DatadirModification<'a> {
Ok(())
}
/// Creates a relation if it is not already present.
/// Returns the current size of the relation
pub(crate) async fn create_relation_if_required(
&mut self,
rel: RelTag,
ctx: &RequestContext,
) -> Result<u32, PageReconstructError> {
// Get current size and put rel creation if rel doesn't exist
//
// NOTE: we check the cache first even though get_rel_exists and get_rel_size would
// check the cache too. This is because eagerly checking the cache results in
// less work overall and 10% better performance. It's more work on cache miss
// but cache miss is rare.
if let Some(nblocks) = self.tline.get_cached_rel_size(&rel, self.get_lsn()) {
Ok(nblocks)
} else if !self
.tline
.get_rel_exists(rel, Version::Modified(self), ctx)
.await?
{
// create it with 0 size initially, the logic below will extend it
self.put_rel_creation(rel, 0, ctx)
.await
.context("Relation Error")?;
Ok(0)
} else {
self.tline
.get_rel_size(rel, Version::Modified(self), ctx)
.await
}
}
/// Given a block number for a relation (which represents a newly written block),
/// the previous block count of the relation, and the shard info, find the gaps
/// that were created by the newly written block if any.
fn find_gaps(
rel: RelTag,
blkno: u32,
previous_nblocks: u32,
shard: &ShardIdentity,
) -> Option<KeySpace> {
let mut key = rel_block_to_key(rel, blkno);
let mut gap_accum = None;
for gap_blkno in previous_nblocks..blkno {
key.field6 = gap_blkno;
if shard.get_shard_number(&key) != shard.number {
continue;
}
gap_accum
.get_or_insert_with(KeySpaceAccum::new)
.add_key(key);
}
gap_accum.map(|accum| accum.to_keyspace())
}
pub async fn ingest_batch(
&mut self,
mut batch: SerializedValueBatch,
// TODO(vlad): remove this argument and replace the shard check with is_key_local
shard: &ShardIdentity,
ctx: &RequestContext,
) -> anyhow::Result<()> {
let mut gaps_at_lsns = Vec::default();
for meta in batch.metadata.iter() {
let (rel, blkno) = Key::from_compact(meta.key()).to_rel_block()?;
let new_nblocks = blkno + 1;
let old_nblocks = self.create_relation_if_required(rel, ctx).await?;
if new_nblocks > old_nblocks {
self.put_rel_extend(rel, new_nblocks, ctx).await?;
}
if let Some(gaps) = Self::find_gaps(rel, blkno, old_nblocks, shard) {
gaps_at_lsns.push((gaps, meta.lsn()));
}
}
if !gaps_at_lsns.is_empty() {
batch.zero_gaps(gaps_at_lsns);
}
match self.pending_data_batch.as_mut() {
Some(pending_batch) => {
pending_batch.extend(batch);
}
None if !batch.is_empty() => {
self.pending_data_batch = Some(batch);
}
None => {
// Nothing to initialize the batch with
}
}
Ok(())
}
/// Put a new page version that can be constructed from a WAL record
///
/// NOTE: this will *not* implicitly extend the relation, if the page is beyond the
@@ -1227,8 +1327,13 @@ impl<'a> DatadirModification<'a> {
self.lsn
);
}
self.pending_zero_data_pages.insert(key.to_compact());
self.pending_bytes += ZERO_PAGE.len();
let batch = self
.pending_data_batch
.get_or_insert_with(SerializedValueBatch::default);
batch.put(key.to_compact(), Value::Image(ZERO_PAGE.clone()), self.lsn);
Ok(())
}
@@ -1246,17 +1351,14 @@ impl<'a> DatadirModification<'a> {
self.lsn
);
}
self.pending_zero_data_pages.insert(key.to_compact());
self.pending_bytes += ZERO_PAGE.len();
Ok(())
}
/// Call this at the end of each WAL record.
pub(crate) fn on_record_end(&mut self) {
let pending_zero_data_pages = std::mem::take(&mut self.pending_zero_data_pages);
for key in pending_zero_data_pages {
self.put_data(key, Value::Image(ZERO_PAGE.clone()));
}
let batch = self
.pending_data_batch
.get_or_insert_with(SerializedValueBatch::default);
batch.put(key.to_compact(), Value::Image(ZERO_PAGE.clone()), self.lsn);
Ok(())
}
/// Store a relmapper file (pg_filenode.map) in the repository
@@ -1506,35 +1608,42 @@ impl<'a> DatadirModification<'a> {
Ok(())
}
/// Drop a relation.
pub async fn put_rel_drop(&mut self, rel: RelTag, ctx: &RequestContext) -> anyhow::Result<()> {
anyhow::ensure!(rel.relnode != 0, RelationError::InvalidRelnode);
/// Drop some relations
pub(crate) async fn put_rel_drops(
&mut self,
drop_relations: HashMap<(u32, u32), Vec<RelTag>>,
ctx: &RequestContext,
) -> anyhow::Result<()> {
for ((spc_node, db_node), rel_tags) in drop_relations {
let dir_key = rel_dir_to_key(spc_node, db_node);
let buf = self.get(dir_key, ctx).await?;
let mut dir = RelDirectory::des(&buf)?;
// Remove it from the directory entry
let dir_key = rel_dir_to_key(rel.spcnode, rel.dbnode);
let buf = self.get(dir_key, ctx).await?;
let mut dir = RelDirectory::des(&buf)?;
let mut dirty = false;
for rel_tag in rel_tags {
if dir.rels.remove(&(rel_tag.relnode, rel_tag.forknum)) {
dirty = true;
self.pending_directory_entries
.push((DirectoryKind::Rel, dir.rels.len()));
// update logical size
let size_key = rel_size_to_key(rel_tag);
let old_size = self.get(size_key, ctx).await?.get_u32_le();
self.pending_nblocks -= old_size as i64;
if dir.rels.remove(&(rel.relnode, rel.forknum)) {
self.put(dir_key, Value::Image(Bytes::from(RelDirectory::ser(&dir)?)));
} else {
warn!("dropped rel {} did not exist in rel directory", rel);
// Remove entry from relation size cache
self.tline.remove_cached_rel_size(&rel_tag);
// Delete size entry, as well as all blocks
self.delete(rel_key_range(rel_tag));
}
}
if dirty {
self.put(dir_key, Value::Image(Bytes::from(RelDirectory::ser(&dir)?)));
self.pending_directory_entries
.push((DirectoryKind::Rel, dir.rels.len()));
}
}
// update logical size
let size_key = rel_size_to_key(rel);
let old_size = self.get(size_key, ctx).await?.get_u32_le();
self.pending_nblocks -= old_size as i64;
// Remove enty from relation size cache
self.tline.remove_cached_rel_size(&rel);
// Delete size entry, as well as all blocks
self.delete(rel_key_range(rel));
Ok(())
}
@@ -1741,12 +1850,17 @@ impl<'a> DatadirModification<'a> {
let mut writer = self.tline.writer().await;
// Flush relation and SLRU data blocks, keep metadata.
let pending_data_pages = std::mem::take(&mut self.pending_data_pages);
if let Some(batch) = self.pending_data_batch.take() {
tracing::debug!(
"Flushing batch with max_lsn={}. Last record LSN is {}",
batch.max_lsn,
self.tline.get_last_record_lsn()
);
// This bails out on first error without modifying pending_updates.
// That's Ok, cf this function's doc comment.
writer.put_batch(pending_data_pages, ctx).await?;
self.pending_bytes = 0;
// This bails out on first error without modifying pending_updates.
// That's Ok, cf this function's doc comment.
writer.put_batch(batch, ctx).await?;
}
if pending_nblocks != 0 {
writer.update_current_logical_size(pending_nblocks * i64::from(BLCKSZ));
@@ -1766,9 +1880,6 @@ impl<'a> DatadirModification<'a> {
/// All the modifications in this atomic update are stamped by the specified LSN.
///
pub async fn commit(&mut self, ctx: &RequestContext) -> anyhow::Result<()> {
// Commit should never be called mid-wal-record
assert!(self.pending_zero_data_pages.is_empty());
let mut writer = self.tline.writer().await;
let pending_nblocks = self.pending_nblocks;
@@ -1776,21 +1887,49 @@ impl<'a> DatadirModification<'a> {
// Ordering: the items in this batch do not need to be in any global order, but values for
// a particular Key must be in Lsn order relative to one another. InMemoryLayer relies on
// this to do efficient updates to its index.
let mut write_batch = std::mem::take(&mut self.pending_data_pages);
// this to do efficient updates to its index. See [`wal_decoder::serialized_batch`] for
// more details.
write_batch.extend(
self.pending_metadata_pages
let metadata_batch = {
let pending_meta = self
.pending_metadata_pages
.drain()
.flat_map(|(key, values)| {
values
.into_iter()
.map(move |(lsn, value_size, value)| (key, lsn, value_size, value))
}),
);
})
.collect::<Vec<_>>();
if !write_batch.is_empty() {
writer.put_batch(write_batch, ctx).await?;
if pending_meta.is_empty() {
None
} else {
Some(SerializedValueBatch::from_values(pending_meta))
}
};
let data_batch = self.pending_data_batch.take();
let maybe_batch = match (data_batch, metadata_batch) {
(Some(mut data), Some(metadata)) => {
data.extend(metadata);
Some(data)
}
(Some(data), None) => Some(data),
(None, Some(metadata)) => Some(metadata),
(None, None) => None,
};
if let Some(batch) = maybe_batch {
tracing::debug!(
"Flushing batch with max_lsn={}. Last record LSN is {}",
batch.max_lsn,
self.tline.get_last_record_lsn()
);
// This bails out on first error without modifying pending_updates.
// That's Ok, cf this function's doc comment.
writer.put_batch(batch, ctx).await?;
}
if !self.pending_deletions.is_empty() {
@@ -1800,6 +1939,9 @@ impl<'a> DatadirModification<'a> {
self.pending_lsns.push(self.lsn);
for pending_lsn in self.pending_lsns.drain(..) {
// TODO(vlad): pretty sure the comment below is not valid anymore
// and we can call finish write with the latest LSN
//
// Ideally, we should be able to call writer.finish_write() only once
// with the highest LSN. However, the last_record_lsn variable in the
// timeline keeps track of the latest LSN and the immediate previous LSN
@@ -1815,14 +1957,14 @@ impl<'a> DatadirModification<'a> {
writer.update_directory_entries_count(kind, count as u64);
}
self.pending_bytes = 0;
self.pending_metadata_bytes = 0;
Ok(())
}
pub(crate) fn len(&self) -> usize {
self.pending_metadata_pages.len()
+ self.pending_data_pages.len()
+ self.pending_data_batch.as_ref().map_or(0, |b| b.len())
+ self.pending_deletions.len()
}
@@ -1864,11 +2006,10 @@ impl<'a> DatadirModification<'a> {
// modifications before ingesting DB create operations, which are the only kind that reads
// data pages during ingest.
if cfg!(debug_assertions) {
for (dirty_key, _, _, _) in &self.pending_data_pages {
debug_assert!(&key.to_compact() != dirty_key);
}
debug_assert!(!self.pending_zero_data_pages.contains(&key.to_compact()))
assert!(!self
.pending_data_batch
.as_ref()
.map_or(false, |b| b.updates_key(&key)));
}
}
@@ -1886,18 +2027,10 @@ impl<'a> DatadirModification<'a> {
}
fn put_data(&mut self, key: CompactKey, val: Value) {
let val_serialized_size = val.serialized_size().unwrap() as usize;
// If this page was previously zero'd in the same WalRecord, then drop the previous zero page write. This
// is an optimization that avoids persisting both the zero page generated by us (e.g. during a relation extend),
// and the subsequent postgres-originating write
if self.pending_zero_data_pages.remove(&key) {
self.pending_bytes -= ZERO_PAGE.len();
}
self.pending_bytes += val_serialized_size;
self.pending_data_pages
.push((key, self.lsn, val_serialized_size, val))
let batch = self
.pending_data_batch
.get_or_insert_with(SerializedValueBatch::default);
batch.put(key, val, self.lsn);
}
fn put_metadata(&mut self, key: CompactKey, val: Value) {
@@ -1905,10 +2038,10 @@ impl<'a> DatadirModification<'a> {
// Replace the previous value if it exists at the same lsn
if let Some((last_lsn, last_value_ser_size, last_value)) = values.last_mut() {
if *last_lsn == self.lsn {
// Update the pending_bytes contribution from this entry, and update the serialized size in place
self.pending_bytes -= *last_value_ser_size;
// Update the pending_metadata_bytes contribution from this entry, and update the serialized size in place
self.pending_metadata_bytes -= *last_value_ser_size;
*last_value_ser_size = val.serialized_size().unwrap() as usize;
self.pending_bytes += *last_value_ser_size;
self.pending_metadata_bytes += *last_value_ser_size;
// Use the latest value, this replaces any earlier write to the same (key,lsn), such as much
// have been generated by synthesized zero page writes prior to the first real write to a page.
@@ -1918,8 +2051,12 @@ impl<'a> DatadirModification<'a> {
}
let val_serialized_size = val.serialized_size().unwrap() as usize;
self.pending_bytes += val_serialized_size;
self.pending_metadata_bytes += val_serialized_size;
values.push((self.lsn, val_serialized_size, val));
if key == CHECKPOINT_KEY.to_compact() {
tracing::debug!("Checkpoint key added to pending with size {val_serialized_size}");
}
}
fn delete(&mut self, key_range: Range<Key>) {
@@ -2028,7 +2165,11 @@ static ZERO_PAGE: Bytes = Bytes::from_static(&[0u8; BLCKSZ as usize]);
#[cfg(test)]
mod tests {
use hex_literal::hex;
use utils::id::TimelineId;
use pageserver_api::{models::ShardParameters, shard::ShardStripeSize};
use utils::{
id::TimelineId,
shard::{ShardCount, ShardNumber},
};
use super::*;
@@ -2082,6 +2223,93 @@ mod tests {
Ok(())
}
#[test]
fn gap_finding() {
let rel = RelTag {
spcnode: 1663,
dbnode: 208101,
relnode: 2620,
forknum: 0,
};
let base_blkno = 1;
let base_key = rel_block_to_key(rel, base_blkno);
let before_base_key = rel_block_to_key(rel, base_blkno - 1);
let shard = ShardIdentity::unsharded();
let mut previous_nblocks = 0;
for i in 0..10 {
let crnt_blkno = base_blkno + i;
let gaps = DatadirModification::find_gaps(rel, crnt_blkno, previous_nblocks, &shard);
previous_nblocks = crnt_blkno + 1;
if i == 0 {
// The first block we write is 1, so we should find the gap.
assert_eq!(gaps.unwrap(), KeySpace::single(before_base_key..base_key));
} else {
assert!(gaps.is_none());
}
}
// This is an update to an already existing block. No gaps here.
let update_blkno = 5;
let gaps = DatadirModification::find_gaps(rel, update_blkno, previous_nblocks, &shard);
assert!(gaps.is_none());
// This is an update past the current end block.
let after_gap_blkno = 20;
let gaps = DatadirModification::find_gaps(rel, after_gap_blkno, previous_nblocks, &shard);
let gap_start_key = rel_block_to_key(rel, previous_nblocks);
let after_gap_key = rel_block_to_key(rel, after_gap_blkno);
assert_eq!(
gaps.unwrap(),
KeySpace::single(gap_start_key..after_gap_key)
);
}
#[test]
fn sharded_gap_finding() {
let rel = RelTag {
spcnode: 1663,
dbnode: 208101,
relnode: 2620,
forknum: 0,
};
let first_blkno = 6;
// This shard will get the even blocks
let shard = ShardIdentity::from_params(
ShardNumber(0),
&ShardParameters {
count: ShardCount(2),
stripe_size: ShardStripeSize(1),
},
);
// Only keys belonging to this shard are considered as gaps.
let mut previous_nblocks = 0;
let gaps =
DatadirModification::find_gaps(rel, first_blkno, previous_nblocks, &shard).unwrap();
assert!(!gaps.ranges.is_empty());
for gap_range in gaps.ranges {
let mut k = gap_range.start;
while k != gap_range.end {
assert_eq!(shard.get_shard_number(&k), shard.number);
k = k.next();
}
}
previous_nblocks = first_blkno;
let update_blkno = 2;
let gaps = DatadirModification::find_gaps(rel, update_blkno, previous_nblocks, &shard);
assert!(gaps.is_none());
}
/*
fn assert_current_logical_size<R: Repository>(timeline: &DatadirTimeline<R>, lsn: Lsn) {
let incremental = timeline.get_current_logical_size();

File diff suppressed because it is too large Load Diff

View File

@@ -9,7 +9,6 @@
//! may lead to a data loss.
//!
pub(crate) use pageserver_api::config::TenantConfigToml as TenantConf;
use pageserver_api::models::AuxFilePolicy;
use pageserver_api::models::CompactionAlgorithmSettings;
use pageserver_api::models::EvictionPolicy;
use pageserver_api::models::{self, ThrottleConfig};
@@ -341,10 +340,6 @@ pub struct TenantConfOpt {
#[serde(skip_serializing_if = "Option::is_none")]
pub image_layer_creation_check_threshold: Option<u8>,
#[serde(skip_serializing_if = "Option::is_none")]
#[serde(default)]
pub switch_aux_file_policy: Option<AuxFilePolicy>,
#[serde(skip_serializing_if = "Option::is_none")]
#[serde(with = "humantime_serde")]
#[serde(default)]
@@ -354,6 +349,10 @@ pub struct TenantConfOpt {
#[serde(with = "humantime_serde")]
#[serde(default)]
pub lsn_lease_length_for_ts: Option<Duration>,
#[serde(skip_serializing_if = "Option::is_none")]
#[serde(default)]
pub timeline_offloading: Option<bool>,
}
impl TenantConfOpt {
@@ -410,15 +409,15 @@ impl TenantConfOpt {
image_layer_creation_check_threshold: self
.image_layer_creation_check_threshold
.unwrap_or(global_conf.image_layer_creation_check_threshold),
switch_aux_file_policy: self
.switch_aux_file_policy
.unwrap_or(global_conf.switch_aux_file_policy),
lsn_lease_length: self
.lsn_lease_length
.unwrap_or(global_conf.lsn_lease_length),
lsn_lease_length_for_ts: self
.lsn_lease_length_for_ts
.unwrap_or(global_conf.lsn_lease_length_for_ts),
timeline_offloading: self
.lazy_slru_download
.unwrap_or(global_conf.timeline_offloading),
}
}
}
@@ -470,9 +469,9 @@ impl From<TenantConfOpt> for models::TenantConfig {
lazy_slru_download: value.lazy_slru_download,
timeline_get_throttle: value.timeline_get_throttle.map(ThrottleConfig::from),
image_layer_creation_check_threshold: value.image_layer_creation_check_threshold,
switch_aux_file_policy: value.switch_aux_file_policy,
lsn_lease_length: value.lsn_lease_length.map(humantime),
lsn_lease_length_for_ts: value.lsn_lease_length_for_ts.map(humantime),
timeline_offloading: value.timeline_offloading,
}
}
}

View File

@@ -0,0 +1,57 @@
use anyhow::Result;
use serde::Serialize;
use std::ops::AddAssign;
use std::time::Duration;
///
/// Result of performing GC
///
#[derive(Default, Serialize, Debug)]
pub struct GcResult {
pub layers_total: u64,
pub layers_needed_by_cutoff: u64,
pub layers_needed_by_pitr: u64,
pub layers_needed_by_branches: u64,
pub layers_needed_by_leases: u64,
pub layers_not_updated: u64,
pub layers_removed: u64, // # of layer files removed because they have been made obsolete by newer ondisk files.
#[serde(serialize_with = "serialize_duration_as_millis")]
pub elapsed: Duration,
/// The layers which were garbage collected.
///
/// Used in `/v1/tenant/:tenant_id/timeline/:timeline_id/do_gc` to wait for the layers to be
/// dropped in tests.
#[cfg(feature = "testing")]
#[serde(skip)]
pub(crate) doomed_layers: Vec<crate::tenant::storage_layer::Layer>,
}
// helper function for `GcResult`, serializing a `Duration` as an integer number of milliseconds
fn serialize_duration_as_millis<S>(d: &Duration, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
d.as_millis().serialize(serializer)
}
impl AddAssign for GcResult {
fn add_assign(&mut self, other: Self) {
self.layers_total += other.layers_total;
self.layers_needed_by_pitr += other.layers_needed_by_pitr;
self.layers_needed_by_cutoff += other.layers_needed_by_cutoff;
self.layers_needed_by_branches += other.layers_needed_by_branches;
self.layers_needed_by_leases += other.layers_needed_by_leases;
self.layers_not_updated += other.layers_not_updated;
self.layers_removed += other.layers_removed;
self.elapsed += other.elapsed;
#[cfg(feature = "testing")]
{
let mut other = other;
self.doomed_layers.append(&mut other.doomed_layers);
}
}
}

View File

@@ -48,9 +48,9 @@ mod layer_coverage;
use crate::context::RequestContext;
use crate::keyspace::KeyPartitioning;
use crate::repository::Key;
use crate::tenant::storage_layer::InMemoryLayer;
use anyhow::Result;
use pageserver_api::key::Key;
use pageserver_api::keyspace::{KeySpace, KeySpaceAccum};
use range_set_blaze::{CheckSortedDisjoint, RangeSetBlaze};
use std::collections::{HashMap, VecDeque};

View File

@@ -1959,7 +1959,7 @@ impl TenantManager {
attempt.before_reset_tenant();
let (_guard, progress) = utils::completion::channel();
match tenant.shutdown(progress, ShutdownMode::Hard).await {
match tenant.shutdown(progress, ShutdownMode::Flush).await {
Ok(()) => {
slot_guard.drop_old_value().expect("it was just shutdown");
}
@@ -2811,7 +2811,7 @@ where
}
use {
crate::repository::GcResult, pageserver_api::models::TimelineGcRequest,
crate::tenant::gc_result::GcResult, pageserver_api::models::TimelineGcRequest,
utils::http::error::ApiError,
};

View File

@@ -190,6 +190,7 @@ use chrono::{NaiveDateTime, Utc};
pub(crate) use download::download_initdb_tar_zst;
use pageserver_api::models::TimelineArchivalState;
use pageserver_api::shard::{ShardIndex, TenantShardId};
use regex::Regex;
use scopeguard::ScopeGuard;
use tokio_util::sync::CancellationToken;
use utils::backoff::{
@@ -199,7 +200,7 @@ use utils::pausable_failpoint;
use std::collections::{HashMap, VecDeque};
use std::sync::atomic::{AtomicU32, Ordering};
use std::sync::{Arc, Mutex};
use std::sync::{Arc, Mutex, OnceLock};
use std::time::Duration;
use remote_storage::{
@@ -245,11 +246,11 @@ use super::upload_queue::{NotInitialized, SetDeletedFlagProgress};
use super::Generation;
pub(crate) use download::{
do_download_tenant_manifest, download_index_part, is_temp_download_file,
download_index_part, download_tenant_manifest, is_temp_download_file,
list_remote_tenant_shards, list_remote_timelines,
};
pub(crate) use index::LayerFileMetadata;
pub(crate) use upload::{upload_initdb_dir, upload_tenant_manifest};
pub(crate) use upload::upload_initdb_dir;
// Occasional network issues and such can cause remote operations to fail, and
// that's expected. If a download fails, we log it at info-level, and retry.
@@ -274,12 +275,6 @@ pub(crate) const BUFFER_SIZE: usize = 32 * 1024;
/// which we warn and skip.
const DELETION_QUEUE_FLUSH_TIMEOUT: Duration = Duration::from_secs(10);
/// Hardcode a generation for the tenant manifest for now so that we don't
/// need to deal with generation-less manifests in the future.
///
/// TODO: add proper generation support to all the places that use this.
pub(crate) const TENANT_MANIFEST_GENERATION: Generation = Generation::new(1);
pub enum MaybeDeletedIndexPart {
IndexPart(IndexPart),
Deleted(IndexPart),
@@ -579,12 +574,18 @@ impl RemoteTimelineClient {
if latest_index_generation > index_generation {
// Unexpected! Why are we loading such an old index if a more recent one exists?
tracing::warn!(
// We will refuse to proceed, as there is no reasonable scenario where this should happen, but
// there _is_ a clear bug/corruption scenario where it would happen (controller sets the generation
// backwards).
tracing::error!(
?index_generation,
?latest_index_generation,
?latest_index_mtime,
"Found a newer index while loading an old one"
);
return Err(DownloadError::Fatal(
"Index age exceeds threshold and a newer index exists".into(),
));
}
}
@@ -1278,10 +1279,14 @@ impl RemoteTimelineClient {
let fut = {
let mut guard = self.upload_queue.lock().unwrap();
let upload_queue = match &mut *guard {
UploadQueue::Stopped(_) => return,
UploadQueue::Stopped(_) => {
scopeguard::ScopeGuard::into_inner(sg);
return;
}
UploadQueue::Uninitialized => {
// transition into Stopped state
self.stop_impl(&mut guard);
scopeguard::ScopeGuard::into_inner(sg);
return;
}
UploadQueue::Initialized(ref mut init) => init,
@@ -1446,7 +1451,7 @@ impl RemoteTimelineClient {
let remote_path = remote_layer_path(
&self.tenant_shard_id.tenant_id,
&self.timeline_id,
self.tenant_shard_id.to_index(),
uploaded.metadata().shard,
&uploaded.layer_desc().layer_name(),
uploaded.metadata().generation,
);
@@ -1487,7 +1492,7 @@ impl RemoteTimelineClient {
&adopted
.get_timeline_id()
.expect("Source timeline should be alive"),
self.tenant_shard_id.to_index(),
adopted.metadata().shard,
&adopted.layer_desc().layer_name(),
adopted.metadata().generation,
);
@@ -1495,7 +1500,7 @@ impl RemoteTimelineClient {
let target_remote_path = remote_layer_path(
&self.tenant_shard_id.tenant_id,
&self.timeline_id,
self.tenant_shard_id.to_index(),
adopted_as.metadata().shard,
&adopted_as.layer_desc().layer_name(),
adopted_as.metadata().generation,
);
@@ -2202,6 +2207,18 @@ impl RemoteTimelineClient {
inner.initialized_mut()?;
Ok(UploadQueueAccessor { inner })
}
pub(crate) fn no_pending_work(&self) -> bool {
let inner = self.upload_queue.lock().unwrap();
match &*inner {
UploadQueue::Uninitialized
| UploadQueue::Stopped(UploadQueueStopped::Uninitialized) => true,
UploadQueue::Stopped(UploadQueueStopped::Deletable(x)) => {
x.upload_queue_for_deletion.no_pending_work()
}
UploadQueue::Initialized(x) => x.no_pending_work(),
}
}
}
pub(crate) struct UploadQueueAccessor<'a> {
@@ -2235,6 +2252,12 @@ pub fn remote_tenant_manifest_path(
RemotePath::from_string(&path).expect("Failed to construct path")
}
/// Prefix to all generations' manifest objects in a tenant shard
pub fn remote_tenant_manifest_prefix(tenant_shard_id: &TenantShardId) -> RemotePath {
let path = format!("tenants/{tenant_shard_id}/tenant-manifest",);
RemotePath::from_string(&path).expect("Failed to construct path")
}
pub fn remote_timelines_path(tenant_shard_id: &TenantShardId) -> RemotePath {
let path = format!("tenants/{tenant_shard_id}/{TIMELINES_SEGMENT_NAME}");
RemotePath::from_string(&path).expect("Failed to construct path")
@@ -2329,6 +2352,15 @@ pub fn parse_remote_index_path(path: RemotePath) -> Option<Generation> {
}
}
/// Given the key of a tenant manifest, parse out the generation number
pub(crate) fn parse_remote_tenant_manifest_path(path: RemotePath) -> Option<Generation> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| Regex::new(r".+tenant-manifest-([0-9a-f]{8}).json").unwrap());
re.captures(path.get_path().as_str())
.and_then(|c| c.get(1))
.and_then(|m| Generation::parse_suffix(m.as_str()))
}
#[cfg(test)]
mod tests {
use super::*;

View File

@@ -20,7 +20,9 @@ use utils::backoff;
use crate::config::PageServerConf;
use crate::context::RequestContext;
use crate::span::debug_assert_current_span_has_tenant_and_timeline_id;
use crate::span::{
debug_assert_current_span_has_tenant_and_timeline_id, debug_assert_current_span_has_tenant_id,
};
use crate::tenant::remote_timeline_client::{remote_layer_path, remote_timelines_path};
use crate::tenant::storage_layer::LayerName;
use crate::tenant::Generation;
@@ -36,9 +38,10 @@ use utils::pausable_failpoint;
use super::index::{IndexPart, LayerFileMetadata};
use super::manifest::TenantManifest;
use super::{
parse_remote_index_path, remote_index_path, remote_initdb_archive_path,
remote_initdb_preserved_archive_path, remote_tenant_manifest_path, remote_tenant_path,
FAILED_DOWNLOAD_WARN_THRESHOLD, FAILED_REMOTE_OP_RETRIES, INITDB_PATH,
parse_remote_index_path, parse_remote_tenant_manifest_path, remote_index_path,
remote_initdb_archive_path, remote_initdb_preserved_archive_path, remote_tenant_manifest_path,
remote_tenant_manifest_prefix, remote_tenant_path, FAILED_DOWNLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES, INITDB_PATH,
};
///
@@ -365,32 +368,34 @@ async fn do_download_remote_path_retry_forever(
.await
}
pub async fn do_download_tenant_manifest(
async fn do_download_tenant_manifest(
storage: &GenericRemoteStorage,
tenant_shard_id: &TenantShardId,
_timeline_id: Option<&TimelineId>,
generation: Generation,
cancel: &CancellationToken,
) -> Result<(TenantManifest, Generation), DownloadError> {
// TODO: generation support
let generation = super::TENANT_MANIFEST_GENERATION;
) -> Result<(TenantManifest, Generation, SystemTime), DownloadError> {
let remote_path = remote_tenant_manifest_path(tenant_shard_id, generation);
let (manifest_bytes, _manifest_bytes_mtime) =
let (manifest_bytes, manifest_bytes_mtime) =
do_download_remote_path_retry_forever(storage, &remote_path, cancel).await?;
let tenant_manifest = TenantManifest::from_json_bytes(&manifest_bytes)
.with_context(|| format!("deserialize tenant manifest file at {remote_path:?}"))
.map_err(DownloadError::Other)?;
Ok((tenant_manifest, generation))
Ok((tenant_manifest, generation, manifest_bytes_mtime))
}
async fn do_download_index_part(
storage: &GenericRemoteStorage,
tenant_shard_id: &TenantShardId,
timeline_id: &TimelineId,
timeline_id: Option<&TimelineId>,
index_generation: Generation,
cancel: &CancellationToken,
) -> Result<(IndexPart, Generation, SystemTime), DownloadError> {
let timeline_id =
timeline_id.expect("A timeline ID is always provided when downloading an index");
let remote_path = remote_index_path(tenant_shard_id, timeline_id, index_generation);
let (index_part_bytes, index_part_mtime) =
@@ -403,59 +408,79 @@ async fn do_download_index_part(
Ok((index_part, index_generation, index_part_mtime))
}
/// index_part.json objects are suffixed with a generation number, so we cannot
/// directly GET the latest index part without doing some probing.
/// Metadata objects are "generationed", meaning that they include a generation suffix. This
/// function downloads the object with the highest generation <= `my_generation`.
///
/// In this function we probe for the most recent index in a generation <= our current generation.
/// See "Finding the remote indices for timelines" in docs/rfcs/025-generation-numbers.md
/// Data objects (layer files) also include a generation in their path, but there is no equivalent
/// search process, because their reference from an index includes the generation.
///
/// An expensive object listing operation is only done if necessary: the typical fast path is to issue two
/// GET operations, one to our own generation (stale attachment case), and one to the immediately preceding
/// generation (normal case when migrating/restarting). Only if both of these return 404 do we fall back
/// to listing objects.
///
/// * `my_generation`: the value of `[crate::tenant::Tenant::generation]`
/// * `what`: for logging, what object are we downloading
/// * `prefix`: when listing objects, use this prefix (i.e. the part of the object path before the generation)
/// * `do_download`: a GET of the object in a particular generation, which should **retry indefinitely** unless
/// `cancel`` has fired. This function does not do its own retries of GET operations, and relies
/// on the function passed in to do so.
/// * `parse_path`: parse a fully qualified remote storage path to get the generation of the object.
#[allow(clippy::too_many_arguments)]
#[tracing::instrument(skip_all, fields(generation=?my_generation))]
pub(crate) async fn download_index_part(
storage: &GenericRemoteStorage,
tenant_shard_id: &TenantShardId,
timeline_id: &TimelineId,
pub(crate) async fn download_generation_object<'a, T, DF, DFF, PF>(
storage: &'a GenericRemoteStorage,
tenant_shard_id: &'a TenantShardId,
timeline_id: Option<&'a TimelineId>,
my_generation: Generation,
cancel: &CancellationToken,
) -> Result<(IndexPart, Generation, SystemTime), DownloadError> {
debug_assert_current_span_has_tenant_and_timeline_id();
what: &str,
prefix: RemotePath,
do_download: DF,
parse_path: PF,
cancel: &'a CancellationToken,
) -> Result<(T, Generation, SystemTime), DownloadError>
where
DF: Fn(
&'a GenericRemoteStorage,
&'a TenantShardId,
Option<&'a TimelineId>,
Generation,
&'a CancellationToken,
) -> DFF,
DFF: Future<Output = Result<(T, Generation, SystemTime), DownloadError>>,
PF: Fn(RemotePath) -> Option<Generation>,
T: 'static,
{
debug_assert_current_span_has_tenant_id();
if my_generation.is_none() {
// Operating without generations: just fetch the generation-less path
return do_download_index_part(
storage,
tenant_shard_id,
timeline_id,
my_generation,
cancel,
)
.await;
return do_download(storage, tenant_shard_id, timeline_id, my_generation, cancel).await;
}
// Stale case: If we were intentionally attached in a stale generation, there may already be a remote
// index in our generation.
// Stale case: If we were intentionally attached in a stale generation, the remote object may already
// exist in our generation.
//
// This is an optimization to avoid doing the listing for the general case below.
let res =
do_download_index_part(storage, tenant_shard_id, timeline_id, my_generation, cancel).await;
let res = do_download(storage, tenant_shard_id, timeline_id, my_generation, cancel).await;
match res {
Ok(index_part) => {
tracing::debug!(
"Found index_part from current generation (this is a stale attachment)"
);
return Ok(index_part);
Ok(decoded) => {
tracing::debug!("Found {what} from current generation (this is a stale attachment)");
return Ok(decoded);
}
Err(DownloadError::NotFound) => {}
Err(e) => return Err(e),
};
// Typical case: the previous generation of this tenant was running healthily, and had uploaded
// and index part. We may safely start from this index without doing a listing, because:
// Typical case: the previous generation of this tenant was running healthily, and had uploaded the object
// we are seeking in that generation. We may safely start from this index without doing a listing, because:
// - We checked for current generation case above
// - generations > my_generation are to be ignored
// - any other indices that exist would have an older generation than `previous_gen`, and
// we want to find the most recent index from a previous generation.
// - any other objects that exist would have an older generation than `previous_gen`, and
// we want to find the most recent object from a previous generation.
//
// This is an optimization to avoid doing the listing for the general case below.
let res = do_download_index_part(
let res = do_download(
storage,
tenant_shard_id,
timeline_id,
@@ -464,14 +489,12 @@ pub(crate) async fn download_index_part(
)
.await;
match res {
Ok(index_part) => {
tracing::debug!("Found index_part from previous generation");
return Ok(index_part);
Ok(decoded) => {
tracing::debug!("Found {what} from previous generation");
return Ok(decoded);
}
Err(DownloadError::NotFound) => {
tracing::debug!(
"No index_part found from previous generation, falling back to listing"
);
tracing::debug!("No {what} found from previous generation, falling back to listing");
}
Err(e) => {
return Err(e);
@@ -481,12 +504,10 @@ pub(crate) async fn download_index_part(
// General case/fallback: if there is no index at my_generation or prev_generation, then list all index_part.json
// objects, and select the highest one with a generation <= my_generation. Constructing the prefix is equivalent
// to constructing a full index path with no generation, because the generation is a suffix.
let index_prefix = remote_index_path(tenant_shard_id, timeline_id, Generation::none());
let indices = download_retry(
let paths = download_retry(
|| async {
storage
.list(Some(&index_prefix), ListingMode::NoDelimiter, None, cancel)
.list(Some(&prefix), ListingMode::NoDelimiter, None, cancel)
.await
},
"list index_part files",
@@ -497,22 +518,22 @@ pub(crate) async fn download_index_part(
// General case logic for which index to use: the latest index whose generation
// is <= our own. See "Finding the remote indices for timelines" in docs/rfcs/025-generation-numbers.md
let max_previous_generation = indices
let max_previous_generation = paths
.into_iter()
.filter_map(|o| parse_remote_index_path(o.key))
.filter_map(|o| parse_path(o.key))
.filter(|g| g <= &my_generation)
.max();
match max_previous_generation {
Some(g) => {
tracing::debug!("Found index_part in generation {g:?}");
do_download_index_part(storage, tenant_shard_id, timeline_id, g, cancel).await
tracing::debug!("Found {what} in generation {g:?}");
do_download(storage, tenant_shard_id, timeline_id, g, cancel).await
}
None => {
// Migration from legacy pre-generation state: we have a generation but no prior
// attached pageservers did. Try to load from a no-generation path.
tracing::debug!("No index_part.json* found");
do_download_index_part(
tracing::debug!("No {what}* found");
do_download(
storage,
tenant_shard_id,
timeline_id,
@@ -524,6 +545,57 @@ pub(crate) async fn download_index_part(
}
}
/// index_part.json objects are suffixed with a generation number, so we cannot
/// directly GET the latest index part without doing some probing.
///
/// In this function we probe for the most recent index in a generation <= our current generation.
/// See "Finding the remote indices for timelines" in docs/rfcs/025-generation-numbers.md
pub(crate) async fn download_index_part(
storage: &GenericRemoteStorage,
tenant_shard_id: &TenantShardId,
timeline_id: &TimelineId,
my_generation: Generation,
cancel: &CancellationToken,
) -> Result<(IndexPart, Generation, SystemTime), DownloadError> {
debug_assert_current_span_has_tenant_and_timeline_id();
let index_prefix = remote_index_path(tenant_shard_id, timeline_id, Generation::none());
download_generation_object(
storage,
tenant_shard_id,
Some(timeline_id),
my_generation,
"index_part",
index_prefix,
do_download_index_part,
parse_remote_index_path,
cancel,
)
.await
}
pub(crate) async fn download_tenant_manifest(
storage: &GenericRemoteStorage,
tenant_shard_id: &TenantShardId,
my_generation: Generation,
cancel: &CancellationToken,
) -> Result<(TenantManifest, Generation, SystemTime), DownloadError> {
let manifest_prefix = remote_tenant_manifest_prefix(tenant_shard_id);
download_generation_object(
storage,
tenant_shard_id,
None,
my_generation,
"tenant-manifest",
manifest_prefix,
do_download_tenant_manifest,
parse_remote_tenant_manifest_path,
cancel,
)
.await
}
pub(crate) async fn download_initdb_tar_zst(
conf: &'static PageServerConf,
storage: &GenericRemoteStorage,

View File

@@ -3,7 +3,7 @@ use serde::{Deserialize, Serialize};
use utils::{id::TimelineId, lsn::Lsn};
/// Tenant-shard scoped manifest
#[derive(Clone, Serialize, Deserialize)]
#[derive(Clone, Serialize, Deserialize, PartialEq, Eq)]
pub struct TenantManifest {
/// Debugging aid describing the version of this manifest.
/// Can also be used for distinguishing breaking changes later on.
@@ -23,7 +23,7 @@ pub struct TenantManifest {
/// Very similar to [`pageserver_api::models::OffloadedTimelineInfo`],
/// but the two datastructures serve different needs, this is for a persistent disk format
/// that must be backwards compatible, while the other is only for informative purposes.
#[derive(Clone, Serialize, Deserialize, Copy)]
#[derive(Clone, Serialize, Deserialize, Copy, PartialEq, Eq)]
pub struct OffloadedTimelineManifest {
pub timeline_id: TimelineId,
/// Whether the timeline has a parent it has been branched off from or not

View File

@@ -187,6 +187,8 @@ pub(super) async fn gather_inputs(
// but it is unlikely to cause any issues. In the worst case,
// the calculation will error out.
timelines.retain(|t| t.is_active());
// Also filter out archived timelines.
timelines.retain(|t| t.is_archived() != Some(true));
// Build a map of branch points.
let mut branchpoints: HashMap<TimelineId, HashSet<Lsn>> = HashMap::new();

View File

@@ -1,5 +1,6 @@
//! Common traits and structs for layers
pub mod batch_split_writer;
pub mod delta_layer;
pub mod filter_iterator;
pub mod image_layer;
@@ -8,14 +9,13 @@ pub(crate) mod layer;
mod layer_desc;
mod layer_name;
pub mod merge_iterator;
pub mod split_writer;
use crate::context::{AccessStatsBehavior, RequestContext};
use crate::repository::Value;
use crate::walrecord::NeonWalRecord;
use bytes::Bytes;
use pageserver_api::key::Key;
use pageserver_api::key::{Key, NON_INHERITED_SPARSE_RANGE};
use pageserver_api::keyspace::{KeySpace, KeySpaceRandomAccum};
use pageserver_api::record::NeonWalRecord;
use pageserver_api::value::Value;
use std::cmp::{Ordering, Reverse};
use std::collections::hash_map::Entry;
use std::collections::{BinaryHeap, HashMap};
@@ -196,6 +196,9 @@ impl ValuesReconstructState {
/// Returns true if this was the last value needed for the key and false otherwise.
///
/// If the key is done after the update, mark it as such.
///
/// If the key is in the sparse keyspace (i.e., aux files), we do not track them in
/// `key_done`.
pub(crate) fn update_key(
&mut self,
key: &Key,
@@ -206,10 +209,18 @@ impl ValuesReconstructState {
.keys
.entry(*key)
.or_insert(Ok(VectoredValueReconstructState::default()));
let is_sparse_key = NON_INHERITED_SPARSE_RANGE.contains(key);
if let Ok(state) = state {
let key_done = match state.situation {
ValueReconstructSituation::Complete => unreachable!(),
ValueReconstructSituation::Complete => {
if is_sparse_key {
// Sparse keyspace might be visited multiple times because
// we don't track unmapped keyspaces.
return ValueReconstructSituation::Complete;
} else {
unreachable!()
}
}
ValueReconstructSituation::Continue => match value {
Value::Image(img) => {
state.img = Some((lsn, img));
@@ -234,7 +245,9 @@ impl ValuesReconstructState {
if key_done && state.situation == ValueReconstructSituation::Continue {
state.situation = ValueReconstructSituation::Complete;
self.keys_done.add_key(*key);
if !is_sparse_key {
self.keys_done.add_key(*key);
}
}
state.situation

View File

@@ -5,48 +5,162 @@ use pageserver_api::key::{Key, KEY_SIZE};
use utils::{id::TimelineId, lsn::Lsn, shard::TenantShardId};
use crate::tenant::storage_layer::Layer;
use crate::{config::PageServerConf, context::RequestContext, repository::Value, tenant::Timeline};
use crate::{config::PageServerConf, context::RequestContext, tenant::Timeline};
use pageserver_api::value::Value;
use super::layer::S3_UPLOAD_LIMIT;
use super::{
DeltaLayerWriter, ImageLayerWriter, PersistentLayerDesc, PersistentLayerKey, ResidentLayer,
};
pub(crate) enum SplitWriterResult {
pub(crate) enum BatchWriterResult {
Produced(ResidentLayer),
Discarded(PersistentLayerKey),
}
#[cfg(test)]
impl SplitWriterResult {
impl BatchWriterResult {
fn into_resident_layer(self) -> ResidentLayer {
match self {
SplitWriterResult::Produced(layer) => layer,
SplitWriterResult::Discarded(_) => panic!("unexpected discarded layer"),
BatchWriterResult::Produced(layer) => layer,
BatchWriterResult::Discarded(_) => panic!("unexpected discarded layer"),
}
}
fn into_discarded_layer(self) -> PersistentLayerKey {
match self {
SplitWriterResult::Produced(_) => panic!("unexpected produced layer"),
SplitWriterResult::Discarded(layer) => layer,
BatchWriterResult::Produced(_) => panic!("unexpected produced layer"),
BatchWriterResult::Discarded(layer) => layer,
}
}
}
enum LayerWriterWrapper {
Image(ImageLayerWriter),
Delta(DeltaLayerWriter),
}
/// An layer writer that takes unfinished layers and finish them atomically.
#[must_use]
pub struct BatchLayerWriter {
generated_layer_writers: Vec<(LayerWriterWrapper, PersistentLayerKey)>,
conf: &'static PageServerConf,
}
impl BatchLayerWriter {
pub async fn new(conf: &'static PageServerConf) -> anyhow::Result<Self> {
Ok(Self {
generated_layer_writers: Vec::new(),
conf,
})
}
pub fn add_unfinished_image_writer(
&mut self,
writer: ImageLayerWriter,
key_range: Range<Key>,
lsn: Lsn,
) {
self.generated_layer_writers.push((
LayerWriterWrapper::Image(writer),
PersistentLayerKey {
key_range,
lsn_range: PersistentLayerDesc::image_layer_lsn_range(lsn),
is_delta: false,
},
));
}
pub fn add_unfinished_delta_writer(
&mut self,
writer: DeltaLayerWriter,
key_range: Range<Key>,
lsn_range: Range<Lsn>,
) {
self.generated_layer_writers.push((
LayerWriterWrapper::Delta(writer),
PersistentLayerKey {
key_range,
lsn_range,
is_delta: true,
},
));
}
pub(crate) async fn finish_with_discard_fn<D, F>(
self,
tline: &Arc<Timeline>,
ctx: &RequestContext,
discard_fn: D,
) -> anyhow::Result<Vec<BatchWriterResult>>
where
D: Fn(&PersistentLayerKey) -> F,
F: Future<Output = bool>,
{
let Self {
generated_layer_writers,
..
} = self;
let clean_up_layers = |generated_layers: Vec<BatchWriterResult>| {
for produced_layer in generated_layers {
if let BatchWriterResult::Produced(resident_layer) = produced_layer {
let layer: Layer = resident_layer.into();
layer.delete_on_drop();
}
}
};
// BEGIN: catch every error and do the recovery in the below section
let mut generated_layers: Vec<BatchWriterResult> = Vec::new();
for (inner, layer_key) in generated_layer_writers {
if discard_fn(&layer_key).await {
generated_layers.push(BatchWriterResult::Discarded(layer_key));
} else {
let res = match inner {
LayerWriterWrapper::Delta(writer) => {
writer.finish(layer_key.key_range.end, ctx).await
}
LayerWriterWrapper::Image(writer) => {
writer
.finish_with_end_key(layer_key.key_range.end, ctx)
.await
}
};
let layer = match res {
Ok((desc, path)) => {
match Layer::finish_creating(self.conf, tline, desc, &path) {
Ok(layer) => layer,
Err(e) => {
tokio::fs::remove_file(&path).await.ok();
clean_up_layers(generated_layers);
return Err(e);
}
}
}
Err(e) => {
// Image/DeltaLayerWriter::finish will clean up the temporary layer if anything goes wrong,
// so we don't need to remove the layer we just failed to create by ourselves.
clean_up_layers(generated_layers);
return Err(e);
}
};
generated_layers.push(BatchWriterResult::Produced(layer));
}
}
// END: catch every error and do the recovery in the above section
Ok(generated_layers)
}
}
/// An image writer that takes images and produces multiple image layers.
///
/// The interface does not guarantee atomicity (i.e., if the image layer generation
/// fails, there might be leftover files to be cleaned up)
#[must_use]
pub struct SplitImageLayerWriter {
inner: ImageLayerWriter,
target_layer_size: u64,
generated_layer_writers: Vec<(ImageLayerWriter, PersistentLayerKey)>,
lsn: Lsn,
conf: &'static PageServerConf,
timeline_id: TimelineId,
tenant_shard_id: TenantShardId,
lsn: Lsn,
batches: BatchLayerWriter,
start_key: Key,
}
@@ -71,10 +185,10 @@ impl SplitImageLayerWriter {
ctx,
)
.await?,
generated_layer_writers: Vec::new(),
conf,
timeline_id,
tenant_shard_id,
batches: BatchLayerWriter::new(conf).await?,
lsn,
start_key,
})
@@ -102,16 +216,13 @@ impl SplitImageLayerWriter {
ctx,
)
.await?;
let layer_key = PersistentLayerKey {
key_range: self.start_key..key,
lsn_range: PersistentLayerDesc::image_layer_lsn_range(self.lsn),
is_delta: false,
};
let prev_image_writer = std::mem::replace(&mut self.inner, next_image_writer);
self.batches.add_unfinished_image_writer(
prev_image_writer,
self.start_key..key,
self.lsn,
);
self.start_key = key;
self.generated_layer_writers
.push((prev_image_writer, layer_key));
}
self.inner.put_image(key, img, ctx).await
}
@@ -122,64 +233,18 @@ impl SplitImageLayerWriter {
ctx: &RequestContext,
end_key: Key,
discard_fn: D,
) -> anyhow::Result<Vec<SplitWriterResult>>
) -> anyhow::Result<Vec<BatchWriterResult>>
where
D: Fn(&PersistentLayerKey) -> F,
F: Future<Output = bool>,
{
let Self {
mut generated_layer_writers,
inner,
..
mut batches, inner, ..
} = self;
if inner.num_keys() != 0 {
let layer_key = PersistentLayerKey {
key_range: self.start_key..end_key,
lsn_range: PersistentLayerDesc::image_layer_lsn_range(self.lsn),
is_delta: false,
};
generated_layer_writers.push((inner, layer_key));
batches.add_unfinished_image_writer(inner, self.start_key..end_key, self.lsn);
}
let clean_up_layers = |generated_layers: Vec<SplitWriterResult>| {
for produced_layer in generated_layers {
if let SplitWriterResult::Produced(image_layer) = produced_layer {
let layer: Layer = image_layer.into();
layer.delete_on_drop();
}
}
};
// BEGIN: catch every error and do the recovery in the below section
let mut generated_layers = Vec::new();
for (inner, layer_key) in generated_layer_writers {
if discard_fn(&layer_key).await {
generated_layers.push(SplitWriterResult::Discarded(layer_key));
} else {
let layer = match inner
.finish_with_end_key(layer_key.key_range.end, ctx)
.await
{
Ok((desc, path)) => {
match Layer::finish_creating(self.conf, tline, desc, &path) {
Ok(layer) => layer,
Err(e) => {
tokio::fs::remove_file(&path).await.ok();
clean_up_layers(generated_layers);
return Err(e);
}
}
}
Err(e) => {
// ImageLayerWriter::finish will clean up the temporary layer if anything goes wrong,
// so we don't need to remove the layer we just failed to create by ourselves.
clean_up_layers(generated_layers);
return Err(e);
}
};
generated_layers.push(SplitWriterResult::Produced(layer));
}
}
// END: catch every error and do the recovery in the above section
Ok(generated_layers)
batches.finish_with_discard_fn(tline, ctx, discard_fn).await
}
#[cfg(test)]
@@ -188,7 +253,7 @@ impl SplitImageLayerWriter {
tline: &Arc<Timeline>,
ctx: &RequestContext,
end_key: Key,
) -> anyhow::Result<Vec<SplitWriterResult>> {
) -> anyhow::Result<Vec<BatchWriterResult>> {
self.finish_with_discard_fn(tline, ctx, end_key, |_| async { false })
.await
}
@@ -196,9 +261,6 @@ impl SplitImageLayerWriter {
/// A delta writer that takes key-lsn-values and produces multiple delta layers.
///
/// The interface does not guarantee atomicity (i.e., if the delta layer generation fails,
/// there might be leftover files to be cleaned up).
///
/// Note that if updates of a single key exceed the target size limit, all of the updates will be batched
/// into a single file. This behavior might change in the future. For reference, the legacy compaction algorithm
/// will split them into multiple files based on size.
@@ -206,12 +268,12 @@ impl SplitImageLayerWriter {
pub struct SplitDeltaLayerWriter {
inner: Option<(Key, DeltaLayerWriter)>,
target_layer_size: u64,
generated_layer_writers: Vec<(DeltaLayerWriter, PersistentLayerKey)>,
conf: &'static PageServerConf,
timeline_id: TimelineId,
tenant_shard_id: TenantShardId,
lsn_range: Range<Lsn>,
last_key_written: Key,
batches: BatchLayerWriter,
}
impl SplitDeltaLayerWriter {
@@ -225,12 +287,12 @@ impl SplitDeltaLayerWriter {
Ok(Self {
target_layer_size,
inner: None,
generated_layer_writers: Vec::new(),
conf,
timeline_id,
tenant_shard_id,
lsn_range,
last_key_written: Key::MIN,
batches: BatchLayerWriter::new(conf).await?,
})
}
@@ -279,13 +341,11 @@ impl SplitDeltaLayerWriter {
.await?;
let (start_key, prev_delta_writer) =
std::mem::replace(&mut self.inner, Some((key, next_delta_writer))).unwrap();
let layer_key = PersistentLayerKey {
key_range: start_key..key,
lsn_range: self.lsn_range.clone(),
is_delta: true,
};
self.generated_layer_writers
.push((prev_delta_writer, layer_key));
self.batches.add_unfinished_delta_writer(
prev_delta_writer,
start_key..key,
self.lsn_range.clone(),
);
} else if inner.estimated_size() >= S3_UPLOAD_LIMIT {
// We have to produce a very large file b/c a key is updated too often.
anyhow::bail!(
@@ -305,64 +365,25 @@ impl SplitDeltaLayerWriter {
tline: &Arc<Timeline>,
ctx: &RequestContext,
discard_fn: D,
) -> anyhow::Result<Vec<SplitWriterResult>>
) -> anyhow::Result<Vec<BatchWriterResult>>
where
D: Fn(&PersistentLayerKey) -> F,
F: Future<Output = bool>,
{
let Self {
mut generated_layer_writers,
inner,
..
mut batches, inner, ..
} = self;
if let Some((start_key, writer)) = inner {
if writer.num_keys() != 0 {
let end_key = self.last_key_written.next();
let layer_key = PersistentLayerKey {
key_range: start_key..end_key,
lsn_range: self.lsn_range.clone(),
is_delta: true,
};
generated_layer_writers.push((writer, layer_key));
batches.add_unfinished_delta_writer(
writer,
start_key..end_key,
self.lsn_range.clone(),
);
}
}
let clean_up_layers = |generated_layers: Vec<SplitWriterResult>| {
for produced_layer in generated_layers {
if let SplitWriterResult::Produced(delta_layer) = produced_layer {
let layer: Layer = delta_layer.into();
layer.delete_on_drop();
}
}
};
// BEGIN: catch every error and do the recovery in the below section
let mut generated_layers = Vec::new();
for (inner, layer_key) in generated_layer_writers {
if discard_fn(&layer_key).await {
generated_layers.push(SplitWriterResult::Discarded(layer_key));
} else {
let layer = match inner.finish(layer_key.key_range.end, ctx).await {
Ok((desc, path)) => {
match Layer::finish_creating(self.conf, tline, desc, &path) {
Ok(layer) => layer,
Err(e) => {
tokio::fs::remove_file(&path).await.ok();
clean_up_layers(generated_layers);
return Err(e);
}
}
}
Err(e) => {
// DeltaLayerWriter::finish will clean up the temporary layer if anything goes wrong,
// so we don't need to remove the layer we just failed to create by ourselves.
clean_up_layers(generated_layers);
return Err(e);
}
};
generated_layers.push(SplitWriterResult::Produced(layer));
}
}
// END: catch every error and do the recovery in the above section
Ok(generated_layers)
batches.finish_with_discard_fn(tline, ctx, discard_fn).await
}
#[cfg(test)]
@@ -370,7 +391,7 @@ impl SplitDeltaLayerWriter {
self,
tline: &Arc<Timeline>,
ctx: &RequestContext,
) -> anyhow::Result<Vec<SplitWriterResult>> {
) -> anyhow::Result<Vec<BatchWriterResult>> {
self.finish_with_discard_fn(tline, ctx, |_| async { false })
.await
}

View File

@@ -30,7 +30,6 @@
use crate::config::PageServerConf;
use crate::context::{PageContentKind, RequestContext, RequestContextBuilder};
use crate::page_cache::{self, FileId, PAGE_SZ};
use crate::repository::{Key, Value, KEY_SIZE};
use crate::tenant::blob_io::BlobWriter;
use crate::tenant::block_io::{BlockBuf, BlockCursor, BlockLease, BlockReader, FileBlockReader};
use crate::tenant::disk_btree::{
@@ -46,7 +45,7 @@ use crate::tenant::PageReconstructError;
use crate::virtual_file::owned_buffers_io::io_buf_ext::{FullSlice, IoBufExt};
use crate::virtual_file::IoBufferMut;
use crate::virtual_file::{self, MaybeFatalIo, VirtualFile};
use crate::{walrecord, TEMP_FILE_SUFFIX};
use crate::TEMP_FILE_SUFFIX;
use crate::{DELTA_FILE_MAGIC, STORAGE_FORMAT_VERSION};
use anyhow::{anyhow, bail, ensure, Context, Result};
use camino::{Utf8Path, Utf8PathBuf};
@@ -54,9 +53,11 @@ use futures::StreamExt;
use itertools::Itertools;
use pageserver_api::config::MaxVectoredReadBytes;
use pageserver_api::key::DBDIR_KEY;
use pageserver_api::key::{Key, KEY_SIZE};
use pageserver_api::keyspace::KeySpace;
use pageserver_api::models::ImageCompressionAlgorithm;
use pageserver_api::shard::TenantShardId;
use pageserver_api::value::Value;
use rand::{distributions::Alphanumeric, Rng};
use serde::{Deserialize, Serialize};
use std::collections::VecDeque;
@@ -269,7 +270,7 @@ impl AsLayerDesc for DeltaLayer {
}
impl DeltaLayer {
pub(crate) async fn dump(&self, verbose: bool, ctx: &RequestContext) -> Result<()> {
pub async fn dump(&self, verbose: bool, ctx: &RequestContext) -> Result<()> {
self.desc.dump();
if !verbose {
@@ -652,6 +653,10 @@ impl DeltaLayerWriter {
})
}
pub fn is_empty(&self) -> bool {
self.inner.as_ref().unwrap().num_keys == 0
}
///
/// Append a key-value pair to the file.
///
@@ -1084,7 +1089,7 @@ impl DeltaLayerInner {
}
}
pub(super) async fn load_keys<'a>(
pub(crate) async fn index_entries<'a>(
&'a self,
ctx: &RequestContext,
) -> Result<Vec<DeltaEntry<'a>>> {
@@ -1293,7 +1298,7 @@ impl DeltaLayerInner {
// is it an image or will_init walrecord?
// FIXME: this could be handled by threading the BlobRef to the
// VectoredReadBuilder
let will_init = crate::repository::ValueBytes::will_init(&data)
let will_init = pageserver_api::value::ValueBytes::will_init(&data)
.inspect_err(|_e| {
#[cfg(feature = "testing")]
tracing::error!(data=?utils::Hex(&data), err=?_e, %key, %lsn, "failed to parse will_init out of serialized value");
@@ -1346,7 +1351,7 @@ impl DeltaLayerInner {
tree_reader.dump().await?;
let keys = self.load_keys(ctx).await?;
let keys = self.index_entries(ctx).await?;
async fn dump_blob(val: &ValueRef<'_>, ctx: &RequestContext) -> anyhow::Result<String> {
let buf = val.load_raw(ctx).await?;
@@ -1356,7 +1361,7 @@ impl DeltaLayerInner {
format!(" img {} bytes", img.len())
}
Value::WalRecord(rec) => {
let wal_desc = walrecord::describe_wal_record(&rec)?;
let wal_desc = pageserver_api::record::describe_wal_record(&rec)?;
format!(
" rec {} bytes will_init: {} {}",
buf.len(),
@@ -1437,7 +1442,7 @@ impl DeltaLayerInner {
offset
}
pub(crate) fn iter<'a>(&'a self, ctx: &'a RequestContext) -> DeltaLayerIterator<'a> {
pub fn iter<'a>(&'a self, ctx: &'a RequestContext) -> DeltaLayerIterator<'a> {
let block_reader = FileBlockReader::new(&self.file, self.file_id);
let tree_reader =
DiskBtreeReader::new(self.index_start_blk, self.index_root_blk, block_reader);
@@ -1453,6 +1458,16 @@ impl DeltaLayerInner {
),
}
}
/// NB: not super efficient, but not terrible either. Should prob be an iterator.
//
// We're reusing the index traversal logical in plan_reads; would be nice to
// factor that out.
pub(crate) async fn load_keys(&self, ctx: &RequestContext) -> anyhow::Result<Vec<Key>> {
self.index_entries(ctx)
.await
.map(|entries| entries.into_iter().map(|entry| entry.key).collect())
}
}
/// A set of data associated with a delta layer key and its value
@@ -1600,7 +1615,6 @@ pub(crate) mod test {
use rand::RngCore;
use super::*;
use crate::repository::Value;
use crate::tenant::harness::TIMELINE_ID;
use crate::tenant::storage_layer::{Layer, ResidentLayer};
use crate::tenant::vectored_blob_io::StreamingVectoredReadPlanner;
@@ -1612,6 +1626,7 @@ pub(crate) mod test {
DEFAULT_PG_VERSION,
};
use bytes::Bytes;
use pageserver_api::value::Value;
/// Construct an index for a fictional delta layer and and then
/// traverse in order to plan vectored reads for a query. Finally,
@@ -1964,8 +1979,8 @@ pub(crate) mod test {
#[tokio::test]
async fn copy_delta_prefix_smoke() {
use crate::walrecord::NeonWalRecord;
use bytes::Bytes;
use pageserver_api::record::NeonWalRecord;
let h = crate::tenant::harness::TenantHarness::create("truncate_delta_smoke")
.await
@@ -2188,6 +2203,7 @@ pub(crate) mod test {
(k1, l1).cmp(&(k2, l2))
}
#[cfg(feature = "testing")]
pub(crate) fn sort_delta_value(
(k1, l1, v1): &(Key, Lsn, Value),
(k2, l2, v2): &(Key, Lsn, Value),

Some files were not shown because too many files have changed in this diff Show More