Compare commits

..

75 Commits

Author SHA1 Message Date
Bojan Serafimov
80d0378626 Change l0 definition (currently panics) 2023-01-18 14:34:27 -05:00
Heikki Linnakangas
e5cc2f92c4 Switch to 'tracing' for logging, restructure code to make use of spans.
Refactors Compute::prepare_and_run. It's split into subroutines
differently, to make it easier to attach tracing spans to the
different stages. The high-level logic for waiting for Postgres to
exit is moved to the caller.

Replace 'env_logger' with 'tracing', and add `#instrument` directives
to different stages fo the startup process. This is a fairly
mechanical change, except for the changes in 'spec.rs'. 'spec.rs'
contained some complicated formatting, where parts of log messages
were printed directly to stdout with `print`s. That was a bit messed
up because the log normally goes to stderr, but those lines were
printed to stdout. In our docker images, stderr and stdout both go to
the same place so you wouldn't notice, but I don't think it was
intentional.

This changes the log format to the default
'tracing_subscriber::format' format. It's different from the Postgres
log format, however, and because both compute_tools and Postgres print
to the same log, it's now a mix of two different formats.  I'm not
sure how the Grafana log parsing pipeline can handle that. If it's a
problem, we can build custom formatter to change the compute_tools log
format to be the same as Postgres's, like it was before this commit,
or we can change the Postgres log format to match tracing_formatter's,
or we can start printing compute_tool's log output to a different
destination than Postgres
2023-01-18 19:42:47 +02:00
Kirill Bulatov
90f66aa51b Enable logs in unit tests 2023-01-18 17:43:27 +02:00
Kirill Bulatov
826e89b9ce Use actual temporary dir for pageserver unit tests 2023-01-18 17:43:27 +02:00
Vadim Kharitonov
e59d32ac5d Change SENTRY_ENVIRONMENT from "development" to "staging" 2023-01-18 16:34:49 +01:00
Anastasia Lubennikova
506086a3e2 Fix metric_collection_endpoint for prod.
It was incorrectly set to staging url
2023-01-18 16:35:43 +02:00
Heikki Linnakangas
3b58c61b33 If an error happens while checking for core dumps, don't panic.
If we panic, we skip the 30s wait in 'main', and don't give the
console a chance to observe the error. Which is not nice.

Spotted by @ololobus at
https://github.com/neondatabase/neon/pull/3352#discussion_r1072806981
2023-01-18 11:25:47 +02:00
Kirill Bulatov
c6b56d2967 Add more io::Error context when fail to operate on a path (#3254)
I have a test failure that shows 

```
Caused by:
    0: Failed to reconstruct a page image:
    1: Directory not empty (os error 39)
```

but does not really show where exactly that happens.

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-3227/release/3823785365/index.html#categories/c0057473fc9ec8fb70876fd29a171ce8/7088dab272f2c7b7/?attachment=60fe6ed2add4d82d

The PR aims to add more context in debugging that issue.
2023-01-17 22:07:38 +02:00
Anastasia Lubennikova
9d3992ef48 Increase metric_collection_interval for proxy on prod
to not owerwhelm the service
2023-01-17 15:50:19 +02:00
Anastasia Lubennikova
7624963e13 Enable metric_collection_endpoint for proxy on prod
in all regions
2023-01-17 13:43:50 +02:00
Anastasia Lubennikova
63e3b815a2 Enable metric_collection_endpoint for pageserver on prod
in all regions
2023-01-17 13:43:50 +02:00
Kirill Bulatov
1ebd145c29 Actualize the comment (#3362)
Follow-up of
https://github.com/neondatabase/neon/pull/3326#issuecomment-1384265759
2023-01-17 13:30:42 +02:00
sharnoff
f8e887830a build: Use curl -f on vm-informant download (#3363)
Without this, we can silently fail
2023-01-17 10:38:33 +01:00
Christian Schwarz
48dd9565ac TaskHandle: tone down sender is dropped while join handle is still alive
Rationale: see comments added as part of this commit.

fixes https://github.com/neondatabase/neon/issues/3339
2023-01-17 09:42:22 +01:00
Anastasia Lubennikova
e067cd2947 Enable metric collection for proxy on staging 2023-01-16 21:15:42 +02:00
Christian Schwarz
58c8c1076c download_all_remote_layers API: require client to specify max_concurrent_downloads
Before this patch, we would start all layer downloads simultaneously.

There is at most one download_all_remote_layers task per timeline.
Hence, the specified limit is per timeline.

There is still no global concurrency limit for layer downloads.
We'll have to revisit that at some point and also prioritize on-demand
initiated downloads over download_all_remote_layers downloads.
But that's for another day.
2023-01-16 19:29:06 +01:00
Alexander Bayandin
4c6b507472 Update Postgres clients we test (#3359)
Update client libraries and runtimes for Postgres libraries we test.
- `pg8000` works with Neon now 🎉 
- `PostgresClientKit` still doesn't support SNI
2023-01-16 17:22:17 +00:00
Stas Kelvich
431e464c1e Consumption metering RFC 2023-01-16 19:15:59 +02:00
danieltprice
424fd0bd63 Update auth.rs (#3349)
Update SNI error message. Users now specify the endpoint ID when making
a connection to Neon. This should be reflected in the error message.
2023-01-16 12:32:00 -04:00
Joonas Koivunen
a8a9bee602 walredo: simple tests and bench updates (#3045)
Separated from #2875.

The microbenchmark has been validated to show similar difference as to
larger scale OLTP benchmark.
2023-01-16 18:24:45 +02:00
Vadim Kharitonov
6ac5656be5 Enable earthdistance extension 2023-01-16 17:04:51 +01:00
Anastasia Lubennikova
3c571ecde8 Update docs/consumption_metrics.md 2023-01-16 17:24:13 +02:00
Anastasia Lubennikova
5f1bd0e8a3 Add documentation for consumption metrics 2023-01-16 17:24:13 +02:00
Anastasia Lubennikova
2cbe84b78f Proxy metrics (#3290)
Implement proxy metrics collection.
Only collect metric for outbound traffic.

Add proxy CLI parameters:
- metric-collection-endpoint
- metric-collection-interval.

Add test_proxy_metric_collection test.

Move shared consumption metrics code to libs/consumption_metrics.
Refactor the code.
2023-01-16 15:17:28 +00:00
sharnoff
5c6a7a17cb Add VM informant to vm-compute-node (#3324)
The general idea is that the VM informant binary is added to the
vm-compute-node images only. `compute_tools` then will run whatever's at
`/bin/vm-informant`, if the path exists.
2023-01-16 07:05:29 -08:00
Arseny Sher
84ffdc8b4f Don't keep FDs open on cancelled timelines in safekeepers.
Since PR #3300 we don't remove timelines completely until next restart, so this
prevents leakage.

fixes https://github.com/neondatabase/neon/issues/3336
2023-01-16 19:03:38 +04:00
Kirill Bulatov
bce4233d3a Rework Cargo.toml dependencies (#3322)
* Use workspace variables from cargo, coming with rustc
[1.64](https://github.com/rust-lang/rust/blob/master/RELEASES.md#version-1640-2022-09-22)

See
https://doc.rust-lang.org/nightly/cargo/reference/workspaces.html#the-package-table
and
https://doc.rust-lang.org/nightly/cargo/reference/workspaces.html#the-dependencies-table
sections.

Now, all dependencies in all non-root `Cargo.toml` files are defined as 
```
clap.workspace = true
```

sometimes, when extra features are needed, as 
```
bytes = {workspace = true, features = ['serde'] }
```

With the actual declarations (with shared features and version
numbers/file paths/etc.) in the root Cargo.toml.
Features are additive:

https://doc.rust-lang.org/nightly/cargo/reference/specifying-dependencies.html#inheriting-a-dependency-from-a-workspace

* Uses the mechanism above to set common, 2021, edition and license across the
workspace

* Mechanically bumps a few dependencies

* Updates hakari format, as it suggested:
```
work/neon/neon kb/cargo-templated ❯ cargo hakari generate
info: no changes detected
info: new hakari format version available: 3 (current: 2)
(add or update `dep-format-version = "3"` in hakari.toml, then run `cargo hakari generate && cargo hakari manage-deps`)
```
2023-01-13 18:13:34 +02:00
Vadim Kharitonov
16baa91b2b Add more information about cargo deny 2023-01-13 13:24:34 +01:00
Kirill Bulatov
99808558de Avoid duplicate timeline insert (#3326)
`initialize_with_lock` inserts `Arc<Timeline>` before returning it:
c1731bc4f0/pageserver/src/tenant.rs (L222)

but `setup_timeline` function did another insert, which got removed in this PR:
c1731bc4f0/pageserver/src/tenant.rs (L486)


On top, a better comment and function renames are added.
2023-01-13 12:05:54 +00:00
Anastasia Lubennikova
c6d383e239 code cleanup 2023-01-13 11:51:28 +02:00
Anastasia Lubennikova
5e3e0fbf6f remove unneeded Cargo.lock changes 2023-01-13 11:51:28 +02:00
Anastasia Lubennikova
26f39c03f2 review code cleanup:
- handle errors in calculate_synthetic_size_worker. Don't exit the bgworker if one tenant failed.

- add cached_synthetic_tenant_size to cache values calculated by the bgworker

- code cleanup: remove unneeded info! messages, clean comments

- handle collect_metrics_task() error. Don't exit collect_metrics worker if one task failed.

 - add unit test to cover case when we have multiple branches at the same lsn
2023-01-13 11:51:28 +02:00
Anastasia Lubennikova
148e020fb9 Fix logical size calculation:
sort updates in topological order so that the parent timeline always preceeds its children.
    fixes #3179
2023-01-13 11:51:28 +02:00
Anastasia Lubennikova
0675859bb0 Add background worker that periodically spawns
synthetic size calculation.
Add new pageserver config param calculate_synthetic_size_interval
2023-01-13 11:51:28 +02:00
Anastasia Lubennikova
ba0190e3e8 Handle errors in tenant_size_model code 2023-01-13 11:51:28 +02:00
Konstantin Knizhnik
9ce5ada89e Do not report position in SMGR message (#3307)
refer #3277
2023-01-13 10:23:35 +02:00
Alexander Bayandin
c28bfd4c63 Nightly Benchmarks: add user provided example (#3308) 2023-01-12 23:03:21 +00:00
Vadim Kharitonov
dec875fee1 Disable postgis_sfcgal 2023-01-12 21:51:49 +01:00
Kirill Bulatov
fe8cef3427 Use ready! rustc 1.64 macro (#3315)
rustc
[1.64](https://github.com/rust-lang/rust/blob/master/RELEASES.md#version-1640-2022-09-22)
had brought `ready!` macro:
https://doc.rust-lang.org/stable/std/task/macro.ready.html

Use it to shorten the code slightly.
2023-01-12 21:27:34 +02:00
MMeent
bb406b21a8 Fix issue in compaction code (#3246)
If we ran `compact_prefetch_buffers` with exactly one hole in the
buffers, the code would fail to remove the last, now unused, entry from
the array.

This is now fixed. 

Also, add and adjust some comments in the compaction code so that the
algorithm used is a bit more clear.

Fixes #3192
2023-01-12 19:23:59 +01:00
Heikki Linnakangas
57a6e931ea Comment, formatting and other cosmetic cleanup. 2023-01-12 19:05:13 +02:00
Heikki Linnakangas
0cceb14e48 Add a FIXME on ugly error message parsing. 2023-01-12 19:05:13 +02:00
Konstantin Knizhnik
1983c4d4ad Explain prefetch (#3002)
Co-authored-by: Bojan Serafimov <bojan.serafimov7@gmail.com>
2023-01-12 18:12:40 +02:00
Heikki Linnakangas
d7c41cbbee Replace tokio::watch with CancellationToken.
PR #3228 starts to use CancellationTokens more widely, this is a small
part extracted from that.
2023-01-12 17:37:15 +02:00
Vadim Kharitonov
29a2465276 Update rust version in toolchain 2023-01-12 15:16:52 +01:00
Arthur Petukhovsky
f49e923d87 Keep deleted timelines in memory of safekeeper (#3300)
A temporal fix for https://github.com/neondatabase/neon/issues/3146,
until we come up with a reliable way to create and delete timelines in
all safekeepers.
2023-01-12 15:33:07 +03:00
Vadim Kharitonov
a0ee306c74 Enable safe contrib extensions 2023-01-12 12:41:53 +01:00
Heikki Linnakangas
c1731bc4f0 Push on-demand download into Timeline::get() function itself.
This makes Timeline::get() async, and all functions that call it
directly or indirectly with it. The with_ondemand_download() mechanism
is gone, Timeline::get() now always downloads files, whether you want
it or not. That is what all the current callers want, so even though
this loses the capability to get a page only if it's already in the
pageserver, without downloading, we were not using that capability.
There were some places that used 'no_ondemand_download' in the WAL
ingestion code that would error out if a layer file was not found
locally, but those were dubious. We do actually want to on-demand
download in all of those places.

Per discussion at
https://github.com/neondatabase/neon/pull/3233#issuecomment-1368032358
2023-01-12 11:53:10 +02:00
Sergey Melnikov
95bf19b85a Add --atomic to all helm upgrade operations (#3299)
When number of github actions workers is changed, some jobs get killed.
When helm if killed during the upgrade, release stuck in pending-upgrade
state. --atomic should initiate automatic rollback in this case.
2023-01-10 10:05:27 +00:00
Vadim Kharitonov
80d4afab0c Update tokio version (RUSTSEC-2023-0001) 2023-01-10 09:02:00 +01:00
Arthur Petukhovsky
0807522a64 Enable wss proxy in all regions (#3292)
Follow-up to https://github.com/neondatabase/helm-charts/pull/24 and
#3247
2023-01-09 19:56:12 +00:00
Christian Schwarz
8eebd5f039 run on-demand compaction in a task_mgr task
With this patch, tenant_detach and timeline_delete's
task_mgr::shutdown_tasks() call will wait for on-demand
compaction to finish.
Before this patch, the on-demand compaction would grab the
layer_removal_cs after tenant_detach / timeline_delete had
removed the timeline directory.
This resulted in error

  No such file or directory (os error 2)

NB: I already implemented this pattern for ondemand GC a while back.

fixes https://github.com/neondatabase/neon/issues/3136
2023-01-09 19:08:22 +01:00
Heikki Linnakangas
8c07ef413d Minor cleanup of test_ondemand_download_timetravel test.
- Fix and improve comments
- Rename 'physical_size' local variable to 'resident_size' for clarity.
- Remove one 'unnecessary wait_for_upload' call. The
  'wait_for_sk_commit_lsn_to_reach_remote_storage' call after shutting
  down compute is sufficient.
2023-01-09 18:56:50 +02:00
Sergey Melnikov
14df37c108 Use GHA environments for gradual prod rollout (#3295)
Each release will wait for manual approval for each region
2023-01-09 20:18:16 +04:00
Christian Schwarz
d4d0aa6ed6 gc_iteration_internal: better log message & debug log level if nothing to do
fixes https://github.com/neondatabase/neon/issues/3107
2023-01-09 13:53:59 +01:00
Kirill Bulatov
a457256fef Fix log message matching (#3291)
Spotted
https://neon-github-public-dev.s3.amazonaws.com/reports/main/debug/3871991071/index.html#suites/158be07438eb5188d40b466b6acfaeb3/22966d740e33b677/
failing on `main`, fixes that by using a proper regex match string.

Also removes one clippy lint suppression.
2023-01-09 14:25:12 +02:00
Shany Pozin
3a22e1335d Adding a PR template (#3288)
## Describe your changes
Added a PR template 
## Issue ticket number and link
#3162
## Checklist before requesting a review
- [ ] I have performed a self-review of my code
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
2023-01-09 12:15:53 +00:00
Sergey Melnikov
93c77b0383 Use GHA environment for per-region deploy approvals on staging (#3293)
Each main deploy will wait for manual approval for each region
2023-01-09 15:40:14 +04:00
Shany Pozin
7920b39a27 Adding transition reason to the log when a tenant is moved to Broken state (#3289)
#3160
2023-01-09 10:24:50 +02:00
Kirill Bulatov
23d5e2bdaa Fix common pg port in the CLI basics test (#3283)
Closes https://github.com/neondatabase/neon/issues/3282
2023-01-07 00:46:42 +02:00
Christian Schwarz
3526323bc4 prepare Timeline::get_reconstruct_data for becoming async (#3271)
This patch restructures the code so that PR
https://github.com/neondatabase/neon/pull/3228 can seamlessly
replace the return PageReconstructResult::NeedsDownload with
a download_remote_layer().await.

Background:

PR https://github.com/neondatabase/neon/pull/3228 will turn
get_reconstruct_data() async and do the on-demand
download right in place, instead of returning a
PageReconstructResult::NeedsDownload.

Current rustc requires that the layers lock guard be not in scope
across an await point.

For on-demand download inside get_reconstruct_data(), we need
to do download_remote_layer().await.

Supersedes https://github.com/neondatabase/neon/pull/3260

See my comment there:
https://github.com/neondatabase/neon/pull/3260#issuecomment-1370752407

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2023-01-06 19:42:25 +02:00
Heikki Linnakangas
af9425394f Print time taken by CREATE/ALTER DATABASE at compute start.
Trying to investigate why the "apply_config" stage is taking longer
than expected. This proves or disproves that it's the CREATE DATABASE
statement.
2023-01-06 17:50:44 +02:00
Arthur Petukhovsky
debd134b15 Implement wss support in proxy (#3247)
This is a hacky implementation of WebSocket server, embedded into our
postgres proxy. The server is used to allow https://github.com/neondatabase/serverless 
to connect to our postgres from browser and serverless javascript functions.

How it will work (general schema):
- browser opens a websocket connection to
`wss://ep-abc-xyz-123.xx-central-1.aws.neon.tech/`
- proxy accepts this connection and terminates TLS (https)
- inside encrypted tunnel (HTTPS), browser initiates plain
(non-encrypted) postgres connection
- proxy performs auth as in usual plain pg connection and forwards
connection to the compute

Related issue: #3225
2023-01-06 18:34:18 +03:00
Heikki Linnakangas
df42213dbb Fix missing COMMIT in handle_role_deletions.
There was no COMMIT, so the DROP ROLE commands were always implicitly
rolled back.

Fixes issue #3279.
2023-01-06 17:07:46 +02:00
Kirill Bulatov
b6237474d2 Fix README and basic startup example (#3275)
Follow-up of https://github.com/neondatabase/neon/pull/3270 which made
an example from main README.md not working.

Fixes that, by adding a way to specify a default tenant now and modifies
the basic neon_local test to start postgres and check branching.
Not all neon_local commands are implemented, so not all README.md
contents is tested yet.
2023-01-06 12:26:14 +02:00
Heikki Linnakangas
8b710b9753 Fix segfault if pageserver connection is lost during backend startup.
It's not OK to return early from within a PG_TRY-CATCH block. The
PG_TRY macro sets the global PG_exception_stack variable, and
PG_END_TRY restores it. If we jump out in between with "return NULL",
the PG_exception_stack is left to point to garbage. (I'm surprised the
comments in PG_TRY_CATCH don't warn about this.)

Add test that re-attaches tenant in pageserver while Postgres is
running. If the tenant is detached while compute is connected and
busy running queries, those queries will fail if they try to fetch any
pages. But when the tenant is re-attached, things should start working
again, without disconnecting the client <-> postgres connections.
Without this fix, this reproduced the segfault.

Fixes issue #3231
2023-01-05 18:51:47 +02:00
Heikki Linnakangas
c187de1101 Copy error message before it's freed.
pageserver_disconnect() call invalidates 'pageserver_conn', including
the error message pointer we got from PQerrorMessage(pageserver_conn).
Copy the message to a temporary variable before disconnecting, like
we do in a few other places.

In the passing, clear 'pageserver_conn_wes' variable in a few places
where it was free'd. I didn't see any live bug from this, but since
pageserver_disconnect() checks if it's NULL, let's not leave it
dangling to already-free'd memory.
2023-01-05 18:51:47 +02:00
Kirill Bulatov
8712e1899e Move initial timeline creation into pytest (#3270)
For every Python test, we start the storage first, and expect that
later, in the test, when we start a compute, it will work without
specific timeline and tenant creation or their IDs specified.

For that, we have a concept of "default" branch that was created on the
control plane level first, but that's not needed at all, given that it's
only Python tests that need it: let them create the initial timeline
during set-up.

Before, control plane started and stopped pageserver for timeline
creation, now Python harness runs an extra tenant creation request on
test env init.

I had to adjust the metrics test, turns out it registered the metrics
from the default tenant after an extra pageserver restart.
New model does not sent the metrics before the collection time happens,
and that was 30s before.
2023-01-05 17:48:27 +02:00
Christian Schwarz
d7f1e30112 remote_timeline_client: more metrics & metrics-related cleanups
- Clean up redundant metric removal in TimelineMetrics::drop.
RemoteTimelineClientMetrics is responsible for cleaning up
REMOTE_OPERATION_TIME andREMOTE_UPLOAD_QUEUE_UNFINISHED_TASKS.

- Rename `pageserver_remote_upload_queue_unfinished_tasks` to
`pageserver_remote_timeline_client_calls_unfinished`. The new name
reflects that the metric is with respect to the entire call to remote
timeline client. This includes wait time in the upload queue and hence
it's a longer span than what `pageserver_remote_OPERATION_seconds`
measures.

- Add the `pageserver_remote_timeline_client_calls_started` histogram.
See the metric description for why we need it.

- Add helper functions `call_begin` etc to `RemoteTimelineClientMetrics`
to centralize the logic for updating the metrics above (they relate to
each other, see comments in code).

- Use these constructs to track ongoing downloads in
`pageserver_remote_timeline_client_calls_unfinished`

refs https://github.com/neondatabase/neon/issues/2029
fixes https://github.com/neondatabase/neon/issues/3249
closes https://github.com/neondatabase/neon/pull/3250
2023-01-05 11:50:17 +01:00
Christian Schwarz
6a9d1030a6 use RemoteTimelineClient for downloading index part during tenant_attach
Before this change, we would not .measure_remote_op for index part
downloads.

And more generally, it's good to pass not just uploads but also
downloads through RemoteTimelineClient, e.g., if we ever want to
implement some timeline-scoped policies there.

Found this while working on https://github.com/neondatabase/neon/pull/3250
where I add a metric to measure the degree of concurrent downloads.
Layer download was missing in a test that I added there.
2023-01-05 11:08:50 +01:00
Heikki Linnakangas
8c6e607327 Refactor send_tarball() (#3259)
The Basebackup struct is really just a convenient place to carry the
various parameters around in send_tarball and its subroutines. Make it
internal to the send_tarball function.
2023-01-04 23:03:16 +02:00
Vadim Kharitonov
f436fb2dfb Fix panics at compute_ctl:monitor 2023-01-04 17:26:42 +01:00
Kirill Bulatov
8932d14d50 Revert "Run Python tests in 8 threads (#3206)" (#3264)
This reverts commit 56a4466d0a.

Seems that flackiness increased after this commit, while the time
decrease was a couple of seconds.
With every regular Python test spawing 1 etcd, 3 safekeepers, 1
pageserver, few CLI commands and post-run cleanup hooks, it might be
hard to run many such tests in parallel.

We could return to this later, after we consider alternative test
structure and/or CI runner structure.
2023-01-04 17:31:51 +02:00
Kirill Bulatov
efad64bc7f Expect compute shutdown test log error (#3262)
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-3261/debug/3833043374/index.html#suites/ffbb7f9930a77115316b58ff32b7c719/1f6ebaedc0a113a1/

Spotted a flacky test that appeared after
https://github.com/neondatabase/neon/pull/3227 changes
2023-01-04 10:45:11 +00:00
Kirill Bulatov
10dae79c6d Tone down safekeeper and pageserver walreceiver errors (#3227)
Closes https://github.com/neondatabase/neon/issues/3114

Adds more typization into errors that appear during protocol messages (`FeMessage`), postgres and walreceiver connections.

Socket IO errors are now better detected and logged with lesser (INFO, DEBUG) error level, without traces that they were logged before, when they were wrapped in anyhow context.
2023-01-03 20:42:04 +00:00
162 changed files with 5739 additions and 3145 deletions

View File

@@ -4,7 +4,7 @@
hakari-package = "workspace_hack"
# Format for `workspace-hack = ...` lines in other Cargo.tomls. Requires cargo-hakari 0.9.8 or above.
dep-format-version = "2"
dep-format-version = "3"
# Setting workspace.resolver = "2" in the root Cargo.toml is HIGHLY recommended.
# Hakari works much better with the new feature resolver.

View File

@@ -0,0 +1,10 @@
## Describe your changes
## Issue ticket number and link
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

View File

@@ -123,8 +123,8 @@ runs:
exit 1
fi
if [[ "${{ inputs.run_in_parallel }}" == "true" ]]; then
# -n8 uses eight processes to run tests via pytest-xdist
EXTRA_PARAMS="-n8 $EXTRA_PARAMS"
# -n4 uses four processes to run tests via pytest-xdist
EXTRA_PARAMS="-n4 $EXTRA_PARAMS"
# --dist=loadgroup points tests marked with @pytest.mark.xdist_group
# to the same worker to make @pytest.mark.order work with xdist

View File

@@ -6,6 +6,8 @@ storage:
broker_endpoint: http://storage-broker-lb.epsilon.ap-southeast-1.internal.aws.neon.tech:50051
pageserver_config_stub:
pg_distrib_dir: /usr/local
metric_collection_endpoint: http://console-release.local/billing/api/v1/usage_events
metric_collection_interval: 10min
remote_storage:
bucket_name: "{{ bucket_name }}"
bucket_region: "{{ bucket_region }}"

View File

@@ -6,6 +6,8 @@ storage:
broker_endpoint: http://storage-broker-lb.gamma.eu-central-1.internal.aws.neon.tech:50051
pageserver_config_stub:
pg_distrib_dir: /usr/local
metric_collection_endpoint: http://console-release.local/billing/api/v1/usage_events
metric_collection_interval: 10min
remote_storage:
bucket_name: "{{ bucket_name }}"
bucket_region: "{{ bucket_region }}"

View File

@@ -6,6 +6,8 @@ storage:
broker_endpoint: http://storage-broker-lb.delta.us-east-2.internal.aws.neon.tech:50051
pageserver_config_stub:
pg_distrib_dir: /usr/local
metric_collection_endpoint: http://console-release.local/billing/api/v1/usage_events
metric_collection_interval: 10min
remote_storage:
bucket_name: "{{ bucket_name }}"
bucket_region: "{{ bucket_region }}"
@@ -34,4 +36,4 @@ storage:
ansible_host: i-06d113fb73bfddeb0
safekeeper-2.us-east-2.aws.neon.tech:
ansible_host: i-09f66c8e04afff2e8

View File

@@ -6,6 +6,8 @@ storage:
broker_endpoint: http://storage-broker-lb.eta.us-west-2.internal.aws.neon.tech:50051
pageserver_config_stub:
pg_distrib_dir: /usr/local
metric_collection_endpoint: http://console-release.local/billing/api/v1/usage_events
metric_collection_interval: 10min
remote_storage:
bucket_name: "{{ bucket_name }}"
bucket_region: "{{ bucket_region }}"

View File

@@ -7,6 +7,8 @@ storage:
broker_endpoint: http://storage-broker.prod.local:50051
pageserver_config_stub:
pg_distrib_dir: /usr/local
metric_collection_endpoint: http://console-release.local/billing/api/v1/usage_events
metric_collection_interval: 10min
remote_storage:
bucket_name: "{{ bucket_name }}"
bucket_region: "{{ bucket_region }}"

View File

@@ -18,7 +18,7 @@ storage:
ansible_aws_ssm_region: eu-west-1
ansible_aws_ssm_bucket_name: neon-dev-storage-eu-west-1
console_region_id: aws-eu-west-1
sentry_environment: development
sentry_environment: staging
children:
pageservers:

View File

@@ -18,7 +18,7 @@ storage:
ansible_aws_ssm_region: us-east-2
ansible_aws_ssm_bucket_name: neon-staging-storage-us-east-2
console_region_id: aws-us-east-2
sentry_environment: development
sentry_environment: staging
children:
pageservers:

View File

@@ -8,7 +8,10 @@ settings:
authBackend: "console"
authEndpoint: "http://console-staging.local/management/api/v2"
domain: "*.eu-west-1.aws.neon.build"
sentryEnvironment: "development"
sentryEnvironment: "staging"
wssPort: 8443
metricCollectionEndpoint: "http://console-staging.local/billing/api/v1/usage_events"
metricCollectionInterval: "1min"
# -- Additional labels for neon-proxy pods
podLabels:
@@ -23,6 +26,7 @@ exposedService:
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: eu-west-1.aws.neon.build
httpsPort: 443
#metrics:
# enabled: true

View File

@@ -49,4 +49,4 @@ extraManifests:
- "{{ .Release.Namespace }}"
settings:
sentryEnvironment: "development"
sentryEnvironment: "staging"

View File

@@ -8,7 +8,9 @@ settings:
authBackend: "link"
authEndpoint: "https://console.stage.neon.tech/authenticate_proxy_request/"
uri: "https://console.stage.neon.tech/psql_session/"
sentryEnvironment: "development"
sentryEnvironment: "staging"
metricCollectionEndpoint: "http://console-staging.local/billing/api/v1/usage_events"
metricCollectionInterval: "1min"
# -- Additional labels for neon-proxy-link pods
podLabels:

View File

@@ -8,7 +8,10 @@ settings:
authBackend: "console"
authEndpoint: "http://console-staging.local/management/api/v2"
domain: "*.cloud.stage.neon.tech"
sentryEnvironment: "development"
sentryEnvironment: "staging"
wssPort: 8443
metricCollectionEndpoint: "http://console-staging.local/billing/api/v1/usage_events"
metricCollectionInterval: "1min"
# -- Additional labels for neon-proxy pods
podLabels:
@@ -23,6 +26,7 @@ exposedService:
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: neon-proxy-scram-legacy.beta.us-east-2.aws.neon.build
httpsPort: 443
#metrics:
# enabled: true

View File

@@ -8,7 +8,10 @@ settings:
authBackend: "console"
authEndpoint: "http://console-staging.local/management/api/v2"
domain: "*.us-east-2.aws.neon.build"
sentryEnvironment: "development"
sentryEnvironment: "staging"
wssPort: 8443
metricCollectionEndpoint: "http://console-staging.local/billing/api/v1/usage_events"
metricCollectionInterval: "1min"
# -- Additional labels for neon-proxy pods
podLabels:
@@ -23,6 +26,7 @@ exposedService:
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: us-east-2.aws.neon.build
httpsPort: 443
#metrics:
# enabled: true

View File

@@ -49,4 +49,4 @@ extraManifests:
- "{{ .Release.Namespace }}"
settings:
sentryEnvironment: "development"
sentryEnvironment: "staging"

View File

@@ -9,6 +9,9 @@ settings:
authEndpoint: "http://console-release.local/management/api/v2"
domain: "*.ap-southeast-1.aws.neon.tech"
sentryEnvironment: "production"
wssPort: 8443
metricCollectionEndpoint: "http://console-release.local/billing/api/v1/usage_events"
metricCollectionInterval: "10min"
# -- Additional labels for neon-proxy pods
podLabels:
@@ -23,6 +26,7 @@ exposedService:
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: ap-southeast-1.aws.neon.tech
httpsPort: 443
#metrics:
# enabled: true

View File

@@ -9,6 +9,9 @@ settings:
authEndpoint: "http://console-release.local/management/api/v2"
domain: "*.eu-central-1.aws.neon.tech"
sentryEnvironment: "production"
wssPort: 8443
metricCollectionEndpoint: "http://console-release.local/billing/api/v1/usage_events"
metricCollectionInterval: "10min"
# -- Additional labels for neon-proxy pods
podLabels:
@@ -23,6 +26,7 @@ exposedService:
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: eu-central-1.aws.neon.tech
httpsPort: 443
#metrics:
# enabled: true

View File

@@ -9,6 +9,9 @@ settings:
authEndpoint: "http://console-release.local/management/api/v2"
domain: "*.us-east-2.aws.neon.tech"
sentryEnvironment: "production"
wssPort: 8443
metricCollectionEndpoint: "http://console-release.local/billing/api/v1/usage_events"
metricCollectionInterval: "10min"
# -- Additional labels for neon-proxy pods
podLabels:
@@ -23,6 +26,7 @@ exposedService:
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: us-east-2.aws.neon.tech
httpsPort: 443
#metrics:
# enabled: true

View File

@@ -9,6 +9,9 @@ settings:
authEndpoint: "http://console-release.local/management/api/v2"
domain: "*.us-west-2.aws.neon.tech"
sentryEnvironment: "production"
wssPort: 8443
metricCollectionEndpoint: "http://console-release.local/billing/api/v1/usage_events"
metricCollectionInterval: "10min"
# -- Additional labels for neon-proxy pods
podLabels:
@@ -23,6 +26,7 @@ exposedService:
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: us-west-2.aws.neon.tech
httpsPort: 443
#metrics:
# enabled: true

View File

@@ -3,6 +3,9 @@ settings:
authEndpoint: "http://console-release.local/management/api/v2"
domain: "*.cloud.neon.tech"
sentryEnvironment: "production"
wssPort: 8443
metricCollectionEndpoint: "http://console-release.local/billing/api/v1/usage_events"
metricCollectionInterval: "10min"
podLabels:
zenith_service: proxy-scram
@@ -16,6 +19,7 @@ exposedService:
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: '*.cloud.neon.tech'
httpsPort: 443
metrics:
enabled: true

View File

@@ -489,3 +489,108 @@ jobs:
slack-message: "Periodic TPC-H perf testing ${{ matrix.platform }}: ${{ job.status }}\n${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
env:
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
user-examples-compare:
if: success() || failure()
needs: [ tpch-compare ]
strategy:
fail-fast: false
matrix:
# neon-captest-prefetch: We have pre-created projects with prefetch enabled
# rds-aurora: Aurora Postgres Serverless v2 with autoscaling from 0.5 to 2 ACUs
# rds-postgres: RDS Postgres db.m5.large instance (2 vCPU, 8 GiB) with gp3 EBS storage
platform: [ neon-captest-prefetch, rds-postgres, rds-aurora ]
env:
POSTGRES_DISTRIB_DIR: /tmp/neon/pg_install
DEFAULT_PG_VERSION: 14
TEST_OUTPUT: /tmp/test_output
BUILD_TYPE: remote
SAVE_PERF_REPORT: ${{ github.event.inputs.save_perf_report || ( github.ref == 'refs/heads/main' ) }}
PLATFORM: ${{ matrix.platform }}
runs-on: [ self-hosted, us-east-2, x64 ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
options: --init
timeout-minutes: 360 # 6h
steps:
- uses: actions/checkout@v3
- name: Download Neon artifact
uses: ./.github/actions/download
with:
name: neon-${{ runner.os }}-release-artifact
path: /tmp/neon/
prefix: latest
- name: Add Postgres binaries to PATH
run: |
${POSTGRES_DISTRIB_DIR}/v${DEFAULT_PG_VERSION}/bin/pgbench --version
echo "${POSTGRES_DISTRIB_DIR}/v${DEFAULT_PG_VERSION}/bin" >> $GITHUB_PATH
- name: Set up Connection String
id: set-up-connstr
run: |
case "${PLATFORM}" in
neon-captest-prefetch)
CONNSTR=${{ secrets.BENCHMARK_USER_EXAMPLE_CAPTEST_CONNSTR }}
;;
rds-aurora)
CONNSTR=${{ secrets.BENCHMARK_USER_EXAMPLE_RDS_AURORA_CONNSTR }}
;;
rds-postgres)
CONNSTR=${{ secrets.BENCHMARK_USER_EXAMPLE_RDS_POSTGRES_CONNSTR }}
;;
*)
echo 2>&1 "Unknown PLATFORM=${PLATFORM}. Allowed only 'neon-captest-prefetch', 'rds-aurora', or 'rds-postgres'"
exit 1
;;
esac
echo "connstr=${CONNSTR}" >> $GITHUB_OUTPUT
psql ${CONNSTR} -c "SELECT version();"
- name: Set database options
if: matrix.platform == 'neon-captest-prefetch'
run: |
DB_NAME=$(psql ${BENCHMARK_CONNSTR} --no-align --quiet -t -c "SELECT current_database()")
psql ${BENCHMARK_CONNSTR} -c "ALTER DATABASE ${DB_NAME} SET enable_seqscan_prefetch=on"
psql ${BENCHMARK_CONNSTR} -c "ALTER DATABASE ${DB_NAME} SET effective_io_concurrency=32"
psql ${BENCHMARK_CONNSTR} -c "ALTER DATABASE ${DB_NAME} SET maintenance_io_concurrency=32"
env:
BENCHMARK_CONNSTR: ${{ steps.set-up-connstr.outputs.connstr }}
- name: Run user examples
uses: ./.github/actions/run-python-test-set
with:
build_type: ${{ env.BUILD_TYPE }}
test_selection: performance/test_perf_olap.py
run_in_parallel: false
save_perf_report: ${{ env.SAVE_PERF_REPORT }}
extra_params: -m remote_cluster --timeout 21600 -k test_user_examples
env:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
BENCHMARK_CONNSTR: ${{ steps.set-up-connstr.outputs.connstr }}
- name: Create Allure report
if: success() || failure()
uses: ./.github/actions/allure-report
with:
action: generate
build_type: ${{ env.BUILD_TYPE }}
- name: Post to a Slack channel
if: ${{ github.event.schedule && failure() }}
uses: slackapi/slack-github-action@v1
with:
channel-id: "C033QLM5P7D" # dev-staging-stream
slack-message: "Periodic TPC-H perf testing ${{ matrix.platform }}: ${{ job.status }}\n${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
env:
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}

View File

@@ -595,6 +595,8 @@ jobs:
defaults:
run:
shell: sh -eu {0}
env:
VM_INFORMANT_VERSION: 0.1.1
steps:
- name: Downloading latest vm-builder
@@ -606,9 +608,22 @@ jobs:
run: |
docker pull 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}
- name: Downloading VM informant version ${{ env.VM_INFORMANT_VERSION }}
run: |
curl -fL https://github.com/neondatabase/autoscaling/releases/download/${{ env.VM_INFORMANT_VERSION }}/vm-informant -o vm-informant
chmod +x vm-informant
- name: Adding VM informant to compute-node image
run: |
ID=$(docker create 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}})
docker cp vm-informant $ID:/bin/vm-informant
docker commit $ID temp-vm-compute-node
docker rm -f $ID
- name: Build vm image
run: |
./vm-builder -src=369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} -dst=369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}
# note: as of 2023-01-12, vm-builder requires a trailing ":latest" for local images
./vm-builder -src=temp-vm-compute-node:latest -dst=369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}
- name: Pushing vm-compute-node image
run: |
@@ -794,6 +809,8 @@ jobs:
strategy:
matrix:
include: ${{fromJSON(needs.calculate-deploy-targets.outputs.matrix-include)}}
environment:
name: prod-old
steps:
- name: Checkout
uses: actions/checkout@v3
@@ -839,7 +856,9 @@ jobs:
shell: bash
strategy:
matrix:
target_region: [ us-east-2 ]
target_region: [ eu-west-1, us-east-2 ]
environment:
name: dev-${{ matrix.target_region }}
steps:
- name: Checkout
uses: actions/checkout@v3
@@ -911,6 +930,8 @@ jobs:
strategy:
matrix:
target_region: [ us-east-2, us-west-2, eu-central-1, ap-southeast-1 ]
environment:
name: prod-${{ matrix.target_region }}
steps:
- name: Checkout
uses: actions/checkout@v3
@@ -950,6 +971,8 @@ jobs:
strategy:
matrix:
include: ${{fromJSON(needs.calculate-deploy-targets.outputs.matrix-include)}}
environment:
name: prod-old
env:
KUBECONFIG: .kubeconfig
steps:
@@ -975,8 +998,8 @@ jobs:
- name: Re-deploy proxy
run: |
DOCKER_TAG=${{needs.tag.outputs.build-tag}}
helm upgrade ${{ matrix.proxy_job }} neondatabase/neon-proxy --namespace neon-proxy --install -f .github/helm-values/${{ matrix.proxy_config }}.yaml --set image.tag=${DOCKER_TAG} --set settings.sentryUrl=${{ secrets.SENTRY_URL_PROXY }} --wait --timeout 15m0s
helm upgrade ${{ matrix.proxy_job }}-scram neondatabase/neon-proxy --namespace neon-proxy --install -f .github/helm-values/${{ matrix.proxy_config }}-scram.yaml --set image.tag=${DOCKER_TAG} --set settings.sentryUrl=${{ secrets.SENTRY_URL_PROXY }} --wait --timeout 15m0s
helm upgrade ${{ matrix.proxy_job }} neondatabase/neon-proxy --namespace neon-proxy --install --atomic -f .github/helm-values/${{ matrix.proxy_config }}.yaml --set image.tag=${DOCKER_TAG} --set settings.sentryUrl=${{ secrets.SENTRY_URL_PROXY }} --wait --timeout 15m0s
helm upgrade ${{ matrix.proxy_job }}-scram neondatabase/neon-proxy --namespace neon-proxy --install --atomic -f .github/helm-values/${{ matrix.proxy_config }}-scram.yaml --set image.tag=${DOCKER_TAG} --set settings.sentryUrl=${{ secrets.SENTRY_URL_PROXY }} --wait --timeout 15m0s
deploy-storage-broker:
name: deploy storage broker on old staging and old prod
@@ -993,6 +1016,8 @@ jobs:
strategy:
matrix:
include: ${{fromJSON(needs.calculate-deploy-targets.outputs.matrix-include)}}
environment:
name: prod-old
env:
KUBECONFIG: .kubeconfig
steps:
@@ -1041,6 +1066,8 @@ jobs:
target_cluster: dev-eu-west-1-zeta
deploy_link_proxy: false
deploy_legacy_scram_proxy: false
environment:
name: dev-${{ matrix.target_region }}
steps:
- name: Checkout
uses: actions/checkout@v3
@@ -1056,19 +1083,19 @@ jobs:
- name: Re-deploy scram proxy
run: |
DOCKER_TAG=${{needs.tag.outputs.build-tag}}
helm upgrade neon-proxy-scram neondatabase/neon-proxy --namespace neon-proxy --create-namespace --install -f .github/helm-values/${{ matrix.target_cluster }}.neon-proxy-scram.yaml --set image.tag=${DOCKER_TAG} --set settings.sentryUrl=${{ secrets.SENTRY_URL_PROXY }} --wait --timeout 15m0s
helm upgrade neon-proxy-scram neondatabase/neon-proxy --namespace neon-proxy --create-namespace --install --atomic -f .github/helm-values/${{ matrix.target_cluster }}.neon-proxy-scram.yaml --set image.tag=${DOCKER_TAG} --set settings.sentryUrl=${{ secrets.SENTRY_URL_PROXY }} --wait --timeout 15m0s
- name: Re-deploy link proxy
if: matrix.deploy_link_proxy
run: |
DOCKER_TAG=${{needs.tag.outputs.build-tag}}
helm upgrade neon-proxy-link neondatabase/neon-proxy --namespace neon-proxy --create-namespace --install -f .github/helm-values/${{ matrix.target_cluster }}.neon-proxy-link.yaml --set image.tag=${DOCKER_TAG} --set settings.sentryUrl=${{ secrets.SENTRY_URL_PROXY }} --wait --timeout 15m0s
helm upgrade neon-proxy-link neondatabase/neon-proxy --namespace neon-proxy --create-namespace --install --atomic -f .github/helm-values/${{ matrix.target_cluster }}.neon-proxy-link.yaml --set image.tag=${DOCKER_TAG} --set settings.sentryUrl=${{ secrets.SENTRY_URL_PROXY }} --wait --timeout 15m0s
- name: Re-deploy legacy scram proxy
if: matrix.deploy_legacy_scram_proxy
run: |
DOCKER_TAG=${{needs.tag.outputs.build-tag}}
helm upgrade neon-proxy-scram-legacy neondatabase/neon-proxy --namespace neon-proxy --create-namespace --install -f .github/helm-values/${{ matrix.target_cluster }}.neon-proxy-scram-legacy.yaml --set image.tag=${DOCKER_TAG} --set settings.sentryUrl=${{ secrets.SENTRY_URL_PROXY }} --wait --timeout 15m0s
helm upgrade neon-proxy-scram-legacy neondatabase/neon-proxy --namespace neon-proxy --create-namespace --install --atomic -f .github/helm-values/${{ matrix.target_cluster }}.neon-proxy-scram-legacy.yaml --set image.tag=${DOCKER_TAG} --set settings.sentryUrl=${{ secrets.SENTRY_URL_PROXY }} --wait --timeout 15m0s
deploy-storage-broker-dev-new:
runs-on: [ self-hosted, dev, x64 ]
@@ -1088,6 +1115,8 @@ jobs:
target_cluster: dev-us-east-2-beta
- target_region: eu-west-1
target_cluster: dev-eu-west-1-zeta
environment:
name: dev-${{ matrix.target_region }}
steps:
- name: Checkout
uses: actions/checkout@v3
@@ -1126,6 +1155,8 @@ jobs:
target_cluster: prod-eu-central-1-gamma
- target_region: ap-southeast-1
target_cluster: prod-ap-southeast-1-epsilon
environment:
name: prod-${{ matrix.target_region }}
steps:
- name: Checkout
uses: actions/checkout@v3
@@ -1141,7 +1172,7 @@ jobs:
- name: Re-deploy proxy
run: |
DOCKER_TAG=${{needs.tag.outputs.build-tag}}
helm upgrade neon-proxy-scram neondatabase/neon-proxy --namespace neon-proxy --create-namespace --install -f .github/helm-values/${{ matrix.target_cluster }}.neon-proxy-scram.yaml --set image.tag=${DOCKER_TAG} --set settings.sentryUrl=${{ secrets.SENTRY_URL_PROXY }} --wait --timeout 15m0s
helm upgrade neon-proxy-scram neondatabase/neon-proxy --namespace neon-proxy --create-namespace --install --atomic -f .github/helm-values/${{ matrix.target_cluster }}.neon-proxy-scram.yaml --set image.tag=${DOCKER_TAG} --set settings.sentryUrl=${{ secrets.SENTRY_URL_PROXY }} --wait --timeout 15m0s
deploy-storage-broker-prod-new:
runs-on: prod
@@ -1165,6 +1196,8 @@ jobs:
target_cluster: prod-eu-central-1-gamma
- target_region: ap-southeast-1
target_cluster: prod-ap-southeast-1-epsilon
environment:
name: prod-${{ matrix.target_region }}
steps:
- name: Checkout
uses: actions/checkout@v3

2
.gitignore vendored
View File

@@ -1,7 +1,5 @@
/pg_install
/target
/tmp_check
/tmp_check_cli
__pycache__/
test_output/
.vscode

584
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,14 +1,3 @@
# 'named-profiles' feature was stabilized in cargo 1.57. This line makes the
# build work with older cargo versions.
#
# We have this because as of this writing, the latest cargo Debian package
# that's available is 1.56. (Confusingly, the Debian package version number
# is 0.57, whereas 'cargo --version' says 1.56.)
#
# See https://tracker.debian.org/pkg/cargo for the current status of the
# package. When that gets updated, we can remove this.
cargo-features = ["named-profiles"]
[workspace]
members = [
"compute_tools",
@@ -21,6 +10,143 @@ members = [
"libs/*",
]
[workspace.package]
edition = "2021"
license = "Apache-2.0"
## All dependency versions, used in the project
[workspace.dependencies]
anyhow = { version = "1.0", features = ["backtrace"] }
async-stream = "0.3"
async-trait = "0.1"
atty = "0.2.14"
aws-config = { version = "0.51.0", default-features = false, features=["rustls"] }
aws-sdk-s3 = "0.21.0"
aws-smithy-http = "0.51.0"
aws-types = "0.51.0"
base64 = "0.13.0"
bincode = "1.3"
bindgen = "0.61"
bstr = "1.0"
byteorder = "1.4"
bytes = "1.0"
chrono = { version = "0.4", default-features = false, features = ["clock"] }
clap = "4.0"
close_fds = "0.3.2"
comfy-table = "6.1"
const_format = "0.2"
crc32c = "0.6"
crossbeam-utils = "0.8.5"
fail = "0.5.0"
fs2 = "0.4.3"
futures = "0.3"
futures-core = "0.3"
futures-util = "0.3"
git-version = "0.3"
hashbrown = "0.13"
hex = "0.4"
hex-literal = "0.3"
hmac = "0.12.1"
hostname = "0.3.1"
humantime = "2.1"
humantime-serde = "1.1.1"
hyper = "0.14"
hyper-tungstenite = "0.9"
itertools = "0.10"
jsonwebtoken = "8"
libc = "0.2"
md5 = "0.7.0"
memoffset = "0.8"
nix = "0.26"
notify = "5.0.0"
num-traits = "0.2.15"
once_cell = "1.13"
parking_lot = "0.12"
pin-project-lite = "0.2"
prometheus = {version = "0.13", default_features=false, features = ["process"]} # removes protobuf dependency
prost = "0.11"
rand = "0.8"
regex = "1.4"
reqwest = { version = "0.11", default-features = false, features = ["rustls-tls"] }
routerify = "3"
rstar = "0.9.3"
rustls = "0.20"
rustls-pemfile = "1"
rustls-split = "0.3"
scopeguard = "1.1"
sentry = { version = "0.29", default-features = false, features = ["backtrace", "contexts", "panic", "rustls", "reqwest" ] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
serde_with = "2.0"
sha2 = "0.10.2"
signal-hook = "0.3"
socket2 = "0.4.4"
strum = "0.24"
strum_macros = "0.24"
svg_fmt = "0.4.1"
tar = "0.4"
thiserror = "1.0"
tls-listener = { version = "0.6", features = ["rustls", "hyper-h1"] }
tokio = { version = "1.17", features = ["macros"] }
tokio-postgres-rustls = "0.9.0"
tokio-rustls = "0.23"
tokio-stream = "0.1"
tokio-util = { version = "0.7", features = ["io"] }
toml = "0.5"
toml_edit = { version = "0.17", features = ["easy"] }
tonic = {version = "0.8", features = ["tls", "tls-roots"]}
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
url = "2.2"
uuid = { version = "1.2", features = ["v4", "serde"] }
walkdir = "2.3.2"
webpki-roots = "0.22.5"
x509-parser = "0.14"
## TODO replace this with tracing
env_logger = "0.10"
log = "0.4"
## TODO switch when the new release is made
amplify_num = { git = "https://github.com/rust-amplify/rust-amplify.git", tag = "v4.0.0-beta.1" }
## Libraries from neondatabase/ git forks, ideally with changes to be upstreamed
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
postgres-types = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
tokio-tar = { git = "https://github.com/neondatabase/tokio-tar.git", rev="404df61437de0feef49ba2ccdbdd94eb8ad6e142" }
## Local libraries
consumption_metrics = { version = "0.1", path = "./libs/consumption_metrics/" }
metrics = { version = "0.1", path = "./libs/metrics/" }
pageserver_api = { version = "0.1", path = "./libs/pageserver_api/" }
postgres_connection = { version = "0.1", path = "./libs/postgres_connection/" }
postgres_ffi = { version = "0.1", path = "./libs/postgres_ffi/" }
pq_proto = { version = "0.1", path = "./libs/pq_proto/" }
remote_storage = { version = "0.1", path = "./libs/remote_storage/" }
safekeeper_api = { version = "0.1", path = "./libs/safekeeper_api" }
storage_broker = { version = "0.1", path = "./storage_broker/" } # Note: main broker code is inside the binary crate, so linking with the library shouldn't be heavy.
tenant_size_model = { version = "0.1", path = "./libs/tenant_size_model/" }
utils = { version = "0.1", path = "./libs/utils/" }
## Common library dependency
workspace_hack = { version = "0.1", path = "./workspace_hack/" }
## Build dependencies
criterion = "0.4"
rcgen = "0.10"
rstest = "0.16"
tempfile = "3.2"
tonic-build = "0.8"
# This is only needed for proxy's tests.
# TODO: we should probably fork `tokio-postgres-rustls` instead.
[patch.crates-io]
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
################# Binary contents sections
[profile.release]
# This is useful for profiling and, to some extent, debug.
# Besides, debug info should not affect the performance.
@@ -81,9 +207,3 @@ inherits = "release"
debug = false # true = 2 = all symbols, 1 = line only
opt-level = "z"
lto = true
# This is only needed for proxy's tests.
# TODO: we should probably fork `tokio-postgres-rustls` instead.
[patch.crates-io]
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }

View File

@@ -29,7 +29,13 @@ RUN cd postgres && \
make MAKELEVEL=0 -j $(getconf _NPROCESSORS_ONLN) -s -C contrib/ install && \
# Install headers
make MAKELEVEL=0 -j $(getconf _NPROCESSORS_ONLN) -s -C src/include install && \
make MAKELEVEL=0 -j $(getconf _NPROCESSORS_ONLN) -s -C src/interfaces/libpq install
make MAKELEVEL=0 -j $(getconf _NPROCESSORS_ONLN) -s -C src/interfaces/libpq install && \
# Enable some of contrib extensions
echo 'trusted = true' >> /usr/local/pgsql/share/extension/bloom.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pgrowlocks.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/intagg.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pgstattuple.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/earthdistance.control
#########################################################################################
#
@@ -55,7 +61,9 @@ RUN wget https://download.osgeo.org/postgis/source/postgis-3.3.1.tar.gz && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/postgis.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/postgis_raster.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/postgis_tiger_geocoder.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/postgis_topology.control
echo 'trusted = true' >> /usr/local/pgsql/share/extension/postgis_topology.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/address_standardizer.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/address_standardizer_data_us.control
#########################################################################################
#

View File

@@ -29,7 +29,13 @@ RUN cd postgres && \
make MAKELEVEL=0 -j $(getconf _NPROCESSORS_ONLN) -s -C contrib/ install && \
# Install headers
make MAKELEVEL=0 -j $(getconf _NPROCESSORS_ONLN) -s -C src/include install && \
make MAKELEVEL=0 -j $(getconf _NPROCESSORS_ONLN) -s -C src/interfaces/libpq install
make MAKELEVEL=0 -j $(getconf _NPROCESSORS_ONLN) -s -C src/interfaces/libpq install && \
# Enable some of contrib extensions
echo 'trusted = true' >> /usr/local/pgsql/share/extension/bloom.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pgrowlocks.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/intagg.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pgstattuple.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/earthdistance.control
#########################################################################################
#
@@ -55,7 +61,9 @@ RUN wget https://download.osgeo.org/postgis/source/postgis-3.3.1.tar.gz && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/postgis.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/postgis_raster.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/postgis_tiger_geocoder.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/postgis_topology.control
echo 'trusted = true' >> /usr/local/pgsql/share/extension/postgis_topology.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/address_standardizer.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/address_standardizer_data_us.control
#########################################################################################
#

View File

@@ -118,11 +118,8 @@ Python (3.9 or higher), and install python3 packages using `./scripts/pysync` (r
# Later that would be responsibility of a package install script
> ./target/debug/neon_local init
Starting pageserver at '127.0.0.1:64000' in '.neon'.
pageserver started, pid: 2545906
Successfully initialized timeline de200bd42b49cc1814412c7e592dd6e9
Stopped pageserver 1 process with pid 2545906
# start pageserver and safekeeper
# start pageserver, safekeeper, and broker for their intercommunication
> ./target/debug/neon_local start
Starting neon broker at 127.0.0.1:50051
storage_broker started, pid: 2918372
@@ -131,6 +128,12 @@ pageserver started, pid: 2918386
Starting safekeeper at '127.0.0.1:5454' in '.neon/safekeepers/sk1'.
safekeeper 1 started, pid: 2918437
# create initial tenant and use it as a default for every future neon_local invocation
> ./target/debug/neon_local tenant create --set-default
tenant 9ef87a5bf0d92544f6fafeeb3239695c successfully created on the pageserver
Created an initial timeline 'de200bd42b49cc1814412c7e592dd6e9' at Lsn 0/16B5A50 for tenant: 9ef87a5bf0d92544f6fafeeb3239695c
Setting tenant 9ef87a5bf0d92544f6fafeeb3239695c as a default one
# start postgres compute node
> ./target/debug/neon_local pg start main
Starting new postgres (v14) main on timeline de200bd42b49cc1814412c7e592dd6e9 ...

View File

@@ -1,24 +1,25 @@
[package]
name = "compute_tools"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
anyhow = "1.0"
chrono = { version = "0.4", default-features = false, features = ["clock"] }
clap = "4.0"
env_logger = "0.9"
futures = "0.3.13"
hyper = { version = "0.14", features = ["full"] }
log = { version = "0.4", features = ["std", "serde"] }
notify = "5.0.0"
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
regex = "1"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
tar = "0.4"
tokio = { version = "1.17", features = ["macros", "rt", "rt-multi-thread"] }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
url = "2.2.2"
workspace_hack = { version = "0.1", path = "../workspace_hack" }
anyhow.workspace = true
chrono.workspace = true
clap.workspace = true
futures.workspace = true
hyper = { workspace = true, features = ["full"] }
notify.workspace = true
postgres.workspace = true
regex.workspace = true
serde.workspace = true
serde_json.workspace = true
tar.workspace = true
tokio = { workspace = true, features = ["rt", "rt-multi-thread"] }
tokio-postgres.workspace = true
tracing.workspace = true
tracing-subscriber.workspace = true
url.workspace = true
workspace_hack.workspace = true

View File

@@ -19,6 +19,10 @@ Also `compute_ctl` spawns two separate service threads:
- `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
last activity requests.
If the `vm-informant` binary is present at `/bin/vm-informant`, it will also be started. For VM
compute nodes, `vm-informant` communicates with the VM autoscaling system. It coordinates
downscaling and (eventually) will request immediate upscaling under resource pressure.
Usage example:
```sh
compute_ctl -D /var/db/postgres/compute \

View File

@@ -18,6 +18,10 @@
//! - `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
//! last activity requests.
//!
//! If the `vm-informant` binary is present at `/bin/vm-informant`, it will also be started. For VM
//! compute nodes, `vm-informant` communicates with the VM autoscaling system. It coordinates
//! downscaling and (eventually) will request immediate upscaling under resource pressure.
//!
//! Usage example:
//! ```sh
//! compute_ctl -D /var/db/postgres/compute \
@@ -36,10 +40,11 @@ use std::{thread, time::Duration};
use anyhow::{Context, Result};
use chrono::Utc;
use clap::Arg;
use log::{error, info};
use tracing::{error, info};
use compute_tools::compute::{ComputeMetrics, ComputeNode, ComputeState, ComputeStatus};
use compute_tools::http::api::launch_http_server;
use compute_tools::informant::spawn_vm_informant_if_present;
use compute_tools::logger::*;
use compute_tools::monitor::launch_monitor;
use compute_tools::params::*;
@@ -48,7 +53,6 @@ use compute_tools::spec::*;
use url::Url;
fn main() -> Result<()> {
// TODO: re-use `utils::logging` later
init_logger(DEFAULT_LOG_LEVEL)?;
let matches = cli().get_matches();
@@ -114,30 +118,48 @@ fn main() -> Result<()> {
// requests, while configuration is still in progress.
let _http_handle = launch_http_server(&compute).expect("cannot launch http endpoint thread");
let _monitor_handle = launch_monitor(&compute).expect("cannot launch compute monitor thread");
// Also spawn the thread responsible for handling the VM informant -- if it's present
let _vm_informant_handle = spawn_vm_informant_if_present().expect("cannot launch VM informant");
// Run compute (Postgres) and hang waiting on it.
match compute.prepare_and_run() {
Ok(ec) => {
let code = ec.code().unwrap_or(1);
info!("Postgres exited with code {}, shutting down", code);
exit(code)
}
Err(error) => {
error!("could not start the compute node: {:?}", error);
// Start Postgres
let mut delay_exit = false;
let mut exit_code = None;
let pg = match compute.start_compute() {
Ok(pg) => Some(pg),
Err(err) => {
error!("could not start the compute node: {:?}", err);
let mut state = compute.state.write().unwrap();
state.error = Some(format!("{:?}", error));
state.error = Some(format!("{:?}", err));
state.status = ComputeStatus::Failed;
drop(state);
// Keep serving HTTP requests, so the cloud control plane was able to
// get the actual error.
info!("giving control plane 30s to collect the error before shutdown");
thread::sleep(Duration::from_secs(30));
info!("shutting down");
Err(error)
delay_exit = true;
None
}
};
// Wait for the child Postgres process forever. In this state Ctrl+C will
// propagate to Postgres and it will be shut down as well.
if let Some(mut pg) = pg {
let ecode = pg
.wait()
.expect("failed to start waiting on Postgres process");
info!("Postgres exited with code {}, shutting down", ecode);
exit_code = ecode.code()
}
if let Err(err) = compute.check_for_core_dumps() {
error!("error while checking for core dumps: {err:?}");
}
// If launch failed, keep serving HTTP requests for a while, so the cloud
// control plane can get the actual error.
if delay_exit {
info!("giving control plane 30s to collect the error before shutdown");
thread::sleep(Duration::from_secs(30));
info!("shutting down");
}
exit(exit_code.unwrap_or(1))
}
fn cli() -> clap::Command {

View File

@@ -1,10 +1,11 @@
use anyhow::{anyhow, Result};
use log::error;
use postgres::Client;
use tokio_postgres::NoTls;
use tracing::{error, instrument};
use crate::compute::ComputeNode;
#[instrument(skip_all)]
pub fn create_writability_check_data(client: &mut Client) -> Result<()> {
let query = "
CREATE TABLE IF NOT EXISTS health_check (
@@ -21,6 +22,7 @@ pub fn create_writability_check_data(client: &mut Client) -> Result<()> {
Ok(())
}
#[instrument(skip_all)]
pub async fn check_writability(compute: &ComputeNode) -> Result<()> {
let (client, connection) = tokio_postgres::connect(compute.connstr.as_str(), NoTls).await?;
if client.is_closed() {

View File

@@ -17,15 +17,15 @@
use std::fs;
use std::os::unix::fs::PermissionsExt;
use std::path::Path;
use std::process::{Command, ExitStatus, Stdio};
use std::process::{Command, Stdio};
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::RwLock;
use anyhow::{Context, Result};
use chrono::{DateTime, Utc};
use log::{info, warn};
use postgres::{Client, NoTls};
use serde::{Serialize, Serializer};
use tracing::{info, instrument, warn};
use crate::checker::create_writability_check_data;
use crate::config;
@@ -121,6 +121,7 @@ impl ComputeNode {
// Get basebackup from the libpq connection to pageserver using `connstr` and
// unarchive it to `pgdata` directory overriding all its previous content.
#[instrument(skip(self))]
fn get_basebackup(&self, lsn: &str) -> Result<()> {
let start_time = Utc::now();
@@ -154,6 +155,7 @@ impl ComputeNode {
// Run `postgres` in a special mode with `--sync-safekeepers` argument
// and return the reported LSN back to the caller.
#[instrument(skip(self))]
fn sync_safekeepers(&self) -> Result<String> {
let start_time = Utc::now();
@@ -196,6 +198,7 @@ impl ComputeNode {
/// Do all the preparations like PGDATA directory creation, configuration,
/// safekeepers sync, basebackup, etc.
#[instrument(skip(self))]
pub fn prepare_pgdata(&self) -> Result<()> {
let spec = &self.spec;
let pgdata_path = Path::new(&self.pgdata);
@@ -229,9 +232,8 @@ impl ComputeNode {
/// Start Postgres as a child process and manage DBs/roles.
/// After that this will hang waiting on the postmaster process to exit.
pub fn run(&self) -> Result<ExitStatus> {
let start_time = Utc::now();
#[instrument(skip(self))]
pub fn start_postgres(&self) -> Result<std::process::Child> {
let pgdata_path = Path::new(&self.pgdata);
// Run postgres as a child process.
@@ -242,6 +244,11 @@ impl ComputeNode {
wait_for_postgres(&mut pg, pgdata_path)?;
Ok(pg)
}
#[instrument(skip(self))]
pub fn apply_config(&self) -> Result<()> {
// If connection fails,
// it may be the old node with `zenith_admin` superuser.
//
@@ -279,8 +286,34 @@ impl ComputeNode {
// 'Close' connection
drop(client);
let startup_end_time = Utc::now();
info!(
"finished configuration of compute for project {}",
self.spec.cluster.cluster_id
);
Ok(())
}
#[instrument(skip(self))]
pub fn start_compute(&self) -> Result<std::process::Child> {
info!(
"starting compute for project {}, operation {}, tenant {}, timeline {}",
self.spec.cluster.cluster_id,
self.spec.operation_uuid.as_ref().unwrap(),
self.tenant,
self.timeline,
);
self.prepare_pgdata()?;
let start_time = Utc::now();
let pg = self.start_postgres()?;
self.apply_config()?;
let startup_end_time = Utc::now();
self.metrics.config_ms.store(
startup_end_time
.signed_duration_since(start_time)
@@ -300,34 +333,7 @@ impl ComputeNode {
self.set_status(ComputeStatus::Running);
info!(
"finished configuration of compute for project {}",
self.spec.cluster.cluster_id
);
// Wait for child Postgres process basically forever. In this state Ctrl+C
// will propagate to Postgres and it will be shut down as well.
let ecode = pg
.wait()
.expect("failed to start waiting on Postgres process");
self.check_for_core_dumps()
.expect("failed to check for core dumps");
Ok(ecode)
}
pub fn prepare_and_run(&self) -> Result<ExitStatus> {
info!(
"starting compute for project {}, operation {}, tenant {}, timeline {}",
self.spec.cluster.cluster_id,
self.spec.operation_uuid.as_ref().unwrap(),
self.tenant,
self.timeline,
);
self.prepare_pgdata()?;
self.run()
Ok(pg)
}
// Look for core dumps and collect backtraces.
@@ -340,7 +346,7 @@ impl ComputeNode {
//
// Use that as a default location and pattern, except macos where core dumps are written
// to /cores/ directory by default.
fn check_for_core_dumps(&self) -> Result<()> {
pub fn check_for_core_dumps(&self) -> Result<()> {
let core_dump_dir = match std::env::consts::OS {
"macos" => Path::new("/cores/"),
_ => Path::new(&self.pgdata),

View File

@@ -6,8 +6,8 @@ use std::thread;
use anyhow::Result;
use hyper::service::{make_service_fn, service_fn};
use hyper::{Body, Method, Request, Response, Server, StatusCode};
use log::{error, info};
use serde_json;
use tracing::{error, info};
use crate::compute::ComputeNode;

View File

@@ -0,0 +1,50 @@
use std::path::Path;
use std::process;
use std::thread;
use std::time::Duration;
use tracing::{info, warn};
use anyhow::{Context, Result};
const VM_INFORMANT_PATH: &str = "/bin/vm-informant";
const RESTART_INFORMANT_AFTER_MILLIS: u64 = 5000;
/// Launch a thread to start the VM informant if it's present (and restart, on failure)
pub fn spawn_vm_informant_if_present() -> Result<Option<thread::JoinHandle<()>>> {
let exists = Path::new(VM_INFORMANT_PATH)
.try_exists()
.context("could not check if path exists")?;
if !exists {
return Ok(None);
}
Ok(Some(
thread::Builder::new()
.name("run-vm-informant".into())
.spawn(move || run_informant())?,
))
}
fn run_informant() -> ! {
let restart_wait = Duration::from_millis(RESTART_INFORMANT_AFTER_MILLIS);
info!("starting VM informant");
loop {
let mut cmd = process::Command::new(VM_INFORMANT_PATH);
// Block on subprocess:
let result = cmd.status();
match result {
Err(e) => warn!("failed to run VM informant at {VM_INFORMANT_PATH:?}: {e}"),
Ok(status) if !status.success() => {
warn!("{VM_INFORMANT_PATH} exited with code {status:?}, retrying")
}
Ok(_) => info!("{VM_INFORMANT_PATH} ended gracefully (unexpectedly). Retrying"),
}
// Wait before retrying
thread::sleep(restart_wait);
}
}

View File

@@ -8,6 +8,7 @@ pub mod http;
#[macro_use]
pub mod logger;
pub mod compute;
pub mod informant;
pub mod monitor;
pub mod params;
pub mod pg_helpers;

View File

@@ -1,42 +1,20 @@
use std::io::Write;
use anyhow::Result;
use chrono::Utc;
use env_logger::{Builder, Env};
macro_rules! info_println {
($($tts:tt)*) => {
if log_enabled!(Level::Info) {
println!($($tts)*);
}
}
}
macro_rules! info_print {
($($tts:tt)*) => {
if log_enabled!(Level::Info) {
print!($($tts)*);
}
}
}
use tracing_subscriber::layer::SubscriberExt;
use tracing_subscriber::prelude::*;
/// Initialize `env_logger` using either `default_level` or
/// `RUST_LOG` environment variable as default log level.
pub fn init_logger(default_level: &str) -> Result<()> {
let env = Env::default().filter_or("RUST_LOG", default_level);
let env_filter = tracing_subscriber::EnvFilter::try_from_default_env()
.unwrap_or_else(|_| tracing_subscriber::EnvFilter::new(default_level));
Builder::from_env(env)
.format(|buf, record| {
let thread_handle = std::thread::current();
writeln!(
buf,
"{} [{}] {}: {}",
Utc::now().format("%Y-%m-%d %H:%M:%S%.3f %Z"),
thread_handle.name().unwrap_or("main"),
record.level(),
record.args()
)
})
let fmt_layer = tracing_subscriber::fmt::layer()
.with_target(false)
.with_writer(std::io::stderr);
tracing_subscriber::registry()
.with(env_filter)
.with(fmt_layer)
.init();
Ok(())

View File

@@ -3,8 +3,8 @@ use std::{thread, time};
use anyhow::Result;
use chrono::{DateTime, Utc};
use log::{debug, info};
use postgres::{Client, NoTls};
use tracing::{debug, info};
use crate::compute::ComputeNode;
@@ -52,10 +52,16 @@ fn watch_compute_activity(compute: &ComputeNode) {
let mut idle_backs: Vec<DateTime<Utc>> = vec![];
for b in backs.into_iter() {
let state: String = b.get("state");
let change: String = b.get("state_change");
let state: String = match b.try_get("state") {
Ok(state) => state,
Err(_) => continue,
};
if state == "idle" {
let change: String = match b.try_get("state_change") {
Ok(state_change) => state_change,
Err(_) => continue,
};
let change = DateTime::parse_from_rfc3339(&change);
match change {
Ok(t) => idle_backs.push(t.with_timezone(&Utc)),

View File

@@ -11,6 +11,7 @@ use anyhow::{bail, Result};
use notify::{RecursiveMode, Watcher};
use postgres::{Client, Transaction};
use serde::Deserialize;
use tracing::{debug, instrument};
const POSTGRES_WAIT_TIMEOUT: Duration = Duration::from_millis(60 * 1000); // milliseconds
@@ -229,6 +230,7 @@ pub fn get_existing_dbs(client: &mut Client) -> Result<Vec<Database>> {
/// Wait for Postgres to become ready to accept connections. It's ready to
/// accept connections when the state-field in `pgdata/postmaster.pid` says
/// 'ready'.
#[instrument(skip(pg))]
pub fn wait_for_postgres(pg: &mut Child, pgdata: &Path) -> Result<()> {
let pid_path = pgdata.join("postmaster.pid");
@@ -287,18 +289,18 @@ pub fn wait_for_postgres(pg: &mut Child, pgdata: &Path) -> Result<()> {
}
let res = rx.recv_timeout(Duration::from_millis(100));
log::debug!("woken up by notify: {res:?}");
debug!("woken up by notify: {res:?}");
// If there are multiple events in the channel already, we only need to be
// check once. Swallow the extra events before we go ahead to check the
// pid file.
while let Ok(res) = rx.try_recv() {
log::debug!("swallowing extra event: {res:?}");
debug!("swallowing extra event: {res:?}");
}
// Check that we can open pid file first.
if let Ok(file) = File::open(&pid_path) {
if !postmaster_pid_seen {
log::debug!("postmaster.pid appeared");
debug!("postmaster.pid appeared");
watcher
.unwatch(pgdata)
.expect("Failed to remove pgdata dir watch");
@@ -314,7 +316,7 @@ pub fn wait_for_postgres(pg: &mut Child, pgdata: &Path) -> Result<()> {
// Pid file could be there and we could read it, but it could be empty, for example.
if let Some(Ok(line)) = last_line {
let status = line.trim();
log::debug!("last line of postmaster.pid: {status:?}");
debug!("last line of postmaster.pid: {status:?}");
// Now Postgres is ready to accept connections
if status == "ready" {
@@ -330,7 +332,7 @@ pub fn wait_for_postgres(pg: &mut Child, pgdata: &Path) -> Result<()> {
}
}
log::info!("PostgreSQL is now running, continuing to configure it");
tracing::info!("PostgreSQL is now running, continuing to configure it");
Ok(())
}

View File

@@ -2,10 +2,10 @@ use std::path::Path;
use std::str::FromStr;
use anyhow::Result;
use log::{info, log_enabled, warn, Level};
use postgres::config::Config;
use postgres::{Client, NoTls};
use serde::Deserialize;
use tracing::{info, info_span, instrument, span_enabled, warn, Level};
use crate::compute::ComputeNode;
use crate::config;
@@ -79,23 +79,25 @@ pub fn update_pg_hba(pgdata_path: &Path) -> Result<()> {
/// Given a cluster spec json and open transaction it handles roles creation,
/// deletion and update.
#[instrument(skip_all)]
pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
let mut xact = client.transaction()?;
let existing_roles: Vec<Role> = get_existing_roles(&mut xact)?;
// Print a list of existing Postgres roles (only in debug mode)
info!("postgres roles:");
for r in &existing_roles {
info_println!(
"{} - {}:{}",
" ".repeat(27 + 5),
r.name,
if r.encrypted_password.is_some() {
"[FILTERED]"
} else {
"(null)"
}
);
if span_enabled!(Level::INFO) {
info!("postgres roles:");
for r in &existing_roles {
info!(
" - {}:{}",
r.name,
if r.encrypted_password.is_some() {
"[FILTERED]"
} else {
"(null)"
}
);
}
}
// Process delta operations first
@@ -136,58 +138,68 @@ pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
info!("cluster spec roles:");
for role in &spec.cluster.roles {
let name = &role.name;
info_print!(
"{} - {}:{}",
" ".repeat(27 + 5),
name,
if role.encrypted_password.is_some() {
"[FILTERED]"
} else {
"(null)"
}
);
// XXX: with a limited number of roles it is fine, but consider making it a HashMap
let pg_role = existing_roles.iter().find(|r| r.name == *name);
if let Some(r) = pg_role {
let mut update_role = false;
enum RoleAction {
None,
Update,
Create,
}
let action = if let Some(r) = pg_role {
if (r.encrypted_password.is_none() && role.encrypted_password.is_some())
|| (r.encrypted_password.is_some() && role.encrypted_password.is_none())
{
update_role = true;
RoleAction::Update
} else if let Some(pg_pwd) = &r.encrypted_password {
// Check whether password changed or not (trim 'md5:' prefix first)
update_role = pg_pwd[3..] != *role.encrypted_password.as_ref().unwrap();
if pg_pwd[3..] != *role.encrypted_password.as_ref().unwrap() {
RoleAction::Update
} else {
RoleAction::None
}
} else {
RoleAction::None
}
} else {
RoleAction::Create
};
if update_role {
match action {
RoleAction::None => {}
RoleAction::Update => {
let mut query: String = format!("ALTER ROLE {} ", name.pg_quote());
info_print!(" -> update");
query.push_str(&role.to_pg_options());
xact.execute(query.as_str(), &[])?;
}
} else {
info!("role name: '{}'", &name);
let mut query: String = format!("CREATE ROLE {} ", name.pg_quote());
info!("role create query: '{}'", &query);
info_print!(" -> create");
RoleAction::Create => {
let mut query: String = format!("CREATE ROLE {} ", name.pg_quote());
info!("role create query: '{}'", &query);
query.push_str(&role.to_pg_options());
xact.execute(query.as_str(), &[])?;
query.push_str(&role.to_pg_options());
xact.execute(query.as_str(), &[])?;
let grant_query = format!(
"GRANT pg_read_all_data, pg_write_all_data TO {}",
name.pg_quote()
);
xact.execute(grant_query.as_str(), &[])?;
info!("role grant query: '{}'", &grant_query);
let grant_query = format!(
"GRANT pg_read_all_data, pg_write_all_data TO {}",
name.pg_quote()
);
xact.execute(grant_query.as_str(), &[])?;
info!("role grant query: '{}'", &grant_query);
}
}
info_print!("\n");
if span_enabled!(Level::INFO) {
let pwd = if role.encrypted_password.is_some() {
"[FILTERED]"
} else {
"(null)"
};
let action_str = match action {
RoleAction::None => "",
RoleAction::Create => " -> create",
RoleAction::Update => " -> update",
};
info!(" - {}:{}{}", name, pwd, action_str);
}
}
xact.commit()?;
@@ -196,23 +208,20 @@ pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
}
/// Reassign all dependent objects and delete requested roles.
#[instrument(skip_all)]
pub fn handle_role_deletions(node: &ComputeNode, client: &mut Client) -> Result<()> {
let spec = &node.spec;
// First, reassign all dependent objects to db owners.
if let Some(ops) = &spec.delta_operations {
if let Some(ops) = &node.spec.delta_operations {
// First, reassign all dependent objects to db owners.
info!("reassigning dependent objects of to-be-deleted roles");
for op in ops {
if op.action == "delete_role" {
reassign_owned_objects(node, &op.name)?;
}
}
}
// Second, proceed with role deletions.
let mut xact = client.transaction()?;
if let Some(ops) = &spec.delta_operations {
// Second, proceed with role deletions.
info!("processing role deletions");
let mut xact = client.transaction()?;
for op in ops {
// We do not check either role exists or not,
// Postgres will take care of it for us
@@ -223,6 +232,7 @@ pub fn handle_role_deletions(node: &ComputeNode, client: &mut Client) -> Result<
xact.execute(query.as_str(), &[])?;
}
}
xact.commit()?;
}
Ok(())
@@ -263,13 +273,16 @@ fn reassign_owned_objects(node: &ComputeNode, role_name: &PgIdent) -> Result<()>
/// like `CREATE DATABASE` and `DROP DATABASE` do not support it. Statement-level
/// atomicity should be enough here due to the order of operations and various checks,
/// which together provide us idempotency.
#[instrument(skip_all)]
pub fn handle_databases(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
let existing_dbs: Vec<Database> = get_existing_dbs(client)?;
// Print a list of existing Postgres databases (only in debug mode)
info!("postgres databases:");
for r in &existing_dbs {
info_println!("{} - {}:{}", " ".repeat(27 + 5), r.name, r.owner);
if span_enabled!(Level::INFO) {
info!("postgres databases:");
for r in &existing_dbs {
info!(" {}:{}", r.name, r.owner);
}
}
// Process delta operations first
@@ -312,12 +325,15 @@ pub fn handle_databases(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
for db in &spec.cluster.databases {
let name = &db.name;
info_print!("{} - {}:{}", " ".repeat(27 + 5), db.name, db.owner);
// XXX: with a limited number of databases it is fine, but consider making it a HashMap
let pg_db = existing_dbs.iter().find(|r| r.name == *name);
if let Some(r) = pg_db {
enum DatabaseAction {
None,
Update,
Create,
}
let action = if let Some(r) = pg_db {
// XXX: db owner name is returned as quoted string from Postgres,
// when quoting is needed.
let new_owner = if r.owner.starts_with('"') {
@@ -327,24 +343,42 @@ pub fn handle_databases(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
};
if new_owner != r.owner {
// Update the owner
DatabaseAction::Update
} else {
DatabaseAction::None
}
} else {
DatabaseAction::Create
};
match action {
DatabaseAction::None => {}
DatabaseAction::Update => {
let query: String = format!(
"ALTER DATABASE {} OWNER TO {}",
name.pg_quote(),
db.owner.pg_quote()
);
info_print!(" -> update");
let _ = info_span!("executing", query).entered();
client.execute(query.as_str(), &[])?;
}
} else {
let mut query: String = format!("CREATE DATABASE {} ", name.pg_quote());
info_print!(" -> create");
DatabaseAction::Create => {
let mut query: String = format!("CREATE DATABASE {} ", name.pg_quote());
query.push_str(&db.to_pg_options());
let _ = info_span!("executing", query).entered();
client.execute(query.as_str(), &[])?;
}
};
query.push_str(&db.to_pg_options());
client.execute(query.as_str(), &[])?;
if span_enabled!(Level::INFO) {
let action_str = match action {
DatabaseAction::None => "",
DatabaseAction::Create => " -> create",
DatabaseAction::Update => " -> update",
};
info!(" - {}:{}{}", db.name, db.owner, action_str);
}
info_print!("\n");
}
Ok(())
@@ -352,6 +386,7 @@ pub fn handle_databases(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
/// Grant CREATE ON DATABASE to the database owner and do some other alters and grants
/// to allow users creating trusted extensions and re-creating `public` schema, for example.
#[instrument(skip_all)]
pub fn handle_grants(node: &ComputeNode, client: &mut Client) -> Result<()> {
let spec = &node.spec;

View File

@@ -1 +0,0 @@
tmp_check/

View File

@@ -1,32 +1,31 @@
[package]
name = "control_plane"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
anyhow = "1.0"
clap = "4.0"
comfy-table = "6.1"
git-version = "0.3.5"
nix = "0.25"
once_cell = "1.13.0"
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev = "43e6db254a97fdecbce33d8bc0890accfd74495e" }
regex = "1"
reqwest = { version = "0.11", default-features = false, features = ["blocking", "json", "rustls-tls"] }
serde = { version = "1.0", features = ["derive"] }
serde_with = "2.0"
tar = "0.4.38"
thiserror = "1"
toml = "0.5"
url = "2.2.2"
anyhow.workspace = true
clap.workspace = true
comfy-table.workspace = true
git-version.workspace = true
nix.workspace = true
once_cell.workspace = true
postgres.workspace = true
regex.workspace = true
reqwest = { workspace = true, features = ["blocking", "json"] }
serde.workspace = true
serde_with.workspace = true
tar.workspace = true
thiserror.workspace = true
toml.workspace = true
url.workspace = true
# Note: Do not directly depend on pageserver or safekeeper; use pageserver_api or safekeeper_api
# instead, so that recompile times are better.
pageserver_api = { path = "../libs/pageserver_api" }
postgres_connection = { path = "../libs/postgres_connection" }
safekeeper_api = { path = "../libs/safekeeper_api" }
# Note: main broker code is inside the binary crate, so linking with the library shouldn't be heavy.
storage_broker = { version = "0.1", path = "../storage_broker" }
utils = { path = "../libs/utils" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }
pageserver_api.workspace = true
safekeeper_api.workspace = true
postgres_connection.workspace = true
storage_broker.workspace = true
utils.workspace = true
workspace_hack.workspace = true

View File

@@ -136,22 +136,6 @@ where
anyhow::bail!("{process_name} did not start in {RETRY_UNTIL_SECS} seconds");
}
/// Send SIGTERM to child process
pub fn send_stop_child_process(child: &std::process::Child) -> anyhow::Result<()> {
let pid = child.id();
match kill(
nix::unistd::Pid::from_raw(pid.try_into().unwrap()),
Signal::SIGTERM,
) {
Ok(()) => Ok(()),
Err(Errno::ESRCH) => {
println!("child process with pid {pid} does not exist");
Ok(())
}
Err(e) => anyhow::bail!("Failed to send signal to child process with pid {pid}: {e}"),
}
}
/// Stops the process, using the pid file given. Returns Ok also if the process is already not running.
pub fn stop_process(immediate: bool, process_name: &str, pid_file: &Path) -> anyhow::Result<()> {
let pid = match pid_file::read(pid_file)

View File

@@ -263,7 +263,7 @@ fn get_tenant_id(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> anyhow::R
} else if let Some(default_id) = env.default_tenant_id {
Ok(default_id)
} else {
bail!("No tenant id. Use --tenant-id, or set 'default_tenant_id' in the config file");
anyhow::bail!("No tenant id. Use --tenant-id, or set a default tenant");
}
}
@@ -284,8 +284,6 @@ fn parse_timeline_id(sub_match: &ArgMatches) -> anyhow::Result<Option<TimelineId
}
fn handle_init(init_match: &ArgMatches) -> anyhow::Result<LocalEnv> {
let initial_timeline_id_arg = parse_timeline_id(init_match)?;
// Create config file
let toml_file: String = if let Some(config_path) = init_match.get_one::<PathBuf>("config") {
// load and parse the file
@@ -309,30 +307,16 @@ fn handle_init(init_match: &ArgMatches) -> anyhow::Result<LocalEnv> {
LocalEnv::parse_config(&toml_file).context("Failed to create neon configuration")?;
env.init(pg_version)
.context("Failed to initialize neon repository")?;
let initial_tenant_id = env
.default_tenant_id
.expect("default_tenant_id should be generated by the `env.init()` call above");
// Initialize pageserver, create initial tenant and timeline.
let pageserver = PageServerNode::from_env(&env);
let initial_timeline_id = pageserver
.initialize(
Some(initial_tenant_id),
initial_timeline_id_arg,
&pageserver_config_overrides(init_match),
pg_version,
)
pageserver
.initialize(&pageserver_config_overrides(init_match))
.unwrap_or_else(|e| {
eprintln!("pageserver init failed: {e:?}");
exit(1);
});
env.register_branch_mapping(
DEFAULT_BRANCH_NAME.to_owned(),
initial_tenant_id,
initial_timeline_id,
)?;
Ok(env)
}
@@ -388,6 +372,17 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an
println!(
"Created an initial timeline '{new_timeline_id}' at Lsn {last_record_lsn} for tenant: {new_tenant_id}",
);
if create_match.get_flag("set-default") {
println!("Setting tenant {new_tenant_id} as a default one");
env.default_tenant_id = Some(new_tenant_id);
}
}
Some(("set-default", set_default_match)) => {
let tenant_id =
parse_tenant_id(set_default_match)?.context("No tenant id specified")?;
println!("Setting tenant {tenant_id} as a default one");
env.default_tenant_id = Some(tenant_id);
}
Some(("config", create_match)) => {
let tenant_id = get_tenant_id(create_match, env)?;
@@ -928,9 +923,8 @@ fn cli() -> Command {
.version(GIT_VERSION)
.subcommand(
Command::new("init")
.about("Initialize a new Neon repository")
.about("Initialize a new Neon repository, preparing configs for services to start with")
.arg(pageserver_config_args.clone())
.arg(timeline_id_arg.clone().help("Use a specific timeline id when creating a tenant and its initial timeline"))
.arg(
Arg::new("config")
.long("config")
@@ -992,11 +986,14 @@ fn cli() -> Command {
.arg(timeline_id_arg.clone().help("Use a specific timeline id when creating a tenant and its initial timeline"))
.arg(Arg::new("config").short('c').num_args(1).action(ArgAction::Append).required(false))
.arg(pg_version_arg.clone())
.arg(Arg::new("set-default").long("set-default").action(ArgAction::SetTrue).required(false)
.help("Use this tenant in future CLI commands where tenant_id is needed, but not specified"))
)
.subcommand(Command::new("set-default").arg(tenant_id_arg.clone().required(true))
.about("Set a particular tenant as default in future CLI commands where tenant_id is needed, but not specified"))
.subcommand(Command::new("config")
.arg(tenant_id_arg.clone())
.arg(Arg::new("config").short('c').num_args(1).action(ArgAction::Append).required(false))
)
.arg(Arg::new("config").short('c').num_args(1).action(ArgAction::Append).required(false)))
)
.subcommand(
Command::new("pageserver")

View File

@@ -296,11 +296,6 @@ impl LocalEnv {
env.neon_distrib_dir = env::current_exe()?.parent().unwrap().to_owned();
}
// If no initial tenant ID was given, generate it.
if env.default_tenant_id.is_none() {
env.default_tenant_id = Some(TenantId::generate());
}
env.base_data_dir = base_path();
Ok(env)

View File

@@ -7,7 +7,7 @@ use std::path::PathBuf;
use std::process::{Child, Command};
use std::{io, result};
use anyhow::{bail, ensure, Context};
use anyhow::{bail, Context};
use pageserver_api::models::{
TenantConfigRequest, TenantCreateRequest, TenantInfo, TimelineCreateRequest, TimelineInfo,
};
@@ -130,83 +130,15 @@ impl PageServerNode {
overrides
}
/// Initializes a pageserver node by creating its config with the overrides provided,
/// and creating an initial tenant and timeline afterwards.
pub fn initialize(
&self,
create_tenant: Option<TenantId>,
initial_timeline_id: Option<TimelineId>,
config_overrides: &[&str],
pg_version: u32,
) -> anyhow::Result<TimelineId> {
/// Initializes a pageserver node by creating its config with the overrides provided.
pub fn initialize(&self, config_overrides: &[&str]) -> anyhow::Result<()> {
// First, run `pageserver --init` and wait for it to write a config into FS and exit.
self.pageserver_init(config_overrides).with_context(|| {
format!(
"Failed to run init for pageserver node {}",
self.env.pageserver.id,
)
})?;
// Then, briefly start it fully to run HTTP commands on it,
// to create initial tenant and timeline.
// We disable the remote storage, since we stop pageserver right after the timeline creation,
// hence most of the uploads will either aborted or not started: no point to start them at all.
let disabled_remote_storage_override = "remote_storage={}";
let mut pageserver_process = self
.start_node(
&[disabled_remote_storage_override],
// Previous overrides will be taken from the config created before, don't overwrite them.
false,
)
.with_context(|| {
format!(
"Failed to start a process for pageserver node {}",
self.env.pageserver.id,
)
})?;
let init_result = self
.try_init_timeline(create_tenant, initial_timeline_id, pg_version)
.context("Failed to create initial tenant and timeline for pageserver");
match &init_result {
Ok(initial_timeline_id) => {
println!("Successfully initialized timeline {initial_timeline_id}")
}
Err(e) => eprintln!("{e:#}"),
}
background_process::send_stop_child_process(&pageserver_process)?;
let exit_code = pageserver_process.wait()?;
ensure!(
exit_code.success(),
format!(
"pageserver init failed with exit code {:?}",
exit_code.code()
)
);
println!(
"Stopped pageserver {} process with pid {}",
self.env.pageserver.id,
pageserver_process.id(),
);
init_result
}
fn try_init_timeline(
&self,
new_tenant_id: Option<TenantId>,
new_timeline_id: Option<TimelineId>,
pg_version: u32,
) -> anyhow::Result<TimelineId> {
let initial_tenant_id = self.tenant_create(new_tenant_id, HashMap::new())?;
let initial_timeline_info = self.timeline_create(
initial_tenant_id,
new_timeline_id,
None,
None,
Some(pg_version),
)?;
Ok(initial_timeline_info.timeline_id)
})
}
pub fn repo_path(&self) -> PathBuf {

View File

@@ -52,7 +52,7 @@ name = "ring"
version = "*"
expression = "MIT AND ISC AND OpenSSL"
license-files = [
{ path = "LICENSE", hash = 0xbd0eed23 },
{ path = "LICENSE", hash = 0xbd0eed23 }
]
[licenses.private]

115
docs/consumption_metrics.md Normal file
View File

@@ -0,0 +1,115 @@
### Overview
Pageserver and proxy periodically collect consumption metrics and push them to a HTTP endpoint.
This doc describes current implementation details.
For design details see [the RFC](./rfcs/021-metering.md) and [the discussion on Github](https://github.com/neondatabase/neon/pull/2884).
- The metrics are collected in a separate thread, and the collection interval and endpoint are configurable.
- Metrics are cached, so that we don't send unchanged metrics on every iteration.
- Metrics are sent in batches of 1000 (see CHUNK_SIZE const) metrics max with no particular grouping guarantees.
batch format is
```json
{ "events" : [metric1, metric2, ...]]}
```
See metric format examples below.
- All metrics values are in bytes, unless otherwise specified.
- Currently no retries are implemented.
### Pageserver metrics
#### Configuration
The endpoint and the collection interval are specified in the pageserver config file (or can be passed as command line arguments):
`metric_collection_endpoint` defaults to None, which means that metric collection is disabled by default.
`metric_collection_interval` defaults to 10min
#### Metrics
Currently, the following metrics are collected:
- `written_size`
Amount of WAL produced , by a timeline, i.e. last_record_lsn
This is an absolute, per-timeline metric.
- `resident_size`
Size of all the layer files in the tenant's directory on disk on the pageserver.
This is an absolute, per-tenant metric.
- `remote_storage_size`
Size of the remote storage (S3) directory.
This is an absolute, per-tenant metric.
- `timeline_logical_size`
Logical size of the data in the timeline
This is an absolute, per-timeline metric.
- `synthetic_storage_size`
Size of all tenant's branches including WAL
This is the same metric that `tenant/{tenant_id}/size` endpoint returns.
This is an absolute, per-tenant metric.
Synthetic storage size is calculated in a separate thread, so it might be slightly outdated.
#### Format example
```json
{
"metric": "remote_storage_size",
"type": "absolute",
"time": "2022-12-28T11:07:19.317310284Z",
"idempotency_key": "2022-12-28 11:07:19.317310324 UTC-1-4019",
"value": 12345454,
"tenant_id": "5d07d9ce9237c4cd845ea7918c0afa7d",
"timeline_id": "a03ebb4f5922a1c56ff7485cc8854143",
}
```
`idempotency_key` is a unique key for each metric, so that we can deduplicate metrics.
It is a combination of the time, node_id and a random number.
### Proxy consumption metrics
#### Configuration
The endpoint and the collection interval can be passed as command line arguments for proxy:
`metric_collection_endpoint` no default, which means that metric collection is disabled by default.
`metric_collection_interval` no default
#### Metrics
Currently, only one proxy metric is collected:
- `proxy_io_bytes_per_client`
Outbound traffic per client.
This is an incremental, per-endpoint metric.
#### Format example
```json
{
"metric": "proxy_io_bytes_per_client",
"type": "incremental",
"start_time": "2022-12-28T11:07:19.317310284Z",
"stop_time": "2022-12-28T11:07:19.317310284Z",
"idempotency_key": "2022-12-28 11:07:19.317310324 UTC-1-4019",
"value": 12345454,
"endpoint_id": "5d07d9ce9237c4cd845ea7918c0afa7d",
}
```
The metric is incremental, so the value is the difference between the current and the previous value.
If there is no previous value, the value, the value is the current value and the `start_time` equals `stop_time`.
### TODO
- [ ] Handle errors better: currently if one tenant fails to gather metrics, the whole iteration fails and metrics are not sent for any tenant.
- [ ] Add retries
- [ ] Tune the interval

186
docs/rfcs/021-metering.md Normal file
View File

@@ -0,0 +1,186 @@
# Consumption tracking
# Goals
This proposal is made with two mostly but not entirely overlapping goals:
* Collect info that is needed for consumption-based billing
* Cross-check AWS bills
# Metrics
There are six metrics to collect:
* CPU time. Wall clock seconds * the current number of cores. We have a fixed ratio of memory to cores, so the current memory size is the function of the number of cores. Measured per each `endpoint`.
* Traffic. In/out traffic on the proxy. Measured per each `endpoint`.
* Written size. Amount of data we write. That is different from both traffic and storage size, as only during the writing we
a) occupy some disk bandwidth on safekeepers
b) necessarily cross AZ boundaries delivering WAL to all safekeepers
Each timeline/branch has at most one writer, so the data is collected per branch.
* Synthetic storage size. That is what is exposed now with pageserver's `/v1/tenant/{}/size`. Looks like now it is per-tenant. (Side note: can we make it per branch to show as branch physical size in UI?)
* Real storage size. That is the size of the tenant directory on the pageservers disk. Per-tenant.
* S3 storage size. That is the size of the tenant data on S3. Per-tenant.
That info should be enough to build an internal model that predicts AWS price (hence tracking `written data` and `real storage size`). As for the billing model we probably can get away with mentioning only `CPU time`, `synthetic storage size`, and `traffic` consumption.
# Services participating in metrics collection
## Proxy
For actual implementation details check `/docs/consumption_metrics.md`
Proxy is the only place that knows about traffic flow, so it tracks it and reports it with quite a small interval, let's say 1 minute. A small interval is needed here since the proxy is stateless, and any restart will reset accumulated consumption. Also proxy should report deltas since the last report, not an absolute value of the counter. Such kind of events is easier to integrate over a period of time to get the amount of traffic during some time interval.
Example event:
```json
{
"metric": "proxy_io_bytes_per_client",
"type": "incremental",
"start_time": "2022-12-28T11:07:19.317310284Z",
"stop_time": "2022-12-28T11:07:19.317310284Z",
"idempotency_key": "2022-12-28 11:07:19.317310324 UTC-1-4019",
"value": 12345454,
"endpoint_id": "5d07d9ce9237c4cd845ea7918c0afa7d",
}
```
Since we report deltas over some period of time, it makes sense to include `event_start_time`/`event_stop_time` where `event_start_time` is the time of the previous report. That will allow us to identify metering gaps better (e.g., failed send/delivery).
When there is no active connection proxy can avoid reporting anything. Also, deltas are additive, so several console instances serving the same user and endpoint can report traffic without coordination.
## Console
The console knows about start/stop events, so it knows the amount of CPU time allocated to each endpoint. It also knows about operation successes and failures and can avoid billing clients after unsuccessful 'suspend' events. The console doesn't know the current compute size within the allowed limits on the endpoint. So with CPU time, we do the following:
* While we don't yet have the autoscaling console can report `cpu time` as the number of seconds since the last `start_compute` event.
* When we have autoscaling, `autoscaler-agent` can report `cpu time`*`compute_units_count` in the same increments as the proxy reports traffic.
Example event:
```json
{
"metric": "effective_compute_seconds",
"type": "increment",
"endpoint_id": "blazing-warrior-34",
"event_start_time": ...,
"event_stop_time": ...,
"value": 12345454,
}
```
I'd also suggest reporting one value, `cpu time`*`compute_units_count`, instead of two separate fields as it makes event schema simpler (it is possible to treat it the same way as traffic) and preserves additivity.
## Pageserver
For actual implementation details check `/docs/consumption_metrics.md`
Pageserver knows / has access to / can calculate the rest of the metrics:
* Written size -- that is basically `last_received_lsn`,
* Synthetic storage size -- there is a way to calculate it, albeit a costly one,
* Real storage size -- there is a way to calculate it using a layer map or filesystem,
* S3 storage size -- can calculate it by S3 API calls
Some of those metrics are expensive to calculate, so the reporting period here is driven mainly by implementation details. We can set it to, for example, once per hour. Not a big deal since the pageserver is stateful, and all metrics can be reported as an absolute value, not increments. At the same time, a smaller reporting period improves UX, so it would be good to have something more real-time.
`written size` is primarily a safekeeper-related metric, but since it is available on both pageserver and safekeeper, we can avoid reporting anything from the safekeeper.
Example event:
```json
{
"metric": "remote_storage_size",
"type": "absolute",
"time": "2022-12-28T11:07:19.317310284Z",
"idempotency_key": "2022-12-28 11:07:19.317310324 UTC-1-4019",
"value": 12345454,
"tenant_id": "5d07d9ce9237c4cd845ea7918c0afa7d",
"timeline_id": "a03ebb4f5922a1c56ff7485cc8854143",
}
```
# Data collection
## Push vs. pull
We already have pull-based Prometheus metrics, so it is tempting to use them here too. However, in our setup, it is hard to tell when some metric changes. For example, garbage collection will constantly free some disk space over a week, even if the project is down for that week. We could also iterate through all existing tenants/branches/endpoints, but that means some amount of code to do that properly and most likely we will end up with some per-metric hacks in the collector to cut out some of the tenants that are surely not changing that metric.
With the push model, it is easier to publish data only about actively changing metrics -- pageserver knows when it performs s3 offloads, garbage collection and starts/stops consuming data from the safekeeper; proxy knows about connected clients; console / autoscaler-agent knows about active cpu time.
Hence, let's go with a push-based model.
## Common bus vs. proxying through the console
We can implement such push systems in a few ways:
a. Each component pushes its metrics to the "common bus", namely segment, Kafka, or something similar. That approach scales well, but it would be harder to test it locally, will introduce new dependencies, we will have to distribute secrets for that connection to all of the components, etc. We would also have to loop back some of the events and their aggregates to the console, as we want to show some that metrics to the user in real-time.
b. Each component can call HTTP `POST` with its events to the console, and the console can forward it to the segment for later integration with metronome / orb / onebill / etc. With that approach, only the console has to speak with segment. Also since that data passes through the console, the console can save the latest metrics values, so there is no need for constant feedback of that events back from the segment.
# Implementation
Each (proxy|pageserver|autoscaler-agent) sends consumption events to the single endpoint in the console:
```json
POST /usage_events HTTP/1.1
Content-Type: application/json
[
{
"metric": "remote_storage_size",
"type": "absolute",
"time": "2022-12-28T11:07:19.317310284Z",
"idempotency_key": "2022-12-28 11:07:19.317310324 UTC-1-4019",
"value": 12345454,
"tenant_id": "5d07d9ce9237c4cd845ea7918c0afa7d",
"timeline_id": "a03ebb4f5922a1c56ff7485cc8854143",
},
...
]
```
![data flow](./images/metering.jpg)
Events could be either:
* `incremental` -- change in consumption since the previous event or service restart. That is `effective_cpu_seconds`, `traffic_in_bytes`, and `traffic_out_bytes`.
* `absolute` -- that is the current value of a metric. All of the size-related metrics are absolute.
Each service can post events at its own pace and bundle together data from different tenants/endpoints.
The console algorithm upon receive of events could be the following:
1. Create and send a segment event with the same content (possibly enriching it with tenant/timeline data for endpoint-based events).
2. Update the latest state of per-tenant and per-endpoint metrics in the database.
3. Check whether any of that metrics is above the allowed threshold and stop the project if necessary.
Since all the data comes in batches, we can do the batch update to reduce the number of queries in the database. Proxy traffic is probably the most frequent metric, so with batching, we will have extra `number_of_proxies` requests to the database each minute. This is most likely fine for now but will generate many dead tuples in the console database. If that is the case, we can change step 2 to the following:
2.1. Check if there $tenant_$metric / $endpoint_$metric key in Redis
2.2. If no stored value is found and the metric is incremental, then fetch the current value from DWH (which keeps aggregated value for all the events) and publish it.
2.3. Publish a new value (absolute metric) or add an increment to the stored value (incremental metric)
## Consumption watchdog
Since all the data goes through the console, we don't have to run any background thread/coroutines to check whether consumption is within the allowed limits. We only change consumption with `POST /usage_events`, so limit checks could be applied in the same handler.
## Extensibility
If we need to add a new metric (e.g. s3 traffic or something else), the console code should, by default, process it and publish segment event, even if the metric name is unknown to the console.
## Naming & schema
Each metric name should end up with units -- now `_seconds` and `_bytes`, and segment event should always have `tenant_id` and `timeline_id`/`endpoint_id` where applicable.

Binary file not shown.

After

Width:  |  Height:  |  Size: 232 KiB

View File

@@ -18,10 +18,6 @@ Intended to be used in integration tests and in CLI tools for local installation
Documentation of the Neon features and concepts.
Now it is mostly dev documentation.
`/monitoring`:
TODO
`/pageserver`:
Neon storage service.
@@ -98,6 +94,13 @@ cargo hakari manage-deps
If you don't have hakari installed (`error: no such subcommand: hakari`), install it by running `cargo install cargo-hakari`.
### Checking Rust 3rd-parties
[Cargo deny](https://embarkstudios.github.io/cargo-deny/index.html) is a cargo plugin that lets us lint project's dependency graph to ensure all dependencies conform to requirements. It detects security issues, matches licenses, and ensures crates only come from trusted sources.
```bash
cargo deny check
```
## Using Python
Note that Debian/Ubuntu Python packages are stale, as it commonly happens,
so manual installation of dependencies is not recommended.

View File

@@ -0,0 +1,16 @@
[package]
name = "consumption_metrics"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
anyhow = "1.0.68"
chrono = { version = "0.4", default-features = false, features = ["clock", "serde"] }
rand = "0.8.3"
serde = "1.0.152"
serde_with = "2.1.0"
utils = { version = "0.1.0", path = "../utils" }
workspace_hack = { version = "0.1.0", path = "../../workspace_hack" }

View File

@@ -0,0 +1,50 @@
//!
//! Shared code for consumption metics collection
//!
use chrono::{DateTime, Utc};
use rand::Rng;
use serde::Serialize;
#[derive(Serialize, Debug, Clone, Eq, PartialEq, Ord, PartialOrd)]
#[serde(tag = "type")]
pub enum EventType {
#[serde(rename = "absolute")]
Absolute { time: DateTime<Utc> },
#[serde(rename = "incremental")]
Incremental {
start_time: DateTime<Utc>,
stop_time: DateTime<Utc>,
},
}
#[derive(Serialize, Debug, Clone, Eq, PartialEq, Ord, PartialOrd)]
pub struct Event<Extra> {
#[serde(flatten)]
#[serde(rename = "type")]
pub kind: EventType,
pub metric: &'static str,
pub idempotency_key: String,
pub value: u64,
#[serde(flatten)]
pub extra: Extra,
}
pub fn idempotency_key(node_id: String) -> String {
format!(
"{}-{}-{:04}",
Utc::now(),
node_id,
rand::thread_rng().gen_range(0..=9999)
)
}
pub const CHUNK_SIZE: usize = 1000;
// Just a wrapper around a slice of events
// to serialize it as `{"events" : [ ] }
#[derive(serde::Serialize)]
pub struct EventChunk<'a, T> {
pub events: &'a [T],
}

View File

@@ -1,11 +1,12 @@
[package]
name = "metrics"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
prometheus = {version = "0.13", default_features=false, features = ["process"]} # removes protobuf dependency
libc = "0.2"
once_cell = "1.13.0"
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
prometheus.workspace = true
libc.workspace = true
once_cell.workspace = true
workspace_hack.workspace = true

View File

@@ -1,17 +1,17 @@
[package]
name = "pageserver_api"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
serde = { version = "1.0", features = ["derive"] }
serde_with = "2.0"
const_format = "0.2.21"
anyhow = { version = "1.0", features = ["backtrace"] }
bytes = "1.0.1"
byteorder = "1.4.3"
serde.workspace = true
serde_with.workspace = true
const_format.workspace = true
anyhow.workspace = true
bytes.workspace = true
byteorder.workspace = true
utils.workspace = true
postgres_ffi.workspace = true
utils = { path = "../utils" }
postgres_ffi = { path = "../postgres_ffi" }
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
workspace_hack.workspace = true

View File

@@ -1,4 +1,4 @@
use std::num::NonZeroU64;
use std::num::{NonZeroU64, NonZeroUsize};
use byteorder::{BigEndian, ReadBytesExt};
use serde::{Deserialize, Serialize};
@@ -210,6 +210,11 @@ pub struct TimelineInfo {
pub state: TimelineState,
}
#[derive(Debug, Serialize, Deserialize)]
pub struct DownloadRemoteLayersTaskSpawnRequest {
pub max_concurrent_downloads: NonZeroUsize,
}
#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct DownloadRemoteLayersTaskInfo {
pub task_id: String,

View File

@@ -1,18 +1,17 @@
[package]
name = "postgres_connection"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
edition.workspace = true
license.workspace = true
[dependencies]
anyhow = "1.0"
itertools = "0.10.3"
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev = "43e6db254a97fdecbce33d8bc0890accfd74495e" }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
url = "2.2.2"
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
anyhow.workspace = true
itertools.workspace = true
postgres.workspace = true
tokio-postgres.workspace = true
url.workspace = true
workspace_hack.workspace = true
[dev-dependencies]
once_cell = "1.13.0"
once_cell.workspace = true

View File

@@ -1,30 +1,31 @@
[package]
name = "postgres_ffi"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
rand = "0.8.3"
regex = "1.4.5"
bytes = "1.0.1"
byteorder = "1.4.3"
anyhow = "1.0"
crc32c = "0.6.0"
hex = "0.4.3"
once_cell = "1.13.0"
log = "0.4.14"
memoffset = "0.7"
thiserror = "1.0"
serde = { version = "1.0", features = ["derive"] }
utils = { path = "../utils" }
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
rand.workspace = true
regex.workspace = true
bytes.workspace = true
byteorder.workspace = true
anyhow.workspace = true
crc32c.workspace = true
hex.workspace = true
once_cell.workspace = true
log.workspace = true
memoffset.workspace = true
thiserror.workspace = true
serde.workspace = true
utils.workspace = true
workspace_hack.workspace = true
[dev-dependencies]
env_logger = "0.9"
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
env_logger.workspace = true
postgres.workspace = true
wal_craft = { path = "wal_craft" }
[build-dependencies]
anyhow = "1.0"
bindgen = "0.61"
anyhow.workspace = true
bindgen.workspace = true

View File

@@ -1,17 +1,17 @@
[package]
name = "wal_craft"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
edition.workspace = true
license.workspace = true
[dependencies]
anyhow = "1.0"
clap = "4.0"
env_logger = "0.9"
log = "0.4"
once_cell = "1.13.0"
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
postgres_ffi = { path = "../" }
tempfile = "3.2"
workspace_hack = { version = "0.1", path = "../../../workspace_hack" }
anyhow.workspace = true
clap.workspace = true
env_logger.workspace = true
log.workspace = true
once_cell.workspace = true
postgres.workspace = true
postgres_ffi.workspace = true
tempfile.workspace = true
workspace_hack.workspace = true

View File

@@ -1,17 +1,18 @@
[package]
name = "pq_proto"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
anyhow = "1.0"
bytes = "1.0.1"
pin-project-lite = "0.2.7"
postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
rand = "0.8.3"
serde = { version = "1.0", features = ["derive"] }
tokio = { version = "1.17", features = ["macros"] }
tracing = "0.1"
anyhow.workspace = true
bytes.workspace = true
pin-project-lite.workspace = true
postgres-protocol.workspace = true
rand.workspace = true
serde.workspace = true
tokio.workspace = true
tracing.workspace = true
thiserror.workspace = true
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
workspace_hack.workspace = true

View File

@@ -5,7 +5,7 @@
// Tools for calling certain async methods in sync contexts.
pub mod sync;
use anyhow::{bail, ensure, Context, Result};
use anyhow::{ensure, Context, Result};
use bytes::{Buf, BufMut, Bytes, BytesMut};
use postgres_protocol::PG_EPOCH;
use serde::{Deserialize, Serialize};
@@ -194,6 +194,35 @@ macro_rules! retry_read {
};
}
/// An error occured during connection being open.
#[derive(thiserror::Error, Debug)]
pub enum ConnectionError {
/// IO error during writing to or reading from the connection socket.
#[error("Socket IO error: {0}")]
Socket(std::io::Error),
/// Invalid packet was received from client
#[error("Protocol error: {0}")]
Protocol(String),
/// Failed to parse a protocol mesage
#[error("Message parse error: {0}")]
MessageParse(anyhow::Error),
}
impl From<anyhow::Error> for ConnectionError {
fn from(e: anyhow::Error) -> Self {
Self::MessageParse(e)
}
}
impl ConnectionError {
pub fn into_io_error(self) -> io::Error {
match self {
ConnectionError::Socket(io) => io,
other => io::Error::new(io::ErrorKind::Other, other.to_string()),
}
}
}
impl FeMessage {
/// Read one message from the stream.
/// This function returns `Ok(None)` in case of EOF.
@@ -216,7 +245,9 @@ impl FeMessage {
/// }
/// ```
#[inline(never)]
pub fn read(stream: &mut (impl io::Read + Unpin)) -> anyhow::Result<Option<FeMessage>> {
pub fn read(
stream: &mut (impl io::Read + Unpin),
) -> Result<Option<FeMessage>, ConnectionError> {
Self::read_fut(&mut AsyncishRead(stream)).wait()
}
@@ -224,7 +255,7 @@ impl FeMessage {
/// See documentation for `Self::read`.
pub fn read_fut<Reader>(
stream: &mut Reader,
) -> SyncFuture<Reader, impl Future<Output = anyhow::Result<Option<FeMessage>>> + '_>
) -> SyncFuture<Reader, impl Future<Output = Result<Option<FeMessage>, ConnectionError>> + '_>
where
Reader: tokio::io::AsyncRead + Unpin,
{
@@ -238,17 +269,21 @@ impl FeMessage {
let tag = match retry_read!(stream.read_u8().await) {
Ok(b) => b,
Err(e) if e.kind() == io::ErrorKind::UnexpectedEof => return Ok(None),
Err(e) => return Err(e.into()),
Err(e) => return Err(ConnectionError::Socket(e)),
};
// The message length includes itself, so it better be at least 4.
let len = retry_read!(stream.read_u32().await)?
let len = retry_read!(stream.read_u32().await)
.map_err(ConnectionError::Socket)?
.checked_sub(4)
.context("invalid message length")?;
.ok_or_else(|| ConnectionError::Protocol("invalid message length".to_string()))?;
let body = {
let mut buffer = vec![0u8; len as usize];
stream.read_exact(&mut buffer).await?;
stream
.read_exact(&mut buffer)
.await
.map_err(ConnectionError::Socket)?;
Bytes::from(buffer)
};
@@ -265,7 +300,11 @@ impl FeMessage {
b'c' => Ok(Some(FeMessage::CopyDone)),
b'f' => Ok(Some(FeMessage::CopyFail)),
b'p' => Ok(Some(FeMessage::PasswordMessage(body))),
tag => bail!("unknown message tag: {},'{:?}'", tag, body),
tag => {
return Err(ConnectionError::Protocol(format!(
"unknown message tag: {tag},'{body:?}'"
)))
}
}
})
}
@@ -275,7 +314,9 @@ impl FeStartupPacket {
/// Read startup message from the stream.
// XXX: It's tempting yet undesirable to accept `stream` by value,
// since such a change will cause user-supplied &mut references to be consumed
pub fn read(stream: &mut (impl io::Read + Unpin)) -> anyhow::Result<Option<FeMessage>> {
pub fn read(
stream: &mut (impl io::Read + Unpin),
) -> Result<Option<FeMessage>, ConnectionError> {
Self::read_fut(&mut AsyncishRead(stream)).wait()
}
@@ -284,7 +325,7 @@ impl FeStartupPacket {
// since such a change will cause user-supplied &mut references to be consumed
pub fn read_fut<Reader>(
stream: &mut Reader,
) -> SyncFuture<Reader, impl Future<Output = anyhow::Result<Option<FeMessage>>> + '_>
) -> SyncFuture<Reader, impl Future<Output = Result<Option<FeMessage>, ConnectionError>> + '_>
where
Reader: tokio::io::AsyncRead + Unpin,
{
@@ -302,31 +343,41 @@ impl FeStartupPacket {
let len = match retry_read!(stream.read_u32().await) {
Ok(len) => len as usize,
Err(e) if e.kind() == io::ErrorKind::UnexpectedEof => return Ok(None),
Err(e) => return Err(e.into()),
Err(e) => return Err(ConnectionError::Socket(e)),
};
#[allow(clippy::manual_range_contains)]
if len < 4 || len > MAX_STARTUP_PACKET_LENGTH {
bail!("invalid message length");
return Err(ConnectionError::Protocol(format!(
"invalid message length {len}"
)));
}
let request_code = retry_read!(stream.read_u32().await)?;
let request_code =
retry_read!(stream.read_u32().await).map_err(ConnectionError::Socket)?;
// the rest of startup packet are params
let params_len = len - 8;
let mut params_bytes = vec![0u8; params_len];
stream.read_exact(params_bytes.as_mut()).await?;
stream
.read_exact(params_bytes.as_mut())
.await
.map_err(ConnectionError::Socket)?;
// Parse params depending on request code
let req_hi = request_code >> 16;
let req_lo = request_code & ((1 << 16) - 1);
let message = match (req_hi, req_lo) {
(RESERVED_INVALID_MAJOR_VERSION, CANCEL_REQUEST_CODE) => {
ensure!(params_len == 8, "expected 8 bytes for CancelRequest params");
if params_len != 8 {
return Err(ConnectionError::Protocol(
"expected 8 bytes for CancelRequest params".to_string(),
));
}
let mut cursor = Cursor::new(params_bytes);
FeStartupPacket::CancelRequest(CancelKeyData {
backend_pid: cursor.read_i32().await?,
cancel_key: cursor.read_i32().await?,
backend_pid: cursor.read_i32().await.map_err(ConnectionError::Socket)?,
cancel_key: cursor.read_i32().await.map_err(ConnectionError::Socket)?,
})
}
(RESERVED_INVALID_MAJOR_VERSION, NEGOTIATE_SSL_CODE) => {
@@ -338,7 +389,9 @@ impl FeStartupPacket {
FeStartupPacket::GssEncRequest
}
(RESERVED_INVALID_MAJOR_VERSION, unrecognized_code) => {
bail!("Unrecognized request code {}", unrecognized_code)
return Err(ConnectionError::Protocol(format!(
"Unrecognized request code {unrecognized_code}"
)));
}
// TODO bail if protocol major_version is not 3?
(major_version, minor_version) => {
@@ -346,15 +399,21 @@ impl FeStartupPacket {
// See `postgres: ProcessStartupPacket, build_startup_packet`.
let mut tokens = str::from_utf8(&params_bytes)
.context("StartupMessage params: invalid utf-8")?
.strip_suffix('\0') // drop packet's own null terminator
.context("StartupMessage params: missing null terminator")?
.strip_suffix('\0') // drop packet's own null
.ok_or_else(|| {
ConnectionError::Protocol(
"StartupMessage params: missing null terminator".to_string(),
)
})?
.split_terminator('\0');
let mut params = HashMap::new();
while let Some(name) = tokens.next() {
let value = tokens
.next()
.context("StartupMessage params: key without value")?;
let value = tokens.next().ok_or_else(|| {
ConnectionError::Protocol(
"StartupMessage params: key without value".to_string(),
)
})?;
params.insert(name.to_owned(), value.to_owned());
}
@@ -458,7 +517,7 @@ pub enum BeMessage<'a> {
CloseComplete,
// None means column is NULL
DataRow(&'a [Option<&'a [u8]>]),
ErrorResponse(&'a str),
ErrorResponse(&'a str, Option<&'a [u8; 5]>),
/// Single byte - used in response to SSLRequest/GSSENCRequest.
EncryptionResponse(bool),
NoData,
@@ -606,7 +665,7 @@ fn write_body<R>(buf: &mut BytesMut, f: impl FnOnce(&mut BytesMut) -> R) -> R {
}
/// Safe write of s into buf as cstring (String in the protocol).
fn write_cstr(s: impl AsRef<[u8]>, buf: &mut BytesMut) -> Result<(), io::Error> {
fn write_cstr(s: impl AsRef<[u8]>, buf: &mut BytesMut) -> io::Result<()> {
let bytes = s.as_ref();
if bytes.contains(&0) {
return Err(io::Error::new(
@@ -626,7 +685,7 @@ fn read_cstr(buf: &mut Bytes) -> anyhow::Result<Bytes> {
Ok(result)
}
const SQLSTATE_INTERNAL_ERROR: &str = "XX000\0";
pub const SQLSTATE_INTERNAL_ERROR: &[u8; 5] = b"XX000";
impl<'a> BeMessage<'a> {
/// Write message to the given buf.
@@ -767,10 +826,7 @@ impl<'a> BeMessage<'a> {
// First byte of each field represents type of this field. Set just enough fields
// to satisfy rust-postgres client: 'S' -- severity, 'C' -- error, 'M' -- error
// message text.
BeMessage::ErrorResponse(error_msg) => {
// For all the errors set Severity to Error and error code to
// 'internal error'.
BeMessage::ErrorResponse(error_msg, pg_error_code) => {
// 'E' signalizes ErrorResponse messages
buf.put_u8(b'E');
write_body(buf, |buf| {
@@ -778,7 +834,9 @@ impl<'a> BeMessage<'a> {
buf.put_slice(b"ERROR\0");
buf.put_u8(b'C'); // SQLSTATE error code
buf.put_slice(SQLSTATE_INTERNAL_ERROR.as_bytes());
buf.put_slice(&terminate_code(
pg_error_code.unwrap_or(SQLSTATE_INTERNAL_ERROR),
));
buf.put_u8(b'M'); // the message
write_cstr(error_msg, buf)?;
@@ -801,7 +859,7 @@ impl<'a> BeMessage<'a> {
buf.put_slice(b"NOTICE\0");
buf.put_u8(b'C'); // SQLSTATE error code
buf.put_slice(SQLSTATE_INTERNAL_ERROR.as_bytes());
buf.put_slice(&terminate_code(SQLSTATE_INTERNAL_ERROR));
buf.put_u8(b'M'); // the message
write_cstr(error_msg.as_bytes(), buf)?;
@@ -1089,3 +1147,12 @@ mod tests {
let _ = FeStartupPacket::read_fut(stream).await;
}
}
fn terminate_code(code: &[u8; 5]) -> [u8; 6] {
let mut terminated = [0; 6];
for (i, &elem) in code.iter().enumerate() {
terminated[i] = elem;
}
terminated
}

View File

@@ -1,28 +1,28 @@
[package]
name = "remote_storage"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
anyhow = { version = "1.0", features = ["backtrace"] }
async-trait = "0.1"
metrics = { version = "0.1", path = "../metrics" }
utils = { version = "0.1", path = "../utils" }
once_cell = "1.13.0"
aws-smithy-http = "0.51.0"
aws-types = "0.51.0"
aws-config = { version = "0.51.0", default-features = false, features=["rustls"] }
aws-sdk-s3 = "0.21.0"
hyper = { version = "0.14", features = ["stream"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
tokio = { version = "1.17", features = ["sync", "macros", "fs", "io-util"] }
tokio-util = { version = "0.7", features = ["io"] }
toml_edit = { version = "0.14", features = ["easy"] }
tracing = "0.1.27"
anyhow.workspace = true
async-trait.workspace = true
once_cell.workspace = true
aws-smithy-http.workspace = true
aws-types.workspace = true
aws-config.workspace = true
aws-sdk-s3.workspace = true
hyper = { workspace = true, features = ["stream"] }
serde.workspace = true
serde_json.workspace = true
tokio = { workspace = true, features = ["sync", "fs", "io-util"] }
tokio-util.workspace = true
toml_edit.workspace = true
tracing.workspace = true
metrics.workspace = true
utils.workspace = true
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
workspace_hack.workspace = true
[dev-dependencies]
tempfile = "3.2"
tempfile.workspace = true

View File

@@ -1,13 +1,13 @@
[package]
name = "safekeeper_api"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
serde = { version = "1.0", features = ["derive"] }
serde_with = "2.0"
const_format = "0.2.21"
serde.workspace = true
serde_with.workspace = true
const_format.workspace = true
utils.workspace = true
utils = { path = "../utils" }
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
workspace_hack.workspace = true

View File

@@ -1,9 +1,11 @@
[package]
name = "tenant_size_model"
version = "0.1.0"
edition = "2021"
edition.workspace = true
publish = false
license = "Apache-2.0"
license.workspace = true
[dependencies]
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
anyhow.workspace = true
workspace_hack.workspace = true

View File

@@ -1,6 +1,8 @@
use std::borrow::Cow;
use std::collections::HashMap;
use anyhow::Context;
/// Pricing model or history size builder.
///
/// Maintains knowledge of the branches and their modifications. Generic over the branch name key
@@ -134,7 +136,7 @@ impl<K: std::hash::Hash + Eq + 'static> Storage<K> {
size: Option<u64>,
) where
K: std::borrow::Borrow<Q>,
Q: std::hash::Hash + Eq,
Q: std::hash::Hash + Eq + std::fmt::Debug,
{
let lastseg_id = *self.branches.get(branch).unwrap();
let newseg_id = self.segments.len();
@@ -214,20 +216,24 @@ impl<K: std::hash::Hash + Eq + 'static> Storage<K> {
}
/// Panics if the parent branch cannot be found.
pub fn branch<Q: ?Sized>(&mut self, parent: &Q, name: K)
pub fn branch<Q: ?Sized>(&mut self, parent: &Q, name: K) -> anyhow::Result<()>
where
K: std::borrow::Borrow<Q>,
Q: std::hash::Hash + Eq,
K: std::borrow::Borrow<Q> + std::fmt::Debug,
Q: std::hash::Hash + Eq + std::fmt::Debug,
{
// Find the right segment
let branchseg_id = *self
.branches
.get(parent)
.expect("should had found the parent by key");
let branchseg_id = *self.branches.get(parent).with_context(|| {
format!(
"should had found the parent {:?} by key. in branches {:?}",
parent, self.branches
)
})?;
let _branchseg = &mut self.segments[branchseg_id];
// Create branch name for it
self.branches.insert(name, branchseg_id);
Ok(())
}
pub fn calculate(&mut self, retention_period: u64) -> SegmentSize {

View File

@@ -38,7 +38,7 @@ fn scenario_2() -> (Vec<Segment>, SegmentSize) {
}
// Branch
storage.branch("main", "child");
storage.branch("main", "child").unwrap();
storage.update("child", 1_000);
// More updates on parent
@@ -63,7 +63,7 @@ fn scenario_3() -> (Vec<Segment>, SegmentSize) {
}
// Branch
storage.branch("main", "child");
storage.branch("main", "child").unwrap();
storage.update("child", 1_000);
// More updates on parent
@@ -90,7 +90,7 @@ fn scenario_4() -> (Vec<Segment>, SegmentSize) {
}
// Branch
storage.branch("main", "child");
storage.branch("main", "child").unwrap();
storage.update("child", 1_000);
// More updates on parent
@@ -106,10 +106,10 @@ fn scenario_4() -> (Vec<Segment>, SegmentSize) {
fn scenario_5() -> (Vec<Segment>, SegmentSize) {
let mut storage = Storage::new("a");
storage.insert("a", 5000);
storage.branch("a", "b");
storage.branch("a", "b").unwrap();
storage.update("b", 4000);
storage.update("a", 2000);
storage.branch("a", "c");
storage.branch("a", "c").unwrap();
storage.insert("c", 4000);
storage.insert("a", 2000);
@@ -133,12 +133,12 @@ fn scenario_6() -> (Vec<Segment>, SegmentSize) {
let mut storage = Storage::new(None);
storage.branch(&None, branches[0]); // at 0
storage.branch(&None, branches[0]).unwrap(); // at 0
storage.modify_branch(&branches[0], NO_OP, 108951064, 43696128); // at 108951064
storage.branch(&branches[0], branches[1]); // at 108951064
storage.branch(&branches[0], branches[1]).unwrap(); // at 108951064
storage.modify_branch(&branches[1], NO_OP, 15560408, -1851392); // at 124511472
storage.modify_branch(&branches[0], NO_OP, 174464360, -1531904); // at 283415424
storage.branch(&branches[0], branches[2]); // at 283415424
storage.branch(&branches[0], branches[2]).unwrap(); // at 283415424
storage.modify_branch(&branches[2], NO_OP, 15906192, 8192); // at 299321616
storage.modify_branch(&branches[0], NO_OP, 18909976, 32768); // at 302325400

View File

@@ -1,48 +1,49 @@
[package]
name = "utils"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
sentry = { version = "0.29.0", default-features = false, features = ["backtrace", "contexts", "panic", "rustls", "reqwest" ] }
async-trait = "0.1"
anyhow = "1.0"
bincode = "1.3"
bytes = "1.0.1"
hyper = { version = "0.14.7", features = ["full"] }
routerify = "3"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
thiserror = "1.0"
tokio = { version = "1.17", features = ["macros"]}
tokio-rustls = "0.23"
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter", "json"] }
nix = "0.25"
signal-hook = "0.3.10"
rand = "0.8.3"
jsonwebtoken = "8"
hex = { version = "0.4.3", features = ["serde"] }
rustls = "0.20.2"
rustls-split = "0.3.0"
git-version = "0.3.5"
serde_with = "2.0"
once_cell = "1.13.0"
strum = "0.24"
strum_macros = "0.24"
sentry.workspace = true
async-trait.workspace = true
anyhow.workspace = true
bincode.workspace = true
bytes.workspace = true
hyper = { workspace = true, features = ["full"] }
routerify.workspace = true
serde.workspace = true
serde_json.workspace = true
thiserror.workspace = true
tokio.workspace = true
tokio-rustls.workspace = true
tracing.workspace = true
tracing-subscriber = { workspace = true, features = ["json"] }
nix.workspace = true
signal-hook.workspace = true
rand.workspace = true
jsonwebtoken.workspace = true
hex = { workspace = true, features = ["serde"] }
rustls.workspace = true
rustls-split.workspace = true
git-version.workspace = true
serde_with.workspace = true
once_cell.workspace = true
strum.workspace = true
strum_macros.workspace = true
metrics = { path = "../metrics" }
pq_proto = { path = "../pq_proto" }
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
metrics.workspace = true
pq_proto.workspace = true
workspace_hack.workspace = true
[dev-dependencies]
byteorder = "1.4.3"
bytes = "1.0.1"
hex-literal = "0.3"
tempfile = "3.2"
criterion = "0.4"
rustls-pemfile = "1"
byteorder.workspace = true
bytes.workspace = true
hex-literal.workspace = true
tempfile.workspace = true
criterion.workspace = true
rustls-pemfile.workspace = true
[[bench]]
name = "benchmarks"

View File

@@ -8,6 +8,7 @@ use strum_macros::{EnumString, EnumVariantNames};
pub enum LogFormat {
Plain,
Json,
Test,
}
impl LogFormat {
@@ -39,6 +40,7 @@ pub fn init(log_format: LogFormat) -> anyhow::Result<()> {
match log_format {
LogFormat::Json => base_logger.json().init(),
LogFormat::Plain => base_logger.init(),
LogFormat::Test => base_logger.with_test_writer().init(),
}
Ok(())

View File

@@ -3,8 +3,9 @@
//! implementation determining how to process the queries. Currently its API
//! is rather narrow, but we can extend it once required.
use crate::postgres_backend_async::{log_query_error, short_error, QueryError};
use crate::sock_split::{BidiStream, ReadStream, WriteStream};
use anyhow::{bail, ensure, Context, Result};
use anyhow::Context;
use bytes::{Bytes, BytesMut};
use pq_proto::{BeMessage, FeMessage, FeStartupPacket};
use serde::{Deserialize, Serialize};
@@ -21,20 +22,32 @@ pub trait Handler {
/// postgres_backend will issue ReadyForQuery after calling this (this
/// might be not what we want after CopyData streaming, but currently we don't
/// care).
fn process_query(&mut self, pgb: &mut PostgresBackend, query_string: &str) -> Result<()>;
fn process_query(
&mut self,
pgb: &mut PostgresBackend,
query_string: &str,
) -> Result<(), QueryError>;
/// Called on startup packet receival, allows to process params.
///
/// If Ok(false) is returned postgres_backend will skip auth -- that is needed for new users
/// creation is the proxy code. That is quite hacky and ad-hoc solution, may be we could allow
/// to override whole init logic in implementations.
fn startup(&mut self, _pgb: &mut PostgresBackend, _sm: &FeStartupPacket) -> Result<()> {
fn startup(
&mut self,
_pgb: &mut PostgresBackend,
_sm: &FeStartupPacket,
) -> Result<(), QueryError> {
Ok(())
}
/// Check auth jwt
fn check_auth_jwt(&mut self, _pgb: &mut PostgresBackend, _jwt_response: &[u8]) -> Result<()> {
bail!("JWT auth failed")
fn check_auth_jwt(
&mut self,
_pgb: &mut PostgresBackend,
_jwt_response: &[u8],
) -> Result<(), QueryError> {
Err(QueryError::Other(anyhow::anyhow!("JWT auth failed")))
}
fn is_shutdown_requested(&self) -> bool {
@@ -66,7 +79,7 @@ impl FromStr for AuthType {
match s {
"Trust" => Ok(Self::Trust),
"NeonJWT" => Ok(Self::NeonJWT),
_ => bail!("invalid value \"{s}\" for auth type"),
_ => anyhow::bail!("invalid value \"{s}\" for auth type"),
}
}
}
@@ -154,7 +167,7 @@ pub fn is_socket_read_timed_out(error: &anyhow::Error) -> bool {
}
// Cast a byte slice to a string slice, dropping null terminator if there's one.
fn cstr_to_str(bytes: &[u8]) -> Result<&str> {
fn cstr_to_str(bytes: &[u8]) -> anyhow::Result<&str> {
let without_null = bytes.strip_suffix(&[0]).unwrap_or(bytes);
std::str::from_utf8(without_null).map_err(|e| e.into())
}
@@ -188,10 +201,10 @@ impl PostgresBackend {
}
/// Get direct reference (into the Option) to the read stream.
fn get_stream_in(&mut self) -> Result<&mut BidiStream> {
fn get_stream_in(&mut self) -> anyhow::Result<&mut BidiStream> {
match &mut self.stream {
Some(Stream::Bidirectional(stream)) => Ok(stream),
_ => bail!("reader taken"),
_ => anyhow::bail!("reader taken"),
}
}
@@ -215,7 +228,7 @@ impl PostgresBackend {
}
/// Read full message or return None if connection is closed.
pub fn read_message(&mut self) -> Result<Option<FeMessage>> {
pub fn read_message(&mut self) -> Result<Option<FeMessage>, QueryError> {
let (state, stream) = (self.state, self.get_stream_in()?);
use ProtoState::*;
@@ -223,6 +236,7 @@ impl PostgresBackend {
Initialization | Encrypted => FeStartupPacket::read(stream),
Authentication | Established => FeMessage::read(stream),
}
.map_err(QueryError::from)
}
/// Write message into internal output buffer.
@@ -246,7 +260,7 @@ impl PostgresBackend {
}
// Wrapper for run_message_loop() that shuts down socket when we are done
pub fn run(mut self, handler: &mut impl Handler) -> Result<()> {
pub fn run(mut self, handler: &mut impl Handler) -> Result<(), QueryError> {
let ret = self.run_message_loop(handler);
if let Some(stream) = self.stream.as_mut() {
let _ = stream.shutdown(Shutdown::Both);
@@ -254,7 +268,7 @@ impl PostgresBackend {
ret
}
fn run_message_loop(&mut self, handler: &mut impl Handler) -> Result<()> {
fn run_message_loop(&mut self, handler: &mut impl Handler) -> Result<(), QueryError> {
trace!("postgres backend to {:?} started", self.peer_addr);
let mut unnamed_query_string = Bytes::new();
@@ -263,7 +277,7 @@ impl PostgresBackend {
match self.read_message() {
Ok(message) => {
if let Some(msg) = message {
trace!("got message {:?}", msg);
trace!("got message {msg:?}");
match self.process_message(handler, msg, &mut unnamed_query_string)? {
ProcessMsgResult::Continue => continue,
@@ -274,10 +288,12 @@ impl PostgresBackend {
}
}
Err(e) => {
// If it is a timeout error, continue the loop
if !is_socket_read_timed_out(&e) {
return Err(e);
if let QueryError::Other(e) = &e {
if is_socket_read_timed_out(e) {
continue;
}
}
return Err(e);
}
}
}
@@ -295,7 +311,7 @@ impl PostgresBackend {
}
stream => {
self.stream = stream;
bail!("can't start TLs without bidi stream");
anyhow::bail!("can't start TLs without bidi stream");
}
}
}
@@ -305,17 +321,16 @@ impl PostgresBackend {
handler: &mut impl Handler,
msg: FeMessage,
unnamed_query_string: &mut Bytes,
) -> Result<ProcessMsgResult> {
) -> Result<ProcessMsgResult, QueryError> {
// Allow only startup and password messages during auth. Otherwise client would be able to bypass auth
// TODO: change that to proper top-level match of protocol state with separate message handling for each state
if self.state < ProtoState::Established {
ensure!(
matches!(
msg,
FeMessage::PasswordMessage(_) | FeMessage::StartupPacket(_)
),
"protocol violation"
);
if self.state < ProtoState::Established
&& !matches!(
msg,
FeMessage::PasswordMessage(_) | FeMessage::StartupPacket(_)
)
{
return Err(QueryError::Other(anyhow::anyhow!("protocol violation")));
}
let have_tls = self.tls_config.is_some();
@@ -339,8 +354,13 @@ impl PostgresBackend {
}
FeStartupPacket::StartupMessage { .. } => {
if have_tls && !matches!(self.state, ProtoState::Encrypted) {
self.write_message(&BeMessage::ErrorResponse("must connect with TLS"))?;
bail!("client did not connect with TLS");
self.write_message(&BeMessage::ErrorResponse(
"must connect with TLS",
None,
))?;
return Err(QueryError::Other(anyhow::anyhow!(
"client did not connect with TLS"
)));
}
// NB: startup() may change self.auth_type -- we are using that in proxy code
@@ -379,8 +399,11 @@ impl PostgresBackend {
let (_, jwt_response) = m.split_last().context("protocol violation")?;
if let Err(e) = handler.check_auth_jwt(self, jwt_response) {
self.write_message(&BeMessage::ErrorResponse(&e.to_string()))?;
bail!("auth failed: {}", e);
self.write_message(&BeMessage::ErrorResponse(
&e.to_string(),
Some(e.pg_error_code()),
))?;
return Err(e);
}
}
}
@@ -394,33 +417,14 @@ impl PostgresBackend {
// remove null terminator
let query_string = cstr_to_str(&body)?;
trace!("got query {:?}", query_string);
// xxx distinguish fatal and recoverable errors?
trace!("got query {query_string:?}");
if let Err(e) = handler.process_query(self, query_string) {
// ":?" uses the alternate formatting style, which makes anyhow display the
// full cause of the error, not just the top-level context + its trace.
// We don't want to send that in the ErrorResponse though,
// because it's not relevant to the compute node logs.
//
// We also don't want to log full stacktrace when the error is primitive,
// such as usual connection closed.
let short_error = format!("{:#}", e);
let root_cause = e.root_cause().to_string();
if root_cause.contains("connection closed unexpectedly")
|| root_cause.contains("Broken pipe (os error 32)")
{
error!(
"query handler for '{}' failed: {}",
query_string, short_error
);
} else {
error!("query handler for '{}' failed: {:?}", query_string, e);
}
self.write_message_noflush(&BeMessage::ErrorResponse(&short_error))?;
// TODO: untangle convoluted control flow
if e.to_string().contains("failed to run") {
return Ok(ProcessMsgResult::Break);
}
log_query_error(query_string, &e);
let short_error = short_error(&e);
self.write_message_noflush(&BeMessage::ErrorResponse(
&short_error,
Some(e.pg_error_code()),
))?;
}
self.write_message(&BeMessage::ReadyForQuery)?;
}
@@ -445,11 +449,13 @@ impl PostgresBackend {
FeMessage::Execute(_) => {
let query_string = cstr_to_str(unnamed_query_string)?;
trace!("got execute {:?}", query_string);
// xxx distinguish fatal and recoverable errors?
trace!("got execute {query_string:?}");
if let Err(e) = handler.process_query(self, query_string) {
error!("query handler for '{}' failed: {:?}", query_string, e);
self.write_message(&BeMessage::ErrorResponse(&e.to_string()))?;
log_query_error(query_string, &e);
self.write_message(&BeMessage::ErrorResponse(
&e.to_string(),
Some(e.pg_error_code()),
))?;
}
// NOTE there is no ReadyForQuery message. This handler is used
// for basebackup and it uses CopyOut which doesn't require
@@ -468,7 +474,9 @@ impl PostgresBackend {
// We prefer explicit pattern matching to wildcards, because
// this helps us spot the places where new variants are missing
FeMessage::CopyData(_) | FeMessage::CopyDone | FeMessage::CopyFail => {
bail!("unexpected message type: {:?}", msg);
return Err(QueryError::Other(anyhow::anyhow!(
"unexpected message type: {msg:?}"
)));
}
}

View File

@@ -4,39 +4,87 @@
//! is rather narrow, but we can extend it once required.
use crate::postgres_backend::AuthType;
use anyhow::{bail, Context, Result};
use anyhow::Context;
use bytes::{Buf, Bytes, BytesMut};
use pq_proto::{BeMessage, FeMessage, FeStartupPacket};
use std::future::Future;
use pq_proto::{BeMessage, ConnectionError, FeMessage, FeStartupPacket, SQLSTATE_INTERNAL_ERROR};
use std::io;
use std::net::SocketAddr;
use std::pin::Pin;
use std::sync::Arc;
use std::task::Poll;
use tracing::{debug, error, trace};
use std::{future::Future, task::ready};
use tracing::{debug, error, info, trace};
use tokio::io::{AsyncRead, AsyncWrite, AsyncWriteExt, BufReader};
use tokio_rustls::TlsAcceptor;
pub fn is_expected_io_error(e: &io::Error) -> bool {
use io::ErrorKind::*;
matches!(
e.kind(),
ConnectionRefused | ConnectionAborted | ConnectionReset
)
}
/// An error, occurred during query processing:
/// either during the connection ([`ConnectionError`]) or before/after it.
#[derive(thiserror::Error, Debug)]
pub enum QueryError {
/// The connection was lost while processing the query.
#[error(transparent)]
Disconnected(#[from] ConnectionError),
/// Some other error
#[error(transparent)]
Other(#[from] anyhow::Error),
}
impl From<io::Error> for QueryError {
fn from(e: io::Error) -> Self {
Self::Disconnected(ConnectionError::Socket(e))
}
}
impl QueryError {
pub fn pg_error_code(&self) -> &'static [u8; 5] {
match self {
Self::Disconnected(_) => b"08006", // connection failure
Self::Other(_) => SQLSTATE_INTERNAL_ERROR, // internal error
}
}
}
#[async_trait::async_trait]
pub trait Handler {
/// Handle single query.
/// postgres_backend will issue ReadyForQuery after calling this (this
/// might be not what we want after CopyData streaming, but currently we don't
/// care).
async fn process_query(&mut self, pgb: &mut PostgresBackend, query_string: &str) -> Result<()>;
async fn process_query(
&mut self,
pgb: &mut PostgresBackend,
query_string: &str,
) -> Result<(), QueryError>;
/// Called on startup packet receival, allows to process params.
///
/// If Ok(false) is returned postgres_backend will skip auth -- that is needed for new users
/// creation is the proxy code. That is quite hacky and ad-hoc solution, may be we could allow
/// to override whole init logic in implementations.
fn startup(&mut self, _pgb: &mut PostgresBackend, _sm: &FeStartupPacket) -> Result<()> {
fn startup(
&mut self,
_pgb: &mut PostgresBackend,
_sm: &FeStartupPacket,
) -> Result<(), QueryError> {
Ok(())
}
/// Check auth jwt
fn check_auth_jwt(&mut self, _pgb: &mut PostgresBackend, _jwt_response: &[u8]) -> Result<()> {
bail!("JWT auth failed")
fn check_auth_jwt(
&mut self,
_pgb: &mut PostgresBackend,
_jwt_response: &[u8],
) -> Result<(), QueryError> {
Err(QueryError::Other(anyhow::anyhow!("JWT auth failed")))
}
}
@@ -70,17 +118,14 @@ impl AsyncWrite for Stream {
self: Pin<&mut Self>,
cx: &mut std::task::Context<'_>,
buf: &[u8],
) -> Poll<Result<usize, std::io::Error>> {
) -> Poll<io::Result<usize>> {
match self.get_mut() {
Self::Unencrypted(stream) => Pin::new(stream).poll_write(cx, buf),
Self::Tls(stream) => Pin::new(stream).poll_write(cx, buf),
Self::Broken => unreachable!(),
}
}
fn poll_flush(
self: Pin<&mut Self>,
cx: &mut std::task::Context<'_>,
) -> Poll<Result<(), std::io::Error>> {
fn poll_flush(self: Pin<&mut Self>, cx: &mut std::task::Context<'_>) -> Poll<io::Result<()>> {
match self.get_mut() {
Self::Unencrypted(stream) => Pin::new(stream).poll_flush(cx),
Self::Tls(stream) => Pin::new(stream).poll_flush(cx),
@@ -90,7 +135,7 @@ impl AsyncWrite for Stream {
fn poll_shutdown(
self: Pin<&mut Self>,
cx: &mut std::task::Context<'_>,
) -> Poll<Result<(), std::io::Error>> {
) -> Poll<io::Result<()>> {
match self.get_mut() {
Self::Unencrypted(stream) => Pin::new(stream).poll_shutdown(cx),
Self::Tls(stream) => Pin::new(stream).poll_shutdown(cx),
@@ -103,7 +148,7 @@ impl AsyncRead for Stream {
self: Pin<&mut Self>,
cx: &mut std::task::Context<'_>,
buf: &mut tokio::io::ReadBuf<'_>,
) -> Poll<Result<(), std::io::Error>> {
) -> Poll<io::Result<()>> {
match self.get_mut() {
Self::Unencrypted(stream) => Pin::new(stream).poll_read(cx, buf),
Self::Tls(stream) => Pin::new(stream).poll_read(cx, buf),
@@ -139,7 +184,7 @@ pub fn query_from_cstring(query_string: Bytes) -> Vec<u8> {
}
// Cast a byte slice to a string slice, dropping null terminator if there's one.
fn cstr_to_str(bytes: &[u8]) -> Result<&str> {
fn cstr_to_str(bytes: &[u8]) -> anyhow::Result<&str> {
let without_null = bytes.strip_suffix(&[0]).unwrap_or(bytes);
std::str::from_utf8(without_null).map_err(|e| e.into())
}
@@ -149,7 +194,7 @@ impl PostgresBackend {
socket: tokio::net::TcpStream,
auth_type: AuthType,
tls_config: Option<Arc<rustls::ServerConfig>>,
) -> std::io::Result<Self> {
) -> io::Result<Self> {
let peer_addr = socket.peer_addr()?;
Ok(Self {
@@ -167,17 +212,18 @@ impl PostgresBackend {
}
/// Read full message or return None if connection is closed.
pub async fn read_message(&mut self) -> Result<Option<FeMessage>> {
pub async fn read_message(&mut self) -> Result<Option<FeMessage>, QueryError> {
use ProtoState::*;
match self.state {
Initialization | Encrypted => FeStartupPacket::read_fut(&mut self.stream).await,
Authentication | Established => FeMessage::read_fut(&mut self.stream).await,
Closed => Ok(None),
}
.map_err(QueryError::from)
}
/// Flush output buffer into the socket.
pub async fn flush(&mut self) -> std::io::Result<()> {
pub async fn flush(&mut self) -> io::Result<()> {
while self.buf_out.has_remaining() {
let bytes_written = self.stream.write(self.buf_out.chunk()).await?;
self.buf_out.advance(bytes_written);
@@ -187,7 +233,7 @@ impl PostgresBackend {
}
/// Write message into internal output buffer.
pub fn write_message(&mut self, message: &BeMessage<'_>) -> Result<&mut Self, std::io::Error> {
pub fn write_message(&mut self, message: &BeMessage<'_>) -> io::Result<&mut Self> {
BeMessage::write(&mut self.buf_out, message)?;
Ok(self)
}
@@ -207,12 +253,9 @@ impl PostgresBackend {
cx: &mut std::task::Context<'_>,
) -> Poll<Result<(), std::io::Error>> {
while self.buf_out.has_remaining() {
match Pin::new(&mut self.stream).poll_write(cx, self.buf_out.chunk()) {
Poll::Ready(Ok(bytes_written)) => {
self.buf_out.advance(bytes_written);
}
Poll::Ready(Err(err)) => return Poll::Ready(Err(err)),
Poll::Pending => return Poll::Pending,
match ready!(Pin::new(&mut self.stream).poll_write(cx, self.buf_out.chunk())) {
Ok(bytes_written) => self.buf_out.advance(bytes_written),
Err(err) => return Poll::Ready(Err(err)),
}
}
Poll::Ready(Ok(()))
@@ -223,7 +266,11 @@ impl PostgresBackend {
}
// Wrapper for run_message_loop() that shuts down socket when we are done
pub async fn run<F, S>(mut self, handler: &mut impl Handler, shutdown_watcher: F) -> Result<()>
pub async fn run<F, S>(
mut self,
handler: &mut impl Handler,
shutdown_watcher: F,
) -> Result<(), QueryError>
where
F: Fn() -> S,
S: Future,
@@ -237,7 +284,7 @@ impl PostgresBackend {
&mut self,
handler: &mut impl Handler,
shutdown_watcher: F,
) -> Result<()>
) -> Result<(), QueryError>
where
F: Fn() -> S,
S: Future,
@@ -273,7 +320,7 @@ impl PostgresBackend {
return Ok(());
}
}
Ok::<(), anyhow::Error>(())
Ok::<(), QueryError>(())
} => {
// Handshake complete.
result?;
@@ -318,14 +365,14 @@ impl PostgresBackend {
self.stream = Stream::Tls(Box::new(tls_stream));
return Ok(());
};
bail!("TLS already started");
anyhow::bail!("TLS already started");
}
async fn process_handshake_message(
&mut self,
handler: &mut impl Handler,
msg: FeMessage,
) -> Result<ProcessMsgResult> {
) -> Result<ProcessMsgResult, QueryError> {
assert!(self.state < ProtoState::Established);
let have_tls = self.tls_config.is_some();
match msg {
@@ -348,8 +395,13 @@ impl PostgresBackend {
}
FeStartupPacket::StartupMessage { .. } => {
if have_tls && !matches!(self.state, ProtoState::Encrypted) {
self.write_message(&BeMessage::ErrorResponse("must connect with TLS"))?;
bail!("client did not connect with TLS");
self.write_message(&BeMessage::ErrorResponse(
"must connect with TLS",
None,
))?;
return Err(QueryError::Other(anyhow::anyhow!(
"client did not connect with TLS"
)));
}
// NB: startup() may change self.auth_type -- we are using that in proxy code
@@ -389,8 +441,11 @@ impl PostgresBackend {
let (_, jwt_response) = m.split_last().context("protocol violation")?;
if let Err(e) = handler.check_auth_jwt(self, jwt_response) {
self.write_message(&BeMessage::ErrorResponse(&e.to_string()))?;
bail!("auth failed: {}", e);
self.write_message(&BeMessage::ErrorResponse(
&e.to_string(),
Some(e.pg_error_code()),
))?;
return Err(e);
}
}
}
@@ -413,33 +468,28 @@ impl PostgresBackend {
handler: &mut impl Handler,
msg: FeMessage,
unnamed_query_string: &mut Bytes,
) -> Result<ProcessMsgResult> {
) -> Result<ProcessMsgResult, QueryError> {
// Allow only startup and password messages during auth. Otherwise client would be able to bypass auth
// TODO: change that to proper top-level match of protocol state with separate message handling for each state
assert!(self.state == ProtoState::Established);
match msg {
FeMessage::StartupPacket(_) | FeMessage::PasswordMessage(_) => {
bail!("protocol violation");
return Err(QueryError::Other(anyhow::anyhow!("protocol violation")));
}
FeMessage::Query(body) => {
// remove null terminator
let query_string = cstr_to_str(&body)?;
trace!("got query {:?}", query_string);
// xxx distinguish fatal and recoverable errors?
trace!("got query {query_string:?}");
if let Err(e) = handler.process_query(self, query_string).await {
// ":?" uses the alternate formatting style, which makes anyhow display the
// full cause of the error, not just the top-level context + its trace.
// We don't want to send that in the ErrorResponse though,
// because it's not relevant to the compute node logs.
error!("query handler for '{}' failed: {:?}", query_string, e);
self.write_message(&BeMessage::ErrorResponse(&e.to_string()))?;
// TODO: untangle convoluted control flow
if e.to_string().contains("failed to run") {
return Ok(ProcessMsgResult::Break);
}
log_query_error(query_string, &e);
let short_error = short_error(&e);
self.write_message(&BeMessage::ErrorResponse(
&short_error,
Some(e.pg_error_code()),
))?;
}
self.write_message(&BeMessage::ReadyForQuery)?;
}
@@ -464,11 +514,13 @@ impl PostgresBackend {
FeMessage::Execute(_) => {
let query_string = cstr_to_str(unnamed_query_string)?;
trace!("got execute {:?}", query_string);
// xxx distinguish fatal and recoverable errors?
trace!("got execute {query_string:?}");
if let Err(e) = handler.process_query(self, query_string).await {
error!("query handler for '{}' failed: {:?}", query_string, e);
self.write_message(&BeMessage::ErrorResponse(&e.to_string()))?;
log_query_error(query_string, &e);
self.write_message(&BeMessage::ErrorResponse(
&e.to_string(),
Some(e.pg_error_code()),
))?;
}
// NOTE there is no ReadyForQuery message. This handler is used
// for basebackup and it uses CopyOut which doesn't require
@@ -487,7 +539,10 @@ impl PostgresBackend {
// We prefer explicit pattern matching to wildcards, because
// this helps us spot the places where new variants are missing
FeMessage::CopyData(_) | FeMessage::CopyDone | FeMessage::CopyFail => {
bail!("unexpected message type: {:?}", msg);
return Err(QueryError::Other(anyhow::anyhow!(
"unexpected message type: {:?}",
msg
)));
}
}
@@ -515,10 +570,9 @@ impl<'a> AsyncWrite for CopyDataWriter<'a> {
// It's not strictly required to flush between each message, but makes it easier
// to view in wireshark, and usually the messages that the callers write are
// decently-sized anyway.
match this.pgb.poll_write_buf(cx) {
Poll::Ready(Ok(())) => {}
Poll::Ready(Err(err)) => return Poll::Ready(Err(err)),
Poll::Pending => return Poll::Pending,
match ready!(this.pgb.poll_write_buf(cx)) {
Ok(()) => {}
Err(err) => return Poll::Ready(Err(err)),
}
// CopyData
@@ -535,10 +589,9 @@ impl<'a> AsyncWrite for CopyDataWriter<'a> {
cx: &mut std::task::Context<'_>,
) -> Poll<Result<(), std::io::Error>> {
let this = self.get_mut();
match this.pgb.poll_write_buf(cx) {
Poll::Ready(Ok(())) => {}
Poll::Ready(Err(err)) => return Poll::Ready(Err(err)),
Poll::Pending => return Poll::Pending,
match ready!(this.pgb.poll_write_buf(cx)) {
Ok(()) => {}
Err(err) => return Poll::Ready(Err(err)),
}
this.pgb.poll_flush(cx)
}
@@ -547,11 +600,35 @@ impl<'a> AsyncWrite for CopyDataWriter<'a> {
cx: &mut std::task::Context<'_>,
) -> Poll<Result<(), std::io::Error>> {
let this = self.get_mut();
match this.pgb.poll_write_buf(cx) {
Poll::Ready(Ok(())) => {}
Poll::Ready(Err(err)) => return Poll::Ready(Err(err)),
Poll::Pending => return Poll::Pending,
match ready!(this.pgb.poll_write_buf(cx)) {
Ok(()) => {}
Err(err) => return Poll::Ready(Err(err)),
}
this.pgb.poll_flush(cx)
}
}
pub fn short_error(e: &QueryError) -> String {
match e {
QueryError::Disconnected(connection_error) => connection_error.to_string(),
QueryError::Other(e) => format!("{e:#}"),
}
}
pub(super) fn log_query_error(query: &str, e: &QueryError) {
match e {
QueryError::Disconnected(ConnectionError::Socket(io_error)) => {
if is_expected_io_error(io_error) {
info!("query handler for '{query}' failed with expected io error: {io_error}");
} else {
error!("query handler for '{query}' failed with io error: {io_error}");
}
}
QueryError::Disconnected(other_connection_error) => {
error!("query handler for '{query}' failed with connection error: {other_connection_error:?}")
}
QueryError::Other(e) => {
error!("query handler for '{query}' failed: {e:?}");
}
}
}

View File

@@ -9,7 +9,10 @@ use byteorder::{BigEndian, ReadBytesExt, WriteBytesExt};
use bytes::{Buf, BufMut, Bytes, BytesMut};
use once_cell::sync::Lazy;
use utils::postgres_backend::{AuthType, Handler, PostgresBackend};
use utils::{
postgres_backend::{AuthType, Handler, PostgresBackend},
postgres_backend_async::QueryError,
};
fn make_tcp_pair() -> (TcpStream, TcpStream) {
let listener = TcpListener::bind("127.0.0.1:0").unwrap();
@@ -105,7 +108,7 @@ fn ssl() {
&mut self,
_pgb: &mut PostgresBackend,
query_string: &str,
) -> anyhow::Result<()> {
) -> Result<(), QueryError> {
self.got_query = query_string == QUERY;
Ok(())
}
@@ -152,7 +155,7 @@ fn no_ssl() {
&mut self,
_pgb: &mut PostgresBackend,
_query_string: &str,
) -> anyhow::Result<()> {
) -> Result<(), QueryError> {
panic!()
}
}
@@ -212,7 +215,7 @@ fn server_forces_ssl() {
&mut self,
_pgb: &mut PostgresBackend,
_query_string: &str,
) -> anyhow::Result<()> {
) -> Result<(), QueryError> {
panic!()
}
}

View File

@@ -1,8 +1,8 @@
[package]
name = "pageserver"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[features]
default = []
@@ -11,68 +11,68 @@ default = []
testing = ["fail/failpoints"]
[dependencies]
amplify_num = { git = "https://github.com/hlinnaka/rust-amplify.git", branch = "unsigned-int-perf" }
anyhow = { version = "1.0", features = ["backtrace"] }
async-stream = "0.3"
async-trait = "0.1"
byteorder = "1.4.3"
bytes = "1.0.1"
chrono = { version = "0.4.23", default-features = false, features = ["clock", "serde"] }
clap = { version = "4.0", features = ["string"] }
close_fds = "0.3.2"
const_format = "0.2.21"
crc32c = "0.6.0"
crossbeam-utils = "0.8.5"
fail = "0.5.0"
futures = "0.3.13"
git-version = "0.3.5"
hex = "0.4.3"
humantime = "2.1.0"
humantime-serde = "1.1.1"
hyper = "0.14"
itertools = "0.10.3"
nix = "0.25"
num-traits = "0.2.15"
once_cell = "1.13.0"
pin-project-lite = "0.2.7"
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
postgres-types = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
rand = "0.8.3"
regex = "1.4.5"
rstar = "0.9.3"
scopeguard = "1.1.0"
serde = { version = "1.0", features = ["derive"] }
serde_json = { version = "1.0", features = ["raw_value"] }
serde_with = "2.0"
signal-hook = "0.3.10"
svg_fmt = "0.4.1"
tokio-tar = { git = "https://github.com/neondatabase/tokio-tar.git", rev="404df61437de0feef49ba2ccdbdd94eb8ad6e142" }
thiserror = "1.0"
tokio = { version = "1.17", features = ["process", "sync", "macros", "fs", "rt", "io-util", "time"] }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
tokio-util = { version = "0.7.3", features = ["io", "io-util"] }
toml_edit = { version = "0.14", features = ["easy"] }
tracing = "0.1.36"
url = "2"
walkdir = "2.3.2"
metrics = { path = "../libs/metrics" }
pageserver_api = { path = "../libs/pageserver_api" }
postgres_connection = { path = "../libs/postgres_connection" }
postgres_ffi = { path = "../libs/postgres_ffi" }
pq_proto = { path = "../libs/pq_proto" }
remote_storage = { path = "../libs/remote_storage" }
storage_broker = { version = "0.1", path = "../storage_broker" }
tenant_size_model = { path = "../libs/tenant_size_model" }
utils = { path = "../libs/utils" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }
reqwest = { version = "0.11", default-features = false, features = ["rustls-tls"] }
amplify_num.workspace = true
anyhow.workspace = true
async-stream.workspace = true
async-trait.workspace = true
byteorder.workspace = true
bytes.workspace = true
chrono = { workspace = true, features = ["serde"] }
clap = { workspace = true, features = ["string"] }
close_fds.workspace = true
const_format.workspace = true
consumption_metrics.workspace = true
crc32c.workspace = true
crossbeam-utils.workspace = true
fail.workspace = true
futures.workspace = true
git-version.workspace = true
hex.workspace = true
humantime.workspace = true
humantime-serde.workspace = true
hyper.workspace = true
itertools.workspace = true
nix.workspace = true
num-traits.workspace = true
once_cell.workspace = true
pin-project-lite.workspace = true
postgres.workspace = true
postgres-protocol.workspace = true
postgres-types.workspace = true
rand.workspace = true
regex.workspace = true
rstar.workspace = true
scopeguard.workspace = true
serde.workspace = true
serde_json = { workspace = true, features = ["raw_value"] }
serde_with.workspace = true
signal-hook.workspace = true
svg_fmt.workspace = true
tokio-tar.workspace = true
thiserror.workspace = true
tokio = { workspace = true, features = ["process", "sync", "fs", "rt", "io-util", "time"] }
tokio-postgres.workspace = true
tokio-util.workspace = true
toml_edit.workspace = true
tracing.workspace = true
url.workspace = true
walkdir.workspace = true
metrics.workspace = true
pageserver_api.workspace = true
postgres_connection.workspace = true
postgres_ffi.workspace = true
pq_proto.workspace = true
remote_storage.workspace = true
storage_broker.workspace = true
tenant_size_model.workspace = true
utils.workspace = true
workspace_hack.workspace = true
reqwest.workspace = true
[dev-dependencies]
criterion = "0.4"
hex-literal = "0.3"
tempfile = "3.2"
criterion.workspace = true
hex-literal.workspace = true
tempfile.workspace = true
[[bench]]
name = "bench_layer_map"

View File

@@ -30,33 +30,44 @@ fn redo_scenarios(c: &mut Criterion) {
let conf = PageServerConf::dummy_conf(repo_dir.path().to_path_buf());
let conf = Box::leak(Box::new(conf));
let tenant_id = TenantId::generate();
// std::fs::create_dir_all(conf.tenant_path(&tenant_id)).unwrap();
let mut manager = PostgresRedoManager::new(conf, tenant_id);
manager.launch_process(14).unwrap();
let manager = PostgresRedoManager::new(conf, tenant_id);
let manager = Arc::new(manager);
tracing::info!("executing first");
short().execute(&manager).unwrap();
tracing::info!("first executed");
let thread_counts = [1, 2, 4, 8, 16];
for thread_count in thread_counts {
c.bench_with_input(
BenchmarkId::new("short-50record", thread_count),
&thread_count,
|b, thread_count| {
add_multithreaded_walredo_requesters(b, *thread_count, &manager, short, 50);
},
);
}
let mut group = c.benchmark_group("short");
group.sampling_mode(criterion::SamplingMode::Flat);
for thread_count in thread_counts {
c.bench_with_input(
BenchmarkId::new("medium-10record", thread_count),
group.bench_with_input(
BenchmarkId::new("short", thread_count),
&thread_count,
|b, thread_count| {
add_multithreaded_walredo_requesters(b, *thread_count, &manager, medium, 10);
add_multithreaded_walredo_requesters(b, *thread_count, &manager, short);
},
);
}
drop(group);
let mut group = c.benchmark_group("medium");
group.sampling_mode(criterion::SamplingMode::Flat);
for thread_count in thread_counts {
group.bench_with_input(
BenchmarkId::new("medium", thread_count),
&thread_count,
|b, thread_count| {
add_multithreaded_walredo_requesters(b, *thread_count, &manager, medium);
},
);
}
drop(group);
}
/// Sets up `threads` number of requesters to `request_redo`, with the given input.
@@ -65,46 +76,66 @@ fn add_multithreaded_walredo_requesters(
threads: u32,
manager: &Arc<PostgresRedoManager>,
input_factory: fn() -> Request,
request_repeats: usize,
) {
b.iter_batched_ref(
|| {
// barrier for all of the threads, and the benchmarked thread
let barrier = Arc::new(Barrier::new(threads as usize + 1));
assert_ne!(threads, 0);
let jhs = (0..threads)
.map(|_| {
std::thread::spawn({
let manager = manager.clone();
let barrier = barrier.clone();
move || {
let input = std::iter::repeat(input_factory())
.take(request_repeats)
.collect::<Vec<_>>();
if threads == 1 {
b.iter_batched_ref(
|| Some(input_factory()),
|input| execute_all(input.take(), manager),
criterion::BatchSize::PerIteration,
);
} else {
let (work_tx, work_rx) = std::sync::mpsc::sync_channel(threads as usize);
barrier.wait();
let work_rx = std::sync::Arc::new(std::sync::Mutex::new(work_rx));
execute_all(input, &manager).unwrap();
let barrier = Arc::new(Barrier::new(threads as usize + 1));
barrier.wait();
let jhs = (0..threads)
.map(|_| {
std::thread::spawn({
let manager = manager.clone();
let barrier = barrier.clone();
let work_rx = work_rx.clone();
move || loop {
// queue up and wait if we want to go another round
if work_rx.lock().unwrap().recv().is_err() {
break;
}
})
let input = Some(input_factory());
barrier.wait();
execute_all(input, &manager).unwrap();
barrier.wait();
}
})
.collect::<Vec<_>>();
})
.collect::<Vec<_>>();
(barrier, JoinOnDrop(jhs))
},
|input| {
let barrier = &input.0;
let _jhs = JoinOnDrop(jhs);
// start the work
barrier.wait();
b.iter_batched(
|| {
for _ in 0..threads {
work_tx.send(()).unwrap()
}
},
|()| {
// start the work
barrier.wait();
// wait for work to complete
barrier.wait();
},
criterion::BatchSize::PerIteration,
);
// wait for work to complete
barrier.wait();
},
criterion::BatchSize::PerIteration,
);
drop(work_tx);
}
}
struct JoinOnDrop(Vec<std::thread::JoinHandle<()>>);
@@ -121,7 +152,10 @@ impl Drop for JoinOnDrop {
}
}
fn execute_all(input: Vec<Request>, manager: &PostgresRedoManager) -> Result<(), WalRedoError> {
fn execute_all<I>(input: I, manager: &PostgresRedoManager) -> Result<(), WalRedoError>
where
I: IntoIterator<Item = Request>,
{
// just fire all requests as fast as possible
input.into_iter().try_for_each(|req| {
let page = req.execute(manager)?;
@@ -143,6 +177,7 @@ macro_rules! lsn {
}};
}
/// Short payload, 1132 bytes.
// pg_records are copypasted from log, where they are put with Debug impl of Bytes, which uses \0
// for null bytes.
#[allow(clippy::octal_escapes)]
@@ -172,6 +207,7 @@ fn short() -> Request {
}
}
/// Medium sized payload, serializes as 26393 bytes.
// see [`short`]
#[allow(clippy::octal_escapes)]
fn medium() -> Request {

Binary file not shown.

View File

@@ -27,7 +27,7 @@ use tracing::*;
///
use tokio_tar::{Builder, EntryType, Header};
use crate::tenant::{with_ondemand_download, Timeline};
use crate::tenant::Timeline;
use pageserver_api::reltag::{RelTag, SlruKind};
use postgres_ffi::pg_constants::{DEFAULTTABLESPACE_OID, GLOBALTABLESPACE_OID};
@@ -38,6 +38,14 @@ use postgres_ffi::PG_TLI;
use postgres_ffi::{BLCKSZ, RELSEG_SIZE, WAL_SEGMENT_SIZE};
use utils::lsn::Lsn;
/// Create basebackup with non-rel data in it.
/// Only include relational data if 'full_backup' is true.
///
/// Currently we use empty 'req_lsn' in two cases:
/// * During the basebackup right after timeline creation
/// * When working without safekeepers. In this situation it is important to match the lsn
/// we are taking basebackup on with the lsn that is used in pageserver's walreceiver
/// to start the replication.
pub async fn send_basebackup_tarball<'a, W>(
write: &'a mut W,
timeline: &'a Timeline,
@@ -123,14 +131,6 @@ where
full_backup: bool,
}
// Create basebackup with non-rel data in it.
// Only include relational data if 'full_backup' is true.
//
// Currently we use empty lsn in two cases:
// * During the basebackup right after timeline creation
// * When working without safekeepers. In this situation it is important to match the lsn
// we are taking basebackup on with the lsn that is used in pageserver's walreceiver
// to start the replication.
impl<'a, W> Basebackup<'a, W>
where
W: AsyncWrite + Send + Sync + Unpin,
@@ -171,30 +171,23 @@ where
SlruKind::MultiXactOffsets,
SlruKind::MultiXactMembers,
] {
for segno in
with_ondemand_download(|| self.timeline.list_slru_segments(kind, self.lsn)).await?
{
for segno in self.timeline.list_slru_segments(kind, self.lsn).await? {
self.add_slru_segment(kind, segno).await?;
}
}
// Create tablespace directories
for ((spcnode, dbnode), has_relmap_file) in
with_ondemand_download(|| self.timeline.list_dbdirs(self.lsn)).await?
{
for ((spcnode, dbnode), has_relmap_file) in self.timeline.list_dbdirs(self.lsn).await? {
self.add_dbdir(spcnode, dbnode, has_relmap_file).await?;
// Gather and send relational files in each database if full backup is requested.
if self.full_backup {
for rel in
with_ondemand_download(|| self.timeline.list_rels(spcnode, dbnode, self.lsn))
.await?
{
for rel in self.timeline.list_rels(spcnode, dbnode, self.lsn).await? {
self.add_rel(rel).await?;
}
}
}
for xid in with_ondemand_download(|| self.timeline.list_twophase_files(self.lsn)).await? {
for xid in self.timeline.list_twophase_files(self.lsn).await? {
self.add_twophase_file(xid).await?;
}
@@ -210,8 +203,7 @@ where
}
async fn add_rel(&mut self, tag: RelTag) -> anyhow::Result<()> {
let nblocks =
with_ondemand_download(|| self.timeline.get_rel_size(tag, self.lsn, false)).await?;
let nblocks = self.timeline.get_rel_size(tag, self.lsn, false).await?;
// If the relation is empty, create an empty file
if nblocks == 0 {
@@ -229,11 +221,10 @@ where
let mut segment_data: Vec<u8> = vec![];
for blknum in startblk..endblk {
let img = with_ondemand_download(|| {
self.timeline
.get_rel_page_at_lsn(tag, blknum, self.lsn, false)
})
.await?;
let img = self
.timeline
.get_rel_page_at_lsn(tag, blknum, self.lsn, false)
.await?;
segment_data.extend_from_slice(&img[..]);
}
@@ -252,17 +243,17 @@ where
// Generate SLRU segment files from repository.
//
async fn add_slru_segment(&mut self, slru: SlruKind, segno: u32) -> anyhow::Result<()> {
let nblocks =
with_ondemand_download(|| self.timeline.get_slru_segment_size(slru, segno, self.lsn))
.await?;
let nblocks = self
.timeline
.get_slru_segment_size(slru, segno, self.lsn)
.await?;
let mut slru_buf: Vec<u8> = Vec::with_capacity(nblocks as usize * BLCKSZ as usize);
for blknum in 0..nblocks {
let img = with_ondemand_download(|| {
self.timeline
.get_slru_page_at_lsn(slru, segno, blknum, self.lsn)
})
.await?;
let img = self
.timeline
.get_slru_page_at_lsn(slru, segno, blknum, self.lsn)
.await?;
if slru == SlruKind::Clog {
ensure!(img.len() == BLCKSZ as usize || img.len() == BLCKSZ as usize + 8);
@@ -294,9 +285,10 @@ where
has_relmap_file: bool,
) -> anyhow::Result<()> {
let relmap_img = if has_relmap_file {
let img =
with_ondemand_download(|| self.timeline.get_relmap_file(spcnode, dbnode, self.lsn))
.await?;
let img = self
.timeline
.get_relmap_file(spcnode, dbnode, self.lsn)
.await?;
ensure!(img.len() == 512);
Some(img)
} else {
@@ -329,7 +321,9 @@ where
// XLOG_TBLSPC_DROP records. But we probably should just
// throw an error on CREATE TABLESPACE in the first place.
if !has_relmap_file
&& with_ondemand_download(|| self.timeline.list_rels(spcnode, dbnode, self.lsn))
&& self
.timeline
.list_rels(spcnode, dbnode, self.lsn)
.await?
.is_empty()
{
@@ -362,7 +356,7 @@ where
// Extract twophase state files
//
async fn add_twophase_file(&mut self, xid: TransactionId) -> anyhow::Result<()> {
let img = with_ondemand_download(|| self.timeline.get_twophase_file(xid, self.lsn)).await?;
let img = self.timeline.get_twophase_file(xid, self.lsn).await?;
let mut buf = BytesMut::new();
buf.extend_from_slice(&img[..]);
@@ -398,10 +392,14 @@ where
)
.await?;
let checkpoint_bytes = with_ondemand_download(|| self.timeline.get_checkpoint(self.lsn))
let checkpoint_bytes = self
.timeline
.get_checkpoint(self.lsn)
.await
.context("failed to get checkpoint bytes")?;
let pg_control_bytes = with_ondemand_download(|| self.timeline.get_control_file(self.lsn))
let pg_control_bytes = self
.timeline
.get_control_file(self.lsn)
.await
.context("failed get control bytes")?;

View File

@@ -336,6 +336,7 @@ fn start_pageserver(conf: &'static PageServerConf) -> anyhow::Result<()> {
pageserver::consumption_metrics::collect_metrics(
metric_collection_endpoint,
conf.metric_collection_interval,
conf.synthetic_size_calculation_interval,
conf.id,
)
.instrument(info_span!("metrics_collection"))

View File

@@ -59,6 +59,8 @@ pub mod defaults {
pub const DEFAULT_METRIC_COLLECTION_INTERVAL: &str = "10 min";
pub const DEFAULT_METRIC_COLLECTION_ENDPOINT: Option<reqwest::Url> = None;
pub const DEFAULT_SYNTHETIC_SIZE_CALCULATION_INTERVAL: &str = "10 min";
///
/// Default built-in configuration file.
///
@@ -83,6 +85,7 @@ pub mod defaults {
#concurrent_tenant_size_logical_size_queries = '{DEFAULT_CONCURRENT_TENANT_SIZE_LOGICAL_SIZE_QUERIES}'
#metric_collection_interval = '{DEFAULT_METRIC_COLLECTION_INTERVAL}'
#synthetic_size_calculation_interval = '{DEFAULT_SYNTHETIC_SIZE_CALCULATION_INTERVAL}'
# [tenant_config]
#checkpoint_distance = {DEFAULT_CHECKPOINT_DISTANCE} # in bytes
@@ -152,6 +155,7 @@ pub struct PageServerConf {
// How often to collect metrics and send them to the metrics endpoint.
pub metric_collection_interval: Duration,
pub metric_collection_endpoint: Option<Url>,
pub synthetic_size_calculation_interval: Duration,
pub test_remote_failures: u64,
}
@@ -215,6 +219,7 @@ struct PageServerConfigBuilder {
metric_collection_interval: BuilderValue<Duration>,
metric_collection_endpoint: BuilderValue<Option<Url>>,
synthetic_size_calculation_interval: BuilderValue<Duration>,
test_remote_failures: BuilderValue<u64>,
}
@@ -255,6 +260,10 @@ impl Default for PageServerConfigBuilder {
DEFAULT_METRIC_COLLECTION_INTERVAL,
)
.expect("cannot parse default metric collection interval")),
synthetic_size_calculation_interval: Set(humantime::parse_duration(
DEFAULT_SYNTHETIC_SIZE_CALCULATION_INTERVAL,
)
.expect("cannot parse default synthetic size calculation interval")),
metric_collection_endpoint: Set(DEFAULT_METRIC_COLLECTION_ENDPOINT),
test_remote_failures: Set(0),
@@ -342,6 +351,14 @@ impl PageServerConfigBuilder {
self.metric_collection_endpoint = BuilderValue::Set(metric_collection_endpoint)
}
pub fn synthetic_size_calculation_interval(
&mut self,
synthetic_size_calculation_interval: Duration,
) {
self.synthetic_size_calculation_interval =
BuilderValue::Set(synthetic_size_calculation_interval)
}
pub fn test_remote_failures(&mut self, fail_first: u64) {
self.test_remote_failures = BuilderValue::Set(fail_first);
}
@@ -399,6 +416,9 @@ impl PageServerConfigBuilder {
metric_collection_endpoint: self
.metric_collection_endpoint
.ok_or(anyhow!("missing metric_collection_endpoint"))?,
synthetic_size_calculation_interval: self
.synthetic_size_calculation_interval
.ok_or(anyhow!("missing synthetic_size_calculation_interval"))?,
test_remote_failures: self
.test_remote_failures
.ok_or(anyhow!("missing test_remote_failuers"))?,
@@ -577,7 +597,8 @@ impl PageServerConf {
let endpoint = parse_toml_string(key, item)?.parse().context("failed to parse metric_collection_endpoint")?;
builder.metric_collection_endpoint(Some(endpoint));
},
"synthetic_size_calculation_interval" =>
builder.synthetic_size_calculation_interval(parse_toml_duration(key, item)?),
"test_remote_failures" => builder.test_remote_failures(parse_toml_u64(key, item)?),
_ => bail!("unrecognized pageserver option '{key}'"),
}
@@ -672,11 +693,6 @@ impl PageServerConf {
Ok(t_conf)
}
#[cfg(test)]
pub fn test_repo_dir(test_name: &str) -> PathBuf {
PathBuf::from(format!("../tmp_check/test_{test_name}"))
}
pub fn dummy_conf(repo_dir: PathBuf) -> Self {
let pg_distrib_dir = PathBuf::from(env!("CARGO_MANIFEST_DIR")).join("../pg_install");
@@ -701,6 +717,7 @@ impl PageServerConf {
concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::default(),
metric_collection_interval: Duration::from_secs(60),
metric_collection_endpoint: defaults::DEFAULT_METRIC_COLLECTION_ENDPOINT,
synthetic_size_calculation_interval: Duration::from_secs(60),
test_remote_failures: 0,
}
}
@@ -834,6 +851,7 @@ id = 10
metric_collection_interval = '222 s'
metric_collection_endpoint = 'http://localhost:80/metrics'
synthetic_size_calculation_interval = '333 s'
log_format = 'json'
"#;
@@ -880,6 +898,9 @@ log_format = 'json'
defaults::DEFAULT_METRIC_COLLECTION_INTERVAL
)?,
metric_collection_endpoint: defaults::DEFAULT_METRIC_COLLECTION_ENDPOINT,
synthetic_size_calculation_interval: humantime::parse_duration(
defaults::DEFAULT_SYNTHETIC_SIZE_CALCULATION_INTERVAL
)?,
test_remote_failures: 0,
},
"Correct defaults should be used when no config values are provided"
@@ -926,6 +947,7 @@ log_format = 'json'
concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::default(),
metric_collection_interval: Duration::from_secs(222),
metric_collection_endpoint: Some(Url::parse("http://localhost:80/metrics")?),
synthetic_size_calculation_interval: Duration::from_secs(333),
test_remote_failures: 0,
},
"Should be able to parse all basic config values correctly"

View File

@@ -3,154 +3,74 @@
//! and push them to a HTTP endpoint.
//! Cache metrics to send only the updated ones.
//!
use anyhow;
use tracing::*;
use utils::id::NodeId;
use utils::id::TimelineId;
use crate::task_mgr;
use crate::task_mgr::{self, TaskKind, BACKGROUND_RUNTIME};
use crate::tenant::mgr;
use anyhow;
use chrono::Utc;
use consumption_metrics::{idempotency_key, Event, EventChunk, EventType, CHUNK_SIZE};
use pageserver_api::models::TenantState;
use utils::id::TenantId;
use serde::{Deserialize, Serialize};
use reqwest::Url;
use serde::Serialize;
use serde_with::{serde_as, DisplayFromStr};
use std::collections::HashMap;
use std::fmt;
use std::str::FromStr;
use std::time::Duration;
use tracing::*;
use utils::id::{NodeId, TenantId, TimelineId};
use chrono::{DateTime, Utc};
use rand::Rng;
use reqwest::Url;
const WRITTEN_SIZE: &str = "written_size";
const SYNTHETIC_STORAGE_SIZE: &str = "synthetic_storage_size";
const RESIDENT_SIZE: &str = "resident_size";
const REMOTE_STORAGE_SIZE: &str = "remote_storage_size";
const TIMELINE_LOGICAL_SIZE: &str = "timeline_logical_size";
/// ConsumptionMetric struct that defines the format for one metric entry
/// i.e.
///
/// ```json
/// {
/// "metric": "remote_storage_size",
/// "type": "absolute",
/// "tenant_id": "5d07d9ce9237c4cd845ea7918c0afa7d",
/// "timeline_id": "a03ebb4f5922a1c56ff7485cc8854143",
/// "time": "2022-12-28T11:07:19.317310284Z",
/// "idempotency_key": "2022-12-28 11:07:19.317310324 UTC-1-4019",
/// "value": 12345454,
/// }
/// ```
#[serde_as]
#[derive(Serialize, Deserialize, Debug, Clone, Eq, PartialEq, Ord, PartialOrd)]
pub struct ConsumptionMetric {
pub metric: ConsumptionMetricKind,
#[serde(rename = "type")]
pub metric_type: &'static str,
#[derive(Serialize)]
struct Ids {
#[serde_as(as = "DisplayFromStr")]
pub tenant_id: TenantId,
tenant_id: TenantId,
#[serde_as(as = "Option<DisplayFromStr>")]
#[serde(skip_serializing_if = "Option::is_none")]
pub timeline_id: Option<TimelineId>,
pub time: DateTime<Utc>,
pub idempotency_key: String,
pub value: u64,
}
impl ConsumptionMetric {
pub fn new_absolute<R: Rng + ?Sized>(
metric: ConsumptionMetricKind,
tenant_id: TenantId,
timeline_id: Option<TimelineId>,
value: u64,
node_id: NodeId,
rng: &mut R,
) -> Self {
Self {
metric,
metric_type: "absolute",
tenant_id,
timeline_id,
time: Utc::now(),
// key that allows metric collector to distinguish unique events
idempotency_key: format!("{}-{}-{:04}", Utc::now(), node_id, rng.gen_range(0..=9999)),
value,
}
}
}
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Ord, PartialOrd, Serialize, Deserialize)]
#[serde(rename_all = "snake_case")]
pub enum ConsumptionMetricKind {
/// Amount of WAL produced , by a timeline, i.e. last_record_lsn
/// This is an absolute, per-timeline metric.
WrittenSize,
/// Size of all tenant branches including WAL
/// This is an absolute, per-tenant metric.
/// This is the same metric that tenant/tenant_id/size endpoint returns.
SyntheticStorageSize,
/// Size of all the layer files in the tenant's directory on disk on the pageserver.
/// This is an absolute, per-tenant metric.
/// See also prometheus metric RESIDENT_PHYSICAL_SIZE.
ResidentSize,
/// Size of the remote storage (S3) directory.
/// This is an absolute, per-tenant metric.
RemoteStorageSize,
/// Logical size of the data in the timeline
/// This is an absolute, per-timeline metric
TimelineLogicalSize,
}
impl FromStr for ConsumptionMetricKind {
type Err = anyhow::Error;
fn from_str(s: &str) -> Result<Self, Self::Err> {
match s {
"written_size" => Ok(Self::WrittenSize),
"synthetic_storage_size" => Ok(Self::SyntheticStorageSize),
"resident_size" => Ok(Self::ResidentSize),
"remote_storage_size" => Ok(Self::RemoteStorageSize),
"timeline_logical_size" => Ok(Self::TimelineLogicalSize),
_ => anyhow::bail!("invalid value \"{s}\" for metric type"),
}
}
}
impl fmt::Display for ConsumptionMetricKind {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
f.write_str(match self {
ConsumptionMetricKind::WrittenSize => "written_size",
ConsumptionMetricKind::SyntheticStorageSize => "synthetic_storage_size",
ConsumptionMetricKind::ResidentSize => "resident_size",
ConsumptionMetricKind::RemoteStorageSize => "remote_storage_size",
ConsumptionMetricKind::TimelineLogicalSize => "timeline_logical_size",
})
}
}
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct ConsumptionMetricsKey {
tenant_id: TenantId,
timeline_id: Option<TimelineId>,
metric: ConsumptionMetricKind,
}
#[derive(serde::Serialize)]
struct EventChunk<'a> {
events: &'a [ConsumptionMetric],
/// Key that uniquely identifies the object, this metric describes.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct PageserverConsumptionMetricsKey {
pub tenant_id: TenantId,
pub timeline_id: Option<TimelineId>,
pub metric: &'static str,
}
/// Main thread that serves metrics collection
pub async fn collect_metrics(
metric_collection_endpoint: &Url,
metric_collection_interval: Duration,
synthetic_size_calculation_interval: Duration,
node_id: NodeId,
) -> anyhow::Result<()> {
let mut ticker = tokio::time::interval(metric_collection_interval);
info!("starting collect_metrics");
// spin up background worker that caclulates tenant sizes
task_mgr::spawn(
BACKGROUND_RUNTIME.handle(),
TaskKind::CalculateSyntheticSize,
None,
None,
"synthetic size calculation",
true,
async move {
calculate_synthetic_size_worker(synthetic_size_calculation_interval)
.instrument(info_span!("synthetic_size_worker"))
.await?;
Ok(())
},
);
// define client here to reuse it for all requests
let client = reqwest::Client::new();
let mut cached_metrics: HashMap<ConsumptionMetricsKey, u64> = HashMap::new();
let mut cached_metrics: HashMap<PageserverConsumptionMetricsKey, u64> = HashMap::new();
loop {
tokio::select! {
@@ -159,7 +79,10 @@ pub async fn collect_metrics(
return Ok(());
},
_ = ticker.tick() => {
collect_metrics_task(&client, &mut cached_metrics, metric_collection_endpoint, node_id).await?;
if let Err(err) = collect_metrics_iteration(&client, &mut cached_metrics, metric_collection_endpoint, node_id).await
{
error!("metrics collection failed: {err:?}");
}
}
}
}
@@ -169,15 +92,20 @@ pub async fn collect_metrics(
///
/// Gather per-tenant and per-timeline metrics and send them to the `metric_collection_endpoint`.
/// Cache metrics to avoid sending the same metrics multiple times.
pub async fn collect_metrics_task(
///
/// TODO
/// - refactor this function (chunking+sending part) to reuse it in proxy module;
/// - improve error handling. Now if one tenant fails to collect metrics,
/// the whole iteration fails and metrics for other tenants are not collected.
pub async fn collect_metrics_iteration(
client: &reqwest::Client,
cached_metrics: &mut HashMap<ConsumptionMetricsKey, u64>,
cached_metrics: &mut HashMap<PageserverConsumptionMetricsKey, u64>,
metric_collection_endpoint: &reqwest::Url,
node_id: NodeId,
) -> anyhow::Result<()> {
let mut current_metrics: Vec<(ConsumptionMetricsKey, u64)> = Vec::new();
let mut current_metrics: Vec<(PageserverConsumptionMetricsKey, u64)> = Vec::new();
trace!(
"starting collect_metrics_task. metric_collection_endpoint: {}",
"starting collect_metrics_iteration. metric_collection_endpoint: {}",
metric_collection_endpoint
);
@@ -201,10 +129,10 @@ pub async fn collect_metrics_task(
let timeline_written_size = u64::from(timeline.get_last_record_lsn());
current_metrics.push((
ConsumptionMetricsKey {
PageserverConsumptionMetricsKey {
tenant_id,
timeline_id: Some(timeline.timeline_id),
metric: ConsumptionMetricKind::WrittenSize,
metric: WRITTEN_SIZE,
},
timeline_written_size,
));
@@ -213,10 +141,10 @@ pub async fn collect_metrics_task(
// Only send timeline logical size when it is fully calculated.
if is_exact {
current_metrics.push((
ConsumptionMetricsKey {
PageserverConsumptionMetricsKey {
tenant_id,
timeline_id: Some(timeline.timeline_id),
metric: ConsumptionMetricKind::TimelineLogicalSize,
metric: TIMELINE_LOGICAL_SIZE,
},
timeline_logical_size,
));
@@ -234,24 +162,34 @@ pub async fn collect_metrics_task(
);
current_metrics.push((
ConsumptionMetricsKey {
PageserverConsumptionMetricsKey {
tenant_id,
timeline_id: None,
metric: ConsumptionMetricKind::ResidentSize,
metric: RESIDENT_SIZE,
},
tenant_resident_size,
));
current_metrics.push((
ConsumptionMetricsKey {
PageserverConsumptionMetricsKey {
tenant_id,
timeline_id: None,
metric: ConsumptionMetricKind::RemoteStorageSize,
metric: REMOTE_STORAGE_SIZE,
},
tenant_remote_size,
));
// TODO add SyntheticStorageSize metric
// Note that this metric is calculated in a separate bgworker
// Here we only use cached value, which may lag behind the real latest one
let tenant_synthetic_size = tenant.get_cached_synthetic_size();
current_metrics.push((
PageserverConsumptionMetricsKey {
tenant_id,
timeline_id: None,
metric: SYNTHETIC_STORAGE_SIZE,
},
tenant_synthetic_size,
));
}
// Filter metrics
@@ -267,35 +205,29 @@ pub async fn collect_metrics_task(
// Send metrics.
// Split into chunks of 1000 metrics to avoid exceeding the max request size
const CHUNK_SIZE: usize = 1000;
let chunks = current_metrics.chunks(CHUNK_SIZE);
let mut chunk_to_send: Vec<ConsumptionMetric> = Vec::with_capacity(1000);
let mut chunk_to_send: Vec<Event<Ids>> = Vec::with_capacity(CHUNK_SIZE);
for chunk in chunks {
chunk_to_send.clear();
// this code block is needed to convince compiler
// that rng is not reused aroung await point
{
// enrich metrics with timestamp and metric_kind before sending
let mut rng = rand::thread_rng();
chunk_to_send.extend(chunk.iter().map(|(curr_key, curr_val)| {
ConsumptionMetric::new_absolute(
curr_key.metric,
curr_key.tenant_id,
curr_key.timeline_id,
*curr_val,
node_id,
&mut rng,
)
}));
}
// enrich metrics with type,timestamp and idempotency key before sending
chunk_to_send.extend(chunk.iter().map(|(curr_key, curr_val)| Event {
kind: EventType::Absolute { time: Utc::now() },
metric: curr_key.metric,
idempotency_key: idempotency_key(node_id.to_string()),
value: *curr_val,
extra: Ids {
tenant_id: curr_key.tenant_id,
timeline_id: curr_key.timeline_id,
},
}));
let chunk_json = serde_json::value::to_raw_value(&EventChunk {
events: &chunk_to_send,
})
.expect("ConsumptionMetric should not fail serialization");
.expect("PageserverConsumptionMetric should not fail serialization");
let res = client
.post(metric_collection_endpoint.clone())
@@ -322,3 +254,39 @@ pub async fn collect_metrics_task(
Ok(())
}
/// Caclculate synthetic size for each active tenant
pub async fn calculate_synthetic_size_worker(
synthetic_size_calculation_interval: Duration,
) -> anyhow::Result<()> {
info!("starting calculate_synthetic_size_worker");
let mut ticker = tokio::time::interval(synthetic_size_calculation_interval);
loop {
tokio::select! {
_ = task_mgr::shutdown_watcher() => {
return Ok(());
},
_ = ticker.tick() => {
let tenants = mgr::list_tenants().await;
// iterate through list of Active tenants and collect metrics
for (tenant_id, tenant_state) in tenants {
if tenant_state != TenantState::Active {
continue;
}
if let Ok(tenant) = mgr::get_tenant(tenant_id, true).await
{
if let Err(e) = tenant.calculate_synthetic_size().await {
error!("failed to calculate synthetic size for tenant {}: {}", tenant_id, e);
}
}
}
}
}
}
}

View File

@@ -3,6 +3,7 @@ use std::sync::Arc;
use anyhow::{anyhow, Context, Result};
use hyper::StatusCode;
use hyper::{Body, Request, Response, Uri};
use pageserver_api::models::DownloadRemoteLayersTaskSpawnRequest;
use remote_storage::GenericRemoteStorage;
use tokio_util::sync::CancellationToken;
use tracing::*;
@@ -13,7 +14,7 @@ use super::models::{
};
use crate::pgdatadir_mapping::LsnForTimestamp;
use crate::tenant::config::TenantConfOpt;
use crate::tenant::{with_ondemand_download, Timeline};
use crate::tenant::{PageReconstructError, Timeline};
use crate::{config::PageServerConf, tenant::mgr};
use utils::{
auth::JwtAuth,
@@ -77,6 +78,15 @@ fn check_permission(request: &Request<Body>, tenant_id: Option<TenantId>) -> Res
})
}
fn apierror_from_prerror(err: PageReconstructError) -> ApiError {
match err {
PageReconstructError::Other(err) => ApiError::InternalServerError(err),
PageReconstructError::WalRedo(err) => {
ApiError::InternalServerError(anyhow::Error::new(err))
}
}
}
// Helper function to construct a TimelineInfo struct for a timeline
async fn build_timeline_info(
timeline: &Arc<Timeline>,
@@ -298,9 +308,10 @@ async fn get_lsn_by_timestamp_handler(request: Request<Body>) -> Result<Response
.await
.and_then(|tenant| tenant.get_timeline(timeline_id, true))
.map_err(ApiError::NotFound)?;
let result = with_ondemand_download(|| timeline.find_lsn_for_timestamp(timestamp_pg))
let result = timeline
.find_lsn_for_timestamp(timestamp_pg)
.await
.map_err(ApiError::InternalServerError)?;
.map_err(apierror_from_prerror)?;
let result = match result {
LsnForTimestamp::Present(lsn) => format!("{lsn}"),
@@ -585,7 +596,7 @@ async fn tenant_create_handler(mut request: Request<Body>) -> Result<Response<Bo
// is Active when this function returns.
if let res @ Err(_) = tenant.wait_to_become_active().await {
// This shouldn't happen because we just created the tenant directory
// in tenant_mgr::create_tenant, and there aren't any remote timelines
// in tenant::mgr::create_tenant, and there aren't any remote timelines
// to load, so, nothing can really fail during load.
// Don't do cleanup because we don't know how we got here.
// The tenant will likely be in `Broken` state and subsequent
@@ -738,17 +749,17 @@ async fn timeline_compact_handler(request: Request<Body>) -> Result<Response<Bod
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
check_permission(&request, Some(tenant_id))?;
let tenant = mgr::get_tenant(tenant_id, true)
.await
.map_err(ApiError::NotFound)?;
let timeline = tenant
.get_timeline(timeline_id, true)
.map_err(ApiError::NotFound)?;
timeline
.compact()
let result_receiver = mgr::immediate_compact(tenant_id, timeline_id)
.await
.context("spawn compaction task")
.map_err(ApiError::InternalServerError)?;
let result: anyhow::Result<()> = result_receiver
.await
.context("receive compaction result")
.map_err(ApiError::InternalServerError)?;
result.map_err(ApiError::InternalServerError)?;
json_response(StatusCode::OK, ())
}
@@ -778,10 +789,11 @@ async fn timeline_checkpoint_handler(request: Request<Body>) -> Result<Response<
}
async fn timeline_download_remote_layers_handler_post(
request: Request<Body>,
mut request: Request<Body>,
) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
let body: DownloadRemoteLayersTaskSpawnRequest = json_request(&mut request).await?;
check_permission(&request, Some(tenant_id))?;
let tenant = mgr::get_tenant(tenant_id, true)
@@ -790,7 +802,7 @@ async fn timeline_download_remote_layers_handler_post(
let timeline = tenant
.get_timeline(timeline_id, true)
.map_err(ApiError::NotFound)?;
match timeline.spawn_download_all_remote_layers().await {
match timeline.spawn_download_all_remote_layers(body).await {
Ok(st) => json_response(StatusCode::ACCEPTED, st),
Err(st) => json_response(StatusCode::CONFLICT, st),
}

View File

@@ -143,7 +143,11 @@ async fn import_rel(
// Call put_rel_creation for every segment of the relation,
// because there is no guarantee about the order in which we are processing segments.
// ignore "relation already exists" error
if let Err(e) = modification.put_rel_creation(rel, nblocks as u32) {
//
// FIXME: use proper error type for this, instead of parsing the error message.
// Or better yet, keep track of which relations we've already created
// https://github.com/neondatabase/neon/issues/3309
if let Err(e) = modification.put_rel_creation(rel, nblocks as u32).await {
if e.to_string().contains("already exists") {
debug!("relation {} already exists. we must be extending it", rel);
} else {
@@ -178,7 +182,7 @@ async fn import_rel(
//
// If we process rel segments out of order,
// put_rel_extend will skip the update.
modification.put_rel_extend(rel, blknum)?;
modification.put_rel_extend(rel, blknum).await?;
Ok(())
}
@@ -206,7 +210,9 @@ async fn import_slru(
ensure!(nblocks <= pg_constants::SLRU_PAGES_PER_SEGMENT as usize);
modification.put_slru_segment_creation(slru, segno, nblocks as u32)?;
modification
.put_slru_segment_creation(slru, segno, nblocks as u32)
.await?;
let mut rpageno = 0;
loop {
@@ -492,7 +498,7 @@ async fn import_file(
}
"pg_filenode.map" => {
let bytes = read_all_bytes(reader).await?;
modification.put_relmap_file(spcnode, dbnode, bytes)?;
modification.put_relmap_file(spcnode, dbnode, bytes).await?;
debug!("imported relmap file")
}
"PG_VERSION" => {
@@ -515,7 +521,7 @@ async fn import_file(
match file_name.as_ref() {
"pg_filenode.map" => {
let bytes = read_all_bytes(reader).await?;
modification.put_relmap_file(spcnode, dbnode, bytes)?;
modification.put_relmap_file(spcnode, dbnode, bytes).await?;
debug!("imported relmap file")
}
"PG_VERSION" => {
@@ -545,7 +551,9 @@ async fn import_file(
let xid = u32::from_str_radix(file_name.as_ref(), 16)?;
let bytes = read_all_bytes(reader).await?;
modification.put_twophase_file(xid, Bytes::copy_from_slice(&bytes[..]))?;
modification
.put_twophase_file(xid, Bytes::copy_from_slice(&bytes[..]))
.await?;
debug!("imported twophase file");
} else if file_path.starts_with("pg_wal") {
debug!("found wal file in base section. ignore it");

View File

@@ -209,15 +209,34 @@ pub static NUM_ONDISK_LAYERS: Lazy<IntGauge> = Lazy::new(|| {
// remote storage metrics
static REMOTE_UPLOAD_QUEUE_UNFINISHED_TASKS: Lazy<IntGaugeVec> = Lazy::new(|| {
/// NB: increment _after_ recording the current value into [`REMOTE_TIMELINE_CLIENT_CALLS_STARTED_HIST`].
static REMOTE_TIMELINE_CLIENT_CALLS_UNFINISHED_GAUGE: Lazy<IntGaugeVec> = Lazy::new(|| {
register_int_gauge_vec!(
"pageserver_remote_upload_queue_unfinished_tasks",
"Number of tasks in the upload queue that are not finished yet.",
"pageserver_remote_timeline_client_calls_unfinished",
"Number of ongoing calls to remote timeline client. \
Used to populate pageserver_remote_timeline_client_calls_started. \
This metric is not useful for sampling from Prometheus, but useful in tests.",
&["tenant_id", "timeline_id", "file_kind", "op_kind"],
)
.expect("failed to define a metric")
});
static REMOTE_TIMELINE_CLIENT_CALLS_STARTED_HIST: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
"pageserver_remote_timeline_client_calls_started",
"When calling a remote timeline client method, we record the current value \
of the calls_unfinished gauge in this histogram. Plot the histogram \
over time in a heatmap to visualize how many operations were ongoing \
at a given instant. It gives you a better idea of the queue depth \
than plotting the gauge directly, since operations may complete faster \
than the sampling interval.",
&["tenant_id", "timeline_id", "file_kind", "op_kind"],
// The calls_unfinished gauge is an integer gauge, hence we have integer buckets.
vec![0.0, 1.0, 2.0, 4.0, 6.0, 8.0, 10.0, 15.0, 20.0, 40.0, 60.0, 80.0, 100.0, 500.0],
)
.expect("failed to define a metric")
});
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum RemoteOpKind {
Upload,
@@ -248,15 +267,12 @@ impl RemoteOpFileKind {
}
}
pub static REMOTE_OPERATION_KINDS: &[&str] = &["upload", "download", "delete"];
pub static REMOTE_OPERATION_FILE_KINDS: &[&str] = &["layer", "index"];
pub static REMOTE_OPERATION_STATUSES: &[&str] = &["success", "failure"];
pub static REMOTE_OPERATION_TIME: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
"pageserver_remote_operation_seconds",
"Time spent on remote storage operations. \
Grouped by tenant, timeline, operation_kind and status",
Grouped by tenant, timeline, operation_kind and status. \
Does not account for time spent waiting in remote timeline client's queues.",
&["tenant_id", "timeline_id", "file_kind", "op_kind", "status"]
)
.expect("failed to define a metric")
@@ -475,21 +491,6 @@ impl Drop for TimelineMetrics {
for op in SMGR_QUERY_TIME_OPERATIONS {
let _ = SMGR_QUERY_TIME.remove_label_values(&[op, tenant_id, timeline_id]);
}
let _ = REMOTE_UPLOAD_QUEUE_UNFINISHED_TASKS.remove_label_values(&[tenant_id, timeline_id]);
for file_kind in REMOTE_OPERATION_FILE_KINDS {
for op in REMOTE_OPERATION_KINDS {
for status in REMOTE_OPERATION_STATUSES {
let _ = REMOTE_OPERATION_TIME.remove_label_values(&[
tenant_id,
timeline_id,
file_kind,
op,
status,
]);
}
}
}
}
}
@@ -510,7 +511,8 @@ pub struct RemoteTimelineClientMetrics {
timeline_id: String,
remote_physical_size_gauge: Mutex<Option<UIntGauge>>,
remote_operation_time: Mutex<HashMap<(&'static str, &'static str, &'static str), Histogram>>,
unfinished_tasks: Mutex<HashMap<(&'static str, &'static str), IntGauge>>,
calls_unfinished_gauge: Mutex<HashMap<(&'static str, &'static str), IntGauge>>,
calls_started_hist: Mutex<HashMap<(&'static str, &'static str), Histogram>>,
}
impl RemoteTimelineClientMetrics {
@@ -519,7 +521,8 @@ impl RemoteTimelineClientMetrics {
tenant_id: tenant_id.to_string(),
timeline_id: timeline_id.to_string(),
remote_operation_time: Mutex::new(HashMap::default()),
unfinished_tasks: Mutex::new(HashMap::default()),
calls_unfinished_gauge: Mutex::new(HashMap::default()),
calls_started_hist: Mutex::new(HashMap::default()),
remote_physical_size_gauge: Mutex::new(None),
}
}
@@ -558,16 +561,37 @@ impl RemoteTimelineClientMetrics {
});
metric.clone()
}
pub fn unfinished_tasks(
fn calls_unfinished_gauge(
&self,
file_kind: &RemoteOpFileKind,
op_kind: &RemoteOpKind,
) -> IntGauge {
// XXX would be nice to have an upgradable RwLock
let mut guard = self.unfinished_tasks.lock().unwrap();
let mut guard = self.calls_unfinished_gauge.lock().unwrap();
let key = (file_kind.as_str(), op_kind.as_str());
let metric = guard.entry(key).or_insert_with(move || {
REMOTE_UPLOAD_QUEUE_UNFINISHED_TASKS
REMOTE_TIMELINE_CLIENT_CALLS_UNFINISHED_GAUGE
.get_metric_with_label_values(&[
&self.tenant_id.to_string(),
&self.timeline_id.to_string(),
key.0,
key.1,
])
.unwrap()
});
metric.clone()
}
fn calls_started_hist(
&self,
file_kind: &RemoteOpFileKind,
op_kind: &RemoteOpKind,
) -> Histogram {
// XXX would be nice to have an upgradable RwLock
let mut guard = self.calls_started_hist.lock().unwrap();
let key = (file_kind.as_str(), op_kind.as_str());
let metric = guard.entry(key).or_insert_with(move || {
REMOTE_TIMELINE_CLIENT_CALLS_STARTED_HIST
.get_metric_with_label_values(&[
&self.tenant_id.to_string(),
&self.timeline_id.to_string(),
@@ -580,6 +604,58 @@ impl RemoteTimelineClientMetrics {
}
}
/// See [`RemoteTimelineClientMetrics::call_begin`].
#[must_use]
pub(crate) struct RemoteTimelineClientCallMetricGuard(Option<IntGauge>);
impl RemoteTimelineClientCallMetricGuard {
/// Consume this guard object without decrementing the metric.
/// The caller vouches to do this manually, so that the prior increment of the gauge will cancel out.
pub fn will_decrement_manually(mut self) {
self.0 = None; // prevent drop() from decrementing
}
}
impl Drop for RemoteTimelineClientCallMetricGuard {
fn drop(&mut self) {
if let RemoteTimelineClientCallMetricGuard(Some(guard)) = self {
guard.dec();
}
}
}
impl RemoteTimelineClientMetrics {
/// Increment the metrics that track ongoing calls to the remote timeline client instance.
///
/// Drop the returned guard object once the operation is finished to decrement the values.
/// Or, use [`RemoteTimelineClientCallMetricGuard::will_decrement_manually`] and [`call_end`] if that
/// is more suitable.
/// Never do both.
pub(crate) fn call_begin(
&self,
file_kind: &RemoteOpFileKind,
op_kind: &RemoteOpKind,
) -> RemoteTimelineClientCallMetricGuard {
let unfinished_metric = self.calls_unfinished_gauge(file_kind, op_kind);
self.calls_started_hist(file_kind, op_kind)
.observe(unfinished_metric.get() as f64);
unfinished_metric.inc();
RemoteTimelineClientCallMetricGuard(Some(unfinished_metric))
}
/// Manually decrement the metric instead of using the guard object.
/// Using the guard object is generally preferable.
/// See [`call_begin`] for more context.
pub(crate) fn call_end(&self, file_kind: &RemoteOpFileKind, op_kind: &RemoteOpKind) {
let unfinished_metric = self.calls_unfinished_gauge(file_kind, op_kind);
debug_assert!(
unfinished_metric.get() > 0,
"begin and end should cancel out"
);
unfinished_metric.dec();
}
}
impl Drop for RemoteTimelineClientMetrics {
fn drop(&mut self) {
let RemoteTimelineClientMetrics {
@@ -587,13 +663,22 @@ impl Drop for RemoteTimelineClientMetrics {
timeline_id,
remote_physical_size_gauge,
remote_operation_time,
unfinished_tasks,
calls_unfinished_gauge,
calls_started_hist,
} = self;
for ((a, b, c), _) in remote_operation_time.get_mut().unwrap().drain() {
let _ = REMOTE_OPERATION_TIME.remove_label_values(&[tenant_id, timeline_id, a, b, c]);
}
for ((a, b), _) in unfinished_tasks.get_mut().unwrap().drain() {
let _ = REMOTE_UPLOAD_QUEUE_UNFINISHED_TASKS.remove_label_values(&[
for ((a, b), _) in calls_unfinished_gauge.get_mut().unwrap().drain() {
let _ = REMOTE_TIMELINE_CLIENT_CALLS_UNFINISHED_GAUGE.remove_label_values(&[
tenant_id,
timeline_id,
a,
b,
]);
}
for ((a, b), _) in calls_started_hist.get_mut().unwrap().drain() {
let _ = REMOTE_TIMELINE_CLIENT_CALLS_STARTED_HIST.remove_label_values(&[
tenant_id,
timeline_id,
a,

View File

@@ -9,7 +9,7 @@
// custom protocol.
//
use anyhow::{bail, ensure, Context, Result};
use anyhow::Context;
use bytes::Buf;
use bytes::Bytes;
use futures::{Stream, StreamExt};
@@ -19,6 +19,8 @@ use pageserver_api::models::{
PagestreamFeMessage, PagestreamGetPageRequest, PagestreamGetPageResponse,
PagestreamNblocksRequest, PagestreamNblocksResponse,
};
use pq_proto::ConnectionError;
use pq_proto::FeStartupPacket;
use pq_proto::{BeMessage, FeMessage, RowDescriptor};
use std::io;
use std::net::TcpListener;
@@ -28,6 +30,7 @@ use std::sync::Arc;
use std::time::Duration;
use tracing::*;
use utils::id::ConnectionId;
use utils::postgres_backend_async::QueryError;
use utils::{
auth::{Claims, JwtAuth, Scope},
id::{TenantId, TimelineId},
@@ -60,8 +63,8 @@ fn copyin_stream(pgb: &mut PostgresBackend) -> impl Stream<Item = io::Result<Byt
_ = task_mgr::shutdown_watcher() => {
// We were requested to shut down.
let msg = format!("pageserver is shutting down");
let _ = pgb.write_message(&BeMessage::ErrorResponse(&msg));
Err(anyhow::anyhow!(msg))
let _ = pgb.write_message(&BeMessage::ErrorResponse(&msg, None));
Err(QueryError::Other(anyhow::anyhow!(msg)))
}
msg = pgb.read_message() => { msg }
@@ -74,14 +77,15 @@ fn copyin_stream(pgb: &mut PostgresBackend) -> impl Stream<Item = io::Result<Byt
FeMessage::CopyDone => { break },
FeMessage::Sync => continue,
FeMessage::Terminate => {
let msg = format!("client terminated connection with Terminate message during COPY");
pgb.write_message(&BeMessage::ErrorResponse(&msg))?;
let msg = "client terminated connection with Terminate message during COPY";
let query_error_error = QueryError::Disconnected(ConnectionError::Socket(io::Error::new(io::ErrorKind::ConnectionReset, msg)));
pgb.write_message(&BeMessage::ErrorResponse(msg, Some(query_error_error.pg_error_code())))?;
Err(io::Error::new(io::ErrorKind::ConnectionReset, msg))?;
break;
}
m => {
let msg = format!("unexpected message {:?}", m);
pgb.write_message(&BeMessage::ErrorResponse(&msg))?;
let msg = format!("unexpected message {m:?}");
pgb.write_message(&BeMessage::ErrorResponse(&msg, None))?;
Err(io::Error::new(io::ErrorKind::Other, msg))?;
break;
}
@@ -91,12 +95,16 @@ fn copyin_stream(pgb: &mut PostgresBackend) -> impl Stream<Item = io::Result<Byt
}
Ok(None) => {
let msg = "client closed connection during COPY";
pgb.write_message(&BeMessage::ErrorResponse(msg))?;
let query_error_error = QueryError::Disconnected(ConnectionError::Socket(io::Error::new(io::ErrorKind::ConnectionReset, msg)));
pgb.write_message(&BeMessage::ErrorResponse(msg, Some(query_error_error.pg_error_code())))?;
pgb.flush().await?;
Err(io::Error::new(io::ErrorKind::ConnectionReset, msg))?;
}
Err(e) => {
Err(io::Error::new(io::ErrorKind::Other, e))?;
Err(QueryError::Disconnected(ConnectionError::Socket(io_error))) => {
Err(io_error)?;
}
Err(other) => {
Err(io::Error::new(io::ErrorKind::Other, other))?;
}
};
}
@@ -194,23 +202,19 @@ async fn page_service_conn_main(
// we've been requested to shut down
Ok(())
}
Err(err) => {
let root_cause_io_err_kind = err
.root_cause()
.downcast_ref::<io::Error>()
.map(|e| e.kind());
Err(QueryError::Disconnected(ConnectionError::Socket(io_error))) => {
// `ConnectionReset` error happens when the Postgres client closes the connection.
// As this disconnection happens quite often and is expected,
// we decided to downgrade the logging level to `INFO`.
// See: https://github.com/neondatabase/neon/issues/1683.
if root_cause_io_err_kind == Some(io::ErrorKind::ConnectionReset) {
if io_error.kind() == io::ErrorKind::ConnectionReset {
info!("Postgres client disconnected");
Ok(())
} else {
Err(err)
Err(io_error).context("Postgres connection error")
}
}
other => other.context("Postgres query error"),
}
}
@@ -247,7 +251,6 @@ impl PageRequestMetrics {
}
}
#[derive(Debug)]
struct PageServerHandler {
_conf: &'static PageServerConf,
auth: Option<Arc<JwtAuth>>,
@@ -312,7 +315,7 @@ impl PageServerHandler {
Some(FeMessage::CopyData(bytes)) => bytes,
Some(FeMessage::Terminate) => break,
Some(m) => {
bail!("unexpected message: {m:?} during COPY");
anyhow::bail!("unexpected message: {m:?} during COPY");
}
None => break, // client disconnected
};
@@ -369,7 +372,7 @@ impl PageServerHandler {
base_lsn: Lsn,
_end_lsn: Lsn,
pg_version: u32,
) -> anyhow::Result<()> {
) -> Result<(), QueryError> {
task_mgr::associate_with(Some(tenant_id), Some(timeline_id));
// Create empty timeline
info!("creating new timeline");
@@ -423,11 +426,16 @@ impl PageServerHandler {
timeline_id: TimelineId,
start_lsn: Lsn,
end_lsn: Lsn,
) -> anyhow::Result<()> {
) -> Result<(), QueryError> {
task_mgr::associate_with(Some(tenant_id), Some(timeline_id));
let timeline = get_active_timeline_with_timeout(tenant_id, timeline_id).await?;
ensure!(timeline.get_last_record_lsn() == start_lsn);
let last_record_lsn = timeline.get_last_record_lsn();
if last_record_lsn != start_lsn {
return Err(QueryError::Other(
anyhow::anyhow!("Cannot import WAL from Lsn {start_lsn} because timeline does not start from the same lsn: {last_record_lsn}"))
);
}
// TODO leave clean state on error. For now you can use detach to clean
// up broken state from a failed import.
@@ -451,7 +459,11 @@ impl PageServerHandler {
}
// TODO Does it make sense to overshoot?
ensure!(timeline.get_last_record_lsn() >= end_lsn);
if timeline.get_last_record_lsn() < end_lsn {
return Err(QueryError::Other(
anyhow::anyhow!("Cannot import WAL from Lsn {start_lsn} because timeline does not start from the same lsn: {last_record_lsn}"))
);
}
// Flush data to disk, then upload to s3. No need for a forced checkpoint.
// We only want to persist the data, and it doesn't matter if it's in the
@@ -480,7 +492,7 @@ impl PageServerHandler {
mut lsn: Lsn,
latest: bool,
latest_gc_cutoff_lsn: &RcuReadGuard<Lsn>,
) -> Result<Lsn> {
) -> anyhow::Result<Lsn> {
if latest {
// Latest page version was requested. If LSN is given, it is a hint
// to the page server that there have been no modifications to the
@@ -511,11 +523,11 @@ impl PageServerHandler {
}
} else {
if lsn == Lsn(0) {
bail!("invalid LSN(0) in request");
anyhow::bail!("invalid LSN(0) in request");
}
timeline.wait_lsn(lsn).await?;
}
ensure!(
anyhow::ensure!(
lsn >= **latest_gc_cutoff_lsn,
"tried to request a page version that was garbage collected. requested at {} gc cutoff {}",
lsn, **latest_gc_cutoff_lsn
@@ -528,15 +540,12 @@ impl PageServerHandler {
&self,
timeline: &Timeline,
req: &PagestreamExistsRequest,
) -> Result<PagestreamBeMessage> {
) -> anyhow::Result<PagestreamBeMessage> {
let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn();
let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest, &latest_gc_cutoff_lsn)
.await?;
let exists = crate::tenant::with_ondemand_download(|| {
timeline.get_rel_exists(req.rel, lsn, req.latest)
})
.await?;
let exists = timeline.get_rel_exists(req.rel, lsn, req.latest).await?;
Ok(PagestreamBeMessage::Exists(PagestreamExistsResponse {
exists,
@@ -548,15 +557,12 @@ impl PageServerHandler {
&self,
timeline: &Timeline,
req: &PagestreamNblocksRequest,
) -> Result<PagestreamBeMessage> {
) -> anyhow::Result<PagestreamBeMessage> {
let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn();
let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest, &latest_gc_cutoff_lsn)
.await?;
let n_blocks = crate::tenant::with_ondemand_download(|| {
timeline.get_rel_size(req.rel, lsn, req.latest)
})
.await?;
let n_blocks = timeline.get_rel_size(req.rel, lsn, req.latest).await?;
Ok(PagestreamBeMessage::Nblocks(PagestreamNblocksResponse {
n_blocks,
@@ -568,15 +574,14 @@ impl PageServerHandler {
&self,
timeline: &Timeline,
req: &PagestreamDbSizeRequest,
) -> Result<PagestreamBeMessage> {
) -> anyhow::Result<PagestreamBeMessage> {
let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn();
let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest, &latest_gc_cutoff_lsn)
.await?;
let total_blocks = crate::tenant::with_ondemand_download(|| {
timeline.get_db_size(DEFAULTTABLESPACE_OID, req.dbnode, lsn, req.latest)
})
.await?;
let total_blocks = timeline
.get_db_size(DEFAULTTABLESPACE_OID, req.dbnode, lsn, req.latest)
.await?;
let db_size = total_blocks as i64 * BLCKSZ as i64;
Ok(PagestreamBeMessage::DbSize(PagestreamDbSizeResponse {
@@ -589,7 +594,7 @@ impl PageServerHandler {
&self,
timeline: &Timeline,
req: &PagestreamGetPageRequest,
) -> Result<PagestreamBeMessage> {
) -> anyhow::Result<PagestreamBeMessage> {
let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn();
let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest, &latest_gc_cutoff_lsn)
.await?;
@@ -602,10 +607,9 @@ impl PageServerHandler {
}
*/
let page = crate::tenant::with_ondemand_download(|| {
timeline.get_rel_page_at_lsn(req.rel, req.blkno, lsn, req.latest)
})
.await?;
let page = timeline
.get_rel_page_at_lsn(req.rel, req.blkno, lsn, req.latest)
.await?;
Ok(PagestreamBeMessage::GetPage(PagestreamGetPageResponse {
page,
@@ -638,7 +642,7 @@ impl PageServerHandler {
pgb.write_message(&BeMessage::CopyOutResponse)?;
pgb.flush().await?;
/* Send a tarball of the latest layer on the timeline */
// Send a tarball of the latest layer on the timeline
{
let mut writer = pgb.copyout_writer();
basebackup::send_basebackup_tarball(&mut writer, &timeline, lsn, prev_lsn, full_backup)
@@ -654,7 +658,7 @@ impl PageServerHandler {
// when accessing management api supply None as an argument
// when using to authorize tenant pass corresponding tenant id
fn check_permission(&self, tenant_id: Option<TenantId>) -> Result<()> {
fn check_permission(&self, tenant_id: Option<TenantId>) -> anyhow::Result<()> {
if self.auth.is_none() {
// auth is set to Trust, nothing to check so just return ok
return Ok(());
@@ -676,20 +680,19 @@ impl postgres_backend_async::Handler for PageServerHandler {
&mut self,
_pgb: &mut PostgresBackend,
jwt_response: &[u8],
) -> anyhow::Result<()> {
) -> Result<(), QueryError> {
// this unwrap is never triggered, because check_auth_jwt only called when auth_type is NeonJWT
// which requires auth to be present
let data = self
.auth
.as_ref()
.unwrap()
.decode(str::from_utf8(jwt_response)?)?;
.decode(str::from_utf8(jwt_response).context("jwt response is not UTF-8")?)?;
if matches!(data.claims.scope, Scope::Tenant) {
ensure!(
data.claims.tenant_id.is_some(),
if matches!(data.claims.scope, Scope::Tenant) && data.claims.tenant_id.is_none() {
return Err(QueryError::Other(anyhow::anyhow!(
"jwt token scope is Tenant, but tenant id is missing"
)
)));
}
info!(
@@ -701,22 +704,33 @@ impl postgres_backend_async::Handler for PageServerHandler {
Ok(())
}
fn startup(
&mut self,
_pgb: &mut PostgresBackend,
_sm: &FeStartupPacket,
) -> Result<(), QueryError> {
Ok(())
}
async fn process_query(
&mut self,
pgb: &mut PostgresBackend,
query_string: &str,
) -> anyhow::Result<()> {
debug!("process query {:?}", query_string);
) -> Result<(), QueryError> {
debug!("process query {query_string:?}");
if query_string.starts_with("pagestream ") {
let (_, params_raw) = query_string.split_at("pagestream ".len());
let params = params_raw.split(' ').collect::<Vec<_>>();
ensure!(
params.len() == 2,
"invalid param number for pagestream command"
);
let tenant_id = TenantId::from_str(params[0])?;
let timeline_id = TimelineId::from_str(params[1])?;
if params.len() != 2 {
return Err(QueryError::Other(anyhow::anyhow!(
"invalid param number for pagestream command"
)));
}
let tenant_id = TenantId::from_str(params[0])
.with_context(|| format!("Failed to parse tenant id from {}", params[0]))?;
let timeline_id = TimelineId::from_str(params[1])
.with_context(|| format!("Failed to parse timeline id from {}", params[1]))?;
self.check_permission(Some(tenant_id))?;
@@ -726,18 +740,24 @@ impl postgres_backend_async::Handler for PageServerHandler {
let (_, params_raw) = query_string.split_at("basebackup ".len());
let params = params_raw.split_whitespace().collect::<Vec<_>>();
ensure!(
params.len() >= 2,
"invalid param number for basebackup command"
);
if params.len() < 2 {
return Err(QueryError::Other(anyhow::anyhow!(
"invalid param number for basebackup command"
)));
}
let tenant_id = TenantId::from_str(params[0])?;
let timeline_id = TimelineId::from_str(params[1])?;
let tenant_id = TenantId::from_str(params[0])
.with_context(|| format!("Failed to parse tenant id from {}", params[0]))?;
let timeline_id = TimelineId::from_str(params[1])
.with_context(|| format!("Failed to parse timeline id from {}", params[1]))?;
self.check_permission(Some(tenant_id))?;
let lsn = if params.len() == 3 {
Some(Lsn::from_str(params[2])?)
Some(
Lsn::from_str(params[2])
.with_context(|| format!("Failed to parse Lsn from {}", params[2]))?,
)
} else {
None
};
@@ -752,13 +772,16 @@ impl postgres_backend_async::Handler for PageServerHandler {
let (_, params_raw) = query_string.split_at("get_last_record_rlsn ".len());
let params = params_raw.split_whitespace().collect::<Vec<_>>();
ensure!(
params.len() == 2,
"invalid param number for get_last_record_rlsn command"
);
if params.len() != 2 {
return Err(QueryError::Other(anyhow::anyhow!(
"invalid param number for get_last_record_rlsn command"
)));
}
let tenant_id = TenantId::from_str(params[0])?;
let timeline_id = TimelineId::from_str(params[1])?;
let tenant_id = TenantId::from_str(params[0])
.with_context(|| format!("Failed to parse tenant id from {}", params[0]))?;
let timeline_id = TimelineId::from_str(params[1])
.with_context(|| format!("Failed to parse timeline id from {}", params[1]))?;
self.check_permission(Some(tenant_id))?;
let timeline = get_active_timeline_with_timeout(tenant_id, timeline_id).await?;
@@ -780,22 +803,31 @@ impl postgres_backend_async::Handler for PageServerHandler {
let (_, params_raw) = query_string.split_at("fullbackup ".len());
let params = params_raw.split_whitespace().collect::<Vec<_>>();
ensure!(
params.len() >= 2,
"invalid param number for fullbackup command"
);
if params.len() < 2 {
return Err(QueryError::Other(anyhow::anyhow!(
"invalid param number for fullbackup command"
)));
}
let tenant_id = TenantId::from_str(params[0])?;
let timeline_id = TimelineId::from_str(params[1])?;
let tenant_id = TenantId::from_str(params[0])
.with_context(|| format!("Failed to parse tenant id from {}", params[0]))?;
let timeline_id = TimelineId::from_str(params[1])
.with_context(|| format!("Failed to parse timeline id from {}", params[1]))?;
// The caller is responsible for providing correct lsn and prev_lsn.
let lsn = if params.len() > 2 {
Some(Lsn::from_str(params[2])?)
Some(
Lsn::from_str(params[2])
.with_context(|| format!("Failed to parse Lsn from {}", params[2]))?,
)
} else {
None
};
let prev_lsn = if params.len() > 3 {
Some(Lsn::from_str(params[3])?)
Some(
Lsn::from_str(params[3])
.with_context(|| format!("Failed to parse Lsn from {}", params[3]))?,
)
} else {
None
};
@@ -820,12 +852,21 @@ impl postgres_backend_async::Handler for PageServerHandler {
// -c "import basebackup $TENANT $TIMELINE $START_LSN $END_LSN $PG_VERSION"
let (_, params_raw) = query_string.split_at("import basebackup ".len());
let params = params_raw.split_whitespace().collect::<Vec<_>>();
ensure!(params.len() == 5);
let tenant_id = TenantId::from_str(params[0])?;
let timeline_id = TimelineId::from_str(params[1])?;
let base_lsn = Lsn::from_str(params[2])?;
let end_lsn = Lsn::from_str(params[3])?;
let pg_version = u32::from_str(params[4])?;
if params.len() != 5 {
return Err(QueryError::Other(anyhow::anyhow!(
"invalid param number for import basebackup command"
)));
}
let tenant_id = TenantId::from_str(params[0])
.with_context(|| format!("Failed to parse tenant id from {}", params[0]))?;
let timeline_id = TimelineId::from_str(params[1])
.with_context(|| format!("Failed to parse timeline id from {}", params[1]))?;
let base_lsn = Lsn::from_str(params[2])
.with_context(|| format!("Failed to parse Lsn from {}", params[2]))?;
let end_lsn = Lsn::from_str(params[3])
.with_context(|| format!("Failed to parse Lsn from {}", params[3]))?;
let pg_version = u32::from_str(params[4])
.with_context(|| format!("Failed to parse pg_version from {}", params[4]))?;
self.check_permission(Some(tenant_id))?;
@@ -843,7 +884,10 @@ impl postgres_backend_async::Handler for PageServerHandler {
Ok(()) => pgb.write_message(&BeMessage::CommandComplete(b"SELECT 1"))?,
Err(e) => {
error!("error importing base backup between {base_lsn} and {end_lsn}: {e:?}");
pgb.write_message(&BeMessage::ErrorResponse(&e.to_string()))?
pgb.write_message(&BeMessage::ErrorResponse(
&e.to_string(),
Some(e.pg_error_code()),
))?
}
};
} else if query_string.starts_with("import wal ") {
@@ -853,11 +897,19 @@ impl postgres_backend_async::Handler for PageServerHandler {
// caller should poll the http api to check when that is done.
let (_, params_raw) = query_string.split_at("import wal ".len());
let params = params_raw.split_whitespace().collect::<Vec<_>>();
ensure!(params.len() == 4);
let tenant_id = TenantId::from_str(params[0])?;
let timeline_id = TimelineId::from_str(params[1])?;
let start_lsn = Lsn::from_str(params[2])?;
let end_lsn = Lsn::from_str(params[3])?;
if params.len() != 4 {
return Err(QueryError::Other(anyhow::anyhow!(
"invalid param number for import wal command"
)));
}
let tenant_id = TenantId::from_str(params[0])
.with_context(|| format!("Failed to parse tenant id from {}", params[0]))?;
let timeline_id = TimelineId::from_str(params[1])
.with_context(|| format!("Failed to parse timeline id from {}", params[1]))?;
let start_lsn = Lsn::from_str(params[2])
.with_context(|| format!("Failed to parse Lsn from {}", params[2]))?;
let end_lsn = Lsn::from_str(params[3])
.with_context(|| format!("Failed to parse Lsn from {}", params[3]))?;
self.check_permission(Some(tenant_id))?;
@@ -868,7 +920,10 @@ impl postgres_backend_async::Handler for PageServerHandler {
Ok(()) => pgb.write_message(&BeMessage::CommandComplete(b"SELECT 1"))?,
Err(e) => {
error!("error importing WAL between {start_lsn} and {end_lsn}: {e:?}");
pgb.write_message(&BeMessage::ErrorResponse(&e.to_string()))?
pgb.write_message(&BeMessage::ErrorResponse(
&e.to_string(),
Some(e.pg_error_code()),
))?
}
};
} else if query_string.to_ascii_lowercase().starts_with("set ") {
@@ -879,8 +934,13 @@ impl postgres_backend_async::Handler for PageServerHandler {
// show <tenant_id>
let (_, params_raw) = query_string.split_at("show ".len());
let params = params_raw.split(' ').collect::<Vec<_>>();
ensure!(params.len() == 1, "invalid param number for config command");
let tenant_id = TenantId::from_str(params[0])?;
if params.len() != 1 {
return Err(QueryError::Other(anyhow::anyhow!(
"invalid param number for config command"
)));
}
let tenant_id = TenantId::from_str(params[0])
.with_context(|| format!("Failed to parse tenant id from {}", params[0]))?;
self.check_permission(Some(tenant_id))?;
@@ -921,7 +981,9 @@ impl postgres_backend_async::Handler for PageServerHandler {
]))?
.write_message(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else {
bail!("unknown command");
return Err(QueryError::Other(anyhow::anyhow!(
"unknown command {query_string}"
)));
}
Ok(())
@@ -933,7 +995,7 @@ impl postgres_backend_async::Handler for PageServerHandler {
/// If the tenant is Loading, waits for it to become Active, for up to 30 s. That
/// ensures that queries don't fail immediately after pageserver startup, because
/// all tenants are still loading.
async fn get_active_tenant_with_timeout(tenant_id: TenantId) -> Result<Arc<Tenant>> {
async fn get_active_tenant_with_timeout(tenant_id: TenantId) -> anyhow::Result<Arc<Tenant>> {
let tenant = mgr::get_tenant(tenant_id, false).await?;
match tokio::time::timeout(Duration::from_secs(30), tenant.wait_to_become_active()).await {
Ok(wait_result) => wait_result
@@ -947,7 +1009,7 @@ async fn get_active_tenant_with_timeout(tenant_id: TenantId) -> Result<Arc<Tenan
async fn get_active_timeline_with_timeout(
tenant_id: TenantId,
timeline_id: TimelineId,
) -> Result<Arc<Timeline>> {
) -> anyhow::Result<Arc<Timeline>> {
get_active_tenant_with_timeout(tenant_id)
.await
.and_then(|tenant| tenant.get_timeline(timeline_id, true))

View File

@@ -6,11 +6,10 @@
//! walingest.rs handles a few things like implicit relation creation and extension.
//! Clarify that)
//!
use super::tenant::PageReconstructResult;
use super::tenant::{PageReconstructError, Timeline};
use crate::keyspace::{KeySpace, KeySpaceAccum};
use crate::tenant::{with_ondemand_download, Timeline};
use crate::repository::*;
use crate::walrecord::NeonWalRecord;
use crate::{repository::*, try_no_ondemand_download};
use anyhow::Context;
use bytes::{Buf, Bytes};
use pageserver_api::reltag::{RelTag, SlruKind};
@@ -92,76 +91,80 @@ impl Timeline {
//------------------------------------------------------------------------------
/// Look up given page version.
pub fn get_rel_page_at_lsn(
pub async fn get_rel_page_at_lsn(
&self,
tag: RelTag,
blknum: BlockNumber,
lsn: Lsn,
latest: bool,
) -> PageReconstructResult<Bytes> {
) -> Result<Bytes, PageReconstructError> {
if tag.relnode == 0 {
return PageReconstructResult::from(anyhow::anyhow!("invalid relnode"));
return Err(PageReconstructError::Other(anyhow::anyhow!(
"invalid relnode"
)));
}
let nblocks = try_no_ondemand_download!(self.get_rel_size(tag, lsn, latest));
let nblocks = self.get_rel_size(tag, lsn, latest).await?;
if blknum >= nblocks {
debug!(
"read beyond EOF at {} blk {} at {}, size is {}: returning all-zeros page",
tag, blknum, lsn, nblocks
);
return PageReconstructResult::Success(ZERO_PAGE.clone());
return Ok(ZERO_PAGE.clone());
}
let key = rel_block_to_key(tag, blknum);
self.get(key, lsn)
self.get(key, lsn).await
}
// Get size of a database in blocks
pub fn get_db_size(
pub async fn get_db_size(
&self,
spcnode: Oid,
dbnode: Oid,
lsn: Lsn,
latest: bool,
) -> PageReconstructResult<usize> {
) -> Result<usize, PageReconstructError> {
let mut total_blocks = 0;
let rels = try_no_ondemand_download!(self.list_rels(spcnode, dbnode, lsn));
let rels = self.list_rels(spcnode, dbnode, lsn).await?;
for rel in rels {
let n_blocks = try_no_ondemand_download!(self.get_rel_size(rel, lsn, latest));
let n_blocks = self.get_rel_size(rel, lsn, latest).await?;
total_blocks += n_blocks as usize;
}
PageReconstructResult::Success(total_blocks)
Ok(total_blocks)
}
/// Get size of a relation file
pub fn get_rel_size(
pub async fn get_rel_size(
&self,
tag: RelTag,
lsn: Lsn,
latest: bool,
) -> PageReconstructResult<BlockNumber> {
) -> Result<BlockNumber, PageReconstructError> {
if tag.relnode == 0 {
return PageReconstructResult::from(anyhow::anyhow!("invalid relnode"));
return Err(PageReconstructError::Other(anyhow::anyhow!(
"invalid relnode"
)));
}
if let Some(nblocks) = self.get_cached_rel_size(&tag, lsn) {
return PageReconstructResult::Success(nblocks);
return Ok(nblocks);
}
if (tag.forknum == FSM_FORKNUM || tag.forknum == VISIBILITYMAP_FORKNUM)
&& !try_no_ondemand_download!(self.get_rel_exists(tag, lsn, latest))
&& !self.get_rel_exists(tag, lsn, latest).await?
{
// FIXME: Postgres sometimes calls smgrcreate() to create
// FSM, and smgrnblocks() on it immediately afterwards,
// without extending it. Tolerate that by claiming that
// any non-existent FSM fork has size 0.
return PageReconstructResult::Success(0);
return Ok(0);
}
let key = rel_size_to_key(tag);
let mut buf = try_no_ondemand_download!(self.get(key, lsn));
let mut buf = self.get(key, lsn).await?;
let nblocks = buf.get_u32_le();
if latest {
@@ -174,47 +177,49 @@ impl Timeline {
// associated with most recent value of LSN.
self.update_cached_rel_size(tag, lsn, nblocks);
}
PageReconstructResult::Success(nblocks)
Ok(nblocks)
}
/// Does relation exist?
pub fn get_rel_exists(
pub async fn get_rel_exists(
&self,
tag: RelTag,
lsn: Lsn,
_latest: bool,
) -> PageReconstructResult<bool> {
) -> Result<bool, PageReconstructError> {
if tag.relnode == 0 {
return PageReconstructResult::from(anyhow::anyhow!("invalid relnode"));
return Err(PageReconstructError::Other(anyhow::anyhow!(
"invalid relnode"
)));
}
// first try to lookup relation in cache
if let Some(_nblocks) = self.get_cached_rel_size(&tag, lsn) {
return PageReconstructResult::Success(true);
return Ok(true);
}
// fetch directory listing
let key = rel_dir_to_key(tag.spcnode, tag.dbnode);
let buf = try_no_ondemand_download!(self.get(key, lsn));
let buf = self.get(key, lsn).await?;
match RelDirectory::des(&buf).context("deserialization failure") {
Ok(dir) => {
let exists = dir.rels.get(&(tag.relnode, tag.forknum)).is_some();
PageReconstructResult::Success(exists)
Ok(exists)
}
Err(e) => PageReconstructResult::from(e),
Err(e) => Err(PageReconstructError::from(e)),
}
}
/// Get a list of all existing relations in given tablespace and database.
pub fn list_rels(
pub async fn list_rels(
&self,
spcnode: Oid,
dbnode: Oid,
lsn: Lsn,
) -> PageReconstructResult<HashSet<RelTag>> {
) -> Result<HashSet<RelTag>, PageReconstructError> {
// fetch directory listing
let key = rel_dir_to_key(spcnode, dbnode);
let buf = try_no_ondemand_download!(self.get(key, lsn));
let buf = self.get(key, lsn).await?;
match RelDirectory::des(&buf).context("deserialization failure") {
Ok(dir) => {
@@ -226,53 +231,53 @@ impl Timeline {
forknum: *forknum,
}));
PageReconstructResult::Success(rels)
Ok(rels)
}
Err(e) => PageReconstructResult::from(e),
Err(e) => Err(PageReconstructError::from(e)),
}
}
/// Look up given SLRU page version.
pub fn get_slru_page_at_lsn(
pub async fn get_slru_page_at_lsn(
&self,
kind: SlruKind,
segno: u32,
blknum: BlockNumber,
lsn: Lsn,
) -> PageReconstructResult<Bytes> {
) -> Result<Bytes, PageReconstructError> {
let key = slru_block_to_key(kind, segno, blknum);
self.get(key, lsn)
self.get(key, lsn).await
}
/// Get size of an SLRU segment
pub fn get_slru_segment_size(
pub async fn get_slru_segment_size(
&self,
kind: SlruKind,
segno: u32,
lsn: Lsn,
) -> PageReconstructResult<BlockNumber> {
) -> Result<BlockNumber, PageReconstructError> {
let key = slru_segment_size_to_key(kind, segno);
let mut buf = try_no_ondemand_download!(self.get(key, lsn));
PageReconstructResult::Success(buf.get_u32_le())
let mut buf = self.get(key, lsn).await?;
Ok(buf.get_u32_le())
}
/// Get size of an SLRU segment
pub fn get_slru_segment_exists(
pub async fn get_slru_segment_exists(
&self,
kind: SlruKind,
segno: u32,
lsn: Lsn,
) -> PageReconstructResult<bool> {
) -> Result<bool, PageReconstructError> {
// fetch directory listing
let key = slru_dir_to_key(kind);
let buf = try_no_ondemand_download!(self.get(key, lsn));
let buf = self.get(key, lsn).await?;
match SlruSegmentDirectory::des(&buf).context("deserialization failure") {
Ok(dir) => {
let exists = dir.segments.get(&segno).is_some();
PageReconstructResult::Success(exists)
Ok(exists)
}
Err(e) => PageReconstructResult::from(e),
Err(e) => Err(PageReconstructError::from(e)),
}
}
@@ -283,10 +288,10 @@ impl Timeline {
/// so it's not well defined which LSN you get if there were multiple commits
/// "in flight" at that point in time.
///
pub fn find_lsn_for_timestamp(
pub async fn find_lsn_for_timestamp(
&self,
search_timestamp: TimestampTz,
) -> PageReconstructResult<LsnForTimestamp> {
) -> Result<LsnForTimestamp, PageReconstructError> {
let gc_cutoff_lsn_guard = self.get_latest_gc_cutoff_lsn();
let min_lsn = *gc_cutoff_lsn_guard;
let max_lsn = self.get_last_record_lsn();
@@ -302,12 +307,14 @@ impl Timeline {
// cannot overflow, high and low are both smaller than u64::MAX / 2
let mid = (high + low) / 2;
let cmp = try_no_ondemand_download!(self.is_latest_commit_timestamp_ge_than(
search_timestamp,
Lsn(mid * 8),
&mut found_smaller,
&mut found_larger,
));
let cmp = self
.is_latest_commit_timestamp_ge_than(
search_timestamp,
Lsn(mid * 8),
&mut found_smaller,
&mut found_larger,
)
.await?;
if cmp {
high = mid;
@@ -319,15 +326,15 @@ impl Timeline {
(false, false) => {
// This can happen if no commit records have been processed yet, e.g.
// just after importing a cluster.
PageReconstructResult::Success(LsnForTimestamp::NoData(max_lsn))
Ok(LsnForTimestamp::NoData(max_lsn))
}
(true, false) => {
// Didn't find any commit timestamps larger than the request
PageReconstructResult::Success(LsnForTimestamp::Future(max_lsn))
Ok(LsnForTimestamp::Future(max_lsn))
}
(false, true) => {
// Didn't find any commit timestamps smaller than the request
PageReconstructResult::Success(LsnForTimestamp::Past(max_lsn))
Ok(LsnForTimestamp::Past(max_lsn))
}
(true, true) => {
// low is the LSN of the first commit record *after* the search_timestamp,
@@ -337,7 +344,7 @@ impl Timeline {
// Otherwise, if you restore to the returned LSN, the database will
// include physical changes from later commits that will be marked
// as aborted, and will need to be vacuumed away.
PageReconstructResult::Success(LsnForTimestamp::Present(Lsn((low - 1) * 8)))
Ok(LsnForTimestamp::Present(Lsn((low - 1) * 8)))
}
}
}
@@ -349,26 +356,21 @@ impl Timeline {
/// Additionally, sets 'found_smaller'/'found_Larger, if encounters any commits
/// with a smaller/larger timestamp.
///
pub fn is_latest_commit_timestamp_ge_than(
pub async fn is_latest_commit_timestamp_ge_than(
&self,
search_timestamp: TimestampTz,
probe_lsn: Lsn,
found_smaller: &mut bool,
found_larger: &mut bool,
) -> PageReconstructResult<bool> {
for segno in try_no_ondemand_download!(self.list_slru_segments(SlruKind::Clog, probe_lsn)) {
let nblocks = try_no_ondemand_download!(self.get_slru_segment_size(
SlruKind::Clog,
segno,
probe_lsn
));
) -> Result<bool, PageReconstructError> {
for segno in self.list_slru_segments(SlruKind::Clog, probe_lsn).await? {
let nblocks = self
.get_slru_segment_size(SlruKind::Clog, segno, probe_lsn)
.await?;
for blknum in (0..nblocks).rev() {
let clog_page = try_no_ondemand_download!(self.get_slru_page_at_lsn(
SlruKind::Clog,
segno,
blknum,
probe_lsn
));
let clog_page = self
.get_slru_page_at_lsn(SlruKind::Clog, segno, blknum, probe_lsn)
.await?;
if clog_page.len() == BLCKSZ as usize + 8 {
let mut timestamp_bytes = [0u8; 8];
@@ -377,76 +379,85 @@ impl Timeline {
if timestamp >= search_timestamp {
*found_larger = true;
return PageReconstructResult::Success(true);
return Ok(true);
} else {
*found_smaller = true;
}
}
}
}
PageReconstructResult::Success(false)
Ok(false)
}
/// Get a list of SLRU segments
pub fn list_slru_segments(
pub async fn list_slru_segments(
&self,
kind: SlruKind,
lsn: Lsn,
) -> PageReconstructResult<HashSet<u32>> {
) -> Result<HashSet<u32>, PageReconstructError> {
// fetch directory entry
let key = slru_dir_to_key(kind);
let buf = try_no_ondemand_download!(self.get(key, lsn));
let buf = self.get(key, lsn).await?;
match SlruSegmentDirectory::des(&buf).context("deserialization failure") {
Ok(dir) => PageReconstructResult::Success(dir.segments),
Err(e) => PageReconstructResult::from(e),
Ok(dir) => Ok(dir.segments),
Err(e) => Err(PageReconstructError::from(e)),
}
}
pub fn get_relmap_file(
pub async fn get_relmap_file(
&self,
spcnode: Oid,
dbnode: Oid,
lsn: Lsn,
) -> PageReconstructResult<Bytes> {
) -> Result<Bytes, PageReconstructError> {
let key = relmap_file_key(spcnode, dbnode);
let buf = try_no_ondemand_download!(self.get(key, lsn));
PageReconstructResult::Success(buf)
self.get(key, lsn).await
}
pub fn list_dbdirs(&self, lsn: Lsn) -> PageReconstructResult<HashMap<(Oid, Oid), bool>> {
pub async fn list_dbdirs(
&self,
lsn: Lsn,
) -> Result<HashMap<(Oid, Oid), bool>, PageReconstructError> {
// fetch directory entry
let buf = try_no_ondemand_download!(self.get(DBDIR_KEY, lsn));
let buf = self.get(DBDIR_KEY, lsn).await?;
match DbDirectory::des(&buf).context("deserialization failure") {
Ok(dir) => PageReconstructResult::Success(dir.dbdirs),
Err(e) => PageReconstructResult::from(e),
Ok(dir) => Ok(dir.dbdirs),
Err(e) => Err(PageReconstructError::from(e)),
}
}
pub fn get_twophase_file(&self, xid: TransactionId, lsn: Lsn) -> PageReconstructResult<Bytes> {
pub async fn get_twophase_file(
&self,
xid: TransactionId,
lsn: Lsn,
) -> Result<Bytes, PageReconstructError> {
let key = twophase_file_key(xid);
let buf = try_no_ondemand_download!(self.get(key, lsn));
PageReconstructResult::Success(buf)
let buf = self.get(key, lsn).await?;
Ok(buf)
}
pub fn list_twophase_files(&self, lsn: Lsn) -> PageReconstructResult<HashSet<TransactionId>> {
pub async fn list_twophase_files(
&self,
lsn: Lsn,
) -> Result<HashSet<TransactionId>, PageReconstructError> {
// fetch directory entry
let buf = try_no_ondemand_download!(self.get(TWOPHASEDIR_KEY, lsn));
let buf = self.get(TWOPHASEDIR_KEY, lsn).await?;
match TwoPhaseDirectory::des(&buf).context("deserialization failure") {
Ok(dir) => PageReconstructResult::Success(dir.xids),
Err(e) => PageReconstructResult::from(e),
Ok(dir) => Ok(dir.xids),
Err(e) => Err(PageReconstructError::from(e)),
}
}
pub fn get_control_file(&self, lsn: Lsn) -> PageReconstructResult<Bytes> {
self.get(CONTROLFILE_KEY, lsn)
pub async fn get_control_file(&self, lsn: Lsn) -> Result<Bytes, PageReconstructError> {
self.get(CONTROLFILE_KEY, lsn).await
}
pub fn get_checkpoint(&self, lsn: Lsn) -> PageReconstructResult<Bytes> {
self.get(CHECKPOINT_KEY, lsn)
pub async fn get_checkpoint(&self, lsn: Lsn) -> Result<Bytes, PageReconstructError> {
self.get(CHECKPOINT_KEY, lsn).await
}
/// Does the same as get_current_logical_size but counted on demand.
@@ -460,20 +471,24 @@ impl Timeline {
cancel: CancellationToken,
) -> Result<u64, CalculateLogicalSizeError> {
// Fetch list of database dirs and iterate them
let buf = self.get_download(DBDIR_KEY, lsn).await?;
let buf = self.get(DBDIR_KEY, lsn).await.context("read dbdir")?;
let dbdir = DbDirectory::des(&buf).context("deserialize db directory")?;
let mut total_size: u64 = 0;
for (spcnode, dbnode) in dbdir.dbdirs.keys() {
for rel in
crate::tenant::with_ondemand_download(|| self.list_rels(*spcnode, *dbnode, lsn))
.await?
for rel in self
.list_rels(*spcnode, *dbnode, lsn)
.await
.context("list rels")?
{
if cancel.is_cancelled() {
return Err(CalculateLogicalSizeError::Cancelled);
}
let relsize_key = rel_size_to_key(rel);
let mut buf = self.get_download(relsize_key, lsn).await?;
let mut buf = self
.get(relsize_key, lsn)
.await
.context("read relation size of {rel:?}")?;
let relsize = buf.get_u32_le();
total_size += relsize as u64;
@@ -494,7 +509,7 @@ impl Timeline {
result.add_key(DBDIR_KEY);
// Fetch list of database dirs and iterate them
let buf = self.get_download(DBDIR_KEY, lsn).await?;
let buf = self.get(DBDIR_KEY, lsn).await?;
let dbdir = DbDirectory::des(&buf).context("deserialization failure")?;
let mut dbs: Vec<(Oid, Oid)> = dbdir.dbdirs.keys().cloned().collect();
@@ -503,15 +518,15 @@ impl Timeline {
result.add_key(relmap_file_key(spcnode, dbnode));
result.add_key(rel_dir_to_key(spcnode, dbnode));
let mut rels: Vec<RelTag> =
with_ondemand_download(|| self.list_rels(spcnode, dbnode, lsn))
.await?
.into_iter()
.collect();
let mut rels: Vec<RelTag> = self
.list_rels(spcnode, dbnode, lsn)
.await?
.into_iter()
.collect();
rels.sort_unstable();
for rel in rels {
let relsize_key = rel_size_to_key(rel);
let mut buf = self.get_download(relsize_key, lsn).await?;
let mut buf = self.get(relsize_key, lsn).await?;
let relsize = buf.get_u32_le();
result.add_range(rel_block_to_key(rel, 0)..rel_block_to_key(rel, relsize));
@@ -527,13 +542,13 @@ impl Timeline {
] {
let slrudir_key = slru_dir_to_key(kind);
result.add_key(slrudir_key);
let buf = self.get_download(slrudir_key, lsn).await?;
let buf = self.get(slrudir_key, lsn).await?;
let dir = SlruSegmentDirectory::des(&buf).context("deserialization failure")?;
let mut segments: Vec<u32> = dir.segments.iter().cloned().collect();
segments.sort_unstable();
for segno in segments {
let segsize_key = slru_segment_size_to_key(kind, segno);
let mut buf = self.get_download(segsize_key, lsn).await?;
let mut buf = self.get(segsize_key, lsn).await?;
let segsize = buf.get_u32_le();
result.add_range(
@@ -545,7 +560,7 @@ impl Timeline {
// Then pg_twophase
result.add_key(TWOPHASEDIR_KEY);
let buf = self.get_download(TWOPHASEDIR_KEY, lsn).await?;
let buf = self.get(TWOPHASEDIR_KEY, lsn).await?;
let twophase_dir = TwoPhaseDirectory::des(&buf).context("deserialization failure")?;
let mut xids: Vec<TransactionId> = twophase_dir.xids.iter().cloned().collect();
xids.sort_unstable();
@@ -703,9 +718,14 @@ impl<'a> DatadirModification<'a> {
}
/// Store a relmapper file (pg_filenode.map) in the repository
pub fn put_relmap_file(&mut self, spcnode: Oid, dbnode: Oid, img: Bytes) -> anyhow::Result<()> {
pub async fn put_relmap_file(
&mut self,
spcnode: Oid,
dbnode: Oid,
img: Bytes,
) -> anyhow::Result<()> {
// Add it to the directory (if it doesn't exist already)
let buf = self.get(DBDIR_KEY).no_ondemand_download()?;
let buf = self.get(DBDIR_KEY).await?;
let mut dbdir = DbDirectory::des(&buf)?;
let r = dbdir.dbdirs.insert((spcnode, dbnode), true);
@@ -731,9 +751,13 @@ impl<'a> DatadirModification<'a> {
Ok(())
}
pub fn put_twophase_file(&mut self, xid: TransactionId, img: Bytes) -> anyhow::Result<()> {
pub async fn put_twophase_file(
&mut self,
xid: TransactionId,
img: Bytes,
) -> anyhow::Result<()> {
// Add it to the directory entry
let buf = self.get(TWOPHASEDIR_KEY).no_ondemand_download()?;
let buf = self.get(TWOPHASEDIR_KEY).await?;
let mut dir = TwoPhaseDirectory::des(&buf)?;
if !dir.xids.insert(xid) {
anyhow::bail!("twophase file for xid {} already exists", xid);
@@ -757,16 +781,16 @@ impl<'a> DatadirModification<'a> {
Ok(())
}
pub fn drop_dbdir(&mut self, spcnode: Oid, dbnode: Oid) -> anyhow::Result<()> {
pub async fn drop_dbdir(&mut self, spcnode: Oid, dbnode: Oid) -> anyhow::Result<()> {
let req_lsn = self.tline.get_last_record_lsn();
let total_blocks = self
.tline
.get_db_size(spcnode, dbnode, req_lsn, true)
.no_ondemand_download()?;
.await?;
// Remove entry from dbdir
let buf = self.get(DBDIR_KEY).no_ondemand_download()?;
let buf = self.get(DBDIR_KEY).await?;
let mut dir = DbDirectory::des(&buf)?;
if dir.dbdirs.remove(&(spcnode, dbnode)).is_some() {
let buf = DbDirectory::ser(&dir)?;
@@ -789,11 +813,15 @@ impl<'a> DatadirModification<'a> {
/// Create a relation fork.
///
/// 'nblocks' is the initial size.
pub fn put_rel_creation(&mut self, rel: RelTag, nblocks: BlockNumber) -> anyhow::Result<()> {
pub async fn put_rel_creation(
&mut self,
rel: RelTag,
nblocks: BlockNumber,
) -> anyhow::Result<()> {
anyhow::ensure!(rel.relnode != 0, "invalid relnode");
// It's possible that this is the first rel for this db in this
// tablespace. Create the reldir entry for it if so.
let mut dbdir = DbDirectory::des(&self.get(DBDIR_KEY).no_ondemand_download()?)?;
let mut dbdir = DbDirectory::des(&self.get(DBDIR_KEY).await?)?;
let rel_dir_key = rel_dir_to_key(rel.spcnode, rel.dbnode);
let mut rel_dir = if dbdir.dbdirs.get(&(rel.spcnode, rel.dbnode)).is_none() {
// Didn't exist. Update dbdir
@@ -805,7 +833,7 @@ impl<'a> DatadirModification<'a> {
RelDirectory::default()
} else {
// reldir already exists, fetch it
RelDirectory::des(&self.get(rel_dir_key).no_ondemand_download()?)?
RelDirectory::des(&self.get(rel_dir_key).await?)?
};
// Add the new relation to the rel directory entry, and write it back
@@ -833,17 +861,17 @@ impl<'a> DatadirModification<'a> {
}
/// Truncate relation
pub fn put_rel_truncation(&mut self, rel: RelTag, nblocks: BlockNumber) -> anyhow::Result<()> {
pub async fn put_rel_truncation(
&mut self,
rel: RelTag,
nblocks: BlockNumber,
) -> anyhow::Result<()> {
anyhow::ensure!(rel.relnode != 0, "invalid relnode");
let last_lsn = self.tline.get_last_record_lsn();
if self
.tline
.get_rel_exists(rel, last_lsn, true)
.no_ondemand_download()?
{
if self.tline.get_rel_exists(rel, last_lsn, true).await? {
let size_key = rel_size_to_key(rel);
// Fetch the old size first
let old_size = self.get(size_key).no_ondemand_download()?.get_u32_le();
let old_size = self.get(size_key).await?.get_u32_le();
// Update the entry with the new size.
let buf = nblocks.to_le_bytes();
@@ -863,12 +891,16 @@ impl<'a> DatadirModification<'a> {
/// Extend relation
/// If new size is smaller, do nothing.
pub fn put_rel_extend(&mut self, rel: RelTag, nblocks: BlockNumber) -> anyhow::Result<()> {
pub async fn put_rel_extend(
&mut self,
rel: RelTag,
nblocks: BlockNumber,
) -> anyhow::Result<()> {
anyhow::ensure!(rel.relnode != 0, "invalid relnode");
// Put size
let size_key = rel_size_to_key(rel);
let old_size = self.get(size_key).no_ondemand_download()?.get_u32_le();
let old_size = self.get(size_key).await?.get_u32_le();
// only extend relation here. never decrease the size
if nblocks > old_size {
@@ -884,12 +916,12 @@ impl<'a> DatadirModification<'a> {
}
/// Drop a relation.
pub fn put_rel_drop(&mut self, rel: RelTag) -> anyhow::Result<()> {
pub async fn put_rel_drop(&mut self, rel: RelTag) -> anyhow::Result<()> {
anyhow::ensure!(rel.relnode != 0, "invalid relnode");
// Remove it from the directory entry
let dir_key = rel_dir_to_key(rel.spcnode, rel.dbnode);
let buf = self.get(dir_key).no_ondemand_download()?;
let buf = self.get(dir_key).await?;
let mut dir = RelDirectory::des(&buf)?;
if dir.rels.remove(&(rel.relnode, rel.forknum)) {
@@ -900,7 +932,7 @@ impl<'a> DatadirModification<'a> {
// update logical size
let size_key = rel_size_to_key(rel);
let old_size = self.get(size_key).no_ondemand_download()?.get_u32_le();
let old_size = self.get(size_key).await?.get_u32_le();
self.pending_nblocks -= old_size as i64;
// Remove enty from relation size cache
@@ -912,7 +944,7 @@ impl<'a> DatadirModification<'a> {
Ok(())
}
pub fn put_slru_segment_creation(
pub async fn put_slru_segment_creation(
&mut self,
kind: SlruKind,
segno: u32,
@@ -920,7 +952,7 @@ impl<'a> DatadirModification<'a> {
) -> anyhow::Result<()> {
// Add it to the directory entry
let dir_key = slru_dir_to_key(kind);
let buf = self.get(dir_key).no_ondemand_download()?;
let buf = self.get(dir_key).await?;
let mut dir = SlruSegmentDirectory::des(&buf)?;
if !dir.segments.insert(segno) {
@@ -956,10 +988,10 @@ impl<'a> DatadirModification<'a> {
}
/// This method is used for marking truncated SLRU files
pub fn drop_slru_segment(&mut self, kind: SlruKind, segno: u32) -> anyhow::Result<()> {
pub async fn drop_slru_segment(&mut self, kind: SlruKind, segno: u32) -> anyhow::Result<()> {
// Remove it from the directory entry
let dir_key = slru_dir_to_key(kind);
let buf = self.get(dir_key).no_ondemand_download()?;
let buf = self.get(dir_key).await?;
let mut dir = SlruSegmentDirectory::des(&buf)?;
if !dir.segments.remove(&segno) {
@@ -983,9 +1015,9 @@ impl<'a> DatadirModification<'a> {
}
/// This method is used for marking truncated SLRU files
pub fn drop_twophase_file(&mut self, xid: TransactionId) -> anyhow::Result<()> {
pub async fn drop_twophase_file(&mut self, xid: TransactionId) -> anyhow::Result<()> {
// Remove it from the directory entry
let buf = self.get(TWOPHASEDIR_KEY).no_ondemand_download()?;
let buf = self.get(TWOPHASEDIR_KEY).await?;
let mut dir = TwoPhaseDirectory::des(&buf)?;
if !dir.xids.remove(&xid) {
@@ -1079,7 +1111,7 @@ impl<'a> DatadirModification<'a> {
// Internal helper functions to batch the modifications
fn get(&self, key: Key) -> PageReconstructResult<Bytes> {
async fn get(&self, key: Key) -> Result<Bytes, PageReconstructError> {
// Have we already updated the same key? Read the pending updated
// version in that case.
//
@@ -1087,18 +1119,20 @@ impl<'a> DatadirModification<'a> {
// value that has been removed, deletion only avoids leaking storage.
if let Some(value) = self.pending_updates.get(&key) {
if let Value::Image(img) = value {
PageReconstructResult::Success(img.clone())
Ok(img.clone())
} else {
// Currently, we never need to read back a WAL record that we
// inserted in the same "transaction". All the metadata updates
// work directly with Images, and we never need to read actual
// data pages. We could handle this if we had to, by calling
// the walredo manager, but let's keep it simple for now.
PageReconstructResult::from(anyhow::anyhow!("unexpected pending WAL record"))
Err(PageReconstructError::from(anyhow::anyhow!(
"unexpected pending WAL record"
)))
}
} else {
let lsn = Lsn::max(self.tline.get_last_record_lsn(), self.lsn);
self.tline.get(key, lsn)
self.tline.get(key, lsn).await
}
}

View File

@@ -220,6 +220,8 @@ pub enum TaskKind {
// task that drives downloading layers
DownloadAllRemoteLayers,
// Task that calculates synthetis size for all active tenants
CalculateSyntheticSize,
}
#[derive(Default)]

View File

@@ -13,11 +13,13 @@
use anyhow::{bail, Context};
use bytes::Bytes;
use futures::FutureExt;
use futures::Stream;
use pageserver_api::models::TimelineState;
use remote_storage::DownloadError;
use remote_storage::GenericRemoteStorage;
use tokio::sync::watch;
use tokio::task::JoinSet;
use tracing::*;
use utils::crashsafe::path_with_suffix_extension;
@@ -36,6 +38,8 @@ use std::path::Path;
use std::path::PathBuf;
use std::process::Command;
use std::process::Stdio;
use std::sync::atomic::AtomicU64;
use std::sync::atomic::Ordering;
use std::sync::Arc;
use std::sync::MutexGuard;
use std::sync::{Mutex, RwLock};
@@ -90,7 +94,7 @@ mod timeline;
pub mod size;
pub use timeline::{with_ondemand_download, PageReconstructError, PageReconstructResult, Timeline};
pub use timeline::{PageReconstructError, Timeline};
// re-export this function so that page_cache.rs can use it.
pub use crate::tenant::ephemeral_file::writeback as writeback_ephemeral_file;
@@ -137,6 +141,7 @@ pub struct Tenant {
/// Cached logical sizes updated updated on each [`Tenant::gather_size_inputs`].
cached_logical_sizes: tokio::sync::Mutex<HashMap<(TimelineId, Lsn), u64>>,
cached_synthetic_tenant_size: Arc<AtomicU64>,
}
/// A timeline with some of its files on disk, being initialized.
@@ -436,8 +441,16 @@ struct RemoteStartupData {
impl Tenant {
/// Yet another helper for timeline initialization.
/// Contains common part for `load_local_timeline` and `load_remote_timeline`
async fn setup_timeline(
/// Contains the common part of `load_local_timeline` and `load_remote_timeline`.
///
/// - Initializes the Timeline struct and inserts it into the tenant's hash map
/// - Scans the local timeline directory for layer files and builds the layer map
/// - Downloads remote index file and adds remote files to the layer map
/// - Schedules remote upload tasks for any files that are present locally but missing from remote storage.
///
/// If the operation fails, the timeline is left in the tenant's hash map in Broken state. On success,
/// it is marked as Active.
async fn timeline_init_and_sync(
&self,
timeline_id: TimelineId,
remote_client: Option<RemoteTimelineClient>,
@@ -480,10 +493,7 @@ impl Tenant {
// But we shouldnt start walreceiver before we have all the data locally, because working walreceiver
// will ingest data which may require looking at the layers which are not yet available locally
match timeline.initialize_with_lock(&mut timelines_accessor, true, false) {
Ok(initialized_timeline) => {
timelines_accessor.insert(timeline_id, initialized_timeline.clone());
Ok(initialized_timeline)
}
Ok(new_timeline) => new_timeline,
Err(e) => {
error!("Failed to initialize timeline {tenant_id}/{timeline_id}: {e:?}");
// FIXME using None is a hack, it wont hurt, just ugly.
@@ -499,16 +509,14 @@ impl Tenant {
None,
)
.with_context(|| {
format!(
"Failed to crate broken timeline data for {tenant_id}/{timeline_id}"
)
format!("creating broken timeline data for {tenant_id}/{timeline_id}")
})?;
broken_timeline.set_state(TimelineState::Broken);
timelines_accessor.insert(timeline_id, broken_timeline);
Err(e)
return Err(e);
}
}
}?;
};
if self.remote_storage.is_some() {
// Reconcile local state with remote storage, downloading anything that's
@@ -594,7 +602,7 @@ impl Tenant {
match tenant_clone.attach().await {
Ok(_) => {}
Err(e) => {
tenant_clone.set_broken();
tenant_clone.set_broken(&e.to_string());
error!("error attaching tenant: {:?}", e);
}
}
@@ -610,7 +618,7 @@ impl Tenant {
#[instrument(skip(self), fields(tenant_id=%self.tenant_id))]
async fn attach(self: &Arc<Tenant>) -> anyhow::Result<()> {
// Create directory with marker file to indicate attaching state.
// The load_local_tenants() function in tenant_mgr relies on the marker file
// The load_local_tenants() function in tenant::mgr relies on the marker file
// to determine whether a tenant has finished attaching.
let tenant_dir = self.conf.tenant_path(&self.tenant_id);
let marker_file = self.conf.tenant_attaching_mark_file_path(&self.tenant_id);
@@ -639,26 +647,62 @@ impl Tenant {
.as_ref()
.ok_or_else(|| anyhow::anyhow!("cannot attach without remote storage"))?;
let remote_timelines = remote_timeline_client::list_remote_timelines(
let remote_timeline_ids = remote_timeline_client::list_remote_timelines(
remote_storage,
self.conf,
self.tenant_id,
)
.await?;
info!("found {} timelines", remote_timelines.len());
info!("found {} timelines", remote_timeline_ids.len());
let mut timeline_ancestors: HashMap<TimelineId, TimelineMetadata> = HashMap::new();
let mut index_parts: HashMap<TimelineId, IndexPart> = HashMap::new();
for (timeline_id, index_part) in remote_timelines {
let remote_metadata = index_part.parse_metadata().with_context(|| {
format!(
"Failed to parse metadata file from remote storage for tenant {} timeline {}",
self.tenant_id, timeline_id
)
})?;
// Download & parse index parts
let mut part_downloads = JoinSet::new();
for timeline_id in remote_timeline_ids {
let client = RemoteTimelineClient::new(
remote_storage.clone(),
self.conf,
self.tenant_id,
timeline_id,
);
part_downloads.spawn(
async move {
debug!("starting index part download");
let index_part = client
.download_index_file()
.await
.context("download index file")?;
let remote_metadata = index_part.parse_metadata().context("parse metadata")?;
debug!("finished index part download");
Result::<_, anyhow::Error>::Ok((
timeline_id,
client,
index_part,
remote_metadata,
))
}
.map(move |res| {
res.with_context(|| format!("download index part for timeline {timeline_id}"))
})
.instrument(info_span!("download_index_part", timeline=%timeline_id)),
);
}
// Wait for all the download tasks to complete & collect results.
let mut remote_clients = HashMap::new();
let mut index_parts = HashMap::new();
let mut timeline_ancestors = HashMap::new();
while let Some(result) = part_downloads.join_next().await {
// NB: we already added timeline_id as context to the error
let result: Result<_, anyhow::Error> = result.context("joinset task join")?;
let (timeline_id, client, index_part, remote_metadata) = result?;
debug!("successfully downloaded index part for timeline {timeline_id}");
timeline_ancestors.insert(timeline_id, remote_metadata);
index_parts.insert(timeline_id, index_part);
remote_clients.insert(timeline_id, client);
}
// For every timeline, download the metadata file, scan the local directory,
@@ -671,7 +715,7 @@ impl Tenant {
timeline_id,
index_parts.remove(&timeline_id).unwrap(),
remote_metadata,
remote_storage.clone(),
remote_clients.remove(&timeline_id).unwrap(),
)
.await
.with_context(|| {
@@ -714,22 +758,19 @@ impl Tenant {
Ok(size)
}
#[instrument(skip(self, index_part, remote_metadata, remote_storage), fields(timeline_id=%timeline_id))]
#[instrument(skip_all, fields(timeline_id=%timeline_id))]
async fn load_remote_timeline(
&self,
timeline_id: TimelineId,
index_part: IndexPart,
remote_metadata: TimelineMetadata,
remote_storage: GenericRemoteStorage,
remote_client: RemoteTimelineClient,
) -> anyhow::Result<()> {
info!("downloading index file for timeline {}", timeline_id);
tokio::fs::create_dir_all(self.conf.timeline_path(&timeline_id, &self.tenant_id))
.await
.context("Failed to create new timeline directory")?;
let remote_client =
RemoteTimelineClient::new(remote_storage, self.conf, self.tenant_id, timeline_id)?;
let ancestor = if let Some(ancestor_id) = remote_metadata.ancestor_timeline() {
let timelines = self.timelines.lock().unwrap();
Some(Arc::clone(timelines.get(&ancestor_id).ok_or_else(
@@ -748,7 +789,7 @@ impl Tenant {
// cannot be older than the local one
let local_metadata = None;
self.setup_timeline(
self.timeline_init_and_sync(
timeline_id,
Some(remote_client),
Some(RemoteStartupData {
@@ -825,7 +866,7 @@ impl Tenant {
match tenant_clone.load().await {
Ok(()) => {}
Err(err) => {
tenant_clone.set_broken();
tenant_clone.set_broken(&err.to_string());
error!("could not load tenant {tenant_id}: {err:?}");
}
}
@@ -986,18 +1027,14 @@ impl Tenant {
None
};
let remote_client = self
.remote_storage
.as_ref()
.map(|remote_storage| {
RemoteTimelineClient::new(
remote_storage.clone(),
self.conf,
self.tenant_id,
timeline_id,
)
})
.transpose()?;
let remote_client = self.remote_storage.as_ref().map(|remote_storage| {
RemoteTimelineClient::new(
remote_storage.clone(),
self.conf,
self.tenant_id,
timeline_id,
)
});
let remote_startup_data = match &remote_client {
Some(remote_client) => match remote_client.download_index_file().await {
@@ -1017,7 +1054,7 @@ impl Tenant {
None => None,
};
self.setup_timeline(
self.timeline_init_and_sync(
timeline_id,
remote_client,
remote_startup_data,
@@ -1465,7 +1502,7 @@ impl Tenant {
});
}
pub fn set_broken(&self) {
pub fn set_broken(&self, reason: &str) {
self.state.send_modify(|current_state| {
match *current_state {
TenantState::Active => {
@@ -1474,18 +1511,22 @@ impl Tenant {
// activated should never be marked as broken. We cope with it the best
// we can, but it shouldn't happen.
*current_state = TenantState::Broken;
warn!("Changing Active tenant to Broken state");
warn!("Changing Active tenant to Broken state, reason: {}", reason);
}
TenantState::Broken => {
// This shouldn't happen either
warn!("Tenant is already broken");
warn!("Tenant is already in Broken state");
}
TenantState::Stopping => {
// This shouldn't happen either
*current_state = TenantState::Broken;
warn!("Marking Stopping tenant as Broken");
warn!(
"Marking Stopping tenant as Broken state, reason: {}",
reason
);
}
TenantState::Loading | TenantState::Attaching => {
info!("Setting tenant as Broken state, reason: {}", reason);
*current_state = TenantState::Broken;
}
}
@@ -1687,6 +1728,7 @@ impl Tenant {
remote_storage,
state,
cached_logical_sizes: tokio::sync::Mutex::new(HashMap::new()),
cached_synthetic_tenant_size: Arc::new(AtomicU64::new(0)),
}
}
@@ -1839,7 +1881,12 @@ impl Tenant {
utils::failpoint_sleep_millis_async!("gc_iteration_internal_after_getting_gc_timelines");
info!("starting on {} timelines", gc_timelines.len());
// If there is nothing to GC, we don't want any messages in the INFO log.
if !gc_timelines.is_empty() {
info!("{} timelines need GC", gc_timelines.len());
} else {
debug!("{} timelines need GC", gc_timelines.len());
}
// Perform GC for each timeline.
//
@@ -2191,7 +2238,7 @@ impl Tenant {
self.conf,
tenant_id,
new_timeline_id,
)?;
);
remote_client.init_upload_queue_for_empty_remote(&new_metadata)?;
Some(remote_client)
} else {
@@ -2319,6 +2366,24 @@ impl Tenant {
size::gather_inputs(self, logical_sizes_at_once, &mut shared_cache).await
}
/// Calculate synthetic tenant size
/// This is periodically called by background worker.
/// result is cached in tenant struct
#[instrument(skip_all, fields(tenant_id=%self.tenant_id))]
pub async fn calculate_synthetic_size(&self) -> anyhow::Result<u64> {
let inputs = self.gather_size_inputs().await?;
let size = inputs.calculate()?;
self.cached_synthetic_tenant_size
.store(size, Ordering::Relaxed);
Ok(size)
}
pub fn get_cached_synthetic_size(&self) -> u64 {
self.cached_synthetic_tenant_size.load(Ordering::Relaxed)
}
}
fn remove_timeline_and_uninit_mark(timeline_dir: &Path, uninit_mark: &Path) -> anyhow::Result<()> {
@@ -2561,9 +2626,11 @@ where
#[cfg(test)]
pub mod harness {
use bytes::{Bytes, BytesMut};
use once_cell::sync::Lazy;
use std::sync::{Arc, RwLock, RwLockReadGuard, RwLockWriteGuard};
use once_cell::sync::OnceCell;
use std::sync::Arc;
use std::{fs, path::PathBuf};
use tempfile::TempDir;
use utils::logging;
use utils::lsn::Lsn;
use crate::{
@@ -2594,8 +2661,6 @@ pub mod harness {
buf.freeze()
}
static LOCK: Lazy<RwLock<()>> = Lazy::new(|| RwLock::new(()));
impl From<TenantConf> for TenantConfOpt {
fn from(tenant_conf: TenantConf) -> Self {
Self {
@@ -2616,36 +2681,31 @@ pub mod harness {
}
}
pub struct TenantHarness<'a> {
/// The harness saves some boilerplate and provides a way to create functional tenant
/// without running pageserver binary. It uses temporary directory to store data in it.
/// Tempdir gets removed on harness drop.
pub struct TenantHarness {
// keep the struct to not to remove tmp dir during the test
_temp_repo_dir: TempDir,
pub conf: &'static PageServerConf,
pub tenant_conf: TenantConf,
pub tenant_id: TenantId,
pub lock_guard: (
Option<RwLockReadGuard<'a, ()>>,
Option<RwLockWriteGuard<'a, ()>>,
),
}
impl<'a> TenantHarness<'a> {
pub fn create(test_name: &'static str) -> anyhow::Result<Self> {
Self::create_internal(test_name, false)
}
pub fn create_exclusive(test_name: &'static str) -> anyhow::Result<Self> {
Self::create_internal(test_name, true)
}
fn create_internal(test_name: &'static str, exclusive: bool) -> anyhow::Result<Self> {
let lock_guard = if exclusive {
(None, Some(LOCK.write().unwrap()))
} else {
(Some(LOCK.read().unwrap()), None)
};
static LOG_HANDLE: OnceCell<()> = OnceCell::new();
let repo_dir = PageServerConf::test_repo_dir(test_name);
let _ = fs::remove_dir_all(&repo_dir);
fs::create_dir_all(&repo_dir)?;
impl TenantHarness {
pub fn new() -> anyhow::Result<Self> {
LOG_HANDLE.get_or_init(|| {
logging::init(logging::LogFormat::Test).expect("Failed to init test logging")
});
let conf = PageServerConf::dummy_conf(repo_dir);
let temp_repo_dir = tempfile::tempdir()?;
// `TempDir` uses a randomly generated subdirectory of a system tmp dir,
// so far it's enough to take care of concurrently running tests.
let repo_dir = temp_repo_dir.path();
let conf = PageServerConf::dummy_conf(repo_dir.to_path_buf());
// Make a static copy of the config. This can never be free'd, but that's
// OK in a test.
let conf: &'static PageServerConf = Box::leak(Box::new(conf));
@@ -2663,10 +2723,10 @@ pub mod harness {
fs::create_dir_all(conf.timelines_path(&tenant_id))?;
Ok(Self {
_temp_repo_dir: temp_repo_dir,
conf,
tenant_conf,
tenant_id,
lock_guard,
})
}
@@ -2760,7 +2820,8 @@ mod tests {
#[tokio::test]
async fn test_basic() -> anyhow::Result<()> {
let tenant = TenantHarness::create("test_basic")?.load().await;
let harness = TenantHarness::new()?;
let tenant = harness.load().await;
let tline = tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?
.initialize()?;
@@ -2776,15 +2837,15 @@ mod tests {
drop(writer);
assert_eq!(
tline.get(*TEST_KEY, Lsn(0x10)).no_ondemand_download()?,
tline.get(*TEST_KEY, Lsn(0x10)).await?,
TEST_IMG("foo at 0x10")
);
assert_eq!(
tline.get(*TEST_KEY, Lsn(0x1f)).no_ondemand_download()?,
tline.get(*TEST_KEY, Lsn(0x1f)).await?,
TEST_IMG("foo at 0x10")
);
assert_eq!(
tline.get(*TEST_KEY, Lsn(0x20)).no_ondemand_download()?,
tline.get(*TEST_KEY, Lsn(0x20)).await?,
TEST_IMG("foo at 0x20")
);
@@ -2793,9 +2854,8 @@ mod tests {
#[tokio::test]
async fn no_duplicate_timelines() -> anyhow::Result<()> {
let tenant = TenantHarness::create("no_duplicate_timelines")?
.load()
.await;
let harness = TenantHarness::new()?;
let tenant = harness.load().await;
let _ = tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?
.initialize()?;
@@ -2826,7 +2886,8 @@ mod tests {
///
#[tokio::test]
async fn test_branch() -> anyhow::Result<()> {
let tenant = TenantHarness::create("test_branch")?.load().await;
let harness = TenantHarness::new()?;
let tenant = harness.load().await;
let tline = tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?
.initialize()?;
@@ -2863,15 +2924,15 @@ mod tests {
// Check page contents on both branches
assert_eq!(
from_utf8(&tline.get(TEST_KEY_A, Lsn(0x40)).no_ondemand_download()?)?,
from_utf8(&tline.get(TEST_KEY_A, Lsn(0x40)).await?)?,
"foo at 0x40"
);
assert_eq!(
from_utf8(&newtline.get(TEST_KEY_A, Lsn(0x40)).no_ondemand_download()?)?,
from_utf8(&newtline.get(TEST_KEY_A, Lsn(0x40)).await?)?,
"bar at 0x40"
);
assert_eq!(
from_utf8(&newtline.get(TEST_KEY_B, Lsn(0x40)).no_ondemand_download()?)?,
from_utf8(&newtline.get(TEST_KEY_B, Lsn(0x40)).await?)?,
"foobar at 0x20"
);
@@ -2923,10 +2984,8 @@ mod tests {
#[tokio::test]
async fn test_prohibit_branch_creation_on_garbage_collected_data() -> anyhow::Result<()> {
let tenant =
TenantHarness::create("test_prohibit_branch_creation_on_garbage_collected_data")?
.load()
.await;
let harness = TenantHarness::new()?;
let tenant = harness.load().await;
let tline = tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?
.initialize()?;
@@ -2961,9 +3020,8 @@ mod tests {
#[tokio::test]
async fn test_prohibit_branch_creation_on_pre_initdb_lsn() -> anyhow::Result<()> {
let tenant = TenantHarness::create("test_prohibit_branch_creation_on_pre_initdb_lsn")?
.load()
.await;
let harness = TenantHarness::new()?;
let tenant = harness.load().await;
tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0x50), DEFAULT_PG_VERSION)?
@@ -3012,9 +3070,8 @@ mod tests {
#[tokio::test]
async fn test_retain_data_in_parent_which_is_needed_for_child() -> anyhow::Result<()> {
let tenant = TenantHarness::create("test_retain_data_in_parent_which_is_needed_for_child")?
.load()
.await;
let harness = TenantHarness::new()?;
let tenant = harness.load().await;
let tline = tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?
.initialize()?;
@@ -3030,18 +3087,14 @@ mod tests {
tenant
.gc_iteration(Some(TIMELINE_ID), 0x10, Duration::ZERO)
.await?;
assert!(newtline
.get(*TEST_KEY, Lsn(0x25))
.no_ondemand_download()
.is_ok());
assert!(newtline.get(*TEST_KEY, Lsn(0x25)).await.is_ok());
Ok(())
}
#[tokio::test]
async fn test_parent_keeps_data_forever_after_branching() -> anyhow::Result<()> {
let tenant = TenantHarness::create("test_parent_keeps_data_forever_after_branching")?
.load()
.await;
let harness = TenantHarness::new()?;
let tenant = harness.load().await;
let tline = tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?
.initialize()?;
@@ -3063,7 +3116,7 @@ mod tests {
// Check that the data is still accessible on the branch.
assert_eq!(
newtline.get(*TEST_KEY, Lsn(0x50)).no_ondemand_download()?,
newtline.get(*TEST_KEY, Lsn(0x50)).await?,
TEST_IMG(&format!("foo at {}", Lsn(0x40)))
);
@@ -3072,8 +3125,7 @@ mod tests {
#[tokio::test]
async fn timeline_load() -> anyhow::Result<()> {
const TEST_NAME: &str = "timeline_load";
let harness = TenantHarness::create(TEST_NAME)?;
let harness = TenantHarness::new()?;
{
let tenant = harness.load().await;
let tline = tenant
@@ -3092,8 +3144,7 @@ mod tests {
#[tokio::test]
async fn timeline_load_with_ancestor() -> anyhow::Result<()> {
const TEST_NAME: &str = "timeline_load_with_ancestor";
let harness = TenantHarness::create(TEST_NAME)?;
let harness = TenantHarness::new()?;
// create two timelines
{
let tenant = harness.load().await;
@@ -3131,8 +3182,7 @@ mod tests {
#[tokio::test]
async fn corrupt_metadata() -> anyhow::Result<()> {
const TEST_NAME: &str = "corrupt_metadata";
let harness = TenantHarness::create(TEST_NAME)?;
let harness = TenantHarness::new()?;
let tenant = harness.load().await;
tenant
@@ -3173,7 +3223,8 @@ mod tests {
#[tokio::test]
async fn test_images() -> anyhow::Result<()> {
let tenant = TenantHarness::create("test_images")?.load().await;
let harness = TenantHarness::new()?;
let tenant = harness.load().await;
let tline = tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?
.initialize()?;
@@ -3211,23 +3262,23 @@ mod tests {
tline.compact().await?;
assert_eq!(
tline.get(*TEST_KEY, Lsn(0x10)).no_ondemand_download()?,
tline.get(*TEST_KEY, Lsn(0x10)).await?,
TEST_IMG("foo at 0x10")
);
assert_eq!(
tline.get(*TEST_KEY, Lsn(0x1f)).no_ondemand_download()?,
tline.get(*TEST_KEY, Lsn(0x1f)).await?,
TEST_IMG("foo at 0x10")
);
assert_eq!(
tline.get(*TEST_KEY, Lsn(0x20)).no_ondemand_download()?,
tline.get(*TEST_KEY, Lsn(0x20)).await?,
TEST_IMG("foo at 0x20")
);
assert_eq!(
tline.get(*TEST_KEY, Lsn(0x30)).no_ondemand_download()?,
tline.get(*TEST_KEY, Lsn(0x30)).await?,
TEST_IMG("foo at 0x30")
);
assert_eq!(
tline.get(*TEST_KEY, Lsn(0x40)).no_ondemand_download()?,
tline.get(*TEST_KEY, Lsn(0x40)).await?,
TEST_IMG("foo at 0x40")
);
@@ -3240,7 +3291,8 @@ mod tests {
//
#[tokio::test]
async fn test_bulk_insert() -> anyhow::Result<()> {
let tenant = TenantHarness::create("test_bulk_insert")?.load().await;
let harness = TenantHarness::new()?;
let tenant = harness.load().await;
let tline = tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?
.initialize()?;
@@ -3284,7 +3336,8 @@ mod tests {
#[tokio::test]
async fn test_random_updates() -> anyhow::Result<()> {
let tenant = TenantHarness::create("test_random_updates")?.load().await;
let harness = TenantHarness::new()?;
let tenant = harness.load().await;
let tline = tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?
.initialize()?;
@@ -3337,7 +3390,7 @@ mod tests {
for (blknum, last_lsn) in updated.iter().enumerate() {
test_key.field6 = blknum as u32;
assert_eq!(
tline.get(test_key, lsn).no_ondemand_download()?,
tline.get(test_key, lsn).await?,
TEST_IMG(&format!("{} at {}", blknum, last_lsn))
);
}
@@ -3357,9 +3410,8 @@ mod tests {
#[tokio::test]
async fn test_traverse_branches() -> anyhow::Result<()> {
let tenant = TenantHarness::create("test_traverse_branches")?
.load()
.await;
let harness = TenantHarness::new()?;
let tenant = harness.load().await;
let mut tline = tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?
.initialize()?;
@@ -3423,7 +3475,7 @@ mod tests {
for (blknum, last_lsn) in updated.iter().enumerate() {
test_key.field6 = blknum as u32;
assert_eq!(
tline.get(test_key, lsn).no_ondemand_download()?,
tline.get(test_key, lsn).await?,
TEST_IMG(&format!("{} at {}", blknum, last_lsn))
);
}
@@ -3443,9 +3495,8 @@ mod tests {
#[tokio::test]
async fn test_traverse_ancestors() -> anyhow::Result<()> {
let tenant = TenantHarness::create("test_traverse_ancestors")?
.load()
.await;
let harness = TenantHarness::new()?;
let tenant = harness.load().await;
let mut tline = tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION)?
.initialize()?;
@@ -3498,7 +3549,7 @@ mod tests {
println!("checking [{idx}][{blknum}] at {lsn}");
test_key.field6 = blknum as u32;
assert_eq!(
tline.get(test_key, *lsn).no_ondemand_download()?,
tline.get(test_key, *lsn).await?,
TEST_IMG(&format!("{idx} {blknum} at {lsn}"))
);
}

View File

@@ -76,7 +76,7 @@ impl EphemeralFile {
})
}
fn fill_buffer(&self, buf: &mut [u8], blkno: u32) -> Result<(), io::Error> {
fn fill_buffer(&self, buf: &mut [u8], blkno: u32) -> io::Result<()> {
let mut off = 0;
while off < PAGE_SZ {
let n = self
@@ -277,7 +277,7 @@ impl Drop for EphemeralFile {
}
}
pub fn writeback(file_id: u64, blkno: u32, buf: &[u8]) -> Result<(), io::Error> {
pub fn writeback(file_id: u64, blkno: u32, buf: &[u8]) -> io::Result<()> {
if let Some(file) = EPHEMERAL_FILES.read().unwrap().files.get(&file_id) {
match file.write_all_at(buf, blkno as u64 * PAGE_SZ as u64) {
Ok(_) => Ok(()),
@@ -332,25 +332,17 @@ mod tests {
use super::*;
use crate::tenant::blob_io::{BlobCursor, BlobWriter};
use crate::tenant::block_io::BlockCursor;
use crate::tenant::harness::TenantHarness;
use rand::{seq::SliceRandom, thread_rng, RngCore};
use std::fs;
use std::str::FromStr;
fn harness(
test_name: &str,
) -> Result<(&'static PageServerConf, TenantId, TimelineId), io::Error> {
let repo_dir = PageServerConf::test_repo_dir(test_name);
let _ = fs::remove_dir_all(&repo_dir);
let conf = PageServerConf::dummy_conf(repo_dir);
// Make a static copy of the config. This can never be free'd, but that's
// OK in a test.
let conf: &'static PageServerConf = Box::leak(Box::new(conf));
let tenant_id = TenantId::from_str("11000000000000000000000000000000").unwrap();
fn harness() -> Result<(TenantHarness, TimelineId), io::Error> {
let harness = TenantHarness::new().expect("Failed to create tenant harness");
let timeline_id = TimelineId::from_str("22000000000000000000000000000000").unwrap();
fs::create_dir_all(conf.timeline_path(&timeline_id, &tenant_id))?;
fs::create_dir_all(harness.timeline_path(&timeline_id))?;
Ok((conf, tenant_id, timeline_id))
Ok((harness, timeline_id))
}
// Helper function to slurp contents of a file, starting at the current position,
@@ -367,10 +359,10 @@ mod tests {
}
#[test]
fn test_ephemeral_files() -> Result<(), io::Error> {
let (conf, tenant_id, timeline_id) = harness("ephemeral_files")?;
fn test_ephemeral_files() -> io::Result<()> {
let (harness, timeline_id) = harness()?;
let file_a = EphemeralFile::create(conf, tenant_id, timeline_id)?;
let file_a = EphemeralFile::create(harness.conf, harness.tenant_id, timeline_id)?;
file_a.write_all_at(b"foo", 0)?;
assert_eq!("foo", read_string(&file_a, 0, 20)?);
@@ -381,7 +373,7 @@ mod tests {
// Open a lot of files, enough to cause some page evictions.
let mut efiles = Vec::new();
for fileno in 0..100 {
let efile = EphemeralFile::create(conf, tenant_id, timeline_id)?;
let efile = EphemeralFile::create(harness.conf, harness.tenant_id, timeline_id)?;
efile.write_all_at(format!("file {}", fileno).as_bytes(), 0)?;
assert_eq!(format!("file {}", fileno), read_string(&efile, 0, 10)?);
efiles.push((fileno, efile));
@@ -398,10 +390,10 @@ mod tests {
}
#[test]
fn test_ephemeral_blobs() -> Result<(), io::Error> {
let (conf, tenant_id, timeline_id) = harness("ephemeral_blobs")?;
fn test_ephemeral_blobs() -> io::Result<()> {
let (harness, timeline_id) = harness()?;
let mut file = EphemeralFile::create(conf, tenant_id, timeline_id)?;
let mut file = EphemeralFile::create(harness.conf, harness.tenant_id, timeline_id)?;
let pos_foo = file.write_blob(b"foo")?;
assert_eq!(b"foo", file.block_cursor().read_blob(pos_foo)?.as_slice());

View File

@@ -26,7 +26,7 @@ use std::sync::Arc;
use tracing::*;
use utils::lsn::Lsn;
use super::storage_layer::{InMemoryLayer, InMemoryOrHistoricLayer, Layer};
use super::storage_layer::{InMemoryLayer, Layer};
///
/// LayerMap tracks what layers exist on a timeline.
@@ -241,8 +241,7 @@ where
/// Return value of LayerMap::search
pub struct SearchResult<L: ?Sized> {
// FIXME: I wish this could be Arc<dyn Layer>. But I couldn't make that work.
pub layer: InMemoryOrHistoricLayer<L>,
pub layer: Arc<L>,
pub lsn_floor: Lsn,
}
@@ -261,32 +260,10 @@ where
/// contain the version, even if it's missing from the returned
/// layer.
///
/// NOTE: This only searches the 'historic' layers, *not* the
/// 'open' and 'frozen' layers!
///
pub fn search(&self, key: Key, end_lsn: Lsn) -> Option<SearchResult<L>> {
// First check if an open or frozen layer matches
if let Some(open_layer) = &self.open_layer {
let start_lsn = open_layer.get_lsn_range().start;
if end_lsn > start_lsn {
return Some(SearchResult {
layer: InMemoryOrHistoricLayer::InMemory(Arc::clone(open_layer)),
lsn_floor: start_lsn,
});
}
}
for frozen_layer in self.frozen_layers.iter().rev() {
let start_lsn = frozen_layer.get_lsn_range().start;
if end_lsn > start_lsn {
return Some(SearchResult {
layer: InMemoryOrHistoricLayer::InMemory(Arc::clone(frozen_layer)),
lsn_floor: start_lsn,
});
}
}
self.search_historic(key, end_lsn)
}
fn search_historic(&self, key: Key, end_lsn: Lsn) -> Option<SearchResult<L>> {
// linear search
// Find the latest image layer that covers the given key
let mut latest_img: Option<Arc<L>> = None;
let mut latest_img_lsn: Option<Lsn> = None;
@@ -311,7 +288,7 @@ where
if Lsn(img_lsn.0 + 1) == end_lsn {
// found exact match
return Some(SearchResult {
layer: InMemoryOrHistoricLayer::Historic(Arc::clone(l)),
layer: Arc::clone(l),
lsn_floor: img_lsn,
});
}
@@ -374,13 +351,13 @@ where
);
Some(SearchResult {
lsn_floor,
layer: InMemoryOrHistoricLayer::Historic(l),
layer: l,
})
} else if let Some(l) = latest_img {
trace!("found img layer and no deltas for request on {key} at {end_lsn}");
Some(SearchResult {
lsn_floor: latest_img_lsn.unwrap(),
layer: InMemoryOrHistoricLayer::Historic(l),
layer: l,
})
} else {
trace!("no layer found for request on {key} at {end_lsn}");
@@ -388,11 +365,27 @@ where
}
}
/// Whether a layer is L0.
///
/// A layer is L0 if it's sparse enough that compacting it
/// is always more efficient than reimaging over it.
fn is_l0(layer: &L) -> bool {
// 1 TB key range, very conservative
let max_l0_range: i128 = 1024 * 1024 * 1024 * 1024 / (postgres_ffi::BLCKSZ as i128);
let kr = layer.get_key_range();
// TODO Use `collect_keyspace` result to only count used space covered.
// Otherwise there's a chance that with the current `max_l0_range`
// all layers will be classified as l0.
// Divide by 10 to avoid overflow
let size = kr.end.to_i128() / 10 - kr.start.to_i128() / 10;
size > max_l0_range / 10
}
///
/// Insert an on-disk layer
///
pub fn insert_historic(&mut self, layer: Arc<L>) {
if layer.get_key_range() == (Key::MIN..Key::MAX) {
if Self::is_l0(&layer) {
self.l0_delta_layers.push(layer.clone());
}
self.historic_layers.insert(LayerRTreeObject::new(layer));
@@ -405,7 +398,7 @@ where
/// This should be called when the corresponding file on disk has been deleted.
///
pub fn remove_historic(&mut self, layer: Arc<L>) {
if layer.get_key_range() == (Key::MIN..Key::MAX) {
if Self::is_l0(&layer) {
let len_before = self.l0_delta_layers.len();
// FIXME: ptr_eq might fail to return true for 'dyn'
@@ -594,9 +587,7 @@ where
// We ignore level0 delta layers. Unless the whole keyspace fits
// into one partition
if !range_eq(key_range, &(Key::MIN..Key::MAX))
&& range_eq(&l.get_key_range(), &(Key::MIN..Key::MAX))
{
if !range_eq(key_range, &(Key::MIN..Key::MAX)) && Self::is_l0(&l) {
continue;
}

View File

@@ -430,7 +430,7 @@ where
Err(e) => {
let tenants_accessor = TENANTS.read().await;
match tenants_accessor.get(&tenant_id) {
Some(tenant) => tenant.set_broken(),
Some(tenant) => tenant.set_broken(&e.to_string()),
None => warn!("Tenant {tenant_id} got removed from memory"),
}
Err(e)
@@ -492,3 +492,53 @@ pub async fn immediate_gc(
Ok(wait_task_done)
}
#[cfg(feature = "testing")]
pub async fn immediate_compact(
tenant_id: TenantId,
timeline_id: TimelineId,
) -> Result<tokio::sync::oneshot::Receiver<anyhow::Result<()>>, ApiError> {
let guard = TENANTS.read().await;
let tenant = guard
.get(&tenant_id)
.map(Arc::clone)
.with_context(|| format!("Tenant {tenant_id} not found"))
.map_err(ApiError::NotFound)?;
let timeline = tenant
.get_timeline(timeline_id, true)
.map_err(ApiError::NotFound)?;
// Run in task_mgr to avoid race with detach operation
let (task_done, wait_task_done) = tokio::sync::oneshot::channel();
task_mgr::spawn(
&tokio::runtime::Handle::current(),
TaskKind::Compaction,
Some(tenant_id),
Some(timeline_id),
&format!(
"timeline_compact_handler compaction run for tenant {tenant_id} timeline {timeline_id}"
),
false,
async move {
let result = timeline
.compact()
.instrument(
info_span!("manual_compact", tenant = %tenant_id, timeline = %timeline_id),
)
.await;
match task_done.send(result) {
Ok(_) => (),
Err(result) => error!("failed to send compaction result: {result:?}"),
}
Ok(())
},
);
// drop the guard until after we've spawned the task so that timeline shutdown will wait for the task
drop(guard);
Ok(wait_task_done)
}

View File

@@ -16,7 +16,7 @@
//! unless the pageserver is configured without remote storage.
//!
//! We allocate the client instance in [Timeline][`crate::tenant::Timeline`], i.e.,
//! either in [`crate::tenant_mgr`] during startup or when creating a new
//! either in [`crate::tenant::mgr`] during startup or when creating a new
//! timeline.
//! However, the client does not become ready for use until we've initialized its upload queue:
//!
@@ -135,7 +135,7 @@
//! - Initiate upload queue with that [`IndexPart`].
//! - Reschedule all lost operations by comparing the local filesystem state
//! and remote state as per [`IndexPart`]. This is done in
//! [`Timeline::setup_timeline`] and [`Timeline::reconcile_with_remote`].
//! [`Timeline::timeline_init_and_sync`] and [`Timeline::reconcile_with_remote`].
//!
//! Note that if we crash during file deletion between the index update
//! that removes the file from the list of files, and deleting the remote file,
@@ -298,8 +298,8 @@ impl RemoteTimelineClient {
conf: &'static PageServerConf,
tenant_id: TenantId,
timeline_id: TimelineId,
) -> anyhow::Result<RemoteTimelineClient> {
Ok(RemoteTimelineClient {
) -> RemoteTimelineClient {
RemoteTimelineClient {
conf,
runtime: &BACKGROUND_RUNTIME,
tenant_id,
@@ -307,7 +307,7 @@ impl RemoteTimelineClient {
storage_impl: remote_storage,
upload_queue: Mutex::new(UploadQueue::Uninitialized),
metrics: Arc::new(RemoteTimelineClientMetrics::new(&tenant_id, &timeline_id)),
})
}
}
/// Initialize the upload queue for a remote storage that already received
@@ -367,6 +367,10 @@ impl RemoteTimelineClient {
/// Download index file
pub async fn download_index_file(&self) -> Result<IndexPart, DownloadError> {
let _unfinished_gauge_guard = self
.metrics
.call_begin(&RemoteOpFileKind::Index, &RemoteOpKind::Download);
download::download_index_part(
self.conf,
&self.storage_impl,
@@ -393,22 +397,27 @@ impl RemoteTimelineClient {
layer_file_name: &LayerFileName,
layer_metadata: &LayerFileMetadata,
) -> anyhow::Result<u64> {
let downloaded_size = download::download_layer_file(
self.conf,
&self.storage_impl,
self.tenant_id,
self.timeline_id,
layer_file_name,
layer_metadata,
)
.measure_remote_op(
self.tenant_id,
self.timeline_id,
RemoteOpFileKind::Layer,
RemoteOpKind::Download,
Arc::clone(&self.metrics),
)
.await?;
let downloaded_size = {
let _unfinished_gauge_guard = self
.metrics
.call_begin(&RemoteOpFileKind::Layer, &RemoteOpKind::Download);
download::download_layer_file(
self.conf,
&self.storage_impl,
self.tenant_id,
self.timeline_id,
layer_file_name,
layer_metadata,
)
.measure_remote_op(
self.tenant_id,
self.timeline_id,
RemoteOpFileKind::Layer,
RemoteOpKind::Download,
Arc::clone(&self.metrics),
)
.await?
};
// Update the metadata for given layer file. The remote index file
// might be missing some information for the file; this allows us
@@ -517,7 +526,7 @@ impl RemoteTimelineClient {
metadata_bytes,
);
let op = UploadOp::UploadMetadata(index_part, disk_consistent_lsn);
self.update_upload_queue_unfinished_metric(1, &op);
self.calls_unfinished_metric_begin(&op);
upload_queue.queued_operations.push_back(op);
upload_queue.latest_files_changes_since_metadata_upload_scheduled = 0;
@@ -549,7 +558,7 @@ impl RemoteTimelineClient {
upload_queue.latest_files_changes_since_metadata_upload_scheduled += 1;
let op = UploadOp::UploadLayer(layer_file_name.clone(), layer_metadata.clone());
self.update_upload_queue_unfinished_metric(1, &op);
self.calls_unfinished_metric_begin(&op);
upload_queue.queued_operations.push_back(op);
info!(
@@ -601,7 +610,7 @@ impl RemoteTimelineClient {
// schedule the actual deletions
for name in names {
let op = UploadOp::Delete(RemoteOpFileKind::Layer, name.clone());
self.update_upload_queue_unfinished_metric(1, &op);
self.calls_unfinished_metric_begin(&op);
upload_queue.queued_operations.push_back(op);
info!("scheduled layer file deletion {}", name.file_name());
}
@@ -747,13 +756,13 @@ impl RemoteTimelineClient {
// Note: We only check for the shutdown requests between retries, so
// if a shutdown request arrives while we're busy uploading, in the
// upload::upload:*() call below, we will wait not exit until it has
// finisheed. We probably could cancel the upload by simply dropping
// finished. We probably could cancel the upload by simply dropping
// the Future, but we're not 100% sure if the remote storage library
// is cancellation safe, so we don't dare to do that. Hopefully, the
// upload finishes or times out soon enough.
if task_mgr::is_shutdown_requested() {
info!("upload task cancelled by shutdown request");
self.update_upload_queue_unfinished_metric(-1, &task.op);
self.calls_unfinished_metric_end(&task.op);
self.stop();
return;
}
@@ -901,22 +910,40 @@ impl RemoteTimelineClient {
// Launch any queued tasks that were unblocked by this one.
self.launch_queued_tasks(upload_queue);
}
self.update_upload_queue_unfinished_metric(-1, &task.op);
self.calls_unfinished_metric_end(&task.op);
}
fn update_upload_queue_unfinished_metric(&self, delta: i64, op: &UploadOp) {
let (file_kind, op_kind) = match op {
fn calls_unfinished_metric_impl(
&self,
op: &UploadOp,
) -> Option<(RemoteOpFileKind, RemoteOpKind)> {
let res = match op {
UploadOp::UploadLayer(_, _) => (RemoteOpFileKind::Layer, RemoteOpKind::Upload),
UploadOp::UploadMetadata(_, _) => (RemoteOpFileKind::Index, RemoteOpKind::Upload),
UploadOp::Delete(file_kind, _) => (*file_kind, RemoteOpKind::Delete),
UploadOp::Barrier(_) => {
// we do not account these
return;
return None;
}
};
self.metrics
.unfinished_tasks(&file_kind, &op_kind)
.add(delta)
Some(res)
}
fn calls_unfinished_metric_begin(&self, op: &UploadOp) {
let (file_kind, op_kind) = match self.calls_unfinished_metric_impl(op) {
Some(x) => x,
None => return,
};
let guard = self.metrics.call_begin(&file_kind, &op_kind);
guard.will_decrement_manually(); // in unfinished_ops_metric_end()
}
fn calls_unfinished_metric_end(&self, op: &UploadOp) {
let (file_kind, op_kind) = match self.calls_unfinished_metric_impl(op) {
Some(x) => x,
None => return,
};
self.metrics.call_end(&file_kind, &op_kind);
}
fn stop(&self) {
@@ -967,7 +994,7 @@ impl RemoteTimelineClient {
// Tear down queued ops
for op in qi.queued_operations.into_iter() {
self.update_upload_queue_unfinished_metric(-1, &op);
self.calls_unfinished_metric_end(&op);
// Dropping UploadOp::Barrier() here will make wait_completion() return with an Err()
// which is exactly what we want to happen.
drop(op);
@@ -1037,7 +1064,7 @@ mod tests {
// Test scheduling
#[test]
fn upload_scheduling() -> anyhow::Result<()> {
let harness = TenantHarness::create("upload_scheduling")?;
let harness = TenantHarness::new()?;
let timeline_path = harness.timeline_path(&TIMELINE_ID);
std::fs::create_dir_all(&timeline_path)?;

View File

@@ -8,10 +8,9 @@ use std::future::Future;
use std::path::Path;
use anyhow::{anyhow, Context};
use futures::stream::{FuturesUnordered, StreamExt};
use tokio::fs;
use tokio::io::AsyncWriteExt;
use tracing::{debug, error, info, info_span, warn, Instrument};
use tracing::{error, info, warn};
use crate::config::PageServerConf;
use crate::tenant::storage_layer::LayerFileName;
@@ -175,7 +174,7 @@ pub async fn list_remote_timelines<'a>(
storage: &'a GenericRemoteStorage,
conf: &'static PageServerConf,
tenant_id: TenantId,
) -> anyhow::Result<Vec<(TimelineId, IndexPart)>> {
) -> anyhow::Result<HashSet<TimelineId>> {
let tenant_path = conf.timelines_path(&tenant_id);
let tenant_storage_path = conf.remote_path(&tenant_path)?;
@@ -194,7 +193,6 @@ pub async fn list_remote_timelines<'a>(
}
let mut timeline_ids = HashSet::new();
let mut part_downloads = FuturesUnordered::new();
for timeline_remote_storage_key in timelines {
let object_name = timeline_remote_storage_key.object_name().ok_or_else(|| {
@@ -205,35 +203,22 @@ pub async fn list_remote_timelines<'a>(
format!("failed to parse object name into timeline id '{object_name}'")
})?;
// list_prefixes returns all files with the prefix. If we haven't seen this timeline ID
// yet, launch a download task for it.
if !timeline_ids.contains(&timeline_id) {
timeline_ids.insert(timeline_id);
let storage_clone = storage.clone();
part_downloads.push(async move {
(
timeline_id,
download_index_part(conf, &storage_clone, tenant_id, timeline_id)
.instrument(info_span!("download_index_part", timeline=%timeline_id))
.await,
)
});
}
// list_prefixes is assumed to return unique names. Ensure this here.
// NB: it's safer to bail out than warn-log this because the pageserver
// needs to absolutely know about _all_ timelines that exist, so that
// GC knows all the branchpoints. If we skipped over a timeline instead,
// GC could delete a layer that's still needed by that timeline.
anyhow::ensure!(
!timeline_ids.contains(&timeline_id),
"list_prefixes contains duplicate timeline id {timeline_id}"
);
timeline_ids.insert(timeline_id);
}
// Wait for all the download tasks to complete.
let mut timeline_parts = Vec::new();
while let Some((timeline_id, part_upload_result)) = part_downloads.next().await {
let index_part = part_upload_result
.with_context(|| format!("Failed to fetch index part for timeline {timeline_id}"))?;
debug!("Successfully fetched index part for timeline {timeline_id}");
timeline_parts.push((timeline_id, index_part));
}
Ok(timeline_parts)
Ok(timeline_ids)
}
pub async fn download_index_part(
pub(super) async fn download_index_part(
conf: &'static PageServerConf,
storage: &GenericRemoteStorage,
tenant_id: TenantId,

View File

@@ -44,6 +44,116 @@ struct TimelineInputs {
next_gc_cutoff: Lsn,
}
// Adjust BranchFrom sorting so that we always process ancestor
// before descendants. This is needed to correctly calculate size of
// descendant timelines.
//
// Note that we may have multiple BranchFroms at the same LSN, so we
// need to sort them in the tree order.
//
// see updates_sort_with_branches_at_same_lsn test below
fn sort_updates_in_tree_order(updates: Vec<Update>) -> anyhow::Result<Vec<Update>> {
let mut sorted_updates = Vec::with_capacity(updates.len());
let mut known_timelineids = HashSet::new();
let mut i = 0;
while i < updates.len() {
let curr_upd = &updates[i];
if let Command::BranchFrom(parent_id) = curr_upd.command {
let parent_id = match parent_id {
Some(parent_id) if known_timelineids.contains(&parent_id) => {
// we have already processed ancestor
// process this BranchFrom Update normally
known_timelineids.insert(curr_upd.timeline_id);
sorted_updates.push(*curr_upd);
i += 1;
continue;
}
None => {
known_timelineids.insert(curr_upd.timeline_id);
sorted_updates.push(*curr_upd);
i += 1;
continue;
}
Some(parent_id) => parent_id,
};
let mut j = i;
// we have not processed ancestor yet.
// there is a chance that it is at the same Lsn
if !known_timelineids.contains(&parent_id) {
let mut curr_lsn_branchfroms: HashMap<TimelineId, Vec<(TimelineId, usize)>> =
HashMap::new();
// inspect all branchpoints at the same lsn
while j < updates.len() && updates[j].lsn == curr_upd.lsn {
let lookahead_upd = &updates[j];
j += 1;
if let Command::BranchFrom(lookahead_parent_id) = lookahead_upd.command {
match lookahead_parent_id {
Some(lookahead_parent_id)
if !known_timelineids.contains(&lookahead_parent_id) =>
{
// we have not processed ancestor yet
// store it for later
let es =
curr_lsn_branchfroms.entry(lookahead_parent_id).or_default();
es.push((lookahead_upd.timeline_id, j));
}
_ => {
// we have already processed ancestor
// process this BranchFrom Update normally
known_timelineids.insert(lookahead_upd.timeline_id);
sorted_updates.push(*lookahead_upd);
}
}
}
}
// process BranchFroms in the tree order
// check that we don't have a cycle if somet entry is orphan
// (this should not happen, but better to be safe)
let mut processed_some_entry = true;
while processed_some_entry {
processed_some_entry = false;
curr_lsn_branchfroms.retain(|parent_id, branchfroms| {
if known_timelineids.contains(parent_id) {
for (timeline_id, j) in branchfroms {
known_timelineids.insert(*timeline_id);
sorted_updates.push(updates[*j - 1]);
}
processed_some_entry = true;
false
} else {
true
}
});
}
if !curr_lsn_branchfroms.is_empty() {
// orphans are expected to be rare and transient between tenant reloads
// for example, an broken ancestor without the child branch being broken.
anyhow::bail!(
"orphan branch(es) detected in BranchFroms: {curr_lsn_branchfroms:?}"
);
}
}
assert!(j > i);
i = j;
} else {
// not a BranchFrom, keep the same order
sorted_updates.push(*curr_upd);
i += 1;
}
}
Ok(sorted_updates)
}
/// Gathers the inputs for the tenant sizing model.
///
/// Tenant size does not consider the latest state, but only the state until next_gc_cutoff, which
@@ -267,7 +377,11 @@ pub(super) async fn gather_inputs(
// for branch points, which come as multiple updates at the same LSN, the Command::Update
// is needed before a branch is made out of that branch Command::BranchFrom. this is
// handled by the variant order in `Command`.
//
updates.sort_unstable();
// And another sort to handle Command::BranchFrom ordering
// in case when there are multiple branches at the same LSN.
let sorted_updates = sort_updates_in_tree_order(updates)?;
let retention_period = match max_cutoff_distance {
Some(max) => max.0,
@@ -277,7 +391,7 @@ pub(super) async fn gather_inputs(
};
Ok(ModelInputs {
updates,
updates: sorted_updates,
retention_period,
timeline_inputs,
})
@@ -295,6 +409,7 @@ impl ModelInputs {
command: op,
timeline_id,
} = update;
let Lsn(now) = *lsn;
match op {
Command::Update(sz) => {
@@ -304,7 +419,8 @@ impl ModelInputs {
storage.insert_point(&Some(*timeline_id), "".into(), now, None);
}
Command::BranchFrom(parent) => {
storage.branch(parent, Some(*timeline_id));
// This branch command may fail if it cannot find a parent to branch from.
storage.branch(parent, Some(*timeline_id))?;
}
}
}
@@ -372,6 +488,7 @@ async fn calculate_logical_size(
let size_res = timeline
.spawn_ondemand_logical_size_calculation(lsn)
.instrument(info_span!("spawn_ondemand_logical_size_calculation"))
.await?;
Ok(TimelineAtLsnSizeResult(timeline, lsn, size_res))
}
@@ -463,3 +580,137 @@ fn verify_size_for_multiple_branches() {
assert_eq!(inputs.calculate().unwrap(), 36_409_872);
}
#[test]
fn updates_sort_with_branches_at_same_lsn() {
use std::str::FromStr;
use Command::{BranchFrom, EndOfBranch};
macro_rules! lsn {
($e:expr) => {
Lsn::from_str($e).unwrap()
};
}
let ids = [
TimelineId::from_str("00000000000000000000000000000000").unwrap(),
TimelineId::from_str("11111111111111111111111111111111").unwrap(),
TimelineId::from_str("22222222222222222222222222222222").unwrap(),
TimelineId::from_str("33333333333333333333333333333333").unwrap(),
TimelineId::from_str("44444444444444444444444444444444").unwrap(),
];
// issue https://github.com/neondatabase/neon/issues/3179
let commands = vec![
Update {
lsn: lsn!("0/0"),
command: BranchFrom(None),
timeline_id: ids[0],
},
Update {
lsn: lsn!("0/169AD58"),
command: Command::Update(25387008),
timeline_id: ids[0],
},
// next three are wrongly sorted, because
// ids[1] is branched from before ids[1] exists
// and ids[2] is branched from before ids[2] exists
Update {
lsn: lsn!("0/169AD58"),
command: BranchFrom(Some(ids[1])),
timeline_id: ids[3],
},
Update {
lsn: lsn!("0/169AD58"),
command: BranchFrom(Some(ids[0])),
timeline_id: ids[2],
},
Update {
lsn: lsn!("0/169AD58"),
command: BranchFrom(Some(ids[2])),
timeline_id: ids[1],
},
Update {
lsn: lsn!("0/1CA85B8"),
command: Command::Update(28925952),
timeline_id: ids[1],
},
Update {
lsn: lsn!("0/1CD85B8"),
command: Command::Update(29024256),
timeline_id: ids[1],
},
Update {
lsn: lsn!("0/1CD85B8"),
command: BranchFrom(Some(ids[1])),
timeline_id: ids[4],
},
Update {
lsn: lsn!("0/22DCE70"),
command: Command::Update(32546816),
timeline_id: ids[3],
},
Update {
lsn: lsn!("0/230CE70"),
command: EndOfBranch,
timeline_id: ids[3],
},
];
let expected = vec![
Update {
lsn: lsn!("0/0"),
command: BranchFrom(None),
timeline_id: ids[0],
},
Update {
lsn: lsn!("0/169AD58"),
command: Command::Update(25387008),
timeline_id: ids[0],
},
Update {
lsn: lsn!("0/169AD58"),
command: BranchFrom(Some(ids[0])),
timeline_id: ids[2],
},
Update {
lsn: lsn!("0/169AD58"),
command: BranchFrom(Some(ids[2])),
timeline_id: ids[1],
},
Update {
lsn: lsn!("0/169AD58"),
command: BranchFrom(Some(ids[1])),
timeline_id: ids[3],
},
Update {
lsn: lsn!("0/1CA85B8"),
command: Command::Update(28925952),
timeline_id: ids[1],
},
Update {
lsn: lsn!("0/1CD85B8"),
command: Command::Update(29024256),
timeline_id: ids[1],
},
Update {
lsn: lsn!("0/1CD85B8"),
command: BranchFrom(Some(ids[1])),
timeline_id: ids[4],
},
Update {
lsn: lsn!("0/22DCE70"),
command: Command::Update(32546816),
timeline_id: ids[3],
},
Update {
lsn: lsn!("0/230CE70"),
command: EndOfBranch,
timeline_id: ids[3],
},
];
let sorted_commands = sort_updates_in_tree_order(commands).unwrap();
assert_eq!(sorted_commands, expected);
}

View File

@@ -109,7 +109,7 @@ pub trait Layer: Send + Sync {
/// See PageReconstructResult for possible return values. The collected data
/// is appended to reconstruct_data; the caller should pass an empty struct
/// on first call, or a struct with a cached older image of the page if one
/// is available. If this returns PageReconstructResult::Continue, look up
/// is available. If this returns ValueReconstructResult::Continue, look up
/// the predecessor layer and call again with the same 'reconstruct_data' to
/// collect more data.
fn get_value_reconstruct_data(
@@ -196,38 +196,3 @@ pub fn downcast_remote_layer(
None
}
}
pub enum InMemoryOrHistoricLayer<L: ?Sized> {
InMemory(Arc<InMemoryLayer>),
Historic(Arc<L>),
}
impl<L: ?Sized> InMemoryOrHistoricLayer<L>
where
L: PersistentLayer,
{
pub fn downcast_remote_layer(&self) -> Option<std::sync::Arc<RemoteLayer>> {
match self {
Self::InMemory(_) => None,
Self::Historic(l) => {
if l.is_remote_layer() {
Arc::clone(l).downcast_remote_layer()
} else {
None
}
}
}
}
pub fn get_value_reconstruct_data(
&self,
key: Key,
lsn_range: Range<Lsn>,
reconstruct_data: &mut ValueReconstructState,
) -> Result<ValueReconstructResult> {
match self {
Self::InMemory(l) => l.get_value_reconstruct_data(key, lsn_range, reconstruct_data),
Self::Historic(l) => l.get_value_reconstruct_data(key, lsn_range, reconstruct_data),
}
}
}

View File

@@ -83,7 +83,7 @@ async fn compaction_loop(tenant_id: TenantId) {
tokio::select! {
_ = task_mgr::shutdown_watcher() => {
info!("received cancellation request during idling");
break ;
break;
},
_ = tokio::time::sleep(sleep_duration) => {},
}

View File

@@ -3,12 +3,12 @@
use anyhow::{anyhow, bail, ensure, Context};
use bytes::Bytes;
use fail::fail_point;
use futures::stream::FuturesUnordered;
use futures::StreamExt;
use itertools::Itertools;
use once_cell::sync::OnceCell;
use pageserver_api::models::{
DownloadRemoteLayersTaskInfo, DownloadRemoteLayersTaskState, TimelineState,
DownloadRemoteLayersTaskInfo, DownloadRemoteLayersTaskSpawnRequest,
DownloadRemoteLayersTaskState, TimelineState,
};
use tokio::sync::{oneshot, watch, Semaphore, TryAcquireError};
use tokio_util::sync::CancellationToken;
@@ -25,8 +25,8 @@ use std::time::{Duration, Instant, SystemTime};
use crate::tenant::remote_timeline_client::{self, index::LayerFileMetadata};
use crate::tenant::storage_layer::{
DeltaFileName, DeltaLayerWriter, ImageFileName, ImageLayerWriter, InMemoryLayer,
InMemoryOrHistoricLayer, LayerFileName, RemoteLayer,
DeltaFileName, DeltaLayerWriter, ImageFileName, ImageLayerWriter, InMemoryLayer, LayerFileName,
RemoteLayer,
};
use crate::tenant::{
ephemeral_file::is_ephemeral_file,
@@ -193,22 +193,29 @@ pub struct Timeline {
}
/// Internal structure to hold all data needed for logical size calculation.
/// Calculation consists of two parts:
/// 1. Initial size calculation. That might take a long time, because it requires
/// reading all layers containing relation sizes up to the `initial_part_end`.
///
/// Calculation consists of two stages:
///
/// 1. Initial size calculation. That might take a long time, because it requires
/// reading all layers containing relation sizes at `initial_part_end`.
///
/// 2. Collecting an incremental part and adding that to the initial size.
/// Increments are appended on walreceiver writing new timeline data,
/// which result in increase or decrease of the logical size.
struct LogicalSize {
/// Size, potentially slow to compute, derived from all layers located locally on this node's FS.
/// Might require reading multiple layers, and even ancestor's layers, to collect the size.
/// Size, potentially slow to compute. Calculating this might require reading multiple
/// layers, and even ancestor's layers.
///
/// NOTE: initial size is not a constant and will change between restarts.
/// NOTE: size at a given LSN is constant, but after a restart we will calculate
/// the initial size at a different LSN.
initial_logical_size: OnceCell<u64>,
/// Semaphore to track ongoing calculation of `initial_logical_size`.
initial_size_computation: Arc<tokio::sync::Semaphore>,
/// Latest Lsn that has its size uncalculated, could be absent for freshly created timelines.
initial_part_end: Option<Lsn>,
/// All other size changes after startup, combined together.
///
/// Size shouldn't ever be negative, but this is signed for two reasons:
@@ -335,43 +342,6 @@ pub struct WalReceiverInfo {
pub last_received_msg_ts: u128,
}
/// Like `?`, but for [`PageReconstructResult`].
/// Use it to bubble up the `NeedsDownload` and `Error` to the caller.
///
/// Once `std::ops::Try` is stabilized, we should use it instead of this macro.
#[macro_export]
macro_rules! try_no_ondemand_download {
($result:expr) => {{
let result = $result;
match result {
PageReconstructResult::Success(value) => value,
PageReconstructResult::NeedsDownload(timeline, layer) => {
return PageReconstructResult::NeedsDownload(timeline, layer);
}
PageReconstructResult::Error(e) => return PageReconstructResult::Error(e),
}
}};
}
/// Replacement for `?` in functions that return [`PageReconstructResult`].
///
/// Given an `expr: Result<T, E>`, use `try_page_reconstruct_result!(expr)`
/// instead of `(expr)?`.
/// If `expr` is `Ok(v)`, the macro evaluates to `v`.
/// If `expr` is `Err(e)`, the macro returns `PageReconstructResult::Error(e.into())`.
///
/// Once `std::ops::Try` is stabilized, we should use it instead of this macro.
#[macro_export]
macro_rules! try_page_reconstruct_result {
($result:expr) => {{
let result = $result;
match result {
Ok(v) => v,
Err(e) => return PageReconstructResult::from(e),
}
}};
}
///
/// Information about how much history needs to be retained, needed by
/// Garbage Collection.
@@ -401,21 +371,13 @@ pub struct GcInfo {
pub pitr_cutoff: Lsn,
}
pub enum PageReconstructResult<T> {
Success(T),
/// The given RemoteLayer needs to be downloaded and replaced in the timeline's layer map
/// for the operation to succeed. Use [`Timeline::download_remote_layer`] to do it, then
/// retry the operation that returned this error.
NeedsDownload(Weak<Timeline>, Weak<RemoteLayer>),
Error(PageReconstructError),
}
/// An error happened in a get() operation.
#[derive(thiserror::Error)]
pub enum PageReconstructError {
#[error(transparent)]
Other(#[from] anyhow::Error), // source and Display delegate to anyhow::Error
/// An error happened replaying WAL records
#[error(transparent)]
WalRedo(#[from] crate::walredo::WalRedoError),
}
@@ -429,49 +391,6 @@ impl std::fmt::Debug for PageReconstructError {
}
}
/// This impl makes it so you can substitute return type
/// `Result<T, E>` with `PageReconstructError<T>` in functions
/// and existing `?` will generally continue to work.
/// The reason why thanks to
/// anyhow::Error that `(some error type)ensures that exis
impl<E, T> From<E> for PageReconstructResult<T>
where
E: Into<PageReconstructError>,
{
fn from(e: E) -> Self {
Self::Error(e.into())
}
}
impl<T> PageReconstructResult<T> {
/// Treat the need for on-demand download as an error.
///
/// **Avoid this function in new code** if you can help it,
/// as on-demand download will become the norm in the future,
/// especially once we implement layer file eviction.
///
/// If you are in an async function, use [`with_ondemand_download`]
/// to do the download right here.
///
/// If you are in a sync function, change its return type from
/// `Result<T, E>` to `PageReconstructResult<T>` and bubble up
/// the non-success cases of `PageReconstructResult<T>` to the caller.
/// This gives them a chance to do the download and retry.
/// Consider using [`try_no_ondemand_download`] for convenience.
///
/// For more background, read the comment on [`with_ondemand_download`].
pub fn no_ondemand_download(self) -> anyhow::Result<T> {
match self {
PageReconstructResult::Success(value) => Ok(value),
// TODO print more info about the timeline
PageReconstructResult::NeedsDownload(_, _) => anyhow::bail!("Layer needs downloading"),
PageReconstructResult::Error(e) => {
Err(anyhow::Error::new(e).context("Failed to reconstruct the page"))
}
}
}
}
/// Public interface functions
impl Timeline {
/// Get the LSN where this branch was created
@@ -493,15 +412,19 @@ impl Timeline {
/// Look up given page version.
///
/// NOTE: It is considered an error to 'get' a key that doesn't exist. The abstraction
/// above this needs to store suitable metadata to track what data exists with
/// what keys, in separate metadata entries. If a non-existent key is requested,
/// the Repository implementation may incorrectly return a value from an ancestor
/// branch, for example, or waste a lot of cycles chasing the non-existing key.
/// If a remote layer file is needed, it is downloaded as part of this
/// call.
///
pub fn get(&self, key: Key, lsn: Lsn) -> PageReconstructResult<Bytes> {
/// NOTE: It is considered an error to 'get' a key that doesn't exist. The
/// abstraction above this needs to store suitable metadata to track what
/// data exists with what keys, in separate metadata entries. If a
/// non-existent key is requested, we may incorrectly return a value from
/// an ancestor branch, for example, or waste a lot of cycles chasing the
/// non-existing key.
///
pub async fn get(&self, key: Key, lsn: Lsn) -> Result<Bytes, PageReconstructError> {
if !lsn.is_valid() {
return PageReconstructResult::from(anyhow!("Invalid LSN"));
return Err(PageReconstructError::Other(anyhow::anyhow!("Invalid LSN")));
}
// Check the page cache. We will get back the most recent page with lsn <= `lsn`.
@@ -512,7 +435,7 @@ impl Timeline {
Some((cached_lsn, cached_img)) => {
match cached_lsn.cmp(&lsn) {
Ordering::Less => {} // there might be WAL between cached_lsn and lsn, we need to check
Ordering::Equal => return PageReconstructResult::Success(cached_img), // exact LSN match, return the image
Ordering::Equal => return Ok(cached_img), // exact LSN match, return the image
Ordering::Greater => {
unreachable!("the returned lsn should never be after the requested lsn")
}
@@ -527,18 +450,14 @@ impl Timeline {
img: cached_page_img,
};
try_no_ondemand_download!(self.get_reconstruct_data(key, lsn, &mut reconstruct_state));
self.get_reconstruct_data(key, lsn, &mut reconstruct_state)
.await?;
self.metrics
.reconstruct_time_histo
.observe_closure_duration(|| self.reconstruct_value(key, lsn, reconstruct_state))
}
// Like get(), but if a remote layer file is needed, it is downloaded as part of this call.
pub async fn get_download(&self, key: Key, lsn: Lsn) -> anyhow::Result<Bytes> {
with_ondemand_download(|| self.get(key, lsn)).await
}
/// Get last or prev record separately. Same as get_last_record_rlsn().last/prev.
pub fn get_last_record_lsn(&self) -> Lsn {
self.last_record_lsn.load().last
@@ -1511,7 +1430,8 @@ impl Timeline {
/// Calculate the logical size of the database at the latest LSN.
///
/// NOTE: counted incrementally, includes ancestors, this can be a slow operation.
/// NOTE: counted incrementally, includes ancestors. This can be a slow operation,
/// especially if we need to download remote layers.
async fn calculate_logical_size(
&self,
up_to_lsn: Lsn,
@@ -1591,7 +1511,7 @@ trait TraversalLayerExt {
fn traversal_id(&self) -> TraversalId;
}
impl<T: PersistentLayer + ?Sized> TraversalLayerExt for T {
impl TraversalLayerExt for Arc<dyn PersistentLayer> {
fn traversal_id(&self) -> TraversalId {
match self.local_path() {
Some(local_path) => {
@@ -1621,15 +1541,6 @@ impl TraversalLayerExt for Arc<InMemoryLayer> {
}
}
impl TraversalLayerExt for InMemoryOrHistoricLayer<dyn PersistentLayer> {
fn traversal_id(&self) -> String {
match self {
Self::InMemory(l) => l.traversal_id(),
Self::Historic(l) => l.traversal_id(),
}
}
}
impl Timeline {
///
/// Get a handle to a Layer for reading.
@@ -1639,23 +1550,19 @@ impl Timeline {
///
/// This function takes the current timeline's locked LayerMap as an argument,
/// so callers can avoid potential race conditions.
fn get_reconstruct_data(
async fn get_reconstruct_data(
&self,
key: Key,
request_lsn: Lsn,
reconstruct_state: &mut ValueReconstructState,
) -> PageReconstructResult<()> {
) -> Result<(), PageReconstructError> {
// Start from the current timeline.
let mut timeline_owned;
let mut timeline = self;
// For debugging purposes, collect the path of layers that we traversed
// through. It's included in the error message if we fail to find the key.
let mut traversal_path = Vec::<(
ValueReconstructResult,
Lsn,
Box<dyn TraversalLayerExt>,
)>::new();
let mut traversal_path = Vec::<TraversalPathItem>::new();
let cached_lsn = if let Some((cached_lsn, _)) = &reconstruct_state.img {
*cached_lsn
@@ -1675,34 +1582,34 @@ impl Timeline {
// The function should have updated 'state'
//info!("CALLED for {} at {}: {:?} with {} records, cached {}", key, cont_lsn, result, reconstruct_state.records.len(), cached_lsn);
match result {
ValueReconstructResult::Complete => return PageReconstructResult::Success(()),
ValueReconstructResult::Complete => return Ok(()),
ValueReconstructResult::Continue => {
// If we reached an earlier cached page image, we're done.
if cont_lsn == cached_lsn + 1 {
self.metrics.materialized_page_cache_hit_counter.inc_by(1);
return PageReconstructResult::Success(());
return Ok(());
}
if prev_lsn <= cont_lsn {
// Didn't make any progress in last iteration. Error out to avoid
// getting stuck in the loop.
return layer_traversal_error(format!(
return Err(layer_traversal_error(format!(
"could not find layer with more data for key {} at LSN {}, request LSN {}, ancestor {}",
key,
Lsn(cont_lsn.0 - 1),
request_lsn,
timeline.ancestor_lsn
), &traversal_path);
), traversal_path));
}
prev_lsn = cont_lsn;
}
ValueReconstructResult::Missing => {
return layer_traversal_error(
return Err(layer_traversal_error(
format!(
"could not find data for key {} at LSN {}, for request at LSN {}",
key, cont_lsn, request_lsn
),
&traversal_path,
);
traversal_path,
));
}
}
@@ -1715,7 +1622,7 @@ impl Timeline {
);
let ancestor = match timeline.get_ancestor_timeline() {
Ok(timeline) => timeline,
Err(e) => return PageReconstructResult::from(e),
Err(e) => return Err(PageReconstructError::from(e)),
};
timeline_owned = ancestor;
timeline = &*timeline_owned;
@@ -1723,15 +1630,74 @@ impl Timeline {
continue 'outer;
}
loop {
#[allow(clippy::never_loop)] // see comment at bottom of this loop
'layer_map_search: loop {
let remote_layer = {
let layers = timeline.layers.read().unwrap();
// Check the open and frozen in-memory layers first, in order from newest
// to oldest.
if let Some(open_layer) = &layers.open_layer {
let start_lsn = open_layer.get_lsn_range().start;
if cont_lsn > start_lsn {
//info!("CHECKING for {} at {} on open layer {}", key, cont_lsn, open_layer.filename().display());
// Get all the data needed to reconstruct the page version from this layer.
// But if we have an older cached page image, no need to go past that.
let lsn_floor = max(cached_lsn + 1, start_lsn);
result = match open_layer.get_value_reconstruct_data(
key,
lsn_floor..cont_lsn,
reconstruct_state,
) {
Ok(result) => result,
Err(e) => return Err(PageReconstructError::from(e)),
};
cont_lsn = lsn_floor;
traversal_path.push((
result,
cont_lsn,
Box::new({
let open_layer = Arc::clone(open_layer);
move || open_layer.traversal_id()
}),
));
continue 'outer;
}
}
for frozen_layer in layers.frozen_layers.iter().rev() {
let start_lsn = frozen_layer.get_lsn_range().start;
if cont_lsn > start_lsn {
//info!("CHECKING for {} at {} on frozen layer {}", key, cont_lsn, frozen_layer.filename().display());
let lsn_floor = max(cached_lsn + 1, start_lsn);
result = match frozen_layer.get_value_reconstruct_data(
key,
lsn_floor..cont_lsn,
reconstruct_state,
) {
Ok(result) => result,
Err(e) => return Err(PageReconstructError::from(e)),
};
cont_lsn = lsn_floor;
traversal_path.push((
result,
cont_lsn,
Box::new({
let frozen_layer = Arc::clone(frozen_layer);
move || frozen_layer.traversal_id()
}),
));
continue 'outer;
}
}
if let Some(SearchResult { lsn_floor, layer }) = layers.search(key, cont_lsn) {
// If it's a remote layer, download it and retry.
if let Some(remote_layer) = layer.downcast_remote_layer() {
if let Some(remote_layer) =
super::storage_layer::downcast_remote_layer(&layer)
{
// TODO: push a breadcrumb to 'traversal_path' to record the fact that
// we downloaded / would need to download this.
remote_layer
// we downloaded / would need to download this layer.
remote_layer // download happens outside the scope of `layers` guard object
} else {
// Get all the data needed to reconstruct the page version from this layer.
// But if we have an older cached page image, no need to go past that.
@@ -1742,10 +1708,17 @@ impl Timeline {
reconstruct_state,
) {
Ok(result) => result,
Err(e) => return PageReconstructResult::from(e),
Err(e) => return Err(PageReconstructError::from(e)),
};
cont_lsn = lsn_floor;
traversal_path.push((result, cont_lsn, Box::new(layer)));
traversal_path.push((
result,
cont_lsn,
Box::new({
let layer = Arc::clone(&layer);
move || layer.traversal_id()
}),
));
continue 'outer;
}
} else if timeline.ancestor_timeline.is_some() {
@@ -1759,15 +1732,24 @@ impl Timeline {
continue 'outer;
}
};
// Download the remote_layer and replace it in the layer map.
// For that, we need to release the mutex. Otherwise, we'd deadlock.
//
// The control flow is so weird here because `drop(layers)` inside
// the if stmt above is not enough for current rustc: it requires
// that the layers lock guard is not in scope across the download
// await point.
let remote_layer_as_persistent: Arc<dyn PersistentLayer> =
Arc::clone(&remote_layer) as Arc<dyn PersistentLayer>;
let id = remote_layer_as_persistent.traversal_id();
info!("need remote layer {id}");
// The next layer doesn't exist locally. The caller can do the download and retry.
// The next layer doesn't exist locally. Need to download it.
// (The control flow is a bit complicated here because we must drop the 'layers'
// lock before awaiting on the Future.)
info!("need remote layer {}", remote_layer.traversal_id());
return PageReconstructResult::NeedsDownload(
Weak::clone(&timeline.myself),
Arc::downgrade(&remote_layer),
);
info!("on-demand downloading remote layer {id}");
timeline.download_remote_layer(remote_layer).await?;
continue 'layer_map_search;
}
}
}
@@ -2205,7 +2187,7 @@ impl Timeline {
partitioning: &KeyPartitioning,
lsn: Lsn,
force: bool,
) -> anyhow::Result<HashMap<LayerFileName, LayerFileMetadata>> {
) -> Result<HashMap<LayerFileName, LayerFileMetadata>, PageReconstructError> {
let timer = self.metrics.create_images_time_histo.start_timer();
let mut image_layers: Vec<ImageLayer> = Vec::new();
for partition in partitioning.parts.iter() {
@@ -2221,13 +2203,15 @@ impl Timeline {
)?;
fail_point!("image-layer-writer-fail-before-finish", |_| {
anyhow::bail!("failpoint image-layer-writer-fail-before-finish");
Err(PageReconstructError::Other(anyhow::anyhow!(
"failpoint image-layer-writer-fail-before-finish"
)))
});
for range in &partition.ranges {
let mut key = range.start;
while key < range.end {
let img = match self.get_download(key, lsn).await {
let img = match self.get(key, lsn).await {
Ok(img) => img,
Err(err) => {
// If we fail to reconstruct a VM or FSM page, we can zero the
@@ -2278,7 +2262,7 @@ impl Timeline {
self.conf.timeline_path(&self.timeline_id, &self.tenant_id),
))
.collect::<Vec<_>>();
par_fsync::par_fsync(&all_paths)?;
par_fsync::par_fsync(&all_paths).context("fsync of newly created layer files")?;
let mut layer_paths_to_upload = HashMap::with_capacity(image_layers.len());
@@ -2286,7 +2270,10 @@ impl Timeline {
let timeline_path = self.conf.timeline_path(&self.timeline_id, &self.tenant_id);
for l in image_layers {
let path = l.filename();
let metadata = timeline_path.join(path.file_name()).metadata()?;
let metadata = timeline_path
.join(path.file_name())
.metadata()
.context("reading metadata of layer file {path}")?;
layer_paths_to_upload.insert(path, LayerFileMetadata::new(metadata.len()));
@@ -2687,8 +2674,7 @@ impl Timeline {
if let Some(pitr_cutoff_timestamp) = now.checked_sub(pitr) {
let pitr_timestamp = to_pg_timestamp(pitr_cutoff_timestamp);
match with_ondemand_download(|| self.find_lsn_for_timestamp(pitr_timestamp)).await?
{
match self.find_lsn_for_timestamp(pitr_timestamp).await? {
LsnForTimestamp::Present(lsn) => lsn,
LsnForTimestamp::Future(lsn) => {
// The timestamp is in the future. That sounds impossible,
@@ -2957,7 +2943,7 @@ impl Timeline {
key: Key,
request_lsn: Lsn,
mut data: ValueReconstructState,
) -> PageReconstructResult<Bytes> {
) -> Result<Bytes, PageReconstructError> {
// Perform WAL redo if needed
data.records.reverse();
@@ -2965,15 +2951,16 @@ impl Timeline {
if data.records.is_empty() {
if let Some((img_lsn, img)) = &data.img {
trace!(
"found page image for key {} at {}, no WAL redo required",
"found page image for key {} at {}, no WAL redo required, req LSN {}",
key,
img_lsn
img_lsn,
request_lsn,
);
PageReconstructResult::Success(img.clone())
Ok(img.clone())
} else {
PageReconstructResult::from(anyhow!(
Err(PageReconstructError::from(anyhow!(
"base image for {key} at {request_lsn} not found"
))
)))
}
} else {
// We need to do WAL redo.
@@ -2981,12 +2968,12 @@ impl Timeline {
// If we don't have a base image, then the oldest WAL record better initialize
// the page
if data.img.is_none() && !data.records.first().unwrap().1.will_init() {
PageReconstructResult::from(anyhow!(
Err(PageReconstructError::from(anyhow!(
"Base image for {} at {} not found, but got {} WAL records",
key,
request_lsn,
data.records.len()
))
)))
} else {
if data.img.is_some() {
trace!(
@@ -3007,7 +2994,7 @@ impl Timeline {
.context("Failed to reconstruct a page image:")
{
Ok(img) => img,
Err(e) => return PageReconstructResult::from(e),
Err(e) => return Err(PageReconstructError::from(e)),
};
if img.len() == page_cache::PAGE_SZ {
@@ -3022,11 +3009,11 @@ impl Timeline {
)
.context("Materialized page memoization failed")
{
return PageReconstructResult::from(e);
return Err(PageReconstructError::from(e));
}
}
PageReconstructResult::Success(img)
Ok(img)
}
}
}
@@ -3052,7 +3039,7 @@ impl Timeline {
/// So, the current download attempt will run to completion even if we stop polling.
#[instrument(skip_all, fields(tenant_id=%self.tenant_id, timeline_id=%self.timeline_id, layer=%remote_layer.short_id()))]
pub async fn download_remote_layer(
self: Arc<Self>,
&self,
remote_layer: Arc<RemoteLayer>,
) -> anyhow::Result<()> {
let permit = match Arc::clone(&remote_layer.ongoing_download)
@@ -3068,6 +3055,7 @@ impl Timeline {
let (sender, receiver) = tokio::sync::oneshot::channel();
// Spawn a task so that download does not outlive timeline when we detach tenant / delete timeline.
let self_clone = self.myself.upgrade().expect("timeline is gone");
task_mgr::spawn(
&tokio::runtime::Handle::current(),
TaskKind::RemoteDownloadTask,
@@ -3076,7 +3064,7 @@ impl Timeline {
&format!("download layer {}", remote_layer.short_id()),
false,
async move {
let remote_client = self.remote_client.as_ref().unwrap();
let remote_client = self_clone.remote_client.as_ref().unwrap();
// Does retries + exponential back-off internally.
// When this fails, don't layer further retry attempts here.
@@ -3087,12 +3075,12 @@ impl Timeline {
if let Ok(size) = &result {
// XXX the temp file is still around in Err() case
// and consumes space until we clean up upon pageserver restart.
self.metrics.resident_physical_size_gauge.add(*size);
self_clone.metrics.resident_physical_size_gauge.add(*size);
// Download complete. Replace the RemoteLayer with the corresponding
// Delta- or ImageLayer in the layer map.
let new_layer = remote_layer.create_downloaded_layer(self.conf, *size);
let mut layers = self.layers.write().unwrap();
let new_layer = remote_layer.create_downloaded_layer(self_clone.conf, *size);
let mut layers = self_clone.layers.write().unwrap();
{
let l: Arc<dyn PersistentLayer> = remote_layer.clone();
layers.remove_historic(l);
@@ -3128,6 +3116,7 @@ impl Timeline {
pub async fn spawn_download_all_remote_layers(
self: Arc<Self>,
request: DownloadRemoteLayersTaskSpawnRequest,
) -> Result<DownloadRemoteLayersTaskInfo, DownloadRemoteLayersTaskInfo> {
let mut status_guard = self.download_all_remote_layers_task_info.write().unwrap();
if let Some(st) = &*status_guard {
@@ -3151,7 +3140,7 @@ impl Timeline {
"download all remote layers task",
false,
async move {
self_clone.download_all_remote_layers().await;
self_clone.download_all_remote_layers(request).await;
let mut status_guard = self_clone.download_all_remote_layers_task_info.write().unwrap();
match &mut *status_guard {
None => {
@@ -3183,20 +3172,23 @@ impl Timeline {
Ok(initial_info)
}
async fn download_all_remote_layers(self: &Arc<Self>) {
let mut downloads: FuturesUnordered<_> = {
async fn download_all_remote_layers(
self: &Arc<Self>,
request: DownloadRemoteLayersTaskSpawnRequest,
) {
let mut downloads = Vec::new();
{
let layers = self.layers.read().unwrap();
layers
.iter_historic_layers()
.filter_map(|l| l.downcast_remote_layer())
.map({
|l| {
let self_clone = Arc::clone(self);
self_clone.download_remote_layer(l)
}
})
.collect()
};
.map(|l| self.download_remote_layer(l))
.for_each(|dl| downloads.push(dl))
}
let total_layer_count = downloads.len();
// limit download concurrency as specified in request
let downloads = futures::stream::iter(downloads);
let mut downloads = downloads.buffer_unordered(request.max_concurrent_downloads.get());
macro_rules! lock_status {
($st:ident) => {
@@ -3217,7 +3209,7 @@ impl Timeline {
{
lock_status!(st);
st.total_layer_count = downloads.len().try_into().unwrap();
st.total_layer_count = total_layer_count as u64;
}
loop {
tokio::select! {
@@ -3256,108 +3248,25 @@ impl Timeline {
}
}
/// Helper function to deal with [`PageReconstructResult`].
///
/// Takes a sync closure that returns a [`PageReconstructResult`].
/// If it is [`PageReconstructResult::NeedsDownload`],
/// do the download and retry the closure.
///
/// ### Background
///
/// This is a crutch to make on-demand downloads efficient in
/// our async-sync-async sandwich codebase. Some context:
///
/// - The code that does the downloads uses async Rust.
/// - The code that initiates download is many levels of sync Rust.
/// - The sync code must wait for the download to finish to
/// make further progress.
/// - The sync code is invoked directly from async functions upstack.
///
/// Example (there are also much worse ones where the sandwich is taller)
///
/// async handle_get_page_at_lsn_request page_service.rs
/// sync get_rel_page_at_lsn timeline.rs
/// sync timeline.get timeline.rs
/// sync get_reconstruct_data timeline.rs
/// async download_remote_layer timeline.rs
///
/// It is not possible to Timeline::download_remote_layer().await within
/// get_reconstruct_data, so instead, we return [`PageReconstructResult::NeedsDownload`]
/// which contains references to the [`Timeline`] and [`RemoteLayer`].
/// We bubble that error upstack to the async code, which can then call
/// `Timeline::download_remote_layer().await`.
/// That is _efficient_ because tokio can use the same OS thread to do
/// other work while we're waiting for the download.
///
/// It is a deliberate decision to use a new result type to communicate
/// the need for download instead of adding another variant to [`PageReconstructError`].
/// The reason is that with the latter approach, any place that does
/// `?` on a `Result<T, PageReconstructError>` will implicitly ignore the
/// need for download. We want that to be explicit, so that
/// - the code base becomes greppable for places that don't do a download
/// - future code changes will need to explicilty address for on-demand download
///
/// Alternatives to consider in the future:
///
/// - Inside `get_reconstruct_data`, we can std::thread::spawn a thread
/// and use it to block_on the download_remote_layer future.
/// That is obviously inefficient as it creates one thread per download.
/// - Convert everything to async. The problem here is that the sync
/// functions are used by many other sync functions. So, the scope
/// creep of such a conversion is tremendous.
/// - Compromise between the two: implement async functions for each sync
/// function. Switch over the hot code paths (GetPage()) to use the
/// async path, so that the hot path doesn't spawn threads. Other code
/// paths would remain sync initially, and get converted to async over time.
///
pub async fn with_ondemand_download<F, T>(mut f: F) -> Result<T, anyhow::Error>
where
F: Send + FnMut() -> PageReconstructResult<T>,
T: Send,
{
loop {
let closure_result = f();
match closure_result {
PageReconstructResult::NeedsDownload(weak_timeline, weak_remote_layer) => {
// if the timeline is gone, it has likely been deleted / tenant detached
let tl = weak_timeline.upgrade().context("timeline is gone")?;
// if the remote layer got removed, retry the function, it might succeed now
let remote_layer = match weak_remote_layer.upgrade() {
None => {
info!("remote layer is gone, retrying closure");
continue;
}
Some(l) => l,
};
// Does retries internally
tl.download_remote_layer(remote_layer).await?;
// Download successful, retry the closure
continue;
}
PageReconstructResult::Success(closure_value) => return Ok(closure_value),
PageReconstructResult::Error(e) => {
return Err(anyhow::Error::new(e).context("Failed to reconstruct the page"))
}
}
}
}
type TraversalPathItem = (
ValueReconstructResult,
Lsn,
Box<dyn Send + FnOnce() -> TraversalId>,
);
/// Helper function for get_reconstruct_data() to add the path of layers traversed
/// to an error, as anyhow context information.
fn layer_traversal_error(
msg: String,
path: &[(ValueReconstructResult, Lsn, Box<dyn TraversalLayerExt>)],
) -> PageReconstructResult<()> {
fn layer_traversal_error(msg: String, path: Vec<TraversalPathItem>) -> PageReconstructError {
// We want the original 'msg' to be the outermost context. The outermost context
// is the most high-level information, which also gets propagated to the client.
let mut msg_iter = path
.iter()
.into_iter()
.map(|(r, c, l)| {
format!(
"layer traversal: result {:?}, cont_lsn {}, layer: {}",
r,
c,
l.traversal_id(),
l(),
)
})
.chain(std::iter::once(msg));
@@ -3366,7 +3275,7 @@ fn layer_traversal_error(
// Append all subsequent traversals, and the error message 'msg', as contexts.
let msg = msg_iter.fold(err, |err, msg| err.context(msg));
PageReconstructResult::from(msg)
PageReconstructError::from(msg)
}
/// Various functions to mutate the timeline.

View File

@@ -525,12 +525,13 @@ mod tests {
})
}
fn test_files<OF, FD>(testname: &str, openfunc: OF) -> Result<(), Error>
fn test_files<OF, FD>(test_name: &str, openfunc: OF) -> Result<(), Error>
where
FD: Read + Write + Seek + FileExt,
OF: Fn(&Path, &OpenOptions) -> Result<FD, std::io::Error>,
{
let testdir = crate::config::PageServerConf::test_repo_dir(testname);
let temp_repo_dir = tempfile::tempdir()?;
let testdir = temp_repo_dir.path().join(test_name);
std::fs::create_dir_all(&testdir)?;
let path_a = testdir.join("file_a");
@@ -632,7 +633,8 @@ mod tests {
const THREADS: usize = 100;
const SAMPLE: [u8; SIZE] = [0xADu8; SIZE];
let testdir = crate::config::PageServerConf::test_repo_dir("vfile_concurrency");
let temp_repo_dir = tempfile::tempdir()?;
let testdir = temp_repo_dir.path().join("vfile_concurrency");
std::fs::create_dir_all(&testdir)?;
// Create a test file.

View File

@@ -30,8 +30,8 @@ use bytes::{Buf, Bytes, BytesMut};
use tracing::*;
use crate::pgdatadir_mapping::*;
use crate::tenant::PageReconstructError;
use crate::tenant::Timeline;
use crate::tenant::{with_ondemand_download, PageReconstructError};
use crate::walrecord::*;
use crate::ZERO_PAGE;
use pageserver_api::reltag::{RelTag, SlruKind};
@@ -55,8 +55,7 @@ impl<'a> WalIngest<'a> {
pub async fn new(timeline: &Timeline, startpoint: Lsn) -> anyhow::Result<WalIngest> {
// Fetch the latest checkpoint into memory, so that we can compare with it
// quickly in `ingest_record` and update it when it changes.
let checkpoint_bytes =
with_ondemand_download(|| timeline.get_checkpoint(startpoint)).await?;
let checkpoint_bytes = timeline.get_checkpoint(startpoint).await?;
let checkpoint = CheckPoint::decode(&checkpoint_bytes)?;
trace!("CheckPoint.nextXid = {}", checkpoint.nextXid.value);
@@ -107,7 +106,7 @@ impl<'a> WalIngest<'a> {
== pg_constants::XLOG_SMGR_CREATE
{
let create = XlSmgrCreate::decode(&mut buf);
self.ingest_xlog_smgr_create(modification, &create)?;
self.ingest_xlog_smgr_create(modification, &create).await?;
} else if decoded.xl_rmid == pg_constants::RM_SMGR_ID
&& (decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK)
== pg_constants::XLOG_SMGR_TRUNCATE
@@ -135,7 +134,7 @@ impl<'a> WalIngest<'a> {
let dropdb = XlDropDatabase::decode(&mut buf);
for tablespace_id in dropdb.tablespace_ids {
trace!("Drop db {}, {}", tablespace_id, dropdb.db_id);
modification.drop_dbdir(tablespace_id, dropdb.db_id)?;
modification.drop_dbdir(tablespace_id, dropdb.db_id).await?;
}
}
} else if self.timeline.pg_version == 15 {
@@ -159,7 +158,7 @@ impl<'a> WalIngest<'a> {
let dropdb = XlDropDatabase::decode(&mut buf);
for tablespace_id in dropdb.tablespace_ids {
trace!("Drop db {}, {}", tablespace_id, dropdb.db_id);
modification.drop_dbdir(tablespace_id, dropdb.db_id)?;
modification.drop_dbdir(tablespace_id, dropdb.db_id).await?;
}
}
}
@@ -214,9 +213,11 @@ impl<'a> WalIngest<'a> {
parsed_xact.xid,
lsn,
);
modification.drop_twophase_file(parsed_xact.xid)?;
modification.drop_twophase_file(parsed_xact.xid).await?;
} else if info == pg_constants::XLOG_XACT_PREPARE {
modification.put_twophase_file(decoded.xl_xid, Bytes::copy_from_slice(&buf[..]))?;
modification
.put_twophase_file(decoded.xl_xid, Bytes::copy_from_slice(&buf[..]))
.await?;
}
} else if decoded.xl_rmid == pg_constants::RM_MULTIXACT_ID {
let info = decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK;
@@ -250,11 +251,13 @@ impl<'a> WalIngest<'a> {
self.ingest_multixact_create_record(modification, &xlrec)?;
} else if info == pg_constants::XLOG_MULTIXACT_TRUNCATE_ID {
let xlrec = XlMultiXactTruncate::decode(&mut buf);
self.ingest_multixact_truncate_record(modification, &xlrec)?;
self.ingest_multixact_truncate_record(modification, &xlrec)
.await?;
}
} else if decoded.xl_rmid == pg_constants::RM_RELMAP_ID {
let xlrec = XlRelmapUpdate::decode(&mut buf);
self.ingest_relmap_page(modification, &xlrec, decoded)?;
self.ingest_relmap_page(modification, &xlrec, decoded)
.await?;
} else if decoded.xl_rmid == pg_constants::RM_XLOG_ID {
let info = decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK;
if info == pg_constants::XLOG_NEXTOID {
@@ -534,23 +537,21 @@ impl<'a> WalIngest<'a> {
// get calls instead.
let req_lsn = modification.tline.get_last_record_lsn();
let rels = with_ondemand_download(|| {
modification
.tline
.list_rels(src_tablespace_id, src_db_id, req_lsn)
})
.await?;
let rels = modification
.tline
.list_rels(src_tablespace_id, src_db_id, req_lsn)
.await?;
debug!("ingest_xlog_dbase_create: {} rels", rels.len());
// Copy relfilemap
let filemap = with_ondemand_download(|| {
modification
.tline
.get_relmap_file(src_tablespace_id, src_db_id, req_lsn)
})
.await?;
modification.put_relmap_file(tablespace_id, db_id, filemap)?;
let filemap = modification
.tline
.get_relmap_file(src_tablespace_id, src_db_id, req_lsn)
.await?;
modification
.put_relmap_file(tablespace_id, db_id, filemap)
.await?;
let mut num_rels_copied = 0;
let mut num_blocks_copied = 0;
@@ -558,9 +559,10 @@ impl<'a> WalIngest<'a> {
assert_eq!(src_rel.spcnode, src_tablespace_id);
assert_eq!(src_rel.dbnode, src_db_id);
let nblocks =
with_ondemand_download(|| modification.tline.get_rel_size(src_rel, req_lsn, true))
.await?;
let nblocks = modification
.tline
.get_rel_size(src_rel, req_lsn, true)
.await?;
let dst_rel = RelTag {
spcnode: tablespace_id,
dbnode: db_id,
@@ -568,19 +570,17 @@ impl<'a> WalIngest<'a> {
forknum: src_rel.forknum,
};
modification.put_rel_creation(dst_rel, nblocks)?;
modification.put_rel_creation(dst_rel, nblocks).await?;
// Copy content
debug!("copying rel {} to {}, {} blocks", src_rel, dst_rel, nblocks);
for blknum in 0..nblocks {
debug!("copying block {} from {} to {}", blknum, src_rel, dst_rel);
let content = with_ondemand_download(|| {
modification
.tline
.get_rel_page_at_lsn(src_rel, blknum, req_lsn, true)
})
.await?;
let content = modification
.tline
.get_rel_page_at_lsn(src_rel, blknum, req_lsn, true)
.await?;
modification.put_rel_page_image(dst_rel, blknum, content)?;
num_blocks_copied += 1;
}
@@ -595,9 +595,9 @@ impl<'a> WalIngest<'a> {
Ok(())
}
fn ingest_xlog_smgr_create(
async fn ingest_xlog_smgr_create(
&mut self,
modification: &mut DatadirModification,
modification: &mut DatadirModification<'_>,
rec: &XlSmgrCreate,
) -> anyhow::Result<()> {
let rel = RelTag {
@@ -606,7 +606,7 @@ impl<'a> WalIngest<'a> {
relnode: rec.rnode.relnode,
forknum: rec.forknum,
};
self.put_rel_creation(modification, rel)?;
self.put_rel_creation(modification, rel).await?;
Ok(())
}
@@ -629,7 +629,8 @@ impl<'a> WalIngest<'a> {
relnode,
forknum: MAIN_FORKNUM,
};
self.put_rel_truncation(modification, rel, rec.blkno)?;
self.put_rel_truncation(modification, rel, rec.blkno)
.await?;
}
if (rec.flags & pg_constants::SMGR_TRUNCATE_FSM) != 0 {
let rel = RelTag {
@@ -650,7 +651,8 @@ impl<'a> WalIngest<'a> {
let nblocks = self.get_relsize(rel, modification.lsn).await?;
if nblocks > fsm_physical_page_no {
// check if something to do: FSM is larger than truncate position
self.put_rel_truncation(modification, rel, fsm_physical_page_no)?;
self.put_rel_truncation(modification, rel, fsm_physical_page_no)
.await?;
}
}
if (rec.flags & pg_constants::SMGR_TRUNCATE_VM) != 0 {
@@ -671,7 +673,8 @@ impl<'a> WalIngest<'a> {
let nblocks = self.get_relsize(rel, modification.lsn).await?;
if nblocks > vm_page_no {
// check if something to do: VM is larger than truncate position
self.put_rel_truncation(modification, rel, vm_page_no)?;
self.put_rel_truncation(modification, rel, vm_page_no)
.await?;
}
}
Ok(())
@@ -740,10 +743,12 @@ impl<'a> WalIngest<'a> {
relnode: xnode.relnode,
};
let last_lsn = self.timeline.get_last_record_lsn();
if with_ondemand_download(|| modification.tline.get_rel_exists(rel, last_lsn, true))
if modification
.tline
.get_rel_exists(rel, last_lsn, true)
.await?
{
self.put_rel_drop(modification, rel)?;
self.put_rel_drop(modification, rel).await?;
}
}
}
@@ -795,16 +800,16 @@ impl<'a> WalIngest<'a> {
// instead.
let req_lsn = modification.tline.get_last_record_lsn();
let slru_segments = with_ondemand_download(|| {
modification
.tline
.list_slru_segments(SlruKind::Clog, req_lsn)
})
.await?;
let slru_segments = modification
.tline
.list_slru_segments(SlruKind::Clog, req_lsn)
.await?;
for segno in slru_segments {
let segpage = segno * pg_constants::SLRU_PAGES_PER_SEGMENT;
if slru_may_delete_clogsegment(segpage, xlrec.pageno) {
modification.drop_slru_segment(SlruKind::Clog, segno)?;
modification
.drop_slru_segment(SlruKind::Clog, segno)
.await?;
trace!("Drop CLOG segment {:>04X}", segno);
}
}
@@ -891,9 +896,9 @@ impl<'a> WalIngest<'a> {
Ok(())
}
fn ingest_multixact_truncate_record(
async fn ingest_multixact_truncate_record(
&mut self,
modification: &mut DatadirModification,
modification: &mut DatadirModification<'_>,
xlrec: &XlMultiXactTruncate,
) -> Result<()> {
self.checkpoint.oldestMulti = xlrec.end_trunc_off;
@@ -909,7 +914,9 @@ impl<'a> WalIngest<'a> {
// Delete all the segments except the last one. The last segment can still
// contain, possibly partially, valid data.
while segment != endsegment {
modification.drop_slru_segment(SlruKind::MultiXactMembers, segment as u32)?;
modification
.drop_slru_segment(SlruKind::MultiXactMembers, segment as u32)
.await?;
/* move to next segment, handling wraparound correctly */
if segment == maxsegment {
@@ -925,9 +932,9 @@ impl<'a> WalIngest<'a> {
Ok(())
}
fn ingest_relmap_page(
async fn ingest_relmap_page(
&mut self,
modification: &mut DatadirModification,
modification: &mut DatadirModification<'_>,
xlrec: &XlRelmapUpdate,
decoded: &DecodedWALRecord,
) -> Result<()> {
@@ -936,17 +943,19 @@ impl<'a> WalIngest<'a> {
// skip xl_relmap_update
buf.advance(12);
modification.put_relmap_file(xlrec.tsid, xlrec.dbid, Bytes::copy_from_slice(&buf[..]))?;
modification
.put_relmap_file(xlrec.tsid, xlrec.dbid, Bytes::copy_from_slice(&buf[..]))
.await?;
Ok(())
}
fn put_rel_creation(
async fn put_rel_creation(
&mut self,
modification: &mut DatadirModification,
modification: &mut DatadirModification<'_>,
rel: RelTag,
) -> Result<()> {
modification.put_rel_creation(rel, 0)?;
modification.put_rel_creation(rel, 0).await?;
Ok(())
}
@@ -974,28 +983,31 @@ impl<'a> WalIngest<'a> {
Ok(())
}
fn put_rel_truncation(
async fn put_rel_truncation(
&mut self,
modification: &mut DatadirModification,
modification: &mut DatadirModification<'_>,
rel: RelTag,
nblocks: BlockNumber,
) -> anyhow::Result<()> {
modification.put_rel_truncation(rel, nblocks)?;
modification.put_rel_truncation(rel, nblocks).await?;
Ok(())
}
fn put_rel_drop(&mut self, modification: &mut DatadirModification, rel: RelTag) -> Result<()> {
modification.put_rel_drop(rel)?;
async fn put_rel_drop(
&mut self,
modification: &mut DatadirModification<'_>,
rel: RelTag,
) -> Result<()> {
modification.put_rel_drop(rel).await?;
Ok(())
}
async fn get_relsize(&mut self, rel: RelTag, lsn: Lsn) -> anyhow::Result<BlockNumber> {
let exists =
with_ondemand_download(|| self.timeline.get_rel_exists(rel, lsn, true)).await?;
let exists = self.timeline.get_rel_exists(rel, lsn, true).await?;
let nblocks = if !exists {
0
} else {
with_ondemand_download(|| self.timeline.get_rel_size(rel, lsn, true)).await?
self.timeline.get_rel_size(rel, lsn, true).await?
};
Ok(nblocks)
}
@@ -1011,19 +1023,17 @@ impl<'a> WalIngest<'a> {
// record.
// TODO: would be nice if to be more explicit about it
let last_lsn = modification.lsn;
let old_nblocks =
if !with_ondemand_download(|| self.timeline.get_rel_exists(rel, last_lsn, true)).await?
{
// create it with 0 size initially, the logic below will extend it
modification.put_rel_creation(rel, 0)?;
0
} else {
with_ondemand_download(|| self.timeline.get_rel_size(rel, last_lsn, true)).await?
};
let old_nblocks = if !self.timeline.get_rel_exists(rel, last_lsn, true).await? {
// create it with 0 size initially, the logic below will extend it
modification.put_rel_creation(rel, 0).await?;
0
} else {
self.timeline.get_rel_size(rel, last_lsn, true).await?
};
if new_nblocks > old_nblocks {
//info!("extending {} {} to {}", rel, old_nblocks, new_nblocks);
modification.put_rel_extend(rel, new_nblocks)?;
modification.put_rel_extend(rel, new_nblocks).await?;
// fill the gap with zeros
for gap_blknum in old_nblocks..blknum {
@@ -1063,16 +1073,19 @@ impl<'a> WalIngest<'a> {
// record.
// TODO: would be nice if to be more explicit about it
let last_lsn = self.timeline.get_last_record_lsn();
let old_nblocks = if !with_ondemand_download(|| {
self.timeline.get_slru_segment_exists(kind, segno, last_lsn)
})
.await?
let old_nblocks = if !self
.timeline
.get_slru_segment_exists(kind, segno, last_lsn)
.await?
{
// create it with 0 size initially, the logic below will extend it
modification.put_slru_segment_creation(kind, segno, 0)?;
modification
.put_slru_segment_creation(kind, segno, 0)
.await?;
0
} else {
with_ondemand_download(|| self.timeline.get_slru_segment_size(kind, segno, last_lsn))
self.timeline
.get_slru_segment_size(kind, segno, last_lsn)
.await?
};
@@ -1124,7 +1137,7 @@ mod tests {
async fn init_walingest_test(tline: &Timeline) -> Result<WalIngest> {
let mut m = tline.begin_modification(Lsn(0x10));
m.put_checkpoint(ZERO_CHECKPOINT.clone())?;
m.put_relmap_file(0, 111, Bytes::from(""))?; // dummy relmapper file
m.put_relmap_file(0, 111, Bytes::from("")).await?; // dummy relmapper file
m.commit()?;
let walingest = WalIngest::new(tline, Lsn(0x10)).await?;
@@ -1133,12 +1146,13 @@ mod tests {
#[tokio::test]
async fn test_relsize() -> Result<()> {
let tenant = TenantHarness::create("test_relsize")?.load().await;
let harness = TenantHarness::new()?;
let tenant = harness.load().await;
let tline = create_test_timeline(&tenant, TIMELINE_ID, DEFAULT_PG_VERSION)?;
let mut walingest = init_walingest_test(&tline).await?;
let mut m = tline.begin_modification(Lsn(0x20));
walingest.put_rel_creation(&mut m, TESTREL_A)?;
walingest.put_rel_creation(&mut m, TESTREL_A).await?;
walingest
.put_rel_page_image(&mut m, TESTREL_A, 0, TEST_IMG("foo blk 0 at 2"))
.await?;
@@ -1163,132 +1177,103 @@ mod tests {
// The relation was created at LSN 2, not visible at LSN 1 yet.
assert_eq!(
tline
.get_rel_exists(TESTREL_A, Lsn(0x10), false)
.no_ondemand_download()?,
tline.get_rel_exists(TESTREL_A, Lsn(0x10), false).await?,
false
);
assert!(tline
.get_rel_size(TESTREL_A, Lsn(0x10), false)
.no_ondemand_download()
.await
.is_err());
assert_eq!(
tline
.get_rel_exists(TESTREL_A, Lsn(0x20), false)
.no_ondemand_download()?,
tline.get_rel_exists(TESTREL_A, Lsn(0x20), false).await?,
true
);
assert_eq!(
tline
.get_rel_size(TESTREL_A, Lsn(0x20), false)
.no_ondemand_download()?,
1
);
assert_eq!(
tline
.get_rel_size(TESTREL_A, Lsn(0x50), false)
.no_ondemand_download()?,
3
);
assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x20), false).await?, 1);
assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x50), false).await?, 3);
// Check page contents at each LSN
assert_eq!(
tline
.get_rel_page_at_lsn(TESTREL_A, 0, Lsn(0x20), false)
.no_ondemand_download()?,
.await?,
TEST_IMG("foo blk 0 at 2")
);
assert_eq!(
tline
.get_rel_page_at_lsn(TESTREL_A, 0, Lsn(0x30), false)
.no_ondemand_download()?,
.await?,
TEST_IMG("foo blk 0 at 3")
);
assert_eq!(
tline
.get_rel_page_at_lsn(TESTREL_A, 0, Lsn(0x40), false)
.no_ondemand_download()?,
.await?,
TEST_IMG("foo blk 0 at 3")
);
assert_eq!(
tline
.get_rel_page_at_lsn(TESTREL_A, 1, Lsn(0x40), false)
.no_ondemand_download()?,
.await?,
TEST_IMG("foo blk 1 at 4")
);
assert_eq!(
tline
.get_rel_page_at_lsn(TESTREL_A, 0, Lsn(0x50), false)
.no_ondemand_download()?,
.await?,
TEST_IMG("foo blk 0 at 3")
);
assert_eq!(
tline
.get_rel_page_at_lsn(TESTREL_A, 1, Lsn(0x50), false)
.no_ondemand_download()?,
.await?,
TEST_IMG("foo blk 1 at 4")
);
assert_eq!(
tline
.get_rel_page_at_lsn(TESTREL_A, 2, Lsn(0x50), false)
.no_ondemand_download()?,
.await?,
TEST_IMG("foo blk 2 at 5")
);
// Truncate last block
let mut m = tline.begin_modification(Lsn(0x60));
walingest.put_rel_truncation(&mut m, TESTREL_A, 2)?;
walingest.put_rel_truncation(&mut m, TESTREL_A, 2).await?;
m.commit()?;
assert_current_logical_size(&tline, Lsn(0x60));
// Check reported size and contents after truncation
assert_eq!(
tline
.get_rel_size(TESTREL_A, Lsn(0x60), false)
.no_ondemand_download()?,
2
);
assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x60), false).await?, 2);
assert_eq!(
tline
.get_rel_page_at_lsn(TESTREL_A, 0, Lsn(0x60), false)
.no_ondemand_download()?,
.await?,
TEST_IMG("foo blk 0 at 3")
);
assert_eq!(
tline
.get_rel_page_at_lsn(TESTREL_A, 1, Lsn(0x60), false)
.no_ondemand_download()?,
.await?,
TEST_IMG("foo blk 1 at 4")
);
// should still see the truncated block with older LSN
assert_eq!(
tline
.get_rel_size(TESTREL_A, Lsn(0x50), false)
.no_ondemand_download()?,
3
);
assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x50), false).await?, 3);
assert_eq!(
tline
.get_rel_page_at_lsn(TESTREL_A, 2, Lsn(0x50), false)
.no_ondemand_download()?,
.await?,
TEST_IMG("foo blk 2 at 5")
);
// Truncate to zero length
let mut m = tline.begin_modification(Lsn(0x68));
walingest.put_rel_truncation(&mut m, TESTREL_A, 0)?;
walingest.put_rel_truncation(&mut m, TESTREL_A, 0).await?;
m.commit()?;
assert_eq!(
tline
.get_rel_size(TESTREL_A, Lsn(0x68), false)
.no_ondemand_download()?,
0
);
assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x68), false).await?, 0);
// Extend from 0 to 2 blocks, leaving a gap
let mut m = tline.begin_modification(Lsn(0x70));
@@ -1296,22 +1281,17 @@ mod tests {
.put_rel_page_image(&mut m, TESTREL_A, 1, TEST_IMG("foo blk 1"))
.await?;
m.commit()?;
assert_eq!(
tline
.get_rel_size(TESTREL_A, Lsn(0x70), false)
.no_ondemand_download()?,
2
);
assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x70), false).await?, 2);
assert_eq!(
tline
.get_rel_page_at_lsn(TESTREL_A, 0, Lsn(0x70), false)
.no_ondemand_download()?,
.await?,
ZERO_PAGE
);
assert_eq!(
tline
.get_rel_page_at_lsn(TESTREL_A, 1, Lsn(0x70), false)
.no_ondemand_download()?,
.await?,
TEST_IMG("foo blk 1")
);
@@ -1321,24 +1301,19 @@ mod tests {
.put_rel_page_image(&mut m, TESTREL_A, 1500, TEST_IMG("foo blk 1500"))
.await?;
m.commit()?;
assert_eq!(
tline
.get_rel_size(TESTREL_A, Lsn(0x80), false)
.no_ondemand_download()?,
1501
);
assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x80), false).await?, 1501);
for blk in 2..1500 {
assert_eq!(
tline
.get_rel_page_at_lsn(TESTREL_A, blk, Lsn(0x80), false)
.no_ondemand_download()?,
.await?,
ZERO_PAGE
);
}
assert_eq!(
tline
.get_rel_page_at_lsn(TESTREL_A, 1500, Lsn(0x80), false)
.no_ondemand_download()?,
.await?,
TEST_IMG("foo blk 1500")
);
@@ -1349,7 +1324,8 @@ mod tests {
// and then created it again within the same layer.
#[tokio::test]
async fn test_drop_extend() -> Result<()> {
let tenant = TenantHarness::create("test_drop_extend")?.load().await;
let harness = TenantHarness::new()?;
let tenant = harness.load().await;
let tline = create_test_timeline(&tenant, TIMELINE_ID, DEFAULT_PG_VERSION)?;
let mut walingest = init_walingest_test(&tline).await?;
@@ -1361,28 +1337,19 @@ mod tests {
// Check that rel exists and size is correct
assert_eq!(
tline
.get_rel_exists(TESTREL_A, Lsn(0x20), false)
.no_ondemand_download()?,
tline.get_rel_exists(TESTREL_A, Lsn(0x20), false).await?,
true
);
assert_eq!(
tline
.get_rel_size(TESTREL_A, Lsn(0x20), false)
.no_ondemand_download()?,
1
);
assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x20), false).await?, 1);
// Drop rel
let mut m = tline.begin_modification(Lsn(0x30));
walingest.put_rel_drop(&mut m, TESTREL_A)?;
walingest.put_rel_drop(&mut m, TESTREL_A).await?;
m.commit()?;
// Check that rel is not visible anymore
assert_eq!(
tline
.get_rel_exists(TESTREL_A, Lsn(0x30), false)
.no_ondemand_download()?,
tline.get_rel_exists(TESTREL_A, Lsn(0x30), false).await?,
false
);
@@ -1398,17 +1365,10 @@ mod tests {
// Check that rel exists and size is correct
assert_eq!(
tline
.get_rel_exists(TESTREL_A, Lsn(0x40), false)
.no_ondemand_download()?,
tline.get_rel_exists(TESTREL_A, Lsn(0x40), false).await?,
true
);
assert_eq!(
tline
.get_rel_size(TESTREL_A, Lsn(0x40), false)
.no_ondemand_download()?,
1
);
assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x40), false).await?, 1);
Ok(())
}
@@ -1418,7 +1378,8 @@ mod tests {
// and then extended it again within the same layer.
#[tokio::test]
async fn test_truncate_extend() -> Result<()> {
let tenant = TenantHarness::create("test_truncate_extend")?.load().await;
let harness = TenantHarness::new()?;
let tenant = harness.load().await;
let tline = create_test_timeline(&tenant, TIMELINE_ID, DEFAULT_PG_VERSION)?;
let mut walingest = init_walingest_test(&tline).await?;
@@ -1435,26 +1396,20 @@ mod tests {
// The relation was created at LSN 20, not visible at LSN 1 yet.
assert_eq!(
tline
.get_rel_exists(TESTREL_A, Lsn(0x10), false)
.no_ondemand_download()?,
tline.get_rel_exists(TESTREL_A, Lsn(0x10), false).await?,
false
);
assert!(tline
.get_rel_size(TESTREL_A, Lsn(0x10), false)
.no_ondemand_download()
.await
.is_err());
assert_eq!(
tline
.get_rel_exists(TESTREL_A, Lsn(0x20), false)
.no_ondemand_download()?,
tline.get_rel_exists(TESTREL_A, Lsn(0x20), false).await?,
true
);
assert_eq!(
tline
.get_rel_size(TESTREL_A, Lsn(0x20), false)
.no_ondemand_download()?,
tline.get_rel_size(TESTREL_A, Lsn(0x20), false).await?,
relsize
);
@@ -1465,7 +1420,7 @@ mod tests {
assert_eq!(
tline
.get_rel_page_at_lsn(TESTREL_A, blkno, lsn, false)
.no_ondemand_download()?,
.await?,
TEST_IMG(&data)
);
}
@@ -1473,16 +1428,11 @@ mod tests {
// Truncate relation so that second segment was dropped
// - only leave one page
let mut m = tline.begin_modification(Lsn(0x60));
walingest.put_rel_truncation(&mut m, TESTREL_A, 1)?;
walingest.put_rel_truncation(&mut m, TESTREL_A, 1).await?;
m.commit()?;
// Check reported size and contents after truncation
assert_eq!(
tline
.get_rel_size(TESTREL_A, Lsn(0x60), false)
.no_ondemand_download()?,
1
);
assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x60), false).await?, 1);
for blkno in 0..1 {
let lsn = Lsn(0x20);
@@ -1490,16 +1440,14 @@ mod tests {
assert_eq!(
tline
.get_rel_page_at_lsn(TESTREL_A, blkno, Lsn(0x60), false)
.no_ondemand_download()?,
.await?,
TEST_IMG(&data)
);
}
// should still see all blocks with older LSN
assert_eq!(
tline
.get_rel_size(TESTREL_A, Lsn(0x50), false)
.no_ondemand_download()?,
tline.get_rel_size(TESTREL_A, Lsn(0x50), false).await?,
relsize
);
for blkno in 0..relsize {
@@ -1508,7 +1456,7 @@ mod tests {
assert_eq!(
tline
.get_rel_page_at_lsn(TESTREL_A, blkno, Lsn(0x50), false)
.no_ondemand_download()?,
.await?,
TEST_IMG(&data)
);
}
@@ -1526,15 +1474,11 @@ mod tests {
m.commit()?;
assert_eq!(
tline
.get_rel_exists(TESTREL_A, Lsn(0x80), false)
.no_ondemand_download()?,
tline.get_rel_exists(TESTREL_A, Lsn(0x80), false).await?,
true
);
assert_eq!(
tline
.get_rel_size(TESTREL_A, Lsn(0x80), false)
.no_ondemand_download()?,
tline.get_rel_size(TESTREL_A, Lsn(0x80), false).await?,
relsize
);
// Check relation content
@@ -1544,7 +1488,7 @@ mod tests {
assert_eq!(
tline
.get_rel_page_at_lsn(TESTREL_A, blkno, Lsn(0x80), false)
.no_ondemand_download()?,
.await?,
TEST_IMG(&data)
);
}
@@ -1556,7 +1500,8 @@ mod tests {
/// split into multiple 1 GB segments in Postgres.
#[tokio::test]
async fn test_large_rel() -> Result<()> {
let tenant = TenantHarness::create("test_large_rel")?.load().await;
let harness = TenantHarness::new()?;
let tenant = harness.load().await;
let tline = create_test_timeline(&tenant, TIMELINE_ID, DEFAULT_PG_VERSION)?;
let mut walingest = init_walingest_test(&tline).await?;
@@ -1574,21 +1519,19 @@ mod tests {
assert_current_logical_size(&tline, Lsn(lsn));
assert_eq!(
tline
.get_rel_size(TESTREL_A, Lsn(lsn), false)
.no_ondemand_download()?,
tline.get_rel_size(TESTREL_A, Lsn(lsn), false).await?,
RELSEG_SIZE + 1
);
// Truncate one block
lsn += 0x10;
let mut m = tline.begin_modification(Lsn(lsn));
walingest.put_rel_truncation(&mut m, TESTREL_A, RELSEG_SIZE)?;
walingest
.put_rel_truncation(&mut m, TESTREL_A, RELSEG_SIZE)
.await?;
m.commit()?;
assert_eq!(
tline
.get_rel_size(TESTREL_A, Lsn(lsn), false)
.no_ondemand_download()?,
tline.get_rel_size(TESTREL_A, Lsn(lsn), false).await?,
RELSEG_SIZE
);
assert_current_logical_size(&tline, Lsn(lsn));
@@ -1596,12 +1539,12 @@ mod tests {
// Truncate another block
lsn += 0x10;
let mut m = tline.begin_modification(Lsn(lsn));
walingest.put_rel_truncation(&mut m, TESTREL_A, RELSEG_SIZE - 1)?;
walingest
.put_rel_truncation(&mut m, TESTREL_A, RELSEG_SIZE - 1)
.await?;
m.commit()?;
assert_eq!(
tline
.get_rel_size(TESTREL_A, Lsn(lsn), false)
.no_ondemand_download()?,
tline.get_rel_size(TESTREL_A, Lsn(lsn), false).await?,
RELSEG_SIZE - 1
);
assert_current_logical_size(&tline, Lsn(lsn));
@@ -1612,12 +1555,12 @@ mod tests {
while size >= 0 {
lsn += 0x10;
let mut m = tline.begin_modification(Lsn(lsn));
walingest.put_rel_truncation(&mut m, TESTREL_A, size as BlockNumber)?;
walingest
.put_rel_truncation(&mut m, TESTREL_A, size as BlockNumber)
.await?;
m.commit()?;
assert_eq!(
tline
.get_rel_size(TESTREL_A, Lsn(lsn), false)
.no_ondemand_download()?,
tline.get_rel_size(TESTREL_A, Lsn(lsn), false).await?,
size as BlockNumber
);

View File

@@ -31,6 +31,7 @@ use once_cell::sync::OnceCell;
use std::future::Future;
use storage_broker::BrokerClientChannel;
use tokio::sync::watch;
use tokio_util::sync::CancellationToken;
use tracing::*;
pub use connection_manager::spawn_connection_manager_task;
@@ -76,7 +77,7 @@ pub fn is_broker_client_initialized() -> bool {
/// A handle of an asynchronous task.
/// The task has a channel that it can use to communicate its lifecycle events in a certain form, see [`TaskEvent`]
/// and a cancellation channel that it can listen to for earlier interrupts.
/// and a cancellation token that it can listen to for earlier interrupts.
///
/// Note that the communication happens via the `watch` channel, that does not accumulate the events, replacing the old one with the never one on submission.
/// That may lead to certain events not being observed by the listener.
@@ -84,7 +85,7 @@ pub fn is_broker_client_initialized() -> bool {
pub struct TaskHandle<E> {
join_handle: Option<tokio::task::JoinHandle<anyhow::Result<()>>>,
events_receiver: watch::Receiver<TaskStateUpdate<E>>,
cancellation: watch::Sender<()>,
cancellation: CancellationToken,
}
pub enum TaskEvent<E> {
@@ -102,20 +103,23 @@ pub enum TaskStateUpdate<E> {
impl<E: Clone> TaskHandle<E> {
/// Initializes the task, starting it immediately after the creation.
pub fn spawn<Fut>(
task: impl FnOnce(watch::Sender<TaskStateUpdate<E>>, watch::Receiver<()>) -> Fut
+ Send
+ 'static,
task: impl FnOnce(watch::Sender<TaskStateUpdate<E>>, CancellationToken) -> Fut + Send + 'static,
) -> Self
where
Fut: Future<Output = anyhow::Result<()>> + Send,
E: Send + Sync + 'static,
{
let (cancellation, cancellation_receiver) = watch::channel(());
let cancellation = CancellationToken::new();
let (events_sender, events_receiver) = watch::channel(TaskStateUpdate::Started);
let cancellation_clone = cancellation.clone();
let join_handle = WALRECEIVER_RUNTIME.spawn(async move {
events_sender.send(TaskStateUpdate::Started).ok();
task(events_sender, cancellation_receiver).await
task(events_sender, cancellation_clone).await
// events_sender is dropped at some point during the .await above.
// But the task is still running on WALRECEIVER_RUNTIME.
// That is the window when `!jh.is_finished()`
// is true inside `fn next_task_event()` below.
});
TaskHandle {
@@ -132,7 +136,23 @@ impl<E: Clone> TaskHandle<E> {
TaskEvent::End(match self.join_handle.as_mut() {
Some(jh) => {
if !jh.is_finished() {
warn!("sender is dropped while join handle is still alive");
// Barring any implementation errors in this module, we can
// only arrive here while the task that executes the future
// passed to `Self::spawn()` is still execution. Cf the comment
// in Self::spawn().
//
// This was logging at warning level in earlier versions, presumably
// to leave some breadcrumbs in case we had an implementation
// error that would would make us get stuck in `jh.await`.
//
// There hasn't been such a bug so far.
// But in a busy system, e.g., during pageserver restart,
// we arrive here often enough that the warning-level logs
// became a distraction.
// So, tone them down to info-level.
//
// XXX: rewrite this module to eliminate the race condition.
info!("sender is dropped while join handle is still alive");
}
let res = jh
@@ -157,7 +177,7 @@ impl<E: Clone> TaskHandle<E> {
/// Aborts current task, waiting for it to finish.
pub async fn shutdown(self) {
if let Some(jh) = self.join_handle {
self.cancellation.send(()).ok();
self.cancellation.cancel();
match jh.await {
Ok(Ok(())) => debug!("Shutdown success"),
Ok(Err(e)) => error!("Shutdown task error: {e:?}"),

View File

@@ -846,7 +846,7 @@ mod tests {
#[tokio::test]
async fn no_connection_no_candidate() -> anyhow::Result<()> {
let harness = TenantHarness::create("no_connection_no_candidate")?;
let harness = TenantHarness::new()?;
let mut state = dummy_state(&harness).await;
let now = Utc::now().naive_utc();
@@ -879,7 +879,7 @@ mod tests {
#[tokio::test]
async fn connection_no_candidate() -> anyhow::Result<()> {
let harness = TenantHarness::create("connection_no_candidate")?;
let harness = TenantHarness::new()?;
let mut state = dummy_state(&harness).await;
let now = Utc::now().naive_utc();
@@ -942,7 +942,7 @@ mod tests {
#[tokio::test]
async fn no_connection_candidate() -> anyhow::Result<()> {
let harness = TenantHarness::create("no_connection_candidate")?;
let harness = TenantHarness::new()?;
let mut state = dummy_state(&harness).await;
let now = Utc::now().naive_utc();
@@ -1001,7 +1001,7 @@ mod tests {
#[tokio::test]
async fn candidate_with_many_connection_failures() -> anyhow::Result<()> {
let harness = TenantHarness::create("candidate_with_many_connection_failures")?;
let harness = TenantHarness::new()?;
let mut state = dummy_state(&harness).await;
let now = Utc::now().naive_utc();
@@ -1041,7 +1041,7 @@ mod tests {
#[tokio::test]
async fn lsn_wal_over_threshhold_current_candidate() -> anyhow::Result<()> {
let harness = TenantHarness::create("lsn_wal_over_threshcurrent_candidate")?;
let harness = TenantHarness::new()?;
let mut state = dummy_state(&harness).await;
let current_lsn = Lsn(100_000).align();
let now = Utc::now().naive_utc();
@@ -1105,7 +1105,7 @@ mod tests {
#[tokio::test]
async fn timeout_connection_threshhold_current_candidate() -> anyhow::Result<()> {
let harness = TenantHarness::create("timeout_connection_threshhold_current_candidate")?;
let harness = TenantHarness::new()?;
let mut state = dummy_state(&harness).await;
let current_lsn = Lsn(100_000).align();
let now = Utc::now().naive_utc();
@@ -1166,7 +1166,7 @@ mod tests {
#[tokio::test]
async fn timeout_wal_over_threshhold_current_candidate() -> anyhow::Result<()> {
let harness = TenantHarness::create("timeout_wal_over_threshhold_current_candidate")?;
let harness = TenantHarness::new()?;
let mut state = dummy_state(&harness).await;
let current_lsn = Lsn(100_000).align();
let new_lsn = Lsn(100_100).align();
@@ -1232,7 +1232,7 @@ mod tests {
const DUMMY_SAFEKEEPER_HOST: &str = "safekeeper_connstr";
async fn dummy_state(harness: &TenantHarness<'_>) -> WalreceiverState {
async fn dummy_state(harness: &TenantHarness) -> WalreceiverState {
WalreceiverState {
id: TenantTimelineId {
tenant_id: harness.tenant_id,

View File

@@ -1,6 +1,7 @@
//! Actual Postgres connection handler to stream WAL to the server.
use std::{
error::Error,
str::FromStr,
sync::Arc,
time::{Duration, SystemTime},
@@ -11,13 +12,14 @@ use bytes::BytesMut;
use chrono::{NaiveDateTime, Utc};
use fail::fail_point;
use futures::StreamExt;
use postgres::{SimpleQueryMessage, SimpleQueryRow};
use postgres::{error::SqlState, SimpleQueryMessage, SimpleQueryRow};
use postgres_ffi::v14::xlog_utils::normalize_lsn;
use postgres_ffi::WAL_SEGMENT_SIZE;
use postgres_protocol::message::backend::ReplicationMessage;
use postgres_types::PgLsn;
use tokio::{pin, select, sync::watch, time};
use tokio_postgres::{replication::ReplicationStream, Client};
use tokio_util::sync::CancellationToken;
use tracing::{debug, error, info, trace, warn};
use crate::{metrics::LIVE_CONNECTIONS_COUNT, walreceiver::TaskStateUpdate};
@@ -32,7 +34,7 @@ use crate::{
use postgres_connection::PgConnectionConfig;
use postgres_ffi::waldecoder::WalStreamDecoder;
use pq_proto::ReplicationFeedback;
use utils::lsn::Lsn;
use utils::{lsn::Lsn, postgres_backend_async::is_expected_io_error};
/// Status of the connection.
#[derive(Debug, Clone, Copy)]
@@ -58,7 +60,7 @@ pub async fn handle_walreceiver_connection(
timeline: Arc<Timeline>,
wal_source_connconf: PgConnectionConfig,
events_sender: watch::Sender<TaskStateUpdate<WalConnectionStatus>>,
mut cancellation: watch::Receiver<()>,
cancellation: CancellationToken,
connect_timeout: Duration,
) -> anyhow::Result<()> {
// Connect to the database in replication mode.
@@ -68,10 +70,17 @@ pub async fn handle_walreceiver_connection(
let mut config = wal_source_connconf.to_tokio_postgres_config();
config.application_name("pageserver");
config.replication_mode(tokio_postgres::config::ReplicationMode::Physical);
time::timeout(connect_timeout, config.connect(postgres::NoTls))
.await
.context("Timed out while waiting for walreceiver connection to open")?
.context("Failed to open walreceiver connection")?
match time::timeout(connect_timeout, config.connect(postgres::NoTls)).await {
Ok(Ok(client_and_conn)) => client_and_conn,
Ok(Err(conn_err)) => {
let expected_error = ignore_expected_errors(conn_err)?;
info!("DB connection stream finished: {expected_error}");
return Ok(());
}
Err(elapsed) => anyhow::bail!(
"Timed out while waiting {elapsed} for walreceiver connection to open"
),
}
};
info!("connected!");
@@ -90,7 +99,7 @@ pub async fn handle_walreceiver_connection(
// The connection object performs the actual communication with the database,
// so spawn it off to run on its own.
let mut connection_cancellation = cancellation.clone();
let connection_cancellation = cancellation.clone();
task_mgr::spawn(
WALRECEIVER_RUNTIME.handle(),
TaskKind::WalReceiverConnection,
@@ -103,15 +112,13 @@ pub async fn handle_walreceiver_connection(
connection_result = connection => match connection_result{
Ok(()) => info!("Walreceiver db connection closed"),
Err(connection_error) => {
if connection_error.is_closed() {
info!("Connection closed regularly: {connection_error}")
} else {
warn!("Connection aborted: {connection_error}")
if let Err(e) = ignore_expected_errors(connection_error) {
warn!("Connection aborted: {e:#}")
}
}
},
_ = connection_cancellation.changed() => info!("Connection cancelled"),
_ = connection_cancellation.cancelled() => info!("Connection cancelled"),
}
Ok(())
},
@@ -177,7 +184,7 @@ pub async fn handle_walreceiver_connection(
while let Some(replication_message) = {
select! {
_ = cancellation.changed() => {
_ = cancellation.cancelled() => {
info!("walreceiver interrupted");
None
}
@@ -187,14 +194,9 @@ pub async fn handle_walreceiver_connection(
let replication_message = match replication_message {
Ok(message) => message,
Err(replication_error) => {
if replication_error.is_closed() {
info!("Replication stream got closed");
return Ok(());
} else {
return Err(
anyhow::Error::new(replication_error).context("replication stream error")
);
}
let expected_error = ignore_expected_errors(replication_error)?;
info!("Replication stream finished: {expected_error}");
return Ok(());
}
};
@@ -400,3 +402,32 @@ async fn identify_system(client: &mut Client) -> anyhow::Result<IdentifySystem>
Err(IdentifyError.into())
}
}
/// We don't want to report connectivity problems as real errors towards connection manager because
/// 1. they happen frequently enough to make server logs hard to read and
/// 2. the connection manager can retry other safekeeper.
///
/// If this function returns `Ok(pg_error)`, it's such an error.
/// The caller should log it at info level and then report to connection manager that we're done handling this connection.
/// Connection manager will then handle reconnections.
///
/// If this function returns an `Err()`, the caller can bubble it up using `?`.
/// The connection manager will log the error at ERROR level.
fn ignore_expected_errors(pg_error: postgres::Error) -> anyhow::Result<postgres::Error> {
if pg_error.is_closed()
|| pg_error
.source()
.and_then(|source| source.downcast_ref::<std::io::Error>())
.map(is_expected_io_error)
.unwrap_or(false)
{
return Ok(pg_error);
} else if let Some(db_error) = pg_error.as_db_error() {
if db_error.code() == &SqlState::CONNECTION_FAILURE
&& db_error.message().contains("end streaming")
{
return Ok(pg_error);
}
}
Err(pg_error).context("connection error")
}

Some files were not shown because too many files have changed in this diff Show More