Compare commits

...

62 Commits

Author SHA1 Message Date
Bojan Serafimov
5edb0ccfa7 Add assertion 2023-01-24 15:23:27 -05:00
Bojan Serafimov
2bc1324bed Try im (benchmarks panic) 2023-01-24 14:36:22 -05:00
Sergey Melnikov
aabca55d7e Migrate update version to management APIv2 (#3430) 2023-01-24 17:18:16 +01:00
Kirill Bulatov
1c3636d848 Tone down walreceiver connection timeout errors (#3425)
Closes https://github.com/neondatabase/neon/issues/3342
2023-01-24 18:03:33 +02:00
Kirill Bulatov
0c16ad8591 Tone down broker subscription errors 2023-01-24 17:23:33 +02:00
Christian Schwarz
0b673c12d7 timeline: don't transition Active=>Active during pageserver startup
Before this patch, when `initialize_with_lock` was called via
`timeline_init_and_sync`, we would transition the timeline like so:

    load_local_timeline/load_remote_timeline:
        timeline_init_and_sync
            Timeline::new
                () => Loading
            initialize_with_lock:
                set_state(Active)
                    Loading => Active
        timeline.activate()
            Active => Active
2023-01-24 15:56:02 +01:00
Christian Schwarz
7a333cfb12 be noisy about unexpected Timeline state transitions 2023-01-24 15:56:02 +01:00
Christian Schwarz
f7ec33970a add doc comment that outlines which tokio tasks walreceiver creates 2023-01-24 15:23:48 +01:00
Joonas Koivunen
98d0a0d242 fix(http): omit needless string allocs (#3421)
Drive-by fix noticed while #3419.
2023-01-24 14:53:39 +02:00
Joonas Koivunen
f74080cbad feat(http): support ?inputs_only=true for tenant_size (#3419)
this makes debugging problematic cases in the future easier, as we can
just request the model inputs, use them locally to reproduce the issue
with the model.
2023-01-24 13:57:13 +02:00
Christian Schwarz
55c184fcd7 fix some anyhow::Context::context calls that should use with_context(format!(...))
Noticed this while combing through some production logs.
2023-01-24 12:22:33 +01:00
Kirill Bulatov
fd18692dfb Output coloured pageserver logs for tty stdout 2023-01-24 09:58:08 +02:00
Alexey Kondratov
a4be54d21f [compute_ctl] Stop updating roles on each compute start (#3391)
I noticed that `compute_ctl` updates all roles on each start, search for
rows like

> - web_access:[FILTERED] -> update

in the compute startup log.

It happens since we had an adhoc hack for md5 hashes comparison, which
doesn't work with scram hashes stored in the `pg_authid`. It doesn't
really hurt, as nothing changes, but we just run >= 2 extra queries on
each start, so fix it.
2023-01-23 17:46:22 +01:00
Christian Schwarz
6b6570b580 remove TimelineState::Suspended, introduce TimelineState::Loading
The TimelineState::Suspsended was dubious to begin with. I suppose
that the intention was that timelines could transition back and
forth between Active and Suspended states.
But practically, the code before this patch never did that.
The transitions were:

    () ==Timeline::new==> Suspended ==*==> {Active,Broken,Stopping}

One exception: Tenant::set_stopping() could transition timelines like
so:

    !Broken ==Tenant::set_stopping()==> Suspended

But Tenant itself cannot transition from stopping state to any other
state.

Thus, this patch removes TimelineState::Suspended and introduces a new
state Loading. The aforementioned transitions change as follows:

    - () ==Timeline::new==> Suspended ==*==> {Active,Broken,Stopping}
    + () ==Timeline::new==> Loading   ==*==> {Active,Broken,Stopping}

    - !Broken ==Tenant::set_stopping()==> Suspended
    + !Broken ==Tenant::set_stopping()==> Stopping

Walreceiver's connection manager loop watches TimelineState to decide
whether it should retry connecting, or exit.
This patch changes the loop to exit when it observes the transition
into Stopping state.

Walreceiver isn't supposed to be started until the timeline transitions
into Active state. So, this patch also adds some warn!() messages
in case this happens anyways.
2023-01-23 17:22:49 +01:00
Joonas Koivunen
7704caa3ac More tenant size fixes (#3410)
Small changes, but hopefully this will help with the panic detected in
staging, for which we cannot get the debugging information right now
(end-of-branch before branch-point).
2023-01-23 17:12:51 +02:00
Shany Pozin
a44e5eda14 Adding pageserver3 to staging (#3403) 2023-01-23 14:08:48 +01:00
Konstantin Knizhnik
5c865f46ba Fix slru_segment_key_range function: segno was assigned to incorrect Key field (#3354) 2023-01-23 10:51:09 +02:00
bojanserafimov
a3d7ad2d52 Implement layer map using immutable BST (#2998) 2023-01-20 16:10:12 -05:00
Anastasia Lubennikova
36f048d6b0 Fix tenant size orphans (#3377)
Before only the timelines which have passed the `gc_horizon` were
processed which failed with orphans at the tree_sort phase. Example
input in added `test_branched_empty_timeline_size` test case.

The PR changes iteration to happen through all timelines, and in
addition to that, any learned branch points will be calculated as they
would had been in the original implementation if the ancestor branch had
been over the `gc_horizon`.

This also changes how tenants where all timelines are below `gc_horizon`
are handled. Previously tenant_size 0 was returned, but now they will
have approximately `initdb_lsn` worth of tenant_size.

The PR also adds several new tenant size tests that describe various corner
cases of branching structure and `gc_horizon` setting.
They are currently disabled to not consume time during CI.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
2023-01-20 20:21:36 +02:00
Joonas Koivunen
58fb6fe861 fix: dont stop pageserver if we fail to calculate synthetic size 2023-01-20 19:55:19 +02:00
Alexey Kondratov
20b1e26e74 [compute_ctl] Make role deletion spec processing idempotent (#3380)
Previously, we were trying to re-assign owned objects of the already
deleted role. This were causing a crash loop in the case when compute
was restarted with a spec that includes delta operation for role
deletion. To avoid such cases, check that role is still present before
calling `reassign_owned_objects`.

Resolves neondatabase/cloud#3553
2023-01-20 15:37:24 +01:00
Christian Schwarz
8ba1699937 Revert "Use actual temporary dir for pageserver unit tests"
This reverts commit 826e89b9ce.

The problem with that commit was that it deletes the TempDir while
there are still EphemeralFile instances open.

At first I thought this could be fixed by simply adding

  Handle::current().block_on(task_mgr::shutdown(None, Some(tenant_id), None))

to TenantHarness::drop, but it turned out to be insufficient.

So, reverting the commit until we find a proper solution.

refs https://github.com/neondatabase/neon/issues/3385
2023-01-19 20:16:56 +01:00
bojanserafimov
a9bd05760f Improve layer map docstrings (#3382) 2023-01-19 10:29:15 -05:00
Heikki Linnakangas
e5cc2f92c4 Switch to 'tracing' for logging, restructure code to make use of spans.
Refactors Compute::prepare_and_run. It's split into subroutines
differently, to make it easier to attach tracing spans to the
different stages. The high-level logic for waiting for Postgres to
exit is moved to the caller.

Replace 'env_logger' with 'tracing', and add `#instrument` directives
to different stages fo the startup process. This is a fairly
mechanical change, except for the changes in 'spec.rs'. 'spec.rs'
contained some complicated formatting, where parts of log messages
were printed directly to stdout with `print`s. That was a bit messed
up because the log normally goes to stderr, but those lines were
printed to stdout. In our docker images, stderr and stdout both go to
the same place so you wouldn't notice, but I don't think it was
intentional.

This changes the log format to the default
'tracing_subscriber::format' format. It's different from the Postgres
log format, however, and because both compute_tools and Postgres print
to the same log, it's now a mix of two different formats.  I'm not
sure how the Grafana log parsing pipeline can handle that. If it's a
problem, we can build custom formatter to change the compute_tools log
format to be the same as Postgres's, like it was before this commit,
or we can change the Postgres log format to match tracing_formatter's,
or we can start printing compute_tool's log output to a different
destination than Postgres
2023-01-18 19:42:47 +02:00
Kirill Bulatov
90f66aa51b Enable logs in unit tests 2023-01-18 17:43:27 +02:00
Kirill Bulatov
826e89b9ce Use actual temporary dir for pageserver unit tests 2023-01-18 17:43:27 +02:00
Vadim Kharitonov
e59d32ac5d Change SENTRY_ENVIRONMENT from "development" to "staging" 2023-01-18 16:34:49 +01:00
Anastasia Lubennikova
506086a3e2 Fix metric_collection_endpoint for prod.
It was incorrectly set to staging url
2023-01-18 16:35:43 +02:00
Heikki Linnakangas
3b58c61b33 If an error happens while checking for core dumps, don't panic.
If we panic, we skip the 30s wait in 'main', and don't give the
console a chance to observe the error. Which is not nice.

Spotted by @ololobus at
https://github.com/neondatabase/neon/pull/3352#discussion_r1072806981
2023-01-18 11:25:47 +02:00
Kirill Bulatov
c6b56d2967 Add more io::Error context when fail to operate on a path (#3254)
I have a test failure that shows 

```
Caused by:
    0: Failed to reconstruct a page image:
    1: Directory not empty (os error 39)
```

but does not really show where exactly that happens.

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-3227/release/3823785365/index.html#categories/c0057473fc9ec8fb70876fd29a171ce8/7088dab272f2c7b7/?attachment=60fe6ed2add4d82d

The PR aims to add more context in debugging that issue.
2023-01-17 22:07:38 +02:00
Anastasia Lubennikova
9d3992ef48 Increase metric_collection_interval for proxy on prod
to not owerwhelm the service
2023-01-17 15:50:19 +02:00
Anastasia Lubennikova
7624963e13 Enable metric_collection_endpoint for proxy on prod
in all regions
2023-01-17 13:43:50 +02:00
Anastasia Lubennikova
63e3b815a2 Enable metric_collection_endpoint for pageserver on prod
in all regions
2023-01-17 13:43:50 +02:00
Kirill Bulatov
1ebd145c29 Actualize the comment (#3362)
Follow-up of
https://github.com/neondatabase/neon/pull/3326#issuecomment-1384265759
2023-01-17 13:30:42 +02:00
sharnoff
f8e887830a build: Use curl -f on vm-informant download (#3363)
Without this, we can silently fail
2023-01-17 10:38:33 +01:00
Christian Schwarz
48dd9565ac TaskHandle: tone down sender is dropped while join handle is still alive
Rationale: see comments added as part of this commit.

fixes https://github.com/neondatabase/neon/issues/3339
2023-01-17 09:42:22 +01:00
Anastasia Lubennikova
e067cd2947 Enable metric collection for proxy on staging 2023-01-16 21:15:42 +02:00
Christian Schwarz
58c8c1076c download_all_remote_layers API: require client to specify max_concurrent_downloads
Before this patch, we would start all layer downloads simultaneously.

There is at most one download_all_remote_layers task per timeline.
Hence, the specified limit is per timeline.

There is still no global concurrency limit for layer downloads.
We'll have to revisit that at some point and also prioritize on-demand
initiated downloads over download_all_remote_layers downloads.
But that's for another day.
2023-01-16 19:29:06 +01:00
Alexander Bayandin
4c6b507472 Update Postgres clients we test (#3359)
Update client libraries and runtimes for Postgres libraries we test.
- `pg8000` works with Neon now 🎉 
- `PostgresClientKit` still doesn't support SNI
2023-01-16 17:22:17 +00:00
Stas Kelvich
431e464c1e Consumption metering RFC 2023-01-16 19:15:59 +02:00
danieltprice
424fd0bd63 Update auth.rs (#3349)
Update SNI error message. Users now specify the endpoint ID when making
a connection to Neon. This should be reflected in the error message.
2023-01-16 12:32:00 -04:00
Joonas Koivunen
a8a9bee602 walredo: simple tests and bench updates (#3045)
Separated from #2875.

The microbenchmark has been validated to show similar difference as to
larger scale OLTP benchmark.
2023-01-16 18:24:45 +02:00
Vadim Kharitonov
6ac5656be5 Enable earthdistance extension 2023-01-16 17:04:51 +01:00
Anastasia Lubennikova
3c571ecde8 Update docs/consumption_metrics.md 2023-01-16 17:24:13 +02:00
Anastasia Lubennikova
5f1bd0e8a3 Add documentation for consumption metrics 2023-01-16 17:24:13 +02:00
Anastasia Lubennikova
2cbe84b78f Proxy metrics (#3290)
Implement proxy metrics collection.
Only collect metric for outbound traffic.

Add proxy CLI parameters:
- metric-collection-endpoint
- metric-collection-interval.

Add test_proxy_metric_collection test.

Move shared consumption metrics code to libs/consumption_metrics.
Refactor the code.
2023-01-16 15:17:28 +00:00
sharnoff
5c6a7a17cb Add VM informant to vm-compute-node (#3324)
The general idea is that the VM informant binary is added to the
vm-compute-node images only. `compute_tools` then will run whatever's at
`/bin/vm-informant`, if the path exists.
2023-01-16 07:05:29 -08:00
Arseny Sher
84ffdc8b4f Don't keep FDs open on cancelled timelines in safekeepers.
Since PR #3300 we don't remove timelines completely until next restart, so this
prevents leakage.

fixes https://github.com/neondatabase/neon/issues/3336
2023-01-16 19:03:38 +04:00
Kirill Bulatov
bce4233d3a Rework Cargo.toml dependencies (#3322)
* Use workspace variables from cargo, coming with rustc
[1.64](https://github.com/rust-lang/rust/blob/master/RELEASES.md#version-1640-2022-09-22)

See
https://doc.rust-lang.org/nightly/cargo/reference/workspaces.html#the-package-table
and
https://doc.rust-lang.org/nightly/cargo/reference/workspaces.html#the-dependencies-table
sections.

Now, all dependencies in all non-root `Cargo.toml` files are defined as 
```
clap.workspace = true
```

sometimes, when extra features are needed, as 
```
bytes = {workspace = true, features = ['serde'] }
```

With the actual declarations (with shared features and version
numbers/file paths/etc.) in the root Cargo.toml.
Features are additive:

https://doc.rust-lang.org/nightly/cargo/reference/specifying-dependencies.html#inheriting-a-dependency-from-a-workspace

* Uses the mechanism above to set common, 2021, edition and license across the
workspace

* Mechanically bumps a few dependencies

* Updates hakari format, as it suggested:
```
work/neon/neon kb/cargo-templated ❯ cargo hakari generate
info: no changes detected
info: new hakari format version available: 3 (current: 2)
(add or update `dep-format-version = "3"` in hakari.toml, then run `cargo hakari generate && cargo hakari manage-deps`)
```
2023-01-13 18:13:34 +02:00
Vadim Kharitonov
16baa91b2b Add more information about cargo deny 2023-01-13 13:24:34 +01:00
Kirill Bulatov
99808558de Avoid duplicate timeline insert (#3326)
`initialize_with_lock` inserts `Arc<Timeline>` before returning it:
c1731bc4f0/pageserver/src/tenant.rs (L222)

but `setup_timeline` function did another insert, which got removed in this PR:
c1731bc4f0/pageserver/src/tenant.rs (L486)


On top, a better comment and function renames are added.
2023-01-13 12:05:54 +00:00
Anastasia Lubennikova
c6d383e239 code cleanup 2023-01-13 11:51:28 +02:00
Anastasia Lubennikova
5e3e0fbf6f remove unneeded Cargo.lock changes 2023-01-13 11:51:28 +02:00
Anastasia Lubennikova
26f39c03f2 review code cleanup:
- handle errors in calculate_synthetic_size_worker. Don't exit the bgworker if one tenant failed.

- add cached_synthetic_tenant_size to cache values calculated by the bgworker

- code cleanup: remove unneeded info! messages, clean comments

- handle collect_metrics_task() error. Don't exit collect_metrics worker if one task failed.

 - add unit test to cover case when we have multiple branches at the same lsn
2023-01-13 11:51:28 +02:00
Anastasia Lubennikova
148e020fb9 Fix logical size calculation:
sort updates in topological order so that the parent timeline always preceeds its children.
    fixes #3179
2023-01-13 11:51:28 +02:00
Anastasia Lubennikova
0675859bb0 Add background worker that periodically spawns
synthetic size calculation.
Add new pageserver config param calculate_synthetic_size_interval
2023-01-13 11:51:28 +02:00
Anastasia Lubennikova
ba0190e3e8 Handle errors in tenant_size_model code 2023-01-13 11:51:28 +02:00
Konstantin Knizhnik
9ce5ada89e Do not report position in SMGR message (#3307)
refer #3277
2023-01-13 10:23:35 +02:00
Alexander Bayandin
c28bfd4c63 Nightly Benchmarks: add user provided example (#3308) 2023-01-12 23:03:21 +00:00
Vadim Kharitonov
dec875fee1 Disable postgis_sfcgal 2023-01-12 21:51:49 +01:00
Kirill Bulatov
fe8cef3427 Use ready! rustc 1.64 macro (#3315)
rustc
[1.64](https://github.com/rust-lang/rust/blob/master/RELEASES.md#version-1640-2022-09-22)
had brought `ready!` macro:
https://doc.rust-lang.org/stable/std/task/macro.ready.html

Use it to shorten the code slightly.
2023-01-12 21:27:34 +02:00
MMeent
bb406b21a8 Fix issue in compaction code (#3246)
If we ran `compact_prefetch_buffers` with exactly one hole in the
buffers, the code would fail to remove the last, now unused, entry from
the array.

This is now fixed. 

Also, add and adjust some comments in the compaction code so that the
algorithm used is a bit more clear.

Fixes #3192
2023-01-12 19:23:59 +01:00
122 changed files with 5171 additions and 2255 deletions

View File

@@ -4,7 +4,7 @@
hakari-package = "workspace_hack"
# Format for `workspace-hack = ...` lines in other Cargo.tomls. Requires cargo-hakari 0.9.8 or above.
dep-format-version = "2"
dep-format-version = "3"
# Setting workspace.resolver = "2" in the root Cargo.toml is HIGHLY recommended.
# Hakari works much better with the new feature resolver.

View File

@@ -117,7 +117,8 @@
shell:
cmd: |
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
curl -sfS -d '{"version": {{ current_version }} }' -X PATCH {{ console_mgmt_base_url }}/api/v1/pageservers/$INSTANCE_ID
curl -sfS -H "Authorization: Bearer {{ CONSOLE_API_TOKEN }}" {{ console_mgmt_base_url }}/management/api/v2/pageservers/$INSTANCE_ID | jq '.version = {{ current_version }}' > /tmp/new_version
curl -sfS -H "Authorization: Bearer {{ CONSOLE_API_TOKEN }}" -X POST -d@/tmp/new_version {{ console_mgmt_base_url }}/management/api/v2/pageservers
tags:
- pageserver
@@ -186,6 +187,7 @@
shell:
cmd: |
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
curl -sfS -d '{"version": {{ current_version }} }' -X PATCH {{ console_mgmt_base_url }}/api/v1/safekeepers/$INSTANCE_ID
curl -sfS -H "Authorization: Bearer {{ CONSOLE_API_TOKEN }}" {{ console_mgmt_base_url }}/management/api/v2/safekeepers/$INSTANCE_ID | jq '.version = {{ current_version }}' > /tmp/new_version
curl -sfS -H "Authorization: Bearer {{ CONSOLE_API_TOKEN }}" -X POST -d@/tmp/new_version {{ console_mgmt_base_url }}/management/api/v2/safekeepers
tags:
- safekeeper

View File

@@ -6,6 +6,8 @@ storage:
broker_endpoint: http://storage-broker-lb.epsilon.ap-southeast-1.internal.aws.neon.tech:50051
pageserver_config_stub:
pg_distrib_dir: /usr/local
metric_collection_endpoint: http://console-release.local/billing/api/v1/usage_events
metric_collection_interval: 10min
remote_storage:
bucket_name: "{{ bucket_name }}"
bucket_region: "{{ bucket_region }}"

View File

@@ -6,6 +6,8 @@ storage:
broker_endpoint: http://storage-broker-lb.gamma.eu-central-1.internal.aws.neon.tech:50051
pageserver_config_stub:
pg_distrib_dir: /usr/local
metric_collection_endpoint: http://console-release.local/billing/api/v1/usage_events
metric_collection_interval: 10min
remote_storage:
bucket_name: "{{ bucket_name }}"
bucket_region: "{{ bucket_region }}"

View File

@@ -6,6 +6,8 @@ storage:
broker_endpoint: http://storage-broker-lb.delta.us-east-2.internal.aws.neon.tech:50051
pageserver_config_stub:
pg_distrib_dir: /usr/local
metric_collection_endpoint: http://console-release.local/billing/api/v1/usage_events
metric_collection_interval: 10min
remote_storage:
bucket_name: "{{ bucket_name }}"
bucket_region: "{{ bucket_region }}"
@@ -34,4 +36,4 @@ storage:
ansible_host: i-06d113fb73bfddeb0
safekeeper-2.us-east-2.aws.neon.tech:
ansible_host: i-09f66c8e04afff2e8

View File

@@ -6,6 +6,8 @@ storage:
broker_endpoint: http://storage-broker-lb.eta.us-west-2.internal.aws.neon.tech:50051
pageserver_config_stub:
pg_distrib_dir: /usr/local
metric_collection_endpoint: http://console-release.local/billing/api/v1/usage_events
metric_collection_interval: 10min
remote_storage:
bucket_name: "{{ bucket_name }}"
bucket_region: "{{ bucket_region }}"

View File

@@ -7,6 +7,8 @@ storage:
broker_endpoint: http://storage-broker.prod.local:50051
pageserver_config_stub:
pg_distrib_dir: /usr/local
metric_collection_endpoint: http://console-release.local/billing/api/v1/usage_events
metric_collection_interval: 10min
remote_storage:
bucket_name: "{{ bucket_name }}"
bucket_region: "{{ bucket_region }}"

View File

@@ -18,7 +18,7 @@ storage:
ansible_aws_ssm_region: eu-west-1
ansible_aws_ssm_bucket_name: neon-dev-storage-eu-west-1
console_region_id: aws-eu-west-1
sentry_environment: development
sentry_environment: staging
children:
pageservers:

View File

@@ -18,7 +18,7 @@ storage:
ansible_aws_ssm_region: us-east-2
ansible_aws_ssm_bucket_name: neon-staging-storage-us-east-2
console_region_id: aws-us-east-2
sentry_environment: development
sentry_environment: staging
children:
pageservers:
@@ -29,6 +29,8 @@ storage:
ansible_host: i-0565a8b4008aa3f40
pageserver-2.us-east-2.aws.neon.build:
ansible_host: i-01e31cdf7e970586a
pageserver-3.us-east-2.aws.neon.build:
ansible_host: i-0602a0291365ef7cc
safekeepers:
hosts:

View File

@@ -8,8 +8,10 @@ settings:
authBackend: "console"
authEndpoint: "http://console-staging.local/management/api/v2"
domain: "*.eu-west-1.aws.neon.build"
sentryEnvironment: "development"
sentryEnvironment: "staging"
wssPort: 8443
metricCollectionEndpoint: "http://console-staging.local/billing/api/v1/usage_events"
metricCollectionInterval: "1min"
# -- Additional labels for neon-proxy pods
podLabels:

View File

@@ -49,4 +49,4 @@ extraManifests:
- "{{ .Release.Namespace }}"
settings:
sentryEnvironment: "development"
sentryEnvironment: "staging"

View File

@@ -8,7 +8,9 @@ settings:
authBackend: "link"
authEndpoint: "https://console.stage.neon.tech/authenticate_proxy_request/"
uri: "https://console.stage.neon.tech/psql_session/"
sentryEnvironment: "development"
sentryEnvironment: "staging"
metricCollectionEndpoint: "http://console-staging.local/billing/api/v1/usage_events"
metricCollectionInterval: "1min"
# -- Additional labels for neon-proxy-link pods
podLabels:

View File

@@ -8,8 +8,10 @@ settings:
authBackend: "console"
authEndpoint: "http://console-staging.local/management/api/v2"
domain: "*.cloud.stage.neon.tech"
sentryEnvironment: "development"
sentryEnvironment: "staging"
wssPort: 8443
metricCollectionEndpoint: "http://console-staging.local/billing/api/v1/usage_events"
metricCollectionInterval: "1min"
# -- Additional labels for neon-proxy pods
podLabels:

View File

@@ -8,8 +8,10 @@ settings:
authBackend: "console"
authEndpoint: "http://console-staging.local/management/api/v2"
domain: "*.us-east-2.aws.neon.build"
sentryEnvironment: "development"
sentryEnvironment: "staging"
wssPort: 8443
metricCollectionEndpoint: "http://console-staging.local/billing/api/v1/usage_events"
metricCollectionInterval: "1min"
# -- Additional labels for neon-proxy pods
podLabels:

View File

@@ -49,4 +49,4 @@ extraManifests:
- "{{ .Release.Namespace }}"
settings:
sentryEnvironment: "development"
sentryEnvironment: "staging"

View File

@@ -10,6 +10,8 @@ settings:
domain: "*.ap-southeast-1.aws.neon.tech"
sentryEnvironment: "production"
wssPort: 8443
metricCollectionEndpoint: "http://console-release.local/billing/api/v1/usage_events"
metricCollectionInterval: "10min"
# -- Additional labels for neon-proxy pods
podLabels:

View File

@@ -10,6 +10,8 @@ settings:
domain: "*.eu-central-1.aws.neon.tech"
sentryEnvironment: "production"
wssPort: 8443
metricCollectionEndpoint: "http://console-release.local/billing/api/v1/usage_events"
metricCollectionInterval: "10min"
# -- Additional labels for neon-proxy pods
podLabels:

View File

@@ -10,6 +10,8 @@ settings:
domain: "*.us-east-2.aws.neon.tech"
sentryEnvironment: "production"
wssPort: 8443
metricCollectionEndpoint: "http://console-release.local/billing/api/v1/usage_events"
metricCollectionInterval: "10min"
# -- Additional labels for neon-proxy pods
podLabels:

View File

@@ -10,6 +10,8 @@ settings:
domain: "*.us-west-2.aws.neon.tech"
sentryEnvironment: "production"
wssPort: 8443
metricCollectionEndpoint: "http://console-release.local/billing/api/v1/usage_events"
metricCollectionInterval: "10min"
# -- Additional labels for neon-proxy pods
podLabels:

View File

@@ -4,6 +4,8 @@ settings:
domain: "*.cloud.neon.tech"
sentryEnvironment: "production"
wssPort: 8443
metricCollectionEndpoint: "http://console-release.local/billing/api/v1/usage_events"
metricCollectionInterval: "10min"
podLabels:
zenith_service: proxy-scram

View File

@@ -489,3 +489,108 @@ jobs:
slack-message: "Periodic TPC-H perf testing ${{ matrix.platform }}: ${{ job.status }}\n${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
env:
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
user-examples-compare:
if: success() || failure()
needs: [ tpch-compare ]
strategy:
fail-fast: false
matrix:
# neon-captest-prefetch: We have pre-created projects with prefetch enabled
# rds-aurora: Aurora Postgres Serverless v2 with autoscaling from 0.5 to 2 ACUs
# rds-postgres: RDS Postgres db.m5.large instance (2 vCPU, 8 GiB) with gp3 EBS storage
platform: [ neon-captest-prefetch, rds-postgres, rds-aurora ]
env:
POSTGRES_DISTRIB_DIR: /tmp/neon/pg_install
DEFAULT_PG_VERSION: 14
TEST_OUTPUT: /tmp/test_output
BUILD_TYPE: remote
SAVE_PERF_REPORT: ${{ github.event.inputs.save_perf_report || ( github.ref == 'refs/heads/main' ) }}
PLATFORM: ${{ matrix.platform }}
runs-on: [ self-hosted, us-east-2, x64 ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
options: --init
timeout-minutes: 360 # 6h
steps:
- uses: actions/checkout@v3
- name: Download Neon artifact
uses: ./.github/actions/download
with:
name: neon-${{ runner.os }}-release-artifact
path: /tmp/neon/
prefix: latest
- name: Add Postgres binaries to PATH
run: |
${POSTGRES_DISTRIB_DIR}/v${DEFAULT_PG_VERSION}/bin/pgbench --version
echo "${POSTGRES_DISTRIB_DIR}/v${DEFAULT_PG_VERSION}/bin" >> $GITHUB_PATH
- name: Set up Connection String
id: set-up-connstr
run: |
case "${PLATFORM}" in
neon-captest-prefetch)
CONNSTR=${{ secrets.BENCHMARK_USER_EXAMPLE_CAPTEST_CONNSTR }}
;;
rds-aurora)
CONNSTR=${{ secrets.BENCHMARK_USER_EXAMPLE_RDS_AURORA_CONNSTR }}
;;
rds-postgres)
CONNSTR=${{ secrets.BENCHMARK_USER_EXAMPLE_RDS_POSTGRES_CONNSTR }}
;;
*)
echo 2>&1 "Unknown PLATFORM=${PLATFORM}. Allowed only 'neon-captest-prefetch', 'rds-aurora', or 'rds-postgres'"
exit 1
;;
esac
echo "connstr=${CONNSTR}" >> $GITHUB_OUTPUT
psql ${CONNSTR} -c "SELECT version();"
- name: Set database options
if: matrix.platform == 'neon-captest-prefetch'
run: |
DB_NAME=$(psql ${BENCHMARK_CONNSTR} --no-align --quiet -t -c "SELECT current_database()")
psql ${BENCHMARK_CONNSTR} -c "ALTER DATABASE ${DB_NAME} SET enable_seqscan_prefetch=on"
psql ${BENCHMARK_CONNSTR} -c "ALTER DATABASE ${DB_NAME} SET effective_io_concurrency=32"
psql ${BENCHMARK_CONNSTR} -c "ALTER DATABASE ${DB_NAME} SET maintenance_io_concurrency=32"
env:
BENCHMARK_CONNSTR: ${{ steps.set-up-connstr.outputs.connstr }}
- name: Run user examples
uses: ./.github/actions/run-python-test-set
with:
build_type: ${{ env.BUILD_TYPE }}
test_selection: performance/test_perf_olap.py
run_in_parallel: false
save_perf_report: ${{ env.SAVE_PERF_REPORT }}
extra_params: -m remote_cluster --timeout 21600 -k test_user_examples
env:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
BENCHMARK_CONNSTR: ${{ steps.set-up-connstr.outputs.connstr }}
- name: Create Allure report
if: success() || failure()
uses: ./.github/actions/allure-report
with:
action: generate
build_type: ${{ env.BUILD_TYPE }}
- name: Post to a Slack channel
if: ${{ github.event.schedule && failure() }}
uses: slackapi/slack-github-action@v1
with:
channel-id: "C033QLM5P7D" # dev-staging-stream
slack-message: "Periodic TPC-H perf testing ${{ matrix.platform }}: ${{ job.status }}\n${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
env:
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}

View File

@@ -595,6 +595,8 @@ jobs:
defaults:
run:
shell: sh -eu {0}
env:
VM_INFORMANT_VERSION: 0.1.1
steps:
- name: Downloading latest vm-builder
@@ -606,9 +608,22 @@ jobs:
run: |
docker pull 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}
- name: Downloading VM informant version ${{ env.VM_INFORMANT_VERSION }}
run: |
curl -fL https://github.com/neondatabase/autoscaling/releases/download/${{ env.VM_INFORMANT_VERSION }}/vm-informant -o vm-informant
chmod +x vm-informant
- name: Adding VM informant to compute-node image
run: |
ID=$(docker create 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}})
docker cp vm-informant $ID:/bin/vm-informant
docker commit $ID temp-vm-compute-node
docker rm -f $ID
- name: Build vm image
run: |
./vm-builder -src=369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} -dst=369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}
# note: as of 2023-01-12, vm-builder requires a trailing ":latest" for local images
./vm-builder -src=temp-vm-compute-node:latest -dst=369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}
- name: Pushing vm-compute-node image
run: |

690
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -10,6 +10,140 @@ members = [
"libs/*",
]
[workspace.package]
edition = "2021"
license = "Apache-2.0"
## All dependency versions, used in the project
[workspace.dependencies]
anyhow = { version = "1.0", features = ["backtrace"] }
async-stream = "0.3"
async-trait = "0.1"
atty = "0.2.14"
aws-config = { version = "0.51.0", default-features = false, features=["rustls"] }
aws-sdk-s3 = "0.21.0"
aws-smithy-http = "0.51.0"
aws-types = "0.51.0"
base64 = "0.13.0"
bincode = "1.3"
bindgen = "0.61"
bstr = "1.0"
byteorder = "1.4"
bytes = "1.0"
chrono = { version = "0.4", default-features = false, features = ["clock"] }
clap = "4.0"
close_fds = "0.3.2"
comfy-table = "6.1"
const_format = "0.2"
crc32c = "0.6"
crossbeam-utils = "0.8.5"
fail = "0.5.0"
fs2 = "0.4.3"
futures = "0.3"
futures-core = "0.3"
futures-util = "0.3"
git-version = "0.3"
hashbrown = "0.13"
hex = "0.4"
hex-literal = "0.3"
hmac = "0.12.1"
hostname = "0.3.1"
humantime = "2.1"
humantime-serde = "1.1.1"
hyper = "0.14"
hyper-tungstenite = "0.9"
itertools = "0.10"
jsonwebtoken = "8"
libc = "0.2"
md5 = "0.7.0"
memoffset = "0.8"
nix = "0.26"
notify = "5.0.0"
num-traits = "0.2.15"
once_cell = "1.13"
parking_lot = "0.12"
pin-project-lite = "0.2"
prometheus = {version = "0.13", default_features=false, features = ["process"]} # removes protobuf dependency
prost = "0.11"
rand = "0.8"
regex = "1.4"
reqwest = { version = "0.11", default-features = false, features = ["rustls-tls"] }
routerify = "3"
rpds = "0.12.0"
rustls = "0.20"
rustls-pemfile = "1"
rustls-split = "0.3"
scopeguard = "1.1"
sentry = { version = "0.29", default-features = false, features = ["backtrace", "contexts", "panic", "rustls", "reqwest" ] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
serde_with = "2.0"
sha2 = "0.10.2"
signal-hook = "0.3"
socket2 = "0.4.4"
strum = "0.24"
strum_macros = "0.24"
svg_fmt = "0.4.1"
tar = "0.4"
thiserror = "1.0"
tls-listener = { version = "0.6", features = ["rustls", "hyper-h1"] }
tokio = { version = "1.17", features = ["macros"] }
tokio-postgres-rustls = "0.9.0"
tokio-rustls = "0.23"
tokio-stream = "0.1"
tokio-util = { version = "0.7", features = ["io"] }
toml = "0.5"
toml_edit = { version = "0.17", features = ["easy"] }
tonic = {version = "0.8", features = ["tls", "tls-roots"]}
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
url = "2.2"
uuid = { version = "1.2", features = ["v4", "serde"] }
walkdir = "2.3.2"
webpki-roots = "0.22.5"
x509-parser = "0.14"
## TODO replace this with tracing
env_logger = "0.10"
log = "0.4"
## Libraries from neondatabase/ git forks, ideally with changes to be upstreamed
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
postgres-types = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
tokio-tar = { git = "https://github.com/neondatabase/tokio-tar.git", rev="404df61437de0feef49ba2ccdbdd94eb8ad6e142" }
## Local libraries
consumption_metrics = { version = "0.1", path = "./libs/consumption_metrics/" }
metrics = { version = "0.1", path = "./libs/metrics/" }
pageserver_api = { version = "0.1", path = "./libs/pageserver_api/" }
postgres_connection = { version = "0.1", path = "./libs/postgres_connection/" }
postgres_ffi = { version = "0.1", path = "./libs/postgres_ffi/" }
pq_proto = { version = "0.1", path = "./libs/pq_proto/" }
remote_storage = { version = "0.1", path = "./libs/remote_storage/" }
safekeeper_api = { version = "0.1", path = "./libs/safekeeper_api" }
storage_broker = { version = "0.1", path = "./storage_broker/" } # Note: main broker code is inside the binary crate, so linking with the library shouldn't be heavy.
tenant_size_model = { version = "0.1", path = "./libs/tenant_size_model/" }
utils = { version = "0.1", path = "./libs/utils/" }
## Common library dependency
workspace_hack = { version = "0.1", path = "./workspace_hack/" }
## Build dependencies
criterion = "0.4"
rcgen = "0.10"
rstest = "0.16"
tempfile = "3.2"
tonic-build = "0.8"
# This is only needed for proxy's tests.
# TODO: we should probably fork `tokio-postgres-rustls` instead.
[patch.crates-io]
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
################# Binary contents sections
[profile.release]
# This is useful for profiling and, to some extent, debug.
# Besides, debug info should not affect the performance.
@@ -70,9 +204,3 @@ inherits = "release"
debug = false # true = 2 = all symbols, 1 = line only
opt-level = "z"
lto = true
# This is only needed for proxy's tests.
# TODO: we should probably fork `tokio-postgres-rustls` instead.
[patch.crates-io]
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }

View File

@@ -34,7 +34,8 @@ RUN cd postgres && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/bloom.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pgrowlocks.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/intagg.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pgstattuple.control
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pgstattuple.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/earthdistance.control
#########################################################################################
#
@@ -62,8 +63,7 @@ RUN wget https://download.osgeo.org/postgis/source/postgis-3.3.1.tar.gz && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/postgis_tiger_geocoder.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/postgis_topology.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/address_standardizer.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/address_standardizer_data_us.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/postgis_sfcgal.control
echo 'trusted = true' >> /usr/local/pgsql/share/extension/address_standardizer_data_us.control
#########################################################################################
#

View File

@@ -34,7 +34,8 @@ RUN cd postgres && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/bloom.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pgrowlocks.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/intagg.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pgstattuple.control
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pgstattuple.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/earthdistance.control
#########################################################################################
#
@@ -62,8 +63,7 @@ RUN wget https://download.osgeo.org/postgis/source/postgis-3.3.1.tar.gz && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/postgis_tiger_geocoder.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/postgis_topology.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/address_standardizer.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/address_standardizer_data_us.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/postgis_sfcgal.control
echo 'trusted = true' >> /usr/local/pgsql/share/extension/address_standardizer_data_us.control
#########################################################################################
#

View File

@@ -1,24 +1,25 @@
[package]
name = "compute_tools"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
anyhow = "1.0"
chrono = { version = "0.4", default-features = false, features = ["clock"] }
clap = "4.0"
env_logger = "0.9"
futures = "0.3.13"
hyper = { version = "0.14", features = ["full"] }
log = { version = "0.4", features = ["std", "serde"] }
notify = "5.0.0"
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
regex = "1"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
tar = "0.4"
tokio = { version = "1.17", features = ["macros", "rt", "rt-multi-thread"] }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
url = "2.2.2"
workspace_hack = { version = "0.1", path = "../workspace_hack" }
anyhow.workspace = true
chrono.workspace = true
clap.workspace = true
futures.workspace = true
hyper = { workspace = true, features = ["full"] }
notify.workspace = true
postgres.workspace = true
regex.workspace = true
serde.workspace = true
serde_json.workspace = true
tar.workspace = true
tokio = { workspace = true, features = ["rt", "rt-multi-thread"] }
tokio-postgres.workspace = true
tracing.workspace = true
tracing-subscriber.workspace = true
url.workspace = true
workspace_hack.workspace = true

View File

@@ -19,6 +19,10 @@ Also `compute_ctl` spawns two separate service threads:
- `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
last activity requests.
If the `vm-informant` binary is present at `/bin/vm-informant`, it will also be started. For VM
compute nodes, `vm-informant` communicates with the VM autoscaling system. It coordinates
downscaling and (eventually) will request immediate upscaling under resource pressure.
Usage example:
```sh
compute_ctl -D /var/db/postgres/compute \

View File

@@ -18,6 +18,10 @@
//! - `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
//! last activity requests.
//!
//! If the `vm-informant` binary is present at `/bin/vm-informant`, it will also be started. For VM
//! compute nodes, `vm-informant` communicates with the VM autoscaling system. It coordinates
//! downscaling and (eventually) will request immediate upscaling under resource pressure.
//!
//! Usage example:
//! ```sh
//! compute_ctl -D /var/db/postgres/compute \
@@ -36,10 +40,11 @@ use std::{thread, time::Duration};
use anyhow::{Context, Result};
use chrono::Utc;
use clap::Arg;
use log::{error, info};
use tracing::{error, info};
use compute_tools::compute::{ComputeMetrics, ComputeNode, ComputeState, ComputeStatus};
use compute_tools::http::api::launch_http_server;
use compute_tools::informant::spawn_vm_informant_if_present;
use compute_tools::logger::*;
use compute_tools::monitor::launch_monitor;
use compute_tools::params::*;
@@ -48,7 +53,6 @@ use compute_tools::spec::*;
use url::Url;
fn main() -> Result<()> {
// TODO: re-use `utils::logging` later
init_logger(DEFAULT_LOG_LEVEL)?;
let matches = cli().get_matches();
@@ -114,30 +118,48 @@ fn main() -> Result<()> {
// requests, while configuration is still in progress.
let _http_handle = launch_http_server(&compute).expect("cannot launch http endpoint thread");
let _monitor_handle = launch_monitor(&compute).expect("cannot launch compute monitor thread");
// Also spawn the thread responsible for handling the VM informant -- if it's present
let _vm_informant_handle = spawn_vm_informant_if_present().expect("cannot launch VM informant");
// Run compute (Postgres) and hang waiting on it.
match compute.prepare_and_run() {
Ok(ec) => {
let code = ec.code().unwrap_or(1);
info!("Postgres exited with code {}, shutting down", code);
exit(code)
}
Err(error) => {
error!("could not start the compute node: {:?}", error);
// Start Postgres
let mut delay_exit = false;
let mut exit_code = None;
let pg = match compute.start_compute() {
Ok(pg) => Some(pg),
Err(err) => {
error!("could not start the compute node: {:?}", err);
let mut state = compute.state.write().unwrap();
state.error = Some(format!("{:?}", error));
state.error = Some(format!("{:?}", err));
state.status = ComputeStatus::Failed;
drop(state);
// Keep serving HTTP requests, so the cloud control plane was able to
// get the actual error.
info!("giving control plane 30s to collect the error before shutdown");
thread::sleep(Duration::from_secs(30));
info!("shutting down");
Err(error)
delay_exit = true;
None
}
};
// Wait for the child Postgres process forever. In this state Ctrl+C will
// propagate to Postgres and it will be shut down as well.
if let Some(mut pg) = pg {
let ecode = pg
.wait()
.expect("failed to start waiting on Postgres process");
info!("Postgres exited with code {}, shutting down", ecode);
exit_code = ecode.code()
}
if let Err(err) = compute.check_for_core_dumps() {
error!("error while checking for core dumps: {err:?}");
}
// If launch failed, keep serving HTTP requests for a while, so the cloud
// control plane can get the actual error.
if delay_exit {
info!("giving control plane 30s to collect the error before shutdown");
thread::sleep(Duration::from_secs(30));
info!("shutting down");
}
exit(exit_code.unwrap_or(1))
}
fn cli() -> clap::Command {

View File

@@ -1,10 +1,11 @@
use anyhow::{anyhow, Result};
use log::error;
use postgres::Client;
use tokio_postgres::NoTls;
use tracing::{error, instrument};
use crate::compute::ComputeNode;
#[instrument(skip_all)]
pub fn create_writability_check_data(client: &mut Client) -> Result<()> {
let query = "
CREATE TABLE IF NOT EXISTS health_check (
@@ -21,6 +22,7 @@ pub fn create_writability_check_data(client: &mut Client) -> Result<()> {
Ok(())
}
#[instrument(skip_all)]
pub async fn check_writability(compute: &ComputeNode) -> Result<()> {
let (client, connection) = tokio_postgres::connect(compute.connstr.as_str(), NoTls).await?;
if client.is_closed() {

View File

@@ -17,15 +17,15 @@
use std::fs;
use std::os::unix::fs::PermissionsExt;
use std::path::Path;
use std::process::{Command, ExitStatus, Stdio};
use std::process::{Command, Stdio};
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::RwLock;
use anyhow::{Context, Result};
use chrono::{DateTime, Utc};
use log::{info, warn};
use postgres::{Client, NoTls};
use serde::{Serialize, Serializer};
use tracing::{info, instrument, warn};
use crate::checker::create_writability_check_data;
use crate::config;
@@ -121,6 +121,7 @@ impl ComputeNode {
// Get basebackup from the libpq connection to pageserver using `connstr` and
// unarchive it to `pgdata` directory overriding all its previous content.
#[instrument(skip(self))]
fn get_basebackup(&self, lsn: &str) -> Result<()> {
let start_time = Utc::now();
@@ -154,6 +155,7 @@ impl ComputeNode {
// Run `postgres` in a special mode with `--sync-safekeepers` argument
// and return the reported LSN back to the caller.
#[instrument(skip(self))]
fn sync_safekeepers(&self) -> Result<String> {
let start_time = Utc::now();
@@ -196,6 +198,7 @@ impl ComputeNode {
/// Do all the preparations like PGDATA directory creation, configuration,
/// safekeepers sync, basebackup, etc.
#[instrument(skip(self))]
pub fn prepare_pgdata(&self) -> Result<()> {
let spec = &self.spec;
let pgdata_path = Path::new(&self.pgdata);
@@ -229,9 +232,8 @@ impl ComputeNode {
/// Start Postgres as a child process and manage DBs/roles.
/// After that this will hang waiting on the postmaster process to exit.
pub fn run(&self) -> Result<ExitStatus> {
let start_time = Utc::now();
#[instrument(skip(self))]
pub fn start_postgres(&self) -> Result<std::process::Child> {
let pgdata_path = Path::new(&self.pgdata);
// Run postgres as a child process.
@@ -242,10 +244,15 @@ impl ComputeNode {
wait_for_postgres(&mut pg, pgdata_path)?;
Ok(pg)
}
#[instrument(skip(self))]
pub fn apply_config(&self) -> Result<()> {
// If connection fails,
// it may be the old node with `zenith_admin` superuser.
//
// In this case we need to connect with old `zenith_admin`name
// In this case we need to connect with old `zenith_admin` name
// and create new user. We cannot simply rename connected user,
// but we can create a new one and grant it all privileges.
let mut client = match Client::connect(self.connstr.as_str(), NoTls) {
@@ -271,6 +278,7 @@ impl ComputeNode {
Ok(client) => client,
};
// Proceed with post-startup configuration. Note, that order of operations is important.
handle_roles(&self.spec, &mut client)?;
handle_databases(&self.spec, &mut client)?;
handle_role_deletions(self, &mut client)?;
@@ -279,8 +287,34 @@ impl ComputeNode {
// 'Close' connection
drop(client);
let startup_end_time = Utc::now();
info!(
"finished configuration of compute for project {}",
self.spec.cluster.cluster_id
);
Ok(())
}
#[instrument(skip(self))]
pub fn start_compute(&self) -> Result<std::process::Child> {
info!(
"starting compute for project {}, operation {}, tenant {}, timeline {}",
self.spec.cluster.cluster_id,
self.spec.operation_uuid.as_ref().unwrap(),
self.tenant,
self.timeline,
);
self.prepare_pgdata()?;
let start_time = Utc::now();
let pg = self.start_postgres()?;
self.apply_config()?;
let startup_end_time = Utc::now();
self.metrics.config_ms.store(
startup_end_time
.signed_duration_since(start_time)
@@ -300,34 +334,7 @@ impl ComputeNode {
self.set_status(ComputeStatus::Running);
info!(
"finished configuration of compute for project {}",
self.spec.cluster.cluster_id
);
// Wait for child Postgres process basically forever. In this state Ctrl+C
// will propagate to Postgres and it will be shut down as well.
let ecode = pg
.wait()
.expect("failed to start waiting on Postgres process");
self.check_for_core_dumps()
.expect("failed to check for core dumps");
Ok(ecode)
}
pub fn prepare_and_run(&self) -> Result<ExitStatus> {
info!(
"starting compute for project {}, operation {}, tenant {}, timeline {}",
self.spec.cluster.cluster_id,
self.spec.operation_uuid.as_ref().unwrap(),
self.tenant,
self.timeline,
);
self.prepare_pgdata()?;
self.run()
Ok(pg)
}
// Look for core dumps and collect backtraces.
@@ -340,7 +347,7 @@ impl ComputeNode {
//
// Use that as a default location and pattern, except macos where core dumps are written
// to /cores/ directory by default.
fn check_for_core_dumps(&self) -> Result<()> {
pub fn check_for_core_dumps(&self) -> Result<()> {
let core_dump_dir = match std::env::consts::OS {
"macos" => Path::new("/cores/"),
_ => Path::new(&self.pgdata),

View File

@@ -6,8 +6,8 @@ use std::thread;
use anyhow::Result;
use hyper::service::{make_service_fn, service_fn};
use hyper::{Body, Method, Request, Response, Server, StatusCode};
use log::{error, info};
use serde_json;
use tracing::{error, info};
use crate::compute::ComputeNode;

View File

@@ -0,0 +1,50 @@
use std::path::Path;
use std::process;
use std::thread;
use std::time::Duration;
use tracing::{info, warn};
use anyhow::{Context, Result};
const VM_INFORMANT_PATH: &str = "/bin/vm-informant";
const RESTART_INFORMANT_AFTER_MILLIS: u64 = 5000;
/// Launch a thread to start the VM informant if it's present (and restart, on failure)
pub fn spawn_vm_informant_if_present() -> Result<Option<thread::JoinHandle<()>>> {
let exists = Path::new(VM_INFORMANT_PATH)
.try_exists()
.context("could not check if path exists")?;
if !exists {
return Ok(None);
}
Ok(Some(
thread::Builder::new()
.name("run-vm-informant".into())
.spawn(move || run_informant())?,
))
}
fn run_informant() -> ! {
let restart_wait = Duration::from_millis(RESTART_INFORMANT_AFTER_MILLIS);
info!("starting VM informant");
loop {
let mut cmd = process::Command::new(VM_INFORMANT_PATH);
// Block on subprocess:
let result = cmd.status();
match result {
Err(e) => warn!("failed to run VM informant at {VM_INFORMANT_PATH:?}: {e}"),
Ok(status) if !status.success() => {
warn!("{VM_INFORMANT_PATH} exited with code {status:?}, retrying")
}
Ok(_) => info!("{VM_INFORMANT_PATH} ended gracefully (unexpectedly). Retrying"),
}
// Wait before retrying
thread::sleep(restart_wait);
}
}

View File

@@ -8,6 +8,7 @@ pub mod http;
#[macro_use]
pub mod logger;
pub mod compute;
pub mod informant;
pub mod monitor;
pub mod params;
pub mod pg_helpers;

View File

@@ -1,42 +1,20 @@
use std::io::Write;
use anyhow::Result;
use chrono::Utc;
use env_logger::{Builder, Env};
macro_rules! info_println {
($($tts:tt)*) => {
if log_enabled!(Level::Info) {
println!($($tts)*);
}
}
}
macro_rules! info_print {
($($tts:tt)*) => {
if log_enabled!(Level::Info) {
print!($($tts)*);
}
}
}
use tracing_subscriber::layer::SubscriberExt;
use tracing_subscriber::prelude::*;
/// Initialize `env_logger` using either `default_level` or
/// `RUST_LOG` environment variable as default log level.
pub fn init_logger(default_level: &str) -> Result<()> {
let env = Env::default().filter_or("RUST_LOG", default_level);
let env_filter = tracing_subscriber::EnvFilter::try_from_default_env()
.unwrap_or_else(|_| tracing_subscriber::EnvFilter::new(default_level));
Builder::from_env(env)
.format(|buf, record| {
let thread_handle = std::thread::current();
writeln!(
buf,
"{} [{}] {}: {}",
Utc::now().format("%Y-%m-%d %H:%M:%S%.3f %Z"),
thread_handle.name().unwrap_or("main"),
record.level(),
record.args()
)
})
let fmt_layer = tracing_subscriber::fmt::layer()
.with_target(false)
.with_writer(std::io::stderr);
tracing_subscriber::registry()
.with(env_filter)
.with(fmt_layer)
.init();
Ok(())

View File

@@ -3,8 +3,8 @@ use std::{thread, time};
use anyhow::Result;
use chrono::{DateTime, Utc};
use log::{debug, info};
use postgres::{Client, NoTls};
use tracing::{debug, info};
use crate::compute::ComputeNode;

View File

@@ -1,3 +1,9 @@
pub const DEFAULT_LOG_LEVEL: &str = "info";
pub const DEFAULT_CONNSTRING: &str = "host=localhost user=postgres";
// From Postgres docs:
// To ease transition from the md5 method to the newer SCRAM method, if md5 is specified
// as a method in pg_hba.conf but the user's password on the server is encrypted for SCRAM
// (see below), then SCRAM-based authentication will automatically be chosen instead.
// https://www.postgresql.org/docs/15/auth-password.html
//
// So it's safe to set md5 here, as `control-plane` anyway uses SCRAM for all roles.
pub const PG_HBA_ALL_MD5: &str = "host\tall\t\tall\t\t0.0.0.0/0\t\tmd5";

View File

@@ -11,6 +11,7 @@ use anyhow::{bail, Result};
use notify::{RecursiveMode, Watcher};
use postgres::{Client, Transaction};
use serde::Deserialize;
use tracing::{debug, instrument};
const POSTGRES_WAIT_TIMEOUT: Duration = Duration::from_millis(60 * 1000); // milliseconds
@@ -129,8 +130,8 @@ impl Role {
/// Serialize a list of role parameters into a Postgres-acceptable
/// string of arguments.
pub fn to_pg_options(&self) -> String {
// XXX: consider putting LOGIN as a default option somewhere higher, e.g. in Rails.
// For now we do not use generic `options` for roles. Once used, add
// XXX: consider putting LOGIN as a default option somewhere higher, e.g. in control-plane.
// For now, we do not use generic `options` for roles. Once used, add
// `self.options.as_pg_options()` somewhere here.
let mut params: String = "LOGIN".to_string();
@@ -229,6 +230,7 @@ pub fn get_existing_dbs(client: &mut Client) -> Result<Vec<Database>> {
/// Wait for Postgres to become ready to accept connections. It's ready to
/// accept connections when the state-field in `pgdata/postmaster.pid` says
/// 'ready'.
#[instrument(skip(pg))]
pub fn wait_for_postgres(pg: &mut Child, pgdata: &Path) -> Result<()> {
let pid_path = pgdata.join("postmaster.pid");
@@ -287,18 +289,18 @@ pub fn wait_for_postgres(pg: &mut Child, pgdata: &Path) -> Result<()> {
}
let res = rx.recv_timeout(Duration::from_millis(100));
log::debug!("woken up by notify: {res:?}");
debug!("woken up by notify: {res:?}");
// If there are multiple events in the channel already, we only need to be
// check once. Swallow the extra events before we go ahead to check the
// pid file.
while let Ok(res) = rx.try_recv() {
log::debug!("swallowing extra event: {res:?}");
debug!("swallowing extra event: {res:?}");
}
// Check that we can open pid file first.
if let Ok(file) = File::open(&pid_path) {
if !postmaster_pid_seen {
log::debug!("postmaster.pid appeared");
debug!("postmaster.pid appeared");
watcher
.unwatch(pgdata)
.expect("Failed to remove pgdata dir watch");
@@ -314,7 +316,7 @@ pub fn wait_for_postgres(pg: &mut Child, pgdata: &Path) -> Result<()> {
// Pid file could be there and we could read it, but it could be empty, for example.
if let Some(Ok(line)) = last_line {
let status = line.trim();
log::debug!("last line of postmaster.pid: {status:?}");
debug!("last line of postmaster.pid: {status:?}");
// Now Postgres is ready to accept connections
if status == "ready" {
@@ -330,7 +332,7 @@ pub fn wait_for_postgres(pg: &mut Child, pgdata: &Path) -> Result<()> {
}
}
log::info!("PostgreSQL is now running, continuing to configure it");
tracing::info!("PostgreSQL is now running, continuing to configure it");
Ok(())
}

View File

@@ -1,12 +1,11 @@
use std::path::Path;
use std::str::FromStr;
use std::time::Instant;
use anyhow::Result;
use log::{info, log_enabled, warn, Level};
use postgres::config::Config;
use postgres::{Client, NoTls};
use serde::Deserialize;
use tracing::{info, info_span, instrument, span_enabled, warn, Level};
use crate::compute::ComputeNode;
use crate::config;
@@ -80,23 +79,25 @@ pub fn update_pg_hba(pgdata_path: &Path) -> Result<()> {
/// Given a cluster spec json and open transaction it handles roles creation,
/// deletion and update.
#[instrument(skip_all)]
pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
let mut xact = client.transaction()?;
let existing_roles: Vec<Role> = get_existing_roles(&mut xact)?;
// Print a list of existing Postgres roles (only in debug mode)
info!("postgres roles:");
for r in &existing_roles {
info_println!(
"{} - {}:{}",
" ".repeat(27 + 5),
r.name,
if r.encrypted_password.is_some() {
"[FILTERED]"
} else {
"(null)"
}
);
if span_enabled!(Level::INFO) {
info!("postgres roles:");
for r in &existing_roles {
info!(
" - {}:{}",
r.name,
if r.encrypted_password.is_some() {
"[FILTERED]"
} else {
"(null)"
}
);
}
}
// Process delta operations first
@@ -137,58 +138,80 @@ pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
info!("cluster spec roles:");
for role in &spec.cluster.roles {
let name = &role.name;
info_print!(
"{} - {}:{}",
" ".repeat(27 + 5),
name,
if role.encrypted_password.is_some() {
"[FILTERED]"
} else {
"(null)"
}
);
// XXX: with a limited number of roles it is fine, but consider making it a HashMap
let pg_role = existing_roles.iter().find(|r| r.name == *name);
if let Some(r) = pg_role {
let mut update_role = false;
enum RoleAction {
None,
Update,
Create,
}
let action = if let Some(r) = pg_role {
if (r.encrypted_password.is_none() && role.encrypted_password.is_some())
|| (r.encrypted_password.is_some() && role.encrypted_password.is_none())
{
update_role = true;
RoleAction::Update
} else if let Some(pg_pwd) = &r.encrypted_password {
// Check whether password changed or not (trim 'md5:' prefix first)
update_role = pg_pwd[3..] != *role.encrypted_password.as_ref().unwrap();
// Check whether password changed or not (trim 'md5' prefix first if any)
//
// This is a backward compatibility hack, which comes from the times when we were using
// md5 for everyone and hashes were stored in the console db without md5 prefix. So when
// role comes from the control-plane (json spec) `Role.encrypted_password` doesn't have md5 prefix,
// but when role comes from Postgres (`get_existing_roles` / `existing_roles`) it has this prefix.
// Here is the only place so far where we compare hashes, so it seems to be the best candidate
// to place this compatibility layer.
let pg_pwd = if let Some(stripped) = pg_pwd.strip_prefix("md5") {
stripped
} else {
pg_pwd
};
if pg_pwd != *role.encrypted_password.as_ref().unwrap() {
RoleAction::Update
} else {
RoleAction::None
}
} else {
RoleAction::None
}
} else {
RoleAction::Create
};
if update_role {
match action {
RoleAction::None => {}
RoleAction::Update => {
let mut query: String = format!("ALTER ROLE {} ", name.pg_quote());
info_print!(" -> update");
query.push_str(&role.to_pg_options());
xact.execute(query.as_str(), &[])?;
}
} else {
info!("role name: '{}'", &name);
let mut query: String = format!("CREATE ROLE {} ", name.pg_quote());
info!("role create query: '{}'", &query);
info_print!(" -> create");
RoleAction::Create => {
let mut query: String = format!("CREATE ROLE {} ", name.pg_quote());
info!("role create query: '{}'", &query);
query.push_str(&role.to_pg_options());
xact.execute(query.as_str(), &[])?;
query.push_str(&role.to_pg_options());
xact.execute(query.as_str(), &[])?;
let grant_query = format!(
"GRANT pg_read_all_data, pg_write_all_data TO {}",
name.pg_quote()
);
xact.execute(grant_query.as_str(), &[])?;
info!("role grant query: '{}'", &grant_query);
let grant_query = format!(
"GRANT pg_read_all_data, pg_write_all_data TO {}",
name.pg_quote()
);
xact.execute(grant_query.as_str(), &[])?;
info!("role grant query: '{}'", &grant_query);
}
}
info_print!("\n");
if span_enabled!(Level::INFO) {
let pwd = if role.encrypted_password.is_some() {
"[FILTERED]"
} else {
"(null)"
};
let action_str = match action {
RoleAction::None => "",
RoleAction::Create => " -> create",
RoleAction::Update => " -> update",
};
info!(" - {}:{}{}", name, pwd, action_str);
}
}
xact.commit()?;
@@ -197,12 +220,25 @@ pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
}
/// Reassign all dependent objects and delete requested roles.
#[instrument(skip_all)]
pub fn handle_role_deletions(node: &ComputeNode, client: &mut Client) -> Result<()> {
if let Some(ops) = &node.spec.delta_operations {
// First, reassign all dependent objects to db owners.
info!("reassigning dependent objects of to-be-deleted roles");
// Fetch existing roles. We could've exported and used `existing_roles` from
// `handle_roles()`, but we only make this list there before creating new roles.
// Which is probably fine as we never create to-be-deleted roles, but that'd
// just look a bit untidy. Anyway, the entire `pg_roles` should be in shared
// buffers already, so this shouldn't be a big deal.
let mut xact = client.transaction()?;
let existing_roles: Vec<Role> = get_existing_roles(&mut xact)?;
xact.commit()?;
for op in ops {
if op.action == "delete_role" {
// Check that role is still present in Postgres, as this could be a
// restart with the same spec after role deletion.
if op.action == "delete_role" && existing_roles.iter().any(|r| r.name == op.name) {
reassign_owned_objects(node, &op.name)?;
}
}
@@ -261,13 +297,16 @@ fn reassign_owned_objects(node: &ComputeNode, role_name: &PgIdent) -> Result<()>
/// like `CREATE DATABASE` and `DROP DATABASE` do not support it. Statement-level
/// atomicity should be enough here due to the order of operations and various checks,
/// which together provide us idempotency.
#[instrument(skip_all)]
pub fn handle_databases(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
let existing_dbs: Vec<Database> = get_existing_dbs(client)?;
// Print a list of existing Postgres databases (only in debug mode)
info!("postgres databases:");
for r in &existing_dbs {
info_println!("{} - {}:{}", " ".repeat(27 + 5), r.name, r.owner);
if span_enabled!(Level::INFO) {
info!("postgres databases:");
for r in &existing_dbs {
info!(" {}:{}", r.name, r.owner);
}
}
// Process delta operations first
@@ -310,13 +349,15 @@ pub fn handle_databases(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
for db in &spec.cluster.databases {
let name = &db.name;
info_print!("{} - {}:{}", " ".repeat(27 + 5), db.name, db.owner);
// XXX: with a limited number of databases it is fine, but consider making it a HashMap
let pg_db = existing_dbs.iter().find(|r| r.name == *name);
let start_time = Instant::now();
if let Some(r) = pg_db {
enum DatabaseAction {
None,
Update,
Create,
}
let action = if let Some(r) = pg_db {
// XXX: db owner name is returned as quoted string from Postgres,
// when quoting is needed.
let new_owner = if r.owner.starts_with('"') {
@@ -326,29 +367,42 @@ pub fn handle_databases(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
};
if new_owner != r.owner {
// Update the owner
DatabaseAction::Update
} else {
DatabaseAction::None
}
} else {
DatabaseAction::Create
};
match action {
DatabaseAction::None => {}
DatabaseAction::Update => {
let query: String = format!(
"ALTER DATABASE {} OWNER TO {}",
name.pg_quote(),
db.owner.pg_quote()
);
info_print!(" -> update");
let _ = info_span!("executing", query).entered();
client.execute(query.as_str(), &[])?;
let elapsed = start_time.elapsed().as_millis();
info_print!(" ({} ms)", elapsed);
}
} else {
let mut query: String = format!("CREATE DATABASE {} ", name.pg_quote());
info_print!(" -> create");
DatabaseAction::Create => {
let mut query: String = format!("CREATE DATABASE {} ", name.pg_quote());
query.push_str(&db.to_pg_options());
let _ = info_span!("executing", query).entered();
client.execute(query.as_str(), &[])?;
}
};
query.push_str(&db.to_pg_options());
client.execute(query.as_str(), &[])?;
let elapsed = start_time.elapsed().as_millis();
info_print!(" ({} ms)", elapsed);
if span_enabled!(Level::INFO) {
let action_str = match action {
DatabaseAction::None => "",
DatabaseAction::Create => " -> create",
DatabaseAction::Update => " -> update",
};
info!(" - {}:{}{}", db.name, db.owner, action_str);
}
info_print!("\n");
}
Ok(())
@@ -356,6 +410,7 @@ pub fn handle_databases(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
/// Grant CREATE ON DATABASE to the database owner and do some other alters and grants
/// to allow users creating trusted extensions and re-creating `public` schema, for example.
#[instrument(skip_all)]
pub fn handle_grants(node: &ComputeNode, client: &mut Client) -> Result<()> {
let spec = &node.spec;

View File

@@ -1,32 +1,31 @@
[package]
name = "control_plane"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
anyhow = "1.0"
clap = "4.0"
comfy-table = "6.1"
git-version = "0.3.5"
nix = "0.25"
once_cell = "1.13.0"
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev = "43e6db254a97fdecbce33d8bc0890accfd74495e" }
regex = "1"
reqwest = { version = "0.11", default-features = false, features = ["blocking", "json", "rustls-tls"] }
serde = { version = "1.0", features = ["derive"] }
serde_with = "2.0"
tar = "0.4.38"
thiserror = "1"
toml = "0.5"
url = "2.2.2"
anyhow.workspace = true
clap.workspace = true
comfy-table.workspace = true
git-version.workspace = true
nix.workspace = true
once_cell.workspace = true
postgres.workspace = true
regex.workspace = true
reqwest = { workspace = true, features = ["blocking", "json"] }
serde.workspace = true
serde_with.workspace = true
tar.workspace = true
thiserror.workspace = true
toml.workspace = true
url.workspace = true
# Note: Do not directly depend on pageserver or safekeeper; use pageserver_api or safekeeper_api
# instead, so that recompile times are better.
pageserver_api = { path = "../libs/pageserver_api" }
postgres_connection = { path = "../libs/postgres_connection" }
safekeeper_api = { path = "../libs/safekeeper_api" }
# Note: main broker code is inside the binary crate, so linking with the library shouldn't be heavy.
storage_broker = { version = "0.1", path = "../storage_broker" }
utils = { path = "../libs/utils" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }
pageserver_api.workspace = true
safekeeper_api.workspace = true
postgres_connection.workspace = true
storage_broker.workspace = true
utils.workspace = true
workspace_hack.workspace = true

View File

@@ -52,7 +52,7 @@ name = "ring"
version = "*"
expression = "MIT AND ISC AND OpenSSL"
license-files = [
{ path = "LICENSE", hash = 0xbd0eed23 },
{ path = "LICENSE", hash = 0xbd0eed23 }
]
[licenses.private]

115
docs/consumption_metrics.md Normal file
View File

@@ -0,0 +1,115 @@
### Overview
Pageserver and proxy periodically collect consumption metrics and push them to a HTTP endpoint.
This doc describes current implementation details.
For design details see [the RFC](./rfcs/021-metering.md) and [the discussion on Github](https://github.com/neondatabase/neon/pull/2884).
- The metrics are collected in a separate thread, and the collection interval and endpoint are configurable.
- Metrics are cached, so that we don't send unchanged metrics on every iteration.
- Metrics are sent in batches of 1000 (see CHUNK_SIZE const) metrics max with no particular grouping guarantees.
batch format is
```json
{ "events" : [metric1, metric2, ...]]}
```
See metric format examples below.
- All metrics values are in bytes, unless otherwise specified.
- Currently no retries are implemented.
### Pageserver metrics
#### Configuration
The endpoint and the collection interval are specified in the pageserver config file (or can be passed as command line arguments):
`metric_collection_endpoint` defaults to None, which means that metric collection is disabled by default.
`metric_collection_interval` defaults to 10min
#### Metrics
Currently, the following metrics are collected:
- `written_size`
Amount of WAL produced , by a timeline, i.e. last_record_lsn
This is an absolute, per-timeline metric.
- `resident_size`
Size of all the layer files in the tenant's directory on disk on the pageserver.
This is an absolute, per-tenant metric.
- `remote_storage_size`
Size of the remote storage (S3) directory.
This is an absolute, per-tenant metric.
- `timeline_logical_size`
Logical size of the data in the timeline
This is an absolute, per-timeline metric.
- `synthetic_storage_size`
Size of all tenant's branches including WAL
This is the same metric that `tenant/{tenant_id}/size` endpoint returns.
This is an absolute, per-tenant metric.
Synthetic storage size is calculated in a separate thread, so it might be slightly outdated.
#### Format example
```json
{
"metric": "remote_storage_size",
"type": "absolute",
"time": "2022-12-28T11:07:19.317310284Z",
"idempotency_key": "2022-12-28 11:07:19.317310324 UTC-1-4019",
"value": 12345454,
"tenant_id": "5d07d9ce9237c4cd845ea7918c0afa7d",
"timeline_id": "a03ebb4f5922a1c56ff7485cc8854143",
}
```
`idempotency_key` is a unique key for each metric, so that we can deduplicate metrics.
It is a combination of the time, node_id and a random number.
### Proxy consumption metrics
#### Configuration
The endpoint and the collection interval can be passed as command line arguments for proxy:
`metric_collection_endpoint` no default, which means that metric collection is disabled by default.
`metric_collection_interval` no default
#### Metrics
Currently, only one proxy metric is collected:
- `proxy_io_bytes_per_client`
Outbound traffic per client.
This is an incremental, per-endpoint metric.
#### Format example
```json
{
"metric": "proxy_io_bytes_per_client",
"type": "incremental",
"start_time": "2022-12-28T11:07:19.317310284Z",
"stop_time": "2022-12-28T11:07:19.317310284Z",
"idempotency_key": "2022-12-28 11:07:19.317310324 UTC-1-4019",
"value": 12345454,
"endpoint_id": "5d07d9ce9237c4cd845ea7918c0afa7d",
}
```
The metric is incremental, so the value is the difference between the current and the previous value.
If there is no previous value, the value, the value is the current value and the `start_time` equals `stop_time`.
### TODO
- [ ] Handle errors better: currently if one tenant fails to gather metrics, the whole iteration fails and metrics are not sent for any tenant.
- [ ] Add retries
- [ ] Tune the interval

186
docs/rfcs/021-metering.md Normal file
View File

@@ -0,0 +1,186 @@
# Consumption tracking
# Goals
This proposal is made with two mostly but not entirely overlapping goals:
* Collect info that is needed for consumption-based billing
* Cross-check AWS bills
# Metrics
There are six metrics to collect:
* CPU time. Wall clock seconds * the current number of cores. We have a fixed ratio of memory to cores, so the current memory size is the function of the number of cores. Measured per each `endpoint`.
* Traffic. In/out traffic on the proxy. Measured per each `endpoint`.
* Written size. Amount of data we write. That is different from both traffic and storage size, as only during the writing we
a) occupy some disk bandwidth on safekeepers
b) necessarily cross AZ boundaries delivering WAL to all safekeepers
Each timeline/branch has at most one writer, so the data is collected per branch.
* Synthetic storage size. That is what is exposed now with pageserver's `/v1/tenant/{}/size`. Looks like now it is per-tenant. (Side note: can we make it per branch to show as branch physical size in UI?)
* Real storage size. That is the size of the tenant directory on the pageservers disk. Per-tenant.
* S3 storage size. That is the size of the tenant data on S3. Per-tenant.
That info should be enough to build an internal model that predicts AWS price (hence tracking `written data` and `real storage size`). As for the billing model we probably can get away with mentioning only `CPU time`, `synthetic storage size`, and `traffic` consumption.
# Services participating in metrics collection
## Proxy
For actual implementation details check `/docs/consumption_metrics.md`
Proxy is the only place that knows about traffic flow, so it tracks it and reports it with quite a small interval, let's say 1 minute. A small interval is needed here since the proxy is stateless, and any restart will reset accumulated consumption. Also proxy should report deltas since the last report, not an absolute value of the counter. Such kind of events is easier to integrate over a period of time to get the amount of traffic during some time interval.
Example event:
```json
{
"metric": "proxy_io_bytes_per_client",
"type": "incremental",
"start_time": "2022-12-28T11:07:19.317310284Z",
"stop_time": "2022-12-28T11:07:19.317310284Z",
"idempotency_key": "2022-12-28 11:07:19.317310324 UTC-1-4019",
"value": 12345454,
"endpoint_id": "5d07d9ce9237c4cd845ea7918c0afa7d",
}
```
Since we report deltas over some period of time, it makes sense to include `event_start_time`/`event_stop_time` where `event_start_time` is the time of the previous report. That will allow us to identify metering gaps better (e.g., failed send/delivery).
When there is no active connection proxy can avoid reporting anything. Also, deltas are additive, so several console instances serving the same user and endpoint can report traffic without coordination.
## Console
The console knows about start/stop events, so it knows the amount of CPU time allocated to each endpoint. It also knows about operation successes and failures and can avoid billing clients after unsuccessful 'suspend' events. The console doesn't know the current compute size within the allowed limits on the endpoint. So with CPU time, we do the following:
* While we don't yet have the autoscaling console can report `cpu time` as the number of seconds since the last `start_compute` event.
* When we have autoscaling, `autoscaler-agent` can report `cpu time`*`compute_units_count` in the same increments as the proxy reports traffic.
Example event:
```json
{
"metric": "effective_compute_seconds",
"type": "increment",
"endpoint_id": "blazing-warrior-34",
"event_start_time": ...,
"event_stop_time": ...,
"value": 12345454,
}
```
I'd also suggest reporting one value, `cpu time`*`compute_units_count`, instead of two separate fields as it makes event schema simpler (it is possible to treat it the same way as traffic) and preserves additivity.
## Pageserver
For actual implementation details check `/docs/consumption_metrics.md`
Pageserver knows / has access to / can calculate the rest of the metrics:
* Written size -- that is basically `last_received_lsn`,
* Synthetic storage size -- there is a way to calculate it, albeit a costly one,
* Real storage size -- there is a way to calculate it using a layer map or filesystem,
* S3 storage size -- can calculate it by S3 API calls
Some of those metrics are expensive to calculate, so the reporting period here is driven mainly by implementation details. We can set it to, for example, once per hour. Not a big deal since the pageserver is stateful, and all metrics can be reported as an absolute value, not increments. At the same time, a smaller reporting period improves UX, so it would be good to have something more real-time.
`written size` is primarily a safekeeper-related metric, but since it is available on both pageserver and safekeeper, we can avoid reporting anything from the safekeeper.
Example event:
```json
{
"metric": "remote_storage_size",
"type": "absolute",
"time": "2022-12-28T11:07:19.317310284Z",
"idempotency_key": "2022-12-28 11:07:19.317310324 UTC-1-4019",
"value": 12345454,
"tenant_id": "5d07d9ce9237c4cd845ea7918c0afa7d",
"timeline_id": "a03ebb4f5922a1c56ff7485cc8854143",
}
```
# Data collection
## Push vs. pull
We already have pull-based Prometheus metrics, so it is tempting to use them here too. However, in our setup, it is hard to tell when some metric changes. For example, garbage collection will constantly free some disk space over a week, even if the project is down for that week. We could also iterate through all existing tenants/branches/endpoints, but that means some amount of code to do that properly and most likely we will end up with some per-metric hacks in the collector to cut out some of the tenants that are surely not changing that metric.
With the push model, it is easier to publish data only about actively changing metrics -- pageserver knows when it performs s3 offloads, garbage collection and starts/stops consuming data from the safekeeper; proxy knows about connected clients; console / autoscaler-agent knows about active cpu time.
Hence, let's go with a push-based model.
## Common bus vs. proxying through the console
We can implement such push systems in a few ways:
a. Each component pushes its metrics to the "common bus", namely segment, Kafka, or something similar. That approach scales well, but it would be harder to test it locally, will introduce new dependencies, we will have to distribute secrets for that connection to all of the components, etc. We would also have to loop back some of the events and their aggregates to the console, as we want to show some that metrics to the user in real-time.
b. Each component can call HTTP `POST` with its events to the console, and the console can forward it to the segment for later integration with metronome / orb / onebill / etc. With that approach, only the console has to speak with segment. Also since that data passes through the console, the console can save the latest metrics values, so there is no need for constant feedback of that events back from the segment.
# Implementation
Each (proxy|pageserver|autoscaler-agent) sends consumption events to the single endpoint in the console:
```json
POST /usage_events HTTP/1.1
Content-Type: application/json
[
{
"metric": "remote_storage_size",
"type": "absolute",
"time": "2022-12-28T11:07:19.317310284Z",
"idempotency_key": "2022-12-28 11:07:19.317310324 UTC-1-4019",
"value": 12345454,
"tenant_id": "5d07d9ce9237c4cd845ea7918c0afa7d",
"timeline_id": "a03ebb4f5922a1c56ff7485cc8854143",
},
...
]
```
![data flow](./images/metering.jpg)
Events could be either:
* `incremental` -- change in consumption since the previous event or service restart. That is `effective_cpu_seconds`, `traffic_in_bytes`, and `traffic_out_bytes`.
* `absolute` -- that is the current value of a metric. All of the size-related metrics are absolute.
Each service can post events at its own pace and bundle together data from different tenants/endpoints.
The console algorithm upon receive of events could be the following:
1. Create and send a segment event with the same content (possibly enriching it with tenant/timeline data for endpoint-based events).
2. Update the latest state of per-tenant and per-endpoint metrics in the database.
3. Check whether any of that metrics is above the allowed threshold and stop the project if necessary.
Since all the data comes in batches, we can do the batch update to reduce the number of queries in the database. Proxy traffic is probably the most frequent metric, so with batching, we will have extra `number_of_proxies` requests to the database each minute. This is most likely fine for now but will generate many dead tuples in the console database. If that is the case, we can change step 2 to the following:
2.1. Check if there $tenant_$metric / $endpoint_$metric key in Redis
2.2. If no stored value is found and the metric is incremental, then fetch the current value from DWH (which keeps aggregated value for all the events) and publish it.
2.3. Publish a new value (absolute metric) or add an increment to the stored value (incremental metric)
## Consumption watchdog
Since all the data goes through the console, we don't have to run any background thread/coroutines to check whether consumption is within the allowed limits. We only change consumption with `POST /usage_events`, so limit checks could be applied in the same handler.
## Extensibility
If we need to add a new metric (e.g. s3 traffic or something else), the console code should, by default, process it and publish segment event, even if the metric name is unknown to the console.
## Naming & schema
Each metric name should end up with units -- now `_seconds` and `_bytes`, and segment event should always have `tenant_id` and `timeline_id`/`endpoint_id` where applicable.

Binary file not shown.

After

Width:  |  Height:  |  Size: 232 KiB

View File

@@ -18,10 +18,6 @@ Intended to be used in integration tests and in CLI tools for local installation
Documentation of the Neon features and concepts.
Now it is mostly dev documentation.
`/monitoring`:
TODO
`/pageserver`:
Neon storage service.
@@ -98,6 +94,13 @@ cargo hakari manage-deps
If you don't have hakari installed (`error: no such subcommand: hakari`), install it by running `cargo install cargo-hakari`.
### Checking Rust 3rd-parties
[Cargo deny](https://embarkstudios.github.io/cargo-deny/index.html) is a cargo plugin that lets us lint project's dependency graph to ensure all dependencies conform to requirements. It detects security issues, matches licenses, and ensures crates only come from trusted sources.
```bash
cargo deny check
```
## Using Python
Note that Debian/Ubuntu Python packages are stale, as it commonly happens,
so manual installation of dependencies is not recommended.

View File

@@ -0,0 +1,16 @@
[package]
name = "consumption_metrics"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
anyhow = "1.0.68"
chrono = { version = "0.4", default-features = false, features = ["clock", "serde"] }
rand = "0.8.3"
serde = "1.0.152"
serde_with = "2.1.0"
utils = { version = "0.1.0", path = "../utils" }
workspace_hack = { version = "0.1.0", path = "../../workspace_hack" }

View File

@@ -0,0 +1,50 @@
//!
//! Shared code for consumption metics collection
//!
use chrono::{DateTime, Utc};
use rand::Rng;
use serde::Serialize;
#[derive(Serialize, Debug, Clone, Eq, PartialEq, Ord, PartialOrd)]
#[serde(tag = "type")]
pub enum EventType {
#[serde(rename = "absolute")]
Absolute { time: DateTime<Utc> },
#[serde(rename = "incremental")]
Incremental {
start_time: DateTime<Utc>,
stop_time: DateTime<Utc>,
},
}
#[derive(Serialize, Debug, Clone, Eq, PartialEq, Ord, PartialOrd)]
pub struct Event<Extra> {
#[serde(flatten)]
#[serde(rename = "type")]
pub kind: EventType,
pub metric: &'static str,
pub idempotency_key: String,
pub value: u64,
#[serde(flatten)]
pub extra: Extra,
}
pub fn idempotency_key(node_id: String) -> String {
format!(
"{}-{}-{:04}",
Utc::now(),
node_id,
rand::thread_rng().gen_range(0..=9999)
)
}
pub const CHUNK_SIZE: usize = 1000;
// Just a wrapper around a slice of events
// to serialize it as `{"events" : [ ] }
#[derive(serde::Serialize)]
pub struct EventChunk<'a, T> {
pub events: &'a [T],
}

View File

@@ -1,11 +1,12 @@
[package]
name = "metrics"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
prometheus = {version = "0.13", default_features=false, features = ["process"]} # removes protobuf dependency
libc = "0.2"
once_cell = "1.13.0"
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
prometheus.workspace = true
libc.workspace = true
once_cell.workspace = true
workspace_hack.workspace = true

View File

@@ -1,17 +1,17 @@
[package]
name = "pageserver_api"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
serde = { version = "1.0", features = ["derive"] }
serde_with = "2.0"
const_format = "0.2.21"
anyhow = { version = "1.0", features = ["backtrace"] }
bytes = "1.0.1"
byteorder = "1.4.3"
serde.workspace = true
serde_with.workspace = true
const_format.workspace = true
anyhow.workspace = true
bytes.workspace = true
byteorder.workspace = true
utils.workspace = true
postgres_ffi.workspace = true
utils = { path = "../utils" }
postgres_ffi = { path = "../postgres_ffi" }
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
workspace_hack.workspace = true

View File

@@ -1,4 +1,4 @@
use std::num::NonZeroU64;
use std::num::{NonZeroU64, NonZeroUsize};
use byteorder::{BigEndian, ReadBytesExt};
use serde::{Deserialize, Serialize};
@@ -44,18 +44,17 @@ impl TenantState {
/// A state of a timeline in pageserver's memory.
#[derive(Debug, Clone, Copy, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
pub enum TimelineState {
/// Timeline is fully operational. If the containing Tenant is Active, the timeline's
/// background jobs are running otherwise they will be launched when the tenant is activated.
/// The timeline is recognized by the pageserver but is not yet operational.
/// In particular, the walreceiver connection loop is not running for this timeline.
/// It will eventually transition to state Active or Broken.
Loading,
/// The timeline is fully operational.
/// It can be queried, and the walreceiver connection loop is running.
Active,
/// A timeline is recognized by pageserver, but not yet ready to operate.
/// The status indicates, that the timeline could eventually go back to Active automatically:
/// for example, if the owning tenant goes back to Active again.
Suspended,
/// A timeline is recognized by pageserver, but not yet ready to operate and not allowed to
/// automatically become Active after certain events: only a management call can change this status.
/// The timeline was previously Loading or Active but is shutting down.
/// It cannot transition back into any other state.
Stopping,
/// A timeline is recognized by the pageserver, but can no longer be used for
/// any operations, because it failed to be activated.
/// The timeline is broken and not operational (previous states: Loading or Active).
Broken,
}
@@ -210,6 +209,11 @@ pub struct TimelineInfo {
pub state: TimelineState,
}
#[derive(Debug, Serialize, Deserialize)]
pub struct DownloadRemoteLayersTaskSpawnRequest {
pub max_concurrent_downloads: NonZeroUsize,
}
#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct DownloadRemoteLayersTaskInfo {
pub task_id: String,

View File

@@ -1,18 +1,17 @@
[package]
name = "postgres_connection"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
edition.workspace = true
license.workspace = true
[dependencies]
anyhow = "1.0"
itertools = "0.10.3"
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev = "43e6db254a97fdecbce33d8bc0890accfd74495e" }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
url = "2.2.2"
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
anyhow.workspace = true
itertools.workspace = true
postgres.workspace = true
tokio-postgres.workspace = true
url.workspace = true
workspace_hack.workspace = true
[dev-dependencies]
once_cell = "1.13.0"
once_cell.workspace = true

View File

@@ -1,30 +1,31 @@
[package]
name = "postgres_ffi"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
rand = "0.8.3"
regex = "1.4.5"
bytes = "1.0.1"
byteorder = "1.4.3"
anyhow = "1.0"
crc32c = "0.6.0"
hex = "0.4.3"
once_cell = "1.13.0"
log = "0.4.14"
memoffset = "0.7"
thiserror = "1.0"
serde = { version = "1.0", features = ["derive"] }
utils = { path = "../utils" }
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
rand.workspace = true
regex.workspace = true
bytes.workspace = true
byteorder.workspace = true
anyhow.workspace = true
crc32c.workspace = true
hex.workspace = true
once_cell.workspace = true
log.workspace = true
memoffset.workspace = true
thiserror.workspace = true
serde.workspace = true
utils.workspace = true
workspace_hack.workspace = true
[dev-dependencies]
env_logger = "0.9"
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
env_logger.workspace = true
postgres.workspace = true
wal_craft = { path = "wal_craft" }
[build-dependencies]
anyhow = "1.0"
bindgen = "0.61"
anyhow.workspace = true
bindgen.workspace = true

View File

@@ -1,17 +1,17 @@
[package]
name = "wal_craft"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
edition.workspace = true
license.workspace = true
[dependencies]
anyhow = "1.0"
clap = "4.0"
env_logger = "0.9"
log = "0.4"
once_cell = "1.13.0"
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
postgres_ffi = { path = "../" }
tempfile = "3.2"
workspace_hack = { version = "0.1", path = "../../../workspace_hack" }
anyhow.workspace = true
clap.workspace = true
env_logger.workspace = true
log.workspace = true
once_cell.workspace = true
postgres.workspace = true
postgres_ffi.workspace = true
tempfile.workspace = true
workspace_hack.workspace = true

View File

@@ -1,18 +1,18 @@
[package]
name = "pq_proto"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
anyhow = "1.0"
bytes = "1.0.1"
pin-project-lite = "0.2.7"
postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
rand = "0.8.3"
serde = { version = "1.0", features = ["derive"] }
tokio = { version = "1.17", features = ["macros"] }
tracing = "0.1"
thiserror = "1.0"
anyhow.workspace = true
bytes.workspace = true
pin-project-lite.workspace = true
postgres-protocol.workspace = true
rand.workspace = true
serde.workspace = true
tokio.workspace = true
tracing.workspace = true
thiserror.workspace = true
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
workspace_hack.workspace = true

View File

@@ -1,28 +1,28 @@
[package]
name = "remote_storage"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
anyhow = { version = "1.0", features = ["backtrace"] }
async-trait = "0.1"
metrics = { version = "0.1", path = "../metrics" }
utils = { version = "0.1", path = "../utils" }
once_cell = "1.13.0"
aws-smithy-http = "0.51.0"
aws-types = "0.51.0"
aws-config = { version = "0.51.0", default-features = false, features=["rustls"] }
aws-sdk-s3 = "0.21.0"
hyper = { version = "0.14", features = ["stream"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
tokio = { version = "1.17", features = ["sync", "macros", "fs", "io-util"] }
tokio-util = { version = "0.7", features = ["io"] }
toml_edit = { version = "0.14", features = ["easy"] }
tracing = "0.1.27"
anyhow.workspace = true
async-trait.workspace = true
once_cell.workspace = true
aws-smithy-http.workspace = true
aws-types.workspace = true
aws-config.workspace = true
aws-sdk-s3.workspace = true
hyper = { workspace = true, features = ["stream"] }
serde.workspace = true
serde_json.workspace = true
tokio = { workspace = true, features = ["sync", "fs", "io-util"] }
tokio-util.workspace = true
toml_edit.workspace = true
tracing.workspace = true
metrics.workspace = true
utils.workspace = true
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
workspace_hack.workspace = true
[dev-dependencies]
tempfile = "3.2"
tempfile.workspace = true

View File

@@ -1,13 +1,13 @@
[package]
name = "safekeeper_api"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
serde = { version = "1.0", features = ["derive"] }
serde_with = "2.0"
const_format = "0.2.21"
serde.workspace = true
serde_with.workspace = true
const_format.workspace = true
utils.workspace = true
utils = { path = "../utils" }
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
workspace_hack.workspace = true

View File

@@ -1,9 +1,11 @@
[package]
name = "tenant_size_model"
version = "0.1.0"
edition = "2021"
edition.workspace = true
publish = false
license = "Apache-2.0"
license.workspace = true
[dependencies]
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
anyhow.workspace = true
workspace_hack.workspace = true

View File

@@ -1,6 +1,8 @@
use std::borrow::Cow;
use std::collections::HashMap;
use anyhow::Context;
/// Pricing model or history size builder.
///
/// Maintains knowledge of the branches and their modifications. Generic over the branch name key
@@ -132,22 +134,25 @@ impl<K: std::hash::Hash + Eq + 'static> Storage<K> {
op: Cow<'static, str>,
lsn: u64,
size: Option<u64>,
) where
) -> anyhow::Result<()>
where
K: std::borrow::Borrow<Q>,
Q: std::hash::Hash + Eq,
Q: std::hash::Hash + Eq + std::fmt::Debug,
{
let lastseg_id = *self.branches.get(branch).unwrap();
let Some(lastseg_id) = self.branches.get(branch).copied() else { anyhow::bail!("branch not found: {branch:?}") };
let newseg_id = self.segments.len();
let lastseg = &mut self.segments[lastseg_id];
assert!(lsn > lastseg.end_lsn);
let Some(start_size) = lastseg.end_size else { anyhow::bail!("no end_size on latest segment for {branch:?}") };
let newseg = Segment {
op,
parent: Some(lastseg_id),
start_lsn: lastseg.end_lsn,
end_lsn: lsn,
start_size: lastseg.end_size.unwrap(),
start_size,
end_size: size,
children_after: Vec::new(),
needed: false,
@@ -156,6 +161,8 @@ impl<K: std::hash::Hash + Eq + 'static> Storage<K> {
self.segments.push(newseg);
*self.branches.get_mut(branch).expect("read already") = newseg_id;
Ok(())
}
/// Advances the branch with the named operation, by the relative LSN and logical size bytes.
@@ -165,21 +172,24 @@ impl<K: std::hash::Hash + Eq + 'static> Storage<K> {
op: Cow<'static, str>,
lsn_bytes: u64,
size_bytes: i64,
) where
) -> anyhow::Result<()>
where
K: std::borrow::Borrow<Q>,
Q: std::hash::Hash + Eq,
Q: std::hash::Hash + Eq + std::fmt::Debug,
{
let lastseg_id = *self.branches.get(branch).unwrap();
let Some(lastseg_id) = self.branches.get(branch).copied() else { anyhow::bail!("branch not found: {branch:?}") };
let newseg_id = self.segments.len();
let lastseg = &mut self.segments[lastseg_id];
let Some(last_end_size) = lastseg.end_size else { anyhow::bail!("no end_size on latest segment for {branch:?}") };
let newseg = Segment {
op,
parent: Some(lastseg_id),
start_lsn: lastseg.end_lsn,
end_lsn: lastseg.end_lsn + lsn_bytes,
start_size: lastseg.end_size.unwrap(),
end_size: Some((lastseg.end_size.unwrap() as i64 + size_bytes) as u64),
start_size: last_end_size,
end_size: Some((last_end_size as i64 + size_bytes) as u64),
children_after: Vec::new(),
needed: false,
};
@@ -187,50 +197,54 @@ impl<K: std::hash::Hash + Eq + 'static> Storage<K> {
self.segments.push(newseg);
*self.branches.get_mut(branch).expect("read already") = newseg_id;
Ok(())
}
pub fn insert<Q: ?Sized>(&mut self, branch: &Q, bytes: u64)
pub fn insert<Q: ?Sized>(&mut self, branch: &Q, bytes: u64) -> anyhow::Result<()>
where
K: std::borrow::Borrow<Q>,
Q: std::hash::Hash + Eq,
Q: std::hash::Hash + Eq + std::fmt::Debug,
{
self.modify_branch(branch, "insert".into(), bytes, bytes as i64);
self.modify_branch(branch, "insert".into(), bytes, bytes as i64)
}
pub fn update<Q: ?Sized>(&mut self, branch: &Q, bytes: u64)
pub fn update<Q: ?Sized>(&mut self, branch: &Q, bytes: u64) -> anyhow::Result<()>
where
K: std::borrow::Borrow<Q>,
Q: std::hash::Hash + Eq,
Q: std::hash::Hash + Eq + std::fmt::Debug,
{
self.modify_branch(branch, "update".into(), bytes, 0i64);
self.modify_branch(branch, "update".into(), bytes, 0i64)
}
pub fn delete<Q: ?Sized>(&mut self, branch: &Q, bytes: u64)
pub fn delete<Q: ?Sized>(&mut self, branch: &Q, bytes: u64) -> anyhow::Result<()>
where
K: std::borrow::Borrow<Q>,
Q: std::hash::Hash + Eq,
Q: std::hash::Hash + Eq + std::fmt::Debug,
{
self.modify_branch(branch, "delete".into(), bytes, -(bytes as i64));
self.modify_branch(branch, "delete".into(), bytes, -(bytes as i64))
}
/// Panics if the parent branch cannot be found.
pub fn branch<Q: ?Sized>(&mut self, parent: &Q, name: K)
pub fn branch<Q: ?Sized>(&mut self, parent: &Q, name: K) -> anyhow::Result<()>
where
K: std::borrow::Borrow<Q>,
Q: std::hash::Hash + Eq,
K: std::borrow::Borrow<Q> + std::fmt::Debug,
Q: std::hash::Hash + Eq + std::fmt::Debug,
{
// Find the right segment
let branchseg_id = *self
.branches
.get(parent)
.expect("should had found the parent by key");
let branchseg_id = *self.branches.get(parent).with_context(|| {
format!(
"should had found the parent {:?} by key. in branches {:?}",
parent, self.branches
)
})?;
let _branchseg = &mut self.segments[branchseg_id];
// Create branch name for it
self.branches.insert(name, branchseg_id);
Ok(())
}
pub fn calculate(&mut self, retention_period: u64) -> SegmentSize {
pub fn calculate(&mut self, retention_period: u64) -> anyhow::Result<SegmentSize> {
// Phase 1: Mark all the segments that need to be retained
for (_branch, &last_seg_id) in self.branches.iter() {
let last_seg = &self.segments[last_seg_id];
@@ -255,7 +269,7 @@ impl<K: std::hash::Hash + Eq + 'static> Storage<K> {
self.size_from_snapshot_later(0)
}
fn size_from_wal(&self, seg_id: usize) -> SegmentSize {
fn size_from_wal(&self, seg_id: usize) -> anyhow::Result<SegmentSize> {
let seg = &self.segments[seg_id];
let this_size = seg.end_lsn - seg.start_lsn;
@@ -266,10 +280,10 @@ impl<K: std::hash::Hash + Eq + 'static> Storage<K> {
for &child_id in seg.children_after.iter() {
// try each child both ways
let child = &self.segments[child_id];
let p1 = self.size_from_wal(child_id);
let p1 = self.size_from_wal(child_id)?;
let p = if !child.needed {
let p2 = self.size_from_snapshot_later(child_id);
let p2 = self.size_from_snapshot_later(child_id)?;
if p1.total() < p2.total() {
p1
} else {
@@ -280,15 +294,15 @@ impl<K: std::hash::Hash + Eq + 'static> Storage<K> {
};
children.push(p);
}
SegmentSize {
Ok(SegmentSize {
seg_id,
method: if seg.needed { WalNeeded } else { Wal },
this_size,
children,
}
})
}
fn size_from_snapshot_later(&self, seg_id: usize) -> SegmentSize {
fn size_from_snapshot_later(&self, seg_id: usize) -> anyhow::Result<SegmentSize> {
// If this is needed, then it's time to do the snapshot and continue
// with wal method.
let seg = &self.segments[seg_id];
@@ -299,10 +313,10 @@ impl<K: std::hash::Hash + Eq + 'static> Storage<K> {
for &child_id in seg.children_after.iter() {
// try each child both ways
let child = &self.segments[child_id];
let p1 = self.size_from_wal(child_id);
let p1 = self.size_from_wal(child_id)?;
let p = if !child.needed {
let p2 = self.size_from_snapshot_later(child_id);
let p2 = self.size_from_snapshot_later(child_id)?;
if p1.total() < p2.total() {
p1
} else {
@@ -313,12 +327,12 @@ impl<K: std::hash::Hash + Eq + 'static> Storage<K> {
};
children.push(p);
}
SegmentSize {
Ok(SegmentSize {
seg_id,
method: WalNeeded,
this_size: seg.start_size,
children,
}
})
} else {
// If any of the direct children are "needed", need to be able to reconstruct here
let mut children_needed = false;
@@ -333,7 +347,7 @@ impl<K: std::hash::Hash + Eq + 'static> Storage<K> {
let method1 = if !children_needed {
let mut children = Vec::new();
for child in seg.children_after.iter() {
children.push(self.size_from_snapshot_later(*child));
children.push(self.size_from_snapshot_later(*child)?);
}
Some(SegmentSize {
seg_id,
@@ -349,20 +363,25 @@ impl<K: std::hash::Hash + Eq + 'static> Storage<K> {
let method2 = if children_needed || seg.children_after.len() >= 2 {
let mut children = Vec::new();
for child in seg.children_after.iter() {
children.push(self.size_from_wal(*child));
children.push(self.size_from_wal(*child)?);
}
let Some(this_size) = seg.end_size else { anyhow::bail!("no end_size at junction {seg_id}") };
Some(SegmentSize {
seg_id,
method: SnapshotAfter,
this_size: seg.end_size.unwrap(),
this_size,
children,
})
} else {
None
};
match (method1, method2) {
(None, None) => panic!(),
Ok(match (method1, method2) {
(None, None) => anyhow::bail!(
"neither method was applicable: children_after={}, children_needed={}",
seg.children_after.len(),
children_needed
),
(Some(method), None) => method,
(None, Some(method)) => method,
(Some(method1), Some(method2)) => {
@@ -372,7 +391,7 @@ impl<K: std::hash::Hash + Eq + 'static> Storage<K> {
method2
}
}
}
})
}
}

View File

@@ -7,118 +7,118 @@
use tenant_size_model::{Segment, SegmentSize, Storage};
// Main branch only. Some updates on it.
fn scenario_1() -> (Vec<Segment>, SegmentSize) {
fn scenario_1() -> anyhow::Result<(Vec<Segment>, SegmentSize)> {
// Create main branch
let mut storage = Storage::new("main");
// Bulk load 5 GB of data to it
storage.insert("main", 5_000);
storage.insert("main", 5_000)?;
// Stream of updates
for _ in 0..5 {
storage.update("main", 1_000);
storage.update("main", 1_000)?;
}
let size = storage.calculate(1000);
let size = storage.calculate(1000)?;
(storage.into_segments(), size)
Ok((storage.into_segments(), size))
}
// Main branch only. Some updates on it.
fn scenario_2() -> (Vec<Segment>, SegmentSize) {
fn scenario_2() -> anyhow::Result<(Vec<Segment>, SegmentSize)> {
// Create main branch
let mut storage = Storage::new("main");
// Bulk load 5 GB of data to it
storage.insert("main", 5_000);
storage.insert("main", 5_000)?;
// Stream of updates
for _ in 0..5 {
storage.update("main", 1_000);
storage.update("main", 1_000)?;
}
// Branch
storage.branch("main", "child");
storage.update("child", 1_000);
storage.branch("main", "child")?;
storage.update("child", 1_000)?;
// More updates on parent
storage.update("main", 1_000);
storage.update("main", 1_000)?;
let size = storage.calculate(1000);
let size = storage.calculate(1000)?;
(storage.into_segments(), size)
Ok((storage.into_segments(), size))
}
// Like 2, but more updates on main
fn scenario_3() -> (Vec<Segment>, SegmentSize) {
fn scenario_3() -> anyhow::Result<(Vec<Segment>, SegmentSize)> {
// Create main branch
let mut storage = Storage::new("main");
// Bulk load 5 GB of data to it
storage.insert("main", 5_000);
storage.insert("main", 5_000)?;
// Stream of updates
for _ in 0..5 {
storage.update("main", 1_000);
storage.update("main", 1_000)?;
}
// Branch
storage.branch("main", "child");
storage.update("child", 1_000);
storage.branch("main", "child")?;
storage.update("child", 1_000)?;
// More updates on parent
for _ in 0..5 {
storage.update("main", 1_000);
storage.update("main", 1_000)?;
}
let size = storage.calculate(1000);
let size = storage.calculate(1000)?;
(storage.into_segments(), size)
Ok((storage.into_segments(), size))
}
// Diverged branches
fn scenario_4() -> (Vec<Segment>, SegmentSize) {
fn scenario_4() -> anyhow::Result<(Vec<Segment>, SegmentSize)> {
// Create main branch
let mut storage = Storage::new("main");
// Bulk load 5 GB of data to it
storage.insert("main", 5_000);
storage.insert("main", 5_000)?;
// Stream of updates
for _ in 0..5 {
storage.update("main", 1_000);
storage.update("main", 1_000)?;
}
// Branch
storage.branch("main", "child");
storage.update("child", 1_000);
storage.branch("main", "child")?;
storage.update("child", 1_000)?;
// More updates on parent
for _ in 0..8 {
storage.update("main", 1_000);
storage.update("main", 1_000)?;
}
let size = storage.calculate(1000);
let size = storage.calculate(1000)?;
(storage.into_segments(), size)
Ok((storage.into_segments(), size))
}
fn scenario_5() -> (Vec<Segment>, SegmentSize) {
fn scenario_5() -> anyhow::Result<(Vec<Segment>, SegmentSize)> {
let mut storage = Storage::new("a");
storage.insert("a", 5000);
storage.branch("a", "b");
storage.update("b", 4000);
storage.update("a", 2000);
storage.branch("a", "c");
storage.insert("c", 4000);
storage.insert("a", 2000);
storage.insert("a", 5000)?;
storage.branch("a", "b")?;
storage.update("b", 4000)?;
storage.update("a", 2000)?;
storage.branch("a", "c")?;
storage.insert("c", 4000)?;
storage.insert("a", 2000)?;
let size = storage.calculate(5000);
let size = storage.calculate(5000)?;
(storage.into_segments(), size)
Ok((storage.into_segments(), size))
}
fn scenario_6() -> (Vec<Segment>, SegmentSize) {
fn scenario_6() -> anyhow::Result<(Vec<Segment>, SegmentSize)> {
use std::borrow::Cow;
const NO_OP: Cow<'static, str> = Cow::Borrowed("");
@@ -133,18 +133,18 @@ fn scenario_6() -> (Vec<Segment>, SegmentSize) {
let mut storage = Storage::new(None);
storage.branch(&None, branches[0]); // at 0
storage.modify_branch(&branches[0], NO_OP, 108951064, 43696128); // at 108951064
storage.branch(&branches[0], branches[1]); // at 108951064
storage.modify_branch(&branches[1], NO_OP, 15560408, -1851392); // at 124511472
storage.modify_branch(&branches[0], NO_OP, 174464360, -1531904); // at 283415424
storage.branch(&branches[0], branches[2]); // at 283415424
storage.modify_branch(&branches[2], NO_OP, 15906192, 8192); // at 299321616
storage.modify_branch(&branches[0], NO_OP, 18909976, 32768); // at 302325400
storage.branch(&None, branches[0])?; // at 0
storage.modify_branch(&branches[0], NO_OP, 108951064, 43696128)?; // at 108951064
storage.branch(&branches[0], branches[1])?; // at 108951064
storage.modify_branch(&branches[1], NO_OP, 15560408, -1851392)?; // at 124511472
storage.modify_branch(&branches[0], NO_OP, 174464360, -1531904)?; // at 283415424
storage.branch(&branches[0], branches[2])?; // at 283415424
storage.modify_branch(&branches[2], NO_OP, 15906192, 8192)?; // at 299321616
storage.modify_branch(&branches[0], NO_OP, 18909976, 32768)?; // at 302325400
let size = storage.calculate(100_000);
let size = storage.calculate(100_000)?;
(storage.into_segments(), size)
Ok((storage.into_segments(), size))
}
fn main() {
@@ -163,7 +163,8 @@ fn main() {
eprintln!("invalid scenario {}", other);
std::process::exit(1);
}
};
}
.unwrap();
graphviz_tree(&segments, &size);
}
@@ -251,7 +252,7 @@ fn graphviz_tree(segments: &[Segment], tree: &SegmentSize) {
#[test]
fn scenarios_return_same_size() {
type ScenarioFn = fn() -> (Vec<Segment>, SegmentSize);
type ScenarioFn = fn() -> anyhow::Result<(Vec<Segment>, SegmentSize)>;
let truths: &[(u32, ScenarioFn, _)] = &[
(line!(), scenario_1, 8000),
(line!(), scenario_2, 9000),
@@ -262,7 +263,7 @@ fn scenarios_return_same_size() {
];
for (line, scenario, expected) in truths {
let (_, size) = scenario();
let (_, size) = scenario().unwrap();
assert_eq!(*expected, size.total_children(), "scenario on line {line}");
}
}

View File

@@ -1,48 +1,50 @@
[package]
name = "utils"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
sentry = { version = "0.29.0", default-features = false, features = ["backtrace", "contexts", "panic", "rustls", "reqwest" ] }
async-trait = "0.1"
anyhow = "1.0"
bincode = "1.3"
bytes = "1.0.1"
hyper = { version = "0.14.7", features = ["full"] }
routerify = "3"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
thiserror = "1.0"
tokio = { version = "1.17", features = ["macros"]}
tokio-rustls = "0.23"
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter", "json"] }
nix = "0.25"
signal-hook = "0.3.10"
rand = "0.8.3"
jsonwebtoken = "8"
hex = { version = "0.4.3", features = ["serde"] }
rustls = "0.20.2"
rustls-split = "0.3.0"
git-version = "0.3.5"
serde_with = "2.0"
once_cell = "1.13.0"
strum = "0.24"
strum_macros = "0.24"
atty.workspace = true
sentry.workspace = true
async-trait.workspace = true
anyhow.workspace = true
bincode.workspace = true
bytes.workspace = true
hyper = { workspace = true, features = ["full"] }
routerify.workspace = true
serde.workspace = true
serde_json.workspace = true
thiserror.workspace = true
tokio.workspace = true
tokio-rustls.workspace = true
tracing.workspace = true
tracing-subscriber = { workspace = true, features = ["json"] }
nix.workspace = true
signal-hook.workspace = true
rand.workspace = true
jsonwebtoken.workspace = true
hex = { workspace = true, features = ["serde"] }
rustls.workspace = true
rustls-split.workspace = true
git-version.workspace = true
serde_with.workspace = true
once_cell.workspace = true
strum.workspace = true
strum_macros.workspace = true
metrics = { path = "../metrics" }
pq_proto = { path = "../pq_proto" }
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
metrics.workspace = true
pq_proto.workspace = true
workspace_hack.workspace = true
[dev-dependencies]
byteorder = "1.4.3"
bytes = "1.0.1"
hex-literal = "0.3"
tempfile = "3.2"
criterion = "0.4"
rustls-pemfile = "1"
byteorder.workspace = true
bytes.workspace = true
hex-literal.workspace = true
tempfile.workspace = true
criterion.workspace = true
rustls-pemfile.workspace = true
[[bench]]
name = "benchmarks"

View File

@@ -8,6 +8,7 @@ use strum_macros::{EnumString, EnumVariantNames};
pub enum LogFormat {
Plain,
Json,
Test,
}
impl LogFormat {
@@ -33,12 +34,13 @@ pub fn init(log_format: LogFormat) -> anyhow::Result<()> {
let base_logger = tracing_subscriber::fmt()
.with_env_filter(env_filter)
.with_target(false)
.with_ansi(false)
.with_ansi(atty::is(atty::Stream::Stdout))
.with_writer(std::io::stdout);
match log_format {
LogFormat::Json => base_logger.json().init(),
LogFormat::Plain => base_logger.init(),
LogFormat::Test => base_logger.with_test_writer().init(),
}
Ok(())

View File

@@ -7,12 +7,12 @@ use crate::postgres_backend::AuthType;
use anyhow::Context;
use bytes::{Buf, Bytes, BytesMut};
use pq_proto::{BeMessage, ConnectionError, FeMessage, FeStartupPacket, SQLSTATE_INTERNAL_ERROR};
use std::future::Future;
use std::io;
use std::net::SocketAddr;
use std::pin::Pin;
use std::sync::Arc;
use std::task::Poll;
use std::{future::Future, task::ready};
use tracing::{debug, error, info, trace};
use tokio::io::{AsyncRead, AsyncWrite, AsyncWriteExt, BufReader};
@@ -253,12 +253,9 @@ impl PostgresBackend {
cx: &mut std::task::Context<'_>,
) -> Poll<Result<(), std::io::Error>> {
while self.buf_out.has_remaining() {
match Pin::new(&mut self.stream).poll_write(cx, self.buf_out.chunk()) {
Poll::Ready(Ok(bytes_written)) => {
self.buf_out.advance(bytes_written);
}
Poll::Ready(Err(err)) => return Poll::Ready(Err(err)),
Poll::Pending => return Poll::Pending,
match ready!(Pin::new(&mut self.stream).poll_write(cx, self.buf_out.chunk())) {
Ok(bytes_written) => self.buf_out.advance(bytes_written),
Err(err) => return Poll::Ready(Err(err)),
}
}
Poll::Ready(Ok(()))
@@ -573,10 +570,9 @@ impl<'a> AsyncWrite for CopyDataWriter<'a> {
// It's not strictly required to flush between each message, but makes it easier
// to view in wireshark, and usually the messages that the callers write are
// decently-sized anyway.
match this.pgb.poll_write_buf(cx) {
Poll::Ready(Ok(())) => {}
Poll::Ready(Err(err)) => return Poll::Ready(Err(err)),
Poll::Pending => return Poll::Pending,
match ready!(this.pgb.poll_write_buf(cx)) {
Ok(()) => {}
Err(err) => return Poll::Ready(Err(err)),
}
// CopyData
@@ -593,10 +589,9 @@ impl<'a> AsyncWrite for CopyDataWriter<'a> {
cx: &mut std::task::Context<'_>,
) -> Poll<Result<(), std::io::Error>> {
let this = self.get_mut();
match this.pgb.poll_write_buf(cx) {
Poll::Ready(Ok(())) => {}
Poll::Ready(Err(err)) => return Poll::Ready(Err(err)),
Poll::Pending => return Poll::Pending,
match ready!(this.pgb.poll_write_buf(cx)) {
Ok(()) => {}
Err(err) => return Poll::Ready(Err(err)),
}
this.pgb.poll_flush(cx)
}
@@ -605,10 +600,9 @@ impl<'a> AsyncWrite for CopyDataWriter<'a> {
cx: &mut std::task::Context<'_>,
) -> Poll<Result<(), std::io::Error>> {
let this = self.get_mut();
match this.pgb.poll_write_buf(cx) {
Poll::Ready(Ok(())) => {}
Poll::Ready(Err(err)) => return Poll::Ready(Err(err)),
Poll::Pending => return Poll::Pending,
match ready!(this.pgb.poll_write_buf(cx)) {
Ok(()) => {}
Err(err) => return Poll::Ready(Err(err)),
}
this.pgb.poll_flush(cx)
}

View File

@@ -1,8 +1,8 @@
[package]
name = "pageserver"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[features]
default = []
@@ -11,68 +11,68 @@ default = []
testing = ["fail/failpoints"]
[dependencies]
amplify_num = { git = "https://github.com/hlinnaka/rust-amplify.git", branch = "unsigned-int-perf" }
anyhow = { version = "1.0", features = ["backtrace"] }
async-stream = "0.3"
async-trait = "0.1"
byteorder = "1.4.3"
bytes = "1.0.1"
chrono = { version = "0.4.23", default-features = false, features = ["clock", "serde"] }
clap = { version = "4.0", features = ["string"] }
close_fds = "0.3.2"
const_format = "0.2.21"
crc32c = "0.6.0"
crossbeam-utils = "0.8.5"
fail = "0.5.0"
futures = "0.3.13"
git-version = "0.3.5"
hex = "0.4.3"
humantime = "2.1.0"
humantime-serde = "1.1.1"
hyper = "0.14"
itertools = "0.10.3"
nix = "0.25"
num-traits = "0.2.15"
once_cell = "1.13.0"
pin-project-lite = "0.2.7"
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
postgres-types = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
rand = "0.8.3"
regex = "1.4.5"
rstar = "0.9.3"
scopeguard = "1.1.0"
serde = { version = "1.0", features = ["derive"] }
serde_json = { version = "1.0", features = ["raw_value"] }
serde_with = "2.0"
signal-hook = "0.3.10"
svg_fmt = "0.4.1"
tokio-tar = { git = "https://github.com/neondatabase/tokio-tar.git", rev="404df61437de0feef49ba2ccdbdd94eb8ad6e142" }
thiserror = "1.0"
tokio = { version = "1.17", features = ["process", "sync", "macros", "fs", "rt", "io-util", "time"] }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
tokio-util = { version = "0.7.3", features = ["io", "io-util"] }
toml_edit = { version = "0.14", features = ["easy"] }
tracing = "0.1.36"
url = "2"
walkdir = "2.3.2"
metrics = { path = "../libs/metrics" }
pageserver_api = { path = "../libs/pageserver_api" }
postgres_connection = { path = "../libs/postgres_connection" }
postgres_ffi = { path = "../libs/postgres_ffi" }
pq_proto = { path = "../libs/pq_proto" }
remote_storage = { path = "../libs/remote_storage" }
storage_broker = { version = "0.1", path = "../storage_broker" }
tenant_size_model = { path = "../libs/tenant_size_model" }
utils = { path = "../libs/utils" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }
reqwest = { version = "0.11", default-features = false, features = ["rustls-tls"] }
anyhow.workspace = true
async-stream.workspace = true
async-trait.workspace = true
byteorder.workspace = true
bytes.workspace = true
chrono = { workspace = true, features = ["serde"] }
clap = { workspace = true, features = ["string"] }
close_fds.workspace = true
const_format.workspace = true
consumption_metrics.workspace = true
crc32c.workspace = true
crossbeam-utils.workspace = true
fail.workspace = true
futures.workspace = true
git-version.workspace = true
hex.workspace = true
humantime.workspace = true
humantime-serde.workspace = true
hyper.workspace = true
itertools.workspace = true
nix.workspace = true
num-traits.workspace = true
once_cell.workspace = true
pin-project-lite.workspace = true
postgres.workspace = true
postgres-protocol.workspace = true
postgres-types.workspace = true
rand.workspace = true
regex.workspace = true
scopeguard.workspace = true
serde.workspace = true
serde_json = { workspace = true, features = ["raw_value"] }
serde_with.workspace = true
signal-hook.workspace = true
svg_fmt.workspace = true
tokio-tar.workspace = true
thiserror.workspace = true
tokio = { workspace = true, features = ["process", "sync", "fs", "rt", "io-util", "time"] }
tokio-postgres.workspace = true
tokio-util.workspace = true
toml_edit.workspace = true
tracing.workspace = true
url.workspace = true
walkdir.workspace = true
metrics.workspace = true
pageserver_api.workspace = true
postgres_connection.workspace = true
postgres_ffi.workspace = true
pq_proto.workspace = true
remote_storage.workspace = true
storage_broker.workspace = true
tenant_size_model.workspace = true
utils.workspace = true
workspace_hack.workspace = true
reqwest.workspace = true
rpds.workspace = true
im = "15.1.0"
[dev-dependencies]
criterion = "0.4"
hex-literal = "0.3"
tempfile = "3.2"
criterion.workspace = true
hex-literal.workspace = true
tempfile.workspace = true
[[bench]]
name = "bench_layer_map"

View File

@@ -1,13 +1,12 @@
use anyhow::Result;
use pageserver::keyspace::{KeyPartitioning, KeySpace};
use pageserver::repository::Key;
use pageserver::tenant::layer_map::LayerMap;
use pageserver::tenant::storage_layer::{DeltaFileName, ImageFileName, ValueReconstructState};
use pageserver::tenant::storage_layer::{Layer, ValueReconstructResult};
use pageserver::tenant::storage_layer::Layer;
use pageserver::tenant::storage_layer::{DeltaFileName, ImageFileName, LayerDescriptor};
use rand::prelude::{SeedableRng, SliceRandom, StdRng};
use std::cmp::{max, min};
use std::fs::File;
use std::io::{BufRead, BufReader};
use std::ops::Range;
use std::path::PathBuf;
use std::str::FromStr;
use std::sync::Arc;
@@ -17,102 +16,35 @@ use utils::lsn::Lsn;
use criterion::{criterion_group, criterion_main, Criterion};
struct DummyDelta {
key_range: Range<Key>,
lsn_range: Range<Lsn>,
}
impl Layer for DummyDelta {
fn get_key_range(&self) -> Range<Key> {
self.key_range.clone()
}
fn get_lsn_range(&self) -> Range<Lsn> {
self.lsn_range.clone()
}
fn get_value_reconstruct_data(
&self,
_key: Key,
_lsn_range: Range<Lsn>,
_reconstruct_data: &mut ValueReconstructState,
) -> Result<ValueReconstructResult> {
panic!()
}
fn is_incremental(&self) -> bool {
true
}
fn dump(&self, _verbose: bool) -> Result<()> {
unimplemented!()
}
fn short_id(&self) -> String {
unimplemented!()
}
}
struct DummyImage {
key_range: Range<Key>,
lsn: Lsn,
}
impl Layer for DummyImage {
fn get_key_range(&self) -> Range<Key> {
self.key_range.clone()
}
fn get_lsn_range(&self) -> Range<Lsn> {
// End-bound is exclusive
self.lsn..(self.lsn + 1)
}
fn get_value_reconstruct_data(
&self,
_key: Key,
_lsn_range: Range<Lsn>,
_reconstruct_data: &mut ValueReconstructState,
) -> Result<ValueReconstructResult> {
panic!()
}
fn is_incremental(&self) -> bool {
false
}
fn dump(&self, _verbose: bool) -> Result<()> {
unimplemented!()
}
fn short_id(&self) -> String {
unimplemented!()
}
}
fn build_layer_map(filename_dump: PathBuf) -> LayerMap<dyn Layer> {
let mut layer_map = LayerMap::<dyn Layer>::default();
fn build_layer_map(filename_dump: PathBuf) -> LayerMap<LayerDescriptor> {
let mut layer_map = LayerMap::<LayerDescriptor>::default();
let mut min_lsn = Lsn(u64::MAX);
let mut max_lsn = Lsn(0);
let filenames = BufReader::new(File::open(filename_dump).unwrap()).lines();
let mut updates = layer_map.batch_update();
for fname in filenames {
let fname = &fname.unwrap();
if let Some(imgfilename) = ImageFileName::parse_str(fname) {
let layer = DummyImage {
key_range: imgfilename.key_range,
lsn: imgfilename.lsn,
let layer = LayerDescriptor {
key: imgfilename.key_range,
lsn: imgfilename.lsn..(imgfilename.lsn + 1),
is_incremental: false,
short_id: fname.to_string(),
};
layer_map.insert_historic(Arc::new(layer));
updates.insert_historic(Arc::new(layer));
min_lsn = min(min_lsn, imgfilename.lsn);
max_lsn = max(max_lsn, imgfilename.lsn);
} else if let Some(deltafilename) = DeltaFileName::parse_str(fname) {
let layer = DummyDelta {
key_range: deltafilename.key_range,
lsn_range: deltafilename.lsn_range.clone(),
let layer = LayerDescriptor {
key: deltafilename.key_range.clone(),
lsn: deltafilename.lsn_range.clone(),
is_incremental: true,
short_id: fname.to_string(),
};
layer_map.insert_historic(Arc::new(layer));
updates.insert_historic(Arc::new(layer));
min_lsn = min(min_lsn, deltafilename.lsn_range.start);
max_lsn = max(max_lsn, deltafilename.lsn_range.end);
} else {
@@ -122,11 +54,12 @@ fn build_layer_map(filename_dump: PathBuf) -> LayerMap<dyn Layer> {
println!("min: {min_lsn}, max: {max_lsn}");
updates.flush();
layer_map
}
/// Construct a layer map query pattern for benchmarks
fn uniform_query_pattern(layer_map: &LayerMap<dyn Layer>) -> Vec<(Key, Lsn)> {
fn uniform_query_pattern(layer_map: &LayerMap<LayerDescriptor>) -> Vec<(Key, Lsn)> {
// For each image layer we query one of the pages contained, at LSN right
// before the image layer was created. This gives us a somewhat uniform
// coverage of both the lsn and key space because image layers have
@@ -150,6 +83,41 @@ fn uniform_query_pattern(layer_map: &LayerMap<dyn Layer>) -> Vec<(Key, Lsn)> {
.collect()
}
// Construct a partitioning for testing get_difficulty map when we
// don't have an exact result of `collect_keyspace` to work with.
fn uniform_key_partitioning(layer_map: &LayerMap<LayerDescriptor>, _lsn: Lsn) -> KeyPartitioning {
let mut parts = Vec::new();
// We add a partition boundary at the start of each image layer,
// no matter what lsn range it covers. This is just the easiest
// thing to do. A better thing to do would be to get a real
// partitioning from some database. Even better, remove the need
// for key partitions by deciding where to create image layers
// directly based on a coverage-based difficulty map.
let mut keys: Vec<_> = layer_map
.iter_historic_layers()
.filter_map(|l| {
if l.is_incremental() {
None
} else {
let kr = l.get_key_range();
Some(kr.start.next())
}
})
.collect();
keys.sort();
let mut current_key = Key::from_hex("000000000000000000000000000000000000").unwrap();
for key in keys {
parts.push(KeySpace {
ranges: vec![current_key..key],
});
current_key = key;
}
KeyPartitioning { parts }
}
// Benchmark using metadata extracted from our performance test environment, from
// a project where we have run pgbench many timmes. The pgbench database was initialized
// between each test run.
@@ -183,24 +151,68 @@ fn bench_from_captest_env(c: &mut Criterion) {
// Benchmark using metadata extracted from a real project that was taknig
// too long processing layer map queries.
fn bench_from_real_project(c: &mut Criterion) {
// TODO consider compressing this file
// Init layer map
let now = Instant::now();
let layer_map = build_layer_map(PathBuf::from("benches/odd-brook-layernames.txt"));
println!("Finished layer map init in {:?}", now.elapsed());
// Choose uniformly distributed queries
let queries: Vec<(Key, Lsn)> = uniform_query_pattern(&layer_map);
// Test with uniform query pattern
c.bench_function("real_map_uniform_queries", |b| {
// Choose inputs for get_difficulty_map
let latest_lsn = layer_map
.iter_historic_layers()
.map(|l| l.get_lsn_range().end)
.max()
.unwrap();
let partitioning = uniform_key_partitioning(&layer_map, latest_lsn);
// Check correctness of get_difficulty_map
// TODO put this in a dedicated test outside of this mod
{
println!("running correctness check");
let now = Instant::now();
let result_bruteforce = layer_map.get_difficulty_map_bruteforce(latest_lsn, &partitioning);
assert!(result_bruteforce.len() == partitioning.parts.len());
println!("Finished bruteforce in {:?}", now.elapsed());
let now = Instant::now();
let result_fast = layer_map.get_difficulty_map(latest_lsn, &partitioning, None);
assert!(result_fast.len() == partitioning.parts.len());
println!("Finished fast in {:?}", now.elapsed());
// Assert results are equal. Manually iterate for easier debugging.
let zip = std::iter::zip(
&partitioning.parts,
std::iter::zip(result_bruteforce, result_fast),
);
for (_part, (bruteforce, fast)) in zip {
assert_eq!(bruteforce, fast);
}
println!("No issues found");
}
// Define and name the benchmark function
let mut group = c.benchmark_group("real_map");
group.bench_function("uniform_queries", |b| {
b.iter(|| {
for q in queries.clone().into_iter() {
layer_map.search(q.0, q.1);
}
});
});
group.bench_function("get_difficulty_map", |b| {
b.iter(|| {
layer_map.get_difficulty_map(latest_lsn, &partitioning, Some(3));
});
});
group.finish();
}
// Benchmark using synthetic data. Arrange image layers on stacked diagonal lines.
fn bench_sequential(c: &mut Criterion) {
let mut layer_map: LayerMap<dyn Layer> = LayerMap::default();
// Init layer map. Create 100_000 layers arranged in 1000 diagonal lines.
//
// TODO This code is pretty slow and runs even if we're only running other
@@ -208,39 +220,39 @@ fn bench_sequential(c: &mut Criterion) {
// Putting it inside the `bench_function` closure is not a solution
// because then it runs multiple times during warmup.
let now = Instant::now();
let mut layer_map = LayerMap::default();
let mut updates = layer_map.batch_update();
for i in 0..100_000 {
// TODO try inserting a super-wide layer in between every 10 to reflect
// what often happens with L1 layers that include non-rel changes.
// Maybe do that as a separate test.
let i32 = (i as u32) % 100;
let zero = Key::from_hex("000000000000000000000000000000000000").unwrap();
let layer = DummyImage {
key_range: zero.add(10 * i32)..zero.add(10 * i32 + 1),
lsn: Lsn(10 * i),
let layer = LayerDescriptor {
key: zero.add(10 * i32)..zero.add(10 * i32 + 1),
lsn: Lsn(i)..Lsn(i + 1),
is_incremental: false,
short_id: format!("Layer {}", i),
};
layer_map.insert_historic(Arc::new(layer));
updates.insert_historic(Arc::new(layer));
}
// Manually measure runtime without criterion because criterion
// has a minimum sample size of 10 and I don't want to run it 10 times.
println!("Finished init in {:?}", now.elapsed());
updates.flush();
println!("Finished layer map init in {:?}", now.elapsed());
// Choose 100 uniformly random queries
let rng = &mut StdRng::seed_from_u64(1);
let queries: Vec<(Key, Lsn)> = uniform_query_pattern(&layer_map)
.choose_multiple(rng, 1)
.choose_multiple(rng, 100)
.copied()
.collect();
// Define and name the benchmark function
c.bench_function("sequential_uniform_queries", |b| {
// Run the search queries
let mut group = c.benchmark_group("sequential");
group.bench_function("uniform_queries", |b| {
b.iter(|| {
for q in queries.clone().into_iter() {
layer_map.search(q.0, q.1);
}
});
});
group.finish();
}
criterion_group!(group_1, bench_from_captest_env);

View File

@@ -30,33 +30,44 @@ fn redo_scenarios(c: &mut Criterion) {
let conf = PageServerConf::dummy_conf(repo_dir.path().to_path_buf());
let conf = Box::leak(Box::new(conf));
let tenant_id = TenantId::generate();
// std::fs::create_dir_all(conf.tenant_path(&tenant_id)).unwrap();
let mut manager = PostgresRedoManager::new(conf, tenant_id);
manager.launch_process(14).unwrap();
let manager = PostgresRedoManager::new(conf, tenant_id);
let manager = Arc::new(manager);
tracing::info!("executing first");
short().execute(&manager).unwrap();
tracing::info!("first executed");
let thread_counts = [1, 2, 4, 8, 16];
for thread_count in thread_counts {
c.bench_with_input(
BenchmarkId::new("short-50record", thread_count),
&thread_count,
|b, thread_count| {
add_multithreaded_walredo_requesters(b, *thread_count, &manager, short, 50);
},
);
}
let mut group = c.benchmark_group("short");
group.sampling_mode(criterion::SamplingMode::Flat);
for thread_count in thread_counts {
c.bench_with_input(
BenchmarkId::new("medium-10record", thread_count),
group.bench_with_input(
BenchmarkId::new("short", thread_count),
&thread_count,
|b, thread_count| {
add_multithreaded_walredo_requesters(b, *thread_count, &manager, medium, 10);
add_multithreaded_walredo_requesters(b, *thread_count, &manager, short);
},
);
}
drop(group);
let mut group = c.benchmark_group("medium");
group.sampling_mode(criterion::SamplingMode::Flat);
for thread_count in thread_counts {
group.bench_with_input(
BenchmarkId::new("medium", thread_count),
&thread_count,
|b, thread_count| {
add_multithreaded_walredo_requesters(b, *thread_count, &manager, medium);
},
);
}
drop(group);
}
/// Sets up `threads` number of requesters to `request_redo`, with the given input.
@@ -65,46 +76,66 @@ fn add_multithreaded_walredo_requesters(
threads: u32,
manager: &Arc<PostgresRedoManager>,
input_factory: fn() -> Request,
request_repeats: usize,
) {
b.iter_batched_ref(
|| {
// barrier for all of the threads, and the benchmarked thread
let barrier = Arc::new(Barrier::new(threads as usize + 1));
assert_ne!(threads, 0);
let jhs = (0..threads)
.map(|_| {
std::thread::spawn({
let manager = manager.clone();
let barrier = barrier.clone();
move || {
let input = std::iter::repeat(input_factory())
.take(request_repeats)
.collect::<Vec<_>>();
if threads == 1 {
b.iter_batched_ref(
|| Some(input_factory()),
|input| execute_all(input.take(), manager),
criterion::BatchSize::PerIteration,
);
} else {
let (work_tx, work_rx) = std::sync::mpsc::sync_channel(threads as usize);
barrier.wait();
let work_rx = std::sync::Arc::new(std::sync::Mutex::new(work_rx));
execute_all(input, &manager).unwrap();
let barrier = Arc::new(Barrier::new(threads as usize + 1));
barrier.wait();
let jhs = (0..threads)
.map(|_| {
std::thread::spawn({
let manager = manager.clone();
let barrier = barrier.clone();
let work_rx = work_rx.clone();
move || loop {
// queue up and wait if we want to go another round
if work_rx.lock().unwrap().recv().is_err() {
break;
}
})
let input = Some(input_factory());
barrier.wait();
execute_all(input, &manager).unwrap();
barrier.wait();
}
})
.collect::<Vec<_>>();
})
.collect::<Vec<_>>();
(barrier, JoinOnDrop(jhs))
},
|input| {
let barrier = &input.0;
let _jhs = JoinOnDrop(jhs);
// start the work
barrier.wait();
b.iter_batched(
|| {
for _ in 0..threads {
work_tx.send(()).unwrap()
}
},
|()| {
// start the work
barrier.wait();
// wait for work to complete
barrier.wait();
},
criterion::BatchSize::PerIteration,
);
// wait for work to complete
barrier.wait();
},
criterion::BatchSize::PerIteration,
);
drop(work_tx);
}
}
struct JoinOnDrop(Vec<std::thread::JoinHandle<()>>);
@@ -121,7 +152,10 @@ impl Drop for JoinOnDrop {
}
}
fn execute_all(input: Vec<Request>, manager: &PostgresRedoManager) -> Result<(), WalRedoError> {
fn execute_all<I>(input: I, manager: &PostgresRedoManager) -> Result<(), WalRedoError>
where
I: IntoIterator<Item = Request>,
{
// just fire all requests as fast as possible
input.into_iter().try_for_each(|req| {
let page = req.execute(manager)?;
@@ -143,6 +177,7 @@ macro_rules! lsn {
}};
}
/// Short payload, 1132 bytes.
// pg_records are copypasted from log, where they are put with Debug impl of Bytes, which uses \0
// for null bytes.
#[allow(clippy::octal_escapes)]
@@ -172,6 +207,7 @@ fn short() -> Request {
}
}
/// Medium sized payload, serializes as 26393 bytes.
// see [`short`]
#[allow(clippy::octal_escapes)]
fn medium() -> Request {

Binary file not shown.

View File

@@ -336,6 +336,7 @@ fn start_pageserver(conf: &'static PageServerConf) -> anyhow::Result<()> {
pageserver::consumption_metrics::collect_metrics(
metric_collection_endpoint,
conf.metric_collection_interval,
conf.synthetic_size_calculation_interval,
conf.id,
)
.instrument(info_span!("metrics_collection"))

View File

@@ -59,6 +59,8 @@ pub mod defaults {
pub const DEFAULT_METRIC_COLLECTION_INTERVAL: &str = "10 min";
pub const DEFAULT_METRIC_COLLECTION_ENDPOINT: Option<reqwest::Url> = None;
pub const DEFAULT_SYNTHETIC_SIZE_CALCULATION_INTERVAL: &str = "10 min";
///
/// Default built-in configuration file.
///
@@ -83,6 +85,7 @@ pub mod defaults {
#concurrent_tenant_size_logical_size_queries = '{DEFAULT_CONCURRENT_TENANT_SIZE_LOGICAL_SIZE_QUERIES}'
#metric_collection_interval = '{DEFAULT_METRIC_COLLECTION_INTERVAL}'
#synthetic_size_calculation_interval = '{DEFAULT_SYNTHETIC_SIZE_CALCULATION_INTERVAL}'
# [tenant_config]
#checkpoint_distance = {DEFAULT_CHECKPOINT_DISTANCE} # in bytes
@@ -152,6 +155,7 @@ pub struct PageServerConf {
// How often to collect metrics and send them to the metrics endpoint.
pub metric_collection_interval: Duration,
pub metric_collection_endpoint: Option<Url>,
pub synthetic_size_calculation_interval: Duration,
pub test_remote_failures: u64,
}
@@ -215,6 +219,7 @@ struct PageServerConfigBuilder {
metric_collection_interval: BuilderValue<Duration>,
metric_collection_endpoint: BuilderValue<Option<Url>>,
synthetic_size_calculation_interval: BuilderValue<Duration>,
test_remote_failures: BuilderValue<u64>,
}
@@ -255,6 +260,10 @@ impl Default for PageServerConfigBuilder {
DEFAULT_METRIC_COLLECTION_INTERVAL,
)
.expect("cannot parse default metric collection interval")),
synthetic_size_calculation_interval: Set(humantime::parse_duration(
DEFAULT_SYNTHETIC_SIZE_CALCULATION_INTERVAL,
)
.expect("cannot parse default synthetic size calculation interval")),
metric_collection_endpoint: Set(DEFAULT_METRIC_COLLECTION_ENDPOINT),
test_remote_failures: Set(0),
@@ -342,6 +351,14 @@ impl PageServerConfigBuilder {
self.metric_collection_endpoint = BuilderValue::Set(metric_collection_endpoint)
}
pub fn synthetic_size_calculation_interval(
&mut self,
synthetic_size_calculation_interval: Duration,
) {
self.synthetic_size_calculation_interval =
BuilderValue::Set(synthetic_size_calculation_interval)
}
pub fn test_remote_failures(&mut self, fail_first: u64) {
self.test_remote_failures = BuilderValue::Set(fail_first);
}
@@ -399,6 +416,9 @@ impl PageServerConfigBuilder {
metric_collection_endpoint: self
.metric_collection_endpoint
.ok_or(anyhow!("missing metric_collection_endpoint"))?,
synthetic_size_calculation_interval: self
.synthetic_size_calculation_interval
.ok_or(anyhow!("missing synthetic_size_calculation_interval"))?,
test_remote_failures: self
.test_remote_failures
.ok_or(anyhow!("missing test_remote_failuers"))?,
@@ -577,7 +597,8 @@ impl PageServerConf {
let endpoint = parse_toml_string(key, item)?.parse().context("failed to parse metric_collection_endpoint")?;
builder.metric_collection_endpoint(Some(endpoint));
},
"synthetic_size_calculation_interval" =>
builder.synthetic_size_calculation_interval(parse_toml_duration(key, item)?),
"test_remote_failures" => builder.test_remote_failures(parse_toml_u64(key, item)?),
_ => bail!("unrecognized pageserver option '{key}'"),
}
@@ -701,6 +722,7 @@ impl PageServerConf {
concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::default(),
metric_collection_interval: Duration::from_secs(60),
metric_collection_endpoint: defaults::DEFAULT_METRIC_COLLECTION_ENDPOINT,
synthetic_size_calculation_interval: Duration::from_secs(60),
test_remote_failures: 0,
}
}
@@ -834,6 +856,7 @@ id = 10
metric_collection_interval = '222 s'
metric_collection_endpoint = 'http://localhost:80/metrics'
synthetic_size_calculation_interval = '333 s'
log_format = 'json'
"#;
@@ -880,6 +903,9 @@ log_format = 'json'
defaults::DEFAULT_METRIC_COLLECTION_INTERVAL
)?,
metric_collection_endpoint: defaults::DEFAULT_METRIC_COLLECTION_ENDPOINT,
synthetic_size_calculation_interval: humantime::parse_duration(
defaults::DEFAULT_SYNTHETIC_SIZE_CALCULATION_INTERVAL
)?,
test_remote_failures: 0,
},
"Correct defaults should be used when no config values are provided"
@@ -926,6 +952,7 @@ log_format = 'json'
concurrent_tenant_size_logical_size_queries: ConfigurableSemaphore::default(),
metric_collection_interval: Duration::from_secs(222),
metric_collection_endpoint: Some(Url::parse("http://localhost:80/metrics")?),
synthetic_size_calculation_interval: Duration::from_secs(333),
test_remote_failures: 0,
},
"Should be able to parse all basic config values correctly"

View File

@@ -3,154 +3,74 @@
//! and push them to a HTTP endpoint.
//! Cache metrics to send only the updated ones.
//!
use anyhow;
use tracing::*;
use utils::id::NodeId;
use utils::id::TimelineId;
use crate::task_mgr;
use crate::task_mgr::{self, TaskKind, BACKGROUND_RUNTIME};
use crate::tenant::mgr;
use anyhow;
use chrono::Utc;
use consumption_metrics::{idempotency_key, Event, EventChunk, EventType, CHUNK_SIZE};
use pageserver_api::models::TenantState;
use utils::id::TenantId;
use serde::{Deserialize, Serialize};
use reqwest::Url;
use serde::Serialize;
use serde_with::{serde_as, DisplayFromStr};
use std::collections::HashMap;
use std::fmt;
use std::str::FromStr;
use std::time::Duration;
use tracing::*;
use utils::id::{NodeId, TenantId, TimelineId};
use chrono::{DateTime, Utc};
use rand::Rng;
use reqwest::Url;
const WRITTEN_SIZE: &str = "written_size";
const SYNTHETIC_STORAGE_SIZE: &str = "synthetic_storage_size";
const RESIDENT_SIZE: &str = "resident_size";
const REMOTE_STORAGE_SIZE: &str = "remote_storage_size";
const TIMELINE_LOGICAL_SIZE: &str = "timeline_logical_size";
/// ConsumptionMetric struct that defines the format for one metric entry
/// i.e.
///
/// ```json
/// {
/// "metric": "remote_storage_size",
/// "type": "absolute",
/// "tenant_id": "5d07d9ce9237c4cd845ea7918c0afa7d",
/// "timeline_id": "a03ebb4f5922a1c56ff7485cc8854143",
/// "time": "2022-12-28T11:07:19.317310284Z",
/// "idempotency_key": "2022-12-28 11:07:19.317310324 UTC-1-4019",
/// "value": 12345454,
/// }
/// ```
#[serde_as]
#[derive(Serialize, Deserialize, Debug, Clone, Eq, PartialEq, Ord, PartialOrd)]
pub struct ConsumptionMetric {
pub metric: ConsumptionMetricKind,
#[serde(rename = "type")]
pub metric_type: &'static str,
#[derive(Serialize)]
struct Ids {
#[serde_as(as = "DisplayFromStr")]
pub tenant_id: TenantId,
tenant_id: TenantId,
#[serde_as(as = "Option<DisplayFromStr>")]
#[serde(skip_serializing_if = "Option::is_none")]
pub timeline_id: Option<TimelineId>,
pub time: DateTime<Utc>,
pub idempotency_key: String,
pub value: u64,
}
impl ConsumptionMetric {
pub fn new_absolute<R: Rng + ?Sized>(
metric: ConsumptionMetricKind,
tenant_id: TenantId,
timeline_id: Option<TimelineId>,
value: u64,
node_id: NodeId,
rng: &mut R,
) -> Self {
Self {
metric,
metric_type: "absolute",
tenant_id,
timeline_id,
time: Utc::now(),
// key that allows metric collector to distinguish unique events
idempotency_key: format!("{}-{}-{:04}", Utc::now(), node_id, rng.gen_range(0..=9999)),
value,
}
}
}
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Ord, PartialOrd, Serialize, Deserialize)]
#[serde(rename_all = "snake_case")]
pub enum ConsumptionMetricKind {
/// Amount of WAL produced , by a timeline, i.e. last_record_lsn
/// This is an absolute, per-timeline metric.
WrittenSize,
/// Size of all tenant branches including WAL
/// This is an absolute, per-tenant metric.
/// This is the same metric that tenant/tenant_id/size endpoint returns.
SyntheticStorageSize,
/// Size of all the layer files in the tenant's directory on disk on the pageserver.
/// This is an absolute, per-tenant metric.
/// See also prometheus metric RESIDENT_PHYSICAL_SIZE.
ResidentSize,
/// Size of the remote storage (S3) directory.
/// This is an absolute, per-tenant metric.
RemoteStorageSize,
/// Logical size of the data in the timeline
/// This is an absolute, per-timeline metric
TimelineLogicalSize,
}
impl FromStr for ConsumptionMetricKind {
type Err = anyhow::Error;
fn from_str(s: &str) -> Result<Self, Self::Err> {
match s {
"written_size" => Ok(Self::WrittenSize),
"synthetic_storage_size" => Ok(Self::SyntheticStorageSize),
"resident_size" => Ok(Self::ResidentSize),
"remote_storage_size" => Ok(Self::RemoteStorageSize),
"timeline_logical_size" => Ok(Self::TimelineLogicalSize),
_ => anyhow::bail!("invalid value \"{s}\" for metric type"),
}
}
}
impl fmt::Display for ConsumptionMetricKind {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
f.write_str(match self {
ConsumptionMetricKind::WrittenSize => "written_size",
ConsumptionMetricKind::SyntheticStorageSize => "synthetic_storage_size",
ConsumptionMetricKind::ResidentSize => "resident_size",
ConsumptionMetricKind::RemoteStorageSize => "remote_storage_size",
ConsumptionMetricKind::TimelineLogicalSize => "timeline_logical_size",
})
}
}
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct ConsumptionMetricsKey {
tenant_id: TenantId,
timeline_id: Option<TimelineId>,
metric: ConsumptionMetricKind,
}
#[derive(serde::Serialize)]
struct EventChunk<'a> {
events: &'a [ConsumptionMetric],
/// Key that uniquely identifies the object, this metric describes.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct PageserverConsumptionMetricsKey {
pub tenant_id: TenantId,
pub timeline_id: Option<TimelineId>,
pub metric: &'static str,
}
/// Main thread that serves metrics collection
pub async fn collect_metrics(
metric_collection_endpoint: &Url,
metric_collection_interval: Duration,
synthetic_size_calculation_interval: Duration,
node_id: NodeId,
) -> anyhow::Result<()> {
let mut ticker = tokio::time::interval(metric_collection_interval);
info!("starting collect_metrics");
// spin up background worker that caclulates tenant sizes
task_mgr::spawn(
BACKGROUND_RUNTIME.handle(),
TaskKind::CalculateSyntheticSize,
None,
None,
"synthetic size calculation",
false,
async move {
calculate_synthetic_size_worker(synthetic_size_calculation_interval)
.instrument(info_span!("synthetic_size_worker"))
.await?;
Ok(())
},
);
// define client here to reuse it for all requests
let client = reqwest::Client::new();
let mut cached_metrics: HashMap<ConsumptionMetricsKey, u64> = HashMap::new();
let mut cached_metrics: HashMap<PageserverConsumptionMetricsKey, u64> = HashMap::new();
loop {
tokio::select! {
@@ -159,7 +79,10 @@ pub async fn collect_metrics(
return Ok(());
},
_ = ticker.tick() => {
collect_metrics_task(&client, &mut cached_metrics, metric_collection_endpoint, node_id).await?;
if let Err(err) = collect_metrics_iteration(&client, &mut cached_metrics, metric_collection_endpoint, node_id).await
{
error!("metrics collection failed: {err:?}");
}
}
}
}
@@ -169,15 +92,20 @@ pub async fn collect_metrics(
///
/// Gather per-tenant and per-timeline metrics and send them to the `metric_collection_endpoint`.
/// Cache metrics to avoid sending the same metrics multiple times.
pub async fn collect_metrics_task(
///
/// TODO
/// - refactor this function (chunking+sending part) to reuse it in proxy module;
/// - improve error handling. Now if one tenant fails to collect metrics,
/// the whole iteration fails and metrics for other tenants are not collected.
pub async fn collect_metrics_iteration(
client: &reqwest::Client,
cached_metrics: &mut HashMap<ConsumptionMetricsKey, u64>,
cached_metrics: &mut HashMap<PageserverConsumptionMetricsKey, u64>,
metric_collection_endpoint: &reqwest::Url,
node_id: NodeId,
) -> anyhow::Result<()> {
let mut current_metrics: Vec<(ConsumptionMetricsKey, u64)> = Vec::new();
let mut current_metrics: Vec<(PageserverConsumptionMetricsKey, u64)> = Vec::new();
trace!(
"starting collect_metrics_task. metric_collection_endpoint: {}",
"starting collect_metrics_iteration. metric_collection_endpoint: {}",
metric_collection_endpoint
);
@@ -201,10 +129,10 @@ pub async fn collect_metrics_task(
let timeline_written_size = u64::from(timeline.get_last_record_lsn());
current_metrics.push((
ConsumptionMetricsKey {
PageserverConsumptionMetricsKey {
tenant_id,
timeline_id: Some(timeline.timeline_id),
metric: ConsumptionMetricKind::WrittenSize,
metric: WRITTEN_SIZE,
},
timeline_written_size,
));
@@ -213,10 +141,10 @@ pub async fn collect_metrics_task(
// Only send timeline logical size when it is fully calculated.
if is_exact {
current_metrics.push((
ConsumptionMetricsKey {
PageserverConsumptionMetricsKey {
tenant_id,
timeline_id: Some(timeline.timeline_id),
metric: ConsumptionMetricKind::TimelineLogicalSize,
metric: TIMELINE_LOGICAL_SIZE,
},
timeline_logical_size,
));
@@ -234,24 +162,34 @@ pub async fn collect_metrics_task(
);
current_metrics.push((
ConsumptionMetricsKey {
PageserverConsumptionMetricsKey {
tenant_id,
timeline_id: None,
metric: ConsumptionMetricKind::ResidentSize,
metric: RESIDENT_SIZE,
},
tenant_resident_size,
));
current_metrics.push((
ConsumptionMetricsKey {
PageserverConsumptionMetricsKey {
tenant_id,
timeline_id: None,
metric: ConsumptionMetricKind::RemoteStorageSize,
metric: REMOTE_STORAGE_SIZE,
},
tenant_remote_size,
));
// TODO add SyntheticStorageSize metric
// Note that this metric is calculated in a separate bgworker
// Here we only use cached value, which may lag behind the real latest one
let tenant_synthetic_size = tenant.get_cached_synthetic_size();
current_metrics.push((
PageserverConsumptionMetricsKey {
tenant_id,
timeline_id: None,
metric: SYNTHETIC_STORAGE_SIZE,
},
tenant_synthetic_size,
));
}
// Filter metrics
@@ -267,35 +205,29 @@ pub async fn collect_metrics_task(
// Send metrics.
// Split into chunks of 1000 metrics to avoid exceeding the max request size
const CHUNK_SIZE: usize = 1000;
let chunks = current_metrics.chunks(CHUNK_SIZE);
let mut chunk_to_send: Vec<ConsumptionMetric> = Vec::with_capacity(1000);
let mut chunk_to_send: Vec<Event<Ids>> = Vec::with_capacity(CHUNK_SIZE);
for chunk in chunks {
chunk_to_send.clear();
// this code block is needed to convince compiler
// that rng is not reused aroung await point
{
// enrich metrics with timestamp and metric_kind before sending
let mut rng = rand::thread_rng();
chunk_to_send.extend(chunk.iter().map(|(curr_key, curr_val)| {
ConsumptionMetric::new_absolute(
curr_key.metric,
curr_key.tenant_id,
curr_key.timeline_id,
*curr_val,
node_id,
&mut rng,
)
}));
}
// enrich metrics with type,timestamp and idempotency key before sending
chunk_to_send.extend(chunk.iter().map(|(curr_key, curr_val)| Event {
kind: EventType::Absolute { time: Utc::now() },
metric: curr_key.metric,
idempotency_key: idempotency_key(node_id.to_string()),
value: *curr_val,
extra: Ids {
tenant_id: curr_key.tenant_id,
timeline_id: curr_key.timeline_id,
},
}));
let chunk_json = serde_json::value::to_raw_value(&EventChunk {
events: &chunk_to_send,
})
.expect("ConsumptionMetric should not fail serialization");
.expect("PageserverConsumptionMetric should not fail serialization");
let res = client
.post(metric_collection_endpoint.clone())
@@ -322,3 +254,39 @@ pub async fn collect_metrics_task(
Ok(())
}
/// Caclculate synthetic size for each active tenant
pub async fn calculate_synthetic_size_worker(
synthetic_size_calculation_interval: Duration,
) -> anyhow::Result<()> {
info!("starting calculate_synthetic_size_worker");
let mut ticker = tokio::time::interval(synthetic_size_calculation_interval);
loop {
tokio::select! {
_ = task_mgr::shutdown_watcher() => {
return Ok(());
},
_ = ticker.tick() => {
let tenants = mgr::list_tenants().await;
// iterate through list of Active tenants and collect metrics
for (tenant_id, tenant_state) in tenants {
if tenant_state != TenantState::Active {
continue;
}
if let Ok(tenant) = mgr::get_tenant(tenant_id, true).await
{
if let Err(e) = tenant.calculate_synthetic_size().await {
error!("failed to calculate synthetic size for tenant {}: {}", tenant_id, e);
}
}
}
}
}
}
}

View File

@@ -430,6 +430,13 @@ paths:
schema:
type: string
format: hex
- name: inputs_only
in: query
required: false
schema:
type: boolean
description: |
When true, skip calculation and only provide the model inputs (for debugging). Defaults to false.
get:
description: |
Calculate tenant's size, which is a mixture of WAL (bytes) and logical_size (bytes).
@@ -449,8 +456,9 @@ paths:
format: hex
size:
type: integer
nullable: true
description: |
Size metric in bytes.
Size metric in bytes or null if inputs_only=true was given.
"401":
description: Unauthorized Error
content:

View File

@@ -3,6 +3,7 @@ use std::sync::Arc;
use anyhow::{anyhow, Context, Result};
use hyper::StatusCode;
use hyper::{Body, Request, Response, Uri};
use pageserver_api::models::DownloadRemoteLayersTaskSpawnRequest;
use remote_storage::GenericRemoteStorage;
use tokio_util::sync::CancellationToken;
use tracing::*;
@@ -238,11 +239,7 @@ fn query_param_present(request: &Request<Body>, param: &str) -> bool {
request
.uri()
.query()
.map(|v| {
url::form_urlencoded::parse(v.as_bytes())
.into_owned()
.any(|(p, _)| p == param)
})
.map(|v| url::form_urlencoded::parse(v.as_bytes()).any(|(p, _)| p == param))
.unwrap_or(false)
}
@@ -251,13 +248,12 @@ fn get_query_param(request: &Request<Body>, param_name: &str) -> Result<String,
Err(ApiError::BadRequest(anyhow!("empty query in request"))),
|v| {
url::form_urlencoded::parse(v.as_bytes())
.into_owned()
.find(|(k, _)| k == param_name)
.map_or(
Err(ApiError::BadRequest(anyhow!(
"no {param_name} specified in query parameters"
))),
|(_, v)| Ok(v),
|(_, v)| Ok(v.into_owned()),
)
},
)
@@ -281,7 +277,7 @@ async fn timeline_detail_handler(request: Request<Body>) -> Result<Response<Body
let timeline_info = build_timeline_info(&timeline, include_non_incremental_logical_size)
.await
.context("Failed to get local timeline info: {e:#}")
.context("get local timeline info")
.map_err(ApiError::InternalServerError)?;
Ok::<_, ApiError>(timeline_info)
@@ -452,21 +448,39 @@ async fn tenant_status(request: Request<Body>) -> Result<Response<Body>, ApiErro
json_response(StatusCode::OK, tenant_info)
}
/// HTTP endpoint to query the current tenant_size of a tenant.
///
/// This is not used by consumption metrics under [`crate::consumption_metrics`], but can be used
/// to debug any of the calculations. Requires `tenant_id` request parameter, supports
/// `inputs_only=true|false` (default false) which supports debugging failure to calculate model
/// values.
async fn tenant_size_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?;
check_permission(&request, Some(tenant_id))?;
let inputs_only = if query_param_present(&request, "inputs_only") {
get_query_param(&request, "inputs_only")?
.parse()
.map_err(|_| ApiError::BadRequest(anyhow!("failed to parse inputs_only")))?
} else {
false
};
let tenant = mgr::get_tenant(tenant_id, true)
.await
.map_err(ApiError::InternalServerError)?;
// this can be long operation, it currently is not backed by any request coalescing or similar
// this can be long operation
let inputs = tenant
.gather_size_inputs()
.await
.map_err(ApiError::InternalServerError)?;
let size = inputs.calculate().map_err(ApiError::InternalServerError)?;
let size = if !inputs_only {
Some(inputs.calculate().map_err(ApiError::InternalServerError)?)
} else {
None
};
/// Private response type with the additional "unstable" `inputs` field.
///
@@ -478,7 +492,9 @@ async fn tenant_size_handler(request: Request<Body>) -> Result<Response<Body>, A
#[serde_as(as = "serde_with::DisplayFromStr")]
id: TenantId,
/// Size is a mixture of WAL and logical size, so the unit is bytes.
size: u64,
///
/// Will be none if `?inputs_only=true` was given.
size: Option<u64>,
inputs: crate::tenant::size::ModelInputs,
}
@@ -788,10 +804,11 @@ async fn timeline_checkpoint_handler(request: Request<Body>) -> Result<Response<
}
async fn timeline_download_remote_layers_handler_post(
request: Request<Body>,
mut request: Request<Body>,
) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
let body: DownloadRemoteLayersTaskSpawnRequest = json_request(&mut request).await?;
check_permission(&request, Some(tenant_id))?;
let tenant = mgr::get_tenant(tenant_id, true)
@@ -800,7 +817,7 @@ async fn timeline_download_remote_layers_handler_post(
let timeline = tenant
.get_timeline(timeline_id, true)
.map_err(ApiError::NotFound)?;
match timeline.spawn_download_all_remote_layers().await {
match timeline.spawn_download_all_remote_layers(body).await {
Ok(st) => json_response(StatusCode::ACCEPTED, st),
Err(st) => json_response(StatusCode::CONFLICT, st),
}

View File

@@ -488,7 +488,7 @@ impl Timeline {
let mut buf = self
.get(relsize_key, lsn)
.await
.context("read relation size of {rel:?}")?;
.with_context(|| format!("read relation size of {rel:?}"))?;
let relsize = buf.get_u32_le();
total_size += relsize as u64;
@@ -1405,15 +1405,15 @@ fn slru_segment_key_range(kind: SlruKind, segno: u32) -> Range<Key> {
Key {
field1: 0x01,
field2,
field3: segno,
field4: 0,
field3: 1,
field4: segno,
field5: 0,
field6: 0,
}..Key {
field1: 0x01,
field2,
field3: segno,
field4: 0,
field3: 1,
field4: segno,
field5: 1,
field6: 0,
}

View File

@@ -37,6 +37,17 @@ impl Key {
| self.field6 as i128
}
pub fn from_i128(x: i128) -> Self {
Key {
field1: ((x >> 120) & 0xf) as u8,
field2: ((x >> 104) & 0xFFFF) as u32,
field3: (x >> 72) as u32,
field4: (x >> 40) as u32,
field5: (x >> 32) as u8,
field6: x as u32,
}
}
pub fn next(&self) -> Key {
self.add(1)
}

View File

@@ -183,12 +183,29 @@ pub enum TaskKind {
// associated with one later, after receiving a command from the client.
PageRequestHandler,
// Manages the WAL receiver connection for one timeline. It subscribes to
// events from storage_broker, decides which safekeeper to connect to. It spawns a
// separate WalReceiverConnection task to handle each connection.
/// Manages the WAL receiver connection for one timeline.
/// It subscribes to events from storage_broker and decides which safekeeper to connect to.
/// Once the decision has been made, it establishes the connection using the `tokio-postgres` library.
/// There is at most one connection at any given time.
///
/// That `tokio-postgres` library represents a connection as two objects: a `Client` and a `Connection`.
/// The `Client` object is what library users use to make requests & get responses.
/// Internally, `Client` hands over requests to the `Connection` object.
/// The `Connection` object is responsible for speaking the wire protocol.
///
/// Walreceiver uses its own abstraction called `TaskHandle` to represent the activity of establishing and handling a connection.
/// That abstraction doesn't use `task_mgr` and hence, has no `TaskKind`.
/// The [`WalReceiverManager`] task ensures that this `TaskHandle` task does not outlive the [`WalReceiverManager`] task.
///
/// Once the connection is established, the `TaskHandle` task creates a
/// [`WalReceiverConnection`] task_mgr task that is responsible for polling
/// the `Connection` object.
/// A `CancellationToken` created by the `TaskHandle` task ensures
/// that the [`WalReceiverConnection`] task will cancel soon after as the `TaskHandle` is dropped.
WalReceiverManager,
// Handles a connection to a safekeeper, to stream WAL to a timeline.
/// The task that polls the `tokio-postgres::Connection` object.
/// See the comment on [`WalReceiverManager`].
WalReceiverConnection,
// Garbage collection worker. One per tenant
@@ -220,6 +237,8 @@ pub enum TaskKind {
// task that drives downloading layers
DownloadAllRemoteLayers,
// Task that calculates synthetis size for all active tenants
CalculateSyntheticSize,
}
#[derive(Default)]

View File

@@ -38,6 +38,8 @@ use std::path::Path;
use std::path::PathBuf;
use std::process::Command;
use std::process::Stdio;
use std::sync::atomic::AtomicU64;
use std::sync::atomic::Ordering;
use std::sync::Arc;
use std::sync::MutexGuard;
use std::sync::{Mutex, RwLock};
@@ -139,6 +141,7 @@ pub struct Tenant {
/// Cached logical sizes updated updated on each [`Tenant::gather_size_inputs`].
cached_logical_sizes: tokio::sync::Mutex<HashMap<(TimelineId, Lsn), u64>>,
cached_synthetic_tenant_size: Arc<AtomicU64>,
}
/// A timeline with some of its files on disk, being initialized.
@@ -185,7 +188,7 @@ impl UninitializedTimeline<'_> {
mut self,
timelines: &mut HashMap<TimelineId, Arc<Timeline>>,
load_layer_map: bool,
launch_wal_receiver: bool,
activate: bool,
) -> anyhow::Result<Arc<Timeline>> {
let timeline_id = self.timeline_id;
let tenant_id = self.owning_tenant.tenant_id;
@@ -218,13 +221,12 @@ impl UninitializedTimeline<'_> {
"Failed to remove uninit mark file for timeline {tenant_id}/{timeline_id}"
)
})?;
new_timeline.set_state(TimelineState::Active);
v.insert(Arc::clone(&new_timeline));
new_timeline.maybe_spawn_flush_loop();
if launch_wal_receiver {
new_timeline.launch_wal_receiver();
if activate {
new_timeline.activate();
}
}
}
@@ -438,8 +440,16 @@ struct RemoteStartupData {
impl Tenant {
/// Yet another helper for timeline initialization.
/// Contains common part for `load_local_timeline` and `load_remote_timeline`
async fn setup_timeline(
/// Contains the common part of `load_local_timeline` and `load_remote_timeline`.
///
/// - Initializes the Timeline struct and inserts it into the tenant's hash map
/// - Scans the local timeline directory for layer files and builds the layer map
/// - Downloads remote index file and adds remote files to the layer map
/// - Schedules remote upload tasks for any files that are present locally but missing from remote storage.
///
/// If the operation fails, the timeline is left in the tenant's hash map in Broken state. On success,
/// it is marked as Active.
async fn timeline_init_and_sync(
&self,
timeline_id: TimelineId,
remote_client: Option<RemoteTimelineClient>,
@@ -482,10 +492,7 @@ impl Tenant {
// But we shouldnt start walreceiver before we have all the data locally, because working walreceiver
// will ingest data which may require looking at the layers which are not yet available locally
match timeline.initialize_with_lock(&mut timelines_accessor, true, false) {
Ok(initialized_timeline) => {
timelines_accessor.insert(timeline_id, initialized_timeline.clone());
Ok(initialized_timeline)
}
Ok(new_timeline) => new_timeline,
Err(e) => {
error!("Failed to initialize timeline {tenant_id}/{timeline_id}: {e:?}");
// FIXME using None is a hack, it wont hurt, just ugly.
@@ -501,16 +508,14 @@ impl Tenant {
None,
)
.with_context(|| {
format!(
"Failed to crate broken timeline data for {tenant_id}/{timeline_id}"
)
format!("creating broken timeline data for {tenant_id}/{timeline_id}")
})?;
broken_timeline.set_state(TimelineState::Broken);
timelines_accessor.insert(timeline_id, broken_timeline);
Err(e)
return Err(e);
}
}
}?;
};
if self.remote_storage.is_some() {
// Reconcile local state with remote storage, downloading anything that's
@@ -783,7 +788,7 @@ impl Tenant {
// cannot be older than the local one
let local_metadata = None;
self.setup_timeline(
self.timeline_init_and_sync(
timeline_id,
Some(remote_client),
Some(RemoteStartupData {
@@ -1048,7 +1053,7 @@ impl Tenant {
None => None,
};
self.setup_timeline(
self.timeline_init_and_sync(
timeline_id,
remote_client,
remote_startup_data,
@@ -1456,8 +1461,7 @@ impl Tenant {
tasks::start_background_loops(self.tenant_id);
for timeline in not_broken_timelines {
timeline.set_state(TimelineState::Active);
timeline.launch_wal_receiver();
timeline.activate();
}
}
}
@@ -1481,7 +1485,7 @@ impl Tenant {
.values()
.filter(|timeline| timeline.current_state() != TimelineState::Broken);
for timeline in not_broken_timelines {
timeline.set_state(TimelineState::Suspended);
timeline.set_state(TimelineState::Stopping);
}
}
TenantState::Broken => {
@@ -1722,6 +1726,7 @@ impl Tenant {
remote_storage,
state,
cached_logical_sizes: tokio::sync::Mutex::new(HashMap::new()),
cached_synthetic_tenant_size: Arc::new(AtomicU64::new(0)),
}
}
@@ -2359,6 +2364,24 @@ impl Tenant {
size::gather_inputs(self, logical_sizes_at_once, &mut shared_cache).await
}
/// Calculate synthetic tenant size
/// This is periodically called by background worker.
/// result is cached in tenant struct
#[instrument(skip_all, fields(tenant_id=%self.tenant_id))]
pub async fn calculate_synthetic_size(&self) -> anyhow::Result<u64> {
let inputs = self.gather_size_inputs().await?;
let size = inputs.calculate()?;
self.cached_synthetic_tenant_size
.store(size, Ordering::Relaxed);
Ok(size)
}
pub fn get_cached_synthetic_size(&self) -> u64 {
self.cached_synthetic_tenant_size.load(Ordering::Relaxed)
}
}
fn remove_timeline_and_uninit_mark(timeline_dir: &Path, uninit_mark: &Path) -> anyhow::Result<()> {
@@ -2602,8 +2625,10 @@ where
pub mod harness {
use bytes::{Bytes, BytesMut};
use once_cell::sync::Lazy;
use once_cell::sync::OnceCell;
use std::sync::{Arc, RwLock, RwLockReadGuard, RwLockWriteGuard};
use std::{fs, path::PathBuf};
use utils::logging;
use utils::lsn::Lsn;
use crate::{
@@ -2667,6 +2692,8 @@ pub mod harness {
),
}
static LOG_HANDLE: OnceCell<()> = OnceCell::new();
impl<'a> TenantHarness<'a> {
pub fn create(test_name: &'static str) -> anyhow::Result<Self> {
Self::create_internal(test_name, false)
@@ -2681,6 +2708,10 @@ pub mod harness {
(Some(LOCK.read().unwrap()), None)
};
LOG_HANDLE.get_or_init(|| {
logging::init(logging::LogFormat::Test).expect("Failed to init test logging")
});
let repo_dir = PageServerConf::test_repo_dir(test_name);
let _ = fs::remove_dir_all(&repo_dir);
fs::create_dir_all(&repo_dir)?;

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,583 @@
use std::collections::BTreeMap;
use std::ops::Range;
use tracing::info;
use super::layer_coverage::LayerCoverageTuple;
/// Layers in this module are identified and indexed by this data.
///
/// This is a helper struct to enable sorting layers by lsn.start.
///
/// These three values are enough to uniquely identify a layer, since
/// a layer is obligated to contain all contents within range, so two
/// deltas (or images) with the same range have identical content.
#[derive(Debug, PartialEq, Eq, Clone)]
pub struct LayerKey {
// TODO I use i128 and u64 because it was easy for prototyping,
// testing, and benchmarking. If we can use the Lsn and Key
// types without overhead that would be preferable.
pub key: Range<i128>,
pub lsn: Range<u64>,
pub is_image: bool,
}
impl PartialOrd for LayerKey {
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
Some(self.cmp(other))
}
}
impl Ord for LayerKey {
fn cmp(&self, other: &Self) -> std::cmp::Ordering {
// NOTE we really care about comparing by lsn.start first
self.lsn
.start
.cmp(&other.lsn.start)
.then(self.lsn.end.cmp(&other.lsn.end))
.then(self.key.start.cmp(&other.key.start))
.then(self.key.end.cmp(&other.key.end))
.then(self.is_image.cmp(&other.is_image))
}
}
/// Efficiently queryable layer coverage for each LSN.
///
/// Allows answering layer map queries very efficiently,
/// but doesn't allow retroactive insertion, which is
/// sometimes necessary. See BufferedHistoricLayerCoverage.
pub struct HistoricLayerCoverage<Value> {
/// The latest state
head: LayerCoverageTuple<Value>,
/// All previous states
historic: BTreeMap<u64, LayerCoverageTuple<Value>>,
}
impl<T: Clone> Default for HistoricLayerCoverage<T> {
fn default() -> Self {
Self::new()
}
}
impl<Value: Clone> HistoricLayerCoverage<Value> {
pub fn new() -> Self {
Self {
head: LayerCoverageTuple::default(),
historic: BTreeMap::default(),
}
}
/// Add a layer
///
/// Panics if new layer has older lsn.start than an existing layer.
/// See BufferedHistoricLayerCoverage for a more general insertion method.
pub fn insert(&mut self, layer_key: LayerKey, value: Value) {
// It's only a persistent map, not a retroactive one
if let Some(last_entry) = self.historic.iter().next_back() {
let last_lsn = last_entry.0;
if layer_key.lsn.start < *last_lsn {
panic!("unexpected retroactive insert");
}
}
// Insert into data structure
if layer_key.is_image {
self.head
.image_coverage
.insert(layer_key.key, layer_key.lsn.clone(), value);
} else {
self.head
.delta_coverage
.insert(layer_key.key, layer_key.lsn.clone(), value);
}
// Remember history. Clone is O(1)
self.historic.insert(layer_key.lsn.start, self.head.clone());
}
/// Query at a particular LSN, inclusive
pub fn get_version(&self, lsn: u64) -> Option<&LayerCoverageTuple<Value>> {
match self.historic.range(..=lsn).next_back() {
Some((_, v)) => Some(v),
None => None,
}
}
/// Remove all entries after a certain LSN (inclusive)
pub fn trim(&mut self, begin: &u64) {
self.historic.split_off(begin);
self.head = self
.historic
.iter()
.rev()
.next()
.map(|(_, v)| v.clone())
.unwrap_or_default();
}
}
/// This is the most basic test that demonstrates intended usage.
/// All layers in this test have height 1.
#[test]
fn test_persistent_simple() {
let mut map = HistoricLayerCoverage::<String>::new();
map.insert(
LayerKey {
key: 0..5,
lsn: 100..101,
is_image: true,
},
"Layer 1".to_string(),
);
map.insert(
LayerKey {
key: 3..9,
lsn: 110..111,
is_image: true,
},
"Layer 2".to_string(),
);
map.insert(
LayerKey {
key: 5..6,
lsn: 120..121,
is_image: true,
},
"Layer 3".to_string(),
);
// After Layer 1 insertion
let version = map.get_version(105).unwrap();
assert_eq!(version.image_coverage.query(1), Some("Layer 1".to_string()));
assert_eq!(version.image_coverage.query(4), Some("Layer 1".to_string()));
// After Layer 2 insertion
let version = map.get_version(115).unwrap();
assert_eq!(version.image_coverage.query(4), Some("Layer 2".to_string()));
assert_eq!(version.image_coverage.query(8), Some("Layer 2".to_string()));
assert_eq!(version.image_coverage.query(11), None);
// After Layer 3 insertion
let version = map.get_version(125).unwrap();
assert_eq!(version.image_coverage.query(4), Some("Layer 2".to_string()));
assert_eq!(version.image_coverage.query(5), Some("Layer 3".to_string()));
assert_eq!(version.image_coverage.query(7), Some("Layer 2".to_string()));
}
/// Cover simple off-by-one edge cases
#[test]
fn test_off_by_one() {
let mut map = HistoricLayerCoverage::<String>::new();
map.insert(
LayerKey {
key: 3..5,
lsn: 100..110,
is_image: true,
},
"Layer 1".to_string(),
);
// Check different LSNs
let version = map.get_version(99);
assert!(version.is_none());
let version = map.get_version(100).unwrap();
assert_eq!(version.image_coverage.query(4), Some("Layer 1".to_string()));
let version = map.get_version(110).unwrap();
assert_eq!(version.image_coverage.query(4), Some("Layer 1".to_string()));
// Check different keys
let version = map.get_version(105).unwrap();
assert_eq!(version.image_coverage.query(2), None);
assert_eq!(version.image_coverage.query(3), Some("Layer 1".to_string()));
assert_eq!(version.image_coverage.query(4), Some("Layer 1".to_string()));
assert_eq!(version.image_coverage.query(5), None);
}
/// Cover edge cases where layers begin or end on the same key
#[test]
fn test_key_collision() {
let mut map = HistoricLayerCoverage::<String>::new();
map.insert(
LayerKey {
key: 3..5,
lsn: 100..110,
is_image: true,
},
"Layer 10".to_string(),
);
map.insert(
LayerKey {
key: 5..8,
lsn: 100..110,
is_image: true,
},
"Layer 11".to_string(),
);
map.insert(
LayerKey {
key: 3..4,
lsn: 200..210,
is_image: true,
},
"Layer 20".to_string(),
);
// Check after layer 11
let version = map.get_version(105).unwrap();
assert_eq!(version.image_coverage.query(2), None);
assert_eq!(
version.image_coverage.query(3),
Some("Layer 10".to_string())
);
assert_eq!(
version.image_coverage.query(5),
Some("Layer 11".to_string())
);
assert_eq!(
version.image_coverage.query(7),
Some("Layer 11".to_string())
);
assert_eq!(version.image_coverage.query(8), None);
// Check after layer 20
let version = map.get_version(205).unwrap();
assert_eq!(version.image_coverage.query(2), None);
assert_eq!(
version.image_coverage.query(3),
Some("Layer 20".to_string())
);
assert_eq!(
version.image_coverage.query(5),
Some("Layer 11".to_string())
);
assert_eq!(
version.image_coverage.query(7),
Some("Layer 11".to_string())
);
assert_eq!(version.image_coverage.query(8), None);
}
/// Test when rectangles have nontrivial height and possibly overlap
#[test]
fn test_persistent_overlapping() {
let mut map = HistoricLayerCoverage::<String>::new();
// Add 3 key-disjoint layers with varying LSN ranges
map.insert(
LayerKey {
key: 1..2,
lsn: 100..200,
is_image: true,
},
"Layer 1".to_string(),
);
map.insert(
LayerKey {
key: 4..5,
lsn: 110..200,
is_image: true,
},
"Layer 2".to_string(),
);
map.insert(
LayerKey {
key: 7..8,
lsn: 120..300,
is_image: true,
},
"Layer 3".to_string(),
);
// Add wide and short layer
map.insert(
LayerKey {
key: 0..9,
lsn: 130..199,
is_image: true,
},
"Layer 4".to_string(),
);
// Add wide layer taller than some
map.insert(
LayerKey {
key: 0..9,
lsn: 140..201,
is_image: true,
},
"Layer 5".to_string(),
);
// Add wide layer taller than all
map.insert(
LayerKey {
key: 0..9,
lsn: 150..301,
is_image: true,
},
"Layer 6".to_string(),
);
// After layer 4 insertion
let version = map.get_version(135).unwrap();
assert_eq!(version.image_coverage.query(0), Some("Layer 4".to_string()));
assert_eq!(version.image_coverage.query(1), Some("Layer 1".to_string()));
assert_eq!(version.image_coverage.query(2), Some("Layer 4".to_string()));
assert_eq!(version.image_coverage.query(4), Some("Layer 2".to_string()));
assert_eq!(version.image_coverage.query(5), Some("Layer 4".to_string()));
assert_eq!(version.image_coverage.query(7), Some("Layer 3".to_string()));
assert_eq!(version.image_coverage.query(8), Some("Layer 4".to_string()));
// After layer 5 insertion
let version = map.get_version(145).unwrap();
assert_eq!(version.image_coverage.query(0), Some("Layer 5".to_string()));
assert_eq!(version.image_coverage.query(1), Some("Layer 5".to_string()));
assert_eq!(version.image_coverage.query(2), Some("Layer 5".to_string()));
assert_eq!(version.image_coverage.query(4), Some("Layer 5".to_string()));
assert_eq!(version.image_coverage.query(5), Some("Layer 5".to_string()));
assert_eq!(version.image_coverage.query(7), Some("Layer 3".to_string()));
assert_eq!(version.image_coverage.query(8), Some("Layer 5".to_string()));
// After layer 6 insertion
let version = map.get_version(155).unwrap();
assert_eq!(version.image_coverage.query(0), Some("Layer 6".to_string()));
assert_eq!(version.image_coverage.query(1), Some("Layer 6".to_string()));
assert_eq!(version.image_coverage.query(2), Some("Layer 6".to_string()));
assert_eq!(version.image_coverage.query(4), Some("Layer 6".to_string()));
assert_eq!(version.image_coverage.query(5), Some("Layer 6".to_string()));
assert_eq!(version.image_coverage.query(7), Some("Layer 6".to_string()));
assert_eq!(version.image_coverage.query(8), Some("Layer 6".to_string()));
}
/// Wrapper for HistoricLayerCoverage that allows us to hack around the lack
/// of support for retroactive insertion by rebuilding the map since the
/// change.
///
/// Why is this needed? We most often insert new layers with newer LSNs,
/// but during compaction we create layers with non-latest LSN, and during
/// GC we delete historic layers.
///
/// Even though rebuilding is an expensive (N log N) solution to the problem,
/// it's not critical since we do something equally expensive just to decide
/// whether or not to create new image layers.
/// TODO It's not expensive but it's not great to hold a layer map write lock
/// for that long.
///
/// If this becomes an actual bottleneck, one solution would be to build a
/// segment tree that holds PersistentLayerMaps. Though this would mean that
/// we take an additional log(N) performance hit for queries, which will probably
/// still be more critical.
///
/// See this for more on persistent and retroactive techniques:
/// https://www.youtube.com/watch?v=WqCWghETNDc&t=581s
pub struct BufferedHistoricLayerCoverage<Value> {
/// A persistent layer map that we rebuild when we need to retroactively update
historic_coverage: HistoricLayerCoverage<Value>,
/// We buffer insertion into the PersistentLayerMap to decrease the number of rebuilds.
buffer: BTreeMap<LayerKey, Option<Value>>,
/// All current layers. This is not used for search. Only to make rebuilds easier.
layers: BTreeMap<LayerKey, Value>,
}
impl<T: std::fmt::Debug> std::fmt::Debug for BufferedHistoricLayerCoverage<T> {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("RetroactiveLayerMap")
.field("buffer", &self.buffer)
.field("layers", &self.layers)
.finish()
}
}
impl<T: Clone> Default for BufferedHistoricLayerCoverage<T> {
fn default() -> Self {
Self::new()
}
}
impl<Value: Clone> BufferedHistoricLayerCoverage<Value> {
pub fn new() -> Self {
Self {
historic_coverage: HistoricLayerCoverage::<Value>::new(),
buffer: BTreeMap::new(),
layers: BTreeMap::new(),
}
}
pub fn insert(&mut self, layer_key: LayerKey, value: Value) {
self.buffer.insert(layer_key, Some(value));
}
pub fn remove(&mut self, layer_key: LayerKey) {
self.buffer.insert(layer_key, None);
}
pub fn rebuild(&mut self) {
// Find the first LSN that needs to be rebuilt
let rebuild_since: u64 = match self.buffer.iter().next() {
Some((LayerKey { lsn, .. }, _)) => lsn.start,
None => return, // No need to rebuild if buffer is empty
};
// Apply buffered updates to self.layers
let num_updates = self.buffer.len();
self.buffer.retain(|layer_key, layer| {
match layer {
Some(l) => {
self.layers.insert(layer_key.clone(), l.clone());
}
None => {
self.layers.remove(layer_key);
}
};
false
});
// Rebuild
let mut num_inserted = 0;
self.historic_coverage.trim(&rebuild_since);
for (layer_key, layer) in self.layers.range(
LayerKey {
lsn: rebuild_since..0,
key: 0..0,
is_image: false,
}..,
) {
self.historic_coverage
.insert(layer_key.clone(), layer.clone());
num_inserted += 1;
}
// TODO maybe only warn if ratio is at least 10
info!(
"Rebuilt layer map. Did {} insertions to process a batch of {} updates.",
num_inserted, num_updates,
)
}
/// Iterate all the layers
pub fn iter(&self) -> impl '_ + Iterator<Item = Value> {
// NOTE we can actually perform this without rebuilding,
// but it's not necessary for now.
if !self.buffer.is_empty() {
panic!("rebuild pls")
}
self.layers.values().cloned()
}
/// Return a reference to a queryable map, assuming all updates
/// have already been processed using self.rebuild()
pub fn get(&self) -> anyhow::Result<&HistoricLayerCoverage<Value>> {
// NOTE we error here instead of implicitly rebuilding because
// rebuilding is somewhat expensive.
// TODO maybe implicitly rebuild and log/sentry an error?
if !self.buffer.is_empty() {
anyhow::bail!("rebuild required")
}
Ok(&self.historic_coverage)
}
}
#[test]
fn test_retroactive_regression_1() {
let mut map = BufferedHistoricLayerCoverage::new();
map.insert(
LayerKey {
key: 0..21267647932558653966460912964485513215,
lsn: 23761336..23761457,
is_image: false,
},
"sdfsdfs".to_string(),
);
map.rebuild();
let version = map.get().unwrap().get_version(23761457).unwrap();
assert_eq!(
version.delta_coverage.query(100),
Some("sdfsdfs".to_string())
);
}
#[test]
fn test_retroactive_simple() {
let mut map = BufferedHistoricLayerCoverage::new();
// Append some images in increasing LSN order
map.insert(
LayerKey {
key: 0..5,
lsn: 100..101,
is_image: true,
},
"Image 1".to_string(),
);
map.insert(
LayerKey {
key: 3..9,
lsn: 110..111,
is_image: true,
},
"Image 2".to_string(),
);
map.insert(
LayerKey {
key: 4..6,
lsn: 120..121,
is_image: true,
},
"Image 3".to_string(),
);
map.insert(
LayerKey {
key: 8..9,
lsn: 120..121,
is_image: true,
},
"Image 4".to_string(),
);
// Add a delta layer out of order
map.insert(
LayerKey {
key: 2..5,
lsn: 105..106,
is_image: true,
},
"Delta 1".to_string(),
);
// Rebuild so we can start querying
map.rebuild();
// Query key 4
let version = map.get().unwrap().get_version(90);
assert!(version.is_none());
let version = map.get().unwrap().get_version(102).unwrap();
assert_eq!(version.image_coverage.query(4), Some("Image 1".to_string()));
let version = map.get().unwrap().get_version(107).unwrap();
assert_eq!(version.image_coverage.query(4), Some("Delta 1".to_string()));
let version = map.get().unwrap().get_version(115).unwrap();
assert_eq!(version.image_coverage.query(4), Some("Image 2".to_string()));
let version = map.get().unwrap().get_version(125).unwrap();
assert_eq!(version.image_coverage.query(4), Some("Image 3".to_string()));
// Remove Image 3
map.remove(LayerKey {
key: 4..6,
lsn: 120..121,
is_image: true,
});
map.rebuild();
// Check deletion worked
let version = map.get().unwrap().get_version(125).unwrap();
assert_eq!(version.image_coverage.query(4), Some("Image 2".to_string()));
assert_eq!(version.image_coverage.query(8), Some("Image 4".to_string()));
}

View File

@@ -0,0 +1,229 @@
use std::ops::Range;
use im::OrdMap;
use rpds::RedBlackTreeMapSync;
/// Data structure that can efficiently:
/// - find the latest layer by lsn.end at a given key
/// - iterate the latest layers in a key range
/// - insert layers in non-decreasing lsn.start order
///
/// The struct is parameterized over Value for easier
/// testing, but in practice it's some sort of layer.
pub struct LayerCoverage<Value> {
/// For every change in coverage (as we sweep the key space)
/// we store (lsn.end, value).
///
/// We use an immutable/persistent tree so that we can keep historic
/// versions of this coverage without cloning the whole thing and
/// incurring quadratic memory cost. See HistoricLayerCoverage.
///
/// We use the Sync version of the map because we want Self to
/// be Sync. Using nonsync might be faster, if we can work with
/// that.
nodes: RedBlackTreeMapSync<i128, Option<(u64, Value)>>,
im_nodes: OrdMap<i128, Option<(u64, Value)>>,
}
impl<T: Clone> Default for LayerCoverage<T> {
fn default() -> Self {
Self::new()
}
}
impl<Value: Clone> LayerCoverage<Value> {
pub fn new() -> Self {
Self {
nodes: RedBlackTreeMapSync::default(),
im_nodes: OrdMap::default(),
}
}
/// Helper function to subdivide the key range without changing any values
///
/// Complexity: O(log N)
fn add_node(&mut self, key: i128) {
let value = match self.nodes.range(..=key).last() {
Some((_, Some(v))) => Some(v.clone()),
Some((_, None)) => None,
None => None,
};
self.nodes.insert_mut(key, value);
let im_value = match self.im_nodes.range(..=key).last() {
Some((_, Some(v))) => Some(v.clone()),
Some((_, None)) => None,
None => None,
};
self.im_nodes.remove(&key);
self.im_nodes.insert(key, im_value);
}
/// Insert a layer.
///
/// Complexity: worst case O(N), in practice O(log N). See NOTE in implementation.
pub fn insert(&mut self, key: Range<i128>, lsn: Range<u64>, value: Value) {
// Add nodes at endpoints
//
// NOTE The order of lines is important. We add nodes at the start
// and end of the key range **before updating any nodes** in order
// to pin down the current coverage outside of the relevant key range.
// Only the coverage inside the layer's key range should change.
self.add_node(key.start);
self.add_node(key.end);
// Raise the height where necessary
//
// NOTE This loop is worst case O(N), but amortized O(log N) in the special
// case when rectangles have no height. In practice I don't think we'll see
// the kind of layer intersections needed to trigger O(N) behavior. The worst
// case is N/2 horizontal layers overlapped with N/2 vertical layers in a
// grid pattern.
let mut to_update = Vec::new();
let mut to_remove = Vec::new();
let mut prev_covered = false;
for (k, node) in self.nodes.range(key.clone()) {
let needs_cover = match node {
None => true,
Some((h, _)) => h < &lsn.end,
};
if needs_cover {
match prev_covered {
true => to_remove.push(*k),
false => to_update.push(*k),
}
}
prev_covered = needs_cover;
}
if !prev_covered {
to_remove.push(key.end);
}
for k in to_update {
self.nodes.insert_mut(k, Some((lsn.end, value.clone())));
}
for k in to_remove {
self.nodes.remove_mut(&k);
}
let mut to_update = Vec::new();
let mut to_remove = Vec::new();
let mut prev_covered = false;
for (k, node) in self.im_nodes.range(key.clone()) {
let needs_cover = match node {
None => true,
Some((h, _)) => h < &lsn.end,
};
if needs_cover {
match prev_covered {
true => to_remove.push(*k),
false => to_update.push(*k),
}
}
prev_covered = needs_cover;
}
if !prev_covered {
to_remove.push(key.end);
}
for k in to_update {
self.im_nodes.remove(&k);
self.im_nodes.insert(k, Some((lsn.end, value.clone())));
}
for k in to_remove {
self.im_nodes.remove(&k);
}
}
fn get_key_1(&self, key: i128) -> Option<u64> {
self.im_nodes
.get_prev(&key)?
.1
.as_ref()
.map(|(k, _)| k.clone())
}
fn get_key_2(&self, key: i128) -> Option<u64> {
self.im_nodes
.range(..=key)
.rev()
.next()?
.1
.as_ref()
.map(|(k, _)| k.clone())
}
/// Get the latest (by lsn.end) layer at a given key
///
/// Complexity: O(log N)
pub fn query(&self, key: i128) -> Option<Value> {
let k1 = self.get_key_1(key);
let k2 = self.get_key_2(key);
assert_eq!(k1, k2);
// self.im_nodes
// .get_prev(&key)?
// .1
// .as_ref()
// .map(|(_, v)| v.clone())
self.im_nodes
.range(..=key)
.rev()
.next()?
.1
.as_ref()
.map(|(_, v)| v.clone())
// self.nodes
// .range(..=key)
// .rev()
// .next()?
// .1
// .as_ref()
// .map(|(_, v)| v.clone())
}
/// Iterate the changes in layer coverage in a given range. You will likely
/// want to start with self.query(key.start), and then follow up with self.range
///
/// Complexity: O(log N + result_size)
pub fn range(&self, key: Range<i128>) -> impl '_ + Iterator<Item = (i128, Option<Value>)> {
self.nodes
.range(key)
.map(|(k, v)| (*k, v.as_ref().map(|x| x.1.clone())))
}
/// O(1) clone
pub fn clone(&self) -> Self {
Self {
nodes: self.nodes.clone(),
im_nodes: self.im_nodes.clone(),
}
}
}
/// Image and delta coverage at a specific LSN.
pub struct LayerCoverageTuple<Value> {
pub image_coverage: LayerCoverage<Value>,
pub delta_coverage: LayerCoverage<Value>,
}
impl<T: Clone> Default for LayerCoverageTuple<T> {
fn default() -> Self {
Self {
image_coverage: LayerCoverage::default(),
delta_coverage: LayerCoverage::default(),
}
}
}
impl<Value: Clone> LayerCoverageTuple<Value> {
pub fn clone(&self) -> Self {
Self {
image_coverage: self.image_coverage.clone(),
delta_coverage: self.delta_coverage.clone(),
}
}
}

View File

@@ -135,7 +135,7 @@
//! - Initiate upload queue with that [`IndexPart`].
//! - Reschedule all lost operations by comparing the local filesystem state
//! and remote state as per [`IndexPart`]. This is done in
//! [`Timeline::setup_timeline`] and [`Timeline::reconcile_with_remote`].
//! [`Timeline::timeline_init_and_sync`] and [`Timeline::reconcile_with_remote`].
//!
//! Note that if we crash during file deletion between the index update
//! that removes the file from the list of files, and deleting the remote file,

View File

@@ -23,7 +23,13 @@ use tracing::*;
pub struct ModelInputs {
updates: Vec<Update>,
retention_period: u64,
/// Relevant lsns per timeline.
///
/// This field is not required for deserialization purposes, which is mostly used in tests. The
/// LSNs explain the outcome (updates) but are not needed in size calculation.
#[serde_as(as = "HashMap<serde_with::DisplayFromStr, _>")]
#[serde(default)]
timeline_inputs: HashMap<TimelineId, TimelineInputs>,
}
@@ -32,6 +38,8 @@ pub struct ModelInputs {
#[serde_with::serde_as]
#[derive(Debug, serde::Serialize, serde::Deserialize)]
struct TimelineInputs {
#[serde_as(as = "serde_with::DisplayFromStr")]
ancestor_lsn: Lsn,
#[serde_as(as = "serde_with::DisplayFromStr")]
last_record: Lsn,
#[serde_as(as = "serde_with::DisplayFromStr")]
@@ -44,6 +52,116 @@ struct TimelineInputs {
next_gc_cutoff: Lsn,
}
// Adjust BranchFrom sorting so that we always process ancestor
// before descendants. This is needed to correctly calculate size of
// descendant timelines.
//
// Note that we may have multiple BranchFroms at the same LSN, so we
// need to sort them in the tree order.
//
// see updates_sort_with_branches_at_same_lsn test below
fn sort_updates_in_tree_order(updates: Vec<Update>) -> anyhow::Result<Vec<Update>> {
let mut sorted_updates = Vec::with_capacity(updates.len());
let mut known_timelineids = HashSet::new();
let mut i = 0;
while i < updates.len() {
let curr_upd = &updates[i];
if let Command::BranchFrom(parent_id) = curr_upd.command {
let parent_id = match parent_id {
Some(parent_id) if known_timelineids.contains(&parent_id) => {
// we have already processed ancestor
// process this BranchFrom Update normally
known_timelineids.insert(curr_upd.timeline_id);
sorted_updates.push(*curr_upd);
i += 1;
continue;
}
None => {
known_timelineids.insert(curr_upd.timeline_id);
sorted_updates.push(*curr_upd);
i += 1;
continue;
}
Some(parent_id) => parent_id,
};
let mut j = i;
// we have not processed ancestor yet.
// there is a chance that it is at the same Lsn
if !known_timelineids.contains(&parent_id) {
let mut curr_lsn_branchfroms: HashMap<TimelineId, Vec<(TimelineId, usize)>> =
HashMap::new();
// inspect all branchpoints at the same lsn
while j < updates.len() && updates[j].lsn == curr_upd.lsn {
let lookahead_upd = &updates[j];
j += 1;
if let Command::BranchFrom(lookahead_parent_id) = lookahead_upd.command {
match lookahead_parent_id {
Some(lookahead_parent_id)
if !known_timelineids.contains(&lookahead_parent_id) =>
{
// we have not processed ancestor yet
// store it for later
let es =
curr_lsn_branchfroms.entry(lookahead_parent_id).or_default();
es.push((lookahead_upd.timeline_id, j));
}
_ => {
// we have already processed ancestor
// process this BranchFrom Update normally
known_timelineids.insert(lookahead_upd.timeline_id);
sorted_updates.push(*lookahead_upd);
}
}
}
}
// process BranchFroms in the tree order
// check that we don't have a cycle if somet entry is orphan
// (this should not happen, but better to be safe)
let mut processed_some_entry = true;
while processed_some_entry {
processed_some_entry = false;
curr_lsn_branchfroms.retain(|parent_id, branchfroms| {
if known_timelineids.contains(parent_id) {
for (timeline_id, j) in branchfroms {
known_timelineids.insert(*timeline_id);
sorted_updates.push(updates[*j - 1]);
}
processed_some_entry = true;
false
} else {
true
}
});
}
if !curr_lsn_branchfroms.is_empty() {
// orphans are expected to be rare and transient between tenant reloads
// for example, an broken ancestor without the child branch being broken.
anyhow::bail!(
"orphan branch(es) detected in BranchFroms: {curr_lsn_branchfroms:?}"
);
}
}
assert!(j > i);
i = j;
} else {
// not a BranchFrom, keep the same order
sorted_updates.push(*curr_upd);
i += 1;
}
}
Ok(sorted_updates)
}
/// Gathers the inputs for the tenant sizing model.
///
/// Tenant size does not consider the latest state, but only the state until next_gc_cutoff, which
@@ -68,19 +186,20 @@ pub(super) async fn gather_inputs(
// our advantage with `?` error handling.
let mut joinset = tokio::task::JoinSet::new();
let timelines = tenant
// refresh is needed to update gc related pitr_cutoff and horizon_cutoff
tenant
.refresh_gc_info()
.await
.context("Failed to refresh gc_info before gathering inputs")?;
let timelines = tenant.list_timelines();
if timelines.is_empty() {
// All timelines are below tenant's gc_horizon; alternative would be to use
// Tenant::list_timelines but then those gc_info's would not be updated yet, possibly
// missing GcInfo::retain_lsns or having obsolete values for cutoff's.
// perhaps the tenant has just been created, and as such doesn't have any data yet
return Ok(ModelInputs {
updates: vec![],
retention_period: 0,
timeline_inputs: HashMap::new(),
timeline_inputs: HashMap::default(),
});
}
@@ -91,13 +210,25 @@ pub(super) async fn gather_inputs(
let mut updates = Vec::new();
// record the per timline values used to determine `retention_period`
// record the per timeline values useful to debug the model inputs, also used to track
// ancestor_lsn without keeping a hold of Timeline
let mut timeline_inputs = HashMap::with_capacity(timelines.len());
// used to determine the `retention_period` for the size model
let mut max_cutoff_distance = None;
// mapping from (TimelineId, Lsn) => if this branch point has been handled already via
// GcInfo::retain_lsns or if it needs to have its logical_size calculated.
let mut referenced_branch_froms = HashMap::<(TimelineId, Lsn), bool>::new();
for timeline in timelines {
if !timeline.is_active() {
anyhow::bail!(
"timeline {} is not active, cannot calculate tenant_size now",
timeline.timeline_id
);
}
let last_record_lsn = timeline.get_last_record_lsn();
let (interesting_lsns, horizon_cutoff, pitr_cutoff, next_gc_cutoff) = {
@@ -163,13 +294,30 @@ pub(super) async fn gather_inputs(
// all timelines branch from something, because it might be impossible to pinpoint
// which is the tenant_size_model's "default" branch.
let ancestor_lsn = timeline.get_ancestor_lsn();
updates.push(Update {
lsn: timeline.get_ancestor_lsn(),
lsn: ancestor_lsn,
command: Command::BranchFrom(timeline.get_ancestor_timeline_id()),
timeline_id: timeline.timeline_id,
});
if let Some(parent_timeline_id) = timeline.get_ancestor_timeline_id() {
// refresh_gc_info will update branchpoints and pitr_cutoff but only do it for branches
// which are over gc_horizon. for example, a "main" branch which never received any
// updates apart from initdb not have branch points recorded.
referenced_branch_froms
.entry((parent_timeline_id, timeline.get_ancestor_lsn()))
.or_default();
}
for (lsn, _kind) in &interesting_lsns {
// mark this visited so don't need to re-process this parent
*referenced_branch_froms
.entry((timeline.timeline_id, *lsn))
.or_default() = true;
if let Some(size) = logical_size_cache.get(&(timeline.timeline_id, *lsn)) {
updates.push(Update {
lsn: *lsn,
@@ -185,22 +333,10 @@ pub(super) async fn gather_inputs(
}
}
// all timelines also have an end point if they have made any progress
if last_record_lsn > timeline.get_ancestor_lsn()
&& !interesting_lsns
.iter()
.any(|(lsn, _)| lsn == &last_record_lsn)
{
updates.push(Update {
lsn: last_record_lsn,
command: Command::EndOfBranch,
timeline_id: timeline.timeline_id,
});
}
timeline_inputs.insert(
timeline.timeline_id,
TimelineInputs {
ancestor_lsn,
last_record: last_record_lsn,
// this is not used above, because it might not have updated recently enough
latest_gc_cutoff: *timeline.get_latest_gc_cutoff_lsn(),
@@ -211,6 +347,80 @@ pub(super) async fn gather_inputs(
);
}
// iterate over discovered branch points and make sure we are getting logical sizes at those
// points.
for ((timeline_id, lsn), handled) in referenced_branch_froms.iter() {
if *handled {
continue;
}
let timeline_id = *timeline_id;
let lsn = *lsn;
match timeline_inputs.get(&timeline_id) {
Some(inputs) if inputs.ancestor_lsn == lsn => {
// we don't need an update at this branch point which is also point where
// timeline_id branch was branched from.
continue;
}
Some(_) => {}
None => {
// we should have this because we have iterated through all of the timelines
anyhow::bail!("missing timeline_input for {timeline_id}")
}
}
if let Some(size) = logical_size_cache.get(&(timeline_id, lsn)) {
updates.push(Update {
lsn,
timeline_id,
command: Command::Update(*size),
});
needed_cache.insert((timeline_id, lsn));
} else {
let timeline = tenant
.get_timeline(timeline_id, false)
.context("find referenced ancestor timeline")?;
let parallel_size_calcs = Arc::clone(limit);
joinset.spawn(calculate_logical_size(
parallel_size_calcs,
timeline.clone(),
lsn,
));
if let Some(parent_id) = timeline.get_ancestor_timeline_id() {
// we should not find new ones because we iterated tenants all timelines
anyhow::ensure!(
timeline_inputs.contains_key(&parent_id),
"discovered new timeline {parent_id} (parent of {timeline_id})"
);
}
};
}
// finally add in EndOfBranch for all timelines where their last_record_lsn is not a branch
// point. this is needed by the model.
for (timeline_id, inputs) in timeline_inputs.iter() {
let lsn = inputs.last_record;
if referenced_branch_froms.contains_key(&(*timeline_id, lsn)) {
// this means that the (timeline_id, last_record_lsn) represents a branch point
// we do not want to add EndOfBranch updates for these points because it doesn't fit
// into the current tenant_size_model.
continue;
}
if lsn > inputs.ancestor_lsn {
// all timelines also have an end point if they have made any progress
updates.push(Update {
lsn,
command: Command::EndOfBranch,
timeline_id: *timeline_id,
});
}
}
let mut have_any_error = false;
while let Some(res) = joinset.join_next().await {
@@ -267,8 +477,13 @@ pub(super) async fn gather_inputs(
// for branch points, which come as multiple updates at the same LSN, the Command::Update
// is needed before a branch is made out of that branch Command::BranchFrom. this is
// handled by the variant order in `Command`.
//
updates.sort_unstable();
// And another sort to handle Command::BranchFrom ordering
// in case when there are multiple branches at the same LSN.
let sorted_updates = sort_updates_in_tree_order(updates)?;
let retention_period = match max_cutoff_distance {
Some(max) => max.0,
None => {
@@ -277,7 +492,7 @@ pub(super) async fn gather_inputs(
};
Ok(ModelInputs {
updates,
updates: sorted_updates,
retention_period,
timeline_inputs,
})
@@ -295,21 +510,23 @@ impl ModelInputs {
command: op,
timeline_id,
} = update;
let Lsn(now) = *lsn;
match op {
Command::Update(sz) => {
storage.insert_point(&Some(*timeline_id), "".into(), now, Some(*sz));
storage.insert_point(&Some(*timeline_id), "".into(), now, Some(*sz))?;
}
Command::EndOfBranch => {
storage.insert_point(&Some(*timeline_id), "".into(), now, None);
storage.insert_point(&Some(*timeline_id), "".into(), now, None)?;
}
Command::BranchFrom(parent) => {
storage.branch(parent, Some(*timeline_id));
// This branch command may fail if it cannot find a parent to branch from.
storage.branch(parent, Some(*timeline_id))?;
}
}
}
Ok(storage.calculate(self.retention_period).total_children())
Ok(storage.calculate(self.retention_period)?.total_children())
}
}
@@ -372,6 +589,7 @@ async fn calculate_logical_size(
let size_res = timeline
.spawn_ondemand_logical_size_calculation(lsn)
.instrument(info_span!("spawn_ondemand_logical_size_calculation"))
.await?;
Ok(TimelineAtLsnSizeResult(timeline, lsn, size_res))
}
@@ -457,9 +675,146 @@ fn updates_sort() {
fn verify_size_for_multiple_branches() {
// this is generated from integration test test_tenant_size_with_multiple_branches, but this way
// it has the stable lsn's
let doc = r#"{"updates":[{"lsn":"0/0","command":{"branch_from":null},"timeline_id":"cd9d9409c216e64bf580904facedb01b"},{"lsn":"0/176FA40","command":{"update":25763840},"timeline_id":"cd9d9409c216e64bf580904facedb01b"},{"lsn":"0/176FA40","command":{"branch_from":"cd9d9409c216e64bf580904facedb01b"},"timeline_id":"10b532a550540bc15385eac4edde416a"},{"lsn":"0/1819818","command":{"update":26075136},"timeline_id":"10b532a550540bc15385eac4edde416a"},{"lsn":"0/18B5E40","command":{"update":26427392},"timeline_id":"cd9d9409c216e64bf580904facedb01b"},{"lsn":"0/18D3DF0","command":{"update":26492928},"timeline_id":"cd9d9409c216e64bf580904facedb01b"},{"lsn":"0/18D3DF0","command":{"branch_from":"cd9d9409c216e64bf580904facedb01b"},"timeline_id":"230fc9d756f7363574c0d66533564dcc"},{"lsn":"0/220F438","command":{"update":25239552},"timeline_id":"230fc9d756f7363574c0d66533564dcc"}],"retention_period":131072,"timeline_inputs":{"cd9d9409c216e64bf580904facedb01b":{"last_record":"0/18D5E40","latest_gc_cutoff":"0/169ACF0","horizon_cutoff":"0/18B5E40","pitr_cutoff":"0/18B5E40","next_gc_cutoff":"0/18B5E40"},"10b532a550540bc15385eac4edde416a":{"last_record":"0/1839818","latest_gc_cutoff":"0/169ACF0","horizon_cutoff":"0/1819818","pitr_cutoff":"0/1819818","next_gc_cutoff":"0/1819818"},"230fc9d756f7363574c0d66533564dcc":{"last_record":"0/222F438","latest_gc_cutoff":"0/169ACF0","horizon_cutoff":"0/220F438","pitr_cutoff":"0/220F438","next_gc_cutoff":"0/220F438"}}}"#;
//
// timelineinputs have been left out, because those explain the inputs, but don't participate
// in further size calculations.
let doc = r#"{"updates":[{"lsn":"0/0","command":{"branch_from":null},"timeline_id":"cd9d9409c216e64bf580904facedb01b"},{"lsn":"0/176FA40","command":{"update":25763840},"timeline_id":"cd9d9409c216e64bf580904facedb01b"},{"lsn":"0/176FA40","command":{"branch_from":"cd9d9409c216e64bf580904facedb01b"},"timeline_id":"10b532a550540bc15385eac4edde416a"},{"lsn":"0/1819818","command":{"update":26075136},"timeline_id":"10b532a550540bc15385eac4edde416a"},{"lsn":"0/18B5E40","command":{"update":26427392},"timeline_id":"cd9d9409c216e64bf580904facedb01b"},{"lsn":"0/18D3DF0","command":{"update":26492928},"timeline_id":"cd9d9409c216e64bf580904facedb01b"},{"lsn":"0/18D3DF0","command":{"branch_from":"cd9d9409c216e64bf580904facedb01b"},"timeline_id":"230fc9d756f7363574c0d66533564dcc"},{"lsn":"0/220F438","command":{"update":25239552},"timeline_id":"230fc9d756f7363574c0d66533564dcc"}],"retention_period":131072}"#;
let inputs: ModelInputs = serde_json::from_str(doc).unwrap();
assert_eq!(inputs.calculate().unwrap(), 36_409_872);
}
#[test]
fn updates_sort_with_branches_at_same_lsn() {
use std::str::FromStr;
use Command::{BranchFrom, EndOfBranch};
macro_rules! lsn {
($e:expr) => {
Lsn::from_str($e).unwrap()
};
}
let ids = [
TimelineId::from_str("00000000000000000000000000000000").unwrap(),
TimelineId::from_str("11111111111111111111111111111111").unwrap(),
TimelineId::from_str("22222222222222222222222222222222").unwrap(),
TimelineId::from_str("33333333333333333333333333333333").unwrap(),
TimelineId::from_str("44444444444444444444444444444444").unwrap(),
];
// issue https://github.com/neondatabase/neon/issues/3179
let commands = vec![
Update {
lsn: lsn!("0/0"),
command: BranchFrom(None),
timeline_id: ids[0],
},
Update {
lsn: lsn!("0/169AD58"),
command: Command::Update(25387008),
timeline_id: ids[0],
},
// next three are wrongly sorted, because
// ids[1] is branched from before ids[1] exists
// and ids[2] is branched from before ids[2] exists
Update {
lsn: lsn!("0/169AD58"),
command: BranchFrom(Some(ids[1])),
timeline_id: ids[3],
},
Update {
lsn: lsn!("0/169AD58"),
command: BranchFrom(Some(ids[0])),
timeline_id: ids[2],
},
Update {
lsn: lsn!("0/169AD58"),
command: BranchFrom(Some(ids[2])),
timeline_id: ids[1],
},
Update {
lsn: lsn!("0/1CA85B8"),
command: Command::Update(28925952),
timeline_id: ids[1],
},
Update {
lsn: lsn!("0/1CD85B8"),
command: Command::Update(29024256),
timeline_id: ids[1],
},
Update {
lsn: lsn!("0/1CD85B8"),
command: BranchFrom(Some(ids[1])),
timeline_id: ids[4],
},
Update {
lsn: lsn!("0/22DCE70"),
command: Command::Update(32546816),
timeline_id: ids[3],
},
Update {
lsn: lsn!("0/230CE70"),
command: EndOfBranch,
timeline_id: ids[3],
},
];
let expected = vec![
Update {
lsn: lsn!("0/0"),
command: BranchFrom(None),
timeline_id: ids[0],
},
Update {
lsn: lsn!("0/169AD58"),
command: Command::Update(25387008),
timeline_id: ids[0],
},
Update {
lsn: lsn!("0/169AD58"),
command: BranchFrom(Some(ids[0])),
timeline_id: ids[2],
},
Update {
lsn: lsn!("0/169AD58"),
command: BranchFrom(Some(ids[2])),
timeline_id: ids[1],
},
Update {
lsn: lsn!("0/169AD58"),
command: BranchFrom(Some(ids[1])),
timeline_id: ids[3],
},
Update {
lsn: lsn!("0/1CA85B8"),
command: Command::Update(28925952),
timeline_id: ids[1],
},
Update {
lsn: lsn!("0/1CD85B8"),
command: Command::Update(29024256),
timeline_id: ids[1],
},
Update {
lsn: lsn!("0/1CD85B8"),
command: BranchFrom(Some(ids[1])),
timeline_id: ids[4],
},
Update {
lsn: lsn!("0/22DCE70"),
command: Command::Update(32546816),
timeline_id: ids[3],
},
Update {
lsn: lsn!("0/230CE70"),
command: EndOfBranch,
timeline_id: ids[3],
},
];
let sorted_commands = sort_updates_in_tree_order(commands).unwrap();
assert_eq!(sorted_commands, expected);
}

View File

@@ -196,3 +196,50 @@ pub fn downcast_remote_layer(
None
}
}
impl std::fmt::Debug for dyn Layer {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("Layer")
.field("short_id", &self.short_id())
.finish()
}
}
/// Holds metadata about a layer without any content. Used mostly for testing.
pub struct LayerDescriptor {
pub key: Range<Key>,
pub lsn: Range<Lsn>,
pub is_incremental: bool,
pub short_id: String,
}
impl Layer for LayerDescriptor {
fn get_key_range(&self) -> Range<Key> {
self.key.clone()
}
fn get_lsn_range(&self) -> Range<Lsn> {
self.lsn.clone()
}
fn is_incremental(&self) -> bool {
self.is_incremental
}
fn get_value_reconstruct_data(
&self,
_key: Key,
_lsn_range: Range<Lsn>,
_reconstruct_data: &mut ValueReconstructState,
) -> Result<ValueReconstructResult> {
todo!("This method shouldn't be part of the Layer trait")
}
fn short_id(&self) -> String {
self.short_id.clone()
}
fn dump(&self, _verbose: bool) -> Result<()> {
todo!()
}
}

View File

@@ -3,12 +3,12 @@
use anyhow::{anyhow, bail, ensure, Context};
use bytes::Bytes;
use fail::fail_point;
use futures::stream::FuturesUnordered;
use futures::StreamExt;
use itertools::Itertools;
use once_cell::sync::OnceCell;
use pageserver_api::models::{
DownloadRemoteLayersTaskInfo, DownloadRemoteLayersTaskState, TimelineState,
DownloadRemoteLayersTaskInfo, DownloadRemoteLayersTaskSpawnRequest,
DownloadRemoteLayersTaskState, TimelineState,
};
use tokio::sync::{oneshot, watch, Semaphore, TryAcquireError};
use tokio_util::sync::CancellationToken;
@@ -729,16 +729,24 @@ impl Timeline {
Ok(())
}
pub fn activate(self: &Arc<Self>) {
self.set_state(TimelineState::Active);
self.launch_wal_receiver();
}
pub fn set_state(&self, new_state: TimelineState) {
match (self.current_state(), new_state) {
(equal_state_1, equal_state_2) if equal_state_1 == equal_state_2 => {
debug!("Ignoring new state, equal to the existing one: {equal_state_2:?}");
warn!("Ignoring new state, equal to the existing one: {equal_state_2:?}");
}
(st, TimelineState::Loading) => {
error!("ignoring transition from {st:?} into Loading state");
}
(TimelineState::Broken, _) => {
error!("Ignoring state update {new_state:?} for broken tenant");
}
(TimelineState::Stopping, TimelineState::Active) => {
debug!("Not activating a Stopping timeline");
error!("Not activating a Stopping timeline");
}
(_, new_state) => {
self.state.send_replace(new_state);
@@ -812,7 +820,7 @@ impl Timeline {
pg_version: u32,
) -> Arc<Self> {
let disk_consistent_lsn = metadata.disk_consistent_lsn();
let (state, _) = watch::channel(TimelineState::Suspended);
let (state, _) = watch::channel(TimelineState::Loading);
let (layer_flush_start_tx, _) = tokio::sync::watch::channel(0);
let (layer_flush_done_tx, _) = tokio::sync::watch::channel((0, Ok(())));
@@ -970,6 +978,7 @@ impl Timeline {
///
pub(super) fn load_layer_map(&self, disk_consistent_lsn: Lsn) -> anyhow::Result<()> {
let mut layers = self.layers.write().unwrap();
let mut updates = layers.batch_update();
let mut num_layers = 0;
let timer = self.metrics.load_layer_map_histo.start_timer();
@@ -1010,7 +1019,7 @@ impl Timeline {
trace!("found layer {}", layer.path().display());
total_physical_size += file_size;
layers.insert_historic(Arc::new(layer));
updates.insert_historic(Arc::new(layer));
num_layers += 1;
} else if let Some(deltafilename) = DeltaFileName::parse_str(&fname) {
// Create a DeltaLayer struct for each delta file.
@@ -1041,7 +1050,7 @@ impl Timeline {
trace!("found layer {}", layer.path().display());
total_physical_size += file_size;
layers.insert_historic(Arc::new(layer));
updates.insert_historic(Arc::new(layer));
num_layers += 1;
} else if fname == METADATA_FILE_NAME || fname.ends_with(".old") {
// ignore these
@@ -1067,6 +1076,7 @@ impl Timeline {
}
}
updates.flush();
layers.next_open_layer_at = Some(Lsn(disk_consistent_lsn.0) + 1);
info!(
@@ -1091,6 +1101,11 @@ impl Timeline {
// Are we missing some files that are present in remote storage?
// Create RemoteLayer instances for them.
let mut local_only_layers = local_layers;
// We're holding a layer map lock for a while but this
// method is only called during init so it's fine.
let mut layer_map = self.layers.write().unwrap();
let mut updates = layer_map.batch_update();
for remote_layer_name in &index_part.timeline_layers {
let local_layer = local_only_layers.remove(remote_layer_name);
@@ -1129,7 +1144,7 @@ impl Timeline {
anyhow::bail!("could not rename file {local_layer_path:?}: {err:?}");
} else {
self.metrics.resident_physical_size_gauge.sub(local_size);
self.layers.write().unwrap().remove_historic(local_layer);
updates.remove_historic(local_layer);
// fall-through to adding the remote layer
}
} else {
@@ -1171,7 +1186,7 @@ impl Timeline {
);
let remote_layer = Arc::new(remote_layer);
self.layers.write().unwrap().insert_historic(remote_layer);
updates.insert_historic(remote_layer);
}
LayerFileName::Delta(deltafilename) => {
// Create a RemoteLayer for the delta file.
@@ -1194,13 +1209,14 @@ impl Timeline {
&remote_layer_metadata,
);
let remote_layer = Arc::new(remote_layer);
self.layers.write().unwrap().insert_historic(remote_layer);
updates.insert_historic(remote_layer);
}
#[cfg(test)]
LayerFileName::Test(_) => unreachable!(),
}
}
updates.flush();
Ok(local_only_layers)
}
@@ -1392,7 +1408,7 @@ impl Timeline {
TimelineState::Active => continue,
TimelineState::Broken
| TimelineState::Stopping
| TimelineState::Suspended => {
| TimelineState::Loading => {
break format!("aborted because timeline became inactive (new state: {new_state:?})")
}
}
@@ -2099,10 +2115,11 @@ impl Timeline {
])?;
// Add it to the layer map
{
let mut layers = self.layers.write().unwrap();
layers.insert_historic(Arc::new(new_delta));
}
self.layers
.write()
.unwrap()
.batch_update()
.insert_historic(Arc::new(new_delta));
// update the timeline's physical size
let sz = new_delta_path.metadata()?.len();
@@ -2166,13 +2183,15 @@ impl Timeline {
// are some delta layers *later* than current 'lsn', if more WAL was processed and flushed
// after we read last_record_lsn, which is passed here in the 'lsn' argument.
if img_lsn < lsn {
let num_deltas = layers.count_deltas(&img_range, &(img_lsn..lsn))?;
let threshold = self.get_image_creation_threshold();
let num_deltas =
layers.count_deltas(&img_range, &(img_lsn..lsn), Some(threshold))?;
debug!(
"key range {}-{}, has {} deltas on this timeline in LSN range {}..{}",
img_range.start, img_range.end, num_deltas, img_lsn, lsn
);
if num_deltas >= self.get_image_creation_threshold() {
if num_deltas >= threshold {
return Ok(true);
}
}
@@ -2267,21 +2286,23 @@ impl Timeline {
let mut layer_paths_to_upload = HashMap::with_capacity(image_layers.len());
let mut layers = self.layers.write().unwrap();
let mut updates = layers.batch_update();
let timeline_path = self.conf.timeline_path(&self.timeline_id, &self.tenant_id);
for l in image_layers {
let path = l.filename();
let metadata = timeline_path
.join(path.file_name())
.metadata()
.context("reading metadata of layer file {path}")?;
.with_context(|| format!("reading metadata of layer file {}", path.file_name()))?;
layer_paths_to_upload.insert(path, LayerFileMetadata::new(metadata.len()));
self.metrics
.resident_physical_size_gauge
.add(metadata.len());
layers.insert_historic(Arc::new(l));
updates.insert_historic(Arc::new(l));
}
updates.flush();
drop(layers);
timer.stop_and_record();
@@ -2577,6 +2598,7 @@ impl Timeline {
}
let mut layers = self.layers.write().unwrap();
let mut updates = layers.batch_update();
let mut new_layer_paths = HashMap::with_capacity(new_layers.len());
for l in new_layers {
let new_delta_path = l.path();
@@ -2597,7 +2619,7 @@ impl Timeline {
new_layer_paths.insert(new_delta_path, LayerFileMetadata::new(metadata.len()));
let x: Arc<dyn PersistentLayer + 'static> = Arc::new(l);
layers.insert_historic(x);
updates.insert_historic(x);
}
// Now that we have reshuffled the data to set of new delta layers, we can
@@ -2611,8 +2633,9 @@ impl Timeline {
}
layer_names_to_delete.push(l.filename());
l.delete()?;
layers.remove_historic(l);
updates.remove_historic(l);
}
updates.flush();
drop(layers);
// Also schedule the deletions in remote storage
@@ -2812,6 +2835,7 @@ impl Timeline {
// 3. it doesn't need to be retained for 'retain_lsns';
// 4. newer on-disk image layers cover the layer's whole key range
//
// TODO holding a write lock is too agressive and avoidable
let mut layers = self.layers.write().unwrap();
'outer: for l in layers.iter_historic_layers() {
result.layers_total += 1;
@@ -2843,6 +2867,8 @@ impl Timeline {
// might be referenced by child branches forever.
// We can track this in child timeline GC and delete parent layers when
// they are no longer needed. This might be complicated with long inheritance chains.
//
// TODO Vec is not a great choice for `retain_lsns`
for retain_lsn in &retain_lsns {
// start_lsn is inclusive
if &l.get_lsn_range().start <= retain_lsn {
@@ -2896,6 +2922,7 @@ impl Timeline {
layers_to_remove.push(Arc::clone(&l));
}
let mut updates = layers.batch_update();
if !layers_to_remove.is_empty() {
// Persist the new GC cutoff value in the metadata file, before
// we actually remove anything.
@@ -2913,7 +2940,13 @@ impl Timeline {
}
layer_names_to_delete.push(doomed_layer.filename());
doomed_layer.delete()?; // FIXME: schedule succeeded deletions before returning?
layers.remove_historic(doomed_layer);
// TODO Removing from the bottom of the layer map is expensive.
// Maybe instead discard all layer map historic versions that
// won't be needed for page reconstruction for this timeline,
// and mark what we can't delete yet as deleted from the layer
// map index without actually rebuilding the index.
updates.remove_historic(doomed_layer);
result.layers_removed += 1;
}
@@ -2925,6 +2958,7 @@ impl Timeline {
remote_client.schedule_layer_file_deletion(&layer_names_to_delete)?;
}
}
updates.flush();
info!(
"GC completed removing {} layers, cutoff {}",
@@ -3081,11 +3115,13 @@ impl Timeline {
// Delta- or ImageLayer in the layer map.
let new_layer = remote_layer.create_downloaded_layer(self_clone.conf, *size);
let mut layers = self_clone.layers.write().unwrap();
let mut updates = layers.batch_update();
{
let l: Arc<dyn PersistentLayer> = remote_layer.clone();
layers.remove_historic(l);
updates.remove_historic(l);
}
layers.insert_historic(new_layer);
updates.insert_historic(new_layer);
updates.flush();
drop(layers);
// Now that we've inserted the download into the layer map,
@@ -3116,6 +3152,7 @@ impl Timeline {
pub async fn spawn_download_all_remote_layers(
self: Arc<Self>,
request: DownloadRemoteLayersTaskSpawnRequest,
) -> Result<DownloadRemoteLayersTaskInfo, DownloadRemoteLayersTaskInfo> {
let mut status_guard = self.download_all_remote_layers_task_info.write().unwrap();
if let Some(st) = &*status_guard {
@@ -3139,7 +3176,7 @@ impl Timeline {
"download all remote layers task",
false,
async move {
self_clone.download_all_remote_layers().await;
self_clone.download_all_remote_layers(request).await;
let mut status_guard = self_clone.download_all_remote_layers_task_info.write().unwrap();
match &mut *status_guard {
None => {
@@ -3171,15 +3208,23 @@ impl Timeline {
Ok(initial_info)
}
async fn download_all_remote_layers(self: &Arc<Self>) {
let mut downloads: FuturesUnordered<_> = {
async fn download_all_remote_layers(
self: &Arc<Self>,
request: DownloadRemoteLayersTaskSpawnRequest,
) {
let mut downloads = Vec::new();
{
let layers = self.layers.read().unwrap();
layers
.iter_historic_layers()
.filter_map(|l| l.downcast_remote_layer())
.map(|l| self.download_remote_layer(l))
.collect()
};
.for_each(|dl| downloads.push(dl))
}
let total_layer_count = downloads.len();
// limit download concurrency as specified in request
let downloads = futures::stream::iter(downloads);
let mut downloads = downloads.buffer_unordered(request.max_concurrent_downloads.get());
macro_rules! lock_status {
($st:ident) => {
@@ -3200,7 +3245,7 @@ impl Timeline {
{
lock_status!(st);
st.total_layer_count = downloads.len().try_into().unwrap();
st.total_layer_count = total_layer_count as u64;
}
loop {
tokio::select! {

View File

@@ -116,6 +116,10 @@ impl<E: Clone> TaskHandle<E> {
let join_handle = WALRECEIVER_RUNTIME.spawn(async move {
events_sender.send(TaskStateUpdate::Started).ok();
task(events_sender, cancellation_clone).await
// events_sender is dropped at some point during the .await above.
// But the task is still running on WALRECEIVER_RUNTIME.
// That is the window when `!jh.is_finished()`
// is true inside `fn next_task_event()` below.
});
TaskHandle {
@@ -132,7 +136,23 @@ impl<E: Clone> TaskHandle<E> {
TaskEvent::End(match self.join_handle.as_mut() {
Some(jh) => {
if !jh.is_finished() {
warn!("sender is dropped while join handle is still alive");
// Barring any implementation errors in this module, we can
// only arrive here while the task that executes the future
// passed to `Self::spawn()` is still execution. Cf the comment
// in Self::spawn().
//
// This was logging at warning level in earlier versions, presumably
// to leave some breadcrumbs in case we had an implementation
// error that would would make us get stuck in `jh.await`.
//
// There hasn't been such a bug so far.
// But in a busy system, e.g., during pageserver restart,
// we arrive here often enough that the warning-level logs
// became a distraction.
// So, tone them down to info-level.
//
// XXX: rewrite this module to eliminate the race condition.
info!("sender is dropped while join handle is still alive");
}
let res = jh

View File

@@ -183,13 +183,23 @@ async fn connection_manager_loop_step(
new_event = async {
loop {
if walreceiver_state.timeline.current_state() == TimelineState::Loading {
warn!("wal connection manager should only be launched after timeline has become active");
}
match timeline_state_updates.changed().await {
Ok(()) => {
let new_state = walreceiver_state.timeline.current_state();
match new_state {
// we're already active as walreceiver, no need to reactivate
TimelineState::Active => continue,
TimelineState::Broken | TimelineState::Stopping | TimelineState::Suspended => return ControlFlow::Continue(new_state),
TimelineState::Broken | TimelineState::Stopping => {
info!("timeline entered terminal state {new_state:?}, stopping wal connection manager loop");
return ControlFlow::Break(());
}
TimelineState::Loading => {
warn!("timeline transitioned back to Loading state, that should not happen");
return ControlFlow::Continue(new_state);
}
}
}
Err(_sender_dropped_error) => return ControlFlow::Break(()),
@@ -197,7 +207,7 @@ async fn connection_manager_loop_step(
}
} => match new_event {
ControlFlow::Continue(new_state) => {
info!("Timeline became inactive (new state: {new_state:?}), dropping current connections until it reactivates");
info!("observed timeline state change, new state is {new_state:?}");
return ControlFlow::Continue(());
}
ControlFlow::Break(()) => {
@@ -289,7 +299,9 @@ async fn subscribe_for_timeline_updates(
return resp.into_inner();
}
Err(e) => {
warn!("Attempt #{attempt}, failed to subscribe for timeline {id} updates in broker: {e:#}");
// Safekeeper nodes can stop pushing timeline updates to the broker, when no new writes happen and
// entire WAL is streamed. Keep this noticeable with logging, but do not warn/error.
info!("Attempt #{attempt}, failed to subscribe for timeline {id} updates in broker: {e:#}");
continue;
}
}

View File

@@ -77,9 +77,13 @@ pub async fn handle_walreceiver_connection(
info!("DB connection stream finished: {expected_error}");
return Ok(());
}
Err(elapsed) => anyhow::bail!(
"Timed out while waiting {elapsed} for walreceiver connection to open"
),
Err(_) => {
// Timing out to connect to a safekeeper node could happen long time, due to
// many reasons that pageserver cannot control.
// Do not produce an error, but make it visible, that timeouts happen by logging the `event.
info!("Timed out while waiting {connect_timeout:?} for walreceiver connection to open");
return Ok(());
}
}
};

View File

@@ -626,24 +626,20 @@ impl PostgresRedoProcess {
// Create empty data directory for wal-redo postgres, deleting old one first.
if datadir.exists() {
info!(
"old temporary datadir {} exists, removing",
datadir.display()
);
fs::remove_dir_all(&datadir)?;
info!("old temporary datadir {datadir:?} exists, removing");
fs::remove_dir_all(&datadir).map_err(|e| {
Error::new(
e.kind(),
format!("Old temporary dir {datadir:?} removal failure: {e}"),
)
})?;
}
let pg_bin_dir_path = conf.pg_bin_dir(pg_version).map_err(|e| {
Error::new(
ErrorKind::Other,
format!("incorrect pg_bin_dir path: {}", e),
)
})?;
let pg_lib_dir_path = conf.pg_lib_dir(pg_version).map_err(|e| {
Error::new(
ErrorKind::Other,
format!("incorrect pg_lib_dir path: {}", e),
)
})?;
let pg_bin_dir_path = conf
.pg_bin_dir(pg_version)
.map_err(|e| Error::new(ErrorKind::Other, format!("incorrect pg_bin_dir path: {e}")))?;
let pg_lib_dir_path = conf
.pg_lib_dir(pg_version)
.map_err(|e| Error::new(ErrorKind::Other, format!("incorrect pg_lib_dir path: {e}")))?;
info!("running initdb in {}", datadir.display());
let initdb = Command::new(pg_bin_dir_path.join("initdb"))
@@ -1010,3 +1006,110 @@ fn build_get_page_msg(tag: BufferTag, buf: &mut Vec<u8>) {
tag.ser_into(buf)
.expect("serialize BufferTag should always succeed");
}
#[cfg(test)]
mod tests {
use super::{PostgresRedoManager, WalRedoManager};
use crate::repository::Key;
use crate::{config::PageServerConf, walrecord::NeonWalRecord};
use bytes::Bytes;
use std::str::FromStr;
use utils::{id::TenantId, lsn::Lsn};
#[test]
fn short_v14_redo() {
let expected = std::fs::read("fixtures/short_v14_redo.page").unwrap();
let h = RedoHarness::new().unwrap();
let page = h
.manager
.request_redo(
Key {
field1: 0,
field2: 1663,
field3: 13010,
field4: 1259,
field5: 0,
field6: 0,
},
Lsn::from_str("0/16E2408").unwrap(),
None,
short_records(),
14,
)
.unwrap();
assert_eq!(&expected, &*page);
}
#[test]
fn short_v14_fails_for_wrong_key_but_returns_zero_page() {
let h = RedoHarness::new().unwrap();
let page = h
.manager
.request_redo(
Key {
field1: 0,
field2: 1663,
// key should be 13010
field3: 13130,
field4: 1259,
field5: 0,
field6: 0,
},
Lsn::from_str("0/16E2408").unwrap(),
None,
short_records(),
14,
)
.unwrap();
// TODO: there will be some stderr printout, which is forwarded to tracing that could
// perhaps be captured as long as it's in the same thread.
assert_eq!(page, crate::ZERO_PAGE);
}
#[allow(clippy::octal_escapes)]
fn short_records() -> Vec<(Lsn, NeonWalRecord)> {
vec![
(
Lsn::from_str("0/16A9388").unwrap(),
NeonWalRecord::Postgres {
will_init: true,
rec: Bytes::from_static(b"j\x03\0\0\0\x04\0\0\xe8\x7fj\x01\0\0\0\0\0\n\0\0\xd0\x16\x13Y\0\x10\0\04\x03\xd4\0\x05\x7f\x06\0\0\xd22\0\0\xeb\x04\0\0\0\0\0\0\xff\x03\0\0\0\0\x80\xeca\x01\0\0\x01\0\xd4\0\xa0\x1d\0 \x04 \0\0\0\0/\0\x01\0\xa0\x9dX\x01\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0.\0\x01\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\00\x9f\x9a\x01P\x9e\xb2\x01\0\x04\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x02\0!\0\x01\x08 \xff\xff\xff?\0\0\0\0\0\0@\0\0another_table\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x98\x08\0\0\x02@\0\0\0\0\0\0\n\0\0\0\x02\0\0\0\0@\0\0\0\0\0\0\0\0\0\0\0\0\x80\xbf\0\0\0\0\0\0\0\0\0\0pr\x01\0\0\0\0\0\0\0\0\x01d\0\0\0\0\0\0\x04\0\0\x01\0\0\0\0\0\0\0\x0c\x02\0\0\0\0\0\0\0\0\0\0\0\0\0\0/\0!\x80\x03+ \xff\xff\xff\x7f\0\0\0\0\0\xdf\x04\0\0pg_type\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0G\0\0\0\0\0\0\0\n\0\0\0\x02\0\0\0\0\0\0\0\0\0\0\0\x0e\0\0\0\0@\x16D\x0e\0\0\0K\x10\0\0\x01\0pr \0\0\0\0\0\0\0\0\x01n\0\0\0\0\0\xd6\x02\0\0\x01\0\0\0[\x01\0\0\0\0\0\0\0\t\x04\0\0\x02\0\0\0\x01\0\0\0\n\0\0\0\n\0\0\0\x7f\0\0\0\0\0\0\0\n\0\0\0\x02\0\0\0\0\0\0C\x01\0\0\x15\x01\0\0\0\0\0\0\0\0\0\0\0\0\0\0.\0!\x80\x03+ \xff\xff\xff\x7f\0\0\0\0\0;\n\0\0pg_statistic\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\xfd.\0\0\0\0\0\0\n\0\0\0\x02\0\0\0;\n\0\0\0\0\0\0\x13\0\0\0\0\0\xcbC\x13\0\0\0\x18\x0b\0\0\x01\0pr\x1f\0\0\0\0\0\0\0\0\x01n\0\0\0\0\0\xd6\x02\0\0\x01\0\0\0C\x01\0\0\0\0\0\0\0\t\x04\0\0\x01\0\0\0\x01\0\0\0\n\0\0\0\n\0\0\0\x7f\0\0\0\0\0\0\x02\0\x01")
}
),
(
Lsn::from_str("0/16D4080").unwrap(),
NeonWalRecord::Postgres {
will_init: false,
rec: Bytes::from_static(b"\xbc\0\0\0\0\0\0\0h?m\x01\0\0\0\0p\n\0\09\x08\xa3\xea\0 \x8c\0\x7f\x06\0\0\xd22\0\0\xeb\x04\0\0\0\0\0\0\xff\x02\0@\0\0another_table\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x98\x08\0\0\x02@\0\0\0\0\0\0\n\0\0\0\x02\0\0\0\0@\0\0\0\0\0\0\x05\0\0\0\0@zD\x05\0\0\0\0\0\0\0\0\0pr\x01\0\0\0\0\0\0\0\0\x01d\0\0\0\0\0\0\x04\0\0\x01\0\0\0\x02\0")
}
)
]
}
struct RedoHarness {
// underscored because unused, except for removal at drop
_repo_dir: tempfile::TempDir,
manager: PostgresRedoManager,
}
impl RedoHarness {
fn new() -> anyhow::Result<Self> {
let repo_dir = tempfile::tempdir()?;
let conf = PageServerConf::dummy_conf(repo_dir.path().to_path_buf());
let conf = Box::leak(Box::new(conf));
let tenant_id = TenantId::generate();
let manager = PostgresRedoManager::new(conf, tenant_id);
Ok(RedoHarness {
_repo_dir: repo_dir,
manager,
})
}
}
}

View File

@@ -52,7 +52,7 @@ typedef struct
#define NEON_TAG "[NEON_SMGR] "
#define neon_log(tag, fmt, ...) ereport(tag, \
(errmsg(NEON_TAG fmt, ##__VA_ARGS__), \
errhidestmt(true), errhidecontext(true)))
errhidestmt(true), errhidecontext(true), internalerrposition(0)))
/*
* supertype of all the Neon*Request structs below

View File

@@ -287,12 +287,13 @@ compact_prefetch_buffers(void)
/*
* Here we have established:
* slots < search_ring_index may be unused (not scanned)
* slots >= search_ring_index and <= empty_ring_index are unused
* slots > empty_ring_index are in use, or outside our buffer's range.
* slots < search_ring_index have an unknown state (not scanned)
* slots >= search_ring_index and <= empty_ring_index are unused
* slots > empty_ring_index are in use, or outside our buffer's range.
* ... unless search_ring_index <= ring_last
*
* Therefore, there is a gap of at least one unused items between
* search_ring_index and empty_ring_index, which grows as we hit
* search_ring_index and empty_ring_index (both inclusive), which grows as we hit
* more unused items while moving backwards through the array.
*/
@@ -302,6 +303,7 @@ compact_prefetch_buffers(void)
PrefetchRequest *target_slot;
bool found;
/* update search index to an unprocessed entry */
search_ring_index--;
source_slot = GetPrfSlot(search_ring_index);
@@ -309,6 +311,7 @@ compact_prefetch_buffers(void)
if (source_slot->status == PRFS_UNUSED)
continue;
/* slot is used -- start moving slot */
target_slot = GetPrfSlot(empty_ring_index);
Assert(source_slot->status == PRFS_RECEIVED);
@@ -328,16 +331,22 @@ compact_prefetch_buffers(void)
/* Adjust the location of our known-empty slot */
empty_ring_index--;
/* empty the moved slot */
source_slot->status = PRFS_UNUSED;
source_slot->buftag = (BufferTag) {0};
source_slot->response = NULL;
source_slot->my_ring_index = 0;
source_slot->effective_request_lsn = 0;
/* update bookkeeping */
n_moved++;
}
if (MyPState->ring_last != empty_ring_index)
/*
* Only when we've moved slots we can expect trailing unused slots,
* so only then we clean up trailing unused slots.
*/
if (n_moved > 0)
{
prefetch_cleanup_trailing_unused();
return true;

View File

@@ -1,58 +1,63 @@
[package]
name = "proxy"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
anyhow = "1.0"
atty = "0.2.14"
base64 = "0.13.0"
bstr = "1.0"
bytes = { version = "1.0.1", features = ['serde'] }
clap = "4.0"
futures = "0.3.13"
git-version = "0.3.5"
hashbrown = "0.12"
hex = "0.4.3"
hmac = "0.12.1"
hyper = "0.14"
hyper-tungstenite = "0.8.1"
itertools = "0.10.3"
md5 = "0.7.0"
once_cell = "1.13.0"
parking_lot = "0.12"
pin-project-lite = "0.2.7"
rand = "0.8.3"
regex = "1.4.5"
reqwest = { version = "0.11", default-features = false, features = [ "json", "rustls-tls" ] }
routerify = "3"
rustls = "0.20.0"
rustls-pemfile = "1"
scopeguard = "1.1.0"
serde = "1"
serde_json = "1"
sha2 = "0.10.2"
socket2 = "0.4.4"
thiserror = "1.0.30"
tokio = { version = "1.17", features = ["macros"] }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
tokio-rustls = "0.23.0"
tls-listener = { version = "0.5.1", features = ["rustls", "hyper-h1"] }
tracing = "0.1.36"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
url = "2.2.2"
uuid = { version = "1.2", features = ["v4", "serde"] }
webpki-roots = "0.22.5"
x509-parser = "0.14"
anyhow.workspace = true
atty.workspace = true
base64.workspace = true
bstr.workspace = true
bytes = {workspace = true, features = ['serde'] }
clap.workspace = true
chrono.workspace = true
consumption_metrics.workspace = true
futures.workspace = true
git-version.workspace = true
hashbrown.workspace = true
hex.workspace = true
hmac.workspace = true
hyper.workspace = true
hyper-tungstenite.workspace = true
itertools.workspace = true
md5.workspace = true
once_cell.workspace = true
parking_lot.workspace = true
pin-project-lite.workspace = true
rand.workspace = true
regex.workspace = true
reqwest = { workspace = true, features = [ "json" ] }
routerify.workspace = true
rustls.workspace = true
rustls-pemfile.workspace = true
scopeguard.workspace = true
serde.workspace = true
serde_json.workspace = true
sha2.workspace = true
socket2.workspace = true
thiserror.workspace = true
tokio.workspace = true
tokio-postgres.workspace = true
tokio-rustls.workspace = true
tls-listener.workspace = true
tracing.workspace = true
tracing-subscriber.workspace = true
url.workspace = true
uuid.workspace = true
webpki-roots.workspace = true
x509-parser.workspace = true
metrics.workspace = true
pq_proto.workspace = true
utils.workspace = true
prometheus.workspace = true
humantime.workspace = true
hostname.workspace = true
metrics = { path = "../libs/metrics" }
pq_proto = { path = "../libs/pq_proto" }
utils = { path = "../libs/utils" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }
workspace_hack.workspace = true
[dev-dependencies]
async-trait = "0.1"
rcgen = "0.10"
rstest = "0.15"
tokio-postgres-rustls = "0.9.0"
async-trait.workspace = true
rcgen.workspace = true
rstest.workspace = true
tokio-postgres-rustls.workspace = true

View File

@@ -42,9 +42,9 @@ pub enum AuthErrorImpl {
MalformedPassword(&'static str),
#[error(
"Project ID is not specified. \
"Endpoint ID is not specified. \
Either please upgrade the postgres client library (libpq) for SNI support \
or pass the project ID (first part of the domain name) as a parameter: '?options=project%3D<project-id>'. \
or pass the endpoint ID (first part of the domain name) as a parameter: '?options=project%3D<endpoint-id>'. \
See more at https://neon.tech/sni"
)]
MissingProjectName,

View File

@@ -5,6 +5,12 @@ use std::sync::Arc;
pub struct ProxyConfig {
pub tls_config: Option<TlsConfig>,
pub auth_backend: auth::BackendType<'static, ()>,
pub metric_collection_config: Option<MetricCollectionConfig>,
}
pub struct MetricCollectionConfig {
pub endpoint: reqwest::Url,
pub interval: std::time::Duration,
}
pub struct TlsConfig {

View File

@@ -13,7 +13,7 @@ use std::convert::Infallible;
use std::future::ready;
use std::pin::Pin;
use std::sync::Arc;
use std::task::{Context, Poll};
use std::task::{ready, Context, Poll};
use tls_listener::TlsListener;
use tokio::io::{self, AsyncBufRead, AsyncRead, AsyncWrite, ReadBuf};
@@ -104,10 +104,9 @@ impl AsyncRead for WebSocketRW {
return Poll::Ready(Ok(()));
}
let inner_buf = match self.as_mut().poll_fill_buf(cx) {
Poll::Ready(Ok(buf)) => buf,
Poll::Ready(Err(err)) => return Poll::Ready(Err(err)),
Poll::Pending => return Poll::Pending,
let inner_buf = match ready!(self.as_mut().poll_fill_buf(cx)) {
Ok(buf) => buf,
Err(err) => return Poll::Ready(Err(err)),
};
let len = std::cmp::min(inner_buf.len(), buf.remaining());
buf.put_slice(&inner_buf[..len]);
@@ -124,8 +123,8 @@ impl AsyncBufRead for WebSocketRW {
let buf = self.project().chunk.as_ref().unwrap().chunk();
return Poll::Ready(Ok(buf));
} else {
match self.as_mut().project().stream.poll_next(cx) {
Poll::Ready(Some(Ok(message))) => match message {
match ready!(self.as_mut().project().stream.poll_next(cx)) {
Some(Ok(message)) => match message {
Message::Text(_) => {}
Message::Binary(chunk) => {
*self.as_mut().project().chunk = Some(Bytes::from(chunk));
@@ -142,9 +141,8 @@ impl AsyncBufRead for WebSocketRW {
unreachable!();
}
},
Poll::Ready(Some(Err(err))) => return Poll::Ready(Err(ws_err_into(err))),
Poll::Ready(None) => return Poll::Ready(Ok(&[])),
Poll::Pending => return Poll::Pending,
Some(Err(err)) => return Poll::Ready(Err(ws_err_into(err))),
None => return Poll::Ready(Ok(&[])),
}
}
}

View File

@@ -11,6 +11,7 @@ mod config;
mod console;
mod error;
mod http;
mod metrics;
mod mgmt;
mod parse;
mod proxy;
@@ -20,14 +21,14 @@ mod stream;
mod url;
mod waiters;
use ::metrics::set_build_info_metric;
use anyhow::{bail, Context};
use clap::{self, Arg};
use config::ProxyConfig;
use futures::FutureExt;
use metrics::set_build_info_metric;
use std::{borrow::Cow, future::Future, net::SocketAddr};
use tokio::{net::TcpListener, task::JoinError};
use tracing::info;
use tracing::{info, info_span, Instrument};
use utils::project_git_version;
use utils::sentry_init::{init_sentry, release_name};
@@ -65,6 +66,22 @@ async fn main() -> anyhow::Result<()> {
let mgmt_address: SocketAddr = arg_matches.get_one::<String>("mgmt").unwrap().parse()?;
let http_address: SocketAddr = arg_matches.get_one::<String>("http").unwrap().parse()?;
let metric_collection_config = match
(
arg_matches.get_one::<String>("metric-collection-endpoint"),
arg_matches.get_one::<String>("metric-collection-interval"),
) {
(Some(endpoint), Some(interval)) => {
Some(config::MetricCollectionConfig {
endpoint: endpoint.parse()?,
interval: humantime::parse_duration(interval)?,
})
}
(None, None) => None,
_ => bail!("either both or neither metric-collection-endpoint and metric-collection-interval must be specified"),
};
let auth_backend = match arg_matches
.get_one::<String>("auth-backend")
.unwrap()
@@ -95,6 +112,7 @@ async fn main() -> anyhow::Result<()> {
let config: &ProxyConfig = Box::leak(Box::new(ProxyConfig {
tls_config,
auth_backend,
metric_collection_config,
}));
info!("Version: {GIT_VERSION}");
@@ -126,6 +144,21 @@ async fn main() -> anyhow::Result<()> {
)));
}
if let Some(metric_collection_config) = &config.metric_collection_config {
let hostname = hostname::get()?
.into_string()
.map_err(|e| anyhow::anyhow!("failed to get hostname {e:?}"))?;
tasks.push(tokio::spawn(
metrics::collect_metrics(
&metric_collection_config.endpoint,
metric_collection_config.interval,
hostname,
)
.instrument(info_span!("collect_metrics")),
));
}
let tasks = tasks.into_iter().map(flatten_err);
set_build_info_metric(GIT_VERSION);
@@ -199,6 +232,16 @@ fn cli() -> clap::Command {
.alias("ssl-cert") // backwards compatibility
.help("path to TLS cert for client postgres connections"),
)
.arg(
Arg::new("metric-collection-endpoint")
.long("metric-collection-endpoint")
.help("metric collection HTTP endpoint"),
)
.arg(
Arg::new("metric-collection-interval")
.long("metric-collection-interval")
.help("metric collection interval"),
)
}
#[test]

196
proxy/src/metrics.rs Normal file
View File

@@ -0,0 +1,196 @@
//!
//! Periodically collect proxy consumption metrics
//! and push them to a HTTP endpoint.
//!
use chrono::{DateTime, Utc};
use consumption_metrics::{idempotency_key, Event, EventChunk, EventType, CHUNK_SIZE};
use serde::Serialize;
use std::{collections::HashMap, time::Duration};
use tracing::{debug, error, log::info, trace};
const PROXY_IO_BYTES_PER_CLIENT: &str = "proxy_io_bytes_per_client";
///
/// Key that uniquely identifies the object, this metric describes.
/// Currently, endpoint_id is enough, but this may change later,
/// so keep it in a named struct.
///
/// Both the proxy and the ingestion endpoint will live in the same region (or cell)
/// so while the project-id is unique across regions the whole pipeline will work correctly
/// because we enrich the event with project_id in the control-plane endpoint.
///
#[derive(Eq, Hash, PartialEq, Serialize)]
pub struct Ids {
pub endpoint_id: String,
}
pub async fn collect_metrics(
metric_collection_endpoint: &reqwest::Url,
metric_collection_interval: Duration,
hostname: String,
) -> anyhow::Result<()> {
scopeguard::defer! {
info!("collect_metrics has shut down");
}
let mut ticker = tokio::time::interval(metric_collection_interval);
info!(
"starting collect_metrics. metric_collection_endpoint: {}",
metric_collection_endpoint
);
// define client here to reuse it for all requests
let client = reqwest::Client::new();
let mut cached_metrics: HashMap<Ids, (u64, DateTime<Utc>)> = HashMap::new();
loop {
tokio::select! {
_ = ticker.tick() => {
match collect_metrics_iteration(&client, &mut cached_metrics, metric_collection_endpoint, hostname.clone()).await
{
Err(e) => {
error!("Failed to send consumption metrics: {} ", e);
},
Ok(_) => { trace!("collect_metrics_iteration completed successfully") },
}
}
}
}
}
pub fn gather_proxy_io_bytes_per_client() -> Vec<(Ids, (u64, DateTime<Utc>))> {
let mut current_metrics: Vec<(Ids, (u64, DateTime<Utc>))> = Vec::new();
let metrics = prometheus::default_registry().gather();
for m in metrics {
if m.get_name() == "proxy_io_bytes_per_client" {
for ms in m.get_metric() {
let direction = ms
.get_label()
.iter()
.find(|l| l.get_name() == "direction")
.unwrap()
.get_value();
// Only collect metric for outbound traffic
if direction == "tx" {
let endpoint_id = ms
.get_label()
.iter()
.find(|l| l.get_name() == "endpoint_id")
.unwrap()
.get_value();
let value = ms.get_counter().get_value() as u64;
debug!("endpoint_id:val - {}: {}", endpoint_id, value);
current_metrics.push((
Ids {
endpoint_id: endpoint_id.to_string(),
},
(value, Utc::now()),
));
}
}
}
}
current_metrics
}
pub async fn collect_metrics_iteration(
client: &reqwest::Client,
cached_metrics: &mut HashMap<Ids, (u64, DateTime<Utc>)>,
metric_collection_endpoint: &reqwest::Url,
hostname: String,
) -> anyhow::Result<()> {
info!(
"starting collect_metrics_iteration. metric_collection_endpoint: {}",
metric_collection_endpoint
);
let current_metrics = gather_proxy_io_bytes_per_client();
let metrics_to_send: Vec<Event<Ids>> = current_metrics
.iter()
.filter_map(|(curr_key, (curr_val, curr_time))| {
let mut start_time = *curr_time;
let mut value = *curr_val;
if let Some((prev_val, prev_time)) = cached_metrics.get(curr_key) {
// Only send metrics updates if the metric has changed
if curr_val - prev_val > 0 {
value = curr_val - prev_val;
start_time = *prev_time;
} else {
return None;
}
};
Some(Event {
kind: EventType::Incremental {
start_time,
stop_time: *curr_time,
},
metric: PROXY_IO_BYTES_PER_CLIENT,
idempotency_key: idempotency_key(hostname.clone()),
value,
extra: Ids {
endpoint_id: curr_key.endpoint_id.clone(),
},
})
})
.collect();
if metrics_to_send.is_empty() {
trace!("no new metrics to send");
return Ok(());
}
// Send metrics.
// Split into chunks of 1000 metrics to avoid exceeding the max request size
for chunk in metrics_to_send.chunks(CHUNK_SIZE) {
let chunk_json = serde_json::value::to_raw_value(&EventChunk { events: chunk })
.expect("ProxyConsumptionMetric should not fail serialization");
let res = client
.post(metric_collection_endpoint.clone())
.json(&chunk_json)
.send()
.await;
let res = match res {
Ok(x) => x,
Err(err) => {
error!("failed to send metrics: {:?}", err);
continue;
}
};
if res.status().is_success() {
// update cached metrics after they were sent successfully
for send_metric in chunk {
let stop_time = match send_metric.kind {
EventType::Incremental { stop_time, .. } => stop_time,
_ => unreachable!(),
};
cached_metrics
.entry(Ids {
endpoint_id: send_metric.extra.endpoint_id.clone(),
})
// update cached value (add delta) and time
.and_modify(|e| {
e.0 += send_metric.value;
e.1 = stop_time
})
// cache new metric
.or_insert((send_metric.value, stop_time));
}
} else {
error!("metrics endpoint refused the sent metrics: {:?}", res);
}
}
Ok(())
}

View File

@@ -1,48 +1,48 @@
[package]
name = "safekeeper"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[dependencies]
async-stream = "0.3"
anyhow = "1.0"
async-trait = "0.1"
byteorder = "1.4.3"
bytes = "1.0.1"
clap = { version = "4.0", features = ["derive"] }
const_format = "0.2.21"
crc32c = "0.6.0"
fs2 = "0.4.3"
git-version = "0.3.5"
hex = "0.4.3"
humantime = "2.1.0"
hyper = "0.14"
nix = "0.25"
once_cell = "1.13.0"
parking_lot = "0.12.1"
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
regex = "1.4.5"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
serde_with = "2.0"
signal-hook = "0.3.10"
thiserror = "1"
tokio = { version = "1.17", features = ["macros", "fs"] }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="43e6db254a97fdecbce33d8bc0890accfd74495e" }
toml_edit = { version = "0.14", features = ["easy"] }
tracing = "0.1.27"
url = "2.2.2"
async-stream.workspace = true
anyhow.workspace = true
async-trait.workspace = true
byteorder.workspace = true
bytes.workspace = true
clap = { workspace = true, features = ["derive"] }
const_format.workspace = true
crc32c.workspace = true
fs2.workspace = true
git-version.workspace = true
hex.workspace = true
humantime.workspace = true
hyper.workspace = true
nix.workspace = true
once_cell.workspace = true
parking_lot.workspace = true
postgres.workspace = true
postgres-protocol.workspace = true
regex.workspace = true
serde.workspace = true
serde_json.workspace = true
serde_with.workspace = true
signal-hook.workspace = true
thiserror.workspace = true
tokio = { workspace = true, features = ["fs"] }
tokio-postgres.workspace = true
toml_edit.workspace = true
tracing.workspace = true
url.workspace = true
metrics.workspace = true
postgres_ffi.workspace = true
pq_proto.workspace = true
remote_storage.workspace = true
safekeeper_api.workspace = true
storage_broker.workspace = true
utils.workspace = true
metrics = { path = "../libs/metrics" }
postgres_ffi = { path = "../libs/postgres_ffi" }
pq_proto = { path = "../libs/pq_proto" }
remote_storage = { path = "../libs/remote_storage" }
safekeeper_api = { path = "../libs/safekeeper_api" }
storage_broker = { version = "0.1", path = "../storage_broker" }
utils = { path = "../libs/utils" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }
workspace_hack.workspace = true
[dev-dependencies]
tempfile = "3.2"
tempfile.workspace = true

View File

@@ -466,7 +466,7 @@ impl Timeline {
Ok(_) => Ok(()),
Err(e) => {
// Bootstrap failed, cancel timeline and remove timeline directory.
self.cancel();
self.cancel(shared_state);
if let Err(fs_err) = std::fs::remove_dir_all(&self.timeline_dir) {
warn!(
@@ -487,20 +487,23 @@ impl Timeline {
shared_state: &mut MutexGuard<SharedState>,
) -> Result<(bool, bool)> {
let was_active = shared_state.active;
self.cancel();
self.cancel(shared_state);
let dir_existed = delete_dir(&self.timeline_dir)?;
Ok((dir_existed, was_active))
}
/// Cancel timeline to prevent further usage. Background tasks will stop
/// eventually after receiving cancellation signal.
fn cancel(&self) {
info!("Timeline {} is cancelled", self.ttid);
fn cancel(&self, shared_state: &mut MutexGuard<SharedState>) {
info!("timeline {} is cancelled", self.ttid);
let _ = self.cancellation_tx.send(true);
let res = self.wal_backup_launcher_tx.blocking_send(self.ttid);
if let Err(e) = res {
error!("Failed to send stop signal to wal_backup_launcher: {}", e);
}
// Close associated FDs. Nobody will be able to touch timeline data once
// it is cancelled, so WAL storage won't be opened again.
shared_state.sk.wal_store.close();
}
/// Returns if timeline is cancelled.
@@ -537,10 +540,6 @@ impl Timeline {
/// De-register compute connection, shutting down timeline activity if
/// pageserver doesn't need catchup.
pub fn on_compute_disconnect(&self) -> Result<()> {
if self.is_cancelled() {
bail!(TimelineError::Cancelled(self.ttid));
}
let is_wal_backup_action_pending: bool;
{
let mut shared_state = self.write_shared_state();

View File

@@ -55,6 +55,12 @@ pub trait Storage {
/// that without timeline lock.
fn remove_up_to(&self) -> Box<dyn Fn(XLogSegNo) -> Result<()>>;
/// Release resources associated with the storage -- technically, close FDs.
/// Currently we don't remove timelines until restart (#3146), so need to
/// spare descriptors. This would be useful for temporary tli detach as
/// well.
fn close(&mut self) {}
/// Get metrics for this timeline.
fn get_metrics(&self) -> WalStorageMetrics;
}
@@ -401,6 +407,11 @@ impl Storage for PhysicalStorage {
})
}
fn close(&mut self) {
// close happens in destructor
let _open_file = self.file.take();
}
fn get_metrics(&self) -> WalStorageMetrics {
self.metrics.clone()
}

View File

@@ -1,38 +1,38 @@
[package]
name = "storage_broker"
version = "0.1.0"
edition = "2021"
license = "Apache-2.0"
edition.workspace = true
license.workspace = true
[features]
bench = []
[dependencies]
anyhow = "1.0"
async-stream = "0.3"
bytes = "1.0"
clap = { version = "4.0", features = ["derive"] }
const_format = "0.2.21"
futures = "0.3"
futures-core = "0.3"
futures-util = "0.3"
git-version = "0.3.5"
humantime = "2.1.0"
hyper = {version = "0.14.14", features = ["full"]}
once_cell = "1.13.0"
parking_lot = "0.12"
prost = "0.11"
tonic = {version = "0.8", features = ["tls", "tls-roots"]}
tokio = { version = "1.0", features = ["macros", "rt-multi-thread"] }
tokio-stream = "0.1"
tracing = "0.1.27"
anyhow.workspace = true
async-stream.workspace = true
bytes.workspace = true
clap = { workspace = true, features = ["derive"] }
const_format.workspace = true
futures.workspace = true
futures-core.workspace = true
futures-util.workspace = true
git-version.workspace = true
humantime.workspace = true
hyper = { workspace = true, features = ["full"] }
once_cell.workspace = true
parking_lot.workspace = true
prost.workspace = true
tonic.workspace = true
tokio = { workspace = true, features = ["rt-multi-thread"] }
tokio-stream.workspace = true
tracing.workspace = true
metrics.workspace = true
utils.workspace = true
metrics = { path = "../libs/metrics" }
utils = { path = "../libs/utils" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }
workspace_hack.workspace = true
[build-dependencies]
tonic-build = "0.8"
tonic-build.workspace = true
[[bench]]
name = "rps"

View File

@@ -22,6 +22,7 @@ from itertools import chain, product
from pathlib import Path
from types import TracebackType
from typing import Any, Dict, Iterator, List, Optional, Tuple, Type, Union, cast
from urllib.parse import urlparse
import asyncpg
import backoff # type: ignore
@@ -1205,6 +1206,9 @@ class PageserverHttpClient(requests.Session):
return res_json
def tenant_size(self, tenant_id: TenantId) -> int:
return self.tenant_size_and_modelinputs(tenant_id)[0]
def tenant_size_and_modelinputs(self, tenant_id: TenantId) -> Tuple[int, Dict[str, Any]]:
"""
Returns the tenant size, together with the model inputs as the second tuple item.
"""
@@ -1215,9 +1219,9 @@ class PageserverHttpClient(requests.Session):
assert TenantId(res["id"]) == tenant_id
size = res["size"]
assert type(size) == int
# there are additional inputs, which are the collected raw information before being fed to the tenant_size_model
# there are no tests for those right now.
return size
inputs = res["inputs"]
assert type(inputs) is dict
return (size, inputs)
def timeline_list(
self,
@@ -1350,11 +1354,18 @@ class PageserverHttpClient(requests.Session):
assert res_json is None
def timeline_spawn_download_remote_layers(
self, tenant_id: TenantId, timeline_id: TimelineId
self,
tenant_id: TenantId,
timeline_id: TimelineId,
max_concurrent_downloads: int,
) -> dict[str, Any]:
body = {
"max_concurrent_downloads": max_concurrent_downloads,
}
res = self.post(
f"http://localhost:{self.port}/v1/tenant/{tenant_id}/timeline/{timeline_id}/download_remote_layers",
json=body,
)
self.verbose_error(res)
res_json = res.json()
@@ -1388,10 +1399,13 @@ class PageserverHttpClient(requests.Session):
self,
tenant_id: TenantId,
timeline_id: TimelineId,
max_concurrent_downloads: int,
errors_ok=False,
at_least_one_download=True,
):
res = self.timeline_spawn_download_remote_layers(tenant_id, timeline_id)
res = self.timeline_spawn_download_remote_layers(
tenant_id, timeline_id, max_concurrent_downloads
)
while True:
completed = self.timeline_poll_download_remote_layers_status(
tenant_id, timeline_id, res, poll_state="Completed"
@@ -2323,6 +2337,8 @@ class NeonProxy(PgProtocol):
http_port: int,
mgmt_port: int,
auth_backend: NeonProxy.AuthBackend,
metric_collection_endpoint: Optional[str] = None,
metric_collection_interval: Optional[str] = None,
):
host = "127.0.0.1"
super().__init__(dsn=auth_backend.default_conn_url, host=host, port=proxy_port)
@@ -2333,6 +2349,8 @@ class NeonProxy(PgProtocol):
self.proxy_port = proxy_port
self.mgmt_port = mgmt_port
self.auth_backend = auth_backend
self.metric_collection_endpoint = metric_collection_endpoint
self.metric_collection_interval = metric_collection_interval
self._popen: Optional[subprocess.Popen[bytes]] = None
def start(self) -> NeonProxy:
@@ -2344,6 +2362,16 @@ class NeonProxy(PgProtocol):
*["--mgmt", f"{self.host}:{self.mgmt_port}"],
*self.auth_backend.extra_args(),
]
if (
self.metric_collection_endpoint is not None
and self.metric_collection_interval is not None
):
args += [
*["--metric-collection-endpoint", self.metric_collection_endpoint],
*["--metric-collection-interval", self.metric_collection_interval],
]
self._popen = subprocess.Popen(args)
self._wait_until_ready()
return self
@@ -2357,6 +2385,25 @@ class NeonProxy(PgProtocol):
request_result.raise_for_status()
return request_result.text
@staticmethod
def get_session_id(uri_prefix, uri_line):
assert uri_prefix in uri_line
url_parts = urlparse(uri_line)
psql_session_id = url_parts.path[1:]
assert psql_session_id.isalnum(), "session_id should only contain alphanumeric chars"
return psql_session_id
@staticmethod
async def find_auth_link(link_auth_uri, proc):
for _ in range(100):
line = (await proc.stderr.readline()).decode("utf-8").strip()
log.info(f"psql line: {line}")
if link_auth_uri in line:
log.info(f"SUCCESS, found auth url: {line}")
return line
def __enter__(self) -> NeonProxy:
return self
@@ -2371,6 +2418,46 @@ class NeonProxy(PgProtocol):
# it's a child process. This is mostly to clean up in between different tests.
self._popen.kill()
@staticmethod
async def activate_link_auth(
local_vanilla_pg, proxy_with_metric_collector, psql_session_id, create_user=True
):
pg_user = "proxy"
if create_user:
log.info("creating a new user for link auth test")
local_vanilla_pg.start()
local_vanilla_pg.safe_psql(f"create user {pg_user} with login superuser")
db_info = json.dumps(
{
"session_id": psql_session_id,
"result": {
"Success": {
"host": local_vanilla_pg.default_options["host"],
"port": local_vanilla_pg.default_options["port"],
"dbname": local_vanilla_pg.default_options["dbname"],
"user": pg_user,
"aux": {
"project_id": "test_project_id",
"endpoint_id": "test_endpoint_id",
"branch_id": "test_branch_id",
},
}
},
}
)
log.info("sending session activation message")
psql = await PSQL(
host=proxy_with_metric_collector.host,
port=proxy_with_metric_collector.mgmt_port,
).run(db_info)
assert psql.stdout is not None
out = (await psql.stdout.read()).decode("utf-8").strip()
assert out == "ok"
@pytest.fixture(scope="function")
def link_proxy(port_distributor: PortDistributor, neon_binpath: Path) -> Iterator[NeonProxy]:

View File

@@ -144,3 +144,42 @@ def test_tpch(query: LabelledQuery, remote_compare: RemoteCompare):
"""
run_psql(remote_compare, query, times=1)
@pytest.mark.remote_cluster
def test_user_examples(remote_compare: RemoteCompare):
query = LabelledQuery(
"Q1",
r"""
SELECT
v20.c2263 AS v1,
v19.c2484 AS v2,
DATE_TRUNC('month', v18.c37)::DATE AS v3,
(ARRAY_AGG(c1840 order by v18.c37))[1] AS v4,
(ARRAY_AGG(c1841 order by v18.c37 DESC))[1] AS v5,
SUM(v17.c1843) AS v6,
SUM(v17.c1844) AS v7,
SUM(v17.c1848) AS v8,
SUM(v17.c1845) AS v9,
SUM(v17.c1846) AS v10,
SUM(v17.c1861) AS v11,
SUM(v17.c1860) AS v12,
SUM(v17.c1869) AS v13,
SUM(v17.c1856) AS v14,
SUM(v17.c1855) AS v15,
SUM(v17.c1854) AS v16
FROM
s3.t266 v17
INNER JOIN s1.t41 v18 ON v18.c34 = v17.c1836
INNER JOIN s3.t571 v19 ON v19.c2482 = v17.c1834
INNER JOIN s3.t331 v20 ON v20.c2261 = v17.c1835
WHERE
(v17.c1835 = 4) AND
(v18.c37 >= '2019-03-01') AND
(v17.c1833 = 2)
GROUP BY v1, v2, v3
ORDER BY v1, v2, v3
LIMIT 199;
""",
)
run_psql(remote_compare, query, times=3)

Some files were not shown because too many files have changed in this diff Show More