Compare commits

..

405 Commits

Author SHA1 Message Date
github-actions[bot]
98882548d8 Compute release 2025-07-25 07:05 UTC 2025-07-25 07:05:39 +00:00
github-actions[bot]
9ea6db2ee0 Compute release 2025-07-18 07:06 UTC 2025-07-18 07:06:04 +00:00
github-actions[bot]
29aa58f6a7 Compute release 2025-07-11 07:04 UTC 2025-07-11 07:04:29 +00:00
github-actions[bot]
262a00d053 Compute release 2025-07-04 07:04 UTC 2025-07-04 07:04:07 +00:00
github-actions[bot]
2d6b78b98e Compute release 2025-06-27 07:03 UTC 2025-06-27 07:03:41 +00:00
github-actions[bot]
745c0a0868 Compute release 2025-06-20 07:03 UTC 2025-06-20 07:03:23 +00:00
github-actions[bot]
da0cdc0185 Compute release 2025-06-13 07:03 UTC 2025-06-13 07:03:35 +00:00
github-actions[bot]
3aa92ee408 Compute release 2025-06-06 07:03 UTC 2025-06-06 07:03:44 +00:00
github-actions[bot]
60ae50f12b Compute release 2025-06-04 16:52 UTC 2025-06-04 16:53:00 +00:00
Tristan Partin
1823c91525 Use Url::join() when creating the final remote extension URL (#12121)
Url::to_string() adds a trailing slash on the base URL, so when we did
the format!(), we were adding a double forward slash.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-06-04 16:53:00 +00:00
Shockingly Good
c1ae29dea0 impr(compute): Remove the deprecated CLI arg alias for remote-ext-config. (#12087)
Also moves it from `String` to `Url`.
2025-06-04 16:52:59 +00:00
Shockingly Good
098fd689d1 fix(compute) Remove the hardcoded default value for PGXN HTTP URL. (#12030)
Removes the hardcoded value for the Postgres Extensions HTTP gateway URL
as it is always provided by the calling code.
2025-06-04 16:52:59 +00:00
github-actions[bot]
c70c95e293 Compute release 2025-05-30 07:03 UTC 2025-05-30 07:03:34 +00:00
Heikki Linnakangas
0b63de521c Compute release 2025-05-28 19:41 UTC 2025-05-28 22:44:20 +03:00
Heikki Linnakangas
15fd43e0f6 Revert "Add online_advisor extension (#11898)"
This reverts commit 8ff25dca8e.

There was a CI failure on the online-advistor tests. While that's been
investigated, don't include it.
2025-05-28 22:43:56 +03:00
github-actions[bot]
79c4ae0508 Compute release 2025-05-22 19:41 UTC 2025-05-22 19:41:35 +00:00
github-actions[bot]
2edf987d93 Compute release 2025-05-22 15:53 UTC 2025-05-22 15:53:46 +00:00
github-actions[bot]
cb04f050d6 Compute release 2025-05-09 07:03 UTC 2025-05-09 07:03:33 +00:00
github-actions[bot]
cd3e23b326 Compute release 2025-05-07 11:08 UTC 2025-05-07 11:08:27 +00:00
github-actions[bot]
ac993b9dcd Compute release 2025-05-01 11:10 UTC 2025-05-01 11:10:20 +00:00
github-actions[bot]
1c5c25156a Compute release 2025-04-25 2025-04-25 07:01:03 +00:00
github-actions[bot]
e11c4ca77f Compute release 2025-04-24 2025-04-24 15:06:09 +00:00
Tristan Partin
b5deda3e08 Add CPU architecture to the remote extensions object key (#11590)
ARM computes are incoming and we need to account for that in remote
extensions. Previously, we just blindly assumed that all computes were
x86_64.

Note that we use the Go architecture naming convention instead of the
Rust one directly to do our best and be consistent across the stack.

Part-of: https://github.com/neondatabase/cloud/issues/23148

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-24 18:03:52 +03:00
Konstantin Knizhnik
e6b7589ba4 Undo commit d1728a6bcd because it causes problems with creating pg_search extension (#11700)
## Problem

See https://neondb.slack.com/archives/C03H1K0PGKH/p1745489241982209

pg_search extension now can not be created.

## Summary of changes

Undo d1728a6bcd

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-04-24 18:03:52 +03:00
github-actions[bot]
d57a924254 Compute release 2025-04-22 2025-04-22 12:15:11 +00:00
Tristan Partin
38af7f8db7 Compute release 2025-04-16 2025-04-16 13:10:41 -05:00
Tristan Partin
fe4762a9aa Consolidate compute_ctl configuration structures (#11514)
Previously, the structure of the spec file was just the compute spec.
However, the response from the control plane get spec request included
the compute spec and the compute_ctl config. This divergence was
hindering other work such as adding regression tests for compute_ctl
HTTP authorization.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-04-16 13:09:26 -05:00
Anastasia Lubennikova
817ec9979f compute: fix copy-paste typo for neon GUC parameters check (#11610)
fix for commit
[5063151](5063151271)
2025-04-16 11:31:24 -05:00
github-actions[bot]
dbf160dc60 Compute release 2025-04-11 2025-04-11 07:01:04 +00:00
github-actions[bot]
0d1586bab7 Compute release 2025-04-04 2025-04-04 07:01:04 +00:00
github-actions[bot]
aa5c6c9bdd Compute release 2025-04-03 2025-04-03 11:36:20 +00:00
Alexander Bayandin
c7daf2b1e3 Compute: update plv8 patch (#11426)
## Problem

https://github.com/neondatabase/cloud/issues/26866

## Summary of changes
- Update plv8 patch

Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>
2025-04-03 12:33:37 +01:00
github-actions[bot]
b347438d0e Compute release 2025-04-02 2025-04-02 12:07:44 +00:00
github-actions[bot]
81a35b67e2 Compute release 2025-03-21 2025-03-21 07:02:05 +00:00
github-actions[bot]
ed21175591 Compute release 2025-03-17 2025-03-17 10:12:32 +00:00
github-actions[bot]
9985dfd26a Compute release 2025-03-16 2025-03-16 20:10:56 +00:00
Matthias van de Meent
a009203816 Compute release 2025-03-07
GitHub PR neondatabase/neon#11129
2025-03-10 14:42:44 +01:00
github-actions[bot]
8542507ee5 Compute release 2025-03-07 2025-03-07 07:00:52 +00:00
Konstantin Knizhnik
a3e9140788 Merge pull request #11039 from neondatabase/rc/release-compute/2025-02-28
Compute release 2025-02-28
2025-03-05 14:12:13 +02:00
github-actions[bot]
0d3f7a2b82 Compute release 2025-02-28 2025-02-28 11:18:46 +00:00
Tristan Partin
bcfc633bfa Merge pull request #10952 from neondatabase/rc/release-compute/2025-02-24
Compute release 2025-02-24
2025-02-24 13:09:01 -06:00
github-actions[bot]
33e5930c97 Compute release 2025-02-24 2025-02-24 15:34:43 +00:00
Heikki Linnakangas
fff386261d Merge pull request #10908 from neondatabase/disable-pg_duckdb
hotfix: Temporarily disable pg_duckdb
2025-02-20 22:18:50 +02:00
Heikki Linnakangas
723f9ad3ee Temporarily disable pg_duckdb
It clashed with pg_mooncake
2025-02-20 18:26:25 +02:00
Heikki Linnakangas
2b7243bd37 Merge pull request #10875 from neondatabase/rc/release-compute/2025-02-18
Compute release 2025-02-18
2025-02-19 20:57:42 +02:00
github-actions[bot]
81367a6bbc Compute release 2025-02-18 2025-02-18 16:48:02 +00:00
Alexey Kondratov
156c18e1ad Merge pull request #10713 from neondatabase/rc/release-compute/2025-02-07
Compute release 2025-02-07
2025-02-10 11:59:01 +01:00
github-actions[bot]
ffc1a81b83 Compute release 2025-02-07 2025-02-07 07:00:57 +00:00
Tristan Partin
dd04e3eb11 Merge pull request #10603 from neondatabase/rc/release-compute/2025-01-31
Compute release 2025-01-31
2025-02-03 16:12:46 -06:00
github-actions[bot]
6d9846a9e5 Compute release 2025-01-31 2025-01-31 07:00:54 +00:00
Alexey Kondratov
3cd601b370 Merge pull request #10501 from neondatabase/rc/release-compute/2025-01-24
Compute release 2025-01-24
2025-01-24 16:22:29 +01:00
github-actions[bot]
44ef8c884f Compute release 2025-01-24 2025-01-24 07:00:43 +00:00
Konstantin Knizhnik
c68b3464da Merge pull request #10467 from neondatabase/compute-rc-2025-01-21
Compute release 2025-01-21
2025-01-22 08:57:01 +02:00
Tristan Partin
045b05cd1b Merge pull request #10408 from neondatabase/rc/release-compute/2025-01-15 2025-01-15 14:27:34 -06:00
github-actions[bot]
6a4d8ec410 Compute release 2025-01-15 2025-01-15 17:59:13 +00:00
Tristan Partin
f23390cf0e Merge pull request #10354 from neondatabase/rc/release-compute/2025-01-10 2025-01-10 22:00:23 -06:00
github-actions[bot]
ebc313c768 Compute release 2025-01-10 2025-01-10 19:55:05 +00:00
Heikki Linnakangas
441517dd7c Merge pull request #10288 from neondatabase/rc/release-compute/2025-01-07
Compute release 2025-01-07
2025-01-08 14:10:28 +02:00
JC Grünhage
31bd2dcdb4 Fix promote-images-prod after splitting it out (#10292)
## Problem
`promote-images` was split into `promote-images-dev` and
`promote-images-prod` in
https://github.com/neondatabase/neon/pull/10267.

`dev` credentials were loaded in `promote-images-dev` and `prod`
credentials were loaded in `promote-images-prod`, but
`promote-images-prod` needs `dev` credentials as well to access the
`dev` images to replicate them from `dev` to `prod`.

## Summary of changes
Load `dev` credentials in `promote-images-prod` as well.
2025-01-07 16:31:29 +00:00
github-actions[bot]
6292d93867 Compute release 2025-01-07 2025-01-07 10:48:11 +00:00
Arpad Müller
671889b0e9 Merge pull request #10133 from neondatabase/rc/release/2024-12-13
Storage release 2024-12-13
2024-12-13 13:08:40 +01:00
github-actions[bot]
aeb79d1bb6 Storage release 2024-12-13 2024-12-13 06:02:24 +00:00
Vlad Lazar
5525abdadb Merge pull request #10087 from neondatabase/vlad/cherry-pick-multixact-truncation-fix
storage: cherry-pick SLRU, metrics and sharded ingest fixes into the release branch
2024-12-11 16:02:54 +00:00
Christian Schwarz
c4ce4ac25a page_service: don't count time spent in Batcher towards smgr latency metrics (#10075)
## Problem

With pipelining enabled, the time a request spends in the batcher stage
counts towards the smgr op latency.

If pipelining is disabled, that time is not accounted for.

In practice, this results in a jump in smgr getpage latencies in various
dashboards and degrades the internal SLO.

## Solution

In a similar vein to #10042 and with a similar rationale, this PR stops
counting the time spent in batcher stage towards smgr op latency.

The smgr op latency metric is reduced to the actual execution time.

Time spent in batcher stage is tracked in a separate histogram.
I expect to remove that histogram after batching rollout is complete,
but it will be helpful in the meantime to reason about the rollout.
2024-12-11 14:48:54 +01:00
Vlad Lazar
fde1046278 wal_decoder: fix compact key protobuf encoding (#10074)
## Problem

Protobuf doesn't support 128 bit integers, so we encode the keys as two
64 bit integers. Issue is that when we split the 128 bit compact key we
use signed 64 bit integers to represent the two halves. This may result
in a negative lower half when relnode is larger than `0x00800000`. When
we convert the lower half to an i128 we get a negative `CompactKey`.

## Summary of Changes

Use unsigned integers when encoding into Protobuf.

## Deployment

* Prod: We disabled the interpreted proto, so no compat concerns.
* Staging: Disable the interpreted proto, do one release, and then
release the fixed version.
We do this because a negative int32 will convert to a large uint32 value
and could give
a key in the actual pageserver space. In production we would around this
by adding new
fields to the proto and deprecating the old ones, but we can make our
lives easy here.
* Pre-prod: Same as staging
2024-12-11 14:48:45 +01:00
Vlad Lazar
fcfd1c7d0a pageserver: don't drop multixact slrus on non zero shards 2024-12-11 13:41:35 +01:00
Alex Chi Z.
2455dca403 Merge pull request #10081 from neondatabase/skyzh/cherry-pick-fix
pageserver: fix CLog truncate walingest
2024-12-10 22:53:46 -05:00
John Spray
bc6354921f pageserver: fix CLog truncate walingest 2024-12-10 22:30:25 -05:00
Vlad Lazar
7ac2a5560f Merge pull request #10060 from neondatabase/vlad/manual-release-2024-12-09
Manual storage release 2024-12-09
2024-12-09 18:14:40 +00:00
Vlad Lazar
5f4559ecd2 Merge pull request #10053 from neondatabase/rc/release/2024-12-09
Storage release 2024-12-09
2024-12-09 12:28:51 +00:00
github-actions[bot]
6c349e76d9 Storage release 2024-12-09 2024-12-09 06:05:40 +00:00
John Spray
73ad44ae25 Merge pull request #9959 from neondatabase/rc/release/2024-12-02
Storage & Compute release 2024-12-02
2024-12-02 12:19:16 +00:00
github-actions[bot]
304af5c9e3 Storage & Compute release 2024-12-02 2024-12-02 06:05:37 +00:00
Heikki Linnakangas
1ca9b56faf Merge pull request #9935 from neondatabase/compute-rc-2024-11-28
Compute release 2024-11-28
2024-11-29 09:58:00 +02:00
Christian Schwarz
23e579d01f Merge pull request #9881 from neondatabase/rc/release/2024-11-25--2
Fixup Storage & Compute Release 2024-11-25
2024-11-25 16:26:02 +01:00
Christian Schwarz
166f33f96b Fixup Storage & Compute Release 2024-11-25 2024-11-25 16:19:36 +01:00
Christian Schwarz
aada2ee61a Merge pull request #9869 from neondatabase/rc/release/2024-11-25
Storage & Compute release 2024-11-25
2024-11-25 12:59:32 +01:00
github-actions[bot]
0fc6f6af8e Storage & Compute release 2024-11-25 2024-11-25 06:05:23 +00:00
Arseny Sher
1388bbae73 Merge pull request #9783 from neondatabase/rc/2024-11-18
Storage & Compute release 2024-11-18
2024-11-18 12:22:58 +03:00
Alexey Kondratov
6dba1a36b8 Merge pull request #9745 from neondatabase/compute-release-2024-11-13
Compute release 2024-11-13

Includes Postgres minor version upgrades and
various other bugfixes and improvements.
2024-11-13 19:11:15 +01:00
Alex Chi Z.
61ff18dbae Merge pull request #9721 from neondatabase/skyzh/locale-changes
cherry-pick Clean up C.UTF-8 locale changes
2024-11-11 14:29:57 -05:00
Tristan Partin
96d66a201d Clean up C.UTF-8 locale changes
Removes some unnecessary initdb arguments, and fixes Neon for MacOS
since it doesn't seem to ship a C.UTF-8 locale.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-11-11 14:10:30 -05:00
Alex Chi Z.
b24850bdb5 Merge pull request #9710 from neondatabase/rc/2024-11-11
Storage & Compute release 2024-11-11
2024-11-11 11:05:41 -05:00
Alex Chi Z.
04f91eea45 fix(pageserver): increase frozen layer warning threshold; ignore in tests (#9705)
Perf benchmarks produce a lot of layers.

## Summary of changes

Bumping the threshold and ignore the warning.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-11-11 09:15:15 -05:00
Arpad Müller
8e4161eb94 Merge pull request #9617 from neondatabase/rc/2024-11-04
Storage & Compute release 2024-11-04
2024-11-04 17:50:29 +01:00
Anastasia Lubennikova
e369c58a3c Merge pull request #9577 from neondatabase/compute-hotfix-2024-10-30
Compute hotfix release 2024-10-30
2024-10-30 12:25:46 +00:00
Alexey Kondratov
237d6ffc02 chore(compute): Bump pg_mooncake to the latest version
The topmost commit in the `neon` branch at the time of writing this
https://github.com/Mooncake-Labs/pg_mooncake/commits/neon/
568b5a82b5
2024-10-29 23:12:30 +01:00
Anastasia Lubennikova
93f7f1d10f Merge pull request #9573 from neondatabase/releases/2024-10-29-compute-only-2
Compute release 2024-10-29
2024-10-29 18:53:03 +00:00
Yuchen Liang
cf8646da19 Merge pull request #9528 from neondatabase/rc/2024-10-25
Storage & Compute release 2024-10-25
2024-10-25 16:49:34 -04:00
Yuchen Liang
46e9a472d7 Merge branch 'release' into rc/2024-10-25 2024-10-25 16:41:06 -04:00
Alexey Kondratov
c4e5693145 Merge pull request #9476 from neondatabase/tristan957/auth
Compute release 2024-10-22
2024-10-22 12:07:19 +02:00
David Gomes
2b3cc87a2a chore(compute): bumps pg_session_jwt to latest version (#9474) 2024-10-21 18:17:38 -06:00
Alexey Kondratov
fe1b181fb1 Merge pull request #9459 from neondatabase/compute-rc-2024-10-20
Compute release 2024-10-20
2024-10-20 16:12:37 +02:00
Anastasia Lubennikova
7f080da9d8 Merge pull request #9451 from neondatabase/releases/2024-10-17-compute-kq-only
Releases/2024 10 17 compute kq only
2024-10-18 16:19:33 +01:00
Vlad Lazar
ec94acdf03 Merge pull request #9372 from neondatabase/rc/2024-10-14
Storage & Compute release 2024-10-14
2024-10-14 14:25:09 +01:00
Arseny Sher
2613769ca7 Merge pull request #9291 from neondatabase/rc/2024-10-07
Storage & Compute release 2024-10-07
2024-10-07 18:20:22 +03:00
Anastasia Lubennikova
a33e1d12fb Merge pull request #9249 from neondatabase/releases/2024-10-02-compute-only
Compute release 2024-10-02 (2)
2024-10-03 10:15:52 +01:00
Anastasia Lubennikova
5cabf32dae Merge pull request #9228 from neondatabase/releases/2024-10-01-compute-only
Compute release 2024-10-02
2024-10-01 21:36:14 +01:00
John Spray
d3490dbfea Merge pull request #9196 from neondatabase/rc/2024-09-30
Storage & Compute release 2024-09-30
2024-09-30 10:04:42 +01:00
Anastasia Lubennikova
2b9fb47e64 Merge pull request #9151 from neondatabase/releases/2024-09-25-compute-only-2
Compute release 2024-09-25
2024-09-25 23:37:55 +01:00
Alexander Bayandin
7474790c80 CI(promote-images): fix prod ECR auth (#9131)
## Problem
Login to prod ECR doesn't work anymore:
```
Retrieving registries data through *** SDK...
*** ECR detected with eu-central-1 region
Error: The security token included in the request is invalid.
```
Ref
https://github.com/neondatabase/neon/actions/runs/11015238522/job/30592994281

Tested
https://github.com/neondatabase/neon/actions/runs/11017690614/job/30596213259#step:5:18
(on https://github.com/neondatabase/neon/commit/aae6182ff)

## Summary of changes
- Fix login to prod ECR by using `aws-actions/configure-aws-credentials`
2024-09-24 18:34:56 +02:00
Arpad Müller
db1e3ff9f4 Merge pull request #9095 from neondatabase/rc/2024-09-23
Storage & Compute release 2024-09-23
2024-09-24 15:51:27 +02:00
Christian Schwarz
ec0550e8ce Merge pull request #9085 from neondatabase/releases/2024-09-20-hotfix
storage hotfix release 2024-09-20

This storage hotfix release adds valuable metrics to pageserver.

We will only deploy this hotfix manually to a dedicated pageserver that is currently empty.

Context https://neondb.slack.com/archives/C07MU9ES6NP/p1726827244185729

Created using

```
git switch -c releases/2024-09-20-hotfix
git reset --hard origin/release
git merge ec5dce04eb
```
2024-09-20 21:09:43 +02:00
Christian Schwarz
126cbd2e8b Merge commit 'ec5dce04ebfa51b727dfc9bc04ebb1e68aef6434' into releases/2024-09-20-hotfix 2024-09-20 18:51:08 +00:00
Joonas Koivunen
6ceaca96e5 Merge pull request #9005 from neondatabase/rc/2024-09-16
Storage & Compute release 2024-09-16
2024-09-16 15:35:22 +03:00
Christian Schwarz
2f0b3e7ae2 Merge pull request #8959 from neondatabase/rc/2024-09-07
Storage release 2024-09-07
2024-09-07 15:09:13 +02:00
Alex Chi Z.
b5d41eaff4 Merge pull request #8883 from neondatabase/rc/2024-09-02
Storage & Compute release 2024-09-02
2024-09-02 23:15:52 +08:00
Anastasia Lubennikova
aa8c5d1ee9 Merge pull request #8858 from neondatabase/releases/2024-08-28-compute-only
Compute release 2024-08-28
2024-08-28 20:00:51 +01:00
Christian Schwarz
4355dba46c Merge pull request #8827 from neondatabase/rc/2024-08-26
Storage & Compute release 2024-08-26
2024-08-26 12:10:03 +02:00
Arseny Sher
cdd8014692 Merge pull request #8751 from neondatabase/rc/2024-08-19
Storage & Compute release 2024-08-19
2024-08-21 06:34:17 +03:00
Arseny Sher
c9491a5acb Merge pull request #8765 from neondatabase/rc/2024-08-12-fixed
Merge main into release with merge commit.

This is a no-op PR which will incorporate into release branch last commits from main under their original SHA to prevent merge conflicts when doing release.
2024-08-21 06:31:39 +03:00
John Spray
5090281b4a Merge pull request #8688 from neondatabase/rc/2024-08-12
Storage & Compute release 2024-08-12
2024-08-12 13:12:10 +01:00
dependabot[bot]
d69f79c7eb chore(deps): bump aiohttp from 3.9.4 to 3.10.2 (#8684) 2024-08-12 09:17:55 +01:00
Arpad Müller
c7c58eeab8 Also pass HOME env var in access_env_vars (#8685)
Noticed this while debugging a test failure in #8673 which only occurs
with real S3 instead of mock S3: if you authenticate to S3 via
`AWS_PROFILE`, then it requires the `HOME` env var to be set so that it
can read inside the `~/.aws` directory.

The scrubber abstraction `StorageScrubber::scrubber_cli` in
`neon_fixtures.py` would otherwise not work. My earlier PR #6556 has
done similar things for the `neon_local` wrapper.

You can try:

```
aws sso login --profile dev
export ENABLE_REAL_S3_REMOTE_STORAGE=y REMOTE_STORAGE_S3_BUCKET=neon-github-ci-tests REMOTE_STORAGE_S3_REGION=eu-central-1 AWS_PROFILE=dev
RUST_BACKTRACE=1 BUILD_TYPE=debug DEFAULT_PG_VERSION=16 ./scripts/pytest -vv --tb=short -k test_scrubber_tenant_snapshot
```

before and after this patch: this patch fixes it.
2024-08-12 09:17:55 +01:00
John Spray
66f86f184b Update docs/SUMMARY.md (#8665)
## Problem

This page had many dead links, and was confusing for folks looking for
documentation about our product.

Closes: https://github.com/neondatabase/neon/issues/8535

## Summary of changes

- Add a link to the product docs up top
- Remove dead/placeholder links
2024-08-12 09:17:55 +01:00
Alexander Bayandin
642aa1e160 Dockerfiles: remove cachepot (#8666)
## Problem
We install and try to use `cachepot`. But it is not configured correctly
and doesn't work (after https://github.com/neondatabase/neon/pull/2290)

## Summary of changes
- Remove `cachepot`
2024-08-12 09:17:55 +01:00
Vlad Lazar
494023f5df storcon: skip draining shard if it's secondary is lagging too much (#8644)
## Problem
Migrations of tenant shards with cold secondaries are holding up drains
in during production deployments.

## Summary of changes
If a secondary locations is lagging by more than 256MiB (configurable,
but that's the default), then skip cutting it over to the secondary as part of the node drain.
2024-08-12 09:17:55 +01:00
John Spray
e9a378d1aa pageserver: don't treat NotInitialized::Stopped as unexpected (#8675)
## Problem

This type of error can happen during shutdown & was triggering a circuit
breaker alert.

## Summary of changes

- Map NotIntialized::Stopped to CompactionError::ShuttingDown, so that
we may handle it cleanly
2024-08-12 09:17:55 +01:00
Alexander Bayandin
cbba8e3390 CI(pin-build-tools-image): fix permissions for Azure login (#8671)
## Problem

Azure login fails in `pin-build-tools-image` workflow because the job
doesn't have the required permissions.

```
Error: Please make sure to give write permissions to id-token in the workflow.
Error: Login failed with Error: Error message: Unable to get ACTIONS_ID_TOKEN_REQUEST_URL env variable. Double check if the 'auth-type' is correct. Refer to https://github.com/Azure/login#readme for more information.
```

## Summary of changes
- Add `id-token: write` permission to `pin-build-tools-image`
- Add an input to force image tagging
- Unify pushing to Docker Hub with other registries
- Split the job into two to have less if's
2024-08-12 09:17:55 +01:00
Alex Chi Z.
f8c0da43b5 fix(neon): disable create tablespace stmt (#8657)
part of https://github.com/neondatabase/neon/issues/8653

Disable create tablespace stmt. It turns out it requires much less
effort to do the regress test mode flag than patching the test cases,
and given that we might need to support tablespaces in the future, I
decided to add a new flag `regress_test_mode` to change the behavior of
create tablespace.

Tested manually that without setting regress_test_mode, create
tablespace will be rejected.



---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-08-12 09:17:55 +01:00
Conrad Ludgate
9dfed93f70 Revert "proxy: update tokio-postgres to allow arbitrary config params (#8076)" (#8654)
This reverts #8076 - which was already reverted from the release branch
since forever (it would have been a breaking change to release for all
users who currently set TimeZone options). It's causing conflicts now so
we should revert it here as well.
2024-08-12 09:17:55 +01:00
Peter Bendel
a8eebdb072 Run a subset of benchmarking job steps on GitHub action runners in Azure - closer to the system under test (#8651)
## Problem

Latency from one cloud provider to another one is higher than within the
same cloud provider.
Some of our benchmarks are latency sensitive - we run a pgbench or psql
in the github action runner and the system under test is running in Neon
(database project).
For realistic perf tps and latency results we need to compare apples to
apples and run the database client in the same "latency distance" for
all tests.

## Summary of changes

Move job steps that test Neon databases deployed on Azure into Azure
action runners.
- bench strategy variant using azure database
- pgvector strategy variant using azure database
- pgbench-compare strategy variants using azure database

## Test run

https://github.com/neondatabase/neon/actions/runs/10314848502
2024-08-12 09:17:55 +01:00
Alexander Bayandin
af8c865903 Dockerfiles: fix LegacyKeyValueFormat & JSONArgsRecommended (#8664)
## Problem
CI complains in all PRs:
```
"ENV key=value" should be used instead of legacy "ENV key value" format 
```
https://docs.docker.com/reference/build-checks/legacy-key-value-format/

See 
- https://github.com/neondatabase/neon/pull/8644/files ("Unchanged files
with check annotations" section)
- https://github.com/neondatabase/neon/actions/runs/10304090562?pr=8644
("Annotations" section)


## Summary of changes
- Use `ENV key=value` instead of `ENV key value` in all Dockerfiles
2024-08-12 09:17:55 +01:00
Alexander Bayandin
c725a3e4b1 CI(build-tools): update Rust, Python, Mold (#8667)
## Problem
- Rust 1.80.1 has been released:
https://blog.rust-lang.org/2024/08/08/Rust-1.80.1.html
- Python 3.9.19 has been released:
https://www.python.org/downloads/release/python-3919/
- Mold 2.33.0 has been released:
https://github.com/rui314/mold/releases/tag/v2.33.0
- Unpinned `cargo-deny` in `build-tools` got updated to the latest
version and doesn't work anymore with the current config file

## Summary of changes
- Bump Rust to 1.80.1
- Bump Python to 3.9.19
- Bump Mold to 2.33.0 
- Pin `cargo-deny`, `cargo-hack`, `cargo-hakari`, `cargo-nextest`,
`rustfilt` versions
- Update `deny.toml` to the latest format, see
https://github.com/EmbarkStudios/cargo-deny/pull/611
2024-08-12 09:17:55 +01:00
John Spray
857ad70b71 tests: don't require kafka client for regular tests (#8662)
## Problem

We're adding more third party dependencies to support more diverse +
realistic test cases in `test_runner/logical_repl`. I ❤️ these
tests, they are a good thing.

The slight glitch is that python packaging is hard, and some third party
python packages have issues. For example the current kafka dependency
doesn't work on latest python. We can mitigate that by only importing
these more specialized dependencies in the tests that use them.

## Summary of changes

- Move the `kafka` import into a test body, so that folks running the
regular `test_runner/regress` tests don't have to have a working kafka
client package.
2024-08-12 09:17:55 +01:00
John Spray
56077caaf9 pageserver: remove paranoia double-calculation of retain_lsns (#8617)
## Problem

This code was to mitigate risk in
https://github.com/neondatabase/neon/pull/8427

As expected, we did not hit this code path - the new continuous updates
of gc_info are working fine, we can remove this code now.

## Summary of changes

- Remove block that double-checks retain_lsns
2024-08-12 09:17:55 +01:00
Joonas Koivunen
552832b819 fix: stop leaking BackgroundPurges (#8650)
avoid "leaking" the completions of BackgroundPurges by:

1. switching it to TaskTracker for provided close+wait
2. stop using tokio::fs::remove_dir_all which will consume two units of
memory instead of one blocking task

Additionally, use more graceful shutdown in tests which do actually some
background cleanup.
2024-08-12 09:17:55 +01:00
Joonas Koivunen
48ae1214c5 fix(test): do not fail test for filesystem race (#8643)
evidence:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8632/10287641784/index.html#suites/0e58fb04d9998963e98e45fe1880af7d/c7a46335515142b/
2024-08-12 09:17:55 +01:00
Konstantin Knizhnik
2a210d4c58 Use sycnhronous commit for logical replicaiton worker (#8645)
## Problem

See
https://neondb.slack.com/archives/C03QLRH7PPD/p1723038557449239?thread_ts=1722868375.476789&cid=C03QLRH7PPD


Logical replication subscription by default use `synchronous_commit=off`
which cause problems with safekeeper

## Summary of changes

Set `synchronous_commit=on` for logical replication subscription in
test_subscriber_restart.py

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-08-12 09:17:54 +01:00
John Spray
acaacd4680 pageserver: make bench_ingest build (but panic) on macOS (#8641)
## Problem

Some developers build on MacOS, which doesn't have  io_uring.

## Summary of changes

- Add `io_engine_for_bench`, which on linux will give io_uring or panic
if it's unavailable, and on MacOS will always panic.

We do not want to run such benchmarks with StdFs: the results aren't
interesting, and will actively waste the time of any developers who
start investigating performance before they realize they're using a
known-slow I/O backend.

Why not just conditionally compile this benchmark on linux only? Because
even on linux, I still want it to refuse to run if it can't get
io_uring.
2024-08-12 09:17:54 +01:00
Yuchen Liang
77bb6c4cc4 feat(pageserver): add direct io pageserver config (#8622)
Part of #8130, [RFC: Direct IO For Pageserver](https://github.com/neondatabase/neon/blob/problame/direct-io-rfc/docs/rfcs/034-direct-io-for-pageserver.md)

## Description

Add pageserver config for evaluating/enabling direct I/O. 

- Disabled: current default, uses buffered io as is.
- Evaluate: still uses buffered io, but could do alignment checking and
perf simulation (pad latency by direct io RW to a fake file).
- Enabled: uses direct io, behavior on alignment error is configurable.


Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-08-12 09:17:54 +01:00
Cihan Demirci
e082226a32 cicd: push build-tools image to ACR as well (#8638)
https://github.com/neondatabase/cloud/issues/15899
2024-08-12 09:17:54 +01:00
Joonas Koivunen
40e3c913bb refactor(timeline_detach_ancestor): replace ordered reparented with a hashset (#8629)
Earlier I was thinking we'd need a (ancestor_lsn, timeline_id) ordered
list of reparented. Turns out we did not need it at all. Replace it with
an unordered hashset. Additionally refactor the reparented direct
children query out, it will later be used from more places.

Split off from #8430.

Cc: #6994
2024-08-12 09:17:54 +01:00
Alex Chi Z.
658d763915 fix(pageserver): dump the key when it's invalid (#8633)
We see an assertion error in staging. Dump the key to guess where it was
from, and then we can fix it.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-12 09:17:54 +01:00
Joonas Koivunen
c0776b8724 fix: EphemeralFiles can outlive their Timeline via enum LayerManager (#8229)
Ephemeral files cleanup on drop but did not delay shutdown, leading to
problems with restarting the tenant. The solution is as proposed:
- make ephemeral files carry the gate guard to delay `Timeline::gate`
closing
- flush in-memory layers and strong references to those on
`Timeline::shutdown`

The above are realized by making LayerManager an `enum` with `Open` and
`Closed` variants, and fail requests to modify `LayerMap`.

Additionally:

- fix too eager anyhow conversions in compaction
- unify how we freeze layers and handle errors
- optimize likely_resident_layers to read LayerFileManager hashmap
values instead of bouncing through LayerMap

Fixes: #7830
2024-08-12 09:17:54 +01:00
Conrad Ludgate
1f73dfb842 proxy: random changes (#8602)
## Problem

1. Hard to correlate startup parameters with the endpoint that provided
them.
2. Some configurations are not needed in the `ProxyConfig` struct.

## Summary of changes

Because of some borrow checker fun, I needed to switch to an
interior-mutability implementation of our `RequestMonitoring` context
system. Using https://docs.rs/try-lock/latest/try_lock/ as a cheap lock
for such a use-case (needed to be thread safe).

Removed the lock of each startup message, instead just logging only the
startup params in a successful handshake.

Also removed from values from `ProxyConfig` and kept as arguments.
(needed for local-proxy config)
2024-08-12 09:17:54 +01:00
Arpad Müller
38f184bc91 Add missing colon to ArchivalConfigRequest specification (#8627)
Add a missing colon to the API specification of `ArchivalConfigRequest`.
The `state` field is required. Pointed out by Gleb.
2024-08-12 09:17:54 +01:00
Arpad Müller
c75e6fbc46 Lower level for timeline cancellations during gc (#8626)
Timeline cancellation running in parallel with gc yields error log lines
like:

```
Gc failed 1 times, retrying in 2s: TimelineCancelled
```

They are completely harmless though and normal to occur. Therefore, only
print those messages at an info level. Still print them at all so that
we know what is going on if we focus on a single timeline.
2024-08-12 09:17:54 +01:00
Arpad Müller
9a3bc5556a storage broker: only print one line for version and build tag in init (#8624)
This makes it more consistent with pageserver and safekeeper. Also, it
is easier to collect the two values into one data point.
2024-08-12 09:17:54 +01:00
Yuchen Liang
22790fc907 scrubber: clean up scan_metadata before prod (#8565)
Part of #8128.

## Problem
Currently, scrubber `scan_metadata` command will return with an error
code if the metadata on remote storage is corrupted with fatal errors.
To safely deploy this command in a cronjob, we want to differentiate
between failures while running scrubber command and the erroneous
metadata. At the same time, we also want our regression tests to catch
corrupted metadata using the scrubber command.

## Summary of changes

- Return with error code only when the scrubber command fails
- Uses explicit checks on errors and warnings to determine metadata
health in regression tests.

**Resolve conflict with `tenant-snapshot` command (after shard split):**
[`test_scrubber_tenant_snapshot`](https://github.com/neondatabase/neon/blob/yuchen/scrubber-scan-cleanup-before-prod/test_runner/regress/test_storage_scrubber.py#L23)
failed before applying 422a8443dd
- When taking a snapshot, the old `index_part.json` in the unsharded
tenant directory is not kept.
- The current `list_timeline_blobs` implementation consider no
`index_part.json` as a parse error.
- During the scan, we are only analyzing shards with highest shard
count, so we will not get a parse error. but we do need to add the
layers to tenant object listing, otherwise we will get index is
referencing a layer that is not in remote storage error.
- **Action:** Add s3_layers from `list_timeline_blobs` regardless of
parsing error

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-08-12 09:17:54 +01:00
John Spray
ba4e5b51a0 pageserver: add bench_ingest (#7409)
## Problem

We lack a rust bench for the inmemory layer and delta layer write paths:
it is useful to benchmark these components independent of postgres & WAL
decoding.

Related: https://github.com/neondatabase/neon/issues/8452

## Summary of changes

- Refactor DeltaLayerWriter to avoid carrying a Timeline, so that it can
be cleanly tested + benched without a Tenant/Timeline test harness. It
only needed the Timeline for building `Layer`, so this can be done in a
separate step.
- Add `bench_ingest`, which exercises a variety of workload "shapes"
(big values, small values, sequential keys, random keys)
- Include a small uncontroversial optimization: in `freeze`, only
exhaustively walk values to assert ordering relative to end_lsn in debug
mode.

These benches are limited by drive performance on a lot of machines, but
still useful as a local tool for iterating on CPU/memory improvements
around this code path.

Anecdotal measurements on Hetzner AX102 (Ryzen 7950xd):

```

ingest-small-values/ingest 128MB/100b seq
                        time:   [1.1160 s 1.1230 s 1.1289 s]
                        thrpt:  [113.38 MiB/s 113.98 MiB/s 114.70 MiB/s]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild
Benchmarking ingest-small-values/ingest 128MB/100b rand: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 10.0s. You may wish to increase target time to 18.9s.
ingest-small-values/ingest 128MB/100b rand
                        time:   [1.9001 s 1.9056 s 1.9110 s]
                        thrpt:  [66.982 MiB/s 67.171 MiB/s 67.365 MiB/s]
Benchmarking ingest-small-values/ingest 128MB/100b rand-1024keys: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 10.0s. You may wish to increase target time to 11.0s.
ingest-small-values/ingest 128MB/100b rand-1024keys
                        time:   [1.0715 s 1.0828 s 1.0937 s]
                        thrpt:  [117.04 MiB/s 118.21 MiB/s 119.46 MiB/s]
ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [425.49 ms 429.07 ms 432.04 ms]
                        thrpt:  [296.27 MiB/s 298.32 MiB/s 300.83 MiB/s]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild

ingest-big-values/ingest 128MB/8k seq
                        time:   [373.03 ms 375.84 ms 379.17 ms]
                        thrpt:  [337.58 MiB/s 340.57 MiB/s 343.13 MiB/s]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
ingest-big-values/ingest 128MB/8k seq, no delta
                        time:   [81.534 ms 82.811 ms 83.364 ms]
                        thrpt:  [1.4994 GiB/s 1.5095 GiB/s 1.5331 GiB/s]
Found 1 outliers among 10 measurements (10.00%)


```
2024-08-12 09:17:54 +01:00
John Spray
6519f875b9 pageserver: use layer visibility when composing heatmap (#8616)
## Problem

Sometimes, a layer is Covered by hasn't yet been evicted from local disk
(e.g. shortly after image layer generation). It is not good use of
resources to download these to a secondary location, as there's a good
chance they will never be read.

This follows the previous change that added layer visibility:
- #8511 

Part of epic:
- https://github.com/neondatabase/neon/issues/8398

## Summary of changes

- When generating heatmaps, only include Visible layers
- Update test_secondary_downloads to filter to visible layers when
listing layers from an attached location
2024-08-12 09:17:54 +01:00
John Spray
ea7be4152a pageserver: fixes for layer visibility metric (#8603)
## Problem

In staging, we could see that occasionally tenants were wrapping their
pageserver_visible_physical_size metric past zero to 2^64.

This is harmless right now, but will matter more later when we start
using visible size in things like the /utilization endpoint.

## Summary of changes

- Add debug asserts that detect this case. `test_gc_of_remote_layers`
works as a reproducer for this issue once the asserts are added.
- Tighten up the interface around access_stats so that only Layer can
mutate it.
- In Layer, wrap calls to `record_access` in code that will update the
visible size statistic if the access implicitly marks the layer visible
(this was what caused the bug)
- In LayerManager::rewrite_layers, use the proper set_visibility layer
function instead of directly using access_stats (this is an additional
path where metrics could go bad.)
- Removed unused instances of LayerAccessStats in DeltaLayer and
ImageLayer which I noticed while reviewing the code paths that call
record_access.
2024-08-12 09:17:54 +01:00
John Spray
8d8e428d4c tests: improve stability of test_storage_controller_many_tenants (#8607)
## Problem

The controller scale test does random migrations. These mutate secondary
locations, and therefore can cause secondary optimizations to happen in
the background, violating the test's expectation that consistency_check
will work as there are no reconciliations running.

Example:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/10247161379/index.html#suites/07874de07c4a1c9effe0d92da7755ebf/6316beacd3fb3060/

## Summary of changes

- Only migrate to existing secondary locations, not randomly picked
nodes, so that we can do a fast reconcile_until_idle (otherwise
reconcile_until_idle is takes a long time to create new secondary
locations).
- Do a reconcile_until_idle before consistency_check.
2024-08-12 09:17:54 +01:00
a-masterov
0be952fb89 enable rum test (#8380)
## Problem
We need to test the rum extension automatically as a path of the GitHub
workflow

## Summary of changes

rum test is enabled
2024-08-12 09:17:54 +01:00
a-masterov
13e794a35c Add a test using Debezium as a client for the logical replication (#8568)
## Problem
We need to test the logical replication with some external consumers.
## Summary of changes
A test of the logical replication with Debezium as a consumer was added.
---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-08-12 09:17:54 +01:00
Arseny Sher
bd276839ad Add package-mode=false to poetry.
We don't use it for packaging, and 'poetry install' will soon error
otherwise. Also remove name and version fields as these are not required for
non-packaging mode.
2024-08-12 09:17:54 +01:00
Arpad Müller
44d9975799 storage_scrubber: migrate scan_safekeeper_metadata to remote_storage (#8595)
Migrates the safekeeper-specific parts of `ScanMetadata` to
GenericRemoteStorage, making it Azure-ready.
 
Part of https://github.com/neondatabase/neon/issues/7547
2024-08-12 09:17:54 +01:00
Joonas Koivunen
814b090250 chore: bump index part version (#8611)
#8600 missed the hunk changing index_part.json informative version.
Include it in this PR, in addition add more non-warning index_part.json
versions to scrubber.
2024-08-12 09:17:54 +01:00
Vlad Lazar
608c3cedbf pageserver: remove legacy read path (#8601)
## Problem

We have been maintaining two read paths (legacy and vectored) for a
while now. The legacy read-path was only used for cross validation in some tests.

## Summary of changes
* Tweak all tests that were using the legacy read path to use the
vectored read path instead
* Remove the read path dispatching based on the pageserver configs
* Remove the legacy read path code

We will be able to remove the single blob io code in
`pageserver/src/tenant/blob_io.rs` when https://github.com/neondatabase/neon/issues/7386 is complete.

Closes https://github.com/neondatabase/neon/issues/8005
2024-08-12 09:17:54 +01:00
Joonas Koivunen
b2bc5795be feat: persistent gc blocking (#8600)
Currently, we do not have facilities to persistently block GC on a
tenant for whatever reason. We could do a tenant configuration update,
but that is risky for generation numbers and would also be transient.
Introduce a `gc_block` facility in the tenant, which manages per
timeline blocking reasons.

Additionally, add HTTP endpoints for enabling/disabling manual gc
blocking for a specific timeline. For debugging, individual tenant
status now includes a similar string representation logged when GC is
skipped.

Cc: #6994
2024-08-12 09:17:54 +01:00
Joonas Koivunen
c89ee814e1 fix: make Timeline::set_disk_consistent_lsn use fetch_max (#8311)
now it is safe to use from multiple callers, as we have two callers.
2024-08-12 09:17:54 +01:00
Alex Chi Z.
83afea3edb feat(pageserver): support dry-run for gc-compaction, add statistics (#8557)
Add dry-run mode that does not produce any image layer + delta layer. I
will use this code to do some experiments and see how much space we can
reclaim for tenants on staging. Part of
https://github.com/neondatabase/neon/issues/8002

* Add dry-run mode that runs the full compaction process without
updating the layer map. (We never call finish on the writers and the
files will be removed before exiting the function).
* Add compaction statistics and print them at the end of compaction.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-12 09:17:54 +01:00
Alexander Bayandin
3b4b9c1d0b CI(benchmarking): set pub/sub projects for LR tests (#8483)
## Problem

> Currently, long-running LR tests recreate endpoints every night. We'd
like to have along-running buildup of history to exercise the pageserver
in this case (instead of "unit-testing" the same behavior everynight).

Closes #8317

## Summary of changes
- Update Postgres version for replication tests
- Set `BENCHMARK_PROJECT_ID_PUB`/`BENCHMARK_PROJECT_ID_SUB` env vars to
projects that were created for this purpose

---------

Co-authored-by: Sasha Krassovsky <krassovskysasha@gmail.com>
2024-08-12 09:17:54 +01:00
Joonas Koivunen
e1339ac915 fix: allow awaiting logical size for root timelines (#8604)
Currently if `GET
/v1/tenant/x/timeline/y?force-await-initial-logical-size=true` is
requested for a root timeline created within the current pageserver
session, the request handler panics hitting the debug assertion. These
timelines will always have an accurate (at initdb import) calculated
logical size. Fix is to never attempt prioritizing timeline size
calculation if we already have an exact value.

Split off from #8528.
2024-08-12 09:17:54 +01:00
Alexander Bayandin
6564afb822 CI(trigger-e2e-tests): fix deadlock with Build and Test workflow (#8606)
## Problem

In some cases, a deadlock between `build-and-test` and
`trigger-e2e-tests` workflows can happen:

```
Build and Test

Canceling since a deadlock for concurrency group 'Build and Test-8600/merge-anysha' was detected between 'top level workflow' and 'trigger-e2e-tests'
```

I don't understand the reason completely, probably `${{ github.workflow
}}` got evaluated to the same value and somehow caused the issue.
We don't need to limit concurrency for `trigger-e2e-tests`
workflow.

See
https://neondb.slack.com/archives/C059ZC138NR/p1722869486708179?thread_ts=1722869027.960029&cid=C059ZC138NR
2024-08-12 09:17:54 +01:00
Alexander Bayandin
274c2c40b9 CI(trigger-e2e-tests): wait for promote-images job from the last commit (#8592)
## Problem

We don't trigger e2e tests for draft PRs, but we do trigger them once a
PR is in the "Ready for review" state.
Sometimes, a PR can be marked as "Ready for review" before we finish
image building. In such cases, triggering e2e tests fails.

## Summary of changes
- Make `trigger-e2e-tests` job poll status of `promote-images` job from
the build-and-test workflow for the last commit. And trigger only if the
status is `success`
- Remove explicit image checking from the workflow
- Add `concurrency` for `triggere-e2e-tests` workflow to make it
possible to cancel jobs in progress (if PR moves from "Draft" to "Ready
for review" several times in a row)
2024-08-12 09:17:54 +01:00
Konstantin Knizhnik
afdbe0a7d0 Update Postgres versions to use smgrexists() instead of access() to check if Oid is used (#8597)
## Problem

PR #7992 was merged without correspondent changes in Postgres submodules
and this is why test_oid_overflow.py is failed now.

## Summary of changes

Bump Postgres versions

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-08-12 09:17:54 +01:00
Alex Chi Z.
5945eadd42 feat(pageserver): support split delta layers (#8599)
part of https://github.com/neondatabase/neon/issues/8002

Similar to https://github.com/neondatabase/neon/pull/8574, we add
auto-split support for delta layers. Tests are reused from image layer
split writers.


---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-12 09:17:54 +01:00
dotdister
b76ab45cbe safekeeper: remove unused partial_backup_enabled option (#8547)
## Problem
There is an unused safekeeper option `partial_backup_enabled`.

`partial_backup_enabled` was implemented in #6530, but this option was
always turned into enabled in #8022.

If you intended to keep this option for a specific reason, I will close
this PR.

## Summary of changes
I removed an unused safekeeper option `partial_backup_enabled`.
2024-08-12 09:17:54 +01:00
Arpad Müller
7b7d77c817 Merge pull request #8642 from neondatabase/arpad/release-ram-hot-fix
Storage release 2024-08-07
2024-08-07 20:00:43 +02:00
Joonas Koivunen
7ec831c956 fix: drain completed page_service connections (#8632)
We've noticed increased memory usage with the latest release. Drain the
joinset of `page_service` connection handlers to avoid leaking them
until shutdown. An alternative would be to use a TaskTracker.
TaskTracker was not discussed in original PR #8339 review, so not hot
fixing it in here either.
2024-08-07 19:17:40 +02:00
Arpad Müller
1a36516d75 Merge pull request #8598 from neondatabase/rc/2024-08-05
Storage & Compute release 2024-08-05
2024-08-05 14:21:20 +02:00
Alex Chi Z.
fde8aa103e feat(pageserver): support auto split layers based on size (#8574)
part of https://github.com/neondatabase/neon/issues/8002

## Summary of changes

Add a `SplitImageWriter` that automatically splits image layer based on
estimated target image layer size. This does not consider compression
and we might need a better metrics.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-08-05 08:56:00 +02:00
Alex Chi Z.
8624aabc98 fix(pageserver): deadlock in gc-compaction (#8590)
We need both compaction and gc lock for gc-compaction. The lock order
should be the same everywhere, otherwise there could be a deadlock where
A waits for B and B waits for A.

We also had a double-lock issue. The compaction lock gets acquired in
the outer `compact` function. Note that the unit tests directly call
`compact_with_gc`, and therefore not triggering the issue.

## Summary of changes

Ensure all places acquire compact lock and then gc lock. Remove an extra
compact lock acqusition.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-05 08:55:59 +02:00
John Spray
3a10bf8c82 tests: add test_historic_storage_formats (#8423)
## Problem

Currently, our backward compatibility tests only look one release back.
That means, for example, that when we switch on image layer compression
by default, we'll test reading of uncompressed layers for one release,
and then stop doing it. When we make an index_part.json format change,
we'll test against the old format for a week, then stop (unless we write
separate unit tests for each old format).

The reality in the field is that data in old formats will continue to
exist for weeks/months/years. When we make major format changes, we
should retain examples of the old format data, and continuously verify
that the latest code can still read them.

This test uses contents from a new path in the public S3 bucket,
`compatibility-data-snapshots/`. It is populated by hand. The first
important artifact is one from before we switch on compression, so that
we will keep testing reads of uncompressed data. We will generate more
artifacts ahead of other key changes, like when we update remote storage
format for archival timelines.

Closes: https://github.com/neondatabase/cloud/issues/15576
2024-08-05 08:55:59 +02:00
Arthur Petukhovsky
1758c10dec Improve safekeepers eviction rate limiting (#8456)
This commit tries to fix regular load spikes on staging, caused by too
many eviction and partial upload operations running at the same time.
Usually it was hapenning after restart, for partial backup the load was
delayed.
- Add a semaphore for evictions (2 permits by default)
- Rename `resident_since` to `evict_not_before` and smooth out the curve
by using random duration
- Use random duration in partial uploads as well

related to https://github.com/neondatabase/neon/issues/6338
some discussion in
https://neondb.slack.com/archives/C033RQ5SPDH/p1720601531744029
2024-08-05 08:55:59 +02:00
Arpad Müller
7eb3d6bb2d Wait for completion of the upload queue in flush_frozen_layer (#8550)
Makes `flush_frozen_layer` add a barrier to the upload queue and makes
it wait for that barrier to be reached until it lets the flushing be
completed.

This gives us backpressure and ensures that writes can't build up in an
unbounded fashion.

Fixes #7317
2024-08-05 08:55:59 +02:00
John Spray
3833e30d44 storage_controller: start adding chaos hooks (#7946)
Chaos injection bridges the gap between automated testing (where we do
lots of different things with small, short-lived tenants), and staging
(where we do many fewer things, but with larger, long-lived tenants).

This PR adds a first type of chaos which isn't really very chaotic: it's
live migration of tenants between healthy pageservers. This nevertheless
provides continuous checks that things like clean, prompt shutdown of
tenants works for realistically deployed pageservers with realistically
large tenants.
2024-08-05 08:55:59 +02:00
John Spray
4631179320 pageserver: refine how we delete timelines after shard split (#8436)
## Problem

Previously, when we do a timeline deletion, shards will delete layers
that belong to an ancestor. That is not a correctness issue, because
when we delete a timeline, we're always deleting it from all shards, and
destroying data for that timeline is clearly fine.

However, there exists a race where one shard might start doing this
deletion while another shard has not yet received the deletion request,
and might try to access an ancestral layer. This creates ambiguity over
the "all layers referenced by my index should always exist" invariant,
which is important to detecting and reporting corruption.

Now that we have a GC mode for clearing up ancestral layers, we can rely
on that to clean up such layers, and avoid deleting them right away.
This makes things easier to reason about: there are now no cases where a
shard will delete a layer that belongs to a ShardIndex other than
itself.

## Summary of changes

- Modify behavior of RemoteTimelineClient::delete_all
- Add `test_scrubber_physical_gc_timeline_deletion` to exercise this
case
- Tweak AWS SDK config in the scrubber to enable retries. Motivated by
seeing the test for this feature encounter some transient "service
error" S3 errors (which are probably nothing to do with the changes in
this PR)
2024-08-05 08:55:59 +02:00
Alexander Bayandin
4eea3ce705 test_runner: don't create artifacts if Allure is not enabled (#8580)
## Problem

`allure_attach_from_dir` method might create `tar.zst` archives even
if `--alluredir` is not set (i.e. Allure results collection is disabled)

## Summary of changes
- Don't run `allure_attach_from_dir` if `--alluredir`  is not set
2024-08-05 08:55:59 +02:00
Alex Chi Z.
a9bcabe503 fix(pageserver): skip existing layers for btm-gc-compaction (#8498)
part of https://github.com/neondatabase/neon/issues/8002

Due to the limitation of the current layer map implementation, we cannot
directly replace a layer. It's interpreted as an insert and a deletion,
and there will be file exist error when renaming the newly-created layer
to replace the old layer. We work around that by changing the end key of
the image layer. A long-term fix would involve a refactor around the
layer file naming. For delta layers, we simply skip layers with the same
key range produced, though it is possible to add an extra key as an
alternative solution.

* The image layer range for the layers generated from gc-compaction will
be Key::MIN..(Key..MAX-1), to avoid being recognized as an L0 delta
layer.
* Skip existing layers if it turns out that we need to generate a layer
with the same persistent key in the same generation.

Note that it is possible that the newly-generated layer has different
content from the existing layer. For example, when the user drops a
retain_lsn, the compaction could have combined or dropped some records,
therefore creating a smaller layer than the existing one. We discard the
"optimized" layer for now because we cannot deal with such rewrites
within the same generation.


---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-08-05 08:55:59 +02:00
Alex Chi Z.
7a2625b803 storage-scrubber: log version on start (#8571)
Helps us better identify which version of storage scrubber is running.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-05 08:55:59 +02:00
John Spray
f51dc6a44e pageserver: add layer visibility calculation (#8511)
## Problem

We recently added a "visibility" state to layers, but nothing
initializes it.

Part of:
- #8398 

## Summary of changes

- Add a dependency on `range-set-blaze`, which is used as a fast
incrementally updated alternative to KeySpace. We could also use this to
replace the internals of KeySpaceRandomAccum if we wanted to. Writing a
type that does this kind of "BtreeMap & merge overlapping entries" thing
isn't super complicated, but no reason to write this ourselves when
there's a third party impl available.
- Add a function to layermap to calculate visibilities for each layer
- Add a function to Timeline to call into layermap and then apply these
visibilities to the Layer objects.
- Invoke the calculation during startup, after image layer creations,
and when removing branches. Branch removal and image layer creation are
the two ways that a layer can go from Visible to Covered.
- Add unit test & benchmark for the visibility calculation
- Expose `pageserver_visible_physical_size` metric, which should always
be <= `pageserver_remote_physical_size`.
- This metric will feed into the /v1/utilization endpoint later: the
visible size indicates how much space we would like to use on this
pageserver for this tenant.
- When `pageserver_visible_physical_size` is greater than
`pageserver_resident_physical_size`, this is a sign that the tenant has
long-idle branches, which result in layers that are visible in
principle, but not used in practice.

This does not keep visibility hints up to date in all cases:
particularly, when creating a child timeline, any previously covered
layers will not get marked Visible until they are accessed.

Updates after image layer creation could be implemented as more of a
special case, but this would require more new code: the existing depth
calculation code doesn't maintain+yield the list of deltas that would be
covered by an image layer.

## Performance

This operation is done rarely (at startup and at timeline deletion), so
needs to be efficient but not ultra-fast.

There is a new `visibility` bench that measures runtime for a synthetic
100k layers case (`sequential`) and a real layer map (`real_map`) with
~26k layers.

The benchmark shows runtimes of single digit milliseconds (on a ryzen
7950). This confirms that the runtime shouldn't be a problem at startup
(as we already incur S3-level latencies there), but that it's slow
enough that we definitely shouldn't call it more often than necessary,
and it may be worthwhile to optimize further later (things like: when
removing a branch, only bother scanning layers below the branchpoint)

```
visibility/sequential   time:   [4.5087 ms 4.5894 ms 4.6775 ms]
                        change: [+2.0826% +3.9097% +5.8995%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 24 outliers among 100 measurements (24.00%)
  2 (2.00%) high mild
  22 (22.00%) high severe
min: 0/1696070, max: 93/1C0887F0
visibility/real_map     time:   [7.0796 ms 7.0832 ms 7.0871 ms]
                        change: [+0.3900% +0.4505% +0.5164%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
min: 0/1696070, max: 93/1C0887F0
visibility/real_map_many_branches
                        time:   [4.5285 ms 4.5355 ms 4.5434 ms]
                        change: [-1.0012% -0.8004% -0.5969%] (p = 0.00 < 0.05)
                        Change within noise threshold.
```
2024-08-05 08:55:59 +02:00
Arpad Müller
a22361b57b Reduce linux-raw-sys duplication (#8577)
Before, we had four versions of linux-raw-sys in our dependency graph:

```
  linux-raw-sys@0.1.4
  linux-raw-sys@0.3.8
  linux-raw-sys@0.4.13
  linux-raw-sys@0.6.4
```

now it's only two:

```
  linux-raw-sys@0.4.13
  linux-raw-sys@0.6.4
```

The changes in this PR are minimal. In order to get to its state one
only has to update procfs in Cargo.toml to 0.16 and do `cargo update -p
tempfile -p is-terminal -p prometheus`.
2024-08-05 08:55:59 +02:00
Christian Schwarz
1e6a1ac9fa pageserver: shutdown all walredo managers 8s into shutdown (#8572)
# Motivation

The working theory for hung systemd during PS deploy
(https://github.com/neondatabase/cloud/issues/11387) is that leftover
walredo processes trigger a race condition.

In https://github.com/neondatabase/neon/pull/8150 I arranged that a
clean Tenant shutdown does actually kill its walredo processes.

But many prod machines don't manage to shut down all their tenants until
the 10s systemd timeout hits and, presumably, triggers the race
condition in systemd / the Linux kernel that causes the frozen systemd

# Solution

This PR bolts on a rather ugly mechanism to shut down tenant managers
out of order 8s after we've received the SIGTERM from systemd.

# Changes

- add a global registry of `Weak<WalRedoManager>`
- add a special thread spawned during `shutdown_pageserver` that sleeps
for 8s, then shuts down all redo managers in the registry and prevents
new redo managers from being created
- propagate the new failure mode of tenant spawning throughout the code
base
- make sure shut down tenant manager results in
PageReconstructError::Cancelled so that if Timeline::get calls come in
after the shutdown, they do the right thing
2024-08-05 08:55:59 +02:00
Alex Chi Z.
02e8fd0b52 test(pageserver): add test_gc_feedback_with_snapshots (#8474)
should be working after https://github.com/neondatabase/neon/pull/8328
gets merged. Part of https://github.com/neondatabase/neon/issues/8002

adds a new perf benchmark case that ensures garbages can be collected
with branches

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-05 08:55:59 +02:00
Alexander Bayandin
8adc4031d0 CI(create-test-report): fix missing benchmark results in Allure report (#8540)
## Problem

In https://github.com/neondatabase/neon/pull/8241 I've accidentally
removed `create-test-report` dependency on `benchmarks` job

## Summary of changes
- Run `create-test-report` after `benchmarks` job
2024-08-05 08:55:59 +02:00
Arpad Müller
46379cd3f2 storage_scrubber: migrate FindGarbage to remote_storage (#8548)
Uses the newly added APIs from #8541 named `stream_tenants_generic` and
`stream_objects_with_retries` and extends them with
`list_objects_with_retries_generic` and
`stream_tenant_timelines_generic` to migrate the `find-garbage` command
of the scrubber to `GenericRemoteStorage`.

Part of https://github.com/neondatabase/neon/issues/7547
2024-08-05 08:55:59 +02:00
John Spray
b3a76d9601 controller: simplify reconciler generation increment logic (#8560)
## Problem

This code was confusing, untested and covered:
- an impossible case, where intent state is AttacheStale (we never do
this)
- a rare edge case (going from AttachedMulti to Attached), which we were
not testing, and in any case the pageserver internally does the same
Tenant reset in this transition as it would do if we incremented
generation.

Closes: https://github.com/neondatabase/neon/issues/8367

## Summary of changes

- Simplify the logic to only skip incrementing the generation if the
location already has the expected generation and the exact same mode.
2024-08-05 08:55:59 +02:00
Cihan Demirci
6c1bbe8434 cicd: change Azure storage details [2/2] (#8562)
Change Azure storage configuration to point to updated variables/secrets.

Also update subscription id variable.
2024-08-05 08:55:59 +02:00
Tristan Partin
a006f7656e Fix negative replication delay metric
In some cases, we can get a negative metric for replication_delay_bytes.
My best guess from all the research I've done is that we evaluate
pg_last_wal_receive_lsn() before pg_last_wal_replay_lsn(), and that by
the time everything is said and done, the replay LSN has advanced past
the receive LSN. In this case, our lag can effectively be modeled as
0 due to the speed of the WAL reception and replay.
2024-08-05 08:55:59 +02:00
Christian Schwarz
31122adee3 refactor(page_service): Timeline gate guard holding + cancellation + shutdown (#8339)
Since the introduction of sharding, the protocol handling loop in
`handle_pagerequests` cannot know anymore which concrete
`Tenant`/`Timeline` object any of the incoming `PagestreamFeMessage`
resolves to.
In fact, one message might resolve to one `Tenant`/`Timeline` while
the next one may resolve to another one.

To avoid going to tenant manager, we added the `shard_timelines` which
acted as an ever-growing cache that held timeline gate guards open for
the lifetime of the connection.
The consequence of holding the gate guards open was that we had to be
sensitive to every cached `Timeline::cancel` on each interaction with
the network connection, so that Timeline shutdown would not have to wait
for network connection interaction.

We can do better than that, meaning more efficiency & better
abstraction.
I proposed a sketch for it in

* https://github.com/neondatabase/neon/pull/8286

and this PR implements an evolution of that sketch.

The main idea is is that `mod page_service` shall be solely concerned
with the following:
1. receiving requests by speaking the protocol / pagestream subprotocol
2. dispatching the request to a corresponding method on the correct
shard/`Timeline` object
3. sending response by speaking the protocol / pagestream subprotocol.

The cancellation sensitivity responsibilities are clear cut:
* while in `page_service` code, sensitivity to page_service cancellation
is sufficient
* while in `Timeline` code, sensitivity to `Timeline::cancel` is
sufficient

To enforce these responsibilities, we introduce the notion of a
`timeline::handle::Handle` to a `Timeline` object that is checked out
from a `timeline::handle::Cache` for **each request**.
The `Handle` derefs to `Timeline` and is supposed to be used for a
single async method invocation on `Timeline`.
See the lengthy doc comment in `mod handle` for details of the design.
2024-08-05 08:55:59 +02:00
Alex Chi Z.
311cc71b08 feat(pageserver): support btm-gc-compaction for child branches (#8519)
part of https://github.com/neondatabase/neon/issues/8002

For child branches, we will pull the image of the modified keys from the
parant into the child branch, which creates a full history for
generating key retention. If there are not enough delta keys, the image
won't be wrote eventually, and we will only keep the deltas inside the
child branch. We could avoid the wasteful work to pull the image from
the parent if we can know the number of deltas in advance, in the future
(currently we always pull image for all modified keys in the child
branch)


---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-05 08:55:59 +02:00
Alexander Bayandin
0356fc426b CI(regress-tests): run less regression tests (#8561)
## Problem
We run regression tests on `release` & `debug` builds for each of the
three supported Postgres versions (6 in total).
With upcoming ARM support and Postgres 17, the number of jobs will jump
to 16, which is a lot.

See the internal discussion here:
https://neondb.slack.com/archives/C033A2WE6BZ/p1722365908404329

## Summary of changes
- Run `regress-tests` job in debug builds only with the latest Postgres
version
- Do not do `debug` builds on release branches
2024-08-05 08:55:59 +02:00
Christian Schwarz
35738ca37f compaction_level0_phase1: bypass PS PageCache for data blocks (#8543)
part of https://github.com/neondatabase/neon/issues/8184

# Problem

We want to bypass PS PageCache for all data block reads, but
`compact_level0_phase1` currently uses `ValueRef::load` to load the WAL
records from delta layers.
Internally, that maps to `FileBlockReader:read_blk` which hits the
PageCache
[here](e78341e1c2/pageserver/src/tenant/block_io.rs (L229-L236)).

# Solution

This PR adds a mode for `compact_level0_phase1` that uses the
`MergeIterator` for reading the `Value`s from the delta layer files.

`MergeIterator` is a streaming k-merge that uses vectored blob_io under
the hood, which bypasses the PS PageCache for data blocks.

Other notable changes:
* change the `DiskBtreeReader::into_stream` to buffer the node, instead
of holding a `PageCache` `PageReadGuard`.
* Without this, we run out of page cache slots in
`test_pageserver_compaction_smoke`.
* Generally, `PageReadGuard`s aren't supposed to be held across await
points, so, this is a general bugfix.

# Testing / Validation / Performance

`MergeIterator` has not yet been used in production; it's being
developed as part of
* https://github.com/neondatabase/neon/issues/8002

Therefore, this PR adds a validation mode that compares the existing
approach's value iterator with the new approach's stream output, item by
item.
If they're not identical, we log a warning / fail the unit/regression
test.
To avoid flooding the logs, we apply a global rate limit of once per 10
seconds.
In any case, we use the existing approach's value.

Expected performance impact that will be monitored in staging / nightly
benchmarks / eventually pre-prod:
* with validation:
  * increased CPU usage
  * ~doubled VirtualFile read bytes/second metric
* no change in disk IO usage because the kernel page cache will likely
have the pages buffered on the second read
* without validation:
* slightly higher DRAM usage because each iterator participating in the
k-merge has a dedicated buffer (as opposed to before, where compactions
would rely on the PS PageCaceh as a shared evicting buffer)
* less disk IO if previously there were repeat PageCache misses (likely
case on a busy production Pageserver)
* lower CPU usage: PageCache out of the picture, fewer syscalls are made
(vectored blob io batches reads)

# Rollout

The new code is used with validation mode enabled-by-default.
This gets us validation everywhere by default, specifically in
- Rust unit tests
- Python tests
- Nightly pagebench (shouldn't really matter)
- Staging

Before the next release, I'll merge the following aws.git PR that
configures prod to continue using the existing behavior:

* https://github.com/neondatabase/aws/pull/1663

# Interactions With Other Features

This work & rollout should complete before Direct IO is enabled because
Direct IO would double the IOPS & latency for each compaction read
(#8240).

# Future Work

The streaming k-merge's memory usage is proportional to the amount of
memory per participating layer.

But `compact_level0_phase1` still loads all keys into memory for
`all_keys_iter`.
Thus, it continues to have active memory usage proportional to the
number of keys involved in the compaction.

Future work should replace `all_keys_iter` with a streaming keys
iterator.
This PR has a draft in its first commit, which I later reverted because
it's not necessary to achieve the goal of this PR / issue #8184.
2024-08-05 08:55:59 +02:00
Cihan Demirci
fa24d27d38 cicd: change Azure storage details [1/2] (#8553)
Change Azure storage configuration to point to new variables/secrets. They have
the `_NEW` suffix in order not to disrupt any tests while we complete the
switch.
2024-08-05 08:55:59 +02:00
Christian Schwarz
fb6c1e9390 cleanup(compact_level0_phase1): some commentary and wrapping into block expressions (#8544)
Byproduct of scouting done for
https://github.com/neondatabase/neon/issues/8184

refs https://github.com/neondatabase/neon/issues/8184
2024-08-05 08:55:59 +02:00
Yuchen Liang
d1d4631c8f feat(scrubber): post scan_metadata results to storage controller (#8502)
Part of #8128, followup to #8480. closes #8421. 

Enable scrubber to optionally post metadata scan health results to
storage controller.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-08-05 08:55:59 +02:00
Yuchen Liang
b87a1384f0 feat(storcon): store scrubber metadata scan result (#8480)
Part of #8128, followed by #8502.

## Problem

Currently we lack mechanism to alert unhealthy `scan_metadata` status if
we start running this scrubber command as part of a cronjob. With the
storage controller client introduced to storage scrubber in #8196, it is
viable to set up alert by storing health status in the storage
controller database.

We intentionally do not store the full output to the database as the
json blobs potentially makes the table really huge. Instead, only a
health status and a timestamp recording the last time metadata health
status is posted on a tenant shard.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-08-05 08:55:59 +02:00
Anton Chaporgin
5702e1cb46 [neon/acr] impr: push to ACR while building images (#8545)
This tests the ability to push into ACR using OIDC. Proved it worked by running slightly modified YAML.
In `promote-images` we push the following images `neon compute-tools {vm-,}compute-node-{v14,v15,v16}` into `neoneastus2`.

https://github.com/neondatabase/cloud/issues/14640
2024-08-05 08:55:59 +02:00
Alexander Bayandin
5be3e09082 CI(benchmarking): make neonvm default provisioner (#8538)
## Problem

We don't allow regular end-users to use `k8s-pod` provisioner, 
but we still use it in nightly benchmarks

## Summary of changes
- Remove `provisioner` input from `neon-create-project` action, use
`k8s-neonvm` as a default provioner
- Change `neon-` platform prefix to `neonvm-`
- Remove `neon-captest-freetier` and `neon-captest-new` as we already
have their `neonvm` counterparts
2024-08-05 08:55:59 +02:00
Arpad Müller
cd3f4b3a53 scrubber: add remote_storage based listing APIs and use them in find-large-objects (#8541)
Add two new functions `stream_objects_with_retries` and
`stream_tenants_generic` and use them in the `find-large-objects`
subcommand, migrating it to `remote_storage`.

Also adds the `size` field to the `ListingObject` struct.

Part of #7547
2024-08-05 08:55:59 +02:00
Arpad Müller
57f22178d7 Add metrics for input data considered and taken for compression (#8522)
If compression is enabled, we currently try compressing each image
larger than a specific size and if the compressed version is smaller, we
write that one, otherwise we use the uncompressed image. However, this
might sometimes be a wasteful process, if there is a substantial amount
of images that don't compress well.

The compression metrics added in #8420
`pageserver_compression_image_in_bytes_total` and
`pageserver_compression_image_out_bytes_total` are well designed for
answering the question how space efficient the total compression process
is end-to-end, which helps one to decide whether to enable it or not.

To answer the question of how much waste there is in terms of trial
compression, so CPU time, we add two metrics:

* one about the images that have been trial-compressed (considered), and
* one about the images where the compressed image has actually been
written (chosen).

There is different ways of weighting them, like for example one could
look at the count, or the compressed data. But the main contributor to
compression CPU usage is amount of data processed, so we weight the
images by their *uncompressed* size. In other words, the two metrics
are:

* `pageserver_compression_image_in_bytes_considered`
* `pageserver_compression_image_in_bytes_chosen`

Part of #5431
2024-08-05 08:55:59 +02:00
John Spray
3f05758d09 scrubber: enable cleaning up garbage tenants from known deletion bugs, add object age safety check (#8461)
## Problem

Old storage buckets can contain a lot of tenants that aren't known to
the control plane at all, because they belonged to test jobs that get
their control plane state cleaned up shortly after running.

In general, it's somewhat unsafe to purge these, as it's hard to
distinguish "control plane doesn't know about this, so it's garbage"
from "control plane said it didn't know about this, which is a bug in
the scrubber, control plane, or API URL configured".

However, the most common case is that we see only a small husk of a
tenant in S3 from a specific old behavior of the software, for example:
- We had a bug where heatmaps weren't deleted on tenant delete
- When WAL DR was first deployed, we didn't delete initdb.tar.zst on
tenant deletion

## Summary of changes

- Add a KnownBug variant for the garbage reason
- Include such cases in the "safe" deletion mode (`--mode=deleted`)
- Add code that inspects tenants missing in control plane to identify
cases of known bugs (this is kind of slow, but should go away once we've
cleaned all these up)
- Add an additional `-min-age` safety check similar to physical GC,
where even if everything indicates objects aren't needed, we won't
delete something that has been modified too recently.

---------

Co-authored-by: Yuchen Liang <70461588+yliang412@users.noreply.github.com>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-08-05 08:55:59 +02:00
Christian Schwarz
010203a49e l0_flush: use mode=direct by default => coverage in automated tests (#8534)
Testing in staging and pre-prod has been [going

well](https://github.com/neondatabase/neon/issues/7418#issuecomment-2255474917).

This PR enables mode=direct by default, thereby providing additional
coverage in the automated tests:
- Rust tests
- Integration tests
- Nightly pagebench (likely irrelevant because it's read-only)

Production deployments continue to use `mode=page-cache` for the time
being: https://github.com/neondatabase/aws/pull/1655

refs https://github.com/neondatabase/neon/issues/7418
2024-08-05 08:55:59 +02:00
John Spray
7c40266c82 pageserver: fix return code from secondary_download_handler (#8508)
## Problem

The secondary download HTTP API is meant to return 200 if the download
is complete, and 202 if it is still in progress. In #8198 the download
implementation was changed to drop out with success early if it
over-runs a time budget, which resulted in 200 responses for incomplete
downloads.

This breaks storcon_cli's "tenant-warmup" command, which uses the OK
status to indicate download complete.

## Summary of changes

- Only return 200 if we get an Ok() _and_ the progress stats indicate
the download is complete.
2024-08-05 08:55:59 +02:00
Joonas Koivunen
7b3f94c1f0 test: deflake test_duplicate_creation (#8536)
By including comparison of `remote_consistent_lsn_visible` we risk
flakyness coming from outside of timeline creation. Mask out the
`remote_consistent_lsn_visible` for the comparison.

Evidence:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8489/10142336315/index.html#suites/ffbb7f9930a77115316b58ff32b7c719/89ff0270bf58577a
2024-08-05 08:55:59 +02:00
a-masterov
d8205248e2 Add a test for clickhouse as a logical replication consumer (#8408)
## Problem

We need to test logical replication with 3rd-party tools regularly. 

## Summary of changes

Added a test using ClickHouse as a client

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-08-05 08:55:59 +02:00
Arpad Müller
a4d3e0c747 Adopt list_streaming in tenant deletion (#8504)
Uses the Stream based `list_streaming` function added by #8457 in tenant
deletion, as suggested in https://github.com/neondatabase/neon/pull/7932#issuecomment-2150480180 .

We don't have to worry about retries, as the function is wrapped inside
an outer retry block. If there is a retryable error either during the
listing or during deletion, we just do a fresh start.

Also adds `+ Send` bounds as they are required by the
`delete_tenant_remote` function.
2024-08-05 08:55:59 +02:00
Joonas Koivunen
df0748289b Merge pull request #8533 from neondatabase/rc/2024-07-29
Storage & Compute release 2024-07-29
2024-07-29 19:14:29 +03:00
Joonas Koivunen
407bf968c1 Merge remote-tracking branch 'origin/release' into rc/2024-07-29 2024-07-29 15:15:04 +00:00
Christian Schwarz
e0a5bb17ed pageserver: fail if id is present in pageserver.toml (#8489)
Overall plan:
https://www.notion.so/neondatabase/Rollout-Plan-simplified-pageserver-initialization-f935ae02b225444e8a41130b7d34e4ea?pvs=4

---

`identity.toml` is the authoritative place for `id` as of
https://github.com/neondatabase/neon/pull/7766

refs https://github.com/neondatabase/neon/issues/7736
2024-07-29 15:08:15 +00:00
Stas Kelvich
6026cbfb63 Merge pull request #8530 from neondatabase/releases/2024-07-26-compute-only-sk
Compute release 2024-06-26
2024-07-26 17:32:22 +01:00
Em Sharnoff
3a0ee16ed5 Fix sql-exporter-autoscaling for pg < 16 (#8523)
The lfc_approximate_working_set_size_windows query was failing on pg14
and pg15 with

  pq: subquery in FROM must have an alias

Because aliases in that position became optional only in pg16.

Some context here: https://neondb.slack.com/archives/C04DGM6SMTM/p1721970322601679?thread_ts=1721921122.528849
2024-07-26 16:35:16 +01:00
Stas Kelvich
dbcfc01471 Merge pull request #8514 from neondatabase/releases/2024-07-25-compute-only
Compute release 2024-07-25
2024-07-25 22:42:17 +01:00
Anastasia Lubennikova
8bf597c4d7 Update pgrx to v 0.11.3 (#8515)
update pg_jsonschema extension to v 0.3.1
update pg_graphql extension to v1.5.7
update pgx_ulid extension to v0.1.5
update pg_tiktoken extension, patch Cargo.toml to use new pgrx
2024-07-25 13:22:53 -07:00
Em Sharnoff
138ae15a91 vm-image: Expose new LFC working set size metrics (#8298)
In general, replace:

* 'lfc_approximate_working_set_size' with
* 'lfc_approximate_working_set_size_windows'

For the "main" metrics that are actually scraped and used internally,
the old one is just marked as deprecated.
For the "autoscaling" metrics, we're not currently using the old one, so
we can get away with just replacing it.

Also, for the user-visible metrics we'll only store & expose a few
different time windows, to avoid making the UI overly busy or bloating
our internal metrics storage.

But for the autoscaling-related scraper, we aren't storing the metrics,
and it's useful to be able to programmatically operate on the trendline
of how WSS increases (or doesn't!) with window size. So there, we can
just output datapoints for each minute.

Part of neondatabase/autoscaling#872
See also https://www.notion.so/neondatabase/cca38138fadd45eaa753d81b859490c6
2024-07-25 16:34:29 +01:00
Konstantin Knizhnik
59eeadabe9 Change default version of Neon extensio to 1.4 2024-07-25 16:33:49 +01:00
Christian Schwarz
daf8edd986 Merge pull request #8468 from neondatabase/rc/2024-07-23-manual
Storage release 2024-07-23

We did not deploy yesterday's
* https://github.com/neondatabase/neon/pull/8451
because of CICD troubles with pre-prod.

Also, it was missing

* https://github.com/neondatabase/neon/pull/7766

which is low-risk and unblocks more cleanup work that would otherwise have to wait until after next week's release.

So, this PR cherry-picks #7766 and creates a new storage release.

Compute will release separately later this week.

Back pointer to Slack thread: https://neondb.slack.com/archives/C03H1K0PGKH/p1721650191019099
2024-07-24 12:02:14 +02:00
Vlad Lazar
a1272b6ed8 pageserver: use identity file as node id authority and remove init command and config-override flags (#7766)
Ansible will soon write the node id to `identity.toml` in the work dir
for new pageservers. On the pageserver side, we read the node id from
the identity file if it is present and use that as the source of truth.
If the identity file is missing, cannot be read, or does not
deserialise, start-up is aborted.
 
This PR also removes the `--init` mode and the `--config-override` flag
from the `pageserver` binary.
The neon_local is already not using these flags anymore.

Ansible still uses them until the linked change is merged & deployed,
so, this PR has to land simultaneously or after the Ansible change due
to that.

Related Ansible change: https://github.com/neondatabase/aws/pull/1322
Cplane change to remove config-override usages:
https://github.com/neondatabase/cloud/pull/13417
Closes: https://github.com/neondatabase/neon/issues/7736
Overall plan:
https://www.notion.so/neondatabase/Rollout-Plan-simplified-pageserver-initialization-f935ae02b225444e8a41130b7d34e4ea?pvs=4

Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-07-23 12:55:46 +02:00
Christian Schwarz
28ee7cdede Merge pull request #8451 from neondatabase/rc/2024-07-22
## Storage & Compute release 2024-07-22

This PR has so many commits because the release branch diverged from `main`.

Details https://neondb.slack.com/archives/C033A2WE6BZ/p1721650938949059?thread_ts=1721308848.034069&cid=C033A2WE6BZ

The commit range that is truly new since the last storage release are the the `main` commit which I cherry-picked using this command

```
git cherry-pick 8a8b83df27383a07bb7dbba519325c15d2f46357..4e547e6
```
2024-07-22 19:17:01 +02:00
Christian Schwarz
7b63092958 Merge commit '4e547e6' into rc/2024-07-22
See https://neondb.slack.com/archives/C033A2WE6BZ/p1721650938949059?thread_ts=1721308848.034069&cid=C033A2WE6BZ
2024-07-22 14:40:55 +02:00
Arpad Müller
31bfeaf934 Use DefaultCredentialsChain AWS authentication in remote_storage (#8440)
PR #8299 has switched the storage scrubber to use
`DefaultCredentialsChain`. Now we do this for `remote_storage`, as it
allows us to use `remote_storage` from inside kubernetes. Most of the
diff is due to `GenericRemoteStorage::from_config` becoming `async fn`.
2024-07-22 14:36:56 +02:00
Arpad Müller
21b3a191bf Add archival_config endpoint to pageserver (#8414)
This adds an archival_config endpoint to the pageserver. Currently it
has no effect, and always "works", but later the intent is that it will
make a timeline archived/unarchived.

- [x] add yml spec
- [x] add endpoint handler

Part of https://github.com/neondatabase/neon/issues/8088
2024-07-22 14:36:56 +02:00
Shinya Kato
f7f9b4aaec Fix openapi specification (#8273)
## Problem

There are some swagger errors in `pageserver/src/http/openapi_spec.yml`
```
Error	431	15000	Object includes not allowed fields
Error	569	3100401	should always have a 'required'
Error	569	15000	Object includes not allowed fields
Error	1111	10037	properties members must be schemas
```

## Summary of changes

Fixed the above errors.
2024-07-22 14:36:56 +02:00
John Spray
bba062e262 tests: longer timeouts in test_timeline_deletion_with_files_stuck_in_upload_queue (#8438)
## Problem

This test had two locations with 2 second timeouts, which is rather low
when we run on a highly contended test machine running lots of tests in
parallel. It usually passes, but today I've seen both of these locations
time out on separate PRs.

Example failure:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8432/10007868041/index.html#suites/837740b64a53e769572c4ed7b7a7eeeb/6c6a092be083d27c

## Summary of changes

- Change 2 second timeouts to 20 second timeouts
2024-07-22 14:36:56 +02:00
Shinya Kato
067363fe95 safekeeper: remove unused safekeeper runtimes (#8433)
There are unused safekeeper runtimes `WAL_REMOVER_RUNTIME` and
`METRICS_SHIFTER_RUNTIME`.

`WAL_REMOVER_RUNTIME` was implemented in
[#4119](https://github.com/neondatabase/neon/pull/4119) and removed in
[#7887](https://github.com/neondatabase/neon/pull/7887).
`METRICS_SHIFTER_RUNTIME` was also implemented in
[#4119](https://github.com/neondatabase/neon/pull/4119) but has never
been used.

I removed unused safekeeper runtimes `WAL_REMOVER_RUNTIME` and
`METRICS_SHIFTER_RUNTIME`.
2024-07-22 14:36:56 +02:00
John Spray
affe408433 storage scrubber: GC ancestor shard layers (#8196)
## Problem

After a shard split, the pageserver leaves the ancestor shard's content
in place. It may be referenced by child shards, but eventually child
shards will de-reference most ancestor layers as they write their own
data and do GC. We would like to eventually clean up those ancestor
layers to reclaim space.

## Summary of changes

- Extend the physical GC command with `--mode=full`, which includes
cleaning up unreferenced ancestor shard layers
- Add test `test_scrubber_physical_gc_ancestors`
- Remove colored log output: in testing this is irritating ANSI code
spam in logs, and in interactive use doesn't add much.
- Refactor storage controller API client code out of storcon_client into
a `storage_controller/client` crate
- During physical GC of ancestors, call into the storage controller to
check that the latest shards seen in S3 reflect the latest state of the
tenant, and there is no shard split in progress.
2024-07-22 14:36:56 +02:00
Christian Schwarz
9b883e4651 pageserver: remove obsolete cached_metric_collection_interval (#8370)
We're removing the usage of this long-meaningless config field in
https://github.com/neondatabase/aws/pull/1599

Once that PR has been deployed to staging and prod, we can merge this
PR.
2024-07-22 14:36:56 +02:00
Peter Bendel
b98b301d56 Bodobolero/fix root permissions (#8429)
## Problem

My prior PR https://github.com/neondatabase/neon/pull/8422
caused leftovers in the GitHub action runner work directory with root
permission.
As an example see here
https://github.com/neondatabase/neon/actions/runs/10001857641/job/27646237324#step:3:37
To work-around we install vanilla postgres as non-root using deb
packages in /home/nonroot user directory

## Summary of changes

- since we cannot use root we install the deb pkgs directly and create
symbolic links for psql, pgbench and libs in expected places
- continue jobs an aws even if azure jobs fail (because this region is
currently unreliable)
2024-07-22 14:36:56 +02:00
Arpad Müller
ed7ee73cba Enable zstd in tests (#8368)
Successor of #8288 , just enable zstd in tests. Also adds a test that
creates easily compressable data.

Part of #5431

---------

Co-authored-by: John Spray <john@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2024-07-22 14:36:56 +02:00
Arthur Petukhovsky
fceace835b Change log level for GuardDrop error (#8305)
The error means that manager exited earlier than `ResidenceGuard` and
it's not unexpected with current deletion implementation. This commit
changes log level to reduse noise.
2024-07-22 14:36:56 +02:00
Peter Bendel
1b508a6082 Temporarily use vanilla pgbench and psql (client) for running pgvector benchmark (#8422)
## Problem

https://github.com/neondatabase/neon/issues/8275 is not yet fixed

Periodic benchmarking fails with SIGABRT in pgvector step, see
https://github.com/neondatabase/neon/actions/runs/9967453263/job/27541159738#step:7:393

## Summary of changes

Instead of using pgbench and psql from Neon artifacts, download vanilla
postgres binaries into the container and use those to run the client
side of the test.
2024-07-22 14:36:56 +02:00
Alex Chi Z.
f87b031876 pageserver: integrate k-merge with bottom-most compaction (#8415)
Use the k-merge iterator in the compaction process to reduce memory
footprint.

part of https://github.com/neondatabase/neon/issues/8002

## Summary of changes

* refactor the bottom-most compaction code to use k-merge iterator
* add Send bound on some structs as it is used across the await points

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-07-22 14:36:56 +02:00
Arthur Petukhovsky
9f1ba2c4bf Fix partial upload bug with invalid remote state (#8383)
We have an issue that some partial uploaded segments can be actually
missing in remote storage. I found this issue when was looking at the
logs in staging, and it can be triggered by failed uploads:
1. Code tries to upload `SEG_TERM_LSN_LSN_sk5.partial`, but receives
error from S3
2. The failed attempt is saved to `segments` vec
3. After some time, the code tries to upload
`SEG_TERM_LSN_LSN_sk5.partial` again
4. This time the upload is successful and code calls `gc()` to delete
previous uploads
5. Since new object and old object share the same name, uploaded data
gets deleted from remote storage

This commit fixes the issue by patching `gc()` not to delete objects
with the same name as currently uploaded.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-07-22 14:36:56 +02:00
John Spray
9868bb3346 tests: turn on safekeeper eviction by default (#8352)
## Problem

Ahead of enabling eviction in the field, where it will become the
normal/default mode, let's enable it by default throughout our tests in
case any issues become visible there.

## Summary of changes

- Make default `extra_opts` for safekeepers enable offload & deletion
- Set low timeouts in `extra_opts` so that tests running for tens of
seconds have a chance to hit some of these background operations.
2024-07-22 14:36:56 +02:00
John Spray
27da0e9cf5 tests: increase test_pg_regress and test_isolation timeouts (#8418)
## Problem

These tests time out ~1 in 50 runs when in debug mode.

There is no indication of a real issue: they're just wrappers that have
large numbers of individual tests contained within on pytest case.

## Summary of changes

- Bump pg_regress timeout from 600 to 900s
- Bump test_isolation timeout from 300s (default) to 600s

In future it would be nice to break out these tests to run individual
cases (or batches thereof) as separate tests, rather than this monolith.
2024-07-22 14:36:56 +02:00
John Spray
de9bf2af6c tests: fix metrics check in test_s3_eviction (#8419)
## Problem

This test would occasionally fail its metric check. This could happen in
the rare case that the nodes had all been restarted before their most
recent eviction.

The metric check was added in
https://github.com/neondatabase/neon/pull/8348

## Summary of changes

- Check metrics before each restart, accumulate into a bool that we
assert on at the end of the test
2024-07-22 14:36:56 +02:00
Christian Schwarz
3d2c2ce139 NeonEnv.from_repo_dir: use storage_controller_db instead of attachments.json (#8382)
When `NeonEnv.from_repo_dir` was introduced, storage controller stored
its
state exclusively `attachments.json`.
Since then, it has moved to using Postgres, which stores its state in
`storage_controller_db`.

But `NeonEnv.from_repo_dir` wasn't adjusted to do this.
This PR rectifies the situation.

Context for this is failures in
`test_pageserver_characterize_throughput_with_n_tenants`
CF:
https://neondb.slack.com/archives/C033RQ5SPDH/p1721035799502239?thread_ts=1720901332.293769&cid=C033RQ5SPDH

Notably, `from_repo_dir` is also used by the backwards- and
forwards-compatibility.
Thus, the changes in this PR affect those tests as well.
However, it turns out that the compatibility snapshot already contains
the `storage_controller_db`.
Thus, it should just work and in fact we can remove hacks like
`fixup_storage_controller`.

Follow-ups created as part of this work:
* https://github.com/neondatabase/neon/issues/8399
* https://github.com/neondatabase/neon/issues/8400
2024-07-22 14:36:56 +02:00
dotdister
82a2081d61 Fix comment in Control Plane (#8406)
## Problem
There are something wrong in the comment of
`control_plane/src/broker.rs` and `control_plane/src/pageserver.rs`

## Summary of changes
Fixed the comment about component name and their data path in
`control_plane/src/broker.rs` and `control_plane/src/pageserver.rs`.
2024-07-22 14:36:56 +02:00
Joonas Koivunen
ff174a88c0 test: allow requests to any pageserver get cancelled (#8413)
Fix flakyness on `test_sharded_timeline_detach_ancestor` which does not
reproduce on a fast enough runner by allowing cancelled request before
completing on all pageservers. It was only allowed on half of the
pageservers.

Failure evidence:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8352/9972357040/index.html#suites/a1c2be32556270764423c495fad75d47/7cca3e3d94fe12f2
2024-07-22 14:36:56 +02:00
John Spray
ef3ebfaf67 pageserver: layer count & size metrics (#8410)
## Problem

We lack insight into:
- How much of a tenant's physical size is image vs. delta layers
- Average sizes of image vs. delta layers
- Total layer counts per timeline, indicating size of index_part object

As well as general observability love, this is motivated by
https://github.com/neondatabase/neon/issues/6738, where we need to
define some sensible thresholds for storage amplification, and using
total physical size may not work well (if someone does a lot of DROPs
then it's legitimate for the physical-synthetic ratio to be huge), but
the ratio between image layer size and delta layer size may be a better
indicator of whether we're generating unreasonable quantities of image
layers.

## Summary of changes

- Add pageserver_layer_bytes and pageserver_layer_count metrics,
labelled by timeline and `kind` (delta or image)
- Add & subtract these with LayerInner's lifetime.

I'm intentionally avoiding using a generic metric RAII guard object, to
avoid bloating LayerInner: it already has all the information it needs
to update metric on new+drop.
2024-07-22 14:36:56 +02:00
Yuchen Liang
ae1af558b4 docs: update storage controller db name in doc (#8411)
The db name was renamed to storage_controller from attachment_service.
Doc was stale.
2024-07-22 14:36:56 +02:00
John Spray
c150ad4ee2 tests: add test_compaction_l0_memory (#8403)
This test reproduces the case of a writer creating a deep stack of L0
layers. It uses realistic layer sizes and writes several gigabytes of
data, therefore runs as a performance test although it is validating
memory footprint rather than performance per se.

It acts a regression test for two recent fixes:
- https://github.com/neondatabase/neon/pull/8401
- https://github.com/neondatabase/neon/pull/8391

In future it will demonstrate the larger improvement of using a k-merge
iterator for L0 compaction (#8184)

This test can be extended to enforce limits on the memory consumption of
other housekeeping steps, by restarting the pageserver and then running
other things to do the same "how much did RSS increase" measurement.
2024-07-22 14:36:56 +02:00
Alex Chi Z.
a98ccd185b test(pageserver): more k-merge tests on duplicated keys (#8404)
Existing tenants and some selection of layers might produce duplicated
keys. Add tests to ensure the k-merge iterator handles it correctly. We
also enforced ordering of the k-merge iterator to put images before
deltas.

part of https://github.com/neondatabase/neon/issues/8002

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-07-22 14:36:56 +02:00
Peter Bendel
9f796ebba9 Bodobolero/pgbench compare azure (#8409)
## Problem

We want to run performance tests on all supported cloud providers.
We want to run most tests on the postgres version which is default for
new projects in production, currently (July 24) this is postgres version
16

## Summary of changes

- change default postgres version for some (performance) tests to 16
(which is our default for new projects in prod anyhow)
- add azure region to pgbench_compare jobs

- add azure region to pgvector benchmarking jobs
- re-used project `weathered-snowflake-88107345` was prepared with 1
million embeddings running on 7 minCU 7 maxCU in azure region to compare
with AWS region (pgvector indexing and hnsw queries)
  - see job pgbench-pgvector 

- Note we now have a 11 environments combinations where we run
pgbench-compare and 5 are for k8s-pod (deprecated) which we can remove
in the future once auto-scaling team approves.

## Logs

A current run with the changes from this pull request is running here
https://github.com/neondatabase/neon/actions/runs/9972096222

Note that we currently expect some failures due to
- https://github.com/neondatabase/neon/issues/8275
- instability of projects on azure region
2024-07-22 14:36:56 +02:00
John Spray
d51ca338c4 docs/rfcs: timeline ancestor detach API (#6888)
## Problem

When a tenant creates a new timeline that they will treat as their
'main' history,
it is awkward to permanently retain an 'old main' timeline as its
ancestor. Currently
this is necessary because it is forbidden to delete a timeline which has
descendents.

## Summary of changes

A new pageserver API is proposed to 'adopt' data from a parent timeline
into
one of its children, such that the link between ancestor and child can
be severed,
leaving the parent in a state where it may then be deleted.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2024-07-22 14:36:56 +02:00
John Spray
07e78102bf pageserver: reduce size of delta layer ValueRef (#8401)
## Problem

ValueRef is an unnecessarily large structure, because it carries a
cursor. L0 compaction currently instantiates gigabytes of these under
some circumstances.

## Summary of changes

- Carry a ref to the parent layer instead of a cursor, and construct a
cursor on demand.

This reduces RSS high watermark during L0 compaction by about 20%.
2024-07-22 14:36:56 +02:00
John Spray
b21e131d11 pageserver: exclude un-read layers from short residence statistic (#8396)
## Problem

The `evictions_with_low_residence_duration` is used as an indicator of
cache thrashing. However, there are situations where it is quite
legitimate to only have a short residence during compaction, where a
delta is downloaded, used to generate an image layer, and then
discarded. This can lead to false positive alerts.

## Summary of changes

- Only track low residence duration for layers that have been accessed
at least once (compaction doesn't count as an access). This will give us
a metric that indicates thrashing on layers that the _user_ is using,
rather than those we're downloading for housekeeping purposes.

Once we add "layer visibility" as an explicit property of layers, this
can also be used as a cleaner condition (residence of non-visible layers
should never be alertable)
2024-07-22 14:36:56 +02:00
Alex Chi Z.
abe3b4e005 fix(pageserver): limit num of delta layers for l0 compaction (#8391)
## Problem

close https://github.com/neondatabase/neon/issues/8389

## Summary of changes

A quick mitigation for tenants with fast writes. We compact at most 60
delta layers at a time, expecting a memory footprint of 15GB. We will
pick the oldest 60 L0 layers.

This should be a relatively safe change so no test is added. Question is
whether to make this parameter configurable via tenant config.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: John Spray <john@neon.tech>
2024-07-22 14:36:56 +02:00
Tristan Partin
18e7c2b7a1 Add some typing to Endpoint.respec() 2024-07-22 14:36:56 +02:00
Tristan Partin
ad5d784fb7 Hide import behind TYPE_CHECKING 2024-07-22 14:36:56 +02:00
Tristan Partin
85d47637ee Run each migration in its own transaction
Previously, every migration was run in the same transaction. This
is preparatory work for fixing CVE-2024-4317.
2024-07-22 14:36:56 +02:00
Tristan Partin
7e818ee390 Rename compute migrations to start at 1
This matches what we put into the neon_migration.migration_id table.
2024-07-22 14:36:56 +02:00
John Spray
bff505426e pageserver: clean up GcCutoffs names (#8379)
- `horizon` is a confusing term, it's not at all obvious that this means
space-based retention limit, rather than the total GC history limit.
Rename to `GcCutoffs::space`.
- `pitr` is less confusing, but still an unecessary level of indirection
from what we really mean: a time-based condition. The fact that we use
that that time-history for Point In Time Recovery doesn't mean we have
to refer to time as "pitr" everywhere. Rename to `GcCutoffs::time`.
2024-07-22 14:36:56 +02:00
dependabot[bot]
bf7de92dc2 build(deps): bump setuptools from 65.5.1 to 70.0.0 (#8387)
Bumps [setuptools](https://github.com/pypa/setuptools) from 65.5.1 to
70.0.0.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: a-masterov <72613290+a-masterov@users.noreply.github.com>
2024-07-22 14:36:56 +02:00
Arpad Müller
9dc71f5a88 Avoid the storage controller in test_tenant_creation_fails (#8392)
As described in #8385, the likely source for flakiness in
test_tenant_creation_fails is the following sequence of events:

1. test instructs the storage controller to create the tenant
2. storage controller adds the tenant and persists it to the database.
issues a creation request
3. the pageserver restarts with the failpoint disabled
4. storage controller's background reconciliation still wants to create
the tenant
5. pageserver gets new request to create the tenant from background
reconciliation

This commit just avoids the storage controller entirely. It has its own
set of issues, as the re-attach request will obviously not include the
tenant, but it's still useful to test for non-existence of the tenant.

The generation is also not optional any more during tenant attachment.
If you omit it, the pageserver yields an error. We change the signature
of `tenant_attach` to reflect that.

Alternative to #8385
Fixes #8266
2024-07-22 14:36:56 +02:00
Anastasia Lubennikova
2ede9d7a25 Compute: add compatibility patch for rum
Fixes #8251
2024-07-22 14:36:56 +02:00
John Spray
ea5460843c pageserver: un-Arc Timeline::layers (#8386)
## Problem

This structure was in an Arc<> unnecessarily, making it harder to reason
about its lifetime (i.e. it was superficially possible for LayerManager
to outlive timeline, even though no code used it that way)

## Summary of changes

- Remove the Arc<>
2024-07-22 14:36:56 +02:00
Arpad Müller
5b16624bcc Allow the new clippy::doc_lazy_continuation lint (#8388)
The `doc_lazy_continuation` lint of clippy is still unknown on latest
rust stable.

Fixes fall-out from #8151.
2024-07-22 14:36:56 +02:00
Sasha Krassovsky
349373cb11 Allow reusing projects between runs of logical replication benchmarks (#8393) 2024-07-22 14:36:56 +02:00
Joonas Koivunen
957f99cad5 feat(timeline_detach_ancestor): success idempotency (#8354)
Right now timeline detach ancestor reports an error (409, "no ancestor")
on a new attempt after successful completion. This makes it troublesome
for storage controller retries. Fix it to respond with `200 OK` as if
the operation had just completed quickly.

Additionally, the returned timeline identifiers in the 200 OK response
are now ordered so that responses between different nodes for error
comparison are done by the storage controller added in #8353.

Design-wise, this PR introduces a new strategy for accessing the latest
uploaded IndexPart:
`RemoteTimelineClient::initialized_upload_queue(&self) ->
Result<UploadQueueAccessor<'_>, NotInitialized>`. It should be a more
scalable way to query the latest uploaded `IndexPart` than to add a
query method for each question directly on `RemoteTimelineClient`.

GC blocking will need to be introduced to make the operation fully
idempotent. However, it is idempotent for the cases demonstrated by
tests.

Cc: #6994
2024-07-22 14:36:56 +02:00
John Spray
2a3a136474 pageserver: use PITR GC cutoffs as authoritative (#8365)
## Problem

Pageserver GC uses a size-based condition (GC "horizon" in addition to
time-based "PITR").

Eventually we plan to retire the size-based condition:
https://github.com/neondatabase/neon/issues/6374

Currently, we always apply the more conservative of the two, meaning
that tenants always retain at least 64MB of history (default horizon),
even after a very long time has passed. This is particularly acute in
cases where someone has dropped tables/databases, and then leaves a
database idle: the horizon can prevent GCing very large quantities of
historical data (we already account for this in synthetic size by
ignoring gc horizon).

We're not entirely removing GC horizon right now because we don't want
to 100% rely on standby_horizon for robustness of physical replication,
but we can tweak our logic to avoid retaining that 64MB LSN length
indefinitely.

## Summary of changes

- Rework `Timeline::find_gc_cutoffs`, with new logic:
- If there is no PITR set, then use `DEFAULT_PITR_INTERVAL` (1 week) to
calculate a time threshold. Retain either the horizon or up to that
thresholds, whichever requires less data.
- When there is a PITR set, and we have unambiguously resolved the
timestamp to an LSN, then ignore the GC horizon entirely. For typical
PITRs (1 day, 1 week), this will still easily retain enough data to
avoid stressing read only replicas.

The key property we end up with, whether a PITR is set or not, is that
after enough time has passed, our GC cutoff on an idle timeline will
catch up with the last_record_lsn.

Using `DEFAULT_PITR_INTERVAL` is a bit of an arbitrary hack, but this
feels like it isn't really worth the noise of exposing in TenantConfig.
We could just make it a different named constant though. The end-end
state will be that there is no gc_horizon at all, and that tenants with
pitr_interval=0 would truly retain no history, so this constant would go
away.
2024-07-22 14:36:56 +02:00
Joonas Koivunen
cfaf30f5e8 feat(storcon): timeline detach ancestor passthrough (#8353)
Currently storage controller does not support forwarding timeline detach
ancestor requests to pageservers. Add support for forwarding `PUT
.../:tenant_id/timelines/:timeline_id/detach_ancestor`. Implement the
support mostly as is, because the timeline detach ancestor will be made
(mostly) idempotent in future PR.

Cc: #6994
2024-07-22 14:36:56 +02:00
Christian Schwarz
72c2d0812e remove page_service show <tenant_id> (#8372)
This operation isn't used in practice, so let's remove it.

Context: in https://github.com/neondatabase/neon/pull/8339
2024-07-22 14:36:56 +02:00
Arseny Sher
537ecf45f8 Fix test_timeline_copy flakiness.
fixes https://github.com/neondatabase/neon/issues/8355
2024-07-22 14:31:12 +02:00
Luca Bruno
1637a6ee05 proxy/http: switch to typed_json (#8377)
## Summary of changes

This switches JSON rendering logic to `typed_json` in order to
reduce the number of allocations in the HTTP responder path.

Followup from
https://github.com/neondatabase/neon/pull/8319#issuecomment-2216991760.

---------

Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>
2024-07-22 14:30:53 +02:00
Alex Chi Z
d74fb7b879 Merge pull request #8374 from neondatabase/rc/2024-07-15
Storage & Compute release 2024-07-15
2024-07-15 11:02:18 -04:00
Konstantin Knizhnik
7973c3e941 Add neon.running_xacts_overflow_policy to make it possible for RO replica to startup without primary even in case running xacts overflow (#8323)
## Problem

Right now if there are too many running xacts to be restored from CLOG
at replica startup,
then replica is not trying to restore them and wait for non-overflown
running-xacs WAL record from primary.
But if primary is not active, then replica will not start at all.

Too many running xacts can be caused by transactions with large number
of subtractions.
But right now it can be also cause by two reasons:
- Lack of shutdown checkpoint which updates `oldestRunningXid` (because
of immediate shutdown)
- nextXid alignment on 1024 boundary (which cause loosing ~1k XIDs on
each restart)

Both problems are somehow addressed now.
But we have existed customers with "sparse" CLOG and lack of
checkpoints.
To be able to start RO replicas for such customers I suggest to add GUC
which allows replica to start even in case of subxacts overflow.

## Summary of changes

Add `neon.running_xacts_overflow_policy` with the following values:
- ignore: restore from CLOG last N XIDs and accept connections
- skip: do not restore any XIDs from CXLOGbut still accept connections
- wait: wait non-overflown running xacts record from primary node

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-07-15 09:34:35 -04:00
Vlad Lazar
085bbaf5f8 tests: allow list breaching min resident size in statvfs test (#8358)
## Problem
This test would sometimes violate the min resident size during disk
eviction and fail due to the generate warning log.

Disk usage candidate collection only takes into account active tenants.
However, the statvfs call takes into account the entire tenants
directory, which includes tenants which haven't become active yet.

After re-starting the pageserver, disk usage eviction may kick in
*before* both tenants have become active. Hence, the logic will try to satisfy
thedisk usage requirements by evicting everything belonging to the active
tenant, and hence violating the tenant minimum resident size.

## Summary of changes

Allow the warning
2024-07-15 09:28:35 -04:00
Alex Chi Z
85b5219861 fix(pageserver): unique test harness name for merge_in_between (#8366)
As title, there should be a way to detect duplicated harness names in
the future :(

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-07-15 09:28:35 -04:00
Conrad Ludgate
7472c69954 Fix nightly warnings 2024 june (#8151)
## Problem

new clippy warnings on nightly.

## Summary of changes

broken up each commit by warning type.
1. Remove some unnecessary refs.
2. In edition 2024, inference will default to `!` and not `()`.
3. Clippy complains about doc comment indentation
4. Fix `Trait + ?Sized` where `Trait: Sized`.
5. diesel_derives triggering `non_local_defintions`
2024-07-15 09:28:35 -04:00
John Spray
3f8819827c pageserver: circuit breaker on compaction (#8359)
## Problem

We already back off on compaction retries, but the impact of a failing
compaction can be so great that backing off up to 300s isn't enough. The
impact is consuming a lot of I/O+CPU in the case of image layer
generation for large tenants, and potentially also leaking disk space.

Compaction failures are extremely rare and almost always indicate a bug,
frequently a bug that will not let compaction to proceed until it is
fixed.

Related: https://github.com/neondatabase/neon/issues/6738

## Summary of changes

- Introduce a CircuitBreaker type
- Add a circuit breaker for compaction, with a policy that after 5
failures, compaction will not be attempted again for 24 hours.
- Add metrics that we can alert on: any >0 value for
`pageserver_circuit_breaker_broken_total` should generate an alert.
- Add a test that checks this works as intended.

Couple notes to reviewers:
- Circuit breakers are intrinsically a defense-in-depth measure: this is
not the solution to any underlying issues, it is just a general
mitigation for "unknown unknowns" that might be encountered in future.
- This PR isn't primarily about writing a perfect CircuitBreaker type:
the one in this PR is meant to be just enough to mitigate issues in
compaction, and make it easy to monitor/alert on these failures. We can
refine this type in future as/when we want to use it elsewhere.
2024-07-15 09:28:35 -04:00
Japin Li
c440756410 Remove fs2 dependency (#8350)
The fs2 dependency is not needed anymore after commit d42700280.
2024-07-15 09:28:35 -04:00
Arpad Müller
0e600eb921 Implement decompression for vectored reads (#8302)
Implement decompression of images for vectored reads.

This doesn't implement support for still treating blobs as uncompressed
with the bits we reserved for compression, as we have removed that
functionality in #8300 anyways.

Part of #5431
2024-07-15 09:28:35 -04:00
Arpad Müller
a1df835e28 Pass configured compression param to image generation (#8363)
We need to pass on the configured compression param during image layer
generation.

This was an oversight of #8106, and the likely cause why #8288 didn't
bring any interesting regressions.

Part of https://github.com/neondatabase/neon/issues/5431
2024-07-15 09:28:35 -04:00
Sasha Krassovsky
119ddf6ccf Grant execute on snapshot functions to neon_superuser (#8346)
## Problem
I need `neon_superuser` to be allowed to create snapshots for
replication tests

## Summary of changes
Adds a migration that grants these functions to neon_superuser
2024-07-15 09:28:35 -04:00
Joonas Koivunen
90f447b79d test: limit test_layer_download_timeouted to MOCK_S3 (#8331)
Requests against REAL_S3 on CI can consistently take longer than 1s;
testing the short timeouts against it made no sense in hindsight, as
MOCK_S3 works just as well.

evidence:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8229/9857994025/index.html#suites/b97efae3a617afb71cb8142f5afa5224/6828a50921660a32
2024-07-15 09:28:35 -04:00
Alex Chi Z
7dd71f4126 feat(pageserver): rewrite streaming vectored read planner (#8242)
Rewrite streaming vectored read planner to be a separate struct. The API
is designed to produce batches around `max_read_size` instead of exactly
less than that so that `handle_XX` returns one batch a time.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-07-15 09:28:35 -04:00
Arseny Sher
8532d72276 Fix memory context of NeonWALReader allocation.
Allocating it in short living context is wrong because it is reused during
backend lifetime.
2024-07-15 09:28:35 -04:00
John Spray
d3ff47f572 storage controller: add node deletion API (#8226)
## Problem

In anticipation of later adding a really nice drain+delete API, I
initially only added an intentionally basic `/drop` API that is just
about usable for deleting nodes in a pinch, but requires some ugly
storage controller restarts to persuade it to restart secondaries.

## Summary of changes

I started making a few tiny fixes, and ended up writing the delete
API...

- Quality of life nit: ordering of node + tenant listings in storcon_cli
- Papercut: Fix the attach_hook using the wrong operation type for
reporting slow locks
- Make Service::spawn tolerate `generation_pageserver` columns that
point to nonexistent node IDs. I started out thinking of this as a
general resilience thing, but when implementing the delete API I
realized it was actually a legitimate end state after the delete API is
called (as that API doesn't wait for all reconciles to succeed).
- Add a `DELETE` API for nodes, which does not gracefully drain, but
does reschedule everything. This becomes safe to use when the system is
in any state, but will incur availability gaps for any tenants that
weren't already live-migrated away. If tenants have already been
drained, this becomes a totally clean + safe way to decom a node.
- Add a test and a storcon_cli wrapper for it

This is meant to be a robust initial API that lets us remove nodes
without doing ugly things like restarting the storage controller -- it's
not quite a totally graceful node-draining routine yet. There's more
work in https://github.com/neondatabase/neon/issues/8333 to get to our
end-end state.
2024-07-15 09:28:35 -04:00
John Spray
8cc768254f safekeeper: eviction metrics (#8348)
## Problem

Follow up to https://github.com/neondatabase/neon/pull/8335, to improve
observability of how many evict/restores we are doing.

## Summary of changes

- Add `safekeeper_eviction_events_started_total` and
`safekeeper_eviction_events_completed_total`, with a "kind" label of
evict or restore. This gives us rates, and also ability to calculate how
many are in progress.
- Generalize SafekeeperMetrics test type to use the same helpers as
pageserver, and enable querying any metric.
- Read the new metrics at the end of the eviction test.
2024-07-15 09:28:35 -04:00
Vlad Lazar
5c80743c9c storage_controller: fix ReconcilerWaiter::get_status (#8341)
## Problem
SeqWait::would_wait_for returns Ok in the case when we would not wait
for the sequence number and Err otherwise.
ReconcilerWaiter::get_status uses it the wrong way around. This can
cause the storage controller to go into a busy loop
and make it look unavailable to the k8s controller.

## Summary of changes
Use `SeqWait::would_wait_for` correctly.
2024-07-15 09:28:35 -04:00
Christian Schwarz
5bba3e3c75 pageserver: remove trace_read_requests (#8338)
`trace_read_requests` is a per `Tenant`-object option.
But the `handle_pagerequests` loop doesn't know which
`Tenant` object (i.e., which shard) the request is for.

The remaining use of the `Tenant` object is to check `tenant.cancel`.
That check is incorrect [if the pageserver hosts multiple
shards](https://github.com/neondatabase/neon/issues/7427#issuecomment-2220577518).
I'll fix that in a future PR where I completely eliminate the holding
of `Tenant/Timeline` objects across requests.
See [my code RFC](https://github.com/neondatabase/neon/pull/8286) for
the
high level idea.

Note that we can always bring the tracing functionality if we need it.
But since it's actually about logging the `page_service` wire bytes,
it should be a `page_service`-level config option, not per-Tenant.
And for enabling tracing on a single connection, we can implement
a `set pageserver_trace_connection;` option.
2024-07-15 09:28:35 -04:00
Peter Bendel
6caf702417 Run Performance bench on more platforms (#8312)
## Problem

https://github.com/neondatabase/cloud/issues/14721

## Summary of changes

add one more platform to benchmarking job 


57535c039c/.github/workflows/benchmarking.yml (L57C3-L126)

Run with pg 16, provisioner k8-neonvm by default on the new platform.

Adjust some test cases to

- not depend on database client <-> database server latency by pushing
loops into server side pl/pgSQL functions
- increase statement and test timeouts

First successful run of these job steps 

https://github.com/neondatabase/neon/actions/runs/9869817756/job/27254280428
2024-07-15 09:28:35 -04:00
John Spray
32f668f5e7 rfcs: add RFC for timeline archival (#8221)
A design for a cheap low-resource state for idle timelines:
- #8088
2024-07-15 09:28:35 -04:00
Stas Kelvich
a91f9d5832 Enable core dumps for postgres (#8272)
Set core rmilit to ulimited in compute_ctl, so that all child processes
inherit it. We could also set rlimit in relevant startup script, but
that way we would depend on external setup and might inadvertently
disable it again (core dumping worked in pods, but not in VMs with
inittab-based startup).
2024-07-15 09:28:35 -04:00
John Spray
547acde6cd safekeeper: add eviction_min_resident to stop evictions thrashing (#8335)
## Problem

- The condition for eviction is not time-based: it is possible for a
timeline to be restored in response to a client, that client times out,
and then as soon as the timeline is restored it is immediately evicted
again.
- There is no delay on eviction at startup of the safekeeper, so when it
starts up and sees many idle timelines, it does many evictions which
will likely be immediately restored when someone uses the timeline.

## Summary of changes

- Add `eviction_min_resident` parameter, and use it in
`ready_for_eviction` to avoid evictions if the timeline has been
resident for less than this period.
- This also implicitly delays evictions at startup for
`eviction_min_resident`
- Set this to a very low number for the existing eviction test, which
expects immediate eviction.

The default period is 15 minutes. The general reasoning for that is that
in the worst case where we thrash ~10k timelines on one safekeeper,
downloading 16MB for each one, we should set a period that would not
overwhelm the node's bandwidth.
2024-07-15 09:28:35 -04:00
Alex Chi Z
bea6532881 feat(pageserver): add k-merge layer iterator with lazy loading (#8053)
Part of https://github.com/neondatabase/neon/issues/8002. This pull
request adds a k-merge iterator for bottom-most compaction.

## Summary of changes

* Added back lsn_range / key_range in delta layer inner. This was
removed due to https://github.com/neondatabase/neon/pull/8050, but added
back because iterators need that information to process lazy loading.
* Added lazy-loading k-merge iterator.
* Added iterator wrapper as a unified iterator type for image+delta
iterator.

The current status and test should cover the use case for L0 compaction
so that the L0 compaction process can bypass page cache and have a fixed
amount of memory usage. The next step is to integrate this with the new
bottom-most compaction.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-07-15 09:28:35 -04:00
Arpad Müller
8e2fe6b22e Remove ImageCompressionAlgorithm::DisabledNoDecompress (#8300)
Removes the `ImageCompressionAlgorithm::DisabledNoDecompress` variant.
We now assume any blob with the specific bits set is actually a
compressed blob.

The `ImageCompressionAlgorithm::Disabled` variant still remains and is
the new default.

Reverts large parts of #8238 , as originally intended in that PR.

Part of #5431
2024-07-15 09:28:35 -04:00
dependabot[bot]
4d75e1ef81 build(deps-dev): bump zipp from 3.8.1 to 3.19.1
Bumps [zipp](https://github.com/jaraco/zipp) from 3.8.1 to 3.19.1.
- [Release notes](https://github.com/jaraco/zipp/releases)
- [Changelog](https://github.com/jaraco/zipp/blob/main/NEWS.rst)
- [Commits](https://github.com/jaraco/zipp/compare/v3.8.1...v3.19.1)

---
updated-dependencies:
- dependency-name: zipp
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-07-15 09:28:35 -04:00
Conrad Ludgate
4c7c00268c proxy: remove some trace logs (#8334) 2024-07-15 09:28:35 -04:00
John Spray
f28abb953d tests: stabilize test_sharding_split_compaction (#8318)
## Problem

This test incorrectly assumed that a post-split compaction would only
drop content. This was easily destabilized by any changes to image
generation rules.

## Summary of changes

- Before split, do a full image layer generation pass, to guarantee that
post-split compaction should only drop data, never create it.
- Fix the force_image_layer_creation mode of compaction that we use from
tests like this: previously it would try and generate image layers even
if one already existed with the same layer key, which caused compaction
to fail.
2024-07-15 09:28:35 -04:00
Conrad Ludgate
4df39d7304 proxy: pg17 fixes (#8321)
## Problem

#7809 - we do not support sslnegotiation=direct
#7810 - we do not support negotiating down the protocol extensions.

## Summary of changes

1. Same as postgres, check the first startup packet byte for tls header
`0x16`, and check the ALPN.
2. Tell clients using protocol >3.0 to downgrade
2024-07-15 09:28:35 -04:00
Christian Schwarz
bfc7338246 pageserver: move page_service's import basebackup / import wal to mgmt API (#8292)
I want to fix bugs in `page_service`
([issue](https://github.com/neondatabase/neon/issues/7427)) and the
`import basebackup` / `import wal` stand in the way / make the
refactoring more complicated.

We don't use these methods anyway in practice, but, there have been some
objections to removing the functionality completely.

So, this PR preserves the existing functionality but moves it into the
HTTP management API.

Note that I don't try to fix existing bugs in the code, specifically not
fixing
* it only ever worked correctly for unsharded tenants
* it doesn't clean up on error

All errors are mapped to `ApiError::InternalServerError`.
2024-07-15 09:28:35 -04:00
Christian Schwarz
35dac6e6c8 fix(l0_flush): drops permit before fsync, potential cause for OOMs (#8327)
## Problem

Slack thread:
https://neondb.slack.com/archives/C033RQ5SPDH/p1720511577862519

We're seeing OOMs in staging on a pageserver that has
l0_flush.mode=Direct enabled.

There's a strong correlation between jumps in `maxrss_kb` and
`pageserver_timeline_ephemeral_bytes`, so, it's quite likely that
l0_flush.mode=Direct is the culprit.

Notably, the expected max memory usage on that staging server by the
l0_flush.mode=Direct is ~2GiB but we're seeing as much as 24GiB max RSS
before the OOM kill.

One hypothesis is that we're dropping the semaphore permit before all
the dirtied pages have been flushed to disk. (The flushing to disk
likely happens in the fsync inside the `.finish()` call, because we're
using ext4 in data=ordered mode).

## Summary of changes

Hold the permit until after we're done with `.finish()`.
2024-07-15 09:28:35 -04:00
Christian Schwarz
e619e8703e refactor: postgres_backend: replace abstract shutdown_watcher with CancellationToken (#8295)
Preliminary refactoring while working on
https://github.com/neondatabase/neon/issues/7427
and specifically https://github.com/neondatabase/neon/pull/8286
2024-07-15 09:28:35 -04:00
Tristan Partin
6fd35bfe32 Add an application_name to more Neon connections
Helps identify connections in the logs.
2024-07-15 09:28:35 -04:00
Tristan Partin
547a431b0d Refactor how migrations are ran
Just a small improvement I noticed while looking at fixing CVE-2024-4317
in Neon.
2024-07-15 09:28:35 -04:00
Alex Chi Z
f8c01c6341 fix(storage-scrubber): use default AWS authentication (#8299)
part of https://github.com/neondatabase/cloud/issues/14024
close https://github.com/neondatabase/neon/issues/7665

Things running in k8s container use this authentication:
https://docs.aws.amazon.com/sdkref/latest/guide/feature-container-credentials.html
while we did not configure the client to use it. This pull request
simply uses the default s3 client credential chain for storage scrubber.
It might break compatibility with minio.

## Summary of changes

* Use default AWS credential provider chain.
* Improvements for s3 errors, we now have detailed errors and correct
backtrace on last trial of the operation.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2024-07-15 09:28:35 -04:00
Conrad Ludgate
1145700f87 chore: fix nightly build (#8142)
## Problem

`cargo +nightly check` fails

## Summary of changes

Updates `measured`, `time`, and `crc32c`.

* `measured`: updated to fix
https://github.com/rust-lang/rust/issues/125763.
* `time`: updated to fix https://github.com/rust-lang/rust/issues/125319
* `crc32c`: updated to remove some nightly feature detection with a
removed nightly feature
2024-07-15 09:28:35 -04:00
Alex Chi Z
44339f5b70 chore(storage-scrubber): allow disable file logging (#8297)
part of https://github.com/neondatabase/cloud/issues/14024, k8s does not
always have a volume available for logging, and I'm running into weird
permission errors... While I could spend time figuring out how to create
temp directories for logging, I think it would be better to just disable
file logging as k8s containers are ephemeral and we cannot retrieve
anything on the fs after the container gets removed.
  
## Summary of changes

`PAGESERVER_DISABLE_FILE_LOGGING=1` -> file logging disabled

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-07-15 09:28:35 -04:00
Luca BRUNO
7b4a9c1d82 proxy/http: avoid spurious vector reallocations
This tweaks the rows-to-JSON rendering logic in order to avoid
allocating 0-sized temporary vectors and later growing them
to insert elements.
As the exact size is known in advance, both vectors can be built
with an exact capacity upfront. This will avoid further vector
growing/reallocation in the rendering hotpath.

Signed-off-by: Luca BRUNO <lucab@lucabruno.net>
2024-07-15 09:28:35 -04:00
Alexander Bayandin
3b2fc27de4 CI(promote-compatibility-data): take into account commit sha (#8283)
## Problem

In https://github.com/neondatabase/neon/pull/8161, we changed the path
to Neon artefacts by adding commit sha to it, but we missed adding these
changes to `promote-compatibility-data` job that we use for
backward/forward- compatibility testing.

## Summary of changes
- Add commit sha to `promote-compatibility-data`
2024-07-15 09:28:35 -04:00
Yuchen Liang
0b6492e7d3 tests: increase approx size equal threshold to avoid test_lsn_lease_size flakiness (#8282)
## Summary of changes

Increase the `assert_size_approx_equal` threshold to avoid flakiness of
`test_lsn_lease_size`. Still needs more investigation to fully resolve
#8293.

- Also set `autovacuum=off` for the endpoint we are running in the test.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-07-15 09:28:35 -04:00
John Spray
7cfaecbeb6 tests: stabilize test_timeline_size_quota_on_startup (#8255)
## Problem

`test_timeline_size_quota_on_startup` assumed that writing data beyond
the size limit would always be blocked. This is not so: the limit is
only enforced if feedback makes it back from the pageserver to the
safekeeper + compute.

Closes: https://github.com/neondatabase/neon/issues/6562

## Summary of changes

- Modify the test to wait for the pageserver to catch up. The size limit
was never actually being enforced robustly, the original version of this
test was just writing much more than 30MB and about 98% of the time
getting lucky such that the feedback happened to arrive before the tests
for loop was done.
- If the test fails, log the logical size as seen by the pageserver.
2024-07-15 09:28:35 -04:00
Alex Chi Z
472acae615 fix(pageserver): write to both v1+v2 for aux tenant import (#8316)
close https://github.com/neondatabase/neon/issues/8202 ref
https://github.com/neondatabase/neon/pull/6560

For tenant imports, we now write the aux files into both v1+v2 storage,
so that the test case can pick either one for testing. Given the API is
only used for testing, this looks like a safe change.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-07-15 09:28:35 -04:00
John Spray
108bf56e44 tests: use smaller layers in test_pg_regress (#8232)
## Problem

Debug-mode runs of test_pg_regress are rather slow since
https://github.com/neondatabase/neon/pull/8105, and occasionally exceed
their 600s timeout.

## Summary of changes

- Use 8MiB layer files, avoiding large ephemeral layers

On a hetzner AX102, this takes the runtime from 230s to 190s. Which
hopefully will be enough to get the runtime on github runners more
reliably below its 600s timeout.

This has the side benefit of exercising more of the pageserver stack
(including compaction) under a workload that exercises a more diverse
set of postgres functionality than most of our tests.
2024-07-15 09:28:35 -04:00
Alexey Kondratov
e83a499ab4 compute_ctl: Use 'fast' shutdown for Postgres termination (#8289)
## Problem

We currently use 'immediate' mode in the most commonly used shutdown
path, when the control plane calls a `compute_ctl` API to terminate
Postgres inside compute without waiting for the actual pod / VM
termination. Yet, 'immediate' shutdown doesn't create a shutdown
checkpoint and ROs have bad times figuring out the list of running xacts
during next start.

## Summary of changes

Use 'fast' mode, which creates a shutdown checkpoint that is important
for ROs to get a list of running xacts faster instead of going through
the CLOG. On the control plane side, we poll this `compute_ctl`
termination API for 10s, it should be enough as we don't really write
any data at checkpoint time. If it times out, we anyway switch to the
slow k8s-based termination.

See https://www.postgresql.org/docs/current/server-shutdown.html for the
list of modes and signals.

The default VM shutdown hook already uses `fast` mode, see [1]

[1]
c9fd8d7693/vm-image-spec.yaml (L30-L31)

Related to #6211
2024-07-15 09:28:35 -04:00
Yuchen Liang
ebf3bfadde refactor: move part of sharding API from pageserver_api to utils (#8254)
## Problem

LSN Leases introduced in #8084 is a new API that is made shard-aware
from day 1. To support ephemeral endpoint in #7994 without linking
Postgres C API against `compute_ctl`, part of the sharding needs to
reside in `utils`.

## Summary of changes

- Create a new `shard` module in utils crate.
- Move more interface related part of tenant sharding API to utils and
re-export them in pageserver_api.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-07-15 09:28:35 -04:00
John Spray
ab06240fae pageserver: respect has_relmap_file in collect_keyspace (#8276)
## Problem

Rarely, a dbdir entry can exist with no `relmap_file_key` data. This
causes compaction to fail, because it assumes that if the database
exists, then so does the relmap file.

Basebackup already handled this using a boolean to record whether such a
key exists, but `collect_keyspace` didn't.

## Summary of changes

- Respect the flag for whether a relfilemap exists in collect_keyspace
- The reproducer for this issue will merge separately in
https://github.com/neondatabase/neon/pull/8232
2024-07-15 09:28:35 -04:00
Tristan Partin
cec216c5c0 Add long running replication tests
These tests will help verify that replication, both physical and
logical, works as expected in Neon.

Co-authored-by: Sasha Krassovsky <sasha@neon.tech>
2024-07-15 09:28:35 -04:00
Tristan Partin
930201e033 Add PgBin.run_nonblocking()
Allows a process to run without blocking program execution, which can be
useful for certain test scenarios.

Co-authored-by: Sasha Krassovsky <sasha@neon.tech>
2024-07-15 09:28:35 -04:00
Tristan Partin
8328580dc2 Log PG environment variables when a PgBin runs
Useful for debugging situations like connecting to databases.

Co-authored-by: Sasha Krassovsky <sasha@neon.tech>
2024-07-15 09:28:35 -04:00
Tristan Partin
8d9b632f2a Add Neon HTTP API test fixture
This is a Python binding to the Neon HTTP API. It isn't complete, but
can be extended as necessary.

Co-authored-by: Sasha Krassovsky <sasha@neon.tech>
2024-07-15 09:28:35 -04:00
Tristan Partin
55d37c77b9 Hide import behind TYPE_CHECKING
No need to import it if we aren't type checking anything.
2024-07-15 09:28:35 -04:00
John Spray
0948fb6bf1 pageserver: switch to jemalloc (#8307)
## Problem

- Resident memory on long running pageserver processes tends to climb:
memory fragmentation is suspected.
- Total resident memory may be a limiting factor for running on smaller
nodes.

## Summary of changes

- As a low-energy experiment, switch the pageserver to use jemalloc (not
a net-new dependency, proxy already use it)
- Decide at end of week whether to revert before next release.
2024-07-15 09:28:35 -04:00
Alex Chi Z
285c6d2974 fix(pageserver): ensure sparse keyspace is ordered (#8285)
## Problem

Sparse keyspaces were constructed with ranges out of order: this didn't break things obviously, but meant that users of KeySpace functions that assume ordering would assert out.

Closes https://github.com/neondatabase/neon/issues/8277

## Summary of changes

make sure the sparse keyspace has ordered keyspace parts
2024-07-15 09:28:35 -04:00
Vlad Lazar
a5491463e1 Merge pull request #8304 from neondatabase/rc/2024-07-08
Storage & Compute release 2024-07-08
2024-07-08 20:25:54 +01:00
dependabot[bot]
a58827f952 build(deps): bump certifi from 2023.7.22 to 2024.7.4 (#8301) 2024-07-08 17:22:36 +01:00
Arpad Müller
36b790f282 Add concurrency to the find-large-objects scrubber subcommand (#8291)
The find-large-objects scrubber subcommand is quite fast if you run it
in an environment with low latency to the S3 bucket (say an EC2 instance
in the same region). However, the higher the latency gets, the slower
the command becomes. Therefore, add a concurrency param and make it
parallelized. This doesn't change that general relationship, but at
least lets us do multiple requests in parallel and therefore hopefully
faster.

Running with concurrency of 64 (default):

```
2024-07-05T17:30:22.882959Z  INFO lazy_load_identity [...]
[...]
2024-07-05T17:30:28.289853Z  INFO Scanned 500 shards. [...]
```

With concurrency of 1, simulating state before this PR:

```
2024-07-05T17:31:43.375153Z  INFO lazy_load_identity [...]
[...]
2024-07-05T17:33:51.987092Z  INFO Scanned 500 shards. [...]
```

In other words, to list 500 shards, speed is increased from 2:08 minutes
to 6 seconds.

Follow-up of  #8257, part of #5431
2024-07-08 17:22:36 +01:00
Arpad Müller
3ef7748e6b Improve parsing of ImageCompressionAlgorithm (#8281)
Improve parsing of the `ImageCompressionAlgorithm` enum to allow level
customization like `zstd(1)`, as strum only takes `Default::default()`,
i.e. `None` as the level.

Part of #5431
2024-07-08 17:22:36 +01:00
Christian Schwarz
f3310143e4 pageserver_live_connections: track as counter pair (#8227)
Generally counter pairs are preferred over gauges.
In this case, I found myself asking what the typical rate of accepted
page_service connections on a pageserver is, and I couldn't answer it
with the gauge metric.

There are a few dashboards using this metric:

https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_live_connections&type=code

I'll convert them to use the new metric once this PR reaches prod.

refs https://github.com/neondatabase/neon/issues/7427
2024-07-08 17:22:36 +01:00
Konstantin Knizhnik
05b4169644 Increase timeout for wating subscriber caught-up (#8118)
## Problem

test_subscriber_restart has quit large failure rate'

https://neonprod.grafana.net/d/fddp4rvg7k2dcf/regression-test-failures?orgId=1&var-test_name=test_subscriber_restart&var-max_count=100&var-restrict=false

I can be caused by too small timeout (5 seconds) to wait until changes
are propagated.

Related to #8097

## Summary of changes

Increase timeout to 30 seconds.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-07-08 17:22:36 +01:00
Alexander Bayandin
d1495755e7 SELECT 💣(); (#8270)
## Problem
We want to be able to test how our infrastructure reacts on segfaults in
Postgres (for example, we collect cores, and get some required
logs/metrics, etc)

## Summary of changes
- Add `trigger_segfauls` function to `neon_test_utils` to trigger a
segfault in Postgres
- Add `trigger_panic` function to `neon_test_utils` to trigger SIGABRT
(by using `elog(PANIC, ...))
- Fix cleanup logic in regression tests in endpoint crashed
2024-07-08 17:22:36 +01:00
Vlad Lazar
c8dd78c6c8 pageserver: add time based image layer creation check (#8247)
## Problem
Assume a timeline with the following workload: very slow ingest of
updates to a small number of keys that fit within the same partition (as decided by
`KeySpace::partition`). These tenants will create small L0 layers since due to time 
based rolling, and, consequently, the L1 layers will also be small.

Currently, by default, we need to ingest 512 MiB of WAL before checking
if an image layer is required. This scheme works fine under the assumption that L1s are roughly of
checkpoint distance size, but as the first paragraph explained, that's not the case for all workloads.

## Summary of changes
Check if new image layers are required at least once every checkpoint timeout interval.
2024-07-08 17:22:36 +01:00
John Spray
b44ee3950a safekeeper: add separate tombstones map for deleted timelines (#8253)
## Problem

Safekeepers left running for a long time use a lot of memory (up to the
point of OOMing, on small nodes) for deleted timelines, because the
`Timeline` struct is kept alive as a guard against recreating deleted
timelines.

Closes: https://github.com/neondatabase/neon/issues/6810

## Summary of changes

- Create separate tombstones that just record a ttid and when the
timeline was deleted.
- Add a periodic housekeeping task that cleans up tombstones older than
a hardcoded TTL (24h)

I think this also makes https://github.com/neondatabase/neon/pull/6766
un-needed, as the tombstone is also checked during deletion.

I considered making the overall timeline map use an enum type containing
active or deleted, but having a separate map of tombstones avoids
bloating that map, so that calls like `get()` can still go straight to a
timeline without having to walk a hashmap that also contains tombstones.
2024-07-08 17:22:36 +01:00
John Spray
64334f497d tests: make location_conf_churn more robust (#8271)
## Problem

This test directly manages locations on pageservers and configuration of
an endpoint. However, it did not switch off the parts of the storage
controller that attempt to do the same: occasionally, the test would
fail in a strange way such as a compute failing to accept a
reconfiguration request.

## Summary of changes

- Wire up the storage controller's compute notification hook to a no-op
handler
- Configure the tenant's scheduling policy to Stop.
2024-07-08 17:22:35 +01:00
Peter Bendel
5ffcb688cc correct error handling for periodic pagebench runner status (#8274)
## Problem

the following periodic pagebench run was failed but was still shown as
successful


https://github.com/neondatabase/neon/actions/runs/9798909458/job/27058179993#step:9:47

## Summary of changes

if the ec2 test runner reports a failure fail the job step and thus the
workflow

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-07-08 17:22:35 +01:00
John Spray
32fc2dd683 tests: extend allow list in deletion test (#8268)
## Problem

1ea5d8b132 tolerated this as an error
message, but it can show up in logs as well.

Example failure:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8201/9780147712/index.html#testresult/263422f5f5f292ea/retries

## Summary of changes

- Tolerate "failed to delete 1 objects" in pageserver logs, this occurs
occasionally when injected failures exhaust deletion's retries.
2024-07-08 17:22:35 +01:00
Peter Bendel
d35ddfbab7 add checkout depth1 to workflow to access local github actions like generate allure report (#8259)
## Problem

job step to create allure report fails


https://github.com/neondatabase/neon/actions/runs/9781886710/job/27006997416#step:11:1

## Summary of changes

Shallow checkout of sources to get access to local github action needed
in the job step

## Example run 
example run with this change
https://github.com/neondatabase/neon/actions/runs/9790647724
do not merge this PR until the job is clean

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-07-08 17:22:35 +01:00
Konstantin Knizhnik
3ee82a9895 implement rolling hyper-log-log algorithm (#8068)
## Problem

See #7466

## Summary of changes

Implement algorithm descried in
https://hal.science/hal-00465313/document

Now new GUC is added:
`neon.wss_max_duration` which specifies size of sliding window (in
seconds). Default value is 1 hour.

It is possible to request estimation of working set sizes (within this
window using new function
`approximate_working_set_size_seconds`. Old function
`approximate_working_set_size` is preserved for backward compatibility.
But its scope is also limited by `neon.wss_max_duration`.

Version of Neon extension is changed to 1.4

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Matthias van de Meent <matthias@neon.tech>
2024-07-08 17:22:35 +01:00
Arpad Müller
e770aeee92 Flatten compression algorithm setting (#8265)
This flattens the compression algorithm setting, removing the
`Option<_>` wrapping layer and making handling of the setting easier.

It also adds a specific setting for *disabled* compression with the
continued ability to read copmressed data, giving us the option to
more easily back out of a compression rollout, should the need arise,
which was one of the limitations of #8238.

Implements my suggestion from
https://github.com/neondatabase/neon/pull/8238#issuecomment-2206181594 ,
inspired by Christian's review in
https://github.com/neondatabase/neon/pull/8238#pullrequestreview-2156460268 .

Part of #5431
2024-07-08 17:22:35 +01:00
Yuchen Liang
32828cddd6 feat(pageserver): integrate lsn lease into synthetic size (#8220)
Part of #7497, closes #8071. (accidentally closed #8208, reopened here)

## Problem

After the changes in #8084, we need synthetic size to also account for
leased LSNs so that users do not get free retention by running a small
ephemeral endpoint for a long time.

## Summary of changes

This PR integrates LSN leases into the synthetic size calculation. We
model leases as read-only branches started at the leased LSN (except it
does not have a timeline id).

Other changes:
- Add new unit tests testing whether a lease behaves like a read-only
branch.
- Change `/size_debug` response to include lease point in the SVG
visualization.
- Fix `/lsn_lease` HTTP API to do proper parsing for POST.



Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-07-08 17:22:35 +01:00
Arpad Müller
bd2046e1ab Add find-large-objects subcommand to scrubber (#8257)
Adds a find-large-objects subcommand to the scrubber to allow listing
layer objects larger than a specific size.

To be used like:

```
AWS_PROFILE=dev REGION=us-east-2 BUCKET=neon-dev-storage-us-east-2 cargo run -p storage_scrubber -- find-large-objects --min-size 250000000 --ignore-deltas
```

Part of #5431
2024-07-08 17:22:35 +01:00
John Spray
7e2a3d2728 pageserver: downgrade stale generation messages to INFO (#8256)
## Problem

When generations were new, these messages were an important way of
noticing if something unexpected was going on. We found some real issues
when investigating tests that unexpectedly tripped them.

At time has gone on, this code is now pretty battle-tested, and as we do
more live migrations etc, it's fairly normal to see the occasional
message from a node with a stale generation.

At this point the cognitive load on developers to selectively allow-list
these logs outweighs the benefit of having them at warn severity.

Closes: https://github.com/neondatabase/neon/issues/8080

## Summary of changes

- Downgrade "Dropped remote consistent LSN updates" and "Dropping stale
deletions" messages to INFO
- Remove all the allow-list entries for these logs.
2024-07-08 17:22:35 +01:00
Alexander Bayandin
0e4832308d CI(pg-clients): unify workflow with build-and-test (#8160)
## Problem

`pg-clients` workflow looks different from the main `build-and-test`
workflow for historical reasons (it was my very first task at Neon, and 
back then I wasn't really familiar with the rest of the CI pipelines).
This PR unifies `pg-clients` workflow with `build-and-test`

## Summary of changes
- Rename `pg_clients.yml` to `pg-clients.yml`
- Run the workflow on changes in relevant files
- Create Allure report for tests
- Send slack notifications to `#on-call-qa-staging-stream` channel
(instead of `#on-call-staging-stream`)
- Update Client libraries once we're here
2024-07-08 17:22:35 +01:00
Arpad Müller
0a63bc4818 Use bool param for round_trip_test_compressed (#8252)
As per @koivunej 's request in
https://github.com/neondatabase/neon/pull/8238#discussion_r1663892091 ,
use a runtime param instead of monomorphizing the function based on the value.

Part of https://github.com/neondatabase/neon/issues/5431
2024-07-08 17:22:35 +01:00
Vlad Lazar
2897dcc9aa pageserver: increase rate limit duration for layer visit log (#8263)
## Problem
I'd like to keep this in the tree since it might be useful in prod as
well. It's a bit too noisy as is and missing the lsn.

## Summary of changes
Add an lsn field and and increase the rate limit duration.
2024-07-08 17:22:35 +01:00
Alexander Bayandin
1d0ec50ddb CI(build-and-test): add conclusion job (#8246)
## Problem

Currently, if you need to rename a job and the job is listed in [branch
protection
rules](https://github.com/neondatabase/neon/settings/branch_protection_rules),
the PR won't be allowed to merge.

## Summary of changes
- Add `conclusion` job that fails if any of its dependencies don't
finish successfully
2024-07-08 17:22:35 +01:00
Conrad Ludgate
a86b43fcd7 proxy: cache certain non-retriable console errors for a short time (#8201)
## Problem

If there's a quota error, it makes sense to cache it for a short window
of time. Many clients do not handle database connection errors
gracefully, so just spam retry 🤡

## Summary of changes

Updates the node_info cache to support storing console errors. Store
console errors if they cannot be retried (using our own heuristic.
should only trigger for quota exceeded errors).
2024-07-08 17:22:35 +01:00
Vlad Lazar
b917868ada tests: perform graceful rolling restarts in storcon scale test (#8173)
## Problem
Scale test doesn't exercise drain & fill.

## Summary of changes
Make scale test exercise drain & fill
2024-07-08 17:22:35 +01:00
John Spray
7b7d16f52e pageserver: add supplementary branch usage stats (#8131)
## Problem

The metrics we have today aren't convenient for planning around the
impact of timeline archival on costs.

Closes: https://github.com/neondatabase/neon/issues/8108

## Summary of changes

- Add metric `pageserver_archive_size`, which indicates the logical
bytes of data which we would expect to write into an archived branch.
- Add metric `pageserver_pitr_history_size`, which indicates the
distance between last_record_lsn and the PITR cutoff.

These metrics are somewhat temporary: when we implement #8088 and
associated consumption metric changes, these will reach a final form.
For now, an "archived" branch is just any branch outside of its parent's
PITR window: later, archival will become an explicit state (which will
_usually_ correspond to falling outside the parent's PITR window).

The overall volume of timeline metrics is something to watch, but we are
removing many more in https://github.com/neondatabase/neon/pull/8245
than this PR is adding.
2024-07-08 17:22:35 +01:00
Alex Chi Z
fee4169b6b fix(pageserver): ensure test creates valid layer map (#8191)
I'd like to add some constraints to the layer map we generate in tests.

(1) is the layer map that the current compaction algorithm will produce.
There is a property that for all delta layer, all delta layer overlaps
with it on the LSN axis will have the same LSN range.
(2) is the layer map that cannot be produced with the legacy compaction
algorithm.
(3) is the layer map that will be produced by the future
tiered-compaction algorithm. The current validator does not allow that
but we can modify the algorithm to allow it in the future.

## Summary of changes

Add a validator to check if the layer map is valid and refactor the test
cases to include delta layer start/end LSN.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-07-08 17:22:35 +01:00
Christian Schwarz
47e06a2cc6 page_service: stop exposing get_last_record_rlsn (#8244)
Compute doesn't use it, let's eliminate it.

Ref to Slack thread:
https://neondb.slack.com/archives/C033RQ5SPDH/p1719920261995529
2024-07-08 17:22:35 +01:00
Japin Li
c4423c0623 Fix outdated comment (#8149)
Commit 97b48c23f changes the log wait timeout from 1 second to 100
milliseconds but forgets to update the comment.
2024-07-08 17:22:35 +01:00
John Spray
a11cf03123 pageserver: reduce ops tracked at per-timeline detail (#8245)
## Problem

We record detailed histograms for all page_service op types, which
mostly aren't very interesting, but make our prometheus scrapes huge.

Closes: #8223 

## Summary of changes

- Only track GetPageAtLsn histograms on a per-timeline granularity. For
all other operation types, rely on existing node-wide histograms.
2024-07-08 17:22:35 +01:00
Peter Bendel
08b33adfee add pagebench test cases for periodic pagebench on dedicated hardware (#8233)
we want to run some specific pagebench test cases on dedicated hardware
to get reproducible results

run1: 1 client per tenant => characterize throughput with n tenants.
-  500 tenants
- scale 13 (200 MB database)
- 1 hour duration
- ca 380 GB layer snapshot files

run2.singleclient: 1 client per tenant => characterize latencies
run2.manyclient: N clients per tenant => characterize throughput
scalability within one tenant.
- 1 tenant with 1 client for latencies
- 1 tenant with 64 clients because typically for a high number of
connections we recommend the connection pooler
which by default uses 64 connections (for scalability)
- scale 136 (2048 MB database)
- 20 minutes each
2024-07-08 17:22:35 +01:00
Arpad Müller
4fb50144dd Only support compressed reads if the compression setting is present (#8238)
PR #8106 was created with the assumption that no blob is larger than
`256 MiB`. Due to #7852 we have checking for *writes* of blobs larger
than that limit, but we didn't have checking for *reads* of such large
blobs: in theory, we could be reading these blobs every day but we just
don't happen to write the blobs for some reason.

Therefore, we now add a warning for *reads* of such large blobs as well.

To make deploying compression less dangerous, we therefore only assume a
blob is compressed if the compression setting is present in the config.
This also means that we can't back out of compression once we enabled
it.

Part of https://github.com/neondatabase/neon/issues/5431
2024-07-08 17:22:35 +01:00
John Spray
c500137ca9 pageserver: don't try to flush if shutdown during attach (#8235)
## Problem

test_location_conf_churn fails on log errors when it tries to shutdown a
pageserver immediately after starting a tenant attach, like this:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8224/9761000525/index.html#/testresult/15fb6beca5c7327c

```
shutdown:shutdown{tenant_id=35f5c55eb34e7e5e12288c5d8ab8b909 shard_id=0000}:timeline_shutdown{timeline_id=30936747043353a98661735ad09cbbfe shutdown_mode=FreezeAndFlush}: failed to freeze and flush: cannot flush frozen layers when flush_loop is not running, state is Exited\n')
```

This is happening because Tenant::shutdown fires its cancellation token
early if the tenant is not fully attached by the time shutdown is
called, so the flush loop is shutdown by the time we try and flush.

## Summary of changes

- In the early-cancellation case, also set the shutdown mode to Hard to
skip trying to do a flush that will fail.
2024-07-08 17:22:35 +01:00
Alexander Bayandin
252c4acec9 CI: update docker/* actions to latest versions (#7694)
## Problem

GitHub Actions complain that we use actions that depend on deprecated
Node 16:

```
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: docker/setup-buildx-action@v2
```

But also, the latest `docker/setup-buildx-action` fails with the following
error:
```
/nvme/actions-runner/_work/_actions/docker/setup-buildx-action/v3/webpack:/docker-setup-buildx/node_modules/@actions/cache/lib/cache.js:175
            throw new Error(`Path Validation Error: Path(s) specified in the action for caching do(es) not exist, hence no cache is being saved.`);
^
Error: Path Validation Error: Path(s) specified in the action for caching do(es) not exist, hence no cache is being saved.
    at Object.rejected (/nvme/actions-runner/_work/_actions/docker/setup-buildx-action/v3/webpack:/docker-setup-buildx/node_modules/@actions/cache/lib/cache.js:175:1)
    at Generator.next (<anonymous>)
    at fulfilled (/nvme/actions-runner/_work/_actions/docker/setup-buildx-action/v3/webpack:/docker-setup-buildx/node_modules/@actions/cache/lib/cache.js:29:1)
```

We can work this around by setting `cache-binary: false` for `uses:
docker/setup-buildx-action@v3`

## Summary of changes
- Update `docker/setup-buildx-action` from `v2` to `v3`, set
`cache-binary: false`
- Update `docker/login-action` from `v2` to `v3`
- Update `docker/build-push-action` from `v4`/`v5` to `v6`
2024-07-08 17:22:35 +01:00
Heikki Linnakangas
db70c175e6 Simplify test_wal_page_boundary_start test (#8214)
All the code to ensure the WAL record lands at a page boundary was
unnecessary for reproducing the original problem. In fact, it's a pretty
basic test that checks that outbound replication (= neon as publisher)
still works after restarting the endpoint. It just used to be very
broken before commit 5ceccdc7de, which also added this test.

To verify that:

1. Check out commit f3af5f4660 (because the next commit, 7dd58e1449,
fixed the same bug in a different way, making it infeasible to revert
the bug fix in an easy way)
2. Revert the bug fix from commit 5ceccdc7de with this:

```
diff --git a/pgxn/neon/walproposer_pg.c b/pgxn/neon/walproposer_pg.c
index 7debb6325..9f03bbd99 100644
--- a/pgxn/neon/walproposer_pg.c
+++ b/pgxn/neon/walproposer_pg.c
@@ -1437,8 +1437,10 @@ XLogWalPropWrite(WalProposer *wp, char *buf, Size nbytes, XLogRecPtr recptr)
 	 *
 	 * https://github.com/neondatabase/neon/issues/5749
 	 */
+#if 0
 	if (!wp->config->syncSafekeepers)
 		XLogUpdateWalBuffers(buf, recptr, nbytes);
+#endif

 	while (nbytes > 0)
 	{
```

3. Run the test_wal_page_boundary_start regression test. It fails, as
expected

4. Apply this commit to the test, and run it again. It still fails, with
the same error mentioned in issue #5749:

```
PG:2024-06-30 20:49:08.805 GMT [1248196] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"pub1"')
PG:2024-06-30 21:37:52.567 GMT [1467972] LOG:  starting logical decoding for slot "sub1"
PG:2024-06-30 21:37:52.567 GMT [1467972] DETAIL:  Streaming transactions committing after 0/1532330, reading WAL from 0/1531C78.
PG:2024-06-30 21:37:52.567 GMT [1467972] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"pub1"')
PG:2024-06-30 21:37:52.567 GMT [1467972] LOG:  logical decoding found consistent point at 0/1531C78
PG:2024-06-30 21:37:52.567 GMT [1467972] DETAIL:  There are no running transactions.
PG:2024-06-30 21:37:52.567 GMT [1467972] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"pub1"')
PG:2024-06-30 21:37:52.568 GMT [1467972] ERROR:  could not find record while sending logically-decoded data: invalid contrecord length 312 (expected 6) at 0/1533FD8
```
2024-07-08 17:22:35 +01:00
Alex Chi Z
ed3b4a58b4 docker: add storage_scrubber into the docker image (#8239)
## Problem

We will run this tool in the k8s cluster. To make it accessible from
k8s, we need to package it into the docker image.

part of https://github.com/neondatabase/cloud/issues/14024
2024-07-08 17:22:35 +01:00
Konstantin Knizhnik
2863d1df63 Add test for proper handling of connection failure to avoid 'cannot wait on socket event without a socket' error (#8231)
## Problem

See https://github.com/neondatabase/cloud/issues/14289
and PR #8210 

## Summary of changes

Add test for problems fixed in #8210

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-07-08 17:22:35 +01:00
Alex Chi Z
320b24eab3 fix(pageserver): comments about metadata key range (#8236)
Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-07-08 17:22:35 +01:00
John Spray
13a8a5b09b tense of errors (#8234)
I forgot a commit when merging
https://github.com/neondatabase/neon/pull/8177
2024-07-08 17:22:35 +01:00
Alexander Bayandin
64ccdf65e0 CI(benchmarking): move psql queries to actions/run-python-test-set (#8230)
## Problem

Some of the Nightly benchmarks fail with the error
```
+ /tmp/neon/pg_install/v14/bin/pgbench --version
/tmp/neon/pg_install/v14/bin/pgbench: error while loading shared libraries: libpq.so.5: cannot open shared object file: No such file or directory
```
Originally, we added the `pgbench --version` call to check that
`pgbench` is installed and to fail earlier if it's not.
The failure happens because we don't have `LD_LIBRARY_PATH` set for
every job, and it also affects `psql` command.
We can move it to `actions/run-python-test-set` so as not to duplicate
code (as it already have `LD_LIBRARY_PATH` set).

## Summary of changes
- Remove `pgbench --version` call
- Move `psql` commands to common `actions/run-python-test-set`
2024-07-08 17:22:35 +01:00
Christian Schwarz
1ae6aa09dd L0 flush: opt-in mechanism to bypass PageCache reads and writes (#8190)
part of https://github.com/neondatabase/neon/issues/7418

# Motivation

(reproducing #7418)

When we do an `InMemoryLayer::write_to_disk`, there is a tremendous
amount of random read I/O, as deltas from the ephemeral file (written in
LSN order) are written out to the delta layer in key order.

In benchmarks (https://github.com/neondatabase/neon/pull/7409) we can
see that this delta layer writing phase is substantially more expensive
than the initial ingest of data, and that within the delta layer write a
significant amount of the CPU time is spent traversing the page cache.

# High-Level Changes

Add a new mode for L0 flush that works as follows:

* Read the full ephemeral file into memory -- layers are much smaller
than total memory, so this is afforable
* Do all the random reads directly from this in memory buffer instead of
using blob IO/page cache/disk reads.
* Add a semaphore to limit how many timelines may concurrently do this
(limit peak memory).
* Make the semaphore configurable via PS config.

# Implementation Details

The new `BlobReaderRef::Slice` is a temporary hack until we can ditch
`blob_io` for `InMemoryLayer` => Plan for this is laid out in
https://github.com/neondatabase/neon/issues/8183

# Correctness

The correctness of this change is quite obvious to me: we do what we did
before (`blob_io`) but read from memory instead of going to disk.

The highest bug potential is in doing owned-buffers IO. I refactored the
API a bit in preliminary PR
https://github.com/neondatabase/neon/pull/8186 to make it less
error-prone, but still, careful review is requested.

# Performance

I manually measured single-client ingest performance from `pgbench -i
...`.

Full report:
https://neondatabase.notion.site/2024-06-28-benchmarking-l0-flush-performance-e98cff3807f94cb38f2054d8c818fe84?pvs=4

tl;dr:

* no speed improvements during ingest,  but
* significantly lower pressure on PS PageCache (eviction rate drops to
1/3)
  * (that's why I'm working on this)
* noticable but modestly lower CPU time

This is good enough for merging this PR because the changes require
opt-in.

We'll do more testing in staging & pre-prod.

# Stability / Monitoring

**memory consumption**: there's no _hard_ limit on max `InMemoryLayer`
size (aka "checkpoint distance") , hence there's no hard limit on the
memory allocation we do for flushing. In practice, we a) [log a
warning](23827c6b0d/pageserver/src/tenant/timeline.rs (L5741-L5743))
when we flush oversized layers, so we'd know which tenant is to blame
and b) if we were to put a hard limit in place, we would have to decide
what to do if there is an InMemoryLayer that exceeds the limit.
It seems like a better option to guarantee a max size for frozen layer,
dependent on `checkpoint_distance`. Then limit concurrency based on
that.

**metrics**: we do have the
[flush_time_histo](23827c6b0d/pageserver/src/tenant/timeline.rs (L3725-L3726)),
but that includes the wait time for the semaphore. We could add a
separate metric for the time spent after acquiring the semaphore, so one
can infer the wait time. Seems unnecessary at this point, though.
2024-07-08 17:22:35 +01:00
Arpad Müller
aeb68e51df Add support for reading and writing compressed blobs (#8106)
Add support for reading and writing zstd-compressed blobs for use in
image layer generation, but maybe one day useful also for delta layers.
The reading of them is unconditional while the writing is controlled by
the `image_compression` config variable allowing for experiments.

For the on-disk format, we re-use some of the bitpatterns we currently
keep reserved for blobs larger than 256 MiB. This assumes that we have
never ever written any such large blobs to image layers.

After the preparation in #7852, we now are unable to read blobs with a
size larger than 256 MiB (or write them).

A non-goal of this PR is to come up with good heuristics of when to
compress a bitpattern. This is left for future work.

Parts of the PR were inspired by #7091.

cc  #7879

Part of #5431
2024-07-08 17:22:35 +01:00
Vlad Lazar
c3e5223a5d pageserver: rate limit log for loads of layers visited (#8228)
## Problem
At high percentiles we see more than 800 layers being visited by the
read path. We need the tenant/timeline to investigate.

## Summary of changes
Add a rate limited log line when the average number of layers visited
per key is in the last specified histogram bucket.
I plan to use this to identify tenants in us-east-2 staging that exhibit
this behaviour. Will revert before next week's release.
2024-07-08 17:22:35 +01:00
Christian Schwarz
daaa3211a4 fix: noisy logging when download gets cancelled during shutdown (#8224)
Before this PR, during timeline shutdown, we'd occasionally see
log lines like this one:

```
2024-06-26T18:28:11.063402Z  INFO initial_size_calculation{tenant_id=$TENANT,shard_id=0000 timeline_id=$TIMELINE}:logical_size_calculation_task:get_or_maybe_download{layer=000000000000000000000000000000000000-000000067F0001A3950001C1630100000000__0000000D88265898}: layer file download failed, and caller has been cancelled: Cancelled, shutting down
Stack backtrace:
   0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1964:27
      pageserver::tenant::remote_timeline_client::RemoteTimelineClient::download_layer_file::{{closure}}
             at /home/nonroot/pageserver/src/tenant/remote_timeline_client.rs:531:13
      pageserver::tenant::storage_layer::layer::LayerInner::download_and_init::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1136:14
      pageserver::tenant::storage_layer::layer::LayerInner::download_init_and_wait::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1082:74
```

We can eliminate the anyhow backtrace with no loss of information
because the conversion to anyhow::Error happens in exactly one place.

refs #7427
2024-07-08 17:22:35 +01:00
John Spray
7ff9989dd5 pageserver: simpler, stricter config error handling (#8177)
## Problem

Tenant attachment has error paths for failures to write local
configuration, but these types of local storage I/O errors should be
considered fatal for the process. Related thread on an earlier PR that
touched this code:
https://github.com/neondatabase/neon/pull/7947#discussion_r1655134114

## Summary of changes

- Make errors writing tenant config fatal (abort process)
- When reading tenant config, make all I/O errors except ENOENT fatal
- Replace use of bare anyhow errors with `LoadConfigError`
2024-07-08 17:22:35 +01:00
Christian Schwarz
ed3b97604c remote_storage config: move handling of empty inline table {} to callers (#8193)
Before this PR, `RemoteStorageConfig::from_toml` would support
deserializing an
empty `{}` TOML inline table to a `None`, otherwise try `Some()`.

We can instead let
* in proxy: let clap derive handle the Option
* in PS & SK: assume that if the field is specified, it must be a valid
  RemtoeStorageConfig

(This PR started with a much simpler goal of factoring out the
`deserialize_item` function because I need that in another PR).
2024-07-08 17:22:35 +01:00
Konstantin Knizhnik
47c50ec460 Check status of connection after PQconnectStartParams (#8210)
## Problem

See https://github.com/neondatabase/cloud/issues/14289

## Summary of changes

Check connection status after calling PQconnectStartParams

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-07-08 17:22:35 +01:00
Vlad Lazar
8c0ec2f681 docs: Graceful storage controller cluster restarts RFC (#7704)
RFC for "Graceful Restarts of Storage Controller Managed Clusters". 
Related https://github.com/neondatabase/neon/issues/7387
2024-07-08 17:22:35 +01:00
Heikki Linnakangas
588bda98e7 tests: Make neon_xlogflush() flush all WAL, if you omit the LSN arg (#8215)
This makes it much more convenient to use in the common case that you
want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the
argument doesn't work for the same reasons as explained in the comments:
we need to be back off to the beginning of a page if the previous record
ended at page boundary.)

I plan to use this to fix the issue that Arseny Sher called out at
https://github.com/neondatabase/neon/pull/7288#discussion_r1660063852
2024-07-08 17:22:35 +01:00
Alexander Bayandin
504ca7720f CI(gather-rust-build-stats): fix build with libpq (#8219)
## Problem
I've missed setting `PQ_LIB_DIR` in
https://github.com/neondatabase/neon/pull/8206 in
`gather-rust-build-stats` job and it fails now:
```
  = note: /usr/bin/ld: cannot find -lpq
          collect2: error: ld returned 1 exit status
          

error: could not compile `storage_controller` (bin "storage_controller") due to 1 previous error
```

https://github.com/neondatabase/neon/actions/runs/9743960062/job/26888597735

## Summary of changes
- Set `PQ_LIB_DIR` for `gather-rust-build-stats` job
2024-07-08 17:22:35 +01:00
Alex Chi Z
cf4ea92aad fix(pageserver): include aux file in basebackup only once (#8207)
Extracted from https://github.com/neondatabase/neon/pull/6560, currently
we include multiple copies of aux files in the basebackup.

## Summary of changes

Fix the loop.

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-07-08 17:22:35 +01:00
Alexander Bayandin
325294bced CI(build-tools): Remove libpq from build image (#8206)
## Problem
We use `build-tools` image as a base image to build other images, and it
has a pretty old `libpq-dev` installed (v13; it wasn't that old until I
removed system Postgres 14 from `build-tools` image in
https://github.com/neondatabase/neon/pull/6540)

## Summary of changes
- Remove `libpq-dev` from `build-tools` image
- Set `LD_LIBRARY_PATH` for tests (for different Postgres binaries that
we use, like psql and pgbench)
- Set `PQ_LIB_DIR` to build Storage Controller
- Set `LD_LIBRARY_PATH`/`DYLD_LIBRARY_PATH` in the Storage Controller
where it calls Postgres binaries
2024-07-08 17:22:35 +01:00
John Spray
86c8ba2563 pageserver: add metric pageserver_secondary_resident_physical_size (#8204)
## Problem

We lack visibility of how much local disk space is used by secondary
tenant locations

Close: https://github.com/neondatabase/neon/issues/8181

## Summary of changes

- Add `pageserver_secondary_resident_physical_size`, tagged by tenant
- Register & de-register label sets from SecondaryTenant
- Add+use wrappers in SecondaryDetail that update metrics when
adding+removing layers/timelines
2024-07-08 17:22:35 +01:00
Arseny Sher
feeb2dc6fa Merge pull request #8217 from neondatabase/rc/2024-07-01
Storage & Compute release 2024-07-01
2024-07-04 20:22:51 +03:00
Heikki Linnakangas
57f476ff5a Restore running xacts from CLOG on replica startup (#7288)
We have one pretty serious MVCC visibility bug with hot standby
replicas. We incorrectly treat any transactions that are in progress
in the primary, when the standby is started, as aborted. That can
break MVCC for queries running concurrently in the standby. It can
also lead to hint bits being set incorrectly, and that damage can last
until the replica is restarted.

The fundamental bug was that we treated any replica start as starting
from a shut down server. The fix for that is straightforward: we need
to set 'wasShutdown = false' in InitWalRecovery() (see changes in the
postgres repo).

However, that introduces a new problem: with wasShutdown = false, the
standby will not open up for queries until it receives a running-xacts
WAL record from the primary. That's correct, and that's how Postgres
hot standby always works. But it's a problem for Neon, because:

* It changes the historical behavior for existing users. Currently,
  the standby immediately opens up for queries, so if they now need to
  wait, we can breka existing use cases that were working fine
  (assuming you don't hit the MVCC issues).

* The problem is much worse for Neon than it is for standalone
  PostgreSQL, because in Neon, we can start a replica from an
  arbitrary LSN. In standalone PostgreSQL, the replica always starts
  WAL replay from a checkpoint record, and the primary arranges things
  so that there is always a running-xacts record soon after each
  checkpoint record. You can still hit this issue with PostgreSQL if
  you have a transaction with lots of subtransactions running in the
  primary, but it's pretty rare in practice.

To mitigate that, we introduce another way to collect the
running-xacts information at startup, without waiting for the
running-xacts WAL record: We can the CLOG for XIDs that haven't been
marked as committed or aborted. It has limitations with
subtransactions too, but should mitigate the problem for most users.

See https://github.com/neondatabase/neon/issues/7236.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-07-04 18:58:34 +03:00
Heikki Linnakangas
7ee2bebdb7 tests: Make neon_xlogflush() flush all WAL, if you omit the LSN arg
This makes it much more convenient to use in the common case that you
want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the
argument doesn't work for the same reasons as explained in the
comments: we need to be back off to the beginning of a page if the
previous record ended at page boundary.)

I plan to use this to fix the issue that Arseny Sher called out at
https://github.com/neondatabase/neon/pull/7288#discussion_r1660063852
2024-07-04 18:58:28 +03:00
Heikki Linnakangas
be598f1bf4 tests: remove a leftover 'running' flag (#8216)
The 'running' boolean was replaced with a semaphore in commit
f0e2bb79b2, but this initialization was missed. Remove it so that if a
test tries to access it, you get an error rather than always claiming
that the endpoint is not running.

Spotted by Arseny at
https://github.com/neondatabase/neon/pull/7288#discussion_r1660068657
2024-07-04 18:58:20 +03:00
John Spray
939b5954a5 Merge pull request #8138 from neondatabase/rc/2024-06-24
Storage & Compute release 2024-06-24
2024-06-24 10:57:45 +01:00
Arpad Müller
371020fe6a Merge pull request #8069 from neondatabase/rc/2024-06-17
Release 2024-06-17
2024-06-17 15:29:35 +02:00
Christian Schwarz
f45818abed Merge pull request #7999 from neondatabase/rc/2024-06-10
Release 2024-06-10
2024-06-10 19:08:03 +02:00
Christian Schwarz
0384267d58 Revert "Include openssl and ICU statically linked" (#8003)
Reverts neondatabase/neon#7956

Rationale: compute incompatibilties

Slack thread:
https://neondb.slack.com/archives/C033RQ5SPDH/p1718011276665839?thread_ts=1718008160.431869&cid=C033RQ5SPDH

Relevant quotes from @hlinnaka 

> If we go through with the current release candidate, but the compute
is pinned, people who create new projects will get that warning, which
is silly. To them, it looks like the ICU version was downgraded, because
initdb was run with newer version.

> We should upgrade the ICU version eventually. And when we do that,
users with old projects that use ICU will start to see that warning. I
think that's acceptable, as long as we do homework, notify users, and
communicate that properly.
> When do that, we should to try to upgrade the storage and compute
versions at roughly the same time.
2024-06-10 14:35:50 +02:00
Arseny Sher
62b3bd968a Merge pull request #7936 from neondatabase/rc/2024-06-03
Release 2024-06-03
2024-06-04 05:41:36 +03:00
Anastasia Lubennikova
e3e3bc3542 Merge pull request #7920 from neondatabase/compute-only-may-31
Compute release 2024-05-31
2024-05-31 12:47:05 +01:00
Konstantin Knizhnik
be014a2222 Do not produce error if gin page is not restored in redo (#7876)
## Problem

See https://github.com/neondatabase/cloud/issues/10845

## Summary of changes

Do not report error if GIN page is not restored

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-05-31 09:21:40 +01:00
Joonas Koivunen
2e1fe71cc0 Merge pull request #7888 from neondatabase/rc/2024-05-27
Release 2024-05-27
2024-05-27 20:30:48 +03:00
Konstantin Knizhnik
068c158ca5 Fix connect to PS on MacOS/X (#7885)
## Problem

After [0e4f182680] which introduce async
connect
Neon is not able to connect to page server.

## Summary of changes

Perform sync commit at MacOS/X

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-05-27 13:09:44 +00:00
Sasha Krassovsky
b16e4f689f Merge pull request #7869 from neondatabase/rc/2024-05-23
Metrics hotfix release
2024-05-23 14:05:30 -07:00
Sasha Krassovsky
dbff725a0c Remove apostrophe (#7868)
## Problem

## Summary of changes

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2024-05-23 13:47:16 -07:00
Andreas Scherbaum
7fa4628434 Merge pull request #7837 from neondatabase/rc/2024-05-22
Compute-Only Release 2024-05-22
2024-05-22 19:34:39 +02:00
Arthur Petukhovsky
fc538a38b9 Merge pull request #7807 from neondatabase/rc/2024-05-20
Release 2024-05-20
2024-05-20 12:16:00 +01:00
Vlad Lazar
c2e7cb324f Merge pull request #7735 from neondatabase/vlad/release-2024-05-13
Handmade Release 2024-05-13
2024-05-13 16:27:38 +01:00
Vlad Lazar
101043122e Revert protocol version upgrade (#7727)
## Problem

"John pointed out that the switch to protocol version 2 made
test_gc_aggressive test flaky:
https://github.com/neondatabase/neon/issues/7692.
I tracked it down, and that is indeed an issue. Conditions for hitting
the issue:
The problem occurs in the primary
GC horizon is set to a very low value, e.g. 0.
If the primary is actively writing WAL, and GC runs in the pageserver at
the same time that the primary sends a GetPage request, it's possible
that the GC advances the GC horizon past the GetPage request's LSN. I'm
working on a fix here: https://github.com/neondatabase/neon/pull/7708."
- Heikki

## Summary of changes
Use protocol version 1 as default.
2024-05-13 14:17:36 +01:00
Christian Schwarz
c4d7d59825 Merge pull request #7615 from neondatabase/rc/2024-05-06
Release 2024-05-06
2024-05-07 09:41:02 +02:00
Arpad Müller
0de1e1d664 Merge pull request #7530 from neondatabase/rc/2024-04-29
Release 2024-04-29
2024-04-29 15:09:58 +02:00
Joonas Koivunen
271598b77f Merge pull request #7447 from neondatabase/rc/2024-04-22
Release 2024-04-22
2024-04-22 16:10:03 +03:00
John Spray
459bc479dc pageserver: fix unlogged relations with sharding (#7454)
## Problem

- #7451 

INIT_FORKNUM blocks must be stored on shard 0 to enable including them
in basebackup.

This issue can be missed in simple tests because creating an unlogged
table isn't sufficient -- to repro I had to create an _index_ on an
unlogged table (then restart the endpoint).

Closes: #7451 

## Summary of changes

- Add a reproducer for the issue.
- Tweak the condition for `key_is_shard0` to include anything that isn't
a normal relation block _and_ any normal relation block whose forknum is
INIT_FORKNUM.
- To enable existing databases to recover from the issue, add a special
case that omits relations if they were stored on the wrong INITFORK.
This enables postgres to start and the user to drop the table and
recreate it.
2024-04-22 11:55:24 +00:00
Christian Schwarz
c213373a59 Merge pull request #7378 from neondatabase/rc/2024-04-15
Release 2024-04-15
2024-04-15 15:48:14 +03:00
Em Sharnoff
e0addc100d Merge pull request #7356 from neondatabase/rc/2024-04-11-#7348
Release 2024-04-11 (cherry-pick #7348 only)

See here for more: https://neondb.slack.com/archives/C04DGM6SMTM/p1712776981582679
2024-04-11 09:46:34 -07:00
Em Sharnoff
0519138b04 compute_ctl: Auto-set dynamic_shared_memory_type (#7348)
Part of neondatabase/cloud#12047.

The basic idea is that for our VMs, we want to enable swap and disable
Linux memory overcommit. Alongside these, we should set postgres'
dynamic_shared_memory_type to mmap, but we want to avoid setting it to
mmap if swap is not enabled.

Implementing this in the control plane would be fiddly, but it's
relatively straightforward to add to compute_ctl.
2024-04-10 13:13:08 -07:00
Vlad Lazar
5da39b469c Merge pull request #7338 from neondatabase/rc/2024-04-08
Release 2024-04-08
2024-04-08 13:10:24 +01:00
Arseny Sher
82027e22dd Merge pull request #7284 from neondatabase/rc/2024-04-01
Release 2024-04-01
2024-04-02 18:15:28 +03:00
Alex Chi Z
c431e2f1c5 Merge pull request #7263 from neondatabase/rc/2024-03-27
Release 2024-03-27 - compute only release
2024-03-27 14:52:38 -04:00
John Spray
4e5724d9c3 Merge pull request #7248 from neondatabase/rc/2024-03-26
Release 2024-03-26
2024-03-26 15:17:00 +00:00
John Spray
0d3e499059 Merge pull request #7219 from neondatabase/rc/2024-03-25
Release 2024-03-25
2024-03-25 12:28:09 +00:00
Arpad Müller
7b860b837c Merge pull request #7154 from neondatabase/rc/2024-03-18
Release 2024-03-18
2024-03-19 12:07:14 +01:00
Christian Schwarz
41fc96e20f fixup(#7160 / tokio_epoll_uring_ext): double-panic caused by info! in thread-local's drop() (#7164)
Manual testing of the changes in #7160 revealed that, if the
thread-local destructor ever runs (it apparently doesn't in our test
suite runs, otherwise #7160 would not have auto-merged), we can
encounter an `abort()` due to a double-panic in the tracing code.

This github comment here contains the stack trace:
https://github.com/neondatabase/neon/pull/7160#issuecomment-2003778176

This PR reverts #7160 and uses a atomic counter to identify the
thread-local in log messages, instead of the memory address of the
thread local, which may be re-used.
2024-03-18 16:28:17 +01:00
Christian Schwarz
fb2b1ce57b fixup(#7141 / tokio_epoll_uring_ext): high frequency log message
The PR #7141 added log message

```
ThreadLocalState is being dropped and id might be re-used in the future
```

which was supposed to be emitted when the thread-local is destroyed.
Instead, it was emitted on _each_ call to `thread_local_system()`,
ie.., on each tokio-epoll-uring operation.
2024-03-18 13:01:17 +01:00
Joonas Koivunen
464717451b build: make procfs linux only dependency (#7156)
the dependency refuses to build on macos so builds on `main` are broken
right now, including the `release` PR.
2024-03-18 09:32:49 +00:00
Joonas Koivunen
c6ed86d3d0 Merge pull request #7081 from neondatabase/rc/2024-03-11
Release 2024-03-11
2024-03-11 14:41:39 +02:00
Roman Zaynetdinov
f0a9017008 Export db size, deadlocks and changed row metrics (#7050)
## Problem

We want to report metrics for the oldest user database.
2024-03-11 11:55:06 +00:00
Christian Schwarz
bb7949ba00 Merge pull request #6993 from neondatabase/rc/2024-03-04
Release 2024-03-04
2024-03-04 13:08:44 +01:00
Arthur Petukhovsky
1df0f69664 Merge pull request #6973 from neondatabase/rc/2024-02-29-manual
Release 2024-02-29
2024-02-29 17:26:33 +00:00
Vlad Lazar
970066a914 libs: fix expired token in auth decode test (#6963)
The test token expired earlier today (1709200879). I regenerated the
token, but without an expiration date this time.
2024-02-29 17:23:25 +00:00
Arthur Petukhovsky
1ebd3897c0 Merge pull request #6956 from neondatabase/rc/2024-02-28
Release 2024-02-28
2024-02-29 16:39:52 +00:00
Arthur Petukhovsky
6460beffcd Merge pull request #6901 from neondatabase/rc/2024-02-26
Release 2024-02-26
2024-02-26 17:08:19 +00:00
John Spray
6f7f8958db pageserver: only write out legacy tenant config if no generation (#6891)
## Problem

Previously we always wrote out both legacy and modern tenant config
files. The legacy write enabled rollbacks, but we are long past the
point where that is needed.

We still need the legacy format for situations where someone is running
tenants without generations (that will be yanked as well eventually),
but we can avoid writing it out at all if we do have a generation number
set. We implicitly also avoid writing the legacy config if our mode is
Secondary (secondary mode is newer than generations).

## Summary of changes

- Make writing legacy tenant config conditional on there being no
generation number set.
2024-02-26 10:25:25 +00:00
Christian Schwarz
936a00e077 pageserver: remove two obsolete/unused per-timeline metrics (#6893)
over-compensating the addition of a new per-timeline metric in
https://github.com/neondatabase/neon/pull/6834

part of https://github.com/neondatabase/neon/issues/6737
2024-02-26 09:16:24 +00:00
75 changed files with 966 additions and 1406 deletions

View File

@@ -31,7 +31,7 @@ config-variables:
- NEON_PROD_AWS_ACCOUNT_ID
- PGREGRESS_PG16_PROJECT_ID
- PGREGRESS_PG17_PROJECT_ID
- PREWARM_PROJECT_ID
- PREWARM_PGBENCH_SIZE
- REMOTE_STORAGE_AZURE_CONTAINER
- REMOTE_STORAGE_AZURE_REGION
- SLACK_CICD_CHANNEL_ID

View File

@@ -418,7 +418,7 @@ jobs:
statuses: write
id-token: write # aws-actions/configure-aws-credentials
env:
PROJECT_ID: ${{ vars.PREWARM_PROJECT_ID }}
PGBENCH_SIZE: ${{ vars.PREWARM_PGBENCH_SIZE }}
POSTGRES_DISTRIB_DIR: /tmp/neon/pg_install
DEFAULT_PG_VERSION: 17
TEST_OUTPUT: /tmp/test_output

View File

@@ -48,20 +48,8 @@ jobs:
uses: ./.github/workflows/build-build-tools-image.yml
secrets: inherit
generate-ch-tmppw:
runs-on: ubuntu-22.04
outputs:
tmp_val: ${{ steps.pwgen.outputs.tmp_val }}
steps:
- name: Generate a random password
id: pwgen
run: |
set +x
p=$(dd if=/dev/random bs=14 count=1 2>/dev/null | base64)
echo tmp_val="${p//\//}" >> "${GITHUB_OUTPUT}"
test-logical-replication:
needs: [ build-build-tools-image, generate-ch-tmppw ]
needs: [ build-build-tools-image ]
runs-on: ubuntu-22.04
container:
@@ -72,20 +60,16 @@ jobs:
options: --init --user root
services:
clickhouse:
image: clickhouse/clickhouse-server:24.8
env:
CLICKHOUSE_PASSWORD: ${{ needs.generate-ch-tmppw.outputs.tmp_val }}
image: clickhouse/clickhouse-server:24.6.3.64
ports:
- 9000:9000
- 8123:8123
zookeeper:
image: quay.io/debezium/zookeeper:3.1.3.Final
image: quay.io/debezium/zookeeper:2.7
ports:
- 2181:2181
- 2888:2888
- 3888:3888
kafka:
image: quay.io/debezium/kafka:3.1.3.Final
image: quay.io/debezium/kafka:2.7
env:
ZOOKEEPER_CONNECT: "zookeeper:2181"
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
@@ -95,7 +79,7 @@ jobs:
ports:
- 9092:9092
debezium:
image: quay.io/debezium/connect:3.1.3.Final
image: quay.io/debezium/connect:2.7
env:
BOOTSTRAP_SERVERS: kafka:9092
GROUP_ID: 1
@@ -141,7 +125,6 @@ jobs:
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
BENCHMARK_CONNSTR: ${{ steps.create-neon-project.outputs.dsn }}
CLICKHOUSE_PASSWORD: ${{ needs.generate-ch-tmppw.outputs.tmp_val }}
- name: Delete Neon Project
if: always()

200
Cargo.lock generated
View File

@@ -211,11 +211,11 @@ dependencies = [
[[package]]
name = "async-lock"
version = "3.4.0"
version = "3.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ff6e472cdea888a4bd64f342f09b3f50e1886d32afe8df3d663c01140b811b18"
checksum = "7125e42787d53db9dd54261812ef17e937c95a51e4d291373b670342fa44310c"
dependencies = [
"event-listener 5.4.0",
"event-listener 4.0.0",
"event-listener-strategy",
"pin-project-lite",
]
@@ -1404,9 +1404,9 @@ dependencies = [
[[package]]
name = "concurrent-queue"
version = "2.5.0"
version = "2.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4ca0197aee26d1ae37445ee532fefce43251d24cc7c166799f4d46817f1d3973"
checksum = "f057a694a54f12365049b0958a1685bb52d567f5593b355fbf685838e873d400"
dependencies = [
"crossbeam-utils",
]
@@ -2232,9 +2232,9 @@ checksum = "0206175f82b8d6bf6652ff7d71a1e27fd2e4efde587fd368662814d6ec1d9ce0"
[[package]]
name = "event-listener"
version = "5.4.0"
version = "4.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3492acde4c3fc54c845eaab3eed8bd00c7a7d881f78bfc801e43a93dec1331ae"
checksum = "770d968249b5d99410d61f5bf89057f3199a077a04d087092f58e7d10692baae"
dependencies = [
"concurrent-queue",
"parking",
@@ -2243,11 +2243,11 @@ dependencies = [
[[package]]
name = "event-listener-strategy"
version = "0.5.4"
version = "0.4.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8be9f3dfaaffdae2972880079a491a1a8bb7cbed0b8dd7a347f668b4150a3b93"
checksum = "958e4d70b6d5e81971bebec42271ec641e7ff4e170a6fa605f2b8a8b65cb97d3"
dependencies = [
"event-listener 5.4.0",
"event-listener 4.0.0",
"pin-project-lite",
]
@@ -2516,20 +2516,6 @@ version = "0.4.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "304de19db7028420975a296ab0fcbbc8e69438c4ed254a1e41e2a7f37d5f0e0a"
[[package]]
name = "generator"
version = "0.8.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d18470a76cb7f8ff746cf1f7470914f900252ec36bbc40b569d74b1258446827"
dependencies = [
"cc",
"cfg-if",
"libc",
"log",
"rustversion",
"windows 0.61.3",
]
[[package]]
name = "generic-array"
version = "0.14.7"
@@ -2848,7 +2834,7 @@ checksum = "f9c7c7c8ac16c798734b8a24560c1362120597c40d5e1459f09498f8f6c8f2ba"
dependencies = [
"cfg-if",
"libc",
"windows 0.52.0",
"windows",
]
[[package]]
@@ -3119,7 +3105,7 @@ dependencies = [
"iana-time-zone-haiku",
"js-sys",
"wasm-bindgen",
"windows-core 0.52.0",
"windows-core",
]
[[package]]
@@ -3670,19 +3656,6 @@ version = "0.4.26"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "30bde2b3dc3671ae49d8e2e9f044c7c005836e7a023ee57cffa25ab82764bb9e"
[[package]]
name = "loom"
version = "0.7.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "419e0dc8046cb947daa77eb95ae174acfbddb7673b4151f56d1eed8e93fbfaca"
dependencies = [
"cfg-if",
"generator",
"scoped-tls",
"tracing",
"tracing-subscriber",
]
[[package]]
name = "lru"
version = "0.12.3"
@@ -3899,25 +3872,6 @@ dependencies = [
"windows-sys 0.52.0",
]
[[package]]
name = "moka"
version = "0.12.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a9321642ca94a4282428e6ea4af8cc2ca4eac48ac7a6a4ea8f33f76d0ce70926"
dependencies = [
"crossbeam-channel",
"crossbeam-epoch",
"crossbeam-utils",
"loom",
"parking_lot 0.12.1",
"portable-atomic",
"rustc_version",
"smallvec",
"tagptr",
"thiserror 1.0.69",
"uuid",
]
[[package]]
name = "multimap"
version = "0.8.3"
@@ -5431,6 +5385,7 @@ dependencies = [
"futures",
"gettid",
"hashbrown 0.14.5",
"hashlink",
"hex",
"hmac",
"hostname",
@@ -5452,7 +5407,6 @@ dependencies = [
"lasso",
"measured",
"metrics",
"moka",
"once_cell",
"opentelemetry",
"ouroboros",
@@ -6466,12 +6420,6 @@ dependencies = [
"pin-project-lite",
]
[[package]]
name = "scoped-tls"
version = "1.0.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e1cf6437eb19a8f4a6cc0f7dca544973b0b78843adbfeb3683d1a94a0024a294"
[[package]]
name = "scopeguard"
version = "1.1.0"
@@ -7321,12 +7269,6 @@ dependencies = [
"winapi",
]
[[package]]
name = "tagptr"
version = "0.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7b2093cf4c8eb1e67749a6762251bc9cd836b6fc171623bd0a9d324d37af2417"
[[package]]
name = "tar"
version = "0.4.40"
@@ -8696,32 +8638,10 @@ version = "0.52.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e48a53791691ab099e5e2ad123536d0fff50652600abaf43bbf952894110d0be"
dependencies = [
"windows-core 0.52.0",
"windows-core",
"windows-targets 0.52.6",
]
[[package]]
name = "windows"
version = "0.61.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9babd3a767a4c1aef6900409f85f5d53ce2544ccdfaa86dad48c91782c6d6893"
dependencies = [
"windows-collections",
"windows-core 0.61.2",
"windows-future",
"windows-link",
"windows-numerics",
]
[[package]]
name = "windows-collections"
version = "0.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3beeceb5e5cfd9eb1d76b381630e82c4241ccd0d27f1a39ed41b2760b255c5e8"
dependencies = [
"windows-core 0.61.2",
]
[[package]]
name = "windows-core"
version = "0.52.0"
@@ -8731,86 +8651,6 @@ dependencies = [
"windows-targets 0.52.6",
]
[[package]]
name = "windows-core"
version = "0.61.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c0fdd3ddb90610c7638aa2b3a3ab2904fb9e5cdbecc643ddb3647212781c4ae3"
dependencies = [
"windows-implement",
"windows-interface",
"windows-link",
"windows-result",
"windows-strings",
]
[[package]]
name = "windows-future"
version = "0.2.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fc6a41e98427b19fe4b73c550f060b59fa592d7d686537eebf9385621bfbad8e"
dependencies = [
"windows-core 0.61.2",
"windows-link",
"windows-threading",
]
[[package]]
name = "windows-implement"
version = "0.60.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a47fddd13af08290e67f4acabf4b459f647552718f683a7b415d290ac744a836"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.100",
]
[[package]]
name = "windows-interface"
version = "0.59.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "bd9211b69f8dcdfa817bfd14bf1c97c9188afa36f4750130fcdf3f400eca9fa8"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.100",
]
[[package]]
name = "windows-link"
version = "0.1.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5e6ad25900d524eaabdbbb96d20b4311e1e7ae1699af4fb28c17ae66c80d798a"
[[package]]
name = "windows-numerics"
version = "0.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9150af68066c4c5c07ddc0ce30421554771e528bde427614c61038bc2c92c2b1"
dependencies = [
"windows-core 0.61.2",
"windows-link",
]
[[package]]
name = "windows-result"
version = "0.3.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "56f42bd332cc6c8eac5af113fc0c1fd6a8fd2aa08a0119358686e5160d0586c6"
dependencies = [
"windows-link",
]
[[package]]
name = "windows-strings"
version = "0.4.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "56e6c93f3a0c3b36176cb1327a4958a0353d5d166c2a35cb268ace15e91d3b57"
dependencies = [
"windows-link",
]
[[package]]
name = "windows-sys"
version = "0.48.0"
@@ -8869,15 +8709,6 @@ dependencies = [
"windows_x86_64_msvc 0.52.6",
]
[[package]]
name = "windows-threading"
version = "0.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b66463ad2e0ea3bbf808b7f1d371311c80e115c0b71d60efc142cafbcfb057a6"
dependencies = [
"windows-link",
]
[[package]]
name = "windows_aarch64_gnullvm"
version = "0.48.0"
@@ -9014,8 +8845,6 @@ dependencies = [
"clap",
"clap_builder",
"const-oid",
"crossbeam-epoch",
"crossbeam-utils",
"crypto-bigint 0.5.5",
"der 0.7.8",
"deranged",
@@ -9061,7 +8890,6 @@ dependencies = [
"once_cell",
"p256 0.13.2",
"parquet",
"portable-atomic",
"prettyplease",
"proc-macro2",
"prost 0.13.5",

View File

@@ -46,10 +46,10 @@ members = [
"libs/proxy/json",
"libs/proxy/postgres-protocol2",
"libs/proxy/postgres-types2",
"libs/proxy/subzero_core",
"libs/proxy/tokio-postgres2",
"endpoint_storage",
"pgxn/neon/communicator",
"proxy/subzero_core",
]
[workspace.package]
@@ -136,7 +136,6 @@ md5 = "0.7.0"
measured = { version = "0.0.22", features=["lasso"] }
measured-process = { version = "0.0.22" }
memoffset = "0.9"
moka = { version = "0.12", features = ["sync"] }
nix = { version = "0.30.1", features = ["dir", "fs", "mman", "process", "socket", "signal", "poll"] }
# Do not update to >= 7.0.0, at least. The update will have a significant impact
# on compute startup metrics (start_postgres_ms), >= 25% degradation.

View File

@@ -39,13 +39,13 @@ COPY build-tools/patches/pgcopydbv017.patch /pgcopydbv017.patch
RUN if [ "${DEBIAN_VERSION}" = "bookworm" ]; then \
set -e && \
apt-get update && \
apt-get install -y --no-install-recommends \
apt update && \
apt install -y --no-install-recommends \
ca-certificates wget gpg && \
wget -qO - https://www.postgresql.org/media/keys/ACCC4CF8.asc | gpg --dearmor -o /usr/share/keyrings/postgresql-keyring.gpg && \
echo "deb [signed-by=/usr/share/keyrings/postgresql-keyring.gpg] http://apt.postgresql.org/pub/repos/apt bookworm-pgdg main" > /etc/apt/sources.list.d/pgdg.list && \
apt-get update && \
apt-get install -y --no-install-recommends \
apt install -y --no-install-recommends \
build-essential \
autotools-dev \
libedit-dev \
@@ -89,7 +89,8 @@ RUN useradd -ms /bin/bash nonroot -b /home
# Use strict mode for bash to catch errors early
SHELL ["/bin/bash", "-euo", "pipefail", "-c"]
RUN mkdir -p /pgcopydb/{bin,lib} && \
RUN mkdir -p /pgcopydb/bin && \
mkdir -p /pgcopydb/lib && \
chmod -R 755 /pgcopydb && \
chown -R nonroot:nonroot /pgcopydb
@@ -105,8 +106,8 @@ RUN echo 'Acquire::Retries "5";' > /etc/apt/apt.conf.d/80-retries && \
# 'gdb' is included so that we get backtraces of core dumps produced in
# regression tests
RUN set -e \
&& apt-get update \
&& apt-get install -y --no-install-recommends \
&& apt update \
&& apt install -y \
autoconf \
automake \
bison \
@@ -182,22 +183,22 @@ RUN curl -sL "https://github.com/peak/s5cmd/releases/download/v${S5CMD_VERSION}/
ENV LLVM_VERSION=20
RUN curl -fsSL 'https://apt.llvm.org/llvm-snapshot.gpg.key' | apt-key add - \
&& echo "deb http://apt.llvm.org/${DEBIAN_VERSION}/ llvm-toolchain-${DEBIAN_VERSION}-${LLVM_VERSION} main" > /etc/apt/sources.list.d/llvm.stable.list \
&& apt-get update \
&& apt-get install -y --no-install-recommends clang-${LLVM_VERSION} llvm-${LLVM_VERSION} \
&& apt update \
&& apt install -y clang-${LLVM_VERSION} llvm-${LLVM_VERSION} \
&& bash -c 'for f in /usr/bin/clang*-${LLVM_VERSION} /usr/bin/llvm*-${LLVM_VERSION}; do ln -s "${f}" "${f%-${LLVM_VERSION}}"; done' \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# Install node
ENV NODE_VERSION=24
RUN curl -fsSL https://deb.nodesource.com/setup_${NODE_VERSION}.x | bash - \
&& apt-get install -y --no-install-recommends nodejs \
&& apt install -y nodejs \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# Install docker
RUN curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg \
&& echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/debian ${DEBIAN_VERSION} stable" > /etc/apt/sources.list.d/docker.list \
&& apt-get update \
&& apt-get install -y --no-install-recommends docker-ce docker-ce-cli \
&& apt update \
&& apt install -y docker-ce docker-ce-cli \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# Configure sudo & docker
@@ -214,11 +215,12 @@ RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "aws
# Mold: A Modern Linker
ENV MOLD_VERSION=v2.37.1
RUN set -e \
&& git clone -b "${MOLD_VERSION}" --depth 1 https://github.com/rui314/mold.git \
&& git clone https://github.com/rui314/mold.git \
&& mkdir mold/build \
&& cd mold/build \
&& cd mold/build \
&& git checkout ${MOLD_VERSION} \
&& cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=clang++ .. \
&& cmake --build . -j "$(nproc)" \
&& cmake --build . -j $(nproc) \
&& cmake --install . \
&& cd .. \
&& rm -rf mold
@@ -252,7 +254,7 @@ ENV ICU_VERSION=67.1
ENV ICU_PREFIX=/usr/local/icu
# Download and build static ICU
RUN wget -O "/tmp/libicu-${ICU_VERSION}.tgz" https://github.com/unicode-org/icu/releases/download/release-${ICU_VERSION//./-}/icu4c-${ICU_VERSION//./_}-src.tgz && \
RUN wget -O /tmp/libicu-${ICU_VERSION}.tgz https://github.com/unicode-org/icu/releases/download/release-${ICU_VERSION//./-}/icu4c-${ICU_VERSION//./_}-src.tgz && \
echo "94a80cd6f251a53bd2a997f6f1b5ac6653fe791dfab66e1eb0227740fb86d5dc /tmp/libicu-${ICU_VERSION}.tgz" | sha256sum --check && \
mkdir /tmp/icu && \
pushd /tmp/icu && \
@@ -263,7 +265,8 @@ RUN wget -O "/tmp/libicu-${ICU_VERSION}.tgz" https://github.com/unicode-org/icu/
make install && \
popd && \
rm -rf icu && \
rm -f /tmp/libicu-${ICU_VERSION}.tgz
rm -f /tmp/libicu-${ICU_VERSION}.tgz && \
popd
# Switch to nonroot user
USER nonroot:nonroot
@@ -276,19 +279,19 @@ ENV PYTHON_VERSION=3.11.12 \
PYENV_ROOT=/home/nonroot/.pyenv \
PATH=/home/nonroot/.pyenv/shims:/home/nonroot/.pyenv/bin:/home/nonroot/.poetry/bin:$PATH
RUN set -e \
&& cd "$HOME" \
&& cd $HOME \
&& curl -sSO https://raw.githubusercontent.com/pyenv/pyenv-installer/master/bin/pyenv-installer \
&& chmod +x pyenv-installer \
&& ./pyenv-installer \
&& export PYENV_ROOT=/home/nonroot/.pyenv \
&& export PATH="$PYENV_ROOT/bin:$PATH" \
&& export PATH="$PYENV_ROOT/shims:$PATH" \
&& pyenv install "${PYTHON_VERSION}" \
&& pyenv global "${PYTHON_VERSION}" \
&& pyenv install ${PYTHON_VERSION} \
&& pyenv global ${PYTHON_VERSION} \
&& python --version \
&& pip install --no-cache-dir --upgrade pip \
&& pip install --upgrade pip \
&& pip --version \
&& pip install --no-cache-dir pipenv wheel poetry
&& pip install pipenv wheel poetry
# Switch to nonroot user (again)
USER nonroot:nonroot
@@ -314,13 +317,13 @@ RUN curl -sSO https://static.rust-lang.org/rustup/dist/$(uname -m)-unknown-linux
. "$HOME/.cargo/env" && \
cargo --version && rustup --version && \
rustup component add llvm-tools rustfmt clippy && \
cargo install rustfilt --locked --version "${RUSTFILT_VERSION}" && \
cargo install cargo-hakari --locked --version "${CARGO_HAKARI_VERSION}" && \
cargo install cargo-deny --locked --version "${CARGO_DENY_VERSION}" && \
cargo install cargo-hack --locked --version "${CARGO_HACK_VERSION}" && \
cargo install cargo-nextest --locked --version "${CARGO_NEXTEST_VERSION}" && \
cargo install cargo-chef --locked --version "${CARGO_CHEF_VERSION}" && \
cargo install diesel_cli --locked --version "${CARGO_DIESEL_CLI_VERSION}" \
cargo install rustfilt --locked --version ${RUSTFILT_VERSION} && \
cargo install cargo-hakari --locked --version ${CARGO_HAKARI_VERSION} && \
cargo install cargo-deny --locked --version ${CARGO_DENY_VERSION} && \
cargo install cargo-hack --locked --version ${CARGO_HACK_VERSION} && \
cargo install cargo-nextest --locked --version ${CARGO_NEXTEST_VERSION} && \
cargo install cargo-chef --locked --version ${CARGO_CHEF_VERSION} && \
cargo install diesel_cli --locked --version ${CARGO_DIESEL_CLI_VERSION} \
--features postgres-bundled --no-default-features && \
rm -rf /home/nonroot/.cargo/registry && \
rm -rf /home/nonroot/.cargo/git

View File

@@ -413,52 +413,6 @@ struct StartVmMonitorResult {
vm_monitor: Option<JoinHandle<Result<()>>>,
}
/// Databricks-specific environment variables to be passed to the `postgres` sub-process.
pub struct DatabricksEnvVars {
/// The Databricks "endpoint ID" of the compute instance. Used by `postgres` to check
/// the token scopes of internal auth tokens.
pub endpoint_id: String,
/// Hostname of the Databricks workspace URL this compute instance belongs to.
/// Used by postgres to verify Databricks PAT tokens.
pub workspace_host: String,
}
impl DatabricksEnvVars {
pub fn new(compute_spec: &ComputeSpec, compute_id: Option<&String>) -> Self {
// compute_id is a string format of "{endpoint_id}/{compute_idx}"
// endpoint_id is a uuid. We only need to pass down endpoint_id to postgres.
// Panics if compute_id is not set or not in the expected format.
let endpoint_id = compute_id.unwrap().split('/').next().unwrap().to_string();
let workspace_host = compute_spec
.databricks_settings
.as_ref()
.map(|s| s.databricks_workspace_host.clone())
.unwrap_or("".to_string());
Self {
endpoint_id,
workspace_host,
}
}
/// Constants for the names of Databricks-specific postgres environment variables.
const DATABRICKS_ENDPOINT_ID_ENVVAR: &'static str = "DATABRICKS_ENDPOINT_ID";
const DATABRICKS_WORKSPACE_HOST_ENVVAR: &'static str = "DATABRICKS_WORKSPACE_HOST";
/// Convert DatabricksEnvVars to a list of string pairs that can be passed as env vars. Consumes `self`.
pub fn to_env_var_list(self) -> Vec<(String, String)> {
vec![
(
Self::DATABRICKS_ENDPOINT_ID_ENVVAR.to_string(),
self.endpoint_id.clone(),
),
(
Self::DATABRICKS_WORKSPACE_HOST_ENVVAR.to_string(),
self.workspace_host.clone(),
),
]
}
}
impl ComputeNode {
pub fn new(params: ComputeNodeParams, config: ComputeConfig) -> Result<Self> {
let connstr = params.connstr.as_str();
@@ -1457,8 +1411,6 @@ impl ComputeNode {
let pgdata_path = Path::new(&self.params.pgdata);
let tls_config = self.tls_config(&pspec.spec);
let databricks_settings = spec.databricks_settings.as_ref();
let postgres_port = self.params.connstr.port();
// Remove/create an empty pgdata directory and put configuration there.
self.create_pgdata()?;
@@ -1466,11 +1418,8 @@ impl ComputeNode {
pgdata_path,
&self.params,
&pspec.spec,
postgres_port,
self.params.internal_http_port,
tls_config,
databricks_settings,
self.params.lakebase_mode,
)?;
// Syncing safekeepers is only safe with primary nodes: if a primary
@@ -1510,28 +1459,8 @@ impl ComputeNode {
)
})?;
if let Some(settings) = databricks_settings {
copy_tls_certificates(
&settings.pg_compute_tls_settings.key_file,
&settings.pg_compute_tls_settings.cert_file,
pgdata_path,
)?;
// Update pg_hba.conf received with basebackup including additional databricks settings.
update_pg_hba(pgdata_path, Some(&settings.databricks_pg_hba))?;
update_pg_ident(pgdata_path, Some(&settings.databricks_pg_ident))?;
} else {
// Update pg_hba.conf received with basebackup.
update_pg_hba(pgdata_path, None)?;
}
if let Some(databricks_settings) = spec.databricks_settings.as_ref() {
copy_tls_certificates(
&databricks_settings.pg_compute_tls_settings.key_file,
&databricks_settings.pg_compute_tls_settings.cert_file,
pgdata_path,
)?;
}
// Update pg_hba.conf received with basebackup.
update_pg_hba(pgdata_path, None)?;
// Place pg_dynshmem under /dev/shm. This allows us to use
// 'dynamic_shared_memory_type = mmap' so that the files are placed in
@@ -1644,31 +1573,14 @@ impl ComputeNode {
pub fn start_postgres(&self, storage_auth_token: Option<String>) -> Result<PostgresHandle> {
let pgdata_path = Path::new(&self.params.pgdata);
let env_vars: Vec<(String, String)> = if self.params.lakebase_mode {
let databricks_env_vars = {
let state = self.state.lock().unwrap();
let spec = &state.pspec.as_ref().unwrap().spec;
DatabricksEnvVars::new(spec, Some(&self.params.compute_id))
};
info!(
"Starting Postgres for databricks endpoint id: {}",
&databricks_env_vars.endpoint_id
);
let mut env_vars = databricks_env_vars.to_env_var_list();
env_vars.extend(storage_auth_token.map(|t| ("NEON_AUTH_TOKEN".to_string(), t)));
env_vars
} else if let Some(storage_auth_token) = &storage_auth_token {
vec![("NEON_AUTH_TOKEN".to_owned(), storage_auth_token.to_owned())]
} else {
vec![]
};
// Run postgres as a child process.
let mut pg = maybe_cgexec(&self.params.pgbin)
.args(["-D", &self.params.pgdata])
.envs(env_vars)
.envs(if let Some(storage_auth_token) = &storage_auth_token {
vec![("NEON_AUTH_TOKEN", storage_auth_token)]
} else {
vec![]
})
.stderr(Stdio::piped())
.spawn()
.expect("cannot start postgres process");
@@ -1971,16 +1883,12 @@ impl ComputeNode {
// Write new config
let pgdata_path = Path::new(&self.params.pgdata);
let postgres_port = self.params.connstr.port();
config::write_postgres_conf(
pgdata_path,
&self.params,
&spec,
postgres_port,
self.params.internal_http_port,
tls_config,
spec.databricks_settings.as_ref(),
self.params.lakebase_mode,
)?;
self.pg_reload_conf()?;

View File

@@ -7,14 +7,11 @@ use std::io::prelude::*;
use std::path::Path;
use compute_api::responses::TlsConfig;
use compute_api::spec::{
ComputeAudit, ComputeMode, ComputeSpec, DatabricksSettings, GenericOption,
};
use compute_api::spec::{ComputeAudit, ComputeMode, ComputeSpec, GenericOption};
use crate::compute::ComputeNodeParams;
use crate::pg_helpers::{
DatabricksSettingsExt as _, GenericOptionExt, GenericOptionsSearch, PgOptionsSerialize,
escape_conf_value,
GenericOptionExt, GenericOptionsSearch, PgOptionsSerialize, escape_conf_value,
};
use crate::tls::{self, SERVER_CRT, SERVER_KEY};
@@ -43,16 +40,12 @@ pub fn line_in_file(path: &Path, line: &str) -> Result<bool> {
}
/// Create or completely rewrite configuration file specified by `path`
#[allow(clippy::too_many_arguments)]
pub fn write_postgres_conf(
pgdata_path: &Path,
params: &ComputeNodeParams,
spec: &ComputeSpec,
postgres_port: Option<u16>,
extension_server_port: u16,
tls_config: &Option<TlsConfig>,
databricks_settings: Option<&DatabricksSettings>,
lakebase_mode: bool,
) -> Result<()> {
let path = pgdata_path.join("postgresql.conf");
// File::create() destroys the file content if it exists.
@@ -292,24 +285,6 @@ pub fn write_postgres_conf(
writeln!(file, "log_destination='stderr,syslog'")?;
}
if lakebase_mode {
// Explicitly set the port based on the connstr, overriding any previous port setting.
// Note: It is important that we don't specify a different port again after this.
let port = postgres_port.expect("port must be present in connstr");
writeln!(file, "port = {port}")?;
// This is databricks specific settings.
// This should be at the end of the file but before `compute_ctl_temp_override.conf` below
// so that it can override any settings above.
// `compute_ctl_temp_override.conf` is intended to override any settings above during specific operations.
// To prevent potential breakage in the future, we keep it above `compute_ctl_temp_override.conf`.
writeln!(file, "# Databricks settings start")?;
if let Some(settings) = databricks_settings {
writeln!(file, "{}", settings.as_pg_settings())?;
}
writeln!(file, "# Databricks settings end")?;
}
// This is essential to keep this line at the end of the file,
// because it is intended to override any settings above.
writeln!(file, "include_if_exists = 'compute_ctl_temp_override.conf'")?;

View File

@@ -142,7 +142,7 @@ pub fn update_pg_hba(pgdata_path: &Path, databricks_pg_hba: Option<&String>) ->
// Update pg_hba to contains databricks specfic settings before adding neon settings
// PG uses the first record that matches to perform authentication, so we need to have
// our rules before the default ones from neon.
// See https://www.postgresql.org/docs/current/auth-pg-hba-conf.html
// See https://www.postgresql.org/docs/16/auth-pg-hba-conf.html
if let Some(databricks_pg_hba) = databricks_pg_hba {
if config::line_in_file(
&pghba_path,

View File

@@ -793,7 +793,6 @@ impl Endpoint {
autoprewarm: args.autoprewarm,
offload_lfc_interval_seconds: args.offload_lfc_interval_seconds,
suspend_timeout_seconds: -1, // Only used in neon_local.
databricks_settings: None,
};
// this strange code is needed to support respec() in tests

View File

@@ -193,9 +193,6 @@ pub struct ComputeSpec {
///
/// We use this value to derive other values, such as the installed extensions metric.
pub suspend_timeout_seconds: i64,
// Databricks specific options for compute instance.
pub databricks_settings: Option<DatabricksSettings>,
}
/// Feature flag to signal `compute_ctl` to enable certain experimental functionality.

View File

@@ -558,11 +558,11 @@ async fn add_request_id_header_to_response(
mut res: Response<Body>,
req_info: RequestInfo,
) -> Result<Response<Body>, ApiError> {
if let Some(request_id) = req_info.context::<RequestId>()
&& let Ok(request_header_value) = HeaderValue::from_str(&request_id.0)
{
res.headers_mut()
.insert(&X_REQUEST_ID_HEADER, request_header_value);
if let Some(request_id) = req_info.context::<RequestId>() {
if let Ok(request_header_value) = HeaderValue::from_str(&request_id.0) {
res.headers_mut()
.insert(&X_REQUEST_ID_HEADER, request_header_value);
};
};
Ok(res)

View File

@@ -72,10 +72,10 @@ impl Server {
if err.is_incomplete_message() || err.is_closed() || err.is_timeout() {
return true;
}
if let Some(inner) = err.source()
&& let Some(io) = inner.downcast_ref::<std::io::Error>()
{
return suppress_io_error(io);
if let Some(inner) = err.source() {
if let Some(io) = inner.downcast_ref::<std::io::Error>() {
return suppress_io_error(io);
}
}
false
}

View File

@@ -129,12 +129,6 @@ impl<L: LabelGroup> InfoMetric<L> {
}
}
impl<L: LabelGroup + Default> Default for InfoMetric<L, GaugeState> {
fn default() -> Self {
InfoMetric::new(L::default())
}
}
impl<L: LabelGroup, M: MetricType<Metadata = ()>> InfoMetric<L, M> {
pub fn with_metric(label: L, metric: M) -> Self {
Self {

View File

@@ -363,7 +363,7 @@ where
// TODO: An Iterator might be nicer. The communicator's clock algorithm needs to
// _slowly_ iterate through all buckets with its clock hand, without holding a lock.
// If we switch to an Iterator, it must not hold the lock.
pub fn get_at_bucket(&self, pos: usize) -> Option<ValueReadGuard<'_, (K, V)>> {
pub fn get_at_bucket(&self, pos: usize) -> Option<ValueReadGuard<(K, V)>> {
let map = unsafe { self.shared_ptr.as_ref() }.unwrap().read();
if pos >= map.buckets.len() {
return None;

View File

@@ -185,7 +185,6 @@ impl Client {
ssl_mode: SslMode,
process_id: i32,
secret_key: i32,
write_buf: BytesMut,
) -> Client {
Client {
inner: InnerClient {
@@ -196,7 +195,7 @@ impl Client {
waiting: 0,
received: 0,
},
buffer: write_buf,
buffer: Default::default(),
},
cached_typeinfo: Default::default(),

View File

@@ -47,7 +47,14 @@ impl Encoder<BytesMut> for PostgresCodec {
type Error = io::Error;
fn encode(&mut self, item: BytesMut, dst: &mut BytesMut) -> io::Result<()> {
dst.unsplit(item);
// When it comes to request/response workflows, we usually flush the entire write
// buffer in order to wait for the response before we send a new request.
// Therefore we can avoid the copy and just replace the buffer.
if dst.is_empty() {
*dst = item;
} else {
dst.extend_from_slice(&item);
}
Ok(())
}
}

View File

@@ -77,9 +77,6 @@ where
connect_timeout,
};
let mut stream = stream.into_framed();
let write_buf = std::mem::take(stream.write_buffer_mut());
let (client_tx, conn_rx) = mpsc::unbounded_channel();
let (conn_tx, client_rx) = mpsc::channel(4);
let client = Client::new(
@@ -89,9 +86,9 @@ where
ssl_mode,
process_id,
secret_key,
write_buf,
);
let stream = stream.into_framed();
let connection = Connection::new(stream, conn_tx, conn_rx);
Ok((client, connection))

View File

@@ -229,11 +229,8 @@ where
Poll::Ready(()) => {
trace!("poll_flush: flushed");
// Since our codec prefers to share the buffer with the `Client`,
// if we don't release our share, then the `Client` would have to re-alloc
// the buffer when they next use it.
debug_assert!(self.stream.write_buffer().is_empty());
*self.stream.write_buffer_mut() = BytesMut::new();
// GC the write buffer if we managed to flush
gc_bytesmut(self.stream.write_buffer_mut());
Poll::Ready(Ok(()))
}

View File

@@ -49,7 +49,7 @@ impl PerfSpan {
}
}
pub fn enter(&self) -> PerfSpanEntered<'_> {
pub fn enter(&self) -> PerfSpanEntered {
if let Some(ref id) = self.inner.id() {
self.dispatch.enter(id);
}

View File

@@ -715,7 +715,7 @@ fn start_pageserver(
disk_usage_eviction_state,
deletion_queue.new_client(),
secondary_controller,
feature_resolver.clone(),
feature_resolver,
)
.context("Failed to initialize router state")?,
);
@@ -841,14 +841,14 @@ fn start_pageserver(
} else {
None
},
feature_resolver.clone(),
);
// Spawn a Pageserver gRPC server task. It will spawn separate tasks for each request/stream.
// It uses a separate compute request Tokio runtime (COMPUTE_REQUEST_RUNTIME).
// Spawn a Pageserver gRPC server task. It will spawn separate tasks for
// each stream/request.
//
// NB: this port is exposed to computes. It should only provide services that we're okay with
// computes accessing. Internal services should use a separate port.
// TODO: this uses a separate Tokio runtime for the page service. If we want
// other gRPC services, they will need their own port and runtime. Is this
// necessary?
let mut page_service_grpc = None;
if let Some(grpc_listener) = grpc_listener {
page_service_grpc = Some(GrpcPageServiceHandler::spawn(

View File

@@ -2005,10 +2005,6 @@ async fn put_tenant_location_config_handler(
let state = get_state(&request);
let conf = state.conf;
fail::fail_point!("put-location-conf-handler", |_| {
Err(ApiError::ResourceUnavailable("failpoint".into()))
});
// The `Detached` state is special, it doesn't upsert a tenant, it removes
// its local disk content and drops it from memory.
if let LocationConfigMode::Detached = request_data.config.mode {

View File

@@ -68,7 +68,6 @@ use crate::config::PageServerConf;
use crate::context::{
DownloadBehavior, PerfInstrumentFutureExt, RequestContext, RequestContextBuilder,
};
use crate::feature_resolver::FeatureResolver;
use crate::metrics::{
self, COMPUTE_COMMANDS_COUNTERS, ComputeCommandKind, GetPageBatchBreakReason, LIVE_CONNECTIONS,
MISROUTED_PAGESTREAM_REQUESTS, PAGESTREAM_HANDLER_RESULTS_TOTAL, SmgrOpTimer, TimelineMetrics,
@@ -140,7 +139,6 @@ pub fn spawn(
perf_trace_dispatch: Option<Dispatch>,
tcp_listener: tokio::net::TcpListener,
tls_config: Option<Arc<rustls::ServerConfig>>,
feature_resolver: FeatureResolver,
) -> Listener {
let cancel = CancellationToken::new();
let libpq_ctx = RequestContext::todo_child(
@@ -162,7 +160,6 @@ pub fn spawn(
conf.pg_auth_type,
tls_config,
conf.page_service_pipelining.clone(),
feature_resolver,
libpq_ctx,
cancel.clone(),
)
@@ -221,7 +218,6 @@ pub async fn libpq_listener_main(
auth_type: AuthType,
tls_config: Option<Arc<rustls::ServerConfig>>,
pipelining_config: PageServicePipeliningConfig,
feature_resolver: FeatureResolver,
listener_ctx: RequestContext,
listener_cancel: CancellationToken,
) -> Connections {
@@ -265,7 +261,6 @@ pub async fn libpq_listener_main(
auth_type,
tls_config.clone(),
pipelining_config.clone(),
feature_resolver.clone(),
connection_ctx,
connections_cancel.child_token(),
gate_guard,
@@ -308,7 +303,6 @@ async fn page_service_conn_main(
auth_type: AuthType,
tls_config: Option<Arc<rustls::ServerConfig>>,
pipelining_config: PageServicePipeliningConfig,
feature_resolver: FeatureResolver,
connection_ctx: RequestContext,
cancel: CancellationToken,
gate_guard: GateGuard,
@@ -376,7 +370,6 @@ async fn page_service_conn_main(
perf_span_fields,
connection_ctx,
cancel.clone(),
feature_resolver.clone(),
gate_guard,
);
let pgbackend =
@@ -428,8 +421,6 @@ struct PageServerHandler {
pipelining_config: PageServicePipeliningConfig,
get_vectored_concurrent_io: GetVectoredConcurrentIo,
feature_resolver: FeatureResolver,
gate_guard: GateGuard,
}
@@ -596,15 +587,6 @@ impl timeline::handle::TenantManager<TenantManagerTypes> for TenantManagerWrappe
}
}
/// Whether to hold the applied GC cutoff guard when processing GetPage requests.
/// This is determined once at the start of pagestream subprotocol handling based on
/// feature flags, configuration, and test conditions.
#[derive(Debug, Clone, Copy)]
enum HoldAppliedGcCutoffGuard {
Yes,
No,
}
#[derive(thiserror::Error, Debug)]
enum PageStreamError {
/// We encountered an error that should prompt the client to reconnect:
@@ -748,7 +730,6 @@ enum BatchedFeMessage {
GetPage {
span: Span,
shard: WeakHandle<TenantManagerTypes>,
applied_gc_cutoff_guard: Option<RcuReadGuard<Lsn>>,
pages: SmallVec<[BatchedGetPageRequest; 1]>,
batch_break_reason: GetPageBatchBreakReason,
},
@@ -928,7 +909,6 @@ impl PageServerHandler {
perf_span_fields: ConnectionPerfSpanFields,
connection_ctx: RequestContext,
cancel: CancellationToken,
feature_resolver: FeatureResolver,
gate_guard: GateGuard,
) -> Self {
PageServerHandler {
@@ -940,7 +920,6 @@ impl PageServerHandler {
cancel,
pipelining_config,
get_vectored_concurrent_io,
feature_resolver,
gate_guard,
}
}
@@ -980,7 +959,6 @@ impl PageServerHandler {
ctx: &RequestContext,
protocol_version: PagestreamProtocolVersion,
parent_span: Span,
hold_gc_cutoff_guard: HoldAppliedGcCutoffGuard,
) -> Result<Option<BatchedFeMessage>, QueryError>
where
IO: AsyncRead + AsyncWrite + Send + Sync + Unpin + 'static,
@@ -1218,27 +1196,19 @@ impl PageServerHandler {
})
.await?;
let applied_gc_cutoff_guard = shard.get_applied_gc_cutoff_lsn(); // hold guard
// We're holding the Handle
let effective_lsn = match Self::effective_request_lsn(
&shard,
shard.get_last_record_lsn(),
req.hdr.request_lsn,
req.hdr.not_modified_since,
&applied_gc_cutoff_guard,
&shard.get_applied_gc_cutoff_lsn(),
) {
Ok(lsn) => lsn,
Err(e) => {
return respond_error!(span, e);
}
};
let applied_gc_cutoff_guard = match hold_gc_cutoff_guard {
HoldAppliedGcCutoffGuard::Yes => Some(applied_gc_cutoff_guard),
HoldAppliedGcCutoffGuard::No => {
drop(applied_gc_cutoff_guard);
None
}
};
let batch_wait_ctx = if ctx.has_perf_span() {
Some(
@@ -1259,7 +1229,6 @@ impl PageServerHandler {
BatchedFeMessage::GetPage {
span,
shard: shard.downgrade(),
applied_gc_cutoff_guard,
pages: smallvec![BatchedGetPageRequest {
req,
timer,
@@ -1360,28 +1329,13 @@ impl PageServerHandler {
match (eligible_batch, this_msg) {
(
BatchedFeMessage::GetPage {
pages: accum_pages,
applied_gc_cutoff_guard: accum_applied_gc_cutoff_guard,
..
pages: accum_pages, ..
},
BatchedFeMessage::GetPage {
pages: this_pages,
applied_gc_cutoff_guard: this_applied_gc_cutoff_guard,
..
pages: this_pages, ..
},
) => {
accum_pages.extend(this_pages);
// the minimum of the two guards will keep data for both alive
match (&accum_applied_gc_cutoff_guard, this_applied_gc_cutoff_guard) {
(None, None) => (),
(None, Some(this)) => *accum_applied_gc_cutoff_guard = Some(this),
(Some(_), None) => (),
(Some(accum), Some(this)) => {
if **accum > *this {
*accum_applied_gc_cutoff_guard = Some(this);
}
}
};
Ok(())
}
#[cfg(feature = "testing")]
@@ -1696,7 +1650,6 @@ impl PageServerHandler {
BatchedFeMessage::GetPage {
span,
shard,
applied_gc_cutoff_guard,
pages,
batch_break_reason,
} => {
@@ -1716,7 +1669,6 @@ impl PageServerHandler {
.instrument(span.clone())
.await;
assert_eq!(res.len(), npages);
drop(applied_gc_cutoff_guard);
res
},
span,
@@ -1798,7 +1750,7 @@ impl PageServerHandler {
/// Coding discipline within this function: all interaction with the `pgb` connection
/// needs to be sensitive to connection shutdown, currently signalled via [`Self::cancel`].
/// This is so that we can shutdown page_service quickly.
#[instrument(skip_all, fields(hold_gc_cutoff_guard))]
#[instrument(skip_all)]
async fn handle_pagerequests<IO>(
&mut self,
pgb: &mut PostgresBackend<IO>,
@@ -1844,30 +1796,6 @@ impl PageServerHandler {
.take()
.expect("implementation error: timeline_handles should not be locked");
// Evaluate the expensive feature resolver check once per pagestream subprotocol handling
// instead of once per GetPage request. This is shared between pipelined and serial paths.
let hold_gc_cutoff_guard = if cfg!(test) || cfg!(feature = "testing") {
HoldAppliedGcCutoffGuard::Yes
} else {
// Use the global feature resolver with the tenant ID directly, avoiding the need
// to get a timeline/shard which might not be available on this pageserver node.
let empty_properties = std::collections::HashMap::new();
match self.feature_resolver.evaluate_boolean(
"page-service-getpage-hold-applied-gc-cutoff-guard",
tenant_id,
&empty_properties,
) {
Ok(()) => HoldAppliedGcCutoffGuard::Yes,
Err(_) => HoldAppliedGcCutoffGuard::No,
}
};
// record it in the span of handle_pagerequests so that both the request_span
// and the pipeline implementation spans contains the field.
Span::current().record(
"hold_gc_cutoff_guard",
tracing::field::debug(&hold_gc_cutoff_guard),
);
let request_span = info_span!("request");
let ((pgb_reader, timeline_handles), result) = match self.pipelining_config.clone() {
PageServicePipeliningConfig::Pipelined(pipelining_config) => {
@@ -1881,7 +1809,6 @@ impl PageServerHandler {
pipelining_config,
protocol_version,
io_concurrency,
hold_gc_cutoff_guard,
&ctx,
)
.await
@@ -1896,7 +1823,6 @@ impl PageServerHandler {
request_span,
protocol_version,
io_concurrency,
hold_gc_cutoff_guard,
&ctx,
)
.await
@@ -1925,7 +1851,6 @@ impl PageServerHandler {
request_span: Span,
protocol_version: PagestreamProtocolVersion,
io_concurrency: IoConcurrency,
hold_gc_cutoff_guard: HoldAppliedGcCutoffGuard,
ctx: &RequestContext,
) -> (
(PostgresBackendReader<IO>, TimelineHandles),
@@ -1947,7 +1872,6 @@ impl PageServerHandler {
ctx,
protocol_version,
request_span.clone(),
hold_gc_cutoff_guard,
)
.await;
let msg = match msg {
@@ -1995,7 +1919,6 @@ impl PageServerHandler {
pipelining_config: PageServicePipeliningConfigPipelined,
protocol_version: PagestreamProtocolVersion,
io_concurrency: IoConcurrency,
hold_gc_cutoff_guard: HoldAppliedGcCutoffGuard,
ctx: &RequestContext,
) -> (
(PostgresBackendReader<IO>, TimelineHandles),
@@ -2099,7 +2022,6 @@ impl PageServerHandler {
&ctx,
protocol_version,
request_span.clone(),
hold_gc_cutoff_guard,
)
.await;
let Some(read_res) = read_res.transpose() else {
@@ -2146,7 +2068,6 @@ impl PageServerHandler {
pages,
span: _,
shard: _,
applied_gc_cutoff_guard: _,
batch_break_reason: _,
} = &mut batch
{
@@ -3508,6 +3429,8 @@ impl GrpcPageServiceHandler {
/// NB: errors returned from here are intercepted in get_pages(), and may be converted to a
/// GetPageResponse with an appropriate status code to avoid terminating the stream.
///
/// TODO: verify that the requested pages belong to this shard.
///
/// TODO: get_vectored() currently enforces a batch limit of 32. Postgres will typically send
/// batches up to effective_io_concurrency = 100. Either we have to accept large batches, or
/// split them up in the client or server.
@@ -3533,19 +3456,6 @@ impl GrpcPageServiceHandler {
lsn = %req.read_lsn,
);
for &blkno in &req.block_numbers {
let shard = timeline.get_shard_identity();
let key = rel_block_to_key(req.rel, blkno);
if !shard.is_key_local(&key) {
return Err(tonic::Status::invalid_argument(format!(
"block {blkno} of relation {} requested on wrong shard {} (is on {})",
req.rel,
timeline.get_shard_index(),
ShardIndex::new(shard.get_shard_number(&key), shard.count),
)));
}
}
let latest_gc_cutoff_lsn = timeline.get_applied_gc_cutoff_lsn(); // hold guard
let effective_lsn = PageServerHandler::effective_request_lsn(
&timeline,

View File

@@ -70,7 +70,7 @@ use tracing::*;
use utils::generation::Generation;
use utils::guard_arc_swap::GuardArcSwap;
use utils::id::TimelineId;
use utils::logging::{MonitorSlowFutureCallback, log_slow, monitor_slow_future};
use utils::logging::{MonitorSlowFutureCallback, monitor_slow_future};
use utils::lsn::{AtomicLsn, Lsn, RecordLsn};
use utils::postgres_client::PostgresClientProtocol;
use utils::rate_limit::RateLimit;
@@ -6898,13 +6898,7 @@ impl Timeline {
write_guard.store_and_unlock(new_gc_cutoff)
};
let waitlist_wait_fut = std::pin::pin!(waitlist.wait());
log_slow(
"applied_gc_cutoff waitlist wait",
Duration::from_secs(30),
waitlist_wait_fut,
)
.await;
waitlist.wait().await;
info!("GC starting");

View File

@@ -52,7 +52,7 @@ pub(crate) fn regenerate(
};
// Express a static value for how many shards we may schedule on one node
const MAX_SHARDS: u32 = 2500;
const MAX_SHARDS: u32 = 5000;
let mut doc = PageserverUtilization {
disk_usage_bytes: used,

View File

@@ -79,6 +79,10 @@
#include "access/xlogrecovery.h"
#endif
#if PG_VERSION_NUM < 160000
typedef PGAlignedBlock PGIOAlignedBlock;
#endif
#define NEON_PANIC_CONNECTION_STATE(shard_no, elvl, message, ...) \
neon_shard_log(shard_no, elvl, "Broken connection state: " message, \
##__VA_ARGS__)

View File

@@ -635,11 +635,6 @@ lfc_init(void)
NULL);
}
/*
* Dump a list of pages that are currently in the LFC
*
* This is used to get a snapshot that can be used to prewarm the LFC later.
*/
FileCacheState*
lfc_get_state(size_t max_entries)
{
@@ -2272,3 +2267,4 @@ get_prewarm_info(PG_FUNCTION_ARGS)
PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
}

View File

@@ -1,7 +1,7 @@
/*-------------------------------------------------------------------------
*
* neon.c
* Main entry point into the neon extension
* Main entry point into the neon exension
*
*-------------------------------------------------------------------------
*/
@@ -508,7 +508,7 @@ _PG_init(void)
DefineCustomBoolVariable(
"neon.disable_logical_replication_subscribers",
"Disable incoming logical replication",
"Disables incomming logical replication",
NULL,
&disable_logical_replication_subscribers,
false,
@@ -567,7 +567,7 @@ _PG_init(void)
DefineCustomEnumVariable(
"neon.debug_compare_local",
"Debug mode for comparing content of pages in prefetch ring/LFC/PS and local disk",
"Debug mode for compaing content of pages in prefetch ring/LFC/PS and local disk",
NULL,
&debug_compare_local,
DEBUG_COMPARE_LOCAL_NONE,
@@ -735,6 +735,7 @@ neon_shmem_request_hook(void)
static void
neon_shmem_startup_hook(void)
{
/* Initialize */
if (prev_shmem_startup_hook)
prev_shmem_startup_hook();

View File

@@ -167,7 +167,11 @@ extern neon_per_backend_counters *neon_per_backend_counters_shared;
*/
#define NUM_NEON_PERF_COUNTER_SLOTS (MaxBackends + NUM_AUXILIARY_PROCS)
#if PG_VERSION_NUM >= 170000
#define MyNeonCounters (&neon_per_backend_counters_shared[MyProcNumber])
#else
#define MyNeonCounters (&neon_per_backend_counters_shared[MyProc->pgprocno])
#endif
extern void inc_getpage_wait(uint64 latency);
extern void inc_page_cache_read_wait(uint64 latency);

View File

@@ -9,10 +9,6 @@
#include "fmgr.h"
#include "storage/buf_internals.h"
#if PG_MAJORVERSION_NUM < 16
typedef PGAlignedBlock PGIOAlignedBlock;
#endif
#if PG_MAJORVERSION_NUM < 17
#define NRelFileInfoBackendIsTemp(rinfo) (rinfo.backend != InvalidBackendId)
#else
@@ -162,10 +158,6 @@ InitBufferTag(BufferTag *tag, const RelFileNode *rnode,
#define AmAutoVacuumWorkerProcess() (IsAutoVacuumWorkerProcess())
#endif
#if PG_MAJORVERSION_NUM < 17
#define MyProcNumber (MyProc - &ProcGlobal->allProcs[0])
#endif
#if PG_MAJORVERSION_NUM < 15
extern void InitMaterializedSRF(FunctionCallInfo fcinfo, bits32 flags);
extern TimeLineID GetWALInsertionTimeLine(void);

View File

@@ -72,6 +72,10 @@
#include "access/xlogrecovery.h"
#endif
#if PG_VERSION_NUM < 160000
typedef PGAlignedBlock PGIOAlignedBlock;
#endif
#include "access/nbtree.h"
#include "storage/bufpage.h"
#include "access/xlog_internal.h"

View File

@@ -13,7 +13,6 @@
#include "neon.h"
#include "neon_pgversioncompat.h"
#include "miscadmin.h"
#include "pagestore_client.h"
#include RELFILEINFO_HDR
#include "storage/smgr.h"
@@ -24,6 +23,10 @@
#include "utils/dynahash.h"
#include "utils/guc.h"
#if PG_VERSION_NUM >= 150000
#include "miscadmin.h"
#endif
typedef struct
{
NRelFileInfo rinfo;

View File

@@ -33,6 +33,7 @@ env_logger.workspace = true
framed-websockets.workspace = true
futures.workspace = true
hashbrown.workspace = true
hashlink.workspace = true
hex.workspace = true
hmac.workspace = true
hostname.workspace = true
@@ -53,7 +54,6 @@ json = { path = "../libs/proxy/json" }
lasso = { workspace = true, features = ["multi-threaded"] }
measured = { workspace = true, features = ["lasso"] }
metrics.workspace = true
moka.workspace = true
once_cell.workspace = true
opentelemetry = { workspace = true, features = ["trace"] }
papaya = "0.2.0"
@@ -110,7 +110,7 @@ zerocopy.workspace = true
# uncomment this to use the real subzero-core crate
# subzero-core = { git = "https://github.com/neondatabase/subzero", rev = "396264617e78e8be428682f87469bb25429af88a", features = ["postgresql"], optional = true }
# this is a stub for the subzero-core crate
subzero-core = { path = "../libs/proxy/subzero_core", features = ["postgresql"], optional = true}
subzero-core = { path = "./subzero_core", features = ["postgresql"], optional = true}
ouroboros = { version = "0.18", optional = true }
# jwt stuff

View File

@@ -8,12 +8,11 @@ use tracing::{info, info_span};
use crate::auth::backend::ComputeUserInfo;
use crate::cache::Cached;
use crate::cache::node_info::CachedNodeInfo;
use crate::compute::AuthInfo;
use crate::config::AuthenticationConfig;
use crate::context::RequestContext;
use crate::control_plane::client::cplane_proxy_v1;
use crate::control_plane::{self, NodeInfo};
use crate::control_plane::{self, CachedNodeInfo, NodeInfo};
use crate::error::{ReportableError, UserFacingError};
use crate::pqproto::BeMessage;
use crate::proxy::NeonOptions;

View File

@@ -16,14 +16,14 @@ use tracing::{debug, info};
use crate::auth::{self, ComputeUserInfoMaybeEndpoint, validate_password_and_exchange};
use crate::cache::Cached;
use crate::cache::node_info::CachedNodeInfo;
use crate::config::AuthenticationConfig;
use crate::context::RequestContext;
use crate::control_plane::client::ControlPlaneClient;
use crate::control_plane::errors::GetAuthInfoError;
use crate::control_plane::messages::EndpointRateLimitConfig;
use crate::control_plane::{
self, AccessBlockerFlags, AuthSecret, ControlPlaneApi, EndpointAccessControl, RoleAccessControl,
self, AccessBlockerFlags, AuthSecret, CachedNodeInfo, ControlPlaneApi, EndpointAccessControl,
RoleAccessControl,
};
use crate::intern::EndpointIdInt;
use crate::pqproto::BeMessage;
@@ -433,12 +433,11 @@ mod tests {
use super::auth_quirks;
use super::jwt::JwkCache;
use crate::auth::{ComputeUserInfoMaybeEndpoint, IpPattern};
use crate::cache::node_info::CachedNodeInfo;
use crate::config::AuthenticationConfig;
use crate::context::RequestContext;
use crate::control_plane::messages::EndpointRateLimitConfig;
use crate::control_plane::{
self, AccessBlockerFlags, EndpointAccessControl, RoleAccessControl,
self, AccessBlockerFlags, CachedNodeInfo, EndpointAccessControl, RoleAccessControl,
};
use crate::proxy::NeonOptions;
use crate::rate_limiter::EndpointRateLimiter;

View File

@@ -29,7 +29,7 @@ use crate::config::{
};
use crate::control_plane::locks::ApiLocks;
use crate::http::health_server::AppMetrics;
use crate::metrics::{Metrics, ServiceInfo, ThreadPoolMetrics};
use crate::metrics::{Metrics, ThreadPoolMetrics};
use crate::rate_limiter::{EndpointRateLimiter, LeakyBucketConfig, RateBucketInfo};
use crate::scram::threadpool::ThreadPool;
use crate::serverless::cancel_set::CancelSet;
@@ -207,11 +207,6 @@ pub async fn run() -> anyhow::Result<()> {
endpoint_rate_limiter,
);
Metrics::get()
.service
.info
.set_label(ServiceInfo::running());
match futures::future::select(pin!(maintenance_tasks.join_next()), pin!(task)).await {
// exit immediately on maintenance task completion
Either::Left((Some(res), _)) => match crate::error::flatten_err(res)? {},

View File

@@ -26,7 +26,7 @@ use utils::project_git_version;
use utils::sentry_init::init_sentry;
use crate::context::RequestContext;
use crate::metrics::{Metrics, ServiceInfo, ThreadPoolMetrics};
use crate::metrics::{Metrics, ThreadPoolMetrics};
use crate::pglb::TlsRequired;
use crate::pqproto::FeStartupPacket;
use crate::protocol2::ConnectionInfo;
@@ -135,12 +135,6 @@ pub async fn run() -> anyhow::Result<()> {
cancellation_token.clone(),
))
.map(crate::error::flatten_err);
Metrics::get()
.service
.info
.set_label(ServiceInfo::running());
let signals_task = tokio::spawn(crate::signals::handle(cancellation_token, || {}));
// the signal task cant ever succeed.

View File

@@ -40,7 +40,7 @@ use crate::config::{
};
use crate::context::parquet::ParquetUploadArgs;
use crate::http::health_server::AppMetrics;
use crate::metrics::{Metrics, ServiceInfo};
use crate::metrics::Metrics;
use crate::rate_limiter::{EndpointRateLimiter, RateBucketInfo, WakeComputeRateLimiter};
use crate::redis::connection_with_credentials_provider::ConnectionWithCredentialsProvider;
use crate::redis::kv_ops::RedisKVClient;
@@ -535,7 +535,12 @@ pub async fn run() -> anyhow::Result<()> {
// add a task to flush the db_schema cache every 10 minutes
#[cfg(feature = "rest_broker")]
if let Some(db_schema_cache) = &config.rest_config.db_schema_cache {
maintenance_tasks.spawn(db_schema_cache.maintain());
maintenance_tasks.spawn(async move {
loop {
tokio::time::sleep(Duration::from_secs(600)).await;
db_schema_cache.flush();
}
});
}
if let Some(metrics_config) = &config.metric_collection {
@@ -585,11 +590,6 @@ pub async fn run() -> anyhow::Result<()> {
}
}
Metrics::get()
.service
.info
.set_label(ServiceInfo::running());
let maintenance = loop {
// get one complete task
match futures::future::select(
@@ -711,7 +711,12 @@ fn build_config(args: &ProxyCliArgs) -> anyhow::Result<&'static ProxyConfig> {
info!("Using DbSchemaCache with options={db_schema_cache_config:?}");
let db_schema_cache = if args.is_rest_broker {
Some(DbSchemaCache::new(db_schema_cache_config))
Some(DbSchemaCache::new(
"db_schema_cache",
db_schema_cache_config.size,
db_schema_cache_config.ttl,
true,
))
} else {
None
};

View File

@@ -1,16 +1,4 @@
use std::ops::{Deref, DerefMut};
use std::time::{Duration, Instant};
use moka::Expiry;
use moka::notification::RemovalCause;
use crate::control_plane::messages::ControlPlaneErrorMessage;
use crate::metrics::{
CacheEviction, CacheKind, CacheOutcome, CacheOutcomeGroup, CacheRemovalCause, Metrics,
};
/// Default TTL used when caching errors from control plane.
pub const DEFAULT_ERROR_TTL: Duration = Duration::from_secs(30);
/// A generic trait which exposes types of cache's key and value,
/// as well as the notion of cache entry invalidation.
@@ -22,16 +10,20 @@ pub(crate) trait Cache {
/// Entry's value.
type Value;
/// Used for entry invalidation.
type LookupInfo<Key>;
/// Invalidate an entry using a lookup info.
/// We don't have an empty default impl because it's error-prone.
fn invalidate(&self, _: &Self::Key);
fn invalidate(&self, _: &Self::LookupInfo<Self::Key>);
}
impl<C: Cache> Cache for &C {
type Key = C::Key;
type Value = C::Value;
type LookupInfo<Key> = C::LookupInfo<Key>;
fn invalidate(&self, info: &Self::Key) {
fn invalidate(&self, info: &Self::LookupInfo<Self::Key>) {
C::invalidate(self, info);
}
}
@@ -39,7 +31,7 @@ impl<C: Cache> Cache for &C {
/// Wrapper for convenient entry invalidation.
pub(crate) struct Cached<C: Cache, V = <C as Cache>::Value> {
/// Cache + lookup info.
pub(crate) token: Option<(C, C::Key)>,
pub(crate) token: Option<(C, C::LookupInfo<C::Key>)>,
/// The value itself.
pub(crate) value: V,
@@ -51,6 +43,23 @@ impl<C: Cache, V> Cached<C, V> {
Self { token: None, value }
}
pub(crate) fn take_value(self) -> (Cached<C, ()>, V) {
(
Cached {
token: self.token,
value: (),
},
self.value,
)
}
pub(crate) fn map<U>(self, f: impl FnOnce(V) -> U) -> Cached<C, U> {
Cached {
token: self.token,
value: f(self.value),
}
}
/// Drop this entry from a cache if it's still there.
pub(crate) fn invalidate(self) -> V {
if let Some((cache, info)) = &self.token {
@@ -78,91 +87,3 @@ impl<C: Cache, V> DerefMut for Cached<C, V> {
&mut self.value
}
}
pub type ControlPlaneResult<T> = Result<T, Box<ControlPlaneErrorMessage>>;
#[derive(Clone, Copy)]
pub struct CplaneExpiry {
pub error: Duration,
}
impl Default for CplaneExpiry {
fn default() -> Self {
Self {
error: DEFAULT_ERROR_TTL,
}
}
}
impl CplaneExpiry {
pub fn expire_early<V>(
&self,
value: &ControlPlaneResult<V>,
updated: Instant,
) -> Option<Duration> {
match value {
Ok(_) => None,
Err(err) => Some(self.expire_err_early(err, updated)),
}
}
pub fn expire_err_early(&self, err: &ControlPlaneErrorMessage, updated: Instant) -> Duration {
err.status
.as_ref()
.and_then(|s| s.details.retry_info.as_ref())
.map_or(self.error, |r| r.retry_at.into_std() - updated)
}
}
impl<K, V> Expiry<K, ControlPlaneResult<V>> for CplaneExpiry {
fn expire_after_create(
&self,
_key: &K,
value: &ControlPlaneResult<V>,
created_at: Instant,
) -> Option<Duration> {
self.expire_early(value, created_at)
}
fn expire_after_update(
&self,
_key: &K,
value: &ControlPlaneResult<V>,
updated_at: Instant,
_duration_until_expiry: Option<Duration>,
) -> Option<Duration> {
self.expire_early(value, updated_at)
}
}
pub fn eviction_listener(kind: CacheKind, cause: RemovalCause) {
let cause = match cause {
RemovalCause::Expired => CacheRemovalCause::Expired,
RemovalCause::Explicit => CacheRemovalCause::Explicit,
RemovalCause::Replaced => CacheRemovalCause::Replaced,
RemovalCause::Size => CacheRemovalCause::Size,
};
Metrics::get()
.cache
.evicted_total
.inc(CacheEviction { cache: kind, cause });
}
#[inline]
pub fn count_cache_outcome<T>(kind: CacheKind, cache_result: Option<T>) -> Option<T> {
let outcome = if cache_result.is_some() {
CacheOutcome::Hit
} else {
CacheOutcome::Miss
};
Metrics::get().cache.request_total.inc(CacheOutcomeGroup {
cache: kind,
outcome,
});
cache_result
}
#[inline]
pub fn count_cache_insert(kind: CacheKind) {
Metrics::get().cache.inserted_total.inc(kind);
}

View File

@@ -1,5 +1,6 @@
pub(crate) mod common;
pub(crate) mod node_info;
pub(crate) mod project_info;
mod timed_lru;
pub(crate) use common::{Cached, ControlPlaneResult, CplaneExpiry};
pub(crate) use common::{Cache, Cached};
pub(crate) use timed_lru::TimedLru;

View File

@@ -1,60 +0,0 @@
use crate::cache::common::{Cache, count_cache_insert, count_cache_outcome, eviction_listener};
use crate::cache::{Cached, ControlPlaneResult, CplaneExpiry};
use crate::config::CacheOptions;
use crate::control_plane::NodeInfo;
use crate::metrics::{CacheKind, Metrics};
use crate::types::EndpointCacheKey;
pub(crate) struct NodeInfoCache(moka::sync::Cache<EndpointCacheKey, ControlPlaneResult<NodeInfo>>);
pub(crate) type CachedNodeInfo = Cached<&'static NodeInfoCache, NodeInfo>;
impl Cache for NodeInfoCache {
type Key = EndpointCacheKey;
type Value = ControlPlaneResult<NodeInfo>;
fn invalidate(&self, info: &EndpointCacheKey) {
self.0.invalidate(info);
}
}
impl NodeInfoCache {
pub fn new(config: CacheOptions) -> Self {
let builder = moka::sync::Cache::builder()
.name("node_info")
.expire_after(CplaneExpiry::default());
let builder = config.moka(builder);
if let Some(size) = config.size {
Metrics::get()
.cache
.capacity
.set(CacheKind::NodeInfo, size as i64);
}
let builder = builder
.eviction_listener(|_k, _v, cause| eviction_listener(CacheKind::NodeInfo, cause));
Self(builder.build())
}
pub fn insert(&self, key: EndpointCacheKey, value: ControlPlaneResult<NodeInfo>) {
count_cache_insert(CacheKind::NodeInfo);
self.0.insert(key, value);
}
pub fn get(&self, key: &EndpointCacheKey) -> Option<ControlPlaneResult<NodeInfo>> {
count_cache_outcome(CacheKind::NodeInfo, self.0.get(key))
}
pub fn get_entry(
&'static self,
key: &EndpointCacheKey,
) -> Option<ControlPlaneResult<CachedNodeInfo>> {
self.get(key).map(|res| {
res.map(|value| Cached {
token: Some((self, key.clone())),
value,
})
})
}
}

View File

@@ -1,20 +1,84 @@
use std::collections::HashSet;
use std::collections::{HashMap, HashSet, hash_map};
use std::convert::Infallible;
use std::time::Duration;
use async_trait::async_trait;
use clashmap::ClashMap;
use moka::sync::Cache;
use clashmap::mapref::one::Ref;
use rand::Rng;
use tokio::time::Instant;
use tracing::{debug, info};
use crate::cache::common::{
ControlPlaneResult, CplaneExpiry, count_cache_insert, count_cache_outcome, eviction_listener,
};
use crate::config::ProjectInfoCacheOptions;
use crate::control_plane::messages::{ControlPlaneErrorMessage, Reason};
use crate::control_plane::{EndpointAccessControl, RoleAccessControl};
use crate::intern::{AccountIdInt, EndpointIdInt, ProjectIdInt, RoleNameInt};
use crate::metrics::{CacheKind, Metrics};
use crate::types::{EndpointId, RoleName};
#[async_trait]
pub(crate) trait ProjectInfoCache {
fn invalidate_endpoint_access(&self, endpoint_id: EndpointIdInt);
fn invalidate_endpoint_access_for_project(&self, project_id: ProjectIdInt);
fn invalidate_endpoint_access_for_org(&self, account_id: AccountIdInt);
fn invalidate_role_secret_for_project(&self, project_id: ProjectIdInt, role_name: RoleNameInt);
}
struct Entry<T> {
expires_at: Instant,
value: T,
}
impl<T> Entry<T> {
pub(crate) fn new(value: T, ttl: Duration) -> Self {
Self {
expires_at: Instant::now() + ttl,
value,
}
}
pub(crate) fn get(&self) -> Option<&T> {
(!self.is_expired()).then_some(&self.value)
}
fn is_expired(&self) -> bool {
self.expires_at <= Instant::now()
}
}
struct EndpointInfo {
role_controls: HashMap<RoleNameInt, Entry<ControlPlaneResult<RoleAccessControl>>>,
controls: Option<Entry<ControlPlaneResult<EndpointAccessControl>>>,
}
type ControlPlaneResult<T> = Result<T, Box<ControlPlaneErrorMessage>>;
impl EndpointInfo {
pub(crate) fn get_role_secret_with_ttl(
&self,
role_name: RoleNameInt,
) -> Option<(ControlPlaneResult<RoleAccessControl>, Duration)> {
let entry = self.role_controls.get(&role_name)?;
let ttl = entry.expires_at - Instant::now();
Some((entry.get()?.clone(), ttl))
}
pub(crate) fn get_controls_with_ttl(
&self,
) -> Option<(ControlPlaneResult<EndpointAccessControl>, Duration)> {
let entry = self.controls.as_ref()?;
let ttl = entry.expires_at - Instant::now();
Some((entry.get()?.clone(), ttl))
}
pub(crate) fn invalidate_endpoint(&mut self) {
self.controls = None;
}
pub(crate) fn invalidate_role_secret(&mut self, role_name: RoleNameInt) {
self.role_controls.remove(&role_name);
}
}
/// Cache for project info.
/// This is used to cache auth data for endpoints.
/// Invalidation is done by console notifications or by TTL (if console notifications are disabled).
@@ -22,9 +86,8 @@ use crate::types::{EndpointId, RoleName};
/// We also store endpoint-to-project mapping in the cache, to be able to access per-endpoint data.
/// One may ask, why the data is stored per project, when on the user request there is only data about the endpoint available?
/// On the cplane side updates are done per project (or per branch), so it's easier to invalidate the whole project cache.
pub struct ProjectInfoCache {
role_controls: Cache<(EndpointIdInt, RoleNameInt), ControlPlaneResult<RoleAccessControl>>,
ep_controls: Cache<EndpointIdInt, ControlPlaneResult<EndpointAccessControl>>,
pub struct ProjectInfoCacheImpl {
cache: ClashMap<EndpointIdInt, EndpointInfo>,
project2ep: ClashMap<ProjectIdInt, HashSet<EndpointIdInt>>,
// FIXME(stefan): we need a way to GC the account2ep map.
@@ -33,13 +96,16 @@ pub struct ProjectInfoCache {
config: ProjectInfoCacheOptions,
}
impl ProjectInfoCache {
pub fn invalidate_endpoint_access(&self, endpoint_id: EndpointIdInt) {
#[async_trait]
impl ProjectInfoCache for ProjectInfoCacheImpl {
fn invalidate_endpoint_access(&self, endpoint_id: EndpointIdInt) {
info!("invalidating endpoint access for `{endpoint_id}`");
self.ep_controls.invalidate(&endpoint_id);
if let Some(mut endpoint_info) = self.cache.get_mut(&endpoint_id) {
endpoint_info.invalidate_endpoint();
}
}
pub fn invalidate_endpoint_access_for_project(&self, project_id: ProjectIdInt) {
fn invalidate_endpoint_access_for_project(&self, project_id: ProjectIdInt) {
info!("invalidating endpoint access for project `{project_id}`");
let endpoints = self
.project2ep
@@ -47,11 +113,13 @@ impl ProjectInfoCache {
.map(|kv| kv.value().clone())
.unwrap_or_default();
for endpoint_id in endpoints {
self.ep_controls.invalidate(&endpoint_id);
if let Some(mut endpoint_info) = self.cache.get_mut(&endpoint_id) {
endpoint_info.invalidate_endpoint();
}
}
}
pub fn invalidate_endpoint_access_for_org(&self, account_id: AccountIdInt) {
fn invalidate_endpoint_access_for_org(&self, account_id: AccountIdInt) {
info!("invalidating endpoint access for org `{account_id}`");
let endpoints = self
.account2ep
@@ -59,15 +127,13 @@ impl ProjectInfoCache {
.map(|kv| kv.value().clone())
.unwrap_or_default();
for endpoint_id in endpoints {
self.ep_controls.invalidate(&endpoint_id);
if let Some(mut endpoint_info) = self.cache.get_mut(&endpoint_id) {
endpoint_info.invalidate_endpoint();
}
}
}
pub fn invalidate_role_secret_for_project(
&self,
project_id: ProjectIdInt,
role_name: RoleNameInt,
) {
fn invalidate_role_secret_for_project(&self, project_id: ProjectIdInt, role_name: RoleNameInt) {
info!(
"invalidating role secret for project_id `{}` and role_name `{}`",
project_id, role_name,
@@ -78,73 +144,47 @@ impl ProjectInfoCache {
.map(|kv| kv.value().clone())
.unwrap_or_default();
for endpoint_id in endpoints {
self.role_controls.invalidate(&(endpoint_id, role_name));
if let Some(mut endpoint_info) = self.cache.get_mut(&endpoint_id) {
endpoint_info.invalidate_role_secret(role_name);
}
}
}
}
impl ProjectInfoCache {
impl ProjectInfoCacheImpl {
pub(crate) fn new(config: ProjectInfoCacheOptions) -> Self {
Metrics::get().cache.capacity.set(
CacheKind::ProjectInfoRoles,
(config.size * config.max_roles) as i64,
);
Metrics::get()
.cache
.capacity
.set(CacheKind::ProjectInfoEndpoints, config.size as i64);
// we cache errors for 30 seconds, unless retry_at is set.
let expiry = CplaneExpiry::default();
Self {
role_controls: Cache::builder()
.name("project_info_roles")
.eviction_listener(|_k, _v, cause| {
eviction_listener(CacheKind::ProjectInfoRoles, cause);
})
.max_capacity(config.size * config.max_roles)
.time_to_live(config.ttl)
.expire_after(expiry)
.build(),
ep_controls: Cache::builder()
.name("project_info_endpoints")
.eviction_listener(|_k, _v, cause| {
eviction_listener(CacheKind::ProjectInfoEndpoints, cause);
})
.max_capacity(config.size)
.time_to_live(config.ttl)
.expire_after(expiry)
.build(),
cache: ClashMap::new(),
project2ep: ClashMap::new(),
account2ep: ClashMap::new(),
config,
}
}
pub(crate) fn get_role_secret(
fn get_endpoint_cache(
&self,
endpoint_id: &EndpointId,
) -> Option<Ref<'_, EndpointIdInt, EndpointInfo>> {
let endpoint_id = EndpointIdInt::get(endpoint_id)?;
self.cache.get(&endpoint_id)
}
pub(crate) fn get_role_secret_with_ttl(
&self,
endpoint_id: &EndpointId,
role_name: &RoleName,
) -> Option<ControlPlaneResult<RoleAccessControl>> {
let endpoint_id = EndpointIdInt::get(endpoint_id)?;
) -> Option<(ControlPlaneResult<RoleAccessControl>, Duration)> {
let role_name = RoleNameInt::get(role_name)?;
count_cache_outcome(
CacheKind::ProjectInfoRoles,
self.role_controls.get(&(endpoint_id, role_name)),
)
let endpoint_info = self.get_endpoint_cache(endpoint_id)?;
endpoint_info.get_role_secret_with_ttl(role_name)
}
pub(crate) fn get_endpoint_access(
pub(crate) fn get_endpoint_access_with_ttl(
&self,
endpoint_id: &EndpointId,
) -> Option<ControlPlaneResult<EndpointAccessControl>> {
let endpoint_id = EndpointIdInt::get(endpoint_id)?;
count_cache_outcome(
CacheKind::ProjectInfoEndpoints,
self.ep_controls.get(&endpoint_id),
)
) -> Option<(ControlPlaneResult<EndpointAccessControl>, Duration)> {
let endpoint_info = self.get_endpoint_cache(endpoint_id)?;
endpoint_info.get_controls_with_ttl()
}
pub(crate) fn insert_endpoint_access(
@@ -163,17 +203,34 @@ impl ProjectInfoCache {
self.insert_project2endpoint(project_id, endpoint_id);
}
if self.cache.len() >= self.config.size {
// If there are too many entries, wait until the next gc cycle.
return;
}
debug!(
key = &*endpoint_id,
"created a cache entry for endpoint access"
);
count_cache_insert(CacheKind::ProjectInfoEndpoints);
count_cache_insert(CacheKind::ProjectInfoRoles);
let controls = Some(Entry::new(Ok(controls), self.config.ttl));
let role_controls = Entry::new(Ok(role_controls), self.config.ttl);
self.ep_controls.insert(endpoint_id, Ok(controls));
self.role_controls
.insert((endpoint_id, role_name), Ok(role_controls));
match self.cache.entry(endpoint_id) {
clashmap::Entry::Vacant(e) => {
e.insert(EndpointInfo {
role_controls: HashMap::from_iter([(role_name, role_controls)]),
controls,
});
}
clashmap::Entry::Occupied(mut e) => {
let ep = e.get_mut();
ep.controls = controls;
if ep.role_controls.len() < self.config.max_roles {
ep.role_controls.insert(role_name, role_controls);
}
}
}
}
pub(crate) fn insert_endpoint_access_err(
@@ -181,34 +238,55 @@ impl ProjectInfoCache {
endpoint_id: EndpointIdInt,
role_name: RoleNameInt,
msg: Box<ControlPlaneErrorMessage>,
ttl: Option<Duration>,
) {
if self.cache.len() >= self.config.size {
// If there are too many entries, wait until the next gc cycle.
return;
}
debug!(
key = &*endpoint_id,
"created a cache entry for an endpoint access error"
);
// RoleProtected is the only role-specific error that control plane can give us.
// If a given role name does not exist, it still returns a successful response,
// just with an empty secret.
if msg.get_reason() != Reason::RoleProtected {
// We can cache all the other errors in ep_controls because they don't
// depend on what role name we pass to control plane.
self.ep_controls
.entry(endpoint_id)
.and_compute_with(|entry| match entry {
// leave the entry alone if it's already Ok
Some(entry) if entry.value().is_ok() => moka::ops::compute::Op::Nop,
// replace the entry
_ => {
count_cache_insert(CacheKind::ProjectInfoEndpoints);
moka::ops::compute::Op::Put(Err(msg.clone()))
}
});
}
let ttl = ttl.unwrap_or(self.config.ttl);
count_cache_insert(CacheKind::ProjectInfoRoles);
self.role_controls
.insert((endpoint_id, role_name), Err(msg));
let controls = if msg.get_reason() == Reason::RoleProtected {
// RoleProtected is the only role-specific error that control plane can give us.
// If a given role name does not exist, it still returns a successful response,
// just with an empty secret.
None
} else {
// We can cache all the other errors in EndpointInfo.controls,
// because they don't depend on what role name we pass to control plane.
Some(Entry::new(Err(msg.clone()), ttl))
};
let role_controls = Entry::new(Err(msg), ttl);
match self.cache.entry(endpoint_id) {
clashmap::Entry::Vacant(e) => {
e.insert(EndpointInfo {
role_controls: HashMap::from_iter([(role_name, role_controls)]),
controls,
});
}
clashmap::Entry::Occupied(mut e) => {
let ep = e.get_mut();
if let Some(entry) = &ep.controls
&& !entry.is_expired()
&& entry.value.is_ok()
{
// If we have cached non-expired, non-error controls, keep them.
} else {
ep.controls = controls;
}
if ep.role_controls.len() < self.config.max_roles {
ep.role_controls.insert(role_name, role_controls);
}
}
}
}
fn insert_project2endpoint(&self, project_id: ProjectIdInt, endpoint_id: EndpointIdInt) {
@@ -229,35 +307,73 @@ impl ProjectInfoCache {
}
}
pub fn maybe_invalidate_role_secret(&self, _endpoint_id: &EndpointId, _role_name: &RoleName) {
// TODO: Expire the value early if the key is idle.
// Currently not an issue as we would just use the TTL to decide, which is what already happens.
pub fn maybe_invalidate_role_secret(&self, endpoint_id: &EndpointId, role_name: &RoleName) {
let Some(endpoint_id) = EndpointIdInt::get(endpoint_id) else {
return;
};
let Some(role_name) = RoleNameInt::get(role_name) else {
return;
};
let Some(mut endpoint_info) = self.cache.get_mut(&endpoint_id) else {
return;
};
let entry = endpoint_info.role_controls.entry(role_name);
let hash_map::Entry::Occupied(role_controls) = entry else {
return;
};
if role_controls.get().is_expired() {
role_controls.remove();
}
}
pub async fn gc_worker(&self) -> anyhow::Result<Infallible> {
let mut interval = tokio::time::interval(self.config.gc_interval);
let mut interval =
tokio::time::interval(self.config.gc_interval / (self.cache.shards().len()) as u32);
loop {
interval.tick().await;
self.ep_controls.run_pending_tasks();
self.role_controls.run_pending_tasks();
if self.cache.len() < self.config.size {
// If there are not too many entries, wait until the next gc cycle.
continue;
}
self.gc();
}
}
fn gc(&self) {
let shard = rand::rng().random_range(0..self.project2ep.shards().len());
debug!(shard, "project_info_cache: performing epoch reclamation");
// acquire a random shard lock
let mut removed = 0;
let shard = self.project2ep.shards()[shard].write();
for (_, endpoints) in shard.iter() {
for endpoint in endpoints {
self.cache.remove(endpoint);
removed += 1;
}
}
// We can drop this shard only after making sure that all endpoints are removed.
drop(shard);
info!("project_info_cache: removed {removed} endpoints");
}
}
#[cfg(test)]
mod tests {
use std::sync::Arc;
use std::time::Duration;
use super::*;
use crate::control_plane::messages::{Details, EndpointRateLimitConfig, ErrorInfo, Status};
use crate::control_plane::{AccessBlockerFlags, AuthSecret};
use crate::scram::ServerSecret;
use std::sync::Arc;
#[tokio::test]
async fn test_project_info_cache_settings() {
let cache = ProjectInfoCache::new(ProjectInfoCacheOptions {
size: 1,
tokio::time::pause();
let cache = ProjectInfoCacheImpl::new(ProjectInfoCacheOptions {
size: 2,
max_roles: 2,
ttl: Duration::from_secs(1),
gc_interval: Duration::from_secs(600),
@@ -307,17 +423,22 @@ mod tests {
},
);
let cached = cache.get_role_secret(&endpoint_id, &user1).unwrap();
let (cached, ttl) = cache
.get_role_secret_with_ttl(&endpoint_id, &user1)
.unwrap();
assert_eq!(cached.unwrap().secret, secret1);
assert_eq!(ttl, cache.config.ttl);
let cached = cache.get_role_secret(&endpoint_id, &user2).unwrap();
let (cached, ttl) = cache
.get_role_secret_with_ttl(&endpoint_id, &user2)
.unwrap();
assert_eq!(cached.unwrap().secret, secret2);
assert_eq!(ttl, cache.config.ttl);
// Shouldn't add more than 2 roles.
let user3: RoleName = "user3".into();
let secret3 = Some(AuthSecret::Scram(ServerSecret::mock([3; 32])));
cache.role_controls.run_pending_tasks();
cache.insert_endpoint_access(
account_id,
project_id,
@@ -334,18 +455,31 @@ mod tests {
},
);
cache.role_controls.run_pending_tasks();
assert_eq!(cache.role_controls.entry_count(), 2);
assert!(
cache
.get_role_secret_with_ttl(&endpoint_id, &user3)
.is_none()
);
tokio::time::sleep(Duration::from_secs(2)).await;
let cached = cache
.get_endpoint_access_with_ttl(&endpoint_id)
.unwrap()
.0
.unwrap();
assert_eq!(cached.allowed_ips, allowed_ips);
cache.role_controls.run_pending_tasks();
assert_eq!(cache.role_controls.entry_count(), 0);
tokio::time::advance(Duration::from_secs(2)).await;
let cached = cache.get_role_secret_with_ttl(&endpoint_id, &user1);
assert!(cached.is_none());
let cached = cache.get_role_secret_with_ttl(&endpoint_id, &user2);
assert!(cached.is_none());
let cached = cache.get_endpoint_access_with_ttl(&endpoint_id);
assert!(cached.is_none());
}
#[tokio::test]
async fn test_caching_project_info_errors() {
let cache = ProjectInfoCache::new(ProjectInfoCacheOptions {
let cache = ProjectInfoCacheImpl::new(ProjectInfoCacheOptions {
size: 10,
max_roles: 10,
ttl: Duration::from_secs(1),
@@ -385,23 +519,34 @@ mod tests {
status: None,
});
let get_role_secret =
|endpoint_id, role_name| cache.get_role_secret(endpoint_id, role_name).unwrap();
let get_endpoint_access = |endpoint_id| cache.get_endpoint_access(endpoint_id).unwrap();
let get_role_secret = |endpoint_id, role_name| {
cache
.get_role_secret_with_ttl(endpoint_id, role_name)
.unwrap()
.0
};
let get_endpoint_access =
|endpoint_id| cache.get_endpoint_access_with_ttl(endpoint_id).unwrap().0;
// stores role-specific errors only for get_role_secret
cache.insert_endpoint_access_err((&endpoint_id).into(), (&user1).into(), role_msg.clone());
cache.insert_endpoint_access_err(
(&endpoint_id).into(),
(&user1).into(),
role_msg.clone(),
None,
);
assert_eq!(
get_role_secret(&endpoint_id, &user1).unwrap_err().error,
role_msg.error
);
assert!(cache.get_endpoint_access(&endpoint_id).is_none());
assert!(cache.get_endpoint_access_with_ttl(&endpoint_id).is_none());
// stores non-role specific errors for both get_role_secret and get_endpoint_access
cache.insert_endpoint_access_err(
(&endpoint_id).into(),
(&user1).into(),
generic_msg.clone(),
None,
);
assert_eq!(
get_role_secret(&endpoint_id, &user1).unwrap_err().error,
@@ -413,7 +558,11 @@ mod tests {
);
// error isn't returned for other roles in the same endpoint
assert!(cache.get_role_secret(&endpoint_id, &user2).is_none());
assert!(
cache
.get_role_secret_with_ttl(&endpoint_id, &user2)
.is_none()
);
// success for a role does not overwrite errors for other roles
cache.insert_endpoint_access(
@@ -441,6 +590,7 @@ mod tests {
(&endpoint_id).into(),
(&user2).into(),
generic_msg.clone(),
None,
);
assert!(get_role_secret(&endpoint_id, &user2).is_err());
assert!(get_endpoint_access(&endpoint_id).is_ok());

262
proxy/src/cache/timed_lru.rs vendored Normal file
View File

@@ -0,0 +1,262 @@
use std::borrow::Borrow;
use std::hash::Hash;
use std::time::{Duration, Instant};
// This seems to make more sense than `lru` or `cached`:
//
// * `near/nearcore` ditched `cached` in favor of `lru`
// (https://github.com/near/nearcore/issues?q=is%3Aissue+lru+is%3Aclosed).
//
// * `lru` methods use an obscure `KeyRef` type in their contraints (which is deliberately excluded from docs).
// This severely hinders its usage both in terms of creating wrappers and supported key types.
//
// On the other hand, `hashlink` has good download stats and appears to be maintained.
use hashlink::{LruCache, linked_hash_map::RawEntryMut};
use tracing::debug;
use super::Cache;
use super::common::Cached;
/// An implementation of timed LRU cache with fixed capacity.
/// Key properties:
///
/// * Whenever a new entry is inserted, the least recently accessed one is evicted.
/// The cache also keeps track of entry's insertion time (`created_at`) and TTL (`expires_at`).
///
/// * If `update_ttl_on_retrieval` is `true`. When the entry is about to be retrieved, we check its expiration timestamp.
/// If the entry has expired, we remove it from the cache; Otherwise we bump the
/// expiration timestamp (e.g. +5mins) and change its place in LRU list to prolong
/// its existence.
///
/// * There's an API for immediate invalidation (removal) of a cache entry;
/// It's useful in case we know for sure that the entry is no longer correct.
/// See [`Cached`] for more information.
///
/// * Expired entries are kept in the cache, until they are evicted by the LRU policy,
/// or by a successful lookup (i.e. the entry hasn't expired yet).
/// There is no background job to reap the expired records.
///
/// * It's possible for an entry that has not yet expired entry to be evicted
/// before expired items. That's a bit wasteful, but probably fine in practice.
pub(crate) struct TimedLru<K, V> {
/// Cache's name for tracing.
name: &'static str,
/// The underlying cache implementation.
cache: parking_lot::Mutex<LruCache<K, Entry<V>>>,
/// Default time-to-live of a single entry.
ttl: Duration,
update_ttl_on_retrieval: bool,
}
impl<K: Hash + Eq, V> Cache for TimedLru<K, V> {
type Key = K;
type Value = V;
type LookupInfo<Key> = Key;
fn invalidate(&self, info: &Self::LookupInfo<K>) {
self.invalidate_raw(info);
}
}
struct Entry<T> {
created_at: Instant,
expires_at: Instant,
ttl: Duration,
update_ttl_on_retrieval: bool,
value: T,
}
impl<K: Hash + Eq, V> TimedLru<K, V> {
/// Construct a new LRU cache with timed entries.
pub(crate) fn new(
name: &'static str,
capacity: usize,
ttl: Duration,
update_ttl_on_retrieval: bool,
) -> Self {
Self {
name,
cache: LruCache::new(capacity).into(),
ttl,
update_ttl_on_retrieval,
}
}
/// Drop an entry from the cache if it's outdated.
#[tracing::instrument(level = "debug", fields(cache = self.name), skip_all)]
fn invalidate_raw(&self, key: &K) {
// Do costly things before taking the lock.
let mut cache = self.cache.lock();
let entry = match cache.raw_entry_mut().from_key(key) {
RawEntryMut::Vacant(_) => return,
RawEntryMut::Occupied(x) => x.remove(),
};
drop(cache); // drop lock before logging
let Entry {
created_at,
expires_at,
..
} = entry;
debug!(
?created_at,
?expires_at,
"processed a cache entry invalidation event"
);
}
/// Try retrieving an entry by its key, then execute `extract` if it exists.
#[tracing::instrument(level = "debug", fields(cache = self.name), skip_all)]
fn get_raw<Q, R>(&self, key: &Q, extract: impl FnOnce(&K, &Entry<V>) -> R) -> Option<R>
where
K: Borrow<Q>,
Q: Hash + Eq + ?Sized,
{
let now = Instant::now();
// Do costly things before taking the lock.
let mut cache = self.cache.lock();
let mut raw_entry = match cache.raw_entry_mut().from_key(key) {
RawEntryMut::Vacant(_) => return None,
RawEntryMut::Occupied(x) => x,
};
// Immeditely drop the entry if it has expired.
let entry = raw_entry.get();
if entry.expires_at <= now {
raw_entry.remove();
return None;
}
let value = extract(raw_entry.key(), entry);
let (created_at, expires_at) = (entry.created_at, entry.expires_at);
// Update the deadline and the entry's position in the LRU list.
let deadline = now.checked_add(raw_entry.get().ttl).expect("time overflow");
if raw_entry.get().update_ttl_on_retrieval {
raw_entry.get_mut().expires_at = deadline;
}
raw_entry.to_back();
drop(cache); // drop lock before logging
debug!(
created_at = format_args!("{created_at:?}"),
old_expires_at = format_args!("{expires_at:?}"),
new_expires_at = format_args!("{deadline:?}"),
"accessed a cache entry"
);
Some(value)
}
/// Insert an entry to the cache. If an entry with the same key already
/// existed, return the previous value and its creation timestamp.
#[tracing::instrument(level = "debug", fields(cache = self.name), skip_all)]
fn insert_raw(&self, key: K, value: V) -> (Instant, Option<V>) {
self.insert_raw_ttl(key, value, self.ttl, self.update_ttl_on_retrieval)
}
/// Insert an entry to the cache. If an entry with the same key already
/// existed, return the previous value and its creation timestamp.
#[tracing::instrument(level = "debug", fields(cache = self.name), skip_all)]
fn insert_raw_ttl(
&self,
key: K,
value: V,
ttl: Duration,
update: bool,
) -> (Instant, Option<V>) {
let created_at = Instant::now();
let expires_at = created_at.checked_add(ttl).expect("time overflow");
let entry = Entry {
created_at,
expires_at,
ttl,
update_ttl_on_retrieval: update,
value,
};
// Do costly things before taking the lock.
let old = self
.cache
.lock()
.insert(key, entry)
.map(|entry| entry.value);
debug!(
created_at = format_args!("{created_at:?}"),
expires_at = format_args!("{expires_at:?}"),
replaced = old.is_some(),
"created a cache entry"
);
(created_at, old)
}
}
impl<K: Hash + Eq + Clone, V: Clone> TimedLru<K, V> {
pub(crate) fn insert_ttl(&self, key: K, value: V, ttl: Duration) {
self.insert_raw_ttl(key, value, ttl, false);
}
#[cfg(feature = "rest_broker")]
pub(crate) fn insert(&self, key: K, value: V) {
self.insert_raw_ttl(key, value, self.ttl, self.update_ttl_on_retrieval);
}
pub(crate) fn insert_unit(&self, key: K, value: V) -> (Option<V>, Cached<&Self, ()>) {
let (_, old) = self.insert_raw(key.clone(), value);
let cached = Cached {
token: Some((self, key)),
value: (),
};
(old, cached)
}
#[cfg(feature = "rest_broker")]
pub(crate) fn flush(&self) {
let now = Instant::now();
let mut cache = self.cache.lock();
// Collect keys of expired entries first
let expired_keys: Vec<_> = cache
.iter()
.filter_map(|(key, entry)| {
if entry.expires_at <= now {
Some(key.clone())
} else {
None
}
})
.collect();
// Remove expired entries
for key in expired_keys {
cache.remove(&key);
}
}
}
impl<K: Hash + Eq, V: Clone> TimedLru<K, V> {
/// Retrieve a cached entry in convenient wrapper, alongside timing information.
pub(crate) fn get_with_created_at<Q>(
&self,
key: &Q,
) -> Option<Cached<&Self, (<Self as Cache>::Value, Instant)>>
where
K: Borrow<Q> + Clone,
Q: Hash + Eq + ?Sized,
{
self.get_raw(key, |key, entry| Cached {
token: Some((self, key.clone())),
value: (entry.value.clone(), entry.created_at),
})
}
}

View File

@@ -107,23 +107,20 @@ pub fn remote_storage_from_toml(s: &str) -> anyhow::Result<RemoteStorageConfig>
#[derive(Debug)]
pub struct CacheOptions {
/// Max number of entries.
pub size: Option<u64>,
pub size: usize,
/// Entry's time-to-live.
pub absolute_ttl: Option<Duration>,
/// Entry's time-to-idle.
pub idle_ttl: Option<Duration>,
pub ttl: Duration,
}
impl CacheOptions {
/// Default options for [`crate::cache::node_info::NodeInfoCache`].
pub const CACHE_DEFAULT_OPTIONS: &'static str = "size=4000,idle_ttl=4m";
/// Default options for [`crate::control_plane::NodeInfoCache`].
pub const CACHE_DEFAULT_OPTIONS: &'static str = "size=4000,ttl=4m";
/// Parse cache options passed via cmdline.
/// Example: [`Self::CACHE_DEFAULT_OPTIONS`].
fn parse(options: &str) -> anyhow::Result<Self> {
let mut size = None;
let mut absolute_ttl = None;
let mut idle_ttl = None;
let mut ttl = None;
for option in options.split(',') {
let (key, value) = option
@@ -132,33 +129,20 @@ impl CacheOptions {
match key {
"size" => size = Some(value.parse()?),
"absolute_ttl" | "ttl" => absolute_ttl = Some(humantime::parse_duration(value)?),
"idle_ttl" | "tti" => idle_ttl = Some(humantime::parse_duration(value)?),
"ttl" => ttl = Some(humantime::parse_duration(value)?),
unknown => bail!("unknown key: {unknown}"),
}
}
Ok(Self {
size,
absolute_ttl,
idle_ttl,
})
}
// TTL doesn't matter if cache is always empty.
if let Some(0) = size {
ttl.get_or_insert(Duration::default());
}
pub fn moka<K, V, C>(
&self,
mut builder: moka::sync::CacheBuilder<K, V, C>,
) -> moka::sync::CacheBuilder<K, V, C> {
if let Some(size) = self.size {
builder = builder.max_capacity(size);
}
if let Some(ttl) = self.absolute_ttl {
builder = builder.time_to_live(ttl);
}
if let Some(tti) = self.idle_ttl {
builder = builder.time_to_idle(tti);
}
builder
Ok(Self {
size: size.context("missing `size`")?,
ttl: ttl.context("missing `ttl`")?,
})
}
}
@@ -175,17 +159,17 @@ impl FromStr for CacheOptions {
#[derive(Debug)]
pub struct ProjectInfoCacheOptions {
/// Max number of entries.
pub size: u64,
pub size: usize,
/// Entry's time-to-live.
pub ttl: Duration,
/// Max number of roles per endpoint.
pub max_roles: u64,
pub max_roles: usize,
/// Gc interval.
pub gc_interval: Duration,
}
impl ProjectInfoCacheOptions {
/// Default options for [`crate::cache::project_info::ProjectInfoCache`].
/// Default options for [`crate::control_plane::NodeInfoCache`].
pub const CACHE_DEFAULT_OPTIONS: &'static str =
"size=10000,ttl=4m,max_roles=10,gc_interval=60m";
@@ -512,37 +496,21 @@ mod tests {
#[test]
fn test_parse_cache_options() -> anyhow::Result<()> {
let CacheOptions {
size,
absolute_ttl,
idle_ttl: _,
} = "size=4096,ttl=5min".parse()?;
assert_eq!(size, Some(4096));
assert_eq!(absolute_ttl, Some(Duration::from_secs(5 * 60)));
let CacheOptions { size, ttl } = "size=4096,ttl=5min".parse()?;
assert_eq!(size, 4096);
assert_eq!(ttl, Duration::from_secs(5 * 60));
let CacheOptions {
size,
absolute_ttl,
idle_ttl: _,
} = "ttl=4m,size=2".parse()?;
assert_eq!(size, Some(2));
assert_eq!(absolute_ttl, Some(Duration::from_secs(4 * 60)));
let CacheOptions { size, ttl } = "ttl=4m,size=2".parse()?;
assert_eq!(size, 2);
assert_eq!(ttl, Duration::from_secs(4 * 60));
let CacheOptions {
size,
absolute_ttl,
idle_ttl: _,
} = "size=0,ttl=1s".parse()?;
assert_eq!(size, Some(0));
assert_eq!(absolute_ttl, Some(Duration::from_secs(1)));
let CacheOptions { size, ttl } = "size=0,ttl=1s".parse()?;
assert_eq!(size, 0);
assert_eq!(ttl, Duration::from_secs(1));
let CacheOptions {
size,
absolute_ttl,
idle_ttl: _,
} = "size=0".parse()?;
assert_eq!(size, Some(0));
assert_eq!(absolute_ttl, None);
let CacheOptions { size, ttl } = "size=0".parse()?;
assert_eq!(size, 0);
assert_eq!(ttl, Duration::default());
Ok(())
}

View File

@@ -3,6 +3,7 @@
use std::net::IpAddr;
use std::str::FromStr;
use std::sync::Arc;
use std::time::Duration;
use ::http::HeaderName;
use ::http::header::AUTHORIZATION;
@@ -16,8 +17,6 @@ use tracing::{Instrument, debug, info, info_span, warn};
use super::super::messages::{ControlPlaneErrorMessage, GetEndpointAccessControl, WakeCompute};
use crate::auth::backend::ComputeUserInfo;
use crate::auth::backend::jwt::AuthRule;
use crate::cache::Cached;
use crate::cache::node_info::CachedNodeInfo;
use crate::context::RequestContext;
use crate::control_plane::caches::ApiCaches;
use crate::control_plane::errors::{
@@ -26,7 +25,8 @@ use crate::control_plane::errors::{
use crate::control_plane::locks::ApiLocks;
use crate::control_plane::messages::{ColdStartInfo, EndpointJwksResponse};
use crate::control_plane::{
AccessBlockerFlags, AuthInfo, AuthSecret, EndpointAccessControl, NodeInfo, RoleAccessControl,
AccessBlockerFlags, AuthInfo, AuthSecret, CachedNodeInfo, EndpointAccessControl, NodeInfo,
RoleAccessControl,
};
use crate::metrics::Metrics;
use crate::proxy::retry::CouldRetry;
@@ -118,6 +118,7 @@ impl NeonControlPlaneClient {
cache_key.into(),
role.into(),
msg.clone(),
retry_info.map(|r| Duration::from_millis(r.retry_delay_ms)),
);
Err(err)
@@ -346,11 +347,18 @@ impl super::ControlPlaneApi for NeonControlPlaneClient {
) -> Result<RoleAccessControl, GetAuthInfoError> {
let key = endpoint.normalize();
if let Some(role_control) = self.caches.project_info.get_role_secret(&key, role) {
if let Some((role_control, ttl)) = self
.caches
.project_info
.get_role_secret_with_ttl(&key, role)
{
return match role_control {
Err(msg) => {
Err(mut msg) => {
info!(key = &*key, "found cached get_role_access_control error");
// if retry_delay_ms is set change it to the remaining TTL
replace_retry_delay_ms(&mut msg, |_| ttl.as_millis() as u64);
Err(GetAuthInfoError::ApiError(ControlPlaneError::Message(msg)))
}
Ok(role_control) => {
@@ -375,14 +383,17 @@ impl super::ControlPlaneApi for NeonControlPlaneClient {
) -> Result<EndpointAccessControl, GetAuthInfoError> {
let key = endpoint.normalize();
if let Some(control) = self.caches.project_info.get_endpoint_access(&key) {
if let Some((control, ttl)) = self.caches.project_info.get_endpoint_access_with_ttl(&key) {
return match control {
Err(msg) => {
Err(mut msg) => {
info!(
key = &*key,
"found cached get_endpoint_access_control error"
);
// if retry_delay_ms is set change it to the remaining TTL
replace_retry_delay_ms(&mut msg, |_| ttl.as_millis() as u64);
Err(GetAuthInfoError::ApiError(ControlPlaneError::Message(msg)))
}
Ok(control) => {
@@ -415,11 +426,17 @@ impl super::ControlPlaneApi for NeonControlPlaneClient {
macro_rules! check_cache {
() => {
if let Some(info) = self.caches.node_info.get_entry(&key) {
if let Some(cached) = self.caches.node_info.get_with_created_at(&key) {
let (cached, (info, created_at)) = cached.take_value();
return match info {
Err(msg) => {
Err(mut msg) => {
info!(key = &*key, "found cached wake_compute error");
// if retry_delay_ms is set, reduce it by the amount of time it spent in cache
replace_retry_delay_ms(&mut msg, |delay| {
delay.saturating_sub(created_at.elapsed().as_millis() as u64)
});
Err(WakeComputeError::ControlPlane(ControlPlaneError::Message(
msg,
)))
@@ -427,7 +444,7 @@ impl super::ControlPlaneApi for NeonControlPlaneClient {
Ok(info) => {
debug!(key = &*key, "found cached compute node info");
ctx.set_project(info.aux.clone());
Ok(info)
Ok(cached.map(|()| info))
}
};
}
@@ -466,12 +483,10 @@ impl super::ControlPlaneApi for NeonControlPlaneClient {
let mut stored_node = node.clone();
// store the cached node as 'warm_cached'
stored_node.aux.cold_start_info = ColdStartInfo::WarmCached;
self.caches.node_info.insert(key.clone(), Ok(stored_node));
Ok(Cached {
token: Some((&self.caches.node_info, key)),
value: node,
})
let (_, cached) = self.caches.node_info.insert_unit(key, Ok(stored_node));
Ok(cached.map(|()| node))
}
Err(err) => match err {
WakeComputeError::ControlPlane(ControlPlaneError::Message(ref msg)) => {
@@ -488,7 +503,11 @@ impl super::ControlPlaneApi for NeonControlPlaneClient {
"created a cache entry for the wake compute error"
);
self.caches.node_info.insert(key, Err(msg.clone()));
let ttl = retry_info.map_or(Duration::from_secs(30), |r| {
Duration::from_millis(r.retry_delay_ms)
});
self.caches.node_info.insert_ttl(key, Err(msg.clone()), ttl);
Err(err)
}
@@ -498,6 +517,14 @@ impl super::ControlPlaneApi for NeonControlPlaneClient {
}
}
fn replace_retry_delay_ms(msg: &mut ControlPlaneErrorMessage, f: impl FnOnce(u64) -> u64) {
if let Some(status) = &mut msg.status
&& let Some(retry_info) = &mut status.details.retry_info
{
retry_info.retry_delay_ms = f(retry_info.retry_delay_ms);
}
}
/// Parse http response body, taking status code into account.
fn parse_body<T: for<'a> serde::Deserialize<'a>>(
status: StatusCode,

View File

@@ -15,7 +15,6 @@ use crate::auth::IpPattern;
use crate::auth::backend::ComputeUserInfo;
use crate::auth::backend::jwt::AuthRule;
use crate::cache::Cached;
use crate::cache::node_info::CachedNodeInfo;
use crate::compute::ConnectInfo;
use crate::context::RequestContext;
use crate::control_plane::errors::{
@@ -23,7 +22,8 @@ use crate::control_plane::errors::{
};
use crate::control_plane::messages::{EndpointRateLimitConfig, MetricsAuxInfo};
use crate::control_plane::{
AccessBlockerFlags, AuthInfo, AuthSecret, EndpointAccessControl, NodeInfo, RoleAccessControl,
AccessBlockerFlags, AuthInfo, AuthSecret, CachedNodeInfo, EndpointAccessControl, NodeInfo,
RoleAccessControl,
};
use crate::intern::RoleNameInt;
use crate::scram;

View File

@@ -13,11 +13,10 @@ use tracing::{debug, info};
use super::{EndpointAccessControl, RoleAccessControl};
use crate::auth::backend::ComputeUserInfo;
use crate::auth::backend::jwt::{AuthRule, FetchAuthRules, FetchAuthRulesError};
use crate::cache::node_info::{CachedNodeInfo, NodeInfoCache};
use crate::cache::project_info::ProjectInfoCache;
use crate::cache::project_info::ProjectInfoCacheImpl;
use crate::config::{CacheOptions, ProjectInfoCacheOptions};
use crate::context::RequestContext;
use crate::control_plane::{ControlPlaneApi, errors};
use crate::control_plane::{CachedNodeInfo, ControlPlaneApi, NodeInfoCache, errors};
use crate::error::ReportableError;
use crate::metrics::ApiLockMetrics;
use crate::rate_limiter::{DynamicLimiter, Outcome, RateLimiterConfig, Token};
@@ -120,7 +119,7 @@ pub struct ApiCaches {
/// Cache for the `wake_compute` API method.
pub(crate) node_info: NodeInfoCache,
/// Cache which stores project_id -> endpoint_ids mapping.
pub project_info: Arc<ProjectInfoCache>,
pub project_info: Arc<ProjectInfoCacheImpl>,
}
impl ApiCaches {
@@ -129,8 +128,13 @@ impl ApiCaches {
project_info_cache_config: ProjectInfoCacheOptions,
) -> Self {
Self {
node_info: NodeInfoCache::new(wake_compute_cache_config),
project_info: Arc::new(ProjectInfoCache::new(project_info_cache_config)),
node_info: NodeInfoCache::new(
"node_info_cache",
wake_compute_cache_config.size,
wake_compute_cache_config.ttl,
true,
),
project_info: Arc::new(ProjectInfoCacheImpl::new(project_info_cache_config)),
}
}
}

View File

@@ -1,10 +1,8 @@
use std::fmt::{self, Display};
use std::time::Duration;
use measured::FixedCardinalityLabel;
use serde::{Deserialize, Serialize};
use smol_str::SmolStr;
use tokio::time::Instant;
use crate::auth::IpPattern;
use crate::intern::{AccountIdInt, BranchIdInt, EndpointIdInt, ProjectIdInt, RoleNameInt};
@@ -233,13 +231,7 @@ impl Reason {
#[derive(Copy, Clone, Debug, Deserialize)]
#[allow(dead_code)]
pub(crate) struct RetryInfo {
#[serde(rename = "retry_delay_ms", deserialize_with = "milliseconds_from_now")]
pub(crate) retry_at: Instant,
}
fn milliseconds_from_now<'de, D: serde::Deserializer<'de>>(d: D) -> Result<Instant, D::Error> {
let millis = u64::deserialize(d)?;
Ok(Instant::now() + Duration::from_millis(millis))
pub(crate) retry_delay_ms: u64,
}
#[derive(Debug, Deserialize, Clone)]

View File

@@ -16,13 +16,13 @@ use messages::EndpointRateLimitConfig;
use crate::auth::backend::ComputeUserInfo;
use crate::auth::backend::jwt::AuthRule;
use crate::auth::{AuthError, IpPattern, check_peer_addr_is_in_list};
use crate::cache::node_info::CachedNodeInfo;
use crate::cache::{Cached, TimedLru};
use crate::context::RequestContext;
use crate::control_plane::messages::MetricsAuxInfo;
use crate::control_plane::messages::{ControlPlaneErrorMessage, MetricsAuxInfo};
use crate::intern::{AccountIdInt, EndpointIdInt, ProjectIdInt};
use crate::protocol2::ConnectionInfoExtra;
use crate::rate_limiter::{EndpointRateLimiter, LeakyBucketConfig};
use crate::types::{EndpointId, RoleName};
use crate::types::{EndpointCacheKey, EndpointId, RoleName};
use crate::{compute, scram};
/// Various cache-related types.
@@ -77,6 +77,10 @@ pub(crate) struct AccessBlockerFlags {
pub vpc_access_blocked: bool,
}
pub(crate) type NodeInfoCache =
TimedLru<EndpointCacheKey, Result<NodeInfo, Box<ControlPlaneErrorMessage>>>;
pub(crate) type CachedNodeInfo = Cached<&'static NodeInfoCache, NodeInfo>;
#[derive(Clone, Debug)]
pub struct RoleAccessControl {
pub secret: Option<AuthSecret>,

View File

@@ -2,16 +2,15 @@ use std::sync::{Arc, OnceLock};
use lasso::ThreadedRodeo;
use measured::label::{
FixedCardinalitySet, LabelGroupSet, LabelGroupVisitor, LabelName, LabelSet, LabelValue,
StaticLabelSet,
FixedCardinalitySet, LabelGroupSet, LabelName, LabelSet, LabelValue, StaticLabelSet,
};
use measured::metric::histogram::Thresholds;
use measured::metric::name::MetricName;
use measured::{
Counter, CounterVec, FixedCardinalityLabel, Gauge, GaugeVec, Histogram, HistogramVec,
LabelGroup, MetricGroup,
Counter, CounterVec, FixedCardinalityLabel, Gauge, Histogram, HistogramVec, LabelGroup,
MetricGroup,
};
use metrics::{CounterPairAssoc, CounterPairVec, HyperLogLogVec, InfoMetric};
use metrics::{CounterPairAssoc, CounterPairVec, HyperLogLogVec};
use tokio::time::{self, Instant};
use crate::control_plane::messages::ColdStartInfo;
@@ -26,12 +25,6 @@ pub struct Metrics {
#[metric(namespace = "wake_compute_lock")]
pub wake_compute_lock: ApiLockMetrics,
#[metric(namespace = "service")]
pub service: ServiceMetrics,
#[metric(namespace = "cache")]
pub cache: CacheMetrics,
}
static SELF: OnceLock<Metrics> = OnceLock::new();
@@ -222,6 +215,13 @@ pub enum Bool {
False,
}
#[derive(FixedCardinalityLabel, Copy, Clone)]
#[label(singleton = "outcome")]
pub enum CacheOutcome {
Hit,
Miss,
}
#[derive(LabelGroup)]
#[label(set = ConsoleRequestSet)]
pub struct ConsoleRequest<'a> {
@@ -553,6 +553,14 @@ impl From<bool> for Bool {
}
}
#[derive(LabelGroup)]
#[label(set = InvalidEndpointsSet)]
pub struct InvalidEndpointsGroup {
pub protocol: Protocol,
pub rejected: Bool,
pub outcome: ConnectOutcome,
}
#[derive(LabelGroup)]
#[label(set = RetriesMetricSet)]
pub struct RetriesMetricGroup {
@@ -652,99 +660,3 @@ pub struct ThreadPoolMetrics {
#[metric(init = CounterVec::with_label_set(ThreadPoolWorkers(workers)))]
pub worker_task_skips_total: CounterVec<ThreadPoolWorkers>,
}
#[derive(MetricGroup, Default)]
pub struct ServiceMetrics {
pub info: InfoMetric<ServiceInfo>,
}
#[derive(Default)]
pub struct ServiceInfo {
pub state: ServiceState,
}
impl ServiceInfo {
pub const fn running() -> Self {
ServiceInfo {
state: ServiceState::Running,
}
}
pub const fn terminating() -> Self {
ServiceInfo {
state: ServiceState::Terminating,
}
}
}
impl LabelGroup for ServiceInfo {
fn visit_values(&self, v: &mut impl LabelGroupVisitor) {
const STATE: &LabelName = LabelName::from_str("state");
v.write_value(STATE, &self.state);
}
}
#[derive(FixedCardinalityLabel, Clone, Copy, Debug, Default)]
#[label(singleton = "state")]
pub enum ServiceState {
#[default]
Init,
Running,
Terminating,
}
#[derive(MetricGroup)]
#[metric(new())]
pub struct CacheMetrics {
/// The capacity of the cache
pub capacity: GaugeVec<StaticLabelSet<CacheKind>>,
/// The total number of entries inserted into the cache
pub inserted_total: CounterVec<StaticLabelSet<CacheKind>>,
/// The total number of entries removed from the cache
pub evicted_total: CounterVec<CacheEvictionSet>,
/// The total number of cache requests
pub request_total: CounterVec<CacheOutcomeSet>,
}
impl Default for CacheMetrics {
fn default() -> Self {
Self::new()
}
}
#[derive(FixedCardinalityLabel, Clone, Copy, Debug)]
#[label(singleton = "cache")]
pub enum CacheKind {
NodeInfo,
ProjectInfoEndpoints,
ProjectInfoRoles,
Schema,
}
#[derive(FixedCardinalityLabel, Clone, Copy, Debug)]
pub enum CacheRemovalCause {
Expired,
Explicit,
Replaced,
Size,
}
#[derive(LabelGroup)]
#[label(set = CacheEvictionSet)]
pub struct CacheEviction {
pub cache: CacheKind,
pub cause: CacheRemovalCause,
}
#[derive(FixedCardinalityLabel, Copy, Clone)]
pub enum CacheOutcome {
Hit,
Miss,
}
#[derive(LabelGroup)]
#[label(set = CacheOutcomeSet)]
pub struct CacheOutcomeGroup {
pub cache: CacheKind,
pub outcome: CacheOutcome,
}

View File

@@ -2,7 +2,7 @@ use thiserror::Error;
use crate::auth::Backend;
use crate::auth::backend::ComputeUserInfo;
use crate::cache::common::Cache;
use crate::cache::Cache;
use crate::compute::{AuthInfo, ComputeConnection, ConnectionError, PostgresError};
use crate::config::ProxyConfig;
use crate::context::RequestContext;

View File

@@ -1,12 +1,11 @@
use tokio::time;
use tracing::{debug, info, warn};
use crate::cache::node_info::CachedNodeInfo;
use crate::compute::{self, COULD_NOT_CONNECT, ComputeConnection};
use crate::config::{ComputeConfig, ProxyConfig, RetryConfig};
use crate::context::RequestContext;
use crate::control_plane::NodeInfo;
use crate::control_plane::locks::ApiLocks;
use crate::control_plane::{self, NodeInfo};
use crate::metrics::{
ConnectOutcome, ConnectionFailureKind, Metrics, RetriesMetricGroup, RetryType,
};
@@ -18,7 +17,7 @@ use crate::types::Host;
/// (e.g. the compute node's address might've changed at the wrong time).
/// Invalidate the cache entry (if any) to prevent subsequent errors.
#[tracing::instrument(skip_all)]
pub(crate) fn invalidate_cache(node_info: CachedNodeInfo) -> NodeInfo {
pub(crate) fn invalidate_cache(node_info: control_plane::CachedNodeInfo) -> NodeInfo {
let is_cached = node_info.cached();
if is_cached {
warn!("invalidating stalled compute node info cache entry");
@@ -38,7 +37,7 @@ pub(crate) trait ConnectMechanism {
async fn connect_once(
&self,
ctx: &RequestContext,
node_info: &CachedNodeInfo,
node_info: &control_plane::CachedNodeInfo,
config: &ComputeConfig,
) -> Result<Self::Connection, compute::ConnectionError>;
}
@@ -67,7 +66,7 @@ impl ConnectMechanism for TcpMechanism<'_> {
async fn connect_once(
&self,
ctx: &RequestContext,
node_info: &CachedNodeInfo,
node_info: &control_plane::CachedNodeInfo,
config: &ComputeConfig,
) -> Result<ComputeConnection, compute::ConnectionError> {
let permit = self.locks.get_permit(&node_info.conn_info.host).await?;

View File

@@ -15,17 +15,15 @@ use rstest::rstest;
use rustls::crypto::ring;
use rustls::pki_types;
use tokio::io::{AsyncRead, AsyncWrite, DuplexStream};
use tokio::time::Instant;
use tracing_test::traced_test;
use super::retry::CouldRetry;
use crate::auth::backend::{ComputeUserInfo, MaybeOwned};
use crate::cache::node_info::{CachedNodeInfo, NodeInfoCache};
use crate::config::{CacheOptions, ComputeConfig, RetryConfig, TlsConfig};
use crate::config::{ComputeConfig, RetryConfig, TlsConfig};
use crate::context::RequestContext;
use crate::control_plane::client::{ControlPlaneClient, TestControlPlaneClient};
use crate::control_plane::messages::{ControlPlaneErrorMessage, Details, MetricsAuxInfo, Status};
use crate::control_plane::{self, NodeInfo};
use crate::control_plane::{self, CachedNodeInfo, NodeInfo, NodeInfoCache};
use crate::error::ErrorKind;
use crate::pglb::ERR_INSECURE_CONNECTION;
use crate::pglb::handshake::{HandshakeData, handshake};
@@ -419,11 +417,12 @@ impl TestConnectMechanism {
Self {
counter: Arc::new(std::sync::Mutex::new(0)),
sequence,
cache: Box::leak(Box::new(NodeInfoCache::new(CacheOptions {
size: Some(1),
absolute_ttl: Some(Duration::from_secs(100)),
idle_ttl: None,
}))),
cache: Box::leak(Box::new(NodeInfoCache::new(
"test",
1,
Duration::from_secs(100),
false,
))),
}
}
}
@@ -437,7 +436,7 @@ impl ConnectMechanism for TestConnectMechanism {
async fn connect_once(
&self,
_ctx: &RequestContext,
_node_info: &CachedNodeInfo,
_node_info: &control_plane::CachedNodeInfo,
_config: &ComputeConfig,
) -> Result<Self::Connection, compute::ConnectionError> {
let mut counter = self.counter.lock().unwrap();
@@ -502,7 +501,7 @@ impl TestControlPlaneClient for TestConnectMechanism {
details: Details {
error_info: None,
retry_info: Some(control_plane::messages::RetryInfo {
retry_at: Instant::now() + Duration::from_millis(1),
retry_delay_ms: 1,
}),
user_facing_message: None,
},
@@ -547,11 +546,8 @@ fn helper_create_uncached_node_info() -> NodeInfo {
fn helper_create_cached_node_info(cache: &'static NodeInfoCache) -> CachedNodeInfo {
let node = helper_create_uncached_node_info();
cache.insert("key".into(), Ok(node.clone()));
CachedNodeInfo {
token: Some((cache, "key".into())),
value: node,
}
let (_, node2) = cache.insert_unit("key".into(), Ok(node.clone()));
node2.map(|()| node)
}
fn helper_create_connect_info(

View File

@@ -1,9 +1,9 @@
use async_trait::async_trait;
use tracing::{error, info};
use crate::cache::node_info::CachedNodeInfo;
use crate::config::RetryConfig;
use crate::context::RequestContext;
use crate::control_plane::CachedNodeInfo;
use crate::control_plane::errors::{ControlPlaneError, WakeComputeError};
use crate::error::ReportableError;
use crate::metrics::{

View File

@@ -131,11 +131,11 @@ where
Ok(())
}
struct MessageHandler<C: Send + Sync + 'static> {
struct MessageHandler<C: ProjectInfoCache + Send + Sync + 'static> {
cache: Arc<C>,
}
impl<C: Send + Sync + 'static> Clone for MessageHandler<C> {
impl<C: ProjectInfoCache + Send + Sync + 'static> Clone for MessageHandler<C> {
fn clone(&self) -> Self {
Self {
cache: self.cache.clone(),
@@ -143,8 +143,8 @@ impl<C: Send + Sync + 'static> Clone for MessageHandler<C> {
}
}
impl MessageHandler<ProjectInfoCache> {
pub(crate) fn new(cache: Arc<ProjectInfoCache>) -> Self {
impl<C: ProjectInfoCache + Send + Sync + 'static> MessageHandler<C> {
pub(crate) fn new(cache: Arc<C>) -> Self {
Self { cache }
}
@@ -224,7 +224,7 @@ impl MessageHandler<ProjectInfoCache> {
}
}
fn invalidate_cache(cache: Arc<ProjectInfoCache>, msg: Notification) {
fn invalidate_cache<C: ProjectInfoCache>(cache: Arc<C>, msg: Notification) {
match msg {
Notification::EndpointSettingsUpdate(ids) => ids
.iter()
@@ -247,8 +247,8 @@ fn invalidate_cache(cache: Arc<ProjectInfoCache>, msg: Notification) {
}
}
async fn handle_messages(
handler: MessageHandler<ProjectInfoCache>,
async fn handle_messages<C: ProjectInfoCache + Send + Sync + 'static>(
handler: MessageHandler<C>,
redis: ConnectionWithCredentialsProvider,
cancellation_token: CancellationToken,
) -> anyhow::Result<()> {
@@ -284,10 +284,13 @@ async fn handle_messages(
/// Handle console's invalidation messages.
#[tracing::instrument(name = "redis_notifications", skip_all)]
pub async fn task_main(
pub async fn task_main<C>(
redis: ConnectionWithCredentialsProvider,
cache: Arc<ProjectInfoCache>,
) -> anyhow::Result<Infallible> {
cache: Arc<C>,
) -> anyhow::Result<Infallible>
where
C: ProjectInfoCache + Send + Sync + 'static,
{
let handler = MessageHandler::new(cache);
// 6h - 1m.
// There will be 1 minute overlap between two tasks. But at least we can be sure that no message is lost.

View File

@@ -1,6 +1,5 @@
use std::borrow::Cow;
use std::collections::HashMap;
use std::convert::Infallible;
use std::sync::Arc;
use bytes::Bytes;
@@ -13,7 +12,6 @@ use hyper::body::Incoming;
use hyper::http::{HeaderName, HeaderValue};
use hyper::{Request, Response, StatusCode};
use indexmap::IndexMap;
use moka::sync::Cache;
use ouroboros::self_referencing;
use serde::de::DeserializeOwned;
use serde::{Deserialize, Deserializer};
@@ -55,12 +53,12 @@ use super::http_util::{
};
use super::json::JsonConversionError;
use crate::auth::backend::ComputeCredentialKeys;
use crate::cache::common::{count_cache_insert, count_cache_outcome, eviction_listener};
use crate::cache::{Cached, TimedLru};
use crate::config::ProxyConfig;
use crate::context::RequestContext;
use crate::error::{ErrorKind, ReportableError, UserFacingError};
use crate::http::read_body_with_limit;
use crate::metrics::{CacheKind, Metrics};
use crate::metrics::Metrics;
use crate::serverless::sql_over_http::HEADER_VALUE_TRUE;
use crate::types::EndpointCacheKey;
use crate::util::deserialize_json_string;
@@ -140,31 +138,8 @@ pub struct ApiConfig {
}
// The DbSchemaCache is a cache of the ApiConfig and DbSchemaOwned for each endpoint
pub(crate) struct DbSchemaCache(Cache<EndpointCacheKey, Arc<(ApiConfig, DbSchemaOwned)>>);
pub(crate) type DbSchemaCache = TimedLru<EndpointCacheKey, Arc<(ApiConfig, DbSchemaOwned)>>;
impl DbSchemaCache {
pub fn new(config: crate::config::CacheOptions) -> Self {
let builder = Cache::builder().name("schema");
let builder = config.moka(builder);
let metrics = &Metrics::get().cache;
if let Some(size) = config.size {
metrics.capacity.set(CacheKind::Schema, size as i64);
}
let builder =
builder.eviction_listener(|_k, _v, cause| eviction_listener(CacheKind::Schema, cause));
Self(builder.build())
}
pub async fn maintain(&self) -> Result<Infallible, anyhow::Error> {
let mut ticker = tokio::time::interval(std::time::Duration::from_secs(60));
loop {
ticker.tick().await;
self.0.run_pending_tasks();
}
}
pub async fn get_cached_or_remote(
&self,
endpoint_id: &EndpointCacheKey,
@@ -174,9 +149,8 @@ impl DbSchemaCache {
ctx: &RequestContext,
config: &'static ProxyConfig,
) -> Result<Arc<(ApiConfig, DbSchemaOwned)>, RestError> {
let cache_result = count_cache_outcome(CacheKind::Schema, self.0.get(endpoint_id));
match cache_result {
Some(v) => Ok(v),
match self.get_with_created_at(endpoint_id) {
Some(Cached { value: (v, _), .. }) => Ok(v),
None => {
info!("db_schema cache miss for endpoint: {:?}", endpoint_id);
let remote_value = self
@@ -199,8 +173,7 @@ impl DbSchemaCache {
db_extra_search_path: None,
};
let value = Arc::new((api_config, schema_owned));
count_cache_insert(CacheKind::Schema);
self.0.insert(endpoint_id.clone(), value);
self.insert(endpoint_id.clone(), value);
return Err(e);
}
Err(e) => {
@@ -208,8 +181,7 @@ impl DbSchemaCache {
}
};
let value = Arc::new((api_config, schema_owned));
count_cache_insert(CacheKind::Schema);
self.0.insert(endpoint_id.clone(), value.clone());
self.insert(endpoint_id.clone(), value.clone());
Ok(value)
}
}

View File

@@ -4,8 +4,6 @@ use anyhow::bail;
use tokio_util::sync::CancellationToken;
use tracing::{info, warn};
use crate::metrics::{Metrics, ServiceInfo};
/// Handle unix signals appropriately.
pub async fn handle<F>(
token: CancellationToken,
@@ -30,12 +28,10 @@ where
// Shut down the whole application.
_ = interrupt.recv() => {
warn!("received SIGINT, exiting immediately");
Metrics::get().service.info.set_label(ServiceInfo::terminating());
bail!("interrupted");
}
_ = terminate.recv() => {
warn!("received SIGTERM, shutting down once all existing connections have closed");
Metrics::get().service.info.set_label(ServiceInfo::terminating());
token.cancel();
}
}

View File

@@ -102,7 +102,7 @@ pub struct ReportedError {
}
impl ReportedError {
pub fn new(e: impl UserFacingError + Into<anyhow::Error>) -> Self {
pub fn new(e: (impl UserFacingError + Into<anyhow::Error>)) -> Self {
let error_kind = e.get_error_kind();
Self {
source: e.into(),

View File

@@ -981,7 +981,6 @@ impl Reconciler {
));
}
let mut first_err = None;
for (node, conf) in changes {
if self.cancel.is_cancelled() {
return Err(ReconcileError::Cancel);
@@ -991,12 +990,7 @@ impl Reconciler {
// shard _available_ (the attached location), and configuring secondary locations
// can be done lazily when the node becomes available (via background reconciliation).
if node.is_available() {
let res = self.location_config(&node, conf, None, false).await;
if let Err(err) = res {
if first_err.is_none() {
first_err = Some(err);
}
}
self.location_config(&node, conf, None, false).await?;
} else {
// If the node is unavailable, we skip and consider the reconciliation successful: this
// is a common case where a pageserver is marked unavailable: we demote a location on
@@ -1008,10 +1002,6 @@ impl Reconciler {
}
}
if let Some(err) = first_err {
return Err(err);
}
// The condition below identifies a detach. We must have no attached intent and
// must have been attached to something previously. Pass this information to
// the [`ComputeHook`] such that it can update its tenant-wide state.

View File

@@ -1530,19 +1530,10 @@ impl Service {
// so that waiters will see the correct error after waiting.
tenant.set_last_error(result.sequence, e);
// If the reconciliation failed, don't clear the observed state for places where we
// detached. Instead, mark the observed state as uncertain.
let failed_reconcile_deltas = deltas.map(|delta| {
if let ObservedStateDelta::Delete(node_id) = delta {
ObservedStateDelta::Upsert(Box::new((
node_id,
ObservedStateLocation { conf: None },
)))
} else {
delta
}
});
tenant.apply_observed_deltas(failed_reconcile_deltas);
// Skip deletions on reconcile failures
let upsert_deltas =
deltas.filter(|delta| matches!(delta, ObservedStateDelta::Upsert(_)));
tenant.apply_observed_deltas(upsert_deltas);
}
}

View File

@@ -16,7 +16,6 @@ from typing_extensions import override
from fixtures.benchmark_fixture import MetricReport, NeonBenchmarker
from fixtures.log_helper import log
from fixtures.neon_fixtures import (
Endpoint,
NeonEnv,
PgBin,
PgProtocol,
@@ -130,10 +129,6 @@ class NeonCompare(PgCompare):
# Start pg
self._pg = self.env.endpoints.create_start("main", "main", self.tenant)
@property
def endpoint(self) -> Endpoint:
return self._pg
@property
@override
def pg(self) -> PgProtocol:

View File

@@ -79,28 +79,18 @@ class EndpointHttpClient(requests.Session):
return json
def prewarm_lfc(self, from_endpoint_id: str | None = None):
"""
Prewarm LFC cache from given endpoint and wait till it finishes or errors
"""
params = {"from_endpoint": from_endpoint_id} if from_endpoint_id else dict()
self.post(self.prewarm_url, params=params).raise_for_status()
self.prewarm_lfc_wait()
def prewarm_lfc_wait(self):
"""
Wait till LFC prewarm returns with error or success.
If prewarm was not requested before calling this function, it will error
"""
statuses = "failed", "completed", "skipped"
def prewarmed():
json = self.prewarm_lfc_status()
status, err = json["status"], json.get("error")
assert status in statuses, f"{status}, {err=}"
assert status in ["failed", "completed", "skipped"], f"{status}, {err=}"
wait_until(prewarmed, timeout=60)
res = self.prewarm_lfc_status()
assert res["status"] != "failed", res
assert self.prewarm_lfc_status()["status"] != "failed"
def offload_lfc_status(self) -> dict[str, str]:
res = self.get(self.offload_url)
@@ -109,26 +99,17 @@ class EndpointHttpClient(requests.Session):
return json
def offload_lfc(self):
"""
Offload LFC cache to endpoint storage and wait till offload finishes or errors
"""
self.post(self.offload_url).raise_for_status()
self.offload_lfc_wait()
def offload_lfc_wait(self):
"""
Wait till LFC offload returns with error or success.
If offload was not requested before calling this function, it will error
"""
def offloaded():
json = self.offload_lfc_status()
status, err = json["status"], json.get("error")
assert status in ["failed", "completed"], f"{status}, {err=}"
wait_until(offloaded, timeout=60)
res = self.offload_lfc_status()
assert res["status"] != "failed", res
wait_until(offloaded)
assert self.offload_lfc_status()["status"] != "failed"
def promote(self, promote_spec: dict[str, Any], disconnect: bool = False):
url = f"http://localhost:{self.external_port}/promote"

View File

@@ -1,6 +1,5 @@
from __future__ import annotations
import re
import time
from typing import TYPE_CHECKING, cast, final
@@ -14,17 +13,6 @@ if TYPE_CHECKING:
from fixtures.pg_version import PgVersion
def connstr_to_env(connstr: str) -> dict[str, str]:
# postgresql://neondb_owner:npg_kuv6Rqi1cB@ep-old-silence-w26pxsvz-pooler.us-east-2.aws.neon.build/neondb?sslmode=require&channel_binding=...'
parts = re.split(r":|@|\/|\?", connstr.removeprefix("postgresql://"))
return {
"PGUSER": parts[0],
"PGPASSWORD": parts[1],
"PGHOST": parts[2],
"PGDATABASE": parts[3],
}
def connection_parameters_to_env(params: dict[str, str]) -> dict[str, str]:
return {
"PGHOST": params["host"],

View File

@@ -9,10 +9,9 @@
```bash
export BENCHMARK_CONNSTR=postgres://user:pass@ep-abc-xyz-123.us-east-2.aws.neon.build/neondb
export CLICKHOUSE_PASSWORD=ch_password123
docker compose -f test_runner/logical_repl/clickhouse/docker-compose.yml up -d
./scripts/pytest -m remote_cluster -k 'test_clickhouse[release-pg17]'
./scripts/pytest -m remote_cluster -k test_clickhouse
docker compose -f test_runner/logical_repl/clickhouse/docker-compose.yml down
```
@@ -22,6 +21,6 @@ docker compose -f test_runner/logical_repl/clickhouse/docker-compose.yml down
export BENCHMARK_CONNSTR=postgres://user:pass@ep-abc-xyz-123.us-east-2.aws.neon.build/neondb
docker compose -f test_runner/logical_repl/debezium/docker-compose.yml up -d
./scripts/pytest -m remote_cluster -k 'test_debezium[release-pg17]'
./scripts/pytest -m remote_cluster -k test_debezium
docker compose -f test_runner/logical_repl/debezium/docker-compose.yml down
```

View File

@@ -1,11 +1,9 @@
services:
clickhouse:
image: clickhouse/clickhouse-server:25.6
image: clickhouse/clickhouse-server
user: "101:101"
container_name: clickhouse
hostname: clickhouse
environment:
- CLICKHOUSE_PASSWORD=${CLICKHOUSE_PASSWORD:-ch_password123}
ports:
- 127.0.0.1:8123:8123
- 127.0.0.1:9000:9000

View File

@@ -1,28 +1,18 @@
services:
zookeeper:
image: quay.io/debezium/zookeeper:3.1.3.Final
ports:
- 127.0.0.1:2181:2181
- 127.0.0.1:2888:2888
- 127.0.0.1:3888:3888
image: quay.io/debezium/zookeeper:2.7
kafka:
image: quay.io/debezium/kafka:3.1.3.Final
depends_on: [zookeeper]
image: quay.io/debezium/kafka:2.7
environment:
ZOOKEEPER_CONNECT: "zookeeper:2181"
KAFKA_LISTENERS: INTERNAL://:9092,EXTERNAL://:29092
KAFKA_ADVERTISED_LISTENERS: INTERNAL://kafka:9092,EXTERNAL://localhost:29092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: INTERNAL
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
KAFKA_BROKER_ID: 1
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_JMX_PORT: 9991
ports:
- 9092:9092
- 29092:29092
- 127.0.0.1:9092:9092
debezium:
image: quay.io/debezium/connect:3.1.3.Final
depends_on: [kafka]
image: quay.io/debezium/connect:2.7
environment:
BOOTSTRAP_SERVERS: kafka:9092
GROUP_ID: 1

View File

@@ -53,13 +53,8 @@ def test_clickhouse(remote_pg: RemotePostgres):
cur.execute("CREATE TABLE table1 (id integer primary key, column1 varchar(10));")
cur.execute("INSERT INTO table1 (id, column1) VALUES (1, 'abc'), (2, 'def');")
conn.commit()
if "CLICKHOUSE_PASSWORD" not in os.environ:
raise RuntimeError("CLICKHOUSE_PASSWORD is not set")
client = clickhouse_connect.get_client(
host=clickhouse_host, password=os.environ["CLICKHOUSE_PASSWORD"]
)
client = clickhouse_connect.get_client(host=clickhouse_host)
client.command("SET allow_experimental_database_materialized_postgresql=1")
client.command("DROP DATABASE IF EXISTS db1_postgres")
client.command(
"CREATE DATABASE db1_postgres ENGINE = "
f"MaterializedPostgreSQL('{conn_options['host']}', "

View File

@@ -17,7 +17,6 @@ from fixtures.utils import wait_until
if TYPE_CHECKING:
from fixtures.neon_fixtures import RemotePostgres
from kafka import KafkaConsumer
class DebeziumAPI:
@@ -102,13 +101,9 @@ def debezium(remote_pg: RemotePostgres):
assert len(dbz.list_connectors()) == 1
from kafka import KafkaConsumer
kafka_host = "kafka" if (os.getenv("CI", "false") == "true") else "127.0.0.1"
kafka_port = 9092 if (os.getenv("CI", "false") == "true") else 29092
log.info("Connecting to Kafka: %s:%s", kafka_host, kafka_port)
consumer = KafkaConsumer(
"dbserver1.inventory.customers",
bootstrap_servers=[f"{kafka_host}:{kafka_port}"],
bootstrap_servers=["kafka:9092"],
auto_offset_reset="earliest",
enable_auto_commit=False,
)
@@ -117,7 +112,7 @@ def debezium(remote_pg: RemotePostgres):
assert resp.status_code == 204
def get_kafka_msg(consumer: KafkaConsumer, ts_ms, before=None, after=None) -> None:
def get_kafka_msg(consumer, ts_ms, before=None, after=None) -> None:
"""
Gets the message from Kafka and checks its validity
Arguments:
@@ -129,7 +124,6 @@ def get_kafka_msg(consumer: KafkaConsumer, ts_ms, before=None, after=None) -> No
after: a dictionary, if not None, the after field from the kafka message must
have the same values for the same keys
"""
log.info("Bootstrap servers: %s", consumer.config["bootstrap_servers"])
msg = consumer.poll()
assert msg, "Empty message"
for val in msg.values():

View File

@@ -2,48 +2,45 @@ from __future__ import annotations
import os
import timeit
import traceback
from concurrent.futures import ThreadPoolExecutor as Exec
from pathlib import Path
from threading import Thread
from time import sleep
from typing import TYPE_CHECKING, cast
from typing import TYPE_CHECKING, Any, cast
import pytest
from fixtures.benchmark_fixture import NeonBenchmarker, PgBenchRunResult
from fixtures.log_helper import log
from fixtures.neon_api import NeonAPI, connstr_to_env
from performance.test_perf_pgbench import utc_now_timestamp
from fixtures.neon_api import NeonAPI, connection_parameters_to_env
if TYPE_CHECKING:
from fixtures.compare_fixtures import NeonCompare
from fixtures.neon_fixtures import Endpoint, PgBin
from fixtures.pg_version import PgVersion
from performance.test_perf_pgbench import utc_now_timestamp
# These tests compare performance for a write-heavy and read-heavy workloads of an ordinary endpoint
# compared to the endpoint which saves its LFC and prewarms using it on startup
# compared to the endpoint which saves its LFC and prewarms using it on startup.
def test_compare_prewarmed_pgbench_perf(neon_compare: NeonCompare):
env = neon_compare.env
env.create_branch("normal")
env.create_branch("prewarmed")
pg_bin = neon_compare.pg_bin
ep_ordinary: Endpoint = neon_compare.endpoint
ep_prewarmed: Endpoint = env.endpoints.create_start("prewarmed")
ep_normal: Endpoint = env.endpoints.create_start("normal")
ep_prewarmed: Endpoint = env.endpoints.create_start("prewarmed", autoprewarm=True)
for ep in [ep_ordinary, ep_prewarmed]:
for ep in [ep_normal, ep_prewarmed]:
connstr: str = ep.connstr()
pg_bin.run(["pgbench", "-i", "-I", "dtGvp", connstr, "-s100"])
ep.safe_psql("CREATE SCHEMA neon; CREATE EXTENSION neon WITH SCHEMA neon")
if ep == ep_prewarmed:
client = ep.http_client()
client.offload_lfc()
ep.stop()
ep.start(autoprewarm=True)
client.prewarm_lfc_wait()
else:
ep.stop()
ep.start()
ep.safe_psql("CREATE EXTENSION neon")
client = ep.http_client()
client.offload_lfc()
ep.stop()
ep.start()
client.prewarm_lfc_wait()
run_start_timestamp = utc_now_timestamp()
t0 = timeit.default_timer()
@@ -62,36 +59,6 @@ def test_compare_prewarmed_pgbench_perf(neon_compare: NeonCompare):
neon_compare.zenbenchmark.record_pg_bench_result(name, res)
def test_compare_prewarmed_read_perf(neon_compare: NeonCompare):
env = neon_compare.env
env.create_branch("prewarmed")
ep_ordinary: Endpoint = neon_compare.endpoint
ep_prewarmed: Endpoint = env.endpoints.create_start("prewarmed")
sql = [
"CREATE SCHEMA neon",
"CREATE EXTENSION neon WITH SCHEMA neon",
"CREATE TABLE foo(key serial primary key, t text default 'foooooooooooooooooooooooooooooooooooooooooooooooooooo')",
"INSERT INTO foo SELECT FROM generate_series(1,1000000)",
]
sql_check = "SELECT count(*) from foo"
ep_ordinary.safe_psql_many(sql)
ep_ordinary.stop()
ep_ordinary.start()
with neon_compare.record_duration("ordinary_run_duration"):
ep_ordinary.safe_psql(sql_check)
ep_prewarmed.safe_psql_many(sql)
client = ep_prewarmed.http_client()
client.offload_lfc()
ep_prewarmed.stop()
ep_prewarmed.start(autoprewarm=True)
client.prewarm_lfc_wait()
with neon_compare.record_duration("prewarmed_run_duration"):
ep_prewarmed.safe_psql(sql_check)
@pytest.mark.remote_cluster
@pytest.mark.timeout(2 * 60 * 60)
def test_compare_prewarmed_pgbench_perf_benchmark(
@@ -100,66 +67,67 @@ def test_compare_prewarmed_pgbench_perf_benchmark(
pg_version: PgVersion,
zenbenchmark: NeonBenchmarker,
):
"""
Prewarm API is not public, so this test relies on a pre-created project
with pgbench size of 3424, pgbench -i -IdtGvp -s3424. Sleeping and
offloading constants are hardcoded to this size as well
"""
project_id = os.getenv("PROJECT_ID")
assert project_id
name = f"Test prewarmed pgbench performance, GITHUB_RUN_ID={os.getenv('GITHUB_RUN_ID')}"
project = neon_api.create_project(pg_version, name)
project_id = project["project"]["id"]
neon_api.wait_for_operation_to_finish(project_id)
err = False
try:
benchmark_impl(pg_bin, neon_api, project, zenbenchmark)
except Exception as e:
err = True
log.error(f"Caught exception: {e}")
log.error(traceback.format_exc())
finally:
assert not err
neon_api.delete_project(project_id)
ordinary_branch_id = ""
prewarmed_branch_id = ""
for branch in neon_api.get_branches(project_id)["branches"]:
if branch["name"] == "ordinary":
ordinary_branch_id = branch["id"]
if branch["name"] == "prewarmed":
prewarmed_branch_id = branch["id"]
assert len(ordinary_branch_id) > 0
assert len(prewarmed_branch_id) > 0
ep_ordinary = None
ep_prewarmed = None
for ep in neon_api.get_endpoints(project_id)["endpoints"]:
if ep["branch_id"] == ordinary_branch_id:
ep_ordinary = ep
if ep["branch_id"] == prewarmed_branch_id:
ep_prewarmed = ep
assert ep_ordinary
assert ep_prewarmed
ordinary_id = ep_ordinary["id"]
prewarmed_id = ep_prewarmed["id"]
def benchmark_impl(
pg_bin: PgBin, neon_api: NeonAPI, project: dict[str, Any], zenbenchmark: NeonBenchmarker
):
pgbench_size = int(os.getenv("PGBENCH_SIZE") or "3424") # 50GB
offload_secs = 20
test_duration_min = 3
test_duration_min = 5
pgbench_duration = f"-T{test_duration_min * 60}"
pgbench_init_cmd = ["pgbench", "-P10", "-n", "-c10", pgbench_duration, "-Mprepared"]
pgbench_perf_cmd = pgbench_init_cmd + ["-S"]
# prewarm API is not publicly exposed. In order to test performance of a
# fully prewarmed endpoint, wait after it restarts.
# The number here is empirical, based on manual runs on staging
prewarmed_sleep_secs = 180
ordinary_uri = neon_api.get_connection_uri(project_id, ordinary_branch_id, ordinary_id)["uri"]
prewarmed_uri = neon_api.get_connection_uri(project_id, prewarmed_branch_id, prewarmed_id)[
"uri"
]
branch_id = project["branch"]["id"]
project_id = project["project"]["id"]
normal_env = connection_parameters_to_env(
project["connection_uris"][0]["connection_parameters"]
)
normal_id = project["endpoints"][0]["id"]
prewarmed_branch_id = neon_api.create_branch(
project_id, "prewarmed", parent_id=branch_id, add_endpoint=False
)["branch"]["id"]
neon_api.wait_for_operation_to_finish(project_id)
ep_prewarmed = neon_api.create_endpoint(
project_id,
prewarmed_branch_id,
endpoint_type="read_write",
settings={"autoprewarm": True, "offload_lfc_interval_seconds": offload_secs},
)
neon_api.wait_for_operation_to_finish(project_id)
prewarmed_env = normal_env.copy()
prewarmed_env["PGHOST"] = ep_prewarmed["endpoint"]["host"]
prewarmed_id = ep_prewarmed["endpoint"]["id"]
def bench(endpoint_name, endpoint_id, env):
log.info(f"Running pgbench for {pgbench_duration}s to warm up the cache")
pg_bin.run_capture(pgbench_init_cmd, env) # capture useful for debugging
pg_bin.run(["pgbench", "-i", "-I", "dtGvp", f"-s{pgbench_size}"], env)
sleep(offload_secs * 2) # ensure LFC is offloaded after pgbench finishes
neon_api.restart_endpoint(project_id, endpoint_id)
sleep(prewarmed_sleep_secs)
log.info(f"Initialized {endpoint_name}")
if endpoint_name == "prewarmed":
log.info(f"sleeping {offload_secs * 2} to ensure LFC is offloaded")
sleep(offload_secs * 2)
neon_api.restart_endpoint(project_id, endpoint_id)
log.info(f"sleeping {prewarmed_sleep_secs} to ensure LFC is prewarmed")
sleep(prewarmed_sleep_secs)
else:
neon_api.restart_endpoint(project_id, endpoint_id)
log.info(f"Starting benchmark for {endpoint_name}")
run_start_timestamp = utc_now_timestamp()
t0 = timeit.default_timer()
out = pg_bin.run_capture(pgbench_perf_cmd, env)
out = pg_bin.run_capture(["pgbench", "-c10", pgbench_duration, "-Mprepared"], env)
run_duration = timeit.default_timer() - t0
run_end_timestamp = utc_now_timestamp()
@@ -172,9 +140,29 @@ def test_compare_prewarmed_pgbench_perf_benchmark(
)
zenbenchmark.record_pg_bench_result(endpoint_name, res)
prewarmed_args = ("prewarmed", prewarmed_id, connstr_to_env(prewarmed_uri))
prewarmed_thread = Thread(target=bench, args=prewarmed_args)
prewarmed_thread.start()
with Exec(max_workers=2) as exe:
exe.submit(bench, "normal", normal_id, normal_env)
exe.submit(bench, "prewarmed", prewarmed_id, prewarmed_env)
bench("ordinary", ordinary_id, connstr_to_env(ordinary_uri))
prewarmed_thread.join()
def test_compare_prewarmed_read_perf(neon_compare: NeonCompare):
env = neon_compare.env
env.create_branch("normal")
env.create_branch("prewarmed")
ep_normal: Endpoint = env.endpoints.create_start("normal")
ep_prewarmed: Endpoint = env.endpoints.create_start("prewarmed", autoprewarm=True)
sql = [
"CREATE EXTENSION neon",
"CREATE TABLE foo(key serial primary key, t text default 'foooooooooooooooooooooooooooooooooooooooooooooooooooo')",
"INSERT INTO foo SELECT FROM generate_series(1,1000000)",
]
for ep in [ep_normal, ep_prewarmed]:
ep.safe_psql_many(sql)
client = ep.http_client()
client.offload_lfc()
ep.stop()
ep.start()
client.prewarm_lfc_wait()
with neon_compare.record_duration(f"{ep.branch_name}_run_duration"):
ep.safe_psql("SELECT count(*) from foo")

View File

@@ -3309,7 +3309,6 @@ def test_ps_unavailable_after_delete(
ps.allowed_errors.append(".*request was dropped before completing.*")
env.storage_controller.node_delete(ps.id, force=True)
wait_until(lambda: assert_nodes_count(2))
env.storage_controller.reconcile_until_idle()
elif deletion_api == DeletionAPIKind.OLD:
env.storage_controller.node_delete_old(ps.id)
assert_nodes_count(2)
@@ -4960,49 +4959,3 @@ def test_storage_controller_forward_404(neon_env_builder: NeonEnvBuilder):
env.storage_controller.configure_failpoints(
("reconciler-live-migrate-post-generation-inc", "off")
)
def test_re_attach_with_stuck_secondary(neon_env_builder: NeonEnvBuilder):
"""
This test assumes that the secondary location cannot be configured for whatever reason.
It then attempts to detach and and attach the tenant back again and, finally, checks
for observed state consistency by attempting to create a timeline.
See LKB-204 for more details.
"""
neon_env_builder.num_pageservers = 2
env = neon_env_builder.init_configs()
env.start()
env.storage_controller.allowed_errors.append(".*failpoint.*")
tenant_id, _ = env.create_tenant(shard_count=1, placement_policy='{"Attached":1}')
env.storage_controller.reconcile_until_idle()
locations = env.storage_controller.locate(tenant_id)
assert len(locations) == 1
primary: int = locations[0]["node_id"]
not_primary = [ps.id for ps in env.pageservers if ps.id != primary]
assert len(not_primary) == 1
secondary = not_primary[0]
env.get_pageserver(secondary).http_client().configure_failpoints(
("put-location-conf-handler", "return(1)")
)
env.storage_controller.tenant_policy_update(tenant_id, {"placement": "Detached"})
with pytest.raises(Exception, match="failpoint"):
env.storage_controller.reconcile_all()
env.storage_controller.tenant_policy_update(tenant_id, {"placement": {"Attached": 1}})
with pytest.raises(Exception, match="failpoint"):
env.storage_controller.reconcile_all()
env.storage_controller.pageserver_api().timeline_create(
pg_version=PgVersion.NOT_SET, tenant_id=tenant_id, new_timeline_id=TimelineId.generate()
)

View File

@@ -298,26 +298,15 @@ def test_pageserver_metrics_removed_after_detach(neon_env_builder: NeonEnvBuilde
assert post_detach_samples == set()
@pytest.mark.parametrize("compaction", ["compaction_enabled", "compaction_disabled"])
def test_pageserver_metrics_removed_after_offload(
neon_env_builder: NeonEnvBuilder, compaction: str
):
def test_pageserver_metrics_removed_after_offload(neon_env_builder: NeonEnvBuilder):
"""Tests that when a timeline is offloaded, the tenant specific metrics are not left behind"""
neon_env_builder.enable_pageserver_remote_storage(RemoteStorageKind.MOCK_S3)
neon_env_builder.num_safekeepers = 3
env = neon_env_builder.init_start()
tenant_1, _ = env.create_tenant(
conf={
# disable background compaction and GC so that we don't have leftover tasks
# after offloading.
"gc_period": "0s",
"compaction_period": "0s",
}
if compaction == "compaction_disabled"
else None
)
tenant_1, _ = env.create_tenant()
timeline_1 = env.create_timeline("test_metrics_removed_after_offload_1", tenant_id=tenant_1)
timeline_2 = env.create_timeline("test_metrics_removed_after_offload_2", tenant_id=tenant_1)
@@ -362,23 +351,6 @@ def test_pageserver_metrics_removed_after_offload(
state=TimelineArchivalState.ARCHIVED,
)
env.pageserver.http_client().timeline_offload(tenant_1, timeline)
# We need to wait until all background jobs are finished before we can check the metrics.
# There're many of them: compaction, GC, etc.
wait_until(
lambda: all(
sample.value == 0
for sample in env.pageserver.http_client()
.get_metrics()
.query_all("pageserver_background_loop_semaphore_waiting_tasks")
)
and all(
sample.value == 0
for sample in env.pageserver.http_client()
.get_metrics()
.query_all("pageserver_background_loop_semaphore_running_tasks")
)
)
post_offload_samples = set(
[x.name for x in get_ps_metric_samples_for_timeline(tenant_1, timeline)]
)

View File

@@ -28,8 +28,6 @@ chrono = { version = "0.4", default-features = false, features = ["clock", "serd
clap = { version = "4", features = ["derive", "env", "string"] }
clap_builder = { version = "4", default-features = false, features = ["color", "env", "help", "std", "string", "suggestions", "usage"] }
const-oid = { version = "0.9", default-features = false, features = ["db", "std"] }
crossbeam-epoch = { version = "0.9" }
crossbeam-utils = { version = "0.8" }
crypto-bigint = { version = "0.5", features = ["generic-array", "zeroize"] }
der = { version = "0.7", default-features = false, features = ["derive", "flagset", "oid", "pem", "std"] }
deranged = { version = "0.3", default-features = false, features = ["powerfmt", "serde", "std"] }
@@ -75,7 +73,6 @@ num-traits = { version = "0.2", features = ["i128", "libm"] }
once_cell = { version = "1" }
p256 = { version = "0.13", features = ["jwk"] }
parquet = { version = "53", default-features = false, features = ["zstd"] }
portable-atomic = { version = "1", features = ["require-cas"] }
prost = { version = "0.13", features = ["no-recursion-limit", "prost-derive"] }
rand = { version = "0.9" }
regex = { version = "1" }