Compare commits

...

199 Commits

Author SHA1 Message Date
Shany Pozin
7d6ec16166 Merge pull request #5296 from neondatabase/releases/2023-09-13
Release 2023-09-13
2023-09-13 13:49:14 +03:00
Konstantin Knizhnik
1697e7b319 Fix lfc_ensure_function which now disables LFC (#5294)
## Problem

There was a bug in lfc_ensure_opened which actually disables LFC

## Summary of changes

Return true ifLFC file is normally opened

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-09-13 08:56:03 +03:00
bojanserafimov
8556d94740 proxy http: reproduce issue with transactions in pool (#5293)
xfail test reproducing issue https://github.com/neondatabase/neon/issues/4698
2023-09-12 17:13:25 -04:00
MMeent
3b6b847d76 Fixes for Pg16: (#5292)
- pagestore_smgr.c had unnecessary WALSync() (see #5287 )
- Compute node dockerfile didn't build the neon_rmgr extension
- Add PostgreSQL 16 image to docker-compose tests
- Fix issue with high CPU usage in Safekeeper due to a bug in WALSender

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-09-12 22:02:03 +03:00
Alexander Bayandin
2641ff3d1a Use CI_ACCESS_TOKEN to create release PR (#5286)
## Problem

If @github-actions creates release PR, the CI pipeline is not triggered
(but we have `release-notify.yml` workflow that we expect to run on this
event).
I suspect this happened because @github-actions is not a repository
member.

Ref
https://github.com/neondatabase/neon/pull/5283#issuecomment-1715209291

## Summary of changes
- Use `CI_ACCESS_TOKEN` to create a PR
- Use `gh` instead of `thomaseizinger/create-pull-request`
- Restrict permissions for GITHUB_TOKEN to `contents: write` only
(required for `git push`)
2023-09-12 20:01:21 +01:00
Alexander Bayandin
e1661c3c3c approved-for-ci-run.yml: fix ci-run/pr-* branch deletion (#5278)
## Problem

`ci-run/pr-*` branches (and attached PRs) should be deleted
automatically when their parent PRs get closed.
But there are not

## Summary of changes
- Fix if-condition
2023-09-12 19:29:26 +03:00
Alexander Bayandin
9c3f38e10f Document how to run CI for external contributors (#5279)
## Problem
We don't have this instruction written anywhere but in internal Slack

## Summary of changes
- Add `How to run a CI pipeline on Pull Requests from external
contributors` section to `CONTRIBUTING.md`

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-09-12 16:53:13 +01:00
Christian Schwarz
ab1f37e908 revert recent VirtualFile asyncification changes (#5291)
Motivation
==========

We observed two "indigestion" events on staging, each shortly after
restarting `pageserver-0.eu-west-1.aws.neon.build`. It has ~8k tenants.

The indigestion manifests as `Timeline::get` calls failing with
`exceeded evict iter limit` .
The error is from `page_cache.rs`; it was unable to find a free page and
hence failed with the error.

The indigestion events started occuring after we started deploying
builds that contained the following commits:

```
[~/src/neon]: git log --oneline c0ed362790caa368aa65ba57d352a2f1562fd6bf..15eaf78083ecff62b7669
091da1a1c8b4f60ebf8
15eaf7808 Disallow block_in_place and Handle::block_on (#5101)
a18d6d9ae Make File opening in VirtualFile async-compatible (#5280)
76cc87398 Use tokio locks in VirtualFile and turn with_file into macro (#5247)
```

The second and third commit are interesting.
They add .await points to the VirtualFile code.

Background
==========

On the read path, which is the dominant user of page cache & VirtualFile
during pageserver restart, `Timeline::get` `page_cache` and VirtualFile
interact as follows:

1. Timeline::get tries to read from a layer
2. This read goes through the page cache.
3. If we have a page miss (which is known to be common after restart),
page_cache uses `find_victim` to find an empty slot, and once it has
found a slot, it gives exclusive ownership of it to the caller through a
`PageWriteGuard`.
4. The caller is supposed to fill the write guard with data from the
underlying backing store, i.e., the layer `VirtualFile`.
5. So, we call into `VirtualFile::read_at`` to fill the write guard.

The `find_victim` method finds an empty slot using a basic
implementation of clock page replacement algorithm.
Slots that are currently in use (`PageReadGuard` / `PageWriteGuard`)
cannot become victims.
If there have been too many iterations, `find_victim` gives up with
error `exceeded evict iter limit`.

Root Cause For Indigestion
==========================

The second and third commit quoted in the "Motivation" section
introduced `.await` points in the VirtualFile code.
These enable tokio to preempt us and schedule another future __while__
we hold the `PageWriteGuard` and are calling `VirtualFile::read_at`.
This was not possible before these commits, because there simply were no
await points that weren't Poll::Ready immediately.
With the offending commits, there is now actual usage of
`tokio::sync::RwLock` to protect the VirtualFile file descriptor cache.
And we __know__ from other experiments that, during the post-restart
"rush", the VirtualFile fd cache __is__ too small, i.e., all slots are
taken by _ongoing_ VirtualFile operations and cannot be victims.
So, assume that VirtualFile's `find_victim_slot`'s
`RwLock::write().await` calls _will_ yield control to the executor.

The above can lead to the pathological situation if we have N runnable
tokio tasks, each wanting to do `Timeline::get`, but only M slots, N >>
M.
Suppose M of the N tasks win a PageWriteGuard and get preempted at some
.await point inside `VirtualFile::read_at`.
Now suppose tokio schedules the remaining N-M tasks for fairness, then
schedules the first M tasks again.
Each of the N-M tasks will run `find_victim()` until it hits the
`exceeded evict iter limit`.
Why? Because the first M tasks took all the slots and are still holding
them tight through their `PageWriteGuard`.

The result is massive wastage of CPU time in `find_victim()`.
The effort to find a page is futile, but each of the N-M tasks still
attempts it.

This delays the time when tokio gets around to schedule the first M
tasks again.
Eventually, tokio will schedule them, they will make progress, fill the
`PageWriteGuard`, release it.
But in the meantime, the N-M tasks have already bailed with error
`exceeded evict iter limit`.

Eventually, higher level mechanisms will retry for the N-M tasks, and
this time, there won't be as many concurrent tasks wanting to do
`Timeline::get`.
So, it will shake out.

But, it's a massive indigestion until then.

This PR
=======

This PR reverts the offending commits until we find a proper fix.

```
    Revert "Use tokio locks in VirtualFile and turn with_file into macro (#5247)"
    
    This reverts commit 76cc87398c.


    Revert "Make File opening in VirtualFile async-compatible (#5280)"
    
    This reverts commit a18d6d9ae3.
```
2023-09-12 17:38:31 +02:00
MMeent
83e7e5dbbd Feat/postgres 16 (#4761)
This adds PostgreSQL 16 as a vendored postgresql version, and adapts the
code to support this version. 
The important changes to PostgreSQL 16 compared to the PostgreSQL 15
changeset include the addition of a neon_rmgr instead of altering Postgres's
original WAL format.

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2023-09-12 15:11:32 +02:00
Shany Pozin
0e6fdc8a58 Merge pull request #5283 from neondatabase/releases/2023-09-12
Release 2023-09-12
2023-09-12 14:56:47 +03:00
Christian Schwarz
521438a5c6 fix deadlock around TENANTS (#5285)
The sequence that can lead to a deadlock:

1. DELETE request gets all the way to `tenant.shutdown(progress,
false).await.is_err() ` , while holding TENANTS.read()
2. POST request for tenant creation comes in, calls `tenant_map_insert`,
it does `let mut guard = TENANTS.write().await;`
3. Something that `tenant.shutdown()` needs to wait for needs a
`TENANTS.read().await`.
The only case identified in exhaustive manual scanning of the code base
is this one:
Imitate size access does `get_tenant().await`, which does
`TENANTS.read().await` under the hood.

In the above case (1) waits for (3), (3)'s read-lock request is queued
behind (2)'s write-lock, and (2) waits for (1).
Deadlock.

I made a reproducer/proof-that-above-hypothesis-holds in
https://github.com/neondatabase/neon/pull/5281 , but, it's not ready for
merge yet and we want the fix _now_.

fixes https://github.com/neondatabase/neon/issues/5284
2023-09-12 14:13:13 +03:00
Christian Schwarz
5be8d38a63 fix deadlock around TENANTS (#5285)
The sequence that can lead to a deadlock:

1. DELETE request gets all the way to `tenant.shutdown(progress,
false).await.is_err() ` , while holding TENANTS.read()
2. POST request for tenant creation comes in, calls `tenant_map_insert`,
it does `let mut guard = TENANTS.write().await;`
3. Something that `tenant.shutdown()` needs to wait for needs a
`TENANTS.read().await`.
The only case identified in exhaustive manual scanning of the code base
is this one:
Imitate size access does `get_tenant().await`, which does
`TENANTS.read().await` under the hood.

In the above case (1) waits for (3), (3)'s read-lock request is queued
behind (2)'s write-lock, and (2) waits for (1).
Deadlock.

I made a reproducer/proof-that-above-hypothesis-holds in
https://github.com/neondatabase/neon/pull/5281 , but, it's not ready for
merge yet and we want the fix _now_.

fixes https://github.com/neondatabase/neon/issues/5284
2023-09-12 11:23:46 +02:00
John Spray
36c261851f s3_scrubber: remove atty dependency (#5171)
## Problem

- https://github.com/neondatabase/neon/security/dependabot/28

## Summary of changes

Remove atty, and remove the `with_ansi` arg to scrubber's stdout logger.
2023-09-12 10:11:41 +01:00
Arpad Müller
15eaf78083 Disallow block_in_place and Handle::block_on (#5101)
## Problem

`block_in_place` is a quite expensive operation, and if it is used, we
should explicitly have to opt into it by allowing the
`clippy::disallowed_methods` lint.

For more, see
https://github.com/neondatabase/neon/pull/5023#discussion_r1304194495.

Similar arguments exist for `Handle::block_on`, but we don't do this yet
as there is still usages.

## Summary of changes

Adds a clippy.toml file, configuring the [`disallowed_methods`
lint](https://rust-lang.github.io/rust-clippy/master/#/disallowed_method).
2023-09-12 00:11:16 +00:00
Arpad Müller
a18d6d9ae3 Make File opening in VirtualFile async-compatible (#5280)
## Problem

Previously, we were using `observe_closure_duration` in `VirtualFile`
file opening code, but this doesn't support async open operations, which
we want to use as part of #4743.

## Summary of changes

* Move the duration measurement from the `with_file` macro into a
`observe_duration` macro.
* Some smaller drive-by fixes to replace the old strings with the new
variant names introduced by #5273

Part of #4743, follow-up of #5247.
2023-09-11 18:41:08 +02:00
Arpad Müller
76cc87398c Use tokio locks in VirtualFile and turn with_file into macro (#5247)
## Problem

For #4743, we want to convert everything up to the actual I/O operations
of `VirtualFile` to `async fn`.

## Summary of changes

This PR is the last change in a series of changes to `VirtualFile`:
#5189, #5190, #5195, #5203, and #5224.

It does the last preparations before the I/O operations are actually
made async. We are doing the following things:

* First, we change the locks for the file descriptor cache to tokio's
locks that support Send. This is important when one wants to hold locks
across await points (which we want to do), otherwise the Future won't be
Send. Also, one shouldn't generally block in async code as executors
don't like that.
* Due to the lock change, we now take an approach for the `VirtualFile`
destructors similar to the one proposed by #5122 for the page cache, to
use `try_write`. Similarly to the situation in the linked PR, one can
make an argument that if we are in the destructor and the slot has not
been reused yet, we are the only user accessing the slot due to owning
the lock mutably. It is still possible that we are not obtaining the
lock, but the only cause for that is the clock algorithm touching the
slot, which should be quite an unlikely occurence. For the instance of
`try_write` failing, we spawn an async task to destroy the lock. As just
argued however, most of the time the code path where we spawn the task
should not be visited.
* Lastly, we split `with_file` into a macro part, and a function part
that contains most of the logic. The function part returns a lock
object, that the macro uses. The macro exists to perform the operation
in a more compact fashion, saving code from putting the lock into a
variable and then doing the operation while measuring the time to run
it. We take the locks approach because Rust has no support for async
closures. One can make normal closures return a future, but that
approach gets into lifetime issues the moment you want to pass data to
these closures via parameters that has a lifetime (captures work). For
details, see
[this](https://smallcultfollowing.com/babysteps/blog/2023/03/29/thoughts-on-async-closures/)
and
[this](https://users.rust-lang.org/t/function-that-takes-an-async-closure/61663)
link. In #5224, we ran into a similar problem with the `test_files`
function, and we ended up passing the path and the `OpenOptions`
by-value instead of by-ref, at the expense of a few extra copies. This
can be done as the data is cheaply copyable, and we are in test code.
But here, we are not, and while `File::try_clone` exists, it [issues
system calls
internally](1e746d7741/library/std/src/os/fd/owned.rs (L94-L111)).
Also, it would allocate an entirely new file descriptor, something that
the fd cache was built to prevent.
* We change the `STORAGE_IO_TIME` metrics to support async.

Part of #4743.
2023-09-11 17:35:05 +02:00
bojanserafimov
c0ed362790 Measure pageserver wal recovery time and fix flush() method (#5240) 2023-09-11 09:46:06 -04:00
duguorong009
d7fa2dba2d fix(pageserver): update the STORAGE_IO_TIME metrics to avoid expensive operations (#5273)
Introduce the `StorageIoOperation` enum, `StorageIoTime` struct, and
`STORAGE_IO_TIME_METRIC` static which provides lockless access to
histograms consumed by `VirtualFile`.

Closes #5131

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-09-11 14:58:15 +03:00
Joonas Koivunen
a55a78a453 Misc test flakyness fixes (#5233)
Assorted flakyness fixes from #5198, might not be flaky on `main`.

Migrate some tests using neon_simple_env to just neon_env_builder and
using initial_tenant to make flakyness understanding easier. (Did not
understand the flakyness of
`test_timeline_create_break_after_uninit_mark`.)

`test_download_remote_layers_api` is flaky because we have no atomic
"wait for WAL, checkpoint, wait for upload and do not receive any more
WAL".

`test_tenant_size` fixes are just boilerplate which should had always
existed; we should wait for the tenant to be active. similarly for
`test_timeline_delete`.

`test_timeline_size_post_checkpoint` fails often for me with reading
zero from metrics. Give it a few attempts.
2023-09-11 11:42:49 +03:00
Rahul Modpur
999fe668e7 Ack tenant detach before local files are deleted (#5211)
## Problem

Detaching a tenant can involve many thousands of local filesystem
metadata writes, but the control plane would benefit from us not
blocking detach/delete responses on these.

## Summary of changes

After rename of local tenant directory ack tenant detach and delete
tenant directory in background

#5183 

---------

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
2023-09-10 22:59:51 +03:00
Alexander Bayandin
d33e1b1b24 approved-for-ci-run.yml: use token to checkout the repo (#5266)
## Problem

Another thing I overlooked regarding'approved-for-ci-run`:
- When we create a PR, the action is associated with @vipvap and this
triggers the pipeline — this is good.
- When we update the PR by force-pushing to the branch, the action is
associated with @github-actions, which doesn't trigger a pipeline — this
is bad.

Initially spotted in #5239 / #5211
([link](https://github.com/neondatabase/neon/actions/runs/6122249456/job/16633919558?pr=5239))
— `check-permissions` should not fail.


## Summary of changes
- Use `CI_ACCESS_TOKEN` to check out the repo (I expect this token will
be reused in the following `git push`)
2023-09-10 20:12:38 +01:00
Alexander Bayandin
15fd188fd6 Fix GitHub Autocomment for ci-run/prs (#5268)
## Problem

When PR `ci-run/pr-*` is created the GitHub Autocomment with test
results are supposed to be posted to the original PR, currently, this
doesn't work.

I created this PR from a personal fork to debug and fix the issue. 

## Summary of changes
- `scripts/comment-test-report.js`: use `pull_request.head` instead of
`pull_request.base` 🤦
2023-09-10 20:06:10 +01:00
Alexander Bayandin
34e39645c4 GitHub Workflows: add actionlint (#5265)
## Problem

Add a CI pipeline that checks GitHub Workflows with
https://github.com/rhysd/actionlint (it uses `shellcheck` for shell
scripts in steps)

To run it locally: `SHELLCHECK_OPTS=--exclude=SC2046,SC2086 actionlint`

## Summary of changes
- Add `.github/workflows/actionlint.yml`
- Fix actionlint warnings
2023-09-10 20:05:07 +01:00
Em Sharnoff
1cac923af8 vm-monitor: Rate-limit upscale requests (#5263)
Some VMs, when already scaled up as much as possible, end up spamming
the autoscaler-agent with upscale requests that will never be fulfilled.
If postgres is using memory greater than the cgroup's memory.high, it
can emit new memory.high events 1000 times per second, which... just
means unnecessary load on the rest of the system.

This changes the vm-monitor so that we skip sending upscale requests if
we already sent one within the last second, to avoid spamming the
autoscaler-agent. This matches previous behavior that the vm-informant
hand.
2023-09-10 20:33:53 +03:00
Em Sharnoff
853552dcb4 vm-monitor: Don't include Args in top-level span (#5264)
It makes the logs too verbose.

ref https://neondb.slack.com/archives/C03F5SM1N02/p1694281232874719?thread_ts=1694272777.207109&cid=C03F5SM1N02
2023-09-10 20:15:53 +03:00
Alexander Bayandin
1ea93af56c Create GitHub release from release tag (#5246)
## Problem

This PR creates a GitHub release from a release tag with an
autogenerated changelog: https://github.com/neondatabase/neon/releases

## Summary of changes
- Call GitHub API to create a release
2023-09-09 22:02:28 +01:00
Konstantin Knizhnik
f64b338ce3 Ingore DISK_FULL error when performing availability check for client (#5010)
See #5001

No space is what's expected if we're at size limit.
Of course if SK incorrectly returned "no space", the availability check
wouldn't fire.
But  users would notice such a bug quite soon anyways.
So  ignoring "no space" is the right trade-off.


## Problem

## Summary of changes

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-09-09 21:51:04 +03:00
Konstantin Knizhnik
ba06ea26bb Fix issues with reanabling LFC (#5209)
refer #5208

## Problem

See
https://neondb.slack.com/archives/C03H1K0PGKH/p1693938336062439?thread_ts=1693928260.704799&cid=C03H1K0PGKH

#5208 disable LFC forever in case of error. It is not good because the
problem causing this error (for example ENOSPC) can be resolved anti
will be nice to reenable it after fixing.

Also #5208 disables LFC locally in one backend. But other backends may
still see corrupted data.
It should not cause problems right now with "permission denied" error
because there should be no backend which is able to normally open LFC.
But in case of out-of-disk-space error, other backend can read corrupted
data.

## Summary of changes

1. Cleanup hash table after error to prevent access to stale or
corrupted data
2. Perform disk write under exclusive lock (hoping it will not affect
performance because usually write just copy data from user to system
space)
3. Use generations to prevent access to stale data in lfc_read

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-09-09 17:51:16 +03:00
Joonas Koivunen
6f28da1737 fix: LocalFs root in test_compatibility is PosixPath('...') (#5261)
I forgot a `str(...)` conversion in #5243. This lead to log lines such
as:

```
Using fs root 'PosixPath('/tmp/test_output/test_backward_compatibility[debug-pg14]/compatibility_snapshot/repo/local_fs_remote_storage/pageserver')' as a remote storage
```

This surprisingly works, creating hierarchy of under current working
directory (`repo_dir` for tests):
- `PosixPath('`
  - `tmp` .. up until .. `local_fs_remote_storage`
    - `pageserver')`

It should not work but right now test_compatibility.py tests finds local
metadata and layers, which end up used. After #5172 when remote storage
is the source of truth it will no longer work.
2023-09-08 20:27:00 +03:00
Heikki Linnakangas
60050212e1 Update rdkit to version 2023_03_03. (#5260)
It includes PostgreSQL 16 support.
2023-09-08 19:40:29 +03:00
Joonas Koivunen
66633ef2a9 rust-toolchain: use 1.72.0, same as CI (#5256)
Switches everyone without an `rustup override` to 1.72.0.

Code changes required already done in #5255.
Depends on https://github.com/neondatabase/build/pull/65.
2023-09-08 19:36:02 +03:00
Alexander Bayandin
028fbae161 Miscellaneous fixes for tests-related things (#5259)
## Problem

A bunch of fixes for different test-related things 

## Summary of changes
- Fix test_runner/pg_clients (`subprocess_capture` return value has
changed)
- Do not run create-test-report if check-permissions failed for not
cancelled jobs
- Fix Code Coverage comment layout after flaky tests. Add another
healing "\n"
- test_compatibility: add an instruction for local run


Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-09-08 16:28:09 +01:00
John Spray
7b6337db58 tests: enable multiple pageservers in neon_local and neon_fixture (#5231)
## Problem

Currently our testing environment only supports running a single
pageserver at a time. This is insufficient for testing failover and
migrations.
- Dependency of writing tests for #5207 

## Summary of changes

- `neon_local` and `neon_fixture` now handle multiple pageservers
- This is a breaking change to the `.neon/config` format: any local
environments will need recreating
- Existing tests continue to work unchanged:
  - The default number of pageservers is 1
- `NeonEnv.pageserver` is now a helper property that retrieves the first
pageserver if there is only one, else throws.
- Pageserver data directories are now at `.neon/pageserver_{n}` where n
is 1,2,3...
- Compatibility tests get some special casing to migrate neon_local
configs: these are not meant to be backward/forward compatible, but they
were treated that way by the test.
2023-09-08 16:19:57 +01:00
Konstantin Knizhnik
499d0707d2 Perform throttling for concurrent build index which is done outside transaction (#5048)
See 
https://neondb.slack.com/archives/C03H1K0PGKH/p1692550646191429

## Problem

Build index concurrently is writing WAL outside transaction.
`backpressure_throttling_impl` doesn't perform throttling for read-only
transactions (not assigned XID).
It cause huge write lag which can cause large delay of accessing the
table.

## Summary of changes

Looks at `PROC_IN_SAFE_IC` in process state set during concurrent index
build.
 
## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2023-09-08 18:05:08 +03:00
Joonas Koivunen
720d59737a rust-1.72.0 changes (#5255)
Prepare to upgrade rust version to latest stable.

- `rustfmt` has learned to format `let irrefutable = $expr else { ...
};` blocks
- There's a new warning about virtual (workspace) crate resolver, picked
the latest resolver as I suspect everyone would expect it to be the
latest; should not matter anyways
- Some new clippies, which seem alright
2023-09-08 16:28:41 +03:00
Joonas Koivunen
ff87fc569d test: Remote storage refactorings (#5243)
Remote storage cleanup split from #5198:
- pageserver, extensions, and safekeepers now have their separate remote
storage
- RemoteStorageKind has the configuration code
- S3Storage has the cleanup code
- with MOCK_S3, pageserver, extensions, safekeepers use different
buckets
- with LOCAL_FS, `repo_dir / "local_fs_remote_storage" / $user` is used
as path, where $user is `pageserver`, `safekeeper`
- no more `NeonEnvBuilder.enable_xxx_remote_storage` but one
`enable_{pageserver,extensions,safekeeper}_remote_storage`

Should not have any real changes. These will allow us to default to
`LOCAL_FS` for pageserver on the next PR, remove
`RemoteStorageKind.NOOP`, work towards #5172.

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-09-08 13:54:23 +03:00
Heikki Linnakangas
cdc65c1857 Update pg_cron to version 1.6.0 (#5252)
This includes PostgreSQL 16 support. There are no catalog changes, so
this is a drop-in replacement, no need to run "ALTER EXTENSION UPDATE".
2023-09-08 12:42:46 +03:00
Heikki Linnakangas
dac995e7e9 Update plpgsql_check extension to version v2.4.0 (#5249)
This brings v16 support.
2023-09-08 10:46:02 +03:00
Alexander Bayandin
b80740bf9f test_startup: increase timeout (#5238)
## Problem

`test_runner/performance/test_startup.py::test_startup` started to fail
more frequently because of the timeout.
Let's increase the timeout to see the failures on the perf dashboard.

## Summary of changes
- Increase timeout for`test_startup` from 600 to 900 seconds
2023-09-08 01:57:38 +01:00
Heikki Linnakangas
57c1ea49b3 Update hypopg extension to version 1.4.0 (#5245)
The v1.4.0 includes changes to make it compile with PostgreSQL 16. The
commit log doesn't call it out explicitly, but I tested it manually.

v1.4.0 includes some new functions, but I tested manually that the the
v1.3.1 functionality works with the v1.4.0 version of the library. That
means that this doesn't break existing installations. Users can do
"ALTER EXTENSION hypopg UPDATE" if they want to use the new v1.4.0
functionality, but they don't have to.
2023-09-08 03:30:11 +03:00
Heikki Linnakangas
6c31a2d342 Upgrade prefix extension to version 1.2.10 (#5244)
This version includes trivial changes to make it compile with PostgreSQL
16. No functional changes.
2023-09-08 02:10:01 +03:00
Heikki Linnakangas
252b953f18 Upgrade postgresql-hll to version 2.18. (#5241)
This includes PostgreSQL 16 support. No other changes, really.

The extension version in the upstream was changed from 2.17 to 2.18,
however, there is no difference between the catalog objects. So if you
had installed 2.17 previously, it will continue to work. You can run
"ALTER EXTENSION hll UPDATE", but all it will do is update the version
number in the pg_extension table.
2023-09-08 02:07:17 +03:00
Heikki Linnakangas
b414360afb Upgrade ip4r to version 2.4.2 (#5242)
Includes PostgreSQL v16 support. No functional changes.
2023-09-08 02:06:53 +03:00
Arpad Müller
d206655a63 Make VirtualFile::{open, open_with_options, create,sync_all,with_file} async fn (#5224)
## Problem

Once we use async file system APIs for `VirtualFile`, these functions
will also need to be async fn.

## Summary of changes

Makes the functions `open, open_with_options, create,sync_all,with_file`
of `VirtualFile` async fn, including all functions that call it. Like in
the prior PRs, the actual I/O operations are not using async APIs yet,
as per request in the #4743 epic.

We switch towards not using `VirtualFile` in the par_fsync module,
hopefully this is only temporary until we can actually do fully async
I/O in `VirtualFile`. This might cause us to exhaust fd limits in the
tests, but it should only be an issue for the local developer as we have
high ulimits in prod.

This PR is a follow-up of #5189, #5190, #5195, and #5203. Part of #4743.
2023-09-08 00:50:50 +02:00
Heikki Linnakangas
e5adc4efb9 Upgrade h3-pg to version 4.1.3. (#5237)
This includes v16 support.
2023-09-07 21:39:12 +03:00
Heikki Linnakangas
c202f0ba10 Update PostGIS to version 3.3.3 (#5236)
It's a good idea to keep up-to-date in general. One noteworthy change is
that PostGIS 3.3.3 adds support for PostgreSQL v16. We'll need that.

PostGIS 3.4.0 has already been released, and we should consider
upgrading to that. However, it's a major upgrade and requires running
"SELECT postgis_extensions_upgrade();" in each database, to upgrade the
catalogs. I don't want to deal with that right now.
2023-09-07 21:38:55 +03:00
Alexander Bayandin
d15563f93b Misc workflows: fix quotes in bash (#5235) 2023-09-07 19:39:42 +03:00
Rahul Modpur
485a2cfdd3 Fix pg_config version parsing (#5200)
## Problem
Fix pg_config version parsing

## Summary of changes
Use regex to capture major version of postgres
#5146
2023-09-07 15:34:22 +02:00
Alexander Bayandin
1fee69371b Update plv8 to 3.1.8 (#5230)
## Problem

We likely need this to support Postgres 16
It's also been asked by a user
https://github.com/neondatabase/neon/discussions/5042

The latest version is 3.2.0, but it requires some changes in the build
script (which I haven't checked, but it didn't work right away)

## Summary of changes
```
3.1.8       2023-08-01
            - force v8 to compile in release mode

3.1.7       2023-06-26
            - fix byteoffset issue with arraybuffers
            - support postgres 16 beta

3.1.6       2023-04-08
            - fix crash issue on fetch apply
            - fix interrupt issue
```
From https://github.com/plv8/plv8/blob/v3.1.8/Changes
2023-09-07 14:21:38 +01:00
Alexander Bayandin
f8a91e792c Even better handling of approved-for-ci-run label (#5227)
## Problem

We've got `approved-for-ci-run` to work 🎉 
But it's still a bit rough, this PR should improve the UX for external
contributors.

## Summary of changes
- `build_and_test.yml`: add `check-permissions` job, which fails if PR is
created from a fork. Make all jobs in the workflow to be dependant on
`check-permission` to fail fast
- `approved-for-ci-run.yml`: add `cleanup` job to close `ci-run/pr-*`
PRs and delete linked branches when the parent PR is closed
- `approved-for-ci-run.yml`: fix the layout for the `ci-run/pr-*` PR
description
- GitHub Autocomment: add a comment with tests result to the original PR
(instead of a PR from `ci-run/pr-*` )
2023-09-07 14:21:01 +01:00
duguorong009
706977fb77 fix(pageserver): add the walreceiver state to tenant timeline GET api endpoint (#5196)
Add a `walreceiver_state` field to `TimelineInfo` (response of `GET /v1/tenant/:tenant_id/timeline/:timeline_id`) and while doing that, refactor out a common `Timeline::walreceiver_state(..)`. No OpenAPI changes, because this is an internal debugging addition.

Fixes #3115.

Co-authored-by: Joonas Koivunen <joonas.koivunen@gmail.com>
2023-09-07 14:17:18 +03:00
Arpad Müller
7ba0f5c08d Improve comment in page cache (#5220)
It was easy to interpret comment in the page cache initialization code
to be about justifying why we leak here at all, not just why this
specific type of leaking is done (which the comment was actually meant
to describe).

See
https://github.com/neondatabase/neon/pull/5125#discussion_r1308445993

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-09-06 21:44:54 +02:00
Arpad Müller
6243b44dea Remove Virtual from FileBlockReaderVirtual variant name (#5225)
With #5181, the generics for `FileBlockReader` have been removed, so
having a `Virtual` postfix makes less sense now.
2023-09-06 20:54:57 +02:00
Joonas Koivunen
3a966852aa doc: tests expect lsof (#5226)
On a clean system `lsof` needs to be installed. Add it to the list just
to keep things nice and copy-pasteable (except for poetry).
2023-09-06 21:40:00 +03:00
duguorong009
31e1568dee refactor(pageserver): refactor pageserver router state creation (#5165)
Fixes #3894 by:
- Refactor the pageserver router creation flow
- Create the router state in `pageserver/src/bin/pageserver.rs`
2023-09-06 21:31:49 +03:00
Chengpeng Yan
9a9187b81a Complete the missing metrics for files_created/bytes_written (#5120) 2023-09-06 14:00:15 -04:00
Chengpeng Yan
dfe2e5159a remove the duplicate entries in postgresql.conf (#5090) 2023-09-06 13:57:03 -04:00
Alexander Bayandin
e4b1d6b30a Misc post-merge fixes (#5219)
## Problem
- `SCALE: unbound variable` from
https://github.com/neondatabase/neon/pull/5079
- The layout of the GitHub auto-comment is broken if the code coverage
section follows flaky test section from
https://github.com/neondatabase/neon/pull/4999

## Summary of changes
- `benchmarking.yml`: Rename `SCALE` to `TEST_OLAP_SCALE` 
- `comment-test-report.js`: Add an extra new-line before Code coverage
section
2023-09-06 20:11:44 +03:00
Alexander Bayandin
76a96b0745 Notify Slack channel about upcoming releases (#5197)
## Problem

When the next release is coming, we want to let everyone know about it by
posting a message to the Slack channel with a list of commits.

## Summary of changes
- `.github/workflows/release-notify.yml` is added
- the workflow sends a message to
`vars.SLACK_UPCOMING_RELEASE_CHANNEL_ID` (or
[#test-release-notifications](https://neondb.slack.com/archives/C05QQ9J1BRC)
if not configured)
- On each PR update, the workflow updates the list of commits in the
message (it doesn't send additional messages)
2023-09-06 17:52:21 +01:00
Arpad Müller
5e00c44169 Add WriteBlobWriter buffering and make VirtualFile::{write,write_all} async (#5203)
## Problem

We want to convert the `VirtualFile` APIs to async fn so that we can
adopt one of the async I/O solutions.

## Summary of changes

This PR is a follow-up of #5189, #5190, and #5195, and does the
following:

* Move the used `Write` trait functions of `VirtualFile` into inherent
functions
* Add optional buffering to `WriteBlobWriter`. The buffer is discarded
on drop, similarly to how tokio's `BufWriter` does it: drop is neither
async nor does it support errors.
* Remove the generics by `Write` impl of `WriteBlobWriter`, alwaays
using `VirtualFile`
* Rename `WriteBlobWriter` to `BlobWriter`
* Make various functions in the write path async, like
`VirtualFile::{write,write_all}`.

Part of #4743.
2023-09-06 18:17:12 +02:00
Alexander Bayandin
d5f1858f78 approved-for-ci-run.yml: use different tokens (#5218)
## Problem

`CI_ACCESS_TOKEN` has quite limited access (which is good), but this
doesn't allow it to remove labels from PRs (which is bad)

## Summary of changes
- Use `GITHUB_TOKEN` to remove labels
- Use `CI_ACCESS_TOKEN` to create PRs
2023-09-06 18:50:59 +03:00
John Spray
61d661a6c3 pageserver: generation number fetch on startup and use in /attach (#5163)
## Problem

- #5050 

Closes: https://github.com/neondatabase/neon/issues/5136

## Summary of changes

- A new configuration property `control_plane_api` controls other
functionality in this PR: if it is unset (default) then everything still
works as it does today.
- If `control_plane_api` is set, then on startup we call out to control
plane `/re-attach` endpoint to discover our attachments and their
generations. If an attachment is missing from the response we implicitly
detach the tenant.
- Calls to pageserver `/attach` API may include a `generation`
parameter. If `control_plane_api` is set, then this parameter is
mandatory.
- RemoteTimelineClient's loading of index_part.json is generation-aware,
and will try to load the index_part with the most recent generation <=
its own generation.
- The `neon_local` testing environment now includes a new binary
`attachment_service` which implements the endpoints that the pageserver
requires to operate. This is on by default if running `cargo neon` by
hand. In `test_runner/` tests, it is off by default: existing tests
continue to run with in the legacy generation-less mode.

Caveats:
- The re-attachment during startup assumes that we are only re-attaching
tenants that have previously been attached, and not totally new tenants
-- this relies on the control plane's attachment logic to keep retrying
so that we should eventually see the attach API call. That's important
because the `/re-attach` API doesn't tell us which timelines we should
attach -- we still use local disk state for that. Ref:
https://github.com/neondatabase/neon/issues/5173
- Testing: generations are only enabled for one integration test right
now (test_pageserver_restart), as a smoke test that all the machinery
basically works. Writing fuller tests that stress tenant migration will
come later, and involve extending our test fixtures to deal with
multiple pageservers.
- I'm not in love with "attachment_service" as a name for the neon_local
component, but it's not very important because we can easily rename
these test bits whenever we want.
- Limited observability when in re-attach on startup: when I add
generation validation for deletions in a later PR, I want to wrap up the
control plane API calls in some small client class that will expose
metrics for things like errors calling the control plane API, which will
act as a strong red signal that something is not right.

Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-09-06 14:44:48 +01:00
Alexander Bayandin
da60f69909 approved-for-ci-run.yml: use our bot (#5216)
## Problem

Pull Requests created by GitHub Actions bot doesn't have access to
secrets, so we need to use our bot for it to auto-trigger a tests run

See previous PRs  #4663, #5210, #5212

## Summary of changes
- Use our bot to create PRs
2023-09-06 14:55:11 +03:00
John Spray
743933176e scrubber: add scan-metadata and hook into integration tests (#5176)
## Problem

- Scrubber's `tidy` command requires presence of a control plane
- Scrubber has no tests at all 

## Summary of changes

- Add re-usable async streams for reading metadata from a bucket
- Add a `scan-metadata` command that reads from those streams and calls
existing `checks.rs` code to validate metadata, then returns a summary
struct for the bucket. Command returns nonzero status if errors are
found.
- Add an `enable_scrub_on_exit()` function to NeonEnvBuilder so that
tests using remote storage can request to have the scrubber run after
they finish
- Enable remote storarge and scrub_on_exit in test_pageserver_restart
and test_pageserver_chaos

This is a "toe in the water" of the overall space of validating the
scrubber. Later, we should:
- Enable scrubbing at end of tests using remote storage by default
- Make the success condition stricter than "no errors": tests should
declare what tenants+timelines they expect to see in the bucket (or
sniff these from the functions tests use to create them) and we should
require that the scrubber reports on these particular tenants/timelines.

The `tidy` command is untouched in this PR, but it should be refactored
later to use similar async streaming interface instead of the current
batch-reading approach (the streams are faster with large buckets), and
to also be covered by some tests.


---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
2023-09-06 11:55:24 +01:00
Alexander Bayandin
8e25d3e79e test_runner: add scale parameter to tpc-h tests (#5079)
## Problem

It's hard to find out which DB size we use for OLAP benchmarks (TPC-H in
particular).
This PR adds handling of `TEST_OLAP_SCALE` env var, which is get added
to a test name as a parameter.

This is required for performing larger periodic benchmarks. 

## Summary of changes
- Handle `TEST_OLAP_SCALE` in
`test_runner/performance/test_perf_olap.py`
- Set `TEST_OLAP_SCALE` in `.github/workflows/benchmarking.yml` to a
TPC-H scale
2023-09-06 13:22:57 +03:00
duguorong009
4fec48f2b5 chore(pageserver): remove unnecessary logging in tenant task loops (#5188)
Fixes #3830 by adding the `#[cfg(not(feature = "testing"))]` attribute
to unnecessary loggings in `pageserver/src/tenant/tasks.rs`.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-09-06 13:19:19 +03:00
Vadim Kharitonov
88b1ac48bd Create Release PR at 7:00 UTC every Tuesday (#5213) 2023-09-06 13:17:52 +03:00
Alexander Bayandin
15ff4e5fd1 approved-for-ci-run.yml: trigger on pull_request_target (#5212)
## Problem

Continuation of #4663, #5210

We're still getting an error:
```
GraphQL: Resource not accessible by integration (removeLabelsFromLabelable)
```

## Summary of changes
- trigger `approved-for-ci-run.yml` workflow on `pull_request_target`
instead of `pull_request`
2023-09-06 13:14:07 +03:00
Alexander Bayandin
dbfb4ea7b8 Make CI more friendly for external contributors. Second try (#5210)
## Problem

`approved-for-ci-run` label logic doesn't work as expected:
- https://github.com/neondatabase/neon/pull/4722#issuecomment-1636742145
- https://github.com/neondatabase/neon/pull/4722#issuecomment-1636755394

Continuation of https://github.com/neondatabase/neon/pull/4663
Closes #2222 (hopefully)

## Summary of changes
- Create a twin PR automatically
- Allow `GITHUB_TOKEN` to manipulate with labels
2023-09-06 10:06:55 +01:00
Alexander Bayandin
c222320a2a Generate lcov coverage report (#4999)
## Problem

We want to display coverage information for each PR.

- an example of a full coverage report:
https://neon-github-public-dev.s3.amazonaws.com/code-coverage/abea64800fb390c32a3efe6795d53d8621115c83/lcov/index.html
- an example of GitHub auto-comment with coverage information:
https://github.com/neondatabase/neon/pull/4999#issuecomment-1679344658

## Summary of changes
- Use
patched[*](426e7e7a22)
lcov to generate coverage report
- Upload HTML coverage report to S3
- `scripts/comment-test-report.js`: add coverage information
2023-09-06 00:48:15 +01:00
MMeent
89c64e179e Fix corruption issue in Local File Cache (#5208)
Fix issue where updating the size of the Local File Cache could lead to
invalid reads:

## Problem

LFC cache can get re-enabled when lfc_max_size is set, e.g. through an
autoscaler configuration, or PostgreSQL not liking us setting the
variable.

1. initialize: LFC enabled (lfc_size_limit > 0; lfc_desc = 0)
2. Open LFC file fails, lfc_desc = -1. lfc_size_limit is set to 0;
3. lfc_size_limit is updated by autoscaling to >0
4. read() now thinks LFC is enabled (size_limit > 0) and lfc_desc is
valid, but doesn't try to read from the invalid file handle and thus
doesn't update the buffer content with the page's data, but does think
the data was read...
Any buffer we try to read from local file cache is essentially
uninitialized memory. Those are likely 0-bytes, but might potentially be
any old buffer that was previously read from or flushed to disk.

Fix this by adding a more definitive disable flag, plus better invalid state handling.
2023-09-05 20:00:47 +02:00
Alexander Bayandin
7ceddadb37 Merge custom extension CI jobs (#5194)
## Problem

When a remote custom extension build fails, it looks a bit confusing on
neon CI:
- `trigger-custom-extensions-build` is green
- `wait-for-extensions-build` is red
- `build-and-upload-extensions` is red

But to restart the build (to get everything green), you need to restart
the only passed `trigger-custom-extensions-build`.

## Summary of changes
- Merge `trigger-custom-extensions-build` and
`wait-for-extensions-build` jobs into
`trigger-custom-extensions-build-and-wait`
2023-09-05 14:02:37 +01:00
Arpad Müller
4904613aaa Convert VirtualFile::{seek,metadata} to async (#5195)
## Problem

We want to convert the `VirtualFile` APIs to async fn so that we can
adopt one of the async I/O solutions.

## Summary of changes

Convert the following APIs of `VirtualFile` to async fn (as well as all
of the APIs calling it):

* `VirtualFile::seek`
* `VirtualFile::metadata`
* Also, prepare for deletion of the write impl by writing the summary to
a buffer before writing it to disk, as suggested in
https://github.com/neondatabase/neon/issues/4743#issuecomment-1700663864
. This change adds an additional warning for the case when the summary
exceeds a block. Previously, we'd have silently corrupted data in this
(unlikely) case.
* `WriteBlobWriter::write_blob`, in preparation for making
`VirtualFile::write_all` async.
2023-09-05 12:55:45 +02:00
Vadim Kharitonov
07d7874bc8 Merge pull request #5202 from neondatabase/releases/2023-09-05
Release 2023-09-05
2023-09-05 12:16:06 +02:00
Nikita Kalyanov
77658a155b support deploying in IPv6-only environments (#4135)
A set of changes to enable neon to work in IPv6 environments. The
changes are backward-compatible but allow to deploy neon even to
IPv6-only environments:
- bind to both IPv4 and IPv6 interfaces
- allow connections to Postgres from IPv6 interface
- parse the address from control plane that could also be IPv6
2023-09-05 12:45:46 +03:00
Arpad Müller
128a85ba5e Convert many VirtualFile APIs to async (#5190)
## Problem

`VirtualFile` does both reading and writing, and it would be nice if
both could be converted to async, so that it doesn't have to support an
async read path and a blocking write path (especially for the locks this
is annoying as none of the lock implementations in std, tokio or
parking_lot have support for both async and blocking access).

## Summary of changes

This PR is some initial work on making the `VirtualFile` APIs async. It
can be reviewed commit-by-commit.

* Introduce the `MaybeVirtualFile` enum to be generic in a test that
compares real files with virtual files.
* Make various APIs of `VirtualFile` async, including `write_all_at`,
`read_at`, `read_exact_at`.

Part of #4743 , successor of #5180.

Co-authored-by: Christian Schwarz <me@cschwarz.com>
2023-09-04 17:05:20 +02:00
Arpad Müller
6cd497bb44 Make VirtualFile::crashsafe_overwrite async fn (#5189)
## Problem

The `VirtualFile::crashsafe_overwrite` function was introduced by #5186
but it was not turned `async fn` yet. We want to make these functions
async fn as part of #4743.

## Summary of changes

Make `VirtualFile::crashsafe_overwrite` async fn, as well as all the
functions calling it. Don't make anything inside `crashsafe_overwrite`
use async functionalities, as per #4743 instructions.

Also, add rustdoc to `crashsafe_overwrite`.

Part of #4743.
2023-09-04 12:52:35 +02:00
John Spray
80f10d5ced pageserver: safe deletion for tenant directories (#5182)
## Problem

If a pageserver crashes partway through deleting a tenant's directory,
it might leave a partial state that confuses a subsequent
startup/attach.

## Summary of changes

Rename tenant directory to a temporary path before deleting.

Timeline deletions already have deletion markers to provide safety.

In future, it would be nice to exploit this to send responses to detach
requests earlier: https://github.com/neondatabase/neon/issues/5183
2023-09-04 08:31:55 +01:00
Christian Schwarz
7e817789d5 VirtualFile: add crash-safe overwrite abstraction & use it (#5186)
(part of #4743)
(preliminary to #5180)
 
This PR adds a special-purpose API to `VirtualFile` for write-once
files.
It adopts it for `save_metadata` and `persist_tenant_conf`.

This is helpful for the asyncification efforts (#4743) and specifically
asyncification of `VirtualFile` because above two functions were the
only ones that needed the VirtualFile to be an `std::io::Write`.
(There was also `manifest.rs` that needed the `std::io::Write`, but, it
isn't used right now, and likely won't be used because we're taking a
different route for crash consistency, see #5172. So, let's remove it.
It'll be in Git history if we need to re-introduce it when picking up
the compaction work again; that's why it was introduced in the first
place).

We can't remove the `impl std::io::Write for VirtualFile` just yet
because of the `BufWriter` in

```rust
struct DeltaLayerWriterInner {
...
    blob_writer: WriteBlobWriter<BufWriter<VirtualFile>>,
}
```

But, @arpad-m and I have a plan to get rid of that by extracting the
append-only-ness-on-top-of-VirtualFile that #4994 added to
`EphemeralFile` into an abstraction that can be re-used in the
`DeltaLayerWriterInner` and `ImageLayerWriterInner`.
That'll be another PR.


### Performance Impact

This PR adds more fsyncs compared to before because we fsync the parent
directory every time.

1. For `save_metadata`, the additional fsyncs are unnecessary because we
know that `metadata` fits into a kernel page, and hence the write won't
be torn on the way into the kernel. However, the `metadata` file in
general is going to lose signficance very soon (=> see #5172), and the
NVMes in prod can definitely handle the additional fsync. So, let's not
worry about it.
2. For `persist_tenant_conf`, which we don't check to fit into a single
kernel page, this PR makes it actually crash-consistent. Before, we
could crash while writing out the tenant conf, leaving a prefix of the
tenant conf on disk.
2023-09-02 10:06:14 +02:00
John Spray
41aa627ec0 tests: get test name automatically for remote storage (#5184)
## Problem

Tests using remote storage have manually entered `test_name` parameters,
which:
- Are easy to accidentally duplicate when copying code to make a new
test
- Omit parameters, so don't actually create unique S3 buckets when
running many tests concurrently.

## Summary of changes

- Use the `request` fixture in neon_env_builder fixture to get the test
name, then munge that into an S3 compatible bucket name.
- Remove the explicit `test_name` parameters to enable_remote_storage
2023-09-01 17:29:38 +01:00
Conrad Ludgate
44da9c38e0 proxy: error typo (#5187)
## Problem

https://github.com/neondatabase/neon/pull/5162#discussion_r1311853491
2023-09-01 19:21:33 +03:00
Christian Schwarz
cfc0fb573d pageserver: run all Rust tests with remote storage enabled (#5164)
For
[#5086](https://github.com/neondatabase/neon/pull/5086#issuecomment-1701331777)
we will require remote storage to be configured in pageserver.

This PR enables `localfs`-based storage for all Rust unit tests.

Changes:

- In `TenantHarness`, set up localfs remote storage for the tenant.
- `create_test_timeline` should mimic what real timeline creation does,
and real timeline creation waits for the timeline to reach remote
storage. With this PR, `create_test_timeline` now does that as well.
- All the places that create the harness tenant twice need to shut down
the tenant before the re-create through a second call to `try_load` or
`load`.
- Without shutting down, upload tasks initiated by/through the first
incarnation of the harness tenant might still be ongoing when the second
incarnation of the harness tenant is `try_load`/`load`ed. That doesn't
make sense in the tests that do that, they generally try to set up a
scenario similar to pageserver stop & start.
- There was one test that recreates a timeline, not the tenant. For that
case, I needed to create a `Timeline::shutdown` method. It's a
refactoring of the existing `Tenant::shutdown` method.
- The remote_timeline_client tests previously set up their own
`GenericRemoteStorage` and `RemoteTimelineClient`. Now they re-use the
one that's pre-created by the TenantHarness. Some adjustments to the
assertions were needed because the assertions now need to account for
the initial image layer that's created by `create_test_timeline` to be
present.
2023-09-01 18:10:40 +02:00
Christian Schwarz
aa22000e67 FileBlockReader<File> is never used (#5181)
part of #4743

preliminary to #5180
2023-09-01 17:30:22 +02:00
Christian Schwarz
5edae96a83 rfc: Crash-Consistent Layer Map Updates By Leveraging index_part.json (#5086)
This RFC describes a simple scheme to make layer map updates crash
consistent by leveraging the index_part.json in remote storage. Without
such a mechanism, crashes can induce certain edge cases in which broadly
held assumptions about system invariants don't hold.
2023-09-01 15:24:58 +02:00
Christian Schwarz
40ce520c07 remote_timeline_client: tests: run upload ops on the tokio::test runtime (#5177)
The `remote_timeline_client` tests use `#[tokio::test]` and rely on the
fact that the test runtime that is set up by this macro is
single-threaded.

In PR https://github.com/neondatabase/neon/pull/5164, we observed
interesting flakiness with the `upload_scheduling` test case:
it would observe the upload of the third layer (`layer_file_name_3`)
before we did `wait_completion`.

Under the single-threaded-runtime assumption, that wouldn't be possible,
because the test code doesn't await inbetween scheduling the upload
and calling `wait_completion`.

However, RemoteTimelineClient was actually using `BACKGROUND_RUNTIME`.
That means there was parallelism where the tests didn't expect it,
leading to flakiness such as execution of an UploadOp task before
the test calls `wait_completion`.

The most confusing scenario is code like this:

```
schedule upload(A);
wait_completion.await; // B
schedule_upload(C);
wait_completion.await; // D
```

On a single-threaded executor, it is guaranteed that the upload up C
doesn't run before D, because we (the test) don't relinquish control
to the executor before D's `await` point.

However, RemoteTimelineClient actually scheduled onto the
BACKGROUND_RUNTIME, so, `A` could start running before `B` and
`C` could start running before `D`.

This would cause flaky tests when making assertions about the state
manipulated by the operations. The concrete issue that led to discover
of this bug was an assertion about `remote_fs_dir` state in #5164.
2023-09-01 16:24:04 +03:00
Alexander Bayandin
e9f2c64322 Wait for custom extensions build before deploy (#5170)
## Problem

Currently, the `deploy` job doesn't wait for the custom extension job
(in another repo) and can be started even with failed extensions build.
This PR adds another job that polls the status of the extension build job
and fails if the extension build fails.

## Summary of changes
- Add `wait-for-extensions-build` job, which waits for a custom
extension build in another repo.
2023-09-01 12:59:19 +01:00
John Spray
715077ab5b tests: broaden a log allow regex in test_ignored_tenant_stays_broken_without_metadata (#5168)
## Problem

- https://github.com/neondatabase/neon/issues/5167

## Summary of changes

Accept "will not become active" log line with _either_ Broken or
Stopping state, because we may hit it while in the process of doing the
`/ignore` (earlier in the test than the test expects to see the same
line with Broken)
2023-09-01 08:36:38 +01:00
John Spray
616e7046c7 s3_scrubber: import into the main neon repository (#5141)
## Problem

The S3 scrubber currently lives at
https://github.com/neondatabase/s3-scrubber

We don't have tests that use it, and it has copies of some data
structures that can get stale.

## Summary of changes

- Import the s3-scrubber as `s3_scrubber/
- Replace copied_definitions/ in the scrubber with direct access to the
`utils` and `pageserver` crates
- Modify visibility of a few definitions in `pageserver` to allow the
scrubber to use them
- Update scrubber code for recent changes to `IndexPart`
- Update `KNOWN_VERSIONS` for IndexPart and move the definition into
index.rs so that it is easier to keep up to date

As a future refinement, it would be good to pull the remote persistence
types (like IndexPart) out of `pageserver` into a separate library so
that the scrubber doesn't have to link against the whole pageserver, and
so that it's clearer which types need to be public.

Co-authored-by: Kirill Bulatov <kirill@neon.tech>
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-08-31 19:01:39 +01:00
Anastasia Lubennikova
1804111a02 Merge pull request #5161 from neondatabase/rc-2023-08-31
Release 2023-08-31
2023-08-31 16:53:17 +03:00
Conrad Ludgate
1b916a105a proxy: locked is not retriable (#5162)
## Problem

Management service returns Locked when quotas are exhausted. We cannot
retry on those

## Summary of changes

Makes Locked status unretriable
2023-08-31 15:50:15 +03:00
Conrad Ludgate
d11621d904 Proxy: proxy protocol v2 (#5028)
## Problem

We need to log the client IP, not the IP of the NLB.

## Summary of changes

Parse the proxy [protocol version
2](https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt) if
possible
2023-08-31 14:30:25 +03:00
John Spray
43bb8bfdbb pageserver: fix flake in test_timeline_deletion_with_files_stuck_in_upload_queue (#5149)
## Problem

Test failing on a different ERROR log than it anticipated.

Closes: https://github.com/neondatabase/neon/issues/5148

## Summary of changes

Add the "could not flush frozen layer" error log to the permitted
errors.
2023-08-31 10:42:32 +01:00
John Spray
300a5aa05e pageserver: fix test v4_indexpart_is_parsed (#5157)
## Problem

Two recent PRs raced:
- https://github.com/neondatabase/neon/pull/5153
- https://github.com/neondatabase/neon/pull/5140

## Summary of changes

Add missing `generation` argument to IndexLayerMetadata construction
2023-08-31 10:40:46 +01:00
Nikita Kalyanov
b9c111962f pass JWT to management API (#5151)
support authentication with JWT from env for proxy calls to mgmt API
2023-08-31 12:23:51 +03:00
John Spray
83ae2bd82c pageserver: generation number support in keys and indices (#5140)
## Problem

To implement split brain protection, we need tenants and timelines to be
aware of their current generation, and use it when composing S3 keys.


## Summary of changes

- A `Generation` type is introduced in the `utils` crate -- it is in
this broadly-visible location because it will later be used from
`control_plane/` as well as `pageserver/`. Generations can be a number,
None, or Broken, to support legacy content (None), and Tenants in the
broken state (Broken).
- Tenant, Timeline, and RemoteTimelineClient all get a generation
attribute
- IndexPart's IndexLayerMetadata has a new `generation` attribute.
Legacy layers' metadata will deserialize to Generation::none().
- Remote paths are composed with a trailing generation suffix. If a
generation is equal to Generation::none() (as it currently always is),
then this suffix is an empty string.
- Functions for composing remote storage paths added in
remote_timeline_client: these avoid the way that we currently always
compose a local path and then strip the prefix, and avoid requiring a
PageserverConf reference on functions that want to create remote paths
(the conf is only needed for local paths). These are less DRY than the
old functions, but remote storage paths are a very rarely changing
thing, so it's better to write out our paths clearly in the functions
than to compose timeline paths from tenant paths, etc.
- Code paths that construct a Tenant take a `generation` argument in
anticipation that we will soon load generations on startup before
constructing Tenant.

Until the whole feature is done, we don't want any generation-ful keys
though: so initially we will carry this everywhere with the special
Generation::none() value.

Closes: https://github.com/neondatabase/neon/issues/5135

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-08-31 09:19:34 +01:00
Alexey Kondratov
f2c21447ce [compute_ctl] Create check availability data during full configuration (#5084)
I've moved it to the API handler in the 589cf1ed2 to do not delay
compute start. Yet, we now skip full configuration and catalog updates
in the most hot path -- waking up suspended compute, and only do it at:

- first start
- start with applying new configuration
- start for availability check

so it doesn't really matter anymore.

The problem with creating the table and test record in the API handler
is that someone can fill up timeline till the logical limit. Then it's
suspended and availability check is scheduled, so it fails.

If table + test row are created at the very beginning, we reserve a 8 KB
page for future checks, which theoretically will last almost forever.
For example, my ~1y old branch still has 8 KB sized test table:
```sql
cloud_admin@postgres=# select pg_relation_size('health_check');
 pg_relation_size
------------------
             8192
(1 row)
```

---------

Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
2023-08-30 17:44:28 +02:00
Conrad Ludgate
93dcdb293a proxy: password hack hack (#5126)
## Problem

fixes #4881 

## Summary of changes
2023-08-30 16:20:27 +01:00
John Spray
a93274b389 pageserver: remove vestigial timeline_layers attribute (#5153)
## Problem

`timeline_layers` was write-only since
b95addddd5

We deployed the version that no longer requires it for deserializing, so
now we can stop including it when serializing.

## Summary of changes

Fully remove `timeline_layers`.
2023-08-30 16:14:04 +01:00
Anastasia Lubennikova
a7c0e4dcd0 Check if custiom extension is enabled.
This check was lost in the latest refactoring.

If extension is not present in 'public_extensions' or 'custom_extensions' don't download it
2023-08-30 17:47:06 +03:00
Conrad Ludgate
3b81e0c86d chore: remove webpki (#5069)
## Problem

webpki is unmaintained

Closes https://github.com/neondatabase/neon/security/dependabot/33

## Summary of changes

Update all dependents of webpki.
2023-08-30 15:14:03 +01:00
Anastasia Lubennikova
e5a397cf96 Form archive_path for remote extensions on the fly 2023-08-30 13:56:51 +03:00
Arthur Petukhovsky
cd0178efed Merge pull request #5150 from neondatabase/release-sk-fix-active-timeline
Release 2023-08-30
2023-08-30 11:43:39 +02:00
Joonas Koivunen
05773708d3 fix: add context for ancestor lsn wait (#5143)
In logs it is confusing to see seqwait timeouts which seemingly arise
from the branched lsn but actually are about the ancestor, leading to
questions like "has the last_record_lsn went back".

Noticed by @problame.
2023-08-30 12:21:41 +03:00
John Spray
382473d9a5 docs: add RFC for remote storage generation numbers (#4919)
## Summary

A scheme of logical "generation numbers" for pageservers and their
attachments is proposed, along with
changes to the remote storage format to include these generation numbers
in S3 keys.

Using the control plane as the issuer of these generation numbers
enables strong anti-split-brain
properties in the pageserver cluster without implementing a consensus
mechanism directly
in the pageservers.

## Motivation

Currently, the pageserver's remote storage format does not provide a
mechanism for addressing
split brain conditions that may happen when replacing a node during
failover or when migrating
a tenant from one pageserver to another. From a remote storage
perspective, a split brain condition
occurs whenever two nodes both think they have the same tenant attached,
and both can write to S3. This
can happen in the case of a network partition, pathologically long
delays (e.g. suspended VM), or software
bugs.

This blocks robust implementation of failover from unresponsive
pageservers, due to the risk that
the unresponsive pageserver is still writing to S3.

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2023-08-30 09:49:55 +01:00
Arpad Müller
eb0a698adc Make page cache and read_blk async (#5023)
## Problem

`read_blk` does I/O and thus we would like to make it async. We can't
make the function async as long as the `PageReadGuard` returned by
`read_blk` isn't `Send`. The page cache is called by `read_blk`, and
thus it can't be async without `read_blk` being async. Thus, we have a
circular dependency.

## Summary of changes

Due to the circular dependency, we convert both the page cache and
`read_blk` to async at the same time:

We make the page cache use `tokio::sync` synchronization primitives as
those are `Send`. This makes all the places that acquire a lock require
async though, which we then also do. This includes also asyncification
of the `read_blk` function.

Builds upon #4994, #5015, #5056, and #5129.

Part of #4743.
2023-08-30 09:04:31 +02:00
Arseny Sher
81b6578c44 Allow walsender in recovery mode give WAL till dynamic flush_lsn.
Instead of fixed during the start of replication. To this end, create
term_flush_lsn watch channel similar to commit_lsn one. This allows to continue
recovery streaming if new data appears.
2023-08-29 23:19:40 +03:00
Arseny Sher
bc49c73fee Move wal_stream_connection_config to utils.
It will be used by safekeeper as well.
2023-08-29 23:19:40 +03:00
Arseny Sher
e98580b092 Add term and http endpoint to broker messaged SkTimelineInfo.
We need them for safekeeper peer recovery
https://github.com/neondatabase/neon/pull/4875
2023-08-29 23:19:40 +03:00
Arseny Sher
804ef23043 Rename TermSwitchEntry to TermLsn.
Add derive Ord for easy comparison of <term, lsn> pairs.

part of https://github.com/neondatabase/neon/pull/4875
2023-08-29 23:19:40 +03:00
Arseny Sher
87f7d6bce3 Start and stop per timeline recovery task.
Slightly refactors init: now load_tenant_timelines is also async to properly
init the timeline, but to keep global map lock sync we just acquire it anew for
each timeline.

Recovery task itself is just a stub here.

part of
https://github.com/neondatabase/neon/pull/4875
2023-08-29 23:19:40 +03:00
Arseny Sher
39e3fbbeb0 Add safekeeper peers to TimelineInfo.
Now available under GET /tenant/xxx/timeline/yyy for inspection.
2023-08-29 23:19:40 +03:00
Em Sharnoff
8d2a4aa5f8 vm-monitor: Add flag for when file cache on disk (#5130)
Part 1 of 2, for moving the file cache onto disk.

Because VMs are created by the control plane (and that's where the
filesystem for the file cache is defined), we can't rely on any kind of
synchronization between releases, so the change needs to be
feature-gated (kind of), with the default remaining the same for now.

See also: neondatabase/cloud#6593
2023-08-29 12:44:48 -07:00
Joonas Koivunen
d1fcdf75b3 test: enhanced logging for curious mock_s3 (#5134)
Possible flakyness with mock_s3. Add logging in hopes this will happen
again.

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-08-29 14:48:50 +03:00
Shany Pozin
333574be57 Merge pull request #5133 from neondatabase/releases/2023-08-29
Release 2023-08-29
2023-08-29 14:02:58 +03:00
Alexander Bayandin
7e39a96441 scripts/flaky_tests.py: Improve flaky tests detection (#5094)
## Problem

We still need to rerun some builds manually because flaky tests weren't
detected automatically.
I found two reasons for it:
- If a test is flaky on a particular build type, on a particular
Postgres version, there's a high chance that this test is flaky on all
configurations, but we don't automatically detect such cases.
- We detect flaky tests only on the main branch, which requires manual
retrigger runs for freshly made flaky tests.
Both of them are fixed in the PR.

## Summary of changes
- Spread flakiness of a single test to all configurations
- Detect flaky tests in all branches (not only in the main)
- Look back only at  7 days of test history (instead of 10)
2023-08-29 11:53:24 +01:00
Alexander Bayandin
79a799a143 Merge branch 'release' into releases/2023-08-29 2023-08-29 11:17:57 +01:00
Vadim Kharitonov
babefdd3f9 Upgrade pgvector to 0.5.0 (#5132) 2023-08-29 12:53:50 +03:00
Arpad Müller
805fee1483 page cache: small code cleanups (#5125)
## Problem

I saw these things while working on #5111.

## Summary of changes

* Add a comment explaining why we use `Vec::leak` instead of
`Vec::into_boxed_slice` plus `Box::leak`.
* Add another comment explaining what `valid` is doing, it wasn't very
clear before.
* Add a function `set_usage_count` to not set it directly.
2023-08-29 11:49:04 +03:00
Felix Prasanna
85d6d9dc85 monitor/compute_ctl: remove references to the informant (#5115)
Also added some docs to the monitor :)

Co-authored-by: Em Sharnoff <sharnoff@neon.tech>
2023-08-29 02:59:27 +03:00
Em Sharnoff
e40ee7c3d1 remove unused file 'vm-cgconfig.conf' (#5127)
Honestly no clue why it's still here, should have been removed ages ago.
This is handled by vm-builder now.
2023-08-28 13:04:57 -07:00
Christian Schwarz
0fe3b3646a page cache: don't proactively evict EphemeralFile pages (#5129)
Before this patch, when dropping an EphemeralFile, we'd scan the entire
`slots` to proactively evict its pages (`drop_buffers_for_immutable`).

This was _necessary_ before #4994 because the page cache was a
write-back cache: we'd be deleting the EphemeralFile from disk after,
so, if we hadn't evicted its pages before that, write-back in
`find_victim` wouldhave failed.

But, since #4994, the page cache is a read-only cache, so, it's safe
to keep read-only data cached. It's never going to get accessed again
and eventually, `find_victim` will evict it.

The only remaining advantage of `drop_buffers_for_immutable` over
relying on `find_victim` is that `find_victim` has to do the clock
page replacement iterations until the count reaches 0,
whereas `drop_buffers_for_immutable` can kick the page out right away.

However, weigh that against the cost of `drop_buffers_for_immutable`,
which currently scans the entire `slots` array to find the
EphemeralFile's pages.

Alternatives have been proposed in #5122 and #5128, but, they come
with their own overheads & trade-offs.

Also, the real reason why we're looking into this piece of code is
that we want to make the slots rwlock async in #5023.
Since `drop_buffers_for_immutable` is called from drop, and there
is no async drop, it would be nice to not have to deal with this.

So, let's just stop doing `drop_buffers_for_immutable` and observe
the performance impact in benchmarks.
2023-08-28 20:42:18 +02:00
Em Sharnoff
529f8b5016 compute_ctl: Fix switched vm-monitor args (#5117)
Small switcheroo from #4946.
2023-08-28 14:55:41 +02:00
Joonas Koivunen
fbcd174489 load_layer_map: schedule deletions for any future layers (#5103)
Unrelated fixes noticed while integrating #4938.

- Stop leaking future layers in remote storage
- We schedule extra index_part uploads if layer name to be removed was
not actually present
2023-08-28 10:51:49 +03:00
Felix Prasanna
7b5489a0bb compute_ctl: start pg in cgroup for vms (#4920)
Starts `postgres` in cgroup directly from `compute_ctl` instead of from
`vm-builder`. This is required because the `vm-monitor` cannot be in the
cgroup it is managing. Otherwise, it itself would be frozen when
freezing the cgroup.

Requires https://github.com/neondatabase/cloud/pull/6331, which adds the
`AUTOSCALING` environment variable letting `compute_ctl` know to start
`postgres` in the cgroup.

Requires https://github.com/neondatabase/autoscaling/pull/468, which
prevents `vm-builder` from starting the monitor and putting postgres in
a cgroup. This will require a `VM_BUILDER_VERSION` bump.
2023-08-25 15:59:12 -04:00
Felix Prasanna
40268dcd8d monitor: fix filecache calculations (#5112)
## Problem
An underflow bug in the filecache calculations.

## Summary of changes
Fixed the bug, cleaned up calculations in general.
2023-08-25 13:29:10 -04:00
Vadim Kharitonov
4436c84751 Change codeowners (#5109) 2023-08-25 19:48:16 +03:00
Conrad Ludgate
9da06af6c9 Merge pull request #5113 from neondatabase/release-http-connection-fix
Release 2023-08-25
2023-08-25 17:21:35 +01:00
Conrad Ludgate
ce1753d036 proxy: dont return connection pending (#5107)
## Problem

We were returning Pending when a connection had a notice/notification
(introduced recently in #5020). When returning pending, the runtime
assumes you will call `cx.waker().wake()` in order to continue
processing.

We weren't doing that, so the connection task would get stuck

## Summary of changes

Don't return pending. Loop instead
2023-08-25 16:42:30 +01:00
Alek Westover
67db8432b4 Fix cargo deny errors (#5068)
## Problem
cargo deny lint broken

Links to the CVEs:

[rustsec.org/advisories/RUSTSEC-2023-0052](https://rustsec.org/advisories/RUSTSEC-2023-0052)

[rustsec.org/advisories/RUSTSEC-2023-0053](https://rustsec.org/advisories/RUSTSEC-2023-0053)
One is fixed, the other one isn't so we allow it (for now), to unbreak
CI. Then later we'll try to get rid of webpki in favour of the rustls
fork.

## Summary of changes
```
+ignore = ["RUSTSEC-2023-0052"]
```
2023-08-25 16:42:30 +01:00
John Spray
b758bf47ca pageserver: refactor TimelineMetadata serialization in IndexPart (#5091)
## Problem

The `metadata_bytes` field of IndexPart required explicit
deserialization & error checking everywhere it was used -- there isn't
anything special about this structure that should prevent it from being
serialized & deserialized along with the rest of the structure.

## Summary of changes

- Implement Serialize and Deserialize for TimelineMetadata
- Replace IndexPart::metadata_bytes with a simpler `metadata`, that can
be used directly.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-08-25 16:16:20 +01:00
Felix Prasanna
024e306f73 monitor: improve logging (#5099) 2023-08-25 10:09:53 -04:00
Alek Westover
f71c82e5de remove obsolete need dependency (#5087) 2023-08-25 09:10:26 -04:00
Conrad Ludgate
faf070f288 proxy: dont return connection pending (#5107)
## Problem

We were returning Pending when a connection had a notice/notification
(introduced recently in #5020). When returning pending, the runtime
assumes you will call `cx.waker().wake()` in order to continue
processing.

We weren't doing that, so the connection task would get stuck

## Summary of changes

Don't return pending. Loop instead
2023-08-25 15:08:45 +03:00
Arpad Müller
8c13296add Remove BlockReader::read_blk in favour of BlockCursor (#5015)
## Problem

We want to make `read_blk` an async function, but outside of
`async_trait`, which allocates, and nightly features, we can't use async
fn's in traits.

## Summary of changes

* Remove all uses of `BlockReader::read_blk` in favour of using block
  cursors, at least where the type of the `BlockReader` is behind a
  generic
* Introduce a `BlockReaderRef` enum that lists all implementors of
  `BlockReader::read_blk`.
* Remove `BlockReader::read_blk` and move its implementations into
  inherent functions on the types instead.

We don't turn `read_blk` into an async fn yet, for that we also need to
modify the page cache. So this is a preparatory PR, albeit an important
one.

Part of #4743.
2023-08-25 12:28:01 +02:00
Felix Prasanna
18537be298 monitor: listen on correct port to accept agent connections (#5100)
## Problem
The previous arguments have the monitor listen on `localhost`, which the
informant can connect to since it's also in the VM, but which the agent
cannot. Also, the port is wrong.

## Summary of changes
Listen on `0.0.0.0:10301`
2023-08-24 17:32:46 -04:00
Felix Prasanna
3128eeff01 compute_ctl: add vm-monitor (#4946)
Co-authored-by: Em Sharnoff <sharnoff@neon.tech>
2023-08-24 15:54:37 -04:00
Arpad Müller
227c87e333 Make EphemeralFile::write_blob function async (#5056)
## Problem

The `EphemeralFile::write_blob` function accesses the page cache
internally. We want to require `async` for these accesses in #5023.

## Summary of changes

This removes the implementaiton of the `BlobWriter` trait for
`EphemeralFile` and turns the `write_blob` function into an inherent
function. We can then make it async as well as the `push_bytes`
function. We move the `SER_BUFFER` thread-local into the
`InMemoryLayerInner` so that the same buffer can be accessed by
different threads as the async is (potentially) moved between threads.

Part of #4743, preparation for #5023.
2023-08-24 19:18:30 +02:00
Alek Westover
e8f9aaf78c Don't use non-existent docker tags (#5096) 2023-08-24 19:45:23 +03:00
Chengpeng Yan
fa74d5649e rename EphmeralFile::size to EphemeralFile::len (#5076)
## Problem
close https://github.com/neondatabase/neon/issues/5034

## Summary of changes
Based on the
[comment](https://github.com/neondatabase/neon/pull/4994#discussion_r1297277922).
Just rename the `EphmeralFile::size` to `EphemeralFile::len`.
2023-08-24 16:41:57 +02:00
Joonas Koivunen
f70871dfd0 internal-devx: pageserver future layers (#5092)
I've personally forgotten why/how can we have future layers during
reconciliation. Adds `#[cfg(feature = "testing")]` logging when we
upload such index_part.json, with a cross reference to where the cleanup
happens.

Latest private slack thread:
https://neondb.slack.com/archives/C033RQ5SPDH/p1692879032573809?thread_ts=1692792276.173979&cid=C033RQ5SPDH

Builds upon #5074. Should had been considered on #4837.
2023-08-24 17:22:36 +03:00
Alek Westover
99a1be6c4e remove upload step from neon, it is in private repo now (#5085) 2023-08-24 17:14:40 +03:00
Joonas Koivunen
76aa01c90f refactor: single phase Timeline::load_layer_map (#5074)
Current implementation first calls `load_layer_map`, which loads all
local layers, cleans up files, leave cleaning up stuff to "second
function". Then the "second function" is finally called, it does not do
the cleanup and some of the first functions setup can torn down. "Second
function" is actually both `reconcile_with_remote` and
`create_remote_layers`.

This change makes it a bit more verbose but in one phase with the
following sub-steps:
1. scan the timeline directory
2. delete extra files
    - now including on-demand download files
    - fixes #3660
3. recoincile the two sources of layers (directory, index_part)
4. rename_to_backup future layers, short layers
5. create the remaining as layers

Needed by #4938.

It was also noticed that this is blocking code in an `async fn` so just
do it in a `spawn_blocking`, which should be healthy for our startup
times. Other effects includes hopefully halving of `stat` calls; extra
calls which were not done previously are now done for the future layers.

Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: John Spray <john@neon.tech>
2023-08-24 16:07:40 +03:00
John Spray
3e2f0ffb11 libs: make backoff::retry() take a cancellation token (#5065)
## Problem

Currently, anything that uses backoff::retry will delay the join of its
task by however long its backoff sleep is, multiplied by its max
retries.

Whenever we call a function that sleeps, we should be passing in a
CancellationToken.

## Summary of changes

- Add a `Cancel` type to backoff::retry that wraps a CancellationToken
and an error `Fn` to generate an error if the cancellation token fires.
- In call sites that already run in a `task_mgr` task, use
`shutdown_token()` to provide the token. In other locations, use a dead
`CancellationToken` to satisfy the interface, and leave a TODO to fix it
up when we broaden the use of explicit cancellation tokens.
2023-08-24 14:54:46 +03:00
Arseny Sher
d597e6d42b Track list of walreceivers and their voting/streaming state in shmem.
Also add both walsenders and walreceivers to TimelineStatus (available under
v1/tenant/xxx/timeline/yyy).

Prepares for
https://github.com/neondatabase/neon/pull/4875
2023-08-23 16:04:08 +03:00
Christian Schwarz
71ccb07a43 ci: fix upload-postgres-extensions-to-s3 job (#5063)
This is cherry-picked-then-improved version of release branch commit
4204960942 PR #4861)

The commit

	commit 5f8fd640bf
	Author: Alek Westover <alek.westover@gmail.com>
	Date:   Wed Jul 26 08:24:03 2023 -0400

	    Upload Test Remote Extensions (#4792)

switched to using the release tag instead of `latest`, but,
the `promote-images` job only uploads `latest` to the prod ECR.

The switch to using release tag was good in principle, but,
it broke the release pipeline. So, switch release pipeline
back to using `latest`.

Note that a proper fix should abandon use of `:latest` tag
at all: currently, if a `main` pipeline runs concurrently
with a `release` pipeline, the `release` pipeline may end
up using the `main` pipeline's images.

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-08-22 22:45:25 +03:00
Joonas Koivunen
ad8d777c1c refactor: remove is_incremental=true for ImageLayers footgun (#5061)
Accidentially giving is_incremental=true for ImageLayers costs a lot of
debugging time. Removes all API which would allow to do that. They can
easily be restored later *when needed*.

Split off from #4938.
2023-08-22 22:12:05 +03:00
Joonas Koivunen
2f97b43315 build: update tar, get rid of duplicate xattr (#5071)
`tar` recently pushed to 0.4.40. No big changes, but less Cargo.lock and
one less nagging from `cargo-deny`.

The diff:
https://github.com/alexcrichton/tar-rs/compare/0.4.38...0.4.40.
2023-08-22 21:21:44 +03:00
Joonas Koivunen
533a92636c refactor: pre-cleanup Layer, PersistentLayer and impls (#5059)
Remove pub but dead code, move trait methods as inherent methods, remove
unnecessary. Split off from #4938.
2023-08-22 21:14:28 +03:00
Alek Westover
bf303a6575 Trigger workflow in remote (private) repo to build and upload private extensions (#4944) 2023-08-22 13:32:29 -04:00
Christian Schwarz
8cd20485f8 metrics: smgr query time: add a pre-aggregated histogram (#5064)
When doing global queries in VictoriaMetrics, the per-timeline
histograms make us run into cardinality limits.

We don't want to give them up just yet because we don't
have an alternative for drilling down on timeline-specific
performance issues.

So, add a pre-aggregated histogram and add observations to it
whenever we add observations to the per-timeline histogram.

While we're at it, switch to using a strummed enum for the operation
type names.
2023-08-22 20:08:31 +03:00
Joonas Koivunen
933a869f00 refactor: compaction becomes async again (#5058)
#4938 will make on-demand download of layers in compaction possible, so
it's not suitable for our "policy" of no `spawn_blocking(|| ...
Handle::block_on(async { spawn_blocking(...).await })` because this
poses a clear deadlock risk. Nested spawn_blockings are because of the
download using `tokio::fs::File`.

- Remove `spawn_blocking` from caller of `compact_level0_phase1`
- Remove `Handle::block_on` from `compact_level0_phase1` (indentation
change)
- Revert to `AsLayerDesc::layer_desc` usage temporarily (until it
becomes field access in #4938)
2023-08-22 20:03:14 +03:00
Conrad Ludgate
8c6541fea9 chore: add supported targets to deny (#5070)
## Problem

many duplicate windows crates pollute the cargo deny output

## Summary of changes

we don't build those crates, so remove those targets from being checked
2023-08-22 19:44:31 +03:00
Alek Westover
5cf75d92d8 Fix cargo deny errors (#5068)
## Problem
cargo deny lint broken

Links to the CVEs:

[rustsec.org/advisories/RUSTSEC-2023-0052](https://rustsec.org/advisories/RUSTSEC-2023-0052)

[rustsec.org/advisories/RUSTSEC-2023-0053](https://rustsec.org/advisories/RUSTSEC-2023-0053)
One is fixed, the other one isn't so we allow it (for now), to unbreak
CI. Then later we'll try to get rid of webpki in favour of the rustls
fork.

## Summary of changes
```
+ignore = ["RUSTSEC-2023-0052"]
```
2023-08-22 18:41:32 +03:00
Vadim Kharitonov
4e2e44e524 Enable neon-pool-opt-in (#5062) 2023-08-22 09:06:14 +01:00
Vadim Kharitonov
ed786104f3 Merge pull request #5060 from neondatabase/releases/2023-08-22
Release 2023-08-22
2023-08-22 09:41:02 +02:00
Conrad Ludgate
0b001a0001 proxy: remove connections on shutdown (#5051)
## Problem

On shutdown, proxy connections are staying open.

## Summary of changes

Remove the connections on shutdown
2023-08-21 19:20:58 +01:00
Felix Prasanna
4a8bd866f6 bump vm-builder version to v0.16.3 (#5055)
This change to autoscaling allows agents to connect directly to the
monitor, completely removing the informant.
2023-08-21 13:29:16 -04:00
John Spray
615a490239 pageserver: refactor Tenant/Timeline args into structs (#5053)
## Problem

There are some common types that we pass into tenants and timelines as
we construct them, such as remote storage and the broker client.
Currently the list is small, but this is likely to grow -- the deletion
queue PR (#4960) pushed some methods to the point of clippy complaining
they had too many args, because of the extra deletion queue client being
passed around.

There are some shared objects that currently aren't passed around
explicitly because they use a static `once_cell` (e.g.
CONCURRENT_COMPACTIONS), but as we add more resource management and
concurreny control over time, it will be more readable & testable to
pass a type around in the respective Resources object, rather than to
coordinate via static objects. The `Resources` structures in this PR
will make it easier to add references to central coordination functions,
without having to rely on statics.

## Summary of changes

- For `Tenant`, the `broker_client` and `remote_storage` are bundled
into `TenantSharedResources`
- For `Timeline`, the `remote_client` is wrapped into
`TimelineResources`.

Both of these structures will get an additional deletion queue member in
#4960.
2023-08-21 17:30:28 +01:00
John Spray
b95addddd5 pageserver: do not read redundant timeline_layers from IndexPart, so that we can remove it later (#4972)
## Problem

IndexPart contains two redundant lists of layer names: a set of the
names, and then a map of name to metadata.

We already required that all the layers in `timeline_layers` are also in
`layers_metadata`, in `initialize_with_current_remote_index_part`, so if
there were any index_part.json files in the field that relied on these
sets being different, they would already be broken.

## Summary of changes

`timeline_layers` is made private and no longer read at runtime. It is
still serialized, but not deserialized.

`disk_consistent_lsn` is also made private, as this field only exists
for convenience of humans reading the serialized JSON.

This prepares us to entirely remove `timeline_layers` in a future
release, once this change is fully deployed, and therefore no
pageservers are trying to read the field.
2023-08-21 14:29:36 +03:00
Joonas Koivunen
130ccb4b67 Remove initial timeline id troubles (#5044)
I made a mistake when I adding `env.initial_timeline:
Optional[TimelineId]` in the #3839, should had just generated it and
used it to create a specific timeline. This PR fixes those mistakes, and
some extra calling into psql which must be slower than python field
access.
2023-08-20 12:33:19 +03:00
Dmitry Rodionov
9140a950f4 Resume tenant deletion on attach (#5039)
I'm still a bit nervous about attach -> crash case. But it should work.
(unlike case with timeline). Ideally would be cool to cover this with
test.

This continues tradition of adding bool flags for Tenant::set_stopping.
Probably lifecycle project will help with fixing it.
2023-08-20 12:28:50 +03:00
Arpad Müller
a23b0773f1 Fix DeltaLayer dumping (#5045)
## Problem

Before, DeltaLayer dumping (via `cargo run --release -p pagectl --
print-layer-file` ) would crash as one can't call `Handle::block_on` in
an async executor thread.

## Summary of changes

Avoid the problem by using `DeltaLayerInner::load_keys` to load the keys
into RAM (which we already do during compaction), and then load the
values one by one during dumping.
2023-08-19 00:56:03 +02:00
Joonas Koivunen
368ee6c8ca refactor: failpoint support (#5033)
- move them to pageserver which is the only dependant on the crate fail
- "move" the exported macro to the new module
- support at init time the same failpoints as runtime

Found while debugging test failures and making tests more repeatable by
allowing "exit" from pageserver start via environment variables. Made
those changes to `test_gc_cutoff.py`.

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-08-19 01:01:44 +03:00
Felix Prasanna
5c6a692cf1 bump VM_BUILDER_VERSION to v0.16.2 (#5031)
A very slight change that allows us to configure the UID of the
neon-postgres cgroup owner. We start postgres in this cgroup so we can
scale it with the cgroups v2 api. Currently, the control plane
overwrites the entrypoint set by `vm-builder`, so `compute_ctl` (and
thus postgres), is not started in the neon-postgres cgroup. Having
`compute_ctl` start postgres in the cgroup should fix this. However, at
the moment appears like it does not have the correct permissions.
Configuring the neon-postgres UID to `postgres` (which is the UID
`compute_ctl` runs under) should hopefully fix this.


See #4920 - the PR to modify `compute_ctl` to start postgres in the
cgorup.
See: neondatabase/autoscaling#480, neondatabase/autoscaling#477. Both
these PR's are part of an effort to increase `vm-builder`'s
configurability and allow us to adjust it as we integrate in the
monitor.
2023-08-18 14:29:20 -04:00
Dmitry Rodionov
30888a24d9 Avoid flakiness in test_timeline_delete_fail_before_local_delete (#5032)
The problem was that timeline detail can return timelines in not only
active state. And by the time request comes timeline deletion can still
be in progress if we're unlucky (test execution happened to be slower
for some reason)

Reference for failed test run

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5022/5891420105/index.html#suites/f588e0a787c49e67b29490359c589fae/dab036e9bd673274

The error was `Exception: detail succeeded (it should return 404)`

reported by @koivunej
2023-08-18 20:49:11 +03:00
Dmitry Rodionov
f6c671c140 resume timeline deletions on attach (#5030)
closes [#5036](https://github.com/neondatabase/neon/issues/5036)
2023-08-18 20:48:33 +03:00
Christian Schwarz
ed5bce7cba rfcs: archive my MVCC S3 Notion Proposal (#5040)
This is a copy from the [original Notion
page](https://www.notion.so/neondatabase/Proposal-Pageserver-MVCC-S3-Storage-8a424c0c7ec5459e89d3e3f00e87657c?pvs=4),
taken on 2023-08-16.

This is for archival mostly.
The RFC that we're likely to go with is
https://github.com/neondatabase/neon/pull/4919.
2023-08-18 19:34:29 +02:00
Christian Schwarz
7a63685cde simplify page-caching of EphemeralFile (#4994)
(This PR is the successor of https://github.com/neondatabase/neon/pull/4984 )

## Summary

The current way in which `EphemeralFile` uses `PageCache` complicates
the Pageserver code base to a degree that isn't worth it.
This PR refactors how we cache `EphemeralFile` contents, by exploiting
the append-only nature of `EphemeralFile`.

The result is that `PageCache` only holds `ImmutableFilePage` and
`MaterializedPage`.
These types of pages are read-only and evictable without write-back.
This allows us to remove the writeback code from `PageCache`, also
eliminating an entire failure mode.

Futher, many great open-source libraries exist to solve the problem of a
read-only cache,
much better than our `page_cache.rs` (e.g., better replacement policy,
less global locking).
With this PR, we can now explore using them.

## Problem & Analysis

Before this PR, `PageCache` had three types of pages:

* `ImmutableFilePage`: caches Delta / Image layer file contents
* `MaterializedPage`: caches results of Timeline::get (page
materialization)
* `EphemeralPage`: caches `EphemeralFile` contents

`EphemeralPage` is quite different from `ImmutableFilePage` and
`MaterializedPage`:

* Immutable and materialized pages are for the acceleration of (future)
reads of the same data using `PAGE_CACHE_SIZE * PAGE_SIZE` bytes of
DRAM.
* Ephemeral pages are a write-back cache of `EphemeralFile` contents,
i.e., if there is pressure in the page cache, we spill `EphemeralFile`
contents to disk.

`EphemeralFile` is only used by `InMemoryLayer`, for the following
purposes:
* **write**: when filling up the `InMemoryLayer`, via `impl BlobWriter
for EphemeralFile`
* **read**: when doing **page reconstruction** for a page@lsn that isn't
written to disk
* **read**: when writing L0 layer files, we re-read the `InMemoryLayer`
and put the contents into the L0 delta writer
(**`create_delta_layer`**). This happens every 10min or when
InMemoryLayer reaches 256MB in size.

The access patterns of the `InMemoryLayer` use case are as follows:

* **write**: via `BlobWriter`, strictly append-only
* **read for page reconstruction**: via `BlobReader`, random
* **read for `create_delta_layer`**: via `BlobReader`, dependent on
data, but generally random. Why?
* in classical LSM terms, this function is what writes the
memory-resident `C0` tree into the disk-resident `C1` tree
* in our system, though, the values of InMemoryLayer are stored in an
EphemeralFile, and hence they are not guaranteed to be memory-resident
* the function reads `Value`s in `Key, LSN` order, which is `!=` insert
order

What do these `EphemeralFile`-level access patterns mean for the page
cache?

* **write**:
* the common case is that `Value` is a WAL record, and if it isn't a
full-page-image WAL record, then it's smaller than `PAGE_SIZE`
* So, the `EphemeralPage` pages act as a buffer for these `< PAGE_CACHE`
sized writes.
* If there's no page cache eviction between subsequent
`InMemoryLayer::put_value` calls, the `EphemeralPage` is still resident,
so the page cache avoids doing a `write` system call.
* In practice, a busy page server will have page cache evictions because
we only configure 64MB of page cache size.
* **reads for page reconstruction**: read acceleration, just as for the
other page types.
* **reads for `create_delta_layer`**:
* The `Value` reads happen through a `BlockCursor`, which optimizes the
case of repeated reads from the same page.
* So, the best case is that subsequent values are located on the same
page; hence `BlockCursor`s buffer is maximally effective.
* The worst case is that each `Value` is on a different page; hence the
`BlockCursor`'s 1-page-sized buffer is ineffective.
* The best case translates into `256MB/PAGE_SIZE` page cache accesses,
one per page.
    * the worst case translates into `#Values` page cache accesses
* again, the page cache accesses must be assumed to be random because
the `Value`s aren't accessed in insertion order but `Key, LSN` order.

## Summary of changes

Preliminaries for this PR were:

- #5003
- #5004 
- #5005 
  - uncommitted microbenchmark in #5011 

Based on the observations outlined above, this PR makes the following
changes:

* Rip out `EphemeralPage` from `page_cache.rs`
* Move the `block_io::FileId` to `page_cache::FileId`
* Add a `PAGE_SIZE`d buffer to the `EphemeralPage` struct.
  It's called `mutable_tail`.
* Change `write_blob` to use `mutable_tail` for the write buffering
instead of a page cache page.
* if `mutable_tail` is full, it writes it out to disk, zeroes it out,
and re-uses it.
* There is explicitly no double-buffering, so that memory allocation per
`EphemeralFile` instance is fixed.
* Change `read_blob` to return different `BlockLease` variants depending
on `blknum`
* for the `blknum` that corresponds to the `mutable_tail`, return a ref
to it
* Rust borrowing rules prevent `write_blob` calls while refs are
outstanding.
  * for all non-tail blocks, return a page-cached `ImmutablePage`
* It is safe to page-cache these as ImmutablePage because EphemeralFile
is append-only.

## Performance 

How doe the changes above affect performance?
M claim is: not significantly.

* **write path**:
* before this PR, the `EphemeralFile::write_blob` didn't issue its own
`write` system calls.
* If there were enough free pages, it didn't issue *any* `write` system
calls.
* If it had to evict other `EphemeralPage`s to get pages a page for its
writes (`get_buf_for_write`), the page cache code would implicitly issue
the writeback of victim pages as needed.
* With this PR, `EphemeralFile::write_blob` *always* issues *all* of its
*own* `write` system calls.
* Also, the writes are explicit instead of implicit through page cache
write back, which will help #4743
* The perf impact of always doing the writes is the CPU overhead and
syscall latency.
* Before this PR, we might have never issued them if there were enough
free pages.
* We don't issue `fsync` and can expect the writes to only hit the
kernel page cache.
* There is also an advantage in issuing the writes directly: the perf
impact is paid by the tenant that caused the writes, instead of whatever
tenant evicts the `EphemeralPage`.
* **reads for page reconstruction**: no impact.
* The `write_blob` function pre-warms the page cache when it writes the
`mutable_tail` to disk.
* So, the behavior is the same as with the EphemeralPages before this
PR.
* **reads for `create_delta_layer`**: no impact.
  * Same argument as for page reconstruction.
  * Note for the future:
* going through the page cache likely causes read amplification here.
Why?
* Due to the `Key,Lsn`-ordered access pattern, we don't read all the
values in the page before moving to the next page. In the worst case, we
might read the same page multiple times to read different `Values` from
it.
    * So, it might be better to bypass the page cache here.
    * Idea drafts:
      * bypass PS page cache + prefetch pipeline + iovec-based IO
* bypass PS page cache + use `copy_file_range` to copy from ephemeral
file into the L0 delta file, without going through user space
2023-08-18 20:31:03 +03:00
Joonas Koivunen
0a082aee77 test: allow race with flush and stopped queue (#5027)
A lucky race can happen with the shutdown order I guess right now. Seen
in [test_tenant_delete_smoke].

The message is not the greatest to match against.

[test_tenant_delete_smoke]:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/5892262320/index.html#suites/3556ed71f2d69272a7014df6dcb02317/189a0d1245fb5a8c
2023-08-18 19:36:25 +03:00
Arthur Petukhovsky
0b90411380 Fix safekeeper recovery with auth (#5035)
Fix missing a password in walrcv_connect for a safekeeper recovery. Add
a test which restarts endpoint and triggers a recovery.
2023-08-18 16:48:55 +01:00
Arpad Müller
f4da010aee Make the compaction warning more tolerant (#5024)
## Problem

The performance benchmark in `test_runner/performance/test_layer_map.py`
is currently failing due to the warning added in #4888.

## Summary of changes

The test mentioned has a `compaction_target_size` of 8192, which is just
one page size. This is an unattainable goal, as we generate at least
three pages: one for the header, one for the b-tree (minimally sized
ones have just the root node in a single page), one for the data.

Therefore, we add two pages to the warning limit. The warning text
becomes a bit less accurate but I think this is okay.
2023-08-18 16:36:31 +02:00
Conrad Ludgate
ec10838aa4 proxy: pool connection logs (#5020)
## Problem

Errors and notices that happen during a pooled connection lifecycle have
no session identifiers

## Summary of changes

Using a watch channel, we set the session ID whenever it changes. This
way we can see the status of a connection for that session

Also, adding a connection id to be able to search the entire connection
lifecycle
2023-08-18 11:44:08 +01:00
Joonas Koivunen
67af24191e test: cleanup remote_timeline_client tests (#5013)
I will have to change these as I change remote_timeline_client api in
#4938. So a bit of cleanup, handle my comments which were just resolved
during initial review.

Cleanup:
- use unwrap in tests instead of mixed `?` and `unwrap`
- use `Handle` instead of `&'static Reactor` to make the
RemoteTimelineClient more natural
- use arrays in tests
- use plain `#[tokio::test]`
2023-08-17 19:27:30 +03:00
Joonas Koivunen
6af5f9bfe0 fix: format context (#5022)
We return an error with unformatted `{timeline_id}`.
2023-08-17 14:30:25 +00:00
Dmitry Rodionov
64fc7eafcd Increase timeout once again. (#5021)
When failpoint is early in deletion process it takes longer to complete
after failpoint is removed.

Example was: https://neon-github-public-dev.s3.amazonaws.com/reports/main/5889544346/index.html#suites/3556ed71f2d69272a7014df6dcb02317/49826c68ce8492b1
2023-08-17 15:37:28 +03:00
Conrad Ludgate
3e4710c59e proxy: add more sasl logs (#5012)
## Problem

A customer is having trouble connecting to neon from their production
environment. The logs show a mix of "Internal error" and "authentication
protocol violation" but not the full error

## Summary of changes

Make sure we don't miss any logs during SASL/SCRAM
2023-08-17 12:05:54 +01:00
Dmitry Rodionov
d8b0a298b7 Do not attach deleted tenants (#5008)
Rather temporary solution before proper:
https://github.com/neondatabase/neon/issues/5006

It requires more plumbing so lets not attach deleted tenants first and
then implement resume.

Additionally fix `assert_prefix_empty`. It had a buggy prefix calculation,
and since we always asserted for absence of stuff it worked. Here I
started to assert for presence of stuff too and it failed. Added more
"presence" asserts to other places to be confident that it works.

Resolves [#5016](https://github.com/neondatabase/neon/issues/5016)
2023-08-17 13:46:49 +03:00
Alexander Bayandin
c8094ee51e test_compatibility: run amcheck unconditionally (#4985)
## Problem

The previous version of neon (that we use in the forward compatibility test)
has installed `amcheck` extension now. We can run `pg_amcheck`
unconditionally.

## Summary of changes
- Run `pg_amcheck` in compatibility tests unconditionally
2023-08-17 11:46:00 +01:00
Christian Schwarz
957af049c2 ephemeral file: refactor write_blob impl to concentrate mutable state (#5004)
Before this patch, we had the `off` and `blknum` as function-wide
mutable state. Now it's contained in the `Writer` struct.

The use of `push_bytes` instead of index-based filling of the buffer
also makes it easier to reason about what's going on.

This is prep for https://github.com/neondatabase/neon/pull/4994
2023-08-17 13:07:25 +03:00
Anastasia Lubennikova
786c7b3708 Refactor remote extensions index download.
Don't download ext_index.json from s3, but instead receive it as a part of spec from control plane.
This eliminates s3 access for most compute starts,
and also allows us to update extensions spec on the fly
2023-08-17 12:48:33 +03:00
Joonas Koivunen
d3612ce266 delta_layer: Restore generic from last week (#5014)
Restores #4937 work relating to the ability to use `ResidentDeltaLayer`
(which is an Arc wrapper) in #4938 for the ValueRef's by removing the
borrow from `ValueRef` and providing it from an upper layer.

This should not have any functional changes, most importantly, the
`main` will continue to use the borrowed `DeltaLayerInner`. It might be
that I can change #4938 to be like this. If that is so, I'll gladly rip
out the `Ref` and move the borrow back. But I'll first want to look at
the current test failures.
2023-08-17 11:47:31 +03:00
Christian Schwarz
994411f5c2 page cache: newtype the blob_io and ephemeral_file file ids (#5005)
This makes it more explicit that these are different u64-sized
namespaces.
Re-using one in place of the other would be catastrophic.

Prep for https://github.com/neondatabase/neon/pull/4994
which will eliminate the ephemeral_file::FileId and move the
blob_io::FileId into page_cache.
It makes sense to have this preliminary commit though,
to minimize amount of new concept in #4994 and other
preliminaries that depend on that work.
2023-08-16 18:33:47 +02:00
Conrad Ludgate
25934ec1ba proxy: reduce global conn pool contention (#4747)
## Problem

As documented, the global connection pool will be high contention.

## Summary of changes

Use DashMap rather than Mutex<HashMap>.

Of note, DashMap currently uses a RwLock internally, but it's partially
sharded to reduce contention by a factor of N. We could potentially use
flurry which is a port of Java's concurrent hashmap, but I have no good
understanding of it's performance characteristics. Dashmap is at least
equivalent to hashmap but less contention.

See the read heavy benchmark to analyse our expected performance
<https://github.com/xacrimon/conc-map-bench#ready-heavy>

I also spoke with the developer of dashmap recently, and they are
working on porting the implementation to use concurrent HAMT FWIW
2023-08-16 17:20:28 +01:00
Arpad Müller
0bdbc39cb1 Compaction: unify key and value reference vecs (#4888)
## Problem

PR #4839 has already reduced the number of b-tree traversals and vec
creations from 3 to 2, but as pointed out in
https://github.com/neondatabase/neon/pull/4839#discussion_r1279167815 ,
we would ideally just traverse the b-tree once during compaction.

Afer #4836, the two vecs created are one for the list of keys, lsns and
sizes, and one for the list of `(key, lsn, value reference)`. However,
they are not equal, as pointed out in
https://github.com/neondatabase/neon/pull/4839#issuecomment-1660418012
and the following comment: the key vec creation combines multiple
entries for which the lsn is changing but the key stays the same into
one, with the size being the sum of the sub-sizes. In SQL, this would
correspond to something like `SELECT key, lsn, SUM(size) FROM b_tree
GROUP BY key;` and `SELECT key, lsn, val_ref FROM b_tree;`. Therefore,
the join operation is non-trivial.

## Summary of changes

This PR merges the two lists of keys and value references into one. It's
not a trivial change and affects the size pattern of the resulting
files, which is why this is in a separate PR from #4839 .

The key vec is used in compaction for determining when to start a new
layer file. The loop uses various thresholds to come to this conclusion,
but the grouping via the key has led to the behaviour that regardless of
the threshold, it only starts a new file when either a new key is
encountered, or a new delta file.

The new code now does the combination after the merging and sorting of
the various keys from the delta files. This *mostly* does the same as
the old code, except for a detail: with the grouping done on a
per-delta-layer basis, the sorted and merged vec would still have
multiple entries for multiple delta files, but now, we don't have an
easy way to tell when a new input delta layer file is encountered, so we
cannot create multiple entries on that basis easily.

To prevent possibly infinite growth, our new grouping code compares the
combined size with the threshold, and if it is exceeded, it cuts a new
entry so that the downstream code can cut a new output file. Here, we
perform a tradeoff however, as if the threshold is too small, we risk
putting entries for the same key into multiple layer files, but if the
threshold is too big, we can in some instances exceed the target size.

Currently, we set the threshold to the target size, so in theory we
would stay below or roughly at double the `target_file_size`.

We also fix the way the size was calculated for the last key. The calculation
was wrong and accounted for the old layer's btree, even though we
already account for the overhead of the in-construction btree.

Builds on top of #4839 .
2023-08-16 18:27:18 +03:00
Dmitry Rodionov
96b84ace89 Correctly remove orphaned objects in RemoteTimelineClient::delete_all (#5000)
Previously list_prefixes was incorrectly used for that purpose. Change
to use list_files. Add a test.

Some drive by refactorings on python side to move helpers out of
specific test file to be widely accessible

resolves https://github.com/neondatabase/neon/issues/4499
2023-08-16 17:31:16 +03:00
Christian Schwarz
368b783ada ephemeral_file: remove FileExt impl (was only used by tests) (#5003)
Extracted from https://github.com/neondatabase/neon/pull/4994
2023-08-16 15:41:25 +02:00
Dmitry Rodionov
0f47bc03eb Fix delete_objects in UnreliableWrapper (#5002)
For `delete_objects` it was injecting failures for whole delete_objects operation
and then for every delete it contains. Make it fail once for the whole operation.
2023-08-16 14:08:53 +03:00
Arseny Sher
fdbe8dc8e0 Fix test_s3_wal_replay flakiness.
ref https://github.com/neondatabase/neon/issues/4466
2023-08-16 12:57:43 +03:00
Arthur Petukhovsky
1b97a3074c Disable neon-pool-opt-in (#4995) 2023-08-15 20:57:56 +03:00
John Spray
5c836ee5b4 tests: extend timeout in timeline deletion test (#4992)
## Problem

This was set to 5 seconds, which was very close to how long a compaction
took on my workstation, and when deletion is blocked on compaction the
test would fail.

We will fix this to make compactions drop out on deletion, but for the
moment let's stabilize the test.

## Summary of changes

Change timeout on timeline deletion in
`test_timeline_deletion_with_files_stuck_in_upload_queue` from 5 seconds
to 30 seconds.
2023-08-15 20:14:03 +03:00
Arseny Sher
4687b2e597 Test that auth on pg/http services can be enabled separately in sks.
To this end add
1) -e option to 'neon_local safekeeper start' command appending extra options
   to safekeeper invocation;
2) Allow multiple occurrences of the same option in safekeepers, the last
   value is taken.
3) Allow to specify empty string for *-auth-public-key-path opts, it
   disables auth for the service.
2023-08-15 19:31:20 +03:00
Arseny Sher
13adc83fc3 Allow to enable http/pg/pg tenant only auth separately in safekeeper.
The same option enables auth and specifies public key, so this allows to use
different public keys as well. The motivation is to
1) Allow to e.g. change pageserver key/token without replacing all compute
  tokens.
2) Enable auth gradually.
2023-08-15 19:31:20 +03:00
Dmitry Rodionov
52c2c69351 fsync directory before mark file removal (#4986)
## Problem

Deletions can be possibly reordered. Use fsync to avoid the case when
mark file doesnt exist but other tenant/timeline files do.

See added comments.

resolves #4987
2023-08-15 19:24:23 +03:00
Stas Kelvich
84b74f2bd1 Merge pull request #4997 from neondatabase/sk/proxy-release-23-07-15
Fix lint
2023-08-15 18:54:20 +03:00
Arthur Petukhovsky
fec2ad6283 Fix lint 2023-08-15 18:49:02 +03:00
Stas Kelvich
98eebd4682 Merge pull request #4996 from neondatabase/sk/proxy_release
Disable neon-pool-opt-in
2023-08-15 18:37:50 +03:00
Arthur Petukhovsky
2f74287c9b Disable neon-pool-opt-in 2023-08-15 18:34:17 +03:00
Alexander Bayandin
207919f5eb Upload test results to DB right after generation (#4967)
## Problem

While adding new test results format, I've also changed the way we
upload Allure reports to S3
(722c7956bb)
to avoid duplicated results from previous runs. But it broke links at
earlier results (results are still available but on different URLs).

This PR fixes this (by reverting logic in
722c7956bb
changes), and moves the logic for storing test results into db to allure
generate step. It allows us to avoid test results duplicates in the db
and saves some time on extra s3 downloads that happened in a different
job before the PR.

Ref https://neondb.slack.com/archives/C059ZC138NR/p1691669522160229

## Summary of changes
- Move test results storing logic from a workflow to
`actions/allure-report-generate`
2023-08-15 15:32:30 +01:00
George MacKerron
218be9eb32 Added deferrable transaction option to http batch queries (#4993)
## Problem

HTTP batch queries currently allow us to set the isolation level and
read only, but not deferrable.

## Summary of changes

Add support for deferrable.

Echo deferrable status in response headers only if true.

Likewise, now echo read-only status in response headers only if true.
2023-08-15 14:52:00 +01:00
249 changed files with 20086 additions and 5966 deletions

View File

@@ -14,10 +14,12 @@
!pgxn/
!proxy/
!safekeeper/
!s3_scrubber/
!storage_broker/
!trace/
!vendor/postgres-v14/
!vendor/postgres-v15/
!vendor/postgres-v16/
!workspace_hack/
!neon_local/
!scripts/ninstall.sh

8
.github/actionlint.yml vendored Normal file
View File

@@ -0,0 +1,8 @@
self-hosted-runner:
labels:
- gen3
- large
- small
- us-east-2
config-variables:
- SLACK_UPCOMING_RELEASE_CHANNEL_ID

View File

@@ -1,6 +1,13 @@
name: 'Create Allure report'
description: 'Generate Allure report from uploaded by actions/allure-report-store tests results'
inputs:
store-test-results-into-db:
description: 'Whether to store test results into the database. TEST_RESULT_CONNSTR/TEST_RESULT_CONNSTR_NEW should be set'
type: boolean
required: false
default: false
outputs:
base-url:
description: 'Base URL for Allure report'
@@ -139,9 +146,11 @@ runs:
sed -i 's|<a href="." class=|<a href="https://'${BUCKET}'.s3.amazonaws.com/'${REPORT_PREFIX}'/latest/index.html?nocache='"'+Date.now()+'"'" class=|g' ${WORKDIR}/report/app.js
# Upload a history and the final report (in this particular order to not to have duplicated history in 2 places)
# Use sync for the final report to delete files from previous runs
time aws s3 mv --recursive --only-show-errors "${WORKDIR}/report/history" "s3://${BUCKET}/${REPORT_PREFIX}/latest/history"
time aws s3 sync --delete --only-show-errors "${WORKDIR}/report" "s3://${BUCKET}/${REPORT_PREFIX}/${GITHUB_RUN_ID}"
# Use aws s3 cp (instead of aws s3 sync) to keep files from previous runs to make old URLs work,
# and to keep files on the host to upload them to the database
time aws s3 cp --recursive --only-show-errors "${WORKDIR}/report" "s3://${BUCKET}/${REPORT_PREFIX}/${GITHUB_RUN_ID}"
# Generate redirect
cat <<EOF > ${WORKDIR}/index.html
@@ -170,6 +179,41 @@ runs:
aws s3 rm "s3://${BUCKET}/${LOCK_FILE}"
fi
- name: Store Allure test stat in the DB
if: ${{ !cancelled() && inputs.store-test-results-into-db == 'true' }}
shell: bash -euxo pipefail {0}
env:
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
REPORT_JSON_URL: ${{ steps.generate-report.outputs.report-json-url }}
run: |
export DATABASE_URL=${REGRESS_TEST_RESULT_CONNSTR}
./scripts/pysync
poetry run python3 scripts/ingest_regress_test_result.py \
--revision ${COMMIT_SHA} \
--reference ${GITHUB_REF} \
--build-type unified \
--ingest ${WORKDIR}/report/data/suites.json
- name: Store Allure test stat in the DB (new)
if: ${{ !cancelled() && inputs.store-test-results-into-db == 'true' }}
shell: bash -euxo pipefail {0}
env:
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
BASE_S3_URL: ${{ steps.generate-report.outputs.base-s3-url }}
run: |
export DATABASE_URL=${REGRESS_TEST_RESULT_CONNSTR_NEW}
./scripts/pysync
poetry run python3 scripts/ingest_regress_test_result-new-format.py \
--reference ${GITHUB_REF} \
--revision ${COMMIT_SHA} \
--run-id ${GITHUB_RUN_ID} \
--run-attempt ${GITHUB_RUN_ATTEMPT} \
--test-cases-dir ${WORKDIR}/report/data/test-cases
- name: Cleanup
if: always()
shell: bash -euxo pipefail {0}

View File

@@ -70,6 +70,9 @@ runs:
name: compatibility-snapshot-${{ inputs.build_type }}-pg${{ inputs.pg_version }}
path: /tmp/compatibility_snapshot_pg${{ inputs.pg_version }}
prefix: latest
# The lack of compatibility snapshot (for example, for the new Postgres version)
# shouldn't fail the whole job. Only relevant test should fail.
skip-if-does-not-exist: true
- name: Checkout
if: inputs.needs_postgres_source == 'true'
@@ -145,7 +148,11 @@ runs:
if [ "${RERUN_FLAKY}" == "true" ]; then
mkdir -p $TEST_OUTPUT
poetry run ./scripts/flaky_tests.py "${TEST_RESULT_CONNSTR}" --days 10 --output "$TEST_OUTPUT/flaky.json"
poetry run ./scripts/flaky_tests.py "${TEST_RESULT_CONNSTR}" \
--days 7 \
--output "$TEST_OUTPUT/flaky.json" \
--pg-version "${DEFAULT_PG_VERSION}" \
--build-type "${BUILD_TYPE}"
EXTRA_PARAMS="--flaky-tests-json $TEST_OUTPUT/flaky.json $EXTRA_PARAMS"
fi

31
.github/workflows/actionlint.yml vendored Normal file
View File

@@ -0,0 +1,31 @@
name: Lint GitHub Workflows
on:
push:
branches:
- main
- release
paths:
- '.github/workflows/*.ya?ml'
pull_request:
paths:
- '.github/workflows/*.ya?ml'
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
jobs:
actionlint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: reviewdog/action-actionlint@v1
env:
# SC2046 - Quote this to prevent word splitting. - https://www.shellcheck.net/wiki/SC2046
# SC2086 - Double quote to prevent globbing and word splitting. - https://www.shellcheck.net/wiki/SC2086
SHELLCHECK_OPTS: --exclude=SC2046,SC2086
with:
fail_on_error: true
filter_mode: nofilter
level: error

View File

@@ -2,7 +2,9 @@ name: Handle `approved-for-ci-run` label
# This workflow helps to run CI pipeline for PRs made by external contributors (from forks).
on:
pull_request:
pull_request_target:
branches:
- main
types:
# Default types that triggers a workflow ([1]):
# - [1] https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request
@@ -17,39 +19,83 @@ on:
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
BRANCH: "ci-run/pr-${{ github.event.pull_request.number }}"
permissions: write-all
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number }}
jobs:
remove-label:
# Remove `approved-for-ci-run` label if the workflow is triggered by changes in a PR.
# The PR should be reviewed and labelled manually again.
runs-on: [ ubuntu-latest ]
if: |
contains(fromJSON('["opened", "synchronize", "reopened", "closed"]'), github.event.action) &&
contains(github.event.pull_request.labels.*.name, 'approved-for-ci-run')
runs-on: ubuntu-latest
steps:
- run: gh pr --repo "${GITHUB_REPOSITORY}" edit "${PR_NUMBER}" --remove-label "approved-for-ci-run"
create-branch:
# Create a local branch for an `approved-for-ci-run` labelled PR to run CI pipeline in it.
runs-on: [ ubuntu-latest ]
create-or-update-pr-for-ci-run:
# Create local PR for an `approved-for-ci-run` labelled PR to run CI pipeline in it.
if: |
github.event.action == 'labeled' &&
contains(github.event.pull_request.labels.*.name, 'approved-for-ci-run')
runs-on: ubuntu-latest
steps:
- run: gh pr --repo "${GITHUB_REPOSITORY}" edit "${PR_NUMBER}" --remove-label "approved-for-ci-run"
- uses: actions/checkout@v3
with:
ref: main
token: ${{ secrets.CI_ACCESS_TOKEN }}
- run: gh pr checkout "${PR_NUMBER}"
- run: git checkout -b "ci-run/pr-${PR_NUMBER}"
- run: git checkout -b "${BRANCH}"
- run: git push --force origin "ci-run/pr-${PR_NUMBER}"
- run: git push --force origin "${BRANCH}"
- name: Create a Pull Request for CI run (if required)
env:
GH_TOKEN: ${{ secrets.CI_ACCESS_TOKEN }}
run: |
cat << EOF > body.md
This Pull Request is created automatically to run the CI pipeline for #${PR_NUMBER}
Please do not alter or merge/close it.
Feel free to review/comment/discuss the original PR #${PR_NUMBER}.
EOF
ALREADY_CREATED="$(gh pr --repo ${GITHUB_REPOSITORY} list --head ${HEAD} --base main --json number --jq '.[].number')"
if [ -z "${ALREADY_CREATED}" ]; then
gh pr --repo "${GITHUB_REPOSITORY}" create --title "CI run for PR #${PR_NUMBER}" \
--body-file "body.md" \
--head "${BRANCH}" \
--base "main" \
--draft
fi
cleanup:
# Close PRs and delete branchs if the original PR is closed.
if: |
github.event.action == 'closed' &&
github.event.pull_request.head.repo.full_name != github.repository
runs-on: ubuntu-latest
steps:
- run: |
CLOSED="$(gh pr --repo ${GITHUB_REPOSITORY} list --head ${HEAD} --json 'closed' --jq '.[].closed')"
if [ "${CLOSED}" == "false" ]; then
gh pr --repo "${GITHUB_REPOSITORY}" close "${BRANCH}" --delete-branch
fi

View File

@@ -117,6 +117,7 @@ jobs:
outputs:
pgbench-compare-matrix: ${{ steps.pgbench-compare-matrix.outputs.matrix }}
olap-compare-matrix: ${{ steps.olap-compare-matrix.outputs.matrix }}
tpch-compare-matrix: ${{ steps.tpch-compare-matrix.outputs.matrix }}
steps:
- name: Generate matrix for pgbench benchmark
@@ -136,11 +137,11 @@ jobs:
}'
if [ "$(date +%A)" = "Saturday" ]; then
matrix=$(echo $matrix | jq '.include += [{ "platform": "rds-postgres", "db_size": "10gb"},
matrix=$(echo "$matrix" | jq '.include += [{ "platform": "rds-postgres", "db_size": "10gb"},
{ "platform": "rds-aurora", "db_size": "50gb"}]')
fi
echo "matrix=$(echo $matrix | jq --compact-output '.')" >> $GITHUB_OUTPUT
echo "matrix=$(echo "$matrix" | jq --compact-output '.')" >> $GITHUB_OUTPUT
- name: Generate matrix for OLAP benchmarks
id: olap-compare-matrix
@@ -152,11 +153,30 @@ jobs:
}'
if [ "$(date +%A)" = "Saturday" ]; then
matrix=$(echo $matrix | jq '.include += [{ "platform": "rds-postgres" },
matrix=$(echo "$matrix" | jq '.include += [{ "platform": "rds-postgres" },
{ "platform": "rds-aurora" }]')
fi
echo "matrix=$(echo $matrix | jq --compact-output '.')" >> $GITHUB_OUTPUT
echo "matrix=$(echo "$matrix" | jq --compact-output '.')" >> $GITHUB_OUTPUT
- name: Generate matrix for TPC-H benchmarks
id: tpch-compare-matrix
run: |
matrix='{
"platform": [
"neon-captest-reuse"
],
"scale": [
"10"
]
}'
if [ "$(date +%A)" = "Saturday" ]; then
matrix=$(echo "$matrix" | jq '.include += [{ "platform": "rds-postgres", "scale": "10" },
{ "platform": "rds-aurora", "scale": "10" }]')
fi
echo "matrix=$(echo "$matrix" | jq --compact-output '.')" >> $GITHUB_OUTPUT
pgbench-compare:
needs: [ generate-matrices ]
@@ -233,7 +253,11 @@ jobs:
echo "connstr=${CONNSTR}" >> $GITHUB_OUTPUT
psql ${CONNSTR} -c "SELECT version();"
QUERY="SELECT version();"
if [[ "${PLATFORM}" = "neon"* ]]; then
QUERY="${QUERY} SHOW neon.tenant_id; SHOW neon.timeline_id;"
fi
psql ${CONNSTR} -c "${QUERY}"
- name: Benchmark init
uses: ./.github/actions/run-python-test-set
@@ -358,7 +382,11 @@ jobs:
echo "connstr=${CONNSTR}" >> $GITHUB_OUTPUT
psql ${CONNSTR} -c "SELECT version();"
QUERY="SELECT version();"
if [[ "${PLATFORM}" = "neon"* ]]; then
QUERY="${QUERY} SHOW neon.tenant_id; SHOW neon.timeline_id;"
fi
psql ${CONNSTR} -c "${QUERY}"
- name: ClickBench benchmark
uses: ./.github/actions/run-python-test-set
@@ -372,6 +400,7 @@ jobs:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
BENCHMARK_CONNSTR: ${{ steps.set-up-connstr.outputs.connstr }}
TEST_OLAP_SCALE: 10
- name: Create Allure report
if: ${{ !cancelled() }}
@@ -398,7 +427,7 @@ jobs:
strategy:
fail-fast: false
matrix: ${{ fromJson(needs.generate-matrices.outputs.olap-compare-matrix) }}
matrix: ${{ fromJson(needs.generate-matrices.outputs.tpch-compare-matrix) }}
env:
POSTGRES_DISTRIB_DIR: /tmp/neon/pg_install
@@ -407,6 +436,7 @@ jobs:
BUILD_TYPE: remote
SAVE_PERF_REPORT: ${{ github.event.inputs.save_perf_report || ( github.ref_name == 'main' ) }}
PLATFORM: ${{ matrix.platform }}
TEST_OLAP_SCALE: ${{ matrix.scale }}
runs-on: [ self-hosted, us-east-2, x64 ]
container:
@@ -428,18 +458,17 @@ jobs:
${POSTGRES_DISTRIB_DIR}/v${DEFAULT_PG_VERSION}/bin/pgbench --version
echo "${POSTGRES_DISTRIB_DIR}/v${DEFAULT_PG_VERSION}/bin" >> $GITHUB_PATH
- name: Set up Connection String
id: set-up-connstr
- name: Get Connstring Secret Name
run: |
case "${PLATFORM}" in
neon-captest-reuse)
CONNSTR=${{ secrets.BENCHMARK_CAPTEST_TPCH_S10_CONNSTR }}
ENV_PLATFORM=CAPTEST_TPCH
;;
rds-aurora)
CONNSTR=${{ secrets.BENCHMARK_RDS_AURORA_TPCH_S10_CONNSTR }}
ENV_PLATFORM=RDS_AURORA_TPCH
;;
rds-postgres)
CONNSTR=${{ secrets.BENCHMARK_RDS_POSTGRES_TPCH_S10_CONNSTR }}
ENV_PLATFORM=RDS_AURORA_TPCH
;;
*)
echo >&2 "Unknown PLATFORM=${PLATFORM}. Allowed only 'neon-captest-reuse', 'rds-aurora', or 'rds-postgres'"
@@ -447,9 +476,21 @@ jobs:
;;
esac
CONNSTR_SECRET_NAME="BENCHMARK_${ENV_PLATFORM}_S${TEST_OLAP_SCALE}_CONNSTR"
echo "CONNSTR_SECRET_NAME=${CONNSTR_SECRET_NAME}" >> $GITHUB_ENV
- name: Set up Connection String
id: set-up-connstr
run: |
CONNSTR=${{ secrets[env.CONNSTR_SECRET_NAME] }}
echo "connstr=${CONNSTR}" >> $GITHUB_OUTPUT
psql ${CONNSTR} -c "SELECT version();"
QUERY="SELECT version();"
if [[ "${PLATFORM}" = "neon"* ]]; then
QUERY="${QUERY} SHOW neon.tenant_id; SHOW neon.timeline_id;"
fi
psql ${CONNSTR} -c "${QUERY}"
- name: Run TPC-H benchmark
uses: ./.github/actions/run-python-test-set
@@ -463,6 +504,7 @@ jobs:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
BENCHMARK_CONNSTR: ${{ steps.set-up-connstr.outputs.connstr }}
TEST_OLAP_SCALE: ${{ matrix.scale }}
- name: Create Allure report
if: ${{ !cancelled() }}
@@ -534,7 +576,11 @@ jobs:
echo "connstr=${CONNSTR}" >> $GITHUB_OUTPUT
psql ${CONNSTR} -c "SELECT version();"
QUERY="SELECT version();"
if [[ "${PLATFORM}" = "neon"* ]]; then
QUERY="${QUERY} SHOW neon.tenant_id; SHOW neon.timeline_id;"
fi
psql ${CONNSTR} -c "${QUERY}"
- name: Run user examples
uses: ./.github/actions/run-python-test-set

View File

@@ -5,7 +5,6 @@ on:
branches:
- main
- release
- ci-run/pr-*
pull_request:
defaults:
@@ -24,7 +23,30 @@ env:
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_KEY_DEV }}
jobs:
check-permissions:
runs-on: ubuntu-latest
steps:
- name: Disallow PRs from forks
if: |
github.event_name == 'pull_request' &&
github.event.pull_request.head.repo.full_name != github.repository
run: |
if [ "${{ contains(fromJSON('["OWNER", "MEMBER", "COLLABORATOR"]'), github.event.pull_request.author_association) }}" = "true" ]; then
MESSAGE="Please create a PR from a branch of ${GITHUB_REPOSITORY} instead of a fork"
else
MESSAGE="The PR should be reviewed and labelled with 'approved-for-ci-run' to trigger a CI run"
fi
echo >&2 "We don't run CI for PRs from forks"
echo >&2 "${MESSAGE}"
exit 1
tag:
needs: [ check-permissions ]
runs-on: [ self-hosted, gen3, small ]
container: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/base:pinned
outputs:
@@ -53,6 +75,7 @@ jobs:
id: build-tag
check-codestyle-python:
needs: [ check-permissions ]
runs-on: [ self-hosted, gen3, small ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
@@ -85,6 +108,7 @@ jobs:
run: poetry run mypy .
check-codestyle-rust:
needs: [ check-permissions ]
runs-on: [ self-hosted, gen3, large ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
@@ -151,6 +175,7 @@ jobs:
run: cargo deny check
build-neon:
needs: [ check-permissions ]
runs-on: [ self-hosted, gen3, large ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
@@ -187,7 +212,7 @@ jobs:
# Eventually it will be replaced by a regression test https://github.com/neondatabase/neon/pull/4603
FAILED=false
for postgres in postgres-v14 postgres-v15; do
for postgres in postgres-v14 postgres-v15 postgres-v16; do
expected=$(cat vendor/revisions.json | jq --raw-output '."'"${postgres}"'"')
actual=$(git rev-parse "HEAD:vendor/${postgres}")
if [ "${expected}" != "${actual}" ]; then
@@ -209,6 +234,10 @@ jobs:
id: pg_v15_rev
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v15) >> $GITHUB_OUTPUT
- name: Set pg 16 revision for caching
id: pg_v16_rev
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v16) >> $GITHUB_OUTPUT
# Set some environment variables used by all the steps.
#
# CARGO_FLAGS is extra options to pass to "cargo build", "cargo test" etc.
@@ -229,10 +258,12 @@ jobs:
cov_prefix=""
CARGO_FLAGS="--locked --release"
fi
echo "cov_prefix=${cov_prefix}" >> $GITHUB_ENV
echo "CARGO_FEATURES=${CARGO_FEATURES}" >> $GITHUB_ENV
echo "CARGO_FLAGS=${CARGO_FLAGS}" >> $GITHUB_ENV
echo "CARGO_HOME=${GITHUB_WORKSPACE}/.cargo" >> $GITHUB_ENV
{
echo "cov_prefix=${cov_prefix}"
echo "CARGO_FEATURES=${CARGO_FEATURES}"
echo "CARGO_FLAGS=${CARGO_FLAGS}"
echo "CARGO_HOME=${GITHUB_WORKSPACE}/.cargo"
} >> $GITHUB_ENV
# Disabled for now
# Don't include the ~/.cargo/registry/src directory. It contains just
@@ -267,6 +298,13 @@ jobs:
path: pg_install/v15
key: v1-${{ runner.os }}-${{ matrix.build_type }}-pg-${{ steps.pg_v15_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Cache postgres v16 build
id: cache_pg_16
uses: actions/cache@v3
with:
path: pg_install/v16
key: v1-${{ runner.os }}-${{ matrix.build_type }}-pg-${{ steps.pg_v16_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Build postgres v14
if: steps.cache_pg_14.outputs.cache-hit != 'true'
run: mold -run make postgres-v14 -j$(nproc)
@@ -275,6 +313,10 @@ jobs:
if: steps.cache_pg_15.outputs.cache-hit != 'true'
run: mold -run make postgres-v15 -j$(nproc)
- name: Build postgres v16
if: steps.cache_pg_16.outputs.cache-hit != 'true'
run: mold -run make postgres-v16 -j$(nproc)
- name: Build neon extensions
run: mold -run make neon-pg-ext -j$(nproc)
@@ -348,17 +390,17 @@ jobs:
uses: ./.github/actions/save-coverage-data
regress-tests:
needs: [ check-permissions, build-neon ]
runs-on: [ self-hosted, gen3, large ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
# Default shared memory is 64mb
options: --init --shm-size=512mb
needs: [ build-neon ]
strategy:
fail-fast: false
matrix:
build_type: [ debug, release ]
pg_version: [ v14, v15 ]
pg_version: [ v14, v15, v16 ]
steps:
- name: Checkout
uses: actions/checkout@v3
@@ -386,12 +428,12 @@ jobs:
uses: ./.github/actions/save-coverage-data
benchmarks:
needs: [ check-permissions, build-neon ]
runs-on: [ self-hosted, gen3, small ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
# Default shared memory is 64mb
options: --init --shm-size=512mb
needs: [ build-neon ]
if: github.ref_name == 'main' || contains(github.event.pull_request.labels.*.name, 'run-benchmarks')
strategy:
fail-fast: false
@@ -418,12 +460,13 @@ jobs:
# while coverage is currently collected for the debug ones
create-test-report:
needs: [ check-permissions, regress-tests, coverage-report, benchmarks ]
if: ${{ !cancelled() && contains(fromJSON('["skipped", "success"]'), needs.check-permissions.result) }}
runs-on: [ self-hosted, gen3, small ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
options: --init
needs: [ regress-tests, benchmarks ]
if: ${{ !cancelled() }}
steps:
- uses: actions/checkout@v3
@@ -432,6 +475,11 @@ jobs:
if: ${{ !cancelled() }}
id: create-allure-report
uses: ./.github/actions/allure-report-generate
with:
store-test-results-into-db: true
env:
REGRESS_TEST_RESULT_CONNSTR: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR }}
REGRESS_TEST_RESULT_CONNSTR_NEW: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}
- uses: actions/github-script@v6
if: ${{ !cancelled() }}
@@ -444,81 +492,40 @@ jobs:
reportJsonUrl: "${{ steps.create-allure-report.outputs.report-json-url }}",
}
const coverage = {
coverageUrl: "${{ needs.coverage-report.outputs.coverage-html }}",
summaryJsonUrl: "${{ needs.coverage-report.outputs.coverage-json }}",
}
const script = require("./scripts/comment-test-report.js")
await script({
github,
context,
fetch,
report,
coverage,
})
- name: Store Allure test stat in the DB
if: ${{ !cancelled() && steps.create-allure-report.outputs.report-json-url }}
env:
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
REPORT_JSON_URL: ${{ steps.create-allure-report.outputs.report-json-url }}
TEST_RESULT_CONNSTR: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR }}
run: |
./scripts/pysync
curl --fail --output suites.json "${REPORT_JSON_URL}"
export BUILD_TYPE=unified
export DATABASE_URL="$TEST_RESULT_CONNSTR"
poetry run python3 scripts/ingest_regress_test_result.py \
--revision ${COMMIT_SHA} \
--reference ${GITHUB_REF} \
--build-type ${BUILD_TYPE} \
--ingest suites.json
- name: Store Allure test stat in the DB (new)
if: ${{ !cancelled() && steps.create-allure-report.outputs.report-json-url }}
env:
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
REPORT_JSON_URL: ${{ steps.create-allure-report.outputs.report-json-url }}
TEST_RESULT_CONNSTR: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}
BASE_S3_URL: ${{ steps.create-allure-report.outputs.base-s3-url }}
run: |
aws s3 cp --only-show-errors --recursive ${BASE_S3_URL}/data/test-cases ./test-cases
./scripts/pysync
export DATABASE_URL="$TEST_RESULT_CONNSTR"
poetry run python3 scripts/ingest_regress_test_result-new-format.py \
--reference ${GITHUB_REF} \
--revision ${COMMIT_SHA} \
--run-id ${GITHUB_RUN_ID} \
--run-attempt ${GITHUB_RUN_ATTEMPT} \
--test-cases-dir ./test-cases
coverage-report:
needs: [ check-permissions, regress-tests ]
runs-on: [ self-hosted, gen3, small ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
options: --init
needs: [ regress-tests ]
strategy:
fail-fast: false
matrix:
build_type: [ debug ]
outputs:
coverage-html: ${{ steps.upload-coverage-report-new.outputs.report-url }}
coverage-json: ${{ steps.upload-coverage-report-new.outputs.summary-json }}
steps:
- name: Checkout
uses: actions/checkout@v3
with:
submodules: true
fetch-depth: 1
# Disabled for now
# - name: Restore cargo deps cache
# id: cache_cargo
# uses: actions/cache@v3
# with:
# path: |
# ~/.cargo/registry/
# !~/.cargo/registry/src
# ~/.cargo/git/
# target/
# key: v1-${{ runner.os }}-${{ matrix.build_type }}-cargo-${{ hashFiles('rust-toolchain.toml') }}-${{ hashFiles('Cargo.lock') }}
fetch-depth: 0
- name: Get Neon artifact
uses: ./.github/actions/download
@@ -561,13 +568,45 @@ jobs:
REPORT_URL=https://${BUCKET}.s3.amazonaws.com/code-coverage/${COMMIT_SHA}/index.html
echo "report-url=${REPORT_URL}" >> $GITHUB_OUTPUT
- name: Build coverage report NEW
id: upload-coverage-report-new
env:
BUCKET: neon-github-public-dev
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
run: |
BASELINE="$(git merge-base HEAD origin/main)"
CURRENT="${COMMIT_SHA}"
cp /tmp/coverage/report/lcov.info ./${CURRENT}.info
GENHTML_ARGS="--ignore-errors path,unmapped,empty --synthesize-missing --demangle-cpp rustfilt --output-directory lcov-html ${CURRENT}.info"
# Use differential coverage if the baseline coverage exists.
# It can be missing if the coverage repoer wasn't uploaded yet or tests has failed on BASELINE commit.
if aws s3 cp --only-show-errors s3://${BUCKET}/code-coverage/${BASELINE}/lcov.info ./${BASELINE}.info; then
git diff ${BASELINE} ${CURRENT} -- '*.rs' > baseline-current.diff
GENHTML_ARGS="--baseline-file ${BASELINE}.info --diff-file baseline-current.diff ${GENHTML_ARGS}"
fi
genhtml ${GENHTML_ARGS}
aws s3 cp --only-show-errors --recursive ./lcov-html/ s3://${BUCKET}/code-coverage/${COMMIT_SHA}/lcov
REPORT_URL=https://${BUCKET}.s3.amazonaws.com/code-coverage/${COMMIT_SHA}/lcov/index.html
echo "report-url=${REPORT_URL}" >> $GITHUB_OUTPUT
REPORT_URL=https://${BUCKET}.s3.amazonaws.com/code-coverage/${COMMIT_SHA}/lcov/summary.json
echo "summary-json=${REPORT_URL}" >> $GITHUB_OUTPUT
- uses: actions/github-script@v6
env:
REPORT_URL: ${{ steps.upload-coverage-report.outputs.report-url }}
REPORT_URL_NEW: ${{ steps.upload-coverage-report-new.outputs.report-url }}
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
with:
script: |
const { REPORT_URL, COMMIT_SHA } = process.env
const { REPORT_URL, REPORT_URL_NEW, COMMIT_SHA } = process.env
await github.rest.repos.createCommitStatus({
owner: context.repo.owner,
@@ -578,12 +617,21 @@ jobs:
context: 'Code coverage report',
})
await github.rest.repos.createCommitStatus({
owner: context.repo.owner,
repo: context.repo.repo,
sha: `${COMMIT_SHA}`,
state: 'success',
target_url: `${REPORT_URL_NEW}`,
context: 'Code coverage report NEW',
})
trigger-e2e-tests:
needs: [ check-permissions, promote-images, tag ]
runs-on: [ self-hosted, gen3, small ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/base:pinned
options: --init
needs: [ promote-images, tag ]
steps:
- name: Set PR's status to pending and request a remote CI test
run: |
@@ -624,8 +672,8 @@ jobs:
}"
neon-image:
needs: [ check-permissions, tag ]
runs-on: [ self-hosted, gen3, large ]
needs: [ tag ]
container: gcr.io/kaniko-project/executor:v1.9.2-debug
defaults:
run:
@@ -672,7 +720,7 @@ jobs:
compute-tools-image:
runs-on: [ self-hosted, gen3, large ]
needs: [ tag ]
needs: [ check-permissions, tag ]
container: gcr.io/kaniko-project/executor:v1.9.2-debug
defaults:
run:
@@ -717,17 +765,17 @@ jobs:
run: rm -rf ~/.ecr
compute-node-image:
needs: [ check-permissions, tag ]
runs-on: [ self-hosted, gen3, large ]
container:
image: gcr.io/kaniko-project/executor:v1.9.2-debug
# Workaround for "Resolving download.osgeo.org (download.osgeo.org)... failed: Temporary failure in name resolution.""
# Should be prevented by https://github.com/neondatabase/neon/issues/4281
options: --add-host=download.osgeo.org:140.211.15.30
needs: [ tag ]
strategy:
fail-fast: false
matrix:
version: [ v14, v15 ]
version: [ v14, v15, v16 ]
defaults:
run:
shell: sh -eu {0}
@@ -771,50 +819,22 @@ jobs:
--destination neondatabase/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}
--cleanup
# Due to a kaniko bug, we can't use cache for extensions image, thus it takes about the same amount of time as compute-node image to build (~10 min)
# During the transition period we need to have extensions in both places (in S3 and in compute-node image),
# so we won't build extension twice, but extract them from compute-node.
#
# For now we use extensions image only for new custom extensitons
- name: Kaniko build extensions only
run: |
# Kaniko is suposed to clean up after itself if --cleanup flag is set, but it doesn't.
# Despite some fixes were made in https://github.com/GoogleContainerTools/kaniko/pull/2504 (in kaniko v1.11.0),
# it still fails with error:
# error building image: could not save file: copying file: symlink postgres /kaniko/1/usr/local/pgsql/bin/postmaster: file exists
#
# Ref https://github.com/GoogleContainerTools/kaniko/issues/1406
find /kaniko -maxdepth 1 -mindepth 1 -type d -regex "/kaniko/[0-9]*" -exec rm -rv {} \;
/kaniko/executor --reproducible --snapshot-mode=redo --skip-unused-stages --cache=true \
--cache-repo 369495373322.dkr.ecr.eu-central-1.amazonaws.com/cache \
--context . \
--build-arg GIT_VERSION=${{ github.event.pull_request.head.sha || github.sha }} \
--build-arg PG_VERSION=${{ matrix.version }} \
--build-arg BUILD_TAG=${{needs.tag.outputs.build-tag}} \
--build-arg REPOSITORY=369495373322.dkr.ecr.eu-central-1.amazonaws.com \
--dockerfile Dockerfile.compute-node \
--destination 369495373322.dkr.ecr.eu-central-1.amazonaws.com/extensions-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} \
--destination neondatabase/extensions-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} \
--cleanup \
--target postgres-extensions
# Cleanup script fails otherwise - rm: cannot remove '/nvme/actions-runner/_work/_temp/_github_home/.ecr': Permission denied
- name: Cleanup ECR folder
run: rm -rf ~/.ecr
vm-compute-node-image:
needs: [ check-permissions, tag, compute-node-image ]
runs-on: [ self-hosted, gen3, large ]
needs: [ tag, compute-node-image ]
strategy:
fail-fast: false
matrix:
version: [ v14, v15 ]
version: [ v14, v15, v16 ]
defaults:
run:
shell: sh -eu {0}
env:
VM_BUILDER_VERSION: v0.15.4
VM_BUILDER_VERSION: v0.17.5
steps:
- name: Checkout
@@ -835,14 +855,18 @@ jobs:
- name: Build vm image
run: |
./vm-builder -enable-file-cache -src=369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} -dst=369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}
./vm-builder \
-enable-file-cache \
-cgroup-uid=postgres \
-src=369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} \
-dst=369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}
- name: Pushing vm-compute-node image
run: |
docker push 369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}
test-images:
needs: [ tag, neon-image, compute-node-image, compute-tools-image ]
needs: [ check-permissions, tag, neon-image, compute-node-image, compute-tools-image ]
runs-on: [ self-hosted, gen3, small ]
steps:
@@ -885,8 +909,8 @@ jobs:
docker compose -f ./docker-compose/docker-compose.yml down
promote-images:
needs: [ check-permissions, tag, test-images, vm-compute-node-image ]
runs-on: [ self-hosted, gen3, small ]
needs: [ tag, test-images, vm-compute-node-image ]
container: golang:1.19-bullseye
# Don't add if-condition here.
# The job should always be run because we have dependant other jobs that shouldn't be skipped
@@ -906,6 +930,7 @@ jobs:
run: |
crane pull 369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-v14:${{needs.tag.outputs.build-tag}} vm-compute-node-v14
crane pull 369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-v15:${{needs.tag.outputs.build-tag}} vm-compute-node-v15
crane pull 369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-v16:${{needs.tag.outputs.build-tag}} vm-compute-node-v16
- name: Add latest tag to images
if: |
@@ -916,10 +941,10 @@ jobs:
crane tag 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-tools:${{needs.tag.outputs.build-tag}} latest
crane tag 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v14:${{needs.tag.outputs.build-tag}} latest
crane tag 369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-v14:${{needs.tag.outputs.build-tag}} latest
crane tag 369495373322.dkr.ecr.eu-central-1.amazonaws.com/extensions-v14:${{needs.tag.outputs.build-tag}} latest
crane tag 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v15:${{needs.tag.outputs.build-tag}} latest
crane tag 369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-v15:${{needs.tag.outputs.build-tag}} latest
crane tag 369495373322.dkr.ecr.eu-central-1.amazonaws.com/extensions-v15:${{needs.tag.outputs.build-tag}} latest
crane tag 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v16:${{needs.tag.outputs.build-tag}} latest
crane tag 369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-v16:${{needs.tag.outputs.build-tag}} latest
- name: Push images to production ECR
if: |
@@ -930,10 +955,10 @@ jobs:
crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-tools:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/compute-tools:latest
crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v14:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v14:latest
crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-v14:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-v14:latest
crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/extensions-v14:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/extensions-v14:latest
crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v15:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v15:latest
crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-v15:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-v15:latest
crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/extensions-v15:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/extensions-v15:latest
crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v16:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v16:latest
crane copy 369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-v16:${{needs.tag.outputs.build-tag}} 093970136003.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-v16:latest
- name: Configure Docker Hub login
run: |
@@ -945,6 +970,7 @@ jobs:
run: |
crane push vm-compute-node-v14 neondatabase/vm-compute-node-v14:${{needs.tag.outputs.build-tag}}
crane push vm-compute-node-v15 neondatabase/vm-compute-node-v15:${{needs.tag.outputs.build-tag}}
crane push vm-compute-node-v16 neondatabase/vm-compute-node-v16:${{needs.tag.outputs.build-tag}}
- name: Push latest tags to Docker Hub
if: |
@@ -955,66 +981,94 @@ jobs:
crane tag neondatabase/compute-tools:${{needs.tag.outputs.build-tag}} latest
crane tag neondatabase/compute-node-v14:${{needs.tag.outputs.build-tag}} latest
crane tag neondatabase/vm-compute-node-v14:${{needs.tag.outputs.build-tag}} latest
crane tag neondatabase/extensions-v14:${{needs.tag.outputs.build-tag}} latest
crane tag neondatabase/compute-node-v15:${{needs.tag.outputs.build-tag}} latest
crane tag neondatabase/vm-compute-node-v15:${{needs.tag.outputs.build-tag}} latest
crane tag neondatabase/extensions-v15:${{needs.tag.outputs.build-tag}} latest
crane tag neondatabase/compute-node-v16:${{needs.tag.outputs.build-tag}} latest
crane tag neondatabase/vm-compute-node-v16:${{needs.tag.outputs.build-tag}} latest
- name: Cleanup ECR folder
run: rm -rf ~/.ecr
upload-postgres-extensions-to-s3:
if: |
(github.ref_name == 'main' || github.ref_name == 'release') &&
github.event_name != 'workflow_dispatch'
runs-on: ${{ github.ref_name == 'release' && fromJSON('["self-hosted", "prod", "x64"]') || fromJSON('["self-hosted", "gen3", "small"]') }}
needs: [ tag, promote-images ]
strategy:
fail-fast: false
matrix:
version: [ v14, v15 ]
env:
EXTENSIONS_IMAGE: ${{ github.ref_name == 'release' && '093970136003' || '369495373322'}}.dkr.ecr.eu-central-1.amazonaws.com/extensions-${{ matrix.version }}:latest
AWS_ACCESS_KEY_ID: ${{ github.ref_name == 'release' && secrets.AWS_ACCESS_KEY_PROD || secrets.AWS_ACCESS_KEY_DEV }}
AWS_SECRET_ACCESS_KEY: ${{ github.ref_name == 'release' && secrets.AWS_SECRET_KEY_PROD || secrets.AWS_SECRET_KEY_DEV }}
S3_BUCKETS: ${{ github.ref_name == 'release' && vars.S3_EXTENSIONS_BUCKETS_PROD || vars.S3_EXTENSIONS_BUCKETS_DEV }}
trigger-custom-extensions-build-and-wait:
needs: [ check-permissions, tag ]
runs-on: ubuntu-latest
steps:
- name: Pull postgres-extensions image
- name: Set PR's status to pending and request a remote CI test
run: |
docker pull ${EXTENSIONS_IMAGE}
COMMIT_SHA=${{ github.event.pull_request.head.sha || github.sha }}
REMOTE_REPO="${{ github.repository_owner }}/build-custom-extensions"
- name: Create postgres-extensions container
id: create-container
curl -f -X POST \
https://api.github.com/repos/${{ github.repository }}/statuses/$COMMIT_SHA \
-H "Accept: application/vnd.github.v3+json" \
--user "${{ secrets.CI_ACCESS_TOKEN }}" \
--data \
"{
\"state\": \"pending\",
\"context\": \"build-and-upload-extensions\",
\"description\": \"[$REMOTE_REPO] Remote CI job is about to start\"
}"
curl -f -X POST \
https://api.github.com/repos/$REMOTE_REPO/actions/workflows/build_and_upload_extensions.yml/dispatches \
-H "Accept: application/vnd.github.v3+json" \
--user "${{ secrets.CI_ACCESS_TOKEN }}" \
--data \
"{
\"ref\": \"main\",
\"inputs\": {
\"ci_job_name\": \"build-and-upload-extensions\",
\"commit_hash\": \"$COMMIT_SHA\",
\"remote_repo\": \"${{ github.repository }}\",
\"compute_image_tag\": \"${{ needs.tag.outputs.build-tag }}\",
\"remote_branch_name\": \"${{ github.ref_name }}\"
}
}"
- name: Wait for extension build to finish
env:
GH_TOKEN: ${{ secrets.CI_ACCESS_TOKEN }}
run: |
EID=$(docker create ${EXTENSIONS_IMAGE} true)
echo "EID=${EID}" >> $GITHUB_OUTPUT
TIMEOUT=1800 # 30 minutes, usually it takes ~2-3 minutes, but if runners are busy, it might take longer
INTERVAL=15 # try each N seconds
- name: Extract postgres-extensions from container
run: |
rm -rf ./extensions-to-upload # Just in case
mkdir -p extensions-to-upload
last_status="" # a variable to carry the last status of the "build-and-upload-extensions" context
docker cp ${{ steps.create-container.outputs.EID }}:/extensions/ ./extensions-to-upload/
docker cp ${{ steps.create-container.outputs.EID }}:/ext_index.json ./extensions-to-upload/
for ((i=0; i <= TIMEOUT; i+=INTERVAL)); do
sleep $INTERVAL
- name: Upload postgres-extensions to S3
run: |
for BUCKET in $(echo ${S3_BUCKETS:-[]} | jq --raw-output '.[]'); do
aws s3 cp --recursive --only-show-errors ./extensions-to-upload s3://${BUCKET}/${{ needs.tag.outputs.build-tag }}/${{ matrix.version }}
# Get statuses for the latest commit in the PR / branch
gh api \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
"/repos/${{ github.repository }}/statuses/${{ github.event.pull_request.head.sha || github.sha }}" > statuses.json
# Get the latest status for the "build-and-upload-extensions" context
last_status=$(jq --raw-output '[.[] | select(.context == "build-and-upload-extensions")] | sort_by(.created_at)[-1].state' statuses.json)
if [ "${last_status}" = "pending" ]; then
# Extension build is still in progress.
continue
elif [ "${last_status}" = "success" ]; then
# Extension build is successful.
exit 0
else
# Status is neither "pending" nor "success", exit the loop and fail the job.
break
fi
done
- name: Cleanup
if: ${{ always() && steps.create-container.outputs.EID }}
run: |
docker rm ${{ steps.create-container.outputs.EID }} || true
# Extension build failed, print `statuses.json` for debugging and fail the job.
jq '.' statuses.json
echo >&2 "Status of extension build is '${last_status}' != 'success'"
exit 1
deploy:
needs: [ check-permissions, promote-images, tag, regress-tests, trigger-custom-extensions-build-and-wait ]
if: ( github.ref_name == 'main' || github.ref_name == 'release' ) && github.event_name != 'workflow_dispatch'
runs-on: [ self-hosted, gen3, small ]
container: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/ansible:latest
needs: [ upload-postgres-extensions-to-s3, promote-images, tag, regress-tests ]
if: ( github.ref_name == 'main' || github.ref_name == 'release' ) && github.event_name != 'workflow_dispatch'
steps:
- name: Fix git ownership
run: |
@@ -1052,20 +1106,35 @@ jobs:
# Retry script for 5XX server errors: https://github.com/actions/github-script#retries
retries: 5
script: |
github.rest.git.createRef({
await github.rest.git.createRef({
owner: context.repo.owner,
repo: context.repo.repo,
ref: "refs/tags/${{ needs.tag.outputs.build-tag }}",
sha: context.sha,
})
- name: Create GitHub release
if: github.ref_name == 'release'
uses: actions/github-script@v6
with:
# Retry script for 5XX server errors: https://github.com/actions/github-script#retries
retries: 5
script: |
await github.rest.repos.createRelease({
owner: context.repo.owner,
repo: context.repo.repo,
tag_name: "${{ needs.tag.outputs.build-tag }}",
generate_release_notes: true,
})
promote-compatibility-data:
needs: [ check-permissions, promote-images, tag, regress-tests ]
if: github.ref_name == 'release'
runs-on: [ self-hosted, gen3, small ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/base:pinned
options: --init
needs: [ promote-images, tag, regress-tests ]
if: github.ref_name == 'release' && github.event_name != 'workflow_dispatch'
steps:
- name: Promote compatibility snapshot for the release
env:
@@ -1073,7 +1142,7 @@ jobs:
PREFIX: artifacts/latest
run: |
# Update compatibility snapshot for the release
for pg_version in v14 v15; do
for pg_version in v14 v15 v16; do
for build_type in debug release; do
OLD_FILENAME=compatibility-snapshot-${build_type}-pg${pg_version}-${GITHUB_RUN_ID}.tar.zst
NEW_FILENAME=compatibility-snapshot-${build_type}-pg${pg_version}.tar.zst

View File

@@ -4,7 +4,6 @@ on:
push:
branches:
- main
- ci-run/pr-*
pull_request:
defaults:
@@ -39,7 +38,7 @@ jobs:
fetch-depth: 1
- name: Install macOS postgres dependencies
run: brew install flex bison openssl protobuf
run: brew install flex bison openssl protobuf icu4c pkg-config
- name: Set pg 14 revision for caching
id: pg_v14_rev
@@ -49,6 +48,10 @@ jobs:
id: pg_v15_rev
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v15) >> $GITHUB_OUTPUT
- name: Set pg 16 revision for caching
id: pg_v16_rev
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v16) >> $GITHUB_OUTPUT
- name: Cache postgres v14 build
id: cache_pg_14
uses: actions/cache@v3
@@ -63,6 +66,13 @@ jobs:
path: pg_install/v15
key: v1-${{ runner.os }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v15_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Cache postgres v16 build
id: cache_pg_16
uses: actions/cache@v3
with:
path: pg_install/v16
key: v1-${{ runner.os }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v16_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Set extra env for macOS
run: |
echo 'LDFLAGS=-L/usr/local/opt/openssl@3/lib' >> $GITHUB_ENV
@@ -86,6 +96,10 @@ jobs:
if: steps.cache_pg_15.outputs.cache-hit != 'true'
run: make postgres-v15 -j$(nproc)
- name: Build postgres v16
if: steps.cache_pg_16.outputs.cache-hit != 'true'
run: make postgres-v16 -j$(nproc)
- name: Build neon extensions
run: make neon-pg-ext -j$(nproc)

29
.github/workflows/release-notify.yml vendored Normal file
View File

@@ -0,0 +1,29 @@
name: Notify Slack channel about upcoming release
concurrency:
group: ${{ github.workflow }}-${{ github.event.number }}
cancel-in-progress: true
on:
pull_request:
branches:
- release
types:
# Default types that triggers a workflow:
# - https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request
- opened
- synchronize
- reopened
# Additional types that we want to handle:
- closed
jobs:
notify:
runs-on: [ ubuntu-latest ]
steps:
- uses: neondatabase/dev-actions/release-pr-notify@main
with:
slack-token: ${{ secrets.SLACK_BOT_TOKEN }}
slack-channel-id: ${{ vars.SLACK_UPCOMING_RELEASE_CHANNEL_ID || 'C05QQ9J1BRC' }} # if not set, then `#test-release-notifications`
github-token: ${{ secrets.GITHUB_TOKEN }}

View File

@@ -2,16 +2,19 @@ name: Create Release Branch
on:
schedule:
- cron: '0 10 * * 2'
- cron: '0 7 * * 2'
workflow_dispatch:
jobs:
create_release_branch:
runs-on: [ubuntu-latest]
runs-on: [ ubuntu-latest ]
permissions:
contents: write # for `git push`
steps:
- name: Check out code
uses: actions/checkout@v3
uses: actions/checkout@v4
with:
ref: main
@@ -26,9 +29,16 @@ jobs:
run: git push origin releases/${{ steps.date.outputs.date }}
- name: Create pull request into release
uses: thomaseizinger/create-pull-request@e3972219c86a56550fb70708d96800d8e24ba862 # 1.3.0
with:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
head: releases/${{ steps.date.outputs.date }}
base: release
title: Release ${{ steps.date.outputs.date }}
env:
GH_TOKEN: ${{ secrets.CI_ACCESS_TOKEN }}
run: |
cat << EOF > body.md
## Release ${{ steps.date.outputs.date }}
**Please merge this PR using 'Create a merge commit'!**
EOF
gh pr create --title "Release ${{ steps.date.outputs.date }}" \
--body-file "body.md" \
--head "releases/${{ steps.date.outputs.date }}" \
--base "release"

4
.gitmodules vendored
View File

@@ -6,3 +6,7 @@
path = vendor/postgres-v15
url = https://github.com/neondatabase/postgres.git
branch = REL_15_STABLE_neon
[submodule "vendor/postgres-v16"]
path = vendor/postgres-v16
url = https://github.com/neondatabase/postgres.git
branch = REL_16_STABLE_neon

View File

@@ -1,11 +1,12 @@
/compute_tools/ @neondatabase/control-plane
/compute_tools/ @neondatabase/control-plane @neondatabase/compute
/control_plane/ @neondatabase/compute @neondatabase/storage
/libs/pageserver_api/ @neondatabase/compute @neondatabase/storage
/libs/postgres_ffi/ @neondatabase/compute
/libs/remote_storage/ @neondatabase/storage
/libs/safekeeper_api/ @neondatabase/safekeepers
/pageserver/ @neondatabase/compute @neondatabase/storage
/libs/postgres_ffi/ @neondatabase/compute
/libs/remote_storage/ @neondatabase/storage
/libs/safekeeper_api/ @neondatabase/safekeepers
/libs/vm_monitor/ @neondatabase/autoscaling @neondatabase/compute
/pageserver/ @neondatabase/compute @neondatabase/storage
/pgxn/ @neondatabase/compute
/proxy/ @neondatabase/control-plane
/proxy/ @neondatabase/proxy
/safekeeper/ @neondatabase/safekeepers
/vendor/ @neondatabase/compute

View File

@@ -27,3 +27,28 @@ your patch's fault. Help to fix the root cause if something else has
broken the CI, before pushing.
*Happy Hacking!*
# How to run a CI pipeline on Pull Requests from external contributors
_An instruction for maintainers_
## TL;DR:
- Review the PR
- If and only if it looks **safe** (i.e. it doesn't contain any malicious code which could expose secrets or harm the CI), then:
- Press the "Approve and run" button in GitHub UI
- Add the `approved-for-ci-run` label to the PR
Repeat all steps after any change to the PR.
- When the changes are ready to get merged — merge the original PR (not the internal one)
## Longer version:
GitHub Actions triggered by the `pull_request` event don't share repository secrets with the forks (for security reasons).
So, passing the CI pipeline on Pull Requests from external contributors is impossible.
We're using the following approach to make it work:
- After the review, assign the `approved-for-ci-run` label to the PR if changes look safe
- A GitHub Action will create an internal branch and a new PR with the same changes (for example, for a PR `#1234`, it'll be a branch `ci-run/pr-1234`)
- Because the PR is created from the internal branch, it is able to access repository secrets (that's why it's crucial to make sure that the PR doesn't contain any malicious code that could expose our secrets or intentionally harm the CI)
- The label gets removed automatically, so to run CI again with new changes, the label should be added again (after the review)
For details see [`approved-for-ci-run.yml`](.github/workflows/approved-for-ci-run.yml)

808
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,4 +1,5 @@
[workspace]
resolver = "2"
members = [
"compute_tools",
"control_plane",
@@ -7,6 +8,7 @@ members = [
"proxy",
"safekeeper",
"storage_broker",
"s3_scrubber",
"workspace_hack",
"trace",
"libs/compute_api",
@@ -23,6 +25,7 @@ members = [
"libs/remote_storage",
"libs/tracing-utils",
"libs/postgres_ffi/wal_craft",
"libs/vm_monitor",
]
[workspace.package]
@@ -36,17 +39,19 @@ async-compression = { version = "0.4.0", features = ["tokio", "gzip"] }
flate2 = "1.0.26"
async-stream = "0.3"
async-trait = "0.1"
aws-config = { version = "0.55", default-features = false, features=["rustls"] }
aws-sdk-s3 = "0.27"
aws-smithy-http = "0.55"
aws-credential-types = "0.55"
aws-types = "0.55"
aws-config = { version = "0.56", default-features = false, features=["rustls"] }
aws-sdk-s3 = "0.29"
aws-smithy-http = "0.56"
aws-credential-types = "0.56"
aws-types = "0.56"
axum = { version = "0.6.20", features = ["ws"] }
base64 = "0.13.0"
bincode = "1.3"
bindgen = "0.65"
bstr = "1.0"
byteorder = "1.4"
bytes = "1.0"
cfg-if = "1.0.0"
chrono = { version = "0.4", default-features = false, features = ["clock"] }
clap = { version = "4.0", features = ["derive"] }
close_fds = "0.3.2"
@@ -54,6 +59,7 @@ comfy-table = "6.1"
const_format = "0.2"
crc32c = "0.6"
crossbeam-utils = "0.8.5"
dashmap = "5.5.0"
either = "1.8"
enum-map = "2.4.2"
enumset = "1.0.12"
@@ -73,6 +79,7 @@ humantime = "2.1"
humantime-serde = "1.1.1"
hyper = "0.14"
hyper-tungstenite = "0.9"
inotify = "0.10.2"
itertools = "0.10"
jsonwebtoken = "8"
libc = "0.2"
@@ -88,7 +95,7 @@ opentelemetry = "0.19.0"
opentelemetry-otlp = { version = "0.12.0", default_features=false, features = ["http-proto", "trace", "http", "reqwest-client"] }
opentelemetry-semantic-conventions = "0.11.0"
parking_lot = "0.12"
pbkdf2 = "0.12.1"
pbkdf2 = { version = "0.12.1", features = ["simple", "std"] }
pin-project-lite = "0.2"
prometheus = {version = "0.13", default_features=false, features = ["process"]} # removes protobuf dependency
prost = "0.11"
@@ -100,16 +107,18 @@ reqwest-middleware = "0.2.0"
reqwest-retry = "0.2.2"
routerify = "3"
rpds = "0.13"
rustls = "0.20"
rustls = "0.21"
rustls-pemfile = "1"
rustls-split = "0.3"
scopeguard = "1.1"
sentry = { version = "0.30", default-features = false, features = ["backtrace", "contexts", "panic", "rustls", "reqwest" ] }
sysinfo = "0.29.2"
sentry = { version = "0.31", default-features = false, features = ["backtrace", "contexts", "panic", "rustls", "reqwest" ] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
serde_with = "2.0"
sha2 = "0.10.2"
signal-hook = "0.3"
smallvec = "1.11"
socket2 = "0.5"
strum = "0.24"
strum_macros = "0.24"
@@ -118,11 +127,11 @@ sync_wrapper = "0.1.2"
tar = "0.4"
test-context = "0.1"
thiserror = "1.0"
tls-listener = { version = "0.6", features = ["rustls", "hyper-h1"] }
tls-listener = { version = "0.7", features = ["rustls", "hyper-h1"] }
tokio = { version = "1.17", features = ["macros"] }
tokio-io-timeout = "1.2.0"
tokio-postgres-rustls = "0.9.0"
tokio-rustls = "0.23"
tokio-postgres-rustls = "0.10.0"
tokio-rustls = "0.24"
tokio-stream = "0.1"
tokio-tar = "0.3"
tokio-util = { version = "0.7", features = ["io"] }
@@ -132,11 +141,11 @@ tonic = {version = "0.9", features = ["tls", "tls-roots"]}
tracing = "0.1"
tracing-error = "0.2.0"
tracing-opentelemetry = "0.19.0"
tracing-subscriber = { version = "0.3", default_features = false, features = ["smallvec", "fmt", "tracing-log", "std", "env-filter"] }
tracing-subscriber = { version = "0.3", default_features = false, features = ["smallvec", "fmt", "tracing-log", "std", "env-filter", "json"] }
url = "2.2"
uuid = { version = "1.2", features = ["v4", "serde"] }
walkdir = "2.3.2"
webpki-roots = "0.23"
webpki-roots = "0.25"
x509-parser = "0.15"
## TODO replace this with tracing
@@ -168,14 +177,15 @@ storage_broker = { version = "0.1", path = "./storage_broker/" } # Note: main br
tenant_size_model = { version = "0.1", path = "./libs/tenant_size_model/" }
tracing-utils = { version = "0.1", path = "./libs/tracing-utils/" }
utils = { version = "0.1", path = "./libs/utils/" }
vm_monitor = { version = "0.1", path = "./libs/vm_monitor/" }
## Common library dependency
workspace_hack = { version = "0.1", path = "./workspace_hack/" }
## Build dependencies
criterion = "0.5.1"
rcgen = "0.10"
rstest = "0.17"
rcgen = "0.11"
rstest = "0.18"
tempfile = "3.4"
tonic-build = "0.9"

View File

@@ -12,6 +12,7 @@ WORKDIR /home/nonroot
COPY --chown=nonroot vendor/postgres-v14 vendor/postgres-v14
COPY --chown=nonroot vendor/postgres-v15 vendor/postgres-v15
COPY --chown=nonroot vendor/postgres-v16 vendor/postgres-v16
COPY --chown=nonroot pgxn pgxn
COPY --chown=nonroot Makefile Makefile
COPY --chown=nonroot scripts/ninstall.sh scripts/ninstall.sh
@@ -39,6 +40,7 @@ ARG CACHEPOT_BUCKET=neon-github-dev
COPY --from=pg-build /home/nonroot/pg_install/v14/include/postgresql/server pg_install/v14/include/postgresql/server
COPY --from=pg-build /home/nonroot/pg_install/v15/include/postgresql/server pg_install/v15/include/postgresql/server
COPY --from=pg-build /home/nonroot/pg_install/v16/include/postgresql/server pg_install/v16/include/postgresql/server
COPY --chown=nonroot . .
# Show build caching stats to check if it was used in the end.
@@ -65,6 +67,7 @@ RUN set -e \
&& apt install -y \
libreadline-dev \
libseccomp-dev \
libicu67 \
openssl \
ca-certificates \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* \
@@ -81,6 +84,7 @@ COPY --from=build --chown=neon:neon /home/nonroot/target/release/neon_local
COPY --from=pg-build /home/nonroot/pg_install/v14 /usr/local/v14/
COPY --from=pg-build /home/nonroot/pg_install/v15 /usr/local/v15/
COPY --from=pg-build /home/nonroot/pg_install/v16 /usr/local/v16/
COPY --from=pg-build /home/nonroot/postgres_install.tar.gz /data/
# By default, pageserver uses `.neon/` working directory in WORKDIR, so create one and fill it with the dummy config.

View File

@@ -74,8 +74,8 @@ RUN wget https://gitlab.com/Oslandia/SFCGAL/-/archive/v1.3.10/SFCGAL-v1.3.10.tar
ENV PATH "/usr/local/pgsql/bin:$PATH"
RUN wget https://download.osgeo.org/postgis/source/postgis-3.3.2.tar.gz -O postgis.tar.gz && \
echo "9a2a219da005a1730a39d1959a1c7cec619b1efb009b65be80ffc25bad299068 postgis.tar.gz" | sha256sum --check && \
RUN wget https://download.osgeo.org/postgis/source/postgis-3.3.3.tar.gz -O postgis.tar.gz && \
echo "74eb356e3f85f14233791013360881b6748f78081cc688ff9d6f0f673a762d13 postgis.tar.gz" | sha256sum --check && \
mkdir postgis-src && cd postgis-src && tar xvzf ../postgis.tar.gz --strip-components=1 -C . && \
find /usr/local/pgsql -type f | sed 's|^/usr/local/pgsql/||' > /before.txt &&\
./autogen.sh && \
@@ -124,8 +124,8 @@ COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN apt update && \
apt install -y ninja-build python3-dev libncurses5 binutils clang
RUN wget https://github.com/plv8/plv8/archive/refs/tags/v3.1.5.tar.gz -O plv8.tar.gz && \
echo "1e108d5df639e4c189e1c5bdfa2432a521c126ca89e7e5a969d46899ca7bf106 plv8.tar.gz" | sha256sum --check && \
RUN wget https://github.com/plv8/plv8/archive/refs/tags/v3.1.8.tar.gz -O plv8.tar.gz && \
echo "92b10c7db39afdae97ff748c9ec54713826af222c459084ad002571b79eb3f49 plv8.tar.gz" | sha256sum --check && \
mkdir plv8-src && cd plv8-src && tar xvzf ../plv8.tar.gz --strip-components=1 -C . && \
export PATH="/usr/local/pgsql/bin:$PATH" && \
make DOCKER=1 -j $(getconf _NPROCESSORS_ONLN) install && \
@@ -172,8 +172,8 @@ RUN wget https://github.com/uber/h3/archive/refs/tags/v4.1.0.tar.gz -O h3.tar.gz
cp -R /h3/usr / && \
rm -rf build
RUN wget https://github.com/zachasme/h3-pg/archive/refs/tags/v4.1.2.tar.gz -O h3-pg.tar.gz && \
echo "c135aa45999b2ad1326d2537c1cadef96d52660838e4ca371706c08fdea1a956 h3-pg.tar.gz" | sha256sum --check && \
RUN wget https://github.com/zachasme/h3-pg/archive/refs/tags/v4.1.3.tar.gz -O h3-pg.tar.gz && \
echo "5c17f09a820859ffe949f847bebf1be98511fb8f1bd86f94932512c00479e324 h3-pg.tar.gz" | sha256sum --check && \
mkdir h3-pg-src && cd h3-pg-src && tar xvzf ../h3-pg.tar.gz --strip-components=1 -C . && \
export PATH="/usr/local/pgsql/bin:$PATH" && \
make -j $(getconf _NPROCESSORS_ONLN) && \
@@ -211,8 +211,8 @@ RUN wget https://github.com/df7cb/postgresql-unit/archive/refs/tags/7.7.tar.gz -
FROM build-deps AS vector-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN wget https://github.com/pgvector/pgvector/archive/refs/tags/v0.4.4.tar.gz -O pgvector.tar.gz && \
echo "1cb70a63f8928e396474796c22a20be9f7285a8a013009deb8152445b61b72e6 pgvector.tar.gz" | sha256sum --check && \
RUN wget https://github.com/pgvector/pgvector/archive/refs/tags/v0.5.0.tar.gz -O pgvector.tar.gz && \
echo "d8aa3504b215467ca528525a6de12c3f85f9891b091ce0e5864dd8a9b757f77b pgvector.tar.gz" | sha256sum --check && \
mkdir pgvector-src && cd pgvector-src && tar xvzf ../pgvector.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
@@ -243,8 +243,8 @@ RUN wget https://github.com/michelp/pgjwt/archive/9742dab1b2f297ad3811120db7b214
FROM build-deps AS hypopg-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN wget https://github.com/HypoPG/hypopg/archive/refs/tags/1.3.1.tar.gz -O hypopg.tar.gz && \
echo "e7f01ee0259dc1713f318a108f987663d60f3041948c2ada57a94b469565ca8e hypopg.tar.gz" | sha256sum --check && \
RUN wget https://github.com/HypoPG/hypopg/archive/refs/tags/1.4.0.tar.gz -O hypopg.tar.gz && \
echo "0821011743083226fc9b813c1f2ef5897a91901b57b6bea85a78e466187c6819 hypopg.tar.gz" | sha256sum --check && \
mkdir hypopg-src && cd hypopg-src && tar xvzf ../hypopg.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
@@ -307,8 +307,8 @@ RUN wget https://github.com/theory/pgtap/archive/refs/tags/v1.2.0.tar.gz -O pgta
FROM build-deps AS ip4r-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN wget https://github.com/RhodiumToad/ip4r/archive/refs/tags/2.4.1.tar.gz -O ip4r.tar.gz && \
echo "78b9f0c1ae45c22182768fe892a32d533c82281035e10914111400bf6301c726 ip4r.tar.gz" | sha256sum --check && \
RUN wget https://github.com/RhodiumToad/ip4r/archive/refs/tags/2.4.2.tar.gz -O ip4r.tar.gz && \
echo "0f7b1f159974f49a47842a8ab6751aecca1ed1142b6d5e38d81b064b2ead1b4b ip4r.tar.gz" | sha256sum --check && \
mkdir ip4r-src && cd ip4r-src && tar xvzf ../ip4r.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
@@ -323,8 +323,8 @@ RUN wget https://github.com/RhodiumToad/ip4r/archive/refs/tags/2.4.1.tar.gz -O i
FROM build-deps AS prefix-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN wget https://github.com/dimitri/prefix/archive/refs/tags/v1.2.9.tar.gz -O prefix.tar.gz && \
echo "38d30a08d0241a8bbb8e1eb8f0152b385051665a8e621c8899e7c5068f8b511e prefix.tar.gz" | sha256sum --check && \
RUN wget https://github.com/dimitri/prefix/archive/refs/tags/v1.2.10.tar.gz -O prefix.tar.gz && \
echo "4342f251432a5f6fb05b8597139d3ccde8dcf87e8ca1498e7ee931ca057a8575 prefix.tar.gz" | sha256sum --check && \
mkdir prefix-src && cd prefix-src && tar xvzf ../prefix.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
@@ -339,8 +339,8 @@ RUN wget https://github.com/dimitri/prefix/archive/refs/tags/v1.2.9.tar.gz -O pr
FROM build-deps AS hll-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN wget https://github.com/citusdata/postgresql-hll/archive/refs/tags/v2.17.tar.gz -O hll.tar.gz && \
echo "9a18288e884f197196b0d29b9f178ba595b0dfc21fbf7a8699380e77fa04c1e9 hll.tar.gz" | sha256sum --check && \
RUN wget https://github.com/citusdata/postgresql-hll/archive/refs/tags/v2.18.tar.gz -O hll.tar.gz && \
echo "e2f55a6f4c4ab95ee4f1b4a2b73280258c5136b161fe9d059559556079694f0e hll.tar.gz" | sha256sum --check && \
mkdir hll-src && cd hll-src && tar xvzf ../hll.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
@@ -355,8 +355,8 @@ RUN wget https://github.com/citusdata/postgresql-hll/archive/refs/tags/v2.17.tar
FROM build-deps AS plpgsql-check-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN wget https://github.com/okbob/plpgsql_check/archive/refs/tags/v2.3.2.tar.gz -O plpgsql_check.tar.gz && \
echo "9d81167c4bbeb74eebf7d60147b21961506161addc2aee537f95ad8efeae427b plpgsql_check.tar.gz" | sha256sum --check && \
RUN wget https://github.com/okbob/plpgsql_check/archive/refs/tags/v2.4.0.tar.gz -O plpgsql_check.tar.gz && \
echo "9ba58387a279b35a3bfa39ee611e5684e6cddb2ba046ddb2c5190b3bd2ca254a plpgsql_check.tar.gz" | sha256sum --check && \
mkdir plpgsql_check-src && cd plpgsql_check-src && tar xvzf ../plpgsql_check.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config USE_PGXS=1 && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config USE_PGXS=1 && \
@@ -371,12 +371,21 @@ RUN wget https://github.com/okbob/plpgsql_check/archive/refs/tags/v2.3.2.tar.gz
FROM build-deps AS timescaledb-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ARG PG_VERSION
ENV PATH "/usr/local/pgsql/bin:$PATH"
RUN apt-get update && \
RUN case "${PG_VERSION}" in \
"v14" | "v15") \
export TIMESCALEDB_VERSION=2.10.1 \
export TIMESCALEDB_CHECKSUM=6fca72a6ed0f6d32d2b3523951ede73dc5f9b0077b38450a029a5f411fdb8c73 \
;; \
*) \
echo "TimescaleDB not supported on this PostgreSQL version. See https://github.com/timescale/timescaledb/issues/5752" && exit 0;; \
esac && \
apt-get update && \
apt-get install -y cmake && \
wget https://github.com/timescale/timescaledb/archive/refs/tags/2.10.1.tar.gz -O timescaledb.tar.gz && \
echo "6fca72a6ed0f6d32d2b3523951ede73dc5f9b0077b38450a029a5f411fdb8c73 timescaledb.tar.gz" | sha256sum --check && \
wget https://github.com/timescale/timescaledb/archive/refs/tags/${TIMESCALEDB_VERSION}.tar.gz -O timescaledb.tar.gz && \
echo "${TIMESCALEDB_CHECKSUM} timescaledb.tar.gz" | sha256sum --check && \
mkdir timescaledb-src && cd timescaledb-src && tar xvzf ../timescaledb.tar.gz --strip-components=1 -C . && \
./bootstrap -DSEND_TELEMETRY_DEFAULT:BOOL=OFF -DUSE_TELEMETRY:BOOL=OFF -DAPACHE_ONLY:BOOL=ON -DCMAKE_BUILD_TYPE=Release && \
cd build && \
@@ -405,6 +414,10 @@ RUN case "${PG_VERSION}" in \
export PG_HINT_PLAN_VERSION=15_1_5_0 \
export PG_HINT_PLAN_CHECKSUM=564cbbf4820973ffece63fbf76e3c0af62c4ab23543142c7caaa682bc48918be \
;; \
"v16") \
export PG_HINT_PLAN_VERSION=16_1_6_0 \
export PG_HINT_PLAN_CHECKSUM=ce6a8040c78012000f5da7240caf6a971401412f41d33f930f09291e6c304b99 \
;; \
*) \
echo "Export the valid PG_HINT_PLAN_VERSION variable" && exit 1 \
;; \
@@ -452,8 +465,8 @@ FROM build-deps AS pg-cron-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN wget https://github.com/citusdata/pg_cron/archive/refs/tags/v1.5.2.tar.gz -O pg_cron.tar.gz && \
echo "6f7f0980c03f1e2a6a747060e67bf4a303ca2a50e941e2c19daeed2b44dec744 pg_cron.tar.gz" | sha256sum --check && \
RUN wget https://github.com/citusdata/pg_cron/archive/refs/tags/v1.6.0.tar.gz -O pg_cron.tar.gz && \
echo "383a627867d730222c272bfd25cd5e151c578d73f696d32910c7db8c665cc7db pg_cron.tar.gz" | sha256sum --check && \
mkdir pg_cron-src && cd pg_cron-src && tar xvzf ../pg_cron.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install && \
@@ -479,8 +492,8 @@ RUN apt-get update && \
libfreetype6-dev
ENV PATH "/usr/local/pgsql/bin/:/usr/local/pgsql/:$PATH"
RUN wget https://github.com/rdkit/rdkit/archive/refs/tags/Release_2023_03_1.tar.gz -O rdkit.tar.gz && \
echo "db346afbd0ba52c843926a2a62f8a38c7b774ffab37eaf382d789a824f21996c rdkit.tar.gz" | sha256sum --check && \
RUN wget https://github.com/rdkit/rdkit/archive/refs/tags/Release_2023_03_3.tar.gz -O rdkit.tar.gz && \
echo "bdbf9a2e6988526bfeb8c56ce3cdfe2998d60ac289078e2215374288185e8c8d rdkit.tar.gz" | sha256sum --check && \
mkdir rdkit-src && cd rdkit-src && tar xvzf ../rdkit.tar.gz --strip-components=1 -C . && \
cmake \
-D RDK_BUILD_CAIRO_SUPPORT=OFF \
@@ -551,8 +564,16 @@ FROM build-deps AS pg-embedding-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN wget https://github.com/neondatabase/pg_embedding/archive/refs/tags/0.3.5.tar.gz -O pg_embedding.tar.gz && \
echo "0e95b27b8b6196e2cf0a0c9ec143fe2219b82e54c5bb4ee064e76398cbe69ae9 pg_embedding.tar.gz" | sha256sum --check && \
RUN case "${PG_VERSION}" in \
"v14" | "v15") \
export PG_EMBEDDING_VERSION=0.3.5 \
export PG_EMBEDDING_CHECKSUM=0e95b27b8b6196e2cf0a0c9ec143fe2219b82e54c5bb4ee064e76398cbe69ae9 \
;; \
*) \
echo "pg_embedding not supported on this PostgreSQL version. Use pgvector instead." && exit 0;; \
esac && \
wget https://github.com/neondatabase/pg_embedding/archive/refs/tags/${PG_EMBEDDING_VERSION}.tar.gz -O pg_embedding.tar.gz && \
echo "${PG_EMBEDDING_CHECKSUM} pg_embedding.tar.gz" | sha256sum --check && \
mkdir pg_embedding-src && cd pg_embedding-src && tar xvzf ../pg_embedding.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install && \
@@ -584,6 +605,10 @@ RUN wget https://gitlab.com/dalibo/postgresql_anonymizer/-/archive/1.1.0/postgre
# Layer "rust extensions"
# This layer is used to build `pgx` deps
#
# FIXME: This needs to be updated to latest version of 'pgrx' (it was renamed from
# 'pgx' to 'pgrx') for PostgreSQL 16. And that in turn requires bumping the pgx
# dependency on all the rust extension that depend on it, too.
#
#########################################################################################
FROM build-deps AS rust-extensions-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
@@ -598,7 +623,17 @@ USER nonroot
WORKDIR /home/nonroot
ARG PG_VERSION
RUN curl -sSO https://static.rust-lang.org/rustup/dist/$(uname -m)-unknown-linux-gnu/rustup-init && \
RUN case "${PG_VERSION}" in \
"v14" | "v15") \
;; \
"v16") \
echo "TODO: Not yet supported for PostgreSQL 16. Need to update pgrx dependencies" && exit 0 \
;; \
*) \
echo "unexpected PostgreSQL version ${PG_VERSION}" && exit 1 \
;; \
esac && \
curl -sSO https://static.rust-lang.org/rustup/dist/$(uname -m)-unknown-linux-gnu/rustup-init && \
chmod +x rustup-init && \
./rustup-init -y --no-modify-path --profile minimal --default-toolchain stable && \
rm rustup-init && \
@@ -615,10 +650,21 @@ USER root
#########################################################################################
FROM rust-extensions-build AS pg-jsonschema-pg-build
ARG PG_VERSION
# caeab60d70b2fd3ae421ec66466a3abbb37b7ee6 made on 06/03/2023
# there is no release tag yet, but we need it due to the superuser fix in the control file, switch to git tag after release >= 0.1.5
RUN wget https://github.com/supabase/pg_jsonschema/archive/caeab60d70b2fd3ae421ec66466a3abbb37b7ee6.tar.gz -O pg_jsonschema.tar.gz && \
RUN case "${PG_VERSION}" in \
"v14" | "v15") \
;; \
"v16") \
echo "TODO: Not yet supported for PostgreSQL 16. Need to update pgrx dependencies" && exit 0 \
;; \
*) \
echo "unexpected PostgreSQL version \"${PG_VERSION}\"" && exit 1 \
;; \
esac && \
wget https://github.com/supabase/pg_jsonschema/archive/caeab60d70b2fd3ae421ec66466a3abbb37b7ee6.tar.gz -O pg_jsonschema.tar.gz && \
echo "54129ce2e7ee7a585648dbb4cef6d73f795d94fe72f248ac01119992518469a4 pg_jsonschema.tar.gz" | sha256sum --check && \
mkdir pg_jsonschema-src && cd pg_jsonschema-src && tar xvzf ../pg_jsonschema.tar.gz --strip-components=1 -C . && \
sed -i 's/pgx = "0.7.1"/pgx = { version = "0.7.3", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
@@ -633,12 +679,23 @@ RUN wget https://github.com/supabase/pg_jsonschema/archive/caeab60d70b2fd3ae421e
#########################################################################################
FROM rust-extensions-build AS pg-graphql-pg-build
ARG PG_VERSION
# b4988843647450a153439be367168ed09971af85 made on 22/02/2023 (from remove-pgx-contrib-spiext branch)
# Currently pgx version bump to >= 0.7.2 causes "call to unsafe function" compliation errors in
# pgx-contrib-spiext. There is a branch that removes that dependency, so use it. It is on the
# same 1.1 version we've used before.
RUN wget https://github.com/yrashk/pg_graphql/archive/b4988843647450a153439be367168ed09971af85.tar.gz -O pg_graphql.tar.gz && \
RUN case "${PG_VERSION}" in \
"v14" | "v15") \
;; \
"v16") \
echo "TODO: Not yet supported for PostgreSQL 16. Need to update pgrx dependencies" && exit 0 \
;; \
*) \
echo "unexpected PostgreSQL version" && exit 1 \
;; \
esac && \
wget https://github.com/yrashk/pg_graphql/archive/b4988843647450a153439be367168ed09971af85.tar.gz -O pg_graphql.tar.gz && \
echo "0c7b0e746441b2ec24187d0e03555faf935c2159e2839bddd14df6dafbc8c9bd pg_graphql.tar.gz" | sha256sum --check && \
mkdir pg_graphql-src && cd pg_graphql-src && tar xvzf ../pg_graphql.tar.gz --strip-components=1 -C . && \
sed -i 's/pgx = "~0.7.1"/pgx = { version = "0.7.3", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
@@ -656,9 +713,20 @@ RUN wget https://github.com/yrashk/pg_graphql/archive/b4988843647450a153439be367
#########################################################################################
FROM rust-extensions-build AS pg-tiktoken-pg-build
ARG PG_VERSION
# 801f84f08c6881c8aa30f405fafbf00eec386a72 made on 10/03/2023
RUN wget https://github.com/kelvich/pg_tiktoken/archive/801f84f08c6881c8aa30f405fafbf00eec386a72.tar.gz -O pg_tiktoken.tar.gz && \
RUN case "${PG_VERSION}" in \
"v14" | "v15") \
;; \
"v16") \
echo "TODO: Not yet supported for PostgreSQL 16. Need to update pgrx dependencies" && exit 0 \
;; \
*) \
echo "unexpected PostgreSQL version" && exit 1 \
;; \
esac && \
wget https://github.com/kelvich/pg_tiktoken/archive/801f84f08c6881c8aa30f405fafbf00eec386a72.tar.gz -O pg_tiktoken.tar.gz && \
echo "52f60ac800993a49aa8c609961842b611b6b1949717b69ce2ec9117117e16e4a pg_tiktoken.tar.gz" | sha256sum --check && \
mkdir pg_tiktoken-src && cd pg_tiktoken-src && tar xvzf ../pg_tiktoken.tar.gz --strip-components=1 -C . && \
cargo pgx install --release && \
@@ -672,8 +740,19 @@ RUN wget https://github.com/kelvich/pg_tiktoken/archive/801f84f08c6881c8aa30f405
#########################################################################################
FROM rust-extensions-build AS pg-pgx-ulid-build
ARG PG_VERSION
RUN wget https://github.com/pksunkara/pgx_ulid/archive/refs/tags/v0.1.0.tar.gz -O pgx_ulid.tar.gz && \
RUN case "${PG_VERSION}" in \
"v14" | "v15") \
;; \
"v16") \
echo "TODO: Not yet supported for PostgreSQL 16. Need to update pgrx dependencies" && exit 0 \
;; \
*) \
echo "unexpected PostgreSQL version" && exit 1 \
;; \
esac && \
wget https://github.com/pksunkara/pgx_ulid/archive/refs/tags/v0.1.0.tar.gz -O pgx_ulid.tar.gz && \
echo "908b7358e6f846e87db508ae5349fb56a88ee6305519074b12f3d5b0ff09f791 pgx_ulid.tar.gz" | sha256sum --check && \
mkdir pgx_ulid-src && cd pgx_ulid-src && tar xvzf ../pgx_ulid.tar.gz --strip-components=1 -C . && \
sed -i 's/pgx = "=0.7.3"/pgx = { version = "0.7.3", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
@@ -726,6 +805,20 @@ RUN make -j $(getconf _NPROCESSORS_ONLN) \
PG_CONFIG=/usr/local/pgsql/bin/pg_config \
-C pgxn/neon_utils \
-s install && \
make -j $(getconf _NPROCESSORS_ONLN) \
PG_CONFIG=/usr/local/pgsql/bin/pg_config \
-C pgxn/neon_rmgr \
-s install && \
case "${PG_VERSION}" in \
"v14" | "v15") \
;; \
"v16") \
echo "Skipping HNSW for PostgreSQL 16" && exit 0 \
;; \
*) \
echo "unexpected PostgreSQL version" && exit 1 \
;; \
esac && \
make -j $(getconf _NPROCESSORS_ONLN) \
PG_CONFIG=/usr/local/pgsql/bin/pg_config \
-C pgxn/hnsw \
@@ -764,29 +857,6 @@ RUN rm -r /usr/local/pgsql/include
# if they were to be used by other libraries.
RUN rm /usr/local/pgsql/lib/lib*.a
#########################################################################################
#
# Extenstion only
#
#########################################################################################
FROM python:3.9-slim-bullseye AS generate-ext-index
ARG PG_VERSION
ARG BUILD_TAG
RUN apt update && apt install -y zstd
# copy the control files here
COPY --from=kq-imcx-pg-build /extensions/ /extensions/
COPY --from=pg-anon-pg-build /extensions/ /extensions/
COPY --from=postgis-build /extensions/ /extensions/
COPY scripts/combine_control_files.py ./combine_control_files.py
RUN python3 ./combine_control_files.py ${PG_VERSION} ${BUILD_TAG} --public_extensions="anon,postgis"
FROM scratch AS postgres-extensions
# After the transition this layer will include all extensitons.
# As for now, it's only a couple for testing purposses
COPY --from=generate-ext-index /extensions/*.tar.zst /extensions/
COPY --from=generate-ext-index /ext_index.json /ext_index.json
#########################################################################################
#
# Final layer

View File

@@ -29,6 +29,7 @@ else ifeq ($(UNAME_S),Darwin)
# It can be configured with OPENSSL_PREFIX variable
OPENSSL_PREFIX ?= $(shell brew --prefix openssl@3)
PG_CONFIGURE_OPTS += --with-includes=$(OPENSSL_PREFIX)/include --with-libraries=$(OPENSSL_PREFIX)/lib
PG_CONFIGURE_OPTS += PKG_CONFIG_PATH=$(shell brew --prefix icu4c)/lib/pkgconfig
# macOS already has bison and flex in the system, but they are old and result in postgres-v14 target failure
# brew formulae are keg-only and not symlinked into HOMEBREW_PREFIX, force their usage
EXTRA_PATH_OVERRIDES += $(shell brew --prefix bison)/bin/:$(shell brew --prefix flex)/bin/:
@@ -83,6 +84,8 @@ $(POSTGRES_INSTALL_DIR)/build/%/config.status:
# I'm not sure why it wouldn't work, but this is the only place (apart from
# the "build-all-versions" entry points) where direct mention of PostgreSQL
# versions is used.
.PHONY: postgres-configure-v16
postgres-configure-v16: $(POSTGRES_INSTALL_DIR)/build/v16/config.status
.PHONY: postgres-configure-v15
postgres-configure-v15: $(POSTGRES_INSTALL_DIR)/build/v15/config.status
.PHONY: postgres-configure-v14
@@ -118,6 +121,10 @@ postgres-clean-%:
$(MAKE) -C $(POSTGRES_INSTALL_DIR)/build/$*/contrib/pageinspect clean
$(MAKE) -C $(POSTGRES_INSTALL_DIR)/build/$*/src/interfaces/libpq clean
.PHONY: postgres-check-%
postgres-check-%: postgres-%
$(MAKE) -C $(POSTGRES_INSTALL_DIR)/build/$* MAKELEVEL=0 check
.PHONY: neon-pg-ext-%
neon-pg-ext-%: postgres-%
+@echo "Compiling neon $*"
@@ -130,6 +137,11 @@ neon-pg-ext-%: postgres-%
$(MAKE) PG_CONFIG=$(POSTGRES_INSTALL_DIR)/$*/bin/pg_config CFLAGS='$(PG_CFLAGS) $(COPT)' \
-C $(POSTGRES_INSTALL_DIR)/build/neon-walredo-$* \
-f $(ROOT_PROJECT_DIR)/pgxn/neon_walredo/Makefile install
+@echo "Compiling neon_rmgr $*"
mkdir -p $(POSTGRES_INSTALL_DIR)/build/neon-rmgr-$*
$(MAKE) PG_CONFIG=$(POSTGRES_INSTALL_DIR)/$*/bin/pg_config CFLAGS='$(PG_CFLAGS) $(COPT)' \
-C $(POSTGRES_INSTALL_DIR)/build/neon-rmgr-$* \
-f $(ROOT_PROJECT_DIR)/pgxn/neon_rmgr/Makefile install
+@echo "Compiling neon_test_utils $*"
mkdir -p $(POSTGRES_INSTALL_DIR)/build/neon-test-utils-$*
$(MAKE) PG_CONFIG=$(POSTGRES_INSTALL_DIR)/$*/bin/pg_config CFLAGS='$(PG_CFLAGS) $(COPT)' \
@@ -140,6 +152,13 @@ neon-pg-ext-%: postgres-%
$(MAKE) PG_CONFIG=$(POSTGRES_INSTALL_DIR)/$*/bin/pg_config CFLAGS='$(PG_CFLAGS) $(COPT)' \
-C $(POSTGRES_INSTALL_DIR)/build/neon-utils-$* \
-f $(ROOT_PROJECT_DIR)/pgxn/neon_utils/Makefile install
# pg_embedding was temporarily released as hnsw from this repo, when we only
# supported PostgreSQL 14 and 15
neon-pg-ext-v14: neon-pg-ext-hnsw-v14
neon-pg-ext-v15: neon-pg-ext-hnsw-v15
neon-pg-ext-hnsw-%: postgres-headers-% postgres-%
+@echo "Compiling hnsw $*"
mkdir -p $(POSTGRES_INSTALL_DIR)/build/hnsw-$*
$(MAKE) PG_CONFIG=$(POSTGRES_INSTALL_DIR)/$*/bin/pg_config CFLAGS='$(PG_CFLAGS) $(COPT)' \
@@ -167,28 +186,39 @@ neon-pg-ext-clean-%:
.PHONY: neon-pg-ext
neon-pg-ext: \
neon-pg-ext-v14 \
neon-pg-ext-v15
neon-pg-ext-v15 \
neon-pg-ext-v16
.PHONY: neon-pg-ext-clean
neon-pg-ext-clean: \
neon-pg-ext-clean-v14 \
neon-pg-ext-clean-v15
neon-pg-ext-clean-v15 \
neon-pg-ext-clean-v16
# shorthand to build all Postgres versions
.PHONY: postgres
postgres: \
postgres-v14 \
postgres-v15
postgres-v15 \
postgres-v16
.PHONY: postgres-headers
postgres-headers: \
postgres-headers-v14 \
postgres-headers-v15
postgres-headers-v15 \
postgres-headers-v16
.PHONY: postgres-clean
postgres-clean: \
postgres-clean-v14 \
postgres-clean-v15
postgres-clean-v15 \
postgres-clean-v16
.PHONY: postgres-check
postgres-check: \
postgres-check-v14 \
postgres-check-v15 \
postgres-check-v16
# This doesn't remove the effects of 'configure'.
.PHONY: clean

View File

@@ -29,18 +29,18 @@ See developer documentation in [SUMMARY.md](/docs/SUMMARY.md) for more informati
```bash
apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libseccomp-dev \
libssl-dev clang pkg-config libpq-dev cmake postgresql-client protobuf-compiler \
libcurl4-openssl-dev openssl python-poetry
libcurl4-openssl-dev openssl python-poetry lsof
```
* On Fedora, these packages are needed:
```bash
dnf install flex bison readline-devel zlib-devel openssl-devel \
libseccomp-devel perl clang cmake postgresql postgresql-contrib protobuf-compiler \
protobuf-devel libcurl-devel openssl poetry
protobuf-devel libcurl-devel openssl poetry lsof
```
* On Arch based systems, these packages are needed:
```bash
pacman -S base-devel readline zlib libseccomp openssl clang \
postgresql-libs cmake postgresql protobuf curl
postgresql-libs cmake postgresql protobuf curl lsof
```
Building Neon requires 3.15+ version of `protoc` (protobuf-compiler). If your distribution provides an older version, you can install a newer version from [here](https://github.com/protocolbuffers/protobuf/releases).
@@ -55,7 +55,7 @@ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
1. Install XCode and dependencies
```
xcode-select --install
brew install protobuf openssl flex bison
brew install protobuf openssl flex bison icu4c pkg-config
# add openssl to PATH, required for ed25519 keys generation in neon_local
echo 'export PATH="$(brew --prefix openssl)/bin:$PATH"' >> ~/.zshrc

5
clippy.toml Normal file
View File

@@ -0,0 +1,5 @@
disallowed-methods = [
"tokio::task::block_in_place",
# Allow this for now, to deny it later once we stop using Handle::block_on completely
# "tokio::runtime::Handle::block_on",
]

View File

@@ -8,6 +8,7 @@ license.workspace = true
anyhow.workspace = true
async-compression.workspace = true
chrono.workspace = true
cfg-if.workspace = true
clap.workspace = true
flate2.workspace = true
futures.workspace = true
@@ -23,6 +24,7 @@ tar.workspace = true
reqwest = { workspace = true, features = ["json"] }
tokio = { workspace = true, features = ["rt", "rt-multi-thread"] }
tokio-postgres.workspace = true
tokio-util.workspace = true
tracing.workspace = true
tracing-opentelemetry.workspace = true
tracing-subscriber.workspace = true
@@ -34,4 +36,5 @@ utils.workspace = true
workspace_hack.workspace = true
toml_edit.workspace = true
remote_storage = { version = "0.1", path = "../libs/remote_storage/" }
vm_monitor = { version = "0.1", path = "../libs/vm_monitor/" }
zstd = "0.12.4"

View File

@@ -19,9 +19,10 @@ Also `compute_ctl` spawns two separate service threads:
- `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
last activity requests.
If the `vm-informant` binary is present at `/bin/vm-informant`, it will also be started. For VM
compute nodes, `vm-informant` communicates with the VM autoscaling system. It coordinates
downscaling and (eventually) will request immediate upscaling under resource pressure.
If `AUTOSCALING` environment variable is set, `compute_ctl` will start the
`vm-monitor` located in [`neon/libs/vm_monitor`]. For VM compute nodes,
`vm-monitor` communicates with the VM autoscaling system. It coordinates
downscaling and requests immediate upscaling under resource pressure.
Usage example:
```sh

View File

@@ -20,9 +20,10 @@
//! - `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
//! last activity requests.
//!
//! If the `vm-informant` binary is present at `/bin/vm-informant`, it will also be started. For VM
//! compute nodes, `vm-informant` communicates with the VM autoscaling system. It coordinates
//! downscaling and (eventually) will request immediate upscaling under resource pressure.
//! If `AUTOSCALING` environment variable is set, `compute_ctl` will start the
//! `vm-monitor` located in [`neon/libs/vm_monitor`]. For VM compute nodes,
//! `vm-monitor` communicates with the VM autoscaling system. It coordinates
//! downscaling and requests immediate upscaling under resource pressure.
//!
//! Usage example:
//! ```sh
@@ -35,10 +36,9 @@
//!
use std::collections::HashMap;
use std::fs::File;
use std::panic;
use std::path::Path;
use std::process::exit;
use std::sync::{mpsc, Arc, Condvar, Mutex, OnceLock, RwLock};
use std::sync::{mpsc, Arc, Condvar, Mutex, RwLock};
use std::{thread, time::Duration};
use anyhow::{Context, Result};
@@ -147,6 +147,7 @@ fn main() -> Result<()> {
match spec_json {
// First, try to get cluster spec from the cli argument
Some(json) => {
info!("got spec from cli argument {}", json);
spec = Some(serde_json::from_str(json)?);
}
None => {
@@ -182,6 +183,7 @@ fn main() -> Result<()> {
if let Some(spec) = spec {
let pspec = ParsedSpec::try_from(spec).map_err(|msg| anyhow::anyhow!(msg))?;
info!("new pspec.spec: {:?}", pspec.spec);
new_state.pspec = Some(pspec);
spec_set = true;
} else {
@@ -196,9 +198,7 @@ fn main() -> Result<()> {
state: Mutex::new(new_state),
state_changed: Condvar::new(),
ext_remote_storage,
ext_remote_paths: OnceLock::new(),
ext_download_progress: RwLock::new(HashMap::new()),
library_index: OnceLock::new(),
build_tag,
};
let compute = Arc::new(compute_node);
@@ -271,6 +271,57 @@ fn main() -> Result<()> {
}
};
// Start the vm-monitor if directed to. The vm-monitor only runs on linux
// because it requires cgroups.
cfg_if::cfg_if! {
if #[cfg(target_os = "linux")] {
use std::env;
use tokio_util::sync::CancellationToken;
use tracing::warn;
let vm_monitor_addr = matches.get_one::<String>("vm-monitor-addr");
let file_cache_connstr = matches.get_one::<String>("filecache-connstr");
let cgroup = matches.get_one::<String>("cgroup");
let file_cache_on_disk = matches.get_flag("file-cache-on-disk");
// Only make a runtime if we need to.
// Note: it seems like you can make a runtime in an inner scope and
// if you start a task in it it won't be dropped. However, make it
// in the outermost scope just to be safe.
let rt = match (env::var_os("AUTOSCALING"), vm_monitor_addr) {
(None, None) => None,
(None, Some(_)) => {
warn!("--vm-monitor-addr option set but AUTOSCALING env var not present");
None
}
(Some(_), None) => {
panic!("AUTOSCALING env var present but --vm-monitor-addr option not set")
}
(Some(_), Some(_)) => Some(
tokio::runtime::Builder::new_multi_thread()
.worker_threads(4)
.enable_all()
.build()
.expect("failed to create tokio runtime for monitor"),
),
};
// This token is used internally by the monitor to clean up all threads
let token = CancellationToken::new();
let vm_monitor = &rt.as_ref().map(|rt| {
rt.spawn(vm_monitor::start(
Box::leak(Box::new(vm_monitor::Args {
cgroup: cgroup.cloned(),
pgconnstr: file_cache_connstr.cloned(),
addr: vm_monitor_addr.cloned().unwrap(),
file_cache_on_disk,
})),
token.clone(),
))
});
}
}
// Wait for the child Postgres process forever. In this state Ctrl+C will
// propagate to Postgres and it will be shut down as well.
if let Some(mut pg) = pg {
@@ -284,6 +335,24 @@ fn main() -> Result<()> {
exit_code = ecode.code()
}
// Terminate the vm_monitor so it releases the file watcher on
// /sys/fs/cgroup/neon-postgres.
// Note: the vm-monitor only runs on linux because it requires cgroups.
cfg_if::cfg_if! {
if #[cfg(target_os = "linux")] {
if let Some(handle) = vm_monitor {
// Kills all threads spawned by the monitor
token.cancel();
// Kills the actual task running the monitor
handle.abort();
// If handle is some, rt must have been used to produce it, and
// hence is also some
rt.unwrap().shutdown_timeout(Duration::from_secs(2));
}
}
}
// Maybe sync safekeepers again, to speed up next startup
let compute_state = compute.state.lock().unwrap().clone();
let pspec = compute_state.pspec.as_ref().expect("spec must be set");
@@ -393,6 +462,34 @@ fn cli() -> clap::Command {
.long("remote-ext-config")
.value_name("REMOTE_EXT_CONFIG"),
)
// TODO(fprasx): we currently have default arguments because the cloud PR
// to pass them in hasn't been merged yet. We should get rid of them once
// the PR is merged.
.arg(
Arg::new("vm-monitor-addr")
.long("vm-monitor-addr")
.default_value("0.0.0.0:10301")
.value_name("VM_MONITOR_ADDR"),
)
.arg(
Arg::new("cgroup")
.long("cgroup")
.default_value("neon-postgres")
.value_name("CGROUP"),
)
.arg(
Arg::new("filecache-connstr")
.long("filecache-connstr")
.default_value(
"host=localhost port=5432 dbname=postgres user=cloud_admin sslmode=disable",
)
.value_name("FILECACHE_CONNSTR"),
)
.arg(
Arg::new("file-cache-on-disk")
.long("file-cache-on-disk")
.action(clap::ArgAction::SetTrue),
)
}
#[test]

View File

@@ -1,12 +1,39 @@
use anyhow::{anyhow, Result};
use anyhow::{anyhow, Ok, Result};
use postgres::Client;
use tokio_postgres::NoTls;
use tracing::{error, instrument};
use tracing::{error, instrument, warn};
use crate::compute::ComputeNode;
/// Create a special service table for availability checks
/// only if it does not exist already.
pub fn create_availability_check_data(client: &mut Client) -> Result<()> {
let query = "
DO $$
BEGIN
IF NOT EXISTS(
SELECT 1
FROM pg_catalog.pg_tables
WHERE tablename = 'health_check'
)
THEN
CREATE TABLE health_check (
id serial primary key,
updated_at timestamptz default now()
);
INSERT INTO health_check VALUES (1, now())
ON CONFLICT (id) DO UPDATE
SET updated_at = now();
END IF;
END
$$;";
client.execute(query, &[])?;
Ok(())
}
/// Update timestamp in a row in a special service table to check
/// that we can actually write some data in this particular timeline.
/// Create table if it's missing.
#[instrument(skip_all)]
pub async fn check_writability(compute: &ComputeNode) -> Result<()> {
// Connect to the database.
@@ -24,21 +51,28 @@ pub async fn check_writability(compute: &ComputeNode) -> Result<()> {
});
let query = "
CREATE TABLE IF NOT EXISTS health_check (
id serial primary key,
updated_at timestamptz default now()
);
INSERT INTO health_check VALUES (1, now())
ON CONFLICT (id) DO UPDATE
SET updated_at = now();";
let result = client.simple_query(query).await?;
if result.len() != 2 {
return Err(anyhow::format_err!(
"expected 2 query results, but got {}",
result.len()
));
match client.simple_query(query).await {
Result::Ok(result) => {
if result.len() != 1 {
return Err(anyhow::anyhow!(
"expected 1 query results, but got {}",
result.len()
));
}
}
Err(err) => {
if let Some(state) = err.code() {
if state == &tokio_postgres::error::SqlState::DISK_FULL {
warn!("Tenant disk is full");
return Ok(());
}
}
return Err(err.into());
}
}
Ok(())

View File

@@ -1,11 +1,12 @@
use std::collections::HashMap;
use std::env;
use std::fs;
use std::io::BufRead;
use std::os::unix::fs::PermissionsExt;
use std::path::Path;
use std::process::{Command, Stdio};
use std::str::FromStr;
use std::sync::{Condvar, Mutex, OnceLock, RwLock};
use std::sync::{Condvar, Mutex, RwLock};
use std::time::Instant;
use anyhow::{Context, Result};
@@ -14,7 +15,6 @@ use futures::future::join_all;
use futures::stream::FuturesUnordered;
use futures::StreamExt;
use postgres::{Client, NoTls};
use regex::Regex;
use tokio;
use tokio_postgres;
use tracing::{error, info, instrument, warn};
@@ -27,6 +27,7 @@ use utils::measured_stream::MeasuredReader;
use remote_storage::{DownloadError, GenericRemoteStorage, RemotePath};
use crate::checker::create_availability_check_data;
use crate::pg_helpers::*;
use crate::spec::*;
use crate::sync_sk::{check_if_synced, ping_safekeeper};
@@ -60,10 +61,6 @@ pub struct ComputeNode {
pub state_changed: Condvar,
/// the S3 bucket that we search for extensions in
pub ext_remote_storage: Option<GenericRemoteStorage>,
// (key: extension name, value: path to extension archive in remote storage)
pub ext_remote_paths: OnceLock<HashMap<String, RemotePath>>,
// (key: library name, value: name of extension containing this library)
pub library_index: OnceLock<HashMap<String, String>>,
// key: ext_archive_name, value: started download time, download_completed?
pub ext_download_progress: RwLock<HashMap<String, (DateTime<Utc>, bool)>>,
pub build_tag: String,
@@ -75,7 +72,6 @@ pub struct RemoteExtensionMetrics {
num_ext_downloaded: u64,
largest_ext_size: u64,
total_ext_download_size: u64,
prep_extensions_ms: u64,
}
#[derive(Clone, Debug)]
@@ -181,6 +177,27 @@ impl TryFrom<ComputeSpec> for ParsedSpec {
}
}
/// If we are a VM, returns a [`Command`] that will run in the `neon-postgres`
/// cgroup. Otherwise returns the default `Command::new(cmd)`
///
/// This function should be used to start postgres, as it will start it in the
/// neon-postgres cgroup if we are a VM. This allows autoscaling to control
/// postgres' resource usage. The cgroup will exist in VMs because vm-builder
/// creates it during the sysinit phase of its inittab.
fn maybe_cgexec(cmd: &str) -> Command {
// The cplane sets this env var for autoscaling computes.
// use `var_os` so we don't have to worry about the variable being valid
// unicode. Should never be an concern . . . but just in case
if env::var_os("AUTOSCALING").is_some() {
let mut command = Command::new("cgexec");
command.args(["-g", "memory:neon-postgres"]);
command.arg(cmd);
command
} else {
Command::new(cmd)
}
}
/// Create special neon_superuser role, that's a slightly nerfed version of a real superuser
/// that we give to customers
fn create_neon_superuser(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
@@ -457,7 +474,7 @@ impl ComputeNode {
pub fn sync_safekeepers(&self, storage_auth_token: Option<String>) -> Result<Lsn> {
let start_time = Utc::now();
let sync_handle = Command::new(&self.pgbin)
let sync_handle = maybe_cgexec(&self.pgbin)
.args(["--sync-safekeepers"])
.env("PGDATA", &self.pgdata) // we cannot use -D in this mode
.envs(if let Some(storage_auth_token) = &storage_auth_token {
@@ -592,7 +609,7 @@ impl ComputeNode {
// Start postgres
info!("starting postgres");
let mut pg = Command::new(&self.pgbin)
let mut pg = maybe_cgexec(&self.pgbin)
.args(["-D", pgdata])
.spawn()
.expect("cannot start postgres process");
@@ -620,7 +637,7 @@ impl ComputeNode {
let pgdata_path = Path::new(&self.pgdata);
// Run postgres as a child process.
let mut pg = Command::new(&self.pgbin)
let mut pg = maybe_cgexec(&self.pgbin)
.args(["-D", &self.pgdata])
.envs(if let Some(storage_auth_token) = &storage_auth_token {
vec![("NEON_AUTH_TOKEN", storage_auth_token)]
@@ -680,6 +697,7 @@ impl ComputeNode {
handle_role_deletions(spec, self.connstr.as_str(), &mut client)?;
handle_grants(spec, self.connstr.as_str())?;
handle_extensions(spec, &mut client)?;
create_availability_check_data(&mut client)?;
// 'Close' connection
drop(client);
@@ -745,11 +763,19 @@ impl ComputeNode {
pspec.timeline_id,
);
info!(
"start_compute spec.remote_extensions {:?}",
pspec.spec.remote_extensions
);
// This part is sync, because we need to download
// remote shared_preload_libraries before postgres start (if any)
{
if let Some(remote_extensions) = &pspec.spec.remote_extensions {
// First, create control files for all availale extensions
extension_server::create_control_files(remote_extensions, &self.pgbin);
let library_load_start_time = Utc::now();
let remote_ext_metrics = self.prepare_preload_libraries(&compute_state)?;
let remote_ext_metrics = self.prepare_preload_libraries(&pspec.spec)?;
let library_load_time = Utc::now()
.signed_duration_since(library_load_start_time)
@@ -761,7 +787,6 @@ impl ComputeNode {
state.metrics.num_ext_downloaded = remote_ext_metrics.num_ext_downloaded;
state.metrics.largest_ext_size = remote_ext_metrics.largest_ext_size;
state.metrics.total_ext_download_size = remote_ext_metrics.total_ext_download_size;
state.metrics.prep_extensions_ms = remote_ext_metrics.prep_extensions_ms;
info!(
"Loading shared_preload_libraries took {:?}ms",
library_load_time
@@ -918,38 +943,11 @@ LIMIT 100",
}
}
// If remote extension storage is configured,
// download extension control files
pub async fn prepare_external_extensions(&self, compute_state: &ComputeState) -> Result<()> {
if let Some(ref ext_remote_storage) = self.ext_remote_storage {
let pspec = compute_state.pspec.as_ref().expect("spec must be set");
let spec = &pspec.spec;
let custom_ext = spec.custom_extensions.clone().unwrap_or(Vec::new());
info!("custom extensions: {:?}", &custom_ext);
let (ext_remote_paths, library_index) = extension_server::get_available_extensions(
ext_remote_storage,
&self.pgbin,
&self.pgversion,
&custom_ext,
&self.build_tag,
)
.await?;
self.ext_remote_paths
.set(ext_remote_paths)
.expect("this is the only time we set ext_remote_paths");
self.library_index
.set(library_index)
.expect("this is the only time we set library_index");
}
Ok(())
}
// download an archive, unzip and place files in correct locations
pub async fn download_extension(
&self,
ext_name: &str,
is_library: bool,
real_ext_name: String,
ext_path: RemotePath,
) -> Result<u64, DownloadError> {
let remote_storage = self
.ext_remote_storage
@@ -958,35 +956,6 @@ LIMIT 100",
"Remote extensions storage is not configured",
)))?;
let mut real_ext_name = ext_name;
if is_library {
// sometimes library names might have a suffix like
// library.so or library.so.3. We strip this off
// because library_index is based on the name without the file extension
let strip_lib_suffix = Regex::new(r"\.so.*").unwrap();
let lib_raw_name = strip_lib_suffix.replace(real_ext_name, "").to_string();
real_ext_name = self
.library_index
.get()
.expect("must have already downloaded the library_index")
.get(&lib_raw_name)
.ok_or(DownloadError::BadInput(anyhow::anyhow!(
"library {} is not found",
lib_raw_name
)))?;
}
let ext_path = &self
.ext_remote_paths
.get()
.expect("error accessing ext_remote_paths")
.get(real_ext_name)
.ok_or(DownloadError::BadInput(anyhow::anyhow!(
"real_ext_name {} is not found",
real_ext_name
)))?;
let ext_archive_name = ext_path.object_name().expect("bad path");
let mut first_try = false;
@@ -1039,8 +1008,8 @@ LIMIT 100",
info!("downloading new extension {ext_archive_name}");
let download_size = extension_server::download_extension(
real_ext_name,
ext_path,
&real_ext_name,
&ext_path,
remote_storage,
&self.pgbin,
)
@@ -1058,18 +1027,19 @@ LIMIT 100",
#[tokio::main]
pub async fn prepare_preload_libraries(
&self,
compute_state: &ComputeState,
spec: &ComputeSpec,
) -> Result<RemoteExtensionMetrics> {
if self.ext_remote_storage.is_none() {
return Ok(RemoteExtensionMetrics {
num_ext_downloaded: 0,
largest_ext_size: 0,
total_ext_download_size: 0,
prep_extensions_ms: 0,
});
}
let pspec = compute_state.pspec.as_ref().expect("spec must be set");
let spec = &pspec.spec;
let remote_extensions = spec
.remote_extensions
.as_ref()
.ok_or(anyhow::anyhow!("Remote extensions are not configured",))?;
info!("parse shared_preload_libraries from spec.cluster.settings");
let mut libs_vec = Vec::new();
@@ -1081,6 +1051,7 @@ LIMIT 100",
.collect();
}
info!("parse shared_preload_libraries from provided postgresql.conf");
// that is used in neon_local and python tests
if let Some(conf) = &spec.cluster.postgresql_conf {
let conf_lines = conf.split('\n').collect::<Vec<&str>>();
@@ -1101,30 +1072,17 @@ LIMIT 100",
libs_vec.extend(preload_libs_vec);
}
info!("Download ext_index.json, find the extension paths");
let prep_ext_start_time = Utc::now();
self.prepare_external_extensions(compute_state).await?;
let prep_ext_time_delta = Utc::now()
.signed_duration_since(prep_ext_start_time)
.to_std()
.unwrap()
.as_millis() as u64;
info!("Prepare extensions took {prep_ext_time_delta}ms");
// Don't try to download libraries that are not in the index.
// Assume that they are already present locally.
libs_vec.retain(|lib| {
self.library_index
.get()
.expect("error accessing ext_remote_paths")
.contains_key(lib)
});
libs_vec.retain(|lib| remote_extensions.library_index.contains_key(lib));
info!("Downloading to shared preload libraries: {:?}", &libs_vec);
let mut download_tasks = Vec::new();
for library in &libs_vec {
download_tasks.push(self.download_extension(library, true));
let (ext_name, ext_path) =
remote_extensions.get_ext(library, true, &self.build_tag, &self.pgversion)?;
download_tasks.push(self.download_extension(ext_name, ext_path));
}
let results = join_all(download_tasks).await;
@@ -1132,7 +1090,6 @@ LIMIT 100",
num_ext_downloaded: 0,
largest_ext_size: 0,
total_ext_download_size: 0,
prep_extensions_ms: prep_ext_time_delta,
};
for result in results {
let download_size = match result {

View File

@@ -46,8 +46,6 @@ pub fn write_postgres_conf(
writeln!(file, "{}", conf)?;
}
write!(file, "{}", &spec.cluster.settings.as_pg_settings())?;
// Add options for connecting to storage
writeln!(file, "# Neon storage settings")?;
if let Some(s) = &spec.pageserver_connstring {

View File

@@ -73,10 +73,10 @@ More specifically, here is an example ext_index.json
*/
use anyhow::Context;
use anyhow::{self, Result};
use futures::future::join_all;
use compute_api::spec::RemoteExtSpec;
use regex::Regex;
use remote_storage::*;
use serde_json;
use std::collections::HashMap;
use std::io::Read;
use std::num::{NonZeroU32, NonZeroUsize};
use std::path::Path;
@@ -107,89 +107,69 @@ fn get_pg_config(argument: &str, pgbin: &str) -> String {
pub fn get_pg_version(pgbin: &str) -> String {
// pg_config --version returns a (platform specific) human readable string
// such as "PostgreSQL 15.4". We parse this to v14/v15
// such as "PostgreSQL 15.4". We parse this to v14/v15/v16 etc.
let human_version = get_pg_config("--version", pgbin);
if human_version.contains("15") {
return "v15".to_string();
} else if human_version.contains("14") {
return "v14".to_string();
return parse_pg_version(&human_version).to_string();
}
fn parse_pg_version(human_version: &str) -> &str {
// Normal releases have version strings like "PostgreSQL 15.4". But there
// are also pre-release versions like "PostgreSQL 17devel" or "PostgreSQL
// 16beta2" or "PostgreSQL 17rc1". And with the --with-extra-version
// configure option, you can tack any string to the version number,
// e.g. "PostgreSQL 15.4foobar".
match Regex::new(r"^PostgreSQL (?<major>\d+).+")
.unwrap()
.captures(human_version)
{
Some(captures) if captures.len() == 2 => match &captures["major"] {
"14" => return "v14",
"15" => return "v15",
"16" => return "v16",
_ => {}
},
_ => {}
}
panic!("Unsuported postgres version {human_version}");
}
// download control files for enabled_extensions
// return Hashmaps converting library names to extension names (library_index)
// and specifying the remote path to the archive for each extension name
pub async fn get_available_extensions(
remote_storage: &GenericRemoteStorage,
pgbin: &str,
pg_version: &str,
custom_extensions: &[String],
build_tag: &str,
) -> Result<(HashMap<String, RemotePath>, HashMap<String, String>)> {
let local_sharedir = Path::new(&get_pg_config("--sharedir", pgbin)).join("extension");
let index_path = format!("{build_tag}/{pg_version}/ext_index.json");
let index_path = RemotePath::new(Path::new(&index_path)).context("error forming path")?;
info!("download ext_index.json from: {:?}", &index_path);
#[cfg(test)]
mod tests {
use super::parse_pg_version;
let mut download = remote_storage.download(&index_path).await?;
let mut ext_idx_buffer = Vec::new();
download
.download_stream
.read_to_end(&mut ext_idx_buffer)
.await?;
info!("ext_index downloaded");
#[test]
fn test_parse_pg_version() {
assert_eq!(parse_pg_version("PostgreSQL 15.4"), "v15");
assert_eq!(parse_pg_version("PostgreSQL 15.14"), "v15");
assert_eq!(
parse_pg_version("PostgreSQL 15.4 (Ubuntu 15.4-0ubuntu0.23.04.1)"),
"v15"
);
#[derive(Debug, serde::Deserialize)]
struct Index {
public_extensions: Vec<String>,
library_index: HashMap<String, String>,
extension_data: HashMap<String, ExtensionData>,
assert_eq!(parse_pg_version("PostgreSQL 14.15"), "v14");
assert_eq!(parse_pg_version("PostgreSQL 14.0"), "v14");
assert_eq!(
parse_pg_version("PostgreSQL 14.9 (Debian 14.9-1.pgdg120+1"),
"v14"
);
assert_eq!(parse_pg_version("PostgreSQL 16devel"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16beta1"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16rc2"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16extra"), "v16");
}
#[derive(Debug, serde::Deserialize)]
struct ExtensionData {
control_data: HashMap<String, String>,
archive_path: String,
#[test]
#[should_panic]
fn test_parse_pg_unsupported_version() {
parse_pg_version("PostgreSQL 13.14");
}
let ext_index_full = serde_json::from_slice::<Index>(&ext_idx_buffer)?;
let mut enabled_extensions = ext_index_full.public_extensions;
enabled_extensions.extend_from_slice(custom_extensions);
let mut library_index = ext_index_full.library_index;
let all_extension_data = ext_index_full.extension_data;
info!("library_index: {:?}", library_index);
info!("enabled_extensions: {:?}", enabled_extensions);
let mut ext_remote_paths = HashMap::new();
let mut file_create_tasks = Vec::new();
for extension in enabled_extensions {
let ext_data = &all_extension_data[&extension];
for (control_file, control_contents) in &ext_data.control_data {
let extension_name = control_file
.strip_suffix(".control")
.expect("control files must end in .control");
let control_path = local_sharedir.join(control_file);
if !control_path.exists() {
ext_remote_paths.insert(
extension_name.to_string(),
RemotePath::from_string(&ext_data.archive_path)?,
);
info!("writing file {:?}{:?}", control_path, control_contents);
file_create_tasks.push(tokio::fs::write(control_path, control_contents));
} else {
warn!("control file {:?} exists both locally and remotely. ignoring the remote version.", control_file);
// also delete this from library index
library_index.retain(|_, value| value != extension_name);
}
}
#[test]
#[should_panic]
fn test_parse_pg_incorrect_version_format() {
parse_pg_version("PostgreSQL 14");
}
let results = join_all(file_create_tasks).await;
for result in results {
result?;
}
info!("ext_remote_paths {:?}", ext_remote_paths);
Ok((ext_remote_paths, library_index))
}
// download the archive for a given extension,
@@ -253,6 +233,34 @@ pub async fn download_extension(
Ok(download_size)
}
// Create extension control files from spec
pub fn create_control_files(remote_extensions: &RemoteExtSpec, pgbin: &str) {
let local_sharedir = Path::new(&get_pg_config("--sharedir", pgbin)).join("extension");
for (ext_name, ext_data) in remote_extensions.extension_data.iter() {
// Check if extension is present in public or custom.
// If not, then it is not allowed to be used by this compute.
if let Some(public_extensions) = &remote_extensions.public_extensions {
if !public_extensions.contains(ext_name) {
if let Some(custom_extensions) = &remote_extensions.custom_extensions {
if !custom_extensions.contains(ext_name) {
continue; // skip this extension, it is not allowed
}
}
}
}
for (control_name, control_content) in &ext_data.control_data {
let control_path = local_sharedir.join(control_name);
if !control_path.exists() {
info!("writing file {:?}{:?}", control_path, control_content);
std::fs::write(control_path, control_content).unwrap();
} else {
warn!("control file {:?} exists both locally and remotely. ignoring the remote version.", control_path);
}
}
}
}
// This function initializes the necessary structs to use remote storage
pub fn init_remote_storage(remote_ext_config: &str) -> anyhow::Result<GenericRemoteStorage> {
#[derive(Debug, serde::Deserialize)]

View File

@@ -1,4 +1,6 @@
use std::convert::Infallible;
use std::net::IpAddr;
use std::net::Ipv6Addr;
use std::net::SocketAddr;
use std::sync::Arc;
use std::thread;
@@ -13,7 +15,7 @@ use hyper::{Body, Method, Request, Response, Server, StatusCode};
use num_cpus;
use serde_json;
use tokio::task;
use tracing::{error, info};
use tracing::{error, info, warn};
use tracing_utils::http::OtelName;
fn status_response_from_state(state: &ComputeState) -> ComputeStatusResponse {
@@ -126,6 +128,15 @@ async fn routes(req: Request<Body>, compute: &Arc<ComputeNode>) -> Response<Body
info!("serving {:?} POST request", route);
info!("req.uri {:?}", req.uri());
// don't even try to download extensions
// if no remote storage is configured
if compute.ext_remote_storage.is_none() {
info!("no extensions remote storage configured");
let mut resp = Response::new(Body::from("no remote storage configured"));
*resp.status_mut() = StatusCode::INTERNAL_SERVER_ERROR;
return resp;
}
let mut is_library = false;
if let Some(params) = req.uri().query() {
info!("serving {:?} POST request with params: {}", route, params);
@@ -137,24 +148,52 @@ async fn routes(req: Request<Body>, compute: &Arc<ComputeNode>) -> Response<Body
return resp;
}
}
let filename = route.split('/').last().unwrap().to_string();
info!("serving /extension_server POST request, filename: {filename:?} is_library: {is_library}");
// don't even try to download extensions
// if no remote storage is configured
if compute.ext_remote_storage.is_none() {
info!("no extensions remote storage configured");
let mut resp = Response::new(Body::from("no remote storage configured"));
*resp.status_mut() = StatusCode::INTERNAL_SERVER_ERROR;
return resp;
}
// get ext_name and path from spec
// don't lock compute_state for too long
let ext = {
let compute_state = compute.state.lock().unwrap();
let pspec = compute_state.pspec.as_ref().expect("spec must be set");
let spec = &pspec.spec;
match compute.download_extension(&filename, is_library).await {
Ok(_) => Response::new(Body::from("OK")),
// debug only
info!("spec: {:?}", spec);
let remote_extensions = match spec.remote_extensions.as_ref() {
Some(r) => r,
None => {
info!("no remote extensions spec was provided");
let mut resp = Response::new(Body::from("no remote storage configured"));
*resp.status_mut() = StatusCode::INTERNAL_SERVER_ERROR;
return resp;
}
};
remote_extensions.get_ext(
&filename,
is_library,
&compute.build_tag,
&compute.pgversion,
)
};
match ext {
Ok((ext_name, ext_path)) => {
match compute.download_extension(ext_name, ext_path).await {
Ok(_) => Response::new(Body::from("OK")),
Err(e) => {
error!("extension download failed: {}", e);
let mut resp = Response::new(Body::from(e.to_string()));
*resp.status_mut() = StatusCode::INTERNAL_SERVER_ERROR;
resp
}
}
}
Err(e) => {
error!("extension download failed: {}", e);
let mut resp = Response::new(Body::from(e.to_string()));
warn!("extension download failed to find extension: {}", e);
let mut resp = Response::new(Body::from("failed to find file"));
*resp.status_mut() = StatusCode::INTERNAL_SERVER_ERROR;
resp
}
@@ -261,7 +300,9 @@ fn render_json_error(e: &str, status: StatusCode) -> Response<Body> {
// Main Hyper HTTP server function that runs it and blocks waiting on it forever.
#[tokio::main]
async fn serve(port: u16, state: Arc<ComputeNode>) {
let addr = SocketAddr::from(([0, 0, 0, 0], port));
// this usually binds to both IPv4 and IPv6 on linux
// see e.g. https://github.com/rust-lang/rust/pull/34440
let addr = SocketAddr::new(IpAddr::from(Ipv6Addr::UNSPECIFIED), port);
let make_service = make_service_fn(move |_conn| {
let state = state.clone();

View File

@@ -6,4 +6,4 @@ pub const DEFAULT_LOG_LEVEL: &str = "info";
// https://www.postgresql.org/docs/15/auth-password.html
//
// So it's safe to set md5 here, as `control-plane` anyway uses SCRAM for all roles.
pub const PG_HBA_ALL_MD5: &str = "host\tall\t\tall\t\t0.0.0.0/0\t\tmd5";
pub const PG_HBA_ALL_MD5: &str = "host\tall\t\tall\t\tall\t\tmd5";

View File

@@ -12,6 +12,8 @@ git-version.workspace = true
nix.workspace = true
once_cell.workspace = true
postgres.workspace = true
hex.workspace = true
hyper.workspace = true
regex.workspace = true
reqwest = { workspace = true, features = ["blocking", "json"] }
serde.workspace = true
@@ -20,6 +22,7 @@ serde_with.workspace = true
tar.workspace = true
thiserror.workspace = true
toml.workspace = true
tokio.workspace = true
url.workspace = true
# Note: Do not directly depend on pageserver or safekeeper; use pageserver_api or safekeeper_api
# instead, so that recompile times are better.

View File

@@ -1,6 +1,7 @@
# Minimal neon environment with one safekeeper. This is equivalent to the built-in
# defaults that you get with no --config
[pageserver]
[[pageservers]]
id=1
listen_pg_addr = '127.0.0.1:64000'
listen_http_addr = '127.0.0.1:9898'
pg_auth_type = 'Trust'

View File

@@ -0,0 +1,105 @@
use crate::{background_process, local_env::LocalEnv};
use anyhow::anyhow;
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use std::{path::PathBuf, process::Child};
use utils::id::{NodeId, TenantId};
pub struct AttachmentService {
env: LocalEnv,
listen: String,
path: PathBuf,
}
const COMMAND: &str = "attachment_service";
#[serde_as]
#[derive(Serialize, Deserialize)]
pub struct AttachHookRequest {
#[serde_as(as = "DisplayFromStr")]
pub tenant_id: TenantId,
pub pageserver_id: Option<NodeId>,
}
#[derive(Serialize, Deserialize)]
pub struct AttachHookResponse {
pub gen: Option<u32>,
}
impl AttachmentService {
pub fn from_env(env: &LocalEnv) -> Self {
let path = env.base_data_dir.join("attachments.json");
// Makes no sense to construct this if pageservers aren't going to use it: assume
// pageservers have control plane API set
let listen_url = env.control_plane_api.clone().unwrap();
let listen = format!(
"{}:{}",
listen_url.host_str().unwrap(),
listen_url.port().unwrap()
);
Self {
env: env.clone(),
path,
listen,
}
}
fn pid_file(&self) -> PathBuf {
self.env.base_data_dir.join("attachment_service.pid")
}
pub fn start(&self) -> anyhow::Result<Child> {
let path_str = self.path.to_string_lossy();
background_process::start_process(
COMMAND,
&self.env.base_data_dir,
&self.env.attachment_service_bin(),
["-l", &self.listen, "-p", &path_str],
[],
background_process::InitialPidFile::Create(&self.pid_file()),
// TODO: a real status check
|| Ok(true),
)
}
pub fn stop(&self, immediate: bool) -> anyhow::Result<()> {
background_process::stop_process(immediate, COMMAND, &self.pid_file())
}
/// Call into the attach_hook API, for use before handing out attachments to pageservers
pub fn attach_hook(
&self,
tenant_id: TenantId,
pageserver_id: NodeId,
) -> anyhow::Result<Option<u32>> {
use hyper::StatusCode;
let url = self
.env
.control_plane_api
.clone()
.unwrap()
.join("attach_hook")
.unwrap();
let client = reqwest::blocking::ClientBuilder::new()
.build()
.expect("Failed to construct http client");
let request = AttachHookRequest {
tenant_id,
pageserver_id: Some(pageserver_id),
};
let response = client.post(url).json(&request).send()?;
if response.status() != StatusCode::OK {
return Err(anyhow!("Unexpected status {}", response.status()));
}
let response = response.json::<AttachHookResponse>()?;
Ok(response.gen)
}
}

View File

@@ -0,0 +1,273 @@
/// The attachment service mimics the aspects of the control plane API
/// that are required for a pageserver to operate.
///
/// This enables running & testing pageservers without a full-blown
/// deployment of the Neon cloud platform.
///
use anyhow::anyhow;
use clap::Parser;
use hex::FromHex;
use hyper::StatusCode;
use hyper::{Body, Request, Response};
use serde::{Deserialize, Serialize};
use std::path::{Path, PathBuf};
use std::{collections::HashMap, sync::Arc};
use utils::logging::{self, LogFormat};
use utils::{
http::{
endpoint::{self},
error::ApiError,
json::{json_request, json_response},
RequestExt, RouterBuilder,
},
id::{NodeId, TenantId},
tcp_listener,
};
use pageserver_api::control_api::{
ReAttachRequest, ReAttachResponse, ReAttachResponseTenant, ValidateRequest, ValidateResponse,
ValidateResponseTenant,
};
use control_plane::attachment_service::{AttachHookRequest, AttachHookResponse};
#[derive(Parser)]
#[command(author, version, about, long_about = None)]
#[command(arg_required_else_help(true))]
struct Cli {
/// Host and port to listen on, like `127.0.0.1:1234`
#[arg(short, long)]
listen: std::net::SocketAddr,
/// Path to the .json file to store state (will be created if it doesn't exist)
#[arg(short, long)]
path: PathBuf,
}
// The persistent state of each Tenant
#[derive(Serialize, Deserialize, Clone)]
struct TenantState {
// Currently attached pageserver
pageserver: Option<NodeId>,
// Latest generation number: next time we attach, increment this
// and use the incremented number when attaching
generation: u32,
}
fn to_hex_map<S, V>(input: &HashMap<TenantId, V>, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
V: Clone + Serialize,
{
let transformed = input.iter().map(|(k, v)| (hex::encode(k), v.clone()));
transformed
.collect::<HashMap<String, V>>()
.serialize(serializer)
}
fn from_hex_map<'de, D, V>(deserializer: D) -> Result<HashMap<TenantId, V>, D::Error>
where
D: serde::de::Deserializer<'de>,
V: Deserialize<'de>,
{
let hex_map = HashMap::<String, V>::deserialize(deserializer)?;
hex_map
.into_iter()
.map(|(k, v)| {
TenantId::from_hex(k)
.map(|k| (k, v))
.map_err(serde::de::Error::custom)
})
.collect()
}
// Top level state available to all HTTP handlers
#[derive(Serialize, Deserialize)]
struct PersistentState {
#[serde(serialize_with = "to_hex_map", deserialize_with = "from_hex_map")]
tenants: HashMap<TenantId, TenantState>,
#[serde(skip)]
path: PathBuf,
}
impl PersistentState {
async fn save(&self) -> anyhow::Result<()> {
let bytes = serde_json::to_vec(self)?;
tokio::fs::write(&self.path, &bytes).await?;
Ok(())
}
async fn load(path: &Path) -> anyhow::Result<Self> {
let bytes = tokio::fs::read(path).await?;
let mut decoded = serde_json::from_slice::<Self>(&bytes)?;
decoded.path = path.to_owned();
Ok(decoded)
}
async fn load_or_new(path: &Path) -> Self {
match Self::load(path).await {
Ok(s) => {
tracing::info!("Loaded state file at {}", path.display());
s
}
Err(e)
if e.downcast_ref::<std::io::Error>()
.map(|e| e.kind() == std::io::ErrorKind::NotFound)
.unwrap_or(false) =>
{
tracing::info!("Will create state file at {}", path.display());
Self {
tenants: HashMap::new(),
path: path.to_owned(),
}
}
Err(e) => {
panic!("Failed to load state from '{}': {e:#} (maybe your .neon/ dir was written by an older version?)", path.display())
}
}
}
}
/// State available to HTTP request handlers
#[derive(Clone)]
struct State {
inner: Arc<tokio::sync::RwLock<PersistentState>>,
}
impl State {
fn new(persistent_state: PersistentState) -> State {
Self {
inner: Arc::new(tokio::sync::RwLock::new(persistent_state)),
}
}
}
#[inline(always)]
fn get_state(request: &Request<Body>) -> &State {
request
.data::<Arc<State>>()
.expect("unknown state type")
.as_ref()
}
/// Pageserver calls into this on startup, to learn which tenants it should attach
async fn handle_re_attach(mut req: Request<Body>) -> Result<Response<Body>, ApiError> {
let reattach_req = json_request::<ReAttachRequest>(&mut req).await?;
let state = get_state(&req).inner.clone();
let mut locked = state.write().await;
let mut response = ReAttachResponse {
tenants: Vec::new(),
};
for (t, state) in &mut locked.tenants {
if state.pageserver == Some(reattach_req.node_id) {
state.generation += 1;
response.tenants.push(ReAttachResponseTenant {
id: *t,
generation: state.generation,
});
}
}
locked.save().await.map_err(ApiError::InternalServerError)?;
json_response(StatusCode::OK, response)
}
/// Pageserver calls into this before doing deletions, to confirm that it still
/// holds the latest generation for the tenants with deletions enqueued
async fn handle_validate(mut req: Request<Body>) -> Result<Response<Body>, ApiError> {
let validate_req = json_request::<ValidateRequest>(&mut req).await?;
let locked = get_state(&req).inner.read().await;
let mut response = ValidateResponse {
tenants: Vec::new(),
};
for req_tenant in validate_req.tenants {
if let Some(tenant_state) = locked.tenants.get(&req_tenant.id) {
let valid = tenant_state.generation == req_tenant.gen;
response.tenants.push(ValidateResponseTenant {
id: req_tenant.id,
valid,
});
}
}
json_response(StatusCode::OK, response)
}
/// Call into this before attaching a tenant to a pageserver, to acquire a generation number
/// (in the real control plane this is unnecessary, because the same program is managing
/// generation numbers and doing attachments).
async fn handle_attach_hook(mut req: Request<Body>) -> Result<Response<Body>, ApiError> {
let attach_req = json_request::<AttachHookRequest>(&mut req).await?;
let state = get_state(&req).inner.clone();
let mut locked = state.write().await;
let tenant_state = locked
.tenants
.entry(attach_req.tenant_id)
.or_insert_with(|| TenantState {
pageserver: attach_req.pageserver_id,
generation: 0,
});
if attach_req.pageserver_id.is_some() {
tenant_state.generation += 1;
}
let generation = tenant_state.generation;
locked.save().await.map_err(ApiError::InternalServerError)?;
json_response(
StatusCode::OK,
AttachHookResponse {
gen: attach_req.pageserver_id.map(|_| generation),
},
)
}
fn make_router(persistent_state: PersistentState) -> RouterBuilder<hyper::Body, ApiError> {
endpoint::make_router()
.data(Arc::new(State::new(persistent_state)))
.post("/re-attach", handle_re_attach)
.post("/validate", handle_validate)
.post("/attach_hook", handle_attach_hook)
}
#[tokio::main]
async fn main() -> anyhow::Result<()> {
logging::init(
LogFormat::Plain,
logging::TracingErrorLayerEnablement::Disabled,
)?;
let args = Cli::parse();
tracing::info!(
"Starting, state at {}, listening on {}",
args.path.to_string_lossy(),
args.listen
);
let persistent_state = PersistentState::load_or_new(&args.path).await;
let http_listener = tcp_listener::bind(args.listen)?;
let router = make_router(persistent_state)
.build()
.map_err(|err| anyhow!(err))?;
let service = utils::http::RouterService::new(router).unwrap();
let server = hyper::Server::from_tcp(http_listener)?.serve(service);
tracing::info!("Serving on {0}", args.listen);
server.await?;
Ok(())
}

View File

@@ -8,6 +8,7 @@
use anyhow::{anyhow, bail, Context, Result};
use clap::{value_parser, Arg, ArgAction, ArgMatches, Command};
use compute_api::spec::ComputeMode;
use control_plane::attachment_service::AttachmentService;
use control_plane::endpoint::ComputeControlPlane;
use control_plane::local_env::LocalEnv;
use control_plane::pageserver::PageServerNode;
@@ -43,14 +44,18 @@ project_git_version!(GIT_VERSION);
const DEFAULT_PG_VERSION: &str = "15";
const DEFAULT_PAGESERVER_CONTROL_PLANE_API: &str = "http://127.0.0.1:1234/";
fn default_conf() -> String {
format!(
r#"
# Default built-in configuration, defined in main.rs
control_plane_api = '{DEFAULT_PAGESERVER_CONTROL_PLANE_API}'
[broker]
listen_addr = '{DEFAULT_BROKER_ADDR}'
[pageserver]
[[pageservers]]
id = {DEFAULT_PAGESERVER_ID}
listen_pg_addr = '{DEFAULT_PAGESERVER_PG_ADDR}'
listen_http_addr = '{DEFAULT_PAGESERVER_HTTP_ADDR}'
@@ -61,6 +66,7 @@ http_auth_type = '{trust_auth}'
id = {DEFAULT_SAFEKEEPER_ID}
pg_port = {DEFAULT_SAFEKEEPER_PG_PORT}
http_port = {DEFAULT_SAFEKEEPER_HTTP_PORT}
"#,
trust_auth = AuthType::Trust,
)
@@ -107,6 +113,7 @@ fn main() -> Result<()> {
"start" => handle_start_all(sub_args, &env),
"stop" => handle_stop_all(sub_args, &env),
"pageserver" => handle_pageserver(sub_args, &env),
"attachment_service" => handle_attachment_service(sub_args, &env),
"safekeeper" => handle_safekeeper(sub_args, &env),
"endpoint" => handle_endpoint(sub_args, &env),
"pg" => bail!("'pg' subcommand has been renamed to 'endpoint'"),
@@ -252,7 +259,7 @@ fn get_timeline_infos(
env: &local_env::LocalEnv,
tenant_id: &TenantId,
) -> Result<HashMap<TimelineId, TimelineInfo>> {
Ok(PageServerNode::from_env(env)
Ok(get_default_pageserver(env)
.timeline_list(tenant_id)?
.into_iter()
.map(|timeline_info| (timeline_info.timeline_id, timeline_info))
@@ -313,17 +320,30 @@ fn handle_init(init_match: &ArgMatches) -> anyhow::Result<LocalEnv> {
.context("Failed to initialize neon repository")?;
// Initialize pageserver, create initial tenant and timeline.
let pageserver = PageServerNode::from_env(&env);
pageserver
.initialize(&pageserver_config_overrides(init_match))
.unwrap_or_else(|e| {
eprintln!("pageserver init failed: {e:?}");
exit(1);
});
for ps_conf in &env.pageservers {
PageServerNode::from_env(&env, ps_conf)
.initialize(&pageserver_config_overrides(init_match))
.unwrap_or_else(|e| {
eprintln!("pageserver init failed: {e:?}");
exit(1);
});
}
Ok(env)
}
/// The default pageserver is the one where CLI tenant/timeline operations are sent by default.
/// For typical interactive use, one would just run with a single pageserver. Scenarios with
/// tenant/timeline placement across multiple pageservers are managed by python test code rather
/// than this CLI.
fn get_default_pageserver(env: &local_env::LocalEnv) -> PageServerNode {
let ps_conf = env
.pageservers
.first()
.expect("Config is validated to contain at least one pageserver");
PageServerNode::from_env(env, ps_conf)
}
fn pageserver_config_overrides(init_match: &ArgMatches) -> Vec<&str> {
init_match
.get_many::<String>("pageserver-config-override")
@@ -334,7 +354,7 @@ fn pageserver_config_overrides(init_match: &ArgMatches) -> Vec<&str> {
}
fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> anyhow::Result<()> {
let pageserver = PageServerNode::from_env(env);
let pageserver = get_default_pageserver(env);
match tenant_match.subcommand() {
Some(("list", _)) => {
for t in pageserver.tenant_list()? {
@@ -342,13 +362,25 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an
}
}
Some(("create", create_match)) => {
let initial_tenant_id = parse_tenant_id(create_match)?;
let tenant_conf: HashMap<_, _> = create_match
.get_many::<String>("config")
.map(|vals| vals.flat_map(|c| c.split_once(':')).collect())
.unwrap_or_default();
let new_tenant_id = pageserver.tenant_create(initial_tenant_id, tenant_conf)?;
println!("tenant {new_tenant_id} successfully created on the pageserver");
// If tenant ID was not specified, generate one
let tenant_id = parse_tenant_id(create_match)?.unwrap_or_else(TenantId::generate);
let generation = if env.control_plane_api.is_some() {
// We must register the tenant with the attachment service, so
// that when the pageserver restarts, it will be re-attached.
let attachment_service = AttachmentService::from_env(env);
attachment_service.attach_hook(tenant_id, pageserver.conf.id)?
} else {
None
};
pageserver.tenant_create(tenant_id, generation, tenant_conf)?;
println!("tenant {tenant_id} successfully created on the pageserver");
// Create an initial timeline for the new tenant
let new_timeline_id = parse_timeline_id(create_match)?;
@@ -358,7 +390,7 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an
.context("Failed to parse postgres version from the argument string")?;
let timeline_info = pageserver.timeline_create(
new_tenant_id,
tenant_id,
new_timeline_id,
None,
None,
@@ -369,17 +401,17 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an
env.register_branch_mapping(
DEFAULT_BRANCH_NAME.to_string(),
new_tenant_id,
tenant_id,
new_timeline_id,
)?;
println!(
"Created an initial timeline '{new_timeline_id}' at Lsn {last_record_lsn} for tenant: {new_tenant_id}",
"Created an initial timeline '{new_timeline_id}' at Lsn {last_record_lsn} for tenant: {tenant_id}",
);
if create_match.get_flag("set-default") {
println!("Setting tenant {new_tenant_id} as a default one");
env.default_tenant_id = Some(new_tenant_id);
println!("Setting tenant {tenant_id} as a default one");
env.default_tenant_id = Some(tenant_id);
}
}
Some(("set-default", set_default_match)) => {
@@ -407,7 +439,7 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an
}
fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::LocalEnv) -> Result<()> {
let pageserver = PageServerNode::from_env(env);
let pageserver = get_default_pageserver(env);
match timeline_match.subcommand() {
Some(("list", list_match)) => {
@@ -484,6 +516,7 @@ fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::LocalEnv) -
None,
pg_version,
ComputeMode::Primary,
DEFAULT_PAGESERVER_ID,
)?;
println!("Done");
}
@@ -537,7 +570,6 @@ fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<(
Some(ep_subcommand_data) => ep_subcommand_data,
None => bail!("no endpoint subcommand provided"),
};
let mut cplane = ComputeControlPlane::load(env.clone())?;
// All subcommands take an optional --tenant-id option
@@ -634,6 +666,13 @@ fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<(
.copied()
.unwrap_or(false);
let pageserver_id =
if let Some(id_str) = sub_args.get_one::<String>("endpoint-pageserver-id") {
NodeId(id_str.parse().context("while parsing pageserver id")?)
} else {
DEFAULT_PAGESERVER_ID
};
let mode = match (lsn, hot_standby) {
(Some(lsn), false) => ComputeMode::Static(lsn),
(None, true) => ComputeMode::Replica,
@@ -649,6 +688,7 @@ fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<(
http_port,
pg_version,
mode,
pageserver_id,
)?;
}
"start" => {
@@ -658,6 +698,13 @@ fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<(
.get_one::<String>("endpoint_id")
.ok_or_else(|| anyhow!("No endpoint ID was provided to start"))?;
let pageserver_id =
if let Some(id_str) = sub_args.get_one::<String>("endpoint-pageserver-id") {
NodeId(id_str.parse().context("while parsing pageserver id")?)
} else {
DEFAULT_PAGESERVER_ID
};
let remote_ext_config = sub_args.get_one::<String>("remote-ext-config");
// If --safekeepers argument is given, use only the listed safekeeper nodes.
@@ -677,7 +724,8 @@ fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<(
let endpoint = cplane.endpoints.get(endpoint_id.as_str());
let auth_token = if matches!(env.pageserver.pg_auth_type, AuthType::NeonJWT) {
let ps_conf = env.get_pageserver_conf(pageserver_id)?;
let auth_token = if matches!(ps_conf.pg_auth_type, AuthType::NeonJWT) {
let claims = Claims::new(Some(tenant_id), Scope::Tenant);
Some(env.generate_auth_token(&claims)?)
@@ -744,6 +792,7 @@ fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<(
http_port,
pg_version,
mode,
pageserver_id,
)?;
ep.start(&auth_token, safekeepers, remote_ext_config)?;
}
@@ -768,51 +817,94 @@ fn handle_endpoint(ep_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<(
}
fn handle_pageserver(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
let pageserver = PageServerNode::from_env(env);
fn get_pageserver(env: &local_env::LocalEnv, args: &ArgMatches) -> Result<PageServerNode> {
let node_id = if let Some(id_str) = args.get_one::<String>("pageserver-id") {
NodeId(id_str.parse().context("while parsing pageserver id")?)
} else {
DEFAULT_PAGESERVER_ID
};
Ok(PageServerNode::from_env(
env,
env.get_pageserver_conf(node_id)?,
))
}
match sub_match.subcommand() {
Some(("start", start_match)) => {
if let Err(e) = pageserver.start(&pageserver_config_overrides(start_match)) {
Some(("start", subcommand_args)) => {
if let Err(e) = get_pageserver(env, subcommand_args)?
.start(&pageserver_config_overrides(subcommand_args))
{
eprintln!("pageserver start failed: {e}");
exit(1);
}
}
Some(("stop", subcommand_args)) => {
let immediate = subcommand_args
.get_one::<String>("stop-mode")
.map(|s| s.as_str())
== Some("immediate");
if let Err(e) = get_pageserver(env, subcommand_args)?.stop(immediate) {
eprintln!("pageserver stop failed: {}", e);
exit(1);
}
}
Some(("restart", subcommand_args)) => {
let pageserver = get_pageserver(env, subcommand_args)?;
//TODO what shutdown strategy should we use here?
if let Err(e) = pageserver.stop(false) {
eprintln!("pageserver stop failed: {}", e);
exit(1);
}
if let Err(e) = pageserver.start(&pageserver_config_overrides(subcommand_args)) {
eprintln!("pageserver start failed: {e}");
exit(1);
}
}
Some(("status", subcommand_args)) => {
match get_pageserver(env, subcommand_args)?.check_status() {
Ok(_) => println!("Page server is up and running"),
Err(err) => {
eprintln!("Page server is not available: {}", err);
exit(1);
}
}
}
Some((sub_name, _)) => bail!("Unexpected pageserver subcommand '{}'", sub_name),
None => bail!("no pageserver subcommand provided"),
}
Ok(())
}
fn handle_attachment_service(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
let svc = AttachmentService::from_env(env);
match sub_match.subcommand() {
Some(("start", _start_match)) => {
if let Err(e) = svc.start() {
eprintln!("start failed: {e}");
exit(1);
}
}
Some(("stop", stop_match)) => {
let immediate = stop_match
.get_one::<String>("stop-mode")
.map(|s| s.as_str())
== Some("immediate");
if let Err(e) = pageserver.stop(immediate) {
eprintln!("pageserver stop failed: {}", e);
if let Err(e) = svc.stop(immediate) {
eprintln!("stop failed: {}", e);
exit(1);
}
}
Some(("restart", restart_match)) => {
//TODO what shutdown strategy should we use here?
if let Err(e) = pageserver.stop(false) {
eprintln!("pageserver stop failed: {}", e);
exit(1);
}
if let Err(e) = pageserver.start(&pageserver_config_overrides(restart_match)) {
eprintln!("pageserver start failed: {e}");
exit(1);
}
}
Some(("status", _)) => match PageServerNode::from_env(env).check_status() {
Ok(_) => println!("Page server is up and running"),
Err(err) => {
eprintln!("Page server is not available: {}", err);
exit(1);
}
},
Some((sub_name, _)) => bail!("Unexpected pageserver subcommand '{}'", sub_name),
None => bail!("no pageserver subcommand provided"),
Some((sub_name, _)) => bail!("Unexpected attachment_service subcommand '{}'", sub_name),
None => bail!("no attachment_service subcommand provided"),
}
Ok(())
}
@@ -825,6 +917,16 @@ fn get_safekeeper(env: &local_env::LocalEnv, id: NodeId) -> Result<SafekeeperNod
}
}
// Get list of options to append to safekeeper command invocation.
fn safekeeper_extra_opts(init_match: &ArgMatches) -> Vec<String> {
init_match
.get_many::<String>("safekeeper-extra-opt")
.into_iter()
.flatten()
.map(|s| s.to_owned())
.collect()
}
fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
let (sub_name, sub_args) = match sub_match.subcommand() {
Some(safekeeper_command_data) => safekeeper_command_data,
@@ -841,7 +943,9 @@ fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
match sub_name {
"start" => {
if let Err(e) = safekeeper.start() {
let extra_opts = safekeeper_extra_opts(sub_args);
if let Err(e) = safekeeper.start(extra_opts) {
eprintln!("safekeeper start failed: {}", e);
exit(1);
}
@@ -866,7 +970,8 @@ fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
exit(1);
}
if let Err(e) = safekeeper.start() {
let extra_opts = safekeeper_extra_opts(sub_args);
if let Err(e) = safekeeper.start(extra_opts) {
eprintln!("safekeeper start failed: {}", e);
exit(1);
}
@@ -884,16 +989,28 @@ fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> anyhow
broker::start_broker_process(env)?;
let pageserver = PageServerNode::from_env(env);
if let Err(e) = pageserver.start(&pageserver_config_overrides(sub_match)) {
eprintln!("pageserver {} start failed: {:#}", env.pageserver.id, e);
try_stop_all(env, true);
exit(1);
// Only start the attachment service if the pageserver is configured to need it
if env.control_plane_api.is_some() {
let attachment_service = AttachmentService::from_env(env);
if let Err(e) = attachment_service.start() {
eprintln!("attachment_service start failed: {:#}", e);
try_stop_all(env, true);
exit(1);
}
}
for ps_conf in &env.pageservers {
let pageserver = PageServerNode::from_env(env, ps_conf);
if let Err(e) = pageserver.start(&pageserver_config_overrides(sub_match)) {
eprintln!("pageserver {} start failed: {:#}", ps_conf.id, e);
try_stop_all(env, true);
exit(1);
}
}
for node in env.safekeepers.iter() {
let safekeeper = SafekeeperNode::from_env(env, node);
if let Err(e) = safekeeper.start() {
if let Err(e) = safekeeper.start(vec![]) {
eprintln!("safekeeper {} start failed: {:#}", safekeeper.id, e);
try_stop_all(env, false);
exit(1);
@@ -912,8 +1029,6 @@ fn handle_stop_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<
}
fn try_stop_all(env: &local_env::LocalEnv, immediate: bool) {
let pageserver = PageServerNode::from_env(env);
// Stop all endpoints
match ComputeControlPlane::load(env.clone()) {
Ok(cplane) => {
@@ -928,8 +1043,11 @@ fn try_stop_all(env: &local_env::LocalEnv, immediate: bool) {
}
}
if let Err(e) = pageserver.stop(immediate) {
eprintln!("pageserver {} stop failed: {:#}", env.pageserver.id, e);
for ps_conf in &env.pageservers {
let pageserver = PageServerNode::from_env(env, ps_conf);
if let Err(e) = pageserver.stop(immediate) {
eprintln!("pageserver {} stop failed: {:#}", ps_conf.id, e);
}
}
for node in env.safekeepers.iter() {
@@ -942,6 +1060,13 @@ fn try_stop_all(env: &local_env::LocalEnv, immediate: bool) {
if let Err(e) = broker::stop_broker_process(env) {
eprintln!("neon broker stop failed: {e:#}");
}
if env.control_plane_api.is_some() {
let attachment_service = AttachmentService::from_env(env);
if let Err(e) = attachment_service.stop(immediate) {
eprintln!("attachment service stop failed: {e:#}");
}
}
}
fn cli() -> Command {
@@ -956,6 +1081,24 @@ fn cli() -> Command {
let safekeeper_id_arg = Arg::new("id").help("safekeeper id").required(false);
// --id, when using a pageserver command
let pageserver_id_arg = Arg::new("pageserver-id")
.long("id")
.help("pageserver id")
.required(false);
// --pageserver-id when using a non-pageserver command
let endpoint_pageserver_id_arg = Arg::new("endpoint-pageserver-id")
.long("pageserver-id")
.required(false);
let safekeeper_extra_opt_arg = Arg::new("safekeeper-extra-opt")
.short('e')
.long("safekeeper-extra-opt")
.num_args(1)
.action(ArgAction::Append)
.help("Additional safekeeper invocation options, e.g. -e=--http-auth-public-key-path=foo")
.required(false);
let tenant_id_arg = Arg::new("tenant-id")
.long("tenant-id")
.help("Tenant id. Represented as a hexadecimal string 32 symbols length")
@@ -1112,10 +1255,24 @@ fn cli() -> Command {
.arg_required_else_help(true)
.about("Manage pageserver")
.subcommand(Command::new("status"))
.arg(pageserver_id_arg.clone())
.subcommand(Command::new("start").about("Start local pageserver")
.arg(pageserver_id_arg.clone())
.arg(pageserver_config_args.clone()))
.subcommand(Command::new("stop").about("Stop local pageserver")
.arg(pageserver_id_arg.clone())
.arg(stop_mode_arg.clone()))
.subcommand(Command::new("restart").about("Restart local pageserver")
.arg(pageserver_id_arg.clone())
.arg(pageserver_config_args.clone()))
)
.subcommand(
Command::new("attachment_service")
.arg_required_else_help(true)
.about("Manage attachment_service")
.subcommand(Command::new("start").about("Start local pageserver").arg(pageserver_config_args.clone()))
.subcommand(Command::new("stop").about("Stop local pageserver")
.arg(stop_mode_arg.clone()))
.subcommand(Command::new("restart").about("Restart local pageserver").arg(pageserver_config_args.clone()))
)
.subcommand(
Command::new("safekeeper")
@@ -1124,6 +1281,7 @@ fn cli() -> Command {
.subcommand(Command::new("start")
.about("Start local safekeeper")
.arg(safekeeper_id_arg.clone())
.arg(safekeeper_extra_opt_arg.clone())
)
.subcommand(Command::new("stop")
.about("Stop local safekeeper")
@@ -1134,6 +1292,7 @@ fn cli() -> Command {
.about("Restart local safekeeper")
.arg(safekeeper_id_arg)
.arg(stop_mode_arg.clone())
.arg(safekeeper_extra_opt_arg)
)
)
.subcommand(
@@ -1149,6 +1308,7 @@ fn cli() -> Command {
.arg(lsn_arg.clone())
.arg(pg_port_arg.clone())
.arg(http_port_arg.clone())
.arg(endpoint_pageserver_id_arg.clone())
.arg(
Arg::new("config-only")
.help("Don't do basebackup, create endpoint directory with only config files")
@@ -1166,6 +1326,7 @@ fn cli() -> Command {
.arg(lsn_arg)
.arg(pg_port_arg)
.arg(http_port_arg)
.arg(endpoint_pageserver_id_arg.clone())
.arg(pg_version_arg)
.arg(hot_standby_arg)
.arg(safekeepers_arg)

View File

@@ -70,6 +70,7 @@ pub struct EndpointConf {
http_port: u16,
pg_version: u32,
skip_pg_catalog_updates: bool,
pageserver_id: NodeId,
}
//
@@ -82,19 +83,16 @@ pub struct ComputeControlPlane {
pub endpoints: BTreeMap<String, Arc<Endpoint>>,
env: LocalEnv,
pageserver: Arc<PageServerNode>,
}
impl ComputeControlPlane {
// Load current endpoints from the endpoints/ subdirectories
pub fn load(env: LocalEnv) -> Result<ComputeControlPlane> {
let pageserver = Arc::new(PageServerNode::from_env(&env));
let mut endpoints = BTreeMap::default();
for endpoint_dir in std::fs::read_dir(env.endpoints_path())
.with_context(|| format!("failed to list {}", env.endpoints_path().display()))?
{
let ep = Endpoint::from_dir_entry(endpoint_dir?, &env, &pageserver)?;
let ep = Endpoint::from_dir_entry(endpoint_dir?, &env)?;
endpoints.insert(ep.endpoint_id.clone(), Arc::new(ep));
}
@@ -102,7 +100,6 @@ impl ComputeControlPlane {
base_port: 55431,
endpoints,
env,
pageserver,
})
}
@@ -125,20 +122,29 @@ impl ComputeControlPlane {
http_port: Option<u16>,
pg_version: u32,
mode: ComputeMode,
pageserver_id: NodeId,
) -> Result<Arc<Endpoint>> {
let pg_port = pg_port.unwrap_or_else(|| self.get_port());
let http_port = http_port.unwrap_or_else(|| self.get_port() + 1);
let pageserver =
PageServerNode::from_env(&self.env, self.env.get_pageserver_conf(pageserver_id)?);
let ep = Arc::new(Endpoint {
endpoint_id: endpoint_id.to_owned(),
pg_address: SocketAddr::new("127.0.0.1".parse().unwrap(), pg_port),
http_address: SocketAddr::new("127.0.0.1".parse().unwrap(), http_port),
env: self.env.clone(),
pageserver: Arc::clone(&self.pageserver),
pageserver,
timeline_id,
mode,
tenant_id,
pg_version,
skip_pg_catalog_updates: false,
// We don't setup roles and databases in the spec locally, so we don't need to
// do catalog updates. Catalog updates also include check availability
// data creation. Yet, we have tests that check that size and db dump
// before and after start are the same. So, skip catalog updates,
// with this we basically test a case of waking up an idle compute, where
// we also skip catalog updates in the cloud.
skip_pg_catalog_updates: true,
});
ep.create_endpoint_dir()?;
@@ -152,7 +158,8 @@ impl ComputeControlPlane {
http_port,
pg_port,
pg_version,
skip_pg_catalog_updates: false,
skip_pg_catalog_updates: true,
pageserver_id,
})?,
)?;
std::fs::write(
@@ -187,18 +194,14 @@ pub struct Endpoint {
// These are not part of the endpoint as such, but the environment
// the endpoint runs in.
pub env: LocalEnv,
pageserver: Arc<PageServerNode>,
pageserver: PageServerNode,
// Optimizations
skip_pg_catalog_updates: bool,
}
impl Endpoint {
fn from_dir_entry(
entry: std::fs::DirEntry,
env: &LocalEnv,
pageserver: &Arc<PageServerNode>,
) -> Result<Endpoint> {
fn from_dir_entry(entry: std::fs::DirEntry, env: &LocalEnv) -> Result<Endpoint> {
if !entry.file_type()?.is_dir() {
anyhow::bail!(
"Endpoint::from_dir_entry failed: '{}' is not a directory",
@@ -214,12 +217,15 @@ impl Endpoint {
let conf: EndpointConf =
serde_json::from_slice(&std::fs::read(entry.path().join("endpoint.json"))?)?;
let pageserver =
PageServerNode::from_env(env, env.get_pageserver_conf(conf.pageserver_id)?);
Ok(Endpoint {
pg_address: SocketAddr::new("127.0.0.1".parse().unwrap(), conf.pg_port),
http_address: SocketAddr::new("127.0.0.1".parse().unwrap(), conf.http_port),
endpoint_id,
env: env.clone(),
pageserver: Arc::clone(pageserver),
pageserver,
timeline_id: conf.timeline_id,
mode: conf.mode,
tenant_id: conf.tenant_id,
@@ -493,7 +499,7 @@ impl Endpoint {
pageserver_connstring: Some(pageserver_connstring),
safekeeper_connstrings,
storage_auth_token: auth_token.clone(),
custom_extensions: Some(vec![]),
remote_extensions: None,
};
let spec_path = self.endpoint_path().join("spec.json");
std::fs::write(spec_path, serde_json::to_string_pretty(&spec)?)?;

View File

@@ -7,6 +7,7 @@
// local installations.
//
pub mod attachment_service;
mod background_process;
pub mod broker;
pub mod endpoint;

View File

@@ -68,11 +68,17 @@ pub struct LocalEnv {
pub broker: NeonBroker,
pub pageserver: PageServerConf,
/// This Vec must always contain at least one pageserver
pub pageservers: Vec<PageServerConf>,
#[serde(default)]
pub safekeepers: Vec<SafekeeperConf>,
// Control plane location: if None, we will not run attachment_service. If set, this will
// be propagated into each pageserver's configuration.
#[serde(default)]
pub control_plane_api: Option<Url>,
/// Keep human-readable aliases in memory (and persist them to config), to hide ZId hex strings from the user.
#[serde(default)]
// A `HashMap<String, HashMap<TenantId, TimelineId>>` would be more appropriate here,
@@ -176,32 +182,28 @@ impl LocalEnv {
pub fn pg_distrib_dir(&self, pg_version: u32) -> anyhow::Result<PathBuf> {
let path = self.pg_distrib_dir.clone();
#[allow(clippy::manual_range_patterns)]
match pg_version {
14 => Ok(path.join(format!("v{pg_version}"))),
15 => Ok(path.join(format!("v{pg_version}"))),
14 | 15 | 16 => Ok(path.join(format!("v{pg_version}"))),
_ => bail!("Unsupported postgres version: {}", pg_version),
}
}
pub fn pg_bin_dir(&self, pg_version: u32) -> anyhow::Result<PathBuf> {
match pg_version {
14 => Ok(self.pg_distrib_dir(pg_version)?.join("bin")),
15 => Ok(self.pg_distrib_dir(pg_version)?.join("bin")),
_ => bail!("Unsupported postgres version: {}", pg_version),
}
Ok(self.pg_distrib_dir(pg_version)?.join("bin"))
}
pub fn pg_lib_dir(&self, pg_version: u32) -> anyhow::Result<PathBuf> {
match pg_version {
14 => Ok(self.pg_distrib_dir(pg_version)?.join("lib")),
15 => Ok(self.pg_distrib_dir(pg_version)?.join("lib")),
_ => bail!("Unsupported postgres version: {}", pg_version),
}
Ok(self.pg_distrib_dir(pg_version)?.join("lib"))
}
pub fn pageserver_bin(&self) -> PathBuf {
self.neon_distrib_dir.join("pageserver")
}
pub fn attachment_service_bin(&self) -> PathBuf {
self.neon_distrib_dir.join("attachment_service")
}
pub fn safekeeper_bin(&self) -> PathBuf {
self.neon_distrib_dir.join("safekeeper")
}
@@ -214,15 +216,23 @@ impl LocalEnv {
self.base_data_dir.join("endpoints")
}
// TODO: move pageserver files into ./pageserver
pub fn pageserver_data_dir(&self) -> PathBuf {
self.base_data_dir.clone()
pub fn pageserver_data_dir(&self, pageserver_id: NodeId) -> PathBuf {
self.base_data_dir
.join(format!("pageserver_{pageserver_id}"))
}
pub fn safekeeper_data_dir(&self, data_dir_name: &str) -> PathBuf {
self.base_data_dir.join("safekeepers").join(data_dir_name)
}
pub fn get_pageserver_conf(&self, id: NodeId) -> anyhow::Result<&PageServerConf> {
if let Some(conf) = self.pageservers.iter().find(|node| node.id == id) {
Ok(conf)
} else {
bail!("could not find pageserver {id}")
}
}
pub fn register_branch_mapping(
&mut self,
branch_name: String,
@@ -299,6 +309,10 @@ impl LocalEnv {
env.neon_distrib_dir = env::current_exe()?.parent().unwrap().to_owned();
}
if env.pageservers.is_empty() {
anyhow::bail!("Configuration must contain at least one pageserver");
}
env.base_data_dir = base_path();
Ok(env)
@@ -331,7 +345,7 @@ impl LocalEnv {
// We read that in, in `create_config`, and fill any missing defaults. Then it's saved
// to .neon/config. TODO: We lose any formatting and comments along the way, which is
// a bit sad.
let mut conf_content = r#"# This file describes a locale deployment of the page server
let mut conf_content = r#"# This file describes a local deployment of the page server
# and safekeeeper node. It is read by the 'neon_local' command-line
# utility.
"#
@@ -461,9 +475,9 @@ impl LocalEnv {
}
fn auth_keys_needed(&self) -> bool {
self.pageserver.pg_auth_type == AuthType::NeonJWT
|| self.pageserver.http_auth_type == AuthType::NeonJWT
|| self.safekeepers.iter().any(|sk| sk.auth_enabled)
self.pageservers.iter().any(|ps| {
ps.pg_auth_type == AuthType::NeonJWT || ps.http_auth_type == AuthType::NeonJWT
}) || self.safekeepers.iter().any(|sk| sk.auth_enabled)
}
}

View File

@@ -27,6 +27,7 @@ use utils::{
lsn::Lsn,
};
use crate::local_env::PageServerConf;
use crate::{background_process, local_env::LocalEnv};
#[derive(Error, Debug)]
@@ -76,43 +77,40 @@ impl ResponseErrorMessageExt for Response {
#[derive(Debug)]
pub struct PageServerNode {
pub pg_connection_config: PgConnectionConfig,
pub conf: PageServerConf,
pub env: LocalEnv,
pub http_client: Client,
pub http_base_url: String,
}
impl PageServerNode {
pub fn from_env(env: &LocalEnv) -> PageServerNode {
let (host, port) = parse_host_port(&env.pageserver.listen_pg_addr)
.expect("Unable to parse listen_pg_addr");
pub fn from_env(env: &LocalEnv, conf: &PageServerConf) -> PageServerNode {
let (host, port) =
parse_host_port(&conf.listen_pg_addr).expect("Unable to parse listen_pg_addr");
let port = port.unwrap_or(5432);
Self {
pg_connection_config: PgConnectionConfig::new_host_port(host, port),
conf: conf.clone(),
env: env.clone(),
http_client: Client::new(),
http_base_url: format!("http://{}/v1", env.pageserver.listen_http_addr),
http_base_url: format!("http://{}/v1", conf.listen_http_addr),
}
}
// pageserver conf overrides defined by neon_local configuration.
fn neon_local_overrides(&self) -> Vec<String> {
let id = format!("id={}", self.env.pageserver.id);
let id = format!("id={}", self.conf.id);
// FIXME: the paths should be shell-escaped to handle paths with spaces, quotas etc.
let pg_distrib_dir_param = format!(
"pg_distrib_dir='{}'",
self.env.pg_distrib_dir_raw().display()
);
let http_auth_type_param =
format!("http_auth_type='{}'", self.env.pageserver.http_auth_type);
let listen_http_addr_param = format!(
"listen_http_addr='{}'",
self.env.pageserver.listen_http_addr
);
let http_auth_type_param = format!("http_auth_type='{}'", self.conf.http_auth_type);
let listen_http_addr_param = format!("listen_http_addr='{}'", self.conf.listen_http_addr);
let pg_auth_type_param = format!("pg_auth_type='{}'", self.env.pageserver.pg_auth_type);
let listen_pg_addr_param =
format!("listen_pg_addr='{}'", self.env.pageserver.listen_pg_addr);
let pg_auth_type_param = format!("pg_auth_type='{}'", self.conf.pg_auth_type);
let listen_pg_addr_param = format!("listen_pg_addr='{}'", self.conf.listen_pg_addr);
let broker_endpoint_param = format!("broker_endpoint='{}'", self.env.broker.client_url());
@@ -126,10 +124,18 @@ impl PageServerNode {
broker_endpoint_param,
];
if self.env.pageserver.http_auth_type != AuthType::Trust
|| self.env.pageserver.pg_auth_type != AuthType::Trust
if let Some(control_plane_api) = &self.env.control_plane_api {
overrides.push(format!(
"control_plane_api='{}'",
control_plane_api.as_str()
));
}
if self.conf.http_auth_type != AuthType::Trust || self.conf.pg_auth_type != AuthType::Trust
{
overrides.push("auth_validation_public_key_path='auth_public_key.pem'".to_owned());
// Keys are generated in the toplevel repo dir, pageservers' workdirs
// are one level below that, so refer to keys with ../
overrides.push("auth_validation_public_key_path='../auth_public_key.pem'".to_owned());
}
overrides
}
@@ -137,16 +143,12 @@ impl PageServerNode {
/// Initializes a pageserver node by creating its config with the overrides provided.
pub fn initialize(&self, config_overrides: &[&str]) -> anyhow::Result<()> {
// First, run `pageserver --init` and wait for it to write a config into FS and exit.
self.pageserver_init(config_overrides).with_context(|| {
format!(
"Failed to run init for pageserver node {}",
self.env.pageserver.id,
)
})
self.pageserver_init(config_overrides)
.with_context(|| format!("Failed to run init for pageserver node {}", self.conf.id,))
}
pub fn repo_path(&self) -> PathBuf {
self.env.pageserver_data_dir()
self.env.pageserver_data_dir(self.conf.id)
}
/// The pid file is created by the pageserver process, with its pid stored inside.
@@ -162,7 +164,7 @@ impl PageServerNode {
fn pageserver_init(&self, config_overrides: &[&str]) -> anyhow::Result<()> {
let datadir = self.repo_path();
let node_id = self.env.pageserver.id;
let node_id = self.conf.id;
println!(
"Initializing pageserver node {} at '{}' in {:?}",
node_id,
@@ -171,6 +173,10 @@ impl PageServerNode {
);
io::stdout().flush()?;
if !datadir.exists() {
std::fs::create_dir(&datadir)?;
}
let datadir_path_str = datadir.to_str().with_context(|| {
format!("Cannot start pageserver node {node_id} in path that has no string representation: {datadir:?}")
})?;
@@ -201,7 +207,7 @@ impl PageServerNode {
let datadir = self.repo_path();
print!(
"Starting pageserver node {} at '{}' in {:?}",
self.env.pageserver.id,
self.conf.id,
self.pg_connection_config.raw_address(),
datadir
);
@@ -210,7 +216,7 @@ impl PageServerNode {
let datadir_path_str = datadir.to_str().with_context(|| {
format!(
"Cannot start pageserver node {} in path that has no string representation: {:?}",
self.env.pageserver.id, datadir,
self.conf.id, datadir,
)
})?;
let mut args = self.pageserver_basic_args(config_overrides, datadir_path_str);
@@ -254,7 +260,7 @@ impl PageServerNode {
// FIXME: why is this tied to pageserver's auth type? Whether or not the safekeeper
// needs a token, and how to generate that token, seems independent to whether
// the pageserver requires a token in incoming requests.
Ok(if self.env.pageserver.http_auth_type != AuthType::Trust {
Ok(if self.conf.http_auth_type != AuthType::Trust {
// Generate a token to connect from the pageserver to a safekeeper
let token = self
.env
@@ -279,7 +285,7 @@ impl PageServerNode {
pub fn page_server_psql_client(&self) -> anyhow::Result<postgres::Client> {
let mut config = self.pg_connection_config.clone();
if self.env.pageserver.pg_auth_type == AuthType::NeonJWT {
if self.conf.pg_auth_type == AuthType::NeonJWT {
let token = self
.env
.generate_auth_token(&Claims::new(None, Scope::PageServerApi))?;
@@ -290,7 +296,7 @@ impl PageServerNode {
fn http_request<U: IntoUrl>(&self, method: Method, url: U) -> anyhow::Result<RequestBuilder> {
let mut builder = self.http_client.request(method, url);
if self.env.pageserver.http_auth_type == AuthType::NeonJWT {
if self.conf.http_auth_type == AuthType::NeonJWT {
let token = self
.env
.generate_auth_token(&Claims::new(None, Scope::PageServerApi))?;
@@ -316,7 +322,8 @@ impl PageServerNode {
pub fn tenant_create(
&self,
new_tenant_id: Option<TenantId>,
new_tenant_id: TenantId,
generation: Option<u32>,
settings: HashMap<&str, &str>,
) -> anyhow::Result<TenantId> {
let mut settings = settings.clone();
@@ -382,11 +389,9 @@ impl PageServerNode {
.context("Failed to parse 'gc_feedback' as bool")?,
};
// If tenant ID was not specified, generate one
let new_tenant_id = new_tenant_id.unwrap_or(TenantId::generate());
let request = models::TenantCreateRequest {
new_tenant_id,
generation,
config,
};
if !settings.is_empty() {

View File

@@ -101,7 +101,7 @@ impl SafekeeperNode {
self.datadir_path().join("safekeeper.pid")
}
pub fn start(&self) -> anyhow::Result<Child> {
pub fn start(&self, extra_opts: Vec<String>) -> anyhow::Result<Child> {
print!(
"Starting safekeeper at '{}' in '{}'",
self.pg_connection_config.raw_address(),
@@ -161,17 +161,28 @@ impl SafekeeperNode {
let key_path = self.env.base_data_dir.join("auth_public_key.pem");
if self.conf.auth_enabled {
let key_path_string = key_path
.to_str()
.with_context(|| {
format!("Key path {key_path:?} cannot be represented as a unicode string")
})?
.to_owned();
args.extend([
"--auth-validation-public-key-path".to_owned(),
key_path
.to_str()
.with_context(|| {
format!("Key path {key_path:?} cannot be represented as a unicode string")
})?
.to_owned(),
"--pg-auth-public-key-path".to_owned(),
key_path_string.clone(),
]);
args.extend([
"--pg-tenant-only-auth-public-key-path".to_owned(),
key_path_string.clone(),
]);
args.extend([
"--http-auth-public-key-path".to_owned(),
key_path_string.clone(),
]);
}
args.extend(extra_opts);
background_process::start_process(
&format!("safekeeper-{id}"),
&datadir,

View File

@@ -4,7 +4,12 @@
# to your expectations and requirements.
# Root options
targets = []
targets = [
{ triple = "x86_64-unknown-linux-gnu" },
{ triple = "aarch64-unknown-linux-gnu" },
{ triple = "aarch64-apple-darwin" },
{ triple = "x86_64-apple-darwin" },
]
all-features = false
no-default-features = false
feature-depth = 1
@@ -18,7 +23,7 @@ vulnerability = "deny"
unmaintained = "warn"
yanked = "warn"
notice = "warn"
ignore = []
ignore = ["RUSTSEC-2023-0052"]
# This section is considered when running `cargo deny check licenses`
# More documentation for the licenses section can be found here:

View File

@@ -30,7 +30,7 @@ cleanup() {
echo "clean up containers if exists"
cleanup
for pg_version in 14 15; do
for pg_version in 14 15 16; do
echo "start containers (pg_version=$pg_version)."
PG_VERSION=$pg_version docker compose -f $COMPOSE_FILE up --build -d

View File

@@ -0,0 +1,957 @@
# Pageserver: split-brain safety for remote storage through generation numbers
## Summary
A scheme of logical "generation numbers" for tenant attachment to pageservers is proposed, along with
changes to the remote storage format to include these generation numbers in S3 keys.
Using the control plane as the issuer of these generation numbers enables strong anti-split-brain
properties in the pageserver cluster without implementing a consensus mechanism directly
in the pageservers.
## Motivation
Currently, the pageserver's remote storage format does not provide a mechanism for addressing
split brain conditions that may happen when replacing a node or when migrating
a tenant from one pageserver to another.
From a remote storage perspective, a split brain condition occurs whenever two nodes both think
they have the same tenant attached, and both can write to S3. This can happen in the case of a
network partition, pathologically long delays (e.g. suspended VM), or software bugs.
In the current deployment model, control plane guarantees that a tenant is attached to one
pageserver at a time, thereby ruling out split-brain conditions resulting from dual
attachment (however, there is always the risk of a control plane bug). This control
plane guarantee prevents robust response to failures, as if a pageserver is unresponsive
we may not detach from it. The mechanism in this RFC fixes this, by making it safe to
attach to a new, different pageserver even if an unresponsive pageserver may be running.
Futher, lack of safety during split-brain conditions blocks two important features where occasional
split-brain conditions are part of the design assumptions:
- seamless tenant migration ([RFC PR](https://github.com/neondatabase/neon/pull/5029))
- automatic pageserver instance failure handling (aka "failover") (RFC TBD)
### Prior art
- 020-pageserver-s3-coordination.md
- 023-the-state-of-pageserver-tenant-relocation.md
- 026-pageserver-s3-mvcc.md
This RFC has broad similarities to the proposal to implement a MVCC scheme in
S3 object names, but this RFC avoids a general purpose transaction scheme in
favour of more specialized "generations" that work like a transaction ID that
always has the same lifetime as a pageserver process or tenant attachment, whichever
is shorter.
## Requirements
- Accommodate storage backends with no atomic or fencing capability (i.e. work within
S3's limitation that there are no atomics and clients can't be fenced)
- Don't depend on any STONITH or node fencing in the compute layer (i.e. we will not
assume that we can reliably kill and EC2 instance and have it die)
- Scoped per-tenant, not per-pageserver; for _seamless tenant migration_, we need
per-tenant granularity, and for _failover_, we likely want to spread the workload
of the failed pageserver instance to a number of peers, rather than monolithically
moving the entire workload to another machine.
We do not rule out the latter case, but should not constrain ourselves to it.
## Design Tenets
These are not requirements, but are ideas that guide the following design:
- Avoid implementing another consensus system: we already have a strongly consistent
database in the control plane that can do atomic operations where needed, and we also
have a Paxos implementation in the safekeeper.
- Avoiding locking in to specific models of how failover will work (e.g. do not assume that
all the tenants on a pageserver will fail over as a unit).
- Be strictly correct when it comes to data integrity. Occasional failures of availability
are tolerable, occasional data loss is not.
## Non Goals
The changes in this RFC intentionally isolate the design decision of how to define
logical generations numbers and object storage format in a way that is somewhat flexible with
respect to how actual orchestration of failover works.
This RFC intentionally does not cover:
- Failure detection
- Orchestration of failover
- Standby modes to keep data ready for fast migration
- Intentional multi-writer operation on tenants (multi-writer scenarios are assumed to be transient split-brain situations).
- Sharding.
The interaction between this RFC and those features is discussed in [Appendix B](#appendix-b-interoperability-with-other-features)
## Impacted Components
pageserver, control plane, safekeeper (optional)
## Implementation Part 1: Correctness
### Summary
- A per-tenant **generation number** is introduced to uniquely identifying tenant attachments to pageserver processes.
- This generation number increments each time the control plane modifies a tenant (`Project`)'s assigned pageserver, or when the assigned pageserver restarts.
- the control plane is the authority for generation numbers: only it may
increment a generation number.
- **Object keys are suffixed** with the generation number
- **Safety for multiply-attached tenants** is provided by the
generation number in the object key: the competing pageservers will not
try to write to the same keys.
- **Safety in split brain for multiple nodes running with
the same node ID** is provided by the pageserver calling out to the control plane
on startup, to re-attach and thereby increment the generations of any attached tenants
- **Safety for deletions** is achieved by deferring the DELETE from S3 to a point in time where the deleting node has validated with control plane that no attachment with a higher generation has a reference to the to-be-DELETEd key.
- **The control plane is used to issue generation numbers** to avoid the need for
a built-in consensus system in the pageserver, although this could in principle
be changed without changing the storage format.
### Generation numbers
A generation number is associated with each tenant in the control plane,
and each time the attachment status of the tenant changes, this is incremented.
Changes in attachment status include:
- Attaching the tenant to a different pageserver
- A pageserver restarting, and "re-attaching" its tenants on startup
These increments of attachment generation provide invariants we need to avoid
split-brain issues in storage:
- If two pageservers have the same tenant attached, the attachments are guaranteed to have different generation numbers, because the generation would increment
while attaching the second one.
- If there are multiple pageservers running with the same node ID, all the attachments on all pageservers are guaranteed to have different generation numbers, because the generation would increment
when the second node started and re-attached its tenants.
As long as the infrastructure does not transparently replace an underlying
physical machine, we are totally safe. See the later [unsafe case](#unsafe-case-on-badly-behaved-infrastructure) section for details.
### Object Key Changes
#### Generation suffix
All object keys (layer objects and index objects) will contain the attachment
generation as a [suffix](#why-a-generation-suffix-rather-than-prefix).
This suffix is the primary mechanism for protecting against split-brain situations, and
enabling safe multi-attachment of tenants:
- Two pageservers running with the same node ID (e.g. after a failure, where there is
some rogue pageserver still running) will not try to write to the same objects, because at startup they will have re-attached tenants and thereby incremented
generation numbers.
- Multiple attachments (to different pageservers) of the same tenant will not try to write to the same objects, as each attachment would have a distinct generation.
The generation is appended in hex format (8 byte string representing
u32), to all our existing key names. A u32's range limit would permit
27 restarts _per second_ over a 5 year system lifetime: orders of magnitude more than
is realistic.
The exact meaning of the generation suffix can evolve over time if necessary, for
example if we chose to implement a failover mechanism internally to the pageservers
rather than going via the control plane. The storage format just sees it as a number,
with the only semantic property being that the highest numbered index is the latest.
#### Index changes
Since object keys now include a generation suffix, the index of these keys must also be updated. IndexPart currently stores keys and LSNs sufficient to reconstruct key names: this would be extended to store the generation as well.
This will increase the size of the file, but only modestly: layers are already encoded as
their string-ized form, so the overhead is about 10 bytes per layer. This will be less if/when
the index storage format is migrated to a binary format from JSON.
#### Visibility
_This section doesn't describe code changes, but extends on the consequences of the
object key changes given above_
##### Visibility of objects to pageservers
Pageservers can of course list objects in S3 at any time, but in practice their
visible set is based on the contents of their LayerMap, which is initialized
from the `index_part.json.???` that they load.
Starting with the `index_part` from the most recent previous generation
(see [loading index_part](#finding-the-remote-indices-for-timelines)), a pageserver
initially has visibility of all the objects that were referenced in the loaded index.
These objects are guaranteed to remain visible until the current generation is
superseded, via pageservers in older generations avoiding deletions (see [deletion](#deletion)).
The "most recent previous generation" is _not_ necessarily the most recent
in terms of walltime, it is the one that is readable at the time a new generation
starts. Consider the following sequence of a tenant being re-attached to different
pageserver nodes:
- Create + attach on PS1 in generation 1
- PS1 Do some work, write out index_part.json-0001
- Attach to PS2 in generation 2
- Read index_part.json-0001
- PS2 starts doing some work...
- Attach to PS3 in generation 3
- Read index_part.json-0001
- **...PS2 finishes its work: now it writes index_part.json-0002**
- PS3 writes out index_part.json-0003
In the above sequence, the ancestry of indices is:
```
0001 -> 0002
|
-> 0003
```
This is not an issue for safety: if the 0002 references some object that is
not in 0001, then 0003 simply does not see it, and will re-do whatever
work was required (e.g. ingesting WAL or doing compaction). Objects referenced
by only the 0002 index will never be read by future attachment generations, and
will eventually be cleaned up by a scrub (see [scrubbing](#cleaning-up-orphan-objects-scrubbing)).
##### Visibility of LSNs to clients
Because index_part.json is now written with a generation suffix, which data
is visible depends on which generation the reader is operating in:
- If one was passively reading from S3 from outside of a pageserver, the
visibility of data would depend on which index_part.json-<generation> file
one had chosen to read from.
- If two pageservers have the same tenant attached, they may have different
data visible as they're independently replaying the WAL, and maintaining
independent LayerMaps that are written to independent index_part.json files.
Data does not have to be remotely committed to be visible.
- For a pageserver writing with a stale generation, historic LSNs
remain readable until another pageserver (with a higher generation suffix)
decides to execute GC deletions. At this point, we may think of the stale
attachment's generation as having logically ended: during its existence
the generation had a consistent view of the world.
- For a newly attached pageserver, its highest visible LSN may appears to
go backwards with respect to an earlier attachment, if that earlier
attachment had not uploaded all data to S3 before the new attachment.
### Deletion
#### Generation number validation
While writes are de-conflicted by writers always using their own generation number in the key,
deletions are slightly more challenging: if a pageserver A is isolated, and the true active node is
pageserver B, then it is dangerous for A to do any object deletions, even of objects that it wrote
itself, because pageserver's B metadata might reference those objects.
We solve this by inserting a "generation validation" step between the write of a remote index
that un-links a particular object from the index, and the actual deletion of the object, such
that deletions strictly obey the following ordering:
1. Write out index_part.json: this guarantees that any subsequent reader of the metadata will
not try and read the object we unlinked.
2. Call out to control plane to validate that the generation which we use for our attachment is still the latest.
3. If step 2 passes, it is safe to delete the object. Why? The check-in with control plane
together with our visibility rules guarantees that any later generation
will use either the exact `index_part.json` that we uploaded in step 1, or a successor
of it; not an earlier one. In both cases, the `index_part.json` doesn't reference the
key we are deleting anymore, so, the key is invisible to any later attachment generation.
Hence it's safe to delete it.
Note that at step 2 we are only confirming that deletions of objects _no longer referenced
by the specific `index_part.json` written in step 1_ are safe. If we were attempting other deletions concurrently,
these would need their own generation validation step.
If step 2 fails, we may leak the object. This is safe, but has a cost: see [scrubbing](#cleaning-up-orphan-objects-scrubbing). We may avoid this entirely outside of node
failures, if we do proper flushing of deletions on clean shutdown and clean migration.
To avoid doing a huge number of control plane requests to perform generation validation,
validation of many tenants will be done in a single request, and deletions will be queued up
prior to validation: see [Persistent deletion queue](#persistent-deletion-queue) for more.
#### `remote_consistent_lsn` updates
Remote objects are not the only kind of deletion the pageserver does: it also indirectly deletes
WAL data, by feeding back remote_consistent_lsn to safekeepers, as a signal to the safekeepers that
they may drop data below this LSN.
For the same reasons that deletion of objects must be guarded by an attachment generation number
validation step, updates to `remote_consistent_lsn` are subject to the same rules, using
an ordering as follows:
1. upload the index_part that covers data up to LSN `L0` to S3
2. Call out to control plane to validate that the generation which we use for our attachment is still the latest.
3. advance the `remote_consistent_lsn` that we advertise to the safekeepers to `L0`
If step 2 fails, then the `remote_consistent_lsn` advertised
to safekeepers will not advance again until a pageserver
with the latest generation is ready to do so.
**Note:** at step 3 we are not advertising the _latest_ remote_consistent_lsn, we are
advertising the value in the index_part that we uploaded in step 1. This provides
a strong ordering guarantee.
Internally to the pageserver, each timeline will have two remote_consistent_lsn values: the one that
reflects its latest write to remote storage, and the one that reflects the most
recent validation of generation number. It is only the latter value that may
be advertised to the outside world (i.e. to the safekeeper).
The control plane remains unaware of `remote_consistent_lsn`: it only has to validate
the freshness of generation numbers, thereby granting the pageserver permission to
share the information with the safekeeper.
For convenience, in subsequent sections and RFCs we will use "deletion" to mean both deletion
of objects in S3, and updates to the `remote_consistent_lsn`, as updates to the remote consistent
LSN are de-facto deletions done via the safekeeper, and both kinds of deletion are subject to
the same generation validation requirement.
### Pageserver attach/startup changes
#### Attachment
Calls to `/v1/tenant/{tenant_id}/attach` are augmented with an additional
`generation` field in the body.
The pageserver does not persist this: a generation is only good for the lifetime
of a process.
#### Finding the remote indices for timelines
Because index files are now suffixed with generation numbers, the pageserver
cannot always GET the remote index in one request, because it can't always
know a-priori what the latest remote index is.
Typically, the most recent generation to write an index would be our own
generation minus 1. However, this might not be the case: the previous
node might have started and acquired a generation number, and then crashed
before writing out a remote index.
In the general case and as a fallback, the pageserver may list all the `index_part.json`
files for a timeline, sort them by generation, and pick the highest that is `<=`
its current generation for this attachment. The tenant should never load an index
with an attachment generation _newer_ than its own.
These two rules combined ensure that objects written by later generations are never visible to earlier generations.
Note that if a given attachment picks an index part from an earlier generation (say n-2), but crashes & restarts before it writes its own generation's index part, next time it tries to pick an index part there may be an index part from generation n-1.
It would pick the n-1 index part in that case, because it's sorted higher than the previous one from generation n-2.
So, above rules guarantee no determinism in selecting the index part.
are allowed to be attached with stale attachment generations during a multiply-attached
phase in a migration, and in this instance if the old location's pageserver restarts,
it should not try and load the newer generation's index.
To summarize, on starting a timeline, the pageserver will:
1. Issue a GET for index_part.json-<my generation - 1>
2. If 1 failed, issue a ListObjectsv2 request for index_part.json\* and
pick the newest.
One could optimize this further by using the control plane to record specifically
which generation most recently wrote an index_part.json, if necessary, to increase
the probability of finding the index_part.json in one GET. One could also improve
the chances by having pageservers proactively write out index_part.json after they
get a new generation ID.
#### Re-attachment on startup
On startup, the pageserver will call out to an new control plane `/re-attach`
API (see [Generation API](#generation-api)). This returns a list of
tenants that should be attached to the pageserver, and their generation numbers, which
the control plane will increment before returning.
The pageserver should still scan its local disk on startup, but should _delete_
any local content for tenants not indicated in the `/re-attach` response: their
absence is an implicit detach operation.
**Note** if a tenant is omitted from the re-attach response, its local disk content
will be deleted. This will change in subsequent work, when the control plane gains
the concept of a secondary/standby location: a node with local content may revert
to this status and retain some local content.
#### Cleaning up previous generations' remote indices
Deletion of old indices is not necessary for correctness, although it is necessary
to avoid the ListObjects fallback in the previous section becoming ever more expensive.
Once the new attachment has written out its index_part.json, it may asynchronously clean up historic index_part.json
objects that were found.
We may choose to implement this deletion either as an explicit step after we
write out index_part for the first time in a pageserver's lifetime, or for
simplicity just do it periodically as part of the background scrub (see [scrubbing](#cleaning-up-orphan-objects-scrubbing));
### Control Plane Changes
#### Store generations for attaching tenants
- The `Project` table must store the generation number for use when
attaching the tenant to a new pageserver.
- The `/v1/tenant/:tenant_id/attach` pageserver API will require the generation number,
which the control plane can supply by simply incrementing the `Project`'s
generation number each time the tenant is attached to a different server: the same database
transaction that changes the assigned pageserver should also change the generation number.
#### Generation API
This section describes an API that could be provided directly by the control plane,
or built as a separate microservice. In earlier parts of the RFC, when we
discuss the control plane providing generation numbers, we are referring to this API.
The API endpoints used by the pageserver to acquire and validate generation
numbers are quite simple, and only require access to some persistent and
linerizable storage (such as a database).
Building this into the control plane is proposed as a least-effort option to exploit existing infrastructure and implement generation number issuance in the same transaction that mandates it (i.e., the transaction that updates the `Project` assignment to another pageserver).
However, this is not mandatory: this "Generation Number Issuer" could
be built as a microservice. In practice, we will write such a miniature service
anyway, to enable E2E pageserver/compute testing without control plane.
The endpoints required by pageservers are:
##### `/re-attach`
- Request: `{node_id: <u32>}`
- Response:
- 200 `{tenants: [{id: <TenantId>, gen: <u32>}]}`
- 404: unknown node_id
- (Future: 429: flapping detected, perhaps nodes are fighting for the same node ID,
or perhaps this node was in a retry loop)
- (On unknown tenants, omit tenant from `tenants` array)
- Server behavior: query database for which tenants should be attached to this pageserver.
- for each tenant that should be attached, increment the attachment generation and
include the new generation in the response
- Client behavior:
- for all tenants in the response, activate with the new generation number
- for any local disk content _not_ referenced in the response, act as if we
had been asked to detach it (i.e. delete local files)
**Note** the `node_id` in this request will change in future if we move to ephemeral
node IDs, to be replaced with some correlation ID that helps the control plane realize
if a process is running with the same storage as a previous pageserver process (e.g.
we might use EC instance ID, or we might just write some UUID to the disk the first
time we use it)
##### `/validate`
- Request: `{'tenants': [{tenant: <tenant id>, attach_gen: <gen>}, ...]}'`
- Response:
- 200 `{'tenants': [{tenant: <tenant id>, status: <bool>}...]}`
- (On unknown tenants, omit tenant from `tenants` array)
- Purpose: enable the pageserver to discover for the given attachments whether they are still the latest.
- Server behavior: this is a read-only operation: simply compare the generations in the request with
the generations known to the server, and set status to `true` if they match.
- Client behavior: clients must not do deletions within a tenant's remote data until they have
received a response indicating the generation they hold for the attachment is current.
#### Use of `/load` and `/ignore` APIs
Because the pageserver will be changed to only attach tenants on startup
based on the control plane's response to a `/re-attach` request, the load/ignore
APIs no longer make sense in their current form.
The `/load` API becomes functionally equivalent to attach, and will be removed:
any location that used `/load` before should just attach instead.
The `/ignore` API is equivalent to detaching, but without deleting local files.
### Timeline/Branch creation & deletion
All of the previous arguments for safety have described operations within
a timeline, where we may describe a sequence that includes updates to
index_part.json, and where reads and writes are coming from a postgres
endpoint (writes via the safekeeper).
Creating or destroying timeline is a bit different, because writes
are coming from the control plane.
We must be safe against scenarios such as:
- A tenant is attached to pageserver B while pageserver A is
in the middle of servicing an RPC from the control plane to
create or delete a tenant.
- A pageserver A has been sent a timeline creation request
but becomes unresponsive. The tenant is attached to a
different pageserver B, and the timeline creation request
is sent there too.
#### Timeline Creation
If some very slow node tries to do a timeline creation _after_
a more recent generation node has already created the timeline
and written some data into it, that must not cause harm. This
is provided in timeline creations by the way all the objects
within the timeline's remote path include a generation suffix:
a slow node in an old generation that attempts to "create" a timeline
that already exists will just emit an index_part.json with
an old generation suffix.
Timeline IDs are never reused, so we don't have
to worry about the case of create/delete/create cycles. If they
were re-used during a disaster recovery "un-delete" of a timeline,
that special case can be handled by calling out to all available pageservers
to check that they return 404 for the timeline, and to flush their
deletion queues in case they had any deletions pending from the
timeline.
The above makes it safe for control plane to change the assignment of
tenant to pageserver in control plane while a timeline creation is ongoing.
The reason is that the creation request against the new assigned pageserver
uses a new generation number. However, care must be taken by control plane
to ensure that a "timeline creation successul" response from some pageserver
is checked for the pageserver's generation for that timeline's tenant still being the latest.
If it is not the latest, the response does not constitute a successful timeline creation.
It is acceptable to discard such responses, the scrubber will clean up the S3 state.
It is better to issue a timelien deletion request to the stale attachment.
#### Timeline Deletion
Tenant/timeline deletion operations are exempt from generation validation
on deletes, and therefore don't have to go through the same deletion
queue as GC/compaction layer deletions. This is because once a
delete is issued by the control plane, it is a promise that the
control plane will keep trying until the deletion is done, so even stale
pageservers are permitted to go ahead and delete the objects.
The implications of this for control plane are:
- During timeline/tenant deletion, the control plane must wait for the deletion to
be truly complete (status 404) and also handle the case where the pageserver
becomes unavailable, either by waiting for a replacement with the same node_id,
or by *re-attaching the tenant elsewhere.
- The control plane must persist its intent to delete
a timeline/tenant before issuing any RPCs, and then once it starts, it must
keep retrying until the tenant/timeline is gone. This is already handled
by using a persistent `Operation` record that is retried indefinitely.
Timeline deletion may result in a special kind of object leak, where
the latest generation attachment completes a deletion (including erasing
all objects in the timeline path), but some slow/partitioned node is
writing into the timeline path with a stale generation number. This would
not be caught by any per-timeline scrubbing (see [scrubbing](#cleaning-up-orphan-objects-scrubbing)), since scrubbing happens on the
attached pageserver, and once the timeline is deleted it isn't attached anywhere.
This scenario should be pretty rare, and the control plane can make it even
rarer by ensuring that if a tenant is in a multi-attached state (e.g. during
migration), we wait for that to complete before processing the deletion. Beyond
that, we may implement some other top-level scrub of timelines in
an external tool, to identify any tenant/timeline paths that are not found
in the control plane database.
#### Examples
- Deletion, node restarts partway through:
- By the time we returned 202, we have written a remote delete marker
- Any subsequent incarnation of the same node_id will see the remote
delete marker and continue to process the deletion
- If the original pageserver is lost permanently and no replacement
with the same node_id is available, then the control plane must recover
by re-attaching the tenant to a different node.
- Creation, node becomes unresponsive partway through.
- Control plane will see HTTP request timeout, keep re-issuing
request to whoever is the latest attachment point for the tenant
until it succeeds.
- Stale nodes may be trying to execute timeline creation: they will
write out index_part.json files with
stale attachment generation: these will be eventually cleaned up
by the same mechanism as other old indices.
### Unsafe case on badly behaved infrastructure
This section is only relevant if running on a different environment
than EC2 machines with ephemeral disks.
If we ever run pageservers on infrastructure that might transparently restart
a pageserver while leaving an old process running (e.g. a VM gets rescheduled
without the old one being fenced), then there is a risk of corruption, when
the control plane attaches the tenant, as follows:
- If the control plane sends an `/attach` request to node A, then node A dies
and is replaced, and the control plane's retries the request without
incrementing that attachment ID, then it could end up with two physical nodes
both using the same generation number.
- This is not an issue when using EC2 instances with ephemeral storage, as long
as the control plane never re-uses a node ID, but it would need re-examining
if running on different infrastructure.
- To robustly protect against this class of issue, we would either:
- add a "node generation" to distinguish between different processes holding the
same node_id.
- or, dispense with static node_id entirely and issue an ephemeral ID to each
pageserver process when it starts.
## Implementation Part 2: Optimizations
### Persistent deletion queue
Between writing our a new index_part.json that doesn't reference an object,
and executing the deletion, an object passes through a window where it is
only referenced in memory, and could be leaked if the pageserver is stopped
uncleanly. That introduces conflicting incentives: on the one hand, we would
like to delay and batch deletions to
1. minimize the cost of the mandatory validations calls to control plane, and
2. minimize cost for DeleteObjects requests.
On the other hand we would also like to minimize leakage by executing
deletions promptly.
To resolve this, we may make the deletion queue persistent
and then executing these in the background at a later time.
_Note: The deletion queue's reason for existence is optimization rather than correctness,
so there is a lot of flexibility in exactly how the it should work,
as long as it obeys the rule to validate generations before executing deletions,
so the following details are not essential to the overall RFC._
#### Scope
The deletion queue will be global per pageserver, not per-tenant. There
are several reasons for this choice:
- Use the queue as a central point to coalesce validation requests to the
control plane: this avoids individual `Timeline` objects ever touching
the control plane API, and avoids them having to know the rules about
validating deletions. This separation of concerns will avoid burdening
the already many-LoC `Timeline` type with even more responsibility.
- Decouple the deletion queue from Tenant attachment lifetime: we may
"hibernate" an inactive tenant by tearing down its `Tenant`/`Timeline`
objects in the pageserver, without having to wait for deletions to be done.
- Amortize the cost of I/O for the persistent queue, instead of having many
tiny queues.
- Coalesce deletions into a smaller number of larger DeleteObjects calls
Because of the cost of doing I/O for persistence, and the desire to coalesce
generation validation requests across tenants, and coalesce deletions into
larger DeleteObjects requests, there will be one deletion queue per pageserver
rather than one per tenant. This has the added benefit that when deactivating
a tenant, we do not have to drain their deletion queue: deletions can proceed
for a tenant whose main `Tenant` object has been torn down.
#### Flow of deletion
The flow of a deletion is becomes:
1. Need for deletion of an object (=> layer file) is identified.
2. Unlink the object from all the places that reference it (=> `index_part.json`).
3. Enqueue the deletion to a persistent queue.
Each entry is `tenant_id, attachment_generation, S3 key`.
4. Validate & execute in batches:
4.1 For a batch of entries, call into control plane.
4.2 For the subset of entries that passed validation, execute a `DeleteObjects` S3 DELETE request for their S3 keys.
As outlined in the Part 1 on correctness, it is critical that deletions are only
executed once the key is not referenced anywhere in S3.
This property is obviously upheld by the scheme above.
#### We Accept Object Leakage In Acceptable Circumcstances
If we crash in the flow above between (2) and (3), we lose track of unreferenced object.
Further, enqueuing a single to the persistent queue may not be durable immediately to amortize cost of flush to disk.
This is acceptable for now, it can be caught by [the scrubber](#cleaning-up-orphan-objects-scrubbing).
There are various measures we can take to improve this in the future.
1. Cap amount of time until enqueued entry becomes durable (timeout for flush-to-tisk)
2. Proactively flush:
- On graceful shutdown, as we anticipate that some or
all of our attachments may be re-assigned while we are offline.
- On tenant detach.
3. For each entry, keep track of whether it has passed (2).
Only admit entries to (4) one they have passed (2).
This requires re-writing / two queue entries (intent, commit) per deletion.
The important take-away with any of the above is that it's not
disastrous to leak objects in exceptional circumstances.
#### Operations that may skip the queue
Deletions of an entire timeline are [exempt](#Timeline-Deletion) from generation number validation. Once the
control plane sends the deletion request, there is no requirement to retain the readability
of any data within the timeline, and all objects within the timeline path may be deleted
at any time from the control plane's deletion request onwards.
Since deletions of smaller timelines won't have enough objects to compose a full sized
DeleteObjects request, it is still useful to send these through the last part of the
deletion pipeline to coalesce with other executing deletions: to enable this, the
deletion queue should expose two input channels: one for deletions that must be
processed in a generation-aware way, and a fast path for timeline deletions, where
that fast path may skip validation and the persistent queue.
### Cleaning up orphan objects (scrubbing)
An orphan object is any object which is no longer referenced by a running node or by metadata.
Examples of how orphan objects arise:
- A node PUTs a layer object, then crashes before it writes the
index_part.json that references that layer.
- A stale node carries on running for some time, and writes out an unbounded number of
objects while it believes itself to be the rightful writer for a tenant.
- A pageserver crashes between un-linking an object from the index, and persisting
the object to its deletion queue.
Orphan objects are functionally harmless, but have a small cost due to S3 capacity consumed. We
may clean them up at some time in the future, but doing a ListObjectsv2 operation and cross
referencing with the latest metadata to identify objects which are not referenced.
Scrubbing will be done only by an attached pageserver (not some third party process), and deletions requested during scrub will go through the same
validation as all other deletions: the attachment generation must be
fresh. This avoids the possibility of a stale pageserver incorrectly
thinking than an object written by a newer generation is stale, and deleting
it.
It is not strictly necessary that scrubbing be done by an attached
pageserver: it could also be done externally. However, an external
scrubber would still require the same validation procedure that
a pageserver's deletion queue performs, before actually erasing
objects.
## Operational impact
### Availability
Coordination of generation numbers via the control plane introduce a dependency for certain
operations:
1. Starting new pageservers (or activating pageservers after a restart)
2. Executing enqueued deletions
3. Advertising updated `remote_consistent_lsn` to enable WAL trimming
Item 1. would mean that some in-place restarts that previously would have resumed service even if the control plane were
unavailable, will now not resume service to users until the control plane is available. We could
avoid this by having a timeout on communication with the control plane, and after some timeout,
resume service with the previous generation numbers (assuming this was persisted to disk). However,
this is unlikely to be needed as the control plane is already an essential & highly available component. Also, having a node re-use an old generation number would complicate
reasoning about the system, as it would break the invariant that a generation number uniquely identifies
a tenant's attachment to a given pageserver _process_: it would merely identify the tenant's attachment
to the pageserver _machine_ or its _on-disk-state_.
Item 2. is a non-issue operationally: it's harmless to delay deletions, the only impact of objects pending deletion is
the S3 capacity cost.
Item 3. could be an issue if safekeepers are low on disk space and the control plane is unavailable for a long time. If this became an issue,
we could adjust the safekeeper to delete segments from local disk sooner, as soon as they're uploaded to S3, rather than waiting for
remote_consistent_lsn to advance.
For a managed service, the general approach should be to make sure we are monitoring & respond fast enough
that control plane outages are bounded in time.
There is also the fact that control plane runs in a single region.
The latency for distant regions is not a big concern for us because all request types added by this RFC are either infrequent or not in the way of the data path.
However, we lose region isolation for the operations listed above.
The ongoing work to split console and control will give us per-region control plane, and all operations in this RFC can be handled by these per-region control planes.
With that in mind, we accept the trade-offs outlined in this paragraph.
We will also implement an "escape hatch" config generation numbers, where in a major disaster outage,
we may manually run pageservers with a hand-selected generation number, so that we can bring them online
independently of a control plane.
### Rollout
Although there is coupling between components, we may deploy most of the new data plane components
independently of the control plane: initially they can just use a static generation number.
#### Phase 1
The pageserver is deployed with some special config to:
- Always act like everything is generation 1 and do not wait for a control plane issued generation on attach
- Skip the places in deletion and remote_consistent_lsn updates where we would call into control plane
#### Phase 2
The control plane changes are deployed: control plane will now track and increment generation numbers.
#### Phase 3
The pageserver is deployed with its control-plane-dependent changes enabled: it will now require
the control plane to service re-attach requests on startup, and handle generation
validation requests.
### On-disk backward compatibility
Backward compatibility with existing data is straightforward:
- When reading the index, we may assume that any layer whose metadata doesn't include
generations will have a path without generation suffix.
- When locating the index file on attachment, we may use the "fallback" listing path
and if there is only an index without generation suffix, that is the one we load.
It is not necessary to re-write existing layers: even new index files will be able
to represent generation-less layers.
### On-disk forward compatibility
We will do a two phase rollout, probably over multiple releases because we will naturally
have some of the read-side code ready before the overall functionality is ready:
1. Deploy pageservers which understand the new index format and generation suffixes
in keys, but do not write objects with generation numbers in the keys.
2. Deploy pageservers that write objects with generation numbers in the keys.
Old pageservers will be oblivious to generation numbers. That means that they can't
read objects with generation numbers in the name. This is why we must
first step must deploy the ability to read, before the second step
starts writing them.
# Frequently Asked Questions
## Why a generation _suffix_ rather than _prefix_?
The choice is motivated by object listing, since one can list by prefix but not
suffix.
In [finding remote indices](#finding-the-remote-indices-for-timelines), we rely
on being able to do a prefix listing for `<tenant>/<timeline>/index_part.json*`.
That relies on the prefix listing.
The converse case of using a generation prefix and listing by generation is
not needed: one could imagine listing by generation while scrubbing (so that
a particular generation's layers could be scrubbed), but this is not part
of normal operations, and the [scrubber](#cleaning-up-orphan-objects-scrubbing) probably won't work that way anyway.
## Wouldn't it be simpler to have a separate deletion queue per timeline?
Functionally speaking, we could. That's how RemoteTimelineClient currently works,
but this approach does not map well to a long-lived persistent queue with
generation validation.
Anything we do per-timeline generates tiny random I/O, on a pageserver with
tens of thousands of timelines operating: to be ready for high scale, we should:
- A) Amortize costs where we can (e.g. a shared deletion queue)
- B) Expect to put tenants into a quiescent state while they're not
busy: i.e. we shouldn't keep a tenant alive to service its deletion queue.
This was discussed in the [scope](#scope) part of the deletion queue section.
# Appendix A: Examples of use in high availability/failover
The generation numbers proposed in this RFC are adaptable to a variety of different
failover scenarios and models. The sections below sketch how they would work in practice.
### In-place restart of a pageserver
"In-place" here means that the restart is done before any other element in the system
has taken action in response to the node being down.
- After restart, the node issues a re-attach request to the control plane, and
receives new generation numbers for all its attached tenants.
- Tenants may be activated with the generation number in the re-attach response.
- If any of its attachments were in fact stale (i.e. had be reassigned to another
node while this node was offline), then
- the re-attach response will inform the tenant about this by not including
the tenant of this by _not_ incrementing the generation for that attachment.
- This will implicitly block deletions in the tenant, but as an optimization
the pageserver should also proactively stop doing S3 uploads when it notices this stale-generation state.
- The control plane is expected to eventually detach this tenant from the
pageserver.
If the control plane does not include a tenant in the re-attach response,
but there is still local state for the tenant in the filesystem, the pageserver
deletes the local state in response and does not load/active the tenant.
See the [earlier section on pageserver startup](#pageserver-attachstartup-changes) for details.
Control plane can use this mechanism to clean up a pageserver that has been
down for so long that all its tenants were migrated away before it came back
up again and asked for re-attach.
### Failure of a pageserver
In this context, read "failure" as the most ambiguous possible case, where
a pageserver is unavailable to clients and control plane, but may still be executing and talking
to S3.
#### Case A: re-attachment to other nodes
1. Let's say node 0 becomes unresponsive in a cluster of three nodes 0, 1, 2.
2. Some external mechanism notices that the node is unavailable and initiates
movement of all tenants attached to that node to a different node according
to some distribution rule.
In this example, it would mean incrementing the generation
of all tenants that were attached to node 0, as each tenant's assigned pageserver changes.
3. A tenant which is now attached to node 1 will _also_ still be attached to node
0, from the perspective of node 0. Node 0 will still be using its old generation,
node 1 will be using a newer generation.
4. S3 writes will continue from nodes 0 and 1: there will be an index_part.json-00000001
\_and\* an index_part.json-00000002. Objects written under the old suffix
after the new attachment was created do not matter from the rest of the system's
perspective: the endpoints are reading from the new attachment location. Objects
written by node 0 are just garbage that can be cleaned up at leisure. Node 0 will
not do any deletions because it can't synchronize with control plane, or if it could,
its deletion queue processing would get errors for the validation requests.
#### Case B: direct node replacement with same node_id and drive
This is the scenario we would experience if running pageservers in some dynamic
VM/container environment that would auto-replace a given node_id when it became
unresponsive, with the node's storage supplied by some network block device
that is attached to the replacement VM/container.
1. Let's say node 0 fails, and there may be some other peers but they aren't relevant.
2. Some external mechanism notices that the node is unavailable, and creates
a "new node 0" (Node 0b) which is a physically separate server. The original node 0
(Node 0a) may still be running, because we do not assume the environment fences nodes.
3. On startup, node 0b re-attaches and gets higher generation numbers for
all tenants.
4. S3 writes continue from nodes 0a and 0b, but the writes do not collide due to different
generation in the suffix, and the writes from node 0a are not visible to the rest
of the system because endpoints are reading only from node 0b.
# Appendix B: interoperability with other features
## Sharded Keyspace
The design in this RFC maps neatly to a sharded keyspace design where subsets of the key space
for a tenant are assigned to different pageservers:
- the "unit of work" for attachments becomes something like a TenantShard rather than a Tenant
- TenantShards get generation numbers just as Tenants do.
- Write workload (ingest, compaction) for a tenant is spread out across pageservers via
TenantShards, but each TenantShard still has exactly one valid writer at a time.
## Read replicas
_This section is about a passive reader of S3 pageserver state, not a postgres
read replica_
For historical reads to LSNs below the remote persistent LSN, any node may act as a reader at any
time: remote data is logically immutable data, and the use of deferred deletion in this RFC helps
mitigate the fact that remote data is not _physically_ immutable (i.e. the actual data for a given
page moves around as compaction happens).
A read replica needs to be aware of generations in remote data in order to read the latest
metadata (find the index_part.json with the latest suffix). It may either query this
from the control plane, or find it with ListObjectsv2 request
## Seamless migration
To make tenant migration totally seamless, we will probably want to intentionally double-attach
a tenant briefly, serving reads from the old node while waiting for the new node to be ready.
This RFC enables that double-attachment: two nodes may be attached at the same time, with the migration destination
having a higher generation number. The old node will be able to ingest and serve reads, but not
do any deletes. The new node's attachment must also avoid deleting layers that the old node may
still use. A new piece of state
will be needed for this in the control plane's definition of an attachment.
## Warm secondary locations
To enable faster tenant movement after a pageserver is lost, we will probably want to spend some
disk capacity on keeping standby locations populated with local disk data.
There's no conflict between this RFC and that: implementing warm secondary locations on a per-tenant basis
would be a separate change to the control plane to store standby location(s) for a tenant. Because
the standbys do not write to S3, they do not need to be assigned generation numbers. When a tenant is
re-attached to a standby location, that would increment the tenant attachment generation and this
would work the same as any other attachment change, but with a warm cache.
## Ephemeral node IDs
This RFC intentionally avoids changing anything fundamental about how pageservers are identified
and registered with the control plane, to avoid coupling the implementation of pageserver split
brain protection with more fundamental changes in the management of the pageservers.
Moving to ephemeral node IDs would provide an extra layer of
resilience in the system, as it would prevent the control plane
accidentally attaching to two physical nodes with the same
generation, if somehow there were two physical nodes with
the same node IDs (currently we rely on EC2 guarantees to
eliminate this scenario). With ephemeral node IDs, there would be
no possibility of that happening, no matter the behavior of
underlying infrastructure.
Nothing fundamental in the pageserver's handling of generations needs to change to handle ephemeral node IDs, since we hardly use the
`node_id` anywhere. The `/re-attach` API would be extended
to enable the pageserver to obtain its ephemeral ID, and provide
some correlation identifier (e.g. EC instance ID), to help the
control plane re-attach tenants to the same physical server that
previously had them attached.

View File

@@ -0,0 +1,316 @@
This is a copy from the [original Notion page](https://www.notion.so/neondatabase/Proposal-Pageserver-MVCC-S3-Storage-8a424c0c7ec5459e89d3e3f00e87657c?pvs=4), taken on 2023-08-16.
This is for archival mostly.
The RFC that we're likely to go with is https://github.com/neondatabase/neon/pull/4919.
---
# Proposal: Pageserver MVCC S3 Storage
tl;dr: this proposal enables Control Plane to attach a tenant to a new pageserver without being 100% certain that it has been detached from the old pageserver. This enables us to automate failover if a pageserver dies (no human in the loop).
# Problem Statement
The current Neon architecture requires the Control Plane to guarantee that a tenant is only attached to one pageserver at a time. If a tenant is attached to multiple pageservers simultaneously, the pageservers will overwrite each others changes in S3 for that tenant, resulting in data loss for that tenant.
The above imposes limitations on tenant relocation and future designs for high availability. For instance, Control Plane cannot relocate a tenant to another pageserver before it is 100% certain that the tenant is detached from the source pageserver. If the source pageserver is unresponsive, the tenant detach procedure cannot proceed, and Control Plane has no choice but to wait for either the source to become responsive again, or rely on a node failure detection mechanism to detect that the source pageserver is dead, and give permission to skip the detachment step. Either way, the tenant is unavailable for an extended period, and we have no means to improve it in the current architecture.
Note that there is no 100% correct node failure detection mechanism, and even techniques to accelerate failure detection, such as ********************************shoot-the-other-node-in-the-head,******************************** have their limits. So, we currently rely on humans as node failure detectors: they get alerted via PagerDuty, assess the situation under high stress, and make the decision. If they make the wrong call, or the apparent dead pageserver somehow resurrects later, well have data loss.
Also, by relying on humans, were [incurring needless unscalable toil](https://sre.google/sre-book/eliminating-toil/): as Neon grows, pageserver failures will become more and more frequent because our fleet grows. Each instance will need quick response time to minimize downtime for the affected tenants, which implies higher toil, higher resulting attrition, and/or higher personnel cost.
Lastly, there are foreseeable needs by operation and product such as zero-downtime relocation and automatic failover/HA. For such features, the ability to have a tenant purposefully or accidentally attached to more than one pageserver will greatly reduce risk of data loss, and improve availability.
# High-Level Idea
The core idea is to evolve the per-Tenant S3 state to an MVCC-like scheme, allowing multiple pageservers to operate on the same tenant S3 state without interference. To make changes to S3, pageservers acquire long-running transactions from Control Plane. After opening a transaction, Pageservers make PUTs directly against S3, but they keys include the transaction ID, so overwrites never happen. Periodically, pageservers talk back to Control Plane to commit their transaction. This is where Control Plane enforces strict linearizability, favoring availability over work-conservation: commit is only granted if no transaction started after the one thats requesting commit. Garbage collection is done through deadlists, and its simplified tremendously by above commit grant/reject policy.
Minimal changes are required for safekeepers to allow WAL for a single timeline be consumed by more than one pageserver without premature truncation.
**Above scheme makes it safe to attach tenants without a 100% correct node failure detection mechanism. Further, it makes it safe to interleave tenant-attachment to pageservers, unlocking new capabilities for (internal) product features:**
- **Fast, Zero-Toil Failover on Network Partitions or Instance Failure**: if a pageserver is not reachable (network partition, hardware failure, overload) we want to spread its attached tenants to new pageservers to restore availability, within the range of *seconds*. We cannot afford gracious timeouts to maximize the probability that the unreachable pageserver has ceased writing to S3. This proposal enables us to attach the tenants to the replacement pageservers, and redirect their computes, without having to wait for confirmation that the unreachable pageserver has ceased writing to S3.
- **************************************Zero-Downtime Relocation:************************************** we want to be able to relocate tenants to different pageservers with minimized availability or a latency impact. This proposal enables us to attach the relocating Tenant to the destination Pageserver before detaching it from the source Pageserver. This can help minimize downtime because we can wait for the destination to catch up on WAL processing before redirecting Computes.
# Design
The core idea is to evolve the per-Tenant S3 state to a per-tenant MVCC-like scheme.
To make S3 changes for a given tenant, Pageserver requests a transaction ID from control plane for that tenant. Without a transaction ID, Pageserver does not write to S3.
Once Pageserver received a transaction ID it is allowed to produce new objects and overwrite objects created in this transaction. Pageserver is not allowed to delete any objects; instead, it marks the object as deleted by appending the key to the transactions deadlist for later deletion. Commits of transactions are serialized through Control Plane: when Pageserver wants to commit a transaction, it sends an RPC to Control Plane. Control Plane responds with a commit grant or commit reject message. Commit grant means that the transactions changes are now visible to subsequent transactions. Commit reject means that the transactions changes are not and never will be visible to another Pageserver instance, and the rejected Pageserver is to cease further activity on that tenant.
## ****************************************************Commit grant/reject policy****************************************************
For the purposes of Pageserver, we want **linearizability** of a tenants S3 state. Since our transactions are scoped per tenant, it is sufficient for linearizability to grant commit if and only if no other transaction has been started since the commit-requesting transaction started.
For example, consider the case of a single tenant, attached to Pageserver A. Pageserver A has an open transaction but becomes unresponsive. Control Plane decides to relocate the tenant to another Pageserver B. It need *not* wait for A to be 100%-certainly down before B can start uploading to S3 for that tenant. Instead, B can start a new transaction right away, make progress, and get commit grants; What about A? The transaction is RejectPending in Control Plane until A eventually becomes responsive again, tries to commit, gets a rejection, acknowledges it, and thus its transaction becomes RejectAcknowledge. If A is definitively dead, operator can also force-transition from state RejectPending to RejectAcknowledged. But critically, Control Plane doesnt have for As transaction to become RejectAcknowledge before attaching the tenant to B.
```mermaid
sequenceDiagram
participant CP
participant A
participant S3
participant B
CP -->> A: attach tenant
activate A
A -->> CP: start txn
CP -->> A: txn=23, last_committed_txn=22
Note over CP,A: network partition
CP --x A: heartbeat
CP --x A: heartbeat
Note over CP: relocate tenant to avoid downtime
CP -->> B: attach tenant
activate B
B -->> CP: start txn
Note over CP: mark A's txn 23 as RejectPending
CP -->> B: txn=24, last-committed txn is 22
B -->> S3: PUT X.layer.24<br>PUT index_part.json.24 referencing X.layer.24
B -->> CP: request commit
CP -->> B: granted
B -->> CP: start txn
CP -->> B: txn=25, last_committed_txn=22
A -->> S3: PUT Y.layer.23 <br> PUT index_part.json.23 referencing Y.layer.23
A --x CP: request commit
A --x CP: request commit
Note over CP,A: partition is over
A -->> CP: request commit
Note over CP: most recently started txn is 25, not 23, reject
CP -->> A: reject
A -->> CP: acknowledge reject
Note over CP: mark A's txn 23 as RejectAcknowledged
deactivate A
B -->> S3: PUT 000-FFF_X-Y.layer.**************25**************<br>...
deactivate B
```
If a Pageserver gets a rejection to a commit request, it acknowledges rejection and cedes further S3 uploads for the tenant, until it receives a `/detach` request for the tenant (control plane has most likely attached the tenant to another pageserver in the meantime).
In practice, Control Plane will probably extend the commit grant/reject schema above, taking into account the pageserver to which it last attached the tenant. In the above example, Control Plane could remember that the pageserver that is supposed to host the tenant is pageserver B, and reject start-txn and commit requests from pageserver A. It would also use such requests from A as a signal that A is reachable again, and retry the `/detach` .
<aside>
💡 A commit failure causes the tenant to become effectively `Broken`. Pageserver should persist this locally so it doesnt bother ControlPlane for a new txn when Pageserver is restarted.
</aside>
## ********************Visibility********************
We mentioned earlier that once a transaction commits, its changes are visible to subsequent transactions. But how does a given transaction know where to look for the data? There is no longer a single `index_part.json` per timeline, or a single `timelines/:timeline_id` prefix to look for; theyre all multi-versioned, suffixed by the txn number.
The solution is: at transaction start, Pageserver receives the last-committed transaction ID from Control Plane (`last_committed_txn` in the diagram). last_commited_txn is the upper bound for what is visible for the current transaction. Control Plane keeps track of each open transactions last_committed_txn for purposes of garbage collection (see later paragraph).
Equipped with last_committed_txn, Pageserver then discovers
- the current index part of a timeline at `tenants/:tenant_id/timelines/:timeline_id/index_part.json.$last_committed_txn`. The `index_part.json.$last_committed_txn` has the exact same contents as the current architectures index_part.json, i.e. full list of layers.
- the list of existent timelines as part of the `attach` RPC from CP;
There is no other S3 state per tenant, so, thats all the visibility required.
An alternative to receiving the list of existent timelines from CP is to introduce a proper **********SetOfTimelines********** object in S3, and multi-version it just like above. For example, we could have a `tenants/:tenant_id/timelines.json.$txn` file that references `index_part.json.$last_committed_txn` . It can be added later if more separation between CP and PS is desired.
So, the only MVCCed object types in this proposal are LayerFile and IndexPart (=individual timeline), but not the SetOfTimelines in a given tenant. Is this a problem? For example, the Pageservers garbage collection code needs to know the full set of timelines of a tenant. Otherwise itll make incorrect decisions. What if Pageserver A knows about timelines {R,S}, but another Pageserver B created an additional branch T, so, its set of timelines is {R,S,T}. Both pageservers will run GC code, and so, PS A may decide to delete a layer thats still needed for branch T. Not a problem with this propsoal, because the effect of GC (i.e., layer deletion) is properly MVCCed.
## Longevity Of Transactions & Availability
Pageserver depends on Control Plane to start a new transaction. If ControlPlane is down, no new transactions can be started.
Pageservers commit transactions based on a maximum amount of uncommitted changes that have accumulated in S3. A lower maximum increases dependence and load on ControlPlane which decreases availability. A higher maximum risks losing more work in the event of failover; the work will have to be re-done in a new transaction on the new node.
Pageservers are persist the open txn id in local storage, so that they can resume the transaction after restart, without dependence on Control Plane.
## **Operations**
********PUTs:********
- **layer files**
- current architecture: layer files are supposed to be write-once, but actually, there are edge-cases where we PUT the same layer file name twice; namely if we PUT the file to S3 but crash before uploading the index part that references it; then detach + attach, and re-run compaction, which is non-deterministic.
- this proposal: with transactions, we can now upload layers and index_part.json concurrently, just need to make sure layer file upload is done before we request txn commit.
- **index part** upload: `index_part.json.$txn` may be created and subsequently overwritten multiple times in a transaction; it is an availability/work-loss trade-off how often to request a commit from CP.
**************DELETEs**************: for deletion, we maintain a deadlist per transaction. It is located at `tenants/:tenant_id/deadlist/deadlist.json.$txn`. It is PUT once before the pageserver requests requests commit, and not changed after sending request to commit. An object created in the current txn need not (but can) be on the deadlist — it can be DELETEd immediately because its not visible to other transactions. An example use case would be an L0 layer that gets compacted within one transaction; or, if we ever start MVCCing the set of timelines of a tenant, a short-lived branch that is created & destroyed within one transaction.
<aside>
**Deadlist Invariant:** if a an object is on a deadlist of transaction T, it is not referenced from anywhere else in the full state visible to T or any later started transaction > T.
</aside>
### Rationale For Deadlist.json
Given that this proposal only MVCCs layers and indexparts, one may ask why the deadlist isnt part of indexpart. The reason is to not lose generality: the deadlist is just a list of keys; it is not necessary to understand the data format of the versioned object to process the deadlist. This is important for garbage collection / vacuuming, which well come to in the next section.
## Garbage Collection / Vacuuming
After a transaction has reached reject-acknowledged state, Control Plane initiates a garbage collection procedure for the aborted transaction.
Control Plane is in the unique position about transaction states. Here is a sketch of the exact transaction states and what Control Plane keeps track of.
```
struct Tenant {
...
txns: HashMap<TxnId, Transaction>,
// the most recently started txn's id; only most recently sarted can win
next_winner_txn: Option<TxnId>,
}
struct Transaction {
id: TxnId, // immutable
last_committed_txn: TxnId, // immutable; the most recent txn in state `Committed`
// when self was started
pageserver_id: PageserverId,
state: enum {
Open,
Committed,
RejectPending,
RejectAcknowledged, // invariant: we know all S3 activity has ceded
GarbageCollected,
}
}
```
Object creations & deletions by a rejected transaction have never been visible to other transactions. That is true for both RejectPending and RejectAcknowledged states. The difference is that, in RejectPending, the pageserver may still be uploading to S3, whereas in RejectAcknowledged, Control Plane can be certain that all S3 activity in the name of that transaction has ceded. So, once a transaction reaches state RejectAcknowledged state, it is safe to DELETE all objects created by that transaction, and discard the transactions deadlists.
A transaction T in state Committed has subsequent transactions that may or may not reference the objects it created. None of the subsequent transaction can reference the objects on Ts deadlist, though, as per the Deadlist Invariant (see previous section).
So, for garbage collection, we need to assess transactions in state Committed and RejectAcknowledged:
- Commited: delete objects on the deadlist.
- We dont need a LIST request here, the deadlist is sufficient. So, its really cheap.
- This is **not true MVCC garbage collection**; by deleting the objects on Committed transaction T s deadlist, we might delete data referenced by other transactions that were concurrent with T, i.e., they started while T was still open. However, the fact that T is committed means that the other transactions are RejectPending or RejectAcknowledged, so, they dont matter. Pageservers executing these doomed RejectPending transactions must handle 404 for GETs gracefully, e.g., by trying to commit txn so they observe the rejection theyre destined to get anyways. 404s for RejectAcknowledged is handled below.
- RejectAcknowledged: delete all objects created in that txn, and discard deadlists.
- 404s / object-already-deleted type messages must be expected because of Committed garbage collection (see above)
- How to get this list of objects created in a txn? Open but solvable design question; Ideas:
- **Brute force**: within tenant prefix, search for all keys ending in `.$txn` and delete them.
- **WAL for PUTs**: before a txn PUTs an object, it logs to S3, or some other equivalently durable storage, that its going to do it. If we log to S3, this means we have to do an additional WAL PUT per “readl” PUT.
- ******************************LIST with reorged S3 layout (preferred one right now):****************************** layout S3 key space such that `$txn` comes first, i.e., `tenants/:tenant_id/$txn/timelines/:timeline_id/*.json.$txn` . That way, when we need to GC a RejectAcknowledged txn, we just LIST the entire `tenants/:tenant_id/$txn` prefix and delete it. The cost of GC for RejectAcknowledged transactions is thus proportional to the number of objects created in that transaction.
## Branches
This proposal only MVCCs layer files and and index_part.json, but leaves the tenant object not-MVCCed. We argued earlier that its fine to ignore this for now, because
1. Control Plane can act as source-of-truth for the set of timelines, and
2. The only operation that makes decision based on “set of timelines” is GC, which in turn only does layer deletions, and layer deletions ***are*** properly MVCCed.
Now that weve introduced garbage collection, lets elaborate a little more on (2). Recall our example from earlier: Pageserver A knows about timelines {R,S}, but another Pageserver B created an additional branch T, so, its set of timelines is {R,S,T}. Both pageservers will run GC code, and so, PS A may decide to delete a layer thats still needed for branch T.
How does the MVCCing of layer files protect us here? If A decides to delete that layer, its just on As transactions deadlist, but still present in S3 and usable by B. If A commits first, B wont be able to commit and the layers in timeline T will be vacuumed. If B commits first, As deadlist is discarded and the layer continues to exist.
## Safekeeper Changes
We need to teach the safekeepers that there can be multiple pageservers requesting WAL for the same timeline, in order to prevent premature WAL truncation.
In the current architecture, the Safekeeper service currently assumes only one Pageserver and is allowed to prune WAL older than that Pageservers `remote_consistent_lsn`. Safekeeper currently learns the `remote_consistent_lsn` through the walreceiver protocol.
So, if we have a tenant attached to two pageservers at the same time, they will both try to stream WAL and the Safekeeper will get confused about which connections `remote_consistent_lsn` to use as a basis for WAL pruning.
What do we need to change to make it work? We need to make sure that the Safekeepers only prune WAL up to the `remote_consistent_lsn` of the last-committed transaction.
The straight-forward way to get it is to re-design WAL pruning as follows:
1. Pageserver reports remote_consistent_lsn as part of transaction commit to Control Plane.
2. Control Plane makes sure transaction state update is persisted.
3. Control Plane (asynchronous to transaction commit) reconciles with Safekeepers to ensure WAL pruning happens.
The above requires non-trivial changes, but, in the light of other planned projects such as restore-tenant-from-safekeeper-wal-backups, I think Control Plane will need to get involved in WAL pruning anyways.
# How This Proposal Unlocks Future Features
Let us revisit the example from the introduction where we were thinking about handling network partitions. Network partitions need to be solved first, because theyre unavoidable in distributed systems. We did that. Now lets see how we can solve actual product problems:
## **Fast, Zero-Toil Failover on Network Partitions or Instance Failure**
The “Problem Statement” section outlined the current architectures problems with regards to network partitions or instance failure: it requires a 100% correct node-dead detector to make decisions, which doesnt exist in reality. We rely instead on human toil: an oncall engineer has to inspect the situation and make a decision, which may be incorrect and in any case take time in the order of minutes, which means equivalent downtime for users.
With this proposal, automatic failover for pageservers is trivial:
If a pageserver is unresponsive from Control Planes / Computes perspective, Control Plane does the following:
- attach all tenants of the unresponsive pageserver to new pageservers
- switch over these tenants computes immediately;
At this point, availability is restored and user pain relieved.
Whats left is to somehow close the doomed transaction of the unresponsive pageserver, so that it beomes RejectAcknowledged, and GC can make progress. Since S3 is cheap, we can afford to wait a really long time here, especially if we put a soft bound on the amount of data a transaction may produce before it must commit. Procedure:
1. Ensure the unresponsive pageserver is taken out of rotation for new attachments. That probably should happen as part of the routine above.
2. Make a human operator investigate decide what to do (next morning, NO ONCALL ALERT):
1. Inspect the instance, investigate logs, understand root cause.
2. Try to re-establish connectivity between pageserver and Control Plane so that pageserver can retry commits, get rejected, ack rejection ⇒ enable GC.
3. Use below procedure to decomission pageserver.
### Decomissioning A Pageserver (Dead or Alive-but-Unrespsonive)
The solution, enabled by this proposal:
1. Ensure that pageservers S3 credentials are revoked so that it cannot make new uploads, which wouldnt be tracked anywhere.
2. Let enough time pass for the S3 credential revocation to propagate. Amazon doesnt give a guarantee here. As stated earlier, we can easily afford to wait here.
3. Mark all Open and RejectPending transactions of that pageserver as RejectAcknowledge.
Revocation of the S3 credentials is required so that, once we transition all the transactions of that pageserver to RejectAcknowledge, once garbage-collection pass is guaranteed to delete all objects that will ever exist for that pageserver. That way, we need not check *****GarbageCollected***** transactions every again.
## Workflow: Zero-Downtime Relocation
With zero-downtime relocation, the goal is to have the target pageserver warmed up, i.e., at the same `last_record_lsn` as the source pageserver, before switching over Computes from source to target pageserver.
With this proposal, it works like so:
1. Grant source pageserver its last open transaction. This one is doomed to be rejected later, unless the relocation fails.
2. Grant target pageserver its first open transaction.
3. Have target pageserver catch up on WAL, streaming from last-committed-txns remote_consistent_lsn onwards.
4. Once target pageserver reports `last_record_lsn` close enough to source pageserver, target pageserver requests commit.
5. Drain compute traffic from source to target pageserver. (Source can still answer requests until it tries to commit and gets reject, so, this will be quite smooth).
Note that as soon as we complete step (4), the source pageservers transaction is doomed to be rejected later. Conversely, if the target cant catch up fast enough, the source will make a transaction commit earlier. This will generally happen if there is a lot of write traffic coming in. The design space to make thing smooth here is large, but well explored in other areas of computing, e.g., VM live migration. We have all the important policy levers at hand, e.g.,
- delaying source commits if we see target making progress
- slowing down source consumption (need some signalling mechanism for it)
- slowing down compute wal generation
-
It doesnt really matter, whats important is that two pageservers can overlap.
# Additional Trade-Offs / Remarks Brought Up During Peer Review
This proposal was read by and discussed @Stas and @Dmitry Rodionov prior to publishing it with the broader team. (This does not mean they endorse this proposal!).
Issues that we discussed:
1. **Frequency of transactions:** If even idle tenants commit every 10min or so, thats quite a lot of load on Control Plane. Can we minimize it by Equating Transaction Commit Period to Attachment Period? I.e. start txn on attach, commit on detach?
1. Would be nice, but, if a tenant is attached for 1 month, then PS dies, we lose 1 month of work.
2. ⇒ my solution to this problem: Adjusted this proposal to make transaction commit frequency proportional to amount of uncommitted data.
1. Its ok to spend resources on active users, they pay us money to do it!
2. The amount of work per transaction is minimal.
1. In current Control Plane, its a small database transaction that is super unlikely to conflict with other transactions.
2. I have very little concerns about scalability of the commit workload on CP side because it's trivially horizontally scalable by sharding by tenant.
3. There's no super stringent availability requirement on control plane; if a txn can't commit because it can't reach the CP, PS can continue & retry in the background, speculating that it's CP downtime and not PS-partitioned-off scenario.
4. Without stringent availability requirement, there's flexibility for future changes to CP-side-implementation.
2. ************************************************Does this proposal address mirroring / no-performance-degradation failover ?************************************************
1. No it doesnt. It only provides the building block for attaching a tenant to a new pageserver without having to worry that the tenant is detached on the old pageserver.
2. A simple scheme to build no-performance-degradation failover on top of this proposal is to have an asynchronous read-only replica of a tenant on another pageserver in the same region.
3. Another more ambitious scheme to get no-performance-degradation would be [One-Pager: Layer File Spreading (Christian)](https://www.notion.so/One-Pager-Layer-File-Spreading-Christian-eb6b64182a214e11b3fceceee688d843?pvs=21); this proposal would be used in layer file spreading for risk-free automation of TenantLeader failover, which hasnt been addressed Ithere.
4. In any way, failover would restart from an older S3 state, and need to re-ingest WAL before being able to server recently written pages.
1. Is that a show-stopper? I think not.
2. Is it suboptimal? Absolutely: if a pageserver instance fails, all its tenants will be distributed among the remaining pageservers (OK), and all these tenants will ask the safekeepers for WAL at the same time (BAD). So, pageserver instance failure will cause a load spike in safekeepers.
1. Personally I think thats an OK trade-off to make.
2. There are countless options to avoid / mitigate the load spike. E.g., pro-actively streaming WAL to the standby read-only replica.
3. ********************************************Does this proposal allow multiple writers for a tenant?********************************************
1. In abstract terms, this proposal provides a linearized history for a given S3 prefix.
2. In concrete terms, this proposal provides a linearized history per tenant.
3. There can be multiple writers at a given time, but only one of them will win to become part of the linearized history.
4. ************************************************************************************Alternative ideas mentioned during meetings that should be turned into a written prospoal like this one:************************************************************************************
1. @Dmitry Rodionov : having linearized storage of index_part.json in some database that allows serializable transactions / atomic compare-and-swap PUT
2. @Dmitry Rodionov :
3. @Stas : something like this scheme, but somehow find a way to equate attachment duration with transaction duration, without losing work if pageserver dies months after attachment.

View File

@@ -0,0 +1,281 @@
# Crash-Consistent Layer Map Updates By Leveraging `index_part.json`
* Created on: Aug 23, 2023
* Author: Christian Schwarz
## Summary
This RFC describes a simple scheme to make layer map updates crash consistent by leveraging the `index_part.json` in remote storage.
Without such a mechanism, crashes can induce certain edge cases in which broadly held assumptions about system invariants don't hold.
## Motivation
### Background
We can currently easily make complex, atomic updates to the layer map by means of an RwLock.
If we crash or restart pageserver, we reconstruct the layer map from:
1. local timeline directory contents
2. remote `index_part.json` contents.
The function that is responsible for this is called `Timeline::load_layer_map()`.
The reconciliation process's behavior is the following:
* local-only files will become part of the layer map as local-only layers and rescheduled for upload
* For a file name that, by its name, is present locally and in the remote `index_part.json`, but where the local file has a different size (future: checksum) than the remote file, we will delete the local file and leave the remote file as a `RemoteLayer` in the layer map.
### The Problem
There are are cases where we need to make an atomic update to the layer map that involves **more than one layer**.
The best example is compaction, where we need to insert the L1 layers generated from the L0 layers, and remove the L0 layers.
As stated above, making the update to the layer map in atomic way is trivial.
But, there is no system call API to make an atomic update to a directory that involves more than one file rename and deletion.
Currently, we issue the system calls one by one and hope we don't crash.
What happens if we crash and restart in the middle of that system call sequence?
We will reconstruct the layer map according to the reconciliation process, taking as input whatever transitory state the timeline directory ended up in.
We cannot roll back or complete the timeline directory update during which we crashed, because we keep no record of the changes we plan to make.
### Problem's Implications For Compaction
The implications of the above are primarily problematic for compaction.
Specifically, the part of it that compacts L0 layers into L1 layers.
Remember that compaction takes a set of L0 layers and reshuffles the delta records in them into L1 layer files.
Once the L1 layer files are written to disk, it atomically removes the L0 layers from the layer map and adds the L1 layers to the layer map.
It then deletes the L0 layers locally, and schedules an upload of the L1 layers and and updated index part.
If we crash before deleting L0s, but after writing out L1s, the next compaction after restart will re-digest the L0s and produce new L1s.
This means the compaction after restart will **overwrite** the previously written L1s.
Currently we also schedule an S3 upload of the overwritten L1.
If the compaction algorithm doesn't change between the two compaction runs, is deterministic, and uses the same set of L0s as input, then the second run will produce identical L1s and the overwrites will go unnoticed.
*However*:
1. the file size of the overwritten L1s may not be identical, and
2. the bit pattern of the overwritten L1s may not be identical, and,
3. in the future, we may want to make the compaction code non-determinstic, influenced by past access patterns, or otherwise change it, resulting in L1 overwrites with a different set of delta records than before the overwrite
The items above are a problem for the [split-brain protection RFC](https://github.com/neondatabase/neon/pull/4919) because it assumes that layer files in S3 are only ever deleted, but never replaced (overPUTted).
For example, if an unresponsive node A becomes active again after control plane has relocated the tenant to a new node B, the node A may overwrite some L1s.
But node B based its world view on the version of node A's `index_part.json` from _before_ the overwrite.
That earlier `index_part.json`` contained the file size of the pre-overwrite L1.
If the overwritten L1 has a different file size, node B will refuse to read data from the overwritten L1.
Effectively, the data in the L1 has become inaccessible to node B.
If node B already uploaded an index part itself, all subsequent attachments will use node B's index part, and run into the same probem.
If we ever introduce checksums instead of checking just the file size, then a mismatching bit pattern (2) will cause similar problems.
In case of (1) and (2), where we know that the logical content of the layers is still the same, we can recover by manually patching the `index_part.json` of the new node to the overwritten L1's file size / checksum.
But if (3) ever happens, the logical content may be different, and, we could have truly lost data.
Given the above considerations, we should avoid making correctness of split-brain protection dependent on overwrites preserving _logical_ layer file contents.
**It is a much cleaner separation of concerns to require that layer files are truly immutable in S3, i.e., PUT once and then only DELETEd, never overwritten (overPUTted).**
## Design
Instead of reconciling a layer map from local timeline directory contents and remote index part, this RFC proposes to view the remote index part as authoritative during timeline load.
Local layer files will be recognized if they match what's listed in remote index part, and removed otherwise.
During **timeline load**, the only thing that matters is the remote index part content.
Essentially, timeline load becomes much like attach, except we don't need to prefix-list the remote timelines.
The local timeline dir's `metadata` file does not matter.
The layer files in the local timeline dir are seen as a nice-to-have cache of layer files that are in the remote index part.
Any layer files in the local timeline dir that aren't in the remote index part are removed during startup.
The `Timeline::load_layer_map()` no longer "merges" local timeline dir contents with the remote index part.
Instead, it treats the remote index part as the authoritative layer map.
If the local timeline dir contains a layer that is in the remote index part, that's nice, and we'll re-use it if file size (and in the future, check sum) match what's stated in the index part.
If it doesn't match, we remove the file from the local timeline dir.
After load, **at runtime**, nothing changes compared to what we did before this RFC.
The procedure for single- and multi-object changes is reproduced here for reference:
* For any new layers that the change adds:
* Write them to a temporary location.
* While holding layer map lock:
* Move them to the final location.
* Insert into layer map.
* Make the S3 changes.
We won't reproduce the remote timeline client method calls here because these are subject to change.
Instead we reproduce the sequence of s3 changes that must result for a given single-/multi-object change:
* PUT layer files inserted by the change.
* PUT an index part that has insertions and deletions of the change.
* DELETE the layer files that are deleted by the change.
Note that it is safe for the DELETE to be deferred arbitrarily.
* If it never happens, we leak the object, but, that's not a correctness concern.
* As of #4938, we don't schedule the remote timeline client operation for deletion immediately, but, only when we drop the `LayerInner`.
* With the [split-brain protection RFC](https://github.com/neondatabase/neon/pull/4919), the deletions will be written to deletion queue for processing when it's safe to do so (see the RFC for details).
## How This Solves The Problem
If we crash before we've finished the S3 changes, then timeline load will reset layer map to the state that's in the S3 index part.
The S3 change sequence above is obviously crash-consistent.
If we crash before the index part PUT, then we leak the inserted layer files to S3.
If we crash after the index part PUT, we leak the to-be-DELETEd layer files to S3.
Leaking is fine, it's a pre-existing condition and not addressed in this RFC.
Multi-object changes that previously created and removed files in timeline dir are now atomic because the layer map updates are atomic and crash consistent:
* atomic layer map update at runtime, currently by using an RwLock in write mode
* atomic `index_part.json` update in S3, as per guarantee that S3 PUT is atomic
* local timeline dir state:
* irrelevant for layer map content => irrelevant for atomic updates / crash consistency
* if we crash after index part PUT, local layer files will be used, so, no on-demand downloads neede for them
* if we crash before index part PUT, local layer files will be deleted
## Trade-Offs
### Fundamental
If we crash before finishing the index part PUT, we lose all the work that hasn't reached the S3 `index_part.json`:
* wal ingest: we lose not-yet-uploaded L0s; load on the **safekeepers** + work for pageserver
* compaction: we lose the entire compaction iteration work; need to re-do it again
* gc: no change to what we have today
If the work is still deemed necessary after restart, the restarted restarted pageserver will re-do this work.
The amount of work to be re-do is capped to the lag of S3 changes to the local changes.
Assuming upload queue allows for unlimited queue depth (that's what it does today), this means:
* on-demand downloads that were needed to do the work: are likely still present, not lost
* wal ingest: currently unbounded
* L0 => L1 compaction: CPU time proportional to `O(sum(L0 size))` and upload work proportional to `O()`
* Compaction threshold is 10 L0s and each L0 can be up to 256M in size. Target size for L1 is 128M.
* In practive, most L0s are tiny due to 10minute `DEFAULT_CHECKPOINT_TIMEOUT`.
* image layer generation: CPU time `O(sum(input data))` + upload work `O(sum(new image layer size))`
* I have no intuition how expensive / long-running it is in reality.
* gc: `update_gc_info`` work (not substantial, AFAIK)
To limit the amount of lost upload work, and ingest work, we can limit the upload queue depth (see suggestions in the next sub-section).
However, to limit the amount of lost CPU work, we would need a way to make make the compaction/image-layer-generation algorithms interruptible & resumable.
We aren't there yet, the need for it is tracked by ([#4580](https://github.com/neondatabase/neon/issues/4580)).
However, this RFC is not constraining the design space either.
### Practical
#### Pageserver Restarts
Pageserver crashes are very rare ; it would likely be acceptable to re-do the lost work in that case.
However, regular pageserver restart happen frequently, e.g., during weekly deploys.
In general, pageserver restart faces the problem of tenants that "take too long" to shut down.
They are a problem because other tenants that shut down quickly are unavailble while we wait for the slow tenants to shut down.
We currently allot 10 seconds for graceful shutdown until we SIGKILL the pageserver process (as per `pageserver.service` unit file).
A longer budget would expose tenants that are done early to a longer downtime.
A short budget would risk throwing away more work that'd have to be re-done after restart.
In the context of this RFC, killing the process would mean losing the work that hasn't made it to S3.
We can mitigate this problem as follows:
0. initially, by accepting that we need to do the work again
1. short-term, introducing measures to cap the amount of in-flight work:
- cap upload queue length, use backpressure to slow down compaction
- disabling compaction/image-layer-generation X minutes before `systemctl restart pageserver`
- introducing a read-only shutdown state for tenants that are fast to shut down;
that state would be equivalent to the state of a tenant in hot standby / readonly mode.
2. mid term, by not restarting pageserver in place, but using [*seamless tenant migration*](https://github.com/neondatabase/neon/pull/5029) to drain a pageserver's tenants before we restart it.
#### `disk_consistent_lsn` can go backwards
`disk_consistent_lsn` can go backwards across restarts if we crash before we've finished the index part PUT.
Nobody should care about it, because the only thing that matters is `remote_consistent_lsn`.
Compute certainly doesn't care about `disk_consistent_lsn`.
## Side-Effects Of This Design
* local `metadata` is basically reduced to a cache of which timelines exist for this tenant; i.e., we can avoid a `ListObjects` requests for a tenant's timelines during tenant load.
## Limitations
Multi-object changes that span multiple timelines aren't covered by this RFC.
That's fine because we currently don't need them, as evidenced by the absence
of a Pageserver operation that holds multiple timelines' layer map lock at a time.
## Impacted components
Primarily pageservers.
Safekeepers will experience more load when we need to re-ingest WAL because we've thrown away work.
No changes to safekeepers are needed.
## Alternatives considered
### Alternative 1: WAL
We could have a local WAL for timeline dir changes, as proposed here https://github.com/neondatabase/neon/issues/4418 and partially implemented here https://github.com/neondatabase/neon/pull/4422 .
The WAL would be used to
1. make multi-object changes atomic
2. replace `reconcile_with_remote()` reconciliation: scheduling of layer upload would be part of WAL replay.
The WAL is appealing in a local-first world, but, it's much more complex than the design described above:
* New on-disk state to get right.
* Forward- and backward-compatibility development costs in the future.
### Alternative 2: Flow Everything Through `index_part.json`
We could have gone to the other extreme and **only** update the layer map whenever we've PUT `index_part.json`.
I.e., layer map would always be the last-persisted S3 state.
That's axiomatically beautiful, not least because it fully separates the layer file production and consumption path (=> [layer file spreading proposal](https://www.notion.so/neondatabase/One-Pager-Layer-File-Spreading-Christian-eb6b64182a214e11b3fceceee688d843?pvs=4)).
And it might make hot standbys / read-only pageservers less of a special case in the future.
But, I have some uncertainties with regard to WAL ingestion, because it needs to be able to do some reads for the logical size feedback to safekeepers.
And it's silly that we wouldn't be able to use the results of compaction or image layer generation before we're done with the upload.
Lastly, a temporarily clogged-up upload queue (e.g. S3 is down) shouldn't immediately render ingestion unavailable.
### Alternative 3: Sequence Numbers For Layers
Instead of what's proposed in this RFC, we could use unique numbers to identify layer files:
```
# before
tenants/$tenant/timelines/$timeline/$key_and_lsn_range
# after
tenants/$tenant/timelines/$timeline/$layer_file_id-$key_and_lsn_range
```
To guarantee uniqueness, the unqiue number is a sequence number, stored in `index_part.json`.
This alternative does not solve atomic layer map updates.
In our crash-during-compaction scenario above, the compaction run after the crash will not overwrite the L1s, but write/PUT new files with new sequence numbers.
In fact, this alternative makes it worse because the data is now duplicated in the not-overwritten and overwritten L1 layer files.
We'd need to write a deduplication pass that checks if perfectly overlapping layers have identical contents.
However, this alternative is appealing because it systematically prevents overwrites at a lower level than this RFC.
So, this alternative is sufficient for the needs of the split-brain safety RFC (immutable layer files locally and in S3).
But it doesn't solve the problems with crash-during-compaction outlined earlier in this RFC, and in fact, makes it much more accute.
The proposed design in this RFC addresses both.
So, if this alternative sounds appealing, we should implement the proposal in this RFC first, then implement this alternative on top.
That way, we avoid a phase where the crash-during-compaction problem is accute.
## Related issues
- https://github.com/neondatabase/neon/issues/4749
- https://github.com/neondatabase/neon/issues/4418
- https://github.com/neondatabase/neon/pull/4422
- https://github.com/neondatabase/neon/issues/5077
- https://github.com/neondatabase/neon/issues/4088
- (re)resolutions:
- https://github.com/neondatabase/neon/pull/4696
- https://github.com/neondatabase/neon/pull/4094
- https://neondb.slack.com/archives/C033QLM5P7D/p1682519017949719
Note that the test case introduced in https://github.com/neondatabase/neon/pull/4696/files#diff-13114949d1deb49ae394405d4c49558adad91150ba8a34004133653a8a5aeb76 will produce L1s with the same logical content, but, as outlined in the last paragraph of the _Problem Statement_ section above, we don't want to make that assumption in order to fix the problem.
## Implementation Plan
1. Remove support for `remote_storage=None`, because we now rely on the existence of an index part.
- The nasty part here is to fix all the tests that fiddle with the local timeline directory.
Possibly they are just irrelevant with this change, but, each case will require inspection.
2. Implement the design above.
- Initially, ship without the mitigations for restart and accept we will do some work twice.
- Measure the impact and implement one of the mitigations.

View File

@@ -10,6 +10,9 @@ chrono.workspace = true
serde.workspace = true
serde_with.workspace = true
serde_json.workspace = true
regex.workspace = true
utils = { path = "../utils" }
remote_storage = { version = "0.1", path = "../remote_storage/" }
workspace_hack.workspace = true

View File

@@ -107,7 +107,6 @@ pub struct ComputeMetrics {
pub num_ext_downloaded: u64,
pub largest_ext_size: u64, // these are measured in bytes
pub total_ext_download_size: u64,
pub prep_extensions_ms: u64,
}
/// Response of the `/computes/{compute_id}/spec` control-plane API.

View File

@@ -3,11 +3,16 @@
//! The spec.json file is used to pass information to 'compute_ctl'. It contains
//! all the information needed to start up the right version of PostgreSQL,
//! and connect it to the storage nodes.
use std::collections::HashMap;
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn;
use regex::Regex;
use remote_storage::RemotePath;
/// String type alias representing Postgres identifier and
/// intended to be used for DB / role names.
pub type PgIdent = String;
@@ -61,8 +66,78 @@ pub struct ComputeSpec {
/// the pageserver and safekeepers.
pub storage_auth_token: Option<String>,
// list of prefixes to search for custom extensions in remote extension storage
// information about available remote extensions
pub remote_extensions: Option<RemoteExtSpec>,
}
#[derive(Clone, Debug, Default, Deserialize, Serialize)]
pub struct RemoteExtSpec {
pub public_extensions: Option<Vec<String>>,
pub custom_extensions: Option<Vec<String>>,
pub library_index: HashMap<String, String>,
pub extension_data: HashMap<String, ExtensionData>,
}
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct ExtensionData {
pub control_data: HashMap<String, String>,
pub archive_path: String,
}
impl RemoteExtSpec {
pub fn get_ext(
&self,
ext_name: &str,
is_library: bool,
build_tag: &str,
pg_major_version: &str,
) -> anyhow::Result<(String, RemotePath)> {
let mut real_ext_name = ext_name;
if is_library {
// sometimes library names might have a suffix like
// library.so or library.so.3. We strip this off
// because library_index is based on the name without the file extension
let strip_lib_suffix = Regex::new(r"\.so.*").unwrap();
let lib_raw_name = strip_lib_suffix.replace(real_ext_name, "").to_string();
real_ext_name = self
.library_index
.get(&lib_raw_name)
.ok_or(anyhow::anyhow!("library {} is not found", lib_raw_name))?;
}
// Check if extension is present in public or custom.
// If not, then it is not allowed to be used by this compute.
if let Some(public_extensions) = &self.public_extensions {
if !public_extensions.contains(&real_ext_name.to_string()) {
if let Some(custom_extensions) = &self.custom_extensions {
if !custom_extensions.contains(&real_ext_name.to_string()) {
return Err(anyhow::anyhow!("extension {} is not found", real_ext_name));
}
}
}
}
match self.extension_data.get(real_ext_name) {
Some(_ext_data) => {
// Construct the path to the extension archive
// BUILD_TAG/PG_MAJOR_VERSION/extensions/EXTENSION_NAME.tar.zst
//
// Keep it in sync with path generation in
// https://github.com/neondatabase/build-custom-extensions/tree/main
let archive_path_str =
format!("{build_tag}/{pg_major_version}/extensions/{real_ext_name}.tar.zst");
Ok((
real_ext_name.to_string(),
RemotePath::from_string(&archive_path_str)?,
))
}
None => Err(anyhow::anyhow!(
"real_ext_name {} is not found",
real_ext_name
)),
}
}
}
#[serde_as]

View File

@@ -205,5 +205,43 @@
"name": "zenith new",
"new_name": "zenith \"new\""
}
]
],
"remote_extensions": {
"library_index": {
"anon": "anon",
"postgis-3": "postgis",
"libpgrouting-3.4": "postgis",
"postgis_raster-3": "postgis",
"postgis_sfcgal-3": "postgis",
"postgis_topology-3": "postgis",
"address_standardizer-3": "postgis"
},
"extension_data": {
"anon": {
"archive_path": "5834329303/v15/extensions/anon.tar.zst",
"control_data": {
"anon.control": "# PostgreSQL Anonymizer (anon) extension\ncomment = ''Data anonymization tools''\ndefault_version = ''1.1.0''\ndirectory=''extension/anon''\nrelocatable = false\nrequires = ''pgcrypto''\nsuperuser = false\nmodule_pathname = ''$libdir/anon''\ntrusted = true\n"
}
},
"postgis": {
"archive_path": "5834329303/v15/extensions/postgis.tar.zst",
"control_data": {
"postgis.control": "# postgis extension\ncomment = ''PostGIS geometry and geography spatial types and functions''\ndefault_version = ''3.3.2''\nmodule_pathname = ''$libdir/postgis-3''\nrelocatable = false\ntrusted = true\n",
"pgrouting.control": "# pgRouting Extension\ncomment = ''pgRouting Extension''\ndefault_version = ''3.4.2''\nmodule_pathname = ''$libdir/libpgrouting-3.4''\nrelocatable = true\nrequires = ''plpgsql''\nrequires = ''postgis''\ntrusted = true\n",
"postgis_raster.control": "# postgis_raster extension\ncomment = ''PostGIS raster types and functions''\ndefault_version = ''3.3.2''\nmodule_pathname = ''$libdir/postgis_raster-3''\nrelocatable = false\nrequires = postgis\ntrusted = true\n",
"postgis_sfcgal.control": "# postgis topology extension\ncomment = ''PostGIS SFCGAL functions''\ndefault_version = ''3.3.2''\nrelocatable = true\nrequires = postgis\ntrusted = true\n",
"postgis_topology.control": "# postgis topology extension\ncomment = ''PostGIS topology spatial types and functions''\ndefault_version = ''3.3.2''\nrelocatable = false\nschema = topology\nrequires = postgis\ntrusted = true\n",
"address_standardizer.control": "# address_standardizer extension\ncomment = ''Used to parse an address into constituent elements. Generally used to support geocoding address normalization step.''\ndefault_version = ''3.3.2''\nrelocatable = true\ntrusted = true\n",
"postgis_tiger_geocoder.control": "# postgis tiger geocoder extension\ncomment = ''PostGIS tiger geocoder and reverse geocoder''\ndefault_version = ''3.3.2''\nrelocatable = false\nschema = tiger\nrequires = ''postgis,fuzzystrmatch''\nsuperuser= false\ntrusted = true\n",
"address_standardizer_data_us.control": "# address standardizer us dataset\ncomment = ''Address Standardizer US dataset example''\ndefault_version = ''3.3.2''\nrelocatable = true\ntrusted = true\n"
}
}
},
"custom_extensions": [
"anon"
],
"public_extensions": [
"postgis"
]
}
}

View File

@@ -0,0 +1,52 @@
//! Types in this file are for pageserver's upward-facing API calls to the control plane,
//! required for acquiring and validating tenant generation numbers.
//!
//! See docs/rfcs/025-generation-numbers.md
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use utils::id::{NodeId, TenantId};
#[derive(Serialize, Deserialize)]
pub struct ReAttachRequest {
pub node_id: NodeId,
}
#[serde_as]
#[derive(Serialize, Deserialize)]
pub struct ReAttachResponseTenant {
#[serde_as(as = "DisplayFromStr")]
pub id: TenantId,
pub generation: u32,
}
#[derive(Serialize, Deserialize)]
pub struct ReAttachResponse {
pub tenants: Vec<ReAttachResponseTenant>,
}
#[serde_as]
#[derive(Serialize, Deserialize)]
pub struct ValidateRequestTenant {
#[serde_as(as = "DisplayFromStr")]
pub id: TenantId,
pub gen: u32,
}
#[derive(Serialize, Deserialize)]
pub struct ValidateRequest {
pub tenants: Vec<ValidateRequestTenant>,
}
#[derive(Serialize, Deserialize)]
pub struct ValidateResponse {
pub tenants: Vec<ValidateResponseTenant>,
}
#[serde_as]
#[derive(Serialize, Deserialize)]
pub struct ValidateResponseTenant {
#[serde_as(as = "DisplayFromStr")]
pub id: TenantId,
pub valid: bool,
}

View File

@@ -1,6 +1,7 @@
use const_format::formatcp;
/// Public API types
pub mod control_api;
pub mod models;
pub mod reltag;

View File

@@ -194,10 +194,22 @@ pub struct TimelineCreateRequest {
pub struct TenantCreateRequest {
#[serde_as(as = "DisplayFromStr")]
pub new_tenant_id: TenantId,
#[serde(default)]
#[serde(skip_serializing_if = "Option::is_none")]
pub generation: Option<u32>,
#[serde(flatten)]
pub config: TenantConfig, // as we have a flattened field, we should reject all unknown fields in it
}
#[serde_as]
#[derive(Deserialize, Debug)]
#[serde(deny_unknown_fields)]
pub struct TenantLoadRequest {
#[serde(default)]
#[serde(skip_serializing_if = "Option::is_none")]
pub generation: Option<u32>,
}
impl std::ops::Deref for TenantCreateRequest {
type Target = TenantConfig;
@@ -241,15 +253,6 @@ pub struct StatusResponse {
pub id: NodeId,
}
impl TenantCreateRequest {
pub fn new(new_tenant_id: TenantId) -> TenantCreateRequest {
TenantCreateRequest {
new_tenant_id,
config: TenantConfig::default(),
}
}
}
#[serde_as]
#[derive(Serialize, Deserialize, Debug)]
#[serde(deny_unknown_fields)]
@@ -293,9 +296,11 @@ impl TenantConfigRequest {
}
}
#[derive(Debug, Serialize, Deserialize)]
#[derive(Debug, Deserialize)]
pub struct TenantAttachRequest {
pub config: TenantAttachConfig,
#[serde(default)]
pub generation: Option<u32>,
}
/// Newtype to enforce deny_unknown_fields on TenantConfig for
@@ -376,6 +381,8 @@ pub struct TimelineInfo {
pub pg_version: u32,
pub state: TimelineState,
pub walreceiver_status: String,
}
#[derive(Debug, Clone, Serialize)]

View File

@@ -10,9 +10,11 @@ should be auto-generated too, but that's a TODO.
The PostgreSQL on-disk file format is not portable across different
CPU architectures and operating systems. It is also subject to change
in each major PostgreSQL version. Currently, this module supports
PostgreSQL v14 and v15: bindings and code that depends on them are version-specific.
This code is organized in modules: `postgres_ffi::v14` and `postgres_ffi::v15`
Version independend code is explicitly exported into shared `postgres_ffi`.
PostgreSQL v14, v15 and v16: bindings and code that depends on them are
version-specific.
This code is organized in modules `postgres_ffi::v14`, `postgres_ffi::v15` and
`postgres_ffi::v16`. Version independent code is explicitly exported into
shared `postgres_ffi`.
TODO: Currently, there is also some code that deals with WAL records

View File

@@ -56,7 +56,7 @@ fn main() -> anyhow::Result<()> {
PathBuf::from("pg_install")
};
for pg_version in &["v14", "v15"] {
for pg_version in &["v14", "v15", "v16"] {
let mut pg_install_dir_versioned = pg_install_dir.join(pg_version);
if pg_install_dir_versioned.is_relative() {
let cwd = env::current_dir().context("Failed to get current_dir")?;
@@ -125,6 +125,7 @@ fn main() -> anyhow::Result<()> {
.allowlist_var("PG_CONTROLFILEDATA_OFFSETOF_CRC")
.allowlist_type("PageHeaderData")
.allowlist_type("DBState")
.allowlist_type("RelMapFile")
// Because structs are used for serialization, tell bindgen to emit
// explicit padding fields.
.explicit_padding(true)

View File

@@ -51,11 +51,59 @@ macro_rules! for_all_postgres_versions {
($macro:tt) => {
$macro!(v14);
$macro!(v15);
$macro!(v16);
};
}
for_all_postgres_versions! { postgres_ffi }
/// dispatch_pgversion
///
/// Run a code block in a context where the postgres_ffi bindings for a
/// specific (supported) PostgreSQL version are `use`-ed in scope under the pgv
/// identifier.
/// If the provided pg_version is not supported, we panic!(), unless the
/// optional third argument was provided (in which case that code will provide
/// the default handling instead).
///
/// Use like
///
/// dispatch_pgversion!(my_pgversion, { pgv::constants::XLOG_DBASE_CREATE })
/// dispatch_pgversion!(my_pgversion, pgv::constants::XLOG_DBASE_CREATE)
///
/// Other uses are for macro-internal purposes only and strictly unsupported.
///
#[macro_export]
macro_rules! dispatch_pgversion {
($version:expr, $code:expr) => {
dispatch_pgversion!($version, $code, panic!("Unknown PostgreSQL version {}", $version))
};
($version:expr, $code:expr, $invalid_pgver_handling:expr) => {
dispatch_pgversion!(
$version => $code,
default = $invalid_pgver_handling,
pgversions = [
14 : v14,
15 : v15,
16 : v16,
]
)
};
($pgversion:expr => $code:expr,
default = $default:expr,
pgversions = [$($sv:literal : $vsv:ident),+ $(,)?]) => {
match ($pgversion) {
$($sv => {
use $crate::$vsv as pgv;
$code
},)+
_ => {
$default
}
}
};
}
pub mod pg_constants;
pub mod relfile_utils;
@@ -90,13 +138,7 @@ pub use v14::xlog_utils::XLogFileName;
pub use v14::bindings::DBState_DB_SHUTDOWNED;
pub fn bkpimage_is_compressed(bimg_info: u8, version: u32) -> anyhow::Result<bool> {
match version {
14 => Ok(bimg_info & v14::bindings::BKPIMAGE_IS_COMPRESSED != 0),
15 => Ok(bimg_info & v15::bindings::BKPIMAGE_COMPRESS_PGLZ != 0
|| bimg_info & v15::bindings::BKPIMAGE_COMPRESS_LZ4 != 0
|| bimg_info & v15::bindings::BKPIMAGE_COMPRESS_ZSTD != 0),
_ => anyhow::bail!("Unknown version {}", version),
}
dispatch_pgversion!(version, Ok(pgv::bindings::bkpimg_is_compressed(bimg_info)))
}
pub fn generate_wal_segment(
@@ -107,11 +149,11 @@ pub fn generate_wal_segment(
) -> Result<Bytes, SerializeError> {
assert_eq!(segno, lsn.segment_number(WAL_SEGMENT_SIZE));
match pg_version {
14 => v14::xlog_utils::generate_wal_segment(segno, system_id, lsn),
15 => v15::xlog_utils::generate_wal_segment(segno, system_id, lsn),
_ => Err(SerializeError::BadInput),
}
dispatch_pgversion!(
pg_version,
pgv::xlog_utils::generate_wal_segment(segno, system_id, lsn),
Err(SerializeError::BadInput)
)
}
pub fn generate_pg_control(
@@ -120,11 +162,11 @@ pub fn generate_pg_control(
lsn: Lsn,
pg_version: u32,
) -> anyhow::Result<(Bytes, u64)> {
match pg_version {
14 => v14::xlog_utils::generate_pg_control(pg_control_bytes, checkpoint_bytes, lsn),
15 => v15::xlog_utils::generate_pg_control(pg_control_bytes, checkpoint_bytes, lsn),
_ => anyhow::bail!("Unknown version {}", pg_version),
}
dispatch_pgversion!(
pg_version,
pgv::xlog_utils::generate_pg_control(pg_control_bytes, checkpoint_bytes, lsn),
anyhow::bail!("Unknown version {}", pg_version)
)
}
// PG timeline is always 1, changing it doesn't have any useful meaning in Neon.
@@ -196,8 +238,6 @@ pub fn fsm_logical_to_physical(addr: BlockNumber) -> BlockNumber {
}
pub mod waldecoder {
use crate::{v14, v15};
use bytes::{Buf, Bytes, BytesMut};
use std::num::NonZeroU32;
use thiserror::Error;
@@ -248,22 +288,17 @@ pub mod waldecoder {
}
pub fn poll_decode(&mut self) -> Result<Option<(Lsn, Bytes)>, WalDecodeError> {
match self.pg_version {
// This is a trick to support both versions simultaneously.
// See WalStreamDecoderHandler comments.
14 => {
use self::v14::waldecoder_handler::WalStreamDecoderHandler;
dispatch_pgversion!(
self.pg_version,
{
use pgv::waldecoder_handler::WalStreamDecoderHandler;
self.poll_decode_internal()
}
15 => {
use self::v15::waldecoder_handler::WalStreamDecoderHandler;
self.poll_decode_internal()
}
_ => Err(WalDecodeError {
},
Err(WalDecodeError {
msg: format!("Unknown version {}", self.pg_version),
lsn: self.lsn,
}),
}
})
)
}
}
}

View File

@@ -163,6 +163,20 @@ pub const RM_HEAP2_ID: u8 = 9;
pub const RM_HEAP_ID: u8 = 10;
pub const RM_LOGICALMSG_ID: u8 = 21;
// from neon_rmgr.h
pub const RM_NEON_ID: u8 = 134;
pub const XLOG_NEON_HEAP_INIT_PAGE: u8 = 0x80;
pub const XLOG_NEON_HEAP_INSERT: u8 = 0x00;
pub const XLOG_NEON_HEAP_DELETE: u8 = 0x10;
pub const XLOG_NEON_HEAP_UPDATE: u8 = 0x20;
pub const XLOG_NEON_HEAP_HOT_UPDATE: u8 = 0x30;
pub const XLOG_NEON_HEAP_LOCK: u8 = 0x40;
pub const XLOG_NEON_HEAP_MULTI_INSERT: u8 = 0x50;
pub const XLOG_NEON_HEAP_VISIBLE: u8 = 0x40;
// from xlogreader.h
pub const XLR_INFO_MASK: u8 = 0x0F;
pub const XLR_RMGR_INFO_MASK: u8 = 0xF0;

View File

@@ -3,3 +3,8 @@ pub const XLOG_DBASE_DROP: u8 = 0x10;
pub const BKPIMAGE_IS_COMPRESSED: u8 = 0x02; /* page image is compressed */
pub const BKPIMAGE_APPLY: u8 = 0x04; /* page image should be restored during replay */
pub const SIZEOF_RELMAPFILE: usize = 512; /* sizeof(RelMapFile) in relmapper.c */
pub fn bkpimg_is_compressed(bimg_info: u8) -> bool {
(bimg_info & BKPIMAGE_IS_COMPRESSED) != 0
}

View File

@@ -1,10 +1,18 @@
pub const XACT_XINFO_HAS_DROPPED_STATS: u32 = 1u32 << 8;
pub const XLOG_DBASE_CREATE_FILE_COPY: u8 = 0x00;
pub const XLOG_DBASE_CREATE_WAL_LOG: u8 = 0x00;
pub const XLOG_DBASE_CREATE_WAL_LOG: u8 = 0x10;
pub const XLOG_DBASE_DROP: u8 = 0x20;
pub const BKPIMAGE_APPLY: u8 = 0x02; /* page image should be restored during replay */
pub const BKPIMAGE_COMPRESS_PGLZ: u8 = 0x04; /* page image is compressed */
pub const BKPIMAGE_COMPRESS_LZ4: u8 = 0x08; /* page image is compressed */
pub const BKPIMAGE_COMPRESS_ZSTD: u8 = 0x10; /* page image is compressed */
pub const SIZEOF_RELMAPFILE: usize = 512; /* sizeof(RelMapFile) in relmapper.c */
pub fn bkpimg_is_compressed(bimg_info: u8) -> bool {
const ANY_COMPRESS_FLAG: u8 = BKPIMAGE_COMPRESS_PGLZ | BKPIMAGE_COMPRESS_LZ4 | BKPIMAGE_COMPRESS_ZSTD;
(bimg_info & ANY_COMPRESS_FLAG) != 0
}

View File

@@ -0,0 +1,18 @@
pub const XACT_XINFO_HAS_DROPPED_STATS: u32 = 1u32 << 8;
pub const XLOG_DBASE_CREATE_FILE_COPY: u8 = 0x00;
pub const XLOG_DBASE_CREATE_WAL_LOG: u8 = 0x10;
pub const XLOG_DBASE_DROP: u8 = 0x20;
pub const BKPIMAGE_APPLY: u8 = 0x02; /* page image should be restored during replay */
pub const BKPIMAGE_COMPRESS_PGLZ: u8 = 0x04; /* page image is compressed */
pub const BKPIMAGE_COMPRESS_LZ4: u8 = 0x08; /* page image is compressed */
pub const BKPIMAGE_COMPRESS_ZSTD: u8 = 0x10; /* page image is compressed */
pub const SIZEOF_RELMAPFILE: usize = 524; /* sizeof(RelMapFile) in relmapper.c */
pub fn bkpimg_is_compressed(bimg_info: u8) -> bool {
const ANY_COMPRESS_FLAG: u8 = BKPIMAGE_COMPRESS_PGLZ | BKPIMAGE_COMPRESS_LZ4 | BKPIMAGE_COMPRESS_ZSTD;
(bimg_info & ANY_COMPRESS_FLAG) != 0
}

View File

@@ -49,9 +49,9 @@ impl Conf {
pub fn pg_distrib_dir(&self) -> anyhow::Result<PathBuf> {
let path = self.pg_distrib_dir.clone();
#[allow(clippy::manual_range_patterns)]
match self.pg_version {
14 => Ok(path.join(format!("v{}", self.pg_version))),
15 => Ok(path.join(format!("v{}", self.pg_version))),
14 | 15 | 16 => Ok(path.join(format!("v{}", self.pg_version))),
_ => bail!("Unsupported postgres version: {}", self.pg_version),
}
}
@@ -250,11 +250,18 @@ fn craft_internal<C: postgres::GenericClient>(
let (mut intermediate_lsns, last_lsn) = f(client, initial_lsn)?;
let last_lsn = match last_lsn {
None => client.pg_current_wal_insert_lsn()?,
Some(last_lsn) => match last_lsn.cmp(&client.pg_current_wal_insert_lsn()?) {
Ordering::Less => bail!("Some records were inserted after the crafted WAL"),
Ordering::Equal => last_lsn,
Ordering::Greater => bail!("Reported LSN is greater than insert_lsn"),
},
Some(last_lsn) => {
let insert_lsn = client.pg_current_wal_insert_lsn()?;
match last_lsn.cmp(&insert_lsn) {
Ordering::Less => bail!(
"Some records were inserted after the crafted WAL: {} vs {}",
last_lsn,
insert_lsn
),
Ordering::Equal => last_lsn,
Ordering::Greater => bail!("Reported LSN is greater than insert_lsn"),
}
}
};
if !intermediate_lsns.starts_with(&[initial_lsn]) {
intermediate_lsns.insert(0, initial_lsn);
@@ -363,8 +370,9 @@ impl Crafter for LastWalRecordXlogSwitchEndsOnPageBoundary {
);
ensure!(
u64::from(after_xlog_switch) as usize % XLOG_BLCKSZ == XLOG_SIZE_OF_XLOG_SHORT_PHD,
"XLOG_SWITCH message ended not on page boundary: {}",
after_xlog_switch
"XLOG_SWITCH message ended not on page boundary: {}, offset = {}",
after_xlog_switch,
u64::from(after_xlog_switch) as usize % XLOG_BLCKSZ
);
Ok((vec![before_xlog_switch, after_xlog_switch], next_segment))
}

View File

@@ -959,7 +959,7 @@ mod tests {
let make_params = |options| StartupMessageParams::new([("options", options)]);
let params = StartupMessageParams::new([]);
assert!(matches!(params.options_escaped(), None));
assert!(params.options_escaped().is_none());
let params = make_params("");
assert!(split_options(&params).is_empty());

View File

@@ -148,21 +148,55 @@ impl RemoteStorage for LocalFs {
Some(folder) => folder.with_base(&self.storage_root),
None => self.storage_root.clone(),
};
let mut files = vec![];
let mut directory_queue = vec![full_path.clone()];
// If we were given a directory, we may use it as our starting point.
// Otherwise, we must go up to the parent directory. This is because
// S3 object list prefixes can be arbitrary strings, but when reading
// the local filesystem we need a directory to start calling read_dir on.
let mut initial_dir = full_path.clone();
match fs::metadata(full_path.clone()).await {
Ok(meta) => {
if !meta.is_dir() {
// It's not a directory: strip back to the parent
initial_dir.pop();
}
}
Err(e) if e.kind() == ErrorKind::NotFound => {
// It's not a file that exists: strip the prefix back to the parent directory
initial_dir.pop();
}
Err(e) => {
// Unexpected I/O error
anyhow::bail!(e)
}
}
// Note that PathBuf starts_with only considers full path segments, but
// object prefixes are arbitrary strings, so we need the strings for doing
// starts_with later.
let prefix = full_path.to_string_lossy();
let mut files = vec![];
let mut directory_queue = vec![initial_dir.clone()];
while let Some(cur_folder) = directory_queue.pop() {
let mut entries = fs::read_dir(cur_folder.clone()).await?;
while let Some(entry) = entries.next_entry().await? {
let file_name: PathBuf = entry.file_name().into();
let full_file_name = cur_folder.clone().join(&file_name);
let file_remote_path = self.local_file_to_relative_path(full_file_name.clone());
files.push(file_remote_path.clone());
if full_file_name.is_dir() {
directory_queue.push(full_file_name);
if full_file_name
.to_str()
.map(|s| s.starts_with(prefix.as_ref()))
.unwrap_or(false)
{
let file_remote_path = self.local_file_to_relative_path(full_file_name.clone());
files.push(file_remote_path.clone());
if full_file_name.is_dir() {
directory_queue.push(full_file_name);
}
}
}
}
Ok(files)
}

View File

@@ -573,7 +573,7 @@ mod tests {
#[test]
fn relative_path() {
let all_paths = vec!["", "some/path", "some/path/"];
let all_paths = ["", "some/path", "some/path/"];
let all_paths: Vec<RemotePath> = all_paths
.iter()
.map(|x| RemotePath::new(Path::new(x)).expect("bad path"))

View File

@@ -71,6 +71,13 @@ impl UnreliableWrapper {
}
}
}
async fn delete_inner(&self, path: &RemotePath, attempt: bool) -> anyhow::Result<()> {
if attempt {
self.attempt(RemoteOp::Delete(path.clone()))?;
}
self.inner.delete(path).await
}
}
#[async_trait::async_trait]
@@ -122,15 +129,15 @@ impl RemoteStorage for UnreliableWrapper {
}
async fn delete(&self, path: &RemotePath) -> anyhow::Result<()> {
self.attempt(RemoteOp::Delete(path.clone()))?;
self.inner.delete(path).await
self.delete_inner(path, true).await
}
async fn delete_objects<'a>(&self, paths: &'a [RemotePath]) -> anyhow::Result<()> {
self.attempt(RemoteOp::DeleteObjects(paths.to_vec()))?;
let mut error_counter = 0;
for path in paths {
if (self.delete(path).await).is_err() {
// Dont record attempt because it was already recorded above
if (self.delete_inner(path, false).await).is_err() {
error_counter += 1;
}
}

View File

@@ -31,6 +31,8 @@ fn lsn_invalid() -> Lsn {
#[serde_as]
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct SkTimelineInfo {
/// Term.
pub term: Option<u64>,
/// Term of the last entry.
pub last_log_term: Option<u64>,
/// LSN of the last record.
@@ -58,4 +60,6 @@ pub struct SkTimelineInfo {
/// A connection string to use for WAL receiving.
#[serde(default)]
pub safekeeper_connstr: Option<String>,
#[serde(default)]
pub http_connstr: Option<String>,
}

View File

@@ -26,6 +26,7 @@ serde_json.workspace = true
signal-hook.workspace = true
thiserror.workspace = true
tokio.workspace = true
tokio-util.workspace = true
tracing.workspace = true
tracing-error.workspace = true
tracing-subscriber = { workspace = true, features = ["json", "registry"] }
@@ -37,6 +38,7 @@ url.workspace = true
uuid.workspace = true
pq_proto.workspace = true
postgres_connection.workspace = true
metrics.workspace = true
workspace_hack.workspace = true

View File

@@ -9,11 +9,12 @@ PORT=$4
SYSID=$(od -A n -j 24 -N 8 -t d8 "$WAL_PATH"/000000010000000000000002* | cut -c 3-)
rm -fr "$DATA_DIR"
env -i LD_LIBRARY_PATH="$PG_BIN"/../lib "$PG_BIN"/initdb -E utf8 -U cloud_admin -D "$DATA_DIR" --sysid="$SYSID"
echo port="$PORT" >> "$DATA_DIR"/postgresql.conf
echo "port=$PORT" >> "$DATA_DIR"/postgresql.conf
echo "shared_preload_libraries='\$libdir/neon_rmgr.so'" >> "$DATA_DIR"/postgresql.conf
REDO_POS=0x$("$PG_BIN"/pg_controldata -D "$DATA_DIR" | grep -F "REDO location"| cut -c 42-)
declare -i WAL_SIZE=$REDO_POS+114
"$PG_BIN"/pg_ctl -D "$DATA_DIR" -l logfile start
"$PG_BIN"/pg_ctl -D "$DATA_DIR" -l logfile stop -m immediate
"$PG_BIN"/pg_ctl -D "$DATA_DIR" -l "$DATA_DIR/logfile.log" start
"$PG_BIN"/pg_ctl -D "$DATA_DIR" -l "$DATA_DIR/logfile.log" stop -m immediate
cp "$DATA_DIR"/pg_wal/000000010000000000000001 .
cp "$WAL_PATH"/* "$DATA_DIR"/pg_wal/
for partial in "$DATA_DIR"/pg_wal/*.partial ; do mv "$partial" "${partial%.partial}" ; done

View File

@@ -1,18 +1,31 @@
use std::fmt::{Debug, Display};
use futures::Future;
use tokio_util::sync::CancellationToken;
pub const DEFAULT_BASE_BACKOFF_SECONDS: f64 = 0.1;
pub const DEFAULT_MAX_BACKOFF_SECONDS: f64 = 3.0;
pub async fn exponential_backoff(n: u32, base_increment: f64, max_seconds: f64) {
pub async fn exponential_backoff(
n: u32,
base_increment: f64,
max_seconds: f64,
cancel: &CancellationToken,
) {
let backoff_duration_seconds =
exponential_backoff_duration_seconds(n, base_increment, max_seconds);
if backoff_duration_seconds > 0.0 {
tracing::info!(
"Backoff: waiting {backoff_duration_seconds} seconds before processing with the task",
);
tokio::time::sleep(std::time::Duration::from_secs_f64(backoff_duration_seconds)).await;
drop(
tokio::time::timeout(
std::time::Duration::from_secs_f64(backoff_duration_seconds),
cancel.cancelled(),
)
.await,
)
}
}
@@ -24,28 +37,57 @@ pub fn exponential_backoff_duration_seconds(n: u32, base_increment: f64, max_sec
}
}
/// Configure cancellation for a retried operation: when to cancel (the token), and
/// what kind of error to return on cancellation
pub struct Cancel<E, CF>
where
E: Display + Debug + 'static,
CF: Fn() -> E,
{
token: CancellationToken,
on_cancel: CF,
}
impl<E, CF> Cancel<E, CF>
where
E: Display + Debug + 'static,
CF: Fn() -> E,
{
pub fn new(token: CancellationToken, on_cancel: CF) -> Self {
Self { token, on_cancel }
}
}
/// retries passed operation until one of the following conditions are met:
/// Encountered error is considered as permanent (non-retryable)
/// Retries have been exhausted.
/// `is_permanent` closure should be used to provide distinction between permanent/non-permanent errors
/// When attempts cross `warn_threshold` function starts to emit log warnings.
/// `description` argument is added to log messages. Its value should identify the `op` is doing
pub async fn retry<T, O, F, E>(
/// `cancel` argument is required: any time we are looping on retry, we should be using a CancellationToken
/// to drop out promptly on shutdown.
pub async fn retry<T, O, F, E, CF>(
mut op: O,
is_permanent: impl Fn(&E) -> bool,
warn_threshold: u32,
max_retries: u32,
description: &str,
cancel: Cancel<E, CF>,
) -> Result<T, E>
where
// Not std::error::Error because anyhow::Error doesnt implement it.
// For context see https://github.com/dtolnay/anyhow/issues/63
E: Display + Debug,
E: Display + Debug + 'static,
O: FnMut() -> F,
F: Future<Output = Result<T, E>>,
CF: Fn() -> E,
{
let mut attempts = 0;
loop {
if cancel.token.is_cancelled() {
return Err((cancel.on_cancel)());
}
let result = op().await;
match result {
Ok(_) => {
@@ -80,6 +122,7 @@ where
attempts,
DEFAULT_BASE_BACKOFF_SECONDS,
DEFAULT_MAX_BACKOFF_SECONDS,
&cancel.token,
)
.await;
attempts += 1;
@@ -132,6 +175,7 @@ mod tests {
1,
1,
"work",
Cancel::new(CancellationToken::new(), || -> io::Error { unreachable!() }),
)
.await;
@@ -157,6 +201,7 @@ mod tests {
2,
2,
"work",
Cancel::new(CancellationToken::new(), || -> io::Error { unreachable!() }),
)
.await
.unwrap();
@@ -179,6 +224,7 @@ mod tests {
2,
2,
"work",
Cancel::new(CancellationToken::new(), || -> io::Error { unreachable!() }),
)
.await
.unwrap_err();

View File

@@ -111,6 +111,10 @@ pub fn fsync(path: &Path) -> io::Result<()> {
.map_err(|e| io::Error::new(e.kind(), format!("Failed to fsync file {path:?}: {e}")))
}
pub async fn fsync_async(path: impl AsRef<std::path::Path>) -> Result<(), std::io::Error> {
tokio::fs::File::open(path).await?.sync_all().await
}
#[cfg(test)]
mod tests {
use tempfile::tempdir;

View File

@@ -0,0 +1,138 @@
use std::fmt::Debug;
use serde::{Deserialize, Serialize};
/// Tenant generations are used to provide split-brain safety and allow
/// multiple pageservers to attach the same tenant concurrently.
///
/// See docs/rfcs/025-generation-numbers.md for detail on how generation
/// numbers are used.
#[derive(Copy, Clone, Eq, PartialEq, PartialOrd, Ord)]
pub enum Generation {
// Generations with this magic value will not add a suffix to S3 keys, and will not
// be included in persisted index_part.json. This value is only to be used
// during migration from pre-generation metadata to generation-aware metadata,
// and should eventually go away.
//
// A special Generation is used rather than always wrapping Generation in an Option,
// so that code handling generations doesn't have to be aware of the legacy
// case everywhere it touches a generation.
None,
// Generations with this magic value may never be used to construct S3 keys:
// we will panic if someone tries to. This is for Tenants in the "Broken" state,
// so that we can satisfy their constructor with a Generation without risking
// a code bug using it in an S3 write (broken tenants should never write)
Broken,
Valid(u32),
}
/// The Generation type represents a number associated with a Tenant, which
/// increments every time the tenant is attached to a new pageserver, or
/// an attached pageserver restarts.
///
/// It is included as a suffix in S3 keys, as a protection against split-brain
/// scenarios where pageservers might otherwise issue conflicting writes to
/// remote storage
impl Generation {
/// Create a new Generation that represents a legacy key format with
/// no generation suffix
pub fn none() -> Self {
Self::None
}
// Create a new generation that will panic if you try to use get_suffix
pub fn broken() -> Self {
Self::Broken
}
pub fn new(v: u32) -> Self {
Self::Valid(v)
}
pub fn is_none(&self) -> bool {
matches!(self, Self::None)
}
#[track_caller]
pub fn get_suffix(&self) -> String {
match self {
Self::Valid(v) => {
format!("-{:08x}", v)
}
Self::None => "".into(),
Self::Broken => {
panic!("Tried to use a broken generation");
}
}
}
/// `suffix` is the part after "-" in a key
///
/// Returns None if parsing was unsuccessful
pub fn parse_suffix(suffix: &str) -> Option<Generation> {
u32::from_str_radix(suffix, 16).map(Generation::new).ok()
}
#[track_caller]
pub fn previous(&self) -> Generation {
match self {
Self::Valid(n) => {
if *n == 0 {
// Since a tenant may be upgraded from a pre-generations state, interpret the "previous" generation
// to 0 as being "no generation".
Self::None
} else {
Self::Valid(n - 1)
}
}
Self::None => Self::None,
Self::Broken => panic!("Attempted to use a broken generation"),
}
}
}
impl Serialize for Generation {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
if let Self::Valid(v) = self {
v.serialize(serializer)
} else {
// We should never be asked to serialize a None or Broken. Structures
// that include an optional generation should convert None to an
// Option<Generation>::None
Err(serde::ser::Error::custom(
"Tried to serialize invalid generation ({self})",
))
}
}
}
impl<'de> Deserialize<'de> for Generation {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
Ok(Self::Valid(u32::deserialize(deserializer)?))
}
}
// We intentionally do not implement Display for Generation, to reduce the
// risk of a bug where the generation is used in a format!() string directly
// instead of using get_suffix().
impl Debug for Generation {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Self::Valid(v) => {
write!(f, "{:08x}", v)
}
Self::None => {
write!(f, "<none>")
}
Self::Broken => {
write!(f, "<broken>")
}
}
}
}

View File

@@ -27,6 +27,9 @@ pub mod id;
// http endpoint utils
pub mod http;
// definition of the Generation type for pageserver attachment APIs
pub mod generation;
// common log initialisation routine
pub mod logging;
@@ -58,6 +61,8 @@ pub mod serde_regex;
pub mod pageserver_feedback;
pub mod postgres_client;
pub mod tracing_span_assert;
pub mod rate_limit;
@@ -68,44 +73,6 @@ pub mod completion;
/// Reporting utilities
pub mod error;
mod failpoint_macro_helpers {
/// use with fail::cfg("$name", "return(2000)")
///
/// The effect is similar to a "sleep(2000)" action, i.e. we sleep for the
/// specified time (in milliseconds). The main difference is that we use async
/// tokio sleep function. Another difference is that we print lines to the log,
/// which can be useful in tests to check that the failpoint was hit.
#[macro_export]
macro_rules! failpoint_sleep_millis_async {
($name:literal) => {{
// If the failpoint is used with a "return" action, set should_sleep to the
// returned value (as string). Otherwise it's set to None.
let should_sleep = (|| {
::fail::fail_point!($name, |x| x);
::std::option::Option::None
})();
// Sleep if the action was a returned value
if let ::std::option::Option::Some(duration_str) = should_sleep {
$crate::failpoint_sleep_helper($name, duration_str).await
}
}};
}
// Helper function used by the macro. (A function has nicer scoping so we
// don't need to decorate everything with "::")
pub async fn failpoint_sleep_helper(name: &'static str, duration_str: String) {
let millis = duration_str.parse::<u64>().unwrap();
let d = std::time::Duration::from_millis(millis);
tracing::info!("failpoint {:?}: sleeping for {:?}", name, d);
tokio::time::sleep(d).await;
tracing::info!("failpoint {:?}: sleep done", name);
}
}
pub use failpoint_macro_helpers::failpoint_sleep_helper;
/// This is a shortcut to embed git sha into binaries and avoid copying the same build script to all packages
///
/// we have several cases:

View File

@@ -0,0 +1,37 @@
//! Postgres client connection code common to other crates (safekeeper and
//! pageserver) which depends on tenant/timeline ids and thus not fitting into
//! postgres_connection crate.
use anyhow::Context;
use postgres_connection::{parse_host_port, PgConnectionConfig};
use crate::id::TenantTimelineId;
/// Create client config for fetching WAL from safekeeper on particular timeline.
/// listen_pg_addr_str is in form host:\[port\].
pub fn wal_stream_connection_config(
TenantTimelineId {
tenant_id,
timeline_id,
}: TenantTimelineId,
listen_pg_addr_str: &str,
auth_token: Option<&str>,
availability_zone: Option<&str>,
) -> anyhow::Result<PgConnectionConfig> {
let (host, port) =
parse_host_port(listen_pg_addr_str).context("Unable to parse listen_pg_addr_str")?;
let port = port.unwrap_or(5432);
let mut connstr = PgConnectionConfig::new_host_port(host, port)
.extend_options([
"-c".to_owned(),
format!("timeline_id={}", timeline_id),
format!("tenant_id={}", tenant_id),
])
.set_password(auth_token.map(|s| s.to_owned()));
if let Some(availability_zone) = availability_zone {
connstr = connstr.extend_options([format!("availability_zone={}", availability_zone)]);
}
Ok(connstr)
}

View File

@@ -0,0 +1,31 @@
[package]
name = "vm_monitor"
version = "0.1.0"
edition.workspace = true
license.workspace = true
[[bin]]
name = "vm-monitor"
path = "./src/bin/monitor.rs"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
anyhow.workspace = true
axum.workspace = true
clap.workspace = true
futures.workspace = true
inotify.workspace = true
serde.workspace = true
serde_json.workspace = true
sysinfo.workspace = true
tokio.workspace = true
tokio-postgres.workspace = true
tokio-stream.workspace = true
tokio-util.workspace = true
tracing.workspace = true
tracing-subscriber.workspace = true
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
[target.'cfg(target_os = "linux")'.dependencies]
cgroups-rs = "0.3.3"

34
libs/vm_monitor/README.md Normal file
View File

@@ -0,0 +1,34 @@
# `vm-monitor`
The `vm-monitor` (or just monitor) is a core component of the autoscaling system,
along with the `autoscale-scheduler` and the `autoscaler-agent`s. The monitor has
two primary roles: 1) notifying agents when immediate upscaling is necessary due
to memory conditions and 2) managing Postgres' file cache and a cgroup to carry
out upscaling and downscaling decisions.
## More on scaling
We scale CPU and memory using NeonVM, our in-house QEMU tool for use with Kubernetes.
To control thresholds for receiving memory usage notifications, we start Postgres
in the `neon-postgres` cgroup and set its `memory.{max,high}`.
* See also: [`neondatabase/autoscaling`](https://github.com/neondatabase/autoscaling/)
* See also: [`neondatabase/vm-monitor`](https://github.com/neondatabase/vm-monitor/),
where initial development of the monitor happened. The repository is no longer
maintained but the commit history may be useful for debugging.
## Structure
The `vm-monitor` is loosely comprised of a few systems. These are:
* the server: this is just a simple `axum` server that accepts requests and
upgrades them to websocket connections. The server only allows one connection at
a time. This means that upon receiving a new connection, the server will terminate
and old one if it exists.
* the filecache: a struct that allows communication with the Postgres file cache.
On startup, we connect to the filecache and hold on to the connection for the
entire monitor lifetime.
* the cgroup watcher: the `CgroupWatcher` manages the `neon-postgres` cgroup by
listening for `memory.high` events and setting its `memory.{high,max}` values.
* the runner: the runner marries the filecache and cgroup watcher together,
communicating with the agent throught the `Dispatcher`, and then calling filecache
and cgroup watcher functions as needed to upscale and downscale

View File

@@ -0,0 +1,33 @@
// We expose a standalone binary _and_ start the monitor in `compute_ctl` so that
// we can test the monitor as part of the entire autoscaling system in
// neondatabase/autoscaling.
//
// The monitor was previously started by vm-builder, and for testing purposes,
// we can mimic that setup with this binary.
#[cfg(target_os = "linux")]
#[tokio::main]
async fn main() -> anyhow::Result<()> {
use clap::Parser;
use tokio_util::sync::CancellationToken;
use tracing_subscriber::EnvFilter;
use vm_monitor::Args;
let subscriber = tracing_subscriber::fmt::Subscriber::builder()
.json()
.with_file(true)
.with_line_number(true)
.with_span_list(true)
.with_env_filter(EnvFilter::from_default_env())
.finish();
tracing::subscriber::set_global_default(subscriber)?;
let args: &'static Args = Box::leak(Box::new(Args::parse()));
let token = CancellationToken::new();
vm_monitor::start(args, token).await
}
#[cfg(not(target_os = "linux"))]
fn main() {
panic!("the monitor requires cgroups, which are only available on linux")
}

View File

@@ -0,0 +1,693 @@
use std::{
fmt::{Debug, Display},
fs,
pin::pin,
sync::atomic::{AtomicU64, Ordering},
};
use anyhow::{anyhow, bail, Context};
use cgroups_rs::{
freezer::FreezerController,
hierarchies::{self, is_cgroup2_unified_mode, UNIFIED_MOUNTPOINT},
memory::MemController,
MaxValue,
Subsystem::{Freezer, Mem},
};
use inotify::{EventStream, Inotify, WatchMask};
use tokio::sync::mpsc::{self, error::TryRecvError};
use tokio::time::{Duration, Instant};
use tokio_stream::{Stream, StreamExt};
use tracing::{info, warn};
use crate::protocol::Resources;
use crate::MiB;
/// Monotonically increasing counter of the number of memory.high events
/// the cgroup has experienced.
///
/// We use this to determine if a modification to the `memory.events` file actually
/// changed the `high` field. If not, we don't care about the change. When we
/// read the file, we check the `high` field in the file against `MEMORY_EVENT_COUNT`
/// to see if it changed since last time.
pub static MEMORY_EVENT_COUNT: AtomicU64 = AtomicU64::new(0);
/// Monotonically increasing counter that gives each cgroup event a unique id.
///
/// This allows us to answer questions like "did this upscale arrive before this
/// memory.high?". This static is also used by the `Sequenced` type to "tag" values
/// with a sequence number. As such, prefer to used the `Sequenced` type rather
/// than this static directly.
static EVENT_SEQUENCE_NUMBER: AtomicU64 = AtomicU64::new(0);
/// A memory event type reported in memory.events.
#[derive(Debug, Eq, PartialEq, Copy, Clone)]
pub enum MemoryEvent {
Low,
High,
Max,
Oom,
OomKill,
OomGroupKill,
}
impl MemoryEvent {
fn as_str(&self) -> &str {
match self {
MemoryEvent::Low => "low",
MemoryEvent::High => "high",
MemoryEvent::Max => "max",
MemoryEvent::Oom => "oom",
MemoryEvent::OomKill => "oom_kill",
MemoryEvent::OomGroupKill => "oom_group_kill",
}
}
}
impl Display for MemoryEvent {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.write_str(self.as_str())
}
}
/// Configuration for a `CgroupWatcher`
#[derive(Debug, Clone)]
pub struct Config {
// The target difference between the total memory reserved for the cgroup
// and the value of the cgroup's memory.high.
//
// In other words, memory.high + oom_buffer_bytes will equal the total memory that the cgroup may
// use (equal to system memory, minus whatever's taken out for the file cache).
oom_buffer_bytes: u64,
// The amount of memory, in bytes, below a proposed new value for
// memory.high that the cgroup's memory usage must be for us to downscale
//
// In other words, we can downscale only when:
//
// memory.current + memory_high_buffer_bytes < (proposed) memory.high
//
// TODO: there's some minor issues with this approach -- in particular, that we might have
// memory in use by the kernel's page cache that we're actually ok with getting rid of.
pub(crate) memory_high_buffer_bytes: u64,
// The maximum duration, in milliseconds, that we're allowed to pause
// the cgroup for while waiting for the autoscaler-agent to upscale us
max_upscale_wait: Duration,
// The required minimum time, in milliseconds, that we must wait before re-freezing
// the cgroup while waiting for the autoscaler-agent to upscale us.
do_not_freeze_more_often_than: Duration,
// The amount of memory, in bytes, that we should periodically increase memory.high
// by while waiting for the autoscaler-agent to upscale us.
//
// This exists to avoid the excessive throttling that happens when a cgroup is above its
// memory.high for too long. See more here:
// https://github.com/neondatabase/autoscaling/issues/44#issuecomment-1522487217
memory_high_increase_by_bytes: u64,
// The period, in milliseconds, at which we should repeatedly increase the value
// of the cgroup's memory.high while we're waiting on upscaling and memory.high
// is still being hit.
//
// Technically speaking, this actually serves as a rate limit to moderate responding to
// memory.high events, but these are roughly equivalent if the process is still allocating
// memory.
memory_high_increase_every: Duration,
}
impl Config {
/// Calculate the new value for the cgroups memory.high based on system memory
pub fn calculate_memory_high_value(&self, total_system_mem: u64) -> u64 {
total_system_mem.saturating_sub(self.oom_buffer_bytes)
}
}
impl Default for Config {
fn default() -> Self {
Self {
oom_buffer_bytes: 100 * MiB,
memory_high_buffer_bytes: 100 * MiB,
// while waiting for upscale, don't freeze for more than 20ms every 1s
max_upscale_wait: Duration::from_millis(20),
do_not_freeze_more_often_than: Duration::from_millis(1000),
// while waiting for upscale, increase memory.high by 10MiB every 25ms
memory_high_increase_by_bytes: 10 * MiB,
memory_high_increase_every: Duration::from_millis(25),
}
}
}
/// Used to represent data that is associated with a certain point in time, such
/// as an upscale request or memory.high event.
///
/// Internally, creating a `Sequenced` uses a static atomic counter to obtain
/// a unique sequence number. Sequence numbers are monotonically increasing,
/// allowing us to answer questions like "did this upscale happen after this
/// memory.high event?" by comparing the sequence numbers of the two events.
#[derive(Debug, Clone)]
pub struct Sequenced<T> {
seqnum: u64,
data: T,
}
impl<T> Sequenced<T> {
pub fn new(data: T) -> Self {
Self {
seqnum: EVENT_SEQUENCE_NUMBER.fetch_add(1, Ordering::AcqRel),
data,
}
}
}
/// Responds to `MonitorEvents` to manage the cgroup: preventing it from being
/// OOM killed or throttling.
///
/// The `CgroupWatcher` primarily achieves this by reading from a stream of
/// `MonitorEvent`s. See `main_signals_loop` for details on how to keep the
/// cgroup happy.
#[derive(Debug)]
pub struct CgroupWatcher {
pub config: Config,
/// The sequence number of the last upscale.
///
/// If we receive a memory.high event that has a _lower_ sequence number than
/// `last_upscale_seqnum`, then we know it occured before the upscale, and we
/// can safely ignore it.
///
/// Note: Like the `events` field, this doesn't _need_ interior mutability but we
/// use it anyways so that methods take `&self`, not `&mut self`.
last_upscale_seqnum: AtomicU64,
/// A channel on which we send messages to request upscale from the dispatcher.
upscale_requester: mpsc::Sender<()>,
/// The actual cgroup we are watching and managing.
cgroup: cgroups_rs::Cgroup,
}
/// Read memory.events for the desired event type.
///
/// `path` specifies the path to the desired `memory.events` file.
/// For more info, see the `memory.events` section of the [kernel docs]
/// <https://docs.kernel.org/admin-guide/cgroup-v2.html#memory-interface-files>
fn get_event_count(path: &str, event: MemoryEvent) -> anyhow::Result<u64> {
let contents = fs::read_to_string(path)
.with_context(|| format!("failed to read memory.events from {path}"))?;
// Then contents of the file look like:
// low 42
// high 101
// ...
contents
.lines()
.filter_map(|s| s.split_once(' '))
.find(|(e, _)| *e == event.as_str())
.ok_or_else(|| anyhow!("failed to find entry for memory.{event} events in {path}"))
.and_then(|(_, count)| {
count
.parse::<u64>()
.with_context(|| format!("failed to parse memory.{event} as u64"))
})
}
/// Create an event stream that produces events whenever the file at the provided
/// path is modified.
fn create_file_watcher(path: &str) -> anyhow::Result<EventStream<[u8; 1024]>> {
info!("creating file watcher for {path}");
let inotify = Inotify::init().context("failed to initialize file watcher")?;
inotify
.watches()
.add(path, WatchMask::MODIFY)
.with_context(|| format!("failed to start watching {path}"))?;
inotify
// The inotify docs use [0u8; 1024] so we'll just copy them. We only need
// to store one event at a time - if the event gets written over, that's
// ok. We still see that there is an event. For more information, see:
// https://man7.org/linux/man-pages/man7/inotify.7.html
.into_event_stream([0u8; 1024])
.context("failed to start inotify event stream")
}
impl CgroupWatcher {
/// Create a new `CgroupWatcher`.
#[tracing::instrument(skip_all, fields(%name))]
pub fn new(
name: String,
// A channel on which to send upscale requests
upscale_requester: mpsc::Sender<()>,
) -> anyhow::Result<(Self, impl Stream<Item = Sequenced<u64>>)> {
// TODO: clarify exactly why we need v2
// Make sure cgroups v2 (aka unified) are supported
if !is_cgroup2_unified_mode() {
anyhow::bail!("cgroups v2 not supported");
}
let cgroup = cgroups_rs::Cgroup::load(hierarchies::auto(), &name);
// Start monitoring the cgroup for memory events. In general, for
// cgroups v2 (aka unified), metrics are reported in files like
// > `/sys/fs/cgroup/{name}/{metric}`
// We are looking for `memory.high` events, which are stored in the
// file `memory.events`. For more info, see the `memory.events` section
// of https://docs.kernel.org/admin-guide/cgroup-v2.html#memory-interface-files
let path = format!("{}/{}/memory.events", UNIFIED_MOUNTPOINT, &name);
let memory_events = create_file_watcher(&path)
.with_context(|| format!("failed to create event watcher for {path}"))?
// This would be nice with with .inspect_err followed by .ok
.filter_map(move |_| match get_event_count(&path, MemoryEvent::High) {
Ok(high) => Some(high),
Err(error) => {
// TODO: Might want to just panic here
warn!(?error, "failed to read high events count from {}", &path);
None
}
})
// Only report the event if the memory.high count increased
.filter_map(|high| {
if MEMORY_EVENT_COUNT.fetch_max(high, Ordering::AcqRel) < high {
Some(high)
} else {
None
}
})
.map(Sequenced::new);
let initial_count = get_event_count(
&format!("{}/{}/memory.events", UNIFIED_MOUNTPOINT, &name),
MemoryEvent::High,
)?;
info!(initial_count, "initial memory.high event count");
// Hard update `MEMORY_EVENT_COUNT` since there could have been processes
// running in the cgroup before that caused it to be non-zero.
MEMORY_EVENT_COUNT.fetch_max(initial_count, Ordering::AcqRel);
Ok((
Self {
cgroup,
upscale_requester,
last_upscale_seqnum: AtomicU64::new(0),
config: Default::default(),
},
memory_events,
))
}
/// The entrypoint for the `CgroupWatcher`.
#[tracing::instrument(skip_all)]
pub async fn watch<E>(
&self,
// These are ~dependency injected~ (fancy, I know) because this function
// should never return.
// -> therefore: when we tokio::spawn it, we don't await the JoinHandle.
// -> therefore: if we want to stick it in an Arc so many threads can access
// it, methods can never take mutable access.
// - note: we use the Arc strategy so that a) we can call this function
// right here and b) the runner can call the set/get_memory methods
// -> since calling recv() on a tokio::sync::mpsc::Receiver takes &mut self,
// we just pass them in here instead of holding them in fields, as that
// would require this method to take &mut self.
mut upscales: mpsc::Receiver<Sequenced<Resources>>,
events: E,
) -> anyhow::Result<()>
where
E: Stream<Item = Sequenced<u64>>,
{
// There are several actions might do when receiving a `memory.high`,
// such as freezing the cgroup, or increasing its `memory.high`. We don't
// want to do these things too often (because postgres needs to run, and
// we only have so much memory). These timers serve as rate limits for this.
let mut wait_to_freeze = pin!(tokio::time::sleep(Duration::ZERO));
let mut wait_to_increase_memory_high = pin!(tokio::time::sleep(Duration::ZERO));
let mut events = pin!(events);
// Are we waiting to be upscaled? Could be true if we request upscale due
// to a memory.high event and it does not arrive in time.
let mut waiting_on_upscale = false;
loop {
tokio::select! {
upscale = upscales.recv() => {
let Sequenced { seqnum, data } = upscale
.context("failed to listen on upscale notification channel")?;
self.last_upscale_seqnum.store(seqnum, Ordering::Release);
info!(cpu = data.cpu, mem_bytes = data.mem, "received upscale");
}
event = events.next() => {
let Some(Sequenced { seqnum, .. }) = event else {
bail!("failed to listen for memory.high events")
};
// The memory.high came before our last upscale, so we consider
// it resolved
if self.last_upscale_seqnum.fetch_max(seqnum, Ordering::AcqRel) > seqnum {
info!(
"received memory.high event, but it came before our last upscale -> ignoring it"
);
continue;
}
// The memory.high came after our latest upscale. We don't
// want to do anything yet, so peek the next event in hopes
// that it's an upscale.
if let Some(upscale_num) = self
.upscaled(&mut upscales)
.context("failed to check if we were upscaled")?
{
if upscale_num > seqnum {
info!(
"received memory.high event, but it came before our last upscale -> ignoring it"
);
continue;
}
}
// If it's been long enough since we last froze, freeze the
// cgroup and request upscale
if wait_to_freeze.is_elapsed() {
info!("received memory.high event -> requesting upscale");
waiting_on_upscale = self
.handle_memory_high_event(&mut upscales)
.await
.context("failed to handle upscale")?;
wait_to_freeze
.as_mut()
.reset(Instant::now() + self.config.do_not_freeze_more_often_than);
continue;
}
// Ok, we can't freeze, just request upscale
if !waiting_on_upscale {
info!("received memory.high event, but too soon to refreeze -> requesting upscale");
// Make check to make sure we haven't been upscaled in the
// meantine (can happen if the agent independently decides
// to upscale us again)
if self
.upscaled(&mut upscales)
.context("failed to check if we were upscaled")?
.is_some()
{
info!("no need to request upscaling because we got upscaled");
continue;
}
self.upscale_requester
.send(())
.await
.context("failed to request upscale")?;
continue;
}
// Shoot, we can't freeze or and we're still waiting on upscale,
// increase memory.high to reduce throttling
if wait_to_increase_memory_high.is_elapsed() {
info!(
"received memory.high event, \
but too soon to refreeze and already requested upscale \
-> increasing memory.high"
);
// Make check to make sure we haven't been upscaled in the
// meantine (can happen if the agent independently decides
// to upscale us again)
if self
.upscaled(&mut upscales)
.context("failed to check if we were upscaled")?
.is_some()
{
info!("no need to increase memory.high because got upscaled");
continue;
}
// Request upscale anyways (the agent will handle deduplicating
// requests)
self.upscale_requester
.send(())
.await
.context("failed to request upscale")?;
let memory_high =
self.get_high_bytes().context("failed to get memory.high")?;
let new_high = memory_high + self.config.memory_high_increase_by_bytes;
info!(
current_high_bytes = memory_high,
new_high_bytes = new_high,
"updating memory.high"
);
self.set_high_bytes(new_high)
.context("failed to set memory.high")?;
wait_to_increase_memory_high
.as_mut()
.reset(Instant::now() + self.config.memory_high_increase_every)
}
// we can't do anything
}
};
}
}
/// Handle a `memory.high`, returning whether we are still waiting on upscale
/// by the time the function returns.
///
/// The general plan for handling a `memory.high` event is as follows:
/// 1. Freeze the cgroup
/// 2. Start a timer for `self.config.max_upscale_wait`
/// 3. Request upscale
/// 4. After the timer elapses or we receive upscale, thaw the cgroup.
/// 5. Return whether or not we are still waiting for upscale. If we are,
/// we'll increase the cgroups memory.high to avoid getting oom killed
#[tracing::instrument(skip_all)]
async fn handle_memory_high_event(
&self,
upscales: &mut mpsc::Receiver<Sequenced<Resources>>,
) -> anyhow::Result<bool> {
// Immediately freeze the cgroup before doing anything else.
info!("received memory.high event -> freezing cgroup");
self.freeze().context("failed to freeze cgroup")?;
// We'll use this for logging durations
let start_time = Instant::now();
// Await the upscale until we have to unfreeze
let timed =
tokio::time::timeout(self.config.max_upscale_wait, self.await_upscale(upscales));
// Request the upscale
info!(
wait = ?self.config.max_upscale_wait,
"sending request for immediate upscaling",
);
self.upscale_requester
.send(())
.await
.context("failed to request upscale")?;
let waiting_on_upscale = match timed.await {
Ok(Ok(())) => {
info!(elapsed = ?start_time.elapsed(), "received upscale in time");
false
}
// **important**: unfreeze the cgroup before ?-reporting the error
Ok(Err(e)) => {
info!("error waiting for upscale -> thawing cgroup");
self.thaw()
.context("failed to thaw cgroup after errored waiting for upscale")?;
Err(e.context("failed to await upscale"))?
}
Err(_) => {
info!(elapsed = ?self.config.max_upscale_wait, "timed out waiting for upscale");
true
}
};
info!("thawing cgroup");
self.thaw().context("failed to thaw cgroup")?;
Ok(waiting_on_upscale)
}
/// Checks whether we were just upscaled, returning the upscale's sequence
/// number if so.
#[tracing::instrument(skip_all)]
fn upscaled(
&self,
upscales: &mut mpsc::Receiver<Sequenced<Resources>>,
) -> anyhow::Result<Option<u64>> {
let Sequenced { seqnum, data } = match upscales.try_recv() {
Ok(upscale) => upscale,
Err(TryRecvError::Empty) => return Ok(None),
Err(TryRecvError::Disconnected) => {
bail!("upscale notification channel was disconnected")
}
};
// Make sure to update the last upscale sequence number
self.last_upscale_seqnum.store(seqnum, Ordering::Release);
info!(cpu = data.cpu, mem_bytes = data.mem, "received upscale");
Ok(Some(seqnum))
}
/// Await an upscale event, discarding any `memory.high` events received in
/// the process.
///
/// This is used in `handle_memory_high_event`, where we need to listen
/// for upscales in particular so we know if we can thaw the cgroup early.
#[tracing::instrument(skip_all)]
async fn await_upscale(
&self,
upscales: &mut mpsc::Receiver<Sequenced<Resources>>,
) -> anyhow::Result<()> {
let Sequenced { seqnum, .. } = upscales
.recv()
.await
.context("error listening for upscales")?;
self.last_upscale_seqnum.store(seqnum, Ordering::Release);
Ok(())
}
/// Get the cgroup's name.
pub fn path(&self) -> &str {
self.cgroup.path()
}
}
/// Represents a set of limits we apply to a cgroup to control memory usage.
///
/// Setting these values also affects the thresholds for receiving usage alerts.
#[derive(Debug)]
pub struct MemoryLimits {
high: u64,
max: u64,
}
impl MemoryLimits {
pub fn new(high: u64, max: u64) -> Self {
Self { max, high }
}
}
// Methods for manipulating the actual cgroup
impl CgroupWatcher {
/// Get a handle on the freezer subsystem.
fn freezer(&self) -> anyhow::Result<&FreezerController> {
if let Some(Freezer(freezer)) = self
.cgroup
.subsystems()
.iter()
.find(|sub| matches!(sub, Freezer(_)))
{
Ok(freezer)
} else {
anyhow::bail!("could not find freezer subsystem")
}
}
/// Attempt to freeze the cgroup.
pub fn freeze(&self) -> anyhow::Result<()> {
self.freezer()
.context("failed to get freezer subsystem")?
.freeze()
.context("failed to freeze")
}
/// Attempt to thaw the cgroup.
pub fn thaw(&self) -> anyhow::Result<()> {
self.freezer()
.context("failed to get freezer subsystem")?
.thaw()
.context("failed to thaw")
}
/// Get a handle on the memory subsystem.
///
/// Note: this method does not require `self.memory_update_lock` because
/// getting a handle to the subsystem does not access any of the files we
/// care about, such as memory.high and memory.events
fn memory(&self) -> anyhow::Result<&MemController> {
if let Some(Mem(memory)) = self
.cgroup
.subsystems()
.iter()
.find(|sub| matches!(sub, Mem(_)))
{
Ok(memory)
} else {
anyhow::bail!("could not find memory subsystem")
}
}
/// Get cgroup current memory usage.
pub fn current_memory_usage(&self) -> anyhow::Result<u64> {
Ok(self
.memory()
.context("failed to get memory subsystem")?
.memory_stat()
.usage_in_bytes)
}
/// Set cgroup memory.high threshold.
pub fn set_high_bytes(&self, bytes: u64) -> anyhow::Result<()> {
self.memory()
.context("failed to get memory subsystem")?
.set_mem(cgroups_rs::memory::SetMemory {
low: None,
high: Some(MaxValue::Value(u64::min(bytes, i64::MAX as u64) as i64)),
min: None,
max: None,
})
.context("failed to set memory.high")
}
/// Set cgroup memory.high and memory.max.
pub fn set_limits(&self, limits: &MemoryLimits) -> anyhow::Result<()> {
info!(
limits.high,
limits.max,
path = self.path(),
"writing new memory limits",
);
self.memory()
.context("failed to get memory subsystem while setting memory limits")?
.set_mem(cgroups_rs::memory::SetMemory {
min: None,
low: None,
high: Some(MaxValue::Value(
u64::min(limits.high, i64::MAX as u64) as i64
)),
max: Some(MaxValue::Value(u64::min(limits.max, i64::MAX as u64) as i64)),
})
.context("failed to set memory limits")
}
/// Given some amount of available memory, set the desired cgroup memory limits
pub fn set_memory_limits(&mut self, available_memory: u64) -> anyhow::Result<()> {
let new_high = self.config.calculate_memory_high_value(available_memory);
let limits = MemoryLimits::new(new_high, available_memory);
info!(
path = self.path(),
memory = ?limits,
"setting cgroup memory",
);
self.set_limits(&limits)
.context("failed to set cgroup memory limits")?;
Ok(())
}
/// Get memory.high threshold.
pub fn get_high_bytes(&self) -> anyhow::Result<u64> {
let high = self
.memory()
.context("failed to get memory subsystem while getting memory statistics")?
.get_mem()
.map(|mem| mem.high)
.context("failed to get memory statistics from subsystem")?;
match high {
Some(MaxValue::Max) => Ok(i64::MAX as u64),
Some(MaxValue::Value(high)) => Ok(high as u64),
None => anyhow::bail!("failed to read memory.high from memory subsystem"),
}
}
}

View File

@@ -0,0 +1,153 @@
//! Managing the websocket connection and other signals in the monitor.
//!
//! Contains types that manage the interaction (not data interchange, see `protocol`)
//! between agent and monitor, allowing us to to process and send messages in a
//! straightforward way. The dispatcher also manages that signals that come from
//! the cgroup (requesting upscale), and the signals that go to the cgroup
//! (notifying it of upscale).
use anyhow::{bail, Context};
use axum::extract::ws::{Message, WebSocket};
use futures::{
stream::{SplitSink, SplitStream},
SinkExt, StreamExt,
};
use tokio::sync::mpsc;
use tracing::info;
use crate::cgroup::Sequenced;
use crate::protocol::{
OutboundMsg, ProtocolRange, ProtocolResponse, ProtocolVersion, Resources, PROTOCOL_MAX_VERSION,
PROTOCOL_MIN_VERSION,
};
/// The central handler for all communications in the monitor.
///
/// The dispatcher has two purposes:
/// 1. Manage the connection to the agent, sending and receiving messages.
/// 2. Communicate with the cgroup manager, notifying it when upscale is received,
/// and sending a message to the agent when the cgroup manager requests
/// upscale.
#[derive(Debug)]
pub struct Dispatcher {
/// We read agent messages of of `source`
pub(crate) source: SplitStream<WebSocket>,
/// We send messages to the agent through `sink`
sink: SplitSink<WebSocket, Message>,
/// Used to notify the cgroup when we are upscaled.
pub(crate) notify_upscale_events: mpsc::Sender<Sequenced<Resources>>,
/// When the cgroup requests upscale it will send on this channel. In response
/// we send an `UpscaleRequst` to the agent.
pub(crate) request_upscale_events: mpsc::Receiver<()>,
/// The protocol version we have agreed to use with the agent. This is negotiated
/// during the creation of the dispatcher, and should be the highest shared protocol
/// version.
///
// NOTE: currently unused, but will almost certainly be used in the futures
// as the protocol changes
#[allow(unused)]
pub(crate) proto_version: ProtocolVersion,
}
impl Dispatcher {
/// Creates a new dispatcher using the passed-in connection.
///
/// Performs a negotiation with the agent to determine the highest protocol
/// version that both support. This consists of two steps:
/// 1. Wait for the agent to sent the range of protocols it supports.
/// 2. Send a protocol version that works for us as well, or an error if there
/// is no compatible version.
pub async fn new(
stream: WebSocket,
notify_upscale_events: mpsc::Sender<Sequenced<Resources>>,
request_upscale_events: mpsc::Receiver<()>,
) -> anyhow::Result<Self> {
let (mut sink, mut source) = stream.split();
// Figure out the highest protocol version we both support
info!("waiting for agent to send protocol version range");
let Some(message) = source.next().await else {
bail!("websocket connection closed while performing protocol handshake")
};
let message = message.context("failed to read protocol version range off connection")?;
let Message::Text(message_text) = message else {
// All messages should be in text form, since we don't do any
// pinging/ponging. See nhooyr/websocket's implementation and the
// agent for more info
bail!("received non-text message during proocol handshake: {message:?}")
};
let monitor_range = ProtocolRange {
min: PROTOCOL_MIN_VERSION,
max: PROTOCOL_MAX_VERSION,
};
let agent_range: ProtocolRange = serde_json::from_str(&message_text)
.context("failed to deserialize protocol version range")?;
info!(range = ?agent_range, "received protocol version range");
let highest_shared_version = match monitor_range.highest_shared_version(&agent_range) {
Ok(version) => {
sink.send(Message::Text(
serde_json::to_string(&ProtocolResponse::Version(version)).unwrap(),
))
.await
.context("failed to notify agent of negotiated protocol version")?;
version
}
Err(e) => {
sink.send(Message::Text(
serde_json::to_string(&ProtocolResponse::Error(format!(
"Received protocol version range {} which does not overlap with {}",
agent_range, monitor_range
)))
.unwrap(),
))
.await
.context("failed to notify agent of no overlap between protocol version ranges")?;
Err(e).context("error determining suitable protocol version range")?
}
};
Ok(Self {
sink,
source,
notify_upscale_events,
request_upscale_events,
proto_version: highest_shared_version,
})
}
/// Notify the cgroup manager that we have received upscale and wait for
/// the acknowledgement.
#[tracing::instrument(skip_all, fields(?resources))]
pub async fn notify_upscale(&self, resources: Sequenced<Resources>) -> anyhow::Result<()> {
self.notify_upscale_events
.send(resources)
.await
.context("failed to send resources and oneshot sender across channel")
}
/// Send a message to the agent.
///
/// Although this function is small, it has one major benefit: it is the only
/// way to send data accross the connection, and you can only pass in a proper
/// `MonitorMessage`. Without safeguards like this, it's easy to accidentally
/// serialize the wrong thing and send it, since `self.sink.send` will take
/// any string.
pub async fn send(&mut self, message: OutboundMsg) -> anyhow::Result<()> {
info!(?message, "sending message");
let json = serde_json::to_string(&message).context("failed to serialize message")?;
self.sink
.send(Message::Text(json))
.await
.context("stream error sending message")
}
}

View File

@@ -0,0 +1,316 @@
//! Logic for configuring and scaling the Postgres file cache.
use std::num::NonZeroU64;
use crate::MiB;
use anyhow::{anyhow, Context};
use tokio_postgres::{types::ToSql, Client, NoTls, Row};
use tokio_util::sync::CancellationToken;
use tracing::{error, info};
/// Manages Postgres' file cache by keeping a connection open.
#[derive(Debug)]
pub struct FileCacheState {
client: Client,
conn_str: String,
pub(crate) config: FileCacheConfig,
/// A token for cancelling spawned threads during shutdown.
token: CancellationToken,
}
#[derive(Debug)]
pub struct FileCacheConfig {
/// Whether the file cache is *actually* stored in memory (e.g. by writing to
/// a tmpfs or shmem file). If true, the size of the file cache will be counted against the
/// memory available for the cgroup.
pub(crate) in_memory: bool,
/// The size of the file cache, in terms of the size of the resource it consumes
/// (currently: only memory)
///
/// For example, setting `resource_multipler = 0.75` gives the cache a target size of 75% of total
/// resources.
///
/// This value must be strictly between 0 and 1.
resource_multiplier: f64,
/// The required minimum amount of memory, in bytes, that must remain available
/// after subtracting the file cache.
///
/// This value must be non-zero.
min_remaining_after_cache: NonZeroU64,
/// Controls the rate of increase in the file cache's size as it grows from zero
/// (when total resources equals min_remaining_after_cache) to the desired size based on
/// `resource_multiplier`.
///
/// A `spread_factor` of zero means that all additional resources will go to the cache until it
/// reaches the desired size. Setting `spread_factor` to N roughly means "for every 1 byte added to
/// the cache's size, N bytes are reserved for the rest of the system, until the cache gets to
/// its desired size".
///
/// This value must be >= 0, and must retain an increase that is more than what would be given by
/// `resource_multiplier`. For example, setting `resource_multiplier` = 0.75 but `spread_factor` = 1
/// would be invalid, because `spread_factor` would induce only 50% usage - never reaching the 75%
/// as desired by `resource_multiplier`.
///
/// `spread_factor` is too large if `(spread_factor + 1) * resource_multiplier >= 1`.
spread_factor: f64,
}
impl FileCacheConfig {
pub fn default_in_memory() -> Self {
Self {
in_memory: true,
// 75 %
resource_multiplier: 0.75,
// 640 MiB; (512 + 128)
min_remaining_after_cache: NonZeroU64::new(640 * MiB).unwrap(),
// ensure any increase in file cache size is split 90-10 with 10% to other memory
spread_factor: 0.1,
}
}
pub fn default_on_disk() -> Self {
Self {
in_memory: false,
resource_multiplier: 0.75,
// 256 MiB - lower than when in memory because overcommitting is safe; if we don't have
// memory, the kernel will just evict from its page cache, rather than e.g. killing
// everything.
min_remaining_after_cache: NonZeroU64::new(256 * MiB).unwrap(),
spread_factor: 0.1,
}
}
/// Make sure fields of the config are consistent.
pub fn validate(&self) -> anyhow::Result<()> {
// Single field validity
anyhow::ensure!(
0.0 < self.resource_multiplier && self.resource_multiplier < 1.0,
"resource_multiplier must be between 0.0 and 1.0 exclusive, got {}",
self.resource_multiplier
);
anyhow::ensure!(
self.spread_factor >= 0.0,
"spread_factor must be >= 0, got {}",
self.spread_factor
);
// Check that `resource_multiplier` and `spread_factor` are valid w.r.t. each other.
//
// As shown in `calculate_cache_size`, we have two lines resulting from `resource_multiplier` and
// `spread_factor`, respectively. They are:
//
// `total` `min_remaining_after_cache`
// size = ————————————————————— - —————————————————————————————
// `spread_factor` + 1 `spread_factor` + 1
//
// and
//
// size = `resource_multiplier` × total
//
// .. where `total` is the total resources. These are isomorphic to the typical 'y = mx + b'
// form, with y = "size" and x = "total".
//
// These lines intersect at:
//
// `min_remaining_after_cache`
// ———————————————————————————————————————————————————
// 1 - `resource_multiplier` × (`spread_factor` + 1)
//
// We want to ensure that this value (a) exists, and (b) is >= `min_remaining_after_cache`. This is
// guaranteed when '`resource_multiplier` × (`spread_factor` + 1)' is less than 1.
// (We also need it to be >= 0, but that's already guaranteed.)
let intersect_factor = self.resource_multiplier * (self.spread_factor + 1.0);
anyhow::ensure!(
intersect_factor < 1.0,
"incompatible resource_multipler and spread_factor"
);
Ok(())
}
/// Calculate the desired size of the cache, given the total memory
pub fn calculate_cache_size(&self, total: u64) -> u64 {
// *Note*: all units are in bytes, until the very last line.
let available = total.saturating_sub(self.min_remaining_after_cache.get());
if available == 0 {
return 0;
}
// Conversions to ensure we don't overflow from floating-point ops
let size_from_spread =
i64::max(0, (available as f64 / (1.0 + self.spread_factor)) as i64) as u64;
let size_from_normal = (total as f64 * self.resource_multiplier) as u64;
let byte_size = u64::min(size_from_spread, size_from_normal);
// The file cache operates in units of mebibytes, so the sizes we produce should
// be rounded to a mebibyte. We round down to be conservative.
byte_size / MiB * MiB
}
}
impl FileCacheState {
/// Connect to the file cache.
#[tracing::instrument(skip_all, fields(%conn_str, ?config))]
pub async fn new(
conn_str: &str,
config: FileCacheConfig,
token: CancellationToken,
) -> anyhow::Result<Self> {
config.validate().context("file cache config is invalid")?;
info!(conn_str, "connecting to Postgres file cache");
let client = FileCacheState::connect(conn_str, token.clone())
.await
.context("failed to connect to postgres file cache")?;
let conn_str = conn_str.to_string();
Ok(Self {
client,
config,
conn_str,
token,
})
}
/// Connect to Postgres.
///
/// Aborts the spawned thread if the kill signal is received. This is not
/// a method as it is called in [`FileCacheState::new`].
#[tracing::instrument(skip_all, fields(%conn_str))]
async fn connect(conn_str: &str, token: CancellationToken) -> anyhow::Result<Client> {
let (client, conn) = tokio_postgres::connect(conn_str, NoTls)
.await
.context("failed to connect to pg client")?;
// The connection object performs the actual communication with the database,
// so spawn it off to run on its own. See tokio-postgres docs.
crate::spawn_with_cancel(
token,
|res| {
if let Err(error) = res {
error!(%error, "postgres error")
}
},
conn,
);
Ok(client)
}
/// Execute a query with a retry if necessary.
///
/// If the initial query fails, we restart the database connection and attempt
/// if again.
#[tracing::instrument(skip_all, fields(%statement))]
pub async fn query_with_retry(
&mut self,
statement: &str,
params: &[&(dyn ToSql + Sync)],
) -> anyhow::Result<Vec<Row>> {
match self
.client
.query(statement, params)
.await
.context("failed to execute query")
{
Ok(rows) => Ok(rows),
Err(e) => {
error!(error = ?e, "postgres error: {e} -> retrying");
let client = FileCacheState::connect(&self.conn_str, self.token.clone())
.await
.context("failed to connect to postgres file cache")?;
info!("successfully reconnected to postgres client");
// Replace the old client and attempt the query with the new one
self.client = client;
self.client
.query(statement, params)
.await
.context("failed to execute query a second time")
}
}
}
/// Get the current size of the file cache.
#[tracing::instrument(skip_all)]
pub async fn get_file_cache_size(&mut self) -> anyhow::Result<u64> {
self.query_with_retry(
// The file cache GUC variable is in MiB, but the conversion with
// pg_size_bytes means that the end result we get is in bytes.
"SELECT pg_size_bytes(current_setting('neon.file_cache_size_limit'));",
&[],
)
.await
.context("failed to query pg for file cache size")?
.first()
.ok_or_else(|| anyhow!("file cache size query returned no rows"))?
// pg_size_bytes returns a bigint which is the same as an i64.
.try_get::<_, i64>(0)
// Since the size of the table is not negative, the cast is sound.
.map(|bytes| bytes as u64)
.context("failed to extract file cache size from query result")
}
/// Attempt to set the file cache size, returning the size it was actually
/// set to.
#[tracing::instrument(skip_all, fields(%num_bytes))]
pub async fn set_file_cache_size(&mut self, num_bytes: u64) -> anyhow::Result<u64> {
let max_bytes = self
// The file cache GUC variable is in MiB, but the conversion with pg_size_bytes
// means that the end result we get is in bytes.
.query_with_retry(
"SELECT pg_size_bytes(current_setting('neon.max_file_cache_size'));",
&[],
)
.await
.context("failed to query pg for max file cache size")?
.first()
.ok_or_else(|| anyhow!("max file cache size query returned no rows"))?
.try_get::<_, i64>(0)
.map(|bytes| bytes as u64)
.context("failed to extract max file cache size from query result")?;
let max_mb = max_bytes / MiB;
let num_mb = u64::min(num_bytes, max_bytes) / MiB;
let capped = if num_bytes > max_bytes {
" (capped by maximum size)"
} else {
""
};
info!(
size = num_mb,
max = max_mb,
"updating file cache size {capped}",
);
// note: even though the normal ways to get the cache size produce values with trailing "MB"
// (hence why we call pg_size_bytes in `get_file_cache_size`'s query), the format
// it expects to set the value is "integer number of MB" without trailing units.
// For some reason, this *really* wasn't working with normal arguments, so that's
// why we're constructing the query here.
self.client
.query(
&format!("ALTER SYSTEM SET neon.file_cache_size_limit = {};", num_mb),
&[],
)
.await
.context("failed to change file cache size limit")?;
// must use pg_reload_conf to have the settings change take effect
self.client
.execute("SELECT pg_reload_conf();", &[])
.await
.context("failed to reload config")?;
Ok(num_mb * MiB)
}
}

218
libs/vm_monitor/src/lib.rs Normal file
View File

@@ -0,0 +1,218 @@
#![cfg(target_os = "linux")]
use anyhow::Context;
use axum::{
extract::{ws::WebSocket, State, WebSocketUpgrade},
response::Response,
};
use axum::{routing::get, Router, Server};
use clap::Parser;
use futures::Future;
use std::{fmt::Debug, time::Duration};
use sysinfo::{RefreshKind, System, SystemExt};
use tokio::{sync::broadcast, task::JoinHandle};
use tokio_util::sync::CancellationToken;
use tracing::{error, info};
use runner::Runner;
// Code that interfaces with agent
pub mod dispatcher;
pub mod protocol;
pub mod cgroup;
pub mod filecache;
pub mod runner;
/// The vm-monitor is an autoscaling component started by compute_ctl.
///
/// It carries out autoscaling decisions (upscaling/downscaling) and responds to
/// memory pressure by making requests to the autoscaler-agent.
#[derive(Debug, Parser)]
pub struct Args {
/// The name of the cgroup we should monitor for memory.high events. This
/// is the cgroup that postgres should be running in.
#[arg(short, long)]
pub cgroup: Option<String>,
/// The connection string for the Postgres file cache we should manage.
#[arg(short, long)]
pub pgconnstr: Option<String>,
/// Flag to signal that the Postgres file cache is on disk (i.e. not in memory aside from the
/// kernel's page cache), and therefore should not count against available memory.
//
// NB: Ideally this flag would directly refer to whether the file cache is in memory (rather
// than a roundabout way, via whether it's on disk), but in order to be backwards compatible
// during the switch away from an in-memory file cache, we had to default to the previous
// behavior.
#[arg(long)]
pub file_cache_on_disk: bool,
/// The address we should listen on for connection requests. For the
/// agent, this is 0.0.0.0:10301. For the informant, this is 127.0.0.1:10369.
#[arg(short, long)]
pub addr: String,
}
impl Args {
pub fn addr(&self) -> &str {
&self.addr
}
}
/// The number of bytes in one mebibyte.
#[allow(non_upper_case_globals)]
const MiB: u64 = 1 << 20;
/// Convert a quantity in bytes to a quantity in mebibytes, generally for display
/// purposes. (Most calculations in this crate use bytes directly)
pub fn bytes_to_mebibytes(bytes: u64) -> f32 {
(bytes as f32) / (MiB as f32)
}
pub fn get_total_system_memory() -> u64 {
System::new_with_specifics(RefreshKind::new().with_memory()).total_memory()
}
/// Global app state for the Axum server
#[derive(Debug, Clone)]
pub struct ServerState {
/// Used to close old connections.
///
/// When a new connection is made, we send a message signalling to the old
/// connection to close.
pub sender: broadcast::Sender<()>,
/// Used to cancel all spawned threads in the monitor.
pub token: CancellationToken,
// The CLI args
pub args: &'static Args,
}
/// Spawn a thread that may get cancelled by the provided [`CancellationToken`].
///
/// This is mainly meant to be called with futures that will be pending for a very
/// long time, or are not mean to return. If it is not desirable for the future to
/// ever resolve, such as in the case of [`cgroup::CgroupWatcher::watch`], the error can
/// be logged with `f`.
pub fn spawn_with_cancel<T, F>(
token: CancellationToken,
f: F,
future: T,
) -> JoinHandle<Option<T::Output>>
where
T: Future + Send + 'static,
T::Output: Send + 'static,
F: FnOnce(&T::Output) + Send + 'static,
{
tokio::spawn(async move {
tokio::select! {
_ = token.cancelled() => {
info!("received global kill signal");
None
}
res = future => {
f(&res);
Some(res)
}
}
})
}
/// The entrypoint to the binary.
///
/// Set up tracing, parse arguments, and start an http server.
pub async fn start(args: &'static Args, token: CancellationToken) -> anyhow::Result<()> {
// This channel is used to close old connections. When a new connection is
// made, we send a message signalling to the old connection to close.
let (sender, _) = tokio::sync::broadcast::channel::<()>(1);
let app = Router::new()
// This route gets upgraded to a websocket connection. We only support
// one connection at a time, which we enforce by killing old connections
// when we receive a new one.
.route("/monitor", get(ws_handler))
.with_state(ServerState {
sender,
token,
args,
});
let addr = args.addr();
let bound = Server::try_bind(&addr.parse().expect("parsing address should not fail"))
.with_context(|| format!("failed to bind to {addr}"))?;
info!(addr, "server bound");
bound
.serve(app.into_make_service())
.await
.context("server exited")?;
Ok(())
}
/// Handles incoming websocket connections.
///
/// If we are already to connected to an agent, we kill that old connection
/// and accept the new one.
#[tracing::instrument(name = "/monitor", skip_all, fields(?args))]
pub async fn ws_handler(
ws: WebSocketUpgrade,
State(ServerState {
sender,
token,
args,
}): State<ServerState>,
) -> Response {
// Kill the old monitor
info!("closing old connection if there is one");
let _ = sender.send(());
// Start the new one. Wow, the cycle of death and rebirth
let closer = sender.subscribe();
ws.on_upgrade(|ws| start_monitor(ws, args, closer, token))
}
/// Starts the monitor. If startup fails or the monitor exits, an error will
/// be logged and our internal state will be reset to allow for new connections.
#[tracing::instrument(skip_all)]
async fn start_monitor(
ws: WebSocket,
args: &Args,
kill: broadcast::Receiver<()>,
token: CancellationToken,
) {
info!(
?args,
"accepted new websocket connection -> starting monitor"
);
let timeout = Duration::from_secs(4);
let monitor = tokio::time::timeout(
timeout,
Runner::new(Default::default(), args, ws, kill, token),
)
.await;
let mut monitor = match monitor {
Ok(Ok(monitor)) => monitor,
Ok(Err(error)) => {
error!(?error, "failed to create monitor");
return;
}
Err(_) => {
error!(
?timeout,
"creating monitor timed out (probably waiting to receive protocol range)"
);
return;
}
};
info!("connected to agent");
match monitor.run().await {
Ok(()) => info!("monitor was killed due to new connection"),
Err(e) => error!(error = ?e, "monitor terminated unexpectedly"),
}
}

View File

@@ -0,0 +1,241 @@
//! Types representing protocols and actual agent-monitor messages.
//!
//! The pervasive use of serde modifiers throughout this module is to ease
//! serialization on the go side. Because go does not have enums (which model
//! messages well), it is harder to model messages, and we accomodate that with
//! serde.
//!
//! *Note*: the agent sends and receives messages in different ways.
//!
//! The agent serializes messages in the form and then sends them. The use
//! of `#[serde(tag = "type", content = "content")]` allows us to use `Type`
//! to determine how to deserialize `Content`.
//! ```ignore
//! struct {
//! Content any
//! Type string
//! Id uint64
//! }
//! ```
//! and receives messages in the form:
//! ```ignore
//! struct {
//! {fields embedded}
//! Type string
//! Id uint64
//! }
//! ```
//! After reading the type field, the agent will decode the entire message
//! again, this time into the correct type using the embedded fields.
//! Because the agent cannot just extract the json contained in a certain field
//! (it initially deserializes to `map[string]interface{}`), we keep the fields
//! at the top level, so the entire piece of json can be deserialized into a struct,
//! such as a `DownscaleResult`, with the `Type` and `Id` fields ignored.
use core::fmt;
use std::cmp;
use serde::{de::Error, Deserialize, Serialize};
/// A Message we send to the agent.
#[derive(Serialize, Deserialize, Debug, Clone)]
pub struct OutboundMsg {
#[serde(flatten)]
pub(crate) inner: OutboundMsgKind,
pub(crate) id: usize,
}
impl OutboundMsg {
pub fn new(inner: OutboundMsgKind, id: usize) -> Self {
Self { inner, id }
}
}
/// The different underlying message types we can send to the agent.
#[derive(Serialize, Deserialize, Debug, Clone)]
#[serde(tag = "type")]
pub enum OutboundMsgKind {
/// Indicates that the agent sent an invalid message, i.e, we couldn't
/// properly deserialize it.
InvalidMessage { error: String },
/// Indicates that we experienced an internal error while processing a message.
/// For example, if a cgroup operation fails while trying to handle an upscale,
/// we return `InternalError`.
InternalError { error: String },
/// Returned to the agent once we have finished handling an upscale. If the
/// handling was unsuccessful, an `InternalError` will get returned instead.
/// *Note*: this is a struct variant because of the way go serializes struct{}
UpscaleConfirmation {},
/// Indicates to the monitor that we are urgently requesting resources.
/// *Note*: this is a struct variant because of the way go serializes struct{}
UpscaleRequest {},
/// Returned to the agent once we have finished attempting to downscale. If
/// an error occured trying to do so, an `InternalError` will get returned instead.
/// However, if we are simply unsuccessful (for example, do to needing the resources),
/// that gets included in the `DownscaleResult`.
DownscaleResult {
// FIXME for the future (once the informant is deprecated)
// As of the time of writing, the agent/informant version of this struct is
// called api.DownscaleResult. This struct has uppercase fields which are
// serialized as such. Thus, we serialize using uppercase names so we don't
// have to make a breaking change to the agent<->informant protocol. Once
// the informant has been superseded by the monitor, we can add the correct
// struct tags to api.DownscaleResult without causing a breaking change,
// since we don't need to support the agent<->informant protocol anymore.
#[serde(rename = "Ok")]
ok: bool,
#[serde(rename = "Status")]
status: String,
},
/// Part of the bidirectional heartbeat. The heartbeat is initiated by the
/// agent.
/// *Note*: this is a struct variant because of the way go serializes struct{}
HealthCheck {},
}
/// A message received form the agent.
#[derive(Serialize, Deserialize, Debug, Clone)]
pub struct InboundMsg {
#[serde(flatten)]
pub(crate) inner: InboundMsgKind,
pub(crate) id: usize,
}
/// The different underlying message types we can receive from the agent.
#[derive(Serialize, Deserialize, Debug, Clone)]
#[serde(tag = "type", content = "content")]
pub enum InboundMsgKind {
/// Indicates that the we sent an invalid message, i.e, we couldn't
/// properly deserialize it.
InvalidMessage { error: String },
/// Indicates that the informan experienced an internal error while processing
/// a message. For example, if it failed to request upsacle from the agent, it
/// would return an `InternalError`.
InternalError { error: String },
/// Indicates to us that we have been granted more resources. We should respond
/// with an `UpscaleConfirmation` when done handling the resources (increasins
/// file cache size, cgorup memory limits).
UpscaleNotification { granted: Resources },
/// A request to reduce resource usage. We should response with a `DownscaleResult`,
/// when done.
DownscaleRequest { target: Resources },
/// Part of the bidirectional heartbeat. The heartbeat is initiated by the
/// agent.
/// *Note*: this is a struct variant because of the way go serializes struct{}
HealthCheck {},
}
/// Represents the resources granted to a VM.
#[derive(Serialize, Deserialize, Debug, Clone, Copy)]
// Renamed because the agent has multiple resources types:
// `Resources` (milliCPU/memory slots)
// `Allocation` (vCPU/bytes) <- what we correspond to
#[serde(rename(serialize = "Allocation", deserialize = "Allocation"))]
pub struct Resources {
/// Number of vCPUs
pub(crate) cpu: f64,
/// Bytes of memory
pub(crate) mem: u64,
}
impl Resources {
pub fn new(cpu: f64, mem: u64) -> Self {
Self { cpu, mem }
}
}
pub const PROTOCOL_MIN_VERSION: ProtocolVersion = ProtocolVersion::V1_0;
pub const PROTOCOL_MAX_VERSION: ProtocolVersion = ProtocolVersion::V1_0;
#[derive(Debug, Clone, Copy, PartialEq, PartialOrd, Ord, Eq, Serialize, Deserialize)]
pub struct ProtocolVersion(u8);
impl ProtocolVersion {
/// Represents v1.0 of the agent<-> monitor protocol - the initial version
///
/// Currently the latest version.
const V1_0: ProtocolVersion = ProtocolVersion(1);
}
impl fmt::Display for ProtocolVersion {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
match *self {
ProtocolVersion(0) => f.write_str("<invalid: zero>"),
ProtocolVersion::V1_0 => f.write_str("v1.0"),
other => write!(f, "<unknown: {other}>"),
}
}
}
/// A set of protocol bounds that determines what we are speaking.
///
/// These bounds are inclusive.
#[derive(Debug)]
pub struct ProtocolRange {
pub min: ProtocolVersion,
pub max: ProtocolVersion,
}
// Use a custom deserialize impl to ensure that `self.min <= self.max`
impl<'de> Deserialize<'de> for ProtocolRange {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
#[derive(Deserialize)]
struct InnerProtocolRange {
min: ProtocolVersion,
max: ProtocolVersion,
}
let InnerProtocolRange { min, max } = InnerProtocolRange::deserialize(deserializer)?;
if min > max {
Err(D::Error::custom(format!(
"min version = {min} is greater than max version = {max}",
)))
} else {
Ok(ProtocolRange { min, max })
}
}
}
impl fmt::Display for ProtocolRange {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
if self.min == self.max {
f.write_fmt(format_args!("{}", self.max))
} else {
f.write_fmt(format_args!("{} to {}", self.min, self.max))
}
}
}
impl ProtocolRange {
/// Find the highest shared version between two `ProtocolRange`'s
pub fn highest_shared_version(&self, other: &Self) -> anyhow::Result<ProtocolVersion> {
// We first have to make sure the ranges are overlapping. Once we know
// this, we can merge the ranges by taking the max of the mins and the
// mins of the maxes.
if self.min > other.max {
anyhow::bail!(
"Non-overlapping bounds: other.max = {} was less than self.min = {}",
other.max,
self.min,
)
} else if self.max < other.min {
anyhow::bail!(
"Non-overlappinng bounds: self.max = {} was less than other.min = {}",
self.max,
other.min
)
} else {
Ok(cmp::min(self.max, other.max))
}
}
}
/// We send this to the monitor after negotiating which protocol to use
#[derive(Serialize, Debug)]
#[serde(rename_all = "camelCase")]
pub enum ProtocolResponse {
Error(String),
Version(ProtocolVersion),
}

View File

@@ -0,0 +1,478 @@
//! Exposes the `Runner`, which handles messages received from agent and
//! sends upscale requests.
//!
//! This is the "Monitor" part of the monitor binary and is the main entrypoint for
//! all functionality.
use std::sync::Arc;
use std::time::{Duration, Instant};
use std::{fmt::Debug, mem};
use anyhow::{bail, Context};
use axum::extract::ws::{Message, WebSocket};
use futures::StreamExt;
use tokio::sync::broadcast;
use tokio::sync::mpsc;
use tokio_util::sync::CancellationToken;
use tracing::{error, info, warn};
use crate::cgroup::{CgroupWatcher, MemoryLimits, Sequenced};
use crate::dispatcher::Dispatcher;
use crate::filecache::{FileCacheConfig, FileCacheState};
use crate::protocol::{InboundMsg, InboundMsgKind, OutboundMsg, OutboundMsgKind, Resources};
use crate::{bytes_to_mebibytes, get_total_system_memory, spawn_with_cancel, Args, MiB};
/// Central struct that interacts with agent, dispatcher, and cgroup to handle
/// signals from the agent.
#[derive(Debug)]
pub struct Runner {
config: Config,
filecache: Option<FileCacheState>,
cgroup: Option<Arc<CgroupWatcher>>,
dispatcher: Dispatcher,
/// We "mint" new message ids by incrementing this counter and taking the value.
///
/// **Note**: This counter is always odd, so that we avoid collisions between the IDs generated
/// by us vs the autoscaler-agent.
counter: usize,
last_upscale_request_at: Option<Instant>,
/// A signal to kill the main thread produced by `self.run()`. This is triggered
/// when the server receives a new connection. When the thread receives the
/// signal off this channel, it will gracefully shutdown.
kill: broadcast::Receiver<()>,
}
/// Configuration for a `Runner`
#[derive(Debug)]
pub struct Config {
/// `sys_buffer_bytes` gives the estimated amount of memory, in bytes, that the kernel uses before
/// handing out the rest to userspace. This value is the estimated difference between the
/// *actual* physical memory and the amount reported by `grep MemTotal /proc/meminfo`.
///
/// For more information, refer to `man 5 proc`, which defines MemTotal as "Total usable RAM
/// (i.e., physical RAM minus a few reserved bits and the kernel binary code)".
///
/// We only use `sys_buffer_bytes` when calculating the system memory from the *external* memory
/// size, rather than the self-reported memory size, according to the kernel.
///
/// TODO: this field is only necessary while we still have to trust the autoscaler-agent's
/// upscale resource amounts (because we might not *actually* have been upscaled yet). This field
/// should be removed once we have a better solution there.
sys_buffer_bytes: u64,
}
impl Default for Config {
fn default() -> Self {
Self {
sys_buffer_bytes: 100 * MiB,
}
}
}
impl Runner {
/// Create a new monitor.
#[tracing::instrument(skip_all, fields(?config, ?args))]
pub async fn new(
config: Config,
args: &Args,
ws: WebSocket,
kill: broadcast::Receiver<()>,
token: CancellationToken,
) -> anyhow::Result<Runner> {
anyhow::ensure!(
config.sys_buffer_bytes != 0,
"invalid monitor Config: sys_buffer_bytes cannot be 0"
);
// *NOTE*: the dispatcher and cgroup manager talk through these channels
// so make sure they each get the correct half, nothing is droppped, etc.
let (notified_send, notified_recv) = mpsc::channel(1);
let (requesting_send, requesting_recv) = mpsc::channel(1);
let dispatcher = Dispatcher::new(ws, notified_send, requesting_recv)
.await
.context("error creating new dispatcher")?;
let mut state = Runner {
config,
filecache: None,
cgroup: None,
dispatcher,
counter: 1, // NB: must be odd, see the comment about the field for more.
last_upscale_request_at: None,
kill,
};
let mut file_cache_reserved_bytes = 0;
let mem = get_total_system_memory();
// We need to process file cache initialization before cgroup initialization, so that the memory
// allocated to the file cache is appropriately taken into account when we decide the cgroup's
// memory limits.
if let Some(connstr) = &args.pgconnstr {
info!("initializing file cache");
let config = match args.file_cache_on_disk {
true => FileCacheConfig::default_on_disk(),
false => FileCacheConfig::default_in_memory(),
};
let mut file_cache = FileCacheState::new(connstr, config, token.clone())
.await
.context("failed to create file cache")?;
let size = file_cache
.get_file_cache_size()
.await
.context("error getting file cache size")?;
let new_size = file_cache.config.calculate_cache_size(mem);
info!(
initial = bytes_to_mebibytes(size),
new = bytes_to_mebibytes(new_size),
"setting initial file cache size",
);
// note: even if size == new_size, we want to explicitly set it, just
// to make sure that we have the permissions to do so
let actual_size = file_cache
.set_file_cache_size(new_size)
.await
.context("failed to set file cache size, possibly due to inadequate permissions")?;
if actual_size != new_size {
info!("file cache size actually got set to {actual_size}")
}
// Mark the resources given to the file cache as reserved, but only if it's in memory.
if !args.file_cache_on_disk {
file_cache_reserved_bytes = actual_size;
}
state.filecache = Some(file_cache);
}
if let Some(name) = &args.cgroup {
let (mut cgroup, cgroup_event_stream) =
CgroupWatcher::new(name.clone(), requesting_send)
.context("failed to create cgroup manager")?;
let available = mem - file_cache_reserved_bytes;
cgroup
.set_memory_limits(available)
.context("failed to set cgroup memory limits")?;
let cgroup = Arc::new(cgroup);
// Some might call this . . . cgroup v2
let cgroup_clone = Arc::clone(&cgroup);
spawn_with_cancel(token, |_| error!("cgroup watcher terminated"), async move {
cgroup_clone.watch(notified_recv, cgroup_event_stream).await
});
state.cgroup = Some(cgroup);
} else {
// *NOTE*: We need to forget the sender so that its drop impl does not get ran.
// This allows us to poll it in `Monitor::run` regardless of whether we
// are managing a cgroup or not. If we don't forget it, all receives will
// immediately return an error because the sender is droped and it will
// claim all select! statements, effectively turning `Monitor::run` into
// `loop { fail to receive }`.
mem::forget(requesting_send);
}
Ok(state)
}
/// Attempt to downscale filecache + cgroup
#[tracing::instrument(skip_all, fields(?target))]
pub async fn try_downscale(&mut self, target: Resources) -> anyhow::Result<(bool, String)> {
// Nothing to adjust
if self.cgroup.is_none() && self.filecache.is_none() {
info!("no action needed for downscale (no cgroup or file cache enabled)");
return Ok((
true,
"monitor is not managing cgroup or file cache".to_string(),
));
}
let requested_mem = target.mem;
let usable_system_memory = requested_mem.saturating_sub(self.config.sys_buffer_bytes);
let expected_file_cache_mem_usage = self
.filecache
.as_ref()
.map(|file_cache| file_cache.config.calculate_cache_size(usable_system_memory))
.unwrap_or(0);
let mut new_cgroup_mem_high = 0;
if let Some(cgroup) = &self.cgroup {
new_cgroup_mem_high = cgroup
.config
.calculate_memory_high_value(usable_system_memory - expected_file_cache_mem_usage);
let current = cgroup
.current_memory_usage()
.context("failed to fetch cgroup memory")?;
if new_cgroup_mem_high < current + cgroup.config.memory_high_buffer_bytes {
let status = format!(
"{}: {} MiB (new high) < {} (current usage) + {} (buffer)",
"calculated memory.high too low",
bytes_to_mebibytes(new_cgroup_mem_high),
bytes_to_mebibytes(current),
bytes_to_mebibytes(cgroup.config.memory_high_buffer_bytes)
);
info!(status, "discontinuing downscale");
return Ok((false, status));
}
}
// The downscaling has been approved. Downscale the file cache, then the cgroup.
let mut status = vec![];
let mut file_cache_mem_usage = 0;
if let Some(file_cache) = &mut self.filecache {
let actual_usage = file_cache
.set_file_cache_size(expected_file_cache_mem_usage)
.await
.context("failed to set file cache size")?;
if file_cache.config.in_memory {
file_cache_mem_usage = actual_usage;
}
let message = format!(
"set file cache size to {} MiB (in memory = {})",
bytes_to_mebibytes(actual_usage),
file_cache.config.in_memory,
);
info!("downscale: {message}");
status.push(message);
}
if let Some(cgroup) = &self.cgroup {
let available_memory = usable_system_memory - file_cache_mem_usage;
if file_cache_mem_usage != expected_file_cache_mem_usage {
new_cgroup_mem_high = cgroup.config.calculate_memory_high_value(available_memory);
}
let limits = MemoryLimits::new(
// new_cgroup_mem_high is initialized to 0 but it is guarancontextd to not be here
// since it is properly initialized in the previous cgroup if let block
new_cgroup_mem_high,
available_memory,
);
cgroup
.set_limits(&limits)
.context("failed to set cgroup memory limits")?;
let message = format!(
"set cgroup memory.high to {} MiB, of new max {} MiB",
bytes_to_mebibytes(new_cgroup_mem_high),
bytes_to_mebibytes(available_memory)
);
info!("downscale: {message}");
status.push(message);
}
// TODO: make this status thing less jank
let status = status.join("; ");
Ok((true, status))
}
/// Handle new resources
#[tracing::instrument(skip_all, fields(?resources))]
pub async fn handle_upscale(&mut self, resources: Resources) -> anyhow::Result<()> {
if self.filecache.is_none() && self.cgroup.is_none() {
info!("no action needed for upscale (no cgroup or file cache enabled)");
return Ok(());
}
let new_mem = resources.mem;
let usable_system_memory = new_mem.saturating_sub(self.config.sys_buffer_bytes);
// Get the file cache's expected contribution to the memory usage
let mut file_cache_mem_usage = 0;
if let Some(file_cache) = &mut self.filecache {
let expected_usage = file_cache.config.calculate_cache_size(usable_system_memory);
info!(
target = bytes_to_mebibytes(expected_usage),
total = bytes_to_mebibytes(new_mem),
"updating file cache size",
);
let actual_usage = file_cache
.set_file_cache_size(expected_usage)
.await
.context("failed to set file cache size")?;
if file_cache.config.in_memory {
file_cache_mem_usage = actual_usage;
}
if actual_usage != expected_usage {
warn!(
"file cache was set to a different size that we wanted: target = {} Mib, actual= {} Mib",
bytes_to_mebibytes(expected_usage),
bytes_to_mebibytes(actual_usage)
)
}
}
if let Some(cgroup) = &self.cgroup {
let available_memory = usable_system_memory - file_cache_mem_usage;
let new_cgroup_mem_high = cgroup.config.calculate_memory_high_value(available_memory);
info!(
target = bytes_to_mebibytes(new_cgroup_mem_high),
total = bytes_to_mebibytes(new_mem),
name = cgroup.path(),
"updating cgroup memory.high",
);
let limits = MemoryLimits::new(new_cgroup_mem_high, available_memory);
cgroup
.set_limits(&limits)
.context("failed to set file cache size")?;
}
Ok(())
}
/// Take in a message and perform some action, such as downscaling or upscaling,
/// and return a message to be send back.
#[tracing::instrument(skip_all, fields(%id, message = ?inner))]
pub async fn process_message(
&mut self,
InboundMsg { inner, id }: InboundMsg,
) -> anyhow::Result<Option<OutboundMsg>> {
match inner {
InboundMsgKind::UpscaleNotification { granted } => {
self.handle_upscale(granted)
.await
.context("failed to handle upscale")?;
self.dispatcher
.notify_upscale(Sequenced::new(granted))
.await
.context("failed to notify notify cgroup of upscale")?;
Ok(Some(OutboundMsg::new(
OutboundMsgKind::UpscaleConfirmation {},
id,
)))
}
InboundMsgKind::DownscaleRequest { target } => self
.try_downscale(target)
.await
.context("failed to downscale")
.map(|(ok, status)| {
Some(OutboundMsg::new(
OutboundMsgKind::DownscaleResult { ok, status },
id,
))
}),
InboundMsgKind::InvalidMessage { error } => {
warn!(
%error, id, "received notification of an invalid message we sent"
);
Ok(None)
}
InboundMsgKind::InternalError { error } => {
warn!(error, id, "agent experienced an internal error");
Ok(None)
}
InboundMsgKind::HealthCheck {} => {
Ok(Some(OutboundMsg::new(OutboundMsgKind::HealthCheck {}, id)))
}
}
}
// TODO: don't propagate errors, probably just warn!?
#[tracing::instrument(skip_all)]
pub async fn run(&mut self) -> anyhow::Result<()> {
info!("starting dispatcher");
loop {
tokio::select! {
signal = self.kill.recv() => {
match signal {
Ok(()) => return Ok(()),
Err(e) => bail!("failed to receive kill signal: {e}")
}
}
// we need to propagate an upscale request
request = self.dispatcher.request_upscale_events.recv() => {
if request.is_none() {
bail!("failed to listen for upscale event from cgroup")
}
// If it's been less than 1 second since the last time we requested upscaling,
// ignore the event, to avoid spamming the agent (otherwise, this can happen
// ~1k times per second).
if let Some(t) = self.last_upscale_request_at {
let elapsed = t.elapsed();
if elapsed < Duration::from_secs(1) {
info!(elapsed_millis = elapsed.as_millis(), "cgroup asked for upscale but too soon to forward the request, ignoring");
continue;
}
}
self.last_upscale_request_at = Some(Instant::now());
info!("cgroup asking for upscale; forwarding request");
self.counter += 2; // Increment, preserving parity (i.e. keep the
// counter odd). See the field comment for more.
self.dispatcher
.send(OutboundMsg::new(OutboundMsgKind::UpscaleRequest {}, self.counter))
.await
.context("failed to send message")?;
}
// there is a message from the agent
msg = self.dispatcher.source.next() => {
if let Some(msg) = msg {
// Don't use 'message' as a key as the string also uses
// that for its key
info!(?msg, "received message");
match msg {
Ok(msg) => {
let message: InboundMsg = match msg {
Message::Text(text) => {
serde_json::from_str(&text).context("failed to deserialize text message")?
}
other => {
warn!(
// Don't use 'message' as a key as the
// string also uses that for its key
msg = ?other,
"agent should only send text messages but received different type"
);
continue
},
};
let out = match self.process_message(message.clone()).await {
Ok(Some(out)) => out,
Ok(None) => continue,
Err(e) => {
let error = e.to_string();
warn!(?error, "error handling message");
OutboundMsg::new(
OutboundMsgKind::InternalError {
error
},
message.id
)
}
};
self.dispatcher
.send(out)
.await
.context("failed to send message")?;
}
Err(e) => warn!("{e}"),
}
} else {
anyhow::bail!("dispatcher connection closed")
}
}
}
}
}
}

View File

@@ -51,6 +51,7 @@ serde.workspace = true
serde_json = { workspace = true, features = ["raw_value"] }
serde_with.workspace = true
signal-hook.workspace = true
smallvec = { workspace = true, features = ["write"] }
svg_fmt.workspace = true
sync_wrapper.workspace = true
tokio-tar.workspace = true

View File

@@ -215,7 +215,6 @@ fn bench_sequential(c: &mut Criterion) {
TimelineId::generate(),
zero.add(10 * i32)..zero.add(10 * i32 + 1),
Lsn(i),
false,
0,
);
updates.insert_historic(layer);

View File

@@ -3,6 +3,7 @@
//! Currently it only analyzes holes, which are regions within the layer range that the layer contains no updates for. In the future it might do more analysis (maybe key quantiles?) but it should never return sensitive data.
use anyhow::Result;
use pageserver::tenant::{TENANTS_SEGMENT_NAME, TIMELINES_SEGMENT_NAME};
use std::cmp::Ordering;
use std::collections::BinaryHeap;
use std::ops::Range;
@@ -10,7 +11,7 @@ use std::{fs, path::Path, str};
use pageserver::page_cache::PAGE_SZ;
use pageserver::repository::{Key, KEY_SIZE};
use pageserver::tenant::block_io::{BlockReader, FileBlockReader};
use pageserver::tenant::block_io::FileBlockReader;
use pageserver::tenant::disk_btree::{DiskBtreeReader, VisitDirection};
use pageserver::tenant::storage_layer::delta_layer::{Summary, DELTA_KEY_SIZE};
use pageserver::tenant::storage_layer::range_overlaps;
@@ -96,8 +97,8 @@ pub(crate) fn parse_filename(name: &str) -> Option<LayerFile> {
// Finds the max_holes largest holes, ignoring any that are smaller than MIN_HOLE_LENGTH"
async fn get_holes(path: &Path, max_holes: usize) -> Result<Vec<Hole>> {
let file = FileBlockReader::new(VirtualFile::open(path)?);
let summary_blk = file.read_blk(0)?;
let file = FileBlockReader::new(VirtualFile::open(path).await?);
let summary_blk = file.read_blk(0).await?;
let actual_summary = Summary::des_prefix(summary_blk.as_ref())?;
let tree_reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new(
actual_summary.index_start_blk,
@@ -142,12 +143,12 @@ pub(crate) async fn main(cmd: &AnalyzeLayerMapCmd) -> Result<()> {
let mut total_delta_layers = 0usize;
let mut total_image_layers = 0usize;
let mut total_excess_layers = 0usize;
for tenant in fs::read_dir(storage_path.join("tenants"))? {
for tenant in fs::read_dir(storage_path.join(TENANTS_SEGMENT_NAME))? {
let tenant = tenant?;
if !tenant.file_type()?.is_dir() {
continue;
}
for timeline in fs::read_dir(tenant.path().join("timelines"))? {
for timeline in fs::read_dir(tenant.path().join(TIMELINES_SEGMENT_NAME))? {
let timeline = timeline?;
if !timeline.file_type()?.is_dir() {
continue;

View File

@@ -5,6 +5,7 @@ use clap::Subcommand;
use pageserver::tenant::block_io::BlockCursor;
use pageserver::tenant::disk_btree::DiskBtreeReader;
use pageserver::tenant::storage_layer::delta_layer::{BlobRef, Summary};
use pageserver::tenant::{TENANTS_SEGMENT_NAME, TIMELINES_SEGMENT_NAME};
use pageserver::{page_cache, virtual_file};
use pageserver::{
repository::{Key, KEY_SIZE},
@@ -44,13 +45,11 @@ pub(crate) enum LayerCmd {
}
async fn read_delta_file(path: impl AsRef<Path>) -> Result<()> {
use pageserver::tenant::block_io::BlockReader;
let path = path.as_ref();
virtual_file::init(10);
page_cache::init(100);
let file = FileBlockReader::new(VirtualFile::open(path)?);
let summary_blk = file.read_blk(0)?;
let file = FileBlockReader::new(VirtualFile::open(path).await?);
let summary_blk = file.read_blk(0).await?;
let actual_summary = Summary::des_prefix(summary_blk.as_ref())?;
let tree_reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new(
actual_summary.index_start_blk,
@@ -70,7 +69,7 @@ async fn read_delta_file(path: impl AsRef<Path>) -> Result<()> {
},
)
.await?;
let cursor = BlockCursor::new(&file);
let cursor = BlockCursor::new_fileblockreader(&file);
for (k, v) in all {
let value = cursor.read_blob(v.pos()).await?;
println!("key:{} value_len:{}", k, value.len());
@@ -82,13 +81,13 @@ async fn read_delta_file(path: impl AsRef<Path>) -> Result<()> {
pub(crate) async fn main(cmd: &LayerCmd) -> Result<()> {
match cmd {
LayerCmd::List { path } => {
for tenant in fs::read_dir(path.join("tenants"))? {
for tenant in fs::read_dir(path.join(TENANTS_SEGMENT_NAME))? {
let tenant = tenant?;
if !tenant.file_type()?.is_dir() {
continue;
}
println!("tenant {}", tenant.file_name().to_string_lossy());
for timeline in fs::read_dir(tenant.path().join("timelines"))? {
for timeline in fs::read_dir(tenant.path().join(TIMELINES_SEGMENT_NAME))? {
let timeline = timeline?;
if !timeline.file_type()?.is_dir() {
continue;
@@ -103,9 +102,9 @@ pub(crate) async fn main(cmd: &LayerCmd) -> Result<()> {
timeline,
} => {
let timeline_path = path
.join("tenants")
.join(TENANTS_SEGMENT_NAME)
.join(tenant)
.join("timelines")
.join(TIMELINES_SEGMENT_NAME)
.join(timeline);
let mut idx = 0;
for layer in fs::read_dir(timeline_path)? {

View File

@@ -25,6 +25,7 @@ use crate::context::RequestContext;
use crate::tenant::Timeline;
use pageserver_api::reltag::{RelTag, SlruKind};
use postgres_ffi::dispatch_pgversion;
use postgres_ffi::pg_constants::{DEFAULTTABLESPACE_OID, GLOBALTABLESPACE_OID};
use postgres_ffi::pg_constants::{PGDATA_SPECIAL_FILES, PGDATA_SUBDIRS, PG_HBA};
use postgres_ffi::relfile_utils::{INIT_FORKNUM, MAIN_FORKNUM};
@@ -323,14 +324,25 @@ where
.timeline
.get_relmap_file(spcnode, dbnode, self.lsn, self.ctx)
.await?;
ensure!(img.len() == 512);
ensure!(
img.len()
== dispatch_pgversion!(
self.timeline.pg_version,
pgv::bindings::SIZEOF_RELMAPFILE
)
);
Some(img)
} else {
None
};
if spcnode == GLOBALTABLESPACE_OID {
let pg_version_str = self.timeline.pg_version.to_string();
let pg_version_str = match self.timeline.pg_version {
14 | 15 => self.timeline.pg_version.to_string(),
ver => format!("{ver}\x0A"),
};
let header = new_tar_header("PG_VERSION", pg_version_str.len() as u64)?;
self.ar.append(&header, pg_version_str.as_bytes()).await?;
@@ -374,7 +386,10 @@ where
if let Some(img) = relmap_img {
let dst_path = format!("base/{}/PG_VERSION", dbnode);
let pg_version_str = self.timeline.pg_version.to_string();
let pg_version_str = match self.timeline.pg_version {
14 | 15 => self.timeline.pg_version.to_string(),
ver => format!("{ver}\x0A"),
};
let header = new_tar_header(&dst_path, pg_version_str.len() as u64)?;
self.ar.append(&header, pg_version_str.as_bytes()).await?;

View File

@@ -6,11 +6,12 @@ use std::{env, ops::ControlFlow, path::Path, str::FromStr};
use anyhow::{anyhow, Context};
use clap::{Arg, ArgAction, Command};
use fail::FailScenario;
use metrics::launch_timestamp::{set_launch_timestamp_metric, LaunchTimestamp};
use pageserver::disk_usage_eviction_task::{self, launch_disk_usage_global_eviction_task};
use pageserver::metrics::{STARTUP_DURATION, STARTUP_IS_LOADING};
use pageserver::task_mgr::WALRECEIVER_RUNTIME;
use pageserver::tenant::TenantSharedResources;
use remote_storage::GenericRemoteStorage;
use tokio::time::Instant;
use tracing::*;
@@ -121,7 +122,7 @@ fn main() -> anyhow::Result<()> {
}
// Initialize up failpoints support
let scenario = FailScenario::setup();
let scenario = pageserver::failpoint_support::init();
// Basic initialization of things that don't change after startup
virtual_file::init(conf.max_file_descriptors);
@@ -382,9 +383,12 @@ fn start_pageserver(
BACKGROUND_RUNTIME.block_on(mgr::init_tenant_mgr(
conf,
broker_client.clone(),
remote_storage.clone(),
TenantSharedResources {
broker_client: broker_client.clone(),
remote_storage: remote_storage.clone(),
},
order,
shutdown_pageserver.clone(),
))?;
BACKGROUND_RUNTIME.spawn({
@@ -473,16 +477,19 @@ fn start_pageserver(
{
let _rt_guard = MGMT_REQUEST_RUNTIME.enter();
let router = http::make_router(
conf,
launch_ts,
http_auth,
broker_client.clone(),
remote_storage,
disk_usage_eviction_state,
)?
.build()
.map_err(|err| anyhow!(err))?;
let router_state = Arc::new(
http::routes::State::new(
conf,
http_auth.clone(),
remote_storage,
broker_client.clone(),
disk_usage_eviction_state,
)
.context("Failed to initialize router state")?,
);
let router = http::make_router(router_state, launch_ts, http_auth.clone())?
.build()
.map_err(|err| anyhow!(err))?;
let service = utils::http::RouterService::new(router).unwrap();
let server = hyper::Server::from_tcp(http_listener)?
.serve(service)

View File

@@ -32,7 +32,8 @@ use crate::disk_usage_eviction_task::DiskUsageEvictionTaskConfig;
use crate::tenant::config::TenantConf;
use crate::tenant::config::TenantConfOpt;
use crate::tenant::{
TENANT_ATTACHING_MARKER_FILENAME, TENANT_DELETED_MARKER_FILE_NAME, TIMELINES_SEGMENT_NAME,
TENANTS_SEGMENT_NAME, TENANT_ATTACHING_MARKER_FILENAME, TENANT_DELETED_MARKER_FILE_NAME,
TIMELINES_SEGMENT_NAME,
};
use crate::{
IGNORED_TENANT_FILE_NAME, METADATA_FILE_NAME, TENANT_CONFIG_NAME, TIMELINE_DELETE_MARK_SUFFIX,
@@ -72,7 +73,7 @@ pub mod defaults {
/// Default built-in configuration file.
///
pub const DEFAULT_CONFIG_FILE: &str = formatcp!(
r###"
r#"
# Initial configuration file created by 'pageserver --init'
#listen_pg_addr = '{DEFAULT_PG_LISTEN_ADDR}'
#listen_http_addr = '{DEFAULT_HTTP_LISTEN_ADDR}'
@@ -117,7 +118,7 @@ pub mod defaults {
[remote_storage]
"###
"#
);
}
@@ -204,6 +205,8 @@ pub struct PageServerConf {
/// has it's initial logical size calculated. Not running background tasks for some seconds is
/// not terrible.
pub background_task_maximum_delay: Duration,
pub control_plane_api: Option<Url>,
}
/// We do not want to store this in a PageServerConf because the latter may be logged
@@ -278,6 +281,8 @@ struct PageServerConfigBuilder {
ondemand_download_behavior_treat_error_as_warn: BuilderValue<bool>,
background_task_maximum_delay: BuilderValue<Duration>,
control_plane_api: BuilderValue<Option<Url>>,
}
impl Default for PageServerConfigBuilder {
@@ -340,6 +345,8 @@ impl Default for PageServerConfigBuilder {
DEFAULT_BACKGROUND_TASK_MAXIMUM_DELAY,
)
.unwrap()),
control_plane_api: Set(None),
}
}
}
@@ -468,6 +475,10 @@ impl PageServerConfigBuilder {
self.background_task_maximum_delay = BuilderValue::Set(delay);
}
pub fn control_plane_api(&mut self, api: Url) {
self.control_plane_api = BuilderValue::Set(Some(api))
}
pub fn build(self) -> anyhow::Result<PageServerConf> {
let concurrent_tenant_size_logical_size_queries = self
.concurrent_tenant_size_logical_size_queries
@@ -553,6 +564,9 @@ impl PageServerConfigBuilder {
background_task_maximum_delay: self
.background_task_maximum_delay
.ok_or(anyhow!("missing background_task_maximum_delay"))?,
control_plane_api: self
.control_plane_api
.ok_or(anyhow!("missing control_plane_api"))?,
})
}
}
@@ -563,7 +577,7 @@ impl PageServerConf {
//
pub fn tenants_path(&self) -> PathBuf {
self.workdir.join("tenants")
self.workdir.join(TENANTS_SEGMENT_NAME)
}
pub fn tenant_path(&self, tenant_id: &TenantId) -> PathBuf {
@@ -643,23 +657,6 @@ impl PageServerConf {
.join(METADATA_FILE_NAME)
}
/// Files on the remote storage are stored with paths, relative to the workdir.
/// That path includes in itself both tenant and timeline ids, allowing to have a unique remote storage path.
///
/// Errors if the path provided does not start from pageserver's workdir.
pub fn remote_path(&self, local_path: &Path) -> anyhow::Result<RemotePath> {
local_path
.strip_prefix(&self.workdir)
.context("Failed to strip workdir prefix")
.and_then(RemotePath::new)
.with_context(|| {
format!(
"Failed to resolve remote part of path {:?} for base {:?}",
local_path, self.workdir
)
})
}
/// Turns storage remote path of a file into its local path.
pub fn local_path(&self, remote_path: &RemotePath) -> PathBuf {
remote_path.with_base(&self.workdir)
@@ -671,26 +668,18 @@ impl PageServerConf {
pub fn pg_distrib_dir(&self, pg_version: u32) -> anyhow::Result<PathBuf> {
let path = self.pg_distrib_dir.clone();
#[allow(clippy::manual_range_patterns)]
match pg_version {
14 => Ok(path.join(format!("v{pg_version}"))),
15 => Ok(path.join(format!("v{pg_version}"))),
14 | 15 | 16 => Ok(path.join(format!("v{pg_version}"))),
_ => bail!("Unsupported postgres version: {}", pg_version),
}
}
pub fn pg_bin_dir(&self, pg_version: u32) -> anyhow::Result<PathBuf> {
match pg_version {
14 => Ok(self.pg_distrib_dir(pg_version)?.join("bin")),
15 => Ok(self.pg_distrib_dir(pg_version)?.join("bin")),
_ => bail!("Unsupported postgres version: {}", pg_version),
}
Ok(self.pg_distrib_dir(pg_version)?.join("bin"))
}
pub fn pg_lib_dir(&self, pg_version: u32) -> anyhow::Result<PathBuf> {
match pg_version {
14 => Ok(self.pg_distrib_dir(pg_version)?.join("lib")),
15 => Ok(self.pg_distrib_dir(pg_version)?.join("lib")),
_ => bail!("Unsupported postgres version: {}", pg_version),
}
Ok(self.pg_distrib_dir(pg_version)?.join("lib"))
}
/// Parse a configuration file (pageserver.toml) into a PageServerConf struct,
@@ -758,6 +747,7 @@ impl PageServerConf {
},
"ondemand_download_behavior_treat_error_as_warn" => builder.ondemand_download_behavior_treat_error_as_warn(parse_toml_bool(key, item)?),
"background_task_maximum_delay" => builder.background_task_maximum_delay(parse_toml_duration(key, item)?),
"control_plane_api" => builder.control_plane_api(parse_toml_string(key, item)?.parse().context("failed to parse control plane URL")?),
_ => bail!("unrecognized pageserver option '{key}'"),
}
}
@@ -926,6 +916,7 @@ impl PageServerConf {
test_remote_failures: 0,
ondemand_download_behavior_treat_error_as_warn: false,
background_task_maximum_delay: Duration::ZERO,
control_plane_api: None,
}
}
}
@@ -1149,6 +1140,7 @@ background_task_maximum_delay = '334 s'
background_task_maximum_delay: humantime::parse_duration(
defaults::DEFAULT_BACKGROUND_TASK_MAXIMUM_DELAY
)?,
control_plane_api: None
},
"Correct defaults should be used when no config values are provided"
);
@@ -1204,6 +1196,7 @@ background_task_maximum_delay = '334 s'
test_remote_failures: 0,
ondemand_download_behavior_treat_error_as_warn: false,
background_task_maximum_delay: Duration::from_secs(334),
control_plane_api: None
},
"Should be able to parse all basic config values correctly"
);

View File

@@ -0,0 +1,119 @@
use std::collections::HashMap;
use hyper::StatusCode;
use pageserver_api::control_api::{ReAttachRequest, ReAttachResponse};
use tokio_util::sync::CancellationToken;
use url::Url;
use utils::{
backoff,
generation::Generation,
id::{NodeId, TenantId},
};
use crate::config::PageServerConf;
// Backoffs when control plane requests do not succeed: compromise between reducing load
// on control plane, and retrying frequently when we are blocked on a control plane
// response to make progress.
const BACKOFF_INCREMENT: f64 = 0.1;
const BACKOFF_MAX: f64 = 10.0;
/// The Pageserver's client for using the control plane API: this is a small subset
/// of the overall control plane API, for dealing with generations (see docs/rfcs/025-generation-numbers.md)
pub(crate) struct ControlPlaneClient {
http_client: reqwest::Client,
base_url: Url,
node_id: NodeId,
cancel: CancellationToken,
}
impl ControlPlaneClient {
/// A None return value indicates that the input `conf` object does not have control
/// plane API enabled.
pub(crate) fn new(conf: &'static PageServerConf, cancel: &CancellationToken) -> Option<Self> {
let mut url = match conf.control_plane_api.as_ref() {
Some(u) => u.clone(),
None => return None,
};
if let Ok(mut segs) = url.path_segments_mut() {
// This ensures that `url` ends with a slash if it doesn't already.
// That way, we can subsequently use join() to safely attach extra path elements.
segs.pop_if_empty().push("");
}
let client = reqwest::ClientBuilder::new()
.build()
.expect("Failed to construct http client");
Some(Self {
http_client: client,
base_url: url,
node_id: conf.id,
cancel: cancel.clone(),
})
}
async fn try_re_attach(
&self,
url: Url,
request: &ReAttachRequest,
) -> anyhow::Result<ReAttachResponse> {
match self.http_client.post(url).json(request).send().await {
Err(e) => Err(anyhow::Error::from(e)),
Ok(r) => {
if r.status() == StatusCode::OK {
r.json::<ReAttachResponse>()
.await
.map_err(anyhow::Error::from)
} else {
Err(anyhow::anyhow!("Unexpected status {}", r.status()))
}
}
}
}
/// Block until we get a successful response
pub(crate) async fn re_attach(&self) -> anyhow::Result<HashMap<TenantId, Generation>> {
let re_attach_path = self
.base_url
.join("re-attach")
.expect("Failed to build re-attach path");
let request = ReAttachRequest {
node_id: self.node_id,
};
let mut attempt = 0;
loop {
let result = self.try_re_attach(re_attach_path.clone(), &request).await;
match result {
Ok(res) => {
tracing::info!(
"Received re-attach response with {} tenants",
res.tenants.len()
);
return Ok(res
.tenants
.into_iter()
.map(|t| (t.id, Generation::new(t.generation)))
.collect::<HashMap<_, _>>());
}
Err(e) => {
tracing::error!("Error re-attaching tenants, retrying: {e:#}");
backoff::exponential_backoff(
attempt,
BACKOFF_INCREMENT,
BACKOFF_MAX,
&self.cancel,
)
.await;
if self.cancel.is_cancelled() {
return Err(anyhow::anyhow!("Shutting down"));
}
attempt += 1;
}
}
}
}
}

View File

@@ -0,0 +1,86 @@
/// use with fail::cfg("$name", "return(2000)")
///
/// The effect is similar to a "sleep(2000)" action, i.e. we sleep for the
/// specified time (in milliseconds). The main difference is that we use async
/// tokio sleep function. Another difference is that we print lines to the log,
/// which can be useful in tests to check that the failpoint was hit.
#[macro_export]
macro_rules! __failpoint_sleep_millis_async {
($name:literal) => {{
// If the failpoint is used with a "return" action, set should_sleep to the
// returned value (as string). Otherwise it's set to None.
let should_sleep = (|| {
::fail::fail_point!($name, |x| x);
::std::option::Option::None
})();
// Sleep if the action was a returned value
if let ::std::option::Option::Some(duration_str) = should_sleep {
$crate::failpoint_support::failpoint_sleep_helper($name, duration_str).await
}
}};
}
pub use __failpoint_sleep_millis_async as sleep_millis_async;
// Helper function used by the macro. (A function has nicer scoping so we
// don't need to decorate everything with "::")
#[doc(hidden)]
pub(crate) async fn failpoint_sleep_helper(name: &'static str, duration_str: String) {
let millis = duration_str.parse::<u64>().unwrap();
let d = std::time::Duration::from_millis(millis);
tracing::info!("failpoint {:?}: sleeping for {:?}", name, d);
tokio::time::sleep(d).await;
tracing::info!("failpoint {:?}: sleep done", name);
}
pub fn init() -> fail::FailScenario<'static> {
// The failpoints lib provides support for parsing the `FAILPOINTS` env var.
// We want non-default behavior for `exit`, though, so, we handle it separately.
//
// Format for FAILPOINTS is "name=actions" separated by ";".
let actions = std::env::var("FAILPOINTS");
if actions.is_ok() {
std::env::remove_var("FAILPOINTS");
} else {
// let the library handle non-utf8, or nothing for not present
}
let scenario = fail::FailScenario::setup();
if let Ok(val) = actions {
val.split(';')
.enumerate()
.map(|(i, s)| s.split_once('=').ok_or((i, s)))
.for_each(|res| {
let (name, actions) = match res {
Ok(t) => t,
Err((i, s)) => {
panic!(
"startup failpoints: missing action on the {}th failpoint; try `{s}=return`",
i + 1,
);
}
};
if let Err(e) = apply_failpoint(name, actions) {
panic!("startup failpoints: failed to apply failpoint {name}={actions}: {e}");
}
});
}
scenario
}
pub(crate) fn apply_failpoint(name: &str, actions: &str) -> Result<(), String> {
if actions == "exit" {
fail::cfg_callback(name, exit_failpoint)
} else {
fail::cfg(name, actions)
}
}
#[inline(never)]
fn exit_failpoint() {
tracing::info!("Exit requested by failpoint");
std::process::exit(1);
}

View File

@@ -383,7 +383,6 @@ paths:
schema:
type: string
format: hex
post:
description: |
Schedules attach operation to happen in the background for the given tenant.
@@ -1020,6 +1019,9 @@ components:
properties:
config:
$ref: '#/components/schemas/TenantConfig'
generation:
type: integer
description: Attachment generation number.
TenantConfigRequest:
allOf:
- $ref: '#/components/schemas/TenantConfig'

View File

@@ -8,9 +8,10 @@ use anyhow::{anyhow, Context, Result};
use hyper::StatusCode;
use hyper::{Body, Request, Response, Uri};
use metrics::launch_timestamp::LaunchTimestamp;
use pageserver_api::models::{DownloadRemoteLayersTaskSpawnRequest, TenantAttachRequest};
use pageserver_api::models::{
DownloadRemoteLayersTaskSpawnRequest, TenantAttachRequest, TenantLoadRequest,
};
use remote_storage::GenericRemoteStorage;
use storage_broker::BrokerClientChannel;
use tenant_size_model::{SizeResult, StorageModel};
use tokio_util::sync::CancellationToken;
use tracing::*;
@@ -32,11 +33,13 @@ use crate::tenant::mgr::{
};
use crate::tenant::size::ModelInputs;
use crate::tenant::storage_layer::LayerAccessStatsReset;
use crate::tenant::{LogicalSizeCalculationCause, PageReconstructError, Timeline};
use crate::tenant::timeline::Timeline;
use crate::tenant::{LogicalSizeCalculationCause, PageReconstructError};
use crate::{config::PageServerConf, tenant::mgr};
use crate::{disk_usage_eviction_task, tenant};
use utils::{
auth::JwtAuth,
generation::Generation,
http::{
endpoint::{self, attach_openapi_ui, auth_middleware, check_permission_with},
error::{ApiError, HttpErrorBody},
@@ -51,7 +54,7 @@ use utils::{
// Imports only used for testing APIs
use super::models::ConfigureFailpointsRequest;
struct State {
pub struct State {
conf: &'static PageServerConf,
auth: Option<Arc<JwtAuth>>,
allowlist_routes: Vec<Uri>,
@@ -61,7 +64,7 @@ struct State {
}
impl State {
fn new(
pub fn new(
conf: &'static PageServerConf,
auth: Option<Arc<JwtAuth>>,
remote_storage: Option<GenericRemoteStorage>,
@@ -282,6 +285,8 @@ async fn build_timeline_info_common(
let state = timeline.current_state();
let remote_consistent_lsn = timeline.get_remote_consistent_lsn().unwrap_or(Lsn(0));
let walreceiver_status = timeline.walreceiver_status();
let info = TimelineInfo {
tenant_id: timeline.tenant_id,
timeline_id: timeline.timeline_id,
@@ -302,6 +307,8 @@ async fn build_timeline_info_common(
pg_version: timeline.pg_version,
state,
walreceiver_status,
};
Ok(info)
}
@@ -472,7 +479,7 @@ async fn tenant_attach_handler(
check_permission(&request, Some(tenant_id))?;
let maybe_body: Option<TenantAttachRequest> = json_request_or_empty_body(&mut request).await?;
let tenant_conf = match maybe_body {
let tenant_conf = match &maybe_body {
Some(request) => TenantConfOpt::try_from(&*request.config).map_err(ApiError::BadRequest)?,
None => TenantConfOpt::default(),
};
@@ -483,10 +490,13 @@ async fn tenant_attach_handler(
let state = get_state(&request);
let generation = get_request_generation(state, maybe_body.as_ref().and_then(|r| r.generation))?;
if let Some(remote_storage) = &state.remote_storage {
mgr::attach_tenant(
state.conf,
tenant_id,
generation,
tenant_conf,
state.broker_client.clone(),
remote_storage.clone(),
@@ -517,7 +527,6 @@ async fn timeline_delete_handler(
.instrument(info_span!("timeline_delete", %tenant_id, %timeline_id))
.await?;
// FIXME: needs to be an error for console to retry it. Ideally Accepted should be used and retried until 404.
json_response(StatusCode::ACCEPTED, ())
}
@@ -539,7 +548,7 @@ async fn tenant_detach_handler(
}
async fn tenant_load_handler(
request: Request<Body>,
mut request: Request<Body>,
_cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?;
@@ -547,10 +556,18 @@ async fn tenant_load_handler(
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Warn);
let maybe_body: Option<TenantLoadRequest> = json_request_or_empty_body(&mut request).await?;
let state = get_state(&request);
// The /load request is only usable when control_plane_api is not set. Once it is set, callers
// should always use /attach instead.
let generation = get_request_generation(state, maybe_body.as_ref().and_then(|r| r.generation))?;
mgr::load_tenant(
state.conf,
tenant_id,
generation,
state.broker_client.clone(),
state.remote_storage.clone(),
&ctx,
@@ -852,6 +869,21 @@ pub fn html_response(status: StatusCode, data: String) -> Result<Response<Body>,
Ok(response)
}
/// Helper for requests that may take a generation, which is mandatory
/// when control_plane_api is set, but otherwise defaults to Generation::none()
fn get_request_generation(state: &State, req_gen: Option<u32>) -> Result<Generation, ApiError> {
if state.conf.control_plane_api.is_some() {
req_gen
.map(Generation::new)
.ok_or(ApiError::BadRequest(anyhow!(
"generation attribute missing"
)))
} else {
// Legacy mode: all tenants operate with no generation
Ok(Generation::none())
}
}
async fn tenant_create_handler(
mut request: Request<Body>,
_cancel: CancellationToken,
@@ -868,14 +900,17 @@ async fn tenant_create_handler(
let tenant_conf =
TenantConfOpt::try_from(&request_data.config).map_err(ApiError::BadRequest)?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Warn);
let state = get_state(&request);
let generation = get_request_generation(state, request_data.generation)?;
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Warn);
let new_tenant = mgr::create_tenant(
state.conf,
tenant_conf,
target_tenant_id,
generation,
state.broker_client.clone(),
state.remote_storage.clone(),
&ctx,
@@ -980,14 +1015,7 @@ async fn failpoints_handler(
// We recognize one extra "action" that's not natively recognized
// by the failpoints crate: exit, to immediately kill the process
let cfg_result = if fp.actions == "exit" {
fail::cfg_callback(fp.name, || {
info!("Exit requested by failpoint");
std::process::exit(1);
})
} else {
fail::cfg(fp.name, &fp.actions)
};
let cfg_result = crate::failpoint_support::apply_failpoint(&fp.name, &fp.actions);
if let Err(err_msg) = cfg_result {
return Err(ApiError::BadRequest(anyhow!(
@@ -1329,12 +1357,9 @@ where
}
pub fn make_router(
conf: &'static PageServerConf,
state: Arc<State>,
launch_ts: &'static LaunchTimestamp,
auth: Option<Arc<JwtAuth>>,
broker_client: BrokerClientChannel,
remote_storage: Option<GenericRemoteStorage>,
disk_usage_eviction_state: Arc<disk_usage_eviction_task::State>,
) -> anyhow::Result<RouterBuilder<hyper::Body, ApiError>> {
let spec = include_bytes!("openapi_spec.yml");
let mut router = attach_openapi_ui(endpoint::make_router(), spec, "/swagger.yml", "/v1/doc");
@@ -1358,16 +1383,7 @@ pub fn make_router(
);
Ok(router
.data(Arc::new(
State::new(
conf,
auth,
remote_storage,
broker_client,
disk_usage_eviction_state,
)
.context("Failed to initialize router state")?,
))
.data(state)
.get("/v1/status", |r| api_handler(r, status_handler))
.put("/v1/failpoints", |r| {
testing_api_handler("manage failpoints", r, failpoints_handler)

View File

@@ -3,6 +3,7 @@ pub mod basebackup;
pub mod config;
pub mod consumption_metrics;
pub mod context;
mod control_plane_client;
pub mod disk_usage_eviction_task;
pub mod http;
pub mod import_datadir;
@@ -21,6 +22,8 @@ pub mod walingest;
pub mod walrecord;
pub mod walredo;
pub mod failpoint_support;
use std::path::Path;
use crate::task_mgr::TaskKind;

View File

@@ -6,7 +6,7 @@ use metrics::{
HistogramVec, IntCounter, IntCounterVec, IntGauge, IntGaugeVec, UIntGauge, UIntGaugeVec,
};
use once_cell::sync::Lazy;
use strum::VariantNames;
use strum::{EnumCount, IntoEnumIterator, VariantNames};
use strum_macros::{EnumVariantNames, IntoStaticStr};
use utils::id::{TenantId, TimelineId};
@@ -537,7 +537,7 @@ const STORAGE_IO_TIME_BUCKETS: &[f64] = &[
30.000, // 30000 ms
];
/// Tracks time taken by fs operations near VirtualFile.
/// VirtualFile fs operation variants.
///
/// Operations:
/// - open ([`std::fs::OpenOptions::open`])
@@ -548,15 +548,66 @@ const STORAGE_IO_TIME_BUCKETS: &[f64] = &[
/// - seek (modify internal position or file length query)
/// - fsync ([`std::fs::File::sync_all`])
/// - metadata ([`std::fs::File::metadata`])
pub(crate) static STORAGE_IO_TIME: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
"pageserver_io_operations_seconds",
"Time spent in IO operations",
&["operation"],
STORAGE_IO_TIME_BUCKETS.into()
)
.expect("failed to define a metric")
});
#[derive(
Debug, Clone, Copy, strum_macros::EnumCount, strum_macros::EnumIter, strum_macros::FromRepr,
)]
pub(crate) enum StorageIoOperation {
Open,
Close,
CloseByReplace,
Read,
Write,
Seek,
Fsync,
Metadata,
}
impl StorageIoOperation {
pub fn as_str(&self) -> &'static str {
match self {
StorageIoOperation::Open => "open",
StorageIoOperation::Close => "close",
StorageIoOperation::CloseByReplace => "close-by-replace",
StorageIoOperation::Read => "read",
StorageIoOperation::Write => "write",
StorageIoOperation::Seek => "seek",
StorageIoOperation::Fsync => "fsync",
StorageIoOperation::Metadata => "metadata",
}
}
}
/// Tracks time taken by fs operations near VirtualFile.
#[derive(Debug)]
pub(crate) struct StorageIoTime {
metrics: [Histogram; StorageIoOperation::COUNT],
}
impl StorageIoTime {
fn new() -> Self {
let storage_io_histogram_vec = register_histogram_vec!(
"pageserver_io_operations_seconds",
"Time spent in IO operations",
&["operation"],
STORAGE_IO_TIME_BUCKETS.into()
)
.expect("failed to define a metric");
let metrics = std::array::from_fn(|i| {
let op = StorageIoOperation::from_repr(i).unwrap();
let metric = storage_io_histogram_vec
.get_metric_with_label_values(&[op.as_str()])
.unwrap();
metric
});
Self { metrics }
}
pub(crate) fn get(&self, op: StorageIoOperation) -> &Histogram {
&self.metrics[op as usize]
}
}
pub(crate) static STORAGE_IO_TIME_METRIC: Lazy<StorageIoTime> = Lazy::new(StorageIoTime::new);
const STORAGE_IO_SIZE_OPERATIONS: &[&str] = &["read", "write"];
@@ -570,23 +621,160 @@ pub(crate) static STORAGE_IO_SIZE: Lazy<IntGaugeVec> = Lazy::new(|| {
.expect("failed to define a metric")
});
const SMGR_QUERY_TIME_OPERATIONS: &[&str] = &[
"get_rel_exists",
"get_rel_size",
"get_page_at_lsn",
"get_db_size",
];
#[derive(Debug)]
struct GlobalAndPerTimelineHistogram {
global: Histogram,
per_tenant_timeline: Histogram,
}
pub static SMGR_QUERY_TIME: Lazy<HistogramVec> = Lazy::new(|| {
impl GlobalAndPerTimelineHistogram {
fn observe(&self, value: f64) {
self.global.observe(value);
self.per_tenant_timeline.observe(value);
}
}
struct GlobalAndPerTimelineHistogramTimer<'a> {
h: &'a GlobalAndPerTimelineHistogram,
start: std::time::Instant,
}
impl<'a> Drop for GlobalAndPerTimelineHistogramTimer<'a> {
fn drop(&mut self) {
let elapsed = self.start.elapsed();
self.h.observe(elapsed.as_secs_f64());
}
}
#[derive(
Debug,
Clone,
Copy,
IntoStaticStr,
strum_macros::EnumCount,
strum_macros::EnumIter,
strum_macros::FromRepr,
)]
#[strum(serialize_all = "snake_case")]
pub enum SmgrQueryType {
GetRelExists,
GetRelSize,
GetPageAtLsn,
GetDbSize,
}
#[derive(Debug)]
pub struct SmgrQueryTimePerTimeline {
metrics: [GlobalAndPerTimelineHistogram; SmgrQueryType::COUNT],
}
static SMGR_QUERY_TIME_PER_TENANT_TIMELINE: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
"pageserver_smgr_query_seconds",
"Time spent on smgr query handling",
"Time spent on smgr query handling, aggegated by query type and tenant/timeline.",
&["smgr_query_type", "tenant_id", "timeline_id"],
CRITICAL_OP_BUCKETS.into(),
)
.expect("failed to define a metric")
});
static SMGR_QUERY_TIME_GLOBAL: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
"pageserver_smgr_query_seconds_global",
"Time spent on smgr query handling, aggregated by query type.",
&["smgr_query_type"],
CRITICAL_OP_BUCKETS.into(),
)
.expect("failed to define a metric")
});
impl SmgrQueryTimePerTimeline {
pub(crate) fn new(tenant_id: &TenantId, timeline_id: &TimelineId) -> Self {
let tenant_id = tenant_id.to_string();
let timeline_id = timeline_id.to_string();
let metrics = std::array::from_fn(|i| {
let op = SmgrQueryType::from_repr(i).unwrap();
let global = SMGR_QUERY_TIME_GLOBAL
.get_metric_with_label_values(&[op.into()])
.unwrap();
let per_tenant_timeline = SMGR_QUERY_TIME_PER_TENANT_TIMELINE
.get_metric_with_label_values(&[op.into(), &tenant_id, &timeline_id])
.unwrap();
GlobalAndPerTimelineHistogram {
global,
per_tenant_timeline,
}
});
Self { metrics }
}
pub(crate) fn start_timer(&self, op: SmgrQueryType) -> impl Drop + '_ {
let metric = &self.metrics[op as usize];
GlobalAndPerTimelineHistogramTimer {
h: metric,
start: std::time::Instant::now(),
}
}
}
#[cfg(test)]
mod smgr_query_time_tests {
use strum::IntoEnumIterator;
use utils::id::{TenantId, TimelineId};
// Regression test, we used hard-coded string constants before using an enum.
#[test]
fn op_label_name() {
use super::SmgrQueryType::*;
let expect: [(super::SmgrQueryType, &'static str); 4] = [
(GetRelExists, "get_rel_exists"),
(GetRelSize, "get_rel_size"),
(GetPageAtLsn, "get_page_at_lsn"),
(GetDbSize, "get_db_size"),
];
for (op, expect) in expect {
let actual: &'static str = op.into();
assert_eq!(actual, expect);
}
}
#[test]
fn basic() {
let ops: Vec<_> = super::SmgrQueryType::iter().collect();
for op in &ops {
let tenant_id = TenantId::generate();
let timeline_id = TimelineId::generate();
let metrics = super::SmgrQueryTimePerTimeline::new(&tenant_id, &timeline_id);
let get_counts = || {
let global: u64 = ops
.iter()
.map(|op| metrics.metrics[*op as usize].global.get_sample_count())
.sum();
let per_tenant_timeline: u64 = ops
.iter()
.map(|op| {
metrics.metrics[*op as usize]
.per_tenant_timeline
.get_sample_count()
})
.sum();
(global, per_tenant_timeline)
};
let (pre_global, pre_per_tenant_timeline) = get_counts();
assert_eq!(pre_per_tenant_timeline, 0);
let timer = metrics.start_timer(*op);
drop(timer);
let (post_global, post_per_tenant_timeline) = get_counts();
assert_eq!(post_per_tenant_timeline, 1);
assert!(post_global > pre_global);
}
}
}
// keep in sync with control plane Go code so that we can validate
// compute's basebackup_ms metric with our perspective in the context of SLI/SLO.
static COMPUTE_STARTUP_BUCKETS: Lazy<[f64; 28]> = Lazy::new(|| {
@@ -1028,6 +1216,12 @@ impl TimelineMetrics {
),
}
}
pub fn record_new_file_metrics(&self, sz: u64) {
self.resident_physical_size_gauge.add(sz);
self.num_persistent_files_created.inc_by(1);
self.persistent_bytes_written.inc_by(sz);
}
}
impl Drop for TimelineMetrics {
@@ -1045,6 +1239,12 @@ impl Drop for TimelineMetrics {
.write()
.unwrap()
.remove(tenant_id, timeline_id);
// The following metrics are born outside of the TimelineMetrics lifecycle but still
// removed at the end of it. The idea is to have the metrics outlive the
// entity during which they're observed, e.g., the smgr metrics shall
// outlive an individual smgr connection, but not the timeline.
for op in StorageTimeOperation::VARIANTS {
let _ =
STORAGE_TIME_SUM_PER_TIMELINE.remove_label_values(&[op, tenant_id, timeline_id]);
@@ -1056,8 +1256,12 @@ impl Drop for TimelineMetrics {
let _ = STORAGE_IO_SIZE.remove_label_values(&[op, tenant_id, timeline_id]);
}
for op in SMGR_QUERY_TIME_OPERATIONS {
let _ = SMGR_QUERY_TIME.remove_label_values(&[op, tenant_id, timeline_id]);
for op in SmgrQueryType::iter() {
let _ = SMGR_QUERY_TIME_PER_TENANT_TIMELINE.remove_label_values(&[
op.into(),
tenant_id,
timeline_id,
]);
}
}
}

View File

@@ -10,6 +10,42 @@
//! PostgreSQL buffer size, and a Slot struct for each buffer to contain
//! information about what's stored in the buffer.
//!
//! # Types Of Pages
//!
//! [`PageCache`] only supports immutable pages.
//! Hence there is no need to worry about coherency.
//!
//! Two types of pages are supported:
//!
//! * **Materialized pages**, filled & used by page reconstruction
//! * **Immutable File pages**, filled & used by [`crate::tenant::block_io`] and [`crate::tenant::ephemeral_file`].
//!
//! Note that [`crate::tenant::ephemeral_file::EphemeralFile`] is generally mutable, but, it's append-only.
//! It uses the page cache only for the blocks that are already fully written and immutable.
//!
//! # Filling The Page Cache
//!
//! Page cache maps from a cache key to a buffer slot.
//! The cache key uniquely identifies the piece of data that is being cached.
//!
//! The cache key for **materialized pages** is [`TenantId`], [`TimelineId`], [`Key`], and [`Lsn`].
//! Use [`PageCache::memorize_materialized_page`] and [`PageCache::lookup_materialized_page`] for fill & access.
//!
//! The cache key for **immutable file** pages is [`FileId`] and a block number.
//! Users of page cache that wish to page-cache an arbitrary (immutable!) on-disk file do the following:
//! * Have a mechanism to deterministically associate the on-disk file with a [`FileId`].
//! * Get a [`FileId`] using [`next_file_id`].
//! * Use the mechanism to associate the on-disk file with the returned [`FileId`].
//! * Use [`PageCache::read_immutable_buf`] to get a [`ReadBufResult`].
//! * If the page was already cached, it'll be the [`ReadBufResult::Found`] variant that contains
//! a read guard for the page. Just use it.
//! * If the page was not cached, it'll be the [`ReadBufResult::NotFound`] variant that contains
//! a write guard for the page. Fill the page with the contents of the on-disk file.
//! Then call [`PageWriteGuard::mark_valid`] to mark the page as valid.
//! Then try again to [`PageCache::read_immutable_buf`].
//! Unless there's high cache pressure, the page should now be cached.
//! (TODO: allow downgrading the write guard to a read guard to ensure forward progress.)
//!
//! # Locking
//!
//! There are two levels of locking involved: There's one lock for the "mapping"
@@ -39,21 +75,16 @@
use std::{
collections::{hash_map::Entry, HashMap},
convert::TryInto,
sync::{
atomic::{AtomicU8, AtomicUsize, Ordering},
RwLock, RwLockReadGuard, RwLockWriteGuard, TryLockError,
},
sync::atomic::{AtomicU64, AtomicU8, AtomicUsize, Ordering},
};
use anyhow::Context;
use once_cell::sync::OnceCell;
use tracing::error;
use utils::{
id::{TenantId, TimelineId},
lsn::Lsn,
};
use crate::tenant::writeback_ephemeral_file;
use crate::{metrics::PageCacheSizeMetrics, repository::Key};
static PAGE_CACHE: OnceCell<PageCache> = OnceCell::new();
@@ -87,6 +118,17 @@ pub fn get() -> &'static PageCache {
pub const PAGE_SZ: usize = postgres_ffi::BLCKSZ as usize;
const MAX_USAGE_COUNT: u8 = 5;
/// See module-level comment.
#[derive(Debug, Copy, Clone, PartialEq, Eq, Hash)]
pub struct FileId(u64);
static NEXT_ID: AtomicU64 = AtomicU64::new(1);
/// See module-level comment.
pub fn next_file_id() -> FileId {
FileId(NEXT_ID.fetch_add(1, Ordering::Relaxed))
}
///
/// CacheKey uniquely identifies a "thing" to cache in the page cache.
///
@@ -97,12 +139,8 @@ enum CacheKey {
hash_key: MaterializedPageHashKey,
lsn: Lsn,
},
EphemeralPage {
file_id: u64,
blkno: u32,
},
ImmutableFilePage {
file_id: u64,
file_id: FileId,
blkno: u32,
},
}
@@ -121,14 +159,13 @@ struct Version {
}
struct Slot {
inner: RwLock<SlotInner>,
inner: tokio::sync::RwLock<SlotInner>,
usage_count: AtomicU8,
}
struct SlotInner {
key: Option<CacheKey>,
buf: &'static mut [u8; PAGE_SZ],
dirty: bool,
}
impl Slot {
@@ -163,6 +200,11 @@ impl Slot {
Err(usage_count) => usage_count,
}
}
/// Sets the usage count to a specific value.
fn set_usage_count(&self, count: u8) {
self.usage_count.store(count, Ordering::Relaxed);
}
}
pub struct PageCache {
@@ -175,11 +217,9 @@ pub struct PageCache {
///
/// If you add support for caching different kinds of objects, each object kind
/// can have a separate mapping map, next to this field.
materialized_page_map: RwLock<HashMap<MaterializedPageHashKey, Vec<Version>>>,
materialized_page_map: std::sync::RwLock<HashMap<MaterializedPageHashKey, Vec<Version>>>,
ephemeral_page_map: RwLock<HashMap<(u64, u32), usize>>,
immutable_page_map: RwLock<HashMap<(u64, u32), usize>>,
immutable_page_map: std::sync::RwLock<HashMap<(FileId, u32), usize>>,
/// The actual buffers with their metadata.
slots: Box<[Slot]>,
@@ -195,7 +235,7 @@ pub struct PageCache {
/// PageReadGuard is a "lease" on a buffer, for reading. The page is kept locked
/// until the guard is dropped.
///
pub struct PageReadGuard<'i>(RwLockReadGuard<'i, SlotInner>);
pub struct PageReadGuard<'i>(tokio::sync::RwLockReadGuard<'i, SlotInner>);
impl std::ops::Deref for PageReadGuard<'_> {
type Target = [u8; PAGE_SZ];
@@ -222,9 +262,10 @@ impl AsRef<[u8; PAGE_SZ]> for PageReadGuard<'_> {
/// to initialize.
///
pub struct PageWriteGuard<'i> {
inner: RwLockWriteGuard<'i, SlotInner>,
inner: tokio::sync::RwLockWriteGuard<'i, SlotInner>,
// Are the page contents currently valid?
// Used to mark pages as invalid that are assigned but not yet filled with data.
valid: bool,
}
@@ -258,14 +299,6 @@ impl PageWriteGuard<'_> {
);
self.valid = true;
}
pub fn mark_dirty(&mut self) {
// only ephemeral pages can be dirty ATM.
assert!(matches!(
self.inner.key,
Some(CacheKey::EphemeralPage { .. })
));
self.inner.dirty = true;
}
}
impl Drop for PageWriteGuard<'_> {
@@ -280,7 +313,6 @@ impl Drop for PageWriteGuard<'_> {
let self_key = self.inner.key.as_ref().unwrap();
PAGE_CACHE.get().unwrap().remove_mapping(self_key);
self.inner.key = None;
self.inner.dirty = false;
}
}
}
@@ -308,7 +340,7 @@ impl PageCache {
/// The 'lsn' is an upper bound, this will return the latest version of
/// the given block, but not newer than 'lsn'. Returns the actual LSN of the
/// returned page.
pub fn lookup_materialized_page(
pub async fn lookup_materialized_page(
&self,
tenant_id: TenantId,
timeline_id: TimelineId,
@@ -328,7 +360,7 @@ impl PageCache {
lsn,
};
if let Some(guard) = self.try_lock_for_read(&mut cache_key) {
if let Some(guard) = self.try_lock_for_read(&mut cache_key).await {
if let CacheKey::MaterializedPage {
hash_key: _,
lsn: available_lsn,
@@ -355,7 +387,7 @@ impl PageCache {
///
/// Store an image of the given page in the cache.
///
pub fn memorize_materialized_page(
pub async fn memorize_materialized_page(
&self,
tenant_id: TenantId,
timeline_id: TimelineId,
@@ -372,7 +404,7 @@ impl PageCache {
lsn,
};
match self.lock_for_write(&cache_key)? {
match self.lock_for_write(&cache_key).await? {
WriteBufResult::Found(write_guard) => {
// We already had it in cache. Another thread must've put it there
// concurrently. Check that it had the same contents that we
@@ -388,68 +420,16 @@ impl PageCache {
Ok(())
}
// Section 1.2: Public interface functions for working with Ephemeral pages.
// Section 1.2: Public interface functions for working with immutable file pages.
pub fn read_ephemeral_buf(&self, file_id: u64, blkno: u32) -> anyhow::Result<ReadBufResult> {
let mut cache_key = CacheKey::EphemeralPage { file_id, blkno };
self.lock_for_read(&mut cache_key)
}
pub fn write_ephemeral_buf(&self, file_id: u64, blkno: u32) -> anyhow::Result<WriteBufResult> {
let cache_key = CacheKey::EphemeralPage { file_id, blkno };
self.lock_for_write(&cache_key)
}
/// Immediately drop all buffers belonging to given file, without writeback
pub fn drop_buffers_for_ephemeral(&self, drop_file_id: u64) {
for slot_idx in 0..self.slots.len() {
let slot = &self.slots[slot_idx];
let mut inner = slot.inner.write().unwrap();
if let Some(key) = &inner.key {
match key {
CacheKey::EphemeralPage { file_id, blkno: _ } if *file_id == drop_file_id => {
// remove mapping for old buffer
self.remove_mapping(key);
inner.key = None;
inner.dirty = false;
}
_ => {}
}
}
}
}
// Section 1.3: Public interface functions for working with immutable file pages.
pub fn read_immutable_buf(&self, file_id: u64, blkno: u32) -> anyhow::Result<ReadBufResult> {
pub async fn read_immutable_buf(
&self,
file_id: FileId,
blkno: u32,
) -> anyhow::Result<ReadBufResult> {
let mut cache_key = CacheKey::ImmutableFilePage { file_id, blkno };
self.lock_for_read(&mut cache_key)
}
/// Immediately drop all buffers belonging to given file, without writeback
pub fn drop_buffers_for_immutable(&self, drop_file_id: u64) {
for slot_idx in 0..self.slots.len() {
let slot = &self.slots[slot_idx];
let mut inner = slot.inner.write().unwrap();
if let Some(key) = &inner.key {
match key {
CacheKey::ImmutableFilePage { file_id, blkno: _ }
if *file_id == drop_file_id =>
{
// remove mapping for old buffer
self.remove_mapping(key);
inner.key = None;
inner.dirty = false;
}
_ => {}
}
}
}
self.lock_for_read(&mut cache_key).await
}
//
@@ -469,14 +449,14 @@ impl PageCache {
///
/// If no page is found, returns None and *cache_key is left unmodified.
///
fn try_lock_for_read(&self, cache_key: &mut CacheKey) -> Option<PageReadGuard> {
async fn try_lock_for_read(&self, cache_key: &mut CacheKey) -> Option<PageReadGuard> {
let cache_key_orig = cache_key.clone();
if let Some(slot_idx) = self.search_mapping(cache_key) {
// The page was found in the mapping. Lock the slot, and re-check
// that it's still what we expected (because we released the mapping
// lock already, another thread could have evicted the page)
let slot = &self.slots[slot_idx];
let inner = slot.inner.read().unwrap();
let inner = slot.inner.read().await;
if inner.key.as_ref() == Some(cache_key) {
slot.inc_usage_count();
return Some(PageReadGuard(inner));
@@ -517,15 +497,11 @@ impl PageCache {
/// }
/// ```
///
fn lock_for_read(&self, cache_key: &mut CacheKey) -> anyhow::Result<ReadBufResult> {
async fn lock_for_read(&self, cache_key: &mut CacheKey) -> anyhow::Result<ReadBufResult> {
let (read_access, hit) = match cache_key {
CacheKey::MaterializedPage { .. } => {
unreachable!("Materialized pages use lookup_materialized_page")
}
CacheKey::EphemeralPage { .. } => (
&crate::metrics::PAGE_CACHE.read_accesses_ephemeral,
&crate::metrics::PAGE_CACHE.read_hits_ephemeral,
),
CacheKey::ImmutableFilePage { .. } => (
&crate::metrics::PAGE_CACHE.read_accesses_immutable,
&crate::metrics::PAGE_CACHE.read_hits_immutable,
@@ -536,7 +512,7 @@ impl PageCache {
let mut is_first_iteration = true;
loop {
// First check if the key already exists in the cache.
if let Some(read_guard) = self.try_lock_for_read(cache_key) {
if let Some(read_guard) = self.try_lock_for_read(cache_key).await {
if is_first_iteration {
hit.inc();
}
@@ -566,8 +542,7 @@ impl PageCache {
// Make the slot ready
let slot = &self.slots[slot_idx];
inner.key = Some(cache_key.clone());
inner.dirty = false;
slot.usage_count.store(1, Ordering::Relaxed);
slot.set_usage_count(1);
return Ok(ReadBufResult::NotFound(PageWriteGuard {
inner,
@@ -580,13 +555,13 @@ impl PageCache {
/// found, returns None.
///
/// When locking a page for writing, the search criteria is always "exact".
fn try_lock_for_write(&self, cache_key: &CacheKey) -> Option<PageWriteGuard> {
async fn try_lock_for_write(&self, cache_key: &CacheKey) -> Option<PageWriteGuard> {
if let Some(slot_idx) = self.search_mapping_for_write(cache_key) {
// The page was found in the mapping. Lock the slot, and re-check
// that it's still what we expected (because we don't released the mapping
// lock already, another thread could have evicted the page)
let slot = &self.slots[slot_idx];
let inner = slot.inner.write().unwrap();
let inner = slot.inner.write().await;
if inner.key.as_ref() == Some(cache_key) {
slot.inc_usage_count();
return Some(PageWriteGuard { inner, valid: true });
@@ -599,10 +574,10 @@ impl PageCache {
///
/// Similar to lock_for_read(), but the returned buffer is write-locked and
/// may be modified by the caller even if it's already found in the cache.
fn lock_for_write(&self, cache_key: &CacheKey) -> anyhow::Result<WriteBufResult> {
async fn lock_for_write(&self, cache_key: &CacheKey) -> anyhow::Result<WriteBufResult> {
loop {
// First check if the key already exists in the cache.
if let Some(write_guard) = self.try_lock_for_write(cache_key) {
if let Some(write_guard) = self.try_lock_for_write(cache_key).await {
return Ok(WriteBufResult::Found(write_guard));
}
@@ -628,8 +603,7 @@ impl PageCache {
// Make the slot ready
let slot = &self.slots[slot_idx];
inner.key = Some(cache_key.clone());
inner.dirty = false;
slot.usage_count.store(1, Ordering::Relaxed);
slot.set_usage_count(1);
return Ok(WriteBufResult::NotFound(PageWriteGuard {
inner,
@@ -667,10 +641,6 @@ impl PageCache {
*lsn = version.lsn;
Some(version.slot_idx)
}
CacheKey::EphemeralPage { file_id, blkno } => {
let map = self.ephemeral_page_map.read().unwrap();
Some(*map.get(&(*file_id, *blkno))?)
}
CacheKey::ImmutableFilePage { file_id, blkno } => {
let map = self.immutable_page_map.read().unwrap();
Some(*map.get(&(*file_id, *blkno))?)
@@ -694,10 +664,6 @@ impl PageCache {
None
}
}
CacheKey::EphemeralPage { file_id, blkno } => {
let map = self.ephemeral_page_map.read().unwrap();
Some(*map.get(&(*file_id, *blkno))?)
}
CacheKey::ImmutableFilePage { file_id, blkno } => {
let map = self.immutable_page_map.read().unwrap();
Some(*map.get(&(*file_id, *blkno))?)
@@ -731,12 +697,6 @@ impl PageCache {
panic!("could not find old key in mapping")
}
}
CacheKey::EphemeralPage { file_id, blkno } => {
let mut map = self.ephemeral_page_map.write().unwrap();
map.remove(&(*file_id, *blkno))
.expect("could not find old key in mapping");
self.size_metrics.current_bytes_ephemeral.sub_page_sz(1);
}
CacheKey::ImmutableFilePage { file_id, blkno } => {
let mut map = self.immutable_page_map.write().unwrap();
map.remove(&(*file_id, *blkno))
@@ -776,17 +736,7 @@ impl PageCache {
}
}
}
CacheKey::EphemeralPage { file_id, blkno } => {
let mut map = self.ephemeral_page_map.write().unwrap();
match map.entry((*file_id, *blkno)) {
Entry::Occupied(entry) => Some(*entry.get()),
Entry::Vacant(entry) => {
entry.insert(slot_idx);
self.size_metrics.current_bytes_ephemeral.add_page_sz(1);
None
}
}
}
CacheKey::ImmutableFilePage { file_id, blkno } => {
let mut map = self.immutable_page_map.write().unwrap();
match map.entry((*file_id, *blkno)) {
@@ -808,7 +758,7 @@ impl PageCache {
/// Find a slot to evict.
///
/// On return, the slot is empty and write-locked.
fn find_victim(&self) -> anyhow::Result<(usize, RwLockWriteGuard<SlotInner>)> {
fn find_victim(&self) -> anyhow::Result<(usize, tokio::sync::RwLockWriteGuard<SlotInner>)> {
let iter_limit = self.slots.len() * 10;
let mut iters = 0;
loop {
@@ -820,10 +770,7 @@ impl PageCache {
if slot.dec_usage_count() == 0 {
let mut inner = match slot.inner.try_write() {
Ok(inner) => inner,
Err(TryLockError::Poisoned(err)) => {
anyhow::bail!("buffer lock was poisoned: {err:?}")
}
Err(TryLockError::WouldBlock) => {
Err(_err) => {
// If we have looped through the whole buffer pool 10 times
// and still haven't found a victim buffer, something's wrong.
// Maybe all the buffers were in locked. That could happen in
@@ -837,25 +784,8 @@ impl PageCache {
}
};
if let Some(old_key) = &inner.key {
if inner.dirty {
if let Err(err) = Self::writeback(old_key, inner.buf) {
// Writing the page to disk failed.
//
// FIXME: What to do here, when? We could propagate the error to the
// caller, but victim buffer is generally unrelated to the original
// call. It can even belong to a different tenant. Currently, we
// report the error to the log and continue the clock sweep to find
// a different victim. But if the problem persists, the page cache
// could fill up with dirty pages that we cannot evict, and we will
// loop retrying the writebacks indefinitely.
error!("writeback of buffer {:?} failed: {}", old_key, err);
continue;
}
}
// remove mapping for old buffer
self.remove_mapping(old_key);
inner.dirty = false;
inner.key = None;
}
return Ok((slot_idx, inner));
@@ -863,39 +793,19 @@ impl PageCache {
}
}
fn writeback(cache_key: &CacheKey, buf: &[u8]) -> Result<(), std::io::Error> {
match cache_key {
CacheKey::MaterializedPage {
hash_key: _,
lsn: _,
} => Err(std::io::Error::new(
std::io::ErrorKind::Other,
"unexpected dirty materialized page",
)),
CacheKey::EphemeralPage { file_id, blkno } => {
writeback_ephemeral_file(*file_id, *blkno, buf)
}
CacheKey::ImmutableFilePage {
file_id: _,
blkno: _,
} => Err(std::io::Error::new(
std::io::ErrorKind::Other,
"unexpected dirty immutable page",
)),
}
}
/// Initialize a new page cache
///
/// This should be called only once at page server startup.
fn new(num_pages: usize) -> Self {
assert!(num_pages > 0, "page cache size must be > 0");
// We could use Vec::leak here, but that potentially also leaks
// uninitialized reserved capacity. With into_boxed_slice and Box::leak
// this is avoided.
let page_buffer = Box::leak(vec![0u8; num_pages * PAGE_SZ].into_boxed_slice());
let size_metrics = &crate::metrics::PAGE_CACHE_SIZE;
size_metrics.max_bytes.set_page_sz(num_pages);
size_metrics.current_bytes_ephemeral.set_page_sz(0);
size_metrics.current_bytes_immutable.set_page_sz(0);
size_metrics.current_bytes_materialized_page.set_page_sz(0);
@@ -905,11 +815,7 @@ impl PageCache {
let buf: &mut [u8; PAGE_SZ] = chunk.try_into().unwrap();
Slot {
inner: RwLock::new(SlotInner {
key: None,
buf,
dirty: false,
}),
inner: tokio::sync::RwLock::new(SlotInner { key: None, buf }),
usage_count: AtomicU8::new(0),
}
})
@@ -917,7 +823,6 @@ impl PageCache {
Self {
materialized_page_map: Default::default(),
ephemeral_page_map: Default::default(),
immutable_page_map: Default::default(),
slots,
next_evict_slot: AtomicUsize::new(0),

View File

@@ -50,7 +50,8 @@ use crate::basebackup;
use crate::config::PageServerConf;
use crate::context::{DownloadBehavior, RequestContext};
use crate::import_datadir::import_wal_from_tar;
use crate::metrics::{LIVE_CONNECTIONS_COUNT, SMGR_QUERY_TIME};
use crate::metrics;
use crate::metrics::LIVE_CONNECTIONS_COUNT;
use crate::task_mgr;
use crate::task_mgr::TaskKind;
use crate::tenant;
@@ -306,39 +307,6 @@ async fn page_service_conn_main(
}
}
struct PageRequestMetrics {
get_rel_exists: metrics::Histogram,
get_rel_size: metrics::Histogram,
get_page_at_lsn: metrics::Histogram,
get_db_size: metrics::Histogram,
}
impl PageRequestMetrics {
fn new(tenant_id: &TenantId, timeline_id: &TimelineId) -> Self {
let tenant_id = tenant_id.to_string();
let timeline_id = timeline_id.to_string();
let get_rel_exists =
SMGR_QUERY_TIME.with_label_values(&["get_rel_exists", &tenant_id, &timeline_id]);
let get_rel_size =
SMGR_QUERY_TIME.with_label_values(&["get_rel_size", &tenant_id, &timeline_id]);
let get_page_at_lsn =
SMGR_QUERY_TIME.with_label_values(&["get_page_at_lsn", &tenant_id, &timeline_id]);
let get_db_size =
SMGR_QUERY_TIME.with_label_values(&["get_db_size", &tenant_id, &timeline_id]);
Self {
get_rel_exists,
get_rel_size,
get_page_at_lsn,
get_db_size,
}
}
}
struct PageServerHandler {
_conf: &'static PageServerConf,
broker_client: storage_broker::BrokerClientChannel,
@@ -406,7 +374,7 @@ impl PageServerHandler {
pgb.write_message_noflush(&BeMessage::CopyBothResponse)?;
pgb.flush().await?;
let metrics = PageRequestMetrics::new(&tenant_id, &timeline_id);
let metrics = metrics::SmgrQueryTimePerTimeline::new(&tenant_id, &timeline_id);
loop {
let msg = tokio::select! {
@@ -446,21 +414,21 @@ impl PageServerHandler {
let response = match neon_fe_msg {
PagestreamFeMessage::Exists(req) => {
let _timer = metrics.get_rel_exists.start_timer();
let _timer = metrics.start_timer(metrics::SmgrQueryType::GetRelExists);
self.handle_get_rel_exists_request(&timeline, &req, &ctx)
.await
}
PagestreamFeMessage::Nblocks(req) => {
let _timer = metrics.get_rel_size.start_timer();
let _timer = metrics.start_timer(metrics::SmgrQueryType::GetRelSize);
self.handle_get_nblocks_request(&timeline, &req, &ctx).await
}
PagestreamFeMessage::GetPage(req) => {
let _timer = metrics.get_page_at_lsn.start_timer();
let _timer = metrics.start_timer(metrics::SmgrQueryType::GetPageAtLsn);
self.handle_get_page_at_lsn_request(&timeline, &req, &ctx)
.await
}
PagestreamFeMessage::DbSize(req) => {
let _timer = metrics.get_db_size.start_timer();
let _timer = metrics.start_timer(metrics::SmgrQueryType::GetDbSize);
self.handle_db_size_request(&timeline, &req, &ctx).await
}
};
@@ -501,7 +469,9 @@ impl PageServerHandler {
// Create empty timeline
info!("creating new timeline");
let tenant = get_active_tenant_with_timeout(tenant_id, &ctx).await?;
let timeline = tenant.create_empty_timeline(timeline_id, base_lsn, pg_version, &ctx)?;
let timeline = tenant
.create_empty_timeline(timeline_id, base_lsn, pg_version, &ctx)
.await?;
// TODO mark timeline as not ready until it reaches end_lsn.
// We might have some wal to import as well, and we should prevent compute
@@ -984,8 +954,8 @@ where
false
};
metrics::metric_vec_duration::observe_async_block_duration_by_result(
&*crate::metrics::BASEBACKUP_QUERY_TIME,
::metrics::metric_vec_duration::observe_async_block_duration_by_result(
&*metrics::BASEBACKUP_QUERY_TIME,
async move {
self.handle_basebackup_request(
pgb,

File diff suppressed because it is too large Load Diff

View File

@@ -12,14 +12,12 @@
//! len >= 128: 1XXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
//!
use crate::page_cache::PAGE_SZ;
use crate::tenant::block_io::{BlockCursor, BlockReader};
use crate::tenant::block_io::BlockCursor;
use crate::virtual_file::VirtualFile;
use std::cmp::min;
use std::io::{Error, ErrorKind};
impl<R> BlockCursor<R>
where
R: BlockReader,
{
impl<'a> BlockCursor<'a> {
/// Read a blob into a new buffer.
pub async fn read_blob(&self, offset: u64) -> Result<Vec<u8>, std::io::Error> {
let mut buf = Vec::new();
@@ -36,7 +34,7 @@ where
let mut blknum = (offset / PAGE_SZ as u64) as u32;
let mut off = (offset % PAGE_SZ as u64) as usize;
let mut buf = self.read_blk(blknum)?;
let mut buf = self.read_blk(blknum).await?;
// peek at the first byte, to determine if it's a 1- or 4-byte length
let first_len_byte = buf[off];
@@ -52,7 +50,7 @@ where
// it is split across two pages
len_buf[..thislen].copy_from_slice(&buf[off..PAGE_SZ]);
blknum += 1;
buf = self.read_blk(blknum)?;
buf = self.read_blk(blknum).await?;
len_buf[thislen..].copy_from_slice(&buf[0..4 - thislen]);
off = 4 - thislen;
} else {
@@ -73,7 +71,7 @@ where
if page_remain == 0 {
// continue on next page
blknum += 1;
buf = self.read_blk(blknum)?;
buf = self.read_blk(blknum).await?;
off = 0;
page_remain = PAGE_SZ;
}
@@ -86,35 +84,24 @@ where
}
}
/// A wrapper of `VirtualFile` that allows users to write blobs.
///
/// Abstract trait for a data sink that you can write blobs to.
///
pub trait BlobWriter {
/// Write a blob of data. Returns the offset that it was written to,
/// which can be used to retrieve the data later.
fn write_blob(&mut self, srcbuf: &[u8]) -> Result<u64, Error>;
}
///
/// An implementation of BlobWriter to write blobs to anything that
/// implements std::io::Write.
///
pub struct WriteBlobWriter<W>
where
W: std::io::Write,
{
inner: W,
/// If a `BlobWriter` is dropped, the internal buffer will be
/// discarded. You need to call [`flush_buffer`](Self::flush_buffer)
/// manually before dropping.
pub struct BlobWriter<const BUFFERED: bool> {
inner: VirtualFile,
offset: u64,
/// A buffer to save on write calls, only used if BUFFERED=true
buf: Vec<u8>,
}
impl<W> WriteBlobWriter<W>
where
W: std::io::Write,
{
pub fn new(inner: W, start_offset: u64) -> Self {
WriteBlobWriter {
impl<const BUFFERED: bool> BlobWriter<BUFFERED> {
pub fn new(inner: VirtualFile, start_offset: u64) -> Self {
Self {
inner,
offset: start_offset,
buf: Vec::with_capacity(Self::CAPACITY),
}
}
@@ -122,28 +109,79 @@ where
self.offset
}
/// Access the underlying Write object.
///
/// NOTE: WriteBlobWriter keeps track of the current write offset. If
/// you write something directly to the inner Write object, it makes the
/// internally tracked 'offset' to go out of sync. So don't do that.
pub fn into_inner(self) -> W {
self.inner
}
}
const CAPACITY: usize = if BUFFERED { PAGE_SZ } else { 0 };
impl<W> BlobWriter for WriteBlobWriter<W>
where
W: std::io::Write,
{
fn write_blob(&mut self, srcbuf: &[u8]) -> Result<u64, Error> {
#[inline(always)]
/// Writes the given buffer directly to the underlying `VirtualFile`.
/// You need to make sure that the internal buffer is empty, otherwise
/// data will be written in wrong order.
async fn write_all_unbuffered(&mut self, src_buf: &[u8]) -> Result<(), Error> {
self.inner.write_all(src_buf).await?;
self.offset += src_buf.len() as u64;
Ok(())
}
#[inline(always)]
/// Flushes the internal buffer to the underlying `VirtualFile`.
pub async fn flush_buffer(&mut self) -> Result<(), Error> {
self.inner.write_all(&self.buf).await?;
self.buf.clear();
Ok(())
}
#[inline(always)]
/// Writes as much of `src_buf` into the internal buffer as it fits
fn write_into_buffer(&mut self, src_buf: &[u8]) -> usize {
let remaining = Self::CAPACITY - self.buf.len();
let to_copy = src_buf.len().min(remaining);
self.buf.extend_from_slice(&src_buf[..to_copy]);
self.offset += to_copy as u64;
to_copy
}
/// Internal, possibly buffered, write function
async fn write_all(&mut self, mut src_buf: &[u8]) -> Result<(), Error> {
if !BUFFERED {
assert!(self.buf.is_empty());
self.write_all_unbuffered(src_buf).await?;
return Ok(());
}
let remaining = Self::CAPACITY - self.buf.len();
// First try to copy as much as we can into the buffer
if remaining > 0 {
let copied = self.write_into_buffer(src_buf);
src_buf = &src_buf[copied..];
}
// Then, if the buffer is full, flush it out
if self.buf.len() == Self::CAPACITY {
self.flush_buffer().await?;
}
// Finally, write the tail of src_buf:
// If it wholly fits into the buffer without
// completely filling it, then put it there.
// If not, write it out directly.
if !src_buf.is_empty() {
assert_eq!(self.buf.len(), 0);
if src_buf.len() < Self::CAPACITY {
let copied = self.write_into_buffer(src_buf);
// We just verified above that src_buf fits into our internal buffer.
assert_eq!(copied, src_buf.len());
} else {
self.write_all_unbuffered(src_buf).await?;
}
}
Ok(())
}
/// Write a blob of data. Returns the offset that it was written to,
/// which can be used to retrieve the data later.
pub async fn write_blob(&mut self, srcbuf: &[u8]) -> Result<u64, Error> {
let offset = self.offset;
if srcbuf.len() < 128 {
// Short blob. Write a 1-byte length header
let len_buf = srcbuf.len() as u8;
self.inner.write_all(&[len_buf])?;
self.offset += 1;
self.write_all(&[len_buf]).await?;
} else {
// Write a 4-byte length header
if srcbuf.len() > 0x7fff_ffff {
@@ -154,11 +192,153 @@ where
}
let mut len_buf = ((srcbuf.len()) as u32).to_be_bytes();
len_buf[0] |= 0x80;
self.inner.write_all(&len_buf)?;
self.offset += 4;
self.write_all(&len_buf).await?;
}
self.inner.write_all(srcbuf)?;
self.offset += srcbuf.len() as u64;
self.write_all(srcbuf).await?;
Ok(offset)
}
}
impl BlobWriter<true> {
/// Access the underlying `VirtualFile`.
///
/// This function flushes the internal buffer before giving access
/// to the underlying `VirtualFile`.
pub async fn into_inner(mut self) -> Result<VirtualFile, Error> {
self.flush_buffer().await?;
Ok(self.inner)
}
/// Access the underlying `VirtualFile`.
///
/// Unlike [`into_inner`](Self::into_inner), this doesn't flush
/// the internal buffer before giving access.
pub fn into_inner_no_flush(self) -> VirtualFile {
self.inner
}
}
impl BlobWriter<false> {
/// Access the underlying `VirtualFile`.
pub fn into_inner(self) -> VirtualFile {
self.inner
}
}
#[cfg(test)]
mod tests {
use super::*;
use crate::tenant::block_io::BlockReaderRef;
use rand::{Rng, SeedableRng};
async fn round_trip_test<const BUFFERED: bool>(blobs: &[Vec<u8>]) -> Result<(), Error> {
let temp_dir = tempfile::tempdir()?;
let path = temp_dir.path().join("file");
// Write part (in block to drop the file)
let mut offsets = Vec::new();
{
let file = VirtualFile::create(&path).await?;
let mut wtr = BlobWriter::<BUFFERED>::new(file, 0);
for blob in blobs.iter() {
let offs = wtr.write_blob(blob).await?;
offsets.push(offs);
}
// Write out one page worth of zeros so that we can
// read again with read_blk
let offs = wtr.write_blob(&vec![0; PAGE_SZ]).await?;
println!("Writing final blob at offs={offs}");
wtr.flush_buffer().await?;
}
let file = VirtualFile::open(&path).await?;
let rdr = BlockReaderRef::VirtualFile(&file);
let rdr = BlockCursor::new(rdr);
for (idx, (blob, offset)) in blobs.iter().zip(offsets.iter()).enumerate() {
let blob_read = rdr.read_blob(*offset).await?;
assert_eq!(
blob, &blob_read,
"mismatch for idx={idx} at offset={offset}"
);
}
Ok(())
}
fn random_array(len: usize) -> Vec<u8> {
let mut rng = rand::thread_rng();
(0..len).map(|_| rng.gen()).collect::<_>()
}
#[tokio::test]
async fn test_one() -> Result<(), Error> {
let blobs = &[vec![12, 21, 22]];
round_trip_test::<false>(blobs).await?;
round_trip_test::<true>(blobs).await?;
Ok(())
}
#[tokio::test]
async fn test_hello_simple() -> Result<(), Error> {
let blobs = &[
vec![0, 1, 2, 3],
b"Hello, World!".to_vec(),
Vec::new(),
b"foobar".to_vec(),
];
round_trip_test::<false>(blobs).await?;
round_trip_test::<true>(blobs).await?;
Ok(())
}
#[tokio::test]
async fn test_really_big_array() -> Result<(), Error> {
let blobs = &[
b"test".to_vec(),
random_array(10 * PAGE_SZ),
b"foobar".to_vec(),
];
round_trip_test::<false>(blobs).await?;
round_trip_test::<true>(blobs).await?;
Ok(())
}
#[tokio::test]
async fn test_arrays_inc() -> Result<(), Error> {
let blobs = (0..PAGE_SZ / 8)
.map(|v| random_array(v * 16))
.collect::<Vec<_>>();
round_trip_test::<false>(&blobs).await?;
round_trip_test::<true>(&blobs).await?;
Ok(())
}
#[tokio::test]
async fn test_arrays_random_size() -> Result<(), Error> {
let mut rng = rand::rngs::StdRng::seed_from_u64(42);
let blobs = (0..1024)
.map(|_| {
let mut sz: u16 = rng.gen();
// Make 50% of the arrays small
if rng.gen() {
sz |= 63;
}
random_array(sz.into())
})
.collect::<Vec<_>>();
round_trip_test::<false>(&blobs).await?;
round_trip_test::<true>(&blobs).await?;
Ok(())
}
#[tokio::test]
async fn test_arrays_page_boundary() -> Result<(), Error> {
let blobs = &[
random_array(PAGE_SZ - 4),
random_array(PAGE_SZ - 4),
random_array(PAGE_SZ - 4),
];
round_trip_test::<false>(blobs).await?;
round_trip_test::<true>(blobs).await?;
Ok(())
}
}

View File

@@ -2,11 +2,12 @@
//! Low-level Block-oriented I/O functions
//!
use super::ephemeral_file::EphemeralFile;
use super::storage_layer::delta_layer::{Adapter, DeltaLayerInner};
use crate::page_cache::{self, PageReadGuard, ReadBufResult, PAGE_SZ};
use crate::virtual_file::VirtualFile;
use bytes::Bytes;
use std::ops::{Deref, DerefMut};
use std::os::unix::fs::FileExt;
use std::sync::atomic::AtomicU64;
/// This is implemented by anything that can read 8 kB (PAGE_SZ)
/// blocks, using the page cache
@@ -14,68 +15,83 @@ use std::sync::atomic::AtomicU64;
/// There are currently two implementations: EphemeralFile, and FileBlockReader
/// below.
pub trait BlockReader {
///
/// Read a block. Returns a "lease" object that can be used to
/// access to the contents of the page. (For the page cache, the
/// lease object represents a lock on the buffer.)
///
fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error>;
///
/// Create a new "cursor" for reading from this reader.
///
/// A cursor caches the last accessed page, allowing for faster
/// access if the same block is accessed repeatedly.
fn block_cursor(&self) -> BlockCursor<&Self>
where
Self: Sized,
{
BlockCursor::new(self)
}
fn block_cursor(&self) -> BlockCursor<'_>;
}
impl<B> BlockReader for &B
where
B: BlockReader,
{
fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
(*self).read_blk(blknum)
fn block_cursor(&self) -> BlockCursor<'_> {
(*self).block_cursor()
}
}
/// A block accessible for reading
///
/// During builds with `#[cfg(test)]`, this is a proper enum
/// with two variants to support testing code. During normal
/// builds, it just has one variant and is thus a cheap newtype
/// wrapper of [`PageReadGuard`]
pub enum BlockLease {
/// Reference to an in-memory copy of an immutable on-disk block.
pub enum BlockLease<'a> {
PageReadGuard(PageReadGuard<'static>),
EphemeralFileMutableTail(&'a [u8; PAGE_SZ]),
#[cfg(test)]
Rc(std::rc::Rc<[u8; PAGE_SZ]>),
Arc(std::sync::Arc<[u8; PAGE_SZ]>),
}
impl From<PageReadGuard<'static>> for BlockLease {
fn from(value: PageReadGuard<'static>) -> Self {
impl From<PageReadGuard<'static>> for BlockLease<'static> {
fn from(value: PageReadGuard<'static>) -> BlockLease<'static> {
BlockLease::PageReadGuard(value)
}
}
#[cfg(test)]
impl From<std::rc::Rc<[u8; PAGE_SZ]>> for BlockLease {
fn from(value: std::rc::Rc<[u8; PAGE_SZ]>) -> Self {
BlockLease::Rc(value)
impl<'a> From<std::sync::Arc<[u8; PAGE_SZ]>> for BlockLease<'a> {
fn from(value: std::sync::Arc<[u8; PAGE_SZ]>) -> Self {
BlockLease::Arc(value)
}
}
impl Deref for BlockLease {
impl<'a> Deref for BlockLease<'a> {
type Target = [u8; PAGE_SZ];
fn deref(&self) -> &Self::Target {
match self {
BlockLease::PageReadGuard(v) => v.deref(),
BlockLease::EphemeralFileMutableTail(v) => v,
#[cfg(test)]
BlockLease::Rc(v) => v.deref(),
BlockLease::Arc(v) => v.deref(),
}
}
}
/// Provides the ability to read blocks from different sources,
/// similar to using traits for this purpose.
///
/// Unlike traits, we also support the read function to be async though.
pub(crate) enum BlockReaderRef<'a> {
FileBlockReader(&'a FileBlockReader),
EphemeralFile(&'a EphemeralFile),
Adapter(Adapter<&'a DeltaLayerInner>),
#[cfg(test)]
TestDisk(&'a super::disk_btree::tests::TestDisk),
#[cfg(test)]
VirtualFile(&'a VirtualFile),
}
impl<'a> BlockReaderRef<'a> {
#[inline(always)]
async fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
use BlockReaderRef::*;
match self {
FileBlockReader(r) => r.read_blk(blknum).await,
EphemeralFile(r) => r.read_blk(blknum).await,
Adapter(r) => r.read_blk(blknum).await,
#[cfg(test)]
TestDisk(r) => r.read_blk(blknum),
#[cfg(test)]
VirtualFile(r) => r.read_blk(blknum).await,
}
}
}
@@ -89,7 +105,7 @@ impl Deref for BlockLease {
///
/// ```no_run
/// # use pageserver::tenant::block_io::{BlockReader, FileBlockReader};
/// # let reader: FileBlockReader<std::fs::File> = unimplemented!("stub");
/// # let reader: FileBlockReader = unimplemented!("stub");
/// let cursor = reader.block_cursor();
/// let buf = cursor.read_blk(1);
/// // do stuff with 'buf'
@@ -97,65 +113,68 @@ impl Deref for BlockLease {
/// // do stuff with 'buf'
/// ```
///
pub struct BlockCursor<R>
where
R: BlockReader,
{
reader: R,
pub struct BlockCursor<'a> {
reader: BlockReaderRef<'a>,
}
impl<R> BlockCursor<R>
where
R: BlockReader,
{
pub fn new(reader: R) -> Self {
impl<'a> BlockCursor<'a> {
pub(crate) fn new(reader: BlockReaderRef<'a>) -> Self {
BlockCursor { reader }
}
// Needed by cli
pub fn new_fileblockreader(reader: &'a FileBlockReader) -> Self {
BlockCursor {
reader: BlockReaderRef::FileBlockReader(reader),
}
}
pub fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
self.reader.read_blk(blknum)
/// Read a block.
///
/// Returns a "lease" object that can be used to
/// access to the contents of the page. (For the page cache, the
/// lease object represents a lock on the buffer.)
#[inline(always)]
pub async fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
self.reader.read_blk(blknum).await
}
}
static NEXT_ID: AtomicU64 = AtomicU64::new(1);
/// An adapter for reading a (virtual) file using the page cache.
///
/// The file is assumed to be immutable. This doesn't provide any functions
/// for modifying the file, nor for invalidating the cache if it is modified.
pub struct FileBlockReader<F> {
pub file: F,
pub struct FileBlockReader {
pub file: VirtualFile,
/// Unique ID of this file, used as key in the page cache.
file_id: u64,
file_id: page_cache::FileId,
}
impl<F> FileBlockReader<F>
where
F: FileExt,
{
pub fn new(file: F) -> Self {
let file_id = NEXT_ID.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
impl FileBlockReader {
pub fn new(file: VirtualFile) -> Self {
let file_id = page_cache::next_file_id();
FileBlockReader { file_id, file }
}
/// Read a page from the underlying file into given buffer.
fn fill_buffer(&self, buf: &mut [u8], blkno: u32) -> Result<(), std::io::Error> {
async fn fill_buffer(&self, buf: &mut [u8], blkno: u32) -> Result<(), std::io::Error> {
assert!(buf.len() == PAGE_SZ);
self.file.read_exact_at(buf, blkno as u64 * PAGE_SZ as u64)
self.file
.read_exact_at(buf, blkno as u64 * PAGE_SZ as u64)
.await
}
}
impl<F> BlockReader for FileBlockReader<F>
where
F: FileExt,
{
fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
// Look up the right page
/// Read a block.
///
/// Returns a "lease" object that can be used to
/// access to the contents of the page. (For the page cache, the
/// lease object represents a lock on the buffer.)
pub async fn read_blk(&self, blknum: u32) -> Result<BlockLease, std::io::Error> {
let cache = page_cache::get();
loop {
match cache
.read_immutable_buf(self.file_id, blknum)
.await
.map_err(|e| {
std::io::Error::new(
std::io::ErrorKind::Other,
@@ -165,7 +184,7 @@ where
ReadBufResult::Found(guard) => break Ok(guard.into()),
ReadBufResult::NotFound(mut write_guard) => {
// Read the page from disk into the buffer
self.fill_buffer(write_guard.deref_mut(), blknum)?;
self.fill_buffer(write_guard.deref_mut(), blknum).await?;
write_guard.mark_valid();
// Swap for read lock
@@ -176,6 +195,12 @@ where
}
}
impl BlockReader for FileBlockReader {
fn block_cursor(&self) -> BlockCursor<'_> {
BlockCursor::new(BlockReaderRef::FileBlockReader(self))
}
}
///
/// Trait for block-oriented output
///

View File

@@ -7,6 +7,7 @@ use anyhow::Context;
use pageserver_api::models::TenantState;
use remote_storage::{DownloadError, GenericRemoteStorage, RemotePath};
use tokio::sync::OwnedMutexGuard;
use tokio_util::sync::CancellationToken;
use tracing::{error, info, instrument, warn, Instrument, Span};
use utils::{
@@ -82,6 +83,8 @@ async fn create_remote_delete_mark(
FAILED_UPLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
"mark_upload",
// TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066)
backoff::Cancel::new(CancellationToken::new(), || unreachable!()),
)
.await
.context("mark_upload")?;
@@ -171,6 +174,8 @@ async fn remove_tenant_remote_delete_mark(
FAILED_UPLOAD_WARN_THRESHOLD,
FAILED_REMOTE_OP_RETRIES,
"remove_tenant_remote_delete_mark",
// TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066)
backoff::Cancel::new(CancellationToken::new(), || unreachable!()),
)
.await
.context("remove_tenant_remote_delete_mark")?;
@@ -212,6 +217,19 @@ async fn cleanup_remaining_fs_traces(
))?
});
// Make sure previous deletions are ordered before mark removal.
// Otherwise there is no guarantee that they reach the disk before mark deletion.
// So its possible for mark to reach disk first and for other deletions
// to be reordered later and thus missed if a crash occurs.
// Note that we dont need to sync after mark file is removed
// because we can tolerate the case when mark file reappears on startup.
let tenant_path = &conf.tenant_path(tenant_id);
if tenant_path.exists() {
crashsafe::fsync_async(&conf.tenant_path(tenant_id))
.await
.context("fsync_pre_mark_remove")?;
}
rm(conf.tenant_deleted_mark_file_path(tenant_id), false).await?;
fail::fail_point!("tenant-delete-before-remove-tenant-dir", |_| {
@@ -225,6 +243,32 @@ async fn cleanup_remaining_fs_traces(
Ok(())
}
pub(crate) async fn remote_delete_mark_exists(
conf: &PageServerConf,
tenant_id: &TenantId,
remote_storage: &GenericRemoteStorage,
) -> anyhow::Result<bool> {
// If remote storage is there we rely on it
let remote_mark_path = remote_tenant_delete_mark_path(conf, tenant_id).context("path")?;
let result = backoff::retry(
|| async { remote_storage.download(&remote_mark_path).await },
|e| matches!(e, DownloadError::NotFound),
SHOULD_RESUME_DELETION_FETCH_MARK_ATTEMPTS,
SHOULD_RESUME_DELETION_FETCH_MARK_ATTEMPTS,
"fetch_tenant_deletion_mark",
// TODO: use a cancellation token (https://github.com/neondatabase/neon/issues/5066)
backoff::Cancel::new(CancellationToken::new(), || unreachable!()),
)
.await;
match result {
Ok(_) => Ok(true),
Err(DownloadError::NotFound) => Ok(false),
Err(e) => Err(anyhow::anyhow!(e)).context("remote_delete_mark_exists")?,
}
}
/// Orchestrates tenant shut down of all tasks, removes its in-memory structures,
/// and deletes its data from both disk and s3.
/// The sequence of steps:
@@ -238,8 +282,9 @@ async fn cleanup_remaining_fs_traces(
/// It is resumable from any step in case a crash/restart occurs.
/// There are three entrypoints to the process:
/// 1. [`DeleteTenantFlow::run`] this is the main one called by a management api handler.
/// 2. [`DeleteTenantFlow::resume`] is called during restarts when local or remote deletion marks are still there.
/// Note the only other place that messes around timeline delete mark is the `Tenant::spawn_load` function.
/// 2. [`DeleteTenantFlow::resume_from_load`] is called during restarts when local or remote deletion marks are still there.
/// 3. [`DeleteTenantFlow::resume_from_attach`] is called when deletion is resumed tenant is found to be deleted during attach process.
/// Note the only other place that messes around timeline delete mark is the `Tenant::spawn_load` function.
#[derive(Default)]
pub enum DeleteTenantFlow {
#[default]
@@ -359,26 +404,14 @@ impl DeleteTenantFlow {
None => return Ok(None),
};
// If remote storage is there we rely on it
let remote_mark_path = remote_tenant_delete_mark_path(conf, &tenant_id)?;
let result = backoff::retry(
|| async { remote_storage.download(&remote_mark_path).await },
|e| matches!(e, DownloadError::NotFound),
SHOULD_RESUME_DELETION_FETCH_MARK_ATTEMPTS,
SHOULD_RESUME_DELETION_FETCH_MARK_ATTEMPTS,
"fetch_tenant_deletion_mark",
)
.await;
match result {
Ok(_) => Ok(acquire(tenant)),
Err(DownloadError::NotFound) => Ok(None),
Err(e) => Err(anyhow::anyhow!(e)).context("should_resume_deletion")?,
if remote_delete_mark_exists(conf, &tenant_id, remote_storage).await? {
Ok(acquire(tenant))
} else {
Ok(None)
}
}
pub(crate) async fn resume(
pub(crate) async fn resume_from_load(
guard: DeletionGuard,
tenant: &Arc<Tenant>,
init_order: Option<&InitializationOrder>,
@@ -388,7 +421,7 @@ impl DeleteTenantFlow {
let (_, progress) = completion::channel();
tenant
.set_stopping(progress, true)
.set_stopping(progress, true, false)
.await
.expect("cant be stopping or broken");
@@ -416,6 +449,31 @@ impl DeleteTenantFlow {
.await
}
pub(crate) async fn resume_from_attach(
guard: DeletionGuard,
tenant: &Arc<Tenant>,
tenants: &'static tokio::sync::RwLock<TenantsMap>,
ctx: &RequestContext,
) -> Result<(), DeleteTenantError> {
let (_, progress) = completion::channel();
tenant
.set_stopping(progress, false, true)
.await
.expect("cant be stopping or broken");
tenant.attach(ctx).await.context("attach")?;
Self::background(
guard,
tenant.conf,
tenant.remote_storage.clone(),
tenants,
tenant,
)
.await
}
async fn prepare(
tenants: &tokio::sync::RwLock<TenantsMap>,
tenant_id: TenantId,

Some files were not shown because too many files have changed in this diff Show More