Compare commits

...

770 Commits

Author SHA1 Message Date
Konstantin Knizhnik
13b1699518 Specify statement_timeout within safe_pgsql command 2024-02-06 21:51:12 +02:00
John Spray
4f57dc6cc6 control_plane/attachment_service: take public key as value (#6651)
It's awkward to point to a file when doing some kinds of ad-hoc
deployment (like right now, when I'm hacking a helm chart having not
quite hooked up secrets properly yet). We take all the rest of the
secrets as CLI args directly, so let's do the same for public key.
2024-02-06 19:08:39 +00:00
Heikki Linnakangas
dc811d1923 Add a span to 'create_neon_superuser' for better OpenTelemetry traces (#6644)
create_neon_superuser runs the first queries in the database after cold
start. Traces suggest that those first queries can make up a significant
fraction of the cold start time. Make it more visible by adding an
explict tracing span to it; currently you just have to deduce it by
looking at the time spent in the parent 'apply_config' span subtracted
by all the other child spans.
2024-02-06 20:37:35 +02:00
Alexander Bayandin
e65f0fe874 CI(benchmarks): make job split consistent across reruns (#6614)
## Problem

We've got several issues with the current `benchmarks` job setup:
- `benchmark_durations.json` file (that we generate in runtime to
split tests into several jobs[0]) is not consistent between these
jobs (and very not consistent with the file if we rerun the job). I.e.
test selection for each job can be different, which could end up in
missed tests in a test run.
- `scripts/benchmark_durations` doesn't fetch all tests from the
database (it doesn't expect any extra directories inside
`test_runner/performance`)
- For some reason, currently split into 4 groups ends up with the 4th
group has no tests to run, which fails the job[1]

- [0] https://github.com/neondatabase/neon/pull/4683
- [1] https://github.com/neondatabase/neon/issues/6629

## Summary of changes
- Generate `benchmark_durations.json` file once before we start
`benchmarks` jobs (this makes it consistent across the jobs) and pass
the file content through the GitHub Actions input (this makes it
consistent for reruns)
- `scripts/benchmark_durations` fix SQL query for getting all required
tests
- Split benchmarks into 5 jobs instead of 4 jobs.
2024-02-06 17:00:55 +00:00
Joonas Koivunen
bb92721168 build: migrate check-style-rust to small runners (#6588)
We have more small runners than large runners, and often a shortage of
large runners. Migrate `check-style-rust` to run on small runners.
2024-02-06 15:53:04 +00:00
Christian Schwarz
d7b29aace7 refactor(walredo): don't create WalRedoManager for broken tenants (#6597)
When we'll later introduce a global pool of pre-spawned walredo
processes (https://github.com/neondatabase/neon/issues/6581), this
refactoring avoids plumbing through the reference to the pool to all the
places where we create a broken tenant.

Builds atop the refactoring in #6583
2024-02-06 16:20:02 +01:00
Christian Schwarz
53a3ed0a7e debug_assert presence of shard_id tracing field (#6572)
also:
fixes https://github.com/neondatabase/neon/issues/6638
2024-02-06 14:43:33 +00:00
dependabot[bot]
27a3c9ecbe build(deps): bump cryptography from 41.0.6 to 42.0.0 (#6643) 2024-02-06 13:15:07 +00:00
John Spray
6297843317 tests: flakiness fixes in pageserver tests (#6632)
Fix several test flakes:
- test_sharding_service_smoke had log failures on "Dropped LSN updates"
- test_emergency_mode had log failures on a deletion queue shutdown
check, where the check was incorrect because it was expecting channel
receiver to stay alive after cancellation token was fired.
- test_secondary_mode_eviction had racing heatmap uploads because the
test was using a live migration hook to set up locations, where that
migration was itself uploading heatmaps and generally making the
situation more complex than it needed to be.

These are the failure modes that I saw when spot checking the last few
failures of each test.

This will mostly/completely address #6511, but I'll leave that ticket
open for a couple days and then check if either of the tests named in
that ticket are flaky.

Related #6511
2024-02-06 12:49:41 +00:00
Vadim Kharitonov
dae56ef60c Do not suspend compute if there is an active logical replication subscription. (#6570)
## Problem

the idea is to keep compute up and running if there are any active
logical replication subscriptions.

### Rationale

Rationale:
- The Write-Ahead Logging (WAL) files, which contain the data changes,
will need to be retained on the publisher side until the subscriber is
able to connect again and apply these changes. This could potentially
lead to increased disk usage on the publisher - and we do not want to
disrupt the source - I think it is more pain for our customer to resolve
storage issues on the source than to pay for the compute at the target.
- Upon resuming the compute resources, the subscriber will start
consuming and applying the changes from the retained WAL files. The time
taken to catch up will depend on the volume of changes and the
configured vCPUs.
we can avoid explaining complex situations where we lag behind (in
extreme cases we could lag behind hours, days or even months)
- I think an important use case for logical replication from a source is
a one-time migration or release upgrade. In this case the customer would
not mind if we are not suspended for the duration of the migration.

We need to document this in the release notes and the documentation in
the context of logical replication where Neon is the target (subscriber)

### See internal discussion here

https://neondb.slack.com/archives/C04DGM6SMTM/p1706793400746539?thread_ts=1706792628.701279&cid=C04DGM6SMTM
2024-02-06 12:15:42 +00:00
Christian Schwarz
0de46fd6f2 heavier_once_cell: switch to tokio::sync::RwLock (#6589)
Using the RwLock reduces contention on the hot path.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2024-02-06 14:04:15 +02:00
Joonas Koivunen
53743991de uploader: avoid cloning vecs just to get Bytes (#6645)
Fix cloning the serialized heatmap on every attempt by just turning it
into `bytes::Bytes` before clone so it will be a refcounted instead of
refcounting a vec clone later on.

Also fixes one cancellation token cloning I had missed in #6618.
Cc: #6096
2024-02-06 11:34:13 +00:00
John Spray
431f4234d4 storage controller: embed database migrations in binary (#6637)
## Problem

We don't have a neat way to carry around migration .sql files during
deploy, and in any case would prefer to avoid depending on diesel CLI to
deploy.

## Summary of changes

- Use `diesel_migrations` crate to embed migrations in our binary
- Run migrations on startup
- Drop the diesel dependency in the `neon_local` binary, as the
attachment_service binary just needs the database to exist. Do database
creation with a simple `createdb`.


Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-02-06 10:07:10 +00:00
Christian Schwarz
edcde05c1c refactor(walredo): split up the massive walredo.rs (#6583)
Part of https://github.com/neondatabase/neon/issues/6581
2024-02-06 09:44:49 +00:00
Christian Schwarz
e196d974cc pagebench: actually implement --num_clients (#6640)
Will need this to validate per-tenant throttling in
https://github.com/neondatabase/neon/issues/5899
2024-02-06 10:34:16 +01:00
Joonas Koivunen
947165788d refactor: needless cancellation token cloning (#6618)
The solution we ended up for `backoff::retry` requires always cloning of
cancellation tokens even though there is just `.await`. Fix that, and
also turn the return type into `Option<Result<T, E>>` avoiding the need
for the `E::cancelled()` fn passed in.

Cc: #6096
2024-02-06 09:39:06 +02:00
John Spray
8e114bd610 control_plane/attachment_service: make --database-url optional (#6636)
## Problem

This change was left out of #6585 accidentally -- just forgot to push
the very last version of my branch.

Now that we can load database url from Secrets Manager, we don't always
need it on the CLI any more. We should let the user omit it instead of
passing `--database-url ""`

## Summary of changes

- Make `--database-url` optional
2024-02-05 20:31:55 +01:00
John Spray
cb7c89332f control_plane: fix tenant GET, clean up endpoints (#6553)
Cleanups from https://github.com/neondatabase/neon/pull/6394

- There was a rogue `*` breaking the `GET /tenant/:tenant_id`, which
passes through to shard zero
- There was a duplicate migrate endpoint
- There are un-prefixed API endpoints that were only needed for compat
tests and can now be removed.
2024-02-05 14:29:05 +00:00
Conrad Ludgate
74c5e3d9b8 use string interner for project cache (#6578)
## Problem

Running some memory profiling with high concurrent request rate shows
seemingly some memory fragmentation.

## Summary of changes

Eventually, we will want to separate global memory (caches) from local
memory (per connection handshake and per passthrough).

Using a string interner for project info cache helps reduce some of the
fragmentation of the global cache by having a single heap dedicated to
project strings, and not scattering them throughout all a requests.

At the same time, the interned key is 4 bytes vs the 24 bytes that
`SmolStr` offers.

Important: we should only store verified strings in the interner because
there's no way to remove them afterwards. Good for caching responses
from console.
2024-02-05 14:27:25 +00:00
Joonas Koivunen
5e8deca268 metrics: remove broken tenants (#6586)
Before tenant migration it made sense to leak broken tenants in the
metrics until restart. Nowdays it makes less sense because on
cancellations we set the tenant broken. The set metric still allows
filterable alerting.

Fixes: #6507
2024-02-05 14:49:35 +02:00
Joonas Koivunen
db89b13aaa fix: use the shared constant download buffer size (#6620)
Noticed that we had forgotten to use
`remote_timeline_client.rs::BUFFER_SIZE` in one instance.
2024-02-05 13:10:08 +01:00
Abhijeet Patil
01c57ec547 Removed Uploading of perf result to git repo 'zenith-perf-data' (#6590)
## Problem
We were archiving the pref benchmarks to 

- neon DB
- git repo `zenith-perf-data`

As the pref batch ran in parallel when the uploading of results to
zenith-perf-data` git repo resulted in merge conflicts.
Which made the run flaky and as a side effect the build started failing
.

The problem is been expressed in
https://github.com/neondatabase/neon/issues/5160

## Summary of changes
As the results were not used from the git repo it was redundant hence in
this PR cleaning up the results uploading of of perf results to git repo
The shell script `generate_and_push_perf_report.sh` was using a py
script
[git-upload](https://github.com/neondatabase/neon/compare/remove-perf-benchmark-git-upload?expand=1#diff-c6d938e7f060e487367d9dc8055245c82b51a73c1f97956111a495a8a86e9a33)
and
[scripts/generate_perf_report_page.py](https://github.com/neondatabase/neon/pull/6590/files#diff-81af2147e72d07e4cf8ee4395632596d805d6168ba75c71cab58db2659956ef8)
which are not used anywhere else in repo hence also cleaning that up

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat the commit message to not include the
above checklist
2024-02-05 10:08:20 +00:00
Arpad Müller
56cf360439 Don't preserve temp files on creation errors of delta layers (#6612)
There is currently no cleanup done after a delta layer creation error,
so delta layers can accumulate. The problem gets worse as the operation
gets retried and delta layers accumulate on the disk. Therefore, delete
them from disk (if something has been written to disk).
2024-02-05 09:53:37 +00:00
Heikki Linnakangas
df7bee7cfa Fix compilation with recent glibc headers with close_range(2).
I was getting an error:

    /home/heikki/git-sandbox/neon//pgxn/neon_walredo/walredoproc.c:161:5: error: conflicting types for ‘close_range’; have ‘int(unsigned int,  unsigned int,  unsigned int)’
      161 | int close_range(unsigned int start_fd, unsigned int count, unsigned int flags) {
          |     ^~~~~~~~~~~
    In file included from /usr/include/x86_64-linux-gnu/bits/sigstksz.h:24,
                     from /usr/include/signal.h:328,
                     from /home/heikki/git-sandbox/neon//pgxn/neon_walredo/walredoproc.c:50:
    /usr/include/unistd.h:1208:12: note: previous declaration of ‘close_range’ with type ‘int(unsigned int,  unsigned int,  int)’
     1208 | extern int close_range (unsigned int __fd, unsigned int __max_fd,
          |            ^~~~~~~~~~~

The discrepancy is in the 3rd argument. Apparently in the glibc
wrapper it's signed.

As a quick fix, rename our close_range() function, the one that calls
syscall() directly, to avoid the clash with the glibc wrapper. In the
long term, an autoconf test would be nice, and some equivalent on
macOS, see issue #6580.
2024-02-05 11:50:45 +02:00
Joonas Koivunen
70f646ffe2 More logging fixes (#6584)
I was on-call this week, these would had made me understand more/faster
of the system:
- move stray attaching start logging inside the span it starts, add
generation
- log ancestor timeline_id or bootstrapping in the beginning of timeline
creation
2024-02-05 09:34:03 +02:00
Vadim Kharitonov
7e8529bec1 Revert "Update pgvector to v0.6.0, third attempt" (#6610)
The issue is still unsolved because of shmem size in VMs. Need to figure it out before applying this patch.

For more details:

```
ERROR:  could not resize shared memory segment "/PostgreSQL.2892504480" to 16774205952 bytes: No space left on device
```

As an example, the same issue in community pgvector/pgvector#453.
2024-02-04 22:27:07 +00:00
Clarence
09519c1773 chore: update wording in docs to improve readability (#6607)
## Problem
 Found typos while reading the docs

## Summary of changes
Fixed the typos found
2024-02-04 19:33:38 +00:00
Joonas Koivunen
9dd69194d4 refactor(proxy): std::io::Write for BytesMut exists (#6606)
Replace TODO with an existing implementation via `BufMut::writer``.
2024-02-03 22:15:59 +00:00
Heikki Linnakangas
647b85fc15 Update pgvector to v0.6.0, third attempt
This includes a compatibility patch that is needed because pgvector
now skips WAL-logging during the index build, and WAL-logs the index
only in one go at the end. That's how GIN, GiST and SP-GIST index
builds work in core PostgreSQL too, but we need some Neon-specific
calls to mark the beginning and end of those build phases.

pgvector is the first index AM that does that with parallel workers,
so I had to modify those functions in the Neon extension to be aware
of parallel workers. Only the leader needs to create the underlying
file and perform the WAL-logging. (In principle, the parallel workers
could participate in the WAL-logging too, but pgvector doesn't do
that. This will need some further work if that changes).

The previous attempt at this (#6592) missed that parallel workers
needed those changes, and segfaulted in parallel build that spilled to
disk.

Testing
-------

We don't have a place for regression tests of extensions at the
moment. I tested this manually with the following script:

```
CREATE EXTENSION IF NOT EXISTS vector;

DROP TABLE IF EXISTS tst;
CREATE TABLE tst (i serial, v vector(3));

INSERT INTO tst (v) SELECT ARRAY[random(), random(), random()] FROM generate_series(1, 15000) g;

-- Serial build, in memory
ALTER TABLE tst SET (parallel_workers=0);
SET maintenance_work_mem='50 MB';
CREATE INDEX idx ON tst USING hnsw (v vector_l2_ops);

-- Test that the index works. (The table contents are random, and the
-- search is approximate anyway, so we cannot check the exact values.
-- For now, just eyeball that they look reasonable)
set enable_seqscan=off;
explain SELECT * FROM tst ORDER BY v <-> ARRAY[0, 0, 0]::vector LIMIT 5;
SELECT * FROM tst ORDER BY v <-> ARRAY[0, 0, 0]::vector LIMIT 5;

DROP INDEX idx;

-- Serial build, spills to on disk

ALTER TABLE tst SET (parallel_workers=0);
SET maintenance_work_mem='5 MB';
CREATE INDEX idx ON tst USING hnsw (v vector_l2_ops);
SELECT * FROM tst ORDER BY v <-> ARRAY[0, 0, 0]::vector LIMIT 5;
DROP INDEX idx;

-- Parallel build, in memory

ALTER TABLE tst SET (parallel_workers=4);
SET maintenance_work_mem='50 MB';
CREATE INDEX idx ON tst USING hnsw (v vector_l2_ops);
SELECT * FROM tst ORDER BY v <-> ARRAY[0, 0, 0]::vector LIMIT 5;
DROP INDEX idx;

-- Parallel build, spills to disk

ALTER TABLE tst SET (parallel_workers=4);
SET maintenance_work_mem='5 MB';
CREATE INDEX idx ON tst USING hnsw (v vector_l2_ops);
SELECT * FROM tst ORDER BY v <-> ARRAY[0, 0, 0]::vector LIMIT 5;
DROP INDEX idx;
```
2024-02-03 09:19:37 +02:00
Heikki Linnakangas
c96aead502 Reorganize .dockerignore
Author: Alexander Bayandin <alexander@neon.tech>
2024-02-03 09:19:37 +02:00
Arpad Müller
aac8eb2c36 Minor logging improvements (#6593)
* log when `lsn_by_timestamp` finished together with its result
* add back logging of the layer name as suggested in
https://github.com/neondatabase/neon/pull/6549#discussion_r1475756808
2024-02-03 02:16:20 +01:00
Clarence
3d1b08496a Update words in docs for better readability (#6600)
## Problem
 Found typos while reading the docs

## Summary of changes
Fixed the typos found
2024-02-03 00:59:39 +00:00
Arpad Müller
0ac2606c8a S3 restore test: Use a workaround to enable moto's self-copy support (#6594)
While working on https://github.com/getmoto/moto/pull/7303 I discovered
that if you enable bucket encryption, moto allows self-copies. So we can
un-ignore the test. I tried it out locally, it works great.

Followup of #6533, part of
https://github.com/neondatabase/cloud/issues/8233
2024-02-02 23:45:57 +01:00
Em Sharnoff
d820d64e38 Bump vm-builder v0.21.0 -> v0.23.2 (#6480)
Relevant changes were all from v0.23.0:

- neondatabase/autoscaling#724
- neondatabase/autoscaling#726
- neondatabase/autoscaling#732

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-02-02 22:39:20 +00:00
Arthur Petukhovsky
f2aa96f003 Console split RFC (#1997)
[Rendered](https://github.com/neondatabase/neon/blob/rfc-console-split/docs/rfcs/017-console-split.md)

Co-authored-by: Stas Kelvich <stas.kelvich@gmail.com>
2024-02-02 23:41:55 +02:00
Sasha Krassovsky
2fd8e24c8f Switch sleeps to wait_until (#6575)
## Problem
I didn't know about `wait_until` and was relying on `sleep` to wait for
stuff. This caused some tests to be flaky.
https://github.com/neondatabase/neon/issues/6561
## Summary of changes
Switch to `wait_until`, this should make it tests less flaky
2024-02-02 21:32:40 +00:00
Heikki Linnakangas
c9876b0993 Fix double-free bug in walredo process. (#6534)
At the end of ApplyRecord(), we called pfree on the decoded record, if
it was "oversized". However, we had alread linked it to the "decode
queue" list in XLogReaderState. If we later called XLogBeginRead(), it
called ResetDecoder and tried to free the same record again.

The conditions to hit this are:

- a large WAL record (larger than aboue 64 kB I think, per
DEFAULT_DECODE_BUFFER_SIZE), and
- another WAL record processed by the same WAL redo process after the
large one.

I think the reason we haven't seen this earlier is that you don't get
WAL records that large that are sent to the WAL redo process, except
when logical replication is enabled. Logical replication adds data to
the WAL records, making them larger.

To fix, allocate the buffer ourselves, and don't link it to the decode
queue. Alternatively, we could perhaps have just removed the pfree(),
but frankly I'm a bit scared about the whole queue thing.
2024-02-02 21:49:11 +02:00
John Spray
786e9cf75b control_plane: implement HTTP compute hook for attachment service (#6471)
## Problem

When we change which physical pageservers a tenant is attached to, we
must update the control plane so that it can update computes. This will
be done via an HTTP hook, as described in
https://www.notion.so/neondatabase/Sharding-Service-Control-Plane-interface-6de56dd310a043bfa5c2f5564fa98365#1fe185a35d6d41f0a54279ac1a41bc94

## Summary of changes

- Optional CLI args `--control-plane-jwt-token` and `-compute-hook-url`
are added. If these are set, then we will use this HTTP endpoint,
instead of trying to use neon_local LocalEnv to update compute
configuration.
- Implement an HTTP-driven version of ComputeHook that calls into the
configured URL
- Notify for all tenants on startup, to ensure that we don't miss
notifications if we crash partway through a change, and carry a
`pending_compute_notification` flag at runtime to allow notifications to
fail without risking never sending the update.
- Add a test for all this

One might wonder: why not do a "forever" retry for compute hook
notifications, rather than carrying a flag on the shard to call
reconcile() again later. The reason is that we will later limit
concurreny of reconciles, when dealing with larger numbers of shards,
and if reconcile is stuck waiting for the control plane to accept a
notification request, it could jam up the whole system and prevent us
making other changes. Anyway: from the perspective of the outside world,
we _do_ retry forever, but we don't retry forever within a given
Reconciler lifetime.

The `pending_compute_notification` logic is predicated on later adding a
background task that just calls `Service::reconcile_all` on a schedule
to make sure that anything+everything that can fail a
Reconciler::reconcile call will eventually be retried.
2024-02-02 19:22:03 +00:00
Vadim Kharitonov
0b91edb943 Revert pgvector 0.6.0 (#6592)
It doesn't work in our VMs. Need more time to investigate
2024-02-02 18:36:31 +00:00
John Spray
2e5eab69c6 tests: remove test_gc_cutoff (#6587)
This test became flaky when postgres retry handling was fixed to use
backoff delays -- each iteration in this test's loop was taking much
longer because pgbench doesn't fail until postgres has given up on
retrying to the pageserver.

We are just removing it, because the condition it tests is no longer
risky: we reload all metadata from remote storage on restart, so
crashing directly between making local changes and doing remote uploads
isn't interesting any more.

Closes:  https://github.com/neondatabase/neon/issues/2856
Closes: https://github.com/neondatabase/neon/issues/5329
2024-02-02 18:20:18 +00:00
Joonas Koivunen
caf868e274 test: assert we eventually free space (#6536)
in `test_statvfs_pressure_{usage,min_avail_bytes}` we now race against
initial logical size calculation on-demand downloading the layers. first
wait out the initial logical sizes, then change the final asserts to be
"eventual", which is not great but it is faster than failing and
retrying.

this issue seems to happen only in debug mode tests.

Fixes: #6510
2024-02-02 19:46:47 +02:00
John Spray
7e2436695d storage controller: use AWS Secrets Manager for database URL, etc (#6585)
## Problem

Passing secrets in via CLI/environment is awkward when using helm for
deployment, and not ideal for security (secrets may show up in ps,
/proc).

We can bypass these issues by simply connecting directly to the AWS
Secrets Manager service at runtime.

## Summary of changes

- Add dependency on aws-sdk-secretsmanager
- Update other aws dependencies to latest, to match transitive
dependency versions
- Add `Secrets` type in attachment service, using AWS SDK to load if
secrets are not provided on the command line.
2024-02-02 16:57:11 +00:00
Conrad Ludgate
6506fd14c4 proxy: more refactors (#6526)
## Problem

not really any problem, just some drive-by changes

## Summary of changes

1. move wake compute
2. move json processing
3. move handle_try_wake
4. move test backend to api provider
5. reduce wake-compute concerns
6. remove duplicate wake-compute loop
2024-02-02 16:07:35 +00:00
John Spray
46fb1a90ce pageserver: avoid calculating/sending logical sizes on shard !=0 (#6567)
## Problem

Sharded tenants only maintain accurate relation sizes on shard 0.
Therefore logical size can only be calculated on shard 0. Fortunately it
is also only _needed_ on shard 0, to provide Safekeeper feedback and to
send consumption metrics.

Closes: #6307

## Summary of changes

- Send 0 for logical size to safekeepers on shards !=0
- Skip logical size warmup task on shards !=0
- Skip imitate_layer_accesses on shards !=0
2024-02-02 15:52:03 +00:00
John Spray
56171cbe8c pageserver: more permissive activation timeout when testing (#6564)
## Problem

The 5 second activation timeout is appropriate for production
environments, where we want to give a prompt response to the cloud
control plane, and if we fail it will retry the call. In tests however,
we don't want every call to e.g. timeline create to have to come with a
retry wrapper.

This issue has always been there, but it is more apparent in sharding
tests that concurrently attach several tenant shards.

Closes: https://github.com/neondatabase/neon/issues/6563

## Summary of changes

When `testing` feature is enabled, make `ACTIVE_TENANT_TIMEOUT` 30
seconds instead of 5 seconds.
2024-02-02 15:14:42 +01:00
Arpad Müller
48b05b7c50 Add a time_travel_remote_storage http endpoint (#6533)
Adds an endpoint to the pageserver to S3-recover an entire tenant to a
specific given timestamp.

Required input parameters:
* `travel_to`: the target timestamp to recover the S3 state to
* `done_if_after`: a timestamp that marks the beginning of the recovery
process. retries of the query should keep this value constant. it *must*
be after `travel_to`, and also after any changes we want to revert, and
must represent a point in time before the endpoint is being called, all
of these time points in terms of the time source used by S3. these
criteria need to hold even in the face of clock differences, so I
recommend waiting a specific amount of time, then taking
`done_if_after`, then waiting some amount of time again, and only then
issuing the request.

Also important to note: the timestamps in S3 work at second accuracy, so
one needs to add generous waits before and after for the process to work
smoothly (at least 2-3 seconds).

We ignore the added test for the mocked S3 for now due to a limitation
in moto: https://github.com/getmoto/moto/issues/7300 .

Part of https://github.com/neondatabase/cloud/issues/8233
2024-02-02 14:52:12 +01:00
Conrad Ludgate
0856fe6676 proxy: remove per client bytes (#5466)
## Problem

Follow up to #5461

In my memory usage/fragmentation measurements, these metrics came up as
a large source of small allocations. The replacement metric has been in
use for a long time now so I think it's good to finally remove this.
Per-endpoint data is still tracked elsewhere

## Summary of changes

remove the per-client bytes metrics
2024-02-02 12:28:48 +00:00
Alexander Bayandin
4133d14a77 Compute: pgbouncer 1.22.0 (#6582)
## Problem
Update pgbouncer from 1.21 (and patches[0][1]) to 1.22 (which includes
these patches)
- [0] https://github.com/pgbouncer/pgbouncer/pull/972
- [1] https://github.com/pgbouncer/pgbouncer/pull/998

## Summary of changes
- Build pgbouncer 1.22.0 for neonVMs from upstream
2024-02-02 11:49:11 +00:00
Alexander Bayandin
30c9e145d7 check-macos-build: switch job to macos-14 (M1) (#6539)
## Problem
- GitHub made available `macos-14` runners, and they run on M1
processors[0]
- The price is the same as Intel-based runners — "macOS | 3 or 4 (M1 or
Intel) | $0.08"[1], but runners on Apple Silicon should be significantly 
faster than their Intel counterparts.
- Most developers who use macOS use Apple Silicon-based Macs nowadays.

- [0] https://github.blog/changelog/2024-01-30-github-actions-introducing-the-new-m1-macos-runner-available-to-open-source/
- [1] https://docs.github.com/en/billing/managing-billing-for-github-actions/about-billing-for-github-actions#per-minute-rates

## Summary of changes
- Run `check-macos-build` on `macos-14`
2024-02-02 10:51:20 +00:00
John Spray
24e916d37f pageserver: fix a syntax error in swagger (#6566)
A description was written as a follow-on to a section line, rather than
in the proper `description:` part. This caused swagger parsers to
rightly reject it.
2024-02-02 10:35:09 +00:00
Andreas Scherbaum
23f58145ed Update wording for better readability (#6559)
Update wording, add spaces in commandline arguments

Co-authored-by: Andreas Scherbaum <andreas@neon.tech>
2024-02-02 11:22:32 +01:00
Heikki Linnakangas
350865392c Print checkpoint key contents with "pagectl print-layer-file" (#6541)
This was very useful in debugging the bugs fixed in #6410 and #6502.

There's a lot more we could do. This only adds the printing to delta
layers, not image layers, for example, and it might be useful to print
details of more record types. But this is a good start.
2024-02-02 01:35:31 +02:00
Christian Schwarz
1be5e564ce feat(walredo): use posix_spawn by moving close_fds() work to walredo C code (#6574)
The rust stdlib uses the efficient `posix_spawn` by default.
However, before this PR, pageserver used `pre_exec()` in our
`close_fds()` ext trait.

This PR moves the work that `close_fds()` did to the walredo C code.
I verified manually using `gdb` that we're now forking out the walredo
process using `posix_spawn`.

refs https://github.com/neondatabase/neon/issues/6565
2024-02-01 22:38:34 +01:00
Christian Schwarz
7a70ef991f feat(walredo): various observability improvements (#6573)
- log when we start walredo process
- include tenant shard id in walredo argv
- dump some basic walredo state in tenant details api
- more suitable walredo process launch histogram buckets
- avoid duplicate tracing labels in walredo launch spans
2024-02-01 21:59:40 +01:00
Sasha Krassovsky
be30388901 Add retry to fetching basebackup (#6537)
## Problem
Currently we have no retry mechanism for fetching basebackup. If there's
an unstable connection, starting compute will just fail.

## Summary of changes
Adds an exponential backoff with 7 retries to get the basebackup.
2024-02-01 20:50:04 +00:00
Heikki Linnakangas
3525080031 Fix pgvector 0.6.0 with Neon. (#6571)
The previous patch was broken. rd_smgr as not open yet, need to use
RelationGetSmgr() to access it.
2024-02-01 20:48:31 +00:00
Arpad Müller
527cdbc010 Don't require AWS access keys for S3 pytests (#6556)
Don't require AWS access keys (AWS_ACCESS_KEY_ID and
AWS_SECRET_ACCESS_KEY) for S3 usage in the pytests, and also allow
AWS_PROFILE to be passed.

One of the two methods is required however.

This allows local development like:

```
aws sso login --profile dev
export ENABLE_REAL_S3_REMOTE_STORAGE=nonempty REMOTE_STORAGE_S3_REGION=eu-central-1 REMOTE_STORAGE_S3_BUCKET=neon-github-ci-tests AWS_PROFILE=dev
cargo build_testing && RUST_BACKTRACE=1 ./scripts/pytest -k debug-pg16 test_runner/regress/test_tenant_delete.py::test_tenant_delete_smoke
```

related earlier PR for the cargo unit tests of the `remote_storage` crate: #6202

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-02-01 20:18:07 +00:00
Alexander Bayandin
39be2b0108 Makefile: set PQ_LIB_DIR to avoid linkage with system libpq (#6538)
## Problem

Initially spotted on macOS. When building `attachment_service`, it might
get linked with system `libpq`:
```
$ otool -L target/debug/attachment_service
target/debug/attachment_service:
	/opt/homebrew/opt/libpq/lib/libpq.5.dylib (compatibility version 5.0.0, current version 5.16.0)
	/System/Library/Frameworks/Security.framework/Versions/A/Security (compatibility version 1.0.0, current version 61040.61.1)
	/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 150.0.0, current version 2202.0.0)
	/usr/lib/libiconv.2.dylib (compatibility version 7.0.0, current version 7.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1336.61.1)
```

After this PR:
```
$ otool -L target/debug/attachment_service 
target/debug/attachment_service:
	/Users/bayandin/work/neon/pg_install/v16/lib/libpq.5.dylib (compatibility version 5.0.0, current version 5.16.0)
	/System/Library/Frameworks/Security.framework/Versions/A/Security (compatibility version 1.0.0, current version 61040.61.1)
	/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 150.0.0, current version 2202.0.0)
	/usr/lib/libiconv.2.dylib (compatibility version 7.0.0, current version 7.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1336.61.1)
```

## Summary of changes
- Set `PQ_LIB_DIR` to bundled Postgres 16 lib dir
2024-02-01 17:34:48 +00:00
Alexander Bayandin
fa52cd575e Remove old tests results and old coverage collection (#6376)
## Problem
We have switched to new test results and new coverage results, so no
need to collect these data in old formats.

## Summary of changes
- Remove "Upload coverage report" for old coverage report
- Remove "Store Allure test stat in the DB" for old test results format
2024-02-01 13:36:55 +00:00
Vlad Lazar
d2c410c748 pageserver_api: remove overlaps from KeySpace (#6544)
This commit adds a function to `KeySpace` which updates a key key space
by removing all overlaps with a second key space. This can involve
splitting or removing of existing ranges.

The implementation is not particularly efficient: O(M * N * log(N))
where N is the number of ranges in the current key space and M is the
number of ranges in the key space we are checking against. In practice,
this shouldn't matter much since, in the short term, the only caller of
this function will be the vectored read path and the number of key
spaces invovled will be small. This follows from the upper bound placed
on the number of keys accepted by the vectored read path.

A couple other small utility functions are added. They'll be used by the
vectored search path as well.
2024-02-01 13:14:35 +00:00
Vlad Lazar
221531c9db pageserver: lift ancestor timeline logic from read path (#6543)
When the read path needs to follow a key into the ancestor timeline, it
needs to wait for said ancestor to become active and aware of it's
branching lsn. The logic is lifted into a separate function with it's
own new error type.

This is done because the vectored read path needs the same logic. It's
also the reason for the newly introduced error type.

When we'll switch the read path to proxy into `get_vectored`, we can
remove the duplicated variants from `PageReconstructError`.
2024-02-01 10:35:18 +00:00
Christian Schwarz
4c173456dc pagebench: fix percentiles reporting (#6547)
Before this patch, pagebench was always showing the same value.

refs https://github.com/neondatabase/neon/issues/6509
2024-01-31 23:29:48 +00:00
Christian Schwarz
e82625b77d refactor(pageserver main): signal handling (#6554)
This refactoring makes it easier to experimentally replace
BACKGROUND_RUNTIME with a single-threaded runtime. Found this useful
[during benchmarking](https://github.com/neondatabase/neon/pull/6555).
2024-01-31 23:25:57 +00:00
Christian Schwarz
0ac1e71524 update tokio-epoll-uring (#6558)
to pull in fixes for
https://github.com/neondatabase/tokio-epoll-uring/issues/37
2024-01-31 22:54:54 +00:00
Anna Khanova
271133d960 Proxy: reduce number of get role secret calls (#6557)
## Problem

Right now if get_role_secret response wasn't cached (e.g. cache already
reached max size) it will send the second (exactly the same request).

## Summary of changes

Avoid needless request.
2024-01-31 22:16:56 +00:00
Joonas Koivunen
3d5fab127a rewrite Gate impl for better observability (#6542)
changes:
- two messages instead of message every second when gate was closing
- replace the gate name string by using a pointer
- slow GateGuards are likely to log who they were (see example)

example found in regress tests: <https://github.com/neondatabase/neon/pull/6542#issuecomment-1919009256>
2024-01-31 22:15:58 +00:00
Joonas Koivunen
66719d7eaf logging: fix span usage (#6549)
Fixes some duplication due to extra or misconfigured `#[instrument]`,
while filling in the `timeline_id` to delete timeline flow calls.
2024-01-31 20:52:00 +00:00
Konstantin Knizhnik
9a9d9beaee Download SLRU segments on demand (#6151)
## Problem

See https://github.com/neondatabase/cloud/issues/8673

## Summary of changes


Download missed SLRU segments from page server

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-01-31 21:39:18 +02:00
John Spray
2bfc831c60 control_plane/attachment_service: make --path optional (#6545)
## Problem

The `--path` argument is only used in testing, for compat tests that use
a JSON snapshot of state rather than the postgres database. In regular
deployments, it should be omitted (currently one has to specify `--path
""`)

## Summary of changes

Make `--path` optional.
2024-01-31 17:02:41 +00:00
Joonas Koivunen
799db161d3 tests: support for running on single pg version, use in one place (#6525)
Some tests which are unit test alike do not need to run on different pg
versions. Logging test is one of them which I found for unrelated
reasons.

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-01-31 17:37:25 +02:00
Arpad Müller
47380be12d Remove version param from get_lsn_by_timestamp (#6551)
This removes the last remnants of the version param added by #5608 ,
concluding the transition plan laid out in
https://github.com/neondatabase/cloud/pull/7553#discussion_r1370473911 .
It follows PR https://github.com/neondatabase/cloud/pull/9202, which we
now assume has been deployed to all environments.

Full history:

* https://github.com/neondatabase/neon/pull/5608 
* https://github.com/neondatabase/cloud/pull/7553
* https://github.com/neondatabase/neon/pull/6178
* https://github.com/neondatabase/cloud/pull/9202
2024-01-31 15:30:19 +01:00
Conrad Ludgate
c7b02ce8ec proxy: use jemalloc (#6531)
## Summary of changes

Experiment with jemalloc in proxy
2024-01-31 14:51:11 +01:00
John Spray
4010adf653 control_plane/attachment_service: complete APIs (#6394)
Depends on: https://github.com/neondatabase/neon/pull/6468

## Problem

The sharding service will be used as a "virtual pageserver" by the
control plane -- so it needs the set of pageserver APIs that the control
plane uses, and to present them under identical URLs, including prefix
(/v1).

## Summary of changes

- Add missing APIs:
  - Tenant deletion
  - Timeline deletion
  - Node list (used in test now, later in tools)
- `/location_config` API (for migrating tenants into the sharding
service)
- Rework attachment service URLs:
  - `/v1` prefix is used for pageserver-compatible APIs
- `/upcall/v1` prefix is used for APIs that are called by the pageserver
(re-attach and validate)
  - `/debug/v1` prefix is used for endpoints that are for testing
- `/control/v1` prefix is used for new sharding service APIs that do not
mimic a pageserver API, such as registering and configuring nodes.
- Add test_sharding_service. The sharding service already had some
collateral coverage from its use in general tests, but this is the first
dedicated testing for it.
2024-01-31 12:23:06 +00:00
Konstantin Knizhnik
e10a7ee391 Prevent to frequent reconnects in case of race condition errors returned by PS (tenant not found) (#6522)
## Problem

See https://neondb.slack.com/archives/C04DGM6SMTM/p1706531433057289

## Summary of changes

1. Do not decrease reconnect timeout until maximal interval value (1
second) is reached
2. Compute reconnect time after connection attempt is taken to exclude
connect time itself from the interval measurement.

So now backend should not perform more than 4 reconnect attempts per
second.
But please notice that backoff is performed locally in each backend and
so if there are many active backends,
then connection (and  so error) rate may be much higher.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-01-31 09:17:32 +02:00
Sasha Krassovsky
e8c9a51273 Allow creating subscriptions as neon_superuser (#6484)
## Problem
We currently can't create subscriptions in PG14 and PG15 because only
superusers can, and PG16 requires adding roles to
pg_create_subscription.

## Summary of changes
I added changes to PG14 and PG15 that allow neon_superuser to bypass the
superuser requirement. For PG16, I didn't do that but added a migration
that adds neon_superuser to pg_create_subscription. Also added a test to
make sure it works.
2024-01-30 22:32:33 -08:00
Alexander Bayandin
3c3ee8f3e8 Compute: add compatibility patch for pgvector (#6527)
## Problem

`pgvector` requires a patch to work well with Neon (a patch created by
@hlinnaka)

## Summary of changes
- Apply the patch to `pgvector`
2024-01-30 17:33:24 +00:00
Arpad Müller
6928a34f59 S3 DR: Large prefix improvements (#6515)
## Problem

PR #6500 has removed the limiting by number of versions/deletions for
time travel calls. We never get informed about how many versions there
are, and thus the call would just hang without any indication of
progress.

## Summary of changes

We improve the pageserver's behaviour with large prefixes, i.e. those
with many keys, removed or currently still available.

* Add a hard limit of 100k versions/deletions. For the reasoning see
https://github.com/neondatabase/cloud/issues/8233#issuecomment-1915021625
, but TLDR it will roughly support tenants of 2 TiB size, of course
depending on general write activity and duration of the s3 retention
window. The goal is to have a limit at all so that the process doesn't
accumulate increasing numbers of versions until an eventual crash.
* Lower the RAM footprint for the `VerOrDelete` datastructure. This
means we now don't cache a lot of redundant metadata in RAM like the
owner ID. The top level datastructure's footprint goes down from 264
bytes to 80 (but it contains strings that are not counted in there).

Follow-up of #6500, part of https://github.com/neondatabase/cloud/issues/8233

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2024-01-30 15:57:27 +00:00
Arseny Sher
bc684e9d3b Make WAL segment init atomic.
Since fdatasync is used for flushing WAL, changing file size is unsafe. Make
segment creation atomic by using tmp file + rename to avoid using partially
initialized segments.

fixes https://github.com/neondatabase/neon/issues/6402
2024-01-30 18:05:22 +04:00
Arseny Sher
08532231ee Fix find_end_of_wal busy loop.
It hanged if file size is less than of a normal segment. Normally that doesn't
happen, but it might in case of crash during segment init. We're going to fix
that half initialized segment by durably renaming it after cooking, so this fix
won't be needed, but better avoid busy loop anyway.

fixes https://github.com/neondatabase/neon/issues/6401
2024-01-30 18:05:22 +04:00
Christian Schwarz
79137a089f fix(#6366): pageserver: incorrect log level for Tenant not found during basebackup (#6400)
Before this patch, when requesting basebackup for a not-found tenant or
timeline, we'd emit an ERROR-level log entry with a huge stack trace.
See #6366 "Details" section for an example

With this patch, we log at INFO level and only a single line.
Example:

```
2024-01-19T14:16:11.479800Z  INFO page_service_conn_main{peer_addr=127.0.0.1:43448}: query handler for 'basebackup d69a536d529a68fcf85bc070030cdf4b 035484e9c28d8d0138a492caadd03ffd 0/2204340 --gzip' entity not found: Tenant d69a536d529a68fcf85bc070030cdf4b not found
2024-01-19T14:19:35.807819Z  INFO page_service_conn_main{peer_addr=127.0.0.1:48862}: query handler for 'basebackup d69a536d529a68fcf85bc070030cdf4a 035484e9c28d8d0138a492caadd03ffd 0/2204340 --gzip' entity not found: Timeline d69a536d529a68fcf85bc070030cdf4a/035484e9c28d8d0138a492caadd03ffd was not found
```

fixes https://github.com/neondatabase/neon/issues/6366

Changes
-------

- Change `handle_basebackup_request` to return a `QueryError`
- The new `impl From<WaitLsnError> for QueryError` is needed so the `?`
at `wait_lsn()` call in `handle_basebackup_request` works again. It's
duplicating `impl From<WaitLsnError> for PageStreamError`.
- Remove hard-to-spot conversion of `handle_basebackup_request` return
value to anyhow::Result (the place where I replaced `anyhow::Ok` with
`Result::<(), QueryError>::Ok(())`
- Add forgotten distinguished handling for "Tenant not found" case in
`impl From<GetActiveTenantError> for QueryError`

This was not at all pleasant, and I find it very hard to follow the
various error conversions.
It took me a while to spot the hard-to-spot `anyhow::Ok` thing above.
It would have been caught by the compiler if we weren't auto-converting
`anyhow::Error` into `QueryError::Other`.
We should move away from that, in my opinion, instead forcing each
`.context()` site to become `.context().map_err(QueryError::Other)`.
But that's for a future PR.
2024-01-30 13:10:48 +00:00
Joonas Koivunen
e3cb715e8a fix: capture initdb stderr, discard others (#6524)
When using spawn + wait_with_output instead of
std::process::Command::output or tokio::process::Command::output we must
configure the redirection.

Fixes: #6523 by discarding the stdout completely, we only care about
stderr if any.
2024-01-30 14:07:58 +01:00
dependabot[bot]
c70bf9150f build(deps): bump aiohttp from 3.9.0 to 3.9.2 (#6518) 2024-01-30 10:46:49 +00:00
Alexander Bayandin
8e4da52069 Compute: pgvector 0.6.0 (#6517)
Update pgvector extension from 0.5.1 to 0.6.0
2024-01-30 09:29:45 +00:00
Arthur Petukhovsky
2ff1a5cecd Patch safekeeper control file on HTTP request (#6455)
Closes #6397
2024-01-29 18:20:57 +00:00
Conrad Ludgate
ec8dcc2231 flatten proxy flow (#6447)
## Problem

Taking my ideas from https://github.com/neondatabase/neon/pull/6283 and
doing a bit less radical changes. smaller commits.

Proxy flow was quite deeply nested, which makes adding more interesting
error handling quite tricky.

## Summary of changes

I recommend reviewing commit by commit.

1. move handshake logic into a separate file
2. move passthrough logic into a separate file
3. no longer accept a closure in CancelMap session logic
4. Remove connect_to_db, copy logic into handle_client
5. flatten auth_and_wake_compute in authenticate
6. record info for link auth
2024-01-29 17:38:03 +00:00
Arpad Müller
b844c6f0c7 Do pagination in list_object_versions call (#6500)
## Problem

The tenants we want to recover might have tens of thousands of keys, or
more. At that point, the AWS API returns a paginated response.

## Summary of changes

Support paginated responses for `list_object_versions` requests.

Follow-up of #6155, part of https://github.com/neondatabase/cloud/issues/8233
2024-01-29 17:59:26 +01:00
Alexander Bayandin
6a85a06e1b Compute: build rdkit without freetype support (#6495)
## Problem
`rdkit` extension is built with `RDK_BUILD_FREETYPE_SUPPORT=ON` (by
default), which requires a bunch of additional dependencies, but the
support of freetype fonts isn't required for Postgres.


With `RDK_BUILD_FREETYPE_SUPPORT=ON`:
```
ldd /usr/local/pgsql/lib/rdkit.so
	linux-vdso.so.1 (0x0000ffff82ea8000)
	libfreetype.so.6 => /usr/lib/aarch64-linux-gnu/libfreetype.so.6 (0x0000ffff825e5000)
	libboost_serialization.so.1.74.0 => /usr/lib/aarch64-linux-gnu/libboost_serialization.so.1.74.0 (0x0000ffff82590000)
	libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffff8255f000)
	libstdc++.so.6 => /usr/lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ffff82387000)
	libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffff822dc000)
	libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffff822b8000)
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffff82144000)
	libpng16.so.16 => /usr/lib/aarch64-linux-gnu/libpng16.so.16 (0x0000ffff820fd000)
	libz.so.1 => /lib/aarch64-linux-gnu/libz.so.1 (0x0000ffff820d3000)
	libbrotlidec.so.1 => /usr/lib/aarch64-linux-gnu/libbrotlidec.so.1 (0x0000ffff820b8000)
	/lib/ld-linux-aarch64.so.1 (0x0000ffff82e78000)
	libbrotlicommon.so.1 => /usr/lib/aarch64-linux-gnu/libbrotlicommon.so.1 (0x0000ffff82087000)
```

With `RDK_BUILD_FREETYPE_SUPPORT=OFF`:
```
ldd /usr/local/pgsql/lib/rdkit.so
	linux-vdso.so.1 (0x0000ffffbba75000)
	libboost_serialization.so.1.74.0 => /usr/lib/aarch64-linux-gnu/libboost_serialization.so.1.74.0 (0x0000ffffbb259000)
	libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffffbb228000)
	libstdc++.so.6 => /usr/lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ffffbb050000)
	libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffffbafa5000)
	libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffffbaf81000)
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffbae0d000)
	/lib/ld-linux-aarch64.so.1 (0x0000ffffbba45000)
```

## Summary of changes
- Build `rdkit` with `RDK_BUILD_FREETYPE_SUPPORT=OFF`
- Remove extra dependencies from the Compute image
2024-01-29 16:16:37 +00:00
John Spray
b04a6acd6c docker: add attachment_service binary (#6506)
## Problem

Creating sharded tenants will require an instance of the sharding
service -- the initial goal is to deploy one of these in a staging
region (https://github.com/neondatabase/cloud/issues/9718). It will run
as a kubernetes container, similar to the storage broker, so needs to be
built into the container image.

## Summary of changes

Add `attachment_service` binary to container image
2024-01-29 13:31:56 +00:00
Vlad Lazar
0c7b89235c pageserver: add range layer map search implementation (#6469)
## Problem
There's no efficient way of querying the layer map for a range.

## Summary of changes
Introduce a range query for the layer map (`LayerMap::range_search`).
There's two broad steps to it:
1. Find all coverage changes for layers that intersect the queried range
(see `LayerCoverage::range_overlaps`).
The slightly tricky part is dealing with the start of the range. We can
either be aligned with a layer or not and we need
to treat these cases differently.
2. Iterate over the coverage changes and collect the result. For this we
use a two pointer approach: the trailing pointer tracks the start of the
current range (current location in the key space) and the forward
pointer tracks the next coverage change.

Plugging the range search into the read path is deferred to a future PR.

## Performance
I adapted the layer map benchmarks on a local branch. Range searches are 
between 2x and 2.5x slower than point searches. That's in line with what I
expected since we query thelayer map twice.

Since `Timeline::get` will proxy to `Timeline::get_vectored` we can
special case the one element layer map range search
at that point.
2024-01-29 09:47:12 +00:00
Joonas Koivunen
1e9a50bca8 disk_usage_eviction_task: cleanup summaries (#6490)
This is the "partial revert" of #6384. The summaries turned out to be
expensive due to naive vec usage, but also inconclusive because of the
additional context required. In addition to removing summary traces,
small refactoring is done.
2024-01-29 10:38:40 +02:00
Conrad Ludgate
511e730cc0 hll experiment (#6312)
## Problem

Measuring cardinality using logs is expensive and slow.

## Summary of changes

Implement a pre-aggregated HyperLogLog-based cardinality estimate.
HyperLogLog estimates the cardinality of a set by using the probability
that the uniform hash of a value will have a run of n 0s at the end is
`1/2^n`, therefore, having observed a run of `n` 0s suggests we have
measured `2^n` distinct values. By using multiple shards, we can use the
harmonic mean to get a more accurate estimate.

We record this into a Prometheus time-series. HyperLogLog counts can be
merged by taking the `max` of each shard. We can apply a `max_over_time`
in order to find the estimate of cardinality of distinct values over
time
2024-01-29 07:26:20 +00:00
Konstantin Knizhnik
c1148dc9ac Fix calculation of maximal multixact in ingest_multixact_create_record (#6502)
## Problem

See https://neondb.slack.com/archives/C06F5UJH601/p1706373716661439

## Summary of changes

Use None instead of 0 as initial accumulator value for calculating
maximal multixact XID.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-01-29 07:39:16 +02:00
Anna Khanova
8253cf1931 proxy: Relax endpoint check (#6503)
## Problem

http-over-sql allowes host to be in format api.aws.... however it's not
the case for the websocket flow.

## Summary of changes

Relax endpoint check for the ws serverless connections.
2024-01-28 21:27:14 +00:00
Christian Schwarz
3a82430432 fixup(#6492): also switch the benchmarks that runs on merge-to-main back to std-fs (#6501) 2024-01-28 00:15:11 +01:00
Arpad Müller
734755eaca Enable nextest retries for the arm build (#6496)
Also make the NEXTEST_RETRIES declaration more local.

Requested in https://github.com/neondatabase/neon/pull/6493#issuecomment-1912110202
2024-01-27 05:16:11 +01:00
Christian Schwarz
e34166a28f CI: switch back to std-fs io engine for soak time before next release (#6492)
PR #5824 introduced the concept of io engines in pageserver and
implemented `tokio-epoll-uring` in addition to our current method,
`std-fs`.

We used `tokio-epoll-uring` in CI for a day to get more exposure to
the code.  Now it's time to switch CI back so that we test with `std-fs`
as well, because that's what we're (still) using in production.
2024-01-26 22:48:34 +01:00
Christian Schwarz
3a36a0a227 fix(test suite): some tests leak child processes (#6497) 2024-01-26 18:23:53 +00:00
John Spray
58f6cb649e control_plane: database persistence for attachment_service (#6468)
## Problem

Spun off from https://github.com/neondatabase/neon/pull/6394 -- this PR
is just the persistence parts and the changes that enable it to work
nicely


## Summary of changes

- Revert #6444 and #6450
- In neon_local, start a vanilla postgres instance for the attachment
service to use.
- Adopt `diesel` crate for database access in attachment service. This
uses raw SQL migrations as the source of truth for the schema, so it's a
soft dependency: we can switch libraries pretty easily.
- Rewrite persistence.rs to use postgres (via diesel) instead of JSON.
- Preserve JSON read+write at startup and shutdown: this enables using
the JSON format in compatibility tests, so that we don't have to commit
to our DB schema yet.
- In neon_local, run database creation + migrations before starting
attachment service
- Run the initial reconciliation in Service::spawn in the background, so
that the pageserver + attachment service don't get stuck waiting for
each other to start, when restarting both together in a test.
2024-01-26 17:20:44 +00:00
Arpad Müller
dcc7610ad6 Do backoff::retry in s3 timetravel test (#6493)
The top level retries weren't enough, probably because we do so many
network requests. Fine grained retries ensure that there is higher
potential for the entire test to succeed.

To demonstrate this, consider the following example: let's assume that
each request has 5% chance of failing and we do 10 requests. Then
chances of success without any retries is 0.95^10 = 0.6. With 3 top
level retries it is 1-0.4^3 = 0.936. With 3 fine grained retries it is
(1-0.05^3)^10 = 0.9988 (roundings implicit). So chances of failure are
6.4% for the top level retry vs 0.12% for the fine grained retry.

Follow-up of #6155
2024-01-26 16:43:56 +00:00
Alexander Bayandin
4c245b0f5a update_build_tools_image.yml: Push build-tools image to Docker Hub (#6481)
## Problem

- `docker.io/neondatabase/build-tools:pinned` image is frequently
outdated on Docker Hub because there's no automated way to update it.
- `update_build_tools_image.yml` workflow contains legacy roll-back
logic, which is not required anymore because it updates only a single
image.

## Summary of changes
- Make `update_build_tools_image.yml` workflow push images to both ECR
and Docker Hub
- Remove unneeded roll-back logic
2024-01-26 16:12:49 +00:00
John Spray
55b7cde665 tests: add basic coverage for sharding (#6380)
## Problem

The support for sharding in the pageserver was written before
https://github.com/neondatabase/neon/pull/6205 landed, so when it landed
we couldn't directly test sharding.

## Summary of changes

- Add `test_sharding_smoke` which tests the basics of creating a
sharding tenant, creating a timeline within it, checking that data
within it is distributed.
- Add modes to pg_regress tests for running with 4 shards as well as
with 1.
2024-01-26 14:40:47 +00:00
Vlad Lazar
5b34d5f561 pageserver: add vectored get latency histogram (#6461)
This patch introduces a new set of grafana metrics for a histogram:
pageserver_get_vectored_seconds_bucket{task_kind="Compaction|PageRequestHandler"}.

While it has a `task_kind` label, only compaction and SLRU fetches are
tracked. This reduces the increase in cardinality to 24.

The metric should allow us to isolate performance regressions while the
vectorized get is being implemented. Once the implementation is
complete, it'll also allow us to quantify the improvements.
2024-01-26 13:40:03 +00:00
Alexander Bayandin
26c55b0255 Compute: fix rdkit extension build (#6488)
## Problem

`rdkit` extension build started to fail because of the changed checksum
of the Comic Neue font:

```
Downloading https://fonts.google.com/download?family=Comic%20Neue...
CMake Error at Code/cmake/Modules/RDKitUtils.cmake:257 (MESSAGE):
  The md5 checksum for /rdkit-src/Code/GraphMol/MolDraw2D/Comic_Neue.zip is
  incorrect; expected: 850b0df852f1cda4970887b540f8f333, found:
  b7fd0df73ad4637504432d72a0accb8f
```

https://github.com/neondatabase/neon/actions/runs/7666530536/job/20895534826

Ref https://neondb.slack.com/archives/C059ZC138NR/p1706265392422469

## Summary of changes
- Disable comic fonts for `rdkit` extension
2024-01-26 12:39:20 +00:00
Vadim Kharitonov
12e9b2a909 Update plv8 (#6465) 2024-01-26 09:56:11 +00:00
Christian Schwarz
918b03b3b0 integrate tokio-epoll-uring as alternative VirtualFile IO engine (#5824) 2024-01-26 09:25:07 +01:00
Alexander Bayandin
d36623ad74 CI: cancel old e2e-tests on new commits (#6463)
## Problem

Triggered `e2e-tests` job is not cancelled along with other jobs in a PR
if the PR get new commits. We can improve the situation by setting
`concurrency_group` for the remote workflow
(https://github.com/neondatabase/cloud/pull/9622 adds
`concurrency_group` group input to the remote workflow).

Ref https://neondb.slack.com/archives/C059ZC138NR/p1706087124297569

Cloud's part added in https://github.com/neondatabase/cloud/pull/9622

## Summary of changes
- Set `concurrency_group` parameter when triggering `e2e-tests`
- At the beginning of a CI pipeline, trigger Cloud's
`cancel-previous-in-concurrency-group.yml` workflow which cancels
previously triggered e2e-tests
2024-01-25 19:25:29 +00:00
Christian Schwarz
689ad72e92 fix(neon_local): leaks child process if it fails to start & pass checks (#6474)
refs https://github.com/neondatabase/neon/issues/6473

Before this PR, if process_started() didn't return Ok(true) until we
ran out of retries, we'd return an error but leave the process running.

Try it by adding a 20s sleep to the pageserver `main()`, e.g., right
before we claim the pidfile.

Without this PR, output looks like so:

```
(.venv) cs@devvm-mbp:[~/src/neon-work-2]: ./target/debug/neon_local start
Starting neon broker at 127.0.0.1:50051.
storage_broker started, pid: 2710939
.
attachment_service started, pid: 2710949
Starting pageserver node 1 at '127.0.0.1:64000' in ".neon/pageserver_1".....
pageserver has not started yet, continuing to wait.....
pageserver 1 start failed: pageserver did not start in 10 seconds
No process is holding the pidfile. The process must have already exited. Leave in place to avoid race conditions: ".neon/pageserver_1/pageserver.pid"
No process is holding the pidfile. The process must have already exited. Leave in place to avoid race conditions: ".neon/safekeepers/sk1/safekeeper.pid"
Stopping storage_broker with pid 2710939 immediately.......
storage_broker has not stopped yet, continuing to wait.....
neon broker stop failed: storage_broker with pid 2710939 did not stop in 10 seconds
Stopping attachment_service with pid 2710949 immediately.......
attachment_service has not stopped yet, continuing to wait.....
attachment service stop failed: attachment_service with pid 2710949 did not stop in 10 seconds
```

and we leak the pageserver process

```
(.venv) cs@devvm-mbp:[~/src/neon-work-2]: ps aux | grep pageserver
cs       2710959  0.0  0.2 2377960 47616 pts/4   Sl   14:36   0:00 /home/cs/src/neon-work-2/target/debug/pageserver -D .neon/pageserver_1 -c id=1 -c pg_distrib_dir='/home/cs/src/neon-work-2/pg_install' -c http_auth_type='Trust' -c pg_auth_type='Trust' -c listen_http_addr='127.0.0.1:9898' -c listen_pg_addr='127.0.0.1:64000' -c broker_endpoint='http://127.0.0.1:50051/' -c control_plane_api='http://127.0.0.1:1234/' -c remote_storage={local_path='../local_fs_remote_storage/pageserver'}
```

After this PR, there is no leaked process.
2024-01-25 19:20:02 +01:00
Christian Schwarz
fd4cce9417 test_pageserver_max_throughput_getpage_at_latest_lsn: remove n_tenants=100 combination (#6477)
Need to fix the neon_local timeouts first
(https://github.com/neondatabase/neon/issues/6473)
and also not run them on every merge, but only nightly:
https://github.com/neondatabase/neon/issues/6476
2024-01-25 18:17:53 +00:00
Arpad Müller
d52b81340f S3 based recovery (#6155)
Adds a new `time_travel_recover` function to the `RemoteStorage` trait
that allows time travel like functionality for S3 buckets, regardless of
their content (it is not even pageserver related). It takes a different
approach from [this
post](https://aws.amazon.com/blogs/storage/point-in-time-restore-for-amazon-s3-buckets/)
that is more complicated.

It takes as input a prefix a target timestamp, and a limit timestamp:

* executes [`ListObjectVersions`](https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectVersions.html)
* obtains the latest version that comes before the target timestamp
* copies that latest version to the same prefix
* if there is versions newer than the limit timestamp, it doesn't do
anything for the file

The limit timestamp is meant to be some timestamp before the start of
the recovery operation and after any changes that one wants to revert.
For example, it might be the time point after a tenant was detached from
all involved pageservers. The limiting mechanism ensures that the
operation is idempotent and can be retried without causing additional
writes/copies.

The approach fulfills all the requirements laid out in 8233, and is a
recoverable operation. Nothing is deleted permanently, only new entries
added to the version log.

I also enable [nextest retries](https://nexte.st/book/retries.html) to
help with some general S3 flakiness (on top of low level retries).

Part of https://github.com/neondatabase/cloud/issues/8233
2024-01-25 18:23:18 +01:00
Joonas Koivunen
8dee9908f8 fix(compaction_task): wrong log levels (#6442)
Filter what we log on compaction task. Per discussion in last triage
call, fixing these by introducing and inspecting the root cause within
anyhow::Error instead of rolling out proper conversions.

Fixes: #6365
Fixes: #6367
2024-01-25 18:45:17 +02:00
Konstantin Knizhnik
19ed230708 Add support for PS sharding in compute (#6205)
refer #5508

replaces #5837

## Problem

This PR implements sharding support at compute side. Relations are
splinted in stripes and `get_page` requests are redirected to the
particular shard where stripe is located. All other requests (i.e. get
relation or database size) are always send to shard 0.

## Summary of changes

Support of sharding at compute side include three things:
1. Make it possible to specify and change in runtime connection to more
retain one page server
2. Send `get_page` request to the particular shard (determined by hash
of page key)
3. Support multiple servers in prefetch ring requests

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: John Spray <john@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-01-25 15:53:31 +02:00
Joonas Koivunen
463b6a26b5 test: show relative order eviction with "fast growing tenant" (#6377)
Refactor out test_disk_usage_eviction tenant creation and add a custom
case with 4 tenants, 3 made with pgbench scale=1 and 1 made with pgbench
scale=4.

Because the tenants are created in order of scales [1, 1, 1, 4] this is
simple enough to demonstrate the problem with using absolute access
times, because on a disk usage based eviction run we will
disproportionally target the *first* scale=1 tenant(s), and the later
larger tenant does not lose anything.

This test is not enough to show the difference between `relative_equal`
and `relative_spare` (the fudge factor); much larger scale will be
needed for "the large tenant", but that will make debug mode tests
slower.

Cc: #5304
2024-01-25 15:38:28 +02:00
John Spray
c9b1657e4c pageserver: fixes for creation operations overlapping with shutdown/startup (#6436)
## Problem

For #6423, creating a reproducer turned out to be very easy, as an
extension to test_ondemand_activation.

However, before I had diagnosed the issue, I was starting with a more
brute force approach of running creation API calls in the background
while restarting a pageserver, and that shows up a bunch of other
interesting issues.

In this PR:
- Add the reproducer for #6423 by extending `test_ondemand_activation`
(confirmed that this test fails if I revert the fix from
https://github.com/neondatabase/neon/pull/6430)
- In timeline creation, return 503 responses when we get an error and
the tenant's cancellation token is set: this covers the cases where we
get an anyhow::Error from something during timeline creation as a result
of shutdown.
- While waiting for tenants to become active during creation, don't
.map_err() the result to a 500: instead let the `From` impl map the
result to something appropriate (this includes mapping shutdown to 503)
- During tenant creation, we were calling `Tenant::load_local` because
no Preload object is provided. This is usually harmless because the
tenant dir is empty, but if there are some half-created timelines in
there, bad things can happen. Propagate the SpawnMode into
Tenant::attach, so that it can properly skip _any_ attempt to load
timelines if creating.
- When we call upsert_location, there's a SpawnMode that tells us
whether to load from remote storage or not. But if the operation is a
retry and we already have the tenant, it is not correct to skip loading
from remote storage: there might be a timeline there. This isn't
strictly a correctness issue as long as the caller behaves correctly
(does not assume that any timelines are persistent until the creation is
acked), but it's a more defensive position.
- If we shut down while the task in Tenant::attach is running, it can
end up spawning rogue tasks. Fix this by holding a GateGuard through
here, and in upsert_location shutting down a tenant after calling
tenant_spawn if we can't insert it into tenants_map. This fixes the
expected behavior that after shutdown_all_tenants returns, no tenant
tasks are running.
- Add `test_create_churn_during_restart`, which runs tenant & timeline
creations across pageserver restarts.
- Update a couple of tests that covered cancellation, to reflect the
cleaner errors we now return.
2024-01-25 12:35:52 +00:00
Arpad Müller
b92be77e19 Make RemoteStorage not use async_trait (#6464)
Makes the `RemoteStorage` trait not be based on `async_trait` any more.

To avoid recursion in async (not supported by Rust), we made
`GenericRemoteStorage` generic on the "Unreliable" variant. That allows
us to have the unreliable wrapper never contain/call itself.

related earlier work: #6305
2024-01-24 21:27:54 +01:00
Arthur Petukhovsky
8cb8c8d7b5 Allow remove_wal.rs to run on inactive timelines (#6462)
Temporary enable it on staging to help with
https://github.com/neondatabase/neon/issues/6403
Can be also deployed to prod if will work well on staging.
2024-01-24 16:48:56 +00:00
Conrad Ludgate
210700d0d9 proxy: add newtype wrappers for string based IDs (#6445)
## Problem

too many string based IDs. easy to mix up ID types.

## Summary of changes

Add a bunch of `SmolStr` wrappers that provide convenience methods but
are type safe
2024-01-24 16:38:10 +00:00
Joonas Koivunen
a0a3ba85e7 fix(page_service): walredo logging problem (#6460)
Fixes: #6459 by formatting full causes of an error to log, while keeping
the top level string for end-user.

Changes user visible error detail from:

```
-DETAIL:  page server returned error: Read error: Failed to reconstruct a page image:
+DETAIL:  page server returned error: Read error
```

However on pageserver logs:

```
-ERROR page_service_conn_main{...}: error reading relation or page version: Read error: Failed to reconstruct a page image:
+ERROR page_service_conn_main{...}: error reading relation or page version: Read error: reconstruct a page image: launch walredo process: spawn process: Permission denied (os error 13)
```
2024-01-24 15:47:17 +00:00
Arpad Müller
d820aa1d08 Disable initdb cancellation (#6451)
## Problem

The initdb cancellation added in #5921 is not sufficient to reliably
abort the entire initdb process. Initdb also spawns children. The tests
added by #6310 (#6385) and #6436 now do initdb cancellations on a more
regular basis.

In #6385, I attempted to issue `killpg` (after giving it a new process
group ID) to kill not just the initdb but all its spawned subprocesses,
but this didn't work. Initdb doesn't take *that* long in the end either,
so we just wait until it concludes.

## Summary of changes

* revert initdb cancellation support added in #5921
* still return `Err(Cancelled)` upon cancellation, but this is just to
not have to remove the cancellation infrastructure
* fixes to the `test_tenant_delete_races_timeline_creation` test to make
it reliably pass

Fixes #6385
2024-01-24 13:06:05 +01:00
Christian Schwarz
996abc9563 pagebench-based GetPage@LSN performance test (#6214) 2024-01-24 12:51:53 +01:00
John Spray
a72af29d12 control_plane/attachment_service: implement PlacementPolicy::Detached (#6458)
## Problem

The API for detaching things wasn't implement yet, but one could hit
this case indirectly from tests when using attach-hook, and find tenants
unexpectedly attached again because their policy remained Single.

## Summary of changes

Add PlacementPolicy::Detached, and:
- add the behavior for it in schedule()
- in tenant_migrate, refuse if the policy is detached
- automatically set this policy in attach-hook if the caller has
specified pageserver=null.
2024-01-24 12:49:30 +01:00
Sasha Krassovsky
4f51824820 Fix creating publications for all tables 2024-01-23 22:41:00 -08:00
Christian Schwarz
743f6dfb9b fix(attachment_service): corrupted attachments.json when parallel requests (#6450)
The pagebench integration PR (#6214) issues attachment requests in
parallel.
We observed corrupted attachments.json from time to time, especially in
the test cases with high tenant counts.

The atomic overwrite added in #6444 exposed the root cause cleanly:
the `.commit()` calls of two request handlers could interleave or
be reordered.
See also:
https://github.com/neondatabase/neon/pull/6444#issuecomment-1906392259

This PR makes changes to the `persistence` module to fix above race:
- mpsc queue for PendingWrites
- one writer task performs the writes in mpsc queue order
- request handlers that need to do writes do it using the
  new `mutating_transaction` function.

`mutating_transaction`, while holding the lock, does the modifications,
serializes the post-modification state, and pushes that as a
`PendingWrite` into the mpsc queue.
It then release the lock and `await`s the completion of the write.
The writer tasks executes the `PendingWrites` in queue order.
Once the write has been executed, it wakes the writing tokio task.
2024-01-23 19:14:32 +00:00
Arpad Müller
faf275d4a2 Remove initdb on timeline delete (#6387)
This PR:

* makes `initdb.tar.zst` be deleted by default on timeline deletion
(#6226), mirroring the safekeeper:
https://github.com/neondatabase/neon/pull/6381
* adds a new `preserve_initdb_archive` endpoint for a timeline, to be
used during the disaster recovery process, see reasoning
[here](https://github.com/neondatabase/neon/issues/6226#issuecomment-1894574778)
* makes the creation code look for `initdb-preserved.tar.zst` in
addition to `initdb.tar.zst`.
* makes the tests use the new endpoint

fixes #6226
2024-01-23 18:22:59 +00:00
Vlad Lazar
001f0d6db7 pageserver: fix import failure caused by merge race (#6448)
PR #6406 raced with #6372 and broke main.
2024-01-23 18:07:01 +01:00
Christian Schwarz
42c17a6fc6 attachment_service: use atomic overwrite to persist attachments.json (#6444)
The pagebench integration PR (#6214) is the first to SIGQUIT & then
restart attachment_service.

With many tenants (100), we have found frequent failures on restart in
the CI[^1].

[^1]:
[Allure](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6214/7615750160/index.html#suites/e26265675583c610f99af77084ae58f1/851ff709578c4452/)

```
2024-01-22T19:07:57.932021Z  INFO request{method=POST path=/attach-hook request_id=2697503c-7b3e-4529-b8c1-d12ef912d3eb}: Request handled, status: 200 OK
2024-01-22T19:07:58.898213Z  INFO Got SIGQUIT. Terminating
2024-01-22T19:08:02.176588Z  INFO version: git-env:d56f31639356ed8e8ce832097f132f27ee19ac8a, launch_timestamp: 2024-01-22 19:08:02.174634554 UTC, build_tag build_tag-env:7615750160, state at /tmp/test_output/test_pageserver_max_throughput_getpage_at_latest_lsn[10-13-30]/repo/attachments.json, listening on 127.0.0.1:15048
thread 'main' panicked at /__w/neon/neon/control_plane/attachment_service/src/persistence.rs:95:17:
Failed to load state from '/tmp/test_output/test_pageserver_max_throughput_getpage_at_latest_lsn[10-13-30]/repo/attachments.json': trailing characters at line 1 column 8957 (maybe your .neon/ dir was written by an older version?)
stack backtrace:
   0: rust_begin_unwind
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
   2: attachment_service::persistence::PersistentState::load_or_new::{{closure}}
             at ./control_plane/attachment_service/src/persistence.rs:95:17
   3: attachment_service::persistence::Persistence:🆕:{{closure}}
             at ./control_plane/attachment_service/src/persistence.rs:103:56
   4: attachment_service::main::{{closure}}
             at ./control_plane/attachment_service/src/main.rs:69:61
   5: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:63
   6: tokio::runtime::coop::with_budget
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:107:5
   7: tokio::runtime::coop::budget
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:73:5
   8: tokio::runtime::park::CachedParkThread::block_on
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:31
   9: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/blocking.rs:66:9
  10: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:87:13
  11: tokio::runtime::context::runtime::enter_runtime
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/runtime.rs:65:16
  12: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:86:9
  13: tokio::runtime::runtime::Runtime::block_on
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/runtime.rs:350:50
  14: attachment_service::main
             at ./control_plane/attachment_service/src/main.rs:99:5
  15: core::ops::function::FnOnce::call_once
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
```

The attachment_service handles SIGQUIT by just exiting the process.
In theory, the SIGQUIT could come in while we're writing out the
`attachments.json`.

Now, in above log output, there's a 1 second gap between the last
request completing
and the SIGQUIT coming in. So, there must be some other issue.

But, let's have this change anyways, maybe it helps uncover the real
cause for the test failure.
2024-01-23 17:21:06 +01:00
Vlad Lazar
37638fce79 pageserver: introduce vectored Timeline::get interface (#6372)
1. Introduce a naive  `Timeline::get_vectored` implementation

The return type is intended to be flexible enough for various types of
callers. We return the pages in a map keyed by `Key` such that the
caller doesn't have to map back to the key if it needs to know it. Some
callers can ignore errors
for specific pages, so we return a separate `Result<Bytes,
PageReconstructError>` for each page and an overarching
`GetVectoredError` for API misuse. The overhead of the mapping will be
small and bounded since we enforce a maximum key count for the
operation.

2. Use the `get_vectored` API for SLRU segment reconstruction and image
layer creation.
2024-01-23 14:23:53 +00:00
Christian Schwarz
50288c16b1 fix(pagebench): avoid CopyFail error in success case (#6443)
PR #6392 fixed CopyFail in the case where we get cancelled.
But, we also want to use `client.shutdown()` if we don't get cancelled.
2024-01-23 15:11:32 +01:00
Conrad Ludgate
e03f8abba9 eager parsing of ip addr (#6446)
## Problem

Parsing the IP address at check time is a little wasteful. 

## Summary of changes

Parse the IP when we get it from cplane. Adding a `None` variant to
still allow malformed patterns
2024-01-23 13:25:01 +00:00
Anna Khanova
1905f0bced proxy: store role not found in cache (#6439)
## Problem

There are a lot of responses with 404 role not found error, which are
not getting cached in proxy.

## Summary of changes

If there was returned an empty secret but with the project_id, store it
in cache.
2024-01-23 13:15:05 +01:00
Conrad Ludgate
72de1cb511 remove some duped deps (#6422)
## Problem

duplicated deps

## Summary of changes

little bit of fiddling with deps to reduce duplicates

needs consideration:
https://github.com/notify-rs/notify/blob/main/CHANGELOG.md#notify-600-2023-05-17
2024-01-23 11:17:15 +00:00
Konstantin Knizhnik
00d9bf5b61 Implement lockless update of pageserver_connstring GUC in shared memory (#6314)
## Problem

There is "neon.pageserver_connstring" GUC with PGC_SIGHUP option,
allowing to change it using
pg_reload_conf(). It is used by control plane to update pageserver
connection string if page server is crashed,
relocated or new shards are added.
It is copied to shared memory because config can not be loaded during
query execution and we need to
reestablish connection to page server.

## Summary of changes

Copying connection string to shared memory is done by postmaster. And
other backends
should check update counter to determine of connection URL is changed
and connection needs to be reestablished.
We can not use standard Postgres LW-locks, because postmaster has proc
entry and so can not wait
on this primitive. This is why lockless access algorithm is implemented
using two atomic counters to enforce
consistent reading of connection string value from shared memory.


## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-01-23 07:55:05 +02:00
Sasha Krassovsky
71f495c7f7 Gate it behind feature flags 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
0a7e050144 Fix test one last time 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
55bfa91bd7 Fix test again again 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
d90b2b99df Fix test again 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
27587e155d Fix test 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
55aede2762 Prevnet duplicate insertions 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
9f186b4d3e Fix query 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
585687d563 Fix syntax error 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
65a98e425d Switch to bigint 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
b2e7249979 Sleep 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
844303255a Cargo fmt 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
6d8df2579b Fix dumb thing 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
3c3b53f8ad Update test 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
30064eb197 Add scary comment 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
869acfe29b Make migrations transactional 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
11a91eaf7b Uncomment the thread 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
394ef013d0 Push the migrations test 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
a718287902 Make migrations happen on a separate thread 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
2eac1adcb9 Make clippy happy 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
3f90b2d337 Fix test_ddl_forwarding 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
a40ed86d87 Add test for migrations, add initial migration 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
1bf8bb88c5 Add support for migrations within compute_ctl 2024-01-22 14:53:29 -08:00
Vlad Lazar
f1901833a6 pageserver_api: migrate keyspace related functions from pgdatadir_mapping (#6406)
The idea is to achieve separation between keyspace layout definition
and operating on said keyspace. I've inlined all these function since
they're small and we don't use LTO in the storage release builds
at the moment.

Closes https://github.com/neondatabase/neon/issues/6347
2024-01-22 19:16:38 +00:00
Arthur Petukhovsky
b41ee81308 Log warning on slow WAL removal (#6432)
Also add `safekeeper_active_timelines` metric.
Should help investigating #6403
2024-01-22 18:38:05 +00:00
Christian Schwarz
205b6111e6 attachment_service: /attach-hook: correctly handle detach (#6433)
Before this patch, we would update the `tenant_state.intent` in memory
but not persist the detachment to disk.

I noticed this in https://github.com/neondatabase/neon/pull/6214 where
we stop, then restart, the attachment service.
2024-01-22 18:27:05 +00:00
John Spray
93572a3e99 pageserver: mark tenant broken when cancelling attach (#6430)
## Problem

When a tenant is in Attaching state, and waiting for the
`concurrent_tenant_warmup` semaphore, it also listens for the tenant
cancellation token. When that token fires, Tenant::attach drops out.
Meanwhile, Tenant::set_stopping waits forever for the tenant to exit
Attaching state.

Fixes: https://github.com/neondatabase/neon/issues/6423

## Summary of changes

- In the absence of a valid state for the tenant, it is set to Broken in
this path. A more elegant solution will require more refactoring, beyond
this minimal fix.
2024-01-22 15:50:32 +00:00
Christian Schwarz
15c0df4de7 fixup(#6037): actually fix the issue, #6388 failed to do so (#6429)
Before this patch, the select! still retured immediately if `futs` was
empty. Must have tested a stale build in my manual testing of #6388.
2024-01-22 14:27:29 +00:00
Anna Khanova
3290fb09bf Proxy: fix gc (#6426)
## Problem

Gc currently doesn't work properly.

## Summary of changes

Change statement on running gc.
2024-01-22 13:24:10 +00:00
hamishc
efdb2bf948 Added missing PG_VERSION arg into compute node dockerfile (#6382)
## Problem

If you build the compute-node dockerfile with the PG_VERSION argument
passed in (e.g. `docker build -f Dockerfile.compute-node --build-arg
PG_VERSION=v15 .`, it fails, as some of stages doesn't have the
PG_VERSION arg defined.

## Summary of changes

Added the PG_VERSION arg to the plv8-build, neon-pg-ext-build, and 
pg-embedding-pg-build stages of Dockerfile.compute-node
2024-01-22 11:05:27 +00:00
Conrad Ludgate
5559b16953 bump shlex (#6421)
## Problem

https://rustsec.org/advisories/RUSTSEC-2024-0006

## Summary of changes

`cargo update -p shlex`
2024-01-22 09:14:30 +00:00
Konstantin Knizhnik
1aea65eb9d Fix potential overflow in update_next_xid (#6412)
## Problem

See https://neondb.slack.com/archives/C06F5UJH601/p1705731304237889

Adding 1 to xid in `update_next_xid` can cause overflow in debug mode.
0xffffffff is valid transaction ID.

## Summary of changes

Use `wrapping_add` 

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-01-21 22:11:00 +02:00
Conrad Ludgate
34ddec67d9 proxy small tweaks (#6398)
## Problem

In https://github.com/neondatabase/neon/pull/6283 I did a couple changes
that weren't directly related to the goal of extracting the state
machine, so I'm putting them here

## Summary of changes

- move postgres vs console provider into another enum
- reduce error cases for link auth
- slightly refactor link flow
2024-01-21 09:58:42 +01:00
Anna Khanova
9ace36d93c Proxy: do not store empty key (#6415)
## Problem

Currently we store in cache even if the project is undefined. That makes
invalidation impossible.

## Summary of changes

Do not store if project id is empty.
2024-01-20 16:14:53 +00:00
Heikki Linnakangas
e4898a6e60 Don't pass InvalidTransactionId to update_next_xid. (#6410)
update_next_xid() doesn't have any special treatment for the invalid or
other special XIDs, so it will treat InvalidTransactionId (0) as a
regular XID. If old nextXid is smaller than 2^31, 0 will look like a
very old XID, and nothing happens. But if nextXid is greater than 2^31 0
will look like a very new XID, and update_next_xid() will incorrectly
bump up nextXID.
2024-01-20 18:04:16 +02:00
Joonas Koivunen
c77981289c build: terminate long running tests (#6389)
configures nextest to kill tests after 1 minute. slow period is set to
20s which is how long our tests currently take in total, there will be 2
warnings and then the test will be killed and it's output logged.

Cc: #6361
Cc: #6368 -- likely this will be enough for longer time, but it will be
counter productive when we want to attach and debug; the added line
would have to be commented out.
2024-01-20 17:41:55 +02:00
Anna Khanova
f003dd6ad5 Remove rename in parameters (#6411)
## Problem

Name in notifications is not compatible with console name.

## Summary of changes

Rename fields to make it compatible.
2024-01-20 10:20:53 +00:00
Conrad Ludgate
7e7e9f5191 proxy: add more columns to parquet upload (#6405)
## Problem

Some fields were missed in the initial spec.

## Summary of changes

Adds a success boolean (defaults to false unless specifically marked as
successful).
Adds a duration_us integer that tracks how many microseconds were taken
from session start through to request completion.
2024-01-20 09:38:11 +00:00
Christian Schwarz
760a48207d fixup(#6037): page_service hangs up within 10ms if there's no message (#6388)
From #6037 on, until this patch, if the client opens the connection but
doesn't send a `PagestreamFeMessage` within the first 10ms, we'd close
the connection because `self.timeline_cancelled()` returns.
It returns because `self.shard_timelines` is still empty at that point:
it gets filled lazily within the handlers for the incoming messages.

Changes
-------

The question is: if we can't check for timeline cancellation, what else
do we need to be cancellable for? `tenant.cancel` is also a bad choice
because the `tenant` (shard) we pick at the top of handle_pagerequests
might indeed go away over the course of the connection lifetime, but
other shards may still be there.

The correct solution, I think, is to be responsive to task_mgr
cancellation, because the connection handler runs in a task_mgr task and
it is already the current canonical way how we shut down a tenant's /
timelin's page_service connections (see `Tenant::shutdown` /
`Timeline::shutdown`).

So, rename the function and make it sensitive to task_mgr cancellation.
2024-01-19 19:16:01 +00:00
Arseny Sher
88df057531 Delete WAL segments from s3 when timeline is deleted.
In the most straightforward way; safekeeper performs it in DELETE endpoint
implementation, with no coordination between sks.

delete_force endpoint in the code is renamed to delete as there is only one way
to delete.
2024-01-19 20:11:24 +04:00
Alexander Bayandin
c65ac37a6d zenbenchmark: attach perf results to allure report (#6395)
## Problem

For PRs with `run-benchmarks` label, we don't upload results to the db,
making it harder to debug such tests. The only way to see some
numbers is by examining GitHub Action output which is really
inconvenient.
This PR adds zenbenchmark metrics to Allure reports.

## Summary of changes
- Create a json file with zenbenchmark results and attach it to allure
report
2024-01-18 20:59:43 +00:00
Arthur Petukhovsky
a092127b17 Fix truncateLsn initialization (#6396)
In
7f828890cf
we changed the logic for persisting control_files. Previously it was
updated if `peer_horizon_lsn` jumped more than one segment, which made
`peer_horizon_lsn` initialized on disk as soon as safekeeper has
received a first `AppendRequest`.

This caused an issue with `truncateLsn`, which now can be zero
sometimes. This PR fixes it, and now `truncateLsn/peer_horizon_lsn` can
never be zero once we know `timeline_start_lsn`.

Closes https://github.com/neondatabase/neon/issues/6248
2024-01-18 18:55:24 +00:00
Christian Schwarz
e8f773387d pagebench: avoid noise about CopyFail in PS logs (#6392)
Before this patch, pagebench get-page-latest-lsn would sometimes cause
noisy errors in pageserver log about `CopyFail` protocol message.

refs https://github.com/neondatabase/neon/issues/6390
2024-01-18 18:50:42 +00:00
Christian Schwarz
00936d19e1 pagebench: use tracing panic hook (#6393) 2024-01-18 18:39:38 +00:00
Joonas Koivunen
57155ada77 temp: human readable summaries for relative access time compared to absolute (#6384)
With testing the new eviction order there is a problem of all of the
(currently rare) disk usage based evictions being rare and unique; this
PR adds a human readable summary of what absolute order would had done
and what the relative order does. Assumption is that these loggings will
make the few evictions runs in staging more useful.

Cc: #5304 for allowing testing in the staging
2024-01-18 17:21:08 +02:00
Konstantin Knizhnik
02b916d3c9 Use [NEON_SMGR] tag for all messages in neon extension (#6313)
## Problem

Use [NEON_SMGR] for all log messages produced by neon extension.

## Summary of changes

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-01-18 17:08:34 +02:00
Anastasia Lubennikova
e6e013b3b7 Fix pgbouncer settings update:
- Start pgbouncer in VM from postgres user, to allow connection to
pgbouncer admin console.
- Remove unused compute_ctl options --pgbouncer-connstr
and --pgbouncer-ini-path.
- Fix and cleanup code of connection to pgbouncer, add retries
because pgbouncer may not be instantly ready when compute_ctl starts.
2024-01-18 11:27:12 +00:00
John Spray
bd19290d9f pageserver: add shard_id to metric labels (#6308)
## Problem

tenant_id/timeline_id is no longer a full identifier for metrics from a
`Tenant` or `Timeline` object.

Closes: https://github.com/neondatabase/neon/issues/5953

## Summary of changes

Include `shard_id` label everywhere we have `tenant_id`/`timeline_id`
label.
2024-01-18 10:52:18 +00:00
Joonas Koivunen
a584e300d1 test: figure out the relative eviction order assertions (#6375)
I just failed to see this earlier on #6136. layer counts are used as an
abstraction, and each of the two tenants lose proportionally about the
same amount of layers. sadly there is no difference in between
`relative_spare` and `relative_equal` as both of these end up evicting
the exact same amount of layers, but I'll try to add later another test
for those.

Cc: #5304
2024-01-18 12:39:45 +02:00
Joonas Koivunen
e247ddbddc build: update h2 (#6383)
Notes: https://github.com/hyperium/h2/releases/tag/v0.3.24

Related: https://rustsec.org/advisories/RUSTSEC-2024-0003
2024-01-18 09:54:15 +00:00
Konstantin Knizhnik
0dc4c9b0b8 Relsize hash lru eviction (#6353)
## Problem


Currently relation hash size is limited by "neon.relsize_hash_size" GUC
with default value 64k.
64k relations is not so small number... but it is enough to create 376
databases to exhaust it.

## Summary of changes

Use LRU replacement algorithm to prevent hash overflow

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-01-17 20:34:30 +02:00
John Spray
b6ec11ad78 control_plane: generalize attachment_service to handle sharding (#6251)
## Problem

To test sharding, we need something to control it. We could write python
code for doing this from the test runner, but this wouldn't be usable
with neon_local run directly, and when we want to write tests with large
number of shards/tenants, Rust is a better fit efficiently handling all
the required state.

This service enables automated tests to easily get a system with
sharding/HA without the test itself having to set this all up by hand:
existing tests can be run against sharded tenants just by setting a
shard count when creating the tenant.

## Summary of changes

Attachment service was previously a map of TenantId->TenantState, where
the principal state stored for each tenant was the generation and the
last attached pageserver. This enabled it to serve the re-attach and
validate requests that the pageserver requires.

In this PR, the scope of the service is extended substantially to do
overall management of tenants in the pageserver, including
tenant/timeline creation, live migration, evacuation of offline
pageservers etc. This is done using synchronous code to make declarative
changes to the tenant's intended state (`TenantState.policy` and
`TenantState.intent`), which are then translated into calls into the
pageserver by the `Reconciler`.

Top level summary of modules within
`control_plane/attachment_service/src`:
- `tenant_state`: structure that represents one tenant shard.
- `service`: implements the main high level such as tenant/timeline
creation, marking a node offline, etc.
- `scheduler`: for operations that need to pick a pageserver for a
tenant, construct a scheduler and call into it.
- `compute_hook`: receive notifications when a tenant shard is attached
somewhere new. Once we have locations for all the shards in a tenant,
emit an update to postgres configuration via the neon_local `LocalEnv`.
- `http`: HTTP stubs. These mostly map to methods on `Service`, but are
separated for readability and so that it'll be easier to adapt if/when
we switch to another RPC layer.
- `node`: structure that describes a pageserver node. The most important
attribute of a node is its availability: marking a node offline causes
tenant shards to reschedule away from it.

This PR is a precursor to implementing the full sharding service for
prod (#6342). What's the difference between this and a production-ready
controller for pageservers?
- JSON file persistence to be replaced with a database
- Limited observability.
- No concurrency limits. Marking a pageserver offline will try and
migrate every tenant to a new pageserver concurrently, even if there are
thousands.
- Very simple scheduler that only knows to pick the pageserver with
fewest tenants, and place secondary locations on a different pageserver
than attached locations: it does not try to place shards for the same
tenant on different pageservers. This matters little in tests, because
picking the least-used pageserver usually results in round-robin
placement.
- Scheduler state is rebuilt exhaustively for each operation that
requires a scheduler.
- Relies on neon_local mechanisms for updating postgres: in production
this would be something that flows through the real control plane.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-01-17 18:01:08 +00:00
John Spray
4cec95ba13 pageserver: add list API for LocationConf (#6329)
## Problem

The `/v1/tenant` listing API only applies to attached tenants.

For an external service to implement a global reconciliation of its list
of shards vs. what's on the pageserver, we need a full view of what's in
TenantManager, including secondary tenant locations, and InProgress
locations.

Dependency of https://github.com/neondatabase/neon/pull/6251

## Summary of changes

- Add methods to Tenant and SecondaryTenant to reconstruct the
LocationConf used to create them.
- Add `GET /v1/location_config` API
2024-01-17 13:34:51 +00:00
Arpad Müller
ab86060d97 Copy initdb if loading from different timeline ID (#6363)
Previously, if we:

1. created a new timeline B from a different timeline's A initdb
2. deleted timeline A

the initdb for timeline B would be gone, at least in a world where we
are deleting initdbs upon timeline deletion. This world is imminent
(#6226).

Therefore, if the pageserver is instructed to load the initdb from a
different timeline ID, copy it to the newly created timeline's directory
in S3. This ensures that we can disaster recover the new timeline as
well, regardless of whether the original timeline was deleted or not.

Part of https://github.com/neondatabase/neon/issues/5282.
2024-01-17 12:42:42 +01:00
Arpad Müller
6ffdcfe6a4 remote_storage: unify azure and S3 tests (#6364)
The remote_storage crate contains two copies of each test, one for azure
and one for S3. The repetition is not necessary and makes the tests more
prone to drift, so we remove it by moving the tests into a shared
module.

The module has a different name depending on where it is included, so
that each test still has "s3" or "azure" in its full path, allowing you
to just test the S3 test or just the azure tests.

Earlier PR that removed some duplication already: #6176

Fixes #6146.
2024-01-16 18:45:19 +01:00
Arpad Müller
4b0204ede5 Add copy operation tests and implement them for azure blobs (#6362)
This implements the `copy` operation for azure blobs, added to S3 by
#6091, and adds tests both to s3 and azure ensuring that the copy
operation works.
2024-01-16 12:07:20 +00:00
John Spray
bf4e708646 pageserver: eviction for secondary mode tenants (#6225)
Follows #6123 

Closes: https://github.com/neondatabase/neon/issues/5342

The approach here is to avoid using `Layer` from secondary tenants, and
instead make the eviction types (e.g. `EvictionCandidate`) have a
variant that carries a Layer for attached tenants, and a different
variant for secondary tenants.

Other changes:
- EvictionCandidate no longer carries a `Timeline`: this was only used
for providing a witness reference to remote timeline client.
- The types for returning eviction candidates are all in
disk_usage_eviction_task.rs now, whereas some of them were in
timeline.rs before.
- The EvictionCandidate type replaces LocalLayerInfoForDiskUsageEviction
type, which was basically the same thing.
2024-01-16 10:29:26 +00:00
John Spray
887e94d7da page_service: more efficient page_service -> shard lookup (#6037)
## Problem

In #5980 the page service connection handler gets a simple piece of
logic for finding the right Timeline: at connection time, it picks an
arbitrary Timeline, and then when handling individual page requests it
checks if the original timeline is the correct shard, and if not looks
one up.

This is pretty slow in the case where we have to go look up the other
timeline, because we take the big tenants manager lock.

## Summary of changes

- Add a `shard_timelines` map of ShardIndex to Timeline on the page
service connection handler
- When looking up a Timeline for a particular ShardIndex, consult
`shard_timelines` to avoid hitting the TenantsManager unless we really
need to.
- Re-work the CancellationToken handling, because the handler now holds
gateguards on multiple timelines, and so must respect cancellation of
_any_ timeline it has in its cache, not just the timeline related to the
request it is currently servicing.

---------

Co-authored-by: Vlad Lazar <vlad@neon.tech>
2024-01-16 09:39:19 +00:00
John Spray
df9e9de541 pageserver: API updates for sharding (#6330)
The theme of the changes in this PR is that they're enablers for #6251
which are superficial struct/api changes.

This is a spinoff from #6251:
- Various APIs + clients thereof take TenantShardId rather than TenantId
- The creation API gets a ShardParameters member, which may be used to
configure shard count and stripe size. This enables the attachment
service to present a "virtual pageserver" creation endpoint that creates
multiple shards.
- The attachment service will use tenant size information to drive shard
splitting. Make a version of `TenantHistorySize` that is usable for
decoding these API responses.
- ComputeSpec includes a shard stripe size.
2024-01-16 09:21:00 +00:00
Anna Khanova
3f2187eb92 Proxy relax sni check (#6323)
## Problem

Using the same domain name () for serverless driver can help with
connection caching.
https://github.com/neondatabase/neon/issues/6290

## Summary of changes

Relax SNI check.
2024-01-16 08:42:13 +00:00
John Khvatov
2a3cfc9665 Remove PAGE_CACHE_ACQUIRE_PINNED_SLOT_TIME histogram. (#6356)
Fixes #6343.

## Problem

PAGE_CACHE_ACQUIRE_PINNED_SLOT_TIME is used on hot path and it adds
noticeable latency to GetPage@LSN.

## Refs

https://discordapp.com/channels/1176467419317940276/1195022264115151001/1196370689268125716
2024-01-15 17:19:19 +01:00
Cihan Demirci
d34adf46b4 do not provide disclaimer input for the deploy-prod workflow (#6360)
We've removed this input from the deploy-prod workflow.
2024-01-15 16:15:34 +00:00
Conrad Ludgate
0bac8ddd76 proxy: fix serverless error message info (#6279)
## Problem


https://github.com/neondatabase/serverless/issues/51#issuecomment-1878677318

## Summary of changes

1. When we have a db_error, use db_error.message() as the message.
2. include error position.
3. line should be a string (weird?)
4. `datatype` -> `dataType`
2024-01-15 16:43:19 +01:00
Christian Schwarz
0e1ef3713e fix(pagebench): #6325 broke running without --runtime (#6351)
After PR #6325, when running without --runtime, we wouldn't wait for
start_work_barrier, causing the benchmark to not start at all.
2024-01-15 08:54:19 +00:00
Konstantin Knizhnik
31a4eb40b2 Do not suspend compute if autovacuum is active (#6322)
## Problem

Se.e
https://github.com/orgs/neondatabase/projects/49/views/13?pane=issue&itemId=48282912

## Summary of changes


Do not suspend compute if there are active auto vacuum workers

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-01-14 09:33:57 +02:00
Arpad Müller
60ced06586 Fix timeline creation and tenant deletion race (#6310)
Fixes the race condition between timeline creation and tenant deletion
outlined in #6255.

Related: #5914, which is a similar race condition about the uninit
marker file.

Fixes #6255
2024-01-13 09:15:58 +01:00
Christian Schwarz
b76454ae41 add script to set up EC2 storage-optimized instance store for benchmarking (#6350)
Been using this all the time in
https://github.com/neondatabase/neon/pull/6214

Part of https://github.com/neondatabase/neon/issues/5771

Should consider this in https://github.com/neondatabase/neon/issues/6297
2024-01-12 19:25:17 +00:00
Arthur Petukhovsky
97b48c23f8 Compact some compute_ctl logs (#6346)
Print postgres roles in a single line and add some info.
2024-01-12 18:24:22 +00:00
Christian Schwarz
cd48ea784f TenantInfo: expose generation number (#6348)
Generally useful when debugging / troubleshooting.

I found this useful when manually duplicating a tenant from a script[^1]
where I can't use `neon_fixtures.Pageserver.tenant_attach`'s automatic
integration with the neon_local's attachment_service.

[^1]: https://github.com/neondatabase/neon/pull/6349
2024-01-12 18:27:11 +01:00
Alexey Kondratov
1c432d5492 [compute_ctl] Do not miss short-living connections (#6008)
## Problem

Currently, activity monitor in `compute_ctl` has 500 ms polling
interval. It also looks on the list of current client backends looking
for an active one or one with the most recent state change. This means
we can miss short-living connections.

Yet, during testing this PR I realized that it's usually not a problem
with pooled connection, as pgbouncer maintains connections to Postgres
even though client connection are short-living. We can still miss direct
connections.

## Summary of changes

This commit introduces another way to detect user activity on the
compute. It polls a sum of `active_time` and sum of `sessions` from all
non-system databases in the `pg_stat_database` [1]. If user runs some
queries or just open a direct connection, it will rise; if user will
drop db, it can go down, but it's still a change and will be detected as
activity.

New statistic-based logic seems to be working fine. Yet, after having it
running for a couple of hours I've seen several odd cases with
connections via pgbouncer:

1. Sometimes, if you run just `psql pooler_connstr -c 'select 1;'`
   `active_time` could be not updated immediately, and it may take a couple
   of dozens of seconds. This doesn't seem critical, though.
2. Same query with pooler, `active_time` can be bumped a bit, then
   pgbouncer keeps open connection to Postgres for ~10 minutes, then it
   disconnects, and `active_time` *could be* bumped a bit again. 'Could be'
   because I've seen it once, but it didn't reproduce for a second try.

I think this can create false-positives (hopefully rare), when we will
not suspend some computes because of lagged statistics update OR because
some non-user processes will try to connect to user databases.
Currently, we don't touch them outside of startup and
`postgres_exporter` is configured to do not discover other databases,
but this can change in the future.

New behavior is covered by feature flag `activity_monitor_experimental`,
which should be provided by control plane via neondatabase/cloud#9171

[1] https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-DATABASE-VIEW

Related to neondatabase/cloud#7966, neondatabase/cloud#7198
2024-01-12 18:15:41 +01:00
Vlad Lazar
02c6abadf0 pageserver: remove depenency of pagebench on pageserver (#6334)
To achieve this I had to lift the BlockNumber and key_to_rel_block
definitions to pageserver_api (similar to a change in #5980).

Closes #6299
2024-01-12 17:11:19 +00:00
John Spray
7af4c676c0 pageserver: only upload initdb from shard 0 (#6331)
## Problem

When creating a timeline on a sharded tenant, we call into each shard.
We don't need to upload the initdb from every shard: only do it on shard
zero.

## Summary of changes

- Move the initdb upload into a function, and only call it on shard
zero.
2024-01-12 15:32:27 +01:00
John Spray
aafe79873c page_service: handle GetActiveTenantError::Cancelled (#6344)
## Problem

Occasional test failures with QueryError::Other errors saying
"cancelled" that get logged at error severity.

## Summary of changes

Avoid casting GetActiveTenantError::Cancelled into QueryError::Other --
it should be QueryError::Shutdown, which is not logged as an error.
2024-01-12 12:43:14 +00:00
Christian Schwarz
eae74383c1 pageserver client: mgmt_api: expose reset API (#6326)
By-product of some hack work that will be thrown away.
2024-01-12 11:07:16 +00:00
Christian Schwarz
8b657a1481 pagebench: getpage: cancellation & better logging (#6325)
Needed these while working on
https://github.com/neondatabase/neon/issues/5479
2024-01-12 11:53:18 +01:00
Christian Schwarz
42613d4c30 refactor(NeonEnv): shutdown of child processes (#6327)
Also shuts down `Broker`, which, before this PR, we did start in
`start()` but relied on the fixture to stop. Do it a bit earlier so
that, after `NeonEnv.stop()` returns, there are no child processes using
`repo_dir`.


Also, drive-by-fixes inverted logic around `ps_assert_metric_no_errors`,
missed during https://github.com/neondatabase/neon/pull/6295

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-01-12 10:23:21 +01:00
Arseny Sher
7f828890cf Extract safekeeper per timeline state from safekeeper.rs
safekeeper.rs is mostly about consensus, but state is wider. Also form
SafekeeperState which encapsulates persistent part + in memory layer with API
for atomic updates.

Moves remote_consistent_lsn back to SafekeeperMemState, fixes its absense from
memory dump.

Also renames SafekeeperState to TimelinePersistentState, as TimelineMemState and
TimelinePersistent state are created.
2024-01-12 10:58:22 +04:00
Sasha Krassovsky
1eb30b40af Bump postgres version to support CREATE PUBLICATION FOR ALL TABLES 2024-01-11 15:30:33 -08:00
dependabot[bot]
8551a61014 build(deps): bump jinja2 from 3.1.2 to 3.1.3 (#6333) 2024-01-11 19:49:28 +00:00
Christian Schwarz
087526b81b neon_local init: add --force mode that allows an empty dir (#6328)
Need this in https://github.com/neondatabase/neon/pull/6214
refs https://github.com/neondatabase/neon/issues/5771
2024-01-11 18:11:44 +00:00
Christian Schwarz
915fba146d pagebench: getpage: optional keyspace cache file (#6324)
Proved useful when benchmarking 20k tenant setup when validating
https://github.com/neondatabase/neon/issues/5479
2024-01-11 17:42:11 +00:00
Vlad Lazar
da7a7c867e pageserver: do not bump priority of background task for timeline status requests (#6301)
## Problem

Previously, `GET /v1/tenant/:tenant_id/timeline` and `GET
/v1/tenant/:tenant_id/timeline/:timeline_id`
would bump the priority of the background task which computes the
initial logical size by cancelling
the wait on the synchronisation semaphore. However, the request would
still return an approximate
logical size. It's undesirable to force background work for a status
request.

## Summary of changes
This PR updates the priority used by the timeline status request such
that they don't do priority boosting
by default anymore. An optional query parameter,
`force-await-initial-logical-size`, is added for both
mentioned endpoints. When set to true, it will skip the concurrency
limiting semaphore and wait
for the background task to complete before returning the exact logical
size.

In order to exercise this behaviour in a test I had to add an extra
failpoint. If you think it's too intrusive,
it can be removed.

Also  fixeda small bug where the cancellation of a download is reported as an
opaque download failure upstream. This caused `test_location_conf_churn`
to fail at teardown due to a WARN log line.

Closes https://github.com/neondatabase/neon/issues/6168
2024-01-11 15:55:32 +00:00
Conrad Ludgate
551f0cc097 proxy: refactor how neon-options are handled (#6306)
## Problem

HTTP connection pool was not respecting the PitR options.

## Summary of changes

1. refactor neon_options a bit to allow easier access to cache_key
2. make HTTP not go through `StartupMessageParams`
3. expose SNI processing to replace what was removed in step 2.
2024-01-11 14:58:31 +00:00
Anna Khanova
a84935d266 Extend unsupported startup parameter error message (#6318)
## Problem

Unsupported startup parameter error happens with pooled connection.
However the reason of this error might not be obvious to the user.

## Summary of changes

Send more descriptive message with the link to our troubleshooting page:
https://neon.tech/docs/connect/connection-errors#unsupported-startup-parameter.

Resolves: https://github.com/neondatabase/neon/issues/6291
2024-01-11 12:09:26 +00:00
Christian Schwarz
3ee981889f compaction: avoid no-op timeline dir fsync (#6311)
Random find while looking at an idle 20k tenant pageserver where each
tenant
has 9 tiny L0 layers and compaction produces no new L1s / image layers.

The aggregate CPU cost of running this every 20s for 20k tenants is
actually substantial, due to the use of `spawn_blocking`.
2024-01-11 10:32:39 +00:00
Christian Schwarz
fc66ba43c4 Revert "revert recent VirtualFile asyncification changes (#5291)" (#6309)
This reverts commit ab1f37e908.
Thereby
fixes #5479

Updated Analysis
================

The problem with the original patch was that it, for the first time,
exposed the `VirtualFile` code to tokio task concurrency instead of just
thread-based concurrency. That caused the VirtualFile file descriptor
cache to start thrashing, effectively grinding the system to a halt.

Details
-------

At the time of the original patch, we had a _lot_ of runnable tasks in
the pageserver.
The symptom that prompted the revert (now being reverted in this PR) is
that our production systems fell into a valley of zero goodput, high
CPU, and zero disk IOPS shortly after PS restart.
We lay out the root cause for that behavior in this subsection.

At the time, there was no concurrency limit on the number of concurrent
initial logical size calculations.
Initial size calculation was initiated for all timelines within the
first 10 minutes as part of consumption metrics collection.
On a PS with 20k timelines, we'd thus have 20k runnable tasks.

Before the original patch, the `VirtualFile` code never returned
`Poll::Pending`.
That meant that once we entered it, the calling tokio task would not
yield to the tokio executor until we were done performing the
VirtualFile operation, i.e., doing a blocking IO system call.

The original patch switched the VirtualFile file descriptor cache's
synchronization primitives to those from `tokio::sync`.
It did not change that we were doing synchronous IO system calls.
And the cache had more slots than we have tokio executor threads.
So, these primitives never actually needed to return `Poll::Pending`.
But, the tokio scheduler makes tokio sync primitives return `Pending`
*artificially*, as a mechanism for the scheduler to get back into
control more often
([example](https://docs.rs/tokio/1.35.1/src/tokio/sync/batch_semaphore.rs.html#570)).

So, the new reality was that VirtualFile calls could now yield to the
tokio executor.
Tokio would pick one of the other 19999 runnable tasks to run.
These tasks were also using VirtualFile.
So, we now had a lot more concurrency in that area of the code.

The problem with more concurrency was that caches started thrashing,
most notably the VirtualFile file descriptor cache: each time a task
would be rescheduled, it would want to do its next VirtualFile
operation. For that, it would first need to evict another (task's)
VirtualFile fd from the cache to make room for its own fd. It would then
do one VirtualFile operation before hitting an await point and yielding
to the executor again. The executor would run the other 19999 tasks for
fairness before circling back to the first task, which would find its fd
evicted.

The other cache that would theoretically be impacted in a similar way is
the pageserver's `PageCache`.
However, for initial logical size calculation, it seems much less
relevant in experiments, likely because of the random access nature of
initial logical size calculation.

Fixes
=====

We fixed the above problems by
- raising VirtualFile cache sizes
  - https://github.com/neondatabase/cloud/issues/8351
- changing code to ensure forward-progress once cache slots have been
acquired
  - https://github.com/neondatabase/neon/pull/5480
  - https://github.com/neondatabase/neon/pull/5482
  - tbd: https://github.com/neondatabase/neon/issues/6065
- reducing the amount of runnable tokio tasks
  - https://github.com/neondatabase/neon/pull/5578
  - https://github.com/neondatabase/neon/pull/6000
- fix bugs that caused unnecessary concurrency induced by connection
handlers
  - https://github.com/neondatabase/neon/issues/5993

I manually verified that this PR doesn't negatively affect startup
performance as follows:
create a pageserver in production configuration, with 20k
tenants/timelines, 9 tiny L0 layer files each; Start it, and observe

```
INFO Startup complete (368.009s since start) elapsed_ms=368009
```

I further verified in that same setup that, when using `pagebench`'s
getpage benchmark at as-fast-as-possible request rate against 5k of the
20k tenants, the achieved throughput is identical. The VirtualFile cache
isn't thrashing in that case.

Future Work
===========

We will still exposed to the cache thrashing risk from outside factors,
e.g., request concurrency is unbounded, and initial size calculation
skips the concurrency limiter when we establish a walreceiver
connection.

Once we start thrashing, we will degrade non-gracefully, i.e., encounter
a valley as was seen with the original patch.

However, we have sufficient means to deal with that unlikely situation:
1. we have dashboards & metrics to monitor & alert on cache thrashing
2. we can react by scaling the bottleneck resources (cache size) or by
manually shedding load through tenant relocation

Potential systematic solutions are future work:
* global concurrency limiting
* per-tenant rate limiting => #5899
* pageserver-initiated load shedding

Related Issues
==============

This PR unblocks the introduction of tokio-epoll-uring for asynchronous
disk IO ([Epic](#4744)).
2024-01-11 11:29:14 +01:00
Arthur Petukhovsky
544284cce0 Collapse multiline queries in compute_ctl (#6316) 2024-01-10 22:25:28 +04:00
Arthur Petukhovsky
71beabf82d Join multiline postgres logs in compute_ctl (#5903)
Postgres can write multiline logs, and they are difficult to handle
after they are mixed with other logs. This PR combines multiline logs
from postgres into a single line, where previous line breaks are
replaced with unicode zero-width spaces. Then postgres logs are written
to stderr with `PG:` prefix.

It makes it easy to distinguish postgres logs from all other compute
logs with a simple grep, e.g. `|= "PG:"`
2024-01-10 15:11:43 +00:00
Anna Khanova
76372ce002 Added auth info cache with notifiations to redis. (#6208)
## Problem

Current cache doesn't support any updates from the cplane.

## Summary of changes

* Added redis notifier listner.
* Added cache which can be invalidated with the notifier. If the
notifier is not available, it's just a normal ttl cache.
* Updated cplane api.

The motivation behind this organization of the data is the following:
* In the Neon data model there are projects. Projects could have
multiple branches and each branch could have more than one endpoint.
* Also there is one special `main` branch.
* Password reset works per branch.
* Allowed IPs are the same for every branch in the project (except,
maybe, the main one).
* The main branch can be changed to the other branch.
* The endpoint can be moved between branches.

Every event described above requires some special processing on the
porxy (or cplane) side.

The idea of invalidating for the project is that whenever one of the
events above is happening with the project, proxy can invalidate all
entries for the entire project.

This approach also requires some additional API change (returning
project_id inside the auth info).
2024-01-10 11:51:05 +00:00
Christian Schwarz
4e1b0b84eb pagebench: fixup after is_rel_block_key changes in #6266 (#6303)
PR #6266 broke the getpage_latest_lsn benchmark.

Before this patch, we'd fail with

```
not implemented: split up range
```

because `r.start = rel size key` and `r.end = rel size key + 1`.

The filtering of the key ranges in that loop is a bit ugly, but,
I measured:
* setup with 180k layer files (20k tenants * 9 layers).
* total physical size is 463GiB
* 5k tenants, the range filtering takes `0.6 seconds` on an
i3en.3xlarge.
That's a tiny fraction of the overall time it takes for pagebench to get
ready to send requests. So, this is good enough for now / there are
other bottlenecks that are bigger.
2024-01-09 19:00:37 +01:00
John Spray
f94abbab95 pageserver: clean up a redundant tenant_id attribute (#6280)
This was a small TODO(sharding) thing in TenantHarness.
2024-01-09 12:10:15 +00:00
John Spray
4b9b4c2c36 pageserver: cleanup redundant create/attach code, fix detach while attaching (#6277)
## Problem

The code for tenant create and tenant attach was just a special case of
what upsert_location does.

## Summary of changes

- Use `upsert_location` for create and attach APIs
- Clean up error handling in upsert_location so that it can generate
appropriate HTTP response codes
- Update tests that asserted the old non-idempotent behavior of attach
- Rework the `test_ignore_while_attaching` test, and fix tenant shutdown
during activation, which this test was supposed to cover, but it was
actually just waiting for activation to complete.
2024-01-09 10:37:54 +00:00
Arpad Müller
8186f6b6f9 Drop async_trait usage from three internal traits (#6305)
This uses the [newly stable](https://blog.rust-lang.org/2023/12/21/async-fn-rpit-in-traits.html)
async trait feature for three internal traits. One requires `Send`
bounds to be present so uses `impl Future<...> + Send` instead.

Advantages:
* less macro usage
* no extra boxing

Disadvantages:
* impl syntax needed for `Send` bounds is a bit more verbose (but only
required in one place)
2024-01-09 11:20:08 +01:00
Christian Schwarz
90e0219b29 python tests: support overlayfs for NeonEnvBuilder.from_repo_dir (#6295)
Part of #5771
Extracted from https://github.com/neondatabase/neon/pull/6214

This PR makes the test suite sensitive to the new env var
`NEON_ENV_BUILDER_FROM_REPO_DIR_USE_OVERLAYFS`.
If it is set, `NeonEnvBuilder.from_repo_dir` uses overlayfs
to duplicate the the snapshot repo dir contents.

Since mounting requires root privileges, we use sudo to perform
the mounts. That, and macOS support, is also why copytree remains
the default.

If we ever run on a filesystem with copy reflink support, we should
consider that as an alternative.

This PR can be tried on a Linux machine on the
`test_backward_compatiblity` test, which uses `from_repo_dir`.
2024-01-09 10:15:46 +00:00
Christian Schwarz
4b6004e8c9 fix(page_service client): correctly deserialize pagestream errors (#6302)
Before this PR, we wouldn't advance the underlying `Bytes`'s cursor.

fixes https://github.com/neondatabase/neon/issues/6298
2024-01-09 10:22:43 +01:00
Em Sharnoff
9bf7664049 vm-monitor: Remove spammy log line (#6284)
During a previous incident, we noticed that this particular line can be
repeatedly logged every 100ms if the memory usage continues is
persistently high enough to warrant upscaling.

Per the added comment: Ideally we'd still like to include this log line,
because it's useful information, but the simple way to include it
produces far too many log lines, and the more complex ways to
deduplicate the log lines while still including the information are
probably not worth the effort right now.
2024-01-08 21:12:39 -08:00
Arpad Müller
d5e3434371 Also allow unnecessary_fallible_conversions lint (#6294)
This fixes the clippy lint firing on macOS on the conversion which
needed for portability. For some reason, the logic in
https://github.com/rust-lang/rust-clippy/pull/11669 to avoid an overlap
is not working.
2024-01-09 04:22:36 +00:00
Christian Schwarz
66c52a629a RFC: vectored Timeline::get (#6250) 2024-01-08 15:00:01 +00:00
Conrad Ludgate
8a646cb750 proxy: add request context for observability and blocking (#6160)
## Summary of changes

### RequestMonitoring

We want to add an event stream with information on each request for
easier analysis than what we can do with diagnostic logs alone
(https://github.com/neondatabase/cloud/issues/8807). This
RequestMonitoring will keep a record of the final state of a request. On
drop it will be pushed into a queue to be uploaded.

Because this context is a bag of data, I don't want this information to
impact logic of request handling. I personally think that weakly typed
data (such as all these options) makes for spaghetti code. I will
however allow for this data to impact rate-limiting and blocking of
requests, as this does not _really_ change how a request is handled.

### Parquet

Each `RequestMonitoring` is flushed into a channel where it is converted
into `RequestData`, which is accumulated into parquet files. Each file
will have a certain number of rows per row group, and several row groups
will eventually fill up the file, which we then upload to S3.

We will also upload smaller files if they take too long to construct.
2024-01-08 11:42:43 +00:00
Arpad Müller
a4ac8e26e8 Update Rust to 1.75.0 (#6285)
[Release notes](https://github.com/rust-lang/rust/releases/tag/1.75.0).
2024-01-08 11:46:16 +01:00
John Spray
b3a681d121 s3_scrubber: updates for sharding (#6281)
This is a lightweight change to keep the scrubber providing sensible
output when using sharding.

- The timeline count was wrong when using sharding
- When checking for tenant existence, we didn't re-use results between
different shards in the same tenant

Closes: https://github.com/neondatabase/neon/issues/5929
2024-01-08 09:19:10 +00:00
John Spray
b5ed6f22ae pageserver: clean up a TODO comment (#6282)
These functions don't need updating for sharding: it's fine for them to
remain shard-naive, as they're only used in the context of dumping a
layer file. The sharding metadata doesn't live in the layer file, it
lives in the index.
2024-01-08 09:19:00 +00:00
John Spray
d1c0232e21 pageserver: use pub(crate) in metrics.rs, and clean up unused items (#6275)
## Problem

Noticed while making other changes that there were `pub` items that were
unused.

## Summary of changes

- Make everything `pub(crate)` in metrics.rs, apart from items used from
`bin/`
- Fix the timelines eviction metric: it was never being incremented
- Remove an unused ephemeral_bytes counter.
2024-01-08 03:53:15 +00:00
Arseny Sher
a41c4122e3 Don't suspend compute if there is active LR subscriber.
https://github.com/neondatabase/neon/issues/6258
2024-01-06 01:24:44 +04:00
Alexander Bayandin
7de829e475 test_runner: replace black with ruff format (#6268)
## Problem

`black` is slow sometimes, we can replace it with `ruff format` (a new
feature in 0.1.2 [0]), which produces pretty similar to black style [1].

On my local machine (MacBook M1 Pro 16GB):
```
# `black` on main
$ hyperfine "BLACK_CACHE_DIR=/dev/null poetry run black ."
Benchmark 1: BLACK_CACHE_DIR=/dev/null poetry run black .
  Time (mean ± σ):      3.131 s ±  0.090 s    [User: 5.194 s, System: 0.859 s]
  Range (min … max):    3.047 s …  3.354 s    10 runs
```
```
# `ruff format` on the current PR
$ hyperfine "RUFF_NO_CACHE=true poetry run ruff format"      
Benchmark 1: RUFF_NO_CACHE=true poetry run ruff format
  Time (mean ± σ):     300.7 ms ±  50.2 ms    [User: 259.5 ms, System: 76.1 ms]
  Range (min … max):   267.5 ms … 420.2 ms    10 runs
```

## Summary of changes
- Replace `black` with `ruff format` everywhere

- [0] https://docs.astral.sh/ruff/formatter/
- [1] https://docs.astral.sh/ruff/formatter/#black-compatibility
2024-01-05 15:35:07 +00:00
John Spray
3c560d27a8 pageserver: implement secondary-mode downloads (#6123)
Follows on from #6050 , in which we upload heatmaps. Secondary locations
will now poll those heatmaps and download layers mentioned in the
heatmap.

TODO:
- [X] ~Unify/reconcile stats for behind-schedule execution with
warn_when_period_overrun
(https://github.com/neondatabase/neon/pull/6050#discussion_r1426560695)~
- [x] Give downloads their own concurrency config independent of uploads

Deferred optimizations:
- https://github.com/neondatabase/neon/issues/6199
- https://github.com/neondatabase/neon/issues/6200

Eviction will be the next PR:
- #5342
2024-01-05 12:29:20 +00:00
Christian Schwarz
d260426a14 is_rel_block_key: exclude the relsize key (#6266)
Before this PR, `is_rel_block_key` returns true for the blknum
`0xffffffff`,
which is a blknum that's actually never written by Postgres, but used by
Neon Pageserver to store the relsize.

Quoting @MMeent:

> PostgreSQL can't extend the relation beyond size of 0xFFFFFFFF blocks,
> so block number 0xFFFFFFFE is the last valid block number.

This PR changes the definition of the function to exclude blknum
0xffffffff.

My motivation for doing this change is to fix the `pagebench` getpage
benchmark, which uses `is_rel_block_key` to filter the keyspace for
valid pages to request from page_service.
fixes https://github.com/neondatabase/neon/issues/6210

I checked other users of the function.

The first one is `key_is_shard0`, which already had added an exemption
for 0xffffffff. So, there's no functional change with this PR.

The second one is `DatadirModification::flush`[^1]. With this PR,
`.flush()` will skip the relsize key, whereas it didn't
before. This means we will pile up all the relsize key-value pairs
`(Key,u32)`
in `DatadirModification::pending_updates` until `.commit()` is called.

The only place I can think of where that would be a problem is if we
import from a full basebackup, and don't `.commit()` regularly,
like we currently don't do in `import_basebackup_from_tar`.
It exposes us to input-controlled allocations.
However, that was already the case for the other keys that are skipped,
so, one can argue that this change is not making the situation much
worse.

[^1]: That type's `flush()` and `commit()` methods are terribly named,
but,
      that's for another time
2024-01-05 11:48:06 +01:00
Arthur Petukhovsky
f3b5db1443 Add API for safekeeper timeline copy (#6091)
Implement API for cloning a single timeline inside a safekeeper. Also
add API for calculating a sha256 hash of WAL, which is used in tests.

`/copy` API works by copying objects inside S3 for all but the last
segments, and the last segments are copied on-disk. A special temporary
directory is created for a timeline, because copy can take a lot of
time, especially for large timelines. After all files segments have been
prepared, this directory is mounted to the main tree and timeline is
loaded to memory.

Some caveats:
- large timelines can take a lot of time to copy, because we need to
copy many S3 segments
- caller should wait for HTTP call to finish indefinetely and don't
close the HTTP connection, because it will stop the process, which is
not continued in the background
- `until_lsn` must be a valid LSN, otherwise bad things can happen
- API will return 200 if specified `timeline_id` already exists, even if
it's not a copy
- each safekeeper will try to copy S3 segments, so it's better to not
call this API in-parallel on different safekeepers
2024-01-04 17:40:38 +00:00
John Spray
18e9208158 pageserver: improved error handling for shard routing error, timeline not found (#6262)
## Problem

- When a client requests a key that isn't found in any shard on the node
(edge case that only happens if a compute's config is out of date), we
should prompt them to reconnect (as this includes a backoff), since they
will not be able to complete the request until they eventually get a
correct pageserver connection string.
- QueryError::Other is used excessively: this contains a type-ambiguous
anyhow::Error and is logged very verbosely (including backtrace).

## Summary of changes

- Introduce PageStreamError to replace use of anyhow::Error in request
handlers for getpage, etc.
- Introduce Reconnect and NotFound variants to QueryError
- Map the "shard routing error" case to PageStreamError::Reconnect ->
QueryError::Reconnect
- Update type conversions for LSN timeouts and tenant/timeline not found
errors to use PageStreamError::NotFound->QueryError::NotFound
2024-01-04 10:40:03 +00:00
Sasha Krassovsky
7662df6ca0 Fix minimum backoff to 1ms 2024-01-03 21:09:19 -08:00
John Spray
c119af8ddd pageserver: run at least 2 background task threads
Otherwise an assertion in CONCURRENT_BACKGROUND_TASKS will
trip if you try to run the pageserver on a single core.
2024-01-03 14:22:40 +00:00
John Spray
a2e083ebe0 pageserver: make walredo shard-aware
This does not have a functional impact, but enables all
the logging in this code to include the shard_id
label.
2024-01-03 14:22:40 +00:00
John Spray
73a944205b pageserver: log details on shard routing error 2024-01-03 14:22:40 +00:00
John Spray
34ebfbdd6f pageserver: fix handling getpage with multiple shards on one node
Previously, we would wait for the LSN to be visible on whichever
timeline we happened to load at the start of the connection, then
proceed to look up the correct timeline for the key and do the read.

If the timeline holding the key was behind the timeline we used
for the LSN wait, then we might serve an apparently-successful read result
that actually contains data from behind the requested lsn.
2024-01-03 14:22:40 +00:00
John Spray
ef7c9c2ccc pageserver: fix active tenant lookup hitting secondaries with sharding
If there is some secondary shard for a tenant on the same
node as an attached shard, the secondary shard could trip up
this code and cause page_service to incorrectly
get an error instead of finding the attached shard.
2024-01-03 14:22:40 +00:00
John Spray
6c79e12630 pageserver: drop unwanted keys during compaction after split 2024-01-03 14:22:40 +00:00
John Spray
753d97bd77 pageserver: don't delete ancestor shard layers 2024-01-03 14:22:40 +00:00
John Spray
edc962f1d7 test_runner: test_issue_5878 log allow list (#6259)
## Problem


https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6254/7388706419/index.html#suites/5a4b8734277a9878cb429b80c314f470/e54c4f6f6ed22672

## Summary of changes

Permit the log message: because the test helper's detach function
increments the generation number, a detach/attach cycle can cause the
error if the test runner node is slow enough for the opportunistic
deletion queue flush on detach not to complete by the time we call
attach.
2024-01-03 14:22:17 +00:00
Arseny Sher
65b4e6e7d6 Remove empty safekeeper init since truncateLsn.
It has caveats such as creating half empty segment which can't be
offloaded. Instead we'll pursue approach of pull_timeline, seeding new state
from some peer.
2024-01-03 18:20:19 +04:00
Alexander Bayandin
17b256679b vm-image-spec: build pgbouncer from Neon's fork (#6249)
## Problem

We need to add one more patch to pgbouncer (for
https://github.com/neondatabase/neon/issues/5801). I've decided to
cherry-pick all required patches to a pgbouncer fork
(`neondatabase/pgbouncer`) and use it instead.

See
https://github.com/neondatabase/pgbouncer/releases/tag/pgbouncer_1_21_0-neon-1

## Summary of changes
- Revert the previous patch (for deallocate/discard all) — the fork
already contains it.
- Remove `libssl-dev` dependency — we build pgbouncer without `openssl`
support.
- Clone git tag and build pgbouncer from source code.
2024-01-03 13:02:04 +00:00
John Spray
673a865055 tests: tolerate 304 when evicting layers (#6261)
In tests that evict layers, explicit eviction can race with automatic
eviction of the same layer and result in a 304
2024-01-03 11:50:58 +00:00
Cuong Nguyen
fb518aea0d Add batch ingestion mechanism to avoid high contention (#5886)
## Problem
For context, this problem was observed in a research project where we
try to make neon run in multiple regions and I was asked by @hlinnaka to
make this PR.

In our project, we use the pageserver in a non-conventional way such
that we would send a larger number of requests to the pageserver than
normal (imagine postgres without the buffer pool). I measured the time
from the moment a WAL record left the safekeeper to when it reached the
pageserver
([code](e593db1f5a/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs (L282-L287)))
and observed that when the number of get_page_at_lsn requests was high,
the wal receiving time increased significantly (see the left side of the
graphs below).

Upon further investigation, I found that the delay was caused by this
line


d2ca410919/pageserver/src/tenant/timeline.rs (L2348)

The `get_layer_for_write` method is called for every value during WAL
ingestion and it tries to acquire layers write lock every time, thus
this results in high contention when read lock is acquired more
frequently.


![Untitled](https://github.com/neondatabase/neon/assets/6244849/85460f4d-ead1-4532-bc64-736d0bfd7f16)

![Untitled2](https://github.com/neondatabase/neon/assets/6244849/84199ab7-5f0e-413b-a42b-f728f2225218)

## Summary of changes

It is unnecessary to call `get_layer_for_write` repeatedly for all
values in a WAL message since they would end up in the same memory layer
anyway, so I created the batched versions of `InMemoryLayer::put_value`,
`InMemoryLayer ::put_tombstone`, `Timeline::put_value`, and
`Timeline::put_tombstone`, that acquire the locks once for a batch of
values.

Additionally, `DatadirModification` is changed to store multiple
versions of uncommitted values, and `WalIngest::ingest_record()` can now
ingest records without immediately committing them.

With these new APIs, the new ingestion loop can be changed to commit for
every `ingest_batch_size` records. The `ingest_batch_size` variable is
exposed as a config. If it is set to 1 then we get the same behavior
before this change. I found that setting this value to 100 seems to work
the best, and you can see its effect on the right side of the above
graphs.

---------

Co-authored-by: John Spray <john@neon.tech>
2024-01-03 10:41:58 +00:00
John Spray
42f41afcbd tests: update pytest and boto3 dependencies (#6253)
## Problem

The version of pytest we were using emits a number of
DeprecationWarnings on latest python: these are fixed in latest release.

boto3 and python-dateutil also have deprecation warnings, but
unfortunately these aren't fixed upstream yet.



## Summary of changes

- Update pytest
- Update boto3 (this doesn't fix deprecation warnings, but by the time I
figured that out I had already done the update, and it's good hygiene
anyway)
2024-01-03 10:36:53 +00:00
Arseny Sher
f71110383c Remove second check for max_slot_wal_keep_size download size.
Already checked in GetLogRepRestartLSN, a rebase artifact.
2024-01-03 13:13:32 +04:00
Arseny Sher
ae3eaf9995 Add [WP] prefix to all walproposer logging.
- rename walpop_log to wp_log
- create also wpg_log which is used in postgres-specific code
- in passing format messages to start with lower case
2024-01-03 11:10:27 +04:00
Christian Schwarz
aa9f1d4b69 pagebench get-page: default to latest=true, make configurable via flag (#6252)
fixes https://github.com/neondatabase/neon/issues/6209
2024-01-02 16:57:29 +00:00
Joonas Koivunen
946c6a0006 scrubber: use adaptive config with retries, check subset of tenants (#6219)
The tool still needs a lot of work. These are the easiest fix and
feature:
- use similar adaptive config with s3 as remote_storage, use retries
- process only particular tenants

Tenants need to be from the correct region, they are not deduplicated,
but the feature is useful for re-checking small amount of tenants after
a large run.
2024-01-02 15:22:16 +00:00
Sasha Krassovsky
ce13281d54 MIN not MAX 2024-01-02 06:28:49 -08:00
Sasha Krassovsky
4e1d16f311 Switch to exponential rate-limiting 2024-01-02 06:28:49 -08:00
Sasha Krassovsky
091a0cda9d Switch to rate-limiting strategy 2024-01-02 06:28:49 -08:00
Sasha Krassovsky
ea9fad419e Add exponential backoff to page_server->send 2024-01-02 06:28:49 -08:00
Arseny Sher
e92c9f42c0 Don't split WAL record across two XLogData's when sending from safekeepers.
As protocol demands. Not following this makes standby complain about corrupted
WAL in various ways.

https://neondb.slack.com/archives/C05L7D1JAUS/p1703774799114719
closes https://github.com/neondatabase/cloud/issues/9057
2024-01-02 10:50:20 +04:00
Arseny Sher
aaaa39d9f5 Add large insertion and slow WAL sending to test_hot_standby.
To exercise MAX_SEND_SIZE sending from safekeeper; we've had a bug with WAL
records torn across several XLogData messages. Add failpoint to safekeeper to
slow down sending. Also check for corrupted WAL complains in standby log.

Make the test a bit simpler in passing, e.g. we don't need explicit commits as
autocommit is enabled by default.

https://neondb.slack.com/archives/C05L7D1JAUS/p1703774799114719
https://github.com/neondatabase/cloud/issues/9057
2024-01-02 10:50:20 +04:00
Arseny Sher
e79a19339c Add failpoint support to safekeeper.
Just a copy paste from pageserver.
2024-01-02 10:50:20 +04:00
Arseny Sher
dbd36e40dc Move failpoint support code to utils.
To enable them in safekeeper as well.
2024-01-02 10:50:20 +04:00
Arseny Sher
90ef48aab8 Fix safekeeper START_REPLICATION (term=n).
It was giving WAL only up to commit_lsn instead of flush_lsn, so recovery of
uncommitted WAL since cdb08f03 hanged. Add test for this.
2024-01-01 20:44:05 +04:00
Arseny Sher
9a43c04a19 compute_ctl: kill postgres and sync-safekeeprs on exit.
Otherwise they are left orphaned when compute_ctl is terminated with a
signal. It was invisible most of the time because normally neon_local or k8s
kills postgres directly and then compute_ctl finishes gracefully. However, in
some tests compute_ctl gets stuck waiting for sync-safekeepers which
intentionally never ends because safekeepers are offline, and we want to stop
compute_ctl without leaving orphanes behind.

This is a quite rough approach which doesn't wait for children termination. A
better way would be to convert compute_ctl to async which would make waiting
easy.
2024-01-01 20:44:05 +04:00
Abhijeet Patil
f28bdb6528 Use nextest for rust unittests (#6223)
## Problem
`cargo test` doesn't support timeouts 
or junit output format

## Summary of changes
- Add `nextest` to `build-tools` image
- Switch `cargo test` with `cargo nextest` on CI
- Set timeout
2023-12-30 13:45:31 +00:00
Conrad Ludgate
1c037209c7 proxy: fix compute addr parsing (#6237)
## Problem

control plane should be able to return domain names and not just IP
addresses.

## Summary of changes

1. add regression tests
2. use rsplit to split the port from the back, then trim the ipv6
brackets
2023-12-29 09:32:24 +00:00
Bodobolero
e5a3b6dfd8 Pg stat statements reset for neon superuser (#6232)
## Problem

Extension pg_stat_statements has function pg_stat_statements_reset().
In vanilla Postgres this function can only be called by superuser role
or other users/roles explicitly granted.
In Neon no end user can use superuser role.
Instead we have neon_superuser role.
We need to grant execute on pg_stat_statements_reset() to neon_superuser

## Summary of changes

Modify the Postgres v14, v15, v16 contrib in our compute docker file to
grant execute on pg_stat_statements_reset() to neon_superuser.
(Modifying it in our docker file is preferable to changes in
neondatabase/postgres because we want to limit the changes in our fork
that we have to carry with each new version of Postgres).

Note that the interface of proc/function pg_stat_statements_reset
changed in pg_stat_statements version 1.7

So for versions up to and including 1.6 we must

`GRANT EXECUTE ON FUNCTION pg_stat_statements_reset() TO
neon_superuser;`

and for versions starting from 1.7 we must

`GRANT EXECUTE ON FUNCTION pg_stat_statements_reset(Oid, Oid, bigint) TO
neon_superuser;`

If we just use `GRANT EXECUTE ON FUNCTION pg_stat_statements_reset() TO
neon_superuser;` for all version this results in the following error for
versions 1.7+:

```sql
neondb=> create extension pg_stat_statements;
ERROR:  function pg_stat_statements_reset() does not exist
```



## Checklist before requesting a review

- [x ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [x ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

## I have run the following test and could now invoke
pg_stat_statements_reset() using default user

```bash
(neon) peterbendel@Peters-MBP neon % kubectl get pods | grep compute-quiet-mud-88416983       
compute-quiet-mud-88416983-74f4bf67db-crl4c            3/3     Running     0          7m26s
(neon) peterbendel@Peters-MBP neon % kubectl set image deploy/compute-quiet-mud-88416983 compute-node=neondatabase/compute-node-v15:7307610371
deployment.apps/compute-quiet-mud-88416983 image updated
(neon) peterbendel@Peters-MBP neon % psql postgresql://peterbendel:<secret>@ep-bitter-sunset-73589702.us-east-2.aws.neon.build/neondb     
psql (16.1, server 15.5)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off)
Type "help" for help.

neondb=> select version();
                                              version                                              
---------------------------------------------------------------------------------------------------
 PostgreSQL 15.5 on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
(1 row)

neondb=> create extension pg_stat_statements;
CREATE EXTENSION

neondb=> select pg_stat_statements_reset();
 pg_stat_statements_reset 
--------------------------
 
(1 row)
```
2023-12-27 18:15:17 +01:00
Sasha Krassovsky
136aab5479 Bump postgres submodule versions 2023-12-27 08:39:00 -08:00
Anastasia Lubennikova
6e40900569 Manage pgbouncer configuration from compute_ctl:
- add pgbouncer_settings section to compute spec;
- add pgbouncer-connstr option to compute_ctl.
- add pgbouncer-ini-path option to compute_ctl. Default: /etc/pgbouncer/pgbouncer.ini

Apply pgbouncer config on compute start and respec to override default spec.

Save pgbouncer config updates to pgbouncer.ini to preserve them across pgbouncer restarts.
2023-12-26 15:17:09 +00:00
Arseny Sher
ddc431fc8f pgindent walproposer condvar comment 2023-12-26 14:12:53 +04:00
Arseny Sher
bfc98f36e3 Refactor handling responses in walproposer.
Remove confirm_wal_streamed; we already apply both write and flush positions of
the slot to commit_lsn which is fine because 1) we need to wake up waiters 2)
committed WAL can be fetched from safekeepers by neon_walreader now.
2023-12-26 14:12:53 +04:00
Arseny Sher
d5fbfe2399 Remove test_wal_deleted_after_broadcast.
It is superseded by stronger test_lagging_sk.
2023-12-26 14:12:53 +04:00
Arseny Sher
1f1c50e8c7 Don't re-add neon_walreader socket to waiteventset if possible.
Should make recovery slightly more efficient (likely negligibly).
2023-12-26 14:12:53 +04:00
Arseny Sher
854df0f566 Do PQgetCopyData before PQconsumeInput in libpqwp_async_read.
To avoid a lot of redundant memmoves and bloated input buffer.

fixes https://github.com/neondatabase/neon/issues/6055
2023-12-26 14:12:53 +04:00
Arseny Sher
9c493869c7 Perform synchronous WAL download in wp only for logical replication.
wp -> sk communication now uses neon_walreader which will fetch missing WAL on
demand from safekeepers, so doesn't need this anymore. Also, cap WAL download by
max_slot_wal_keep_size to be able to start compute if lag is too high.
2023-12-26 14:12:53 +04:00
Arseny Sher
df760e6de5 Add test_lagging_sk. 2023-12-26 14:12:53 +04:00
Arseny Sher
14913c6443 Adapt rust walproposer to neon_walreader. 2023-12-26 14:12:53 +04:00
Arseny Sher
cdb08f0362 Introduce NeonWALReader downloading sk -> compute WAL on demand.
It is similar to XLogReader, but when either requested segment is missing
locally or requested LSN is before basebackup_lsn NeonWALReader asynchronously
fetches WAL from one of safekeepers.

Patch includes walproposer switch to NeonWALReader, splitting wouldn't make much
sense as it is hard to test otherwise. This finally removes risk of pg_wal
explosion (as well as slow start time) when one safekeeper is lagging, at the
same time allowing to recover it.

In the future reader should also be used by logical walsender for similar
reasons (currently we download the tail on compute start synchronously).

The main test is test_lagging_sk. However, I also run it manually a lot varying
MAX_SEND_SIZE on both sides (on safekeeper and on walproposer), testing various
fragmentations (one side having small buffer, another, both), which brought up
https://github.com/neondatabase/neon/issues/6055

closes https://github.com/neondatabase/neon/issues/1012
2023-12-26 14:12:53 +04:00
Konstantin Knizhnik
572bc06011 Do not copy WAL for lagged slots (#6221)
## Problem

See https://neondb.slack.com/archives/C026T7K2YP9/p1702813041997959

## Summary of changes

Do not take in account invalidated slots when calculate restart_lsn
position for basebackup at page server

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-12-22 20:47:55 +02:00
Arpad Müller
a7342b3897 remote_storage: store last_modified and etag in Download (#6227)
Store the content of the `last-modified` and `etag` HTTP headers in
`Download`.

This serves both as the first step towards #6199 and as a preparation
for tests in #6155 .
2023-12-22 14:13:20 +01:00
John Spray
e68ae2888a pageserver: expedite tenant activation on delete (#6190)
## Problem

During startup, a tenant delete request might have to retry for many
minutes waiting for a tenant to enter Active state.

## Summary of changes

- Refactor delete_tenant into TenantManager: this is not a functional
change, but will avoid merge conflicts with
https://github.com/neondatabase/neon/pull/6105 later
- Add 412 responses to the swagger definition of this endpoint.
- Use Tenant::wait_to_become_active in `TenantManager::delete_tenant`

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-12-22 10:22:22 +00:00
Arpad Müller
83000b3824 buildtools: update protoc and mold (#6222)
These updates aren't very important but I would like to try out the new
process as of #6195
2023-12-21 18:07:21 +01:00
Arpad Müller
a21b719770 Use neon-github-ci-tests S3 bucket for remote_storage tests (#6216)
This bucket is already used by the pytests. The current bucket
github-public-dev is more meant for longer living artifacts.

slack thread:
https://neondb.slack.com/archives/C039YKBRZB4/p1703124944669009

Part of https://github.com/neondatabase/cloud/issues/8233 / #6155
2023-12-21 17:28:28 +01:00
Alexander Bayandin
1dff98be84 CI: fix build-tools image tag for PRs (#6217)
## Problem

Fix build-tools image tag calculation for PRs.
Broken in https://github.com/neondatabase/neon/pull/6195

## Summary of changes
- Use `pinned` tag instead of `$GITHUB_RUN_ID` if there's no changes in
the dockerfile (and we don't build such image)
2023-12-21 14:55:24 +00:00
Arpad Müller
7d6fc3c826 Use pre-generated initdb.tar.zst in test_ingest_real_wal (#6206)
This implements the TODO mentioned in the test added by #5892.
2023-12-21 14:23:09 +00:00
Abhijeet Patil
61b6c4cf30 Build dockerfile from neon repo (#6195)
## Fixing GitHub workflow issue related to build and push images

## Summary of changes
Followup of PR#608[move docker file from build repo to neon to solve
issue some issues

The build started failing because it missed a validation in logic that
determines changes in the docker file
Also, all the dependent jobs were skipped because of the build and push
of the image job.
To address the above issue following changes were made

- we are adding validation to generate image tag even if it's a merge to
repo.
- All the dependent jobs won't skip even if the build and push image job
is skipped.
- We have moved the logic to generate a tag in the sub-workflow. As the
tag name was necessary to be passed to the sub-workflow it made sense to
abstract that away where it was needed and then store it as an output
variable so that downward dependent jobs could access the value.
- This made the dependency logic easy and we don't need complex
expressions to check the condition on which it will run
- An earlier PR was closed that tried solving a similar problem that has
some feedback and context before creating this PR
https://github.com/neondatabase/neon/pull/6175

## Checklist before requesting a review

- [x] Move the tag generation logic from the main workflow to the
sub-workflow of build and push the image
- [x] Add a condition to generate an image tag for a non-PR-related run 
- [x] remove complex if the condition from the job if conditions

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Abhijeet Patil <abhijeet@neon.tech>
2023-12-21 12:46:51 +00:00
Bodobolero
f93d15f781 add comment to run vacuum for clickbench (#6212)
## Problem

This is a comment only change.
To ensure that our benchmarking results are fair we need to have correct
stats in catalog. Otherwise optimizer chooses seq scan instead of index
only scan for some queries. Added comment to run vacuum after data prep.
2023-12-21 13:34:31 +01:00
Christian Schwarz
5385791ca6 add pageserver component-level benchmark (pagebench) (#6174)
This PR adds a component-level benchmarking utility for pageserver.
Its name is `pagebench`.

The problem solved by `pagebench` is that we want to put Pageserver
under high load.

This isn't easily achieved with `pgbench` because it needs to go through
a compute, which has signficant performance overhead compared to
accessing Pageserver directly.

Further, compute has its own performance optimizations (most
importantly: caches). Instead of designing a compute-facing workload
that defeats those internal optimizations, `pagebench` simply bypasses
them by accessing pageserver directly.

Supported benchmarks:

* getpage@latest_lsn
* basebackup
* triggering logical size calculation

This code has no automated users yet.
A performance regression test for getpage@latest_lsn will be added in a
later PR.

part of https://github.com/neondatabase/neon/issues/5771
2023-12-21 13:07:23 +01:00
Conrad Ludgate
2df3602a4b Add GC to http connection pool (#6196)
## Problem

HTTP connection pool will grow without being pruned

## Summary of changes

Remove connection clients from pools once idle, or once they exit.
Periodically clear pool shards.

GC Logic:

Each shard contains a hashmap of `Arc<EndpointPool>`s.
Each connection stores a `Weak<EndpointPool>`.

During a GC sweep, we take a random shard write lock, and check that if
any of the `Arc<EndpointPool>`s are unique (using `Arc::get_mut`).
- If they are unique, then we check that the endpoint-pool is empty, and
sweep if it is.
- If they are not unique, then the endpoint-pool is in active use and we
don't sweep.
- Idle connections will self-clear from the endpoint-pool after 5
minutes.

Technically, the uniqueness of the endpoint-pool should be enough to
consider it empty, but the connection count check is done for
completeness sake.
2023-12-21 12:00:10 +00:00
Arpad Müller
48890d206e Simplify inject_index_part test function (#6207)
Instead of manually constructing the directory's path, we can just use
the `parent()` function.

This is a drive-by improvement from #6206
2023-12-21 12:52:38 +01:00
Arpad Müller
baa1323b4a Use ProfileFileCredentialsProvider for AWS SDK configuration (#6202)
Allows usage via `aws sso login --profile=<p>; AWS_PROFILE=<p>`. Now
there is no need to manually configure things any more via
`SSO_ACCOUNT_ID` and others. Now one can run the tests locally (given
Neon employee access to aws):

```
aws sso login --profile dev
export ENABLE_REAL_S3_REMOTE_STORAGE=nonempty REMOTE_STORAGE_S3_REGION=eu-central-1 REMOTE_STORAGE_S3_BUCKET=neon-github-public-dev AWS_PROFILE=dev
cargo test -p remote_storage -j 1 s3 -- --nocapture
```

Also makes the scrubber use the same region for auth that it does its
operations in (not touching the hard coded role name and start_url
values here, they are not ideal though).
2023-12-20 22:38:58 +00:00
Joonas Koivunen
48f156b8a2 feat: relative last activity based eviction (#6136)
Adds a new disk usage based eviction option, EvictionOrder, which
selects whether to use the current `AbsoluteAccessed` or this new
proposed but not yet tested `RelativeAccessed`. Additionally a fudge
factor was noticed while implementing this, which might help sparing
smaller tenants at the expense of targeting larger tenants.

Cc: #5304

Co-authored-by: Arpad Müller <arpad@neon.tech>
2023-12-20 18:44:19 +00:00
John Spray
ac38d3a88c remote_storage: don't count 404s as errors (#6201)
## Problem

Currently a chart of S3 error rate is misleading: it can show errors any
time we are attaching a tenant (probing for index_part generation,
checking for remote delete marker).

Considering 404 successful isn't perfectly elegant, but it enables the
error rate to be used a a more meaningful alert signal: it would
indicate if we were having auth issues, sending bad requests, getting
throttled ,etc.

## Summary of changes

Track 404 requests in the AttemptOutcome::Ok bucket instead of the
AttemptOutcome::Err bucket.
2023-12-20 17:00:29 +00:00
Arthur Petukhovsky
0f56104a61 Make sk_collect_dumps also possible with teleport (#4739)
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2023-12-20 15:06:55 +00:00
John Spray
f260f1565e pageserver: fixes + test updates for sharding (#6186)
This is a precursor to:
- https://github.com/neondatabase/neon/pull/6185

While that PR contains big changes to neon_local and attachment_service,
this PR contains a few unrelated standalone changes generated while
working on that branch:
- Fix restarting a pageserver when it contains multiple shards for the
same tenant
- When using location_config api to attach a tenant, create its
timelines dir
- Update test paths where generations were previously optional to make
them always-on: this avoids tests having to spuriously assert that
attachment_service is not None in order to make the linter happy.
- Add a TenantShardId python implementation for subsequent use in test
helpers that will be made shard-aware
- Teach scrubber to read across shards when checking for layer
existence: this is a refactor to track the list of existent layers at
tenant-level rather than locally to each timeline. This is a precursor
to testing shard splitting.
2023-12-20 12:26:20 +00:00
Joonas Koivunen
c29df80634 fix(layer): move backoff to spawned task (#5746)
Move the backoff to spawned task as it can still be useful; make the
sleep cancellable.
2023-12-20 10:26:06 +02:00
Em Sharnoff
58dbca6ce3 Bump vm-builder v0.19.0 -> v0.21.0 (#6197)
Only applicable change was neondatabase/autoscaling#650, reducing the
vector scrape interval (inside the VM) from 15 seconds to 1 second.
2023-12-19 23:48:41 +00:00
Arthur Petukhovsky
613906acea Support custom types in broker (#5761)
Old methods are unchanged for backwards compatibility. Added
`SafekeeperDiscoveryRequest` and `SafekeeperDiscoveryResponse` types to
serve as example, and also as a prerequisite for
https://github.com/neondatabase/neon/issues/5471
2023-12-19 17:06:43 +00:00
Christian Schwarz
82809d2ec2 fix metric pageserver_initial_logical_size_start_calculation (#6191)
It wasn't being incremented.

Fixup of

    commit 1c88824ed0
    Author: Christian Schwarz <christian@neon.tech>
    Date:   Fri Dec 1 12:52:59 2023 +0100

        initial logical size calculation: add a bunch of metrics (#5995)
2023-12-19 17:44:49 +01:00
Anastasia Lubennikova
0bd79eb063 Handle role deletion when project has no databases. (#6170)
There is still default 'postgres' database, that may contain objects
owned by the role or some ACLs. We need to reassign objects in this
database too.

## Problem
If customer deleted all databases and then tries to delete role, that
has some non-standard ACLs,
`apply_config` operation will stuck because of failing role deletion.
2023-12-19 16:27:47 +00:00
Konstantin Knizhnik
8ff5387da1 eliminate GCC warning for unchecked result of fread (#6167)
## Problem


GCCproduce warning that bread result is not checked. It doesn't affect
program logic, but better live without warnings.

## Summary of changes

Check read result.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2023-12-19 18:17:11 +02:00
Arpad Müller
8b91bbc38e Update jsonwebtoken to 9 and sct to 0.7.1 (#6189)
This increases the list of crates that base on `ring` 0.17.
2023-12-19 15:45:17 +00:00
Christian Schwarz
e6bf6952b8 higher resolution histograms for getpage@lsn (#6177)
part of https://github.com/neondatabase/cloud/issues/7811
2023-12-19 14:46:17 +01:00
Arpad Müller
a2fab34371 Update zstd to 0.13 (#6187)
This updates the `zstd` crate to 0.13, and `zstd-sys` with it (it
contains C so we should always run the newest version of that).
2023-12-19 13:16:53 +00:00
Vadim Kharitonov
c52384752e Compile pg_semver extension (#6184)
Closes #6183
2023-12-19 15:10:07 +02:00
Bodobolero
73d247c464 Analyze clickbench performance with explain plans and pg_stat_statements (#6161)
## Problem

To understand differences in performance between neon, aurora and rds we
want to collect explain analyze plans and pg_stat_statements for
selected benchmarking runs

## Summary of changes

Add workflow input options to collect explain and pg_stat_statements for
benchmarking workflow

Co-authored-by: BodoBolero <bodobolero@gmail.com>
2023-12-19 11:44:25 +00:00
Arseny Sher
b701394d7a Fix WAL waiting in walproposer for v16.
Just preparing cv right before waiting is not enough as we might have already
missed the flushptr change & wakeup, so re-checked before sleep.

https://neondb.slack.com/archives/C03QLRH7PPD/p1702830965396619?thread_ts=1702756761.836649&cid=C03QLRH7PPD
2023-12-19 15:34:14 +04:00
John Spray
d89af4cf8e pageserver: downgrade 'connection reset' WAL errors (#6181)
This squashes a particularly noisy warn-level log that occurs when
safekeepers are restarted.

Unfortunately the error type from `tonic` doesn't provide a neat way of
matching this, so we use a string comparison
2023-12-19 10:38:00 +00:00
Christian Schwarz
6ffbbb2e02 include timeline ids in tenant details response (#6166)
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771

This allows getting the list of tenants and timelines without triggering
initial logical size calculation by requesting the timeline details API
response, which would skew our results.
2023-12-19 10:32:51 +00:00
Arpad Müller
fbb979d5e3 remote_storage: move shared utilities for S3 and Azure into common module (#6176)
The PR does two things:

* move the util functions present in the remote_storage Azure and S3
test files into a shared one, deduplicating them.
* add a `s3_upload_download_works` test as a copy of the Azure test

The goal is mainly to fight duplication and make the code a little bit
more generic (like removing mentions of s3 and azure from function
names).

This is a first step towards #6146.
2023-12-19 11:29:50 +01:00
Arpad Müller
a89d6dc76e Always send a json response for timeline_get_lsn_by_timestamp (#6178)
As part of the transition laid out in
[this](https://github.com/neondatabase/cloud/pull/7553#discussion_r1370473911)
comment, don't read the `version` query parameter in
`timeline_get_lsn_by_timestamp`, but always return the structured json
response.

Follow-up of https://github.com/neondatabase/neon/pull/5608
2023-12-19 11:29:16 +01:00
Christian Schwarz
c272c68e5c RFC: Per-Tenant GetPage@LSN Throttling (#5648)
Implementation epic: https://github.com/neondatabase/neon/issues/5899
2023-12-19 11:20:56 +01:00
Anna Khanova
6e6e40dd7f Invalidate credentials on auth failure (#6171)
## Problem

If the user reset password, cache could receive this information only
after `ttl` minutes.

## Summary of changes

Invalidate password on auth failure.
2023-12-18 23:24:22 +01:00
Heikki Linnakangas
6939fc3db6 Remove declarations of non-existent global variables and functions
FileCacheMonitorMain was removed in commit b497d0094e.
2023-12-18 21:05:31 +02:00
Heikki Linnakangas
c4c48cfd63 Clean up #includes
- No need to include c.h, port.h or pg_config.h, they are included in
  postgres.h
- No need to include postgres.h in header files. Instead, the
  assumption in PostgreSQL is that all .c files include postgres.h.
- Reorder includes to alphabetical order, and system headers before
  pgsql headers
- Remove bunch of other unnecessary includes that got copy-pasted from
  one source file to another
2023-12-18 21:05:29 +02:00
Heikki Linnakangas
82215d20b0 Mark some variables 'static'
Move initialization of neon_redo_read_buffer_filter. This allows
marking it 'static', too.
2023-12-18 21:05:24 +02:00
Sasha Krassovsky
62737f3776 Grant BYPASSRLS and REPLICATION explicitly to neon_superuser roles 2023-12-18 10:54:14 -08:00
Christian Schwarz
1f9a7d1cd0 add a Rust client for Pageserver page_service (#6128)
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771

Stacked atop https://github.com/neondatabase/neon/pull/6145
2023-12-18 18:17:19 +00:00
John Spray
4ea4812ab2 tests: update python dependencies (#6164)
## Problem

Existing dependencies didn't work on Fedora 39 (python 3.12)

## Summary of changes

- Update pyyaml 6.0 -> 6.0.1
- Update yarl 1.8.2->1.9.4
- Update the `dnf install` line in README to include dependencies of
python packages (unrelated to upgrades, just noticed absences while
doing fresh pysync run)
2023-12-18 15:47:09 +00:00
Anna Khanova
00d90ce76a Added cache for get role secret (#6165)
## Problem

Currently if we are getting many consecutive connections to the same
user/ep we will send a lot of traffic to the console.

## Summary of changes

Cache with ttl=4min proxy_get_role_secret response.

Note: this is the temporary hack, notifier listener is WIP.
2023-12-18 16:04:47 +01:00
John Khvatov
33cb9a68f7 pageserver: Reduce tracing overhead in timeline::get (#6115)
## Problem

Compaction process (specifically the image layer reconstructions part)
is lagging behind wal ingest (at speed ~10-15MB/s) for medium-sized
tenants (30-50GB). CPU profile shows that significant amount of time
(see flamegraph) is being spent in `tracing::span::Span::new`.

mainline (commit: 0ba4cae491):

![reconstruct-mainline-0ba4cae491c2](https://github.com/neondatabase/neon/assets/289788/ebfd262e-5c97-4858-80c7-664a1dbcc59d)

## Summary of changes

By lowering the tracing level in get_value_reconstruct_data and
get_or_maybe_download from info to debug, we can reduce the overhead of
span creation in prod environments. On my system, this sped up the image
reconstruction process by 60% (from 14500 to 23160 page reconstruction
per sec)

pr:

![reconstruct-opt-2](https://github.com/neondatabase/neon/assets/289788/563a159b-8f2f-4300-b0a1-6cd66e7df769)


`create_image_layers()` (it's 1 CPU bound here) mainline vs pr:

![image](https://github.com/neondatabase/neon/assets/289788/a981e3cb-6df9-4882-8a94-95e99c35aa83)
2023-12-18 13:33:23 +00:00
Conrad Ludgate
17bde7eda5 proxy refactor large files (#6153)
## Problem

The `src/proxy.rs` file is far too large

## Summary of changes

Creates 3 new files:
```
src/metrics.rs
src/proxy/retry.rs
src/proxy/connect_compute.rs
```
2023-12-18 10:59:49 +00:00
John Spray
dbdb1d21f2 pageserver: on-demand activation cleanups (#6157)
## Problem

#6112 added some logs and metrics: clean these up a bit:
- Avoid counting startup completions for tenants launched after startup
- exclude no-op cases from timing histograms 
- remove a rogue log messages
2023-12-18 10:29:19 +00:00
Arseny Sher
e1935f42a1 Don't generate core dump when walproposer intentionally panics.
Walproposer sometimes intentionally PANICs when its term is defeated as the
basebackup is likely spoiled by that time. We don't want core dumped in this
case.
2023-12-18 11:03:34 +04:00
Alexander Bayandin
9bdc25f0af Revert "CI: build build-tools image" (#6156)
It turns out the issue with skipped jobs is not so trivial (because
Github checks jobs transitively), a possible workaround with `if:
always() && contains(fromJSON('["success", "skipped"]'),
needs.build-buildtools-image.result)` will tangle the workflow really
bad. We'll need to come up with a better solution.

To unblock the main I'm going to revert
https://github.com/neondatabase/neon/pull/6082.
2023-12-16 12:32:00 +00:00
Christian Schwarz
47873470db pageserver: add method to dump keyspace in mgmt api client (#6145)
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771
2023-12-16 10:52:48 +00:00
Abhijeet Patil
8619e6295a CI: build build-tools image (#6082)
## Currently our build docker file is located in the build repo it makes
sense to have it as a part of our neon repo

## Summary of changes
We had the docker file that we use to build our binary and other tools
resided in the build repo
It made sense to bring the docker file to its repo where it has been
used
So that the contributors can also view it and amend if required
It will reduce the maintenance. Docker file changes and code changes can
be accommodated in same PR
Also, building the image and pushing it to ECR is abstracted in a
reusable workflow. Ideal is to use that for any other jobs too

## Checklist before requesting a review

- [x] Moved the docker file used to build the binary from the build repo
to the neon repo
- [x] adding gh workflow to build and push the image
- [x] adding gh workflow to tag the pushed image
- [x] update readMe file

---------

Co-authored-by: Abhijeet Patil <abhijeet@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-12-16 10:33:52 +00:00
Conrad Ludgate
83811491da update zerocopy (#6148)
## Problem

https://github.com/neondatabase/neon/security/dependabot/48

```
$ cargo tree -i zerocopy
zerocopy v0.7.3
└── ahash v0.8.5
    └── hashbrown v0.13.2
```

ahash doesn't use the affected APIs we we are not vulnerable but best to
update to silence the alert anyway

## Summary of changes

```
$ cargo update -p zerocopy --precise 0.7.31
    Updating crates.io index
    Updating syn v2.0.28 -> v2.0.32
    Updating zerocopy v0.7.3 -> v0.7.31
    Updating zerocopy-derive v0.7.3 -> v0.7.31
```
2023-12-16 09:06:00 +00:00
John Spray
d066dad84b pageserver: prioritize activation of tenants with client requests (#6112)
## Problem

During startup, a client request might have to wait a long time while
the system is busy initializing all the attached tenants, even though
most of the attached tenants probably don't have any client requests to
service, and could wait a bit.

## Summary of changes

- Add a semaphore to limit how many Tenant::spawn()s may concurrently do
I/O to attach their tenant (i.e. read indices from remote storage, scan
local layer files, etc).
- Add Tenant::activate_now, a hook for kicking a tenant in its spawn()
method to skip waiting for the warmup semaphore
- For tenants that attached via warmup semaphore units, wait for logical
size calculation to complete before dropping the warmup units
- Set Tenant::activate_now in `get_active_tenant_with_timeout` (the page
service's path for getting a reference to a tenant).
- Wait for tenant activation in HTTP handlers for timeline creation and
deletion: like page service requests, these require an active tenant and
should prioritize activation if called.
2023-12-15 20:37:47 +00:00
John Spray
56f7d55ba7 pageserver: basic cancel/timeout for remote storage operations (#6097)
## Problem

Various places in remote storage were not subject to a timeout (thereby
stuck TCP connections could hold things up), and did not respect a
cancellation token (so things like timeline deletion or tenant detach
would have to wait arbitrarily long).



## Summary of changes

- Add download_cancellable and upload_cancellable helpers, and use them
in all the places we wait for remote storage operations (with the
exception of initdb downloads, where it would not have been safe).
- Add a cancellation token arg to `download_retry`.
- Use cancellation token args in various places that were missing one
per #5066

Closes: #5066 

Why is this only "basic" handling?
- Doesn't express difference between shutdown and errors in return
types, to avoid refactoring all the places that use an anyhow::Error
(these should all eventually return a more structured error type)
- Implements timeouts on top of remote storage, rather than within it:
this means that operations hitting their timeout will lose their
semaphore permit and thereby go to the back of the queue for their
retry.
- Doing a nicer job is tracked in
https://github.com/neondatabase/neon/issues/6096
2023-12-15 17:43:02 +00:00
Christian Schwarz
1a9854bfb7 add a Rust client for Pageserver management API (#6127)
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771

This PR moves the control plane's spread-all-over-the-place client for
the pageserver management API into a separate module within the
pageserver crate.

I need that client to be async in my benchmarking work, so, this PR
switches to the async version of `reqwest`.
That is also the right direction generally IMO.

The switch to async in turn mandated converting most of the
`control_plane/` code to async.

Note that some of the client methods should be taking `TenantShardId`
instead of `TenantId`, but, none of the callers seem to be
sharding-aware.
Leaving that for another time:
https://github.com/neondatabase/neon/issues/6154
2023-12-15 18:33:45 +01:00
John Spray
de1a9c6e3b s3_scrubber: basic support for sharding (#6119)
This doesn't make the scrubber smart enough to understand that many
shards are part of the same tenants, but it makes it understand paths
well enough to scrub the individual shards without thinking they're
malformed.

This is a prerequisite to being able to run tests with sharding enabled.

Related: #5929
2023-12-15 15:48:55 +00:00
Arseny Sher
e62569a878 A few comments on rust walproposer build. 2023-12-15 19:31:51 +04:00
John Spray
bd1cb1b217 tests: update allow list for negative_env (#6144)
Tests attaching the tenant immediately after the fixture detaches it
could result in LSN updates failing validation

e.g.
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6142/7211196140/index.html#suites/7745dadbd815ab87f5798aa881796f47/32b12ccc0b01b122
2023-12-15 15:08:28 +00:00
Conrad Ludgate
98629841e0 improve proxy code cov (#6141)
## Summary of changes

saw some low-hanging codecov improvements. even if code coverage is
somewhat of a pointless game, might as well add tests where we can and
delete code if it's unused
2023-12-15 12:11:50 +00:00
Arpad Müller
215cdd18c4 Make initdb upload retries cancellable and seek to beginning (#6147)
* initdb uploads had no cancellation token, which means that when we
were stuck in upload retries, we wouldn't be able to delete the
timeline. in general, the combination of retrying forever and not having
cancellation tokens is quite dangerous.
* initdb uploads wouldn't rewind the file. this wasn't discovered in the
purposefully unreliable test-s3 in pytest because those fail on the
first byte always, not somewhere during the connection. we'd be getting
errors from the AWS sdk that the file was at an unexpected end.

slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1702632247784079
2023-12-15 12:11:25 +00:00
Joonas Koivunen
0fd80484a9 fix: Timeline deletion during busy startup (#6133)
Compaction was holding back timeline deletion because the compaction
lock had been acquired, but the semaphore was waited on. Timeline
deletion was waiting on the same lock for 1500s.

This replaces the
`pageserver::tenant::tasks::concurrent_background_tasks_rate_limit`
(which looks correct) with a simpler `..._permit` which is just an
infallible acquire, which is easier to spot "aah this needs to be raced
with cancellation tokens".

Ref: https://neondb.slack.com/archives/C03F5SM1N02/p1702496912904719
Ref: https://neondb.slack.com/archives/C03F5SM1N02/p1702578093497779
2023-12-15 11:59:24 +00:00
Joonas Koivunen
07508fb110 fix: better Json parsing errors (#6135)
Before any json parsing from the http api only returned errors were per
field errors. Now they are done using `serde_path_to_error`, which at
least helped greatly with the `disk_usage_eviction_run` used for
testing. I don't think this can conflict with anything added in #5310.
2023-12-15 12:18:22 +02:00
Arseny Sher
5bb9ba37cc Fix python list_segments of sk.
Fixes rare test_peer_recovery flakiness as we started to compare tmp control
file.

https://neondb.slack.com/archives/C04KGFVUWUQ/p1702310929657179
2023-12-15 13:43:11 +04:00
John Spray
f1cd1a2122 pageserver: improved handling of concurrent timeline creations on the same ID (#6139)
## Problem

Historically, the pageserver used an "uninit mark" file on disk for two
purposes:
- Track which timeline dirs are incomplete for handling on restart
- Avoid trying to create the same timeline twice at the same time.

The original purpose of handling restarts is now defunct, as we use
remote storage as the source of truth and clean up any trash timeline
dirs on startup. Using the file to mutually exclude creation operations
is error prone compared with just doing it in memory, and the existing
checks happened some way into the creation operation, and could expose
errors as 500s (anyhow::Errors) rather than something clean.

## Summary of changes

- Creations are now mutually excluded in memory (using
`Tenant::timelines_creating`), rather than relying on a file on disk for
coordination.
- Acquiring unique access to the timeline ID now happens earlier in the
request.
- Creating the same timeline which already exists is now a 201: this
simplifies retry handling for clients.
- 409 is still returned if a timeline with the same ID is still being
created: if this happens it is probably because the client timed out an
earlier request and has retried.
- Colliding timeline creation requests should no longer return 500
errors

This paves the way to entirely removing uninit markers in a subsequent
change.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-12-15 08:51:23 +00:00
Joonas Koivunen
f010479107 feat(layer): pageserver_layer_redownloaded_after histogram (#6132)
this is aimed at replacing the current mtime only based trashing
alerting later.

Cc: #5331
2023-12-14 21:32:54 +02:00
Conrad Ludgate
cc633585dc gauge guards (#6138)
## Problem

The websockets gauge for active db connections seems to be growing more
than the gauge for client connections over websockets, which does not
make sense.

## Summary of changes

refactor how our counter-pair gauges are represented. not sure if this
will improve the problem, but it should be harder to mess-up the
counters. The API is much nicer though now and doesn't require
scopeguard::defer hacks
2023-12-14 17:21:39 +00:00
Christian Schwarz
aa5581d14f utils::logging: TracingEventCountLayer: don't use with_label_values() on hot path (#6129)
fixes #6126
2023-12-14 16:31:41 +01:00
John Spray
c4e0ef507f pageserver: heatmap uploads (#6050)
Dependency (commits inline):
https://github.com/neondatabase/neon/pull/5842

## Problem

Secondary mode tenants need a manifest of what to download. Ultimately
this will be some kind of heat-scored set of layers, but as a robust
first step we will simply use the set of resident layers: secondary
tenant locations will aim to match the on-disk content of the attached
location.

## Summary of changes

- Add heatmap types representing the remote structure
- Add hooks to Tenant/Timeline for generating these heatmaps
- Create a new `HeatmapUploader` type that is external to `Tenant`, and
responsible for walking the list of attached tenants and scheduling
heatmap uploads.

Notes to reviewers:
- Putting the logic for uploads (and later, secondary mode downloads)
outside of `Tenant` is an opinionated choice, motivated by:
- Enable future smarter scheduling of operations, e.g. uploading the
stalest tenant first, rather than having all tenants compete for a fair
semaphore on a first-come-first-served basis. Similarly for downloads,
we may wish to schedule the tenants with the hottest un-downloaded
layers first.
- Enable accessing upload-related state without synchronization (it
belongs to HeatmapUploader, rather than being some Mutex<>'d part of
Tenant)
- Avoid further expanding the scope of Tenant/Timeline types, which are
already among the largest in the codebase
- You might reasonably wonder how much of the uploader code could be a
generic job manager thing. Probably some of it: but let's defer pulling
that out until we have at least two users (perhaps secondary downloads
will be the second one) to highlight which bits are really generic.

Compromises:
- Later, instead of using digests of heatmaps to decide whether anything
changed, I would prefer to avoid walking the layers in tenants that
don't have changes: tracking that will be a bit invasive, as it needs
input from both remote_timeline_client and Layer.
2023-12-14 13:09:24 +00:00
Conrad Ludgate
6987b5c44e proxy: add more rates to endpoint limiter (#6130)
## Problem

Single rate bucket is limited in usefulness

## Summary of changes

Introduce a secondary bucket allowing an average of 200 requests per
second over 1 minute, and a tertiary bucket allowing an average of 100
requests per second over 10 minutes.

Configured by using a format like

```sh
proxy --endpoint-rps-limit 300@1s --endpoint-rps-limit 100@10s --endpoint-rps-limit 50@1m
```

If the bucket limits are inconsistent, an error is returned on startup

```
$ proxy --endpoint-rps-limit 300@1s --endpoint-rps-limit 10@10s
Error: invalid endpoint RPS limits. 10@10s allows fewer requests per bucket than 300@1s (100 vs 300)
```
2023-12-13 21:43:49 +00:00
Alexander Bayandin
0cd49cac84 test_compatibility: make it use initdb.tar.zst 2023-12-13 15:04:25 -06:00
Alexander Bayandin
904dff58b5 test_wal_restore_http: cleanup test 2023-12-13 15:04:25 -06:00
Arthur Petukhovsky
f401a21cf6 Fix test_simple_sync_safekeepers
There is a postgres 16 version encoded in a binary message.
2023-12-13 15:04:25 -06:00
Tristan Partin
158adf602e Update Postgres 16 series to 16.1 2023-12-13 15:04:25 -06:00
Tristan Partin
c94db6adbb Update Postgres 15 series to 15.5 2023-12-13 15:04:25 -06:00
Tristan Partin
85720616b1 Update Postgres 14 series to 14.10 2023-12-13 15:04:25 -06:00
George MacKerron
d6fcc18eb2 Add Neon-Batch- headers to OPTIONS response for SQL-over-HTTP requests (#6116)
This is needed to allow use of batch queries from browsers.

## Problem

SQL-over-HTTP batch queries fail from web browsers because the relevant
headers, `Neon-Batch-isolation-Level` and `Neon-Batch-Read-Only`, are
not included in the server's OPTIONS response. I think we simply forgot
to add them when implementing the batch query feature.

## Summary of changes

Added `Neon-Batch-Isolation-Level` and `Neon-Batch-Read-Only` to the
OPTIONS response.
2023-12-13 17:18:20 +00:00
Vadim Kharitonov
c2528ae671 Increase pgbouncer pool size to 64 for VMs (#6124)
The pool size was changed for pods
(https://github.com/neondatabase/cloud/pull/8057). The idea to increase
it for VMs too
2023-12-13 16:23:24 +00:00
Joonas Koivunen
a919b863d1 refactor: remove eviction batching (#6060)
We no longer have `layer_removal_cs` since #5108, we no longer need
batching.
2023-12-13 18:05:33 +02:00
Joonas Koivunen
2d22661061 refactor: calculate_synthetic_size_worker, remove PRE::NeedsDownload (#6111)
Changes I wanted to make on #6106 but decided to leave out to keep that
commit clean for including in the #6090. Finally remove
`PageReconstructionError::NeedsDownload`.
2023-12-13 14:23:19 +00:00
John Spray
e3778381a8 tests: make test_bulk_insert recreate tenant in same generation (#6113)
## Problem

Test deletes tenant and recreates with the same ID. The recreation bumps
generation number. This could lead to stale generation warnings in the
logs.

## Summary of changes

Handle this more gracefully by re-creating in the same generation that
the tenant was previously attached in.

We could also update the tenant delete path to have the attachment
service to drop tenant state on delete, but I like having it there: it
makes debug easier, and the only time it's a problem is when a test is
re-using a tenant ID after deletion.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2023-12-13 14:14:38 +00:00
Conrad Ludgate
c8316b7a3f simplify endpoint limiter (#6122)
## Problem

1. Using chrono for durations only is wasteful
2. The arc/mutex was not being utilised
3. Locking every shard in the dashmap every GC could cause latency
spikes
4. More buckets

## Summary of changes

1. Use `Instant` instead of `NaiveTime`.
2. Remove the `Arc<Mutex<_>>` wrapper, utilising that dashmap entry
returns mut access
3. Clear only a random shard, update gc interval accordingly
4. Multiple buckets can be checked before allowing access

When I benchmarked the check function, it took on average 811ns when
multithreaded over the course of 10 million checks.
2023-12-13 13:53:23 +00:00
Stas Kelvich
8460654f61 Add per-endpoint rate limiter to proxy 2023-12-13 07:03:21 +02:00
Arpad Müller
7c2c87a5ab Update azure SDK to 0.18 and use open range support (#6103)
* Update `azure-*` crates to 0.18
* Use new open ranges support added by upstream in
https://github.com/Azure/azure-sdk-for-rust/pull/1482

Part of #5567. Prior update PR: #6081
2023-12-12 18:20:12 +01:00
Arpad Müller
5820faaa87 Use extend instead of groups of append calls in tests (#6109)
Repeated calls to `.append` don't line up as nicely as they might get
formatted in different ways. Also, it is more characters and the lines
might be longer.

Saw this while working on #5912.
2023-12-12 18:00:37 +01:00
John Spray
dfb0a6fdaf scrubber: handle initdb files, fix an issue with prefixes (#6079)
- The code for calculating the prefix in the bucket was expecting a
trailing slash (as it is in the tests), but that's an awkward
expectation to impose for use in the field: make the code more flexible
by only trimming a trailing character if it is indeed a slash.
- initdb archives were detected by the scrubber as malformed layer
files. Teach it to recognize and ignore them.
2023-12-12 16:53:08 +00:00
Alexander Bayandin
6acbee2368 test_runner: add from_repo_dir method (#6087)
## Problem

We need a reliable way to restore a project state (in this context, I
mean data on pageservers, safekeepers, and remote storage) from a
snapshot. The existing method (that we use in `test_compatibility`)
heavily relies on config files, which makes it harder to add/change
fields in the config.
The proposed solution uses config file only to get `default_tenant_id`
and `branch_name_mappings`.

## Summary of changes
- Add `NeonEnvBuilder#from_repo_dir` method, which allows using the
`neon_env_builder` fixture with data from a snapshot.
- Use `NeonEnvBuilder#from_repo_dir` in compatibility tests

Requires for https://github.com/neondatabase/neon/issues/6033
2023-12-12 16:24:13 +00:00
Konstantin Knizhnik
aec1acdbac Do not inherite replication slots in branch (#5898)
## Problem
 
See 
https://github.com/neondatabase/company_projects/issues/111
https://neondb.slack.com/archives/C03H1K0PGKH/p1700166126954079


## Summary of changes

Do not search for AUX_FILES_KEY in parent timelines

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2023-12-12 14:24:21 +02:00
Konstantin Knizhnik
8bb4a13192 Do not materialize null images in PS (#5979)
## Problem

PG16 is writing null images during relation extension.
And page server implements optimisation which replace WAL record with
FPI with page image.
So instead of WAL record ~30 bytes we store 8kb null page image.
Ans this image is almost useless, because most likely it will be shortly
rewritten with actual page content.

## Summary of changes

Do not materialize wal records with null page FPI.
 
## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-12-12 14:23:45 +02:00
Anna Khanova
9e071e4458 Propagate information about the protocol to console (#6102)
## Problem

In snowflake logs currently there is no information about the protocol,
that the client uses.

## Summary of changes

Propagate the information about the protocol together with the app_name.
In format: `{app_name}/{sql_over_http/tcp/ws}`.

This will give to @stepashka more observability on what our clients are
using.
2023-12-12 11:42:51 +00:00
John Spray
fead836f26 swagger: remove 'format: hex' from tenant IDs (#6099)
## Problem

TenantId is changing to TenantShardId in many APIs. The swagger had
`format: hex` attributes on some of these IDs. That isn't formally
defined anywhere, but a reasonable person might think it means "hex
digits only", which will no longer be the case once we start using
shard-aware IDs (they're like `<tenant_id>-0001`).



## Summary of changes

- Remove these `format` attributes from all `tenant_id` fields in the
swagger definition
2023-12-12 10:39:34 +00:00
John Spray
20e9cf7d31 pageserver: tweaks to slow/hung task logging (#6098)
## Problem

- `shutdown_tasks` would log when a particular task was taking a long
time to shut down, but not when it eventually completed. That left one
uncertain as to whether the slow task was the source of a hang, or just
a precursor.

## Summary of changes

- Add a log line after a slow task shutdown
- Add an equivalent in Gate's `warn_if_stuck`, in case we ever need it.
This isn't related to the original issue but was noticed when checking
through these logging paths.
2023-12-12 07:19:59 +00:00
Joonas Koivunen
3b04f3a749 fix: accidential return Ok (#6106)
Error indicating request cancellation OR timeline shutdown was deemed as
a reason to exit the background worker that calculated synthetic size.
Fix it to only be considered for avoiding logging such of such errors.
2023-12-11 21:27:53 +00:00
Arpad Müller
c49fd69bd6 Add initdb_lsn to TimelineInfo (#6104)
This way, we can query it.

Background: I want to do statistics for how reproducible `initdb_lsn`
really is, see https://github.com/neondatabase/cloud/issues/8284 and
https://neondb.slack.com/archives/C036U0GRMRB/p1701895218280269
2023-12-11 21:08:14 +00:00
Tristan Partin
5ab9592a2d Add submodule paths as safe directories as a precaution
The check-codestyle-rust-arm job requires this for some reason, so let's
just add them everywhere we do this workaround.
2023-12-11 13:08:37 -06:00
Tristan Partin
036558c956 Fix git ownership issue in check-codestyle-rust-arm
We have this workaround for other jobs. Looks like this one was
forgotten about.
2023-12-11 13:08:37 -06:00
John Spray
6a922b1a75 tests: start adding tests for secondary mode, live migration (#5842)
These tests have been loitering on a branch of mine for a while: they
already provide value even without all the secondary mode bits landed
yet, and the Workload helper is handy for other tests too.

- `Workload` is a re-usable test workload that replaces some of the
arbitrary "write a few rows" SQL that I've found my self repeating, and
adds a systematic way to append data and check that reads properly
reflect the changes. This append+validate stuff is important when doing
migrations, as we want to detect situations where we might be reading
from a pageserver that has not properly seen latest changes.
- test_multi_attach is a validation of how the pageserver handles
attaching the same tenant to multiple pageservers, from a safety point
of view. This is intentionally separate from the larger testing of
migration, to provide an isolated environment for multi-attachment.
- test_location_conf_churn is a pseudo-random walk through the various
states that TenantSlot can be put into, with validation that attached
tenants remain externally readable when they should, and as a side
effect validating that the compute endpoint's online configuration
changes work as expected.
- test_live_migration is the reference implementation of how to drive a
pair of pageservers through a zero-downtime migration of a tenant.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-12-11 16:55:43 +00:00
John Spray
f1fc1fd639 pageserver: further refactoring from TenantId to TenantShardId (#6059)
## Problem

In https://github.com/neondatabase/neon/pull/5957, the most essential
types were updated to use TenantShardId rather than TenantId. That
unblocked other work, but didn't fully enable running multiple shards
from the same tenant on the same pageserver.

## Summary of changes

- Use TenantShardId in page cache key for materialized pages
- Update mgr.rs get_tenant() and list_tenants() functions to use a shard
id, and update all callers.
- Eliminate the exactly_one_or_none helper in mgr.rs and all code that
used it
- Convert timeline HTTP routes to use tenant_shard_id

Note on page cache:
```
struct MaterializedPageHashKey {
    /// Why is this TenantShardId rather than TenantId?
    ///
    /// Usually, the materialized value of a page@lsn is identical on any shard in the same tenant.  However, this
    /// this not the case for certain internally-generated pages (e.g. relation sizes).  In future, we may make this
    /// key smaller by omitting the shard, if we ensure that reads to such pages always skip the cache, or are
    /// special-cased in some other way.
    tenant_shard_id: TenantShardId,
    timeline_id: TimelineId,
    key: Key,
}
```
2023-12-11 15:52:33 +00:00
Alexander Bayandin
66a7a226f8 test_runner: use toml instead of formatted strings (#6088)
## Problem

A bunch of refactorings extracted from
https://github.com/neondatabase/neon/pull/6087 (not required for it); 
the most significant one is using toml instead of formatted strings.

## Summary of changes 
- Use toml instead of formatted strings for config
- Skip pageserver log check if `pageserver.log` doesn't exist
- `chmod -x test_runner/regress/test_config.py`
2023-12-11 15:13:27 +00:00
Joonas Koivunen
f0d15cee6f build: update azure-* to 0.17 (#6081)
this is a drive-by upgrade while we refresh the access tokens at the
same time.
2023-12-11 12:21:02 +01:00
Sasha Krassovsky
0ba4cae491 Fix RLS/REPLICATION granting (#6083)
## Problem

## Summary of changes

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2023-12-08 12:55:44 -08:00
Andrew Rudenko
df1f8e13c4 proxy: pass neon options in deep object format (#6068)
---------

Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>
2023-12-08 19:58:36 +01:00
John Spray
e640bc7dba tests: allow-lists for occasional failures (#6074)
test_creating_tenant_conf_after...
- Test detaches a tenant and then re-attaches immediatel: this causes a
race between pending remote LSN update and the generation bump in the
attachment.

test_gc_cutoff:
- Test rapidly restarts a pageserver before one generation has had the
chance to process deletions from the previous generation
2023-12-08 17:32:16 +00:00
Christian Schwarz
cf024de202 virtual_file metrics: expose max size of the fd cache (#6078)
And also leave a comment on how to determine current size.

Kind of follow-up to #6066

refs https://github.com/neondatabase/cloud/issues/8351
refs https://github.com/neondatabase/neon/issues/5479
2023-12-08 17:23:50 +00:00
Conrad Ludgate
e1a564ace2 proxy simplify cancellation (#5916)
## Problem

The cancellation code was confusing and error prone (as seen before in
our memory leaks).

## Summary of changes

* Use the new `TaskTracker` primitve instead of JoinSet to gracefully
wait for tasks to shutdown.
* Updated libs/utils/completion to use `TaskTracker`
* Remove `tokio::select` in favour of `futures::future::select` in a
specialised `run_until_cancelled()` helper function
2023-12-08 16:21:17 +00:00
Christian Schwarz
f5b9af6ac7 page cache: improve eviction-related metrics (#6077)
These changes help with identifying thrashing.

The existing `pageserver_page_cache_find_victim_iters_total` is already
useful, but, it doesn't tell us how many individual find_victim() calls
are happening, only how many clock-LRU steps happened in the entire
system,
without info about whether we needed to actually evict other data vs
just scan for a long time, e.g., because the cache is large.

The changes in this PR allows us to
1. count each possible outcome separately, esp evictions
2. compute mean iterations/outcome

I don't think anyone except me was paying close attention to
`pageserver_page_cache_find_victim_iters_total` before, so,
I think the slight behavior change of also counting iterations
for the 'iters exceeded' case is fine.

refs https://github.com/neondatabase/cloud/issues/8351
refs https://github.com/neondatabase/neon/issues/5479
2023-12-08 15:27:21 +00:00
John Spray
5e98855d80 tests: update tests that used local_fs&mock_s3 to use one or the other (#6015)
## Problem

This was wasting resources: if we run a test with mock s3 we don't then
need to run it again with local fs. When we're running in CI, we don't
need to run with the mock/local storage as well as real S3. There is
some value in having CI notice/spot issues that might otherwise only
happen when running locally, but that doesn't justify the cost of
running the tests so many more times on every PR.

## Summary of changes

- For tests that used available_remote_storages or
available_s3_storages, update them to either specify no remote storage
(therefore inherit the default, which is currently local fs), or to
specify s3_storage() for the tests that actually want an S3 API.
2023-12-08 14:52:37 +00:00
Conrad Ludgate
699049b8f3 proxy: make auth more type safe (#5689)
## Problem

a5292f7e67/proxy/src/auth/backend.rs (L146-L148)

a5292f7e67/proxy/src/console/provider/neon.rs (L90)

a5292f7e67/proxy/src/console/provider/neon.rs (L154)

## Summary of changes

1. Test backend is only enabled on `cfg(test)`.
2. Postgres mock backend + MD5 auth keys are only enabled on
`cfg(feature = testing)`
3. Password hack and cleartext flow will have their passwords validated
before proceeding.
4. Distinguish between ClientCredentials with endpoint and without,
removing many panics in the process
2023-12-08 11:48:37 +00:00
John Spray
2c544343e0 pageserver: filtered WAL ingest for sharding (#6024)
## Problem

Currently, if one creates many shards they will all ingest all the data:
not much use! We want them to ingest a proportional share of the data
each.

Closes: #6025

## Summary of changes

- WalIngest object gets a copy of the ShardIdentity for the Tenant it
was created by.
- While iterating the `blocks` part of a decoded record, blocks that do
not match the current shard are ignored, apart from on shard zero where
they are used to update relation sizes in `observe_decoded_block` (but
not stored).
- Before committing a `DataDirModificiation` from a WAL record, we check
if it's empty, and drop the record if so. This check is necessary
(rather than just looking at the `blocks` part) because certain record
types may modify blocks in non-obvious ways (e.g.
`ingest_heapam_record`).
- Add WAL ingest metrics to record the total received, total committed,
and total filtered out
- Behaviour for unsharded tenants is unchanged: they will continue to
ingest all blocks, and will take the fast path through `is_key_local`
that doesn't bother calculating any hashes.

After this change, shards store a subset of the tenant's total data, and
accurate relation sizes are only maintained on shard zero.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-12-08 10:12:37 +00:00
Arseny Sher
193e60e2b8 Fix/edit pgindent confusing places in neon. 2023-12-08 14:03:13 +04:00
Arseny Sher
1bbd6cae24 pgindent pgxn/neon 2023-12-08 14:03:13 +04:00
Arseny Sher
65f48c7002 Make targets to run pgindent on core and neon extension. 2023-12-08 14:03:13 +04:00
Alexander Bayandin
d9d8e9afc7 test_tenant_reattach: fix reattach mode names (#6070)
## Problem

Ref
https://neondb.slack.com/archives/C033QLM5P7D/p1701987609146109?thread_ts=1701976393.757279&cid=C033QLM5P7D

## Summary of changes
- Make reattach mode names unique for `test_tenant_reattach`
2023-12-08 08:39:45 +00:00
Arpad Müller
7914eaf1e6 Buffer initdb.tar.zst to a temporary file before upload (#5944)
In https://github.com/neondatabase/neon/pull/5912#pullrequestreview-1749982732 , Christian liked the idea of using files instead of buffering the
archive to RAM for the *download* path. This is for the upload path,
which is a very similar situation.
2023-12-08 03:33:44 +01:00
Joonas Koivunen
37fdbc3aaa fix: use larger buffers for remote storage (#6069)
Currently using 8kB buffers, raise that to 32kB to hopefully 1/4 of
`spawn_blocking` usage. Also a drive-by fixing of last `tokio::io::copy`
to `tokio::io::copy_buf`.
2023-12-07 19:36:44 +00:00
Tristan Partin
7aa1e58301 Add support for Python 3.12 2023-12-07 12:30:42 -06:00
Christian Schwarz
f2892d3798 virtual_file metrics: distinguish first and subsequent open() syscalls (#6066)
This helps with identifying thrashing.

I don't love the name, but, there is already "close-by-replace".

While reading the code, I also found a case where we waste
work in a cache pressure situation:
https://github.com/neondatabase/neon/issues/6065

refs https://github.com/neondatabase/cloud/issues/8351
2023-12-07 16:17:33 +00:00
Joonas Koivunen
b492cedf51 fix(remote_storage): buffering, by using streams for upload and download (#5446)
There is double buffering in remote_storage and in pageserver for 8KiB
in using `tokio::io::copy` to read `BufReader<ReaderStream<_>>`.

Switches downloads and uploads to use `Stream<Item =
std::io::Result<Bytes>>`. Caller and only caller now handles setting up
buffering. For reading, `Stream<Item = ...>` is also a `AsyncBufRead`,
so when writing to a file, we now have `tokio::io::copy_buf` reading
full buffers and writing them to `tokio::io::BufWriter` which handles
the buffering before dispatching over to `tokio::fs::File`.

Additionally implements streaming uploads for azure. With azure
downloads are a bit nicer than before, but not much; instead of one huge
vec they just hold on to N allocations we got over the wire.

This PR will also make it trivial to switch reading and writing to
io-uring based methods.

Cc: #5563.
2023-12-07 15:52:22 +00:00
John Spray
880663f6bc tests: use tenant_create() helper in test_bulk_insert (#6064)
## Problem

Since #5449 we enable generations in tests by default. Running
benchmarks was missed while merging that PR, and there was one that
needed updating.

## Summary of changes

Make test_bulk_insert use the proper generation-aware helper for tenant
creation.
2023-12-07 14:52:16 +00:00
John Spray
e89e41f8ba tests: update for tenant generations (#5449)
## Problem

Some existing tests are written in a way that's incompatible with tenant
generations.

## Summary of changes

Update all the tests that need updating: this is things like calling
through the NeonPageserver.tenant_attach helper to get a generation
number, instead of calling directly into the pageserver API. There are
various more subtle cases.
2023-12-07 12:27:16 +00:00
Conrad Ludgate
f9401fdd31 proxy: fix channel binding error messages (#6054)
## Problem

For channel binding failed messages we were still saying "channel
binding not supported" in the errors.

## Summary of changes

Fix error messages
2023-12-07 11:47:16 +00:00
Joonas Koivunen
b7ffe24426 build: update tokio to 1.34.0, tokio-utils 0.7.10 (#6061)
We should still remember to bump minimum crates for libraries beginning
to use task tracker.
2023-12-07 11:31:38 +00:00
Joonas Koivunen
52718bb8ff fix(layer): metric splitting, span rename (#5902)
Per [feedback], split the Layer metrics, also finally account for lost
and [re-submitted feedback] on `layer_gc` by renaming it to
`layer_delete`, `Layer::garbage_collect_on_drop` renamed to
`Layer::delete_on_drop`. References to "gc" dropped from metric names
and elsewhere.

Also fixes how the cancellations were tracked: there was one rare
counter. Now there is a top level metric for cancelled inits, and the
rare "download failed but failed to communicate" counter is kept.

Fixes: #6027

[feedback]: https://github.com/neondatabase/neon/pull/5809#pullrequestreview-1720043251
[re-submitted feedback]: https://github.com/neondatabase/neon/pull/5108#discussion_r1401867311
2023-12-07 11:39:40 +02:00
Joonas Koivunen
10c77cb410 temp: increase the wait tenant activation timeout (#6058)
5s is causing way too much noise; this is of course a temporary fix, we
should prioritize tenants for which there are pagestream openings the
highest, second highest the basebackups.

Deployment thread for context:
https://neondb.slack.com/archives/C03H1K0PGKH/p1701935048144479?thread_ts=1701765158.926659&cid=C03H1K0PGKH
2023-12-07 09:01:08 +00:00
Heikki Linnakangas
31be301ef3 Make simple_rcu::RcuWaitList::wait() async (#6046)
The gc_timeline() function is async, but it calls the synchronous wait()
function. In the worst case, that could lead to a deadlock by using up
all tokio executor threads.

In the passing, fix a few typos in comments.

Fixes issue #6045.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-12-07 10:20:40 +02:00
Joonas Koivunen
a3c7d400b4 fix: avoid allocations with logging a slug (#6047)
to_string forces allocating a less than pointer sized string (costing on
stack 4 usize), using a Display formattable slug saves that. the
difference seems small, but at the same time, we log these a lot.
2023-12-07 07:25:22 +00:00
Vadim Kharitonov
7501ca6efb Revert timescaledb for pg14 and pg15 (#6056)
```
could not start the compute node: compute is in state "failed": db error: ERROR: could not access file "$libdir/timescaledb-2.10.1": No such file or directory Caused by: ERROR: could not access file "$libdir/timescaledb-2.10.1": No such file or directory
```
2023-12-06 15:12:36 +00:00
Christian Schwarz
987c9aaea0 virtual_file: fix the metric for close() calls done by VirtualFile::drop (#6051)
Before this PR we would inc() the counter for `Close` even though the
slot's FD had already been closed.

Especially visible when subtracting `open` from `close+close-by-replace`
on a system that does a lot of attach and detach.

refs https://github.com/neondatabase/cloud/issues/8440
refs https://github.com/neondatabase/cloud/issues/8351
2023-12-06 12:05:28 +00:00
Konstantin Knizhnik
7fab731f65 Track size of FSM fork while applying records at replica (#5901)
## Problem

See https://neondb.slack.com/archives/C04DGM6SMTM/p1700560921471619

## Summary of changes

Update relation size cache for FSM fork in WAL records filter

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-12-05 18:49:24 +02:00
John Spray
483caa22c6 pageserver: logging tweaks (#6039)
- The `Attaching tenant` log message omitted some useful information
like the generation and mode
- info-level messages about writing configuration files were
unnecessarily verbose
- During process shutdown, we don't emit logs about the various phases:
this is very cheap to log since we do it once per process lifetime, and
is helpful when figuring out where something got stuck during a hang.
2023-12-05 16:11:15 +00:00
John Spray
da5e03b0d8 pageserver: add a /reset API for tenants (#6014)
## Problem

Traditionally we would detach/attach directly with curl if we wanted to
"reboot" a single tenant. That's kind of inconvenient these days,
because one needs to know a generation number to issue an attach
request.

Closes: https://github.com/neondatabase/neon/issues/6011

## Summary of changes

- Introduce a new `/reset` API, which remembers the LocationConf from
the current attachment so that callers do not have to work out the
correct configuration/generation to use.
- As an additional support tool, allow an optional `drop_cache` query
parameter, for situations where we are concerned that some on-disk state
might be bad and want to clear that as well as the in-memory state.

One might wonder why I didn't call this "reattach" -- it's because
there's already a PS->CP API of that name and it could get confusing.
2023-12-05 15:38:27 +00:00
John Spray
be885370f6 pageserver: remove redundant unsafe_create_dir_all (#6040)
This non-fsyncing analog to our safe directory creation function was
just duplicating what tokio's fs::create_dir_all does.
2023-12-05 15:03:07 +00:00
Alexey Kondratov
bc1020f965 compute_ctl: Notify waiters when Postgres failed to start (#6034)
In case of configuring the empty compute, API handler is waiting on
condvar for compute state change. Yet, previously if Postgres failed to
start we were just setting compute status to `Failed` without notifying.
It causes a timeout on control plane side, although we can return a
proper error from compute earlier.

With this commit API handler should be properly notified.
2023-12-05 13:38:45 +01:00
John Spray
61fe9d360d pageserver: add Key->Shard mapping logic & use it in page service (#5980)
## Problem

When a pageserver receives a page service request identified by
TenantId, it must decide which `Tenant` object to route it to.

As in earlier PRs, this stuff is all a no-op for tenants with a single
shard: calls to `is_key_local` always return true without doing any
hashing on a single-shard ShardIdentity.

Closes: https://github.com/neondatabase/neon/issues/6026

## Summary of changes

- Carry immutable `ShardIdentity` objects in Tenant and Timeline. These
provide the information that Tenants/Timelines need to figure out which
shard is responsible for which Key.
- Augment `get_active_tenant_with_timeout` to take a `ShardSelector`
specifying how the shard should be resolved for this tenant. This mode
depends on the kind of request (e.g. basebackups always go to shard
zero).
- In `handle_get_page_at_lsn_request`, handle the case where the
Timeline we looked up at connection time is not the correct shard for
the page being requested. This can happen whenever one node holds
multiple shards for the same tenant. This is currently written as a
"slow path" with the optimistic expectation that usually we'll run with
one shard per pageserver, and the Timeline resolved at connection time
will be the one serving page requests. There is scope for optimization
here later, to avoid doing the full shard lookup for each page.
- Omit consumption metrics from nonzero shards: only the 0th shard is
responsible for tracing accurate relation sizes.

Note to reviewers:
- Testing of these changes is happening separately on the
`jcsp/sharding-pt1` branch, where we have hacked neon_local etc needed
to run a test_pg_regress.
- The main caveat to this implementation is that page service
connections still look up one Timeline when the connection is opened,
before they know which pages are going to be read. If there is one shard
per pageserver then this will always also be the Timeline that serves
page requests. However, if multiple shards are on one pageserver then
get page requests will incur the cost of looking up the correct Timeline
on each getpage request. We may look to improve this in future with a
"sticky" timeline per connection handler so that subsequent requests for
the same Timeline don't have to look up again, and/or by having postgres
pass a shard hint when connecting. This is tracked in the "Loose ends"
section of https://github.com/neondatabase/neon/issues/5507
2023-12-05 12:01:55 +00:00
Conrad Ludgate
f60e49fe8e proxy: fix panic in startup packet (#6032)
## Problem

Panic when less than 8 bytes is presented in a startup packet.

## Summary of changes

We need there to be a 4 byte message code, so the expected min length is
8.
2023-12-05 11:24:16 +01:00
Anna Khanova
c48918d329 Rename metric (#6030)
## Problem

It looks like because of reallocation of the buckets in previous PR, the
metric is broken in graphana.

## Summary of changes

Renamed the metric.
2023-12-05 10:03:07 +00:00
Sasha Krassovsky
bad686bb71 Remove trusted from wal2json (#6035)
## Problem

## Summary of changes
2023-12-04 21:10:23 +00:00
Alexey Kondratov
85d08581ed [compute_ctl] Introduce feature flags in the compute spec (#6016)
## Problem

In the past we've rolled out all new `compute_ctl` functionality right
to all users, which could be risky. I want to have a more fine-grained
control over what we enable, in which env and to which users.

## Summary of changes

Add an option to pass a list of feature flags to `compute_ctl`. If not
passed, it defaults to an empty list. Any unknown flags are ignored.

This allows us to release new experimental features safer, as we can
then flip the flag for one specific user, only Neon employees, free /
pro / etc. users and so on. Or control it per environment.

In the current implementation feature flags are passed via compute spec,
so they do not allow controlling behavior of `empty` computes. For them,
we can either stick with the previous approach, i.e. add separate cli
args or introduce a more generic `--features` cli argument.
2023-12-04 19:54:18 +01:00
Christian Schwarz
c7f1143e57 concurrency-limit low-priority initial logical size calculation [v2] (#6000)
Problem
-------

Before this PR, there was no concurrency limit on initial logical size
computations.

While logical size computations are lazy in theory, in practice
(production), they happen in a short timeframe after restart.

This means that on a PS with 20k tenants, we'd have up to 20k concurrent
initial logical size calculation requests.

This is self-inflicted needless overload.

This hasn't been a problem so far because the `.await` points on the
logical size calculation path never return `Pending`, hence we have a
natural concurrency limit of the number of executor threads.
But, as soon as we return `Pending` somewhere in the logical size
calculation path, other concurrent tasks get scheduled by tokio.
If these other tasks are also logical size calculations, they eventually
pound on the same bottleneck.

For example, in #5479, we want to switch the VirtualFile descriptor
cache to a `tokio::sync::RwLock`, which makes us return `Pending`, and
without measures like this patch, after PS restart, VirtualFile
descriptor cache thrashes heavily for 2 hours until all the logical size
calculations have been computed and the degree of concurrency /
concurrent VirtualFile operations is down to regular levels.
See the *Experiment* section below for details.

<!-- Experiments (see below) show that plain #5479 causes heavy
thrashing of the VirtualFile descriptor cache.
The high degree of concurrency is too much for 
In the case of #5479 the VirtualFile descriptor cache size starts
thrashing heavily.


-->

Background
----------

Before this PR, initial logical size calculation was spawned lazily on
first call to `Timeline::get_current_logical_size()`.

In practice (prod), the lazy calculation is triggered by
`WalReceiverConnectionHandler` if the timeline is active according to
storage broker, or by the first iteration of consumption metrics worker
after restart (`MetricsCollection`).

The spawns by walreceiver are high-priority because logical size is
needed by Safekeepers (via walreceiver `PageserverFeedback`) to enforce
the project logical size limit.
The spawns by metrics collection are not on the user-critical path and
hence low-priority. [^consumption_metrics_slo]

[^consumption_metrics_slo]: We can't delay metrics collection
indefintely because there are TBD internal SLOs tied to metrics
collection happening in a timeline manner
(https://github.com/neondatabase/cloud/issues/7408). But let's ignore
that in this issue.

The ratio of walreceiver-initiated spawns vs
consumption-metrics-initiated spawns can be reconstructed from logs
(`spawning logical size computation from context of task kind {:?}"`).
PR #5995 and #6018 adds metrics for this.

First investigation of the ratio lead to the discovery that walreceiver
spawns 75% of init logical size computations.
That's because of two bugs:
- In Safekeepers: https://github.com/neondatabase/neon/issues/5993
- In interaction between Pageservers and Safekeepers:
https://github.com/neondatabase/neon/issues/5962

The safekeeper bug is likely primarily responsible but we don't have the
data yet. The metrics will hopefully provide some insights.

When assessing production-readiness of this PR, please assume that
neither of these bugs are fixed yet.


Changes In This PR
------------------

With this PR, initial logical size calculation is reworked as follows:

First, all initial logical size calculation task_mgr tasks are started
early, as part of timeline activation, and run a retry loop with long
back-off until success. This removes the lazy computation; it was
needless complexity because in practice, we compute all logical sizes
anyways, because consumption metrics collects it.

Second, within the initial logical size calculation task, each attempt
queues behind the background loop concurrency limiter semaphore. This
fixes the performance issue that we pointed out in the "Problem" section
earlier.

Third, there is a twist to queuing behind the background loop
concurrency limiter semaphore. Logical size is needed by Safekeepers
(via walreceiver `PageserverFeedback`) to enforce the project logical
size limit. However, we currently do open walreceiver connections even
before we have an exact logical size. That's bad, and I'll build on top
of this PR to fix that
(https://github.com/neondatabase/neon/issues/5963). But, for the
purposes of this PR, we don't want to introduce a regression, i.e., we
don't want to provide an exact value later than before this PR. The
solution is to introduce a priority-boosting mechanism
(`GetLogicalSizePriority`), allowing callers of
`Timeline::get_current_logical_size` to specify how urgently they need
an exact value. The effect of specifying high urgency is that the
initial logical size calculation task for the timeline will skip the
concurrency limiting semaphore. This should yield effectively the same
behavior as we had before this PR with lazy spawning.

Last, the priority-boosting mechanism obsoletes the `init_order`'s grace
period for initial logical size calculations. It's a separate commit to
reduce the churn during review. We can drop that commit if people think
it's too much churn, and commit it later once we know this PR here
worked as intended.

Experiment With #5479 
---------------------

I validated this PR combined with #5479 to assess whether we're making
forward progress towards asyncification.

The setup is an `i3en.3xlarge` instance with 20k tenants, each with one
timeline that has 9 layers.
All tenants are inactive, i.e., not known to SKs nor storage broker.
This means all initial logical size calculations are spawned by
consumption metrics `MetricsCollection` task kind.
The consumption metrics worker starts requesting logical sizes at low
priority immediately after restart. This is achieved by deleting the
consumption metrics cache file on disk before starting
PS.[^consumption_metrics_cache_file]

[^consumption_metrics_cache_file] Consumption metrics worker persists
its interval across restarts to achieve persistent reporting intervals
across PS restarts; delete the state file on disk to get predictable
(and I believe worst-case in terms of concurrency during PS restart)
behavior.

Before this patch, all of these timelines would all do their initial
logical size calculation in parallel, leading to extreme thrashing in
page cache and virtual file cache.

With this patch, the virtual file cache thrashing is reduced
significantly (from 80k `open`-system-calls/second to ~500
`open`-system-calls/second during loading).


### Critique

The obvious critique with above experiment is that there's no skipping
of the semaphore, i.e., the priority-boosting aspect of this PR is not
exercised.

If even just 1% of our 20k tenants in the setup were active in
SK/storage_broker, then 200 logical size calculations would skip the
limiting semaphore immediately after restart and run concurrently.

Further critique: given the two bugs wrt timeline inactive vs active
state that were mentioned in the Background section, we could have 75%
of our 20k tenants being (falsely) active on restart.

So... (next section)

This Doesn't Make Us Ready For Async VirtualFile
------------------------------------------------

This PR is a step towards asynchronous `VirtualFile`, aka, #5479 or even
#4744.

But it doesn't yet enable us to ship #5479.

The reason is that this PR doesn't limit the amount of high-priority
logical size computations.
If there are many high-priority logical size calculations requested,
we'll fall over like we did if #5479 is applied without this PR.
And currently, at very least due to the bugs mentioned in the Background
section, we run thousands of high-priority logical size calculations on
PS startup in prod.

So, at a minimum, we need to fix these bugs.

Then we can ship #5479 and #4744, and things will likely be fine under
normal operation.

But in high-traffic situations, overload problems will still be more
likely to happen, e.g., VirtualFile cache descriptor thrashing.
The solution candidates for that are orthogonal to this PR though:
* global concurrency limiting
* per-tenant rate limiting => #5899
* load shedding
* scaling bottleneck resources (fd cache size (neondatabase/cloud#8351),
page cache size(neondatabase/cloud#8351), spread load across more PSes,
etc)

Conclusion
----------

Even with the remarks from in the previous section, we should merge this
PR because:
1. it's an improvement over the status quo (esp. if the aforementioned
bugs wrt timeline active / inactive are fixed)
2. it prepares the way for
https://github.com/neondatabase/neon/pull/6010
3. it gets us close to shipping #5479 and #4744
2023-12-04 17:22:26 +00:00
Christian Schwarz
7403d55013 walredo: stderr cleanup & make explicitly cancel safe (#6031)
# Problem

I need walredo to be cancellation-safe for
https://github.com/neondatabase/neon/pull/6000#discussion_r1412049728

# Solution

We are only `async fn` because of
`wait_for(stderr_logger_task_done).await`, added in #5560 .

The `stderr_logger_cancel` and `stderr_logger_task_done` were there out
of precaution that the stderr logger task might for some reason not stop
when the walredo process terminates.
That hasn't been a problem in practice.
So, simplify things:
- remove `stderr_logger_cancel` and the
`wait_for(...stderr_logger_task_done...)`
- use `tokio::process::ChildStderr` in the stderr logger task
- add metrics to track number of running stderr logger tasks so in case
I'm wrong here, we can use these metrics to identify the issue (not
planning to put them into a dashboard or anything)
2023-12-04 16:06:41 +00:00
Anna Khanova
12f02523a4 Enable dynamic rate limiter (#6029)
## Problem

Limit the number of open connections between the control plane and
proxy.

## Summary of changes

Enable dynamic rate limiter in prod.

Unfortunately the latency metrics are a bit broken, but from logs I see
that on staging for the past 7 days only 2 times latency for acquiring
was greater than 1ms (for most of the cases it's insignificant).
2023-12-04 15:00:24 +00:00
Arseny Sher
207c527270 Safekeepers: persist state before timeline deactivation.
Without it, sometimes on restart we lose latest remote_consistent_lsn which
leads to excessive ps -> sk reconnections.

https://github.com/neondatabase/neon/issues/5993
2023-12-04 18:22:36 +04:00
John Khvatov
eae49ff598 Perform L0 compaction before creating new image layers (#5950)
If there are too many L0 layers before compaction, the compaction
process becomes slow because of slow `Timeline::get`. As a result of the
slowdown, the pageserver will generate even more L0 layers for the next
iteration, further exacerbating the slow performance.

Change to perform L0 -> L1 compaction before creating new images. The
simple change speeds up compaction time and `Timeline::get` to 5x.
`Timeline::get` is faster on top of L1 layers.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-12-04 12:35:09 +00:00
Alexander Bayandin
e6b2f89fec test_pg_clients: fix test that reads from stdout (#6021)
## Problem

`test_pg_clients` reads the actual result from a *.stdout file,
https://github.com/neondatabase/neon/pull/5977 has added a header to
such files, so `test_pg_clients` started to fail.

## Summary of changes
- Use `capture_stdout` and compare the expected result with the output
instead of *.stdout file content
2023-12-04 11:18:41 +00:00
John Spray
1d81e70d60 pageserver: tweak logs for index_part loading (#6005)
## Problem

On pageservers upgraded to enable generations, these INFO level logs
were rather frequent. If a tenant timeline hasn't written new layers
since the upgrade, it will emit the "No index_part.json*" log every time
it starts.

## Summary of changes

- Downgrade two log lines from info to debug
- Add a tiny unit test that I wrote for sanity-checking that there
wasn't something wrong with our Generation-comparing logic when loading
index parts.
2023-12-04 09:57:47 +00:00
Anastasia Lubennikova
e3512340c1 Override neon.max_cluster_size for the time of compute_ctl (#5998)
Temporarily reset neon.max_cluster_size to avoid
the possibility of hitting the limit, while we are applying config:
creating new extensions, roles, etc...
2023-12-03 15:21:44 +00:00
Christian Schwarz
e43cde7aba initial logical size: remove CALLS metric from hot path (#6018)
Only introduced a few hours ago (#5995), I took a look at the numbers
from staging and realized that `get_current_logical_size()` is on the
walingest hot path: we call it for every `ReplicationMessage::XLogData`
that we receive.

Since the metric is global, it would be quite a busy cache line.

This PR replaces it with a new metric purpose-built for what's most
interesting right now.
2023-12-01 22:45:04 +01:00
Alexey Kondratov
c1295bfb3a [compute_ctl] User correct HTTP code in the /configure errors (#6017)
It was using `PRECONDITION_FAILED` for errors during `ComputeSpec` to
`ParsedSpec` conversion, but this disobeys the OpenAPI spec [1] and
correct code should be `BAD_REQUEST` for any spec processing errors.

While on it, I also noticed that `compute_ctl` OpenAPI spec has an
invalid format and fixed it.

[1] fd81945a60/compute_tools/src/http/openapi_spec.yaml (L119-L120)
2023-12-01 18:19:55 +01:00
Joonas Koivunen
711425cc47 fix: use create_new instead of create for mutex file (#6012)
Using create_new makes the uninit marker work as a mutual exclusion
primitive. Temporary hopefully.
2023-12-01 18:30:51 +02:00
bojanserafimov
fd81945a60 Use TEST_OUTPUT envvar in pageserver (#5984) 2023-12-01 09:16:24 -05:00
bojanserafimov
e49c21a3cd Speed up rel extend (#5983) 2023-12-01 09:11:41 -05:00
Anastasia Lubennikova
92e7cd40e8 add sql_exporter to vm-image (#5949)
expose LFC metrics
2023-12-01 13:40:49 +00:00
Alexander Bayandin
7eabfc40ee test_runner: use separate directory for each rerun (#6004)
## Problem

While investigating https://github.com/neondatabase/neon/issues/5854, we
hypothesised that logs/repo-dir from the initial failure might leak into
reruns. Use different directories for each run to avoid such a
possibility.

## Summary of changes
- make each test rerun use different directories
- update `pytest-rerunfailure` plugin from 11.1.2 to 13.0
2023-12-01 13:26:19 +00:00
Christian Schwarz
ce1652990d logical size: better represent level of accuracy in the type system (#5999)
I would love to not expose the in-accurate value int he mgmt API at all,
and in fact control plane doesn't use it [^1].
But our tests do, and I have no desire to change them at this time.

[^1]: https://github.com/neondatabase/cloud/pull/8317
2023-12-01 14:16:29 +01:00
Christian Schwarz
8cd28e1718 logical size calculation: make .current_size() infallible (#5999)
... by panicking on overflow;

It was made fallible initially due to in-confidence in logical size
calculation. However, the error has never happened since I am at Neon.

Let's stop worrying about this by converting the overflow check into a panic.
2023-12-01 14:16:29 +01:00
Christian Schwarz
1c88824ed0 initial logical size calculation: add a bunch of metrics (#5995)
These will help us answer questions such as:
- when & at what do calculations get started after PS restart?
- how often is the api to get current incrementally-computed logical
  size called, and does it return Exact vs Approximate?

I'd also be interested in a histogram of how much wall clock
time size calculations take, but, I don't know good bucket sizes,
and, logging it would introduce yet another per-timeline log
message during startup; don't think that's worth it just yet.

Context

- https://neondb.slack.com/archives/C033RQ5SPDH/p1701197668789769
- https://github.com/neondatabase/neon/issues/5962
- https://github.com/neondatabase/neon/issues/5963
- https://github.com/neondatabase/neon/pull/5955
- https://github.com/neondatabase/cloud/issues/7408
2023-12-01 12:52:59 +01:00
Arpad Müller
1ce1c82d78 Clean up local state if index_part.json request gives 404 (#6009)
If `index_part.json` is (verifiably) not present on remote storage, we
should regard the timeline as inexistent. This lets `clean_up_timelines`
purge the partial local disk state, which is important in the case of
incomplete creations leaving behind state that hinders retries. For
incomplete deletions, we also want the timeline's local disk content be
gone completely.

The PR removes the allowed warnings added by #5390 and #5912, as we now
are only supposed to issue info level messages. It also adds a
reproducer for #6007, by parametrizing the
`test_timeline_init_break_before_checkpoint_recreate` test added by
#5390. If one reverts the .rs changes, the "cannot create its uninit
mark file" log line occurs once one comments out the failing checks for
the local disk state being actually empty.

Closes #6007

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-12-01 10:58:06 +00:00
Vadim Kharitonov
f784e59b12 Update timescaledb to 2.13.0 (#5975)
TimescaleDB has released 2.13.0. This version is compatible with
Postgres16
2023-11-30 17:12:52 -06:00
Arpad Müller
b71b8ecfc2 Add existing_initdb_timeline_id param to timeline creation (#5912)
This PR adds an `existing_initdb_timeline_id` option to timeline
creation APIs, taking an optional timeline ID.

Follow-up of  #5390.

If the `existing_initdb_timeline_id` option is specified via the HTTP
API, the pageserver downloads the existing initdb archive from the given
timeline ID and extracts it, instead of running initdb itself.

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-11-30 22:32:04 +01:00
Arpad Müller
3842773546 Correct RFC number for Pageserver WAL DR RFC (#5997)
When I opened #5248, 27 was an unused RFC number. Since then, two RFCs
have been merged, so now 27 is taken. 29 is free though, so move it
there.
2023-11-30 21:01:25 +00:00
Conrad Ludgate
f39fca0049 proxy: chore: replace strings with SmolStr (#5786)
## Problem

no problem

## Summary of changes

replaces boxstr with arcstr as it's cheaper to clone. mild perf
improvement.

probably should look into other smallstring optimsations tbh, they will
likely be even better. The longest endpoint name I was able to construct
is something like `ep-weathered-wildflower-12345678` which is 32 bytes.
Most string optimisations top out at 23 bytes
2023-11-30 20:52:30 +00:00
Joonas Koivunen
b451e75dc6 test: include cmdline in captured output (#5977)
aiming for faster to understand a bunch of `.stdout` and `.stderr`
files, see example echo_1.stdout differences:

```
+# echo foobar abbacd
+
 foobar abbacd
```

it can be disabled and is disabled in this PR for some tests; use
`pg_bin.run_capture(..., with_command_header=False)` for that.

as a bonus this cleans up the echoed newlines from s3_scrubber
output which are also saved to file but echoed to test log.

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-11-30 17:31:03 +00:00
Anna Khanova
3657a3c76e Proxy fix metrics record (#5996)
## Problem

Some latency metrics are recorded in inconsistent way.

## Summary of changes

Make sure that everything is recorded in seconds.
2023-11-30 16:33:54 +00:00
Joonas Koivunen
eba3bfc57e test: python needs thread safety as well (#5992)
we have test cases which launch processes from threads, and they capture
output assuming this counter is thread-safe. at least according to my
understanding this operation in python requires a lock to be
thread-safe.
2023-11-30 15:48:40 +00:00
John Spray
57ae9cd07f pageserver: add flush_ms and document /location_config API (#5860)
- During migration of tenants, it is useful for callers to
`/location_conf` to flush a tenant's layers while transitioning to
AttachedStale: this optimization reduces the redundant WAL replay work
that the tenant's new attached pageserver will have to do. Test coverage
for this will come as part of the larger tests for live migration in
#5745 #5842
- Flushing is controlled with `flush_ms` query parameter: it is the
caller's job to decide how long they want to wait for a flush to
complete. If flush is not complete within the time limit, the pageserver
proceeds to succeed anyway: flushing is only an optimization.
- Add swagger definitions for all this: the location_config API is the
primary interface for driving tenant migration as described in
docs/rfcs/028-pageserver-migration.md, and will eventually replace the
various /attach /detach /load /ignore APIs.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-30 14:22:07 +00:00
Christian Schwarz
3bb1030f5d walingest: refactor if-cascade on decoded.xl_rmid into match statement (#5974)
refs https://github.com/neondatabase/neon/issues/5962

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-30 14:07:41 +00:00
John Spray
5d3c3636fc tests: add a log allow list entry in test_timeline_deletion_with_files_stuck_in_upload_queue (#5981)
Test failure seen here:

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5860/7032903218/index.html#suites/837740b64a53e769572c4ed7b7a7eeeb/c0f1c79a70a3b9ab

```
E   AssertionError: assert not [(302, '2023-11-29T13:23:51.046801Z ERROR request{method=PUT path=/v1/tenant/f6b845de60cb0e92f4426e0d6af1d2ea/timeline/69a8c98004abe71a281cff8642a45274/checkpoint request_id=eca33d8a-7af2-46e7-92ab-c28629feb42c}: Error processing HTTP request: InternalServerError(queue is in state Stopped\n')]
```

This appears to be a legitimate log: the test is issuing checkpoint
requests in the background, and deleting (therefore shutting down) a
timeline.
2023-11-30 13:44:14 +00:00
Conrad Ludgate
0c87d1866b proxy: fix wake_compute error prop (#5989)
## Problem

fixes #5654 - WakeComputeErrors occuring during a connect_to_compute got
propagated as IO errors, which get forwarded to the user as "Couldn't
connect to compute node" with no helpful message.

## Summary of changes

Handle WakeComputeError during ConnectionError properly
2023-11-30 13:43:21 +00:00
Arpad Müller
8ec6033ed8 Pageserver disaster recovery RFC (#5248)
Enable the pageserver to recover from data corruption events by
implementing a feature to re-apply historic WAL records in parallel to the already
occurring WAL replay.

The feature is outside of the user-visible backup and history story, and
only
serves as a second-level backup for the case that there is a bug in the
pageservers that corrupted the served pages.

The RFC proposes the addition of two new features:
* recover a broken branch from WAL (downtime is allowed)
* a test recovery system to recover random branches to make sure
recovery works
2023-11-30 14:30:17 +01:00
Anna Khanova
e12e2681e9 IP allowlist on the proxy side (#5906)
## Problem

Per-project IP allowlist:
https://github.com/neondatabase/cloud/issues/8116

## Summary of changes

Implemented IP filtering on the proxy side. 

To retrieve ip allowlist for all scenarios, added `get_auth_info` call
to the control plane for:
* sql-over-http
* password_hack
* cleartext_hack

Added cache with ttl for sql-over-http path

This might slow down a bit, consider using redis in the future.

---------

Co-authored-by: Conrad Ludgate <conrad@neon.tech>
2023-11-30 13:14:33 +00:00
Joonas Koivunen
1e57ddaabc fix: flush loop should also keep the gate open (#5987)
I was expecting this to already be in place, because this should not
conflict how we shutdown (0. cancel, 1. shutdown_tasks, 2. close gate).
2023-11-30 14:26:11 +02:00
John Khvatov
3e094e90d7 update aws sdk to 1.0.x (#5976)
This change will be useful for experimenting with S3 performance.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-30 14:17:58 +02:00
Christian Schwarz
292281c9df pagectl: add subcommand to rewrite layer file summary (#5933)
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771
2023-11-30 11:34:30 +00:00
Rahul Modpur
50d959fddc refactor: use serde for TenantConf deserialization Fixes: #5300 (#5310)
Remove handcrafted TenantConf deserialization code. Use
`serde_path_to_error` to include the field which failed parsing. Leaves
the duplicated TenantConf in pageserver and models, does not touch
PageserverConf handcrafted deserialization.

Error change:
- before change: "configure option `checkpoint_distance` cannot be
negative"
- after change: "`checkpoint_distance`: invalid value: integer `-1`,
expected u64"

Fixes: #5300
Cc: #3682

---------

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
Co-authored-by: Shany Pozin <shany@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-30 12:47:13 +02:00
Conrad Ludgate
fc77c42c57 proxy: add flag to enable http pool for all users (#5959)
## Problem

#5123

## Summary of changes

Add `--sql-over-http-pool-opt-in true` default cli arg. Allows us to set
`--sql-over-http-pool-opt-in false` region-by-region
2023-11-30 10:19:30 +00:00
Conrad Ludgate
f05d1b598a proxy: add more db error info (#5951)
## Problem

https://github.com/neondatabase/serverless/issues/51

## Summary of changes

include more error fields in the json response
2023-11-30 10:18:59 +00:00
Christian Schwarz
ca597206b8 walredo: latency histogram for spawn duration (#5925)
fixes https://github.com/neondatabase/neon/issues/5891
2023-11-29 18:44:37 +00:00
Rahul Modpur
46f20faa0d neon_local: fix endpoint api to prevent two primary endpoints (#5520)
`neon_local endpoint` subcommand currently allows creating two primary
endpoints for the same branch which leads to shutdown of both endpoints

`neon_local endpoint start` new behavior:
1. Fail if endpoint doesn't exist
2. Fail if two primary conflict detected

Fixes #4959
Closes #5426

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-29 19:38:03 +02:00
John Spray
9e55ad4796 pageserver: refactor TenantId to TenantShardId in Tenant & Timeline (#5957)
(includes two preparatory commits from
https://github.com/neondatabase/neon/pull/5960)

## Problem

To accommodate multiple shards in the same tenant on the same
pageserver, we must include the full TenantShardId in local paths. That
means that all code touching local storage needs to see the
TenantShardId.

## Summary of changes

- Replace `tenant_id: TenantId` with `tenant_shard_id: TenantShardId` on
Tenant, Timeline and RemoteTimelineClient.
- Use TenantShardId in helpers for building local paths.
- Update all the relevant call sites.

This doesn't update absolutely everything: things like PageCache,
TaskMgr, WalRedo are still shard-naive. The purpose of this PR is to
update the core types so that others code can be added/updated
incrementally without churning the most central shared types.
2023-11-29 14:52:35 +00:00
John Spray
70b5646fba pageserver: remove redundant serialization helpers on DeletionList (#5960)
Precursor for https://github.com/neondatabase/neon/pull/5957

## Problem

When DeletionList was written, TenantId/TimelineId didn't have
human-friendly modes in their serde. #5335 added those, such that the
helpers used in serialization of HashMaps are no longer necessary.

## Summary of changes

- Add a unit test to ensure that this change isn't changing anything
about the serialized form
- Remove the serialization helpers for maps of Id
2023-11-29 10:39:12 +00:00
Konstantin Knizhnik
64890594a5 Optimize storing of null page in WAL (#5910)
## Problem

PG16 (https://github.com/neondatabase/postgres/pull/327) adds new
function to SMGR: zeroextend
It's implementation in Neon actually wal-log zero pages of extended
relation.
This zero page is wal-logged using XLOG_FPI.
As far as page is zero, the hole optimization (excluding from the image
everything between pg_upper and pd_lower) doesn't work.

## Summary of changes

In case of zero page (`PageIsNull()` returns true) assume
`hole_size=BLCKSZ`

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-11-29 12:08:20 +02:00
Arseny Sher
78e73b20e1 Notify safekeeper readiness with systemd.
To avoid downtime during deploy, as in busy regions initial load can currently
take ~30s.
2023-11-29 14:07:06 +04:00
John Spray
c48cc020bd pageserver: fix race between deletion completion and incoming requests (#5941)
## Problem

This is a narrow race that can leave a stuck Stopping tenant behind,
while emitting a log error "Missing InProgress marker during tenant
upsert, this is a bug"

- Deletion request 1 puts tenant into Stopping state, and fires off
background part of DeleteTenantFlow
- Deletion request 2 acquires a SlotGuard for the same tenant ID, leaves
a TenantSlot::InProgress in place while it checks if the tenant's state
is accept able.
- DeleteTenantFlow finishes, calls TenantsMap::remove, which removes the
InProgress marker.
- Deletion request 2 calls SlotGuard::revert, which upserts the old
value (the Tenant in Stopping state), and emits the telltale log
message.

Closes: #5936 

## Summary of changes

- Add a regression test which uses pausable failpoints to reproduce this
scenario.
- TenantsMap::remove is only called by DeleteTenantFlow. Its behavior is
tweaked to express the different possible states, especially
`InProgress` which carriers a barrier.
- In DeleteTenantFlow, if we see such a barrier result from remove(),
wait for the barrier and then try removing again.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-29 09:32:26 +00:00
dependabot[bot]
a15969714c build(deps): bump openssl from 0.10.57 to 0.10.60 in /test_runner/pg_clients/rust/tokio-postgres (#5966)
Bumps [openssl](https://github.com/sfackler/rust-openssl) from 0.10.57 to 0.10.60.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-29 02:17:15 +01:00
dependabot[bot]
8c195d8214 build(deps): bump cryptography from 41.0.4 to 41.0.6 (#5970)
Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.4 to 41.0.6.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-29 02:16:35 +01:00
dependabot[bot]
0d16874960 build(deps): bump openssl from 0.10.55 to 0.10.60 (#5965)
Bumps [openssl](https://github.com/sfackler/rust-openssl) from 0.10.55 to 0.10.60.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-29 01:24:02 +01:00
Alexander Bayandin
fd440e7d79 neonvm: add pgbouncer patch to support DEALLOCATD/DISCARD ALL (#5958)
pgbouncer 1.21.0 doesn't play nicely with DEALLOCATD/DISCARD ALL if
prepared statement support is enabled (max_prepared_statements > 0).
There's a patch[0] that improves this (it will be included in the next 
release of pgbouncer).

This PR applies this patch on top of 1.21.0 release tarball. 
For some reason, the tarball doesn't include `test/test_prepared.py` 
(which is modified by the patch as well), so the patch can't be applied 
clearly. I use `filterdiff` (from `patchutils` package) to apply 
the required changes.

[0] a7b3c0a5f4
2023-11-28 23:43:24 +00:00
bojanserafimov
65160650da Add walingest test (#5892) 2023-11-28 12:50:53 -05:00
dependabot[bot]
12dd6b61df build(deps): bump aiohttp from 3.8.6 to 3.9.0 (#5946) 2023-11-28 17:47:15 +00:00
bojanserafimov
5345c1c21b perf readme fix (#5956) 2023-11-28 17:31:42 +00:00
Joonas Koivunen
105edc265c fix: remove layer_removal_cs (#5108)
Quest: https://github.com/neondatabase/neon/issues/4745. Follow-up to
#4938.

- add in locks for compaction and gc, so we don't have multiple
executions at the same time in tests
- remove layer_removal_cs
- remove waiting for uploads in eviction/gc/compaction
    - #4938 will keep the file resident until upload completes

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-11-28 19:15:21 +02:00
Shany Pozin
8625466144 Move run_initdb to be async and guarded by max of 8 running tasks. Fixes #5895. Use tenant.cancel for cancellation (#5921)
## Problem
https://github.com/neondatabase/neon/issues/5895
2023-11-28 14:49:31 +00:00
John Spray
1ab0cfc8cb pageserver: add sharding metadata to LocationConf (#5932)
## Problem

The TenantShardId in API URLs is sufficient to uniquely identify a
tenant shard, but not for it to function: it also needs to know its full
sharding configuration (stripe size, layout version) in order to map
keys to shards.

## Summary of changes

- Introduce ShardIdentity: this is the superset of ShardIndex (#5924 )
that is required for translating keys to shard numbers.
- Include ShardIdentity as an optional attribute of LocationConf
- Extend the public `LocationConfig` API structure with a flat
representation of shard attributes.

The net result is that at the point we construct a `Tenant`, we have a
`ShardIdentity` (inside LocationConf). This enables the next steps to
actually use the ShardIdentity to split WAL and validate that page
service requires are reaching the correct shard.
2023-11-28 13:14:51 +00:00
John Spray
ca469be1cf pageserver: add shard indices to layer metadata (#5928)
## Problem

For sharded tenants, the layer keys must include the shard number and
shard count, to disambiguate keys written by different shards in the
same tenant (shard number), and disambiguate layers written before and
after splits (shard count).

Closes: https://github.com/neondatabase/neon/issues/5924

## Summary of changes

There are no functional changes in this PR: everything behaves the same
for the default ShardIndex::unsharded() value. Actual construct of
sharded tenants will come next.

- Add a ShardIndex type: this is just a wrapper for a ShardCount and
ShardNumber. This is a subset of ShardIdentity: whereas ShardIdentity
contains enough information to filter page keys, ShardIndex contains
just enough information to construct a remote key. ShardIndex has a
compact encoding, the same as the shard part of TenantShardId.
- Store the ShardIndex as part of IndexLayerMetadata, if it is set to a
different value than ShardIndex::unsharded.
- Update RemoteTimelineClient and DeletionQueue to construct paths using
the layer metadata. Deletion code paths that previously just passed a
`Generation` now pass a full `LayerFileMetadata` to capture the shard as
well.

Notes to reviewers:
- In deletion code paths, I could have used a (Generation, ShardIndex)
instead of the full LayerFileMetadata. I opted for the full object
partly for brevity, and partly because in future when we add checksums
the deletion code really will care about the full metadata in order to
validate that it is deleting what was intended.
- While ShardIdentity and TenantShardId could both use a ShardIndex, I
find that they read more cleanly as "flat" structs that spell out the
shard count and number field separately. Serialization code would need
writing out by hand anyway, because TenantShardId's serialized form is
not a serde struct-style serialization.
- ShardIndex doesn't _have_ to exist (we could use ShardIdentity
everywhere), but it is a worthwhile optimization, as we will have many
copies of this as part of layer metadata. In future the size difference
betweedn ShardIndex and ShardIdentity may become larger if we implement
more sophisticated key distribution mechanisms (i.e. new values of
ShardIdentity::layout).

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-11-28 11:47:25 +00:00
Christian Schwarz
286f34dfce test suite: add method for generation-aware detachment of a tenant (#5939)
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771
2023-11-28 09:51:37 +00:00
Sasha Krassovsky
f290b27378 Fix check for if shmem is valid to take into account detached shmem (#5937)
## Problem
We can segfault if we update connstr inside of a process that has
detached from shmem (e.g. inside stats collector)
## Summary of changes
Add a check to make sure we're not detached
2023-11-28 03:14:42 +00:00
Sasha Krassovsky
4cd18fcebd Compile wal2json (#5893)
Add wal2json extension
2023-11-27 18:17:26 -08:00
Anastasia Lubennikova
4c29e0594e Update neon extension relocatable for existing installations (#5943) 2023-11-27 23:29:24 +00:00
Anastasia Lubennikova
3c56a4dd18 Make neon extension relocatable to allow SET SCHEMA (#5942) 2023-11-27 21:45:41 +00:00
Conrad Ludgate
316309c85b channel binding (#5683)
## Problem

channel binding protects scram from sophisticated MITM attacks where the
attacker is able to produce 'valid' TLS certificates.

## Summary of changes

get the tls-server-end-point channel binding, and verify it is correct
for the SCRAM-SHA-256-PLUS authentication flow
2023-11-27 21:45:15 +00:00
Arpad Müller
e09bb9974c bootstrap_timeline: rename initdb_path to pgdata_path (#5931)
This is a rename without functional changes, in preparation for #5912.

Split off from #5912 as per review request.
2023-11-27 20:14:39 +00:00
Anastasia Lubennikova
5289f341ce Use test specific directory in test_remote_extensions (#5938) 2023-11-27 18:57:58 +00:00
Joonas Koivunen
683ec2417c deflake: test_live_reconfig_get_evictions_low_residence_... (#5926)
- disable extra tenant
- disable compaction which could try to repartition while we assert

Split from #5108.
2023-11-27 15:20:54 +02:00
Christian Schwarz
a76a503b8b remove confusing no-op .take() of init_tenant_load_remote (#5923)
The `Tenant::spawn()` method already `.take()`s it.

I think this was an oversight in
https://github.com/neondatabase/neon/pull/5580 .
2023-11-27 12:50:19 +00:00
Anastasia Lubennikova
92bc2bb132 Refactor remote extensions feature to request extensions from proxy (#5836)
instead of direct S3 request.

Pros:
- simplify code a lot (no need to provide AWS credentials and paths);
- reduce latency of downloading extension data as proxy resides near
computes; -reduce AWS costs as proxy has cache and 1000 computes asking
the same extension will not generate 1000 downloads from S3.
- we can use only one S3 bucket to store extensions (and rid of regional
buckets which were introduced to reduce latency);

Changes:
- deprecate remote-ext-config compute_ctl parameter, use
http://pg-ext-s3-gateway if any old format remote-ext-cofig is provided;
- refactor tests to use mock http server;
2023-11-27 12:10:23 +00:00
John Spray
b80b9e1c4c pageserver: remove defunct local timeline delete markers (#5699)
## Problem

Historically, we treated the presence of a timeline on local disk as
evidence that it logically exists. Since #5580 that is no longer the
case, so we can always rely on remote storage. If we restart and the
timeline is gone in remote storage, we will also purge it from local
disk: no need for a marker.

Reference on why this PR is for timeline markers and not tenant markers:
https://github.com/neondatabase/neon/issues/5080#issuecomment-1783187807

## Summary of changes

Remove code paths that read + write deletion marker for timelines.

Leave code path that deletes these markers, just in case we deploy while
there are some in existence. This can be cleaned up later.
(https://github.com/neondatabase/neon/issues/5718)
2023-11-27 09:31:20 +00:00
Anastasia Lubennikova
87b8ac3ec3 Only create neon extension in postgres database; (#5918)
Create neon extension in neon schema.
2023-11-26 08:37:01 +00:00
Joonas Koivunen
6b1c4cc983 fix: long timeline create cancelled by tenant delete (#5917)
Fix the fallible vs. infallible check order with
`UninitTimeline::finish_creation` so that the incomplete timeline can be
removed. Currently the order of drop guard unwrapping causes uninit
files to be left on pageserver, blocking the tenant deletion.

Cc: #5914
Cc: #investigation-2023-11-23-stuck-tenant-deletion
2023-11-24 16:17:56 +00:00
Joonas Koivunen
831fad46d5 tests: fix allowed_error for compaction detecting a shutdown (#5919)
This has been causing flaky tests, [example evidence].

Follow-up to #5883 where I forgot to fix this.

[example evidence]:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5917/6981540065/index.html#suites/9d2450a537238135fd4007859e09aca7/6fd3556a879fa3d1
2023-11-24 16:14:32 +00:00
Joonas Koivunen
53851ea8ec fix: log cancelled request handler errors (#5915)
noticed during [investigation] with @problame a major point of lost
error logging which would had sped up the investigation.

Cc: #5815

[investigation]:
https://neondb.slack.com/archives/C066ZFAJU85/p1700751858049319
2023-11-24 15:54:06 +02:00
Joonas Koivunen
044375732a test: support validating allowed_errors against a logfile (#5905)
this will make it easier to test if an added allowed_error does in fact
match for example against a log file from an allure report.

```
$ python3 test_runner/fixtures/pageserver/allowed_errors.py --help
usage: allowed_errors.py [-h] [-i INPUT]

check input against pageserver global allowed_errors

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Pageserver logs file. Reads from stdin if no file is provided.
```

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-11-24 12:43:25 +00:00
Konstantin Knizhnik
ea63b43009 Check if LFC was intialized in local_cache_pages function (#5911)
## Problem

There is not check that LFC is initialised (`lfc_max_size != 0`) in
`local_cache_pages` function

## Summary of changes

Add proper check.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-11-24 08:23:00 +02:00
Conrad Ludgate
a56fd45f56 proxy: fix memory leak again (#5909)
## Problem

The connections.join_next helped but it wasn't enough... The way I
implemented the improvement before was still faulty but it mostly worked
so it looked like it was working correctly.

From [`tokio::select`
docs](https://docs.rs/tokio/latest/tokio/macro.select.html):
> 4. Once an <async expression> returns a value, attempt to apply the
value to the provided <pattern>, if the pattern matches, evaluate
<handler> and return. If the pattern does not match, disable the current
branch and for the remainder of the current call to select!. Continue
from step 3.

The `connections.join_next()` future would complete and `Some(Err(e))`
branch would be evaluated but not match (as the future would complete
without panicking, we would hope). Since the branch doesn't match, it's
disabled. The select continues but never attempts to call `join_next`
again. Getting unlucky, more TCP connections are created than we attempt
to join_next.

## Summary of changes

Replace the `Some(Err(e))` pattern with `Some(e)`. Because of the
auto-disabling feature, we don't need the `if !connections.is_empty()`
step as the `None` pattern will disable it for us.
2023-11-23 19:11:24 +00:00
Anastasia Lubennikova
582a42762b update extension version in test_neon_extension 2023-11-23 18:53:03 +00:00
Anastasia Lubennikova
f5dfa6f140 Create extension neon in existing databases too 2023-11-23 18:53:03 +00:00
Anastasia Lubennikova
f8d9bd8d14 Add extension neon to all databases.
- Run CREATE EXTENSION neon for template1, so that it was created in all databases.
- Run ALTER EXTENSION neon in all databases, to always have the newest version of the extension in computes.
- Add test_neon_extension test
2023-11-23 18:53:03 +00:00
Anastasia Lubennikova
04e6c09f14 Add pgxn/neon/README.md 2023-11-23 18:53:03 +00:00
Arpad Müller
54327bbeec Upload initdb results to S3 (#5390)
## Problem

See #2592

## Summary of changes

Compresses the results of initdb into a .tar.zst file and uploads them
to S3, to enable usage in recovery from lsn.

Generations should not be involved I think because we do this only once
at the very beginning of a timeline.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-23 18:11:52 +00:00
Shany Pozin
35f243e787 Move weekly release PR trigger to Monday morning (#5908) 2023-11-23 19:09:34 +02:00
Shany Pozin
b7a988ba46 Support cancellation for find_lsn_for_timestamp API (#5904)
## Problem
#5900
## Summary of changes
Added cancellation token as param in all relevant code paths and actually used it in the find_lsn_for_timestamp main loop
2023-11-23 17:08:32 +02:00
Christian Schwarz
a0e61145c8 fix: cleanup of layers from the future can race with their re-creation (#5890)
fixes https://github.com/neondatabase/neon/issues/5878
obsoletes https://github.com/neondatabase/neon/issues/5879

Before this PR, it could happen that `load_layer_map` schedules removal
of the future
image layer. Then a later compaction run could re-create the same image
layer, scheduling a PUT.
Due to lack of an upload queue barrier, the PUT and DELETE could be
re-ordered.
The result was IndexPart referencing a non-existent object.

## Summary of changes

* Add support to `pagectl` / Python tests to decode `IndexPart`
  * Rust
    * new `pagectl` Subcommand
* `IndexPart::{from,to}_s3_bytes()` methods to internalize knowledge
about encoding of `IndexPart`
  * Python
    * new `NeonCli` subclass
* Add regression test
  * Rust
* Ability to force repartitioning; required to ensure image layer
creation at last_record_lsn
  * Python
    * The regression test.
* Fix the issue
  * Insert an `UploadOp::Barrier` after scheduling the deletions.
2023-11-23 13:33:41 +00:00
Konstantin Knizhnik
6afbadc90e LFC fixes + statistics (#5727)
## Problem

## Summary of changes

See #5500

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-11-23 08:59:19 +02:00
Anastasia Lubennikova
2a12e9c46b Add documentation for our sample pre-commit hook (#5868) 2023-11-22 12:04:36 +00:00
Christian Schwarz
9e3c07611c logging: support output to stderr (#5896)
(part of the getpage benchmarking epic #5771)

The plan is to make the benchmarking tool log on stderr and emit results
as JSON on stdout. That way, the test suite can simply take captures
stdout and json.loads() it, while interactive users of the benchmarking
tool have a reasonable experience as well.

Existing logging users continue to print to stdout, so, this change
should be a no-op functionally and performance-wise.
2023-11-22 11:08:35 +00:00
Christian Schwarz
d353fa1998 refer to our rust-postgres.git fork by branch name (#5894)
This way, `cargo update -p tokio-postgres` just works. The `Cargo.toml`
communicates more clearly that we're referring to the `main` branch. And
the git revision is still pinned in `Cargo.lock`.
2023-11-22 10:58:27 +00:00
Joonas Koivunen
0d10992e46 Cleanup compact_level0_phase1 fsyncing (#5852)
While reviewing code noticed a scary `layer_paths.pop().unwrap()` then
realized this should be further asyncified, something I forgot to do
when I switched the `compact_level0_phase1` back to async in #4938.

This keeps the double-fsync for new deltas as #4749 is still unsolved.
2023-11-21 15:30:40 +02:00
Arpad Müller
3e131bb3d7 Update Rust to 1.74.0 (#5873)
[Release notes](https://github.com/rust-lang/rust/releases/tag/1.74.0).
2023-11-21 11:41:41 +01:00
Sasha Krassovsky
81b2cefe10 Disallow CREATE DATABASE WITH OWNER neon_superuser (#5887)
## Problem
Currently, control plane doesn't know about neon_superuser, so if a user
creates a database with owner neon_superuser it causes an exception when
it tries to forward it. It is also currently possible to ALTER ROLE
neon_superuser.

## Summary of changes
Disallow creating database with owner neon_superuser. This is probably
fine, since I don't think you can create a database with owner normal
superuser. Also forbids altering neon_superuser
2023-11-20 22:39:47 +00:00
Christian Schwarz
d2ca410919 build: back to opt-level=0 in debug builds, for faster compile times (#5751)
This change brings down incremental compilation for me
from > 1min to 10s (and this is a pretty old Ryzen 1700X).

More details: "incremental compilation" here means to change one
character
in the `failed to read value from offset` string in `image_layer.rs`.
The command for incremental compilation is `cargo build_testing`.
The system on which I got these numbers uses `mold` via
`~/.cargo/config.toml`.

As a bonus, `rust-gdb` is now at least a little fun again.

Some tests are timing out in debug builds due to these changes.
This PR makes them skip for debug builds.
We run both with debug and release build, so, the loss of coverage is
marginal.

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-11-20 15:41:37 +01:00
Joonas Koivunen
d98ac04136 chore(background_tasks): missed allowed_error change, logging change (#5883)
- I am always confused by the log for the error wait time, now it will
be `2s` or `2.0s` not `2.0`
- fix missed string change introduced in #5881 [evidence]

[evidence]:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/6921062837/index.html#suites/f9eba3cfdb71aa6e2b54f6466222829b/87897fe1ddee3825
2023-11-20 07:33:17 +00:00
Joonas Koivunen
ac08072d2e fix(layer): VirtualFile opening and read errors can be caused by contention (#5880)
A very low number of layer loads have been marked wrongly as permanent,
as I did not remember that `VirtualFile::open` or reading could fail
transiently for contention. Return separate errors for transient and
persistent errors from `{Delta,Image}LayerInner::load`.

Includes drive-by comment changes.

The implementation looks quite ugly because having the same type be both
the inner (operation error) and outer (critical error), but with the
alternatives I tried I did not find a better way.
2023-11-19 14:57:39 +00:00
John Spray
d22dce2e31 pageserver: shut down idle walredo processes (#5877)
The longer a pageserver runs, the more walredo processes it accumulates
from tenants that are touched intermittently (e.g. by availability
checks). This can lead to getting OOM killed.

Changes:
- Add an Instant recording the last use of the walredo process for a
tenant
- After compaction iteration in the background task, check for idleness
and stop the walredo process if idle for more than 10x compaction
period.

Cc: #3620

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Shany Pozin <shany@neon.tech>
2023-11-19 14:21:16 +00:00
Joonas Koivunen
3b3f040be3 fix(background_tasks): first backoff, compaction error stacktraces (#5881)
First compaction/gc error backoff starts from 0 which is less than 2s
what it was before #5672. This is now fixed to be the intended 2**n.

Additionally noticed the `compaction_iteration` creating an
`anyhow::Error` via `into()` always captures a stacktrace even if we had
a stacktraceful anyhow error within the CompactionError because there is
no stable api for querying that.
2023-11-19 14:16:31 +00:00
Em Sharnoff
cad0dca4b8 compute_ctl: Remove deprecated flag --file-cache-on-disk (#5622)
See neondatabase/cloud#7516 for more.
2023-11-18 12:43:54 +01:00
Em Sharnoff
5d13a2e426 Improve error message when neon.max_cluster_size reached (#4173)
Changes the error message encountered when the `neon.max_cluster_size`
limit is reached. Reasoning is that this is user-visible, and so should
*probably* use language that's closer to what users are familiar with.
2023-11-16 21:51:26 +00:00
khanova
0c243faf96 Proxy log pid hack (#5869)
## Problem

Improve observability for the compute node.

## Summary of changes

Log pid from the compute node. Doesn't work with pgbouncer.
2023-11-16 20:46:23 +00:00
Em Sharnoff
d0a842a509 Update vm-builder to v0.19.0 and move its customization here (#5783)
ref neondatabase/autoscaling#600 for more
2023-11-16 18:17:42 +01:00
khanova
6b82f22ada Collect number of connections by sni type (#5867)
## Problem

We don't know the number of users with the different kind of
authentication: ["sni", "endpoint in options" (A and B from
[here](https://neon.tech/docs/connect/connection-errors)),
"password_hack"]

## Summary of changes

Collect metrics by sni kind.
2023-11-16 12:19:13 +00:00
John Spray
ab631e6792 pageserver: make TenantsMap shard-aware (#5819)
## Problem

When using TenantId as the key, we are unable to handle multiple tenant
shards attached to the same pageserver for the same tenant ID. This is
an expected scenario if we have e.g. 8 shards and 5 pageservers.

## Summary of changes

- TenantsMap is now a BTreeMap instead of a HashMap: this enables
looking up by range. In future, we will need this for page_service, as
incoming requests will just specify the Key, and we'll have to figure
out which shard to route it to.
- A new key type TenantShardId is introduced, to act as the key in
TenantsMap, and as the id type in external APIs. Its human readable
serialization is backward compatible with TenantId, and also
forward-compatible as long as sharding is not actually used (when we
construct a TenantShardId with ShardCount(0), it serializes to an
old-fashioned TenantId).
- Essential tenant APIs are updated to accept TenantShardIds:
tenant/timeline create, tenant delete, and /location_conf. These are the
APIs that will enable driving sharded tenants. Other apis like /attach
/detach /load /ignore will not work with sharding: those will soon be
deprecated and replaced with /location_conf as part of the live
migration work.

Closes: #5787
2023-11-15 23:20:21 +02:00
Alexander Bayandin
f84ac2b98d Fix baseline commit and branch for code coverage (#5769)
## Problem

`HEAD` commit for a PR is a phantom merge commit which skews the baseline
commit for coverage reports.

See
https://github.com/neondatabase/neon/pull/5751#issuecomment-1790717867

## Summary of changes
- Use commit hash instead of `HEAD` for finding baseline commits for
code coverage
- Use the base branch for PRs or the current branch for pushes
2023-11-15 12:40:21 +01:00
dependabot[bot]
5cd5b93066 build(deps): bump aiohttp from 3.8.5 to 3.8.6 (#5864) 2023-11-15 11:08:49 +00:00
khanova
2f0d245c2a Proxy control plane rate limiter (#5785)
## Problem

Proxy might overload the control plane.

## Summary of changes

Implement rate limiter for proxy<->control plane connection.
Resolves https://github.com/neondatabase/neon/issues/5707

Used implementation ideas from https://github.com/conradludgate/squeeze/
2023-11-15 09:15:59 +00:00
Joonas Koivunen
462f04d377 Smaller test addition and change (#5858)
- trivial serialization roundtrip test for
`pageserver::repository::Value`
- add missing `start_paused = true` to 15s test making it <0s test
- completely unrelated future clippy lint avoidance (helps beta channel
users)
2023-11-14 18:04:34 +01:00
Arpad Müller
31a54d663c Migrate links from wiki to notion (#5862)
See the slack discussion:
https://neondb.slack.com/archives/C033A2WE6BZ/p1696429688621489?thread_ts=1695647103.117499
2023-11-14 15:36:47 +00:00
John Spray
7709c91fe5 neon_local: use remote storage by default, add cargo neon tenant migrate (#5760)
## Problem

Currently the only way to exercise tenant migration is via python test
code. We need a convenient way for developers to do it directly in a
neon local environment.

## Summary of changes

- Add a `--num-pageservers` argument to `cargo neon init` so that it's
easy to run with multiple pageservers
- Modify default pageserver overrides in neon_local to set up `LocalFs`
remote storage, as any migration/attach/detach stuff doesn't work in the
legacy local storage mode. This also unblocks removing the pageserver's
support for the legacy local mode.
- Add a new `cargo neon tenant migrate` command that orchestrates tenant
migration, including endpoints.
2023-11-14 09:51:51 +00:00
Arpad Müller
f7249b9018 Fix comment in find_lsn_for_timestamp (#5855)
We still subtract 1 from low to compute `commit_lsn`. the comment
moved/added by #5844 should point this out.
2023-11-11 00:32:00 +00:00
Joonas Koivunen
74d150ba45 build: upgrade ahash (#5851)
`cargo deny` was complaining the version 0.8.3 was yanked (for possible
DoS attack [wiki]), but the latest version (0.8.5) also includes aarch64
fixes which may or may not be relevant. Our usage of ahash limits to
proxy, but I don't think we are at any risk.

[wiki]: https://github.com/tkaitchuck/aHash/wiki/Yanked-versions
2023-11-10 19:10:54 +00:00
Joonas Koivunen
b7f45204a2 build: deny async-std and friends (#5849)
rationale: some crates pull these in as default; hopefully these hints
will require less cleanup-after and Cargo.lock file watching.

follow-up to #5848.
2023-11-10 18:02:22 +01:00
Joonas Koivunen
a05f104cce build: remove async-std dependency (#5848)
Introduced by accident (missing `default-features = false`) in
e09d5ada6a. We directly need only `http_types::StatusCode`.
2023-11-10 16:05:21 +02:00
John Spray
d672e44eee pageserver: error type for collect_keyspace (#5846)
## Problem

This is a log hygiene fix, for an occasional test failure.

warn-level logging in imitate_timeline_cached_layer_accesses can't
distinguish actual errors from shutdown cases.

## Summary of changes

Replaced anyhow::Error with an explicit CollectKeySpaceError type, that
includes conversion from PageReconstructError::Cancelled.
2023-11-10 13:58:18 +00:00
Rahul Modpur
a6f892e200 metric: add started and killed walredo processes counter (#5809)
In OOM situations, knowing exactly how many walredo processes there were
at a time would help afterwards to understand why was pageserver OOM
killed. Add `pageserver_wal_redo_process_total` metric to keep track of
total wal redo process started, shutdown and killed since pageserver
start.

Closes #5722

---------

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <me@cschwarz.com>
2023-11-10 15:05:22 +02:00
Alexander Bayandin
71b380f90a Set BUILD_TAG for build-neon job (#5847)
## Problem

I've added `BUILD_TAG` to docker images.
(https://github.com/neondatabase/neon/pull/5812), but forgot to add it
to services that we build for tests

## Summary of changes
- Set `BUILD_TAG` in `build-neon` job
2023-11-10 12:49:52 +00:00
Alexander Bayandin
6e145a44fa workflows/neon_extra_builds: run check-codestyle-rust & build-neon on arm64 (#5832)
## Problem

Some developers use workstations with arm CPUs, and sometimes x86-64
code is not fully compatible with it (for example,
https://github.com/neondatabase/neon/pull/5827).
Although we don't have arm CPUs in the prod (yet?), it is worth having
some basic checks for this architecture to have a better developer
experience.

Closes https://github.com/neondatabase/neon/issues/5829

## Summary of changes
- Run `check-codestyle-rust`-like & `build-neon`-like jobs on Arm runner
- Add `run-extra-build-*` label to run all available extra builds
2023-11-10 12:45:41 +00:00
Arpad Müller
8e5e3971ba find_lsn_for_timestamp fixes (#5844)
Includes the changes of #3689 that address point 1 of #3689, plus some
further improvements. In particular, this PR does:

* set `min_lsn` to a safe value to create branches from (and verify it
in tests)
* return `min_lsn` instead of `max_lsn` for `NoData` and `Past` (verify
it in test for `Past`, `NoData` is harder and not as important)
* return `commit_lsn` instead of `max_lsn` for Future (and verify it in
the tests)
* add some comments

Split out of #5686 to get something more minimal out to users.
2023-11-10 13:38:44 +01:00
Joonas Koivunen
8dd29f1e27 fix(pageserver): spawn all kinds of tenant shutdowns (#5841)
Minor bugfix, something noticed while manual code-review. Use the same
joinset for inprogress tenants so we can get the benefit of the
buffering logging just as we get for attached tenants, and no single
inprogress task can hold up shutdown of other tenants.
2023-11-09 21:36:57 +00:00
Joonas Koivunen
f5344fb85a temp: log all layer loading errors while we lose them (#5816)
Temporary workaround while some errors are not being logged.

Cc: #5815.
2023-11-09 21:31:53 +00:00
Arpad Müller
f95f001b8b Lsn for get_timestamp_of_lsn should be string, not integer (#5840)
The `get_timestamp_of_lsn` pageserver endpoint has been added in #5497,
but the yml it added was wrong: the lsn is expected in hex format, not
in integer (decimal) format.
2023-11-09 16:12:18 +00:00
John Spray
e0821e1eab pageserver: refined Timeline shutdown (#5833)
## Problem

We have observed the shutdown of a timeline taking a long time when a
deletion arrives at a busy time for the system. This suggests that we
are not respecting cancellation tokens promptly enough.

## Summary of changes

- Refactor timeline shutdown so that rather than having a shutdown()
function that takes a flag for optionally flushing, there are two
distinct functions, one for graceful flushing shutdown, and another that
does the "normal" shutdown where we're just setting a cancellation token
and then tearing down as fast as we can. This makes things a bit easier
to reason about, and enables us to remove the hand-written variant of
shutdown that was maintained in `delete.rs`
- Layer flush task checks cancellation token more carefully
- Logical size calculation's handling of cancellation tokens is
simplified: rather than passing one in, it respects the Timeline's
cancellation token.

This PR doesn't touch RemoteTimelineClient, which will be a key thing to
fix as well, so that a slow remote storage op doesn't hold up shutdown.
2023-11-09 16:02:59 +00:00
bojanserafimov
4469b1a62c Fix blob_io test (#5818) 2023-11-09 10:47:03 -05:00
Joonas Koivunen
842223b47f fix(metric): remove pageserver_wal_redo_wait_seconds (#5791)
the meaning of the values recorded in this histogram changed with #5560
and we never had it visualized as a histogram, just the
`increase(_sum)`. The histogram is not too interesting to look at, so
remove it per discussion in [slack
thread](https://neondb.slack.com/archives/C063LJFF26S/p1699008316109999?thread_ts=1698852436.637559&cid=C063LJFF26S).
2023-11-09 16:40:52 +02:00
Anna Stepanyan
893616051d Update epic-template.md (#5709)
replace the checkbox list with a a proper task list in the epic template

NB: this PR does not change the code, it only touches the github issue
templates
2023-11-09 15:24:43 +01:00
Conrad Ludgate
7cdde285a5 proxy: limit concurrent wake_compute requests per endpoint (#5799)
## Problem

A user can perform many database connections at the same instant of time
- these will all cache miss and materialise as requests to the control
plane. #5705

## Summary of changes

I am using a `DashMap` (a sharded `RwLock<HashMap>`) of endpoints ->
semaphores to apply a limiter. If the limiter is enabled (permits > 0),
the semaphore will be retrieved per endpoint and a permit will be
awaited before continuing to call the wake_compute endpoint.

### Important details

This dashmap would grow uncontrollably without maintenance. It's not a
cache so I don't think an LRU-based reclamation makes sense. Instead,
I've made use of the sharding functionality of DashMap to lock a single
shard and clear out unused semaphores periodically.

I ran a test in release, using 128 tokio tasks among 12 threads each
pushing 1000 entries into the map per second, clearing a shard every 2
seconds (64 second epoch with 32 shards). The endpoint names were
sampled from a gamma distribution to make sure some overlap would occur,
and each permit was held for 1ms. The histogram for time to clear each
shard settled between 256-512us without any variance in my testing.

Holding a lock for under a millisecond for 1 of the shards does not
concern me as blocking
2023-11-09 14:14:30 +00:00
John Spray
9c30883c4b remote_storage: use S3 SDK's adaptive retry policy (#5813)
## Problem

Currently, we aren't doing any explicit slowdown in response to 429
responses. Recently, as we hit remote storage a bit harder (pageserver
does more ListObjectsv2 requests than it used to since #5580 ), we're
seeing storms of 429 responses that may be the result of not just doing
too may requests, but continuing to do those extra requests without
backing off any more than our usual backoff::exponential.

## Summary of changes

Switch from AWS's "Standard" retry policy to "Adaptive" -- docs describe
this as experimental but it has been around for a long time. The main
difference between Standard and Adaptive is that Adaptive rate-limits
the client in response to feedback from the server, which is meant to
avoid scenarios where the client would otherwise repeatedly hit
throttling responses.
2023-11-09 13:50:13 +00:00
Arthur Petukhovsky
0495798591 Fix walproposer build on aarch64 (#5827)
There was a compilation error due to `std::ffi::c_char` being different type on different platforms. Clippy also complained due to a similar reason.
2023-11-09 13:05:17 +00:00
Sasha Krassovsky
87389bc933 Add test simulating bad connection between pageserver and compute (#5728)
## Problem
We have a funny 3-day timeout for connections between the compute and
pageserver. We want to get rid of it, so to do that we need to make sure
the compute is resilient to connection failures.

Closes: https://github.com/neondatabase/neon/issues/5518

## Summary of changes
This test makes the pageserver randomly drop the connection if the
failpoint is enabled, and ensures we can keep querying the pageserver.

This PR also reduces the default timeout to 10 minutes from 3 days.
2023-11-08 19:48:57 +00:00
Arpad Müller
ea118a238a JWT logging improvements (#5823)
* lower level on auth success from info to debug (fixes #5820)
* don't log stacktraces on auth errors (as requested on slack). we do this by introducing an `AuthError` type instead of using `anyhow` and `bail`.
* return errors that have been censored for improved security.
2023-11-08 16:56:53 +00:00
Christian Schwarz
e9b227a11e cleanup unused RemoteStorage fields (#5830)
Found this while working on #5771
2023-11-08 16:54:33 +00:00
John Spray
40441f8ada pageserver: use Gate for stronger safety check in SlotGuard (#5793)
## Problem

#5711 and #5367 raced -- the `SlotGuard` type needs `Gate` to properly
enforce its invariant that we may not drop an `Arc<Tenant>` from a slot.

## Summary of changes

Replace the TODO with the intended check of Gate.
2023-11-08 13:00:11 +00:00
John Spray
a8a39cd464 test: de-flake test_deletion_queue_recovery (#5822)
## Problem

This test could fail if timing is unlucky, and the deletions in the test
land in two deletion lists instead of one.

## Summary of changes

We await _some_ validations instead of _all_ validations, because our
execution failpoint
will prevent validation proceeding for any but the first DeletionList.
Usually the workload
just generates one, but if it generates two due to timing, then we must
not expect that the
second one will be validated.
2023-11-08 12:41:48 +00:00
John Spray
b989ad1922 extend test_change_pageserver for failure case, rework changing pageserver (#5693)
Reproducer for https://github.com/neondatabase/neon/issues/5692

The test change in this PR intentionally fails, to demonstrate the
issue.

---------

Co-authored-by: Sasha Krassovsky <krassovskysasha@gmail.com>
2023-11-08 11:26:56 +00:00
Em Sharnoff
acef742a6e vm-monitor: Remove dependency on workspace_hack (#5752)
neondatabase/autoscaling builds libs/vm-monitor during CI because it's a
necessary component of autoscaling.

workspace_hack includes a lot of crates that are not necessary for
vm-monitor, which artificially inflates the build time on the
autoscaling side, so hopefully removing the dependency should speed
things up.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-07 09:41:20 -08:00
duguorong009
11d9d801b5 pageserver: improve the shutdown log error (#5792)
## Problem
- Close #5784 

## Summary of changes
- Update the `GetActiveTenantError` -> `QueryError` conversion process
in `pageserver/src/page_service.rs`
- Update the pytest logging exceptions in
`./test_runner/regress/test_tenant_detach.py`
2023-11-07 16:57:26 +00:00
Andrew Rudenko
fc47af156f Passing neon options to the console (#5781)
The idea is to pass neon_* prefixed options to control plane. It can be
used by cplane to dynamically create timelines and computes. Such
options also should be excluded from passing to compute. Another issue
is how connection caching is working now, because compute's instance now
depends not only on hostname but probably on such options too I included
them to cache key.
2023-11-07 16:49:26 +01:00
Arpad Müller
e310533ed3 Support JWT key reload in pageserver (#5594)
## Problem

For quickly rotating JWT secrets, we want to be able to reload the JWT
public key file in the pageserver, and also support multiple JWT keys.

See #4897.

## Summary of changes

* Allow directories for the `auth_validation_public_key_path` config
param instead of just files. for the safekeepers, all of their config options
also support multiple JWT keys.
* For the pageservers, make the JWT public keys easily globally swappable
by using the `arc-swap` crate.
* Add an endpoint to the pageserver, triggered by a POST to
`/v1/reload_auth_validation_keys`, that reloads the JWT public keys from
the pre-configured path (for security reasons, you cannot upload any
keys yourself).

Fixes #4897

---------

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-07 15:43:29 +01:00
John Spray
1d68f52b57 pageserver: move deletion failpoint inside backoff (#5814)
## Problem

When enabled, this failpoint would busy-spin in a loop that emits log
messages.

## Summary of changes

Move the failpoint inside a backoff::exponential block: it will still
spam the log, but at much lower rate.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-07 14:25:51 +00:00
Alexander Bayandin
4cd47b7d4b Dockerfile: Set BUILD_TAG for storage services (#5812)
## Problem

https://github.com/neondatabase/neon/pull/5576 added `build-tag`
reporting to `libmetrics_build_info`, but it's not reported because we
didn't set the corresponding env variable in the build process.

## Summary of changes
- Add `BUILD_TAG` env var while building services
2023-11-07 13:45:59 +00:00
Fernando Luz
0141c95788 build: Add warning when missing postgres submodule during the build (#5614)
I forked the project and in my local repo, I wasn't able to compile the
project and in my search, I found the solution in neon forum. After a PR
discussion, I made a change in the makefile to alert the missing `git
submodules update` step.

---------

Signed-off-by: Fernando Luz <prof.fernando.luz@gmail.com>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-07 12:13:05 +00:00
Shany Pozin
0ac4cf67a6 Use self.tenants instead of TENANTS (#5811) 2023-11-07 11:38:02 +00:00
Joonas Koivunen
4be6bc7251 refactor: remove unnecessary unsafe (#5802)
unsafe impls for `Send` and `Sync` should not be added by default. in
the case of `SlotGuard` removing them does not cause any issues, as the
compiler automatically derives those.

This PR adds requirement to document the unsafety (see
[clippy::undocumented_unsafe_blocks]) and opportunistically adds
`#![deny(unsafe_code)]` to most places where we don't have unsafe code
right now.

TRPL on Send and Sync:
https://doc.rust-lang.org/book/ch16-04-extensible-concurrency-sync-and-send.html

[clippy::undocumented_unsafe_blocks]:
https://rust-lang.github.io/rust-clippy/master/#/undocumented_unsafe_blocks
2023-11-07 10:26:25 +00:00
John Spray
a394f49e0d pageserver: avoid converting an error to anyhow::Error (#5803)
This was preventing it getting cleanly converted to a
CalculateLogicalSizeError::Cancelled, resulting in "Logical size
calculation failed" errors in logs.
2023-11-07 09:35:45 +00:00
John Spray
c00651ff9b pageserver: start refactoring into TenantManager (#5797)
## Problem

See: https://github.com/neondatabase/neon/issues/5796

## Summary of changes

Completing the refactor is quite verbose and can be done in stages: each
interface that is currently called directly from a top-level mgr.rs
function can be moved into TenantManager once the relevant subsystems
have access to it.

Landing the initial change to create of TenantManager is useful because
it enables new code to use it without having to be altered later, and
sets us up to incrementally fix the existing code to use an explicit
Arc<TenantManager> instead of relying on the static TENANTS.
2023-11-07 09:06:53 +00:00
Richy Wang
bea8efac24 Fix comments in 'receive_wal.rs'. (#5807)
## Problem
Some comments in 'receive_wal.rs' is not suitable. It may copy from
'send_wal.rs' and leave it unchanged.
## Summary of changes
This commit fixes two comments in the code:
Changed "/// Unregister walsender." to "/// Unregister walreceiver."
Changed "///Scope guard to access slot in WalSenders registry" to
"///Scope guard to access slot in WalReceivers registry."
2023-11-07 09:13:01 +01:00
Conrad Ludgate
ad5b02e175 proxy: remove unsafe (#5805)
## Problem

`unsafe {}`

## Summary of changes

`CStr` has a method to parse the bytes up to a null byte, so we don't
have to do it ourselves.
2023-11-06 17:44:44 +00:00
Arpad Müller
b09a851705 Make azure blob storage not do extra metadata requests (#5777)
Load the metadata from the returned `GetBlobResponse` and avoid
downloading it via a separate request.
As it turns out, the SDK does return the metadata:
https://github.com/Azure/azure-sdk-for-rust/issues/1439 .

This PR will reduce the number of requests to Azure caused by downloads.

Fixes #5571
2023-11-06 15:16:55 +00:00
John Spray
85cd97af61 pageserver: add InProgress tenant map state, use a sync lock for the map (#5367)
## Problem

Follows on from #5299 
- We didn't have a generic way to protect a tenant undergoing changes:
`Tenant` had states, but for our arbitrary transitions between
secondary/attached, we need a general way to say "reserve this tenant
ID, and don't allow any other ops on it, but don't try and report it as
being in any particular state".
- The TenantsMap structure was behind an async RwLock, but it was never
correct to hold it across await points: that would block any other
changes for all tenants.


## Summary of changes

- Add the `TenantSlot::InProgress` value.  This means:
  - Incoming administrative operations on the tenant should retry later
- Anything trying to read the live state of the tenant (e.g. a page
service reader) should retry later or block.
- Store TenantsMap in `std::sync::RwLock`
- Provide an extended `get_active_tenant_with_timeout` for page_service
to use, which will wait on InProgress slots as well as non-active
tenants.

Closes: https://github.com/neondatabase/neon/issues/5378

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-11-06 14:03:22 +00:00
Arpad Müller
e6470ee92e Add API description for safekeeper copy endpoint (#5770)
Adds a yaml API description for a new endpoint that allows creation of a
new timeline as the copy of an existing one.
 
Part of #5282
2023-11-06 15:00:07 +01:00
bojanserafimov
dc72567288 Layer flush minor speedup (#5765)
Convert keys to `i128` before sorting
2023-11-06 08:58:20 -05:00
John Spray
6defa2b5d5 pageserver: add Gate as a partner to CancellationToken for safe shutdown of Tenant & Timeline (#5711)
## Problem

When shutting down a Tenant, it isn't just important to cause any
background tasks to stop. It's also important to wait until they have
stopped before declaring shutdown complete, in cases where we may re-use
the tenant's local storage for something else, such as running in
secondary mode, or creating a new tenant with the same ID.

## Summary of changes

A `Gate` class is added, inspired by
[seastar::gate](https://docs.seastar.io/master/classseastar_1_1gate.html).
For types that have an important lifetime that corresponds to some
physical resource, use of a Gate as well as a CancellationToken provides
a robust pattern for async requests & shutdown:
- Requests must always acquire the gate as long as they are using the
object
- Shutdown must set the cancellation token, and then `close()` the gate
to wait for requests in progress before returning.

This is not for memory safety: it's for expressing the difference
between "Arc<Tenant> exists", and "This tenant's files on disk are
eligible to be read/written".

- Both Tenant and Timeline get a Gate & CancellationToken.
- The Timeline gate is held during eviction of layers, and during
page_service requests.
- Existing cancellation support in page_service is refined to use the
timeline-scope cancellation token instead of a process-scope
cancellation token. This replaces the use of `task_mgr::associate_with`:
tasks no longer change their tenant/timelineidentity after being
spawned.

The Tenant's Gate is not yet used, but will be important for
Tenant-scoped operations in secondary mode, where we must ensure that
our secondary-mode downloads for a tenant are gated wrt the activity of
an attached Tenant.

This is part of a broader move away from using the global-state driven
`task_mgr` shutdown tokens:
- less global state where we rely on implicit knowledge of what task a
given function is running in, and more explicit references to the
cancellation token that a particular function/type will respect, making
shutdown easier to reason about.
- eventually avoid the big global TASKS mutex.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-06 12:39:20 +00:00
duguorong009
b3d3a2587d feat: improve the serde impl for several types(Lsn, TenantId, TimelineId ...) (#5335)
Improve the serde impl for several types (`Lsn`, `TenantId`,
`TimelineId`) by making them sensitive to
`Serializer::is_human_readadable` (true for json, false for bincode).

Fixes #3511 by:
- Implement the custom serde for `Lsn`
- Implement the custom serde for `Id`
- Add the helper module `serde_as_u64` in `libs/utils/src/lsn.rs`
- Remove the unnecessary attr `#[serde_as(as = "DisplayFromStr")]` in
all possible structs

Additionally some safekeeper types gained serde tests.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-06 11:40:03 +02:00
Heikki Linnakangas
b85fc39bdb Update control plane API path for getting compute spec. (#5357)
We changed the path in the control plane. The old path is still accepted
for compatibility with existing computes, but we'd like to phase it out.
2023-11-06 09:26:09 +02:00
duguorong009
09b5954526 refactor: use streaming in safekeeper /v1/debug_dump http response (#5731)
- Update the handler for `/v1/debug_dump` http response in safekeeper
- Update the `debug_dump::build()` to use the streaming in JSON build
process
2023-11-05 10:16:54 +00:00
John Spray
306c4f9967 s3_scrubber: prepare for scrubbing buckets with generation-aware content (#5700)
## Problem

The scrubber didn't know how to find the latest index_part when
generations were in use.

## Summary of changes

- Teach the scrubber to do the same dance that pageserver does when
finding the latest index_part.json
- Teach the scrubber how to understand layer files with generation
suffixes.
- General improvement to testability: scan_metadata has a machine
readable output that the testing `S3Scrubber` wrapper can read.
- Existing test coverage of scrubber was false-passing because it just
didn't see any data due to prefixing of data in the bucket. Fix that.

This is incremental improvement: the more confidence we can have in the
scrubber, the more we can use it in integration tests to validate the
state of remote storage.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-11-03 17:36:02 +00:00
Konstantin Knizhnik
5ceccdc7de Logical replication startup fixes (#5750)
## Problem

See https://neondb.slack.com/archives/C04DGM6SMTM/p1698226491736459

## Summary of changes

Update WAL affected buffers when restoring WAL from safekeeper

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2023-11-03 18:40:27 +02:00
Conrad Ludgate
cdcaa329bf proxy: no more statements (#5747)
## Problem

my prepared statements change in tokio-postgres landed in the latest
release. it didn't work as we intended

## Summary of changes

https://github.com/neondatabase/rust-postgres/pull/24
2023-11-03 08:30:58 +00:00
Joonas Koivunen
27bdbf5e36 chore(layer): restore logging, doc changes (#5766)
Some of the log messages were lost with the #4938. This PR adds some of
them back, most notably:

- starting to on-demand download
- successful completion of on-demand download
- ability to see when there were many waiters for the layer download
- "unexpectedly on-demand downloading ..." is now `info!`

Additionally some rare events are logged as error, which should never
happen.
2023-11-02 19:05:33 +00:00
khanova
4c7fa12a2a Proxy introduce allowed ips (#5729)
## Problem

Proxy doesn't accept wake_compute responses with the allowed IPs.

## Summary of changes

Extend wake_compute api to be able to return allowed_ips.
2023-11-02 16:26:15 +00:00
Em Sharnoff
367971a0e9 vm-monitor: Remove support for file cache in tmpfs (#5617)
ref neondatabase/cloud#7516.

We switched everything over to file cache on disk, now time to remove
support for having it in tmpfs.
2023-11-02 16:06:16 +00:00
bojanserafimov
51570114ea Remove outdated and flaky perf test (#5762) 2023-11-02 10:43:59 -04:00
Joonas Koivunen
098d3111a5 fix(layer): get_and_upgrade and metrics (#5767)
when introducing `get_and_upgrade` I forgot that an `evict_and_wait`
would had already incremented the counter for started evictions, but an
upgrade would just "silently" cancel the eviction as no drop would ever
run. these metrics are likely sources for alerts with the next release,
so it's important to keep them correct.
2023-11-02 13:06:14 +00:00
Joonas Koivunen
3737fe3a4b fix(layer): error out early if layer path is non-file (#5756)
In an earlier PR
https://github.com/neondatabase/neon/pull/5743#discussion_r1378625244 I
added a FIXME and there's a simple solution suggested by @jcsp, so
implement it. Wondering why I did not implement this originally, there
is no concept of a permanent failure, so this failure will happen quite
often. I don't think the frequency is a problem however.

Sadly for std::fs::FileType there is only decimal and hex formatting, no
octal.
2023-11-02 11:03:38 +00:00
John Spray
5650138532 pageserver: helpers for explicitly dying on fatal I/O errors (#5651)
Following from discussion on
https://github.com/neondatabase/neon/pull/5436 where hacking an implicit
die-on-fatal-io behavior into an Error type was a source of disagreement
-- in this PR, dying on fatal I/O errors is explicit, with `fatal_err`
and `maybe_fatal_err` helpers in the `MaybeFatalIo` trait, which is
implemented for std::io::Result.

To enable this approach with `crashsafe_overwrite`, the return type of
that function is changed to std::io::Result -- the previous error enum
for this function was not used for any logic, and the utility of saying
exactly which step in the function failed is outweighed by the hygiene
of having an I/O funciton return an io::Result.

The initial use case for these helpers is the deletion queue.
2023-11-02 09:14:26 +00:00
Joonas Koivunen
2dca4c03fc feat(layer): cancellable get_or_maybe_download (#5744)
With the layer implementation as was done in #4938, it is possible via
cancellation to cause two concurrent downloads on the same path, due to
how `RemoteTimelineClient::download_remote_layer` does tempfiles. Thread
the init semaphore through the spawned task of downloading to make this
impossible to happen.
2023-11-02 08:06:32 +00:00
bojanserafimov
0b790b6d00 Record wal size in import benchmark (#5755) 2023-11-01 17:02:58 -04:00
Joonas Koivunen
e82d1ad6b8 fix(layer): reinit on access before eviction happens (#5743)
Right before merging, I added a loop to `fn
LayerInner::get_or_maybe_download`, which was always supposed to be
there. However I had forgotten to restart initialization instead of
waiting for the eviction to happen to support original design goal of
"eviction should always lose to redownload (or init)". This was wrong.
After this fix, if `spawn_blocking` queue is blocked on something,
nothing bad will happen.

Part of #5737.
2023-11-01 17:38:32 +02:00
Muhammet Yazici
4f0a8e92ad fix: Add bearer prefix to Authorization header (#5740)
## Problem

Some requests with `Authorization` header did not properly set the
`Bearer ` prefix. Problem explained here
https://github.com/neondatabase/cloud/issues/6390.

## Summary of changes

Added `Bearer ` prefix to missing requests.
2023-11-01 09:41:48 +03:00
Konstantin Knizhnik
5952f350cb Always handle POLLHUP in walredo error poll loop (#5716)
## Problem

test_stderr hangs on MacOS.

See https://neondb.slack.com/archives/C036U0GRMRB/p1698438997903919

## Summary of changes

Always handle POLLHUP to prevent infinite loop.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-10-31 20:57:03 +02:00
Tristan Partin
726c8e6730 Add docs for updating Postgres for new minor versions 2023-10-31 12:31:14 -05:00
Em Sharnoff
f7067a38b7 compute_ctl: Assume --vm-monitor-addr arg is always present (#5611)
It has a default value, so this should be sound.
Treating its presence as semantically significant was leading to
spurious warnings.
2023-10-31 10:00:23 -07:00
Joonas Koivunen
896347f307 refactor(layer): remove version checking with atomics (#5742)
The `LayerInner::version` never needed to be read in more than one
place. Clarified while fixing #5737 of which this is the first step.
This decrements possible wrong atomics usage in Layer, but does not
really fix anything.
2023-10-31 18:40:08 +02:00
John Spray
e5c81fef86 tests: minor improvements (#5674)
Minor changes from while I have been working on HA tests:
- Manual pytest executions came with some warnings from `log.warn()`
usage
- When something fails in a generations-enabled test, it it useful to
have a log from the attachment service of what attached when, and with
which generation.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-10-31 11:44:35 +00:00
Christian Schwarz
7ebe9ca1ac pageserver: /attach: clarify semantics of 409 (#5698)
context: https://app.incident.io/neondb/incidents/75
specifically:
https://neondb.slack.com/archives/C0634NXQ6E7/p1698422852902959?thread_ts=1698419362.155059&cid=C0634NXQ6E7
2023-10-31 09:32:58 +01:00
Shany Pozin
1588601503 Move release PR creation to Friday (#5721)
Prepare for a new release workflow
* Release PR is created on Fridays
* The discussion/approval happens during Friday
* Sunday morning the deployment will be done in central-il and perf
tests will be run
* On Monday early IST morning gradually start rolling (starting from US
regions as they are still in weekend time)

See slack for discussion:
https://neondb.slack.com/archives/C04P81J55LK/p1698565305607839?thread_ts=1698428241.031979&cid=C04P81J55LK
2023-10-30 22:10:24 +01:00
John Spray
9c35e1e6e5 pageserver: downgrade slow task warnings from warn to info (#5724)
## Problem

In #5658 we suppressed the first-iteration output from these logs, but
the volume of warnings is still problematic.

## Summary of changes

- Downgrade all slow task warnings to INFO. The information is still
there if we actively want to know about which tasks are running slowly,
without polluting the overall stream of warnings with situations that
are unsurprising to us.
- Revert the previous change so that we output on the first iteration as
we used to do. There is no reason to suppress these, now that the
severity is just info.
2023-10-30 18:32:30 +00:00
Conrad Ludgate
d8c21ec70d fix nightly 1.75 (#5719)
## Problem

Neon doesn't compile on nightly and had numerous clippy complaints.

## Summary of changes

1. Fixed troublesome dependency
2. Fixed or ignored the lints where appropriate
2023-10-30 16:43:06 +00:00
Konstantin Knizhnik
ad99fa5f03 Grant BYPASSRLS and REPLICATION to exited roles (#5657)
## Problem

Role need to have REPLICATION privilege to be able to used for logical
replication.
New roles are created with this option.
This PR tries to update existed roles.

## Summary of changes

Update roles in `handle_roles` method

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-10-30 15:29:25 +00:00
John Spray
e675f4cec8 tests: fix missing comma in test_timeline_deletion_with_files_stuck_… (#5713)
…in_upload_queue

This was a syntax mistake in
https://github.com/neondatabase/neon/pull/5149

We didn't notice because the situation the log allow list covers is a
relative rare race.
2023-10-30 15:18:32 +00:00
Joonas Koivunen
4db8efb2cf Layer: logging fixes (#5676)
- include Layer generation in the default display, with
Generation::Broken as `-broken`
- omit layer from `layer_gc` span because the api it works with needs to
support N layers, so the api needs to log each layer
2023-10-30 16:21:30 +02:00
John Spray
07c2b29895 pageserver: fix error logging on stray timeline files (#5712)
## Problem

If there were stray files in the timelines/ dir after tenant deletion,
pageserver could panic on out of range.

## Summary of changes

Use iterator `take()`, which doesn't care if the number of elements
available is less than requested.
2023-10-30 13:24:52 +00:00
Konstantin Knizhnik
9cdffd164a Prevent SIGSEGV in apply_error_callback when record was not decoded (#5703)
## Problem

See 

https://neondb.slack.com/archives/C036U0GRMRB/p1698652221399419?thread_ts=1698438997.903919&cid=C036U0GRMRB

## Summary of changes

Check if record pointer is not NULL before trying to print record
descriptor

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-10-30 12:06:08 +02:00
John Spray
87db4b441c pageserver: cleaner shutdown in timeline delete (#5701)
The flush task logs a backtrace if it tries to upload and remote
timeline client is already in stopped state.

Therefore we cannot shut them down concurrently: flush task must be shut
down first.

This wasn't more obvious because:
- Timeline deletions IRL usually happen when not much is being written
- In tests, there is a global allow-list for this log

It's not obvious whether removing the global log allow list is safe,
this PR was prompted by how the log spam got in my way when testing
deletion changes.
2023-10-30 09:18:40 +00:00
Conrad Ludgate
964c5c56b7 proxy: dont retry server errors (#5694)
## Problem

accidental spam

## Summary of changes

don't spam control plane if control plane is down :)

## Checklist before requesting a review

- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2023-10-30 08:38:56 +00:00
Arpad Müller
bd59349af3 Fix Rust 1.74 warnings (#5702)
Fixes new warnings and clippy changes introduced by version 1.74 of the
rust compiler toolchain.
2023-10-28 03:47:26 +02:00
Joonas Koivunen
2bd79906d9 fix: possible page_service hang on cancel (#5696)
Fixes #5341, one more suspected case, see:
https://github.com/neondatabase/neon/issues/5341#issuecomment-1783052379

- races `MaybeWriteOnly::shutdown` with cancellation
- switches to using `AsyncWriteExt::write_buf`
- notes cancellation safety for shutdown
2023-10-27 19:09:34 +01:00
Conrad Ludgate
493b47e1da proxy: exclude client latencies in metrics (#5688)
## Problem

In #5539, I moved the connect_to_compute latency to start counting
before authentication - this is because authentication will perform some
calls to the control plane in order to get credentials and to eagerly
wake a compute server. It felt important to include these times in the
latency metric as these are times we should definitely care about
reducing.

What is not interesting to record in this metric is the roundtrip time
during authentication when we wait for the client to respond.

## Summary of changes

Implement a mechanism to pause the latency timer, resuming on drop of
the pause struct. We pause the timer right before we send the
authentication message to the client, and we resume the timer right
after we complete the authentication flow.
2023-10-27 17:17:39 +00:00
John Spray
c13e932c3b pageserver: add generation fields in openapi spec (#5690)
These optional fields have existed for as while, but weren't mentioned
in `openapi_spec.yaml` yet.
2023-10-27 14:20:04 +01:00
Gleb Novikov
a5292f7e67 Some minor renames in attachment service API (#5687)
## Problem

## Summary of changes

## Checklist before requesting a review

- [x] I have performed a self-review of my code.
- [ ] ~~If it is a core feature, I have added thorough tests.~~
- [ ] ~~Do we need to implement analytics? if so did you add the
relevant metrics to the dashboard?~~
- [ ] ~~If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.~~

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2023-10-27 12:36:34 +01:00
Arthur Petukhovsky
262348e41b Fix safekeeper log spans (#5643)
We were missing spans with ttid in "WAL backup" and several other
places, this commit should fix it.

Here are the examples of logs before and after:
https://gist.github.com/petuhovskiy/711a4a4e7ddde3cab3fa6419b2f70fb9
2023-10-27 12:09:02 +01:00
Joonas Koivunen
68f15cf967 fix: schedule_compaction_update must only unlink (#5675)
#5649 added the concept of dangling layers which #4938 uses but only
partially. I forgot to change `schedule_compaction_update` to not
schedule deletions to uphold the "have a layer, you can read it".

With the now remembered fix, I don't think these checks should ever fail
except for a mistake I already did. These changes might be useful for
protecting future changes, even though the Layer carrying the generation
AND the `schedule_(gc|compaction)_update` require strong arcs.

Rationale for keeping the `#[cfg(feature = "testing")]` is worsening any
leak situation which might come up.
2023-10-27 11:16:01 +01:00
duguorong009
39f8fd6945 feat: add build_tag env support for set_build_info_metric (#5576)
- Add a new util `project_build_tag` macro, similar to
`project_git_version`
- Update the `set_build_info_metric` to accept and make use of
`build_tag` info
- Update all codes which use the `set_build_info_metric`
2023-10-27 10:47:11 +01:00
John Spray
83567f9e4e tests: revise perf test that interfered with local disk state (#5682)
This benchmark started failing after #5580 merged.

It was manually deleting some local content on a pageserver, and
expecting the behavior that the pageserver would "forget" about the
timeline on startup as a result. That is no longer our behavior:
pageservers use the remote storage as the source of truth.

Rather than having the test go manually delete things at all, we can
just delete the whole tenant via the pageserver API, and thereby start
from a clean situation.
2023-10-27 09:23:49 +01:00
Conrad Ludgate
71611f4ab3 proxy: prepare to remove high cardinality metrics (#5461)
## Problem

High cardinality metrics are bad

## Summary of changes

Preparing to remove high cardinality metrics. Will actually remove in
#5466
2023-10-26 22:54:37 +01:00
John Spray
7c16b5215e scrubber: add separate find/purge garbage commands (#5409)
## Problem

The previous garbage cleanup functionality relied on doing a dry run,
inspecting logs, and then doing a deletion. This isn't ideal, because
what one actually deletes might not be the same as what one saw in the
dry run. It's also risky UX to rely on presence/absence of one CLI flag
to control deletion: ideally the deletion command should be totally
separate from the one that scans the bucket.

Related: https://github.com/neondatabase/neon/issues/5037

## Summary of changes

This is a major re-work of the code, which results in a net decrease in
line count of about 600. The old code for removing garbage was build
around the idea of doing discovery and purging together: a
"delete_batch_producer" sent batches into a deleter. The new code writes
out both procedures separately, in functions that use the async streams
introduced in https://github.com/neondatabase/neon/pull/5176 to achieve
fast concurrent access to S3 while retaining the readability of a single
function.

- Add `find-garbage`, which writes out a JSON file of tenants/timelines
to purge
- Add `purge-garbage` which consumes the garbage JSON file, applies some
extra validations, and does deletions.
- The purge command will refuse to execute if the garbage file indicates
that only garbage was found: this guards against classes of bugs where
the scrubber might incorrectly deem everything garbage.
- The purge command defaults to only deleting tenants that were found in
"deleted" state in the control plane. This guards against the risk that
using the wrong console API endpoint could cause all tenants to appear
to be missing.

Outstanding work for a future PR:
- Make whatever changes are needed to adapt to the Console/Control Plane
separation.
- Make purge even safer by checking S3 `Modified` times for
index_part.json files (not doing this here, because it will depend on
the generation-aware changes for finding index_part.json files)

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
Co-authored-by: Shany Pozin <shany@neon.tech>
2023-10-26 20:36:28 +01:00
Em Sharnoff
39b148b74e Bump vm-builder v0.18.2 -> v0.18.4 (#5666)
Only applicable change was neondatabase/autoscaling#584, setting
pgbouncer auth_dbname=postgres in order to fix superuser connections
from preventing dropping databases.
2023-10-26 20:04:57 +01:00
Sasha Krassovsky
116c342cad Support changing pageserver dynamically (#5542)
## Problem
We currently require full restart of compute if we change the pageserver
url
## Summary of changes
Makes it so that we don't have to do a full restart, but can just send
SIGHUP
2023-10-26 10:56:07 -07:00
John Spray
ba4fe9e10f pageserver: fix the second "AUX files" warning (#5673)
In https://github.com/neondatabase/neon/pull/5669 I didn't notice that
the same warning is logged in two places: fix the other one.
2023-10-26 13:54:52 +00:00
John Spray
de90bf4663 pageserver: always load remote metadata (no more spawn_load) (#5580)
## Problem

The pageserver had two ways of loading a tenant:
- `spawn_load` would trust on-disk content to reflect all existing
timelines
- `spawn_attach` would list timelines in remote storage.

It was incorrect for `spawn_load` to trust local disk content, because
it doesn't know if the tenant might have been attached and written
somewhere else. To make this correct would requires some generation
number checks, but the payoff is to avoid one S3 op per tenant at
startup, so it's not worth the complexity -- it is much simpler to have
one way to load a tenant.

## Summary of changes

- `Tenant` objects are always created with `Tenant::spawn`: there is no
more distinction between "load" and "attach".
- The ability to run without remote storage (for `neon_local`) is
preserved by adding a branch inside `attach` that uses a fallback
`load_local` if no remote_storage is present.
- Fix attaching a tenant when it has a timeline with no IndexPart: this
can occur if a newly created timeline manages to upload a layer before
it has uploaded an index.
- The attach marker file that used to indicate whether a tenant should
be "loaded" or "attached" is no longer needed, and is removed.
- The GenericRemoteStorage interface gets a `list()` method that maps
more directly to what ListObjects does, returning both keys and common
prefixes. The existing `list_files` and `list_prefixes` methods are just
calls into `list()` now -- these can be removed later if we would like
to shrink the interface a bit.
- The remote deletion marker is moved into `timelines/` and detected as
part of listing timelines rather than as a separate GET request. If any
existing tenants have a marker in the old location (unlikely, only
happens if something crashes mid-delete), then they will rely on the
control plane retrying to complete their deletion.
- Revise S3 calls for timeline listing and tenant load to take a
cancellation token, and retry forever: it never makes sense to make a
Tenant broken because of a transient S3 issue.

## Breaking changes

- The remote deletion marker is moved from `deleted` to
`timelines/deleted` within the tenant prefix. Markers in the old
location will be ignored: it is the control plane's responsibility to
retry deletions until they succeed. Markers in the new location will be
tolerated by the previous release of pageserver via
https://github.com/neondatabase/neon/pull/5632
- The local `attaching` marker file is no longer written. Therefore, if
the pageserver is downgraded after running this code, the old pageserver
will not be able to distinguish between partially attached tenants and
fully attached tenants. This would only impact tenants that were partway
through attaching at the moment of downgrade. In the unlikely even t
that we do experience an incident that prompts us to roll back, then we
may check for attach operations in flight, and manually insert
`attaching` marker files as needed.

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-10-26 14:48:44 +01:00
John Spray
8360307ea0 pageserver: exponential backoff on compaction/GC failures (#5672)
Previously, if walredo process crashed we would try to spawn a fresh one
every 2 seconds, which is expensive in itself, but also results in a
high I/O load from the part of the compaction prior to the failure,
which we re-run every 2 seconds.

Closes: https://github.com/neondatabase/neon/issues/5671
2023-10-26 14:00:26 +01:00
MMeent
6129077d31 WALRedo: Limit logging to log_level = ERROR and above (#5587)
This fixes issues in pageserver's walredo process where WALRedo
logs of loglevel=LOG are interpreted as errors.

## Problem

See #5560

## Summary of changes

Set the log level to something that doesn't include LOG.
2023-10-26 12:21:41 +01:00
John Spray
e0ebdfc7ce pageserver: suppress compaction/gc errors while stopping (#5670)
## Problem

Tenant deletions would sometimes be accompanied by compaction stack
traces, because `shutdown()` puts the tenant into stopping state before
it joins background tasks.

## Summary of changes

Treat GC+Compaction as no-ops on a Stopping tenant.
2023-10-26 10:59:24 +01:00
Joonas Koivunen
c508d3b5fa reimpl Layer, remove remote layer, trait Layer, trait PersistentLayer (#4938)
Implement a new `struct Layer` abstraction which manages downloadness
internally, requiring no LayerMap locking or rewriting to download or
evict providing a property "you have a layer, you can read it". The new
`struct Layer` provides ability to keep the file resident via a RAII
structure for new layers which still need to be uploaded. Previous
solution solved this `RemoteTimelineClient::wait_completion` which lead
to bugs like #5639. Evicting or the final local deletion after garbage
collection is done using Arc'd value `Drop`.

With a single `struct Layer` the closed open ended `trait Layer`, `trait
PersistentLayer` and `struct RemoteLayer` are removed following noting
that compaction could be simplified by simply not using any of the
traits in between: #4839.

The new `struct Layer` is a preliminary to remove
`Timeline::layer_removal_cs` documented in #4745.

Preliminaries: #4936, #4937, #5013, #5014, #5022, #5033, #5044, #5058,
#5059, #5061, #5074, #5103, epic #5172, #5645, #5649. Related split off:
#5057, #5134.
2023-10-26 12:36:38 +03:00
John Spray
acda65d7d4 pageserver: quieten "Failed to get info about AUX files" (#5669)
## Problem

This line caused lots of errors to be emitted for healthy tenants.

## Summary of changes

Downgrade to debug, since it is an expected code path we'll take for
tenants at startup.
2023-10-26 09:53:18 +01:00
dependabot[bot]
378daa358b build(deps): bump werkzeug from 2.2.3 to 3.0.1 (#5665) 2023-10-25 22:50:35 +00:00
Alexander Bayandin
85f4514e7d Get env var for real Azure tests from GitHub (#5662)
## Problem

We'll need to switch `REMOTE_STORAGE_AZURE_REGION` from the current
`eastus2` region to something `eu-central-1`-like. This may require
changing `AZURE_STORAGE_ACCESS_KEY`.
To make it possible to switch from one place (not to break a lot of
builds on CI), move `REMOTE_STORAGE_AZURE_CONTAINER` and
`REMOTE_STORAGE_AZURE_REGION` to GitHub Variables.

See https://github.com/neondatabase/neon/settings/variables/actions

## Summary of changes
- Get values for `REMOTE_STORAGE_AZURE_CONTAINER` &
`REMOTE_STORAGE_AZURE_REGION` from GitHub Variables
2023-10-25 22:54:23 +01:00
Joonas Koivunen
f70019797c refactor(rtc): schedule compaction update (#5649)
a single operation instead of N uploads and 1 deletion scheduling with
write(layer_map) lock releasing in the between. Compaction update will
make for a much better place to change how the operation will change in
future compared to more general file based operations.

builds upon #5645. solves the problem of difficult to see hopeful
correctness w.r.t. other `index_part.json` changing operations.

Co-authored-by: Shany Pozin <shany@neon.tech>
2023-10-25 22:25:43 +01:00
Joonas Koivunen
325258413a fix: trampling on global physical size metric (#5663)
All loading (attached, or from disk) timelines overwrite the global
gauge for physical size. The `_set` method cannot be used safely, so
remove it and just "add" the physical size.
2023-10-25 19:29:12 +01:00
Konstantin Knizhnik
4ddbc0e46d Ignore missed AUX_FILES_KEY when generating image layer (#5660)
## Problem

Logical replication requires new AUX_FILES_KEY which is definitely
absent in existed database.
We do not have function to check if key exists in our KV storage.
So I have to handle the error in `list_aux_files` method.
But this key is also included in key space range and accessed y
`create_image_layer` method.

## Summary of changes

Check if AUX_FILES_KEY  exists before including it in keyspace.

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Shany Pozin <shany@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-10-25 18:35:23 +01:00
Arpad Müller
a673e4e7a9 Optionally return json from get_lsn_by_timestamp (#5608)
This does two things: first a minor refactor to not use HTTP/1.x
style header names and also to not panic if some certain requests had no
"Accept" header. As a second thing, it addresses the third bullet point
from #3689:

> Change `get_lsn_by_timestamp` API method to return LSN even if we only
found commit before the specified timestamp.

This is done by adding a version parameter to the `get_lsn_by_timestamp`
API call and making its behaviour depend on the version number.

Part of #3414 (but doesn't address it in its entirety).

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-10-25 18:46:34 +02:00
bojanserafimov
c155cc0c3f Fix test instructions readme (#5644) 2023-10-25 11:53:04 -04:00
Conrad Ludgate
32126d705b proxy refactor serverless (#4685)
## Problem

Our serverless backend was a bit jumbled. As a comment indicated, we
were handling SQL-over-HTTP in our `websocket.rs` file.

I've extracted out the `sql_over_http` and `websocket` files from the
`http` module and put them into a new module called `serverless`.

## Summary of changes

```sh
mkdir proxy/src/serverless
mv proxy/src/http/{conn_pool,sql_over_http,websocket}.rs proxy/src/serverless/
mv proxy/src/http/server.rs proxy/src/http/health_server.rs
mv proxy/src/metrics proxy/src/usage_metrics.rs
```

I have also extracted the hyper server and handler from websocket.rs
into `serverless.rs`
2023-10-25 15:43:03 +01:00
John Spray
5683ae9eab pageserver: suppress some of the most common spurious warnings (#5658)
Two of the most common spurious log messages:
- broker connections terminate & we log at error severity. Unfortunately
tonic gives us an "Unknown" error so to suppress these we're doing
string matching. It's hacky but worthwhile for operations.
- the first iteration of tenant background tasks tends to over-run its
schedule and emit a warning. Ultimately we should fix these to run on
time, but for now we are not benefiting from polluting our logs with the
warnings.
2023-10-25 14:55:37 +01:00
Alexander Bayandin
4778b6a12e Switch to querying new tests results DB (#5616)
## Problem

We started to store test results in a new format in
https://github.com/neondatabase/neon/pull/4549.
This PR switches scripts to query this db.

(we can completely remove old DB/ingestions scripts in a couple of
weeks after the PR merged)

## Summary of changes
- `scripts/benchmark_durations.py` query new database
- `scripts/flaky_tests.py` query new database
2023-10-25 14:25:13 +01:00
John Spray
8b8be7bed4 tests: don't fail tests on torn log lines (#5655)
## Problem

Tests that force-kill and restart a service can generate torn log lines
that might match WARN|ERROR, but not match the allow expression that a
test has loaded, e.g.
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5651/6638398772/index.html#suites/7538959189f4501983ddd9e167836c8b/d272ba8a73e6945c

## Summary of changes

Ignore log lines which match a regex for torn log lines on restart: they
have two timestamps and the second line is an "INFO version"... message.
2023-10-25 13:29:30 +01:00
Conrad Ludgate
a461c459d8 fix http pool test (#5653)
## Problem

We defer the returning of connections the the connection pool. It's
possible for our test to be faster than the returning of connections -
which then gets a differing process ID because it opens a new
connection.

## Summary of changes

1. Delay the tests just a little (20ms) to give more chance for
connections to return.
2. Correlate connection IDs with the connection logs a bit more
2023-10-25 13:20:45 +01:00
Joonas Koivunen
4ae2d1390d refactor(remote_timeline_client): Split deletion into unlinking + deletion (#5645)
Quest: #4745. Prerequisite for #4938. Original
https://github.com/neondatabase/neon/pull/4938#issuecomment-1777150665.

The new Layer implementation has so far been using
`RemoteTimelineClient::schedule_layer_file_deletion` from `Layer::drop`
but it was noticed that this could mean that the L0s compaction wanted
to remove could linger in the index part for longer time or be left
there for longer time. Solution is to split the
`RemoteTimelineClient::schedule_layer_file_deletion` into two parts:
- unlinking from index_part.json, to be called from end of compaction
and gc
- scheduling of actual deletions, to be called from `Layer::drop`

The added methods are added unused.
2023-10-25 15:01:19 +03:00
Joonas Koivunen
c5949e1fd6 misc smaller improvements (#5527)
- finally add an `#[instrument]` to Timeline::create_image_layers,
making it easier to see that something is happening because we create
image layers
- format some macro context code
- add a warning not to create new validation functions a la parse do not
validate

Split off from #5198.
2023-10-25 14:59:43 +03:00
John Spray
127837abb0 tests: de-flake test_eviction_across_generations (#5650)
## Problem

There was an edge case where initial logical size calculation can be
downloading a layer that wasn't hit by the test's `SELECT`, and it's
on-disk but still marked as remote in the pageserver's internal state,
so evicting it fails.


https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5648/6630099807/index.html#categories/dee044ec96f666edb90a77c01099a941/e38e97a2735ffa8c/

## Summary of changes

Use pageserver API to learn about layers, instead of inspecting local
disk, so that we will always agree with the pageserver about which layer
are local.
2023-10-25 10:55:45 +01:00
Conrad Ludgate
b2c96047d0 move wake compute after the auth quirks logic (#5642)
## Problem

https://github.com/neondatabase/neon/issues/5568#issuecomment-1777015606

## Summary of changes

Make the auth_quirks_creds return the authentication information, and
push the wake_compute loop to after, inside `auth_quirks`
2023-10-25 08:30:47 +01:00
Em Sharnoff
44202eeb3b Bump vm-builder v0.18.1 -> v0.18.2 (#5646)
Only applicable change was neondatabase/autoscaling#571, removing the
postgres_exporter flags `--auto-discover-databases` and
`--exclude-databases=...`
2023-10-24 16:04:28 -07:00
Arpad Müller
4bef977c56 Use tuples instead of manual comparison chain (#5637)
Makes code a little bit simpler
2023-10-24 17:16:23 +00:00
John Spray
a0b862a8bd pageserver: schedule frozen layer uploads inside the layers lock (#5639)
## Problem

Compaction's source of truth for what layers exist is the LayerManager.
`flush_frozen_layer` updates LayerManager before it has scheduled upload
of the frozen layer.

Compaction can then "see" the new layer, decide to delete it, schedule
uploads of replacement layers, all before `flush_frozen_layer` wakes up
again and schedules the upload. When the upload is scheduled, the local
layer file may be gone, in which case we end up with no such layer in
remote storage, but an entry still added to IndexPart pointing to the
missing layer.

## Summary of changes

Schedule layer uploads inside the `self.layers` lock, so that whenever a
frozen layer is present in LayerManager, it is also present in
RemoteTimelineClient's metadata.

Closes: #5635
2023-10-24 13:57:01 +01:00
Conrad Ludgate
767ef29390 proxy: filter out more quota exceeded errors (#5640)
## Problem

Looking at logs, I saw more retries being performed for other quota
exceeded errors

## Summary of changes

Filter out all quota exceeded family of errors
2023-10-24 13:13:23 +01:00
Alexander Bayandin
a8a800af51 Run real Azure tests on CI (#5627)
## Problem
We do not run real Azure-related tests on CI 

## Summary of changes
- Set required env variables to run real Azure blob storage tests on CI
2023-10-24 12:12:11 +01:00
Arpad Müller
1e250cd90a Cleanup in azure_upload_download_works test (#5636)
The `azure_upload_download_works` test is not cleaning up after itself,
leaving behind the files it is uploading. I found these files when
looking at the contents of the bucket in #5627.

We now clean up the file we uploaded before, like the other tests do it
as well.

Follow-up of #5546
2023-10-23 19:08:56 +01:00
John Spray
eaaa18f6ed attachment_service: graceful SIGQUIT (#5626)
`attachment_service` doesn't explicitly handle signals, which causes a
backtrace when `neon_local` kills it with SIGQUIT.

Closes: https://github.com/neondatabase/neon/issues/5613
2023-10-23 17:30:25 +01:00
John Spray
188f67e1df pageserver: forward compat: be tolerant of deletion marker in timelines/ (#5632)
## Problem

https://github.com/neondatabase/neon/pull/5580 will move the remote
deletion marker into the `timelines/` path.

This would cause old pageserver code to fail loading the tenant due to
an apparently invalid timeline ID. That would be a problem if we had to
roll back after deploying #5580

## Summary of changes

If a `deleted` file is in `timelines/` just ignore it.
2023-10-23 17:51:38 +02:00
John Spray
7e805200bb pageserver: parallel load of configs (#5607)
## Problem

When the number of tenants is large, sequentially issuing the open/read
calls for their config files is a ~1000ms delay during startup. It's not
a lot, but it's simple to fix.

## Summary of changes

Put all the config loads into spawn_blocking() tasks and run them in a
JoinSet. We can simplify this a bit later when we have full async disk
I/O.

---------

Co-authored-by: Shany Pozin <shany@neon.tech>
2023-10-23 15:32:34 +01:00
Christian Schwarz
c6ca1d76d2 consumption_metrics: fix periodicness behavior & reporting (#5625)
Before this PR, the ticker was running at default miss behavior `Delay`.
For example, here is the startup output with 25k tenants:

```
2023-10-19T09:57:21.682466Z  INFO synthetic_size_worker: starting calculate_synthetic_size_worker
2023-10-19T10:50:44.678202Z  WARN synthetic_size_worker: task iteration took longer than the configured period elapsed=3202.995707156s period=10m task=ConsumptionMetricsSyntheticSizeWorker
2023-10-19T10:52:17.408056Z  WARN synthetic_size_worker: task iteration took longer than the configured period elapsed=2695.72556035s period=10m task=ConsumptionMetricsSyntheticSizeWorker
```

The first message's `elapsed` value is correct. It matches the
delta between the log line timestamps.

The second one is logged ca 1.5min after, though, but reports a much
larger
`elapsed` than 1.5min.

This PR fixes the behavior by copying what `eviction_task.rs` does.
2023-10-23 16:31:38 +02:00
Conrad Ludgate
94b4e76e13 proxy: latency connect outcome (#5588)
## Problem

I recently updated the latency timers to include cache miss and pool
miss, as well as connection protocol. By moving the latency timer to
start before authentication, we count a lot more failures and it's
messed up the latency dashboard.

## Summary of changes

Add another label to LatencyTimer metrics for outcome. Explicitly report
on success
2023-10-23 15:17:28 +01:00
khanova
b514da90cb Set up timeout for scram protocol execution (#5551)
## Problem
Context:
https://github.com/neondatabase/neon/issues/5511#issuecomment-1759649679

Some of out scram protocol execution timed out only after 17 minutes. 
## Summary of changes
Make timeout for scram execution meaningful and configurable.
2023-10-23 15:11:05 +01:00
Conrad Ludgate
7d17f1719f reduce cancel map contention (#5555)
## Problem

Every database request locks this cancel map rwlock. At high requests
per second this would have high contention

## Summary of changes

Switch to dashmap which has a sharded rwlock to reduce contention
2023-10-23 14:12:41 +01:00
John Spray
41ee75bc71 pageserver: do config writes in a spawn_blocking (#5603)
## Problem

We now persist tenant configuration every time we spawn a tenant. The
persist_tenant_config function is doing a series of non-async filesystem
I/O, because `crashsafe::` isn't async yet. This isn't a demonstrated
problem, but is a source of uncertainty when reasoning about what's
happening with our startup times.

## Summary of changes

- Wrap `crashsafe_overwrite` in `spawn_blocking`.
- Although I think this change makes sense, it does not have a
measurable impact on load time when testing with 10k tenants.
- This can be reverted when we have full async I/O
2023-10-23 09:19:01 +01:00
Christian Schwarz
11e523f503 walredo: fix EGAGAIN/"os error 11" false page reconstruction failures (#5560)
Stacked atop https://github.com/neondatabase/neon/pull/5559

Before this PR, there was the following race condition:

```
T1: polls for writeable stdin
T1: writes to stdin
T1: enters poll for stdout/stderr
T2: enters poll for stdin write
WALREDO: writes to stderr
KERNEL: wakes up T1 and T2
Tx: reads stderr and prints it
Ty: reads stderr and gets EAGAIN
(valid values for (x, y) are (1, 2) or (2, 1))
```

The concrete symptom that we observed repeatedly was with PG16,
which started logging `registered custom resource manager`
to stderr always, during startup, thereby giving us repeated
opportunity to hit above race condition. PG14 and PG15 didn't log
anything to stderr, hence we could have only hit this race condition
if there was an actual error happening.

This PR fixes the race by moving the reading of stderr into a tokio
task. It exits when the stderr is closed by the child process, which
in turn happens when the child exits, either by itself or because
we killed it.

The downside is that the async scheduling can reorder the log messages,
which can be seen in the new `test_stderr`, which runs in a
single-threaded runtime. I included the output below.

Overall I think we should move the entire walredo to async, as Joonas
proposed many months ago. This PR's asyncification is just the first
step to resolve these
false page reconstruction errors.

After this is fixed, we should stop printing that annoying stderr
message
on walredo startup; it causes noise in the pageserver logs.
That work is tracked in #5399 .

```
2023-10-13T19:05:21.878858Z ERROR apply_wal_records{tenant_id=d546fb76ba529195392fb4d19e243991 pid=753986}: failed to write out the walredo errored input: No such file or directory (os error 2) target=walredo-1697223921878-1132-0.walredo length=1132
2023-10-13T19:05:21.878932Z DEBUG postgres applied 2 WAL records (1062 bytes) in 114666 us to reconstruct page image at LSN 0/0
2023-10-13T19:05:21.878942Z ERROR error applying 2 WAL records 0/16A9388..0/16D4080 (1062 bytes) to base image with LSN 0/0 to reconstruct page image at LSN 0/0 n_attempts=0: apply_wal_records

Caused by:
    WAL redo process closed its stdout unexpectedly
2023-10-13T19:05:21.879027Z  INFO kill_and_wait_impl{pid=753986}: wait successful exit_status=signal: 11 (SIGSEGV) (core dumped)
2023-10-13T19:05:21.879079Z DEBUG wal-redo-postgres-stderr{pid=753986 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task started
2023-10-13T19:05:21.879104Z ERROR wal-redo-postgres-stderr{pid=753986 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: received output output="2023-10-13 19:05:21.769 GMT [753986] LOG:  registered custom resource manager \"neon\" with ID 134\n"
2023-10-13T19:05:21.879116Z DEBUG wal-redo-postgres-stderr{pid=753986 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task finished
2023-10-13T19:05:22.004439Z ERROR apply_wal_records{tenant_id=d546fb76ba529195392fb4d19e243991 pid=754000}: failed to write out the walredo errored input: No such file or directory (os error 2) target=walredo-1697223922004-1132-0.walredo length=1132
2023-10-13T19:05:22.004493Z DEBUG postgres applied 2 WAL records (1062 bytes) in 125344 us to reconstruct page image at LSN 0/0
2023-10-13T19:05:22.004501Z ERROR error applying 2 WAL records 0/16A9388..0/16D4080 (1062 bytes) to base image with LSN 0/0 to reconstruct page image at LSN 0/0 n_attempts=1: apply_wal_records

Caused by:
    WAL redo process closed its stdout unexpectedly
2023-10-13T19:05:22.004588Z  INFO kill_and_wait_impl{pid=754000}: wait successful exit_status=signal: 11 (SIGSEGV) (core dumped)
2023-10-13T19:05:22.004624Z DEBUG wal-redo-postgres-stderr{pid=754000 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task started
2023-10-13T19:05:22.004653Z ERROR wal-redo-postgres-stderr{pid=754000 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: received output output="2023-10-13 19:05:21.884 GMT [754000] LOG:  registered custom resource manager \"neon\" with ID 134\n"
2023-10-13T19:05:22.004666Z DEBUG wal-redo-postgres-stderr{pid=754000 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task finished
```
2023-10-23 09:00:13 +01:00
Konstantin Knizhnik
b1a1126152 Grant replication permission to newly created users (#5615)
## Problem

## Summary of changes

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-10-20 21:29:17 +03:00
John Spray
a8899e1e0f pageserver: apply timeout when waiting for tenant loads (#5601)
## Problem

Loading tenants shouldn't hang. However, if it does, we shouldn't let
one hung tenant prevent the entire process from starting background
jobs.

## Summary of changes

Generalize the timeout mechanism that we already applied to loading
initial logical sizes: each phase in startup where we wait for a barrier
is subject to a timeout, and startup will proceed if it doesn't complete
within timeout.

Startup metrics will still reflect the time when a phase actually
completed, rather than when we skipped it.

The code isn't the most beautiful, but that kind of reflects the
awkwardness of await'ing on a future and then stashing it to await again
later if we time out. I could imagine making this cleaner in future by
waiting on a structure that doesn't self-destruct on wait() the way
Barrier does, then make InitializationOrder into a structure that
manages the series of waits etc.
2023-10-20 09:15:34 +01:00
Arseny Sher
2fbd5ab075 Add safekeeper test_late_init. 2023-10-20 10:57:59 +03:00
Arseny Sher
702382e99a Add check that WAL segments are identical after recovery. 2023-10-20 10:57:59 +03:00
Arseny Sher
1b53b3e200 Make test_pageserver_http_get_wal_receiver_success not wait for keepalive. 2023-10-20 10:57:59 +03:00
Arseny Sher
b332268cec Introduce safekeeper peer recovery.
Implements fetching of WAL by safekeeper from another safekeeper by imitating
behaviour of last elected leader. This allows to avoid WAL accumulation on
compute and facilitates faster compute startup as it doesn't need to download
any WAL. Actually removing WAL download in walproposer is a matter of another
patch though.

There is a per timeline task which always runs, checking regularly if it should
start recovery frome someone, meaning there is something to fetch and there is
no streaming compute. It then proceeds with fetching, finishing when there is
nothing more to receive.

Implements https://github.com/neondatabase/neon/pull/4875
2023-10-20 10:57:59 +03:00
Arseny Sher
76c702219c Don't use AppenRequestHeader.epoch_start_lsn.
It is simpler to get it once from ProposerEelected.
2023-10-20 10:57:59 +03:00
Arthur Petukhovsky
ba856140e7 Fix neon_extra_build.yml (#5605)
Build walproposer-lib in gather-rust-build-stats, fix nproc usage, fix
walproposer-lib on macos.
2023-10-19 22:20:39 +01:00
Em Sharnoff
2cf6a47cca vm-monitor: Deny not fail downscale if no memory stats yet (#5606)
Fixes an issue we observed on staging that happens when the
autoscaler-agent attempts to immediately downscale the VM after binding,
which is typical for pooled computes.

The issue was occurring because the autoscaler-agent was requesting
downscaling before the vm-monitor had gathered sufficient cgroup memory
stats to be confident in approving it. When the vm-monitor returned an
internal error instead of denying downscaling, the autoscaler-agent
retried the connection and immediately hit the same issue (in part
because cgroup stats are collected per-connection, rather than
globally).
2023-10-19 19:09:37 +01:00
Konstantin Knizhnik
5a8bcdccb0 Fix elog format error in wallog_mapping_file (#5602)
## Problem

Fix elog format error in wallog_mapping_file 

## Summary of changes

Use proper case to avoid compilation warning=error in C at MacOS.


## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-10-19 20:24:35 +03:00
Em Sharnoff
2c8741a5ed vm-monitor: Log full error on message handling failure (#5604)
There's currently an issue with the vm-monitor on staging that's not
really feasible to debug because the current display impl gives no
context to the errors (just says "failed to downscale").

Logging the full error should help.

For communications with the autoscaler-agent, it's ok to only provide
the outermost cause, because we can cross-reference with the VM logs.
At some point in the future, we may want to change that.
2023-10-19 18:10:33 +02:00
Shany Pozin
893b7bac9a Fix neon_extra_builds.yml : nproc is not supported in mac os (#5598)
## Problem
nproc is not supported in mac os, use sysctl -n hw.ncpu instead
2023-10-19 15:24:23 +01:00
Arthur Petukhovsky
66f8f5f1c8 Call walproposer from Rust (#5403)
Create Rust bindings for C functions from walproposer. This allows to
write better tests with real walproposer code without spawning multiple
processes and starting up the whole environment.

`make walproposer-lib` stage was added to build static libraries
`libwalproposer.a`, `libpgport.a`, `libpgcommon.a`. These libraries can
be statically linked to any executable to call walproposer functions.

`libs/walproposer/src/walproposer.rs` contains
`test_simple_sync_safekeepers` to test that walproposer can be called
from Rust to emulate sync_safekeepers logic. It can also be used as a
usage example.
2023-10-19 14:17:15 +01:00
Alexander Bayandin
3a19da1066 build(deps): bump rustix from 0.37.19 to 0.37.25 (#5596)
## Problem

@dependabot has bumped `rustix` 0.36 version to the latest in
https://github.com/neondatabase/neon/pull/5591, but didn't bump 0.37.

Also, update all Rust dependencies for
`test_runner/pg_clients/rust/tokio-postgres`.

Fixes
- https://github.com/neondatabase/neon/security/dependabot/39
- https://github.com/neondatabase/neon/security/dependabot/40

## Summary of changes
- `cargo update -p rustix@0.37.19`
- Update all dependencies for
`test_runner/pg_clients/rust/tokio-postgres`
2023-10-19 13:49:06 +01:00
Conrad Ludgate
572eda44ee update tokio-postgres (#5597)
https://github.com/neondatabase/rust-postgres/pull/23
2023-10-19 14:32:19 +02:00
Arpad Müller
b1d6af5ebe Azure blobs: Simplify error conversion by addition of to_download_error (#5575)
There is a bunch of duplication and manual Result handling that can be
simplified by moving the error conversion into a shared function, using
`map_err`, and the question mark operator.
2023-10-19 14:31:09 +02:00
Arpad Müller
f842b22b90 Add endpoint for querying time info for lsn (#5497)
## Problem

See #5468.

## Summary of changes

Add a new `get_timestamp_of_lsn` endpoint, returning the timestamp
associated with the given lsn.

Fixes #5468.

---------

Co-authored-by: Shany Pozin <shany@neon.tech>
2023-10-19 04:50:49 +02:00
dependabot[bot]
d444d4dcea build(deps): bump rustix from 0.36.14 to 0.36.16 (#5591)
Bumps [rustix](https://github.com/bytecodealliance/rustix) from 0.36.14
to 0.36.16.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-19 03:43:49 +01:00
Tristan Partin
c8637f3736 Remove specific file references in NOTICE
Seems like a burden to update this file with each major release.
2023-10-18 14:58:48 -05:00
John Spray
ecf759be6d tests: allow-list S3 500 on DeleteObjects key (#5586)
## Problem

S3 can give us a 500 whenever it likes: when this happens at request
level we eat it in `backoff::retry`, but when it happens for a key
inside a DeleteObjects request, we log it at warn level.

## Summary of changes

Allow-list this class of log message in all tests.
2023-10-18 15:16:58 +00:00
Arthur Petukhovsky
9a9d9eba42 Add test_idle_reconnections 2023-10-18 17:09:26 +03:00
Arseny Sher
1f4805baf8 Remove remnants of num_computes field.
Fixes https://github.com/neondatabase/neon/issues/5581
2023-10-18 17:09:26 +03:00
Konstantin Knizhnik
5c88213eaf Logical replication (#5271)
## Problem

See https://github.com/neondatabase/company_projects/issues/111

## Summary of changes

Save logical replication files in WAL at compute and include them in
basebackup at pate server.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2023-10-18 16:42:22 +03:00
John Spray
607d19f0e0 pageserver: clean up page service Result handling for shutdown/disconnect (#5504)
## Problem

- QueryError always logged at error severity, even though disconnections
are not true errors.
- QueryError type is not expressive enough to distinguish actual errors
from shutdowns.
- In some functions we're returning Ok(()) on shutdown, in others we're
returning an error

## Summary of changes

- Add QueryError::Shutdown and use it in places we check for
cancellation
- Adopt consistent Result behavior: disconnects and shutdowns are always
QueryError, not ok
- Transform shutdown+disconnect errors to Ok(()) at the very top of the
task that runs query handler
- Use the postgres protocol error code for "admin shutdown" in responses
to clients when we are shutting down.

Closes: #5517
2023-10-18 13:28:38 +01:00
dependabot[bot]
1fa0478980 build(deps): bump urllib3 from 1.26.17 to 1.26.18 (#5582) 2023-10-18 12:21:54 +01:00
Christian Schwarz
9da67c4f19 walredo: make request_redo() an async fn (#5559)
Stacked atop https://github.com/neondatabase/neon/pull/5557
Prep work for https://github.com/neondatabase/neon/pull/5560

These changes have a 2% impact on `bench_walredo`.
That's likely because of the `block_on() in the innermost piece of
benchmark-only code.
So, it doesn't affect production code.
The use of closures in the benchmarking code prevents a straightforward
conversion of the whole benchmarking code to async.

before:

```
    $ cargo bench --features testing --bench bench_walredo
       Compiling pageserver v0.1.0 (/home/cs/src/neon/pageserver)
        Finished bench [optimized + debuginfo] target(s) in 2m 11s
         Running benches/bench_walredo.rs (target/release/deps/bench_walredo-d99a324337dead70)
    Gnuplot not found, using plotters backend
    short/short/1           time:   [26.363 µs 27.451 µs 28.573 µs]
    Found 1 outliers among 100 measurements (1.00%)
      1 (1.00%) high mild
    short/short/2           time:   [64.340 µs 64.927 µs 65.485 µs]
    Found 2 outliers among 100 measurements (2.00%)
      2 (2.00%) low mild
    short/short/4           time:   [101.98 µs 104.06 µs 106.13 µs]
    short/short/8           time:   [151.42 µs 152.74 µs 154.03 µs]
    short/short/16          time:   [296.30 µs 297.53 µs 298.88 µs]
    Found 14 outliers among 100 measurements (14.00%)
      10 (10.00%) high mild
      4 (4.00%) high severe

    medium/medium/1         time:   [225.12 µs 225.90 µs 226.66 µs]
    Found 1 outliers among 100 measurements (1.00%)
      1 (1.00%) low mild
    medium/medium/2         time:   [490.80 µs 491.64 µs 492.49 µs]
    Found 1 outliers among 100 measurements (1.00%)
      1 (1.00%) low mild
    medium/medium/4         time:   [934.47 µs 936.49 µs 938.52 µs]
    Found 5 outliers among 100 measurements (5.00%)
      3 (3.00%) low mild
      1 (1.00%) high mild
      1 (1.00%) high severe
    medium/medium/8         time:   [1.8364 ms 1.8412 ms 1.8463 ms]
    Found 4 outliers among 100 measurements (4.00%)
      4 (4.00%) high mild
    medium/medium/16        time:   [3.6694 ms 3.6896 ms 3.7104 ms]

```

after:

```

    $ cargo bench --features testing --bench bench_walredo
       Compiling pageserver v0.1.0 (/home/cs/src/neon/pageserver)
        Finished bench [optimized + debuginfo] target(s) in 2m 11s
         Running benches/bench_walredo.rs (target/release/deps/bench_walredo-d99a324337dead70)
    Gnuplot not found, using plotters backend
    short/short/1           time:   [28.345 µs 28.529 µs 28.699 µs]
                            change: [-0.2201% +3.9276% +8.2451%] (p = 0.07 > 0.05)
                            No change in performance detected.
    Found 17 outliers among 100 measurements (17.00%)
      4 (4.00%) low severe
      5 (5.00%) high mild
      8 (8.00%) high severe
    short/short/2           time:   [66.145 µs 66.719 µs 67.274 µs]
                            change: [+1.5467% +2.7605% +3.9927%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 5 outliers among 100 measurements (5.00%)
      5 (5.00%) low mild
    short/short/4           time:   [105.51 µs 107.52 µs 109.49 µs]
                            change: [+0.5023% +3.3196% +6.1986%] (p = 0.02 < 0.05)
                            Change within noise threshold.
    short/short/8           time:   [151.90 µs 153.16 µs 154.41 µs]
                            change: [-1.0001% +0.2779% +1.4221%] (p = 0.65 > 0.05)
                            No change in performance detected.
    short/short/16          time:   [297.38 µs 298.26 µs 299.20 µs]
                            change: [-0.2953% +0.2462% +0.7763%] (p = 0.37 > 0.05)
                            No change in performance detected.
    Found 2 outliers among 100 measurements (2.00%)
      2 (2.00%) high mild

    medium/medium/1         time:   [229.76 µs 230.72 µs 231.69 µs]
                            change: [+1.5804% +2.1354% +2.6635%] (p = 0.00 < 0.05)
                            Performance has regressed.
    medium/medium/2         time:   [501.14 µs 502.31 µs 503.64 µs]
                            change: [+1.8730% +2.1709% +2.5199%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 7 outliers among 100 measurements (7.00%)
      1 (1.00%) low mild
      1 (1.00%) high mild
      5 (5.00%) high severe
    medium/medium/4         time:   [954.15 µs 956.74 µs 959.33 µs]
                            change: [+1.7962% +2.1627% +2.4905%] (p = 0.00 < 0.05)
                            Performance has regressed.
    medium/medium/8         time:   [1.8726 ms 1.8785 ms 1.8848 ms]
                            change: [+1.5858% +2.0240% +2.4626%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 6 outliers among 100 measurements (6.00%)
      1 (1.00%) low mild
      3 (3.00%) high mild
      2 (2.00%) high severe
    medium/medium/16        time:   [3.7565 ms 3.7746 ms 3.7934 ms]
                            change: [+1.5503% +2.3044% +3.0818%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 3 outliers among 100 measurements (3.00%)
      3 (3.00%) high mild
```
2023-10-18 11:23:06 +01:00
Em Sharnoff
16c87b5bda Bump vm-builder v0.17.12 -> v0.18.1 (#5583)
Only applicable change was neondatabase/autoscaling#566, updating
pgbouncer to 1.21.0 and enabling support for prepared statements.
2023-10-18 11:10:01 +02:00
Em Sharnoff
9fe5cc6a82 vm-monitor: Switch from memory.high to polling memory.stat (#5524)
tl;dr it's really hard to avoid throttling from memory.high, and it
counts tmpfs & page cache usage, so it's also hard to make sense of.

In the interest of fixing things quickly with something that should be
*good enough*, this PR switches to instead periodically fetch memory
statistics from the cgroup's memory.stat and use that data to determine
if and when we should upscale.

This PR fixes #5444, which has a lot more detail on the difficulties
we've hit with memory.high. This PR also supersedes #5488.
2023-10-17 15:30:40 -07:00
Conrad Ludgate
543b8153c6 proxy: add flag to reject requests without proxy protocol client ip (#5417)
## Problem

We need a flag to require proxy protocol (prerequisite for #5416)

## Summary of changes

Add a cli flag to require client IP addresses. Error if IP address is
missing when the flag is active.
2023-10-17 16:59:35 +01:00
Christian Schwarz
3a8959a4c4 page_cache: remove dead code (#5493) 2023-10-17 15:56:16 +01:00
Christian Schwarz
4a50483861 docs: error handling: document preferred anyhow context & logging style (#5178)
We already had strong support for this many months ago on Slack:
https://neondb.slack.com/archives/C0277TKAJCA/p1673453329770429
2023-10-17 15:41:47 +01:00
Conrad Ludgate
f775928dfc proxy: refactor how and when connections are returned to the pool (#5095)
## Problem

Transactions break connections in the pool

fixes #4698 

## Summary of changes

* Pool `Client`s are smart object that return themselves to the pool
* Pool `Client`s can be 'discard'ed
* Pool `Client`s are discarded when certain errors are encountered.
* Pool `Client`s are discarded when ReadyForQuery returns a non-idle
state.
2023-10-17 13:55:52 +00:00
John Spray
ea648cfbc6 tests: fix test_eviction_across_generations trying to evict temp files (#5579)
This test is listing files in a timeline and then evicting them: if the
test ran slowly this could encounter temp files for unfinished
downloads: fix by filtering these out in evict_all_layers.
2023-10-17 13:26:11 +01:00
Arpad Müller
093f8c5f45 Update rust to 1.73.0 (#5574)
[Release notes](https://blog.rust-lang.org/2023/10/05/Rust-1.73.0.html)
2023-10-17 13:13:12 +01:00
Arpad Müller
00c71bb93a Also try to login to Azure via SDK provided methods (#5573)
## Problem

We ideally use the Azure SDK's way of obtaining authorization, as
pointed out in
https://github.com/neondatabase/neon/pull/5546#discussion_r1360619178 .

## Summary of changes

This PR adds support for Azure SDK based authentication, using
[DefaultAzureCredential](https://docs.rs/azure_identity/0.16.1/azure_identity/struct.DefaultAzureCredential.html),
which tries the following credentials:

* [EnvironmentCredential](https://docs.rs/azure_identity/0.16.1/azure_identity/struct.EnvironmentCredential.html),
reading from various env vars
* [ImdsManagedIdentityCredential](https://docs.rs/azure_identity/0.16.1/azure_identity/struct.ImdsManagedIdentityCredential.html),
using managed identity
* [AzureCliCredential](https://docs.rs/azure_identity/0.16.1/azure_identity/struct.AzureCliCredential.html),
using Azure CLI

closes #5566.
2023-10-17 11:59:57 +01:00
Christian Schwarz
9256788273 limit imitate accesses concurrency, using same semaphore as compactions (#5578)
Before this PR, when we restarted pageserver, we'd see a rush of
`$number_of_tenants` concurrent eviction tasks starting to do imitate
accesses building up in the period of `[init_order allows activations,
$random_access_delay + EvictionPolicyLayerAccessThreshold::period]`.

We simply cannot handle that degree of concurrent IO.

We already solved the problem for compactions by adding a semaphore.
So, this PR shares that semaphore for use by evictions.

Part of https://github.com/neondatabase/neon/issues/5479

Which is again part of https://github.com/neondatabase/neon/issues/4743

Risks / Changes In System Behavior
==================================

* we don't do evictions as timely as we currently do
* we log a bunch of warnings about eviction taking too long
* imitate accesses and compactions compete for the same concurrency
limit, so, they'll slow each other down through this shares semaphore


Changes
=======

- Move the `CONCURRENT_COMPACTIONS` semaphore into `tasks.rs`
- Rename it to `CONCURRENT_BACKGROUND_TASKS`
- Use it also for the eviction imitate accesses:
    - Imitate acceses are both per-TIMELINE and per-TENANT
    - The per-TENANT is done through coalescing all the per-TIMELINE
      tasks via a tokio mutex `eviction_task_tenant_state`.
    - We acquire the CONCURRENT_BACKGROUND_TASKS permit early, at the
      beginning of the eviction iteration, much before the imitate
      acesses start (and they may not even start at all in the given
      iteration, as they happen only every $threshold).
    - Acquiring early is **sub-optimal** because when the per-timline
      tasks coalesce on the `eviction_task_tenant_state` mutex,
      they are already holding a CONCURRENT_BACKGROUND_TASKS permit.
    - It's also unfair because tenants with many timelines win
      the CONCURRENT_BACKGROUND_TASKS more often.
    - I don't think there's another way though, without refactoring
      more of the imitate accesses logic, e.g, making it all per-tenant.
- Add metrics for queue depth behind the semaphore.
I found these very useful to understand what work is queued in the
system.

    - The metrics are tagged by the new `BackgroundLoopKind`.
    - On a green slate, I would have used `TaskKind`, but we already had
      pre-existing labels whose names didn't map exactly to task kind.
      Also the task kind is kind of a lower-level detail, so, I think
it's fine to have a separate enum to identify background work kinds.


Future Work
===========

I guess I could move the eviction tasks from a ticker to "sleep for
$period".
The benefit would be that the semaphore automatically "smears" the
eviction task scheduling over time, so, we only have the rush on restart
but a smeared-out rush afterward.

The downside is that this perverts the meaning of "$period", as we'd
actually not run the eviction at a fixed period. It also means the the
"took to long" warning & metric becomes meaningless.

Then again, that is already the case for the compaction and gc tasks,
which do sleep for `$period` instead of using a ticker.
2023-10-17 11:29:48 +02:00
Joonas Koivunen
9e1449353d crash-consistent layer map through index_part.json (#5198)
Fixes #5172 as it:
- removes recoinciliation with remote index_part.json and accepts remote
index_part.json as the truth, deleting any local progress which is yet
to be reflected in remote
- moves to prefer remote metadata

Additionally:
- tests with single LOCAL_FS parametrization are cleaned up
- adds a test case for branched (non-bootstrap) local only timeline
availability after restart

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: John Spray <john@neon.tech>
2023-10-17 10:04:56 +01:00
John Spray
b06dffe3dc pageserver: fixes to /location_config API (#5548)
## Problem

I found some issues with the `/location_config` API when writing new
tests.

## Summary of changes

- Calling the API with the "Detached" state is now idempotent.
- `Tenant::spawn_attach` now takes a boolean to indicate whether to
expect a marker file. Marker files are used in the old attach path, but
not in the new location conf API. They aren't needed because in the New
World, the choice of whether to attach via remote state ("attach") or to
trust local state ("load") will be revised to cope with the transitions
between secondary & attached (see
https://github.com/neondatabase/neon/issues/5550). It is okay to merge
this change ahead of that ticket, because the API is not used in the
wild yet.
- Instead of using `schedule_local_tenant_processing`, the location conf
API handler does its own directory creation and calls `spawn_attach`
directly.
- A new `unsafe_create_dir_all` is added. This differs from
crashsafe::create_dir_all in two ways:
- It is intentionally not crashsafe, because in the location conf API we
are no longer using directory or config existence as the signal for any
important business logic.
   - It is async and uses `tokio::fs`.
2023-10-17 10:21:31 +02:00
Christian Schwarz
b08a0ee186 walredo: fix race condition where shutdown kills the wrong process (#5557)
Before this PR, the following race condition existed:

```
T1: does the apply_wal_records() call and gets back an error
T2: does the apply_wal_records() call and gets back an error
T2: does the kill_and_shutdown
T2: new loop iteration
T2: launches new walredo process
T1: does the kill_and_shutdown of the new process
```

That last step is wrong, T2 already did the kill_and_shutdown.

The symptom of this race condition was that T2 would observe an error
when it tried to do something with the process after T1 killed it.
For example, but not limited to:
`POLLHUP` /  `"WAL redo process closed its stderr unexpectedly"`.

The fix in this PR is the following:

* Use Arc to represent walredo processes.
  The Arc lives at least as long as the walredo process.
* Use Arc::ptr_eq to determine whether to kill the process or not.

The price is an additional RwLock to protect the new `redo_process`
field
that holds the Arc. I guess that could perhaps be an atomic pointer
swap some day. But, let's get one race fixed without risking introducing
a new one.

The use of Arc/drop is also not super great here because it now allows
for an unlimited number of to-be-killed processes to exist concurrently.
See the various `NB` comments above `drop(proc)` for why it's "ok" right
now due to the blocking `wait` inside `drop`.

Note: an earlier fix attempt was
https://github.com/neondatabase/neon/pull/5545
where we apply_batch_postgres would compare stdout_fd for equality.
That's incorrect because the kernel can reuse the file descriptor when
T2 launches the new process.
Details:
https://github.com/neondatabase/neon/pull/5545#pullrequestreview-1676589373
2023-10-17 09:55:39 +02:00
Arpad Müller
3666df6342 azure_blob.rs: use division instead of left shift (#5572)
Should have been a right shift but I did a left shift. It's constant
folded anyways so we just use a shift.
2023-10-16 19:52:07 +01:00
Alexey Kondratov
0ca342260c [compute_ctl+pgxn] Handle invalid databases after failed drop (#5561)
## Problem

In 89275f6c1e we fixed an issue, when we were dropping db in Postgres
even though cplane request failed. Yet, it introduced a new problem that
we now de-register db in cplane even if we didn't actually drop it in
Postgres.

## Summary of changes

Here we revert extension change, so we now again may leave db in invalid
state after failed drop. Instead, `compute_ctl` is now responsible for
cleaning up invalid databases during full configuration. Thus, there are
two ways of recovering from failed DROP DATABASE:
1. User can just repeat DROP DATABASE, same as in Vanilla Postgres.
2. If they didn't, then on next full configuration (dbs / roles changes
   in the API; password reset; or data availability check) invalid db
   will be cleaned up in the Postgres and re-created by `compute_ctl`. So
   again it follows pretty much the same semantics as Vanilla Postgres --
   you need to drop it again after failed drop.

That way, we have a recovery trajectory for both problems.

See this commit for info about `invalid` db state:
  a4b4cc1d60

According to it:
> An invalid database cannot be connected to anymore, but can still be
dropped.

While on it, this commit also fixes another issue, when `compute_ctl`
was trying to connect to databases with `ALLOW CONNECTIONS false`. Now
it will just skip them.

Fixes #5435
2023-10-16 20:46:45 +02:00
John Spray
ded7f48565 pageserver: measure startup duration spent fetching remote indices (#5564)
## Problem

Currently it's unclear how much of the `initial_tenant_load` period is
in S3 objects, and therefore how impactful it is to make changes to
remote operations during startup.

## Summary of changes

- `Tenant::load` is refactored to load remote indices in parallel and to
wait for all these remote downloads to finish before it proceeds to
construct any `Timeline` objects.
- `pageserver_startup_duration_seconds` gets a new `phase` value of
`initial_tenant_load_remote` which counts the time from startup to when
the last tenant finishes loading remote content.
- `test_pageserver_restart` is extended to validate this phase. The
previous version of the test was relying on order of dict entries, which
stopped working when adding a phase, so this is refactored a bit.
- `test_pageserver_restart` used to explicitly create a branch, now it
uses the default initial_timeline. This avoids startup getting held up
waiting for logical sizes, when one of the branches is not in use.
2023-10-16 18:21:37 +01:00
Arpad Müller
e09d5ada6a Azure blob storage support (#5546)
Adds prototype-level support for [Azure blob storage](https://azure.microsoft.com/en-us/products/storage/blobs). Some corners were cut, see the TODOs and the followup issue #5567 for details.

Steps to try it out:

* Create a storage account with block blobs (this is a per-storage
account setting).
* Create a container inside that storage account.
* Set the appropriate env vars: `AZURE_STORAGE_ACCOUNT,
AZURE_STORAGE_ACCESS_KEY, REMOTE_STORAGE_AZURE_CONTAINER,
REMOTE_STORAGE_AZURE_REGION`
* Set the env var `ENABLE_REAL_AZURE_REMOTE_STORAGE=y` and run `cargo
test -p remote_storage azure`

Fixes  #5562
2023-10-16 17:37:09 +02:00
Conrad Ludgate
8c522ea034 proxy: count cache-miss for compute latency (#5539)
## Problem

Would be good to view latency for hot-path vs cold-path

## Summary of changes

add some labels to latency metrics
2023-10-16 16:31:04 +01:00
John Spray
44b1c4c456 pageserver: fix eviction across generations (#5538)
## Problem

Bug was introduced by me in 83ae2bd82c

When eviction constructs a RemoteLayer to replace the layer it just
evicted, it is building a LayerFileMetadata using its _current_
generation, rather than the generation of the layer.

## Summary of changes

- Retrieve Generation from RemoteTimelineClient when evicting. This will
no longer be necessary when #4938 lands.
- Add a test for the scenario in question (this fails without the fix).
2023-10-15 20:23:18 +01:00
Christian Schwarz
99c15907c1 walredo: trim public interfaces (#5556)
Stacked atop https://github.com/neondatabase/neon/pull/5554.
2023-10-13 19:35:53 +01:00
Christian Schwarz
c3626e3432 walredo: remove legacy wal-redo-datadir cleanup code (#5554)
It says it in the comment.
2023-10-13 19:16:15 +01:00
Christian Schwarz
dd6990567f walredo: apply_batch_postgres: get a backtrace whenever it encounters an error (#5541)
For 2 weeks we've seen rare, spurious, not-reproducible page
reconstruction
failures with PG16 in prod.

One of the commits we deployed this week was

Commit

    commit fc467941f9
    Author: Joonas Koivunen <joonas@neon.tech>
    Date:   Wed Oct 4 16:19:19 2023 +0300

        walredo: log retryed error (#546)

With the logs from that commit, we learned that some read() or write()
system call that walredo does fails with `EAGAIN`, aka
`Resource temporarily unavailable (os error 11)`.

But we have no idea where exactly in the code we get back that error.

So, use anyhow instead of fake std::io::Error's as an easy way to get
a backtrace when the error happens, and change the logging to print
that backtrace (i.e., use `{:?}` instead of
`utils::error::report_compact_sources(e)`).

The `WalRedoError` type had to go because we add additional `.context()`
further up the call chain before we `{:?}`-print it. That additional
`.context()` further up doesn't see that there's already an
anyhow::Error
inside the `WalRedoError::ApplyWalRecords` variant, and hence captures
another backtrace and prints that one on `{:?}`-print instead of the
original one inside `WalRedoError::ApplyWalRecords`.

If we ever switch back to `report_compact_sources`, we should make sure
we have some other way to uniquely identify the places where we return
an error in the error message.
2023-10-13 14:08:23 +00:00
khanova
21deb81acb Fix case for array of jsons (#5523)
## Problem

Currently proxy doesn't handle array of json parameters correctly.

## Summary of changes

Added one more level of quotes escaping for the array of jsons case.
Resolves: https://github.com/neondatabase/neon/issues/5515
2023-10-12 14:32:49 +02:00
khanova
dbb21d6592 Make http timeout configurable (#5532)
## Problem

Currently http timeout is hardcoded to 15 seconds.

## Summary of changes

Added an option to configure it via cli args.

Context: https://neondb.slack.com/archives/C04DGM6SMTM/p1696941726151899
2023-10-12 11:41:07 +02:00
Joonas Koivunen
ddceb9e6cd fix(branching): read last record lsn only after Tenant::gc_cs (#5535)
Fixes #5531, at least the latest error of not being able to create a
branch from the head under write and gc pressure.
2023-10-11 16:24:36 +01:00
John Spray
0fc3708de2 pageserver: use a backoff::retry in Deleter (#5534)
## Problem

The `Deleter` currently doesn't use a backoff::retry because it doesn't
need to: it is already inside a loop when doing the deletion, so can
just let the loop go around.

However, this is a problem for logging, because we log on errors, which
includes things like 503/429 cases that would usually be swallowed by a
backoff::retry in most places we use the RemoteStorage interface.

The underlying problem is that RemoteStorage doesn't have a proper error
type, and an anyhow::Error can't easily be interrogated for its original
S3 SdkError because downcast_ref requires a concrete type, but SdkError
is parametrized on response type.

## Summary of changes

Wrap remote deletions in Deleter in a backoff::retry to avoid logging
warnings on transient 429/503 conditions, and for symmetry with how
RemoteStorage is used in other places.
2023-10-11 15:25:08 +01:00
John Spray
e0c8ad48d4 remote_storage: log detail errors in delete_objects (#5530)
## Problem

When we got an error in the payload of a DeleteObjects response, we only
logged how many errors, not what they were.

## Summary of changes

Log up to 10 specific errors. We do not log all of them because that
would be up to 1000 log lines per request.
2023-10-11 13:22:00 +01:00
John Spray
39e144696f pageserver: clean up mgr.rs types that needn't be public (#5529)
## Problem

These types/functions are public and it prevents clippy from catching
unused things.

## Summary of changes

Move to `pub(crate)` and remove the error enum that becomes clearly
unused as a result.
2023-10-11 11:50:16 +00:00
Alexander Bayandin
653044f754 test_runners: increase some timeouts to make tests less flaky (#5521)
## Problem
- `test_heavy_write_workload` is flaky, and fails because of to
statement timeout
- `test_wal_lagging` is flaky and fails because of the default pytest
timeout (see https://github.com/neondatabase/neon/issues/5305)

## Summary of changes
- `test_heavy_write_workload`: increase statement timeout to 5 minutes
(from default 2 minutes)
- `test_wal_lagging`: increase pytest timeout to 600s (from default
300s)
2023-10-11 10:49:15 +01:00
Vadim Kharitonov
80dcdfa8bf Update pgvector to 0.5.1 (#5525) 2023-10-11 09:47:19 +01:00
Arseny Sher
685add2009 Enable /metrics without auth.
To enable auth faster.
2023-10-10 20:06:25 +03:00
Conrad Ludgate
d4dc86f8e3 proxy: more connection metrics (#5464)
## Problem

Hard to tell 
1. How many clients are connected to proxy
2. How many requests clients are making
3. How many connections are made to a database

1 and 2 are different because of the properties of HTTP.

We have 2 already tracked through `proxy_accepted_connections_total` and
`proxy_closed_connections_total`, but nothing for 1 and 3

## Summary of changes

Adds 2 new counter gauges.

*
`proxy_opened_client_connections_total`,`proxy_closed_client_connections_total`
- how many client connections are open to proxy
*
`proxy_opened_db_connections_total`,`proxy_closed_db_connections_total`
- how many active connections are made through to a database.

For TCP and Websockets, we expect all 3 of these quantities to be
roughly the same, barring users connecting but with invalid details.

For HTTP:
* client_connections/connections can differ because the client
connections can be reused.
* connections/db_connections can differ because of connection pooling.
2023-10-10 16:33:20 +01:00
Alex Chi Z
5158de70f3 proxy: breakdown wake up failure metrics (#4933)
## Problem

close https://github.com/neondatabase/neon/issues/4702

## Summary of changes

This PR adds a new metrics for wake up errors and breaks it down by most
common reasons (mostly follows the `could_retry` implementation).
2023-10-10 13:17:37 +01:00
khanova
aec9188d36 Added timeout for http requests (#5514)
# Problem
Proxy timeout for HTTP-requests

## Summary of changes
If the HTTP-request exceeds 15s, it would be killed.

Resolves: https://github.com/neondatabase/neon/issues/4847
2023-10-10 13:39:38 +02:00
John Spray
acefee9a32 pageserver: flush deletion queue on detach (#5452)
## Problem

If a caller detaches a tenant and then attaches it again, pending
deletions from the old attachment might not have happened yet. This is
not a correctness problem, but it causes:
- Risk of leaking some objects in S3
- Some warnings from the deletion queue when pending LSN updates and
pending deletions don't pass validation.

## Summary of changes

- Deletion queue now uses UnboundedChannel so that the push interfaces
don't have to be async.
- This was pulled out of https://github.com/neondatabase/neon/pull/5397,
where it is also useful to be able to drive the queue from non-async
contexts.
- Why is it okay for this to be unbounded? The only way the
unbounded-ness of the channel can become a problem is if writing out
deletion lists can't keep up, but if the system were that overloaded
then the code generating deletions (GC, compaction) would also be
impacted.
- DeletionQueueClient gets a new `flush_advisory` function, which is
like flush_execute, but doesn't wait for completion: this is appropriate
for use in contexts where we would like to encourage the deletion queue
to flush, but don't need to block on it.
- This function is also expected to be useful in next steps for seamless
migration, where the option to flush to S3 while transitioning into
AttachedStale will also include flushing deletion queue, but we wouldn't
want to block on that flush.
- The tenant_detach code in mgr.rs invokes flush_advisory after stopping
the `Tenant` object.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-10-10 10:46:24 +01:00
Conrad Ludgate
bf065aabdf proxy: update locked error retry filter (#5376)
## Problem

We don't want to retry customer quota exhaustion errors.

## Summary of changes

Make sure both types of quota exhaustion errors are not retried
2023-10-10 08:59:16 +01:00
Konstantin Knizhnik
fe74fac276 Fix handling flush error in prefetch (#5473)
## Problem

See https://neondb.slack.com/archives/C05U648A9NJ

In case of failure of flush in prefetch, 
prefetch state is reseted. We need to retry register buffer attempt,
otherwise we will get assertion failure.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-10-10 07:43:37 +03:00
Alexander Bayandin
b91ac670e1 Update plpgsql_check extension to 2.5.3 (#5437) 2023-10-09 17:07:43 +01:00
John Spray
b3195afd20 tests: fix a race in test_deletion_queue_recovery on loaded nodes (#5495)
## Problem

Seen in CI for https://github.com/neondatabase/neon/pull/5453 -- the
time gap between validation completing and the header getting written is
long enough to fail the test, where it was doing a cheeky 1 second
sleep.

## Summary of changes

- Replace 1 second sleep with a wait_until to see the header file get
written
- Use enums as test params to make the results more readable (instead of
True-False parameters)
- Fix the temp suffix used for deletion queue headers: this worked fine,
but resulted in `..tmp` extension.
2023-10-09 16:28:28 +01:00
John Spray
7eaa7a496b pageserver: cancellation handling in writes to postgres client socket (#5503)
## Problem

Writes to the postgres client socket from the page server were not
wrapped in cancellation handling, so a stuck client connection could
prevent tenant shutdowwn.

## Summary of changes

All the places we call flush() to write to the socket, we should be
respecting the cancellation token for the task.

In this PR, I explicitly pass around a CancellationToken rather than
doing inline `task_mgr::shutdown_token` calls, to avoid coupling it to
the global task_mgr state and make it easier to refactor later.

I have some follow-on commits that add a Shutdown variant to QueryError
and use it more extensively, but that's pure refactor so will keep
separate from this bug fix PR.

Closes: https://github.com/neondatabase/neon/issues/5341
2023-10-09 15:54:17 +01:00
Joonas Koivunen
4772cd6c93 fix: deny branching, starting compute from not yet uploaded timelines (#5484)
Part of #5172. First commits show that we used to allow starting up a
compute or creating a branch off a not yet uploaded timeline. This PR
moves activation of a timeline to happen **after** initial layer file(s)
(if any) and `index_part.json` have been uploaded. Simply moving
activation to be *after* downloads have finished works because we now
spawn a task per http request handler.

Current behaviour of uploading on the timelines on next startup is kept,
to be removed later as part of #5172.

Adds:
- `NeonCli.map_branch` and corresponding `neon_local` implementation:
allow creating computes for timelines managed via pageserver http
client/api
- possibly duplicate tests (I did not want to search for, will cleanup
in a follow-up if these duplicated)

Changes:
- make `wait_until_tenant_state` return immediatedly on `Broken` and not
wait more
2023-10-09 17:03:38 +03:00
Shany Pozin
010b4d0d5c Move ApiError 404 to info level (#5501)
## Problem
Moving ApiError 404 to info level logging (see
https://github.com/neondatabase/neon/pull/5489#issuecomment-1750211212)
2023-10-09 13:54:46 +03:00
Rahul Modpur
477cb3717b Fix neon_local pageserver status command (#5475)
## Problem
Fix neon_local pageserver status command
#5430

## Summary of changes
Fix clap config for pageserver status subcommand

## Checklist before requesting a review

- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.


Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
2023-10-09 09:13:57 +01:00
John Spray
ea5a97e7b4 pageserver: implement emergency mode for operating without control plane (#5469)
## Problem

Pageservers with `control_plane_api` configured require a control plane
to start up: in an incident this might be a problem.

## Summary of changes

Note to reviewers: most of the code churn in mgr.rs is the refactor
commit that enables the later emergency mode commit: you may want to
review commits separately.

- Add `control_plane_emergency_mode` configuration property
- Refactor init_tenant_mgr to separate loading configurations from the
main loop where we construct Tenant, so that the generations fetch can
peek at the configs in emergency mode.
- During startup, in emergency mode, attach any tenants that were
attached on their last run, using the same generation number.

Closes: #5381 
Closes: https://github.com/neondatabase/neon/issues/5492
2023-10-06 17:25:21 +01:00
John Spray
547914fe19 pageserver: adjust timeline deletion for generations (#5453)
## Problem

Spun off from https://github.com/neondatabase/neon/pull/5449

Timeline deletion does the following:
1. Delete layers referenced in the index
2. Delete everything else in the timeline prefix, except the index
3. Delete the index.

When generations were added, the filter in step 2 got outdated, such
that the index objects were deleted along with everything else at step
2. That didn't really break anything, but it makes an automated test
unhappy and is a violation of the original intent of the code, which
presumably intends to upload an invariant that as long as any objects
for a timeline exist, the index exists.

(Eventually, this index-object-last complexity can go away: when we do
https://github.com/neondatabase/neon/issues/5080, there is no need to
keep the index_part around, as deletions can always be retried any time
any where.)

## Summary of changes

After object listing, split the listed objects into layers and index
objects. Delete the layers first, then the index objects.
2023-10-06 16:15:18 +00:00
Arpad Müller
607b185a49 Fix 1.73.0 clippy lints (#5494)
Doesn't do an upgrade of rustc to 1.73.0 as we want to wait for the
cargo response of the curl CVE before updating. In preparation for an
update, we address the clippy lints that are newly firing in 1.73.0.
2023-10-06 14:17:19 +01:00
Christian Schwarz
bfba5e3aca page_cache: ensure forward progress on miss (#5482)
Problem
=======

Prior to this PR, when we had a cache miss, we'd get back a write guard,
fill it, the drop it and retry the read from cache.

If there's severe contention for the cache, it could happen that the
just-filled data gets evicted before our retry, resulting in lost work
and no forward progress.

Solution
========

This PR leverages the now-available `tokio::sync::RwLockWriteGuard`'s
`downgrade()` functionality to turn the filled slot write guard into a
read guard.
We don't drop the guard at any point, so, forward progress is ensured.


Refs
====

Stacked atop https://github.com/neondatabase/neon/pull/5480 

part of https://github.com/neondatabase/neon/issues/4743
specifically part of https://github.com/neondatabase/neon/issues/5479
2023-10-06 13:41:13 +01:00
Christian Schwarz
ecc7a9567b page_cache: inline {,try_}lock_for_write into memorize_materialized_page (#5480)
Motivation
==========

It's the only user, and the name of `_for_write` is wrong as of

    commit 7a63685cde
    Author: Christian Schwarz <christian@neon.tech>
    Date:   Fri Aug 18 19:31:03 2023 +0200

        simplify page-caching of EphemeralFile (#4994)

Notes
=====

This also allows us to get rid of the WriteBufResult type.

Also rename `search_mapping_for_write` to `search_mapping_exact`. It
makes more sense that way because there is `_for_write`-locking anymore.

Refs
====

part of https://github.com/neondatabase/neon/issues/4743
specifically https://github.com/neondatabase/neon/issues/5479

this is prep work for https://github.com/neondatabase/neon/pull/5482
2023-10-06 13:38:02 +02:00
Joonas Koivunen
45f98dd018 debug_tool: get page at lsn and keyspace via http api (#5057)
If there are any layermap or layer file related problems, having a
reproducable `get_page@lsn` easily usable for fast debugging iteration
is helpful.

Split off from #4938.

Later evolved to add http apis for:
- `get_page@lsn` at
`/v1/tenant/:tenant_id/timeline/:timeline_id/get?key=<hex>&lsn=<lsn
string>`
- collecting the keyspace at
`/v1/tenant/:tenant_id/timeline/:timeline_id/keyspace?[at_lsn=<lsn
string>]`
    - defaults to `last_record_lsn`

collecting the keyspace seems to yield some ranges for which there is no
key.
2023-10-06 12:17:38 +01:00
John Spray
bdfe27f3ac swagger: add a 503 definition to each endpoint (#5476)
## Problem

The control plane doesn't have generic handling for this.

## Summary of changes

Add a 503 response to every endpoint.
2023-10-06 11:31:49 +01:00
Joonas Koivunen
a15f9b3baa pageserver: Tune 503 Resource unavailable (#5489)
503 Resource Unavailable appears as error in logs, but is not really an
error which should ever fail a test on, or even log an error in prod,
[evidence].

Changes:
- log 503 as `info!` level
- use `Cow<'static, str>` instead of `String`
- add an additional `wait_until_tenant_active` in
`test_actually_duplicate_l1`
 
We ought to have in tests "wait for tenants to complete loading" but
this is easier to implement for now.

[evidence]:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5485/6423110295/index.html#/testresult/182de66203864fc0
2023-10-06 09:59:14 +01:00
Alexander Bayandin
ce92638185 test_runner: allow race in test_tenant_delete_is_resumed_on_attach (#5478)
## Problem

`test_tenant_delete_is_resumed_on_attach` is flaky

## Summary of changes
- Allow race in `test_tenant_delete_is_resumed_on_attach`
- Cleanup `allowed_errors` in the file a bit
2023-10-06 09:49:31 +01:00
Joonas Koivunen
a3c82f19b8 tests: prettier subprocess output in test log (#5485)
Clean subprocess output so that:
- one line of output is just one line without a linebreak
    - like shells handle `echo subshell says: $(echo foo)`
- multiple lines are indented like other pytest output
- error output is dedented and then indented to be like other pytest
output

Minor readability changes remove friction.
2023-10-05 20:15:55 +00:00
Arthur Petukhovsky
8b15252f98 Move walproposer state into struct (#5364)
This patch extracts all postgres-dependent functions in a separate
`walproposer_api` functions struct. It helps to compile walproposer as
static library without compiling all other postgres server code. This is
useful to allow calling walproposer C code from Rust, or linking this
library with anything else.

All global variables containing walproposer state were extracted to a
separate `WalProposer` struct. This makes it possible to run several
walproposers in the same process, in separate threads.

There were no logic changes and PR mostly consists of shuffling
functions between several files. We have a good test coverage for
walproposer code and I've seen no issues with tests while I was
refactoring it, so I don't expect any issues after merge.

ref https://github.com/neondatabase/neon/issues/547

---------

Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2023-10-05 18:48:01 +01:00
Alexander Bayandin
522aaca718 Temporary deploy staging preprod region from main (#5477)
## Problem

Stating preprod region can't use `release-XXX` right now, the config is
unified across all regions, it supports only `XXX`.

Ref
https://neondb.slack.com/archives/C03H1K0PGKH/p1696506459720909?thread_ts=1696437812.365249&cid=C03H1K0PGKH

## Summary of changes
- Deploy staging-preprod from main
2023-10-05 14:02:20 +00:00
John Spray
7cbb39063a tests: stabilize + extend deletion queue recovery test (#5457)
## Problem

This test was unstable when run in parallel with lots of others: if the
pageserver stayed up long enough for some of the deletions to get
validated, they won't be discarded on restart the way the test expects
when keep_attachment=True. This was a test bug, not a pageserver bug.

## Summary of changes

- Add failpoints to control plane api client
- Use failpoint to pause validation in the test to cover the case where
it had been flaky
- Add a metric for the number of deleted keys validated
- Add a permutation to the test to additionally exercise the case where
we _do_ validate lists before restart: this is a coverage enhancement
that seemed sensible when realizing that the test was relying on nothing
being validated before restart.
- the test will now always enter the restart with nothing or everything
validated.
2023-10-05 11:22:05 +01:00
John Spray
baa5fa1e77 pageserver: location configuration API, attachment modes, secondary locations (#5299)
## Problem

These changes are part of building seamless tenant migration, as
described in the RFC:
- https://github.com/neondatabase/neon/pull/5029

## Summary of changes

- A new configuration type `LocationConf` supersedes `TenantConfOpt` for
storing a tenant's configuration in the pageserver repo dir. It contains
`TenantConfOpt`, as well as a new `mode` attribute that describes what
kind of location this is (secondary, attached, attachment mode etc). It
is written to a file called `config-v1` instead of `config` -- this
prepares us for neatly making any other profound changes to the format
of the file in future. Forward compat for existing pageserver code is
achieved by writing out both old and new style files. Backward compat is
achieved by checking for the old-style file if the new one isn't found.
- The `TenantMap` type changes, to hold `TenantSlot` instead of just
`Tenant`. The `Tenant` type continues to be used for attached tenants
only. Tenants in other states (such as secondaries) are represented by a
different variant of `TenantSlot`.
- Where `Tenant` & `Timeline` used to hold an Arc<Mutex<TenantConfOpt>>,
they now hold a reference to a AttachedTenantConf, which includes the
extra information from LocationConf. This enables them to know the
current attachment mode.
- The attachment mode is used as an advisory input to decide whether to
do compaction and GC (AttachedStale is meant to avoid doing uploads,
AttachedMulti is meant to avoid doing deletions).
- A new HTTP API is added at `PUT /tenants/<tenant_id>/location_config`
to drive new location configuration. This provides a superset of the
functionality of attach/detach/load/ignore:
  - Attaching a tenant is just configuring it in an attached state
  - Detaching a tenant is configuring it to a detached state
  - Loading a tenant is just the same as attaching it
- Ignoring a tenant is the same as configuring it into Secondary with
warm=false (i.e. retain the files on disk but do nothing else).

Caveats:
- AttachedMulti tenants don't do compaction in this PR, but they do in
the follow on #5397
- Concurrent updates to the `location_config` API are not handled
elegantly in this PR, a better mechanism is added in the follow on
https://github.com/neondatabase/neon/pull/5367
- Secondary mode is just a placeholder in this PR: the code to upload
heatmaps and do downloads on secondary locations will be added in a
later PR (but that shouldn't change any external interfaces)

Closes: https://github.com/neondatabase/neon/issues/5379

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-10-05 09:55:10 +01:00
Conrad Ludgate
c216b16b0f proxy: fix memory leak (#5472)
## Problem

these JoinSets live for the duration of the process. they might have
many millions of connections spawned on them and they never get cleared.

Fixes #4672 

## Summary of changes

Drain the connections as we go
2023-10-05 07:30:28 +01:00
John Spray
c5ea91f831 pageserver: fix loading control plane JWT token (#5470)
## Problem

In #5383 this configuration was added, but it missed the parts of the
Builder class that let it actually be used.

## Summary of changes

Add `control_plane_api_token` hooks to PageserverConfigBuilder
2023-10-05 01:31:17 +01:00
Em Sharnoff
6489a4ea40 vm-monitor: Remove mem::forget of tokio::sync::mpsc::Sender (#5441)
If the cgroup integration was not enabled, this would cause compute_ctl
to leak memory.

Thankfully, we never use vm-monitor *without* the cgroup handling
enabled, so this wasn't actually impacting us, but... it still looked
suspicious, so figured it was worth changing.
2023-10-04 15:08:10 -07:00
Arthur Petukhovsky
f8a7498965 Wait for sk tli init in test_timeline_status (#5467)
Fix #5447
2023-10-04 22:53:34 +01:00
Joonas Koivunen
7dce62a9ee test: duplicate L1 layer (#5412)
We overwrite L1 layers if compaction gets interrupted. We did not have a
test showing that we do in fact do this.

The test might be a bit flaky due to timestamp usage, but separating for
smaller diff in as part of #5172.

Also removes an unrelated 200s pgbench from the test suite.
2023-10-04 16:52:32 +01:00
Alexander Bayandin
7a2cafb34d Use zstd to compress large allure artifacts (#5458)
## Problem

- Because we compress artifacts file by file, we don't need to put them
into `tar` containers (ie instead of `tar.gz` we can use just `gz`).
- Pythons gz single-threaded and pretty slow.

A benchmark has shown ~20 times speedup (19.876176291 vs
0.8748335830000009) on my laptop (for a pageserver.log size is 1.3M)

## Summary of changes
- Replace tarfile with zstandart
- Update allure to 2.24.0
2023-10-04 16:20:16 +01:00
duguorong009
25a37215f3 fix: replace all std::PathBufs with camino::Utf8PathBuf (#5352)
Fixes #4689 by replacing all of `std::Path` , `std::PathBuf` with
`camino::Utf8Path`, `camino::Utf8PathBuf` in
- pageserver
- safekeeper
- control_plane
- libs/remote_storage

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-10-04 17:52:23 +03:00
Conrad Ludgate
f002b1a219 proxy: http limits (#5460)
## Problem

1MB request body is apparently too small for some clients

## Summary of changes

Update to 10 MB request body. Also revert the removal of response limits
while we don't have streaming support.
2023-10-04 15:01:05 +01:00
Joonas Koivunen
fc467941f9 walredo: log retryed error (#5462)
We currently lose the actual reason the first walredo attempt failed.
Together with implicit retry making it difficult to eyeball what is
happening.

PR version keeps the logging the same error message twice, which is what
we've been doing all along. However correlating the retrying case and
the finally returned error is difficult, because the actual error
message was left out before this PR.

Lastly, log the final error we present to postgres *in the same span*,
not outside it. Additionally, suppress the stacktrace as the comment
suggested.
2023-10-04 14:19:19 +01:00
Christian Schwarz
25bf791568 metrics: distinguish page reconstruction success & failure (#5463)
Here's the existing dashboards that use the metric:


https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_getpage_reconstruct_seconds&type=code

Looks like only `_count` and `_sum` values are used currently.
We can fix them up easily post merge.

I think the histogram is worth keeping, though.

follow-up to
https://github.com/neondatabase/neon/pull/5459#pullrequestreview-1657072882

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-10-04 13:40:00 +01:00
Joonas Koivunen
dee2bcca44 fix: time the reconstruction, not future creation (#5459)
`pageserver_getpage_reconstruct_seconds` histogram had been only
recording the time it takes to create a future, not await on it. Since:
eb0a698adc.
2023-10-04 11:01:07 +01:00
Joonas Koivunen
db8ff9d64b testing: record walredo failures to test reports (#5451)
We have rare walredo failures with pg16.

Let's introduce recording of failing walredo input in `#[cfg(feature =
"testing")]`. There is additional logging (the value reconstruction path
logging usually shown with not found keys), keeping it for
`#[cfg(features = "testing")]`.

Cc: #5404.
2023-10-04 11:24:30 +03:00
Rahul Modpur
af6a20dfc2 Improve CrashsafeOverwriteError source printing (#5410)
## Problem

Duplication of error in log

Fixes #5366 

## Summary of changes

Removed `{0}` from error description above each enum due to presence of
`#[source]` to avoid duplication

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
2023-10-04 02:38:42 +02:00
Alexander Bayandin
fec94ad5b3 Update checksums for pg_jsonschema & pg_graphql (#5455)
## Problem

Folks have re-taged releases for `pg_jsonschema` and `pg_graphql` (to
increase timeouts on their CI), for us, these are a noop changes, 
but unfortunately, this will cause our builds to fail due to checksums 
mismatch (this might not strike right away because of the build cache).
- 8ba7c7be9d
- aa7509370a

## Summary of changes
- `pg_jsonschema` update checksum
- `pg_graphql` update checksum
2023-10-03 18:42:39 +01:00
John Spray
ace0c775fc pageserver: prefer 503 to 500 for transient unavailability (#5439)
## Problem

The 500 status code should only be used for bugs or unrecoverable
failures: situations we did not expect. Currently, the pageserver is
misusing this response code for some situations that are totally normal,
like requests targeting tenants that are in the process of activating.

The 503 response is a convenient catch-all for "I can't right now, but I
will be able to".

## Summary of changes

- Change some transient availability error conditions to return 503
instead of 500
- Update the HTTP client configuration in integration tests to retry on
503

After these changes, things like creating a tenant and then trying to
create a timeline within it will no longer require carefully checking
its status first, or retrying on 500s. Instead, a client which is
properly configured to retry on 503 can quietly handle such situations.
2023-10-03 17:00:55 +01:00
dependabot[bot]
78dde31827 build(deps): bump urllib3 from 1.26.11 to 1.26.17 (#5442) 2023-10-03 11:50:27 +01:00
Christian Schwarz
de0e96d2be remote_storage: separate semaphores for read and write ops (#5440)
Before this PR, a compaction that queues a lot of uploads could grab all
the semaphore permits.

Any readers that need on-demand downloads would queue up, causing
getpage@lsn outliers.

Internal context:
https://neondb.slack.com/archives/C05NXJFNRPA/p1696264359425419?thread_ts=1696250393.840899&cid=C05NXJFNRPA
2023-10-03 11:22:11 +03:00
Alexander Bayandin
00369c8c2a Update pg_jsonschema & pg_grapgql extensions (#5438)
- Update `pg_jsonschema` to 0.2.0 with Postgres 16 support
- Update `pg_grapgql` to 1.4.0 with Postgres 16 support
- Remove `pgx` (old name of `pgrx`) layer from Dockerfile
2023-10-02 23:50:27 +01:00
Vadim Kharitonov
c1dcf61ca2 Update pgx-ulid extension (#5382)
- Update `pgx-ulid` from 0.1.0 to 0.1.3, and add it to Postgres 16
- Add `pg_tiktoken` to Postgres 16 image

Closes #5374

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-10-02 15:52:45 +01:00
Sasha Krassovsky
89275f6c1e Fix invalid database resulting from failed DROP DB (#5423)
## Problem
If the control plane happened to respond to a DROP DATABASE request with
a non-200 response, we'd abort the DROP DATABASE transaction in the
usual spot. However, Postgres for some reason actually performs the drop
inside of `standard_ProcessUtility`. As such, the database was left in a
weird state after aborting the transaction. We had test coverage of a
failed CREATE DATABASE but not a failed DROP DATABASE.
 
## Summary of changes
Since DROP DATABASE can't be inside of a transaction block, we can just
forward the DDL changes to the control plane inside of
`ProcessUtility_hook`, and if we respond with 500 bail out of
`ProcessUtility` before we perform the drop. This change also adds a
test, which reproduced the invalid database issue before the fix was
applied.
2023-09-29 19:39:28 +01:00
Christian Schwarz
c07eef8ea5 page_cache: find_victim: don't spin while there's no chance for a slot (#5319)
It is wasteful to cycle through the page cache slots trying to find a
victim slot if all the slots are currently un-evictable because a read /
write guard is alive.

We suspect this wasteful cycling to be the root cause for an
"indigestion" we observed in staging (#5291).
The hypothesis is that we `.await` after we get ahold of a read / write
guard, and that tokio actually deschedules us in favor of another
future.
If that other future then needs a page slot, it can't get ours because
we're holding the guard.
Repeat this, and eventually, the other future(s) will find themselves
doing `find_victim` until they hit `exceeded evict iter limit`.

The `find_victim` is wasteful and CPU-starves the futures that are
already holding the read/write guard. A `yield` inside `find_victim`
could mitigate the starvation, but wouldn't fix the wasting of CPU
cycles.

So instead, this PR queues waiters behind a tokio semaphore that counts
evictable slots.
The downside is that this stops the clock page replacement if we have 0
evictable slots.

Also, as explained by the big block comment in `find_victims`, the
semaphore doesn't fully prevent starvation because because we can't make
tokio prioritize those tasks executing `find_victim` that have been
trying the longest.

Implementation
===============
We need to acquire the semaphore permit before locking the slot.
Otherwise, we could deadlock / discover that all permits are gone and
would have to relinquish the slot, having moved forward the Clock LRU
without making progress.

The downside is that, we never get full throughput for read-heavy
workloads, because, until the reader coalesces onto an existing permit,
it'll hold its own permit.


Addendum To Root-Cause Analysis In #5291
========================================

Since merging that PR, @arpad-m pointed out that we couldn't have
reached the `slot.write().await` with his patches because the
VirtualFile slots can't have all been write-locked, because we only hold
them locked while the IO is ongoing, and the IO is still done with
synchronous system calls in that patch set, so, we can have had at most
$number_of_executor_threads locked at any given time.
I count 3 tokio runtimes that do `Timeline::get`, each with 8 executor
threads in our deployment => $number_of_executor_threads = 3*8 = 24 .
But the virtual file cache has 100 slots.

We both agree that nothing changed about the core hypothesis, i.e.,
additional await points inside VirtualFile caused higher concurrency
resulting in exhaustion of page cache slots.
But we'll need to reproduce the issue and investigate further to truly
understand the root cause, or find out that & why we were indeed using
100 VirtualFile slots.

TODO: could it be compaction that needs to hold guards of many
VirtualFile's in its iterators?
2023-09-29 20:03:56 +02:00
Alexander Bayandin
86dd28d4fb Bump hermit-abi & num_cpus packages (#5427)
## Problem

I've noticed that `hermit-abi`
0.3.1 [1] has been yanked from crates.io (looks like nothing too
bad [2]).
Also, we have 2 versions of `hermit-api` in dependencies (0.3.* and
0.2.*), update `num-cpus` to use the latest `hermit-api` 0.3.3.

- [1] https://crates.io/crates/hermit-abi/0.3.1
- [2] https://github.com/hermit-os/hermit-rs/issues/436

## Summary of changes
- `cargo update -p num-cpus`
- `cargo update -p hermit-abi`
- Unignore `RUSTSEC-2023-0052` in `deny.toml` (it has been fixed in
https://github.com/neondatabase/neon/pull/5069)
2023-09-29 12:57:45 +01:00
Conrad Ludgate
fd20bbc6cb proxy: log params when no endpoint (#5418)
## Problem

Our SNI error dashboard features IP addresses but it's not immediately
clear who that is still (#5369)

## Summary of changes

Log some startup params with this error
2023-09-29 09:40:27 +01:00
John Spray
6a1903987a tests: use approximate equality in test_get_tenant_size_with_multiple_branches (#5411)
## Problem

This test has been flaky for a long time.

As far as I can tell, the test was simply wrong to expect postgres
activity to result in deterministic sizes: making the match fuzzy is not
a hack, it's just matching the reality that postgres doesn't promise to
write exactly the same number of pages every time it runs a given query.

## Summary of changes

Equalities now tolerate up to 4 pages different. This is big enough to
tolerate the deltas we've seen in practice.

Closes: https://github.com/neondatabase/neon/issues/2962
2023-09-29 09:15:43 +01:00
John Spray
1881373ec4 Update CODEOWNERS (#5421)
It is usually not intended to notify a random member of the compute team
for pageserver PRs.

Leaving the notification of the storage team in place, because this
serves a purpose when some external contributor opens a PR and isn't
sure who to ask.
2023-09-28 17:34:51 +01:00
John Spray
ca3ca2bb9c pageserver: don't try and recover deletion queue if no remote storage (#5419)
## Problem

Because `neon_local` by default runs with no remote storage, it was not
running the deletion queue workers, and the attempt to call into
`recover()` was failing.

This is a bogus configuration that will go away when we make remote
storage mandatory.

## Summary of changes

Don't try and do deletion queue recovery when remote storage is
disabled.

The reason we don't just unset `control_plane_api` to avoid this is that
generations will soon become mandatory, irrespective of when we make
remote storage mandatory.
2023-09-28 17:20:34 +01:00
Em Sharnoff
b497d0094e file cache: Remove free space monitor (#5406)
This effectively reverts #3832.

There's a couple issues we just discovered with the free space monitor,
and to my knowledge, the fact we're putting the file cache on a separate
filesystem (even when on disk) that's guaranteed to have more room than
the maximum size means that this free space monitor should have no
effect.

More details:

1. The control plane sets the maximum file cache size based on max CU
2. The control plane sets the size of the filesystem underlying the file
cache based on the maximum user selectable CU (or, if the endpoint is
larger, then that size), so that there's always enough room
3. If postmaster gets SIGKILL'd, then the free space monitor process
does not exit
4. If the free space monitor is acting on the cache file but not subject
to locking or up-to-date metadata from a newer postgres instance, then
this could lead to data corruption.

So, in practice I belive the risk of data corruption is *low* but not
nothing, and given the issues we hit because of (3), and given that this
the free space monitor shouldn't be necessary because of (1) and (2),
it's best to just remove it outright.

See also: neondatabase/autoscaling#534, #5405
2023-09-28 06:47:44 -07:00
Conrad Ludgate
528fb1bd81 proxy: metrics2 (#5179)
## Problem

We need to count metrics always when a connection is open. Not only when
the transfer is 0.

We also need to count bytes usage for HTTP.

## Summary of changes

New structure for usage metrics. A `DashMap<Ids, Arc<Counters>>`.

If the arc has 1 owner (the map) then I can conclude that no connections
are open.
If the counters has "open_connections" non zero, then I can conclude a
new connection was opened in the last interval and should be reported
on.

Also, keep count of how many bytes processed for HTTP and report it
here.
2023-09-28 11:38:26 +01:00
Joonas Koivunen
af28362a47 tests: Default to LOCAL_FS for pageserver remote storage (#5402)
Part of #5172. Builds upon #5243, #5298. Includes the test changes:
- no more RemoteStorageKind.NOOP
- no more testing of pageserver without remote storage
- benchmarks now use LOCAL_FS as well

Support for running without RemoteStorage is still kept but in practice,
there are no tests and should not be any tests.

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-09-28 12:25:20 +03:00
John Spray
6b4bb91d0a docs/rfcs: add RFC for fast tenant migration/failover (#5029)
## Problem

Currently we don't have a way to migrate tenants from one pageserver to
another without a risk of gap in availability.

## Summary of changes

This follows on from https://github.com/neondatabase/neon/pull/4919

Migrating tenants between pageservers is essential to operating a
service
at scale, in several contexts:

1. Responding to a pageserver node failure by migrating tenants to other
pageservers
2. Balancing load and capacity across pageservers, for example when a
user expands their
   database and they need to migrate to a pageserver with more capacity.
3. Restarting pageservers for upgrades and maintenance

Currently, a tenant may migrated by attaching to a new node,
re-configuring endpoints to use the new node, and then later detaching
from the old node. This is safe once [generation
numbers](025-generation-numbers.md) are implemented, but does meet
our seamless/fast/efficient goals:

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-09-28 10:07:11 +01:00
Em Sharnoff
5fdc80db03 Bump vm-builder v0.17.11 -> v0.17.12 (#5407)
Only relevant change is neondatabase/autoscaling#534 - refer there for
more details.
2023-09-28 09:52:39 +02:00
Em Sharnoff
48e85460fc vm-monitor: Unset memory.high on start + refactor cgroup handling (#5348)
## Problem

Over the past couple days, we've had a couple VMs hit issues with
postgres getting hit by memory.high throttling, even after #5303 was
supposed to fix that. The tl;dr of those issues is that because
vm-monitor startup sets the file cache size first, before interacting
with the cgroup, cgroup throttling can mean we timeout connecting to the
file cache and never reset the cgroup, even if memory has been upscaled
since then.

See e.g.:

- https://neondb.slack.com/archives/C03F5SM1N02/p1695218132208249
- https://neondb.slack.com/archives/C03F5SM1N02/p1695314613696659

## Summary of changes

This PR adds an additional step into vm-monitor startup, where we first
set the cgroup's memory.high value to 'max', removing the capacity for
throttling. This preferable to just setting memory.high before the file
cache, because it's theoretically possible that the new value of
memory.high could still be less than the current memory usage, in which
case postgres could continue to be throttled without sufficient memory
events to relieve that.

Implementing this properly involved adding a method to our internal
cgroup interface, and it seemed like there was duplicated functionality
there, so this PR unifies that as well, making things a bit more
consistent.
2023-09-27 21:27:23 -07:00
Christian Schwarz
090a644392 metrics for resident & remote physical size without tenant/timeline dimension (#5389)
So that we can compute worst-case /storage size dashboard panel more
cheaply.
2023-09-27 13:18:05 +01:00
John Spray
2cced770da pageserver: add control_plane_api_token config (#5383)
## Problem

Control plane API calls in prod will need authentication.

## Summary of changes

`control_plane_api_token` config is loaded and set as HTTP
`Authorization` header.

Closes: https://github.com/neondatabase/neon/issues/5139
2023-09-27 13:12:13 +01:00
MMeent
7038ce40ce Fix neon_zeroextend's WAL logging (#5387)
When you log more than a few blocks, you need to reserve the space in
advance. We didn't do that, so we got errors. Now we do that, and
shouldn't get errors.
2023-09-27 13:48:30 +02:00
Joonas Koivunen
ce45fd4cc7 test_pageserver_metric_collection: allowed synthetic size to be cancelled at shutdown (#5398)
[evidence] of these messages during shutdown. They can happen if we are
unlucky enough.

[evidence]:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/6323709725/index.html#suites/e557ea0d920cfebd45c1921296031273/4120269a64eed172
2023-09-27 12:00:49 +01:00
Joonas Koivunen
6cc8c31fd8 disk_usage_based_eviction: switch warmup to use full table scans (#5384)
Fixes #3978. `test_partial_evict_tenant` can fail multiple times so even
though we retry it as flaky, it will still haunt us.

Originally was going to just relax the comparison, then ended up
replacing warming up to use full table scans instead of `pgbench
--select-only`. This seems to help by producing the expected layer
accesses. There might be something off with how many layers pg16
produces compared to pg14 and pg15. Created #5392.
2023-09-27 10:00:21 +01:00
552 changed files with 78160 additions and 29774 deletions

View File

@@ -1,17 +1,3 @@
# The binaries are really slow, if you compile them in 'dev' mode with the defaults.
# Enable some optimizations even in 'dev' mode, to make tests faster. The basic
# optimizations enabled by "opt-level=1" don't affect debuggability too much.
#
# See https://www.reddit.com/r/rust/comments/gvrgca/this_is_a_neat_trick_for_getting_good_runtime/
#
[profile.dev.package."*"]
# Set the default for dependencies in Development mode.
opt-level = 3
[profile.dev]
# Turn on a small amount of optimization in Development mode.
opt-level = 1
[build]
# This is only present for local builds, as it will be overridden
# by the RUSTDOCFLAGS env var in CI.

View File

@@ -22,5 +22,11 @@ platforms = [
# "x86_64-pc-windows-msvc",
]
[final-excludes]
# vm_monitor benefits from the same Cargo.lock as the rest of our artifacts, but
# it is built primarly in separate repo neondatabase/autoscaling and thus is excluded
# from depending on workspace-hack because most of the dependencies are not used.
workspace-members = ["vm_monitor"]
# Write out exact versions rather than a semver range. (Defaults to false.)
# exact-versions = true

2
.config/nextest.toml Normal file
View File

@@ -0,0 +1,2 @@
[profile.default]
slow-timeout = { period = "20s", terminate-after = 3 }

View File

@@ -1,27 +1,27 @@
*
!rust-toolchain.toml
!Cargo.toml
# Files
!Cargo.lock
!Cargo.toml
!Makefile
!rust-toolchain.toml
!scripts/combine_control_files.py
!scripts/ninstall.sh
!vm-cgconfig.conf
# Directories
!.cargo/
!.config/
!control_plane/
!compute_tools/
!control_plane/
!libs/
!neon_local/
!pageserver/
!pgxn/
!proxy/
!safekeeper/
!s3_scrubber/
!safekeeper/
!storage_broker/
!trace/
!vendor/postgres-v14/
!vendor/postgres-v15/
!vendor/postgres-v16/
!vendor/postgres-*/
!workspace_hack/
!neon_local/
!scripts/ninstall.sh
!scripts/combine_control_files.py
!vm-cgconfig.conf

View File

@@ -17,8 +17,9 @@ assignees: ''
## Implementation ideas
## Tasks
- [ ]
```[tasklist]
### Tasks
```
## Other related tasks and Epics

View File

@@ -3,7 +3,7 @@
**NB: this PR must be merged only by 'Create a merge commit'!**
### Checklist when preparing for release
- [ ] Read or refresh [the release flow guide](https://github.com/neondatabase/cloud/wiki/Release:-general-flow)
- [ ] Read or refresh [the release flow guide](https://www.notion.so/neondatabase/Release-general-flow-61f2e39fd45d4d14a70c7749604bd70b)
- [ ] Ask in the [cloud Slack channel](https://neondb.slack.com/archives/C033A2WE6BZ) that you are going to rollout the release. Any blockers?
- [ ] Does this release contain any db migrations? Destructive ones? What is the rollback plan?

View File

@@ -1,8 +1,14 @@
self-hosted-runner:
labels:
- arm64
- dev
- gen3
- large
# Remove `macos-14` from the list after https://github.com/rhysd/actionlint/pull/392 is merged.
- macos-14
- small
- us-east-2
config-variables:
- REMOTE_STORAGE_AZURE_CONTAINER
- REMOTE_STORAGE_AZURE_REGION
- SLACK_UPCOMING_RELEASE_CHANNEL_ID

View File

@@ -76,8 +76,8 @@ runs:
rm -f ${ALLURE_ZIP}
fi
env:
ALLURE_VERSION: 2.23.1
ALLURE_ZIP_SHA256: 11141bfe727504b3fd80c0f9801eb317407fd0ac983ebb57e671f14bac4bcd86
ALLURE_VERSION: 2.24.0
ALLURE_ZIP_SHA256: 60b1d6ce65d9ef24b23cf9c2c19fd736a123487c38e54759f1ed1a7a77353c90
# Potentially we could have several running build for the same key (for example, for the main branch), so we use improvised lock for this
- name: Acquire lock
@@ -179,23 +179,6 @@ runs:
aws s3 rm "s3://${BUCKET}/${LOCK_FILE}"
fi
- name: Store Allure test stat in the DB
if: ${{ !cancelled() && inputs.store-test-results-into-db == 'true' }}
shell: bash -euxo pipefail {0}
env:
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
REPORT_JSON_URL: ${{ steps.generate-report.outputs.report-json-url }}
run: |
export DATABASE_URL=${REGRESS_TEST_RESULT_CONNSTR}
./scripts/pysync
poetry run python3 scripts/ingest_regress_test_result.py \
--revision ${COMMIT_SHA} \
--reference ${GITHUB_REF} \
--build-type unified \
--ingest ${WORKDIR}/report/data/suites.json
- name: Store Allure test stat in the DB (new)
if: ${{ !cancelled() && inputs.store-test-results-into-db == 'true' }}
shell: bash -euxo pipefail {0}
@@ -203,6 +186,10 @@ runs:
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
BASE_S3_URL: ${{ steps.generate-report.outputs.base-s3-url }}
run: |
if [ ! -d "${WORKDIR}/report/data/test-cases" ]; then
exit 0
fi
export DATABASE_URL=${REGRESS_TEST_RESULT_CONNSTR_NEW}
./scripts/pysync

View File

@@ -44,6 +44,10 @@ inputs:
description: 'Postgres version to use for tests'
required: false
default: 'v14'
benchmark_durations:
description: 'benchmark durations JSON'
required: false
default: '{}'
runs:
using: "composite"
@@ -160,7 +164,7 @@ runs:
# We use pytest-split plugin to run benchmarks in parallel on different CI runners
if [ "${TEST_SELECTION}" = "test_runner/performance" ] && [ "${{ inputs.build_type }}" != "remote" ]; then
mkdir -p $TEST_OUTPUT
poetry run ./scripts/benchmark_durations.py "${TEST_RESULT_CONNSTR}" --days 10 --output "$TEST_OUTPUT/benchmark_durations.json"
echo '${{ inputs.benchmark_durations || '{}' }}' > $TEST_OUTPUT/benchmark_durations.json
EXTRA_PARAMS="--durations-path $TEST_OUTPUT/benchmark_durations.json $EXTRA_PARAMS"
fi

View File

@@ -11,7 +11,7 @@ on:
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12 or JAN-DEC)
# │ │ │ │ ┌───────────── day of the week (0 - 6 or SUN-SAT)
- cron: '0 3 * * *' # run once a day, timezone is utc
- cron: '0 3 * * *' # run once a day, timezone is utc
workflow_dispatch: # adds ability to run this manually
inputs:
@@ -23,6 +23,21 @@ on:
type: boolean
description: 'Publish perf report. If not set, the report will be published only for the main branch'
required: false
collect_olap_explain:
type: boolean
description: 'Collect EXPLAIN ANALYZE for OLAP queries. If not set, EXPLAIN ANALYZE will not be collected'
required: false
default: false
collect_pg_stat_statements:
type: boolean
description: 'Collect pg_stat_statements for OLAP queries. If not set, pg_stat_statements will not be collected'
required: false
default: false
run_AWS_RDS_AND_AURORA:
type: boolean
description: 'AWS-RDS and AWS-AURORA normally only run on Saturday. Set this to true to run them on every workflow_dispatch'
required: false
default: false
defaults:
run:
@@ -113,6 +128,8 @@ jobs:
# - neon-captest-reuse: Reusing existing project
# - rds-aurora: Aurora Postgres Serverless v2 with autoscaling from 0.5 to 2 ACUs
# - rds-postgres: RDS Postgres db.m5.large instance (2 vCPU, 8 GiB) with gp3 EBS storage
env:
RUN_AWS_RDS_AND_AURORA: ${{ github.event.inputs.run_AWS_RDS_AND_AURORA || 'false' }}
runs-on: ubuntu-latest
outputs:
pgbench-compare-matrix: ${{ steps.pgbench-compare-matrix.outputs.matrix }}
@@ -152,7 +169,7 @@ jobs:
]
}'
if [ "$(date +%A)" = "Saturday" ]; then
if [ "$(date +%A)" = "Saturday" ] || [ ${RUN_AWS_RDS_AND_AURORA} = "true" ]; then
matrix=$(echo "$matrix" | jq '.include += [{ "platform": "rds-postgres" },
{ "platform": "rds-aurora" }]')
fi
@@ -171,9 +188,9 @@ jobs:
]
}'
if [ "$(date +%A)" = "Saturday" ]; then
if [ "$(date +%A)" = "Saturday" ] || [ ${RUN_AWS_RDS_AND_AURORA} = "true" ]; then
matrix=$(echo "$matrix" | jq '.include += [{ "platform": "rds-postgres", "scale": "10" },
{ "platform": "rds-aurora", "scale": "10" }]')
{ "platform": "rds-aurora", "scale": "10" }]')
fi
echo "matrix=$(echo "$matrix" | jq --compact-output '.')" >> $GITHUB_OUTPUT
@@ -337,6 +354,8 @@ jobs:
POSTGRES_DISTRIB_DIR: /tmp/neon/pg_install
DEFAULT_PG_VERSION: 14
TEST_OUTPUT: /tmp/test_output
TEST_OLAP_COLLECT_EXPLAIN: ${{ github.event.inputs.collect_olap_explain }}
TEST_OLAP_COLLECT_PG_STAT_STATEMENTS: ${{ github.event.inputs.collect_pg_stat_statements }}
BUILD_TYPE: remote
SAVE_PERF_REPORT: ${{ github.event.inputs.save_perf_report || ( github.ref_name == 'main' ) }}
PLATFORM: ${{ matrix.platform }}
@@ -399,6 +418,8 @@ jobs:
env:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
TEST_OLAP_COLLECT_EXPLAIN: ${{ github.event.inputs.collect_olap_explain || 'false' }}
TEST_OLAP_COLLECT_PG_STAT_STATEMENTS: ${{ github.event.inputs.collect_pg_stat_statements || 'false' }}
BENCHMARK_CONNSTR: ${{ steps.set-up-connstr.outputs.connstr }}
TEST_OLAP_SCALE: 10

View File

@@ -0,0 +1,124 @@
name: Build and Push Docker Image
on:
workflow_call:
inputs:
dockerfile-path:
required: true
type: string
image-name:
required: true
type: string
outputs:
build-tools-tag:
description: "tag generated for build tools"
value: ${{ jobs.tag.outputs.build-tools-tag }}
jobs:
check-if-build-tools-dockerfile-changed:
runs-on: ubuntu-latest
outputs:
docker_file_changed: ${{ steps.dockerfile.outputs.docker_file_changed }}
steps:
- name: Check if Dockerfile.buildtools has changed
id: dockerfile
run: |
if [[ "$GITHUB_EVENT_NAME" != "pull_request" ]]; then
echo "docker_file_changed=false" >> $GITHUB_OUTPUT
exit
fi
updated_files=$(gh pr --repo neondatabase/neon diff ${{ github.event.pull_request.number }} --name-only)
if [[ $updated_files == *"Dockerfile.buildtools"* ]]; then
echo "docker_file_changed=true" >> $GITHUB_OUTPUT
fi
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
tag:
runs-on: ubuntu-latest
needs: [ check-if-build-tools-dockerfile-changed ]
outputs:
build-tools-tag: ${{steps.buildtools-tag.outputs.image_tag}}
steps:
- name: Get buildtools tag
env:
DOCKERFILE_CHANGED: ${{ needs.check-if-build-tools-dockerfile-changed.outputs.docker_file_changed }}
run: |
if [[ "$GITHUB_EVENT_NAME" == "pull_request" ]] && [[ "${DOCKERFILE_CHANGED}" == "true" ]]; then
IMAGE_TAG=$GITHUB_RUN_ID
else
IMAGE_TAG=pinned
fi
echo "image_tag=${IMAGE_TAG}" >> $GITHUB_OUTPUT
shell: bash
id: buildtools-tag
kaniko:
if: needs.check-if-build-tools-dockerfile-changed.outputs.docker_file_changed == 'true'
needs: [ tag, check-if-build-tools-dockerfile-changed ]
runs-on: [ self-hosted, dev, x64 ]
container: gcr.io/kaniko-project/executor:v1.7.0-debug
steps:
- name: Checkout
uses: actions/checkout@v1
- name: Configure ECR login
run: echo "{\"credsStore\":\"ecr-login\"}" > /kaniko/.docker/config.json
- name: Kaniko build
run: |
/kaniko/executor \
--reproducible \
--snapshotMode=redo \
--skip-unused-stages \
--dockerfile ${{ inputs.dockerfile-path }} \
--cache=true \
--cache-repo 369495373322.dkr.ecr.eu-central-1.amazonaws.com/cache \
--destination 369495373322.dkr.ecr.eu-central-1.amazonaws.com/${{ inputs.image-name }}:${{ needs.tag.outputs.build-tools-tag }}-amd64
kaniko-arm:
if: needs.check-if-build-tools-dockerfile-changed.outputs.docker_file_changed == 'true'
needs: [ tag, check-if-build-tools-dockerfile-changed ]
runs-on: [ self-hosted, dev, arm64 ]
container: gcr.io/kaniko-project/executor:v1.7.0-debug
steps:
- name: Checkout
uses: actions/checkout@v1
- name: Configure ECR login
run: echo "{\"credsStore\":\"ecr-login\"}" > /kaniko/.docker/config.json
- name: Kaniko build
run: |
/kaniko/executor \
--reproducible \
--snapshotMode=redo \
--skip-unused-stages \
--dockerfile ${{ inputs.dockerfile-path }} \
--cache=true \
--cache-repo 369495373322.dkr.ecr.eu-central-1.amazonaws.com/cache \
--destination 369495373322.dkr.ecr.eu-central-1.amazonaws.com/${{ inputs.image-name }}:${{ needs.tag.outputs.build-tools-tag }}-arm64
manifest:
if: needs.check-if-build-tools-dockerfile-changed.outputs.docker_file_changed == 'true'
name: 'manifest'
runs-on: [ self-hosted, dev, x64 ]
needs:
- tag
- kaniko
- kaniko-arm
- check-if-build-tools-dockerfile-changed
steps:
- name: Create manifest
run: |
docker manifest create 369495373322.dkr.ecr.eu-central-1.amazonaws.com/${{ inputs.image-name }}:${{ needs.tag.outputs.build-tools-tag }} \
--amend 369495373322.dkr.ecr.eu-central-1.amazonaws.com/${{ inputs.image-name }}:${{ needs.tag.outputs.build-tools-tag }}-amd64 \
--amend 369495373322.dkr.ecr.eu-central-1.amazonaws.com/${{ inputs.image-name }}:${{ needs.tag.outputs.build-tools-tag }}-arm64
- name: Push manifest
run: docker manifest push 369495373322.dkr.ecr.eu-central-1.amazonaws.com/${{ inputs.image-name }}:${{ needs.tag.outputs.build-tools-tag }}

View File

@@ -21,6 +21,8 @@ env:
COPT: '-Werror'
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_DEV }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_KEY_DEV }}
# A concurrency group that we use for e2e-tests runs, matches `concurrency.group` above with `github.repository` as a prefix
E2E_CONCURRENCY_GROUP: ${{ github.repository }}-${{ github.workflow }}-${{ github.ref_name }}-${{ github.ref_name == 'main' && github.sha || 'anysha' }}
jobs:
check-permissions:
@@ -44,6 +46,19 @@ jobs:
exit 1
cancel-previous-e2e-tests:
needs: [ check-permissions ]
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
steps:
- name: Cancel previous e2e-tests runs for this PR
env:
GH_TOKEN: ${{ secrets.CI_ACCESS_TOKEN }}
run: |
gh workflow --repo neondatabase/cloud \
run cancel-previous-in-concurrency-group.yml \
--field concurrency_group="${{ env.E2E_CONCURRENCY_GROUP }}"
tag:
needs: [ check-permissions ]
@@ -74,11 +89,19 @@ jobs:
shell: bash
id: build-tag
check-codestyle-python:
build-buildtools-image:
needs: [ check-permissions ]
uses: ./.github/workflows/build_and_push_docker_image.yml
with:
dockerfile-path: Dockerfile.buildtools
image-name: build-tools
secrets: inherit
check-codestyle-python:
needs: [ check-permissions, build-buildtools-image ]
runs-on: [ self-hosted, gen3, small ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools:${{ needs.build-buildtools-image.outputs.build-tools-tag }}
options: --init
steps:
@@ -98,20 +121,20 @@ jobs:
- name: Install Python deps
run: ./scripts/pysync
- name: Run ruff to ensure code format
run: poetry run ruff .
- name: Run `ruff check` to ensure code format
run: poetry run ruff check .
- name: Run black to ensure code format
run: poetry run black --diff --check .
- name: Run `ruff format` to ensure code format
run: poetry run ruff format --check .
- name: Run mypy to check types
run: poetry run mypy .
check-codestyle-rust:
needs: [ check-permissions ]
runs-on: [ self-hosted, gen3, large ]
needs: [ check-permissions, build-buildtools-image ]
runs-on: [ self-hosted, gen3, small ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools:${{ needs.build-buildtools-image.outputs.build-tools-tag }}
options: --init
steps:
@@ -172,14 +195,18 @@ jobs:
# https://github.com/EmbarkStudios/cargo-deny
- name: Check rust licenses/bans/advisories/sources
if: ${{ !cancelled() }}
run: cargo deny check
run: cargo deny check --hide-inclusion-graph
build-neon:
needs: [ check-permissions ]
needs: [ check-permissions, tag, build-buildtools-image ]
runs-on: [ self-hosted, gen3, large ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
options: --init
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools:${{ needs.build-buildtools-image.outputs.build-tools-tag }}
# Raise locked memory limit for tokio-epoll-uring.
# On 5.10 LTS kernels < 5.10.162 (and generally mainline kernels < 5.12),
# io_uring will account the memory of the CQ and SQ as locked.
# More details: https://github.com/neondatabase/neon/issues/6373#issuecomment-1905814391
options: --init --shm-size=512mb --ulimit memlock=67108864:67108864
strategy:
fail-fast: false
matrix:
@@ -187,6 +214,7 @@ jobs:
env:
BUILD_TYPE: ${{ matrix.build_type }}
GIT_VERSION: ${{ github.event.pull_request.head.sha || github.sha }}
BUILD_TAG: ${{ needs.tag.outputs.build-tag }}
steps:
- name: Fix git ownership
@@ -198,6 +226,10 @@ jobs:
#
git config --global --add safe.directory ${{ github.workspace }}
git config --global --add safe.directory ${GITHUB_WORKSPACE}
for r in 14 15 16; do
git config --global --add safe.directory "${{ github.workspace }}/vendor/postgres-v$r"
git config --global --add safe.directory "${GITHUB_WORKSPACE}/vendor/postgres-v$r"
done
- name: Checkout
uses: actions/checkout@v3
@@ -320,20 +352,37 @@ jobs:
- name: Build neon extensions
run: mold -run make neon-pg-ext -j$(nproc)
- name: Build walproposer-lib
run: mold -run make walproposer-lib -j$(nproc)
- name: Run cargo build
run: |
${cov_prefix} mold -run cargo build $CARGO_FLAGS $CARGO_FEATURES --bins --tests
- name: Run cargo test
- name: Run rust tests
env:
NEXTEST_RETRIES: 3
run: |
${cov_prefix} cargo test $CARGO_FLAGS $CARGO_FEATURES
for io_engine in std-fs tokio-epoll-uring ; do
NEON_PAGESERVER_UNIT_TEST_VIRTUAL_FILE_IOENGINE=$io_engine ${cov_prefix} cargo nextest run $CARGO_FLAGS $CARGO_FEATURES
done
# Run separate tests for real S3
export ENABLE_REAL_S3_REMOTE_STORAGE=nonempty
export REMOTE_STORAGE_S3_BUCKET=neon-github-public-dev
export REMOTE_STORAGE_S3_BUCKET=neon-github-ci-tests
export REMOTE_STORAGE_S3_REGION=eu-central-1
# Avoid `$CARGO_FEATURES` since there's no `testing` feature in the e2e tests now
${cov_prefix} cargo test $CARGO_FLAGS --package remote_storage --test test_real_s3
${cov_prefix} cargo nextest run $CARGO_FLAGS -E 'package(remote_storage)' -E 'test(test_real_s3)'
# Run separate tests for real Azure Blob Storage
# XXX: replace region with `eu-central-1`-like region
export ENABLE_REAL_AZURE_REMOTE_STORAGE=y
export AZURE_STORAGE_ACCOUNT="${{ secrets.AZURE_STORAGE_ACCOUNT_DEV }}"
export AZURE_STORAGE_ACCESS_KEY="${{ secrets.AZURE_STORAGE_ACCESS_KEY_DEV }}"
export REMOTE_STORAGE_AZURE_CONTAINER="${{ vars.REMOTE_STORAGE_AZURE_CONTAINER }}"
export REMOTE_STORAGE_AZURE_REGION="${{ vars.REMOTE_STORAGE_AZURE_REGION }}"
# Avoid `$CARGO_FEATURES` since there's no `testing` feature in the e2e tests now
${cov_prefix} cargo nextest run $CARGO_FLAGS -E 'package(remote_storage)' -E 'test(test_real_azure)'
- name: Install rust binaries
run: |
@@ -390,12 +439,12 @@ jobs:
uses: ./.github/actions/save-coverage-data
regress-tests:
needs: [ check-permissions, build-neon ]
needs: [ check-permissions, build-neon, build-buildtools-image, tag ]
runs-on: [ self-hosted, gen3, large ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
# Default shared memory is 64mb
options: --init --shm-size=512mb
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools:${{ needs.build-buildtools-image.outputs.build-tools-tag }}
# for changed limits, see comments on `options:` earlier in this file
options: --init --shm-size=512mb --ulimit memlock=67108864:67108864
strategy:
fail-fast: false
matrix:
@@ -420,25 +469,60 @@ jobs:
rerun_flaky: true
pg_version: ${{ matrix.pg_version }}
env:
TEST_RESULT_CONNSTR: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR }}
TEST_RESULT_CONNSTR: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}
CHECK_ONDISK_DATA_COMPATIBILITY: nonempty
BUILD_TAG: ${{ needs.tag.outputs.build-tag }}
PAGESERVER_VIRTUAL_FILE_IO_ENGINE: std-fs
- name: Merge and upload coverage data
if: matrix.build_type == 'debug' && matrix.pg_version == 'v14'
uses: ./.github/actions/save-coverage-data
benchmarks:
needs: [ check-permissions, build-neon ]
get-benchmarks-durations:
outputs:
json: ${{ steps.get-benchmark-durations.outputs.json }}
needs: [ check-permissions, build-buildtools-image ]
runs-on: [ self-hosted, gen3, small ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
# Default shared memory is 64mb
options: --init --shm-size=512mb
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools:${{ needs.build-buildtools-image.outputs.build-tools-tag }}
options: --init
if: github.ref_name == 'main' || contains(github.event.pull_request.labels.*.name, 'run-benchmarks')
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Cache poetry deps
uses: actions/cache@v3
with:
path: ~/.cache/pypoetry/virtualenvs
key: v1-${{ runner.os }}-python-deps-${{ hashFiles('poetry.lock') }}
- name: Install Python deps
run: ./scripts/pysync
- name: get benchmark durations
id: get-benchmark-durations
env:
TEST_RESULT_CONNSTR: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}
run: |
poetry run ./scripts/benchmark_durations.py "${TEST_RESULT_CONNSTR}" \
--days 10 \
--output /tmp/benchmark_durations.json
echo "json=$(jq --compact-output '.' /tmp/benchmark_durations.json)" >> $GITHUB_OUTPUT
benchmarks:
needs: [ check-permissions, build-neon, build-buildtools-image, get-benchmarks-durations ]
runs-on: [ self-hosted, gen3, small ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools:${{ needs.build-buildtools-image.outputs.build-tools-tag }}
# for changed limits, see comments on `options:` earlier in this file
options: --init --shm-size=512mb --ulimit memlock=67108864:67108864
if: github.ref_name == 'main' || contains(github.event.pull_request.labels.*.name, 'run-benchmarks')
strategy:
fail-fast: false
matrix:
pytest_split_group: [ 1, 2, 3, 4 ]
# the amount of groups (N) should be reflected in `extra_params: --splits N ...`
pytest_split_group: [ 1, 2, 3, 4, 5 ]
build_type: [ release ]
steps:
- name: Checkout
@@ -451,21 +535,23 @@ jobs:
test_selection: performance
run_in_parallel: false
save_perf_report: ${{ github.ref_name == 'main' }}
extra_params: --splits ${{ strategy.job-total }} --group ${{ matrix.pytest_split_group }}
extra_params: --splits 5 --group ${{ matrix.pytest_split_group }}
benchmark_durations: ${{ needs.get-benchmarks-durations.outputs.json }}
env:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
TEST_RESULT_CONNSTR: "${{ secrets.REGRESS_TEST_RESULT_CONNSTR }}"
TEST_RESULT_CONNSTR: "${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}"
PAGESERVER_VIRTUAL_FILE_IO_ENGINE: std-fs
# XXX: no coverage data handling here, since benchmarks are run on release builds,
# while coverage is currently collected for the debug ones
create-test-report:
needs: [ check-permissions, regress-tests, coverage-report, benchmarks ]
needs: [ check-permissions, regress-tests, coverage-report, benchmarks, build-buildtools-image ]
if: ${{ !cancelled() && contains(fromJSON('["skipped", "success"]'), needs.check-permissions.result) }}
runs-on: [ self-hosted, gen3, small ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools:${{ needs.build-buildtools-image.outputs.build-tools-tag }}
options: --init
steps:
@@ -478,7 +564,6 @@ jobs:
with:
store-test-results-into-db: true
env:
REGRESS_TEST_RESULT_CONNSTR: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR }}
REGRESS_TEST_RESULT_CONNSTR_NEW: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}
- uses: actions/github-script@v6
@@ -507,11 +592,10 @@ jobs:
})
coverage-report:
needs: [ check-permissions, regress-tests ]
needs: [ check-permissions, regress-tests, build-buildtools-image ]
runs-on: [ self-hosted, gen3, small ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools:${{ needs.build-buildtools-image.outputs.build-tools-tag }}
options: --init
strategy:
fail-fast: false
@@ -557,25 +641,17 @@ jobs:
--input-objects=/tmp/coverage/binaries.list \
--format=lcov
- name: Upload coverage report
id: upload-coverage-report
env:
BUCKET: neon-github-public-dev
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
run: |
aws s3 cp --only-show-errors --recursive /tmp/coverage/report s3://${BUCKET}/code-coverage/${COMMIT_SHA}
REPORT_URL=https://${BUCKET}.s3.amazonaws.com/code-coverage/${COMMIT_SHA}/index.html
echo "report-url=${REPORT_URL}" >> $GITHUB_OUTPUT
- name: Build coverage report NEW
id: upload-coverage-report-new
env:
BUCKET: neon-github-public-dev
# A differential coverage report is available only for PRs.
# (i.e. for pushes into main/release branches we have a regular coverage report)
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
BASE_SHA: ${{ github.event.pull_request.base.sha || github.sha }}
run: |
BASELINE="$(git merge-base HEAD origin/main)"
CURRENT="${COMMIT_SHA}"
BASELINE="$(git merge-base $BASE_SHA $CURRENT)"
cp /tmp/coverage/report/lcov.info ./${CURRENT}.info
@@ -601,21 +677,11 @@ jobs:
- uses: actions/github-script@v6
env:
REPORT_URL: ${{ steps.upload-coverage-report.outputs.report-url }}
REPORT_URL_NEW: ${{ steps.upload-coverage-report-new.outputs.report-url }}
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
with:
script: |
const { REPORT_URL, REPORT_URL_NEW, COMMIT_SHA } = process.env
await github.rest.repos.createCommitStatus({
owner: context.repo.owner,
repo: context.repo.repo,
sha: `${COMMIT_SHA}`,
state: 'success',
target_url: `${REPORT_URL}`,
context: 'Code coverage report',
})
const { REPORT_URL_NEW, COMMIT_SHA } = process.env
await github.rest.repos.createCommitStatus({
owner: context.repo.owner,
@@ -667,12 +733,13 @@ jobs:
\"commit_hash\": \"$COMMIT_SHA\",
\"remote_repo\": \"${{ github.repository }}\",
\"storage_image_tag\": \"${{ needs.tag.outputs.build-tag }}\",
\"compute_image_tag\": \"${{ needs.tag.outputs.build-tag }}\"
\"compute_image_tag\": \"${{ needs.tag.outputs.build-tag }}\",
\"concurrency_group\": \"${{ env.E2E_CONCURRENCY_GROUP }}\"
}
}"
neon-image:
needs: [ check-permissions, tag ]
needs: [ check-permissions, build-buildtools-image, tag ]
runs-on: [ self-hosted, gen3, large ]
container: gcr.io/kaniko-project/executor:v1.9.2-debug
defaults:
@@ -710,6 +777,8 @@ jobs:
--cache-repo 369495373322.dkr.ecr.eu-central-1.amazonaws.com/cache
--context .
--build-arg GIT_VERSION=${{ github.event.pull_request.head.sha || github.sha }}
--build-arg BUILD_TAG=${{ needs.tag.outputs.build-tag }}
--build-arg TAG=${{ needs.build-buildtools-image.outputs.build-tools-tag }}
--build-arg REPOSITORY=369495373322.dkr.ecr.eu-central-1.amazonaws.com
--destination 369495373322.dkr.ecr.eu-central-1.amazonaws.com/neon:${{needs.tag.outputs.build-tag}}
--destination neondatabase/neon:${{needs.tag.outputs.build-tag}}
@@ -720,7 +789,7 @@ jobs:
compute-tools-image:
runs-on: [ self-hosted, gen3, large ]
needs: [ check-permissions, tag ]
needs: [ check-permissions, build-buildtools-image, tag ]
container: gcr.io/kaniko-project/executor:v1.9.2-debug
defaults:
run:
@@ -755,6 +824,7 @@ jobs:
--context .
--build-arg GIT_VERSION=${{ github.event.pull_request.head.sha || github.sha }}
--build-arg BUILD_TAG=${{needs.tag.outputs.build-tag}}
--build-arg TAG=${{needs.build-buildtools-image.outputs.build-tools-tag}}
--build-arg REPOSITORY=369495373322.dkr.ecr.eu-central-1.amazonaws.com
--dockerfile Dockerfile.compute-tools
--destination 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-tools:${{needs.tag.outputs.build-tag}}
@@ -765,7 +835,7 @@ jobs:
run: rm -rf ~/.ecr
compute-node-image:
needs: [ check-permissions, tag ]
needs: [ check-permissions, build-buildtools-image, tag ]
runs-on: [ self-hosted, gen3, large ]
container:
image: gcr.io/kaniko-project/executor:v1.9.2-debug
@@ -813,6 +883,7 @@ jobs:
--build-arg GIT_VERSION=${{ github.event.pull_request.head.sha || github.sha }}
--build-arg PG_VERSION=${{ matrix.version }}
--build-arg BUILD_TAG=${{needs.tag.outputs.build-tag}}
--build-arg TAG=${{needs.build-buildtools-image.outputs.build-tools-tag}}
--build-arg REPOSITORY=369495373322.dkr.ecr.eu-central-1.amazonaws.com
--dockerfile Dockerfile.compute-node
--destination 369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}
@@ -834,7 +905,7 @@ jobs:
run:
shell: sh -eu {0}
env:
VM_BUILDER_VERSION: v0.17.11
VM_BUILDER_VERSION: v0.23.2
steps:
- name: Checkout
@@ -856,8 +927,7 @@ jobs:
- name: Build vm image
run: |
./vm-builder \
-enable-file-cache \
-cgroup-uid=postgres \
-spec=vm-image-spec.yaml \
-src=369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}} \
-dst=369495373322.dkr.ecr.eu-central-1.amazonaws.com/vm-compute-node-${{ matrix.version }}:${{needs.tag.outputs.build-tag}}
@@ -1079,6 +1149,10 @@ jobs:
#
git config --global --add safe.directory ${{ github.workspace }}
git config --global --add safe.directory ${GITHUB_WORKSPACE}
for r in 14 15 16; do
git config --global --add safe.directory "${{ github.workspace }}/vendor/postgres-v$r"
git config --global --add safe.directory "${GITHUB_WORKSPACE}/vendor/postgres-v$r"
done
- name: Checkout
uses: actions/checkout@v3
@@ -1092,9 +1166,11 @@ jobs:
run: |
if [[ "$GITHUB_REF_NAME" == "main" ]]; then
gh workflow --repo neondatabase/aws run deploy-dev.yml --ref main -f branch=main -f dockerTag=${{needs.tag.outputs.build-tag}} -f deployPreprodRegion=false
elif [[ "$GITHUB_REF_NAME" == "release" ]]; then
# TODO: move deployPreprodRegion to release (`"$GITHUB_REF_NAME" == "release"` block), once Staging support different compute tag prefixes for different regions
gh workflow --repo neondatabase/aws run deploy-dev.yml --ref main -f branch=main -f dockerTag=${{needs.tag.outputs.build-tag}} -f deployPreprodRegion=true
gh workflow --repo neondatabase/aws run deploy-prod.yml --ref main -f branch=main -f dockerTag=${{needs.tag.outputs.build-tag}} -f disclamerAcknowledged=true
elif [[ "$GITHUB_REF_NAME" == "release" ]]; then
gh workflow --repo neondatabase/aws run deploy-prod.yml --ref main -f branch=main -f dockerTag=${{needs.tag.outputs.build-tag}}
else
echo "GITHUB_REF_NAME (value '$GITHUB_REF_NAME') is not set to either 'main' or 'release'"
exit 1

View File

@@ -21,9 +21,12 @@ env:
jobs:
check-macos-build:
if: github.ref_name == 'main' || contains(github.event.pull_request.labels.*.name, 'run-extra-build-macos')
if: |
contains(github.event.pull_request.labels.*.name, 'run-extra-build-macos') ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-*') ||
github.ref_name == 'main'
timeout-minutes: 90
runs-on: macos-latest
runs-on: macos-14
env:
# Use release build only, to have less debug info around
@@ -32,7 +35,7 @@ jobs:
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
with:
submodules: true
fetch-depth: 1
@@ -57,21 +60,21 @@ jobs:
uses: actions/cache@v3
with:
path: pg_install/v14
key: v1-${{ runner.os }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v14_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v14_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Cache postgres v15 build
id: cache_pg_15
uses: actions/cache@v3
with:
path: pg_install/v15
key: v1-${{ runner.os }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v15_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v15_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Cache postgres v16 build
id: cache_pg_16
uses: actions/cache@v3
with:
path: pg_install/v16
key: v1-${{ runner.os }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v16_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v16_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Set extra env for macOS
run: |
@@ -86,31 +89,228 @@ jobs:
!~/.cargo/registry/src
~/.cargo/git
target
key: v1-${{ runner.os }}-cargo-${{ hashFiles('./Cargo.lock') }}-${{ hashFiles('./rust-toolchain.toml') }}-rust
key: v1-${{ runner.os }}-${{ runner.arch }}-cargo-${{ hashFiles('./Cargo.lock') }}-${{ hashFiles('./rust-toolchain.toml') }}-rust
- name: Build postgres v14
if: steps.cache_pg_14.outputs.cache-hit != 'true'
run: make postgres-v14 -j$(nproc)
run: make postgres-v14 -j$(sysctl -n hw.ncpu)
- name: Build postgres v15
if: steps.cache_pg_15.outputs.cache-hit != 'true'
run: make postgres-v15 -j$(nproc)
run: make postgres-v15 -j$(sysctl -n hw.ncpu)
- name: Build postgres v16
if: steps.cache_pg_16.outputs.cache-hit != 'true'
run: make postgres-v16 -j$(nproc)
run: make postgres-v16 -j$(sysctl -n hw.ncpu)
- name: Build neon extensions
run: make neon-pg-ext -j$(nproc)
run: make neon-pg-ext -j$(sysctl -n hw.ncpu)
- name: Build walproposer-lib
run: make walproposer-lib -j$(sysctl -n hw.ncpu)
- name: Run cargo build
run: cargo build --all --release
run: PQ_LIB_DIR=$(pwd)/pg_install/v16/lib cargo build --all --release
- name: Check that no warnings are produced
run: ./run_clippy.sh
check-linux-arm-build:
timeout-minutes: 90
runs-on: [ self-hosted, dev, arm64 ]
env:
# Use release build only, to have less debug info around
# Hence keeping target/ (and general cache size) smaller
BUILD_TYPE: release
CARGO_FEATURES: --features testing
CARGO_FLAGS: --release
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_DEV }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_KEY_DEV }}
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools:pinned
options: --init
steps:
- name: Fix git ownership
run: |
# Workaround for `fatal: detected dubious ownership in repository at ...`
#
# Use both ${{ github.workspace }} and ${GITHUB_WORKSPACE} because they're different on host and in containers
# Ref https://github.com/actions/checkout/issues/785
#
git config --global --add safe.directory ${{ github.workspace }}
git config --global --add safe.directory ${GITHUB_WORKSPACE}
for r in 14 15 16; do
git config --global --add safe.directory "${{ github.workspace }}/vendor/postgres-v$r"
git config --global --add safe.directory "${GITHUB_WORKSPACE}/vendor/postgres-v$r"
done
- name: Checkout
uses: actions/checkout@v4
with:
submodules: true
fetch-depth: 1
- name: Set pg 14 revision for caching
id: pg_v14_rev
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v14) >> $GITHUB_OUTPUT
- name: Set pg 15 revision for caching
id: pg_v15_rev
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v15) >> $GITHUB_OUTPUT
- name: Set pg 16 revision for caching
id: pg_v16_rev
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v16) >> $GITHUB_OUTPUT
- name: Set env variables
run: |
echo "CARGO_HOME=${GITHUB_WORKSPACE}/.cargo" >> $GITHUB_ENV
- name: Cache postgres v14 build
id: cache_pg_14
uses: actions/cache@v3
with:
path: pg_install/v14
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v14_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Cache postgres v15 build
id: cache_pg_15
uses: actions/cache@v3
with:
path: pg_install/v15
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v15_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Cache postgres v16 build
id: cache_pg_16
uses: actions/cache@v3
with:
path: pg_install/v16
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-${{ steps.pg_v16_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Build postgres v14
if: steps.cache_pg_14.outputs.cache-hit != 'true'
run: mold -run make postgres-v14 -j$(nproc)
- name: Build postgres v15
if: steps.cache_pg_15.outputs.cache-hit != 'true'
run: mold -run make postgres-v15 -j$(nproc)
- name: Build postgres v16
if: steps.cache_pg_16.outputs.cache-hit != 'true'
run: mold -run make postgres-v16 -j$(nproc)
- name: Build neon extensions
run: mold -run make neon-pg-ext -j$(nproc)
- name: Build walproposer-lib
run: mold -run make walproposer-lib -j$(nproc)
- name: Run cargo build
run: |
mold -run cargo build --locked $CARGO_FLAGS $CARGO_FEATURES --bins --tests
- name: Run cargo test
env:
NEXTEST_RETRIES: 3
run: |
cargo nextest run $CARGO_FEATURES
# Run separate tests for real S3
export ENABLE_REAL_S3_REMOTE_STORAGE=nonempty
export REMOTE_STORAGE_S3_BUCKET=neon-github-ci-tests
export REMOTE_STORAGE_S3_REGION=eu-central-1
# Avoid `$CARGO_FEATURES` since there's no `testing` feature in the e2e tests now
cargo nextest run --package remote_storage --test test_real_s3
# Run separate tests for real Azure Blob Storage
# XXX: replace region with `eu-central-1`-like region
export ENABLE_REAL_AZURE_REMOTE_STORAGE=y
export AZURE_STORAGE_ACCOUNT="${{ secrets.AZURE_STORAGE_ACCOUNT_DEV }}"
export AZURE_STORAGE_ACCESS_KEY="${{ secrets.AZURE_STORAGE_ACCESS_KEY_DEV }}"
export REMOTE_STORAGE_AZURE_CONTAINER="${{ vars.REMOTE_STORAGE_AZURE_CONTAINER }}"
export REMOTE_STORAGE_AZURE_REGION="${{ vars.REMOTE_STORAGE_AZURE_REGION }}"
# Avoid `$CARGO_FEATURES` since there's no `testing` feature in the e2e tests now
cargo nextest run --package remote_storage --test test_real_azure
check-codestyle-rust-arm:
timeout-minutes: 90
runs-on: [ self-hosted, dev, arm64 ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
options: --init
steps:
- name: Fix git ownership
run: |
# Workaround for `fatal: detected dubious ownership in repository at ...`
#
# Use both ${{ github.workspace }} and ${GITHUB_WORKSPACE} because they're different on host and in containers
# Ref https://github.com/actions/checkout/issues/785
#
git config --global --add safe.directory ${{ github.workspace }}
git config --global --add safe.directory ${GITHUB_WORKSPACE}
for r in 14 15 16; do
git config --global --add safe.directory "${{ github.workspace }}/vendor/postgres-v$r"
git config --global --add safe.directory "${GITHUB_WORKSPACE}/vendor/postgres-v$r"
done
- name: Checkout
uses: actions/checkout@v4
with:
submodules: true
fetch-depth: 1
# Some of our rust modules use FFI and need those to be checked
- name: Get postgres headers
run: make postgres-headers -j$(nproc)
# cargo hack runs the given cargo subcommand (clippy in this case) for all feature combinations.
# This will catch compiler & clippy warnings in all feature combinations.
# TODO: use cargo hack for build and test as well, but, that's quite expensive.
# NB: keep clippy args in sync with ./run_clippy.sh
- run: |
CLIPPY_COMMON_ARGS="$( source .neon_clippy_args; echo "$CLIPPY_COMMON_ARGS")"
if [ "$CLIPPY_COMMON_ARGS" = "" ]; then
echo "No clippy args found in .neon_clippy_args"
exit 1
fi
echo "CLIPPY_COMMON_ARGS=${CLIPPY_COMMON_ARGS}" >> $GITHUB_ENV
- name: Run cargo clippy (debug)
run: cargo hack --feature-powerset clippy $CLIPPY_COMMON_ARGS
- name: Run cargo clippy (release)
run: cargo hack --feature-powerset clippy --release $CLIPPY_COMMON_ARGS
- name: Check documentation generation
run: cargo doc --workspace --no-deps --document-private-items
env:
RUSTDOCFLAGS: "-Dwarnings -Arustdoc::private_intra_doc_links"
# Use `${{ !cancelled() }}` to run quck tests after the longer clippy run
- name: Check formatting
if: ${{ !cancelled() }}
run: cargo fmt --all -- --check
# https://github.com/facebookincubator/cargo-guppy/tree/bec4e0eb29dcd1faac70b1b5360267fc02bf830e/tools/cargo-hakari#2-keep-the-workspace-hack-up-to-date-in-ci
- name: Check rust dependencies
if: ${{ !cancelled() }}
run: |
cargo hakari generate --diff # workspace-hack Cargo.toml is up-to-date
cargo hakari manage-deps --dry-run # all workspace crates depend on workspace-hack
# https://github.com/EmbarkStudios/cargo-deny
- name: Check rust licenses/bans/advisories/sources
if: ${{ !cancelled() }}
run: cargo deny check
gather-rust-build-stats:
if: github.ref_name == 'main' || contains(github.event.pull_request.labels.*.name, 'run-extra-build-stats')
if: |
contains(github.event.pull_request.labels.*.name, 'run-extra-build-stats') ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-*') ||
github.ref_name == 'main'
runs-on: [ self-hosted, gen3, large ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/rust:pinned
@@ -126,7 +326,7 @@ jobs:
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
with:
submodules: true
fetch-depth: 1
@@ -135,6 +335,9 @@ jobs:
- name: Get postgres headers
run: make postgres-headers -j$(nproc)
- name: Build walproposer-lib
run: make walproposer-lib -j$(nproc)
- name: Produce the build stats
run: cargo build --all --release --timings

View File

@@ -2,7 +2,7 @@ name: Create Release Branch
on:
schedule:
- cron: '0 7 * * 2'
- cron: '0 6 * * 1'
workflow_dispatch:
jobs:

View File

@@ -0,0 +1,70 @@
name: 'Update build tools image tag'
# This workflow it used to update tag of build tools in ECR.
# The most common use case is adding/moving `pinned` tag to `${GITHUB_RUN_IT}` image.
on:
workflow_dispatch:
inputs:
from-tag:
description: 'Source tag'
required: true
type: string
to-tag:
description: 'Destination tag'
required: true
type: string
default: 'pinned'
defaults:
run:
shell: bash -euo pipefail {0}
permissions: {}
jobs:
tag-image:
runs-on: [ self-hosted, gen3, small ]
env:
ECR_IMAGE: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/build-tools
DOCKER_HUB_IMAGE: docker.io/neondatabase/build-tools
FROM_TAG: ${{ inputs.from-tag }}
TO_TAG: ${{ inputs.to-tag }}
steps:
# Use custom DOCKER_CONFIG directory to avoid conflicts with default settings
# The default value is ~/.docker
- name: Set custom docker config directory
run: |
mkdir -p .docker-custom
echo DOCKER_CONFIG=$(pwd)/.docker-custom >> $GITHUB_ENV
- uses: docker/login-action@v2
with:
username: ${{ secrets.NEON_DOCKERHUB_USERNAME }}
password: ${{ secrets.NEON_DOCKERHUB_PASSWORD }}
- uses: docker/login-action@v2
with:
registry: 369495373322.dkr.ecr.eu-central-1.amazonaws.com
username: ${{ secrets.AWS_ACCESS_KEY_DEV }}
password: ${{ secrets.AWS_SECRET_KEY_DEV }}
- uses: actions/setup-go@v5
with:
go-version: '1.21'
- name: Install crane
run: |
go install github.com/google/go-containerregistry/cmd/crane@a0658aa1d0cc7a7f1bcc4a3af9155335b6943f40 # v0.18.0
- name: Copy images
run: |
crane copy "${ECR_IMAGE}:${FROM_TAG}" "${ECR_IMAGE}:${TO_TAG}"
crane copy "${ECR_IMAGE}:${FROM_TAG}" "${DOCKER_HUB_IMAGE}:${TO_TAG}"
- name: Remove custom docker config directory
if: always()
run: |
rm -rf .docker-custom

4
.gitignore vendored
View File

@@ -6,6 +6,7 @@ __pycache__/
test_output/
.vscode
.idea
neon.iml
/.neon
/integration_tests/.neon
@@ -18,3 +19,6 @@ test_output/
*.o
*.so
*.Po
# pgindent typedef lists
*.list

View File

@@ -5,7 +5,7 @@
/libs/remote_storage/ @neondatabase/storage
/libs/safekeeper_api/ @neondatabase/safekeepers
/libs/vm_monitor/ @neondatabase/autoscaling @neondatabase/compute
/pageserver/ @neondatabase/compute @neondatabase/storage
/pageserver/ @neondatabase/storage
/pgxn/ @neondatabase/compute
/proxy/ @neondatabase/proxy
/safekeeper/ @neondatabase/safekeepers

View File

@@ -9,6 +9,24 @@ refactoring, additional comments, and so forth. Let's try to raise the
bar, and clean things up as we go. Try to leave code in a better shape
than it was before.
## Pre-commit hook
We have a sample pre-commit hook in `pre-commit.py`.
To set it up, run:
```bash
ln -s ../../pre-commit.py .git/hooks/pre-commit
```
This will run following checks on staged files before each commit:
- `rustfmt`
- checks for Python files, see [obligatory checks](/docs/sourcetree.md#obligatory-checks).
There is also a separate script `./run_clippy.sh` that runs `cargo clippy` on the whole project
and `./scripts/reformat` that runs all formatting tools to ensure the project is up to date.
If you want to skip the hook, run `git commit` with `--no-verify` option.
## Submitting changes
1. Get at least one +1 on your PR before you push.
@@ -52,3 +70,17 @@ We're using the following approach to make it work:
- The label gets removed automatically, so to run CI again with new changes, the label should be added again (after the review)
For details see [`approved-for-ci-run.yml`](.github/workflows/approved-for-ci-run.yml)
## How do I add the "pinned" tag to an buildtools image?
We use the `pinned` tag for `Dockerfile.buildtools` build images in our CI/CD setup, currently adding the `pinned` tag is a manual operation.
You can call it from GitHub UI: https://github.com/neondatabase/neon/actions/workflows/update_build_tools_image.yml,
or using GitHub CLI:
```bash
gh workflow -R neondatabase/neon run update_build_tools_image.yml \
-f from-tag=6254913013 \
-f to-tag=pinned \
# Default `-f to-tag` is `pinned`, so the parameter can be omitted.
```

2245
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -3,8 +3,11 @@ resolver = "2"
members = [
"compute_tools",
"control_plane",
"control_plane/attachment_service",
"pageserver",
"pageserver/ctl",
"pageserver/client",
"pageserver/pagebench",
"proxy",
"safekeeper",
"storage_broker",
@@ -26,6 +29,7 @@ members = [
"libs/tracing-utils",
"libs/postgres_ffi/wal_craft",
"libs/vm_monitor",
"libs/walproposer",
]
[workspace.package]
@@ -35,15 +39,21 @@ license = "Apache-2.0"
## All dependency versions, used in the project
[workspace.dependencies]
anyhow = { version = "1.0", features = ["backtrace"] }
async-compression = { version = "0.4.0", features = ["tokio", "gzip"] }
arc-swap = "1.6"
async-compression = { version = "0.4.0", features = ["tokio", "gzip", "zstd"] }
azure_core = "0.18"
azure_identity = "0.18"
azure_storage = "0.18"
azure_storage_blobs = "0.18"
flate2 = "1.0.26"
async-stream = "0.3"
async-trait = "0.1"
aws-config = { version = "0.56", default-features = false, features=["rustls"] }
aws-sdk-s3 = "0.29"
aws-smithy-http = "0.56"
aws-credential-types = "0.56"
aws-types = "0.56"
aws-config = { version = "1.1.4", default-features = false, features=["rustls"] }
aws-sdk-s3 = "1.14"
aws-sdk-secretsmanager = { version = "1.14.0" }
aws-smithy-async = { version = "1.1.4", default-features = false, features=["rt-tokio"] }
aws-smithy-types = "1.1.4"
aws-credential-types = "1.1.4"
axum = { version = "0.6.20", features = ["ws"] }
base64 = "0.13.0"
bincode = "1.3"
@@ -51,15 +61,15 @@ bindgen = "0.65"
bstr = "1.0"
byteorder = "1.4"
bytes = "1.0"
camino = "1.1.6"
cfg-if = "1.0.0"
chrono = { version = "0.4", default-features = false, features = ["clock"] }
clap = { version = "4.0", features = ["derive"] }
close_fds = "0.3.2"
comfy-table = "6.1"
const_format = "0.2"
crc32c = "0.6"
crossbeam-utils = "0.8.5"
dashmap = "5.5.0"
dashmap = { version = "5.5.0", features = ["raw-api"] }
either = "1.8"
enum-map = "2.4.2"
enumset = "1.0.12"
@@ -71,79 +81,96 @@ futures-util = "0.3"
git-version = "0.3"
hashbrown = "0.13"
hashlink = "0.8.1"
hdrhistogram = "7.5.2"
hex = "0.4"
hex-literal = "0.4"
hmac = "0.12.1"
hostname = "0.3.1"
http-types = { version = "2", default-features = false }
humantime = "2.1"
humantime-serde = "1.1.1"
hyper = "0.14"
hyper-tungstenite = "0.11"
inotify = "0.10.2"
ipnet = "2.9.0"
itertools = "0.10"
jsonwebtoken = "8"
jsonwebtoken = "9"
lasso = "0.7"
libc = "0.2"
md5 = "0.7.0"
memoffset = "0.8"
native-tls = "0.2"
nix = "0.26"
notify = "5.0.0"
nix = { version = "0.27", features = ["fs", "process", "socket", "signal", "poll"] }
notify = "6.0.0"
num_cpus = "1.15"
num-traits = "0.2.15"
once_cell = "1.13"
opentelemetry = "0.19.0"
opentelemetry-otlp = { version = "0.12.0", default_features=false, features = ["http-proto", "trace", "http", "reqwest-client"] }
opentelemetry-semantic-conventions = "0.11.0"
opentelemetry = "0.20.0"
opentelemetry-otlp = { version = "0.13.0", default_features=false, features = ["http-proto", "trace", "http", "reqwest-client"] }
opentelemetry-semantic-conventions = "0.12.0"
parking_lot = "0.12"
parquet = { version = "49.0.0", default-features = false, features = ["zstd"] }
parquet_derive = "49.0.0"
pbkdf2 = { version = "0.12.1", features = ["simple", "std"] }
pin-project-lite = "0.2"
prometheus = {version = "0.13", default_features=false, features = ["process"]} # removes protobuf dependency
prost = "0.11"
rand = "0.8"
regex = "1.4"
redis = { version = "0.24.0", features = ["tokio-rustls-comp", "keep-alive"] }
regex = "1.10.2"
reqwest = { version = "0.11", default-features = false, features = ["rustls-tls"] }
reqwest-tracing = { version = "0.4.0", features = ["opentelemetry_0_19"] }
reqwest-tracing = { version = "0.4.7", features = ["opentelemetry_0_20"] }
reqwest-middleware = "0.2.0"
reqwest-retry = "0.2.2"
routerify = "3"
rpds = "0.13"
rustc-hash = "1.1.0"
rustls = "0.21"
rustls-pemfile = "1"
rustls-split = "0.3"
scopeguard = "1.1"
sysinfo = "0.29.2"
sd-notify = "0.4.1"
sentry = { version = "0.31", default-features = false, features = ["backtrace", "contexts", "panic", "rustls", "reqwest" ] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
serde_path_to_error = "0.1"
serde_with = "2.0"
serde_assert = "0.5.0"
sha2 = "0.10.2"
signal-hook = "0.3"
smallvec = "1.11"
smol_str = { version = "0.2.0", features = ["serde"] }
socket2 = "0.5"
strum = "0.24"
strum_macros = "0.24"
svg_fmt = "0.4.1"
sync_wrapper = "0.1.2"
tar = "0.4"
task-local-extensions = "0.1.4"
test-context = "0.1"
thiserror = "1.0"
tikv-jemallocator = "0.5"
tikv-jemalloc-ctl = "0.5"
tls-listener = { version = "0.7", features = ["rustls", "hyper-h1"] }
tokio = { version = "1.17", features = ["macros"] }
tokio-epoll-uring = { git = "https://github.com/neondatabase/tokio-epoll-uring.git" , branch = "main" }
tokio-io-timeout = "1.2.0"
tokio-postgres-rustls = "0.10.0"
tokio-rustls = "0.24"
tokio-stream = "0.1"
tokio-tar = "0.3"
tokio-util = { version = "0.7", features = ["io"] }
tokio-util = { version = "0.7.10", features = ["io", "rt"] }
toml = "0.7"
toml_edit = "0.19"
tonic = {version = "0.9", features = ["tls", "tls-roots"]}
tracing = "0.1"
tracing-error = "0.2.0"
tracing-opentelemetry = "0.19.0"
tracing-opentelemetry = "0.20.0"
tracing-subscriber = { version = "0.3", default_features = false, features = ["smallvec", "fmt", "tracing-log", "std", "env-filter", "json"] }
twox-hash = { version = "1.6.3", default-features = false }
url = "2.2"
uuid = { version = "1.2", features = ["v4", "serde"] }
uuid = { version = "1.6.1", features = ["v4", "v7", "serde"] }
walkdir = "2.3.2"
webpki-roots = "0.25"
x509-parser = "0.15"
@@ -153,11 +180,11 @@ env_logger = "0.10"
log = "0.4"
## Libraries from neondatabase/ git forks, ideally with changes to be upstreamed
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="9011f7110db12b5e15afaf98f8ac834501d50ddc" }
postgres-native-tls = { git = "https://github.com/neondatabase/rust-postgres.git", rev="9011f7110db12b5e15afaf98f8ac834501d50ddc" }
postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", rev="9011f7110db12b5e15afaf98f8ac834501d50ddc" }
postgres-types = { git = "https://github.com/neondatabase/rust-postgres.git", rev="9011f7110db12b5e15afaf98f8ac834501d50ddc" }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="9011f7110db12b5e15afaf98f8ac834501d50ddc" }
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", branch="neon" }
postgres-native-tls = { git = "https://github.com/neondatabase/rust-postgres.git", branch="neon" }
postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", branch="neon" }
postgres-types = { git = "https://github.com/neondatabase/rust-postgres.git", branch="neon" }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", branch="neon" }
## Other git libraries
heapless = { default-features=false, features=[], git = "https://github.com/japaric/heapless.git", rev = "644653bf3b831c6bb4963be2de24804acf5e5001" } # upstream release pending
@@ -167,6 +194,7 @@ compute_api = { version = "0.1", path = "./libs/compute_api/" }
consumption_metrics = { version = "0.1", path = "./libs/consumption_metrics/" }
metrics = { version = "0.1", path = "./libs/metrics/" }
pageserver_api = { version = "0.1", path = "./libs/pageserver_api/" }
pageserver_client = { path = "./pageserver/client" }
postgres_backend = { version = "0.1", path = "./libs/postgres_backend/" }
postgres_connection = { version = "0.1", path = "./libs/postgres_connection/" }
postgres_ffi = { version = "0.1", path = "./libs/postgres_ffi/" }
@@ -178,6 +206,7 @@ tenant_size_model = { version = "0.1", path = "./libs/tenant_size_model/" }
tracing-utils = { version = "0.1", path = "./libs/tracing-utils/" }
utils = { version = "0.1", path = "./libs/utils/" }
vm_monitor = { version = "0.1", path = "./libs/vm_monitor/" }
walproposer = { version = "0.1", path = "./libs/walproposer/" }
## Common library dependency
workspace_hack = { version = "0.1", path = "./workspace_hack/" }
@@ -186,14 +215,18 @@ workspace_hack = { version = "0.1", path = "./workspace_hack/" }
criterion = "0.5.1"
rcgen = "0.11"
rstest = "0.18"
tempfile = "3.4"
camino-tempfile = "1.0.2"
tonic-build = "0.9"
[patch.crates-io]
# This is only needed for proxy's tests.
# TODO: we should probably fork `tokio-postgres-rustls` instead.
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", rev="9011f7110db12b5e15afaf98f8ac834501d50ddc" }
tokio-postgres = { git = "https://github.com/neondatabase/rust-postgres.git", branch="neon" }
# bug fixes for UUID
parquet = { git = "https://github.com/neondatabase/arrow-rs", branch = "neon-fix-bugs" }
parquet_derive = { git = "https://github.com/neondatabase/arrow-rs", branch = "neon-fix-bugs" }
################# Binary contents sections

View File

@@ -3,7 +3,7 @@
### By default, the binaries inside the image have some mock parameters and can start, but are not intended to be used
### inside this image in the real deployments.
ARG REPOSITORY=neondatabase
ARG IMAGE=rust
ARG IMAGE=build-tools
ARG TAG=pinned
# Build Postgres
@@ -27,6 +27,7 @@ RUN set -e \
FROM $REPOSITORY/$IMAGE:$TAG AS build
WORKDIR /home/nonroot
ARG GIT_VERSION=local
ARG BUILD_TAG
# Enable https://github.com/paritytech/cachepot to cache Rust crates' compilation results in Docker builds.
# Set up cachepot to use an AWS S3 bucket for cache results, to reuse it between `docker build` invocations.
@@ -52,6 +53,7 @@ RUN set -e \
--bin pagectl \
--bin safekeeper \
--bin storage_broker \
--bin attachment_service \
--bin proxy \
--bin neon_local \
--locked --release \
@@ -78,9 +80,10 @@ COPY --from=build --chown=neon:neon /home/nonroot/target/release/pg_sni_router
COPY --from=build --chown=neon:neon /home/nonroot/target/release/pageserver /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/pagectl /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/safekeeper /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/storage_broker /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/storage_broker /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/attachment_service /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/proxy /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/neon_local /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/neon_local /usr/local/bin
COPY --from=pg-build /home/nonroot/pg_install/v14 /usr/local/v14/
COPY --from=pg-build /home/nonroot/pg_install/v15 /usr/local/v15/

166
Dockerfile.buildtools Normal file
View File

@@ -0,0 +1,166 @@
FROM debian:bullseye-slim
# Add nonroot user
RUN useradd -ms /bin/bash nonroot -b /home
SHELL ["/bin/bash", "-c"]
# System deps
RUN set -e \
&& apt update \
&& apt install -y \
autoconf \
automake \
bison \
build-essential \
ca-certificates \
cmake \
curl \
flex \
git \
gnupg \
gzip \
jq \
libcurl4-openssl-dev \
libbz2-dev \
libffi-dev \
liblzma-dev \
libncurses5-dev \
libncursesw5-dev \
libpq-dev \
libreadline-dev \
libseccomp-dev \
libsqlite3-dev \
libssl-dev \
libstdc++-10-dev \
libtool \
libxml2-dev \
libxmlsec1-dev \
libxxhash-dev \
lsof \
make \
netcat \
net-tools \
openssh-client \
parallel \
pkg-config \
unzip \
wget \
xz-utils \
zlib1g-dev \
zstd \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# protobuf-compiler (protoc)
ENV PROTOC_VERSION 25.1
RUN curl -fsSL "https://github.com/protocolbuffers/protobuf/releases/download/v${PROTOC_VERSION}/protoc-${PROTOC_VERSION}-linux-$(uname -m | sed 's/aarch64/aarch_64/g').zip" -o "protoc.zip" \
&& unzip -q protoc.zip -d protoc \
&& mv protoc/bin/protoc /usr/local/bin/protoc \
&& mv protoc/include/google /usr/local/include/google \
&& rm -rf protoc.zip protoc
# LLVM
ENV LLVM_VERSION=17
RUN curl -fsSL 'https://apt.llvm.org/llvm-snapshot.gpg.key' | apt-key add - \
&& echo "deb http://apt.llvm.org/bullseye/ llvm-toolchain-bullseye-${LLVM_VERSION} main" > /etc/apt/sources.list.d/llvm.stable.list \
&& apt update \
&& apt install -y clang-${LLVM_VERSION} llvm-${LLVM_VERSION} \
&& bash -c 'for f in /usr/bin/clang*-${LLVM_VERSION} /usr/bin/llvm*-${LLVM_VERSION}; do ln -s "${f}" "${f%-${LLVM_VERSION}}"; done' \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# PostgreSQL 14
RUN curl -fsSL 'https://www.postgresql.org/media/keys/ACCC4CF8.asc' | apt-key add - \
&& echo 'deb http://apt.postgresql.org/pub/repos/apt bullseye-pgdg main' > /etc/apt/sources.list.d/pgdg.list \
&& apt update \
&& apt install -y postgresql-client-14 \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# AWS CLI
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "awscliv2.zip" \
&& unzip -q awscliv2.zip \
&& ./aws/install \
&& rm awscliv2.zip
# Mold: A Modern Linker
ENV MOLD_VERSION v2.4.0
RUN set -e \
&& git clone https://github.com/rui314/mold.git \
&& mkdir mold/build \
&& cd mold/build \
&& git checkout ${MOLD_VERSION} \
&& cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=clang++ .. \
&& cmake --build . -j $(nproc) \
&& cmake --install . \
&& cd .. \
&& rm -rf mold
# LCOV
# Build lcov from a fork:
# It includes several bug fixes on top on v2.0 release (https://github.com/linux-test-project/lcov/compare/v2.0...master)
# And patches from us:
# - Generates json file with code coverage summary (https://github.com/neondatabase/lcov/commit/426e7e7a22f669da54278e9b55e6d8caabd00af0.tar.gz)
RUN for package in Capture::Tiny DateTime Devel::Cover Digest::MD5 File::Spec JSON::XS Memory::Process Time::HiRes JSON; do yes | perl -MCPAN -e "CPAN::Shell->notest('install', '$package')"; done \
&& wget https://github.com/neondatabase/lcov/archive/426e7e7a22f669da54278e9b55e6d8caabd00af0.tar.gz -O lcov.tar.gz \
&& echo "61a22a62e20908b8b9e27d890bd0ea31f567a7b9668065589266371dcbca0992 lcov.tar.gz" | sha256sum --check \
&& mkdir -p lcov && tar -xzf lcov.tar.gz -C lcov --strip-components=1 \
&& cd lcov \
&& make install \
&& rm -rf ../lcov.tar.gz
# Switch to nonroot user
USER nonroot:nonroot
WORKDIR /home/nonroot
# Python
ENV PYTHON_VERSION=3.9.2 \
PYENV_ROOT=/home/nonroot/.pyenv \
PATH=/home/nonroot/.pyenv/shims:/home/nonroot/.pyenv/bin:/home/nonroot/.poetry/bin:$PATH
RUN set -e \
&& cd $HOME \
&& curl -sSO https://raw.githubusercontent.com/pyenv/pyenv-installer/master/bin/pyenv-installer \
&& chmod +x pyenv-installer \
&& ./pyenv-installer \
&& export PYENV_ROOT=/home/nonroot/.pyenv \
&& export PATH="$PYENV_ROOT/bin:$PATH" \
&& export PATH="$PYENV_ROOT/shims:$PATH" \
&& pyenv install ${PYTHON_VERSION} \
&& pyenv global ${PYTHON_VERSION} \
&& python --version \
&& pip install --upgrade pip \
&& pip --version \
&& pip install pipenv wheel poetry
# Switch to nonroot user (again)
USER nonroot:nonroot
WORKDIR /home/nonroot
# Rust
# Please keep the version of llvm (installed above) in sync with rust llvm (`rustc --version --verbose | grep LLVM`)
ENV RUSTC_VERSION=1.75.0
ENV RUSTUP_HOME="/home/nonroot/.rustup"
ENV PATH="/home/nonroot/.cargo/bin:${PATH}"
RUN curl -sSO https://static.rust-lang.org/rustup/dist/$(uname -m)-unknown-linux-gnu/rustup-init && whoami && \
chmod +x rustup-init && \
./rustup-init -y --default-toolchain ${RUSTC_VERSION} && \
rm rustup-init && \
export PATH="$HOME/.cargo/bin:$PATH" && \
. "$HOME/.cargo/env" && \
cargo --version && rustup --version && \
rustup component add llvm-tools-preview rustfmt clippy && \
cargo install --git https://github.com/paritytech/cachepot && \
cargo install rustfilt && \
cargo install cargo-hakari && \
cargo install cargo-deny && \
cargo install cargo-hack && \
cargo install cargo-nextest && \
rm -rf /home/nonroot/.cargo/registry && \
rm -rf /home/nonroot/.cargo/git
ENV RUSTC_WRAPPER=cachepot
# Show versions
RUN whoami \
&& python --version \
&& pip --version \
&& cargo --version --verbose \
&& rustup --version --verbose \
&& rustc --version --verbose \
&& clang --version

View File

@@ -1,6 +1,6 @@
ARG PG_VERSION
ARG REPOSITORY=neondatabase
ARG IMAGE=rust
ARG IMAGE=build-tools
ARG TAG=pinned
ARG BUILD_TAG
@@ -48,7 +48,29 @@ RUN cd postgres && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pgrowlocks.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/pgstattuple.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/refint.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/xml2.control
echo 'trusted = true' >> /usr/local/pgsql/share/extension/xml2.control && \
# We need to grant EXECUTE on pg_stat_statements_reset() to neon_superuser.
# In vanilla postgres this function is limited to Postgres role superuser.
# In neon we have neon_superuser role that is not a superuser but replaces superuser in some cases.
# We could add the additional grant statements to the postgres repository but it would be hard to maintain,
# whenever we need to pick up a new postgres version and we want to limit the changes in our postgres fork,
# so we do it here.
old_list="pg_stat_statements--1.0--1.1.sql pg_stat_statements--1.1--1.2.sql pg_stat_statements--1.2--1.3.sql pg_stat_statements--1.3--1.4.sql pg_stat_statements--1.4--1.5.sql pg_stat_statements--1.4.sql pg_stat_statements--1.5--1.6.sql"; \
# the first loop is for pg_stat_statement extension version <= 1.6
for file in /usr/local/pgsql/share/extension/pg_stat_statements--*.sql; do \
filename=$(basename "$file"); \
if echo "$old_list" | grep -q -F "$filename"; then \
echo 'GRANT EXECUTE ON FUNCTION pg_stat_statements_reset() TO neon_superuser;' >> $file; \
fi; \
done; \
# the second loop is for pg_stat_statement extension versions >= 1.7,
# where pg_stat_statement_reset() got 3 additional arguments
for file in /usr/local/pgsql/share/extension/pg_stat_statements--*.sql; do \
filename=$(basename "$file"); \
if ! echo "$old_list" | grep -q -F "$filename"; then \
echo 'GRANT EXECUTE ON FUNCTION pg_stat_statements_reset(Oid, Oid, bigint) TO neon_superuser;' >> $file; \
fi; \
done
#########################################################################################
#
@@ -121,29 +143,24 @@ RUN wget https://github.com/pgRouting/pgrouting/archive/v3.4.2.tar.gz -O pgrouti
#########################################################################################
FROM build-deps AS plv8-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN apt update && \
apt install -y ninja-build python3-dev libncurses5 binutils clang
RUN case "${PG_VERSION}" in \
"v14" | "v15") \
export PLV8_VERSION=3.1.5 \
export PLV8_CHECKSUM=1e108d5df639e4c189e1c5bdfa2432a521c126ca89e7e5a969d46899ca7bf106 \
;; \
"v16") \
export PLV8_VERSION=3.1.8 \
export PLV8_CHECKSUM=92b10c7db39afdae97ff748c9ec54713826af222c459084ad002571b79eb3f49 \
;; \
*) \
echo "Export the valid PG_VERSION variable" && exit 1 \
;; \
esac && \
wget https://github.com/plv8/plv8/archive/refs/tags/v${PLV8_VERSION}.tar.gz -O plv8.tar.gz && \
echo "${PLV8_CHECKSUM} plv8.tar.gz" | sha256sum --check && \
RUN wget https://github.com/plv8/plv8/archive/refs/tags/v3.1.10.tar.gz -O plv8.tar.gz && \
echo "7096c3290928561f0d4901b7a52794295dc47f6303102fae3f8e42dd575ad97d plv8.tar.gz" | sha256sum --check && \
mkdir plv8-src && cd plv8-src && tar xvzf ../plv8.tar.gz --strip-components=1 -C . && \
# generate and copy upgrade scripts
mkdir -p upgrade && ./generate_upgrade.sh 3.1.10 && \
cp upgrade/* /usr/local/pgsql/share/extension/ && \
export PATH="/usr/local/pgsql/bin:$PATH" && \
make DOCKER=1 -j $(getconf _NPROCESSORS_ONLN) install && \
rm -rf /plv8-* && \
find /usr/local/pgsql/ -name "plv8-*.so" | xargs strip && \
# don't break computes with installed old version of plv8
cd /usr/local/pgsql/lib/ && \
ln -s plv8-3.1.10.so plv8-3.1.5.so && \
ln -s plv8-3.1.10.so plv8-3.1.8.so && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/plv8.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/plcoffee.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/plls.control
@@ -224,8 +241,8 @@ RUN wget https://github.com/df7cb/postgresql-unit/archive/refs/tags/7.7.tar.gz -
FROM build-deps AS vector-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN wget https://github.com/pgvector/pgvector/archive/refs/tags/v0.5.0.tar.gz -O pgvector.tar.gz && \
echo "d8aa3504b215467ca528525a6de12c3f85f9891b091ce0e5864dd8a9b757f77b pgvector.tar.gz" | sha256sum --check && \
RUN wget https://github.com/pgvector/pgvector/archive/refs/tags/v0.5.1.tar.gz -O pgvector.tar.gz && \
echo "cc7a8e034a96e30a819911ac79d32f6bc47bdd1aa2de4d7d4904e26b83209dc8 pgvector.tar.gz" | sha256sum --check && \
mkdir pgvector-src && cd pgvector-src && tar xvzf ../pgvector.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config && \
@@ -368,8 +385,8 @@ RUN wget https://github.com/citusdata/postgresql-hll/archive/refs/tags/v2.18.tar
FROM build-deps AS plpgsql-check-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
RUN wget https://github.com/okbob/plpgsql_check/archive/refs/tags/v2.4.0.tar.gz -O plpgsql_check.tar.gz && \
echo "9ba58387a279b35a3bfa39ee611e5684e6cddb2ba046ddb2c5190b3bd2ca254a plpgsql_check.tar.gz" | sha256sum --check && \
RUN wget https://github.com/okbob/plpgsql_check/archive/refs/tags/v2.5.3.tar.gz -O plpgsql_check.tar.gz && \
echo "6631ec3e7fb3769eaaf56e3dfedb829aa761abf163d13dba354b4c218508e1c0 plpgsql_check.tar.gz" | sha256sum --check && \
mkdir plpgsql_check-src && cd plpgsql_check-src && tar xvzf ../plpgsql_check.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) PG_CONFIG=/usr/local/pgsql/bin/pg_config USE_PGXS=1 && \
make -j $(getconf _NPROCESSORS_ONLN) install PG_CONFIG=/usr/local/pgsql/bin/pg_config USE_PGXS=1 && \
@@ -393,7 +410,9 @@ RUN case "${PG_VERSION}" in \
export TIMESCALEDB_CHECKSUM=6fca72a6ed0f6d32d2b3523951ede73dc5f9b0077b38450a029a5f411fdb8c73 \
;; \
*) \
echo "TimescaleDB not supported on this PostgreSQL version. See https://github.com/timescale/timescaledb/issues/5752" && exit 0;; \
export TIMESCALEDB_VERSION=2.13.0 \
export TIMESCALEDB_CHECKSUM=584a351c7775f0e067eaa0e7277ea88cab9077cc4c455cbbf09a5d9723dce95d \
;; \
esac && \
apt-get update && \
apt-get install -y cmake && \
@@ -501,8 +520,7 @@ RUN apt-get update && \
libboost-regex1.74-dev \
libboost-serialization1.74-dev \
libboost-system1.74-dev \
libeigen3-dev \
libfreetype6-dev
libeigen3-dev
ENV PATH "/usr/local/pgsql/bin/:/usr/local/pgsql/:$PATH"
RUN wget https://github.com/rdkit/rdkit/archive/refs/tags/Release_2023_03_3.tar.gz -O rdkit.tar.gz && \
@@ -527,6 +545,8 @@ RUN wget https://github.com/rdkit/rdkit/archive/refs/tags/Release_2023_03_3.tar.
-D PostgreSQL_TYPE_INCLUDE_DIR=`pg_config --includedir-server` \
-D PostgreSQL_LIBRARY_DIR=`pg_config --libdir` \
-D RDK_INSTALL_INTREE=OFF \
-D RDK_INSTALL_COMIC_FONTS=OFF \
-D RDK_BUILD_FREETYPE_SUPPORT=OFF \
-D CMAKE_BUILD_TYPE=Release \
. && \
make -j $(getconf _NPROCESSORS_ONLN) && \
@@ -567,6 +587,23 @@ RUN wget https://github.com/ChenHuajun/pg_roaringbitmap/archive/refs/tags/v0.5.4
make -j $(getconf _NPROCESSORS_ONLN) install && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/roaringbitmap.control
#########################################################################################
#
# Layer "pg-semver-pg-build"
# compile pg_semver extension
#
#########################################################################################
FROM build-deps AS pg-semver-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN wget https://github.com/theory/pg-semver/archive/refs/tags/v0.32.1.tar.gz -O pg_semver.tar.gz && \
echo "fbdaf7512026d62eec03fad8687c15ed509b6ba395bff140acd63d2e4fbe25d7 pg_semver.tar.gz" | sha256sum --check && \
mkdir pg_semver-src && cd pg_semver-src && tar xvzf ../pg_semver.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/semver.control
#########################################################################################
#
# Layer "pg-embedding-pg-build"
@@ -576,6 +613,7 @@ RUN wget https://github.com/ChenHuajun/pg_roaringbitmap/archive/refs/tags/v0.5.4
FROM build-deps AS pg-embedding-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ARG PG_VERSION
ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN case "${PG_VERSION}" in \
"v14" | "v15") \
@@ -615,11 +653,7 @@ RUN wget https://gitlab.com/dalibo/postgresql_anonymizer/-/archive/1.1.0/postgre
#########################################################################################
#
# Layer "rust extensions"
# This layer is used to build `pgx` deps
#
# FIXME: This needs to be updated to latest version of 'pgrx' (it was renamed from
# 'pgx' to 'pgrx') for PostgreSQL 16. And that in turn requires bumping the pgx
# dependency on all the rust extension that depend on it, too.
# This layer is used to build `pgrx` deps
#
#########################################################################################
FROM build-deps AS rust-extensions-build
@@ -635,22 +669,12 @@ USER nonroot
WORKDIR /home/nonroot
ARG PG_VERSION
RUN case "${PG_VERSION}" in \
"v14" | "v15") \
;; \
"v16") \
echo "TODO: Not yet supported for PostgreSQL 16. Need to update pgrx dependencies" && exit 0 \
;; \
*) \
echo "unexpected PostgreSQL version ${PG_VERSION}" && exit 1 \
;; \
esac && \
curl -sSO https://static.rust-lang.org/rustup/dist/$(uname -m)-unknown-linux-gnu/rustup-init && \
RUN curl -sSO https://static.rust-lang.org/rustup/dist/$(uname -m)-unknown-linux-gnu/rustup-init && \
chmod +x rustup-init && \
./rustup-init -y --no-modify-path --profile minimal --default-toolchain stable && \
rm rustup-init && \
cargo install --locked --version 0.7.3 cargo-pgx && \
/bin/bash -c 'cargo pgx init --pg${PG_VERSION:1}=/usr/local/pgsql/bin/pg_config'
cargo install --locked --version 0.10.2 cargo-pgrx && \
/bin/bash -c 'cargo pgrx init --pg${PG_VERSION:1}=/usr/local/pgsql/bin/pg_config'
USER root
@@ -664,23 +688,11 @@ USER root
FROM rust-extensions-build AS pg-jsonschema-pg-build
ARG PG_VERSION
# caeab60d70b2fd3ae421ec66466a3abbb37b7ee6 made on 06/03/2023
# there is no release tag yet, but we need it due to the superuser fix in the control file, switch to git tag after release >= 0.1.5
RUN case "${PG_VERSION}" in \
"v14" | "v15") \
;; \
"v16") \
echo "TODO: Not yet supported for PostgreSQL 16. Need to update pgrx dependencies" && exit 0 \
;; \
*) \
echo "unexpected PostgreSQL version \"${PG_VERSION}\"" && exit 1 \
;; \
esac && \
wget https://github.com/supabase/pg_jsonschema/archive/caeab60d70b2fd3ae421ec66466a3abbb37b7ee6.tar.gz -O pg_jsonschema.tar.gz && \
echo "54129ce2e7ee7a585648dbb4cef6d73f795d94fe72f248ac01119992518469a4 pg_jsonschema.tar.gz" | sha256sum --check && \
RUN wget https://github.com/supabase/pg_jsonschema/archive/refs/tags/v0.2.0.tar.gz -O pg_jsonschema.tar.gz && \
echo "9118fc508a6e231e7a39acaa6f066fcd79af17a5db757b47d2eefbe14f7794f0 pg_jsonschema.tar.gz" | sha256sum --check && \
mkdir pg_jsonschema-src && cd pg_jsonschema-src && tar xvzf ../pg_jsonschema.tar.gz --strip-components=1 -C . && \
sed -i 's/pgx = "0.7.1"/pgx = { version = "0.7.3", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
cargo pgx install --release && \
sed -i 's/pgrx = "0.10.2"/pgrx = { version = "0.10.2", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
cargo pgrx install --release && \
echo "trusted = true" >> /usr/local/pgsql/share/extension/pg_jsonschema.control
#########################################################################################
@@ -693,26 +705,11 @@ RUN case "${PG_VERSION}" in \
FROM rust-extensions-build AS pg-graphql-pg-build
ARG PG_VERSION
# b4988843647450a153439be367168ed09971af85 made on 22/02/2023 (from remove-pgx-contrib-spiext branch)
# Currently pgx version bump to >= 0.7.2 causes "call to unsafe function" compliation errors in
# pgx-contrib-spiext. There is a branch that removes that dependency, so use it. It is on the
# same 1.1 version we've used before.
RUN case "${PG_VERSION}" in \
"v14" | "v15") \
;; \
"v16") \
echo "TODO: Not yet supported for PostgreSQL 16. Need to update pgrx dependencies" && exit 0 \
;; \
*) \
echo "unexpected PostgreSQL version" && exit 1 \
;; \
esac && \
wget https://github.com/yrashk/pg_graphql/archive/b4988843647450a153439be367168ed09971af85.tar.gz -O pg_graphql.tar.gz && \
echo "0c7b0e746441b2ec24187d0e03555faf935c2159e2839bddd14df6dafbc8c9bd pg_graphql.tar.gz" | sha256sum --check && \
RUN wget https://github.com/supabase/pg_graphql/archive/refs/tags/v1.4.0.tar.gz -O pg_graphql.tar.gz && \
echo "bd8dc7230282b3efa9ae5baf053a54151ed0e66881c7c53750e2d0c765776edc pg_graphql.tar.gz" | sha256sum --check && \
mkdir pg_graphql-src && cd pg_graphql-src && tar xvzf ../pg_graphql.tar.gz --strip-components=1 -C . && \
sed -i 's/pgx = "~0.7.1"/pgx = { version = "0.7.3", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
sed -i 's/pgx-tests = "~0.7.1"/pgx-tests = "0.7.3"/g' Cargo.toml && \
cargo pgx install --release && \
sed -i 's/pgrx = "=0.10.2"/pgrx = { version = "0.10.2", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
cargo pgrx install --release && \
# it's needed to enable extension because it uses untrusted C language
sed -i 's/superuser = false/superuser = true/g' /usr/local/pgsql/share/extension/pg_graphql.control && \
echo "trusted = true" >> /usr/local/pgsql/share/extension/pg_graphql.control
@@ -727,21 +724,11 @@ RUN case "${PG_VERSION}" in \
FROM rust-extensions-build AS pg-tiktoken-pg-build
ARG PG_VERSION
# 801f84f08c6881c8aa30f405fafbf00eec386a72 made on 10/03/2023
RUN case "${PG_VERSION}" in \
"v14" | "v15") \
;; \
"v16") \
echo "TODO: Not yet supported for PostgreSQL 16. Need to update pgrx dependencies" && exit 0 \
;; \
*) \
echo "unexpected PostgreSQL version" && exit 1 \
;; \
esac && \
wget https://github.com/kelvich/pg_tiktoken/archive/801f84f08c6881c8aa30f405fafbf00eec386a72.tar.gz -O pg_tiktoken.tar.gz && \
echo "52f60ac800993a49aa8c609961842b611b6b1949717b69ce2ec9117117e16e4a pg_tiktoken.tar.gz" | sha256sum --check && \
# 26806147b17b60763039c6a6878884c41a262318 made on 26/09/2023
RUN wget https://github.com/kelvich/pg_tiktoken/archive/26806147b17b60763039c6a6878884c41a262318.tar.gz -O pg_tiktoken.tar.gz && \
echo "e64e55aaa38c259512d3e27c572da22c4637418cf124caba904cd50944e5004e pg_tiktoken.tar.gz" | sha256sum --check && \
mkdir pg_tiktoken-src && cd pg_tiktoken-src && tar xvzf ../pg_tiktoken.tar.gz --strip-components=1 -C . && \
cargo pgx install --release && \
cargo pgrx install --release && \
echo "trusted = true" >> /usr/local/pgsql/share/extension/pg_tiktoken.control
#########################################################################################
@@ -754,23 +741,34 @@ RUN case "${PG_VERSION}" in \
FROM rust-extensions-build AS pg-pgx-ulid-build
ARG PG_VERSION
RUN case "${PG_VERSION}" in \
"v14" | "v15") \
;; \
"v16") \
echo "TODO: Not yet supported for PostgreSQL 16. Need to update pgrx dependencies" && exit 0 \
;; \
*) \
echo "unexpected PostgreSQL version" && exit 1 \
;; \
esac && \
wget https://github.com/pksunkara/pgx_ulid/archive/refs/tags/v0.1.0.tar.gz -O pgx_ulid.tar.gz && \
echo "908b7358e6f846e87db508ae5349fb56a88ee6305519074b12f3d5b0ff09f791 pgx_ulid.tar.gz" | sha256sum --check && \
RUN wget https://github.com/pksunkara/pgx_ulid/archive/refs/tags/v0.1.3.tar.gz -O pgx_ulid.tar.gz && \
echo "ee5db82945d2d9f2d15597a80cf32de9dca67b897f605beb830561705f12683c pgx_ulid.tar.gz" | sha256sum --check && \
mkdir pgx_ulid-src && cd pgx_ulid-src && tar xvzf ../pgx_ulid.tar.gz --strip-components=1 -C . && \
sed -i 's/pgx = "=0.7.3"/pgx = { version = "0.7.3", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
cargo pgx install --release && \
echo "******************* Apply a patch for Postgres 16 support; delete in the next release ******************" && \
wget https://github.com/pksunkara/pgx_ulid/commit/f84954cf63fc8c80d964ac970d9eceed3c791196.patch && \
patch -p1 < f84954cf63fc8c80d964ac970d9eceed3c791196.patch && \
echo "********************************************************************************************************" && \
sed -i 's/pgrx = "=0.10.2"/pgrx = { version = "=0.10.2", features = [ "unsafe-postgres" ] }/g' Cargo.toml && \
cargo pgrx install --release && \
echo "trusted = true" >> /usr/local/pgsql/share/extension/ulid.control
#########################################################################################
#
# Layer "wal2json-build"
# Compile "wal2json" extension
#
#########################################################################################
FROM build-deps AS wal2json-pg-build
COPY --from=pg-build /usr/local/pgsql/ /usr/local/pgsql/
ENV PATH "/usr/local/pgsql/bin/:$PATH"
RUN wget https://github.com/eulerto/wal2json/archive/refs/tags/wal2json_2_5.tar.gz && \
echo "b516653575541cf221b99cf3f8be9b6821f6dbcfc125675c85f35090f824f00e wal2json_2_5.tar.gz" | sha256sum --check && \
mkdir wal2json-src && cd wal2json-src && tar xvzf ../wal2json_2_5.tar.gz --strip-components=1 -C . && \
make -j $(getconf _NPROCESSORS_ONLN) && \
make -j $(getconf _NPROCESSORS_ONLN) install
#########################################################################################
#
# Layer "neon-pg-ext-build"
@@ -778,6 +776,8 @@ RUN case "${PG_VERSION}" in \
#
#########################################################################################
FROM build-deps AS neon-pg-ext-build
ARG PG_VERSION
# Public extensions
COPY --from=postgis-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=postgis-build /sfcgal/* /
@@ -806,7 +806,9 @@ COPY --from=pg-pgx-ulid-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=rdkit-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-uuidv7-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-roaringbitmap-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-semver-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=pg-embedding-pg-build /usr/local/pgsql/ /usr/local/pgsql/
COPY --from=wal2json-pg-build /usr/local/pgsql /usr/local/pgsql
COPY pgxn/ pgxn/
RUN make -j $(getconf _NPROCESSORS_ONLN) \
@@ -880,8 +882,10 @@ FROM debian:bullseye-slim
RUN mkdir /var/db && useradd -m -d /var/db/postgres postgres && \
echo "postgres:test_console_pass" | chpasswd && \
mkdir /var/db/postgres/compute && mkdir /var/db/postgres/specs && \
mkdir /var/db/postgres/pgbouncer && \
chown -R postgres:postgres /var/db/postgres && \
chmod 0750 /var/db/postgres/compute && \
chmod 0750 /var/db/postgres/pgbouncer && \
echo '/usr/local/lib' >> /etc/ld.so.conf && /sbin/ldconfig && \
# create folder for file cache
mkdir -p -m 777 /neon/cache
@@ -897,7 +901,7 @@ COPY --from=compute-tools --chown=postgres /home/nonroot/target/release-line-deb
# libgeos, libgdal, libsfcgal1, libproj and libprotobuf-c1 for PostGIS
# libxml2, libxslt1.1 for xml2
# libzstd1 for zstd
# libboost*, libfreetype6, and zlib1g for rdkit
# libboost* for rdkit
# ca-certificates for communicating with s3 by compute_ctl
RUN apt update && \
apt install --no-install-recommends -y \
@@ -910,7 +914,6 @@ RUN apt update && \
libboost-serialization1.74.0 \
libboost-system1.74.0 \
libossp-uuid16 \
libfreetype6 \
libgeos-c1v5 \
libgdal28 \
libproj19 \
@@ -922,7 +925,6 @@ RUN apt update && \
libcurl4-openssl-dev \
locales \
procps \
zlib1g \
ca-certificates && \
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* && \
localedef -i en_US -c -f UTF-8 -A /usr/share/locale/locale.alias en_US.UTF-8

View File

@@ -1,7 +1,7 @@
# First transient image to build compute_tools binaries
# NB: keep in sync with rust image version in .github/workflows/build_and_test.yml
ARG REPOSITORY=neondatabase
ARG IMAGE=rust
ARG IMAGE=build-tools
ARG TAG=pinned
ARG BUILD_TAG

View File

@@ -51,6 +51,8 @@ CARGO_BUILD_FLAGS += $(filter -j1,$(MAKEFLAGS))
CARGO_CMD_PREFIX += $(if $(filter n,$(MAKEFLAGS)),,+)
# Force cargo not to print progress bar
CARGO_CMD_PREFIX += CARGO_TERM_PROGRESS_WHEN=never CI=1
# Set PQ_LIB_DIR to make sure `attachment_service` get linked with bundled libpq (through diesel)
CARGO_CMD_PREFIX += PQ_LIB_DIR=$(POSTGRES_INSTALL_DIR)/v16/lib
#
# Top level Makefile to build Neon and PostgreSQL
@@ -62,7 +64,7 @@ all: neon postgres neon-pg-ext
#
# The 'postgres_ffi' depends on the Postgres headers.
.PHONY: neon
neon: postgres-headers
neon: postgres-headers walproposer-lib
+@echo "Compiling Neon"
$(CARGO_CMD_PREFIX) cargo build $(CARGO_BUILD_FLAGS)
@@ -72,6 +74,10 @@ neon: postgres-headers
#
$(POSTGRES_INSTALL_DIR)/build/%/config.status:
+@echo "Configuring Postgres $* build"
@test -s $(ROOT_PROJECT_DIR)/vendor/postgres-$*/configure || { \
echo "\nPostgres submodule not found in $(ROOT_PROJECT_DIR)/vendor/postgres-$*/, execute "; \
echo "'git submodule update --init --recursive --depth 2 --progress .' in project root.\n"; \
exit 1; }
mkdir -p $(POSTGRES_INSTALL_DIR)/build/$*
(cd $(POSTGRES_INSTALL_DIR)/build/$* && \
env PATH="$(EXTRA_PATH_OVERRIDES):$$PATH" $(ROOT_PROJECT_DIR)/vendor/postgres-$*/configure \
@@ -168,6 +174,42 @@ neon-pg-ext-clean-%:
-C $(POSTGRES_INSTALL_DIR)/build/neon-utils-$* \
-f $(ROOT_PROJECT_DIR)/pgxn/neon_utils/Makefile clean
# Build walproposer as a static library. walproposer source code is located
# in the pgxn/neon directory.
#
# We also need to include libpgport.a and libpgcommon.a, because walproposer
# uses some functions from those libraries.
#
# Some object files are removed from libpgport.a and libpgcommon.a because
# they depend on openssl and other libraries that are not included in our
# Rust build.
.PHONY: walproposer-lib
walproposer-lib: neon-pg-ext-v16
+@echo "Compiling walproposer-lib"
mkdir -p $(POSTGRES_INSTALL_DIR)/build/walproposer-lib
$(MAKE) PG_CONFIG=$(POSTGRES_INSTALL_DIR)/v16/bin/pg_config CFLAGS='$(PG_CFLAGS) $(COPT)' \
-C $(POSTGRES_INSTALL_DIR)/build/walproposer-lib \
-f $(ROOT_PROJECT_DIR)/pgxn/neon/Makefile walproposer-lib
cp $(POSTGRES_INSTALL_DIR)/v16/lib/libpgport.a $(POSTGRES_INSTALL_DIR)/build/walproposer-lib
cp $(POSTGRES_INSTALL_DIR)/v16/lib/libpgcommon.a $(POSTGRES_INSTALL_DIR)/build/walproposer-lib
ifeq ($(UNAME_S),Linux)
$(AR) d $(POSTGRES_INSTALL_DIR)/build/walproposer-lib/libpgport.a \
pg_strong_random.o
$(AR) d $(POSTGRES_INSTALL_DIR)/build/walproposer-lib/libpgcommon.a \
pg_crc32c.o \
hmac_openssl.o \
cryptohash_openssl.o \
scram-common.o \
md5_common.o \
checksum_helper.o
endif
.PHONY: walproposer-lib-clean
walproposer-lib-clean:
$(MAKE) PG_CONFIG=$(POSTGRES_INSTALL_DIR)/v16/bin/pg_config \
-C $(POSTGRES_INSTALL_DIR)/build/walproposer-lib \
-f $(ROOT_PROJECT_DIR)/pgxn/neon/Makefile clean
.PHONY: neon-pg-ext
neon-pg-ext: \
neon-pg-ext-v14 \
@@ -220,6 +262,44 @@ distclean:
fmt:
./pre-commit.py --fix-inplace
postgres-%-pg-bsd-indent: postgres-%
+@echo "Compiling pg_bsd_indent"
$(MAKE) -C $(POSTGRES_INSTALL_DIR)/build/$*/src/tools/pg_bsd_indent/
# Create typedef list for the core. Note that generally it should be combined with
# buildfarm one to cover platform specific stuff.
# https://wiki.postgresql.org/wiki/Running_pgindent_on_non-core_code_or_development_code
postgres-%-typedefs.list: postgres-%
$(ROOT_PROJECT_DIR)/vendor/postgres-$*/src/tools/find_typedef $(POSTGRES_INSTALL_DIR)/$*/bin > $@
# Indent postgres. See src/tools/pgindent/README for details.
.PHONY: postgres-%-pgindent
postgres-%-pgindent: postgres-%-pg-bsd-indent postgres-%-typedefs.list
+@echo merge with buildfarm typedef to cover all platforms
+@echo note: I first tried to download from pgbuildfarm.org, but for unclear reason e.g. \
REL_16_STABLE list misses PGSemaphoreData
# wget -q -O - "http://www.pgbuildfarm.org/cgi-bin/typedefs.pl?branch=REL_16_STABLE" |\
# cat - postgres-$*-typedefs.list | sort | uniq > postgres-$*-typedefs-full.list
cat $(ROOT_PROJECT_DIR)/vendor/postgres-$*/src/tools/pgindent/typedefs.list |\
cat - postgres-$*-typedefs.list | sort | uniq > postgres-$*-typedefs-full.list
+@echo note: you might want to run it on selected files/dirs instead.
INDENT=$(POSTGRES_INSTALL_DIR)/build/$*/src/tools/pg_bsd_indent/pg_bsd_indent \
$(ROOT_PROJECT_DIR)/vendor/postgres-$*/src/tools/pgindent/pgindent --typedefs postgres-$*-typedefs-full.list \
$(ROOT_PROJECT_DIR)/vendor/postgres-$*/src/ \
--excludes $(ROOT_PROJECT_DIR)/vendor/postgres-$*/src/tools/pgindent/exclude_file_patterns
rm -f pg*.BAK
# Indent pxgn/neon.
.PHONY: pgindent
neon-pgindent: postgres-v16-pg-bsd-indent neon-pg-ext-v16
$(MAKE) PG_CONFIG=$(POSTGRES_INSTALL_DIR)/v16/bin/pg_config CFLAGS='$(PG_CFLAGS) $(COPT)' \
FIND_TYPEDEF=$(ROOT_PROJECT_DIR)/vendor/postgres-v16/src/tools/find_typedef \
INDENT=$(POSTGRES_INSTALL_DIR)/build/v16/src/tools/pg_bsd_indent/pg_bsd_indent \
PGINDENT_SCRIPT=$(ROOT_PROJECT_DIR)/vendor/postgres-v16/src/tools/pgindent/pgindent \
-C $(POSTGRES_INSTALL_DIR)/build/neon-v16 \
-f $(ROOT_PROJECT_DIR)/pgxn/neon/Makefile pgindent
.PHONY: setup-pre-commit-hook
setup-pre-commit-hook:
ln -s -f $(ROOT_PROJECT_DIR)/pre-commit.py .git/hooks/pre-commit

4
NOTICE
View File

@@ -1,5 +1,5 @@
Neon
Copyright 2022 Neon Inc.
The PostgreSQL submodules in vendor/postgres-v14 and vendor/postgres-v15 are licensed under the
PostgreSQL license. See vendor/postgres-v14/COPYRIGHT and vendor/postgres-v15/COPYRIGHT.
The PostgreSQL submodules in vendor/ are licensed under the PostgreSQL license.
See vendor/postgres-vX/COPYRIGHT for details.

View File

@@ -14,8 +14,8 @@ Alternatively, compile and run the project [locally](#running-local-installation
A Neon installation consists of compute nodes and the Neon storage engine. Compute nodes are stateless PostgreSQL nodes backed by the Neon storage engine.
The Neon storage engine consists of two major components:
- Pageserver. Scalable storage backend for the compute nodes.
- Safekeepers. The safekeepers form a redundant WAL service that received WAL from the compute node, and stores it durably until it has been processed by the pageserver and uploaded to cloud storage.
- Pageserver: Scalable storage backend for the compute nodes.
- Safekeepers: The safekeepers form a redundant WAL service that received WAL from the compute node, and stores it durably until it has been processed by the pageserver and uploaded to cloud storage.
See developer documentation in [SUMMARY.md](/docs/SUMMARY.md) for more information.
@@ -29,13 +29,14 @@ See developer documentation in [SUMMARY.md](/docs/SUMMARY.md) for more informati
```bash
apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libseccomp-dev \
libssl-dev clang pkg-config libpq-dev cmake postgresql-client protobuf-compiler \
libcurl4-openssl-dev openssl python-poetry lsof libicu-dev
libcurl4-openssl-dev openssl python3-poetry lsof libicu-dev
```
* On Fedora, these packages are needed:
```bash
dnf install flex bison readline-devel zlib-devel openssl-devel \
libseccomp-devel perl clang cmake postgresql postgresql-contrib protobuf-compiler \
protobuf-devel libcurl-devel openssl poetry lsof libicu-devel
protobuf-devel libcurl-devel openssl poetry lsof libicu-devel libpq-devel python3-devel \
libffi-devel
```
* On Arch based systems, these packages are needed:
```bash
@@ -80,9 +81,9 @@ The project uses [rust toolchain file](./rust-toolchain.toml) to define the vers
This file is automatically picked up by [`rustup`](https://rust-lang.github.io/rustup/overrides.html#the-toolchain-file) that installs (if absent) and uses the toolchain version pinned in the file.
rustup users who want to build with another toolchain can use [`rustup override`](https://rust-lang.github.io/rustup/overrides.html#directory-overrides) command to set a specific toolchain for the project's directory.
rustup users who want to build with another toolchain can use the [`rustup override`](https://rust-lang.github.io/rustup/overrides.html#directory-overrides) command to set a specific toolchain for the project's directory.
non-rustup users most probably are not getting the same toolchain automatically from the file, so are responsible to manually verify their toolchain matches the version in the file.
non-rustup users most probably are not getting the same toolchain automatically from the file, so are responsible to manually verify that their toolchain matches the version in the file.
Newer rustc versions most probably will work fine, yet older ones might not be supported due to some new features used by the project or the crates.
#### Building on Linux
@@ -123,7 +124,7 @@ make -j`sysctl -n hw.logicalcpu` -s
To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `pg_install/bin` and `pg_install/lib`, respectively.
To run the integration tests or Python scripts (not required to use the code), install
Python (3.9 or higher), and install python3 packages using `./scripts/pysync` (requires [poetry>=1.3](https://python-poetry.org/)) in the project directory.
Python (3.9 or higher), and install the python3 packages using `./scripts/pysync` (requires [poetry>=1.3](https://python-poetry.org/)) in the project directory.
#### Running neon database
@@ -149,6 +150,9 @@ tenant 9ef87a5bf0d92544f6fafeeb3239695c successfully created on the pageserver
Created an initial timeline 'de200bd42b49cc1814412c7e592dd6e9' at Lsn 0/16B5A50 for tenant: 9ef87a5bf0d92544f6fafeeb3239695c
Setting tenant 9ef87a5bf0d92544f6fafeeb3239695c as a default one
# create postgres compute node
> cargo neon endpoint create main
# start postgres compute node
> cargo neon endpoint start main
Starting new endpoint main (PostgreSQL v14) on timeline de200bd42b49cc1814412c7e592dd6e9 ...
@@ -162,7 +166,7 @@ Starting postgres at 'postgresql://cloud_admin@127.0.0.1:55432/postgres'
2. Now, it is possible to connect to postgres and run some queries:
```text
> psql -p55432 -h 127.0.0.1 -U cloud_admin postgres
> psql -p 55432 -h 127.0.0.1 -U cloud_admin postgres
postgres=# CREATE TABLE t(key int primary key, value text);
CREATE TABLE
postgres=# insert into t values(1,1);
@@ -185,8 +189,11 @@ Created timeline 'b3b863fa45fa9e57e615f9f2d944e601' at Lsn 0/16F9A00 for tenant:
(L) main [de200bd42b49cc1814412c7e592dd6e9]
(L) ┗━ @0/16F9A00: migration_check [b3b863fa45fa9e57e615f9f2d944e601]
# create postgres on that branch
> cargo neon endpoint create migration_check --branch-name migration_check
# start postgres on that branch
> cargo neon endpoint start migration_check --branch-name migration_check
> cargo neon endpoint start migration_check
Starting new endpoint migration_check (PostgreSQL v14) on timeline b3b863fa45fa9e57e615f9f2d944e601 ...
Starting postgres at 'postgresql://cloud_admin@127.0.0.1:55434/postgres'
@@ -198,7 +205,7 @@ Starting postgres at 'postgresql://cloud_admin@127.0.0.1:55434/postgres'
# this new postgres instance will have all the data from 'main' postgres,
# but all modifications would not affect data in original postgres
> psql -p55434 -h 127.0.0.1 -U cloud_admin postgres
> psql -p 55434 -h 127.0.0.1 -U cloud_admin postgres
postgres=# select * from t;
key | value
-----+-------
@@ -209,7 +216,7 @@ postgres=# insert into t values(2,2);
INSERT 0 1
# check that the new change doesn't affect the 'main' postgres
> psql -p55432 -h 127.0.0.1 -U cloud_admin postgres
> psql -p 55432 -h 127.0.0.1 -U cloud_admin postgres
postgres=# select * from t;
key | value
-----+-------
@@ -217,7 +224,7 @@ postgres=# select * from t;
(1 row)
```
4. If you want to run tests afterward (see below), you must stop all the running of the pageserver, safekeeper, and postgres instances
4. If you want to run tests afterwards (see below), you must stop all the running pageserver, safekeeper, and postgres instances
you have just started. You can terminate them all with one command:
```sh
> cargo neon stop
@@ -236,7 +243,7 @@ CARGO_BUILD_FLAGS="--features=testing" make
```
By default, this runs both debug and release modes, and all supported postgres versions. When
testing locally, it is convenient to run just run one set of permutations, like this:
testing locally, it is convenient to run just one set of permutations, like this:
```sh
DEFAULT_PG_VERSION=15 BUILD_TYPE=release ./scripts/pytest

View File

@@ -13,6 +13,7 @@ clap.workspace = true
flate2.workspace = true
futures.workspace = true
hyper = { workspace = true, features = ["full"] }
nix.workspace = true
notify.workspace = true
num_cpus.workspace = true
opentelemetry.workspace = true
@@ -20,6 +21,7 @@ postgres.workspace = true
regex.workspace = true
serde.workspace = true
serde_json.workspace = true
signal-hook.workspace = true
tar.workspace = true
reqwest = { workspace = true, features = ["json"] }
tokio = { workspace = true, features = ["rt", "rt-multi-thread"] }
@@ -37,4 +39,6 @@ workspace_hack.workspace = true
toml_edit.workspace = true
remote_storage = { version = "0.1", path = "../libs/remote_storage/" }
vm_monitor = { version = "0.1", path = "../libs/vm_monitor/" }
zstd = "0.12.4"
zstd = "0.13"
bytes = "1.0"
rust-ini = "0.20.0"

View File

@@ -31,27 +31,31 @@
//! -C 'postgresql://cloud_admin@localhost/postgres' \
//! -S /var/db/postgres/specs/current.json \
//! -b /usr/local/bin/postgres \
//! -r {"bucket": "neon-dev-extensions-eu-central-1", "region": "eu-central-1"}
//! -r http://pg-ext-s3-gateway \
//! ```
//!
use std::collections::HashMap;
use std::fs::File;
use std::path::Path;
use std::process::exit;
use std::sync::atomic::Ordering;
use std::sync::{mpsc, Arc, Condvar, Mutex, RwLock};
use std::{thread, time::Duration};
use anyhow::{Context, Result};
use chrono::Utc;
use clap::Arg;
use nix::sys::signal::{kill, Signal};
use signal_hook::consts::{SIGQUIT, SIGTERM};
use signal_hook::{consts::SIGINT, iterator::Signals};
use tracing::{error, info};
use url::Url;
use compute_api::responses::ComputeStatus;
use compute_tools::compute::{ComputeNode, ComputeState, ParsedSpec};
use compute_tools::compute::{ComputeNode, ComputeState, ParsedSpec, PG_PID, SYNC_SAFEKEEPERS_PID};
use compute_tools::configurator::launch_configurator;
use compute_tools::extension_server::{get_pg_version, init_remote_storage};
use compute_tools::extension_server::get_pg_version;
use compute_tools::http::api::launch_http_server;
use compute_tools::logger::*;
use compute_tools::monitor::launch_monitor;
@@ -60,11 +64,18 @@ use compute_tools::spec::*;
// this is an arbitrary build tag. Fine as a default / for testing purposes
// in-case of not-set environment var
const BUILD_TAG_DEFAULT: &str = "5670669815";
const BUILD_TAG_DEFAULT: &str = "latest";
fn main() -> Result<()> {
init_tracing_and_logging(DEFAULT_LOG_LEVEL)?;
let mut signals = Signals::new([SIGINT, SIGTERM, SIGQUIT])?;
thread::spawn(move || {
for sig in signals.forever() {
handle_exit_signal(sig);
}
});
let build_tag = option_env!("BUILD_TAG")
.unwrap_or(BUILD_TAG_DEFAULT)
.to_string();
@@ -74,10 +85,18 @@ fn main() -> Result<()> {
let pgbin_default = String::from("postgres");
let pgbin = matches.get_one::<String>("pgbin").unwrap_or(&pgbin_default);
let remote_ext_config = matches.get_one::<String>("remote-ext-config");
let ext_remote_storage = remote_ext_config.map(|x| {
init_remote_storage(x).expect("cannot initialize remote extension storage from config")
});
let ext_remote_storage = matches
.get_one::<String>("remote-ext-config")
// Compatibility hack: if the control plane specified any remote-ext-config
// use the default value for extension storage proxy gateway.
// Remove this once the control plane is updated to pass the gateway URL
.map(|conf| {
if conf.starts_with("http") {
conf.trim_end_matches('/')
} else {
"http://pg-ext-s3-gateway"
}
});
let http_port = *matches
.get_one::<u16>("http-port")
@@ -156,6 +175,7 @@ fn main() -> Result<()> {
let path = Path::new(sp);
let file = File::open(path)?;
spec = Some(serde_json::from_reader(file)?);
live_config_allowed = true;
} else if let Some(id) = compute_id {
if let Some(cp_base) = control_plane_uri {
live_config_allowed = true;
@@ -197,16 +217,16 @@ fn main() -> Result<()> {
live_config_allowed,
state: Mutex::new(new_state),
state_changed: Condvar::new(),
ext_remote_storage,
ext_remote_storage: ext_remote_storage.map(|s| s.to_string()),
ext_download_progress: RwLock::new(HashMap::new()),
build_tag,
};
let compute = Arc::new(compute_node);
// If this is a pooled VM, prewarm before starting HTTP server and becoming
// available for binding. Prewarming helps postgres start quicker later,
// available for binding. Prewarming helps Postgres start quicker later,
// because QEMU will already have it's memory allocated from the host, and
// the necessary binaries will alreaady be cached.
// the necessary binaries will already be cached.
if !spec_set {
compute.prewarm_postgres()?;
}
@@ -249,6 +269,11 @@ fn main() -> Result<()> {
state.status = ComputeStatus::Init;
compute.state_changed.notify_all();
info!(
"running compute with features: {:?}",
state.pspec.as_ref().unwrap().spec.features
);
drop(state);
// Launch remaining service threads
@@ -261,11 +286,17 @@ fn main() -> Result<()> {
let pg = match compute.start_compute(extension_server_port) {
Ok(pg) => Some(pg),
Err(err) => {
error!("could not start the compute node: {:?}", err);
error!("could not start the compute node: {:#}", err);
let mut state = compute.state.lock().unwrap();
state.error = Some(format!("{:?}", err));
state.status = ComputeStatus::Failed;
drop(state);
// Notify others that Postgres failed to start. In case of configuring the
// empty compute, it's likely that API handler is still waiting for compute
// state change. With this we will notify it that compute is in Failed state,
// so control plane will know about it earlier and record proper error instead
// of timeout.
compute.state_changed.notify_all();
drop(state); // unlock
delay_exit = true;
None
}
@@ -277,32 +308,26 @@ fn main() -> Result<()> {
if #[cfg(target_os = "linux")] {
use std::env;
use tokio_util::sync::CancellationToken;
use tracing::warn;
let vm_monitor_addr = matches.get_one::<String>("vm-monitor-addr");
let vm_monitor_addr = matches
.get_one::<String>("vm-monitor-addr")
.expect("--vm-monitor-addr should always be set because it has a default arg");
let file_cache_connstr = matches.get_one::<String>("filecache-connstr");
let cgroup = matches.get_one::<String>("cgroup");
let file_cache_on_disk = matches.get_flag("file-cache-on-disk");
// Only make a runtime if we need to.
// Note: it seems like you can make a runtime in an inner scope and
// if you start a task in it it won't be dropped. However, make it
// in the outermost scope just to be safe.
let rt = match (env::var_os("AUTOSCALING"), vm_monitor_addr) {
(None, None) => None,
(None, Some(_)) => {
warn!("--vm-monitor-addr option set but AUTOSCALING env var not present");
None
}
(Some(_), None) => {
panic!("AUTOSCALING env var present but --vm-monitor-addr option not set")
}
(Some(_), Some(_)) => Some(
let rt = if env::var_os("AUTOSCALING").is_some() {
Some(
tokio::runtime::Builder::new_multi_thread()
.worker_threads(4)
.enable_all()
.build()
.expect("failed to create tokio runtime for monitor"),
),
.expect("failed to create tokio runtime for monitor")
)
} else {
None
};
// This token is used internally by the monitor to clean up all threads
@@ -313,8 +338,7 @@ fn main() -> Result<()> {
Box::leak(Box::new(vm_monitor::Args {
cgroup: cgroup.cloned(),
pgconnstr: file_cache_connstr.cloned(),
addr: vm_monitor_addr.cloned().unwrap(),
file_cache_on_disk,
addr: vm_monitor_addr.clone(),
})),
token.clone(),
))
@@ -324,13 +348,20 @@ fn main() -> Result<()> {
// Wait for the child Postgres process forever. In this state Ctrl+C will
// propagate to Postgres and it will be shut down as well.
if let Some(mut pg) = pg {
if let Some((mut pg, logs_handle)) = pg {
// Startup is finished, exit the startup tracing span
drop(startup_context_guard);
let ecode = pg
.wait()
.expect("failed to start waiting on Postgres process");
PG_PID.store(0, Ordering::SeqCst);
// Process has exited, so we can join the logs thread.
let _ = logs_handle
.join()
.map_err(|e| tracing::error!("log thread panicked: {:?}", e));
info!("Postgres exited with code {}, shutting down", ecode);
exit_code = ecode.code()
}
@@ -485,11 +516,24 @@ fn cli() -> clap::Command {
)
.value_name("FILECACHE_CONNSTR"),
)
.arg(
Arg::new("file-cache-on-disk")
.long("file-cache-on-disk")
.action(clap::ArgAction::SetTrue),
)
}
/// When compute_ctl is killed, send also termination signal to sync-safekeepers
/// to prevent leakage. TODO: it is better to convert compute_ctl to async and
/// wait for termination which would be easy then.
fn handle_exit_signal(sig: i32) {
info!("received {sig} termination signal");
let ss_pid = SYNC_SAFEKEEPERS_PID.load(Ordering::SeqCst);
if ss_pid != 0 {
let ss_pid = nix::unistd::Pid::from_raw(ss_pid as i32);
kill(ss_pid, Signal::SIGTERM).ok();
}
let pg_pid = PG_PID.load(Ordering::SeqCst);
if pg_pid != 0 {
let pg_pid = nix::unistd::Pid::from_raw(pg_pid as i32);
kill(pg_pid, Signal::SIGTERM).ok();
}
exit(1);
}
#[test]

View File

@@ -6,7 +6,10 @@ use std::os::unix::fs::PermissionsExt;
use std::path::Path;
use std::process::{Command, Stdio};
use std::str::FromStr;
use std::sync::atomic::AtomicU32;
use std::sync::atomic::Ordering;
use std::sync::{Condvar, Mutex, RwLock};
use std::thread;
use std::time::Instant;
use anyhow::{Context, Result};
@@ -17,22 +20,26 @@ use futures::StreamExt;
use postgres::{Client, NoTls};
use tokio;
use tokio_postgres;
use tracing::{error, info, instrument, warn};
use tracing::{debug, error, info, instrument, warn};
use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn;
use compute_api::responses::{ComputeMetrics, ComputeStatus};
use compute_api::spec::{ComputeMode, ComputeSpec};
use compute_api::spec::{ComputeFeature, ComputeMode, ComputeSpec};
use utils::measured_stream::MeasuredReader;
use remote_storage::{DownloadError, GenericRemoteStorage, RemotePath};
use remote_storage::{DownloadError, RemotePath};
use crate::checker::create_availability_check_data;
use crate::logger::inlinify;
use crate::pg_helpers::*;
use crate::spec::*;
use crate::sync_sk::{check_if_synced, ping_safekeeper};
use crate::{config, extension_server};
pub static SYNC_SAFEKEEPERS_PID: AtomicU32 = AtomicU32::new(0);
pub static PG_PID: AtomicU32 = AtomicU32::new(0);
/// Compute node info shared across several `compute_ctl` threads.
pub struct ComputeNode {
// Url type maintains proper escaping
@@ -59,8 +66,8 @@ pub struct ComputeNode {
pub state: Mutex<ComputeState>,
/// `Condvar` to allow notifying waiters about state changes.
pub state_changed: Condvar,
/// the S3 bucket that we search for extensions in
pub ext_remote_storage: Option<GenericRemoteStorage>,
/// the address of extension storage proxy gateway
pub ext_remote_storage: Option<String>,
// key: ext_archive_name, value: started download time, download_completed?
pub ext_download_progress: RwLock<HashMap<String, (DateTime<Utc>, bool)>>,
pub build_tag: String,
@@ -200,6 +207,7 @@ fn maybe_cgexec(cmd: &str) -> Command {
/// Create special neon_superuser role, that's a slightly nerfed version of a real superuser
/// that we give to customers
#[instrument(skip_all)]
fn create_neon_superuser(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
let roles = spec
.cluster
@@ -252,7 +260,7 @@ fn create_neon_superuser(spec: &ComputeSpec, client: &mut Client) -> Result<()>
IF NOT EXISTS (
SELECT FROM pg_catalog.pg_roles WHERE rolname = 'neon_superuser')
THEN
CREATE ROLE neon_superuser CREATEDB CREATEROLE NOLOGIN IN ROLE pg_read_all_data, pg_write_all_data;
CREATE ROLE neon_superuser CREATEDB CREATEROLE NOLOGIN REPLICATION BYPASSRLS IN ROLE pg_read_all_data, pg_write_all_data;
IF array_length(roles, 1) IS NOT NULL THEN
EXECUTE format('GRANT neon_superuser TO %s',
array_to_string(ARRAY(SELECT quote_ident(x) FROM unnest(roles) as x), ', '));
@@ -269,7 +277,7 @@ fn create_neon_superuser(spec: &ComputeSpec, client: &mut Client) -> Result<()>
$$;"#,
roles_decl, database_decl,
);
info!("Neon superuser created:\n{}", &query);
info!("Neon superuser created: {}", inlinify(&query));
client
.simple_query(&query)
.map_err(|e| anyhow::anyhow!(e).context(query))?;
@@ -277,6 +285,17 @@ fn create_neon_superuser(spec: &ComputeSpec, client: &mut Client) -> Result<()>
}
impl ComputeNode {
/// Check that compute node has corresponding feature enabled.
pub fn has_feature(&self, feature: ComputeFeature) -> bool {
let state = self.state.lock().unwrap();
if let Some(s) = state.pspec.as_ref() {
s.spec.features.contains(&feature)
} else {
false
}
}
pub fn set_status(&self, status: ComputeStatus) {
let mut state = self.state.lock().unwrap();
state.status = status;
@@ -301,7 +320,7 @@ impl ComputeNode {
// Get basebackup from the libpq connection to pageserver using `connstr` and
// unarchive it to `pgdata` directory overriding all its previous content.
#[instrument(skip_all, fields(%lsn))]
fn get_basebackup(&self, compute_state: &ComputeState, lsn: Lsn) -> Result<()> {
fn try_get_basebackup(&self, compute_state: &ComputeState, lsn: Lsn) -> Result<()> {
let spec = compute_state.pspec.as_ref().expect("spec must be set");
let start_time = Instant::now();
@@ -372,6 +391,34 @@ impl ComputeNode {
Ok(())
}
// Gets the basebackup in a retry loop
#[instrument(skip_all, fields(%lsn))]
pub fn get_basebackup(&self, compute_state: &ComputeState, lsn: Lsn) -> Result<()> {
let mut retry_period_ms = 500;
let mut attempts = 0;
let max_attempts = 5;
loop {
let result = self.try_get_basebackup(compute_state, lsn);
match result {
Ok(_) => {
return result;
}
Err(ref e) if attempts < max_attempts => {
warn!(
"Failed to get basebackup: {} (attempt {}/{})",
e, attempts, max_attempts
);
std::thread::sleep(std::time::Duration::from_millis(retry_period_ms));
retry_period_ms *= 2;
}
Err(_) => {
return result;
}
}
attempts += 1;
}
}
pub async fn check_safekeepers_synced_async(
&self,
compute_state: &ComputeState,
@@ -474,7 +521,7 @@ impl ComputeNode {
pub fn sync_safekeepers(&self, storage_auth_token: Option<String>) -> Result<Lsn> {
let start_time = Utc::now();
let sync_handle = maybe_cgexec(&self.pgbin)
let mut sync_handle = maybe_cgexec(&self.pgbin)
.args(["--sync-safekeepers"])
.env("PGDATA", &self.pgdata) // we cannot use -D in this mode
.envs(if let Some(storage_auth_token) = &storage_auth_token {
@@ -483,15 +530,29 @@ impl ComputeNode {
vec![]
})
.stdout(Stdio::piped())
.stderr(Stdio::piped())
.spawn()
.expect("postgres --sync-safekeepers failed to start");
SYNC_SAFEKEEPERS_PID.store(sync_handle.id(), Ordering::SeqCst);
// `postgres --sync-safekeepers` will print all log output to stderr and
// final LSN to stdout. So we pipe only stdout, while stderr will be automatically
// redirected to the caller output.
// final LSN to stdout. So we leave stdout to collect LSN, while stderr logs
// will be collected in a child thread.
let stderr = sync_handle
.stderr
.take()
.expect("stderr should be captured");
let logs_handle = handle_postgres_logs(stderr);
let sync_output = sync_handle
.wait_with_output()
.expect("postgres --sync-safekeepers failed");
SYNC_SAFEKEEPERS_PID.store(0, Ordering::SeqCst);
// Process has exited, so we can join the logs thread.
let _ = logs_handle
.join()
.map_err(|e| tracing::error!("log thread panicked: {:?}", e));
if !sync_output.status.success() {
anyhow::bail!(
@@ -629,11 +690,12 @@ impl ComputeNode {
/// Start Postgres as a child process and manage DBs/roles.
/// After that this will hang waiting on the postmaster process to exit.
/// Returns a handle to the child process and a handle to the logs thread.
#[instrument(skip_all)]
pub fn start_postgres(
&self,
storage_auth_token: Option<String>,
) -> Result<std::process::Child> {
) -> Result<(std::process::Child, std::thread::JoinHandle<()>)> {
let pgdata_path = Path::new(&self.pgdata);
// Run postgres as a child process.
@@ -644,12 +706,18 @@ impl ComputeNode {
} else {
vec![]
})
.stderr(Stdio::piped())
.spawn()
.expect("cannot start postgres process");
PG_PID.store(pg.id(), Ordering::SeqCst);
// Start a thread to collect logs from stderr.
let stderr = pg.stderr.take().expect("stderr should be captured");
let logs_handle = handle_postgres_logs(stderr);
wait_for_postgres(&mut pg, pgdata_path)?;
Ok(pg)
Ok((pg, logs_handle))
}
/// Do initial configuration of the already started Postgres.
@@ -661,13 +729,14 @@ impl ComputeNode {
// In this case we need to connect with old `zenith_admin` name
// and create new user. We cannot simply rename connected user,
// but we can create a new one and grant it all privileges.
let mut client = match Client::connect(self.connstr.as_str(), NoTls) {
let connstr = self.connstr.clone();
let mut client = match Client::connect(connstr.as_str(), NoTls) {
Err(e) => {
info!(
"cannot connect to postgres: {}, retrying with `zenith_admin` username",
e
);
let mut zenith_admin_connstr = self.connstr.clone();
let mut zenith_admin_connstr = connstr.clone();
zenith_admin_connstr
.set_username("zenith_admin")
@@ -680,8 +749,8 @@ impl ComputeNode {
client.simple_query("GRANT zenith_admin TO cloud_admin")?;
drop(client);
// reconnect with connsting with expected name
Client::connect(self.connstr.as_str(), NoTls)?
// reconnect with connstring with expected name
Client::connect(connstr.as_str(), NoTls)?
}
Ok(client) => client,
};
@@ -692,16 +761,24 @@ impl ComputeNode {
// Proceed with post-startup configuration. Note, that order of operations is important.
let spec = &compute_state.pspec.as_ref().expect("spec must be set").spec;
create_neon_superuser(spec, &mut client)?;
cleanup_instance(&mut client)?;
handle_roles(spec, &mut client)?;
handle_databases(spec, &mut client)?;
handle_role_deletions(spec, self.connstr.as_str(), &mut client)?;
handle_grants(spec, self.connstr.as_str())?;
handle_role_deletions(spec, connstr.as_str(), &mut client)?;
handle_grants(spec, &mut client, connstr.as_str())?;
handle_extensions(spec, &mut client)?;
handle_extension_neon(&mut client)?;
create_availability_check_data(&mut client)?;
// 'Close' connection
drop(client);
if self.has_feature(ComputeFeature::Migrations) {
thread::spawn(move || {
let mut client = Client::connect(connstr.as_str(), NoTls)?;
handle_migrations(&mut client)
});
}
Ok(())
}
@@ -709,8 +786,12 @@ impl ComputeNode {
// `pg_ctl` for start / stop, so this just seems much easier to do as we already
// have opened connection to Postgres and superuser access.
#[instrument(skip_all)]
fn pg_reload_conf(&self, client: &mut Client) -> Result<()> {
client.simple_query("SELECT pg_reload_conf()")?;
fn pg_reload_conf(&self) -> Result<()> {
let pgctl_bin = Path::new(&self.pgbin).parent().unwrap().join("pg_ctl");
Command::new(pgctl_bin)
.args(["reload", "-D", &self.pgdata])
.output()
.expect("cannot run pg_ctl process");
Ok(())
}
@@ -720,27 +801,61 @@ impl ComputeNode {
pub fn reconfigure(&self) -> Result<()> {
let spec = self.state.lock().unwrap().pspec.clone().unwrap().spec;
if let Some(ref pgbouncer_settings) = spec.pgbouncer_settings {
info!("tuning pgbouncer");
let rt = tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()
.expect("failed to create rt");
// Spawn a thread to do the tuning,
// so that we don't block the main thread that starts Postgres.
let pgbouncer_settings = pgbouncer_settings.clone();
let _handle = thread::spawn(move || {
let res = rt.block_on(tune_pgbouncer(pgbouncer_settings));
if let Err(err) = res {
error!("error while tuning pgbouncer: {err:?}");
}
});
}
// Write new config
let pgdata_path = Path::new(&self.pgdata);
config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), &spec, None)?;
let postgresql_conf_path = pgdata_path.join("postgresql.conf");
config::write_postgres_conf(&postgresql_conf_path, &spec, None)?;
// temporarily reset max_cluster_size in config
// to avoid the possibility of hitting the limit, while we are reconfiguring:
// creating new extensions, roles, etc...
config::compute_ctl_temp_override_create(pgdata_path, "neon.max_cluster_size=-1")?;
self.pg_reload_conf()?;
let mut client = Client::connect(self.connstr.as_str(), NoTls)?;
self.pg_reload_conf(&mut client)?;
// Proceed with post-startup configuration. Note, that order of operations is important.
// Disable DDL forwarding because control plane already knows about these roles/databases.
if spec.mode == ComputeMode::Primary {
client.simple_query("SET neon.forward_ddl = false")?;
cleanup_instance(&mut client)?;
handle_roles(&spec, &mut client)?;
handle_databases(&spec, &mut client)?;
handle_role_deletions(&spec, self.connstr.as_str(), &mut client)?;
handle_grants(&spec, self.connstr.as_str())?;
handle_grants(&spec, &mut client, self.connstr.as_str())?;
handle_extensions(&spec, &mut client)?;
handle_extension_neon(&mut client)?;
// We can skip handle_migrations here because a new migration can only appear
// if we have a new version of the compute_ctl binary, which can only happen
// if compute got restarted, in which case we'll end up inside of apply_config
// instead of reconfigure.
}
// 'Close' connection
drop(client);
// reset max_cluster_size in config back to original value and reload config
config::compute_ctl_temp_override_remove(pgdata_path)?;
self.pg_reload_conf()?;
let unknown_op = "unknown".to_string();
let op_id = spec.operation_uuid.as_ref().unwrap_or(&unknown_op);
info!(
@@ -752,7 +867,10 @@ impl ComputeNode {
}
#[instrument(skip_all)]
pub fn start_compute(&self, extension_server_port: u16) -> Result<std::process::Child> {
pub fn start_compute(
&self,
extension_server_port: u16,
) -> Result<(std::process::Child, std::thread::JoinHandle<()>)> {
let compute_state = self.state.lock().unwrap().clone();
let pspec = compute_state.pspec.as_ref().expect("spec must be set");
info!(
@@ -763,6 +881,26 @@ impl ComputeNode {
pspec.timeline_id,
);
// tune pgbouncer
if let Some(pgbouncer_settings) = &pspec.spec.pgbouncer_settings {
info!("tuning pgbouncer");
let rt = tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()
.expect("failed to create rt");
// Spawn a thread to do the tuning,
// so that we don't block the main thread that starts Postgres.
let pgbouncer_settings = pgbouncer_settings.clone();
let _handle = thread::spawn(move || {
let res = rt.block_on(tune_pgbouncer(pgbouncer_settings));
if let Err(err) = res {
error!("error while tuning pgbouncer: {err:?}");
}
});
}
info!(
"start_compute spec.remote_extensions {:?}",
pspec.spec.remote_extensions
@@ -797,11 +935,21 @@ impl ComputeNode {
self.prepare_pgdata(&compute_state, extension_server_port)?;
let start_time = Utc::now();
let pg = self.start_postgres(pspec.storage_auth_token.clone())?;
let pg_process = self.start_postgres(pspec.storage_auth_token.clone())?;
let config_time = Utc::now();
if pspec.spec.mode == ComputeMode::Primary && !pspec.spec.skip_pg_catalog_updates {
let pgdata_path = Path::new(&self.pgdata);
// temporarily reset max_cluster_size in config
// to avoid the possibility of hitting the limit, while we are applying config:
// creating new extensions, roles, etc...
config::compute_ctl_temp_override_create(pgdata_path, "neon.max_cluster_size=-1")?;
self.pg_reload_conf()?;
self.apply_config(&compute_state)?;
config::compute_ctl_temp_override_remove(pgdata_path)?;
self.pg_reload_conf()?;
}
let startup_end_time = Utc::now();
@@ -837,7 +985,17 @@ impl ComputeNode {
};
info!(?metrics, "compute start finished");
Ok(pg)
Ok(pg_process)
}
/// Update the `last_active` in the shared state, but ensure that it's a more recent one.
pub fn update_last_active(&self, last_active: Option<DateTime<Utc>>) {
let mut state = self.state.lock().unwrap();
// NB: `Some(<DateTime>)` is always greater than `None`.
if last_active > state.last_active {
state.last_active = last_active;
debug!("set the last compute activity time to: {:?}", last_active);
}
}
// Look for core dumps and collect backtraces.
@@ -949,12 +1107,12 @@ LIMIT 100",
real_ext_name: String,
ext_path: RemotePath,
) -> Result<u64, DownloadError> {
let remote_storage = self
.ext_remote_storage
.as_ref()
.ok_or(DownloadError::BadInput(anyhow::anyhow!(
"Remote extensions storage is not configured",
)))?;
let ext_remote_storage =
self.ext_remote_storage
.as_ref()
.ok_or(DownloadError::BadInput(anyhow::anyhow!(
"Remote extensions storage is not configured",
)))?;
let ext_archive_name = ext_path.object_name().expect("bad path");
@@ -1010,7 +1168,7 @@ LIMIT 100",
let download_size = extension_server::download_extension(
&real_ext_name,
&ext_path,
remote_storage,
ext_remote_storage,
&self.pgbin,
)
.await
@@ -1039,7 +1197,7 @@ LIMIT 100",
let remote_extensions = spec
.remote_extensions
.as_ref()
.ok_or(anyhow::anyhow!("Remote extensions are not configured",))?;
.ok_or(anyhow::anyhow!("Remote extensions are not configured"))?;
info!("parse shared_preload_libraries from spec.cluster.settings");
let mut libs_vec = Vec::new();

View File

@@ -93,5 +93,25 @@ pub fn write_postgres_conf(
writeln!(file, "neon.extension_server_port={}", port)?;
}
// This is essential to keep this line at the end of the file,
// because it is intended to override any settings above.
writeln!(file, "include_if_exists = 'compute_ctl_temp_override.conf'")?;
Ok(())
}
/// create file compute_ctl_temp_override.conf in pgdata_dir
/// add provided options to this file
pub fn compute_ctl_temp_override_create(pgdata_path: &Path, options: &str) -> Result<()> {
let path = pgdata_path.join("compute_ctl_temp_override.conf");
let mut file = File::create(path)?;
write!(file, "{}", options)?;
Ok(())
}
/// remove file compute_ctl_temp_override.conf in pgdata_dir
pub fn compute_ctl_temp_override_remove(pgdata_path: &Path) -> Result<()> {
let path = pgdata_path.join("compute_ctl_temp_override.conf");
std::fs::remove_file(path)?;
Ok(())
}

View File

@@ -71,18 +71,16 @@ More specifically, here is an example ext_index.json
}
}
*/
use anyhow::Context;
use anyhow::{self, Result};
use anyhow::{bail, Context};
use bytes::Bytes;
use compute_api::spec::RemoteExtSpec;
use regex::Regex;
use remote_storage::*;
use serde_json;
use std::io::Read;
use std::num::{NonZeroU32, NonZeroUsize};
use reqwest::StatusCode;
use std::path::Path;
use std::str;
use tar::Archive;
use tokio::io::AsyncReadExt;
use tracing::info;
use tracing::log::warn;
use zstd::stream::read::Decoder;
@@ -133,67 +131,36 @@ fn parse_pg_version(human_version: &str) -> &str {
panic!("Unsuported postgres version {human_version}");
}
#[cfg(test)]
mod tests {
use super::parse_pg_version;
#[test]
fn test_parse_pg_version() {
assert_eq!(parse_pg_version("PostgreSQL 15.4"), "v15");
assert_eq!(parse_pg_version("PostgreSQL 15.14"), "v15");
assert_eq!(
parse_pg_version("PostgreSQL 15.4 (Ubuntu 15.4-0ubuntu0.23.04.1)"),
"v15"
);
assert_eq!(parse_pg_version("PostgreSQL 14.15"), "v14");
assert_eq!(parse_pg_version("PostgreSQL 14.0"), "v14");
assert_eq!(
parse_pg_version("PostgreSQL 14.9 (Debian 14.9-1.pgdg120+1"),
"v14"
);
assert_eq!(parse_pg_version("PostgreSQL 16devel"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16beta1"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16rc2"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16extra"), "v16");
}
#[test]
#[should_panic]
fn test_parse_pg_unsupported_version() {
parse_pg_version("PostgreSQL 13.14");
}
#[test]
#[should_panic]
fn test_parse_pg_incorrect_version_format() {
parse_pg_version("PostgreSQL 14");
}
}
// download the archive for a given extension,
// unzip it, and place files in the appropriate locations (share/lib)
pub async fn download_extension(
ext_name: &str,
ext_path: &RemotePath,
remote_storage: &GenericRemoteStorage,
ext_remote_storage: &str,
pgbin: &str,
) -> Result<u64> {
info!("Download extension {:?} from {:?}", ext_name, ext_path);
let mut download = remote_storage.download(ext_path).await?;
let mut download_buffer = Vec::new();
download
.download_stream
.read_to_end(&mut download_buffer)
.await?;
// TODO add retry logic
let download_buffer =
match download_extension_tar(ext_remote_storage, &ext_path.to_string()).await {
Ok(buffer) => buffer,
Err(error_message) => {
return Err(anyhow::anyhow!(
"error downloading extension {:?}: {:?}",
ext_name,
error_message
));
}
};
let download_size = download_buffer.len() as u64;
info!("Download size {:?}", download_size);
// it's unclear whether it is more performant to decompress into memory or not
// TODO: decompressing into memory can be avoided
let mut decoder = Decoder::new(download_buffer.as_slice())?;
let mut decompress_buffer = Vec::new();
decoder.read_to_end(&mut decompress_buffer)?;
let mut archive = Archive::new(decompress_buffer.as_slice());
let decoder = Decoder::new(download_buffer.as_ref())?;
let mut archive = Archive::new(decoder);
let unzip_dest = pgbin
.strip_suffix("/bin/postgres")
.expect("bad pgbin")
@@ -261,29 +228,69 @@ pub fn create_control_files(remote_extensions: &RemoteExtSpec, pgbin: &str) {
}
}
// This function initializes the necessary structs to use remote storage
pub fn init_remote_storage(remote_ext_config: &str) -> anyhow::Result<GenericRemoteStorage> {
#[derive(Debug, serde::Deserialize)]
struct RemoteExtJson {
bucket: String,
region: String,
endpoint: Option<String>,
prefix: Option<String>,
}
let remote_ext_json = serde_json::from_str::<RemoteExtJson>(remote_ext_config)?;
// Do request to extension storage proxy, i.e.
// curl http://pg-ext-s3-gateway/latest/v15/extensions/anon.tar.zst
// using HHTP GET
// and return the response body as bytes
//
async fn download_extension_tar(ext_remote_storage: &str, ext_path: &str) -> Result<Bytes> {
let uri = format!("{}/{}", ext_remote_storage, ext_path);
let config = S3Config {
bucket_name: remote_ext_json.bucket,
bucket_region: remote_ext_json.region,
prefix_in_bucket: remote_ext_json.prefix,
endpoint: remote_ext_json.endpoint,
concurrency_limit: NonZeroUsize::new(100).expect("100 != 0"),
max_keys_per_list_response: None,
};
let config = RemoteStorageConfig {
max_concurrent_syncs: NonZeroUsize::new(100).expect("100 != 0"),
max_sync_errors: NonZeroU32::new(100).expect("100 != 0"),
storage: RemoteStorageKind::AwsS3(config),
};
GenericRemoteStorage::from_config(&config)
info!("Download extension {:?} from uri {:?}", ext_path, uri);
let resp = reqwest::get(uri).await?;
match resp.status() {
StatusCode::OK => match resp.bytes().await {
Ok(resp) => {
info!("Download extension {:?} completed successfully", ext_path);
Ok(resp)
}
Err(e) => bail!("could not deserialize remote extension response: {}", e),
},
StatusCode::SERVICE_UNAVAILABLE => bail!("remote extension is temporarily unavailable"),
_ => bail!(
"unexpected remote extension response status code: {}",
resp.status()
),
}
}
#[cfg(test)]
mod tests {
use super::parse_pg_version;
#[test]
fn test_parse_pg_version() {
assert_eq!(parse_pg_version("PostgreSQL 15.4"), "v15");
assert_eq!(parse_pg_version("PostgreSQL 15.14"), "v15");
assert_eq!(
parse_pg_version("PostgreSQL 15.4 (Ubuntu 15.4-0ubuntu0.23.04.1)"),
"v15"
);
assert_eq!(parse_pg_version("PostgreSQL 14.15"), "v14");
assert_eq!(parse_pg_version("PostgreSQL 14.0"), "v14");
assert_eq!(
parse_pg_version("PostgreSQL 14.9 (Debian 14.9-1.pgdg120+1"),
"v14"
);
assert_eq!(parse_pg_version("PostgreSQL 16devel"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16beta1"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16rc2"), "v16");
assert_eq!(parse_pg_version("PostgreSQL 16extra"), "v16");
}
#[test]
#[should_panic]
fn test_parse_pg_unsupported_version() {
parse_pg_version("PostgreSQL 13.14");
}
#[test]
#[should_panic]
fn test_parse_pg_incorrect_version_format() {
parse_pg_version("PostgreSQL 14");
}
}

View File

@@ -123,7 +123,7 @@ async fn routes(req: Request<Body>, compute: &Arc<ComputeNode>) -> Response<Body
}
}
// download extension files from S3 on demand
// download extension files from remote extension storage on demand
(&Method::POST, route) if route.starts_with("/extension_server/") => {
info!("serving {:?} POST request", route);
info!("req.uri {:?}", req.uri());
@@ -227,7 +227,7 @@ async fn handle_configure_request(
let parsed_spec = match ParsedSpec::try_from(spec) {
Ok(ps) => ps,
Err(msg) => return Err((msg, StatusCode::PRECONDITION_FAILED)),
Err(msg) => return Err((msg, StatusCode::BAD_REQUEST)),
};
// XXX: wrap state update under lock in code blocks. Otherwise,

View File

@@ -156,17 +156,17 @@ paths:
description: Error text or 'OK' if download succeeded.
example: "OK"
400:
description: Request is invalid.
content:
application/json:
schema:
$ref: "#/components/schemas/GenericError"
description: Request is invalid.
content:
application/json:
schema:
$ref: "#/components/schemas/GenericError"
500:
description: Extension download request failed.
content:
application/json:
schema:
$ref: "#/components/schemas/GenericError"
description: Extension download request failed.
content:
application/json:
schema:
$ref: "#/components/schemas/GenericError"
components:
securitySchemes:

View File

@@ -1,7 +1,7 @@
//!
//! Various tools and helpers to handle cluster / compute node (Postgres)
//! configuration.
//!
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
pub mod checker;
pub mod config;
pub mod configurator;

View File

@@ -38,3 +38,9 @@ pub fn init_tracing_and_logging(default_log_level: &str) -> anyhow::Result<()> {
Ok(())
}
/// Replace all newline characters with a special character to make it
/// easier to grep for log messages.
pub fn inlinify(s: &str) -> String {
s.replace('\n', "\u{200B}")
}

View File

@@ -1,100 +1,195 @@
use std::sync::Arc;
use std::{thread, time};
use std::{thread, time::Duration};
use chrono::{DateTime, Utc};
use postgres::{Client, NoTls};
use tracing::{debug, info};
use tracing::{debug, error, info, warn};
use crate::compute::ComputeNode;
use compute_api::responses::ComputeStatus;
use compute_api::spec::ComputeFeature;
const MONITOR_CHECK_INTERVAL: u64 = 500; // milliseconds
const MONITOR_CHECK_INTERVAL: Duration = Duration::from_millis(500);
// Spin in a loop and figure out the last activity time in the Postgres.
// Then update it in the shared state. This function never errors out.
// XXX: the only expected panic is at `RwLock` unwrap().
// NB: the only expected panic is at `Mutex` unwrap(), all other errors
// should be handled gracefully.
fn watch_compute_activity(compute: &ComputeNode) {
// Suppose that `connstr` doesn't change
let connstr = compute.connstr.as_str();
// During startup and configuration we connect to every Postgres database,
// but we don't want to count this as some user activity. So wait until
// the compute fully started before monitoring activity.
wait_for_postgres_start(compute);
// Define `client` outside of the loop to reuse existing connection if it's active.
let mut client = Client::connect(connstr, NoTls);
let timeout = time::Duration::from_millis(MONITOR_CHECK_INTERVAL);
info!("watching Postgres activity at {}", connstr);
let mut sleep = false;
let mut prev_active_time: Option<f64> = None;
let mut prev_sessions: Option<i64> = None;
if compute.has_feature(ComputeFeature::ActivityMonitorExperimental) {
info!("starting experimental activity monitor for {}", connstr);
} else {
info!("starting activity monitor for {}", connstr);
}
loop {
// Should be outside of the write lock to allow others to read while we sleep.
thread::sleep(timeout);
// We use `continue` a lot, so it's more convenient to sleep at the top of the loop.
// But skip the first sleep, so we can connect to Postgres immediately.
if sleep {
// Should be outside of the mutex lock to allow others to read while we sleep.
thread::sleep(MONITOR_CHECK_INTERVAL);
} else {
sleep = true;
}
match &mut client {
Ok(cli) => {
if cli.is_closed() {
info!("connection to postgres closed, trying to reconnect");
info!("connection to Postgres is closed, trying to reconnect");
// Connection is closed, reconnect and try again.
client = Client::connect(connstr, NoTls);
continue;
}
// Get all running client backends except ourself, use RFC3339 DateTime format.
let backends = cli
.query(
"SELECT state, to_char(state_change, 'YYYY-MM-DD\"T\"HH24:MI:SS.US\"Z\"') AS state_change
FROM pg_stat_activity
WHERE backend_type = 'client backend'
AND pid != pg_backend_pid()
AND usename != 'cloud_admin';", // XXX: find a better way to filter other monitors?
&[],
);
let mut last_active = compute.state.lock().unwrap().last_active;
// This is a new logic, only enable if the feature flag is set.
// TODO: remove this once we are sure that it works OR drop it altogether.
if compute.has_feature(ComputeFeature::ActivityMonitorExperimental) {
// First, check if the total active time or sessions across all databases has changed.
// If it did, it means that user executed some queries. In theory, it can even go down if
// some databases were dropped, but it's still a user activity.
match get_database_stats(cli) {
Ok((active_time, sessions)) => {
let mut detected_activity = false;
if let Ok(backs) = backends {
let mut idle_backs: Vec<DateTime<Utc>> = vec![];
for b in backs.into_iter() {
let state: String = match b.try_get("state") {
Ok(state) => state,
Err(_) => continue,
};
if state == "idle" {
let change: String = match b.try_get("state_change") {
Ok(state_change) => state_change,
Err(_) => continue,
};
let change = DateTime::parse_from_rfc3339(&change);
match change {
Ok(t) => idle_backs.push(t.with_timezone(&Utc)),
Err(e) => {
info!("cannot parse backend state_change DateTime: {}", e);
continue;
prev_active_time = match prev_active_time {
Some(prev_active_time) => {
if active_time != prev_active_time {
detected_activity = true;
}
Some(active_time)
}
}
} else {
// Found non-idle backend, so the last activity is NOW.
// Save it and exit the for loop. Also clear the idle backend
// `state_change` timestamps array as it doesn't matter now.
last_active = Some(Utc::now());
idle_backs.clear();
break;
}
}
None => Some(active_time),
};
prev_sessions = match prev_sessions {
Some(prev_sessions) => {
if sessions != prev_sessions {
detected_activity = true;
}
Some(sessions)
}
None => Some(sessions),
};
// Get idle backend `state_change` with the max timestamp.
if let Some(last) = idle_backs.iter().max() {
last_active = Some(*last);
if detected_activity {
// Update the last active time and continue, we don't need to
// check backends state change.
compute.update_last_active(Some(Utc::now()));
continue;
}
}
Err(e) => {
error!("could not get database statistics: {}", e);
continue;
}
}
}
// Update the last activity in the shared state if we got a more recent one.
let mut state = compute.state.lock().unwrap();
// NB: `Some(<DateTime>)` is always greater than `None`.
if last_active > state.last_active {
state.last_active = last_active;
debug!("set the last compute activity time to: {:?}", last_active);
// Second, if database statistics is the same, check all backends state change,
// maybe there is some with more recent activity. `get_backends_state_change()`
// can return None or stale timestamp, so it's `compute.update_last_active()`
// responsibility to check if the new timestamp is more recent than the current one.
// This helps us to discover new sessions, that did nothing yet.
match get_backends_state_change(cli) {
Ok(last_active) => {
compute.update_last_active(last_active);
}
Err(e) => {
error!("could not get backends state change: {}", e);
}
}
// Finally, if there are existing (logical) walsenders, do not suspend.
//
// walproposer doesn't currently show up in pg_stat_replication,
// but protect if it will be
let ws_count_query = "select count(*) from pg_stat_replication where application_name != 'walproposer';";
match cli.query_one(ws_count_query, &[]) {
Ok(r) => match r.try_get::<&str, i64>("count") {
Ok(num_ws) => {
if num_ws > 0 {
compute.update_last_active(Some(Utc::now()));
continue;
}
}
Err(e) => {
warn!("failed to parse walsenders count: {:?}", e);
continue;
}
},
Err(e) => {
warn!("failed to get list of walsenders: {:?}", e);
continue;
}
}
//
// Don't suspend compute if there is an active logical replication subscription
//
// `where pid is not null` to filter out read only computes and subscription on branches
//
let logical_subscriptions_query =
"select count(*) from pg_stat_subscription where pid is not null;";
match cli.query_one(logical_subscriptions_query, &[]) {
Ok(row) => match row.try_get::<&str, i64>("count") {
Ok(num_subscribers) => {
if num_subscribers > 0 {
compute.update_last_active(Some(Utc::now()));
continue;
}
}
Err(e) => {
warn!("failed to parse `pg_stat_subscription` count: {:?}", e);
continue;
}
},
Err(e) => {
warn!(
"failed to get list of active logical replication subscriptions: {:?}",
e
);
continue;
}
}
//
// Do not suspend compute if autovacuum is running
//
let autovacuum_count_query = "select count(*) from pg_stat_activity where backend_type = 'autovacuum worker'";
match cli.query_one(autovacuum_count_query, &[]) {
Ok(r) => match r.try_get::<&str, i64>("count") {
Ok(num_workers) => {
if num_workers > 0 {
compute.update_last_active(Some(Utc::now()));
continue;
}
}
Err(e) => {
warn!("failed to parse autovacuum workers count: {:?}", e);
continue;
}
},
Err(e) => {
warn!("failed to get list of autovacuum workers: {:?}", e);
continue;
}
}
}
Err(e) => {
debug!("cannot connect to postgres: {}, retrying", e);
debug!("could not connect to Postgres: {}, retrying", e);
// Establish a new connection and try again.
client = Client::connect(connstr, NoTls);
@@ -103,12 +198,124 @@ fn watch_compute_activity(compute: &ComputeNode) {
}
}
// Hang on condition variable waiting until the compute status is `Running`.
fn wait_for_postgres_start(compute: &ComputeNode) {
let mut state = compute.state.lock().unwrap();
while state.status != ComputeStatus::Running {
info!("compute is not running, waiting before monitoring activity");
state = compute.state_changed.wait(state).unwrap();
if state.status == ComputeStatus::Running {
break;
}
}
}
// Figure out the total active time and sessions across all non-system databases.
// Returned tuple is `(active_time, sessions)`.
// It can return `0.0` active time or `0` sessions, which means no user databases exist OR
// it was a start with skipped `pg_catalog` updates and user didn't do any queries
// (or open any sessions) yet.
fn get_database_stats(cli: &mut Client) -> anyhow::Result<(f64, i64)> {
// Filter out `postgres` database as `compute_ctl` and other monitoring tools
// like `postgres_exporter` use it to query Postgres statistics.
// Use explicit 8 bytes type casts to match Rust types.
let stats = cli.query_one(
"SELECT coalesce(sum(active_time), 0.0)::float8 AS total_active_time,
coalesce(sum(sessions), 0)::bigint AS total_sessions
FROM pg_stat_database
WHERE datname NOT IN (
'postgres',
'template0',
'template1'
);",
&[],
);
let stats = match stats {
Ok(stats) => stats,
Err(e) => {
return Err(anyhow::anyhow!("could not query active_time: {}", e));
}
};
let active_time: f64 = match stats.try_get("total_active_time") {
Ok(active_time) => active_time,
Err(e) => return Err(anyhow::anyhow!("could not get total_active_time: {}", e)),
};
let sessions: i64 = match stats.try_get("total_sessions") {
Ok(sessions) => sessions,
Err(e) => return Err(anyhow::anyhow!("could not get total_sessions: {}", e)),
};
Ok((active_time, sessions))
}
// Figure out the most recent state change time across all client backends.
// If there is currently active backend, timestamp will be `Utc::now()`.
// It can return `None`, which means no client backends exist or we were
// unable to parse the timestamp.
fn get_backends_state_change(cli: &mut Client) -> anyhow::Result<Option<DateTime<Utc>>> {
let mut last_active: Option<DateTime<Utc>> = None;
// Get all running client backends except ourself, use RFC3339 DateTime format.
let backends = cli.query(
"SELECT state, to_char(state_change, 'YYYY-MM-DD\"T\"HH24:MI:SS.US\"Z\"') AS state_change
FROM pg_stat_activity
WHERE backend_type = 'client backend'
AND pid != pg_backend_pid()
AND usename != 'cloud_admin';", // XXX: find a better way to filter other monitors?
&[],
);
match backends {
Ok(backs) => {
let mut idle_backs: Vec<DateTime<Utc>> = vec![];
for b in backs.into_iter() {
let state: String = match b.try_get("state") {
Ok(state) => state,
Err(_) => continue,
};
if state == "idle" {
let change: String = match b.try_get("state_change") {
Ok(state_change) => state_change,
Err(_) => continue,
};
let change = DateTime::parse_from_rfc3339(&change);
match change {
Ok(t) => idle_backs.push(t.with_timezone(&Utc)),
Err(e) => {
info!("cannot parse backend state_change DateTime: {}", e);
continue;
}
}
} else {
// Found non-idle backend, so the last activity is NOW.
// Return immediately, no need to check other backends.
return Ok(Some(Utc::now()));
}
}
// Get idle backend `state_change` with the max timestamp.
if let Some(last) = idle_backs.iter().max() {
last_active = Some(*last);
}
}
Err(e) => {
return Err(anyhow::anyhow!("could not query backends: {}", e));
}
}
Ok(last_active)
}
/// Launch a separate compute monitor thread and return its `JoinHandle`.
pub fn launch_monitor(state: &Arc<ComputeNode>) -> thread::JoinHandle<()> {
let state = Arc::clone(state);
pub fn launch_monitor(compute: &Arc<ComputeNode>) -> thread::JoinHandle<()> {
let compute = Arc::clone(compute);
thread::Builder::new()
.name("compute-monitor".into())
.spawn(move || watch_compute_activity(&state))
.spawn(move || watch_compute_activity(&compute))
.expect("cannot launch compute monitor thread")
}

View File

@@ -1,3 +1,4 @@
use std::collections::HashMap;
use std::fmt::Write;
use std::fs;
use std::fs::File;
@@ -5,12 +6,17 @@ use std::io::{BufRead, BufReader};
use std::os::unix::fs::PermissionsExt;
use std::path::Path;
use std::process::Child;
use std::thread::JoinHandle;
use std::time::{Duration, Instant};
use anyhow::{bail, Result};
use ini::Ini;
use notify::{RecursiveMode, Watcher};
use postgres::{Client, Transaction};
use tracing::{debug, instrument};
use tokio::io::AsyncBufReadExt;
use tokio::time::timeout;
use tokio_postgres::NoTls;
use tracing::{debug, error, info, instrument};
use compute_api::spec::{Database, GenericOption, GenericOptions, PgIdent, Role};
@@ -205,22 +211,37 @@ pub fn get_existing_roles(xact: &mut Transaction<'_>) -> Result<Vec<Role>> {
}
/// Build a list of existing Postgres databases
pub fn get_existing_dbs(client: &mut Client) -> Result<Vec<Database>> {
let postgres_dbs = client
pub fn get_existing_dbs(client: &mut Client) -> Result<HashMap<String, Database>> {
// `pg_database.datconnlimit = -2` means that the database is in the
// invalid state. See:
// https://github.com/postgres/postgres/commit/a4b4cc1d60f7e8ccfcc8ff8cb80c28ee411ad9a9
let postgres_dbs: Vec<Database> = client
.query(
"SELECT datname, datdba::regrole::text as owner
FROM pg_catalog.pg_database;",
"SELECT
datname AS name,
datdba::regrole::text AS owner,
NOT datallowconn AS restrict_conn,
datconnlimit = - 2 AS invalid
FROM
pg_catalog.pg_database;",
&[],
)?
.iter()
.map(|row| Database {
name: row.get("datname"),
name: row.get("name"),
owner: row.get("owner"),
restrict_conn: row.get("restrict_conn"),
invalid: row.get("invalid"),
options: None,
})
.collect();
Ok(postgres_dbs)
let dbs_map = postgres_dbs
.iter()
.map(|db| (db.name.clone(), db.clone()))
.collect::<HashMap<_, _>>();
Ok(dbs_map)
}
/// Wait for Postgres to become ready to accept connections. It's ready to
@@ -343,3 +364,172 @@ pub fn create_pgdata(pgdata: &str) -> Result<()> {
Ok(())
}
/// Update pgbouncer.ini with provided options
fn update_pgbouncer_ini(
pgbouncer_config: HashMap<String, String>,
pgbouncer_ini_path: &str,
) -> Result<()> {
let mut conf = Ini::load_from_file(pgbouncer_ini_path)?;
let section = conf.section_mut(Some("pgbouncer")).unwrap();
for (option_name, value) in pgbouncer_config.iter() {
section.insert(option_name, value);
debug!(
"Updating pgbouncer.ini with new values {}={}",
option_name, value
);
}
conf.write_to_file(pgbouncer_ini_path)?;
Ok(())
}
/// Tune pgbouncer.
/// 1. Apply new config using pgbouncer admin console
/// 2. Add new values to pgbouncer.ini to preserve them after restart
pub async fn tune_pgbouncer(pgbouncer_config: HashMap<String, String>) -> Result<()> {
let pgbouncer_connstr = if std::env::var_os("AUTOSCALING").is_some() {
// for VMs use pgbouncer specific way to connect to
// pgbouncer admin console without password
// when pgbouncer is running under the same user.
"host=/tmp port=6432 dbname=pgbouncer user=pgbouncer".to_string()
} else {
// for k8s use normal connection string with password
// to connect to pgbouncer admin console
let mut pgbouncer_connstr =
"host=localhost port=6432 dbname=pgbouncer user=postgres sslmode=disable".to_string();
if let Ok(pass) = std::env::var("PGBOUNCER_PASSWORD") {
pgbouncer_connstr.push_str(format!(" password={}", pass).as_str());
}
pgbouncer_connstr
};
info!(
"Connecting to pgbouncer with connection string: {}",
pgbouncer_connstr
);
// connect to pgbouncer, retrying several times
// because pgbouncer may not be ready yet
let mut retries = 3;
let client = loop {
match tokio_postgres::connect(&pgbouncer_connstr, NoTls).await {
Ok((client, connection)) => {
tokio::spawn(async move {
if let Err(e) = connection.await {
eprintln!("connection error: {}", e);
}
});
break client;
}
Err(e) => {
if retries == 0 {
return Err(e.into());
}
error!("Failed to connect to pgbouncer: pgbouncer_connstr {}", e);
retries -= 1;
tokio::time::sleep(Duration::from_secs(1)).await;
}
}
};
// Apply new config
for (option_name, value) in pgbouncer_config.iter() {
let query = format!("SET {}={}", option_name, value);
// keep this log line for debugging purposes
info!("Applying pgbouncer setting change: {}", query);
if let Err(err) = client.simple_query(&query).await {
// Don't fail on error, just print it into log
error!(
"Failed to apply pgbouncer setting change: {}, {}",
query, err
);
};
}
// save values to pgbouncer.ini
// so that they are preserved after pgbouncer restart
let pgbouncer_ini_path = if std::env::var_os("AUTOSCALING").is_some() {
// in VMs we use /etc/pgbouncer.ini
"/etc/pgbouncer.ini".to_string()
} else {
// in pods we use /var/db/postgres/pgbouncer/pgbouncer.ini
// this is a shared volume between pgbouncer and postgres containers
// FIXME: fix permissions for this file
"/var/db/postgres/pgbouncer/pgbouncer.ini".to_string()
};
update_pgbouncer_ini(pgbouncer_config, &pgbouncer_ini_path)?;
Ok(())
}
/// Spawn a thread that will read Postgres logs from `stderr`, join multiline logs
/// and send them to the logger. In the future we may also want to add context to
/// these logs.
pub fn handle_postgres_logs(stderr: std::process::ChildStderr) -> JoinHandle<()> {
std::thread::spawn(move || {
let runtime = tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()
.expect("failed to build tokio runtime");
let res = runtime.block_on(async move {
let stderr = tokio::process::ChildStderr::from_std(stderr)?;
handle_postgres_logs_async(stderr).await
});
if let Err(e) = res {
tracing::error!("error while processing postgres logs: {}", e);
}
})
}
/// Read Postgres logs from `stderr` until EOF. Buffer is flushed on one of the following conditions:
/// - next line starts with timestamp
/// - EOF
/// - no new lines were written for the last second
async fn handle_postgres_logs_async(stderr: tokio::process::ChildStderr) -> Result<()> {
let mut lines = tokio::io::BufReader::new(stderr).lines();
let timeout_duration = Duration::from_millis(100);
let ts_regex =
regex::Regex::new(r"^\d+-\d{2}-\d{2} \d{2}:\d{2}:\d{2}").expect("regex is valid");
let mut buf = vec![];
loop {
let next_line = timeout(timeout_duration, lines.next_line()).await;
// we should flush lines from the buffer if we cannot continue reading multiline message
let should_flush_buf = match next_line {
// Flushing if new line starts with timestamp
Ok(Ok(Some(ref line))) => ts_regex.is_match(line),
// Flushing on EOF, timeout or error
_ => true,
};
if !buf.is_empty() && should_flush_buf {
// join multiline message into a single line, separated by unicode Zero Width Space.
// "PG:" suffix is used to distinguish postgres logs from other logs.
let combined = format!("PG:{}\n", buf.join("\u{200B}"));
buf.clear();
// sync write to stderr to avoid interleaving with other logs
use std::io::Write;
let res = std::io::stderr().lock().write_all(combined.as_bytes());
if let Err(e) = res {
tracing::error!("error while writing to stderr: {}", e);
}
}
// if not timeout, append line to the buffer
if next_line.is_ok() {
match next_line?? {
Some(line) => buf.push(line),
// EOF
None => break,
};
}
}
Ok(())
}

View File

@@ -9,11 +9,12 @@ use reqwest::StatusCode;
use tracing::{error, info, info_span, instrument, span_enabled, warn, Level};
use crate::config;
use crate::logger::inlinify;
use crate::params::PG_HBA_ALL_MD5;
use crate::pg_helpers::*;
use compute_api::responses::{ControlPlaneComputeStatus, ControlPlaneSpecResponse};
use compute_api::spec::{ComputeSpec, Database, PgIdent, Role};
use compute_api::spec::{ComputeSpec, PgIdent, Role};
// Do control plane request and return response if any. In case of error it
// returns a bool flag indicating whether it makes sense to retry the request
@@ -24,7 +25,7 @@ fn do_control_plane_request(
) -> Result<ControlPlaneSpecResponse, (bool, String)> {
let resp = reqwest::blocking::Client::new()
.get(uri)
.header("Authorization", jwt)
.header("Authorization", format!("Bearer {}", jwt))
.send()
.map_err(|e| {
(
@@ -68,7 +69,7 @@ pub fn get_spec_from_control_plane(
base_uri: &str,
compute_id: &str,
) -> Result<Option<ComputeSpec>> {
let cp_uri = format!("{base_uri}/management/api/v2/computes/{compute_id}/spec");
let cp_uri = format!("{base_uri}/compute/api/v2/computes/{compute_id}/spec");
let jwt: String = match std::env::var("NEON_CONTROL_PLANE_TOKEN") {
Ok(v) => v,
Err(_) => "".to_string(),
@@ -118,19 +119,6 @@ pub fn get_spec_from_control_plane(
spec
}
/// It takes cluster specification and does the following:
/// - Serialize cluster config and put it into `postgresql.conf` completely rewriting the file.
/// - Update `pg_hba.conf` to allow external connections.
pub fn handle_configuration(spec: &ComputeSpec, pgdata_path: &Path) -> Result<()> {
// File `postgresql.conf` is no longer included into `basebackup`, so just
// always write all config into it creating new file.
config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), spec, None)?;
update_pg_hba(pgdata_path)?;
Ok(())
}
/// Check `pg_hba.conf` and update if needed to allow external connections.
pub fn update_pg_hba(pgdata_path: &Path) -> Result<()> {
// XXX: consider making it a part of spec.json
@@ -161,6 +149,38 @@ pub fn add_standby_signal(pgdata_path: &Path) -> Result<()> {
Ok(())
}
/// Compute could be unexpectedly shut down, for example, during the
/// database dropping. This leaves the database in the invalid state,
/// which prevents new db creation with the same name. This function
/// will clean it up before proceeding with catalog updates. All
/// possible future cleanup operations may go here too.
#[instrument(skip_all)]
pub fn cleanup_instance(client: &mut Client) -> Result<()> {
let existing_dbs = get_existing_dbs(client)?;
for (_, db) in existing_dbs {
if db.invalid {
// After recent commit in Postgres, interrupted DROP DATABASE
// leaves the database in the invalid state. According to the
// commit message, the only option for user is to drop it again.
// See:
// https://github.com/postgres/postgres/commit/a4b4cc1d60f7e8ccfcc8ff8cb80c28ee411ad9a9
//
// Postgres Neon extension is done the way, that db is de-registered
// in the control plane metadata only after it is dropped. So there is
// a chance that it still thinks that db should exist. This means
// that it will be re-created by `handle_databases()`. Yet, it's fine
// as user can just repeat drop (in vanilla Postgres they would need
// to do the same, btw).
let query = format!("DROP DATABASE IF EXISTS {}", db.name.pg_quote());
info!("dropping invalid database {}", db.name);
client.execute(query.as_str(), &[])?;
}
}
Ok(())
}
/// Given a cluster spec json and open transaction it handles roles creation,
/// deletion and update.
#[instrument(skip_all)]
@@ -170,18 +190,20 @@ pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
// Print a list of existing Postgres roles (only in debug mode)
if span_enabled!(Level::INFO) {
info!("postgres roles:");
let mut vec = Vec::new();
for r in &existing_roles {
info!(
" - {}:{}",
vec.push(format!(
"{}:{}",
r.name,
if r.encrypted_password.is_some() {
"[FILTERED]"
} else {
"(null)"
}
);
));
}
info!("postgres roles (total {}): {:?}", vec.len(), vec);
}
// Process delta operations first
@@ -219,7 +241,10 @@ pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
// Refresh Postgres roles info to handle possible roles renaming
let existing_roles: Vec<Role> = get_existing_roles(&mut xact)?;
info!("cluster spec roles:");
info!(
"handling cluster spec roles (total {})",
spec.cluster.roles.len()
);
for role in &spec.cluster.roles {
let name = &role.name;
// XXX: with a limited number of roles it is fine, but consider making it a HashMap
@@ -264,16 +289,25 @@ pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
match action {
RoleAction::None => {}
RoleAction::Update => {
// This can be run on /every/ role! Not just ones created through the console.
// This means that if you add some funny ALTER here that adds a permission,
// this will get run even on user-created roles! This will result in different
// behavior before and after a spec gets reapplied. The below ALTER as it stands
// now only grants LOGIN and changes the password. Please do not allow this branch
// to do anything silly.
let mut query: String = format!("ALTER ROLE {} ", name.pg_quote());
query.push_str(&role.to_pg_options());
xact.execute(query.as_str(), &[])?;
}
RoleAction::Create => {
// This branch only runs when roles are created through the console, so it is
// safe to add more permissions here. BYPASSRLS and REPLICATION are inherited
// from neon_superuser.
let mut query: String = format!(
"CREATE ROLE {} CREATEROLE CREATEDB BYPASSRLS IN ROLE neon_superuser",
"CREATE ROLE {} INHERIT CREATEROLE CREATEDB BYPASSRLS REPLICATION IN ROLE neon_superuser",
name.pg_quote()
);
info!("role create query: '{}'", &query);
info!("running role create query: '{}'", &query);
query.push_str(&role.to_pg_options());
xact.execute(query.as_str(), &[])?;
}
@@ -290,7 +324,7 @@ pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
RoleAction::Create => " -> create",
RoleAction::Update => " -> update",
};
info!(" - {}:{}{}", name, pwd, action_str);
info!(" - {}:{}{}", name, pwd, action_str);
}
}
@@ -342,33 +376,49 @@ pub fn handle_role_deletions(spec: &ComputeSpec, connstr: &str, client: &mut Cli
Ok(())
}
fn reassign_owned_objects_in_one_db(
conf: Config,
role_name: &PgIdent,
db_owner: &PgIdent,
) -> Result<()> {
let mut client = conf.connect(NoTls)?;
// This will reassign all dependent objects to the db owner
let reassign_query = format!(
"REASSIGN OWNED BY {} TO {}",
role_name.pg_quote(),
db_owner.pg_quote()
);
info!(
"reassigning objects owned by '{}' in db '{}' to '{}'",
role_name,
conf.get_dbname().unwrap_or(""),
db_owner
);
client.simple_query(&reassign_query)?;
// This now will only drop privileges of the role
let drop_query = format!("DROP OWNED BY {}", role_name.pg_quote());
client.simple_query(&drop_query)?;
Ok(())
}
// Reassign all owned objects in all databases to the owner of the database.
fn reassign_owned_objects(spec: &ComputeSpec, connstr: &str, role_name: &PgIdent) -> Result<()> {
for db in &spec.cluster.databases {
if db.owner != *role_name {
let mut conf = Config::from_str(connstr)?;
conf.dbname(&db.name);
let mut client = conf.connect(NoTls)?;
// This will reassign all dependent objects to the db owner
let reassign_query = format!(
"REASSIGN OWNED BY {} TO {}",
role_name.pg_quote(),
db.owner.pg_quote()
);
info!(
"reassigning objects owned by '{}' in db '{}' to '{}'",
role_name, &db.name, &db.owner
);
client.simple_query(&reassign_query)?;
// This now will only drop privileges of the role
let drop_query = format!("DROP OWNED BY {}", role_name.pg_quote());
client.simple_query(&drop_query)?;
reassign_owned_objects_in_one_db(conf, role_name, &db.owner)?;
}
}
// Also handle case when there are no databases in the spec.
// In this case we need to reassign objects in the default database.
let conf = Config::from_str(connstr)?;
let db_owner = PgIdent::from_str("cloud_admin")?;
reassign_owned_objects_in_one_db(conf, role_name, &db_owner)?;
Ok(())
}
@@ -379,14 +429,15 @@ fn reassign_owned_objects(spec: &ComputeSpec, connstr: &str, role_name: &PgIdent
/// which together provide us idempotency.
#[instrument(skip_all)]
pub fn handle_databases(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
let existing_dbs: Vec<Database> = get_existing_dbs(client)?;
let existing_dbs = get_existing_dbs(client)?;
// Print a list of existing Postgres databases (only in debug mode)
if span_enabled!(Level::INFO) {
info!("postgres databases:");
for r in &existing_dbs {
info!(" {}:{}", r.name, r.owner);
let mut vec = Vec::new();
for (dbname, db) in &existing_dbs {
vec.push(format!("{}:{}", dbname, db.owner));
}
info!("postgres databases (total {}): {:?}", vec.len(), vec);
}
// Process delta operations first
@@ -439,8 +490,7 @@ pub fn handle_databases(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
"rename_db" => {
let new_name = op.new_name.as_ref().unwrap();
// XXX: with a limited number of roles it is fine, but consider making it a HashMap
if existing_dbs.iter().any(|r| r.name == op.name) {
if existing_dbs.get(&op.name).is_some() {
let query: String = format!(
"ALTER DATABASE {} RENAME TO {}",
op.name.pg_quote(),
@@ -457,14 +507,15 @@ pub fn handle_databases(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
}
// Refresh Postgres databases info to handle possible renames
let existing_dbs: Vec<Database> = get_existing_dbs(client)?;
let existing_dbs = get_existing_dbs(client)?;
info!("cluster spec databases:");
info!(
"handling cluster spec databases (total {})",
spec.cluster.databases.len()
);
for db in &spec.cluster.databases {
let name = &db.name;
// XXX: with a limited number of databases it is fine, but consider making it a HashMap
let pg_db = existing_dbs.iter().find(|r| r.name == *name);
let pg_db = existing_dbs.get(name);
enum DatabaseAction {
None,
@@ -520,7 +571,7 @@ pub fn handle_databases(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
DatabaseAction::Create => " -> create",
DatabaseAction::Update => " -> update",
};
info!(" - {}:{}{}", db.name, db.owner, action_str);
info!(" - {}:{}{}", db.name, db.owner, action_str);
}
}
@@ -530,13 +581,32 @@ pub fn handle_databases(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
/// Grant CREATE ON DATABASE to the database owner and do some other alters and grants
/// to allow users creating trusted extensions and re-creating `public` schema, for example.
#[instrument(skip_all)]
pub fn handle_grants(spec: &ComputeSpec, connstr: &str) -> Result<()> {
info!("cluster spec grants:");
pub fn handle_grants(spec: &ComputeSpec, client: &mut Client, connstr: &str) -> Result<()> {
info!("modifying database permissions");
let existing_dbs = get_existing_dbs(client)?;
// Do some per-database access adjustments. We'd better do this at db creation time,
// but CREATE DATABASE isn't transactional. So we cannot create db + do some grants
// atomically.
for db in &spec.cluster.databases {
match existing_dbs.get(&db.name) {
Some(pg_db) => {
if pg_db.restrict_conn || pg_db.invalid {
info!(
"skipping grants for db {} (invalid: {}, connections not allowed: {})",
db.name, pg_db.invalid, pg_db.restrict_conn
);
continue;
}
}
None => {
bail!(
"database {} doesn't exist in Postgres after handle_databases()",
db.name
);
}
}
let mut conf = Config::from_str(connstr)?;
conf.dbname(&db.name);
@@ -575,6 +645,11 @@ pub fn handle_grants(spec: &ComputeSpec, connstr: &str) -> Result<()> {
// Explicitly grant CREATE ON SCHEMA PUBLIC to the web_access user.
// This is needed because since postgres 15 this privilege is removed by default.
// TODO: web_access isn't created for almost 1 year. It could be that we have
// active users of 1 year old projects, but hopefully not, so check it and
// remove this code if possible. The worst thing that could happen is that
// user won't be able to use public schema in NEW databases created in the
// very OLD project.
let grant_query = "DO $$\n\
BEGIN\n\
IF EXISTS(\n\
@@ -597,7 +672,11 @@ pub fn handle_grants(spec: &ComputeSpec, connstr: &str) -> Result<()> {
$$;"
.to_string();
info!("grant query for db {} : {}", &db.name, &grant_query);
info!(
"grant query for db {} : {}",
&db.name,
inlinify(&grant_query)
);
db_client.simple_query(&grant_query)?;
}
@@ -618,3 +697,117 @@ pub fn handle_extensions(spec: &ComputeSpec, client: &mut Client) -> Result<()>
Ok(())
}
/// Run CREATE and ALTER EXTENSION neon UPDATE for postgres database
#[instrument(skip_all)]
pub fn handle_extension_neon(client: &mut Client) -> Result<()> {
info!("handle extension neon");
let mut query = "CREATE SCHEMA IF NOT EXISTS neon";
client.simple_query(query)?;
query = "CREATE EXTENSION IF NOT EXISTS neon WITH SCHEMA neon";
info!("create neon extension with query: {}", query);
client.simple_query(query)?;
query = "UPDATE pg_extension SET extrelocatable = true WHERE extname = 'neon'";
client.simple_query(query)?;
query = "ALTER EXTENSION neon SET SCHEMA neon";
info!("alter neon extension schema with query: {}", query);
client.simple_query(query)?;
// this will be a no-op if extension is already up to date,
// which may happen in two cases:
// - extension was just installed
// - extension was already installed and is up to date
let query = "ALTER EXTENSION neon UPDATE";
info!("update neon extension schema with query: {}", query);
client.simple_query(query)?;
Ok(())
}
#[instrument(skip_all)]
pub fn handle_migrations(client: &mut Client) -> Result<()> {
info!("handle migrations");
// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
// !BE SURE TO ONLY ADD MIGRATIONS TO THE END OF THIS ARRAY. IF YOU DO NOT, VERY VERY BAD THINGS MAY HAPPEN!
// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
let migrations = [
"ALTER ROLE neon_superuser BYPASSRLS",
r#"
DO $$
DECLARE
role_name text;
BEGIN
FOR role_name IN SELECT rolname FROM pg_roles WHERE pg_has_role(rolname, 'neon_superuser', 'member')
LOOP
RAISE NOTICE 'EXECUTING ALTER ROLE % INHERIT', quote_ident(role_name);
EXECUTE 'ALTER ROLE ' || quote_ident(role_name) || ' INHERIT';
END LOOP;
FOR role_name IN SELECT rolname FROM pg_roles
WHERE
NOT pg_has_role(rolname, 'neon_superuser', 'member') AND NOT starts_with(rolname, 'pg_')
LOOP
RAISE NOTICE 'EXECUTING ALTER ROLE % NOBYPASSRLS', quote_ident(role_name);
EXECUTE 'ALTER ROLE ' || quote_ident(role_name) || ' NOBYPASSRLS';
END LOOP;
END $$;
"#,
r#"
DO $$
BEGIN
IF (SELECT setting::numeric >= 160000 FROM pg_settings WHERE name = 'server_version_num') THEN
EXECUTE 'GRANT pg_create_subscription TO neon_superuser';
END IF;
END
$$;"#,
];
let mut query = "CREATE SCHEMA IF NOT EXISTS neon_migration";
client.simple_query(query)?;
query = "CREATE TABLE IF NOT EXISTS neon_migration.migration_id (key INT NOT NULL PRIMARY KEY, id bigint NOT NULL DEFAULT 0)";
client.simple_query(query)?;
query = "INSERT INTO neon_migration.migration_id VALUES (0, 0) ON CONFLICT DO NOTHING";
client.simple_query(query)?;
query = "ALTER SCHEMA neon_migration OWNER TO cloud_admin";
client.simple_query(query)?;
query = "REVOKE ALL ON SCHEMA neon_migration FROM PUBLIC";
client.simple_query(query)?;
query = "SELECT id FROM neon_migration.migration_id";
let row = client.query_one(query, &[])?;
let mut current_migration: usize = row.get::<&str, i64>("id") as usize;
let starting_migration_id = current_migration;
query = "BEGIN";
client.simple_query(query)?;
while current_migration < migrations.len() {
info!("Running migration:\n{}\n", migrations[current_migration]);
client.simple_query(migrations[current_migration])?;
current_migration += 1;
}
let setval = format!(
"UPDATE neon_migration.migration_id SET id={}",
migrations.len()
);
client.simple_query(&setval)?;
query = "COMMIT";
client.simple_query(query)?;
info!(
"Ran {} migrations",
(migrations.len() - starting_migration_id)
);
Ok(())
}

View File

@@ -28,7 +28,7 @@ mod pg_helpers_tests {
assert_eq!(
spec.cluster.settings.as_pg_settings(),
r#"fsync = off
wal_level = replica
wal_level = logical
hot_standby = on
neon.safekeepers = '127.0.0.1:6502,127.0.0.1:6503,127.0.0.1:6501'
wal_log_hints = on

View File

@@ -6,8 +6,13 @@ license.workspace = true
[dependencies]
anyhow.workspace = true
async-trait.workspace = true
camino.workspace = true
clap.workspace = true
comfy-table.workspace = true
diesel = { version = "2.1.4", features = ["postgres"]}
diesel_migrations = { version = "2.1.0", features = ["postgres"]}
futures.workspace = true
git-version.workspace = true
nix.workspace = true
once_cell.workspace = true
@@ -16,6 +21,7 @@ hex.workspace = true
hyper.workspace = true
regex.workspace = true
reqwest = { workspace = true, features = ["blocking", "json"] }
scopeguard.workspace = true
serde.workspace = true
serde_json.workspace = true
serde_with.workspace = true
@@ -23,10 +29,11 @@ tar.workspace = true
thiserror.workspace = true
toml.workspace = true
tokio.workspace = true
tokio-postgres.workspace = true
tokio-util.workspace = true
url.workspace = true
# Note: Do not directly depend on pageserver or safekeeper; use pageserver_api or safekeeper_api
# instead, so that recompile times are better.
pageserver_api.workspace = true
pageserver_client.workspace = true
postgres_backend.workspace = true
safekeeper_api.workspace = true
postgres_connection.workspace = true

View File

@@ -0,0 +1,34 @@
[package]
name = "attachment_service"
version = "0.1.0"
edition.workspace = true
license.workspace = true
[dependencies]
anyhow.workspace = true
aws-config.workspace = true
aws-sdk-secretsmanager.workspace = true
camino.workspace = true
clap.workspace = true
futures.workspace = true
git-version.workspace = true
hyper.workspace = true
pageserver_api.workspace = true
pageserver_client.workspace = true
postgres_connection.workspace = true
reqwest.workspace = true
serde.workspace = true
serde_json.workspace = true
thiserror.workspace = true
tokio.workspace = true
tokio-util.workspace = true
tracing.workspace = true
diesel = { version = "2.1.4", features = ["serde_json", "postgres"] }
diesel_migrations = { version = "2.1.0" }
utils = { path = "../../libs/utils/" }
metrics = { path = "../../libs/metrics/" }
control_plane = { path = ".." }
workspace_hack = { version = "0.1", path = "../../workspace_hack" }

View File

@@ -0,0 +1,6 @@
-- This file was automatically created by Diesel to setup helper functions
-- and other internal bookkeeping. This file is safe to edit, any future
-- changes will be added to existing projects as new migrations.
DROP FUNCTION IF EXISTS diesel_manage_updated_at(_tbl regclass);
DROP FUNCTION IF EXISTS diesel_set_updated_at();

View File

@@ -0,0 +1,36 @@
-- This file was automatically created by Diesel to setup helper functions
-- and other internal bookkeeping. This file is safe to edit, any future
-- changes will be added to existing projects as new migrations.
-- Sets up a trigger for the given table to automatically set a column called
-- `updated_at` whenever the row is modified (unless `updated_at` was included
-- in the modified columns)
--
-- # Example
--
-- ```sql
-- CREATE TABLE users (id SERIAL PRIMARY KEY, updated_at TIMESTAMP NOT NULL DEFAULT NOW());
--
-- SELECT diesel_manage_updated_at('users');
-- ```
CREATE OR REPLACE FUNCTION diesel_manage_updated_at(_tbl regclass) RETURNS VOID AS $$
BEGIN
EXECUTE format('CREATE TRIGGER set_updated_at BEFORE UPDATE ON %s
FOR EACH ROW EXECUTE PROCEDURE diesel_set_updated_at()', _tbl);
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION diesel_set_updated_at() RETURNS trigger AS $$
BEGIN
IF (
NEW IS DISTINCT FROM OLD AND
NEW.updated_at IS NOT DISTINCT FROM OLD.updated_at
) THEN
NEW.updated_at := current_timestamp;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

View File

@@ -0,0 +1 @@
DROP TABLE tenant_shards;

View File

@@ -0,0 +1,12 @@
CREATE TABLE tenant_shards (
tenant_id VARCHAR NOT NULL,
shard_number INTEGER NOT NULL,
shard_count INTEGER NOT NULL,
PRIMARY KEY(tenant_id, shard_number, shard_count),
shard_stripe_size INTEGER NOT NULL,
generation INTEGER NOT NULL,
generation_pageserver BIGINT NOT NULL,
placement_policy VARCHAR NOT NULL,
-- config is JSON encoded, opaque to the database.
config TEXT NOT NULL
);

View File

@@ -0,0 +1 @@
DROP TABLE nodes;

View File

@@ -0,0 +1,10 @@
CREATE TABLE nodes (
node_id BIGINT PRIMARY KEY NOT NULL,
scheduling_policy VARCHAR NOT NULL,
listen_http_addr VARCHAR NOT NULL,
listen_http_port INTEGER NOT NULL,
listen_pg_addr VARCHAR NOT NULL,
listen_pg_port INTEGER NOT NULL
);

View File

@@ -0,0 +1,318 @@
use std::{collections::HashMap, time::Duration};
use control_plane::endpoint::{ComputeControlPlane, EndpointStatus};
use control_plane::local_env::LocalEnv;
use hyper::{Method, StatusCode};
use pageserver_api::shard::{ShardCount, ShardIndex, ShardNumber, TenantShardId};
use postgres_connection::parse_host_port;
use serde::{Deserialize, Serialize};
use tokio_util::sync::CancellationToken;
use utils::{
backoff::{self},
id::{NodeId, TenantId},
};
use crate::service::Config;
const BUSY_DELAY: Duration = Duration::from_secs(1);
const SLOWDOWN_DELAY: Duration = Duration::from_secs(5);
pub(crate) const API_CONCURRENCY: usize = 32;
pub(super) struct ComputeHookTenant {
shards: Vec<(ShardIndex, NodeId)>,
}
#[derive(Serialize, Deserialize, Debug)]
struct ComputeHookNotifyRequestShard {
node_id: NodeId,
shard_number: ShardNumber,
}
/// Request body that we send to the control plane to notify it of where a tenant is attached
#[derive(Serialize, Deserialize, Debug)]
struct ComputeHookNotifyRequest {
tenant_id: TenantId,
shards: Vec<ComputeHookNotifyRequestShard>,
}
/// Error type for attempts to call into the control plane compute notification hook
#[derive(thiserror::Error, Debug)]
pub(crate) enum NotifyError {
// Request was not send successfully, e.g. transport error
#[error("Sending request: {0}")]
Request(#[from] reqwest::Error),
// Request could not be serviced right now due to ongoing Operation in control plane, but should be possible soon.
#[error("Control plane tenant busy")]
Busy,
// Explicit 429 response asking us to retry less frequently
#[error("Control plane overloaded")]
SlowDown,
// A 503 response indicates the control plane can't handle the request right now
#[error("Control plane unavailable (status {0})")]
Unavailable(StatusCode),
// API returned unexpected non-success status. We will retry, but log a warning.
#[error("Control plane returned unexpected status {0}")]
Unexpected(StatusCode),
// We shutdown while sending
#[error("Shutting down")]
ShuttingDown,
// A response indicates we will never succeed, such as 400 or 404
#[error("Non-retryable error {0}")]
Fatal(StatusCode),
}
impl ComputeHookTenant {
async fn maybe_reconfigure(&mut self, tenant_id: TenantId) -> Option<ComputeHookNotifyRequest> {
// Find the highest shard count and drop any shards that aren't
// for that shard count.
let shard_count = self.shards.iter().map(|(k, _v)| k.shard_count).max();
let Some(shard_count) = shard_count else {
// No shards, nothing to do.
tracing::info!("ComputeHookTenant::maybe_reconfigure: no shards");
return None;
};
self.shards.retain(|(k, _v)| k.shard_count == shard_count);
self.shards
.sort_by_key(|(shard, _node_id)| shard.shard_number);
if self.shards.len() == shard_count.0 as usize || shard_count == ShardCount(0) {
// We have pageservers for all the shards: emit a configuration update
return Some(ComputeHookNotifyRequest {
tenant_id,
shards: self
.shards
.iter()
.map(|(shard, node_id)| ComputeHookNotifyRequestShard {
shard_number: shard.shard_number,
node_id: *node_id,
})
.collect(),
});
} else {
tracing::info!(
"ComputeHookTenant::maybe_reconfigure: not enough shards ({}/{})",
self.shards.len(),
shard_count.0
);
}
None
}
}
/// The compute hook is a destination for notifications about changes to tenant:pageserver
/// mapping. It aggregates updates for the shards in a tenant, and when appropriate reconfigures
/// the compute connection string.
pub(super) struct ComputeHook {
config: Config,
state: tokio::sync::Mutex<HashMap<TenantId, ComputeHookTenant>>,
authorization_header: Option<String>,
}
impl ComputeHook {
pub(super) fn new(config: Config) -> Self {
let authorization_header = config
.control_plane_jwt_token
.clone()
.map(|jwt| format!("Bearer {}", jwt));
Self {
state: Default::default(),
config,
authorization_header,
}
}
/// For test environments: use neon_local's LocalEnv to update compute
async fn do_notify_local(
&self,
reconfigure_request: ComputeHookNotifyRequest,
) -> anyhow::Result<()> {
let env = match LocalEnv::load_config() {
Ok(e) => e,
Err(e) => {
tracing::warn!("Couldn't load neon_local config, skipping compute update ({e})");
return Ok(());
}
};
let cplane =
ComputeControlPlane::load(env.clone()).expect("Error loading compute control plane");
let ComputeHookNotifyRequest { tenant_id, shards } = reconfigure_request;
let compute_pageservers = shards
.into_iter()
.map(|shard| {
let ps_conf = env
.get_pageserver_conf(shard.node_id)
.expect("Unknown pageserver");
let (pg_host, pg_port) = parse_host_port(&ps_conf.listen_pg_addr)
.expect("Unable to parse listen_pg_addr");
(pg_host, pg_port.unwrap_or(5432))
})
.collect::<Vec<_>>();
for (endpoint_name, endpoint) in &cplane.endpoints {
if endpoint.tenant_id == tenant_id && endpoint.status() == EndpointStatus::Running {
tracing::info!("🔁 Reconfiguring endpoint {}", endpoint_name,);
endpoint.reconfigure(compute_pageservers.clone()).await?;
}
}
Ok(())
}
async fn do_notify_iteration(
&self,
client: &reqwest::Client,
url: &String,
reconfigure_request: &ComputeHookNotifyRequest,
cancel: &CancellationToken,
) -> Result<(), NotifyError> {
let req = client.request(Method::POST, url);
let req = if let Some(value) = &self.authorization_header {
req.header(reqwest::header::AUTHORIZATION, value)
} else {
req
};
tracing::debug!(
"Sending notify request to {} ({:?})",
url,
reconfigure_request
);
let send_result = req.json(&reconfigure_request).send().await;
let response = match send_result {
Ok(r) => r,
Err(e) => return Err(e.into()),
};
// Treat all 2xx responses as success
if response.status() >= StatusCode::OK && response.status() < StatusCode::MULTIPLE_CHOICES {
if response.status() != StatusCode::OK {
// Non-200 2xx response: it doesn't make sense to retry, but this is unexpected, so
// log a warning.
tracing::warn!(
"Unexpected 2xx response code {} from control plane",
response.status()
);
}
return Ok(());
}
// Error response codes
match response.status() {
StatusCode::TOO_MANY_REQUESTS => {
// TODO: 429 handling should be global: set some state visible to other requests
// so that they will delay before starting, rather than all notifications trying
// once before backing off.
tokio::time::timeout(SLOWDOWN_DELAY, cancel.cancelled())
.await
.ok();
Err(NotifyError::SlowDown)
}
StatusCode::LOCKED => {
// Delay our retry if busy: the usual fast exponential backoff in backoff::retry
// is not appropriate
tokio::time::timeout(BUSY_DELAY, cancel.cancelled())
.await
.ok();
Err(NotifyError::Busy)
}
StatusCode::SERVICE_UNAVAILABLE
| StatusCode::GATEWAY_TIMEOUT
| StatusCode::BAD_GATEWAY => Err(NotifyError::Unavailable(response.status())),
StatusCode::BAD_REQUEST | StatusCode::UNAUTHORIZED | StatusCode::FORBIDDEN => {
Err(NotifyError::Fatal(response.status()))
}
_ => Err(NotifyError::Unexpected(response.status())),
}
}
async fn do_notify(
&self,
url: &String,
reconfigure_request: ComputeHookNotifyRequest,
cancel: &CancellationToken,
) -> Result<(), NotifyError> {
let client = reqwest::Client::new();
backoff::retry(
|| self.do_notify_iteration(&client, url, &reconfigure_request, cancel),
|e| matches!(e, NotifyError::Fatal(_)),
3,
10,
"Send compute notification",
cancel,
)
.await
.ok_or_else(|| NotifyError::ShuttingDown)
.and_then(|x| x)
}
/// Call this to notify the compute (postgres) tier of new pageservers to use
/// for a tenant. notify() is called by each shard individually, and this function
/// will decide whether an update to the tenant is sent. An update is sent on the
/// condition that:
/// - We know a pageserver for every shard.
/// - All the shards have the same shard_count (i.e. we are not mid-split)
///
/// Cancellation token enables callers to drop out, e.g. if calling from a Reconciler
/// that is cancelled.
///
/// This function is fallible, including in the case that the control plane is transiently
/// unavailable. A limited number of retries are done internally to efficiently hide short unavailability
/// periods, but we don't retry forever. The **caller** is responsible for handling failures and
/// ensuring that they eventually call again to ensure that the compute is eventually notified of
/// the proper pageserver nodes for a tenant.
#[tracing::instrument(skip_all, fields(tenant_shard_id, node_id))]
pub(super) async fn notify(
&self,
tenant_shard_id: TenantShardId,
node_id: NodeId,
cancel: &CancellationToken,
) -> Result<(), NotifyError> {
let mut locked = self.state.lock().await;
let entry = locked
.entry(tenant_shard_id.tenant_id)
.or_insert_with(|| ComputeHookTenant { shards: Vec::new() });
let shard_index = ShardIndex {
shard_count: tenant_shard_id.shard_count,
shard_number: tenant_shard_id.shard_number,
};
let mut set = false;
for (existing_shard, existing_node) in &mut entry.shards {
if *existing_shard == shard_index {
*existing_node = node_id;
set = true;
}
}
if !set {
entry.shards.push((shard_index, node_id));
}
let reconfigure_request = entry.maybe_reconfigure(tenant_shard_id.tenant_id).await;
let Some(reconfigure_request) = reconfigure_request else {
// The tenant doesn't yet have pageservers for all its shards: we won't notify anything
// until it does.
tracing::debug!("Tenant isn't yet ready to emit a notification",);
return Ok(());
};
if let Some(notify_url) = &self.config.compute_hook_url {
self.do_notify(notify_url, reconfigure_request, cancel)
.await
} else {
self.do_notify_local(reconfigure_request)
.await
.map_err(|e| {
// This path is for testing only, so munge the error into our prod-style error type.
tracing::error!("Local notification hook failed: {e}");
NotifyError::Fatal(StatusCode::INTERNAL_SERVER_ERROR)
})
}
}
}

View File

@@ -0,0 +1,422 @@
use crate::reconciler::ReconcileError;
use crate::service::{Service, STARTUP_RECONCILE_TIMEOUT};
use hyper::{Body, Request, Response};
use hyper::{StatusCode, Uri};
use pageserver_api::models::{
TenantCreateRequest, TenantLocationConfigRequest, TimelineCreateRequest,
};
use pageserver_api::shard::TenantShardId;
use pageserver_client::mgmt_api;
use std::sync::Arc;
use std::time::{Duration, Instant};
use utils::auth::SwappableJwtAuth;
use utils::http::endpoint::{auth_middleware, request_span};
use utils::http::request::parse_request_param;
use utils::id::{TenantId, TimelineId};
use utils::{
http::{
endpoint::{self},
error::ApiError,
json::{json_request, json_response},
RequestExt, RouterBuilder,
},
id::NodeId,
};
use pageserver_api::control_api::{ReAttachRequest, ValidateRequest};
use control_plane::attachment_service::{
AttachHookRequest, InspectRequest, NodeConfigureRequest, NodeRegisterRequest,
TenantShardMigrateRequest,
};
/// State available to HTTP request handlers
#[derive(Clone)]
pub struct HttpState {
service: Arc<crate::service::Service>,
auth: Option<Arc<SwappableJwtAuth>>,
allowlist_routes: Vec<Uri>,
}
impl HttpState {
pub fn new(service: Arc<crate::service::Service>, auth: Option<Arc<SwappableJwtAuth>>) -> Self {
let allowlist_routes = ["/status"]
.iter()
.map(|v| v.parse().unwrap())
.collect::<Vec<_>>();
Self {
service,
auth,
allowlist_routes,
}
}
}
#[inline(always)]
fn get_state(request: &Request<Body>) -> &HttpState {
request
.data::<Arc<HttpState>>()
.expect("unknown state type")
.as_ref()
}
/// Pageserver calls into this on startup, to learn which tenants it should attach
async fn handle_re_attach(mut req: Request<Body>) -> Result<Response<Body>, ApiError> {
let reattach_req = json_request::<ReAttachRequest>(&mut req).await?;
let state = get_state(&req);
json_response(
StatusCode::OK,
state
.service
.re_attach(reattach_req)
.await
.map_err(ApiError::InternalServerError)?,
)
}
/// Pageserver calls into this before doing deletions, to confirm that it still
/// holds the latest generation for the tenants with deletions enqueued
async fn handle_validate(mut req: Request<Body>) -> Result<Response<Body>, ApiError> {
let validate_req = json_request::<ValidateRequest>(&mut req).await?;
let state = get_state(&req);
json_response(StatusCode::OK, state.service.validate(validate_req))
}
/// Call into this before attaching a tenant to a pageserver, to acquire a generation number
/// (in the real control plane this is unnecessary, because the same program is managing
/// generation numbers and doing attachments).
async fn handle_attach_hook(mut req: Request<Body>) -> Result<Response<Body>, ApiError> {
let attach_req = json_request::<AttachHookRequest>(&mut req).await?;
let state = get_state(&req);
json_response(
StatusCode::OK,
state
.service
.attach_hook(attach_req)
.await
.map_err(ApiError::InternalServerError)?,
)
}
async fn handle_inspect(mut req: Request<Body>) -> Result<Response<Body>, ApiError> {
let inspect_req = json_request::<InspectRequest>(&mut req).await?;
let state = get_state(&req);
json_response(StatusCode::OK, state.service.inspect(inspect_req))
}
async fn handle_tenant_create(
service: Arc<Service>,
mut req: Request<Body>,
) -> Result<Response<Body>, ApiError> {
let create_req = json_request::<TenantCreateRequest>(&mut req).await?;
json_response(StatusCode::OK, service.tenant_create(create_req).await?)
}
// For tenant and timeline deletions, which both implement an "initially return 202, then 404 once
// we're done" semantic, we wrap with a retry loop to expose a simpler API upstream. This avoids
// needing to track a "deleting" state for tenants.
async fn deletion_wrapper<R, F>(service: Arc<Service>, f: F) -> Result<Response<Body>, ApiError>
where
R: std::future::Future<Output = Result<StatusCode, ApiError>> + Send + 'static,
F: Fn(Arc<Service>) -> R + Send + Sync + 'static,
{
let started_at = Instant::now();
// To keep deletion reasonably snappy for small tenants, initially check after 1 second if deletion
// completed.
let mut retry_period = Duration::from_secs(1);
// On subsequent retries, wait longer.
let max_retry_period = Duration::from_secs(5);
// Enable callers with a 30 second request timeout to reliably get a response
let max_wait = Duration::from_secs(25);
loop {
let status = f(service.clone()).await?;
match status {
StatusCode::ACCEPTED => {
tracing::info!("Deletion accepted, waiting to try again...");
tokio::time::sleep(retry_period).await;
retry_period = max_retry_period;
}
StatusCode::NOT_FOUND => {
tracing::info!("Deletion complete");
return json_response(StatusCode::OK, ());
}
_ => {
tracing::warn!("Unexpected status {status}");
return json_response(status, ());
}
}
let now = Instant::now();
if now + retry_period > started_at + max_wait {
tracing::info!("Deletion timed out waiting for 404");
// REQUEST_TIMEOUT would be more appropriate, but CONFLICT is already part of
// the pageserver's swagger definition for this endpoint, and has the same desired
// effect of causing the control plane to retry later.
return json_response(StatusCode::CONFLICT, ());
}
}
}
async fn handle_tenant_location_config(
service: Arc<Service>,
mut req: Request<Body>,
) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&req, "tenant_id")?;
let config_req = json_request::<TenantLocationConfigRequest>(&mut req).await?;
json_response(
StatusCode::OK,
service
.tenant_location_config(tenant_id, config_req)
.await?,
)
}
async fn handle_tenant_delete(
service: Arc<Service>,
req: Request<Body>,
) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&req, "tenant_id")?;
deletion_wrapper(service, move |service| async move {
service.tenant_delete(tenant_id).await
})
.await
}
async fn handle_tenant_timeline_create(
service: Arc<Service>,
mut req: Request<Body>,
) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&req, "tenant_id")?;
let create_req = json_request::<TimelineCreateRequest>(&mut req).await?;
json_response(
StatusCode::OK,
service
.tenant_timeline_create(tenant_id, create_req)
.await?,
)
}
async fn handle_tenant_timeline_delete(
service: Arc<Service>,
req: Request<Body>,
) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&req, "tenant_id")?;
let timeline_id: TimelineId = parse_request_param(&req, "timeline_id")?;
deletion_wrapper(service, move |service| async move {
service.tenant_timeline_delete(tenant_id, timeline_id).await
})
.await
}
async fn handle_tenant_timeline_passthrough(
service: Arc<Service>,
req: Request<Body>,
) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&req, "tenant_id")?;
let Some(path) = req.uri().path_and_query() else {
// This should never happen, our request router only calls us if there is a path
return Err(ApiError::BadRequest(anyhow::anyhow!("Missing path")));
};
tracing::info!("Proxying request for tenant {} ({})", tenant_id, path);
// Find the node that holds shard zero
let (base_url, tenant_shard_id) = service.tenant_shard0_baseurl(tenant_id)?;
// Callers will always pass an unsharded tenant ID. Before proxying, we must
// rewrite this to a shard-aware shard zero ID.
let path = format!("{}", path);
let tenant_str = tenant_id.to_string();
let tenant_shard_str = format!("{}", tenant_shard_id);
let path = path.replace(&tenant_str, &tenant_shard_str);
let client = mgmt_api::Client::new(base_url, service.get_config().jwt_token.as_deref());
let resp = client.get_raw(path).await.map_err(|_e|
// FIXME: give APiError a proper Unavailable variant. We return 503 here because
// if we can't successfully send a request to the pageserver, we aren't available.
ApiError::ShuttingDown)?;
// We have a reqest::Response, would like a http::Response
let mut builder = hyper::Response::builder()
.status(resp.status())
.version(resp.version());
for (k, v) in resp.headers() {
builder = builder.header(k, v);
}
let response = builder
.body(Body::wrap_stream(resp.bytes_stream()))
.map_err(|e| ApiError::InternalServerError(e.into()))?;
Ok(response)
}
async fn handle_tenant_locate(
service: Arc<Service>,
req: Request<Body>,
) -> Result<Response<Body>, ApiError> {
let tenant_id: TenantId = parse_request_param(&req, "tenant_id")?;
json_response(StatusCode::OK, service.tenant_locate(tenant_id)?)
}
async fn handle_node_register(mut req: Request<Body>) -> Result<Response<Body>, ApiError> {
let register_req = json_request::<NodeRegisterRequest>(&mut req).await?;
let state = get_state(&req);
state.service.node_register(register_req).await?;
json_response(StatusCode::OK, ())
}
async fn handle_node_list(req: Request<Body>) -> Result<Response<Body>, ApiError> {
let state = get_state(&req);
json_response(StatusCode::OK, state.service.node_list().await?)
}
async fn handle_node_configure(mut req: Request<Body>) -> Result<Response<Body>, ApiError> {
let node_id: NodeId = parse_request_param(&req, "node_id")?;
let config_req = json_request::<NodeConfigureRequest>(&mut req).await?;
if node_id != config_req.node_id {
return Err(ApiError::BadRequest(anyhow::anyhow!(
"Path and body node_id differ"
)));
}
let state = get_state(&req);
json_response(StatusCode::OK, state.service.node_configure(config_req)?)
}
async fn handle_tenant_shard_migrate(
service: Arc<Service>,
mut req: Request<Body>,
) -> Result<Response<Body>, ApiError> {
let tenant_shard_id: TenantShardId = parse_request_param(&req, "tenant_shard_id")?;
let migrate_req = json_request::<TenantShardMigrateRequest>(&mut req).await?;
json_response(
StatusCode::OK,
service
.tenant_shard_migrate(tenant_shard_id, migrate_req)
.await?,
)
}
/// Status endpoint is just used for checking that our HTTP listener is up
async fn handle_status(_req: Request<Body>) -> Result<Response<Body>, ApiError> {
json_response(StatusCode::OK, ())
}
impl From<ReconcileError> for ApiError {
fn from(value: ReconcileError) -> Self {
ApiError::Conflict(format!("Reconciliation error: {}", value))
}
}
/// Common wrapper for request handlers that call into Service and will operate on tenants: they must only
/// be allowed to run if Service has finished its initial reconciliation.
async fn tenant_service_handler<R, H>(request: Request<Body>, handler: H) -> R::Output
where
R: std::future::Future<Output = Result<Response<Body>, ApiError>> + Send + 'static,
H: FnOnce(Arc<Service>, Request<Body>) -> R + Send + Sync + 'static,
{
let state = get_state(&request);
let service = state.service.clone();
let startup_complete = service.startup_complete.clone();
if tokio::time::timeout(STARTUP_RECONCILE_TIMEOUT, startup_complete.wait())
.await
.is_err()
{
// This shouldn't happen: it is the responsibilty of [`Service::startup_reconcile`] to use appropriate
// timeouts around its remote calls, to bound its runtime.
return Err(ApiError::Timeout(
"Timed out waiting for service readiness".into(),
));
}
request_span(
request,
|request| async move { handler(service, request).await },
)
.await
}
pub fn make_router(
service: Arc<Service>,
auth: Option<Arc<SwappableJwtAuth>>,
) -> RouterBuilder<hyper::Body, ApiError> {
let mut router = endpoint::make_router();
if auth.is_some() {
router = router.middleware(auth_middleware(|request| {
let state = get_state(request);
if state.allowlist_routes.contains(request.uri()) {
None
} else {
state.auth.as_deref()
}
}))
}
router
.data(Arc::new(HttpState::new(service, auth)))
// Non-prefixed generic endpoints (status, metrics)
.get("/status", |r| request_span(r, handle_status))
// Upcalls for the pageserver: point the pageserver's `control_plane_api` config to this prefix
.post("/upcall/v1/re-attach", |r| {
request_span(r, handle_re_attach)
})
.post("/upcall/v1/validate", |r| request_span(r, handle_validate))
// Test/dev/debug endpoints
.post("/debug/v1/attach-hook", |r| {
request_span(r, handle_attach_hook)
})
.post("/debug/v1/inspect", |r| request_span(r, handle_inspect))
.get("/control/v1/tenant/:tenant_id/locate", |r| {
tenant_service_handler(r, handle_tenant_locate)
})
// Node operations
.post("/control/v1/node", |r| {
request_span(r, handle_node_register)
})
.get("/control/v1/node", |r| request_span(r, handle_node_list))
.put("/control/v1/node/:node_id/config", |r| {
request_span(r, handle_node_configure)
})
// Tenant Shard operations
.put("/control/v1/tenant/:tenant_shard_id/migrate", |r| {
tenant_service_handler(r, handle_tenant_shard_migrate)
})
// Tenant operations
// The ^/v1/ endpoints act as a "Virtual Pageserver", enabling shard-naive clients to call into
// this service to manage tenants that actually consist of many tenant shards, as if they are a single entity.
.post("/v1/tenant", |r| {
tenant_service_handler(r, handle_tenant_create)
})
.delete("/v1/tenant/:tenant_id", |r| {
tenant_service_handler(r, handle_tenant_delete)
})
.put("/v1/tenant/:tenant_id/location_config", |r| {
tenant_service_handler(r, handle_tenant_location_config)
})
// Timeline operations
.delete("/v1/tenant/:tenant_id/timeline/:timeline_id", |r| {
tenant_service_handler(r, handle_tenant_timeline_delete)
})
.post("/v1/tenant/:tenant_id/timeline", |r| {
tenant_service_handler(r, handle_tenant_timeline_create)
})
// Tenant detail GET passthrough to shard zero
.get("/v1/tenant/:tenant_id", |r| {
tenant_service_handler(r, handle_tenant_timeline_passthrough)
})
// Timeline GET passthrough to shard zero. Note that the `*` in the URL is a wildcard: any future
// timeline GET APIs will be implicitly included.
.get("/v1/tenant/:tenant_id/timeline*", |r| {
tenant_service_handler(r, handle_tenant_timeline_passthrough)
})
}

View File

@@ -0,0 +1,60 @@
use serde::{Deserialize, Serialize};
use utils::seqwait::MonotonicCounter;
mod compute_hook;
pub mod http;
mod node;
pub mod persistence;
mod reconciler;
mod scheduler;
mod schema;
pub mod service;
mod tenant_state;
#[derive(Clone, Serialize, Deserialize)]
enum PlacementPolicy {
/// Cheapest way to attach a tenant: just one pageserver, no secondary
Single,
/// Production-ready way to attach a tenant: one attached pageserver and
/// some number of secondaries.
Double(usize),
/// Do not attach to any pageservers
Detached,
}
#[derive(Ord, PartialOrd, Eq, PartialEq, Copy, Clone)]
struct Sequence(u64);
impl Sequence {
fn initial() -> Self {
Self(0)
}
}
impl std::fmt::Display for Sequence {
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
write!(f, "{}", self.0)
}
}
impl MonotonicCounter<Sequence> for Sequence {
fn cnt_advance(&mut self, v: Sequence) {
assert!(*self <= v);
*self = v;
}
fn cnt_value(&self) -> Sequence {
*self
}
}
impl Sequence {
fn next(&self) -> Sequence {
Sequence(self.0 + 1)
}
}
impl Default for PlacementPolicy {
fn default() -> Self {
PlacementPolicy::Double(1)
}
}

View File

@@ -0,0 +1,259 @@
/// The attachment service mimics the aspects of the control plane API
/// that are required for a pageserver to operate.
///
/// This enables running & testing pageservers without a full-blown
/// deployment of the Neon cloud platform.
///
use anyhow::{anyhow, Context};
use attachment_service::http::make_router;
use attachment_service::persistence::Persistence;
use attachment_service::service::{Config, Service};
use aws_config::{self, BehaviorVersion, Region};
use camino::Utf8PathBuf;
use clap::Parser;
use diesel::Connection;
use metrics::launch_timestamp::LaunchTimestamp;
use std::sync::Arc;
use tokio::signal::unix::SignalKind;
use utils::auth::{JwtAuth, SwappableJwtAuth};
use utils::logging::{self, LogFormat};
use utils::{project_build_tag, project_git_version, tcp_listener};
project_git_version!(GIT_VERSION);
project_build_tag!(BUILD_TAG);
use diesel_migrations::{embed_migrations, EmbeddedMigrations};
pub const MIGRATIONS: EmbeddedMigrations = embed_migrations!("./migrations");
#[derive(Parser)]
#[command(author, version, about, long_about = None)]
#[command(arg_required_else_help(true))]
struct Cli {
/// Host and port to listen on, like `127.0.0.1:1234`
#[arg(short, long)]
listen: std::net::SocketAddr,
/// Public key for JWT authentication of clients
#[arg(long)]
public_key: Option<String>,
/// Token for authenticating this service with the pageservers it controls
#[arg(long)]
jwt_token: Option<String>,
/// Token for authenticating this service with the control plane, when calling
/// the compute notification endpoint
#[arg(long)]
control_plane_jwt_token: Option<String>,
/// URL to control plane compute notification endpoint
#[arg(long)]
compute_hook_url: Option<String>,
/// Path to the .json file to store state (will be created if it doesn't exist)
#[arg(short, long)]
path: Option<Utf8PathBuf>,
/// URL to connect to postgres, like postgresql://localhost:1234/attachment_service
#[arg(long)]
database_url: Option<String>,
}
/// Secrets may either be provided on the command line (for testing), or loaded from AWS SecretManager: this
/// type encapsulates the logic to decide which and do the loading.
struct Secrets {
database_url: String,
public_key: Option<JwtAuth>,
jwt_token: Option<String>,
control_plane_jwt_token: Option<String>,
}
impl Secrets {
const DATABASE_URL_SECRET: &'static str = "rds-neon-storage-controller-url";
const PAGESERVER_JWT_TOKEN_SECRET: &'static str =
"neon-storage-controller-pageserver-jwt-token";
const CONTROL_PLANE_JWT_TOKEN_SECRET: &'static str =
"neon-storage-controller-control-plane-jwt-token";
const PUBLIC_KEY_SECRET: &'static str = "neon-storage-controller-public-key";
async fn load(args: &Cli) -> anyhow::Result<Self> {
match &args.database_url {
Some(url) => Self::load_cli(url, args),
None => Self::load_aws_sm().await,
}
}
async fn load_aws_sm() -> anyhow::Result<Self> {
let Ok(region) = std::env::var("AWS_REGION") else {
anyhow::bail!("AWS_REGION is not set, cannot load secrets automatically: either set this, or use CLI args to supply secrets");
};
let config = aws_config::defaults(BehaviorVersion::v2023_11_09())
.region(Region::new(region.clone()))
.load()
.await;
let asm = aws_sdk_secretsmanager::Client::new(&config);
let Some(database_url) = asm
.get_secret_value()
.secret_id(Self::DATABASE_URL_SECRET)
.send()
.await?
.secret_string()
.map(str::to_string)
else {
anyhow::bail!(
"Database URL secret not found at {region}/{}",
Self::DATABASE_URL_SECRET
)
};
let jwt_token = asm
.get_secret_value()
.secret_id(Self::PAGESERVER_JWT_TOKEN_SECRET)
.send()
.await?
.secret_string()
.map(str::to_string);
if jwt_token.is_none() {
tracing::warn!("No pageserver JWT token set: this will only work if authentication is disabled on the pageserver");
}
let control_plane_jwt_token = asm
.get_secret_value()
.secret_id(Self::CONTROL_PLANE_JWT_TOKEN_SECRET)
.send()
.await?
.secret_string()
.map(str::to_string);
if jwt_token.is_none() {
tracing::warn!("No control plane JWT token set: this will only work if authentication is disabled on the pageserver");
}
let public_key = asm
.get_secret_value()
.secret_id(Self::PUBLIC_KEY_SECRET)
.send()
.await?
.secret_string()
.map(str::to_string);
let public_key = match public_key {
Some(key) => Some(JwtAuth::from_key(key)?),
None => {
tracing::warn!(
"No public key set: inccoming HTTP requests will not be authenticated"
);
None
}
};
Ok(Self {
database_url,
public_key,
jwt_token,
control_plane_jwt_token,
})
}
fn load_cli(database_url: &str, args: &Cli) -> anyhow::Result<Self> {
let public_key = match &args.public_key {
None => None,
Some(key) => Some(JwtAuth::from_key(key.clone()).context("Loading public key")?),
};
Ok(Self {
database_url: database_url.to_owned(),
public_key,
jwt_token: args.jwt_token.clone(),
control_plane_jwt_token: args.control_plane_jwt_token.clone(),
})
}
}
async fn migration_run(database_url: &str) -> anyhow::Result<()> {
use diesel::PgConnection;
use diesel_migrations::{HarnessWithOutput, MigrationHarness};
let mut conn = PgConnection::establish(database_url)?;
HarnessWithOutput::write_to_stdout(&mut conn)
.run_pending_migrations(MIGRATIONS)
.map(|_| ())
.map_err(|e| anyhow::anyhow!(e))?;
Ok(())
}
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let launch_ts = Box::leak(Box::new(LaunchTimestamp::generate()));
logging::init(
LogFormat::Plain,
logging::TracingErrorLayerEnablement::Disabled,
logging::Output::Stdout,
)?;
let args = Cli::parse();
tracing::info!(
"version: {}, launch_timestamp: {}, build_tag {}, state at {}, listening on {}",
GIT_VERSION,
launch_ts.to_string(),
BUILD_TAG,
args.path.as_ref().unwrap_or(&Utf8PathBuf::from("<none>")),
args.listen
);
let secrets = Secrets::load(&args).await?;
let config = Config {
jwt_token: secrets.jwt_token,
control_plane_jwt_token: secrets.control_plane_jwt_token,
compute_hook_url: args.compute_hook_url,
};
// After loading secrets & config, but before starting anything else, apply database migrations
migration_run(&secrets.database_url)
.await
.context("Running database migrations")?;
let json_path = args.path;
let persistence = Arc::new(Persistence::new(secrets.database_url, json_path.clone()));
let service = Service::spawn(config, persistence.clone()).await?;
let http_listener = tcp_listener::bind(args.listen)?;
let auth = secrets
.public_key
.map(|jwt_auth| Arc::new(SwappableJwtAuth::new(jwt_auth)));
let router = make_router(service, auth)
.build()
.map_err(|err| anyhow!(err))?;
let router_service = utils::http::RouterService::new(router).unwrap();
let server = hyper::Server::from_tcp(http_listener)?.serve(router_service);
tracing::info!("Serving on {0}", args.listen);
tokio::task::spawn(server);
// Wait until we receive a signal
let mut sigint = tokio::signal::unix::signal(SignalKind::interrupt())?;
let mut sigquit = tokio::signal::unix::signal(SignalKind::quit())?;
let mut sigterm = tokio::signal::unix::signal(SignalKind::terminate())?;
tokio::select! {
_ = sigint.recv() => {},
_ = sigterm.recv() => {},
_ = sigquit.recv() => {},
}
tracing::info!("Terminating on signal");
if json_path.is_some() {
// Write out a JSON dump on shutdown: this is used in compat tests to avoid passing
// full postgres dumps around.
if let Err(e) = persistence.write_tenants_json().await {
tracing::error!("Failed to write JSON on shutdown: {e}")
}
}
std::process::exit(0);
}

View File

@@ -0,0 +1,50 @@
use control_plane::attachment_service::{NodeAvailability, NodeSchedulingPolicy};
use utils::id::NodeId;
use crate::persistence::NodePersistence;
#[derive(Clone)]
pub(crate) struct Node {
pub(crate) id: NodeId,
pub(crate) availability: NodeAvailability,
pub(crate) scheduling: NodeSchedulingPolicy,
pub(crate) listen_http_addr: String,
pub(crate) listen_http_port: u16,
pub(crate) listen_pg_addr: String,
pub(crate) listen_pg_port: u16,
}
impl Node {
pub(crate) fn base_url(&self) -> String {
format!("http://{}:{}", self.listen_http_addr, self.listen_http_port)
}
/// Is this node elegible to have work scheduled onto it?
pub(crate) fn may_schedule(&self) -> bool {
match self.availability {
NodeAvailability::Active => {}
NodeAvailability::Offline => return false,
}
match self.scheduling {
NodeSchedulingPolicy::Active => true,
NodeSchedulingPolicy::Draining => false,
NodeSchedulingPolicy::Filling => true,
NodeSchedulingPolicy::Pause => false,
}
}
pub(crate) fn to_persistent(&self) -> NodePersistence {
NodePersistence {
node_id: self.id.0 as i64,
scheduling_policy: self.scheduling.into(),
listen_http_addr: self.listen_http_addr.clone(),
listen_http_port: self.listen_http_port as i32,
listen_pg_addr: self.listen_pg_addr.clone(),
listen_pg_port: self.listen_pg_port as i32,
}
}
}

View File

@@ -0,0 +1,398 @@
use std::collections::HashMap;
use std::str::FromStr;
use camino::Utf8Path;
use camino::Utf8PathBuf;
use control_plane::attachment_service::{NodeAvailability, NodeSchedulingPolicy};
use diesel::pg::PgConnection;
use diesel::prelude::*;
use diesel::Connection;
use pageserver_api::models::TenantConfig;
use pageserver_api::shard::{ShardCount, ShardNumber, TenantShardId};
use serde::{Deserialize, Serialize};
use utils::generation::Generation;
use utils::id::{NodeId, TenantId};
use crate::node::Node;
use crate::PlacementPolicy;
/// ## What do we store?
///
/// The attachment service does not store most of its state durably.
///
/// The essential things to store durably are:
/// - generation numbers, as these must always advance monotonically to ensure data safety.
/// - Tenant's PlacementPolicy and TenantConfig, as the source of truth for these is something external.
/// - Node's scheduling policies, as the source of truth for these is something external.
///
/// Other things we store durably as an implementation detail:
/// - Node's host/port: this could be avoided it we made nodes emit a self-registering heartbeat,
/// but it is operationally simpler to make this service the authority for which nodes
/// it talks to.
///
/// ## Performance/efficiency
///
/// The attachment service does not go via the database for most things: there are
/// a couple of places where we must, and where efficiency matters:
/// - Incrementing generation numbers: the Reconciler has to wait for this to complete
/// before it can attach a tenant, so this acts as a bound on how fast things like
/// failover can happen.
/// - Pageserver re-attach: we will increment many shards' generations when this happens,
/// so it is important to avoid e.g. issuing O(N) queries.
///
/// Database calls relating to nodes have low performance requirements, as they are very rarely
/// updated, and reads of nodes are always from memory, not the database. We only require that
/// we can UPDATE a node's scheduling mode reasonably quickly to mark a bad node offline.
pub struct Persistence {
database_url: String,
// In test environments, we support loading+saving a JSON file. This is temporary, for the benefit of
// test_compatibility.py, so that we don't have to commit to making the database contents fully backward/forward
// compatible just yet.
json_path: Option<Utf8PathBuf>,
}
/// Legacy format, for use in JSON compat objects in test environment
#[derive(Serialize, Deserialize)]
struct JsonPersistence {
tenants: HashMap<TenantShardId, TenantShardPersistence>,
}
#[derive(thiserror::Error, Debug)]
pub(crate) enum DatabaseError {
#[error(transparent)]
Query(#[from] diesel::result::Error),
#[error(transparent)]
Connection(#[from] diesel::result::ConnectionError),
#[error("Logical error: {0}")]
Logical(String),
}
pub(crate) type DatabaseResult<T> = Result<T, DatabaseError>;
impl Persistence {
pub fn new(database_url: String, json_path: Option<Utf8PathBuf>) -> Self {
Self {
database_url,
json_path,
}
}
/// Call the provided function in a tokio blocking thread, with a Diesel database connection.
async fn with_conn<F, R>(&self, func: F) -> DatabaseResult<R>
where
F: Fn(&mut PgConnection) -> DatabaseResult<R> + Send + 'static,
R: Send + 'static,
{
let database_url = self.database_url.clone();
tokio::task::spawn_blocking(move || -> DatabaseResult<R> {
// TODO: connection pooling, such as via diesel::r2d2
let mut conn = PgConnection::establish(&database_url)?;
func(&mut conn)
})
.await
.expect("Task panic")
}
/// When a node is first registered, persist it before using it for anything
pub(crate) async fn insert_node(&self, node: &Node) -> DatabaseResult<()> {
let np = node.to_persistent();
self.with_conn(move |conn| -> DatabaseResult<()> {
diesel::insert_into(crate::schema::nodes::table)
.values(&np)
.execute(conn)?;
Ok(())
})
.await
}
/// At startup, populate the list of nodes which our shards may be placed on
pub(crate) async fn list_nodes(&self) -> DatabaseResult<Vec<Node>> {
let nodes: Vec<Node> = self
.with_conn(move |conn| -> DatabaseResult<_> {
Ok(crate::schema::nodes::table
.load::<NodePersistence>(conn)?
.into_iter()
.map(|n| Node {
id: NodeId(n.node_id as u64),
// At startup we consider a node offline until proven otherwise.
availability: NodeAvailability::Offline,
scheduling: NodeSchedulingPolicy::from_str(&n.scheduling_policy)
.expect("Bad scheduling policy in DB"),
listen_http_addr: n.listen_http_addr,
listen_http_port: n.listen_http_port as u16,
listen_pg_addr: n.listen_pg_addr,
listen_pg_port: n.listen_pg_port as u16,
})
.collect::<Vec<Node>>())
})
.await?;
tracing::info!("list_nodes: loaded {} nodes", nodes.len());
Ok(nodes)
}
/// At startup, load the high level state for shards, such as their config + policy. This will
/// be enriched at runtime with state discovered on pageservers.
pub(crate) async fn list_tenant_shards(&self) -> DatabaseResult<Vec<TenantShardPersistence>> {
let loaded = self
.with_conn(move |conn| -> DatabaseResult<_> {
Ok(crate::schema::tenant_shards::table.load::<TenantShardPersistence>(conn)?)
})
.await?;
if loaded.is_empty() {
if let Some(path) = &self.json_path {
if tokio::fs::try_exists(path)
.await
.map_err(|e| DatabaseError::Logical(format!("Error stat'ing JSON file: {e}")))?
{
tracing::info!("Importing from legacy JSON format at {path}");
return self.list_tenant_shards_json(path).await;
}
}
}
Ok(loaded)
}
/// Shim for automated compatibility tests: load tenants from a JSON file instead of database
pub(crate) async fn list_tenant_shards_json(
&self,
path: &Utf8Path,
) -> DatabaseResult<Vec<TenantShardPersistence>> {
let bytes = tokio::fs::read(path)
.await
.map_err(|e| DatabaseError::Logical(format!("Failed to load JSON: {e}")))?;
let mut decoded = serde_json::from_slice::<JsonPersistence>(&bytes)
.map_err(|e| DatabaseError::Logical(format!("Deserialization error: {e}")))?;
for (tenant_id, tenant) in &mut decoded.tenants {
// Backward compat: an old attachments.json from before PR #6251, replace
// empty strings with proper defaults.
if tenant.tenant_id.is_empty() {
tenant.tenant_id = tenant_id.to_string();
tenant.config = serde_json::to_string(&TenantConfig::default())
.map_err(|e| DatabaseError::Logical(format!("Serialization error: {e}")))?;
tenant.placement_policy = serde_json::to_string(&PlacementPolicy::default())
.map_err(|e| DatabaseError::Logical(format!("Serialization error: {e}")))?;
}
}
let tenants: Vec<TenantShardPersistence> = decoded.tenants.into_values().collect();
// Synchronize database with what is in the JSON file
self.insert_tenant_shards(tenants.clone()).await?;
Ok(tenants)
}
/// For use in testing environments, where we dump out JSON on shutdown.
pub async fn write_tenants_json(&self) -> anyhow::Result<()> {
let Some(path) = &self.json_path else {
anyhow::bail!("Cannot write JSON if path isn't set (test environment bug)");
};
tracing::info!("Writing state to {path}...");
let tenants = self.list_tenant_shards().await?;
let mut tenants_map = HashMap::new();
for tsp in tenants {
let tenant_shard_id = TenantShardId {
tenant_id: TenantId::from_str(tsp.tenant_id.as_str())?,
shard_number: ShardNumber(tsp.shard_number as u8),
shard_count: ShardCount(tsp.shard_count as u8),
};
tenants_map.insert(tenant_shard_id, tsp);
}
let json = serde_json::to_string(&JsonPersistence {
tenants: tenants_map,
})?;
tokio::fs::write(path, &json).await?;
tracing::info!("Wrote {} bytes to {path}...", json.len());
Ok(())
}
/// Tenants must be persisted before we schedule them for the first time. This enables us
/// to correctly retain generation monotonicity, and the externally provided placement policy & config.
pub(crate) async fn insert_tenant_shards(
&self,
shards: Vec<TenantShardPersistence>,
) -> DatabaseResult<()> {
use crate::schema::tenant_shards::dsl::*;
self.with_conn(move |conn| -> DatabaseResult<()> {
conn.transaction(|conn| -> QueryResult<()> {
for tenant in &shards {
diesel::insert_into(tenant_shards)
.values(tenant)
.execute(conn)?;
}
Ok(())
})?;
Ok(())
})
.await
}
/// Ordering: call this _after_ deleting the tenant on pageservers, but _before_ dropping state for
/// the tenant from memory on this server.
#[allow(unused)]
pub(crate) async fn delete_tenant(&self, del_tenant_id: TenantId) -> DatabaseResult<()> {
use crate::schema::tenant_shards::dsl::*;
self.with_conn(move |conn| -> DatabaseResult<()> {
diesel::delete(tenant_shards)
.filter(tenant_id.eq(del_tenant_id.to_string()))
.execute(conn)?;
Ok(())
})
.await
}
/// When a tenant invokes the /re-attach API, this function is responsible for doing an efficient
/// batched increment of the generations of all tenants whose generation_pageserver is equal to
/// the node that called /re-attach.
#[tracing::instrument(skip_all, fields(node_id))]
pub(crate) async fn re_attach(
&self,
node_id: NodeId,
) -> DatabaseResult<HashMap<TenantShardId, Generation>> {
use crate::schema::tenant_shards::dsl::*;
let updated = self
.with_conn(move |conn| {
let rows_updated = diesel::update(tenant_shards)
.filter(generation_pageserver.eq(node_id.0 as i64))
.set(generation.eq(generation + 1))
.execute(conn)?;
tracing::info!("Incremented {} tenants' generations", rows_updated);
// TODO: UPDATE+SELECT in one query
let updated = tenant_shards
.filter(generation_pageserver.eq(node_id.0 as i64))
.select(TenantShardPersistence::as_select())
.load(conn)?;
Ok(updated)
})
.await?;
let mut result = HashMap::new();
for tsp in updated {
let tenant_shard_id = TenantShardId {
tenant_id: TenantId::from_str(tsp.tenant_id.as_str())
.map_err(|e| DatabaseError::Logical(format!("Malformed tenant id: {e}")))?,
shard_number: ShardNumber(tsp.shard_number as u8),
shard_count: ShardCount(tsp.shard_count as u8),
};
result.insert(tenant_shard_id, Generation::new(tsp.generation as u32));
}
Ok(result)
}
/// Reconciler calls this immediately before attaching to a new pageserver, to acquire a unique, monotonically
/// advancing generation number. We also store the NodeId for which the generation was issued, so that in
/// [`Self::re_attach`] we can do a bulk UPDATE on the generations for that node.
pub(crate) async fn increment_generation(
&self,
tenant_shard_id: TenantShardId,
node_id: NodeId,
) -> anyhow::Result<Generation> {
use crate::schema::tenant_shards::dsl::*;
let updated = self
.with_conn(move |conn| {
let updated = diesel::update(tenant_shards)
.filter(tenant_id.eq(tenant_shard_id.tenant_id.to_string()))
.filter(shard_number.eq(tenant_shard_id.shard_number.0 as i32))
.filter(shard_count.eq(tenant_shard_id.shard_count.0 as i32))
.set((
generation.eq(generation + 1),
generation_pageserver.eq(node_id.0 as i64),
))
// TODO: only returning() the generation column
.returning(TenantShardPersistence::as_returning())
.get_result(conn)?;
Ok(updated)
})
.await?;
Ok(Generation::new(updated.generation as u32))
}
pub(crate) async fn detach(&self, tenant_shard_id: TenantShardId) -> anyhow::Result<()> {
use crate::schema::tenant_shards::dsl::*;
self.with_conn(move |conn| {
let updated = diesel::update(tenant_shards)
.filter(tenant_id.eq(tenant_shard_id.tenant_id.to_string()))
.filter(shard_number.eq(tenant_shard_id.shard_number.0 as i32))
.filter(shard_count.eq(tenant_shard_id.shard_count.0 as i32))
.set((
generation_pageserver.eq(i64::MAX),
placement_policy.eq(serde_json::to_string(&PlacementPolicy::Detached).unwrap()),
))
.execute(conn)?;
Ok(updated)
})
.await?;
Ok(())
}
// TODO: when we start shard splitting, we must durably mark the tenant so that
// on restart, we know that we must go through recovery (list shards that exist
// and pick up where we left off and/or revert to parent shards).
#[allow(dead_code)]
pub(crate) async fn begin_shard_split(&self, _tenant_id: TenantId) -> anyhow::Result<()> {
todo!();
}
// TODO: when we finish shard splitting, we must atomically clean up the old shards
// and insert the new shards, and clear the splitting marker.
#[allow(dead_code)]
pub(crate) async fn complete_shard_split(&self, _tenant_id: TenantId) -> anyhow::Result<()> {
todo!();
}
}
/// Parts of [`crate::tenant_state::TenantState`] that are stored durably
#[derive(Queryable, Selectable, Insertable, Serialize, Deserialize, Clone)]
#[diesel(table_name = crate::schema::tenant_shards)]
pub(crate) struct TenantShardPersistence {
#[serde(default)]
pub(crate) tenant_id: String,
#[serde(default)]
pub(crate) shard_number: i32,
#[serde(default)]
pub(crate) shard_count: i32,
#[serde(default)]
pub(crate) shard_stripe_size: i32,
// Latest generation number: next time we attach, increment this
// and use the incremented number when attaching
pub(crate) generation: i32,
// Currently attached pageserver
#[serde(rename = "pageserver")]
pub(crate) generation_pageserver: i64,
#[serde(default)]
pub(crate) placement_policy: String,
#[serde(default)]
pub(crate) config: String,
}
/// Parts of [`crate::node::Node`] that are stored durably
#[derive(Serialize, Deserialize, Queryable, Selectable, Insertable)]
#[diesel(table_name = crate::schema::nodes)]
pub(crate) struct NodePersistence {
pub(crate) node_id: i64,
pub(crate) scheduling_policy: String,
pub(crate) listen_http_addr: String,
pub(crate) listen_http_port: i32,
pub(crate) listen_pg_addr: String,
pub(crate) listen_pg_port: i32,
}

View File

@@ -0,0 +1,528 @@
use crate::persistence::Persistence;
use crate::service;
use control_plane::attachment_service::NodeAvailability;
use pageserver_api::models::{
LocationConfig, LocationConfigMode, LocationConfigSecondary, TenantConfig,
};
use pageserver_api::shard::{ShardIdentity, TenantShardId};
use pageserver_client::mgmt_api;
use std::collections::HashMap;
use std::sync::Arc;
use std::time::Duration;
use tokio_util::sync::CancellationToken;
use utils::generation::Generation;
use utils::id::{NodeId, TimelineId};
use utils::lsn::Lsn;
use crate::compute_hook::{ComputeHook, NotifyError};
use crate::node::Node;
use crate::tenant_state::{IntentState, ObservedState, ObservedStateLocation};
/// Object with the lifetime of the background reconcile task that is created
/// for tenants which have a difference between their intent and observed states.
pub(super) struct Reconciler {
/// See [`crate::tenant_state::TenantState`] for the meanings of these fields: they are a snapshot
/// of a tenant's state from when we spawned a reconcile task.
pub(super) tenant_shard_id: TenantShardId,
pub(crate) shard: ShardIdentity,
pub(crate) generation: Generation,
pub(crate) intent: IntentState,
pub(crate) config: TenantConfig,
pub(crate) observed: ObservedState,
pub(crate) service_config: service::Config,
/// A snapshot of the pageservers as they were when we were asked
/// to reconcile.
pub(crate) pageservers: Arc<HashMap<NodeId, Node>>,
/// A hook to notify the running postgres instances when we change the location
/// of a tenant. Use this via [`Self::compute_notify`] to update our failure flag
/// and guarantee eventual retries.
pub(crate) compute_hook: Arc<ComputeHook>,
/// To avoid stalling if the cloud control plane is unavailable, we may proceed
/// past failures in [`ComputeHook::notify`], but we _must_ remember that we failed
/// so that we can set [`crate::tenant_state::TenantState::pending_compute_notification`] to ensure a later retry.
pub(crate) compute_notify_failure: bool,
/// A means to abort background reconciliation: it is essential to
/// call this when something changes in the original TenantState that
/// will make this reconciliation impossible or unnecessary, for
/// example when a pageserver node goes offline, or the PlacementPolicy for
/// the tenant is changed.
pub(crate) cancel: CancellationToken,
/// Access to persistent storage for updating generation numbers
pub(crate) persistence: Arc<Persistence>,
}
#[derive(thiserror::Error, Debug)]
pub(crate) enum ReconcileError {
#[error(transparent)]
Notify(#[from] NotifyError),
#[error(transparent)]
Other(#[from] anyhow::Error),
}
impl Reconciler {
async fn location_config(
&mut self,
node_id: NodeId,
config: LocationConfig,
flush_ms: Option<Duration>,
) -> anyhow::Result<()> {
let node = self
.pageservers
.get(&node_id)
.expect("Pageserver may not be removed while referenced");
self.observed
.locations
.insert(node.id, ObservedStateLocation { conf: None });
tracing::info!("location_config({}) calling: {:?}", node_id, config);
let client =
mgmt_api::Client::new(node.base_url(), self.service_config.jwt_token.as_deref());
client
.location_config(self.tenant_shard_id, config.clone(), flush_ms)
.await?;
tracing::info!("location_config({}) complete: {:?}", node_id, config);
self.observed
.locations
.insert(node.id, ObservedStateLocation { conf: Some(config) });
Ok(())
}
async fn maybe_live_migrate(&mut self) -> Result<(), ReconcileError> {
let destination = if let Some(node_id) = self.intent.attached {
match self.observed.locations.get(&node_id) {
Some(conf) => {
// We will do a live migration only if the intended destination is not
// currently in an attached state.
match &conf.conf {
Some(conf) if conf.mode == LocationConfigMode::Secondary => {
// Fall through to do a live migration
node_id
}
None | Some(_) => {
// Attached or uncertain: don't do a live migration, proceed
// with a general-case reconciliation
tracing::info!("maybe_live_migrate: destination is None or attached");
return Ok(());
}
}
}
None => {
// Our destination is not attached: maybe live migrate if some other
// node is currently attached. Fall through.
node_id
}
}
} else {
// No intent to be attached
tracing::info!("maybe_live_migrate: no attached intent");
return Ok(());
};
let mut origin = None;
for (node_id, state) in &self.observed.locations {
if let Some(observed_conf) = &state.conf {
if observed_conf.mode == LocationConfigMode::AttachedSingle {
let node = self
.pageservers
.get(node_id)
.expect("Nodes may not be removed while referenced");
// We will only attempt live migration if the origin is not offline: this
// avoids trying to do it while reconciling after responding to an HA failover.
if !matches!(node.availability, NodeAvailability::Offline) {
origin = Some(*node_id);
break;
}
}
}
}
let Some(origin) = origin else {
tracing::info!("maybe_live_migrate: no origin found");
return Ok(());
};
// We have an origin and a destination: proceed to do the live migration
tracing::info!("Live migrating {}->{}", origin, destination);
self.live_migrate(origin, destination).await?;
Ok(())
}
async fn get_lsns(
&self,
tenant_shard_id: TenantShardId,
node_id: &NodeId,
) -> anyhow::Result<HashMap<TimelineId, Lsn>> {
let node = self
.pageservers
.get(node_id)
.expect("Pageserver may not be removed while referenced");
let client =
mgmt_api::Client::new(node.base_url(), self.service_config.jwt_token.as_deref());
let timelines = client.timeline_list(&tenant_shard_id).await?;
Ok(timelines
.into_iter()
.map(|t| (t.timeline_id, t.last_record_lsn))
.collect())
}
async fn secondary_download(&self, tenant_shard_id: TenantShardId, node_id: &NodeId) {
let node = self
.pageservers
.get(node_id)
.expect("Pageserver may not be removed while referenced");
let client =
mgmt_api::Client::new(node.base_url(), self.service_config.jwt_token.as_deref());
match client.tenant_secondary_download(tenant_shard_id).await {
Ok(()) => {}
Err(_) => {
tracing::info!(" (skipping, destination wasn't in secondary mode)")
}
}
}
async fn await_lsn(
&self,
tenant_shard_id: TenantShardId,
pageserver_id: &NodeId,
baseline: HashMap<TimelineId, Lsn>,
) -> anyhow::Result<()> {
loop {
let latest = match self.get_lsns(tenant_shard_id, pageserver_id).await {
Ok(l) => l,
Err(e) => {
println!(
"🕑 Can't get LSNs on pageserver {} yet, waiting ({e})",
pageserver_id
);
std::thread::sleep(Duration::from_millis(500));
continue;
}
};
let mut any_behind: bool = false;
for (timeline_id, baseline_lsn) in &baseline {
match latest.get(timeline_id) {
Some(latest_lsn) => {
println!("🕑 LSN origin {baseline_lsn} vs destination {latest_lsn}");
if latest_lsn < baseline_lsn {
any_behind = true;
}
}
None => {
// Expected timeline isn't yet visible on migration destination.
// (IRL we would have to account for timeline deletion, but this
// is just test helper)
any_behind = true;
}
}
}
if !any_behind {
println!("✅ LSN caught up. Proceeding...");
break;
} else {
std::thread::sleep(Duration::from_millis(500));
}
}
Ok(())
}
pub async fn live_migrate(
&mut self,
origin_ps_id: NodeId,
dest_ps_id: NodeId,
) -> anyhow::Result<()> {
// `maybe_live_migrate` is responsibble for sanity of inputs
assert!(origin_ps_id != dest_ps_id);
fn build_location_config(
shard: &ShardIdentity,
config: &TenantConfig,
mode: LocationConfigMode,
generation: Option<Generation>,
secondary_conf: Option<LocationConfigSecondary>,
) -> LocationConfig {
LocationConfig {
mode,
generation: generation.map(|g| g.into().unwrap()),
secondary_conf,
tenant_conf: config.clone(),
shard_number: shard.number.0,
shard_count: shard.count.0,
shard_stripe_size: shard.stripe_size.0,
}
}
tracing::info!(
"🔁 Switching origin pageserver {} to stale mode",
origin_ps_id
);
// FIXME: it is incorrect to use self.generation here, we should use the generation
// from the ObservedState of the origin pageserver (it might be older than self.generation)
let stale_conf = build_location_config(
&self.shard,
&self.config,
LocationConfigMode::AttachedStale,
Some(self.generation),
None,
);
self.location_config(origin_ps_id, stale_conf, Some(Duration::from_secs(10)))
.await?;
let baseline_lsns = Some(self.get_lsns(self.tenant_shard_id, &origin_ps_id).await?);
// If we are migrating to a destination that has a secondary location, warm it up first
if let Some(destination_conf) = self.observed.locations.get(&dest_ps_id) {
if let Some(destination_conf) = &destination_conf.conf {
if destination_conf.mode == LocationConfigMode::Secondary {
tracing::info!(
"🔁 Downloading latest layers to destination pageserver {}",
dest_ps_id,
);
self.secondary_download(self.tenant_shard_id, &dest_ps_id)
.await;
}
}
}
// Increment generation before attaching to new pageserver
self.generation = self
.persistence
.increment_generation(self.tenant_shard_id, dest_ps_id)
.await?;
let dest_conf = build_location_config(
&self.shard,
&self.config,
LocationConfigMode::AttachedMulti,
Some(self.generation),
None,
);
tracing::info!("🔁 Attaching to pageserver {}", dest_ps_id);
self.location_config(dest_ps_id, dest_conf, None).await?;
if let Some(baseline) = baseline_lsns {
tracing::info!("🕑 Waiting for LSN to catch up...");
self.await_lsn(self.tenant_shard_id, &dest_ps_id, baseline)
.await?;
}
tracing::info!("🔁 Notifying compute to use pageserver {}", dest_ps_id);
// During a live migration it is unhelpful to proceed if we couldn't notify compute: if we detach
// the origin without notifying compute, we will render the tenant unavailable.
while let Err(e) = self.compute_notify().await {
match e {
NotifyError::Fatal(_) => return Err(anyhow::anyhow!(e)),
_ => {
tracing::warn!(
"Live migration blocked by compute notification error, retrying: {e}"
);
}
}
}
// Downgrade the origin to secondary. If the tenant's policy is PlacementPolicy::Single, then
// this location will be deleted in the general case reconciliation that runs after this.
let origin_secondary_conf = build_location_config(
&self.shard,
&self.config,
LocationConfigMode::Secondary,
None,
Some(LocationConfigSecondary { warm: true }),
);
self.location_config(origin_ps_id, origin_secondary_conf.clone(), None)
.await?;
// TODO: we should also be setting the ObservedState on earlier API calls, in case we fail
// partway through. In fact, all location conf API calls should be in a wrapper that sets
// the observed state to None, then runs, then sets it to what we wrote.
self.observed.locations.insert(
origin_ps_id,
ObservedStateLocation {
conf: Some(origin_secondary_conf),
},
);
println!(
"🔁 Switching to AttachedSingle mode on pageserver {}",
dest_ps_id
);
let dest_final_conf = build_location_config(
&self.shard,
&self.config,
LocationConfigMode::AttachedSingle,
Some(self.generation),
None,
);
self.location_config(dest_ps_id, dest_final_conf.clone(), None)
.await?;
self.observed.locations.insert(
dest_ps_id,
ObservedStateLocation {
conf: Some(dest_final_conf),
},
);
println!("✅ Migration complete");
Ok(())
}
/// Reconciling a tenant makes API calls to pageservers until the observed state
/// matches the intended state.
///
/// First we apply special case handling (e.g. for live migrations), and then a
/// general case reconciliation where we walk through the intent by pageserver
/// and call out to the pageserver to apply the desired state.
pub(crate) async fn reconcile(&mut self) -> Result<(), ReconcileError> {
// TODO: if any of self.observed is None, call to remote pageservers
// to learn correct state.
// Special case: live migration
self.maybe_live_migrate().await?;
// If the attached pageserver is not attached, do so now.
if let Some(node_id) = self.intent.attached {
let mut wanted_conf =
attached_location_conf(self.generation, &self.shard, &self.config);
match self.observed.locations.get(&node_id) {
Some(conf) if conf.conf.as_ref() == Some(&wanted_conf) => {
// Nothing to do
tracing::info!("Observed configuration already correct.")
}
_ => {
// In all cases other than a matching observed configuration, we will
// reconcile this location. This includes locations with different configurations, as well
// as locations with unknown (None) observed state.
self.generation = self
.persistence
.increment_generation(self.tenant_shard_id, node_id)
.await?;
wanted_conf.generation = self.generation.into();
tracing::info!("Observed configuration requires update.");
self.location_config(node_id, wanted_conf, None).await?;
self.compute_notify().await?;
}
}
}
// Configure secondary locations: if these were previously attached this
// implicitly downgrades them from attached to secondary.
let mut changes = Vec::new();
for node_id in &self.intent.secondary {
let wanted_conf = secondary_location_conf(&self.shard, &self.config);
match self.observed.locations.get(node_id) {
Some(conf) if conf.conf.as_ref() == Some(&wanted_conf) => {
// Nothing to do
tracing::info!(%node_id, "Observed configuration already correct.")
}
_ => {
// In all cases other than a matching observed configuration, we will
// reconcile this location.
tracing::info!(%node_id, "Observed configuration requires update.");
changes.push((*node_id, wanted_conf))
}
}
}
// Detach any extraneous pageservers that are no longer referenced
// by our intent.
let all_pageservers = self.intent.all_pageservers();
for node_id in self.observed.locations.keys() {
if all_pageservers.contains(node_id) {
// We are only detaching pageservers that aren't used at all.
continue;
}
changes.push((
*node_id,
LocationConfig {
mode: LocationConfigMode::Detached,
generation: None,
secondary_conf: None,
shard_number: self.shard.number.0,
shard_count: self.shard.count.0,
shard_stripe_size: self.shard.stripe_size.0,
tenant_conf: self.config.clone(),
},
));
}
for (node_id, conf) in changes {
self.location_config(node_id, conf, None).await?;
}
Ok(())
}
pub(crate) async fn compute_notify(&mut self) -> Result<(), NotifyError> {
// Whenever a particular Reconciler emits a notification, it is always notifying for the intended
// destination.
if let Some(node_id) = self.intent.attached {
let result = self
.compute_hook
.notify(self.tenant_shard_id, node_id, &self.cancel)
.await;
if let Err(e) = &result {
// It is up to the caller whether they want to drop out on this error, but they don't have to:
// in general we should avoid letting unavailability of the cloud control plane stop us from
// making progress.
tracing::warn!("Failed to notify compute of attached pageserver {node_id}: {e}");
// Set this flag so that in our ReconcileResult we will set the flag on the shard that it
// needs to retry at some point.
self.compute_notify_failure = true;
}
result
} else {
Ok(())
}
}
}
pub(crate) fn attached_location_conf(
generation: Generation,
shard: &ShardIdentity,
config: &TenantConfig,
) -> LocationConfig {
LocationConfig {
mode: LocationConfigMode::AttachedSingle,
generation: generation.into(),
secondary_conf: None,
shard_number: shard.number.0,
shard_count: shard.count.0,
shard_stripe_size: shard.stripe_size.0,
tenant_conf: config.clone(),
}
}
pub(crate) fn secondary_location_conf(
shard: &ShardIdentity,
config: &TenantConfig,
) -> LocationConfig {
LocationConfig {
mode: LocationConfigMode::Secondary,
generation: None,
secondary_conf: Some(LocationConfigSecondary { warm: true }),
shard_number: shard.number.0,
shard_count: shard.count.0,
shard_stripe_size: shard.stripe_size.0,
tenant_conf: config.clone(),
}
}

View File

@@ -0,0 +1,89 @@
use pageserver_api::shard::TenantShardId;
use std::collections::{BTreeMap, HashMap};
use utils::{http::error::ApiError, id::NodeId};
use crate::{node::Node, tenant_state::TenantState};
/// Scenarios in which we cannot find a suitable location for a tenant shard
#[derive(thiserror::Error, Debug)]
pub enum ScheduleError {
#[error("No pageservers found")]
NoPageservers,
#[error("No pageserver found matching constraint")]
ImpossibleConstraint,
}
impl From<ScheduleError> for ApiError {
fn from(value: ScheduleError) -> Self {
ApiError::Conflict(format!("Scheduling error: {}", value))
}
}
pub(crate) struct Scheduler {
tenant_counts: HashMap<NodeId, usize>,
}
impl Scheduler {
pub(crate) fn new(
tenants: &BTreeMap<TenantShardId, TenantState>,
nodes: &HashMap<NodeId, Node>,
) -> Self {
let mut tenant_counts = HashMap::new();
for node_id in nodes.keys() {
tenant_counts.insert(*node_id, 0);
}
for tenant in tenants.values() {
if let Some(ps) = tenant.intent.attached {
let entry = tenant_counts.entry(ps).or_insert(0);
*entry += 1;
}
}
for (node_id, node) in nodes {
if !node.may_schedule() {
tenant_counts.remove(node_id);
}
}
Self { tenant_counts }
}
pub(crate) fn schedule_shard(
&mut self,
hard_exclude: &[NodeId],
) -> Result<NodeId, ScheduleError> {
if self.tenant_counts.is_empty() {
return Err(ScheduleError::NoPageservers);
}
let mut tenant_counts: Vec<(NodeId, usize)> = self
.tenant_counts
.iter()
.filter_map(|(k, v)| {
if hard_exclude.contains(k) {
None
} else {
Some((*k, *v))
}
})
.collect();
// Sort by tenant count. Nodes with the same tenant count are sorted by ID.
tenant_counts.sort_by_key(|i| (i.1, i.0));
if tenant_counts.is_empty() {
// After applying constraints, no pageservers were left
return Err(ScheduleError::ImpossibleConstraint);
}
for (node_id, count) in &tenant_counts {
tracing::info!("tenant_counts[{node_id}]={count}");
}
let node_id = tenant_counts.first().unwrap().0;
tracing::info!("scheduler selected node {node_id}");
*self.tenant_counts.get_mut(&node_id).unwrap() += 1;
Ok(node_id)
}
}

View File

@@ -0,0 +1,27 @@
// @generated automatically by Diesel CLI.
diesel::table! {
nodes (node_id) {
node_id -> Int8,
scheduling_policy -> Varchar,
listen_http_addr -> Varchar,
listen_http_port -> Int4,
listen_pg_addr -> Varchar,
listen_pg_port -> Int4,
}
}
diesel::table! {
tenant_shards (tenant_id, shard_number, shard_count) {
tenant_id -> Varchar,
shard_number -> Int4,
shard_count -> Int4,
shard_stripe_size -> Int4,
generation -> Int4,
generation_pageserver -> Int8,
placement_policy -> Varchar,
config -> Text,
}
}
diesel::allow_tables_to_appear_in_same_query!(nodes, tenant_shards,);

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,527 @@
use std::{collections::HashMap, sync::Arc, time::Duration};
use control_plane::attachment_service::NodeAvailability;
use pageserver_api::{
models::{LocationConfig, LocationConfigMode, TenantConfig},
shard::{ShardIdentity, TenantShardId},
};
use tokio::task::JoinHandle;
use tokio_util::sync::CancellationToken;
use utils::{
generation::Generation,
id::NodeId,
seqwait::{SeqWait, SeqWaitError},
};
use crate::{
compute_hook::ComputeHook,
node::Node,
persistence::Persistence,
reconciler::{attached_location_conf, secondary_location_conf, ReconcileError, Reconciler},
scheduler::{ScheduleError, Scheduler},
service, PlacementPolicy, Sequence,
};
pub(crate) struct TenantState {
pub(crate) tenant_shard_id: TenantShardId,
pub(crate) shard: ShardIdentity,
// Runtime only: sequence used to coordinate when updating this object while
// with background reconcilers may be running. A reconciler runs to a particular
// sequence.
pub(crate) sequence: Sequence,
// Latest generation number: next time we attach, increment this
// and use the incremented number when attaching
pub(crate) generation: Generation,
// High level description of how the tenant should be set up. Provided
// externally.
pub(crate) policy: PlacementPolicy,
// Low level description of exactly which pageservers should fulfil
// which role. Generated by `Self::schedule`.
pub(crate) intent: IntentState,
// Low level description of how the tenant is configured on pageservers:
// if this does not match `Self::intent` then the tenant needs reconciliation
// with `Self::reconcile`.
pub(crate) observed: ObservedState,
// Tenant configuration, passed through opaquely to the pageserver. Identical
// for all shards in a tenant.
pub(crate) config: TenantConfig,
/// If a reconcile task is currently in flight, it may be joined here (it is
/// only safe to join if either the result has been received or the reconciler's
/// cancellation token has been fired)
pub(crate) reconciler: Option<ReconcilerHandle>,
/// Optionally wait for reconciliation to complete up to a particular
/// sequence number.
pub(crate) waiter: std::sync::Arc<SeqWait<Sequence, Sequence>>,
/// Indicates sequence number for which we have encountered an error reconciling. If
/// this advances ahead of [`Self::waiter`] then a reconciliation error has occurred,
/// and callers should stop waiting for `waiter` and propagate the error.
pub(crate) error_waiter: std::sync::Arc<SeqWait<Sequence, Sequence>>,
/// The most recent error from a reconcile on this tenant
/// TODO: generalize to an array of recent events
/// TOOD: use a ArcSwap instead of mutex for faster reads?
pub(crate) last_error: std::sync::Arc<std::sync::Mutex<String>>,
/// If we have a pending compute notification that for some reason we weren't able to send,
/// set this to true. If this is set, calls to [`Self::maybe_reconcile`] will run a task to retry
/// sending it. This is the mechanism by which compute notifications are included in the scope
/// of state that we publish externally in an eventually consistent way.
pub(crate) pending_compute_notification: bool,
}
#[derive(Default, Clone, Debug)]
pub(crate) struct IntentState {
pub(crate) attached: Option<NodeId>,
pub(crate) secondary: Vec<NodeId>,
}
#[derive(Default, Clone)]
pub(crate) struct ObservedState {
pub(crate) locations: HashMap<NodeId, ObservedStateLocation>,
}
/// Our latest knowledge of how this tenant is configured in the outside world.
///
/// Meaning:
/// * No instance of this type exists for a node: we are certain that we have nothing configured on that
/// node for this shard.
/// * Instance exists with conf==None: we *might* have some state on that node, but we don't know
/// what it is (e.g. we failed partway through configuring it)
/// * Instance exists with conf==Some: this tells us what we last successfully configured on this node,
/// and that configuration will still be present unless something external interfered.
#[derive(Clone)]
pub(crate) struct ObservedStateLocation {
/// If None, it means we do not know the status of this shard's location on this node, but
/// we know that we might have some state on this node.
pub(crate) conf: Option<LocationConfig>,
}
pub(crate) struct ReconcilerWaiter {
// For observability purposes, remember the ID of the shard we're
// waiting for.
pub(crate) tenant_shard_id: TenantShardId,
seq_wait: std::sync::Arc<SeqWait<Sequence, Sequence>>,
error_seq_wait: std::sync::Arc<SeqWait<Sequence, Sequence>>,
error: std::sync::Arc<std::sync::Mutex<String>>,
seq: Sequence,
}
#[derive(thiserror::Error, Debug)]
pub enum ReconcileWaitError {
#[error("Timeout waiting for shard {0}")]
Timeout(TenantShardId),
#[error("shutting down")]
Shutdown,
#[error("Reconcile error on shard {0}: {1}")]
Failed(TenantShardId, String),
}
impl ReconcilerWaiter {
pub(crate) async fn wait_timeout(&self, timeout: Duration) -> Result<(), ReconcileWaitError> {
tokio::select! {
result = self.seq_wait.wait_for_timeout(self.seq, timeout)=> {
result.map_err(|e| match e {
SeqWaitError::Timeout => ReconcileWaitError::Timeout(self.tenant_shard_id),
SeqWaitError::Shutdown => ReconcileWaitError::Shutdown
})?;
},
result = self.error_seq_wait.wait_for(self.seq) => {
result.map_err(|e| match e {
SeqWaitError::Shutdown => ReconcileWaitError::Shutdown,
SeqWaitError::Timeout => unreachable!()
})?;
return Err(ReconcileWaitError::Failed(self.tenant_shard_id, self.error.lock().unwrap().clone()))
}
}
Ok(())
}
}
/// Having spawned a reconciler task, the tenant shard's state will carry enough
/// information to optionally cancel & await it later.
pub(crate) struct ReconcilerHandle {
sequence: Sequence,
handle: JoinHandle<()>,
cancel: CancellationToken,
}
/// When a reconcile task completes, it sends this result object
/// to be applied to the primary TenantState.
pub(crate) struct ReconcileResult {
pub(crate) sequence: Sequence,
/// On errors, `observed` should be treated as an incompleted description
/// of state (i.e. any nodes present in the result should override nodes
/// present in the parent tenant state, but any unmentioned nodes should
/// not be removed from parent tenant state)
pub(crate) result: Result<(), ReconcileError>,
pub(crate) tenant_shard_id: TenantShardId,
pub(crate) generation: Generation,
pub(crate) observed: ObservedState,
/// Set [`TenantState::pending_compute_notification`] from this flag
pub(crate) pending_compute_notification: bool,
}
impl IntentState {
pub(crate) fn new() -> Self {
Self {
attached: None,
secondary: vec![],
}
}
pub(crate) fn all_pageservers(&self) -> Vec<NodeId> {
let mut result = Vec::new();
if let Some(p) = self.attached {
result.push(p)
}
result.extend(self.secondary.iter().copied());
result
}
/// When a node goes offline, we update intents to avoid using it
/// as their attached pageserver.
///
/// Returns true if a change was made
pub(crate) fn notify_offline(&mut self, node_id: NodeId) -> bool {
if self.attached == Some(node_id) {
self.attached = None;
self.secondary.push(node_id);
true
} else {
false
}
}
}
impl ObservedState {
pub(crate) fn new() -> Self {
Self {
locations: HashMap::new(),
}
}
}
impl TenantState {
pub(crate) fn new(
tenant_shard_id: TenantShardId,
shard: ShardIdentity,
policy: PlacementPolicy,
) -> Self {
Self {
tenant_shard_id,
policy,
intent: IntentState::default(),
generation: Generation::new(0),
shard,
observed: ObservedState::default(),
config: TenantConfig::default(),
reconciler: None,
sequence: Sequence(1),
waiter: Arc::new(SeqWait::new(Sequence(0))),
error_waiter: Arc::new(SeqWait::new(Sequence(0))),
last_error: Arc::default(),
pending_compute_notification: false,
}
}
/// For use on startup when learning state from pageservers: generate my [`IntentState`] from my
/// [`ObservedState`], even if it violates my [`PlacementPolicy`]. Call [`Self::schedule`] next,
/// to get an intent state that complies with placement policy. The overall goal is to do scheduling
/// in a way that makes use of any configured locations that already exist in the outside world.
pub(crate) fn intent_from_observed(&mut self) {
// Choose an attached location by filtering observed locations, and then sorting to get the highest
// generation
let mut attached_locs = self
.observed
.locations
.iter()
.filter_map(|(node_id, l)| {
if let Some(conf) = &l.conf {
if conf.mode == LocationConfigMode::AttachedMulti
|| conf.mode == LocationConfigMode::AttachedSingle
|| conf.mode == LocationConfigMode::AttachedStale
{
Some((node_id, conf.generation))
} else {
None
}
} else {
None
}
})
.collect::<Vec<_>>();
attached_locs.sort_by_key(|i| i.1);
if let Some((node_id, _gen)) = attached_locs.into_iter().last() {
self.intent.attached = Some(*node_id);
}
// All remaining observed locations generate secondary intents. This includes None
// observations, as these may well have some local content on disk that is usable (this
// is an edge case that might occur if we restarted during a migration or other change)
self.observed.locations.keys().for_each(|node_id| {
if Some(*node_id) != self.intent.attached {
self.intent.secondary.push(*node_id);
}
});
}
pub(crate) fn schedule(&mut self, scheduler: &mut Scheduler) -> Result<(), ScheduleError> {
// TODO: before scheduling new nodes, check if any existing content in
// self.intent refers to pageservers that are offline, and pick other
// pageservers if so.
// Build the set of pageservers already in use by this tenant, to avoid scheduling
// more work on the same pageservers we're already using.
let mut used_pageservers = self.intent.all_pageservers();
let mut modified = false;
use PlacementPolicy::*;
match self.policy {
Single => {
// Should have exactly one attached, and zero secondaries
if self.intent.attached.is_none() {
let node_id = scheduler.schedule_shard(&used_pageservers)?;
self.intent.attached = Some(node_id);
used_pageservers.push(node_id);
modified = true;
}
if !self.intent.secondary.is_empty() {
self.intent.secondary.clear();
modified = true;
}
}
Double(secondary_count) => {
// Should have exactly one attached, and N secondaries
if self.intent.attached.is_none() {
let node_id = scheduler.schedule_shard(&used_pageservers)?;
self.intent.attached = Some(node_id);
used_pageservers.push(node_id);
modified = true;
}
while self.intent.secondary.len() < secondary_count {
let node_id = scheduler.schedule_shard(&used_pageservers)?;
self.intent.secondary.push(node_id);
used_pageservers.push(node_id);
modified = true;
}
}
Detached => {
// Should have no attached or secondary pageservers
if self.intent.attached.is_some() {
self.intent.attached = None;
modified = true;
}
if !self.intent.secondary.is_empty() {
self.intent.secondary.clear();
modified = true;
}
}
}
if modified {
self.sequence.0 += 1;
}
Ok(())
}
/// Query whether the tenant's observed state for attached node matches its intent state, and if so,
/// yield the node ID. This is appropriate for emitting compute hook notifications: we are checking that
/// the node in question is not only where we intend to attach, but that the tenant is indeed already attached there.
///
/// Reconciliation may still be needed for other aspects of state such as secondaries (see [`Self::dirty`]): this
/// funciton should not be used to decide whether to reconcile.
pub(crate) fn stably_attached(&self) -> Option<NodeId> {
if let Some(attach_intent) = self.intent.attached {
match self.observed.locations.get(&attach_intent) {
Some(loc) => match &loc.conf {
Some(conf) => match conf.mode {
LocationConfigMode::AttachedMulti
| LocationConfigMode::AttachedSingle
| LocationConfigMode::AttachedStale => {
// Our intent and observed state agree that this node is in an attached state.
Some(attach_intent)
}
// Our observed config is not an attached state
_ => None,
},
// Our observed state is None, i.e. in flux
None => None,
},
// We have no observed state for this node
None => None,
}
} else {
// Our intent is not to attach
None
}
}
fn dirty(&self) -> bool {
if let Some(node_id) = self.intent.attached {
let wanted_conf = attached_location_conf(self.generation, &self.shard, &self.config);
match self.observed.locations.get(&node_id) {
Some(conf) if conf.conf.as_ref() == Some(&wanted_conf) => {}
Some(_) | None => {
return true;
}
}
}
for node_id in &self.intent.secondary {
let wanted_conf = secondary_location_conf(&self.shard, &self.config);
match self.observed.locations.get(node_id) {
Some(conf) if conf.conf.as_ref() == Some(&wanted_conf) => {}
Some(_) | None => {
return true;
}
}
}
// Even if there is no pageserver work to be done, if we have a pending notification to computes,
// wake up a reconciler to send it.
if self.pending_compute_notification {
return true;
}
false
}
pub(crate) fn maybe_reconcile(
&mut self,
result_tx: tokio::sync::mpsc::UnboundedSender<ReconcileResult>,
pageservers: &Arc<HashMap<NodeId, Node>>,
compute_hook: &Arc<ComputeHook>,
service_config: &service::Config,
persistence: &Arc<Persistence>,
) -> Option<ReconcilerWaiter> {
// If there are any ambiguous observed states, and the nodes they refer to are available,
// we should reconcile to clean them up.
let mut dirty_observed = false;
for (node_id, observed_loc) in &self.observed.locations {
let node = pageservers
.get(node_id)
.expect("Nodes may not be removed while referenced");
if observed_loc.conf.is_none()
&& !matches!(node.availability, NodeAvailability::Offline)
{
dirty_observed = true;
break;
}
}
if !self.dirty() && !dirty_observed {
tracing::info!("Not dirty, no reconciliation needed.");
return None;
}
// Reconcile already in flight for the current sequence?
if let Some(handle) = &self.reconciler {
if handle.sequence == self.sequence {
return Some(ReconcilerWaiter {
tenant_shard_id: self.tenant_shard_id,
seq_wait: self.waiter.clone(),
error_seq_wait: self.error_waiter.clone(),
error: self.last_error.clone(),
seq: self.sequence,
});
}
}
// Reconcile in flight for a stale sequence? Our sequence's task will wait for it before
// doing our sequence's work.
let old_handle = self.reconciler.take();
let cancel = CancellationToken::new();
let mut reconciler = Reconciler {
tenant_shard_id: self.tenant_shard_id,
shard: self.shard,
generation: self.generation,
intent: self.intent.clone(),
config: self.config.clone(),
observed: self.observed.clone(),
pageservers: pageservers.clone(),
compute_hook: compute_hook.clone(),
service_config: service_config.clone(),
cancel: cancel.clone(),
persistence: persistence.clone(),
compute_notify_failure: false,
};
let reconcile_seq = self.sequence;
tracing::info!("Spawning Reconciler for sequence {}", self.sequence);
let must_notify = self.pending_compute_notification;
let join_handle = tokio::task::spawn(async move {
// Wait for any previous reconcile task to complete before we start
if let Some(old_handle) = old_handle {
old_handle.cancel.cancel();
if let Err(e) = old_handle.handle.await {
// We can't do much with this other than log it: the task is done, so
// we may proceed with our work.
tracing::error!("Unexpected join error waiting for reconcile task: {e}");
}
}
// Early check for cancellation before doing any work
// TODO: wrap all remote API operations in cancellation check
// as well.
if reconciler.cancel.is_cancelled() {
return;
}
// Attempt to make observed state match intent state
let result = reconciler.reconcile().await;
// If we know we had a pending compute notification from some previous action, send a notification irrespective
// of whether the above reconcile() did any work
if result.is_ok() && must_notify {
// If this fails we will send the need to retry in [`ReconcileResult::pending_compute_notification`]
reconciler.compute_notify().await.ok();
}
result_tx
.send(ReconcileResult {
sequence: reconcile_seq,
result,
tenant_shard_id: reconciler.tenant_shard_id,
generation: reconciler.generation,
observed: reconciler.observed,
pending_compute_notification: reconciler.compute_notify_failure,
})
.ok();
});
self.reconciler = Some(ReconcilerHandle {
sequence: self.sequence,
handle: join_handle,
cancel,
});
Some(ReconcilerWaiter {
tenant_shard_id: self.tenant_shard_id,
seq_wait: self.waiter.clone(),
error_seq_wait: self.error_waiter.clone(),
error: self.last_error.clone(),
seq: self.sequence,
})
}
}

View File

@@ -1,24 +1,46 @@
use crate::{background_process, local_env::LocalEnv};
use anyhow::anyhow;
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use std::{path::PathBuf, process::Child};
use utils::id::{NodeId, TenantId};
use camino::{Utf8Path, Utf8PathBuf};
use diesel::{
backend::Backend,
query_builder::{AstPass, QueryFragment, QueryId},
Connection, PgConnection, QueryResult, RunQueryDsl,
};
use diesel_migrations::{HarnessWithOutput, MigrationHarness};
use hyper::Method;
use pageserver_api::{
models::{ShardParameters, TenantCreateRequest, TimelineCreateRequest, TimelineInfo},
shard::TenantShardId,
};
use pageserver_client::mgmt_api::ResponseErrorMessageExt;
use postgres_backend::AuthType;
use serde::{de::DeserializeOwned, Deserialize, Serialize};
use std::{env, str::FromStr};
use tokio::process::Command;
use tracing::instrument;
use url::Url;
use utils::{
auth::{Claims, Scope},
id::{NodeId, TenantId},
};
pub struct AttachmentService {
env: LocalEnv,
listen: String,
path: PathBuf,
path: Utf8PathBuf,
jwt_token: Option<String>,
public_key: Option<String>,
postgres_port: u16,
client: reqwest::Client,
}
const COMMAND: &str = "attachment_service";
#[serde_as]
const ATTACHMENT_SERVICE_POSTGRES_VERSION: u32 = 16;
#[derive(Serialize, Deserialize)]
pub struct AttachHookRequest {
#[serde_as(as = "DisplayFromStr")]
pub tenant_id: TenantId,
pub pageserver_id: Option<NodeId>,
pub tenant_shard_id: TenantShardId,
pub node_id: Option<NodeId>,
}
#[derive(Serialize, Deserialize)]
@@ -26,9 +48,141 @@ pub struct AttachHookResponse {
pub gen: Option<u32>,
}
#[derive(Serialize, Deserialize)]
pub struct InspectRequest {
pub tenant_shard_id: TenantShardId,
}
#[derive(Serialize, Deserialize)]
pub struct InspectResponse {
pub attachment: Option<(u32, NodeId)>,
}
#[derive(Serialize, Deserialize)]
pub struct TenantCreateResponseShard {
pub shard_id: TenantShardId,
pub node_id: NodeId,
pub generation: u32,
}
#[derive(Serialize, Deserialize)]
pub struct TenantCreateResponse {
pub shards: Vec<TenantCreateResponseShard>,
}
#[derive(Serialize, Deserialize)]
pub struct NodeRegisterRequest {
pub node_id: NodeId,
pub listen_pg_addr: String,
pub listen_pg_port: u16,
pub listen_http_addr: String,
pub listen_http_port: u16,
}
#[derive(Serialize, Deserialize)]
pub struct NodeConfigureRequest {
pub node_id: NodeId,
pub availability: Option<NodeAvailability>,
pub scheduling: Option<NodeSchedulingPolicy>,
}
#[derive(Serialize, Deserialize, Debug)]
pub struct TenantLocateResponseShard {
pub shard_id: TenantShardId,
pub node_id: NodeId,
pub listen_pg_addr: String,
pub listen_pg_port: u16,
pub listen_http_addr: String,
pub listen_http_port: u16,
}
#[derive(Serialize, Deserialize)]
pub struct TenantLocateResponse {
pub shards: Vec<TenantLocateResponseShard>,
pub shard_params: ShardParameters,
}
/// Explicitly migrating a particular shard is a low level operation
/// TODO: higher level "Reschedule tenant" operation where the request
/// specifies some constraints, e.g. asking it to get off particular node(s)
#[derive(Serialize, Deserialize, Debug)]
pub struct TenantShardMigrateRequest {
pub tenant_shard_id: TenantShardId,
pub node_id: NodeId,
}
#[derive(Serialize, Deserialize, Clone, Copy)]
pub enum NodeAvailability {
// Normal, happy state
Active,
// Offline: Tenants shouldn't try to attach here, but they may assume that their
// secondary locations on this node still exist. Newly added nodes are in this
// state until we successfully contact them.
Offline,
}
impl FromStr for NodeAvailability {
type Err = anyhow::Error;
fn from_str(s: &str) -> Result<Self, Self::Err> {
match s {
"active" => Ok(Self::Active),
"offline" => Ok(Self::Offline),
_ => Err(anyhow::anyhow!("Unknown availability state '{s}'")),
}
}
}
/// FIXME: this is a duplicate of the type in the attachment_service crate, because the
/// type needs to be defined with diesel traits in there.
#[derive(Serialize, Deserialize, Clone, Copy)]
pub enum NodeSchedulingPolicy {
Active,
Filling,
Pause,
Draining,
}
impl FromStr for NodeSchedulingPolicy {
type Err = anyhow::Error;
fn from_str(s: &str) -> Result<Self, Self::Err> {
match s {
"active" => Ok(Self::Active),
"filling" => Ok(Self::Filling),
"pause" => Ok(Self::Pause),
"draining" => Ok(Self::Draining),
_ => Err(anyhow::anyhow!("Unknown scheduling state '{s}'")),
}
}
}
impl From<NodeSchedulingPolicy> for String {
fn from(value: NodeSchedulingPolicy) -> String {
use NodeSchedulingPolicy::*;
match value {
Active => "active",
Filling => "filling",
Pause => "pause",
Draining => "draining",
}
.to_string()
}
}
#[derive(Serialize, Deserialize, Debug)]
pub struct TenantShardMigrateResponse {}
impl AttachmentService {
pub fn from_env(env: &LocalEnv) -> Self {
let path = env.base_data_dir.join("attachments.json");
let path = Utf8PathBuf::from_path_buf(env.base_data_dir.clone())
.unwrap()
.join("attachments.json");
// Makes no sense to construct this if pageservers aren't going to use it: assume
// pageservers have control plane API set
@@ -40,66 +194,502 @@ impl AttachmentService {
listen_url.port().unwrap()
);
// Convention: NeonEnv in python tests reserves the next port after the control_plane_api
// port, for use by our captive postgres.
let postgres_port = listen_url
.port()
.expect("Control plane API setting should always have a port")
+ 1;
// Assume all pageservers have symmetric auth configuration: this service
// expects to use one JWT token to talk to all of them.
let ps_conf = env
.pageservers
.first()
.expect("Config is validated to contain at least one pageserver");
let (jwt_token, public_key) = match ps_conf.http_auth_type {
AuthType::Trust => (None, None),
AuthType::NeonJWT => {
let jwt_token = env
.generate_auth_token(&Claims::new(None, Scope::PageServerApi))
.unwrap();
// If pageserver auth is enabled, this implicitly enables auth for this service,
// using the same credentials.
let public_key_path =
camino::Utf8PathBuf::try_from(env.base_data_dir.join("auth_public_key.pem"))
.unwrap();
// This service takes keys as a string rather than as a path to a file/dir: read the key into memory.
let public_key = if std::fs::metadata(&public_key_path)
.expect("Can't stat public key")
.is_dir()
{
// Our config may specify a directory: this is for the pageserver's ability to handle multiple
// keys. We only use one key at a time, so, arbitrarily load the first one in the directory.
let mut dir =
std::fs::read_dir(&public_key_path).expect("Can't readdir public key path");
let dent = dir
.next()
.expect("Empty key dir")
.expect("Error reading key dir");
std::fs::read_to_string(dent.path()).expect("Can't read public key")
} else {
std::fs::read_to_string(&public_key_path).expect("Can't read public key")
};
(Some(jwt_token), Some(public_key))
}
};
Self {
env: env.clone(),
path,
listen,
jwt_token,
public_key,
postgres_port,
client: reqwest::ClientBuilder::new()
.build()
.expect("Failed to construct http client"),
}
}
fn pid_file(&self) -> PathBuf {
self.env.base_data_dir.join("attachment_service.pid")
fn pid_file(&self) -> Utf8PathBuf {
Utf8PathBuf::from_path_buf(self.env.base_data_dir.join("attachment_service.pid"))
.expect("non-Unicode path")
}
pub fn start(&self) -> anyhow::Result<Child> {
let path_str = self.path.to_string_lossy();
/// PIDFile for the postgres instance used to store attachment service state
fn postgres_pid_file(&self) -> Utf8PathBuf {
Utf8PathBuf::from_path_buf(
self.env
.base_data_dir
.join("attachment_service_postgres.pid"),
)
.expect("non-Unicode path")
}
/// In order to access database migrations, we need to find the Neon source tree
async fn find_source_root(&self) -> anyhow::Result<Utf8PathBuf> {
// We assume that either prd or our binary is in the source tree. The former is usually
// true for automated test runners, the latter is usually true for developer workstations. Often
// both are true, which is fine.
let candidate_start_points = [
// Current working directory
Utf8PathBuf::from_path_buf(std::env::current_dir()?).unwrap(),
// Directory containing the binary we're running inside
Utf8PathBuf::from_path_buf(env::current_exe()?.parent().unwrap().to_owned()).unwrap(),
];
// For each candidate start point, search through ancestors looking for a neon.git source tree root
for start_point in &candidate_start_points {
// Start from the build dir: assumes we are running out of a built neon source tree
for path in start_point.ancestors() {
// A crude approximation: the root of the source tree is whatever contains a "control_plane"
// subdirectory.
let control_plane = path.join("control_plane");
if tokio::fs::try_exists(&control_plane).await? {
return Ok(path.to_owned());
}
}
}
// Fall-through
Err(anyhow::anyhow!(
"Could not find control_plane src dir, after searching ancestors of {candidate_start_points:?}"
))
}
/// Find the directory containing postgres binaries, such as `initdb` and `pg_ctl`
///
/// This usually uses ATTACHMENT_SERVICE_POSTGRES_VERSION of postgres, but will fall back
/// to other versions if that one isn't found. Some automated tests create circumstances
/// where only one version is available in pg_distrib_dir, such as `test_remote_extensions`.
pub async fn get_pg_bin_dir(&self) -> anyhow::Result<Utf8PathBuf> {
let prefer_versions = [ATTACHMENT_SERVICE_POSTGRES_VERSION, 15, 14];
for v in prefer_versions {
let path = Utf8PathBuf::from_path_buf(self.env.pg_bin_dir(v)?).unwrap();
if tokio::fs::try_exists(&path).await? {
return Ok(path);
}
}
// Fall through
anyhow::bail!(
"Postgres binaries not found in {}",
self.env.pg_distrib_dir.display()
);
}
/// Readiness check for our postgres process
async fn pg_isready(&self, pg_bin_dir: &Utf8Path) -> anyhow::Result<bool> {
let bin_path = pg_bin_dir.join("pg_isready");
let args = ["-h", "localhost", "-p", &format!("{}", self.postgres_port)];
let exitcode = Command::new(bin_path).args(args).spawn()?.wait().await?;
Ok(exitcode.success())
}
/// Create our database if it doesn't exist, and run migrations.
///
/// This function is equivalent to the `diesel setup` command in the diesel CLI. We implement
/// the same steps by hand to avoid imposing a dependency on installing diesel-cli for developers
/// who just want to run `cargo neon_local` without knowing about diesel.
///
/// Returns the database url
pub async fn setup_database(&self) -> anyhow::Result<String> {
let database_url = format!(
"postgresql://localhost:{}/attachment_service",
self.postgres_port
);
println!("Running attachment service database setup...");
fn change_database_of_url(database_url: &str, default_database: &str) -> (String, String) {
let base = ::url::Url::parse(database_url).unwrap();
let database = base.path_segments().unwrap().last().unwrap().to_owned();
let mut new_url = base.join(default_database).unwrap();
new_url.set_query(base.query());
(database, new_url.into())
}
#[derive(Debug, Clone)]
pub struct CreateDatabaseStatement {
db_name: String,
}
impl CreateDatabaseStatement {
pub fn new(db_name: &str) -> Self {
CreateDatabaseStatement {
db_name: db_name.to_owned(),
}
}
}
impl<DB: Backend> QueryFragment<DB> for CreateDatabaseStatement {
fn walk_ast<'b>(&'b self, mut out: AstPass<'_, 'b, DB>) -> QueryResult<()> {
out.push_sql("CREATE DATABASE ");
out.push_identifier(&self.db_name)?;
Ok(())
}
}
impl<Conn> RunQueryDsl<Conn> for CreateDatabaseStatement {}
impl QueryId for CreateDatabaseStatement {
type QueryId = ();
const HAS_STATIC_QUERY_ID: bool = false;
}
if PgConnection::establish(&database_url).is_err() {
let (database, postgres_url) = change_database_of_url(&database_url, "postgres");
println!("Creating database: {database}");
let mut conn = PgConnection::establish(&postgres_url)?;
CreateDatabaseStatement::new(&database).execute(&mut conn)?;
}
let mut conn = PgConnection::establish(&database_url)?;
let migrations_dir = self
.find_source_root()
.await?
.join("control_plane/attachment_service/migrations");
let migrations = diesel_migrations::FileBasedMigrations::from_path(migrations_dir)?;
println!("Running migrations in {}", migrations.path().display());
HarnessWithOutput::write_to_stdout(&mut conn)
.run_pending_migrations(migrations)
.map(|_| ())
.map_err(|e| anyhow::anyhow!(e))?;
println!("Migrations complete");
Ok(database_url)
}
pub async fn start(&self) -> anyhow::Result<()> {
// Start a vanilla Postgres process used by the attachment service for persistence.
let pg_data_path = Utf8PathBuf::from_path_buf(self.env.base_data_dir.clone())
.unwrap()
.join("attachment_service_db");
let pg_bin_dir = self.get_pg_bin_dir().await?;
let pg_log_path = pg_data_path.join("postgres.log");
if !tokio::fs::try_exists(&pg_data_path).await? {
// Initialize empty database
let initdb_path = pg_bin_dir.join("initdb");
let mut child = Command::new(&initdb_path)
.args(["-D", pg_data_path.as_ref()])
.spawn()
.expect("Failed to spawn initdb");
let status = child.wait().await?;
if !status.success() {
anyhow::bail!("initdb failed with status {status}");
}
tokio::fs::write(
&pg_data_path.join("postgresql.conf"),
format!("port = {}", self.postgres_port),
)
.await?;
};
println!("Starting attachment service database...");
let db_start_args = [
"-w",
"-D",
pg_data_path.as_ref(),
"-l",
pg_log_path.as_ref(),
"start",
];
background_process::start_process(
"attachment_service_db",
&self.env.base_data_dir,
pg_bin_dir.join("pg_ctl").as_std_path(),
db_start_args,
[],
background_process::InitialPidFile::Create(self.postgres_pid_file()),
|| self.pg_isready(&pg_bin_dir),
)
.await?;
// Run migrations on every startup, in case something changed.
let database_url = self.setup_database().await?;
let mut args = vec![
"-l",
&self.listen,
"-p",
self.path.as_ref(),
"--database-url",
&database_url,
]
.into_iter()
.map(|s| s.to_string())
.collect::<Vec<_>>();
if let Some(jwt_token) = &self.jwt_token {
args.push(format!("--jwt-token={jwt_token}"));
}
if let Some(public_key) = &self.public_key {
args.push(format!("--public-key=\"{public_key}\""));
}
if let Some(control_plane_compute_hook_api) = &self.env.control_plane_compute_hook_api {
args.push(format!(
"--compute-hook-url={control_plane_compute_hook_api}"
));
}
background_process::start_process(
COMMAND,
&self.env.base_data_dir,
&self.env.attachment_service_bin(),
["-l", &self.listen, "-p", &path_str],
[],
background_process::InitialPidFile::Create(&self.pid_file()),
// TODO: a real status check
|| Ok(true),
args,
[(
"NEON_REPO_DIR".to_string(),
self.env.base_data_dir.to_string_lossy().to_string(),
)],
background_process::InitialPidFile::Create(self.pid_file()),
|| async {
match self.status().await {
Ok(_) => Ok(true),
Err(_) => Ok(false),
}
},
)
.await?;
Ok(())
}
pub fn stop(&self, immediate: bool) -> anyhow::Result<()> {
background_process::stop_process(immediate, COMMAND, &self.pid_file())
pub async fn stop(&self, immediate: bool) -> anyhow::Result<()> {
background_process::stop_process(immediate, COMMAND, &self.pid_file())?;
let pg_data_path = self.env.base_data_dir.join("attachment_service_db");
let pg_bin_dir = self.get_pg_bin_dir().await?;
println!("Stopping attachment service database...");
let pg_stop_args = ["-D", &pg_data_path.to_string_lossy(), "stop"];
let stop_status = Command::new(pg_bin_dir.join("pg_ctl"))
.args(pg_stop_args)
.spawn()?
.wait()
.await?;
if !stop_status.success() {
let pg_status_args = ["-D", &pg_data_path.to_string_lossy(), "status"];
let status_exitcode = Command::new(pg_bin_dir.join("pg_ctl"))
.args(pg_status_args)
.spawn()?
.wait()
.await?;
// pg_ctl status returns this exit code if postgres is not running: in this case it is
// fine that stop failed. Otherwise it is an error that stop failed.
const PG_STATUS_NOT_RUNNING: i32 = 3;
if Some(PG_STATUS_NOT_RUNNING) == status_exitcode.code() {
println!("Attachment service data base is already stopped");
return Ok(());
} else {
anyhow::bail!("Failed to stop attachment service database: {stop_status}")
}
}
Ok(())
}
/// Simple HTTP request wrapper for calling into attachment service
async fn dispatch<RQ, RS>(
&self,
method: hyper::Method,
path: String,
body: Option<RQ>,
) -> anyhow::Result<RS>
where
RQ: Serialize + Sized,
RS: DeserializeOwned + Sized,
{
// The configured URL has the /upcall path prefix for pageservers to use: we will strip that out
// for general purpose API access.
let listen_url = self.env.control_plane_api.clone().unwrap();
let url = Url::from_str(&format!(
"http://{}:{}/{path}",
listen_url.host_str().unwrap(),
listen_url.port().unwrap()
))
.unwrap();
let mut builder = self.client.request(method, url);
if let Some(body) = body {
builder = builder.json(&body)
}
if let Some(jwt_token) = &self.jwt_token {
builder = builder.header(
reqwest::header::AUTHORIZATION,
format!("Bearer {jwt_token}"),
);
}
let response = builder.send().await?;
let response = response.error_from_body().await?;
Ok(response
.json()
.await
.map_err(pageserver_client::mgmt_api::Error::ReceiveBody)?)
}
/// Call into the attach_hook API, for use before handing out attachments to pageservers
pub fn attach_hook(
#[instrument(skip(self))]
pub async fn attach_hook(
&self,
tenant_id: TenantId,
tenant_shard_id: TenantShardId,
pageserver_id: NodeId,
) -> anyhow::Result<Option<u32>> {
use hyper::StatusCode;
let url = self
.env
.control_plane_api
.clone()
.unwrap()
.join("attach_hook")
.unwrap();
let client = reqwest::blocking::ClientBuilder::new()
.build()
.expect("Failed to construct http client");
let request = AttachHookRequest {
tenant_id,
pageserver_id: Some(pageserver_id),
tenant_shard_id,
node_id: Some(pageserver_id),
};
let response = client.post(url).json(&request).send()?;
if response.status() != StatusCode::OK {
return Err(anyhow!("Unexpected status {}", response.status()));
}
let response = self
.dispatch::<_, AttachHookResponse>(
Method::POST,
"debug/v1/attach-hook".to_string(),
Some(request),
)
.await?;
let response = response.json::<AttachHookResponse>()?;
Ok(response.gen)
}
#[instrument(skip(self))]
pub async fn inspect(
&self,
tenant_shard_id: TenantShardId,
) -> anyhow::Result<Option<(u32, NodeId)>> {
let request = InspectRequest { tenant_shard_id };
let response = self
.dispatch::<_, InspectResponse>(
Method::POST,
"debug/v1/inspect".to_string(),
Some(request),
)
.await?;
Ok(response.attachment)
}
#[instrument(skip(self))]
pub async fn tenant_create(
&self,
req: TenantCreateRequest,
) -> anyhow::Result<TenantCreateResponse> {
self.dispatch(Method::POST, "v1/tenant".to_string(), Some(req))
.await
}
#[instrument(skip(self))]
pub async fn tenant_locate(&self, tenant_id: TenantId) -> anyhow::Result<TenantLocateResponse> {
self.dispatch::<(), _>(
Method::GET,
format!("control/v1/tenant/{tenant_id}/locate"),
None,
)
.await
}
#[instrument(skip(self))]
pub async fn tenant_migrate(
&self,
tenant_shard_id: TenantShardId,
node_id: NodeId,
) -> anyhow::Result<TenantShardMigrateResponse> {
self.dispatch(
Method::PUT,
format!("tenant/{tenant_shard_id}/migrate"),
Some(TenantShardMigrateRequest {
tenant_shard_id,
node_id,
}),
)
.await
}
#[instrument(skip_all, fields(node_id=%req.node_id))]
pub async fn node_register(&self, req: NodeRegisterRequest) -> anyhow::Result<()> {
self.dispatch::<_, ()>(Method::POST, "control/v1/node".to_string(), Some(req))
.await
}
#[instrument(skip_all, fields(node_id=%req.node_id))]
pub async fn node_configure(&self, req: NodeConfigureRequest) -> anyhow::Result<()> {
self.dispatch::<_, ()>(
Method::PUT,
format!("control/v1/node/{}/config", req.node_id),
Some(req),
)
.await
}
#[instrument(skip(self))]
pub async fn status(&self) -> anyhow::Result<()> {
self.dispatch::<(), ()>(Method::GET, "status".to_string(), None)
.await
}
#[instrument(skip_all, fields(%tenant_id, timeline_id=%req.new_timeline_id))]
pub async fn tenant_timeline_create(
&self,
tenant_id: TenantId,
req: TimelineCreateRequest,
) -> anyhow::Result<TimelineInfo> {
self.dispatch(
Method::POST,
format!("v1/tenant/{tenant_id}/timeline"),
Some(req),
)
.await
}
}

View File

@@ -16,12 +16,13 @@ use std::ffi::OsStr;
use std::io::Write;
use std::os::unix::prelude::AsRawFd;
use std::os::unix::process::CommandExt;
use std::path::{Path, PathBuf};
use std::process::{Child, Command};
use std::path::Path;
use std::process::Command;
use std::time::Duration;
use std::{fs, io, thread};
use anyhow::Context;
use camino::{Utf8Path, Utf8PathBuf};
use nix::errno::Errno;
use nix::fcntl::{FcntlArg, FdFlag};
use nix::sys::signal::{kill, Signal};
@@ -43,15 +44,15 @@ const NOTICE_AFTER_RETRIES: u64 = 50;
/// Argument to `start_process`, to indicate whether it should create pidfile or if the process creates
/// it itself.
pub enum InitialPidFile<'t> {
pub enum InitialPidFile {
/// Create a pidfile, to allow future CLI invocations to manipulate the process.
Create(&'t Path),
Create(Utf8PathBuf),
/// The process will create the pidfile itself, need to wait for that event.
Expect(&'t Path),
Expect(Utf8PathBuf),
}
/// Start a background child process using the parameters given.
pub fn start_process<F, AI, A, EI>(
pub async fn start_process<F, Fut, AI, A, EI>(
process_name: &str,
datadir: &Path,
command: &Path,
@@ -59,9 +60,10 @@ pub fn start_process<F, AI, A, EI>(
envs: EI,
initial_pid_file: InitialPidFile,
process_status_check: F,
) -> anyhow::Result<Child>
) -> anyhow::Result<()>
where
F: Fn() -> anyhow::Result<bool>,
F: Fn() -> Fut,
Fut: std::future::Future<Output = anyhow::Result<bool>>,
AI: IntoIterator<Item = A>,
A: AsRef<OsStr>,
// Not generic AsRef<OsStr>, otherwise empty `envs` prevents type inference
@@ -85,10 +87,10 @@ where
.stdout(process_log_file)
.stderr(same_file_for_stderr)
.args(args);
let filled_cmd = fill_aws_secrets_vars(fill_rust_env_vars(background_command));
let filled_cmd = fill_remote_storage_secrets_vars(fill_rust_env_vars(background_command));
filled_cmd.envs(envs);
let pid_file_to_check = match initial_pid_file {
let pid_file_to_check = match &initial_pid_file {
InitialPidFile::Create(path) => {
pre_exec_create_pidfile(filled_cmd, path);
path
@@ -96,7 +98,7 @@ where
InitialPidFile::Expect(path) => path,
};
let mut spawned_process = filled_cmd.spawn().with_context(|| {
let spawned_process = filled_cmd.spawn().with_context(|| {
format!("Could not spawn {process_name}, see console output and log files for details.")
})?;
let pid = spawned_process.id();
@@ -104,12 +106,26 @@ where
i32::try_from(pid)
.with_context(|| format!("Subprocess {process_name} has invalid pid {pid}"))?,
);
// set up a scopeguard to kill & wait for the child in case we panic or bail below
let spawned_process = scopeguard::guard(spawned_process, |mut spawned_process| {
println!("SIGKILL & wait the started process");
(|| {
// TODO: use another signal that can be caught by the child so it can clean up any children it spawned (e..g, walredo).
spawned_process.kill().context("SIGKILL child")?;
spawned_process.wait().context("wait() for child process")?;
anyhow::Ok(())
})()
.with_context(|| format!("scopeguard kill&wait child {process_name:?}"))
.unwrap();
});
for retries in 0..RETRIES {
match process_started(pid, Some(pid_file_to_check), &process_status_check) {
match process_started(pid, pid_file_to_check, &process_status_check).await {
Ok(true) => {
println!("\n{process_name} started, pid: {pid}");
return Ok(spawned_process);
println!("\n{process_name} started and passed status check, pid: {pid}");
// leak the child process, it'll outlive this neon_local invocation
drop(scopeguard::ScopeGuard::into_inner(spawned_process));
return Ok(());
}
Ok(false) => {
if retries == NOTICE_AFTER_RETRIES {
@@ -124,20 +140,23 @@ where
thread::sleep(Duration::from_millis(RETRY_INTERVAL_MILLIS));
}
Err(e) => {
println!("{process_name} failed to start: {e:#}");
if let Err(e) = spawned_process.kill() {
println!("Could not stop {process_name} subprocess: {e:#}")
};
println!("error starting process {process_name:?}: {e:#}");
return Err(e);
}
}
}
println!();
anyhow::bail!("{process_name} did not start in {RETRY_UNTIL_SECS} seconds");
anyhow::bail!(
"{process_name} did not start+pass status checks within {RETRY_UNTIL_SECS} seconds"
);
}
/// Stops the process, using the pid file given. Returns Ok also if the process is already not running.
pub fn stop_process(immediate: bool, process_name: &str, pid_file: &Path) -> anyhow::Result<()> {
pub fn stop_process(
immediate: bool,
process_name: &str,
pid_file: &Utf8Path,
) -> anyhow::Result<()> {
let pid = match pid_file::read(pid_file)
.with_context(|| format!("read pid_file {pid_file:?}"))?
{
@@ -233,11 +252,15 @@ fn fill_rust_env_vars(cmd: &mut Command) -> &mut Command {
filled_cmd
}
fn fill_aws_secrets_vars(mut cmd: &mut Command) -> &mut Command {
fn fill_remote_storage_secrets_vars(mut cmd: &mut Command) -> &mut Command {
for env_key in [
"AWS_ACCESS_KEY_ID",
"AWS_SECRET_ACCESS_KEY",
"AWS_SESSION_TOKEN",
"AWS_PROFILE",
// HOME is needed in combination with `AWS_PROFILE` to pick up the SSO sessions.
"HOME",
"AZURE_STORAGE_ACCOUNT",
"AZURE_STORAGE_ACCESS_KEY",
] {
if let Ok(value) = std::env::var(env_key) {
cmd = cmd.env(env_key, value);
@@ -252,10 +275,10 @@ fn fill_aws_secrets_vars(mut cmd: &mut Command) -> &mut Command {
/// will remain held until the cmd exits.
fn pre_exec_create_pidfile<P>(cmd: &mut Command, path: P) -> &mut Command
where
P: Into<PathBuf>,
P: Into<Utf8PathBuf>,
{
let path: PathBuf = path.into();
// SAFETY
let path: Utf8PathBuf = path.into();
// SAFETY:
// pre_exec is marked unsafe because it runs between fork and exec.
// Why is that dangerous in various ways?
// Long answer: https://github.com/rust-lang/rust/issues/39575
@@ -309,22 +332,20 @@ where
cmd
}
fn process_started<F>(
async fn process_started<F, Fut>(
pid: Pid,
pid_file_to_check: Option<&Path>,
pid_file_to_check: &Utf8Path,
status_check: &F,
) -> anyhow::Result<bool>
where
F: Fn() -> anyhow::Result<bool>,
F: Fn() -> Fut,
Fut: std::future::Future<Output = anyhow::Result<bool>>,
{
match status_check() {
Ok(true) => match pid_file_to_check {
Some(pid_file_path) => match pid_file::read(pid_file_path)? {
PidFileRead::NotExist => Ok(false),
PidFileRead::LockedByOtherProcess(pid_in_file) => Ok(pid_in_file == pid),
PidFileRead::NotHeldByAnyProcess(_) => Ok(false),
},
None => Ok(true),
match status_check().await {
Ok(true) => match pid_file::read(pid_file_to_check)? {
PidFileRead::NotExist => Ok(false),
PidFileRead::LockedByOtherProcess(pid_in_file) => Ok(pid_in_file == pid),
PidFileRead::NotHeldByAnyProcess(_) => Ok(false),
},
Ok(false) => Ok(false),
Err(e) => anyhow::bail!("process failed to start: {e}"),

View File

@@ -1,274 +0,0 @@
/// The attachment service mimics the aspects of the control plane API
/// that are required for a pageserver to operate.
///
/// This enables running & testing pageservers without a full-blown
/// deployment of the Neon cloud platform.
///
use anyhow::anyhow;
use clap::Parser;
use hex::FromHex;
use hyper::StatusCode;
use hyper::{Body, Request, Response};
use serde::{Deserialize, Serialize};
use std::path::{Path, PathBuf};
use std::{collections::HashMap, sync::Arc};
use utils::logging::{self, LogFormat};
use utils::{
http::{
endpoint::{self},
error::ApiError,
json::{json_request, json_response},
RequestExt, RouterBuilder,
},
id::{NodeId, TenantId},
tcp_listener,
};
use pageserver_api::control_api::{
ReAttachRequest, ReAttachResponse, ReAttachResponseTenant, ValidateRequest, ValidateResponse,
ValidateResponseTenant,
};
use control_plane::attachment_service::{AttachHookRequest, AttachHookResponse};
#[derive(Parser)]
#[command(author, version, about, long_about = None)]
#[command(arg_required_else_help(true))]
struct Cli {
/// Host and port to listen on, like `127.0.0.1:1234`
#[arg(short, long)]
listen: std::net::SocketAddr,
/// Path to the .json file to store state (will be created if it doesn't exist)
#[arg(short, long)]
path: PathBuf,
}
// The persistent state of each Tenant
#[derive(Serialize, Deserialize, Clone)]
struct TenantState {
// Currently attached pageserver
pageserver: Option<NodeId>,
// Latest generation number: next time we attach, increment this
// and use the incremented number when attaching
generation: u32,
}
fn to_hex_map<S, V>(input: &HashMap<TenantId, V>, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
V: Clone + Serialize,
{
let transformed = input.iter().map(|(k, v)| (hex::encode(k), v.clone()));
transformed
.collect::<HashMap<String, V>>()
.serialize(serializer)
}
fn from_hex_map<'de, D, V>(deserializer: D) -> Result<HashMap<TenantId, V>, D::Error>
where
D: serde::de::Deserializer<'de>,
V: Deserialize<'de>,
{
let hex_map = HashMap::<String, V>::deserialize(deserializer)?;
hex_map
.into_iter()
.map(|(k, v)| {
TenantId::from_hex(k)
.map(|k| (k, v))
.map_err(serde::de::Error::custom)
})
.collect()
}
// Top level state available to all HTTP handlers
#[derive(Serialize, Deserialize)]
struct PersistentState {
#[serde(serialize_with = "to_hex_map", deserialize_with = "from_hex_map")]
tenants: HashMap<TenantId, TenantState>,
#[serde(skip)]
path: PathBuf,
}
impl PersistentState {
async fn save(&self) -> anyhow::Result<()> {
let bytes = serde_json::to_vec(self)?;
tokio::fs::write(&self.path, &bytes).await?;
Ok(())
}
async fn load(path: &Path) -> anyhow::Result<Self> {
let bytes = tokio::fs::read(path).await?;
let mut decoded = serde_json::from_slice::<Self>(&bytes)?;
decoded.path = path.to_owned();
Ok(decoded)
}
async fn load_or_new(path: &Path) -> Self {
match Self::load(path).await {
Ok(s) => {
tracing::info!("Loaded state file at {}", path.display());
s
}
Err(e)
if e.downcast_ref::<std::io::Error>()
.map(|e| e.kind() == std::io::ErrorKind::NotFound)
.unwrap_or(false) =>
{
tracing::info!("Will create state file at {}", path.display());
Self {
tenants: HashMap::new(),
path: path.to_owned(),
}
}
Err(e) => {
panic!("Failed to load state from '{}': {e:#} (maybe your .neon/ dir was written by an older version?)", path.display())
}
}
}
}
/// State available to HTTP request handlers
#[derive(Clone)]
struct State {
inner: Arc<tokio::sync::RwLock<PersistentState>>,
}
impl State {
fn new(persistent_state: PersistentState) -> State {
Self {
inner: Arc::new(tokio::sync::RwLock::new(persistent_state)),
}
}
}
#[inline(always)]
fn get_state(request: &Request<Body>) -> &State {
request
.data::<Arc<State>>()
.expect("unknown state type")
.as_ref()
}
/// Pageserver calls into this on startup, to learn which tenants it should attach
async fn handle_re_attach(mut req: Request<Body>) -> Result<Response<Body>, ApiError> {
let reattach_req = json_request::<ReAttachRequest>(&mut req).await?;
let state = get_state(&req).inner.clone();
let mut locked = state.write().await;
let mut response = ReAttachResponse {
tenants: Vec::new(),
};
for (t, state) in &mut locked.tenants {
if state.pageserver == Some(reattach_req.node_id) {
state.generation += 1;
response.tenants.push(ReAttachResponseTenant {
id: *t,
generation: state.generation,
});
}
}
locked.save().await.map_err(ApiError::InternalServerError)?;
json_response(StatusCode::OK, response)
}
/// Pageserver calls into this before doing deletions, to confirm that it still
/// holds the latest generation for the tenants with deletions enqueued
async fn handle_validate(mut req: Request<Body>) -> Result<Response<Body>, ApiError> {
let validate_req = json_request::<ValidateRequest>(&mut req).await?;
let locked = get_state(&req).inner.read().await;
let mut response = ValidateResponse {
tenants: Vec::new(),
};
for req_tenant in validate_req.tenants {
if let Some(tenant_state) = locked.tenants.get(&req_tenant.id) {
let valid = tenant_state.generation == req_tenant.gen;
response.tenants.push(ValidateResponseTenant {
id: req_tenant.id,
valid,
});
}
}
json_response(StatusCode::OK, response)
}
/// Call into this before attaching a tenant to a pageserver, to acquire a generation number
/// (in the real control plane this is unnecessary, because the same program is managing
/// generation numbers and doing attachments).
async fn handle_attach_hook(mut req: Request<Body>) -> Result<Response<Body>, ApiError> {
let attach_req = json_request::<AttachHookRequest>(&mut req).await?;
let state = get_state(&req).inner.clone();
let mut locked = state.write().await;
let tenant_state = locked
.tenants
.entry(attach_req.tenant_id)
.or_insert_with(|| TenantState {
pageserver: attach_req.pageserver_id,
generation: 0,
});
if attach_req.pageserver_id.is_some() {
tenant_state.generation += 1;
}
tenant_state.pageserver = attach_req.pageserver_id;
let generation = tenant_state.generation;
locked.save().await.map_err(ApiError::InternalServerError)?;
json_response(
StatusCode::OK,
AttachHookResponse {
gen: attach_req.pageserver_id.map(|_| generation),
},
)
}
fn make_router(persistent_state: PersistentState) -> RouterBuilder<hyper::Body, ApiError> {
endpoint::make_router()
.data(Arc::new(State::new(persistent_state)))
.post("/re-attach", handle_re_attach)
.post("/validate", handle_validate)
.post("/attach_hook", handle_attach_hook)
}
#[tokio::main]
async fn main() -> anyhow::Result<()> {
logging::init(
LogFormat::Plain,
logging::TracingErrorLayerEnablement::Disabled,
)?;
let args = Cli::parse();
tracing::info!(
"Starting, state at {}, listening on {}",
args.path.to_string_lossy(),
args.listen
);
let persistent_state = PersistentState::load_or_new(&args.path).await;
let http_listener = tcp_listener::bind(args.listen)?;
let router = make_router(persistent_state)
.build()
.map_err(|err| anyhow!(err))?;
let service = utils::http::RouterService::new(router).unwrap();
let server = hyper::Server::from_tcp(http_listener)?.serve(service);
tracing::info!("Serving on {0}", args.listen);
server.await?;
Ok(())
}

File diff suppressed because it is too large Load Diff

View File

@@ -7,11 +7,11 @@
//! ```
use anyhow::Context;
use std::path::PathBuf;
use camino::Utf8PathBuf;
use crate::{background_process, local_env};
pub fn start_broker_process(env: &local_env::LocalEnv) -> anyhow::Result<()> {
pub async fn start_broker_process(env: &local_env::LocalEnv) -> anyhow::Result<()> {
let broker = &env.broker;
let listen_addr = &broker.listen_addr;
@@ -19,29 +19,30 @@ pub fn start_broker_process(env: &local_env::LocalEnv) -> anyhow::Result<()> {
let args = [format!("--listen-addr={listen_addr}")];
let client = reqwest::blocking::Client::new();
let client = reqwest::Client::new();
background_process::start_process(
"storage_broker",
&env.base_data_dir,
&env.storage_broker_bin(),
args,
[],
background_process::InitialPidFile::Create(&storage_broker_pid_file_path(env)),
|| {
background_process::InitialPidFile::Create(storage_broker_pid_file_path(env)),
|| async {
let url = broker.client_url();
let status_url = url.join("status").with_context(|| {
format!("Failed to append /status path to broker endpoint {url}",)
format!("Failed to append /status path to broker endpoint {url}")
})?;
let request = client
.get(status_url)
.build()
.with_context(|| format!("Failed to construct request to broker endpoint {url}"))?;
match client.execute(request) {
match client.execute(request).await {
Ok(resp) => Ok(resp.status().is_success()),
Err(_) => Ok(false),
}
},
)
.await
.context("Failed to spawn storage_broker subprocess")?;
Ok(())
}
@@ -50,6 +51,7 @@ pub fn stop_broker_process(env: &local_env::LocalEnv) -> anyhow::Result<()> {
background_process::stop_process(true, "storage_broker", &storage_broker_pid_file_path(env))
}
fn storage_broker_pid_file_path(env: &local_env::LocalEnv) -> PathBuf {
env.base_data_dir.join("storage_broker.pid")
fn storage_broker_pid_file_path(env: &local_env::LocalEnv) -> Utf8PathBuf {
Utf8PathBuf::from_path_buf(env.base_data_dir.join("storage_broker.pid"))
.expect("non-Unicode path")
}

View File

@@ -45,32 +45,32 @@ use std::sync::Arc;
use std::time::Duration;
use anyhow::{anyhow, bail, Context, Result};
use compute_api::spec::RemoteExtSpec;
use nix::sys::signal::kill;
use nix::sys::signal::Signal;
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use url::Host;
use utils::id::{NodeId, TenantId, TimelineId};
use crate::attachment_service::AttachmentService;
use crate::local_env::LocalEnv;
use crate::pageserver::PageServerNode;
use crate::postgresql_conf::PostgresConf;
use compute_api::responses::{ComputeState, ComputeStatus};
use compute_api::spec::{Cluster, ComputeMode, ComputeSpec};
use compute_api::spec::{Cluster, ComputeFeature, ComputeMode, ComputeSpec};
// contents of a endpoint.json file
#[serde_as]
#[derive(Serialize, Deserialize, PartialEq, Eq, Clone, Debug)]
pub struct EndpointConf {
endpoint_id: String,
#[serde_as(as = "DisplayFromStr")]
tenant_id: TenantId,
#[serde_as(as = "DisplayFromStr")]
timeline_id: TimelineId,
mode: ComputeMode,
pg_port: u16,
http_port: u16,
pg_version: u32,
skip_pg_catalog_updates: bool,
pageserver_id: NodeId,
features: Vec<ComputeFeature>,
}
//
@@ -122,18 +122,14 @@ impl ComputeControlPlane {
http_port: Option<u16>,
pg_version: u32,
mode: ComputeMode,
pageserver_id: NodeId,
) -> Result<Arc<Endpoint>> {
let pg_port = pg_port.unwrap_or_else(|| self.get_port());
let http_port = http_port.unwrap_or_else(|| self.get_port() + 1);
let pageserver =
PageServerNode::from_env(&self.env, self.env.get_pageserver_conf(pageserver_id)?);
let ep = Arc::new(Endpoint {
endpoint_id: endpoint_id.to_owned(),
pg_address: SocketAddr::new("127.0.0.1".parse().unwrap(), pg_port),
http_address: SocketAddr::new("127.0.0.1".parse().unwrap(), http_port),
env: self.env.clone(),
pageserver,
timeline_id,
mode,
tenant_id,
@@ -145,6 +141,7 @@ impl ComputeControlPlane {
// with this we basically test a case of waking up an idle compute, where
// we also skip catalog updates in the cloud.
skip_pg_catalog_updates: true,
features: vec![],
});
ep.create_endpoint_dir()?;
@@ -159,7 +156,7 @@ impl ComputeControlPlane {
pg_port,
pg_version,
skip_pg_catalog_updates: true,
pageserver_id,
features: vec![],
})?,
)?;
std::fs::write(
@@ -172,6 +169,30 @@ impl ComputeControlPlane {
Ok(ep)
}
pub fn check_conflicting_endpoints(
&self,
mode: ComputeMode,
tenant_id: TenantId,
timeline_id: TimelineId,
) -> Result<()> {
if matches!(mode, ComputeMode::Primary) {
// this check is not complete, as you could have a concurrent attempt at
// creating another primary, both reading the state before checking it here,
// but it's better than nothing.
let mut duplicates = self.endpoints.iter().filter(|(_k, v)| {
v.tenant_id == tenant_id
&& v.timeline_id == timeline_id
&& v.mode == mode
&& v.status() != EndpointStatus::Stopped
});
if let Some((key, _)) = duplicates.next() {
bail!("attempting to create a duplicate primary endpoint on tenant {tenant_id}, timeline {timeline_id}: endpoint {key:?} exists already. please don't do this, it is not supported.");
}
}
Ok(())
}
}
///////////////////////////////////////////////////////////////////////////////
@@ -194,10 +215,32 @@ pub struct Endpoint {
// These are not part of the endpoint as such, but the environment
// the endpoint runs in.
pub env: LocalEnv,
pageserver: PageServerNode,
// Optimizations
skip_pg_catalog_updates: bool,
// Feature flags
features: Vec<ComputeFeature>,
}
#[derive(PartialEq, Eq)]
pub enum EndpointStatus {
Running,
Stopped,
Crashed,
RunningNoPidfile,
}
impl std::fmt::Display for EndpointStatus {
fn fmt(&self, writer: &mut std::fmt::Formatter) -> std::fmt::Result {
let s = match self {
Self::Running => "running",
Self::Stopped => "stopped",
Self::Crashed => "crashed",
Self::RunningNoPidfile => "running, no pidfile",
};
write!(writer, "{}", s)
}
}
impl Endpoint {
@@ -217,20 +260,17 @@ impl Endpoint {
let conf: EndpointConf =
serde_json::from_slice(&std::fs::read(entry.path().join("endpoint.json"))?)?;
let pageserver =
PageServerNode::from_env(env, env.get_pageserver_conf(conf.pageserver_id)?);
Ok(Endpoint {
pg_address: SocketAddr::new("127.0.0.1".parse().unwrap(), conf.pg_port),
http_address: SocketAddr::new("127.0.0.1".parse().unwrap(), conf.http_port),
endpoint_id,
env: env.clone(),
pageserver,
timeline_id: conf.timeline_id,
mode: conf.mode,
tenant_id: conf.tenant_id,
pg_version: conf.pg_version,
skip_pg_catalog_updates: conf.skip_pg_catalog_updates,
features: conf.features,
})
}
@@ -253,7 +293,7 @@ impl Endpoint {
conf.append("shared_buffers", "1MB");
conf.append("fsync", "off");
conf.append("max_connections", "100");
conf.append("wal_level", "replica");
conf.append("wal_level", "logical");
// wal_sender_timeout is the maximum time to wait for WAL replication.
// It also defines how often the walreciever will send a feedback message to the wal sender.
conf.append("wal_sender_timeout", "5s");
@@ -360,16 +400,16 @@ impl Endpoint {
self.endpoint_path().join("pgdata")
}
pub fn status(&self) -> &str {
pub fn status(&self) -> EndpointStatus {
let timeout = Duration::from_millis(300);
let has_pidfile = self.pgdata().join("postmaster.pid").exists();
let can_connect = TcpStream::connect_timeout(&self.pg_address, timeout).is_ok();
match (has_pidfile, can_connect) {
(true, true) => "running",
(false, false) => "stopped",
(true, false) => "crashed",
(false, true) => "running, no pidfile",
(true, true) => EndpointStatus::Running,
(false, false) => EndpointStatus::Stopped,
(true, false) => EndpointStatus::Crashed,
(false, true) => EndpointStatus::RunningNoPidfile,
}
}
@@ -414,43 +454,58 @@ impl Endpoint {
);
}
// Also wait for the compute_ctl process to die. It might have some cleanup
// work to do after postgres stops, like syncing safekeepers, etc.
//
// TODO use background_process::stop_process instead
let pidfile_path = self.endpoint_path().join("compute_ctl.pid");
let pid: u32 = std::fs::read_to_string(pidfile_path)?.parse()?;
let pid = nix::unistd::Pid::from_raw(pid as i32);
crate::background_process::wait_until_stopped("compute_ctl", pid)?;
Ok(())
}
pub fn start(
&self,
auth_token: &Option<String>,
safekeepers: Vec<NodeId>,
remote_ext_config: Option<&String>,
) -> Result<()> {
if self.status() == "running" {
anyhow::bail!("The endpoint is already running");
fn wait_for_compute_ctl_to_exit(&self, send_sigterm: bool) -> Result<()> {
// TODO use background_process::stop_process instead: https://github.com/neondatabase/neon/pull/6482
let pidfile_path = self.endpoint_path().join("compute_ctl.pid");
let pid: u32 = std::fs::read_to_string(pidfile_path)?.parse()?;
let pid = nix::unistd::Pid::from_raw(pid as i32);
if send_sigterm {
kill(pid, Signal::SIGTERM).ok();
}
crate::background_process::wait_until_stopped("compute_ctl", pid)?;
Ok(())
}
fn read_postgresql_conf(&self) -> Result<String> {
// Slurp the endpoints/<endpoint id>/postgresql.conf file into
// memory. We will include it in the spec file that we pass to
// `compute_ctl`, and `compute_ctl` will write it to the postgresql.conf
// in the data directory.
let postgresql_conf_path = self.endpoint_path().join("postgresql.conf");
let postgresql_conf = match std::fs::read(&postgresql_conf_path) {
Ok(content) => String::from_utf8(content)?,
Err(e) if e.kind() == std::io::ErrorKind::NotFound => "".to_string(),
Err(e) => {
return Err(anyhow::Error::new(e).context(format!(
"failed to read config file in {}",
postgresql_conf_path.to_str().unwrap()
)))
}
};
match std::fs::read(&postgresql_conf_path) {
Ok(content) => Ok(String::from_utf8(content)?),
Err(e) if e.kind() == std::io::ErrorKind::NotFound => Ok("".to_string()),
Err(e) => Err(anyhow::Error::new(e).context(format!(
"failed to read config file in {}",
postgresql_conf_path.to_str().unwrap()
))),
}
}
fn build_pageserver_connstr(pageservers: &[(Host, u16)]) -> String {
pageservers
.iter()
.map(|(host, port)| format!("postgresql://no_user@{host}:{port}"))
.collect::<Vec<_>>()
.join(",")
}
pub async fn start(
&self,
auth_token: &Option<String>,
safekeepers: Vec<NodeId>,
pageservers: Vec<(Host, u16)>,
remote_ext_config: Option<&String>,
shard_stripe_size: usize,
) -> Result<()> {
if self.status() == EndpointStatus::Running {
anyhow::bail!("The endpoint is already running");
}
let postgresql_conf = self.read_postgresql_conf()?;
// We always start the compute node from scratch, so if the Postgres
// data dir exists from a previous launch, remove it first.
@@ -458,13 +513,9 @@ impl Endpoint {
std::fs::remove_dir_all(self.pgdata())?;
}
let pageserver_connstring = {
let config = &self.pageserver.pg_connection_config;
let (host, port) = (config.host(), config.port());
let pageserver_connstring = Self::build_pageserver_connstr(&pageservers);
assert!(!pageserver_connstring.is_empty());
// NOTE: avoid spaces in connection string, because it is less error prone if we forward it somewhere.
format!("postgresql://no_user@{host}:{port}")
};
let mut safekeeper_connstrings = Vec::new();
if self.mode == ComputeMode::Primary {
for sk_id in safekeepers {
@@ -478,11 +529,24 @@ impl Endpoint {
}
}
// check for file remote_extensions_spec.json
// if it is present, read it and pass to compute_ctl
let remote_extensions_spec_path = self.endpoint_path().join("remote_extensions_spec.json");
let remote_extensions_spec = std::fs::File::open(remote_extensions_spec_path);
let remote_extensions: Option<RemoteExtSpec>;
if let Ok(spec_file) = remote_extensions_spec {
remote_extensions = serde_json::from_reader(spec_file).ok();
} else {
remote_extensions = None;
};
// Create spec file
let spec = ComputeSpec {
skip_pg_catalog_updates: self.skip_pg_catalog_updates,
format_version: 1.0,
operation_uuid: None,
features: self.features.clone(),
cluster: Cluster {
cluster_id: None, // project ID: not used
name: None, // project name: not used
@@ -499,7 +563,9 @@ impl Endpoint {
pageserver_connstring: Some(pageserver_connstring),
safekeeper_connstrings,
storage_auth_token: auth_token.clone(),
remote_extensions: None,
remote_extensions,
pgbouncer_settings: None,
shard_stripe_size: Some(shard_stripe_size),
};
let spec_path = self.endpoint_path().join("spec.json");
std::fs::write(spec_path, serde_json::to_string_pretty(&spec)?)?;
@@ -537,9 +603,21 @@ impl Endpoint {
}
let child = cmd.spawn()?;
// set up a scopeguard to kill & wait for the child in case we panic or bail below
let child = scopeguard::guard(child, |mut child| {
println!("SIGKILL & wait the started process");
(|| {
// TODO: use another signal that can be caught by the child so it can clean up any children it spawned
child.kill().context("SIGKILL child")?;
child.wait().context("wait() for child process")?;
anyhow::Ok(())
})()
.with_context(|| format!("scopeguard kill&wait child {child:?}"))
.unwrap();
});
// Write down the pid so we can wait for it when we want to stop
// TODO use background_process::start_process instead
// TODO use background_process::start_process instead: https://github.com/neondatabase/neon/pull/6482
let pid = child.id();
let pidfile_path = self.endpoint_path().join("compute_ctl.pid");
std::fs::write(pidfile_path, pid.to_string())?;
@@ -550,7 +628,7 @@ impl Endpoint {
const MAX_ATTEMPTS: u32 = 10 * 30; // Wait up to 30 s
loop {
attempt += 1;
match self.get_status() {
match self.get_status().await {
Ok(state) => {
match state.status {
ComputeStatus::Init => {
@@ -588,12 +666,15 @@ impl Endpoint {
std::thread::sleep(ATTEMPT_INTERVAL);
}
// disarm the scopeguard, let the child outlive this function (and neon_local invoction)
drop(scopeguard::ScopeGuard::into_inner(child));
Ok(())
}
// Call the /status HTTP API
pub fn get_status(&self) -> Result<ComputeState> {
let client = reqwest::blocking::Client::new();
pub async fn get_status(&self) -> Result<ComputeState> {
let client = reqwest::Client::new();
let response = client
.request(
@@ -604,16 +685,75 @@ impl Endpoint {
self.http_address.port()
),
)
.send()?;
.send()
.await?;
// Interpret the response
let status = response.status();
if !(status.is_client_error() || status.is_server_error()) {
Ok(response.json()?)
Ok(response.json().await?)
} else {
// reqwest does not export its error construction utility functions, so let's craft the message ourselves
let url = response.url().to_owned();
let msg = match response.text() {
let msg = match response.text().await {
Ok(err_body) => format!("Error: {}", err_body),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
};
Err(anyhow::anyhow!(msg))
}
}
pub async fn reconfigure(&self, mut pageservers: Vec<(Host, u16)>) -> Result<()> {
let mut spec: ComputeSpec = {
let spec_path = self.endpoint_path().join("spec.json");
let file = std::fs::File::open(spec_path)?;
serde_json::from_reader(file)?
};
let postgresql_conf = self.read_postgresql_conf()?;
spec.cluster.postgresql_conf = Some(postgresql_conf);
// If we weren't given explicit pageservers, query the attachment service
if pageservers.is_empty() {
let attachment_service = AttachmentService::from_env(&self.env);
let locate_result = attachment_service.tenant_locate(self.tenant_id).await?;
pageservers = locate_result
.shards
.into_iter()
.map(|shard| {
(
Host::parse(&shard.listen_pg_addr)
.expect("Attachment service reported bad hostname"),
shard.listen_pg_port,
)
})
.collect::<Vec<_>>();
}
let pageserver_connstr = Self::build_pageserver_connstr(&pageservers);
assert!(!pageserver_connstr.is_empty());
spec.pageserver_connstring = Some(pageserver_connstr);
let client = reqwest::Client::new();
let response = client
.post(format!(
"http://{}:{}/configure",
self.http_address.ip(),
self.http_address.port()
))
.body(format!(
"{{\"spec\":{}}}",
serde_json::to_string_pretty(&spec)?
))
.send()
.await?;
let status = response.status();
if !(status.is_client_error() || status.is_server_error()) {
Ok(())
} else {
let url = response.url().to_owned();
let msg = match response.text().await {
Ok(err_body) => format!("Error: {}", err_body),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
};
@@ -629,15 +769,30 @@ impl Endpoint {
// Postgres is always started from scratch, so stop
// without destroy only used for testing and debugging.
//
self.pg_ctl(
if destroy {
&["-m", "immediate", "stop"]
} else {
&["stop"]
},
&None,
)?;
// Also wait for the compute_ctl process to die. It might have some
// cleanup work to do after postgres stops, like syncing safekeepers,
// etc.
//
// If destroying, send it SIGTERM before waiting. Sometimes we do *not*
// want this cleanup: tests intentionally do stop when majority of
// safekeepers is down, so sync-safekeepers would hang otherwise. This
// could be a separate flag though.
self.wait_for_compute_ctl_to_exit(destroy)?;
if destroy {
self.pg_ctl(&["-m", "immediate", "stop"], &None)?;
println!(
"Destroying postgres data directory '{}'",
self.pgdata().to_str().unwrap()
);
std::fs::remove_dir_all(self.endpoint_path())?;
} else {
self.pg_ctl(&["stop"], &None)?;
}
Ok(())
}

View File

@@ -1,11 +1,10 @@
//
// Local control plane.
//
// Can start, configure and stop postgres instances running as a local processes.
//
// Intended to be used in integration tests and in CLI tools for
// local installations.
//
//! Local control plane.
//!
//! Can start, configure and stop postgres instances running as a local processes.
//!
//! Intended to be used in integration tests and in CLI tools for
//! local installations.
#![deny(clippy::undocumented_unsafe_blocks)]
pub mod attachment_service;
mod background_process;

View File

@@ -5,10 +5,10 @@
use anyhow::{bail, ensure, Context};
use clap::ValueEnum;
use postgres_backend::AuthType;
use reqwest::Url;
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use std::collections::HashMap;
use std::env;
use std::fs;
@@ -33,7 +33,6 @@ pub const DEFAULT_PG_VERSION: u32 = 15;
// to 'neon_local init --config=<path>' option. See control_plane/simple.conf for
// an example.
//
#[serde_as]
#[derive(Serialize, Deserialize, PartialEq, Eq, Clone, Debug)]
pub struct LocalEnv {
// Base directory for all the nodes (the pageserver, safekeepers and
@@ -59,7 +58,6 @@ pub struct LocalEnv {
// Default tenant ID to use with the 'neon_local' command line utility, when
// --tenant_id is not explicitly specified.
#[serde(default)]
#[serde_as(as = "Option<DisplayFromStr>")]
pub default_tenant_id: Option<TenantId>,
// used to issue tokens during e.g pg start
@@ -74,17 +72,21 @@ pub struct LocalEnv {
#[serde(default)]
pub safekeepers: Vec<SafekeeperConf>,
// Control plane location: if None, we will not run attachment_service. If set, this will
// Control plane upcall API for pageserver: if None, we will not run attachment_service. If set, this will
// be propagated into each pageserver's configuration.
#[serde(default)]
pub control_plane_api: Option<Url>,
// Control plane upcall API for attachment service. If set, this will be propagated into the
// attachment service's configuration.
#[serde(default)]
pub control_plane_compute_hook_api: Option<Url>,
/// Keep human-readable aliases in memory (and persist them to config), to hide ZId hex strings from the user.
#[serde(default)]
// A `HashMap<String, HashMap<TenantId, TimelineId>>` would be more appropriate here,
// but deserialization into a generic toml object as `toml::Value::try_from` fails with an error.
// https://toml.io/en/v1.0.0 does not contain a concept of "a table inside another table".
#[serde_as(as = "HashMap<_, Vec<(DisplayFromStr, DisplayFromStr)>>")]
branch_name_mappings: HashMap<String, Vec<(TenantId, TimelineId)>>,
}
@@ -166,6 +168,31 @@ impl Default for SafekeeperConf {
}
}
#[derive(Clone, Copy)]
pub enum InitForceMode {
MustNotExist,
EmptyDirOk,
RemoveAllContents,
}
impl ValueEnum for InitForceMode {
fn value_variants<'a>() -> &'a [Self] {
&[
Self::MustNotExist,
Self::EmptyDirOk,
Self::RemoveAllContents,
]
}
fn to_possible_value(&self) -> Option<clap::builder::PossibleValue> {
Some(clap::builder::PossibleValue::new(match self {
InitForceMode::MustNotExist => "must-not-exist",
InitForceMode::EmptyDirOk => "empty-dir-ok",
InitForceMode::RemoveAllContents => "remove-all-contents",
}))
}
}
impl SafekeeperConf {
/// Compute is served by port on which only tenant scoped tokens allowed, if
/// it is configured.
@@ -201,7 +228,11 @@ impl LocalEnv {
}
pub fn attachment_service_bin(&self) -> PathBuf {
self.neon_distrib_dir.join("attachment_service")
// Irrespective of configuration, attachment service binary is always
// run from the same location as neon_local. This means that for compatibility
// tests that run old pageserver/safekeeper, they still run latest attachment service.
let neon_local_bin_dir = env::current_exe().unwrap().parent().unwrap().to_owned();
neon_local_bin_dir.join("attachment_service")
}
pub fn safekeeper_bin(&self) -> PathBuf {
@@ -229,7 +260,13 @@ impl LocalEnv {
if let Some(conf) = self.pageservers.iter().find(|node| node.id == id) {
Ok(conf)
} else {
bail!("could not find pageserver {id}")
let have_ids = self
.pageservers
.iter()
.map(|node| format!("{}:{}", node.id, node.listen_http_addr))
.collect::<Vec<_>>();
let joined = have_ids.join(",");
bail!("could not find pageserver {id}, have ids {joined}")
}
}
@@ -388,7 +425,7 @@ impl LocalEnv {
//
// Initialize a new Neon repository
//
pub fn init(&mut self, pg_version: u32, force: bool) -> anyhow::Result<()> {
pub fn init(&mut self, pg_version: u32, force: &InitForceMode) -> anyhow::Result<()> {
// check if config already exists
let base_path = &self.base_data_dir;
ensure!(
@@ -397,25 +434,34 @@ impl LocalEnv {
);
if base_path.exists() {
if force {
println!("removing all contents of '{}'", base_path.display());
// instead of directly calling `remove_dir_all`, we keep the original dir but removing
// all contents inside. This helps if the developer symbol links another directory (i.e.,
// S3 local SSD) to the `.neon` base directory.
for entry in std::fs::read_dir(base_path)? {
let entry = entry?;
let path = entry.path();
if path.is_dir() {
fs::remove_dir_all(&path)?;
} else {
fs::remove_file(&path)?;
match force {
InitForceMode::MustNotExist => {
bail!(
"directory '{}' already exists. Perhaps already initialized?",
base_path.display()
);
}
InitForceMode::EmptyDirOk => {
if let Some(res) = std::fs::read_dir(base_path)?.next() {
res.context("check if directory is empty")?;
anyhow::bail!("directory not empty: {base_path:?}");
}
}
InitForceMode::RemoveAllContents => {
println!("removing all contents of '{}'", base_path.display());
// instead of directly calling `remove_dir_all`, we keep the original dir but removing
// all contents inside. This helps if the developer symbol links another directory (i.e.,
// S3 local SSD) to the `.neon` base directory.
for entry in std::fs::read_dir(base_path)? {
let entry = entry?;
let path = entry.path();
if path.is_dir() {
fs::remove_dir_all(&path)?;
} else {
fs::remove_file(&path)?;
}
}
}
} else {
bail!(
"directory '{}' already exists. Perhaps already initialized? (Hint: use --force to remove all contents)",
base_path.display()
);
}
}

View File

@@ -6,68 +6,36 @@
//!
use std::borrow::Cow;
use std::collections::HashMap;
use std::fs::File;
use std::io::{BufReader, Write};
use std::io;
use std::io::Write;
use std::num::NonZeroU64;
use std::path::PathBuf;
use std::process::{Child, Command};
use std::{io, result};
use std::process::Command;
use std::time::Duration;
use anyhow::{bail, Context};
use pageserver_api::models::{self, TenantInfo, TimelineInfo};
use camino::Utf8PathBuf;
use futures::SinkExt;
use pageserver_api::models::{
self, LocationConfig, ShardParameters, TenantHistorySize, TenantInfo, TimelineInfo,
};
use pageserver_api::shard::TenantShardId;
use pageserver_client::mgmt_api;
use postgres_backend::AuthType;
use postgres_connection::{parse_host_port, PgConnectionConfig};
use reqwest::blocking::{Client, RequestBuilder, Response};
use reqwest::{IntoUrl, Method};
use thiserror::Error;
use utils::auth::{Claims, Scope};
use utils::{
http::error::HttpErrorBody,
id::{TenantId, TimelineId},
lsn::Lsn,
};
use crate::attachment_service::{AttachmentService, NodeRegisterRequest};
use crate::local_env::PageServerConf;
use crate::{background_process, local_env::LocalEnv};
#[derive(Error, Debug)]
pub enum PageserverHttpError {
#[error("Reqwest error: {0}")]
Transport(#[from] reqwest::Error),
#[error("Error: {0}")]
Response(String),
}
impl From<anyhow::Error> for PageserverHttpError {
fn from(e: anyhow::Error) -> Self {
Self::Response(e.to_string())
}
}
type Result<T> = result::Result<T, PageserverHttpError>;
pub trait ResponseErrorMessageExt: Sized {
fn error_from_body(self) -> Result<Self>;
}
impl ResponseErrorMessageExt for Response {
fn error_from_body(self) -> Result<Self> {
let status = self.status();
if !(status.is_client_error() || status.is_server_error()) {
return Ok(self);
}
// reqwest does not export its error construction utility functions, so let's craft the message ourselves
let url = self.url().to_owned();
Err(PageserverHttpError::Response(
match self.json::<HttpErrorBody>() {
Ok(err_body) => format!("Error: {}", err_body.msg),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
},
))
}
}
/// Directory within .neon which will be used by default for LocalFs remote storage.
pub const PAGESERVER_REMOTE_STORAGE_DIR: &str = "local_fs_remote_storage/pageserver";
//
// Control routines for pageserver.
@@ -79,8 +47,7 @@ pub struct PageServerNode {
pub pg_connection_config: PgConnectionConfig,
pub conf: PageServerConf,
pub env: LocalEnv,
pub http_client: Client,
pub http_base_url: String,
pub http_client: mgmt_api::Client,
}
impl PageServerNode {
@@ -92,13 +59,26 @@ impl PageServerNode {
pg_connection_config: PgConnectionConfig::new_host_port(host, port),
conf: conf.clone(),
env: env.clone(),
http_client: Client::new(),
http_base_url: format!("http://{}/v1", conf.listen_http_addr),
http_client: mgmt_api::Client::new(
format!("http://{}", conf.listen_http_addr),
{
match conf.http_auth_type {
AuthType::Trust => None,
AuthType::NeonJWT => Some(
env.generate_auth_token(&Claims::new(None, Scope::PageServerApi))
.unwrap(),
),
}
}
.as_deref(),
),
}
}
// pageserver conf overrides defined by neon_local configuration.
fn neon_local_overrides(&self) -> Vec<String> {
/// Merge overrides provided by the user on the command line with our default overides derived from neon_local configuration.
///
/// These all end up on the command line of the `pageserver` binary.
fn neon_local_overrides(&self, cli_overrides: &[&str]) -> Vec<String> {
let id = format!("id={}", self.conf.id);
// FIXME: the paths should be shell-escaped to handle paths with spaces, quotas etc.
let pg_distrib_dir_param = format!(
@@ -129,6 +109,25 @@ impl PageServerNode {
"control_plane_api='{}'",
control_plane_api.as_str()
));
// Attachment service uses the same auth as pageserver: if JWT is enabled
// for us, we will also need it to talk to them.
if matches!(self.conf.http_auth_type, AuthType::NeonJWT) {
let jwt_token = self
.env
.generate_auth_token(&Claims::new(None, Scope::PageServerApi))
.unwrap();
overrides.push(format!("control_plane_api_token='{}'", jwt_token));
}
}
if !cli_overrides
.iter()
.any(|c| c.starts_with("remote_storage"))
{
overrides.push(format!(
"remote_storage={{local_path='../{PAGESERVER_REMOTE_STORAGE_DIR}'}}"
));
}
if self.conf.http_auth_type != AuthType::Trust || self.conf.pg_auth_type != AuthType::Trust
@@ -137,6 +136,10 @@ impl PageServerNode {
// are one level below that, so refer to keys with ../
overrides.push("auth_validation_public_key_path='../auth_public_key.pem'".to_owned());
}
// Apply the user-provided overrides
overrides.extend(cli_overrides.iter().map(|&c| c.to_owned()));
overrides
}
@@ -144,7 +147,7 @@ impl PageServerNode {
pub fn initialize(&self, config_overrides: &[&str]) -> anyhow::Result<()> {
// First, run `pageserver --init` and wait for it to write a config into FS and exit.
self.pageserver_init(config_overrides)
.with_context(|| format!("Failed to run init for pageserver node {}", self.conf.id,))
.with_context(|| format!("Failed to run init for pageserver node {}", self.conf.id))
}
pub fn repo_path(&self) -> PathBuf {
@@ -154,12 +157,13 @@ impl PageServerNode {
/// The pid file is created by the pageserver process, with its pid stored inside.
/// Other pageservers cannot lock the same file and overwrite it for as long as the current
/// pageserver runs. (Unless someone removes the file manually; never do that!)
fn pid_file(&self) -> PathBuf {
self.repo_path().join("pageserver.pid")
fn pid_file(&self) -> Utf8PathBuf {
Utf8PathBuf::from_path_buf(self.repo_path().join("pageserver.pid"))
.expect("non-Unicode path")
}
pub fn start(&self, config_overrides: &[&str]) -> anyhow::Result<Child> {
self.start_node(config_overrides, false)
pub async fn start(&self, config_overrides: &[&str], register: bool) -> anyhow::Result<()> {
self.start_node(config_overrides, false, register).await
}
fn pageserver_init(&self, config_overrides: &[&str]) -> anyhow::Result<()> {
@@ -200,10 +204,13 @@ impl PageServerNode {
Ok(())
}
fn start_node(&self, config_overrides: &[&str], update_config: bool) -> anyhow::Result<Child> {
let mut overrides = self.neon_local_overrides();
overrides.extend(config_overrides.iter().map(|&c| c.to_owned()));
async fn start_node(
&self,
config_overrides: &[&str],
update_config: bool,
register: bool,
) -> anyhow::Result<()> {
// TODO: using a thread here because start_process() is not async but we need to call check_status()
let datadir = self.repo_path();
print!(
"Starting pageserver node {} at '{}' in {:?}",
@@ -211,7 +218,7 @@ impl PageServerNode {
self.pg_connection_config.raw_address(),
datadir
);
io::stdout().flush()?;
io::stdout().flush().context("flush stdout")?;
let datadir_path_str = datadir.to_str().with_context(|| {
format!(
@@ -223,20 +230,42 @@ impl PageServerNode {
if update_config {
args.push(Cow::Borrowed("--update-config"));
}
background_process::start_process(
"pageserver",
&datadir,
&self.env.pageserver_bin(),
args.iter().map(Cow::as_ref),
self.pageserver_env_variables()?,
background_process::InitialPidFile::Expect(&self.pid_file()),
|| match self.check_status() {
Ok(()) => Ok(true),
Err(PageserverHttpError::Transport(_)) => Ok(false),
Err(e) => Err(anyhow::anyhow!("Failed to check node status: {e}")),
background_process::InitialPidFile::Expect(self.pid_file()),
|| async {
let st = self.check_status().await;
match st {
Ok(()) => Ok(true),
Err(mgmt_api::Error::ReceiveBody(_)) => Ok(false),
Err(e) => Err(anyhow::anyhow!("Failed to check node status: {e}")),
}
},
)
.await?;
if register {
let attachment_service = AttachmentService::from_env(&self.env);
let (pg_host, pg_port) =
parse_host_port(&self.conf.listen_pg_addr).expect("Unable to parse listen_pg_addr");
let (http_host, http_port) = parse_host_port(&self.conf.listen_http_addr)
.expect("Unable to parse listen_http_addr");
attachment_service
.node_register(NodeRegisterRequest {
node_id: self.conf.id,
listen_pg_addr: pg_host.to_string(),
listen_pg_port: pg_port.unwrap_or(5432),
listen_http_addr: http_host.to_string(),
listen_http_port: http_port.unwrap_or(80),
})
.await?;
}
Ok(())
}
fn pageserver_basic_args<'a>(
@@ -246,8 +275,7 @@ impl PageServerNode {
) -> Vec<Cow<'a, str>> {
let mut args = vec![Cow::Borrowed("-D"), Cow::Borrowed(datadir_path_str)];
let mut overrides = self.neon_local_overrides();
overrides.extend(config_overrides.iter().map(|&c| c.to_owned()));
let overrides = self.neon_local_overrides(config_overrides);
for config_override in overrides {
args.push(Cow::Borrowed("-c"));
args.push(Cow::Owned(config_override));
@@ -283,7 +311,12 @@ impl PageServerNode {
background_process::stop_process(immediate, "pageserver", &self.pid_file())
}
pub fn page_server_psql_client(&self) -> anyhow::Result<postgres::Client> {
pub async fn page_server_psql_client(
&self,
) -> anyhow::Result<(
tokio_postgres::Client,
tokio_postgres::Connection<tokio_postgres::Socket, tokio_postgres::tls::NoTlsStream>,
)> {
let mut config = self.pg_connection_config.clone();
if self.conf.pg_auth_type == AuthType::NeonJWT {
let token = self
@@ -291,44 +324,18 @@ impl PageServerNode {
.generate_auth_token(&Claims::new(None, Scope::PageServerApi))?;
config = config.set_password(Some(token));
}
Ok(config.connect_no_tls()?)
Ok(config.connect_no_tls().await?)
}
fn http_request<U: IntoUrl>(&self, method: Method, url: U) -> anyhow::Result<RequestBuilder> {
let mut builder = self.http_client.request(method, url);
if self.conf.http_auth_type == AuthType::NeonJWT {
let token = self
.env
.generate_auth_token(&Claims::new(None, Scope::PageServerApi))?;
builder = builder.bearer_auth(token)
}
Ok(builder)
pub async fn check_status(&self) -> mgmt_api::Result<()> {
self.http_client.status().await
}
pub fn check_status(&self) -> Result<()> {
self.http_request(Method::GET, format!("{}/status", self.http_base_url))?
.send()?
.error_from_body()?;
Ok(())
pub async fn tenant_list(&self) -> mgmt_api::Result<Vec<TenantInfo>> {
self.http_client.list_tenants().await
}
pub fn tenant_list(&self) -> Result<Vec<TenantInfo>> {
Ok(self
.http_request(Method::GET, format!("{}/tenant", self.http_base_url))?
.send()?
.error_from_body()?
.json()?)
}
pub fn tenant_create(
&self,
new_tenant_id: TenantId,
generation: Option<u32>,
settings: HashMap<&str, &str>,
) -> anyhow::Result<TenantId> {
let mut settings = settings.clone();
let config = models::TenantConfig {
pub fn parse_config(mut settings: HashMap<&str, &str>) -> anyhow::Result<models::TenantConfig> {
let result = models::TenantConfig {
checkpoint_distance: settings
.remove("checkpoint_distance")
.map(|x| x.parse::<u64>())
@@ -387,33 +394,41 @@ impl PageServerNode {
.map(|x| x.parse::<bool>())
.transpose()
.context("Failed to parse 'gc_feedback' as bool")?,
heatmap_period: settings.remove("heatmap_period").map(|x| x.to_string()),
lazy_slru_download: settings
.remove("lazy_slru_download")
.map(|x| x.parse::<bool>())
.transpose()
.context("Failed to parse 'lazy_slru_download' as bool")?,
};
if !settings.is_empty() {
bail!("Unrecognized tenant settings: {settings:?}")
} else {
Ok(result)
}
}
pub async fn tenant_create(
&self,
new_tenant_id: TenantId,
generation: Option<u32>,
settings: HashMap<&str, &str>,
) -> anyhow::Result<TenantId> {
let config = Self::parse_config(settings.clone())?;
let request = models::TenantCreateRequest {
new_tenant_id,
new_tenant_id: TenantShardId::unsharded(new_tenant_id),
generation,
config,
shard_parameters: ShardParameters::default(),
};
if !settings.is_empty() {
bail!("Unrecognized tenant settings: {settings:?}")
}
self.http_request(Method::POST, format!("{}/tenant", self.http_base_url))?
.json(&request)
.send()?
.error_from_body()?
.json::<Option<String>>()
.with_context(|| {
format!("Failed to parse tenant creation response for tenant id: {new_tenant_id:?}")
})?
.context("No tenant id was found in the tenant creation response")
.and_then(|tenant_id_string| {
tenant_id_string.parse().with_context(|| {
format!("Failed to parse response string as tenant id: '{tenant_id_string}'")
})
})
Ok(self.http_client.tenant_create(&request).await?)
}
pub fn tenant_config(
pub async fn tenant_config(
&self,
tenant_id: TenantId,
mut settings: HashMap<&str, &str>,
@@ -484,6 +499,12 @@ impl PageServerNode {
.map(|x| x.parse::<bool>())
.transpose()
.context("Failed to parse 'gc_feedback' as bool")?,
heatmap_period: settings.remove("heatmap_period").map(|x| x.to_string()),
lazy_slru_download: settings
.remove("lazy_slru_download")
.map(|x| x.parse::<bool>())
.transpose()
.context("Failed to parse 'lazy_slru_download' as bool")?,
}
};
@@ -491,59 +512,59 @@ impl PageServerNode {
bail!("Unrecognized tenant settings: {settings:?}")
}
self.http_request(Method::PUT, format!("{}/tenant/config", self.http_base_url))?
.json(&models::TenantConfigRequest { tenant_id, config })
.send()?
.error_from_body()?;
self.http_client
.tenant_config(&models::TenantConfigRequest { tenant_id, config })
.await?;
Ok(())
}
pub fn timeline_list(&self, tenant_id: &TenantId) -> anyhow::Result<Vec<TimelineInfo>> {
let timeline_infos: Vec<TimelineInfo> = self
.http_request(
Method::GET,
format!("{}/tenant/{}/timeline", self.http_base_url, tenant_id),
)?
.send()?
.error_from_body()?
.json()?;
Ok(timeline_infos)
pub async fn location_config(
&self,
tenant_shard_id: TenantShardId,
config: LocationConfig,
flush_ms: Option<Duration>,
) -> anyhow::Result<()> {
Ok(self
.http_client
.location_config(tenant_shard_id, config, flush_ms)
.await?)
}
pub fn timeline_create(
pub async fn timeline_list(
&self,
tenant_id: TenantId,
new_timeline_id: Option<TimelineId>,
tenant_shard_id: &TenantShardId,
) -> anyhow::Result<Vec<TimelineInfo>> {
Ok(self.http_client.list_timelines(*tenant_shard_id).await?)
}
pub async fn tenant_secondary_download(&self, tenant_id: &TenantShardId) -> anyhow::Result<()> {
Ok(self
.http_client
.tenant_secondary_download(*tenant_id)
.await?)
}
pub async fn timeline_create(
&self,
tenant_shard_id: TenantShardId,
new_timeline_id: TimelineId,
ancestor_start_lsn: Option<Lsn>,
ancestor_timeline_id: Option<TimelineId>,
pg_version: Option<u32>,
existing_initdb_timeline_id: Option<TimelineId>,
) -> anyhow::Result<TimelineInfo> {
// If timeline ID was not specified, generate one
let new_timeline_id = new_timeline_id.unwrap_or(TimelineId::generate());
self.http_request(
Method::POST,
format!("{}/tenant/{}/timeline", self.http_base_url, tenant_id),
)?
.json(&models::TimelineCreateRequest {
let req = models::TimelineCreateRequest {
new_timeline_id,
ancestor_start_lsn,
ancestor_timeline_id,
pg_version,
})
.send()?
.error_from_body()?
.json::<Option<TimelineInfo>>()
.with_context(|| {
format!("Failed to parse timeline creation response for tenant id: {tenant_id}")
})?
.with_context(|| {
format!(
"No timeline id was found in the timeline creation response for tenant {tenant_id}"
)
})
existing_initdb_timeline_id,
};
Ok(self
.http_client
.timeline_create(tenant_shard_id, &req)
.await?)
}
/// Import a basebackup prepared using either:
@@ -555,7 +576,7 @@ impl PageServerNode {
/// * `timeline_id` - id to assign to imported timeline
/// * `base` - (start lsn of basebackup, path to `base.tar` file)
/// * `pg_wal` - if there's any wal to import: (end lsn, path to `pg_wal.tar`)
pub fn timeline_import(
pub async fn timeline_import(
&self,
tenant_id: TenantId,
timeline_id: TimelineId,
@@ -563,38 +584,72 @@ impl PageServerNode {
pg_wal: Option<(Lsn, PathBuf)>,
pg_version: u32,
) -> anyhow::Result<()> {
let mut client = self.page_server_psql_client()?;
let (client, conn) = self.page_server_psql_client().await?;
// The connection object performs the actual communication with the database,
// so spawn it off to run on its own.
tokio::spawn(async move {
if let Err(e) = conn.await {
eprintln!("connection error: {}", e);
}
});
tokio::pin!(client);
// Init base reader
let (start_lsn, base_tarfile_path) = base;
let base_tarfile = File::open(base_tarfile_path)?;
let mut base_reader = BufReader::new(base_tarfile);
let base_tarfile = tokio::fs::File::open(base_tarfile_path).await?;
let base_tarfile = tokio_util::io::ReaderStream::new(base_tarfile);
// Init wal reader if necessary
let (end_lsn, wal_reader) = if let Some((end_lsn, wal_tarfile_path)) = pg_wal {
let wal_tarfile = File::open(wal_tarfile_path)?;
let wal_reader = BufReader::new(wal_tarfile);
let wal_tarfile = tokio::fs::File::open(wal_tarfile_path).await?;
let wal_reader = tokio_util::io::ReaderStream::new(wal_tarfile);
(end_lsn, Some(wal_reader))
} else {
(start_lsn, None)
};
// Import base
let import_cmd = format!(
"import basebackup {tenant_id} {timeline_id} {start_lsn} {end_lsn} {pg_version}"
);
let mut writer = client.copy_in(&import_cmd)?;
io::copy(&mut base_reader, &mut writer)?;
writer.finish()?;
let copy_in = |reader, cmd| {
let client = &client;
async move {
let writer = client.copy_in(&cmd).await?;
let writer = std::pin::pin!(writer);
let mut writer = writer.sink_map_err(|e| {
std::io::Error::new(std::io::ErrorKind::Other, format!("{e}"))
});
let mut reader = std::pin::pin!(reader);
writer.send_all(&mut reader).await?;
writer.into_inner().finish().await?;
anyhow::Ok(())
}
};
// Import base
copy_in(
base_tarfile,
format!(
"import basebackup {tenant_id} {timeline_id} {start_lsn} {end_lsn} {pg_version}"
),
)
.await?;
// Import wal if necessary
if let Some(mut wal_reader) = wal_reader {
let import_cmd = format!("import wal {tenant_id} {timeline_id} {start_lsn} {end_lsn}");
let mut writer = client.copy_in(&import_cmd)?;
io::copy(&mut wal_reader, &mut writer)?;
writer.finish()?;
if let Some(wal_reader) = wal_reader {
copy_in(
wal_reader,
format!("import wal {tenant_id} {timeline_id} {start_lsn} {end_lsn}"),
)
.await?;
}
Ok(())
}
pub async fn tenant_synthetic_size(
&self,
tenant_shard_id: TenantShardId,
) -> anyhow::Result<TenantHistorySize> {
Ok(self
.http_client
.tenant_synthetic_size(tenant_shard_id)
.await?)
}
}

View File

@@ -7,12 +7,11 @@
//! ```
use std::io::Write;
use std::path::PathBuf;
use std::process::Child;
use std::{io, result};
use anyhow::Context;
use camino::Utf8PathBuf;
use postgres_connection::PgConnectionConfig;
use reqwest::blocking::{Client, RequestBuilder, Response};
use reqwest::{IntoUrl, Method};
use thiserror::Error;
use utils::{http::error::HttpErrorBody, id::NodeId};
@@ -33,12 +32,14 @@ pub enum SafekeeperHttpError {
type Result<T> = result::Result<T, SafekeeperHttpError>;
#[async_trait::async_trait]
pub trait ResponseErrorMessageExt: Sized {
fn error_from_body(self) -> Result<Self>;
async fn error_from_body(self) -> Result<Self>;
}
impl ResponseErrorMessageExt for Response {
fn error_from_body(self) -> Result<Self> {
#[async_trait::async_trait]
impl ResponseErrorMessageExt for reqwest::Response {
async fn error_from_body(self) -> Result<Self> {
let status = self.status();
if !(status.is_client_error() || status.is_server_error()) {
return Ok(self);
@@ -47,7 +48,7 @@ impl ResponseErrorMessageExt for Response {
// reqwest does not export its error construction utility functions, so let's craft the message ourselves
let url = self.url().to_owned();
Err(SafekeeperHttpError::Response(
match self.json::<HttpErrorBody>() {
match self.json::<HttpErrorBody>().await {
Ok(err_body) => format!("Error: {}", err_body.msg),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
},
@@ -68,7 +69,7 @@ pub struct SafekeeperNode {
pub pg_connection_config: PgConnectionConfig,
pub env: LocalEnv,
pub http_client: Client,
pub http_client: reqwest::Client,
pub http_base_url: String,
}
@@ -79,7 +80,7 @@ impl SafekeeperNode {
conf: conf.clone(),
pg_connection_config: Self::safekeeper_connection_config(conf.pg_port),
env: env.clone(),
http_client: Client::new(),
http_client: reqwest::Client::new(),
http_base_url: format!("http://127.0.0.1:{}/v1", conf.http_port),
}
}
@@ -97,11 +98,12 @@ impl SafekeeperNode {
SafekeeperNode::datadir_path_by_id(&self.env, self.id)
}
pub fn pid_file(&self) -> PathBuf {
self.datadir_path().join("safekeeper.pid")
pub fn pid_file(&self) -> Utf8PathBuf {
Utf8PathBuf::from_path_buf(self.datadir_path().join("safekeeper.pid"))
.expect("non-Unicode path")
}
pub fn start(&self, extra_opts: Vec<String>) -> anyhow::Result<Child> {
pub async fn start(&self, extra_opts: Vec<String>) -> anyhow::Result<()> {
print!(
"Starting safekeeper at '{}' in '{}'",
self.pg_connection_config.raw_address(),
@@ -189,13 +191,16 @@ impl SafekeeperNode {
&self.env.safekeeper_bin(),
&args,
[],
background_process::InitialPidFile::Expect(&self.pid_file()),
|| match self.check_status() {
Ok(()) => Ok(true),
Err(SafekeeperHttpError::Transport(_)) => Ok(false),
Err(e) => Err(anyhow::anyhow!("Failed to check node status: {e}")),
background_process::InitialPidFile::Expect(self.pid_file()),
|| async {
match self.check_status().await {
Ok(()) => Ok(true),
Err(SafekeeperHttpError::Transport(_)) => Ok(false),
Err(e) => Err(anyhow::anyhow!("Failed to check node status: {e}")),
}
},
)
.await
}
///
@@ -214,7 +219,7 @@ impl SafekeeperNode {
)
}
fn http_request<U: IntoUrl>(&self, method: Method, url: U) -> RequestBuilder {
fn http_request<U: IntoUrl>(&self, method: Method, url: U) -> reqwest::RequestBuilder {
// TODO: authentication
//if self.env.auth_type == AuthType::NeonJWT {
// builder = builder.bearer_auth(&self.env.safekeeper_auth_token)
@@ -222,10 +227,12 @@ impl SafekeeperNode {
self.http_client.request(method, url)
}
pub fn check_status(&self) -> Result<()> {
pub async fn check_status(&self) -> Result<()> {
self.http_request(Method::GET, format!("{}/{}", self.http_base_url, "status"))
.send()?
.error_from_body()?;
.send()
.await?
.error_from_body()
.await?;
Ok(())
}
}

View File

@@ -23,7 +23,7 @@ vulnerability = "deny"
unmaintained = "warn"
yanked = "warn"
notice = "warn"
ignore = ["RUSTSEC-2023-0052"]
ignore = []
# This section is considered when running `cargo deny check licenses`
# More documentation for the licenses section can be found here:
@@ -35,6 +35,7 @@ allow = [
"Artistic-2.0",
"BSD-2-Clause",
"BSD-3-Clause",
"CC0-1.0",
"ISC",
"MIT",
"MPL-2.0",
@@ -74,10 +75,30 @@ highlight = "all"
workspace-default-features = "allow"
external-default-features = "allow"
allow = []
deny = []
skip = []
skip-tree = []
[[bans.deny]]
# we use tokio, the same rationale applies for async-{io,waker,global-executor,executor,channel,lock}, smol
# if you find yourself here while adding a dependency, try "default-features = false", ask around on #rust
name = "async-std"
[[bans.deny]]
name = "async-io"
[[bans.deny]]
name = "async-waker"
[[bans.deny]]
name = "async-global-executor"
[[bans.deny]]
name = "async-executor"
[[bans.deny]]
name = "smol"
# This section is considered when running `cargo deny check sources`.
# More documentation about the 'sources' section can be found here:
# https://embarkstudios.github.io/cargo-deny/checks/sources/cfg.html

9
diesel.toml Normal file
View File

@@ -0,0 +1,9 @@
# For documentation on how to configure this file,
# see https://diesel.rs/guides/configuring-diesel-cli
[print_schema]
file = "control_plane/attachment_service/src/schema.rs"
custom_type_derives = ["diesel::query_builder::QueryId"]
[migrations_directory]
dir = "control_plane/attachment_service/migrations"

View File

@@ -25,7 +25,7 @@
},
{
"name": "wal_level",
"value": "replica",
"value": "logical",
"vartype": "enum"
},
{

View File

@@ -21,7 +21,7 @@ We build all images after a successful `release` tests run and push automaticall
## Docker Compose example
You can see a [docker compose](https://docs.docker.com/compose/) example to create a neon cluster in [/docker-compose/docker-compose.yml](/docker-compose/docker-compose.yml). It creates the following conatainers.
You can see a [docker compose](https://docs.docker.com/compose/) example to create a neon cluster in [/docker-compose/docker-compose.yml](/docker-compose/docker-compose.yml). It creates the following containers.
- pageserver x 1
- safekeeper x 3
@@ -38,7 +38,7 @@ You can specify version of neon cluster using following environment values.
- TAG: the tag version of [docker image](https://registry.hub.docker.com/r/neondatabase/neon/tags) (default is latest), which is tagged in [CI test](/.github/workflows/build_and_test.yml)
```
$ cd docker-compose/
$ docker-compose down # remove the conainers if exists
$ docker-compose down # remove the containers if exists
$ PG_VERSION=15 TAG=2937 docker-compose up --build -d # You can specify the postgres and image version
Creating network "dockercompose_default" with the default driver
Creating docker-compose_storage_broker_1 ... done

View File

@@ -188,11 +188,60 @@ that.
## Error message style
### PostgreSQL extensions
PostgreSQL has a style guide for writing error messages:
https://www.postgresql.org/docs/current/error-style-guide.html
Follow that guide when writing error messages in the PostgreSQL
extension. We don't follow it strictly in the pageserver and
safekeeper, but the advice in the PostgreSQL style guide is generally
good, and you can't go wrong by following it.
extensions.
### Neon Rust code
#### Anyhow Context
When adding anyhow `context()`, use form `present-tense-verb+action`.
Example:
- Bad: `file.metadata().context("could not get file metadata")?;`
- Good: `file.metadata().context("get file metadata")?;`
#### Logging Errors
When logging any error `e`, use `could not {e:#}` or `failed to {e:#}`.
If `e` is an `anyhow` error and you want to log the backtrace that it contains,
use `{e:?}` instead of `{e:#}`.
#### Rationale
The `{:#}` ("alternate Display") of an `anyhow` error chain is concatenation fo the contexts, using `: `.
For example, the following Rust code will result in output
```
ERROR failed to list users: load users from server: parse response: invalid json
```
This is more concise / less noisy than what happens if you do `.context("could not ...")?` at each level, i.e.:
```
ERROR could not list users: could not load users from server: could not parse response: invalid json
```
```rust
fn main() {
match list_users().context("list users") else {
Ok(_) => ...,
Err(e) => tracing::error!("failed to {e:#}"),
}
}
fn list_users() {
http_get_users().context("load users from server")?;
}
fn http_get_users() {
let response = client....?;
response.parse().context("parse response")?; // fails with serde error "invalid json"
}
```

View File

@@ -96,6 +96,16 @@ prefix_in_bucket = '/test_prefix/'
`AWS_SECRET_ACCESS_KEY` and `AWS_ACCESS_KEY_ID` env variables can be used to specify the S3 credentials if needed.
or
```toml
[remote_storage]
container_name = 'some-container-name'
container_region = 'us-east'
prefix_in_container = '/test-prefix/'
```
`AZURE_STORAGE_ACCOUNT` and `AZURE_STORAGE_ACCESS_KEY` env variables can be used to specify the azure credentials if needed.
## Repository background tasks

View File

@@ -64,7 +64,7 @@ Storage.
The LayerMap tracks what layers exist in a timeline.
Currently, the layer map is just a resizeable array (Vec). On a GetPage@LSN or
Currently, the layer map is just a resizable array (Vec). On a GetPage@LSN or
other read request, the layer map scans through the array to find the right layer
that contains the data for the requested page. The read-code in LayeredTimeline
is aware of the ancestor, and returns data from the ancestor timeline if it's

View File

@@ -22,7 +22,7 @@ timeline to shutdown. It will also wait for them to finish.
A task registered in the task registry can check if it has been
requested to shut down, by calling `is_shutdown_requested()`. There's
also a `shudown_watcher()` Future that can be used with `tokio::select!`
also a `shutdown_watcher()` Future that can be used with `tokio::select!`
or similar, to wake up on shutdown.

View File

@@ -74,4 +74,4 @@ somewhat wasteful, but because most WAL records only affect one page,
the overhead is acceptable.
The WAL redo always happens for one particular page. If the WAL record
coantains changes to other pages, they are ignored.
contains changes to other pages, they are ignored.

View File

@@ -0,0 +1,420 @@
# Splitting cloud console
Created on 17.06.2022
## Summary
Currently we have `cloud` repository that contains code implementing public API for our clients as well as code for managing storage and internal infrastructure services. We can split everything user-related from everything storage-related to make it easier to test and maintain.
This RFC proposes to introduce a new control-plane service with HTTP API. The overall architecture will look like this:
```markup
. x
external area x internal area
(our clients) x (our services)
x
x ┌───────────────────────┐
x ┌───────────────┐ > ┌─────────────────────┐ │ Storage (EC2) │
x │ console db │ > │ control-plane db │ │ │
x └───────────────┘ > └─────────────────────┘ │ - safekeepers │
x ▲ > ▲ │ - pageservers │
x │ > │ │ │
┌──────────────────┐ x ┌───────┴───────┐ > │ │ Dependencies │
│ browser UI ├──►│ │ > ┌──────────┴──────────┐ │ │
└──────────────────┘ x │ │ > │ │ │ - etcd │
x │ console ├───────►│ control-plane ├────►│ - S3 │
┌──────────────────┐ x │ │ > │ (deployed in k8s) │ │ - more? │
│public API clients├──►│ │ > │ │ │ │
└──────────────────┘ x └───────┬───────┘ > └──────────┬──────────┘ └───────────────────────┘
x │ > ▲ │ ▲
x │ > │ │ │
x ┌───────┴───────┐ > │ │ ┌───────────┴───────────┐
x │ dependencies │ > │ │ │ │
x │- analytics │ > │ └───────────────►│ computes │
x │- auth │ > │ │ (deployed in k8s) │
x │- billing │ > │ │ │
x └───────────────┘ > │ └───────────────────────┘
x > │ ▲
x > ┌─────┴───────────────┐ │
┌──────────────────┐ x > │ │ │
│ │ x > │ proxy ├─────────────────┘
│ postgres ├───────────────────────────►│ (deployed in k8s) │
│ users │ x > │ │
│ │ x > └─────────────────────┘
└──────────────────┘ x >
>
>
closed-source > open-source
>
>
```
Notes:
- diagram is simplified in the less-important places
- directed arrows are strict and mean that connections in the reverse direction are forbidden
This split is quite complex and this RFC proposes several smaller steps to achieve the larger goal:
1. Start by refactoring the console code, the goal is to have console and control-plane code in the different directories without dependencies on each other.
2. Do similar refactoring for tables in the console database, remove queries selecting data from both console and control-plane; move control-plane tables to a separate database.
3. Implement control-plane HTTP API serving on a separate TCP port; make all console→control-plane calls to go through that HTTP API.
4. Move control-plane source code to the neon repo; start control-plane as a separate service.
## Motivation
These are the two most important problems we want to solve:
- Publish open-source implementation of all our cloud/storage features
- Make a unified control-plane that is used in all cloud (serverless) and local (tests) setups
Right now we have some closed-source code in the cloud repo. That code contains implementation for running Neon computes in k8s and without that code its impossible to automatically scale PostgreSQL computes. That means that we dont have an open-source serverless PostgreSQL at the moment.
After splitting and open-sourcing control-plane service we will have source code and Docker images for all storage services. That control-plane service should have HTTP API for creating and managing tenants (including all our storage features), while proxy will listen for incoming connections and create computes on-demand.
Improving our test suite is an important task, but requires a lot of prerequisites and may require a separate RFC. Possible implementation of that is described in the section [Next steps](#next-steps).
Another piece of motivation can be a better involvement of storage development team into a control-plane. By splitting control-plane from the console, it can be more convenient to test and develop control-plane with paying less attention to “business” features, such as user management, billing and analytics.
For example, console currently requires authentication providers such as GitHub OAuth to work at all, as well as nodejs to be able to build it locally. It will be more convenient to build and run it locally without these requirements.
## Proposed implementation
### Current state of things
Lets start with defining the current state of things at the moment of this proposal. We have three repositories containing source code:
- open-source `postgres` — our fork of postgres
- open-source `neon` — our main repository for storage source code
- closed-source `cloud` — mostly console backend and UI frontend
This proposal aims not to change anything at the existing code in `neon` and `postgres` repositories, but to create control-plane service and move its source code from `cloud` to the `neon` repository. That means that we need to split code in `cloud` repo only, and will consider only this repository for exploring its source code.
Lets look at the miscellaneous things in the `cloud` repo which are NOT part of the console application, i.e. NOT the Go source code that is compiled to the `./console` binary. There we have:
- command-line tools, such as cloudbench, neonadmin
- markdown documentation
- cloud operations scripts (helm, terraform, ansible)
- configs and other things
- e2e python tests
- incidents playbooks
- UI frontend
- Make build scripts, code generation scripts
- database migrations
- swagger definitions
And also lets take a look at what we have in the console source code, which is the service wed like to split:
- API Servers
- Public API v2
- Management API v2
- Public API v1
- Admin API v1 (same port as Public API v1)
- Management API v1
- Workers
- Monitor Compute Activity
- Watch Failed Operations
- Availability Checker
- Business Metrics Collector
- Internal Services
- Auth Middleware, UserIsAdmin, Cookies
- Cable Websocket Server
- Admin Services
- Global Settings, Operations, Pageservers, Platforms, Projects, Safekeepers, Users
- Authenticate Proxy
- API Keys
- App Controller, serving UI HTML
- Auth Controller
- Branches
- Projects
- Psql Connect + Passwordless login
- Users
- Cloud Metrics
- User Metrics
- Invites
- Pageserver/Safekeeper management
- Operations, k8s/docker/common logic
- Platforms, Regions
- Project State
- Projects Roles, SCRAM
- Global Settings
- Other things
- segment analytics integration
- sentry integration
- other common utilities packages
### Drawing the splitting line
The most challenging and the most important thing is to define the line that will split new control-plane service from the existing cloud service. If we dont get it right, then we can end up with having a lot more issues without many benefits.
We propose to define that line as follows:
- everything user-related stays in the console service
- everything storage-related should be in the control-plane service
- something that falls in between should be decided where to go, but most likely should stay in the console service
- some similar parts should be in both services, such as admin/management/db_migrations
We call user-related all requests that can be connected to some user. The general idea is dont have any user_id in the control-plane service and operate exclusively on tenant_id+timeline_id, the same way as existing storage services work now (compute, safekeeper, pageserver).
Storage-related things can be defined as doing any of the following:
- using k8s API
- doing requests to any of the storage services (proxy, compute, safekeeper, pageserver, etc..)
- tracking current status of tenants/timelines, managing lifetime of computes
Based on that idea, we can say that new control-plane service should have the following components:
- single HTTP API for everything
- Create and manage tenants and timelines
- Manage global settings and storage configuration (regions, platforms, safekeepers, pageservers)
- Admin API for storage health inspection and debugging
- Workers
- Monitor Compute Activity
- Watch Failed Operations
- Availability Checker
- Internal Services
- Admin Services
- Global Settings, Operations, Pageservers, Platforms, Tenants, Safekeepers
- Authenticate Proxy
- Branches
- Psql Connect
- Cloud Metrics
- Pageserver/Safekeeper management
- Operations, k8s/docker/common logic
- Platforms, Regions
- Tenant State
- Compute Roles, SCRAM
- Global Settings
---
And other components should probably stay in the console service:
- API Servers (no changes here)
- Public API v2
- Management API v2
- Public API v1
- Admin API v1 (same port as Public API v1)
- Management API v1
- Workers
- Business Metrics Collector
- Internal Services
- Auth Middleware, UserIsAdmin, Cookies
- Cable Websocket Server
- Admin Services
- Users admin stays the same
- Other admin services can redirect requests to the control-plane
- API Keys
- App Controller, serving UI HTML
- Auth Controller
- Projects
- User Metrics
- Invites
- Users
- Passwordless login
- Other things
- segment analytics integration
- sentry integration
- other common utilities packages
There are also miscellaneous things that are useful for all kinds of services. So we can say that these things can be in both services:
- markdown documentation
- e2e python tests
- make build scripts, code generation scripts
- database migrations
- swagger definitions
The single entrypoint to the storage should be control-plane API. After we define that API, we can have code-generated implementation for the client and for the server. The general idea is to move code implementing storage components from the console to the API implementation inside the new control-plane service.
After the code is moved to the new service, we can fill the created void by making API calls to the new service:
- authorization of the client
- mapping user_id + project_id to the tenant_id
- calling the control-plane API
### control-plane API
Currently we have the following projects API in the console:
```
GET /projects/{project_id}
PATCH /projects/{project_id}
POST /projects/{project_id}/branches
GET /projects/{project_id}/databases
POST /projects/{project_id}/databases
GET /projects/{project_id}/databases/{database_id}
PUT /projects/{project_id}/databases/{database_id}
DELETE /projects/{project_id}/databases/{database_id}
POST /projects/{project_id}/delete
GET /projects/{project_id}/issue_token
GET /projects/{project_id}/operations
GET /projects/{project_id}/operations/{operation_id}
POST /projects/{project_id}/query
GET /projects/{project_id}/roles
POST /projects/{project_id}/roles
GET /projects/{project_id}/roles/{role_name}
DELETE /projects/{project_id}/roles/{role_name}
POST /projects/{project_id}/roles/{role_name}/reset_password
POST /projects/{project_id}/start
POST /projects/{project_id}/stop
POST /psql_session/{psql_session_id}
```
It looks fine and we probably already have clients relying on it. So we should not change it, at least for now. But most of these endpoints (if not all) are related to storage, and it can suggest us what control-plane API should look like:
```
GET /tenants/{tenant_id}
PATCH /tenants/{tenant_id}
POST /tenants/{tenant_id}/branches
GET /tenants/{tenant_id}/databases
POST /tenants/{tenant_id}/databases
GET /tenants/{tenant_id}/databases/{database_id}
PUT /tenants/{tenant_id}/databases/{database_id}
DELETE /tenants/{tenant_id}/databases/{database_id}
POST /tenants/{tenant_id}/delete
GET /tenants/{tenant_id}/issue_token
GET /tenants/{tenant_id}/operations
GET /tenants/{tenant_id}/operations/{operation_id}
POST /tenants/{tenant_id}/query
GET /tenants/{tenant_id}/roles
POST /tenants/{tenant_id}/roles
GET /tenants/{tenant_id}/roles/{role_name}
DELETE /tenants/{tenant_id}/roles/{role_name}
POST /tenants/{tenant_id}/roles/{role_name}/reset_password
POST /tenants/{tenant_id}/start
POST /tenants/{tenant_id}/stop
POST /psql_session/{psql_session_id}
```
One of the options here is to use gRPC instead of the HTTP, which has some useful features, but there are some strong points towards using plain HTTP:
- HTTP API is easier to use for the clients
- we already have HTTP API in pageserver/safekeeper/console
- we probably want control-plane API to be similar to the console API, available in the cloud
### Getting updates from the storage
There can be some valid cases, when we would like to know what is changed in the storage. For example, console might want to know when user has queried and started compute and when compute was scaled to zero after that, to know how much user should pay for the service. Another example is to get info about reaching the disk space limits. Yet another example is to do analytics, such as how many users had at least one active project in a month.
All of the above cases can happen without using the console, just by accessing compute through the proxy.
To solve this, we can have a log of events occurring in the storage (event logs). That is very similar to operations table we have right now, the only difference is that events are immutable and we cannot change them after saving to the database. For example, we might want to have events for the following activities:
- We finished processing some HTTP API query, such as resetting the password
- We changed some state, such as started or stopped a compute
- Operation is created
- Operation is started for the first time
- Operation is failed for the first time
- Operation is finished
Once we save these events to the database, we can create HTTP API to subscribe to these events. That API can look like this:
```
GET /events/<cursor>
{
"events": [...],
"next_cursor": 123
}
```
It should be possible to replay event logs from some point of time, to get a state of almost anything from the storage services. That means that if we maintain some state in the control-plane database and we have a reason to have the same state in the console database, it is possible by polling events from the control-plane API and changing the state in the console database according to the events.
### Next steps
After implementing control-plane HTTP API and starting control-plane as a separate service, we might want to think of exploiting benefits of the new architecture, such as reorganizing test infrastructure. Possible options are listed in the [Next steps](#next-steps-1).
## Non Goals
RFC doesnt cover the actual cloud deployment scripts and schemas, such as terraform, ansible, k8s yamls and so on.
## Impacted components
Mostly console, but can also affect some storage service.
## Scalability
We should support starting several instances of the new control-plane service at the same time.
At the same time, it should be possible to use only single instance of control-plane, which can be useful for local tests.
## Security implications
New control-plane service is an internal service, so no external requests can reach it. But at the same time, it contains API to do absolutely anything with any of the tenants. That means that bad internal actor can potentially read and write all of the tenants. To make this safer, we can have one of these:
- Simple option is to protect all requests with a single private key, so that no one can make requests without having that one key.
- Another option is to have a separate token for every tenant and store these tokens in another secure place. This way its harder to access all tenants at once, because they have the different tokens.
## Alternative implementation
There was an idea to create a k8s operator for managing storage services and computes, but author of this RFC is not really familiar with it.
Regarding less alternative ideas, there are another options for the name of the new control-plane service:
- storage-ctl
- cloud
- cloud-ctl
## Pros/cons of proposed approaches (TODO)
Pros:
- All storage features are completely open-source
- Better tests coverage, less difference between cloud and local setups
- Easier to develop storage and cloud features, because there is no need to setup console for that
- Easier to deploy storage-only services to the any cloud
Cons:
- All storage features are completely open-source
- Distributed services mean more code to connect different services and potential network issues
- Console needs to have a dependency on storage API, there can be complications with developing new feature in a branch
- More code to JOIN data from different services (console and control-plane)
## Definition of Done
We have a new control-plane service running in the k8s. Source code for that control-plane service is located in the open-source neon repo.
## Next steps
After weve reached DoD, we can make further improvements.
First thing that can benefit from the split is local testing. The same control-plane service can implement starting computes as a local processes instead of k8s deployments. If it will also support starting pageservers/safekeepers/proxy for the local setup, then it can completely replace `./neon_local` binary, which is currently used for testing. The local testing environment can look like this:
```
┌─────────────────────┐ ┌───────────────────────┐
│ │ │ Storage (local) │
│ control-plane db │ │ │
│ (local process) │ │ - safekeepers │
│ │ │ - pageservers │
└──────────▲──────────┘ │ │
│ │ Dependencies │
┌──────────┴──────────┐ │ │
│ │ │ - etcd │
│ control-plane ├────►│ - S3 │
│ (local process) │ │ - more? │
│ │ │ │
└──────────┬──────────┘ └───────────────────────┘
▲ │ ▲
│ │ │
│ │ ┌───────────┴───────────┐
│ │ │ │
│ └───────────────►│ computes │
│ │ (local processes) │
│ │ │
┌──────┴──────────────┐ └───────────────────────┘
│ │ ▲
│ proxy │ │
│ (local process) ├─────────────────┘
│ │
└─────────────────────┘
```
The key thing here is that control-plane local service have the same API and almost the same implementation as the one deployed in the k8s. This allows to run the same e2e tests against both cloud and local setups.
For the python test_runner tests everything can stay mostly the same. To do that, we just need to replace `./neon_local` cli commands with API calls to the control-plane.
The benefit here will be in having fast local tests that are really close to our cloud setup. Bugs in k8s queries are still cannot be found when running computes as a local processes, but it should be really easy to start k8s locally (for example in k3s) and run the same tests with control-plane connected to the local k8s.
Talking about console and UI tests, after the split there should be a way to test these without spinning up all the storage locally. New control-plane service has a well-defined API, allowing us to mock it. This way we can create UI tests to verify the right calls are issued after specific UI interactions and verify that we render correct messages when API returns errors.

View File

@@ -78,7 +78,7 @@ with grpc streams and tokio mpsc channels. The implementation description is at
It is just 500 lines of code and core functionality is complete. 1-1 pub sub
gives about 120k received messages per second; having multiple subscribers in
different connecitons quickly scales to 1 million received messages per second.
different connections quickly scales to 1 million received messages per second.
I had concerns about many concurrent streams in singe connection, but 2^20
subscribers still work (though eat memory, with 10 publishers 20GB are consumed;
in this implementation each publisher holds full copy of all subscribers). There
@@ -95,12 +95,12 @@ other members, with best-effort this is simple.
### Security implications
Communication happens in a private network that is not exposed to users;
additionaly we can add auth to the broker.
additionally we can add auth to the broker.
## Alternative: get existing pub-sub
We could take some existing pub sub solution, e.g. RabbitMQ, Redis. But in this
case IMV simplicity of our own outweights external dependency costs (RabbitMQ is
case IMV simplicity of our own outweighs external dependency costs (RabbitMQ is
much more complicated and needs VM; Redis Rust client maintenance is not
ideal...). Also note that projects like CockroachDB and TiDB are based on gRPC
as well.

View File

@@ -74,7 +74,7 @@ TenantMaintenanceGuard: Like ActiveTenantGuard, but can be held even when the
tenant is not in Active state. Used for operations like attach/detach. Perhaps
allow only one such guard on a Tenant at a time.
Similarly for Timelines. We don't currentl have a "state" on Timeline, but I think
Similarly for Timelines. We don't currently have a "state" on Timeline, but I think
we need at least two states: Active and Stopping. The Stopping state is used at
deletion, to prevent new TimelineActiveGuards from appearing, while you wait for
existing TimelineActiveGuards to die out.
@@ -85,7 +85,7 @@ have a TenantActiveGuard, and the tenant's state changes from Active to
Stopping, the is_shutdown_requested() function should return true, and
shutdown_watcher() future should return.
This signaling doesn't neessarily need to cover all cases. For example, if you
This signaling doesn't necessarily need to cover all cases. For example, if you
have a block of code in spawn_blocking(), it might be acceptable if
is_shutdown_requested() doesn't return true even though the tenant is in
Stopping state, as long as the code finishes reasonably fast.

View File

@@ -37,7 +37,7 @@ sequenceDiagram
```
At this point it is not possible to restore from index, it contains L2 which
is no longer available in s3 and doesnt contain L3 added by compaction by the
is no longer available in s3 and doesn't contain L3 added by compaction by the
first pageserver. So if any of the pageservers restart initial sync will fail
(or in on-demand world it will fail a bit later during page request from
missing layer)
@@ -74,7 +74,7 @@ One possible solution for relocation case is to orchestrate background jobs
from outside. The oracle who runs migration can turn off background jobs on
PS1 before migration and then run migration -> enable them on PS2. The problem
comes if migration fails. In this case in order to resume background jobs
oracle needs to guarantee that PS2 doesnt run background jobs and if it doesnt
oracle needs to guarantee that PS2 doesn't run background jobs and if it doesn't
respond then PS1 is stuck unable to run compaction/gc. This cannot be solved
without human ensuring that no upload from PS2 can happen. In order to be able
to resolve this automatically CAS is required on S3 side so pageserver can
@@ -128,7 +128,7 @@ During discussion it seems that we converged on the approach consisting of:
whether we need to apply change to the index state or not.
- Responsibility for running background jobs is assigned externally. Pageserver
keeps locally persistent flag for each tenant that indicates whether this
pageserver is considered as primary one or not. TODO what happends if we
pageserver is considered as primary one or not. TODO what happens if we
crash and cannot start for some extended period of time? Control plane can
assign ownership to some other pageserver. Pageserver needs some way to check
if its still the blessed one. Maybe by explicit request to control plane on
@@ -138,7 +138,7 @@ Requirement for deterministic layer generation was considered overly strict
because of two reasons:
- It can limit possible optimizations e g when pageserver wants to reshuffle
some data locally and doesnt want to coordinate this
some data locally and doesn't want to coordinate this
- The deterministic algorithm itself can change so during deployments for some
time there will be two different version running at the same time which can
cause non determinism
@@ -164,7 +164,7 @@ sequenceDiagram
CP->>PS1: Yes
deactivate CP
PS1->>S3: Fetch PS1 index.
note over PS1: Continue operations, start backround jobs
note over PS1: Continue operations, start background jobs
note over PS1,PS2: PS1 starts up and still and is not a leader anymore
PS1->>CP: Am I still the leader for Tenant X?
CP->>PS1: No
@@ -203,7 +203,7 @@ sequenceDiagram
### Eviction
When two pageservers operate on a tenant for extended period of time follower
doesnt perform write operations in s3. When layer is evicted follower relies
doesn't perform write operations in s3. When layer is evicted follower relies
on updates from primary to get info about layers it needs to cover range for
evicted layer.

View File

@@ -4,7 +4,7 @@ Created on 08.03.23
## Motivation
Currently we dont delete pageserver part of the data from s3 when project is deleted. (The same is true for safekeepers, but this outside of the scope of this RFC).
Currently we don't delete pageserver part of the data from s3 when project is deleted. (The same is true for safekeepers, but this outside of the scope of this RFC).
This RFC aims to spin a discussion to come to a robust deletion solution that wont put us in into a corner for features like postponed deletion (when we keep data for user to be able to restore a project if it was deleted by accident)
@@ -75,9 +75,9 @@ Remote one is needed for cases when pageserver is lost during deletion so other
Why local mark file is needed?
If we dont have one, we have two choices, delete local data before deleting the remote part or do that after.
If we don't have one, we have two choices, delete local data before deleting the remote part or do that after.
If we delete local data before remote then during restart pageserver wont pick up remote tenant at all because nothing is available locally (pageserver looks for remote conuterparts of locally available tenants).
If we delete local data before remote then during restart pageserver wont pick up remote tenant at all because nothing is available locally (pageserver looks for remote counterparts of locally available tenants).
If we delete local data after remote then at the end of the sequence when remote mark file is deleted if pageserver restart happens then the state is the same to situation when pageserver just missing data on remote without knowing the fact that this data is intended to be deleted. In this case the current behavior is upload everything local-only to remote.
@@ -145,7 +145,7 @@ sequenceDiagram
CP->>PS: Retry delete tenant
PS->>CP: Not modified
else Mark is missing
note over PS: Continue to operate the tenant as if deletion didnt happen
note over PS: Continue to operate the tenant as if deletion didn't happen
note over CP: Eventually console should <br> retry delete request
@@ -168,7 +168,7 @@ sequenceDiagram
PS->>CP: True
```
Similar sequence applies when both local and remote marks were persisted but Control Plane still didnt receive a response.
Similar sequence applies when both local and remote marks were persisted but Control Plane still didn't receive a response.
If pageserver crashes after both mark files were deleted then it will reply to control plane status poll request with 404 which should be treated by control plane as success.
@@ -187,7 +187,7 @@ If pageseserver is lost then the deleted tenant should be attached to different
##### Restrictions for tenant that is in progress of being deleted
I propose to add another state to tenant/timeline - PendingDelete. This state shouldnt allow executing any operations aside from polling the deletion status.
I propose to add another state to tenant/timeline - PendingDelete. This state shouldn't allow executing any operations aside from polling the deletion status.
#### Summary
@@ -237,7 +237,7 @@ New branch gets created
PS1 starts up (is it possible or we just recycle it?)
PS1 is unaware of the new branch. It can either fall back to s3 ls, or ask control plane.
So here comes the dependency of storage on control plane. During restart storage needs to know which timelines are valid for operation. If there is nothing on s3 that can answer that question storage neeeds to ask control plane.
So here comes the dependency of storage on control plane. During restart storage needs to know which timelines are valid for operation. If there is nothing on s3 that can answer that question storage needs to ask control plane.
### Summary
@@ -250,7 +250,7 @@ Cons:
Pros:
- Easier to reason about if you dont have to account for pageserver restarts
- Easier to reason about if you don't have to account for pageserver restarts
### Extra notes
@@ -262,7 +262,7 @@ Delayed deletion can be done with both approaches. As discussed with Anna (@step
After discussion in comments I see that we settled on two options (though a bit different from ones described in rfc). First one is the same - pageserver owns as much as possible. The second option is that pageserver owns markers thing, but actual deletion happens in control plane by repeatedly calling ls + delete.
To my mind the only benefit of the latter approach is possible code reuse between safekeepers and pageservers. Otherwise poking around integrating s3 library into control plane, configuring shared knowledge abouth paths in s3 - are the downsides. Another downside of relying on control plane is the testing process. Control plane resides in different repository so it is quite hard to test pageserver related changes there. e2e test suite there doesnt support shutting down pageservers, which are separate docker containers there instead of just processes.
To my mind the only benefit of the latter approach is possible code reuse between safekeepers and pageservers. Otherwise poking around integrating s3 library into control plane, configuring shared knowledge about paths in s3 - are the downsides. Another downside of relying on control plane is the testing process. Control plane resides in different repository so it is quite hard to test pageserver related changes there. e2e test suite there doesn't support shutting down pageservers, which are separate docker containers there instead of just processes.
With pageserver owning everything we still give the retry logic to control plane but its easier to duplicate if needed compared to sharing inner s3 workings. We will have needed tests for retry logic in neon repo.

View File

@@ -75,7 +75,7 @@ sequenceDiagram
```
At this point it is not possible to restore the state from index, it contains L2 which
is no longer available in s3 and doesnt contain L3 added by compaction by the
is no longer available in s3 and doesn't contain L3 added by compaction by the
first pageserver. So if any of the pageservers restart, initial sync will fail
(or in on-demand world it will fail a bit later during page request from
missing layer)
@@ -171,13 +171,13 @@ sequenceDiagram
Another problem is a possibility of concurrent branch creation calls.
I e during migration create_branch can be called on old pageserver and newly created branch wont be seen on new pageserver. Prior art includes prototyping an approach of trying to mirror such branches, but currently it lost its importance, because now attach is fast because we dont need to download all data, and additionally to the best of my knowledge of control plane internals (cc @ololobus to confirm) operations on one project are executed sequentially, so it is not possible to have such case. So branch create operation will be executed only when relocation is completed. As a safety measure we can forbid branch creation for tenants that are in readonly remote state.
I e during migration create_branch can be called on old pageserver and newly created branch wont be seen on new pageserver. Prior art includes prototyping an approach of trying to mirror such branches, but currently it lost its importance, because now attach is fast because we don't need to download all data, and additionally to the best of my knowledge of control plane internals (cc @ololobus to confirm) operations on one project are executed sequentially, so it is not possible to have such case. So branch create operation will be executed only when relocation is completed. As a safety measure we can forbid branch creation for tenants that are in readonly remote state.
## Simplistic approach
The difference of simplistic approach from one described above is that it calls ignore on source tenant first and then calls attach on target pageserver. Approach above does it in opposite order thus opening a possibility for race conditions we strive to avoid.
The approach largely follows this guide: <https://github.com/neondatabase/cloud/wiki/Cloud:-Ad-hoc-tenant-relocation>
The approach largely follows this guide: <https://www.notion.so/neondatabase/Cloud-Ad-hoc-tenant-relocation-f687474f7bfc42269e6214e3acba25c7>
The happy path sequence:

View File

@@ -55,7 +55,7 @@ When PostgreSQL requests a file, `compute_ctl` downloads it.
PostgreSQL requests files in the following cases:
- When loading a preload library set in `local_preload_libraries`
- When explicitly loading a library with `LOAD`
- Wnen creating extension with `CREATE EXTENSION` (download sql scripts, (optional) extension data files and (optional) library files)))
- When creating extension with `CREATE EXTENSION` (download sql scripts, (optional) extension data files and (optional) library files)))
#### Summary

View File

@@ -26,7 +26,7 @@ plane guarantee prevents robust response to failures, as if a pageserver is unre
we may not detach from it. The mechanism in this RFC fixes this, by making it safe to
attach to a new, different pageserver even if an unresponsive pageserver may be running.
Futher, lack of safety during split-brain conditions blocks two important features where occasional
Further lack of safety during split-brain conditions blocks two important features where occasional
split-brain conditions are part of the design assumptions:
- seamless tenant migration ([RFC PR](https://github.com/neondatabase/neon/pull/5029))
@@ -490,11 +490,11 @@ The above makes it safe for control plane to change the assignment of
tenant to pageserver in control plane while a timeline creation is ongoing.
The reason is that the creation request against the new assigned pageserver
uses a new generation number. However, care must be taken by control plane
to ensure that a "timeline creation successul" response from some pageserver
to ensure that a "timeline creation successful" response from some pageserver
is checked for the pageserver's generation for that timeline's tenant still being the latest.
If it is not the latest, the response does not constitute a successful timeline creation.
It is acceptable to discard such responses, the scrubber will clean up the S3 state.
It is better to issue a timelien deletion request to the stale attachment.
It is better to issue a timeline deletion request to the stale attachment.
#### Timeline Deletion
@@ -633,7 +633,7 @@ As outlined in the Part 1 on correctness, it is critical that deletions are only
executed once the key is not referenced anywhere in S3.
This property is obviously upheld by the scheme above.
#### We Accept Object Leakage In Acceptable Circumcstances
#### We Accept Object Leakage In Acceptable Circumstances
If we crash in the flow above between (2) and (3), we lose track of unreferenced object.
Further, enqueuing a single to the persistent queue may not be durable immediately to amortize cost of flush to disk.

View File

@@ -162,7 +162,7 @@ struct Tenant {
...
txns: HashMap<TxnId, Transaction>,
// the most recently started txn's id; only most recently sarted can win
// the most recently started txn's id; only most recently started can win
next_winner_txn: Option<TxnId>,
}
struct Transaction {
@@ -186,7 +186,7 @@ A transaction T in state Committed has subsequent transactions that may or may n
So, for garbage collection, we need to assess transactions in state Committed and RejectAcknowledged:
- Commited: delete objects on the deadlist.
- Committed: delete objects on the deadlist.
- We dont need a LIST request here, the deadlist is sufficient. So, its really cheap.
- This is **not true MVCC garbage collection**; by deleting the objects on Committed transaction T s deadlist, we might delete data referenced by other transactions that were concurrent with T, i.e., they started while T was still open. However, the fact that T is committed means that the other transactions are RejectPending or RejectAcknowledged, so, they dont matter. Pageservers executing these doomed RejectPending transactions must handle 404 for GETs gracefully, e.g., by trying to commit txn so they observe the rejection theyre destined to get anyways. 404s for RejectAcknowledged is handled below.
- RejectAcknowledged: delete all objects created in that txn, and discard deadlists.
@@ -242,15 +242,15 @@ If a pageserver is unresponsive from Control Planes / Computes perspective
At this point, availability is restored and user pain relieved.
Whats left is to somehow close the doomed transaction of the unresponsive pageserver, so that it beomes RejectAcknowledged, and GC can make progress. Since S3 is cheap, we can afford to wait a really long time here, especially if we put a soft bound on the amount of data a transaction may produce before it must commit. Procedure:
Whats left is to somehow close the doomed transaction of the unresponsive pageserver, so that it becomes RejectAcknowledged, and GC can make progress. Since S3 is cheap, we can afford to wait a really long time here, especially if we put a soft bound on the amount of data a transaction may produce before it must commit. Procedure:
1. Ensure the unresponsive pageserver is taken out of rotation for new attachments. That probably should happen as part of the routine above.
2. Make a human operator investigate decide what to do (next morning, NO ONCALL ALERT):
1. Inspect the instance, investigate logs, understand root cause.
2. Try to re-establish connectivity between pageserver and Control Plane so that pageserver can retry commits, get rejected, ack rejection ⇒ enable GC.
3. Use below procedure to decomission pageserver.
3. Use below procedure to decommission pageserver.
### Decomissioning A Pageserver (Dead or Alive-but-Unrespsonive)
### Decommissioning A Pageserver (Dead or Alive-but-Unresponsive)
The solution, enabled by this proposal:
@@ -310,7 +310,7 @@ Issues that we discussed:
1. In abstract terms, this proposal provides a linearized history for a given S3 prefix.
2. In concrete terms, this proposal provides a linearized history per tenant.
3. There can be multiple writers at a given time, but only one of them will win to become part of the linearized history.
4. ************************************************************************************Alternative ideas mentioned during meetings that should be turned into a written prospoal like this one:************************************************************************************
4. ************************************************************************************Alternative ideas mentioned during meetings that should be turned into a written proposal like this one:************************************************************************************
1. @Dmitry Rodionov : having linearized storage of index_part.json in some database that allows serializable transactions / atomic compare-and-swap PUT
2. @Dmitry Rodionov :
3. @Stas : something like this scheme, but somehow find a way to equate attachment duration with transaction duration, without losing work if pageserver dies months after attachment.

View File

@@ -54,7 +54,7 @@ If the compaction algorithm doesn't change between the two compaction runs, is d
*However*:
1. the file size of the overwritten L1s may not be identical, and
2. the bit pattern of the overwritten L1s may not be identical, and,
3. in the future, we may want to make the compaction code non-determinstic, influenced by past access patterns, or otherwise change it, resulting in L1 overwrites with a different set of delta records than before the overwrite
3. in the future, we may want to make the compaction code non-deterministic, influenced by past access patterns, or otherwise change it, resulting in L1 overwrites with a different set of delta records than before the overwrite
The items above are a problem for the [split-brain protection RFC](https://github.com/neondatabase/neon/pull/4919) because it assumes that layer files in S3 are only ever deleted, but never replaced (overPUTted).
@@ -63,7 +63,7 @@ But node B based its world view on the version of node A's `index_part.json` fro
That earlier `index_part.json`` contained the file size of the pre-overwrite L1.
If the overwritten L1 has a different file size, node B will refuse to read data from the overwritten L1.
Effectively, the data in the L1 has become inaccessible to node B.
If node B already uploaded an index part itself, all subsequent attachments will use node B's index part, and run into the same probem.
If node B already uploaded an index part itself, all subsequent attachments will use node B's index part, and run into the same problem.
If we ever introduce checksums instead of checking just the file size, then a mismatching bit pattern (2) will cause similar problems.
@@ -121,7 +121,7 @@ Multi-object changes that previously created and removed files in timeline dir a
* atomic `index_part.json` update in S3, as per guarantee that S3 PUT is atomic
* local timeline dir state:
* irrelevant for layer map content => irrelevant for atomic updates / crash consistency
* if we crash after index part PUT, local layer files will be used, so, no on-demand downloads neede for them
* if we crash after index part PUT, local layer files will be used, so, no on-demand downloads needed for them
* if we crash before index part PUT, local layer files will be deleted
## Trade-Offs
@@ -140,7 +140,7 @@ Assuming upload queue allows for unlimited queue depth (that's what it does toda
* wal ingest: currently unbounded
* L0 => L1 compaction: CPU time proportional to `O(sum(L0 size))` and upload work proportional to `O()`
* Compaction threshold is 10 L0s and each L0 can be up to 256M in size. Target size for L1 is 128M.
* In practive, most L0s are tiny due to 10minute `DEFAULT_CHECKPOINT_TIMEOUT`.
* In practice, most L0s are tiny due to 10minute `DEFAULT_CHECKPOINT_TIMEOUT`.
* image layer generation: CPU time `O(sum(input data))` + upload work `O(sum(new image layer size))`
* I have no intuition how expensive / long-running it is in reality.
* gc: `update_gc_info`` work (not substantial, AFAIK)
@@ -158,7 +158,7 @@ Pageserver crashes are very rare ; it would likely be acceptable to re-do the lo
However, regular pageserver restart happen frequently, e.g., during weekly deploys.
In general, pageserver restart faces the problem of tenants that "take too long" to shut down.
They are a problem because other tenants that shut down quickly are unavailble while we wait for the slow tenants to shut down.
They are a problem because other tenants that shut down quickly are unavailable while we wait for the slow tenants to shut down.
We currently allot 10 seconds for graceful shutdown until we SIGKILL the pageserver process (as per `pageserver.service` unit file).
A longer budget would expose tenants that are done early to a longer downtime.
A short budget would risk throwing away more work that'd have to be re-done after restart.
@@ -236,7 +236,7 @@ tenants/$tenant/timelines/$timeline/$key_and_lsn_range
tenants/$tenant/timelines/$timeline/$layer_file_id-$key_and_lsn_range
```
To guarantee uniqueness, the unqiue number is a sequence number, stored in `index_part.json`.
To guarantee uniqueness, the unique number is a sequence number, stored in `index_part.json`.
This alternative does not solve atomic layer map updates.
In our crash-during-compaction scenario above, the compaction run after the crash will not overwrite the L1s, but write/PUT new files with new sequence numbers.
@@ -246,11 +246,11 @@ We'd need to write a deduplication pass that checks if perfectly overlapping lay
However, this alternative is appealing because it systematically prevents overwrites at a lower level than this RFC.
So, this alternative is sufficient for the needs of the split-brain safety RFC (immutable layer files locally and in S3).
But it doesn't solve the problems with crash-during-compaction outlined earlier in this RFC, and in fact, makes it much more accute.
But it doesn't solve the problems with crash-during-compaction outlined earlier in this RFC, and in fact, makes it much more acute.
The proposed design in this RFC addresses both.
So, if this alternative sounds appealing, we should implement the proposal in this RFC first, then implement this alternative on top.
That way, we avoid a phase where the crash-during-compaction problem is accute.
That way, we avoid a phase where the crash-during-compaction problem is acute.
## Related issues

View File

@@ -0,0 +1,599 @@
# Seamless tenant migration
- Author: john@neon.tech
- Created on 2023-08-11
- Implemented on ..
## Summary
The preceding [generation numbers RFC](025-generation-numbers.md) may be thought of as "making tenant
migration safe". Following that,
this RFC is about how those migrations are to be done:
1. Seamlessly (without interruption to client availability)
2. Quickly (enabling faster operations)
3. Efficiently (minimizing I/O and $ cost)
These points are in priority order: if we have to sacrifice
efficiency to make a migration seamless for clients, we will
do so, etc.
This is accomplished by introducing two high level changes:
- A dual-attached state for tenants, used in a control-plane-orchestrated
migration procedure that preserves availability during a migration.
- Warm secondary locations for tenants, where on-disk content is primed
for a fast migration of the tenant from its current attachment to this
secondary location.
## Motivation
Migrating tenants between pageservers is essential to operating a service
at scale, in several contexts:
1. Responding to a pageserver node failure by migrating tenants to other pageservers
2. Balancing load and capacity across pageservers, for example when a user expands their
database and they need to migrate to a pageserver with more capacity.
3. Restarting pageservers for upgrades and maintenance
The current situation steps for migration are:
- detach from old node; skip if old node is dead; (the [skip part is still WIP](https://github.com/neondatabase/cloud/issues/5426)).
- attach to new node
- re-configure endpoints to use the new node
Once [generation numbers](025-generation-numbers.md) are implemented,
the detach step is no longer critical for correctness. So, we can
- attach to a new node,
- re-configure endpoints to use the new node, and then
- detach from the old node.
However, this still does not meet our seamless/fast/efficient goals:
- Not fast: The new node will have to download potentially large amounts
of data from S3, which may take many minutes.
- Not seamless: If we attach to a new pageserver before detaching an old one,
the new one might delete some objects that interrupt availability of reads on the old one.
- Not efficient: the old pageserver will continue uploading
S3 content during the migration that will never be read.
The user expectations for availability are:
- For planned maintenance, there should be zero availability
gap. This expectation is fulfilled by this RFC.
- For unplanned changes (e.g. node failures), there should be
minimal availability gap. This RFC provides the _mechanism_
to fail over quickly, but does not provide the failure _detection_
nor failover _policy_.
## Non Goals
- Defining service tiers with different storage strategies: the same
level of HA & overhead will apply to all tenants. This doesn't rule out
adding such tiers in future.
- Enabling pageserver failover in the absence of a control plane: the control
plane will remain the source of truth for what should be attached where.
- Totally avoiding availability gaps on unplanned migrations during
a failure (we expect a small, bounded window of
read unavailability of very recent LSNs)
- Workload balancing: this RFC defines the mechanism for moving tenants
around, not the higher level logic for deciding who goes where.
- Defining all possible configuration flows for tenants: the migration process
defined in this RFC demonstrates the sufficiency of the pageserver API, but
is not the only kind of configuration change the control plane will ever do.
The APIs defined here should let the control plane move tenants around in
whatever way is needed while preserving data safety and read availability.
## Impacted components
Pageserver, control plane
## Terminology
- **Attachment**: a tenant is _attached_ to a pageserver if it has
been issued a generation number, and is running an instance of
the `Tenant` type, ingesting the WAL, and available to serve
page reads.
- **Location**: locations are a superset of attachments. A location
is a combination of a tenant and a pageserver. We may _attach_ at a _location_.
- **Secondary location**: a location which is not currently attached.
- **Warm secondary location**: a location which is not currently attached, but is endeavoring to maintain a warm local cache of layers. We avoid calling this a _warm standby_ to avoid confusion with similar postgres features.
## Implementation (high level)
### Warm secondary locations
To enable faster migrations, we will identify at least one _secondary location_
for each tenant. This secondary location will keep a warm cache of layers
for the tenant, so that if it is later attached, it can catch up with the
latest LSN quickly: rather than downloading everything, it only has to replay
the recent part of the WAL to advance from the remote_consistent_offset to the
most recent LSN in the WAL.
The control plane is responsible for selecting secondary locations, and
calling into pageservers to configure tenants into a secondary mode at this
new location, as well as attaching the tenant in its existing primary location.
The attached pageserver for a tenant will publish a [layer heatmap](#layer-heatmap)
to advise secondaries of which layers should be downloaded.
### Location modes
Currently, we consider a tenant to be in one of two states on a pageserver:
- Attached: active `Tenant` object, and layers on local disk
- Detached: no layers on local disk, no runtime state.
We will extend this with finer-grained modes, whose purpose will become
clear in later sections:
- **AttachedSingle**: equivalent the existing attached state.
- **AttachedMulti**: like AttachedSingle, holds an up to date generation, but
does not do deletions.
- **AttachedStale**: like AttachedSingle, holds a stale generation,
do not do any remote storage operations.
- **Secondary**: keep local state on disk, periodically update from S3.
- **Detached**: equivalent to existing detached state.
To control these finer grained states, a new pageserver API endpoint will be added.
### Cutover procedure
Define old location and new location as "Node A" and "Node B". Consider
the case where both nodes are available, and Node B was previously configured
as a secondary location for the tenant we are migrating.
The cutover procedure is orchestrated by the control plane, calling into
the pageservers' APIs:
1. Call to Node A requesting it to flush to S3 and enter AttachedStale state
2. Increment generation, and call to Node B requesting it to enter AttachedMulti
state with the new generation.
3. Call to Node B, requesting it to download the latest hot layers from remote storage,
according to the latest heatmap flushed by Node A.
4. Wait for Node B's WAL ingestion to catch up with node A's
5. Update endpoints to use node B instead of node A
6. Call to node B requesting it to enter state AttachedSingle.
7. Call to node A requesting it to enter state Secondary
The following table summarizes how the state of the system advances:
| Step | Node A | Node B | Node used by endpoints |
| :-----------: | :------------: | :------------: | :--------------------: |
| 1 (_initial_) | AttachedSingle | Secondary | A |
| 2 | AttachedStale | AttachedMulti | A |
| 3 | AttachedStale | AttachedMulti | A |
| 4 | AttachedStale | AttachedMulti | A |
| 5 (_cutover_) | AttachedStale | AttachedMulti | B |
| 6 | AttachedStale | AttachedSingle | B |
| 7 (_final_) | Secondary | AttachedSingle | B |
The procedure described for a clean handover from a live node to a secondary
is also used for failure cases and for migrations to a location that is not
configured as a secondary, by simply skipping irrelevant steps, as described in
the following sections.
#### Migration from an unresponsive node
If node A is unavailable, then all calls into
node A are skipped and we don't wait for B to catch up before
switching updating the endpoints to use B.
#### Migration to a location that is not a secondary
If node B is initially in Detached state, the procedure is identical. Since Node B
is coming from a Detached state rather than Secondary, the download of layers and
catch up with WAL will take much longer.
We might do this if:
- Attached and secondary locations are both critically low on disk, and we need
to migrate to a third node with more resources available.
- We are migrating a tenant which does not use secondary locations to save on cost.
#### Permanent migration away from a node
In the final step of the migration, we generally request the original node to enter a Secondary
state. This is typical if we are doing a planned migration during maintenance, or to
balance CPU/network load away from a node.
One might also want to permanently migrate away: this can be done by simply removing the secondary
location after the migration is complete, or as an optimization by substituting the Detached state
for the Secondary state in the final step.
#### Cutover diagram
```mermaid
sequenceDiagram
participant CP as Control plane
participant A as Node A
participant B as Node B
participant E as Endpoint
CP->>A: PUT Flush & go to AttachedStale
note right of A: A continues to ingest WAL
CP->>B: PUT AttachedMulti
CP->>B: PUT Download layers from latest heatmap
note right of B: B downloads from S3
loop Poll until download complete
CP->>B: GET download status
end
activate B
note right of B: B ingests WAL
loop Poll until catch up
CP->>B: GET visible WAL
CP->>A: GET visible WAL
end
deactivate B
CP->>E: Configure to use Node B
E->>B: Connect for reads
CP->>B: PUT AttachedSingle
CP->>A: PUT Secondary
```
#### Cutover from an unavailable pageserver
This case is far simpler: we may skip straight to our intended
end state.
```mermaid
sequenceDiagram
participant A as Node A
participant CP as Control plane
participant B as Node B
participant E as Endpoint
note right of A: Node A offline
activate A
CP->>B: PUT AttachedSingle
CP->>E: Configure to use Node B
E->>B: Connect for reads
deactivate A
```
## Implementation (detail)
### Purpose of AttachedMulti, AttachedStale
#### AttachedMulti
Ordinarily, an attached pageserver whose generation is the latest may delete
layers at will (e.g. during compaction). If a previous generation pageserver
is also still attached, and in use by endpoints, then this layer deletion could
lead to a loss of availability for the endpoint when reading from the previous
generation pageserver.
The _AttachedMulti_ state simply disables deletions. These will be enqueued
in `RemoteTimelineClient` until the control plane transitions the
node into AttachedSingle, which unblocks deletions. Other remote storage operations
such as uploads are not blocked.
AttachedMulti is not required for data safety, only to preserve availability
on pageservers running with stale generations.
A node enters AttachedMulti only when explicitly asked to by the control plane. It should
only remain in this state for the duration of a migration.
If a control plane bug leaves
the node in AttachedMulti for a long time, then we must avoid unbounded memory use from enqueued
deletions. This may be accomplished simply, by dropping enqueued deletions when some modest
threshold of delayed deletions (e.g. 10k layers per tenant) is reached. As with all deletions,
it is safe to skip them, and the leaked objects will be eventually cleaned up by scrub or
by timeline deletion.
During AttachedMulti, the Tenant is free to drop layers from local disk in response to
disk pressure: only the deletion of remote layers is blocked.
#### AttachedStale
Currently, a pageserver with a stale generation number will continue to
upload layers, but be prevented from completing deletions. This is safe, but inefficient: layers uploaded by this stale generation
will not be read back by future generations of pageservers.
The _AttachedStale_ state disables S3 uploads. The stale pageserver
will continue to ingest the WAL and write layers to local disk, but not to
do any uploads to S3.
A node may enter AttachedStale in two ways:
- Explicitly, when control plane calls into the node at the start of a migration.
- Implicitly, when the node tries to validate some deletions and discovers
that its generation is stale.
The AttachedStale state also disables sending consumption metrics from
that location: it is interpreted as an indication that some other pageserver
is already attached or is about to be attached, and that new pageserver will
be responsible for sending consumption metrics.
#### Disk Pressure & AttachedStale
Over long periods of time, a tenant location in AttachedStale will accumulate data
on local disk, as it cannot evict any layers written since it entered the
AttachStale state. We rely on the control plane to revert the location to
Secondary or Detached at the end of a migration.
This scenario is particularly noteworthy when evacuating all tenants on a pageserver:
since _all_ the attached tenants will go into AttachedStale, we will be doing no
uploads at all, therefore ingested data will cause disk usage to increase continuously.
Under nominal conditions, the available disk space on pageservers should be sufficient
to complete the evacuation before this becomes a problem, but we must also handle
the case where we hit a low disk situation while in this state.
The concept of disk pressure already exists in the pageserver: the `disk_usage_eviction_task`
touches each Tenant when it determines that a low-disk condition requires
some layer eviction. Having selected layers for eviction, the eviction
task calls `Timeline::evict_layers`.
**Safety**: If evict_layers is called while in AttachedStale state, and some of the to-be-evicted
layers are not yet uploaded to S3, then the block on uploads will be lifted. This
will result in leaking some objects once a migration is complete, but will enable
the node to manage its disk space properly: if a node is left with some tenants
in AttachedStale indefinitely due to a network partition or control plane bug,
these tenants will not cause a full disk condition.
### Warm secondary updates
#### Layer heatmap
The secondary location's job is to serve reads **with the same quality of service as the original location
was serving them around the time of a migration**. This does not mean the secondary
location needs the whole set of layers: inactive layers that might soon
be evicted on the attached pageserver need not be downloaded by the
secondary. A totally idle tenant only needs to maintain enough on-disk
state to enable a fast cold start (i.e. the most recent image layers are
typically sufficient).
To enable this, we introduce the concept of a _layer heatmap_, which
acts as an advisory input to secondary locations to decide which
layers to download from S3.
#### Attached pageserver
The attached pageserver, if in state AttachedSingle, periodically
uploads a serialized heat map to S3. It may skip this if there
is no change since the last time it uploaded (e.g. if the tenant
is totally idle).
Additionally, when the tenant is flushed to remote storage prior to a migration
(the first step in [cutover procedure](#cutover-procedure)),
the heatmap is written out. This enables a future attached pageserver
to get an up to date view when deciding which layers to download.
#### Secondary location behavior
Secondary warm locations run a simple loop, implemented separately from
the main `Tenant` type, which represents attached tenants:
- Download the layer heatmap
- Select any "hot enough" layers to download, if there is sufficient
free disk space.
- Download layers, if they were not previously evicted (see below)
- Download the latest index_part.json
- Check if any layers currently on disk are no longer referenced by
IndexPart & delete them
Note that the heatmap is only advisory: if a secondary location has plenty
of disk space, it may choose to retain layers that aren't referenced
by the heatmap, as long as they are still referenced by the IndexPart. Conversely,
if a node is very low on disk space, it might opt to raise the heat threshold required
to both downloading a layer, until more disk space is available.
#### Secondary locations & disk pressure
Secondary locations are subject to eviction on disk pressure, just as
attached locations are. For eviction purposes, the access time of a
layer in a secondary location will be the access time given in the heatmap,
rather than the literal time at which the local layer file was accessed.
The heatmap will indicate which layers are in local storage on the attached
location. The secondary will always attempt to get back to having that
set of layers on disk, but to avoid flapping, it will remember the access
time of the layer it was most recently asked to evict, and layers whose
access time is below that will not be re-downloaded.
The resulting behavior is that after a layer is evicted from a secondary
location, it is only re-downloaded once the attached pageserver accesses
the layer and uploads a heatmap reflecting that access time. On a pageserver
restart, the secondary location will attempt to download all layers in
the heatmap again, if they are not on local disk.
This behavior will be slightly different when secondary locations are
used for "low energy tenants", but that is beyond the scope of this RFC.
### Location configuration API
Currently, the `/tenant/<tenant_id>/config` API defines various
tunables like compaction settings, which apply to the tenant irrespective
of which pageserver it is running on.
A new "location config" structure will be introduced, which defines
configuration which is per-tenant, but local to a particular pageserver,
such as the attachment mode and whether it is a secondary.
The pageserver will expose a new per-tenant API for setting
the state: `/tenant/<tenant_id>/location/config`.
Body content:
```
{
state: 'enum{Detached, Secondary, AttachedSingle, AttachedMulti, AttachedStale}',
generation: Option<u32>,
configuration: `Option<TenantConfig>`
flush: bool
}
```
Existing `/attach` and `/detach` endpoint will have the same
behavior as calling `/location/config` with `AttachedSingle` and `Detached`
states respectively. These endpoints will be deprecated and later
removed.
The generation attribute is mandatory for entering `AttachedSingle` or
`AttachedMulti`.
The configuration attribute is mandatory when entering any state other
than `Detached`. This configuration is the same as the body for
the existing `/tenant/<tenant_id>/config` endpoint.
The `flush` argument indicates whether the pageservers should flush
to S3 before proceeding: this only has any effect if the node is
currently in AttachedSingle or AttachedMulti. This is used
during the first phase of migration, when transitioning the
old pageserver to AttachedSingle.
The `/re-attach` API response will be extended to include a `state` as
well as a `generation`, enabling the pageserver to enter the
correct state for each tenant on startup.
### Database schema for locations
A new table `ProjectLocation`:
- pageserver_id: int
- tenant_id: TenantId
- generation: Option<int>
- state: `enum(Secondary, AttachedSingle, AttachedMulti)`
Notes:
- It is legacy for a Project to have zero `ProjectLocation`s
- The `pageserver` column in `Project` now means "to which pageserver should
endpoints connect", rather than simply which pageserver is attached.
- The `generation` column in `Project` remains, and is incremented and used
to set the generation of `ProjectLocation` rows when they are set into
an attached state.
- The `Detached` state is implicitly represented as the absence of
a `ProjectLocation`.
### Executing migrations
Migrations will be implemented as Go functions, within the
existing `Operation` framework in the control plane. These
operations are persistent, such that they will always keep
trying until completion: this property is important to avoid
leaving garbage behind on pageservers, such as AttachedStale
locations.
### Recovery from failures during migration
During migration, the control plane may encounter failures of either
the original or new pageserver, or both:
- If the original fails, skip past waiting for the new pageserver
to catch up, and put it into AttachedSingle immediately.
- If the new node fails, put the old pageserver into Secondary
and then back into AttachedSingle (this has the effect of
retaining on-disk state and granting it a fresh generation number).
- If both nodes fail, keep trying until one of them is available
again.
### Control plane -> Pageserver reconciliation
A migration may be done while the old node is unavailable,
in which case the old node may still be running in an AttachedStale
state.
In this case, it is undesirable to have the migration `Operation`
stay alive until the old node eventually comes back online
and can be cleaned up. To handle this, the control plane
should run a background reconciliation process to compare
a pageserver's attachments with the database, and clean up
any that shouldn't be there any more.
Note that there will be no work to do if the old node was really
offline, as during startup it will call into `/re-attach` and
be updated that way. The reconciliation will only be needed
if the node was unavailable but still running.
## Alternatives considered
### Only enabling secondary locations for tenants on a higher service tier
This will make sense in future, especially for tiny databases that may be
downloaded from S3 in milliseconds when needed.
However, it is not wise to do it immediately, because pageservers contain
a mixture of higher and lower tier workloads. If we had 1 tenant with
a secondary location and 9 without, then those other 9 tenants will do
a lot of I/O as they try to recover from S3, which may degrade the
service of the tenant which had a secondary location.
Until we segregate tenant on different service tiers on different pageserver
nodes, or implement & test QoS to ensure that tenants with secondaries are
not harmed by tenants without, we should use the same failover approach
for all the tenants.
### Hot secondary locations (continuous WAL replay)
Instead of secondary locations populating their caches from S3, we could
have them consume the WAL from safekeepers. The downsides of this would be:
- Double load on safekeepers, which are a less scalable service than S3
- Secondary locations' on-disk state would end up subtly different to
the remote state, which would make synchronizing with S3 more complex/expensive
when going into attached state.
The downside of only updating secondary locations from S3 is that we will
have a delay during migration from replaying the LSN range between what's
in S3 and what's in the pageserver. This range will be very small on
planned migrations, as we have the old pageserver flush to S3 immediately
before attaching the new pageserver. On unplanned migrations (old pageserver
is unavailable), the range of LSNs to replay is bounded by the flush frequency
on the old pageserver. However, the migration doesn't have to wait for the
replay: it's just that not-yet-replayed LSNs will be unavailable for read
until the new pageserver catches up.
We expect that pageserver reads of the most recent LSNs will be relatively
rare, as for an active endpoint those pages will usually still be in the postgres
page cache: this leads us to prefer synchronizing from S3 on secondary
locations, rather than consuming the WAL from safekeepers.
### Cold secondary locations
It is not functionally necessary to keep warm caches on secondary locations at all. However, if we do not, then
we would experience a de-facto availability loss in unplanned migrations, as reads to the new node would take an extremely long time (many seconds, perhaps minutes).
Warm caches on secondary locations are necessary to meet
our availability goals.
### Pageserver-granularity failover
Instead of migrating tenants individually, we could have entire spare nodes,
and on a node death, move all its work to one of these spares.
This approach is avoided for several reasons:
- we would still need fine-grained tenant migration for other
purposes such as balancing load
- by sharing the spare capacity over many peers rather than one spare node,
these peers may use the capacity for other purposes, until it is needed
to handle migrated tenants. e.g. for keeping a deeper cache of their
attached tenants.
### Readonly during migration
We could simplify migrations by making both previous and new nodes go into a
readonly state, then flush remote content from the previous node, then activate
attachment on the secondary node.
The downside to this approach is a potentially large gap in readability of
recent LSNs while loading data onto the new node. To avoid this, it is worthwhile
to incur the extra cost of double-replaying the WAL onto old and new nodes' local
storage during a migration.
### Peer-to-peer pageserver communication
Rather than uploading the heatmap to S3, attached pageservers could make it
available to peers.
Currently, pageservers have no peer to peer communication, so adding this
for heatmaps would incur significant overhead in deployment and configuration
of the service, and ensuring that when a new pageserver is deployed, other
pageservers are updated to be aware of it.
As well as simplifying implementation, putting heatmaps in S3 will be useful
for future analytics purposes -- gathering aggregated statistics on activity
patterns across many tenants may be done directly from data in S3.

View File

@@ -0,0 +1,197 @@
# Per-Tenant GetPage@LSN Throttling
Author: Christian Schwarz
Date: Oct 24, 2023
## Summary
This RFC proposes per-tenant throttling of GetPage@LSN requests inside Pageserver
and the interactions with its client, i.e., the neon_smgr component in Compute.
The result of implementing & executing this RFC will be a fleet-wide upper limit for
**"the highest GetPage/second that Pageserver can support for a single tenant/shard"**.
## Background
### GetPage@LSN Request Flow
Pageserver exposes its `page_service.rs` as a libpq listener.
The Computes' `neon_smgr` module connects to that libpq listener.
Once a connection is established, the protocol allows Compute to request page images at a given LSN.
We call these requests GetPage@LSN requests, or GetPage requests for short.
Other request types can be sent, but these are low traffic compared to GetPage requests
and are not the concern of this RFC.
Pageserver associates one libpq connection with one tokio task.
Per connection/task, the pq protocol is handled by the common `postgres_backend` crate.
Its `run_message_loop` function invokes the `page_service` specific `impl<IO> postgres_backend::Handler<IO> for PageServerHandler`.
Requests are processed in the order in which they arrive via the TCP-based pq protocol.
So, there is no concurrent request processing within one connection/task.
There is a degree of natural pipelining:
Compute can "fill the pipe" by sending more than one GetPage request into the libpq TCP stream.
And Pageserver can fill the pipe with responses in the other direction.
Both directions are subject to the limit of tx/rx buffers, nodelay, TCP flow control, etc.
### GetPage@LSN Access Pattern
The Compute has its own hierarchy of caches, specifically `shared_buffers` and the `local file cache` (LFC).
Compute only issues GetPage requests to Pageserver if it encounters a miss in these caches.
If the working set stops fitting into Compute's caches, requests to Pageserver increase sharply -- the Compute starts *thrashing*.
## Motivation
In INC-69, a tenant issued 155k GetPage/second for a period of 10 minutes and 60k GetPage/second for a period of 3h,
then dropping to ca 18k GetPage/second for a period of 9h.
We noticed this because of an internal GetPage latency SLO burn rate alert, i.e.,
the request latency profile during this period significantly exceeded what was acceptable according to the internal SLO.
Sadly, we do not have the observability data to determine the impact of this tenant on other tenants on the same tenants.
However, here are some illustrative data points for the 155k period:
The tenant was responsible for >= 99% of the GetPage traffic and, frankly, the overall activity on this Pageserver instance.
We were serving pages at 10 Gb/s (`155k x 8 kbyte (PAGE_SZ) per second is 1.12GiB/s = 9.4Gb/s.`)
The CPU utilization of the instance was 75% user+system.
Pageserver page cache served 1.75M accesses/second at a hit rate of ca 90%.
The hit rate for materialized pages was ca. 40%.
Curiously, IOPS to the Instance Store NVMe were very low, rarely exceeding 100.
The fact that the IOPS were so low / the materialized page cache hit rate was so high suggests that **this tenant's compute's caches were thrashing**.
The compute was of type `k8s-pod`; hence, auto-scaling could/would not have helped remediate the thrashing by provisioning more RAM.
The consequence was that the **thrashing translated into excessive GetPage requests against Pageserver**.
My claim is that it was **unhealthy to serve this workload at the pace we did**:
* it is likely that other tenants were/would have experienced high latencies (again, we sadly don't have per-tenant latency data to confirm this)
* more importantly, it was **unsustainable** to serve traffic at this pace for multiple reasons:
* **predictability of performance**: when the working set grows, the pageserver materialized page cache hit rate drops.
At some point, we're bound by the EC2 Instance Store NVMe drive's IOPS limit.
The result is an **uneven** performance profile from the Compute perspective.
* **economics**: Neon currently does not charge for IOPS, only capacity.
**We cannot afford to undercut the market in IOPS/$ this drastically; it leads to adverse selection and perverse incentives.**
For example, the 155k IOPS, which we served for 10min, would cost ca. 6.5k$/month when provisioned as an io2 EBS volume.
Even the 18k IOPS, which we served for 9h, would cost ca. 1.1k$/month when provisioned as an io2 EBS volume.
We charge 0$.
It could be economically advantageous to keep using a low-DRAM compute because Pageserver IOPS are fast enough and free.
Note: It is helpful to think of Pageserver as a disk, because it's precisely where `neon_smgr` sits:
vanilla Postgres gets its pages from disk, Neon Postgres gets them from Pageserver.
So, regarding the above performance & economic arguments, it is fair to say that we currently provide an "as-fast-as-possible-IOPS" disk that we charge for only by capacity.
## Solution: Throttling GetPage Requests
**The consequence of the above analysis must be that Pageserver throttles GetPage@LSN requests**.
That is, unless we want to start charging for provisioned GetPage@LSN/second.
Throttling sets the correct incentive for a thrashing Compute to scale up its DRAM to the working set size.
Neon Autoscaling will make this easy, [eventually](https://github.com/neondatabase/neon/pull/3913).
## The Design Space
What that remains is the question about *policy* and *mechanism*:
**Policy** concerns itself with the question of what limit applies to a given connection|timeline|tenant.
Candidates are:
* hard limit, same limit value per connection|timeline|tenant
* Per-tenant will provide an upper bound for the impact of a tenant on a given Pageserver instance.
This is a major operational pain point / risk right now.
* hard limit, configurable per connection|timeline|tenant
* This outsources policy to console/control plane, with obvious advantages for flexible structuring of what service we offer to customers.
* Note that this is not a mechanism to guarantee a minium provisioned rate, i.e., this is not a mechanism to guarantee a certain QoS for a tenant.
* fair share among active connections|timelines|tenants per instance
* example: each connection|timeline|tenant gets a fair fraction of the machine's GetPage/second capacity
* NB: needs definition of "active", and knowledge of available GetPage/second capacity in advance
* ...
Regarding **mechanism**, it's clear that **backpressure** is the way to go.
However, we must choose between
* **implicit** backpressure through pq/TCP and
* **explicit** rejection of requests + retries with exponential backoff
Further, there is the question of how throttling GetPage@LSN will affect the **internal GetPage latency SLO**:
where do we measure the SLI for Pageserver's internal getpage latency SLO? Before or after the throttling?
And when we eventually move the measurement point into the Computes (to avoid coordinated omission),
how do we avoid counting throttling-induced latency toward the internal getpage latency SLI/SLO?
## Scope Of This RFC
**This RFC proposes introducing a hard GetPage@LSN/second limit per tenant, with the same value applying to each tenant on a Pageserver**.
This proposal is easy to implement and significantly de-risks operating large Pageservers,
based on the assumption that extremely-high-GetPage-rate-episodes like the one from the "Motivation" section are uncorrelated between tenants.
For example, suppose we pick a limit that allows up to 10 tenants to go at limit rate.
Suppose our Pageserver can serve 100k GetPage/second total at a 100% page cache miss rate.
If each tenant gets a hard limit of 10k GetPage/second, we can serve up to 10 tenants at limit speed without latency degradation.
The mechanism for backpressure will be TCP-based implicit backpressure.
The compute team isn't concerned about prefetch queue depth.
Pageserver will implement it by delaying the reading of requests from the libpq connection(s).
The rate limit will be implemented using a per-tenant token bucket.
The bucket will be be shared among all connections to the tenant.
The bucket implementation supports starvation-preventing `await`ing.
The current candidate for the implementation is [`leaky_bucket`](https://docs.rs/leaky-bucket/).
The getpage@lsn benchmark that's being added in https://github.com/neondatabase/neon/issues/5771
can be used to evaluate the overhead of sharing the bucket among connections of a tenant.
A possible technique to mitigate the impact of sharing the bucket would be to maintain a buffer of a few tokens per connection handler.
Regarding metrics / the internal GetPage latency SLO:
we will measure the GetPage latency SLO _after_ the throttler and introduce a new metric to measure the amount of throttling, quantified by:
- histogram that records the tenants' observations of queue depth before they start waiting (one such histogram per pageserver)
- histogram that records the tenants' observations of time spent waiting (one such histogram per pageserver)
Further observability measures:
- an INFO log message at frequency 1/min if the tenant/timeline/connection was throttled in that last minute.
The message will identify the tenant/timeline/connection to allow correlation with compute logs/stats.
Rollout will happen as follows:
- deploy 1: implementation + config: disabled by default, ability to enable it per tenant through tenant_conf
- experimentation in staging and later production to study impact & interaction with auto-scaling
- determination of a sensible global default value
- the value will be chosen as high as possible ...
- ... but low enough to work towards this RFC's goal that one tenant should not be able to dominate a pageserver instance.
- deploy 2: implementation fixes if any + config: enabled by default with the aforementioned global default
- reset of the experimental per-tenant overrides
- gain experience & lower the limit over time
- we stop lowering the limit as soon as this RFC's goal is achieved, i.e.,
once we decide that in practice the chosen value sufficiently de-risks operating large pageservers
The per-tenant override will remain for emergencies and testing.
But since Console doesn't preserve it during tenant migrations, it isn't durably configurable for the tenant.
Toward the upper layers of the Neon stack, the resulting limit will be
**"the highest GetPage/second that Pageserver can support for a single tenant"**.
### Rationale
We decided against error + retry because of worries about starvation.
## Future Work
Enable per-tenant emergency override of the limit via Console.
Should be part of a more general framework to specify tenant config overrides.
**NB:** this is **not** the right mechanism to _sell_ different max GetPage/second levels to users,
or _auto-scale_ the GetPage/second levels. Such functionality will require a separate RFC that
concerns itself with GetPage/second capacity planning.
Compute-side metrics for GetPage latency.
Back-channel to inform Compute/Autoscaling/ControlPlane that the project is being throttled.
Compute-side neon_smgr improvements to avoid sending the same GetPage request multiple times if multiple backends experience a cache miss.
Dealing with read-only endpoints: users use read-only endpoints to scale reads for a single tenant.
Possibly there are also assumptions around read-only endpoints not affecting the primary read-write endpoint's performance.
With per-tenant rate limiting, we will not meet that expectation.
However, we can currently only scale per tenant.
Soon, we will have sharding (#5505), which will apply the throttling on a per-shard basis.
But, that's orthogonal to scaling reads: if many endpoints hit one shard, they share the same throttling limit.
To solve this properly, I think we'll need replicas for tenants / shard.
To performance-isolate a tenant's endpoints from each other, we'd then route them to different replicas.

View File

@@ -0,0 +1,205 @@
# Name
Created on: 2023-09-08
Author: Arpad Müller
## Summary
Enable the pageserver to recover from data corruption events by implementing
a feature to re-apply historic WAL records in parallel to the already occurring
WAL replay.
The feature is outside of the user-visible backup and history story, and only
serves as a second-level backup for the case that there is a bug in the
pageservers that corrupted the served pages.
The RFC proposes the addition of two new features:
* recover a broken branch from WAL (downtime is allowed)
* a test recovery system to recover random branches to make sure recovery works
## Motivation
The historic WAL is currently stored in S3 even after it has been replayed by
the pageserver and thus been integrated into the pageserver's storage system.
This is done to defend from data corruption failures inside the pageservers.
However, application of this WAL in the disaster recovery setting is currently
very manual and we want to automate this to make it easier.
### Use cases
There are various use cases for this feature, like:
* The main motivation is replaying in the instance of pageservers corrupting
data.
* We might want to, beyond the user-visible history features, through our
support channels and upon customer request, in select instances, recover
historic versions beyond the range of history that we officially support.
* Running the recovery process in the background for random tenant timelines
to figure out if there was a corruption of data (we would compare with what
the pageserver stores for the "official" timeline).
* Using the WAL to arrive at historic pages we can then back up to S3 so that
WAL itself can be discarded, or at least not used for future replays.
Again, this sounds a lot like what the pageserver is already doing, but the
point is to provide a fallback to the service provided by the pageserver.
## Design
### Design constraints
The main design constraint is that the feature needs to be *simple* enough that
the number of bugs are as low, and reliability as high as possible: the main
goal of this endeavour is to achieve higher correctness than the pageserver.
For the background process, we cannot afford a downtime of the timeline that is
being cloned, as we don't want to restrict ourselves to offline tenants only.
In the scenario where we want to recover from disasters or roll back to a
historic lsn through support staff, downtimes are more affordable, and
inevitable if the original had been subject to the corruption. Ideally, the
two code paths would share code, so the solution would be designed for not
requiring downtimes.
### API endpoint changes
This RFC proposes two API endpoint changes in the safekeeper and the
pageserver.
Remember, the pageserver timeline API creation endpoint is to this URL:
```
/v1/tenant/{tenant_id}/timeline/
```
Where `{tenant_id}` is the ID of the tenant the timeline is created for,
and specified as part of the URL. The timeline ID is passed via the POST
request body as the only required parameter `new_timeline_id`.
This proposal adds one optional parameter called
`existing_initdb_timeline_id` to the request's json body. If the parameter
is not specified, behaviour should be as existing, so the pageserver runs
initdb.
If the parameter is specified, it is expected to point to a timeline ID.
In fact that ID might match `new_timeline_id`, what's important is that
S3 storage contains a matching initdb under the URL matching the given
tenant and timeline.
Having both `ancestor_timeline_id` and `existing_initdb_timeline_id`
specified is illegal and will yield in an HTTP error. This feature is
only meant for the "main" branch that doesn't have any ancestors
of its own, as only here initdb is relevant.
For the safekeeper, we propose the addition of the following copy endpoint:
```
/v1/tenant/{tenant_id}/timeline/{source_timeline_id}/copy
```
it is meant for POST requests with json, and the two URL parameters
`tenant_id` and `source_timeline_id`. The json request body contains
the two required parameters `target_timeline_id` and `until_lsn`.
After invoking, the copy endpoint starts a copy process of the WAL from
the source ID to the target ID. The lsn is updated according to the
progress of the API call.
### Higher level features
We want the API changes to support the following higher level features:
* recovery-after-corruption DR of the main timeline of a tenant. This
feature allows for downtime.
* test DR of the main timeline into a special copy timeline. this feature
is meant to run against selected production tenants in the background,
without the user noticing, so it does not allow for downtime.
The recovery-after-corruption DR only needs the pageserver changes.
It works as follows:
* delete the timeline from the pageservers via timeline deletion API
* re-create it via timeline creation API (same ID as before) and set
`existing_initdb_timeline_id` to the same timeline ID
The test DR requires also the copy primitive and works as follows:
* copy the WAL of the timeline to a new place
* create a new timeline for the tenant
## Non Goals
At the danger of being repetitive, the main goal of this feature is to be a
backup method, so reliability is very important. This implies that other
aspects like performance or space reduction are less important.
### Corrupt WAL
The process suggested by this RFC assumes that the WAL is free of corruption.
In some instances, corruption can make it into WAL, like for example when
higher level components like postgres or the application first read corrupt
data, and then execute a write with data derived from that earlier read. That
written data might then contain the corruption.
Common use cases can hit this quite easily. For example, an application reads
some counter, increments it, and then writes the new counter value to the
database.
On a lower level, the compute might put FPIs (Full Page Images) into the WAL,
which have corrupt data for rows unrelated to the write operation at hand.
Separating corrupt writes from non-corrupt ones is a hard problem in general,
and if the application was involved in making the corrupt write, a recovery
would also involve the application. Therefore, corruption that has made it into
the WAL is outside of the scope of this feature. However, the WAL replay can be
issued to right before the point in time where the corruption occurred. Then the
data loss is isolated to post-corruption writes only.
## Impacted components (e.g. pageserver, safekeeper, console, etc)
Most changes would happen to the pageservers.
For the higher level features, maybe other components like the console would
be involved.
We need to make sure that the shadow timelines are not subject to the usual
limits and billing we apply to existing timelines.
## Proposed implementation
The first problem to keep in mind is the reproducibility of `initdb`.
So an initial step would be to upload `initdb` snapshots to S3.
After that, we'd have the endpoint spawn a background process which
performs the replay of the WAL to that new timeline. This process should
follow the existing workflows as closely as possible, just using the
WAL records of a different timeline.
The timeline created will be in a special state that solely looks for WAL
entries of the timeline it is trying to copy. Once the target LSN is reached,
it turns into a normal timeline that also accepts writes to its own
timeline ID.
### Scalability
For now we want to run this entire process on a single node, and as
it is by nature linear, it's hard to parallelize. However, for the
verification workloads, we can easily start the WAL replay in parallel
for different points in time. This is valuable especially for tenants
with large WAL records.
Compare this with the tricks to make addition circuits execute with
lower latency by making them perform the addition for both possible
values of the carry bit, and then, in a second step, taking the
result for the carry bit that was actually obtained.
The other scalability dimension to consider is the WAL length, which
is a growing question as tenants accumulate changes. There are
possible approaches to this, including creating snapshots of the
page files and uploading them to S3, but if we do this for every single
branch, we lose the cheap branching property.
### Implementation by component
The proposed changes for the various components of the neon architecture
are written up in this notion page:
https://www.notion.so/neondatabase/Pageserver-disaster-recovery-one-pager-4ecfb5df16ce4f6bbfc3817ed1a6cbb2
### Unresolved questions
none known (outside of the mentioned ones).

View File

@@ -0,0 +1,142 @@
# Vectored Timeline Get
Created on: 2024-01-02
Author: Christian Schwarz
# Summary
A brief RFC / GitHub Epic describing a vectored version of the `Timeline::get` method that is at the heart of Pageserver.
# Motivation
During basebackup, we issue many `Timeline::get` calls for SLRU pages that are *adjacent* in key space.
For an example, see
https://github.com/neondatabase/neon/blob/5c88213eaf1b1e29c610a078d0b380f69ed49a7e/pageserver/src/basebackup.rs#L281-L302.
Each of these `Timeline::get` calls must traverse the layer map to gather reconstruct data (`Timeline::get_reconstruct_data`) for the requested page number (`blknum` in the example).
For each layer visited by layer map traversal, we do a `DiskBtree` point lookup.
If it's negative (no entry), we resume layer map traversal.
If it's positive, we collect the result in our reconstruct data bag.
If the reconstruct data bag contents suffice to reconstruct the page, we're done with `get_reconstruct_data` and move on to walredo.
Otherwise, we resume layer map traversal.
Doing this many `Timeline::get` calls is quite inefficient because:
1. We do the layer map traversal repeatedly, even if, e.g., all the data sits in the same image layer at the bottom of the stack.
2. We may visit many DiskBtree inner pages multiple times for point lookup of different keys.
This is likely particularly bad for L0s which span the whole key space and hence must be visited by layer map traversal, but
may not contain the data we're looking for.
3. Anecdotally, keys adjacent in keyspace and written simultaneously also end up physically adjacent in the layer files [^1].
So, to provide the reconstruct data for N adjacent keys, we would actually only _need_ to issue a single large read to the filesystem, instead of the N reads we currently do.
The filesystem, in turn, ideally stores the layer file physically contiguously, so our large read will turn into one IOP toward the disk.
[^1]: https://www.notion.so/neondatabase/Christian-Investigation-Slow-Basebackups-Early-2023-12-34ea5c7dcdc1485d9ac3731da4d2a6fc?pvs=4#15ee4e143392461fa64590679c8f54c9
# Solution
We should have a vectored aka batched aka scatter-gather style alternative API for `Timeline::get`. Having such an API unlocks:
* more efficient basebackup
* batched IO during compaction (useful for strides of unchanged pages)
* page_service: expose vectored get_page_at_lsn for compute (=> good for seqscan / prefetch)
* if [on-demand SLRU downloads](https://github.com/neondatabase/neon/pull/6151) land before vectored Timeline::get, on-demand SLRU downloads will still benefit from this API
# DoD
There is a new variant of `Timeline::get`, called `Timeline::get_vectored`.
It takes as arguments an `lsn: Lsn` and a `src: &[KeyVec]` where `struct KeyVec { base: Key, count: usize }`.
It is up to the implementor to figure out a suitable and efficient way to return the reconstructed page images.
It is sufficient to simply return a `Vec<Bytes>`, but, likely more efficient solutions can be found after studying all the callers of `Timeline::get`.
Functionally, the behavior of `Timeline::get_vectored` is equivalent to
```rust
let mut keys_iter: impl Iterator<Item=Key>
= src.map(|KeyVec{ base, count }| (base..base+count)).flatten();
let mut out = Vec::new();
for key in keys_iter {
let data = Timeline::get(key, lsn)?;
out.push(data);
}
return out;
```
However, unlike above, an ideal solution will
* Visit each `struct Layer` at most once.
* For each visited layer, call `Layer::get_value_reconstruct_data` at most once.
* This means, read each `DiskBtree` page at most once.
* Facilitate merging of the reads we issue to the OS and eventually NVMe.
Each of these items above represents a significant amount of work.
## Performance
Ideally, the **base performance** of a vectored get of a single page should be identical to the current `Timeline::get`.
A reasonable constant overhead over current `Timeline::get` is acceptable.
The performance improvement for the vectored use case is demonstrated in some way, e.g., using the `pagebench` basebackup benchmark against a tenant with a lot of SLRU segments.
# Implementation
High-level set of tasks / changes to be made:
- **Get clarity on API**:
- Define naive `Timeline::get_vectored` implementation & adopt it across pageserver.
- The tricky thing here will be the return type (e.g. `Vec<Bytes>` vs `impl Stream`).
- Start with something simple to explore the different usages of the API.
Then iterate with peers until we have something that is good enough.
- **Vectored Layer Map traversal**
- Vectored `LayerMap::search` (take 1 LSN and N `Key`s instead of just 1 LSN and 1 `Key`)
- Refactor `Timeline::get_reconstruct_data` to hold & return state for N `Key`s instead of 1
- The slightly tricky part here is what to do about `cont_lsn` [after we've found some reconstruct data for some keys](https://github.com/neondatabase/neon/blob/d066dad84b076daf3781cdf9a692098889d3974e/pageserver/src/tenant/timeline.rs#L2378-L2385)
but need more.
Likely we'll need to keep track of `cont_lsn` per key and continue next iteration at `max(cont_lsn)` of all keys that still need data.
- **Vectored `Layer::get_value_reconstruct_data` / `DiskBtree`**
- Current code calls it [here](https://github.com/neondatabase/neon/blob/d066dad84b076daf3781cdf9a692098889d3974e/pageserver/src/tenant/timeline.rs#L2378-L2384).
- Delta layers use `DiskBtreeReader::visit()` to collect the `(offset,len)` pairs for delta record blobs to load.
- Image layers use `DiskBtreeReader::get` to get the offset of the image blob to load. Underneath, that's just a `::visit()` call.
- What needs to happen to `DiskBtree::visit()`?
* Minimally
* take a single `KeyVec` instead of a single `Key` as argument, i.e., take a single contiguous key range to visit.
* Change the visit code to to invoke the callback for all values in the `KeyVec`'s key range
* This should be good enough for what we've seen when investigating basebackup slowness, because there, the key ranges are contiguous.
* Ideally:
* Take a `&[KeyVec]`, sort it;
* during Btree traversal, peek at the next `KeyVec` range to determine whether we need to descend or back out.
* NB: this should be a straight-forward extension of the minimal solution above, as we'll already be checking for "is there more key range in the requested `KeyVec`".
- **Facilitate merging of the reads we issue to the OS and eventually NVMe.**
- The `DiskBtree::visit` produces a set of offsets which we then read from a `VirtualFile` [here](https://github.com/neondatabase/neon/blob/292281c9dfb24152b728b1a846cc45105dac7fe0/pageserver/src/tenant/storage_layer/delta_layer.rs#L772-L804)
- [Delta layer reads](https://github.com/neondatabase/neon/blob/292281c9dfb24152b728b1a846cc45105dac7fe0/pageserver/src/tenant/storage_layer/delta_layer.rs#L772-L804)
- We hit (and rely) on `PageCache` and `VirtualFile here (not great under pressure)
- [Image layer reads](https://github.com/neondatabase/neon/blob/292281c9dfb24152b728b1a846cc45105dac7fe0/pageserver/src/tenant/storage_layer/image_layer.rs#L429-L435)
- What needs to happen is the **vectorization of the `blob_io` interface and then the `VirtualFile` API**.
- That is tricky because
- the `VirtualFile` API, which sits underneath `blob_io`, is being touched by ongoing [io_uring work](https://github.com/neondatabase/neon/pull/5824)
- there's the question how IO buffers will be managed; currently this area relies heavily on `PageCache`, but there's controversy around the future of `PageCache`.
- The guiding principle here should be to avoid coupling this work to the `PageCache`.
- I.e., treat `PageCache` as an extra hop in the I/O chain, rather than as an integral part of buffer management.
Let's see how we can improve by doing the first three items in above list first, then revisit.
## Rollout / Feature Flags
No feature flags are required for this epic.
At the end of this epic, `Timeline::get` forwards to `Timeline::get_vectored`, i.e., it's an all-or-nothing type of change.
It is encouraged to deliver this feature incrementally, i.e., do many small PRs over multiple weeks.
That will help isolate performance regressions across weekly releases.
# Interaction With Sharding
[Sharding](https://github.com/neondatabase/neon/pull/5432) splits up the key space, see functions `is_key_local` / `key_to_shard_number`.
Just as with `Timeline::get`, callers of `Timeline::get_vectored` are responsible for ensuring that they only ask for blocks of the given `struct Timeline`'s shard.
Given that this is already the case, there shouldn't be significant interaction/interference with sharding.
However, let's have a safety check for this constraint (error or assertion) because there are currently few affordances at the higher layers of Pageserver for sharding<=>keyspace interaction.
For example, `KeySpace` is not broken up by shard stripe, so if someone naively converted the compaction code to issue a vectored get for a keyspace range it would violate this constraint.

View File

@@ -129,13 +129,13 @@ Run `poetry shell` to activate the virtual environment.
Alternatively, use `poetry run` to run a single command in the venv, e.g. `poetry run pytest`.
### Obligatory checks
We force code formatting via `black`, `ruff`, and type hints via `mypy`.
We force code formatting via `ruff`, and type hints via `mypy`.
Run the following commands in the repository's root (next to `pyproject.toml`):
```bash
poetry run black . # All code is reformatted
poetry run ruff . # Python linter
poetry run mypy . # Ensure there are no typing errors
poetry run ruff format . # All code is reformatted
poetry run ruff check . # Python linter
poetry run mypy . # Ensure there are no typing errors
```
**WARNING**: do not run `mypy` from a directory other than the root of the repository.

View File

@@ -21,7 +21,7 @@ implementation where we keep more data than we would need to, do not
change the synthetic size or incur any costs to the user.
The synthetic size is calculated for the whole project. It is not
straighforward to attribute size to individual branches. See "What is
straightforward to attribute size to individual branches. See "What is
the size of an individual branch?" for discussion on those
difficulties.
@@ -248,7 +248,7 @@ and truncate the WAL.
Synthetic size is calculated for the whole project, and includes all
branches. There is no such thing as the size of a branch, because it
is not straighforward to attribute the parts of size to individual
is not straightforward to attribute the parts of size to individual
branches.
## Example: attributing size to branches

108
docs/updating-postgres.md Normal file
View File

@@ -0,0 +1,108 @@
# Updating Postgres
## Minor Versions
When upgrading to a new minor version of Postgres, please follow these steps:
_Example: 15.4 is the new minor version to upgrade to from 15.3._
1. Clone the Neon Postgres repository if you have not done so already.
```shell
git clone git@github.com:neondatabase/postgres.git
```
1. Add the Postgres upstream remote.
```shell
git remote add upstream https://git.postgresql.org/git/postgresql.git
```
1. Create a new branch based on the stable branch you are updating.
```shell
git checkout -b my-branch REL_15_STABLE_neon
```
1. Tag the last commit on the stable branch you are updating.
```shell
git tag REL_15_3_neon
```
1. Push the new tag to the Neon Postgres repository.
```shell
git push origin REL_15_3_neon
```
1. Find the release tags you're looking for. They are of the form `REL_X_Y`.
1. Rebase the branch you created on the tag and resolve any conflicts.
```shell
git fetch upstream REL_15_4
git rebase REL_15_4
```
1. Run the Postgres test suite to make sure our commits have not affected
Postgres in a negative way.
```shell
make check
# OR
meson test -C builddir
```
1. Push your branch to the Neon Postgres repository.
```shell
git push origin my-branch
```
1. Clone the Neon repository if you have not done so already.
```shell
git clone git@github.com:neondatabase/neon.git
```
1. Create a new branch.
1. Change the `revisions.json` file to point at the HEAD of your Postgres
branch.
1. Update the Git submodule.
```shell
git submodule set-branch --branch my-branch vendor/postgres-v15
git submodule update --remote vendor/postgres-v15
```
1. Run the Neon test suite to make sure that Neon is still good to go on this
minor Postgres release.
```shell
./scripts/poetry -k pg15
```
1. Commit your changes.
1. Create a pull request, and wait for CI to go green.
1. Force push the rebased Postgres branches into the Neon Postgres repository.
```shell
git push --force origin my-branch:REL_15_STABLE_neon
```
It may require disabling various branch protections.
1. Update your Neon PR to point at the branches.
```shell
git submodule set-branch --branch REL_15_STABLE_neon vendor/postgres-v15
git commit --amend --no-edit
git push --force origin
```
1. Merge the pull request after getting approval(s) and CI completion.

View File

@@ -1,3 +1,5 @@
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
pub mod requests;
pub mod responses;
pub mod spec;

View File

@@ -6,7 +6,6 @@
use std::collections::HashMap;
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn;
@@ -19,7 +18,6 @@ pub type PgIdent = String;
/// Cluster spec or configuration represented as an optional number of
/// delta operations + final cluster state description.
#[serde_as]
#[derive(Clone, Debug, Default, Deserialize, Serialize)]
pub struct ComputeSpec {
pub format_version: f32,
@@ -28,6 +26,13 @@ pub struct ComputeSpec {
// but we don't use it for anything. Serde will ignore missing fields when
// deserializing it.
pub operation_uuid: Option<String>,
/// Compute features to enable. These feature flags are provided, when we
/// know all the details about client's compute, so they cannot be used
/// to change `Empty` compute behavior.
#[serde(default)]
pub features: Vec<ComputeFeature>,
/// Expected cluster state at the end of transition process.
pub cluster: Cluster,
pub delta_operations: Option<Vec<DeltaOp>>,
@@ -50,12 +55,12 @@ pub struct ComputeSpec {
// these, and instead set the "neon.tenant_id", "neon.timeline_id",
// etc. GUCs in cluster.settings. TODO: Once the control plane has been
// updated to fill these fields, we can make these non optional.
#[serde_as(as = "Option<DisplayFromStr>")]
pub tenant_id: Option<TenantId>,
#[serde_as(as = "Option<DisplayFromStr>")]
pub timeline_id: Option<TimelineId>,
#[serde_as(as = "Option<DisplayFromStr>")]
pub pageserver_connstring: Option<String>,
#[serde(default)]
pub safekeeper_connstrings: Vec<String>,
@@ -68,6 +73,31 @@ pub struct ComputeSpec {
// information about available remote extensions
pub remote_extensions: Option<RemoteExtSpec>,
pub pgbouncer_settings: Option<HashMap<String, String>>,
// Stripe size for pageserver sharding, in pages
#[serde(default)]
pub shard_stripe_size: Option<usize>,
}
/// Feature flag to signal `compute_ctl` to enable certain experimental functionality.
#[derive(Serialize, Clone, Copy, Debug, Deserialize, PartialEq, Eq)]
#[serde(rename_all = "snake_case")]
pub enum ComputeFeature {
// XXX: Add more feature flags here.
/// Enable the experimental activity monitor logic, which uses `pg_stat_database` to
/// track short-lived connections as user activity.
ActivityMonitorExperimental,
/// Enable running migrations
Migrations,
/// This is a special feature flag that is used to represent unknown feature flags.
/// Basically all unknown to enum flags are represented as this one. See unit test
/// `parse_unknown_features()` for more details.
#[serde(other)]
UnknownFeature,
}
#[derive(Clone, Debug, Default, Deserialize, Serialize)]
@@ -140,14 +170,13 @@ impl RemoteExtSpec {
}
}
#[serde_as]
#[derive(Clone, Copy, Debug, Default, Eq, PartialEq, Deserialize, Serialize)]
pub enum ComputeMode {
/// A read-write node
#[default]
Primary,
/// A read-only node, pinned at a particular LSN
Static(#[serde_as(as = "DisplayFromStr")] Lsn),
Static(Lsn),
/// A read-only node that follows the tip of the branch in hot standby mode
///
/// Future versions may want to distinguish between replicas with hot standby
@@ -200,6 +229,12 @@ pub struct Database {
pub name: PgIdent,
pub owner: PgIdent,
pub options: GenericOptions,
// These are derived flags, not present in the spec file.
// They are never set by the control plane.
#[serde(skip_deserializing, default)]
pub restrict_conn: bool,
#[serde(skip_deserializing, default)]
pub invalid: bool,
}
/// Common type representing both SQL statement params with or without value,
@@ -224,7 +259,10 @@ mod tests {
#[test]
fn parse_spec_file() {
let file = File::open("tests/cluster_spec.json").unwrap();
let _spec: ComputeSpec = serde_json::from_reader(file).unwrap();
let spec: ComputeSpec = serde_json::from_reader(file).unwrap();
// Features list defaults to empty vector.
assert!(spec.features.is_empty());
}
#[test]
@@ -236,4 +274,41 @@ mod tests {
ob.insert("unknown_field_123123123".into(), "hello".into());
let _spec: ComputeSpec = serde_json::from_value(json).unwrap();
}
#[test]
fn parse_unknown_features() {
// Test that unknown feature flags do not cause any errors.
let file = File::open("tests/cluster_spec.json").unwrap();
let mut json: serde_json::Value = serde_json::from_reader(file).unwrap();
let ob = json.as_object_mut().unwrap();
// Add unknown feature flags.
let features = vec!["foo_bar_feature", "baz_feature"];
ob.insert("features".into(), features.into());
let spec: ComputeSpec = serde_json::from_value(json).unwrap();
assert!(spec.features.len() == 2);
assert!(spec.features.contains(&ComputeFeature::UnknownFeature));
assert_eq!(spec.features, vec![ComputeFeature::UnknownFeature; 2]);
}
#[test]
fn parse_known_features() {
// Test that we can properly parse known feature flags.
let file = File::open("tests/cluster_spec.json").unwrap();
let mut json: serde_json::Value = serde_json::from_reader(file).unwrap();
let ob = json.as_object_mut().unwrap();
// Add known feature flags.
let features = vec!["activity_monitor_experimental"];
ob.insert("features".into(), features.into());
let spec: ComputeSpec = serde_json::from_value(json).unwrap();
assert_eq!(
spec.features,
vec![ComputeFeature::ActivityMonitorExperimental]
);
}
}

View File

@@ -76,7 +76,7 @@
},
{
"name": "wal_level",
"value": "replica",
"value": "logical",
"vartype": "enum"
},
{
@@ -243,5 +243,9 @@
"public_extensions": [
"postgis"
]
},
"pgbouncer_settings": {
"default_pool_size": "42",
"pool_mode": "session"
}
}

View File

@@ -1,6 +1,6 @@
//!
//! Shared code for consumption metics collection
//!
#![deny(unsafe_code)]
#![deny(clippy::undocumented_unsafe_blocks)]
use chrono::{DateTime, Utc};
use rand::Rng;
use serde::{Deserialize, Serialize};
@@ -107,7 +107,7 @@ pub const CHUNK_SIZE: usize = 1000;
// Just a wrapper around a slice of events
// to serialize it as `{"events" : [ ] }
#[derive(serde::Serialize)]
#[derive(serde::Serialize, serde::Deserialize)]
pub struct EventChunk<'a, T: Clone> {
pub events: std::borrow::Cow<'a, [T]>,
}

Some files were not shown because too many files have changed in this diff Show More