Compare commits

...

185 Commits

Author SHA1 Message Date
quantumish
44ec0e6f06 Document problems and pitfalls with fine-grained hashmap impl 2025-08-12 13:42:48 -07:00
quantumish
d33071a386 Fix concurrency bugs in resizing (WIP) 2025-08-12 10:34:29 -07:00
David Freifeld
add1a0ad78 Add initial work in implementing incremental resizing (WIP) 2025-07-14 09:02:56 -07:00
David Freifeld
282b90df28 Fix freelist bug, clean up interface to BucketIdx 2025-07-11 12:47:27 -07:00
David Freifeld
7f655ddb9a Heavily WIP rewrite of hashmap for concurrency + perf 2025-07-10 17:34:30 -07:00
David Freifeld
da6f4dbafe Merge branch 'communicator-rewrite' into quantumish/lfc-resizable-map 2025-07-10 17:33:56 -07:00
Heikki Linnakangas
a79fd3bda7 Move logic for picking request slot to the C code
With this refactoring, the Rust code deals with one giant array of
requests, and doesn't know that it's sliced up per backend
process. The C code is now responsible for slicing it up.

This also adds code to complete old IOs at backends start that were
started and left behind by a previous session. That was a little more
straightforward to do with the refactoring, which is why I tackled it
now.
2025-07-07 12:59:08 +03:00
Heikki Linnakangas
e1b58d5d69 Don't segfault if one of the unimplemented functions are called
We'll need to implement these, but let's stop the crashing for now
2025-07-07 11:33:44 +03:00
Erik Grinaker
9ae004f3bc Rename ShardMap to ShardSpec 2025-07-06 19:13:59 +02:00
Erik Grinaker
341c5f53d8 Restructure get_page retries 2025-07-06 18:35:47 +02:00
Erik Grinaker
4b06b547c1 pageserver/client_grpc: add shard map updates 2025-07-06 13:27:17 +02:00
Heikki Linnakangas
74e0d85a04 fix: Don't lose track of in-progress request if query is cancelled 2025-07-06 13:04:03 +03:00
Erik Grinaker
23ba42446b Fix accidental 1ms sleeps for GetPages 2025-07-06 11:09:58 +02:00
Heikki Linnakangas
71a83daac2 Revert crate dependencies to the versions in main branch
Some tests were failing with "Only request bodies with a known size
can be checksum validated." erros. This is a known issue with more
recent aws client versions, see
https://github.com/neondatabase/neon/issues/11363.
2025-07-05 18:03:19 +03:00
Heikki Linnakangas
1b8355a9f9 put back option lost in merge 2025-07-05 17:36:27 +03:00
Heikki Linnakangas
e14bb4be39 Merge remote-tracking branch 'origin/main' into communicator-rewrite 2025-07-05 16:59:51 +03:00
Heikki Linnakangas
f3a6c0d8ff cargo fmt 2025-07-05 16:26:24 +03:00
Heikki Linnakangas
17ec37aab2 Close gRPC getpage streams on shutdown
Some tests were failing, because pageserver didn't shut down promptly.
Tonic server graceful shutdown was a little too graceful; any open
streams linger until they're closed. Check the cancellation token
while waiting for next request, and close the stream if
shutdown/cancellation was requested.
2025-07-05 16:26:24 +03:00
Heikki Linnakangas
d6ec1f1a1c Skip legacy LFC initialization when communicator is used
It clashes with the initialization of the LFC file
2025-07-05 16:26:24 +03:00
Erik Grinaker
6f3fb4433f Add TODO 2025-07-05 14:15:34 +02:00
Erik Grinaker
d7678df445 Reap idle pool resources 2025-07-05 13:35:28 +02:00
Erik Grinaker
03d9f0ec41 Comment tweaks 2025-07-05 11:16:40 +02:00
Erik Grinaker
56845f2da2 Add GetPageClass::is_bulk 2025-07-05 11:15:28 +02:00
Heikki Linnakangas
b568189f7b Build dummy libcommunicator into the 'neon' extension (#12266)
This doesn't do anything interesting yet, but demonstrates linking Rust
code to the neon Postgres extension, so that we can review and test
drive just the build process changes independently.
2025-07-04 23:27:28 +00:00
Heikki Linnakangas
9a37bfdf63 Fix re-finding an entry in bucket chain 2025-07-05 00:44:46 +03:00
Arpad Müller
b94a5ce119 Don't await the walreceiver on timeline shutdown (#12402)
Mostly a revert of https://github.com/neondatabase/neon/pull/11851 and
https://github.com/neondatabase/neon/pull/12330 .

Christian suggested reverting his PR to fix the issue
https://github.com/neondatabase/neon/issues/12369 .

Alternatives considered:

1. I have originally wanted to introduce cancellation tokens to
`RequestContext`, but in the end I gave up on them because I didn't find
a select-free way of preventing
`test_layer_download_cancelled_by_config_location` from hanging.

Namely if I put a select around the `get_or_maybe_download` invocation
in `get_values_reconstruct_data`, it wouldn't hang, but if I put it
around the `download_init_and_wait` invocation in
`get_or_maybe_download`, the test would still hang. Not sure why, even
though I made the attached child function of the `RequestContext` create
a child token.

2. Introduction of a `download_cancel` cancellation token as a child of
a timeline token, putting it into `RemoteTimelineClient` together with
the main token, and then putting it into the whole
`RemoteTimelineClient` read path.

3. Greater refactorings, like to make cancellation tokens follow a DAG
structure so you can have tokens cancelled either by say timeline
shutting down or a request ending. It doesn't just represent an effort
that we don't have the engineering budget for, it also causes
interesting questions like what to do about batching (do you cancel the
entire request if only some requests get cancelled?).

We might see a reemergence of
https://github.com/neondatabase/neon/issues/11762, but given that we
have https://github.com/neondatabase/neon/pull/11853 and
https://github.com/neondatabase/neon/pull/12376 now, it is possible that
it will not come back. Looking at some code, it might actually fix the
locations where the error pops up. Let's see.

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2025-07-04 20:12:10 +00:00
Heikki Linnakangas
4c916552e8 Reduce logging noise
These are very useful while debugging, but also very noisy; let's dial
it down a little.
2025-07-04 23:11:36 +03:00
Heikki Linnakangas
50fbf4ac53 Fix hash table initialization across forked processes
attach_writer()/reader() are called from each forked process. It's too
late to do initialization there, in fact we used to overwrite the
contents of the hash table (or at least the freelist?) every time a
new process attached to it. The initialization must be done earlier,
in the HashMapInit() constructors.
2025-07-04 23:08:34 +03:00
Erik Grinaker
cb698a3951 Add dedicated client pools for bulk requests 2025-07-04 21:52:25 +02:00
Mikhail
7ed4530618 offload_lfc_interval_seconds in ComputeSpec (#12447)
- Add ComputeSpec flag `offload_lfc_interval_seconds` controlling
  whether LFC should be offloaded to endpoint storage. Default value
  (None) means "don't offload".
- Add glue code around it for `neon_local` and integration tests.
- Add `autoprewarm` mode for `test_lfc_prewarm` testing
  `offload_lfc_interval_seconds` and `autoprewarm` flags in conjunction.
- Rename `compute_ctl_lfc_prewarm_requests_total` and
`compute_ctl_lfc_offload_requests_total` to
`compute_ctl_lfc_prewarms_total`
  and `compute_ctl_lfc_offloads_total` to reflect we count prewarms and
  offloads, not `compute_ctl` requests of those.
  Don't count request in metrics if there is a prewarm/offload already
  ongoing.

https://github.com/neondatabase/cloud/issues/19011
Resolves: https://github.com/neondatabase/cloud/issues/30770
2025-07-04 18:49:57 +00:00
Erik Grinaker
f6cc5cbd0c Split out retry handler to separate module 2025-07-04 20:20:09 +02:00
Heikki Linnakangas
00affada26 Add request ID to all communicator log lines as context information 2025-07-04 20:34:26 +03:00
Heikki Linnakangas
90d3c09c24 Minor cleanup
Tidy up and add some comments. Rename a few things for clarity.
2025-07-04 20:32:59 +03:00
Heikki Linnakangas
6c398aeae7 Fix dependency in Makefile 2025-07-04 20:24:21 +03:00
Heikki Linnakangas
3a44774227 impr(ci): Simplify build-macos workflow, prepare for rust communicator (#12357)
Don't build walproposer-lib as a separate job. It only takes a few
seconds, after you have built all its dependencies.

Don't cache the Neon Pg extensions in the per-postgres-version caches.
This is in preparation for the communicator project, which will
introduce Rust parts to the Neon Pg extension, which complicates the
build process. With that, the 'make neon-pg-ext' step requires some of
the Rust bits to be built already, or it will build them on the spot,
which in turn requires all the Rust sources to be present, and we don't
want to repeat that part for each Postgres version anyway. To prepare
for that, rely on "make all" to build the neon extension and the rust
bits in the correct order instead. Building the neon extension doesn't
currently take very long anyway after you have built Postgres itself, so
you don't gain much by caching it. See
https://github.com/neondatabase/neon/pull/12266.

Add an explicit "rustup update" step to update the toolchain. It's not
strictly necessary right now, because currently "make all" will only
invoke "cargo build" once and the race condition described in the
comment doesn't happen. But prepare for the future.

To further simplify the build, get rid of the separate 'build-postgres'
jobs too, and just build Postgres as a step in the main job. That makes
the overall workflow run longer, because we no longer build all the
postgres versions in parallel (although you still get intra-runner
parallelism thanks to `make -j`), but that's acceptable. In the
cache-hit case, it might even be a little faster because there is less
overhead from launching jobs, and in the cache-miss case, it's maybe
5-10 minutes slower altogether.

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2025-07-04 15:34:58 +00:00
Heikki Linnakangas
1856bbbb9f Minor cleanup and commenting 2025-07-04 18:28:34 +03:00
Aleksandr Sarantsev
b2705cfee6 storcon: Make node deletion process cancellable (#12320)
## Problem

The current deletion operation is synchronous and blocking, which is
unsuitable for potentially long-running tasks like. In such cases, the
standard HTTP request-response pattern is not a good fit.

## Summary of Changes

- Added new `storcon_cli` commands: `NodeStartDelete` and
`NodeCancelDelete` to initiate and cancel deletion asynchronously.
- Added corresponding `storcon` HTTP handlers to support the new
start/cancel deletion flow.
- Introduced a new type of background operation: `Delete`, to track and
manage the deletion process outside the request lifecycle.

---------

Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>
2025-07-04 14:08:09 +00:00
Heikki Linnakangas
bd46dd60a0 Add a temporary timeout to handling an IO request in the communicator
It's nicer to timeout in the communicator and return an error to the
backend, than PANIC the backend.
2025-07-04 16:08:22 +03:00
Heikki Linnakangas
5f2d476a58 Add request ID to io-in-progress locking table, to ease debugging
I also added INFO messages for when a backend blocks on the
io-in-progress lock. It's probably too noisy for production, but
useful now to get a picture of how much it happens.
2025-07-04 15:55:57 +03:00
Heikki Linnakangas
3231cb6138 Await the io-in-progress locking futures
Otherwise they don't do anything. Oops.
2025-07-04 15:55:57 +03:00
Heikki Linnakangas
e558e0da5c Assign request_id earlier, in the originating backend
Makes it more useful for stitching together logs etc. for a specific
request.
2025-07-04 15:55:55 +03:00
Heikki Linnakangas
70bf2e088d Request multiple block numbers in a single GetPageV request
That's how it was always intended to be used
2025-07-04 15:49:04 +03:00
Trung Dinh
225267b3ae Make disk eviction run by default (#12464)
## Problem

## Summary of changes
Provide a sane set of default values for disk_usage_based_eviction.

Closes https://github.com/neondatabase/neon/issues/12301.
2025-07-04 12:06:10 +00:00
Vlad Lazar
d378726e38 pageserver: reset the broker subscription if it's been idle for a while (#12436)
## Problem

I suspect that the pageservers get stuck on receiving broker updates.

## Summary of changes

This is a an opportunistic (staging only) patch that resets the
susbscription
stream if it's been idle for a while. This won't go to prod in this
form.
I'll revert or update it before Friday.
2025-07-04 10:25:03 +00:00
Konstantin Knizhnik
436a117c15 Do not allocate anything in subtransaction memory context (#12176)
## Problem

See https://github.com/neondatabase/neon/issues/12173

## Summary of changes

Allocate table in TopTransactionMemoryContext

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-07-04 10:24:39 +00:00
Heikki Linnakangas
da3f9ee72d cargo fmt 2025-07-04 12:39:41 +03:00
Alex Chi Z.
cc699f6f85 fix(pageserver): do not log no-route-to-host errors (#12468)
## Problem

close https://github.com/neondatabase/neon/issues/12344

## Summary of changes

Add `HostUnreachable` and `NetworkUnreachable` to expected I/O error.
This was new in Rust 1.83.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-03 21:57:42 +00:00
Erik Grinaker
88d1127bf4 Tweak GetPageSplitter 2025-07-03 21:12:26 +02:00
David Freifeld
794bb7a9e8 Merge branch 'quantumish/comm-lfc-integration' into communicator-rewrite 2025-07-03 10:52:29 -07:00
Konstantin Knizhnik
495112ca50 Add GUC for dynamically enable compare local mode (#12424)
## Problem

DEBUG_LOCAL_COMPARE mode allows to detect data corruption.
But it requires rebuild of neon extension (and so requires special
image) and significantly slowdown execution because always fetch pages
from page server.

## Summary of changes

Introduce new GUC `neon.debug_compare_local`, accepting the following
values: " none", "prefetch", "lfc", "all" (by default it is definitely
disabled).
In mode less than "all", neon SMGR will not fetch page from PS if it is
found in local caches.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-07-03 17:37:05 +00:00
Suhas Thalanki
46158ee63f fix(compute): background installed extensions worker would collect data without waiting for interval (#12465)
## Problem

The background installed extensions worker relied on `interval.tick()`
to go to sleep for a period of time. This can lead to bugs due to the
interval being updated at the end of the loop as the first tick is
[instantaneous](https://docs.rs/tokio/latest/tokio/time/struct.Interval.html#method.tick).

## Summary of changes

Changed it to a `tokio::time::sleep` to prevent this issue. Now it puts
the thread to sleep and only wakes up after the specified duration
2025-07-03 17:10:30 +00:00
Alex Chi Z.
305fe61ac1 fix(pageserver): also print open layer size in backpressure (#12440)
## Problem

Better investigate memory usage during backpressure

## Summary of changes

Print open layer size if backpressure is activated

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-03 16:37:11 +00:00
Vlad Lazar
f95fdf5b44 pageserver: fix duplicate tombstones in ancestor detach (#12460)
## Problem

Ancestor detach from a previously detached parent when there were no
writes panics since it tries to upload the tombstone layer twice.

## Summary of Changes

If we're gonna copy the tombstone from the ancestor, don't bother
creating it.

Fixes https://github.com/neondatabase/neon/issues/12458
2025-07-03 16:35:46 +00:00
Erik Grinaker
42e4e5a418 Add GetPage request splitting 2025-07-03 18:31:12 +02:00
Arpad Müller
a852bc5e39 Add new activating scheduling policy for safekeepers (#12441)
When deploying new safekeepers, we don't immediately want to send
traffic to them. Maybe they are not ready yet by the time the deploy
script is registering them with the storage controller.

For pageservers, the storcon solves the problem by not scheduling stuff
to them unless there has been a positive heartbeat response. We can't do
the same for safekeepers though, otherwise a single down safekeeper
would mean we can't create new timelines in smaller regions where there
is only three safekeepers in total.

So far we have created safekeepers as `pause` but this adds a manual
step to safekeeper deployment which is prone to oversight. We want
things to be automatted. So we introduce a new state `activating` that
acts just like `pause`, except that we automatically transition the
policy to `active` once we get a positive heartbeat from the safekeeper.
For `pause`, we always keep the safekeeper paused.
2025-07-03 16:27:43 +00:00
Aleksandr Sarantsev
b96983a31c storcon: Ignore keep-failing reconciles (#12391)
## Problem

Currently, if `storcon` (storage controller) reconciliations repeatedly
fail, the system will indefinitely freeze optimizations. This can result
in optimization starvation for several days until the reconciliation
issues are manually resolved. To mitigate this, we should detect
persistently failing reconciliations and exclude them from influencing
the optimization decision.

## Summary of Changes

- A tenant shard reconciliation is now considered "keep-failing" if it
fails 5 consecutive times. These failures are excluded from the
optimization readiness check.
- Added a new metric: `storage_controller_keep_failing_reconciles` to
monitor such cases.
- Added a warning log message when a reconciliation is marked as
"keep-failing".

---------

Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>
2025-07-03 16:21:36 +00:00
Heikki Linnakangas
96a817fa2b Fix the case that storage auth token is _not_ used
I broke that in previous commit while fixing the case of using a token.
2025-07-03 18:39:06 +03:00
Heikki Linnakangas
e7b057f2e8 Fix passing storage JWT token to the communicator process
Makes the 'test_compute_auth_to_pageserver' test pass
2025-07-03 18:14:22 +03:00
Dmitrii Kovalkov
3ed28661b1 storcon: remote feature testing safekeeper quorum checks (#12459)
## Problem
Previous PR didn't fix the creation of timeline in neon_local with <3
safekeepers because there is one more check down the stack.

- Closes: https://github.com/neondatabase/neon/issues/12298
- Follow up on https://github.com/neondatabase/neon/pull/12378

## Summary of changes
- Remove feature `testing` safekeeper quorum checks from storcon

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2025-07-03 15:02:30 +00:00
Conrad Ludgate
03e604e432 Nightly lints and small tweaks (#12456)
Let chains available in 1.88 :D new clippy lints coming up in future
releases.
2025-07-03 14:47:12 +00:00
HaoyuHuang
4db934407a SK changes #1 (#12448)
## TLDR
This PR is a no-op. The changes are disabled by default. 

## Problem
I. Currently we don't have a way to detect disk I/O failures from WAL
operations.

II.
We observe that the offloader fails to upload a segment due to race
conditions on XLOG SWITCH and PG start streaming WALs. wal_backup task
continously failing to upload a full segment while the segment remains
partial on the disk.

The consequence is that commit_lsn for all SKs move forward but
backup_lsn stays the same. Then, all SKs run out of disk space.

III.
We have discovered SK bugs where the WAL offload owner cannot keep up
with WAL backup/upload to S3, which results in an unbounded accumulation
of WAL segment files on the Safekeeper's disk until the disk becomes
full. This is a somewhat dangerous operation that is hard to recover
from because the Safekeeper cannot write its control files when it is
out of disk space. There are actually 2 problems here:

1. A single problematic timeline can take over the entire disk for the
SK
2. Once out of disk, it's difficult to recover SK


IV. 
Neon reports certain storage errors as "critical" errors using a marco,
which will increment a counter/metric that can be used to raise alerts.
However, this metric isn't sliced by tenant and/or timeline today. We
need the tenant/timeline dimension to better respond to incidents and
for blast radius analysis.

## Summary of changes
I. 
The PR adds a `safekeeper_wal_disk_io_errors ` which is incremented when
SK fails to create or flush WALs.

II. 
To mitigate this issue, we will re-elect a new offloader if the current
offloader is lagging behind too much.
Each SK makes the decision locally but they are aware of each other's
commit and backup lsns.

The new algorithm is
- determine_offloader will pick a SK. say SK-1.
- Each SK checks
-- if commit_lsn - back_lsn > threshold,
-- -- remove SK-1 from the candidate and call determine_offloader again.

SK-1 will step down and all SKs will elect the same leader again.
After the backup is caught up, the leader will become SK-1 again.

This also helps when SK-1 is slow to backup. 

I'll set the reelect backup lag to 4 GB later. Setting to 128 MB in dev
to trigger the code more frequently.

III. 
This change addresses problem no. 1 by having the Safekeeper perform a
timeline disk utilization check check when processing WAL proposal
messages from Postgres/compute. The Safekeeper now rejects the WAL
proposal message, effectively stops writing more WAL for the timeline to
disk, if the existing WAL files for the timeline on the SK disk exceeds
a certain size (the default threshold is 100GB). The disk utilization is
calculated based on a `last_removed_segno` variable tracked by the
background task removing WAL files, which produces an accurate and
conservative estimate (>= than actual disk usage) of the actual disk
usage.


IV.
* Add a new metric `hadron_critical_storage_event_count` that has the
`tenant_shard_id` and `timeline_id` as dimensions.
* Modified the `crtitical!` marco to include tenant_id and timeline_id
as additional arguments and adapted existing call sites to populate the
tenant shard and timeline ID fields. The `critical!` marco invocation
now increments the `hadron_critical_storage_event_count` with the extra
dimensions. (In SK there isn't the notion of a tenant-shard, so just the
tenant ID is recorded in lieu of tenant shard ID.)

I considered adding a separate marco to avoid merge conflicts, but I
think in this case (detecting critical errors) conflicts are probably
more desirable so that we can be aware whenever Neon adds another
`critical!` invocation in their code.

---------

Co-authored-by: Chen Luo <chen.luo@databricks.com>
Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
Co-authored-by: William Huang <william.huang@databricks.com>
2025-07-03 14:32:53 +00:00
Heikki Linnakangas
956c2f4378 cargo fmt 2025-07-03 16:16:42 +03:00
Heikki Linnakangas
3293e4685e Fix cases where pageserver gets stuck waiting for LSN
The compute might make a request with an LSN that it hasn't even
flushed yet.
2025-07-03 16:14:45 +03:00
Erik Grinaker
6f8650782f Client tweaks 2025-07-03 14:54:23 +02:00
Erik Grinaker
14214eb853 Add client shard routing 2025-07-03 14:42:35 +02:00
Erik Grinaker
d4b4724921 Sanity-check Pageserver URLs 2025-07-03 14:18:14 +02:00
Erik Grinaker
9aba9550dd Instrument client methods 2025-07-03 14:11:53 +02:00
Erik Grinaker
375e8e5592 Improve retries and logging 2025-07-03 14:02:43 +02:00
Ruslan Talpa
95e1011cd6 subzero pre-integration refactor (#12416)
## Problem
integrating subzero requires a bit of refactoring. To make the
integration PR a bit more manageable, the refactoring is done in this
separate PR.
 
## Summary of changes
* move common types/functions used in sql_over_http to errors.rs and
http_util.rs
* add the "Local" auth backend to proxy (similar to local_proxy), useful
in local testing
* change the Connect and Send type for the http client to allow for
custom body when making post requests to local_proxy from the proxy

---------

Co-authored-by: Ruslan Talpa <ruslan.talpa@databricks.com>
2025-07-03 11:04:08 +00:00
Conrad Ludgate
1bc1eae5e8 fix redis credentials check (#12455)
## Problem

`keep_connection` does not exit, so it was never setting
`credentials_refreshed`.

## Summary of changes

Set `credentials_refreshed` to true when we first establish a
connection, and after we re-authenticate the connection.
2025-07-03 09:51:35 +00:00
Erik Grinaker
52c586f678 Restructure shard management 2025-07-03 11:51:19 +02:00
Matthias van de Meent
e12d4f356a Work around Clap's incorrect usage of Display for default_value_t (#12454)
## Problem

#12450 

## Summary of changes

Instead of `#[arg(default_value_t = typed_default_value)]`, we use
`#[arg(default_value = "str that deserializes into the value")]`,
because apparently you can't convince clap to _not_ deserialize from the
Display implementation of an imported enum.
2025-07-03 09:41:09 +00:00
Erik Grinaker
de97b73d6e Lint fixes 2025-07-03 10:38:14 +02:00
Folke Behrens
3415b90e88 proxy/logging: Add "ep" and "query_id" to list of extracted fields (#12437)
Extract two more interesting fields from spans: ep (endpoint) and
query_id.
Useful for reliable filtering in logging.
2025-07-03 08:09:10 +00:00
Conrad Ludgate
e01c8f238c [proxy] update noisy error logging (#12438)
Health checks for pg-sni-router open a TCP connection and immediately
close it again. This is noisy. We will filter out any EOF errors on the
first message.

"acquired permit" debug log is incorrect since it logs when we timedout
as well. This fixes the debug log.
2025-07-03 07:46:48 +00:00
Conrad Ludgate
45607cbe0c [local_proxy]: ignore TLS for endpoint (#12316)
## Problem

When local proxy is configured with TLS, the certificate does not match
the endpoint string. This currently returns an error.

## Summary of changes

I don't think this code is necessary anymore, taking the prefix from the
hostname is good enough (and is equivalent to what `endpoint_sni` was
doing) and we ignore checking the domain suffix.
2025-07-03 07:35:57 +00:00
Heikki Linnakangas
d8556616c9 Fix running Postgres in "vanilla mode", without neon storage
Some tests do that
2025-07-03 00:32:40 +03:00
Heikki Linnakangas
d8296e60e6 Fix caching of newly extended pages
This fixes read errors e.g. in test_compute_catalog.py test (and
probably many others).
2025-07-02 23:21:42 +03:00
Heikki Linnakangas
7263d6e2e5 Clarify error message if not_modified_lsn > request_lsn
I'm seeing this error from some python tests. Which means there's a
bug in the compute side of course, but it took me a while to figure
that out.
2025-07-02 23:21:42 +03:00
David Freifeld
b018b0ff60 Fix cargo clippy lints for neon-shmem 2025-07-02 13:01:26 -07:00
Tristan Partin
8b4fbefc29 Patch pgaudit to disable logging in parallel workers (#12325)
We want to turn logging in parallel workers off to reduce log
amplification in queries which use parallel workers.

Part-of: https://github.com/neondatabase/cloud/issues/28483

Signed-off-by: Tristan Partin <tristan.partin@databricks.com>
2025-07-02 19:54:47 +00:00
Alex Chi Z.
a9a51c038b rfc: storage feature flags (#11805)
## Problem

Part of https://github.com/neondatabase/neon/issues/11813

## Summary of changes

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-02 17:41:36 +00:00
Alexey Kondratov
44121cc175 docs(compute): RFC for compute rolling restart with prewarm (#11294)
## Problem

Neon currently implements several features that guarantee high uptime of
compute nodes:

1. Storage high-availability (HA), i.e. each tenant shard has a
secondary pageserver location, so we can quickly switch over compute to
it in case of primary pageserver failure.
2. Fast compute provisioning, i.e. we have a fleet of pre-created empty
computes, that are ready to serve workload, so restarting unresponsive
compute is very fast.
3. Preemptive NeonVM compute provisioning in case of k8s node
unavailability.

This helps us to be well-within the uptime SLO of 99.95% most of the
time. Problems begin when we go up to multi-TB workloads and 32-64 CU
computes. During restart, compute looses all caches: LFC, shared
buffers, file system cache. Depending on the workload, it can take a lot
of time to warm up the caches, so that performance could be degraded and
might be even unacceptable for certain workloads. The latter means that
although current approach works well for small to
medium workloads, we still have to do some additional work to avoid
performance degradation after restart of large instances.

[Rendered
version](https://github.com/neondatabase/neon/blob/alexk/pg-prewarm-rfc/docs/rfcs/2025-03-17-compute-prewarm.md)

Part of https://github.com/neondatabase/cloud/issues/19011
2025-07-02 17:16:00 +00:00
Dmitry Savelev
0429a0db16 Switch the billing metrics storage format to ndjson. (#12427)
## Problem
The billing team wants to change the billing events pipeline and use a
common events format in S3 buckets across different event producers.

## Summary of changes
Change the events storage format for billing events from JSON to NDJSON.
Also partition files by hours, rather than days.

Resolves: https://github.com/neondatabase/cloud/issues/29995
2025-07-02 16:30:47 +00:00
Erik Grinaker
12dade35fa Comment tweaks 2025-07-02 14:47:27 +02:00
Conrad Ludgate
d6beb3ffbb [proxy] rewrite pg-text to json routines (#12413)
We would like to move towards an arena system for JSON encoding the
responses. This change pushes an "out" parameter into the pg-test to
json routines to make swapping in an arena system easier in the future.
(see #11992)

This additionally removes the redundant `column: &[Type]` argument, as
well as rewriting the pg_array parser.

---

I rewrote the pg_array parser since while making these changes I found
it hard to reason about. I went back to the specification and rewrote it
from scratch. There's 4 separate routines:
1. pg_array_parse - checks for any prelude (multidimensional array
ranges)
2. pg_array_parse_inner - only deals with the arrays themselves
3. pg_array_parse_item - parses a single item from the array, this might
be quoted, unquoted, or another nested array.
4. pg_array_parse_quoted - parses a quoted string, following the
relevant string escaping rules.
2025-07-02 12:46:11 +00:00
Erik Grinaker
1ec63bd6bc Misc pool improvements 2025-07-02 14:42:06 +02:00
Heikki Linnakangas
7012b4aa90 Remove --grpc options from neon_local endpoint reconfigure and start calls
They don't exist in neon_local anymore, and aren't actually used in
tests either.
2025-07-02 15:10:18 +03:00
Arpad Müller
efd7e52812 Don't error if timeline offload is already in progress (#12428)
Don't print errors like:
```
Compaction failed 1 times, retrying in 2s: Failed to offload timeline: Unexpected offload error: Timeline deletion is already in progress
```

Print it at info log level instead.

https://github.com/neondatabase/cloud/issues/30666
2025-07-02 12:06:55 +00:00
Heikki Linnakangas
2cc28c75be Fix "ERROR: could not read size of rel ..." in many regression tests.
We were incorrectly skipping the call to communicator_new_rel_create(),
which resulted in an error during index build, when the btree build code
tried to check the size of the newly-created relation.
2025-07-02 14:10:11 +03:00
Erik Grinaker
bf01145ae4 Remove some old code 2025-07-02 11:46:54 +02:00
Erik Grinaker
8ab8fc11a3 Use new PageserverClient 2025-07-02 11:27:56 +02:00
Erik Grinaker
6f0af96a54 Add new PageserverClient 2025-07-02 10:59:40 +02:00
Ivan Efremov
0f879a2e8f [proxy]: Fix redis IRSA expiration failure errors (#12430)
Relates to the
[#30688](https://github.com/neondatabase/cloud/issues/30688)
2025-07-02 08:55:44 +00:00
Dmitrii Kovalkov
8e7ce42229 tests: start primary compute on not-readonly branches (#12408)
## Problem

https://github.com/neondatabase/neon/pull/11712 changed how computes are
started in the test: the lsn is specified, making them read-only static
replicas. Lsn is `last_record_lsn` from pageserver. It works fine with
read-only branches (because their `last_record_lsn` is equal to
`start_lsn` and always valid). But with writable timelines, the
`last_record_lsn` on the pageserver might be stale.

Particularly in this test, after the `detach_branch` operation, the
tenant is reset on the pagesever. It leads to `last_record_lsn` going
back to `disk_consistent_lsn`, so basically rolling back some recent
writes.

If we start a primary compute, it will start at safekeepers' commit Lsn,
which is the correct one , and will wait till pageserver catches up with
this Lsn after reset.

- Closes: https://github.com/neondatabase/neon/issues/12365

## Summary of changes
- Start `primary` compute for writable timelines.
2025-07-02 05:41:17 +00:00
Heikki Linnakangas
9913d2668a print retried pageserver requests to log
Not sure how verbose we want this to be in production, but for now,
more is better.

This shows that many tests are failing with errors like these:

    PG:2025-07-01 23:02:34.311 GMT [1456523] LOG:  [COMMUNICATOR] send_process_get_rel_size_request: got error status: NotFound, message: "Read error", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 01 Jul 2025 23:02:34 GMT"} }, retrying​

I haven't debugged why that is yet. Did the compute make a bogus request?
2025-07-02 02:04:04 +03:00
Heikki Linnakangas
2fefece77d temporary hack to make regression tests fail faster 2025-07-02 01:42:39 +03:00
Heikki Linnakangas
471191e64e Fix updating relsize cache during WAL replay
This makes some of the test_runner/regress/test_hot_standby.py tests
pass, (Others are still failing..)
2025-07-01 21:22:04 +03:00
Alex Chi Z.
5ec8881c0b feat(pageserver): resolve feature flag based on remote size (#12400)
## Problem

Part of #11813 

## Summary of changes

* Compute tenant remote size in the housekeeping loop.
* Add a new `TenantFeatureResolver` struct to cache the tenant-specific
properties.
* Evaluate feature flag based on the remote size.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-01 18:11:24 +00:00
Alex Chi Z.
b254dce8a1 feat(pageserver): report compaction progress (#12401)
## Problem

close https://github.com/neondatabase/neon/issues/11528

## Summary of changes

Gives us better observability of compaction progress.

- Image creation: num of partition processed / total partition
- Gc-compaction: index of the in the queue / total items for a full
compaction
- Shard ancestor compaction: layers to rewrite / total layers

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-01 17:00:27 +00:00
Alex Chi Z.
3815e3b2b5 feat(pageserver): reduce lock contention in l0 compaction (#12360)
## Problem

L0 compaction currently holds the read lock for a long region while it
doesn't need to.

## Summary of changes

This patch reduces the one long contention region into 2 short ones:
gather the layers to compact at the beginning, and several short read
locks when querying the image coverage.

Co-Authored-By: Chen Luo

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-01 16:58:41 +00:00
Suhas Thalanki
bbcd70eab3 Dynamic Masking Support for anon v2 (#11733)
## Problem

This PR works on adding dynamic masking support for `anon` v2. It
currently only supports static masking.

## Summary of changes

Added a security definer function that sets the dynamic masking guc to
`true` with superuser permissions.
Added a security definer function that adds `anon` to
`session_preload_libraries` if it's not already present.

Related to: https://github.com/neondatabase/cloud/issues/20456
2025-07-01 16:50:27 +00:00
Erik Grinaker
f6761760a2 Documentation and tweaks 2025-07-01 17:54:41 +02:00
Erik Grinaker
0bce818d5e Add stream pool 2025-07-01 17:54:41 +02:00
Erik Grinaker
48be1da6ef Add initial client pool 2025-07-01 17:54:41 +02:00
Erik Grinaker
d2efc80e40 Add initial ChannelPool 2025-07-01 17:54:41 +02:00
Erik Grinaker
958c2577f5 pageserver: tighten up page_api::Client 2025-07-01 17:54:41 +02:00
Suhas Thalanki
0934ce9bce compute: metrics for autovacuum (mxid, postgres) (#12294)
## Problem

Currently we do not have metrics for autovacuum.

## Summary of changes

Added a metric that extracts the top 5 DBs with oldest mxid and frozen
xid. Tables that were vacuumed recently should have younger value (or
younger age).

Related Issue: https://github.com/neondatabase/cloud/issues/27296
2025-07-01 15:33:23 +00:00
Heikki Linnakangas
175c2e11e3 Add assertions that the legacy relsize cache is not used with new communicator
And fix a few cases where it was being called
2025-07-01 16:44:25 +03:00
Heikki Linnakangas
efdb07e7b6 Implement function to check if page is in local cache
This is needed for read replicas. There's one more TODO that needs to
implemented before read replicas work though, in
neon_extend_rel_size()
2025-07-01 16:22:51 +03:00
Conrad Ludgate
4932963bac [proxy]: dont log user errors from postgres (#12412)
## Problem

#8843 

User initiated sql queries are being classified as "postgres" errors,
whereas they're really user errors.

## Summary of changes

Classify user-initiated postgres errors as user errors if they are
related to a sql query that we ran on their behalf. Do not log those
errors.
2025-07-01 13:03:34 +00:00
Lassi Pölönen
6d73cfa608 Support audit syslog over TLS (#12124)
Add support to transport syslogs over TLS. Since TLS params essentially
require passing host and port separately, add a boolean flag to the
configuration template and also use the same `action` format for
plaintext logs. This allows seamless transition.

The plaintext host:port is picked from `AUDIT_LOGGING_ENDPOINT` (as
earlier) and from `AUDIT_LOGGING_TLS_ENDPOINT`. The TLS host:port is
used when defined and non-empty.

`remote_endpoint` is split separately to hostname and port as required
by `omfwd` module.

Also the address parsing and config content generation are split to more
testable functions with basic tests added.
2025-07-01 12:53:46 +00:00
Heikki Linnakangas
b0970b415c Don't call legacy lfc function when new communicator is used 2025-07-01 15:47:26 +03:00
Dmitrii Kovalkov
d2d9946bab tests: override safekeeper ports in storcon DB (#12410)
## Problem
We persist safekeeper host/port in the storcon DB after
https://github.com/neondatabase/neon/pull/11712, so the storcon fails to
ping safekeepers in the compatibility tests, where we start the cluster
from the snapshot.

PR also adds some small code improvements related to the test failure.

- Closes: https://github.com/neondatabase/neon/issues/12339

## Summary of changes
- Update safekeeper ports in the storcon DB when starting the neon from
the dir (snapshot)
- Fail the response on all not-success codes (e.g. 3xx). Should not
happen, but just to be more safe.
- Add `neon_previous/` to .gitignore to make it easier to run compat
tests.
- Add missing EXPORT to the instruction for running compat tests
2025-07-01 12:47:16 +00:00
Trung Dinh
daa402f35a pageserver: Make ImageLayerWriter sync, infallible and lazy (#12403)
## Problem

## Summary of changes
Make ImageLayerWriter sync, infallible and lazy.


Address https://github.com/neondatabase/neon/issues/12389.

All unit tests passed.
2025-07-01 09:53:11 +00:00
Suhas Thalanki
5f3532970e [compute] fix: background worker that collects installed extension metrics now updates collection interval (#12277)
## Problem

Previously, the background worker that collects the list of installed
extensions across DBs had a timeout set to 1 hour. This cause a problem
with computes that had a `suspend_timeout` > 1 hour as this collection
was treated as activity, preventing compute shutdown.

Issue: https://github.com/neondatabase/cloud/issues/30147

## Summary of changes

Passing the `suspend_timeout` as part of the `ComputeSpec` so that any
updates to this are taken into account by the background worker and
updates its collection interval.
2025-06-30 22:12:37 +00:00
Arpad Müller
2e681e0ef8 detach_ancestor: delete the right layer when hardlink fails (#12397)
If a hardlink operation inside `detach_ancestor` fails due to the layer
already existing, we delete the layer to make sure the source is one we
know about, and then retry.

But we deleted the wrong file, namely, the one we wanted to use as the
source of the hardlink. As a result, the follow up hard link operation
failed. Our PR corrects this mistake.
2025-06-30 21:36:15 +00:00
Heikki Linnakangas
7429dd711c fix the .metrics.socket filename in the ignore list 2025-06-30 23:41:09 +03:00
Heikki Linnakangas
88ac1e356b Ignore the metrics unix domain socket in tests 2025-06-30 23:39:01 +03:00
Erik Grinaker
c3cb1ab98d Merge branch 'main' into communicator-rewrite 2025-06-30 21:07:01 +02:00
Dmitrii Kovalkov
8e216a3a59 storcon: notify cplane on safekeeper membership change (#12390)
## Problem
We don't notify cplane about safekeeper membership change yet. Without
the notification the compute needs to know all the safekeepers on the
cluster to be able to speak to them. Change notifications will allow to
avoid it.

- Closes: https://github.com/neondatabase/neon/issues/12188

## Summary of changes
- Implement `notify_safekeepers` method in `ComputeHook`
- Notify cplane about safekeepers in `safekeeper_migrate` handler.
- Update the test to make sure notifications work.

## Out of scope
- There is `cplane_notified_generation` field in `timelines` table in
strocon's database. It's not needed now, so it's not updated in the PR.
Probably we can remove it.
- e2e tests to make sure it works with a production cplane
2025-06-30 14:09:50 +00:00
Erik Grinaker
81ac4ef43a Add a generic pool prototype 2025-06-30 14:49:34 +02:00
Erik Grinaker
d0a4ae3e8f pageserver: add gRPC LSN lease support (#12384)
## Problem

The gRPC API does not provide LSN leases.

## Summary of changes

* Add LSN lease support to the gRPC API.
* Use gRPC LSN leases for static computes with `grpc://` connstrings.
* Move `PageserverProtocol` into the `compute_api::spec` module and
reuse it.
2025-06-30 12:44:17 +00:00
Erik Grinaker
a384d7d501 pageserver: assert no changes to shard identity (#12379)
## Problem

Location config changes can currently result in changes to the shard
identity. Such changes will cause data corruption, as seen with #12217.

Resolves #12227.
Requires #12377.

## Summary of changes

Assert that the shard identity does not change on location config
updates and on (re)attach.

This is currently asserted with `critical!`, in case it misfires in
production. Later, we should reject such requests with an error and turn
this into a proper assertion.
2025-06-30 12:36:45 +00:00
Christian Schwarz
66f53d9d34 refactor(pageserver): force explicit mapping to CreateImageLayersError::Other (#12382)
Implicit mapping to an `anyhow::Error` when we do `?` is discouraged
because tooling to find those places isn't great.

As a drive-by, also make SplitImageLayerWriter::new infallible and sync.
I think we should also make ImageLayerWriter::new completely lazy,
then `BatchLayerWriter:new` infallible and async.
2025-06-30 11:03:48 +00:00
Erik Grinaker
a5b0fc560c Fix/allow remaining clippy lints 2025-06-30 12:36:20 +02:00
Busra Kugler
2af9380962 Revert "Replace step-security maintained actions" (#12386)
Reverts neondatabase/neon#11663 and
https://github.com/neondatabase/neon/pull/11265/

Step Security is not yet approved by Databricks team, in order to
prevent issues during Github org migration, I'll revert this PR to use
the previous action instead of Step Security maintained action.
2025-06-30 10:15:10 +00:00
Ivan Efremov
620d50432c Fix path issue in the proxy-bernch CI workflow (#12388) 2025-06-30 09:33:57 +00:00
Erik Grinaker
67b04f8ab3 Fix a bunch of linter warnings 2025-06-30 11:10:02 +02:00
Erik Grinaker
1d43f3bee8 pageserver: fix stripe size persistence in legacy HTTP handlers (#12377)
## Problem

Similarly to #12217, the following endpoints may result in a stripe size
mismatch between the storage controller and Pageserver if an unsharded
tenant has a different stripe size set than the default. This can lead
to data corruption if the tenant is later manually split without
specifying an explicit stripe size, since the storage controller and
Pageserver will apply different defaults. This commonly happens with
tenants that were created before the default stripe size was changed
from 32k to 2k.

* `PUT /v1/tenant/config`
* `PATCH /v1/tenant/config`

These endpoints are no longer in regular production use (they were used
when cplane still managed Pageserver directly), but can still be called
manually or by tests.

## Summary of changes

Retain the current shard parameters when updating the location config in
`PUT | PATCH /v1/tenant/config`.

Also opportunistically derive `Copy` for `ShardParameters`.
2025-06-30 09:08:44 +00:00
Dmitrii Kovalkov
c746678bbc storcon: implement safekeeper_migrate handler (#11849)
This PR implements a safekeeper migration algorithm from RFC-035


https://github.com/neondatabase/neon/blob/main/docs/rfcs/035-safekeeper-dynamic-membership-change.md#change-algorithm

- Closes: https://github.com/neondatabase/neon/issues/11823

It is not production-ready yet, but I think it's good enough to commit
and start testing.

There are some known issues which will be addressed in later PRs:
- https://github.com/neondatabase/neon/issues/12186
- https://github.com/neondatabase/neon/issues/12187
- https://github.com/neondatabase/neon/issues/12188
- https://github.com/neondatabase/neon/issues/12189
- https://github.com/neondatabase/neon/issues/12190
- https://github.com/neondatabase/neon/issues/12191
- https://github.com/neondatabase/neon/issues/12192

## Summary of changes
- Implement `tenant_timeline_safekeeper_migrate` handler to drive the
migration
- Add possibility to specify number of safekeepers per timeline in tests
(`timeline_safekeeper_count`)
- Add `term` and `flush_lsn` to `TimelineMembershipSwitchResponse`
- Implement compare-and-swap (CAS) operation over timeline in DB for
updating membership configuration safely.
- Write simple test to verify that migration code works
2025-06-30 08:30:05 +00:00
Erik Grinaker
9d9e3cd08a Fix test_normal_work grpc param 2025-06-30 10:13:46 +02:00
Aleksandr Sarantsev
9bb4688c54 storcon: Remove testing feature from kick_secondary_downloads (#12383)
## Problem

Some of the design decisions in PR #12256 were influenced by the
requirements of consistency tests. These decisions introduced
intermediate logic that is no longer needed and should be cleaned up.

## Summary of Changes
- Remove the `feature("testing")` flag related to
`kick_secondary_download`.
- Set the default value of `kick_secondary_download` back to false,
reflecting the intended production behavior.

Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>
2025-06-30 05:41:05 +00:00
Heikki Linnakangas
97a8f4ef85 Handle unexpected EOF while doing an LFC read more gracefully
There's a bug somewhere because this happens in python regression
tests. We need to hunt that down, but in any case, let's not get stuck
in an infinite loop if it happens.
2025-06-30 00:59:53 +03:00
Heikki Linnakangas
39f31957e3 Handle pageserver response with different number of pages gracefully
Some tests are hitting this case, where pageserver returns 0 page
images in the response to a GetPage request. I suspect it's because
the code doesn't handle sharding correclty? In any case, let's not
panic on it, but return an IO error to the originating backend.
2025-06-29 23:44:28 +03:00
Heikki Linnakangas
924c6a6fdf Fix handling the case that server closes the stream
- avoid panic by checking for Ok(None) response from
  tonic::Streaming::message() instead of just using unwrap()
- There was a race condition, if the caller sent the message, but the
  receiver task concurrently received Ok(None) indicating the stream
  was closed. (I didn't see that in action, but I think it could happen
  by reading the code)
2025-06-29 22:53:39 +03:00
Heikki Linnakangas
7020476bf5 Run cargo fmt 2025-06-29 22:53:09 +03:00
Heikki Linnakangas
80e948db93 Remove ununused mock factory
After reading the code a few times, I didn't quite understand what it
was, to be honest, or how it was going to be used. Remove it now to
reduce noise, but we can resurrect it from git history if we need it
in the future.
2025-06-29 22:52:48 +03:00
Heikki Linnakangas
bfb30d434c minor code tidy-up 2025-06-29 22:51:34 +03:00
Heikki Linnakangas
f3ba201800 Run cargo fmt 2025-06-29 21:21:07 +03:00
Heikki Linnakangas
8b7796cbfa wip 2025-06-29 21:20:48 +03:00
Heikki Linnakangas
fdc7e9c2a4 Extract repeated code to look up RequestTracker into a helper function 2025-06-29 21:20:14 +03:00
Heikki Linnakangas
a352d290eb Plumb through both libpq and grpc connection strings to the compute
Add a new 'pageserver_connection_info' field in the compute spec. It
replaces the old 'pageserver_connstring' field with a more complicated
struct that includes both libpq and grpc URLs, for each shard (or only
one of the the URLs, depending on the configuration). It also includes
a flag suggesting which one to use; compute_ctl now uses it to decide
which protocol to use for the basebackup.

This is compatible with everything that's in production, because the
control plane never used the 'pageserver_connstring' field. That was
added a long time ago with the idea that it would replace the code
that digs the 'neon.pageserver_connstring' GUC from the list of
Postgres settings, but we never got around to do that in the control
plane. Hence, it was only used with neon_local. But the plan now is to
pass the 'pageserver_connection_info' from the control plane, and once
that's fully deployed everywhere, the code to parse
'neon.pageserver_connstring' in compute_ctl can be removed.

The 'grpc' flag on an endpoint in endpoint config is now more of a
suggestion. Compute_ctl gets both URLs, so it can choose to use libpq
or grpc as it wishes. It currently always obeys the 'prefer_grpc' flag
that's part of the connection info though. Postgres however uses grpc
iff the new rust-based communicator is enabled.

TODO/plan for the control plane:

- Start to pass `pageserver_connection_info` in the spec file.
- Also keep the current `neon.pageserver_connstring` setting for now,
  for backwards compatibility with old computes

After that, the `pageserver_connection_info.prefer_grpc` flag in the
spec file can be used to control whether compute_ctl uses grpc or
libpq.  The actual compute's grpc usage will be controlled by the
`neon.enable_new_communicator` GUC. It can be set separately from
'prefer_grpc'.

Later:

- Once all old computes are gone, remove the code to pass
  `neon.pageserver_connstring`
2025-06-29 18:16:49 +03:00
Heikki Linnakangas
8c122a1c98 Don't call into the old LFC when using the new communicator
This fixes errors like `index "pg_class_relname_nsp_index" contains
unexpected zero page at block 2` when running the python tests

smgrzeroextend() still called into the old LFC's lfc_write() function,
even when using the new communicator, which zeroed some arbitrary
pages in the LFC file, overwriting pages managed by the new LFC
implementation managed by `integrated_cache.rs`
2025-06-29 17:40:46 +03:00
Dmitrii Kovalkov
47553dbaf9 neon_local: set timeline_safekeeper_count if we have less than 3 safekeepers (#12378)
## Problem
- Closes: https://github.com/neondatabase/neon/issues/12298

## Summary of changes
- Set `timeline_safekeeper_count` in `neon_local` if we have less than 3
safekeepers
- Remove `cfg!(feature = "testing")` code from
`safekeepers_for_new_timeline`
- Change `timeline_safekeeper_count` type to `usize`
2025-06-28 12:59:29 +00:00
Erik Grinaker
e50b914a8e compute_tools: support gRPC base backups in compute_ctl (#12244)
## Problem

`compute_ctl` should support gRPC base backups.

Requires #12111.
Requires #12243.
Touches #11926.

## Summary of changes

Support `grpc://` connstrings for `compute_ctl` base backups.
2025-06-27 16:39:00 +00:00
Christian Schwarz
e33e109403 fix(pageserver): buffered writer cancellation error handling (#12376)
## Problem

The problem has been well described in already-commited PR #11853.
tl;dr: BufferedWriter is sensitive to cancellation, which the previous
approach was not.

The write path was most affected (ingest & compaction), which was mostly
fixed in #11853:
it introduced `PutError` and mapped instances of `PutError` that were
due to cancellation of underlying buffered writer into
`CreateImageLayersError::Cancelled`.

However, there is a long tail of remaining errors that weren't caught by
#11853 that result in `CompactionError::Other`s, which we log with great
noise.

## Solution

The stack trace logging for CompactionError::Other added in #11853
allows us to chop away at that long tail using the following pattern:
- look at the stack trace
- from leaf up, identify the place where we incorrectly map from the
distinguished variant X indicating cancellation to an `anyhow::Error`
- follow that anyhow further up, ensuring it stays the same anyhow all
the way up in the `CompactionError::Other`
- since it stayed one anyhow chain all the way up, root_cause() will
yield us X
- so, in `log_compaction_error`, add an additional `downcast_ref` check
for X

This PR specifically adds checks for
- the flush task cancelling (FlushTaskError, BlobWriterError)
- opening of the layer writer (GateError)

That should cover all the reports in issues 
- https://github.com/neondatabase/cloud/issues/29434
- https://github.com/neondatabase/neon/issues/12162

## Refs
- follow-up to #11853
- fixup of / fixes https://github.com/neondatabase/neon/issues/11762
- fixes https://github.com/neondatabase/neon/issues/12162
- refs https://github.com/neondatabase/cloud/issues/29434
2025-06-27 15:26:00 +00:00
Folke Behrens
0ee15002fc proxy: Move client connection accept and handshake to pglb (#12380)
* This must be a no-op.
* Move proxy::task_main to pglb::task_main.
* Move client accept, TLS and handshake to pglb.
* Keep auth and wake in proxy.
2025-06-27 15:20:23 +00:00
Arpad Müller
4c7956fa56 Fix hang deleting offloaded timelines (#12366)
We don't have cancellation support for timeline deletions. In other
words, timeline deletion might still go on in an older generation while
we are attaching it in a newer generation already, because the
cancellation simply hasn't reached the deletion code.

This has caused us to hit a situation with offloaded timelines in which
the timeline was in an unrecoverable state: always returning an accepted
response, but never a 404 like it should be.

The detailed description can be found in
[here](https://github.com/neondatabase/cloud/issues/30406#issuecomment-3008667859)
(private repo link).

TLDR:

1. we ask to delete timeline on old pageserver/generation, starts
process in background
2. the storcon migrates the tenant to a different pageserver.
- during attach, the pageserver still finds an index part, so it adds it
to `offloaded_timelines`
4. the timeline deletion finishes, removing the index part in S3
5. there is a retry of the timeline deletion endpoint, sent to the new
pageserver location. it is bound to fail however:
- as the index part is gone, we print `Timeline already deleted in
remote storage`.
- the problem is that we then return an accepted response code, and not
a 404.
- this confuses the code calling us. it thinks the timeline is not
deleted, so keeps retrying.
- this state never gets recovered from until a reset/detach, because of
the `offloaded_timelines` entry staying there.

This is where this PR fixes things: if no index part can be found, we
can safely assume that the timeline is gone in S3 (it's the last thing
to be deleted), so we can remove it from `offloaded_timelines` and
trigger a reupload of the manifest. Subsequent retries will pick that
up.

Why not improve the cancellation support? It is a more disruptive code
change, that might have its own risks. So we don't do it for now.

Fixes https://github.com/neondatabase/cloud/issues/30406
2025-06-27 15:14:55 +00:00
Heikki Linnakangas
5a82182c48 impr(ci): Refactor postgres Makefile targets to a separate makefile (#12363)
Mainly for general readability. Some notable changes:

- Postgres can be built without the rest of the repository, and in
particular without any of the Rust bits. Some CI scripts took advantage
of that, so let's make that more explicit by separating those parts.
Also add an explicit comment about that in the new postgres.mk file.

- Add a new PG_INSTALL_CACHED variable. If it's set, `make all` and
other top-Makefile targets skip checking if Postgres is up-to-date. This
is also to be used in CI scripts that build and cache Postgres as
separate steps. (It is currently only used in the macos walproposer-lib
rule, but stay tuned for more.)

- Introduce a POSTGRES_VERSIONS variable that lists all supported
PostgreSQL versions. Refactor a few Makefile rules to use that.
2025-06-27 14:49:52 +00:00
Arpad Müller
37e181af8a Update rust to 1.88.0 (#12364)
We keep the practice of keeping the compiler up to date, pointing to the
latest release. This is done by many other projects in the Rust
ecosystem as well.

[Announcement blog
post](https://blog.rust-lang.org/2025/06/26/Rust-1.88.0/)

Prior update was in https://github.com/neondatabase/neon/pull/11938
2025-06-27 13:51:59 +00:00
Peter Bendel
6f4198c78a treat strategy flag test_maintenance as boolean data type (#12373)
## Problem

In large oltp test run
https://github.com/neondatabase/neon/actions/runs/15905488707/job/44859116742
we see that the `Benchmark database maintenance` step is skipped in all
3 strategy variants, however it should be executed in two.

This is due to treating the `test_maintenance` boolean type in the
strategy in the condition of the `Benchmark database maintenance` step

## Summary of changes
Use a boolean condition instead of a string comparison

## Test run from this pull request branch

https://github.com/neondatabase/neon/actions/runs/15923605412
2025-06-27 13:49:26 +00:00
Vlad Lazar
cc1664ef93 pageserver: allow flush task cancelled error in sharding autosplit test (#12374)
## Problem

Test is failing due to compaction shutdown noise (see
https://github.com/neondatabase/neon/issues/12162).

## Summary of changes

Allow list the noise.
2025-06-27 13:13:11 +00:00
Vlad Lazar
ebb6e26a64 pageserver: handle multiple attached children in shard resolution (#12336)
## Problem

When resolving a shard during a split we might have multiple attached
shards with the old shard count (i.e. not all of them are marked in
progress and ignored). Hence, we can compute the desired shard number
based on the old shard count and misroute the request.

## Summary of Changes

Recompute the desired shard every time the shard count changes during
the iteration
2025-06-27 12:46:18 +00:00
Mikhail
ebc12a388c fix: endpoint_storage_addr as String (#12359)
It's not a SocketAddr as we use k8s DNS
https://github.com/neondatabase/cloud/issues/19011
2025-06-27 11:06:27 +00:00
Conrad Ludgate
abc1efd5a6 [proxy] fix connect_to_compute retry handling (#12351)
# Problem

In #12335 I moved the `authenticate` method outside of the
`connect_to_compute` loop. This triggered [e2e tests to become
flaky](https://github.com/neondatabase/cloud/pull/30533). This
highlighted an edge case we forgot to consider with that change.

When we connect to compute, the compute IP might be cached. This cache
hit might however be stale. Because we can't validate the IP is
associated with a specific compute-id☨, we will succeed the
connect_to_compute operation and fail when it comes to password
authentication☨☨. Before the change, we were invalidating the cache and
triggering wake_compute if the authentication failed.

Additionally, I noticed some faulty logic I introduced 1 year ago
https://github.com/neondatabase/neon/pull/8141/files#diff-5491e3afe62d8c5c77178149c665603b29d88d3ec2e47fc1b3bb119a0a970afaL145-R147

☨ We can when we roll out TLS, as the certificate common name includes
the compute-id.

☨☨ Technically password authentication could pass for the wrong compute,
but I think this would only happen in the very very rare event that the
IP got reused **and** the compute's endpoint happened to be a
branch/replica.

# Solution

1. Fix the broken logic
2. Simplify cache invalidation (I don't know why it was so convoluted)
3. Add a loop around connect_to_compute + authenticate to re-introduce
the wake_compute invalidation we accidentally removed.

I went with this approach to try and avoid interfering with
https://github.com/neondatabase/neon/compare/main...cloneable/proxy-pglb-connect-compute-split.
The changes made in commit 3 will move into `handle_client_request` I
suspect,
2025-06-27 10:36:27 +00:00
Dmitrii Kovalkov
6fa1562b57 pageserver: increase default max_size_entries limit for basebackup cache (#12343)
## Problem
Some pageservers hit `max_size_entries` limit in staging with only ~25
MiB storage used by basebackup cache. The limit is too strict. It should
be safe to relax it.

- Part of https://github.com/neondatabase/cloud/issues/29353

## Summary of changes
- Increase the default `max_size_entries` from 1000 to 10000
2025-06-27 09:18:18 +00:00
Heikki Linnakangas
10afac87e7 impr(ci): Remove unnecessary 'make postgres-headers' build step (#12354)
The 'make postgres' step includes installation of the headers, no need
to do that separately.
2025-06-26 16:45:34 +00:00
Vlad Lazar
72b3c9cd11 pageserver: fix wal receiver hang on remote client shutdown (#12348)
## Problem

Druing shard splits we shut down the remote client early and allow the
parent shard to keep ingesting data. While ingesting data, the wal
receiver task may wait for the current flush to complete in order to
apply backpressure. Notifications are delivered via
`Timeline::layer_flush_done_tx`.

When the remote client was being shut down the flush loop exited
whithout delivering a notification. This left
`Timeline::wait_flush_completion` hanging indefinitely which blocked the
shutdown of the wal receiver task, and, hence, the shard split.

## Summary of Changes

Deliver a final notification when the flush loop is shutting down
without the timeline cancel cancellation token having fired. I tried
writing a test for this, but got stuck in failpoint hell and decided
it's not worth it.

`test_sharding_autosplit`, which reproduces this reliably in CI, passed
with the proposed fix in
https://github.com/neondatabase/neon/pull/12304.

Closes https://github.com/neondatabase/neon/issues/12060
2025-06-26 16:35:34 +00:00
Arpad Müller
232f2447d4 Support pull_timeline of timelines without writes (#12028)
Make the safekeeper `pull_timeline` endpoint support timelines that
haven't had any writes yet. In the storcon managed sk timelines world,
if a safekeeper goes down temporarily, the storcon will schedule a
`pull_timeline` call. There is no guarantee however that by when the
safekeeper is online again, there have been writes to the timeline yet.

The `snapshot` endpoint gives an error if the timeline hasn't had
writes, so we avoid calling it if `timeline_start_lsn` indicates a
freshly created timeline.

Fixes #11422
Part of #11670
2025-06-26 16:29:03 +00:00
Erik Grinaker
a2d2108e6a pageserver: use base backup cache with gRPC (#12352)
## Problem

gRPC base backups do not use the base backup cache.

Touches https://github.com/neondatabase/neon/issues/11728.

## Summary of changes

Integrate gRPC base backups with the base backup cache.

Also fixes a bug where the base backup cache did not differentiate
between primary/replica base backups (at least I think that's a bug?).
2025-06-26 15:52:15 +00:00
Alex Chi Z.
33c0d5e2f4 fix(pageserver): make posthog config parsing more robust (#12356)
## Problem

In our infra config, we have to split server_api_key and other fields in
two files: the former one in the sops file, and the latter one in the
normal config. It creates the situation that we might misconfigure some
regions that it only has part of the fields available, causing
storcon/pageserver refuse to start.

## Summary of changes

Allow PostHog config to have part of the fields available. Parse it
later.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-06-26 15:49:08 +00:00
Dmitrii Kovalkov
605fb04f89 pageserver: use bounded sender for basebackup cache (#12342)
## Problem
Basebackup cache now uses unbounded channel for prepare requests. In
theory it can grow large if the cache is hung and does not process the
requests.

- Part of https://github.com/neondatabase/cloud/issues/29353

## Summary of changes
- Replace an unbounded channel with a bounded one, the size is
configurable.
- Add `pageserver_basebackup_cache_prepare_queue_size` to observe the
size of the queue.
- Refactor a bit to move all metrics logic to `basebackup_cache.rs`
2025-06-26 13:26:24 +00:00
Conrad Ludgate
fd1e8ec257 [proxy] review and cleanup CLI args (#12167)
I was looking at how we could expose our proxy config as toml again, and
as I was writing out the schema format, I noticed some cruft in our CLI
args that no longer seem to be in use.

The redis change is the most complex, but I am pretty sure it's sound.
Since https://github.com/neondatabase/cloud/pull/15613 cplane longer
publishes to the global redis instance.
2025-06-26 11:25:41 +00:00
Konstantin Knizhnik
be23eae3b6 Mark pages as avaiable in LFC only after generation check (#12350)
## Problem

If LFC generation is changed then `lfc_readv_select` will return -1 but
pages are still marked as available in bitmap.

## Summary of changes

Update bitmap after generation check.

Co-authored-by: Kosntantin Knizhnik <konstantin.knizhnik@databricks.com>
2025-06-26 07:06:27 +00:00
Alex Chi Z.
6f70885e11 fix(pageserver): allow refresh_interval to be empty (#12349)
## Problem

Fix for https://github.com/neondatabase/neon/pull/12324

## Summary of changes

Need `serde(default)` to allow this field not present in the config,
otherwise there will be a config deserialization error.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-06-25 22:15:03 +00:00
Erik Grinaker
f755979102 pageserver: payload compression for gRPC base backups (#12346)
## Problem

gRPC base backups use gRPC compression. However, this has two problems:

* Base backup caching will cache compressed base backups (making gRPC
compression pointless).
* Tonic does not support varying the compression level, and zstd default
level is 10% slower than gzip fastest level.

Touches https://github.com/neondatabase/neon/issues/11728.
Touches https://github.com/neondatabase/cloud/issues/29353.

## Summary of changes

This patch adds a gRPC parameter `BaseBackupRequest::compression`
specifying the compression algorithm. It also moves compression into
`send_basebackup_tarball` to reduce code duplication.

A follow-up PR will integrate the base backup cache with gRPC.
2025-06-25 18:16:23 +00:00
Matthias van de Meent
1d49eefbbb RFC: Endpoint Persistent Unlogged Files Storage (#9661)
## Summary
A design for a storage system that allows storage of files required to
make
Neon's Endpoints have a better experience at or after a reboot.

## Motivation
Several systems inside PostgreSQL (and Neon) need some persistent
storage for
optimal workings across reboots and restarts, but still work without.
Examples are the cumulative statistics file in `pg_stat/global.stat`,
`pg_stat_statements`' `pg_stat/pg_stat_statements.stat`, and
`pg_prewarm`'s
`autoprewarm.blocks`. We need a storage system that can store and manage
these files for each Endpoint.

[GH rendered
file](https://github.com/neondatabase/neon/blob/MMeent/rfc-unlogged-file/docs/rfcs/040-Endpoint-Persistent-Unlogged-Files-Storage.md)

Part of https://github.com/neondatabase/cloud/issues/24225
2025-06-25 16:25:57 +00:00
Alex Chi Z.
6c77638ea1 feat(storcon): retrieve feature flag and pass to pageservers (#12324)
## Problem

part of https://github.com/neondatabase/neon/issues/11813

## Summary of changes

It costs $$$ to directly retrieve the feature flags from the pageserver.
Therefore, this patch adds new APIs to retrieve the spec from the
storcon and updates it via pageserver.

* Storcon retrieves the feature flag and send it to the pageservers.
* If the feature flag gets updated outside of the normal refresh loop of
the pageserver, pageserver won't fetch the flags on its own as long as
the last updated time <= refresh_period.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-06-25 14:58:18 +00:00
Conrad Ludgate
517a3d0d86 [proxy]: BatchQueue::call is not cancel safe - make it directly cancellation aware (#12345)
## Problem

https://github.com/neondatabase/cloud/issues/30539

If the current leader cancels the `call` function, then it has removed
the jobs from the queue, but will never finish sending the responses.
Because of this, it is not cancellation safe.

## Summary of changes

Document these functions as not cancellation safe. Move cancellation of
the queued jobs into the queue itself.

## Alternatives considered

1. We could spawn the task that runs the batch, since that won't get
cancelled.
* This requires `fn call(self: Arc<Self>)` or `fn call(&'static self)`.
2. We could add another scopeguard and return the requests back to the
queue.
* This requires that requests are always retry safe, and also requires
requests to be `Clone`.
2025-06-25 14:19:20 +00:00
Conrad Ludgate
27ca1e21be [console_redirect_proxy]: fix channel binding (#12238)
## Problem

While working more on TLS to compute, I realised that Console Redirect
-> pg-sni-router -> compute would break if channel binding was set to
prefer. This is because the channel binding data would differ between
Console Redirect -> pg-sni-router vs pg-sni-router -> compute.

I also noticed that I actually disabled channel binding in #12145, since
`connect_raw` would think that the connection didn't support TLS.

## Summary of changes

Make sure we specify the channel binding.
Make sure that `connect_raw` can see if we have TLS support.
2025-06-25 13:41:30 +00:00
Arpad Müller
1dc01c9bed Support cancellations of timelines with hanging ondemand downloads (#12330)
In `test_layer_download_cancelled_by_config_location`, we simulate hung
downloads via the `before-downloading-layer-stream-pausable` failpoint.
Then, we cancel a timeline via the `location_config` endpoint.

With the new default as of
https://github.com/neondatabase/neon/pull/11712, we would be creating
the timeline on safekeepers regardless if there have been writes or not,
and it turns out the test relied on the timeline not existing on
safekeepers, due to a cancellation bug:

* as established before, the test makes the read path hang
* the timeline cancellation function first cancels the walreceiver, and
only then cancels the timeline's token
* `WalIngest::new` is requesting a checkpoint, which hits the read path
* at cancellation time, we'd be hanging inside the read, not seeing the
cancellation of the walreceiver
* the test would time out due to the hang

This is probably also reproducible in the wild when there is S3
unavailabilies or bottlenecks. So we thought that it's worthwhile to fix
the hang issue. The approach chosen in the end involves the
`tokio::select` macro.

In PR 11712, we originally punted on the test due to the hang and opted
it out from the new default, but now we can use the new default.

Part of https://github.com/neondatabase/neon/issues/12299
2025-06-25 13:40:38 +00:00
Heikki Linnakangas
7c4c36f5ac Remove unnecessary separate installation of libpq (#12287)
`make install` compiles and installs libpq. Remove redundant separate
step to compile and install it.
2025-06-25 10:47:56 +00:00
Tristan Partin
a2d623696c Update pgaudit to latest versions (#12328)
These updates contain some bug fixes and are completely backwards
compatible with what we currently support in Neon.

Link: https://github.com/pgaudit/pgaudit/compare/1.6.2...1.6.3
Link: https://github.com/pgaudit/pgaudit/compare/1.7.0...1.7.1
Link: https://github.com/pgaudit/pgaudit/compare/16.0...16.1
Link: https://github.com/pgaudit/pgaudit/compare/17.0...17.1
Signed-off-by: Tristan Partin <tristan.partin@databricks.com>

Signed-off-by: Tristan Partin <tristan.partin@databricks.com>
2025-06-25 09:03:02 +00:00
Tristan Partin
aa75722010 Set pgaudit.log=none for monitoring connections (#12137)
pgaudit can spam logs due to all the monitoring that we do. Logs from
these connections are not necessary for HIPPA compliance, so we can stop
logging from those connections.

Part-of: https://github.com/neondatabase/cloud/issues/29574

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-06-24 17:42:23 +00:00
Matthias van de Meent
6c6de6382a Use enum-typed PG versions (#12317)
This makes it possible for the compiler to validate that a match block
matched all PostgreSQL versions we support.

## Problem
We did not have a complete picture about which places we had to test
against PG versions, and what format these versions were: The full PG
version ID format (Major/minor/bugfix `MMmmbb`) as transfered in
protocol messages, or only the Major release version (`MM`). This meant
type confusion was rampant.

With this change, it becomes easier to develop new version-dependent
features, by making type and niche confusion impossible.

## Summary of changes
Every use of `pg_version` is now typed as either `PgVersionId` (u32,
valued in decimal `MMmmbb`) or PgMajorVersion (an enum, with a value for
every major version we support, serialized and stored like a u32 with
the value of that major version)

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2025-06-24 17:25:31 +00:00
Dmitry Savelev
158d84ea30 Switch the billing metrics storage format to ndjson. (#12338)
## Problem

The billing team wants to change the billing events pipeline and use a
common events format in S3 buckets across different event producers.

## Summary of changes

Change the events storage format for billing events from JSON to NDJSON.

Resolves: https://github.com/neondatabase/cloud/issues/29994
2025-06-24 15:36:36 +00:00
Conrad Ludgate
4dd9ca7b04 [proxy]: authenticate to compute after connect_to_compute (#12335)
## Problem

PGLB will do the connect_to_compute logic, neonkeeper will do the
session establishment logic. We should split it.

## Summary of changes

Moves postgres authentication to compute to a separate routine that
happens after connect_to_compute.
2025-06-24 14:15:36 +00:00
Arpad Müller
552249607d apply clippy fixes for 1.88.0 beta (#12331)
The 1.88.0 stable release is near (this Thursday). We'd like to fix most
warnings beforehand so that the compiler upgrade doesn't require
approval from too many teams.

This is therefore a preparation PR (like similar PRs before it).

There is a lot of changes for this release, mostly because the
`uninlined_format_args` lint has been added to the `style` lint group.
One can read more about the lint
[here](https://rust-lang.github.io/rust-clippy/master/#/uninlined_format_args).

The PR is the result of `cargo +beta clippy --fix` and `cargo fmt`. One
remaining warning is left for the proxy team.

---------

Co-authored-by: Conrad Ludgate <conrad@neon.tech>
2025-06-24 10:12:42 +00:00
Ivan Efremov
a29772bf6e Create proxy-bench periodic run in CI (#12242)
Currently run for test only via pushing to the test-proxy-bench branch.

Relates to the #22681
2025-06-24 09:54:43 +00:00
Arpad Müller
0efff1db26 Allow cancellation errors in tests that allow timeline deletion errors (#12315)
After merging of PR https://github.com/neondatabase/neon/pull/11712 we
saw some tests be flaky, with errors showing up about the timeline
having been cancelled instead of having been deleted. This is an outcome
that is inherently racy with the "has been deleted" error.

In some instances, https://github.com/neondatabase/neon/pull/11712 has
already added the error about the timeline having been cancelled. This
PR adds them to the remaining instances of
https://github.com/neondatabase/neon/pull/11712, fixing the flakiness.
2025-06-23 22:26:38 +00:00
Aleksandr Sarantsev
5eecde461d storcon: Fix migration for Attached(0) tenants (#12256)
## Problem

`Attached(0)` tenant migrations can get stuck if the heatmap file has
not been uploaded.

## Summary of Changes

- Added a test to reproduce the issue.
- Introduced a `kick_secondary_downloads` config flag:
  - Enabled in testing environments.
  - Disabled in production (and in the new test).
- Updated `Attached(0)` locations to consider the number of secondaries
in their intent when deciding whether to download the heatmap.
2025-06-23 18:55:26 +00:00
Alex Chi Z.
85164422d0 feat(pageserver): support force overriding feature flags (#12233)
## Problem

Part of #11813 

## Summary of changes

Add a test API to make it easier to manipulate the feature flags within
tests.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-06-23 17:31:53 +00:00
John Spray
6c3aba7c44 storcon: adjust AZ selection for heterogenous AZs (#12296)
## Problem

The scheduler uses total shards per AZ to select the AZ for newly
created or attached tenants.

This makes bad decisions when we have different node counts per AZ -- we
might have 2 very busy pageservers in one AZ, and 4 more lightly loaded
pageservers in other AZs, and the scheduler picks the busy pageservers
because the total shard count in their AZ is lower.

## Summary of changes

- Divide the shard count by the number of nodes in the AZ when scoring
in `get_az_for_new_tenant`

---------

Co-authored-by: John Spray <john.spray@databricks.com>
2025-06-23 15:50:31 +00:00
Erik Grinaker
68a175d545 test_runner: fix test_basebackup_with_high_slru_count gzip param (#12319)
The `--gzip-probability` parameter was removed in #12250. However,
`test_basebackup_with_high_slru_count` still uses it, and keeps failing.

This patch removes the use of the parameter (gzip is enabled by
default).
2025-06-23 15:33:45 +00:00
381 changed files with 17319 additions and 10420 deletions

View File

@@ -4,6 +4,7 @@
!Cargo.lock
!Cargo.toml
!Makefile
!postgres.mk
!rust-toolchain.toml
!scripts/ninstall.sh
!docker-compose/run-tests.sh

View File

@@ -32,161 +32,14 @@ permissions:
contents: read
jobs:
build-pgxn:
if: |
inputs.pg_versions != '[]' || inputs.rebuild_everything ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-macos') ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-*') ||
github.ref_name == 'main'
timeout-minutes: 30
runs-on: macos-15
strategy:
matrix:
postgres-version: ${{ inputs.rebuild_everything && fromJSON('["v14", "v15", "v16", "v17"]') || fromJSON(inputs.pg_versions) }}
env:
# Use release build only, to have less debug info around
# Hence keeping target/ (and general cache size) smaller
BUILD_TYPE: release
steps:
- name: Harden the runner (Audit all outbound calls)
uses: step-security/harden-runner@4d991eb9b905ef189e4c376166672c3f2f230481 # v2.11.0
with:
egress-policy: audit
- name: Checkout main repo
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Set pg ${{ matrix.postgres-version }} for caching
id: pg_rev
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-${{ matrix.postgres-version }}) | tee -a "${GITHUB_OUTPUT}"
- name: Cache postgres ${{ matrix.postgres-version }} build
id: cache_pg
uses: actions/cache@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
with:
path: pg_install/${{ matrix.postgres-version }}
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-${{ matrix.postgres-version }}-${{ steps.pg_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Checkout submodule vendor/postgres-${{ matrix.postgres-version }}
if: steps.cache_pg.outputs.cache-hit != 'true'
run: |
git submodule init vendor/postgres-${{ matrix.postgres-version }}
git submodule update --depth 1 --recursive
- name: Install build dependencies
if: steps.cache_pg.outputs.cache-hit != 'true'
run: |
brew install flex bison openssl protobuf icu4c
- name: Set extra env for macOS
if: steps.cache_pg.outputs.cache-hit != 'true'
run: |
echo 'LDFLAGS=-L/usr/local/opt/openssl@3/lib' >> $GITHUB_ENV
echo 'CPPFLAGS=-I/usr/local/opt/openssl@3/include' >> $GITHUB_ENV
- name: Build Postgres ${{ matrix.postgres-version }}
if: steps.cache_pg.outputs.cache-hit != 'true'
run: |
make postgres-${{ matrix.postgres-version }} -j$(sysctl -n hw.ncpu)
- name: Build Neon Pg Ext ${{ matrix.postgres-version }}
if: steps.cache_pg.outputs.cache-hit != 'true'
run: |
make "neon-pg-ext-${{ matrix.postgres-version }}" -j$(sysctl -n hw.ncpu)
- name: Get postgres headers ${{ matrix.postgres-version }}
if: steps.cache_pg.outputs.cache-hit != 'true'
run: |
make postgres-headers-${{ matrix.postgres-version }} -j$(sysctl -n hw.ncpu)
- name: Upload "pg_install/${{ matrix.postgres-version }}" artifact
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
with:
name: pg_install--${{ matrix.postgres-version }}
path: pg_install/${{ matrix.postgres-version }}
# The artifact is supposed to be used by the next job in the same workflow,
# so theres no need to store it for too long.
retention-days: 1
build-walproposer-lib:
if: |
contains(inputs.pg_versions, 'v17') || inputs.rebuild_everything ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-macos') ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-*') ||
github.ref_name == 'main'
timeout-minutes: 30
runs-on: macos-15
needs: [build-pgxn]
env:
# Use release build only, to have less debug info around
# Hence keeping target/ (and general cache size) smaller
BUILD_TYPE: release
steps:
- name: Harden the runner (Audit all outbound calls)
uses: step-security/harden-runner@4d991eb9b905ef189e4c376166672c3f2f230481 # v2.11.0
with:
egress-policy: audit
- name: Checkout main repo
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Set pg v17 for caching
id: pg_rev
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v17) | tee -a "${GITHUB_OUTPUT}"
- name: Download "pg_install/v17" artifact
uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0
with:
name: pg_install--v17
path: pg_install/v17
- name: Cache walproposer-lib
id: cache_walproposer_lib
uses: actions/cache@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
with:
path: build/walproposer-lib
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-walproposer_lib-v17-${{ steps.pg_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Checkout submodule vendor/postgres-v17
if: steps.cache_walproposer_lib.outputs.cache-hit != 'true'
run: |
git submodule init vendor/postgres-v17
git submodule update --depth 1 --recursive
- name: Install build dependencies
if: steps.cache_walproposer_lib.outputs.cache-hit != 'true'
run: |
brew install flex bison openssl protobuf icu4c
- name: Set extra env for macOS
if: steps.cache_walproposer_lib.outputs.cache-hit != 'true'
run: |
echo 'LDFLAGS=-L/usr/local/opt/openssl@3/lib' >> $GITHUB_ENV
echo 'CPPFLAGS=-I/usr/local/opt/openssl@3/include' >> $GITHUB_ENV
- name: Build walproposer-lib (only for v17)
if: steps.cache_walproposer_lib.outputs.cache-hit != 'true'
run:
make walproposer-lib -j$(sysctl -n hw.ncpu)
- name: Upload "build/walproposer-lib" artifact
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
with:
name: build--walproposer-lib
path: build/walproposer-lib
# The artifact is supposed to be used by the next job in the same workflow,
# so theres no need to store it for too long.
retention-days: 1
cargo-build:
make-all:
if: |
inputs.pg_versions != '[]' || inputs.rebuild_rust_code || inputs.rebuild_everything ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-macos') ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-*') ||
github.ref_name == 'main'
timeout-minutes: 30
timeout-minutes: 60
runs-on: macos-15
needs: [build-pgxn, build-walproposer-lib]
env:
# Use release build only, to have less debug info around
# Hence keeping target/ (and general cache size) smaller
@@ -202,41 +55,53 @@ jobs:
with:
submodules: true
- name: Download "pg_install/v14" artifact
uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0
with:
name: pg_install--v14
path: pg_install/v14
- name: Download "pg_install/v15" artifact
uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0
with:
name: pg_install--v15
path: pg_install/v15
- name: Download "pg_install/v16" artifact
uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0
with:
name: pg_install--v16
path: pg_install/v16
- name: Download "pg_install/v17" artifact
uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0
with:
name: pg_install--v17
path: pg_install/v17
- name: Download "build/walproposer-lib" artifact
uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0
with:
name: build--walproposer-lib
path: build/walproposer-lib
# `actions/download-artifact` doesn't preserve permissions:
# https://github.com/actions/download-artifact?tab=readme-ov-file#permission-loss
- name: Make pg_install/v*/bin/* executable
- name: Install build dependencies
run: |
chmod +x pg_install/v*/bin/*
brew install flex bison openssl protobuf icu4c
- name: Set extra env for macOS
run: |
echo 'LDFLAGS=-L/usr/local/opt/openssl@3/lib' >> $GITHUB_ENV
echo 'CPPFLAGS=-I/usr/local/opt/openssl@3/include' >> $GITHUB_ENV
- name: Restore "pg_install/" cache
id: cache_pg
uses: actions/cache@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
with:
path: pg_install
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-install-v14-${{ hashFiles('Makefile', 'postgres.mk', 'vendor/revisions.json') }}
- name: Checkout vendor/postgres submodules
if: steps.cache_pg.outputs.cache-hit != 'true'
run: |
git submodule init
git submodule update --depth 1 --recursive
- name: Build Postgres
if: steps.cache_pg.outputs.cache-hit != 'true'
run: |
make postgres -j$(sysctl -n hw.ncpu)
# This isn't strictly necessary, but it makes the cached and non-cached builds more similar,
# When pg_install is restored from cache, there is no 'build/' directory. By removing it
# in a non-cached build too, we enforce that the rest of the steps don't depend on it,
# so that we notice any build caching bugs earlier.
- name: Remove build artifacts
if: steps.cache_pg.outputs.cache-hit != 'true'
run: |
rm -rf build
# Explicitly update the rust toolchain before running 'make'. The parallel make build can
# invoke 'cargo build' more than once in parallel, for different crates. That's OK, 'cargo'
# does its own locking to prevent concurrent builds from stepping on each other's
# toes. However, it will first try to update the toolchain, and that step is not locked the
# same way. To avoid two toolchain updates running in parallel and stepping on each other's
# toes, ensure that the toolchain is up-to-date beforehand.
- name: Update rust toolchain
run: |
rustup --version &&
rustup update &&
rustup show
- name: Cache cargo deps
uses: actions/cache@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
@@ -248,17 +113,12 @@ jobs:
target
key: v1-${{ runner.os }}-${{ runner.arch }}-cargo-${{ hashFiles('./Cargo.lock') }}-${{ hashFiles('./rust-toolchain.toml') }}-rust
- name: Install build dependencies
run: |
brew install flex bison openssl protobuf icu4c
- name: Set extra env for macOS
run: |
echo 'LDFLAGS=-L/usr/local/opt/openssl@3/lib' >> $GITHUB_ENV
echo 'CPPFLAGS=-I/usr/local/opt/openssl@3/include' >> $GITHUB_ENV
- name: Run cargo build
run: cargo build --all --release -j$(sysctl -n hw.ncpu)
# Build the neon-specific postgres extensions, and all the Rust bits.
#
# Pass PG_INSTALL_CACHED=1 because PostgreSQL was already built and cached
# separately.
- name: Build all
run: PG_INSTALL_CACHED=1 BUILD_TYPE=release make -j$(sysctl -n hw.ncpu) all
- name: Check that no warnings are produced
run: ./run_clippy.sh

View File

@@ -69,7 +69,7 @@ jobs:
submodules: true
- name: Check for file changes
uses: step-security/paths-filter@v3
uses: dorny/paths-filter@de90cc6fb38fc0963ad72b210f1f284cd68cea36 # v3.0.2
id: files-changed
with:
token: ${{ secrets.GITHUB_TOKEN }}

View File

@@ -153,7 +153,7 @@ jobs:
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
- name: Benchmark database maintenance
if: ${{ matrix.test_maintenance == 'true' }}
if: ${{ matrix.test_maintenance }}
uses: ./.github/actions/run-python-test-set
with:
build_type: ${{ env.BUILD_TYPE }}

View File

@@ -53,7 +53,7 @@ jobs:
submodules: true
- name: Check for Postgres changes
uses: step-security/paths-filter@v3
uses: dorny/paths-filter@1441771bbfdd59dcd748680ee64ebd8faab1a242 #v3
id: files_changed
with:
token: ${{ github.token }}

View File

@@ -34,7 +34,7 @@ jobs:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- uses: step-security/changed-files@3dbe17c78367e7d60f00d78ae6781a35be47b4a1 # v45.0.1
- uses: tj-actions/changed-files@ed68ef82c095e0d48ec87eccea555d944a631a4c # v46.0.5
id: python-src
with:
files: |
@@ -45,7 +45,7 @@ jobs:
poetry.lock
pyproject.toml
- uses: step-security/changed-files@3dbe17c78367e7d60f00d78ae6781a35be47b4a1 # v45.0.1
- uses: tj-actions/changed-files@ed68ef82c095e0d48ec87eccea555d944a631a4c # v46.0.5
id: rust-src
with:
files: |

84
.github/workflows/proxy-benchmark.yml vendored Normal file
View File

@@ -0,0 +1,84 @@
name: Periodic proxy performance test on unit-perf hetzner runner
on:
push: # TODO: remove after testing
branches:
- test-proxy-bench # Runs on pushes to branches starting with test-proxy-bench
# schedule:
# * is a special character in YAML so you have to quote this string
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12 or JAN-DEC)
# │ │ │ │ ┌───────────── day of the week (0 - 6 or SUN-SAT)
# - cron: '0 5 * * *' # Runs at 5 UTC once a day
workflow_dispatch: # adds an ability to run this manually
defaults:
run:
shell: bash -euo pipefail {0}
concurrency:
group: ${{ github.workflow }}
cancel-in-progress: false
permissions:
contents: read
jobs:
run_periodic_proxybench_test:
permissions:
id-token: write # aws-actions/configure-aws-credentials
statuses: write
contents: write
pull-requests: write
runs-on: [self-hosted, unit-perf]
timeout-minutes: 60 # 1h timeout
container:
image: ghcr.io/neondatabase/build-tools:pinned-bookworm
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
options: --init
steps:
- name: Checkout proxy-bench Repo
uses: actions/checkout@v4
with:
repository: neondatabase/proxy-bench
path: proxy-bench
- name: Set up the environment which depends on $RUNNER_TEMP on nvme drive
id: set-env
shell: bash -euxo pipefail {0}
run: |
PROXY_BENCH_PATH=$(realpath ./proxy-bench)
{
echo "PROXY_BENCH_PATH=$PROXY_BENCH_PATH"
echo "NEON_DIR=${RUNNER_TEMP}/neon"
echo "TEST_OUTPUT=${PROXY_BENCH_PATH}/test_output"
echo ""
} >> "$GITHUB_ENV"
- name: Run proxy-bench
run: ${PROXY_BENCH_PATH}/run.sh
- name: Ingest Bench Results # neon repo script
if: always()
run: |
mkdir -p $TEST_OUTPUT
python $NEON_DIR/scripts/proxy_bench_results_ingest.py --out $TEST_OUTPUT
- name: Push Metrics to Proxy perf database
if: always()
env:
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PROXY_TEST_RESULT_CONNSTR }}"
REPORT_FROM: $TEST_OUTPUT
run: $NEON_DIR/scripts/generate_and_push_perf_report.sh
- name: Docker cleanup
if: always()
run: docker compose down
- name: Notify Failure
if: failure()
run: echo "Proxy bench job failed" && exit 1

1
.gitignore vendored
View File

@@ -6,6 +6,7 @@
/tmp_check_cli
__pycache__/
test_output/
neon_previous/
.vscode
.idea
*.swp

2988
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -24,6 +24,7 @@ members = [
"libs/pageserver_api",
"libs/postgres_ffi",
"libs/postgres_ffi_types",
"libs/postgres_versioninfo",
"libs/safekeeper_api",
"libs/desim",
"libs/neon-shmem",
@@ -156,6 +157,7 @@ pprof = { version = "0.14", features = ["criterion", "flamegraph", "frame-pointe
procfs = "0.16"
prometheus = {version = "0.13", default-features=false, features = ["process"]} # removes protobuf dependency
prost = "0.13.5"
prost-types = "0.13.5"
rand = "0.8"
redis = { version = "0.29.2", features = ["tokio-rustls-comp", "keep-alive"] }
regex = "1.10.2"
@@ -179,6 +181,7 @@ serde_json = "1"
serde_path_to_error = "0.1"
serde_with = { version = "3", features = [ "base64" ] }
serde_assert = "0.5.0"
serde_repr = "0.1.20"
sha2 = "0.10.2"
signal-hook = "0.3"
smallvec = "1.11"
@@ -260,7 +263,6 @@ desim = { version = "0.1", path = "./libs/desim" }
endpoint_storage = { version = "0.0.1", path = "./endpoint_storage/" }
http-utils = { version = "0.1", path = "./libs/http-utils/" }
metrics = { version = "0.1", path = "./libs/metrics/" }
neonart = { version = "0.1", path = "./libs/neonart/" }
neon-shmem = { version = "0.1", path = "./libs/neon-shmem/" }
pageserver = { path = "./pageserver" }
pageserver_api = { version = "0.1", path = "./libs/pageserver_api/" }
@@ -272,6 +274,7 @@ postgres_backend = { version = "0.1", path = "./libs/postgres_backend/" }
postgres_connection = { version = "0.1", path = "./libs/postgres_connection/" }
postgres_ffi = { version = "0.1", path = "./libs/postgres_ffi/" }
postgres_ffi_types = { version = "0.1", path = "./libs/postgres_ffi_types/" }
postgres_versioninfo = { version = "0.1", path = "./libs/postgres_versioninfo/" }
postgres_initdb = { path = "./libs/postgres_initdb" }
posthog_client_lite = { version = "0.1", path = "./libs/posthog_client_lite" }
pq_proto = { version = "0.1", path = "./libs/pq_proto/" }
@@ -291,7 +294,7 @@ walproposer = { version = "0.1", path = "./libs/walproposer/" }
workspace_hack = { version = "0.1", path = "./workspace_hack/" }
## Build dependencies
cbindgen = "0.28.0"
cbindgen = "0.29.0"
criterion = "0.5.1"
rcgen = "0.13"
rstest = "0.18"

View File

@@ -30,7 +30,18 @@ ARG BASE_IMAGE_SHA=debian:${DEBIAN_FLAVOR}
ARG BASE_IMAGE_SHA=${BASE_IMAGE_SHA/debian:bookworm-slim/debian@$BOOKWORM_SLIM_SHA}
ARG BASE_IMAGE_SHA=${BASE_IMAGE_SHA/debian:bullseye-slim/debian@$BULLSEYE_SLIM_SHA}
# Build Postgres
# Naive way:
#
# 1. COPY . .
# 1. make neon-pg-ext
# 2. cargo build <storage binaries>
#
# But to enable docker to cache intermediate layers, we perform a few preparatory steps:
#
# - Build all postgres versions, depending on just the contents of vendor/
# - Use cargo chef to build all rust dependencies
# 1. Build all postgres versions
FROM $REPOSITORY/$IMAGE:$TAG AS pg-build
WORKDIR /home/nonroot
@@ -38,16 +49,15 @@ COPY --chown=nonroot vendor/postgres-v14 vendor/postgres-v14
COPY --chown=nonroot vendor/postgres-v15 vendor/postgres-v15
COPY --chown=nonroot vendor/postgres-v16 vendor/postgres-v16
COPY --chown=nonroot vendor/postgres-v17 vendor/postgres-v17
COPY --chown=nonroot pgxn pgxn
COPY --chown=nonroot Makefile Makefile
COPY --chown=nonroot postgres.mk postgres.mk
COPY --chown=nonroot scripts/ninstall.sh scripts/ninstall.sh
ENV BUILD_TYPE=release
RUN set -e \
&& mold -run make -j $(nproc) -s neon-pg-ext \
&& tar -C pg_install -czf /home/nonroot/postgres_install.tar.gz .
&& mold -run make -j $(nproc) -s postgres
# Prepare cargo-chef recipe
# 2. Prepare cargo-chef recipe
FROM $REPOSITORY/$IMAGE:$TAG AS plan
WORKDIR /home/nonroot
@@ -55,23 +65,22 @@ COPY --chown=nonroot . .
RUN cargo chef prepare --recipe-path recipe.json
# Build neon binaries
# Main build image
FROM $REPOSITORY/$IMAGE:$TAG AS build
WORKDIR /home/nonroot
ARG GIT_VERSION=local
ARG BUILD_TAG
COPY --from=pg-build /home/nonroot/pg_install/v14/include/postgresql/server pg_install/v14/include/postgresql/server
COPY --from=pg-build /home/nonroot/pg_install/v15/include/postgresql/server pg_install/v15/include/postgresql/server
COPY --from=pg-build /home/nonroot/pg_install/v16/include/postgresql/server pg_install/v16/include/postgresql/server
COPY --from=pg-build /home/nonroot/pg_install/v17/include/postgresql/server pg_install/v17/include/postgresql/server
COPY --from=plan /home/nonroot/recipe.json recipe.json
ARG ADDITIONAL_RUSTFLAGS=""
# 3. Build cargo dependencies. Note that this step doesn't depend on anything else than
# `recipe.json`, so the layer can be reused as long as none of the dependencies change.
COPY --from=plan /home/nonroot/recipe.json recipe.json
RUN set -e \
&& RUSTFLAGS="-Clinker=clang -Clink-arg=-fuse-ld=mold -Clink-arg=-Wl,--no-rosegment -Cforce-frame-pointers=yes ${ADDITIONAL_RUSTFLAGS}" cargo chef cook --locked --release --recipe-path recipe.json
# Perform the main build. We reuse the Postgres build artifacts from the intermediate 'pg-build'
# layer, and the cargo dependencies built in the previous step.
COPY --chown=nonroot --from=pg-build /home/nonroot/pg_install/ pg_install
COPY --chown=nonroot . .
RUN set -e \
@@ -86,10 +95,10 @@ RUN set -e \
--bin endpoint_storage \
--bin neon_local \
--bin storage_scrubber \
--locked --release
--locked --release \
&& mold -run make -j $(nproc) -s neon-pg-ext
# Build final image
#
# Assemble the final image
FROM $BASE_IMAGE_SHA
WORKDIR /data
@@ -129,12 +138,15 @@ COPY --from=build --chown=neon:neon /home/nonroot/target/release/proxy
COPY --from=build --chown=neon:neon /home/nonroot/target/release/endpoint_storage /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/neon_local /usr/local/bin
COPY --from=build --chown=neon:neon /home/nonroot/target/release/storage_scrubber /usr/local/bin
COPY --from=build /home/nonroot/pg_install/v14 /usr/local/v14/
COPY --from=build /home/nonroot/pg_install/v15 /usr/local/v15/
COPY --from=build /home/nonroot/pg_install/v16 /usr/local/v16/
COPY --from=build /home/nonroot/pg_install/v17 /usr/local/v17/
COPY --from=pg-build /home/nonroot/pg_install/v14 /usr/local/v14/
COPY --from=pg-build /home/nonroot/pg_install/v15 /usr/local/v15/
COPY --from=pg-build /home/nonroot/pg_install/v16 /usr/local/v16/
COPY --from=pg-build /home/nonroot/pg_install/v17 /usr/local/v17/
COPY --from=pg-build /home/nonroot/postgres_install.tar.gz /data/
# Deprecated: Old deployment scripts use this tarball which contains all the Postgres binaries.
# That's obsolete, since all the same files are also present under /usr/local/v*. But to keep the
# old scripts working for now, create the tarball.
RUN tar -C /usr/local -cvzf /data/postgres_install.tar.gz v14 v15 v16 v17
# By default, pageserver uses `.neon/` working directory in WORKDIR, so create one and fill it with the dummy config.
# Now, when `docker run ... pageserver` is run, it can start without errors, yet will have some default dummy values.

132
Makefile
View File

@@ -4,11 +4,14 @@ ROOT_PROJECT_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
# managers.
POSTGRES_INSTALL_DIR ?= $(ROOT_PROJECT_DIR)/pg_install/
# Supported PostgreSQL versions
POSTGRES_VERSIONS = v17 v16 v15 v14
# CARGO_BUILD_FLAGS: Extra flags to pass to `cargo build`. `--locked`
# and `--features testing` are popular examples.
#
# CARGO_PROFILE: You can also set to override the cargo profile to
# use. By default, it is derived from BUILD_TYPE.
# CARGO_PROFILE: Set to override the cargo profile to use. By default,
# it is derived from BUILD_TYPE.
# All intermediate build artifacts are stored here.
BUILD_DIR := build
@@ -102,13 +105,13 @@ CACHEDIR_TAG_CONTENTS := "Signature: 8a477f597d28d172789f06886806bc55"
# Top level Makefile to build Neon and PostgreSQL
#
.PHONY: all
all: neon postgres neon-pg-ext
all: neon postgres-install neon-pg-ext
### Neon Rust bits
#
# The 'postgres_ffi' depends on the Postgres headers.
# The 'postgres_ffi' crate depends on the Postgres headers.
.PHONY: neon
neon: postgres-headers walproposer-lib cargo-target-dir
neon: postgres-headers-install walproposer-lib cargo-target-dir
+@echo "Compiling Neon"
$(CARGO_CMD_PREFIX) cargo build $(CARGO_BUILD_FLAGS) $(CARGO_PROFILE)
@@ -118,78 +121,8 @@ cargo-target-dir:
mkdir -p target
test -e target/CACHEDIR.TAG || echo "$(CACHEDIR_TAG_CONTENTS)" > target/CACHEDIR.TAG
### PostgreSQL parts
# Some rules are duplicated for Postgres v14 and 15. We may want to refactor
# to avoid the duplication in the future, but it's tolerable for now.
#
$(BUILD_DIR)/%/config.status:
mkdir -p $(BUILD_DIR)
test -e $(BUILD_DIR)/CACHEDIR.TAG || echo "$(CACHEDIR_TAG_CONTENTS)" > $(BUILD_DIR)/CACHEDIR.TAG
+@echo "Configuring Postgres $* build"
@test -s $(ROOT_PROJECT_DIR)/vendor/postgres-$*/configure || { \
echo "\nPostgres submodule not found in $(ROOT_PROJECT_DIR)/vendor/postgres-$*/, execute "; \
echo "'git submodule update --init --recursive --depth 2 --progress .' in project root.\n"; \
exit 1; }
mkdir -p $(BUILD_DIR)/$*
VERSION=$*; \
EXTRA_VERSION=$$(cd $(ROOT_PROJECT_DIR)/vendor/postgres-$$VERSION && git rev-parse HEAD); \
(cd $(BUILD_DIR)/$$VERSION && \
env PATH="$(EXTRA_PATH_OVERRIDES):$$PATH" $(ROOT_PROJECT_DIR)/vendor/postgres-$$VERSION/configure \
CFLAGS='$(PG_CFLAGS)' LDFLAGS='$(PG_LDFLAGS)' \
$(PG_CONFIGURE_OPTS) --with-extra-version=" ($$EXTRA_VERSION)" \
--prefix=$(abspath $(POSTGRES_INSTALL_DIR))/$$VERSION > configure.log)
# nicer alias to run 'configure'
# Note: I've been unable to use templates for this part of our configuration.
# I'm not sure why it wouldn't work, but this is the only place (apart from
# the "build-all-versions" entry points) where direct mention of PostgreSQL
# versions is used.
.PHONY: postgres-configure-v17
postgres-configure-v17: $(BUILD_DIR)/v17/config.status
.PHONY: postgres-configure-v16
postgres-configure-v16: $(BUILD_DIR)/v16/config.status
.PHONY: postgres-configure-v15
postgres-configure-v15: $(BUILD_DIR)/v15/config.status
.PHONY: postgres-configure-v14
postgres-configure-v14: $(BUILD_DIR)/v14/config.status
# Install the PostgreSQL header files into $(POSTGRES_INSTALL_DIR)/<version>/include
.PHONY: postgres-headers-%
postgres-headers-%: postgres-configure-%
+@echo "Installing PostgreSQL $* headers"
$(MAKE) -C $(BUILD_DIR)/$*/src/include MAKELEVEL=0 install
# Compile and install PostgreSQL
.PHONY: postgres-%
postgres-%: postgres-configure-% \
postgres-headers-% # to prevent `make install` conflicts with neon's `postgres-headers`
+@echo "Compiling PostgreSQL $*"
$(MAKE) -C $(BUILD_DIR)/$* MAKELEVEL=0 install
+@echo "Compiling libpq $*"
$(MAKE) -C $(BUILD_DIR)/$*/src/interfaces/libpq install
+@echo "Compiling pg_prewarm $*"
$(MAKE) -C $(BUILD_DIR)/$*/contrib/pg_prewarm install
+@echo "Compiling pg_buffercache $*"
$(MAKE) -C $(BUILD_DIR)/$*/contrib/pg_buffercache install
+@echo "Compiling pg_visibility $*"
$(MAKE) -C $(BUILD_DIR)/$*/contrib/pg_visibility install
+@echo "Compiling pageinspect $*"
$(MAKE) -C $(BUILD_DIR)/$*/contrib/pageinspect install
+@echo "Compiling pg_trgm $*"
$(MAKE) -C $(BUILD_DIR)/$*/contrib/pg_trgm install
+@echo "Compiling amcheck $*"
$(MAKE) -C $(BUILD_DIR)/$*/contrib/amcheck install
+@echo "Compiling test_decoding $*"
$(MAKE) -C $(BUILD_DIR)/$*/contrib/test_decoding install
.PHONY: postgres-check-%
postgres-check-%: postgres-%
$(MAKE) -C $(BUILD_DIR)/$* MAKELEVEL=0 check
.PHONY: neon-pg-ext-%
neon-pg-ext-%: postgres-% cargo-target-dir
neon-pg-ext-%: postgres-install-% cargo-target-dir
+@echo "Compiling neon-specific Postgres extensions for $*"
mkdir -p $(BUILD_DIR)/pgxn-$*
$(MAKE) PG_CONFIG="$(POSTGRES_INSTALL_DIR)/$*/bin/pg_config" COPT='$(COPT)' \
@@ -231,39 +164,14 @@ ifeq ($(UNAME_S),Linux)
pg_crc32c.o
endif
# Shorthand to call neon-pg-ext-% target for all Postgres versions
.PHONY: neon-pg-ext
neon-pg-ext: \
neon-pg-ext-v14 \
neon-pg-ext-v15 \
neon-pg-ext-v16 \
neon-pg-ext-v17
# shorthand to build all Postgres versions
.PHONY: postgres
postgres: \
postgres-v14 \
postgres-v15 \
postgres-v16 \
postgres-v17
.PHONY: postgres-headers
postgres-headers: \
postgres-headers-v14 \
postgres-headers-v15 \
postgres-headers-v16 \
postgres-headers-v17
.PHONY: postgres-check
postgres-check: \
postgres-check-v14 \
postgres-check-v15 \
postgres-check-v16 \
postgres-check-v17
neon-pg-ext: $(foreach pg_version,$(POSTGRES_VERSIONS),neon-pg-ext-$(pg_version))
# This removes everything
.PHONY: distclean
distclean:
$(RM) -r $(POSTGRES_INSTALL_DIR)
$(RM) -r $(POSTGRES_INSTALL_DIR) $(BUILD_DIR)
$(CARGO_CMD_PREFIX) cargo clean
.PHONY: fmt
@@ -311,3 +219,19 @@ neon-pgindent: postgres-v17-pg-bsd-indent neon-pg-ext-v17
.PHONY: setup-pre-commit-hook
setup-pre-commit-hook:
ln -s -f $(ROOT_PROJECT_DIR)/pre-commit.py .git/hooks/pre-commit
# Targets for building PostgreSQL are defined in postgres.mk.
#
# But if the caller has indicated that PostgreSQL is already
# installed, by setting the PG_INSTALL_CACHED variable, skip it.
ifdef PG_INSTALL_CACHED
postgres-install: skip-install
$(foreach pg_version,$(POSTGRES_VERSIONS),postgres-install-$(pg_version)): skip-install
postgres-headers-install:
+@echo "Skipping installation of PostgreSQL headers because PG_INSTALL_CACHED is set"
skip-install:
+@echo "Skipping PostgreSQL installation because PG_INSTALL_CACHED is set"
else
include postgres.mk
endif

View File

@@ -165,6 +165,7 @@ RUN curl -fsSL \
&& rm sql_exporter.tar.gz
# protobuf-compiler (protoc)
# Keep the version the same as in compute/compute-node.Dockerfile
ENV PROTOC_VERSION=25.1
RUN curl -fsSL "https://github.com/protocolbuffers/protobuf/releases/download/v${PROTOC_VERSION}/protoc-${PROTOC_VERSION}-linux-$(uname -m | sed 's/aarch64/aarch_64/g').zip" -o "protoc.zip" \
&& unzip -q protoc.zip -d protoc \
@@ -179,7 +180,7 @@ RUN curl -sL "https://github.com/peak/s5cmd/releases/download/v${S5CMD_VERSION}/
&& mv s5cmd /usr/local/bin/s5cmd
# LLVM
ENV LLVM_VERSION=19
ENV LLVM_VERSION=20
RUN curl -fsSL 'https://apt.llvm.org/llvm-snapshot.gpg.key' | apt-key add - \
&& echo "deb http://apt.llvm.org/${DEBIAN_VERSION}/ llvm-toolchain-${DEBIAN_VERSION}-${LLVM_VERSION} main" > /etc/apt/sources.list.d/llvm.stable.list \
&& apt update \
@@ -292,7 +293,7 @@ WORKDIR /home/nonroot
# Rust
# Please keep the version of llvm (installed above) in sync with rust llvm (`rustc --version --verbose | grep LLVM`)
ENV RUSTC_VERSION=1.87.0
ENV RUSTC_VERSION=1.88.0
ENV RUSTUP_HOME="/home/nonroot/.rustup"
ENV PATH="/home/nonroot/.cargo/bin:${PATH}"
ARG RUSTFILT_VERSION=0.2.1

View File

@@ -22,7 +22,7 @@ sql_exporter.yml: $(jsonnet_files)
--output-file etc/$@ \
--tla-str collector_name=neon_collector \
--tla-str collector_file=neon_collector.yml \
--tla-str 'connection_string=postgresql://cloud_admin@127.0.0.1:5432/postgres?sslmode=disable&application_name=sql_exporter' \
--tla-str 'connection_string=postgresql://cloud_admin@127.0.0.1:5432/postgres?sslmode=disable&application_name=sql_exporter&pgaudit.log=none' \
etc/sql_exporter.jsonnet
sql_exporter_autoscaling.yml: $(jsonnet_files)
@@ -30,7 +30,7 @@ sql_exporter_autoscaling.yml: $(jsonnet_files)
--output-file etc/$@ \
--tla-str collector_name=neon_collector_autoscaling \
--tla-str collector_file=neon_collector_autoscaling.yml \
--tla-str 'connection_string=postgresql://cloud_admin@127.0.0.1:5432/postgres?sslmode=disable&application_name=sql_exporter_autoscaling' \
--tla-str 'connection_string=postgresql://cloud_admin@127.0.0.1:5432/postgres?sslmode=disable&application_name=sql_exporter_autoscaling&pgaudit.log=none' \
etc/sql_exporter.jsonnet
.PHONY: clean

View File

@@ -115,6 +115,9 @@ ARG EXTENSIONS=all
FROM $BASE_IMAGE_SHA AS build-deps
ARG DEBIAN_VERSION
# Keep in sync with build-tools.Dockerfile
ENV PROTOC_VERSION=25.1
# Use strict mode for bash to catch errors early
SHELL ["/bin/bash", "-euo", "pipefail", "-c"]
@@ -149,8 +152,14 @@ RUN case $DEBIAN_VERSION in \
libclang-dev \
jsonnet \
$VERSION_INSTALLS \
&& apt clean && rm -rf /var/lib/apt/lists/* && \
useradd -ms /bin/bash nonroot -b /home
&& apt clean && rm -rf /var/lib/apt/lists/* \
&& useradd -ms /bin/bash nonroot -b /home \
# Install protoc from binary release, since Debian's versions are too old.
&& curl -fsSL "https://github.com/protocolbuffers/protobuf/releases/download/v${PROTOC_VERSION}/protoc-${PROTOC_VERSION}-linux-$(uname -m | sed 's/aarch64/aarch_64/g').zip" -o "protoc.zip" \
&& unzip -q protoc.zip -d protoc \
&& mv protoc/bin/protoc /usr/local/bin/protoc \
&& mv protoc/include/google /usr/local/include/google \
&& rm -rf protoc.zip protoc
#########################################################################################
#
@@ -171,9 +180,6 @@ RUN cd postgres && \
eval $CONFIGURE_CMD && \
make MAKELEVEL=0 -j $(getconf _NPROCESSORS_ONLN) -s install && \
make MAKELEVEL=0 -j $(getconf _NPROCESSORS_ONLN) -s -C contrib/ install && \
# Install headers
make MAKELEVEL=0 -j $(getconf _NPROCESSORS_ONLN) -s -C src/include install && \
make MAKELEVEL=0 -j $(getconf _NPROCESSORS_ONLN) -s -C src/interfaces/libpq install && \
# Enable some of contrib extensions
echo 'trusted = true' >> /usr/local/pgsql/share/extension/autoinc.control && \
echo 'trusted = true' >> /usr/local/pgsql/share/extension/dblink.control && \
@@ -1173,7 +1179,7 @@ COPY --from=pgrag-src /ext-src/ /ext-src/
# Install it using virtual environment, because Python 3.11 (the default version on Debian 12 (Bookworm)) complains otherwise
WORKDIR /ext-src/onnxruntime-src
RUN apt update && apt install --no-install-recommends --no-install-suggests -y \
python3 python3-pip python3-venv protobuf-compiler && \
python3 python3-pip python3-venv && \
apt clean && rm -rf /var/lib/apt/lists/* && \
python3 -m venv venv && \
. venv/bin/activate && \
@@ -1566,29 +1572,31 @@ RUN make -j $(getconf _NPROCESSORS_ONLN) && \
FROM build-deps AS pgaudit-src
ARG PG_VERSION
WORKDIR /ext-src
COPY "compute/patches/pgaudit-parallel_workers-${PG_VERSION}.patch" .
RUN case "${PG_VERSION}" in \
"v14") \
export PGAUDIT_VERSION=1.6.2 \
export PGAUDIT_CHECKSUM=1f350d70a0cbf488c0f2b485e3a5c9b11f78ad9e3cbb95ef6904afa1eb3187eb \
export PGAUDIT_VERSION=1.6.3 \
export PGAUDIT_CHECKSUM=37a8f5a7cc8d9188e536d15cf0fdc457fcdab2547caedb54442c37f124110919 \
;; \
"v15") \
export PGAUDIT_VERSION=1.7.0 \
export PGAUDIT_CHECKSUM=8f4a73e451c88c567e516e6cba7dc1e23bc91686bb6f1f77f8f3126d428a8bd8 \
export PGAUDIT_VERSION=1.7.1 \
export PGAUDIT_CHECKSUM=e9c8e6e092d82b2f901d72555ce0fe7780552f35f8985573796cd7e64b09d4ec \
;; \
"v16") \
export PGAUDIT_VERSION=16.0 \
export PGAUDIT_CHECKSUM=d53ef985f2d0b15ba25c512c4ce967dce07b94fd4422c95bd04c4c1a055fe738 \
export PGAUDIT_VERSION=16.1 \
export PGAUDIT_CHECKSUM=3bae908ab70ba0c6f51224009dbcfff1a97bd6104c6273297a64292e1b921fee \
;; \
"v17") \
export PGAUDIT_VERSION=17.0 \
export PGAUDIT_CHECKSUM=7d0d08d030275d525f36cd48b38c6455f1023da863385badff0cec44965bfd8c \
export PGAUDIT_VERSION=17.1 \
export PGAUDIT_CHECKSUM=9c5f37504d393486cc75d2ced83f75f5899be64fa85f689d6babb833b4361e6c \
;; \
*) \
echo "pgaudit is not supported on this PostgreSQL version" && exit 1;; \
esac && \
wget https://github.com/pgaudit/pgaudit/archive/refs/tags/${PGAUDIT_VERSION}.tar.gz -O pgaudit.tar.gz && \
echo "${PGAUDIT_CHECKSUM} pgaudit.tar.gz" | sha256sum --check && \
mkdir pgaudit-src && cd pgaudit-src && tar xzf ../pgaudit.tar.gz --strip-components=1 -C .
mkdir pgaudit-src && cd pgaudit-src && tar xzf ../pgaudit.tar.gz --strip-components=1 -C . && \
patch -p1 < "/ext-src/pgaudit-parallel_workers-${PG_VERSION}.patch"
FROM pg-build AS pgaudit-build
COPY --from=pgaudit-src /ext-src/ /ext-src/
@@ -1628,11 +1636,14 @@ RUN make install USE_PGXS=1 -j $(getconf _NPROCESSORS_ONLN)
# compile neon extensions
#
#########################################################################################
FROM pg-build AS neon-ext-build
FROM pg-build-with-cargo AS neon-ext-build
ARG PG_VERSION
COPY pgxn/ pgxn/
RUN make -j $(getconf _NPROCESSORS_ONLN) -C pgxn -s install-compute
USER root
COPY . .
RUN make -j $(getconf _NPROCESSORS_ONLN) -C pgxn -s install-compute \
BUILD_TYPE=release CARGO_BUILD_FLAGS="--locked --release" NEON_CARGO_ARTIFACT_TARGET_DIR="$(pwd)/target/release"
#########################################################################################
#
@@ -1977,7 +1988,7 @@ RUN apt update && \
locales \
lsof \
procps \
rsyslog \
rsyslog-gnutls \
screen \
tcpdump \
$VERSION_INSTALLS && \

View File

@@ -8,6 +8,8 @@
import 'sql_exporter/compute_logical_snapshot_files.libsonnet',
import 'sql_exporter/compute_logical_snapshots_bytes.libsonnet',
import 'sql_exporter/compute_max_connections.libsonnet',
import 'sql_exporter/compute_pg_oldest_frozen_xid_age.libsonnet',
import 'sql_exporter/compute_pg_oldest_mxid_age.libsonnet',
import 'sql_exporter/compute_receive_lsn.libsonnet',
import 'sql_exporter/compute_subscriptions_count.libsonnet',
import 'sql_exporter/connection_counts.libsonnet',

View File

@@ -0,0 +1,13 @@
{
metric_name: 'compute_pg_oldest_frozen_xid_age',
type: 'gauge',
help: 'Age of oldest XIDs that have not been frozen by VACUUM. An indicator of how long it has been since VACUUM last ran.',
key_labels: [
'database_name',
],
value_label: 'metric',
values: [
'frozen_xid_age',
],
query: importstr 'sql_exporter/compute_pg_oldest_frozen_xid_age.sql',
}

View File

@@ -0,0 +1,4 @@
SELECT datname database_name,
age(datfrozenxid) frozen_xid_age
FROM pg_database
ORDER BY frozen_xid_age DESC LIMIT 10;

View File

@@ -0,0 +1,13 @@
{
metric_name: 'compute_pg_oldest_mxid_age',
type: 'gauge',
help: 'Age of oldest MXIDs that have not been replaced by VACUUM. An indicator of how long it has been since VACUUM last ran.',
key_labels: [
'database_name',
],
value_label: 'metric',
values: [
'min_mxid_age',
],
query: importstr 'sql_exporter/compute_pg_oldest_mxid_age.sql',
}

View File

@@ -0,0 +1,4 @@
SELECT datname database_name,
mxid_age(datminmxid) min_mxid_age
FROM pg_database
ORDER BY min_mxid_age DESC LIMIT 10;

View File

@@ -1,8 +1,8 @@
diff --git a/sql/anon.sql b/sql/anon.sql
index 0cdc769..f6cc950 100644
index 0cdc769..b450327 100644
--- a/sql/anon.sql
+++ b/sql/anon.sql
@@ -1141,3 +1141,8 @@ $$
@@ -1141,3 +1141,15 @@ $$
-- TODO : https://en.wikipedia.org/wiki/L-diversity
-- TODO : https://en.wikipedia.org/wiki/T-closeness
@@ -11,6 +11,13 @@ index 0cdc769..f6cc950 100644
+
+GRANT ALL ON SCHEMA anon to neon_superuser;
+GRANT ALL ON ALL TABLES IN SCHEMA anon TO neon_superuser;
+
+DO $$
+BEGIN
+ IF current_setting('server_version_num')::int >= 150000 THEN
+ GRANT SET ON PARAMETER anon.transparent_dynamic_masking TO neon_superuser;
+ END IF;
+END $$;
diff --git a/sql/init.sql b/sql/init.sql
index 7da6553..9b6164b 100644
--- a/sql/init.sql

View File

@@ -0,0 +1,143 @@
commit 7220bb3a3f23fa27207d77562dcc286f9a123313
Author: Tristan Partin <tristan.partin@databricks.com>
Date: 2025-06-23 02:09:31 +0000
Disable logging in parallel workers
When a query uses parallel workers, pgaudit will log the same query for
every parallel worker. This is undesireable since it can result in log
amplification for queries that use parallel workers.
Signed-off-by: Tristan Partin <tristan.partin@databricks.com>
diff --git a/expected/pgaudit.out b/expected/pgaudit.out
index baa8011..a601375 100644
--- a/expected/pgaudit.out
+++ b/expected/pgaudit.out
@@ -2563,6 +2563,37 @@ COMMIT;
NOTICE: AUDIT: SESSION,12,4,MISC,COMMIT,,,COMMIT;,<not logged>
DROP TABLE part_test;
NOTICE: AUDIT: SESSION,13,1,DDL,DROP TABLE,,,DROP TABLE part_test;,<not logged>
+--
+-- Test logging in parallel workers
+SET pgaudit.log = 'read';
+SET pgaudit.log_client = on;
+SET pgaudit.log_level = 'notice';
+-- Force parallel execution for testing
+SET max_parallel_workers_per_gather = 2;
+SET parallel_tuple_cost = 0;
+SET parallel_setup_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET min_parallel_index_scan_size = 0;
+-- Create table with enough data to trigger parallel execution
+CREATE TABLE parallel_test (id int, data text);
+INSERT INTO parallel_test SELECT generate_series(1, 1000), 'test data';
+SELECT count(*) FROM parallel_test;
+NOTICE: AUDIT: SESSION,14,1,READ,SELECT,,,SELECT count(*) FROM parallel_test;,<not logged>
+ count
+-------
+ 1000
+(1 row)
+
+-- Cleanup parallel test
+DROP TABLE parallel_test;
+RESET max_parallel_workers_per_gather;
+RESET parallel_tuple_cost;
+RESET parallel_setup_cost;
+RESET min_parallel_table_scan_size;
+RESET min_parallel_index_scan_size;
+RESET pgaudit.log;
+RESET pgaudit.log_client;
+RESET pgaudit.log_level;
-- Cleanup
-- Set client_min_messages up to warning to avoid noise
SET client_min_messages = 'warning';
diff --git a/pgaudit.c b/pgaudit.c
index 5e6fd38..ac9ded2 100644
--- a/pgaudit.c
+++ b/pgaudit.c
@@ -11,6 +11,7 @@
#include "postgres.h"
#include "access/htup_details.h"
+#include "access/parallel.h"
#include "access/sysattr.h"
#include "access/xact.h"
#include "access/relation.h"
@@ -1303,7 +1304,7 @@ pgaudit_ExecutorStart_hook(QueryDesc *queryDesc, int eflags)
{
AuditEventStackItem *stackItem = NULL;
- if (!internalStatement)
+ if (!internalStatement && !IsParallelWorker())
{
/* Push the audit even onto the stack */
stackItem = stack_push();
@@ -1384,7 +1385,7 @@ pgaudit_ExecutorCheckPerms_hook(List *rangeTabls, bool abort)
/* Log DML if the audit role is valid or session logging is enabled */
if ((auditOid != InvalidOid || auditLogBitmap != 0) &&
- !IsAbortedTransactionBlockState())
+ !IsAbortedTransactionBlockState() && !IsParallelWorker())
{
/* If auditLogRows is on, wait for rows processed to be set */
if (auditLogRows && auditEventStack != NULL)
@@ -1438,7 +1439,7 @@ pgaudit_ExecutorRun_hook(QueryDesc *queryDesc, ScanDirection direction, uint64 c
else
standard_ExecutorRun(queryDesc, direction, count, execute_once);
- if (auditLogRows && !internalStatement)
+ if (auditLogRows && !internalStatement && !IsParallelWorker())
{
/* Find an item from the stack by the query memory context */
stackItem = stack_find_context(queryDesc->estate->es_query_cxt);
@@ -1458,7 +1459,7 @@ pgaudit_ExecutorEnd_hook(QueryDesc *queryDesc)
AuditEventStackItem *stackItem = NULL;
AuditEventStackItem *auditEventStackFull = NULL;
- if (auditLogRows && !internalStatement)
+ if (auditLogRows && !internalStatement && !IsParallelWorker())
{
/* Find an item from the stack by the query memory context */
stackItem = stack_find_context(queryDesc->estate->es_query_cxt);
diff --git a/sql/pgaudit.sql b/sql/pgaudit.sql
index cc1374a..1870a60 100644
--- a/sql/pgaudit.sql
+++ b/sql/pgaudit.sql
@@ -1612,6 +1612,36 @@ COMMIT;
DROP TABLE part_test;
+--
+-- Test logging in parallel workers
+SET pgaudit.log = 'read';
+SET pgaudit.log_client = on;
+SET pgaudit.log_level = 'notice';
+
+-- Force parallel execution for testing
+SET max_parallel_workers_per_gather = 2;
+SET parallel_tuple_cost = 0;
+SET parallel_setup_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET min_parallel_index_scan_size = 0;
+
+-- Create table with enough data to trigger parallel execution
+CREATE TABLE parallel_test (id int, data text);
+INSERT INTO parallel_test SELECT generate_series(1, 1000), 'test data';
+
+SELECT count(*) FROM parallel_test;
+
+-- Cleanup parallel test
+DROP TABLE parallel_test;
+RESET max_parallel_workers_per_gather;
+RESET parallel_tuple_cost;
+RESET parallel_setup_cost;
+RESET min_parallel_table_scan_size;
+RESET min_parallel_index_scan_size;
+RESET pgaudit.log;
+RESET pgaudit.log_client;
+RESET pgaudit.log_level;
+
-- Cleanup
-- Set client_min_messages up to warning to avoid noise
SET client_min_messages = 'warning';

View File

@@ -0,0 +1,143 @@
commit 29dc2847f6255541992f18faf8a815dfab79631a
Author: Tristan Partin <tristan.partin@databricks.com>
Date: 2025-06-23 02:09:31 +0000
Disable logging in parallel workers
When a query uses parallel workers, pgaudit will log the same query for
every parallel worker. This is undesireable since it can result in log
amplification for queries that use parallel workers.
Signed-off-by: Tristan Partin <tristan.partin@databricks.com>
diff --git a/expected/pgaudit.out b/expected/pgaudit.out
index b22560b..73f0327 100644
--- a/expected/pgaudit.out
+++ b/expected/pgaudit.out
@@ -2563,6 +2563,37 @@ COMMIT;
NOTICE: AUDIT: SESSION,12,4,MISC,COMMIT,,,COMMIT;,<not logged>
DROP TABLE part_test;
NOTICE: AUDIT: SESSION,13,1,DDL,DROP TABLE,,,DROP TABLE part_test;,<not logged>
+--
+-- Test logging in parallel workers
+SET pgaudit.log = 'read';
+SET pgaudit.log_client = on;
+SET pgaudit.log_level = 'notice';
+-- Force parallel execution for testing
+SET max_parallel_workers_per_gather = 2;
+SET parallel_tuple_cost = 0;
+SET parallel_setup_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET min_parallel_index_scan_size = 0;
+-- Create table with enough data to trigger parallel execution
+CREATE TABLE parallel_test (id int, data text);
+INSERT INTO parallel_test SELECT generate_series(1, 1000), 'test data';
+SELECT count(*) FROM parallel_test;
+NOTICE: AUDIT: SESSION,14,1,READ,SELECT,,,SELECT count(*) FROM parallel_test;,<not logged>
+ count
+-------
+ 1000
+(1 row)
+
+-- Cleanup parallel test
+DROP TABLE parallel_test;
+RESET max_parallel_workers_per_gather;
+RESET parallel_tuple_cost;
+RESET parallel_setup_cost;
+RESET min_parallel_table_scan_size;
+RESET min_parallel_index_scan_size;
+RESET pgaudit.log;
+RESET pgaudit.log_client;
+RESET pgaudit.log_level;
-- Cleanup
-- Set client_min_messages up to warning to avoid noise
SET client_min_messages = 'warning';
diff --git a/pgaudit.c b/pgaudit.c
index 5e6fd38..ac9ded2 100644
--- a/pgaudit.c
+++ b/pgaudit.c
@@ -11,6 +11,7 @@
#include "postgres.h"
#include "access/htup_details.h"
+#include "access/parallel.h"
#include "access/sysattr.h"
#include "access/xact.h"
#include "access/relation.h"
@@ -1303,7 +1304,7 @@ pgaudit_ExecutorStart_hook(QueryDesc *queryDesc, int eflags)
{
AuditEventStackItem *stackItem = NULL;
- if (!internalStatement)
+ if (!internalStatement && !IsParallelWorker())
{
/* Push the audit even onto the stack */
stackItem = stack_push();
@@ -1384,7 +1385,7 @@ pgaudit_ExecutorCheckPerms_hook(List *rangeTabls, bool abort)
/* Log DML if the audit role is valid or session logging is enabled */
if ((auditOid != InvalidOid || auditLogBitmap != 0) &&
- !IsAbortedTransactionBlockState())
+ !IsAbortedTransactionBlockState() && !IsParallelWorker())
{
/* If auditLogRows is on, wait for rows processed to be set */
if (auditLogRows && auditEventStack != NULL)
@@ -1438,7 +1439,7 @@ pgaudit_ExecutorRun_hook(QueryDesc *queryDesc, ScanDirection direction, uint64 c
else
standard_ExecutorRun(queryDesc, direction, count, execute_once);
- if (auditLogRows && !internalStatement)
+ if (auditLogRows && !internalStatement && !IsParallelWorker())
{
/* Find an item from the stack by the query memory context */
stackItem = stack_find_context(queryDesc->estate->es_query_cxt);
@@ -1458,7 +1459,7 @@ pgaudit_ExecutorEnd_hook(QueryDesc *queryDesc)
AuditEventStackItem *stackItem = NULL;
AuditEventStackItem *auditEventStackFull = NULL;
- if (auditLogRows && !internalStatement)
+ if (auditLogRows && !internalStatement && !IsParallelWorker())
{
/* Find an item from the stack by the query memory context */
stackItem = stack_find_context(queryDesc->estate->es_query_cxt);
diff --git a/sql/pgaudit.sql b/sql/pgaudit.sql
index 8052426..7f0667b 100644
--- a/sql/pgaudit.sql
+++ b/sql/pgaudit.sql
@@ -1612,6 +1612,36 @@ COMMIT;
DROP TABLE part_test;
+--
+-- Test logging in parallel workers
+SET pgaudit.log = 'read';
+SET pgaudit.log_client = on;
+SET pgaudit.log_level = 'notice';
+
+-- Force parallel execution for testing
+SET max_parallel_workers_per_gather = 2;
+SET parallel_tuple_cost = 0;
+SET parallel_setup_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET min_parallel_index_scan_size = 0;
+
+-- Create table with enough data to trigger parallel execution
+CREATE TABLE parallel_test (id int, data text);
+INSERT INTO parallel_test SELECT generate_series(1, 1000), 'test data';
+
+SELECT count(*) FROM parallel_test;
+
+-- Cleanup parallel test
+DROP TABLE parallel_test;
+RESET max_parallel_workers_per_gather;
+RESET parallel_tuple_cost;
+RESET parallel_setup_cost;
+RESET min_parallel_table_scan_size;
+RESET min_parallel_index_scan_size;
+RESET pgaudit.log;
+RESET pgaudit.log_client;
+RESET pgaudit.log_level;
+
-- Cleanup
-- Set client_min_messages up to warning to avoid noise
SET client_min_messages = 'warning';

View File

@@ -0,0 +1,143 @@
commit cc708dde7ef2af2a8120d757102d2e34c0463a0f
Author: Tristan Partin <tristan.partin@databricks.com>
Date: 2025-06-23 02:09:31 +0000
Disable logging in parallel workers
When a query uses parallel workers, pgaudit will log the same query for
every parallel worker. This is undesireable since it can result in log
amplification for queries that use parallel workers.
Signed-off-by: Tristan Partin <tristan.partin@databricks.com>
diff --git a/expected/pgaudit.out b/expected/pgaudit.out
index 8772054..9b66ac6 100644
--- a/expected/pgaudit.out
+++ b/expected/pgaudit.out
@@ -2556,6 +2556,37 @@ DROP SERVER fdw_server;
NOTICE: AUDIT: SESSION,11,1,DDL,DROP SERVER,,,DROP SERVER fdw_server;,<not logged>
DROP EXTENSION postgres_fdw;
NOTICE: AUDIT: SESSION,12,1,DDL,DROP EXTENSION,,,DROP EXTENSION postgres_fdw;,<not logged>
+--
+-- Test logging in parallel workers
+SET pgaudit.log = 'read';
+SET pgaudit.log_client = on;
+SET pgaudit.log_level = 'notice';
+-- Force parallel execution for testing
+SET max_parallel_workers_per_gather = 2;
+SET parallel_tuple_cost = 0;
+SET parallel_setup_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET min_parallel_index_scan_size = 0;
+-- Create table with enough data to trigger parallel execution
+CREATE TABLE parallel_test (id int, data text);
+INSERT INTO parallel_test SELECT generate_series(1, 1000), 'test data';
+SELECT count(*) FROM parallel_test;
+NOTICE: AUDIT: SESSION,13,1,READ,SELECT,,,SELECT count(*) FROM parallel_test;,<not logged>
+ count
+-------
+ 1000
+(1 row)
+
+-- Cleanup parallel test
+DROP TABLE parallel_test;
+RESET max_parallel_workers_per_gather;
+RESET parallel_tuple_cost;
+RESET parallel_setup_cost;
+RESET min_parallel_table_scan_size;
+RESET min_parallel_index_scan_size;
+RESET pgaudit.log;
+RESET pgaudit.log_client;
+RESET pgaudit.log_level;
-- Cleanup
-- Set client_min_messages up to warning to avoid noise
SET client_min_messages = 'warning';
diff --git a/pgaudit.c b/pgaudit.c
index 004d1f9..f061164 100644
--- a/pgaudit.c
+++ b/pgaudit.c
@@ -11,6 +11,7 @@
#include "postgres.h"
#include "access/htup_details.h"
+#include "access/parallel.h"
#include "access/sysattr.h"
#include "access/xact.h"
#include "access/relation.h"
@@ -1339,7 +1340,7 @@ pgaudit_ExecutorStart_hook(QueryDesc *queryDesc, int eflags)
{
AuditEventStackItem *stackItem = NULL;
- if (!internalStatement)
+ if (!internalStatement && !IsParallelWorker())
{
/* Push the audit even onto the stack */
stackItem = stack_push();
@@ -1420,7 +1421,7 @@ pgaudit_ExecutorCheckPerms_hook(List *rangeTabls, List *permInfos, bool abort)
/* Log DML if the audit role is valid or session logging is enabled */
if ((auditOid != InvalidOid || auditLogBitmap != 0) &&
- !IsAbortedTransactionBlockState())
+ !IsAbortedTransactionBlockState() && !IsParallelWorker())
{
/* If auditLogRows is on, wait for rows processed to be set */
if (auditLogRows && auditEventStack != NULL)
@@ -1475,7 +1476,7 @@ pgaudit_ExecutorRun_hook(QueryDesc *queryDesc, ScanDirection direction, uint64 c
else
standard_ExecutorRun(queryDesc, direction, count, execute_once);
- if (auditLogRows && !internalStatement)
+ if (auditLogRows && !internalStatement && !IsParallelWorker())
{
/* Find an item from the stack by the query memory context */
stackItem = stack_find_context(queryDesc->estate->es_query_cxt);
@@ -1495,7 +1496,7 @@ pgaudit_ExecutorEnd_hook(QueryDesc *queryDesc)
AuditEventStackItem *stackItem = NULL;
AuditEventStackItem *auditEventStackFull = NULL;
- if (auditLogRows && !internalStatement)
+ if (auditLogRows && !internalStatement && !IsParallelWorker())
{
/* Find an item from the stack by the query memory context */
stackItem = stack_find_context(queryDesc->estate->es_query_cxt);
diff --git a/sql/pgaudit.sql b/sql/pgaudit.sql
index 6aae88b..de6d7fd 100644
--- a/sql/pgaudit.sql
+++ b/sql/pgaudit.sql
@@ -1631,6 +1631,36 @@ DROP USER MAPPING FOR regress_user1 SERVER fdw_server;
DROP SERVER fdw_server;
DROP EXTENSION postgres_fdw;
+--
+-- Test logging in parallel workers
+SET pgaudit.log = 'read';
+SET pgaudit.log_client = on;
+SET pgaudit.log_level = 'notice';
+
+-- Force parallel execution for testing
+SET max_parallel_workers_per_gather = 2;
+SET parallel_tuple_cost = 0;
+SET parallel_setup_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET min_parallel_index_scan_size = 0;
+
+-- Create table with enough data to trigger parallel execution
+CREATE TABLE parallel_test (id int, data text);
+INSERT INTO parallel_test SELECT generate_series(1, 1000), 'test data';
+
+SELECT count(*) FROM parallel_test;
+
+-- Cleanup parallel test
+DROP TABLE parallel_test;
+RESET max_parallel_workers_per_gather;
+RESET parallel_tuple_cost;
+RESET parallel_setup_cost;
+RESET min_parallel_table_scan_size;
+RESET min_parallel_index_scan_size;
+RESET pgaudit.log;
+RESET pgaudit.log_client;
+RESET pgaudit.log_level;
+
-- Cleanup
-- Set client_min_messages up to warning to avoid noise
SET client_min_messages = 'warning';

View File

@@ -0,0 +1,143 @@
commit 8d02e4c6c5e1e8676251b0717a46054267091cb4
Author: Tristan Partin <tristan.partin@databricks.com>
Date: 2025-06-23 02:09:31 +0000
Disable logging in parallel workers
When a query uses parallel workers, pgaudit will log the same query for
every parallel worker. This is undesireable since it can result in log
amplification for queries that use parallel workers.
Signed-off-by: Tristan Partin <tristan.partin@databricks.com>
diff --git a/expected/pgaudit.out b/expected/pgaudit.out
index d696287..4b1059a 100644
--- a/expected/pgaudit.out
+++ b/expected/pgaudit.out
@@ -2568,6 +2568,37 @@ DROP SERVER fdw_server;
NOTICE: AUDIT: SESSION,11,1,DDL,DROP SERVER,,,DROP SERVER fdw_server,<not logged>
DROP EXTENSION postgres_fdw;
NOTICE: AUDIT: SESSION,12,1,DDL,DROP EXTENSION,,,DROP EXTENSION postgres_fdw,<not logged>
+--
+-- Test logging in parallel workers
+SET pgaudit.log = 'read';
+SET pgaudit.log_client = on;
+SET pgaudit.log_level = 'notice';
+-- Force parallel execution for testing
+SET max_parallel_workers_per_gather = 2;
+SET parallel_tuple_cost = 0;
+SET parallel_setup_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET min_parallel_index_scan_size = 0;
+-- Create table with enough data to trigger parallel execution
+CREATE TABLE parallel_test (id int, data text);
+INSERT INTO parallel_test SELECT generate_series(1, 1000), 'test data';
+SELECT count(*) FROM parallel_test;
+NOTICE: AUDIT: SESSION,13,1,READ,SELECT,,,SELECT count(*) FROM parallel_test,<not logged>
+ count
+-------
+ 1000
+(1 row)
+
+-- Cleanup parallel test
+DROP TABLE parallel_test;
+RESET max_parallel_workers_per_gather;
+RESET parallel_tuple_cost;
+RESET parallel_setup_cost;
+RESET min_parallel_table_scan_size;
+RESET min_parallel_index_scan_size;
+RESET pgaudit.log;
+RESET pgaudit.log_client;
+RESET pgaudit.log_level;
-- Cleanup
-- Set client_min_messages up to warning to avoid noise
SET client_min_messages = 'warning';
diff --git a/pgaudit.c b/pgaudit.c
index 1764af1..0e48875 100644
--- a/pgaudit.c
+++ b/pgaudit.c
@@ -11,6 +11,7 @@
#include "postgres.h"
#include "access/htup_details.h"
+#include "access/parallel.h"
#include "access/sysattr.h"
#include "access/xact.h"
#include "access/relation.h"
@@ -1406,7 +1407,7 @@ pgaudit_ExecutorStart_hook(QueryDesc *queryDesc, int eflags)
{
AuditEventStackItem *stackItem = NULL;
- if (!internalStatement)
+ if (!internalStatement && !IsParallelWorker())
{
/* Push the audit event onto the stack */
stackItem = stack_push();
@@ -1489,7 +1490,7 @@ pgaudit_ExecutorCheckPerms_hook(List *rangeTabls, List *permInfos, bool abort)
/* Log DML if the audit role is valid or session logging is enabled */
if ((auditOid != InvalidOid || auditLogBitmap != 0) &&
- !IsAbortedTransactionBlockState())
+ !IsAbortedTransactionBlockState() && !IsParallelWorker())
{
/* If auditLogRows is on, wait for rows processed to be set */
if (auditLogRows && auditEventStack != NULL)
@@ -1544,7 +1545,7 @@ pgaudit_ExecutorRun_hook(QueryDesc *queryDesc, ScanDirection direction, uint64 c
else
standard_ExecutorRun(queryDesc, direction, count, execute_once);
- if (auditLogRows && !internalStatement)
+ if (auditLogRows && !internalStatement && !IsParallelWorker())
{
/* Find an item from the stack by the query memory context */
stackItem = stack_find_context(queryDesc->estate->es_query_cxt);
@@ -1564,7 +1565,7 @@ pgaudit_ExecutorEnd_hook(QueryDesc *queryDesc)
AuditEventStackItem *stackItem = NULL;
AuditEventStackItem *auditEventStackFull = NULL;
- if (auditLogRows && !internalStatement)
+ if (auditLogRows && !internalStatement && !IsParallelWorker())
{
/* Find an item from the stack by the query memory context */
stackItem = stack_find_context(queryDesc->estate->es_query_cxt);
diff --git a/sql/pgaudit.sql b/sql/pgaudit.sql
index e161f01..c873098 100644
--- a/sql/pgaudit.sql
+++ b/sql/pgaudit.sql
@@ -1637,6 +1637,36 @@ DROP USER MAPPING FOR regress_user1 SERVER fdw_server;
DROP SERVER fdw_server;
DROP EXTENSION postgres_fdw;
+--
+-- Test logging in parallel workers
+SET pgaudit.log = 'read';
+SET pgaudit.log_client = on;
+SET pgaudit.log_level = 'notice';
+
+-- Force parallel execution for testing
+SET max_parallel_workers_per_gather = 2;
+SET parallel_tuple_cost = 0;
+SET parallel_setup_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET min_parallel_index_scan_size = 0;
+
+-- Create table with enough data to trigger parallel execution
+CREATE TABLE parallel_test (id int, data text);
+INSERT INTO parallel_test SELECT generate_series(1, 1000), 'test data';
+
+SELECT count(*) FROM parallel_test;
+
+-- Cleanup parallel test
+DROP TABLE parallel_test;
+RESET max_parallel_workers_per_gather;
+RESET parallel_tuple_cost;
+RESET parallel_setup_cost;
+RESET min_parallel_table_scan_size;
+RESET min_parallel_index_scan_size;
+RESET pgaudit.log;
+RESET pgaudit.log_client;
+RESET pgaudit.log_level;
+
-- Cleanup
-- Set client_min_messages up to warning to avoid noise
SET client_min_messages = 'warning';

View File

@@ -26,7 +26,7 @@ commands:
- name: postgres-exporter
user: nobody
sysvInitAction: respawn
shell: 'DATA_SOURCE_NAME="user=cloud_admin sslmode=disable dbname=postgres application_name=postgres-exporter" /bin/postgres_exporter --config.file=/etc/postgres_exporter.yml'
shell: 'DATA_SOURCE_NAME="user=cloud_admin sslmode=disable dbname=postgres application_name=postgres-exporter pgaudit.log=none" /bin/postgres_exporter --config.file=/etc/postgres_exporter.yml'
- name: pgbouncer-exporter
user: postgres
sysvInitAction: respawn
@@ -59,7 +59,7 @@ files:
# the rules use ALL as the hostname. Avoid the pointless lookups and the "unable to
# resolve host" log messages that they generate.
Defaults !fqdn
# Allow postgres user (which is what compute_ctl runs as) to run /neonvm/bin/resize-swap
# and /neonvm/bin/set-disk-quota as root without requiring entering a password (NOPASSWD),
# regardless of hostname (ALL)

View File

@@ -26,7 +26,7 @@ commands:
- name: postgres-exporter
user: nobody
sysvInitAction: respawn
shell: 'DATA_SOURCE_NAME="user=cloud_admin sslmode=disable dbname=postgres application_name=postgres-exporter" /bin/postgres_exporter --config.file=/etc/postgres_exporter.yml'
shell: 'DATA_SOURCE_NAME="user=cloud_admin sslmode=disable dbname=postgres application_name=postgres-exporter pgaudit.log=none" /bin/postgres_exporter --config.file=/etc/postgres_exporter.yml'
- name: pgbouncer-exporter
user: postgres
sysvInitAction: respawn
@@ -59,7 +59,7 @@ files:
# the rules use ALL as the hostname. Avoid the pointless lookups and the "unable to
# resolve host" log messages that they generate.
Defaults !fqdn
# Allow postgres user (which is what compute_ctl runs as) to run /neonvm/bin/resize-swap
# and /neonvm/bin/set-disk-quota as root without requiring entering a password (NOPASSWD),
# regardless of hostname (ALL)

View File

@@ -27,6 +27,7 @@ fail.workspace = true
flate2.workspace = true
futures.workspace = true
http.workspace = true
hostname-validator = "1.1"
indexmap.workspace = true
itertools.workspace = true
jsonwebtoken.workspace = true
@@ -66,6 +67,7 @@ uuid.workspace = true
walkdir.workspace = true
x509-cert.workspace = true
postgres_versioninfo.workspace = true
postgres_initdb.workspace = true
compute_api.workspace = true
utils.workspace = true

View File

@@ -36,6 +36,8 @@
use std::ffi::OsString;
use std::fs::File;
use std::process::exit;
use std::sync::Arc;
use std::sync::atomic::AtomicU64;
use std::sync::mpsc;
use std::thread;
use std::time::Duration;
@@ -190,7 +192,9 @@ fn main() -> Result<()> {
cgroup: cli.cgroup,
#[cfg(target_os = "linux")]
vm_monitor_addr: cli.vm_monitor_addr,
installed_extensions_collection_interval: cli.installed_extensions_collection_interval,
installed_extensions_collection_interval: Arc::new(AtomicU64::new(
cli.installed_extensions_collection_interval,
)),
},
config,
)?;

View File

@@ -29,7 +29,7 @@ use anyhow::{Context, bail};
use aws_config::BehaviorVersion;
use camino::{Utf8Path, Utf8PathBuf};
use clap::{Parser, Subcommand};
use compute_tools::extension_server::{PostgresMajorVersion, get_pg_version};
use compute_tools::extension_server::get_pg_version;
use nix::unistd::Pid;
use std::ops::Not;
use tracing::{Instrument, error, info, info_span, warn};
@@ -179,12 +179,8 @@ impl PostgresProcess {
.await
.context("create pgdata directory")?;
let pg_version = match get_pg_version(self.pgbin.as_ref()) {
PostgresMajorVersion::V14 => 14,
PostgresMajorVersion::V15 => 15,
PostgresMajorVersion::V16 => 16,
PostgresMajorVersion::V17 => 17,
};
let pg_version = get_pg_version(self.pgbin.as_ref());
postgres_initdb::do_run_initdb(postgres_initdb::RunInitdbArgs {
superuser: initdb_user,
locale: DEFAULT_LOCALE, // XXX: this shouldn't be hard-coded,
@@ -486,10 +482,8 @@ async fn cmd_pgdata(
};
let superuser = "cloud_admin";
let destination_connstring = format!(
"host=localhost port={} user={} dbname=neondb",
pg_port, superuser
);
let destination_connstring =
format!("host=localhost port={pg_port} user={superuser} dbname=neondb");
let pgdata_dir = workdir.join("pgdata");
let mut proc = PostgresProcess::new(pgdata_dir.clone(), pg_bin_dir.clone(), pg_lib_dir.clone());

View File

@@ -69,7 +69,7 @@ impl clap::builder::TypedValueParser for S3Uri {
S3Uri::from_str(value_str).map_err(|e| {
clap::Error::raw(
clap::error::ErrorKind::InvalidValue,
format!("Failed to parse S3 URI: {}", e),
format!("Failed to parse S3 URI: {e}"),
)
})
}

View File

@@ -22,7 +22,7 @@ pub async fn get_dbs_and_roles(compute: &Arc<ComputeNode>) -> anyhow::Result<Cat
spawn(async move {
if let Err(e) = connection.await {
eprintln!("connection error: {}", e);
eprintln!("connection error: {e}");
}
});
@@ -119,7 +119,7 @@ pub async fn get_database_schema(
_ => {
let mut lines = stderr_reader.lines();
if let Some(line) = lines.next_line().await? {
if line.contains(&format!("FATAL: database \"{}\" does not exist", dbname)) {
if line.contains(&format!("FATAL: database \"{dbname}\" does not exist")) {
return Err(SchemaDumpError::DatabaseDoesNotExist);
}
warn!("pg_dump stderr: {}", line)

View File

@@ -6,7 +6,8 @@ use compute_api::responses::{
LfcPrewarmState, TlsConfig,
};
use compute_api::spec::{
ComputeAudit, ComputeFeature, ComputeMode, ComputeSpec, ExtVersion, PgIdent,
ComputeAudit, ComputeFeature, ComputeMode, ComputeSpec, ExtVersion, PageserverConnectionInfo,
PageserverShardConnectionInfo, PgIdent,
};
use futures::StreamExt;
use futures::future::join_all;
@@ -15,29 +16,29 @@ use itertools::Itertools;
use nix::sys::signal::{Signal, kill};
use nix::unistd::Pid;
use once_cell::sync::Lazy;
use pageserver_page_api as page_api;
use pageserver_page_api::{self as page_api, BaseBackupCompression};
use postgres;
use postgres::NoTls;
use postgres::error::SqlState;
use remote_storage::{DownloadError, RemotePath};
use std::collections::{HashMap, HashSet};
use std::net::SocketAddr;
use std::os::unix::fs::{PermissionsExt, symlink};
use std::path::Path;
use std::process::{Command, Stdio};
use std::str::FromStr;
use std::sync::atomic::{AtomicU32, Ordering};
use std::sync::atomic::{AtomicU32, AtomicU64, Ordering};
use std::sync::{Arc, Condvar, Mutex, RwLock};
use std::time::{Duration, Instant};
use std::{env, fs};
use tokio::spawn;
use tokio_util::io::StreamReader;
use tokio::task::JoinHandle;
use tokio::{spawn, time};
use tracing::{Instrument, debug, error, info, instrument, warn};
use url::Url;
use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn;
use utils::measured_stream::MeasuredReader;
use utils::pid_file;
use utils::shard::{ShardCount, ShardIndex, ShardNumber};
use crate::configurator::launch_configurator;
use crate::disk_quota::set_disk_quota;
@@ -71,6 +72,7 @@ pub static BUILD_TAG: Lazy<String> = Lazy::new(|| {
.unwrap_or(BUILD_TAG_DEFAULT)
.to_string()
});
const DEFAULT_INSTALLED_EXTENSIONS_COLLECTION_INTERVAL: u64 = 3600;
/// Static configuration params that don't change after startup. These mostly
/// come from the CLI args, or are derived from them.
@@ -104,9 +106,11 @@ pub struct ComputeNodeParams {
pub remote_ext_base_url: Option<Url>,
/// Interval for installed extensions collection
pub installed_extensions_collection_interval: u64,
pub installed_extensions_collection_interval: Arc<AtomicU64>,
}
type TaskHandle = Mutex<Option<JoinHandle<()>>>;
/// Compute node info shared across several `compute_ctl` threads.
pub struct ComputeNode {
pub params: ComputeNodeParams,
@@ -127,6 +131,10 @@ pub struct ComputeNode {
// key: ext_archive_name, value: started download time, download_completed?
pub ext_download_progress: RwLock<HashMap<String, (DateTime<Utc>, bool)>>,
pub compute_ctl_config: ComputeCtlConfig,
/// Handle to the extension stats collection task
extension_stats_task: TaskHandle,
lfc_offload_task: TaskHandle,
}
// store some metrics about download size that might impact startup time
@@ -217,10 +225,11 @@ pub struct ParsedSpec {
pub spec: ComputeSpec,
pub tenant_id: TenantId,
pub timeline_id: TimelineId,
pub pageserver_connstr: String,
pub pageserver_conninfo: PageserverConnectionInfo,
pub safekeeper_connstrings: Vec<String>,
pub storage_auth_token: Option<String>,
pub endpoint_storage_addr: Option<SocketAddr>,
/// k8s dns name and port
pub endpoint_storage_addr: Option<String>,
pub endpoint_storage_token: Option<String>,
}
@@ -252,8 +261,7 @@ impl ParsedSpec {
// duplicate entry?
if current == previous {
return Err(format!(
"duplicate entry in safekeeper_connstrings: {}!",
current,
"duplicate entry in safekeeper_connstrings: {current}!",
));
}
@@ -264,6 +272,27 @@ impl ParsedSpec {
}
}
fn extract_pageserver_conninfo_from_guc(
pageserver_connstring_guc: &str,
) -> PageserverConnectionInfo {
PageserverConnectionInfo {
shards: pageserver_connstring_guc
.split(',')
.enumerate()
.map(|(i, connstr)| {
(
i as u32,
PageserverShardConnectionInfo {
libpq_url: Some(connstr.to_string()),
grpc_url: None,
},
)
})
.collect(),
prefer_grpc: false,
}
}
impl TryFrom<ComputeSpec> for ParsedSpec {
type Error = String;
fn try_from(spec: ComputeSpec) -> Result<Self, String> {
@@ -273,11 +302,17 @@ impl TryFrom<ComputeSpec> for ParsedSpec {
// For backwards-compatibility, the top-level fields in the spec file
// may be empty. In that case, we need to dig them from the GUCs in the
// cluster.settings field.
let pageserver_connstr = spec
.pageserver_connstring
.clone()
.or_else(|| spec.cluster.settings.find("neon.pageserver_connstring"))
.ok_or("pageserver connstr should be provided")?;
let pageserver_conninfo = match &spec.pageserver_connection_info {
Some(x) => x.clone(),
None => {
if let Some(guc) = spec.cluster.settings.find("neon.pageserver_connstring") {
extract_pageserver_conninfo_from_guc(&guc)
} else {
return Err("pageserver connstr should be provided".to_string());
}
}
};
let safekeeper_connstrings = if spec.safekeeper_connstrings.is_empty() {
if matches!(spec.mode, ComputeMode::Primary) {
spec.cluster
@@ -316,13 +351,10 @@ impl TryFrom<ComputeSpec> for ParsedSpec {
.or(Err("invalid timeline id"))?
};
let endpoint_storage_addr: Option<SocketAddr> = spec
let endpoint_storage_addr: Option<String> = spec
.endpoint_storage_addr
.clone()
.or_else(|| spec.cluster.settings.find("neon.endpoint_storage_addr"))
.unwrap_or_default()
.parse()
.ok();
.or_else(|| spec.cluster.settings.find("neon.endpoint_storage_addr"));
let endpoint_storage_token = spec
.endpoint_storage_token
.clone()
@@ -330,7 +362,7 @@ impl TryFrom<ComputeSpec> for ParsedSpec {
let res = ParsedSpec {
spec,
pageserver_connstr,
pageserver_conninfo,
safekeeper_connstrings,
storage_auth_token,
tenant_id,
@@ -368,7 +400,7 @@ fn maybe_cgexec(cmd: &str) -> Command {
struct PostgresHandle {
postgres: std::process::Child,
log_collector: tokio::task::JoinHandle<Result<()>>,
log_collector: JoinHandle<Result<()>>,
}
impl PostgresHandle {
@@ -382,7 +414,7 @@ struct StartVmMonitorResult {
#[cfg(target_os = "linux")]
token: tokio_util::sync::CancellationToken,
#[cfg(target_os = "linux")]
vm_monitor: Option<tokio::task::JoinHandle<Result<()>>>,
vm_monitor: Option<JoinHandle<Result<()>>>,
}
impl ComputeNode {
@@ -408,11 +440,11 @@ impl ComputeNode {
// that can affect `compute_ctl` and prevent it from properly configuring the database schema.
// Unset them via connection string options before connecting to the database.
// N.B. keep it in sync with `ZENITH_OPTIONS` in `get_maintenance_client()`.
const EXTRA_OPTIONS: &str = "-c role=cloud_admin -c default_transaction_read_only=off -c search_path=public -c statement_timeout=0";
const EXTRA_OPTIONS: &str = "-c role=cloud_admin -c default_transaction_read_only=off -c search_path=public -c statement_timeout=0 -c pgaudit.log=none";
let options = match conn_conf.get_options() {
// Allow the control plane to override any options set by the
// compute
Some(options) => format!("{} {}", EXTRA_OPTIONS, options),
Some(options) => format!("{EXTRA_OPTIONS} {options}"),
None => EXTRA_OPTIONS.to_string(),
};
conn_conf.options(&options);
@@ -432,6 +464,8 @@ impl ComputeNode {
state_changed: Condvar::new(),
ext_download_progress: RwLock::new(HashMap::new()),
compute_ctl_config: config.compute_ctl_config,
extension_stats_task: Mutex::new(None),
lfc_offload_task: Mutex::new(None),
})
}
@@ -519,6 +553,9 @@ impl ComputeNode {
None
};
this.terminate_extension_stats_task();
this.terminate_lfc_offload_task();
// Terminate the vm_monitor so it releases the file watcher on
// /sys/fs/cgroup/neon-postgres.
// Note: the vm-monitor only runs on linux because it requires cgroups.
@@ -755,10 +792,15 @@ impl ComputeNode {
// Configure and start rsyslog for compliance audit logging
match pspec.spec.audit_log_level {
ComputeAudit::Hipaa | ComputeAudit::Extended | ComputeAudit::Full => {
let remote_endpoint =
let remote_tls_endpoint =
std::env::var("AUDIT_LOGGING_TLS_ENDPOINT").unwrap_or("".to_string());
let remote_plain_endpoint =
std::env::var("AUDIT_LOGGING_ENDPOINT").unwrap_or("".to_string());
if remote_endpoint.is_empty() {
anyhow::bail!("AUDIT_LOGGING_ENDPOINT is empty");
if remote_plain_endpoint.is_empty() && remote_tls_endpoint.is_empty() {
anyhow::bail!(
"AUDIT_LOGGING_ENDPOINT and AUDIT_LOGGING_TLS_ENDPOINT are both empty"
);
}
let log_directory_path = Path::new(&self.params.pgdata).join("log");
@@ -774,7 +816,8 @@ impl ComputeNode {
log_directory_path.clone(),
endpoint_id,
project_id,
&remote_endpoint,
&remote_plain_endpoint,
&remote_tls_endpoint,
)?;
// Launch a background task to clean up the audit logs
@@ -841,12 +884,15 @@ impl ComputeNode {
// Log metrics so that we can search for slow operations in logs
info!(?metrics, postmaster_pid = %postmaster_pid, "compute start finished");
// Spawn the extension stats background task
self.spawn_extension_stats_task();
if pspec.spec.autoprewarm {
info!("autoprewarming on startup as requested");
self.prewarm_lfc(None);
}
if let Some(seconds) = pspec.spec.offload_lfc_interval_seconds {
self.spawn_lfc_offload_task(Duration::from_secs(seconds.into()));
};
Ok(())
}
@@ -1001,84 +1047,87 @@ impl ComputeNode {
Ok(())
}
// Get basebackup from the libpq connection to pageserver using `connstr` and
// unarchive it to `pgdata` directory overriding all its previous content.
/// Fetches a basebackup from the Pageserver using the compute state's Pageserver connstring and
/// unarchives it to `pgdata` directory, replacing any existing contents.
#[instrument(skip_all, fields(%lsn))]
fn try_get_basebackup(&self, compute_state: &ComputeState, lsn: Lsn) -> Result<()> {
let spec = compute_state.pspec.as_ref().expect("spec must be set");
let shard0_connstr = spec.pageserver_connstr.split(',').next().unwrap();
match Url::parse(shard0_connstr)?.scheme() {
"postgres" | "postgresql" => self.try_get_basebackup_libpq(spec, lsn),
"grpc" => self.try_get_basebackup_grpc(spec, lsn),
scheme => return Err(anyhow!("unknown URL scheme {scheme}")),
}
}
let started = Instant::now();
let (connected, size) = if spec.pageserver_conninfo.prefer_grpc {
self.try_get_basebackup_grpc(spec, lsn)?
} else {
self.try_get_basebackup_libpq(spec, lsn)?
};
fn try_get_basebackup_grpc(&self, spec: &ParsedSpec, lsn: Lsn) -> Result<()> {
let start_time = Instant::now();
let shard0_connstr = spec
.pageserver_connstr
.split(',')
.next()
.unwrap()
.to_string();
let chunks = tokio::runtime::Handle::current().block_on(async move {
let mut client = page_api::proto::PageServiceClient::connect(shard0_connstr).await?;
let req = page_api::proto::GetBaseBackupRequest {
lsn: lsn.0,
replica: false, // TODO: handle replicas, with LSN 0
full: false,
};
let mut req = tonic::Request::new(req);
let metadata = req.metadata_mut();
metadata.insert("neon-tenant-id", spec.tenant_id.to_string().parse()?);
metadata.insert("neon-timeline-id", spec.timeline_id.to_string().parse()?);
metadata.insert("neon-shard-id", "0000".to_string().parse()?); // TODO: shard count
if let Some(auth) = spec.storage_auth_token.as_ref() {
metadata.insert("authorization", format!("Bearer {auth}").parse()?);
}
let chunks = client.get_base_backup(req).await?.into_inner();
anyhow::Ok(chunks)
})?;
let pageserver_connect_micros = start_time.elapsed().as_micros() as u64;
// Convert the chunks stream into an AsyncRead
let stream_reader = StreamReader::new(
chunks.map(|chunk| chunk.map(|c| c.chunk).map_err(std::io::Error::other)),
);
// Wrap the AsyncRead into a blocking reader for compatibility with tar::Archive
let reader = tokio_util::io::SyncIoBridge::new(stream_reader);
let mut measured_reader = MeasuredReader::new(reader);
let mut bufreader = std::io::BufReader::new(&mut measured_reader);
// Read the archive directly from the `CopyOutReader`
//
// Set `ignore_zeros` so that unpack() reads all the Copy data and
// doesn't stop at the end-of-archive marker. Otherwise, if the server
// sends an Error after finishing the tarball, we will not notice it.
let mut ar = tar::Archive::new(&mut bufreader);
ar.set_ignore_zeros(true);
ar.unpack(&self.params.pgdata)?;
// Report metrics
let mut state = self.state.lock().unwrap();
state.metrics.pageserver_connect_micros = pageserver_connect_micros;
state.metrics.basebackup_bytes = measured_reader.get_byte_count() as u64;
state.metrics.basebackup_ms = start_time.elapsed().as_millis() as u64;
state.metrics.pageserver_connect_micros =
connected.duration_since(started).as_micros() as u64;
state.metrics.basebackup_bytes = size as u64;
state.metrics.basebackup_ms = started.elapsed().as_millis() as u64;
Ok(())
}
fn try_get_basebackup_libpq(&self, spec: &ParsedSpec, lsn: Lsn) -> Result<()> {
let start_time = Instant::now();
/// Fetches a basebackup via gRPC. The connstring must use grpc://. Returns the timestamp when
/// the connection was established, and the (compressed) size of the basebackup.
fn try_get_basebackup_grpc(&self, spec: &ParsedSpec, lsn: Lsn) -> Result<(Instant, usize)> {
let shard0 = spec
.pageserver_conninfo
.shards
.get(&0)
.expect("shard 0 connection info missing");
let shard0_url = shard0.grpc_url.clone().expect("no grpc_url for shard 0");
let shard0_connstr = spec.pageserver_connstr.split(',').next().unwrap();
let mut config = postgres::Config::from_str(shard0_connstr)?;
let shard_index = match spec.pageserver_conninfo.shards.len() as u8 {
0 | 1 => ShardIndex::unsharded(),
count => ShardIndex::new(ShardNumber(0), ShardCount(count)),
};
let (reader, connected) = tokio::runtime::Handle::current().block_on(async move {
let mut client = page_api::Client::connect(
shard0_url,
spec.tenant_id,
spec.timeline_id,
shard_index,
spec.storage_auth_token.clone(),
None, // NB: base backups use payload compression
)
.await?;
let connected = Instant::now();
let reader = client
.get_base_backup(page_api::GetBaseBackupRequest {
lsn: (lsn != Lsn(0)).then_some(lsn),
compression: BaseBackupCompression::Gzip,
replica: spec.spec.mode != ComputeMode::Primary,
full: false,
})
.await?;
anyhow::Ok((reader, connected))
})?;
let mut reader = MeasuredReader::new(tokio_util::io::SyncIoBridge::new(reader));
// Set `ignore_zeros` so that unpack() reads the entire stream and doesn't just stop at the
// end-of-archive marker. If the server errors, the tar::Builder drop handler will write an
// end-of-archive marker before the error is emitted, and we would not see the error.
let mut ar = tar::Archive::new(flate2::read::GzDecoder::new(&mut reader));
ar.set_ignore_zeros(true);
ar.unpack(&self.params.pgdata)?;
Ok((connected, reader.get_byte_count()))
}
/// Fetches a basebackup via libpq. The connstring must use postgresql://. Returns the timestamp
/// when the connection was established, and the (compressed) size of the basebackup.
fn try_get_basebackup_libpq(&self, spec: &ParsedSpec, lsn: Lsn) -> Result<(Instant, usize)> {
let shard0 = spec
.pageserver_conninfo
.shards
.get(&0)
.expect("shard 0 connection info missing");
let shard0_connstr = shard0.libpq_url.clone().expect("no libpq_url for shard 0");
let mut config = postgres::Config::from_str(&shard0_connstr)?;
// Use the storage auth token from the config file, if given.
// Note: this overrides any password set in the connection string.
@@ -1097,7 +1146,7 @@ impl ComputeNode {
// Connect to pageserver
let mut client = config.connect(NoTls)?;
let pageserver_connect_micros = start_time.elapsed().as_micros() as u64;
let connected = Instant::now();
let basebackup_cmd = match lsn {
Lsn(0) => {
@@ -1134,16 +1183,13 @@ impl ComputeNode {
// Set `ignore_zeros` so that unpack() reads all the Copy data and
// doesn't stop at the end-of-archive marker. Otherwise, if the server
// sends an Error after finishing the tarball, we will not notice it.
// The tar::Builder drop handler will write an end-of-archive marker
// before emitting the error, and we would not see it otherwise.
let mut ar = tar::Archive::new(flate2::read::GzDecoder::new(&mut bufreader));
ar.set_ignore_zeros(true);
ar.unpack(&self.params.pgdata)?;
// Report metrics
let mut state = self.state.lock().unwrap();
state.metrics.pageserver_connect_micros = pageserver_connect_micros;
state.metrics.basebackup_bytes = measured_reader.get_byte_count() as u64;
state.metrics.basebackup_ms = start_time.elapsed().as_millis() as u64;
Ok(())
Ok((connected, measured_reader.get_byte_count()))
}
// Gets the basebackup in a retry loop
@@ -1193,7 +1239,7 @@ impl ComputeNode {
let sk_configs = sk_connstrs.into_iter().map(|connstr| {
// Format connstr
let id = connstr.clone();
let connstr = format!("postgresql://no_user@{}", connstr);
let connstr = format!("postgresql://no_user@{connstr}");
let options = format!(
"-c timeline_id={} tenant_id={}",
pspec.timeline_id, pspec.tenant_id
@@ -1376,16 +1422,8 @@ impl ComputeNode {
}
};
info!(
"getting basebackup@{} from pageserver {}",
lsn, &pspec.pageserver_connstr
);
self.get_basebackup(compute_state, lsn).with_context(|| {
format!(
"failed to get basebackup@{} from pageserver {}",
lsn, &pspec.pageserver_connstr
)
})?;
self.get_basebackup(compute_state, lsn)
.with_context(|| format!("failed to get basebackup@{lsn}"))?;
// Update pg_hba.conf received with basebackup.
update_pg_hba(pgdata_path)?;
@@ -1556,7 +1594,7 @@ impl ComputeNode {
let (mut client, connection) = conf.connect(NoTls).await?;
tokio::spawn(async move {
if let Err(e) = connection.await {
eprintln!("connection error: {}", e);
eprintln!("connection error: {e}");
}
});
@@ -1677,6 +1715,8 @@ impl ComputeNode {
tls_config = self.compute_ctl_config.tls.clone();
}
self.update_installed_extensions_collection_interval(&spec);
let max_concurrent_connections = self.max_service_connections(compute_state, &spec);
// Merge-apply spec & changes to PostgreSQL state.
@@ -1699,7 +1739,7 @@ impl ComputeNode {
Ok((mut client, connection)) => {
tokio::spawn(async move {
if let Err(e) = connection.await {
eprintln!("connection error: {}", e);
eprintln!("connection error: {e}");
}
});
if let Err(e) = handle_migrations(&mut client).await {
@@ -1741,6 +1781,8 @@ impl ComputeNode {
let tls_config = self.tls_config(&spec);
self.update_installed_extensions_collection_interval(&spec);
if let Some(ref pgbouncer_settings) = spec.pgbouncer_settings {
info!("tuning pgbouncer");
@@ -2003,7 +2045,7 @@ impl ComputeNode {
let (client, connection) = connect_result.unwrap();
tokio::spawn(async move {
if let Err(e) = connection.await {
eprintln!("connection error: {}", e);
eprintln!("connection error: {e}");
}
});
let result = client
@@ -2172,7 +2214,7 @@ LIMIT 100",
db_client
.simple_query(&query)
.await
.with_context(|| format!("Failed to execute query: {}", query))?;
.with_context(|| format!("Failed to execute query: {query}"))?;
}
Ok(())
@@ -2199,7 +2241,7 @@ LIMIT 100",
let version: Option<ExtVersion> = db_client
.query_opt(version_query, &[&ext_name])
.await
.with_context(|| format!("Failed to execute query: {}", version_query))?
.with_context(|| format!("Failed to execute query: {version_query}"))?
.map(|row| row.get(0));
// sanitize the inputs as postgres idents.
@@ -2214,14 +2256,14 @@ LIMIT 100",
db_client
.simple_query(&query)
.await
.with_context(|| format!("Failed to execute query: {}", query))?;
.with_context(|| format!("Failed to execute query: {query}"))?;
} else {
let query =
format!("CREATE EXTENSION IF NOT EXISTS {ext_name} WITH VERSION {quoted_version}");
db_client
.simple_query(&query)
.await
.with_context(|| format!("Failed to execute query: {}", query))?;
.with_context(|| format!("Failed to execute query: {query}"))?;
}
Ok(ext_version)
@@ -2320,22 +2362,22 @@ LIMIT 100",
/// The operation will time out after a specified duration.
pub fn wait_timeout_while_pageserver_connstr_unchanged(&self, duration: Duration) {
let state = self.state.lock().unwrap();
let old_pageserver_connstr = state
let old_pageserver_conninfo = state
.pspec
.as_ref()
.expect("spec must be set")
.pageserver_connstr
.pageserver_conninfo
.clone();
let mut unchanged = true;
let _ = self
.state_changed
.wait_timeout_while(state, duration, |s| {
let pageserver_connstr = &s
let pageserver_conninfo = &s
.pspec
.as_ref()
.expect("spec must be set")
.pageserver_connstr;
unchanged = pageserver_connstr == &old_pageserver_connstr;
.pageserver_conninfo;
unchanged = pageserver_conninfo == &old_pageserver_conninfo;
unchanged
})
.unwrap();
@@ -2345,24 +2387,92 @@ LIMIT 100",
}
pub fn spawn_extension_stats_task(&self) {
self.terminate_extension_stats_task();
let conf = self.tokio_conn_conf.clone();
let installed_extensions_collection_interval =
self.params.installed_extensions_collection_interval;
tokio::spawn(async move {
// An initial sleep is added to ensure that two collections don't happen at the same time.
// The first collection happens during compute startup.
tokio::time::sleep(tokio::time::Duration::from_secs(
installed_extensions_collection_interval,
))
.await;
let mut interval = tokio::time::interval(tokio::time::Duration::from_secs(
installed_extensions_collection_interval,
));
let atomic_interval = self.params.installed_extensions_collection_interval.clone();
let mut installed_extensions_collection_interval =
2 * atomic_interval.load(std::sync::atomic::Ordering::SeqCst);
info!(
"[NEON_EXT_SPAWN] Spawning background installed extensions worker with Timeout: {}",
installed_extensions_collection_interval
);
let handle = tokio::spawn(async move {
loop {
interval.tick().await;
info!(
"[NEON_EXT_INT_SLEEP]: Interval: {}",
installed_extensions_collection_interval
);
// Sleep at the start of the loop to ensure that two collections don't happen at the same time.
// The first collection happens during compute startup.
tokio::time::sleep(tokio::time::Duration::from_secs(
installed_extensions_collection_interval,
))
.await;
let _ = installed_extensions(conf.clone()).await;
// Acquire a read lock on the compute spec and then update the interval if necessary
installed_extensions_collection_interval = std::cmp::max(
installed_extensions_collection_interval,
2 * atomic_interval.load(std::sync::atomic::Ordering::SeqCst),
);
}
});
// Store the new task handle
*self.extension_stats_task.lock().unwrap() = Some(handle);
}
fn terminate_extension_stats_task(&self) {
if let Some(h) = self.extension_stats_task.lock().unwrap().take() {
h.abort()
}
}
pub fn spawn_lfc_offload_task(self: &Arc<Self>, interval: Duration) {
self.terminate_lfc_offload_task();
let secs = interval.as_secs();
info!("spawning lfc offload worker with {secs}s interval");
let this = self.clone();
let handle = spawn(async move {
let mut interval = time::interval(interval);
interval.tick().await; // returns immediately
loop {
interval.tick().await;
this.offload_lfc_async().await;
}
});
*self.lfc_offload_task.lock().unwrap() = Some(handle);
}
fn terminate_lfc_offload_task(&self) {
if let Some(h) = self.lfc_offload_task.lock().unwrap().take() {
h.abort()
}
}
fn update_installed_extensions_collection_interval(&self, spec: &ComputeSpec) {
// Update the interval for collecting installed extensions statistics
// If the value is -1, we never suspend so set the value to default collection.
// If the value is 0, it means default, we will just continue to use the default.
if spec.suspend_timeout_seconds == -1 || spec.suspend_timeout_seconds == 0 {
info!(
"[NEON_EXT_INT_UPD] Spec Timeout: {}, New Timeout: {}",
spec.suspend_timeout_seconds, DEFAULT_INSTALLED_EXTENSIONS_COLLECTION_INTERVAL
);
self.params.installed_extensions_collection_interval.store(
DEFAULT_INSTALLED_EXTENSIONS_COLLECTION_INTERVAL,
std::sync::atomic::Ordering::SeqCst,
);
} else {
info!(
"[NEON_EXT_INT_UPD] Spec Timeout: {}",
spec.suspend_timeout_seconds
);
self.params.installed_extensions_collection_interval.store(
spec.suspend_timeout_seconds as u64,
std::sync::atomic::Ordering::SeqCst,
);
}
}
}

View File

@@ -5,6 +5,7 @@ use compute_api::responses::LfcOffloadState;
use compute_api::responses::LfcPrewarmState;
use http::StatusCode;
use reqwest::Client;
use std::mem::replace;
use std::sync::Arc;
use tokio::{io::AsyncReadExt, spawn};
use tracing::{error, info};
@@ -88,17 +89,15 @@ impl ComputeNode {
self.state.lock().unwrap().lfc_offload_state.clone()
}
/// Returns false if there is a prewarm request ongoing, true otherwise
/// If there is a prewarm request ongoing, return false, true otherwise
pub fn prewarm_lfc(self: &Arc<Self>, from_endpoint: Option<String>) -> bool {
crate::metrics::LFC_PREWARM_REQUESTS.inc();
{
let state = &mut self.state.lock().unwrap().lfc_prewarm_state;
if let LfcPrewarmState::Prewarming =
std::mem::replace(state, LfcPrewarmState::Prewarming)
{
if let LfcPrewarmState::Prewarming = replace(state, LfcPrewarmState::Prewarming) {
return false;
}
}
crate::metrics::LFC_PREWARMS.inc();
let cloned = self.clone();
spawn(async move {
@@ -152,32 +151,41 @@ impl ComputeNode {
.map(|_| ())
}
/// Returns false if there is an offload request ongoing, true otherwise
/// If offload request is ongoing, return false, true otherwise
pub fn offload_lfc(self: &Arc<Self>) -> bool {
crate::metrics::LFC_OFFLOAD_REQUESTS.inc();
{
let state = &mut self.state.lock().unwrap().lfc_offload_state;
if let LfcOffloadState::Offloading =
std::mem::replace(state, LfcOffloadState::Offloading)
{
if replace(state, LfcOffloadState::Offloading) == LfcOffloadState::Offloading {
return false;
}
}
let cloned = self.clone();
spawn(async move {
let Err(err) = cloned.offload_lfc_impl().await else {
cloned.state.lock().unwrap().lfc_offload_state = LfcOffloadState::Completed;
return;
};
error!(%err);
cloned.state.lock().unwrap().lfc_offload_state = LfcOffloadState::Failed {
error: err.to_string(),
};
});
spawn(async move { cloned.offload_lfc_with_state_update().await });
true
}
pub async fn offload_lfc_async(self: &Arc<Self>) {
{
let state = &mut self.state.lock().unwrap().lfc_offload_state;
if replace(state, LfcOffloadState::Offloading) == LfcOffloadState::Offloading {
return;
}
}
self.offload_lfc_with_state_update().await
}
async fn offload_lfc_with_state_update(&self) {
crate::metrics::LFC_OFFLOADS.inc();
let Err(err) = self.offload_lfc_impl().await else {
self.state.lock().unwrap().lfc_offload_state = LfcOffloadState::Completed;
return;
};
error!(%err);
self.state.lock().unwrap().lfc_offload_state = LfcOffloadState::Failed {
error: err.to_string(),
};
}
async fn offload_lfc_impl(&self) -> Result<()> {
let EndpointStoragePair { url, token } = self.endpoint_storage_pair(None)?;
info!(%url, "requesting LFC state from postgres");

View File

@@ -51,14 +51,56 @@ pub fn write_postgres_conf(
// Write the postgresql.conf content from the spec file as is.
if let Some(conf) = &spec.cluster.postgresql_conf {
writeln!(file, "{}", conf)?;
writeln!(file, "{conf}")?;
}
// Add options for connecting to storage
writeln!(file, "# Neon storage settings")?;
if let Some(s) = &spec.pageserver_connstring {
writeln!(file, "neon.pageserver_connstring={}", escape_conf_value(s))?;
if let Some(conninfo) = &spec.pageserver_connection_info {
let mut libpq_urls: Option<Vec<String>> = Some(Vec::new());
let mut grpc_urls: Option<Vec<String>> = Some(Vec::new());
for shardno in 0..conninfo.shards.len() {
let info = conninfo.shards.get(&(shardno as u32)).ok_or_else(|| {
anyhow::anyhow!("shard {shardno} missing from pageserver_connection_info shard map")
})?;
if let Some(url) = &info.libpq_url {
if let Some(ref mut urls) = libpq_urls {
urls.push(url.clone());
}
} else {
libpq_urls = None
}
if let Some(url) = &info.grpc_url {
if let Some(ref mut urls) = grpc_urls {
urls.push(url.clone());
}
} else {
grpc_urls = None
}
}
if let Some(libpq_urls) = libpq_urls {
writeln!(
file,
"neon.pageserver_connstring={}",
escape_conf_value(&libpq_urls.join(","))
)?;
} else {
writeln!(file, "# no neon.pageserver_connstring")?;
}
if let Some(grpc_urls) = grpc_urls {
writeln!(
file,
"neon.pageserver_grpc_urls={}",
escape_conf_value(&grpc_urls.join(","))
)?;
} else {
writeln!(file, "# no neon.pageserver_grpc_urls")?;
}
}
if let Some(stripe_size) = spec.shard_stripe_size {
writeln!(file, "neon.stripe_size={stripe_size}")?;
}
@@ -70,7 +112,7 @@ pub fn write_postgres_conf(
);
// If generation is given, prepend sk list with g#number:
if let Some(generation) = spec.safekeepers_generation {
write!(neon_safekeepers_value, "g#{}:", generation)?;
write!(neon_safekeepers_value, "g#{generation}:")?;
}
neon_safekeepers_value.push_str(&spec.safekeeper_connstrings.join(","));
writeln!(
@@ -109,8 +151,8 @@ pub fn write_postgres_conf(
tls::update_key_path_blocking(pgdata_path, tls_config);
// these are the default, but good to be explicit.
writeln!(file, "ssl_cert_file = '{}'", SERVER_CRT)?;
writeln!(file, "ssl_key_file = '{}'", SERVER_KEY)?;
writeln!(file, "ssl_cert_file = '{SERVER_CRT}'")?;
writeln!(file, "ssl_key_file = '{SERVER_KEY}'")?;
}
// Locales
@@ -191,8 +233,7 @@ pub fn write_postgres_conf(
}
writeln!(
file,
"shared_preload_libraries='{}{}'",
libs, extra_shared_preload_libraries
"shared_preload_libraries='{libs}{extra_shared_preload_libraries}'"
)?;
} else {
// Typically, this should be unreacheable,
@@ -244,8 +285,7 @@ pub fn write_postgres_conf(
}
writeln!(
file,
"shared_preload_libraries='{}{}'",
libs, extra_shared_preload_libraries
"shared_preload_libraries='{libs}{extra_shared_preload_libraries}'"
)?;
} else {
// Typically, this should be unreacheable,
@@ -263,7 +303,7 @@ pub fn write_postgres_conf(
}
}
writeln!(file, "neon.extension_server_port={}", extension_server_port)?;
writeln!(file, "neon.extension_server_port={extension_server_port}")?;
if spec.drop_subscriptions_before_start {
writeln!(file, "neon.disable_logical_replication_subscribers=true")?;
@@ -291,7 +331,7 @@ where
{
let path = pgdata_path.join("compute_ctl_temp_override.conf");
let mut file = File::create(path)?;
write!(file, "{}", options)?;
write!(file, "{options}")?;
let res = exec();

View File

@@ -10,7 +10,13 @@ input(type="imfile" File="{log_directory}/*.log"
startmsg.regex="^[[:digit:]]{{4}}-[[:digit:]]{{2}}-[[:digit:]]{{2}} [[:digit:]]{{2}}:[[:digit:]]{{2}}:[[:digit:]]{{2}}.[[:digit:]]{{3}} GMT,")
# the directory to store rsyslog state files
global(workDirectory="/var/log/rsyslog")
global(
workDirectory="/var/log/rsyslog"
DefaultNetstreamDriverCAFile="/etc/ssl/certs/ca-certificates.crt"
)
# Whether the remote syslog receiver uses tls
set $.remote_syslog_tls = "{remote_syslog_tls}";
# Construct json, endpoint_id and project_id as additional metadata
set $.json_log!endpoint_id = "{endpoint_id}";
@@ -21,5 +27,29 @@ set $.json_log!msg = $msg;
template(name="PgAuditLog" type="string"
string="<%PRI%>1 %TIMESTAMP:::date-rfc3339% %HOSTNAME% - - - - %$.json_log%")
# Forward to remote syslog receiver (@@<hostname>:<port>;format
local5.info @@{remote_endpoint};PgAuditLog
# Forward to remote syslog receiver (over TLS)
if ( $syslogtag == 'pgaudit_log' ) then {{
if ( $.remote_syslog_tls == 'true' ) then {{
action(type="omfwd" target="{remote_syslog_host}" port="{remote_syslog_port}" protocol="tcp"
template="PgAuditLog"
queue.type="linkedList"
queue.size="1000"
action.ResumeRetryCount="10"
StreamDriver="gtls"
StreamDriverMode="1"
StreamDriverAuthMode="x509/name"
StreamDriverPermittedPeers="{remote_syslog_host}"
StreamDriver.CheckExtendedKeyPurpose="on"
StreamDriver.PermitExpiredCerts="off"
)
stop
}} else {{
action(type="omfwd" target="{remote_syslog_host}" port="{remote_syslog_port}" protocol="tcp"
template="PgAuditLog"
queue.type="linkedList"
queue.size="1000"
action.ResumeRetryCount="10"
)
stop
}}
}}

View File

@@ -74,9 +74,11 @@ More specifically, here is an example ext_index.json
use std::path::Path;
use std::str;
use crate::metrics::{REMOTE_EXT_REQUESTS_TOTAL, UNKNOWN_HTTP_STATUS};
use anyhow::{Context, Result, bail};
use bytes::Bytes;
use compute_api::spec::RemoteExtSpec;
use postgres_versioninfo::PgMajorVersion;
use regex::Regex;
use remote_storage::*;
use reqwest::StatusCode;
@@ -86,8 +88,6 @@ use tracing::log::warn;
use url::Url;
use zstd::stream::read::Decoder;
use crate::metrics::{REMOTE_EXT_REQUESTS_TOTAL, UNKNOWN_HTTP_STATUS};
fn get_pg_config(argument: &str, pgbin: &str) -> String {
// gives the result of `pg_config [argument]`
// where argument is a flag like `--version` or `--sharedir`
@@ -106,7 +106,7 @@ fn get_pg_config(argument: &str, pgbin: &str) -> String {
.to_string()
}
pub fn get_pg_version(pgbin: &str) -> PostgresMajorVersion {
pub fn get_pg_version(pgbin: &str) -> PgMajorVersion {
// pg_config --version returns a (platform specific) human readable string
// such as "PostgreSQL 15.4". We parse this to v14/v15/v16 etc.
let human_version = get_pg_config("--version", pgbin);
@@ -114,25 +114,11 @@ pub fn get_pg_version(pgbin: &str) -> PostgresMajorVersion {
}
pub fn get_pg_version_string(pgbin: &str) -> String {
match get_pg_version(pgbin) {
PostgresMajorVersion::V14 => "v14",
PostgresMajorVersion::V15 => "v15",
PostgresMajorVersion::V16 => "v16",
PostgresMajorVersion::V17 => "v17",
}
.to_owned()
get_pg_version(pgbin).v_str()
}
#[derive(Copy, Clone, Debug, PartialEq, Eq)]
pub enum PostgresMajorVersion {
V14,
V15,
V16,
V17,
}
fn parse_pg_version(human_version: &str) -> PostgresMajorVersion {
use PostgresMajorVersion::*;
fn parse_pg_version(human_version: &str) -> PgMajorVersion {
use PgMajorVersion::*;
// Normal releases have version strings like "PostgreSQL 15.4". But there
// are also pre-release versions like "PostgreSQL 17devel" or "PostgreSQL
// 16beta2" or "PostgreSQL 17rc1". And with the --with-extra-version
@@ -143,10 +129,10 @@ fn parse_pg_version(human_version: &str) -> PostgresMajorVersion {
.captures(human_version)
{
Some(captures) if captures.len() == 2 => match &captures["major"] {
"14" => return V14,
"15" => return V15,
"16" => return V16,
"17" => return V17,
"14" => return PG14,
"15" => return PG15,
"16" => return PG16,
"17" => return PG17,
_ => {}
},
_ => {}
@@ -310,10 +296,7 @@ async fn download_extension_tar(remote_ext_base_url: &Url, ext_path: &str) -> Re
async fn do_extension_server_request(uri: Url) -> Result<Bytes, (String, String)> {
let resp = reqwest::get(uri).await.map_err(|e| {
(
format!(
"could not perform remote extensions server request: {:?}",
e
),
format!("could not perform remote extensions server request: {e:?}"),
UNKNOWN_HTTP_STATUS.to_string(),
)
})?;
@@ -323,7 +306,7 @@ async fn do_extension_server_request(uri: Url) -> Result<Bytes, (String, String)
StatusCode::OK => match resp.bytes().await {
Ok(resp) => Ok(resp),
Err(e) => Err((
format!("could not read remote extensions server response: {:?}", e),
format!("could not read remote extensions server response: {e:?}"),
// It's fine to return and report error with status as 200 OK,
// because we still failed to read the response.
status.to_string(),
@@ -334,10 +317,7 @@ async fn do_extension_server_request(uri: Url) -> Result<Bytes, (String, String)
status.to_string(),
)),
_ => Err((
format!(
"unexpected remote extensions server response status code: {}",
status
),
format!("unexpected remote extensions server response status code: {status}"),
status.to_string(),
)),
}
@@ -349,25 +329,25 @@ mod tests {
#[test]
fn test_parse_pg_version() {
use super::PostgresMajorVersion::*;
assert_eq!(parse_pg_version("PostgreSQL 15.4"), V15);
assert_eq!(parse_pg_version("PostgreSQL 15.14"), V15);
use postgres_versioninfo::PgMajorVersion::*;
assert_eq!(parse_pg_version("PostgreSQL 15.4"), PG15);
assert_eq!(parse_pg_version("PostgreSQL 15.14"), PG15);
assert_eq!(
parse_pg_version("PostgreSQL 15.4 (Ubuntu 15.4-0ubuntu0.23.04.1)"),
V15
PG15
);
assert_eq!(parse_pg_version("PostgreSQL 14.15"), V14);
assert_eq!(parse_pg_version("PostgreSQL 14.0"), V14);
assert_eq!(parse_pg_version("PostgreSQL 14.15"), PG14);
assert_eq!(parse_pg_version("PostgreSQL 14.0"), PG14);
assert_eq!(
parse_pg_version("PostgreSQL 14.9 (Debian 14.9-1.pgdg120+1"),
V14
PG14
);
assert_eq!(parse_pg_version("PostgreSQL 16devel"), V16);
assert_eq!(parse_pg_version("PostgreSQL 16beta1"), V16);
assert_eq!(parse_pg_version("PostgreSQL 16rc2"), V16);
assert_eq!(parse_pg_version("PostgreSQL 16extra"), V16);
assert_eq!(parse_pg_version("PostgreSQL 16devel"), PG16);
assert_eq!(parse_pg_version("PostgreSQL 16beta1"), PG16);
assert_eq!(parse_pg_version("PostgreSQL 16rc2"), PG16);
assert_eq!(parse_pg_version("PostgreSQL 16extra"), PG16);
}
#[test]

View File

@@ -65,7 +65,7 @@ pub(in crate::http) async fn configure(
if state.status == ComputeStatus::Failed {
let err = state.error.as_ref().map_or("unknown error", |x| x);
let msg = format!("compute configuration failed: {:?}", err);
let msg = format!("compute configuration failed: {err:?}");
return Err(msg);
}
}

View File

@@ -43,7 +43,7 @@ pub async fn get_installed_extensions(mut conf: Config) -> Result<InstalledExten
let (mut client, connection) = conf.connect(NoTls).await?;
tokio::spawn(async move {
if let Err(e) = connection.await {
eprintln!("connection error: {}", e);
eprintln!("connection error: {e}");
}
});
@@ -57,7 +57,7 @@ pub async fn get_installed_extensions(mut conf: Config) -> Result<InstalledExten
let (client, connection) = conf.connect(NoTls).await?;
tokio::spawn(async move {
if let Err(e) = connection.await {
eprintln!("connection error: {}", e);
eprintln!("connection error: {e}");
}
});

View File

@@ -4,7 +4,8 @@ use std::thread;
use std::time::{Duration, SystemTime};
use anyhow::{Result, bail};
use compute_api::spec::ComputeMode;
use compute_api::spec::{ComputeMode, PageserverConnectionInfo};
use pageserver_page_api as page_api;
use postgres::{NoTls, SimpleQueryMessage};
use tracing::{info, warn};
use utils::id::{TenantId, TimelineId};
@@ -76,25 +77,16 @@ fn acquire_lsn_lease_with_retry(
loop {
// Note: List of pageservers is dynamic, need to re-read configs before each attempt.
let configs = {
let (conninfo, auth) = {
let state = compute.state.lock().unwrap();
let spec = state.pspec.as_ref().expect("spec must be set");
let conn_strings = spec.pageserver_connstr.split(',');
conn_strings
.map(|connstr| {
let mut config = postgres::Config::from_str(connstr).expect("Invalid connstr");
if let Some(storage_auth_token) = &spec.storage_auth_token {
config.password(storage_auth_token.clone());
}
config
})
.collect::<Vec<_>>()
(
spec.pageserver_conninfo.clone(),
spec.storage_auth_token.clone(),
)
};
let result = try_acquire_lsn_lease(tenant_id, timeline_id, lsn, &configs);
let result = try_acquire_lsn_lease(conninfo, auth.as_deref(), tenant_id, timeline_id, lsn);
match result {
Ok(Some(res)) => {
return Ok(res);
@@ -116,68 +108,112 @@ fn acquire_lsn_lease_with_retry(
}
}
/// Tries to acquire an LSN lease through PS page_service API.
/// Tries to acquire LSN leases on all Pageserver shards.
fn try_acquire_lsn_lease(
conninfo: PageserverConnectionInfo,
auth: Option<&str>,
tenant_id: TenantId,
timeline_id: TimelineId,
lsn: Lsn,
configs: &[postgres::Config],
) -> Result<Option<SystemTime>> {
fn get_valid_until(
config: &postgres::Config,
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
lsn: Lsn,
) -> Result<Option<SystemTime>> {
let mut client = config.connect(NoTls)?;
let cmd = format!("lease lsn {} {} {} ", tenant_shard_id, timeline_id, lsn);
let res = client.simple_query(&cmd)?;
let msg = match res.first() {
Some(msg) => msg,
None => bail!("empty response"),
};
let row = match msg {
SimpleQueryMessage::Row(row) => row,
_ => bail!("error parsing lsn lease response"),
let shard_count = conninfo.shards.len();
let mut leases = Vec::new();
for (shard_number, shard) in conninfo.shards.into_iter() {
let tenant_shard_id = match shard_count {
0 | 1 => TenantShardId::unsharded(tenant_id),
shard_count => TenantShardId {
tenant_id,
shard_number: ShardNumber(shard_number as u8),
shard_count: ShardCount::new(shard_count as u8),
},
};
// Note: this will be None if a lease is explicitly not granted.
let valid_until_str = row.get("valid_until");
let valid_until = valid_until_str.map(|s| {
SystemTime::UNIX_EPOCH
.checked_add(Duration::from_millis(u128::from_str(s).unwrap() as u64))
.expect("Time larger than max SystemTime could handle")
});
Ok(valid_until)
let lease = if conninfo.prefer_grpc {
acquire_lsn_lease_grpc(
&shard.grpc_url.unwrap(),
auth,
tenant_shard_id,
timeline_id,
lsn,
)?
} else {
acquire_lsn_lease_libpq(
&shard.libpq_url.unwrap(),
auth,
tenant_shard_id,
timeline_id,
lsn,
)?
};
leases.push(lease);
}
let shard_count = configs.len();
Ok(leases.into_iter().min().flatten())
}
let valid_until = if shard_count > 1 {
configs
.iter()
.enumerate()
.map(|(shard_number, config)| {
let tenant_shard_id = TenantShardId {
tenant_id,
shard_count: ShardCount::new(shard_count as u8),
shard_number: ShardNumber(shard_number as u8),
};
get_valid_until(config, tenant_shard_id, timeline_id, lsn)
})
.collect::<Result<Vec<Option<SystemTime>>>>()?
.into_iter()
.min()
.unwrap()
} else {
get_valid_until(
&configs[0],
TenantShardId::unsharded(tenant_id),
timeline_id,
lsn,
)?
/// Acquires an LSN lease on a single shard, using the libpq API. The connstring must use a
/// postgresql:// scheme.
fn acquire_lsn_lease_libpq(
connstring: &str,
auth: Option<&str>,
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
lsn: Lsn,
) -> Result<Option<SystemTime>> {
let mut config = postgres::Config::from_str(connstring)?;
if let Some(auth) = auth {
config.password(auth);
}
let mut client = config.connect(NoTls)?;
let cmd = format!("lease lsn {tenant_shard_id} {timeline_id} {lsn} ");
let res = client.simple_query(&cmd)?;
let msg = match res.first() {
Some(msg) => msg,
None => bail!("empty response"),
};
let row = match msg {
SimpleQueryMessage::Row(row) => row,
_ => bail!("error parsing lsn lease response"),
};
// Note: this will be None if a lease is explicitly not granted.
let valid_until_str = row.get("valid_until");
let valid_until = valid_until_str.map(|s| {
SystemTime::UNIX_EPOCH
.checked_add(Duration::from_millis(u128::from_str(s).unwrap() as u64))
.expect("Time larger than max SystemTime could handle")
});
Ok(valid_until)
}
/// Acquires an LSN lease on a single shard, using the gRPC API. The connstring must use a
/// grpc:// scheme.
fn acquire_lsn_lease_grpc(
connstring: &str,
auth: Option<&str>,
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
lsn: Lsn,
) -> Result<Option<SystemTime>> {
tokio::runtime::Handle::current().block_on(async move {
let mut client = page_api::Client::connect(
connstring.to_string(),
tenant_shard_id.tenant_id,
timeline_id,
tenant_shard_id.to_index(),
auth.map(String::from),
None,
)
.await?;
let req = page_api::LeaseLsnRequest { lsn };
match client.lease_lsn(req).await {
Ok(expires) => Ok(Some(expires)),
// Lease couldn't be acquired because the LSN has been garbage collected.
Err(err) if err.code() == tonic::Code::FailedPrecondition => Ok(None),
Err(err) => Err(err.into()),
}
})
}

View File

@@ -97,20 +97,18 @@ pub(crate) static PG_TOTAL_DOWNTIME_MS: Lazy<GenericCounter<AtomicU64>> = Lazy::
.expect("failed to define a metric")
});
/// Needed as neon.file_cache_prewarm_batch == 0 doesn't mean we never tried to prewarm.
/// On the other hand, LFC_PREWARMED_PAGES is excessive as we can GET /lfc/prewarm
pub(crate) static LFC_PREWARM_REQUESTS: Lazy<IntCounter> = Lazy::new(|| {
pub(crate) static LFC_PREWARMS: Lazy<IntCounter> = Lazy::new(|| {
register_int_counter!(
"compute_ctl_lfc_prewarm_requests_total",
"Total number of LFC prewarm requests made by compute_ctl",
"compute_ctl_lfc_prewarms_total",
"Total number of LFC prewarms requested by compute_ctl or autoprewarm option",
)
.expect("failed to define a metric")
});
pub(crate) static LFC_OFFLOAD_REQUESTS: Lazy<IntCounter> = Lazy::new(|| {
pub(crate) static LFC_OFFLOADS: Lazy<IntCounter> = Lazy::new(|| {
register_int_counter!(
"compute_ctl_lfc_offload_requests_total",
"Total number of LFC offload requests made by compute_ctl",
"compute_ctl_lfc_offloads_total",
"Total number of LFC offloads requested by compute_ctl or lfc_offload_period_seconds option",
)
.expect("failed to define a metric")
});
@@ -124,7 +122,7 @@ pub fn collect() -> Vec<MetricFamily> {
metrics.extend(AUDIT_LOG_DIR_SIZE.collect());
metrics.extend(PG_CURR_DOWNTIME_MS.collect());
metrics.extend(PG_TOTAL_DOWNTIME_MS.collect());
metrics.extend(LFC_PREWARM_REQUESTS.collect());
metrics.extend(LFC_OFFLOAD_REQUESTS.collect());
metrics.extend(LFC_PREWARMS.collect());
metrics.extend(LFC_OFFLOADS.collect());
metrics
}

View File

@@ -36,9 +36,9 @@ pub fn escape_literal(s: &str) -> String {
let res = s.replace('\'', "''").replace('\\', "\\\\");
if res.contains('\\') {
format!("E'{}'", res)
format!("E'{res}'")
} else {
format!("'{}'", res)
format!("'{res}'")
}
}
@@ -46,7 +46,7 @@ pub fn escape_literal(s: &str) -> String {
/// with `'{}'` is not required, as it returns a ready-to-use config string.
pub fn escape_conf_value(s: &str) -> String {
let res = s.replace('\'', "''").replace('\\', "\\\\");
format!("'{}'", res)
format!("'{res}'")
}
pub trait GenericOptionExt {
@@ -446,7 +446,7 @@ pub async fn tune_pgbouncer(
let mut pgbouncer_connstr =
"host=localhost port=6432 dbname=pgbouncer user=postgres sslmode=disable".to_string();
if let Ok(pass) = std::env::var("PGBOUNCER_PASSWORD") {
pgbouncer_connstr.push_str(format!(" password={}", pass).as_str());
pgbouncer_connstr.push_str(format!(" password={pass}").as_str());
}
pgbouncer_connstr
};
@@ -464,7 +464,7 @@ pub async fn tune_pgbouncer(
Ok((client, connection)) => {
tokio::spawn(async move {
if let Err(e) = connection.await {
eprintln!("connection error: {}", e);
eprintln!("connection error: {e}");
}
});
break client;

View File

@@ -4,8 +4,10 @@ use std::path::Path;
use std::process::Command;
use std::time::Duration;
use std::{fs::OpenOptions, io::Write};
use url::{Host, Url};
use anyhow::{Context, Result, anyhow};
use hostname_validator;
use tracing::{error, info, instrument, warn};
const POSTGRES_LOGS_CONF_PATH: &str = "/etc/rsyslog.d/postgres_logs.conf";
@@ -82,18 +84,84 @@ fn restart_rsyslog() -> Result<()> {
Ok(())
}
fn parse_audit_syslog_address(
remote_plain_endpoint: &str,
remote_tls_endpoint: &str,
) -> Result<(String, u16, String)> {
let tls;
let remote_endpoint = if !remote_tls_endpoint.is_empty() {
tls = "true".to_string();
remote_tls_endpoint
} else {
tls = "false".to_string();
remote_plain_endpoint
};
// Urlify the remote_endpoint, so parsing can be done with url::Url.
let url_str = format!("http://{remote_endpoint}");
let url = Url::parse(&url_str).map_err(|err| {
anyhow!("Error parsing {remote_endpoint}, expected host:port, got {err:?}")
})?;
let is_valid = url.scheme() == "http"
&& url.path() == "/"
&& url.query().is_none()
&& url.fragment().is_none()
&& url.username() == ""
&& url.password().is_none();
if !is_valid {
return Err(anyhow!(
"Invalid address format {remote_endpoint}, expected host:port"
));
}
let host = match url.host() {
Some(Host::Domain(h)) if hostname_validator::is_valid(h) => h.to_string(),
Some(Host::Ipv4(ip4)) => ip4.to_string(),
Some(Host::Ipv6(ip6)) => ip6.to_string(),
_ => return Err(anyhow!("Invalid host")),
};
let port = url
.port()
.ok_or_else(|| anyhow!("Invalid port in {remote_endpoint}"))?;
Ok((host, port, tls))
}
fn generate_audit_rsyslog_config(
log_directory: String,
endpoint_id: &str,
project_id: &str,
remote_syslog_host: &str,
remote_syslog_port: u16,
remote_syslog_tls: &str,
) -> String {
format!(
include_str!("config_template/compute_audit_rsyslog_template.conf"),
log_directory = log_directory,
endpoint_id = endpoint_id,
project_id = project_id,
remote_syslog_host = remote_syslog_host,
remote_syslog_port = remote_syslog_port,
remote_syslog_tls = remote_syslog_tls
)
}
pub fn configure_audit_rsyslog(
log_directory: String,
endpoint_id: &str,
project_id: &str,
remote_endpoint: &str,
remote_tls_endpoint: &str,
) -> Result<()> {
let config_content: String = format!(
include_str!("config_template/compute_audit_rsyslog_template.conf"),
log_directory = log_directory,
endpoint_id = endpoint_id,
project_id = project_id,
remote_endpoint = remote_endpoint
let (remote_syslog_host, remote_syslog_port, remote_syslog_tls) =
parse_audit_syslog_address(remote_endpoint, remote_tls_endpoint).unwrap();
let config_content = generate_audit_rsyslog_config(
log_directory,
endpoint_id,
project_id,
&remote_syslog_host,
remote_syslog_port,
&remote_syslog_tls,
);
info!("rsyslog config_content: {}", config_content);
@@ -258,6 +326,8 @@ pub fn launch_pgaudit_gc(log_directory: String) {
mod tests {
use crate::rsyslog::PostgresLogsRsyslogConfig;
use super::{generate_audit_rsyslog_config, parse_audit_syslog_address};
#[test]
fn test_postgres_logs_config() {
{
@@ -287,4 +357,146 @@ mod tests {
assert!(res.is_err());
}
}
#[test]
fn test_parse_audit_syslog_address() {
{
// host:port format (plaintext)
let parsed = parse_audit_syslog_address("collector.host.tld:5555", "");
assert!(parsed.is_ok());
assert_eq!(
parsed.unwrap(),
(
String::from("collector.host.tld"),
5555,
String::from("false")
)
);
}
{
// host:port format with ipv4 ip address (plaintext)
let parsed = parse_audit_syslog_address("10.0.0.1:5555", "");
assert!(parsed.is_ok());
assert_eq!(
parsed.unwrap(),
(String::from("10.0.0.1"), 5555, String::from("false"))
);
}
{
// host:port format with ipv6 ip address (plaintext)
let parsed =
parse_audit_syslog_address("[7e60:82ed:cb2e:d617:f904:f395:aaca:e252]:5555", "");
assert_eq!(
parsed.unwrap(),
(
String::from("7e60:82ed:cb2e:d617:f904:f395:aaca:e252"),
5555,
String::from("false")
)
);
}
{
// Only TLS host:port defined
let parsed = parse_audit_syslog_address("", "tls.host.tld:5556");
assert_eq!(
parsed.unwrap(),
(String::from("tls.host.tld"), 5556, String::from("true"))
);
}
{
// tls host should take precedence, when both defined
let parsed = parse_audit_syslog_address("plaintext.host.tld:5555", "tls.host.tld:5556");
assert_eq!(
parsed.unwrap(),
(String::from("tls.host.tld"), 5556, String::from("true"))
);
}
{
// host without port (plaintext)
let parsed = parse_audit_syslog_address("collector.host.tld", "");
assert!(parsed.is_err());
}
{
// port without host
let parsed = parse_audit_syslog_address(":5555", "");
assert!(parsed.is_err());
}
{
// valid host with invalid port
let parsed = parse_audit_syslog_address("collector.host.tld:90001", "");
assert!(parsed.is_err());
}
{
// invalid hostname with valid port
let parsed = parse_audit_syslog_address("-collector.host.tld:5555", "");
assert!(parsed.is_err());
}
{
// parse error
let parsed = parse_audit_syslog_address("collector.host.tld:::5555", "");
assert!(parsed.is_err());
}
}
#[test]
fn test_generate_audit_rsyslog_config() {
{
// plaintext version
let log_directory = "/tmp/log".to_string();
let endpoint_id = "ep-test-endpoint-id";
let project_id = "test-project-id";
let remote_syslog_host = "collector.host.tld";
let remote_syslog_port = 5555;
let remote_syslog_tls = "false";
let conf_str = generate_audit_rsyslog_config(
log_directory,
endpoint_id,
project_id,
remote_syslog_host,
remote_syslog_port,
remote_syslog_tls,
);
assert!(conf_str.contains(r#"set $.remote_syslog_tls = "false";"#));
assert!(conf_str.contains(r#"type="omfwd""#));
assert!(conf_str.contains(r#"target="collector.host.tld""#));
assert!(conf_str.contains(r#"port="5555""#));
assert!(conf_str.contains(r#"StreamDriverPermittedPeers="collector.host.tld""#));
}
{
// TLS version
let log_directory = "/tmp/log".to_string();
let endpoint_id = "ep-test-endpoint-id";
let project_id = "test-project-id";
let remote_syslog_host = "collector.host.tld";
let remote_syslog_port = 5556;
let remote_syslog_tls = "true";
let conf_str = generate_audit_rsyslog_config(
log_directory,
endpoint_id,
project_id,
remote_syslog_host,
remote_syslog_port,
remote_syslog_tls,
);
assert!(conf_str.contains(r#"set $.remote_syslog_tls = "true";"#));
assert!(conf_str.contains(r#"type="omfwd""#));
assert!(conf_str.contains(r#"target="collector.host.tld""#));
assert!(conf_str.contains(r#"port="5556""#));
assert!(conf_str.contains(r#"StreamDriverPermittedPeers="collector.host.tld""#));
}
}
}

View File

@@ -23,12 +23,12 @@ fn do_control_plane_request(
) -> Result<ControlPlaneConfigResponse, (bool, String, String)> {
let resp = reqwest::blocking::Client::new()
.get(uri)
.header("Authorization", format!("Bearer {}", jwt))
.header("Authorization", format!("Bearer {jwt}"))
.send()
.map_err(|e| {
(
true,
format!("could not perform request to control plane: {:?}", e),
format!("could not perform request to control plane: {e:?}"),
UNKNOWN_HTTP_STATUS.to_string(),
)
})?;
@@ -39,7 +39,7 @@ fn do_control_plane_request(
Ok(spec_resp) => Ok(spec_resp),
Err(e) => Err((
true,
format!("could not deserialize control plane response: {:?}", e),
format!("could not deserialize control plane response: {e:?}"),
status.to_string(),
)),
},
@@ -62,7 +62,7 @@ fn do_control_plane_request(
// or some internal failure happened. Doesn't make much sense to retry in this case.
_ => Err((
false,
format!("unexpected control plane response status code: {}", status),
format!("unexpected control plane response status code: {status}"),
status.to_string(),
)),
}

View File

@@ -933,56 +933,53 @@ async fn get_operations<'a>(
PerDatabasePhase::DeleteDBRoleReferences => {
let ctx = ctx.read().await;
let operations =
spec.delta_operations
.iter()
.flatten()
.filter(|op| op.action == "delete_role")
.filter_map(move |op| {
if db.is_owned_by(&op.name) {
return None;
}
if !ctx.roles.contains_key(&op.name) {
return None;
}
let quoted = op.name.pg_quote();
let new_owner = match &db {
DB::SystemDB => PgIdent::from("cloud_admin").pg_quote(),
DB::UserDB(db) => db.owner.pg_quote(),
};
let (escaped_role, outer_tag) = op.name.pg_quote_dollar();
let operations = spec
.delta_operations
.iter()
.flatten()
.filter(|op| op.action == "delete_role")
.filter_map(move |op| {
if db.is_owned_by(&op.name) {
return None;
}
if !ctx.roles.contains_key(&op.name) {
return None;
}
let quoted = op.name.pg_quote();
let new_owner = match &db {
DB::SystemDB => PgIdent::from("cloud_admin").pg_quote(),
DB::UserDB(db) => db.owner.pg_quote(),
};
let (escaped_role, outer_tag) = op.name.pg_quote_dollar();
Some(vec![
// This will reassign all dependent objects to the db owner
Operation {
query: format!(
"REASSIGN OWNED BY {} TO {}",
quoted, new_owner,
),
comment: None,
},
// Revoke some potentially blocking privileges (Neon-specific currently)
Operation {
query: format!(
include_str!("sql/pre_drop_role_revoke_privileges.sql"),
// N.B. this has to be properly dollar-escaped with `pg_quote_dollar()`
role_name = escaped_role,
outer_tag = outer_tag,
),
comment: None,
},
// This now will only drop privileges of the role
// TODO: this is obviously not 100% true because of the above case,
// there could be still some privileges that are not revoked. Maybe this
// only drops privileges that were granted *by this* role, not *to this* role,
// but this has to be checked.
Operation {
query: format!("DROP OWNED BY {}", quoted),
comment: None,
},
])
})
.flatten();
Some(vec![
// This will reassign all dependent objects to the db owner
Operation {
query: format!("REASSIGN OWNED BY {quoted} TO {new_owner}",),
comment: None,
},
// Revoke some potentially blocking privileges (Neon-specific currently)
Operation {
query: format!(
include_str!("sql/pre_drop_role_revoke_privileges.sql"),
// N.B. this has to be properly dollar-escaped with `pg_quote_dollar()`
role_name = escaped_role,
outer_tag = outer_tag,
),
comment: None,
},
// This now will only drop privileges of the role
// TODO: this is obviously not 100% true because of the above case,
// there could be still some privileges that are not revoked. Maybe this
// only drops privileges that were granted *by this* role, not *to this* role,
// but this has to be checked.
Operation {
query: format!("DROP OWNED BY {quoted}"),
comment: None,
},
])
})
.flatten();
Ok(Box::new(operations))
}

View File

@@ -27,7 +27,7 @@ pub async fn ping_safekeeper(
let (client, conn) = config.connect(tokio_postgres::NoTls).await?;
tokio::spawn(async move {
if let Err(e) = conn.await {
eprintln!("connection error: {}", e);
eprintln!("connection error: {e}");
}
});

View File

@@ -3,7 +3,8 @@
"timestamp": "2021-05-23T18:25:43.511Z",
"operation_uuid": "0f657b36-4b0f-4a2d-9c2e-1dcd615e7d8b",
"suspend_timeout_seconds": 3600,
"cluster": {
"cluster_id": "test-cluster-42",
"name": "Zenith Test",

View File

@@ -31,6 +31,7 @@ mod pg_helpers_tests {
wal_level = logical
hot_standby = on
autoprewarm = off
offload_lfc_interval_seconds = 20
neon.safekeepers = '127.0.0.1:6502,127.0.0.1:6503,127.0.0.1:6501'
wal_log_hints = on
log_connections = on

View File

@@ -16,9 +16,9 @@ use std::time::Duration;
use anyhow::{Context, Result, anyhow, bail};
use clap::Parser;
use compute_api::requests::ComputeClaimsScope;
use compute_api::spec::ComputeMode;
use compute_api::spec::{ComputeMode, PageserverConnectionInfo, PageserverShardConnectionInfo};
use control_plane::broker::StorageBroker;
use control_plane::endpoint::{ComputeControlPlane, EndpointTerminateMode, PageserverProtocol};
use control_plane::endpoint::{ComputeControlPlane, EndpointTerminateMode};
use control_plane::endpoint_storage::{ENDPOINT_STORAGE_DEFAULT_ADDR, EndpointStorage};
use control_plane::local_env;
use control_plane::local_env::{
@@ -48,7 +48,7 @@ use postgres_connection::parse_host_port;
use safekeeper_api::membership::{SafekeeperGeneration, SafekeeperId};
use safekeeper_api::{
DEFAULT_HTTP_LISTEN_PORT as DEFAULT_SAFEKEEPER_HTTP_PORT,
DEFAULT_PG_LISTEN_PORT as DEFAULT_SAFEKEEPER_PG_PORT,
DEFAULT_PG_LISTEN_PORT as DEFAULT_SAFEKEEPER_PG_PORT, PgMajorVersion, PgVersionId,
};
use storage_broker::DEFAULT_LISTEN_ADDR as DEFAULT_BROKER_ADDR;
use tokio::task::JoinSet;
@@ -64,7 +64,9 @@ const DEFAULT_PAGESERVER_ID: NodeId = NodeId(1);
const DEFAULT_BRANCH_NAME: &str = "main";
project_git_version!(GIT_VERSION);
const DEFAULT_PG_VERSION: u32 = 17;
#[allow(dead_code)]
const DEFAULT_PG_VERSION: PgMajorVersion = PgMajorVersion::PG17;
const DEFAULT_PG_VERSION_NUM: &str = "17";
const DEFAULT_PAGESERVER_CONTROL_PLANE_API: &str = "http://127.0.0.1:1234/upcall/v1/";
@@ -167,9 +169,9 @@ struct TenantCreateCmdArgs {
#[clap(short = 'c')]
config: Vec<String>,
#[arg(default_value_t = DEFAULT_PG_VERSION)]
#[arg(default_value = DEFAULT_PG_VERSION_NUM)]
#[clap(long, help = "Postgres version to use for the initial timeline")]
pg_version: u32,
pg_version: PgMajorVersion,
#[clap(
long,
@@ -290,9 +292,9 @@ struct TimelineCreateCmdArgs {
#[clap(long, help = "Human-readable alias for the new timeline")]
branch_name: String,
#[arg(default_value_t = DEFAULT_PG_VERSION)]
#[arg(default_value = DEFAULT_PG_VERSION_NUM)]
#[clap(long, help = "Postgres version")]
pg_version: u32,
pg_version: PgMajorVersion,
}
#[derive(clap::Args)]
@@ -322,9 +324,9 @@ struct TimelineImportCmdArgs {
#[clap(long, help = "Lsn the basebackup ends at")]
end_lsn: Option<Lsn>,
#[arg(default_value_t = DEFAULT_PG_VERSION)]
#[arg(default_value = DEFAULT_PG_VERSION_NUM)]
#[clap(long, help = "Postgres version of the backup being imported")]
pg_version: u32,
pg_version: PgMajorVersion,
}
#[derive(clap::Subcommand)]
@@ -601,9 +603,9 @@ struct EndpointCreateCmdArgs {
)]
config_only: bool,
#[arg(default_value_t = DEFAULT_PG_VERSION)]
#[arg(default_value = DEFAULT_PG_VERSION_NUM)]
#[clap(long, help = "Postgres version")]
pg_version: u32,
pg_version: PgMajorVersion,
/// Use gRPC to communicate with Pageservers, by generating grpc:// connstrings.
///
@@ -673,6 +675,16 @@ struct EndpointStartCmdArgs {
#[arg(default_value = "90s")]
start_timeout: Duration,
#[clap(
long,
help = "Download LFC cache from endpoint storage on endpoint startup",
default_value = "false"
)]
autoprewarm: bool,
#[clap(long, help = "Upload LFC cache to endpoint storage periodically")]
offload_lfc_interval_seconds: Option<std::num::NonZeroU64>,
#[clap(
long,
help = "Run in development mode, skipping VM-specific operations like process termination",
@@ -919,7 +931,7 @@ fn print_timeline(
br_sym = "┗━";
}
print!("{} @{}: ", br_sym, ancestor_lsn);
print!("{br_sym} @{ancestor_lsn}: ");
}
// Finally print a timeline id and name with new line
@@ -1295,7 +1307,7 @@ async fn handle_timeline(cmd: &TimelineCmd, env: &mut local_env::LocalEnv) -> Re
},
new_members: None,
};
let pg_version = args.pg_version * 10000;
let pg_version = PgVersionId::from(args.pg_version);
let req = safekeeper_api::models::TimelineCreateRequest {
tenant_id,
timeline_id,
@@ -1504,29 +1516,35 @@ async fn handle_endpoint(subcmd: &EndpointCmd, env: &local_env::LocalEnv) -> Res
)?;
}
let (pageservers, stripe_size) = if let Some(pageserver_id) = pageserver_id {
let conf = env.get_pageserver_conf(pageserver_id).unwrap();
// Use gRPC if requested.
let pageserver = if endpoint.grpc {
let grpc_addr = conf.listen_grpc_addr.as_ref().expect("bad config");
let (host, port) = parse_host_port(grpc_addr)?;
let port = port.unwrap_or(DEFAULT_PAGESERVER_GRPC_PORT);
(PageserverProtocol::Grpc, host, port)
} else {
let (shards, stripe_size) = if let Some(ps_id) = pageserver_id {
let conf = env.get_pageserver_conf(ps_id).unwrap();
let libpq_url = Some({
let (host, port) = parse_host_port(&conf.listen_pg_addr)?;
let port = port.unwrap_or(5432);
(PageserverProtocol::Libpq, host, port)
format!("postgres://no_user@{host}:{port}")
});
let grpc_url = if let Some(grpc_addr) = &conf.listen_grpc_addr {
let (host, port) = parse_host_port(grpc_addr)?;
let port = port.unwrap_or(DEFAULT_PAGESERVER_GRPC_PORT);
Some(format!("grpc://no_user@{host}:{port}"))
} else {
None
};
let pageserver = PageserverShardConnectionInfo {
libpq_url,
grpc_url,
};
// If caller is telling us what pageserver to use, this is not a tenant which is
// fully managed by storage controller, therefore not sharded.
(vec![pageserver], DEFAULT_STRIPE_SIZE)
(vec![(0, pageserver)], DEFAULT_STRIPE_SIZE)
} else {
// Look up the currently attached location of the tenant, and its striping metadata,
// to pass these on to postgres.
let storage_controller = StorageController::from_env(env);
let locate_result = storage_controller.tenant_locate(endpoint.tenant_id).await?;
let pageservers = futures::future::try_join_all(
locate_result.shards.into_iter().map(|shard| async move {
let shards = futures::future::try_join_all(locate_result.shards.into_iter().map(
|shard| async move {
if let ComputeMode::Static(lsn) = endpoint.mode {
// Initialize LSN leases for static computes.
let conf = env.get_pageserver_conf(shard.node_id).unwrap();
@@ -1538,28 +1556,34 @@ async fn handle_endpoint(subcmd: &EndpointCmd, env: &local_env::LocalEnv) -> Res
.await?;
}
let pageserver = if endpoint.grpc {
(
PageserverProtocol::Grpc,
Host::parse(&shard.listen_grpc_addr.expect("no gRPC address"))?,
shard.listen_grpc_port.expect("no gRPC port"),
)
let libpq_host = Host::parse(&shard.listen_pg_addr)?;
let libpq_port = shard.listen_pg_port;
let libpq_url =
Some(format!("postgres://no_user@{libpq_host}:{libpq_port}"));
let grpc_url = if let Some(grpc_host) = shard.listen_grpc_addr {
let grpc_port = shard.listen_grpc_port.expect("no gRPC port");
Some(format!("grpc://no_user@{grpc_host}:{grpc_port}"))
} else {
(
PageserverProtocol::Libpq,
Host::parse(&shard.listen_pg_addr)?,
shard.listen_pg_port,
)
None
};
anyhow::Ok(pageserver)
}),
)
let pageserver = PageserverShardConnectionInfo {
libpq_url,
grpc_url,
};
anyhow::Ok((shard.shard_id.shard_number.0 as u32, pageserver))
},
))
.await?;
let stripe_size = locate_result.shard_params.stripe_size;
(pageservers, stripe_size)
(shards, stripe_size)
};
assert!(!shards.is_empty());
let pageserver_conninfo = PageserverConnectionInfo {
shards: shards.into_iter().collect(),
prefer_grpc: endpoint.grpc,
};
assert!(!pageservers.is_empty());
let ps_conf = env.get_pageserver_conf(DEFAULT_PAGESERVER_ID)?;
let auth_token = if matches!(ps_conf.pg_auth_type, AuthType::NeonJWT) {
@@ -1583,22 +1607,24 @@ async fn handle_endpoint(subcmd: &EndpointCmd, env: &local_env::LocalEnv) -> Res
let endpoint_storage_token = env.generate_auth_token(&claims)?;
let endpoint_storage_addr = env.endpoint_storage.listen_addr.to_string();
let args = control_plane::endpoint::EndpointStartArgs {
auth_token,
endpoint_storage_token,
endpoint_storage_addr,
safekeepers_generation,
safekeepers,
pageserver_conninfo,
remote_ext_base_url: remote_ext_base_url.clone(),
shard_stripe_size: stripe_size.0 as usize,
create_test_user: args.create_test_user,
start_timeout: args.start_timeout,
autoprewarm: args.autoprewarm,
offload_lfc_interval_seconds: args.offload_lfc_interval_seconds,
dev: args.dev,
};
println!("Starting existing endpoint {endpoint_id}...");
endpoint
.start(
&auth_token,
endpoint_storage_token,
endpoint_storage_addr,
safekeepers_generation,
safekeepers,
pageservers,
remote_ext_base_url.as_ref(),
stripe_size.0 as usize,
args.create_test_user,
args.start_timeout,
args.dev,
)
.await?;
endpoint.start(args).await?;
}
EndpointCmd::Reconfigure(args) => {
let endpoint_id = &args.endpoint_id;
@@ -1606,20 +1632,27 @@ async fn handle_endpoint(subcmd: &EndpointCmd, env: &local_env::LocalEnv) -> Res
.endpoints
.get(endpoint_id.as_str())
.with_context(|| format!("postgres endpoint {endpoint_id} is not found"))?;
let pageservers = if let Some(ps_id) = args.endpoint_pageserver_id {
let shards = if let Some(ps_id) = args.endpoint_pageserver_id {
let conf = env.get_pageserver_conf(ps_id)?;
// Use gRPC if requested.
let pageserver = if endpoint.grpc {
let grpc_addr = conf.listen_grpc_addr.as_ref().expect("bad config");
let (host, port) = parse_host_port(grpc_addr)?;
let port = port.unwrap_or(DEFAULT_PAGESERVER_GRPC_PORT);
(PageserverProtocol::Grpc, host, port)
} else {
let libpq_url = Some({
let (host, port) = parse_host_port(&conf.listen_pg_addr)?;
let port = port.unwrap_or(5432);
(PageserverProtocol::Libpq, host, port)
format!("postgres://no_user@{host}:{port}")
});
let grpc_url = if let Some(grpc_addr) = &conf.listen_grpc_addr {
let (host, port) = parse_host_port(grpc_addr)?;
let port = port.unwrap_or(DEFAULT_PAGESERVER_GRPC_PORT);
Some(format!("grpc://no_user@{host}:{port}"))
} else {
None
};
vec![pageserver]
let pageserver = PageserverShardConnectionInfo {
libpq_url,
grpc_url,
};
// If caller is telling us what pageserver to use, this is not a tenant which is
// fully managed by storage controller, therefore not sharded.
vec![(0, pageserver)]
} else {
let storage_controller = StorageController::from_env(env);
storage_controller
@@ -1629,27 +1662,37 @@ async fn handle_endpoint(subcmd: &EndpointCmd, env: &local_env::LocalEnv) -> Res
.into_iter()
.map(|shard| {
// Use gRPC if requested.
if endpoint.grpc {
(
PageserverProtocol::Grpc,
Host::parse(&shard.listen_grpc_addr.expect("no gRPC address"))
.expect("bad hostname"),
shard.listen_grpc_port.expect("no gRPC port"),
)
let libpq_host = Host::parse(&shard.listen_pg_addr).expect("bad hostname");
let libpq_port = shard.listen_pg_port;
let libpq_url =
Some(format!("postgres://no_user@{libpq_host}:{libpq_port}"));
let grpc_url = if let Some(grpc_host) = shard.listen_grpc_addr {
let grpc_port = shard.listen_grpc_port.expect("no gRPC port");
Some(format!("grpc://no_user@{grpc_host}:{grpc_port}"))
} else {
(
PageserverProtocol::Libpq,
Host::parse(&shard.listen_pg_addr).expect("bad hostname"),
shard.listen_pg_port,
)
}
None
};
(
shard.shard_id.shard_number.0 as u32,
PageserverShardConnectionInfo {
libpq_url,
grpc_url,
},
)
})
.collect::<Vec<_>>()
};
let pageserver_conninfo = PageserverConnectionInfo {
shards: shards.into_iter().collect(),
prefer_grpc: endpoint.grpc,
};
// If --safekeepers argument is given, use only the listed
// safekeeper nodes; otherwise all from the env.
let safekeepers = parse_safekeepers(&args.safekeepers)?;
endpoint.reconfigure(pageservers, None, safekeepers).await?;
endpoint
.reconfigure(Some(pageserver_conninfo), None, safekeepers, None)
.await?;
}
EndpointCmd::Stop(args) => {
let endpoint_id = &args.endpoint_id;
@@ -1742,7 +1785,7 @@ async fn handle_pageserver(subcmd: &PageserverCmd, env: &local_env::LocalEnv) ->
StopMode::Immediate => true,
};
if let Err(e) = get_pageserver(env, args.pageserver_id)?.stop(immediate) {
eprintln!("pageserver stop failed: {}", e);
eprintln!("pageserver stop failed: {e}");
exit(1);
}
}
@@ -1751,7 +1794,7 @@ async fn handle_pageserver(subcmd: &PageserverCmd, env: &local_env::LocalEnv) ->
let pageserver = get_pageserver(env, args.pageserver_id)?;
//TODO what shutdown strategy should we use here?
if let Err(e) = pageserver.stop(false) {
eprintln!("pageserver stop failed: {}", e);
eprintln!("pageserver stop failed: {e}");
exit(1);
}
@@ -1768,7 +1811,7 @@ async fn handle_pageserver(subcmd: &PageserverCmd, env: &local_env::LocalEnv) ->
{
Ok(_) => println!("Page server is up and running"),
Err(err) => {
eprintln!("Page server is not available: {}", err);
eprintln!("Page server is not available: {err}");
exit(1);
}
}
@@ -1805,7 +1848,7 @@ async fn handle_storage_controller(
},
};
if let Err(e) = svc.stop(stop_args).await {
eprintln!("stop failed: {}", e);
eprintln!("stop failed: {e}");
exit(1);
}
}
@@ -1827,7 +1870,7 @@ async fn handle_safekeeper(subcmd: &SafekeeperCmd, env: &local_env::LocalEnv) ->
let safekeeper = get_safekeeper(env, args.id)?;
if let Err(e) = safekeeper.start(&args.extra_opt, &args.start_timeout).await {
eprintln!("safekeeper start failed: {}", e);
eprintln!("safekeeper start failed: {e}");
exit(1);
}
}
@@ -1839,7 +1882,7 @@ async fn handle_safekeeper(subcmd: &SafekeeperCmd, env: &local_env::LocalEnv) ->
StopMode::Immediate => true,
};
if let Err(e) = safekeeper.stop(immediate) {
eprintln!("safekeeper stop failed: {}", e);
eprintln!("safekeeper stop failed: {e}");
exit(1);
}
}
@@ -1852,12 +1895,12 @@ async fn handle_safekeeper(subcmd: &SafekeeperCmd, env: &local_env::LocalEnv) ->
};
if let Err(e) = safekeeper.stop(immediate) {
eprintln!("safekeeper stop failed: {}", e);
eprintln!("safekeeper stop failed: {e}");
exit(1);
}
if let Err(e) = safekeeper.start(&args.extra_opt, &args.start_timeout).await {
eprintln!("safekeeper start failed: {}", e);
eprintln!("safekeeper start failed: {e}");
exit(1);
}
}
@@ -2113,7 +2156,7 @@ async fn try_stop_all(env: &local_env::LocalEnv, immediate: bool) {
let storage = EndpointStorage::from_env(env);
if let Err(e) = storage.stop(immediate) {
eprintln!("endpoint_storage stop failed: {:#}", e);
eprintln!("endpoint_storage stop failed: {e:#}");
}
for ps_conf in &env.pageservers {

View File

@@ -59,6 +59,10 @@ use compute_api::spec::{
Cluster, ComputeAudit, ComputeFeature, ComputeMode, ComputeSpec, Database, PgIdent,
RemoteExtSpec, Role,
};
// re-export these, because they're used in the reconfigure() function
pub use compute_api::spec::{PageserverConnectionInfo, PageserverShardConnectionInfo};
use jsonwebtoken::jwk::{
AlgorithmParameters, CommonParameters, EllipticCurve, Jwk, JwkSet, KeyAlgorithm, KeyOperations,
OctetKeyPairParameters, OctetKeyPairType, PublicKeyUse,
@@ -67,13 +71,13 @@ use nix::sys::signal::{Signal, kill};
use pageserver_api::shard::ShardStripeSize;
use pem::Pem;
use reqwest::header::CONTENT_TYPE;
use safekeeper_api::PgMajorVersion;
use safekeeper_api::membership::SafekeeperGeneration;
use serde::{Deserialize, Serialize};
use sha2::{Digest, Sha256};
use spki::der::Decode;
use spki::{SubjectPublicKeyInfo, SubjectPublicKeyInfoRef};
use tracing::debug;
use url::Host;
use utils::id::{NodeId, TenantId, TimelineId};
use crate::local_env::LocalEnv;
@@ -89,7 +93,7 @@ pub struct EndpointConf {
pg_port: u16,
external_http_port: u16,
internal_http_port: u16,
pg_version: u32,
pg_version: PgMajorVersion,
grpc: bool,
skip_pg_catalog_updates: bool,
reconfigure_concurrency: usize,
@@ -192,7 +196,7 @@ impl ComputeControlPlane {
pg_port: Option<u16>,
external_http_port: Option<u16>,
internal_http_port: Option<u16>,
pg_version: u32,
pg_version: PgMajorVersion,
mode: ComputeMode,
grpc: bool,
skip_pg_catalog_updates: bool,
@@ -312,7 +316,7 @@ pub struct Endpoint {
pub internal_http_address: SocketAddr,
// postgres major version in the format: 14, 15, etc.
pg_version: u32,
pg_version: PgMajorVersion,
// These are not part of the endpoint as such, but the environment
// the endpoint runs in.
@@ -372,27 +376,20 @@ impl std::fmt::Display for EndpointTerminateMode {
}
}
/// Protocol used to connect to a Pageserver.
#[derive(Clone, Copy, Debug)]
pub enum PageserverProtocol {
Libpq,
Grpc,
}
impl PageserverProtocol {
/// Returns the URL scheme for the protocol, used in connstrings.
pub fn scheme(&self) -> &'static str {
match self {
Self::Libpq => "postgresql",
Self::Grpc => "grpc",
}
}
}
impl Display for PageserverProtocol {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.write_str(self.scheme())
}
pub struct EndpointStartArgs {
pub auth_token: Option<String>,
pub endpoint_storage_token: String,
pub endpoint_storage_addr: String,
pub safekeepers_generation: Option<SafekeeperGeneration>,
pub safekeepers: Vec<NodeId>,
pub pageserver_conninfo: PageserverConnectionInfo,
pub remote_ext_base_url: Option<String>,
pub shard_stripe_size: usize,
pub create_test_user: bool,
pub start_timeout: Duration,
pub autoprewarm: bool,
pub offload_lfc_interval_seconds: Option<std::num::NonZeroU64>,
pub dev: bool,
}
impl Endpoint {
@@ -557,7 +554,7 @@ impl Endpoint {
conf.append("hot_standby", "on");
// prefetching of blocks referenced in WAL doesn't make sense for us
// Neon hot standby ignores pages that are not in the shared_buffers
if self.pg_version >= 15 {
if self.pg_version >= PgMajorVersion::PG15 {
conf.append("recovery_prefetch", "off");
}
}
@@ -659,14 +656,6 @@ impl Endpoint {
}
}
fn build_pageserver_connstr(pageservers: &[(PageserverProtocol, Host, u16)]) -> String {
pageservers
.iter()
.map(|(scheme, host, port)| format!("{scheme}://no_user@{host}:{port}"))
.collect::<Vec<_>>()
.join(",")
}
/// Map safekeepers ids to the actual connection strings.
fn build_safekeepers_connstrs(&self, sk_ids: Vec<NodeId>) -> Result<Vec<String>> {
let mut safekeeper_connstrings = Vec::new();
@@ -699,21 +688,7 @@ impl Endpoint {
})
}
#[allow(clippy::too_many_arguments)]
pub async fn start(
&self,
auth_token: &Option<String>,
endpoint_storage_token: String,
endpoint_storage_addr: String,
safekeepers_generation: Option<SafekeeperGeneration>,
safekeepers: Vec<NodeId>,
pageservers: Vec<(PageserverProtocol, Host, u16)>,
remote_ext_base_url: Option<&String>,
shard_stripe_size: usize,
create_test_user: bool,
start_timeout: Duration,
dev: bool,
) -> Result<()> {
pub async fn start(&self, args: EndpointStartArgs) -> Result<()> {
if self.status() == EndpointStatus::Running {
anyhow::bail!("The endpoint is already running");
}
@@ -726,10 +701,7 @@ impl Endpoint {
std::fs::remove_dir_all(self.pgdata())?;
}
let pageserver_connstring = Self::build_pageserver_connstr(&pageservers);
assert!(!pageserver_connstring.is_empty());
let safekeeper_connstrings = self.build_safekeepers_connstrs(safekeepers)?;
let safekeeper_connstrings = self.build_safekeepers_connstrs(args.safekeepers)?;
// check for file remote_extensions_spec.json
// if it is present, read it and pass to compute_ctl
@@ -757,7 +729,7 @@ impl Endpoint {
cluster_id: None, // project ID: not used
name: None, // project name: not used
state: None,
roles: if create_test_user {
roles: if args.create_test_user {
vec![Role {
name: PgIdent::from_str("test").unwrap(),
encrypted_password: None,
@@ -766,7 +738,7 @@ impl Endpoint {
} else {
Vec::new()
},
databases: if create_test_user {
databases: if args.create_test_user {
vec![Database {
name: PgIdent::from_str("neondb").unwrap(),
owner: PgIdent::from_str("test").unwrap(),
@@ -787,21 +759,23 @@ impl Endpoint {
branch_id: None,
endpoint_id: Some(self.endpoint_id.clone()),
mode: self.mode,
pageserver_connstring: Some(pageserver_connstring),
safekeepers_generation: safekeepers_generation.map(|g| g.into_inner()),
pageserver_connection_info: Some(args.pageserver_conninfo),
safekeepers_generation: args.safekeepers_generation.map(|g| g.into_inner()),
safekeeper_connstrings,
storage_auth_token: auth_token.clone(),
storage_auth_token: args.auth_token.clone(),
remote_extensions,
pgbouncer_settings: None,
shard_stripe_size: Some(shard_stripe_size),
shard_stripe_size: Some(args.shard_stripe_size),
local_proxy_config: None,
reconfigure_concurrency: self.reconfigure_concurrency,
drop_subscriptions_before_start: self.drop_subscriptions_before_start,
audit_log_level: ComputeAudit::Disabled,
logs_export_host: None::<String>,
endpoint_storage_addr: Some(endpoint_storage_addr),
endpoint_storage_token: Some(endpoint_storage_token),
autoprewarm: false,
endpoint_storage_addr: Some(args.endpoint_storage_addr),
endpoint_storage_token: Some(args.endpoint_storage_token),
autoprewarm: args.autoprewarm,
offload_lfc_interval_seconds: args.offload_lfc_interval_seconds,
suspend_timeout_seconds: -1, // Only used in neon_local.
};
// this strange code is needed to support respec() in tests
@@ -812,7 +786,7 @@ impl Endpoint {
debug!("spec.cluster {:?}", spec.cluster);
// fill missing fields again
if create_test_user {
if args.create_test_user {
spec.cluster.roles.push(Role {
name: PgIdent::from_str("test").unwrap(),
encrypted_password: None,
@@ -846,10 +820,10 @@ impl Endpoint {
// Launch compute_ctl
let conn_str = self.connstr("cloud_admin", "postgres");
println!("Starting postgres node at '{}'", conn_str);
if create_test_user {
println!("Starting postgres node at '{conn_str}'");
if args.create_test_user {
let conn_str = self.connstr("test", "neondb");
println!("Also at '{}'", conn_str);
println!("Also at '{conn_str}'");
}
let mut cmd = Command::new(self.env.neon_distrib_dir.join("compute_ctl"));
cmd.args([
@@ -879,11 +853,11 @@ impl Endpoint {
.stderr(logfile.try_clone()?)
.stdout(logfile);
if let Some(remote_ext_base_url) = remote_ext_base_url {
cmd.args(["--remote-ext-base-url", remote_ext_base_url]);
if let Some(remote_ext_base_url) = args.remote_ext_base_url {
cmd.args(["--remote-ext-base-url", &remote_ext_base_url]);
}
if dev {
if args.dev {
cmd.arg("--dev");
}
@@ -915,10 +889,11 @@ impl Endpoint {
Ok(state) => {
match state.status {
ComputeStatus::Init => {
if Instant::now().duration_since(start_at) > start_timeout {
let timeout = args.start_timeout;
if Instant::now().duration_since(start_at) > timeout {
bail!(
"compute startup timed out {:?}; still in Init state",
start_timeout
timeout
);
}
// keep retrying
@@ -946,10 +921,10 @@ impl Endpoint {
}
}
Err(e) => {
if Instant::now().duration_since(start_at) > start_timeout {
if Instant::now().duration_since(start_at) > args.start_timeout {
return Err(e).context(format!(
"timed out {:?} waiting to connect to compute_ctl HTTP",
start_timeout,
args.start_timeout
));
}
}
@@ -988,7 +963,7 @@ impl Endpoint {
// reqwest does not export its error construction utility functions, so let's craft the message ourselves
let url = response.url().to_owned();
let msg = match response.text().await {
Ok(err_body) => format!("Error: {}", err_body),
Ok(err_body) => format!("Error: {err_body}"),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
};
Err(anyhow::anyhow!(msg))
@@ -997,12 +972,11 @@ impl Endpoint {
pub async fn reconfigure(
&self,
pageservers: Vec<(PageserverProtocol, Host, u16)>,
pageserver_conninfo: Option<PageserverConnectionInfo>,
stripe_size: Option<ShardStripeSize>,
safekeepers: Option<Vec<NodeId>>,
safekeeper_generation: Option<SafekeeperGeneration>,
) -> Result<()> {
anyhow::ensure!(!pageservers.is_empty(), "no pageservers provided");
let (mut spec, compute_ctl_config) = {
let config_path = self.endpoint_path().join("config.json");
let file = std::fs::File::open(config_path)?;
@@ -1014,8 +988,15 @@ impl Endpoint {
let postgresql_conf = self.read_postgresql_conf()?;
spec.cluster.postgresql_conf = Some(postgresql_conf);
let pageserver_connstr = Self::build_pageserver_connstr(&pageservers);
spec.pageserver_connstring = Some(pageserver_connstr);
if let Some(pageserver_conninfo) = pageserver_conninfo {
// If pageservers are provided, we need to ensure that they are not empty.
// This is a requirement for the compute_ctl configuration.
anyhow::ensure!(
!pageserver_conninfo.shards.is_empty(),
"no pageservers provided"
);
spec.pageserver_connection_info = Some(pageserver_conninfo);
}
if stripe_size.is_some() {
spec.shard_stripe_size = stripe_size.map(|s| s.0 as usize);
}
@@ -1024,6 +1005,9 @@ impl Endpoint {
if let Some(safekeepers) = safekeepers {
let safekeeper_connstrings = self.build_safekeepers_connstrs(safekeepers)?;
spec.safekeeper_connstrings = safekeeper_connstrings;
if let Some(g) = safekeeper_generation {
spec.safekeepers_generation = Some(g.into_inner());
}
}
let client = reqwest::Client::builder()
@@ -1054,13 +1038,31 @@ impl Endpoint {
} else {
let url = response.url().to_owned();
let msg = match response.text().await {
Ok(err_body) => format!("Error: {}", err_body),
Ok(err_body) => format!("Error: {err_body}"),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
};
Err(anyhow::anyhow!(msg))
}
}
pub async fn reconfigure_pageservers(
&self,
pageservers: PageserverConnectionInfo,
stripe_size: Option<ShardStripeSize>,
) -> Result<()> {
self.reconfigure(Some(pageservers), stripe_size, None, None)
.await
}
pub async fn reconfigure_safekeepers(
&self,
safekeepers: Vec<NodeId>,
generation: SafekeeperGeneration,
) -> Result<()> {
self.reconfigure(None, None, Some(safekeepers), Some(generation))
.await
}
pub async fn stop(
&self,
mode: EndpointTerminateMode,

View File

@@ -12,9 +12,11 @@ use std::{env, fs};
use anyhow::{Context, bail};
use clap::ValueEnum;
use pageserver_api::config::PostHogConfig;
use pem::Pem;
use postgres_backend::AuthType;
use reqwest::{Certificate, Url};
use safekeeper_api::PgMajorVersion;
use serde::{Deserialize, Serialize};
use utils::auth::encode_from_key_file;
use utils::id::{NodeId, TenantId, TenantTimelineId, TimelineId};
@@ -210,7 +212,11 @@ pub struct NeonStorageControllerConf {
pub use_local_compute_notifications: bool,
pub timeline_safekeeper_count: Option<i64>,
pub timeline_safekeeper_count: Option<usize>,
pub posthog_config: Option<PostHogConfig>,
pub kick_secondary_downloads: Option<bool>,
}
impl NeonStorageControllerConf {
@@ -242,6 +248,8 @@ impl Default for NeonStorageControllerConf {
use_https_safekeeper_api: false,
use_local_compute_notifications: true,
timeline_safekeeper_count: None,
posthog_config: None,
kick_secondary_downloads: None,
}
}
}
@@ -257,7 +265,7 @@ impl Default for EndpointStorageConf {
impl NeonBroker {
pub fn client_url(&self) -> Url {
let url = if let Some(addr) = self.listen_https_addr {
format!("https://{}", addr)
format!("https://{addr}")
} else {
format!(
"http://{}",
@@ -421,25 +429,21 @@ impl LocalEnv {
self.pg_distrib_dir.clone()
}
pub fn pg_distrib_dir(&self, pg_version: u32) -> anyhow::Result<PathBuf> {
pub fn pg_distrib_dir(&self, pg_version: PgMajorVersion) -> anyhow::Result<PathBuf> {
let path = self.pg_distrib_dir.clone();
#[allow(clippy::manual_range_patterns)]
match pg_version {
14 | 15 | 16 | 17 => Ok(path.join(format!("v{pg_version}"))),
_ => bail!("Unsupported postgres version: {}", pg_version),
}
Ok(path.join(pg_version.v_str()))
}
pub fn pg_dir(&self, pg_version: u32, dir_name: &str) -> anyhow::Result<PathBuf> {
pub fn pg_dir(&self, pg_version: PgMajorVersion, dir_name: &str) -> anyhow::Result<PathBuf> {
Ok(self.pg_distrib_dir(pg_version)?.join(dir_name))
}
pub fn pg_bin_dir(&self, pg_version: u32) -> anyhow::Result<PathBuf> {
pub fn pg_bin_dir(&self, pg_version: PgMajorVersion) -> anyhow::Result<PathBuf> {
self.pg_dir(pg_version, "bin")
}
pub fn pg_lib_dir(&self, pg_version: u32) -> anyhow::Result<PathBuf> {
pub fn pg_lib_dir(&self, pg_version: PgMajorVersion) -> anyhow::Result<PathBuf> {
self.pg_dir(pg_version, "lib")
}
@@ -730,7 +734,7 @@ impl LocalEnv {
let config_toml_path = dentry.path().join("pageserver.toml");
let config_toml: PageserverConfigTomlSubset = toml_edit::de::from_str(
&std::fs::read_to_string(&config_toml_path)
.with_context(|| format!("read {:?}", config_toml_path))?,
.with_context(|| format!("read {config_toml_path:?}"))?,
)
.context("parse pageserver.toml")?;
let identity_toml_path = dentry.path().join("identity.toml");
@@ -740,7 +744,7 @@ impl LocalEnv {
}
let identity_toml: IdentityTomlSubset = toml_edit::de::from_str(
&std::fs::read_to_string(&identity_toml_path)
.with_context(|| format!("read {:?}", identity_toml_path))?,
.with_context(|| format!("read {identity_toml_path:?}"))?,
)
.context("parse identity.toml")?;
let PageserverConfigTomlSubset {

View File

@@ -22,6 +22,7 @@ use pageserver_api::shard::TenantShardId;
use pageserver_client::mgmt_api;
use postgres_backend::AuthType;
use postgres_connection::{PgConnectionConfig, parse_host_port};
use safekeeper_api::PgMajorVersion;
use utils::auth::{Claims, Scope};
use utils::id::{NodeId, TenantId, TimelineId};
use utils::lsn::Lsn;
@@ -121,7 +122,7 @@ impl PageServerNode {
.env
.generate_auth_token(&Claims::new(None, Scope::GenerationsApi))
.unwrap();
overrides.push(format!("control_plane_api_token='{}'", jwt_token));
overrides.push(format!("control_plane_api_token='{jwt_token}'"));
}
if !conf.other.contains_key("remote_storage") {
@@ -607,7 +608,7 @@ impl PageServerNode {
timeline_id: TimelineId,
base: (Lsn, PathBuf),
pg_wal: Option<(Lsn, PathBuf)>,
pg_version: u32,
pg_version: PgMajorVersion,
) -> anyhow::Result<()> {
// Init base reader
let (start_lsn, base_tarfile_path) = base;

View File

@@ -143,7 +143,7 @@ impl SafekeeperNode {
let id_string = id.to_string();
// TODO: add availability_zone to the config.
// Right now we just specify any value here and use it to check metrics in tests.
let availability_zone = format!("sk-{}", id_string);
let availability_zone = format!("sk-{id_string}");
let mut args = vec![
"-D".to_owned(),

View File

@@ -6,6 +6,8 @@ use std::str::FromStr;
use std::sync::OnceLock;
use std::time::{Duration, Instant};
use crate::background_process;
use crate::local_env::{LocalEnv, NeonStorageControllerConf};
use camino::{Utf8Path, Utf8PathBuf};
use hyper0::Uri;
use nix::unistd::Pid;
@@ -22,6 +24,7 @@ use pageserver_client::mgmt_api::ResponseErrorMessageExt;
use pem::Pem;
use postgres_backend::AuthType;
use reqwest::{Method, Response};
use safekeeper_api::PgMajorVersion;
use serde::de::DeserializeOwned;
use serde::{Deserialize, Serialize};
use tokio::process::Command;
@@ -31,9 +34,6 @@ use utils::auth::{Claims, Scope, encode_from_key_file};
use utils::id::{NodeId, TenantId};
use whoami::username;
use crate::background_process;
use crate::local_env::{LocalEnv, NeonStorageControllerConf};
pub struct StorageController {
env: LocalEnv,
private_key: Option<Pem>,
@@ -48,7 +48,7 @@ pub struct StorageController {
const COMMAND: &str = "storage_controller";
const STORAGE_CONTROLLER_POSTGRES_VERSION: u32 = 16;
const STORAGE_CONTROLLER_POSTGRES_VERSION: PgMajorVersion = PgMajorVersion::PG16;
const DB_NAME: &str = "storage_controller";
@@ -167,7 +167,7 @@ impl StorageController {
fn storage_controller_instance_dir(&self, instance_id: u8) -> PathBuf {
self.env
.base_data_dir
.join(format!("storage_controller_{}", instance_id))
.join(format!("storage_controller_{instance_id}"))
}
fn pid_file(&self, instance_id: u8) -> Utf8PathBuf {
@@ -184,9 +184,15 @@ impl StorageController {
/// to other versions if that one isn't found. Some automated tests create circumstances
/// where only one version is available in pg_distrib_dir, such as `test_remote_extensions`.
async fn get_pg_dir(&self, dir_name: &str) -> anyhow::Result<Utf8PathBuf> {
let prefer_versions = [STORAGE_CONTROLLER_POSTGRES_VERSION, 16, 15, 14];
const PREFER_VERSIONS: [PgMajorVersion; 5] = [
STORAGE_CONTROLLER_POSTGRES_VERSION,
PgMajorVersion::PG16,
PgMajorVersion::PG15,
PgMajorVersion::PG14,
PgMajorVersion::PG17,
];
for v in prefer_versions {
for v in PREFER_VERSIONS {
let path = Utf8PathBuf::from_path_buf(self.env.pg_dir(v, dir_name)?).unwrap();
if tokio::fs::try_exists(&path).await? {
return Ok(path);
@@ -220,7 +226,7 @@ impl StorageController {
"-d",
DB_NAME,
"-p",
&format!("{}", postgres_port),
&format!("{postgres_port}"),
];
let pg_lib_dir = self.get_pg_lib_dir().await.unwrap();
let envs = [
@@ -263,7 +269,7 @@ impl StorageController {
"-h",
"localhost",
"-p",
&format!("{}", postgres_port),
&format!("{postgres_port}"),
"-U",
&username(),
"-O",
@@ -425,7 +431,7 @@ impl StorageController {
// from `LocalEnv`'s config file (`.neon/config`).
tokio::fs::write(
&pg_data_path.join("postgresql.conf"),
format!("port = {}\nfsync=off\n", postgres_port),
format!("port = {postgres_port}\nfsync=off\n"),
)
.await?;
@@ -477,7 +483,7 @@ impl StorageController {
self.setup_database(postgres_port).await?;
}
let database_url = format!("postgresql://localhost:{}/{DB_NAME}", postgres_port);
let database_url = format!("postgresql://localhost:{postgres_port}/{DB_NAME}");
// We support running a startup SQL script to fiddle with the database before we launch storcon.
// This is used by the test suite.
@@ -508,7 +514,7 @@ impl StorageController {
drop(client);
conn.await??;
let addr = format!("{}:{}", host, listen_port);
let addr = format!("{host}:{listen_port}");
let address_for_peers = Uri::builder()
.scheme(scheme)
.authority(addr.clone())
@@ -557,6 +563,10 @@ impl StorageController {
args.push("--use-local-compute-notifications".to_string());
}
if let Some(value) = self.config.kick_secondary_downloads {
args.push(format!("--kick-secondary-downloads={value}"));
}
if let Some(ssl_ca_file) = self.env.ssl_ca_cert_path() {
args.push(format!("--ssl-ca-file={}", ssl_ca_file.to_str().unwrap()));
}
@@ -628,10 +638,28 @@ impl StorageController {
args.push("--timelines-onto-safekeepers".to_string());
}
if let Some(sk_cnt) = self.config.timeline_safekeeper_count {
// neon_local is used in test environments where we often have less than 3 safekeepers.
if self.config.timeline_safekeeper_count.is_some() || self.env.safekeepers.len() < 3 {
let sk_cnt = self
.config
.timeline_safekeeper_count
.unwrap_or(self.env.safekeepers.len());
args.push(format!("--timeline-safekeeper-count={sk_cnt}"));
}
let mut envs = vec![
("LD_LIBRARY_PATH".to_owned(), pg_lib_dir.to_string()),
("DYLD_LIBRARY_PATH".to_owned(), pg_lib_dir.to_string()),
];
if let Some(posthog_config) = &self.config.posthog_config {
envs.push((
"POSTHOG_CONFIG".to_string(),
serde_json::to_string(posthog_config)?,
));
}
println!("Starting storage controller");
background_process::start_process(
@@ -639,10 +667,7 @@ impl StorageController {
&instance_dir,
&self.env.storage_controller_bin(),
args,
vec![
("LD_LIBRARY_PATH".to_owned(), pg_lib_dir.to_string()),
("DYLD_LIBRARY_PATH".to_owned(), pg_lib_dir.to_string()),
],
envs,
background_process::InitialPidFile::Create(self.pid_file(start_args.instance_id)),
&start_args.start_timeout,
|| async {
@@ -806,9 +831,9 @@ impl StorageController {
builder = builder.json(&body)
}
if let Some(private_key) = &self.private_key {
println!("Getting claims for path {}", path);
println!("Getting claims for path {path}");
if let Some(required_claims) = Self::get_claims_for_path(&path)? {
println!("Got claims {:?} for path {}", required_claims, path);
println!("Got claims {required_claims:?} for path {path}");
let jwt_token = encode_from_key_file(&required_claims, private_key)?;
builder = builder.header(
reqwest::header::AUTHORIZATION,

View File

@@ -65,12 +65,27 @@ enum Command {
#[arg(long)]
scheduling: Option<NodeSchedulingPolicy>,
},
// Set a node status as deleted.
/// Exists for backup usage and will be removed in future.
/// Use [`Command::NodeStartDelete`] instead, if possible.
NodeDelete {
#[arg(long)]
node_id: NodeId,
},
/// Start deletion of the specified pageserver.
NodeStartDelete {
#[arg(long)]
node_id: NodeId,
},
/// Cancel deletion of the specified pageserver and wait for `timeout`
/// for the operation to be canceled. May be retried.
NodeCancelDelete {
#[arg(long)]
node_id: NodeId,
#[arg(long)]
timeout: humantime::Duration,
},
/// Delete a tombstone of node from the storage controller.
/// This is used when we want to allow the node to be re-registered.
NodeDeleteTombstone {
#[arg(long)]
node_id: NodeId,
@@ -649,7 +664,7 @@ async fn main() -> anyhow::Result<()> {
response
.new_shards
.iter()
.map(|s| format!("{:?}", s))
.map(|s| format!("{s:?}"))
.collect::<Vec<_>>()
.join(",")
);
@@ -771,8 +786,8 @@ async fn main() -> anyhow::Result<()> {
println!("Tenant {tenant_id}");
let mut table = comfy_table::Table::new();
table.add_row(["Policy", &format!("{:?}", policy)]);
table.add_row(["Stripe size", &format!("{:?}", stripe_size)]);
table.add_row(["Policy", &format!("{policy:?}")]);
table.add_row(["Stripe size", &format!("{stripe_size:?}")]);
table.add_row(["Config", &serde_json::to_string_pretty(&config).unwrap()]);
println!("{table}");
println!("Shards:");
@@ -789,7 +804,7 @@ async fn main() -> anyhow::Result<()> {
let secondary = shard
.node_secondary
.iter()
.map(|n| format!("{}", n))
.map(|n| format!("{n}"))
.collect::<Vec<_>>()
.join(",");
@@ -863,7 +878,7 @@ async fn main() -> anyhow::Result<()> {
}
} else {
// Make it obvious to the user that since they've omitted an AZ, we're clearing it
eprintln!("Clearing preferred AZ for tenant {}", tenant_id);
eprintln!("Clearing preferred AZ for tenant {tenant_id}");
}
// Construct a request that modifies all the tenant's shards
@@ -912,10 +927,43 @@ async fn main() -> anyhow::Result<()> {
.await?;
}
Command::NodeDelete { node_id } => {
eprintln!("Warning: This command is obsolete and will be removed in a future version");
eprintln!("Use `NodeStartDelete` instead, if possible");
storcon_client
.dispatch::<(), ()>(Method::DELETE, format!("control/v1/node/{node_id}"), None)
.await?;
}
Command::NodeStartDelete { node_id } => {
storcon_client
.dispatch::<(), ()>(
Method::PUT,
format!("control/v1/node/{node_id}/delete"),
None,
)
.await?;
println!("Delete started for {node_id}");
}
Command::NodeCancelDelete { node_id, timeout } => {
storcon_client
.dispatch::<(), ()>(
Method::DELETE,
format!("control/v1/node/{node_id}/delete"),
None,
)
.await?;
println!("Waiting for node {node_id} to quiesce on scheduling policy ...");
let final_policy =
wait_for_scheduling_policy(storcon_client, node_id, *timeout, |sched| {
!matches!(sched, NodeSchedulingPolicy::Deleting)
})
.await?;
println!(
"Delete was cancelled for node {node_id}. Schedulling policy is now {final_policy:?}"
);
}
Command::NodeDeleteTombstone { node_id } => {
storcon_client
.dispatch::<(), ()>(
@@ -1134,8 +1182,7 @@ async fn main() -> anyhow::Result<()> {
Err((tenant_shard_id, from, to, error)) => {
failure += 1;
println!(
"Failed to migrate {} from node {} to node {}: {}",
tenant_shard_id, from, to, error
"Failed to migrate {tenant_shard_id} from node {from} to node {to}: {error}"
);
}
}
@@ -1277,8 +1324,7 @@ async fn main() -> anyhow::Result<()> {
concurrency,
} => {
let mut path = format!(
"/v1/tenant/{}/timeline/{}/download_heatmap_layers",
tenant_shard_id, timeline_id,
"/v1/tenant/{tenant_shard_id}/timeline/{timeline_id}/download_heatmap_layers",
);
if let Some(c) = concurrency {
@@ -1303,8 +1349,7 @@ async fn watch_tenant_shard(
) -> anyhow::Result<()> {
if let Some(until_migrated_to) = until_migrated_to {
println!(
"Waiting for tenant shard {} to be migrated to node {}",
tenant_shard_id, until_migrated_to
"Waiting for tenant shard {tenant_shard_id} to be migrated to node {until_migrated_to}"
);
}
@@ -1327,7 +1372,7 @@ async fn watch_tenant_shard(
"attached: {} secondary: {} {}",
shard
.node_attached
.map(|n| format!("{}", n))
.map(|n| format!("{n}"))
.unwrap_or("none".to_string()),
shard
.node_secondary
@@ -1341,15 +1386,12 @@ async fn watch_tenant_shard(
"(reconciler idle)"
}
);
println!("{}", summary);
println!("{summary}");
// Maybe drop out if we finished migration
if let Some(until_migrated_to) = until_migrated_to {
if shard.node_attached == Some(until_migrated_to) && !shard.is_reconciling {
println!(
"Tenant shard {} is now on node {}",
tenant_shard_id, until_migrated_to
);
println!("Tenant shard {tenant_shard_id} is now on node {until_migrated_to}");
break;
}
}

View File

@@ -4,6 +4,7 @@
"timestamp": "2022-10-12T18:00:00.000Z",
"operation_uuid": "0f657b36-4b0f-4a2d-9c2e-1dcd615e7d8c",
"suspend_timeout_seconds": -1,
"cluster": {
"cluster_id": "docker_compose",

View File

@@ -0,0 +1,396 @@
# Memo: Endpoint Persistent Unlogged Files Storage
Created on 2024-11-05
Implemented on N/A
## Summary
A design for a storage system that allows storage of files required to make
Neon's Endpoints have a better experience at or after a reboot.
## Motivation
Several systems inside PostgreSQL (and Neon) need some persistent storage for
optimal workings across reboots and restarts, but still work without.
Examples are the query-level statistics files of `pg_stat_statements` in
`pg_stat/pg_stat_statements.stat`, and `pg_prewarm`'s `autoprewarm.blocks`.
We need a storage system that can store and manage these files for each
Endpoint, without necessarily granting users access to an unlimited storage
device.
## Goals
- Store known files for Endpoints with reasonable persistence.
_Data loss in this service, while annoying and bad for UX, won't lose any
customer's data._
## Non Goals (if relevant)
- This storage system does not need branching, file versioning, or other such
features. The files are as ephemeral to the timeline of the data as the
Endpoints that host the data.
- This storage system does not need to store _all_ user files, only 'known'
user files.
- This storage system does not need to be hosted fully inside Computes.
_Instead, this will be a separate component similar to Pageserver,
SafeKeeper, the S3 proxy used for dynamically loaded extensions, etc._
## Impacted components
- Compute needs new code to load and store these files in its lifetime.
- Control Plane needs to consider this new storage system when signalling
the deletion of an Endpoint, Timeline, or Tenant.
- Control Plane needs to consider this new storage system when it resets
or re-assigns an endpoint's timeline/branch state.
A new service is created: the Endpoint Persistent Unlogged Files Storage
service. This could be integrated in e.g. Pageserver or Control Plane, or a
separately hosted service.
## Proposed implementation
Endpoint-related data files are managed by a newly designed service (which
optionally is integrated in an existing service like Pageserver or Control
Plane), which stores data directly into S3 or any blob storage of choice.
Upon deletion of the Endpoint, or reassignment of the endpoint to a different
branch, this ephemeral data is dropped: the data stored may not match the
state of the branch's data after reassignment, and on endpoint deletion the
data won't have any use to the user.
Compute gets credentials (JWT token with Tenant, Timeline & Endpoint claims)
which it can use to authenticate to this new service and retrieve and store
data associated with this endpoint. This limited scope reduces leaks of data
across endpoints and timeline resets, and limits the ability of endpoints to
mess with other endpoints' data.
The path of this endpoint data in S3 is initially as follows:
s3://<regional-epufs-bucket>/
tenants/
<hex-tenant-id>/
tenants/
<hex-timeline-id>/
endpoints/
<endpoint-id>/
pgdata/
<file_path_in_pgdatadir>
For other blob storages an equivalent or similar path can be constructed.
### Reliability, failure modes and corner cases (if relevant)
Reliability is important, but not critical to the workings of Neon. The data
stored in this service will, when lost, reduce performance, but won't be a
cause of permanent data loss - only operational metadata is stored.
Most, if not all, blob storage services have sufficiently high persistence
guarantees to cater our need for persistence and uptime. The only concern with
blob storages is that the access latency is generally higher than local disk,
but for the object types stored (cache state, ...) I don't think this will be
much of an issue.
### Interaction/Sequence diagram (if relevant)
In these diagrams you can replace S3 with any persistent storage device of
choice, but S3 is chosen as representative name: The well-known and short name
of AWS' blob storage. Azure Blob Storage should work too, but it has a much
longer name making it less practical for the diagrams.
Write data:
```http
POST /tenants/<tenant-id>/timelines/<tl-id>/endpoints/<endpoint-id>/pgdata/<the-pgdata-path>
Host: epufs.svc.neon.local
<<<
200 OK
{
"version": "<opaque>", # opaque file version token, changes when the file contents change
"size": <bytes>,
}
```
```mermaid
sequenceDiagram
autonumber
participant co as Compute
participant ep as EPUFS
participant s3 as Blob Storage
co-->ep: Connect with credentials
co->>+ep: Store Unlogged Persistent File
opt is authenticated
ep->>s3: Write UPF to S3
end
ep->>-co: OK / Failure / Auth Failure
co-->ep: Cancel connection
```
Read data: (optional with cache-relevant request parameters, e.g. If-Modified-Since)
```http
GET /tenants/<tenant-id>/timelines/<tl-id>/endpoints/<endpoint-id>/pgdata/<the-pgdata-path>
Host: epufs.svc.neon.local
<<<
200 OK
<file data>
```
```mermaid
sequenceDiagram
autonumber
participant co as Compute
participant ep as EPUFS
participant s3 as Blob Storage
co->>+ep: Read Unlogged Persistent File
opt is authenticated
ep->>+s3: Request UPF from storage
s3->>-ep: Receive UPF from storage
end
ep->>-co: OK(response) / Failure(storage, auth, ...)
```
Compute Startup:
```mermaid
sequenceDiagram
autonumber
participant co as Compute
participant ps as Pageserver
participant ep as EPUFS
participant es as Extension server
note over co: Bind endpoint ep-xxx
par Get basebackup
co->>+ps: Request basebackup @ LSN
ps-)ps: Construct basebackup
ps->>-co: Receive basebackup TAR @ LSN
and Get startup-critical Unlogged Persistent Files
co->>+ep: Get all UPFs of endpoint ep-xxx
ep-)ep: Retrieve and gather all UPFs
ep->>-co: TAR of UPFs
and Get startup-critical extensions
loop For every startup-critical extension
co->>es: Get critical extension
es->>co: Receive critical extension
end
end
note over co: Start compute
```
CPlane ops:
```http
DELETE /tenants/<tenant-id>/timelines/<timeline-id>/endpoints/<endpoint-id>
Host: epufs.svc.neon.local
<<<
200 OK
{
"tenant": "<tenant-id>",
"timeline": "<timeline-id>",
"endpoint": "<endpoint-id>",
"deleted": {
"files": <count>,
"bytes": <count>,
},
}
```
```http
DELETE /tenants/<tenant-id>/timelines/<timeline-id>
Host: epufs.svc.neon.local
<<<
200 OK
{
"tenant": "<tenant-id>",
"timeline": "<timeline-id>",
"deleted": {
"files": <count>,
"bytes": <count>,
},
}
```
```http
DELETE /tenants/<tenant-id>
Host: epufs.svc.neon.local
<<<
200 OK
{
"tenant": "<tenant-id>",
"deleted": {
"files": <count>,
"bytes": <count>,
},
}
```
```mermaid
sequenceDiagram
autonumber
participant cp as Control Plane
participant ep as EPUFS
participant s3 as Blob Storage
alt Tenant deleted
cp-)ep: Tenant deleted
loop For every object associated with removed tenant
ep->>s3: Remove data of deleted tenant from Storage
end
opt
ep-)cp: Tenant cleanup complete
end
alt Timeline deleted
cp-)ep: Timeline deleted
loop For every object associated with removed timeline
ep->>s3: Remove data of deleted timeline from Storage
end
opt
ep-)cp: Timeline cleanup complete
end
else Endpoint reassigned or removed
cp->>+ep: Endpoint reassigned
loop For every object associated with reassigned/removed endpoint
ep->>s3: Remove data from Storage
end
ep->>-cp: Cleanup complete
end
```
### Scalability (if relevant)
Provisionally: As this service is going to be part of compute startup, this
service should be able to quickly respond to all requests. Therefore this
service is deployed to every AZ we host Computes in, and Computes communicate
(generally) only to the EPUFS endpoint of the AZ they're hosted in.
Local caching of frequently restarted endpoints' data or metadata may be
needed for best performance. However, due to the regional nature of stored
data but zonal nature of the service deployment, we should be careful when we
implement any local caching, as it is possible that computes in AZ 1 will
update data originally written and thus cached by AZ 2. Cache version tests
and invalidation is therefore required if we want to roll out caching to this
service, which is too broad a scope for an MVC. This is why caching is left
out of scope for this RFC, and should be considered separately after this RFC
is implemented.
### Security implications (if relevant)
This service must be able to authenticate users at least by Tenant ID,
Timeline ID and Endpoint ID. This will use the existing JWT infrastructure of
Compute, which will be upgraded to the extent needed to support Timeline- and
Endpoint-based claims.
The service requires unlimited access to (a prefix of) a blob storage bucket,
and thus must be hosted outside the Compute VM sandbox.
A service that generates pre-signed request URLs for Compute to download the
data from that URL is likely problematic, too: Compute would be able to write
unlimited data to the bucket, or exfiltrate this signed URL to get read/write
access to specific objects in this bucket, which would still effectively give
users access to the S3 bucket (but with improved access logging).
There may be a use case for transferring data associated with one endpoint to
another endpoint (e.g. to make one endpoint warm its caches with the state of
another endpoint), but that's not currently in scope, and specific needs may
be solved through out-of-line communication of data or pre-signed URLs.
### Unresolved questions (if relevant)
Caching of files is not in the implementation scope of the document, but
should at some future point be considered to maximize performance.
## Alternative implementation (if relevant)
Several ideas have come up to solve this issue:
### Use AUXfile
One prevalent idea was to WAL-log the files using our AUXfile mechanism.
Benefits:
+ We already have this storage mechanism
Demerits:
- It isn't available on read replicas
- Additional WAL will be consumed during shutdown and after the shutdown
checkpoint, which needs PG modifications to work without panics.
- It increases the data we need to manage in our versioned storage, thus
causing higher storage costs with higher retention due to duplication at
the storage layer.
### Sign URLs for read/write operations, instead of proxying them
Benefits:
+ The service can be implemented with a much reduced IO budget
Demerits:
- Users could get access to these signed credentials
- Not all blob storage services may implement URL signing
### Give endpoints each their own directly accessed block volume
Benefits:
+ Easier to integrate for PostgreSQL
Demerits:
- Little control on data size and contents
- Potentially problematic as we'd need to store data all across the pgdata
directory.
- EBS is not a good candidate
- Attaches in 10s of seconds, if not more; i.e. too cold to start
- Shared EBS volumes are a no-go, as you'd have to schedule the endpoint
with users of the same EBS volumes, which can't work with VM migration
- EBS storage costs are very high (>80$/kilotenant when using a
volume/tenant)
- EBS volumes can't be mounted across AZ boundaries
- Bucket per endpoint is unfeasible
- S3 buckets are priced at $20/month per 1k, which we could better spend
on developers.
- Allocating service accounts takes time (100s of ms), and service accounts
are a limited resource, too; so they're not a good candidate to allocate
on a per-endpoint basis.
- Giving credentials limited to prefix has similar issues as the pre-signed
URL approach.
- Bucket DNS lookup will fill DNS caches and put pressure on DNS lookup
much more than our current systems would.
- Volumes bound by hypervisor are unlikely
- This requires significant investment and increased software on the
hypervisor.
- It is unclear if we can attach volumes after boot, i.e. for pooled
instances.
### Put the files into a table
Benefits:
+ Mostly already available in PostgreSQL
Demerits:
- Uses WAL
- Can't be used after shutdown checkpoint
- Needs a RW endpoint, and table & catalog access to write to this data
- Gets hit with DB size limitations
- Depending on user acces:
- Inaccessible:
The user doesn't have control over database size caused by
these systems.
- Accessible:
The user can corrupt these files and cause the system to crash while
user-corrupted files are present, thus increasing on-call overhead.
## Definition of Done (if relevant)
This project is done if we have:
- One S3 bucket equivalent per region, which stores this per-endpoint data.
- A new service endpoint in at least every AZ, which indirectly grants
endpoints access to the data stored for these endpoints in these buckets.
- Compute writes & reads temp-data at shutdown and startup, respectively, for
at least the pg_prewarm or lfc_prewarm state files.
- Cleanup of endpoint data is triggered when the endpoint is deleted or is
detached from its current timeline.

View File

@@ -0,0 +1,179 @@
# Storage Feature Flags
In this RFC, we will describe how we will implement per-tenant feature flags.
## PostHog as Feature Flag Service
Before we start, let's talk about how current feature flag services work. PostHog is the feature flag service we are currently using across multiple user-facing components in the company. PostHog has two modes of operation: HTTP evaluation and server-side local evaluation.
Let's assume we have a storage feature flag called gc-compaction and we want to roll it out to scale-tier users with resident size >= 10GB and <= 100GB.
### Define User Profiles
The first step is to synchronize our user profiles to the PostHog service. We can simply assume that each tenant is a user in PostHog. Each user profile has some properties associated with it. In our case, it will be: plan type (free, scale, enterprise, etc); resident size (in bytes); primary pageserver (string); region (string).
### Define Feature Flags
We would create a feature flag called gc-compaction in PostHog with 4 variants: disabled, stage-1, stage-2, fully-enabled. We will flip the feature flags from disabled to fully-enabled stage by stage for some percentage of our users.
### Option 1: HTTP Evaluation Mode
When using PostHog's HTTP evaluation mode, the client will make request to the PostHog service, asking for the value of a feature flag for a specific user.
* Control plane will report the plan type to PostHog each time it attaches a tenant to the storcon or when the user upgrades/downgrades. It calls the PostHog profile API to associate tenant ID with the plan type. Assume we have X active tenants and such attach or plan change event happens each week, that would be 4X profile update requests per month.
* Pageservers will report the resident size and the primary pageserver to the PostHog service. Assume we report resident size every 24 hours, that would be 30X requests per month.
* Each tenant will request the state of the feature flag every 1 hour, that's 720X requests per month.
* The Rust client would be easy to implement as we only need to call the `/decide` API on PostHog.
Using the HTTP evaluation mode we will issue 754X requests a month.
### Option 2: Local Evaluation Mode
When using PostHog's HTTP evaluation mode, the client (usually the server in a browser/server architecture) will poll the feature flag configuration every 30s (default in the Python client) from PostHog. Such configuration contains data like:
<details>
<summary>Example JSON response from the PostHog local evaluation API</summary>
```
[
{
"id": 1,
"name": "Beta Feature",
"key": "person-flag",
"is_simple_flag": True,
"active": True,
"filters": {
"groups": [
{
"properties": [
{
"key": "location",
"operator": "exact",
"value": ["Straße"],
"type": "person",
}
],
"rollout_percentage": 100,
},
{
"properties": [
{
"key": "star",
"operator": "exact",
"value": ["ſun"],
"type": "person",
}
],
"rollout_percentage": 100,
},
],
},
}
]
```
</details>
Note that the API only contains information like "under what condition => rollout percentage". The user is responsible to provide the properties required to the client for local evaluation, and the PostHog service (web UI) cannot know if a feature is enabled for the tenant or not until the client uses the `capture` API to report the result back. To control the rollout percentage, the user ID gets mapped to a float number in `[0, 1)` on a consistent hash ring. All values <= the percentage will get the feature enabled or set to the desired value.
To use the local evaluation mode, the system needs:
* Assume each pageserver will poll PostHog for the local evaluation JSON every 5 minutes (instead of the 30s default as it's too frequent). That's 8640Y per month, Y is the number of pageservers. Local evaluation requests cost 10x more than the normal decide request, so that's 86400Y request units to bill.
* Storcon needs to store the plan type in the database and pass that information to the pageserver when attaching the tenant.
* Storcon also needs to update PostHog with the active tenants, for example, when the tenant gets detached/attached. Assume each active tenant gets detached/attached every week, that would be 4X requests per month.
* We do not need to update bill type or resident size to PostHog as all these are evaluated locally.
* After each local evaluation of the feature flag, we need to call PostHog's capture event API to update the result of the evaluation that the feature is enabled. We can do this when the flag gets changed compared with the last cached state in memory. That would be at least 4X (assume we do deployment every week so the cache gets cleared) and maybe an additional multiplifier of 10 assume we have 10 active features.
In this case, we will issue 86400Y + 40X requests per month.
Assume X = 1,000,000 and Y = 100,
| | HTTP Evaluation | Local Evaluation |
|---|---|---|
| Latency of propagating the conditions/properties for feature flag | 24 hours | available locally |
| Latency of applying the feature flag | 1 hour | 5 minutes |
| Can properties be reported from different services | Yes | No |
| Do we need to sync billing info etc to pageserver | No | Yes |
| Cost | 75400$ / month | 4864$ / month |
# Our Solution
We will use PostHog _only_ as an UI to configure the feature flags. Whether a feature is enabled or not can only be queried through storcon/pageserver instead of using the PostHog UI. (We could report it back to PostHog via `capture_event` but it costs $$$.) This allows us to ramp up the feature flag functionality fast at first. At the same time, it would also give us the option to migrate to our own solution once we want to have more properties and more complex evaluation rules in our system.
* We will create several fake users (tenants) in PostHog that contains all the properties we will use for evaluating a feature flag (i.e., resident size, billing type, pageserver id, etc.)
* We will use PostHog's local evaluation API to poll the configuration of the feature flags and evaluate them locally on each of the pageserver.
* The evaluation result will not be reported back to PostHog.
* Storcon needs to pull some information from cplane database.
* To know if a feature is currently enabled or not, we need to call the storcon/pageserver API; and we won't be able to know if a feature has been enabled on a tenant before easily: we need to look at the Grafana logs.
We only need to pay for the 86400Y local evaluation requests (that would be setting Y=0 in solution 2 => $864/month, and even less if we proxy it through storcon).
## Implementation
* Pageserver: implement a PostHog local evaluation client. The client will be shared across all tenants on the pageserver with a single API: `evaluate(tenant_id, feature_flag, properties) -> json`.
* Storcon: if we need plan type as the evaluation condition, pull it from cplane database.
* Storcon/Pageserver: implement an HTTP API `:tenant_id/feature/:feature` to retrieve the current feature flag status.
* Storcon/Pageserver: a loop to update the feature flag spec on both storcon and pageserver. Pageserver loop will only be activated if storcon does not push the specs to the pageserver.
## Difference from Tenant Config
* Feature flags can be modified by percentage, and the default config for each feature flag can be modified in UI without going through the release process.
* Feature flags are more flexible and won't be persisted anywhere and will be passed as plain JSON over the wire so that do not need to handle backward/forward compatibility as in tenant config.
* The expectation of tenant config is that once we add a flag we cannot remove it (or it will be hard to remove), but feature flags are more flexible.
# Final Implementation
* We added a new crate `posthog_lite_client` that supports local feature evaluations.
* We set up two projects "Storage (staging)" and "Storage (production)" in the PostHog console.
* Each pageserver reports 10 fake tenants to PostHog so that we can get all combinations of regions (and other properties) in the PostHog UI.
* Supported properties: AZ, neon_region, pageserver, tenant_id.
* You may use "Pageserver Feature Flags" dashboard to see the evaluation status.
* The feature flag spec is polled on storcon every 30s (in each of the region) and storcon will propagate the spec to the pageservers.
* The pageserver housekeeping loop updates the tenant-specific properties (e.g., remote size) for evaluation.
Each tenant has a `feature_resolver` object. After you add a feature flag in the PostHog console, you can retrieve it with:
```rust
// Boolean flag
self
.feature_resolver
.evaluate_boolean("flag")
.is_ok()
// Multivariate flag
self
.feature_resolver
.evaluate_multivariate("gc-comapction-strategy")
.ok();
```
The user needs to handle the case where the evaluation result is an error. This can occur in a variety of cases:
* During the pageserver start, the feature flag spec has not been retrieved.
* No condition group is matched.
* The feature flag spec contains an operand/operation not supported by the lite PostHog library.
For boolean flags, the return value is `Result<(), Error>`. `Ok(())` means the flag is evaluated to true. Otherwise,
there is either an error in evaluation or it does not match any groups.
For multivariate flags, the return value is `Result<String, Error>`. `Ok(variant)` indicates the flag is evaluated
to a variant. Otherwise, there is either an error in evaluation or it does not match any groups.
The evaluation logic is documented in the PostHog lite library. It compares the consistent hash of a flag key + tenant_id
with the rollout percentage and determines which tenant to roll out a specific feature.
Users can use the feature flag evaluation API to get the flag evaluation result of a specific tenant for debugging purposes.
```
curl http://localhost:9898/v1/tenant/:tenant_id/feature_flag?flag=:key&as=multivariate/boolean"
```
By default, the storcon pushes the feature flag specs to the pageservers every 30 seconds, which means that a change in feature flag in the
PostHog UI will propagate to the pageservers within 30 seconds.
# Future Works
* Support dynamic tenant properties like logical size as the evaluation condition.
* Support properties like `plan_type` (needs cplane to pass it down).
* Report feature flag evaluation result back to PostHog (if the cost is okay).
* Fast feature flag evaluation cache on critical paths (e.g., cache a feature flag result in `AtomicBool` and use it on the read path).

View File

@@ -0,0 +1,399 @@
# Compute rolling restart with prewarm
Created on 2025-03-17
Implemented on _TBD_
Author: Alexey Kondratov (@ololobus)
## Summary
This RFC describes an approach to reduce performance degradation due to missing caches after compute node restart, i.e.:
1. Rolling restart of the running instance via 'warm' replica.
2. Auto-prewarm compute caches after unplanned restart or scale-to-zero.
## Motivation
Neon currently implements several features that guarantee high uptime of compute nodes:
1. Storage high-availability (HA), i.e. each tenant shard has a secondary pageserver location, so we can quickly switch over compute to it in case of primary pageserver failure.
2. Fast compute provisioning, i.e. we have a fleet of pre-created empty computes, that are ready to serve workload, so restarting unresponsive compute is very fast.
3. Preemptive NeonVM compute provisioning in case of k8s node unavailability.
This helps us to be well-within the uptime SLO of 99.95% most of the time. Problems begin when we go up to multi-TB workloads and 32-64 CU computes.
During restart, compute loses all caches: LFC, shared buffers, file system cache. Depending on the workload, it can take a lot of time to warm up the caches,
so that performance could be degraded and might be even unacceptable for certain workloads. The latter means that although current approach works well for small to
medium workloads, we still have to do some additional work to avoid performance degradation after restart of large instances.
## Non Goals
- Details of the persistence storage for prewarm data are out of scope, there is a separate RFC for that: <https://github.com/neondatabase/neon/pull/9661>.
- Complete compute/Postgres HA setup and flow. Although it was originally in scope of this RFC, during preliminary research it appeared to be a rabbit hole, so it's worth of a separate RFC.
- Low-level implementation details for Postgres replica-to-primary promotion. There are a lot of things to think and care about: how to start walproposer, [logical replication failover](https://www.postgresql.org/docs/current/logical-replication-failover.html), and so on, but it's worth of at least a separate one-pager design document if not RFC.
## Impacted components
Postgres, compute_ctl, Control plane, Endpoint storage for unlogged storage of compute files.
For the latter, we will need to implement a uniform abstraction layer on top of S3, ABS, etc., but
S3 is used in text interchangeably with 'endpoint storage' for simplicity.
## Proposed implementation
### compute_ctl spec changes and auto-prewarm
We are going to extend the current compute spec with the following attributes
```rust
struct ComputeSpec {
/// [All existing attributes]
...
/// Whether to do auto-prewarm at start or not.
/// Default to `false`.
pub lfc_auto_prewarm: bool
/// Interval in seconds between automatic dumps of
/// LFC state into S3. Default `None`, which means 'off'.
pub lfc_dump_interval_sec: Option<i32>
}
```
When `lfc_dump_interval_sec` is set to `N`, `compute_ctl` will periodically dump the LFC state
and store it in S3, so that it could be used either for auto-prewarm after restart or by replica
during the rolling restart. For enabling periodic dumping, we should consider the following value
`lfc_dump_interval_sec=300` (5 minutes), same as in the upstream's `pg_prewarm.autoprewarm_interval`.
When `lfc_auto_prewarm` is set to `true`, `compute_ctl` will start prewarming the LFC upon restart
iif some of the previous states is present in S3.
### compute_ctl API
1. `POST /store_lfc_state` -- dump LFC state using Postgres SQL interface and store result in S3.
This has to be a blocking call, i.e. it will return only after the state is stored in S3.
If there is any concurrent request in progress, we should return `429 Too Many Requests`,
and let the caller to retry.
2. `GET /dump_lfc_state` -- dump LFC state using Postgres SQL interface and return it as is
in text format suitable for the future restore/prewarm. This API is not strictly needed at
the end state, but could be useful for a faster prototyping of a complete rolling restart flow
with prewarm, as it doesn't require persistent for LFC state storage.
3. `POST /restore_lfc_state` -- restore/prewarm LFC state with request
```yaml
RestoreLFCStateRequest:
oneOf:
- type: object
required:
- lfc_state
properties:
lfc_state:
type: string
description: Raw LFC content dumped with GET `/dump_lfc_state`
- type: object
required:
- lfc_cache_key
properties:
lfc_cache_key:
type: string
description: |
endpoint_id of the source endpoint on the same branch
to use as a 'donor' for LFC content. Compute will look up
LFC content dump in S3 using this key and do prewarm.
```
where `lfc_state` and `lfc_cache_key` are mutually exclusive.
The actual prewarming will happen asynchronously, so the caller need to check the
prewarm status using the compute's standard `GET /status` API.
4. `GET /status` -- extend existing API with following attributes
```rust
struct ComputeStatusResponse {
// [All existing attributes]
...
pub prewarm_state: PrewarmState
}
/// Compute prewarm state. Will be stored in the shared Compute state
/// in compute_ctl
struct PrewarmState {
pub status: PrewarmStatus
/// Total number of pages to prewarm
pub pages_total: i64
/// Number of pages prewarmed so far
pub pages_processed: i64
/// Optional prewarm error
pub error: Option<String>
}
pub enum PrewarmStatus {
/// Prewarming was never requested on this compute
Off,
/// Prewarming was requested, but not started yet
Pending,
/// Prewarming is in progress. The caller should follow
/// `PrewarmState::progress`.
InProgress,
/// Prewarming has been successfully completed
Completed,
/// Prewarming failed. The caller should look at
/// `PrewarmState::error` for the reason.
Failed,
/// It is intended to be used by auto-prewarm if none of
/// the previous LFC states is available in S3.
/// This is a distinct state from the `Failed` because
/// technically it's not a failure and could happen if
/// compute was restart before it dumped anything into S3,
/// or just after the initial rollout of the feature.
Skipped,
}
```
5. `POST /promote` -- this is a **blocking** API call to promote compute replica into primary.
This API should be very similar to the existing `POST /configure` API, i.e. accept the
spec (primary spec, because originally compute was started as replica). It's a distinct
API method because semantics and response codes are different:
- If promotion is done successfully, it will return `200 OK`.
- If compute is already primary, the call will be no-op and `compute_ctl`
will return `412 Precondition Failed`.
- If, for some reason, second request reaches compute that is in progress of promotion,
it will respond with `429 Too Many Requests`.
- If compute hit any permanent failure during promotion `500 Internal Server Error`
will be returned.
### Control plane operations
The complete flow will be present as a sequence diagram in the next section, but here
we just want to list some important steps that have to be done by control plane during
the rolling restart via warm replica, but without much of low-level implementation details.
1. Register the 'intent' of the instance restart, but not yet interrupt any workload at
primary and also accept new connections. This may require some endpoint state machine
changes, e.g. introduction of the `pending_restart` state. Being in this state also
**mustn't prevent any other operations except restart**: suspend, live-reconfiguration
(e.g. due to notify-attach call from the storage controller), deletion.
2. Start new replica compute on the same timeline and start prewarming it. This process
may take quite a while, so the same concurrency considerations as in 1. should be applied
here as well.
3. When warm replica is ready, control plane should:
3.1. Terminate the primary compute. Starting from here, **this is a critical section**,
if anything goes off, the only option is to start the primary normally and proceed
with auto-prewarm.
3.2. Send cache invalidation message to all proxies, notifying them that all new connections
should request and wait for the new connection details. At this stage, proxy has to also
drop any existing connections to the old primary, so they didn't do stale reads.
3.3. Attach warm replica compute to the primary endpoint inside control plane metadata
database.
3.4. Promote replica to primary.
3.5. When everything is done, finalize the endpoint state to be just `active`.
### Complete rolling restart flow
```mermaid
sequenceDiagram
autonumber
participant proxy as Neon proxy
participant cplane as Control plane
participant primary as Compute (primary)
box Compute (replica)
participant ctl as compute_ctl
participant pg as Postgres
end
box Endpoint unlogged storage
participant s3proxy as Endpoint storage service
participant s3 as S3/ABS/etc.
end
cplane ->> primary: POST /store_lfc_state
primary -->> cplane: 200 OK
cplane ->> ctl: POST /restore_lfc_state
activate ctl
ctl -->> cplane: 202 Accepted
activate cplane
cplane ->> ctl: GET /status: poll prewarm status
ctl ->> s3proxy: GET /read_file
s3proxy ->> s3: read file
s3 -->> s3proxy: file content
s3proxy -->> ctl: 200 OK: file content
proxy ->> cplane: GET /proxy_wake_compute
cplane -->> proxy: 200 OK: old primary conninfo
ctl ->> pg: prewarm LFC
activate pg
pg -->> ctl: prewarm is completed
deactivate pg
ctl -->> cplane: 200 OK: prewarm is completed
deactivate ctl
deactivate cplane
cplane -->> cplane: reassign replica compute to endpoint,<br>start terminating the old primary compute
activate cplane
cplane ->> proxy: invalidate caches
proxy ->> cplane: GET /proxy_wake_compute
cplane -x primary: POST /terminate
primary -->> cplane: 200 OK
note over primary: old primary<br>compute terminated
cplane ->> ctl: POST /promote
activate ctl
ctl ->> pg: pg_ctl promote
activate pg
pg -->> ctl: done
deactivate pg
ctl -->> cplane: 200 OK
deactivate ctl
cplane -->> cplane: finalize operation
cplane -->> proxy: 200 OK: new primary conninfo
deactivate cplane
```
### Network bandwidth and prewarm speed
It's currently known that pageserver can sustain about 3000 RPS per shard for a few running computes.
Large tenants are usually split into 8 shards, so the final formula may look like this:
```text
8 shards * 3000 RPS * 8 KB =~ 190 MB/s
```
so depending on the LFC size, prewarming will take at least:
- ~5s for 1 GB
- ~50s for 10 GB
- ~5m for 100 GB
- \>1h for 1 TB
In total, one pageserver is normally capped by 30k RPS, so it obviously can't sustain many computes
doing prewarm at the same time. Later, we may need an additional mechanism for computes to throttle
the prewarming requests gracefully.
### Reliability, failure modes and corner cases
We consider following failures while implementing this RFC:
1. Compute got interrupted/crashed/restarted during prewarm. The caller -- control plane -- should
detect that and start prewarm from the beginning.
2. Control plane promotion request timed out or hit network issues. If it never reached the
compute, control plane should just repeat it. If it did reach the compute, then during
retry control plane can hit `409` as previous request triggered the promotion already.
In this case, control plane need to retry until either `200` or
permanent error `500` is returned.
3. Compute got interrupted/crashed/restarted during promotion. At restart it will ask for
a spec from control plane, and its content should signal compute to start as **primary**,
so it's expected that control plane will continue polling for certain period of time and
will discover that compute is ready to accept connections if restart is fast enough.
4. Any other unexpected failure or timeout during prewarming. This **failure mustn't be fatal**,
control plane has to report failure, terminate replica and keep primary running.
5. Any other unexpected failure or timeout during promotion. Unfortunately, at this moment
we already have the primary node stopped, so the only option is to start primary again
and proceed with auto-prewarm.
6. Any unexpected failure during auto-prewarm. This **failure mustn't be fatal**,
`compute_ctl` has to report the failure, but do not crash the compute.
7. Control plane failed to confirm that old primary has terminated. This can happen, especially
in the future HA setup. In this case, control plane has to ensure that it sent VM deletion
and pod termination requests to k8s, so long-term we do not have two running primaries
on the same timeline.
### Security implications
There are two security implications to consider:
1. Access to `compute_ctl` API. It has to be accessible from the outside of compute, so all
new API methods have to be exposed on the **external** HTTP port and **must** be authenticated
with JWT.
2. Read/write only your own LFC state data in S3. Although it's not really a security concern,
since LFC state is just a mapping of blocks present in LFC at certain moment in time;
it still has to be highly restricted, so that i) only computes on the same timeline can
read S3 state; ii) each compute can only write to the path that contains it's `endpoint_id`.
Both of this must be validated by Endpoint storage service using the JWT token provided by `compute_ctl`.
### Unresolved questions
#### Billing, metrics and monitoring
Currently, we only label computes with `endpoint_id` after attaching them to the endpoint.
In this proposal, this means that temporary replica will remain unlabelled until it's promoted
to primary. We can also hide it from users in the control plane API, but what to do with
billing and monitoring is still unclear.
We can probably mark it as 'billable' and tag with `project_id`, so it will be billed, but
not interfere in any way with the current primary monitoring.
Another thing to consider is how logs and metrics export will switch to the new compute.
It's expected that OpenTelemetry collector will auto-discover the new compute and start
scraping metrics from it.
#### Auto-prewarm
It's still an open question whether we need auto-prewarm at all. The author's gut-feeling is
that yes, we need it, but might be not for all workloads, so it could end up exposed as a
user-controllable knob on the endpoint. There are two arguments for that:
1. Auto-prewarm existing in upstream's `pg_prewarm`, _probably for a reason_.
2. There are still could be 2 flows when we cannot perform the rolling restart via the warm
replica: i) any failure or interruption during promotion; ii) wake up after scale-to-zero.
The latter might be challenged as well, i.e. one can argue that auto-prewarm may and will
compete with user-workload for storage resources. This is correct, but it might as well
reduce the time to get warm LFC and good performance.
#### Low-level details of the replica promotion
There are many things to consider here, but three items just off the top of my head:
1. How to properly start the `walproposer` inside Postgres.
2. What to do with logical replication. Currently, we do not include logical replication slots
inside basebackup, because nobody advances them at replica, so they just prevent the WAL
deletion. Yet, we do need to have them at primary after promotion. Starting with Postgres 17,
there is a new feature called
[logical replication failover](https://www.postgresql.org/docs/current/logical-replication-failover.html)
and `synchronized_standby_slots` setting, but we need a plan for the older versions. Should we
request a new basebackup during promotion?
3. How do we guarantee that replica will receive all the latest WAL from safekeepers? Do some
'shallow' version of sync safekeepers without data copying? Or just a standard version of
sync safekeepers?
## Alternative implementation
The proposal already assumes one of the alternatives -- do not have any persistent storage for
LFC state. This is possible to implement faster with the proposed API, but it means that
we do not implement auto-prewarm yet.
## Definition of Done
At the end of implementing this RFC we should have two high-level settings that enable:
1. Auto-prewarm of user computes upon restart.
2. Perform primary compute restart via the warm replica promotion.
It also has to be decided what's the criteria for enabling one or both of these flows for
certain clients.

View File

@@ -374,7 +374,7 @@ MC4CAQAwBQYDK2VwBCIEID/Drmc1AA6U/znNRWpF3zEGegOATQxfkdWxitcOMsIH
let request = Request::builder()
.uri(format!("/{tenant}/{timeline}/{endpoint}/sub/path/key"))
.method(method)
.header("Authorization", format!("Bearer {}", token))
.header("Authorization", format!("Bearer {token}"))
.body(Body::empty())
.unwrap();
let status = ServiceExt::ready(&mut app)

View File

@@ -12,6 +12,7 @@ jsonwebtoken.workspace = true
serde.workspace = true
serde_json.workspace = true
regex.workspace = true
url.workspace = true
utils = { path = "../utils" }
remote_storage = { version = "0.1", path = "../remote_storage/" }

View File

@@ -58,7 +58,7 @@ pub enum LfcPrewarmState {
},
}
#[derive(Serialize, Default, Debug, Clone)]
#[derive(Serialize, Default, Debug, Clone, PartialEq)]
#[serde(tag = "status", rename_all = "snake_case")]
pub enum LfcOffloadState {
#[default]

View File

@@ -6,10 +6,12 @@
use std::collections::HashMap;
use std::fmt::Display;
use anyhow::anyhow;
use indexmap::IndexMap;
use regex::Regex;
use remote_storage::RemotePath;
use serde::{Deserialize, Serialize};
use url::Url;
use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn;
@@ -103,7 +105,11 @@ pub struct ComputeSpec {
// updated to fill these fields, we can make these non optional.
pub tenant_id: Option<TenantId>,
pub timeline_id: Option<TimelineId>,
pub pageserver_connstring: Option<String>,
// Pageserver information can be passed in two different ways:
// 1. Here
// 2. in cluster.settings. This is legacy, we are switching to method 1.
pub pageserver_connection_info: Option<PageserverConnectionInfo>,
// More neon ids that we expose to the compute_ctl
// and to postgres as neon extension GUCs.
@@ -179,9 +185,18 @@ pub struct ComputeSpec {
/// JWT for authorizing requests to endpoint storage service
pub endpoint_storage_token: Option<String>,
/// Download LFC state from endpoint_storage and pass it to Postgres on startup
#[serde(default)]
/// Download LFC state from endpoint storage and pass it to Postgres on compute startup
pub autoprewarm: bool,
#[serde(default)]
/// Upload LFC state to endpoint storage periodically. Default value (None) means "don't upload"
pub offload_lfc_interval_seconds: Option<std::num::NonZeroU64>,
/// Suspend timeout in seconds.
///
/// We use this value to derive other values, such as the installed extensions metric.
pub suspend_timeout_seconds: i64,
}
/// Feature flag to signal `compute_ctl` to enable certain experimental functionality.
@@ -203,6 +218,20 @@ pub enum ComputeFeature {
UnknownFeature,
}
/// Feature flag to signal `compute_ctl` to enable certain experimental functionality.
#[derive(Clone, Debug, Default, Deserialize, Serialize, Eq, PartialEq)]
pub struct PageserverConnectionInfo {
pub shards: HashMap<u32, PageserverShardConnectionInfo>,
pub prefer_grpc: bool,
}
#[derive(Clone, Debug, Default, Deserialize, Serialize, Eq, PartialEq)]
pub struct PageserverShardConnectionInfo {
pub libpq_url: Option<String>,
pub grpc_url: Option<String>,
}
#[derive(Clone, Debug, Default, Deserialize, Serialize)]
pub struct RemoteExtSpec {
pub public_extensions: Option<Vec<String>>,
@@ -436,6 +465,47 @@ pub struct JwksSettings {
pub jwt_audience: Option<String>,
}
/// Protocol used to connect to a Pageserver. Parsed from the connstring scheme.
#[derive(Clone, Copy, Debug, Default, PartialEq, Eq)]
pub enum PageserverProtocol {
/// The original protocol based on libpq and COPY. Uses postgresql:// or postgres:// scheme.
#[default]
Libpq,
/// A newer, gRPC-based protocol. Uses grpc:// scheme.
Grpc,
}
impl PageserverProtocol {
/// Parses the protocol from a connstring scheme. Defaults to Libpq if no scheme is given.
/// Errors if the connstring is an invalid URL.
pub fn from_connstring(connstring: &str) -> anyhow::Result<Self> {
let scheme = match Url::parse(connstring) {
Ok(url) => url.scheme().to_lowercase(),
Err(url::ParseError::RelativeUrlWithoutBase) => return Ok(Self::default()),
Err(err) => return Err(anyhow!("invalid connstring URL: {err}")),
};
match scheme.as_str() {
"postgresql" | "postgres" => Ok(Self::Libpq),
"grpc" => Ok(Self::Grpc),
scheme => Err(anyhow!("invalid protocol scheme: {scheme}")),
}
}
/// Returns the URL scheme for the protocol, for use in connstrings.
pub fn scheme(&self) -> &'static str {
match self {
Self::Libpq => "postgresql",
Self::Grpc => "grpc",
}
}
}
impl Display for PageserverProtocol {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.write_str(self.scheme())
}
}
#[cfg(test)]
mod tests {
use std::fs::File;

View File

@@ -3,6 +3,7 @@
"timestamp": "2021-05-23T18:25:43.511Z",
"operation_uuid": "0f657b36-4b0f-4a2d-9c2e-1dcd615e7d8b",
"suspend_timeout_seconds": 3600,
"cluster": {
"cluster_id": "test-cluster-42",
@@ -89,6 +90,11 @@
"value": "off",
"vartype": "bool"
},
{
"name": "offload_lfc_interval_seconds",
"value": "20",
"vartype": "integer"
},
{
"name": "neon.safekeepers",
"value": "127.0.0.1:6502,127.0.0.1:6503,127.0.0.1:6501",

View File

@@ -71,7 +71,7 @@ impl Runtime {
debug!("thread panicked: {:?}", e);
let mut result = ctx.result.lock();
if result.0 == -1 {
*result = (256, format!("thread panicked: {:?}", e));
*result = (256, format!("thread panicked: {e:?}"));
}
});
}

View File

@@ -47,8 +47,8 @@ impl Debug for AnyMessage {
match self {
AnyMessage::None => write!(f, "None"),
AnyMessage::InternalConnect => write!(f, "InternalConnect"),
AnyMessage::Just32(v) => write!(f, "Just32({})", v),
AnyMessage::ReplCell(v) => write!(f, "ReplCell({:?})", v),
AnyMessage::Just32(v) => write!(f, "Just32({v})"),
AnyMessage::ReplCell(v) => write!(f, "ReplCell({v:?})"),
AnyMessage::Bytes(v) => write!(f, "Bytes({})", hex::encode(v)),
AnyMessage::LSN(v) => write!(f, "LSN({})", Lsn(*v)),
}

View File

@@ -582,14 +582,14 @@ pub fn attach_openapi_ui(
deepLinking: true,
showExtensions: true,
showCommonExtensions: true,
url: "{}",
url: "{spec_mount_path}",
}})
window.ui = ui;
}};
</script>
</body>
</html>
"#, spec_mount_path))).unwrap())
"#))).unwrap())
})
)
}
@@ -696,7 +696,7 @@ mod tests {
let remote_addr = SocketAddr::new(IpAddr::from_str("127.0.0.1").unwrap(), 80);
let mut service = builder.build(remote_addr);
if let Err(e) = poll_fn(|ctx| service.poll_ready(ctx)).await {
panic!("request service is not ready: {:?}", e);
panic!("request service is not ready: {e:?}");
}
let mut req: Request<Body> = Request::default();
@@ -716,7 +716,7 @@ mod tests {
let remote_addr = SocketAddr::new(IpAddr::from_str("127.0.0.1").unwrap(), 80);
let mut service = builder.build(remote_addr);
if let Err(e) = poll_fn(|ctx| service.poll_ready(ctx)).await {
panic!("request service is not ready: {:?}", e);
panic!("request service is not ready: {e:?}");
}
let req: Request<Body> = Request::default();

View File

@@ -12,6 +12,8 @@ rustc-hash = { version = "2.1.1" }
rand = "0.9.1"
libc.workspace = true
lock_api = "0.4.13"
atomic = "0.6.1"
bytemuck = { version = "1.23.1", features = ["derive"] }
[dev-dependencies]
criterion = { workspace = true, features = ["html_reports"] }

View File

@@ -1,12 +1,12 @@
use criterion::{criterion_group, criterion_main, BatchSize, Criterion, BenchmarkId};
use criterion::{BatchSize, BenchmarkId, Criterion, criterion_group, criterion_main};
use neon_shmem::hash::HashMapAccess;
use neon_shmem::hash::HashMapInit;
use neon_shmem::hash::entry::Entry;
use rand::prelude::*;
use rand::distr::{Distribution, StandardUniform};
use std::hash::BuildHasher;
use rand::prelude::*;
use std::default::Default;
use std::hash::BuildHasher;
// Taken from bindings to C code
#[derive(Clone, Debug, Hash, Eq, PartialEq)]
@@ -20,15 +20,15 @@ pub struct FileCacheKey {
}
impl Distribution<FileCacheKey> for StandardUniform {
// questionable, but doesn't need to be good randomness
fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> FileCacheKey {
FileCacheKey {
_spc_id: rng.random(),
_db_id: rng.random(),
_rel_number: rng.random(),
_fork_num: rng.random(),
_block_num: rng.random()
}
// questionable, but doesn't need to be good randomness
fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> FileCacheKey {
FileCacheKey {
_spc_id: rng.random(),
_db_id: rng.random(),
_rel_number: rng.random(),
_fork_num: rng.random(),
_block_num: rng.random(),
}
}
}
@@ -43,240 +43,288 @@ pub struct FileCacheEntry {
}
impl FileCacheEntry {
fn dummy() -> Self {
Self {
_offset: 0,
_access_count: 0,
_prev: std::ptr::null_mut(),
_next: std::ptr::null_mut(),
_state: [0; 8]
}
}
fn dummy() -> Self {
Self {
_offset: 0,
_access_count: 0,
_prev: std::ptr::null_mut(),
_next: std::ptr::null_mut(),
_state: [0; 8],
}
}
}
// Utilities for applying operations.
#[derive(Clone, Debug)]
struct TestOp<K,V>(K, Option<V>);
struct TestOp<K, V>(K, Option<V>);
fn apply_op<K: Clone + std::hash::Hash + Eq, V, S: std::hash::BuildHasher>(
op: TestOp<K,V>,
map: &mut HashMapAccess<K,V,S>,
op: TestOp<K, V>,
map: &mut HashMapAccess<K, V, S>,
) {
let entry = map.entry(op.0);
let entry = map.entry(op.0);
match op.1 {
Some(new) => {
match entry {
Entry::Occupied(mut e) => Some(e.insert(new)),
Entry::Vacant(e) => { _ = e.insert(new).unwrap(); None },
}
},
None => {
match entry {
Entry::Occupied(e) => Some(e.remove()),
Entry::Vacant(_) => None,
}
},
};
Some(new) => match entry {
Entry::Occupied(mut e) => Some(e.insert(new)),
Entry::Vacant(e) => {
_ = e.insert(new).unwrap();
None
}
},
None => match entry {
Entry::Occupied(e) => Some(e.remove()),
Entry::Vacant(_) => None,
},
};
}
// Hash utilities
struct SeaRandomState {
k1: u64,
k2: u64,
k3: u64,
k4: u64
k1: u64,
k2: u64,
k3: u64,
k4: u64,
}
impl std::hash::BuildHasher for SeaRandomState {
type Hasher = seahash::SeaHasher;
fn build_hasher(&self) -> Self::Hasher {
seahash::SeaHasher::with_seeds(self.k1, self.k2, self.k3, self.k4)
}
type Hasher = seahash::SeaHasher;
fn build_hasher(&self) -> Self::Hasher {
seahash::SeaHasher::with_seeds(self.k1, self.k2, self.k3, self.k4)
}
}
impl SeaRandomState {
fn new() -> Self {
let mut rng = rand::rng();
Self { k1: rng.random(), k2: rng.random(), k3: rng.random(), k4: rng.random() }
}
fn new() -> Self {
let mut rng = rand::rng();
Self {
k1: rng.random(),
k2: rng.random(),
k3: rng.random(),
k4: rng.random(),
}
}
}
fn small_benchs(c: &mut Criterion) {
let mut group = c.benchmark_group("Small maps");
let mut group = c.benchmark_group("Small maps");
group.sample_size(10);
group.bench_function("small_rehash", |b| {
let ideal_filled = 4_000_000;
let size = 5_000_000;
let mut writer = HashMapInit::new_resizeable(size, size * 2).attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.bench_function("small_rehash_xxhash", |b| {
let ideal_filled = 4_000_000;
let size = 5_000_000;
let mut writer = HashMapInit::new_resizeable(size, size * 2)
.with_hasher(twox_hash::xxhash64::RandomState::default())
.attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.bench_function("small_rehash", |b| {
let ideal_filled = 4_000_000;
let size = 5_000_000;
let mut writer = HashMapInit::new_resizeable(size, size * 2).attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.bench_function("small_rehash_ahash", |b| {
let ideal_filled = 4_000_000;
let size = 5_000_000;
let mut writer = HashMapInit::new_resizeable(size, size * 2)
.with_hasher(ahash::RandomState::default())
.attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.bench_function("small_rehash_xxhash", |b| {
let ideal_filled = 4_000_000;
let size = 5_000_000;
let mut writer = HashMapInit::new_resizeable(size, size * 2)
.with_hasher(twox_hash::xxhash64::RandomState::default())
.attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.bench_function("small_rehash_seahash", |b| {
let ideal_filled = 4_000_000;
let size = 5_000_000;
let mut writer = HashMapInit::new_resizeable(size, size * 2)
.with_hasher(SeaRandomState::new())
.attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.bench_function("small_rehash_ahash", |b| {
let ideal_filled = 4_000_000;
let size = 5_000_000;
let mut writer = HashMapInit::new_resizeable(size, size * 2)
.with_hasher(ahash::RandomState::default())
.attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.finish();
group.bench_function("small_rehash_seahash", |b| {
let ideal_filled = 4_000_000;
let size = 5_000_000;
let mut writer = HashMapInit::new_resizeable(size, size * 2)
.with_hasher(SeaRandomState::new())
.attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.finish();
}
fn real_benchs(c: &mut Criterion) {
let mut group = c.benchmark_group("Realistic workloads");
group.sample_size(10);
let mut group = c.benchmark_group("Realistic workloads");
group.sample_size(10);
group.bench_function("real_bulk_insert", |b| {
let size = 125_000_000;
let ideal_filled = 100_000_000;
let mut rng = rand::rng();
b.iter_batched(
|| HashMapInit::new_resizeable(size, size * 2).attach_writer(),
|writer| {
for _ in 0..ideal_filled {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
let entry = writer.entry(key);
std::hint::black_box(match entry {
Entry::Occupied(mut e) => { e.insert(val); },
Entry::Vacant(e) => { _ = e.insert(val).unwrap(); },
})
}
},
BatchSize::SmallInput,
)
});
group.bench_function("real_rehash", |b| {
let size = 125_000_000;
let ideal_filled = 100_000_000;
let mut writer = HashMapInit::new_resizeable(size, size).attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.bench_function("real_rehash_hashbrown", |b| {
let size = 125_000_000;
let ideal_filled = 100_000_000;
let mut writer = hashbrown::raw::RawTable::new();
let mut rng = rand::rng();
let hasher = rustc_hash::FxBuildHasher::default();
unsafe {
writer.resize(size, |(k,_)| hasher.hash_one(&k),
hashbrown::raw::Fallibility::Infallible).unwrap();
}
while writer.len() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
writer.insert(hasher.hash_one(&key), (key, val), |(k,_)| hasher.hash_one(&k));
}
b.iter(|| unsafe { writer.table.rehash_in_place(
&|table, index| hasher.hash_one(&table.bucket::<(FileCacheKey, FileCacheEntry)>(index).as_ref().0),
std::mem::size_of::<(FileCacheKey, FileCacheEntry)>(),
if std::mem::needs_drop::<(FileCacheKey, FileCacheEntry)>() {
Some(|ptr| std::ptr::drop_in_place(ptr as *mut (FileCacheKey, FileCacheEntry)))
} else {
None
let size = 125_000_000;
let ideal_filled = 100_000_000;
let mut rng = rand::rng();
b.iter_batched(
|| HashMapInit::new_resizeable(size, size * 2).attach_writer(),
|writer| {
for _ in 0..ideal_filled {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
let entry = writer.entry(key);
std::hint::black_box(match entry {
Entry::Occupied(mut e) => {
e.insert(val);
}
Entry::Vacant(e) => {
_ = e.insert(val).unwrap();
}
})
}
},
) });
});
BatchSize::SmallInput,
)
});
for elems in [2, 4, 8, 16, 32, 64, 96, 112] {
group.bench_with_input(BenchmarkId::new("real_rehash_varied", elems), &elems, |b, &size| {
let ideal_filled = size * 1_000_000;
let size = 125_000_000;
let mut writer = HashMapInit::new_resizeable(size, size).attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.bench_with_input(BenchmarkId::new("real_rehash_varied_hashbrown", elems), &elems, |b, &size| {
let ideal_filled = size * 1_000_000;
let size = 125_000_000;
let mut writer = hashbrown::raw::RawTable::new();
let mut rng = rand::rng();
let hasher = rustc_hash::FxBuildHasher::default();
unsafe {
writer.resize(size, |(k,_)| hasher.hash_one(&k),
hashbrown::raw::Fallibility::Infallible).unwrap();
}
while writer.len() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
writer.insert(hasher.hash_one(&key), (key, val), |(k,_)| hasher.hash_one(&k));
}
b.iter(|| unsafe { writer.table.rehash_in_place(
&|table, index| hasher.hash_one(&table.bucket::<(FileCacheKey, FileCacheEntry)>(index).as_ref().0),
std::mem::size_of::<(FileCacheKey, FileCacheEntry)>(),
if std::mem::needs_drop::<(FileCacheKey, FileCacheEntry)>() {
Some(|ptr| std::ptr::drop_in_place(ptr as *mut (FileCacheKey, FileCacheEntry)))
} else {
None
},
) });
});
}
group.finish();
group.bench_function("real_rehash", |b| {
let size = 125_000_000;
let ideal_filled = 100_000_000;
let mut writer = HashMapInit::new_resizeable(size, size).attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.bench_function("real_rehash_hashbrown", |b| {
let size = 125_000_000;
let ideal_filled = 100_000_000;
let mut writer = hashbrown::raw::RawTable::new();
let mut rng = rand::rng();
let hasher = rustc_hash::FxBuildHasher::default();
unsafe {
writer
.resize(
size,
|(k, _)| hasher.hash_one(&k),
hashbrown::raw::Fallibility::Infallible,
)
.unwrap();
}
while writer.len() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
writer.insert(hasher.hash_one(&key), (key, val), |(k, _)| {
hasher.hash_one(&k)
});
}
b.iter(|| unsafe {
writer.table.rehash_in_place(
&|table, index| {
hasher.hash_one(
&table
.bucket::<(FileCacheKey, FileCacheEntry)>(index)
.as_ref()
.0,
)
},
std::mem::size_of::<(FileCacheKey, FileCacheEntry)>(),
if std::mem::needs_drop::<(FileCacheKey, FileCacheEntry)>() {
Some(|ptr| std::ptr::drop_in_place(ptr as *mut (FileCacheKey, FileCacheEntry)))
} else {
None
},
)
});
});
for elems in [2, 4, 8, 16, 32, 64, 96, 112] {
group.bench_with_input(
BenchmarkId::new("real_rehash_varied", elems),
&elems,
|b, &size| {
let ideal_filled = size * 1_000_000;
let size = 125_000_000;
let mut writer = HashMapInit::new_resizeable(size, size).attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
},
);
group.bench_with_input(
BenchmarkId::new("real_rehash_varied_hashbrown", elems),
&elems,
|b, &size| {
let ideal_filled = size * 1_000_000;
let size = 125_000_000;
let mut writer = hashbrown::raw::RawTable::new();
let mut rng = rand::rng();
let hasher = rustc_hash::FxBuildHasher::default();
unsafe {
writer
.resize(
size,
|(k, _)| hasher.hash_one(&k),
hashbrown::raw::Fallibility::Infallible,
)
.unwrap();
}
while writer.len() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
writer.insert(hasher.hash_one(&key), (key, val), |(k, _)| {
hasher.hash_one(&k)
});
}
b.iter(|| unsafe {
writer.table.rehash_in_place(
&|table, index| {
hasher.hash_one(
&table
.bucket::<(FileCacheKey, FileCacheEntry)>(index)
.as_ref()
.0,
)
},
std::mem::size_of::<(FileCacheKey, FileCacheEntry)>(),
if std::mem::needs_drop::<(FileCacheKey, FileCacheEntry)>() {
Some(|ptr| {
std::ptr::drop_in_place(ptr as *mut (FileCacheKey, FileCacheEntry))
})
} else {
None
},
)
});
},
);
}
group.finish();
}
criterion_group!(benches, small_benchs, real_benchs);
criterion_main!(benches);

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,301 @@
//! Lock-free stable array of buckets managed with a freelist.
//!
//! Since the positions of entries in the dictionary and the bucket array are not correlated,
//! we either had to separately shard both and deal with the overhead of two lock acquisitions
//! per read/write, or make the bucket array lock free. This is *generally* fine since most
//! accesses of the bucket array are done while holding the lock on the corresponding dict shard
//! and thus synchronized. May not hold up to the removals done by the LFC which is a problem.
//!
//! Routines are pretty closely adapted from https://timharris.uk/papers/2001-disc.pdf
//!
//! Notable caveats:
//! - Can only store around 2^30 entries, which is actually only 10x our current workload.
//! - This is because we need two tag bits to distinguish full/empty and marked/unmarked entries.
//! - Has not been seriously tested.
//!
//! Full entries also store the index to their corresponding dictionary entry in order
//! to enable .entry_at_bucket() which is needed for the clock eviction algo in the LFC.
use std::cell::UnsafeCell;
use std::mem::MaybeUninit;
use std::sync::atomic::{AtomicUsize, Ordering};
use atomic::Atomic;
#[derive(bytemuck::NoUninit, Clone, Copy, PartialEq, Eq)]
#[repr(transparent)]
pub(crate) struct BucketIdx(pub(super) u32);
// This should always be true as `BucketIdx` is a simple newtype.
const _: () = assert!(Atomic::<BucketIdx>::is_lock_free());
impl BucketIdx {
/// Tag for next pointers in free entries.
const NEXT_TAG: u32 = 0b00 << 30;
/// Tag for marked next pointers in free entries.
const MARK_TAG: u32 = 0b01 << 30;
/// Tag for full entries.
const FULL_TAG: u32 = 0b10 << 30;
/// Reserved. Don't use me.
const RSVD_TAG: u32 = 0b11 << 30;
/// Invalid index within the bucket array (can be mixed with any tag).
pub const INVALID: Self = Self(0x3FFFFFFF);
/// Max index within the bucket array (can be mixed with any tag).
pub const MAX: usize = Self::INVALID.0 as usize - 1;
pub(super) fn is_marked(&self) -> bool {
self.0 & Self::RSVD_TAG == Self::MARK_TAG
}
pub(super) fn as_marked(self) -> Self {
Self((self.0 & Self::INVALID.0) | Self::MARK_TAG)
}
pub(super) fn get_unmarked(self) -> Self {
Self(self.0 & Self::INVALID.0)
}
pub fn new(val: usize) -> Self {
debug_assert!(val < Self::MAX);
Self(val as u32)
}
pub fn new_full(val: usize) -> Self {
debug_assert!(val < Self::MAX);
Self(val as u32 | Self::FULL_TAG)
}
/// Try to extract a valid index if the tag is NEXT.
pub fn next_checked(&self) -> Option<usize> {
if self.0 & Self::RSVD_TAG == Self::NEXT_TAG && *self != Self::INVALID {
Some(self.0 as usize)
} else {
None
}
}
/// Try to extract an index if the tag is FULL.
pub fn full_checked(&self) -> Option<usize> {
if self.0 & Self::RSVD_TAG == Self::FULL_TAG {
Some((self.0 & Self::INVALID.0) as usize)
} else {
None
}
}
}
/// Entry within the bucket array. Value is only initialized if you
pub(crate) struct Bucket<V> {
// Only initialized if `next` field is tagged with FULL.
pub val: MaybeUninit<V>,
// Either points to next entry in freelist if empty or points
// to the corresponding entry in dictionary if full.
pub next: Atomic<BucketIdx>,
}
impl<V> Bucket<V> {
pub fn empty(next: BucketIdx) -> Self {
Self {
val: MaybeUninit::uninit(),
next: Atomic::new(next)
}
}
pub fn as_ref(&self) -> &V {
unsafe { self.val.assume_init_ref() }
}
pub fn as_mut(&mut self) -> &mut V {
unsafe { self.val.assume_init_mut() }
}
pub fn replace(&mut self, new_val: V) -> V {
unsafe { std::mem::replace(self.val.assume_init_mut(), new_val) }
}
}
pub(crate) struct BucketArray<'a, V> {
/// Buckets containing values.
pub(crate) buckets: &'a UnsafeCell<[Bucket<V>]>,
/// Head of the freelist.
pub(crate) free_head: Atomic<BucketIdx>,
/// Maximum index of a bucket allowed to be allocated.
pub(crate) alloc_limit: Atomic<BucketIdx>,
/// The number of currently occupied buckets.
pub(crate) buckets_in_use: AtomicUsize,
// Unclear what the purpose of this is.
pub(crate) _user_list_head: Atomic<BucketIdx>,
}
impl <'a, V> std::ops::Index<usize> for BucketArray<'a, V> {
type Output = Bucket<V>;
fn index(&self, index: usize) -> &Self::Output {
let buckets: &[_] = unsafe { &*(self.buckets.get() as *mut _) };
&buckets[index]
}
}
impl <'a, V> std::ops::IndexMut<usize> for BucketArray<'a, V> {
fn index_mut(&mut self, index: usize) -> &mut Self::Output {
let buckets: &mut [_] = unsafe { &mut *(self.buckets.get() as *mut _) };
&mut buckets[index]
}
}
impl<'a, V> BucketArray<'a, V> {
pub fn new(buckets: &'a UnsafeCell<[Bucket<V>]>) -> Self {
Self {
buckets,
free_head: Atomic::new(BucketIdx(0)),
_user_list_head: Atomic::new(BucketIdx(0)),
alloc_limit: Atomic::new(BucketIdx::INVALID),
buckets_in_use: 0.into(),
}
}
pub fn as_mut_ptr(&self) -> *mut Bucket<V> {
unsafe { (&mut *self.buckets.get()).as_mut_ptr() }
}
pub fn get_mut(&self, index: usize) -> &mut Bucket<V> {
let buckets: &mut [_] = unsafe { &mut *(self.buckets.get() as *mut _) };
&mut buckets[index]
}
pub fn len(&self) -> usize {
unsafe { (&*self.buckets.get()).len() }
}
/// Deallocate a bucket, adding it to the free list.
// Adapted from List::insert in https://timharris.uk/papers/2001-disc.pdf
pub fn dealloc_bucket(&self, pos: usize) -> V {
loop {
let free = self.free_head.load(Ordering::Relaxed);
self[pos].next.store(free, Ordering::Relaxed);
if self.free_head.compare_exchange_weak(
free, BucketIdx::new(pos), Ordering::Relaxed, Ordering::Relaxed
).is_ok() {
self.buckets_in_use.fetch_sub(1, Ordering::Relaxed);
return unsafe { self[pos].val.assume_init_read() };
}
}
}
/// Find a usable bucket at the front of the free list.
// Adapted from List::search in https://timharris.uk/papers/2001-disc.pdf
#[allow(unused_assignments)]
fn find_bucket(&self) -> (BucketIdx, BucketIdx) {
let mut left_node = BucketIdx::INVALID;
let mut right_node = BucketIdx::INVALID;
let mut left_node_next = BucketIdx::INVALID;
loop {
let mut t = BucketIdx::INVALID;
let mut t_next = self.free_head.load(Ordering::Relaxed);
let alloc_limit = self.alloc_limit.load(Ordering::Relaxed).next_checked();
while t_next.is_marked() || t.next_checked()
.map_or(true, |v| alloc_limit.map_or(false, |l| v > l))
{
if !t_next.is_marked() {
left_node = t;
left_node_next = t_next;
}
t = t_next.get_unmarked();
if t == BucketIdx::INVALID { break }
t_next = self[t.0 as usize].next.load(Ordering::Relaxed);
}
right_node = t;
if left_node_next == right_node {
if right_node != BucketIdx::INVALID && self[right_node.0 as usize]
.next.load(Ordering::Relaxed).is_marked()
{
continue;
} else {
return (left_node, right_node);
}
}
let left_ref = if left_node != BucketIdx::INVALID {
&self[left_node.0 as usize].next
} else { &self.free_head };
if left_ref.compare_exchange_weak(
left_node_next, right_node, Ordering::Relaxed, Ordering::Relaxed
).is_ok() {
if right_node != BucketIdx::INVALID && self[right_node.0 as usize]
.next.load(Ordering::Relaxed).is_marked()
{
continue;
} else {
return (left_node, right_node);
}
}
}
}
/// Pop a bucket from the free list.
// Adapted from List::delete in https://timharris.uk/papers/2001-disc.pdf
#[allow(unused_assignments)]
pub(crate) fn alloc_bucket(&self, value: V, key_pos: usize) -> Option<BucketIdx> {
let mut right_node_next = BucketIdx::INVALID;
let mut left_idx = BucketIdx::INVALID;
let mut right_idx = BucketIdx::INVALID;
loop {
(left_idx, right_idx) = self.find_bucket();
if right_idx == BucketIdx::INVALID {
return None;
}
let right = &self[right_idx.0 as usize];
right_node_next = right.next.load(Ordering::Relaxed);
if !right_node_next.is_marked() {
if right.next.compare_exchange_weak(
right_node_next, right_node_next.as_marked(),
Ordering::Relaxed, Ordering::Relaxed
).is_ok() {
break;
}
}
}
let left_ref = if left_idx != BucketIdx::INVALID {
&self[left_idx.0 as usize].next
} else {
&self.free_head
};
if left_ref.compare_exchange_weak(
right_idx, right_node_next,
Ordering::Relaxed, Ordering::Relaxed
).is_err() {
todo!()
}
self.buckets_in_use.fetch_add(1, Ordering::Relaxed);
self[right_idx.0 as usize].next.store(
BucketIdx::new_full(key_pos), Ordering::Relaxed
);
self.get_mut(right_idx.0 as usize).val.write(value);
Some(right_idx)
}
pub fn clear(&mut self) {
for i in 0..self.len() {
self[i] = Bucket::empty(
if i < self.len() - 1 {
BucketIdx::new(i + 1)
} else {
BucketIdx::INVALID
}
);
}
self.free_head.store(BucketIdx(0), Ordering::Relaxed);
self.buckets_in_use.store(0, Ordering::Relaxed);
}
}

View File

@@ -1,37 +1,85 @@
//! Simple hash table with chaining.
//! Sharded linear probing hash table.
//! NOTE/FIXME: one major bug with this design is that the current hashmap DOES NOT TRACK
//! the previous size of the hashmap and thus does lookups incorrectly/badly. This should
//! be a reasonably minor fix?
use std::cell::UnsafeCell;
use std::hash::Hash;
use std::mem::MaybeUninit;
use std::sync::atomic::{Ordering, AtomicUsize};
use crate::hash::entry::*;
use crate::sync::*;
use crate::hash::{
entry::*,
bucket::{BucketArray, Bucket, BucketIdx}
};
/// Invalid position within the map (either within the dictionary or bucket array).
pub(crate) const INVALID_POS: u32 = u32::MAX;
/// Metadata tag for the type of an entry in the hashmap.
#[derive(PartialEq, Eq, Clone, Copy)]
pub(crate) enum EntryTag {
/// An occupied entry inserted after a resize operation.
Occupied,
/// An occupied entry inserted before a resize operation
/// a.k.a. an entry that needs to be rehashed at some point.
Rehash,
/// An entry that was once `Occupied`.
Tombstone,
/// An entry that was once `Rehash`.
RehashTombstone,
/// An empty entry.
Empty,
}
/// Fundamental storage unit within the hash table. Either empty or contains a key-value pair.
/// Always part of a chain of some kind (either a freelist if empty or a hash chain if full).
pub(crate) struct Bucket<K, V> {
/// Index of next bucket in the chain.
pub(crate) next: u32,
/// Key-value pair contained within bucket.
pub(crate) inner: Option<(K, V)>,
/// Searching the chains of a hashmap oftentimes requires interpreting
/// a set of metadata tags differently. This enum encodes the ways a
/// metadata tag can be treated during a lookup.
pub(crate) enum MapEntryType {
/// Should be treated as if it were occupied.
Occupied,
/// Should be treated as if it were a tombstone.
Tombstone,
/// Should be treated as if it were empty.
Empty,
/// Should be ignored.
Skip
}
/// A key within the dictionary component of the hashmap.
pub(crate) struct EntryKey<K> {
// NOTE: This could be split out to save 3 bytes per entry!
// Wasn't sure it was worth the penalty of another shmem area.
pub(crate) tag: EntryTag,
pub(crate) val: MaybeUninit<K>,
}
/// A shard of the dictionary.
pub(crate) struct DictShard<'a, K> {
pub(crate) keys: &'a mut [EntryKey<K>],
pub(crate) idxs: &'a mut [BucketIdx],
}
impl<'a, K> DictShard<'a, K> {
fn len(&self) -> usize {
self.keys.len()
}
}
pub(crate) struct MaybeUninitDictShard<'a, K> {
pub(crate) keys: &'a mut [MaybeUninit<EntryKey<K>>],
pub(crate) idxs: &'a mut [MaybeUninit<BucketIdx>],
}
/// Core hash table implementation.
pub(crate) struct CoreHashMap<'a, K, V> {
/// Dictionary used to map hashes to bucket indices.
pub(crate) dictionary: &'a mut [u32],
/// Buckets containing key-value pairs.
pub(crate) buckets: &'a mut [Bucket<K, V>],
/// Head of the freelist.
pub(crate) free_head: u32,
/// Maximum index of a bucket allowed to be allocated. [`INVALID_POS`] if no limit.
pub(crate) alloc_limit: u32,
/// The number of currently occupied buckets.
pub(crate) buckets_in_use: u32,
// pub(crate) lock: libc::pthread_mutex_t,
// Unclear what the purpose of this is.
pub(crate) _user_list_head: u32,
pub(crate) dict_shards: &'a mut [RwLock<DictShard<'a, K>>],
/// Stable bucket array used to store the values.
pub(crate) bucket_arr: BucketArray<'a, V>,
/// Index of the next entry to process for rehashing.
pub(crate) rehash_index: AtomicUsize,
/// Index of the end of the range to be rehashed.
pub(crate) rehash_end: AtomicUsize,
}
/// Error for when there are no empty buckets left but one is needed.
@@ -39,139 +87,249 @@ pub(crate) struct CoreHashMap<'a, K, V> {
pub struct FullError();
impl<'a, K: Clone + Hash + Eq, V> CoreHashMap<'a, K, V> {
const FILL_FACTOR: f32 = 0.60;
/// Estimate the size of data contained within the the hash map.
pub fn estimate_size(num_buckets: u32) -> usize {
let mut size = 0;
// buckets
size += size_of::<Bucket<K, V>>() * num_buckets as usize;
// dictionary
size += (f32::ceil((size_of::<u32>() * num_buckets as usize) as f32 / Self::FILL_FACTOR))
as usize;
size
}
pub fn new(
buckets: &'a mut [MaybeUninit<Bucket<K, V>>],
dictionary: &'a mut [MaybeUninit<u32>],
buckets_cell: &'a UnsafeCell<[MaybeUninit<Bucket<V>>]>,
dict_shards: &'a mut [RwLock<MaybeUninitDictShard<'a, K>>],
) -> Self {
let buckets = unsafe { &mut *buckets_cell.get() };
// Initialize the buckets
for i in 0..buckets.len() {
buckets[i].write(Bucket {
next: if i < buckets.len() - 1 {
i as u32 + 1
} else {
INVALID_POS
},
inner: None,
});
for i in 0..buckets.len() {
buckets[i].write(Bucket::empty(
if i < buckets.len() - 1 {
BucketIdx::new(i + 1)
} else {
BucketIdx::INVALID
})
);
}
// Initialize the dictionary
for e in dictionary.iter_mut() {
e.write(INVALID_POS);
}
for shard in dict_shards.iter_mut() {
let mut dicts = shard.write();
for e in dicts.keys.iter_mut() {
e.write(EntryKey {
tag: EntryTag::Empty,
val: MaybeUninit::uninit(),
});
}
for e in dicts.idxs.iter_mut() {
e.write(BucketIdx::INVALID);
}
}
let buckets_cell = unsafe {
&*(buckets_cell as *const _ as *const UnsafeCell<_>)
};
// TODO: use std::slice::assume_init_mut() once it stabilizes
let buckets =
unsafe { std::slice::from_raw_parts_mut(buckets.as_mut_ptr().cast(), buckets.len()) };
let dictionary = unsafe {
std::slice::from_raw_parts_mut(dictionary.as_mut_ptr().cast(), dictionary.len())
let dict_shards = unsafe {
std::slice::from_raw_parts_mut(dict_shards.as_mut_ptr().cast(),
dict_shards.len())
};
Self {
dictionary,
buckets,
free_head: 0,
buckets_in_use: 0,
_user_list_head: INVALID_POS,
alloc_limit: INVALID_POS,
dict_shards,
rehash_index: buckets.len().into(),
rehash_end: buckets.len().into(),
bucket_arr: BucketArray::new(buckets_cell),
}
}
/// Get the value associated with a key (if it exists) given its hash.
pub fn get_with_hash(&self, key: &K, hash: u64) -> Option<&V> {
let mut next = self.dictionary[hash as usize % self.dictionary.len()];
loop {
if next == INVALID_POS {
return None;
}
/// Get the value associated with a key (if it exists) given its hash.
pub fn get_with_hash(&'a self, key: &K, hash: u64) -> Option<ValueReadGuard<'a, V>> {
let ind = self.rehash_index.load(Ordering::Relaxed);
let end = self.rehash_end.load(Ordering::Relaxed);
let bucket = &self.buckets[next as usize];
let (bucket_key, bucket_value) = bucket.inner.as_ref().expect("entry is in use");
if bucket_key == key {
return Some(bucket_value);
}
next = bucket.next;
}
}
// First search the chains from the current context (thus treat
// to-be-rehashed entries as tombstones within a current chain).
let res = self.get(key, hash, |tag| match tag {
EntryTag::Empty => MapEntryType::Empty,
EntryTag::Occupied => MapEntryType::Occupied,
_ => MapEntryType::Tombstone,
});
if res.is_some() {
return res;
}
/// Get number of buckets in map.
pub fn get_num_buckets(&self) -> usize {
self.buckets.len()
}
if ind < end {
// Search chains from the previous size of the map if a rehash is in progress.
// Ignore any entries inserted since the resize operation occurred.
self.get(key, hash, |tag| match tag {
EntryTag::Empty => MapEntryType::Empty,
EntryTag::Rehash => MapEntryType::Occupied,
_ => MapEntryType::Tombstone,
})
} else {
None
}
}
pub fn entry_with_hash(&'a mut self, key: K, hash: u64) -> Result<Entry<'a, K, V>, FullError> {
let ind = self.rehash_index.load(Ordering::Relaxed);
let end = self.rehash_end.load(Ordering::Relaxed);
/// Clears all entries from the hashmap.
///
/// Does not reset any allocation limits, but does clear any entries beyond them.
pub fn clear(&mut self) {
for i in 0..self.buckets.len() {
self.buckets[i] = Bucket {
next: if i < self.buckets.len() - 1 {
i as u32 + 1
} else {
INVALID_POS
},
inner: None,
}
}
for i in 0..self.dictionary.len() {
self.dictionary[i] = INVALID_POS;
}
self.free_head = 0;
self.buckets_in_use = 0;
let res = self.entry(key.clone(), hash, |tag| match tag {
EntryTag::Empty => MapEntryType::Empty,
EntryTag::Occupied => MapEntryType::Occupied,
// We can't treat old entries as tombstones here, as we definitely can't
// insert over them! Instead we can just skip directly over them.
EntryTag::Rehash => MapEntryType::Skip,
_ => MapEntryType::Tombstone,
});
if ind < end {
if let Ok(Entry::Occupied(_)) = res {
res
} else {
self.entry(key, hash, |tag| match tag {
EntryTag::Empty => MapEntryType::Empty,
EntryTag::Occupied => MapEntryType::Skip,
EntryTag::Rehash => MapEntryType::Occupied,
_ => MapEntryType::Tombstone
})
}
} else {
res
}
}
fn get<F>(&'a self, key: &K, hash: u64, f: F) -> Option<ValueReadGuard<'a, V>>
where F: Fn(EntryTag) -> MapEntryType
{
let num_buckets = self.get_num_buckets();
let shard_size = num_buckets / self.dict_shards.len();
let bucket_pos = hash as usize % num_buckets;
let shard_start = bucket_pos / shard_size;
for off in 0..self.dict_shards.len() {
let shard_idx = (shard_start + off) % self.dict_shards.len();
let shard = self.dict_shards[shard_idx].read();
let entry_start = if off == 0 { bucket_pos % shard_size } else { 0 };
for entry_idx in entry_start..shard.len() {
match f(shard.keys[entry_idx].tag) {
MapEntryType::Empty => return None,
MapEntryType::Tombstone | MapEntryType::Skip => continue,
MapEntryType::Occupied => {
let cand_key = unsafe { shard.keys[entry_idx].val.assume_init_ref() };
if cand_key == key {
let bucket_idx = shard.idxs[entry_idx].next_checked()
.expect("position is valid");
return Some(RwLockReadGuard::map(
shard, |_| self.bucket_arr[bucket_idx].as_ref()
));
}
},
}
}
}
None
}
/// Find the position of an unused bucket via the freelist and initialize it.
pub(crate) fn alloc_bucket(&mut self, key: K, value: V) -> Result<u32, FullError> {
let mut pos = self.free_head;
pub fn entry<F>(&'a self, key: K, hash: u64, f: F) -> Result<Entry<'a, K, V>, FullError>
where F: Fn(EntryTag) -> MapEntryType
{
// We need to keep holding on the locks for each shard we process since if we don't find the
// key anywhere, we want to insert it at the earliest possible position (which may be several
// shards away). Ideally cross-shard chains are quite rare, so this shouldn't be a big deal.
//
// NB: Somewhat real chance of a deadlock! E.g. one thread has a ridiculously long chain that
// starts at block N and wraps around the hashmap to N-1, yet another thread begins a lookup at
// N-1 during this and has a chain that lasts a few shards. Then thread 1 is blocked on thread 2
// to get to shard N-1 but thread 2 is blocked on thread 1 to get to shard N. Pretty fringe case
// since chains shouldn't last very long, but still a problem with this somewhat naive sharding
// mechanism.
//
// We could fix this by either refusing to hold locks and only inserting into the earliest entry
// within the current shard (which effectively means after a while we forget about certain open
// entries at the end of shards) or by pivoting to a more involved concurrency setup?
let mut shards = Vec::new();
let mut insert_pos = None;
let mut insert_shard = None;
// Find the first bucket we're *allowed* to use.
let mut prev = PrevPos::First(self.free_head);
while pos != INVALID_POS && pos >= self.alloc_limit {
let bucket = &mut self.buckets[pos as usize];
prev = PrevPos::Chained(pos);
pos = bucket.next;
}
if pos == INVALID_POS {
return Err(FullError());
}
// Repair the freelist.
match prev {
PrevPos::First(_) => {
let next_pos = self.buckets[pos as usize].next;
self.free_head = next_pos;
let num_buckets = self.get_num_buckets();
let shard_size = num_buckets / self.dict_shards.len();
let mut entry_pos = hash as usize % num_buckets;
let shard_start = entry_pos / shard_size;
for off in 0..self.dict_shards.len() {
let shard_idx = (shard_start + off) % self.dict_shards.len();
let shard = self.dict_shards[shard_idx].write();
let mut inserted = false;
let entry_start = if off == 0 { entry_pos % shard_size } else { 0 };
for entry_idx in entry_start..shard.len() {
entry_pos += 1;
match f(shard.keys[entry_idx].tag) {
MapEntryType::Skip => continue,
MapEntryType::Empty => {
let ((shard, idx), shard_pos) = match (insert_shard, insert_pos) {
(Some((s, i)), Some(p)) => ((s, i), p),
(None, Some(p)) => ((shard, shard_idx), p),
(None, None) => ((shard, shard_idx), entry_idx),
_ => unreachable!()
};
return Ok(Entry::Vacant(VacantEntry {
_key: key,
shard,
shard_pos,
key_pos: (shard_size * idx) + shard_pos,
bucket_arr: &self.bucket_arr,
}))
},
MapEntryType::Tombstone => {
if insert_pos.is_none() {
insert_pos = Some(entry_idx);
inserted = true;
}
},
MapEntryType::Occupied => {
let cand_key = unsafe { shard.keys[entry_idx].val.assume_init_ref() };
if *cand_key == key {
let bucket_pos = shard.idxs[entry_idx].next_checked().unwrap();
return Ok(Entry::Occupied(OccupiedEntry {
shard,
shard_pos: entry_idx,
bucket_pos,
bucket_arr: &self.bucket_arr,
key_pos: entry_pos,
}));
}
}
}
}
if inserted {
insert_shard = Some((shard, shard_idx));
} else {
shards.push(shard);
}
}
if let (Some((shard, idx)), Some(shard_pos)) = (insert_shard, insert_pos) {
Ok(Entry::Vacant(VacantEntry {
_key: key,
shard,
shard_pos,
key_pos: (shard_size * idx) + shard_pos,
bucket_arr: &self.bucket_arr,
}))
} else {
Err(FullError{})
}
}
/// Get number of buckets in map.
pub fn get_num_buckets(&self) -> usize {
self.bucket_arr.len()
}
pub fn clear(&mut self) {
let mut shards: Vec<_> = self.dict_shards.iter().map(|x| x.write()).collect();
for shard in shards.iter_mut() {
for e in shard.keys.iter_mut() {
e.tag = EntryTag::Empty;
}
for e in shard.idxs.iter_mut() {
*e = BucketIdx::INVALID;
}
PrevPos::Chained(p) => if p != INVALID_POS {
let next_pos = self.buckets[pos as usize].next;
self.buckets[p as usize].next = next_pos;
},
_ => unreachable!()
}
// Initialize the bucket.
let bucket = &mut self.buckets[pos as usize];
self.buckets_in_use += 1;
bucket.next = INVALID_POS;
bucket.inner = Some((key, value));
Ok(pos)
self.bucket_arr.clear();
}
}

View File

@@ -1,139 +1,81 @@
//! Equivalent of [`std::collections::hash_map::Entry`] for this hashmap.
use crate::hash::core::{CoreHashMap, FullError, INVALID_POS};
use crate::hash::{
core::{DictShard, EntryTag},
bucket::{BucketArray, BucketIdx}
};
use crate::sync::{RwLockWriteGuard, ValueWriteGuard};
use std::hash::Hash;
use std::mem;
pub enum Entry<'a, 'b, K, V> {
Occupied(OccupiedEntry<'a, 'b, K, V>),
Vacant(VacantEntry<'a, 'b, K, V>),
pub enum Entry<'a, K, V> {
Occupied(OccupiedEntry<'a, K, V>),
Vacant(VacantEntry<'a, K, V>),
}
/// Enum representing the previous position within a chain.
#[derive(Clone, Copy)]
pub(crate) enum PrevPos {
/// Starting index within the dictionary.
First(u32),
/// Regular index within the buckets.
Chained(u32),
/// Unknown - e.g. the associated entry was retrieved by index instead of chain.
Unknown(u64),
pub struct OccupiedEntry<'a, K, V> {
/// Mutable reference to the shard of the map the entry is in.
pub(crate) shard: RwLockWriteGuard<'a, DictShard<'a, K>>,
/// The position of the entry in the shard.
pub(crate) shard_pos: usize,
/// True logical position of the entry in the map.
pub(crate) key_pos: usize,
/// Mutable reference to the bucket array containing entry.
pub(crate) bucket_arr: &'a BucketArray<'a, V>,
/// The position of the bucket in the [`CoreHashMap`] bucket array.
pub(crate) bucket_pos: usize,
}
pub struct OccupiedEntry<'a, 'b, K, V> {
/// Mutable reference to the map containing this entry.
pub(crate) map: RwLockWriteGuard<'b, CoreHashMap<'a, K, V>>,
/// The key of the occupied entry
pub(crate) _key: K,
/// The index of the previous entry in the chain.
pub(crate) prev_pos: PrevPos,
/// The position of the bucket in the [`CoreHashMap`] bucket array.
pub(crate) bucket_pos: u32,
}
impl<K, V> OccupiedEntry<'_, '_, K, V> {
impl<K, V> OccupiedEntry<'_, K, V> {
pub fn get(&self) -> &V {
&self.map.buckets[self.bucket_pos as usize]
.inner
.as_ref()
.unwrap()
.1
self.bucket_arr[self.bucket_pos].as_ref()
}
pub fn get_mut(&mut self) -> &mut V {
&mut self.map.buckets[self.bucket_pos as usize]
.inner
.as_mut()
.unwrap()
.1
self.bucket_arr.get_mut(self.bucket_pos).as_mut()
}
/// Inserts a value into the entry, replacing (and returning) the existing value.
/// Inserts a value into the entry, replacing (and returning) the existing value.
pub fn insert(&mut self, value: V) -> V {
let bucket = &mut self.map.buckets[self.bucket_pos as usize];
// This assumes inner is Some, which it must be for an OccupiedEntry
mem::replace(&mut bucket.inner.as_mut().unwrap().1, value)
self.bucket_arr.get_mut(self.bucket_pos).replace(value)
}
/// Removes the entry from the hash map, returning the value originally stored within it.
///
/// This may result in multiple bucket accesses if the entry was obtained by index as the
/// previous chain entry needs to be discovered in this case.
///
/// # Panics
/// Panics if the `prev_pos` field is equal to [`PrevPos::Unknown`]. In practice, this means
/// the entry was obtained via calling something like [`CoreHashMap::entry_at_bucket`].
pub fn remove(mut self) -> V {
// If this bucket was queried by index, go ahead and follow its chain from the start.
let prev = if let PrevPos::Unknown(hash) = self.prev_pos {
let dict_idx = hash as usize % self.map.dictionary.len();
let mut prev = PrevPos::First(dict_idx as u32);
let mut curr = self.map.dictionary[dict_idx];
while curr != self.bucket_pos {
curr = self.map.buckets[curr as usize].next;
prev = PrevPos::Chained(curr);
}
prev
} else {
self.prev_pos
};
// CoreHashMap::remove returns Option<(K, V)>. We know it's Some for an OccupiedEntry.
let bucket = &mut self.map.buckets[self.bucket_pos as usize];
// unlink it from the chain
match prev {
PrevPos::First(dict_pos) => {
self.map.dictionary[dict_pos as usize] = bucket.next;
},
PrevPos::Chained(bucket_pos) => {
// println!("we think prev of {} is {bucket_pos}", self.bucket_pos);
self.map.buckets[bucket_pos as usize].next = bucket.next;
},
_ => unreachable!(),
}
// and add it to the freelist
let free = self.map.free_head;
let bucket = &mut self.map.buckets[self.bucket_pos as usize];
let old_value = bucket.inner.take();
bucket.next = free;
self.map.free_head = self.bucket_pos;
self.map.buckets_in_use -= 1;
old_value.unwrap().1
/// Removes the entry from the hash map, returning the value originally stored within it.
pub fn remove(&mut self) -> V {
self.shard.idxs[self.shard_pos] = BucketIdx::INVALID;
self.shard.keys[self.shard_pos].tag = EntryTag::Tombstone;
self.bucket_arr.dealloc_bucket(self.bucket_pos)
}
}
/// An abstract view into a vacant entry within the map.
pub struct VacantEntry<'a, 'b, K, V> {
/// Mutable reference to the map containing this entry.
pub(crate) map: RwLockWriteGuard<'b, CoreHashMap<'a, K, V>>,
/// The key to be inserted into this entry.
pub(crate) key: K,
/// The position within the dictionary corresponding to the key's hash.
pub(crate) dict_pos: u32,
pub struct VacantEntry<'a, K, V> {
/// The key of the occupied entry
pub(crate) _key: K,
/// Mutable reference to the shard of the map the entry is in.
pub(crate) shard: RwLockWriteGuard<'a, DictShard<'a, K>>,
/// The position of the entry in the shard.
pub(crate) shard_pos: usize,
/// True logical position of the entry in the map.
pub(crate) key_pos: usize,
/// Mutable reference to the bucket array containing entry.
pub(crate) bucket_arr: &'a BucketArray<'a, V>,
}
impl<'b, K: Clone + Hash + Eq, V> VacantEntry<'_, 'b, K, V> {
/// Insert a value into the vacant entry, finding and populating an empty bucket in the process.
///
/// # Errors
/// Will return [`FullError`] if there are no unoccupied buckets in the map.
pub fn insert(mut self, value: V) -> Result<ValueWriteGuard<'b, V>, FullError> {
let pos = self.map.alloc_bucket(self.key, value)?;
if pos == INVALID_POS {
return Err(FullError());
}
self.map.buckets[pos as usize].next = self.map.dictionary[self.dict_pos as usize];
self.map.dictionary[self.dict_pos as usize] = pos;
impl<'a, K: Clone + Hash + Eq, V> VacantEntry<'a, K, V> {
/// Insert a value into the vacant entry, finding and populating an empty bucket in the process.
pub fn insert(mut self, value: V) -> ValueWriteGuard<'a, V> {
let pos = self.bucket_arr.alloc_bucket(value, self.key_pos)
.expect("bucket is available if entry is");
self.shard.keys[self.shard_pos].tag = EntryTag::Occupied;
self.shard.keys[self.shard_pos].val.write(self._key);
let idx = pos.next_checked().expect("position is valid");
self.shard.idxs[self.shard_pos] = pos;
Ok(RwLockWriteGuard::map(
self.map,
|m| &mut m.buckets[pos as usize].inner.as_mut().unwrap().1
))
RwLockWriteGuard::map(self.shard, |_| {
self.bucket_arr.get_mut(idx).as_mut()
})
}
}

View File

@@ -3,9 +3,9 @@ use std::collections::HashSet;
use std::fmt::Debug;
use std::mem::MaybeUninit;
use crate::hash::Entry;
use crate::hash::HashMapAccess;
use crate::hash::HashMapInit;
use crate::hash::Entry;
use crate::hash::core::FullError;
use rand::seq::SliceRandom;
@@ -35,20 +35,20 @@ impl<'a> From<&'a [u8]> for TestKey {
}
}
fn test_inserts<K: Into<TestKey> + Copy>(keys: &[K]) {
let w = HashMapInit::<TestKey, usize>::new_resizeable_named(
100000, 120000, "test_inserts"
).attach_writer();
fn test_inserts<K: Into<TestKey> + Copy>(keys: &[K]) {
let w = HashMapInit::<TestKey, usize>::new_resizeable_named(100000, 120000, 100, "test_inserts")
.attach_writer();
for (idx, k) in keys.iter().enumerate() {
let res = w.entry((*k).into());
match res {
Entry::Occupied(mut e) => { e.insert(idx); }
Entry::Vacant(e) => {
let res = e.insert(idx);
assert!(res.is_ok());
},
};
let res = w.entry((*k).into());
match res.unwrap() {
Entry::Occupied(mut e) => {
e.insert(idx);
}
Entry::Vacant(e) => {
_ = e.insert(idx);
}
};
}
for (idx, k) in keys.iter().enumerate() {
@@ -83,7 +83,7 @@ fn sparse() {
for _ in 0..10000 {
loop {
let key = rand::random::<u128>();
if used_keys.get(&key).is_some() {
if used_keys.contains(&key) {
continue;
}
used_keys.insert(key);
@@ -109,79 +109,85 @@ fn apply_op(
shadow.remove(&op.0)
};
let entry = map.entry(op.0);
let entry = map.entry(op.0);
let hash_existing = match op.1 {
Some(new) => {
match entry {
Entry::Occupied(mut e) => Some(e.insert(new)),
Entry::Vacant(e) => { _ = e.insert(new).unwrap(); None },
}
},
None => {
match entry {
Entry::Occupied(e) => Some(e.remove()),
Entry::Vacant(_) => None,
}
},
};
Some(new) => match entry.unwrap() {
Entry::Occupied(mut e) => Some(e.insert(new)),
Entry::Vacant(e) => {
_ = e.insert(new);
None
}
},
None => match entry.unwrap() {
Entry::Occupied(mut e) => Some(e.remove()),
Entry::Vacant(_) => None,
},
};
assert_eq!(shadow_existing, hash_existing);
assert_eq!(shadow_existing, hash_existing);
}
fn do_random_ops(
num_ops: usize,
size: u32,
del_prob: f64,
writer: &mut HashMapAccess<TestKey, usize>,
shadow: &mut BTreeMap<TestKey, usize>,
rng: &mut rand::rngs::ThreadRng,
num_ops: usize,
size: u32,
del_prob: f64,
writer: &mut HashMapAccess<TestKey, usize>,
shadow: &mut BTreeMap<TestKey, usize>,
rng: &mut rand::rngs::ThreadRng,
) {
for i in 0..num_ops {
for i in 0..num_ops {
let key: TestKey = ((rng.next_u32() % size) as u128).into();
let op = TestOp(key, if rng.random_bool(del_prob) { Some(i) } else { None });
let op = TestOp(
key,
if rng.random_bool(del_prob) {
Some(i)
} else {
None
},
);
apply_op(&op, writer, shadow);
}
}
fn do_deletes(
num_ops: usize,
writer: &mut HashMapAccess<TestKey, usize>,
shadow: &mut BTreeMap<TestKey, usize>,
num_ops: usize,
writer: &mut HashMapAccess<TestKey, usize>,
shadow: &mut BTreeMap<TestKey, usize>,
) {
for _ in 0..num_ops {
let (k, _) = shadow.pop_first().unwrap();
writer.remove(&k);
}
for _ in 0..num_ops {
let (k, _) = shadow.pop_first().unwrap();
writer.remove(&k);
}
}
fn do_shrink(
writer: &mut HashMapAccess<TestKey, usize>,
shadow: &mut BTreeMap<TestKey, usize>,
to: u32
writer: &mut HashMapAccess<TestKey, usize>,
shadow: &mut BTreeMap<TestKey, usize>,
to: usize,
) {
assert!(writer.shrink_goal().is_none());
writer.begin_shrink(to);
assert_eq!(writer.shrink_goal(), Some(to as usize));
while writer.get_num_buckets_in_use() > to as usize {
let (k, _) = shadow.pop_first().unwrap();
let entry = writer.entry(k);
if let Entry::Occupied(e) = entry {
e.remove();
}
}
let old_usage = writer.get_num_buckets_in_use();
writer.finish_shrink().unwrap();
assert!(writer.shrink_goal().is_none());
assert_eq!(writer.get_num_buckets_in_use(), old_usage);
assert!(writer.shrink_goal().is_none());
writer.begin_shrink(to);
assert_eq!(writer.shrink_goal(), Some(to as usize));
while writer.get_num_buckets_in_use() > to as usize {
let (k, _) = shadow.pop_first().unwrap();
let entry = writer.entry(k).unwrap();
if let Entry::Occupied(mut e) = entry {
e.remove();
}
}
let old_usage = writer.get_num_buckets_in_use();
writer.finish_shrink().unwrap();
assert!(writer.shrink_goal().is_none());
assert_eq!(writer.get_num_buckets_in_use(), old_usage);
}
#[test]
fn random_ops() {
let mut writer = HashMapInit::<TestKey, usize>::new_resizeable_named(
100000, 120000, "test_random"
).attach_writer();
let mut writer =
HashMapInit::<TestKey, usize>::new_resizeable_named(100000, 120000, 10, "test_random")
.attach_writer();
let mut shadow: std::collections::BTreeMap<TestKey, usize> = BTreeMap::new();
let distribution = Zipf::new(u128::MAX as f64, 1.1).unwrap();
let mut rng = rand::rng();
for i in 0..100000 {
@@ -193,234 +199,230 @@ fn random_ops() {
}
}
// #[test]
// fn test_shuffle() {
// let mut writer = HashMapInit::<TestKey, usize>::new_resizeable_named(1000, 1200, 10, "test_shuf")
// .attach_writer();
// let mut shadow: std::collections::BTreeMap<TestKey, usize> = BTreeMap::new();
// let mut rng = rand::rng();
#[test]
fn test_shuffle() {
let mut writer = HashMapInit::<TestKey, usize>::new_resizeable_named(
1000, 1200, "test_shuf"
).attach_writer();
let mut shadow: std::collections::BTreeMap<TestKey, usize> = BTreeMap::new();
let mut rng = rand::rng();
// do_random_ops(10000, 1000, 0.75, &mut writer, &mut shadow, &mut rng);
// writer.shuffle();
// do_random_ops(10000, 1000, 0.75, &mut writer, &mut shadow, &mut rng);
// }
do_random_ops(10000, 1000, 0.75, &mut writer, &mut shadow, &mut rng);
writer.shuffle();
do_random_ops(10000, 1000, 0.75, &mut writer, &mut shadow, &mut rng);
}
// #[test]
// fn test_grow() {
// let mut writer = HashMapInit::<TestKey, usize>::new_resizeable_named(1000, 2000, 10, "test_grow")
// .attach_writer();
// let mut shadow: std::collections::BTreeMap<TestKey, usize> = BTreeMap::new();
// let mut rng = rand::rng();
#[test]
fn test_grow() {
let mut writer = HashMapInit::<TestKey, usize>::new_resizeable_named(
1000, 2000, "test_grow"
).attach_writer();
let mut shadow: std::collections::BTreeMap<TestKey, usize> = BTreeMap::new();
let mut rng = rand::rng();
do_random_ops(10000, 1000, 0.75, &mut writer, &mut shadow, &mut rng);
let old_usage = writer.get_num_buckets_in_use();
writer.grow(1500).unwrap();
assert_eq!(writer.get_num_buckets_in_use(), old_usage);
assert_eq!(writer.get_num_buckets(), 1500);
do_random_ops(10000, 1500, 0.75, &mut writer, &mut shadow, &mut rng);
}
// do_random_ops(10000, 1000, 0.75, &mut writer, &mut shadow, &mut rng);
// let old_usage = writer.get_num_buckets_in_use();
// writer.grow(1500).unwrap();
// assert_eq!(writer.get_num_buckets_in_use(), old_usage);
// assert_eq!(writer.get_num_buckets(), 1500);
// do_random_ops(10000, 1500, 0.75, &mut writer, &mut shadow, &mut rng);
// }
#[test]
fn test_clear() {
let mut writer = HashMapInit::<TestKey, usize>::new_resizeable_named(
1500, 2000, "test_clear"
).attach_writer();
let mut shadow: std::collections::BTreeMap<TestKey, usize> = BTreeMap::new();
let mut rng = rand::rng();
do_random_ops(2000, 1500, 0.75, &mut writer, &mut shadow, &mut rng);
writer.clear();
assert_eq!(writer.get_num_buckets_in_use(), 0);
assert_eq!(writer.get_num_buckets(), 1500);
while let Some((key, _)) = shadow.pop_first() {
assert!(writer.get(&key).is_none());
}
do_random_ops(2000, 1500, 0.75, &mut writer, &mut shadow, &mut rng);
for i in 0..(1500 - writer.get_num_buckets_in_use()) {
writer.insert((1500 + i as u128).into(), 0).unwrap();
}
assert_eq!(writer.insert(5000.into(), 0), Err(FullError {}));
writer.clear();
assert!(writer.insert(5000.into(), 0).is_ok());
let mut writer = HashMapInit::<TestKey, usize>::new_resizeable_named(1500, 2000, 10, "test_clear")
.attach_writer();
// let mut shadow: std::collections::BTreeMap<TestKey, usize> = BTreeMap::new();
// let mut rng = rand::rng();
// do_random_ops(2000, 1500, 0.75, &mut writer, &mut shadow, &mut rng);
// writer.clear();
// assert_eq!(writer.get_num_buckets_in_use(), 0);
// assert_eq!(writer.get_num_buckets(), 1500);
// while let Some((key, _)) = shadow.pop_first() {
// assert!(writer.get(&key).is_none());
// }
// do_random_ops(2000, 1500, 0.75, &mut writer, &mut shadow, &mut rng);
// for i in 0..(1500 - writer.get_num_buckets_in_use()) {
// writer.insert((1500 + i as u128).into(), 0).unwrap();
// }
// assert_eq!(writer.insert(5000.into(), 0), Err(FullError {}));
// writer.clear();
// assert!(writer.insert(5000.into(), 0).is_ok());
}
#[test]
fn test_idx_remove() {
let mut writer = HashMapInit::<TestKey, usize>::new_resizeable_named(
1500, 2000, "test_clear"
).attach_writer();
let mut shadow: std::collections::BTreeMap<TestKey, usize> = BTreeMap::new();
let mut rng = rand::rng();
do_random_ops(2000, 1500, 0.25, &mut writer, &mut shadow, &mut rng);
for _ in 0..100 {
let idx = (rng.next_u32() % 1500) as usize;
if let Some(e) = writer.entry_at_bucket(idx) {
shadow.remove(&e._key);
e.remove();
}
}
while let Some((key, val)) = shadow.pop_first() {
assert_eq!(*writer.get(&key).unwrap(), val);
}
}
// #[test]
// fn test_idx_remove() {
// let mut writer = HashMapInit::<TestKey, usize>::new_resizeable_named(1500, 2000, 10, "test_clear")
// .attach_writer();
// let mut shadow: std::collections::BTreeMap<TestKey, usize> = BTreeMap::new();
// let mut rng = rand::rng();
// do_random_ops(2000, 1500, 0.25, &mut writer, &mut shadow, &mut rng);
// for _ in 0..100 {
// let idx = (rng.next_u32() % 1500) as usize;
// if let Some(e) = writer.entry_at_bucket(idx) {
// shadow.remove(&e._key);
// e.remove();
// }
// }
// while let Some((key, val)) = shadow.pop_first() {
// assert_eq!(*writer.get(&key).unwrap(), val);
// }
// }
#[test]
fn test_idx_get() {
let mut writer = HashMapInit::<TestKey, usize>::new_resizeable_named(
1500, 2000, "test_clear"
).attach_writer();
let mut shadow: std::collections::BTreeMap<TestKey, usize> = BTreeMap::new();
let mut rng = rand::rng();
do_random_ops(2000, 1500, 0.25, &mut writer, &mut shadow, &mut rng);
for _ in 0..100 {
let idx = (rng.next_u32() % 1500) as usize;
if let Some(pair) = writer.get_at_bucket(idx) {
{
let v: *const usize = &pair.1;
assert_eq!(writer.get_bucket_for_value(v), idx);
}
{
let v: *const usize = &pair.1;
assert_eq!(writer.get_bucket_for_value(v), idx);
}
}
}
}
// #[test]
// fn test_idx_get() {
// let mut writer = HashMapInit::<TestKey, usize>::new_resizeable_named(1500, 2000, "test_clear")
// .attach_writer();
// let mut shadow: std::collections::BTreeMap<TestKey, usize> = BTreeMap::new();
// let mut rng = rand::rng();
// do_random_ops(2000, 1500, 0.25, &mut writer, &mut shadow, &mut rng);
// for _ in 0..100 {
// let idx = (rng.next_u32() % 1500) as usize;
// if let Some(pair) = writer.get_at_bucket(idx) {
// {
// let v: *const usize = &pair.1;
// assert_eq!(writer.get_bucket_for_value(v), idx);
// }
// {
// let v: *const usize = &pair.1;
// assert_eq!(writer.get_bucket_for_value(v), idx);
// }
// }
// }
// }
#[test]
fn test_shrink() {
let mut writer = HashMapInit::<TestKey, usize>::new_resizeable_named(
1500, 2000, "test_shrink"
).attach_writer();
let mut shadow: std::collections::BTreeMap<TestKey, usize> = BTreeMap::new();
let mut rng = rand::rng();
do_random_ops(10000, 1500, 0.75, &mut writer, &mut shadow, &mut rng);
do_shrink(&mut writer, &mut shadow, 1000);
assert_eq!(writer.get_num_buckets(), 1000);
do_deletes(500, &mut writer, &mut shadow);
do_random_ops(10000, 500, 0.75, &mut writer, &mut shadow, &mut rng);
assert!(writer.get_num_buckets_in_use() <= 1000);
}
// #[test]
// fn test_shrink() {
// let mut writer = HashMapInit::<TestKey, usize>::new_resizeable_named(1500, 2000, "test_shrink")
// .attach_writer();
// let mut shadow: std::collections::BTreeMap<TestKey, usize> = BTreeMap::new();
// let mut rng = rand::rng();
#[test]
fn test_shrink_grow_seq() {
let mut writer = HashMapInit::<TestKey, usize>::new_resizeable_named(
1000, 20000, "test_grow_seq"
).attach_writer();
let mut shadow: std::collections::BTreeMap<TestKey, usize> = BTreeMap::new();
let mut rng = rand::rng();
// do_random_ops(10000, 1500, 0.75, &mut writer, &mut shadow, &mut rng);
// do_shrink(&mut writer, &mut shadow, 1000);
// assert_eq!(writer.get_num_buckets(), 1000);
// do_deletes(500, &mut writer, &mut shadow);
// do_random_ops(10000, 500, 0.75, &mut writer, &mut shadow, &mut rng);
// assert!(writer.get_num_buckets_in_use() <= 1000);
// }
do_random_ops(500, 1000, 0.1, &mut writer, &mut shadow, &mut rng);
eprintln!("Shrinking to 750");
do_shrink(&mut writer, &mut shadow, 750);
do_random_ops(200, 1000, 0.5, &mut writer, &mut shadow, &mut rng);
eprintln!("Growing to 1500");
writer.grow(1500).unwrap();
do_random_ops(600, 1500, 0.1, &mut writer, &mut shadow, &mut rng);
eprintln!("Shrinking to 200");
while shadow.len() > 100 {
do_deletes(1, &mut writer, &mut shadow);
}
do_shrink(&mut writer, &mut shadow, 200);
do_random_ops(50, 1500, 0.25, &mut writer, &mut shadow, &mut rng);
eprintln!("Growing to 10k");
writer.grow(10000).unwrap();
do_random_ops(10000, 5000, 0.25, &mut writer, &mut shadow, &mut rng);
}
// #[test]
// fn test_shrink_grow_seq() {
// let mut writer =
// HashMapInit::<TestKey, usize>::new_resizeable_named(1000, 20000, "test_grow_seq")
// .attach_writer();
// let mut shadow: std::collections::BTreeMap<TestKey, usize> = BTreeMap::new();
// let mut rng = rand::rng();
// do_random_ops(500, 1000, 0.1, &mut writer, &mut shadow, &mut rng);
// eprintln!("Shrinking to 750");
// do_shrink(&mut writer, &mut shadow, 750);
// do_random_ops(200, 1000, 0.5, &mut writer, &mut shadow, &mut rng);
// eprintln!("Growing to 1500");
// writer.grow(1500).unwrap();
// do_random_ops(600, 1500, 0.1, &mut writer, &mut shadow, &mut rng);
// eprintln!("Shrinking to 200");
// while shadow.len() > 100 {
// do_deletes(1, &mut writer, &mut shadow);
// }
// do_shrink(&mut writer, &mut shadow, 200);
// do_random_ops(50, 1500, 0.25, &mut writer, &mut shadow, &mut rng);
// eprintln!("Growing to 10k");
// writer.grow(10000).unwrap();
// do_random_ops(10000, 5000, 0.25, &mut writer, &mut shadow, &mut rng);
// }
#[test]
fn test_bucket_ops() {
let writer = HashMapInit::<TestKey, usize>::new_resizeable_named(
1000, 1200, "test_bucket_ops"
).attach_writer();
match writer.entry(1.into()) {
Entry::Occupied(mut e) => { e.insert(2); },
Entry::Vacant(e) => { _ = e.insert(2).unwrap(); },
}
assert_eq!(writer.get_num_buckets_in_use(), 1);
assert_eq!(writer.get_num_buckets(), 1000);
assert_eq!(*writer.get(&1.into()).unwrap(), 2);
let pos = match writer.entry(1.into()) {
Entry::Occupied(e) => {
assert_eq!(e._key, 1.into());
let pos = e.bucket_pos as usize;
pos
},
Entry::Vacant(_) => { panic!("Insert didn't affect entry"); },
};
assert_eq!(writer.entry_at_bucket(pos).unwrap()._key, 1.into());
assert_eq!(*writer.get_at_bucket(pos).unwrap(), (1.into(), 2));
{
let ptr: *const usize = &*writer.get(&1.into()).unwrap();
assert_eq!(writer.get_bucket_for_value(ptr), pos);
}
writer.remove(&1.into());
assert!(writer.get(&1.into()).is_none());
let writer = HashMapInit::<TestKey, usize>::new_resizeable_named(1000, 1200, 10, "test_bucket_ops")
.attach_writer();
match writer.entry(1.into()).unwrap() {
Entry::Occupied(mut e) => {
e.insert(2);
}
Entry::Vacant(e) => {
_ = e.insert(2);
},
}
assert_eq!(writer.get_num_buckets_in_use(), 1);
assert_eq!(writer.get_num_buckets(), 1000);
assert_eq!(*writer.get(&1.into()).unwrap(), 2);
let pos = match writer.entry(1.into()).unwrap() {
Entry::Occupied(e) => {
assert_eq!(e._key, 1.into());
let pos = e.bucket_pos as usize;
pos
}
Entry::Vacant(_) => {
panic!("Insert didn't affect entry");
}
};
assert_eq!(unsafe { writer.get_at_bucket(pos).unwrap() }, &2);
{
let ptr: *const usize = &*writer.get(&1.into()).unwrap();
assert_eq!(writer.get_bucket_for_value(ptr), pos);
}
writer.remove(&1.into());
assert!(writer.get(&1.into()).is_none());
}
#[test]
fn test_shrink_zero() {
let mut writer = HashMapInit::<TestKey, usize>::new_resizeable_named(
1500, 2000, "test_shrink_zero"
).attach_writer();
writer.begin_shrink(0);
for i in 0..1500 {
writer.entry_at_bucket(i).map(|x| x.remove());
}
writer.finish_shrink().unwrap();
assert_eq!(writer.get_num_buckets_in_use(), 0);
let entry = writer.entry(1.into());
if let Entry::Vacant(v) = entry {
assert!(v.insert(2).is_err());
} else {
panic!("Somehow got non-vacant entry in empty map.")
}
writer.grow(50).unwrap();
let entry = writer.entry(1.into());
if let Entry::Vacant(v) = entry {
assert!(v.insert(2).is_ok());
} else {
panic!("Somehow got non-vacant entry in empty map.")
}
assert_eq!(writer.get_num_buckets_in_use(), 1);
}
// #[test]
// fn test_shrink_zero() {
// let mut writer =
// HashMapInit::<TestKey, usize>::new_resizeable_named(1500, 2000, "test_shrink_zero")
// .attach_writer();
// writer.begin_shrink(0);
// for i in 0..1500 {
// writer.entry_at_bucket(i).map(|x| x.remove());
// }
// writer.finish_shrink().unwrap();
// assert_eq!(writer.get_num_buckets_in_use(), 0);
// let entry = writer.entry(1.into());
// if let Entry::Vacant(v) = entry {
// assert!(v.insert(2).is_err());
// } else {
// panic!("Somehow got non-vacant entry in empty map.")
// }
// writer.grow(50).unwrap();
// let entry = writer.entry(1.into());
// if let Entry::Vacant(v) = entry {
// assert!(v.insert(2).is_ok());
// } else {
// panic!("Somehow got non-vacant entry in empty map.")
// }
// assert_eq!(writer.get_num_buckets_in_use(), 1);
// }
#[test]
#[should_panic]
fn test_grow_oom() {
let writer = HashMapInit::<TestKey, usize>::new_resizeable_named(
1500, 2000, "test_grow_oom"
).attach_writer();
writer.grow(20000).unwrap();
}
// #[test]
// #[should_panic]
// fn test_grow_oom() {
// let writer = HashMapInit::<TestKey, usize>::new_resizeable_named(1500, 2000, "test_grow_oom")
// .attach_writer();
// writer.grow(20000).unwrap();
// }
#[test]
#[should_panic]
fn test_shrink_bigger() {
let mut writer = HashMapInit::<TestKey, usize>::new_resizeable_named(
1500, 2500, "test_shrink_bigger"
).attach_writer();
writer.begin_shrink(2000);
}
// #[test]
// #[should_panic]
// fn test_shrink_bigger() {
// let mut writer =
// HashMapInit::<TestKey, usize>::new_resizeable_named(1500, 2500, "test_shrink_bigger")
// .attach_writer();
// writer.begin_shrink(2000);
// }
#[test]
#[should_panic]
fn test_shrink_early_finish() {
let writer = HashMapInit::<TestKey, usize>::new_resizeable_named(
1500, 2500, "test_shrink_early_finish"
).attach_writer();
writer.finish_shrink().unwrap();
}
// #[test]
// #[should_panic]
// fn test_shrink_early_finish() {
// let writer =
// HashMapInit::<TestKey, usize>::new_resizeable_named(1500, 2500, "test_shrink_early_finish")
// .attach_writer();
// writer.finish_shrink().unwrap();
// }
#[test]
#[should_panic]
fn test_shrink_fixed_size() {
let mut area = [MaybeUninit::uninit(); 10000];
let init_struct = HashMapInit::<TestKey, usize>::with_fixed(3, &mut area);
let mut writer = init_struct.attach_writer();
writer.begin_shrink(1);
}
// #[test]
// #[should_panic]
// fn test_shrink_fixed_size() {
// let mut area = [MaybeUninit::uninit(); 10000];
// let init_struct = HashMapInit::<TestKey, usize>::with_fixed(3, &mut area);
// let mut writer = init_struct.attach_writer();
// writer.begin_shrink(1);
// }

View File

@@ -76,19 +76,15 @@ impl ShmemHandle {
Self::new_with_fd(fd, initial_size, max_size)
}
fn new_with_fd(
fd: OwnedFd,
initial_size: usize,
max_size: usize,
) -> Result<Self, Error> {
fn new_with_fd(fd: OwnedFd, initial_size: usize, max_size: usize) -> Result<Self, Error> {
// We reserve the high-order bit for the `RESIZE_IN_PROGRESS` flag, and the actual size
// is a little larger than this because of the SharedStruct header. Make the upper limit
// somewhat smaller than that, because with anything close to that, you'll run out of
// memory anyway.
assert!(max_size < 1 << 48, "max size {max_size} too large");
assert!(
initial_size <= max_size,
initial_size <= max_size,
"initial size {initial_size} larger than max size {max_size}"
);
@@ -150,12 +146,12 @@ impl ShmemHandle {
let shared = self.shared();
assert!(
new_size <= self.max_size,
new_size <= self.max_size,
"new size ({new_size}) is greater than max size ({})",
self.max_size
self.max_size
);
assert_eq!(self.max_size, shared.max_size);
assert_eq!(self.max_size, shared.max_size);
// Lock the area by setting the bit in `current_size`
//
@@ -187,9 +183,8 @@ impl ShmemHandle {
let result = {
use std::cmp::Ordering::{Equal, Greater, Less};
match new_size.cmp(&old_size) {
Less => nix_ftruncate(&self.fd, new_size as i64).map_err(|e| {
Error::new("could not shrink shmem segment, ftruncate failed", e)
}),
Less => nix_ftruncate(&self.fd, new_size as i64)
.map_err(|e| Error::new("could not shrink shmem segment, ftruncate failed", e)),
Equal => Ok(()),
Greater => enlarge_file(self.fd.as_fd(), new_size as u64),
}
@@ -207,7 +202,7 @@ impl ShmemHandle {
/// Returns the current user-visible size of the shared memory segment.
///
/// NOTE: a concurrent [`ShmemHandle::set_size()`] call can change the size at any time.
/// It is the caller's responsibility not to access the area beyond the current size.
/// It is the caller's responsibility not to access the area beyond the current size.
pub fn current_size(&self) -> usize {
let total_current_size =
self.shared().current_size.load(Ordering::Relaxed) & !RESIZE_IN_PROGRESS;
@@ -253,12 +248,8 @@ fn enlarge_file(fd: BorrowedFd, size: u64) -> Result<(), Error> {
// we don't get a segfault later when trying to actually use it.
#[cfg(not(target_os = "macos"))]
{
nix::fcntl::posix_fallocate(fd, 0, size as i64).map_err(|e| {
Error::new(
"could not grow shmem segment, posix_fallocate failed",
e,
)
})
nix::fcntl::posix_fallocate(fd, 0, size as i64)
.map_err(|e| Error::new("could not grow shmem segment, posix_fallocate failed", e))
}
// As a fallback on macos, which doesn't have posix_fallocate, use plain 'fallocate'
#[cfg(target_os = "macos")]
@@ -279,7 +270,7 @@ mod tests {
fn assert_range(ptr: *const u8, expected: u8, range: Range<usize>) {
for i in range {
let b = unsafe { *(ptr.add(i)) };
assert_eq!(expected, b, "unexpected byte at offset {}", i);
assert_eq!(expected, b, "unexpected byte at offset {i}");
}
}

View File

@@ -6,16 +6,24 @@ use std::ptr::NonNull;
use nix::errno::Errno;
pub type RwLock<T> = lock_api::RwLock<PthreadRwLock, T>;
pub type Mutex<T> = lock_api::Mutex<PthreadMutex, T>;
pub(crate) type RwLockReadGuard<'a, T> = lock_api::RwLockReadGuard<'a, PthreadRwLock, T>;
pub type RwLockWriteGuard<'a, T> = lock_api::RwLockWriteGuard<'a, PthreadRwLock, T>;
pub type ValueReadGuard<'a, T> = lock_api::MappedRwLockReadGuard<'a, PthreadRwLock, T>;
pub type ValueWriteGuard<'a, T> = lock_api::MappedRwLockWriteGuard<'a, PthreadRwLock, T>;
/// Shared memory read-write lock.
/// Wrapper around a pointer to a [`libc::pthread_rwlock_t`].
///
/// `PthreadRwLock(None)` is an invalid state for this type. It only exists because the
/// [`lock_api::RawRwLock`] trait has a mandatory `INIT` const member to allow for static
/// initialization of the lock. Unfortunately, pthread seemingly does not support any way
/// to statically initialize a `pthread_rwlock_t` with `PTHREAD_PROCESS_SHARED` set. However,
/// `lock_api` allows manual construction and seemingly doesn't use `INIT` itself so for
/// now it's set to this invalid value to satisfy the trait constraints.
pub struct PthreadRwLock(Option<NonNull<libc::pthread_rwlock_t>>);
impl PthreadRwLock {
pub fn new(lock: *mut libc::pthread_rwlock_t) -> Self {
pub fn new(lock: NonNull<libc::pthread_rwlock_t>) -> Self {
unsafe {
let mut attrs = MaybeUninit::uninit();
// Ignoring return value here - only possible error is OOM.
@@ -24,34 +32,40 @@ impl PthreadRwLock {
attrs.as_mut_ptr(),
libc::PTHREAD_PROCESS_SHARED
);
// TODO(quantumish): worth making this function return Result?
libc::pthread_rwlock_init(lock, attrs.as_mut_ptr());
// TODO(quantumish): worth making this function fallible?
libc::pthread_rwlock_init(lock.as_ptr(), attrs.as_mut_ptr());
// Safety: POSIX specifies that "any function affecting the attributes
// object (including destruction) shall not affect any previously
// initialized read-write locks".
libc::pthread_rwlockattr_destroy(attrs.as_mut_ptr());
Self(Some(NonNull::new_unchecked(lock)))
Self(Some(lock))
}
}
fn inner(&self) -> NonNull<libc::pthread_rwlock_t> {
match self.0 {
None => panic!("PthreadRwLock constructed badly - something likely used RawMutex::INIT"),
Some(x) => x,
self.0.unwrap_or_else(
|| panic!("PthreadRwLock constructed badly - something likely used RawRwLock::INIT")
)
}
fn unlock(&self) {
unsafe {
let res = libc::pthread_rwlock_unlock(self.inner().as_ptr());
assert!(res == 0, "unlock failed with {}", Errno::from_raw(res));
}
}
}
unsafe impl lock_api::RawRwLock for PthreadRwLock {
type GuardMarker = lock_api::GuardSend;
/// *DO NOT USE THIS.* See [`PthreadRwLock`] for the full explanation.
const INIT: Self = Self(None);
fn lock_shared(&self) {
unsafe {
let res = libc::pthread_rwlock_rdlock(self.inner().as_ptr());
if res != 0 {
panic!("rdlock failed with {}", Errno::from_raw(res));
}
assert!(res == 0, "rdlock failed with {}", Errno::from_raw(res));
}
}
@@ -61,7 +75,7 @@ unsafe impl lock_api::RawRwLock for PthreadRwLock {
match res {
0 => true,
libc::EAGAIN => false,
o => panic!("try_rdlock failed with {}", Errno::from_raw(res)),
o => panic!("try_rdlock failed with {}", Errno::from_raw(o)),
}
}
}
@@ -69,9 +83,7 @@ unsafe impl lock_api::RawRwLock for PthreadRwLock {
fn lock_exclusive(&self) {
unsafe {
let res = libc::pthread_rwlock_wrlock(self.inner().as_ptr());
if res != 0 {
panic!("wrlock failed with {}", Errno::from_raw(res));
}
assert!(res == 0, "wrlock failed with {}", Errno::from_raw(res));
}
}
@@ -81,25 +93,77 @@ unsafe impl lock_api::RawRwLock for PthreadRwLock {
match res {
0 => true,
libc::EAGAIN => false,
o => panic!("try_wrlock failed with {}", Errno::from_raw(res)),
o => panic!("try_wrlock failed with {}", Errno::from_raw(o)),
}
}
}
unsafe fn unlock_exclusive(&self) {
unsafe {
let res = libc::pthread_rwlock_unlock(self.inner().as_ptr());
if res != 0 {
panic!("unlock failed with {}", Errno::from_raw(res));
self.unlock();
}
unsafe fn unlock_shared(&self) {
self.unlock();
}
}
pub struct PthreadMutex(Option<NonNull<libc::pthread_mutex_t>>);
impl PthreadMutex {
pub fn new(lock: NonNull<libc::pthread_mutex_t>) -> Self {
unsafe {
let mut attrs = MaybeUninit::uninit();
// Ignoring return value here - only possible error is OOM.
libc::pthread_mutexattr_init(attrs.as_mut_ptr());
libc::pthread_mutexattr_setpshared(
attrs.as_mut_ptr(),
libc::PTHREAD_PROCESS_SHARED
);
libc::pthread_mutex_init(lock.as_ptr(), attrs.as_mut_ptr());
// Safety: POSIX specifies that "any function affecting the attributes
// object (including destruction) shall not affect any previously
// initialized read-write locks".
libc::pthread_mutexattr_destroy(attrs.as_mut_ptr());
Self(Some(lock))
}
}
fn inner(&self) -> NonNull<libc::pthread_mutex_t> {
self.0.unwrap_or_else(
|| panic!("PthreadMutex constructed badly - something likely used RawMutex::INIT")
)
}
}
unsafe impl lock_api::RawMutex for PthreadMutex {
type GuardMarker = lock_api::GuardSend;
/// *DO NOT USE THIS.* See [`PthreadRwLock`] for the full explanation.
const INIT: Self = Self(None);
fn lock(&self) {
unsafe {
let res = libc::pthread_mutex_lock(self.inner().as_ptr());
assert!(res == 0, "lock failed with {}", Errno::from_raw(res));
}
}
fn try_lock(&self) -> bool {
unsafe {
let res = libc::pthread_mutex_trylock(self.inner().as_ptr());
match res {
0 => true,
libc::EAGAIN => false,
o => panic!("try_rdlock failed with {}", Errno::from_raw(o)),
}
}
}
unsafe fn unlock_shared(&self) {
unsafe fn unlock(&self) {
unsafe {
let res = libc::pthread_rwlock_unlock(self.inner().as_ptr());
if res != 0 {
panic!("unlock failed with {}", Errno::from_raw(res));
}
let res = libc::pthread_mutex_unlock(self.inner().as_ptr());
assert!(res == 0, "unlock failed with {}", Errno::from_raw(res));
}
}
}

View File

@@ -115,6 +115,7 @@ where
// Error means you must retry.
//
// This corresponds to the 'lookupOpt' function in the paper
#[allow(clippy::only_used_in_recursion)]
fn lookup_recurse<'e, V: Value>(
key: &[u8],
node: NodeRef<'e, V>,
@@ -155,6 +156,7 @@ fn lookup_recurse<'e, V: Value>(
}
}
#[allow(clippy::only_used_in_recursion)]
fn next_recurse<'e, V: Value>(
min_key: &[u8],
path: &mut Vec<u8>,
@@ -163,7 +165,7 @@ fn next_recurse<'e, V: Value>(
) -> Result<Option<&'e V>, ConcurrentUpdateError> {
let rnode = node.read_lock_or_restart()?;
let prefix = rnode.get_prefix();
if prefix.len() != 0 {
if !prefix.is_empty() {
path.extend_from_slice(prefix);
}
@@ -213,13 +215,15 @@ fn next_recurse<'e, V: Value>(
}
// This corresponds to the 'insertOpt' function in the paper
pub(crate) fn update_recurse<'e, 'g, K: Key, V: Value, A: ArtAllocator<V>, F>(
#[allow(clippy::only_used_in_recursion)]
#[allow(clippy::too_many_arguments)]
pub(crate) fn update_recurse<'e, K: Key, V: Value, A: ArtAllocator<V>, F>(
key: &[u8],
value_fn: F,
node: NodeRef<'e, V>,
rparent: Option<(ReadLockedNodeRef<V>, u8)>,
rgrandparent: Option<(ReadLockedNodeRef<V>, u8)>,
guard: &'g mut TreeWriteGuard<'e, K, V, A>,
guard: &'_ mut TreeWriteGuard<'e, K, V, A>,
level: usize,
orig_key: &[u8],
) -> Result<(), ArtError>
@@ -248,8 +252,8 @@ where
return Ok(());
}
let prefix_match_len = prefix_match_len.unwrap();
let key = &key[prefix_match_len as usize..];
let level = level + prefix_match_len as usize;
let key = &key[prefix_match_len..];
let level = level + prefix_match_len;
if rnode.is_leaf() {
assert_eq!(key.len(), 0);
@@ -321,7 +325,7 @@ where
};
wnode.write_unlock();
}
return Ok(());
Ok(())
} else {
let next_child = next_node.unwrap(); // checked above it's not None
if let Some((ref rparent, _)) = rparent {
@@ -351,23 +355,24 @@ enum PathElement {
impl std::fmt::Debug for PathElement {
fn fmt(&self, fmt: &mut std::fmt::Formatter<'_>) -> Result<(), std::fmt::Error> {
match self {
PathElement::Prefix(prefix) => write!(fmt, "{:?}", prefix),
PathElement::KeyByte(key_byte) => write!(fmt, "{}", key_byte),
PathElement::Prefix(prefix) => write!(fmt, "{prefix:?}"),
PathElement::KeyByte(key_byte) => write!(fmt, "{key_byte}"),
}
}
}
pub(crate) fn dump_tree<'e, V: Value + std::fmt::Debug>(
pub(crate) fn dump_tree<V: Value + std::fmt::Debug>(
root: RootPtr<V>,
epoch_pin: &'e EpochPin,
epoch_pin: &'_ EpochPin,
dst: &mut dyn std::io::Write,
) {
let root_ref = NodeRef::from_root_ptr(root);
let _ = dump_recurse(&[], root_ref, &epoch_pin, 0, dst);
let _ = dump_recurse(&[], root_ref, epoch_pin, 0, dst);
}
// TODO: return an Err if writeln!() returns error, instead of unwrapping
#[allow(clippy::only_used_in_recursion)]
fn dump_recurse<'e, V: Value + std::fmt::Debug>(
path: &[PathElement],
node: NodeRef<'e, V>,
@@ -380,7 +385,7 @@ fn dump_recurse<'e, V: Value + std::fmt::Debug>(
let rnode = node.read_lock_or_restart()?;
let mut path = Vec::from(path);
let prefix = rnode.get_prefix();
if prefix.len() != 0 {
if !prefix.is_empty() {
path.push(PathElement::Prefix(Vec::from(prefix)));
}
@@ -390,7 +395,7 @@ fn dump_recurse<'e, V: Value + std::fmt::Debug>(
// and the lifetime of 'epoch_pin' enforces that the reference is only accessible
// as long as the epoch is pinned.
let val = unsafe { vptr.as_ref().unwrap() };
writeln!(dst, "{} {:?}: {:?}", indent, path, val).unwrap();
writeln!(dst, "{indent} {path:?}: {val:?}").unwrap();
return Ok(());
}
@@ -426,13 +431,13 @@ fn dump_recurse<'e, V: Value + std::fmt::Debug>(
/// [foo]b -> [a]r -> value
/// e -> [ls]e -> value
///```
fn insert_split_prefix<'e, K: Key, V: Value, A: ArtAllocator<V>>(
fn insert_split_prefix<K: Key, V: Value, A: ArtAllocator<V>>(
key: &[u8],
value: V,
node: &mut WriteLockedNodeRef<V>,
parent: &mut WriteLockedNodeRef<V>,
parent_key: u8,
guard: &'e TreeWriteGuard<K, V, A>,
guard: &'_ TreeWriteGuard<K, V, A>,
) -> Result<(), OutOfMemoryError> {
let old_node = node;
let old_prefix = old_node.get_prefix();
@@ -463,11 +468,11 @@ fn insert_split_prefix<'e, K: Key, V: Value, A: ArtAllocator<V>>(
Ok(())
}
fn insert_to_node<'e, K: Key, V: Value, A: ArtAllocator<V>>(
fn insert_to_node<K: Key, V: Value, A: ArtAllocator<V>>(
wnode: &mut WriteLockedNodeRef<V>,
key: &[u8],
value: V,
guard: &'e TreeWriteGuard<K, V, A>,
guard: &'_ TreeWriteGuard<K, V, A>,
) -> Result<(), OutOfMemoryError> {
let value_child = allocate_node_for_value(&key[1..], value, guard.tree_writer.allocator)?;
wnode.insert_child(key[0], value_child.into_ptr());

View File

@@ -105,13 +105,13 @@ impl AtomicLockAndVersion {
}
fn set_locked_bit(version: u64) -> u64 {
return version + 2;
version + 2
}
fn is_obsolete(version: u64) -> bool {
return (version & 1) == 1;
(version & 1) == 1
}
fn is_locked(version: u64) -> bool {
return (version & 2) == 2;
(version & 2) == 2
}

View File

@@ -45,6 +45,7 @@ impl<V> std::fmt::Debug for NodePtr<V> {
impl<V> Copy for NodePtr<V> {}
impl<V> Clone for NodePtr<V> {
#[allow(clippy::non_canonical_clone_impl)]
fn clone(&self) -> NodePtr<V> {
NodePtr {
ptr: self.ptr,
@@ -305,14 +306,13 @@ impl<V: Value> NodePtr<V> {
&self,
allocator: &impl ArtAllocator<V>,
) -> Result<NodePtr<V>, OutOfMemoryError> {
let bigger = match self.variant() {
match self.variant() {
NodeVariant::Internal4(n) => n.grow(allocator),
NodeVariant::Internal16(n) => n.grow(allocator),
NodeVariant::Internal48(n) => n.grow(allocator),
NodeVariant::Internal256(_) => panic!("cannot grow Internal256 node"),
NodeVariant::Leaf(_) => panic!("cannot grow Leaf node"),
};
bigger
}
}
pub(crate) fn insert_child(&mut self, key_byte: u8, child: NodePtr<V>) {
@@ -464,7 +464,7 @@ impl<V: Value> NodeInternal4<V> {
new.extend_from_slice(prefix);
new.push(prefix_byte);
new.extend_from_slice(&self.prefix[0..self.prefix_len as usize]);
(&mut self.prefix[0..new.len()]).copy_from_slice(&new);
self.prefix[0..new.len()].copy_from_slice(&new);
self.prefix_len = new.len() as u8;
}
@@ -515,7 +515,7 @@ impl<V: Value> NodeInternal4<V> {
return;
}
}
panic!("could not re-find parent with key {}", key_byte);
panic!("could not re-find parent with key {key_byte}");
}
fn delete_child(&mut self, key_byte: u8) {
@@ -529,7 +529,7 @@ impl<V: Value> NodeInternal4<V> {
return;
}
}
panic!("could not re-find parent with key {}", key_byte);
panic!("could not re-find parent with key {key_byte}");
}
fn is_full(&self) -> bool {
@@ -558,7 +558,7 @@ impl<V: Value> NodeInternal4<V> {
tag: NodeTag::Internal16,
lock_and_version: AtomicLockAndVersion::new(),
prefix: self.prefix.clone(),
prefix: self.prefix,
prefix_len: self.prefix_len,
num_children: self.num_children,
@@ -585,7 +585,7 @@ impl<V: Value> NodeInternal16<V> {
new.extend_from_slice(prefix);
new.push(prefix_byte);
new.extend_from_slice(&self.prefix[0..self.prefix_len as usize]);
(&mut self.prefix[0..new.len()]).copy_from_slice(&new);
self.prefix[0..new.len()].copy_from_slice(&new);
self.prefix_len = new.len() as u8;
}
@@ -636,7 +636,7 @@ impl<V: Value> NodeInternal16<V> {
return;
}
}
panic!("could not re-find parent with key {}", key_byte);
panic!("could not re-find parent with key {key_byte}");
}
fn delete_child(&mut self, key_byte: u8) {
@@ -650,7 +650,7 @@ impl<V: Value> NodeInternal16<V> {
return;
}
}
panic!("could not re-find parent with key {}", key_byte);
panic!("could not re-find parent with key {key_byte}");
}
fn is_full(&self) -> bool {
@@ -679,7 +679,7 @@ impl<V: Value> NodeInternal16<V> {
tag: NodeTag::Internal48,
lock_and_version: AtomicLockAndVersion::new(),
prefix: self.prefix.clone(),
prefix: self.prefix,
prefix_len: self.prefix_len,
num_children: self.num_children,
@@ -706,7 +706,7 @@ impl<V: Value> NodeInternal16<V> {
tag: NodeTag::Internal4,
lock_and_version: AtomicLockAndVersion::new(),
prefix: self.prefix.clone(),
prefix: self.prefix,
prefix_len: self.prefix_len,
num_children: self.num_children,
@@ -736,7 +736,7 @@ impl<V: Value> NodeInternal48<V> {
idx,
self.num_children
);
assert!(shadow_indexes.get(&idx).is_none());
assert!(!shadow_indexes.contains(&idx));
shadow_indexes.insert(idx);
count += 1;
}
@@ -750,7 +750,7 @@ impl<V: Value> NodeInternal48<V> {
new.extend_from_slice(prefix);
new.push(prefix_byte);
new.extend_from_slice(&self.prefix[0..self.prefix_len as usize]);
(&mut self.prefix[0..new.len()]).copy_from_slice(&new);
self.prefix[0..new.len()].copy_from_slice(&new);
self.prefix_len = new.len() as u8;
}
@@ -790,7 +790,7 @@ impl<V: Value> NodeInternal48<V> {
fn replace_child(&mut self, key_byte: u8, replacement: NodePtr<V>) {
let idx = self.child_indexes[key_byte as usize];
if idx == INVALID_CHILD_INDEX {
panic!("could not re-find parent with key {}", key_byte);
panic!("could not re-find parent with key {key_byte}");
}
self.child_ptrs[idx as usize] = replacement;
self.validate();
@@ -799,7 +799,7 @@ impl<V: Value> NodeInternal48<V> {
fn delete_child(&mut self, key_byte: u8) {
let idx = self.child_indexes[key_byte as usize] as usize;
if idx == INVALID_CHILD_INDEX as usize {
panic!("could not re-find parent with key {}", key_byte);
panic!("could not re-find parent with key {key_byte}");
}
// Compact the child_ptrs array
@@ -816,10 +816,7 @@ impl<V: Value> NodeInternal48<V> {
return;
}
}
panic!(
"could not re-find last index {} on Internal48 node",
removed_idx
);
panic!("could not re-find last index {removed_idx} on Internal48 node");
} else {
self.child_indexes[key_byte as usize] = INVALID_CHILD_INDEX;
self.num_children -= 1;
@@ -853,7 +850,7 @@ impl<V: Value> NodeInternal48<V> {
tag: NodeTag::Internal256,
lock_and_version: AtomicLockAndVersion::new(),
prefix: self.prefix.clone(),
prefix: self.prefix,
prefix_len: self.prefix_len,
num_children: self.num_children as u16,
@@ -879,7 +876,7 @@ impl<V: Value> NodeInternal48<V> {
tag: NodeTag::Internal16,
lock_and_version: AtomicLockAndVersion::new(),
prefix: self.prefix.clone(),
prefix: self.prefix,
prefix_len: self.prefix_len,
num_children: self.num_children,
@@ -912,7 +909,7 @@ impl<V: Value> NodeInternal256<V> {
new.extend_from_slice(prefix);
new.push(prefix_byte);
new.extend_from_slice(&self.prefix[0..self.prefix_len as usize]);
(&mut self.prefix[0..new.len()]).copy_from_slice(&new);
self.prefix[0..new.len()].copy_from_slice(&new);
self.prefix_len = new.len() as u8;
}
@@ -949,14 +946,14 @@ impl<V: Value> NodeInternal256<V> {
if !self.child_ptrs[idx].is_null() {
self.child_ptrs[idx] = replacement
} else {
panic!("could not re-find parent with key {}", key_byte);
panic!("could not re-find parent with key {key_byte}");
}
}
fn delete_child(&mut self, key_byte: u8) {
let idx = key_byte as usize;
if self.child_ptrs[idx].is_null() {
panic!("could not re-find parent with key {}", key_byte);
panic!("could not re-find parent with key {key_byte}");
}
self.num_children -= 1;
self.child_ptrs[idx] = NodePtr::null();
@@ -987,7 +984,7 @@ impl<V: Value> NodeInternal256<V> {
tag: NodeTag::Internal48,
lock_and_version: AtomicLockAndVersion::new(),
prefix: self.prefix.clone(),
prefix: self.prefix,
prefix_len: self.prefix_len,
num_children: self.num_children as u8,
@@ -1019,7 +1016,7 @@ impl<V: Value> NodeLeaf<V> {
new.extend_from_slice(prefix);
new.push(prefix_byte);
new.extend_from_slice(&self.prefix[0..self.prefix_len as usize]);
(&mut self.prefix[0..new.len()]).copy_from_slice(&new);
self.prefix[0..new.len()].copy_from_slice(&new);
self.prefix_len = new.len() as u8;
}

View File

@@ -61,13 +61,11 @@ impl<'t, V: crate::Value> ArtMultiSlabAllocator<'t, V> {
let (allocator_area, remain) = alloc_from_slice::<ArtMultiSlabAllocator<V>>(area);
let (tree_area, remain) = alloc_from_slice::<Tree<V>>(remain);
let allocator = allocator_area.write(ArtMultiSlabAllocator {
allocator_area.write(ArtMultiSlabAllocator {
tree_area: spin::Mutex::new(Some(tree_area)),
inner: MultiSlabAllocator::new(remain, &Self::LAYOUTS),
phantom_val: PhantomData,
});
allocator
})
}
}

View File

@@ -119,7 +119,7 @@ impl<'t> BlockAllocator<'t> {
}
// out of blocks
return INVALID_BLOCK;
INVALID_BLOCK
}
// TODO: this is currently unused. The slab allocator never releases blocks

View File

@@ -370,15 +370,16 @@ mod tests {
for i in 0..11 {
all.push(slab.alloc(i));
}
#[allow(clippy::needless_range_loop)]
for i in 0..11 {
assert!(unsafe { (*all[i]).val == i });
}
let distribution = Zipf::new(10 as f64, 1.1).unwrap();
let distribution = Zipf::new(10.0, 1.1).unwrap();
let mut rng = rand::rng();
for _ in 0..100000 {
slab.0.dump();
let idx = (rng.sample(distribution) as usize).into();
let idx = rng.sample(distribution) as usize;
let ptr: *mut TestObject = all[idx];
if !ptr.is_null() {
assert_eq!(unsafe { (*ptr).val }, idx);

View File

@@ -3,7 +3,6 @@
use std::sync::atomic::{AtomicU64, AtomicUsize, Ordering};
use crossbeam_utils::CachePadded;
use spin;
const NUM_SLOTS: usize = 1000;
@@ -62,10 +61,8 @@ impl EpochShared {
pub(crate) fn advance(&self) -> u64 {
// Advance the global epoch
let old_epoch = self.global_epoch.fetch_add(2, Ordering::Relaxed);
let new_epoch = old_epoch + 2;
// Anyone that release their pin after this will update their slot.
new_epoch
old_epoch + 2
}
pub(crate) fn broadcast(&self) {
@@ -99,10 +96,8 @@ impl EpochShared {
let delta = now.wrapping_sub(this_epoch);
if delta > u64::MAX / 2 {
// this is very recent
} else {
if delta > now.wrapping_sub(oldest) {
oldest = this_epoch;
}
} else if delta > now.wrapping_sub(oldest) {
oldest = this_epoch;
}
}
oldest

View File

@@ -239,7 +239,7 @@ where
phantom_key: PhantomData<K>,
}
impl<'a, 't: 'a, K: Key, V: Value, A: ArtAllocator<V>> TreeInitStruct<'t, K, V, A> {
impl<'t, K: Key, V: Value, A: ArtAllocator<V>> TreeInitStruct<'t, K, V, A> {
pub fn new(allocator: &'t A) -> TreeInitStruct<'t, K, V, A> {
let tree_ptr = allocator.alloc_tree();
let tree_ptr = NonNull::new(tree_ptr).expect("out of memory");
@@ -295,7 +295,7 @@ impl<'t, K: Key, V: Value, A: ArtAllocator<V>> TreeWriteAccess<'t, K, V, A> {
pub fn start_read(&'t self) -> TreeReadGuard<'t, K, V> {
TreeReadGuard {
tree: &self.tree,
tree: self.tree,
epoch_pin: self.epoch_handle.pin(),
phantom_key: PhantomData,
}
@@ -305,7 +305,7 @@ impl<'t, K: Key, V: Value, A: ArtAllocator<V>> TreeWriteAccess<'t, K, V, A> {
impl<'t, K: Key, V: Value> TreeReadAccess<'t, K, V> {
pub fn start_read(&'t self) -> TreeReadGuard<'t, K, V> {
TreeReadGuard {
tree: &self.tree,
tree: self.tree,
epoch_pin: self.epoch_handle.pin(),
phantom_key: PhantomData,
}
@@ -360,7 +360,7 @@ impl<'e, K: Key, V: Value, A: ArtAllocator<V>> TreeWriteGuard<'e, K, V, A> {
let mut success = None;
self.update_with_fn(key, |existing| {
if let Some(_) = existing {
if existing.is_some() {
success = Some(false);
UpdateAction::Nothing
} else {
@@ -461,11 +461,9 @@ where
K: Key + for<'a> From<&'a [u8]>,
{
pub fn new_wrapping() -> TreeIterator<K> {
let mut next_key = Vec::new();
next_key.resize(K::KEY_LEN, 0);
TreeIterator {
done: false,
next_key,
next_key: vec![0; K::KEY_LEN],
max_key: None,
phantom_key: PhantomData,
}
@@ -495,11 +493,9 @@ where
let mut wrapped_around = false;
loop {
assert_eq!(self.next_key.len(), K::KEY_LEN);
if let Some((k, v)) = algorithm::iter_next(
&mut self.next_key,
read_guard.tree.root,
&read_guard.epoch_pin,
) {
if let Some((k, v)) =
algorithm::iter_next(&self.next_key, read_guard.tree.root, &read_guard.epoch_pin)
{
assert_eq!(k.len(), K::KEY_LEN);
assert_eq!(self.next_key.len(), K::KEY_LEN);

View File

@@ -102,7 +102,7 @@ fn sparse() {
for _ in 0..10000 {
loop {
let key = rand::random::<u128>();
if used_keys.get(&key).is_some() {
if used_keys.contains(&key) {
continue;
}
used_keys.insert(key);
@@ -182,26 +182,19 @@ fn test_iter<A: ArtAllocator<TestValue>>(
let mut iter = TreeIterator::new(&(TestKey::MIN..TestKey::MAX));
loop {
let shadow_item = shadow_iter.next().map(|(k, v)| (k.clone(), v.clone()));
let shadow_item = shadow_iter.next().map(|(k, v)| (*k, *v));
let r = tree.start_read();
let item = iter.next(&r);
if shadow_item != item.map(|(k, v)| (k, v.load())) {
eprintln!(
"FAIL: iterator returned {:?}, expected {:?}",
item, shadow_item
);
eprintln!("FAIL: iterator returned {item:?}, expected {shadow_item:?}");
tree.start_read().dump(&mut std::io::stderr());
eprintln!("SHADOW:");
let mut si = shadow.iter();
while let Some(si) = si.next() {
for si in shadow {
eprintln!("key: {:?}, val: {}", si.0, si.1);
}
panic!(
"FAIL: iterator returned {:?}, expected {:?}",
item, shadow_item
);
panic!("FAIL: iterator returned {item:?}, expected {shadow_item:?}");
}
if item.is_none() {
break;

View File

@@ -18,6 +18,8 @@ bytes.workspace = true
byteorder.workspace = true
utils.workspace = true
postgres_ffi_types.workspace = true
postgres_versioninfo.workspace = true
posthog_client_lite.workspace = true
enum-map.workspace = true
strum.workspace = true
strum_macros.workspace = true
@@ -28,12 +30,13 @@ humantime-serde.workspace = true
chrono = { workspace = true, features = ["serde"] }
itertools.workspace = true
storage_broker.workspace = true
camino = {workspace = true, features = ["serde1"]}
camino = { workspace = true, features = ["serde1"] }
remote_storage.workspace = true
postgres_backend.workspace = true
nix = {workspace = true, optional = true}
nix = { workspace = true, optional = true }
reqwest.workspace = true
rand.workspace = true
tracing.workspace = true
tracing-utils.workspace = true
once_cell.workspace = true

View File

@@ -4,6 +4,7 @@ use camino::Utf8PathBuf;
mod tests;
use const_format::formatcp;
use posthog_client_lite::PostHogClientConfig;
pub const DEFAULT_PG_LISTEN_PORT: u16 = 64000;
pub const DEFAULT_PG_LISTEN_ADDR: &str = formatcp!("127.0.0.1:{DEFAULT_PG_LISTEN_PORT}");
pub const DEFAULT_HTTP_LISTEN_PORT: u16 = 9898;
@@ -63,25 +64,66 @@ impl Display for NodeMetadata {
}
}
/// PostHog integration config.
/// PostHog integration config. This is used in pageserver, storcon, and neon_local.
/// Ensure backward compatibility when adding new fields.
#[derive(Debug, Clone, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
pub struct PostHogConfig {
/// PostHog project ID
pub project_id: String,
#[serde(default)]
#[serde(skip_serializing_if = "Option::is_none")]
pub project_id: Option<String>,
/// Server-side (private) API key
pub server_api_key: String,
#[serde(default)]
#[serde(skip_serializing_if = "Option::is_none")]
pub server_api_key: Option<String>,
/// Client-side (public) API key
pub client_api_key: String,
#[serde(default)]
#[serde(skip_serializing_if = "Option::is_none")]
pub client_api_key: Option<String>,
/// Private API URL
pub private_api_url: String,
#[serde(default)]
#[serde(skip_serializing_if = "Option::is_none")]
pub private_api_url: Option<String>,
/// Public API URL
pub public_api_url: String,
/// Refresh interval for the feature flag spec
#[serde(default)]
#[serde(skip_serializing_if = "Option::is_none")]
pub public_api_url: Option<String>,
/// Refresh interval for the feature flag spec.
/// The storcon will push the feature flag spec to the pageserver. If the pageserver does not receive
/// the spec for `refresh_interval`, it will fetch the spec from the PostHog API.
#[serde(default)]
#[serde(skip_serializing_if = "Option::is_none")]
#[serde(with = "humantime_serde")]
pub refresh_interval: Option<Duration>,
}
impl PostHogConfig {
pub fn try_into_posthog_config(self) -> Result<PostHogClientConfig, &'static str> {
let Some(project_id) = self.project_id else {
return Err("project_id is required");
};
let Some(server_api_key) = self.server_api_key else {
return Err("server_api_key is required");
};
let Some(client_api_key) = self.client_api_key else {
return Err("client_api_key is required");
};
let Some(private_api_url) = self.private_api_url else {
return Err("private_api_url is required");
};
let Some(public_api_url) = self.public_api_url else {
return Err("public_api_url is required");
};
Ok(PostHogClientConfig {
project_id,
server_api_key,
client_api_key,
private_api_url,
public_api_url,
})
}
}
/// `pageserver.toml`
///
/// We use serde derive with `#[serde(default)]` to generate a deserializer
@@ -367,6 +409,9 @@ pub struct BasebackupCacheConfig {
// TODO(diko): support max_entry_size_bytes.
// pub max_entry_size_bytes: u64,
pub max_size_entries: usize,
/// Size of the channel used to send prepare requests to the basebackup cache worker.
/// If exceeded, new prepare requests will be dropped.
pub prepare_channel_size: usize,
}
impl Default for BasebackupCacheConfig {
@@ -375,7 +420,8 @@ impl Default for BasebackupCacheConfig {
cleanup_period: Duration::from_secs(60),
max_total_size_bytes: 1024 * 1024 * 1024, // 1 GiB
// max_entry_size_bytes: 16 * 1024 * 1024, // 16 MiB
max_size_entries: 1000,
max_size_entries: 10000,
prepare_channel_size: 100,
}
}
}

View File

@@ -386,6 +386,7 @@ pub enum NodeSchedulingPolicy {
Pause,
PauseForRestart,
Draining,
Deleting,
}
impl FromStr for NodeSchedulingPolicy {
@@ -398,6 +399,7 @@ impl FromStr for NodeSchedulingPolicy {
"pause" => Ok(Self::Pause),
"pause_for_restart" => Ok(Self::PauseForRestart),
"draining" => Ok(Self::Draining),
"deleting" => Ok(Self::Deleting),
_ => Err(anyhow::anyhow!("Unknown scheduling state '{s}'")),
}
}
@@ -412,6 +414,7 @@ impl From<NodeSchedulingPolicy> for String {
Pause => "pause",
PauseForRestart => "pause_for_restart",
Draining => "draining",
Deleting => "deleting",
}
.to_string()
}
@@ -420,6 +423,7 @@ impl From<NodeSchedulingPolicy> for String {
#[derive(Serialize, Deserialize, Clone, Copy, Eq, PartialEq, Debug)]
pub enum SkSchedulingPolicy {
Active,
Activating,
Pause,
Decomissioned,
}
@@ -430,6 +434,7 @@ impl FromStr for SkSchedulingPolicy {
fn from_str(s: &str) -> Result<Self, Self::Err> {
Ok(match s {
"active" => Self::Active,
"activating" => Self::Activating,
"pause" => Self::Pause,
"decomissioned" => Self::Decomissioned,
_ => {
@@ -446,6 +451,7 @@ impl From<SkSchedulingPolicy> for String {
use SkSchedulingPolicy::*;
match value {
Active => "active",
Activating => "activating",
Pause => "pause",
Decomissioned => "decomissioned",
}
@@ -546,6 +552,11 @@ pub struct TimelineImportRequest {
pub sk_set: Vec<NodeId>,
}
#[derive(serde::Serialize, serde::Deserialize, Clone)]
pub struct TimelineSafekeeperMigrateRequest {
pub new_sk_set: Vec<NodeId>,
}
#[cfg(test)]
mod test {
use serde_json;
@@ -577,8 +588,7 @@ mod test {
let err = serde_json::from_value::<TenantCreateRequest>(create_request).unwrap_err();
assert!(
err.to_string().contains("unknown field `unknown_field`"),
"expect unknown field `unknown_field` error, got: {}",
err
"expect unknown field `unknown_field` error, got: {err}"
);
}

View File

@@ -334,8 +334,7 @@ impl KeySpace {
std::cmp::max(range.start, prev.start) < std::cmp::min(range.end, prev.end);
assert!(
!overlap,
"Attempt to merge ovelapping keyspaces: {:?} overlaps {:?}",
prev, range
"Attempt to merge ovelapping keyspaces: {prev:?} overlaps {range:?}"
);
}
@@ -1104,7 +1103,7 @@ mod tests {
// total range contains at least one shard-local page
let all_nonzero = fragments.iter().all(|f| f.0 > 0);
if !all_nonzero {
eprintln!("Found a zero-length fragment: {:?}", fragments);
eprintln!("Found a zero-length fragment: {fragments:?}");
}
assert!(all_nonzero);
} else {

View File

@@ -11,6 +11,7 @@ use std::time::{Duration, SystemTime};
#[cfg(feature = "testing")]
use camino::Utf8PathBuf;
use postgres_versioninfo::PgMajorVersion;
use serde::{Deserialize, Deserializer, Serialize, Serializer};
use serde_with::serde_as;
pub use utilization::PageserverUtilization;
@@ -20,7 +21,9 @@ use utils::{completion, serde_system_time};
use crate::config::Ratio;
use crate::key::{CompactKey, Key};
use crate::shard::{DEFAULT_STRIPE_SIZE, ShardCount, ShardStripeSize, TenantShardId};
use crate::shard::{
DEFAULT_STRIPE_SIZE, ShardCount, ShardIdentity, ShardStripeSize, TenantShardId,
};
/// The state of a tenant in this pageserver.
///
@@ -398,7 +401,7 @@ pub enum TimelineCreateRequestMode {
// inherits the ancestor's pg_version. Earlier code wasn't
// using a flattened enum, so, it was an accepted field, and
// we continue to accept it by having it here.
pg_version: Option<u32>,
pg_version: Option<PgMajorVersion>,
#[serde(default, skip_serializing_if = "std::ops::Not::not")]
read_only: bool,
},
@@ -410,7 +413,7 @@ pub enum TimelineCreateRequestMode {
Bootstrap {
#[serde(default)]
existing_initdb_timeline_id: Option<TimelineId>,
pg_version: Option<u32>,
pg_version: Option<PgMajorVersion>,
},
}
@@ -474,7 +477,7 @@ pub struct TenantShardSplitResponse {
}
/// Parameters that apply to all shards in a tenant. Used during tenant creation.
#[derive(Serialize, Deserialize, Debug)]
#[derive(Clone, Copy, Serialize, Deserialize, Debug)]
#[serde(deny_unknown_fields)]
pub struct ShardParameters {
pub count: ShardCount,
@@ -496,6 +499,15 @@ impl Default for ShardParameters {
}
}
impl From<ShardIdentity> for ShardParameters {
fn from(identity: ShardIdentity) -> Self {
Self {
count: identity.count,
stripe_size: identity.stripe_size,
}
}
}
#[derive(Debug, Default, Clone, Eq, PartialEq)]
pub enum FieldPatch<T> {
Upsert(T),
@@ -1182,7 +1194,7 @@ impl Display for ImageCompressionAlgorithm {
ImageCompressionAlgorithm::Disabled => write!(f, "disabled"),
ImageCompressionAlgorithm::Zstd { level } => {
if let Some(level) = level {
write!(f, "zstd({})", level)
write!(f, "zstd({level})")
} else {
write!(f, "zstd")
}
@@ -1573,7 +1585,7 @@ pub struct TimelineInfo {
pub last_received_msg_lsn: Option<Lsn>,
/// the timestamp (in microseconds) of the last received message
pub last_received_msg_ts: Option<u128>,
pub pg_version: u32,
pub pg_version: PgMajorVersion,
pub state: TimelineState,
@@ -2011,8 +2023,7 @@ mod tests {
let err = serde_json::from_value::<TenantConfigRequest>(config_request).unwrap_err();
assert!(
err.to_string().contains("unknown field `unknown_field`"),
"expect unknown field `unknown_field` error, got: {}",
err
"expect unknown field `unknown_field` error, got: {err}"
);
}

View File

@@ -37,6 +37,7 @@ use std::hash::{Hash, Hasher};
pub use ::utils::shard::*;
use postgres_ffi_types::forknum::INIT_FORKNUM;
use serde::{Deserialize, Serialize};
use utils::critical;
use crate::key::Key;
use crate::models::ShardParameters;
@@ -179,7 +180,7 @@ impl ShardIdentity {
/// For use when creating ShardIdentity instances for new shards, where a creation request
/// specifies the ShardParameters that apply to all shards.
pub fn from_params(number: ShardNumber, params: &ShardParameters) -> Self {
pub fn from_params(number: ShardNumber, params: ShardParameters) -> Self {
Self {
number,
count: params.count,
@@ -188,6 +189,17 @@ impl ShardIdentity {
}
}
/// Asserts that the given shard identities are equal. Changes to shard parameters will likely
/// result in data corruption.
pub fn assert_equal(&self, other: ShardIdentity) {
if self != &other {
// TODO: for now, we're conservative and just log errors in production. Turn this into a
// real assertion when we're confident it doesn't misfire, and also reject requests that
// attempt to change it with an error response.
critical!("shard identity mismatch: {self:?} != {other:?}");
}
}
fn is_broken(&self) -> bool {
self.layout == LAYOUT_BROKEN
}
@@ -320,7 +332,11 @@ fn hash_combine(mut a: u32, mut b: u32) -> u32 {
///
/// The mapping of key to shard is not stable across changes to ShardCount: this is intentional
/// and will be handled at higher levels when shards are split.
fn key_to_shard_number(count: ShardCount, stripe_size: ShardStripeSize, key: &Key) -> ShardNumber {
pub fn key_to_shard_number(
count: ShardCount,
stripe_size: ShardStripeSize,
key: &Key,
) -> ShardNumber {
// Fast path for un-sharded tenants or broadcast keys
if count < ShardCount(2) || key_is_shard0(key) {
return ShardNumber(0);

View File

@@ -78,7 +78,13 @@ pub fn is_expected_io_error(e: &io::Error) -> bool {
use io::ErrorKind::*;
matches!(
e.kind(),
BrokenPipe | ConnectionRefused | ConnectionAborted | ConnectionReset | TimedOut
HostUnreachable
| NetworkUnreachable
| BrokenPipe
| ConnectionRefused
| ConnectionAborted
| ConnectionReset
| TimedOut,
)
}
@@ -939,7 +945,7 @@ impl<IO: AsyncRead + AsyncWrite + Unpin> PostgresBackendReader<IO> {
FeMessage::CopyFail => Err(CopyStreamHandlerEnd::CopyFail),
FeMessage::Terminate => Err(CopyStreamHandlerEnd::Terminate),
_ => Err(CopyStreamHandlerEnd::from(ConnectionError::Protocol(
ProtocolError::Protocol(format!("unexpected message in COPY stream {:?}", msg)),
ProtocolError::Protocol(format!("unexpected message in COPY stream {msg:?}")),
))),
},
None => Err(CopyStreamHandlerEnd::EOF),

View File

@@ -61,7 +61,7 @@ async fn simple_select() {
// so spawn it off to run on its own.
tokio::spawn(async move {
if let Err(e) = connection.await {
eprintln!("connection error: {}", e);
eprintln!("connection error: {e}");
}
});
@@ -137,7 +137,7 @@ async fn simple_select_ssl() {
// so spawn it off to run on its own.
tokio::spawn(async move {
if let Err(e) = connection.await {
eprintln!("connection error: {}", e);
eprintln!("connection error: {e}");
}
});

View File

@@ -223,7 +223,7 @@ mod tests_pg_connection_config {
assert_eq!(cfg.port(), 123);
assert_eq!(cfg.raw_address(), "stub.host.example:123");
assert_eq!(
format!("{:?}", cfg),
format!("{cfg:?}"),
"PgConnectionConfig { host: Domain(\"stub.host.example\"), port: 123, password: None }"
);
}
@@ -239,7 +239,7 @@ mod tests_pg_connection_config {
assert_eq!(cfg.port(), 123);
assert_eq!(cfg.raw_address(), "[::1]:123");
assert_eq!(
format!("{:?}", cfg),
format!("{cfg:?}"),
"PgConnectionConfig { host: Ipv6(::1), port: 123, password: None }"
);
}
@@ -252,7 +252,7 @@ mod tests_pg_connection_config {
assert_eq!(cfg.port(), 123);
assert_eq!(cfg.raw_address(), "stub.host.example:123");
assert_eq!(
format!("{:?}", cfg),
format!("{cfg:?}"),
"PgConnectionConfig { host: Domain(\"stub.host.example\"), port: 123, password: Some(REDACTED-STRING) }"
);
}

View File

@@ -19,6 +19,7 @@ serde.workspace = true
postgres_ffi_types.workspace = true
utils.workspace = true
tracing.workspace = true
postgres_versioninfo.workspace = true
[dev-dependencies]
env_logger.workspace = true

View File

@@ -4,6 +4,7 @@ use criterion::{Bencher, Criterion, criterion_group, criterion_main};
use postgres_ffi::v17::wal_generator::LogicalMessageGenerator;
use postgres_ffi::v17::waldecoder_handler::WalStreamDecoderHandler;
use postgres_ffi::waldecoder::WalStreamDecoder;
use postgres_versioninfo::PgMajorVersion;
use pprof::criterion::{Output, PProfProfiler};
use utils::lsn::Lsn;
@@ -32,7 +33,7 @@ fn bench_complete_record(c: &mut Criterion) {
let value_size = LogicalMessageGenerator::make_value_size(size, PREFIX);
let value = vec![1; value_size];
let mut decoder = WalStreamDecoder::new(Lsn(0), 170000);
let mut decoder = WalStreamDecoder::new(Lsn(0), PgMajorVersion::PG17);
let msg = LogicalMessageGenerator::new(PREFIX, &value)
.next()
.unwrap()

View File

@@ -14,6 +14,8 @@ use bytes::Bytes;
use utils::bin_ser::SerializeError;
use utils::lsn::Lsn;
pub use postgres_versioninfo::PgMajorVersion;
macro_rules! postgres_ffi {
($version:ident) => {
#[path = "."]
@@ -91,21 +93,22 @@ macro_rules! dispatch_pgversion {
$version => $code,
default = $invalid_pgver_handling,
pgversions = [
14 : v14,
15 : v15,
16 : v16,
17 : v17,
$crate::PgMajorVersion::PG14 => v14,
$crate::PgMajorVersion::PG15 => v15,
$crate::PgMajorVersion::PG16 => v16,
$crate::PgMajorVersion::PG17 => v17,
]
)
};
($pgversion:expr => $code:expr,
default = $default:expr,
pgversions = [$($sv:literal : $vsv:ident),+ $(,)?]) => {
match ($pgversion) {
pgversions = [$($sv:pat => $vsv:ident),+ $(,)?]) => {
match ($pgversion.clone().into()) {
$($sv => {
use $crate::$vsv as pgv;
$code
},)+
#[allow(unreachable_patterns)]
_ => {
$default
}
@@ -179,9 +182,9 @@ macro_rules! enum_pgversion {
$($variant ( $crate::$md::$t )),+
}
impl self::$name {
pub fn pg_version(&self) -> u32 {
pub fn pg_version(&self) -> PgMajorVersion {
enum_pgversion_dispatch!(self, $name, _ign, {
pgv::bindings::PG_MAJORVERSION_NUM
pgv::bindings::MY_PGVERSION
})
}
}
@@ -195,15 +198,15 @@ macro_rules! enum_pgversion {
};
{name = $name:ident,
path = $p:ident,
typ = $t:ident,
$(typ = $t:ident,)?
pgversions = [$($variant:ident : $md:ident),+ $(,)?]} => {
pub enum $name {
$($variant ($crate::$md::$p::$t)),+
$($variant $(($crate::$md::$p::$t))?),+
}
impl $name {
pub fn pg_version(&self) -> u32 {
pub fn pg_version(&self) -> PgMajorVersion {
enum_pgversion_dispatch!(self, $name, _ign, {
pgv::bindings::PG_MAJORVERSION_NUM
pgv::bindings::MY_PGVERSION
})
}
}
@@ -249,22 +252,21 @@ pub use v14::xlog_utils::{
try_from_pg_timestamp,
};
pub fn bkpimage_is_compressed(bimg_info: u8, version: u32) -> bool {
pub fn bkpimage_is_compressed(bimg_info: u8, version: PgMajorVersion) -> bool {
dispatch_pgversion!(version, pgv::bindings::bkpimg_is_compressed(bimg_info))
}
pub fn generate_wal_segment(
segno: u64,
system_id: u64,
pg_version: u32,
pg_version: PgMajorVersion,
lsn: Lsn,
) -> Result<Bytes, SerializeError> {
assert_eq!(segno, lsn.segment_number(WAL_SEGMENT_SIZE));
dispatch_pgversion!(
pg_version,
pgv::xlog_utils::generate_wal_segment(segno, system_id, lsn),
Err(SerializeError::BadInput)
pgv::xlog_utils::generate_wal_segment(segno, system_id, lsn)
)
}
@@ -272,7 +274,7 @@ pub fn generate_pg_control(
pg_control_bytes: &[u8],
checkpoint_bytes: &[u8],
lsn: Lsn,
pg_version: u32,
pg_version: PgMajorVersion,
) -> anyhow::Result<(Bytes, u64, bool)> {
dispatch_pgversion!(
pg_version,
@@ -352,6 +354,7 @@ pub fn fsm_logical_to_physical(addr: BlockNumber) -> BlockNumber {
pub mod waldecoder {
use std::num::NonZeroU32;
use crate::PgMajorVersion;
use bytes::{Buf, Bytes, BytesMut};
use thiserror::Error;
use utils::lsn::Lsn;
@@ -369,7 +372,7 @@ pub mod waldecoder {
pub struct WalStreamDecoder {
pub lsn: Lsn,
pub pg_version: u32,
pub pg_version: PgMajorVersion,
pub inputbuf: BytesMut,
pub state: State,
}
@@ -382,7 +385,7 @@ pub mod waldecoder {
}
impl WalStreamDecoder {
pub fn new(lsn: Lsn, pg_version: u32) -> WalStreamDecoder {
pub fn new(lsn: Lsn, pg_version: PgMajorVersion) -> WalStreamDecoder {
WalStreamDecoder {
lsn,
pg_version,

View File

@@ -1,3 +1,7 @@
use crate::PgMajorVersion;
pub const MY_PGVERSION: PgMajorVersion = PgMajorVersion::PG14;
pub const XLOG_DBASE_CREATE: u8 = 0x00;
pub const XLOG_DBASE_DROP: u8 = 0x10;

View File

@@ -1,3 +1,7 @@
use crate::PgMajorVersion;
pub const MY_PGVERSION: PgMajorVersion = PgMajorVersion::PG15;
pub const XACT_XINFO_HAS_DROPPED_STATS: u32 = 1u32 << 8;
pub const XLOG_DBASE_CREATE_FILE_COPY: u8 = 0x00;

View File

@@ -1,3 +1,7 @@
use crate::PgMajorVersion;
pub const MY_PGVERSION: PgMajorVersion = PgMajorVersion::PG16;
pub const XACT_XINFO_HAS_DROPPED_STATS: u32 = 1u32 << 8;
pub const XLOG_DBASE_CREATE_FILE_COPY: u8 = 0x00;

Some files were not shown because too many files have changed in this diff Show More