Compare commits

...

298 Commits

Author SHA1 Message Date
Heikki Linnakangas
cd14f6ca94 Fix LSN lease background thread with grpc
The spawned thread didn't have the tokio runtime active, which lead to
this error:

    ERROR lsn_lease_bg_task{tenant_id=1bb647cb7d3974b52e74f7442fa7d059 timeline_id=cf41456d3202e1c3940cb8f372d160ab lsn=0/1576000}:panic{thread=<unnamed> location=compute_tools/src/lsn_lease.rs:201:5}: there is no reactor running, must be called from the context of a Tokio 1.x runtime

Fixes `test_readonly_node_gc`
2025-08-01 00:37:22 +03:00
Heikki Linnakangas
8ed56decfb Make LFC prewarming test case less sensitive to LFC chunk size
Namely, this makes it pass with the new communicator, which doesn't do
chunking at all.
2025-08-01 00:24:01 +03:00
Heikki Linnakangas
e466cd1eb2 fix prewarm test with grpc
I added a fixture to run these tests with and without grpc, but missed
passing the option to one endpoint creation.
2025-08-01 00:12:06 +03:00
Heikki Linnakangas
4a031b9467 Fix LFC prewarm cancellation 2025-08-01 00:11:50 +03:00
Heikki Linnakangas
26bd994852 reformat 2025-07-31 23:44:11 +03:00
Heikki Linnakangas
b78cdfe3ea Fix test_lfc_prewarm.py test failure 2025-07-31 23:44:11 +03:00
Heikki Linnakangas
50302499f5 Silence test failure with gRPC
The error message is just a little different with gRPC.
2025-07-31 22:02:30 +03:00
Heikki Linnakangas
ede37c5346 rever unintentional changes to submodules 2025-07-31 21:17:19 +03:00
Heikki Linnakangas
b72f410b6e cargo fmt 2025-07-31 20:47:52 +03:00
Heikki Linnakangas
e1c7d79e2a dial down smgr trace logging to same level as on 'main' 2025-07-31 20:47:20 +03:00
Heikki Linnakangas
bb1f50bf09 Set num_shards in shared memory.
The get_num_shards() function, called from the WAL proposer, requires
it.

Fixes test_timeline_size_quota_on_startup
2025-07-31 16:29:24 +03:00
Heikki Linnakangas
9871a3f9e7 tidy up error handling a bit
Pass back a suitable 'errno' from the communicator process to the
originating backend in all cases. Usually it's just EIO because we
don't have a good way to map from tonic StatusCodes to libc error
numbers. That's probably good enough; from the original backend's
perspective all errors are IO errors.

In the C code, set libc errno variable before calling ereport(), so
that errcode_for_file_access() works. And once we do that, we can
replace pg_strerror() calls with %m.
2025-07-31 15:31:19 +03:00
Heikki Linnakangas
e1df05448c Merge remote-tracking branch 'origin/main' into communicator-rewrite 2025-07-31 15:01:34 +03:00
Heikki Linnakangas
17cd611ccc Merge remote-tracking branch 'origin/main' into communicator-rewrite 2025-07-31 14:45:22 +03:00
Heikki Linnakangas
c509d53cd1 fix clippy warnings 2025-07-31 14:45:13 +03:00
Heikki Linnakangas
84f4dcd2be fix test scripts to not set neon.use_communicator_worker anymore
compute_ctl does it based on prefer_protocol now
2025-07-31 14:36:26 +03:00
Heikki Linnakangas
b4808a4e5c Set neon.use_communicator_worker GUC based on prefer_protocol 2025-07-31 14:24:38 +03:00
Heikki Linnakangas
5e2a19ce73 cargo fmt 2025-07-31 14:24:17 +03:00
Heikki Linnakangas
8a4f16a471 More work on metrics
Switch to the 'measured' crate everywhere in the communicator. Connect
the allocator metrics to the metrics endpoint.
2025-07-31 14:09:39 +03:00
Heikki Linnakangas
0428164058 Fix LFC stats exposed by the built-in prometheus endpoint 2025-07-31 11:34:14 +03:00
Heikki Linnakangas
c8042f9e31 Run pgindent on the new communicator C code 2025-07-31 11:11:38 +03:00
Heikki Linnakangas
4016808dff Handle get_raw_page_at_lsn() debugging function properly
This adds a new request type between backend and communicator, to make
a getpage request at a given LSN, bypassing the LFC. Only used by the
get_raw_page_at_lsn() debugging/testing function.
2025-07-31 11:04:15 +03:00
Heikki Linnakangas
c8b875c93b Merge remote-tracking branch 'origin/main' into communicator-rewrite 2025-07-30 23:08:43 +03:00
Heikki Linnakangas
768fc101cc Merge remote-tracking branch 'origin/main' into communicator-rewrite 2025-07-30 23:08:18 +03:00
Heikki Linnakangas
3dfa2fc3e4 Fix relsize caching in hot standby mode
Fixes remaining test_hot_standby.py failures
2025-07-30 22:55:38 +03:00
Heikki Linnakangas
49204b6a59 don't try to update the legacy last-written LSN cache with new communicator 2025-07-30 22:01:04 +03:00
Heikki Linnakangas
c0360644a7 Evict and retry if the block hash map is full
I made this change to one the is_write==true case earlier already, but
the is_write==false codepath needs the same treatment.
2025-07-30 21:48:25 +03:00
Heikki Linnakangas
688990e7ec Crank down the logging
More logs is useful during debugging, but it's time to crank it down a
notch...
2025-07-30 21:24:19 +03:00
Heikki Linnakangas
af5e3da381 Fix updating last-written LSN when WAL redo skips updating a block
This makes the test_replica_query_race test pass, and probably some
other read replica tests too.
2025-07-30 21:20:10 +03:00
Heikki Linnakangas
fca52af7e3 Don't update the legacy last-written LSN cache with new communicator
The new communicator has its own tracking
2025-07-30 17:31:51 +03:00
Heikki Linnakangas
95ef69ca95 Enable gRPC in the docker-compose setup 2025-07-30 15:16:50 +03:00
Heikki Linnakangas
9e250e382a Merge remote-tracking branch 'origin/main' into communicator-rewrite 2025-07-30 11:19:42 +03:00
Erik Grinaker
80d052f262 Merge branch 'main' into communicator-rewrite 2025-07-29 22:05:16 +02:00
Heikki Linnakangas
349a5c6724 cargo hakari generate 2025-07-29 16:52:00 +03:00
Heikki Linnakangas
aad301e083 cargo fmt 2025-07-29 16:46:54 +03:00
Heikki Linnakangas
e0db31456b remove leftover debugging change
I made this change a long time ago while debugging a test failure, but
I never meant to commit it.
2025-07-29 16:45:38 +03:00
Heikki Linnakangas
b6b3911063 Merge remote-tracking branch 'origin/main' into communicator-rewrite 2025-07-29 16:44:00 +03:00
Heikki Linnakangas
0e0aff7b8c fix metrics when not using the new communicator 2025-07-24 01:40:32 +03:00
Heikki Linnakangas
5a5ea9cb9f cargo fmt 2025-07-24 01:33:02 +03:00
Heikki Linnakangas
3d209dcaae Minor changes to minimize diff against 'main'
The `pgxn/neon/communicator/Cargo.lock` file was not used, since the
package is part of the workspace.
2025-07-24 00:42:00 +03:00
Heikki Linnakangas
f939691f6a remove leftover empty file 2025-07-24 00:27:49 +03:00
Erik Grinaker
f96c8f63c2 pageserver: route gRPC requests to child shards 2025-07-23 16:38:22 +02:00
Erik Grinaker
c8cdd25da4 Pass stripe size during shard map updates 2025-07-23 16:38:20 +02:00
Folke Behrens
90242416a6 otel: Use blocking reqwest in dedicated thread
OTel 0.28+ by default uses blocking operations in a dedicated thread.
2025-07-23 16:36:27 +02:00
Heikki Linnakangas
6d8b1cc754 silence compiler warning about using variable unused 2025-07-23 13:47:35 +03:00
Heikki Linnakangas
35da660200 more work on exposing LFC stats 2025-07-23 13:39:32 +03:00
Heikki Linnakangas
bfdd37b54e Fix segfault in unimplemented function
We need to implement this eventually, but for now let's at least
silence the segfault.

See also https://github.com/neondatabase/neon/pull/12696
2025-07-23 13:08:59 +03:00
Heikki Linnakangas
6cd1295d9f Refactor communicator process initialization when new communicator is not used
This should fix the 'cargo test' failures on xlog_utils tests, which
launch Postgres in stand-alone mode, i.e. without setting 'neon_tenant'
2025-07-23 13:01:19 +03:00
Erik Grinaker
eaec6e2fb4 Fix notify_local shard count 2025-07-23 11:16:35 +02:00
Heikki Linnakangas
f7e403eea1 Fix broken link in doc comment 2025-07-23 11:37:27 +03:00
Erik Grinaker
464ed0cbc7 rustfmt 2025-07-23 09:41:01 +02:00
Erik Grinaker
f55ccd2c17 Fix lints 2025-07-23 08:17:06 +02:00
Erik Grinaker
c9758dc46b Fix communicator build 2025-07-23 08:06:20 +02:00
Erik Grinaker
78c5d70b4c cargo hakari generate 2025-07-23 07:58:20 +02:00
Heikki Linnakangas
fc35be0397 Remove the half-baked Adaptive Radix Tree implementation
We are committed to using the resizeable hash table for now. ART is a
great data structure, but it's too much for now. Maybe later.
2025-07-23 01:49:56 +03:00
Heikki Linnakangas
a7a6df3d6f fix datatype used in test mock function 2025-07-23 01:44:45 +03:00
Heikki Linnakangas
bfb4b0991d Refactor the way lfc_get_stats() is implemented
This reduces the boilerplate a little, and makes it more
straightforward to dispatch the call to the old or the new communicator
2025-07-23 01:40:42 +03:00
Heikki Linnakangas
c18f4a52f8 refactor metrics to use 'measured' crate 2025-07-23 00:56:21 +03:00
Heikki Linnakangas
48535798ba Merge remote-tracking branch 'origin/main' into communicator-rewrite 2025-07-23 00:00:10 +03:00
Heikki Linnakangas
dc35bda074 WIP: Implement LFC prewarming
This doesn't pass the tests yet, immediate issue is that we'r emissing
some stats that the tests depend on. And there's a lot more cleanup,
commenting etc. to do. But this is roughly how it should look like.
2025-07-20 01:23:34 +03:00
Heikki Linnakangas
e2c3c2eccb Merge remote-tracking branch 'origin/main' into HEAD 2025-07-20 00:58:57 +03:00
Victor Polevoy
cb50291dcd Fetches the SLRU segment via the new communicator.
The fetch is done not into a buffer as earlier, but directly into the
file.
2025-07-18 10:02:31 +02:00
Heikki Linnakangas
10a7d49726 Use XLogRecPtr for LSNs in C generated code.
This hopefully silences the static assertion Erik is seeing:

```
pgxn/neon/communicator_new.c:1352:9: error: static assertion failed due to requirement '__builtin_types_compatible_p(unsigned long long, unsigned long)': (r->lsn) does not have type XLogRecPtr
 1352 |                                                                 LSN_FORMAT_ARGS(r->lsn));
      |                                                                 ^~~~~~~~~~~~~~~~~~~~~~~
```
2025-07-17 13:37:45 +03:00
Erik Grinaker
f765bd3677 pageserver: improve gRPC cancellation 2025-07-17 12:34:46 +02:00
Erik Grinaker
edcdd6ca9c Merge branch 'main' into communicator-rewrite 2025-07-17 10:59:37 +02:00
Heikki Linnakangas
62af2a14e2 Improve comments a little 2025-07-15 16:06:49 +03:00
Erik Grinaker
367d96e25b Merge branch 'main' into communicator-rewrite 2025-07-14 18:47:23 +02:00
Erik Grinaker
87f01a25ab pageserver/client_grpc: reap idle channels immediately 2025-07-13 18:44:05 +02:00
Erik Grinaker
56eb511618 pageserver/client_grpc: use unbounded pools 2025-07-13 13:29:27 +02:00
Erik Grinaker
ddeb3f3ed3 pageserver/client_grpc: don't pipeline GetPage requests 2025-07-13 12:24:17 +02:00
Heikki Linnakangas
69dbad700c Merge remote-tracking branch 'origin/main' into HEAD 2025-07-12 16:43:57 +03:00
Erik Grinaker
0d5f4dd979 pageserver/client_grpc: improve retry logic 2025-07-12 12:41:11 +02:00
Erik Grinaker
1637fbce25 Merge fix 2025-07-11 10:50:19 +02:00
Erik Grinaker
8cd5370c00 Merge branch 'main' into communicator-rewrite 2025-07-11 10:39:26 +02:00
Heikki Linnakangas
bceafc6c32 Update LFC cache hit/miss counters
Fixes EXPLAIN (FILECACHE) option
2025-07-10 16:36:53 +03:00
Heikki Linnakangas
dcf8e0565f Improve communicator README 2025-07-10 15:19:20 +03:00
Heikki Linnakangas
c14cf15b52 Tidy up the memory ordering instructions on request slot code
I believe the explicit memory fence instructions are
unnecessary. Performing a store with Release ordering makes all the
previous non-atomic writes visible too. Per rust docs for Ordering::Release
( https://doc.rust-lang.org/std/sync/atomic/enum.Ordering.html#variant.Release):

> When coupled with a store, all previous operations become ordered
> before any load of this value with Acquire (or stronger)
> ordering. In particular, all previous writes become visible to all
> threads that perform an Acquire (or stronger) load of this value.
>
> ...
>
> Corresponds to memory_order_release in C++20.

The "all previous writes" means non-atomic writes too. It's not very
clear from that text, but the C++20 docs that it links to is more
explicit about it:

> All memory writes (including non-atomic and relaxed atomic) that
> happened-before the atomic store from the point of view of thread A,
> become visible side-effects in thread B. That is, once the atomic
> load is completed, thread B is guaranteed to see everything thread A
> wrote to memory.

In addition to removing the fence instructions, fix the comments on
each atomic Acquire operation to point to the correct Release
counterpart. We had such comments but they had gone out-of-date as
code has moved.
2025-07-10 15:19:20 +03:00
Heikki Linnakangas
5da06d4129 Make start_neon_io_request() wakeup the communicator process
All the callers did that previously. So rather than document that the
caller needs to do it, just do it in start_neon_io_request() straight
away. (We might want to revisit this if we get codepaths where the C
code submits multiple IO requests as a batch. In that case, it would
be more efficient to fill all the request slots first and only send
one notification to the pipe for all of them)
2025-07-10 15:19:20 +03:00
Heikki Linnakangas
f30c59bec9 Improve comments on request slots 2025-07-10 15:19:20 +03:00
Heikki Linnakangas
47c099a0fb Rename NeonIOHandle to NeonIORequestSlot
All the code talks about "request slots", better to make the struct
name reflect that. The "Handle" term was borrowed from Postgres v18
AIO implementation, from the similar handles or slots used to submit
IO requests from backends to worker processes. But even though the
idea is similar, it's a completely separate implementation and there's
nothing else shared between them than the very high level
design.
2025-07-10 14:52:16 +03:00
Heikki Linnakangas
b67e8f2edc Move some code, just for more natural logical ordering 2025-07-10 14:49:29 +03:00
Heikki Linnakangas
b5b1db29bb Implement shard map live-update 2025-07-10 12:25:15 +03:00
Heikki Linnakangas
ed4652b65b Update the relsize cache rather than forget it at end of index build
This greatly reduces the cases where we make a request to the
pageserver with a very recent LSN. Those cases are slow because the
pageserver needs to wait for the WAL to arrive. This speeds up the
Postgres pg_regress and isolation tests greatly.
2025-07-09 17:21:06 +03:00
Heikki Linnakangas
60d87966b8 minor comment improvement 2025-07-09 16:39:40 +03:00
Heikki Linnakangas
8db138ef64 Plumb through the stripe size to the communicator 2025-07-09 16:18:26 +03:00
Heikki Linnakangas
1ee24602d5 Implement working set size estimation 2025-07-09 16:18:26 +03:00
Heikki Linnakangas
732bd26e70 cargo fmt 2025-07-09 16:18:26 +03:00
Erik Grinaker
08399672be Temporary workaround for timeout retry errors 2025-07-09 09:49:15 +02:00
Heikki Linnakangas
d63f1d259a avoid assertion failure about calling palloc() in critical section 2025-07-08 21:33:25 +03:00
Heikki Linnakangas
4053092408 Fix LSN tracking on "unlogged index builds"
Fixes the test_gin_redo.py test failure, and probably some others
2025-07-08 17:22:24 +03:00
Heikki Linnakangas
ccf88e9375 Improve debug logging by printing IO request details 2025-07-08 17:16:09 +03:00
Heikki Linnakangas
a79fd3bda7 Move logic for picking request slot to the C code
With this refactoring, the Rust code deals with one giant array of
requests, and doesn't know that it's sliced up per backend
process. The C code is now responsible for slicing it up.

This also adds code to complete old IOs at backends start that were
started and left behind by a previous session. That was a little more
straightforward to do with the refactoring, which is why I tackled it
now.
2025-07-07 12:59:08 +03:00
Heikki Linnakangas
e1b58d5d69 Don't segfault if one of the unimplemented functions are called
We'll need to implement these, but let's stop the crashing for now
2025-07-07 11:33:44 +03:00
Erik Grinaker
9ae004f3bc Rename ShardMap to ShardSpec 2025-07-06 19:13:59 +02:00
Erik Grinaker
341c5f53d8 Restructure get_page retries 2025-07-06 18:35:47 +02:00
Erik Grinaker
4b06b547c1 pageserver/client_grpc: add shard map updates 2025-07-06 13:27:17 +02:00
Heikki Linnakangas
74e0d85a04 fix: Don't lose track of in-progress request if query is cancelled 2025-07-06 13:04:03 +03:00
Erik Grinaker
23ba42446b Fix accidental 1ms sleeps for GetPages 2025-07-06 11:09:58 +02:00
Heikki Linnakangas
71a83daac2 Revert crate dependencies to the versions in main branch
Some tests were failing with "Only request bodies with a known size
can be checksum validated." erros. This is a known issue with more
recent aws client versions, see
https://github.com/neondatabase/neon/issues/11363.
2025-07-05 18:03:19 +03:00
Heikki Linnakangas
1b8355a9f9 put back option lost in merge 2025-07-05 17:36:27 +03:00
Heikki Linnakangas
e14bb4be39 Merge remote-tracking branch 'origin/main' into communicator-rewrite 2025-07-05 16:59:51 +03:00
Heikki Linnakangas
f3a6c0d8ff cargo fmt 2025-07-05 16:26:24 +03:00
Heikki Linnakangas
17ec37aab2 Close gRPC getpage streams on shutdown
Some tests were failing, because pageserver didn't shut down promptly.
Tonic server graceful shutdown was a little too graceful; any open
streams linger until they're closed. Check the cancellation token
while waiting for next request, and close the stream if
shutdown/cancellation was requested.
2025-07-05 16:26:24 +03:00
Heikki Linnakangas
d6ec1f1a1c Skip legacy LFC initialization when communicator is used
It clashes with the initialization of the LFC file
2025-07-05 16:26:24 +03:00
Erik Grinaker
6f3fb4433f Add TODO 2025-07-05 14:15:34 +02:00
Erik Grinaker
d7678df445 Reap idle pool resources 2025-07-05 13:35:28 +02:00
Erik Grinaker
03d9f0ec41 Comment tweaks 2025-07-05 11:16:40 +02:00
Erik Grinaker
56845f2da2 Add GetPageClass::is_bulk 2025-07-05 11:15:28 +02:00
Heikki Linnakangas
9a37bfdf63 Fix re-finding an entry in bucket chain 2025-07-05 00:44:46 +03:00
Heikki Linnakangas
4c916552e8 Reduce logging noise
These are very useful while debugging, but also very noisy; let's dial
it down a little.
2025-07-04 23:11:36 +03:00
Heikki Linnakangas
50fbf4ac53 Fix hash table initialization across forked processes
attach_writer()/reader() are called from each forked process. It's too
late to do initialization there, in fact we used to overwrite the
contents of the hash table (or at least the freelist?) every time a
new process attached to it. The initialization must be done earlier,
in the HashMapInit() constructors.
2025-07-04 23:08:34 +03:00
Erik Grinaker
cb698a3951 Add dedicated client pools for bulk requests 2025-07-04 21:52:25 +02:00
Erik Grinaker
f6cc5cbd0c Split out retry handler to separate module 2025-07-04 20:20:09 +02:00
Heikki Linnakangas
00affada26 Add request ID to all communicator log lines as context information 2025-07-04 20:34:26 +03:00
Heikki Linnakangas
90d3c09c24 Minor cleanup
Tidy up and add some comments. Rename a few things for clarity.
2025-07-04 20:32:59 +03:00
Heikki Linnakangas
6c398aeae7 Fix dependency in Makefile 2025-07-04 20:24:21 +03:00
Heikki Linnakangas
1856bbbb9f Minor cleanup and commenting 2025-07-04 18:28:34 +03:00
Heikki Linnakangas
bd46dd60a0 Add a temporary timeout to handling an IO request in the communicator
It's nicer to timeout in the communicator and return an error to the
backend, than PANIC the backend.
2025-07-04 16:08:22 +03:00
Heikki Linnakangas
5f2d476a58 Add request ID to io-in-progress locking table, to ease debugging
I also added INFO messages for when a backend blocks on the
io-in-progress lock. It's probably too noisy for production, but
useful now to get a picture of how much it happens.
2025-07-04 15:55:57 +03:00
Heikki Linnakangas
3231cb6138 Await the io-in-progress locking futures
Otherwise they don't do anything. Oops.
2025-07-04 15:55:57 +03:00
Heikki Linnakangas
e558e0da5c Assign request_id earlier, in the originating backend
Makes it more useful for stitching together logs etc. for a specific
request.
2025-07-04 15:55:55 +03:00
Heikki Linnakangas
70bf2e088d Request multiple block numbers in a single GetPageV request
That's how it was always intended to be used
2025-07-04 15:49:04 +03:00
Heikki Linnakangas
da3f9ee72d cargo fmt 2025-07-04 12:39:41 +03:00
Erik Grinaker
88d1127bf4 Tweak GetPageSplitter 2025-07-03 21:12:26 +02:00
David Freifeld
794bb7a9e8 Merge branch 'quantumish/comm-lfc-integration' into communicator-rewrite 2025-07-03 10:52:29 -07:00
Erik Grinaker
42e4e5a418 Add GetPage request splitting 2025-07-03 18:31:12 +02:00
Heikki Linnakangas
96a817fa2b Fix the case that storage auth token is _not_ used
I broke that in previous commit while fixing the case of using a token.
2025-07-03 18:39:06 +03:00
Heikki Linnakangas
e7b057f2e8 Fix passing storage JWT token to the communicator process
Makes the 'test_compute_auth_to_pageserver' test pass
2025-07-03 18:14:22 +03:00
Heikki Linnakangas
956c2f4378 cargo fmt 2025-07-03 16:16:42 +03:00
Heikki Linnakangas
3293e4685e Fix cases where pageserver gets stuck waiting for LSN
The compute might make a request with an LSN that it hasn't even
flushed yet.
2025-07-03 16:14:45 +03:00
Erik Grinaker
6f8650782f Client tweaks 2025-07-03 14:54:23 +02:00
Erik Grinaker
14214eb853 Add client shard routing 2025-07-03 14:42:35 +02:00
Erik Grinaker
d4b4724921 Sanity-check Pageserver URLs 2025-07-03 14:18:14 +02:00
Erik Grinaker
9aba9550dd Instrument client methods 2025-07-03 14:11:53 +02:00
Erik Grinaker
375e8e5592 Improve retries and logging 2025-07-03 14:02:43 +02:00
Erik Grinaker
52c586f678 Restructure shard management 2025-07-03 11:51:19 +02:00
Erik Grinaker
de97b73d6e Lint fixes 2025-07-03 10:38:14 +02:00
Heikki Linnakangas
d8556616c9 Fix running Postgres in "vanilla mode", without neon storage
Some tests do that
2025-07-03 00:32:40 +03:00
Heikki Linnakangas
d8296e60e6 Fix caching of newly extended pages
This fixes read errors e.g. in test_compute_catalog.py test (and
probably many others).
2025-07-02 23:21:42 +03:00
Heikki Linnakangas
7263d6e2e5 Clarify error message if not_modified_lsn > request_lsn
I'm seeing this error from some python tests. Which means there's a
bug in the compute side of course, but it took me a while to figure
that out.
2025-07-02 23:21:42 +03:00
David Freifeld
86fb7b966a Update integrated_cache.rs to use new hashmap API 2025-07-02 12:18:37 -07:00
David Freifeld
0c099b0944 Merge branch 'quantumish/lfc-resizable-map' into quantumish/comm-lfc-integration 2025-07-02 12:05:24 -07:00
David Freifeld
2fe27f510d Make neon-shmem tests thread-safe and report errno in panics 2025-07-02 11:57:49 -07:00
David Freifeld
19b5618578 Switch to neon_shmem::sync lock_api and integrate into hashmap 2025-07-02 11:44:38 -07:00
Erik Grinaker
12dade35fa Comment tweaks 2025-07-02 14:47:27 +02:00
Erik Grinaker
1ec63bd6bc Misc pool improvements 2025-07-02 14:42:06 +02:00
Heikki Linnakangas
7012b4aa90 Remove --grpc options from neon_local endpoint reconfigure and start calls
They don't exist in neon_local anymore, and aren't actually used in
tests either.
2025-07-02 15:10:18 +03:00
Heikki Linnakangas
2cc28c75be Fix "ERROR: could not read size of rel ..." in many regression tests.
We were incorrectly skipping the call to communicator_new_rel_create(),
which resulted in an error during index build, when the btree build code
tried to check the size of the newly-created relation.
2025-07-02 14:10:11 +03:00
Erik Grinaker
bf01145ae4 Remove some old code 2025-07-02 11:46:54 +02:00
Erik Grinaker
8ab8fc11a3 Use new PageserverClient 2025-07-02 11:27:56 +02:00
Erik Grinaker
6f0af96a54 Add new PageserverClient 2025-07-02 10:59:40 +02:00
Heikki Linnakangas
9913d2668a print retried pageserver requests to log
Not sure how verbose we want this to be in production, but for now,
more is better.

This shows that many tests are failing with errors like these:

    PG:2025-07-01 23:02:34.311 GMT [1456523] LOG:  [COMMUNICATOR] send_process_get_rel_size_request: got error status: NotFound, message: "Read error", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 01 Jul 2025 23:02:34 GMT"} }, retrying​

I haven't debugged why that is yet. Did the compute make a bogus request?
2025-07-02 02:04:04 +03:00
Heikki Linnakangas
2fefece77d temporary hack to make regression tests fail faster 2025-07-02 01:42:39 +03:00
Heikki Linnakangas
471191e64e Fix updating relsize cache during WAL replay
This makes some of the test_runner/regress/test_hot_standby.py tests
pass, (Others are still failing..)
2025-07-01 21:22:04 +03:00
Erik Grinaker
f6761760a2 Documentation and tweaks 2025-07-01 17:54:41 +02:00
Erik Grinaker
0bce818d5e Add stream pool 2025-07-01 17:54:41 +02:00
Erik Grinaker
48be1da6ef Add initial client pool 2025-07-01 17:54:41 +02:00
Erik Grinaker
d2efc80e40 Add initial ChannelPool 2025-07-01 17:54:41 +02:00
Erik Grinaker
958c2577f5 pageserver: tighten up page_api::Client 2025-07-01 17:54:41 +02:00
Heikki Linnakangas
175c2e11e3 Add assertions that the legacy relsize cache is not used with new communicator
And fix a few cases where it was being called
2025-07-01 16:44:25 +03:00
Heikki Linnakangas
efdb07e7b6 Implement function to check if page is in local cache
This is needed for read replicas. There's one more TODO that needs to
implemented before read replicas work though, in
neon_extend_rel_size()
2025-07-01 16:22:51 +03:00
Heikki Linnakangas
b0970b415c Don't call legacy lfc function when new communicator is used 2025-07-01 15:47:26 +03:00
David Freifeld
9d3e07ef2c Add initial prototype of shmem sync primitives 2025-06-30 17:07:07 -07:00
Heikki Linnakangas
7429dd711c fix the .metrics.socket filename in the ignore list 2025-06-30 23:41:09 +03:00
Heikki Linnakangas
88ac1e356b Ignore the metrics unix domain socket in tests 2025-06-30 23:39:01 +03:00
Erik Grinaker
c3cb1ab98d Merge branch 'main' into communicator-rewrite 2025-06-30 21:07:01 +02:00
Erik Grinaker
81ac4ef43a Add a generic pool prototype 2025-06-30 14:49:34 +02:00
Erik Grinaker
a5b0fc560c Fix/allow remaining clippy lints 2025-06-30 12:36:20 +02:00
Erik Grinaker
67b04f8ab3 Fix a bunch of linter warnings 2025-06-30 11:10:02 +02:00
Erik Grinaker
9d9e3cd08a Fix test_normal_work grpc param 2025-06-30 10:13:46 +02:00
Heikki Linnakangas
97a8f4ef85 Handle unexpected EOF while doing an LFC read more gracefully
There's a bug somewhere because this happens in python regression
tests. We need to hunt that down, but in any case, let's not get stuck
in an infinite loop if it happens.
2025-06-30 00:59:53 +03:00
Heikki Linnakangas
39f31957e3 Handle pageserver response with different number of pages gracefully
Some tests are hitting this case, where pageserver returns 0 page
images in the response to a GetPage request. I suspect it's because
the code doesn't handle sharding correclty? In any case, let's not
panic on it, but return an IO error to the originating backend.
2025-06-29 23:44:28 +03:00
Heikki Linnakangas
924c6a6fdf Fix handling the case that server closes the stream
- avoid panic by checking for Ok(None) response from
  tonic::Streaming::message() instead of just using unwrap()
- There was a race condition, if the caller sent the message, but the
  receiver task concurrently received Ok(None) indicating the stream
  was closed. (I didn't see that in action, but I think it could happen
  by reading the code)
2025-06-29 22:53:39 +03:00
Heikki Linnakangas
7020476bf5 Run cargo fmt 2025-06-29 22:53:09 +03:00
Heikki Linnakangas
80e948db93 Remove ununused mock factory
After reading the code a few times, I didn't quite understand what it
was, to be honest, or how it was going to be used. Remove it now to
reduce noise, but we can resurrect it from git history if we need it
in the future.
2025-06-29 22:52:48 +03:00
Heikki Linnakangas
bfb30d434c minor code tidy-up 2025-06-29 22:51:34 +03:00
Heikki Linnakangas
f3ba201800 Run cargo fmt 2025-06-29 21:21:07 +03:00
Heikki Linnakangas
8b7796cbfa wip 2025-06-29 21:20:48 +03:00
Heikki Linnakangas
fdc7e9c2a4 Extract repeated code to look up RequestTracker into a helper function 2025-06-29 21:20:14 +03:00
Heikki Linnakangas
a352d290eb Plumb through both libpq and grpc connection strings to the compute
Add a new 'pageserver_connection_info' field in the compute spec. It
replaces the old 'pageserver_connstring' field with a more complicated
struct that includes both libpq and grpc URLs, for each shard (or only
one of the the URLs, depending on the configuration). It also includes
a flag suggesting which one to use; compute_ctl now uses it to decide
which protocol to use for the basebackup.

This is compatible with everything that's in production, because the
control plane never used the 'pageserver_connstring' field. That was
added a long time ago with the idea that it would replace the code
that digs the 'neon.pageserver_connstring' GUC from the list of
Postgres settings, but we never got around to do that in the control
plane. Hence, it was only used with neon_local. But the plan now is to
pass the 'pageserver_connection_info' from the control plane, and once
that's fully deployed everywhere, the code to parse
'neon.pageserver_connstring' in compute_ctl can be removed.

The 'grpc' flag on an endpoint in endpoint config is now more of a
suggestion. Compute_ctl gets both URLs, so it can choose to use libpq
or grpc as it wishes. It currently always obeys the 'prefer_grpc' flag
that's part of the connection info though. Postgres however uses grpc
iff the new rust-based communicator is enabled.

TODO/plan for the control plane:

- Start to pass `pageserver_connection_info` in the spec file.
- Also keep the current `neon.pageserver_connstring` setting for now,
  for backwards compatibility with old computes

After that, the `pageserver_connection_info.prefer_grpc` flag in the
spec file can be used to control whether compute_ctl uses grpc or
libpq.  The actual compute's grpc usage will be controlled by the
`neon.enable_new_communicator` GUC. It can be set separately from
'prefer_grpc'.

Later:

- Once all old computes are gone, remove the code to pass
  `neon.pageserver_connstring`
2025-06-29 18:16:49 +03:00
Heikki Linnakangas
8c122a1c98 Don't call into the old LFC when using the new communicator
This fixes errors like `index "pg_class_relname_nsp_index" contains
unexpected zero page at block 2` when running the python tests

smgrzeroextend() still called into the old LFC's lfc_write() function,
even when using the new communicator, which zeroed some arbitrary
pages in the LFC file, overwriting pages managed by the new LFC
implementation managed by `integrated_cache.rs`
2025-06-29 17:40:46 +03:00
David Freifeld
74330920ee Simplify API, squash bugs, and expand hashmap test suite 2025-06-27 17:11:22 -07:00
David Freifeld
c3c136ef3a Remove statistics utilities from neon_shmem crate 2025-06-27 17:10:52 -07:00
David Freifeld
78b6da270b Sketchily integrate hashmap rewrite with integrated_cache 2025-06-26 16:45:48 -07:00
David Freifeld
47664e40d4 Initial work in visualizing properties of hashmap 2025-06-26 16:00:33 -07:00
David Freifeld
b1e3161d4e Satisfy cargo clippy lints, simplify shrinking API 2025-06-26 14:32:32 -07:00
David Freifeld
4713715c59 Merge branch 'communicator-rewrite' of github.com:neondatabase/neon into communicator-rewrite 2025-06-26 10:26:41 -07:00
David Freifeld
1e74b52f7e Merge branch 'quantumish/lfc-resizable-map' into communicator-rewrite 2025-06-26 10:26:22 -07:00
Erik Grinaker
e3ecdfbecc pgxn/neon: actually use UNAME_S 2025-06-26 12:38:44 +02:00
Erik Grinaker
d08e553835 pgxn/neon: fix callback_get_request_lsn_unsafe return type 2025-06-26 12:33:59 +02:00
Erik Grinaker
7fffb5b4df pgxn/neon: fix macOS build 2025-06-26 12:33:39 +02:00
David Freifeld
1fb3639170 Properly change type of HashMapInit in .with_hasher() 2025-06-25 03:03:19 -07:00
David Freifeld
00dfaa2eb4 Add Criterion microbenchmarks for rehashing and insertions 2025-06-24 16:30:59 -07:00
David Freifeld
ae740ca1bb Document hashmap implementation, fix get_bucket_for_value
Previously, `get_bucket_for_value` incorrectly divided by the size of
`V` to get the bucket index. Now it divides by the size of `Bucket<K,V>`.
2025-06-24 16:27:17 -07:00
David Freifeld
24e6c68772 Remove prev entry tracking, refactor HashMapInit into proper builder 2025-06-24 13:34:22 -07:00
David Freifeld
93a45708ff Change finish_shrink to remap entries in shrunk space 2025-06-23 16:15:43 -07:00
Heikki Linnakangas
46b5c0be0b Remove duplicated migration script
I messed this up during the merge I guess?
2025-06-23 19:46:32 +03:00
Heikki Linnakangas
2d913ff125 fix some mismerges 2025-06-23 18:21:16 +03:00
Heikki Linnakangas
e90be06d46 silence a few compiler warnings
about unnecessary 'mut's and 'use's
2025-06-23 18:16:54 +03:00
Heikki Linnakangas
356ba67607 Merge remote-tracking branch 'origin/main' into HEAD
I also included build script changes from
https://github.com/neondatabase/neon/pull/12266, which is not yet
merged but will be soon.
2025-06-23 17:46:30 +03:00
David Freifeld
610ea22c46 Generalize map to allow arbitrary hash fns, add clear() helper method 2025-06-20 11:46:02 -07:00
Heikki Linnakangas
1847f4de54 Add missing #include.
Got a warning on macos without this
2025-06-18 17:26:20 +03:00
David Freifeld
477648b8cd Clean up hashmap implementation, add bucket tests 2025-06-17 11:23:10 -07:00
Heikki Linnakangas
e8af3a2811 remove unused struct in example code, to silence compiler warning 2025-06-17 02:09:21 +03:00
Heikki Linnakangas
b603e3dddb Silence compiler warnings in example code 2025-06-17 02:07:33 +03:00
Heikki Linnakangas
83007782fd fix compilation of example 2025-06-17 02:07:15 +03:00
David Freifeld
bb1e359872 Add testing utilities for hash map, freelist bugfixes 2025-06-16 16:02:39 -07:00
David Freifeld
ac87544e79 Implement shrinking, add basic tests for core operations 2025-06-16 13:13:38 -07:00
David Freifeld
b6b122e07b nw: add shrinking and deletion skeletons 2025-06-16 10:20:30 -07:00
Erik Grinaker
782062014e Fix test_normal_work endpoint restart 2025-06-16 10:16:27 +02:00
Erik Grinaker
d0b3629412 Tweak base backups 2025-06-13 13:47:26 -07:00
Heikki Linnakangas
16d6898e44 git add missing file 2025-06-12 02:37:59 +03:00
Erik Grinaker
f4d51c0f5c Use gRPC for test_normal_work 2025-06-09 22:51:15 +02:00
Erik Grinaker
ec17ae0658 Handle gRPC basebackups in compute_ctl 2025-06-09 22:50:57 +02:00
Erik Grinaker
9ecce60ded Plumb gRPC addr through storage-controller 2025-06-09 20:24:18 +02:00
Erik Grinaker
e74a957045 test_runner: initial gRPC protocol support 2025-06-06 16:56:33 +02:00
Erik Grinaker
396a16a3b2 test_runner: enable gRPC Pageserver 2025-06-06 14:55:29 +02:00
Elizabeth Murray
7140a50225 Minor changes to get integration tests to run for communicator. 2025-06-06 04:32:51 +02:00
Elizabeth Murray
68f18ccacf Request Tracker Prototype
Does not include splitting requests across shards.
2025-06-05 13:32:18 -07:00
Heikki Linnakangas
786888d93f Instead of a fixed TCP port for metrics, listen on a unix domain socket
That avoids clashes if you run two computes at the same time. More
secure too. We might want to have a TCP port in the long run, but this
is less trouble for now.

To see the metrics with curl you can use:

    curl --unix-socket .neon/endpoints/ep-main/pgdata/.metrics.socket http://localhost/metrics
2025-06-05 21:28:11 +03:00
Heikki Linnakangas
255537dda1 avoid hitting assertion failure in MarkPostmasterChildWalSender() 2025-06-05 20:08:32 +03:00
Erik Grinaker
8b494f6a24 Ignore communicator_bindings.h 2025-06-05 17:52:50 +02:00
Erik Grinaker
28a61741b3 Mangle gRPC connstrings to use port 51051 2025-06-05 17:46:58 +02:00
Heikki Linnakangas
10b936bf03 Use a custom Rust implementation to replace the LFC hash table
The new implementation lives in a separately allocated shared memory
area, which could be resized. Resizing it isn't actually implemented
yet, though. It would require some co-operation from the LFC code.
2025-06-05 18:31:29 +03:00
Erik Grinaker
2fb6164bf8 Misc build fixes 2025-06-05 17:22:11 +02:00
Erik Grinaker
328f28dfe5 impl Default for SlabBlockHeader 2025-06-05 17:18:28 +02:00
Erik Grinaker
95838056da Fix RelTag fields 2025-06-05 17:13:51 +02:00
Heikki Linnakangas
6145cfd1c2 Move neon-shmem facility to separate module within the crate 2025-06-05 18:13:03 +03:00
Erik Grinaker
6d451654f1 Remove generated communicator_bindings.h 2025-06-05 17:12:13 +02:00
Heikki Linnakangas
96b4de1de6 Make LFC chunk size a compile-time constant
A runtime setting is nicer, but the next commit will replace the hash
table with a different implementation that requires the value size to
be a compile-time constant.
2025-06-05 18:08:40 +03:00
Heikki Linnakangas
9fdf5fbb7e Use a separate freelist to track LFC "holes"
When the LFC is shrunk, we punch holes in the underlying file to
release the disk space to the OS. We tracked it in the same hash table
as the in-use entries, because that was convenient. However, I'm
working on being able to shrink the hash table too, and once we do
that, we'll need some other place to track the holes. Implement a
simple scheme of an in-memory array and a chain of on-disk blocks for
that.
2025-06-05 18:08:35 +03:00
Erik Grinaker
37c58522a2 Merge branch 'main' into communicator-rewrite 2025-06-05 15:08:05 +02:00
Erik Grinaker
4b6f02e47d Merge branch 'main' into communicator-rewrite 2025-06-04 10:23:29 +02:00
Erik Grinaker
8202c6172f Merge branch 'main' into communicator-rewrite 2025-06-03 16:04:31 +02:00
Erik Grinaker
69a47d789d pageserver: remove gRPC compute service prototype 2025-06-03 13:47:21 +02:00
Erik Grinaker
b36f880710 Fix Linux build failures 2025-06-03 13:37:56 +02:00
Erik Grinaker
745b750f33 Merge branch 'main' into communicator-rewrite 2025-06-03 13:29:45 +02:00
Heikki Linnakangas
f06bb2bbd8 Implement growing the hash table. Fix unit tests. 2025-05-29 15:54:55 +03:00
Heikki Linnakangas
b3c25418a6 Add metrics to track memory usage of the rust communicator 2025-05-29 02:14:01 +03:00
Heikki Linnakangas
33549bad1d use separate hash tables for relsize cache and block mappings 2025-05-28 23:57:55 +03:00
Heikki Linnakangas
009168d711 Add placeholder shmem hashmap implementation
Use that instead of the half-baked Adaptive Radix Tree
implementation. ART would probably be better in the long run, but more
complicated to implement.
2025-05-28 11:08:35 +03:00
Elizabeth Murray
7c9bd542a6 Fix compile warnings, minor cleanup. 2025-05-26 06:30:48 -07:00
Elizabeth Murray
014823b305 Add a new iteration of a new client pool with some updates. 2025-05-26 05:29:32 -07:00
Elizabeth Murray
af9379ccf6 Use a sempahore to gate access to connections. Add metrics for testing. 2025-05-26 05:28:50 -07:00
Heikki Linnakangas
bb28109ffa Merge remote-tracking branch 'origin/main' into communicator-rewrite-with-integrated-cache
There were conflicts because of the differences in the page_api
protocol that was merged to main vs what was on the branch. I adapted
the code for the protocol in main.
2025-05-26 11:52:32 +03:00
Elizabeth Murray
60a0bec1c0 Set default max consumers per connection to a high number. 2025-05-19 07:00:39 -07:00
Elizabeth Murray
31fa7a545d Remove unnecessary info include now that the info message is gone. 2025-05-19 06:52:07 -07:00
Elizabeth Murray
ac464c5f2c Return info message that was used for debugging. 2025-05-19 06:39:16 -07:00
Elizabeth Murray
0dddb1e373 Add back whitespace that was removed. 2025-05-19 06:34:52 -07:00
Elizabeth Murray
3acb263e62 Add first iteration of simulating a flakey network with a custom TCP. 2025-05-19 06:33:30 -07:00
Elizabeth Murray
1e83398cdd Correct out-of-date comment. 2025-05-14 07:31:52 -07:00
Elizabeth Murray
be8ed81532 Connection pool: update error accounting, sweep idle connections, add config options. 2025-05-14 07:31:52 -07:00
Heikki Linnakangas
12b08c4b82 Fix shutdown 2025-05-14 01:49:55 +03:00
Heikki Linnakangas
827358dd03 Handle OOMs a little more gracefully 2025-05-12 23:33:22 +03:00
Heikki Linnakangas
d367273000 minor cleanup 2025-05-12 23:11:55 +03:00
Heikki Linnakangas
e2bad5d9e9 Add debugging HTTP endpoint for dumping the cache tree 2025-05-12 22:54:03 +03:00
Heikki Linnakangas
5623e4665b bunch of fixes 2025-05-12 18:40:54 +03:00
Heikki Linnakangas
8abb4dab6d implement shrinking nodes 2025-05-12 03:57:10 +03:00
Heikki Linnakangas
731667ac37 better metrics of the art tree 2025-05-12 02:08:51 +03:00
Heikki Linnakangas
6a1374d106 Pack tree node structs more tightly, avoiding alignment padding 2025-05-12 01:01:58 +03:00
Heikki Linnakangas
f7c908f2f0 more metrics 2025-05-12 01:01:50 +03:00
Heikki Linnakangas
86671e3a0b Add a bunch of metric counters 2025-05-11 20:11:13 +03:00
Heikki Linnakangas
319cd74f73 Fix eviction 2025-05-11 19:34:50 +03:00
Heikki Linnakangas
0efefbf77c Add a few metrics, fix page eviction 2025-05-10 03:13:28 +03:00
Heikki Linnakangas
e6a4171fa1 fix concurrency issues with the LFC
- Add another locking hash table to track which cached pages are currently being
  modified, by smgrwrite() or smgrread() or by prefetch.

- Use single-value Leaf pages in the art tree. That seems simpler after all,
  and it eliminates some corner cases where a Value needed to be cloned, which
  made it tricky to use atomics or other interior mutability on the Values
2025-05-10 02:36:48 +03:00
Heikki Linnakangas
0c25ea9e31 reduce LOG noise 2025-05-09 18:27:36 +03:00
Heikki Linnakangas
6692321026 Remove dependency on io_uring, use plain std::fs ops instead
io_uring is a great idea in the long term, but for now, let's make it
easier to develop locally on macos, where io_uring is not available.
2025-05-06 17:46:21 +03:00
Heikki Linnakangas
791df28755 Linked list fix and add unit test 2025-05-06 16:46:54 +03:00
Heikki Linnakangas
d20da994f4 git add missing file 2025-05-06 15:36:48 +03:00
Heikki Linnakangas
6dbbdaae73 run 'cargo fmt' 2025-05-06 15:35:56 +03:00
Heikki Linnakangas
977bc09d2a Bunch of fixes, smarter iterator, metrics exporter 2025-05-06 15:28:50 +03:00
Heikki Linnakangas
44269fcd5e Implement simple eviction and free block tracking 2025-05-06 15:28:15 +03:00
Heikki Linnakangas
44cc648dc8 Implement iterator over keys
the implementation is not very optimized, but probably good enough for an MVP
2025-05-06 15:27:38 +03:00
Heikki Linnakangas
884e028a4a implement deletion in art tree 2025-05-06 15:27:38 +03:00
Heikki Linnakangas
42df3e5453 debugging stats 2025-05-06 15:27:38 +03:00
Heikki Linnakangas
fc743e284f more work on allocators 2025-05-06 15:27:38 +03:00
Heikki Linnakangas
d02f9a2139 Collect garbage, handle OOMs 2025-05-06 15:27:38 +03:00
Heikki Linnakangas
083118e98e Implement epoch system 2025-05-06 15:27:38 +03:00
Heikki Linnakangas
54cd2272f1 more memory allocation stuff 2025-05-06 15:27:38 +03:00
Heikki Linnakangas
e40193e3c8 simple block-based allocator 2025-05-06 15:27:38 +03:00
Heikki Linnakangas
ce9f7bacc1 Fix communicator client for recent changes in protocol and client code 2025-05-06 15:26:51 +03:00
Heikki Linnakangas
b7891f8fe8 Include 'neon-shard-id' header in client requests 2025-05-06 15:23:30 +03:00
Elizabeth Murray
5f2adaa9ad Remove some additional debug info messages. 2025-05-02 10:50:53 -07:00
Elizabeth Murray
3e5e396c8d Remove some debug info messages. 2025-05-02 10:24:18 -07:00
Elizabeth Murray
9d781c6fda Add a connection pool module to the grpc client. 2025-05-02 10:22:33 -07:00
Erik Grinaker
cf5d038472 service documentation 2025-05-02 15:20:12 +02:00
Erik Grinaker
d785100c02 page_api: add GetPageRequest::class 2025-05-02 10:48:32 +02:00
Erik Grinaker
2c0d930e3d page_api: add GetPageResponse::status 2025-04-30 16:48:45 +02:00
Erik Grinaker
66171a117b page_api: add GetPageRequestBatch 2025-04-30 15:31:11 +02:00
Erik Grinaker
df2806e7a0 page_api: add GetPageRequest::id 2025-04-30 15:00:16 +02:00
Erik Grinaker
07631692db page_api: protobuf comments 2025-04-30 12:36:11 +02:00
Erik Grinaker
4c77397943 Add neon-shard-id header 2025-04-30 11:18:06 +02:00
Erik Grinaker
7bb58be546 Use authorization header instead of neon-auth-token 2025-04-30 10:38:44 +02:00
Erik Grinaker
b5373de208 page_api: add get_slru_segment() 2025-04-29 17:59:27 +02:00
Erik Grinaker
b86c610f42 page_api: tweaks 2025-04-29 17:23:51 +02:00
Erik Grinaker
0f520d79ab pageserver: rename data_api to page_api 2025-04-29 15:58:52 +02:00
Heikki Linnakangas
93eb7bb6b8 include lots of changes that went missing by accident 2025-04-29 15:32:27 +03:00
Heikki Linnakangas
e58d0fece1 New communicator, with "integrated" cache accessible from all processes 2025-04-29 11:52:44 +03:00
61 changed files with 7510 additions and 808 deletions

1
.gitignore vendored
View File

@@ -15,6 +15,7 @@ neon.iml
/.neon
/integration_tests/.neon
compaction-suite-results.*
pgxn/neon/communicator/communicator_bindings.h
docker-compose/docker-compose-parallel.yml
# Coverage

107
Cargo.lock generated
View File

@@ -259,6 +259,17 @@ version = "1.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a8ab6b55fe97976e46f91ddbed8d147d966475dc29b2032757ba47e02376fbc3"
[[package]]
name = "atomic_enum"
version = "0.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "99e1aca718ea7b89985790c94aad72d77533063fe00bc497bb79a7c2dae6a661"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.100",
]
[[package]]
name = "autocfg"
version = "1.1.0"
@@ -1296,13 +1307,29 @@ dependencies = [
name = "communicator"
version = "0.1.0"
dependencies = [
"atomic_enum",
"axum",
"bytes",
"cbindgen",
"clashmap",
"http 1.3.1",
"libc",
"measured",
"neon-shmem",
"nix 0.30.1",
"pageserver_api",
"pageserver_client_grpc",
"pageserver_page_api",
"prometheus",
"prost 0.13.5",
"strum_macros",
"thiserror 1.0.69",
"tokio",
"tokio-pipe",
"tonic",
"tracing",
"tracing-subscriber",
"uring-common",
"utils",
"workspace_hack",
]
@@ -1643,9 +1670,9 @@ dependencies = [
[[package]]
name = "crossbeam-utils"
version = "0.8.19"
version = "0.8.21"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "248e3bacc7dc6baa3b21e405ee045c3047101a49145e7e9eca583ab4c2ca5345"
checksum = "d0a5c400df2834b80a4c3327b3aad3a4c4cd4de0629063962b03235697506a28"
[[package]]
name = "crossterm"
@@ -2361,6 +2388,12 @@ version = "1.0.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3f9eec918d3f24069decb9af1554cad7c880e2da24a9afd88aca000531ab82c1"
[[package]]
name = "foldhash"
version = "0.1.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d9c4f5dac5e15c24eb999c26181a6ca40b39fe946cbe4c263c7209467bc83af2"
[[package]]
name = "form_urlencoded"
version = "1.2.1"
@@ -2742,6 +2775,16 @@ version = "0.15.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "bf151400ff0baff5465007dd2f3e717f3fe502074ca563069ce3a6629d07b289"
[[package]]
name = "hashbrown"
version = "0.15.4"
source = "git+https://github.com/quantumish/hashbrown.git?rev=6610e6d#6610e6d2b1f288ef7b0709a3efefbc846395dc5e"
dependencies = [
"allocator-api2",
"equivalent",
"foldhash",
]
[[package]]
name = "hashlink"
version = "0.9.1"
@@ -3822,7 +3865,7 @@ dependencies = [
"prometheus",
"rand 0.9.1",
"rand_distr",
"twox-hash",
"twox-hash 1.6.3",
]
[[package]]
@@ -3928,15 +3971,21 @@ checksum = "e5ce46fe64a9d73be07dcbe690a38ce1b293be448fd8ce1e6c1b8062c9f72c6a"
name = "neon-shmem"
version = "0.1.0"
dependencies = [
"ahash",
"criterion",
"hashbrown 0.15.4",
"libc",
"lock_api",
"nix 0.30.1",
"rand 0.9.1",
"rand_distr",
"rustc-hash 2.1.1",
"seahash",
"tempfile",
"thiserror 1.0.69",
"twox-hash 2.1.1",
"workspace_hack",
"xxhash-rust",
]
[[package]]
@@ -4391,13 +4440,16 @@ version = "0.1.0"
dependencies = [
"anyhow",
"async-trait",
"axum",
"bytes",
"camino",
"clap",
"futures",
"hdrhistogram",
"http 1.3.1",
"humantime",
"humantime-serde",
"metrics",
"pageserver_api",
"pageserver_client",
"pageserver_client_grpc",
@@ -4487,6 +4539,7 @@ dependencies = [
"pageserver_client",
"pageserver_compaction",
"pageserver_page_api",
"peekable",
"pem",
"pin-project-lite",
"postgres-protocol",
@@ -4500,6 +4553,7 @@ dependencies = [
"pprof",
"pq_proto",
"procfs",
"prost 0.13.5",
"rand 0.9.1",
"range-set-blaze",
"regex",
@@ -4536,7 +4590,7 @@ dependencies = [
"tower 0.5.2",
"tracing",
"tracing-utils",
"twox-hash",
"twox-hash 1.6.3",
"url",
"utils",
"uuid",
@@ -4748,7 +4802,7 @@ dependencies = [
"paste",
"seq-macro",
"thrift",
"twox-hash",
"twox-hash 1.6.3",
"zstd",
"zstd-sys",
]
@@ -4794,6 +4848,15 @@ dependencies = [
"sha2",
]
[[package]]
name = "peekable"
version = "0.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "225f9651e475709164f871dc2f5724956be59cb9edb055372ffeeab01ec2d20b"
dependencies = [
"smallvec",
]
[[package]]
name = "pem"
version = "3.0.3"
@@ -6493,6 +6556,12 @@ version = "0.4.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "621e3680f3e07db4c9c2c3fb07c6223ab2fab2e54bd3c04c3ae037990f428c32"
[[package]]
name = "seahash"
version = "4.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1c107b6f4780854c8b126e228ea8869f4d7b71260f962fefb57b996b8959ba6b"
[[package]]
name = "sec1"
version = "0.3.0"
@@ -7646,6 +7715,16 @@ dependencies = [
"syn 2.0.100",
]
[[package]]
name = "tokio-pipe"
version = "0.2.12"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f213a84bffbd61b8fa0ba8a044b4bbe35d471d0b518867181e82bd5c15542784"
dependencies = [
"libc",
"tokio",
]
[[package]]
name = "tokio-postgres"
version = "0.7.10"
@@ -8183,6 +8262,15 @@ dependencies = [
"static_assertions",
]
[[package]]
name = "twox-hash"
version = "2.1.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8b907da542cbced5261bd3256de1b3a1bf340a3d37f93425a07362a1d687de56"
dependencies = [
"rand 0.9.1",
]
[[package]]
name = "typed-json"
version = "0.1.1"
@@ -9013,8 +9101,8 @@ dependencies = [
"clap",
"clap_builder",
"const-oid",
"criterion",
"crossbeam-epoch",
"crossbeam-utils",
"crypto-bigint 0.5.5",
"der 0.7.8",
"deranged",
@@ -9057,7 +9145,6 @@ dependencies = [
"num-iter",
"num-rational",
"num-traits",
"once_cell",
"p256 0.13.2",
"parquet",
"portable-atomic",
@@ -9166,6 +9253,12 @@ version = "0.13.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4d25c75bf9ea12c4040a97f829154768bbbce366287e2dc044af160cd79a13fd"
[[package]]
name = "xxhash-rust"
version = "0.8.15"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fdd20c5420375476fbd4394763288da7eb0cc0b8c11deed431a91562af7335d3"
[[package]]
name = "yansi"
version = "1.0.1"

View File

@@ -93,6 +93,7 @@ clap = { version = "4.0", features = ["derive", "env"] }
clashmap = { version = "1.0", features = ["raw-api"] }
comfy-table = "7.1"
const_format = "0.2"
crossbeam-utils = "0.8.21"
crc32c = "0.6"
diatomic-waker = { version = "0.2.3" }
either = "1.8"
@@ -152,6 +153,7 @@ parquet = { version = "53", default-features = false, features = ["zstd"] }
parquet_derive = "53"
pbkdf2 = { version = "0.12.1", features = ["simple", "std"] }
pem = "3.0.3"
peekable = "0.3.0"
pin-project-lite = "0.2"
pprof = { version = "0.14", features = ["criterion", "flamegraph", "frame-pointer", "prost-codec"] }
procfs = "0.16"
@@ -190,6 +192,7 @@ smallvec = "1.11"
smol_str = { version = "0.2.0", features = ["serde"] }
socket2 = "0.5"
spki = "0.7.3"
spin = "0.9.8"
strum = "0.26"
strum_macros = "0.26"
"subtle" = "2.5.0"
@@ -201,7 +204,6 @@ thiserror = "1.0"
tikv-jemallocator = { version = "0.6", features = ["profiling", "stats", "unprefixed_malloc_on_supported_platforms"] }
tikv-jemalloc-ctl = { version = "0.6", features = ["stats"] }
tokio = { version = "1.43.1", features = ["macros"] }
tokio-epoll-uring = { git = "https://github.com/neondatabase/tokio-epoll-uring.git" , branch = "main" }
tokio-io-timeout = "1.2.0"
tokio-postgres-rustls = "0.12.0"
tokio-rustls = { version = "0.26.0", default-features = false, features = ["tls12", "ring"]}
@@ -242,6 +244,9 @@ zeroize = "1.8"
env_logger = "0.11"
log = "0.4"
tokio-epoll-uring = { git = "https://github.com/neondatabase/tokio-epoll-uring.git" , branch = "main" }
uring-common = { git = "https://github.com/neondatabase/tokio-epoll-uring.git" , branch = "main" }
## Libraries from neondatabase/ git forks, ideally with changes to be upstreamed
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", branch = "neon" }
postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", branch = "neon" }

View File

@@ -8,7 +8,7 @@ use std::path::Path;
use compute_api::responses::TlsConfig;
use compute_api::spec::{
ComputeAudit, ComputeMode, ComputeSpec, DatabricksSettings, GenericOption,
ComputeAudit, ComputeMode, ComputeSpec, DatabricksSettings, GenericOption, PageserverProtocol,
};
use crate::compute::ComputeNodeParams;
@@ -69,6 +69,15 @@ pub fn write_postgres_conf(
writeln!(file, "# Neon storage settings")?;
writeln!(file)?;
if let Some(conninfo) = &spec.pageserver_connection_info {
match conninfo.prefer_protocol {
PageserverProtocol::Libpq => {
writeln!(file, "neon.use_communicator_worker=false")?;
}
PageserverProtocol::Grpc => {
writeln!(file, "neon.use_communicator_worker=true")?;
}
}
// Stripe size GUC should be defined prior to connection string
if let Some(stripe_size) = conninfo.stripe_size {
writeln!(
@@ -79,6 +88,7 @@ pub fn write_postgres_conf(
}
let mut libpq_urls: Option<Vec<String>> = Some(Vec::new());
let mut grpc_urls: Option<Vec<String>> = Some(Vec::new());
let num_shards = if conninfo.shard_count.0 == 0 {
1 // unsharded, treat it as a single shard
} else {
@@ -111,6 +121,14 @@ pub fn write_postgres_conf(
} else {
libpq_urls = None
}
// Similarly for gRPC URLs
if let Some(url) = &first_pageserver.grpc_url {
if let Some(ref mut urls) = grpc_urls {
urls.push(url.clone());
}
} else {
grpc_urls = None
}
}
if let Some(libpq_urls) = libpq_urls {
writeln!(
@@ -125,7 +143,22 @@ pub fn write_postgres_conf(
} else {
writeln!(file, "# no neon.pageserver_connstring")?;
}
if let Some(grpc_urls) = grpc_urls {
writeln!(
file,
"# derived from compute spec's pageserver_conninfo field"
)?;
writeln!(
file,
"neon.pageserver_grpc_urls={}",
escape_conf_value(&grpc_urls.join(","))
)?;
} else {
writeln!(file, "# no neon.pageserver_grpc_urls")?;
}
} else {
writeln!(file, "neon.use_communicator_worker=false")?;
// Stripe size GUC should be defined prior to connection string
if let Some(stripe_size) = spec.shard_stripe_size {
writeln!(file, "# from compute spec's shard_stripe_size field")?;

View File

@@ -28,7 +28,10 @@ pub fn launch_lsn_lease_bg_task_for_static(compute: &Arc<ComputeNode>) {
let compute = compute.clone();
let span = tracing::info_span!("lsn_lease_bg_task", %tenant_id, %timeline_id, %lsn);
let runtime = tokio::runtime::Handle::current();
thread::spawn(move || {
let _rt_guard = runtime.enter();
let _entered = span.entered();
if let Err(e) = lsn_lease_bg_task(compute, tenant_id, timeline_id, lsn) {
// TODO: might need stronger error feedback than logging an warning.

View File

@@ -120,6 +120,11 @@
"value": "host=pageserver port=6400",
"vartype": "string"
},
{
"name": "neon.pageserver_grpc_urls",
"value": "grpc://pageserver:6401/",
"vartype": "string"
},
{
"name": "max_replication_write_lag",
"value": "500MB",

View File

@@ -1,6 +1,7 @@
broker_endpoint='http://storage_broker:50051'
pg_distrib_dir='/usr/local/'
listen_pg_addr='0.0.0.0:6400'
listen_grpc_addr='0.0.0.0:6401'
listen_http_addr='0.0.0.0:9898'
remote_storage={ endpoint='http://minio:9000', bucket_name='neon', bucket_region='eu-north-1', prefix_in_bucket='/pageserver' }
control_plane_api='http://0.0.0.0:6666' # No storage controller in docker compose, specify a junk address

View File

@@ -6,15 +6,26 @@ license.workspace = true
[dependencies]
thiserror.workspace = true
nix.workspace=true
nix.workspace = true
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
libc.workspace = true
lock_api.workspace = true
rustc-hash.workspace = true
[dev-dependencies]
criterion = { workspace = true, features = ["html_reports"] }
rand = "0.9"
rand_distr = "0.5.1"
xxhash-rust = { version = "0.8.15", features = ["xxh3"] }
ahash.workspace = true
twox-hash = { version = "2.1.1" }
seahash = "4.1.0"
hashbrown = { git = "https://github.com/quantumish/hashbrown.git", rev = "6610e6d" }
[target.'cfg(target_os = "macos")'.dependencies]
tempfile = "3.14.0"
[dev-dependencies]
rand.workspace = true
rand_distr = "0.5.1"
[[bench]]
name = "hmap_resize"
harness = false

View File

@@ -0,0 +1,330 @@
use criterion::{BatchSize, BenchmarkId, Criterion, criterion_group, criterion_main};
use neon_shmem::hash::HashMapAccess;
use neon_shmem::hash::HashMapInit;
use neon_shmem::hash::entry::Entry;
use rand::distr::{Distribution, StandardUniform};
use rand::prelude::*;
use std::default::Default;
use std::hash::BuildHasher;
// Taken from bindings to C code
#[derive(Clone, Debug, Hash, Eq, PartialEq)]
#[repr(C)]
pub struct FileCacheKey {
pub _spc_id: u32,
pub _db_id: u32,
pub _rel_number: u32,
pub _fork_num: u32,
pub _block_num: u32,
}
impl Distribution<FileCacheKey> for StandardUniform {
// questionable, but doesn't need to be good randomness
fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> FileCacheKey {
FileCacheKey {
_spc_id: rng.random(),
_db_id: rng.random(),
_rel_number: rng.random(),
_fork_num: rng.random(),
_block_num: rng.random(),
}
}
}
#[derive(Clone, Debug)]
#[repr(C)]
pub struct FileCacheEntry {
pub _offset: u32,
pub _access_count: u32,
pub _prev: *mut FileCacheEntry,
pub _next: *mut FileCacheEntry,
pub _state: [u32; 8],
}
impl FileCacheEntry {
fn dummy() -> Self {
Self {
_offset: 0,
_access_count: 0,
_prev: std::ptr::null_mut(),
_next: std::ptr::null_mut(),
_state: [0; 8],
}
}
}
// Utilities for applying operations.
#[derive(Clone, Debug)]
struct TestOp<K, V>(K, Option<V>);
fn apply_op<K: Clone + std::hash::Hash + Eq, V, S: std::hash::BuildHasher>(
op: TestOp<K, V>,
map: &mut HashMapAccess<K, V, S>,
) {
let entry = map.entry(op.0);
match op.1 {
Some(new) => match entry {
Entry::Occupied(mut e) => Some(e.insert(new)),
Entry::Vacant(e) => {
_ = e.insert(new).unwrap();
None
}
},
None => match entry {
Entry::Occupied(e) => Some(e.remove()),
Entry::Vacant(_) => None,
},
};
}
// Hash utilities
struct SeaRandomState {
k1: u64,
k2: u64,
k3: u64,
k4: u64,
}
impl std::hash::BuildHasher for SeaRandomState {
type Hasher = seahash::SeaHasher;
fn build_hasher(&self) -> Self::Hasher {
seahash::SeaHasher::with_seeds(self.k1, self.k2, self.k3, self.k4)
}
}
impl SeaRandomState {
fn new() -> Self {
let mut rng = rand::rng();
Self {
k1: rng.random(),
k2: rng.random(),
k3: rng.random(),
k4: rng.random(),
}
}
}
fn small_benchs(c: &mut Criterion) {
let mut group = c.benchmark_group("Small maps");
group.sample_size(10);
group.bench_function("small_rehash", |b| {
let ideal_filled = 4_000_000;
let size = 5_000_000;
let mut writer = HashMapInit::new_resizeable(size, size * 2).attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.bench_function("small_rehash_xxhash", |b| {
let ideal_filled = 4_000_000;
let size = 5_000_000;
let mut writer = HashMapInit::new_resizeable(size, size * 2)
.with_hasher(twox_hash::xxhash64::RandomState::default())
.attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.bench_function("small_rehash_ahash", |b| {
let ideal_filled = 4_000_000;
let size = 5_000_000;
let mut writer = HashMapInit::new_resizeable(size, size * 2)
.with_hasher(ahash::RandomState::default())
.attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.bench_function("small_rehash_seahash", |b| {
let ideal_filled = 4_000_000;
let size = 5_000_000;
let mut writer = HashMapInit::new_resizeable(size, size * 2)
.with_hasher(SeaRandomState::new())
.attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.finish();
}
fn real_benchs(c: &mut Criterion) {
let mut group = c.benchmark_group("Realistic workloads");
group.sample_size(10);
group.bench_function("real_bulk_insert", |b| {
let size = 125_000_000;
let ideal_filled = 100_000_000;
let mut rng = rand::rng();
b.iter_batched(
|| HashMapInit::new_resizeable(size, size * 2).attach_writer(),
|writer| {
for _ in 0..ideal_filled {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
let entry = writer.entry(key);
match entry {
Entry::Occupied(mut e) => {
std::hint::black_box(e.insert(val));
}
Entry::Vacant(e) => {
let _ = std::hint::black_box(e.insert(val).unwrap());
}
}
}
},
BatchSize::SmallInput,
)
});
group.bench_function("real_rehash", |b| {
let size = 125_000_000;
let ideal_filled = 100_000_000;
let mut writer = HashMapInit::new_resizeable(size, size).attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.bench_function("real_rehash_hashbrown", |b| {
let size = 125_000_000;
let ideal_filled = 100_000_000;
let mut writer = hashbrown::raw::RawTable::new();
let mut rng = rand::rng();
let hasher = rustc_hash::FxBuildHasher;
unsafe {
writer
.resize(
size,
|(k, _)| hasher.hash_one(k),
hashbrown::raw::Fallibility::Infallible,
)
.unwrap();
}
while writer.len() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
writer.insert(hasher.hash_one(&key), (key, val), |(k, _)| {
hasher.hash_one(k)
});
}
b.iter(|| unsafe {
writer.table.rehash_in_place(
&|table, index| {
hasher.hash_one(
&table
.bucket::<(FileCacheKey, FileCacheEntry)>(index)
.as_ref()
.0,
)
},
std::mem::size_of::<(FileCacheKey, FileCacheEntry)>(),
if std::mem::needs_drop::<(FileCacheKey, FileCacheEntry)>() {
Some(|ptr| std::ptr::drop_in_place(ptr as *mut (FileCacheKey, FileCacheEntry)))
} else {
None
},
)
});
});
for elems in [2, 4, 8, 16, 32, 64, 96, 112] {
group.bench_with_input(
BenchmarkId::new("real_rehash_varied", elems),
&elems,
|b, &size| {
let ideal_filled = size * 1_000_000;
let size = 125_000_000;
let mut writer = HashMapInit::new_resizeable(size, size).attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
},
);
group.bench_with_input(
BenchmarkId::new("real_rehash_varied_hashbrown", elems),
&elems,
|b, &size| {
let ideal_filled = size * 1_000_000;
let size = 125_000_000;
let mut writer = hashbrown::raw::RawTable::new();
let mut rng = rand::rng();
let hasher = rustc_hash::FxBuildHasher;
unsafe {
writer
.resize(
size,
|(k, _)| hasher.hash_one(k),
hashbrown::raw::Fallibility::Infallible,
)
.unwrap();
}
while writer.len() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
writer.insert(hasher.hash_one(&key), (key, val), |(k, _)| {
hasher.hash_one(k)
});
}
b.iter(|| unsafe {
writer.table.rehash_in_place(
&|table, index| {
hasher.hash_one(
&table
.bucket::<(FileCacheKey, FileCacheEntry)>(index)
.as_ref()
.0,
)
},
std::mem::size_of::<(FileCacheKey, FileCacheEntry)>(),
if std::mem::needs_drop::<(FileCacheKey, FileCacheEntry)>() {
Some(|ptr| {
std::ptr::drop_in_place(ptr as *mut (FileCacheKey, FileCacheEntry))
})
} else {
None
},
)
});
},
);
}
group.finish();
}
criterion_group!(benches, small_benchs, real_benchs);
criterion_main!(benches);

View File

@@ -16,6 +16,7 @@
//!
//! Concurrency is managed very simply: the entire map is guarded by one shared-memory RwLock.
use std::fmt::Debug;
use std::hash::{BuildHasher, Hash};
use std::mem::MaybeUninit;
@@ -56,6 +57,22 @@ pub struct HashMapInit<'a, K, V, S = rustc_hash::FxBuildHasher> {
num_buckets: u32,
}
impl<'a, K, V, S> Debug for HashMapInit<'a, K, V, S>
where
K: Debug,
V: Debug,
{
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("HashMapInit")
.field("shmem_handle", &self.shmem_handle)
.field("shared_ptr", &self.shared_ptr)
.field("shared_size", &self.shared_size)
// .field("hasher", &self.hasher)
.field("num_buckets", &self.num_buckets)
.finish()
}
}
/// This is a per-process handle to a hash table that (possibly) lives in shared memory.
/// If a child process is launched with fork(), the child process should
/// get its own HashMapAccess by calling HashMapInit::attach_writer/reader().
@@ -71,6 +88,20 @@ pub struct HashMapAccess<'a, K, V, S = rustc_hash::FxBuildHasher> {
unsafe impl<K: Sync, V: Sync, S> Sync for HashMapAccess<'_, K, V, S> {}
unsafe impl<K: Send, V: Send, S> Send for HashMapAccess<'_, K, V, S> {}
impl<'a, K, V, S> Debug for HashMapAccess<'a, K, V, S>
where
K: Debug,
V: Debug,
{
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("HashMapAccess")
.field("shmem_handle", &self.shmem_handle)
.field("shared_ptr", &self.shared_ptr)
// .field("hasher", &self.hasher)
.finish()
}
}
impl<'a, K: Clone + Hash + Eq, V, S> HashMapInit<'a, K, V, S> {
/// Change the 'hasher' used by the hash table.
///
@@ -298,7 +329,7 @@ where
/// Get a reference to the entry containing a key.
///
/// NB: THis takes a write lock as there's no way to distinguish whether the intention
/// NB: This takes a write lock as there's no way to distinguish whether the intention
/// is to use the entry for reading or for writing in advance.
pub fn entry(&self, key: K) -> Entry<'a, '_, K, V> {
let hash = self.get_hash_value(&key);

View File

@@ -1,5 +1,6 @@
//! Simple hash table with chaining.
use std::fmt::Debug;
use std::hash::Hash;
use std::mem::MaybeUninit;
@@ -17,6 +18,19 @@ pub(crate) struct Bucket<K, V> {
pub(crate) inner: Option<(K, V)>,
}
impl<K, V> Debug for Bucket<K, V>
where
K: Debug,
V: Debug,
{
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("Bucket")
.field("next", &self.next)
.field("inner", &self.inner)
.finish()
}
}
/// Core hash table implementation.
pub(crate) struct CoreHashMap<'a, K, V> {
/// Dictionary used to map hashes to bucket indices.
@@ -31,6 +45,22 @@ pub(crate) struct CoreHashMap<'a, K, V> {
pub(crate) buckets_in_use: u32,
}
impl<'a, K, V> Debug for CoreHashMap<'a, K, V>
where
K: Debug,
V: Debug,
{
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("CoreHashMap")
.field("dictionary", &self.dictionary)
.field("buckets", &self.buckets)
.field("free_head", &self.free_head)
.field("alloc_limit", &self.alloc_limit)
.field("buckets_in_use", &self.buckets_in_use)
.finish()
}
}
/// Error for when there are no empty buckets left but one is needed.
#[derive(Debug, PartialEq)]
pub struct FullError;

View File

@@ -61,6 +61,10 @@ impl<K, V> OccupiedEntry<'_, '_, K, V> {
///
/// This may result in multiple bucket accesses if the entry was obtained by index as the
/// previous chain entry needs to be discovered in this case.
///
/// # Panics
/// Panics if the `prev_pos` field is equal to [`PrevPos::Unknown`]. In practice, this means
/// the entry was obtained via calling something like [`super::HashMapAccess::entry_at_bucket`].
pub fn remove(mut self) -> V {
// If this bucket was queried by index, go ahead and follow its chain from the start.
let prev = if let PrevPos::Unknown(hash) = self.prev_pos {

View File

@@ -21,6 +21,7 @@ use nix::unistd::ftruncate as nix_ftruncate;
/// the underlying file is resized. Do not access the area beyond the current size. Currently, that
/// will cause the file to be expanded, but we might use `mprotect()` etc. to enforce that in the
/// future.
#[derive(Debug)]
pub struct ShmemHandle {
/// memfd file descriptor
fd: OwnedFd,
@@ -35,6 +36,7 @@ pub struct ShmemHandle {
}
/// This is stored at the beginning in the shared memory area.
#[derive(Debug)]
struct SharedStruct {
max_size: usize,

View File

@@ -310,6 +310,11 @@ impl AtomicLsn {
}
}
/// Consumes the atomic and returns the contained value.
pub const fn into_inner(self) -> Lsn {
Lsn(self.inner.into_inner())
}
/// Atomically retrieve the `Lsn` value from memory.
pub fn load(&self) -> Lsn {
Lsn(self.inner.load(Ordering::Acquire))

View File

@@ -54,6 +54,7 @@ pageserver_api.workspace = true
pageserver_client.workspace = true # for ResponseErrorMessageExt TOOD refactor that
pageserver_compaction.workspace = true
pageserver_page_api.workspace = true
peekable.workspace = true
pem.workspace = true
pin-project-lite.workspace = true
postgres_backend.workspace = true
@@ -66,6 +67,7 @@ postgres-types.workspace = true
posthog_client_lite.workspace = true
pprof.workspace = true
pq_proto.workspace = true
prost.workspace = true
rand.workspace = true
range-set-blaze = { version = "0.1.16", features = ["alloc"] }
regex.workspace = true

View File

@@ -3,3 +3,4 @@ mod pool;
mod retry;
pub use client::{PageserverClient, ShardSpec};
pub use pageserver_api::shard::ShardStripeSize; // used in ShardSpec

View File

@@ -33,6 +33,8 @@ pub enum ProtocolError {
Invalid(&'static str, String),
#[error("required field '{0}' is missing")]
Missing(&'static str),
#[error("invalid combination of not_modified_lsn '{0}' and request_lsn '{1}'")]
InvalidLsns(Lsn, Lsn),
}
impl ProtocolError {
@@ -85,9 +87,9 @@ impl TryFrom<proto::ReadLsn> for ReadLsn {
return Err(ProtocolError::invalid("request_lsn", pb.request_lsn));
}
if pb.not_modified_since_lsn > pb.request_lsn {
return Err(ProtocolError::invalid(
"not_modified_since_lsn",
pb.not_modified_since_lsn,
return Err(ProtocolError::InvalidLsns(
Lsn(pb.not_modified_since_lsn),
Lsn(pb.request_lsn),
));
}
Ok(Self {

View File

@@ -25,6 +25,9 @@ tracing.workspace = true
tokio.workspace = true
tokio-stream.workspace = true
tokio-util.workspace = true
axum.workspace = true
http.workspace = true
metrics.workspace = true
tonic.workspace = true
url.workspace = true

View File

@@ -34,6 +34,10 @@ use crate::util::{request_stats, tokio_thread_local_stats};
/// GetPage@LatestLSN, uniformly distributed across the compute-accessible keyspace.
#[derive(clap::Parser)]
pub(crate) struct Args {
#[clap(long, default_value = "false")]
grpc: bool,
#[clap(long, default_value = "false")]
grpc_stream: bool,
#[clap(long, default_value = "http://localhost:9898")]
mgmt_api_endpoint: String,
/// Pageserver connection string. Supports postgresql:// and grpc:// protocols.
@@ -78,6 +82,9 @@ pub(crate) struct Args {
#[clap(long)]
set_io_mode: Option<pageserver_api::models::virtual_file::IoMode>,
#[clap(long)]
only_relnode: Option<u32>,
/// Queue depth generated in each client.
#[clap(long, default_value = "1")]
queue_depth: NonZeroUsize,
@@ -92,10 +99,31 @@ pub(crate) struct Args {
#[clap(long, default_value = "1")]
batch_size: NonZeroUsize,
#[clap(long)]
only_relnode: Option<u32>,
targets: Option<Vec<TenantTimelineId>>,
#[clap(long, default_value = "100")]
pool_max_consumers: NonZeroUsize,
#[clap(long, default_value = "5")]
pool_error_threshold: NonZeroUsize,
#[clap(long, default_value = "5000")]
pool_connect_timeout: NonZeroUsize,
#[clap(long, default_value = "1000")]
pool_connect_backoff: NonZeroUsize,
#[clap(long, default_value = "60000")]
pool_max_idle_duration: NonZeroUsize,
#[clap(long, default_value = "0")]
max_delay_ms: usize,
#[clap(long, default_value = "0")]
percent_drops: usize,
#[clap(long, default_value = "0")]
percent_hangs: usize,
}
/// State shared by all clients
@@ -152,7 +180,6 @@ pub(crate) fn main(args: Args) -> anyhow::Result<()> {
main_impl(args, thread_local_stats)
})
}
async fn main_impl(
args: Args,
all_thread_local_stats: AllThreadLocalStats<request_stats::Stats>,
@@ -317,6 +344,7 @@ async fn main_impl(
let rps_period = args
.per_client_rate
.map(|rps_limit| Duration::from_secs_f64(1.0 / (rps_limit as f64)));
let make_worker: &dyn Fn(WorkerId) -> Pin<Box<dyn Send + Future<Output = ()>>> = &|worker_id| {
let ss = shared_state.clone();
let cancel = cancel.clone();

View File

@@ -453,6 +453,7 @@ impl TimelineHandles {
handles: Default::default(),
}
}
async fn get(
&mut self,
tenant_id: TenantId,

View File

@@ -5,10 +5,12 @@ MODULE_big = neon
OBJS = \
$(WIN32RES) \
communicator.o \
communicator_new.o \
communicator_process.o \
extension_server.o \
file_cache.o \
hll.o \
lfc_prewarm.o \
libpagestore.o \
logical_replication_monitor.o \
neon.o \
@@ -67,6 +69,7 @@ WALPROP_OBJS = \
# libcommunicator.a is built by cargo from the Rust sources under communicator/
# subdirectory. `cargo build` also generates communicator_bindings.h.
communicator_new.o: communicator/communicator_bindings.h
communicator_process.o: communicator/communicator_bindings.h
file_cache.o: communicator/communicator_bindings.h

View File

@@ -17,12 +17,30 @@ rest_broker = []
[dependencies]
axum.workspace = true
bytes.workspace = true
clashmap.workspace = true
http.workspace = true
libc.workspace = true
nix.workspace = true
atomic_enum = "0.3.0"
measured.workspace = true
prometheus.workspace = true
prost.workspace = true
strum_macros.workspace = true
thiserror.workspace = true
tonic = { workspace = true, default-features = false, features=["codegen", "prost", "transport"] }
tokio = { workspace = true, features = ["macros", "net", "io-util", "rt", "rt-multi-thread"] }
tokio-pipe = { version = "0.2.12" }
tracing.workspace = true
tracing-subscriber.workspace = true
measured.workspace = true
uring-common = { workspace = true, features = ["bytes"] }
pageserver_client_grpc.workspace = true
pageserver_api.workspace = true
pageserver_page_api.workspace = true
neon-shmem.workspace = true
utils.workspace = true
workspace_hack = { version = "0.1", path = "../../../workspace_hack" }

View File

@@ -3,9 +3,18 @@
This package provides the so-called "compute-pageserver communicator",
or just "communicator" in short. The communicator is a separate
background worker process that runs in the PostgreSQL server. It's
part of the neon extension. Currently, it only provides an HTTP
endpoint for metrics, but in the future it will evolve to handle all
communications with the pageservers.
part of the neon extension.
The commuicator handles the communication with the pageservers, and
also provides an HTTP endpoint for metrics over a local Unix Domain
socket (aka. the "communicator control socket"). On the PostgreSQL
side, the glue code in pgxn/neon/ uses the communicator to implement
the PostgreSQL Storage Manager (SMGR) interface.
## Design criteria
- Low latency
- Saturate a 10 Gbit / s network interface without becoming a bottleneck
## Source code view
@@ -14,10 +23,122 @@ pgxn/neon/communicator_process.c
the glue that interacts with PostgreSQL code and the Rust
code in the communicator process.
pgxn/neon/communicator_new.c
Contains the backend code that interacts with the communicator
process.
pgxn/neon/communicator/src/worker_process/
Worker process main loop and glue code
pgxn/neon/communicator/src/backend_interface.rs
The entry point for calls from each backend.
pgxn/neon/communicator/src/init.rs
Initialization at server startup
At compilation time, pgxn/neon/communicator/ produces a static
library, libcommunicator.a. It is linked to the neon.so extension
library.
The real networking code, which is independent of PostgreSQL, is in
the pageserver/client_grpc crate.
## Process view
The communicator runs in a dedicated background worker process, the
"communicator process". The communicator uses a multi-threaded Tokio
runtime to execute the IO requests. So the communicator process has
multiple threads running. That's unusual for Postgres processes and
care must be taken to make that work.
### Backend <-> worker communication
Each backend has a number of I/O request slots in shared memory. The
slots are statically allocated for each backend, and must not be
accessed by other backends. The worker process reads requests from the
shared memory slots, and writes responses back to the slots.
Here's an example snapshot of the system, when two requests from two
different backends are in progress:
```
Backends Request slots Communicator process
--------- ------------- --------------------
Backend 1 1: Idle
2: Idle
3: Processing tokio task handling request 3
Backend 2 4: Completed
5: Processing tokio task handling request 5
6: Idle
... ...
```
To submit an IO request, the backend first picks one of its Idle
slots, writes the IO request in the slot, and updates it to
'Submitted' state. That transfers the ownership of the slot to the
worker process, until the worker process marks the request as
Completed. The worker process spawns a separate Tokio task for each
request.
To inform the worker process that a request slot has a pending IO
request, there's a pipe shared by the worker process and all backend
processes. The backend writes the index of the request slot to the
pipe after changing the slot's state to Submitted. This wakes up the
worker process.
(Note that the pipe is just used for wakeups, but the worker process
is free to pick up Submitted IO requests even without receiving the
wakeup. As of this writing, it doesn't do that, but it might be useful
in the future to reduce latency even further, for example.)
When the worker process has completed processing the request, it
writes the result back in the request slot. A GetPage request can also
contain a pointer to buffer in the shared buffer cache. In that case,
the worker process writes the resulting page contents directly to the
buffer, and just a result code in the request slot. It then updates
the 'state' field to Completed, which passes the owner ship back to
the originating backend. Finally, it signals the process Latch of the
originating backend, waking it up.
### Differences between PostgreSQL v16, v17 and v18
PostgreSQL v18 introduced the new AIO mechanism. The PostgreSQL AIO
mechanism uses a very similar mechanism as described in the previous
section, for the communication between AIO worker processes and
backends. With our communicator, the AIO worker processes are not
used, but we use the same PgAioHandle request slots as in upstream.
For Neon-specific IO requests like GetDbSize, a neon request slot is
used. But for the actual IO requests, the request slot merely contains
a pointer to the PgAioHandle slot. The worker process updates the
status of that, calls the IO callbacks upon completionetc, just like
the upstream AIO worker processes do.
## Sequence diagram
neon
PostgreSQL extension backend_interface.rs worker_process.rs processor tonic
| . . . .
| smgr_read() . . . .
+-------------> + . . .
. | . . .
. | rcommunicator_ . . .
. | get_page_at_lsn . . .
. +------------------> + . .
| . .
| write request to . . .
| slot . .
| . .
| . .
| submit_request() . .
+-----------------> + .
| | .
| | db_size_request . .
+---------------->.
. TODO
### Compute <-> pageserver protocol
The protocol between Compute and the pageserver is based on gRPC. See `protos/`.

View File

@@ -0,0 +1,224 @@
//! This module implements a request/response "slot" for submitting
//! requests from backends to the communicator process.
//!
//! NB: The "backend" side of this code runs in Postgres backend processes,
//! which means that it is not safe to use the 'tracing' crate for logging, nor
//! to launch threads or use tokio tasks!
use std::cell::UnsafeCell;
use std::sync::atomic::{AtomicI32, Ordering};
use crate::neon_request::{NeonIORequest, NeonIOResult};
use atomic_enum::atomic_enum;
/// One request/response slot. Each backend has its own set of slots that it
/// uses.
///
/// This is the moral equivalent of PgAioHandle for Postgres AIO requests
/// Like PgAioHandle, try to keep this small.
///
/// There is an array of these in shared memory. Therefore, this must be Sized.
///
/// ## Lifecycle of a request
///
/// A slot is always owned by either the backend process or the communicator
/// process, depending on the 'state'. Only the owning process is allowed to
/// read or modify the slot, except for reading the 'state' itself to check who
/// owns it.
///
/// A slot begins in the Idle state, where it is owned by the backend process.
/// To submit a request, the backend process fills the slot with the request
/// data, and changes it to the Submitted state. After changing the state, the
/// slot is owned by the communicator process, and the backend is not allowed
/// to access it until the communicator process marks it as Completed.
///
/// When the communicator process sees that the slot is in Submitted state, it
/// starts to process the request. After processing the request, it stores the
/// result in the slot, and changes the state to Completed. It is now owned by
/// the backend process again, which may now read the result, and reuse the
/// slot for a new request.
///
/// For correctness of the above protocol, we really only need two states:
/// "owned by backend" and "owned by communicator process". But to help with
/// debugging and better assertions, there are a few more states. When the
/// backend starts to fill in the request details in the slot, it first sets the
/// state from Idle to Filling, and when it's done with that, from Filling to
/// Submitted. In the Filling state, the slot is still owned by the
/// backend. Similarly, when the communicator process starts to process a
/// request, it sets it to Processing state first, but the slot is still owned
/// by the communicator process.
///
/// This struct doesn't handle waking up the communicator process when a request
/// has been submitted or when a response is ready. The 'owner_procno' is used
/// for waking up the backend on completion, but that happens elsewhere.
pub struct NeonIORequestSlot {
/// similar to PgAioHandleState
state: AtomicNeonIORequestSlotState,
/// The owning process's ProcNumber. The worker process uses this to set the
/// process's latch on completion.
///
/// (This could be calculated from num_neon_request_slots_per_backend and
/// the index of this slot in the overall 'neon_requst_slots array'. But we
/// prefer the communicator process to not know how the request slots are
/// divided between the backends.)
owner_procno: AtomicI32,
/// SAFETY: This is modified by submit_request(), after it has established
/// ownership of the slot by setting state from Idle to Filling
request: UnsafeCell<NeonIORequest>,
/// Valid when state is Completed
///
/// SAFETY: This is modified by RequestProcessingGuard::complete(). There
/// can be only one RequestProcessingGuard outstanding for a slot at a time,
/// because it is returned by start_processing_request() which checks the
/// state, so RequestProcessingGuard has exclusive access to the slot.
result: UnsafeCell<NeonIOResult>,
}
// The protocol described in the "Lifecycle of a request" section above ensures
// the safe access to the fields
unsafe impl Send for NeonIORequestSlot {}
unsafe impl Sync for NeonIORequestSlot {}
impl Default for NeonIORequestSlot {
fn default() -> NeonIORequestSlot {
NeonIORequestSlot {
owner_procno: AtomicI32::new(-1),
request: UnsafeCell::new(NeonIORequest::Empty),
result: UnsafeCell::new(NeonIOResult::Empty),
state: AtomicNeonIORequestSlotState::new(NeonIORequestSlotState::Idle),
}
}
}
#[atomic_enum]
#[derive(Eq, PartialEq)]
pub enum NeonIORequestSlotState {
Idle,
/// Backend is filling in the request
Filling,
/// Backend has submitted the request to the communicator, but the
/// communicator process has not yet started processing it.
Submitted,
/// Communicator is processing the request
Processing,
/// Communicator has completed the request, and the 'result' field is now
/// valid, but the backend has not read the result yet.
Completed,
}
impl NeonIORequestSlot {
/// Write a request to the slot, and mark it as Submitted.
///
/// Note: This does not wake up the worker process to actually process
/// the request. It's the caller's responsibility to do that.
pub fn submit_request(&self, request: &NeonIORequest, proc_number: i32) {
// Verify that the slot is in Idle state previously, and put it in
// Filling state.
//
// XXX: This step isn't strictly necessary. Assuming the caller didn't
// screw up and try to use a slot that's already in use, we could fill
// the slot and switch it directly from Idle to Submitted state.
if let Err(s) = self.state.compare_exchange(
NeonIORequestSlotState::Idle,
NeonIORequestSlotState::Filling,
Ordering::Relaxed,
Ordering::Relaxed,
) {
panic!("unexpected state in request slot: {s:?}");
}
// Fill in the request details
self.owner_procno.store(proc_number, Ordering::Relaxed);
unsafe { *self.request.get() = *request }
// This synchronizes-with store/swap in [`start_processing_request`].
// Note that this ensures that the previous non-atomic writes visible
// to other threads too.
self.state
.store(NeonIORequestSlotState::Submitted, Ordering::Release);
}
pub fn get_state(&self) -> NeonIORequestSlotState {
self.state.load(Ordering::Relaxed)
}
pub fn try_get_result(&self) -> Option<NeonIOResult> {
// This synchronizes-with the store/swap in [`RequestProcessingGuard::completed`]
let state = self.state.load(Ordering::Acquire);
if state == NeonIORequestSlotState::Completed {
let result = unsafe { *self.result.get() };
self.state
.store(NeonIORequestSlotState::Idle, Ordering::Relaxed);
Some(result)
} else {
None
}
}
/// Read the IO request from the slot indicated in the wakeup
pub fn start_processing_request<'a>(&'a self) -> Option<RequestProcessingGuard<'a>> {
// XXX: using atomic load rather than compare_exchange would be
// sufficient here, as long as the communicator process has _some_ means
// of tracking which requests it's already processing. That could be a
// flag somewhere in communicator's private memory, for example.
//
// This synchronizes-with the store in [`submit_request`].
if let Err(s) = self.state.compare_exchange(
NeonIORequestSlotState::Submitted,
NeonIORequestSlotState::Processing,
Ordering::Acquire,
Ordering::Relaxed,
) {
// FIXME surprising state. This is unexpected at the moment, but if we
// started to process requests more aggressively, without waiting for the
// read from the pipe, then this could happen
panic!("unexpected state in request slot: {s:?}");
}
Some(RequestProcessingGuard(self))
}
}
/// [`NeonIORequestSlot::start_processing_request`] returns this guard object to
/// indicate that the the caller now "owns" the slot, until it calls
/// [`RequestProcessingGuard::completed`].
///
/// TODO: implement Drop on this, to mark the request as Aborted or Errored
/// if [`RequestProcessingGuard::completed`] is not called.
pub struct RequestProcessingGuard<'a>(&'a NeonIORequestSlot);
unsafe impl<'a> Send for RequestProcessingGuard<'a> {}
unsafe impl<'a> Sync for RequestProcessingGuard<'a> {}
impl<'a> RequestProcessingGuard<'a> {
pub fn get_request(&self) -> &NeonIORequest {
unsafe { &*self.0.request.get() }
}
pub fn get_owner_procno(&self) -> i32 {
self.0.owner_procno.load(Ordering::Relaxed)
}
pub fn completed(self, result: NeonIOResult) {
// Store the result to the slot.
unsafe {
*self.0.result.get() = result;
};
// Mark the request as completed. After that, we no longer have
// ownership of the slot, and must not modify it.
let old_state = self
.0
.state
.swap(NeonIORequestSlotState::Completed, Ordering::Release);
assert!(old_state == NeonIORequestSlotState::Processing);
}
}

View File

@@ -0,0 +1,296 @@
//! This code runs in each backend process. That means that launching Rust threads, panicking
//! etc. is forbidden!
use std::os::fd::OwnedFd;
use crate::backend_comms::NeonIORequestSlot;
use crate::init::CommunicatorInitStruct;
use crate::integrated_cache::{BackendCacheReadOp, IntegratedCacheReadAccess};
use crate::neon_request::{CCachedGetPageVResult, CLsn, COid};
use crate::neon_request::{NeonIORequest, NeonIOResult};
use utils::lsn::Lsn;
pub struct CommunicatorBackendStruct<'t> {
my_proc_number: i32,
neon_request_slots: &'t [NeonIORequestSlot],
submission_pipe_write_fd: OwnedFd,
pending_cache_read_op: Option<BackendCacheReadOp<'t>>,
integrated_cache: &'t IntegratedCacheReadAccess<'t>,
}
#[unsafe(no_mangle)]
pub extern "C" fn rcommunicator_backend_init(
cis: Box<CommunicatorInitStruct>,
my_proc_number: i32,
) -> &'static mut CommunicatorBackendStruct<'static> {
if my_proc_number < 0 {
panic!("cannot attach to communicator shared memory with procnumber {my_proc_number}");
}
let integrated_cache = Box::leak(Box::new(cis.integrated_cache_init_struct.backend_init()));
let bs: &'static mut CommunicatorBackendStruct =
Box::leak(Box::new(CommunicatorBackendStruct {
my_proc_number,
neon_request_slots: cis.neon_request_slots,
submission_pipe_write_fd: cis.submission_pipe_write_fd,
pending_cache_read_op: None,
integrated_cache,
}));
bs
}
/// Start a request. You can poll for its completion and get the result by
/// calling bcomm_poll_dbsize_request_completion(). The communicator will wake
/// us up by setting our process latch, so to wait for the completion, wait on
/// the latch and call bcomm_poll_dbsize_request_completion() every time the
/// latch is set.
///
/// Safety: The C caller must ensure that the references are valid.
/// The requested slot must be free, or this panics.
#[unsafe(no_mangle)]
pub extern "C" fn bcomm_start_io_request(
bs: &'_ mut CommunicatorBackendStruct,
slot_idx: i32,
request: &NeonIORequest,
immediate_result_ptr: &mut NeonIOResult,
) -> i32 {
assert!(bs.pending_cache_read_op.is_none());
// Check if the request can be satisfied from the cache first
if let NeonIORequest::RelSize(req) = request {
if let Some(nblocks) = bs.integrated_cache.get_rel_size(&req.reltag()) {
*immediate_result_ptr = NeonIOResult::RelSize(nblocks);
return -1;
}
}
// Create neon request and submit it
bs.start_neon_io_request(slot_idx, request);
slot_idx
}
#[unsafe(no_mangle)]
pub extern "C" fn bcomm_start_get_page_v_request(
bs: &mut CommunicatorBackendStruct,
slot_idx: i32,
request: &NeonIORequest,
immediate_result_ptr: &mut CCachedGetPageVResult,
) -> i32 {
let NeonIORequest::GetPageV(get_pagev_request) = request else {
panic!("invalid request passed to bcomm_start_get_page_v_request()");
};
assert!(matches!(request, NeonIORequest::GetPageV(_)));
assert!(bs.pending_cache_read_op.is_none());
// Check if the request can be satisfied from the cache first
let mut all_cached = true;
let mut read_op = bs.integrated_cache.start_read_op();
for i in 0..get_pagev_request.nblocks {
if let Some(cache_block) = read_op.get_page(
&get_pagev_request.reltag(),
get_pagev_request.block_number + i as u32,
) {
immediate_result_ptr.cache_block_numbers[i as usize] = cache_block;
} else {
// not found in cache
all_cached = false;
break;
}
}
if all_cached {
bs.pending_cache_read_op = Some(read_op);
return -1;
}
// Create neon request and submit it
bs.start_neon_io_request(slot_idx, request);
slot_idx
}
/// Check if a request has completed. Returns:
///
/// -1 if the request is still being processed
/// 0 on success
#[unsafe(no_mangle)]
pub extern "C" fn bcomm_poll_request_completion(
bs: &mut CommunicatorBackendStruct,
request_slot_idx: u32,
result_p: &mut NeonIOResult,
) -> i32 {
match bs.neon_request_slots[request_slot_idx as usize].try_get_result() {
None => -1, // still processing
Some(result) => {
*result_p = result;
0
}
}
}
/// Check if a request has completed. Returns:
///
/// 'false' if the slot is Idle. The backend process has ownership.
/// 'true' if the slot is busy, and should be polled for result.
#[unsafe(no_mangle)]
pub extern "C" fn bcomm_get_request_slot_status(
bs: &mut CommunicatorBackendStruct,
request_slot_idx: u32,
) -> bool {
use crate::backend_comms::NeonIORequestSlotState;
match bs.neon_request_slots[request_slot_idx as usize].get_state() {
NeonIORequestSlotState::Idle => false,
NeonIORequestSlotState::Filling => {
// 'false' would be the right result here. However, this
// is a very transient state. The C code should never
// leave a slot in this state, so if it sees that,
// something's gone wrong and it's not clear what to do
// with it.
panic!("unexpected Filling state in request slot {request_slot_idx}");
}
NeonIORequestSlotState::Submitted => true,
NeonIORequestSlotState::Processing => true,
NeonIORequestSlotState::Completed => true,
}
}
// LFC functions
/// Finish a local file cache read
///
//
#[unsafe(no_mangle)]
pub extern "C" fn bcomm_finish_cache_read(bs: &mut CommunicatorBackendStruct) -> bool {
if let Some(op) = bs.pending_cache_read_op.take() {
op.finish()
} else {
panic!("bcomm_finish_cache_read() called with no cached read pending");
}
}
/// Check if LFC contains the given buffer, and update its last-written LSN if not.
///
/// This is used in WAL replay in read replica, to skip updating pages that are
/// not in cache.
#[unsafe(no_mangle)]
pub extern "C" fn bcomm_update_lw_lsn_for_block_if_not_cached(
bs: &mut CommunicatorBackendStruct,
spc_oid: COid,
db_oid: COid,
rel_number: u32,
fork_number: u8,
block_number: u32,
lsn: CLsn,
) -> bool {
bs.integrated_cache.update_lw_lsn_for_block_if_not_cached(
&pageserver_page_api::RelTag {
spcnode: spc_oid,
dbnode: db_oid,
relnode: rel_number,
forknum: fork_number,
},
block_number,
Lsn(lsn),
)
}
#[repr(C)]
#[derive(Clone, Debug)]
pub struct FileCacheIterator {
next_bucket: u64,
pub spc_oid: COid,
pub db_oid: COid,
pub rel_number: u32,
pub fork_number: u8,
pub block_number: u32,
}
/// Iterate over LFC contents
#[allow(clippy::missing_safety_doc)]
#[unsafe(no_mangle)]
pub unsafe extern "C" fn bcomm_cache_iterate_begin(
_bs: &mut CommunicatorBackendStruct,
iter: *mut FileCacheIterator,
) {
unsafe { (*iter).next_bucket = 0 };
}
#[allow(clippy::missing_safety_doc)]
#[unsafe(no_mangle)]
pub unsafe extern "C" fn bcomm_cache_iterate_next(
bs: &mut CommunicatorBackendStruct,
iter: *mut FileCacheIterator,
) -> bool {
use crate::integrated_cache::GetBucketResult;
loop {
let next_bucket = unsafe { (*iter).next_bucket } as usize;
match bs.integrated_cache.get_bucket(next_bucket) {
GetBucketResult::Occupied(rel, blk) => {
unsafe {
(*iter).spc_oid = rel.spcnode;
(*iter).db_oid = rel.dbnode;
(*iter).rel_number = rel.relnode;
(*iter).fork_number = rel.forknum;
(*iter).block_number = blk;
(*iter).next_bucket += 1;
}
break true;
}
GetBucketResult::Vacant => {
unsafe {
(*iter).next_bucket += 1;
}
continue;
}
GetBucketResult::OutOfBounds => {
break false;
}
}
}
}
#[allow(clippy::missing_safety_doc)]
#[unsafe(no_mangle)]
pub unsafe extern "C" fn bcomm_cache_get_num_pages_used(bs: &mut CommunicatorBackendStruct) -> u64 {
bs.integrated_cache.get_num_buckets_in_use() as u64
}
impl<'t> CommunicatorBackendStruct<'t> {
/// The slot must be free, or this panics.
pub(crate) fn start_neon_io_request(&mut self, request_slot_idx: i32, request: &NeonIORequest) {
let my_proc_number = self.my_proc_number;
self.neon_request_slots[request_slot_idx as usize].submit_request(request, my_proc_number);
// Tell the communicator about it
self.notify_about_request(request_slot_idx);
}
/// Send a wakeup to the communicator process
fn notify_about_request(self: &CommunicatorBackendStruct<'t>, request_slot_idx: i32) {
// wake up communicator by writing the idx to the submission pipe
//
// This can block, if the pipe is full. That should be very rare,
// because the communicator tries hard to drain the pipe to prevent
// that. Also, there's a natural upper bound on how many wakeups can be
// queued up: there is only a limited number of request slots for each
// backend.
//
// If it does block very briefly, that's not too serious.
let idxbuf = request_slot_idx.to_ne_bytes();
let _res = nix::unistd::write(&self.submission_pipe_write_fd, &idxbuf);
// FIXME: check result, return any errors
}
}

View File

@@ -0,0 +1,156 @@
//! Implement the "low-level" parts of the file cache.
//!
//! This module just deals with reading and writing the file, and keeping track
//! which blocks in the cache file are in use and which are free. The "high
//! level" parts of tracking which block in the cache file corresponds to which
//! relation block is handled in 'integrated_cache' instead.
//!
//! This module is only used to access the file from the communicator
//! process. The backend processes *also* read the file (and sometimes also
//! write it? ), but the backends use direct C library calls for that.
use std::fs::File;
use std::os::unix::fs::FileExt;
use std::path::Path;
use std::sync::Arc;
use std::sync::Mutex;
use measured::metric;
use measured::metric::MetricEncoding;
use measured::metric::gauge::GaugeState;
use measured::{Gauge, MetricGroup};
use crate::BLCKSZ;
use tokio::task::spawn_blocking;
pub type CacheBlock = u64;
pub const INVALID_CACHE_BLOCK: CacheBlock = u64::MAX;
pub struct FileCache {
file: Arc<File>,
free_list: Mutex<FreeList>,
metrics: FileCacheMetricGroup,
}
#[derive(MetricGroup)]
#[metric(new())]
struct FileCacheMetricGroup {
/// Local File Cache size in 8KiB blocks
max_blocks: Gauge,
/// Number of free 8KiB blocks in Local File Cache
num_free_blocks: Gauge,
}
// TODO: We keep track of all free blocks in this vec. That doesn't really scale.
// Idea: when free_blocks fills up with more than 1024 entries, write them all to
// one block on disk.
#[derive(Debug)]
struct FreeList {
next_free_block: CacheBlock,
max_blocks: u64,
free_blocks: Vec<CacheBlock>,
}
impl FileCache {
pub fn new(file_cache_path: &Path, mut initial_size: u64) -> Result<FileCache, std::io::Error> {
if initial_size < 100 {
tracing::warn!(
"min size for file cache is 100 blocks, {} requested",
initial_size
);
initial_size = 100;
}
let file = std::fs::OpenOptions::new()
.read(true)
.write(true)
.truncate(true)
.create(true)
.open(file_cache_path)?;
tracing::info!("initialized file cache with {} blocks", initial_size);
Ok(FileCache {
file: Arc::new(file),
free_list: Mutex::new(FreeList {
next_free_block: 0,
max_blocks: initial_size,
free_blocks: Vec::new(),
}),
metrics: FileCacheMetricGroup::new(),
})
}
// File cache management
pub async fn read_block(
&self,
cache_block: CacheBlock,
mut dst: impl uring_common::buf::IoBufMut + Send + Sync,
) -> Result<(), std::io::Error> {
assert!(dst.bytes_total() == BLCKSZ);
let file = self.file.clone();
let dst_ref = unsafe { std::slice::from_raw_parts_mut(dst.stable_mut_ptr(), BLCKSZ) };
spawn_blocking(move || file.read_exact_at(dst_ref, cache_block * BLCKSZ as u64)).await??;
Ok(())
}
pub async fn write_block(
&self,
cache_block: CacheBlock,
src: impl uring_common::buf::IoBuf + Send + Sync,
) -> Result<(), std::io::Error> {
assert!(src.bytes_init() == BLCKSZ);
let file = self.file.clone();
let src_ref = unsafe { std::slice::from_raw_parts(src.stable_ptr(), BLCKSZ) };
spawn_blocking(move || file.write_all_at(src_ref, cache_block * BLCKSZ as u64)).await??;
Ok(())
}
pub fn alloc_block(&self) -> Option<CacheBlock> {
let mut free_list = self.free_list.lock().unwrap();
if let Some(x) = free_list.free_blocks.pop() {
return Some(x);
}
if free_list.next_free_block < free_list.max_blocks {
let result = free_list.next_free_block;
free_list.next_free_block += 1;
return Some(result);
}
None
}
pub fn dealloc_block(&self, cache_block: CacheBlock) {
let mut free_list = self.free_list.lock().unwrap();
free_list.free_blocks.push(cache_block);
}
}
impl<T: metric::group::Encoding> MetricGroup<T> for FileCache
where
GaugeState: MetricEncoding<T>,
{
fn collect_group_into(&self, enc: &mut T) -> Result<(), <T as metric::group::Encoding>::Err> {
// Update the gauges with fresh values first
{
let free_list = self.free_list.lock().unwrap();
self.metrics.max_blocks.set(free_list.max_blocks as i64);
let total_free_blocks: i64 = free_list.free_blocks.len() as i64
+ (free_list.max_blocks as i64 - free_list.next_free_block as i64);
self.metrics.num_free_blocks.set(total_free_blocks);
}
self.metrics.collect_group_into(enc)
}
}

View File

@@ -0,0 +1,107 @@
//! Global allocator, for tracking memory usage of the Rust parts
//!
//! Postgres is designed to handle allocation failure (ie. malloc() returning NULL) gracefully. It
//! rolls backs the transaction and gives the user an "ERROR: out of memory" error. Rust code
//! however panics if an allocation fails. We don't want that to ever happen, because an unhandled
//! panic leads to Postgres crash and restart. Our strategy is to pre-allocate a large enough chunk
//! of memory for use by the Rust code, so that the allocations never fail.
//!
//! To pick the size for the pre-allocated chunk, we have a metric to track the high watermark
//! memory usage of all the Rust allocations in total.
//!
//! TODO:
//!
//! - Currently we just export the metrics. Actual allocations are still just passed through to
//! the system allocator.
//! - Take padding etc. overhead into account
use std::alloc::{GlobalAlloc, Layout, System};
use std::sync::atomic::{AtomicU64, AtomicUsize, Ordering};
use measured::metric;
use measured::metric::MetricEncoding;
use measured::metric::gauge::GaugeState;
use measured::{Gauge, MetricGroup};
pub(crate) struct MyAllocator {
allocations: AtomicU64,
deallocations: AtomicU64,
allocated: AtomicUsize,
high: AtomicUsize,
}
#[derive(MetricGroup)]
#[metric(new())]
struct MyAllocatorMetricGroup {
/// Number of allocations in Rust code
communicator_mem_allocations: Gauge,
/// Number of deallocations in Rust code
communicator_mem_deallocations: Gauge,
/// Bytes currently allocated
communicator_mem_allocated: Gauge,
/// High watermark of allocated bytes
communicator_mem_high: Gauge,
}
unsafe impl GlobalAlloc for MyAllocator {
unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
self.allocations.fetch_add(1, Ordering::Relaxed);
let mut allocated = self.allocated.fetch_add(layout.size(), Ordering::Relaxed);
allocated += layout.size();
self.high.fetch_max(allocated, Ordering::Relaxed);
unsafe { System.alloc(layout) }
}
unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
self.deallocations.fetch_add(1, Ordering::Relaxed);
self.allocated.fetch_sub(layout.size(), Ordering::Relaxed);
unsafe { System.dealloc(ptr, layout) }
}
}
#[global_allocator]
static GLOBAL: MyAllocator = MyAllocator {
allocations: AtomicU64::new(0),
deallocations: AtomicU64::new(0),
allocated: AtomicUsize::new(0),
high: AtomicUsize::new(0),
};
pub(crate) struct MyAllocatorCollector {
metrics: MyAllocatorMetricGroup,
}
impl MyAllocatorCollector {
pub(crate) fn new() -> Self {
Self {
metrics: MyAllocatorMetricGroup::new(),
}
}
}
impl<T: metric::group::Encoding> MetricGroup<T> for MyAllocatorCollector
where
GaugeState: MetricEncoding<T>,
{
fn collect_group_into(&self, enc: &mut T) -> Result<(), <T as metric::group::Encoding>::Err> {
// Update the gauges with fresh values first
self.metrics
.communicator_mem_allocations
.set(GLOBAL.allocations.load(Ordering::Relaxed) as i64);
self.metrics
.communicator_mem_deallocations
.set(GLOBAL.allocations.load(Ordering::Relaxed) as i64);
self.metrics
.communicator_mem_allocated
.set(GLOBAL.allocated.load(Ordering::Relaxed) as i64);
self.metrics
.communicator_mem_high
.set(GLOBAL.high.load(Ordering::Relaxed) as i64);
self.metrics.collect_group_into(enc)
}
}

View File

@@ -0,0 +1,166 @@
//! Initialization functions. These are executed in the postmaster process,
//! at different stages of server startup.
//!
//!
//! Communicator initialization steps:
//!
//! 1. At postmaster startup, before shared memory is allocated,
//! rcommunicator_shmem_size() is called to get the amount of
//! shared memory that this module needs.
//!
//! 2. Later, after the shared memory has been allocated,
//! rcommunicator_shmem_init() is called to initialize the shmem
//! area.
//!
//! Per process initialization:
//!
//! When a backend process starts up, it calls rcommunicator_backend_init().
//! In the communicator worker process, other functions are called, see
//! `worker_process` module.
use std::ffi::c_int;
use std::mem;
use std::mem::MaybeUninit;
use std::os::fd::OwnedFd;
use crate::backend_comms::NeonIORequestSlot;
use crate::integrated_cache::IntegratedCacheInitStruct;
/// This struct is created in the postmaster process, and inherited to
/// the communicator process and all backend processes through fork()
#[repr(C)]
pub struct CommunicatorInitStruct {
pub submission_pipe_read_fd: OwnedFd,
pub submission_pipe_write_fd: OwnedFd,
// Shared memory data structures
pub num_neon_request_slots: u32,
pub neon_request_slots: &'static [NeonIORequestSlot],
pub integrated_cache_init_struct: IntegratedCacheInitStruct<'static>,
}
impl std::fmt::Debug for CommunicatorInitStruct {
fn fmt(&self, fmt: &mut std::fmt::Formatter<'_>) -> Result<(), std::fmt::Error> {
fmt.debug_struct("CommunicatorInitStruct")
.field("submission_pipe_read_fd", &self.submission_pipe_read_fd)
.field("submission_pipe_write_fd", &self.submission_pipe_write_fd)
.field("num_neon_request_slots", &self.num_neon_request_slots)
.field("neon_request_slots length", &self.neon_request_slots.len())
.finish()
}
}
#[unsafe(no_mangle)]
pub extern "C" fn rcommunicator_shmem_size(num_neon_request_slots: u32) -> u64 {
let mut size = 0;
size += mem::size_of::<NeonIORequestSlot>() * num_neon_request_slots as usize;
// For integrated_cache's Allocator. TODO: make this adjustable
size += IntegratedCacheInitStruct::shmem_size();
size as u64
}
/// Initialize the shared memory segment. Returns a backend-private
/// struct, which will be inherited by backend processes through fork
#[unsafe(no_mangle)]
pub extern "C" fn rcommunicator_shmem_init(
submission_pipe_read_fd: c_int,
submission_pipe_write_fd: c_int,
num_neon_request_slots: u32,
shmem_area_ptr: *mut MaybeUninit<u8>,
shmem_area_len: u64,
initial_file_cache_size: u64,
max_file_cache_size: u64,
) -> &'static mut CommunicatorInitStruct {
let shmem_area: &'static mut [MaybeUninit<u8>] =
unsafe { std::slice::from_raw_parts_mut(shmem_area_ptr, shmem_area_len as usize) };
let (neon_request_slots, remaining_area) =
alloc_array_from_slice::<NeonIORequestSlot>(shmem_area, num_neon_request_slots as usize);
for slot in neon_request_slots.iter_mut() {
slot.write(NeonIORequestSlot::default());
}
// 'neon_request_slots' is initialized now. (MaybeUninit::slice_assume_init_mut() is nightly-only
// as of this writing.)
let neon_request_slots = unsafe {
std::mem::transmute::<&mut [MaybeUninit<NeonIORequestSlot>], &mut [NeonIORequestSlot]>(
neon_request_slots,
)
};
// Give the rest of the area to the integrated cache
let integrated_cache_init_struct = IntegratedCacheInitStruct::shmem_init(
remaining_area,
initial_file_cache_size,
max_file_cache_size,
);
let (submission_pipe_read_fd, submission_pipe_write_fd) = unsafe {
use std::os::fd::FromRawFd;
(
OwnedFd::from_raw_fd(submission_pipe_read_fd),
OwnedFd::from_raw_fd(submission_pipe_write_fd),
)
};
let cis: &'static mut CommunicatorInitStruct = Box::leak(Box::new(CommunicatorInitStruct {
submission_pipe_read_fd,
submission_pipe_write_fd,
num_neon_request_slots,
neon_request_slots,
integrated_cache_init_struct,
}));
cis
}
pub fn alloc_from_slice<T>(
area: &mut [MaybeUninit<u8>],
) -> (&mut MaybeUninit<T>, &mut [MaybeUninit<u8>]) {
let layout = std::alloc::Layout::new::<T>();
let area_start = area.as_mut_ptr();
// pad to satisfy alignment requirements
let padding = area_start.align_offset(layout.align());
if padding + layout.size() > area.len() {
panic!("out of memory");
}
let area = &mut area[padding..];
let (result_area, remain) = area.split_at_mut(layout.size());
let result_ptr: *mut MaybeUninit<T> = result_area.as_mut_ptr().cast();
let result = unsafe { result_ptr.as_mut().unwrap() };
(result, remain)
}
pub fn alloc_array_from_slice<T>(
area: &mut [MaybeUninit<u8>],
len: usize,
) -> (&mut [MaybeUninit<T>], &mut [MaybeUninit<u8>]) {
let layout = std::alloc::Layout::new::<T>();
let area_start = area.as_mut_ptr();
// pad to satisfy alignment requirements
let padding = area_start.align_offset(layout.align());
if padding + layout.size() * len > area.len() {
panic!("out of memory");
}
let area = &mut area[padding..];
let (result_area, remain) = area.split_at_mut(layout.size() * len);
let result_ptr: *mut MaybeUninit<T> = result_area.as_mut_ptr().cast();
let result = unsafe { std::slice::from_raw_parts_mut(result_ptr.as_mut().unwrap(), len) };
(result, remain)
}

View File

@@ -0,0 +1,990 @@
//! Integrated communicator cache
//!
//! It tracks:
//! - Relation sizes and existence
//! - Last-written LSN
//! - Block cache (also known as LFC)
//!
//! TODO: limit the size
//! TODO: concurrency
//!
//! Note: This deals with "relations" which is really just one "relation fork" in Postgres
//! terms. RelFileLocator + ForkNumber is the key.
//
// TODO: Thoughts on eviction:
//
// There are two things we need to track, and evict if we run out of space:
// - blocks in the file cache's file. If the file grows too large, need to evict something.
// Also if the cache is resized
//
// - entries in the cache map. If we run out of memory in the shmem area, need to evict
// something
//
use std::mem::MaybeUninit;
use std::sync::atomic::{AtomicBool, AtomicU32, AtomicU64, AtomicUsize, Ordering};
use utils::lsn::{AtomicLsn, Lsn};
use crate::file_cache::INVALID_CACHE_BLOCK;
use crate::file_cache::{CacheBlock, FileCache};
use crate::init::alloc_from_slice;
use pageserver_page_api::RelTag;
use measured::metric;
use measured::metric::MetricEncoding;
use measured::metric::counter::CounterState;
use measured::metric::gauge::GaugeState;
use measured::{Counter, Gauge, MetricGroup};
use neon_shmem::hash::{HashMapInit, entry::Entry};
use neon_shmem::shmem::ShmemHandle;
// in # of entries
const RELSIZE_CACHE_SIZE: u32 = 64 * 1024;
/// This struct is initialized at postmaster startup, and passed to all the processes via fork().
pub struct IntegratedCacheInitStruct<'t> {
shared: &'t IntegratedCacheShared,
relsize_cache_handle: HashMapInit<'t, RelKey, RelEntry>,
block_map_handle: HashMapInit<'t, BlockKey, BlockEntry>,
}
/// This struct is allocated in the (fixed-size) shared memory area at postmaster startup.
/// It is accessible by all the backends and the communicator process.
#[derive(Debug)]
pub struct IntegratedCacheShared {
global_lw_lsn: AtomicU64,
}
/// Represents write-access to the integrated cache. This is used by the communicator process.
pub struct IntegratedCacheWriteAccess<'t> {
shared: &'t IntegratedCacheShared,
relsize_cache: neon_shmem::hash::HashMapAccess<'t, RelKey, RelEntry>,
block_map: neon_shmem::hash::HashMapAccess<'t, BlockKey, BlockEntry>,
pub(crate) file_cache: Option<FileCache>,
// Fields for eviction
clock_hand: AtomicUsize,
metrics: IntegratedCacheMetricGroup,
}
#[derive(MetricGroup)]
#[metric(new())]
struct IntegratedCacheMetricGroup {
/// Page evictions from the Local File Cache
cache_page_evictions_counter: Counter,
/// Block entry evictions from the integrated cache
block_entry_evictions_counter: Counter,
/// Number of times the clock hand has moved
clock_iterations_counter: Counter,
// metrics from the hash map
/// Allocated size of the block cache hash map
block_map_num_buckets: Gauge,
/// Number of buckets in use in the block cache hash map
block_map_num_buckets_in_use: Gauge,
/// Allocated size of the relsize cache hash map
relsize_cache_num_buckets: Gauge,
/// Number of buckets in use in the relsize cache hash map
relsize_cache_num_buckets_in_use: Gauge,
}
/// Represents read-only access to the integrated cache. Backend processes have this.
pub struct IntegratedCacheReadAccess<'t> {
shared: &'t IntegratedCacheShared,
relsize_cache: neon_shmem::hash::HashMapAccess<'t, RelKey, RelEntry>,
block_map: neon_shmem::hash::HashMapAccess<'t, BlockKey, BlockEntry>,
}
impl<'t> IntegratedCacheInitStruct<'t> {
/// Return the desired size in bytes of the fixed-size shared memory area to reserve for the
/// integrated cache.
pub fn shmem_size() -> usize {
// The relsize cache is fixed-size. The block map is allocated in a separate resizable
// area.
let mut sz = 0;
sz += std::mem::size_of::<IntegratedCacheShared>();
sz += HashMapInit::<RelKey, RelEntry>::estimate_size(RELSIZE_CACHE_SIZE);
sz
}
/// Initialize the shared memory segment. This runs once in postmaster. Returns a struct which
/// will be inherited by all processes through fork.
pub fn shmem_init(
shmem_area: &'t mut [MaybeUninit<u8>],
initial_file_cache_size: u64,
max_file_cache_size: u64,
) -> IntegratedCacheInitStruct<'t> {
// Initialize the shared struct
let (shared, remain_shmem_area) = alloc_from_slice::<IntegratedCacheShared>(shmem_area);
let shared = shared.write(IntegratedCacheShared {
global_lw_lsn: AtomicU64::new(0),
});
// Use the remaining part of the fixed-size area for the relsize cache
let relsize_cache_handle =
neon_shmem::hash::HashMapInit::with_fixed(RELSIZE_CACHE_SIZE, remain_shmem_area);
let max_bytes =
HashMapInit::<BlockKey, BlockEntry>::estimate_size(max_file_cache_size as u32);
// Initialize the block map in a separate resizable shared memory area
let shmem_handle = ShmemHandle::new("block mapping", 0, max_bytes).unwrap();
let block_map_handle =
neon_shmem::hash::HashMapInit::with_shmem(initial_file_cache_size as u32, shmem_handle);
IntegratedCacheInitStruct {
shared,
relsize_cache_handle,
block_map_handle,
}
}
/// Initialize access to the integrated cache for the communicator worker process
pub fn worker_process_init(
self,
lsn: Lsn,
file_cache: Option<FileCache>,
) -> IntegratedCacheWriteAccess<'t> {
let IntegratedCacheInitStruct {
shared,
relsize_cache_handle,
block_map_handle,
} = self;
shared.global_lw_lsn.store(lsn.0, Ordering::Relaxed);
IntegratedCacheWriteAccess {
shared,
relsize_cache: relsize_cache_handle.attach_writer(),
block_map: block_map_handle.attach_writer(),
file_cache,
clock_hand: AtomicUsize::new(0),
metrics: IntegratedCacheMetricGroup::new(),
}
}
/// Initialize access to the integrated cache for a backend process
pub fn backend_init(self) -> IntegratedCacheReadAccess<'t> {
let IntegratedCacheInitStruct {
shared,
relsize_cache_handle,
block_map_handle,
} = self;
IntegratedCacheReadAccess {
shared,
relsize_cache: relsize_cache_handle.attach_reader(),
block_map: block_map_handle.attach_reader(),
}
}
}
/// Value stored in the cache mapping hash table.
struct BlockEntry {
lw_lsn: AtomicLsn,
cache_block: AtomicU64,
pinned: AtomicU64,
// 'referenced' bit for the clock algorithm
referenced: AtomicBool,
}
/// Value stored in the relsize cache hash table.
struct RelEntry {
/// cached size of the relation
/// u32::MAX means 'not known' (that's InvalidBlockNumber in Postgres)
nblocks: AtomicU32,
/// This is the last time the "metadata" of this relation changed, not
/// the contents of the blocks. That is, the size of the relation.
lw_lsn: AtomicLsn,
}
impl std::fmt::Debug for RelEntry {
fn fmt(&self, fmt: &mut std::fmt::Formatter<'_>) -> Result<(), std::fmt::Error> {
fmt.debug_struct("Rel")
.field("nblocks", &self.nblocks.load(Ordering::Relaxed))
.finish()
}
}
impl std::fmt::Debug for BlockEntry {
fn fmt(&self, fmt: &mut std::fmt::Formatter<'_>) -> Result<(), std::fmt::Error> {
fmt.debug_struct("Block")
.field("lw_lsn", &self.lw_lsn.load())
.field("cache_block", &self.cache_block.load(Ordering::Relaxed))
.field("pinned", &self.pinned.load(Ordering::Relaxed))
.field("referenced", &self.referenced.load(Ordering::Relaxed))
.finish()
}
}
#[derive(Clone, Debug, PartialEq, PartialOrd, Eq, Hash, Ord)]
struct RelKey(RelTag);
impl From<&RelTag> for RelKey {
fn from(val: &RelTag) -> RelKey {
RelKey(*val)
}
}
#[derive(Clone, Debug, PartialEq, PartialOrd, Eq, Hash, Ord)]
struct BlockKey {
rel: RelTag,
block_number: u32,
}
impl From<(&RelTag, u32)> for BlockKey {
fn from(val: (&RelTag, u32)) -> BlockKey {
BlockKey {
rel: *val.0,
block_number: val.1,
}
}
}
/// Return type used in the cache's get_*() functions. 'Found' means that the page, or other
/// information that was enqueried, exists in the cache. '
pub enum CacheResult<V> {
/// The enqueried page or other information existed in the cache.
Found(V),
/// The cache doesn't contain the page (or other enqueried information, like relation size). The
/// Lsn is the 'not_modified_since' LSN that should be used in the request to the pageserver to
/// read the page.
NotFound(Lsn),
}
/// Return type of [try_evict_entry]
enum EvictResult {
/// Could not evict page because it was pinned
Pinned,
/// The victim bucket was already vacant
Vacant,
/// Evicted an entry. If it had a cache block associated with it, it's returned
/// here, otherwise None
Evicted(Option<CacheBlock>),
}
impl<'t> IntegratedCacheWriteAccess<'t> {
pub fn get_rel_size(&'t self, rel: &RelTag) -> CacheResult<u32> {
if let Some(nblocks) = get_rel_size(&self.relsize_cache, rel) {
CacheResult::Found(nblocks)
} else {
let lsn = Lsn(self.shared.global_lw_lsn.load(Ordering::Relaxed));
CacheResult::NotFound(lsn)
}
}
pub async fn get_page(
&'t self,
rel: &RelTag,
block_number: u32,
dst: impl uring_common::buf::IoBufMut + Send + Sync,
) -> Result<CacheResult<()>, std::io::Error> {
let x = if let Some(block_entry) = self.block_map.get(&BlockKey::from((rel, block_number)))
{
block_entry.referenced.store(true, Ordering::Relaxed);
let cache_block = block_entry.cache_block.load(Ordering::Relaxed);
if cache_block != INVALID_CACHE_BLOCK {
// pin it and release lock
block_entry.pinned.fetch_add(1, Ordering::Relaxed);
(cache_block, DeferredUnpin(block_entry.pinned.as_ptr()))
} else {
return Ok(CacheResult::NotFound(block_entry.lw_lsn.load()));
}
} else {
let lsn = Lsn(self.shared.global_lw_lsn.load(Ordering::Relaxed));
return Ok(CacheResult::NotFound(lsn));
};
let (cache_block, _deferred_pin) = x;
self.file_cache
.as_ref()
.unwrap()
.read_block(cache_block, dst)
.await?;
// unpin the entry (by implicitly dropping deferred_pin)
Ok(CacheResult::Found(()))
}
pub async fn page_is_cached(
&'t self,
rel: &RelTag,
block_number: u32,
) -> Result<CacheResult<()>, std::io::Error> {
if let Some(block_entry) = self.block_map.get(&BlockKey::from((rel, block_number))) {
// This is used for prefetch requests. Treat the probe as an 'access', to keep it
// in cache.
block_entry.referenced.store(true, Ordering::Relaxed);
let cache_block = block_entry.cache_block.load(Ordering::Relaxed);
if cache_block != INVALID_CACHE_BLOCK {
Ok(CacheResult::Found(()))
} else {
Ok(CacheResult::NotFound(block_entry.lw_lsn.load()))
}
} else {
let lsn = Lsn(self.shared.global_lw_lsn.load(Ordering::Relaxed));
Ok(CacheResult::NotFound(lsn))
}
}
/// Does the relation exists? CacheResult::NotFound means that the cache doesn't contain that
/// information, i.e. we don't know if the relation exists or not.
pub fn get_rel_exists(&'t self, rel: &RelTag) -> CacheResult<bool> {
// we don't currently cache negative entries, so if the relation is in the cache, it exists
if let Some(_rel_entry) = self.relsize_cache.get(&RelKey::from(rel)) {
CacheResult::Found(true)
} else {
let lsn = Lsn(self.shared.global_lw_lsn.load(Ordering::Relaxed));
CacheResult::NotFound(lsn)
}
}
pub fn get_db_size(&'t self, _db_oid: u32) -> CacheResult<u64> {
// TODO: it would be nice to cache database sizes too. Getting the database size
// is not a very common operation, but when you do it, it's often interactive, with
// e.g. psql \l+ command, so the user will feel the latency.
// fixme: is this right lsn?
let lsn = Lsn(self.shared.global_lw_lsn.load(Ordering::Relaxed));
CacheResult::NotFound(lsn)
}
pub fn remember_rel_size(&'t self, rel: &RelTag, nblocks: u32, lsn: Lsn) {
match self.relsize_cache.entry(RelKey::from(rel)) {
Entry::Vacant(e) => {
tracing::trace!("inserting rel entry for {rel:?}, {nblocks} blocks");
// FIXME: what to do if we run out of memory? Evict other relation entries?
_ = e
.insert(RelEntry {
nblocks: AtomicU32::new(nblocks),
lw_lsn: AtomicLsn::new(lsn.0),
})
.expect("out of memory");
}
Entry::Occupied(e) => {
tracing::trace!("updating rel entry for {rel:?}, {nblocks} blocks");
e.get().nblocks.store(nblocks, Ordering::Relaxed);
e.get().lw_lsn.store(lsn);
}
};
}
/// Remember the given page contents in the cache.
pub async fn remember_page(
&'t self,
rel: &RelTag,
block_number: u32,
src: impl uring_common::buf::IoBuf + Send + Sync,
lw_lsn: Lsn,
is_write: bool,
) {
let key = BlockKey::from((rel, block_number));
// FIXME: make this work when file cache is disabled. Or make it mandatory
let file_cache = self.file_cache.as_ref().unwrap();
if is_write {
// there should be no concurrent IOs. If a backend tries to read the page
// at the same time, they may get a torn write. That's the same as with
// regular POSIX filesystem read() and write()
// First check if we have a block in cache already
let mut old_cache_block = None;
let mut found_existing = false;
// NOTE(quantumish): honoring original semantics here (used to be update_with_fn)
// but I don't see any reason why this has to take a write lock.
if let Entry::Occupied(e) = self.block_map.entry(key.clone()) {
let block_entry = e.get();
found_existing = true;
// Prevent this entry from being evicted
let pin_count = block_entry.pinned.fetch_add(1, Ordering::Relaxed);
if pin_count > 0 {
// this is unexpected, because the caller has obtained the io-in-progress lock,
// so no one else should try to modify the page at the same time.
// XXX: and I think a read should not be happening either, because the postgres
// buffer is held locked. TODO: check these conditions and tidy this up a little. Seems fragile to just panic.
panic!("block entry was unexpectedly pinned");
}
let cache_block = block_entry.cache_block.load(Ordering::Relaxed);
old_cache_block = if cache_block != INVALID_CACHE_BLOCK {
Some(cache_block)
} else {
None
};
}
// Allocate a new block if required
let cache_block = old_cache_block.unwrap_or_else(|| {
loop {
if let Some(x) = file_cache.alloc_block() {
break x;
}
if let Some(x) = self.try_evict_cache_block() {
break x;
}
}
});
// Write the page to the cache file
file_cache
.write_block(cache_block, src)
.await
.expect("error writing to cache");
// FIXME: handle errors gracefully.
// FIXME: unpin the block entry on error
// Update the block entry
loop {
let entry = self.block_map.entry(key.clone());
assert_eq!(found_existing, matches!(entry, Entry::Occupied(_)));
match entry {
Entry::Occupied(e) => {
let block_entry = e.get();
// Update the cache block
let old_blk = block_entry.cache_block.compare_exchange(
INVALID_CACHE_BLOCK,
cache_block,
Ordering::Relaxed,
Ordering::Relaxed,
);
assert!(old_blk == Ok(INVALID_CACHE_BLOCK) || old_blk == Err(cache_block));
block_entry.lw_lsn.store(lw_lsn);
block_entry.referenced.store(true, Ordering::Relaxed);
let pin_count = block_entry.pinned.fetch_sub(1, Ordering::Relaxed);
assert!(pin_count > 0);
break;
}
Entry::Vacant(e) => {
if e.insert(BlockEntry {
lw_lsn: AtomicLsn::new(lw_lsn.0),
cache_block: AtomicU64::new(cache_block),
pinned: AtomicU64::new(0),
referenced: AtomicBool::new(true),
})
.is_ok()
{
break;
} else {
// The hash map was full. Evict an entry and retry.
}
}
}
self.try_evict_block_entry();
}
} else {
// !is_write
//
// We can assume that it doesn't already exist, because the
// caller is assumed to have already checked it, and holds
// the io-in-progress lock. (The BlockEntry might exist, but no cache block)
// Allocate a new block first
let cache_block = {
loop {
if let Some(x) = file_cache.alloc_block() {
break x;
}
if let Some(x) = self.try_evict_cache_block() {
break x;
}
}
};
// Write the page to the cache file
file_cache
.write_block(cache_block, src)
.await
.expect("error writing to cache");
// FIXME: handle errors gracefully.
loop {
match self.block_map.entry(key.clone()) {
Entry::Occupied(e) => {
let block_entry = e.get();
// FIXME: could there be concurrent readers?
assert!(block_entry.pinned.load(Ordering::Relaxed) == 0);
let old_cache_block =
block_entry.cache_block.swap(cache_block, Ordering::Relaxed);
if old_cache_block != INVALID_CACHE_BLOCK {
panic!(
"remember_page called in !is_write mode, but page is already cached at blk {old_cache_block}"
);
}
break;
}
Entry::Vacant(e) => {
if e.insert(BlockEntry {
lw_lsn: AtomicLsn::new(lw_lsn.0),
cache_block: AtomicU64::new(cache_block),
pinned: AtomicU64::new(0),
referenced: AtomicBool::new(true),
})
.is_ok()
{
break;
} else {
// The hash map was full. Evict an entry and retry.
}
}
};
self.try_evict_block_entry();
}
}
}
/// Forget information about given relation in the cache. (For DROP TABLE and such)
pub fn forget_rel(&'t self, rel: &RelTag, _nblocks: Option<u32>, flush_lsn: Lsn) {
tracing::trace!("forgetting rel entry for {rel:?}");
self.relsize_cache.remove(&RelKey::from(rel));
// update with flush LSN
let _ = self
.shared
.global_lw_lsn
.fetch_max(flush_lsn.0, Ordering::Relaxed);
// also forget all cached blocks for the relation
// FIXME
/*
let mut iter = MapIterator::new(&key_range_for_rel_blocks(rel));
let r = self.cache_tree.start_read();
while let Some((k, _v)) = iter.next(&r) {
let w = self.cache_tree.start_write();
let mut evicted_cache_block = None;
let res = w.update_with_fn(&k, |e| {
if let Some(e) = e {
let block_entry = if let MapEntry::Block(e) = e {
e
} else {
panic!("unexpected map entry type for block key");
};
let cache_block = block_entry
.cache_block
.swap(INVALID_CACHE_BLOCK, Ordering::Relaxed);
if cache_block != INVALID_CACHE_BLOCK {
evicted_cache_block = Some(cache_block);
}
UpdateAction::Remove
} else {
UpdateAction::Nothing
}
});
// FIXME: It's pretty surprising to run out of memory while removing. But
// maybe it can happen because of trying to shrink a node?
res.expect("out of memory");
if let Some(evicted_cache_block) = evicted_cache_block {
self.file_cache
.as_ref()
.unwrap()
.dealloc_block(evicted_cache_block);
}
}
*/
}
// Maintenance routines
/// Evict one block entry from the cache.
///
/// This is called when the hash map is full, to make an entry available for a new
/// insertion. There's no guarantee that the entry is free by the time this function
/// returns anymore; it can taken by a concurrent thread at any time. So you need to
/// call this and retry repeatedly until you succeed.
fn try_evict_block_entry(&self) {
let num_buckets = self.block_map.get_num_buckets();
loop {
self.metrics.clock_iterations_counter.inc();
let victim_bucket = self.clock_hand.fetch_add(1, Ordering::Relaxed) % num_buckets;
let evict_this = match self.block_map.get_at_bucket(victim_bucket).as_deref() {
None => {
// The caller wants to have a free bucket. If there's one already, we're good.
return;
}
Some((_, blk_entry)) => {
// Clear the 'referenced' flag. If it was already clear,
// release the lock (by exiting this scope), and try to
// evict it.
!blk_entry.referenced.swap(false, Ordering::Relaxed)
}
};
if evict_this {
match self.try_evict_entry(victim_bucket) {
EvictResult::Pinned => {
// keep looping
}
EvictResult::Vacant => {
// This was released by someone else. Return so that
// the caller will try to use it. (Chances are that it
// will be reused by someone else, but let's try.)
return;
}
EvictResult::Evicted(None) => {
// This is now free.
return;
}
EvictResult::Evicted(Some(cache_block)) => {
// This is now free. We must not leak the cache block, so put it to the freelist
self.file_cache.as_ref().unwrap().dealloc_block(cache_block);
return;
}
}
}
// TODO: add some kind of a backstop to error out if we loop
// too many times without finding any unpinned entries
}
}
/// Evict one block from the file cache. This is called when the file cache fills up,
/// to release a cache block.
///
/// Returns the evicted block. It's not put to the free list, so it's available for
/// the caller to use immediately.
fn try_evict_cache_block(&self) -> Option<CacheBlock> {
let num_buckets = self.block_map.get_num_buckets();
let mut iterations = 0;
while iterations < 100 {
self.metrics.clock_iterations_counter.inc();
let victim_bucket = self.clock_hand.fetch_add(1, Ordering::Relaxed) % num_buckets;
let evict_this = match self.block_map.get_at_bucket(victim_bucket).as_deref() {
None => {
// This bucket was unused. It's no use for finding a free cache block
continue;
}
Some((_, blk_entry)) => {
// Clear the 'referenced' flag. If it was already clear,
// release the lock (by exiting this scope), and try to
// evict it.
!blk_entry.referenced.swap(false, Ordering::Relaxed)
}
};
if evict_this {
match self.try_evict_entry(victim_bucket) {
EvictResult::Pinned => {
// keep looping
}
EvictResult::Vacant => {
// This was released by someone else. Keep looping.
}
EvictResult::Evicted(None) => {
// This is now free, but it didn't have a cache block
// associated with it. Keep looping.
}
EvictResult::Evicted(Some(cache_block)) => {
// Reuse this
return Some(cache_block);
}
}
}
iterations += 1;
}
// Reached the max iteration count without finding an entry. Return
// to give the caller a chance to do other things
None
}
/// Returns Err, if the page could not be evicted because it was pinned
fn try_evict_entry(&self, victim: usize) -> EvictResult {
// grab the write lock
if let Some(e) = self.block_map.entry_at_bucket(victim) {
let old = e.get();
// note: all the accesses to 'pinned' currently happen
// within update_with_fn(), or while holding ValueReadGuard, which protects from concurrent
// updates. Otherwise, another thread could set the 'pinned'
// flag just after we have checked it here.
//
// FIXME: ^^ outdated comment, update_with_fn() is no more
if old.pinned.load(Ordering::Relaxed) == 0 {
let old_val = e.remove();
let _ = self
.shared
.global_lw_lsn
.fetch_max(old_val.lw_lsn.into_inner().0, Ordering::Relaxed);
let evicted_cache_block = match old_val.cache_block.into_inner() {
INVALID_CACHE_BLOCK => None,
n => Some(n),
};
if evicted_cache_block.is_some() {
self.metrics.cache_page_evictions_counter.inc();
}
self.metrics.block_entry_evictions_counter.inc();
EvictResult::Evicted(evicted_cache_block)
} else {
EvictResult::Pinned
}
} else {
EvictResult::Vacant
}
}
/// Resize the local file cache.
pub fn resize_file_cache(&self, num_blocks: u32) {
let old_num_blocks = self.block_map.get_num_buckets() as u32;
if old_num_blocks < num_blocks {
if let Err(err) = self.block_map.grow(num_blocks) {
tracing::warn!(
"could not grow file cache to {} blocks (old size {}): {}",
num_blocks,
old_num_blocks,
err
);
}
} else {
// TODO: Shrinking not implemented yet
}
}
pub fn dump_map(&self, _dst: &mut dyn std::io::Write) {
//FIXME self.cache_map.start_read().dump(dst);
}
}
impl<T: metric::group::Encoding> MetricGroup<T> for IntegratedCacheWriteAccess<'_>
where
CounterState: MetricEncoding<T>,
GaugeState: MetricEncoding<T>,
{
fn collect_group_into(&self, enc: &mut T) -> Result<(), <T as metric::group::Encoding>::Err> {
// Update gauges
self.metrics
.block_map_num_buckets
.set(self.block_map.get_num_buckets() as i64);
self.metrics
.block_map_num_buckets_in_use
.set(self.block_map.get_num_buckets_in_use() as i64);
self.metrics
.relsize_cache_num_buckets
.set(self.relsize_cache.get_num_buckets() as i64);
self.metrics
.relsize_cache_num_buckets_in_use
.set(self.relsize_cache.get_num_buckets_in_use() as i64);
if let Some(file_cache) = &self.file_cache {
file_cache.collect_group_into(enc)?;
}
self.metrics.collect_group_into(enc)
}
}
/// Read relation size from the cache.
///
/// This is in a separate function so that it can be shared by
/// IntegratedCacheReadAccess::get_rel_size() and IntegratedCacheWriteAccess::get_rel_size()
fn get_rel_size(
r: &neon_shmem::hash::HashMapAccess<RelKey, RelEntry>,
rel: &RelTag,
) -> Option<u32> {
if let Some(rel_entry) = r.get(&RelKey::from(rel)) {
let nblocks = rel_entry.nblocks.load(Ordering::Relaxed);
if nblocks != u32::MAX {
Some(nblocks)
} else {
None
}
} else {
None
}
}
pub enum GetBucketResult {
Occupied(RelTag, u32),
Vacant,
OutOfBounds,
}
/// Accessor for other backends
///
/// This allows backends to read pages from the cache directly, on their own, without making a
/// request to the communicator process.
impl<'t> IntegratedCacheReadAccess<'t> {
pub fn get_rel_size(&'t self, rel: &RelTag) -> Option<u32> {
get_rel_size(&self.relsize_cache, rel)
}
pub fn start_read_op(&'t self) -> BackendCacheReadOp<'t> {
BackendCacheReadOp {
read_guards: Vec::new(),
map_access: self,
}
}
/// Check if LFC contains the given buffer, and update its last-written LSN if not.
///
/// Returns:
/// true if the block is in the LFC
/// false if it's not.
///
/// If the block was not in the LFC (i.e. when this returns false), the last-written LSN
/// value on the block is updated to the given 'lsn', so that the next read of the block
/// will read the new version. Otherwise the caller is assumed to modify the page and
/// to update the last-written LSN later by writing the new page.
pub fn update_lw_lsn_for_block_if_not_cached(
&'t self,
rel: &RelTag,
block_number: u32,
lsn: Lsn,
) -> bool {
let key = BlockKey::from((rel, block_number));
let entry = self.block_map.entry(key);
match entry {
Entry::Occupied(e) => {
let block_entry = e.get();
if block_entry.cache_block.load(Ordering::Relaxed) != INVALID_CACHE_BLOCK {
block_entry.referenced.store(true, Ordering::Relaxed);
true
} else {
let old_lwlsn = block_entry.lw_lsn.fetch_max(lsn);
if old_lwlsn >= lsn {
// shouldn't happen
tracing::warn!(
"attempted to move last-written LSN backwards from {old_lwlsn} to {lsn} for rel {rel} blk {block_number}"
);
}
false
}
}
Entry::Vacant(e) => {
if e.insert(BlockEntry {
lw_lsn: AtomicLsn::new(lsn.0),
cache_block: AtomicU64::new(INVALID_CACHE_BLOCK),
pinned: AtomicU64::new(0),
referenced: AtomicBool::new(true),
})
.is_ok()
{
false
} else {
// The hash table is full.
//
// TODO: Evict something. But for now, just set the global lw LSN instead.
// That's correct, but not very efficient for future reads
let _ = self
.shared
.global_lw_lsn
.fetch_max(lsn.0, Ordering::Relaxed);
false
}
}
}
}
pub fn get_bucket(&self, bucket_no: usize) -> GetBucketResult {
match self.block_map.get_at_bucket(bucket_no).as_deref() {
None => {
// free bucket, or out of bounds
if bucket_no >= self.block_map.get_num_buckets() {
GetBucketResult::OutOfBounds
} else {
GetBucketResult::Vacant
}
}
Some((key, _)) => GetBucketResult::Occupied(key.rel, key.block_number),
}
}
pub fn get_num_buckets_in_use(&self) -> usize {
self.block_map.get_num_buckets_in_use()
}
}
pub struct BackendCacheReadOp<'t> {
read_guards: Vec<DeferredUnpin>,
map_access: &'t IntegratedCacheReadAccess<'t>,
}
impl<'e> BackendCacheReadOp<'e> {
/// Initiate a read of the page from the cache.
///
/// This returns the "cache block number", i.e. the block number within the cache file, where
/// the page's contents is stored. To get the page contents, the caller needs to read that block
/// from the cache file. This returns a guard object that you must hold while it performs the
/// read. It's possible that while you are performing the read, the cache block is invalidated.
/// After you have completed the read, call BackendCacheReadResult::finish() to check if the
/// read was in fact valid or not. If it was concurrently invalidated, you need to retry.
pub fn get_page(&mut self, rel: &RelTag, block_number: u32) -> Option<u64> {
if let Some(block_entry) = self
.map_access
.block_map
.get(&BlockKey::from((rel, block_number)))
{
block_entry.referenced.store(true, Ordering::Relaxed);
let cache_block = block_entry.cache_block.load(Ordering::Relaxed);
if cache_block != INVALID_CACHE_BLOCK {
block_entry.pinned.fetch_add(1, Ordering::Relaxed);
self.read_guards
.push(DeferredUnpin(block_entry.pinned.as_ptr()));
Some(cache_block)
} else {
None
}
} else {
None
}
}
pub fn finish(self) -> bool {
// TODO: currently, we hold a pin on the in-memory map, so concurrent invalidations are not
// possible. But if we switch to optimistic locking, this would return 'false' if the
// optimistic locking failed and you need to retry.
true
}
}
/// A hack to decrement an AtomicU64 on drop. This is used to decrement the pin count
/// of a BlockEntry. The safety depends on the fact that the BlockEntry is not evicted
/// or moved while it's pinned.
struct DeferredUnpin(*mut u64);
unsafe impl Sync for DeferredUnpin {}
unsafe impl Send for DeferredUnpin {}
impl Drop for DeferredUnpin {
fn drop(&mut self) {
// unpin it
unsafe {
let pin_ref = AtomicU64::from_ptr(self.0);
pin_ref.fetch_sub(1, Ordering::Relaxed);
}
}
}

View File

@@ -1,5 +1,29 @@
//! Three main parts:
//! - async tokio communicator core, which receives requests and processes them.
//! - Main loop and requests queues, which routes requests from backends to the core
//! - the per-backend glue code, which submits requests
mod backend_comms;
// mark this 'pub', because these functions are called from C code. Otherwise, the compiler
// complains about a bunch of structs and enum variants being unused, because it thinkgs
// the functions that use them are never called. There are some C-callable functions in
// other modules too, but marking this as pub is currently enough to silence the warnings
//
// TODO: perhaps collect *all* the extern "C" functions to one module?
pub mod backend_interface;
mod file_cache;
mod init;
mod integrated_cache;
mod neon_request;
mod worker_process;
mod global_allocator;
/// Name of the Unix Domain Socket that serves the metrics, and other APIs in the
/// future. This is within the Postgres data directory.
const NEON_COMMUNICATOR_SOCKET_NAME: &str = "neon-communicator.socket";
// FIXME: get this from postgres headers somehow
pub const BLCKSZ: usize = 8192;

View File

@@ -0,0 +1,466 @@
// Definitions of some core PostgreSQL datatypes.
/// XLogRecPtr is defined in "access/xlogdefs.h" as:
///
/// ```
/// typedef uint64 XLogRecPtr;
/// ```
/// cbindgen:no-export
pub type XLogRecPtr = u64;
pub type CLsn = XLogRecPtr;
pub type COid = u32;
// This conveniently matches PG_IOV_MAX
pub const MAX_GETPAGEV_PAGES: usize = 32;
pub const INVALID_BLOCK_NUMBER: u32 = u32::MAX;
use std::ffi::CStr;
use pageserver_page_api::{self as page_api, SlruKind};
/// Request from a Postgres backend to the communicator process
#[allow(clippy::large_enum_variant)]
#[repr(C)]
#[derive(Copy, Clone, Debug, strum_macros::EnumDiscriminants)]
#[strum_discriminants(derive(measured::FixedCardinalityLabel))]
pub enum NeonIORequest {
Empty,
// Read requests. These are C-friendly variants of the corresponding structs in
// pageserver_page_api.
RelSize(CRelSizeRequest),
GetPageV(CGetPageVRequest),
ReadSlruSegment(CReadSlruSegmentRequest),
PrefetchV(CPrefetchVRequest),
DbSize(CDbSizeRequest),
/// This is like GetPageV, but bypasses the LFC and allows specifiying the
/// request LSNs directly. For debugging purposes only.
GetPageVUncached(CGetPageVUncachedRequest),
// Write requests. These are needed to keep the relation size cache and LFC up-to-date.
// They are not sent to the pageserver.
WritePage(CWritePageRequest),
RelExtend(CRelExtendRequest),
RelZeroExtend(CRelZeroExtendRequest),
RelCreate(CRelCreateRequest),
RelTruncate(CRelTruncateRequest),
RelUnlink(CRelUnlinkRequest),
// Other requests
UpdateCachedRelSize(CUpdateCachedRelSizeRequest),
}
#[repr(C)]
#[derive(Copy, Clone, Debug)]
pub enum NeonIOResult {
Empty,
/// InvalidBlockNumber == 0xffffffff means "rel does not exist"
RelSize(u32),
/// the result pages are written to the shared memory addresses given in the request
GetPageV,
/// The result is written to the file, path to which is provided
/// in the request. The [`u64`] value here is the number of blocks.
ReadSlruSegment(u64),
/// A prefetch request returns as soon as the request has been received by the communicator.
/// It is processed in the background.
PrefetchVLaunched,
DbSize(u64),
// FIXME design compact error codes. Can't easily pass a string or other dynamic data.
// currently, this is 'errno'
Error(i32),
Aborted,
/// used for all write requests
WriteOK,
}
impl NeonIORequest {
/// All requests include a unique request ID, which can be used to trace the execution
/// of a request all the way to the pageservers. The request ID needs to be unique
/// within the lifetime of the Postgres instance (but not across servers or across
/// restarts of the same server).
pub fn request_id(&self) -> u64 {
use NeonIORequest::*;
match self {
Empty => 0,
RelSize(req) => req.request_id,
GetPageV(req) => req.request_id,
GetPageVUncached(req) => req.request_id,
ReadSlruSegment(req) => req.request_id,
PrefetchV(req) => req.request_id,
DbSize(req) => req.request_id,
WritePage(req) => req.request_id,
RelExtend(req) => req.request_id,
RelZeroExtend(req) => req.request_id,
RelCreate(req) => req.request_id,
RelTruncate(req) => req.request_id,
RelUnlink(req) => req.request_id,
UpdateCachedRelSize(req) => req.request_id,
}
}
}
/// Special quick result to a CGetPageVRequest request, indicating that the
/// the requested pages are present in the local file cache. The backend can
/// read the blocks directly from the given LFC blocks.
#[repr(C)]
#[derive(Copy, Clone, Debug)]
pub struct CCachedGetPageVResult {
pub cache_block_numbers: [u64; MAX_GETPAGEV_PAGES],
}
/// ShmemBuf represents a buffer in shared memory.
///
/// SAFETY: The pointer must point to an area in shared memory. The functions allow you to liberally
/// get a mutable pointer to the contents; it is the caller's responsibility to ensure that you
/// don't access a buffer that's you're not allowed to. Inappropriate access to the buffer doesn't
/// violate Rust's safety semantics, but it will mess up and crash Postgres.
///
#[repr(C)]
#[derive(Copy, Clone, Debug)]
pub struct ShmemBuf {
// Pointer to where the result is written or where to read from. Must point into a buffer in shared memory!
pub ptr: *mut u8,
}
unsafe impl Send for ShmemBuf {}
unsafe impl Sync for ShmemBuf {}
unsafe impl uring_common::buf::IoBuf for ShmemBuf {
fn stable_ptr(&self) -> *const u8 {
self.ptr
}
fn bytes_init(&self) -> usize {
crate::BLCKSZ
}
fn bytes_total(&self) -> usize {
crate::BLCKSZ
}
}
unsafe impl uring_common::buf::IoBufMut for ShmemBuf {
fn stable_mut_ptr(&mut self) -> *mut u8 {
self.ptr
}
unsafe fn set_init(&mut self, pos: usize) {
if pos > crate::BLCKSZ {
panic!(
"set_init called past end of buffer, pos {}, buffer size {}",
pos,
crate::BLCKSZ
);
}
}
}
impl ShmemBuf {
pub fn as_mut_ptr(&self) -> *mut u8 {
self.ptr
}
}
#[repr(C)]
#[derive(Copy, Clone, Debug)]
pub struct CRelSizeRequest {
pub request_id: u64,
pub spc_oid: COid,
pub db_oid: COid,
pub rel_number: u32,
pub fork_number: u8,
pub allow_missing: bool,
}
#[repr(C)]
#[derive(Copy, Clone, Debug)]
pub struct CGetPageVRequest {
pub request_id: u64,
pub spc_oid: COid,
pub db_oid: COid,
pub rel_number: u32,
pub fork_number: u8,
pub block_number: u32,
pub nblocks: u8,
// These fields define where the result is written. Must point into a buffer in shared memory!
pub dest: [ShmemBuf; MAX_GETPAGEV_PAGES],
}
#[repr(C)]
#[derive(Copy, Clone, Debug)]
pub struct CGetPageVUncachedRequest {
pub request_id: u64,
pub spc_oid: COid,
pub db_oid: COid,
pub rel_number: u32,
pub fork_number: u8,
pub block_number: u32,
pub nblocks: u8,
pub request_lsn: CLsn,
pub not_modified_since: CLsn,
// These fields define where the result is written. Must point into a buffer in shared memory!
pub dest: [ShmemBuf; MAX_GETPAGEV_PAGES],
}
#[repr(C)]
#[derive(Copy, Clone, Debug)]
pub struct CReadSlruSegmentRequest {
pub request_id: u64,
pub slru_kind: SlruKind,
pub segment_number: u32,
pub request_lsn: CLsn,
/// Must be a null-terminated C string containing the file path
/// where the communicator will write the SLRU segment.
pub destination_file_path: ShmemBuf,
}
impl CReadSlruSegmentRequest {
/// Returns the file path where the communicator will write the
/// SLRU segment.
pub(crate) fn destination_file_path(&self) -> String {
unsafe { CStr::from_ptr(self.destination_file_path.as_mut_ptr() as *const _) }
.to_string_lossy()
.into_owned()
}
}
#[repr(C)]
#[derive(Copy, Clone, Debug)]
pub struct CPrefetchVRequest {
pub request_id: u64,
pub spc_oid: COid,
pub db_oid: COid,
pub rel_number: u32,
pub fork_number: u8,
pub block_number: u32,
pub nblocks: u8,
}
#[repr(C)]
#[derive(Copy, Clone, Debug)]
pub struct CDbSizeRequest {
pub request_id: u64,
pub db_oid: COid,
}
#[repr(C)]
#[derive(Copy, Clone, Debug)]
pub struct CWritePageRequest {
pub request_id: u64,
pub spc_oid: COid,
pub db_oid: COid,
pub rel_number: u32,
pub fork_number: u8,
pub block_number: u32,
pub lsn: CLsn,
// `src` defines the new page contents. Must point into a buffer in shared memory!
pub src: ShmemBuf,
}
#[repr(C)]
#[derive(Copy, Clone, Debug)]
pub struct CRelExtendRequest {
pub request_id: u64,
pub spc_oid: COid,
pub db_oid: COid,
pub rel_number: u32,
pub fork_number: u8,
pub block_number: u32,
pub lsn: CLsn,
// `src` defines the new page contents. Must point into a buffer in shared memory!
pub src: ShmemBuf,
}
#[repr(C)]
#[derive(Copy, Clone, Debug)]
pub struct CRelZeroExtendRequest {
pub request_id: u64,
pub spc_oid: COid,
pub db_oid: COid,
pub rel_number: u32,
pub fork_number: u8,
pub block_number: u32,
pub nblocks: u32,
pub lsn: CLsn,
}
#[repr(C)]
#[derive(Copy, Clone, Debug)]
pub struct CRelCreateRequest {
pub request_id: u64,
pub spc_oid: COid,
pub db_oid: COid,
pub rel_number: u32,
pub fork_number: u8,
pub lsn: CLsn,
}
#[repr(C)]
#[derive(Copy, Clone, Debug)]
pub struct CRelTruncateRequest {
pub request_id: u64,
pub spc_oid: COid,
pub db_oid: COid,
pub rel_number: u32,
pub fork_number: u8,
pub nblocks: u32,
pub lsn: CLsn,
}
#[repr(C)]
#[derive(Copy, Clone, Debug)]
pub struct CRelUnlinkRequest {
pub request_id: u64,
pub spc_oid: COid,
pub db_oid: COid,
pub rel_number: u32,
pub fork_number: u8,
pub lsn: CLsn,
}
impl CRelSizeRequest {
pub fn reltag(&self) -> page_api::RelTag {
page_api::RelTag {
spcnode: self.spc_oid,
dbnode: self.db_oid,
relnode: self.rel_number,
forknum: self.fork_number,
}
}
}
impl CGetPageVRequest {
pub fn reltag(&self) -> page_api::RelTag {
page_api::RelTag {
spcnode: self.spc_oid,
dbnode: self.db_oid,
relnode: self.rel_number,
forknum: self.fork_number,
}
}
}
impl CGetPageVUncachedRequest {
pub fn reltag(&self) -> page_api::RelTag {
page_api::RelTag {
spcnode: self.spc_oid,
dbnode: self.db_oid,
relnode: self.rel_number,
forknum: self.fork_number,
}
}
}
impl CPrefetchVRequest {
pub fn reltag(&self) -> page_api::RelTag {
page_api::RelTag {
spcnode: self.spc_oid,
dbnode: self.db_oid,
relnode: self.rel_number,
forknum: self.fork_number,
}
}
}
impl CWritePageRequest {
pub fn reltag(&self) -> page_api::RelTag {
page_api::RelTag {
spcnode: self.spc_oid,
dbnode: self.db_oid,
relnode: self.rel_number,
forknum: self.fork_number,
}
}
}
impl CRelExtendRequest {
pub fn reltag(&self) -> page_api::RelTag {
page_api::RelTag {
spcnode: self.spc_oid,
dbnode: self.db_oid,
relnode: self.rel_number,
forknum: self.fork_number,
}
}
}
impl CRelZeroExtendRequest {
pub fn reltag(&self) -> page_api::RelTag {
page_api::RelTag {
spcnode: self.spc_oid,
dbnode: self.db_oid,
relnode: self.rel_number,
forknum: self.fork_number,
}
}
}
impl CRelCreateRequest {
pub fn reltag(&self) -> page_api::RelTag {
page_api::RelTag {
spcnode: self.spc_oid,
dbnode: self.db_oid,
relnode: self.rel_number,
forknum: self.fork_number,
}
}
}
impl CRelTruncateRequest {
pub fn reltag(&self) -> page_api::RelTag {
page_api::RelTag {
spcnode: self.spc_oid,
dbnode: self.db_oid,
relnode: self.rel_number,
forknum: self.fork_number,
}
}
}
impl CRelUnlinkRequest {
pub fn reltag(&self) -> page_api::RelTag {
page_api::RelTag {
spcnode: self.spc_oid,
dbnode: self.db_oid,
relnode: self.rel_number,
forknum: self.fork_number,
}
}
}
#[repr(C)]
#[derive(Copy, Clone, Debug)]
pub struct CUpdateCachedRelSizeRequest {
pub request_id: u64,
pub spc_oid: COid,
pub db_oid: COid,
pub rel_number: u32,
pub fork_number: u8,
pub nblocks: u32,
pub lsn: CLsn,
}
impl CUpdateCachedRelSizeRequest {
pub fn reltag(&self) -> page_api::RelTag {
page_api::RelTag {
spcnode: self.spc_oid,
dbnode: self.db_oid,
relnode: self.rel_number,
forknum: self.fork_number,
}
}
}

View File

@@ -4,10 +4,13 @@
//!
//! These are called from the communicator threads! Careful what you do, most Postgres
//! functions are not safe to call in that context.
use utils::lsn::Lsn;
#[cfg(not(test))]
unsafe extern "C" {
pub fn notify_proc_unsafe(procno: std::ffi::c_int);
pub fn callback_set_my_latch_unsafe();
pub fn callback_get_request_lsn_unsafe() -> crate::neon_request::CLsn;
pub fn callback_get_lfc_metrics_unsafe() -> LfcMetrics;
}
@@ -16,20 +19,36 @@ unsafe extern "C" {
// package, but the code coverage build still builds these and tries to link with the
// external C code.)
#[cfg(test)]
unsafe fn notify_proc_unsafe(_procno: std::ffi::c_int) {
panic!("not usable in unit tests");
}
#[cfg(test)]
unsafe fn callback_set_my_latch_unsafe() {
panic!("not usable in unit tests");
}
#[cfg(test)]
unsafe fn callback_get_request_lsn_unsafe() -> crate::neon_request::CLsn {
panic!("not usable in unit tests");
}
#[cfg(test)]
unsafe fn callback_get_lfc_metrics_unsafe() -> LfcMetrics {
panic!("not usable in unit tests");
}
// safe wrappers
pub(super) fn notify_proc(procno: std::ffi::c_int) {
unsafe { notify_proc_unsafe(procno) };
}
pub(super) fn callback_set_my_latch() {
unsafe { callback_set_my_latch_unsafe() };
}
pub(super) fn get_request_lsn() -> Lsn {
Lsn(unsafe { callback_get_request_lsn_unsafe() })
}
pub(super) fn callback_get_lfc_metrics() -> LfcMetrics {
unsafe { callback_get_lfc_metrics_unsafe() }
}

View File

@@ -19,71 +19,105 @@ use http::StatusCode;
use http::header::CONTENT_TYPE;
use measured::MetricGroup;
use measured::metric::MetricEncoding;
use measured::metric::gauge::GaugeState;
use measured::metric::group::Encoding;
use measured::text::BufferedTextEncoder;
use std::io::ErrorKind;
use std::sync::Arc;
use tokio::net::UnixListener;
use crate::NEON_COMMUNICATOR_SOCKET_NAME;
use crate::worker_process::lfc_metrics::LfcMetricsCollector;
use crate::worker_process::main_loop::CommunicatorWorkerProcessStruct;
impl CommunicatorWorkerProcessStruct {
/// Launch the listener
pub(crate) async fn launch_control_socket_listener(
&'static self,
) -> Result<(), std::io::Error> {
use axum::routing::get;
let app = Router::new()
.route("/metrics", get(get_metrics))
.route("/autoscaling_metrics", get(get_autoscaling_metrics))
.route("/debug/panic", get(handle_debug_panic))
.with_state(self);
enum ControlSocketState<'a> {
Full(&'a CommunicatorWorkerProcessStruct<'a>),
Legacy(LegacyControlSocketState),
}
// If the server is restarted, there might be an old socket still
// lying around. Remove it first.
match std::fs::remove_file(NEON_COMMUNICATOR_SOCKET_NAME) {
Ok(()) => {
tracing::warn!("removed stale control socket");
}
Err(e) if e.kind() == ErrorKind::NotFound => {}
Err(e) => {
tracing::error!("could not remove stale control socket: {e:#}");
// Try to proceed anyway. It will likely fail below though.
}
};
// Create the unix domain socket and start listening on it
let listener = UnixListener::bind(NEON_COMMUNICATOR_SOCKET_NAME)?;
tokio::spawn(async {
tracing::info!("control socket listener spawned");
axum::serve(listener, app)
.await
.expect("axum::serve never returns")
});
struct LegacyControlSocketState {
pub(crate) lfc_metrics: LfcMetricsCollector,
}
impl<T> MetricGroup<T> for LegacyControlSocketState
where
T: Encoding,
GaugeState: MetricEncoding<T>,
{
fn collect_group_into(&self, enc: &mut T) -> Result<(), T::Err> {
self.lfc_metrics.collect_group_into(enc)?;
Ok(())
}
}
/// Launch the listener
pub(crate) async fn launch_listener(
worker: Option<&'static CommunicatorWorkerProcessStruct<'static>>,
) -> Result<(), std::io::Error> {
use axum::routing::get;
let state = match worker {
Some(worker) => ControlSocketState::Full(worker),
None => ControlSocketState::Legacy(LegacyControlSocketState {
lfc_metrics: LfcMetricsCollector,
}),
};
let app = Router::new()
.route("/metrics", get(get_metrics))
.route("/autoscaling_metrics", get(get_autoscaling_metrics))
.route("/debug/panic", get(handle_debug_panic))
.route("/debug/dump_cache_map", get(dump_cache_map))
.with_state(Arc::new(state));
// If the server is restarted, there might be an old socket still
// lying around. Remove it first.
match std::fs::remove_file(NEON_COMMUNICATOR_SOCKET_NAME) {
Ok(()) => {
tracing::warn!("removed stale control socket");
}
Err(e) if e.kind() == ErrorKind::NotFound => {}
Err(e) => {
tracing::error!("could not remove stale control socket: {e:#}");
// Try to proceed anyway. It will likely fail below though.
}
};
// Create the unix domain socket and start listening on it
let listener = UnixListener::bind(NEON_COMMUNICATOR_SOCKET_NAME)?;
tokio::spawn(async {
tracing::info!("control socket listener spawned");
axum::serve(listener, app)
.await
.expect("axum::serve never returns")
});
Ok(())
}
/// Expose all Prometheus metrics.
async fn get_metrics(State(state): State<&CommunicatorWorkerProcessStruct>) -> Response {
tracing::trace!("/metrics requested");
metrics_to_response(&state).await
async fn get_metrics(State(state): State<Arc<ControlSocketState<'_>>>) -> Response {
match state.as_ref() {
ControlSocketState::Full(worker) => metrics_to_response(&worker).await,
ControlSocketState::Legacy(legacy) => metrics_to_response(&legacy).await,
}
}
/// Expose Prometheus metrics, for use by the autoscaling agent.
///
/// This is a subset of all the metrics.
async fn get_autoscaling_metrics(
State(state): State<&CommunicatorWorkerProcessStruct>,
) -> Response {
tracing::trace!("/metrics requested");
metrics_to_response(&state.lfc_metrics).await
async fn get_autoscaling_metrics(State(state): State<Arc<ControlSocketState<'_>>>) -> Response {
match state.as_ref() {
ControlSocketState::Full(worker) => metrics_to_response(&worker.lfc_metrics).await,
ControlSocketState::Legacy(legacy) => metrics_to_response(&legacy.lfc_metrics).await,
}
}
async fn handle_debug_panic(State(_state): State<&CommunicatorWorkerProcessStruct>) -> Response {
async fn handle_debug_panic(State(_state): State<Arc<ControlSocketState<'_>>>) -> Response {
panic!("test HTTP handler task panic");
}
@@ -100,3 +134,23 @@ async fn metrics_to_response(metrics: &(dyn MetricGroup<BufferedTextEncoder> + S
.body(Body::from(enc.finish()))
.unwrap()
}
async fn dump_cache_map(State(state): State<Arc<ControlSocketState<'_>>>) -> Response {
match state.as_ref() {
ControlSocketState::Full(worker) => {
let mut buf: Vec<u8> = Vec::new();
worker.cache.dump_map(&mut buf);
Response::builder()
.status(StatusCode::OK)
.header(CONTENT_TYPE, "application/text")
.body(Body::from(buf))
.unwrap()
}
ControlSocketState::Legacy(_) => Response::builder()
.status(StatusCode::NOT_FOUND)
.header(CONTENT_TYPE, "application/text")
.body(Body::from(Vec::new()))
.unwrap(),
}
}

View File

@@ -0,0 +1,95 @@
//! Lock table to ensure that only one IO request is in flight for a given
//! block (or relation or database metadata) at a time
use std::cmp::Eq;
use std::hash::Hash;
use std::sync::Arc;
use tokio::sync::{Mutex, OwnedMutexGuard};
use clashmap::ClashMap;
use clashmap::Entry;
use pageserver_page_api::RelTag;
#[derive(Clone, Eq, Hash, PartialEq)]
pub enum RequestInProgressKey {
Db(u32),
Rel(RelTag),
Block(RelTag, u32),
}
type RequestId = u64;
pub type RequestInProgressTable = MutexHashMap<RequestInProgressKey, RequestId>;
// more primitive locking thingie:
pub struct MutexHashMap<K, V>
where
K: Clone + Eq + Hash,
{
lock_table: ClashMap<K, (V, Arc<Mutex<()>>)>,
}
pub struct MutexHashMapGuard<'a, K, V>
where
K: Clone + Eq + Hash,
{
pub key: K,
map: &'a MutexHashMap<K, V>,
mutex: Arc<Mutex<()>>,
_guard: OwnedMutexGuard<()>,
}
impl<'a, K, V> Drop for MutexHashMapGuard<'a, K, V>
where
K: Clone + Eq + Hash,
{
fn drop(&mut self) {
let (_old_key, old_val) = self.map.lock_table.remove(&self.key).unwrap();
assert!(Arc::ptr_eq(&old_val.1, &self.mutex));
// the guard will be dropped as we return
}
}
impl<K, V> MutexHashMap<K, V>
where
K: Clone + Eq + Hash,
V: std::fmt::Display + Copy,
{
pub fn new() -> MutexHashMap<K, V> {
MutexHashMap {
lock_table: ClashMap::new(),
}
}
pub async fn lock<'a>(&'a self, key: K, val: V) -> MutexHashMapGuard<'a, K, V> {
let my_mutex = Arc::new(Mutex::new(()));
let my_guard = Arc::clone(&my_mutex).lock_owned().await;
loop {
let (request_id, lock) = match self.lock_table.entry(key.clone()) {
Entry::Occupied(e) => {
let e = e.get();
(e.0, Arc::clone(&e.1))
}
Entry::Vacant(e) => {
e.insert((val, Arc::clone(&my_mutex)));
break;
}
};
tracing::info!("waiting for conflicting IO {request_id} to complete");
let _ = lock.lock().await;
tracing::info!("conflicting IO {request_id} completed");
}
MutexHashMapGuard {
key,
map: self,
mutex: my_mutex,
_guard: my_guard,
}
}
}

View File

@@ -1,34 +1,126 @@
use std::collections::HashMap;
use std::os::fd::AsRawFd;
use std::os::fd::OwnedFd;
use std::path::PathBuf;
use std::str::FromStr as _;
use crate::backend_comms::NeonIORequestSlot;
use crate::file_cache::FileCache;
use crate::global_allocator::MyAllocatorCollector;
use crate::init::CommunicatorInitStruct;
use crate::integrated_cache::{CacheResult, IntegratedCacheWriteAccess};
use crate::neon_request::{CGetPageVRequest, CGetPageVUncachedRequest, CPrefetchVRequest};
use crate::neon_request::{INVALID_BLOCK_NUMBER, NeonIORequest, NeonIOResult};
use crate::worker_process::control_socket;
use crate::worker_process::in_progress_ios::{RequestInProgressKey, RequestInProgressTable};
use crate::worker_process::lfc_metrics::LfcMetricsCollector;
use pageserver_client_grpc::{PageserverClient, ShardSpec, ShardStripeSize};
use pageserver_page_api as page_api;
use tokio::io::AsyncReadExt;
use tokio_pipe::PipeRead;
use uring_common::buf::IoBuf;
use measured::MetricGroup;
use measured::metric::MetricEncoding;
use measured::metric::counter::CounterState;
use measured::metric::gauge::GaugeState;
use measured::metric::group::Encoding;
use measured::{Gauge, GaugeVec};
use utils::id::{TenantId, TimelineId};
pub struct CommunicatorWorkerProcessStruct {
use super::callbacks::{get_request_lsn, notify_proc};
use tracing::{error, info, info_span, trace};
use utils::lsn::Lsn;
pub struct CommunicatorWorkerProcessStruct<'a> {
/// Tokio runtime that the main loop and any other related tasks runs in.
runtime: tokio::runtime::Runtime,
/// Client to communicate with the pageserver
client: PageserverClient,
/// Request slots that backends use to send IO requests to the communicator.
neon_request_slots: &'a [NeonIORequestSlot],
/// Notification pipe. Backends use this to notify the communicator that a request is waiting to
/// be processed in one of the request slots.
submission_pipe_read_fd: OwnedFd,
/// Locking table for all in-progress IO requests.
in_progress_table: RequestInProgressTable,
/// Local File Cache, relation size tracking, last-written LSN tracking
pub(crate) cache: IntegratedCacheWriteAccess<'a>,
/*** Metrics ***/
pub(crate) lfc_metrics: LfcMetricsCollector,
request_counters: GaugeVec<RequestTypeLabelGroupSet>,
getpage_cache_misses_counter: Gauge,
getpage_cache_hits_counter: Gauge,
// For the requests that affect multiple blocks, have separate counters for the # of blocks affected
request_nblocks_counters: GaugeVec<RequestTypeLabelGroupSet>,
allocator_metrics: MyAllocatorCollector,
}
// Define a label group, consisting of 1 or more label values
#[derive(measured::LabelGroup)]
#[label(set = RequestTypeLabelGroupSet)]
struct RequestTypeLabelGroup {
request_type: crate::neon_request::NeonIORequestDiscriminants,
}
impl RequestTypeLabelGroup {
fn from_req(req: &NeonIORequest) -> Self {
RequestTypeLabelGroup {
request_type: req.into(),
}
}
}
/// Launch the communicator process's Rust subsystems
#[allow(clippy::too_many_arguments)]
pub(super) fn init_legacy() -> Result<(), String> {
let runtime = tokio::runtime::Builder::new_multi_thread()
.enable_all()
.thread_name("communicator thread")
.build()
.unwrap();
// Start the listener on the control socket
runtime
.block_on(control_socket::launch_listener(None))
.map_err(|e| e.to_string())?;
Box::leak(Box::new(runtime));
Ok(())
}
/// Launch the communicator process's Rust subsystems
#[allow(clippy::too_many_arguments)]
pub(super) fn init(
tenant_id: Option<&str>,
timeline_id: Option<&str>,
) -> Result<&'static CommunicatorWorkerProcessStruct, String> {
cis: CommunicatorInitStruct,
tenant_id: &str,
timeline_id: &str,
auth_token: Option<&str>,
shard_map: HashMap<utils::shard::ShardIndex, String>,
stripe_size: Option<ShardStripeSize>,
initial_file_cache_size: u64,
file_cache_path: Option<PathBuf>,
) -> Result<&'static CommunicatorWorkerProcessStruct<'static>, String> {
// The caller validated these already
let _tenant_id = tenant_id
.map(TenantId::from_str)
.transpose()
.map_err(|e| format!("invalid tenant ID: {e}"))?;
let _timeline_id = timeline_id
.map(TimelineId::from_str)
.transpose()
.map_err(|e| format!("invalid timeline ID: {e}"))?;
let tenant_id = TenantId::from_str(tenant_id).map_err(|e| format!("invalid tenant ID: {e}"))?;
let timeline_id =
TimelineId::from_str(timeline_id).map_err(|e| format!("invalid timeline ID: {e}"))?;
let shard_spec =
ShardSpec::new(shard_map, stripe_size).map_err(|e| format!("invalid shard spec: {e}:"))?;
let runtime = tokio::runtime::Builder::new_multi_thread()
.enable_all()
@@ -36,31 +128,716 @@ pub(super) fn init(
.build()
.unwrap();
let last_lsn = get_request_lsn();
let file_cache = if let Some(path) = file_cache_path {
Some(FileCache::new(&path, initial_file_cache_size).expect("could not create cache file"))
} else {
// FIXME: temporarily for testing, use LFC even if disabled
Some(
FileCache::new(&PathBuf::from("new_filecache"), 1000)
.expect("could not create cache file"),
)
};
// Initialize subsystems
let cache = cis
.integrated_cache_init_struct
.worker_process_init(last_lsn, file_cache);
let client = {
let _guard = runtime.enter();
PageserverClient::new(
tenant_id,
timeline_id,
shard_spec,
auth_token.map(|s| s.to_string()),
None,
)
.expect("could not create client")
};
let worker_struct = CommunicatorWorkerProcessStruct {
// Note: it's important to not drop the runtime, or all the tasks are dropped
// too. Including it in the returned struct is one way to keep it around.
runtime,
neon_request_slots: cis.neon_request_slots,
client,
cache,
submission_pipe_read_fd: cis.submission_pipe_read_fd,
in_progress_table: RequestInProgressTable::new(),
// metrics
lfc_metrics: LfcMetricsCollector,
request_counters: GaugeVec::new(),
getpage_cache_misses_counter: Gauge::new(),
getpage_cache_hits_counter: Gauge::new(),
request_nblocks_counters: GaugeVec::new(),
allocator_metrics: MyAllocatorCollector::new(),
};
let worker_struct = Box::leak(Box::new(worker_struct));
let main_loop_handle = worker_struct.runtime.spawn(worker_struct.run());
worker_struct.runtime.spawn(async {
let err = main_loop_handle.await.unwrap_err();
error!("error: {err:?}");
});
// Start the listener on the control socket
worker_struct
.runtime
.block_on(worker_struct.launch_control_socket_listener())
.block_on(control_socket::launch_listener(Some(worker_struct)))
.map_err(|e| e.to_string())?;
Ok(worker_struct)
}
impl<T> MetricGroup<T> for CommunicatorWorkerProcessStruct
impl<'t> CommunicatorWorkerProcessStruct<'t> {
/// Update the configuration
pub(super) fn update_shard_map(
&self,
new_shard_map: HashMap<utils::shard::ShardIndex, String>,
stripe_size: Option<ShardStripeSize>,
) {
let shard_spec = ShardSpec::new(new_shard_map, stripe_size).expect("invalid shard spec");
{
let _in_runtime = self.runtime.enter();
if let Err(err) = self.client.update_shards(shard_spec) {
tracing::error!("could not update shard map: {err:?}");
}
}
}
/// Main loop of the worker process. Receive requests from the backends and process them.
pub(super) async fn run(&'static self) {
let mut idxbuf: [u8; 4] = [0; 4];
let mut submission_pipe_read =
PipeRead::try_from(self.submission_pipe_read_fd.as_raw_fd()).expect("invalid pipe fd");
loop {
// Wait for a backend to ring the doorbell
match submission_pipe_read.read(&mut idxbuf).await {
Ok(4) => {}
Ok(nbytes) => panic!("short read ({nbytes} bytes) on communicator pipe"),
Err(e) => panic!("error reading from communicator pipe: {e}"),
}
let slot_idx = u32::from_ne_bytes(idxbuf) as usize;
// Read the IO request from the slot indicated in the wakeup
let Some(slot) = self.neon_request_slots[slot_idx].start_processing_request() else {
// This currently should not happen. But if we had multiple threads picking up
// requests, and without waiting for the notifications, it could.
panic!("no request in slot");
};
// Ok, we have ownership of this request now. We must process it now, there's no going
// back.
//
// Spawn a separate task for every request. That's a little excessive for requests that
// can be quickly satisfied from the cache, but we expect that to be rare, because the
// requesting backend would have already checked the cache.
tokio::spawn(async move {
use tracing::Instrument;
let request_id = slot.get_request().request_id();
let owner_procno = slot.get_owner_procno();
let span = info_span!(
"processing",
request_id = request_id,
slot_idx = slot_idx,
procno = owner_procno,
);
async {
// FIXME: as a temporary hack, abort the request if we don't get a response
// promptly.
//
// Lots of regression tests are getting stuck and failing at the moment,
// this makes them fail a little faster, which it faster to iterate.
// This needs to be removed once more regression tests are passing.
// See also similar hack in the backend code, in wait_request_completion()
let result = tokio::time::timeout(
tokio::time::Duration::from_secs(60),
self.handle_request(slot.get_request()),
)
.await
.unwrap_or_else(|_elapsed| {
info!("request {request_id} timed out");
NeonIOResult::Error(libc::ETIMEDOUT)
});
trace!("request {request_id} at slot {slot_idx} completed");
// Ok, we have completed the IO. Mark the request as completed. After that,
// we no longer have ownership of the slot, and must not modify it.
slot.completed(result);
// Notify the backend about the completion. (Note that the backend might see
// the completed status even before this; this is just a wakeup)
notify_proc(owner_procno);
}
.instrument(span)
.await
});
}
}
/// Compute the 'request_lsn' to use for a pageserver request
fn request_lsns(&self, not_modified_since_lsn: Lsn) -> page_api::ReadLsn {
let mut request_lsn = get_request_lsn();
// Is it possible that the last-written LSN is ahead of last flush LSN? Generally not, we
// shouldn't evict a page from the buffer cache before all its modifications have been
// safely flushed. That's the "WAL before data" rule. However, there are a few exceptions:
//
// - when creation an index: _bt_blwritepage logs the full page without flushing WAL before
// smgrextend (files are fsynced before build ends).
//
// XXX: If we make a request LSN greater than the current WAL flush LSN, the pageserver would
// block waiting for the WAL arrive, until we flush it and it propagates through the
// safekeepers to the pageserver. If there's nothing that forces the WAL to be flushed,
// the pageserver would get stuck waiting forever. To avoid that, all the write-
// functions in communicator_new.c call XLogSetAsyncXactLSN(). That nudges the WAL writer to
// perform the flush relatively soon.
//
// It would perhaps be nicer to do the WAL flush here, but it's tricky to call back into
// Postgres code to do that from here. That's why we rely on communicator_new.c to do the
// calls "pre-emptively".
//
// FIXME: Because of the above, it can still happen that the flush LSN is ahead of
// not_modified_since, if the WAL writer hasn't done the flush yet. It would be nice to know
// if there are other cases like that that we have mised, but unfortunately we cannot turn
// this into an assertion because of that legit case.
//
// See also the old logic in neon_get_request_lsns() C function
if not_modified_since_lsn > request_lsn {
tracing::info!(
"not_modified_since_lsn {} is ahead of last flushed LSN {}",
not_modified_since_lsn,
request_lsn
);
request_lsn = not_modified_since_lsn;
}
page_api::ReadLsn {
request_lsn,
not_modified_since_lsn: Some(not_modified_since_lsn),
}
}
/// Handle one IO request
async fn handle_request(&'static self, request: &'_ NeonIORequest) -> NeonIOResult {
self.request_counters
.inc(RequestTypeLabelGroup::from_req(request));
match request {
NeonIORequest::Empty => {
error!("unexpected Empty IO request");
NeonIOResult::Error(0)
}
NeonIORequest::RelSize(req) => {
let rel = req.reltag();
let _in_progress_guard = self
.in_progress_table
.lock(RequestInProgressKey::Rel(rel), req.request_id)
.await;
// Check the cache first
let not_modified_since = match self.cache.get_rel_size(&rel) {
CacheResult::Found(nblocks) => {
tracing::trace!("found relsize for {:?} in cache: {}", rel, nblocks);
return NeonIOResult::RelSize(nblocks);
}
// XXX: we don't cache negative entries, so if there's no entry in the cache, it could mean
// that the relation doesn't exist or that we don't have it cached.
CacheResult::NotFound(lsn) => lsn,
};
let read_lsn = self.request_lsns(not_modified_since);
match self
.client
.get_rel_size(page_api::GetRelSizeRequest {
read_lsn,
rel,
allow_missing: req.allow_missing,
})
.await
{
Ok(Some(nblocks)) => {
// update the cache
tracing::trace!(
"updated relsize for {:?} in cache: {}, lsn {}",
rel,
nblocks,
read_lsn
);
self.cache
.remember_rel_size(&rel, nblocks, not_modified_since);
NeonIOResult::RelSize(nblocks)
}
Ok(None) => {
// TODO: cache negative entry?
NeonIOResult::RelSize(INVALID_BLOCK_NUMBER)
}
Err(err) => {
// FIXME: Could we map the tonic StatusCode to a libc errno in a more fine-grained way? Or pass the error message to the backend
info!("tonic error: {err:?}");
NeonIOResult::Error(libc::EIO)
}
}
}
NeonIORequest::GetPageV(req) => match self.handle_get_pagev_request(req).await {
Ok(()) => NeonIOResult::GetPageV,
Err(errno) => NeonIOResult::Error(errno),
},
NeonIORequest::GetPageVUncached(req) => {
match self.handle_get_pagev_uncached_request(req).await {
Ok(()) => NeonIOResult::GetPageV,
Err(errno) => NeonIOResult::Error(errno),
}
}
NeonIORequest::ReadSlruSegment(req) => {
let lsn = Lsn(req.request_lsn);
let file_path = req.destination_file_path();
match self
.client
.get_slru_segment(page_api::GetSlruSegmentRequest {
read_lsn: self.request_lsns(lsn),
kind: req.slru_kind,
segno: req.segment_number,
})
.await
{
Ok(slru_bytes) => {
if let Err(e) = tokio::fs::write(&file_path, &slru_bytes).await {
error!("could not write slru segment to file {file_path}: {e}");
return NeonIOResult::Error(e.raw_os_error().unwrap_or(libc::EIO));
}
let blocks_count = slru_bytes.len() / crate::BLCKSZ;
NeonIOResult::ReadSlruSegment(blocks_count as _)
}
Err(err) => {
// FIXME: Could we map the tonic StatusCode to a libc errno in a more fine-grained way? Or pass the error message to the backend
info!("tonic error: {err:?}");
NeonIOResult::Error(libc::EIO)
}
}
}
NeonIORequest::PrefetchV(req) => {
self.request_nblocks_counters
.inc_by(RequestTypeLabelGroup::from_req(request), req.nblocks as i64);
let req = *req;
// FIXME: handle_request() runs in a separate task already, do we really need to spawn a new one here?
tokio::spawn(async move { self.handle_prefetchv_request(&req).await });
NeonIOResult::PrefetchVLaunched
}
NeonIORequest::DbSize(req) => {
let _in_progress_guard = self
.in_progress_table
.lock(RequestInProgressKey::Db(req.db_oid), req.request_id)
.await;
// Check the cache first
let not_modified_since = match self.cache.get_db_size(req.db_oid) {
CacheResult::Found(db_size) => {
// get_page already copied the block content to the destination
return NeonIOResult::DbSize(db_size);
}
CacheResult::NotFound(lsn) => lsn,
};
match self
.client
.get_db_size(page_api::GetDbSizeRequest {
read_lsn: self.request_lsns(not_modified_since),
db_oid: req.db_oid,
})
.await
{
Ok(db_size) => NeonIOResult::DbSize(db_size),
Err(err) => {
// FIXME: Could we map the tonic StatusCode to a libc errno in a more fine-grained way? Or pass the error message to the backend
info!("tonic error: {err:?}");
NeonIOResult::Error(libc::EIO)
}
}
}
// Write requests
NeonIORequest::WritePage(req) => {
let rel = req.reltag();
let _in_progress_guard = self
.in_progress_table
.lock(
RequestInProgressKey::Block(rel, req.block_number),
req.request_id,
)
.await;
// We must at least update the last-written LSN on the page, but also store the page
// image in the LFC while we still have it
self.cache
.remember_page(&rel, req.block_number, req.src, Lsn(req.lsn), true)
.await;
NeonIOResult::WriteOK
}
NeonIORequest::RelExtend(req) => {
let rel = req.reltag();
let _in_progress_guard = self
.in_progress_table
.lock(
RequestInProgressKey::Block(rel, req.block_number),
req.request_id,
)
.await;
// We must at least update the last-written LSN on the page and the relation size,
// but also store the page image in the LFC while we still have it
self.cache
.remember_page(&rel, req.block_number, req.src, Lsn(req.lsn), true)
.await;
self.cache
.remember_rel_size(&req.reltag(), req.block_number + 1, Lsn(req.lsn));
NeonIOResult::WriteOK
}
NeonIORequest::RelZeroExtend(req) => {
self.request_nblocks_counters
.inc_by(RequestTypeLabelGroup::from_req(request), req.nblocks as i64);
// TODO: need to grab an io-in-progress lock for this? I guess not
// TODO: We could put the empty pages to the cache. Maybe have
// a marker on the block entries for all-zero pages, instead of
// actually storing the empty pages.
self.cache.remember_rel_size(
&req.reltag(),
req.block_number + req.nblocks,
Lsn(req.lsn),
);
NeonIOResult::WriteOK
}
NeonIORequest::RelCreate(req) => {
// TODO: need to grab an io-in-progress lock for this? I guess not
self.cache.remember_rel_size(&req.reltag(), 0, Lsn(req.lsn));
NeonIOResult::WriteOK
}
NeonIORequest::RelTruncate(req) => {
// TODO: need to grab an io-in-progress lock for this? I guess not
self.cache
.remember_rel_size(&req.reltag(), req.nblocks, Lsn(req.lsn));
NeonIOResult::WriteOK
}
NeonIORequest::RelUnlink(req) => {
// TODO: need to grab an io-in-progress lock for this? I guess not
self.cache.forget_rel(&req.reltag(), None, Lsn(req.lsn));
NeonIOResult::WriteOK
}
NeonIORequest::UpdateCachedRelSize(req) => {
// TODO: need to grab an io-in-progress lock for this? I guess not
self.cache
.remember_rel_size(&req.reltag(), req.nblocks, Lsn(req.lsn));
NeonIOResult::WriteOK
}
}
}
/// Subroutine to handle a GetPageV request, since it's a little more complicated than
/// others.
async fn handle_get_pagev_request(&'t self, req: &CGetPageVRequest) -> Result<(), i32> {
let rel = req.reltag();
// Check the cache first
//
// Note: Because the backends perform a direct lookup in the cache before sending
// the request to the communicator process, we expect the pages to almost never
// be already in cache. It could happen if:
// 1. two backends try to read the same page at the same time, but that should never
// happen because there's higher level locking in the Postgres buffer manager, or
// 2. a prefetch request finished at the same time as a backend requested the
// page. That's much more likely.
let mut cache_misses = Vec::with_capacity(req.nblocks as usize);
for i in 0..req.nblocks {
let blkno = req.block_number + i as u32;
// note: this is deadlock-safe even though we hold multiple locks at the same time,
// because they're always acquired in the same order.
let in_progress_guard = self
.in_progress_table
.lock(RequestInProgressKey::Block(rel, blkno), req.request_id)
.await;
let dest = req.dest[i as usize];
let not_modified_since = match self.cache.get_page(&rel, blkno, dest).await {
Ok(CacheResult::Found(_)) => {
// get_page already copied the block content to the destination
trace!("found blk {} in rel {:?} in LFC", blkno, rel);
continue;
}
Ok(CacheResult::NotFound(lsn)) => lsn,
Err(_io_error) => return Err(libc::EIO), // FIXME print the error?
};
cache_misses.push((blkno, not_modified_since, dest, in_progress_guard));
}
self.getpage_cache_misses_counter
.inc_by(cache_misses.len() as i64);
self.getpage_cache_hits_counter
.inc_by(req.nblocks as i64 - cache_misses.len() as i64);
if cache_misses.is_empty() {
return Ok(());
}
let not_modified_since = cache_misses
.iter()
.map(|(_blkno, lsn, _dest, _guard)| *lsn)
.max()
.unwrap();
// Construct a pageserver request for the cache misses
let block_numbers: Vec<u32> = cache_misses
.iter()
.map(|(blkno, _lsn, _dest, _guard)| *blkno)
.collect();
let read_lsn = self.request_lsns(not_modified_since);
trace!(
"sending getpage request for blocks {:?} in rel {:?} lsns {}",
block_numbers, rel, read_lsn
);
match self
.client
.get_page(page_api::GetPageRequest {
request_id: req.request_id.into(),
request_class: page_api::GetPageClass::Normal,
read_lsn,
rel,
block_numbers: block_numbers.clone(),
})
.await
{
Ok(resp) => {
// Write the received page images directly to the shared memory location
// that the backend requested.
if resp.pages.len() != block_numbers.len() {
error!(
"received unexpected response with {} page images from pageserver for a request for {} pages",
resp.pages.len(),
block_numbers.len(),
);
return Err(libc::EIO);
}
trace!(
"received getpage response for blocks {:?} in rel {:?} lsns {}",
block_numbers, rel, read_lsn
);
for (page, (blkno, _lsn, dest, _guard)) in resp.pages.into_iter().zip(cache_misses)
{
let src: &[u8] = page.image.as_ref();
let len = std::cmp::min(src.len(), dest.bytes_total());
unsafe {
std::ptr::copy_nonoverlapping(src.as_ptr(), dest.as_mut_ptr(), len);
};
// Also store it in the LFC while we have it
self.cache
.remember_page(
&rel,
blkno,
page.image,
read_lsn.not_modified_since_lsn.unwrap(),
false,
)
.await;
}
}
Err(err) => {
// FIXME: Could we map the tonic StatusCode to a libc errno in a more fine-grained way? Or pass the error message to the backend
info!("tonic error: {err:?}");
return Err(libc::EIO);
}
}
Ok(())
}
/// Subroutine to handle an GetPageVUncached request.
///
/// Note: this bypasses the cache, in-progress IO locking, and all other side-effects.
/// This request type is only used in tests.
async fn handle_get_pagev_uncached_request(
&'t self,
req: &CGetPageVUncachedRequest,
) -> Result<(), i32> {
let rel = req.reltag();
// Construct a pageserver request
let block_numbers: Vec<u32> =
(req.block_number..(req.block_number + (req.nblocks as u32))).collect();
let read_lsn = page_api::ReadLsn {
request_lsn: Lsn(req.request_lsn),
not_modified_since_lsn: Some(Lsn(req.not_modified_since)),
};
trace!(
"sending (uncached) getpage request for blocks {:?} in rel {:?} lsns {}",
block_numbers, rel, read_lsn
);
match self
.client
.get_page(page_api::GetPageRequest {
request_id: req.request_id.into(),
request_class: page_api::GetPageClass::Normal,
read_lsn,
rel,
block_numbers: block_numbers.clone(),
})
.await
{
Ok(resp) => {
// Write the received page images directly to the shared memory location
// that the backend requested.
if resp.pages.len() != block_numbers.len() {
error!(
"received unexpected response with {} page images from pageserver for a request for {} pages",
resp.pages.len(),
block_numbers.len(),
);
return Err(libc::EIO);
}
trace!(
"received getpage response for blocks {:?} in rel {:?} lsns {}",
block_numbers, rel, read_lsn
);
for (page, dest) in resp.pages.into_iter().zip(req.dest) {
let src: &[u8] = page.image.as_ref();
let len = std::cmp::min(src.len(), dest.bytes_total());
unsafe {
std::ptr::copy_nonoverlapping(src.as_ptr(), dest.as_mut_ptr(), len);
};
}
}
Err(err) => {
// FIXME: Could we map the tonic StatusCode to a libc errno in a more fine-grained way? Or pass the error message to the backend
info!("tonic error: {err:?}");
return Err(libc::EIO);
}
}
Ok(())
}
/// Subroutine to handle a PrefetchV request, since it's a little more complicated than
/// others.
///
/// This is very similar to a GetPageV request, but the results are only stored in the cache.
async fn handle_prefetchv_request(&'static self, req: &CPrefetchVRequest) -> Result<(), i32> {
let rel = req.reltag();
// Check the cache first
let mut cache_misses = Vec::with_capacity(req.nblocks as usize);
for i in 0..req.nblocks {
let blkno = req.block_number + i as u32;
// note: this is deadlock-safe even though we hold multiple locks at the same time,
// because they're always acquired in the same order.
let in_progress_guard = self
.in_progress_table
.lock(RequestInProgressKey::Block(rel, blkno), req.request_id)
.await;
let not_modified_since = match self.cache.page_is_cached(&rel, blkno).await {
Ok(CacheResult::Found(_)) => {
trace!("found blk {} in rel {:?} in LFC", blkno, rel);
continue;
}
Ok(CacheResult::NotFound(lsn)) => lsn,
Err(_io_error) => return Err(libc::EIO), // FIXME print the error?
};
cache_misses.push((blkno, not_modified_since, in_progress_guard));
}
if cache_misses.is_empty() {
return Ok(());
}
let not_modified_since = cache_misses
.iter()
.map(|(_blkno, lsn, _guard)| *lsn)
.max()
.unwrap();
let block_numbers: Vec<u32> = cache_misses
.iter()
.map(|(blkno, _lsn, _guard)| *blkno)
.collect();
// TODO: spawn separate tasks for these. Use the integrated cache to keep track of the
// in-flight requests
match self
.client
.get_page(page_api::GetPageRequest {
request_id: req.request_id.into(),
request_class: page_api::GetPageClass::Prefetch,
read_lsn: self.request_lsns(not_modified_since),
rel,
block_numbers: block_numbers.clone(),
})
.await
{
Ok(resp) => {
trace!(
"prefetch completed, remembering blocks {:?} in rel {:?} in LFC",
block_numbers, rel
);
if resp.pages.len() != block_numbers.len() {
error!(
"received unexpected response with {} page images from pageserver for a request for {} pages",
resp.pages.len(),
block_numbers.len(),
);
return Err(libc::EIO);
}
for (page, (blkno, _lsn, _guard)) in resp.pages.into_iter().zip(cache_misses) {
self.cache
.remember_page(&rel, blkno, page.image, not_modified_since, false)
.await;
}
}
Err(err) => {
// FIXME: Could we map the tonic StatusCode to a libc errno in a more fine-grained way? Or pass the error message to the backend
info!("tonic error: {err:?}");
return Err(libc::EIO);
}
}
Ok(())
}
}
impl<T> MetricGroup<T> for CommunicatorWorkerProcessStruct<'_>
where
T: Encoding,
CounterState: MetricEncoding<T>,
GaugeState: MetricEncoding<T>,
{
fn collect_group_into(&self, enc: &mut T) -> Result<(), T::Err> {
self.lfc_metrics.collect_group_into(enc)
use measured::metric::MetricFamilyEncoding;
use measured::metric::name::MetricName;
self.lfc_metrics.collect_group_into(enc)?;
self.cache.collect_group_into(enc)?;
self.request_counters
.collect_family_into(MetricName::from_str("request_counters"), enc)?;
self.request_nblocks_counters
.collect_family_into(MetricName::from_str("request_nblocks_counters"), enc)?;
self.allocator_metrics.collect_group_into(enc)?;
Ok(())
}
}

View File

@@ -4,9 +4,9 @@
//! - launch the main loop,
//! - receive IO requests from backends and process them,
//! - write results back to backends.
mod callbacks;
mod control_socket;
mod in_progress_ios;
mod lfc_metrics;
mod logging;
mod main_loop;

View File

@@ -1,14 +1,21 @@
//! Functions called from the C code in the worker process
use std::collections::HashMap;
use std::ffi::{CStr, CString, c_char};
use std::path::PathBuf;
use crate::init::CommunicatorInitStruct;
use crate::worker_process::main_loop;
use crate::worker_process::main_loop::CommunicatorWorkerProcessStruct;
use pageserver_client_grpc::ShardStripeSize;
/// Launch the communicator's tokio tasks, which do most of the work.
///
/// The caller has initialized the process as a regular PostgreSQL background worker
/// process.
/// process. The shared memory segment used to communicate with the backends has been
/// allocated and initialized earlier, at postmaster startup, in
/// rcommunicator_shmem_init().
///
/// Inputs:
/// `tenant_id` and `timeline_id` can be NULL, if we're been launched in "non-Neon" mode,
@@ -23,27 +30,63 @@ use crate::worker_process::main_loop::CommunicatorWorkerProcessStruct;
/// This is called only once in the process, so the returned struct, and error message in
/// case of failure, are simply leaked.
#[unsafe(no_mangle)]
pub extern "C" fn communicator_worker_launch(
pub extern "C" fn communicator_worker_process_launch(
cis: Box<CommunicatorInitStruct>,
tenant_id: *const c_char,
timeline_id: *const c_char,
auth_token: *const c_char,
shard_map: *mut *mut c_char,
nshards: u32,
stripe_size: u32,
file_cache_path: *const c_char,
initial_file_cache_size: u64,
error_p: *mut *const c_char,
) -> Option<&'static CommunicatorWorkerProcessStruct> {
) -> Option<&'static CommunicatorWorkerProcessStruct<'static>> {
tracing::warn!("starting threads in rust code");
// Convert the arguments into more convenient Rust types
let tenant_id = if tenant_id.is_null() {
let tenant_id = {
let cstr = unsafe { CStr::from_ptr(tenant_id) };
cstr.to_str().expect("assume UTF-8")
};
let timeline_id = {
let cstr = unsafe { CStr::from_ptr(timeline_id) };
cstr.to_str().expect("assume UTF-8")
};
let auth_token = if auth_token.is_null() {
None
} else {
let cstr = unsafe { CStr::from_ptr(tenant_id) };
let cstr = unsafe { CStr::from_ptr(auth_token) };
Some(cstr.to_str().expect("assume UTF-8"))
};
let timeline_id = if timeline_id.is_null() {
None
let file_cache_path = {
if file_cache_path.is_null() {
None
} else {
let c_str = unsafe { CStr::from_ptr(file_cache_path) };
Some(PathBuf::from(c_str.to_str().unwrap()))
}
};
let shard_map = shard_map_to_hash(nshards, shard_map);
// FIXME: distinguish between unsharded, and sharded with 1 shard
// Also, we might go from unsharded to sharded while the system
// is running.
let stripe_size = if stripe_size > 0 && nshards > 1 {
Some(ShardStripeSize(stripe_size))
} else {
let cstr = unsafe { CStr::from_ptr(timeline_id) };
Some(cstr.to_str().expect("assume UTF-8"))
None
};
// The `init` function does all the work.
let result = main_loop::init(tenant_id, timeline_id);
let result = main_loop::init(
*cis,
tenant_id,
timeline_id,
auth_token,
shard_map,
stripe_size,
initial_file_cache_size,
file_cache_path,
);
// On failure, return the error message to the C caller in *error_p.
match result {
@@ -58,3 +101,66 @@ pub extern "C" fn communicator_worker_launch(
}
}
}
#[unsafe(no_mangle)]
pub extern "C" fn communicator_worker_process_launch_legacy(error_p: *mut *const c_char) -> bool {
// The `init` function does all the work.
let result = main_loop::init_legacy();
// On failure, return the error message to the C caller in *error_p.
match result {
Ok(()) => true,
Err(errmsg) => {
let errmsg = CString::new(errmsg).expect("no nuls within error message");
let errmsg = Box::leak(errmsg.into_boxed_c_str());
let p: *const c_char = errmsg.as_ptr();
unsafe { *error_p = p };
false
}
}
}
/// Convert the "shard map" from an array of C strings, indexed by shard no to a rust HashMap
fn shard_map_to_hash(
nshards: u32,
shard_map: *mut *mut c_char,
) -> HashMap<utils::shard::ShardIndex, String> {
use utils::shard::*;
assert!(nshards <= u8::MAX as u32);
let mut result: HashMap<ShardIndex, String> = HashMap::new();
let mut p = shard_map;
for i in 0..nshards {
let c_str = unsafe { CStr::from_ptr(*p) };
p = unsafe { p.add(1) };
let s = c_str.to_str().unwrap();
let k = if nshards > 1 {
ShardIndex::new(ShardNumber(i as u8), ShardCount(nshards as u8))
} else {
ShardIndex::unsharded()
};
result.insert(k, s.into());
}
result
}
/// Inform the rust code about a configuration change
#[unsafe(no_mangle)]
pub extern "C" fn communicator_worker_config_reload(
proc_handle: &'static CommunicatorWorkerProcessStruct<'static>,
file_cache_size: u64,
shard_map: *mut *mut c_char,
nshards: u32,
stripe_size: u32,
) {
proc_handle.cache.resize_file_cache(file_cache_size as u32);
let shard_map = shard_map_to_hash(nshards, shard_map);
let stripe_size = (nshards > 1).then_some(ShardStripeSize(stripe_size));
proc_handle.update_shard_map(shard_map, stripe_size);
}

1575
pgxn/neon/communicator_new.c Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,69 @@
/*-------------------------------------------------------------------------
*
* communicator_new.h
* new implementation
*
*
* Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
*-------------------------------------------------------------------------
*/
#ifndef COMMUNICATOR_NEW_H
#define COMMUNICATOR_NEW_H
#include "storage/buf_internals.h"
#include "lfc_prewarm.h"
#include "neon.h"
#include "neon_pgversioncompat.h"
#include "pagestore_client.h"
/* initialization at postmaster startup */
extern void CommunicatorNewShmemRequest(void);
extern void CommunicatorNewShmemInit(void);
/* initialization at backend startup */
extern void communicator_new_init(void);
/* Read requests */
extern bool communicator_new_rel_exists(NRelFileInfo rinfo, ForkNumber forkNum);
extern BlockNumber communicator_new_rel_nblocks(NRelFileInfo rinfo, ForkNumber forknum);
extern int64 communicator_new_dbsize(Oid dbNode);
extern void communicator_new_readv(NRelFileInfo rinfo, ForkNumber forkNum,
BlockNumber base_blockno,
void **buffers, BlockNumber nblocks);
extern void communicator_new_read_at_lsn_uncached(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blockno,
void *buffer, XLogRecPtr request_lsn, XLogRecPtr not_modified_since);
extern void communicator_new_prefetch_register_bufferv(NRelFileInfo rinfo, ForkNumber forkNum,
BlockNumber blockno,
BlockNumber nblocks);
extern bool communicator_new_update_lwlsn_for_block_if_not_cached(NRelFileInfo rinfo, ForkNumber forkNum,
BlockNumber blockno, XLogRecPtr lsn);
extern int communicator_new_read_slru_segment(
SlruKind kind,
uint32_t segno,
neon_request_lsns * request_lsns,
const char *path
);
/* Write requests, to keep the caches up-to-date */
extern void communicator_new_write_page(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blockno,
const void *buffer, XLogRecPtr lsn);
extern void communicator_new_rel_extend(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blockno,
const void *buffer, XLogRecPtr lsn);
extern void communicator_new_rel_zeroextend(NRelFileInfo rinfo, ForkNumber forkNum,
BlockNumber blockno, BlockNumber nblocks,
XLogRecPtr lsn);
extern void communicator_new_rel_create(NRelFileInfo rinfo, ForkNumber forkNum, XLogRecPtr lsn);
extern void communicator_new_rel_truncate(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber nblocks, XLogRecPtr lsn);
extern void communicator_new_rel_unlink(NRelFileInfo rinfo, ForkNumber forkNum, XLogRecPtr lsn);
extern void communicator_new_update_cached_rel_size(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber nblocks, XLogRecPtr lsn);
/* other functions */
extern int32 communicator_new_approximate_working_set_size_seconds(time_t duration, bool reset);
extern struct LfcMetrics communicator_new_get_lfc_metrics_unsafe(void);
extern FileCacheState *communicator_new_get_lfc_state(size_t max_entries);
extern struct LfcStatsEntry *communicator_new_lfc_get_stats(size_t *num_entries);
#endif /* COMMUNICATOR_NEW_H */

View File

@@ -18,6 +18,9 @@
#include <unistd.h>
#include "miscadmin.h"
#if PG_VERSION_NUM >= 150000
#include "access/xlogrecovery.h"
#endif
#include "postmaster/bgworker.h"
#include "postmaster/interrupt.h"
#include "postmaster/postmaster.h"
@@ -29,14 +32,18 @@
#include "tcop/tcopprot.h"
#include "utils/timestamp.h"
#include "communicator_new.h"
#include "communicator_process.h"
#include "file_cache.h"
#include "neon.h"
#include "neon_perf_counters.h"
#include "pagestore_client.h"
/* the rust bindings, generated by cbindgen */
#include "communicator/communicator_bindings.h"
struct CommunicatorInitStruct *cis;
static void pump_logging(struct LoggingReceiver *logging);
PGDLLEXPORT void communicator_new_bgworker_main(Datum main_arg);
@@ -70,9 +77,13 @@ pg_init_communicator_process(void)
void
communicator_new_bgworker_main(Datum main_arg)
{
char **connstrings;
ShardMap shard_map;
uint64 file_cache_size;
struct LoggingReceiver *logging;
const char *errmsg = NULL;
const struct CommunicatorWorkerProcessStruct *proc_handle;
bool success;
/*
* Pretend that this process is a WAL sender. That affects the shutdown
@@ -108,12 +119,43 @@ communicator_new_bgworker_main(Datum main_arg)
logging = communicator_worker_configure_logging();
proc_handle = communicator_worker_launch(
neon_tenant[0] == '\0' ? NULL : neon_tenant,
neon_timeline[0] == '\0' ? NULL : neon_timeline,
&errmsg
);
if (proc_handle == NULL)
if (cis != NULL)
{
/* lfc_size_limit is in MBs */
file_cache_size = lfc_size_limit * (1024 * 1024 / BLCKSZ);
if (file_cache_size < 100)
file_cache_size = 100;
if (!parse_shard_map(pageserver_grpc_urls, &shard_map))
{
/* shouldn't happen, as the GUC was verified already */
elog(FATAL, "could not parse neon.pageserver_grpcs_urls");
}
connstrings = palloc(shard_map.num_shards * sizeof(char *));
for (int i = 0; i < shard_map.num_shards; i++)
connstrings[i] = shard_map.connstring[i];
AssignNumShards(shard_map.num_shards);
proc_handle = communicator_worker_process_launch(
cis,
neon_tenant,
neon_timeline,
neon_auth_token,
connstrings,
shard_map.num_shards,
neon_stripe_size,
lfc_path,
file_cache_size,
&errmsg);
pfree(connstrings);
cis = NULL;
success = proc_handle != NULL;
}
else
{
proc_handle = NULL;
success = communicator_worker_process_launch_legacy(&errmsg);
}
if (!success)
{
/*
* Something went wrong. Before exiting, forward any log messages that
@@ -173,6 +215,32 @@ communicator_new_bgworker_main(Datum main_arg)
{
ConfigReloadPending = false;
ProcessConfigFile(PGC_SIGHUP);
if (proc_handle)
{
/* lfc_size_limit is in MBs */
file_cache_size = lfc_size_limit * (1024 * 1024 / BLCKSZ);
if (file_cache_size < 100)
file_cache_size = 100;
/* Reload pageserver URLs */
if (!parse_shard_map(pageserver_grpc_urls, &shard_map))
{
/* shouldn't happen, as the GUC was verified already */
elog(FATAL, "could not parse neon.pageserver_grpcs_urls");
}
connstrings = palloc(shard_map.num_shards * sizeof(char *));
for (int i = 0; i < shard_map.num_shards; i++)
connstrings[i] = shard_map.connstring[i];
AssignNumShards(shard_map.num_shards);
communicator_worker_config_reload(proc_handle,
file_cache_size,
connstrings,
shard_map.num_shards,
neon_stripe_size);
pfree(connstrings);
}
}
duration = TimestampDifferenceMilliseconds(before, GetCurrentTimestamp());
@@ -271,3 +339,49 @@ callback_set_my_latch_unsafe(void)
{
SetLatch(MyLatch);
}
/*
* FIXME: The logic from neon_get_request_lsns() needs to go here, except for
* the last-written LSN cache stuff, which is managed by the rust code now.
*/
XLogRecPtr
callback_get_request_lsn_unsafe(void)
{
/*
* NB: be very careful with what you do here! This is called from tokio
* threads, so anything tha tries to take LWLocks is unsafe, for example.
*
* RecoveryInProgress() is OK
*/
if (RecoveryInProgress())
{
XLogRecPtr replay_lsn = GetXLogReplayRecPtr(NULL);
return replay_lsn;
}
else
{
XLogRecPtr flushlsn;
#if PG_VERSION_NUM >= 150000
flushlsn = GetFlushRecPtr(NULL);
#else
flushlsn = GetFlushRecPtr();
#endif
return flushlsn;
}
}
/*
* Get metrics, for the built-in metrics exporter that's part of the
* communicator process.
*/
struct LfcMetrics
callback_get_lfc_metrics_unsafe(void)
{
if (neon_use_communicator_worker)
return communicator_new_get_lfc_metrics_unsafe();
else
return lfc_get_metrics_unsafe();
}

View File

@@ -12,6 +12,9 @@
#ifndef COMMUNICATOR_PROCESS_H
#define COMMUNICATOR_PROCESS_H
extern struct CommunicatorInitStruct *cis;
/* initialization early at postmaster startup */
extern void pg_init_communicator_process(void);
#endif /* COMMUNICATOR_PROCESS_H */

View File

@@ -137,15 +137,6 @@ typedef struct FileCacheEntry
#define N_COND_VARS 64
#define CV_WAIT_TIMEOUT 10
#define MAX_PREWARM_WORKERS 8
typedef struct PrewarmWorkerState
{
uint32 prewarmed_pages;
uint32 skipped_pages;
TimestampTz completed;
} PrewarmWorkerState;
typedef struct FileCacheControl
{
uint64 generation; /* generation is needed to handle correct hash
@@ -191,47 +182,27 @@ typedef struct FileCacheControl
* again.
*/
HyperLogLogState wss_estimation;
/* Prewarmer state */
PrewarmWorkerState prewarm_workers[MAX_PREWARM_WORKERS];
size_t n_prewarm_workers;
size_t n_prewarm_entries;
size_t total_prewarm_pages;
size_t prewarm_batch;
bool prewarm_active;
bool prewarm_canceled;
dsm_handle prewarm_lfc_state_handle;
} FileCacheControl;
#define FILE_CACHE_STATE_MAGIC 0xfcfcfcfc
#define FILE_CACHE_STATE_BITMAP(fcs) ((uint8*)&(fcs)->chunks[(fcs)->n_chunks])
#define FILE_CACHE_STATE_SIZE_FOR_CHUNKS(n_chunks) (sizeof(FileCacheState) + (n_chunks)*sizeof(BufferTag) + (((n_chunks) * lfc_blocks_per_chunk)+7)/8)
#define FILE_CACHE_STATE_SIZE(fcs) (sizeof(FileCacheState) + (fcs->n_chunks)*sizeof(BufferTag) + (((fcs->n_chunks) << fcs->chunk_size_log)+7)/8)
static HTAB *lfc_hash;
static int lfc_desc = -1;
static LWLockId lfc_lock;
static int lfc_max_size;
static int lfc_size_limit;
static int lfc_prewarm_limit;
static int lfc_prewarm_batch;
int lfc_max_size;
int lfc_size_limit;
static int lfc_chunk_size_log = MAX_BLOCKS_PER_CHUNK_LOG;
static int lfc_blocks_per_chunk = MAX_BLOCKS_PER_CHUNK;
static char *lfc_path;
char *lfc_path;
static uint64 lfc_generation;
static FileCacheControl *lfc_ctl;
static bool lfc_do_prewarm;
bool lfc_store_prefetch_result;
bool lfc_prewarm_update_ws_estimation;
bool AmPrewarmWorker;
bool lfc_do_prewarm;
bool lfc_prewarm_cancel;
#define LFC_ENABLED() (lfc_ctl->limit != 0)
PGDLLEXPORT void lfc_prewarm_main(Datum main_arg);
/*
* Close LFC file if opened.
* All backends should close their LFC files once LFC is disabled.
@@ -257,6 +228,8 @@ lfc_switch_off(void)
{
int fd;
Assert(!neon_use_communicator_worker);
if (LFC_ENABLED())
{
HASH_SEQ_STATUS status;
@@ -322,6 +295,8 @@ lfc_maybe_disabled(void)
static bool
lfc_ensure_opened(void)
{
Assert(!neon_use_communicator_worker);
if (lfc_generation != lfc_ctl->generation)
{
lfc_close_file();
@@ -347,6 +322,9 @@ LfcShmemInit(void)
bool found;
static HASHCTL info;
if (neon_use_communicator_worker)
return;
if (lfc_max_size <= 0)
return;
@@ -536,7 +514,6 @@ lfc_init(void)
if (!process_shared_preload_libraries_in_progress)
neon_log(ERROR, "Neon module should be loaded via shared_preload_libraries");
DefineCustomBoolVariable("neon.store_prefetch_result_in_lfc",
"Immediately store received prefetch result in LFC",
NULL,
@@ -608,32 +585,6 @@ lfc_init(void)
lfc_check_chunk_size,
lfc_change_chunk_size,
NULL);
DefineCustomIntVariable("neon.file_cache_prewarm_limit",
"Maximal number of prewarmed chunks",
NULL,
&lfc_prewarm_limit,
INT_MAX, /* no limit by default */
0,
INT_MAX,
PGC_SIGHUP,
0,
NULL,
NULL,
NULL);
DefineCustomIntVariable("neon.file_cache_prewarm_batch",
"Number of pages retrivied by prewarm from page server",
NULL,
&lfc_prewarm_batch,
64,
1,
INT_MAX,
PGC_SIGHUP,
0,
NULL,
NULL,
NULL);
}
/*
@@ -658,7 +609,7 @@ lfc_get_state(size_t max_entries)
uint8* bitmap;
size_t n_pages = 0;
size_t n_entries = Min(max_entries, lfc_ctl->used - lfc_ctl->pinned);
size_t state_size = FILE_CACHE_STATE_SIZE_FOR_CHUNKS(n_entries);
size_t state_size = FILE_CACHE_STATE_SIZE_FOR_CHUNKS(n_entries, lfc_blocks_per_chunk);
fcs = (FileCacheState*)palloc0(state_size);
SET_VARSIZE(fcs, state_size);
fcs->magic = FILE_CACHE_STATE_MAGIC;
@@ -703,278 +654,6 @@ lfc_get_state(size_t max_entries)
return fcs;
}
/*
* Prewarm LFC cache to the specified state. It uses lfc_prefetch function to load prewarmed page without hoilding shared buffer lock
* and avoid race conditions with other backends.
*/
void
lfc_prewarm(FileCacheState* fcs, uint32 n_workers)
{
size_t fcs_chunk_size_log;
size_t n_entries;
size_t prewarm_batch = Min(lfc_prewarm_batch, readahead_buffer_size);
size_t fcs_size;
uint32_t max_prefetch_pages;
dsm_segment *seg;
BackgroundWorkerHandle* bgw_handle[MAX_PREWARM_WORKERS];
if (!lfc_ensure_opened())
return;
if (prewarm_batch == 0 || lfc_prewarm_limit == 0 || n_workers == 0)
{
elog(LOG, "LFC: prewarm is disabled");
return;
}
if (n_workers > MAX_PREWARM_WORKERS)
{
elog(ERROR, "LFC: Too much prewarm workers, maximum is %d", MAX_PREWARM_WORKERS);
}
if (fcs == NULL || fcs->n_chunks == 0)
{
elog(LOG, "LFC: nothing to prewarm");
return;
}
if (fcs->magic != FILE_CACHE_STATE_MAGIC)
{
elog(ERROR, "LFC: Invalid file cache state magic: %X", fcs->magic);
}
fcs_size = VARSIZE(fcs);
if (FILE_CACHE_STATE_SIZE(fcs) != fcs_size)
{
elog(ERROR, "LFC: Invalid file cache state size: %u vs. %u", (unsigned)FILE_CACHE_STATE_SIZE(fcs), VARSIZE(fcs));
}
fcs_chunk_size_log = fcs->chunk_size_log;
if (fcs_chunk_size_log > MAX_BLOCKS_PER_CHUNK_LOG)
{
elog(ERROR, "LFC: Invalid chunk size log: %u", fcs->chunk_size_log);
}
n_entries = Min(fcs->n_chunks, lfc_prewarm_limit);
Assert(n_entries != 0);
max_prefetch_pages = n_entries << fcs_chunk_size_log;
if (fcs->n_pages > max_prefetch_pages) {
elog(ERROR, "LFC: Number of pages in file cache state (%d) is more than the limit (%d)", fcs->n_pages, max_prefetch_pages);
}
LWLockAcquire(lfc_lock, LW_EXCLUSIVE);
/* Do not prewarm more entries than LFC limit */
if (lfc_ctl->limit <= lfc_ctl->size)
{
elog(LOG, "LFC: skip prewarm because LFC is already filled");
LWLockRelease(lfc_lock);
return;
}
if (lfc_ctl->prewarm_active)
{
LWLockRelease(lfc_lock);
elog(ERROR, "LFC: skip prewarm because another prewarm is still active");
}
lfc_ctl->n_prewarm_entries = n_entries;
lfc_ctl->n_prewarm_workers = n_workers;
lfc_ctl->prewarm_active = true;
lfc_ctl->prewarm_canceled = false;
lfc_ctl->prewarm_batch = prewarm_batch;
memset(lfc_ctl->prewarm_workers, 0, n_workers*sizeof(PrewarmWorkerState));
LWLockRelease(lfc_lock);
/* Calculate total number of pages to be prewarmed */
lfc_ctl->total_prewarm_pages = fcs->n_pages;
seg = dsm_create(fcs_size, 0);
memcpy(dsm_segment_address(seg), fcs, fcs_size);
lfc_ctl->prewarm_lfc_state_handle = dsm_segment_handle(seg);
/* Spawn background workers */
for (uint32 i = 0; i < n_workers; i++)
{
BackgroundWorker worker = {0};
worker.bgw_flags = BGWORKER_SHMEM_ACCESS;
worker.bgw_start_time = BgWorkerStart_ConsistentState;
worker.bgw_restart_time = BGW_NEVER_RESTART;
strcpy(worker.bgw_library_name, "neon");
strcpy(worker.bgw_function_name, "lfc_prewarm_main");
snprintf(worker.bgw_name, BGW_MAXLEN, "LFC prewarm worker %d", i+1);
strcpy(worker.bgw_type, "LFC prewarm worker");
worker.bgw_main_arg = Int32GetDatum(i);
/* must set notify PID to wait for shutdown */
worker.bgw_notify_pid = MyProcPid;
if (!RegisterDynamicBackgroundWorker(&worker, &bgw_handle[i]))
{
ereport(LOG,
(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
errmsg("LFC: registering dynamic bgworker prewarm failed"),
errhint("Consider increasing the configuration parameter \"%s\".", "max_worker_processes")));
n_workers = i;
lfc_ctl->prewarm_canceled = true;
break;
}
}
for (uint32 i = 0; i < n_workers; i++)
{
bool interrupted;
do
{
interrupted = false;
PG_TRY();
{
BgwHandleStatus status = WaitForBackgroundWorkerShutdown(bgw_handle[i]);
if (status != BGWH_STOPPED && status != BGWH_POSTMASTER_DIED)
{
elog(LOG, "LFC: Unexpected status of prewarm worker termination: %d", status);
}
}
PG_CATCH();
{
elog(LOG, "LFC: cancel prewarm");
lfc_ctl->prewarm_canceled = true;
interrupted = true;
}
PG_END_TRY();
} while (interrupted);
if (!lfc_ctl->prewarm_workers[i].completed)
{
/* Background worker doesn't set completion time: it means that it was abnormally terminated */
elog(LOG, "LFC: prewarm worker %d failed", i+1);
/* Set completion time to prevent get_prewarm_info from considering this worker as active */
lfc_ctl->prewarm_workers[i].completed = GetCurrentTimestamp();
}
}
dsm_detach(seg);
LWLockAcquire(lfc_lock, LW_EXCLUSIVE);
lfc_ctl->prewarm_active = false;
LWLockRelease(lfc_lock);
}
void
lfc_prewarm_main(Datum main_arg)
{
size_t snd_idx = 0, rcv_idx = 0;
size_t n_sent = 0, n_received = 0;
size_t fcs_chunk_size_log;
size_t max_prefetch_pages;
size_t prewarm_batch;
size_t n_workers;
dsm_segment *seg;
FileCacheState* fcs;
uint8* bitmap;
BufferTag tag;
PrewarmWorkerState* ws;
uint32 worker_id = DatumGetInt32(main_arg);
AmPrewarmWorker = true;
pqsignal(SIGTERM, die);
BackgroundWorkerUnblockSignals();
seg = dsm_attach(lfc_ctl->prewarm_lfc_state_handle);
if (seg == NULL)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("could not map dynamic shared memory segment")));
fcs = (FileCacheState*) dsm_segment_address(seg);
prewarm_batch = lfc_ctl->prewarm_batch;
fcs_chunk_size_log = fcs->chunk_size_log;
n_workers = lfc_ctl->n_prewarm_workers;
max_prefetch_pages = lfc_ctl->n_prewarm_entries << fcs_chunk_size_log;
ws = &lfc_ctl->prewarm_workers[worker_id];
bitmap = FILE_CACHE_STATE_BITMAP(fcs);
/* enable prefetch in LFC */
lfc_store_prefetch_result = true;
lfc_do_prewarm = true; /* Flag for lfc_prefetch preventing replacement of existed entries if LFC cache is full */
elog(LOG, "LFC: worker %d start prewarming", worker_id);
while (!lfc_ctl->prewarm_canceled)
{
if (snd_idx < max_prefetch_pages)
{
if ((snd_idx >> fcs_chunk_size_log) % n_workers != worker_id)
{
/* If there are multiple workers, split chunks between them */
snd_idx += 1 << fcs_chunk_size_log;
}
else
{
if (BITMAP_ISSET(bitmap, snd_idx))
{
tag = fcs->chunks[snd_idx >> fcs_chunk_size_log];
tag.blockNum += snd_idx & ((1 << fcs_chunk_size_log) - 1);
if (!BufferTagIsValid(&tag)) {
elog(ERROR, "LFC: Invalid buffer tag: %u", tag.blockNum);
}
if (!lfc_cache_contains(BufTagGetNRelFileInfo(tag), tag.forkNum, tag.blockNum))
{
(void)communicator_prefetch_register_bufferv(tag, NULL, 1, NULL);
n_sent += 1;
}
else
{
ws->skipped_pages += 1;
BITMAP_CLR(bitmap, snd_idx);
}
}
snd_idx += 1;
}
}
if (n_sent >= n_received + prewarm_batch || snd_idx == max_prefetch_pages)
{
if (n_received == n_sent && snd_idx == max_prefetch_pages)
{
break;
}
if ((rcv_idx >> fcs_chunk_size_log) % n_workers != worker_id)
{
/* Skip chunks processed by other workers */
rcv_idx += 1 << fcs_chunk_size_log;
continue;
}
/* Locate next block to prefetch */
while (!BITMAP_ISSET(bitmap, rcv_idx))
{
rcv_idx += 1;
}
tag = fcs->chunks[rcv_idx >> fcs_chunk_size_log];
tag.blockNum += rcv_idx & ((1 << fcs_chunk_size_log) - 1);
if (communicator_prefetch_receive(tag))
{
ws->prewarmed_pages += 1;
}
else
{
ws->skipped_pages += 1;
}
rcv_idx += 1;
n_received += 1;
}
}
/* No need to perform prefetch cleanup here because prewarm worker will be terminated and
* connection to PS dropped just after return from this function.
*/
Assert(n_sent == n_received || lfc_ctl->prewarm_canceled);
elog(LOG, "LFC: worker %d complete prewarming: loaded %ld pages", worker_id, (long)n_received);
lfc_ctl->prewarm_workers[worker_id].completed = GetCurrentTimestamp();
}
void
lfc_invalidate(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber nblocks)
{
@@ -982,6 +661,8 @@ lfc_invalidate(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber nblocks)
FileCacheEntry *entry;
uint32 hash;
Assert(!neon_use_communicator_worker);
if (lfc_maybe_disabled()) /* fast exit if file cache is disabled */
return;
@@ -1027,6 +708,8 @@ lfc_cache_contains(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno)
bool found = false;
uint32 hash;
Assert(!neon_use_communicator_worker);
if (lfc_maybe_disabled()) /* fast exit if file cache is disabled */
return false;
@@ -1062,6 +745,8 @@ lfc_cache_containsv(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
uint32 hash;
int i = 0;
Assert(!neon_use_communicator_worker);
if (lfc_maybe_disabled()) /* fast exit if file cache is disabled */
return 0;
@@ -1169,6 +854,8 @@ lfc_readv_select(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
int blocks_read = 0;
int buf_offset = 0;
Assert(!neon_use_communicator_worker);
if (lfc_maybe_disabled()) /* fast exit if file cache is disabled */
return -1;
@@ -1479,7 +1166,7 @@ lfc_init_new_entry(FileCacheEntry* entry, uint32 hash)
/* Can't add this chunk - we don't have the space for it */
hash_search_with_hash_value(lfc_hash, &entry->key, hash,
HASH_REMOVE, NULL);
lfc_ctl->prewarm_canceled = true; /* cancel prewarm if LFC limit is reached */
lfc_prewarm_cancel = true; /* cancel prewarm if LFC limit is reached */
return false;
}
@@ -1534,6 +1221,8 @@ lfc_prefetch(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blkno,
int chunk_offs = BLOCK_TO_CHUNK_OFF(blkno);
Assert(!neon_use_communicator_worker);
if (lfc_maybe_disabled()) /* fast exit if file cache is disabled */
return false;
@@ -1679,6 +1368,8 @@ lfc_writev(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
uint32 entry_offset;
int buf_offset = 0;
Assert(!neon_use_communicator_worker);
if (lfc_maybe_disabled()) /* fast exit if file cache is disabled */
return;
@@ -1897,7 +1588,6 @@ lfc_get_stats(size_t *num_entries)
return entries;
}
/*
* Function returning data from the local file cache
* relation node/tablespace/database/blocknum and access_counter
@@ -2001,15 +1691,15 @@ lfc_approximate_working_set_size_seconds(time_t duration, bool reset)
}
/*
* Get metrics, for the built-in metrics exporter that's part of the communicator
* process.
* Get metrics, for the built-in metrics exporter that's part of the
* communicator process.
*
* NB: This is called from a Rust tokio task inside the communicator process.
* Acquiring lwlocks, elog(), allocating memory or anything else non-trivial
* is strictly prohibited here!
*/
struct LfcMetrics
callback_get_lfc_metrics_unsafe(void)
lfc_get_metrics_unsafe(void)
{
struct LfcMetrics result = {
.lfc_cache_size_limit = (int64) lfc_size_limit * 1024 * 1024,
@@ -2030,82 +1720,3 @@ callback_get_lfc_metrics_unsafe(void)
return result;
}
PG_FUNCTION_INFO_V1(get_local_cache_state);
Datum
get_local_cache_state(PG_FUNCTION_ARGS)
{
size_t max_entries = PG_ARGISNULL(0) ? lfc_prewarm_limit : PG_GETARG_INT32(0);
FileCacheState* fcs = lfc_get_state(max_entries);
if (fcs != NULL)
PG_RETURN_BYTEA_P((bytea*)fcs);
else
PG_RETURN_NULL();
}
PG_FUNCTION_INFO_V1(prewarm_local_cache);
Datum
prewarm_local_cache(PG_FUNCTION_ARGS)
{
bytea* state = PG_GETARG_BYTEA_PP(0);
uint32 n_workers = PG_GETARG_INT32(1);
FileCacheState* fcs = (FileCacheState*)state;
lfc_prewarm(fcs, n_workers);
PG_RETURN_NULL();
}
PG_FUNCTION_INFO_V1(get_prewarm_info);
Datum
get_prewarm_info(PG_FUNCTION_ARGS)
{
Datum values[4];
bool nulls[4];
TupleDesc tupdesc;
uint32 prewarmed_pages = 0;
uint32 skipped_pages = 0;
uint32 active_workers = 0;
uint32 total_pages;
size_t n_workers;
if (lfc_size_limit == 0)
PG_RETURN_NULL();
LWLockAcquire(lfc_lock, LW_SHARED);
if (!lfc_ctl || lfc_ctl->n_prewarm_workers == 0)
{
LWLockRelease(lfc_lock);
PG_RETURN_NULL();
}
n_workers = lfc_ctl->n_prewarm_workers;
total_pages = lfc_ctl->total_prewarm_pages;
for (size_t i = 0; i < n_workers; i++)
{
PrewarmWorkerState* ws = &lfc_ctl->prewarm_workers[i];
prewarmed_pages += ws->prewarmed_pages;
skipped_pages += ws->skipped_pages;
active_workers += ws->completed != 0;
}
LWLockRelease(lfc_lock);
tupdesc = CreateTemplateTupleDesc(4);
TupleDescInitEntry(tupdesc, (AttrNumber) 1, "total_pages", INT4OID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 2, "prewarmed_pages", INT4OID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 3, "skipped_pages", INT4OID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 4, "active_workers", INT4OID, -1, 0);
tupdesc = BlessTupleDesc(tupdesc);
MemSet(nulls, 0, sizeof(nulls));
values[0] = Int32GetDatum(total_pages);
values[1] = Int32GetDatum(prewarmed_pages);
values[2] = Int32GetDatum(skipped_pages);
values[3] = Int32GetDatum(active_workers);
PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
}

View File

@@ -11,21 +11,19 @@
#ifndef FILE_CACHE_h
#define FILE_CACHE_h
#include "neon_pgversioncompat.h"
#include "lfc_prewarm.h"
#include "neon.h"
typedef struct FileCacheState
{
int32 vl_len_; /* varlena header (do not touch directly!) */
uint32 magic;
uint32 n_chunks;
uint32 n_pages;
uint16 chunk_size_log;
BufferTag chunks[FLEXIBLE_ARRAY_MEMBER];
/* followed by bitmap */
} FileCacheState;
#include "neon_pgversioncompat.h"
/* GUCs */
extern bool lfc_store_prefetch_result;
extern int lfc_max_size;
extern int lfc_size_limit;
extern char *lfc_path;
extern bool lfc_do_prewarm;
extern bool lfc_prewarm_cancel;
/* functions for local file cache */
extern void lfc_invalidate(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber nblocks);
@@ -44,17 +42,13 @@ extern int lfc_cache_containsv(NRelFileInfo rinfo, ForkNumber forkNum,
extern void lfc_init(void);
extern bool lfc_prefetch(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blkno,
const void* buffer, XLogRecPtr lsn);
extern FileCacheState* lfc_get_state(size_t max_entries);
extern void lfc_prewarm(FileCacheState* fcs, uint32 n_workers);
typedef struct LfcStatsEntry
{
const char *metric_name;
bool isnull;
uint64 value;
} LfcStatsEntry;
extern FileCacheState* lfc_get_state(size_t max_entries);
extern LfcStatsEntry *lfc_get_stats(size_t *num_entries);
struct LfcMetrics; /* defined in communicator_bindings.h */
extern struct LfcMetrics lfc_get_metrics_unsafe(void);
typedef struct
{
uint32 pageoffs;
@@ -69,7 +63,6 @@ extern LocalCachePagesRec *lfc_local_cache_pages(size_t *num_entries);
extern int32 lfc_approximate_working_set_size_seconds(time_t duration, bool reset);
static inline bool
lfc_read(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
void *buffer)

671
pgxn/neon/lfc_prewarm.c Normal file
View File

@@ -0,0 +1,671 @@
/*-------------------------------------------------------------------------
*
* lfc_prewarm.c
* Functions related to LFC prewarming
*
* Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
*-------------------------------------------------------------------------
*/
#include "postgres.h"
#include "bitmap.h"
#include "communicator.h"
#include "communicator_new.h"
#include "file_cache.h"
#include "lfc_prewarm.h"
#include "neon.h"
#include "neon_utils.h"
#include "pagestore_client.h"
#include "funcapi.h"
#include "miscadmin.h"
#include "postmaster/bgworker.h"
#include "storage/dsm.h"
#include "tcop/tcopprot.h"
#include "utils/timestamp.h"
#define MAX_PREWARM_WORKERS 8
typedef struct PrewarmWorkerState
{
uint32 prewarmed_pages;
uint32 skipped_pages;
TimestampTz completed;
} PrewarmWorkerState;
typedef struct PrewarmControl
{
/* -1 when not using workers, 0 when no prewarm has been performed */
size_t n_prewarm_workers;
size_t total_prewarm_pages;
bool prewarm_active;
bool prewarm_canceled;
/* These are used in the non-worker mode */
uint32 prewarmed_pages;
uint32 skipped_pages;
TimestampTz completed;
/* These are used with workers */
PrewarmWorkerState prewarm_workers[MAX_PREWARM_WORKERS];
dsm_handle prewarm_lfc_state_handle;
size_t prewarm_batch;
size_t n_prewarm_entries;
} PrewarmControl;
static PrewarmControl *prewarm_ctl;
static int lfc_prewarm_limit;
static int lfc_prewarm_batch;
static LWLockId prewarm_lock;
bool AmPrewarmWorker;
static void lfc_prewarm_with_workers(FileCacheState *fcs, uint32 n_workers);
static void lfc_prewarm_with_async_requests(FileCacheState *fcs);
PGDLLEXPORT void lfc_prewarm_main(Datum main_arg);
void
pg_init_prewarm(void)
{
DefineCustomIntVariable("neon.file_cache_prewarm_limit",
"Maximal number of prewarmed chunks",
NULL,
&lfc_prewarm_limit,
INT_MAX, /* no limit by default */
0,
INT_MAX,
PGC_SIGHUP,
0,
NULL,
NULL,
NULL);
DefineCustomIntVariable("neon.file_cache_prewarm_batch",
"Number of pages retrivied by prewarm from page server",
NULL,
&lfc_prewarm_batch,
64,
1,
INT_MAX,
PGC_SIGHUP,
0,
NULL,
NULL,
NULL);
}
static size_t
PrewarmShmemSize(void)
{
return sizeof(PrewarmControl);
}
void
PrewarmShmemRequest(void)
{
RequestAddinShmemSpace(PrewarmShmemSize());
RequestNamedLWLockTranche("prewarm_lock", 1);
}
void
PrewarmShmemInit(void)
{
bool found;
prewarm_ctl = (PrewarmControl *) ShmemInitStruct("Prewarmer shmem state",
PrewarmShmemSize(),
&found);
if (!found)
{
/* it's zeroed already */
prewarm_lock = (LWLockId) GetNamedLWLockTranche("prewarm_lock");
}
}
static void
validate_fcs(FileCacheState *fcs)
{
size_t fcs_size;
#if 0
size_t fcs_chunk_size_log;
#endif
if (fcs->magic != FILE_CACHE_STATE_MAGIC)
{
elog(ERROR, "LFC: Invalid file cache state magic: %X", fcs->magic);
}
fcs_size = VARSIZE(fcs);
if (FILE_CACHE_STATE_SIZE(fcs) != fcs_size)
{
elog(ERROR, "LFC: Invalid file cache state size: %u vs. %u", (unsigned)FILE_CACHE_STATE_SIZE(fcs), VARSIZE(fcs));
}
/* FIXME */
#if 0
fcs_chunk_size_log = fcs->chunk_size_log;
if (fcs_chunk_size_log > MAX_BLOCKS_PER_CHUNK_LOG)
{
elog(ERROR, "LFC: Invalid chunk size log: %u", fcs->chunk_size_log);
}
#endif
}
/*
* Prewarm LFC cache to the specified state. It uses lfc_prefetch function to
* load prewarmed page without hoilding shared buffer lock and avoid race
* conditions with other backends.
*/
void
lfc_prewarm_with_workers(FileCacheState *fcs, uint32 n_workers)
{
size_t n_entries;
size_t prewarm_batch = Min(lfc_prewarm_batch, readahead_buffer_size);
size_t fcs_size = VARSIZE(fcs);
dsm_segment *seg;
BackgroundWorkerHandle* bgw_handle[MAX_PREWARM_WORKERS];
Assert(!neon_use_communicator_worker);
if (prewarm_batch == 0 || lfc_prewarm_limit == 0 || n_workers == 0)
{
elog(LOG, "LFC: prewarm is disabled");
return;
}
if (n_workers > MAX_PREWARM_WORKERS)
{
elog(ERROR, "LFC: too many prewarm workers, maximum is %d", MAX_PREWARM_WORKERS);
}
if (fcs == NULL || fcs->n_chunks == 0)
{
elog(LOG, "LFC: nothing to prewarm");
return;
}
n_entries = Min(fcs->n_chunks, lfc_prewarm_limit);
Assert(n_entries != 0);
LWLockAcquire(prewarm_lock, LW_EXCLUSIVE);
/* Do not prewarm more entries than LFC limit */
/* FIXME */
#if 0
if (prewarm_ctl->limit <= prewarm_ctl->size)
{
elog(LOG, "LFC: skip prewarm because LFC is already filled");
LWLockRelease(prewarm_lock);
return;
}
#endif
if (prewarm_ctl->prewarm_active)
{
LWLockRelease(prewarm_lock);
elog(ERROR, "LFC: skip prewarm because another prewarm is still active");
}
prewarm_ctl->n_prewarm_entries = n_entries;
prewarm_ctl->n_prewarm_workers = n_workers;
prewarm_ctl->prewarm_active = true;
prewarm_ctl->prewarm_canceled = false;
prewarm_ctl->prewarm_batch = prewarm_batch;
memset(prewarm_ctl->prewarm_workers, 0, n_workers*sizeof(PrewarmWorkerState));
/* Calculate total number of pages to be prewarmed */
prewarm_ctl->total_prewarm_pages = fcs->n_pages;
LWLockRelease(prewarm_lock);
seg = dsm_create(fcs_size, 0);
memcpy(dsm_segment_address(seg), fcs, fcs_size);
prewarm_ctl->prewarm_lfc_state_handle = dsm_segment_handle(seg);
/* Spawn background workers */
for (uint32 i = 0; i < n_workers; i++)
{
BackgroundWorker worker = {0};
worker.bgw_flags = BGWORKER_SHMEM_ACCESS;
worker.bgw_start_time = BgWorkerStart_ConsistentState;
worker.bgw_restart_time = BGW_NEVER_RESTART;
strcpy(worker.bgw_library_name, "neon");
strcpy(worker.bgw_function_name, "lfc_prewarm_main");
snprintf(worker.bgw_name, BGW_MAXLEN, "LFC prewarm worker %d", i+1);
strcpy(worker.bgw_type, "LFC prewarm worker");
worker.bgw_main_arg = Int32GetDatum(i);
/* must set notify PID to wait for shutdown */
worker.bgw_notify_pid = MyProcPid;
if (!RegisterDynamicBackgroundWorker(&worker, &bgw_handle[i]))
{
ereport(LOG,
(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
errmsg("LFC: registering dynamic bgworker prewarm failed"),
errhint("Consider increasing the configuration parameter \"%s\".", "max_worker_processes")));
n_workers = i;
prewarm_ctl->prewarm_canceled = true;
break;
}
}
for (uint32 i = 0; i < n_workers; i++)
{
bool interrupted;
do
{
interrupted = false;
PG_TRY();
{
BgwHandleStatus status = WaitForBackgroundWorkerShutdown(bgw_handle[i]);
if (status != BGWH_STOPPED && status != BGWH_POSTMASTER_DIED)
{
elog(LOG, "LFC: Unexpected status of prewarm worker termination: %d", status);
}
}
PG_CATCH();
{
elog(LOG, "LFC: cancel prewarm");
prewarm_ctl->prewarm_canceled = true;
interrupted = true;
}
PG_END_TRY();
} while (interrupted);
if (!prewarm_ctl->prewarm_workers[i].completed)
{
/* Background worker doesn't set completion time: it means that it was abnormally terminated */
elog(LOG, "LFC: prewarm worker %d failed", i+1);
/* Set completion time to prevent get_prewarm_info from considering this worker as active */
prewarm_ctl->prewarm_workers[i].completed = GetCurrentTimestamp();
}
}
dsm_detach(seg);
LWLockAcquire(prewarm_lock, LW_EXCLUSIVE);
prewarm_ctl->prewarm_active = false;
LWLockRelease(prewarm_lock);
}
void
lfc_prewarm_main(Datum main_arg)
{
size_t snd_idx = 0, rcv_idx = 0;
size_t n_sent = 0, n_received = 0;
size_t fcs_chunk_size_log;
size_t max_prefetch_pages;
size_t prewarm_batch;
size_t n_workers;
dsm_segment *seg;
FileCacheState* fcs;
uint8* bitmap;
BufferTag tag;
PrewarmWorkerState* ws;
uint32 worker_id = DatumGetInt32(main_arg);
Assert(!neon_use_communicator_worker);
AmPrewarmWorker = true;
pqsignal(SIGTERM, die);
BackgroundWorkerUnblockSignals();
seg = dsm_attach(prewarm_ctl->prewarm_lfc_state_handle);
if (seg == NULL)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("could not map dynamic shared memory segment")));
fcs = (FileCacheState*) dsm_segment_address(seg);
prewarm_batch = prewarm_ctl->prewarm_batch;
fcs_chunk_size_log = fcs->chunk_size_log;
n_workers = prewarm_ctl->n_prewarm_workers;
max_prefetch_pages = prewarm_ctl->n_prewarm_entries << fcs_chunk_size_log;
ws = &prewarm_ctl->prewarm_workers[worker_id];
bitmap = FILE_CACHE_STATE_BITMAP(fcs);
/* enable prefetch in LFC */
lfc_store_prefetch_result = true;
lfc_do_prewarm = true; /* Flag for lfc_prefetch preventing replacement of existed entries if LFC cache is full */
elog(LOG, "LFC: worker %d start prewarming", worker_id);
while (!prewarm_ctl->prewarm_canceled)
{
if (snd_idx < max_prefetch_pages)
{
if ((snd_idx >> fcs_chunk_size_log) % n_workers != worker_id)
{
/* If there are multiple workers, split chunks between them */
snd_idx += 1 << fcs_chunk_size_log;
}
else
{
if (BITMAP_ISSET(bitmap, snd_idx))
{
tag = fcs->chunks[snd_idx >> fcs_chunk_size_log];
tag.blockNum += snd_idx & ((1 << fcs_chunk_size_log) - 1);
if (!BufferTagIsValid(&tag))
elog(ERROR, "LFC: Invalid buffer tag: %u", tag.blockNum);
if (!lfc_cache_contains(BufTagGetNRelFileInfo(tag), tag.forkNum, tag.blockNum))
{
(void) communicator_prefetch_register_bufferv(tag, NULL, 1, NULL);
n_sent += 1;
}
else
{
ws->skipped_pages += 1;
BITMAP_CLR(bitmap, snd_idx);
}
}
snd_idx += 1;
}
}
if (n_sent >= n_received + prewarm_batch || snd_idx == max_prefetch_pages)
{
if (n_received == n_sent && snd_idx == max_prefetch_pages)
{
break;
}
if ((rcv_idx >> fcs_chunk_size_log) % n_workers != worker_id)
{
/* Skip chunks processed by other workers */
rcv_idx += 1 << fcs_chunk_size_log;
continue;
}
/* Locate next block to prefetch */
while (!BITMAP_ISSET(bitmap, rcv_idx))
{
rcv_idx += 1;
}
tag = fcs->chunks[rcv_idx >> fcs_chunk_size_log];
tag.blockNum += rcv_idx & ((1 << fcs_chunk_size_log) - 1);
if (communicator_prefetch_receive(tag))
{
ws->prewarmed_pages += 1;
}
else
{
ws->skipped_pages += 1;
}
rcv_idx += 1;
n_received += 1;
}
}
/* No need to perform prefetch cleanup here because prewarm worker will be terminated and
* connection to PS dropped just after return from this function.
*/
Assert(n_sent == n_received || prewarm_ctl->prewarm_canceled);
elog(LOG, "LFC: worker %d complete prewarming: loaded %ld pages", worker_id, (long)n_received);
prewarm_ctl->prewarm_workers[worker_id].completed = GetCurrentTimestamp();
}
/*
* Prewarm LFC cache to the specified state. Uses the new communicator
*
* FIXME: Is there a race condition because we're not holding Postgres
* buffer manager locks?
*/
static void
lfc_prewarm_with_async_requests(FileCacheState *fcs)
{
size_t n_entries;
uint8 *bitmap;
uint64 bitno;
int blocks_per_chunk;
Assert(neon_use_communicator_worker);
if (lfc_prewarm_limit == 0)
{
elog(LOG, "LFC: prewarm is disabled");
return;
}
if (fcs == NULL || fcs->n_chunks == 0)
{
elog(LOG, "LFC: nothing to prewarm");
return;
}
n_entries = Min(fcs->n_chunks, lfc_prewarm_limit);
Assert(n_entries != 0);
PG_TRY();
{
LWLockAcquire(prewarm_lock, LW_EXCLUSIVE);
/* Do not prewarm more entries than LFC limit */
/* FIXME */
#if 0
if (prewarm_ctl->limit <= prewarm_ctl->size)
{
elog(LOG, "LFC: skip prewarm because LFC is already filled");
LWLockRelease(prewarm_lock);
return;
}
#endif
if (prewarm_ctl->prewarm_active)
{
LWLockRelease(prewarm_lock);
elog(ERROR, "LFC: skip prewarm because another prewarm is still active");
}
prewarm_ctl->n_prewarm_entries = n_entries;
prewarm_ctl->n_prewarm_workers = -1;
prewarm_ctl->prewarm_active = true;
prewarm_ctl->prewarm_canceled = false;
/* Calculate total number of pages to be prewarmed */
prewarm_ctl->total_prewarm_pages = fcs->n_pages;
LWLockRelease(prewarm_lock);
elog(LOG, "LFC: start prewarming");
lfc_do_prewarm = true;
lfc_prewarm_cancel = false;
bitmap = FILE_CACHE_STATE_BITMAP(fcs);
blocks_per_chunk = 1 << fcs->chunk_size_log;
bitno = 0;
for (uint32 chunkno = 0; chunkno < fcs->n_chunks; chunkno++)
{
BufferTag *chunk_tag = &fcs->chunks[chunkno];
BlockNumber request_startblkno = InvalidBlockNumber;
BlockNumber request_endblkno;
if (!BufferTagIsValid(chunk_tag))
elog(ERROR, "LFC: Invalid buffer tag: %u", chunk_tag->blockNum);
if (lfc_prewarm_cancel)
{
prewarm_ctl->prewarm_canceled = true;
break;
}
/* take next chunk */
for (int j = 0; j < blocks_per_chunk; j++)
{
BlockNumber blkno = chunk_tag->blockNum + j;
if (BITMAP_ISSET(bitmap, bitno))
{
if (request_startblkno != InvalidBlockNumber)
{
if (request_endblkno == blkno)
{
/* append this block to the request */
request_endblkno++;
}
else
{
/* flush this request, and start new one */
communicator_new_prefetch_register_bufferv(
BufTagGetNRelFileInfo(*chunk_tag),
chunk_tag->forkNum,
request_startblkno,
request_endblkno - request_startblkno
);
request_startblkno = blkno;
request_endblkno = blkno + 1;
}
}
else
{
/* flush this request, if any, and start new one */
if (request_startblkno != InvalidBlockNumber)
{
communicator_new_prefetch_register_bufferv(
BufTagGetNRelFileInfo(*chunk_tag),
chunk_tag->forkNum,
request_startblkno,
request_endblkno - request_startblkno
);
}
request_startblkno = blkno;
request_endblkno = blkno + 1;
}
prewarm_ctl->prewarmed_pages += 1;
}
bitno++;
}
/* flush this request */
communicator_new_prefetch_register_bufferv(
BufTagGetNRelFileInfo(*chunk_tag),
chunk_tag->forkNum,
request_startblkno,
request_endblkno - request_startblkno
);
request_startblkno = request_endblkno = InvalidBlockNumber;
}
elog(LOG, "LFC: complete prewarming: loaded %lu pages", (unsigned long) prewarm_ctl->prewarmed_pages);
prewarm_ctl->completed = GetCurrentTimestamp();
LWLockAcquire(prewarm_lock, LW_EXCLUSIVE);
prewarm_ctl->prewarm_active = false;
LWLockRelease(prewarm_lock);
}
PG_CATCH();
{
elog(LOG, "LFC: cancel prewarm");
prewarm_ctl->prewarm_canceled = true;
prewarm_ctl->prewarm_active = false;
}
PG_END_TRY();
}
PG_FUNCTION_INFO_V1(get_local_cache_state);
Datum
get_local_cache_state(PG_FUNCTION_ARGS)
{
size_t max_entries = PG_ARGISNULL(0) ? lfc_prewarm_limit : PG_GETARG_INT32(0);
FileCacheState* fcs;
if (neon_use_communicator_worker)
fcs = communicator_new_get_lfc_state(max_entries);
else
fcs = lfc_get_state(max_entries);
if (fcs != NULL)
PG_RETURN_BYTEA_P((bytea*)fcs);
else
PG_RETURN_NULL();
}
PG_FUNCTION_INFO_V1(prewarm_local_cache);
Datum
prewarm_local_cache(PG_FUNCTION_ARGS)
{
bytea* state = PG_GETARG_BYTEA_PP(0);
uint32 n_workers = PG_GETARG_INT32(1);
FileCacheState* fcs;
fcs = (FileCacheState *)state;
validate_fcs(fcs);
if (neon_use_communicator_worker)
lfc_prewarm_with_async_requests(fcs);
else
lfc_prewarm_with_workers(fcs, n_workers);
PG_RETURN_NULL();
}
PG_FUNCTION_INFO_V1(get_prewarm_info);
Datum
get_prewarm_info(PG_FUNCTION_ARGS)
{
Datum values[4];
bool nulls[4];
TupleDesc tupdesc;
uint32 prewarmed_pages = 0;
uint32 skipped_pages = 0;
uint32 active_workers = 0;
uint32 total_pages;
if (lfc_size_limit == 0)
PG_RETURN_NULL();
LWLockAcquire(prewarm_lock, LW_SHARED);
if (!prewarm_ctl || prewarm_ctl->n_prewarm_workers == 0)
{
LWLockRelease(prewarm_lock);
PG_RETURN_NULL();
}
if (prewarm_ctl->n_prewarm_workers == -1)
{
total_pages = prewarm_ctl->total_prewarm_pages;
prewarmed_pages = prewarm_ctl->prewarmed_pages;
skipped_pages = prewarm_ctl->skipped_pages;
active_workers = 1;
}
else
{
size_t n_workers;
n_workers = prewarm_ctl->n_prewarm_workers;
total_pages = prewarm_ctl->total_prewarm_pages;
for (size_t i = 0; i < n_workers; i++)
{
PrewarmWorkerState *ws = &prewarm_ctl->prewarm_workers[i];
prewarmed_pages += ws->prewarmed_pages;
skipped_pages += ws->skipped_pages;
active_workers += ws->completed != 0;
}
}
LWLockRelease(prewarm_lock);
tupdesc = CreateTemplateTupleDesc(4);
TupleDescInitEntry(tupdesc, (AttrNumber) 1, "total_pages", INT4OID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 2, "prewarmed_pages", INT4OID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 3, "skipped_pages", INT4OID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 4, "active_workers", INT4OID, -1, 0);
tupdesc = BlessTupleDesc(tupdesc);
MemSet(nulls, 0, sizeof(nulls));
values[0] = Int32GetDatum(total_pages);
values[1] = Int32GetDatum(prewarmed_pages);
values[2] = Int32GetDatum(skipped_pages);
values[3] = Int32GetDatum(active_workers);
PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
}

39
pgxn/neon/lfc_prewarm.h Normal file
View File

@@ -0,0 +1,39 @@
/*-------------------------------------------------------------------------
*
* lfc_prewarm.h
* Local File Cache prewarmer
*
* Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
*-------------------------------------------------------------------------
*/
#ifndef LFC_PREWARM_H
#define LFC_PREWARM_H
#include "storage/buf_internals.h"
typedef struct FileCacheState
{
int32 vl_len_; /* varlena header (do not touch directly!) */
uint32 magic;
uint32 n_chunks;
uint32 n_pages;
uint16 chunk_size_log;
BufferTag chunks[FLEXIBLE_ARRAY_MEMBER];
/* followed by bitmap */
} FileCacheState;
#define FILE_CACHE_STATE_MAGIC 0xfcfcfcfc
#define FILE_CACHE_STATE_BITMAP(fcs) ((uint8*)&(fcs)->chunks[(fcs)->n_chunks])
#define FILE_CACHE_STATE_SIZE_FOR_CHUNKS(n_chunks, blocks_per_chunk) (sizeof(FileCacheState) + (n_chunks)*sizeof(BufferTag) + (((n_chunks) * blocks_per_chunk)+7)/8)
#define FILE_CACHE_STATE_SIZE(fcs) (sizeof(FileCacheState) + (fcs->n_chunks)*sizeof(BufferTag) + (((fcs->n_chunks) << fcs->chunk_size_log)+7)/8)
extern void pg_init_prewarm(void);
extern void PrewarmShmemRequest(void);
extern void PrewarmShmemInit(void);
#endif /* LFC_PREWARM_H */

View File

@@ -72,6 +72,7 @@ char *neon_branch_id;
char *neon_endpoint_id;
int32 max_cluster_size;
char *pageserver_connstring;
char *pageserver_grpc_urls;
char *neon_auth_token;
int readahead_buffer_size = 128;
@@ -81,7 +82,7 @@ int neon_protocol_version = 3;
static int neon_compute_mode = 0;
static int max_reconnect_attempts = 60;
static int stripe_size;
int neon_stripe_size;
static int max_sockets;
static int pageserver_response_log_timeout = 10000;
@@ -92,13 +93,6 @@ static int conf_refresh_reconnect_attempt_threshold = 16;
// Hadron: timeout for refresh errors (1 minute)
static uint64 kRefreshErrorTimeoutUSec = 1 * USECS_PER_MINUTE;
typedef struct
{
char connstring[MAX_SHARDS][MAX_PAGESERVER_CONNSTRING_SIZE];
size_t num_shards;
size_t stripe_size;
} ShardMap;
/*
* PagestoreShmemState is kept in shared memory. It contains the connection
* strings for each shard.
@@ -187,6 +181,8 @@ static void pageserver_disconnect_shard(shardno_t shard_no);
// HADRON
shardno_t get_num_shards(void);
static void AssignShardMap(const char *newval);
static bool
PagestoreShmemIsValid(void)
{
@@ -200,8 +196,8 @@ PagestoreShmemIsValid(void)
* not valid, returns false. The contents of *result are undefined in
* that case, and must not be relied on.
*/
static bool
ParseShardMap(const char *connstr, ShardMap *result)
bool
parse_shard_map(const char *connstr, ShardMap *result)
{
const char *p;
int nshards = 0;
@@ -246,24 +242,31 @@ ParseShardMap(const char *connstr, ShardMap *result)
if (result)
{
result->num_shards = nshards;
result->stripe_size = stripe_size;
result->stripe_size = neon_stripe_size;
}
return true;
}
/* GUC hooks for neon.pageserver_connstring */
static bool
CheckPageserverConnstring(char **newval, void **extra, GucSource source)
{
char *p = *newval;
return ParseShardMap(p, NULL);
return parse_shard_map(p, NULL);
}
static void
AssignPageserverConnstring(const char *newval, void *extra)
{
ShardMap shard_map;
/*
* 'neon.pageserver_connstring' is ignored if the new communicator is used.
* In that case, the shard map is loaded from 'neon.pageserver_grpc_urls'
* instead, and that happens in the communicator process only.
*/
if (neon_use_communicator_worker)
return;
/*
* Only postmaster updates the copy in shared memory.
@@ -271,11 +274,29 @@ AssignPageserverConnstring(const char *newval, void *extra)
if (!PagestoreShmemIsValid() || IsUnderPostmaster)
return;
if (!ParseShardMap(newval, &shard_map))
AssignShardMap(newval);
}
/* GUC hooks for neon.pageserver_connstring */
static bool
CheckPageserverGrpcUrls(char **newval, void **extra, GucSource source)
{
char *p = *newval;
return parse_shard_map(p, NULL);
}
static void
AssignShardMap(const char *newval)
{
ShardMap shard_map;
if (!parse_shard_map(newval, &shard_map))
{
/*
* shouldn't happen, because we already checked the value in
* CheckPageserverConnstring
* CheckPageserverConnstring/CheckPageserverGrpcUrls
*/
elog(ERROR, "could not parse shard map");
}
@@ -294,6 +315,27 @@ AssignPageserverConnstring(const char *newval, void *extra)
}
}
/*
* Set the 'num_shards' variable in shared memory.
*
* This is only used with the new communicator. The new communicator doesn't
* use the shard_map in shared memory, except for the shard count, which is
* needed by get_num_shards() calls in the walproposer. This is called to set
* that. This is only called from the communicator process, at process startup
* or if the configuration is reloaded.
*/
void
AssignNumShards(shardno_t num_shards)
{
Assert(neon_use_communicator_worker);
pg_atomic_add_fetch_u64(&pagestore_shared->begin_update_counter, 1);
pg_write_barrier();
pagestore_shared->shard_map.num_shards = num_shards;
pg_write_barrier();
pg_atomic_add_fetch_u64(&pagestore_shared->end_update_counter, 1);
}
/* BEGIN_HADRON */
/**
* Return the total number of shards seen in the shard map.
@@ -397,10 +439,10 @@ get_shard_number(BufferTag *tag)
#if PG_MAJORVERSION_NUM < 16
hash = murmurhash32(tag->rnode.relNode);
hash = hash_combine(hash, murmurhash32(tag->blockNum / stripe_size));
hash = hash_combine(hash, murmurhash32(tag->blockNum / neon_stripe_size));
#else
hash = murmurhash32(tag->relNumber);
hash = hash_combine(hash, murmurhash32(tag->blockNum / stripe_size));
hash = hash_combine(hash, murmurhash32(tag->blockNum / neon_stripe_size));
#endif
return hash % n_shards;
@@ -1478,6 +1520,15 @@ pg_init_libpagestore(void)
0, /* no flags required */
CheckPageserverConnstring, AssignPageserverConnstring, NULL);
DefineCustomStringVariable("neon.pageserver_grpc_urls",
"list of gRPC URLs for the page servers",
NULL,
&pageserver_grpc_urls,
"",
PGC_SIGHUP,
0, /* no flags required */
CheckPageserverGrpcUrls, NULL, NULL);
DefineCustomStringVariable("neon.timeline_id",
"Neon timeline_id the server is running on",
NULL,
@@ -1524,7 +1575,7 @@ pg_init_libpagestore(void)
DefineCustomIntVariable("neon.stripe_size",
"sharding stripe size",
NULL,
&stripe_size,
&neon_stripe_size,
2048, 1, INT_MAX,
PGC_SIGHUP,
GUC_UNIT_BLOCKS,
@@ -1643,7 +1694,7 @@ pg_init_libpagestore(void)
if (neon_auth_token)
neon_log(LOG, "using storage auth token from NEON_AUTH_TOKEN environment variable");
if (pageserver_connstring[0])
if (pageserver_connstring[0] || pageserver_grpc_urls[0])
{
neon_log(PageStoreTrace, "set neon_smgr hook");
smgr_hook = smgr_neon;

View File

@@ -21,6 +21,7 @@
#include "replication/logicallauncher.h"
#include "replication/slot.h"
#include "replication/walsender.h"
#include "storage/ipc.h"
#include "storage/proc.h"
#include "storage/ipc.h"
#include "funcapi.h"
@@ -31,6 +32,7 @@
#include "utils/guc_tables.h"
#include "communicator.h"
#include "communicator_new.h"
#include "communicator_process.h"
#include "extension_server.h"
#include "file_cache.h"
@@ -473,6 +475,16 @@ _PG_init(void)
load_file("$libdir/neon_rmgr", false);
#endif
DefineCustomBoolVariable(
"neon.use_communicator_worker",
"Uses the communicator worker implementation",
NULL,
&neon_use_communicator_worker,
true,
PGC_POSTMASTER,
0,
NULL, NULL, NULL);
if (lakebase_mode) {
prev_emit_log_hook = emit_log_hook;
emit_log_hook = DatabricksSqlErrorHookImpl;
@@ -512,12 +524,14 @@ _PG_init(void)
pg_init_libpagestore();
relsize_hash_init();
lfc_init();
pg_init_prewarm();
pg_init_walproposer();
init_lwlsncache();
pg_init_lwlsncache();
pg_init_communicator_process();
pg_init_communicator();
Custom_XLogReaderRoutines = NeonOnDemandXLogReaderRoutines;
InitUnstableExtensionsSupport();
@@ -723,7 +737,10 @@ approximate_working_set_size_seconds(PG_FUNCTION_ARGS)
duration = PG_ARGISNULL(0) ? (time_t) -1 : PG_GETARG_INT32(0);
dc = lfc_approximate_working_set_size_seconds(duration, false);
if (neon_use_communicator_worker)
dc = communicator_new_approximate_working_set_size_seconds(duration, false);
else
dc = lfc_approximate_working_set_size_seconds(duration, false);
if (dc < 0)
PG_RETURN_NULL();
else
@@ -736,7 +753,10 @@ approximate_working_set_size(PG_FUNCTION_ARGS)
bool reset = PG_GETARG_BOOL(0);
int32 dc;
dc = lfc_approximate_working_set_size_seconds(-1, reset);
if (neon_use_communicator_worker)
dc = communicator_new_approximate_working_set_size_seconds(-1, reset);
else
dc = lfc_approximate_working_set_size_seconds(-1, reset);
if (dc < 0)
PG_RETURN_NULL();
else
@@ -754,7 +774,10 @@ neon_get_lfc_stats(PG_FUNCTION_ARGS)
InitMaterializedSRF(fcinfo, 0);
/* lfc_get_stats() does all the heavy lifting */
entries = lfc_get_stats(&num_entries);
if (neon_use_communicator_worker)
entries = communicator_new_lfc_get_stats(&num_entries);
else
entries = lfc_get_stats(&num_entries);
/* Convert the LfcStatsEntrys to a result set */
for (size_t i = 0; i < num_entries; i++)
@@ -828,11 +851,13 @@ neon_shmem_request_hook(void)
#endif
LfcShmemRequest();
PrewarmShmemRequest();
NeonPerfCountersShmemRequest();
PagestoreShmemRequest();
RelsizeCacheShmemRequest();
WalproposerShmemRequest();
LwLsnCacheShmemRequest();
CommunicatorNewShmemRequest();
}
@@ -850,6 +875,7 @@ neon_shmem_startup_hook(void)
LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
LfcShmemInit();
PrewarmShmemInit();
NeonPerfCountersShmemInit();
if (lakebase_mode) {
DatabricksMetricsShmemInit();
@@ -858,6 +884,7 @@ neon_shmem_startup_hook(void)
RelsizeCacheShmemInit();
WalproposerShmemInit();
LwLsnCacheShmemInit();
CommunicatorNewShmemInit();
#if PG_MAJORVERSION_NUM >= 17
WAIT_EVENT_NEON_LFC_MAINTENANCE = WaitEventExtensionNew("Neon/FileCache_Maintenance");

View File

@@ -85,5 +85,11 @@ extern void WalproposerShmemInit(void);
extern void LwLsnCacheShmemInit(void);
extern void NeonPerfCountersShmemInit(void);
typedef struct LfcStatsEntry
{
const char *metric_name;
bool isnull;
uint64 value;
} LfcStatsEntry;
#endif /* NEON_H */

View File

@@ -85,12 +85,54 @@ static set_lwlsn_db_hook_type prev_set_lwlsn_db_hook = NULL;
static void neon_set_max_lwlsn(XLogRecPtr lsn);
void
init_lwlsncache(void)
pg_init_lwlsncache(void)
{
if (!process_shared_preload_libraries_in_progress)
ereport(ERROR, errcode(ERRCODE_INTERNAL_ERROR), errmsg("Loading of shared preload libraries is not in progress. Exiting"));
lwlc_register_gucs();
}
void
LwLsnCacheShmemRequest(void)
{
Size requested_size;
if (neon_use_communicator_worker)
return;
requested_size = sizeof(LwLsnCacheCtl);
requested_size += hash_estimate_size(lwlsn_cache_size, sizeof(LastWrittenLsnCacheEntry));
RequestAddinShmemSpace(requested_size);
}
void
LwLsnCacheShmemInit(void)
{
static HASHCTL info;
bool found;
if (neon_use_communicator_worker)
return;
Assert(lwlsn_cache_size > 0);
info.keysize = sizeof(BufferTag);
info.entrysize = sizeof(LastWrittenLsnCacheEntry);
lastWrittenLsnCache = ShmemInitHash("last_written_lsn_cache",
lwlsn_cache_size, lwlsn_cache_size,
&info,
HASH_ELEM | HASH_BLOBS);
LwLsnCache = ShmemInitStruct("neon/LwLsnCacheCtl", sizeof(LwLsnCacheCtl), &found);
// Now set the size in the struct
LwLsnCache->lastWrittenLsnCacheSize = lwlsn_cache_size;
if (found) {
return;
}
dlist_init(&LwLsnCache->lastWrittenLsnLRU);
LwLsnCache->maxLastWrittenLsn = GetRedoRecPtr();
prev_set_lwlsn_block_range_hook = set_lwlsn_block_range_hook;
set_lwlsn_block_range_hook = neon_set_lwlsn_block_range;
@@ -106,41 +148,6 @@ init_lwlsncache(void)
set_lwlsn_db_hook = neon_set_lwlsn_db;
}
void
LwLsnCacheShmemRequest(void)
{
Size requested_size = sizeof(LwLsnCacheCtl);
requested_size += hash_estimate_size(lwlsn_cache_size, sizeof(LastWrittenLsnCacheEntry));
RequestAddinShmemSpace(requested_size);
}
void
LwLsnCacheShmemInit(void)
{
static HASHCTL info;
bool found;
if (lwlsn_cache_size > 0)
{
info.keysize = sizeof(BufferTag);
info.entrysize = sizeof(LastWrittenLsnCacheEntry);
lastWrittenLsnCache = ShmemInitHash("last_written_lsn_cache",
lwlsn_cache_size, lwlsn_cache_size,
&info,
HASH_ELEM | HASH_BLOBS);
LwLsnCache = ShmemInitStruct("neon/LwLsnCacheCtl", sizeof(LwLsnCacheCtl), &found);
// Now set the size in the struct
LwLsnCache->lastWrittenLsnCacheSize = lwlsn_cache_size;
if (found) {
return;
}
}
dlist_init(&LwLsnCache->lastWrittenLsnLRU);
LwLsnCache->maxLastWrittenLsn = GetRedoRecPtr();
}
/*
* neon_get_lwlsn -- Returns maximal LSN of written page.
* It returns an upper bound for the last written LSN of a given page,
@@ -155,6 +162,7 @@ neon_get_lwlsn(NRelFileInfo rlocator, ForkNumber forknum, BlockNumber blkno)
XLogRecPtr lsn;
LastWrittenLsnCacheEntry* entry;
Assert(!neon_use_communicator_worker);
Assert(LwLsnCache->lastWrittenLsnCacheSize != 0);
LWLockAcquire(LastWrittenLsnLock, LW_SHARED);
@@ -207,7 +215,10 @@ neon_get_lwlsn(NRelFileInfo rlocator, ForkNumber forknum, BlockNumber blkno)
return lsn;
}
static void neon_set_max_lwlsn(XLogRecPtr lsn) {
static void
neon_set_max_lwlsn(XLogRecPtr lsn)
{
Assert(!neon_use_communicator_worker);
LWLockAcquire(LastWrittenLsnLock, LW_EXCLUSIVE);
LwLsnCache->maxLastWrittenLsn = lsn;
LWLockRelease(LastWrittenLsnLock);
@@ -228,6 +239,7 @@ neon_get_lwlsn_v(NRelFileInfo relfilenode, ForkNumber forknum,
LastWrittenLsnCacheEntry* entry;
XLogRecPtr lsn;
Assert(!neon_use_communicator_worker);
Assert(LwLsnCache->lastWrittenLsnCacheSize != 0);
Assert(nblocks > 0);
Assert(PointerIsValid(lsns));
@@ -376,6 +388,8 @@ SetLastWrittenLSNForBlockRangeInternal(XLogRecPtr lsn,
XLogRecPtr
neon_set_lwlsn_block_range(XLogRecPtr lsn, NRelFileInfo rlocator, ForkNumber forknum, BlockNumber from, BlockNumber n_blocks)
{
Assert(!neon_use_communicator_worker);
if (lsn == InvalidXLogRecPtr || n_blocks == 0 || LwLsnCache->lastWrittenLsnCacheSize == 0)
return lsn;
@@ -412,6 +426,8 @@ neon_set_lwlsn_block_v(const XLogRecPtr *lsns, NRelFileInfo relfilenode,
Oid dbOid = NInfoGetDbOid(relfilenode);
Oid relNumber = NInfoGetRelNumber(relfilenode);
Assert(!neon_use_communicator_worker);
if (lsns == NULL || nblocks == 0 || LwLsnCache->lastWrittenLsnCacheSize == 0 ||
NInfoGetRelNumber(relfilenode) == InvalidOid)
return InvalidXLogRecPtr;
@@ -469,6 +485,7 @@ neon_set_lwlsn_block_v(const XLogRecPtr *lsns, NRelFileInfo relfilenode,
XLogRecPtr
neon_set_lwlsn_block(XLogRecPtr lsn, NRelFileInfo rlocator, ForkNumber forknum, BlockNumber blkno)
{
Assert(!neon_use_communicator_worker);
return neon_set_lwlsn_block_range(lsn, rlocator, forknum, blkno, 1);
}
@@ -478,6 +495,7 @@ neon_set_lwlsn_block(XLogRecPtr lsn, NRelFileInfo rlocator, ForkNumber forknum,
XLogRecPtr
neon_set_lwlsn_relation(XLogRecPtr lsn, NRelFileInfo rlocator, ForkNumber forknum)
{
Assert(!neon_use_communicator_worker);
return neon_set_lwlsn_block(lsn, rlocator, forknum, REL_METADATA_PSEUDO_BLOCKNO);
}
@@ -488,6 +506,8 @@ XLogRecPtr
neon_set_lwlsn_db(XLogRecPtr lsn)
{
NRelFileInfo dummyNode = {InvalidOid, InvalidOid, InvalidOid};
Assert(!neon_use_communicator_worker);
return neon_set_lwlsn_block(lsn, dummyNode, MAIN_FORKNUM, 0);
}

View File

@@ -3,7 +3,7 @@
#include "neon_pgversioncompat.h"
void init_lwlsncache(void);
extern void pg_init_lwlsncache(void);
/* Hooks */
XLogRecPtr neon_get_lwlsn(NRelFileInfo rlocator, ForkNumber forknum, BlockNumber blkno);
@@ -14,4 +14,4 @@ XLogRecPtr neon_set_lwlsn_block(XLogRecPtr lsn, NRelFileInfo rlocator, ForkNumbe
XLogRecPtr neon_set_lwlsn_relation(XLogRecPtr lsn, NRelFileInfo rlocator, ForkNumber forknum);
XLogRecPtr neon_set_lwlsn_db(XLogRecPtr lsn);
#endif /* NEON_LWLSNCACHE_H */
#endif /* NEON_LWLSNCACHE_H */

View File

@@ -237,15 +237,27 @@ extern void prefetch_on_ps_disconnect(void);
extern page_server_api *page_server;
extern char *pageserver_connstring;
extern char *pageserver_grpc_urls;
extern int flush_every_n_requests;
extern int readahead_buffer_size;
extern char *neon_timeline;
extern char *neon_tenant;
extern int32 max_cluster_size;
extern int neon_protocol_version;
extern int neon_stripe_size;
typedef struct
{
char connstring[MAX_SHARDS][MAX_PAGESERVER_CONNSTRING_SIZE];
size_t num_shards;
size_t stripe_size;
} ShardMap;
extern bool parse_shard_map(const char *connstr, ShardMap *result);
extern shardno_t get_shard_number(BufferTag* tag);
extern void AssignNumShards(shardno_t num_shards);
extern const f_smgr *smgr_neon(ProcNumber backend, NRelFileInfo rinfo);
extern void smgr_init_neon(void);
extern void readahead_buffer_resize(int newsize, void *extra);
@@ -290,6 +302,7 @@ extern int64 neon_dbsize(Oid dbNode);
extern void neon_get_request_lsns(NRelFileInfo rinfo, ForkNumber forknum,
BlockNumber blkno, neon_request_lsns *output,
BlockNumber nblocks);
extern XLogRecPtr neon_get_write_lsn(void);
/* utils for neon relsize cache */
extern void relsize_hash_init(void);

View File

@@ -62,6 +62,7 @@
#include "bitmap.h"
#include "communicator.h"
#include "communicator_new.h"
#include "file_cache.h"
#include "neon.h"
#include "neon_lwlsncache.h"
@@ -301,7 +302,7 @@ neon_wallog_pagev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
*/
lsns[batch_size++] = lsn;
if (batch_size >= BLOCK_BATCH_SIZE)
if (batch_size >= BLOCK_BATCH_SIZE && !neon_use_communicator_worker)
{
neon_set_lwlsn_block_v(lsns, InfoFromSMgrRel(reln), forknum,
batch_blockno,
@@ -311,7 +312,7 @@ neon_wallog_pagev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
}
}
if (batch_size != 0)
if (batch_size != 0 && !neon_use_communicator_worker)
{
neon_set_lwlsn_block_v(lsns, InfoFromSMgrRel(reln), forknum,
batch_blockno,
@@ -436,11 +437,17 @@ neon_wallog_page(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, co
forknum, LSN_FORMAT_ARGS(lsn))));
}
/*
* Remember the LSN on this page. When we read the page again, we must
* read the same or newer version of it.
*/
neon_set_lwlsn_block(lsn, InfoFromSMgrRel(reln), forknum, blocknum);
if (!neon_use_communicator_worker)
{
/*
* Remember the LSN on this page. When we read the page again, we must
* read the same or newer version of it.
*
* (With the new communicator, the caller will make a write-request
* for this page, which updates the last-written LSN too)
*/
neon_set_lwlsn_block(lsn, InfoFromSMgrRel(reln), forknum, blocknum);
}
}
/*
@@ -497,6 +504,60 @@ nm_adjust_lsn(XLogRecPtr lsn)
return lsn;
}
/*
* Get a LSN to use to stamp an operation like relation create or truncate.
* On operations on individual pages we use the LSN of the page, but when
* e.g. smgrcreate() is called, we have to do something else.
*/
XLogRecPtr
neon_get_write_lsn(void)
{
XLogRecPtr lsn;
if (RecoveryInProgress())
{
/*
* FIXME: v14 doesn't have GetCurrentReplayRecPtr(). Options:
* - add it in our fork
* - store a magic value that means that you must use
* current latest possible LSN at the time that the request
* on this thing is made again (or some other recent enough
* lsn).
*/
#if PG_VERSION_NUM >= 150000
lsn = GetCurrentReplayRecPtr(NULL);
#else
lsn = GetXLogReplayRecPtr(NULL); /* FIXME: this is wrong, see above */
#endif
}
else
lsn = GetXLogInsertRecPtr();
/*
* If the insert LSN points to just after page header, round it down to
* the beginning of the page, because the page header might not have been
* inserted to the WAL yet, and if we tried to flush it, the WAL flushing
* code gets upset.
*/
{
int segoff;
segoff = XLogSegmentOffset(lsn, wal_segment_size);
if (segoff == SizeOfXLogLongPHD)
{
lsn = lsn - segoff;
}
else
{
int offset = lsn % XLOG_BLCKSZ;
if (offset == SizeOfXLogShortPHD)
lsn = lsn - offset;
}
}
return lsn;
}
/*
* Return LSN for requesting pages and number of blocks from page server
@@ -509,6 +570,7 @@ neon_get_request_lsns(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blkno,
{
XLogRecPtr last_written_lsns[PG_IOV_MAX];
Assert(!neon_use_communicator_worker);
Assert(nblocks <= PG_IOV_MAX);
neon_get_lwlsn_v(rinfo, forknum, blkno, (int) nblocks, last_written_lsns);
@@ -740,11 +802,6 @@ neon_exists(SMgrRelation reln, ForkNumber forkNum)
neon_log(ERROR, "unknown relpersistence '%c'", reln->smgr_relpersistence);
}
if (get_cached_relsize(InfoFromSMgrRel(reln), forkNum, &n_blocks))
{
return true;
}
/*
* \d+ on a view calls smgrexists with 0/0/0 relfilenode. The page server
* will error out if you check that, because the whole dbdir for
@@ -768,10 +825,20 @@ neon_exists(SMgrRelation reln, ForkNumber forkNum)
return false;
}
neon_get_request_lsns(InfoFromSMgrRel(reln), forkNum,
REL_METADATA_PSEUDO_BLOCKNO, &request_lsns, 1);
if (neon_use_communicator_worker)
return communicator_new_rel_exists(InfoFromSMgrRel(reln), forkNum);
else
{
if (get_cached_relsize(InfoFromSMgrRel(reln), forkNum, &n_blocks))
{
return true;
}
return communicator_exists(InfoFromSMgrRel(reln), forkNum, &request_lsns);
neon_get_request_lsns(InfoFromSMgrRel(reln), forkNum,
REL_METADATA_PSEUDO_BLOCKNO, &request_lsns, 1);
return communicator_exists(InfoFromSMgrRel(reln), forkNum, &request_lsns);
}
}
/*
@@ -829,16 +896,53 @@ neon_create(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
* relation. Currently, we don't call SetLastWrittenLSN() when a new
* relation created, so if we didn't remember the size in the relsize
* cache, we might call smgrnblocks() on the newly-created relation before
* the creation WAL record hass been received by the page server.
* the creation WAL record has been received by the page server.
*
* XXX: with the new communicator, similar considerations apply. However,
* during replay, neon_get_write_lsn() returns the (end-)LSN of the record
* that's being replayed, so we should not have the correctness issue
* mentioned in previous paragraph.
*/
if (isRedo)
if (neon_use_communicator_worker)
{
update_cached_relsize(InfoFromSMgrRel(reln), forkNum, 0);
get_cached_relsize(InfoFromSMgrRel(reln), forkNum,
&reln->smgr_cached_nblocks[forkNum]);
XLogRecPtr lsn = neon_get_write_lsn();
if (isRedo)
{
/*
* TODO: the protocol can check for existence and get the relsize
* in one roundtrip. Add a similar call to the
* backend<->communicator API. (The size is cached on the
* rel_exists call, so this does only one roundtrip to the
* pageserver, but two function calls and two cache lookups.)
*/
if (!communicator_new_rel_exists(InfoFromSMgrRel(reln), forkNum))
{
communicator_new_rel_create(InfoFromSMgrRel(reln), forkNum, lsn);
reln->smgr_cached_nblocks[forkNum] = 0;
}
else
{
BlockNumber nblocks;
nblocks = communicator_new_rel_nblocks(InfoFromSMgrRel(reln), forkNum);
reln->smgr_cached_nblocks[forkNum] = nblocks;
}
}
else
communicator_new_rel_create(InfoFromSMgrRel(reln), forkNum, lsn);
}
else
set_cached_relsize(InfoFromSMgrRel(reln), forkNum, 0);
{
if (isRedo)
{
update_cached_relsize(InfoFromSMgrRel(reln), forkNum, 0);
get_cached_relsize(InfoFromSMgrRel(reln), forkNum,
&reln->smgr_cached_nblocks[forkNum]);
}
else
set_cached_relsize(InfoFromSMgrRel(reln), forkNum, 0);
}
if (debug_compare_local)
{
@@ -874,9 +978,17 @@ neon_unlink(NRelFileInfoBackend rinfo, ForkNumber forkNum, bool isRedo)
* unlink, it won't do any harm if the file doesn't exist.
*/
mdunlink(rinfo, forkNum, isRedo);
if (!NRelFileInfoBackendIsTemp(rinfo))
{
forget_cached_relsize(InfoFromNInfoB(rinfo), forkNum);
if (neon_use_communicator_worker)
{
XLogRecPtr lsn = neon_get_write_lsn();
communicator_new_rel_unlink(InfoFromNInfoB(rinfo), forkNum, lsn);
}
else
forget_cached_relsize(InfoFromNInfoB(rinfo), forkNum);
}
}
@@ -899,6 +1011,7 @@ neon_extend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno,
#endif
{
XLogRecPtr lsn;
bool lsn_was_zero;
BlockNumber n_blocks = 0;
switch (reln->smgr_relpersistence)
@@ -956,7 +1069,6 @@ neon_extend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno,
neon_wallog_page(reln, forkNum, n_blocks++, buffer, true);
neon_wallog_page(reln, forkNum, blkno, buffer, false);
set_cached_relsize(InfoFromSMgrRel(reln), forkNum, blkno + 1);
lsn = PageGetLSN((Page) buffer);
neon_log(SmgrTrace, "smgrextend called for %u/%u/%u.%u blk %u, page LSN: %X/%08X",
@@ -964,14 +1076,6 @@ neon_extend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno,
forkNum, blkno,
(uint32) (lsn >> 32), (uint32) lsn);
lfc_write(InfoFromSMgrRel(reln), forkNum, blkno, buffer);
if (debug_compare_local)
{
if (IS_LOCAL_REL(reln))
mdextend(reln, forkNum, blkno, buffer, skipFsync);
}
/*
* smgr_extend is often called with an all-zeroes page, so
* lsn==InvalidXLogRecPtr. An smgr_write() call will come for the buffer
@@ -979,20 +1083,51 @@ neon_extend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno,
* it is eventually evicted from the buffer cache. But we need a valid LSN
* to the relation metadata update now.
*/
if (lsn == InvalidXLogRecPtr)
{
lsn_was_zero = (lsn == InvalidXLogRecPtr);
if (lsn_was_zero)
lsn = GetXLogInsertRecPtr();
neon_set_lwlsn_block(lsn, InfoFromSMgrRel(reln), forkNum, blkno);
if (neon_use_communicator_worker)
{
communicator_new_rel_extend(InfoFromSMgrRel(reln), forkNum, blkno, (const void *) buffer, lsn);
if (debug_compare_local)
{
if (IS_LOCAL_REL(reln))
mdextend(reln, forkNum, blkno, buffer, skipFsync);
}
}
else
{
set_cached_relsize(InfoFromSMgrRel(reln), forkNum, blkno + 1);
lfc_write(InfoFromSMgrRel(reln), forkNum, blkno, buffer);
if (debug_compare_local)
{
if (IS_LOCAL_REL(reln))
mdextend(reln, forkNum, blkno, buffer, skipFsync);
}
/*
* smgr_extend is often called with an all-zeroes page, so
* lsn==InvalidXLogRecPtr. An smgr_write() call will come for the buffer
* later, after it has been initialized with the real page contents, and
* it is eventually evicted from the buffer cache. But we need a valid LSN
* to the relation metadata update now.
*/
if (lsn_was_zero)
neon_set_lwlsn_block(lsn, InfoFromSMgrRel(reln), forkNum, blkno);
neon_set_lwlsn_relation(lsn, InfoFromSMgrRel(reln), forkNum);
}
neon_set_lwlsn_relation(lsn, InfoFromSMgrRel(reln), forkNum);
}
#if PG_MAJORVERSION_NUM >= 16
static void
neon_zeroextend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blocknum,
neon_zeroextend(SMgrRelation reln, ForkNumber forkNum, BlockNumber start_block,
int nblocks, bool skipFsync)
{
const PGIOAlignedBlock buffer = {0};
BlockNumber blocknum = start_block;
int remblocks = nblocks;
XLogRecPtr lsn = 0;
@@ -1075,11 +1210,14 @@ neon_zeroextend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blocknum,
lsn = XLogInsert(RM_XLOG_ID, XLOG_FPI);
for (int i = 0; i < count; i++)
if (!neon_use_communicator_worker)
{
lfc_write(InfoFromSMgrRel(reln), forkNum, blocknum + i, buffer.data);
neon_set_lwlsn_block(lsn, InfoFromSMgrRel(reln), forkNum,
blocknum + i);
for (int i = 0; i < count; i++)
{
lfc_write(InfoFromSMgrRel(reln), forkNum, blocknum + i, buffer.data);
neon_set_lwlsn_block(lsn, InfoFromSMgrRel(reln), forkNum,
blocknum + i);
}
}
blocknum += count;
@@ -1088,8 +1226,15 @@ neon_zeroextend(SMgrRelation reln, ForkNumber forkNum, BlockNumber blocknum,
Assert(lsn != 0);
neon_set_lwlsn_relation(lsn, InfoFromSMgrRel(reln), forkNum);
set_cached_relsize(InfoFromSMgrRel(reln), forkNum, blocknum);
if (neon_use_communicator_worker)
{
communicator_new_rel_zeroextend(InfoFromSMgrRel(reln), forkNum, start_block, nblocks, lsn);
}
else
{
neon_set_lwlsn_relation(lsn, InfoFromSMgrRel(reln), forkNum);
set_cached_relsize(InfoFromSMgrRel(reln), forkNum, blocknum);
}
}
#endif
@@ -1149,6 +1294,12 @@ neon_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
neon_log(ERROR, "unknown relpersistence '%c'", reln->smgr_relpersistence);
}
if (neon_use_communicator_worker)
{
communicator_new_prefetch_register_bufferv(InfoFromSMgrRel(reln), forknum, blocknum, nblocks);
return false;
}
tag.spcOid = reln->smgr_rlocator.locator.spcOid;
tag.dbOid = reln->smgr_rlocator.locator.dbOid;
tag.relNumber = reln->smgr_rlocator.locator.relNumber;
@@ -1175,7 +1326,8 @@ neon_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
blocknum += iterblocks;
}
communicator_prefetch_pump_state();
if (!neon_use_communicator_worker)
communicator_prefetch_pump_state();
return false;
}
@@ -1188,8 +1340,6 @@ neon_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
static bool
neon_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
{
BufferTag tag;
switch (reln->smgr_relpersistence)
{
case 0: /* probably shouldn't happen, but ignore it */
@@ -1204,17 +1354,25 @@ neon_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
neon_log(ERROR, "unknown relpersistence '%c'", reln->smgr_relpersistence);
}
if (lfc_cache_contains(InfoFromSMgrRel(reln), forknum, blocknum))
return false;
if (neon_use_communicator_worker)
{
communicator_new_prefetch_register_bufferv(InfoFromSMgrRel(reln), forknum, blocknum, 1);
}
else
{
BufferTag tag;
tag.forkNum = forknum;
tag.blockNum = blocknum;
if (lfc_cache_contains(InfoFromSMgrRel(reln), forknum, blocknum))
return false;
CopyNRelFileInfoToBufTag(tag, InfoFromSMgrRel(reln));
tag.forkNum = forknum;
tag.blockNum = blocknum;
communicator_prefetch_register_bufferv(tag, NULL, 1, NULL);
CopyNRelFileInfoToBufTag(tag, InfoFromSMgrRel(reln));
communicator_prefetch_register_bufferv(tag, NULL, 1, NULL);
communicator_prefetch_pump_state();
communicator_prefetch_pump_state();
}
return false;
}
@@ -1258,7 +1416,8 @@ neon_writeback(SMgrRelation reln, ForkNumber forknum,
*/
neon_log(SmgrTrace, "writeback noop");
communicator_prefetch_pump_state();
if (!neon_use_communicator_worker)
communicator_prefetch_pump_state();
if (debug_compare_local)
{
@@ -1275,7 +1434,14 @@ void
neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
neon_request_lsns request_lsns, void *buffer)
{
communicator_read_at_lsnv(rinfo, forkNum, blkno, &request_lsns, &buffer, 1, NULL);
if (neon_use_communicator_worker)
{
// FIXME: request_lsns is ignored. That affects the neon_test_utils callers.
// Add the capability to specify the LSNs explicitly, for the sake of neon_test_utils ?
communicator_new_read_at_lsn_uncached(rinfo, forkNum, blkno, buffer, request_lsns.request_lsn, request_lsns.not_modified_since);
}
else
communicator_read_at_lsnv(rinfo, forkNum, blkno, &request_lsns, &buffer, 1, NULL);
}
static void
@@ -1401,47 +1567,55 @@ neon_read(SMgrRelation reln, ForkNumber forkNum, BlockNumber blkno, void *buffer
neon_log(ERROR, "unknown relpersistence '%c'", reln->smgr_relpersistence);
}
/* Try to read PS results if they are available */
communicator_prefetch_pump_state();
neon_get_request_lsns(InfoFromSMgrRel(reln), forkNum, blkno, &request_lsns, 1);
present = 0;
bufferp = buffer;
if (communicator_prefetch_lookupv(InfoFromSMgrRel(reln), forkNum, blkno, &request_lsns, 1, &bufferp, &present))
if (neon_use_communicator_worker)
{
/* Prefetch hit */
if (debug_compare_local >= DEBUG_COMPARE_LOCAL_PREFETCH)
{
compare_with_local(reln, forkNum, blkno, buffer, request_lsns.request_lsn);
}
if (debug_compare_local <= DEBUG_COMPARE_LOCAL_PREFETCH)
{
return;
}
communicator_new_readv(InfoFromSMgrRel(reln), forkNum, blkno,
(void *) &buffer, 1);
}
/* Try to read from local file cache */
if (lfc_read(InfoFromSMgrRel(reln), forkNum, blkno, buffer))
else
{
MyNeonCounters->file_cache_hits_total++;
if (debug_compare_local >= DEBUG_COMPARE_LOCAL_LFC)
/* Try to read PS results if they are available */
communicator_prefetch_pump_state();
neon_get_request_lsns(InfoFromSMgrRel(reln), forkNum, blkno, &request_lsns, 1);
present = 0;
bufferp = buffer;
if (communicator_prefetch_lookupv(InfoFromSMgrRel(reln), forkNum, blkno, &request_lsns, 1, &bufferp, &present))
{
compare_with_local(reln, forkNum, blkno, buffer, request_lsns.request_lsn);
/* Prefetch hit */
if (debug_compare_local >= DEBUG_COMPARE_LOCAL_PREFETCH)
{
compare_with_local(reln, forkNum, blkno, buffer, request_lsns.request_lsn);
}
if (debug_compare_local <= DEBUG_COMPARE_LOCAL_PREFETCH)
{
return;
}
}
if (debug_compare_local <= DEBUG_COMPARE_LOCAL_LFC)
/* Try to read from local file cache */
if (lfc_read(InfoFromSMgrRel(reln), forkNum, blkno, buffer))
{
return;
MyNeonCounters->file_cache_hits_total++;
if (debug_compare_local >= DEBUG_COMPARE_LOCAL_LFC)
{
compare_with_local(reln, forkNum, blkno, buffer, request_lsns.request_lsn);
}
if (debug_compare_local <= DEBUG_COMPARE_LOCAL_LFC)
{
return;
}
}
neon_read_at_lsn(InfoFromSMgrRel(reln), forkNum, blkno, request_lsns, buffer);
/*
* Try to receive prefetch results once again just to make sure we don't leave the smgr code while the OS might still have buffered bytes.
*/
communicator_prefetch_pump_state();
}
neon_read_at_lsn(InfoFromSMgrRel(reln), forkNum, blkno, request_lsns, buffer);
/*
* Try to receive prefetch results once again just to make sure we don't leave the smgr code while the OS might still have buffered bytes.
*/
communicator_prefetch_pump_state();
if (debug_compare_local)
{
compare_with_local(reln, forkNum, blkno, buffer, request_lsns.request_lsn);
@@ -1504,59 +1678,67 @@ neon_readv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
nblocks, PG_IOV_MAX);
/* Try to read PS results if they are available */
communicator_prefetch_pump_state();
neon_get_request_lsns(InfoFromSMgrRel(reln), forknum, blocknum,
request_lsns, nblocks);
if (!neon_use_communicator_worker)
communicator_prefetch_pump_state();
memset(read_pages, 0, sizeof(read_pages));
prefetch_result = communicator_prefetch_lookupv(InfoFromSMgrRel(reln), forknum,
blocknum, request_lsns, nblocks,
buffers, read_pages);
if (debug_compare_local >= DEBUG_COMPARE_LOCAL_PREFETCH)
if (neon_use_communicator_worker)
{
compare_with_localv(reln, forknum, blocknum, buffers, nblocks, request_lsns, read_pages);
communicator_new_readv(InfoFromSMgrRel(reln), forknum, blocknum,
buffers, nblocks);
}
if (debug_compare_local <= DEBUG_COMPARE_LOCAL_PREFETCH && prefetch_result == nblocks)
else
{
return;
neon_get_request_lsns(InfoFromSMgrRel(reln), forknum, blocknum,
request_lsns, nblocks);
prefetch_result = communicator_prefetch_lookupv(InfoFromSMgrRel(reln), forknum,
blocknum, request_lsns, nblocks,
buffers, read_pages);
if (debug_compare_local >= DEBUG_COMPARE_LOCAL_PREFETCH)
{
compare_with_localv(reln, forknum, blocknum, buffers, nblocks, request_lsns, read_pages);
}
if (debug_compare_local <= DEBUG_COMPARE_LOCAL_PREFETCH && prefetch_result == nblocks)
{
return;
}
if (debug_compare_local > DEBUG_COMPARE_LOCAL_PREFETCH)
{
memset(read_pages, 0, sizeof(read_pages));
}
/* Try to read from local file cache */
lfc_result = lfc_readv_select(InfoFromSMgrRel(reln), forknum, blocknum, buffers,
nblocks, read_pages);
if (lfc_result > 0)
MyNeonCounters->file_cache_hits_total += lfc_result;
if (debug_compare_local >= DEBUG_COMPARE_LOCAL_LFC)
{
compare_with_localv(reln, forknum, blocknum, buffers, nblocks, request_lsns, read_pages);
}
if (debug_compare_local <= DEBUG_COMPARE_LOCAL_LFC && prefetch_result + lfc_result == nblocks)
{
/* Read all blocks from LFC, so we're done */
return;
}
if (debug_compare_local > DEBUG_COMPARE_LOCAL_LFC)
{
memset(read_pages, 0, sizeof(read_pages));
}
communicator_read_at_lsnv(InfoFromSMgrRel(reln), forknum, blocknum, request_lsns,
buffers, nblocks, read_pages);
/*
* Try to receive prefetch results once again just to make sure we don't leave the smgr code while the OS might still have buffered bytes.
*/
communicator_prefetch_pump_state();
}
if (debug_compare_local > DEBUG_COMPARE_LOCAL_PREFETCH)
{
memset(read_pages, 0, sizeof(read_pages));
}
/* Try to read from local file cache */
lfc_result = lfc_readv_select(InfoFromSMgrRel(reln), forknum, blocknum, buffers,
nblocks, read_pages);
if (lfc_result > 0)
MyNeonCounters->file_cache_hits_total += lfc_result;
if (debug_compare_local >= DEBUG_COMPARE_LOCAL_LFC)
{
compare_with_localv(reln, forknum, blocknum, buffers, nblocks, request_lsns, read_pages);
}
if (debug_compare_local <= DEBUG_COMPARE_LOCAL_LFC && prefetch_result + lfc_result == nblocks)
{
/* Read all blocks from LFC, so we're done */
return;
}
if (debug_compare_local > DEBUG_COMPARE_LOCAL_LFC)
{
memset(read_pages, 0, sizeof(read_pages));
}
communicator_read_at_lsnv(InfoFromSMgrRel(reln), forknum, blocknum, request_lsns,
buffers, nblocks, read_pages);
/*
* Try to receive prefetch results once again just to make sure we don't leave the smgr code while the OS might still have buffered bytes.
*/
communicator_prefetch_pump_state();
if (debug_compare_local)
{
@@ -1657,9 +1839,16 @@ neon_write(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, const vo
forknum, blocknum,
(uint32) (lsn >> 32), (uint32) lsn);
lfc_write(InfoFromSMgrRel(reln), forknum, blocknum, buffer);
if (neon_use_communicator_worker)
{
communicator_new_write_page(InfoFromSMgrRel(reln), forknum, blocknum, buffer, lsn);
}
else
{
lfc_write(InfoFromSMgrRel(reln), forknum, blocknum, buffer);
communicator_prefetch_pump_state();
communicator_prefetch_pump_state();
}
if (debug_compare_local)
{
@@ -1720,9 +1909,21 @@ neon_writev(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
neon_wallog_pagev(reln, forknum, blkno, nblocks, (const char **) buffers, false);
lfc_writev(InfoFromSMgrRel(reln), forknum, blkno, buffers, nblocks);
if (neon_use_communicator_worker)
{
for (int i = 0; i < nblocks; i++)
{
XLogRecPtr lsn = PageGetLSN((Page) buffers[i]);
communicator_prefetch_pump_state();
communicator_new_write_page(InfoFromSMgrRel(reln), forknum, blkno + i, buffers[i], lsn);
}
}
else
{
lfc_writev(InfoFromSMgrRel(reln), forknum, blkno, buffers, nblocks);
communicator_prefetch_pump_state();
}
if (debug_compare_local)
{
@@ -1763,19 +1964,26 @@ neon_nblocks(SMgrRelation reln, ForkNumber forknum)
neon_log(ERROR, "unknown relpersistence '%c'", reln->smgr_relpersistence);
}
if (get_cached_relsize(InfoFromSMgrRel(reln), forknum, &n_blocks))
if (neon_use_communicator_worker)
{
neon_log(SmgrTrace, "cached nblocks for %u/%u/%u.%u: %u blocks",
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum, n_blocks);
return n_blocks;
n_blocks = communicator_new_rel_nblocks(InfoFromSMgrRel(reln), forknum);
}
else
{
if (get_cached_relsize(InfoFromSMgrRel(reln), forknum, &n_blocks))
{
neon_log(SmgrTrace, "cached nblocks for %u/%u/%u.%u: %u blocks",
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum, n_blocks);
return n_blocks;
}
neon_get_request_lsns(InfoFromSMgrRel(reln), forknum,
REL_METADATA_PSEUDO_BLOCKNO, &request_lsns, 1);
neon_get_request_lsns(InfoFromSMgrRel(reln), forknum,
REL_METADATA_PSEUDO_BLOCKNO, &request_lsns, 1);
n_blocks = communicator_nblocks(InfoFromSMgrRel(reln), forknum, &request_lsns);
update_cached_relsize(InfoFromSMgrRel(reln), forknum, n_blocks);
n_blocks = communicator_nblocks(InfoFromSMgrRel(reln), forknum, &request_lsns);
update_cached_relsize(InfoFromSMgrRel(reln), forknum, n_blocks);
}
neon_log(SmgrTrace, "neon_nblocks: rel %u/%u/%u fork %u (request LSN %X/%08X): %u blocks",
RelFileInfoFmt(InfoFromSMgrRel(reln)),
@@ -1796,10 +2004,17 @@ neon_dbsize(Oid dbNode)
neon_request_lsns request_lsns;
NRelFileInfo dummy_node = {0};
neon_get_request_lsns(dummy_node, MAIN_FORKNUM,
REL_METADATA_PSEUDO_BLOCKNO, &request_lsns, 1);
if (neon_use_communicator_worker)
{
db_size = communicator_new_dbsize(dbNode);
}
else
{
neon_get_request_lsns(dummy_node, MAIN_FORKNUM,
REL_METADATA_PSEUDO_BLOCKNO, &request_lsns, 1);
db_size = communicator_dbsize(dbNode, &request_lsns);
db_size = communicator_dbsize(dbNode, &request_lsns);
}
neon_log(SmgrTrace, "neon_dbsize: db %u (request LSN %X/%08X): %ld bytes",
dbNode, LSN_FORMAT_ARGS(request_lsns.effective_request_lsn), db_size);
@@ -1813,8 +2028,6 @@ neon_dbsize(Oid dbNode)
static void
neon_truncate(SMgrRelation reln, ForkNumber forknum, BlockNumber old_blocks, BlockNumber nblocks)
{
XLogRecPtr lsn;
switch (reln->smgr_relpersistence)
{
case 0:
@@ -1838,34 +2051,45 @@ neon_truncate(SMgrRelation reln, ForkNumber forknum, BlockNumber old_blocks, Blo
neon_log(ERROR, "unknown relpersistence '%c'", reln->smgr_relpersistence);
}
set_cached_relsize(InfoFromSMgrRel(reln), forknum, nblocks);
if (neon_use_communicator_worker)
{
XLogRecPtr lsn = neon_get_write_lsn();
/*
* Truncating a relation drops all its buffers from the buffer cache
* without calling smgrwrite() on them. But we must account for that in
* our tracking of last-written-LSN all the same: any future smgrnblocks()
* request must return the new size after the truncation. We don't know
* what the LSN of the truncation record was, so be conservative and use
* the most recently inserted WAL record's LSN.
*/
lsn = GetXLogInsertRecPtr();
lsn = nm_adjust_lsn(lsn);
communicator_new_rel_truncate(InfoFromSMgrRel(reln), forknum, nblocks, lsn);
}
else
{
XLogRecPtr lsn;
/*
* Flush it, too. We don't actually care about it here, but let's uphold
* the invariant that last-written LSN <= flush LSN.
*/
XLogFlush(lsn);
set_cached_relsize(InfoFromSMgrRel(reln), forknum, nblocks);
/*
* Truncate may affect several chunks of relations. So we should either
* update last written LSN for all of them, or update LSN for "dummy"
* metadata block. Second approach seems more efficient. If the relation
* is extended again later, the extension will update the last-written LSN
* for the extended pages, so there's no harm in leaving behind obsolete
* entries for the truncated chunks.
*/
neon_set_lwlsn_relation(lsn, InfoFromSMgrRel(reln), forknum);
/*
* Truncating a relation drops all its buffers from the buffer cache
* without calling smgrwrite() on them. But we must account for that in
* our tracking of last-written-LSN all the same: any future smgrnblocks()
* request must return the new size after the truncation. We don't know
* what the LSN of the truncation record was, so be conservative and use
* the most recently inserted WAL record's LSN.
*/
lsn = GetXLogInsertRecPtr();
lsn = nm_adjust_lsn(lsn);
/*
* Flush it, too. We don't actually care about it here, but let's uphold
* the invariant that last-written LSN <= flush LSN.
*/
XLogFlush(lsn);
/*
* Truncate may affect several chunks of relations. So we should either
* update last written LSN for all of them, or update LSN for "dummy"
* metadata block. Second approach seems more efficient. If the relation
* is extended again later, the extension will update the last-written LSN
* for the extended pages, so there's no harm in leaving behind obsolete
* entries for the truncated chunks.
*/
neon_set_lwlsn_relation(lsn, InfoFromSMgrRel(reln), forknum);
}
if (debug_compare_local)
{
@@ -1908,7 +2132,8 @@ neon_immedsync(SMgrRelation reln, ForkNumber forknum)
neon_log(SmgrTrace, "[NEON_SMGR] immedsync noop");
communicator_prefetch_pump_state();
if (!neon_use_communicator_worker)
communicator_prefetch_pump_state();
if (debug_compare_local)
{
@@ -2094,12 +2319,15 @@ neon_end_unlogged_build(SMgrRelation reln)
nblocks = mdnblocks(reln, MAIN_FORKNUM);
recptr = GetXLogInsertRecPtr();
neon_set_lwlsn_block_range(recptr,
InfoFromNInfoB(rinfob),
MAIN_FORKNUM, 0, nblocks);
neon_set_lwlsn_relation(recptr,
InfoFromNInfoB(rinfob),
MAIN_FORKNUM);
if (!neon_use_communicator_worker)
{
neon_set_lwlsn_block_range(recptr,
InfoFromNInfoB(rinfob),
MAIN_FORKNUM, 0, nblocks);
neon_set_lwlsn_relation(recptr,
InfoFromNInfoB(rinfob),
MAIN_FORKNUM);
}
/* Remove local copy */
for (int forknum = 0; forknum <= MAX_FORKNUM; forknum++)
@@ -2108,8 +2336,15 @@ neon_end_unlogged_build(SMgrRelation reln)
RelFileInfoFmt(InfoFromNInfoB(rinfob)),
forknum);
forget_cached_relsize(InfoFromNInfoB(rinfob), forknum);
lfc_invalidate(InfoFromNInfoB(rinfob), forknum, nblocks);
if (neon_use_communicator_worker)
{
communicator_new_update_cached_rel_size(InfoFromSMgrRel(reln), forknum, nblocks, recptr);
}
else
{
forget_cached_relsize(InfoFromNInfoB(rinfob), forknum);
lfc_invalidate(InfoFromNInfoB(rinfob), forknum, nblocks);
}
mdclose(reln, forknum);
if (!debug_compare_local)
@@ -2177,7 +2412,10 @@ neon_read_slru_segment(SMgrRelation reln, const char* path, int segno, void* buf
request_lsns.not_modified_since = not_modified_since;
request_lsns.effective_request_lsn = request_lsn;
n_blocks = communicator_read_slru_segment(kind, segno, &request_lsns, buffer);
if (neon_use_communicator_worker)
n_blocks = communicator_new_read_slru_segment(kind, (uint32_t)segno, &request_lsns, path);
else
n_blocks = communicator_read_slru_segment(kind, segno, &request_lsns, buffer);
return n_blocks;
}
@@ -2214,7 +2452,8 @@ AtEOXact_neon(XactEvent event, void *arg)
}
break;
}
communicator_reconfigure_timeout_if_needed();
if (!neon_use_communicator_worker)
communicator_reconfigure_timeout_if_needed();
}
static const struct f_smgr neon_smgr =
@@ -2272,7 +2511,10 @@ smgr_init_neon(void)
smgr_init_standard();
neon_init();
communicator_init();
if (neon_use_communicator_worker)
communicator_new_init();
else
communicator_init();
}
@@ -2284,6 +2526,16 @@ neon_extend_rel_size(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blkno,
/* This is only used in WAL replay */
Assert(RecoveryInProgress());
if (neon_use_communicator_worker)
{
relsize = communicator_new_rel_nblocks(rinfo, forknum);
if (blkno >= relsize)
communicator_new_rel_zeroextend(rinfo, forknum, relsize, (blkno - relsize) + 1, end_recptr);
return;
}
/* Extend the relation if we know its size */
if (get_cached_relsize(rinfo, forknum, &relsize))
{
@@ -2438,18 +2690,27 @@ neon_redo_read_buffer_filter(XLogReaderState *record, uint8 block_id)
}
/*
* we don't have the buffer in memory, update lwLsn past this record, also
* evict page from file cache
* We don't have the buffer in shared buffers. Check if it's in the LFC.
* If it's not there either, update the lwLsn past this record.
*/
if (no_redo_needed)
{
neon_set_lwlsn_block(end_recptr, rinfo, forknum, blkno);
bool in_cache;
/*
* Redo changes if page exists in LFC.
* We should perform this check after assigning LwLSN to prevent
* prefetching of some older version of the page by some other backend.
* Redo changes if the page is present in the LFC.
*/
no_redo_needed = !lfc_cache_contains(rinfo, forknum, blkno);
if (neon_use_communicator_worker)
{
in_cache = communicator_new_update_lwlsn_for_block_if_not_cached(rinfo, forknum, blkno, end_recptr);
}
else
{
in_cache = lfc_cache_contains(rinfo, forknum, blkno);
neon_set_lwlsn_block(end_recptr, rinfo, forknum, blkno);
}
no_redo_needed = !in_cache;
}
LWLockRelease(partitionLock);

View File

@@ -87,6 +87,8 @@ get_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber *size)
{
bool found = false;
Assert(!neon_use_communicator_worker);
if (relsize_hash_size > 0)
{
RelTag tag;
@@ -118,6 +120,8 @@ get_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber *size)
void
set_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber size)
{
Assert(!neon_use_communicator_worker);
if (relsize_hash_size > 0)
{
RelTag tag;
@@ -166,6 +170,8 @@ set_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber size)
void
update_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber size)
{
Assert(!neon_use_communicator_worker);
if (relsize_hash_size > 0)
{
RelTag tag;
@@ -200,6 +206,8 @@ update_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber size)
void
forget_cached_relsize(NRelFileInfo rinfo, ForkNumber forknum)
{
Assert(!neon_use_communicator_worker);
if (relsize_hash_size > 0)
{
RelTag tag;

View File

@@ -510,6 +510,7 @@ impl ApiMethod for ComputeHookTenant {
tracing::info!("Reconfiguring pageservers for endpoint {endpoint_name}");
let shard_count = match shards.len() {
0 => panic!("no shards"),
1 => ShardCount::unsharded(),
n => ShardCount(n.try_into().expect("too many shards")),
};

View File

@@ -530,7 +530,10 @@ class NeonLocalCli(AbstractNeonCli):
args.extend(["--external-http-port", str(external_http_port)])
if internal_http_port is not None:
args.extend(["--internal-http-port", str(internal_http_port)])
if grpc:
# XXX: By checking for None, we enable the new communicator for all tests
# by default
if grpc or grpc is None:
args.append("--grpc")
if endpoint_id is not None:
args.append(endpoint_id)

View File

@@ -4778,7 +4778,7 @@ class Endpoint(PgProtocol, LogUtils):
# set small 'max_replication_write_lag' to enable backpressure
# and make tests more stable.
config_lines = ["max_replication_write_lag=15MB"] + config_lines
config_lines += ["max_replication_write_lag=15MB"]
# Delete file cache if it exists (and we're recreating the endpoint)
if USE_LFC:

View File

@@ -90,6 +90,8 @@ DEFAULT_PAGESERVER_ALLOWED_ERRORS = (
# During shutdown, DownloadError::Cancelled may be logged as an error. Cleaning this
# up is tracked in https://github.com/neondatabase/neon/issues/6096
".*Cancelled, shutting down.*",
# gRPC request failures during shutdown.
".*grpc:pageservice.*request failed with Unavailable: timeline is shutting down.*",
# Open layers are only rolled at Lsn boundaries to avoid name clashses.
# Hence, we can overshoot the soft limit set by checkpoint distance.
# This is especially pronounced in tests that set small checkpoint

View File

@@ -157,6 +157,7 @@ def test_cannot_create_endpoint_on_non_uploaded_timeline(neon_env_builder: NeonE
[
".*request{method=POST path=/v1/tenant/.*/timeline request_id=.*}: request was dropped before completing.*",
".*page_service_conn_main.*: query handler for 'basebackup .* ERROR: Not found: Timeline",
".*request failed with Unavailable: Timeline .* is not active",
]
)
ps_http = env.pageserver.http_client()
@@ -194,7 +195,10 @@ def test_cannot_create_endpoint_on_non_uploaded_timeline(neon_env_builder: NeonE
env.neon_cli.mappings_map_branch(initial_branch, env.initial_tenant, env.initial_timeline)
with pytest.raises(RuntimeError, match="ERROR: Not found: Timeline"):
with pytest.raises(
RuntimeError,
match=f"Timeline {env.initial_tenant}/{env.initial_timeline} is not active",
):
env.endpoints.create_start(
initial_branch, tenant_id=env.initial_tenant, basebackup_request_tries=2
)

View File

@@ -101,20 +101,37 @@ def check_prewarmed_contains(
@pytest.mark.skipif(not USE_LFC, reason="LFC is disabled, skipping")
@pytest.mark.parametrize("grpc", [True, False])
@pytest.mark.parametrize("method", METHOD_VALUES, ids=METHOD_IDS)
def test_lfc_prewarm(neon_simple_env: NeonEnv, method: PrewarmMethod):
def test_lfc_prewarm(neon_simple_env: NeonEnv, method: PrewarmMethod, grpc: bool):
"""
Test we can offload endpoint's LFC cache to endpoint storage.
Test we can prewarm endpoint with LFC cache loaded from endpoint storage.
"""
env = neon_simple_env
n_records = 1000000
# The `neon.file_cache_prewarm_limit` GUC sets the max number of *chunks* to
# load. So the number of *pages* loaded depends on the chunk size. With the
# new communicator, the new LFC implementation doesn't do chunking so the
# limit is the number of pages, while with the old implementation, the
# default chunk size 1 MB chunks.
#
# Therefore with the old implementation, 1000 chunks equals 128000 pages, if
# all the chunks are fully dense. In practice they are sparse, but should
# amount to > 10000 pages anyway. (We have an assertion below that at least
# 10000 LFC pages are in use after prewarming)
if grpc:
prewarm_limit = 15000
else:
prewarm_limit = 1000
cfg = [
"autovacuum = off",
"shared_buffers=1MB",
"neon.max_file_cache_size=1GB",
"neon.file_cache_size_limit=1GB",
"neon.file_cache_prewarm_limit=1000",
f"neon.file_cache_prewarm_limit={prewarm_limit}",
]
if method == PrewarmMethod.AUTOPREWARM:
@@ -123,9 +140,10 @@ def test_lfc_prewarm(neon_simple_env: NeonEnv, method: PrewarmMethod):
config_lines=cfg,
autoprewarm=True,
offload_lfc_interval_seconds=AUTOOFFLOAD_INTERVAL_SECS,
grpc=grpc,
)
else:
endpoint = env.endpoints.create_start(branch_name="main", config_lines=cfg)
endpoint = env.endpoints.create_start(branch_name="main", config_lines=cfg, grpc=grpc)
pg_conn = endpoint.connect()
pg_cur = pg_conn.cursor()
@@ -162,7 +180,7 @@ def test_lfc_prewarm(neon_simple_env: NeonEnv, method: PrewarmMethod):
log.info(f"Used LFC size: {lfc_used_pages}")
pg_cur.execute("select * from neon.get_prewarm_info()")
total, prewarmed, skipped, _ = pg_cur.fetchall()[0]
assert lfc_used_pages > 10000
assert lfc_used_pages >= 10000
assert total > 0
assert prewarmed > 0
assert total == prewarmed + skipped
@@ -186,7 +204,7 @@ def test_lfc_prewarm_cancel(neon_simple_env: NeonEnv):
"shared_buffers=1MB",
"neon.max_file_cache_size=1GB",
"neon.file_cache_size_limit=1GB",
"neon.file_cache_prewarm_limit=1000",
"neon.file_cache_prewarm_limit=2000000",
]
endpoint = env.endpoints.create_start(branch_name="main", config_lines=cfg)

View File

@@ -17,7 +17,9 @@ def check_tenant(
config_lines = [
f"neon.safekeeper_proto_version = {safekeeper_proto_version}",
]
endpoint = env.endpoints.create_start("main", tenant_id=tenant_id, config_lines=config_lines)
endpoint = env.endpoints.create_start(
"main", tenant_id=tenant_id, config_lines=config_lines, grpc=True
)
# we rely upon autocommit after each statement
res_1 = endpoint.safe_psql_many(
queries=[

View File

@@ -28,8 +28,8 @@ chrono = { version = "0.4", default-features = false, features = ["clock", "serd
clap = { version = "4", features = ["derive", "env", "string"] }
clap_builder = { version = "4", default-features = false, features = ["color", "env", "help", "std", "string", "suggestions", "usage"] }
const-oid = { version = "0.9", default-features = false, features = ["db", "std"] }
criterion = { version = "0.5", features = ["html_reports"] }
crossbeam-epoch = { version = "0.9" }
crossbeam-utils = { version = "0.8" }
crypto-bigint = { version = "0.5", features = ["generic-array", "zeroize"] }
der = { version = "0.7", default-features = false, features = ["derive", "flagset", "oid", "pem", "std"] }
deranged = { version = "0.3", default-features = false, features = ["powerfmt", "serde", "std"] }
@@ -72,7 +72,6 @@ num-integer = { version = "0.1", features = ["i128"] }
num-iter = { version = "0.1", default-features = false, features = ["i128", "std"] }
num-rational = { version = "0.4", default-features = false, features = ["num-bigint-std", "std"] }
num-traits = { version = "0.2", features = ["i128", "libm"] }
once_cell = { version = "1" }
p256 = { version = "0.13", features = ["jwk"] }
parquet = { version = "53", default-features = false, features = ["zstd"] }
portable-atomic = { version = "1", features = ["require-cas"] }
@@ -105,7 +104,7 @@ tokio-rustls = { version = "0.26", default-features = false, features = ["loggin
tokio-stream = { version = "0.1", features = ["net", "sync"] }
tokio-util = { version = "0.7", features = ["codec", "compat", "io-util", "rt"] }
toml_edit = { version = "0.22", features = ["serde"] }
tonic = { version = "0.13", default-features = false, features = ["codegen", "gzip", "prost", "router", "server", "tls-native-roots", "tls-ring", "zstd"] }
tonic = { version = "0.13", default-features = false, features = ["codegen", "gzip", "prost", "router", "tls-native-roots", "tls-ring", "transport", "zstd"] }
tower = { version = "0.5", default-features = false, features = ["balance", "buffer", "limit", "log"] }
tracing = { version = "0.1", features = ["log"] }
tracing-core = { version = "0.1" }
@@ -143,7 +142,6 @@ num-integer = { version = "0.1", features = ["i128"] }
num-iter = { version = "0.1", default-features = false, features = ["i128", "std"] }
num-rational = { version = "0.4", default-features = false, features = ["num-bigint-std", "std"] }
num-traits = { version = "0.2", features = ["i128", "libm"] }
once_cell = { version = "1" }
parquet = { version = "53", default-features = false, features = ["zstd"] }
prettyplease = { version = "0.2", default-features = false, features = ["verbatim"] }
proc-macro2 = { version = "1" }