Heikki Linnakangas
5f2d476a58
Add request ID to io-in-progress locking table, to ease debugging
...
I also added INFO messages for when a backend blocks on the
io-in-progress lock. It's probably too noisy for production, but
useful now to get a picture of how much it happens.
2025-07-04 15:55:57 +03:00
Heikki Linnakangas
3231cb6138
Await the io-in-progress locking futures
...
Otherwise they don't do anything. Oops.
2025-07-04 15:55:57 +03:00
Heikki Linnakangas
e558e0da5c
Assign request_id earlier, in the originating backend
...
Makes it more useful for stitching together logs etc. for a specific
request.
2025-07-04 15:55:55 +03:00
Heikki Linnakangas
70bf2e088d
Request multiple block numbers in a single GetPageV request
...
That's how it was always intended to be used
2025-07-04 15:49:04 +03:00
Heikki Linnakangas
da3f9ee72d
cargo fmt
2025-07-04 12:39:41 +03:00
Erik Grinaker
88d1127bf4
Tweak GetPageSplitter
2025-07-03 21:12:26 +02:00
David Freifeld
794bb7a9e8
Merge branch 'quantumish/comm-lfc-integration' into communicator-rewrite
2025-07-03 10:52:29 -07:00
Erik Grinaker
42e4e5a418
Add GetPage request splitting
2025-07-03 18:31:12 +02:00
Heikki Linnakangas
96a817fa2b
Fix the case that storage auth token is _not_ used
...
I broke that in previous commit while fixing the case of using a token.
2025-07-03 18:39:06 +03:00
Heikki Linnakangas
e7b057f2e8
Fix passing storage JWT token to the communicator process
...
Makes the 'test_compute_auth_to_pageserver' test pass
2025-07-03 18:14:22 +03:00
Heikki Linnakangas
956c2f4378
cargo fmt
2025-07-03 16:16:42 +03:00
Heikki Linnakangas
3293e4685e
Fix cases where pageserver gets stuck waiting for LSN
...
The compute might make a request with an LSN that it hasn't even
flushed yet.
2025-07-03 16:14:45 +03:00
Erik Grinaker
6f8650782f
Client tweaks
2025-07-03 14:54:23 +02:00
Erik Grinaker
14214eb853
Add client shard routing
2025-07-03 14:42:35 +02:00
Erik Grinaker
d4b4724921
Sanity-check Pageserver URLs
2025-07-03 14:18:14 +02:00
Erik Grinaker
9aba9550dd
Instrument client methods
2025-07-03 14:11:53 +02:00
Erik Grinaker
375e8e5592
Improve retries and logging
2025-07-03 14:02:43 +02:00
Erik Grinaker
52c586f678
Restructure shard management
2025-07-03 11:51:19 +02:00
Erik Grinaker
de97b73d6e
Lint fixes
2025-07-03 10:38:14 +02:00
Heikki Linnakangas
d8556616c9
Fix running Postgres in "vanilla mode", without neon storage
...
Some tests do that
2025-07-03 00:32:40 +03:00
Heikki Linnakangas
d8296e60e6
Fix caching of newly extended pages
...
This fixes read errors e.g. in test_compute_catalog.py test (and
probably many others).
2025-07-02 23:21:42 +03:00
Heikki Linnakangas
7263d6e2e5
Clarify error message if not_modified_lsn > request_lsn
...
I'm seeing this error from some python tests. Which means there's a
bug in the compute side of course, but it took me a while to figure
that out.
2025-07-02 23:21:42 +03:00
David Freifeld
86fb7b966a
Update integrated_cache.rs to use new hashmap API
2025-07-02 12:18:37 -07:00
David Freifeld
0c099b0944
Merge branch 'quantumish/lfc-resizable-map' into quantumish/comm-lfc-integration
2025-07-02 12:05:24 -07:00
David Freifeld
2fe27f510d
Make neon-shmem tests thread-safe and report errno in panics
2025-07-02 11:57:49 -07:00
David Freifeld
19b5618578
Switch to neon_shmem::sync lock_api and integrate into hashmap
2025-07-02 11:44:38 -07:00
Erik Grinaker
12dade35fa
Comment tweaks
2025-07-02 14:47:27 +02:00
Erik Grinaker
1ec63bd6bc
Misc pool improvements
2025-07-02 14:42:06 +02:00
Heikki Linnakangas
7012b4aa90
Remove --grpc options from neon_local endpoint reconfigure and start calls
...
They don't exist in neon_local anymore, and aren't actually used in
tests either.
2025-07-02 15:10:18 +03:00
Heikki Linnakangas
2cc28c75be
Fix "ERROR: could not read size of rel ..." in many regression tests.
...
We were incorrectly skipping the call to communicator_new_rel_create(),
which resulted in an error during index build, when the btree build code
tried to check the size of the newly-created relation.
2025-07-02 14:10:11 +03:00
Erik Grinaker
bf01145ae4
Remove some old code
2025-07-02 11:46:54 +02:00
Erik Grinaker
8ab8fc11a3
Use new PageserverClient
2025-07-02 11:27:56 +02:00
Erik Grinaker
6f0af96a54
Add new PageserverClient
2025-07-02 10:59:40 +02:00
Heikki Linnakangas
9913d2668a
print retried pageserver requests to log
...
Not sure how verbose we want this to be in production, but for now,
more is better.
This shows that many tests are failing with errors like these:
PG:2025-07-01 23:02:34.311 GMT [1456523] LOG: [COMMUNICATOR] send_process_get_rel_size_request: got error status: NotFound, message: "Read error", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 01 Jul 2025 23:02:34 GMT"} }, retrying
I haven't debugged why that is yet. Did the compute make a bogus request?
2025-07-02 02:04:04 +03:00
Heikki Linnakangas
2fefece77d
temporary hack to make regression tests fail faster
2025-07-02 01:42:39 +03:00
Heikki Linnakangas
471191e64e
Fix updating relsize cache during WAL replay
...
This makes some of the test_runner/regress/test_hot_standby.py tests
pass, (Others are still failing..)
2025-07-01 21:22:04 +03:00
Erik Grinaker
f6761760a2
Documentation and tweaks
2025-07-01 17:54:41 +02:00
Erik Grinaker
0bce818d5e
Add stream pool
2025-07-01 17:54:41 +02:00
Erik Grinaker
48be1da6ef
Add initial client pool
2025-07-01 17:54:41 +02:00
Erik Grinaker
d2efc80e40
Add initial ChannelPool
2025-07-01 17:54:41 +02:00
Erik Grinaker
958c2577f5
pageserver: tighten up page_api::Client
2025-07-01 17:54:41 +02:00
Heikki Linnakangas
175c2e11e3
Add assertions that the legacy relsize cache is not used with new communicator
...
And fix a few cases where it was being called
2025-07-01 16:44:25 +03:00
Heikki Linnakangas
efdb07e7b6
Implement function to check if page is in local cache
...
This is needed for read replicas. There's one more TODO that needs to
implemented before read replicas work though, in
neon_extend_rel_size()
2025-07-01 16:22:51 +03:00
Heikki Linnakangas
b0970b415c
Don't call legacy lfc function when new communicator is used
2025-07-01 15:47:26 +03:00
David Freifeld
9d3e07ef2c
Add initial prototype of shmem sync primitives
2025-06-30 17:07:07 -07:00
Heikki Linnakangas
7429dd711c
fix the .metrics.socket filename in the ignore list
2025-06-30 23:41:09 +03:00
Heikki Linnakangas
88ac1e356b
Ignore the metrics unix domain socket in tests
2025-06-30 23:39:01 +03:00
Erik Grinaker
c3cb1ab98d
Merge branch 'main' into communicator-rewrite
2025-06-30 21:07:01 +02:00
Dmitrii Kovalkov
8e216a3a59
storcon: notify cplane on safekeeper membership change ( #12390 )
...
## Problem
We don't notify cplane about safekeeper membership change yet. Without
the notification the compute needs to know all the safekeepers on the
cluster to be able to speak to them. Change notifications will allow to
avoid it.
- Closes: https://github.com/neondatabase/neon/issues/12188
## Summary of changes
- Implement `notify_safekeepers` method in `ComputeHook`
- Notify cplane about safekeepers in `safekeeper_migrate` handler.
- Update the test to make sure notifications work.
## Out of scope
- There is `cplane_notified_generation` field in `timelines` table in
strocon's database. It's not needed now, so it's not updated in the PR.
Probably we can remove it.
- e2e tests to make sure it works with a production cplane
2025-06-30 14:09:50 +00:00
Erik Grinaker
81ac4ef43a
Add a generic pool prototype
2025-06-30 14:49:34 +02:00