Commit Graph

2744 Commits

Author SHA1 Message Date
Christian Schwarz
4e72b22b41 make noise from IoConcurrency::drop instead of the task, for more context 2025-01-21 14:51:54 +01:00
Christian Schwarz
4014c390e2 initial logical size calculation can also reasonable use the sidecar because it's concurrency-limited 2025-01-21 12:50:33 +01:00
Christian Schwarz
bca4263eb8 inspect_image_layer can also have an IoConcurrency root, it's tests only 2025-01-21 12:40:22 +01:00
Christian Schwarz
a958febd7a reference issue that will remote hard-coded sequential() 2025-01-21 12:36:23 +01:00
Christian Schwarz
fc27da43ff one more test can do without it 2025-01-21 12:25:30 +01:00
Christian Schwarz
cf2f0c27aa IoConcurrency roots for scan() an tests 2025-01-21 12:21:46 +01:00
Christian Schwarz
f54c5d5596 turns out create_image_layers is easy 2025-01-21 10:47:49 +01:00
Christian Schwarz
ce5452d2e5 followup 0a37164c29: also rename IoConcurrency::serial() 2025-01-21 00:47:37 +01:00
Christian Schwarz
081ff26519 fixup 40ab9c2c5e: deadlock
Reproduced by

test_runner/regress/test_branching.py::test_branching_with_pgbench[debug-pg16-flat-1-10]'

It kinda makes sense that this deadlocks in `sequential` mode.

However, it also deadlocks in `sidecar-task` mode.
I don't understand why.
2025-01-20 23:46:56 +01:00
Christian Schwarz
0a37164c29 replace env var with config variable; add test suite fixture env var to override default 2025-01-20 23:46:56 +01:00
Christian Schwarz
0eff09e35f Merge remote-tracking branch 'origin/main' into vlad/read-path-concurrent-io 2025-01-20 19:47:03 +01:00
Christian Schwarz
cdad6b2de5 we don't need the CancellationToken, the ios_rx.recv() will fail at the same time 2025-01-20 19:13:40 +01:00
Christian Schwarz
82d20b52b3 make noise when dropping an IoConcurrency with unfinished requests 2025-01-20 19:12:00 +01:00
Christian Schwarz
72130d7d6c fix(page_service / handle): panic when parallel client disconnect & Timeline shutdown (#10445)
## Refs
- fixes https://github.com/neondatabase/neon/issues/10444

## Problem

We're seeing a panic `handles are only shut down once in their lifetime`
in our performance testbed.

## Hypothesis

Annotated code in
https://github.com/neondatabase/neon/issues/10444#issuecomment-2602286415.

```
T1: drop Cache, executes up to (1)
=> HandleInner is now in state ShutDown
T2: Timeline::shutdown => PerTimelineState::shutdown  executes shutdown() again => panics
```

Likely this snuck in the final touches of #10386 where I narrowed down
the locking rules.

## Summary of changes

Make duplicate shutdowns a no-op.
2025-01-20 17:51:30 +00:00
Christian Schwarz
3b1328423e basebackup: fetch all SLRUs of one basebackup using the same IoConcurrency 2025-01-20 16:58:14 +01:00
Christian Schwarz
02fc58b878 impr(timeline handles): add more tests covering reference cyle (#10446)
The other test focus on the external interface usage while the tests
added in this PR add some testing around HandleInner's lifecycle,
ensuring we don't leak it once either connection gets dropped or
per-timeline-state is shut down explicitly.
2025-01-20 14:37:24 +00:00
Arpad Müller
b312a3c320 Move DeleteTimelineFlow::prepare to separate function and use enum (#10334)
It was requested by review in #10305 to use an enum or something like it
for distinguishing the different modes instead of two parameters,
because two flags allow four combinations, and two of them don't really
make sense/ aren't used.

follow-up of #10305
2025-01-20 12:50:44 +00:00
John Spray
8bdaee35f3 pageserver: safety checks on validity of uploaded indices (#10403)
## Problem

Occasionally, we encounter bugs in test environments that can be
detected at the point of uploading an index, but we proceed to upload it
anyway and leave a tenant in a broken state that's awkward to handle.

## Summary of changes

- Validate index when submitting it for upload, so that we can see the
issue quickly e.g. in an API invoking compaction
- Validate index before executing the upload, so that we have a hard
enforcement that any code path that tries to upload an index will not
overwrite a valid index with an invalid one.
2025-01-20 09:20:31 +00:00
Christian Schwarz
2eb235e923 doc string explaining why we're deadlock free right now and why it's so brittle 2025-01-17 18:33:34 +01:00
Christian Schwarz
40ab9c2c5e we can avoid adding the Arc<Mutex<>> around EphemeralLayer if we instead extend the lifetime of the InMemoryLayer for the spawned IO future; plus it's semantically more similar to what we now do for Delta and Image layers 2025-01-17 18:16:17 +01:00
Christian Schwarz
c43400389f delta & image layer spawned IOs: keep layer resident until IO is done 2025-01-17 18:00:13 +01:00
Vlad Lazar
6975228a76 pageserver: add initdb metrics (#10434)
## Problem

Initdb observability is poor.

## Summary of changes

Add some metrics so we can figure out which part, if any, is slow.

Closes https://github.com/neondatabase/neon/issues/10423
2025-01-17 14:51:33 +00:00
Christian Schwarz
c47c5f4ace fix(page_service pipelining): tenant cannot shut down because gate kept open while flushing responses (#10386)
# Refs

- fixes https://github.com/neondatabase/neon/issues/10309
- fixup of batching design, first introduced in
https://github.com/neondatabase/neon/pull/9851
- refinement of https://github.com/neondatabase/neon/pull/8339

# Problem

`Tenant::shutdown` was occasionally taking many minutes (sometimes up to
20) in staging and prod if the
`page_service_pipelining.mode="concurrent-futures"` is enabled.

# Symptoms

The issue happens during shard migration between pageservers.
There is page_service unavailability and hence effectively downtime for
customers in the following case:
1. The source (state `AttachedStale`) gets stuck in `Tenant::shutdown`,
waiting for the gate to close.
2. Cplane/Storcon decides to transition the target `AttachedMulti` to
`AttachedSingle`.
3. That transition comes with a bump of the generation number, causing
the `PUT .../location_config` endpoint to do a full `Tenant::shutdown` /
`Tenant::attach` cycle for the target location.
4. That `Tenant::shutdown` on the target gets stuck, waiting for the
gate to close.
5. Eventually the gate closes (`close completed`), correlating with a
`page_service` connection handler logging that it's exiting because of a
network error (`Connection reset by peer` or `Broken pipe`).

While in (4):
- `Tenant::shutdown` is stuck waiting for all `Timeline::shutdown` calls
to complete.
  So, really, this is a `Timeline::shutdown` bug.
- retries from Cplane/Storcon to complete above state transitions, fail
with errors related to the tenant mgr slot being in state
`TenantSlot::InProgress`, the tenant state being
`TenantState::Stopping`, and the timelines being in
`TimelineState::Stopping`, and the `Timeline::cancel` being cancelled.
- Existing (and/or new?) page_service connections log errors `error
reading relation or page version: Not found: Timed out waiting 30s for
tenant active state. Latest state: None`

# Root-Cause

After a lengthy investigation ([internal
write-up](https://www.notion.so/neondatabase/2025-01-09-batching-deadlock-Slow-Log-Analysis-in-Staging-176f189e00478050bc21c1a072157ca4?pvs=4))
I arrived at the following root cause.

The `spsc_fold` channel (`batch_tx`/`batch_rx`) that connects the
Batcher and Executor stages of the pipelined mode was storing a `Handle`
and thus `GateGuard` of the Timeline that was not shutting down.
The design assumption with pipelining was that this would always be a
short transient state.
However, that was incorrect: the Executor was stuck on writing/flushing
an earlier response into the connection to the client, i.e., socket
write being slow because of TCP backpressure.

The probable scenario of how we end up in that case:
1. Compute backend process sends a continuous stream of getpage prefetch
requests into the connection, but never reads the responses (why this
happens: see Appendix section).
2. Batch N is processed by Batcher and Executor, up to the point where
Executor starts flushing the response.
3. Batch N+1 is procssed by Batcher and queued in the `spsc_fold`.
4. Executor is still waiting for batch N flush to finish.
5. Batcher eventually hits the `TimeoutReader` error (10min).
From here on it waits on the
`spsc_fold.send(Err(QueryError(TimeoutReader_error)))`
which will never finish because the batch already inside the `spsc_fold`
is not
being read by the Executor, because the Executor is still stuck in the
flush.
   (This state is not observable at our default `info` log level)
6. Eventually, Compute backend process is killed (`close()` on the
socket) or Compute as a whole gets killed (probably no clean TCP
shutdown happening in that case).
7. Eventually, Pageserver TCP stack learns about (6) through RST packets
and the Executor's flush() call fails with an error.
8. The Executor exits, dropping `cancel_batcher` and its end of the
spsc_fold.
   This wakes Batcher, causing the `spsc_fold.send` to fail.
   Batcher exits.
   The pipeline shuts down as intended.
We return from `process_query` and log the `Connection reset by peer` or
`Broken pipe` error.

The following diagram visualizes the wait-for graph at (5)

```mermaid
flowchart TD
   Batcher --spsc_fold.send(TimeoutReader_error)--> Executor
   Executor --flush batch N responses--> socket.write_end
   socket.write_end --wait for TCP window to move forward--> Compute
```

# Analysis

By holding the GateGuard inside the `spsc_fold` open, the pipelining
implementation
violated the principle established in
(https://github.com/neondatabase/neon/pull/8339).
That is, that `Handle`s must only be held across an await point if that
await point
is sensitive to the `<Handle as Deref<Target=Timeline>>::cancel` token.

In this case, we were holding the Handle inside the `spsc_fold` while
awaiting the
`pgb_writer.flush()` future.

One may jump to the conclusion that we should simply peek into the
spsc_fold to get
that Timeline cancel token and be sensitive to it during flush, then.

But that violates another principle of the design from
https://github.com/neondatabase/neon/pull/8339.
That is, that the page_service connection lifecycle and the Timeline
lifecycles must be completely decoupled.
Tt must be possible to shut down one shard without shutting down the
page_service connection, because on that single connection we might be
serving other shards attached to this pageserver.
(The current compute client opens separate connections per shard, but,
there are plans to change that.)

# Solution

This PR adds a `handle::WeakHandle` struct that does _not_ hold the
timeline gate open.
It must be `upgrade()`d to get a `handle::Handle`.
That `handle::Handle` _does_ hold the timeline gate open.

The batch queued inside the `spsc_fold` only holds a `WeakHandle`.
We only upgrade it while calling into the various `handle_` methods,
i.e., while interacting with the `Timeline` via `<Handle as
Deref<Target=Timeline>>`.
All that code has always been required to be (and is!) sensitive to
`Timeline::cancel`, and therefore we're guaranteed to bail from it
quickly when `Timeline::shutdown` starts.
We will drop the `Handle` immediately, before we start
`pgb_writer.flush()`ing the responses.
Thereby letting go of our hold on the `GateGuard`, allowing the timeline
shutdown to complete while the page_service handler remains intact.

# Code Changes

* Reproducer & Regression Test
* Developed and proven to reproduce the issue in
https://github.com/neondatabase/neon/pull/10399
* Add a `Test` message to the pagestream protocol (`cfg(feature =
"testing")`).
* Drive-by minimal improvement to the parsing code, we now have a
`PagestreamFeMessageTag`.
* Refactor `pageserver/client` to allow sending and receiving
`page_service` requests independently.
  * Add a Rust helper binary to produce situation (4) from above
* Rationale: (4) and (5) are the same bug class, we're holding a gate
open while `flush()`ing.
* Add a Python regression test that uses the helper binary to
demonstrate the problem.
* Fix
   * Introduce and use `WeakHandle` as explained earlier.
* Replace the `shut_down` atomic with two enum states for `HandleInner`,
wrapped in a `Mutex`.
* To make `WeakHandle::upgrade()` and `Handle::downgrade()`
cache-efficient:
     * Wrap the `Types::Timeline` in an `Arc`
     * Wrap the `GateGuard` in an `Arc`
* The separate `Arc`s enable uncontended cloning of the timeline
reference in `upgrade()` and `downgrade()`.
If instead we were `Arc<Timeline>::clone`, different connection handlers
would be hitting the same cache line on every upgrade()/downgrade(),
causing contention.
* Please read the udpated module-level comment in `mod handle`
module-level comment for details.

# Testing & Performance

The reproducer test that failed before the changes now passes, and
obviously other tests are passing as well.

We'll do more testing in staging, where the issue happens every ~4h if
chaos migrations are enabled in storcon.

Existing perf testing will be sufficient, no perf degradation is
expected.
It's a few more alloctations due to the added Arc's, but, they're low
frequency.

# Appendix: Why Compute Sometimes Doesn't Read Responses

Remember, the whole problem surfaced because flush() was slow because
Compute was not reading responses. Why is that?

In short, the way the compute works, it only advances the page_service
protocol processing when it has an interest in data, i.e., when the
pagestore smgr is called to return pages.

Thus, if compute issues a bunch of requests as part of prefetch but then
it turns out it can service the query without reading those pages, it
may very well happen that these messages stay in the TCP until the next
smgr read happens, either in that session, or possibly in another
session.

If there’s too many unread responses in the TCP, the pageserver kernel
is going to backpressure into userspace, resulting in our stuck flush().

All of this stems from the way vanilla Postgres does prefetching and
"async IO":
it issues `fadvise()` to make the kernel do the IO in the background,
buffering results in the kernel page cache.
It then consumes the results through synchronous `read()` system calls,
which hopefully will be fast because of the `fadvise()`.

If it turns out that some / all of the prefetch results are not needed,
Postgres will not be issuing those `read()` system calls.
The kernel will eventually react to that by reusing page cache pages
that hold completed prefetched data.
Uncompleted prefetch requests may or may not be processed -- it's up to
the kernel.

In Neon, the smgr + Pageserver together take on the role of the kernel
in above paragraphs.
In the current implementation, all prefetches are sent as GetPage
requests to Pageserver.
The responses are only processed in the places where vanilla Postgres
would do the synchronous `read()` system call.
If we never get to that, the responses are queued inside the TCP
connection, which, once buffers run full, will backpressure into
Pageserver's sending code, i.e., the `pgb_writer.flush()` that was the
root cause of the problems we're fixing in this PR.
2025-01-16 20:34:02 +00:00
Vlad Lazar
3a285a046b pageserver: include node id when subscribing to SK (#10432)
## Problem

All pageserver have the same application name which makes it hard to
distinguish them.

## Summary of changes

Include the node id in the application name sent to the safekeeper. This
should gives us
more visibility in logs. There's a few metrics that will increase in
cardinality by `pageserver_count`,
but that's fine.
2025-01-16 18:51:56 +00:00
Christian Schwarz
d2f8342080 Merge branch 'problame/hung-shutdown/fix' into vlad/read-path-concurrent-io 2025-01-16 18:16:36 +01:00
Christian Schwarz
c19a16792a address nit ; https://github.com/neondatabase/neon/pull/10386#discussion_r1918782034 2025-01-16 17:54:14 +01:00
Christian Schwarz
cf75eb7d86 Revert "hacky experiment: what if we had more walredo procs => doesn't move the needle on throughput"
This reverts commit 9fffe6e60d.
2025-01-16 16:46:49 +01:00
Christian Schwarz
6ededa17e2 Revert "experiment: buffered socket with 128k buffer size; not super needle-moving"
This reverts commit 7e13e5fc4a.
2025-01-16 16:42:10 +01:00
Christian Schwarz
7e13e5fc4a experiment: buffered socket with 128k buffer size; not super needle-moving 2025-01-16 16:42:01 +01:00
Alex Chi Z.
cccc196848 refactor(pageserver): make partitioning an ArcSwap (#10377)
## Problem

gc-compaction needs the partitioning data to decide the job split. This
refactor allows concurrent access/computing the partitioning.

## Summary of changes

Make `partitioning` an ArcSwap so that others can access the
partitioning while we compute it. Fully eliminate the `repartition is
called concurrently` warning when gc-compaction is going on.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-01-16 15:33:37 +00:00
Christian Schwarz
9fffe6e60d hacky experiment: what if we had more walredo procs => doesn't move the needle on throughput 2025-01-16 13:58:23 +01:00
Alex Chi Z.
a753349cb0 feat(pageserver): validate data integrity during gc-compaction (#10131)
## Problem

part of https://github.com/neondatabase/neon/issues/9114
part of investigation of
https://github.com/neondatabase/neon/issues/10049

## Summary of changes

* If `cfg!(test) or cfg!(feature = testing)`, then we will always try
generating an image to ensure the history is replayable, but not put the
image layer into the final layer results, therefore discovering wrong
key history before we hit a read error.
* I suspect it's easier to trigger some races if gc-compaction is
continuously run on a timeline, so I increased the frequency to twice
per 10 churns.
* Also, create branches in gc-compaction smoke tests to get more test
coverage.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Arpad Müller <arpad@neon.tech>
2025-01-15 22:04:06 +00:00
Christian Schwarz
66c0df8109 doc comment on BatchedFeMessage explaining WeakHandle; https://github.com/neondatabase/neon/pull/10386#discussion_r1916968951 2025-01-15 21:50:00 +01:00
Christian Schwarz
9fe77c527f inline get_impl; https://github.com/neondatabase/neon/pull/10386#discussion_r1916939623 2025-01-15 21:47:39 +01:00
Christian Schwarz
7fb4595c7e fix: WeakHandle was holding on to the Timeline allocation
This made test_timeline_deletion_with_files_stuck_in_upload_queue fail
because the RemoteTimelineClient was being kept alive.

The fix is to stop keeping the timeline alive from WeakHandle.
2025-01-15 21:46:37 +01:00
Christian Schwarz
350dc251df test case demonstrates the issue: we hod Timeline object alive
--- STDERR:              pageserver tenant::timeline::handle::tests::test_weak_handles ---
thread 'tenant::timeline::handle::tests::test_weak_handles' panicked at pageserver/src/tenant/timeline/handle.rs:1131:9:
assertion `left == right` failed
  left: 3
 right: 2
2025-01-15 21:46:30 +01:00
John Spray
fb0e2acb2f pageserver: add page_trace API for debugging (#10293)
## Problem

When a pageserver is receiving high rates of requests, we don't have a
good way to efficiently discover what the client's access pattern is.

Closes: https://github.com/neondatabase/neon/issues/10275

## Summary of changes

- Add
`/v1/tenant/x/timeline/y/page_trace?size_limit_bytes=...&time_limit_secs=...`
API, which returns a binary buffer.
- Add `pagectl page-trace` tool to decode and analyze the output.

---------

Co-authored-by: Erik Grinaker <erik@neon.tech>
2025-01-15 19:07:22 +00:00
Christian Schwarz
5b77a6d3ce address clippy 2025-01-15 19:38:21 +01:00
Christian Schwarz
8c5005ff59 rename IoConcurrency::{todo=>serial} and remove deprecation warning 2025-01-15 19:38:05 +01:00
Christian Schwarz
f8218ac5fc Revert "investigation: add log_if_slow => shows that the io_futures are slow"
This reverts commit e81fa7137e.
2025-01-15 19:34:37 +01:00
Christian Schwarz
40470c66cd remove opportunistic poll, it seems slightly beneficial for perf
esp before I remembered to configure pipelining, the unpipelined
configuration achieved ~10% higher tput.

In any way, makes sense to not do the opportunisitc polling because
it registers the wrong waker.
2025-01-15 19:34:05 +01:00
Christian Schwarz
8fafff37c5 remove the whole barriers business 2025-01-15 19:00:00 +01:00
Christian Schwarz
e81fa7137e investigation: add log_if_slow => shows that the io_futures are slow 2025-01-15 18:56:07 +01:00
Christian Schwarz
351da2349e Merge branch 'problame/hung-shutdown/fix' into vlad/read-path-concurrent-io 2025-01-15 17:09:02 +01:00
Christian Schwarz
c545d227b9 review doc comment 2025-01-15 16:24:39 +01:00
Christian Schwarz
a4fc6a92c9 fix cargo doc 2025-01-15 16:10:04 +01:00
Christian Schwarz
2205736262 doc comment & one fixup 2025-01-15 14:27:08 +01:00
Vlad Lazar
1577430408 safekeeper: decode and interpret for multiple shards in one go (#10201)
## Problem

Currently, we call `InterpretedWalRecord::from_bytes_filtered`
from each shard. To serve multiple shards at the same time,
the API needs to allow for enquiring about multiple shards.

## Summary of changes

This commit tweaks it a pretty brute force way. Naively, we could
just generate the shard for a key, but pre and post split shards
may be subscribed at the same time, so doing it efficiently is more
complex.
2025-01-15 11:10:24 +00:00
Christian Schwarz
0340f00228 post-merge fix the handling of the new pagestream Test message, so that the regression test now passes
non-package-mode-py3.10christian@neon-hetzner-dev-christian:[~/src/neon]: BUILD_TYPE=debug DEFAULT_PG_VERSION=16 poetry run pytest ./test_runner/regress/test_page_service_batching_regressions.py --timeout=0 --pdb
2025-01-14 23:56:35 +01:00
Christian Schwarz
366ff9ffcc Merge branch 'problame/hung-shutdown/demo-hypothesis' into problame/hung-shutdown/fix 2025-01-14 23:51:53 +01:00