Commit Graph

7037 Commits

Author SHA1 Message Date
Christian Schwarz
cf75eb7d86 Revert "hacky experiment: what if we had more walredo procs => doesn't move the needle on throughput"
This reverts commit 9fffe6e60d.
2025-01-16 16:46:49 +01:00
Christian Schwarz
6ededa17e2 Revert "experiment: buffered socket with 128k buffer size; not super needle-moving"
This reverts commit 7e13e5fc4a.
2025-01-16 16:42:10 +01:00
Christian Schwarz
7e13e5fc4a experiment: buffered socket with 128k buffer size; not super needle-moving 2025-01-16 16:42:01 +01:00
Christian Schwarz
45358bcb65 in the deepl_layers_with_delta script, make the stack height an argument 2025-01-16 16:41:15 +01:00
Christian Schwarz
9fffe6e60d hacky experiment: what if we had more walredo procs => doesn't move the needle on throughput 2025-01-16 13:58:23 +01:00
Christian Schwarz
2ff0a4ae82 extract the l0stack generator into a reusable python module 2025-01-16 13:24:34 +01:00
Christian Schwarz
5b77a6d3ce address clippy 2025-01-15 19:38:21 +01:00
Christian Schwarz
8c5005ff59 rename IoConcurrency::{todo=>serial} and remove deprecation warning 2025-01-15 19:38:05 +01:00
Christian Schwarz
f8218ac5fc Revert "investigation: add log_if_slow => shows that the io_futures are slow"
This reverts commit e81fa7137e.
2025-01-15 19:34:37 +01:00
Christian Schwarz
40470c66cd remove opportunistic poll, it seems slightly beneficial for perf
esp before I remembered to configure pipelining, the unpipelined
configuration achieved ~10% higher tput.

In any way, makes sense to not do the opportunisitc polling because
it registers the wrong waker.
2025-01-15 19:34:05 +01:00
Christian Schwarz
9b9479881a extend script with instructions to configure batching 2025-01-15 19:30:15 +01:00
Christian Schwarz
af11b201bd now the issue is no longer reproducible, maybe it was the barriers? 2025-01-15 19:10:45 +01:00
Christian Schwarz
8fafff37c5 remove the whole barriers business 2025-01-15 19:00:00 +01:00
Christian Schwarz
e81fa7137e investigation: add log_if_slow => shows that the io_futures are slow 2025-01-15 18:56:07 +01:00
Christian Schwarz
e60738f029 it's reproducible before the merge, so, continuing to investigate and fix here 2025-01-15 18:43:01 +01:00
Christian Schwarz
f75b07a160 I find that if I ever go beyond queue-depth=4, something in the pageserver locks up. 2025-01-15 18:31:40 +01:00
Christian Schwarz
a5524fcf4d add comment to use queue-depthed pagebench to the script 2025-01-15 18:31:29 +01:00
Christian Schwarz
351da2349e Merge branch 'problame/hung-shutdown/fix' into vlad/read-path-concurrent-io 2025-01-15 17:09:02 +01:00
Christian Schwarz
c545d227b9 review doc comment 2025-01-15 16:24:39 +01:00
Christian Schwarz
a4fc6a92c9 fix cargo doc 2025-01-15 16:10:04 +01:00
Christian Schwarz
2205736262 doc comment & one fixup 2025-01-15 14:27:08 +01:00
Christian Schwarz
5f9ddbae2f Merge branch 'problame/hung-shutdown/demo-hypothesis' into problame/hung-shutdown/fix 2025-01-15 00:25:11 +01:00
Christian Schwarz
173f18832c fixup 2025-01-15 00:24:59 +01:00
Christian Schwarz
23bd5833e1 Merge branch 'problame/hung-shutdown/demo-hypothesis' into problame/hung-shutdown/fix 2025-01-15 00:21:54 +01:00
Christian Schwarz
dedd524d7e refinements 2025-01-15 00:21:28 +01:00
Christian Schwarz
0340f00228 post-merge fix the handling of the new pagestream Test message, so that the regression test now passes
non-package-mode-py3.10christian@neon-hetzner-dev-christian:[~/src/neon]: BUILD_TYPE=debug DEFAULT_PG_VERSION=16 poetry run pytest ./test_runner/regress/test_page_service_batching_regressions.py --timeout=0 --pdb
2025-01-14 23:56:35 +01:00
Christian Schwarz
366ff9ffcc Merge branch 'problame/hung-shutdown/demo-hypothesis' into problame/hung-shutdown/fix 2025-01-14 23:51:53 +01:00
Christian Schwarz
a8f9b564be fix cd pageserver && cargo clippy --features testing build 2025-01-14 23:50:22 +01:00
Christian Schwarz
5450e54dab bump ci 2025-01-14 22:47:16 +01:00
Christian Schwarz
53b05c4ba0 cleanups to make CI pass (well, fail because the bug isn't fixed yet) 2025-01-14 22:45:09 +01:00
Christian Schwarz
1f7d173235 Merge remote-tracking branch 'origin/main' into problame/hung-shutdown/demo-hypothesis 2025-01-14 22:33:20 +01:00
Christian Schwarz
8454e19a0f address warnings and such 2025-01-14 22:28:08 +01:00
Christian Schwarz
45e08d0aa5 it repros 2025-01-14 22:16:27 +01:00
Erik Grinaker
6debb49b87 pageserver: coalesce index uploads when possible (#10248)
## Problem

With upload queue reordering in #10218, we can easily get into a
situation where multiple index uploads are queued back to back, which
can't be parallelized. This will happen e.g. when multiple layer flushes
enqueue layer/index/layer/index/... and the layers skip the queue and
are uploaded in parallel.

These index uploads will incur serial S3 roundtrip latencies, and may
block later operations.

Touches #10096.

## Summary of changes

When multiple back-to-back index uploads are ready to upload, only
upload the most recent index and drop the rest.
2025-01-14 21:10:17 +00:00
Christian Schwarz
9a02bc0cfd try to repro root cause hypothesis for https://github.com/neondatabase/neon/issues/10309
This approach here doesn't work because it slows down all the responses.

the workload() thread gets stuck in auth, prob with zero pipeline depth

  0x00007fa28fe48e63 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  (gdb) bt
  #0  0x00007fa28fe48e63 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  #1  0x0000561a285bf44e in WaitEventSetWaitBlock (set=0x561a292723e8, cur_timeout=9999, occurred_events=0x7ffd1f11c970, nevents=1)
      at /home/christian/src/neon//vendor/postgres-v16/src/backend/storage/ipc/latch.c:1535
  #2  0x0000561a285bf338 in WaitEventSetWait (set=0x561a292723e8, timeout=9999, occurred_events=0x7ffd1f11c970, nevents=1, wait_event_info=117440512)
      at /home/christian/src/neon//vendor/postgres-v16/src/backend/storage/ipc/latch.c:1481
  #3  0x00007fa2904a7345 in call_PQgetCopyData (shard_no=0, buffer=0x7ffd1f11cad0) at /home/christian/src/neon//pgxn/neon/libpagestore.c:703
  #4  0x00007fa2904a7aec in pageserver_receive (shard_no=0) at /home/christian/src/neon//pgxn/neon/libpagestore.c:899
  #5  0x00007fa2904af471 in prefetch_read (slot=0x561a292863b0) at /home/christian/src/neon//pgxn/neon/pagestore_smgr.c:644
  #6  0x00007fa2904af26b in prefetch_wait_for (ring_index=0) at /home/christian/src/neon//pgxn/neon/pagestore_smgr.c:596
  #7  0x00007fa2904b489d in neon_read_at_lsnv (rinfo=..., forkNum=MAIN_FORKNUM, base_blockno=0, request_lsns=0x7ffd1f11cd60, buffers=0x7ffd1f11cd30, nblocks=1, mask=0x0)
      at /home/christian/src/neon//pgxn/neon/pagestore_smgr.c:3024
  #8  0x00007fa2904b4f34 in neon_read_at_lsn (rinfo=..., forkNum=MAIN_FORKNUM, blkno=0, request_lsns=..., buffer=0x7fa28b969000) at /home/christian/src/neon//pgxn/neon/pagestore_smgr.c:3104
  #9  0x00007fa2904b515d in neon_read (reln=0x561a292ef448, forkNum=MAIN_FORKNUM, blkno=0, buffer=0x7fa28b969000) at /home/christian/src/neon//pgxn/neon/pagestore_smgr.c:3146
  #10 0x0000561a285f1ed5 in smgrread (reln=0x561a292ef448, forknum=MAIN_FORKNUM, blocknum=0, buffer=0x7fa28b969000) at /home/christian/src/neon//vendor/postgres-v16/src/backend/storage/smgr/smgr.c:567
  #11 0x0000561a285a528a in ReadBuffer_common (smgr=0x561a292ef448, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0, hit=0x7ffd1f11cf1b)
      at /home/christian/src/neon//vendor/postgres-v16/src/backend/storage/buffer/bufmgr.c:1134
  #12 0x0000561a285a47e3 in ReadBufferExtended (reln=0x7fa28ce1c888, forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0)
      at /home/christian/src/neon//vendor/postgres-v16/src/backend/storage/buffer/bufmgr.c:781
  #13 0x0000561a285a46ad in ReadBuffer (reln=0x7fa28ce1c888, blockNum=0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/storage/buffer/bufmgr.c:715
  #14 0x0000561a2811d511 in _bt_getbuf (rel=0x7fa28ce1c888, blkno=0, access=1) at /home/christian/src/neon//vendor/postgres-v16/src/backend/access/nbtree/nbtpage.c:852
  #15 0x0000561a2811d1b2 in _bt_metaversion (rel=0x7fa28ce1c888, heapkeyspace=0x7ffd1f11d9f0, allequalimage=0x7ffd1f11d9f1) at /home/christian/src/neon//vendor/postgres-v16/src/backend/access/nbtree/nbtpage.c:747
  #16 0x0000561a28126220 in _bt_first (scan=0x561a292d0348, dir=ForwardScanDirection) at /home/christian/src/neon//vendor/postgres-v16/src/backend/access/nbtree/nbtsearch.c:1465
  #17 0x0000561a28121a07 in btgettuple (scan=0x561a292d0348, dir=ForwardScanDirection) at /home/christian/src/neon//vendor/postgres-v16/src/backend/access/nbtree/nbtree.c:246
  #18 0x0000561a28111afa in index_getnext_tid (scan=0x561a292d0348, direction=ForwardScanDirection) at /home/christian/src/neon//vendor/postgres-v16/src/backend/access/index/indexam.c:583
  #19 0x0000561a28111d14 in index_getnext_slot (scan=0x561a292d0348, direction=ForwardScanDirection, slot=0x561a292d01a8) at /home/christian/src/neon//vendor/postgres-v16/src/backend/access/index/indexam.c:675
  #20 0x0000561a2810fbcc in systable_getnext (sysscan=0x561a292d0158) at /home/christian/src/neon//vendor/postgres-v16/src/backend/access/index/genam.c:512
  #21 0x0000561a287a1ee1 in SearchCatCacheMiss (cache=0x561a292a0f80, nkeys=1, hashValue=3028463561, hashIndex=1, v1=94670359561576, v2=0, v3=0, v4=0)
      at /home/christian/src/neon//vendor/postgres-v16/src/backend/utils/cache/catcache.c:1440
  #22 0x0000561a287a1d8a in SearchCatCacheInternal (cache=0x561a292a0f80, nkeys=1, v1=94670359561576, v2=0, v3=0, v4=0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/utils/cache/catcache.c:1360
  #23 0x0000561a287a1a4f in SearchCatCache (cache=0x561a292a0f80, v1=94670359561576, v2=0, v3=0, v4=0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/utils/cache/catcache.c:1214
  #24 0x0000561a287be060 in SearchSysCache (cacheId=10, key1=94670359561576, key2=0, key3=0, key4=0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/utils/cache/syscache.c:817
  #25 0x0000561a287be66f in GetSysCacheOid (cacheId=10, oidcol=1, key1=94670359561576, key2=0, key3=0, key4=0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/utils/cache/syscache.c:1055
  #26 0x0000561a286319a5 in get_role_oid (rolname=0x561a29270568 "cloud_admin", missing_ok=true) at /home/christian/src/neon//vendor/postgres-v16/src/backend/utils/adt/acl.c:5251
  #27 0x0000561a283d42ca in check_hba (port=0x561a29268de0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/libpq/hba.c:2493
  #28 0x0000561a283d5537 in hba_getauthmethod (port=0x561a29268de0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/libpq/hba.c:3067
  #29 0x0000561a283c6fd7 in ClientAuthentication (port=0x561a29268de0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/libpq/auth.c:395
  #30 0x0000561a287dc943 in PerformAuthentication (port=0x561a29268de0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/utils/init/postinit.c:247
  #31 0x0000561a287dd9cd in InitPostgres (in_dbname=0x561a29270588 "postgres", dboid=0, username=0x561a29270568 "cloud_admin", useroid=0, load_session_libraries=true, override_allow_connections=false,
      out_dbname=0x0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/utils/init/postinit.c:929
  #32 0x0000561a285fa10b in PostgresMain (dbname=0x561a29270588 "postgres", username=0x561a29270568 "cloud_admin") at /home/christian/src/neon//vendor/postgres-v16/src/backend/tcop/postgres.c:4293
  #33 0x0000561a28524ce4 in BackendRun (port=0x561a29268de0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/postmaster/postmaster.c:4465
  #34 0x0000561a285245da in BackendStartup (port=0x561a29268de0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/postmaster/postmaster.c:4193
  #35 0x0000561a285209c4 in ServerLoop () at /home/christian/src/neon//vendor/postgres-v16/src/backend/postmaster/postmaster.c:1782
  #36 0x0000561a2852030f in PostmasterMain (argc=3, argv=0x561a291c5fc0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/postmaster/postmaster.c:1466
  #37 0x0000561a283dd987 in main (argc=3, argv=0x561a291c5fc0) at /home/christian/src/neon//vendor/postgres-v16/src/backend/main/main.c:238
2025-01-14 20:42:01 +01:00
Erik Grinaker
e58e29e639 pageserver: limit number of upload queue tasks (#10384)
## Problem

The upload queue can currently schedule an arbitrary number of tasks.
This can both spawn an unbounded number of Tokio tasks, and also
significantly slow down upload queue scheduling as it's quadratic in
number of operations.

Touches #10096.

## Summary of changes

Limit the number of inprogress tasks to the remote storage upload
concurrency. While this concurrency limit is shared across all tenants,
there's certainly no point in scheduling more than this -- we could even
consider setting the limit lower, but don't for now to avoid
artificially constraining tenants.
2025-01-14 18:01:14 +00:00
Heikki Linnakangas
d36112d20f Simplify compute dockerfile by setting PATH just once (#10357)
By setting PATH in the 'pg-build' layer, all the extension build layers
will inherit. No need to pass PG_CONFIG to all the various make
invocations either: once pg_config is in PATH, the Makefiles will pick
it up from there.
2025-01-14 17:02:35 +00:00
Erik Grinaker
ffaa52ff5d pageserver: reorder upload queue when possible (#10218)
## Problem

The upload queue currently sees significant head-of-line blocking. For
example, index uploads act as upload barriers, and for every layer flush
we schedule a layer and index upload, which effectively serializes layer
uploads.

Resolves #10096.

## Summary of changes

Allow upload queue operations to bypass the queue if they don't conflict
with preceding operations, increasing parallelism.

NB: the upload queue currently schedules an explicit barrier after every
layer flush as well (see #8550). This must be removed to enable
parallelism. This will require a better mechanism for compaction
backpressure, see e.g. #8390 or #5415.
2025-01-14 16:31:59 +00:00
John Spray
aa7323a384 storage controller: quality of life improvements for AZ handling (#10379)
## Problem

Since https://github.com/neondatabase/neon/pull/9916, the preferred AZ
of a tenant is much more impactful, and we would like to make it more
visible in tooling.

## Summary of changes

- Include AZ in node describe API
- Include AZ info in node & tenant outputs in CLI
- Add metrics for per-node shard counts, labelled by AZ
- Add a CLI for setting preferred AZ on a tenant
- Extend AZ-setting API+CLI to handle None for clearing preferred AZ
2025-01-14 15:30:43 +00:00
Christian Schwarz
2466a2f977 page_service: throttle individual requests instead of the batched request (#10353)
## Problem

Before this PR, the pagestream throttle was applied weighted on a
per-batch basis.
This had several problems:

1. The throttle occurence counters were only bumped by `1` instead of
`batch_size`.
2. The throttle wait time aggregator metric only counted one wait time,
irrespective
of `batch_size`. That makes sense in some ways of looking at it but not
in others.
3. If the last request in the batch runs into the throttle, the other
requests in the
batch are also throttled, i.e., over-throttling happens (theoretical,
didn't measure
   it in practice).

## Solution

It occured to me that we can simply push the throttling upwards into
`pagestream_read_message`.

This has the added benefit that in pipeline mode, the `executor` stage
will, if it is idle,
steal whatever requests already made it into the `spsc_fold` and execute
them; before this
change, that was not the case - the throttling happened in the
`executor` stage instead of
the `batcher` stage.
   
## Code Changes

There are two changes in this PR:

1. Lifting up the throttling into the `pagestream_read_message` method.
2. Move the throttling metrics out of the `Throttle` type into
`SmgrOpMetrics`.
Unlike the other smgr metrics, throttling is per-tenant, hence the Arc.
3. Refactor the `SmgrOpTimer` implementation to account for the new
observation states,
   and simplify its design.
4. Drive-by-fix flush time metrics. It was using the same `now` in the
`observe_guard` every time.

The `SmgrOpTimer` is now a state machine.
Each observation point moves the state machine forward.
If a timer object is dropped early some "pair"-like metrics still
require an increment or observation.
That's done in the Drop implementation, by driving the state machine to
completion.
2025-01-14 15:28:01 +00:00
Alex Chi Z.
9bdb14c1c0 fix(pageserver): ensure initial image layers have correct key ranges (#10374)
## Problem

Discovered during the relation dir refactor work.

If we do not create images as in this patch, we would get two set of
image layers:

```
0000...METADATA_KEYS
0000...REL_KEYS
```

They overlap at the same LSN and would cause data loss for relation
keys. This doesn't happen in prod because initial image layer generation
is never called, but better to be fixed to avoid future issues with the
reldir refactors.

## Summary of changes

* Consolidate create_image_layers call into a single one.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-01-14 15:27:48 +00:00
Christian Schwarz
9e1cd986d7 address "why mut" nits; https://github.com/neondatabase/neon/pull/10353#discussion_r1913685991 https://github.com/neondatabase/neon/pull/10353#discussion_r1913683065 https://github.com/neondatabase/neon/pull/10353#discussion_r1913683392 2025-01-14 15:30:33 +01:00
Christian Schwarz
47544dcc0b simplify SmgrOpTimerState variant names, and add some doc comments; https://github.com/neondatabase/neon/pull/10353#discussion_r1913676569 and https://github.com/neondatabase/neon/pull/10353#discussion_r1913676824 2025-01-14 15:28:09 +01:00
Christian Schwarz
4e094d9638 rearrange code & inline HandleInner::shutdown() to minimize the diff 2025-01-14 15:12:46 +01:00
Christian Schwarz
c8bee86586 in some early WIP commit we had removed the loop{} inside get(); re-establish it one level down 2025-01-14 15:12:03 +01:00
Christian Schwarz
768a867dcf doc comment fix 2025-01-14 14:54:15 +01:00
Christian Schwarz
3b65465e10 turns out with the switch to sync Mutex there's no reason for upgrade() to be async either 2025-01-14 14:53:41 +01:00
Christian Schwarz
e4ea706424 turns out PerTimelineState::shutdown() doesn't need to be async 2025-01-14 14:48:03 +01:00
Christian Schwarz
7034f54a9e remove the earlier-commented-out assertions on arc reference counts, they were too whiteboxy to begin with 2025-01-14 14:45:25 +01:00
Christian Schwarz
d68c5ddf7e avoid the tokio::sync::Mutex by wrapping the GateGuard into an Arc 2025-01-14 14:23:28 +01:00