neon/pageserver at cb10be710dd4c4dd513bb3a16a77ae2800cbc888 - neon

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-10 15:02:56 +00:00

Files

Christian Schwarz cb10be710d page_service: batching observability & include throttled time in smgr metrics (#9870 )

This PR 

- fixes smgr metrics https://github.com/neondatabase/neon/issues/9925 
- adds an additional startup log line logging the current batching
config
- adds a histogram of batch sizes global and per-tenant
- adds a metric exposing the current batching config

The issue described #9925 is that before this PR, request latency was
only observed *after* batching.
This means that smgr latency metrics (most importantly getpage latency)
don't account for
- `wait_lsn` time 
- time spent waiting for batch to fill up / the executor stage to pick
up the batch.

The fix is to use a per-request batching timer, like we did before the
initial batching PR.
We funnel those timers through the entire request lifecycle.

I noticed that even before the initial batching changes, we weren't
accounting for the time spent writing & flushing the response to the
wire.
This PR drive-by fixes that deficiency by dropping the timers at the
very end of processing the batch, i.e., after the `pgb.flush()` call.

I was **unable to maintain the behavior that we deduct
time-spent-in-throttle from various latency metrics.
The reason is that we're using a *single* counter in `RequestContext` to
track micros spent in throttle.
But there are *N* metrics timers in the batch, one per request.
As a consequence, the practice of consuming the counter in the drop
handler of each timer no longer works because all but the first timer
will encounter error `close() called on closed state`.
A failed attempt to maintain the current behavior can be found in
https://github.com/neondatabase/neon/pull/9951.

So, this PR remvoes the deduction behavior from all metrics.
I started a discussion on Slack about it the implications this has for
our internal SLO calculation:
https://neondb.slack.com/archives/C033RQ5SPDH/p1732910861704029

# Refs

- fixes https://github.com/neondatabase/neon/issues/9925
- sub-issue https://github.com/neondatabase/neon/issues/9377
- epic: https://github.com/neondatabase/neon/issues/9376

2024-12-03 11:03:23 +00:00

interactive

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

pagebench

test_runner: skip more tests using decorator instead of pytest.skip (#9704 )

2024-11-11 18:07:01 +00:00

__init__.py

Enable all pyupgrade checks in ruff

2024-10-08 14:32:26 -05:00

README.md

Make Postgres 16 default version (#8745 )

2024-08-20 10:46:58 +01:00

test_page_service_batching.py

page_service: batching observability & include throttled time in smgr metrics (#9870 )

2024-12-03 11:03:23 +00:00

util.py

Python 3.11 (#9515 )

2024-11-21 16:25:31 +00:00

README.md

How to reproduce benchmark results / run these benchmarks interactively.

Get an EC2 instance with Instance Store. Use the same instance type as used for the benchmark run.
Mount the Instance Store => neon.git/scripts/ps_ec2_setup_instance_store
Use a pytest command line (see other READMEs further up in the pytest hierarchy).

For tests that take a long time to set up / consume a lot of storage space, we use the test suite's repo_dir snapshotting functionality (from_repo_dir). It supports mounting snapshots using overlayfs, which improves iteration time.

Here's a full command line.

RUST_BACKTRACE=1 NEON_ENV_BUILDER_USE_OVERLAYFS_FOR_SNAPSHOTS=1 DEFAULT_PG_VERSION=16 BUILD_TYPE=release \
    ./scripts/pytest test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py