mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-06 21:12:55 +00:00
## Problem Get page batching stops when we encounter requests at different LSNs. We are leaving batching factor on the table. ## Summary of changes The goal is to support keys with different LSNs in a single batch and still serve them with a single vectored get. Important restriction: the same key at different LSNs is not supported in one batch. Returning different key versions is a much more intrusive change. Firstly, the read path is changed to support "scattered" queries. This is a conceptually simple step from https://github.com/neondatabase/neon/pull/11463. Instead of initializing the fringe for one keyspace, we do it for multiple at different LSNs and let the logic already present into the fringe handle selection. Secondly, page service code is updated to support batching at different LSNs. Eeach request parsed from the wire determines its effective request LSN and keeps it in mem for the batcher toinspect. The batcher allows keys at different LSNs in one batch as long one key is not requested at different LSNs. I'd suggest doing the first pass commit by commit to get a feel for the changes. ## Results I used the batching test from [Christian's PR](https://github.com/neondatabase/neon/pull/11391) which increases the change of batch breaks. Looking at the logs I think the new code is at the max batching factor for the workload (we only break batches due to them being oversized or because the executor is idle). ``` Main: Reasons for stopping batching: {'LSN changed': 22843, 'of batch size': 33417} test_throughput[release-pg16-50-pipelining_config0-30-100-128-batchable {'max_batch_size': 32, 'execution': 'concurrent-futures', 'mode': 'pipelined'}].perfmetric.batching_factor: 14.6662 My branch: Reasons for stopping batching: {'of batch size': 37024} test_throughput[release-pg16-50-pipelining_config0-30-100-128-batchable {'max_batch_size': 32, 'execution': 'concurrent-futures', 'mode': 'pipelined'}].perfmetric.batching_factor: 19.8333 ``` Related: https://github.com/neondatabase/neon/issues/10765
62 lines
2.2 KiB
Python
62 lines
2.2 KiB
Python
# NB: there are benchmarks that double-serve as tests inside the `performance` directory.
|
|
|
|
import subprocess
|
|
from pathlib import Path
|
|
|
|
import pytest
|
|
from fixtures.log_helper import log
|
|
from fixtures.neon_fixtures import NeonEnvBuilder
|
|
|
|
|
|
@pytest.mark.timeout(30) # test takes <20s if pageserver impl is correct
|
|
@pytest.mark.parametrize("kind", ["pageserver-stop", "tenant-detach"])
|
|
def test_slow_flush(neon_env_builder: NeonEnvBuilder, neon_binpath: Path, kind: str):
|
|
def patch_pageserver_toml(config):
|
|
config["page_service_pipelining"] = {
|
|
"mode": "pipelined",
|
|
"max_batch_size": 32,
|
|
"execution": "concurrent-futures",
|
|
"batching": "uniform-lsn",
|
|
}
|
|
|
|
neon_env_builder.pageserver_config_override = patch_pageserver_toml
|
|
env = neon_env_builder.init_start()
|
|
|
|
log.info("make flush appear slow")
|
|
|
|
log.info("sending requests until pageserver accepts no more")
|
|
# TODO: extract this into a helper, like subprocess_capture,
|
|
# so that we capture the stderr from the helper somewhere.
|
|
child = subprocess.Popen(
|
|
[
|
|
neon_binpath / "test_helper_slow_client_reads",
|
|
env.pageserver.connstr(),
|
|
str(env.initial_tenant),
|
|
str(env.initial_timeline),
|
|
],
|
|
bufsize=0, # unbuffered
|
|
stdin=subprocess.PIPE,
|
|
stdout=subprocess.PIPE,
|
|
)
|
|
assert child.stdout is not None
|
|
buf = child.stdout.read(1)
|
|
if len(buf) != 1:
|
|
raise Exception("unexpected EOF")
|
|
if buf != b"R":
|
|
raise Exception(f"unexpected data: {buf!r}")
|
|
log.info("helper reports pageserver accepts no more requests")
|
|
log.info(
|
|
"assuming pageserver connection handle is in a state where TCP has backpressured pageserver=>client response flush() into userspace"
|
|
)
|
|
|
|
if kind == "pageserver-stop":
|
|
log.info("try to shut down the pageserver cleanly")
|
|
env.pageserver.stop()
|
|
elif kind == "tenant-detach":
|
|
log.info("try to shut down the tenant")
|
|
env.pageserver.tenant_detach(env.initial_tenant)
|
|
else:
|
|
raise ValueError(f"unexpected kind: {kind}")
|
|
|
|
log.info("shutdown did not time out, test passed")
|