mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-06 04:52:55 +00:00
# Problem
The timeout-based batching adds latency to unbatchable workloads.
We can choose a short batching timeout (e.g. 10us) but that requires
high-resolution timers, which tokio doesn't have.
I thoroughly explored options to use OS timers (see
[this](https://github.com/neondatabase/neon/pull/9822) abandoned PR).
In short, it's not an attractive option because any timer implementation
adds non-trivial overheads.
# Solution
The insight is that, in the steady state of a batchable workload, the
time we spend in `get_vectored` will be hundreds of microseconds anyway.
If we prepare the next batch concurrently to `get_vectored`, we will
have a sizeable batch ready once `get_vectored` of the current batch is
done and do not need an explicit timeout.
This can be reasonably described as **pipelining of the protocol
handler**.
# Implementation
We model the sub-protocol handler for pagestream requests
(`handle_pagrequests`) as two futures that form a pipeline:
2. Batching: read requests from the connection and fill the current
batch
3. Execution: `take` the current batch, execute it using `get_vectored`,
and send the response.
The Reading and Batching stage are connected through a new type of
channel called `spsc_fold`.
See the long comment in the `handle_pagerequests_pipelined` for details.
# Changes
- Refactor `handle_pagerequests`
- separate functions for
- reading one protocol message; produces a `BatchedFeMessage` with just
one page request in it
- batching; tried to merge an incoming `BatchedFeMessage` into an
existing `BatchedFeMessage`; returns `None` on success and returns back
the incoming message in case merging isn't possible
- execution of a batched message
- unify the timeline handle acquisition & request span construction; it
now happen in the function that reads the protocol message
- Implement serial and pipelined model
- serial: what we had before any of the batching changes
- read one protocol message
- execute protocol messages
- pipelined: the design described above
- optionality for execution of the pipeline: either via concurrent
futures vs tokio tasks
- Pageserver config
- remove batching timeout field
- add ability to configure pipelining mode
- add ability to limit max batch size for pipelined configurations
(required for the rollout, cf
https://github.com/neondatabase/cloud/issues/20620 )
- ability to configure execution mode
- Tests
- remove `batch_timeout` parametrization
- rename `test_getpage_merge_smoke` to `test_throughput`
- add parametrization to test different max batch sizes and execution
moes
- rename `test_timer_precision` to `test_latency`
- rename the test case file to `test_page_service_batching.py`
- better descriptions of what the tests actually do
## On the holding The `TimelineHandle` in the pending batch
While batching, we hold the `TimelineHandle` in the pending batch.
Therefore, the timeline will not finish shutting down while we're
batching.
This is not a problem in practice because the concurrently ongoing
`get_vectored` call will fail quickly with an error indicating that the
timeline is shutting down.
This results in the Execution stage returning a `QueryError::Shutdown`,
which causes the pipeline / entire page service connection to shut down.
This drops all references to the
`Arc<Mutex<Option<Box<BatchedFeMessage>>>>` object, thereby dropping the
contained `TimelineHandle`s.
- => fixes https://github.com/neondatabase/neon/issues/9850
# Performance
Local run of the benchmarks, results in [this empty
commit](1cf5b1463f)
in the PR branch.
Key take-aways:
* `concurrent-futures` and `tasks` deliver identical `batching_factor`
* tail latency impact unknown, cf
https://github.com/neondatabase/neon/issues/9837
* `concurrent-futures` has higher throughput than `tasks` in all
workloads (=lower `time` metric)
* In unbatchable workloads, `concurrent-futures` has 5% higher
`CPU-per-throughput` than that of `tasks`, and 15% higher than that of
`serial`.
* In batchable-32 workload, `concurrent-futures` has 8% lower
`CPU-per-throughput` than that of `tasks` (comparison to tput of
`serial` is irrelevant)
* in unbatchable workloads, mean and tail latencies of
`concurrent-futures` is practically identical to `serial`, whereas
`tasks` adds 20-30us of overhead
Overall, `concurrent-futures` seems like a slightly more attractive
choice.
# Rollout
This change is disabled-by-default.
Rollout plan:
- https://github.com/neondatabase/cloud/issues/20620
# Refs
- epic: https://github.com/neondatabase/neon/issues/9376
- this sub-task: https://github.com/neondatabase/neon/issues/9377
- the abandoned attempt to improve batching timeout resolution:
https://github.com/neondatabase/neon/pull/9820
- closes https://github.com/neondatabase/neon/issues/9850
- fixes https://github.com/neondatabase/neon/issues/9835
80 lines
2.1 KiB
TOML
80 lines
2.1 KiB
TOML
[package]
|
|
name = "utils"
|
|
version = "0.1.0"
|
|
edition.workspace = true
|
|
license.workspace = true
|
|
|
|
[features]
|
|
default = []
|
|
# Enables test-only APIs, incuding failpoints. In particular, enables the `fail_point!` macro,
|
|
# which adds some runtime cost to run tests on outage conditions
|
|
testing = ["fail/failpoints"]
|
|
|
|
[dependencies]
|
|
arc-swap.workspace = true
|
|
sentry.workspace = true
|
|
async-compression.workspace = true
|
|
anyhow.workspace = true
|
|
bincode.workspace = true
|
|
bytes.workspace = true
|
|
camino.workspace = true
|
|
chrono.workspace = true
|
|
diatomic-waker.workspace = true
|
|
git-version.workspace = true
|
|
hex = { workspace = true, features = ["serde"] }
|
|
humantime.workspace = true
|
|
hyper0 = { workspace = true, features = ["full"] }
|
|
fail.workspace = true
|
|
futures = { workspace = true}
|
|
jsonwebtoken.workspace = true
|
|
nix.workspace = true
|
|
once_cell.workspace = true
|
|
pin-project-lite.workspace = true
|
|
pprof.workspace = true
|
|
regex.workspace = true
|
|
routerify.workspace = true
|
|
serde.workspace = true
|
|
serde_with.workspace = true
|
|
serde_json.workspace = true
|
|
signal-hook.workspace = true
|
|
thiserror.workspace = true
|
|
tokio.workspace = true
|
|
tokio-tar.workspace = true
|
|
tokio-util.workspace = true
|
|
toml_edit = { workspace = true, features = ["serde"] }
|
|
tracing.workspace = true
|
|
tracing-error.workspace = true
|
|
tracing-subscriber = { workspace = true, features = ["json", "registry"] }
|
|
rand.workspace = true
|
|
scopeguard.workspace = true
|
|
strum.workspace = true
|
|
strum_macros.workspace = true
|
|
url.workspace = true
|
|
uuid.workspace = true
|
|
walkdir.workspace = true
|
|
|
|
pq_proto.workspace = true
|
|
postgres_connection.workspace = true
|
|
metrics.workspace = true
|
|
|
|
const_format.workspace = true
|
|
|
|
# to use tokio channels as streams, this is faster to compile than async_stream
|
|
# why is it only here? no other crate should use it, streams are rarely needed.
|
|
tokio-stream = { version = "0.1.14" }
|
|
|
|
serde_path_to_error.workspace = true
|
|
|
|
[dev-dependencies]
|
|
byteorder.workspace = true
|
|
bytes.workspace = true
|
|
criterion.workspace = true
|
|
hex-literal.workspace = true
|
|
camino-tempfile.workspace = true
|
|
serde_assert.workspace = true
|
|
tokio = { workspace = true, features = ["test-util"] }
|
|
|
|
[[bench]]
|
|
name = "benchmarks"
|
|
harness = false
|