## Problem
Basebackup cache now uses unbounded channel for prepare requests. In
theory it can grow large if the cache is hung and does not process the
requests.
- Part of https://github.com/neondatabase/cloud/issues/29353
## Summary of changes
- Replace an unbounded channel with a bounded one, the size is
configurable.
- Add `pageserver_basebackup_cache_prepare_queue_size` to observe the
size of the queue.
- Refactor a bit to move all metrics logic to `basebackup_cache.rs`
## Problem
gRPC base backups use gRPC compression. However, this has two problems:
* Base backup caching will cache compressed base backups (making gRPC
compression pointless).
* Tonic does not support varying the compression level, and zstd default
level is 10% slower than gzip fastest level.
Touches https://github.com/neondatabase/neon/issues/11728.
Touches https://github.com/neondatabase/cloud/issues/29353.
## Summary of changes
This patch adds a gRPC parameter `BaseBackupRequest::compression`
specifying the compression algorithm. It also moves compression into
`send_basebackup_tarball` to reduce code duplication.
A follow-up PR will integrate the base backup cache with gRPC.
The 1.88.0 stable release is near (this Thursday). We'd like to fix most
warnings beforehand so that the compiler upgrade doesn't require
approval from too many teams.
This is therefore a preparation PR (like similar PRs before it).
There is a lot of changes for this release, mostly because the
`uninlined_format_args` lint has been added to the `style` lint group.
One can read more about the lint
[here](https://rust-lang.github.io/rust-clippy/master/#/uninlined_format_args).
The PR is the result of `cargo +beta clippy --fix` and `cargo fmt`. One
remaining warning is left for the proxy team.
---------
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
gRPC base backups send a stream of fixed-size 64KB chunks.
pagebench basebackup with compression enabled shows this to reduce
throughput:
* 64 KB: 55 RPS
* 128 KB: 69 RPS
* 256 KB: 73 RPS
* 1024 KB: 73 RPS
This patch sets the base backup chunk size to 256 KB.
Introduce a separate `postgres_ffi_types` crate which contains a few
types and functions that were used in the API. `postgres_ffi_types` is a
much small crate than `postgres_ffi`, and it doesn't depend on bindgen
or the Postgres C headers.
Move NeonWalRecord and Value types to wal_decoder crate. They are only
used in the pageserver-safekeeper "ingest" API. The rest of the ingest
API types are defined in wal_decoder, so move these there as well.
## Problem
The gRPC page service should support compression.
Requires #12111.
Touches #11728.
Touches https://github.com/neondatabase/cloud/issues/25679.
## Summary of changes
Add support for gzip and zstd compression in the server, and a client
parameter to enable compression.
This will need further benchmarking under realistic network conditions.
## Problem
Full basebackups are used in tests, and may be useful for debugging as
well, so we should support them in the gRPC API.
Touches #11728.
## Summary of changes
Add `GetBaseBackupRequest::full` to generate full base backups.
The libpq implementation also allows specifying `prev_lsn` for full
backups, i.e. the end LSN of the previous WAL record. This is omitted in
the gRPC API, since it's not used by any tests, and presumably of
limited value since it's autodetected. We can add it later if we find
that we need it.
## Problem
When converting `proto::GetPageRequest` into `page_api::GetPageRequest`
and validating the request, errors are returned as `tonic::Status`. This
will tear down the GetPage stream, which is disruptive and unnecessary.
## Summary of changes
Emit invalid request errors as `GetPageResponse` with an appropriate
`status_code` instead.
Also move the conversion from `tonic::Status` to `GetPageResponse` out
into the stream handler.
## Problem
The gRPC base backup implementation has a few issues: chunks are not
properly bounded, and it's not possible to omit the LSN.
Touches #11728.
## Summary of changes
* Properly bound chunks by using a limited writer.
* Use an `Option<Lsn>` rather than a `ReadLsn` (the latter requires an
LSN).
## Problem
The gRPC page service doesn't respect `get_vectored_concurrent_io` and
always uses sequential IO.
## Summary of changes
Spawn a sidecar task for concurrent IO when enabled.
Cancellation will be addressed separately.
## Problem
Currently, `page_api` domain types validate message invariants both when
converting Protobuf → domain and domain → Protobuf. This is annoying for
clients, because they can't use stream combinators to convert streamed
requests (needed for hot path performance), and also performs the
validation twice in the common case.
Blocks #12099.
## Summary of changes
Only validate the Protobuf → domain type conversion, i.e. on the
receiver side, and make domain → Protobuf infallible. This is where it
matters -- the Protobuf types are less strict than the domain types, and
receivers should expect all sorts of junk from senders (they're not
required to validate anyway, and can just construct an invalid message
manually).
Also adds a missing `impl From<CheckRelExistsRequest> for
proto::CheckRelExistsRequest`.
## Problem
The gRPC `page_api` domain types used smallvecs to avoid heap
allocations in the common case where a single page is requested.
However, this is pointless: the Protobuf types use a normal vec, and
converting a smallvec into a vec always causes a heap allocation anyway.
## Summary of changes
Use a normal `Vec` instead of a `SmallVec` in `page_api` domain types.
## Problem
We should expose the page service over gRPC.
Requires #12093.
Touches #11728.
## Summary of changes
This patch adds an initial page service implementation over gRPC. It
ties in with the existing `PageServerHandler` request logic, to avoid
the implementations drifting apart for the core read path.
This is just a bare-bones functional implementation. Several important
aspects have been omitted, and will be addressed in follow-up PRs:
* Limited observability: minimal tracing, no logging, limited metrics
and timing, etc.
* Rate limiting will currently block.
* No performance optimization.
* No cancellation handling.
* No tests.
I've only done rudimentary testing of this, but Pagebench passes at
least.
## Problem
The page service logic asserts that a tracing span is present with
tenant/timeline/shard IDs. An initial gRPC page service implementation
thus requires a tracing span.
Touches https://github.com/neondatabase/neon/issues/11728.
## Summary of changes
Adds an `ObservabilityLayer` middleware that generates a tracing span
and decorates it with IDs from the gRPC metadata.
This is a minimal implementation to address the tracing span assertion.
It will be extended with additional observability in later PRs.
## Problem
We have some gaps in our traces. This indicates missing spans.
## Summary of changes
This PR adds two new spans:
* WAIT_EXECUTOR: time a batched request spends in the batch waiting to
be picked up
* FLUSH_RESPONSE: time a get page request spends flushing the response
to the compute

## Problem
We need authentication for the gRPC server.
Requires #11972.
Touches #11728.
## Summary of changes
Add two request interceptors that decode the tenant/timeline/shard
metadata and authenticate the JWT token against them.
## Problem
We want to expose the page service over gRPC, for use with the
communicator.
Requires #11995.
Touches #11728.
## Summary of changes
This patch wires up a gRPC server in the Pageserver, using Tonic. It
does not yet implement the actual page service.
* Adds `listen_grpc_addr` and `grpc_auth_type` config options (disabled
by default).
* Enables gRPC by default with `neon_local`.
* Stub implementation of `page_api.PageService`, returning unimplemented
errors.
* gRPC reflection service for use with e.g. `grpcurl`.
Subsequent PRs will implement the actual page service, including
authentication and observability.
Notably, TLS support is not yet implemented. Certificate reloading
requires us to reimplement the entire Tonic gRPC server.
## Problem
Basebackup cache is on the hot path of compute startup and is generated
on every request (may be slow).
- Issue: https://github.com/neondatabase/cloud/issues/29353
## Summary of changes
- Add `BasebackupCache` which stores basebackups on local disk.
- Basebackup prepare requests are triggered by
`XLOG_CHECKPOINT_SHUTDOWN` records in the log.
- Limit the size of the cache by number of entries.
- Add `basebackup_cache_enabled` feature flag to TenantConfig.
- Write tests for the cache
## Not implemented yet
- Limit the size of the cache by total size in bytes
---------
Co-authored-by: Aleksandr Sarantsev <aleksandr@neon.tech>
## Problem
See
Discussion:
https://neondb.slack.com/archives/C033RQ5SPDH/p1746645666075799
Issue: https://github.com/neondatabase/cloud/issues/28609
Relation size cache is not correctly updated at PS in case of replicas.
## Summary of changes
1. Have two caches for relation size in timeline:
`rel_size_primary_cache` and `rel_size_replica_cache`.
2. `rel_size_primary_cache` is actually what we have now. The only
difference is that it is not updated in `get_rel_size`, only by WAL
ingestion
3. `rel_size_replica_cache` has limited size (LruCache) and it's key is
`(Lsn,RelTag)` . It is updated in `get_rel_size`. Only strict LSN
matches are accepted as cache hit.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
The gRPC page service API will require decoupling the `PageHandler` from
the libpq protocol implementation. As preparation for this, avoid
passing in the entire server config to `PageHandler`, and instead
explicitly pass in the relevant fields.
Touches https://github.com/neondatabase/neon/issues/11728.
## Summary of changes
* Change `PageHandler` to take a `GetVectoredConcurrentIo` instead of
the entire config.
* Change `IoConcurrency::spawn_from_conf` to take a
`GetVectoredConcurrentIo`.
## Problem
There's a few rough edges around PS tracing.
## Summary of changes
* include compute request id in pageserver trace
* use the get page specific context for GET_REL_SIZE and GET_BATCH
* fix assertion in download layer trace

## Problem
`Tenant` isn't really a whole tenant: it's just one shard of a tenant.
## Summary of changes
- Automated rename of Tenant to TenantShard
- Followup commit to change references in comments
## Problem
https://github.com/neondatabase/neon/pull/11494 changes the batching
logic, but we don't have a way to evaluate it.
## Summary of changes
This PR introduces a global and per timeline metric which tracks the
reason for
which a batch was broken.
## Problem
Get page batching stops when we encounter requests at different LSNs.
We are leaving batching factor on the table.
## Summary of changes
The goal is to support keys with different LSNs in a single batch and
still serve them with a single vectored get.
Important restriction: the same key at different LSNs is not supported
in one batch. Returning different key
versions is a much more intrusive change.
Firstly, the read path is changed to support "scattered" queries. This
is a conceptually simple step from
https://github.com/neondatabase/neon/pull/11463. Instead of initializing
the fringe for one keyspace,
we do it for multiple at different LSNs and let the logic already
present into the fringe handle selection.
Secondly, page service code is updated to support batching at different
LSNs. Eeach request parsed from the wire determines its effective
request LSN and keeps it in mem for the batcher toinspect. The batcher
allows keys at
different LSNs in one batch as long one key is not requested at
different LSNs.
I'd suggest doing the first pass commit by commit to get a feel for the
changes.
## Results
I used the batching test from [Christian's
PR](https://github.com/neondatabase/neon/pull/11391) which increases the
change of batch breaks. Looking at the logs I think the new code is at
the max batching factor for the workload (we
only break batches due to them being oversized or because the executor
is idle).
```
Main:
Reasons for stopping batching: {'LSN changed': 22843, 'of batch size': 33417}
test_throughput[release-pg16-50-pipelining_config0-30-100-128-batchable {'max_batch_size': 32, 'execution': 'concurrent-futures', 'mode': 'pipelined'}].perfmetric.batching_factor: 14.6662
My branch:
Reasons for stopping batching: {'of batch size': 37024}
test_throughput[release-pg16-50-pipelining_config0-30-100-128-batchable {'max_batch_size': 32, 'execution': 'concurrent-futures', 'mode': 'pipelined'}].perfmetric.batching_factor: 19.8333
```
Related: https://github.com/neondatabase/neon/issues/10765
## Problem
Page service doesn't use TLS for incoming requests.
- Closes: https://github.com/neondatabase/cloud/issues/27236
## Summary of changes
- Add option `enable_tls_page_service_api` to pageserver config
- Propagate `tls_server_config` to `page_service` if the option is
enabled
No integration tests for now because I didn't find out how to call page
service API from python and AFAIK computes don't support TLS yet
# Problem
Current perf tracing fields do not allow answering the question what a
specific Postgres backend was waiting for.
# Background
For Pageserver logs, we set the backend PID as the libpq
`application_name` on the compute side, and funnel that into the a
tracing field for the spans that emit to the global tracing subscriber.
# Solution
Funnel `application_name`, and the other fields that we use in the
logging spans, into the root span for perf tracing.
# Refs
- fixes https://github.com/neondatabase/neon/issues/11393
- stacked atop https://github.com/neondatabase/neon/pull/11433
- epic: https://github.com/neondatabase/neon/issues/9873
## Problem
https://github.com/neondatabase/neon/pull/11140 introduces performance
tracing with OTEL
and a pageserver config which configures the sampling ratio of get page
requests.
Enabling a non-zero sampling ratio on a per region basis is too
aggressive and comes with perf
impact that isn't very well understood yet.
## Summary of changes
Add a `sampling_ratio` tenant level config which overrides the
pageserver level config.
Note that we do not cache the config and load it on every get page
request such that changes propagate
timely.
Note that I've had to remove the `SHARD_SELECTION` span to get this to
work. The tracing library doesn't
expose a neat way to drop a span if one realises it's not needed at
runtime.
Closes https://github.com/neondatabase/neon/issues/11392
Based on https://github.com/neondatabase/neon/pull/11139
## Problem
We want to export performance traces from the pageserver in OTEL format.
End goal is to see them in Grafana.
## Summary of changes
https://github.com/neondatabase/neon/pull/11139 introduces the
infrastructure required to run the otel collector alongside the
pageserver.
### Design
Requirements:
1. We'd like to avoid implementing our own performance tracing stack if
possible and use the `tracing` crate if possible.
2. Ideally, we'd like zero overhead of a sampling rate of zero and be a
be able to change the tracing config for a tenant on the fly.
3. We should leave the current span hierarchy intact. This includes
adding perf traces without modifying existing tracing.
To satisfy (3) (and (2) in part) a separate span hierarchy is used.
`RequestContext` gains an optional `perf_span` member
that's only set when the request was chosen by sampling. All perf span
related methods added to `RequestContext` are no-ops for requests that
are not sampled.
This on its own is not enough for (3), so performance spans use a
separate tracing subscriber. The `tracing` crate doesn't have great
support for this, so there's a fair amount of boilerplate to override
the subscriber at all points of the perf span lifecycle.
### Perf Impact
[Periodic
pagebench](https://neonprod.grafana.net/d/ddqtbfykfqfi8d/e904990?orgId=1&from=2025-02-08T14:15:59.362Z&to=2025-03-10T14:15:59.362Z&timezone=utc)
shows no statistically significant regression with a sample ratio of 0.
There's an annotation on the dashboard on 2025-03-06.
### Overview of changes:
1. Clean up the `RequestContext` API a bit. Namely, get rid of the
`RequestContext::extend` API and use the builder instead.
2. Add pageserver level configs for tracing: sampling ratio, otel
endpoint, etc.
3. Introduce some perf span tracking utilities and expose them via
`RequestContext`. We add a `tracing::Span` wrapper to be used for perf
spans and a `tracing::Instrumented` equivalent for it. See doc comments
for reason.
4. Set up OTEL tracing infra according to configuration. A separate
runtime is used for the collector.
5. Add perf traces to the read path.
## Refs
- epic https://github.com/neondatabase/neon/issues/9873
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
# Problem
We leave too few observability breadcrumbs in the case where wait_lsn is
exceptionally slow.
# Changes
- refactor: extract the monitoring logic out of `log_slow` into
`monitor_slow_future`
- add global + per-timeline counter for time spent waiting for wait_lsn
- It is updated while we're still waiting, similar to what we do for
page_service response flush.
- add per-timeline counterpair for started & finished wait_lsn count
- add slow-logging to leave breadcrumbs in logs, not just metrics
For the slow-logging, we need to consider not flooding the logs during a
broker or network outage/blip.
The solution is a "log-streak-level" concurrency limit per timeline.
At any given time, there is at most one slow wait_lsn that is logging
the "still running" and "completed" sequence of logs.
Other concurrent slow wait_lsn's don't log at all.
This leaves at least one breadcrumb in each timeline's logs if some
wait_lsn was exceptionally slow during a given period.
The full degree of slowness can then be determined by looking at the
per-timeline metric.
# Performance
Reran the `bench_log_slow` benchmark, no difference, so, existing call
sites are fine.
We do use a Semaphore, but only try_acquire it _after_ things have
already been determined to be slow. So, no baseline overhead
anticipated.
# Refs
-
https://github.com/neondatabase/cloud/issues/23486#issuecomment-2711587222
# Refs
- fixes https://github.com/neondatabase/neon/issues/6107
# Problem
`VirtualFile` currently parses the path it is opened with to identify
the `tenant,shard,timeline` labels to be used for the `STORAGE_IO_SIZE`
metric.
Further, for each read or write call to VirtualFile, it uses
`with_label_values` to retrieve the correct metrics object, which under
the hood is a global hashmap guarded by a parking_lot mutex.
We perform tens of thousands of reads and writes per second on every
pageserver instance; thus, doing the mutex lock + hashmap lookup is
wasteful.
# Changes
Apply the technique we use for all other timeline-scoped metrics to
avoid the repeat `with_label_values`: add it to `TimelineMetrics`.
Wrap `TimelineMetrics` into an `Arc`.
Propagate the `Arc<TimelineMetrics>` down do `VirtualFile`, and use
`Timeline::metrics::storage_io_size`.
To avoid contention on the `Arc<TimelineMetrics>`'s refcount atomics
between different connection handlers for the same timeline, we wrap it
into another Arc.
To avoid frequent allocations, we store that Arc<Arc<TimelineMetrics>>
inside the per-connection timeline cache.
Preliminary refactorings to enable this change:
- https://github.com/neondatabase/neon/pull/11001
- https://github.com/neondatabase/neon/pull/11030
# Performance
I ran the benchmarks in
`test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py`
on an `i3en.3xlarge` because that's what we currently run them on.
None of the benchmarks shows a meaningful difference in latency or
throughput or CPU utilization.
I would have expected some improvement in the
many-tenants-one-client-each workload because they all hit that hashmap
constantly, and clone the same `UintCounter` / `Arc` inside of it.
But apparently the overhead is miniscule compared to the remaining work
we do per getpage.
Yet, since the changes are already made, the added complexity is
manageable, and the perf overhead of `with_label_values` demonstrable in
micro-benchmarks, let's have this change anyway.
Also, propagating TimelineMetrics through RequestContext might come in
handy down the line.
The micro-benchmark that demonstrates perf impact of
`with_label_values`, along with other pitfalls and mitigation techniques
around the `metrics`/`prometheus` crate:
- https://github.com/neondatabase/neon/pull/11019
# Alternative Designs
An earlier iteration of this PR stored an `Arc<Arc<Timeline>>` inside
`RequestContext`.
The problem is that this risks reference cycles if the RequestContext
gets stored in an object that is owned directly or indirectly by
`Timeline`.
Ideally, we wouldn't be using this mess of Arc's at all and propagate
Rust references instead.
But tokio requires tasks to be `'static`, and so, we wouldn't be able to
propagate references across task boundaries, which is incompatible with
any sort of fan-out code we already have (e.g. concurrent IO) or future
code (parallel compaction).
So, opt for Arc for now.
## Problem
Timeline shutdown during basebackup logs at error level because the the
canecellation error is smushed into BasebackupError::Server.
## Summary of changes
Introduce BasebackupError::Shutdown and use it. `log_query_error` will
now see `QueryError::Shutdown` and log at info level.
# Changes
While working on
- https://github.com/neondatabase/neon/pull/7202
I found myself needing to cache another expensive Arc::clone inside
inside the timeline::handle::Cache by wrapping it in another Arc.
Before this PR, it seemed like the only expensive thing we were caching
was the
connection handler tasks' clone of `Arc<Timeline>`.
But in fact the GateGuard was another such thing, but it was
special-cased
in the implementation.
So, this refactoring PR de-special-cases the GateGuard.
# Performance
With this PR we are doing strictly _less_ operations per `Cache::get`.
The reason is that we wrap the entire `Types::Timeline` into one Arc.
Before this PR, it was a separate Arc around the Arc<Timeline> and
one around the Arc<GateGuard>.
With this PR, we avoid an allocation per cached item, namely,
the separate Arc around the GateGuard.
This PR does not change the amount of shared mutable state.
So, all in all, it should be a net positive, albeit probably not
noticable
with our small non-NUMA instances and generally high CPU usage per
request.
# Reviewing
To understand the refactoring logistics, look at the changes to the unit
test types first.
Then read the improved module doc comment.
Then the remaining changes.
In the future, we could rename things to be even more generic.
For example, `Types::TenantMgr` could really be a `Types::Resolver`.
And `Types::Timeline` should, to avoid constant confusion in the doc
comment,
be called `Types::Cached` or `Types::Resolved`.
Because the `handle` module, after this PR, really doesn't care that
we're
using it for storing Arc's and GateGuards.
Then again, specicifity is sometimes more useful than being generic.
And writing the module doc comment in a totally generic way would
probably also be more confusing than helpful.
Updates storage components to edition 2024. We like to stay on the
latest edition if possible. There is no functional changes, however some
code changes had to be done to accommodate the edition's breaking
changes.
The PR has two commits:
* the first commit updates storage crates to edition 2024 and appeases
`cargo clippy` by changing code. i have accidentially ran the formatter
on some files that had other edits.
* the second commit performs a `cargo fmt`
I would recommend a closer review of the first commit and a less close
review of the second one (as it just runs `cargo fmt`).
part of https://github.com/neondatabase/neon/issues/10918
## Problem
We recently added slow GetPage request logging. However, this
unintentionally included the flush time when logging (which we already
have separate logging for). It also logs at WARN level, which is a bit
aggressive since we see this fire quite frequently.
Follows https://github.com/neondatabase/neon/pull/10906.
## Summary of changes
* Only log the request execution time, not the flush time.
* Extract a `pagestream_dispatch_batched_message()` helper.
* Rename `warn_slow()` to `log_slow()` and downgrade to INFO.
We've seen some cases in production where a compute doesn't get a
response to a pageserver request for several minutes, or even more. We
haven't found the root cause for that yet, but whatever the reason is,
it seems overly optimistic to think that if the pageserver hasn't
responded for 2 minutes, we'd get a response if we just wait patiently a
little longer. More likely, the pageserver is dead or there's some kind
of a network glitch so that the TCP connection is dead, or at least
stuck for a long time. Either way, it's better to disconnect and
reconnect. I set the default timeout to 2 minutes, which should be
enough for any GetPage request under normal circumstances, even if the
pageserver has to download several layer files from remote storage.
Make the disconnect timeout configurable. Also make the "log interval",
after which we print a message to the log configurable, so that if you
change the disconnect timeout, you can set the log timeout
correspondingly. The default log interval is still 10 s. The new GUCs
are called "neon.pageserver_response_log_timeout" and
"neon.pageserver_response_disconnect_timeout".
Includes a basic test for the log and disconnect timeouts.
Implements issue #10857
For archived timelines, we would like to prevent all non-pageserver
issued getpage requests, as users are not supposed to send these.
Instead, they should unarchive a timeline before issuing any external
read traffic.
As this is non-trivial to do, at least prevent launches of new computes,
by errorring on basebackup requests for archived timelines. In #10688,
we started issuing a warning instead of an error, because an error would
mean a stuck project. Now after we can confirm the the warning is not
present in the logs for about a week, we can issue errors.
Follow-up of #10688
Related: #9548
## Problem
We don't have good observability for "stuck" getpage requests.
Resolves https://github.com/neondatabase/cloud/issues/23808.
## Summary of changes
Log a periodic warning (every 30 seconds) if GetPage request execution
is slow to complete, to aid in debugging stuck GetPage requests.
This does not cover response flushing (we have separate logging for
that), nor reading the request from the socket and batching it (expected
to be insignificant and not straightforward to handle with the current
protocol).
This costs 95 nanoseconds on the happy path when awaiting a
`tokio::task::yield_now()`:
```
warn_slow/enabled=false time: [45.716 ns 46.116 ns 46.687 ns]
warn_slow/enabled=true time: [141.53 ns 141.83 ns 142.18 ns]
```
# Summary
In
- https://github.com/neondatabase/neon/pull/10813
we added slow flush logging but it didn't log the TCP send & recv queue
length.
This PR adds that data to the log message.
I believe the implementation to be safe & correct right now, but it's
brittle and thus this PR should be reverted or improved upon once the
investigation is over.
Refs:
- stacked atop https://github.com/neondatabase/neon/pull/10813
- context:
https://neondb.slack.com/archives/C08DE6Q9C3B/p1739464533762049?thread_ts=1739462628.361019&cid=C08DE6Q9C3B
- improves https://github.com/neondatabase/neon/issues/10668
- part of https://github.com/neondatabase/cloud/issues/23515
# How It Works
The trouble is two-fold:
1. getting to the raw socket file descriptor through the many Rust types
that wrap it and
2. integrating with the `measure()` function
Rust wraps it in types to model file descriptor lifetimes and ownership,
and usually one can get access using `as_raw_fd()`.
However, we `split()` the stream and the resulting
[`tokio::io::WriteHalf`](https://docs.rs/tokio/latest/tokio/io/struct.WriteHalf.html)
.
Check the PR commit history for my attempts to do it.
My solution is to get the socket fd before we wrap it in our protocol
types, and to store that fd in the new `PostgresBackend::socket_fd`
field.
I believe it's safe because the lifetime of `PostgresBackend::socket_fd`
value == the lifetime of the `TcpStream` that wrap and store in
`PostgresBackend::framed`.
Specifically, the only place that close()s the socket is the `impl Drop
for TcpStream`.
I think the protocol stack calls `TcpStream::shutdown()`, but, that
doesn't `close()` the file descriptor underneath.
Regarding integration with the `measure()` function, the trouble is that
`flush_fut` is currently a generic `Future` type. So, we just pass in
the `socket_fd` as a separate argument.
A clean implementation would convert the `pgb_writer.flush()` to a named
future that provides an accessor for the socket fd while not being
polled.
I tried (see PR history), but failed to break through the `WriteHalf`.
# Testing
Tested locally by running
```
./target/debug/pagebench get-page-latest-lsn --num-clients=1000 --queue-depth=1000
```
in one terminal, waiting a bit, then
```
pkill -STOP pagebench
```
then wait for slow logs to show up in `pageserver.log`.
Pick one of the slow log message's port pairs, e.g., `127.0.0.1:39500`,
and then checking sockstat output
```
ss -ntp | grep '127.0.0.1:39500'
```
to ensure that send & recv queue size match those in the log message.
## Problem
We didn't catch all client errors causing alerts.
## Summary of changes
Client errors should be wrapped with ClientError so that it doesn't fire
alerts.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We expose `latest_gc_cutoff` in our API, and callers understandably were
using that to validate LSNs for branch creation. However, this is _not_
the true GC cutoff from a user's point of view: it's just the point at
which we last actually did GC. The actual cutoff used when validating
branch creations and page_service reads is the min() of latest_gc_cutoff
and the planned GC lsn in GcInfo.
Closes: https://github.com/neondatabase/neon/issues/10639
## Summary of changes
- Expose the more useful min() of GC cutoffs as `gc_cutoff_lsn` in the
API, so that the most obviously named field is really the one people
should use.
- Retain the ability to read the LSN at which GC was actually done, in
an `applied_gc_cutoff_lsn` field.
- Internally rename `latest_gc_cutoff_lsn` to `applied_gc_cutoff_lsn`
("latest" was a confusing name, as the value in GcInfo is more up to
date in terms of what a user experiences)
- Temporarily preserve the old `latest_gc_cutoff_lsn` field for compat
with control plane until we update it to use the new field.
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
# Problem
Say we have a batch of 10 responses to send out.
Then, even with
- #10728
we've still only called observe_execution_end_flush_start for the first
3 responses.
The remaining 7 response timers are still ticking.
When compute now closes the connection, the waiting flush fails with an
error and we `drop()` the remaining 7 responses' smgr op timers. The
`impl Drop for SmgrOpTimer` will observe an execution time that includes
the flush time.
In practice, this is supsected to produce the `+Inf` observations in the
smgr op latency histogram we've seen since the introduction of
pipelining, even after shipping #10728.
refs:
- fixup of https://github.com/neondatabase/neon/pull/10042
- fixup of https://github.com/neondatabase/neon/pull/10728
- fixes https://github.com/neondatabase/neon/issues/10754