## Problem
The HTTP router allowlists matched both on the path and the query
string. This meant that only `/profile/cpu` would be allowed without
auth, while `/profile/cpu?format=svg` would require auth.
Follows #9764.
## Summary of changes
* Match allowlists on URI path, rather than the entire URI.
* Fix the allowlist for Safekeeper to use `/profile/cpu` rather than the
old `/pprof/profile`.
* Just use a constant slice for the allowlist; it's only a handful of
items, and these handlers are not on hot paths.
## Problem
We don't have a convenient way to gather CPU profiles from a running
binary, e.g. during production incidents or end-to-end benchmarks, nor
during microbenchmarks (particularly on macOS).
We would also like to have continuous profiling in production, likely
using [Grafana Cloud
Profiles](https://grafana.com/products/cloud/profiles-for-continuous-profiling/).
We may choose to use either eBPF profiles or pprof profiles for this
(pending testing and discussion with SREs), but pprof profiles appear
useful regardless for the reasons listed above. See
https://github.com/neondatabase/cloud/issues/14888.
This PR is intended as a proof of concept, to try it out in staging and
drive further discussions about profiling more broadly.
Touches #9534.
Touches https://github.com/neondatabase/cloud/issues/14888.
## Summary of changes
Adds a HTTP route `/profile/cpu` that takes a CPU profile and returns
it. Defaults to a 5-second pprof Protobuf profile for use with e.g.
`pprof` or Grafana Alloy, but can also emit an SVG flamegraph. Query
parameters:
* `format`: output format (`pprof` or `svg`)
* `frequency`: sampling frequency in microseconds (default 100)
* `seconds`: number of seconds to profile (default 5)
Also integrates pprof profiles into Criterion benchmarks, such that
flamegraph reports can be taken with `cargo bench ... --profile-duration
<seconds>`. Output under `target/criterion/*/profile/flamegraph.svg`.
Example profiles:
* pprof profile (use [`pprof`](https://github.com/google/pprof)):
[profile.pb.gz](https://github.com/user-attachments/files/17756788/profile.pb.gz)
* Web interface: `pprof -http :6060 profile.pb.gz`
* Interactive flamegraph:
[profile.svg.gz](https://github.com/user-attachments/files/17756782/profile.svg.gz)
## Problem
Two recently observed log errors indicate safekeeper tasks for a
timeline running after that timeline's deletion has started.
- https://github.com/neondatabase/neon/issues/8972
- https://github.com/neondatabase/neon/issues/8974
These code paths do not have a mechanism that coordinates task shutdown
with the overall shutdown of the timeline.
## Summary of changes
- Add a `Gate` to `Timeline`
- Take the gate as part of resident timeline guard: any code that holds
a guard over a timeline staying resident should also hold a guard over
the timeline's total lifetime.
- Take the gate from the wal removal task
- Respect Timeline::cancel in WAL send/recv code, so that we do not
block shutdown indefinitely.
- Add a test that deletes timelines with open pageserver+compute
connections, to check these get torn down as expected.
There is some risk to introducing gates: if there is code holding a gate
which does not properly respect a cancellation token, it can cause
shutdown hangs. The risk of this for safekeepers is lower in practice
than it is for other services, because in a healthy timeline deletion,
the compute is shutdown first, then the timeline is deleted on the
pageserver, and finally it is deleted on the safekeepers -- that makes
it much less likely that some protocol handler will still be running.
Closes: #8972Closes: #8974
## Problem
When processing pipelined `AppendRequest`s, we explicitly flush the WAL
every second and return an `AppendResponse`. However, the WAL is also
implicitly flushed on segment bounds, but this does not result in an
`AppendResponse`. Because of this, concurrent transactions may take up
to 1 second to commit and writes may take up to 1 second before sending
to the pageserver.
## Summary of changes
Advance `flush_lsn` when a WAL segment is closed and flushed, and emit
an `AppendResponse`. To accommodate this, track the `flush_lsn` in
addition to the `flush_record_lsn`.
If WAL truncation fails in the middle it might leave some data on disk
above the write/flush LSN. In theory, concatenated with previous records
it might form bogus WAL (though very unlikely in practice because CRC
would protect from that). To protect from that, set
pending_wal_truncation flag: means before any WAL writes truncation must
be retried until it succeeds. We already did that in case of safekeeper
restart, now extend this mechanism for failures without restart. Also,
importantly, reset LSNs in the beginning of the operation, not in the
end, because once on disk deletion starts previous pointers are wrong.
All this most likely haven't created any problems in practice because
CRC protects from the consequences.
Tests for this are hard; simulation infrastructure might be useful here
in the future, but not yet.
## Problem
`TimelinePersistentState::empty()`, used for tests and benchmarks, had a
hardcoded 16 MB WAL segment size. This caused confusion when attempting
to change the global segment size.
## Summary of changes
Inherit from `WAL_SEGMENT_SIZE` in `TimelinePersistentState::empty()`.
## Problem
The control file is flushed on the WAL ingest path when the commit LSN
advances by one segment, to bound the amount of recovery work in case of
a crash. This involves 3 additional fsyncs, which can have a significant
impact on WAL ingest throughput. This is to some extent mitigated by
`AppendResponse` not being emitted on segment bound flushes, since this
will prevent commit LSN advancement, which will be addressed separately.
## Summary of changes
Don't flush the control file on the WAL ingest path at all. Instead,
leave that responsibility to the timeline manager, but ask it to flush
eagerly if the control file lags the in-memory commit LSN by more than
one segment. This should not cause more than `REFRESH_INTERVAL` (300 ms)
additional latency before flushing the control file, which is
negligible.
## Problem
We don't have a metric capturing the latency of segment initialization.
This can be significant due to fsyncs.
## Summary of changes
Add an `initialize_segment` variant of
`safekeeper_wal_storage_operation_seconds`.
## Problem
We wish to stop using admin tokens in the infra repo, but step down
requests use the admin token.
## Summary of Changes
Introduce a new "ControllerPeer" scope and use it for step-down requests.
## Problem
When we create a new segment, we zero it out in order to avoid changing
the length and fsyncing metadata on every write. However, we zeroed it
out by writing 8 KB zero-pages, and Tokio file writes have non-trivial
overhead.
## Summary of changes
Zero out the segment using
[`File::set_len()`](https://docs.rs/tokio/latest/i686-unknown-linux-gnu/tokio/fs/struct.File.html#method.set_len)
instead. This will typically (depending on the filesystem) just write a
sparse file and omit the 16 MB of data entirely. This improves WAL
append throughput for large messages by over 400% with fsync disabled,
and 100% with fsync enabled.
## Problem
We don't have any benchmarks for Safekeeper WAL ingestion.
## Summary of changes
Add some basic benchmarks for WAL ingestion, specifically for
`SafeKeeper::process_msg()` (single append) and `WalAcceptor` (pipelined
batch ingestion). Also add some baseline file write benchmarks.
## Problem
Benchmarks need more control over the WAL generated by `WalGenerator`.
In particular, they need to vary the size of logical messages.
## Summary of changes
* Make `WalGenerator` generic over `RecordGenerator`, which constructs
WAL records.
* Add `LogicalMessageGenerator` which emits logical messages, with a
configurable payload.
* Minor tweaks and code reorganization.
There are no changes to the core logic or emitted WAL.
## Problem
While experimenting with `MAX_SEND_SIZE` for benchmarking, I saw stack
overflows when increasing it to 1 MB. Turns out a few buffers of this
size are stack-allocated rather than heap-allocated. Even at the default
128 KB size, that's a bit large to allocate on the stack.
## Summary of changes
Heap-allocate buffers of size `MAX_SEND_SIZE`.
## Problem
We don't have any observability for Safekeeper WAL receiver queues.
## Summary of changes
Adds a few WAL receiver metrics:
* `safekeeper_wal_receivers`: gauge of currently connected WAL
receivers.
* `safekeeper_wal_receiver_queue_depth`: histogram of queue depths per
receiver, sampled every 5 seconds.
* `safekeeper_wal_receiver_queue_depth_total`: gauge of total queued
messages across all receivers.
* `safekeeper_wal_receiver_queue_size_total`: gauge of total queued
message sizes across all receivers.
There are already metrics for ingested WAL volume: `written_wal_bytes`
counter per timeline, and `safekeeper_write_wal_bytes` per-request
histogram.
## Problem
We don't have any observability into full WalAcceptor queues per
timeline.
## Summary of changes
Logs a message when a WalAcceptor send has blocked for 5 seconds, and
another message when the send completes. This implies that the log
frequency is at most once every 5 seconds per timeline, so we don't need
further throttling.
## Problem
We don't have a convenient way to generate WAL records for benchmarks
and tests.
## Summary of changes
Adds a WAL generator, exposed as an iterator. It currently only
generates logical messages (noops), but will be extended to write actual
table rows later.
Some existing code for WAL generation has been replaced with this
generator, to reduce duplication.
## Problem
The `WalAcceptor` main loop currently uses two nested loops to consume
inbound messages. This makes it hard to slot in periodic events like
metrics collection. It also duplicates the event processing code, and assumes
all messages in steady state are AppendRequests (other messages types may
be dropped if following an AppendRequest).
## Summary of changes
Refactor the `WalAcceptor` loop to be event driven.
## Problem
When we use pull_timeline API on an evicted timeline, it gets downloaded
to serve the snapshot API request. That means that to evacuate all the
timelines from a node, the node needs enough disk space to download
partial segments from all timelines, which may not be physically the
case.
Closes: #8833
## Summary of changes
- Add a "try" variant of acquiring a residence guard, that returns None
if the timeline is offloaded
- During snapshot API handler, take a different code path if the
timeline isn't resident, where we just read the checkpoint and don't try
to read any segments.
## Problem
The storage components take an entire `SafekeeperConf` during
construction, but only actually use the `no_sync` field. This makes it
hard to understand the storage inputs (which fields do they actually
care about?), and is also inconvenient for tests and benchmarks that
need to set up a lot of unnecessary boilerplate.
## Summary of changes
* Don't take the entire config, but pass in the `no_sync` field
explicitly.
* Take the timeline dir instead of `ttid` as an input, since it's the
only thing it cares about.
* Fix a couple of tests to not leak tempdirs.
* Various minor tweaks.
## Problem
The Postgres version in `TimelinePersistentState::empty()` is incorrect:
the major version should be multiplied by 10000.
## Summary of changes
Multiply the version by 10000.
Always do timeline init through atomic rename of temp directory. Add
GlobalTimelines::load_temp_timeline which does this, and use it from
both pull_timeline and basic timeline creation. Fixes a collection
of issues:
- previously timeline creation didn't really flushed cfile to disk
due to 'nothing to do if state didn't change' check;
- even if it did, without tmp dir it is possible to lose the cfile
but leave timeline dir in place, making it look corrupted;
- tenant directory creation fsync was missing in timeline creation;
- pull_timeline is now protected from concurrent both itself and
timeline creation;
- now global timelines map entry got special CreationInProgress
entry type which prevents from anyone getting access to timeline
while it is being created (previously one could get access to it,
but it was locked during creation, which is valid but confusing if
creation failed).
fixes#8927
## Problem
In #9259, we found that the `check_safekeepers_synced` fast path could
result in a lower basebackup LSN than the `flush_lsn` reported by
Safekeepers in `VoteResponse`, causing the compute to panic once on
startup.
This would happen if the Safekeeper had unflushed WAL records due to a
compute disconnect. The `TIMELINE_STATUS` query would report a
`flush_lsn` below these unflushed records, while `VoteResponse` would
flush the WAL and report the advanced `flush_lsn`. See
https://github.com/neondatabase/neon/issues/9259#issuecomment-2410849032.
## Summary of changes
Flush the WAL if the compute disconnects during WAL processing.
## Problem
Storage controller `/control` API mostly requires admin tokens, for
interactive use by engineers. But for endpoints used by scripts, we
should not require admin tokens.
Discussion at
https://neondb.slack.com/archives/C033RQ5SPDH/p1728550081788989?thread_ts=1728548232.265019&cid=C033RQ5SPDH
## Summary of changes
- Introduce the 'infra' JWT scope, which was not previously used in the
neon repo
- For pageserver & safekeeper node registrations, require infra scope
instead of admin
Note that admin will still work, as the controller auth checks permit
admin tokens for all endpoints irrespective of what scope they require.
`download_byte_range()` is basically a copy of `download()` with an
additional option passed to the backend SDKs. This can cause these code
paths to diverge, and prevents combining various options.
This patch adds `DownloadOpts::byte_(start|end)` and move byte range
handling into `download()`.
Update hyper and tonic again in the storage broker, this time with a fix
for the issue that made us revert the update last time.
The first commit is a revert of #9268, the second a fix for the issue.
fixes#9231.
If peer safekeeper needs garbage collected segment it will be fetched
now from s3 using on-demand WAL download. Reduces danger of running out of disk space when safekeeper fails.
Follow-up of #9234 to give hyper 1.0 the version-free name, and the
legacy version of hyper the one with the version number inside. As we
move away from hyper 0.14, we can remove the `hyper0` name piece by
piece.
Part of #9255
Fixes#9231 .
Upgrade hyper to 1.4.0 and use hyper 1.4 instead of 0.14 in the storage
broker, together with tonic 0.12. The two upgrades go hand in hand.
Thanks to the broker being independent from other components, we can
upgrade its hyper version without touching the other components, which
makes things easier.
To make it faster. On my laptop, it takes about 30 before this commit.
In the arm64 debug variant in CI, it takes about 120 s. Reduce it by
factor of 4.
## Problem
Safekeeper's OpenAPI spec is incorrect:
```
Semantic error at paths./v1/tenant/{tenant_id}/timeline/{timeline_id}.get.responses.404.content.application/json.schema.$ref
$refs must reference a valid location in the document
Jump to line 126
```
Checked on https://editor.swagger.io
## Summary of changes
- Add `NotFoundError`
- Add `description` and `license` fields to make Cloud OpenAPI spec
linter happy
When walproposer observes now higher term it restarts instead of
crashing whole compute with PANIC; this avoids compute crash after
term_bump call. After successfull election we're still checking
last_log_term of the highest given vote to ensure basebackup is good,
and PANIC otherwise.
It will be used for migration per
035-safekeeper-dynamic-membership-change.md
and
https://github.com/neondatabase/docs/pull/21
ref https://github.com/neondatabase/neon/issues/8700
Check that truncation point is not from the future by comparing it with
write_record_lsn, not write_lsn, and explain that xlog switch changes
their normal order.
ref https://github.com/neondatabase/neon/issues/8911
Addresses the 1.82 beta clippy lint `too_long_first_doc_paragraph` by
adding newlines to the first sentence if it is short enough, and making
a short first sentence if there is the need.
wal_storage.rs already checks this, but since this is a quite legit scenario
check it at safekeeper.rs (consensus level) as well.
ref https://github.com/neondatabase/neon/issues/8212
This is a take 2; previous PR #8640 had been reverted because interplay
with another change broke test_last_log_term_switch.