# Problem
walredo shutdown is done in the compaction task. Let's move it to tenant
housekeeping.
# Summary of changes
* Rename "ingest housekeeping" to "tenant housekeeping".
* Move walredo shutdown into tenant housekeeping.
* Add a constant `WALREDO_IDLE_TIMEOUT` set to 3 minutes (previously 10x
compaction threshold).
This patch does a bunch of superficial cleanups of `tenant::tasks` to
avoid noise in subsequent PRs. There are no functional changes.
PS: enable "hide whitespace" when reviewing, due to the unindentation of
large async blocks.
# Problem
Before this PR, the `shard_id` field was missing when page_service logs
a reconstruct error.
This was caused by batching-related refactorings.
Example from staging:
```
2025-01-30T07:10:04.346022Z ERROR page_service_conn_main{peer_addr=...}:process_query{tenant_id=... timeline_id=...}:handle_pagerequests:request:handle_get_page_at_lsn_request_batched{req_lsn=FFFFFFFF/FFFFFFFF}: error reading relation or page version: Read error: whole vectored get request failed because one or more of the requested keys were missing: could not find data for key ...
```
# Changes
Delay creation of the handler-specific span until after shard routing
This also avoids the need for the record() call in the pagestream hot
path.
# Testing
Manual testing with a failpoint that is part of this PR's history but
will be squashed away.
# Refs
- fixes https://github.com/neondatabase/neon/issues/10599
## Problem
There are a couple of log warnings tripping up
`test_timeline_archival_chaos`
- `[stopping left-over name="timeline_delete"
tenant_shard_id=2d526292b67dac0e6425266d7079c253
timeline_id=Some(44ba36bfdee5023672c93778985facd9)
kind=TimelineDeletionWorker\n')](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10672/13161357302/index.html#/testresult/716b997bb1d8a021)`
- `ignoring attempt to restart exited flush_loop
503d8f401d8887cfaae873040a6cc193/d5eed0673ba37d8992f7ec411363a7e3\n')`
Related: https://github.com/neondatabase/neon/issues/10389
## Summary of changes
- Downgrade the 'ignoring attempt to restart' to info -- there's nothing
in the design that forbids this happening, i.e. someone calling
maybe_spawn_flush_loop concurrently with shutdown()
- Prevent timeline deletion tasks outliving tenants by carrying a
gateguard. This logically makes sense because the deletion process does
call into Tenant to update manifests.
## Problem
L0 compaction can get starved by other background tasks. It needs to be
responsive to avoid read amp blowing up during heavy write workloads.
Touches #10694.
## Summary of changes
Add a separate semaphore for compaction, configurable via
`use_compaction_semaphore` (disabled by default). This is primarily for
testing in staging; it needs further work (in particular to split
image/L0 compaction jobs) before it can be enabled.
## Problem
This is tech debt. While we introduced generations for tenants, some
legacy situations without generations needed to delete things inline
(async operation) instead of enqueing them (sync operation).
## Summary of changes
- Remove the async code, replace calls with the sync variant, and assert
that the generation is always set
## Problem
We don't have visibility into how long an individual background job is
waiting for a semaphore permit.
## Summary of changes
* Make `pageserver_background_loop_semaphore_wait_seconds` a histogram
rather than a sum.
* Add a paced warning when a task takes more than 10 minutes to get a
permit (for now).
* Drive-by cleanup of some `EnumMap` usage.
## Problem
We need a metrics to know what's going on in pageserver's background
jobs.
## Summary of changes
* Waiting tasks: task still waiting for the semaphore.
* Running tasks: tasks doing their actual jobs.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Erik Grinaker <erik@neon.tech>
Often the output of the timestamp->lsn API is used as input for branch
creation, and branch creation takes the planned lsn into account, i.e.
rejects lsn's as branch lsns that are before the planned lsn.
This patch doesn't fix all race conditions, it's still racy. But at
least it is a step into the right direction.
For #10639
## Problem
Following #10641, let's add a few critical errors.
Resolves#10094.
## Summary of changes
Adds the following critical errors:
* WAL sender read/decode failure.
* WAL record ingestion failure.
* WAL redo failure.
* Missing key during compaction.
We don't add an error for missing keys during GetPage requests, since
we've seen a handful of these in production recently, and the cause is
still unclear (most likely a benign race).
Right now, branch creation doesn't care if a lsn lease exists or not, it
just fails if the passed lsn is older than either the last or the
planned gc cutoff.
However, if an lsn lease exists for a given lsn, we can actually create
a branch at that point: nothing has been gc'd away.
This prevents race conditions that #10678 still leaves around.
Related: #10639https://github.com/neondatabase/cloud/issues/23667
We don't want any external requests for an archived timeline. This
includes basebackup requests, i.e. when a compute is being started up.
Therefore, we'd like to forbid such basebackup requests: any attempt to
get a basebackup on an archived timeline (or any getpage request really)
is a cplane bug. Make this a warning for now so that, if there is
potentially a bug, we can detect cases in the wild before they cause
stuck operations, but the intention is to return an error eventually.
Related: #9548
On my AX102 Hetzner box, removing this line removes about 20us from the
`latency_mean` result in
`test_pageserver_characterize_latencies_with_1_client_and_throughput_with_many_clients_one_tenant`.
If the same 20us can be removed in the nightly benchmark run, this will
be a ~10% improvement because there, mean latencies are about ~220us.
This span was added during batching refactors, we didn't have it before,
and I don't think it's terribly useful.
refs
- https://github.com/neondatabase/cloud/issues/21759
## Problem
The local filesystem backend for remote storage doesn't set a
concurrency limit. While it can't/won't enforce a concurrency limit
itself, this also bounds the upload queue concurrency. Some tests create
thousands of uploads, which slows down the quadratic scheduling of the
upload queue, and there is no point spawning that many Tokio tasks.
Resolves#10409.
## Summary of changes
Set a concurrency limit of 100 for the LocalFS backend.
Before: `test_layer_map[release-pg17].test_query: 68.338 s`
After: `test_layer_map[release-pg17].test_query: 5.209 s`
## Problem
The code is intended to reschedule compaction immediately if there are
pending tasks. We set the duration to 0 before if there are pending
tasks, but this will go through the `if period == Duration::ZERO {`
branch and sleep for another 10 seconds.
## Summary of changes
Set duration to 1 so that it doesn't sleep for too long.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Any compaction should never produce l0 layers. This never happened in my
experiments, but would be good to guard it early.
## Summary of changes
Disallow gc-compaction to produce l0 layers.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
close https://github.com/neondatabase/neon/issues/10651
## Summary of changes
* Image layer creation starts from the next partition of the last
processed partition if the previous attempt was not complete.
* Add tests.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We don't have visibility into the ratio of image vs. delta pages
ingested in Pageservers. This might be useful to determine whether we
should compress WAL records before storing them, which in turn might
make compaction more efficient.
## Summary of changes
Add `pageserver_wal_ingest_values_committed` metric with dimensions
`class=metadata|data` and `kind=image|delta`.
## Problem
We wish to make heatmap generation additive in
https://github.com/neondatabase/neon/pull/10597.
However, if the pageserver restarts and has a heatmap on disk from when
it was a secondary long ago,
we can end up keeping extra layers on the secondary's disk.
## Summary of changes
Persist the heatmap after a successful upload.
## Problem
In #9895, we fixed some issues where `ClearVmBits` were broadcast to all
shards, even those not owning the VM relation. As part of that, we found
some ancient code from #1417, which discarded spurious incorrect
`ClearVmBits` records for pages outside of the VM relation. We added
observability in #9911 to see how often this actually happens in the
wild.
After two months, we have not seen this happen once in production or
staging. However, out of caution, we don't want a hard error and break
WAL ingestion.
Resolves#10067.
## Summary of changes
Log a critical error when ingesting `ClearVmBits` for unknown VM
relations or pages.
## Problem
Image layer generation could block L0 compactions for a long time.
## Summary of changes
* Refactored the return value of `create_image_layers_for_*` functions
to make it self-explainable.
* Preempt image layer generation in `Try` mode if L0 piles up.
Note that we might potentially run into a state that only the beginning
part of the keyspace gets image coverage. In that case, we either need
to implement something to prioritize some keyspaces with image coverage,
or tune the image_creation_threshold to ensure that the frequency of
image creation could keep up with L0 compaction.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Erik Grinaker <erik@neon.tech>
## Problem
When a client dropped before a request completed, and a handler returned
an ApiError, we would log that at error severity. That was excessive in
the case of a request erroring on a shutdown, and could cause test
flakes.
example:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/13067651123/index.html#suites/ad9c266207b45eafe19909d1020dd987/6021ce86a0d72ae7/
```
Cancelled request finished with an error: ShuttingDown
```
## Summary of changes
- Log a different info-level on ShuttingDown and ResourceUnavailable API
errors from cancelled requests
## Problem
This assertion is incorrect: it is legal to see another shard's data at
this point, after a shard split.
Closes: https://github.com/neondatabase/neon/issues/10609
## Summary of changes
- Remove faulty assertion
## Problem
If offloading races with normal shutdown, we get a "failed to freeze and
flush: cannot flush frozen layers when flush_loop is not running, state
is Exited". This is harmless but points to it being quite strange to try
and freeze and flush such a timeline. flushing on shutdown for an
archived timeline isn't useful.
Related: https://github.com/neondatabase/neon/issues/10389
## Summary of changes
- During Timeline::shutdown, ignore ShutdownMode::FreezeAndFlush if the
timeline is archived
## Problem
Timeline bootstrap starts a flush loop, but doesn't reliably shut down
the timeline (incl. waiting for flush loop to exit) before destroying
UninitializedTimeline, and that destructor tries to clean up local
storage. If local storage is still being written to, then this is
unsound.
Currently the symptom is that we see a "Directory not empty" error log,
e.g.
https://neon-github-public-dev.s3.amazonaws.com/reports/main/12966756686/index.html#testresult/5523f7d15f46f7f7/retries
## Summary of changes
- Move fallible IO part of bootstrap into a function (notably, this is
fallible in the case of the tenant being shut down while creation is
happening)
- When that function returns an error, call shutdown() on the timeline
## Problem
close https://github.com/neondatabase/neon/issues/10482
## Summary of changes
Add an extra lock on the read path to protect against races. The read
path has an implication that only certain kind of compactions can be
performed. Garbage keys must first have an image layer covering the
range, and then being gc-ed -- they cannot be done in one operation. An
alternative to fix this is to move the layers read guard to be acquired
at the beginning of `get_vectored_reconstruct_data_timeline`, but that
was intentionally optimized out and I don't want to regress.
The race is not limited to image layers. Gc-compaction will consolidate
deltas automatically and produce a flat delta layer (i.e., when we have
retain_lsns below the gc-horizon). The same race would also cause
behaviors like getting an un-replayable key history as in
https://github.com/neondatabase/neon/issues/10049.
Signed-off-by: Alex Chi Z <chi@neon.tech>
Related to #10308, we might have legitimate changes in file size or
generation. Those changes should not cause warn log lines.
In order to detect changes of the generation number while the file size
stayed the same, load the metadata that we store on disk on loading of
the timeline.
Still do a comparison with the on-disk layer sizes to find any
discrepancies that might occur due to race conditions (new metadata file
gets written but layer file has not been updated yet, and PS shuts
down). However, as it's possible to hit it in a race conditon, downgrade
it to a warning.
Also fix a mistake in #10529: we want to compare the old with the new
metadata, not the old metadata with itself.
## Problem
We don't have per-timeline observability for read amplification.
Touches https://github.com/neondatabase/cloud/issues/23283.
## Summary of changes
Add a per-timeline `pageserver_layers_per_read` histogram.
NB: per-timeline histograms are expensive, but probably worth it in this
case.
## Problem
We suspect that Postgres checkpoints will limit the number of page
deltas necessary to reconstruct a page, but don't know for certain.
Touches https://github.com/neondatabase/cloud/issues/23283.
## Summary of changes
Add `pageserver_deltas_per_read_global` metric.
This pairs with `pageserver_layers_per_read_global` from #10573.
## Problem
The current global `pageserver_layers_visited_per_vectored_read_global`
metric does not appear to accurately measure read amplification. It
divides the layer count by the number of reads in a batch, but this
means that e.g. 10 reads with 100 L0 layers will only measure a read amp
of 10 per read, while the actual read amp was 100.
While the cost of layer visits are amortized across the batch, and some
layers may not intersect with a given key, each visited layer
contributes directly to the observed latency for every read in the
batch, which is what we care about.
Touches https://github.com/neondatabase/cloud/issues/23283.
Extracted from #10566.
## Summary of changes
* Count the number of layers visited towards each read in the batch,
instead of the average across the batch.
* Rename `pageserver_layers_visited_per_vectored_read_global` to
`pageserver_layers_per_read_global`.
* Reduce the read amp log warning threshold down from 512 to 100.
## Problem
Follow up of https://github.com/neondatabase/neon/pull/10550 in case the
upper limit is set larger than threshold. It does not make sense for
someone to enforce the behavior like "if there are >= 50 L0s, only
compact 10 of them".
## Summary of changes
Use the maximum of compaction threshold and upper limit when selecting
L0 files to compact.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Repartition is slow, but it's only used in image layer creation. We can
skip it if we have a lot of L0 layers to ingest.
## Summary of changes
If L0 compaction is not complete, do not repartition and do not create
image layers.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We don't have good observability for per-timeline compaction debt,
specifically the number of delta layers in the frozen, L0, and L1
levels.
Touches https://github.com/neondatabase/cloud/issues/23283.
## Summary of changes
* Add a `level` label for `pageserver_layer_{count,size}` with values
`l0`, `l1`, and `frozen`.
* Track metrics for frozen layers.
There is already a `kind={delta,image}` label. `kind=image` is only
possible for `level=l1`.
We don't include the currently open ephemeral layer, only frozen layers.
There is always exactly 1 ephemeral layer, with a dynamic size which is
already tracked in `pageserver_timeline_ephemeral_bytes`.
## Problem
If we are GC-ing because a new image layer was added while traversing
the timeline, then it will remove layers that are required for
fulfilling the current get request (read-path cannot "look back" and
notice the new image layer).
## Summary of Changes
Prevent GC from progressing on the current timeline while it is being
visited for a read.
Epic: https://github.com/neondatabase/neon/issues/9376
Luckily they were the same version, so we didn't spend time compiling
two versions, which could have been the case in the future.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
Follow-up of the incident, we should not use the same bound on
lower/upper limit of compaction files. This patch adds an upper bound
limit, which is set to 50 for now.
## Summary of changes
Add `compaction_upper_limit`.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
We've seen the ingest connection manager get stuck shortly after a
migration.
## Summary of changes
A speculative mitigation is to use the same mechanism as get page
requests for kicking LSN ingest. The connection manager monitors
LSN waits and queries the broker if no updates are received for the
timeline.
Closes https://github.com/neondatabase/neon/issues/10351
## Problem
We need a setting to disable the flush upload wait, to test L0 flush
backpressure in staging.
## Summary of changes
Add `l0_flush_wait_upload` setting.
Only a few things that needed updating:
- async_trait was removed
- Message::Text takes a Utf8Bytes object instead of a String
Signed-off-by: Tristan Partin <tristan@neon.tech>
Co-authored-by: Conrad Ludgate <connor@neon.tech>
In #10308, we noticed many warnings about the local layer having
different sizes on-disk compared to the metadata.
However, the layer downloader would never redownload layer files if the
sizes or generation numbers change. This is obviously a bug, which we
aim to fix with this PR.
This change also moves the code deciding what to do about a layer to a
dedicated function: before we handled the "routing" via control flow,
but now it's become too complicated and it is nicer to have the
different verdicts for a layer spelled out in a list/match.
This reverts commit 9e55d79803.
We'll still need this until we can tune L0 flush backpressure and
compaction. I'll add a setting to disable this separately.
## Problem
This one is fairly embarrassing. Safekeeper node id was used in the
pageserver application name
when connecting to safekeepers.
## Summary of changes
Use the right node id.
Closes https://github.com/neondatabase/neon/issues/10461