The old test based on the immutable `target_file_size` that was a
parameter to the function.
It makes no sense to go further once `current_level_target_height` has
reached `u64::MAX`, as lsn's are u64 typed. In practice, we should only
run into this if there is a bug, as the practical lsn range usually ends
much earlier.
Testing on `target_file_size` makes less sense, it basically implements
an invocation mode that turns off the looping and only runs one
iteration of it.
@hlinnaka agrees that `current_level_target_height` is better here.
Part of #7554
in addition to layer names, expand the input vocabulary to recognize
lines in the form of:
${kind}:${lsn}
where:
- kind in `gc_cutoff` or `branch`
- lsn is accepted in Lsn display format (x/y) or hex (as used in layer
names)
gc_cutoff and branch have different colors.
We had accidentally left two endpoints for `tenant`: `/synthetic_size`
and `/size`. Size had the more extensive description but has returned
404 since renaming. Remove the `/size` in favor of the working one and
describe the `text/html` output.
## Problem
This caused a variation of the stats bug fixed by
https://github.com/neondatabase/neon/pull/7662. That PR also fixed this
case, but we still shouldn't make redundant get calls.
## Summary of changes
- Only call get in the create image layers loop at the end of a range if
some keys have been accumulated
This PR does two things:
First, it fixes a bug with tiered compaction's k-merge implementation.
It ignored the lsn of a key during ordering, so multiple updates of the
same key could be read in arbitrary order, say from different layers.
For example there is layers `[(a, 2),(b, 3)]` and `[(a, 1),(c, 2)]` in
the heap, they might return `(a,2)` and `(a,1)`.
Ultimately, this change wasn't enough to fix the ordering issues in
#7296, in other words there is likely still bugs in the k-merge. So as
the second thing, we switch away from the k-merge to an in-memory based
one, similar to #4839, but leave the code around to be improved and
maybe switched to later on.
Part of #7296
For aux file keys (v1 or v2) the vectored read path does not return an
error when they're missing. Instead they are omitted from the resulting
btree (this is a requirement, not a bug). Skip updating the metric in
these cases to avoid infinite results.
- Rename "filename" types which no longer map directly to a filename
(LayerFileName -> LayerName)
- Add a -v1- part to local layer paths to smooth the path to future
updates (we anticipate a -v2- that uses checksums later)
- Rename methods that refer to the string-ized version of a LayerName to
no longer be called "filename"
- Refactor reconcile() function to use a LocalLayerFileMetadata type
that includes the local path, rather than carrying local path separately
in a tuple and unwrap()'ing it later.
Improves the tiered compaction tests:
* Adds a new test that is a simpler version of the ignored
`test_many_updates_for_single_key` test.
* Reduces the amount of data that `test_many_updates_for_single_key`
processes to make it execute more quickly.
* Adds logging support.
Adds ordering asserts to the output of the delta key iterator
`MergeDeltaKeys` that implements a k-merge.
Part of #7296 : the asserts added by this PR get hit in the reproducers
of #7296 as well, but they are earlier in the pipeline.
## Problem
In https://github.com/neondatabase/neon/pull/7531, we would like to be
able to rewrite layers safely. One option is to make `Layer` able to
rewrite files in place safely (e.g. by blocking evictions/deletions for
an old Layer while a new one is created), but that's relatively fragile.
It's more robust in general if we simply never overwrite the same local
file: we can do that by putting the generation number in the filename.
## Summary of changes
- Add `local_layer_path` (counterpart to `remote_layer_path`) and
convert all locations that manually constructed a local layer path by
joining LayerFileName to timeline path
- In the layer upload path, construct remote paths with
`remote_layer_path` rather than trying to build them out of a local
path.
- During startup, carry the full path to layer files through
`init::reconcile`, and pass it into `Layer::for_resident`
- Add a test to make sure we handle upgrades properly.
- Comment out the generation part of `local_layer_path`, since we need
to maintain forward compatibility for one release. A tiny followup PR
will enable it afterwards.
We could make this a bit simpler if we bulk renamed existing layers on
startup instead of carrying literal paths through init, but that is
operationally risky on existing servers with millions of layer files. We
can always do a renaming change in future if it becomes annoying, but
for the moment it's kind of nice to have a structure that enables us to
change local path names again in future quite easily.
We should rename `LayerFileName` to `LayerName` or somesuch, to make it
more obvious that it's not a literal filename: this was already a bit
confusing where that type is used in remote paths. That will be a
followup, to avoid polluting this PR's diff.
This pull request adds the new basebackup read path + aux file write
path. In the regression test, all logical replication tests are run with
matrix aux_file_v2=false/true.
Also fixed the vectored get code path to correctly return missing key
error when being called from the unified sequential get code path.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
RemoteTimelineClient has a lot of mandatory cloning. By using a single
way of creating IndexPart out of UploadQueueInitialized we can simplify
things and also avoid cloning the latest files for each
`index_part.json` upload (the contents will still be cloned).
## Problem
Timelines cannot be deleted if they have children. In many production
cases, a branch or a timeline has been created off the main branch for
various reasons to the effect of having now a "new main" branch. This
feature will make it possible to detach a timeline from its ancestor by
inheriting all of the data before the branchpoint to the detached
timeline and by also reparenting all of the ancestor's earlier branches
to the detached timeline.
## Summary of changes
- Earlier added copy_lsn_prefix functionality is used
- RemoteTimelineClient learns to adopt layers by copying them from
another timeline
- LayerManager adds support for adding adopted layers
-
`timeline::Timeline::{prepare_to_detach,complete_detaching}_from_ancestor`
and `timeline::detach_ancestor` are added
- HTTP PUT handler
Cc: #6994
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
After a shard split of a large existing tenant, child tenants can end up
with oversized historic layers indefinitely, if those layers are
prevented from being GC'd by branchpoints.
This PR is followed by https://github.com/neondatabase/neon/pull/7531
Related issue: https://github.com/neondatabase/neon/issues/7504
## Summary of changes
- Add a new compaction phase `compact_shard_ancestors`, which identifies
layers that are no longer needed after a shard split.
- Add a Timeline->LayerMap code path called `rewrite_layers` , which is
currently only used to drop layers, but will later be used to rewrite
them as well in https://github.com/neondatabase/neon/pull/7531
- Add a new test that compacts after a split, and checks that something
is deleted.
Note that this doesn't have much impact on a tenant's resident size
(since unused layers would end up evicted anyway), but it:
- Makes index_part.json much smaller
- Makes the system easier to reason about: avoid having tenants which
are like "my physical size is 4TiB but don't worry I'll never actually
download it", instead have tenants report the real physical size of what
they might download.
Why do we remove these layers in compaction rather than during the
split? Because we have existing split tenants that need cleaning up. We
can add it to the split operation in future as an optimization.
## Problem
The current `tenant_slots` metric becomes less useful once we have lots
of secondaries, because we can't tell how many tenants are really
attached (without doing a sum() on some other metric).
## Summary of changes
- Add a `mode` label to this metric
- Update the metric with `slot_added` and `slot_removed` helpers that
are called at all the places we mutate the tenants map.
- Add a debug assertion at shutdown that checks the metrics add up to
the right number, as a cheap way of validating that we're calling the
metric hooks in all the right places.
## Problem
In testing of the earlier fix for OOMs under heavy write load
(https://github.com/neondatabase/neon/pull/7218), we saw that the limit
on ephemeral layer size wasn't being reliably enforced. That was
diagnosed as being due to overwhelmed compaction loops: most tenants
were waiting on the semaphore for background tasks, and thereby not
running the function that proactively rolls layers frequently enough.
Related: https://github.com/neondatabase/neon/issues/6939
## Summary of changes
- Create a new per-tenant background loop for "ingest housekeeping",
which invokes maybe_freeze_ephemeral_layer() without taking the
background task semaphore.
- Downgrade to DEBUG a log line in maybe_freeze_ephemeral_layer that had
been INFO, but turns out to be pretty common in the field.
There's some discussion on the issue
(https://github.com/neondatabase/neon/issues/6939#issuecomment-2083554275)
about alternatives for calling this maybe_freeze_epemeral_layer
periodically without it getting stuck behind compaction. A whole task
just for this feels like kind of a big hammer, but we may in future find
that there are other pieces of lightweight housekeeping that we want to
do here too.
Why is it okay to call maybe_freeze_ephemeral_layer outside of the
background tasks semaphore?
- this is the same work we would do anyway if we receive writes from the
safekeeper, just done a bit sooner.
- The period of the new task is generously jittered (+/- 5%), so when
the ephemeral layer size tips over the threshold, we shouldn't see an
excessively aggressive thundering herd of layer freezes (and only layers
larger than the mean layer size will be frozen)
- All that said, this is an imperfect approach that relies on having a
generous amount of RAM to dip into when we need to freeze somewhat
urgently. It would be nice in future to also block compaction/GC when we
recognize resource stress and need to do other work (like layer
freezing) to reduce memory footprint.
This is the first step towards representing all of Pageserver
configuration as clean `serde::Serialize`able Rust structs in
`pageserver_api`.
The `neon_local` code will then use those structs instead of the crude
`toml_edit` / string concatenation that it does today.
refs https://github.com/neondatabase/neon/issues/7555
---------
Co-authored-by: Alex Chi Z <iskyzh@gmail.com>
This pull request adds the scan interface. Scan operates on a sparse
keyspace and retrieves all the key-value pairs from the keyspaces.
Currently, scan only supports the metadata keyspace, and by default do
not retrieve anything from the ancestor branch. This should be fixed in
the future if we need to have some keyspaces that inherits from the
parent.
The scan interface reuses the vectored get code path by disabling the
missing key errors.
This pull request also changes the behavior of vectored get on aux file
v1/v2 key/keyspace: if the key is not found, it is simply not included in the
result, instead of throwing a missing key error.
TODOs in future pull requests: limit memory consumption, ensure the
search stops when all keys are covered by the image layer, remove
`#[allow(dead_code)]` once the code path is used in basebackups / aux
files, remove unnecessary fine-grained keyspace tracking in vectored get
(or have another code path for scan) to improve performance.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
The logic in Service::optimize_all would sometimes choose to migrate a
tenant to a secondary location that was only recently created, resulting
in Reconciler::live_migrate hitting its 5 minute timeout warming up the
location, and proceeding to attach a tenant to a location that doesn't
have a warm enough local set of layer files for good performance.
Closes: #7532
## Summary of changes
- Add a pageserver API for checking download progress of a secondary
location
- During `optimize_all`, connect to pageservers of candidate
optimization secondary locations, and check they are warm.
- During shard split, do heatmap uploads and start secondary downloads,
so that the new shards' secondary locations start downloading ASAP,
rather than waiting minutes for background downloads to kick in.
I have intentionally not implemented this by continuously reading the
status of locations, to avoid dealing with the scale challenge of
efficiently polling & updating 10k-100k locations status. If we
implement that in the future, then this code can be simplified to act
based on latest state of a location rather than fetching it inline
during optimize_all.
We keep the practice of keeping the compiler up to date, pointing to the
latest release. This is done by many other projects in the Rust
ecosystem as well.
Release notes: https://blog.rust-lang.org/2024/05/02/Rust-1.78.0.html
Prior update was in #7198
## Problem
After some time the load from heatmap uploads gets rather spiky. They're
unintentionally synchronising.
Chart (does this make a _boing_ sound in anyone else's head?):

## Summary of changes
- Add a helper `period_jitter` and apply a 5% jitter from downloader and
heatmap_uploader when updating the next runtime at the end of an
interation.
- Refactor existing places that we pick a startup interval into
`period_warmup`, so that the intent is obvious.
The current implementation of finding timeline gc cutoff Lsn(s) is done
while holding `Tenant::gc_cs`. In recent incidents long create branch
times were caused by holding the `Tenant::gc_cs` over extremely long
`Timeline::find_lsn_by_timestamp`. The fix is to find the GC cutoff
values before taking the `Tenant::gc_cs` lock. This change is safe to do
because the GC cutoff values and the branch points have no dependencies
on each other. In the case of `Timeline::find_gc_cutoff` taking a long
time with this change, we should no longer see `Tenant::gc_cs`
interfering with branch creation.
Additionally, the `Tenant::refresh_gc_info` is now tolerant of timeline
deletions (or any other failures to find the pitr_cutoff). This helps
with the synthetic size calculation being constantly completed instead
of having a break for a timely timeline deletion.
Fixes: #7560Fixes: #7587
Split `GcInfo` and replace `Timeline::update_gc_info` with a method that
simply finds gc cutoffs `Timeline::find_gc_cutoffs` to be combined as
`Timeline::gc_info` at the caller.
This change will be followed up with a change that finds the GC cutoff
values before taking the `Tenant::gc_cs` lock.
Cc: #7560
Some part of the code requires missing key error to be propagated to the
code path correctly (i.e., aux key range scan). Currently, it's an
anyhow error.
* remove `stuck_lsn` from the missing key error.
* as a result, when matching missing key, we do not distinguish the case
`stuck_lsn = false/true`.
* vectored get now use the unified missing key error.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Instead of showing the full path of layer traversal, we now only show
tenant (in tracing context)+timeline+filename.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
We had an incident where pageserver requests timed out because
pageserver couldn't fetch WAL from safekeepers. This incident was caused
by a bug in safekeeper logic for timeline activation, which prevented
pageserver from finding safekeepers.
This bug was since fixed, but there is still a chance of a similar bug
in the future due to overall complexity.
We add a new broker message to "signal interest" for timeline. This
signal will be sent by pageservers `wait_lsn`, and safekeepers will
receive this signal to start broadcasting broker messages. Then every
broker subscriber will be able to find the safekeepers and connect to
them (to start fetching WAL).
This feature is not limited to pageservers and any service that wants to
download WAL from safekeepers will be able to use this discovery
request.
This commit changes pageserver's connection_manager (walreceiver) to
send a SafekeeperDiscoveryRequest when there is no information about
safekeepers present in memory. Current implementation will send these
requests only if there is an active wait_lsn() call and no more often
than once per 10 seconds.
Add `test_broker_discovery` to test this: safekeepers started with
`--disable-periodic-broker-push` will not push info to broker so that
pageserver must use a discovery to start fetching WAL.
Add task_stats in safekeepers broker module to log a warning if there is
no message received from the broker for the last 10 seconds.
Closes#5471
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
Makes two of the tests work with the tiered compaction that I had to
ignore in #7283.
The issue was that tiered compaction actually created image layers, but
the keys didn't appear in them as `collect_keyspace` didn't include
them. Not a compaction problem, but due to how the test is structured.
Fixes#7287
## Problem
`init_tenant_mgr` blocks the rest of pageserver startup, including
starting the admin API.
This was noticeable in #7475 , where the init_tenant_mgr runtime could
be long enough to trip the controller's 30 second heartbeat timeout.
## Summary of changes
- When detaching tenants during startup, spawn the background deletes as
background tasks instead of doing them inline
- Write all configs before spawning any tenants, so that the config
writes aren't fighting tenants for system resources
- Write configs with some concurrency (16) rather than writing them all
sequentially.
extracted (and tested) from
https://github.com/neondatabase/neon/pull/7468, part of
https://github.com/neondatabase/neon/issues/7462.
The current codebase assumes the keyspace is dense -- which means that
if we have a keyspace of 0x00-0x100, we assume every key (e.g., 0x00,
0x01, 0x02, ...) exists in the storage engine. However, the assumption
does not hold any more in metadata keyspace. The metadata keyspace is
sparse. It is impossible to do per-key check.
Ideally, we should not have the assumption of dense keyspace at all, but
this would incur a lot of refactors. Therefore, we split the keyspaces
we have to dense/sparse and handle them differently in the code for now.
At some point in the future, we should assume all keyspaces are sparse.
## Summary of changes
* Split collect_keyspace to return dense+sparse keyspace.
* Do not allow generating image layers for sparse keyspace (for now --
will fix this next week, we need image layers anyways).
* Generate delta layers for sparse keyspace.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Followup to https://github.com/neondatabase/neon/pull/6776
While #6776 makes compaction safe on sharded tenants, the logic for
keyspace partitioning remains inefficient: it assumes that the size of
data on a pageserver can be calculated simply as the range between start
and end of a Range -- this is not the case in sharded tenants, where
data within a range belongs to a variety of shards.
Closes: https://github.com/neondatabase/neon/issues/6774
## Summary of changes
I experimented with using a sharding-aware range type in KeySpace to
replace all the Range<Key> uses, but the impact on other code was quite
large (many places use the ranges), and not all of them need this
property of being able to approximate the physical size of data within a
key range.
So I compromised on expressing this as a ShardedRange type, but only
using that type selctively: during keyspace repartition, and in tiered
compaction when accumulating key ranges.
- keyspace partitioning methods take sharding parameters as an input
- new `ShardedRange` type wraps a Range<Key> and a shard identity
- ShardedRange::page_count is the shard-aware replacement for
key_range_size
- Callers that don't need to be shard-aware (e.g. vectored get code that
just wants to count the number of keys in a keyspace) can use
ShardedRange::raw_size to get the faster, shard-naive code (same as old
`key_range_size`)
- Compaction code is updated to carry a shard identity so that it can
use shard aware calculations
- Unit tests for the new fragmentation logic.
- Add a test for compaction on sharded tenants, that validates that we
generate appropriately sized image layers (this fails before fixing
keyspace partitioning)
previously in https://github.com/neondatabase/neon/pull/7375, we
observed that for in-memory layers, we will need to iterate every key in
the key space in order to get the result. The operation can be more
efficient if we use BTreeMap as the in-memory layer representation, even
if we are doing vectored get in a dense keyspace. Imagine a case that
the in-memory layer covers a very little part of the keyspace, and most
of the keys need to be found in lower layers. Using a BTreeMap can
significantly reduce probes for nonexistent keys.
## Summary of changes
* Use BTreeMap as in-memory layer representation.
* Optimize the vectored get flow to utilize the range scan functionality
of BTreeMap.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Sequential get runs after vectored get, so it is possible for the later
to time out while waiting for its ancestor's Lsn to become ready and for
the former to succeed (it essentially has a doubled wait time).
## Summary of Changes
Relax the validation to allow for such rare cases.
## Problem
Sometimes we have test data in the form of S3 contents that we would
like to run live in a neon_local environment.
## Summary of changes
- Add a storage controller API that imports an existing tenant.
Currently this is equivalent to doing a create with a high generation
number, but in future this would be something smarter to probe S3 to
find the shards in a tenant and find generation numbers.
- Add a `neon_local` command that invokes the import API, and then
inspects timelines in the newly attached tenant to create matching
branches.
extracted from https://github.com/neondatabase/neon/pull/7468, part of
https://github.com/neondatabase/neon/issues/7462.
In the page server, we use i128 (instead of u128) to do the integer
representation of the key, which indicates that the highest bit of the
key should not be 1. This constraints our keyspace to <= 0x7F.
Also fix the bug of `to_i128` that dropped the highest 4b. Now we keep
3b of them, dropping the sign bit.
And on that, we shrink the metadata keyspace to 0x60-0x7F for now, and
once we add support for u128, we can have a larger metadata keyspace.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Changing metadata format is not easy. This pull request adds a
tenant-level flag on whether to enable aux file v2. As long as we don't
roll this out to the user and guarantee our staging projects can persist
tenant config correctly, we can test the aux file v2 change with setting
this flag. Previous discussion at
https://github.com/neondatabase/neon/pull/7424.
Signed-off-by: Alex Chi Z <chi@neon.tech>
Implements an approach different from the one #7488 chose: We now return
`past` instead of `present` (or`future`) when encountering the edge case
where commit_lsn < min_lsn. In my opinion, both `past` and `present` are
correct responses, but past is slightly better as the lsn returned by
`present` with #7488 is one too "new". In practice, this shouldn't
matter much, but shrug.
We agreed in slack that this is the better approach:
https://neondb.slack.com/archives/C03F5SM1N02/p1713871064147029
## Problem
PR #7230 attempted to introduce a WAL ingest threshold for checking
whether enough deltas are stacked to warrant creating a new image layer.
However, this check was incorrectly performed at the compaction
partition level instead of the timeline level. Hence, it inhibited GC
for any keys outside of the first partition.
## Summary of Changes
Hoist the check up to the timeline level.
part of https://github.com/neondatabase/neon/issues/7124
# Problem
(Re-stating the problem from #7124 for posterity)
The `test_bulk_ingest` benchmark shows about 2x lower throughput with
`tokio-epoll-uring` compared to `std-fs`.
That's why we temporarily disabled it in #7238.
The reason for this regression is that the benchmark runs on a system
without memory pressure and thus std-fs writes don't block on disk IO
but only copy the data into the kernel page cache.
`tokio-epoll-uring` cannot beat that at this time, and possibly never.
(However, under memory pressure, std-fs would stall the executor thread
on kernel page cache writeback disk IO. That's why we want to use
`tokio-epoll-uring`. And we likely want to use O_DIRECT in the future,
at which point std-fs becomes an absolute show-stopper.)
More elaborate analysis:
https://neondatabase.notion.site/Why-test_bulk_ingest-is-slower-with-tokio-epoll-uring-918c5e619df045a7bd7b5f806cfbd53f?pvs=4
# Changes
This PR increases the buffer size of `blob_io` and `EphemeralFile` from
PAGE_SZ=8k to 64k.
Longer-term, we probably want to do double-buffering / pipelined IO.
# Resource Usage
We currently do not flush the buffer when freezing the InMemoryLayer.
That means a single Timeline can have multiple 64k buffers alive, esp if
flushing is slow.
This poses an OOM risk.
We should either bound the number of frozen layers
(https://github.com/neondatabase/neon/issues/7317).
Or we should change the freezing code to flush the buffer and drop the
allocation.
However, that's future work.
# Performance
(Measurements done on i3en.3xlarge.)
The `test_bulk_insert.py` is too noisy, even with instance storage. It
varies by 30-40%. I suspect that's due to compaction. Raising amount of
data by 10x doesn't help with the noisiness.)
So, I used the `bench_ingest` from @jcsp 's #7409 .
Specifically, the `ingest-small-values/ingest 128MB/100b seq` and
`ingest-small-values/ingest 128MB/100b seq, no delta` benchmarks.
| | | seq | seq, no delta |
|-----|-------------------|-----|---------------|
| 8k | std-fs | 55 | 165 |
| 8k | tokio-epoll-uring | 37 | 107 |
| 64k | std-fs | 55 | 180 |
| 64k | tokio-epoll-uring | 48 | 164 |
The `8k` is from before this PR, the `64k` is with this PR.
The values are the throughput reported by the benchmark (MiB/s).
We see that this PR gets `tokio-epoll-uring` from 67% to 87% of `std-fs`
performance in the `seq` benchmark. Notably, `seq` appears to hit some
other bottleneck at `55 MiB/s`. CC'ing #7418 due to the apparent
bottlenecks in writing delta layers.
For `seq, no delta`, this PR gets `tokio-epoll-uring` from 64% to 91% of
`std-fs` performance.
part of https://github.com/neondatabase/neon/issues/7124
Changes
-------
This PR replaces the `EphemeralFile::write_blob`-specifc `struct Writer`
with re-use of `owned_buffers_io::write::BufferedWriter`.
Further, it restructures the code to cleanly separate
* the high-level aspect of EphemeralFile's write_blob / read_blk API
* the page-caching aspect
* the aspect of IO
* performing buffered write IO to an underlying VirtualFile
* serving reads from either the VirtualFile or the buffer if it hasn't
been flushed yet
* the annoying "feature" that reads past the end of the written range
are allowed and expected to return zeroed memory, as long as one remains
within one PAGE_SZ