Commit Graph

2371 Commits

Author SHA1 Message Date
Alex Chi Z.
6eb638f4b3 feat(pageserver): warn on aux v1 tenants + default to v2 (#8625)
part of https://github.com/neondatabase/neon/issues/8623

We want to discover potential aux v1 customers that we might have missed
from the migrations.

## Summary of changes

Log warnings on basebackup, load timeline, and the first put_file.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-22 22:31:38 +01:00
John Spray
7c74112b2a pageserver: batch InMemoryLayer puts, remove need to sort items by LSN during ingest (#8591)
## Problem/Solution

TimelineWriter::put_batch is simply a loop over individual puts. Each
put acquires and releases locks, and checks for potentially starting a
new layer. Batching these is more efficient, but more importantly
unlocks future changes where we can pre-build serialized buffers much
earlier in the ingest process, potentially even on the safekeeper
(imagine a future model where some variant of DatadirModification lives
on the safekeeper).

Ensuring that the values in put_batch are written to one layer also
enables a simplification upstream, where we no longer need to write
values in LSN-order. This saves us a sort, but also simplifies follow-on
refactors to DatadirModification: we can store metadata keys and data
keys separately at that level without needing to zip them together in
LSN order later.

## Why?

In this PR, these changes are simplify optimizations, but they are
motivated by evolving the ingest path in the direction of disentangling
extracting DatadirModification from Timeline. It may not obvious how
right now, but the general idea is that we'll end up with three phases
of ingest:
- A) Decode walrecords and build a datadirmodification with all the
simple data contents already in a big serialized buffer ready to write
to an ephemeral layer **<-- this part can be pipelined and parallelized,
and done on a safekeeper!**
- B) Let that datadirmodification see a Timeline, so that it can also
generate all the metadata updates that require a read-modify-write of
existing pages
- C) Dump the results of B into an ephemeral layer.

Related: https://github.com/neondatabase/neon/issues/8452

## Caveats

Doing a big monolithic buffer of values to write to disk is ordinarily
an anti-pattern: we prefer nice streaming I/O. However:
- In future, when we do this first decode stage on the safekeeper, it
would be inefficient to serialize a Vec of Value, and then later
deserialize it just to add blob size headers while writing into the
ephemeral layer format. The idea is that for bulk write data, we will
serialize exactly once.
- The monolithic buffer is a stepping stone to pipelining more of this:
by seriailizing earlier (rather than at the final put_value), we will be
able to parallelize the wal decoding and bulk serialization of data page
writes.
- The ephemeral layer's buffered writer already stalls writes while it
waits to flush: so while yes we'll stall for a couple milliseconds to
write a couple megabytes, we already have stalls like this, just
distributed across smaller writes.

## Benchmarks

This PR is primarily a stepping stone to safekeeper ingest filtering,
but also provides a modest efficiency improvement to the `wal_recovery`
part of `test_bulk_ingest`.

test_bulk_ingest:

```
test_bulk_insert[neon-release-pg16].insert: 23.659 s
test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB
test_bulk_insert[neon-release-pg16].peak_mem: 626 MB
test_bulk_insert[neon-release-pg16].size: 0 MB
test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB
test_bulk_insert[neon-release-pg16].num_files_uploaded: 8 
test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB
test_bulk_insert[neon-release-pg16].wal_recovery: 18.981 s
test_bulk_insert[neon-release-pg16].compaction: 0.055 s

vs. tip of main:
test_bulk_insert[neon-release-pg16].insert: 24.001 s
test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB
test_bulk_insert[neon-release-pg16].peak_mem: 604 MB
test_bulk_insert[neon-release-pg16].size: 0 MB
test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB
test_bulk_insert[neon-release-pg16].num_files_uploaded: 8 
test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB
test_bulk_insert[neon-release-pg16].wal_recovery: 23.586 s
test_bulk_insert[neon-release-pg16].compaction: 0.054 s
```
2024-08-22 10:04:42 +00:00
Alex Chi Z.
a968554a8c fix(pageserver): unify initdb optimization for sparse keyspaces; fix force img generation (#8776)
close https://github.com/neondatabase/neon/issues/8558

* Directly generate image layers for sparse keyspaces during initdb
optimization.
* Support force image layer generation for sparse keyspaces.
* Fix a bug of incorrect image layer key range in case of duplicated
keys. (The added line: `start = img_range.end;`) This can cause
overlapping image layers and keys to disappear.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-21 21:25:21 +01:00
Joonas Koivunen
07b7c63975 test: avoid some too long shutdowns by flushing before shutdown (#8772)
After #8655, we needed to mark some tests to shut down immediately. To
aid these tests, try the new pattern of `flush_ep_to_pageserver`
followed by a non-compacting checkpoint. This moves the general graceful
shutdown problem of having too much to flush at shutdown into the test.
Also, add logging for how long the graceful shutdown took, if we got to
complete it for faster log eyeballing.

Fixes: #8712
Cc: #8715, #8708
2024-08-21 14:26:27 -04:00
Christian Schwarz
21b684718e pageserver: add counter for wait time on background loop semaphore (#8769)
## Problem

Compaction jobs and other background loops are concurrency-limited
through a global semaphore.

The current counters allow quantifying how _many_ tasks are waiting.
But there is no way to tell how _much_ delay is added by the semaphore.

So, add a counter that aggregates the wall clock time seconds spent
acquiring the semaphore.

The metrics can be used as follows:

* retroactively calculate average acquisition time in a given time range
* compare the degree of background loop backlog among pageservers

The metric is insufficient to calculate

* run-up of ongoing acquisitions that haven't finished acquiring yet
* Not easily feasible because ["Cancelling a call to acquire makes you
lose your place in the
queue"](https://docs.rs/tokio/latest/tokio/sync/struct.Semaphore.html#method.acquire)

## Summary of changes

* Refactor the metrics to follow the current best practice for typed
metrics in `metrics.rs`.
* Add the new counter.
2024-08-21 10:55:01 +00:00
Alex Chi Z.
c8b9116a97 impr(pageserver): abort on fatal I/O writer error (#8777)
part of https://github.com/neondatabase/neon/issues/8140

The blob writer path now uses `maybe_fatal_err`

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-20 20:05:33 +01:00
John Spray
beefc7a810 pageserver: add metric pageserver_secondary_heatmap_total_size (#8768)
## Problem

We don't have a convenient way for a human to ask "how far are secondary
downloads along for this tenant".

This is useful when driving migrations of tenants to the storage
controller, as we first create a secondary location and want to see it
warm up before we cut over. That can already be done via storcon_cli,
but we would like a way that doesn't require direct API access.

## Summary of changes

Add a metric that reports to total size of layers in the heatmap: this
may be used in conjunction with the existing
`pageserver_secondary_resident_physical_size` to estimate "warmth" of
the secondary location.
2024-08-20 19:47:42 +01:00
Alexander Bayandin
c96593b473 Make Postgres 16 default version (#8745)
## Problem

The default Postgres version is set to 15 in code, while we use 16 in
most of the other places (and Postgres 17 is coming)

## Summary of changes
- Run `benchmarks` job with Postgres 16 (instead of Postgres 14)
- Set `DEFAULT_PG_VERSION` to 16 in all places
- Remove deprecated `--pg-version` pytest argument
- Update `test_metadata_bincode_serde_ensure_roundtrip` for Postgres 16
2024-08-20 10:46:58 +01:00
Christian Schwarz
ef57e73fbf task_mgr::spawn: require a TenantId (#8462)
… to dis-incentivize global tasks via task_mgr in the future

(As of https://github.com/neondatabase/neon/pull/8339 all remaining
task_mgr usage is tenant or timeline scoped.)
2024-08-20 08:26:44 +00:00
Christian Schwarz
eb7241c798 l0_flush: remove support for mode page-cached (#8739)
It's been rolled out everywhere, no configs are referencing it.

All code that's made dead by the removal of the config option is removed
as part of this PR.

The `page_caching::PreWarmingWriter` in `::No` mode is equivalent to a
`size_tracking_writer`, so, use that.

part of https://github.com/neondatabase/neon/issues/7418
2024-08-19 16:35:34 +02:00
Arpad Müller
188bde7f07 Default image compression to zstd at level 1 (#8677)
After the rollout has succeeded, we now set the default image
compression to be enabled.

We also remove its explicit mention from `neon_fixtures.py` added in
#8368 as it is now the default (and we switch to `zstd(1)` which is a
bit nicer on CPU time).

Part of https://github.com/neondatabase/neon/issues/5431
2024-08-18 18:32:10 +01:00
John Spray
e2d89f7991 pageserver: prioritize secondary downloads to get most recent layers first, except l0s (#8729)
## Problem

When a secondary location is trying to catch up while a tenant is
receiving new writes, it can become quite wasteful:
- Downloading L0s which are soon destroyed by compaction to L1s
- Downloading older layer files which are soon made irrelevant when
covered by image layers.

## Summary of changes

Sort the layer files in the heatmap:
- L0 layers are the lowest priority
- Other layers are sorted to download the highest LSNs first.
2024-08-16 14:35:02 +02:00
Joonas Koivunen
4763a960d1 chore: log if we have an open layer or any frozen on shutdown (#8740)
Some benchmarks are failing with a "long" flushing, which might be
because there is a queue of in-memory layers, or something else. Add
logging to narrow it down.

Private slack DM ref:
https://neondb.slack.com/archives/D049K7HJ9JM/p1723727305238099
2024-08-16 06:10:05 +01:00
Alexander Bayandin
a9c28be7d0 fix(pageserver): allow unused_imports in download.rs on macOS (#8733)
## Problem

On macOS, clippy fails with the following error:

```
error: unused import: `crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt`
  --> pageserver/src/tenant/remote_timeline_client/download.rs:26:5
   |
26 | use crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt;
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: `-D unused-imports` implied by `-D warnings`
   = help: to override `-D warnings` add `#[allow(unused_imports)]`
```

Introduced in https://github.com/neondatabase/neon/pull/8717

## Summary of changes
- allow `unused_imports` for
`crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt` on macOS
in download.rs
2024-08-15 10:06:28 +01:00
Christian Schwarz
168913bdf0 refactor(write path): newtype to enforce use of fully initialized slices (#8717)
The `tokio_epoll_uring::Slice` / `tokio_uring::Slice` type is weird.
The new `FullSlice` newtype is better. See the doc comment for details.

The naming is not ideal, but we'll clean that up in a future refactoring
where we move the `FullSlice` into `tokio_epoll_uring`. Then, we'll do
the following:
* tokio_epoll_uring::Slice is removed
* `FullSlice` becomes `tokio_epoll_uring::IoBufView`
* new type `tokio_epoll_uring::IoBufMutView` for the current
`tokio_epoll_uring::Slice<IoBufMut>`

Context
-------

I did this work in preparation for
https://github.com/neondatabase/neon/pull/8537.
There, I'm changing the type that the `inmemory_layer.rs` passes to
`DeltaLayerWriter::put_value_bytes` and thus it seemed like a good
opportunity to make this cleanup first.
2024-08-14 21:57:17 +02:00
Joonas Koivunen
60fc1e8cc8 chore: even more responsive compaction cancellation (#8725)
Some benchmarks and tests might still fail because of #8655 (tracked in
#8708) because we are not fast enough to shut down ([one evidence]).
Partially this is explained by the current validation mode of streaming
k-merge, but otherwise because that is where we use a lot of time in
compaction. Outside of L0 => L1 compaction, the image layer generation
is already guarded by vectored reads doing cancellation checks.

32768 is a wild guess based on looking how many keys we put in each
layer in a bench (1-2 million), but I assume it will be good enough
divisor. Doing checks more often will start showing up as contention
which we cannot currently measure. Doing checks less often might be
reasonable.

[one evidence]:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/10384136483/index.html#suites/9681106e61a1222669b9d22ab136d07b/96e6d53af234924/

Earlier PR: #8706.
2024-08-14 14:48:15 +01:00
Joonas Koivunen
6c9e3c9551 refactor: error/anyhow::Error wrapping (#8697)
We can get CompactionError::Other(Cancelled) via the error handling with
a few ways.
[evidence](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8655/10301613380/index.html#suites/cae012a1e6acdd9fdd8b81541972b6ce/653a33de17802bb1/).
Hopefully fix it by:

1. replace the `map_err` which hid the
`GetReadyAncestorError::Cancelled` with `From<GetReadyAncestorError> for
GetVectoredError` conversion
2. simplifying the code in pgdatadir_mapping to eliminate the token
anyhow wrapping for deserialization errors
3. stop wrapping GetVectoredError as anyhow errors
4. stop wrapping PageReconstructError as anyhow errors

Additionally, produce warnings if we treat any other error (as was legal
before this PR) as missing key.

Cc: #8708.
2024-08-14 12:45:56 +01:00
John Spray
19d69d515c pageserver: evict covered layers earlier (#8679)
## Problem

When pageservers do compaction, they frequently create image layers that
make earlier layers un-needed for reads, but then keep those earlier
layers around for 24 hours waiting for time-based eviction to expire
them.

Now that we track layer visibility, we can use it as an input to
eviction, and avoid the 24 hour "disk bump" that happens around
pageserver restarts.

## Summary of changes

- During time-based eviction, if a layer is marked Covered, use the
eviction period as the threshold: i.e. these layers get to remain
resident for at least one iteration of the eviction loop, but then get
evicted. With current settings this means they get evicted after 1h
instead of 24h.
- During disk usage eviction, prioritized evicting covered layers above
all other layers.


Caveats:
- Using the period as the threshold for time based eviction in this case
is a bit of a hack, but it avoids adding yet another configuration
property, and in any case the value of a new property would be somewhat
arbitrary: there's no "right" length of time to keep covered layers
around just in case.
- We had previously planned on removing time-based eviction: this change
would motivate us to keep it around, but we can still simplify the code
later to just do the eviction of covered layers, rather than applying a
TTL policy to all layers.
2024-08-14 12:10:15 +01:00
Joonas Koivunen
485d76ac62 timeline_detach_ancestor: adjust error handling (#8528)
With additional phases from #8430 the `detach_ancestor::Error` became
untenable. Split it up into phases, and introduce laundering for
remaining `anyhow::Error` to propagate them as most often
`Error::ShuttingDown`.

Additionally, complete FIXMEs.

Cc: #6994
2024-08-14 10:16:18 +01:00
John Spray
4049d2b7e1 scrubber: fix spurious "Missed some shards" errors (#8661)
## Problem

The storage scrubber was reporting warnings for lots of timelines like:
```
WARN Missed some shards at count ShardCount(0) tenant_id=25eb7a83d9a2f90ac0b765b6ca84cf4c
```

These were spurious: these tenants are fine. There was a bug in
accumulating the ShardIndex for each tenant, whereby multiple timelines
would lead us to add the same ShardIndex more than one.

Closes: #8646 

## Summary of changes

- Accumulate ShardIndex in a BTreeSet instead of a Vec
- Extend the test to reproduce the issue
2024-08-14 09:29:06 +01:00
Joonas Koivunen
6d6e2c6a39 feat(detach_ancestor): better retries with persistent gc blocking (#8430)
With the persistent gc blocking, we can now retry reparenting timelines
which had failed for whatever reason on the previous attempt(s).
Restructure the detach_ancestor into three phases:

- prepare (insert persistent gc blocking, copy lsn prefix, layers)
- detach and reparent
    - reparenting can fail, so we might need to retry this portion
- complete (remove persistent gc blocking)

Cc: #6994
2024-08-13 18:51:51 +01:00
Joonas Koivunen
8f170c5105 fix: make compaction more sensitive to cancellation (#8706)
A few of the benchmarks have started failing after #8655 where they are
waiting for compactor task. Reads done by image layer creation should
already be cancellation sensitive because vectored get does a check each
time, but try sprinkling additional cancellation points to:

- each partition
- after each vectored read batch
2024-08-13 18:00:54 +01:00
John Spray
ecb01834d6 pageserver: implement utilization score (#8703)
## Problem

When the utilization API was added, it was just a stub with disk space
information.

Disk space information isn't a very good metric for assigning tenants to
pageservers, because pageservers making full use of their disks would
always just have 85% utilization, irrespective of how much pressure they
had for disk space.

## Summary of changes

- Use the new layer visibiilty metric to calculate a "wanted size" per
tenant, and sum these to get a total local disk space wanted per
pageserver. This acts as the primary signal for utilization.
- Also use the shard count to calculate a utilization score, and take
the max of this and the disk-driven utilization. The shard count limit
is currently set as a constant 20,000, which matches contemporary
operational practices when loading pageservers.

The shard count limit means that for tiny/empty tenants, on a machine
with 3.84TB disk, each tiny tenant influences the utilization score as
if it had size 160MB.
2024-08-13 15:15:55 +01:00
Vlad Lazar
b9d2c7bdd5 pageserver: remove vectored get related configs (#8695)
## Problem
Pageserver exposes some vectored get related configs which are not in
use.

## Summary of changes
Remove the following pageserver configs: get_impl, get_vectored_impl,
and `validate_get_vectored`.
They are not used in the pageserver since
https://github.com/neondatabase/neon/pull/8601.
Manual overrides have been removed from the aws repo in
https://github.com/neondatabase/aws/pull/1664.
2024-08-13 12:45:54 +01:00
John Spray
3379cbcaa4 pageserver: add CompactKey, use it in InMemoryLayer (#8652)
## Problem

This follows a PR that insists all input keys are representable in 16
bytes:
- https://github.com/neondatabase/neon/pull/8648

& a PR that prevents postgres from sending us keys that use the high
bits of field2:
- https://github.com/neondatabase/neon/pull/8657

Motivation for this change:
1. Ingest is bottlenecked on CPU
2. InMemoryLayer can create huge (~1M value) BTreeMap<Key,_> for its
index.
3. Maps over i128 are much faster than maps over an arbitrary 18 byte
struct.

It may still be worthwhile to make the index two-tier to optimize for
the case where only the last 4 bytes (blkno) of the key vary frequently,
but simply using the i128 representation of keys has a big impact for
very little effort.

Related: #8452 

## Summary of changes

- Introduce `CompactKey` type which contains an i128
- Use this instead of Key in InMemoryLayer's index, converting back and
forth as needed.

## Performance

All the small-value `bench_ingest` cases show improved throughput.

The one that exercises this index most directly shows a 35% throughput
increase:

```
ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [374.29 ms 378.56 ms 383.38 ms]
                        thrpt:  [333.88 MiB/s 338.13 MiB/s 341.98 MiB/s]
                 change:
                        time:   [-26.993% -26.117% -25.111%] (p = 0.00 < 0.05)
                        thrpt:  [+33.531% +35.349% +36.974%]
                        Performance has improved.
```
2024-08-13 11:48:23 +01:00
Christian Schwarz
ce0d0a204c fix(walredo): shutdown can complete too early (#8701)
Problem
-------

The following race is possible today:

```
walredo_extraordinary_shutdown_thread: shutdown gets until Poll::Pending of self.launched_processes.close().await call

other thread: drops the last Arc<Process>
  = 1. drop(_launched_processes_guard) runs, this ...

walredo_extraordinary_shutdown_thread: ... wakes self.launched_processes.close().await

walredo_extraordinary_shutdown_thread: logs `done`

other thread:
  = 2. drop(process): this kill & waits
```

Solution
--------

Change drop order so that `process` gets dropped first.

Context
-------


https://neondb.slack.com/archives/C06Q661FA4C/p1723478188785719?thread_ts=1723456706.465789&cid=C06Q661FA4C

refs https://github.com/neondatabase/neon/pull/8572
refs https://github.com/neondatabase/cloud/issues/11387
2024-08-12 18:15:48 +01:00
Joonas Koivunen
9dc9a9b2e9 test: do graceful shutdown by default (#8655)
It should give us all possible allowed_errors more consistently.

While getting the workflows to pass on
https://github.com/neondatabase/neon/pull/8632 it was noticed that
allowed_errors are rarely hit (1/4). This made me realize that we always
do an immediate stop by default. Doing a graceful shutdown would had
made the draining more apparent and likely we would not have needed the
#8632 hotfix.

Downside of doing this is that we will see more timeouts if tests are
randomly leaving pause failpoints which fail the shutdown.

The net outcome should however be positive, we could even detect too
slow shutdowns caused by a bug or deadlock.
2024-08-12 15:37:15 +03:00
Vlad Lazar
f5cef7bf7f storcon: skip draining shard if it's secondary is lagging too much (#8644)
## Problem
Migrations of tenant shards with cold secondaries are holding up drains
in during production deployments.

## Summary of changes
If a secondary locations is lagging by more than 256MiB (configurable,
but that's the default), then skip cutting it over to the secondary as part of the node drain.
2024-08-09 15:45:07 +01:00
John Spray
e6770d79fd pageserver: don't treat NotInitialized::Stopped as unexpected (#8675)
## Problem

This type of error can happen during shutdown & was triggering a circuit
breaker alert.

## Summary of changes

- Map NotIntialized::Stopped to CompactionError::ShuttingDown, so that
we may handle it cleanly
2024-08-09 14:01:56 +01:00
John Spray
953b7d4f7e pageserver: remove paranoia double-calculation of retain_lsns (#8617)
## Problem

This code was to mitigate risk in
https://github.com/neondatabase/neon/pull/8427

As expected, we did not hit this code path - the new continuous updates
of gc_info are working fine, we can remove this code now.

## Summary of changes

- Remove block that double-checks retain_lsns
2024-08-08 12:57:48 +01:00
Joonas Koivunen
8561b2c628 fix: stop leaking BackgroundPurges (#8650)
avoid "leaking" the completions of BackgroundPurges by:

1. switching it to TaskTracker for provided close+wait
2. stop using tokio::fs::remove_dir_all which will consume two units of
memory instead of one blocking task

Additionally, use more graceful shutdown in tests which do actually some
background cleanup.
2024-08-08 12:02:53 +01:00
Joonas Koivunen
21638ee96c fix(test): do not fail test for filesystem race (#8643)
evidence:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8632/10287641784/index.html#suites/0e58fb04d9998963e98e45fe1880af7d/c7a46335515142b/
2024-08-08 10:34:47 +01:00
John Spray
cf3eac785b pageserver: make bench_ingest build (but panic) on macOS (#8641)
## Problem

Some developers build on MacOS, which doesn't have  io_uring.

## Summary of changes

- Add `io_engine_for_bench`, which on linux will give io_uring or panic
if it's unavailable, and on MacOS will always panic.

We do not want to run such benchmarks with StdFs: the results aren't
interesting, and will actively waste the time of any developers who
start investigating performance before they realize they're using a
known-slow I/O backend.

Why not just conditionally compile this benchmark on linux only? Because
even on linux, I still want it to refuse to run if it can't get
io_uring.
2024-08-07 21:17:08 +01:00
Yuchen Liang
542385e364 feat(pageserver): add direct io pageserver config (#8622)
Part of #8130, [RFC: Direct IO For Pageserver](https://github.com/neondatabase/neon/blob/problame/direct-io-rfc/docs/rfcs/034-direct-io-for-pageserver.md)

## Description

Add pageserver config for evaluating/enabling direct I/O. 

- Disabled: current default, uses buffered io as is.
- Evaluate: still uses buffered io, but could do alignment checking and
perf simulation (pad latency by direct io RW to a fake file).
- Enabled: uses direct io, behavior on alignment error is configurable.


Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-08-07 21:04:19 +01:00
Joonas Koivunen
05dd1ae9e0 fix: drain completed page_service connections (#8632)
We've noticed increased memory usage with the latest release. Drain the
joinset of `page_service` connection handlers to avoid leaking them
until shutdown. An alternative would be to use a TaskTracker.
TaskTracker was not discussed in original PR #8339 review, so not hot
fixing it in here either.
2024-08-07 17:14:45 +00:00
Joonas Koivunen
a81fab4826 refactor(timeline_detach_ancestor): replace ordered reparented with a hashset (#8629)
Earlier I was thinking we'd need a (ancestor_lsn, timeline_id) ordered
list of reparented. Turns out we did not need it at all. Replace it with
an unordered hashset. Additionally refactor the reparented direct
children query out, it will later be used from more places.

Split off from #8430.

Cc: #6994
2024-08-07 18:19:00 +02:00
Joonas Koivunen
fc78774f39 fix: EphemeralFiles can outlive their Timeline via enum LayerManager (#8229)
Ephemeral files cleanup on drop but did not delay shutdown, leading to
problems with restarting the tenant. The solution is as proposed:
- make ephemeral files carry the gate guard to delay `Timeline::gate`
closing
- flush in-memory layers and strong references to those on
`Timeline::shutdown`

The above are realized by making LayerManager an `enum` with `Open` and
`Closed` variants, and fail requests to modify `LayerMap`.

Additionally:

- fix too eager anyhow conversions in compaction
- unify how we freeze layers and handle errors
- optimize likely_resident_layers to read LayerFileManager hashmap
values instead of bouncing through LayerMap

Fixes: #7830
2024-08-07 17:50:09 +03:00
Arpad Müller
4d7c0dac93 Add missing colon to ArchivalConfigRequest specification (#8627)
Add a missing colon to the API specification of `ArchivalConfigRequest`.
The `state` field is required. Pointed out by Gleb.
2024-08-07 14:53:52 +02:00
Arpad Müller
00c981576a Lower level for timeline cancellations during gc (#8626)
Timeline cancellation running in parallel with gc yields error log lines
like:

```
Gc failed 1 times, retrying in 2s: TimelineCancelled
```

They are completely harmless though and normal to occur. Therefore, only
print those messages at an info level. Still print them at all so that
we know what is going on if we focus on a single timeline.
2024-08-07 09:29:52 +02:00
John Spray
ca5390a89d pageserver: add bench_ingest (#7409)
## Problem

We lack a rust bench for the inmemory layer and delta layer write paths:
it is useful to benchmark these components independent of postgres & WAL
decoding.

Related: https://github.com/neondatabase/neon/issues/8452

## Summary of changes

- Refactor DeltaLayerWriter to avoid carrying a Timeline, so that it can
be cleanly tested + benched without a Tenant/Timeline test harness. It
only needed the Timeline for building `Layer`, so this can be done in a
separate step.
- Add `bench_ingest`, which exercises a variety of workload "shapes"
(big values, small values, sequential keys, random keys)
- Include a small uncontroversial optimization: in `freeze`, only
exhaustively walk values to assert ordering relative to end_lsn in debug
mode.

These benches are limited by drive performance on a lot of machines, but
still useful as a local tool for iterating on CPU/memory improvements
around this code path.

Anecdotal measurements on Hetzner AX102 (Ryzen 7950xd):

```

ingest-small-values/ingest 128MB/100b seq
                        time:   [1.1160 s 1.1230 s 1.1289 s]
                        thrpt:  [113.38 MiB/s 113.98 MiB/s 114.70 MiB/s]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild
Benchmarking ingest-small-values/ingest 128MB/100b rand: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 10.0s. You may wish to increase target time to 18.9s.
ingest-small-values/ingest 128MB/100b rand
                        time:   [1.9001 s 1.9056 s 1.9110 s]
                        thrpt:  [66.982 MiB/s 67.171 MiB/s 67.365 MiB/s]
Benchmarking ingest-small-values/ingest 128MB/100b rand-1024keys: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 10.0s. You may wish to increase target time to 11.0s.
ingest-small-values/ingest 128MB/100b rand-1024keys
                        time:   [1.0715 s 1.0828 s 1.0937 s]
                        thrpt:  [117.04 MiB/s 118.21 MiB/s 119.46 MiB/s]
ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [425.49 ms 429.07 ms 432.04 ms]
                        thrpt:  [296.27 MiB/s 298.32 MiB/s 300.83 MiB/s]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild

ingest-big-values/ingest 128MB/8k seq
                        time:   [373.03 ms 375.84 ms 379.17 ms]
                        thrpt:  [337.58 MiB/s 340.57 MiB/s 343.13 MiB/s]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
ingest-big-values/ingest 128MB/8k seq, no delta
                        time:   [81.534 ms 82.811 ms 83.364 ms]
                        thrpt:  [1.4994 GiB/s 1.5095 GiB/s 1.5331 GiB/s]
Found 1 outliers among 10 measurements (10.00%)


```
2024-08-06 16:39:40 +00:00
John Spray
3727c6fbbe pageserver: use layer visibility when composing heatmap (#8616)
## Problem

Sometimes, a layer is Covered by hasn't yet been evicted from local disk
(e.g. shortly after image layer generation). It is not good use of
resources to download these to a secondary location, as there's a good
chance they will never be read.

This follows the previous change that added layer visibility:
- #8511 

Part of epic:
- https://github.com/neondatabase/neon/issues/8398

## Summary of changes

- When generating heatmaps, only include Visible layers
- Update test_secondary_downloads to filter to visible layers when
listing layers from an attached location
2024-08-06 17:15:40 +01:00
John Spray
42229aacf6 pageserver: fixes for layer visibility metric (#8603)
## Problem

In staging, we could see that occasionally tenants were wrapping their
pageserver_visible_physical_size metric past zero to 2^64.

This is harmless right now, but will matter more later when we start
using visible size in things like the /utilization endpoint.

## Summary of changes

- Add debug asserts that detect this case. `test_gc_of_remote_layers`
works as a reproducer for this issue once the asserts are added.
- Tighten up the interface around access_stats so that only Layer can
mutate it.
- In Layer, wrap calls to `record_access` in code that will update the
visible size statistic if the access implicitly marks the layer visible
(this was what caused the bug)
- In LayerManager::rewrite_layers, use the proper set_visibility layer
function instead of directly using access_stats (this is an additional
path where metrics could go bad.)
- Removed unused instances of LayerAccessStats in DeltaLayer and
ImageLayer which I noticed while reviewing the code paths that call
record_access.
2024-08-06 14:47:01 +01:00
Joonas Koivunen
dc7eb5ae5a chore: bump index part version (#8611)
#8600 missed the hunk changing index_part.json informative version.
Include it in this PR, in addition add more non-warning index_part.json
versions to scrubber.
2024-08-06 11:45:41 +01:00
Vlad Lazar
44fedfd6c3 pageserver: remove legacy read path (#8601)
## Problem

We have been maintaining two read paths (legacy and vectored) for a
while now. The legacy read-path was only used for cross validation in some tests.

## Summary of changes
* Tweak all tests that were using the legacy read path to use the
vectored read path instead
* Remove the read path dispatching based on the pageserver configs
* Remove the legacy read path code

We will be able to remove the single blob io code in
`pageserver/src/tenant/blob_io.rs` when https://github.com/neondatabase/neon/issues/7386 is complete.

Closes https://github.com/neondatabase/neon/issues/8005
2024-08-06 10:14:01 +01:00
Joonas Koivunen
138f008bab feat: persistent gc blocking (#8600)
Currently, we do not have facilities to persistently block GC on a
tenant for whatever reason. We could do a tenant configuration update,
but that is risky for generation numbers and would also be transient.
Introduce a `gc_block` facility in the tenant, which manages per
timeline blocking reasons.

Additionally, add HTTP endpoints for enabling/disabling manual gc
blocking for a specific timeline. For debugging, individual tenant
status now includes a similar string representation logged when GC is
skipped.

Cc: #6994
2024-08-06 10:09:56 +01:00
Joonas Koivunen
6a6f30e378 fix: make Timeline::set_disk_consistent_lsn use fetch_max (#8311)
now it is safe to use from multiple callers, as we have two callers.
2024-08-06 08:52:01 +01:00
Alex Chi Z.
8f3bc5ae35 feat(pageserver): support dry-run for gc-compaction, add statistics (#8557)
Add dry-run mode that does not produce any image layer + delta layer. I
will use this code to do some experiments and see how much space we can
reclaim for tenants on staging. Part of
https://github.com/neondatabase/neon/issues/8002

* Add dry-run mode that runs the full compaction process without
updating the layer map. (We never call finish on the writers and the
files will be removed before exiting the function).
* Add compaction statistics and print them at the end of compaction.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-06 02:07:48 +00:00
Joonas Koivunen
c32807ac19 fix: allow awaiting logical size for root timelines (#8604)
Currently if `GET
/v1/tenant/x/timeline/y?force-await-initial-logical-size=true` is
requested for a root timeline created within the current pageserver
session, the request handler panics hitting the debug assertion. These
timelines will always have an accurate (at initdb import) calculated
logical size. Fix is to never attempt prioritizing timeline size
calculation if we already have an exact value.

Split off from #8528.
2024-08-05 21:21:33 +01:00
Alex Chi Z.
200fa56b04 feat(pageserver): support split delta layers (#8599)
part of https://github.com/neondatabase/neon/issues/8002

Similar to https://github.com/neondatabase/neon/pull/8574, we add
auto-split support for delta layers. Tests are reused from image layer
split writers.


---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-05 10:30:49 +00:00
Alex Chi Z.
1dc496a2c9 feat(pageserver): support auto split layers based on size (#8574)
part of https://github.com/neondatabase/neon/issues/8002

## Summary of changes

Add a `SplitImageWriter` that automatically splits image layer based on
estimated target image layer size. This does not consider compression
and we might need a better metrics.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-08-05 06:55:36 +01:00