(This PR is the successor of https://github.com/neondatabase/neon/pull/4984 )
## Summary
The current way in which `EphemeralFile` uses `PageCache` complicates
the Pageserver code base to a degree that isn't worth it.
This PR refactors how we cache `EphemeralFile` contents, by exploiting
the append-only nature of `EphemeralFile`.
The result is that `PageCache` only holds `ImmutableFilePage` and
`MaterializedPage`.
These types of pages are read-only and evictable without write-back.
This allows us to remove the writeback code from `PageCache`, also
eliminating an entire failure mode.
Futher, many great open-source libraries exist to solve the problem of a
read-only cache,
much better than our `page_cache.rs` (e.g., better replacement policy,
less global locking).
With this PR, we can now explore using them.
## Problem & Analysis
Before this PR, `PageCache` had three types of pages:
* `ImmutableFilePage`: caches Delta / Image layer file contents
* `MaterializedPage`: caches results of Timeline::get (page
materialization)
* `EphemeralPage`: caches `EphemeralFile` contents
`EphemeralPage` is quite different from `ImmutableFilePage` and
`MaterializedPage`:
* Immutable and materialized pages are for the acceleration of (future)
reads of the same data using `PAGE_CACHE_SIZE * PAGE_SIZE` bytes of
DRAM.
* Ephemeral pages are a write-back cache of `EphemeralFile` contents,
i.e., if there is pressure in the page cache, we spill `EphemeralFile`
contents to disk.
`EphemeralFile` is only used by `InMemoryLayer`, for the following
purposes:
* **write**: when filling up the `InMemoryLayer`, via `impl BlobWriter
for EphemeralFile`
* **read**: when doing **page reconstruction** for a page@lsn that isn't
written to disk
* **read**: when writing L0 layer files, we re-read the `InMemoryLayer`
and put the contents into the L0 delta writer
(**`create_delta_layer`**). This happens every 10min or when
InMemoryLayer reaches 256MB in size.
The access patterns of the `InMemoryLayer` use case are as follows:
* **write**: via `BlobWriter`, strictly append-only
* **read for page reconstruction**: via `BlobReader`, random
* **read for `create_delta_layer`**: via `BlobReader`, dependent on
data, but generally random. Why?
* in classical LSM terms, this function is what writes the
memory-resident `C0` tree into the disk-resident `C1` tree
* in our system, though, the values of InMemoryLayer are stored in an
EphemeralFile, and hence they are not guaranteed to be memory-resident
* the function reads `Value`s in `Key, LSN` order, which is `!=` insert
order
What do these `EphemeralFile`-level access patterns mean for the page
cache?
* **write**:
* the common case is that `Value` is a WAL record, and if it isn't a
full-page-image WAL record, then it's smaller than `PAGE_SIZE`
* So, the `EphemeralPage` pages act as a buffer for these `< PAGE_CACHE`
sized writes.
* If there's no page cache eviction between subsequent
`InMemoryLayer::put_value` calls, the `EphemeralPage` is still resident,
so the page cache avoids doing a `write` system call.
* In practice, a busy page server will have page cache evictions because
we only configure 64MB of page cache size.
* **reads for page reconstruction**: read acceleration, just as for the
other page types.
* **reads for `create_delta_layer`**:
* The `Value` reads happen through a `BlockCursor`, which optimizes the
case of repeated reads from the same page.
* So, the best case is that subsequent values are located on the same
page; hence `BlockCursor`s buffer is maximally effective.
* The worst case is that each `Value` is on a different page; hence the
`BlockCursor`'s 1-page-sized buffer is ineffective.
* The best case translates into `256MB/PAGE_SIZE` page cache accesses,
one per page.
* the worst case translates into `#Values` page cache accesses
* again, the page cache accesses must be assumed to be random because
the `Value`s aren't accessed in insertion order but `Key, LSN` order.
## Summary of changes
Preliminaries for this PR were:
- #5003
- #5004
- #5005
- uncommitted microbenchmark in #5011
Based on the observations outlined above, this PR makes the following
changes:
* Rip out `EphemeralPage` from `page_cache.rs`
* Move the `block_io::FileId` to `page_cache::FileId`
* Add a `PAGE_SIZE`d buffer to the `EphemeralPage` struct.
It's called `mutable_tail`.
* Change `write_blob` to use `mutable_tail` for the write buffering
instead of a page cache page.
* if `mutable_tail` is full, it writes it out to disk, zeroes it out,
and re-uses it.
* There is explicitly no double-buffering, so that memory allocation per
`EphemeralFile` instance is fixed.
* Change `read_blob` to return different `BlockLease` variants depending
on `blknum`
* for the `blknum` that corresponds to the `mutable_tail`, return a ref
to it
* Rust borrowing rules prevent `write_blob` calls while refs are
outstanding.
* for all non-tail blocks, return a page-cached `ImmutablePage`
* It is safe to page-cache these as ImmutablePage because EphemeralFile
is append-only.
## Performance
How doe the changes above affect performance?
M claim is: not significantly.
* **write path**:
* before this PR, the `EphemeralFile::write_blob` didn't issue its own
`write` system calls.
* If there were enough free pages, it didn't issue *any* `write` system
calls.
* If it had to evict other `EphemeralPage`s to get pages a page for its
writes (`get_buf_for_write`), the page cache code would implicitly issue
the writeback of victim pages as needed.
* With this PR, `EphemeralFile::write_blob` *always* issues *all* of its
*own* `write` system calls.
* Also, the writes are explicit instead of implicit through page cache
write back, which will help #4743
* The perf impact of always doing the writes is the CPU overhead and
syscall latency.
* Before this PR, we might have never issued them if there were enough
free pages.
* We don't issue `fsync` and can expect the writes to only hit the
kernel page cache.
* There is also an advantage in issuing the writes directly: the perf
impact is paid by the tenant that caused the writes, instead of whatever
tenant evicts the `EphemeralPage`.
* **reads for page reconstruction**: no impact.
* The `write_blob` function pre-warms the page cache when it writes the
`mutable_tail` to disk.
* So, the behavior is the same as with the EphemeralPages before this
PR.
* **reads for `create_delta_layer`**: no impact.
* Same argument as for page reconstruction.
* Note for the future:
* going through the page cache likely causes read amplification here.
Why?
* Due to the `Key,Lsn`-ordered access pattern, we don't read all the
values in the page before moving to the next page. In the worst case, we
might read the same page multiple times to read different `Values` from
it.
* So, it might be better to bypass the page cache here.
* Idea drafts:
* bypass PS page cache + prefetch pipeline + iovec-based IO
* bypass PS page cache + use `copy_file_range` to copy from ephemeral
file into the L0 delta file, without going through user space
## Problem
The performance benchmark in `test_runner/performance/test_layer_map.py`
is currently failing due to the warning added in #4888.
## Summary of changes
The test mentioned has a `compaction_target_size` of 8192, which is just
one page size. This is an unattainable goal, as we generate at least
three pages: one for the header, one for the b-tree (minimally sized
ones have just the root node in a single page), one for the data.
Therefore, we add two pages to the warning limit. The warning text
becomes a bit less accurate but I think this is okay.
## Problem
Errors and notices that happen during a pooled connection lifecycle have
no session identifiers
## Summary of changes
Using a watch channel, we set the session ID whenever it changes. This
way we can see the status of a connection for that session
Also, adding a connection id to be able to search the entire connection
lifecycle
I will have to change these as I change remote_timeline_client api in
#4938. So a bit of cleanup, handle my comments which were just resolved
during initial review.
Cleanup:
- use unwrap in tests instead of mixed `?` and `unwrap`
- use `Handle` instead of `&'static Reactor` to make the
RemoteTimelineClient more natural
- use arrays in tests
- use plain `#[tokio::test]`
## Problem
A customer is having trouble connecting to neon from their production
environment. The logs show a mix of "Internal error" and "authentication
protocol violation" but not the full error
## Summary of changes
Make sure we don't miss any logs during SASL/SCRAM
Rather temporary solution before proper:
https://github.com/neondatabase/neon/issues/5006
It requires more plumbing so lets not attach deleted tenants first and
then implement resume.
Additionally fix `assert_prefix_empty`. It had a buggy prefix calculation,
and since we always asserted for absence of stuff it worked. Here I
started to assert for presence of stuff too and it failed. Added more
"presence" asserts to other places to be confident that it works.
Resolves [#5016](https://github.com/neondatabase/neon/issues/5016)
## Problem
The previous version of neon (that we use in the forward compatibility test)
has installed `amcheck` extension now. We can run `pg_amcheck`
unconditionally.
## Summary of changes
- Run `pg_amcheck` in compatibility tests unconditionally
Before this patch, we had the `off` and `blknum` as function-wide
mutable state. Now it's contained in the `Writer` struct.
The use of `push_bytes` instead of index-based filling of the buffer
also makes it easier to reason about what's going on.
This is prep for https://github.com/neondatabase/neon/pull/4994
Don't download ext_index.json from s3, but instead receive it as a part of spec from control plane.
This eliminates s3 access for most compute starts,
and also allows us to update extensions spec on the fly
Restores #4937 work relating to the ability to use `ResidentDeltaLayer`
(which is an Arc wrapper) in #4938 for the ValueRef's by removing the
borrow from `ValueRef` and providing it from an upper layer.
This should not have any functional changes, most importantly, the
`main` will continue to use the borrowed `DeltaLayerInner`. It might be
that I can change #4938 to be like this. If that is so, I'll gladly rip
out the `Ref` and move the borrow back. But I'll first want to look at
the current test failures.
This makes it more explicit that these are different u64-sized
namespaces.
Re-using one in place of the other would be catastrophic.
Prep for https://github.com/neondatabase/neon/pull/4994
which will eliminate the ephemeral_file::FileId and move the
blob_io::FileId into page_cache.
It makes sense to have this preliminary commit though,
to minimize amount of new concept in #4994 and other
preliminaries that depend on that work.
## Problem
As documented, the global connection pool will be high contention.
## Summary of changes
Use DashMap rather than Mutex<HashMap>.
Of note, DashMap currently uses a RwLock internally, but it's partially
sharded to reduce contention by a factor of N. We could potentially use
flurry which is a port of Java's concurrent hashmap, but I have no good
understanding of it's performance characteristics. Dashmap is at least
equivalent to hashmap but less contention.
See the read heavy benchmark to analyse our expected performance
<https://github.com/xacrimon/conc-map-bench#ready-heavy>
I also spoke with the developer of dashmap recently, and they are
working on porting the implementation to use concurrent HAMT FWIW
## Problem
PR #4839 has already reduced the number of b-tree traversals and vec
creations from 3 to 2, but as pointed out in
https://github.com/neondatabase/neon/pull/4839#discussion_r1279167815 ,
we would ideally just traverse the b-tree once during compaction.
Afer #4836, the two vecs created are one for the list of keys, lsns and
sizes, and one for the list of `(key, lsn, value reference)`. However,
they are not equal, as pointed out in
https://github.com/neondatabase/neon/pull/4839#issuecomment-1660418012
and the following comment: the key vec creation combines multiple
entries for which the lsn is changing but the key stays the same into
one, with the size being the sum of the sub-sizes. In SQL, this would
correspond to something like `SELECT key, lsn, SUM(size) FROM b_tree
GROUP BY key;` and `SELECT key, lsn, val_ref FROM b_tree;`. Therefore,
the join operation is non-trivial.
## Summary of changes
This PR merges the two lists of keys and value references into one. It's
not a trivial change and affects the size pattern of the resulting
files, which is why this is in a separate PR from #4839 .
The key vec is used in compaction for determining when to start a new
layer file. The loop uses various thresholds to come to this conclusion,
but the grouping via the key has led to the behaviour that regardless of
the threshold, it only starts a new file when either a new key is
encountered, or a new delta file.
The new code now does the combination after the merging and sorting of
the various keys from the delta files. This *mostly* does the same as
the old code, except for a detail: with the grouping done on a
per-delta-layer basis, the sorted and merged vec would still have
multiple entries for multiple delta files, but now, we don't have an
easy way to tell when a new input delta layer file is encountered, so we
cannot create multiple entries on that basis easily.
To prevent possibly infinite growth, our new grouping code compares the
combined size with the threshold, and if it is exceeded, it cuts a new
entry so that the downstream code can cut a new output file. Here, we
perform a tradeoff however, as if the threshold is too small, we risk
putting entries for the same key into multiple layer files, but if the
threshold is too big, we can in some instances exceed the target size.
Currently, we set the threshold to the target size, so in theory we
would stay below or roughly at double the `target_file_size`.
We also fix the way the size was calculated for the last key. The calculation
was wrong and accounted for the old layer's btree, even though we
already account for the overhead of the in-construction btree.
Builds on top of #4839 .
Previously list_prefixes was incorrectly used for that purpose. Change
to use list_files. Add a test.
Some drive by refactorings on python side to move helpers out of
specific test file to be widely accessible
resolves https://github.com/neondatabase/neon/issues/4499
For `delete_objects` it was injecting failures for whole delete_objects operation
and then for every delete it contains. Make it fail once for the whole operation.
## Problem
This was set to 5 seconds, which was very close to how long a compaction
took on my workstation, and when deletion is blocked on compaction the
test would fail.
We will fix this to make compactions drop out on deletion, but for the
moment let's stabilize the test.
## Summary of changes
Change timeout on timeline deletion in
`test_timeline_deletion_with_files_stuck_in_upload_queue` from 5 seconds
to 30 seconds.
To this end add
1) -e option to 'neon_local safekeeper start' command appending extra options
to safekeeper invocation;
2) Allow multiple occurrences of the same option in safekeepers, the last
value is taken.
3) Allow to specify empty string for *-auth-public-key-path opts, it
disables auth for the service.
The same option enables auth and specifies public key, so this allows to use
different public keys as well. The motivation is to
1) Allow to e.g. change pageserver key/token without replacing all compute
tokens.
2) Enable auth gradually.
## Problem
Deletions can be possibly reordered. Use fsync to avoid the case when
mark file doesnt exist but other tenant/timeline files do.
See added comments.
resolves#4987
## Problem
While adding new test results format, I've also changed the way we
upload Allure reports to S3
(722c7956bb)
to avoid duplicated results from previous runs. But it broke links at
earlier results (results are still available but on different URLs).
This PR fixes this (by reverting logic in
722c7956bb
changes), and moves the logic for storing test results into db to allure
generate step. It allows us to avoid test results duplicates in the db
and saves some time on extra s3 downloads that happened in a different
job before the PR.
Ref https://neondb.slack.com/archives/C059ZC138NR/p1691669522160229
## Summary of changes
- Move test results storing logic from a workflow to
`actions/allure-report-generate`
## Problem
HTTP batch queries currently allow us to set the isolation level and
read only, but not deferrable.
## Summary of changes
Add support for deferrable.
Echo deferrable status in response headers only if true.
Likewise, now echo read-only status in response headers only if true.
#4942 left old metrics in place for migration purposes. It was noticed
that from new metrics the total number of deleted objects was forgotten,
add it.
While reviewing, it was noticed that the delete_object could just be
delete_objects of one.
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
The `BlockReader` trait is not ready to be asyncified, as associated
types are not supported by asyncification strategies like via the
`async_trait` macro, or via adopting enums.
## Summary of changes
Remove the `BlockLease` associated type from the `BlockReader` trait and
turn it into an enum instead, bearing the same name. The enum has two
variants, one of which is gated by `#[cfg(test)]`. Therefore, outside of
test settings, the enum has zero overhead over just having the
`PageReadGuard`. Using the enum allows us to impl `BlockReader` without
needing the page cache.
Part of https://github.com/neondatabase/neon/issues/4743
## Problem
The `BlockCursor::read_blob` and `BlockCursor::read_blob_into_buf`
functions are calling `read_blk` internally, so if we want to make that
function async fn, they need to be async themselves.
## Summary of changes
* We first turn `ValueRef::load` into an async fn.
* Then, we switch the `RwLock` implementation in `InMemoryLayer` to use
the one from `tokio`.
* Last, we convert the `read_blob` and `read_blob_into_buf` functions
into async fn.
In three instances we use `Handle::block_on`:
* one use is in compaction code, which currently isn't async. We put the
entire loop into an `async` block to prevent the potentially hot loop
from doing cross-thread operations.
* one use is in dumping code for `DeltaLayer`. The "proper" way to
address this would be to enable the visit function to take async
closures, but then we'd need to be generic over async fs non async,
which [isn't supported by rust right
now](https://blog.rust-lang.org/inside-rust/2022/07/27/keyword-generics.html).
The other alternative would be to do a first pass where we cache the
data into memory, and only then to dump it.
* the third use is in writing code, inside a loop that copies from one
file to another. It is is synchronous and we'd like to keep it that way
(for now?).
Part of #4743
## Problem
It's nice if `single query : single response :: batch query : batch
response`.
But at present, in the single case we send `{ query: '', params: [] }`
and get back a single `{ rows: [], ... }` object, while in the batch
case we send an array of `{ query: '', params: [] }` objects and get
back not an array of `{ rows: [], ... }` objects but a `{ results: [ {
rows: [] , ... }, { rows: [] , ... }, ... ] }` object instead.
## Summary of changes
With this change, the batch query body becomes `{ queries: [{ query: '',
params: [] }, ... ] }`, which restores a consistent relationship between
the request and response bodies.
Originated from test failure where we got SlowDown error from s3.
The patch generalizes `download_retry` to not be download specific.
Resulting `retry` function is moved to utils crate. `download_retries`
is now a thin wrapper around this `retry` function.
To ensure that all needed retries are in place test code now uses
`test_remote_failures=1` setting.
Ref https://neondb.slack.com/archives/C059ZC138NR/p1691743624353009
## Problem
Currently, image generation reads delta layers before writing out
subsequent image layers, which updates the access time of the delta
layers and effectively puts them at the back of the queue for eviction.
This is the opposite of what we want, because after a delta layer is
covered by a later image layer, it's likely that subsequent reads of
latest data will hit the image rather than the delta layer, so the delta
layer should be quite a good candidate for eviction.
## Summary of changes
`RequestContext` gets a new `ATimeBehavior` field, and a
`RequestContextBuilder` helper so that we can optionally add the new
field without growing `RequestContext::new` every time we add something
like this.
Request context is passed into the `record_access` function, and the
access time is not updated if `ATimeBehavior::Skip` is set.
The compaction background task constructs its request context with this
skip policy.
Closes: https://github.com/neondatabase/neon/issues/4969
This code was mostly copied from walsender.c and the idea was to keep it
similar to walsender.c, so that we can easily copy-paste future upstream
changes to walsender.c to waproposer_utils.c, too. But right now I see that
deleting it doesn't break anything, so it's better to remove unused parts.
It allows term leader to ensure he pulls data from the correct term. Absense of
it wasn't very problematic due to CRC checks, but let's be strict.
walproposer still doesn't use it as we're going to remove recovery completely
from it.
Patches a bug in vm-builder where it did not include enough parameters
in the query string. These parameters are `host=localhost port=5432`.
These parameters were not necessary for the monitor because the `pq` go
postgres driver included them by default.
## Problem
In some places, the lock on `InMemoryLayerInner` is only created to
obtain `end_lsn`. This is not needed however, if we move `end_lsn` to
`InMemoryLayer` instead.
## Summary of changes
Make `end_lsn` a member of `InMemoryLayer`, and do less locking of
`InMemoryLayerInner`. `end_lsn` is changed from `Option<Lsn>` into an
`OnceLock<Lsn>`. Thanks to this change, we don't need to lock any more
in three functions.
Part of #4743 . Suggested in
https://github.com/neondatabase/neon/pull/4905#issuecomment-1666458428 .
Found this log on staging:
```
2023-08-10T17:42:58.573790Z INFO handling interactive connection from client protocol="ws"
```
We seem to be losing websocket span in spawn, this patch fixes it.
## Problem
1MB response limit is very small.
## Summary of changes
This data is not yet tracked, so we shoudn't raise the limit too high yet.
But as discussed with @kelvich and @conradludgate, this PR lifts it to
10MB, and adds also details of the limit to the error response.