I was looking for metrics on how many computes are still using protocol
version 1 and 2. This provides counters for that as "pagestream" and
"pagestream_v2" commands, but also all the other commands. The new
metrics are global for the whole pageserver instance rather than
per-tenant, so the additional metrics bloat should be fairly small.
A follow-up on https://github.com/neondatabase/neon/pull/8103/.
Previously, main branch fails with:
```
assertion `left == right` failed
left: b"value 3@0x10@0x30@0x28@0x40"
right: b"value 3@0x10@0x28@0x30@0x40"
```
This gets fixed after #8103 gets merged.
Signed-off-by: Alex Chi Z <chi@neon.tech>
Adds manual compaction trigger; add gc compaction to test_gc_feedback
Part of https://github.com/neondatabase/neon/issues/8002
```
test_gc_feedback[debug-pg15].logical_size: 50 Mb
test_gc_feedback[debug-pg15].physical_size: 2269 Mb
test_gc_feedback[debug-pg15].physical/logical ratio: 44.5302
test_gc_feedback[debug-pg15].max_total_num_of_deltas: 7
test_gc_feedback[debug-pg15].max_num_of_deltas_above_image: 2
test_gc_feedback[debug-pg15].logical_size_after_bottom_most_compaction: 50 Mb
test_gc_feedback[debug-pg15].physical_size_after_bottom_most_compaction: 287 Mb
test_gc_feedback[debug-pg15].physical/logical ratio after bottom_most_compaction: 5.6312
test_gc_feedback[debug-pg15].max_total_num_of_deltas_after_bottom_most_compaction: 4
test_gc_feedback[debug-pg15].max_num_of_deltas_above_image_after_bottom_most_compaction: 1
```
## Summary of changes
* Add the manual compaction trigger
* Use in test_gc_feedback
* Add a guard to avoid running it with retain_lsns
* Fix: Do `schedule_compaction_update` after compaction
* Fix: Supply deltas in the correct order to reconstruct value
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
part of https://github.com/neondatabase/neon/issues/8002
## Summary of changes
This pull request adds the image layer iterator. It buffers a fixed
amount of key-value pairs in memory, and give developer an iterator
abstraction over the image layer. Once the buffer is exhausted, it will
issue 1 I/O to fetch the next batch.
Due to the Rust lifetime mysteries, the `get_stream_from` API has been
refactored to `into_stream` and consumes `self`.
Delta layer iterator implementation will be similar, therefore I'll add
it after this pull request gets merged.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
These APIs have been deprecated for some time, but were still used from
test code.
Closes: https://github.com/neondatabase/neon/issues/4282
## Summary of changes
- It is still convenient to do a "tenant_attach" from a test without
having to write out a location_conf body, so those test methods have
been retained with implementations that call through to their
location_conf equivalent.
Part of #7497, closes#8120.
## Summary of changes
This PR adds a metric to track the number of valid leases after `GCInfo`
gets refreshed each time.
Besides this metric, we should also track disk space and synthetic size
(after #8071 is closed) to make sure leases are used properly.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Part of #7497, closes#8072.
## Problem
Currently the `get_lsn_by_timestamp` and branch creation pageserver APIs do not provide a pleasant client experience where the looked-up LSN might be GC-ed between the two API calls.
This PR attempts to prevent common races between GC and branch creation by making use of LSN leases provided in #8084. A lease can be optionally granted to a looked-up LSN. With the lease, GC will not touch layers needed to reconstruct all pages at this LSN for the duration of the lease.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Part of https://github.com/neondatabase/neon/issues/8002
This pull request adds tests for bottom-most gc-compaction with delta
records. Also fixed a bug in the compaction process that creates
overlapping delta layers by force splitting at the original delta layer
boundary.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
We now have a `vX` number in the file name, i.e.,
`000000067F0000000400000B150100000000-000000067F0000000400000D350100000000__00000000014B7AC8-v1-00000001`
The related pull request for new-style path was merged a month ago
https://github.com/neondatabase/neon/pull/7660
## Summary of changes
Fixed the draw timeline dir command to handle it.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
`test_sharding_autosplit` is occasionally failing on warnings about
shard splits taking longer than expected (`Exclusive lock by ShardSplit
was held for`...)
It's not obvious which part is taking the time (I suspect remote storage
uploads).
Example:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/9618788427/index.html#testresult/b395294d5bdeb783/
## Summary of changes
- Since shard splits are infrequent events, we can afford to be very
chatty: add a bunch of info-level logging throughout the process.
#8082 removed the legacy deletion path, but retained code for completing
deletions that were started before a pageserver restart. This PR cleans
up that remaining code, and removes all the pageserver code that dealt
with tenant deletion markers and resuming tenant deletions.
The release at https://github.com/neondatabase/neon/pull/8138 contains
https://github.com/neondatabase/neon/pull/8082, so we can now merge this
to `main`
Adds a `Deserialize` impl to `RemoteStorageConfig`. We thus achieve the
same as #7743 but with less repetitive code, by deriving `Deserialize`
impls on `S3Config`, `AzureConfig`, and `RemoteStorageConfig`. The
disadvantage is less useful error messages.
The git history of this PR contains a state where we go via an
intermediate representation, leveraging the `serde_json` crate,
without it ever being actual json though.
Also, the PR adds deserialization tests.
Alternative to #7743 .
In #7957 we enabled deletion without attachment, but retained the
old-style deletion (return 202, delete in background) for attached
tenants. In this PR, we remove the old-style deletion path, such that if
the tenant delete API is invoked while a tenant is detached, it is
simply detached before completing the deletion.
This intentionally doesn't rip out all the old deletion code: in case a
deletion was in progress at time of upgrade, we keep around the code for
finishing it for one release cycle. The rest of the code removal happens
in https://github.com/neondatabase/neon/pull/8091
Now that deletion will always be via the new path, the new path is also
updated to use some retries around remote storage operations, to
tripping up the control plane with 500s if S3 has an intermittent issue.
## Problem
These APIs have be unused for some time. They were superseded by
/location_conf: the equivalent of ignoring a tenant is now to put it in
secondary mode.
## Summary of changes
- Remove APIs
- Remove tests & helpers that used them
- Remove error variants that are no longer needed.
part of Epic https://github.com/neondatabase/neon/issues/7386
# Motivation
The materialized page cache adds complexity to the code base, which
increases the maintenance burden and risk for subtle and hard to
reproduce bugs such as #8050.
Further, the best hit rate that we currently achieve in production is ca
1% of materialized page cache lookups for
`task_kind=PageRequestHandler`. Other task kinds have hit rates <0.2%.
Last, caching page images in Pageserver rewards under-sized caches in
Computes because reading from Pageserver's materialized page cache over
the network is often sufficiently fast (low hundreds of microseconds).
Such Computes should upscale their local caches to fit their working
set, rather than repeatedly requesting the same page from Pageserver.
Some more discussion and context in internal thread
https://neondb.slack.com/archives/C033RQ5SPDH/p1718714037708459
# Changes
This PR removes the materialized page cache code & metrics.
The infrastructure for different key kinds in `PageCache` is left in
place, even though the "Immutable" key kind is the only remaining one.
This can be further simplified in a future commit.
Some tests started failing because their total runtime was dependent on
high materialized page cache hit rates. This test makes them
fixed-runtime or raises pytest timeouts:
* test_local_file_cache_unlink
* test_physical_replication
* test_pg_regress
# Performance
I focussed on ensuring that this PR will not result in a performance
regression in prod.
* **getpage** requests: our production metrics have shown the
materialized page cache to be irrelevant (low hit rate). Also,
Pageserver is the wrong place to cache page images, it should happen in
compute.
* **ingest** (`task_kind=WalReceiverConnectionHandler`): prod metrics
show 0 percent hit rate, so, removing will not be a regression.
* **get_lsn_by_timestamp**: important API for branch creation, used by
control pane. The clog pages that this code uses are not
materialize-page-cached because they're not 8k. No risk of introducing a
regression here.
We will watch the various nightly benchmarks closely for more results
before shipping to prod.
The new image iterator and delta iterator uses an iterator-based API.
https://github.com/neondatabase/neon/pull/8006 / part of
https://github.com/neondatabase/neon/issues/8002
This requires the underlying thing (the btree) to have an iterator API,
and the iterator should have a type name so that it can be stored
somewhere.
```rust
pub struct DeltaLayerIterator {
index_iterator: BTreeIterator
}
```
versus
```rust
pub struct DeltaLayerIterator {
index_iterator: impl Stream<....>
}
```
(this requires nightly flag and still buggy in the Rust compiler)
There are multiple ways to achieve this:
1. Either write a BTreeIterator from scratch that provides `async next`.
This is the most efficient way to do that.
2. Or wrap the current `get_stream` API, which is the current approach
in the pull request.
In the future, we should do (1), and the `get_stream` API should be
refactored to use the iterator API. With (2), we have to wrap the
`get_stream` API with `Pin<Box<dyn Stream>>`, where we have the overhead
of dynamic dispatch. However, (2) needs a rewrite of the `visit`
function, which would take some time to write and review. I'd like to
define this iterator API first and work on a real iterator API later.
## Summary of changes
Add `DiskBtreeIterator` and related tests.
Signed-off-by: Alex Chi Z <chi@neon.tech>
Part of #7497, extracts from #7996, closes#8063.
## Problem
With the LSN lease API introduced in
https://github.com/neondatabase/neon/issues/7808, we want to implement
the real lease logic so that GC will
keep all the layers needed to reconstruct all pages at all the leased
LSNs with valid leases at a given time.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
## Problem
```
ERROR synthetic_size_worker: failed to calculate synthetic size for tenant ae449af30216ac56d2c1173f894b1122: Could not find size at 0/218CA70 in timeline d8da32b5e3e0bf18cfdb560f9de29638\n')
```
e.g.
https://neon-github-public-dev.s3.amazonaws.com/reports/main/9518948590/index.html#/testresult/30a6d1e2471d2775
This test had allow lists but was disrupted by
https://github.com/neondatabase/neon/pull/8051. In that PR, I had kept
an error path in fill_logical_sizes that covered the case where we
couldn't find sizes for some of the segments, but that path could only
be hit in the case that some Timeline was shut down concurrently with a
synthetic size calculation, so it makes sense to just leave the
segment's size None in this case: the subsequent size calculations do
not assume it is Some.
## Summary of changes
- Remove `CalculateSyntheticSizeError::LsnNotFound` and just proceed in
the case where we used to return it
- Remove defunct allow list entries in `test_metric_collection`
## Problem
rust 1.79 new enabled by default lints
## Summary of changes
* update to rust 1.79
* `s/default_features/default-features/`
* fix proxy dead code.
* fix pageserver dead code.
## Problem
This PR refactors some error handling to avoid log spam on
tenant/timeline shutdown.
- "ignoring failure to find gc cutoffs: timeline shutting down." logs
(https://github.com/neondatabase/neon/issues/8012)
- "synthetic_size_worker: failed to calculate synthetic size for tenant
...: Failed to refresh gc_info before gathering inputs: tenant shutting
down", for example here:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8049/9502988669/index.html#suites/3fc871d9ee8127d8501d607e03205abb/1a074a66548bbcea
Closes: https://github.com/neondatabase/neon/issues/8012
## Summary of changes
- Refactor: Add a PageReconstructError variant to GcError: this is the
only kind of error that find_gc_cutoffs can emit.
- Functional change: only ignore shutdown PageReconstructError variant:
for other variants, treat it as a real error
- Refactor: add a structured CalculateSyntheticSizeError type and use it
instead of anyhow::Error in synthetic size calculations
- Functional change: while iterating through timelines gathering logical
sizes, only drop out if the whole tenant is cancelled: individual
timeline cancellations indicate deletion in progress and we can just
ignore those.
## Problem
Some code paths during secondary mode download are returning Ok() rather
than UpdateError::Cancelled. This is functionally okay, but it means
that the end of TenantDownloader::download has a sanity check that the
progress is 100% on success, and prints a "Correcting drift..." warning
if not. This warning can be emitted in a test, e.g.
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8049/9503642976/index.html#/testresult/fff1624ba6adae9e.
## Summary of changes
- In secondary download cancellation paths, use
Err(UpdateError::Cancelled) rather than Ok(), so that we drop out of the
download function and do not reach the progress sanity check.
# Problem
Suppose our vectored get starts with an inexact materialized page cache
hit ("cached lsn") that is shadowed by a newer image layer image layer.
Like so:
```
<inmemory layers>
+-+ < delta layer
| |
-|-|----- < image layer
| |
| |
-|-|----- < cached lsn for requested key
+_+
```
The correct visitation order is
1. inmemory layers
2. delta layer records in LSN range `[image_layer.lsn,
oldest_inmemory_layer.lsn_range.start)`
3. image layer
However, the vectored get code, when it visits the delta layer, it
(incorrectly!) returns with state `Complete`.
The reason why it returns is that it calls `on_lsn_advanced` with
`self.lsn_range.start`, i.e., the layer's LSN range.
Instead, it should use `lsn_range.start`, i.e., the LSN range from the
correct visitation order listed above.
# Solution
Use `lsn_range.start` instead of `self.lsn_range.start`.
# Refs
discovered by & fixes https://github.com/neondatabase/neon/issues/6967
Co-authored-by: Vlad Lazar <vlad@neon.tech>
https://github.com/neondatabase/neon/issues/8002
We need mock WAL record to make it easier to write unit tests. This pull
request adds such a record. It has `clear` flag and `append` field. The
tests for legacy-enhanced compaction are not modified yet and will be
part of the next pull request.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Some test cases add random keys into the timeline, but it is not part of
the `collect_keyspace`, this will cause compaction remove the keys.
The pull request adds a field to supply extra keyspaces during unit
tests.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
A demo for a building block for compaction. The GC-compaction operation
iterates all layers below/intersect with the GC horizon, and do a full
layer rewrite of all of them. The end result will be image layer
covering the full keyspace at GC-horizon, and a bunch of delta layers
above the GC-horizon. This helps us collect the garbages of the
test_gc_feedback test case to reduce space amplification.
This operation can be manually triggered using an HTTP API or be
triggered based on some metrics. Actual method TBD.
The test is very basic and it's very likely that most part of the
algorithm will be rewritten. I would like to get this merged so that I
can have a basic skeleton for the algorithm and then make incremental
changes.
<img width="924" alt="image"
src="https://github.com/neondatabase/neon/assets/4198311/f3d49f4e-634f-4f56-986d-bfefc6ae6ee2">
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
We need unique tenant harness names in case you want to inspect the
results of the last failing run. We are not using any proc macros to get
the test name as there is no stable way of doing that, and there will
not be one in the future, so we need to fix these duplicates.
Also, clean up the duplicated tests to not mix `?` and `unwrap/assert`.
We've stored metadata as bytes within the `index_part.json` for
long fixed reasons. #7693 added support for reading out normal json
serialization of the `TimelineMetadata`.
Change the serialization to only write `TimelineMetadata` as json for
going forward, keeping the backward compatibility to reading the
metadata as bytes. Because of failure to include `alias = "metadata"` in
#7693, one more follow-up is required to make the switch from the old
name to `"metadata": <json>`, but that affects only the field name in
serialized format.
In documentation and naming, an effort is made to add enough warning
signs around TimelineMetadata so that it will receive no changes in the
future. We can add those fields to `IndexPart` directly instead.
Additionally, the path to cleaning up `metadata.rs` is documented in the
`metadata.rs` module comment. If we must extend `TimelineMetadata`
before that, the duplication suggested in [review comment] is the way to
go.
[review comment]:
https://github.com/neondatabase/neon/pull/7699#pullrequestreview-2107081558
The new features have deteriorated layer flushing, most recently with
#7927. Changes:
- inline `Timeline::freeze_inmem_layer` to the only caller
- carry the TimelineWriterState guard to the actual point of freezing
the layer
- this allows us to `#[cfg(feature = "testing")]` the assertion added in
#7927
- remove duplicate `flush_frozen_layer` in favor of splitting the
`flush_frozen_layers_and_wait`
- this requires starting the flush loop earlier for `checkpoint_distance
< initdb size` tests
As seen with the pgvector 0.7.0 index builds, we can receive large
batches of images, leading to very large L0 layers in the range of 1GB.
These large layers are produced because we are only able to roll the
layer after we have witnessed two different Lsns in a single
`DataDirModification::commit`. As the single Lsn batches of images can
span over multiple `DataDirModification` lifespans, we will rarely get
to write two different Lsns in a single `put_batch` currently.
The solution is to remember the TimelineWriterState instead of eagerly
forgetting it until we really open the next layer or someone else
flushes (while holding the write_guard).
Additional changes are test fixes to avoid "initdb image layer
optimization" or ignoring initdb layers for assertion.
Cc: #7197 because small `checkpoint_distance` will now trigger the
"initdb image layer optimization"
A simple API to collect some statistics after compaction to easily
understand the result.
The tool reads the layer map, and analyze range by range instead of
doing single-key operations, which is more efficient than doing a
benchmark to collect the result. It currently computes two key metrics:
* Latest data access efficiency, which finds how many delta layers /
image layers the system needs to iterate before returning any key in a
key range.
* (Approximate) PiTR efficiency, as in
https://github.com/neondatabase/neon/issues/7770, which is simply the
number of delta files in the range. The reason behind that is, assume no
image layer is created, PiTR efficiency is simply the cost of collect
records from the delta layers, and the replay time. Number of delta
files (or in the future, estimated size of reads) is a simple yet
efficient way of estimating how much effort the page server needs to
reconstruct a page.
Signed-off-by: Alex Chi Z <chi@neon.tech>
currently we warn even by going over a single byte. even that will be
hit much more rarely once #7927 lands, but get this in earlier.
rationale for 2*checkpoint_distance: anything smaller is not really
worth a warn.
we have an global allowed_error for this warning, which still cannot be
removed nor can it be removed with #7927 because of many tests with very
small `checkpoint_distance`.
M-series macOS has different alignments/size for some fields (which I
did not investigate in detail) and therefore this test cannot pass on
macOS. Fixed by using `<=` for the comparison so that we do not test for
an exact match.
observed by @yliang412
Signed-off-by: Alex Chi Z <chi@neon.tech>
Related to #7341 tenant deletion will end up shutting down timelines
twice, once before actually starting and the second time when per
timeline deletion is requested. Shutting down TimelineMetrics causes
underflows. Add an atomic boolean and only do the shutdown once.
Closes#7406.
## Problem
When a `get_lsn_by_timestamp` request is cancelled, an anyhow error is
exposed to handle that case, which verbosely logs the error. However, we
don't benefit from having the full backtrace provided by anyhow in this
case.
## Summary of changes
This PR introduces a new `ApiError` type to handle errors caused by
cancelled request more robustly.
- A new enum variant `ApiError::Cancelled`
- Currently the cancelled request is mapped to status code 500.
- Need to handle this error in proxy's `http_util` as well.
- Added a failpoint test to simulate cancelled `get_lsn_by_timestamp`
request.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
## Problem
As described in #7952, the controller's attempt to reconcile a tenant
before finally deleting it can get hung up waiting for the compute
notification hook to accept updates.
The fact that we try and reconcile a tenant at all during deletion is
part of a more general design issue (#5080), where deletion was
implemented as an operation on attached tenant, requiring the tenant to
be attached in order to delete it, which is not in principle necessary.
Closes: #7952
## Summary of changes
- In the pageserver deletion API, only do the traditional deletion path
if the tenant is attached. If it's secondary, then tear down the
secondary location, and then do a remote delete. If it's not attached at
all, just do the remote delete.
- In the storage controller, instead of ensuring a tenant is attached
before deletion, do a best-effort detach of the tenant, and then call
into some arbitrary pageserver to issue a deletion of remote content.
The pageserver retains its existing delete behavior when invoked on
attached locations. We can remove this later when all users of the API
are updated to either do a detach-before-delete. This will enable
removing the "weird" code paths during startup that sometimes load a
tenant and then immediately delete it, and removing the deletion markers
on tenants.
## Problem
Currently we serialize the `TimelineMetadata` into bytes to put it into
`index_part.json`. This `Vec<u8>` (hopefully `[u8; 512]`) representation
was chosen because of problems serializing TimelineId and Lsn between
different serializers (bincode, json). After #5335, the serialization of
those types became serialization format aware or format agnostic.
We've removed the pageserver local `metadata` file writing in #6769.
## Summary of changes
Allow switching from the current serialization format to plain JSON for
the legacy TimelineMetadata format in the future by adding a competitive
serialization method to the current one
(`crate::tenant::metadata::modern_serde`), which accepts both old bytes
and new plain JSON.
The benefits of this are that dumping the index_part.json with pretty
printing no longer produces more than 500 lines of output, but after
enabling it produces lines only proportional to the layer count, like:
```json
{
"version": ???,
"layer_metadata": { ... },
"disk_consistent_lsn": "0/15FD5D8",
"legacy_metadata": {
"disk_consistent_lsn": "0/15FD5D8",
"prev_record_lsn": "0/15FD5A0",
"ancestor_timeline": null,
"ancestor_lsn": "0/0",
"latest_gc_cutoff_lsn": "0/149FD18",
"initdb_lsn": "0/149FD18",
"pg_version": 15
}
}
```
In the future, I propose we completely stop using this legacy metadata
type and wasting time trying to come up with another version numbering
scheme in addition to the informative-only one already found in
`index_part.json`, and go ahead with storing metadata or feature flags
on the `index_part.json` itself.
#7699 is the "one release after" changes which starts to produce
metadata in the index_part.json as json.
fixes https://github.com/neondatabase/neon/issues/7790 (duplicating most
of the issue description here for posterity)
# Background
From the time before always-authoritative `index_part.json`, we had to
handle duplicate layers. See the RFC for an illustration of how
duplicate layers could happen:
a8e6d259cb/docs/rfcs/027-crash-consistent-layer-map-through-index-part.md (L41-L50)
As of #5198 , we should not be exposed to that problem anymore.
# Problem 1
We still have
1. [code in
Pageserver](82960b2175/pageserver/src/tenant/timeline.rs (L4502-L4521))
than handles duplicate layers
2. [tests in the test
suite](d9dcbffac3/test_runner/regress/test_duplicate_layers.py (L15))
that demonstrates the problem using a failpoint
However, the test in the test suite doesn't use the failpoint to induce
a crash that could legitimately happen in production.
What is does instead is to return early with an `Ok()`, so that the code
in Pageserver that handles duplicate layers (item 1) actually gets
exercised.
That "return early" would be a bug in the routine if it happened in
production.
So, the tests in the test suite are tests for their own sake, but don't
serve to actually regress-test any production behavior.
# Problem 2
Further, if production code _did_ (it nowawdays doesn't!) create a
duplicate layer, the code in Pageserver that handles the condition (item
1 above) is too little and too late:
* the code handles it by discarding the newer `struct Layer`; that's
good.
* however, on disk, we have already overwritten the old with the new
layer file
* the fact that we do it atomically doesn't matter because ...
* if the new layer file is not bit-identical, then we have a cache
coherency problem
* PS PageCache block cache: caches old bit battern
* blob_io offsets stored in variables, based on pre-overwrite bit
pattern / offsets
* => reading based on these offsets from the new file might yield
different data than before
# Solution
- Remove the test suite code pertaining to Problem 1
- Move & rename test suite code that actually tests RFC-27
crash-consistent layer map.
- Remove the Pageserver code that handles duplicate layers too late
(Problem 1)
- Use `RENAME_NOREPLACE` to prevent over-rename the file during
`.finish()`, bail with an error if it happens (Problem 2)
- This bailing prevents the caller from even trying to insert into the
layer map, as they don't even get a `struct Layer` at hand.
- Add `abort`s in the place where we have the layer map lock and check
for duplicates (Problem 2)
- Note again, we can't reach there because we bail from `.finish()` much
earlier in the code.
- Share the logic to clean up after failed `.finish()` between image
layers and delta layers (drive-by cleanup)
- This exposed that test `image_layer_rewrite` was overwriting layer
files in place. Fix the test.
# Future Work
This PR adds a new failure scenario that was previously "papered over"
by the overwriting of layers:
1. Start a compaction that will produce 3 layers: A, B, C
2. Layer A is `finish()`ed successfully.
3. Layer B fails mid-way at some `put_value()`.
4. Compaction bails out, sleeps 20s.
5. Some disk space gets freed in the meantime.
6. Compaction wakes from sleep, another iteration starts, it attempts to
write Layer A again. But the `.finish()` **fails because A already
exists on disk**.
The failure in step 5 is new with this PR, and it **causes the
compaction to get stuck**.
Before, it would silently overwrite the file and "successfully" complete
the second iteration.
The mitigation for this is to `/reset` the tenant.
## Problem
This was an oversight when adding heatmaps: because they are at the top
level of the tenant, they aren't included in the catch-all list & delete
that happens for timeline paths.
This doesn't break anything, but it leaves behind a few kilobytes of
garbage in the S3 bucket after a tenant is deleted, generating work for
the scrubber.
## Summary of changes
- During deletion, explicitly remove the heatmap file
- In test_tenant_delete_smoke, upload a heatmap so that the test would
fail its "remote storage empty after delete" check if we didn't delete
it.
RemoteTimelineClient maintains a copy of "next IndexPart" as a number of
fields which are like an IndexPart but this is not immediately obvious.
Instead of multiple fields, maintain a `dirty` ("next IndexPart") and
`clean` ("uploaded IndexPart") fields.
Additional cleanup:
- rename `IndexPart::disk_consistent_lsn` accessor
`duplicated_disk_consistent_lsn`
- no one except scrubber should be looking at it, even scrubber is a
stretch
- remove usage elsewhere (pagectl used by tests, metadata scan endpoint)
- serialize index part *before* the index upload operation
- avoid upload operation being retried because of serialization error
- serialization error is fatal anyway for timeline -- it can only make
transient local progress after that, at least the error is bubbled up
now
- gather exploded IndexPart fields into single actual
`UploadQueueInitialized::dirty` of which the uploaded snapshot is
serialized
- implement the long wished monotonicity check with the `clean`
IndexPart with an assertion which is not expected to fire
Continued work from #7860 towards next step of #6994.
I suspected a wakeup could be lost with
`remote_storage::support::DownloadStream` if the cancellation and inner
stream wakeups happen simultaneously. The next poll would only return
the cancellation error without setting the wakeup. There is no lost
wakeup because the single future for getting the cancellation error is
consumed when the value is ready, and a new future is created for the
*next* value. The new future is always polled. Similarly, if only the
`Stream::poll_next` is being used after a `Some(_)` value has been
yielded, it makes no sense to have an expectation of a wakeup for the
*(N+1)th* stream value already set because when a value is wanted,
`Stream::poll_next` will be called.
A test is added to show that the above is true.
Additionally, there was a question of these cancellations and timeouts
flowing to attached or secondary tenant downloads. A test is added to
show that this, in fact, happens.
Lastly, a warning message is logged when a download stream is polled
after a timeout or cancellation error (currently unexpected) so we can
rule it out while troubleshooting.
## Problem
Page LSN is not set while VM update.
May be reason of test_vm_bits flukyness.
Buit more serious issues can be also caused by wrong LSN.
Related: https://github.com/neondatabase/neon/pull/7935
## Summary of changes
- In `apply_in_neon`, set the LSN bytes when applying records of type
`ClearVisibilityMapFlags`
## Problem
CreateImageLayersError and CompactionError had proper From
implementations, but compact_legacy was explicitly squashing all image
layer errors into an anyhow::Error anyway.
This led to errors like:
```
Error processing HTTP request: InternalServerError(timeline shutting down
Stack backtrace:
0: <<anyhow::Error as core::convert::From<pageserver::tenant::timeline::CreateImageLayersError>>::from as core::ops::function::FnOnce<(pageserver::tenant::timeline::CreateImageLayersError,)>>::call_once
at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
1: <core::result::Result<alloc::vec::Vec<pageserver::tenant::storage_layer::layer::ResidentLayer>, pageserver::tenant::timeline::CreateImageLayersError>>::map_err::<anyhow::Error, <anyhow::Error as core::convert::From<pageserver::tenant::timeline::CreateImageLayersError>>::from>
at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/result.rs:829:27
2: <pageserver::tenant::timeline::Timeline>::compact_legacy::{closure#0}
at pageserver/src/tenant/timeline/compaction.rs:125:36
3: <pageserver::tenant::timeline::Timeline>::compact::{closure#0}
at pageserver/src/tenant/timeline.rs:1719:84
4: pageserver::http::routes::timeline_checkpoint_handler::{closure#0}::{closure#0}
```
Closes: https://github.com/neondatabase/neon/issues/7861
Store logical replication origin in KV storage
## Problem
See #6977
## Summary of changes
* Extract origin_lsn from commit WAl record
* Add ReplOrigin key to KV storage and store origin_lsn
* In basebackup replace snapshot origin_lsn with last committed
origin_lsn at basebackup LSN
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Alex Chi Z <chi@neon.tech>