Commit Graph

2189 Commits

Author SHA1 Message Date
Christian Schwarz
82719542c6 fix: vectored get returns incorrect result on inexact materialized page cache hit (#8050)
# Problem

Suppose our vectored get starts with an inexact materialized page cache
hit ("cached lsn") that is shadowed by a newer image layer image layer.
Like so:


```
    <inmemory layers>

    +-+ < delta layer
    | |
   -|-|----- < image layer
    | |
    | |
   -|-|----- < cached lsn for requested key
    +_+
```

The correct visitation order is
1. inmemory layers
2. delta layer records in LSN range `[image_layer.lsn,
oldest_inmemory_layer.lsn_range.start)`
3. image layer

However, the vectored get code, when it visits the delta layer, it
(incorrectly!) returns with state `Complete`.

The reason why it returns is that it calls `on_lsn_advanced` with
`self.lsn_range.start`, i.e., the layer's LSN range.

Instead, it should use `lsn_range.start`, i.e., the LSN range from the
correct visitation order listed above.

# Solution

Use `lsn_range.start` instead of `self.lsn_range.start`.

# Refs

discovered by & fixes https://github.com/neondatabase/neon/issues/6967

Co-authored-by: Vlad Lazar <vlad@neon.tech>
2024-06-13 18:20:47 +00:00
Alex Chi Z
d25f7e3dd5 test(pageserver): add test wal record for unit testing (#8015)
https://github.com/neondatabase/neon/issues/8002

We need mock WAL record to make it easier to write unit tests. This pull
request adds such a record. It has `clear` flag and `append` field. The
tests for legacy-enhanced compaction are not modified yet and will be
part of the next pull request.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-13 09:44:37 -04:00
Alex Chi Z
836d1f4af7 test(pageserver): add test keyspace into collect_keyspace (#8016)
Some test cases add random keys into the timeline, but it is not part of
the `collect_keyspace`, this will cause compaction remove the keys.

The pull request adds a field to supply extra keyspaces during unit
tests.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-12 17:42:43 +00:00
Alex Chi Z
4c2100794b feat(pageserver): initial code sketch & test case for combined gc+compaction at gc_horizon (#7948)
A demo for a building block for compaction. The GC-compaction operation
iterates all layers below/intersect with the GC horizon, and do a full
layer rewrite of all of them. The end result will be image layer
covering the full keyspace at GC-horizon, and a bunch of delta layers
above the GC-horizon. This helps us collect the garbages of the
test_gc_feedback test case to reduce space amplification.

This operation can be manually triggered using an HTTP API or be
triggered based on some metrics. Actual method TBD.

The test is very basic and it's very likely that most part of the
algorithm will be rewritten. I would like to get this merged so that I
can have a basic skeleton for the algorithm and then make incremental
changes.

<img width="924" alt="image"
src="https://github.com/neondatabase/neon/assets/4198311/f3d49f4e-634f-4f56-986d-bfefc6ae6ee2">

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-11 14:14:51 +00:00
Joonas Koivunen
d3b892e9ad test: fix duplicated harness name (#8010)
We need unique tenant harness names in case you want to inspect the
results of the last failing run. We are not using any proc macros to get
the test name as there is no stable way of doing that, and there will
not be one in the future, so we need to fix these duplicates.

Also, clean up the duplicated tests to not mix `?` and `unwrap/assert`.
2024-06-11 10:10:05 -04:00
Joonas Koivunen
7515d0f368 fix: stop storing TimelineMetadata in index_part.json as bytes (#7699)
We've stored metadata as bytes within the `index_part.json` for 
long fixed reasons. #7693 added support for reading out normal json
serialization of the `TimelineMetadata`.

Change the serialization to only write `TimelineMetadata` as json for
going forward, keeping the backward compatibility to reading the
metadata as bytes. Because of failure to include `alias = "metadata"` in
#7693, one more follow-up is required to make the switch from the old
name to `"metadata": <json>`, but that affects only the field name in
serialized format.

In documentation and naming, an effort is made to add enough warning
signs around TimelineMetadata so that it will receive no changes in the
future. We can add those fields to `IndexPart` directly instead.

Additionally, the path to cleaning up `metadata.rs` is documented in the
`metadata.rs` module comment. If we must extend `TimelineMetadata`
before that, the duplication suggested in [review comment] is the way to
go.

[review comment]:
https://github.com/neondatabase/neon/pull/7699#pullrequestreview-2107081558
2024-06-11 15:38:54 +03:00
Joonas Koivunen
e46692788e refactor: Timeline layer flushing (#7993)
The new features have deteriorated layer flushing, most recently with
#7927. Changes:

- inline `Timeline::freeze_inmem_layer` to the only caller
- carry the TimelineWriterState guard to the actual point of freezing
the layer
- this allows us to `#[cfg(feature = "testing")]` the assertion added in
#7927
- remove duplicate `flush_frozen_layer` in favor of splitting the
`flush_frozen_layers_and_wait`
- this requires starting the flush loop earlier for `checkpoint_distance
< initdb size` tests
2024-06-10 19:34:34 +03:00
Joonas Koivunen
b52e31c1a4 fix: allow layer flushes more often (#7927)
As seen with the pgvector 0.7.0 index builds, we can receive large
batches of images, leading to very large L0 layers in the range of 1GB.
These large layers are produced because we are only able to roll the
layer after we have witnessed two different Lsns in a single
`DataDirModification::commit`. As the single Lsn batches of images can
span over multiple `DataDirModification` lifespans, we will rarely get
to write two different Lsns in a single `put_batch` currently.

The solution is to remember the TimelineWriterState instead of eagerly
forgetting it until we really open the next layer or someone else
flushes (while holding the write_guard).

Additional changes are test fixes to avoid "initdb image layer
optimization" or ignoring initdb layers for assertion.

Cc: #7197 because small `checkpoint_distance` will now trigger the
"initdb image layer optimization"
2024-06-10 13:50:17 +00:00
Alex Chi Z
3e63d0f9e0 test(pageserver): quantify compaction outcome (#7867)
A simple API to collect some statistics after compaction to easily
understand the result.

The tool reads the layer map, and analyze range by range instead of
doing single-key operations, which is more efficient than doing a
benchmark to collect the result. It currently computes two key metrics:

* Latest data access efficiency, which finds how many delta layers /
image layers the system needs to iterate before returning any key in a
key range.
* (Approximate) PiTR efficiency, as in
https://github.com/neondatabase/neon/issues/7770, which is simply the
number of delta files in the range. The reason behind that is, assume no
image layer is created, PiTR efficiency is simply the cost of collect
records from the delta layers, and the replay time. Number of delta
files (or in the future, estimated size of reads) is a simple yet
efficient way of estimating how much effort the page server needs to
reconstruct a page.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-10 10:42:13 +02:00
Joonas Koivunen
d46d19456d raise the warning for oversized L0 to 2*target (#7985)
currently we warn even by going over a single byte. even that will be
hit much more rarely once #7927 lands, but get this in earlier.

rationale for 2*checkpoint_distance: anything smaller is not really
worth a warn.

we have an global allowed_error for this warning, which still cannot be
removed nor can it be removed with #7927 because of many tests with very
small `checkpoint_distance`.
2024-06-06 20:18:39 +03:00
Alex Chi Z
5d05013857 fix(pageserver): skip metadata compaction is LSN is not accumulated enough (#7962)
close https://github.com/neondatabase/neon/issues/7937

Only trigger metadata image layer creation if enough delta layers are
accumulated.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-06 11:34:44 -04:00
Alex Chi Z
014509987d fix(pageserver): more flexible layer size test (#7945)
M-series macOS has different alignments/size for some fields (which I
did not investigate in detail) and therefore this test cannot pass on
macOS. Fixed by using `<=` for the comparison so that we do not test for
an exact match.

observed by @yliang412 

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-06 14:40:58 +00:00
Joonas Koivunen
a8be07785e fix: do TimelineMetrics::shutdown only once (#7983)
Related to #7341 tenant deletion will end up shutting down timelines
twice, once before actually starting and the second time when per
timeline deletion is requested. Shutting down TimelineMetrics causes
underflows. Add an atomic boolean and only do the shutdown once.
2024-06-06 14:20:54 +00:00
Yuchen Liang
630cfbe420 refactor(pageserver): designated api error type for cancelled request (#7949)
Closes #7406.

## Problem

When a `get_lsn_by_timestamp` request is cancelled, an anyhow error is
exposed to handle that case, which verbosely logs the error. However, we
don't benefit from having the full backtrace provided by anyhow in this
case.

## Summary of changes

This PR introduces a new `ApiError` type to handle errors caused by
cancelled request more robustly.
-  A new enum variant `ApiError::Cancelled`
- Currently the cancelled request is mapped to status code 500.
- Need to handle this error in proxy's `http_util` as well.
- Added a failpoint test to simulate cancelled `get_lsn_by_timestamp`
request.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-06-06 14:00:14 +00:00
Christian Schwarz
0a65333fff chore(walredo): avoid duplicate tenant_id and shard_slug fields (#7977)
spotted during reviews of async walredo work in #6628
2024-06-06 15:10:16 +02:00
John Spray
91dd99038e pageserver/controller: enable tenant deletion without attachment (#7957)
## Problem

As described in #7952, the controller's attempt to reconcile a tenant
before finally deleting it can get hung up waiting for the compute
notification hook to accept updates.

The fact that we try and reconcile a tenant at all during deletion is
part of a more general design issue (#5080), where deletion was
implemented as an operation on attached tenant, requiring the tenant to
be attached in order to delete it, which is not in principle necessary.

Closes: #7952

## Summary of changes

- In the pageserver deletion API, only do the traditional deletion path
if the tenant is attached. If it's secondary, then tear down the
secondary location, and then do a remote delete. If it's not attached at
all, just do the remote delete.
- In the storage controller, instead of ensuring a tenant is attached
before deletion, do a best-effort detach of the tenant, and then call
into some arbitrary pageserver to issue a deletion of remote content.

The pageserver retains its existing delete behavior when invoked on
attached locations. We can remove this later when all users of the API
are updated to either do a detach-before-delete. This will enable
removing the "weird" code paths during startup that sometimes load a
tenant and then immediately delete it, and removing the deletion markers
on tenants.
2024-06-05 20:22:54 +00:00
Christian Schwarz
83ab14e271 chore!: remove walredo_process_kind config option & kind type (#7756)
refs https://github.com/neondatabase/neon/issues/7753

Preceding PR https://github.com/neondatabase/neon/pull/7754
laid out the plan, this one wraps it up.
2024-06-05 14:21:10 +02:00
Alex Chi Z
1a8d53ab9d feat(pageserver): compute aux file size on initial logical size calculation (#7958)
close https://github.com/neondatabase/neon/issues/7822
close https://github.com/neondatabase/neon/issues/7443

Aux file metrics is computed incrementally. If the size is not
initialized, the metrics will never show up. This pull request adds the
functionality to compute the aux file size on initial logical size
calculation.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-04 13:47:48 -04:00
Joonas Koivunen
3d6e389aa2 feat: support changing IndexPart::metadata_bytes to json in future release (#7693)
## Problem

Currently we serialize the `TimelineMetadata` into bytes to put it into
`index_part.json`. This `Vec<u8>` (hopefully `[u8; 512]`) representation
was chosen because of problems serializing TimelineId and Lsn between
different serializers (bincode, json). After #5335, the serialization of
those types became serialization format aware or format agnostic.

We've removed the pageserver local `metadata` file writing in #6769.

## Summary of changes

Allow switching from the current serialization format to plain JSON for
the legacy TimelineMetadata format in the future by adding a competitive
serialization method to the current one
(`crate::tenant::metadata::modern_serde`), which accepts both old bytes
and new plain JSON.

The benefits of this are that dumping the index_part.json with pretty
printing no longer produces more than 500 lines of output, but after
enabling it produces lines only proportional to the layer count, like:

```json
{
  "version": ???,
  "layer_metadata": { ... },
  "disk_consistent_lsn": "0/15FD5D8",
  "legacy_metadata": {
    "disk_consistent_lsn": "0/15FD5D8",
    "prev_record_lsn": "0/15FD5A0",
    "ancestor_timeline": null,
    "ancestor_lsn": "0/0",
    "latest_gc_cutoff_lsn": "0/149FD18",
    "initdb_lsn": "0/149FD18",
    "pg_version": 15
  }
}
```

In the future, I propose we completely stop using this legacy metadata
type and wasting time trying to come up with another version numbering
scheme in addition to the informative-only one already found in
`index_part.json`, and go ahead with storing metadata or feature flags
on the `index_part.json` itself.

#7699 is the "one release after" changes which starts to produce
metadata in the index_part.json as json.
2024-06-04 19:36:22 +03:00
Christian Schwarz
17116f2ea9 fix(pageserver): abort on duplicate layers, before doing damage (#7799)
fixes https://github.com/neondatabase/neon/issues/7790 (duplicating most
of the issue description here for posterity)

# Background

From the time before always-authoritative `index_part.json`, we had to
handle duplicate layers. See the RFC for an illustration of how
duplicate layers could happen:
a8e6d259cb/docs/rfcs/027-crash-consistent-layer-map-through-index-part.md (L41-L50)

As of #5198 , we should not be exposed to that problem anymore.

# Problem 1

We still have
1. [code in
Pageserver](82960b2175/pageserver/src/tenant/timeline.rs (L4502-L4521))
than handles duplicate layers
2. [tests in the test
suite](d9dcbffac3/test_runner/regress/test_duplicate_layers.py (L15))
that demonstrates the problem using a failpoint

However, the test in the test suite doesn't use the failpoint to induce
a crash that could legitimately happen in production.
What is does instead is to return early with an `Ok()`, so that the code
in Pageserver that handles duplicate layers (item 1) actually gets
exercised.

That "return early" would be a bug in the routine if it happened in
production.
So, the tests in the test suite are tests for their own sake, but don't
serve to actually regress-test any production behavior.

# Problem 2

Further, if production code _did_ (it nowawdays doesn't!) create a
duplicate layer, the code in Pageserver that handles the condition (item
1 above) is too little and too late:

* the code handles it by discarding the newer `struct Layer`; that's
good.
* however, on disk, we have already overwritten the old with the new
layer file
* the fact that we do it atomically doesn't matter because ...
* if the new layer file is not bit-identical, then we have a cache
coherency problem
  * PS PageCache block cache: caches old bit battern
* blob_io offsets stored in variables, based on pre-overwrite bit
pattern / offsets
* => reading based on these offsets from the new file might yield
different data than before
 
# Solution

- Remove the test suite code pertaining to Problem 1
- Move & rename test suite code that actually tests RFC-27
crash-consistent layer map.
- Remove the Pageserver code that handles duplicate layers too late
(Problem 1)
- Use `RENAME_NOREPLACE` to prevent over-rename the file during
`.finish()`, bail with an error if it happens (Problem 2)
- This bailing prevents the caller from even trying to insert into the
layer map, as they don't even get a `struct Layer` at hand.
- Add `abort`s in the place where we have the layer map lock and check
for duplicates (Problem 2)
- Note again, we can't reach there because we bail from `.finish()` much
earlier in the code.
- Share the logic to clean up after failed `.finish()` between image
layers and delta layers (drive-by cleanup)
- This exposed that test `image_layer_rewrite` was overwriting layer
files in place. Fix the test.

# Future Work

This PR adds a new failure scenario that was previously "papered over"
by the overwriting of layers:
1. Start a compaction that will produce 3 layers: A, B, C
2. Layer A is `finish()`ed successfully.
3. Layer B fails mid-way at some `put_value()`.
4. Compaction bails out, sleeps 20s.
5. Some disk space gets freed in the meantime.
6. Compaction wakes from sleep, another iteration starts, it attempts to
write Layer A again. But the `.finish()` **fails because A already
exists on disk**.

The failure in step 5 is new with this PR, and it **causes the
compaction to get stuck**.
Before, it would silently overwrite the file and "successfully" complete
the second iteration.

The mitigation for this is to `/reset` the tenant.
2024-06-04 16:16:23 +00:00
John Spray
fd22fc5b7d pageserver: include heatmap in tenant deletion (#7928)
## Problem

This was an oversight when adding heatmaps: because they are at the top
level of the tenant, they aren't included in the catch-all list & delete
that happens for timeline paths.

This doesn't break anything, but it leaves behind a few kilobytes of
garbage in the S3 bucket after a tenant is deleted, generating work for
the scrubber.

## Summary of changes

- During deletion, explicitly remove the heatmap file
- In test_tenant_delete_smoke, upload a heatmap so that the test would
fail its "remote storage empty after delete" check if we didn't delete
it.
2024-06-04 16:16:50 +01:00
Joonas Koivunen
0112097e13 feat(rtc): maintain dirty and uploaded IndexPart (#7833)
RemoteTimelineClient maintains a copy of "next IndexPart" as a number of
fields which are like an IndexPart but this is not immediately obvious.
Instead of multiple fields, maintain a `dirty` ("next IndexPart") and
`clean` ("uploaded IndexPart") fields.

Additional cleanup:
- rename `IndexPart::disk_consistent_lsn` accessor
`duplicated_disk_consistent_lsn`
- no one except scrubber should be looking at it, even scrubber is a
stretch
- remove usage elsewhere (pagectl used by tests, metadata scan endpoint)
- serialize index part *before* the index upload operation
- avoid upload operation being retried because of serialization error
- serialization error is fatal anyway for timeline -- it can only make
transient local progress after that, at least the error is bubbled up
now
- gather exploded IndexPart fields into single actual
`UploadQueueInitialized::dirty` of which the uploaded snapshot is
serialized
- implement the long wished monotonicity check with the `clean`
IndexPart with an assertion which is not expected to fire

Continued work from #7860 towards next step of #6994.
2024-06-04 17:27:08 +03:00
Joonas Koivunen
0acb604fa3 test: no missed wakeups, cancellation and timeout flow to downloads (#7863)
I suspected a wakeup could be lost with
`remote_storage::support::DownloadStream` if the cancellation and inner
stream wakeups happen simultaneously. The next poll would only return
the cancellation error without setting the wakeup. There is no lost
wakeup because the single future for getting the cancellation error is
consumed when the value is ready, and a new future is created for the
*next* value. The new future is always polled. Similarly, if only the
`Stream::poll_next` is being used after a `Some(_)` value has been
yielded, it makes no sense to have an expectation of a wakeup for the
*(N+1)th* stream value already set because when a value is wanted,
`Stream::poll_next` will be called.

A test is added to show that the above is true.

Additionally, there was a question of these cancellations and timeouts
flowing to attached or secondary tenant downloads. A test is added to
show that this, in fact, happens.

Lastly, a warning message is logged when a download stream is polled
after a timeout or cancellation error (currently unexpected) so we can
rule it out while troubleshooting.
2024-06-04 14:19:36 +03:00
Konstantin Knizhnik
387a36874c Set page LSN when reconstructing VM in page server (#7935)
## Problem

Page LSN is not set while VM update.
May be reason of test_vm_bits flukyness.
Buit more serious issues can be also caused by wrong LSN.

Related: https://github.com/neondatabase/neon/pull/7935

## Summary of changes

- In `apply_in_neon`, set the LSN bytes when applying records of type
`ClearVisibilityMapFlags`
2024-06-04 09:56:03 +01:00
John Spray
11bb265de1 pageserver: don't squash all image layer generation errors into anyhow::Error (#7943)
## Problem

CreateImageLayersError and CompactionError had proper From
implementations, but compact_legacy was explicitly squashing all image
layer errors into an anyhow::Error anyway.

This led to errors like:
```
 Error processing HTTP request: InternalServerError(timeline shutting down

Stack backtrace:
   0: <<anyhow::Error as core::convert::From<pageserver::tenant::timeline::CreateImageLayersError>>::from as core::ops::function::FnOnce<(pageserver::tenant::timeline::CreateImageLayersError,)>>::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
   1: <core::result::Result<alloc::vec::Vec<pageserver::tenant::storage_layer::layer::ResidentLayer>, pageserver::tenant::timeline::CreateImageLayersError>>::map_err::<anyhow::Error, <anyhow::Error as core::convert::From<pageserver::tenant::timeline::CreateImageLayersError>>::from>
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/result.rs:829:27
   2: <pageserver::tenant::timeline::Timeline>::compact_legacy::{closure#0}
             at pageserver/src/tenant/timeline/compaction.rs:125:36
   3: <pageserver::tenant::timeline::Timeline>::compact::{closure#0}
             at pageserver/src/tenant/timeline.rs:1719:84
   4: pageserver::http::routes::timeline_checkpoint_handler::{closure#0}::{closure#0}
```

Closes: https://github.com/neondatabase/neon/issues/7861
2024-06-03 22:10:13 +02:00
Konstantin Knizhnik
7006caf3a1 Store logical replication origin in KV storage (#7099)
Store logical replication origin in KV storage

## Problem

See  #6977

## Summary of changes

* Extract origin_lsn from commit WAl record
* Add ReplOrigin key to KV storage and store origin_lsn
* In basebackup replace snapshot origin_lsn with last committed
origin_lsn at basebackup LSN

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Alex Chi Z <chi@neon.tech>
2024-06-03 19:37:33 +03:00
Arpad Müller
acf0a11fea Move keyspace utils to inherent impls (#7929)
The keyspace utils like `is_rel_size_key` or `is_rel_fsm_block_key` and
many others are free functions and have to be either imported separately
or specified with the full path starting in `pageserver_api:🔑:`.
This is less convenient than if these functions were just inherent
impls.

Follow-up of #7890
Fixes #6438
2024-06-03 16:18:07 +02:00
Alex Chi Z
c1f55c1525 feat(pageserver): collect aux file tombstones (#7900)
close https://github.com/neondatabase/neon/issues/7800

This is a small change to enable the tombstone -> exclude from image
layer path. Most of the pull request is unit tests.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-03 09:56:36 -04:00
John Spray
7e60563910 pageserver: add GcError type (#7917)
## Problem

- Because GC exposes all errors as an anyhow::Error, we have
intermittent issues with spurious log errors during shutdown, e.g. in
this failure of a performance test
https://neon-github-public-dev.s3.amazonaws.com/reports/main/9300804302/index.html#suites/07874de07c4a1c9effe0d92da7755ebf/214a2154f6f0217a/

```
Gc failed 1 times, retrying in 2s: shutting down
```

GC really doesn't do a lot of complicated IO: it doesn't benefit from
the backtrace capabilities of anyhow::Error, and can be expressed more
robustly as an enum.

## Summary of changes

- Add GcError type and use it instead of anyhow::Error in GC functions
- In `gc_iteration_internal`, return GcError::Cancelled on shutdown
rather than Ok(()) (we only used Ok before because we didn't have a
clear cancellation error variant to use).
- In `gc_iteration_internal`, skip past timelines that are shutting
down, to avoid having to go through another GC iteration if we happen to
see a deleting timeline during a GC run.
- In `refresh_gc_info_internal`, avoid an error case where a timeline
might not be found after being looked up, by carrying an Arc<Timeline>
instead of a TimelineId between the first loop and second loop in the
function.
- In HTTP request handler, handle Cancelled variants as 503 instead of
turning all GC errors into 500s.
2024-05-31 22:20:06 +01:00
Joonas Koivunen
ef83f31e77 pagectl: key command for dumping what we know about the key (#7890)
What we know about the key via added `pagectl key $key` command:
- debug formatting
- shard placement when `--shard-count` is specified
- different boolean queries in `key.rs`
- aux files v2

Example:

```
$ cargo run -qp pagectl -- key 000000063F00004005000060270000100E2C
parsed from hex: 000000063F00004005000060270000100E2C:

Key { field1: 0, field2: 1599, field3: 16389, field4: 24615, field5: 0, field6: 1052204 }
rel_block:         true
rel_vm_block:      false
rel_fsm_block:     false
slru_block:        false
inherited:         true
rel_size:          false
slru_segment_size: false
recognized kind:   None
```
2024-05-31 18:19:41 +00:00
John Spray
9fda85b486 pageserver: remove AncestorStopping error variants (#7916)
## Problem

In all cases, AncestorStopping is equivalent to Cancelled.

This became more obvious in
https://github.com/neondatabase/neon/pull/7912#discussion_r1620582309
when updating these error types.

## Summary of changes

- Remove AncestorStopping, always use Cancelled instead
2024-05-31 17:02:10 +01:00
Alex Chi Z
87afbf6b24 test(pageserver): add test interface to create artificial layers (#7899)
This pull request adds necessary interfaces to deterministically create
scenarios we want to test. Simplify some test cases to use this
interface to make it stable + reproducible.

Compaction test will be able to use this interface. Also the upcoming
delete tombstone tests will use this interface to make test
reproducible.

## Summary of changes

* `force_create_image_layer`
* `force_create_delta_layer`
* `force_advance_lsn`
* `create_test_timeline_with_states`
* `branch_timeline_test_with_states`

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-05-31 12:00:40 -04:00
Arthur Petukhovsky
16b2e74037 Add FullAccessTimeline guard in safekeepers (#7887)
This is a preparation for
https://github.com/neondatabase/neon/issues/6337.

The idea is to add FullAccessTimeline, which will act as a guard for
tasks requiring access to WAL files. Eviction will be blocked on these
tasks and WAL won't be deleted from disk until there is at least one
active FullAccessTimeline.

To get FullAccessTimeline, tasks call `tli.full_access_guard().await?`.
After eviction is implemented, this function will be responsible for
downloading missing WAL file and waiting until the download finishes.

This commit also contains other small refactorings:
- Separate `get_tenant_dir` and `get_timeline_dir` functions for
building a local path. This is useful for looking at usages and finding
tasks requiring access to local filesystem.
- `timeline_manager` is now responsible for spawning all background
tasks
- WAL removal task is now spawned instantly after horizon is updated
2024-05-31 13:19:45 +00:00
John Spray
5a394fde56 pageserver: avoid spurious "bad state" logs/errors during shutdown (#7912)
## Problem

- Initial size calculations tend to fail with `Bad state (not active)`

Closes: https://github.com/neondatabase/neon/issues/7911

## Summary of changes

- In `wait_lsn`, return WaitLsnError::Cancelled rather than BadState
when the state is Stopping
- Replace PageReconstructError's `Other` variant with a specific
`BadState` variant
- Avoid returning anyhow::Error from get_ready_ancestor_timeline -- this
was only used for the case where there was no ancestor. All callers of
this function had implicitly checked that the ancestor timeline exists
before calling it, so they can pass in the ancestor instead of handling
an error.
2024-05-31 13:31:42 +01:00
John Spray
98dadf8543 pageserver: quieten some shutdown logs around logical size and flush (#7907)
## Problem

Looking at several noisy shutdown logs:
- In https://github.com/neondatabase/neon/issues/7861 we're hitting a
log error with `InternalServerError(timeline shutting down\n'` on the
checkpoint API handler.
- In the field, we see initial_logical_size_calculation errors on
shutdown, via DownloadError
- In the field, we see errors logged from layer download code
(independent of the error propagated) during shutdown

Closes: https://github.com/neondatabase/neon/issues/7861

## Summary of changes

The theme of these changes is to avoid propagating anyhow::Errors for
cases that aren't really unexpected error cases that we might want a
stacktrace for, and avoid "Other" error variants unless we really do
have unexpected error cases to propagate.

- On the flush_frozen_layers path, use the `FlushLayerError` type
throughout, rather than munging it into an anyhow::Error. Give
FlushLayerError an explicit from_anyhow helper that checks for timeline
cancellation, and uses it to give a Cancelled error instead of an Other
error when the timeline is shutting down.
- In logical size calculation, remove BackgroundCalculationError (this
type was just a Cancelled variant and an Other variant), and instead use
CalculateLogicalSizeError throughout. This can express a
PageReconstructError, and has a From impl that translates cancel-like
page reconstruct errors to Cancelled.
- Replace CalculateLogicalSizeError's Other(anyhow::Error) variant case
with a Decode(DeserializeError) variant, as this was the only kind of
error we actually used in the Other case.
- During layer download, drop out early if the timeline is shutting
down, so that we don't do an `error!()` log of the shutdown error in
this case.
2024-05-31 09:18:58 +01:00
Alex Chi Z
f20a9e760f chore(pageserver): warn on delete non-existing file (#7847)
Consider the following sequence of migration:

```
1. user starts compute
2. force migrate to v2
3. user continues to write data
```

At the time of (3), the compute node is not aware that the page server
does not contain replication states any more, and might continue to
ingest neon-file records into the safekeeper. This will leave the
pageserver store a partial replication state and cause some errors. For
example, the compute could issue a deletion of some aux files in v1, but
this file does not exist in v2. Therefore, we should ignore all these
errors until everyone is migrated to v2.

Also note that if we see this warning in prod, it is likely because we
did not fully suspend users' compute when flipping the v1/v2 flag.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-05-30 14:45:34 +00:00
Alex Chi Z
33395dcf4e perf(pageserver): postpone vectored get fringe keyspace construction (#7904)
Perf shows a significant amount of time is spent on `Keyspace::merge`.
This pull request postpones merging keyspace until retrieving the layer,
which contributes to a 30x improvement in aux keyspace basebackup time.

```
--- old
10000 files found in 0.580569459s
--- new
10000 files found in 0.02995075s
```

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-05-30 10:31:57 -04:00
YukiSeino
167394a073 refacter : VirtualFile::open uses AsRef (#7908)
## Problem
#7371 

## Summary of changes
* The VirtualFile::open, open_with_options, and create methods use
AsRef, similar to the standard library's std::fs APIs.
2024-05-30 15:58:20 +02:00
John Spray
352b08d0be pageserver: fix a warning on secondary mode downloads after evictions (#7877)
## Problem

In 4ce6e2d2fc we added a warning when progress stats don't look right at
the end of a secondary download pass.

This `Correcting drift in progress stats` warning fired in staging on a
pageserver that had been doing some disk usage eviction.

The impact is low because in the same place we log the warning, we also
fix up the progress values.

## Summary of changes

- When we skip downloading a layer because it was recently evicted,
update the progress stats to ensure they still reach a clean complete
state at the end of a download pass.
- Also add a log for evicting secondary location layers, for symmetry
with attached locations, so that we can clearly see when eviction has
happened for a particular tenant's layers when investigating issues.

This is a point fix -- the code would also benefit from being refactored
so that there is some "download result" type with a Skip variant, to
ensure that we are updating the progress stats uniformly for those
cases.
2024-05-28 16:06:47 +01:00
Arseny Sher
3797566c36 safekeeper: test pull_timeline with WAL gc.
Do pull_timeline while WAL is being removed. To this end
- extract pausable_failpoint to utils, sprinkle pull_timeline with it
- add 'checkpoint' sk http endpoint to force WAL removal.

After fixing checking for pull file status code test fails so far which is
expected.
2024-05-25 06:06:32 +03:00
John Spray
1455f5a261 pageserver: revert concurrent secondary downloads, make DownloadStream always yield Err after cancel (#7866)
## Problem

Ongoing hunt for secondary location shutdown hang issues.

## Summary of changes

- Revert the functional changes from #7675 
- Tweak a log in secondary downloads to make it more apparent when we
drop out on cancellation
- Modify DownloadStream's behavior to always return an Err after it has
been cancelled. This _should_ not impact anything, but it makes the
behavior simpler to reason about (e.g. even if the poll function somehow
got called again, it could never end up in an un-cancellable state)

Related #https://github.com/neondatabase/cloud/issues/13576
2024-05-24 11:45:34 +03:00
John Spray
3860bc9c6c pageserver: post-shard-split layer rewrites (2/2) (#7531)
## Problem

- After a shard split of a large existing tenant, child tenants can end
up with oversized historic layers indefinitely, if those layers are
prevented from being GC'd by branchpoints.

This PR follows https://github.com/neondatabase/neon/pull/7531, and adds
rewriting of layers that contain a mixture of needed & un-needed
contents, in addition to dropping un-needed layers.

Closes: https://github.com/neondatabase/neon/issues/7504

## Summary of changes

- Add methods to ImageLayer for reading back existing layers
- Extend `compact_shard_ancestors` to rewrite layer files that contain a
mixture of keys that we want and keys we do not, if unwanted keys are
the majority of those in the file.
- Amend initialization code to handle multiple layers with the same
LayerName properly
- Get rid of of renaming bad layer files to `.old` since that's now
expected on restarts during rewrites.
2024-05-24 08:33:19 +00:00
MMeent
0e4f182680 Rework PageStream connection state handling: (#7611)
* Make PS connection startup use async APIs
   This allows for improved query cancellation when we start connections
 * Make PS connections have per-shard connection retry state.
   Previously they shared global backoff state, which is bad for quickly
   getting all connections started and/or back online.
 * Make sure we clean up most connection state on failed connections.
   Previously, we could technically leak some resources that we'd otherwise
   clean up. Now, the resources are correctly cleaned up.
 * pagestore_smgr.c now PANICs on unexpected response message types.
   Unexpected responses are likely a symptom of having a desynchronized
   view of the connection state. As a desynchronized connection state can
   cause corruption, we PANIC, as we don't know what data may have been
   written to buffers: the only solution is to fail fast & hope we didn't
   write wrong data.
 * Catch errors in sync pagestream request handling.
   Previously, if a query was cancelled after a message was sent to
   the pageserver, but before the data was received, the backend
   could forget that it sent the synchronous request, and let others
   deal with the repercussions. This could then lead to incorrect
   responses, or errors such as "unexpected response from page
   server with tag 0x68"
2024-05-23 23:26:42 +02:00
Joonas Koivunen
7cf726e36e refactor(rtc): remove the duplicate IndexLayerMetadata (#7860)
Once upon a time, we used to have duplicated types for runtime IndexPart
and whatever we stored. Because of the serde fixes in #5335 we have no
need for duplicated IndexPart type anymore, but the `IndexLayerMetadata`
stayed.

- remove the type
- remove LayerFileMetadata::file_size() in favor of direct field access

Split off from #7833. Cc: #3072.
2024-05-23 23:24:31 +03:00
Alex Chi Z
6b3164269c chore(pageserver): reduce logging related to image layers (#7864)
* Reduce the logging level for create image layers of metadata keys.
(question: is it possible to adjust logging levels at runtime?)
* Do a info logging of image layers only after the layer is created. Now
there are a lot of cases where we create the image layer writer but then
discarding that image layer because it does not contain any key.
Therefore, I changed the new image layer logging to trace, and create
image layer logging to info.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-05-23 15:30:43 +00:00
Arpad Müller
75a52ac7fd Use Timeline::create_image_layer_for_rel_blocks in tiered compaction (#7850)
Reduces duplication between tiered and legacy compaction by using the
`Timeline::create_image_layer_for_rel_blocks` function. This way, we
also use vectored get in tiered compaction, so the change has two
benefits in one.

fixes #7659

---------

Co-authored-by: Alex Chi Z. <iskyzh@gmail.com>
2024-05-23 15:10:24 +00:00
Alex Chi Z
e28e46f20b fix(pageserver): make wal connstr a connstr (#7846)
The list timeline API gives something like
`"wal_source_connstr":"PgConnectionConfig { host:
Domain(\"safekeeper-5.us-east-2.aws.neon.build\"), port: 6500, password:
Some(REDACTED-STRING) }"`, which is weird. This pull request makes it
somehow like a connection string. This field is not used at least in the
neon database, so I assume no one is reading or parsing it.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-05-23 09:45:29 -04:00
Arpad Müller
d5d15eb6eb Warn if a blob in an image is larger than 256 MiB (#7852)
We'd like to get some bits reserved in the length field of image layers
for future usage (compression). This PR bases on the assumption that we
don't have any blobs that require more than 28 bits (3 bytes + 4 bits)
to store the length, but as a preparation, before erroring, we want to
first emit warnings as if the assumption is wrong, such warnings are less
disruptive than errors.

A metric would be even less disruptive (log messages are more slow, if
we have a LOT of such large blobs then it would take a lot of time to
print them). At the same time, likely such 256 MiB blobs will occupy an
entire layer file, as they are larger than our target size. For layer
files we already log something, so there shouldn't be a large increase
in overhead.

Part of #5431
2024-05-23 14:28:05 +02:00
John Spray
a43a1ad1df pageserver: fix API-driven secondary downloads possibly colliding with background downloads (#7848)
## Problem

We've seen some strange behaviors when doing lots of migrations
involving secondary locations. One of these was where a tenant was
apparently stuck in the `Scheduler::running` list, but didn't appear to
be making any progress. Another was a shutdown hang
(https://github.com/neondatabase/cloud/issues/13576).

## Summary of changes

- Fix one issue (probably not the only one) where a tenant in the
`pending` list could proceed to `spawn` even if the same tenant already
had a running task via `handle_command` (this could have resulted in a
weird value of SecondaryProgress)
- Add various extra logging:
- log before as well as after layer downloads so that it would be
obvious if we were stuck in remote storage code (we shouldn't be, it has
built in timeouts)
- log the number of running + pending jobs from the scheduler every time
it wakes up to do a scheduling iteration (~10s) -- this is quite chatty,
but not compared with the volume of logs on a busy pageserver. It should
give us confidence that the scheduler loop is still alive, and
visibility of how many tasks the scheduler thinks are running.
2024-05-23 09:13:55 +01:00
Alex Chi Z
ff560a1113 chore(pageserver): use kebab case for compaction algorithms (#7845)
Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-05-22 21:28:47 +00:00