Commit Graph

2509 Commits

Author SHA1 Message Date
Yuchen Liang
bf9a6d0f4c review: follow Buffer::extend_from_slice trait definition
panics if IoBufferMut does not enough capacity left to accomodate the source
buffer.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-12-02 16:07:57 +00:00
Yuchen Liang
9f384a8426 review: remove unused impl Buffer for BytesMut
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-12-02 15:58:08 +00:00
Yuchen Liang
b6a2516c1c Merge branch 'main' into yuchen/double-buffered-writer 2024-11-27 10:10:53 -05:00
Yuchen Liang
b54764bccb hold timeline open in background task using gate guard (#9825)
## Problem

The newly added flush task in
https://github.com/neondatabase/neon/pull/9693 should hold timeline gate
open, to avoid doing local IO after timeline shutdown completes.

## Solution

Pass timeline gate guard to flush background task. The flush task do not
need cancellation token b/c it will automatically shutdown when the
front writer task drop the channel.

- Refactor relevant paths to pass down `&Gate` instead of `GateGuard`.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-11-27 10:10:38 -05:00
Erik Grinaker
e4f437a354 pageserver: add relsize cache metrics (#9890)
## Problem

We don't have any observability for the relation size cache. We have
seen cache misses cause significant performance impact with high
relation counts.

Touches #9855.

## Summary of changes

Adds the following metrics:

* `pageserver_relsize_cache_entries`
* `pageserver_relsize_cache_hits`
* `pageserver_relsize_cache_misses`
* `pageserver_relsize_cache_misses_old`
2024-11-27 13:54:14 +00:00
Vlad Lazar
8fdf786217 pageserver: add tenant config override for wal receiver proto (#9888)
## Problem

Can't change protocol at tenant granularity.

## Summary of changes

Add tenant config level override for wal receiver protocol.

## Links

Related: https://github.com/neondatabase/neon/issues/9336
Epic: https://github.com/neondatabase/neon/issues/9329
2024-11-27 13:46:23 +00:00
Vlad Lazar
9e0148de11 safekeeper: use protobuf for sending compressed records to pageserver (#9821)
## Problem

https://github.com/neondatabase/neon/pull/9746 lifted decoding and
interpretation of WAL to the safekeeper.
This reduced the ingested amount on the pageservers by around 10x for a
tenant with 8 shards, but doubled
the ingested amount for single sharded tenants.

Also, https://github.com/neondatabase/neon/pull/9746 uses bincode which
doesn't support schema evolution.
Technically the schema can be evolved, but it's very cumbersome.

## Summary of changes

This patch set addresses both problems by adding protobuf support for
the interpreted wal records and adding compression support. Compressed
protobuf reduced the ingested amount by 100x on the 32 shards
`test_sharded_ingest` case (compared to non-interpreted proto). For the
1 shard case the reduction is 5x.

Sister change to `rust-postgres` is
[here](https://github.com/neondatabase/rust-postgres/pull/33).

## Links

Related: https://github.com/neondatabase/neon/issues/9336
Epic: https://github.com/neondatabase/neon/issues/9329
2024-11-27 12:12:21 +00:00
Peter Bendel
13feda0669 track how much time the flush loop is stalled waiting for uploads (#9885)
## Problem

We don't know how much time PS is losing during ingest when waiting for
remote storage uploads in the flush frozen layer loop.
Also we don't know how many remote storage requests get an permit
without waiting (not throttled by remote_storage concurrency_limit).

## Summary of changes

- Add a metric that accumulates the time waited per shard/PS
- in [remote storage semaphore wait
seconds](https://neonprod.grafana.net/d/febd9732-9bcf-4992-a821-49b1f6b02724/remote-storage?orgId=1&var-datasource=HUNg6jvVk&var-instance=pageserver-26.us-east-2.aws.neon.build&var-instance=pageserver-27.us-east-2.aws.neon.build&var-instance=pageserver-28.us-east-2.aws.neon.build&var-instance=pageserver-29.us-east-2.aws.neon.build&var-instance=pageserver-30.us-east-2.aws.neon.build&var-instance=pageserver-31.us-east-2.aws.neon.build&var-instance=pageserver-36.us-east-2.aws.neon.build&var-instance=pageserver-37.us-east-2.aws.neon.build&var-instance=pageserver-38.us-east-2.aws.neon.build&var-instance=pageserver-39.us-east-2.aws.neon.build&var-instance=pageserver-40.us-east-2.aws.neon.build&var-instance=pageserver-41.us-east-2.aws.neon.build&var-request_type=put_object&from=1731961336340&to=1731964762933&viewPanel=3)
add a first bucket with 100 microseconds to count requests that do not
need to wait on semaphore

Update: created a new version that uses a Gauge (one increasing value
per PS/shard) instead of histogram as suggested by review
2024-11-26 11:46:58 +00:00
Yuchen Liang
c3302ad7e1 Merge branch 'main' into yuchen/double-buffered-writer 2024-11-25 14:49:42 -05:00
Vlad Lazar
7a2f0ed8d4 safekeeper: lift decoding and interpretation of WAL to the safekeeper (#9746)
## Problem

For any given tenant shard, pageservers receive all of the tenant's WAL
from the safekeeper.
This soft-blocks us from using larger shard counts due to bandwidth
concerns and CPU overhead of filtering
out the records.

## Summary of changes

This PR lifts the decoding and interpretation of WAL from the pageserver
into the safekeeper.

A customised PG replication protocol is used where instead of sending
raw WAL, the safekeeper sends
filtered, interpreted records. The receiver drives the protocol
selection, so, on the pageserver side, usage
of the new protocol is gated by a new pageserver config:
`wal_receiver_protocol`.

 More granularly the changes are:
1. Optionally inject the protocol and shard identity into the arguments
used for starting replication
2. On the safekeeper side, implement a new wal sending primitive which
decodes and interprets records
 before sending them over
3. On the pageserver side, implement the ingestion of this new
replication message type. It's very similar
 to what we already have for raw wal (minus decoding and interpreting).
 
 ## Notes
 
* This PR currently uses my [branch of
rust-postgres](https://github.com/neondatabase/rust-postgres/tree/vlad/interpreted-wal-record-replication-support)
which includes the deserialization logic for the new replication message
type. PR for that is open
[here](https://github.com/neondatabase/rust-postgres/pull/32).
* This PR contains changes for both pageservers and safekeepers. It's
safe to merge because the new protocol is disabled by default on the
pageserver side. We can gradually start enabling it in subsequent
releases.
* CI tests are running on https://github.com/neondatabase/neon/pull/9747
 
 ## Links
 
 Related: https://github.com/neondatabase/neon/issues/9336
 Epic: https://github.com/neondatabase/neon/issues/9329
2024-11-25 17:29:28 +00:00
Yuchen Liang
4284fcd38c fix docs clippy
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-11-25 15:25:29 +00:00
Arpad Müller
77630e5408 Address beta clippy lint needless_lifetimes (#9877)
The 1.82.0 version of Rust will be stable soon, let's get the clippy
lint fixes in before the compiler version upgrade.
2024-11-25 14:59:12 +00:00
Yuchen Liang
8a37f412c2 remove resolved todos
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-11-25 05:42:16 +00:00
Yuchen Liang
d4ebd5ccd3 use CheapCloneForRead trait to prevent efficiency bugs
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-11-25 05:01:23 +00:00
Yuchen Liang
76f0e4fd1d review: remove save_buf_for_read
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-11-25 04:42:02 +00:00
Yuchen Liang
28718bfadc review: simplify FlushControl by using ZST for not(test)
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-11-25 04:39:13 +00:00
Yuchen Liang
e5bf2bec49 remove write_buffered; add notes for bypass-aligned-part-of-write
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-11-25 04:30:42 +00:00
Christian Schwarz
450be26bbb fast imports: initial Importer and Storage changes (#9218)
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Stas Kelvic <stas@neon.tech>

# Context

This PR contains PoC-level changes for a product feature that allows
onboarding large databases into Neon without going through the regular
data path.

# Changes

This internal RFC provides all the context
* https://github.com/neondatabase/cloud/pull/19799

In the language of the RFC, this PR covers

* the Importer code (`fast_import`) 
* all the Pageserver changes (mgmt API changes, flow implementation,
etc)
* a basic test for the Pageserver changes

# Reviewing

As acknowledged in the RFC, the code added in this PR is not ready for
general availability.
Also, the **architecture is not to be discussed in this PR**, but in the
RFC and associated Slack channel instead.

Reviewers of this PR should take that into consideration.
The quality bar to apply during review depends on what area of the code
is being reviewed:

* Importer code (`fast_import`): practically anything goes
* Core flow (`flow.rs`):
* Malicious input data must be expected and the existing threat models
apply.
* The code must not be safe to execute on *dedicated* Pageserver
instances:
* This means in particular that tenants *on other* Pageserver instances
must not be affected negatively wrt data confidentiality, integrity or
availability.
* Other code: the usual quality bar
* Pay special attention to correct use of gate guards, timeline
cancellation in all places during shutdown & migration, etc.
* Consider the broader system impact; if you find potentially
problematic interactions with Storage features that were not covered in
the RFC, bring that up during the review.

I recommend submitting three separate reviews, for the three high-level
areas with different quality bars.


# References

(Internal-only)

* refs https://github.com/neondatabase/cloud/issues/17507
* refs https://github.com/neondatabase/company_projects/issues/293
* refs https://github.com/neondatabase/company_projects/issues/309
* refs https://github.com/neondatabase/cloud/issues/20646

---------

Co-authored-by: Stas Kelvich <stas.kelvich@gmail.com>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: John Spray <john@neon.tech>
2024-11-22 22:47:06 +00:00
Alex Chi Z.
c1937d073f fix(pageserver): ensure upload happens after delete (#9844)
## Problem

Follow up of https://github.com/neondatabase/neon/pull/9682, that patch
didn't fully address the problem: what if shutdown fails due to whatever
reason and then we reattach the tenant? Then we will still remove the
future layer. The underlying problem is that the fix for #5878 gets
voided because of the generation optimizations.

Of course, we also need to ensure that delete happens after uploads, but
note that we only schedule deletes when there are no ongoing upload
tasks, so that's fine.

## Summary of changes

* Add a test case to reproduce the behavior (by changing the original
test case to attach the same generation).
* If layer upload happens after the deletion, drain the deletion queue
before uploading.
* If blocked_deletion is enabled, directly remove it from the
blocked_deletion queue.
* Local fs backend fix to avoid race between deletion and preload.
* test_emergency_mode does not need to wait for uploads (and it's
generally not possible to wait for uploads).
* ~~Optimize deletion executor to skip validation if there are no files
to delete.~~ this doesn't work

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-11-22 18:30:53 +00:00
Erik Grinaker
e939d36dd4 safekeeper,pageserver: fix CPU profiling allowlists (#9856)
## Problem

The HTTP router allowlists matched both on the path and the query
string. This meant that only `/profile/cpu` would be allowed without
auth, while `/profile/cpu?format=svg` would require auth.

Follows #9764.

## Summary of changes

* Match allowlists on URI path, rather than the entire URI.
* Fix the allowlist for Safekeeper to use `/profile/cpu` rather than the
old `/pprof/profile`.
* Just use a constant slice for the allowlist; it's only a handful of
items, and these handlers are not on hot paths.
2024-11-22 17:50:33 +00:00
John Spray
d9de65ee8f pageserver: permit reads behind GC cutoff during LSN grace period (#9833)
## Problem

In https://github.com/neondatabase/neon/issues/9754 and the flakiness of
`test_readonly_node_gc`, we saw that although our logic for controlling
GC was sound, the validation of getpage requests was not, because it
could not consider LSN leases when requests arrived shortly after
restart.

Closes https://github.com/neondatabase/neon/issues/9754

## Summary of changes

This is the "Option 3" discussed verbally -- rather than holding back gc
cutoff, we waive the usual validation of request LSN if we are still
waiting for leases to be sent after startup

- When validating LSN in `wait_or_get_last_lsn`, skip the validation
relative to GC cutoff if the timeline is still in its LSN lease grace
period
- Re-enable test_readonly_node_gc
2024-11-22 09:24:23 +00:00
Erik Grinaker
190e8cebac safekeeper,pageserver: add CPU profiling (#9764)
## Problem

We don't have a convenient way to gather CPU profiles from a running
binary, e.g. during production incidents or end-to-end benchmarks, nor
during microbenchmarks (particularly on macOS).

We would also like to have continuous profiling in production, likely
using [Grafana Cloud
Profiles](https://grafana.com/products/cloud/profiles-for-continuous-profiling/).
We may choose to use either eBPF profiles or pprof profiles for this
(pending testing and discussion with SREs), but pprof profiles appear
useful regardless for the reasons listed above. See
https://github.com/neondatabase/cloud/issues/14888.

This PR is intended as a proof of concept, to try it out in staging and
drive further discussions about profiling more broadly.

Touches #9534.
Touches https://github.com/neondatabase/cloud/issues/14888.

## Summary of changes

Adds a HTTP route `/profile/cpu` that takes a CPU profile and returns
it. Defaults to a 5-second pprof Protobuf profile for use with e.g.
`pprof` or Grafana Alloy, but can also emit an SVG flamegraph. Query
parameters:

* `format`: output format (`pprof` or `svg`)
* `frequency`: sampling frequency in microseconds (default 100)
* `seconds`: number of seconds to profile (default 5)

Also integrates pprof profiles into Criterion benchmarks, such that
flamegraph reports can be taken with `cargo bench ... --profile-duration
<seconds>`. Output under `target/criterion/*/profile/flamegraph.svg`.

Example profiles:

* pprof profile (use [`pprof`](https://github.com/google/pprof)):
[profile.pb.gz](https://github.com/user-attachments/files/17756788/profile.pb.gz)
  * Web interface: `pprof -http :6060 profile.pb.gz`
* Interactive flamegraph:
[profile.svg.gz](https://github.com/user-attachments/files/17756782/profile.svg.gz)
2024-11-21 18:59:46 +00:00
John Spray
42bda5d632 pageserver: revise metrics lifetime for SecondaryTenant (#9818)
## Problem

We saw a scale test failure when one shard went
secondary->attached->secondary in a short period of time -- the metrics
for the shard failed a validation assertion that is meant to ensure the
size metric matches the sum of layer sizes in the SecondaryDetail
struct.

This appears to be due to two SecondaryTenants being alive at the same
time -- the first one was shut down but still had its contributions to
the metrics.

Closes: https://github.com/neondatabase/neon/issues/9628

## Summary of changes

- Refactor code for validating metrics and call it in shutdown as well
as during downloads
- Move code for dropping per-tenant secondary metrics from drop() into
shutdown(), so that once shutdown() completes it is definitely safe to
instantiate another SecondaryTenant for the same tenant.
2024-11-21 08:31:24 +00:00
Vlad Lazar
ee26f09e45 pageserver: remove shard split hard link assertion (#9829)
## Problem

We were hitting this assertion in debug mode tests sometimes.

This case was being hit when the parent shard has no resident layers.
For instance, this is the case on split retry where the previous attempt
shut-down the parent and deleted local state for it. If the logical size
calculation does not download some layers before we get to the
hardlinking, then the assertion is hit.

## Summary of Changes

Remove the assertion. It's fine for the ancestor to not have any
resident layers at the time of the split.

Closes https://github.com/neondatabase/neon/issues/9412
2024-11-20 18:33:05 +00:00
John Spray
5ff2f1ee7d pageserver: enable compaction to proceed while live-migrating (#5397)
## Problem

Long ago, in #5299 the tenant states for migration are added, but
respected only in a coarse-grained way: when hinted not to do deletions,
tenants will just avoid doing all GC or compaction.

Skipping compaction is not necessary for AttachedMulti, as we will soon
become the primary attached location, and it is not a waste of resources
to proceed with compaction. Instead, per the RFC
https://github.com/neondatabase/neon/pull/5029/files), deletions should
be queued up in this state, and executed later when we switch to
AttachedSingle.

Avoiding compaction in AttachedMulti can have an operational impact if a
tenant is under significant write load, as a long-running migration can
result in a large accumulation of delta layers with commensurate impact
on read latency.

Closes: https://github.com/neondatabase/neon/issues/5396

## Summary of changes

- Add a 'config' part to RemoteTimelineClient so that it can be aware of
the mode of the tenant it belongs to, and wire this through for
construction + updates
- Add a special buffer for delayed deletions, and when in AttachedMulti
route deletions here instead of into the main remote client queue. This
is drained when transitioning to AttachedSingle. If the tenant is
detached or our process dies before then, then these objects are leaked.
- As a quality of life improvement, also use the remote timeline
client's knowledge of the tenant state to avoid submitting remote
consistent LSN updates for validation when in AttachedStale (as we know
these will fail)

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2024-11-20 17:31:55 +00:00
Yuchen Liang
0f63c957a6 document and reorder flush background task invokation sequence
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-11-20 16:18:21 +00:00
John Spray
67f5f83edc pageserver: avoid reading SLRU blocks for GC on shards >0 (#9423)
## Problem

SLRU blocks, which can add up to several gigabytes, are currently
ingested by all shards, multiplying their capacity cost by the shard
count and slowing down ingest. We do this because all shards need the
SLRU pages to do timestamp->LSN lookup for GC.

Related: https://github.com/neondatabase/neon/issues/7512

## Summary of changes

- On non-zero shards, learn the GC offset from shard 0's index instead
of calculating it.
- Add a test `test_sharding_gc` that exercises this
- Do GC in test_pg_regress as a general smoke test that GC functions run
(e.g. this would fail if we were using SLRUs we didn't have)

In this PR we are still ingesting SLRUs everywhere, but not using them
any more. Part 2 PR (https://github.com/neondatabase/neon/pull/9786)
makes the change to not store them at all.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2024-11-20 15:56:14 +00:00
Arpad Müller
0a499a3176 Don't preload offloaded timelines (#9646)
In timeline preloading, we also do a preload for offloaded timelines.
This includes the download of `index-part.json`. Ultimately, such a
download is wasteful, therefore avoid it. Same goes for the remote
client, we just discard it immediately thereafter.

Part of #8088

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-11-20 05:44:23 +00:00
Yuchen Liang
77801fe3be Merge branch 'main' into yuchen/double-buffered-writer 2024-11-19 17:19:42 -05:00
Alex Chi Z.
b22a84a7bf feat(pageserver): support key range for manual compaction trigger (#9723)
part of https://github.com/neondatabase/neon/issues/9114, we want to be
able to run partial gc-compaction in tests. In the future, we can also
expand this functionality to legacy compaction, so that we can trigger
compaction for a specific key range.

## Summary of changes

* Support passing compaction key range through pageserver routes.
* Refactor input parameters of compact related function to take the new
`CompactOptions`.
* Add tests for partial compaction. Note that the test may or may not
trigger compaction based on GC horizon. We need to improve the test case
to ensure things always get below the gc_horizon and the gc-compaction
can be triggered.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-11-19 19:38:41 +00:00
Alex Chi Z.
5e3fbef721 fix(pageserver): queue stopped error should be ignored during create timeline (#9767)
close https://github.com/neondatabase/neon/issues/9730

The test case tests if anything goes wrong during pageserver restart +
*during timeline creation not complete*. Therefore, queue is stopped
error is normal in this case, except that it should be categorized as a
shutdown error instead of a real error.

## Summary of changes

* More comments for the test case.
* Queue stopped error will now be forwarded as
CreateTimelineError::ShuttingDown.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-11-19 14:10:09 -05:00
Yuchen Liang
78a17a7051 improve FullSlice semantics
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-11-19 18:38:59 +00:00
Yuchen Liang
826e2395a8 add comments
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-11-19 16:48:29 +00:00
Yuchen Liang
9db6b1e3c8 fix clippy
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-11-19 14:22:28 +00:00
Yuchen Liang
5acc61bdbc move duplex to utils; make flush behavior controllable in test
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-11-18 23:52:52 +00:00
Arpad Müller
4fc3af15dd Remove at most one retain_lsn entry from (possibly offloaded) timelne's parent (#9791)
There is a potential data corruption issue, not one I've encountered,
but it's still not hard to hit with some correct looking code given our
current architecture. It has to do with the timeline's memory object storage
via reference counted `Arc`s, and the removal of `retain_lsn` entries at
the drop of the last `Arc` reference.

The corruption steps are as follows:

1. timeline gets offloaded. timeline object A doesn't get dropped
though, because some long-running task accesses it
2. the same timeline gets unoffloaded again. timeline object B gets
created for it, timeline object A still referenced. both point to the
same timeline.
3. the task keeping the reference to timeline object A exits. destructor
for object A runs, removing `retain_lsn` in the timeline's parent.
4. the timeline's parent runs gc without the `retain_lsn` of the still
exant timleine's child, leading to data corruption.

In general we are susceptible each time when we recreate a `Timeline`
object in the same process, which happens both during a timeline
offload/unoffload cycle, as well as during an ancestor detach operation.

The solution this PR implements is to make the destructor for a timeline
as well as an offloaded timeline remove at most one `retain_lsn`.

PR #9760 has added a log line to print the refcounts at timeline
offload, but this only detects one of the places where we do such a
recycle operation. Plus it doesn't prevent the actual issue.

I doubt that this occurs in practice. It is more a defense in depth measure.
Usually I'd assume that the timeline gets dropped immediately in step 1,
as there is no background tasks referencing it after its shutdown.
But one never knows, and reducing the stakes of step 1 actually occurring
is a really good idea, from potential data corruption to waste of CPU time.

Part of #8088
2024-11-18 21:42:19 +01:00
Vlad Lazar
d7662fdc7b feat(page_service): timeout-based batching of requests (#9321)
## Problem

We don't take advantage of queue depth generated by the compute
on the pageserver. We can process getpage requests more efficiently
by batching them. 

## Summary of changes

Batch up incoming getpage requests that arrive within a configurable
time window (`server_side_batch_timeout`).
Then process the entire batch via one `get_vectored` timeline operation.
By default, no merging takes place.

## Testing

* **Functional**: https://github.com/neondatabase/neon/pull/9792
* **Performance**: will be done in staging/pre-prod

# Refs

* https://github.com/neondatabase/neon/issues/9377
* https://github.com/neondatabase/neon/issues/9376

Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-11-18 20:24:03 +00:00
Alex Chi Z.
e5c89f3da3 feat(pageserver): drop disposable keys during gc-compaction (#9765)
close https://github.com/neondatabase/neon/issues/9552, close
https://github.com/neondatabase/neon/issues/8920, part of
https://github.com/neondatabase/neon/issues/9114

## Summary of changes

* Drop keys not belonging to this shard during gc-compaction to avoid
constructing history that might have been truncated during shard
compaction.
* Run gc-compaction at the end of shard compaction test.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-11-18 19:27:52 +00:00
John Spray
261d065e6f pageserver: respect no_sync in VirtualFile (#9772)
## Problem

`no_sync` initially just skipped syncfs on startup (#9677). I'm also
interested in flaky tests that time out during pageserver shutdown while
flushing l0s, so to eliminate disk throughput as a source of issues
there,

## Summary of changes

- Drive-by change for test timeouts: add a couple more ::info logs
during pageserver startup so it's obvious which part got stuck.
- Add a SyncMode enum to configure VirtualFile and respect it in
sync_all and sync_data functions
- During pageserver startup, set SyncMode according to `no_sync`
2024-11-18 08:59:05 +00:00
Vlad Lazar
ac689ab014 wal_decoder: rename end_lsn to next_record_lsn (#9776)
## Problem

It turns out that `WalStreamDecoder::poll_decode` returns the start LSN
of the next record and not the end LSN of the current record. They are
not always equal. For example, they're not equal when the record in
question is an XLOG SWITCH record.

## Summary of changes

Rename things to reflect that.
2024-11-15 21:53:11 +00:00
Yuchen Liang
990bc65a20 review: https://github.com/neondatabase/neon/pull/9693#discussion_r1840293759
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-11-15 15:42:58 +00:00
Arpad Müller
7880c246f1 Correct mistakes in offloaded timeline retain_lsn management (#9760)
PR #9308 has modified tenant activation code to take offloaded child
timelines into account for populating the list of `retain_lsn` values.
However, there is more places than just tenant activation where one
needs to update the `retain_lsn`s.

This PR fixes some bugs of the current code that could lead to
corruption in the worst case:

1. Deleting of an offloaded timeline would not get its `retain_lsn`
purged from its parent. With the patch we now do it, but as the parent
can be offloaded as well, the situatoin is a bit trickier than for
non-offloaded timelines which can just keep a pointer to their parent.
Here we can't keep a pointer because the parent might get offloaded,
then unoffloaded again, creating a dangling pointer situation. Keeping a
pointer to the *tenant* is not good either, because we might drop the
offloaded timeline in a context where a `offloaded_timelines` lock is
already held: so we don't want to acquire a lock in the drop code of
OffloadedTimeline.
2. Unoffloading a timeline would not get its `retain_lsn` values
populated, leading to it maybe garbage collecting values that its
children might need. We now call `initialize_gc_info` on the parent.
3. Offloading of a timeline would not get its `retain_lsn` values
registered as offloaded at the parent. So if we drop the `Timeline`
object, and its registration is removed, the parent would not have any
of the child's `retain_lsn`s around. Also, before, the `Timeline` object
would delete anything related to its timeline ID, now it only deletes
`retain_lsn`s that have `MaybeOffloaded::No` set.

Incorporates Chi's reproducer from #9753. cc
https://github.com/neondatabase/cloud/issues/20199

The `test_timeline_retain_lsn` test is extended:

1. it gains a new dimension, duplicating each mode, to either have the
"main" branch be the direct parent of the timeline we archive, or the
"test_archived_parent" branch intermediary, creating a three timeline
structure. This doesn't test anything fixed by this PR in particular,
just explores the vast space of possible configurations a little bit
more.
2. it gains two new modes, `offload-parent`, which tests the second
point, and `offload-no-restart` which tests the third point.

It's easy to verify the test actually is "sharp" by removing one of the
respective `self.initialize_gc_info()`, `gc_info.insert_child()` or
`ancestor_children.push()`.

Part of #8088

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Alex Chi Z <chi@neon.tech>
2024-11-15 14:22:29 +01:00
John Spray
93939f123f tests: add test_timeline_archival_chaos (#9609)
## Problem

- We lack test coverage of cases where multiple timelines fight for
updates to the same manifest
(https://github.com/neondatabase/neon/pull/9557), and in timeline
archival changes while dual-attached
(https://github.com/neondatabase/neon/pull/9555)

## Summary of changes

- Add a chaos test for timeline creation->archival->offload->deletion
2024-11-14 17:31:35 +00:00
Konstantin Knizhnik
f70611c8df Correctly truncate VM (#9342)
## Problem

https://github.com/neondatabase/neon/issues/9240

## Summary of changes

Correctly truncate VM page instead just replacing it with zero page.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-11-14 17:19:13 +02:00
John Spray
b4e00b8b22 pageserver: refuse to load tenants with suspiciously old indices in old generations (#9719)
## Problem

Historically, if a control component passed a pageserver "generation: 1"
this could be a quick way to corrupt a tenant by loading a historic
index.

Follows https://github.com/neondatabase/neon/pull/9383
Closes #6951 

## Summary of changes

- Introduce a Fatal variant to DownloadError, to enable index downloads
to signal when they have encountered a scary enough situation that we
shouldn't proceed to load the tenant.
- Handle this variant by putting the tenant into a broken state (no
matter which timeline within the tenant reported it)
- Add a test for this case

In the event that this behavior fires when we don't want it to, we have
ways to intervene:
- "Touch" an affected index to update its mtime (download+upload S3
object)
- If this behavior is triggered, it indicates we're attaching in some
old generation, so we should be able to fix that by manually bumping
generation numbers in the storage controller database (this should never
happen, but it's an option if it does)
2024-11-13 18:07:39 +00:00
Yuchen Liang
6844b5f460 add comments; make read buffering works with write_buffered (owned version)
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-11-12 16:52:46 +00:00
Yuchen Liang
ffd88ede38 fix clippy
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-11-12 15:59:24 +00:00
Alex Chi Z.
cef165818c test(pageserver): add gc-compaction tests with delta will_init (#9724)
I had an impression that gc-compaction didn't test the case where the
first record of the key history is will_init because of there are some
code path that will panic in this case. Luckily it got fixed in
https://github.com/neondatabase/neon/pull/9026 so we can now implement
such tests.

Part of https://github.com/neondatabase/neon/issues/9114

## Summary of changes

* Randomly changed some images into will_init neon wal record
* Split `test_simple_bottom_most_compaction_deltas` into two test cases,
one of them has the bottom layer as delta layer with will_init flags,
while the other is the original one with image layers.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-11-12 10:37:31 -05:00
Yuchen Liang
d6d8a16dbc Merge branch 'main' into yuchen/double-buffered-writer 2024-11-11 20:27:09 -05:00
Yuchen Liang
20e6a0c8a2 use open_with_options_v2 (O_DIRECT) for ephemeral file
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-11-12 01:26:29 +00:00