Compare commits

...

118 Commits

Author SHA1 Message Date
Stas Kelvich
2bf9a350ef Process chunks in parallel 2024-09-14 14:07:35 +01:00
Heikki Linnakangas
2e7e5f4f3a Refactor how the image layer partitioning is done
It got pretty ugly after the last commit. Refactor it so that we first
collect all the key ranges that need to be written out into a list of
tasks, then partition the tasks into image layers, and then write them
out. This will be much easier to parallelize, but that's not included
in this commit yet.
2024-09-13 05:41:35 +03:00
Heikki Linnakangas
9f3d5826be Fix with files > 4 GB 2024-09-13 02:27:26 +03:00
Heikki Linnakangas
4f39501641 Fix handling relations > 1 GB 2024-09-13 02:04:48 +03:00
Heikki Linnakangas
67d1606f82 Write out multiple image layers 2024-09-13 01:47:10 +03:00
Heikki Linnakangas
9351ba26ff Fake LSN
Test passes, yay!
2024-09-12 21:44:49 +03:00
Stas Kelvich
8df388330b Merge branch 'hack/fast-import' of github.com:neondatabase/neon into hack/fast-import 2024-09-12 19:26:16 +01:00
Stas Kelvich
357c07dd35 track rel file import time 2024-09-12 19:26:03 +01:00
Heikki Linnakangas
7b90ec6e19 Create controlfile and checkpoint entries
XXX: untested, not sure if it works..
2024-09-12 21:01:04 +03:00
Heikki Linnakangas
85f4e966e8 Import dummy pg_twophase dir entry 2024-09-12 20:54:16 +03:00
Heikki Linnakangas
4d27048d6d Import SLRUs 2024-09-12 20:46:20 +03:00
Stas Kelvich
3a452d8f56 remove old timeline init code 2024-09-12 18:20:13 +01:00
Stas Kelvich
b81dbc887b import relation sizes 2024-09-12 18:19:25 +01:00
Stas Kelvich
80fed9cfb1 fix oder of insertion for relmaps and reldirs 2024-09-12 15:43:54 +01:00
Stas Kelvich
189386b22f Merge branch 'hack/fast-import' of github.com:neondatabase/neon into hack/fast-import 2024-09-12 13:52:11 +01:00
Stas Kelvich
38dfecb026 clean imports 2024-09-12 13:51:48 +01:00
Stas Kelvich
be28bd8312 merge 2024-09-12 13:49:34 +01:00
Heikki Linnakangas
9759d6ec72 Rename the image layer to not have the temp suffix 2024-09-12 15:49:21 +03:00
Stas Kelvich
0c64d55a6b Import dbdir, relmaps, reldirs 2024-09-12 13:48:29 +01:00
Heikki Linnakangas
578da1dc02 Parse postgres version from control file 2024-09-12 15:21:59 +03:00
Stas Kelvich
842ac7cfda resolve conflicts 2024-09-12 13:13:16 +01:00
Stas Kelvich
71340e3c00 common iterators for pg data dirs 2024-09-12 13:10:35 +01:00
Heikki Linnakangas
e6e0b27dc3 Create index_part.json 2024-09-12 14:53:29 +03:00
Heikki Linnakangas
04ec8bd7de test: Attach the tenant, start endpoint on it
Doesn't work yet, I think because index_part.json is missing
2024-09-12 13:52:14 +03:00
Heikki Linnakangas
6563be1a4c Test passes now
It runs the command successfully. Doesn't try to attach it to the
pageserver on it yet

    BUILD_TYPE=debug DEFAULT_PG_VERSION=16 poetry run pytest --preserve-database-files test_runner/regress/test_pg_import.py
2024-09-12 13:36:42 +03:00
Heikki Linnakangas
fe975acc71 Add --tenant-id and --timeline-id options 2024-09-12 13:28:12 +03:00
Heikki Linnakangas
abed35589b Test fix 2024-09-12 12:59:45 +03:00
Stas Kelvich
3fe8b69968 Merge branch 'hack/fast-import' of github.com:neondatabase/neon into hack/fast-import 2024-09-12 10:59:24 +01:00
Stas Kelvich
0c856443c4 now it produces an image layer 2024-09-12 10:57:50 +01:00
Heikki Linnakangas
0fc584ef9a Add python test 2024-09-12 12:43:12 +03:00
Stas Kelvich
daedec65ac fix awaits 2024-09-12 10:42:08 +01:00
Stas Kelvich
94c393bf8f resolve conflicts 2024-09-12 10:37:07 +01:00
Stas Kelvich
28616b0907 compiles 2024-09-12 10:33:14 +01:00
Heikki Linnakangas
241724f3fc CLI args parsing 2024-09-12 12:31:07 +03:00
Stas Kelvich
98d128d993 first sketch 2024-09-12 09:59:36 +01:00
Erik Grinaker
b37da32c6f pageserver: reuse idempotency keys across metrics sinks (#8876)
## Problem

Metrics event idempotency keys differ across S3 and Vector. The events
should be identical.

Resolves #8605.

## Summary of changes

Pre-generate the idempotency keys and pass the same set into both
metrics sinks.

Co-authored-by: John Spray <john@neon.tech>
2024-09-03 09:05:24 +01:00
Christian Schwarz
3b317cae07 page_cache/layer load: correctly classify layer summary block reads (#8885)
Before this PR, we would classify layer summary block reads as "Unknown"
content kind.

<img width="1267" alt="image"
src="https://github.com/user-attachments/assets/508af034-5c2a-4c89-80db-2899967b337c">
2024-09-02 16:09:26 +01:00
Christian Schwarz
bf0531d107 fixup(#8839): test_forward_compatibility needs to allow lag warning as well (#8891)
Found in
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8885/10665614629/index.html#suites/0fbaeb107ef328d03993d44a1fb15690/ea10ba1c140fba1d
2024-09-02 15:10:10 +01:00
Christian Schwarz
15e90cc427 bottommost-compaction: remove dead code / rectify cfg!()s (#8884)
part of https://github.com/neondatabase/neon/issues/8002
2024-09-02 14:45:17 +01:00
Arpad Müller
9746b6ea31 Implement archival_config timeline endpoint in the storage controller (#8680)
Implement the timeline specific `archival_config` endpoint also in the
storage controller.

It's mostly a copy-paste of the detach handler: the task is the same: do
the same operation on all shards.

Part of #8088.
2024-09-02 13:51:45 +02:00
John Spray
516ac0591e storage controller: eliminate ensure_attached (#8875)
## Problem

This is a followup to #8783

- The old blocking ensure_attached function had been retained to handle
the case where a shard had a None generation_pageserver, but this wasn't
really necessary.
- There was a subtle `.1` in the code where a struct would have been
clearer

Closes #8819

## Summary of changes

- Add ShardGenerationState to represent the results of peek_generation
- Instead of calling ensure_attached when a tenant has a non-attached
shard, check the shard's policy and return 409 if it isn't Attached,
else return 503 if the shard's policy is attached but it hasn't been
reconciled yet (i.e. has a None generation_pageserver)
2024-09-02 11:36:57 +00:00
Arpad Müller
3ec785f30d Add safekeeper scrubber test (#8785)
The test is very rudimentary, it only checks that before and after
tenant deletion, we can run `scan_metadata` for the safekeeper node
kind. Also, we don't actually expect any uploaded data, for that we
don't have enough WAL (needs to create at least one S3-uploaded file,
the scrubber doesn't recognize partial files yet).

The `scan_metadata` scrubber subcommand is extended to support either
specifying a database connection string, which was previously the only
way, and required a database to be present, or specifying the timeline
information manually via json. This is ideal for testing scenarios
because in those, the number of timelines is usually limited,
but it is involved to spin up a database just to write the timeline
information.
2024-08-31 01:12:25 +02:00
Alex Chi Z.
05caaab850 fix(pageserver): fire layer eviction alert only when it's visible (#8882)
The pull request https://github.com/neondatabase/neon/pull/8679
explicitly mentioned that it will evict layers earlier than before.
Given that the eviction metrics is solely based on eviction threshold
(which is 86400s now), we should consider the early eviction and do not
fire alert if it's a covered layer.

## Summary of changes

Record eviction timer only when the layer is visible + accessed.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-30 17:22:26 -04:00
Yuchen Liang
cacb1ae333 pageserver: set default io_buffer_alignment to 512 bytes (#8878)
## Summary of changes

- Setting default io_buffer_alignment to 512 bytes. 
- Fix places that assumed `DEFAULT_IO_BUFFER_ALIGNMENT=0`
- Adapt unit tests to handle merge with `chunk size <= 4096`.

## Testing and Performance

We have done sufficient performance de-risking. 

Enabling it by default completes our correctness de-risking before the
next release.

Context: https://neondb.slack.com/archives/C07BZ38E6SD/p1725026845455259

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-08-30 19:53:52 +01:00
Alex Chi Z.
df971f995c feat(storage-scrubber): check layer map validity (#8867)
When implementing bottom-most gc-compaction, we analyzed the structure
of layer maps that the current compaction algorithm could produce, and
decided to only support structures without delta layer overlaps and LSN
intersections with the exception of single key layers.

## Summary of changes

This patch adds the layer map valid check in the storage scrubber.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-30 14:12:39 -04:00
Alexander Bayandin
e58e045ebb CI(promote-compatibility-data): fix job (#8871)
## Problem

`promote-compatibility-data` job got broken and slightly outdated after 
- https://github.com/neondatabase/neon/pull/8552 -- we don't upload
artifacts for ARM64
- https://github.com/neondatabase/neon/pull/8561 -- we don't prepare
`debug` artifacts in the release branch anymore

## Summary of changes
- Promote artifacts from release PRs to the latest version (but do it
from `release` branch)
- Upload artifacts for both X64 and ARM64
2024-08-30 13:18:30 +01:00
John Spray
20f82f9169 storage controller: sleep between compute notify retries (#8869)
## Problem

Live migration retries when it fails to notify the compute of the new
location. It should sleep between attempts.

Closes: https://github.com/neondatabase/neon/issues/8820

## Summary of changes

- Do an `exponential_backoff` in the retry loop for compute
notifications
2024-08-30 11:44:13 +01:00
Conrad Ludgate
72aa6b02da chore: speed up testing (#8874)
`safekeeper::random_test test_random_schedules` debug test takes over 2
minutes to run on our arm runners. Running it 6 times with pageserver
settings seems redundant.
2024-08-30 11:34:23 +01:00
Conrad Ludgate
022fad65eb proxy: fix password hash cancellation (#8868)
In #8863 I replaced the threadpool with tokio tasks, but there was a
behaviour I missed regarding cancellation. Adding the JoinHandle wrapper
that triggers abort on drop should fix this.

Another change, any panics that occur in password hashing will be
propagated through the resume_unwind functionality.
2024-08-29 20:16:44 +01:00
Arpad Müller
8eaa8ad358 Remove async_trait usages from safekeeper and neon_local (#8864)
Removes additional async_trait usages from safekeeper and neon_local.

Also removes now redundant dependencies of the `async_trait` crate.

cc earlier work: #6305, #6464, #7303, #7342, #7212, #8296
2024-08-29 18:24:25 +02:00
Alex Chi Z.
653a6532a2 fix(pageserver): reject non-i128 key on the write path (#8648)
It's better to reject invalid keys on the write path than storing it and
panic-ing the pageserver.
https://github.com/neondatabase/neon/issues/8636

## Summary of changes

If a key cannot be represented using i128, we don't allow writing that
key into the pageserver.

There are two versions of the check valid function: the normal one that
simply rejects i128 keys, and the stronger one that rejects all keys
that we don't support.

The current behavior when a key gets rejected is that safekeeper will
keep retrying streaming that key to the pageserver. And once such key
gets written, no new computes can be started. Therefore, there could be
a large amount of pageserver warnings if a key cannot be ingested. To
validate this behavior by yourself, the reviewer can (1) use the
stronger version of the valid check (2) run the following SQL.

```
set neon.regress_test_mode = true;
CREATE TABLESPACE regress_tblspace LOCATION '/Users/skyzh/Work/neon-test/tablespace';
CREATE SCHEMA testschema;
CREATE TABLE testschema.foo (i int) TABLESPACE regress_tblspace;
insert into testschema.foo values (1), (2), (3);
```

For now, I'd like to merge the patch with only rejecting non-i128 keys.
It's still unknown whether the stronger version covers all the cases
that basebackup doesn't support. Furthermore, the behavior of rejecting
a key will produce large amounts of warnings due to safekeeper retry.
Therefore, I'd like to reject the minimum set of keys that we don't
support (i128 ones) for now. (well, erroring out is better than panic on
`to_compact_key`)

The next step is to fix the safekeeper behavior (i.e., on such key
rejections, stop streaming WAL), so that we can properly stop writing.
An alternative solution is to simply drop these keys on the write path.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-29 10:07:05 -04:00
Alex Chi Z.
18bfc43fa7 fix(pageserver): add dry-run to force compact API (#8859)
Add `dry-run` flag to the compact API

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-29 10:01:54 -04:00
Conrad Ludgate
7ce49fe6e3 proxy: improve test performance (#8863)
Some tests were very slow and some tests occasionally stalled. This PR
improves some test performance and replaces the custom threadpool in
order to fix the stalling of tests.
2024-08-29 13:20:15 +00:00
Christian Schwarz
a8fbc63be2 tenant background loops: periodic log message if long-running iteration (#8832)
refs https://github.com/neondatabase/neon/issues/7524

Problem
-------

When browsing Pageserver logs, background loop iterations that take a
long time are hard to spot / easy to miss because they tend to not
produce any log messages unless:

- they overrun their period, but that's only one message when the
iteration completes late
- they do something that produces logs (e.g., create image layers)

Further, a slow iteration that is still running does will not
log nor bump the metrics of `warn_when_period_overrun`until _after_
it has finished. Again, that makes a still-running iteration hard to
spot.

Solution
--------

This PR adds a wrapper around the per-tenant background loops
that, while a slow iteration is ongoing, emit a log message
every $period.
2024-08-29 15:06:13 +02:00
Arpad Müller
96b5c4d33d Don't unarchive a timeline if its ancestor is archived (#8853)
If a timeline unarchival request comes in, give an error if the parent
timeline is archived. This prevents us from the situation of having an
archived timeline with children that are not archived.

Follow up of #8824

Part of #8088

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2024-08-29 12:54:02 +00:00
Christian Schwarz
c7481402a0 pageserver: default to 4MiB stack size and add env var to control it (#8862)
# Motivation

In https://github.com/neondatabase/neon/pull/8832 I get tokio runtime
worker stack overflow errors in debug builds.

In a similar vein, I had tokio runtimer worker stack overflow when
trying to eliminate `async_trait`
(https://github.com/neondatabase/neon/pull/8296).

The 2MiB default is kind of arbitrary - so this PR bumps it to 4MiB.

It also adds an env var to control it.

# Risk Assessment

With our 4 runtimes, the worst case stack memory usage is `4 (runtimes)
* ($num_cpus (executor threads) + 512 (blocking pool threads)) * 4MiB`.

On i3en.3xlarge, that's `8384 MiB`. 
On im4gn.2xlarge, that's `8320 MiB`.
Before this change, it was half that.

Looking at production metrics, we _do_ have the headroom to accomodate
this worst case case.

# Alternatives

The problems only occur with debug builds, so technically we could only
raise the stack size for debug builds.

However, it would be another configuration where `debug != release`.

# Future Work

If we ever enable single runtime mode in prod (=>
https://github.com/neondatabase/neon/issues/7312 ) then the worst case
will drop to 25% of its current value.

Eliminating the use of `tokio::spawn_blocking` / `tokio::fs` in favor of
`tokio-epoll-uring` (=> https://github.com/neondatabase/neon/issues/7370
) would reduce the worst case to `4 (runtimes) * $num_cpus (executor
threads) * 4 MiB`.
2024-08-29 14:02:27 +02:00
Conrad Ludgate
a644f01b6a proxy+pageserver: shared leaky bucket impl (#8539)
In proxy I switched to a leaky-bucket impl using the GCRA algorithm. I
figured I could share the code with pageserver and remove the
leaky_bucket crate dependency with some very basic tokio timers and
queues for fairness.

The underlying algorithm should be fairly clear how it works from the
comments I have left in the code.

---

In benchmarking pageserver, @problame found that the new implementation
fixes a getpage throughput discontinuity in pageserver under the
`pagebench get-page-latest-lsn` benchmark with the clickbench dataset
(`test_perf_olap.py`).
The discontinuity is that for any of `--num-clients={2,3,4}`, getpage
throughput remains 10k.
With `--num-clients=5` and greater, getpage throughput then jumps to the
configured 20k rate limit.
With the changes in this PR, the discontinuity is gone, and we scale
throughput linearly to `--num-clients` until the configured rate limit.

More context in
https://github.com/neondatabase/cloud/issues/16886#issuecomment-2315257641.

closes https://github.com/neondatabase/cloud/issues/16886

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-08-29 11:26:52 +00:00
Christian Schwarz
c2f8fdccd7 ingest: rate-limited warning if WAL commit timestamps lags for > wait_lsn_timeout (#8839)
refs https://github.com/neondatabase/cloud/issues/13750

The logging in this commit will make it easier to detect lagging ingest.

We're trusting compute timestamps --- ideally we'd use SK timestmaps
instead.
But trusting the compute timestamp is ok for now.
2024-08-29 12:06:00 +01:00
Konstantin Knizhnik
cfa45ff5ee Undo walloging replorgin file on checkpoint (#8794)
## Problem

See #8620

## Summary of changes

Remove walloping of replorigin file because it is reconstructed by PS

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-08-29 07:45:33 +03:00
Andrew Rudenko
acc075071d feat(compute_ctl): add periodic lease lsn requests for static computes (#7994)
Part of #7497

## Problem

Static computes pinned at some fix LSN could be created initially within
PITR interval but eventually go out it. To make sure that Static
computes are not affected by GC, we need to start using the LSN lease
API (introduced in #8084) in compute_ctl.

## Summary of changes

**compute_ctl**
- Spawn a thread for when a static compute starts to periodically ping
pageserver(s) to make LSN lease requests.
- Add `test_readonly_node_gc` to test if static compute can read all
pages without error.
  - (test will fail on main without the code change here)

**page_service**
- `wait_or_get_last_lsn` will now allow `request_lsn` less than
`latest_gc_cutoff_lsn` to proceed if there is a lease on `request_lsn`.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>
2024-08-28 19:09:26 +00:00
Christian Schwarz
9627747d35 bypass PageCache for InMemoryLayer + avoid Value::deser on L0 flush (#8537)
Part of [Epic: Bypass PageCache for user data
blocks](https://github.com/neondatabase/neon/issues/7386).

# Problem

`InMemoryLayer` still uses the `PageCache` for all data stored in the
`VirtualFile` that underlies the `EphemeralFile`.

# Background

Before this PR, `EphemeralFile` is a fancy and (code-bloated) buffered
writer around a `VirtualFile` that supports `blob_io`.

The `InMemoryLayerInner::index` stores offsets into the `EphemeralFile`.
At those offset, we find a varint length followed by the serialized
`Value`.

Vectored reads (`get_values_reconstruct_data`) are not in fact vectored
- each `Value` that needs to be read is read sequentially.

The `will_init` bit of information which we use to early-exit the
`get_values_reconstruct_data` for a given key is stored in the
serialized `Value`, meaning we have to read & deserialize the `Value`
from the `EphemeralFile`.

The L0 flushing **also** needs to re-determine the `will_init` bit of
information, by deserializing each value during L0 flush.

# Changes

1. Store the value length and `will_init` information in the
`InMemoryLayer::index`. The `EphemeralFile` thus only needs to store the
values.
2. For `get_values_reconstruct_data`:
- Use the in-memory `index` figures out which values need to be read.
Having the `will_init` stored in the index enables us to do that.
- View the EphemeralFile as a byte array of "DIO chunks", each 512 bytes
in size (adjustable constant). A "DIO chunk" is the minimal unit that we
can read under direct IO.
- Figure out which chunks need to be read to retrieve the serialized
bytes for thes values we need to read.
- Coalesce chunk reads such that each DIO chunk is only read once to
serve all value reads that need data from that chunk.
- Merge adjacent chunk reads into larger
`EphemeralFile::read_exact_at_eof_ok` of up to 128k (adjustable
constant).
3. The new `EphemeralFile::read_exact_at_eof_ok` fills the IO buffer
from the underlying VirtualFile and/or its in-memory buffer.
4. The L0 flush code is changed to use the `index` directly, `blob_io` 
5. We can remove the `ephemeral_file::page_caching` construct now.

The `get_values_reconstruct_data` changes seem like a bit overkill but
they are necessary so we issue the equivalent amount of read system
calls compared to before this PR where it was highly likely that even if
the first PageCache access was a miss, remaining reads within the same
`get_values_reconstruct_data` call from the same `EphemeralFile` page
were a hit.

The "DIO chunk" stuff is truly unnecessary for page cache bypass, but,
since we're working on [direct
IO](https://github.com/neondatabase/neon/issues/8130) and
https://github.com/neondatabase/neon/issues/8719 specifically, we need
to do _something_ like this anyways in the near future.

# Alternative Design

The original plan was to use the `vectored_blob_io` code it relies on
the invariant of Delta&Image layers that `index order == values order`.

Further, `vectored_blob_io` code's strategy for merging IOs is limited
to adjacent reads. However, with direct IO, there is another level of
merging that should be done, specifically, if multiple reads map to the
same "DIO chunk" (=alignment-requirement-sized and -aligned region of
the file), then it's "free" to read the chunk into an IO buffer and
serve the two reads from that buffer.
=> https://github.com/neondatabase/neon/issues/8719

# Testing / Performance

Correctness of the IO merging code is ensured by unit tests.

Additionally, minimal tests are added for the `EphemeralFile`
implementation and the bit-packed `InMemoryLayerIndexValue`.

Performance testing results are presented below.
All pref testing done on my M2 MacBook Pro, running a Linux VM.
It's a release build without `--features testing`.

We see definitive improvement in ingest performance microbenchmark and
an ad-hoc microbenchmark for getpage against InMemoryLayer.

```
baseline: commit 7c74112b2a origin/main
HEAD: ef1c55c52e
```

<details>

```
cargo bench --bench bench_ingest -- 'ingest 128MB/100b seq, no delta'

baseline

ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [483.50 ms 498.73 ms 522.53 ms]
                        thrpt:  [244.96 MiB/s 256.65 MiB/s 264.73 MiB/s]

HEAD

ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [479.22 ms 482.92 ms 487.35 ms]
                        thrpt:  [262.64 MiB/s 265.06 MiB/s 267.10 MiB/s]
```

</details>

We don't have a micro-benchmark for InMemoryLayer and it's quite
cumbersome to add one. So, I did manual testing in `neon_local`.

<details>

```

  ./target/release/neon_local stop
  rm -rf .neon
  ./target/release/neon_local init
  ./target/release/neon_local start
  ./target/release/neon_local tenant create --set-default
  ./target/release/neon_local endpoint create foo
  ./target/release/neon_local endpoint start foo
  psql 'postgresql://cloud_admin@127.0.0.1:55432/postgres'
psql (13.16 (Debian 13.16-0+deb11u1), server 15.7)

CREATE TABLE wal_test (
    id SERIAL PRIMARY KEY,
    data TEXT
);

DO $$
DECLARE
    i INTEGER := 1;
BEGIN
    WHILE i <= 500000 LOOP
        INSERT INTO wal_test (data) VALUES ('data');
        i := i + 1;
    END LOOP;
END $$;

-- => result is one L0 from initdb and one 137M-sized ephemeral-2

DO $$
DECLARE
    i INTEGER := 1;
    random_id INTEGER;
    random_record wal_test%ROWTYPE;
    start_time TIMESTAMP := clock_timestamp();
    selects_completed INTEGER := 0;
    min_id INTEGER := 1;  -- Minimum ID value
    max_id INTEGER := 100000;  -- Maximum ID value, based on your insert range
    iters INTEGER := 100000000;  -- Number of iterations to run
BEGIN
    WHILE i <= iters LOOP
        -- Generate a random ID within the known range
        random_id := min_id + floor(random() * (max_id - min_id + 1))::int;

        -- Select the row with the generated random ID
        SELECT * INTO random_record
        FROM wal_test
        WHERE id = random_id;

        -- Increment the select counter
        selects_completed := selects_completed + 1;

        -- Check if a second has passed
        IF EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= 1 THEN
            -- Print the number of selects completed in the last second
            RAISE NOTICE 'Selects completed in last second: %', selects_completed;

            -- Reset counters for the next second
            selects_completed := 0;
            start_time := clock_timestamp();
        END IF;

        -- Increment the loop counter
        i := i + 1;
    END LOOP;
END $$;

./target/release/neon_local stop

baseline: commit 7c74112b2a origin/main

NOTICE:  Selects completed in last second: 1864
NOTICE:  Selects completed in last second: 1850
NOTICE:  Selects completed in last second: 1851
NOTICE:  Selects completed in last second: 1918
NOTICE:  Selects completed in last second: 1911
NOTICE:  Selects completed in last second: 1879
NOTICE:  Selects completed in last second: 1858
NOTICE:  Selects completed in last second: 1827
NOTICE:  Selects completed in last second: 1933

ours

NOTICE:  Selects completed in last second: 1915
NOTICE:  Selects completed in last second: 1928
NOTICE:  Selects completed in last second: 1913
NOTICE:  Selects completed in last second: 1932
NOTICE:  Selects completed in last second: 1846
NOTICE:  Selects completed in last second: 1955
NOTICE:  Selects completed in last second: 1991
NOTICE:  Selects completed in last second: 1973
```

NB: the ephemeral file sizes differ by ca 1MiB, ours being 1MiB smaller.

</details>

# Rollout

This PR changes the code in-place and  is not gated by a feature flag.
2024-08-28 18:31:41 +00:00
Alex Chi Z.
63a0d0d039 fix(storage-scrubber): make retry error into warnings (#8851)
We get many HTTP connect timeout errors from scrubber logs, and it
turned out that the scrubber is retrying, and this is not an actual
error. In the future, we should revisit all places where we log errors
in the storage scrubber, and only error when necessary (i.e., errors
that might need manual fixing)

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-28 13:39:21 -04:00
Vlad Lazar
793b5061ec storcon: track pageserver availability zone (#8852)
## Problem
In order to build AZ aware scheduling, the storage controller needs to
know what AZ pageservers are in.

Related https://github.com/neondatabase/neon/issues/8848

## Summary of changes
This patch set adds a new nullable column to the `nodes` table:
`availability_zone_id`. The node registration
request is extended to include the AZ id (pageservers already have this
in their `metadata.json` file).

If the node is already registered, then we update the persistent and
in-memory state with the provided AZ.
Otherwise, we add the node with the AZ to begin with.

A couple assumptions are made here:
1. Pageserver AZ ids are stable
2. AZ ids do not change over time

Once all pageservers have a configured AZ, we can remove the optionals
in the code and make the database column not nullable.
2024-08-28 18:23:55 +01:00
Yuchen Liang
a889a49e06 pageserver: do vectored read on each dio-aligned section once (#8763)
Part of #8130, closes #8719.

## Problem

Currently, vectored blob io only coalesce blocks if they are immediately
adjacent to each other. When we switch to Direct IO, we need a way to
coalesce blobs that are within the dio-aligned boundary but has gap
between them.

## Summary of changes

- Introduces a `VectoredReadCoalesceMode` for `VectoredReadPlanner` and
`StreamingVectoredReadPlanner` which has two modes:
  - `AdjacentOnly` (current implementation)
  - `Chunked(<alignment requirement>)`
- New `ChunkedVectorBuilder` that considers batching `dio-align`-sized
read, the start and end of the vectored read will respect
`stx_dio_offset_align` / `stx_dio_mem_align` (`vectored_read.start` and
`vectored_read.blobs_at.first().start_offset` will be two different
value).
- Since we break the assumption that blobs within single `VectoredRead`
are next to each other (implicit end offset), we start to store blob end
offsets in the `VectoredRead`.
- Adapted existing tests to run in both `VectoredReadCoalesceMode`.
- The io alignment can also be live configured at runtime.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-08-28 15:54:42 +01:00
Vlad Lazar
5eb7322d08 docs: rolling storage controller restarts RFC (#8310)
## Problem
Storage controller upgrades (restarts, more generally) can cause
multi-second availability gaps.
While the storage controller does not sit on the main data path, it's
generally not acceptable
to block management requests for extended periods of time (e.g.
https://github.com/neondatabase/neon/issues/8034).

## Summary of changes
This RFC describes the issues around the current storage controller
restart procedure
and describes an implementation which reduces downtime to a few
milliseconds on the happy path.

Related https://github.com/neondatabase/neon/issues/7797
2024-08-28 13:56:14 +00:00
Joonas Koivunen
c0ba18a112 bench: flush before shutting down (#8844)
while driving by:
- remove the extra tenant
- remove the extra timelines

implement this by turning the pg_compare to a yielding fixture.

evidence:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/10571779162/index.html#suites/9681106e61a1222669b9d22ab136d07b/3bbe9f007b3ffae1/
2024-08-28 10:20:43 +01:00
John Spray
992a951b5e .github: direct feature requests to the feedback form (#8849)
## Problem

When folks open github issues for feature requests, they don't have a
clear recipient: engineers usually see them during bug triage, but that
doesn't necessarily get the work prioritized.

## Summary of changes

Give end users a clearer path to submitting feedback to Neon
2024-08-28 09:22:19 +01:00
Heikki Linnakangas
c5ef779801 tests: Remove unnecessary entries from list of allowed errors (#8199)
The "manual_gc" context was removed in commit be0c73f8e7. The code that
generated the other error was removed in commit 9a6c0be823.
2024-08-27 17:47:05 +01:00
Heikki Linnakangas
2d10306f7a Remove support for pageserver <-> compute protocol version 1 (#8774)
Protocol version 2 has been the default for a while now, and we no
longer have any computes running in production that used protocol
version 1. This completes the migration by removing support for v1 in
both the pageserver and the compute.

See issue #6211.
2024-08-27 18:36:33 +03:00
Alexey Kondratov
9b9f90c562 fix(walproposer): Do not restart on safekeepers reordering (#8840)
## Problem

Currently, we compare `neon.safekeepers` values as is, so we
unnecessarily restart walproposer even if safekeepers set didn't change.
This leads to errors like:
```log
FATAL:  [WP] restarting walproposer to change safekeeper list
from safekeeper-8.us-east-2.aws.neon.tech:6401,safekeeper-11.us-east-2.aws.neon.tech:6401,safekeeper-10.us-east-2.aws.neon.tech:6401
to safekeeper-11.us-east-2.aws.neon.tech:6401,safekeeper-8.us-east-2.aws.neon.tech:6401,safekeeper-10.us-east-2.aws.neon.tech:6401
```

## Summary of changes

Split the GUC into the list of individual safekeepers and properly
compare. We could've done that somewhere on the upper level, e.g.,
control plane, but I think it's still better when the actual config
consumer is smarter and doesn't rely on upper levels.
2024-08-27 15:49:47 +02:00
Folke Behrens
52cb33770b proxy: Rename backend types and variants as prep for refactor (#8845)
* AuthBackend enum to AuthBackendType
* BackendType enum to Backend
* Link variants to Web
* Adjust messages, comments, etc.
2024-08-27 14:12:42 +02:00
Conrad Ludgate
12850dd5e9 proxy: remove dead code (#8847)
By marking everything possible as pub(crate), we find a few dead code
candidates.
2024-08-27 12:00:35 +01:00
a-masterov
5d527133a3 Fix the pg_hintplan flakyness (#8834)
## Problem
pg_hintplan test seems to be flaky, sometimes it fails, while usually it
passes

## Summary of changes

The regression test is changed to filter out the Neon service queries. The
expected file is changed as well.

## Checklist before requesting a review

- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2024-08-27 12:39:42 +02:00
Arseny Sher
09362b6363 safekeeper: reorder routes and their handlers.
Routes and their handlers were in a bit different order in 1) routes
list 2) their implementation 3) python client 4) openapi spec, making
addition of new ones intimidating. Make it the same everywhere, roughly
lexicographically but preserving some of existing logic.

No functional changes.
2024-08-27 07:37:55 +03:00
Alexey Kondratov
7820c572e7 fix(sql-exporter): Remove tenant_id from compute_logical_snapshot_files
It appeared to be that it's already auto-added to all metrics [1]

[1]: 3a907c317c/apps/base/ext-vmagent/vmagent.yaml (L43)
2024-08-27 00:51:23 +02:00
Alexey Kondratov
bf03713fa1 fix(sql-exporter): Fix typo in gauge
In f4b3c317f there was a typo and I missed that on review
2024-08-27 00:51:23 +02:00
Alex Chi Z.
0f65684263 feat(pageserver): use split layer writer in gc-compaction (#8608)
Part of #8002, the final big PR in the batch.

## Summary of changes

This pull request uses the new split layer writer in the gc-compaction.

* It changes how layers are split. Previously, we split layers based on
the original split point, but this creates too many layers
(test_gc_feedback has one key per layer).
* Therefore, we first verify if the layer map can be processed by the
current algorithm (See https://github.com/neondatabase/neon/pull/8191,
it's basically the same check)
* On that, we proceed with the compaction. This way, it creates a large
enough layer close to the target layer size.
* Added a new set of functions `with_discard` in the split layer writer.
This helps us skip layers if we are going to produce the same persistent
key.
* The delta writer will keep the updates of the same key in a single
file. This might create a super large layer, but we can optimize it
later.
* The split layer writer is used in the gc-compaction algorithm, and it
will split layers based on size.
* Fix the image layer summary block encoded the wrong key range.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-08-26 14:19:47 -04:00
Christian Schwarz
97241776aa pageserver: startup: ensure local disk state is durable (#8835)
refs https://github.com/neondatabase/neon/issues/6989

Problem
-------

After unclean shutdown, we get restarted, start reading the local
filesystem,
and make decisions based on those reads. However, some of the data might
have
not yet been fsynced when the unclean shutdown completed.

Durability matters even though Pageservers are conceptually just a cache
of state in S3. For example:
- the cloud control plane is no control loop => pageserver responses
  to tenant attachmentm, etc, needs to be durable.
  - the storage controller does not rely on this (as much?)
- we don't have layer file checksumming, so, downloaded+renamed but not
  fsynced layer files are technically not to be trusted
  - https://github.com/neondatabase/neon/issues/2683

Solution
--------

`syncfs` the tenants directory during startup, before we start reading
from it.

This is a bit overkill because we do remove some temp files
(InMemoryLayer!)
later during startup. Further, these temp files are particularly likely
to
be dirty in the kernel page cache. However, we don't want to refactor
that
cleanup code right now, and the dirty data on pageservers is generally
not that high. Last, with [direct
IO](https://github.com/neondatabase/neon/issues/8130) we're going to
have near-zero kernel page cache anyway quite soon.
2024-08-26 18:07:55 +02:00
Arpad Müller
2dd53e7ae0 Timeline archival test (#8824)
This PR:

* Implements the rule that archived timelines require all of their
children to be archived as well, as specified in the RFC. There is no
fancy locking mechanism though, so the precondition can still be broken.
As a TODO for later, we still allow unarchiving timelines with archived
parents.
* Adds an `is_archived` flag to `TimelineInfo`
* Adds timeline_archival_config to `PageserverHttpClient`
* Adds a new `test_timeline_archive` test, loosely based on
`test_timeline_delete`

Part of #8088
2024-08-26 17:30:19 +02:00
Folke Behrens
d6eede515a proxy: clippy lints: handle some low hanging fruit (#8829)
Should be mostly uncontroversial ones.
2024-08-26 15:16:54 +02:00
Alexey Kondratov
d48229f50f feat(compute): Introduce new compute_subscriptions_count metric (#8796)
## Problem

We need some metric to sneak peek into how many people use inbound
logical replication (Neon is a subscriber).

## Summary of changes

This commit adds a new metric `compute_subscriptions_count`, which is
number of subscriptions grouped by enabled/disabled state.

Resolves: neondatabase/cloud#16146
2024-08-26 14:34:18 +02:00
Jakub Kołodziejczak
cdfdcd3e5d chore: improve markdown formatting (#8825)
fixes:

![Screenshot_2024-08-25_16-25-30](https://github.com/user-attachments/assets/c993309b-6c2d-4938-9fd0-ce0953fc63ff)

fixes:

![Screenshot_2024-08-25_16-26-29](https://github.com/user-attachments/assets/cf497f4a-d9e3-45a6-a1a5-7e215d96d022)
2024-08-25 16:33:45 +01:00
Conrad Ludgate
06795c6b9a proxy: new local-proxy application (#8736)
Add binary for local-proxy that uses the local auth backend. Runs only
the http serverless driver support and offers config reload based on a
config file and SIGHUP
2024-08-23 22:32:10 +01:00
Conrad Ludgate
701cb61b57 proxy: local auth backend (#8806)
Adds a Local authentication backend. Updates http to extract JWT bearer
tokens and passes them to the local backend to validate.
2024-08-23 18:48:06 +00:00
John Spray
0aa1450936 storage controller: enable timeline CRUD operations to run concurrently with reconciliation & make them safer (#8783)
## Problem

- If a reconciler was waiting to be able to notify computes about a
change, but the control plane was waiting for the controller to finish a
timeline creation/deletion, the overall system can deadlock.
- If a tenant shard was migrated concurrently with a timeline
creation/deletion, there was a risk that the timeline operation could be
applied to a non-latest-generation location, and thereby not really be
persistent. This has never happened in practice, but would eventually
happen at scale.

Closes: #8743 

## Summary of changes

- Introduce `Service::tenant_remote_mutation` helper, which looks up
shards & generations and passes them into an inner function that may do
remote I/O to pageservers. Before returning success, this helper checks
that generations haven't incremented, to guarantee that changes are
persistent.
- Convert tenant_timeline_create, tenant_timeline_delete, and
tenant_timeline_detach_ancestor to use this helper.
- These functions no longer block on ensure_attached unless the tenant
was never attached at all, so they should make progress even if we can't
complete compute notifications.

This increases the database load from timeline/create operations, but
only with cheap read transactions.
2024-08-23 18:56:05 +01:00
John Spray
b65a95f12e controller: use PageserverUtilization for scheduling (#8711)
## Problem

Previously, the controller only used the shard counts for scheduling.
This works well when hosting only many-sharded tenants, but works much
less well when hosting single-sharded tenants that have a greater
deviation in size-per-shard.

Closes: https://github.com/neondatabase/neon/issues/7798

## Summary of changes

- Instead of UtilizationScore, carry the full PageserverUtilization
through into the Scheduler.
- Use the PageserverUtilization::score() instead of shard count when
ordering nodes in scheduling.

Q: Why did test_sharding_split_smoke need updating in this PR?
A: There's an interesting side effect during shard splits: because we do
not decrement the shard count in the utilization when we de-schedule the
shards from before the split, the controller will now prefer to pick
_different_ nodes for shards compared with which ones held secondaries
before the split. We could use our knowledge of splitting to fix up the
utilizations more actively in this situation, but I'm leaning toward
leaving the code simpler, as in practical systems the impact of one
shard on the utilization of a node should be fairly low (single digit
%).
2024-08-23 18:32:56 +01:00
Conrad Ludgate
c1cb7a0fa0 proxy: flesh out JWT verification code (#8805)
This change adds in the necessary verification steps for the JWT
payload, and adds per-role querying of JWKs as needed for #8736
2024-08-23 18:01:02 +01:00
Alex Chi Z.
f4cac1f30f impr(pageserver): error if keys are unordered in merge iter (#8818)
In case of corrupted delta layers, we can detect the corruption and bail
out the compaction.

## Summary of changes

* Detect wrong delta desc of key range 
* Detect unordered deltas

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-23 16:38:42 +00:00
Conrad Ludgate
612b643315 update diesel (#8816)
https://rustsec.org/advisories/RUSTSEC-2024-0365
2024-08-23 15:28:22 +00:00
Vlad Lazar
bcc68a7866 storcon_cli: add support for drain and fill operations (#8791)
## Problem
We have been naughty and curl-ed storcon to fix-up drains and fills.

## Summary of changes
Add support for starting/cancelling drain/fill operations via
`storcon_cli`.
2024-08-23 14:48:06 +01:00
Joonas Koivunen
73286e6b9f test: copy dict to avoid error on retry (#8811)
there is no "const" in python, so when we modify the global dict, it
will remain that way on the retry. fix to not have it influence other
tests which might be run on the same runner.

evidence:
<https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8625/10513146742/index.html#/testresult/453c4ce05ada7496>
2024-08-23 14:43:08 +01:00
Alex Chi Z.
bc8cfe1b55 fix(pageserver): l0 check criteria (#8797)
close https://github.com/neondatabase/neon/issues/8579

## Summary of changes

The `is_l0` check now takes both layer key range and the layer type.
This allows us to have image layers covering the full key range in
btm-most compaction (upcoming PR). However, we still don't allow delta
layers to cover the full key range, and I will make btm-most compaction
to generate delta layers with the key range of the keys existing in the
layer instead of `Key::MIN..Key::HACK_MAX` (upcoming PR).


Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-23 09:42:45 -04:00
Alex Chi Z.
6a74bcadec feat(pageserver): remove features=testing restriction for compact (#8815)
A small PR to make it possible to run force compaction in staging for
btm-gc compaction testing.

Part of https://github.com/neondatabase/neon/issues/8002

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-23 14:32:00 +01:00
Alexander Bayandin
e62cd9e121 CI(autocomment): add arch to build type (#8809)
## Problem

Failed / flaky tests for different arches don't have any difference in
GitHub Autocomment

## Summary of changes
- Add arch to build type for GitHub autocomment
2024-08-23 14:29:11 +01:00
Arpad Müller
e80ab8fd6a Update serde_json to 1.0.125 (#8813)
Updates `serde_json` to `1.0.125`, rolling out speedups added by a
serde_json contributor.

Release [link](https://github.com/serde-rs/json/releases/tag/1.0.125).
Blog post
[link](https://purplesyringa.moe/blog/i-sped-up-serde-json-strings-by-20-percent/).
2024-08-23 12:14:14 +01:00
MMeent
d8ca495eae Require poetry >=1.8 (#8812)
This was already a requirement for installing the python packages after
https://github.com/neondatabase/neon/pull/8609 got merged, so this
updates the documentation to reflect that.
2024-08-23 11:48:26 +01:00
Heikki Linnakangas
dbdb8a1187 Document how to use "git merge" for PostgreSQL minor version upgrades. (#8692)
Our new policy is to use the "rebase" method and slice all the Neon
commits into a nice patch set when doing a new major version, and use
"merge" method on minor version upgrades on the release branches.

"git merge" preserves the git history of Neon commits on the Postgres
branches. While it's nice to rebase all the Neon changes to a logical
patch set against upstream, having to do it between every minor release
is a fair amount work, and it loses the history, and is more
error-prone.
2024-08-23 09:15:55 +03:00
Tristan Partin
f7ab3ffcb7 Check that TERM != dumb before using colors in pre-commit.py
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-08-22 18:03:45 -05:00
Tristan Partin
2f8d548a12 Update Postgres 16 to 16.4
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-08-22 18:03:45 -05:00
Tristan Partin
66db381dc9 Update Postgres 15 to 15.8
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-08-22 18:03:45 -05:00
Tristan Partin
6744ed19d8 Update Postgres 14 to 14.13
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-08-22 18:03:45 -05:00
Tristan Partin
ae63ac7488 Write messages field by field instead of bytes sheet in test_simple_sync_safekeepers
Co-authored-by: Arseny Sher <ars@neon.tech>
2024-08-22 18:03:45 -05:00
Alex Chi Z.
6eb638f4b3 feat(pageserver): warn on aux v1 tenants + default to v2 (#8625)
part of https://github.com/neondatabase/neon/issues/8623

We want to discover potential aux v1 customers that we might have missed
from the migrations.

## Summary of changes

Log warnings on basebackup, load timeline, and the first put_file.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-22 22:31:38 +01:00
Konstantin Knizhnik
7a485b599b Fix race condition in LRU list update in get_cached_relsize (#8807)
## Problem

See https://neondb.slack.com/archives/C07J14D8GTX/p1724347552023709
Manipulations with LRU list in relation size cache are performed under
shared lock

## Summary of changes

Take exclusive lock

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-08-22 23:53:37 +03:00
Joonas Koivunen
b1c457898b test_compatibility: flush in the end (#8804)
`test_forward_compatibility` is still often failing at graceful
shutdown. Fix this by explicit flush before shutdown.

Example:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/10506613738/index.html#testresult/5e7111907f7ecfb2/

Cc: #8655 and #8708
Previous attempt: #8787
2024-08-22 16:38:03 +01:00
Folke Behrens
1a9d559be8 proxy: Enable stricter/pedantic clippy checks (#8775)
Create a list of currently allowed exceptions that should be reduced
over time.
2024-08-22 13:29:05 +02:00
Alexey Kondratov
0e6c0d47a5 Revert "Use sycnhronous commit for logical replicaiton worker (#8645)" (#8792)
This reverts commit cbe8c77997.

This change was originally made to test a hypothesis, but after that,
the proper fix #8669 was merged, so now it's not needed. Moreover, the
test is still flaky, so probably this bug was not a reason of the
flakiness.

Related to #8097
2024-08-22 12:52:36 +02:00
Arpad Müller
d645645fab Sleep in test_scrubber_physical_gc (#8798)
This copies a piece of code from `test_scrubber_physical_gc_ancestors`
to fix a source of flakiness: later on we rely on stuff being older than
a second, but the test can run faster under optimal conditions (as
happened to me locally, but also obvservable in
[this](https://neon-github-public-dev.s3.amazonaws.com/reports/main/10470762360/index.html#testresult/f713b02657db4b4c/retries)
allure report):

```
test_runner/regress/test_storage_scrubber.py:169: in test_scrubber_physical_gc
    assert gc_summary["remote_storage_errors"] == 0
E   assert 1 == 0
```
2024-08-22 12:45:29 +02:00
John Spray
7c74112b2a pageserver: batch InMemoryLayer puts, remove need to sort items by LSN during ingest (#8591)
## Problem/Solution

TimelineWriter::put_batch is simply a loop over individual puts. Each
put acquires and releases locks, and checks for potentially starting a
new layer. Batching these is more efficient, but more importantly
unlocks future changes where we can pre-build serialized buffers much
earlier in the ingest process, potentially even on the safekeeper
(imagine a future model where some variant of DatadirModification lives
on the safekeeper).

Ensuring that the values in put_batch are written to one layer also
enables a simplification upstream, where we no longer need to write
values in LSN-order. This saves us a sort, but also simplifies follow-on
refactors to DatadirModification: we can store metadata keys and data
keys separately at that level without needing to zip them together in
LSN order later.

## Why?

In this PR, these changes are simplify optimizations, but they are
motivated by evolving the ingest path in the direction of disentangling
extracting DatadirModification from Timeline. It may not obvious how
right now, but the general idea is that we'll end up with three phases
of ingest:
- A) Decode walrecords and build a datadirmodification with all the
simple data contents already in a big serialized buffer ready to write
to an ephemeral layer **<-- this part can be pipelined and parallelized,
and done on a safekeeper!**
- B) Let that datadirmodification see a Timeline, so that it can also
generate all the metadata updates that require a read-modify-write of
existing pages
- C) Dump the results of B into an ephemeral layer.

Related: https://github.com/neondatabase/neon/issues/8452

## Caveats

Doing a big monolithic buffer of values to write to disk is ordinarily
an anti-pattern: we prefer nice streaming I/O. However:
- In future, when we do this first decode stage on the safekeeper, it
would be inefficient to serialize a Vec of Value, and then later
deserialize it just to add blob size headers while writing into the
ephemeral layer format. The idea is that for bulk write data, we will
serialize exactly once.
- The monolithic buffer is a stepping stone to pipelining more of this:
by seriailizing earlier (rather than at the final put_value), we will be
able to parallelize the wal decoding and bulk serialization of data page
writes.
- The ephemeral layer's buffered writer already stalls writes while it
waits to flush: so while yes we'll stall for a couple milliseconds to
write a couple megabytes, we already have stalls like this, just
distributed across smaller writes.

## Benchmarks

This PR is primarily a stepping stone to safekeeper ingest filtering,
but also provides a modest efficiency improvement to the `wal_recovery`
part of `test_bulk_ingest`.

test_bulk_ingest:

```
test_bulk_insert[neon-release-pg16].insert: 23.659 s
test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB
test_bulk_insert[neon-release-pg16].peak_mem: 626 MB
test_bulk_insert[neon-release-pg16].size: 0 MB
test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB
test_bulk_insert[neon-release-pg16].num_files_uploaded: 8 
test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB
test_bulk_insert[neon-release-pg16].wal_recovery: 18.981 s
test_bulk_insert[neon-release-pg16].compaction: 0.055 s

vs. tip of main:
test_bulk_insert[neon-release-pg16].insert: 24.001 s
test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB
test_bulk_insert[neon-release-pg16].peak_mem: 604 MB
test_bulk_insert[neon-release-pg16].size: 0 MB
test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB
test_bulk_insert[neon-release-pg16].num_files_uploaded: 8 
test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB
test_bulk_insert[neon-release-pg16].wal_recovery: 23.586 s
test_bulk_insert[neon-release-pg16].compaction: 0.054 s
```
2024-08-22 10:04:42 +00:00
Alex Chi Z.
a968554a8c fix(pageserver): unify initdb optimization for sparse keyspaces; fix force img generation (#8776)
close https://github.com/neondatabase/neon/issues/8558

* Directly generate image layers for sparse keyspaces during initdb
optimization.
* Support force image layer generation for sparse keyspaces.
* Fix a bug of incorrect image layer key range in case of duplicated
keys. (The added line: `start = img_range.end;`) This can cause
overlapping image layers and keys to disappear.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-21 21:25:21 +01:00
Joonas Koivunen
07b7c63975 test: avoid some too long shutdowns by flushing before shutdown (#8772)
After #8655, we needed to mark some tests to shut down immediately. To
aid these tests, try the new pattern of `flush_ep_to_pageserver`
followed by a non-compacting checkpoint. This moves the general graceful
shutdown problem of having too much to flush at shutdown into the test.
Also, add logging for how long the graceful shutdown took, if we got to
complete it for faster log eyeballing.

Fixes: #8712
Cc: #8715, #8708
2024-08-21 14:26:27 -04:00
Tristan Partin
04752dfa75 Prefix current_lsn with compute_ 2024-08-21 12:39:02 -05:00
Tristan Partin
99c19cad24 Add compute_receive_lsn metric
Useful for dashboarding the replication metrics of a single endpoint.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-08-21 12:39:02 -05:00
Joonas Koivunen
b83d722369 test: fix more flaky due to graceful shutdown (#8787)
Going through the list of recent flaky tests, trying to fix those
related to graceful shutdown.

- test_forward_compatibility: flush and wait for uploads to avoid
graceful shutdown
- test_layer_bloating: in the end the endpoint and vanilla are still up
=> immediate shutdown
- test_lagging_sk: pageserver shutdown is not related to the test =>
immediate shutdown
- test_lsn_lease_size: pageserver flushing is not needed => immediate
shutdown

Additionally:
- remove `wait_for_upload` usage from workload fixture

Cc: #8708
Fixes: #8710
2024-08-21 17:22:47 +01:00
Arseny Sher
d919770c55 safekeeper: add listing timelines
Adds endpoint GET /tenant/timeline listing all not deleted timelines.
2024-08-21 18:38:08 +03:00
Tristan Partin
f4b3c317f3 Add compute_logical_snapshot_files metric
Track the number of logical snapshot files on an endpoint over time.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-08-21 10:33:44 -05:00
Conrad Ludgate
428b105dde remove workspace hack from libs (#8780)
This removes workspace hack from all libs, not from any binaries. This
does not change the behaviour of the hack.

Running
```
cargo clean
cargo build --release --bin proxy
```

Before this change took 5m16s. After this change took 3m3s. This is
because this allows the build to be parallelisable much more.
2024-08-21 14:45:32 +01:00
Alexander Bayandin
75175f3628 CI(build-and-test): run regression tests on arm (#8552)
## Problem

We want to run our regression test suite on ARM.

## Summary of changes
- run regression tests on release ARM builds
- run `build-neon` (including rust tests) on debug ARM builds
- add `arch` parameter to test to distinguish them in the allure report
and in a database
2024-08-21 14:29:11 +01:00
253 changed files with 9878 additions and 3932 deletions

View File

@@ -23,10 +23,30 @@ platforms = [
]
[final-excludes]
# vm_monitor benefits from the same Cargo.lock as the rest of our artifacts, but
# it is built primarly in separate repo neondatabase/autoscaling and thus is excluded
# from depending on workspace-hack because most of the dependencies are not used.
workspace-members = ["vm_monitor"]
workspace-members = [
# vm_monitor benefits from the same Cargo.lock as the rest of our artifacts, but
# it is built primarly in separate repo neondatabase/autoscaling and thus is excluded
# from depending on workspace-hack because most of the dependencies are not used.
"vm_monitor",
# All of these exist in libs and are not usually built independently.
# Putting workspace hack there adds a bottleneck for cargo builds.
"compute_api",
"consumption_metrics",
"desim",
"metrics",
"pageserver_api",
"postgres_backend",
"postgres_connection",
"postgres_ffi",
"pq_proto",
"remote_storage",
"safekeeper_api",
"tenant_size_model",
"tracing-utils",
"utils",
"wal_craft",
"walproposer",
]
# Write out exact versions rather than a semver range. (Defaults to false.)
# exact-versions = true

6
.github/ISSUE_TEMPLATE/config.yml vendored Normal file
View File

@@ -0,0 +1,6 @@
blank_issues_enabled: true
contact_links:
- name: Feature request
url: https://console.neon.tech/app/projects?modal=feedback
about: For feature requests in the Neon product, please submit via the feedback form on `https://console.neon.tech`

View File

@@ -71,7 +71,7 @@ runs:
if: inputs.build_type != 'remote'
uses: ./.github/actions/download
with:
name: compatibility-snapshot-${{ inputs.build_type }}-pg${{ inputs.pg_version }}
name: compatibility-snapshot-${{ runner.arch }}-${{ inputs.build_type }}-pg${{ inputs.pg_version }}
path: /tmp/compatibility_snapshot_pg${{ inputs.pg_version }}
prefix: latest
# The lack of compatibility snapshot (for example, for the new Postgres version)
@@ -169,10 +169,8 @@ runs:
EXTRA_PARAMS="--durations-path $TEST_OUTPUT/benchmark_durations.json $EXTRA_PARAMS"
fi
if [[ "${{ inputs.build_type }}" == "debug" ]]; then
if [[ $BUILD_TYPE == "debug" && $RUNNER_ARCH == 'X64' ]]; then
cov_prefix=(scripts/coverage "--profraw-prefix=$GITHUB_JOB" --dir=/tmp/coverage run)
elif [[ "${{ inputs.build_type }}" == "release" ]]; then
cov_prefix=()
else
cov_prefix=()
fi
@@ -213,13 +211,13 @@ runs:
fi
- name: Upload compatibility snapshot
if: github.ref_name == 'release'
# Note, that we use `github.base_ref` which is a target branch for a PR
if: github.event_name == 'pull_request' && github.base_ref == 'release'
uses: ./.github/actions/upload
with:
name: compatibility-snapshot-${{ inputs.build_type }}-pg${{ inputs.pg_version }}-${{ github.run_id }}
name: compatibility-snapshot-${{ runner.arch }}-${{ inputs.build_type }}-pg${{ inputs.pg_version }}
# Directory is created by test_compatibility.py::test_create_snapshot, keep the path in sync with the test
path: /tmp/test_output/compatibility_snapshot_pg${{ inputs.pg_version }}/
prefix: latest
- name: Upload test results
if: ${{ !cancelled() }}

View File

@@ -94,11 +94,16 @@ jobs:
# We run tests with addtional features, that are turned off by default (e.g. in release builds), see
# corresponding Cargo.toml files for their descriptions.
- name: Set env variables
env:
ARCH: ${{ inputs.arch }}
run: |
CARGO_FEATURES="--features testing"
if [[ $BUILD_TYPE == "debug" ]]; then
if [[ $BUILD_TYPE == "debug" && $ARCH == 'x64' ]]; then
cov_prefix="scripts/coverage --profraw-prefix=$GITHUB_JOB --dir=/tmp/coverage run"
CARGO_FLAGS="--locked"
elif [[ $BUILD_TYPE == "debug" ]]; then
cov_prefix=""
CARGO_FLAGS="--locked"
elif [[ $BUILD_TYPE == "release" ]]; then
cov_prefix=""
CARGO_FLAGS="--locked --release"
@@ -158,6 +163,8 @@ jobs:
# Do install *before* running rust tests because they might recompile the
# binaries with different features/flags.
- name: Install rust binaries
env:
ARCH: ${{ inputs.arch }}
run: |
# Install target binaries
mkdir -p /tmp/neon/bin/
@@ -172,7 +179,7 @@ jobs:
done
# Install test executables and write list of all binaries (for code coverage)
if [[ $BUILD_TYPE == "debug" ]]; then
if [[ $BUILD_TYPE == "debug" && $ARCH == 'x64' ]]; then
# Keep bloated coverage data files away from the rest of the artifact
mkdir -p /tmp/coverage/
@@ -209,8 +216,14 @@ jobs:
#nextest does not yet support running doctests
${cov_prefix} cargo test --doc $CARGO_FLAGS $CARGO_FEATURES
# run all non-pageserver tests
${cov_prefix} cargo nextest run $CARGO_FLAGS $CARGO_FEATURES -E '!package(pageserver)'
# run pageserver tests with different settings
for io_engine in std-fs tokio-epoll-uring ; do
NEON_PAGESERVER_UNIT_TEST_VIRTUAL_FILE_IOENGINE=$io_engine ${cov_prefix} cargo nextest run $CARGO_FLAGS $CARGO_FEATURES
for io_buffer_alignment in 0 1 512 ; do
NEON_PAGESERVER_UNIT_TEST_VIRTUAL_FILE_IOENGINE=$io_engine NEON_PAGESERVER_UNIT_TEST_IO_BUFFER_ALIGNMENT=$io_buffer_alignment ${cov_prefix} cargo nextest run $CARGO_FLAGS $CARGO_FEATURES -E 'package(pageserver)'
done
done
# Run separate tests for real S3
@@ -243,8 +256,8 @@ jobs:
uses: ./.github/actions/save-coverage-data
regress-tests:
# Run test on x64 only
if: inputs.arch == 'x64'
# Don't run regression tests on debug arm64 builds
if: inputs.build-type != 'debug' || inputs.arch != 'arm64'
needs: [ build-neon ]
runs-on: ${{ fromJson(format('["self-hosted", "{0}"]', inputs.arch == 'arm64' && 'large-arm64' || 'large')) }}
container:

View File

@@ -198,7 +198,7 @@ jobs:
strategy:
fail-fast: false
matrix:
arch: [ x64 ]
arch: [ x64, arm64 ]
# Do not build or run tests in debug for release branches
build-type: ${{ fromJson((startsWith(github.ref_name, 'release') && github.event_name == 'push') && '["release"]' || '["debug", "release"]') }}
include:
@@ -1055,43 +1055,88 @@ jobs:
generate_release_notes: true,
})
# The job runs on `release` branch and copies compatibility data and Neon artifact from the last *release PR* to the latest directory
promote-compatibility-data:
needs: [ check-permissions, promote-images, tag, build-and-test-locally ]
needs: [ deploy ]
if: github.ref_name == 'release'
runs-on: [ self-hosted, small ]
container:
image: 369495373322.dkr.ecr.eu-central-1.amazonaws.com/base:pinned
options: --init
runs-on: ubuntu-22.04
steps:
- name: Promote compatibility snapshot for the release
- name: Fetch GITHUB_RUN_ID and COMMIT_SHA for the last merged release PR
id: fetch-last-release-pr-info
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
branch_name_and_pr_number=$(gh pr list \
--repo "${GITHUB_REPOSITORY}" \
--base release \
--state merged \
--limit 10 \
--json mergeCommit,headRefName,number \
--jq ".[] | select(.mergeCommit.oid==\"${GITHUB_SHA}\") | { branch_name: .headRefName, pr_number: .number }")
branch_name=$(echo "${branch_name_and_pr_number}" | jq -r '.branch_name')
pr_number=$(echo "${branch_name_and_pr_number}" | jq -r '.pr_number')
run_id=$(gh run list \
--repo "${GITHUB_REPOSITORY}" \
--workflow build_and_test.yml \
--branch "${branch_name}" \
--json databaseId \
--limit 1 \
--jq '.[].databaseId')
last_commit_sha=$(gh pr view "${pr_number}" \
--repo "${GITHUB_REPOSITORY}" \
--json commits \
--jq '.commits[-1].oid')
echo "run-id=${run_id}" | tee -a ${GITHUB_OUTPUT}
echo "commit-sha=${last_commit_sha}" | tee -a ${GITHUB_OUTPUT}
- name: Promote compatibility snapshot and Neon artifact
env:
BUCKET: neon-github-public-dev
PREFIX: artifacts/latest
COMMIT_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
AWS_REGION: eu-central-1
COMMIT_SHA: ${{ steps.fetch-last-release-pr-info.outputs.commit-sha }}
RUN_ID: ${{ steps.fetch-last-release-pr-info.outputs.run-id }}
run: |
# Update compatibility snapshot for the release
for pg_version in v14 v15 v16; do
for build_type in debug release; do
OLD_FILENAME=compatibility-snapshot-${build_type}-pg${pg_version}-${GITHUB_RUN_ID}.tar.zst
NEW_FILENAME=compatibility-snapshot-${build_type}-pg${pg_version}.tar.zst
old_prefix="artifacts/${COMMIT_SHA}/${RUN_ID}"
new_prefix="artifacts/latest"
time aws s3 mv --only-show-errors s3://${BUCKET}/${PREFIX}/${OLD_FILENAME} s3://${BUCKET}/${PREFIX}/${NEW_FILENAME}
files_to_promote=()
files_on_s3=$(aws s3api list-objects-v2 --bucket ${BUCKET} --prefix ${old_prefix} | jq -r '.Contents[]?.Key' || true)
for arch in X64 ARM64; do
for build_type in debug release; do
neon_artifact_filename="neon-Linux-${arch}-${build_type}-artifact.tar.zst"
s3_key=$(echo "${files_on_s3}" | grep ${neon_artifact_filename} | sort --version-sort | tail -1 || true)
if [ -z "${s3_key}" ]; then
echo >&2 "Neither s3://${BUCKET}/${old_prefix}/${neon_artifact_filename} nor its version from previous attempts exist"
exit 1
fi
files_to_promote+=("s3://${BUCKET}/${s3_key}")
for pg_version in v14 v15 v16; do
# We run less tests for debug builds, so we don't need to promote them
if [ "${build_type}" == "debug" ] && { [ "${arch}" == "ARM64" ] || [ "${pg_version}" != "v16" ] ; }; then
continue
fi
compatibility_data_filename="compatibility-snapshot-${arch}-${build_type}-pg${pg_version}.tar.zst"
s3_key=$(echo "${files_on_s3}" | grep ${compatibility_data_filename} | sort --version-sort | tail -1 || true)
if [ -z "${s3_key}" ]; then
echo >&2 "Neither s3://${BUCKET}/${old_prefix}/${compatibility_data_filename} nor its version from previous attempts exist"
exit 1
fi
files_to_promote+=("s3://${BUCKET}/${s3_key}")
done
done
done
# Update Neon artifact for the release (reuse already uploaded artifact)
for build_type in debug release; do
OLD_PREFIX=artifacts/${COMMIT_SHA}/${GITHUB_RUN_ID}
FILENAME=neon-${{ runner.os }}-${{ runner.arch }}-${build_type}-artifact.tar.zst
S3_KEY=$(aws s3api list-objects-v2 --bucket ${BUCKET} --prefix ${OLD_PREFIX} | jq -r '.Contents[]?.Key' | grep ${FILENAME} | sort --version-sort | tail -1 || true)
if [ -z "${S3_KEY}" ]; then
echo >&2 "Neither s3://${BUCKET}/${OLD_PREFIX}/${FILENAME} nor its version from previous attempts exist"
exit 1
fi
time aws s3 cp --only-show-errors s3://${BUCKET}/${S3_KEY} s3://${BUCKET}/${PREFIX}/${FILENAME}
for f in "${files_to_promote[@]}"; do
time aws s3 cp --only-show-errors ${f} s3://${BUCKET}/${new_prefix}/
done
pin-build-tools-image:

58
Cargo.lock generated
View File

@@ -936,6 +936,12 @@ dependencies = [
"which",
]
[[package]]
name = "bit_field"
version = "0.10.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "dc827186963e592360843fb5ba4b973e145841266c1357f7180c43526f2e5b61"
[[package]]
name = "bitflags"
version = "1.3.2"
@@ -1208,7 +1214,6 @@ dependencies = [
"serde_json",
"serde_with",
"utils",
"workspace_hack",
]
[[package]]
@@ -1321,7 +1326,6 @@ dependencies = [
"serde",
"serde_with",
"utils",
"workspace_hack",
]
[[package]]
@@ -1329,7 +1333,6 @@ name = "control_plane"
version = "0.1.0"
dependencies = [
"anyhow",
"async-trait",
"camino",
"clap",
"comfy-table",
@@ -1670,14 +1673,13 @@ dependencies = [
"smallvec",
"tracing",
"utils",
"workspace_hack",
]
[[package]]
name = "diesel"
version = "2.2.1"
version = "2.2.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "62d6dcd069e7b5fe49a302411f759d4cf1cf2c27fe798ef46fb8baefc053dd2b"
checksum = "65e13bab2796f412722112327f3e575601a3e9cdcbe426f0d30dbf43f3f5dc71"
dependencies = [
"bitflags 2.4.1",
"byteorder",
@@ -2947,17 +2949,6 @@ version = "1.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "830d08ce1d1d941e6b30645f1a0eb5643013d835ce3779a5fc208261dbe10f55"
[[package]]
name = "leaky-bucket"
version = "1.0.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8eb491abd89e9794d50f93c8db610a29509123e3fbbc9c8c67a528e9391cd853"
dependencies = [
"parking_lot 0.12.1",
"tokio",
"tracing",
]
[[package]]
name = "libc"
version = "0.2.150"
@@ -3147,7 +3138,6 @@ dependencies = [
"rand 0.8.5",
"rand_distr",
"twox-hash",
"workspace_hack",
]
[[package]]
@@ -3687,6 +3677,7 @@ dependencies = [
"async-compression",
"async-stream",
"async-trait",
"bit_field",
"byteorder",
"bytes",
"camino",
@@ -3711,7 +3702,6 @@ dependencies = [
"humantime-serde",
"hyper 0.14.26",
"itertools 0.10.5",
"leaky-bucket",
"md5",
"metrics",
"nix 0.27.1",
@@ -3736,6 +3726,7 @@ dependencies = [
"reqwest 0.12.4",
"rpds",
"scopeguard",
"send-future",
"serde",
"serde_json",
"serde_path_to_error",
@@ -3791,7 +3782,6 @@ dependencies = [
"strum_macros",
"thiserror",
"utils",
"workspace_hack",
]
[[package]]
@@ -3799,7 +3789,6 @@ name = "pageserver_client"
version = "0.1.0"
dependencies = [
"anyhow",
"async-trait",
"bytes",
"futures",
"pageserver_api",
@@ -4193,7 +4182,6 @@ dependencies = [
"tokio-rustls 0.25.0",
"tokio-util",
"tracing",
"workspace_hack",
]
[[package]]
@@ -4206,7 +4194,6 @@ dependencies = [
"postgres",
"tokio-postgres",
"url",
"workspace_hack",
]
[[package]]
@@ -4229,7 +4216,6 @@ dependencies = [
"serde",
"thiserror",
"utils",
"workspace_hack",
]
[[package]]
@@ -4267,7 +4253,6 @@ dependencies = [
"thiserror",
"tokio",
"tracing",
"workspace_hack",
]
[[package]]
@@ -4832,7 +4817,6 @@ dependencies = [
"toml_edit 0.19.10",
"tracing",
"utils",
"workspace_hack",
]
[[package]]
@@ -5357,7 +5341,6 @@ dependencies = [
"serde",
"serde_with",
"utils",
"workspace_hack",
]
[[package]]
@@ -5466,6 +5449,12 @@ version = "1.0.17"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "bebd363326d05ec3e2f532ab7660680f3b02130d780c299bca73469d521bc0ed"
[[package]]
name = "send-future"
version = "0.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "224e328af6e080cddbab3c770b1cf50f0351ba0577091ef2410c3951d835ff87"
[[package]]
name = "sentry"
version = "0.32.3"
@@ -5601,11 +5590,12 @@ dependencies = [
[[package]]
name = "serde_json"
version = "1.0.96"
version = "1.0.125"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "057d394a50403bcac12672b2b18fb387ab6d289d957dab67dd201875391e52f1"
checksum = "83c8e735a073ccf5be70aa8066aa984eaf2fa000db6c8d0100ae605b366d31ed"
dependencies = [
"itoa",
"memchr",
"ryu",
"serde",
]
@@ -5960,7 +5950,6 @@ name = "storage_controller_client"
version = "0.1.0"
dependencies = [
"anyhow",
"async-trait",
"bytes",
"futures",
"pageserver_api",
@@ -6193,7 +6182,6 @@ dependencies = [
"anyhow",
"serde",
"serde_json",
"workspace_hack",
]
[[package]]
@@ -6794,7 +6782,6 @@ dependencies = [
"tracing",
"tracing-opentelemetry",
"tracing-subscriber",
"workspace_hack",
]
[[package]]
@@ -6965,7 +6952,6 @@ dependencies = [
"anyhow",
"arc-swap",
"async-compression",
"async-trait",
"bincode",
"byteorder",
"bytes",
@@ -6981,7 +6967,6 @@ dependencies = [
"humantime",
"hyper 0.14.26",
"jsonwebtoken",
"leaky-bucket",
"metrics",
"nix 0.27.1",
"once_cell",
@@ -7012,7 +6997,6 @@ dependencies = [
"url",
"uuid",
"walkdir",
"workspace_hack",
]
[[package]]
@@ -7091,7 +7075,6 @@ dependencies = [
"postgres_ffi",
"regex",
"utils",
"workspace_hack",
]
[[package]]
@@ -7112,7 +7095,6 @@ dependencies = [
"bindgen",
"postgres_ffi",
"utils",
"workspace_hack",
]
[[package]]
@@ -7669,8 +7651,6 @@ dependencies = [
"tokio",
"tokio-rustls 0.24.0",
"tokio-util",
"toml_datetime",
"toml_edit 0.19.10",
"tonic",
"tower",
"tracing",

View File

@@ -65,6 +65,7 @@ axum = { version = "0.6.20", features = ["ws"] }
base64 = "0.13.0"
bincode = "1.3"
bindgen = "0.65"
bit_field = "0.10.2"
bstr = "1.0"
byteorder = "1.4"
bytes = "1.0"
@@ -107,13 +108,12 @@ ipnet = "2.9.0"
itertools = "0.10"
jsonwebtoken = "9"
lasso = "0.7"
leaky-bucket = "1.0.1"
libc = "0.2"
md5 = "0.7.0"
measured = { version = "0.0.22", features=["lasso"] }
measured-process = { version = "0.0.22" }
memoffset = "0.8"
nix = { version = "0.27", features = ["fs", "process", "socket", "signal", "poll"] }
nix = { version = "0.27", features = ["dir", "fs", "process", "socket", "signal", "poll"] }
notify = "6.0.0"
num_cpus = "1.15"
num-traits = "0.2.15"
@@ -145,6 +145,7 @@ rustls-split = "0.3"
scopeguard = "1.1"
sysinfo = "0.29.2"
sd-notify = "0.4.1"
send-future = "0.1.0"
sentry = { version = "0.32", default-features = false, features = ["backtrace", "contexts", "panic", "rustls", "reqwest" ] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"

View File

@@ -942,7 +942,7 @@ COPY --from=hll-pg-build /hll.tar.gz /ext-src
COPY --from=plpgsql-check-pg-build /plpgsql_check.tar.gz /ext-src
#COPY --from=timescaledb-pg-build /timescaledb.tar.gz /ext-src
COPY --from=pg-hint-plan-pg-build /pg_hint_plan.tar.gz /ext-src
COPY patches/pg_hintplan.patch /ext-src
COPY patches/pg_hint_plan.patch /ext-src
COPY --from=pg-cron-pg-build /pg_cron.tar.gz /ext-src
COPY patches/pg_cron.patch /ext-src
#COPY --from=pg-pgx-ulid-build /home/nonroot/pgx_ulid.tar.gz /ext-src
@@ -964,7 +964,7 @@ RUN cd /ext-src/pgvector-src && patch -p1 <../pgvector.patch
RUN cd /ext-src/rum-src && patch -p1 <../rum.patch
# cmake is required for the h3 test
RUN apt-get update && apt-get install -y cmake
RUN patch -p1 < /ext-src/pg_hintplan.patch
RUN cd /ext-src/pg_hint_plan-src && patch -p1 < /ext-src/pg_hint_plan.patch
COPY --chmod=755 docker-compose/run-tests.sh /run-tests.sh
RUN patch -p1 </ext-src/pg_anon.patch
RUN patch -p1 </ext-src/pg_cron.patch

View File

@@ -126,7 +126,7 @@ make -j`sysctl -n hw.logicalcpu` -s
To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `pg_install/bin` and `pg_install/lib`, respectively.
To run the integration tests or Python scripts (not required to use the code), install
Python (3.9 or higher), and install the python3 packages using `./scripts/pysync` (requires [poetry>=1.3](https://python-poetry.org/)) in the project directory.
Python (3.9 or higher), and install the python3 packages using `./scripts/pysync` (requires [poetry>=1.8](https://python-poetry.org/)) in the project directory.
#### Running neon database

View File

@@ -44,6 +44,7 @@ use std::{thread, time::Duration};
use anyhow::{Context, Result};
use chrono::Utc;
use clap::Arg;
use compute_tools::lsn_lease::launch_lsn_lease_bg_task_for_static;
use signal_hook::consts::{SIGQUIT, SIGTERM};
use signal_hook::{consts::SIGINT, iterator::Signals};
use tracing::{error, info, warn};
@@ -366,6 +367,8 @@ fn wait_spec(
state.start_time = now;
}
launch_lsn_lease_bg_task_for_static(&compute);
Ok(WaitSpecResult {
compute,
http_port,

View File

@@ -11,6 +11,7 @@ pub mod logger;
pub mod catalog;
pub mod compute;
pub mod extension_server;
pub mod lsn_lease;
mod migration;
pub mod monitor;
pub mod params;

View File

@@ -0,0 +1,186 @@
use anyhow::bail;
use anyhow::Result;
use postgres::{NoTls, SimpleQueryMessage};
use std::time::SystemTime;
use std::{str::FromStr, sync::Arc, thread, time::Duration};
use utils::id::TenantId;
use utils::id::TimelineId;
use compute_api::spec::ComputeMode;
use tracing::{info, warn};
use utils::{
lsn::Lsn,
shard::{ShardCount, ShardNumber, TenantShardId},
};
use crate::compute::ComputeNode;
/// Spawns a background thread to periodically renew LSN leases for static compute.
/// Do nothing if the compute is not in static mode.
pub fn launch_lsn_lease_bg_task_for_static(compute: &Arc<ComputeNode>) {
let (tenant_id, timeline_id, lsn) = {
let state = compute.state.lock().unwrap();
let spec = state.pspec.as_ref().expect("Spec must be set");
match spec.spec.mode {
ComputeMode::Static(lsn) => (spec.tenant_id, spec.timeline_id, lsn),
_ => return,
}
};
let compute = compute.clone();
let span = tracing::info_span!("lsn_lease_bg_task", %tenant_id, %timeline_id, %lsn);
thread::spawn(move || {
let _entered = span.entered();
if let Err(e) = lsn_lease_bg_task(compute, tenant_id, timeline_id, lsn) {
// TODO: might need stronger error feedback than logging an warning.
warn!("Exited with error: {e}");
}
});
}
/// Renews lsn lease periodically so static compute are not affected by GC.
fn lsn_lease_bg_task(
compute: Arc<ComputeNode>,
tenant_id: TenantId,
timeline_id: TimelineId,
lsn: Lsn,
) -> Result<()> {
loop {
let valid_until = acquire_lsn_lease_with_retry(&compute, tenant_id, timeline_id, lsn)?;
let valid_duration = valid_until
.duration_since(SystemTime::now())
.unwrap_or(Duration::ZERO);
// Sleep for 60 seconds less than the valid duration but no more than half of the valid duration.
let sleep_duration = valid_duration
.saturating_sub(Duration::from_secs(60))
.max(valid_duration / 2);
info!(
"Succeeded, sleeping for {} seconds",
sleep_duration.as_secs()
);
thread::sleep(sleep_duration);
}
}
/// Acquires lsn lease in a retry loop. Returns the expiration time if a lease is granted.
/// Returns an error if a lease is explicitly not granted. Otherwise, we keep sending requests.
fn acquire_lsn_lease_with_retry(
compute: &Arc<ComputeNode>,
tenant_id: TenantId,
timeline_id: TimelineId,
lsn: Lsn,
) -> Result<SystemTime> {
let mut attempts = 0usize;
let mut retry_period_ms: f64 = 500.0;
const MAX_RETRY_PERIOD_MS: f64 = 60.0 * 1000.0;
loop {
// Note: List of pageservers is dynamic, need to re-read configs before each attempt.
let configs = {
let state = compute.state.lock().unwrap();
let spec = state.pspec.as_ref().expect("spec must be set");
let conn_strings = spec.pageserver_connstr.split(',');
conn_strings
.map(|connstr| {
let mut config = postgres::Config::from_str(connstr).expect("Invalid connstr");
if let Some(storage_auth_token) = &spec.storage_auth_token {
info!("Got storage auth token from spec file");
config.password(storage_auth_token.clone());
} else {
info!("Storage auth token not set");
}
config
})
.collect::<Vec<_>>()
};
let result = try_acquire_lsn_lease(tenant_id, timeline_id, lsn, &configs);
match result {
Ok(Some(res)) => {
return Ok(res);
}
Ok(None) => {
bail!("Permanent error: lease could not be obtained, LSN is behind the GC cutoff");
}
Err(e) => {
warn!("Failed to acquire lsn lease: {e} (attempt {attempts}");
thread::sleep(Duration::from_millis(retry_period_ms as u64));
retry_period_ms *= 1.5;
retry_period_ms = retry_period_ms.min(MAX_RETRY_PERIOD_MS);
}
}
attempts += 1;
}
}
/// Tries to acquire an LSN lease through PS page_service API.
fn try_acquire_lsn_lease(
tenant_id: TenantId,
timeline_id: TimelineId,
lsn: Lsn,
configs: &[postgres::Config],
) -> Result<Option<SystemTime>> {
fn get_valid_until(
config: &postgres::Config,
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
lsn: Lsn,
) -> Result<Option<SystemTime>> {
let mut client = config.connect(NoTls)?;
let cmd = format!("lease lsn {} {} {} ", tenant_shard_id, timeline_id, lsn);
let res = client.simple_query(&cmd)?;
let msg = match res.first() {
Some(msg) => msg,
None => bail!("empty response"),
};
let row = match msg {
SimpleQueryMessage::Row(row) => row,
_ => bail!("error parsing lsn lease response"),
};
// Note: this will be None if a lease is explicitly not granted.
let valid_until_str = row.get("valid_until");
let valid_until = valid_until_str.map(|s| {
SystemTime::UNIX_EPOCH
.checked_add(Duration::from_millis(u128::from_str(s).unwrap() as u64))
.expect("Time larger than max SystemTime could handle")
});
Ok(valid_until)
}
let shard_count = configs.len();
let valid_until = if shard_count > 1 {
configs
.iter()
.enumerate()
.map(|(shard_number, config)| {
let tenant_shard_id = TenantShardId {
tenant_id,
shard_count: ShardCount::new(shard_count as u8),
shard_number: ShardNumber(shard_number as u8),
};
get_valid_until(config, tenant_shard_id, timeline_id, lsn)
})
.collect::<Result<Vec<Option<SystemTime>>>>()?
.into_iter()
.min()
.unwrap()
} else {
get_valid_until(
&configs[0],
TenantShardId::unsharded(tenant_id),
timeline_id,
lsn,
)?
};
Ok(valid_until)
}

View File

@@ -6,7 +6,6 @@ license.workspace = true
[dependencies]
anyhow.workspace = true
async-trait.workspace = true
camino.workspace = true
clap.workspace = true
comfy-table.workspace = true

View File

@@ -5,6 +5,7 @@
//! ```text
//! .neon/safekeepers/<safekeeper id>
//! ```
use std::future::Future;
use std::io::Write;
use std::path::PathBuf;
use std::time::Duration;
@@ -34,12 +35,10 @@ pub enum SafekeeperHttpError {
type Result<T> = result::Result<T, SafekeeperHttpError>;
#[async_trait::async_trait]
pub trait ResponseErrorMessageExt: Sized {
async fn error_from_body(self) -> Result<Self>;
pub(crate) trait ResponseErrorMessageExt: Sized {
fn error_from_body(self) -> impl Future<Output = Result<Self>> + Send;
}
#[async_trait::async_trait]
impl ResponseErrorMessageExt for reqwest::Response {
async fn error_from_body(self) -> Result<Self> {
let status = self.status();

View File

@@ -41,6 +41,8 @@ enum Command {
listen_http_addr: String,
#[arg(long)]
listen_http_port: u16,
#[arg(long)]
availability_zone_id: String,
},
/// Modify a node's configuration in the storage controller
@@ -147,9 +149,9 @@ enum Command {
#[arg(long)]
threshold: humantime::Duration,
},
// Drain a set of specified pageservers by moving the primary attachments to pageservers
// Migrate away from a set of specified pageservers by moving the primary attachments to pageservers
// outside of the specified set.
Drain {
BulkMigrate {
// Set of pageserver node ids to drain.
#[arg(long)]
nodes: Vec<NodeId>,
@@ -163,6 +165,34 @@ enum Command {
#[arg(long)]
dry_run: Option<bool>,
},
/// Start draining the specified pageserver.
/// The drain is complete when the schedulling policy returns to active.
StartDrain {
#[arg(long)]
node_id: NodeId,
},
/// Cancel draining the specified pageserver and wait for `timeout`
/// for the operation to be canceled. May be retried.
CancelDrain {
#[arg(long)]
node_id: NodeId,
#[arg(long)]
timeout: humantime::Duration,
},
/// Start filling the specified pageserver.
/// The drain is complete when the schedulling policy returns to active.
StartFill {
#[arg(long)]
node_id: NodeId,
},
/// Cancel filling the specified pageserver and wait for `timeout`
/// for the operation to be canceled. May be retried.
CancelFill {
#[arg(long)]
node_id: NodeId,
#[arg(long)]
timeout: humantime::Duration,
},
}
#[derive(Parser)]
@@ -249,6 +279,34 @@ impl FromStr for NodeAvailabilityArg {
}
}
async fn wait_for_scheduling_policy<F>(
client: Client,
node_id: NodeId,
timeout: Duration,
f: F,
) -> anyhow::Result<NodeSchedulingPolicy>
where
F: Fn(NodeSchedulingPolicy) -> bool,
{
let waiter = tokio::time::timeout(timeout, async move {
loop {
let node = client
.dispatch::<(), NodeDescribeResponse>(
Method::GET,
format!("control/v1/node/{node_id}"),
None,
)
.await?;
if f(node.scheduling) {
return Ok::<NodeSchedulingPolicy, mgmt_api::Error>(node.scheduling);
}
}
});
Ok(waiter.await??)
}
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let cli = Cli::parse();
@@ -266,6 +324,7 @@ async fn main() -> anyhow::Result<()> {
listen_pg_port,
listen_http_addr,
listen_http_port,
availability_zone_id,
} => {
storcon_client
.dispatch::<_, ()>(
@@ -277,6 +336,7 @@ async fn main() -> anyhow::Result<()> {
listen_pg_port,
listen_http_addr,
listen_http_port,
availability_zone_id: Some(availability_zone_id),
}),
)
.await?;
@@ -628,7 +688,7 @@ async fn main() -> anyhow::Result<()> {
})
.await?;
}
Command::Drain {
Command::BulkMigrate {
nodes,
concurrency,
max_shards,
@@ -657,7 +717,7 @@ async fn main() -> anyhow::Result<()> {
}
if nodes.len() != node_to_drain_descs.len() {
anyhow::bail!("Drain requested for node which doesn't exist.")
anyhow::bail!("Bulk migration requested away from node which doesn't exist.")
}
node_to_fill_descs.retain(|desc| {
@@ -669,7 +729,7 @@ async fn main() -> anyhow::Result<()> {
});
if node_to_fill_descs.is_empty() {
anyhow::bail!("There are no nodes to drain to")
anyhow::bail!("There are no nodes to migrate to")
}
// Set the node scheduling policy to draining for the nodes which
@@ -690,7 +750,7 @@ async fn main() -> anyhow::Result<()> {
.await?;
}
// Perform the drain: move each tenant shard scheduled on a node to
// Perform the migration: move each tenant shard scheduled on a node to
// be drained to a node which is being filled. A simple round robin
// strategy is used to pick the new node.
let tenants = storcon_client
@@ -703,13 +763,13 @@ async fn main() -> anyhow::Result<()> {
let mut selected_node_idx = 0;
struct DrainMove {
struct MigrationMove {
tenant_shard_id: TenantShardId,
from: NodeId,
to: NodeId,
}
let mut moves: Vec<DrainMove> = Vec::new();
let mut moves: Vec<MigrationMove> = Vec::new();
let shards = tenants
.into_iter()
@@ -739,7 +799,7 @@ async fn main() -> anyhow::Result<()> {
continue;
}
moves.push(DrainMove {
moves.push(MigrationMove {
tenant_shard_id: shard.tenant_shard_id,
from: shard
.node_attached
@@ -816,6 +876,67 @@ async fn main() -> anyhow::Result<()> {
failure
);
}
Command::StartDrain { node_id } => {
storcon_client
.dispatch::<(), ()>(
Method::PUT,
format!("control/v1/node/{node_id}/drain"),
None,
)
.await?;
println!("Drain started for {node_id}");
}
Command::CancelDrain { node_id, timeout } => {
storcon_client
.dispatch::<(), ()>(
Method::DELETE,
format!("control/v1/node/{node_id}/drain"),
None,
)
.await?;
println!("Waiting for node {node_id} to quiesce on scheduling policy ...");
let final_policy =
wait_for_scheduling_policy(storcon_client, node_id, *timeout, |sched| {
use NodeSchedulingPolicy::*;
matches!(sched, Active | PauseForRestart)
})
.await?;
println!(
"Drain was cancelled for node {node_id}. Schedulling policy is now {final_policy:?}"
);
}
Command::StartFill { node_id } => {
storcon_client
.dispatch::<(), ()>(Method::PUT, format!("control/v1/node/{node_id}/fill"), None)
.await?;
println!("Fill started for {node_id}");
}
Command::CancelFill { node_id, timeout } => {
storcon_client
.dispatch::<(), ()>(
Method::DELETE,
format!("control/v1/node/{node_id}/fill"),
None,
)
.await?;
println!("Waiting for node {node_id} to quiesce on scheduling policy ...");
let final_policy =
wait_for_scheduling_policy(storcon_client, node_id, *timeout, |sched| {
use NodeSchedulingPolicy::*;
matches!(sched, Active)
})
.await?;
println!(
"Fill was cancelled for node {node_id}. Schedulling policy is now {final_policy:?}"
);
}
}
Ok(())

View File

@@ -3,7 +3,7 @@ set -x
cd /ext-src || exit 2
FAILED=
LIST=$( (echo "${SKIP//","/"\n"}"; ls -d -- *-src) | sort | uniq -u)
LIST=$( (echo -e "${SKIP//","/"\n"}"; ls -d -- *-src) | sort | uniq -u)
for d in ${LIST}
do
[ -d "${d}" ] || continue

View File

@@ -0,0 +1,259 @@
# Rolling Storage Controller Restarts
## Summary
This RFC describes the issues around the current storage controller restart procedure
and describes an implementation which reduces downtime to a few milliseconds on the happy path.
## Motivation
Storage controller upgrades (restarts, more generally) can cause multi-second availability gaps.
While the storage controller does not sit on the main data path, it's generally not acceptable
to block management requests for extended periods of time (e.g. https://github.com/neondatabase/neon/issues/8034).
### Current Implementation
The storage controller runs in a Kubernetes Deployment configured for one replica and strategy set to [Recreate](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment).
In non Kubernetes terms, during an upgrade, the currently running storage controller is stopped and, only after,
a new instance is created.
At start-up, the storage controller calls into all the pageservers it manages (retrieved from DB) to learn the
latest locations of all tenant shards present on them. This is usually fast, but can push into tens of seconds
under unfavourable circumstances: pageservers are heavily loaded or unavailable.
## Prior Art
There's probably as many ways of handling restarts gracefully as there are distributed systems. Some examples include:
* Active/Standby architectures: Two or more instance of the same service run, but traffic is only routed to one of them.
For fail-over, traffic is routed to one of the standbys (which becomes active).
* Consensus Algorithms (Raft, Paxos and friends): The part of consensus we care about here is leader election: peers communicate to each other
and use a voting scheme that ensures the existence of a single leader (e.g. Raft epochs).
## Requirements
* Reduce storage controller unavailability during upgrades to milliseconds
* Minimize the interval in which it's possible for more than one storage controller
to issue reconciles.
* Have one uniform implementation for restarts and upgrades
* Fit in with the current Kubernetes deployment scheme
## Non Goals
* Implement our own consensus algorithm from scratch
* Completely eliminate downtime storage controller downtime. Instead we aim to reduce it to the point where it looks
like a transient error to the control plane
## Impacted Components
* storage controller
* deployment orchestration (i.e. Ansible)
* helm charts
## Terminology
* Observed State: in-memory mapping between tenant shards and their current pageserver locations - currently built up
at start-up by quering pageservers
* Deployment: Kubernetes [primitive](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) that models
a set of replicas
## Implementation
### High Level Flow
At a very high level the proposed idea is to start a new storage controller instance while
the previous one is still running and cut-over to it when it becomes ready. The new instance,
should coordinate with the existing one and transition responsibility gracefully. While the controller
has built in safety against split-brain situations (via generation numbers), we'd like to avoid such
scenarios since they can lead to availability issues for tenants that underwent changes while two controllers
were operating at the same time and require operator intervention to remedy.
### Kubernetes Deployment Configuration
On the Kubernetes configuration side, the proposal is to update the storage controller `Deployment`
to use `spec.strategy.type = RollingUpdate`, `spec.strategy.rollingUpdate.maxSurge=1` and `spec.strategy.maxUnavailable=0`.
Under the hood, Kubernetes creates a new replica set and adds one pod to it (`maxSurge=1`). The old replica set does not
scale down until the new replica set has one replica in the ready state (`maxUnavailable=0`).
The various possible failure scenarios are investigated in the [Handling Failures](#handling-failures) section.
### Storage Controller Start-Up
This section describes the primitives required on the storage controller side and the flow of the happy path.
#### Database Table For Leader Synchronization
A new table should be added to the storage controller database for leader synchronization during startup.
This table will always contain at most one row. The proposed name for the table is `leader` and the schema
contains two elements:
* `hostname`: represents the hostname for the current storage controller leader - should be addressible
from other pods in the deployment
* `start_timestamp`: holds the start timestamp for the current storage controller leader (UTC timezone) - only required
for failure case handling: see [Previous Leader Crashes Before New Leader Readiness](#previous-leader-crashes-before-new-leader-readiness)
Storage controllers will read the leader row at start-up and then update it to mark themselves as the leader
at the end of the start-up sequence. We want compare-and-exchange semantics for the update: avoid the
situation where two concurrent updates succeed and overwrite each other. The default Postgres isolation
level is `READ COMMITTED`, which isn't strict enough here. This update transaction should use at least `REPEATABLE
READ` isolation level in order to [prevent lost updates](https://www.interdb.jp/pg/pgsql05/08.html). Currently,
the storage controller uses the stricter `SERIALIZABLE` isolation level for all transactions. This more than suits
our needs here.
```
START TRANSACTION ISOLATION LEVEL REPEATABLE READ
UPDATE leader SET hostname=<new_hostname>, start_timestamp=<new_start_ts>
WHERE hostname=<old_hostname>, start_timestampt=<old_start_ts>;
```
If the transaction fails or if no rows have been updated, then the compare-and-exchange is regarded as a failure.
#### Step Down API
A new HTTP endpoint should be added to the storage controller: `POST /control/v1/step_down`. Upon receiving this
request the leader cancels any pending reconciles and goes into a mode where it replies with 503 to all other APIs
and does not issue any location configurations to its pageservers. The successful HTTP response will return a serialized
snapshot of the observed state.
If other step down requests come in after the initial one, the request is handled and the observed state is returned (required
for failure scenario handling - see [Handling Failures](#handling-failures)).
#### Graceful Restart Happy Path
At start-up, the first thing the storage controller does is retrieve the sole row from the new
`leader` table. If such an entry exists, send a `/step_down` PUT API call to the current leader.
This should be retried a few times with a short backoff (see [1]). The aspiring leader loads the
observed state into memory and the start-up sequence proceeds as usual, but *without* querying the
pageservers in order to build up the observed state.
Before doing any reconciliations or persistence change, update the `leader` database table as described in the [Database Table For Leader Synchronization](database-table-for-leader-synchronization)
section. If this step fails, the storage controller process exits.
Note that no row will exist in the `leaders` table for the first graceful restart. In that case, force update the `leader` table
(without the WHERE clause) and perform with the pre-existing start-up procedure (i.e. build observed state by querying pageservers).
Summary of proposed new start-up sequence:
1. Call `/step_down`
2. Perform any pending database migrations
3. Load state from database
4. Load observed state returned in step (1) into memory
5. Do initial heartbeat round (may be moved after 5)
7. Mark self as leader by updating the database
8. Reschedule and reconcile everything
Some things to note from the steps above:
* The storage controller makes no changes to the cluster state before step (5) (i.e. no location config
calls to the pageserver and no compute notifications)
* Ask the current leader to step down before loading state from database so we don't get a lost update
if the transactions overlap.
* Before loading the observed state at step (3), cross-validate against the database. If validation fails,
fall back to asking the pageservers about their current locations.
* Database migrations should only run **after** the previous instance steps down (or the step down times out).
[1] The API call might fail because there's no storage controller running (i.e. [restart](#storage-controller-crash-or-restart)),
so we don't want to extend the unavailability period by much. We still want to retry since that's not the common case.
### Handling Failures
#### Storage Controller Crash Or Restart
The storage controller may crash or be restarted outside of roll-outs. When a new pod is created, its call to
`/step_down` will fail since the previous leader is no longer reachable. In this case perform the pre-existing
start-up procedure and update the leader table (with the WHERE clause). If the update fails, the storage controller
exists and consistency is maintained.
#### Previous Leader Crashes Before New Leader Readiness
When the previous leader (P1) crashes before the new leader (P2) passses the readiness check, Kubernetes will
reconcile the old replica set and create a new pod for it (P1'). The `/step_down` API call will fail for P1'
(see [2]).
Now we have two cases to consider:
* P2 updates the `leader` table first: The database update from P1' will fail and P1' will exit, or be terminated
by Kubernetes depending on timings.
* P1' updates the `leader` table first: The `hostname` field of the `leader` row stays the same, but the `start_timestamp` field changes.
The database update from P2 will fail (since `start_timestamp` does not match). P2 will exit and Kubernetes will
create a new replacement pod for it (P2'). Now the entire dance starts again, but with P1' as the leader and P2' as the incumbent.
[2] P1 and P1' may (more likely than not) be the same pod and have the same hostname. The implementation
should avoid this self reference and fail the API call at the client if the persisted hostname matches
the current one.
#### Previous Leader Crashes After New Leader Readiness
The deployment's replica sets already satisfy the deployment's replica count requirements and the
Kubernetes deployment rollout will just clean up the dead pod.
#### New Leader Crashes Before Pasing Readiness Check
The deployment controller scales up the new replica sets by creating a new pod. The entire procedure is repeated
with the new pod.
#### Network Partition Between New Pod and Previous Leader
This feels very unlikely, but should be considered in any case. P2 (the new aspiring leader) fails the `/step_down`
API call into P1 (the current leader). P2 proceeds with the pre-existing startup procedure and updates the `leader` table.
Kubernetes will terminate P1, but there may be a brief period where both storage controller can drive reconciles.
### Dealing With Split Brain Scenarios
As we've seen in the previous section, we can end up with two storage controller running at the same time. The split brain
duration is not bounded since the Kubernetes controller might become partitioned from the pods (unlikely though). While these
scenarios are not fatal, they can cause tenant unavailability, so we'd like to reduce the chances of this happening.
The rest of this section sketches some safety measure. It's likely overkill to implement all of them however.
### Ensure Leadership Before Producing Side Effects
The storage controller has two types of side effects: location config requests into pageservers and compute notifications into the control plane.
Before issuing either, the storage controller could check that it is indeed still the leader by querying the database. Side effects might still be
applied if they race with the database updatem, but the situation will eventually be detected. The storage controller process should terminate in these cases.
### Leadership Lease
Up until now, the leadership defined by this RFC is static. In order to bound the length of the split brain scenario, we could require the leadership
to be renewed periodically. Two new columns would be added to the leaders table:
1. `last_renewed` - timestamp indicating when the lease was last renewed
2. `lease_duration` - duration indicating the amount of time after which the lease expires
The leader periodically attempts to renew the lease by checking that it is in fact still the legitimate leader and updating `last_renewed` in the
same transaction. If the update fails, the process exits. New storage controller instances wishing to become leaders must wait for the current lease
to expire before acquiring leadership if they have not succesfully received a response to the `/step_down` request.
### Notify Pageserver Of Storage Controller Term
Each time that leadership changes, we can bump a `term` integer column in the `leader` table. This term uniquely identifies a leader.
Location config requests and re-attach responses can include this term. On the pageserver side, keep the latest term in memory and refuse
anything which contains a stale term (i.e. smaller than the current one).
### Observability
* The storage controller should expose a metric which describes it's state (`Active | WarmingUp | SteppedDown`).
Per region alerts should be added on this metric which triggers when:
+ no storage controller has been in the `Active` state for an extended period of time
+ more than one storage controllers are in the `Active` state
* An alert that periodically verifies that the `leader` table is in sync with the metric above would be very useful.
We'd have to expose the storage controller read only database to Grafana (perhaps it is already done).
## Alternatives
### Kubernetes Leases
Kubernetes has a [lease primitive](https://kubernetes.io/docs/concepts/architecture/leases/) which can be used to implement leader election.
Only one instance may hold a lease at any given time. This lease needs to be periodically renewed and has an expiration period.
In our case, it would work something like this:
* `/step_down` deletes the lease or stops it from renewing
* lease acquisition becomes part of the start-up procedure
The kubert crate implements a [lightweight lease API](https://docs.rs/kubert/latest/kubert/lease/struct.LeaseManager.html), but it's still
not exactly trivial to implement.
This approach has the benefit of baked in observability (`kubectl describe lease`), but:
* We offload the responsibility to Kubernetes which makes it harder to debug when things go wrong.
* More code surface than the simple "row in database" approach. Also, most of this code would be in
a dependency not subject to code review, etc.
* Hard to test. Our testing infra does not run the storage controller in Kubernetes and changing it do
so is not simple and complictes and the test set-up.
To my mind, the "row in database" approach is straightforward enough that we don't have to offload this
to something external.

View File

@@ -21,30 +21,21 @@ _Example: 15.4 is the new minor version to upgrade to from 15.3._
1. Create a new branch based on the stable branch you are updating.
```shell
git checkout -b my-branch REL_15_STABLE_neon
git checkout -b my-branch-15 REL_15_STABLE_neon
```
1. Tag the last commit on the stable branch you are updating.
1. Find the upstream release tags you're looking for. They are of the form `REL_X_Y`.
```shell
git tag REL_15_3_neon
```
1. Push the new tag to the Neon Postgres repository.
```shell
git push origin REL_15_3_neon
```
1. Find the release tags you're looking for. They are of the form `REL_X_Y`.
1. Rebase the branch you created on the tag and resolve any conflicts.
1. Merge the upstream tag into the branch you created on the tag and resolve any conflicts.
```shell
git fetch upstream REL_15_4
git rebase REL_15_4
git merge REL_15_4
```
In the commit message of the merge commit, mention if there were
any non-trivial conflicts or other issues.
1. Run the Postgres test suite to make sure our commits have not affected
Postgres in a negative way.
@@ -57,7 +48,7 @@ Postgres in a negative way.
1. Push your branch to the Neon Postgres repository.
```shell
git push origin my-branch
git push origin my-branch-15
```
1. Clone the Neon repository if you have not done so already.
@@ -74,7 +65,7 @@ branch.
1. Update the Git submodule.
```shell
git submodule set-branch --branch my-branch vendor/postgres-v15
git submodule set-branch --branch my-branch-15 vendor/postgres-v15
git submodule update --remote vendor/postgres-v15
```
@@ -89,14 +80,12 @@ minor Postgres release.
1. Create a pull request, and wait for CI to go green.
1. Force push the rebased Postgres branches into the Neon Postgres repository.
1. Push the Postgres branches with the merge commits into the Neon Postgres repository.
```shell
git push --force origin my-branch:REL_15_STABLE_neon
git push origin my-branch-15:REL_15_STABLE_neon
```
It may require disabling various branch protections.
1. Update your Neon PR to point at the branches.
```shell

View File

@@ -14,5 +14,3 @@ regex.workspace = true
utils = { path = "../utils" }
remote_storage = { version = "0.1", path = "../remote_storage/" }
workspace_hack.workspace = true

View File

@@ -6,10 +6,8 @@ license = "Apache-2.0"
[dependencies]
anyhow.workspace = true
chrono.workspace = true
chrono = { workspace = true, features = ["serde"] }
rand.workspace = true
serde.workspace = true
serde_with.workspace = true
utils.workspace = true
workspace_hack.workspace = true

View File

@@ -14,5 +14,3 @@ parking_lot.workspace = true
hex.workspace = true
scopeguard.workspace = true
smallvec = { workspace = true, features = ["write"] }
workspace_hack.workspace = true

View File

@@ -12,8 +12,6 @@ chrono.workspace = true
twox-hash.workspace = true
measured.workspace = true
workspace_hack.workspace = true
[target.'cfg(target_os = "linux")'.dependencies]
procfs.workspace = true
measured-process.workspace = true

View File

@@ -21,11 +21,9 @@ hex.workspace = true
humantime.workspace = true
thiserror.workspace = true
humantime-serde.workspace = true
chrono.workspace = true
chrono = { workspace = true, features = ["serde"] }
itertools.workspace = true
workspace_hack.workspace = true
[dev-dependencies]
bincode.workspace = true
rand.workspace = true

View File

@@ -8,6 +8,7 @@ use std::time::{Duration, Instant};
use serde::{Deserialize, Serialize};
use utils::id::{NodeId, TenantId};
use crate::models::PageserverUtilization;
use crate::{
models::{ShardParameters, TenantConfig},
shard::{ShardStripeSize, TenantShardId},
@@ -55,6 +56,8 @@ pub struct NodeRegisterRequest {
pub listen_http_addr: String,
pub listen_http_port: u16,
pub availability_zone_id: Option<String>,
}
#[derive(Serialize, Deserialize)]
@@ -140,23 +143,11 @@ pub struct TenantShardMigrateRequest {
pub node_id: NodeId,
}
/// Utilisation score indicating how good a candidate a pageserver
/// is for scheduling the next tenant. See [`crate::models::PageserverUtilization`].
/// Lower values are better.
#[derive(Serialize, Deserialize, Clone, Copy, Eq, PartialEq, PartialOrd, Ord, Debug)]
pub struct UtilizationScore(pub u64);
impl UtilizationScore {
pub fn worst() -> Self {
UtilizationScore(u64::MAX)
}
}
#[derive(Serialize, Clone, Copy, Debug)]
#[derive(Serialize, Clone, Debug)]
#[serde(into = "NodeAvailabilityWrapper")]
pub enum NodeAvailability {
// Normal, happy state
Active(UtilizationScore),
Active(PageserverUtilization),
// Node is warming up, but we expect it to become available soon. Covers
// the time span between the re-attach response being composed on the storage controller
// and the first successful heartbeat after the processing of the re-attach response
@@ -195,7 +186,9 @@ impl From<NodeAvailabilityWrapper> for NodeAvailability {
match val {
// Assume the worst utilisation score to begin with. It will later be updated by
// the heartbeats.
NodeAvailabilityWrapper::Active => NodeAvailability::Active(UtilizationScore::worst()),
NodeAvailabilityWrapper::Active => {
NodeAvailability::Active(PageserverUtilization::full())
}
NodeAvailabilityWrapper::WarmingUp => NodeAvailability::WarmingUp(Instant::now()),
NodeAvailabilityWrapper::Offline => NodeAvailability::Offline,
}

View File

@@ -108,14 +108,41 @@ impl Key {
}
}
/// This function checks more extensively what keys we can take on the write path.
/// If a key beginning with 00 does not have a global/default tablespace OID, it
/// will be rejected on the write path.
#[allow(dead_code)]
pub fn is_valid_key_on_write_path_strong(&self) -> bool {
use postgres_ffi::pg_constants::{DEFAULTTABLESPACE_OID, GLOBALTABLESPACE_OID};
if !self.is_i128_representable() {
return false;
}
if self.field1 == 0
&& !(self.field2 == GLOBALTABLESPACE_OID
|| self.field2 == DEFAULTTABLESPACE_OID
|| self.field2 == 0)
{
return false; // User defined tablespaces are not supported
}
true
}
/// This is a weaker version of `is_valid_key_on_write_path_strong` that simply
/// checks if the key is i128 representable. Note that some keys can be successfully
/// ingested into the pageserver, but will cause errors on generating basebackup.
pub fn is_valid_key_on_write_path(&self) -> bool {
self.is_i128_representable()
}
pub fn is_i128_representable(&self) -> bool {
self.field2 <= 0xFFFF || self.field2 == 0xFFFFFFFF || self.field2 == 0x22222222
}
/// 'field2' is used to store tablespaceid for relations and small enum numbers for other relish.
/// As long as Neon does not support tablespace (because of lack of access to local file system),
/// we can assume that only some predefined namespace OIDs are used which can fit in u16
pub fn to_i128(&self) -> i128 {
assert!(
self.field2 <= 0xFFFF || self.field2 == 0xFFFFFFFF || self.field2 == 0x22222222,
"invalid key: {self}",
);
assert!(self.is_i128_representable(), "invalid key: {self}");
(((self.field1 & 0x7F) as i128) << 120)
| (((self.field2 & 0xFFFF) as i128) << 104)
| ((self.field3 as i128) << 72)
@@ -236,6 +263,15 @@ impl Key {
field5: u8::MAX,
field6: u32::MAX,
};
/// A key slightly smaller than [`Key::MAX`] for use in layer key ranges to avoid them to be confused with L0 layers
pub const NON_L0_MAX: Key = Key {
field1: u8::MAX,
field2: u32::MAX,
field3: u32::MAX,
field4: u32::MAX,
field5: u8::MAX,
field6: u32::MAX - 1,
};
pub fn from_hex(s: &str) -> Result<Self> {
if s.len() != 36 {

View File

@@ -48,7 +48,7 @@ pub struct ShardedRange<'a> {
// Calculate the size of a range within the blocks of the same relation, or spanning only the
// top page in the previous relation's space.
fn contiguous_range_len(range: &Range<Key>) -> u32 {
pub fn contiguous_range_len(range: &Range<Key>) -> u32 {
debug_assert!(is_contiguous_range(range));
if range.start.field6 == 0xffffffff {
range.end.field6 + 1
@@ -67,7 +67,7 @@ fn contiguous_range_len(range: &Range<Key>) -> u32 {
/// This matters, because:
/// - Within such ranges, keys are used contiguously. Outside such ranges it is sparse.
/// - Within such ranges, we may calculate distances using simple subtraction of field6.
fn is_contiguous_range(range: &Range<Key>) -> bool {
pub fn is_contiguous_range(range: &Range<Key>) -> bool {
range.start.field1 == range.end.field1
&& range.start.field2 == range.end.field2
&& range.start.field3 == range.end.field3

View File

@@ -7,7 +7,7 @@ pub use utilization::PageserverUtilization;
use std::{
collections::HashMap,
io::{BufRead, Read},
num::{NonZeroU64, NonZeroUsize},
num::{NonZeroU32, NonZeroU64, NonZeroUsize},
str::FromStr,
sync::atomic::AtomicUsize,
time::{Duration, SystemTime},
@@ -348,7 +348,7 @@ impl AuxFilePolicy {
/// If a tenant writes aux files without setting `switch_aux_policy`, this value will be used.
pub fn default_tenant_config() -> Self {
Self::V1
Self::V2
}
}
@@ -486,12 +486,11 @@ pub struct EvictionPolicyLayerAccessThreshold {
#[derive(Debug, Serialize, Deserialize, Clone, PartialEq, Eq)]
pub struct ThrottleConfig {
pub task_kinds: Vec<String>, // TaskKind
pub initial: usize,
pub initial: u32,
#[serde(with = "humantime_serde")]
pub refill_interval: Duration,
pub refill_amount: NonZeroUsize,
pub max: usize,
pub fair: bool,
pub refill_amount: NonZeroU32,
pub max: u32,
}
impl ThrottleConfig {
@@ -501,9 +500,8 @@ impl ThrottleConfig {
// other values don't matter with emtpy `task_kinds`.
initial: 0,
refill_interval: Duration::from_millis(1),
refill_amount: NonZeroUsize::new(1).unwrap(),
refill_amount: NonZeroU32::new(1).unwrap(),
max: 1,
fair: true,
}
}
/// The requests per second allowed by the given config.
@@ -718,6 +716,7 @@ pub struct TimelineInfo {
pub pg_version: u32,
pub state: TimelineState,
pub is_archived: bool,
pub walreceiver_status: String,
@@ -1062,7 +1061,7 @@ impl TryFrom<u8> for PagestreamBeMessageTag {
}
}
// In the V2 protocol version, a GetPage request contains two LSN values:
// A GetPage request contains two LSN values:
//
// request_lsn: Get the page version at this point in time. Lsn::Max is a special value that means
// "get the latest version present". It's used by the primary server, which knows that no one else
@@ -1075,7 +1074,7 @@ impl TryFrom<u8> for PagestreamBeMessageTag {
// passing an earlier LSN can speed up the request, by allowing the pageserver to process the
// request without waiting for 'request_lsn' to arrive.
//
// The legacy V1 interface contained only one LSN, and a boolean 'latest' flag. The V1 interface was
// The now-defunct V1 interface contained only one LSN, and a boolean 'latest' flag. The V1 interface was
// sufficient for the primary; the 'lsn' was equivalent to the 'not_modified_since' value, and
// 'latest' was set to true. The V2 interface was added because there was no correct way for a
// standby to request a page at a particular non-latest LSN, and also include the
@@ -1083,15 +1082,11 @@ impl TryFrom<u8> for PagestreamBeMessageTag {
// request, if the standby knows that the page hasn't been modified since, and risk getting an error
// if that LSN has fallen behind the GC horizon, or requesting the current replay LSN, which could
// require the pageserver unnecessarily to wait for the WAL to arrive up to that point. The new V2
// interface allows sending both LSNs, and let the pageserver do the right thing. There is no
// interface allows sending both LSNs, and let the pageserver do the right thing. There was no
// difference in the responses between V1 and V2.
//
// The Request structs below reflect the V2 interface. If V1 is used, the parse function
// maps the old format requests to the new format.
//
#[derive(Clone, Copy)]
pub enum PagestreamProtocolVersion {
V1,
V2,
}
@@ -1230,36 +1225,17 @@ impl PagestreamFeMessage {
bytes.into()
}
pub fn parse<R: std::io::Read>(
body: &mut R,
protocol_version: PagestreamProtocolVersion,
) -> anyhow::Result<PagestreamFeMessage> {
pub fn parse<R: std::io::Read>(body: &mut R) -> anyhow::Result<PagestreamFeMessage> {
// these correspond to the NeonMessageTag enum in pagestore_client.h
//
// TODO: consider using protobuf or serde bincode for less error prone
// serialization.
let msg_tag = body.read_u8()?;
let (request_lsn, not_modified_since) = match protocol_version {
PagestreamProtocolVersion::V2 => (
Lsn::from(body.read_u64::<BigEndian>()?),
Lsn::from(body.read_u64::<BigEndian>()?),
),
PagestreamProtocolVersion::V1 => {
// In the old protocol, each message starts with a boolean 'latest' flag,
// followed by 'lsn'. Convert that to the two LSNs, 'request_lsn' and
// 'not_modified_since', used in the new protocol version.
let latest = body.read_u8()? != 0;
let request_lsn = Lsn::from(body.read_u64::<BigEndian>()?);
if latest {
(Lsn::MAX, request_lsn) // get latest version
} else {
(request_lsn, request_lsn) // get version at specified LSN
}
}
};
// these two fields are the same for every request type
let request_lsn = Lsn::from(body.read_u64::<BigEndian>()?);
let not_modified_since = Lsn::from(body.read_u64::<BigEndian>()?);
// The rest of the messages are the same between V1 and V2
match msg_tag {
0 => Ok(PagestreamFeMessage::Exists(PagestreamExistsRequest {
request_lsn,
@@ -1467,9 +1443,7 @@ mod tests {
];
for msg in messages {
let bytes = msg.serialize();
let reconstructed =
PagestreamFeMessage::parse(&mut bytes.reader(), PagestreamProtocolVersion::V2)
.unwrap();
let reconstructed = PagestreamFeMessage::parse(&mut bytes.reader()).unwrap();
assert!(msg == reconstructed);
}
}

View File

@@ -38,7 +38,7 @@ pub struct PageserverUtilization {
pub max_shard_count: u32,
/// Cached result of [`Self::score`]
pub utilization_score: u64,
pub utilization_score: Option<u64>,
/// When was this snapshot captured, pageserver local time.
///
@@ -50,6 +50,8 @@ fn unity_percent() -> Percent {
Percent::new(0).unwrap()
}
pub type RawScore = u64;
impl PageserverUtilization {
const UTILIZATION_FULL: u64 = 1000000;
@@ -62,7 +64,7 @@ impl PageserverUtilization {
/// - Negative values are forbidden
/// - Values over UTILIZATION_FULL indicate an overloaded node, which may show degraded performance due to
/// layer eviction.
pub fn score(&self) -> u64 {
pub fn score(&self) -> RawScore {
let disk_usable_capacity = ((self.disk_usage_bytes + self.free_space_bytes)
* self.disk_usable_pct.get() as u64)
/ 100;
@@ -74,8 +76,30 @@ impl PageserverUtilization {
std::cmp::max(disk_utilization_score, shard_utilization_score)
}
pub fn refresh_score(&mut self) {
self.utilization_score = self.score();
pub fn cached_score(&mut self) -> RawScore {
match self.utilization_score {
None => {
let s = self.score();
self.utilization_score = Some(s);
s
}
Some(s) => s,
}
}
/// If a node is currently hosting more work than it can comfortably handle. This does not indicate that
/// it will fail, but it is a strong signal that more work should not be added unless there is no alternative.
pub fn is_overloaded(score: RawScore) -> bool {
score >= Self::UTILIZATION_FULL
}
pub fn adjust_shard_count_max(&mut self, shard_count: u32) {
if self.shard_count < shard_count {
self.shard_count = shard_count;
// Dirty cache: this will be calculated next time someone retrives the score
self.utilization_score = None;
}
}
/// A utilization structure that has a full utilization score: use this as a placeholder when
@@ -88,7 +112,38 @@ impl PageserverUtilization {
disk_usable_pct: Percent::new(100).unwrap(),
shard_count: 1,
max_shard_count: 1,
utilization_score: Self::UTILIZATION_FULL,
utilization_score: Some(Self::UTILIZATION_FULL),
captured_at: serde_system_time::SystemTime(SystemTime::now()),
}
}
}
/// Test helper
pub mod test_utilization {
use super::PageserverUtilization;
use std::time::SystemTime;
use utils::{
serde_percent::Percent,
serde_system_time::{self},
};
// Parameters of the imaginary node used for test utilization instances
const TEST_DISK_SIZE: u64 = 1024 * 1024 * 1024 * 1024;
const TEST_SHARDS_MAX: u32 = 1000;
/// Unit test helper. Unconditionally compiled because cfg(test) doesn't carry across crates. Do
/// not abuse this function from non-test code.
///
/// Emulates a node with a 1000 shard limit and a 1TB disk.
pub fn simple(shard_count: u32, disk_wanted_bytes: u64) -> PageserverUtilization {
PageserverUtilization {
disk_usage_bytes: disk_wanted_bytes,
free_space_bytes: TEST_DISK_SIZE - std::cmp::min(disk_wanted_bytes, TEST_DISK_SIZE),
disk_wanted_bytes,
disk_usable_pct: Percent::new(100).unwrap(),
shard_count,
max_shard_count: TEST_SHARDS_MAX,
utilization_score: None,
captured_at: serde_system_time::SystemTime(SystemTime::now()),
}
}
@@ -120,7 +175,7 @@ mod tests {
disk_usage_bytes: u64::MAX,
free_space_bytes: 0,
disk_wanted_bytes: u64::MAX,
utilization_score: 13,
utilization_score: Some(13),
disk_usable_pct: Percent::new(90).unwrap(),
shard_count: 100,
max_shard_count: 200,

View File

@@ -18,7 +18,6 @@ tokio-rustls.workspace = true
tracing.workspace = true
pq_proto.workspace = true
workspace_hack.workspace = true
[dev-dependencies]
once_cell.workspace = true

View File

@@ -11,7 +11,5 @@ postgres.workspace = true
tokio-postgres.workspace = true
url.workspace = true
workspace_hack.workspace = true
[dev-dependencies]
once_cell.workspace = true

View File

@@ -19,8 +19,6 @@ thiserror.workspace = true
serde.workspace = true
utils.workspace = true
workspace_hack.workspace = true
[dev-dependencies]
env_logger.workspace = true
postgres.workspace = true

View File

@@ -136,9 +136,9 @@ pub const MAX_SEND_SIZE: usize = XLOG_BLCKSZ * 16;
// Export some version independent functions that are used outside of this mod
pub use v14::xlog_utils::encode_logical_message;
pub use v14::xlog_utils::from_pg_timestamp;
pub use v14::xlog_utils::get_current_timestamp;
pub use v14::xlog_utils::to_pg_timestamp;
pub use v14::xlog_utils::try_from_pg_timestamp;
pub use v14::xlog_utils::XLogFileName;
pub use v14::bindings::DBState_DB_SHUTDOWNED;

View File

@@ -135,6 +135,8 @@ pub fn get_current_timestamp() -> TimestampTz {
mod timestamp_conversions {
use std::time::Duration;
use anyhow::Context;
use super::*;
const UNIX_EPOCH_JDATE: u64 = 2440588; // == date2j(1970, 1, 1)
@@ -154,18 +156,18 @@ mod timestamp_conversions {
}
}
pub fn from_pg_timestamp(time: TimestampTz) -> SystemTime {
pub fn try_from_pg_timestamp(time: TimestampTz) -> anyhow::Result<SystemTime> {
let time: u64 = time
.try_into()
.expect("timestamp before millenium (postgres epoch)");
.context("timestamp before millenium (postgres epoch)")?;
let since_unix_epoch = time + SECS_DIFF_UNIX_TO_POSTGRES_EPOCH * USECS_PER_SEC;
SystemTime::UNIX_EPOCH
.checked_add(Duration::from_micros(since_unix_epoch))
.expect("SystemTime overflow")
.context("SystemTime overflow")
}
}
pub use timestamp_conversions::{from_pg_timestamp, to_pg_timestamp};
pub use timestamp_conversions::{to_pg_timestamp, try_from_pg_timestamp};
// Returns (aligned) end_lsn of the last record in data_dir with WAL segments.
// start_lsn must point to some previously known record boundary (beginning of
@@ -545,14 +547,14 @@ mod tests {
#[test]
fn test_ts_conversion() {
let now = SystemTime::now();
let round_trip = from_pg_timestamp(to_pg_timestamp(now));
let round_trip = try_from_pg_timestamp(to_pg_timestamp(now)).unwrap();
let now_since = now.duration_since(SystemTime::UNIX_EPOCH).unwrap();
let round_trip_since = round_trip.duration_since(SystemTime::UNIX_EPOCH).unwrap();
assert_eq!(now_since.as_micros(), round_trip_since.as_micros());
let now_pg = get_current_timestamp();
let round_trip_pg = to_pg_timestamp(from_pg_timestamp(now_pg));
let round_trip_pg = to_pg_timestamp(try_from_pg_timestamp(now_pg).unwrap());
assert_eq!(now_pg, round_trip_pg);
}

View File

@@ -14,8 +14,6 @@ postgres.workspace = true
postgres_ffi.workspace = true
camino-tempfile.workspace = true
workspace_hack.workspace = true
[dev-dependencies]
regex.workspace = true
utils.workspace = true

View File

@@ -11,9 +11,7 @@ itertools.workspace = true
pin-project-lite.workspace = true
postgres-protocol.workspace = true
rand.workspace = true
tokio.workspace = true
tokio = { workspace = true, features = ["io-util"] }
tracing.workspace = true
thiserror.workspace = true
serde.workspace = true
workspace_hack.workspace = true

View File

@@ -32,7 +32,7 @@ scopeguard.workspace = true
metrics.workspace = true
utils.workspace = true
pin-project-lite.workspace = true
workspace_hack.workspace = true
azure_core.workspace = true
azure_identity.workspace = true
azure_storage.workspace = true
@@ -46,3 +46,4 @@ sync_wrapper = { workspace = true, features = ["futures"] }
camino-tempfile.workspace = true
test-context.workspace = true
rand.workspace = true
tokio = { workspace = true, features = ["test-util"] }

View File

@@ -9,5 +9,3 @@ serde.workspace = true
serde_with.workspace = true
const_format.workspace = true
utils.workspace = true
workspace_hack.workspace = true

View File

@@ -9,5 +9,3 @@ license.workspace = true
anyhow.workspace = true
serde.workspace = true
serde_json.workspace = true
workspace_hack.workspace = true

View File

@@ -14,5 +14,3 @@ tokio = { workspace = true, features = ["rt", "rt-multi-thread"] }
tracing.workspace = true
tracing-opentelemetry.workspace = true
tracing-subscriber.workspace = true
workspace_hack.workspace = true

View File

@@ -14,7 +14,6 @@ testing = ["fail/failpoints"]
arc-swap.workspace = true
sentry.workspace = true
async-compression.workspace = true
async-trait.workspace = true
anyhow.workspace = true
bincode.workspace = true
bytes.workspace = true
@@ -26,7 +25,6 @@ hyper = { workspace = true, features = ["full"] }
fail.workspace = true
futures = { workspace = true}
jsonwebtoken.workspace = true
leaky-bucket.workspace = true
nix.workspace = true
once_cell.workspace = true
pin-project-lite.workspace = true
@@ -39,7 +37,7 @@ thiserror.workspace = true
tokio.workspace = true
tokio-tar.workspace = true
tokio-util.workspace = true
toml_edit.workspace = true
toml_edit = { workspace = true, features = ["serde"] }
tracing.workspace = true
tracing-error.workspace = true
tracing-subscriber = { workspace = true, features = ["json", "registry"] }
@@ -54,7 +52,6 @@ walkdir.workspace = true
pq_proto.workspace = true
postgres_connection.workspace = true
metrics.workspace = true
workspace_hack.workspace = true
const_format.workspace = true
@@ -71,6 +68,7 @@ criterion.workspace = true
hex-literal.workspace = true
camino-tempfile.workspace = true
serde_assert.workspace = true
tokio = { workspace = true, features = ["test-util"] }
[[bench]]
name = "benchmarks"

View File

@@ -0,0 +1,280 @@
//! This module implements the Generic Cell Rate Algorithm for a simplified
//! version of the Leaky Bucket rate limiting system.
//!
//! # Leaky Bucket
//!
//! If the bucket is full, no new requests are allowed and are throttled/errored.
//! If the bucket is partially full/empty, new requests are added to the bucket in
//! terms of "tokens".
//!
//! Over time, tokens are removed from the bucket, naturally allowing new requests at a steady rate.
//!
//! The bucket size tunes the burst support. The drain rate tunes the steady-rate requests per second.
//!
//! # [GCRA](https://en.wikipedia.org/wiki/Generic_cell_rate_algorithm)
//!
//! GCRA is a continuous rate leaky-bucket impl that stores minimal state and requires
//! no background jobs to drain tokens, as the design utilises timestamps to drain automatically over time.
//!
//! We store an "empty_at" timestamp as the only state. As time progresses, we will naturally approach
//! the empty state. The full-bucket state is calculated from `empty_at - config.bucket_width`.
//!
//! Another explaination can be found here: <https://brandur.org/rate-limiting>
use std::{sync::Mutex, time::Duration};
use tokio::{sync::Notify, time::Instant};
pub struct LeakyBucketConfig {
/// This is the "time cost" of a single request unit.
/// Should loosely represent how long it takes to handle a request unit in active resource time.
/// Loosely speaking this is the inverse of the steady-rate requests-per-second
pub cost: Duration,
/// total size of the bucket
pub bucket_width: Duration,
}
impl LeakyBucketConfig {
pub fn new(rps: f64, bucket_size: f64) -> Self {
let cost = Duration::from_secs_f64(rps.recip());
let bucket_width = cost.mul_f64(bucket_size);
Self { cost, bucket_width }
}
}
pub struct LeakyBucketState {
/// Bucket is represented by `allow_at..empty_at` where `allow_at = empty_at - config.bucket_width`.
///
/// At any given time, `empty_at - now` represents the number of tokens in the bucket, multiplied by the "time_cost".
/// Adding `n` tokens to the bucket is done by moving `empty_at` forward by `n * config.time_cost`.
/// If `now < allow_at`, the bucket is considered filled and cannot accept any more tokens.
/// Draining the bucket will happen naturally as `now` moves forward.
///
/// Let `n` be some "time cost" for the request,
/// If now is after empty_at, the bucket is empty and the empty_at is reset to now,
/// If now is within the `bucket window + n`, we are within time budget.
/// If now is before the `bucket window + n`, we have run out of budget.
///
/// This is inspired by the generic cell rate algorithm (GCRA) and works
/// exactly the same as a leaky-bucket.
pub empty_at: Instant,
}
impl LeakyBucketState {
pub fn with_initial_tokens(config: &LeakyBucketConfig, initial_tokens: f64) -> Self {
LeakyBucketState {
empty_at: Instant::now() + config.cost.mul_f64(initial_tokens),
}
}
pub fn bucket_is_empty(&self, now: Instant) -> bool {
// if self.end is after now, the bucket is not empty
self.empty_at <= now
}
/// Immediately adds tokens to the bucket, if there is space.
///
/// In a scenario where you are waiting for available rate,
/// rather than just erroring immediately, `started` corresponds to when this waiting started.
///
/// `n` is the number of tokens that will be filled in the bucket.
///
/// # Errors
///
/// If there is not enough space, no tokens are added. Instead, an error is returned with the time when
/// there will be space again.
pub fn add_tokens(
&mut self,
config: &LeakyBucketConfig,
started: Instant,
n: f64,
) -> Result<(), Instant> {
let now = Instant::now();
// invariant: started <= now
debug_assert!(started <= now);
// If the bucket was empty when we started our search,
// we should update the `empty_at` value accordingly.
// this prevents us from having negative tokens in the bucket.
let mut empty_at = self.empty_at;
if empty_at < started {
empty_at = started;
}
let n = config.cost.mul_f64(n);
let new_empty_at = empty_at + n;
let allow_at = new_empty_at.checked_sub(config.bucket_width);
// empty_at
// allow_at | new_empty_at
// / | /
// -------o-[---------o-|--]---------
// now1 ^ now2 ^
//
// at now1, the bucket would be completely filled if we add n tokens.
// at now2, the bucket would be partially filled if we add n tokens.
match allow_at {
Some(allow_at) if now < allow_at => Err(allow_at),
_ => {
self.empty_at = new_empty_at;
Ok(())
}
}
}
}
pub struct RateLimiter {
pub config: LeakyBucketConfig,
pub state: Mutex<LeakyBucketState>,
/// a queue to provide this fair ordering.
pub queue: Notify,
}
struct Requeue<'a>(&'a Notify);
impl Drop for Requeue<'_> {
fn drop(&mut self) {
self.0.notify_one();
}
}
impl RateLimiter {
pub fn with_initial_tokens(config: LeakyBucketConfig, initial_tokens: f64) -> Self {
RateLimiter {
state: Mutex::new(LeakyBucketState::with_initial_tokens(
&config,
initial_tokens,
)),
config,
queue: {
let queue = Notify::new();
queue.notify_one();
queue
},
}
}
pub fn steady_rps(&self) -> f64 {
self.config.cost.as_secs_f64().recip()
}
/// returns true if we did throttle
pub async fn acquire(&self, count: usize) -> bool {
let mut throttled = false;
let start = tokio::time::Instant::now();
// wait until we are the first in the queue
let mut notified = std::pin::pin!(self.queue.notified());
if !notified.as_mut().enable() {
throttled = true;
notified.await;
}
// notify the next waiter in the queue when we are done.
let _guard = Requeue(&self.queue);
loop {
let res = self
.state
.lock()
.unwrap()
.add_tokens(&self.config, start, count as f64);
match res {
Ok(()) => return throttled,
Err(ready_at) => {
throttled = true;
tokio::time::sleep_until(ready_at).await;
}
}
}
}
}
#[cfg(test)]
mod tests {
use std::time::Duration;
use tokio::time::Instant;
use super::{LeakyBucketConfig, LeakyBucketState};
#[tokio::test(start_paused = true)]
async fn check() {
let config = LeakyBucketConfig {
// average 100rps
cost: Duration::from_millis(10),
// burst up to 100 requests
bucket_width: Duration::from_millis(1000),
};
let mut state = LeakyBucketState {
empty_at: Instant::now(),
};
// supports burst
{
// should work for 100 requests this instant
for _ in 0..100 {
state.add_tokens(&config, Instant::now(), 1.0).unwrap();
}
let ready = state.add_tokens(&config, Instant::now(), 1.0).unwrap_err();
assert_eq!(ready - Instant::now(), Duration::from_millis(10));
}
// doesn't overfill
{
// after 1s we should have an empty bucket again.
tokio::time::advance(Duration::from_secs(1)).await;
assert!(state.bucket_is_empty(Instant::now()));
// after 1s more, we should not over count the tokens and allow more than 200 requests.
tokio::time::advance(Duration::from_secs(1)).await;
for _ in 0..100 {
state.add_tokens(&config, Instant::now(), 1.0).unwrap();
}
let ready = state.add_tokens(&config, Instant::now(), 1.0).unwrap_err();
assert_eq!(ready - Instant::now(), Duration::from_millis(10));
}
// supports sustained rate over a long period
{
tokio::time::advance(Duration::from_secs(1)).await;
// should sustain 100rps
for _ in 0..2000 {
tokio::time::advance(Duration::from_millis(10)).await;
state.add_tokens(&config, Instant::now(), 1.0).unwrap();
}
}
// supports requesting more tokens than can be stored in the bucket
// we just wait a little bit longer upfront.
{
// start the bucket completely empty
tokio::time::advance(Duration::from_secs(5)).await;
assert!(state.bucket_is_empty(Instant::now()));
// requesting 200 tokens of space should take 200*cost = 2s
// but we already have 1s available, so we wait 1s from start.
let start = Instant::now();
let ready = state.add_tokens(&config, start, 200.0).unwrap_err();
assert_eq!(ready - Instant::now(), Duration::from_secs(1));
tokio::time::advance(Duration::from_millis(500)).await;
let ready = state.add_tokens(&config, start, 200.0).unwrap_err();
assert_eq!(ready - Instant::now(), Duration::from_millis(500));
tokio::time::advance(Duration::from_millis(500)).await;
state.add_tokens(&config, start, 200.0).unwrap();
// bucket should be completely full now
let ready = state.add_tokens(&config, Instant::now(), 1.0).unwrap_err();
assert_eq!(ready - Instant::now(), Duration::from_millis(10));
}
}
}

View File

@@ -71,6 +71,7 @@ pub mod postgres_client;
pub mod tracing_span_assert;
pub mod leaky_bucket;
pub mod rate_limit;
/// Simple once-barrier and a guard which keeps barrier awaiting.

View File

@@ -5,6 +5,15 @@ use std::time::{Duration, Instant};
pub struct RateLimit {
last: Option<Instant>,
interval: Duration,
dropped: u64,
}
pub struct RateLimitStats(u64);
impl std::fmt::Display for RateLimitStats {
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
write!(f, "{} dropped calls", self.0)
}
}
impl RateLimit {
@@ -12,20 +21,27 @@ impl RateLimit {
Self {
last: None,
interval,
dropped: 0,
}
}
/// Call `f` if the rate limit allows.
/// Don't call it otherwise.
pub fn call<F: FnOnce()>(&mut self, f: F) {
self.call2(|_| f())
}
pub fn call2<F: FnOnce(RateLimitStats)>(&mut self, f: F) {
let now = Instant::now();
match self.last {
Some(last) if now - last <= self.interval => {
// ratelimit
self.dropped += 1;
}
_ => {
self.last = Some(now);
f();
f(RateLimitStats(self.dropped));
self.dropped = 0;
}
}
}

View File

@@ -9,8 +9,6 @@ anyhow.workspace = true
utils.workspace = true
postgres_ffi.workspace = true
workspace_hack.workspace = true
[build-dependencies]
anyhow.workspace = true
bindgen.workspace = true

View File

@@ -95,6 +95,7 @@ fn main() -> anyhow::Result<()> {
.allowlist_var("ERROR")
.allowlist_var("FATAL")
.allowlist_var("PANIC")
.allowlist_var("PG_VERSION_NUM")
.allowlist_var("WPEVENT")
.allowlist_var("WL_LATCH_SET")
.allowlist_var("WL_SOCKET_READABLE")

View File

@@ -282,7 +282,11 @@ mod tests {
use std::cell::UnsafeCell;
use utils::id::TenantTimelineId;
use crate::{api_bindings::Level, bindings::NeonWALReadResult, walproposer::Wrapper};
use crate::{
api_bindings::Level,
bindings::{NeonWALReadResult, PG_VERSION_NUM},
walproposer::Wrapper,
};
use super::ApiImpl;
@@ -489,41 +493,79 @@ mod tests {
let (sender, receiver) = sync_channel(1);
// Messages definitions are at walproposer.h
// xxx: it would be better to extract them from safekeeper crate and
// use serialization/deserialization here.
let greeting_tag = (b'g' as u64).to_ne_bytes();
let proto_version = 2_u32.to_ne_bytes();
let pg_version: [u8; 4] = PG_VERSION_NUM.to_ne_bytes();
let proposer_id = [0; 16];
let system_id = 0_u64.to_ne_bytes();
let tenant_id = ttid.tenant_id.as_arr();
let timeline_id = ttid.timeline_id.as_arr();
let pg_tli = 1_u32.to_ne_bytes();
let wal_seg_size = 16777216_u32.to_ne_bytes();
let proposer_greeting = [
greeting_tag.as_slice(),
proto_version.as_slice(),
pg_version.as_slice(),
proposer_id.as_slice(),
system_id.as_slice(),
tenant_id.as_slice(),
timeline_id.as_slice(),
pg_tli.as_slice(),
wal_seg_size.as_slice(),
]
.concat();
let voting_tag = (b'v' as u64).to_ne_bytes();
let vote_request_term = 3_u64.to_ne_bytes();
let proposer_id = [0; 16];
let vote_request = [
voting_tag.as_slice(),
vote_request_term.as_slice(),
proposer_id.as_slice(),
]
.concat();
let acceptor_greeting_term = 2_u64.to_ne_bytes();
let acceptor_greeting_node_id = 1_u64.to_ne_bytes();
let acceptor_greeting = [
greeting_tag.as_slice(),
acceptor_greeting_term.as_slice(),
acceptor_greeting_node_id.as_slice(),
]
.concat();
let vote_response_term = 3_u64.to_ne_bytes();
let vote_given = 1_u64.to_ne_bytes();
let flush_lsn = 0x539_u64.to_ne_bytes();
let truncate_lsn = 0x539_u64.to_ne_bytes();
let th_len = 1_u32.to_ne_bytes();
let th_term = 2_u64.to_ne_bytes();
let th_lsn = 0x539_u64.to_ne_bytes();
let timeline_start_lsn = 0x539_u64.to_ne_bytes();
let vote_response = [
voting_tag.as_slice(),
vote_response_term.as_slice(),
vote_given.as_slice(),
flush_lsn.as_slice(),
truncate_lsn.as_slice(),
th_len.as_slice(),
th_term.as_slice(),
th_lsn.as_slice(),
timeline_start_lsn.as_slice(),
]
.concat();
let my_impl: Box<dyn ApiImpl> = Box::new(MockImpl {
wait_events: Cell::new(WaitEventsData {
sk: std::ptr::null_mut(),
event_mask: 0,
}),
expected_messages: vec![
// TODO: When updating Postgres versions, this test will cause
// problems. Postgres version in message needs updating.
//
// Greeting(ProposerGreeting { protocol_version: 2, pg_version: 160003, proposer_id: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], system_id: 0, timeline_id: 9e4c8f36063c6c6e93bc20d65a820f3d, tenant_id: 9e4c8f36063c6c6e93bc20d65a820f3d, tli: 1, wal_seg_size: 16777216 })
vec![
103, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 3, 113, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 158, 76, 143, 54, 6, 60, 108, 110,
147, 188, 32, 214, 90, 130, 15, 61, 158, 76, 143, 54, 6, 60, 108, 110, 147,
188, 32, 214, 90, 130, 15, 61, 1, 0, 0, 0, 0, 0, 0, 1,
],
// VoteRequest(VoteRequest { term: 3 })
vec![
118, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0,
],
],
expected_messages: vec![proposer_greeting, vote_request],
expected_ptr: AtomicUsize::new(0),
safekeeper_replies: vec![
// Greeting(AcceptorGreeting { term: 2, node_id: NodeId(1) })
vec![
103, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
],
// VoteResponse(VoteResponse { term: 3, vote_given: 1, flush_lsn: 0/539, truncate_lsn: 0/539, term_history: [(2, 0/539)], timeline_start_lsn: 0/539 })
vec![
118, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 57,
5, 0, 0, 0, 0, 0, 0, 57, 5, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0,
0, 57, 5, 0, 0, 0, 0, 0, 0, 57, 5, 0, 0, 0, 0, 0, 0,
],
],
safekeeper_replies: vec![acceptor_greeting, vote_response],
replies_ptr: AtomicUsize::new(0),
sync_channel: sender,
shmem: UnsafeCell::new(crate::api_bindings::empty_shmem()),

View File

@@ -16,6 +16,7 @@ arc-swap.workspace = true
async-compression.workspace = true
async-stream.workspace = true
async-trait.workspace = true
bit_field.workspace = true
byteorder.workspace = true
bytes.workspace = true
camino.workspace = true
@@ -36,7 +37,6 @@ humantime.workspace = true
humantime-serde.workspace = true
hyper.workspace = true
itertools.workspace = true
leaky-bucket.workspace = true
md5.workspace = true
nix.workspace = true
# hack to get the number of worker threads tokio uses
@@ -52,6 +52,7 @@ rand.workspace = true
range-set-blaze = { version = "0.1.16", features = ["alloc"] }
regex.workspace = true
scopeguard.workspace = true
send-future.workspace = true
serde.workspace = true
serde_json = { workspace = true, features = ["raw_value"] }
serde_path_to_error.workspace = true

View File

@@ -4,12 +4,13 @@ use bytes::Bytes;
use camino::Utf8PathBuf;
use criterion::{criterion_group, criterion_main, Criterion};
use pageserver::{
config::PageServerConf,
config::{defaults::DEFAULT_IO_BUFFER_ALIGNMENT, PageServerConf},
context::{DownloadBehavior, RequestContext},
l0_flush::{L0FlushConfig, L0FlushGlobalState},
page_cache,
repository::Value,
task_mgr::TaskKind,
tenant::storage_layer::inmemory_layer::SerializedBatch,
tenant::storage_layer::InMemoryLayer,
virtual_file,
};
@@ -67,12 +68,16 @@ async fn ingest(
let layer =
InMemoryLayer::create(conf, timeline_id, tenant_shard_id, lsn, entered, &ctx).await?;
let data = Value::Image(Bytes::from(vec![0u8; put_size])).ser()?;
let data = Value::Image(Bytes::from(vec![0u8; put_size]));
let data_ser_size = data.serialized_size().unwrap() as usize;
let ctx = RequestContext::new(
pageserver::task_mgr::TaskKind::WalReceiverConnectionHandler,
pageserver::context::DownloadBehavior::Download,
);
const BATCH_SIZE: usize = 16;
let mut batch = Vec::new();
for i in 0..put_count {
lsn += put_size as u64;
@@ -95,7 +100,17 @@ async fn ingest(
}
}
layer.put_value(key.to_compact(), lsn, &data, &ctx).await?;
batch.push((key.to_compact(), lsn, data_ser_size, data.clone()));
if batch.len() >= BATCH_SIZE {
let this_batch = std::mem::take(&mut batch);
let serialized = SerializedBatch::from_values(this_batch).unwrap();
layer.put_batch(serialized, &ctx).await?;
}
}
if !batch.is_empty() {
let this_batch = std::mem::take(&mut batch);
let serialized = SerializedBatch::from_values(this_batch).unwrap();
layer.put_batch(serialized, &ctx).await?;
}
layer.freeze(lsn + 1).await;
@@ -149,7 +164,11 @@ fn criterion_benchmark(c: &mut Criterion) {
let conf: &'static PageServerConf = Box::leak(Box::new(
pageserver::config::PageServerConf::dummy_conf(temp_dir.path().to_path_buf()),
));
virtual_file::init(16384, virtual_file::io_engine_for_bench());
virtual_file::init(
16384,
virtual_file::io_engine_for_bench(),
DEFAULT_IO_BUFFER_ALIGNMENT,
);
page_cache::init(conf.page_cache_size);
{

View File

@@ -7,7 +7,6 @@ license.workspace = true
[dependencies]
pageserver_api.workspace = true
thiserror.workspace = true
async-trait.workspace = true
reqwest = { workspace = true, features = [ "stream" ] }
utils.workspace = true
serde.workspace = true

View File

@@ -419,6 +419,24 @@ impl Client {
}
}
pub async fn timeline_archival_config(
&self,
tenant_shard_id: TenantShardId,
timeline_id: TimelineId,
req: &TimelineArchivalConfigRequest,
) -> Result<()> {
let uri = format!(
"{}/v1/tenant/{tenant_shard_id}/timeline/{timeline_id}/archival_config",
self.mgmt_api_endpoint
);
self.request(Method::POST, &uri, req)
.await?
.json()
.await
.map_err(Error::ReceiveBody)
}
pub async fn timeline_detach_ancestor(
&self,
tenant_shard_id: TenantShardId,
@@ -506,6 +524,16 @@ impl Client {
.map_err(Error::ReceiveBody)
}
/// Configs io buffer alignment at runtime.
pub async fn put_io_alignment(&self, align: usize) -> Result<()> {
let uri = format!("{}/v1/io_alignment", self.mgmt_api_endpoint);
self.request(Method::PUT, uri, align)
.await?
.json()
.await
.map_err(Error::ReceiveBody)
}
pub async fn get_utilization(&self) -> Result<PageserverUtilization> {
let uri = format!("{}/v1/utilization", self.mgmt_api_endpoint);
self.get(uri)

View File

@@ -4,6 +4,7 @@
use anyhow::Result;
use camino::{Utf8Path, Utf8PathBuf};
use pageserver::config::defaults::DEFAULT_IO_BUFFER_ALIGNMENT;
use pageserver::context::{DownloadBehavior, RequestContext};
use pageserver::task_mgr::TaskKind;
use pageserver::tenant::{TENANTS_SEGMENT_NAME, TIMELINES_SEGMENT_NAME};
@@ -144,7 +145,11 @@ pub(crate) async fn main(cmd: &AnalyzeLayerMapCmd) -> Result<()> {
let ctx = RequestContext::new(TaskKind::DebugTool, DownloadBehavior::Error);
// Initialize virtual_file (file desriptor cache) and page cache which are needed to access layer persistent B-Tree.
pageserver::virtual_file::init(10, virtual_file::api::IoEngineKind::StdFs);
pageserver::virtual_file::init(
10,
virtual_file::api::IoEngineKind::StdFs,
DEFAULT_IO_BUFFER_ALIGNMENT,
);
pageserver::page_cache::init(100);
let mut total_delta_layers = 0usize;

View File

@@ -3,6 +3,7 @@ use std::path::{Path, PathBuf};
use anyhow::Result;
use camino::{Utf8Path, Utf8PathBuf};
use clap::Subcommand;
use pageserver::config::defaults::DEFAULT_IO_BUFFER_ALIGNMENT;
use pageserver::context::{DownloadBehavior, RequestContext};
use pageserver::task_mgr::TaskKind;
use pageserver::tenant::block_io::BlockCursor;
@@ -59,7 +60,7 @@ pub(crate) enum LayerCmd {
async fn read_delta_file(path: impl AsRef<Path>, ctx: &RequestContext) -> Result<()> {
let path = Utf8Path::from_path(path.as_ref()).expect("non-Unicode path");
virtual_file::init(10, virtual_file::api::IoEngineKind::StdFs);
virtual_file::init(10, virtual_file::api::IoEngineKind::StdFs, 1);
page_cache::init(100);
let file = VirtualFile::open(path, ctx).await?;
let file_id = page_cache::next_file_id();
@@ -89,6 +90,7 @@ async fn read_delta_file(path: impl AsRef<Path>, ctx: &RequestContext) -> Result
for (k, v) in all {
let value = cursor.read_blob(v.pos(), ctx).await?;
println!("key:{} value_len:{}", k, value.len());
assert!(k.is_i128_representable(), "invalid key: ");
}
// TODO(chi): special handling for last key?
Ok(())
@@ -189,7 +191,11 @@ pub(crate) async fn main(cmd: &LayerCmd) -> Result<()> {
new_tenant_id,
new_timeline_id,
} => {
pageserver::virtual_file::init(10, virtual_file::api::IoEngineKind::StdFs);
pageserver::virtual_file::init(
10,
virtual_file::api::IoEngineKind::StdFs,
DEFAULT_IO_BUFFER_ALIGNMENT,
);
pageserver::page_cache::init(100);
let ctx = RequestContext::new(TaskKind::DebugTool, DownloadBehavior::Error);

View File

@@ -20,6 +20,7 @@ use clap::{Parser, Subcommand};
use index_part::IndexPartCmd;
use layers::LayerCmd;
use pageserver::{
config::defaults::DEFAULT_IO_BUFFER_ALIGNMENT,
context::{DownloadBehavior, RequestContext},
page_cache,
task_mgr::TaskKind,
@@ -205,7 +206,11 @@ fn read_pg_control_file(control_file_path: &Utf8Path) -> anyhow::Result<()> {
async fn print_layerfile(path: &Utf8Path) -> anyhow::Result<()> {
// Basic initialization of things that don't change after startup
virtual_file::init(10, virtual_file::api::IoEngineKind::StdFs);
virtual_file::init(
10,
virtual_file::api::IoEngineKind::StdFs,
DEFAULT_IO_BUFFER_ALIGNMENT,
);
page_cache::init(100);
let ctx = RequestContext::new(TaskKind::DebugTool, DownloadBehavior::Error);
dump_layerfile_from_path(path, true, &ctx).await

View File

@@ -58,6 +58,11 @@ pub(crate) struct Args {
/// [`pageserver_api::models::virtual_file::IoEngineKind`].
#[clap(long)]
set_io_engine: Option<pageserver_api::models::virtual_file::IoEngineKind>,
/// Before starting the benchmark, live-reconfigure the pageserver to use specified alignment for io buffers.
#[clap(long)]
set_io_alignment: Option<usize>,
targets: Option<Vec<TenantTimelineId>>,
}
@@ -124,6 +129,10 @@ async fn main_impl(
mgmt_api_client.put_io_engine(engine_str).await?;
}
if let Some(align) = args.set_io_alignment {
mgmt_api_client.put_io_alignment(align).await?;
}
// discover targets
let timelines: Vec<TenantTimelineId> = crate::util::cli::targets::discover(
&mgmt_api_client,

View File

@@ -0,0 +1,39 @@
//! `u64`` and `usize`` aren't guaranteed to be identical in Rust, but life is much simpler if that's the case.
pub(crate) const _ASSERT_U64_EQ_USIZE: () = {
if std::mem::size_of::<usize>() != std::mem::size_of::<u64>() {
panic!("the traits defined in this module assume that usize and u64 can be converted to each other without loss of information");
}
};
pub(crate) trait U64IsUsize {
fn into_usize(self) -> usize;
}
impl U64IsUsize for u64 {
#[inline(always)]
fn into_usize(self) -> usize {
#[allow(clippy::let_unit_value)]
let _ = _ASSERT_U64_EQ_USIZE;
self as usize
}
}
pub(crate) trait UsizeIsU64 {
fn into_u64(self) -> u64;
}
impl UsizeIsU64 for usize {
#[inline(always)]
fn into_u64(self) -> u64 {
#[allow(clippy::let_unit_value)]
let _ = _ASSERT_U64_EQ_USIZE;
self as u64
}
}
pub const fn u64_to_usize(x: u64) -> usize {
#[allow(clippy::let_unit_value)]
let _ = _ASSERT_U64_EQ_USIZE;
x as usize
}

View File

@@ -0,0 +1,61 @@
use anyhow;
use camino::Utf8PathBuf;
use clap::Parser;
use pageserver::{pg_import, virtual_file::{self, api::IoEngineKind}};
use utils::id::{TenantId, TimelineId};
use utils::logging::{self, LogFormat, TracingErrorLayerEnablement};
use std::str::FromStr;
//project_git_version!(GIT_VERSION);
#[derive(Parser)]
#[command(
//version = GIT_VERSION,
about = "Utility to import a Postgres data directory directly into image layers",
//long_about = "..."
)]
struct CliOpts {
/// Input Postgres data directory
pgdata: Utf8PathBuf,
/// Path to local dir where the layer files will be stored
dest_path: Utf8PathBuf,
#[arg(long, default_value_t = TenantId::from_str("42424242424242424242424242424242").unwrap())]
tenant_id: TenantId,
#[arg(long, default_value_t = TimelineId::from_str("42424242424242424242424242424242").unwrap())]
timeline_id: TimelineId,
}
fn main() -> anyhow::Result<()> {
logging::init(
LogFormat::Plain,
TracingErrorLayerEnablement::EnableWithRustLogFilter,
logging::Output::Stdout,
)?;
virtual_file::init(
100,
IoEngineKind::StdFs,
512,
);
let rt = tokio::runtime::Builder::new_multi_thread()
.enable_all()
.build()?;
let cli = CliOpts::parse();
rt.block_on(async_main(cli))?;
Ok(())
}
async fn async_main(cli: CliOpts) -> anyhow::Result<()> {
let mut import = pg_import::PgImportEnv::init(&cli.dest_path, cli.tenant_id, cli.timeline_id).await?;
import.import_datadir(&cli.pgdata).await?;
Ok(())
}

View File

@@ -125,18 +125,69 @@ fn main() -> anyhow::Result<()> {
info!(?conf.virtual_file_io_engine, "starting with virtual_file IO engine");
info!(?conf.virtual_file_direct_io, "starting with virtual_file Direct IO settings");
info!(?conf.compact_level0_phase1_value_access, "starting with setting for compact_level0_phase1_value_access");
info!(?conf.io_buffer_alignment, "starting with setting for IO buffer alignment");
// The tenants directory contains all the pageserver local disk state.
// Create if not exists and make sure all the contents are durable before proceeding.
// Ensuring durability eliminates a whole bug class where we come up after an unclean shutdown.
// After unclea shutdown, we don't know if all the filesystem content we can read via syscalls is actually durable or not.
// Examples for that: OOM kill, systemd killing us during shutdown, self abort due to unrecoverable IO error.
let tenants_path = conf.tenants_path();
if !tenants_path.exists() {
utils::crashsafe::create_dir_all(conf.tenants_path())
.with_context(|| format!("Failed to create tenants root dir at '{tenants_path}'"))?;
{
let open = || {
nix::dir::Dir::open(
tenants_path.as_std_path(),
nix::fcntl::OFlag::O_DIRECTORY | nix::fcntl::OFlag::O_RDONLY,
nix::sys::stat::Mode::empty(),
)
};
let dirfd = match open() {
Ok(dirfd) => dirfd,
Err(e) => match e {
nix::errno::Errno::ENOENT => {
utils::crashsafe::create_dir_all(&tenants_path).with_context(|| {
format!("Failed to create tenants root dir at '{tenants_path}'")
})?;
open().context("open tenants dir after creating it")?
}
e => anyhow::bail!(e),
},
};
let started = Instant::now();
// Linux guarantees durability for syncfs.
// POSIX doesn't have syncfs, and further does not actually guarantee durability of sync().
#[cfg(target_os = "linux")]
{
use std::os::fd::AsRawFd;
nix::unistd::syncfs(dirfd.as_raw_fd()).context("syncfs")?;
}
#[cfg(target_os = "macos")]
{
// macOS is not a production platform for Neon, don't even bother.
drop(dirfd);
}
#[cfg(not(any(target_os = "linux", target_os = "macos")))]
{
compile_error!("Unsupported OS");
}
let elapsed = started.elapsed();
info!(
elapsed_ms = elapsed.as_millis(),
"made tenant directory contents durable"
);
}
// Initialize up failpoints support
let scenario = failpoint_support::init();
// Basic initialization of things that don't change after startup
virtual_file::init(conf.max_file_descriptors, conf.virtual_file_io_engine);
virtual_file::init(
conf.max_file_descriptors,
conf.virtual_file_io_engine,
conf.io_buffer_alignment,
);
page_cache::init(conf.page_cache_size);
start_pageserver(launch_ts, conf).context("Failed to start pageserver")?;

View File

@@ -31,6 +31,7 @@ use utils::{
use crate::l0_flush::L0FlushConfig;
use crate::tenant::config::TenantConfOpt;
use crate::tenant::storage_layer::inmemory_layer::IndexEntry;
use crate::tenant::timeline::compaction::CompactL0Phase1ValueAccess;
use crate::tenant::vectored_blob_io::MaxVectoredReadBytes;
use crate::tenant::{TENANTS_SEGMENT_NAME, TIMELINES_SEGMENT_NAME};
@@ -95,6 +96,8 @@ pub mod defaults {
pub const DEFAULT_EPHEMERAL_BYTES_PER_MEMORY_KB: usize = 0;
pub const DEFAULT_IO_BUFFER_ALIGNMENT: usize = 512;
///
/// Default built-in configuration file.
///
@@ -289,6 +292,8 @@ pub struct PageServerConf {
/// Direct IO settings
pub virtual_file_direct_io: virtual_file::DirectIoMode,
pub io_buffer_alignment: usize,
}
/// We do not want to store this in a PageServerConf because the latter may be logged
@@ -393,6 +398,8 @@ struct PageServerConfigBuilder {
compact_level0_phase1_value_access: BuilderValue<CompactL0Phase1ValueAccess>,
virtual_file_direct_io: BuilderValue<virtual_file::DirectIoMode>,
io_buffer_alignment: BuilderValue<usize>,
}
impl PageServerConfigBuilder {
@@ -481,6 +488,7 @@ impl PageServerConfigBuilder {
l0_flush: Set(L0FlushConfig::default()),
compact_level0_phase1_value_access: Set(CompactL0Phase1ValueAccess::default()),
virtual_file_direct_io: Set(virtual_file::DirectIoMode::default()),
io_buffer_alignment: Set(DEFAULT_IO_BUFFER_ALIGNMENT),
}
}
}
@@ -660,6 +668,10 @@ impl PageServerConfigBuilder {
self.virtual_file_direct_io = BuilderValue::Set(value);
}
pub fn io_buffer_alignment(&mut self, value: usize) {
self.io_buffer_alignment = BuilderValue::Set(value);
}
pub fn build(self, id: NodeId) -> anyhow::Result<PageServerConf> {
let default = Self::default_values();
@@ -716,6 +728,7 @@ impl PageServerConfigBuilder {
l0_flush,
compact_level0_phase1_value_access,
virtual_file_direct_io,
io_buffer_alignment,
}
CUSTOM LOGIC
{
@@ -985,6 +998,9 @@ impl PageServerConf {
"virtual_file_direct_io" => {
builder.virtual_file_direct_io(utils::toml_edit_ext::deserialize_item(item).context("virtual_file_direct_io")?)
}
"io_buffer_alignment" => {
builder.io_buffer_alignment(parse_toml_u64("io_buffer_alignment", item)? as usize)
}
_ => bail!("unrecognized pageserver option '{key}'"),
}
}
@@ -1005,6 +1021,15 @@ impl PageServerConf {
conf.default_tenant_conf = t_conf.merge(TenantConf::default());
IndexEntry::validate_checkpoint_distance(conf.default_tenant_conf.checkpoint_distance)
.map_err(|msg| anyhow::anyhow!("{msg}"))
.with_context(|| {
format!(
"effective checkpoint distance is unsupported: {}",
conf.default_tenant_conf.checkpoint_distance
)
})?;
Ok(conf)
}
@@ -1068,6 +1093,7 @@ impl PageServerConf {
l0_flush: L0FlushConfig::default(),
compact_level0_phase1_value_access: CompactL0Phase1ValueAccess::default(),
virtual_file_direct_io: virtual_file::DirectIoMode::default(),
io_buffer_alignment: defaults::DEFAULT_IO_BUFFER_ALIGNMENT,
}
}
}
@@ -1308,6 +1334,7 @@ background_task_maximum_delay = '334 s'
l0_flush: L0FlushConfig::default(),
compact_level0_phase1_value_access: CompactL0Phase1ValueAccess::default(),
virtual_file_direct_io: virtual_file::DirectIoMode::default(),
io_buffer_alignment: defaults::DEFAULT_IO_BUFFER_ALIGNMENT,
},
"Correct defaults should be used when no config values are provided"
);
@@ -1381,6 +1408,7 @@ background_task_maximum_delay = '334 s'
l0_flush: L0FlushConfig::default(),
compact_level0_phase1_value_access: CompactL0Phase1ValueAccess::default(),
virtual_file_direct_io: virtual_file::DirectIoMode::default(),
io_buffer_alignment: defaults::DEFAULT_IO_BUFFER_ALIGNMENT,
},
"Should be able to parse all basic config values correctly"
);

View File

@@ -1,6 +1,8 @@
//! Periodically collect consumption metrics for all active tenants
//! and push them to a HTTP endpoint.
use crate::config::PageServerConf;
use crate::consumption_metrics::metrics::MetricsKey;
use crate::consumption_metrics::upload::KeyGen as _;
use crate::context::{DownloadBehavior, RequestContext};
use crate::task_mgr::{self, TaskKind, BACKGROUND_RUNTIME};
use crate::tenant::size::CalculateSyntheticSizeError;
@@ -8,6 +10,7 @@ use crate::tenant::tasks::BackgroundLoopKind;
use crate::tenant::{mgr::TenantManager, LogicalSizeCalculationCause, Tenant};
use camino::Utf8PathBuf;
use consumption_metrics::EventType;
use itertools::Itertools as _;
use pageserver_api::models::TenantState;
use remote_storage::{GenericRemoteStorage, RemoteStorageConfig};
use reqwest::Url;
@@ -19,9 +22,8 @@ use tokio_util::sync::CancellationToken;
use tracing::*;
use utils::id::NodeId;
mod metrics;
use crate::consumption_metrics::metrics::MetricsKey;
mod disk_cache;
mod metrics;
mod upload;
const DEFAULT_HTTP_REPORTING_TIMEOUT: Duration = Duration::from_secs(60);
@@ -143,6 +145,12 @@ async fn collect_metrics(
// these are point in time, with variable "now"
let metrics = metrics::collect_all_metrics(&tenant_manager, &cached_metrics, &ctx).await;
// Pre-generate event idempotency keys, to reuse them across the bucket
// and HTTP sinks.
let idempotency_keys = std::iter::repeat_with(|| node_id.as_str().generate())
.take(metrics.len())
.collect_vec();
let metrics = Arc::new(metrics);
// why not race cancellation here? because we are one of the last tasks, and if we are
@@ -161,8 +169,14 @@ async fn collect_metrics(
}
if let Some(bucket_client) = &bucket_client {
let res =
upload::upload_metrics_bucket(bucket_client, &cancel, &node_id, &metrics).await;
let res = upload::upload_metrics_bucket(
bucket_client,
&cancel,
&node_id,
&metrics,
&idempotency_keys,
)
.await;
if let Err(e) = res {
tracing::error!("failed to upload to S3: {e:#}");
}
@@ -174,9 +188,9 @@ async fn collect_metrics(
&client,
metric_collection_endpoint,
&cancel,
&node_id,
&metrics,
&mut cached_metrics,
&idempotency_keys,
)
.await;
if let Err(e) = res {

View File

@@ -24,16 +24,16 @@ pub(super) async fn upload_metrics_http(
client: &reqwest::Client,
metric_collection_endpoint: &reqwest::Url,
cancel: &CancellationToken,
node_id: &str,
metrics: &[RawMetric],
cached_metrics: &mut Cache,
idempotency_keys: &[IdempotencyKey<'_>],
) -> anyhow::Result<()> {
let mut uploaded = 0;
let mut failed = 0;
let started_at = std::time::Instant::now();
let mut iter = serialize_in_chunks(CHUNK_SIZE, metrics, node_id);
let mut iter = serialize_in_chunks(CHUNK_SIZE, metrics, idempotency_keys);
while let Some(res) = iter.next() {
let (chunk, body) = res?;
@@ -87,6 +87,7 @@ pub(super) async fn upload_metrics_bucket(
cancel: &CancellationToken,
node_id: &str,
metrics: &[RawMetric],
idempotency_keys: &[IdempotencyKey<'_>],
) -> anyhow::Result<()> {
if metrics.is_empty() {
// Skip uploads if we have no metrics, so that readers don't have to handle the edge case
@@ -106,7 +107,7 @@ pub(super) async fn upload_metrics_bucket(
// Serialize and write into compressed buffer
let started_at = std::time::Instant::now();
for res in serialize_in_chunks(CHUNK_SIZE, metrics, node_id) {
for res in serialize_in_chunks(CHUNK_SIZE, metrics, idempotency_keys) {
let (_chunk, body) = res?;
gzip_writer.write_all(&body).await?;
}
@@ -134,29 +135,31 @@ pub(super) async fn upload_metrics_bucket(
Ok(())
}
// The return type is quite ugly, but we gain testability in isolation
fn serialize_in_chunks<'a, F>(
/// Serializes the input metrics as JSON in chunks of chunk_size. The provided
/// idempotency keys are injected into the corresponding metric events (reused
/// across different metrics sinks), and must have the same length as input.
fn serialize_in_chunks<'a>(
chunk_size: usize,
input: &'a [RawMetric],
factory: F,
idempotency_keys: &'a [IdempotencyKey<'a>],
) -> impl ExactSizeIterator<Item = Result<(&'a [RawMetric], bytes::Bytes), serde_json::Error>> + 'a
where
F: KeyGen<'a> + 'a,
{
use bytes::BufMut;
struct Iter<'a, F> {
assert_eq!(input.len(), idempotency_keys.len());
struct Iter<'a> {
inner: std::slice::Chunks<'a, RawMetric>,
idempotency_keys: std::slice::Iter<'a, IdempotencyKey<'a>>,
chunk_size: usize,
// write to a BytesMut so that we can cheaply clone the frozen Bytes for retries
buffer: bytes::BytesMut,
// chunk amount of events are reused to produce the serialized document
scratch: Vec<Event<Ids, Name>>,
factory: F,
}
impl<'a, F: KeyGen<'a>> Iterator for Iter<'a, F> {
impl<'a> Iterator for Iter<'a> {
type Item = Result<(&'a [RawMetric], bytes::Bytes), serde_json::Error>;
fn next(&mut self) -> Option<Self::Item> {
@@ -167,17 +170,14 @@ where
self.scratch.extend(
chunk
.iter()
.map(|raw_metric| raw_metric.as_event(&self.factory.generate())),
.zip(&mut self.idempotency_keys)
.map(|(raw_metric, key)| raw_metric.as_event(key)),
);
} else {
// next rounds: update_in_place to reuse allocations
assert_eq!(self.scratch.len(), self.chunk_size);
self.scratch
.iter_mut()
.zip(chunk.iter())
.for_each(|(slot, raw_metric)| {
raw_metric.update_in_place(slot, &self.factory.generate())
});
itertools::izip!(self.scratch.iter_mut(), chunk, &mut self.idempotency_keys)
.for_each(|(slot, raw_metric, key)| raw_metric.update_in_place(slot, key));
}
let res = serde_json::to_writer(
@@ -198,18 +198,19 @@ where
}
}
impl<'a, F: KeyGen<'a>> ExactSizeIterator for Iter<'a, F> {}
impl<'a> ExactSizeIterator for Iter<'a> {}
let buffer = bytes::BytesMut::new();
let inner = input.chunks(chunk_size);
let idempotency_keys = idempotency_keys.iter();
let scratch = Vec::new();
Iter {
inner,
idempotency_keys,
chunk_size,
buffer,
scratch,
factory,
}
}
@@ -268,7 +269,7 @@ impl RawMetricExt for RawMetric {
}
}
trait KeyGen<'a>: Copy {
pub(crate) trait KeyGen<'a> {
fn generate(&self) -> IdempotencyKey<'a>;
}
@@ -389,7 +390,10 @@ mod tests {
let examples = metric_samples();
assert!(examples.len() > 1);
let factory = FixedGen::new(Utc::now(), "1", 42);
let now = Utc::now();
let idempotency_keys = (0..examples.len())
.map(|i| FixedGen::new(now, "1", i as u16).generate())
.collect::<Vec<_>>();
// need to use Event here because serde_json::Value uses default hashmap, not linked
// hashmap
@@ -398,13 +402,13 @@ mod tests {
events: Vec<Event<Ids, Name>>,
}
let correct = serialize_in_chunks(examples.len(), &examples, factory)
let correct = serialize_in_chunks(examples.len(), &examples, &idempotency_keys)
.map(|res| res.unwrap().1)
.flat_map(|body| serde_json::from_slice::<EventChunk>(&body).unwrap().events)
.collect::<Vec<_>>();
for chunk_size in 1..examples.len() {
let actual = serialize_in_chunks(chunk_size, &examples, factory)
let actual = serialize_in_chunks(chunk_size, &examples, &idempotency_keys)
.map(|res| res.unwrap().1)
.flat_map(|body| serde_json::from_slice::<EventChunk>(&body).unwrap().events)
.collect::<Vec<_>>();

View File

@@ -105,8 +105,10 @@ pub struct RequestContext {
#[derive(Clone, Copy, PartialEq, Eq, Debug, enum_map::Enum, strum_macros::IntoStaticStr)]
pub enum PageContentKind {
Unknown,
DeltaLayerSummary,
DeltaLayerBtreeNode,
DeltaLayerValue,
ImageLayerSummary,
ImageLayerBtreeNode,
ImageLayerValue,
InMemoryLayer,

View File

@@ -141,12 +141,18 @@ impl ControlPlaneGenerationsApi for ControlPlaneClient {
m.other
);
let az_id = m
.other
.get("availability_zone_id")
.and_then(|jv| jv.as_str().map(|str| str.to_owned()));
Some(NodeRegisterRequest {
node_id: conf.id,
listen_pg_addr: m.postgres_host,
listen_pg_port: m.postgres_port,
listen_http_addr: m.http_host,
listen_http_port: m.http_port,
availability_zone_id: az_id,
})
}
Err(e) => {

View File

@@ -318,6 +318,27 @@ impl From<crate::tenant::DeleteTimelineError> for ApiError {
}
}
impl From<crate::tenant::TimelineArchivalError> for ApiError {
fn from(value: crate::tenant::TimelineArchivalError) -> Self {
use crate::tenant::TimelineArchivalError::*;
match value {
NotFound => ApiError::NotFound(anyhow::anyhow!("timeline not found").into()),
Timeout => ApiError::Timeout("hit pageserver internal timeout".into()),
e @ HasArchivedParent(_) => {
ApiError::PreconditionFailed(e.to_string().into_boxed_str())
}
HasUnarchivedChildren(children) => ApiError::PreconditionFailed(
format!(
"Cannot archive timeline which has non-archived child timelines: {children:?}"
)
.into_boxed_str(),
),
a @ AlreadyInProgress => ApiError::Conflict(a.to_string()),
Other(e) => ApiError::InternalServerError(e),
}
}
}
impl From<crate::tenant::mgr::DeleteTimelineError> for ApiError {
fn from(value: crate::tenant::mgr::DeleteTimelineError) -> Self {
use crate::tenant::mgr::DeleteTimelineError::*;
@@ -405,6 +426,8 @@ async fn build_timeline_info_common(
let current_logical_size = timeline.get_current_logical_size(logical_size_task_priority, ctx);
let current_physical_size = Some(timeline.layer_size_sum().await);
let state = timeline.current_state();
// Report is_archived = false if the timeline is still loading
let is_archived = timeline.is_archived().unwrap_or(false);
let remote_consistent_lsn_projected = timeline
.get_remote_consistent_lsn_projected()
.unwrap_or(Lsn(0));
@@ -445,6 +468,7 @@ async fn build_timeline_info_common(
pg_version: timeline.pg_version,
state,
is_archived,
walreceiver_status,
@@ -686,9 +710,7 @@ async fn timeline_archival_config_handler(
tenant
.apply_timeline_archival_config(timeline_id, request_data.state)
.await
.context("applying archival config")
.map_err(ApiError::InternalServerError)?;
.await?;
Ok::<_, ApiError>(())
}
.instrument(info_span!("timeline_archival_config",
@@ -852,7 +874,10 @@ async fn get_timestamp_of_lsn_handler(
match result {
Some(time) => {
let time = format_rfc3339(postgres_ffi::from_pg_timestamp(time)).to_string();
let time = format_rfc3339(
postgres_ffi::try_from_pg_timestamp(time).map_err(ApiError::InternalServerError)?,
)
.to_string();
json_response(StatusCode::OK, time)
}
None => Err(ApiError::NotFound(
@@ -1706,13 +1731,12 @@ async fn timeline_compact_handler(
flags |= CompactFlags::ForceImageLayerCreation;
}
if Some(true) == parse_query_param::<_, bool>(&request, "enhanced_gc_bottom_most_compaction")? {
if !cfg!(feature = "testing") {
return Err(ApiError::InternalServerError(anyhow!(
"enhanced_gc_bottom_most_compaction is only available in testing mode"
)));
}
flags |= CompactFlags::EnhancedGcBottomMostCompaction;
}
if Some(true) == parse_query_param::<_, bool>(&request, "dry_run")? {
flags |= CompactFlags::DryRun;
}
let wait_until_uploaded =
parse_query_param::<_, bool>(&request, "wait_until_uploaded")?.unwrap_or(false);
@@ -2330,6 +2354,20 @@ async fn put_io_engine_handler(
json_response(StatusCode::OK, ())
}
async fn put_io_alignment_handler(
mut r: Request<Body>,
_cancel: CancellationToken,
) -> Result<Response<Body>, ApiError> {
check_permission(&r, None)?;
let align: usize = json_request(&mut r).await?;
crate::virtual_file::set_io_buffer_alignment(align).map_err(|align| {
ApiError::PreconditionFailed(
format!("Requested io alignment ({align}) is not a power of two").into(),
)
})?;
json_response(StatusCode::OK, ())
}
/// Polled by control plane.
///
/// See [`crate::utilization`].
@@ -2942,7 +2980,7 @@ pub fn make_router(
)
.put(
"/v1/tenant/:tenant_shard_id/timeline/:timeline_id/compact",
|r| testing_api_handler("run timeline compaction", r, timeline_compact_handler),
|r| api_handler(r, timeline_compact_handler),
)
.put(
"/v1/tenant/:tenant_shard_id/timeline/:timeline_id/checkpoint",
@@ -3017,6 +3055,9 @@ pub fn make_router(
|r| api_handler(r, timeline_collect_keyspace),
)
.put("/v1/io_engine", |r| api_handler(r, put_io_engine_handler))
.put("/v1/io_alignment", |r| {
api_handler(r, put_io_alignment_handler)
})
.put(
"/v1/tenant/:tenant_shard_id/timeline/:timeline_id/force_aux_policy_switch",
|r| api_handler(r, force_aux_policy_switch_handler),

View File

@@ -16,6 +16,7 @@ pub mod l0_flush;
use futures::{stream::FuturesUnordered, StreamExt};
pub use pageserver_api::keyspace;
use tokio_util::sync::CancellationToken;
mod assert_u64_eq_usize;
pub mod aux_file;
pub mod metrics;
pub mod page_cache;
@@ -31,6 +32,7 @@ pub mod virtual_file;
pub mod walingest;
pub mod walrecord;
pub mod walredo;
pub mod pg_import;
use camino::Utf8Path;
use deletion_queue::DeletionQueue;
@@ -88,6 +90,8 @@ pub async fn shutdown_pageserver(
) {
use std::time::Duration;
let started_at = std::time::Instant::now();
// If the orderly shutdown below takes too long, we still want to make
// sure that all walredo processes are killed and wait()ed on by us, not systemd.
//
@@ -241,7 +245,10 @@ pub async fn shutdown_pageserver(
walredo_extraordinary_shutdown_thread.join().unwrap();
info!("walredo_extraordinary_shutdown_thread done");
info!("Shut down successfully completed");
info!(
elapsed_ms = started_at.elapsed().as_millis(),
"Shut down successfully completed"
);
std::process::exit(exit_code);
}

View File

@@ -1552,7 +1552,6 @@ pub(crate) static LIVE_CONNECTIONS: Lazy<IntCounterPairVec> = Lazy::new(|| {
#[derive(Clone, Copy, enum_map::Enum, IntoStaticStr)]
pub(crate) enum ComputeCommandKind {
PageStreamV2,
PageStream,
Basebackup,
Fullbackup,
LeaseLsn,
@@ -1803,6 +1802,14 @@ pub(crate) static SECONDARY_RESIDENT_PHYSICAL_SIZE: Lazy<UIntGaugeVec> = Lazy::n
.expect("failed to define a metric")
});
pub(crate) static NODE_UTILIZATION_SCORE: Lazy<UIntGauge> = Lazy::new(|| {
register_uint_gauge!(
"pageserver_utilization_score",
"The utilization score we report to the storage controller for scheduling, where 0 is empty, 1000000 is full, and anything above is considered overloaded",
)
.expect("failed to define a metric")
});
pub(crate) static SECONDARY_HEATMAP_TOTAL_SIZE: Lazy<UIntGaugeVec> = Lazy::new(|| {
register_uint_gauge_vec!(
"pageserver_secondary_heatmap_total_size",

View File

@@ -557,7 +557,7 @@ impl PageServerHandler {
pgb: &mut PostgresBackend<IO>,
tenant_id: TenantId,
timeline_id: TimelineId,
protocol_version: PagestreamProtocolVersion,
_protocol_version: PagestreamProtocolVersion,
ctx: RequestContext,
) -> Result<(), QueryError>
where
@@ -601,8 +601,7 @@ impl PageServerHandler {
fail::fail_point!("ps::handle-pagerequest-message");
// parse request
let neon_fe_msg =
PagestreamFeMessage::parse(&mut copy_data_bytes.reader(), protocol_version)?;
let neon_fe_msg = PagestreamFeMessage::parse(&mut copy_data_bytes.reader())?;
// invoke handler function
let (handler_result, span) = match neon_fe_msg {
@@ -754,16 +753,21 @@ impl PageServerHandler {
}
if request_lsn < **latest_gc_cutoff_lsn {
// Check explicitly for INVALID just to get a less scary error message if the
// request is obviously bogus
return Err(if request_lsn == Lsn::INVALID {
PageStreamError::BadRequest("invalid LSN(0) in request".into())
} else {
PageStreamError::BadRequest(format!(
let gc_info = &timeline.gc_info.read().unwrap();
if !gc_info.leases.contains_key(&request_lsn) {
// The requested LSN is below gc cutoff and is not guarded by a lease.
// Check explicitly for INVALID just to get a less scary error message if the
// request is obviously bogus
return Err(if request_lsn == Lsn::INVALID {
PageStreamError::BadRequest("invalid LSN(0) in request".into())
} else {
PageStreamError::BadRequest(format!(
"tried to request a page version that was garbage collected. requested at {} gc cutoff {}",
request_lsn, **latest_gc_cutoff_lsn
).into())
});
});
}
}
// Wait for WAL up to 'not_modified_since' to arrive, if necessary
@@ -790,6 +794,8 @@ impl PageServerHandler {
}
}
/// Handles the lsn lease request.
/// If a lease cannot be obtained, the client will receive NULL.
#[instrument(skip_all, fields(shard_id, %lsn))]
async fn handle_make_lsn_lease<IO>(
&mut self,
@@ -812,19 +818,25 @@ impl PageServerHandler {
.await?;
set_tracing_field_shard_id(&timeline);
let lease = timeline.make_lsn_lease(lsn, timeline.get_lsn_lease_length(), ctx)?;
let valid_until = lease
.valid_until
.duration_since(SystemTime::UNIX_EPOCH)
.map_err(|e| QueryError::Other(e.into()))?;
let lease = timeline
.make_lsn_lease(lsn, timeline.get_lsn_lease_length(), ctx)
.inspect_err(|e| {
warn!("{e}");
})
.ok();
let valid_until_str = lease.map(|l| {
l.valid_until
.duration_since(SystemTime::UNIX_EPOCH)
.expect("valid_until is earlier than UNIX_EPOCH")
.as_millis()
.to_string()
});
let bytes = valid_until_str.as_ref().map(|x| x.as_bytes());
pgb.write_message_noflush(&BeMessage::RowDescription(&[RowDescriptor::text_col(
b"valid_until",
)]))?
.write_message_noflush(&BeMessage::DataRow(&[Some(
&valid_until.as_millis().to_be_bytes(),
)]))?
.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
.write_message_noflush(&BeMessage::DataRow(&[bytes]))?;
Ok(())
}
@@ -1275,35 +1287,6 @@ where
ctx,
)
.await?;
} else if let Some(params) = parts.strip_prefix(&["pagestream"]) {
if params.len() != 2 {
return Err(QueryError::Other(anyhow::anyhow!(
"invalid param number for pagestream command"
)));
}
let tenant_id = TenantId::from_str(params[0])
.with_context(|| format!("Failed to parse tenant id from {}", params[0]))?;
let timeline_id = TimelineId::from_str(params[1])
.with_context(|| format!("Failed to parse timeline id from {}", params[1]))?;
tracing::Span::current()
.record("tenant_id", field::display(tenant_id))
.record("timeline_id", field::display(timeline_id));
self.check_permission(Some(tenant_id))?;
COMPUTE_COMMANDS_COUNTERS
.for_command(ComputeCommandKind::PageStream)
.inc();
self.handle_pagerequests(
pgb,
tenant_id,
timeline_id,
PagestreamProtocolVersion::V1,
ctx,
)
.await?;
} else if let Some(params) = parts.strip_prefix(&["basebackup"]) {
if params.len() < 2 {
return Err(QueryError::Other(anyhow::anyhow!(

650
pageserver/src/pg_import.rs Normal file
View File

@@ -0,0 +1,650 @@
use std::fs::metadata;
use anyhow::{bail, ensure, Context};
use bytes::Bytes;
use camino::{Utf8Path, Utf8PathBuf};
use itertools::Itertools;
use pageserver_api::{key::{rel_block_to_key, rel_dir_to_key, rel_size_to_key, relmap_file_key, DBDIR_KEY}, reltag::RelTag};
use postgres_ffi::{pg_constants, relfile_utils::parse_relfilename, ControlFileData, BLCKSZ};
use tokio::{io::AsyncRead, task::{self, JoinHandle}};
use tracing::debug;
use utils::{id::{NodeId, TenantId, TimelineId}, shard::{ShardCount, ShardNumber, TenantShardId}};
use walkdir::WalkDir;
use crate::{context::{DownloadBehavior, RequestContext}, pgdatadir_mapping::{DbDirectory, RelDirectory}, task_mgr::TaskKind, tenant::storage_layer::ImageLayerWriter};
use crate::pgdatadir_mapping::{SlruSegmentDirectory, TwoPhaseDirectory};
use crate::config::PageServerConf;
use tokio::io::AsyncReadExt;
use crate::tenant::storage_layer::PersistentLayerDesc;
use utils::generation::Generation;
use utils::lsn::Lsn;
use crate::tenant::IndexPart;
use crate::tenant::metadata::TimelineMetadata;
use crate::tenant::remote_timeline_client;
use crate::tenant::remote_timeline_client::LayerFileMetadata;
use pageserver_api::shard::ShardIndex;
use pageserver_api::key::Key;
use pageserver_api::keyspace::{is_contiguous_range, contiguous_range_len};
use pageserver_api::keyspace::singleton_range;
use pageserver_api::reltag::SlruKind;
use pageserver_api::key::{slru_block_to_key, slru_dir_to_key, slru_segment_size_to_key, TWOPHASEDIR_KEY, CONTROLFILE_KEY, CHECKPOINT_KEY};
use utils::bin_ser::BeSer;
use std::collections::HashSet;
use std::ops::Range;
pub struct PgImportEnv {
conf: &'static PageServerConf,
tli: TimelineId,
tsi: TenantShardId,
pgdata_lsn: Lsn,
tasks: Vec<AnyImportTask>,
layers: Vec<PersistentLayerDesc>,
}
impl PgImportEnv {
pub async fn init(dstdir: &Utf8Path, tenant_id: TenantId, timeline_id: TimelineId) -> anyhow::Result<PgImportEnv> {
let config = toml_edit::Document::new();
let conf = PageServerConf::parse_and_validate(
NodeId(42),
&config,
dstdir
)?;
let conf = Box::leak(Box::new(conf));
let tsi = TenantShardId {
tenant_id,
shard_number: ShardNumber(0),
shard_count: ShardCount(0),
};
Ok(PgImportEnv {
conf,
tli: timeline_id,
tsi,
pgdata_lsn: Lsn(0), // Will be filled in later, when the control file is imported
tasks: Vec::new(),
layers: Vec::new(),
})
}
pub async fn import_datadir(&mut self, pgdata_path: &Utf8PathBuf) -> anyhow::Result<()> {
// Read control file
let controlfile_path = pgdata_path.join("global").join("pg_control");
let controlfile_buf = std::fs::read(&controlfile_path)
.with_context(|| format!("reading controlfile: {controlfile_path}"))?;
let control_file = ControlFileData::decode(&controlfile_buf)?;
let pgdata_lsn = Lsn(control_file.checkPoint).align();
let timeline_path = self.conf.timeline_path(&self.tsi, &self.tli);
println!("Importing {pgdata_path} to {timeline_path} as lsn {pgdata_lsn}...");
self.pgdata_lsn = pgdata_lsn;
let datadir = PgDataDir::new(pgdata_path);
// Import dbdir (00:00:00 keyspace)
// This is just constructed here, but will be written to the image layer in the first call to import_db()
let dbdir_buf = Bytes::from(DbDirectory::ser(&DbDirectory {
dbdirs: datadir.dbs.iter().map(|db| ((db.spcnode, db.dboid), true)).collect(),
})?);
self.tasks.push(ImportSingleKeyTask::new(DBDIR_KEY, dbdir_buf).into());
// Import databases (00:spcnode:dbnode keyspace for each db)
for db in datadir.dbs {
self.import_db(&db).await?;
}
// Import SLRUs
// pg_xact (01:00 keyspace)
self.import_slru(SlruKind::Clog, &pgdata_path.join("pg_xact")).await?;
// pg_multixact/members (01:01 keyspace)
self.import_slru(SlruKind::MultiXactMembers, &pgdata_path.join("pg_multixact/members")).await?;
// pg_multixact/offsets (01:02 keyspace)
self.import_slru(SlruKind::MultiXactOffsets, &pgdata_path.join("pg_multixact/offsets")).await?;
// Import pg_twophase.
// TODO: as empty
let twophasedir_buf = TwoPhaseDirectory::ser(
&TwoPhaseDirectory { xids: HashSet::new() }
)?;
self.tasks.push(AnyImportTask::SingleKey(ImportSingleKeyTask::new(TWOPHASEDIR_KEY, Bytes::from(twophasedir_buf))));
// Controlfile, checkpoint
self.tasks.push(AnyImportTask::SingleKey(ImportSingleKeyTask::new(CONTROLFILE_KEY, Bytes::from(controlfile_buf))));
let checkpoint_buf = control_file.checkPointCopy.encode()?;
self.tasks.push(AnyImportTask::SingleKey(ImportSingleKeyTask::new(CHECKPOINT_KEY, checkpoint_buf)));
// Assigns parts of key space to later parallel jobs
let mut last_end_key = Key::MIN;
let mut current_chunk = Vec::new();
let mut current_chunk_size: usize = 0;
let mut parallel_jobs = Vec::new();
for task in std::mem::take(&mut self.tasks).into_iter() {
if current_chunk_size + task.total_size() > 1024*1024*1024 {
let key_range = last_end_key..task.key_range().start;
parallel_jobs.push(ChunkProcessingJob::new(
key_range.clone(),
std::mem::take(&mut current_chunk),
self
));
last_end_key = key_range.end;
current_chunk_size = 0;
}
current_chunk_size += task.total_size();
current_chunk.push(task);
}
parallel_jobs.push(ChunkProcessingJob::new(
last_end_key..Key::NON_L0_MAX,
current_chunk,
self
));
// Start all jobs simultaneosly
// TODO: semaphore?
let mut handles = vec![];
for job in parallel_jobs {
let handle: JoinHandle<anyhow::Result<PersistentLayerDesc>> = task::spawn(async move {
let layerdesc = job.run().await?;
Ok(layerdesc)
});
handles.push(handle);
}
// Wait for all jobs to complete
for handle in handles {
let layerdesc = handle.await??;
self.layers.push(layerdesc);
}
// Create index_part.json file
self.create_index_part(&control_file).await?;
Ok(())
}
async fn import_db(
&mut self,
db: &PgDataDirDb,
) -> anyhow::Result<()> {
debug!(
"Importing database (path={}, tablespace={}, dboid={})",
db.path, db.spcnode, db.dboid
);
// Import relmap (00:spcnode:dbnode:00:*:00)
let relmap_key = relmap_file_key(db.spcnode, db.dboid);
debug!("Constructing relmap entry, key {relmap_key}");
let mut relmap_file = tokio::fs::File::open(&db.path.join("pg_filenode.map")).await?;
let relmap_buf = read_all_bytes(&mut relmap_file).await?;
self.tasks.push(AnyImportTask::SingleKey(ImportSingleKeyTask::new(relmap_key, relmap_buf)));
// Import reldir (00:spcnode:dbnode:00:*:01)
let reldir_key = rel_dir_to_key(db.spcnode, db.dboid);
debug!("Constructing reldirs entry, key {reldir_key}");
let reldir_buf = RelDirectory::ser(&RelDirectory {
rels: db.files.iter().map(|f| (f.rel_tag.relnode, f.rel_tag.forknum)).collect(),
})?;
self.tasks.push(AnyImportTask::SingleKey(ImportSingleKeyTask::new(reldir_key, Bytes::from(reldir_buf))));
// Import data (00:spcnode:dbnode:reloid:fork:blk) and set sizes for each last
// segment in a given relation (00:spcnode:dbnode:reloid:fork:ff)
for file in &db.files {
let len = metadata(&file.path)?.len() as usize;
ensure!(len % 8192 == 0);
let start_blk: u32 = file.segno * (1024 * 1024 * 1024 / 8192);
let start_key = rel_block_to_key(file.rel_tag, start_blk);
let end_key = rel_block_to_key(file.rel_tag, start_blk + (len / 8192) as u32);
self.tasks.push(AnyImportTask::RelBlocks(ImportRelBlocksTask::new(start_key..end_key, &file.path)));
// Set relsize for the last segment (00:spcnode:dbnode:reloid:fork:ff)
if let Some(nblocks) = file.nblocks {
let size_key = rel_size_to_key(file.rel_tag);
//debug!("Setting relation size (path={path}, rel_tag={rel_tag}, segno={segno}) to {nblocks}, key {size_key}");
let buf = nblocks.to_le_bytes();
self.tasks.push(AnyImportTask::SingleKey(ImportSingleKeyTask::new(size_key, Bytes::from(buf.to_vec()))));
}
}
Ok(())
}
async fn import_slru(
&mut self,
kind: SlruKind,
path: &Utf8PathBuf,
) -> anyhow::Result<()> {
let segments: Vec<(String, u32)> = WalkDir::new(path)
.max_depth(1)
.into_iter()
.filter_map(|entry| {
let entry = entry.ok()?;
let filename = entry.file_name();
let filename = filename.to_string_lossy();
let segno = u32::from_str_radix(&filename, 16).ok()?;
Some((filename.to_string(), segno))
}).collect();
// Write SlruDir
let slrudir_key = slru_dir_to_key(kind);
let segnos: HashSet<u32> = segments.iter().map(|(_path, segno)| { *segno }).collect();
let slrudir = SlruSegmentDirectory {
segments: segnos,
};
let slrudir_buf = SlruSegmentDirectory::ser(&slrudir)?;
self.tasks.push(AnyImportTask::SingleKey(ImportSingleKeyTask::new(slrudir_key, Bytes::from(slrudir_buf))));
for (segpath, segno) in segments {
// SlruSegBlocks for each segment
let p = path.join(Utf8PathBuf::from(segpath));
let file_size = std::fs::metadata(&p)?.len();
ensure!(file_size % 8192 == 0);
let nblocks = u32::try_from(file_size / 8192)?;
let start_key = slru_block_to_key(kind, segno, 0);
let end_key = slru_block_to_key(kind, segno, nblocks);
self.tasks.push(AnyImportTask::SlruBlocks(ImportSlruBlocksTask::new(start_key..end_key, &p)));
// Followed by SlruSegSize
let segsize_key = slru_segment_size_to_key(kind, segno);
let segsize_buf = nblocks.to_le_bytes();
self.tasks.push(AnyImportTask::SingleKey(ImportSingleKeyTask::new(segsize_key, Bytes::copy_from_slice(&segsize_buf))));
}
Ok(())
}
async fn create_index_part(&mut self, control_file: &ControlFileData) -> anyhow::Result<()> {
let dstdir = &self.conf.workdir;
let pg_version = match control_file.catalog_version_no {
// thesea are from catversion.h
202107181 => 14,
202209061 => 15,
202307071 => 16,
catversion => { bail!("unrecognized catalog version {catversion}")},
};
let metadata = TimelineMetadata::new(
// FIXME: The 'disk_consistent_lsn' should be the LSN at the *end* of the
// checkpoint record, and prev_record_lsn should point to its beginning.
// We should read the real end of the record from the WAL, but here we
// just fake it.
Lsn(self.pgdata_lsn.0 + 8),
Some(self.pgdata_lsn),
None, // no ancestor
Lsn(0),
self.pgdata_lsn, // latest_gc_cutoff_lsn
self.pgdata_lsn, // initdb_lsn
pg_version,
);
let generation = Generation::none();
let mut index_part = IndexPart::empty(metadata);
for l in self.layers.iter() {
let name = l.layer_name();
let metadata = LayerFileMetadata::new(l.file_size, generation, ShardIndex::unsharded());
if let Some(_) = index_part.layer_metadata.insert(name.clone(), metadata) {
bail!("duplicate layer filename {name}");
}
}
let data = index_part.to_s3_bytes()?;
let path = remote_timeline_client::remote_index_path(&self.tsi, &self.tli, generation);
let path = dstdir.join(path.get_path());
std::fs::write(&path, data)
.context("could not write {path}")?;
Ok(())
}
}
//
// dbdir iteration tools
//
struct PgDataDir {
pub dbs: Vec<PgDataDirDb> // spcnode, dboid, path
}
struct PgDataDirDb {
pub spcnode: u32,
pub dboid: u32,
pub path: Utf8PathBuf,
pub files: Vec<PgDataDirDbFile>
}
struct PgDataDirDbFile {
pub path: Utf8PathBuf,
pub rel_tag: RelTag,
pub segno: u32,
// Cummulative size of the given fork, set only for the last segment of that fork
pub nblocks: Option<usize>,
}
impl PgDataDir {
fn new(datadir_path: &Utf8PathBuf) -> Self {
// Import ordinary databases, DEFAULTTABLESPACE_OID is smaller than GLOBALTABLESPACE_OID, so import them first
// Traverse database in increasing oid order
let mut databases = WalkDir::new(datadir_path.join("base"))
.max_depth(1)
.into_iter()
.filter_map(|entry| {
entry.ok().and_then(|path| {
path.file_name().to_string_lossy().parse::<u32>().ok()
})
})
.sorted()
.map(|dboid| {
PgDataDirDb::new(
datadir_path.join("base").join(dboid.to_string()),
pg_constants::DEFAULTTABLESPACE_OID,
dboid,
datadir_path
)
})
.collect::<Vec<_>>();
// special case for global catalogs
databases.push(PgDataDirDb::new(
datadir_path.join("global"),
postgres_ffi::pg_constants::GLOBALTABLESPACE_OID,
0,
datadir_path,
));
databases.sort_by_key(|db| (db.spcnode, db.dboid));
Self {
dbs: databases
}
}
}
impl PgDataDirDb {
fn new(db_path: Utf8PathBuf, spcnode: u32, dboid: u32, datadir_path: &Utf8PathBuf) -> Self {
let mut files: Vec<PgDataDirDbFile> = WalkDir::new(&db_path)
.min_depth(1)
.max_depth(2)
.into_iter()
.filter_map(|entry| {
entry.ok().and_then(|path| {
let relfile = path.file_name().to_string_lossy();
// returns (relnode, forknum, segno)
parse_relfilename(&relfile).ok()
})
})
.sorted()
.map(|(relnode, forknum, segno)| {
let rel_tag = RelTag {
spcnode,
dbnode: dboid,
relnode,
forknum,
};
let path = datadir_path.join(rel_tag.to_segfile_name(segno));
let len = metadata(&path).unwrap().len() as usize;
assert!(len % BLCKSZ as usize == 0);
let nblocks = len / BLCKSZ as usize;
PgDataDirDbFile {
path,
rel_tag,
segno,
nblocks: Some(nblocks), // first non-cummulative sizes
}
})
.collect();
// Set cummulative sizes. Do all of that math here, so that later we could easier
// parallelize over segments and know with which segments we need to write relsize
// entry.
let mut cumulative_nblocks: usize= 0;
let mut prev_rel_tag: Option<RelTag> = None;
for i in 0..files.len() {
if prev_rel_tag == Some(files[i].rel_tag) {
cumulative_nblocks += files[i].nblocks.unwrap();
} else {
cumulative_nblocks = files[i].nblocks.unwrap();
}
files[i].nblocks = if i == files.len() - 1 || files[i+1].rel_tag != files[i].rel_tag {
Some(cumulative_nblocks)
} else {
None
};
prev_rel_tag = Some(files[i].rel_tag);
}
PgDataDirDb {
files,
path: db_path,
spcnode,
dboid,
}
}
}
async fn read_all_bytes(reader: &mut (impl AsyncRead + Unpin)) -> anyhow::Result<Bytes> {
let mut buf: Vec<u8> = vec![];
reader.read_to_end(&mut buf).await?;
Ok(Bytes::from(buf))
}
trait ImportTask {
fn key_range(&self) -> Range<Key>;
fn total_size(&self) -> usize {
if is_contiguous_range(&self.key_range()) {
contiguous_range_len(&self.key_range()) as usize * 8192
} else {
u32::MAX as usize
}
}
async fn doit(self, layer_writer: &mut ImageLayerWriter, ctx: &RequestContext) -> anyhow::Result<()>;
}
struct ImportSingleKeyTask {
key: Key,
buf: Bytes,
}
impl ImportSingleKeyTask {
fn new(key: Key, buf: Bytes) -> Self {
ImportSingleKeyTask { key, buf }
}
}
impl ImportTask for ImportSingleKeyTask {
fn key_range(&self) -> Range<Key> {
singleton_range(self.key)
}
async fn doit(self, layer_writer: &mut ImageLayerWriter, ctx: &RequestContext) -> anyhow::Result<()> {
layer_writer.put_image(self.key, self.buf, ctx).await?;
Ok(())
}
}
struct ImportRelBlocksTask {
key_range: Range<Key>,
path: Utf8PathBuf,
}
impl ImportRelBlocksTask {
fn new(key_range: Range<Key>, path: &Utf8Path) -> Self {
ImportRelBlocksTask {
key_range,
path: path.into()
}
}
}
impl ImportTask for ImportRelBlocksTask {
fn key_range(&self) -> Range<Key> {
self.key_range.clone()
}
async fn doit(self, layer_writer: &mut ImageLayerWriter, ctx: &RequestContext) -> anyhow::Result<()> {
debug!("Importing relation file {}", self.path);
let mut reader = tokio::fs::File::open(&self.path).await?;
let mut buf: [u8; 8192] = [0u8; 8192];
let (rel_tag, start_blk) = self.key_range.start.to_rel_block()?;
let (_rel_tag, end_blk) = self.key_range.end.to_rel_block()?;
let mut blknum = start_blk;
while blknum < end_blk {
reader.read_exact(&mut buf).await?;
let key = rel_block_to_key(rel_tag.clone(), blknum);
layer_writer.put_image(key, Bytes::copy_from_slice(&buf), ctx).await?;
blknum += 1;
}
Ok(())
}
}
struct ImportSlruBlocksTask {
key_range: Range<Key>,
path: Utf8PathBuf,
}
impl ImportSlruBlocksTask {
fn new(key_range: Range<Key>, path: &Utf8Path) -> Self {
ImportSlruBlocksTask {
key_range,
path: path.into()
}
}
}
impl ImportTask for ImportSlruBlocksTask {
fn key_range(&self) -> Range<Key> {
self.key_range.clone()
}
async fn doit(self, layer_writer: &mut ImageLayerWriter, ctx: &RequestContext) -> anyhow::Result<()> {
debug!("Importing SLRU segment file {}", self.path);
let mut reader = tokio::fs::File::open(&self.path).await
.context(format!("opening {}", &self.path))?;
let mut buf: [u8; 8192] = [0u8; 8192];
let (kind, segno, start_blk) = self.key_range.start.to_slru_block()?;
let (_kind, _segno, end_blk) = self.key_range.end.to_slru_block()?;
let mut blknum = start_blk;
while blknum < end_blk {
reader.read_exact(&mut buf).await?;
let key = slru_block_to_key(kind, segno, blknum);
layer_writer.put_image(key, Bytes::copy_from_slice(&buf), ctx).await?;
blknum += 1;
}
Ok(())
}
}
enum AnyImportTask {
SingleKey(ImportSingleKeyTask),
RelBlocks(ImportRelBlocksTask),
SlruBlocks(ImportSlruBlocksTask),
}
impl ImportTask for AnyImportTask {
fn key_range(&self) -> Range<Key> {
match self {
Self::SingleKey(t) => t.key_range(),
Self::RelBlocks(t) => t.key_range(),
Self::SlruBlocks(t) => t.key_range()
}
}
async fn doit(self, layer_writer: &mut ImageLayerWriter, ctx: &RequestContext) -> anyhow::Result<()> {
match self {
Self::SingleKey(t) => t.doit(layer_writer, ctx).await,
Self::RelBlocks(t) => t.doit(layer_writer, ctx).await,
Self::SlruBlocks(t) => t.doit(layer_writer, ctx).await,
}
}
}
impl From<ImportSingleKeyTask> for AnyImportTask {
fn from(t: ImportSingleKeyTask) -> Self {
Self::SingleKey(t)
}
}
impl From<ImportRelBlocksTask> for AnyImportTask {
fn from(t: ImportRelBlocksTask) -> Self {
Self::RelBlocks(t)
}
}
impl From<ImportSlruBlocksTask> for AnyImportTask {
fn from(t: ImportSlruBlocksTask) -> Self {
Self::SlruBlocks(t)
}
}
struct ChunkProcessingJob {
range: Range<Key>,
tasks: Vec<AnyImportTask>,
dstdir: Utf8PathBuf,
tenant_id: TenantId,
timeline_id: TimelineId,
pgdata_lsn: Lsn,
}
impl ChunkProcessingJob {
fn new(range: Range<Key>, tasks: Vec<AnyImportTask>, env: &PgImportEnv) -> Self {
assert!(env.pgdata_lsn.is_valid());
Self {
range,
tasks,
dstdir: env.conf.workdir.clone(),
tenant_id: env.tsi.tenant_id,
timeline_id: env.tli,
pgdata_lsn: env.pgdata_lsn,
}
}
async fn run(self) -> anyhow::Result<PersistentLayerDesc> {
let ctx: RequestContext = RequestContext::new(TaskKind::DebugTool, DownloadBehavior::Error);
let config = toml_edit::Document::new();
let conf: &'static PageServerConf = Box::leak(Box::new(PageServerConf::parse_and_validate(
NodeId(42),
&config,
&self.dstdir
)?));
let tsi = TenantShardId {
tenant_id: self.tenant_id,
shard_number: ShardNumber(0),
shard_count: ShardCount(0),
};
let mut layer = ImageLayerWriter::new(
&conf,
self.timeline_id,
tsi,
&self.range,
self.pgdata_lsn,
&ctx,
).await?;
for task in self.tasks {
task.doit(&mut layer, &ctx).await?;
}
let layerdesc = layer.finish_raw(&ctx).await?;
Ok(layerdesc)
}
}

View File

@@ -12,15 +12,14 @@ use crate::keyspace::{KeySpace, KeySpaceAccum};
use crate::span::debug_assert_current_span_has_tenant_and_timeline_id_no_shard_id;
use crate::walrecord::NeonWalRecord;
use crate::{aux_file, repository::*};
use anyhow::{ensure, Context};
use anyhow::{bail, ensure, Context};
use bytes::{Buf, Bytes, BytesMut};
use enum_map::Enum;
use itertools::Itertools;
use pageserver_api::key::{
dbdir_key_range, rel_block_to_key, rel_dir_to_key, rel_key_range, rel_size_to_key,
relmap_file_key, repl_origin_key, repl_origin_key_range, slru_block_to_key, slru_dir_to_key,
slru_segment_key_range, slru_segment_size_to_key, twophase_file_key, twophase_key_range,
AUX_FILES_KEY, CHECKPOINT_KEY, CONTROLFILE_KEY, DBDIR_KEY, TWOPHASEDIR_KEY,
CompactKey, AUX_FILES_KEY, CHECKPOINT_KEY, CONTROLFILE_KEY, DBDIR_KEY, TWOPHASEDIR_KEY,
};
use pageserver_api::keyspace::SparseKeySpace;
use pageserver_api::models::AuxFilePolicy;
@@ -37,7 +36,6 @@ use tokio_util::sync::CancellationToken;
use tracing::{debug, info, trace, warn};
use utils::bin_ser::DeserializeError;
use utils::pausable_failpoint;
use utils::vec_map::{VecMap, VecMapOrdering};
use utils::{bin_ser::BeSer, lsn::Lsn};
/// Max delta records appended to the AUX_FILES_KEY (for aux v1). The write path will write a full image once this threshold is reached.
@@ -174,6 +172,7 @@ impl Timeline {
pending_deletions: Vec::new(),
pending_nblocks: 0,
pending_directory_entries: Vec::new(),
pending_bytes: 0,
lsn,
}
}
@@ -727,7 +726,17 @@ impl Timeline {
) -> Result<HashMap<String, Bytes>, PageReconstructError> {
let current_policy = self.last_aux_file_policy.load();
match current_policy {
Some(AuxFilePolicy::V1) | None => self.list_aux_files_v1(lsn, ctx).await,
Some(AuxFilePolicy::V1) => {
warn!("this timeline is using deprecated aux file policy V1 (policy=V1)");
self.list_aux_files_v1(lsn, ctx).await
}
None => {
let res = self.list_aux_files_v1(lsn, ctx).await?;
if !res.is_empty() {
warn!("this timeline is using deprecated aux file policy V1 (policy=None)");
}
Ok(res)
}
Some(AuxFilePolicy::V2) => self.list_aux_files_v2(lsn, ctx).await,
Some(AuxFilePolicy::CrossValidation) => {
let v1_result = self.list_aux_files_v1(lsn, ctx).await;
@@ -1022,21 +1031,33 @@ pub struct DatadirModification<'a> {
// The put-functions add the modifications here, and they are flushed to the
// underlying key-value store by the 'finish' function.
pending_lsns: Vec<Lsn>,
pending_updates: HashMap<Key, Vec<(Lsn, Value)>>,
pending_updates: HashMap<Key, Vec<(Lsn, usize, Value)>>,
pending_deletions: Vec<(Range<Key>, Lsn)>,
pending_nblocks: i64,
/// For special "directory" keys that store key-value maps, track the size of the map
/// if it was updated in this modification.
pending_directory_entries: Vec<(DirectoryKind, usize)>,
/// An **approximation** of how large our EphemeralFile write will be when committed.
pending_bytes: usize,
}
impl<'a> DatadirModification<'a> {
// When a DatadirModification is committed, we do a monolithic serialization of all its contents. WAL records can
// contain multiple pages, so the pageserver's record-based batch size isn't sufficient to bound this allocation: we
// additionally specify a limit on how much payload a DatadirModification may contain before it should be committed.
pub(crate) const MAX_PENDING_BYTES: usize = 8 * 1024 * 1024;
/// Get the current lsn
pub(crate) fn get_lsn(&self) -> Lsn {
self.lsn
}
pub(crate) fn approx_pending_bytes(&self) -> usize {
self.pending_bytes
}
/// Set the current lsn
pub(crate) fn set_lsn(&mut self, lsn: Lsn) -> anyhow::Result<()> {
ensure!(
@@ -1576,6 +1597,7 @@ impl<'a> DatadirModification<'a> {
if aux_files_key_v1.is_empty() {
None
} else {
warn!("this timeline is using deprecated aux file policy V1");
self.tline.do_switch_aux_policy(AuxFilePolicy::V1)?;
Some(AuxFilePolicy::V1)
}
@@ -1769,21 +1791,30 @@ impl<'a> DatadirModification<'a> {
// Flush relation and SLRU data blocks, keep metadata.
let mut retained_pending_updates = HashMap::<_, Vec<_>>::new();
for (key, values) in self.pending_updates.drain() {
for (lsn, value) in values {
if !key.is_valid_key_on_write_path() {
bail!(
"the request contains data not supported by pageserver at TimelineWriter::put: {}", key
);
}
let mut write_batch = Vec::new();
for (lsn, value_ser_size, value) in values {
if key.is_rel_block_key() || key.is_slru_block_key() {
// This bails out on first error without modifying pending_updates.
// That's Ok, cf this function's doc comment.
writer.put(key, lsn, &value, ctx).await?;
write_batch.push((key.to_compact(), lsn, value_ser_size, value));
} else {
retained_pending_updates
.entry(key)
.or_default()
.push((lsn, value));
retained_pending_updates.entry(key).or_default().push((
lsn,
value_ser_size,
value,
));
}
}
writer.put_batch(write_batch, ctx).await?;
}
self.pending_updates = retained_pending_updates;
self.pending_bytes = 0;
if pending_nblocks != 0 {
writer.update_current_logical_size(pending_nblocks * i64::from(BLCKSZ));
@@ -1809,17 +1840,23 @@ impl<'a> DatadirModification<'a> {
self.pending_nblocks = 0;
if !self.pending_updates.is_empty() {
// The put_batch call below expects expects the inputs to be sorted by Lsn,
// so we do that first.
let lsn_ordered_batch: VecMap<Lsn, (Key, Value)> = VecMap::from_iter(
self.pending_updates
.drain()
.map(|(key, vals)| vals.into_iter().map(move |(lsn, val)| (lsn, (key, val))))
.kmerge_by(|lhs, rhs| lhs.0 < rhs.0),
VecMapOrdering::GreaterOrEqual,
);
// Ordering: the items in this batch do not need to be in any global order, but values for
// a particular Key must be in Lsn order relative to one another. InMemoryLayer relies on
// this to do efficient updates to its index.
let batch: Vec<(CompactKey, Lsn, usize, Value)> = self
.pending_updates
.drain()
.flat_map(|(key, values)| {
values.into_iter().map(move |(lsn, val_ser_size, value)| {
if !key.is_valid_key_on_write_path() {
bail!("the request contains data not supported by pageserver at TimelineWriter::put: {}", key);
}
Ok((key.to_compact(), lsn, val_ser_size, value))
})
})
.collect::<anyhow::Result<Vec<_>>>()?;
writer.put_batch(lsn_ordered_batch, ctx).await?;
writer.put_batch(batch, ctx).await?;
}
if !self.pending_deletions.is_empty() {
@@ -1844,6 +1881,8 @@ impl<'a> DatadirModification<'a> {
writer.update_directory_entries_count(kind, count as u64);
}
self.pending_bytes = 0;
Ok(())
}
@@ -1860,7 +1899,7 @@ impl<'a> DatadirModification<'a> {
// Note: we don't check pending_deletions. It is an error to request a
// value that has been removed, deletion only avoids leaking storage.
if let Some(values) = self.pending_updates.get(&key) {
if let Some((_, value)) = values.last() {
if let Some((_, _, value)) = values.last() {
return if let Value::Image(img) = value {
Ok(img.clone())
} else {
@@ -1888,13 +1927,17 @@ impl<'a> DatadirModification<'a> {
fn put(&mut self, key: Key, val: Value) {
let values = self.pending_updates.entry(key).or_default();
// Replace the previous value if it exists at the same lsn
if let Some((last_lsn, last_value)) = values.last_mut() {
if let Some((last_lsn, last_value_ser_size, last_value)) = values.last_mut() {
if *last_lsn == self.lsn {
*last_value_ser_size = val.serialized_size().unwrap() as usize;
*last_value = val;
return;
}
}
values.push((self.lsn, val));
let val_serialized_size = val.serialized_size().unwrap() as usize;
self.pending_bytes += val_serialized_size;
values.push((self.lsn, val_serialized_size, val));
}
fn delete(&mut self, key_range: Range<Key>) {
@@ -1939,23 +1982,23 @@ impl<'a> Version<'a> {
//--- Metadata structs stored in key-value pairs in the repository.
#[derive(Debug, Serialize, Deserialize)]
struct DbDirectory {
pub struct DbDirectory {
// (spcnode, dbnode) -> (do relmapper and PG_VERSION files exist)
dbdirs: HashMap<(Oid, Oid), bool>,
pub dbdirs: HashMap<(Oid, Oid), bool>,
}
#[derive(Debug, Serialize, Deserialize)]
struct TwoPhaseDirectory {
xids: HashSet<TransactionId>,
pub(crate) struct TwoPhaseDirectory {
pub(crate) xids: HashSet<TransactionId>,
}
#[derive(Debug, Serialize, Deserialize, Default)]
struct RelDirectory {
pub struct RelDirectory {
// Set of relations that exist. (relfilenode, forknum)
//
// TODO: Store it as a btree or radix tree or something else that spans multiple
// key-value pairs, if you have a lot of relations
rels: HashSet<(Oid, u8)>,
pub rels: HashSet<(Oid, u8)>,
}
#[derive(Debug, Serialize, Deserialize, Default, PartialEq)]
@@ -1979,9 +2022,9 @@ struct RelSizeEntry {
}
#[derive(Debug, Serialize, Deserialize, Default)]
struct SlruSegmentDirectory {
pub(crate) struct SlruSegmentDirectory {
// Set of SLRU segments that exist.
segments: HashSet<u32>,
pub(crate) segments: HashSet<u32>,
}
#[derive(Copy, Clone, PartialEq, Eq, Debug, enum_map::Enum)]
@@ -2024,7 +2067,7 @@ mod tests {
let (tenant, ctx) = harness.load().await;
let tline = tenant
.create_empty_timeline(TIMELINE_ID, Lsn(0), DEFAULT_PG_VERSION, &ctx)
.create_empty_timeline(TIMELINE_ID, Lsn(0x10), DEFAULT_PG_VERSION, &ctx)
.await?;
let tline = tline.raw_timeline().unwrap();

View File

@@ -146,6 +146,12 @@ impl FromStr for TokioRuntimeMode {
}
}
static TOKIO_THREAD_STACK_SIZE: Lazy<NonZeroUsize> = Lazy::new(|| {
env::var("NEON_PAGESERVER_TOKIO_THREAD_STACK_SIZE")
// the default 2MiB are insufficent, especially in debug mode
.unwrap_or_else(|| NonZeroUsize::new(4 * 1024 * 1024).unwrap())
});
static ONE_RUNTIME: Lazy<Option<tokio::runtime::Runtime>> = Lazy::new(|| {
let thread_name = "pageserver-tokio";
let Some(mode) = env::var("NEON_PAGESERVER_USE_ONE_RUNTIME") else {
@@ -164,6 +170,7 @@ static ONE_RUNTIME: Lazy<Option<tokio::runtime::Runtime>> = Lazy::new(|| {
tokio::runtime::Builder::new_current_thread()
.thread_name(thread_name)
.enable_all()
.thread_stack_size(TOKIO_THREAD_STACK_SIZE.get())
.build()
.expect("failed to create one single runtime")
}
@@ -173,6 +180,7 @@ static ONE_RUNTIME: Lazy<Option<tokio::runtime::Runtime>> = Lazy::new(|| {
.thread_name(thread_name)
.enable_all()
.worker_threads(num_workers.get())
.thread_stack_size(TOKIO_THREAD_STACK_SIZE.get())
.build()
.expect("failed to create one multi-threaded runtime")
}
@@ -199,6 +207,7 @@ macro_rules! pageserver_runtime {
.thread_name($name)
.worker_threads(TOKIO_WORKER_THREADS.get())
.enable_all()
.thread_stack_size(TOKIO_THREAD_STACK_SIZE.get())
.build()
.expect(std::concat!("Failed to create runtime ", $name))
});

View File

@@ -501,6 +501,42 @@ impl Debug for DeleteTimelineError {
}
}
#[derive(thiserror::Error)]
pub enum TimelineArchivalError {
#[error("NotFound")]
NotFound,
#[error("Timeout")]
Timeout,
#[error("ancestor is archived: {}", .0)]
HasArchivedParent(TimelineId),
#[error("HasUnarchivedChildren")]
HasUnarchivedChildren(Vec<TimelineId>),
#[error("Timeline archival is already in progress")]
AlreadyInProgress,
#[error(transparent)]
Other(#[from] anyhow::Error),
}
impl Debug for TimelineArchivalError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Self::NotFound => write!(f, "NotFound"),
Self::Timeout => write!(f, "Timeout"),
Self::HasArchivedParent(p) => f.debug_tuple("HasArchivedParent").field(p).finish(),
Self::HasUnarchivedChildren(c) => {
f.debug_tuple("HasUnarchivedChildren").field(c).finish()
}
Self::AlreadyInProgress => f.debug_tuple("AlreadyInProgress").finish(),
Self::Other(e) => f.debug_tuple("Other").field(e).finish(),
}
}
}
pub enum SetStoppingError {
AlreadyStopping(completion::Barrier),
Broken,
@@ -845,6 +881,12 @@ impl Tenant {
});
};
// TODO: should also be rejecting tenant conf changes that violate this check.
if let Err(e) = crate::tenant::storage_layer::inmemory_layer::IndexEntry::validate_checkpoint_distance(tenant_clone.get_checkpoint_distance()) {
make_broken(&tenant_clone, anyhow::anyhow!(e), BrokenVerbosity::Error);
return Ok(());
}
let mut init_order = init_order;
// take the completion because initial tenant loading will complete when all of
// these tasks complete.
@@ -1326,24 +1368,59 @@ impl Tenant {
&self,
timeline_id: TimelineId,
state: TimelineArchivalState,
) -> anyhow::Result<()> {
let timeline = self
.get_timeline(timeline_id, false)
.context("Cannot apply timeline archival config to inexistent timeline")?;
) -> Result<(), TimelineArchivalError> {
info!("setting timeline archival config");
let timeline = {
let timelines = self.timelines.lock().unwrap();
let Some(timeline) = timelines.get(&timeline_id) else {
return Err(TimelineArchivalError::NotFound);
};
if state == TimelineArchivalState::Unarchived {
if let Some(ancestor_timeline) = timeline.ancestor_timeline() {
if ancestor_timeline.is_archived() == Some(true) {
return Err(TimelineArchivalError::HasArchivedParent(
ancestor_timeline.timeline_id,
));
}
}
}
// Ensure that there are no non-archived child timelines
let children: Vec<TimelineId> = timelines
.iter()
.filter_map(|(id, entry)| {
if entry.get_ancestor_timeline_id() != Some(timeline_id) {
return None;
}
if entry.is_archived() == Some(true) {
return None;
}
Some(*id)
})
.collect();
if !children.is_empty() && state == TimelineArchivalState::Archived {
return Err(TimelineArchivalError::HasUnarchivedChildren(children));
}
Arc::clone(timeline)
};
let upload_needed = timeline
.remote_client
.schedule_index_upload_for_timeline_archival_state(state)?;
if upload_needed {
info!("Uploading new state");
const MAX_WAIT: Duration = Duration::from_secs(10);
let Ok(v) =
tokio::time::timeout(MAX_WAIT, timeline.remote_client.wait_completion()).await
else {
tracing::warn!("reached timeout for waiting on upload queue");
bail!("reached timeout for upload queue flush");
return Err(TimelineArchivalError::Timeout);
};
v?;
v.map_err(|e| TimelineArchivalError::Other(anyhow::anyhow!(e)))?;
}
Ok(())
}
@@ -3741,13 +3818,21 @@ impl Tenant {
/// less than this (via eviction and on-demand downloads), but this function enables
/// the Tenant to advertise how much storage it would prefer to have to provide fast I/O
/// by keeping important things on local disk.
///
/// This is a heuristic, not a guarantee: tenants that are long-idle will actually use less
/// than they report here, due to layer eviction. Tenants with many active branches may
/// actually use more than they report here.
pub(crate) fn local_storage_wanted(&self) -> u64 {
let mut wanted = 0;
let timelines = self.timelines.lock().unwrap();
for timeline in timelines.values() {
wanted += timeline.metrics.visible_physical_size_gauge.get();
}
wanted
// Heuristic: we use the max() of the timelines' visible sizes, rather than the sum. This
// reflects the observation that on tenants with multiple large branches, typically only one
// of them is used actively enough to occupy space on disk.
timelines
.values()
.map(|t| t.metrics.visible_physical_size_gauge.get())
.max()
.unwrap_or(0)
}
}
@@ -5932,10 +6017,10 @@ mod tests {
.await
.unwrap();
// the default aux file policy to switch is v1 if not set by the admins
// the default aux file policy to switch is v2 if not set by the admins
assert_eq!(
harness.tenant_conf.switch_aux_file_policy,
AuxFilePolicy::V1
AuxFilePolicy::default_tenant_config()
);
let (tenant, ctx) = harness.load().await;
@@ -5979,8 +6064,8 @@ mod tests {
);
assert_eq!(
tline.last_aux_file_policy.load(),
Some(AuxFilePolicy::V1),
"aux file is written with switch_aux_file_policy unset (which is v1), so we should keep v1"
Some(AuxFilePolicy::V2),
"aux file is written with switch_aux_file_policy unset (which is v2), so we should use v2 there"
);
// we can read everything from the storage
@@ -6002,8 +6087,8 @@ mod tests {
assert_eq!(
tline.last_aux_file_policy.load(),
Some(AuxFilePolicy::V1),
"keep v1 storage format when new files are written"
Some(AuxFilePolicy::V2),
"keep v2 storage format when new files are written"
);
let files = tline.list_aux_files(lsn, &ctx).await.unwrap();
@@ -6019,7 +6104,7 @@ mod tests {
// child copies the last flag even if that is not on remote storage yet
assert_eq!(child.get_switch_aux_file_policy(), AuxFilePolicy::V2);
assert_eq!(child.last_aux_file_policy.load(), Some(AuxFilePolicy::V1));
assert_eq!(child.last_aux_file_policy.load(), Some(AuxFilePolicy::V2));
let files = child.list_aux_files(lsn, &ctx).await.unwrap();
assert_eq!(files.get("pg_logical/mappings/test1"), None);
@@ -7005,18 +7090,14 @@ mod tests {
vec![
// Image layer at GC horizon
PersistentLayerKey {
key_range: {
let mut key = Key::MAX;
key.field6 -= 1;
Key::MIN..key
},
key_range: Key::MIN..Key::NON_L0_MAX,
lsn_range: Lsn(0x30)..Lsn(0x31),
is_delta: false
},
// The delta layer that is cut in the middle
// The delta layer covers the full range (with the layer key hack to avoid being recognized as L0)
PersistentLayerKey {
key_range: get_key(3)..get_key(4),
lsn_range: Lsn(0x30)..Lsn(0x41),
key_range: Key::MIN..Key::NON_L0_MAX,
lsn_range: Lsn(0x30)..Lsn(0x48),
is_delta: true
},
// The delta3 layer that should not be picked for the compaction
@@ -7996,6 +8077,214 @@ mod tests {
Ok(())
}
#[tokio::test]
async fn test_simple_bottom_most_compaction_with_retain_lsns_single_key() -> anyhow::Result<()>
{
let harness =
TenantHarness::create("test_simple_bottom_most_compaction_with_retain_lsns_single_key")
.await?;
let (tenant, ctx) = harness.load().await;
fn get_key(id: u32) -> Key {
// using aux key here b/c they are guaranteed to be inside `collect_keyspace`.
let mut key = Key::from_hex("620000000033333333444444445500000000").unwrap();
key.field6 = id;
key
}
let img_layer = (0..10)
.map(|id| (get_key(id), Bytes::from(format!("value {id}@0x10"))))
.collect_vec();
let delta1 = vec![
(
get_key(1),
Lsn(0x20),
Value::WalRecord(NeonWalRecord::wal_append("@0x20")),
),
(
get_key(1),
Lsn(0x28),
Value::WalRecord(NeonWalRecord::wal_append("@0x28")),
),
];
let delta2 = vec![
(
get_key(1),
Lsn(0x30),
Value::WalRecord(NeonWalRecord::wal_append("@0x30")),
),
(
get_key(1),
Lsn(0x38),
Value::WalRecord(NeonWalRecord::wal_append("@0x38")),
),
];
let delta3 = vec![
(
get_key(8),
Lsn(0x48),
Value::WalRecord(NeonWalRecord::wal_append("@0x48")),
),
(
get_key(9),
Lsn(0x48),
Value::WalRecord(NeonWalRecord::wal_append("@0x48")),
),
];
let tline = tenant
.create_test_timeline_with_layers(
TIMELINE_ID,
Lsn(0x10),
DEFAULT_PG_VERSION,
&ctx,
vec![
// delta1 and delta 2 only contain a single key but multiple updates
DeltaLayerTestDesc::new_with_inferred_key_range(Lsn(0x10)..Lsn(0x30), delta1),
DeltaLayerTestDesc::new_with_inferred_key_range(Lsn(0x30)..Lsn(0x50), delta2),
DeltaLayerTestDesc::new_with_inferred_key_range(Lsn(0x10)..Lsn(0x50), delta3),
], // delta layers
vec![(Lsn(0x10), img_layer)], // image layers
Lsn(0x50),
)
.await?;
{
// Update GC info
let mut guard = tline.gc_info.write().unwrap();
*guard = GcInfo {
retain_lsns: vec![
(Lsn(0x10), tline.timeline_id),
(Lsn(0x20), tline.timeline_id),
],
cutoffs: GcCutoffs {
time: Lsn(0x30),
space: Lsn(0x30),
},
leases: Default::default(),
within_ancestor_pitr: false,
};
}
let expected_result = [
Bytes::from_static(b"value 0@0x10"),
Bytes::from_static(b"value 1@0x10@0x20@0x28@0x30@0x38"),
Bytes::from_static(b"value 2@0x10"),
Bytes::from_static(b"value 3@0x10"),
Bytes::from_static(b"value 4@0x10"),
Bytes::from_static(b"value 5@0x10"),
Bytes::from_static(b"value 6@0x10"),
Bytes::from_static(b"value 7@0x10"),
Bytes::from_static(b"value 8@0x10@0x48"),
Bytes::from_static(b"value 9@0x10@0x48"),
];
let expected_result_at_gc_horizon = [
Bytes::from_static(b"value 0@0x10"),
Bytes::from_static(b"value 1@0x10@0x20@0x28@0x30"),
Bytes::from_static(b"value 2@0x10"),
Bytes::from_static(b"value 3@0x10"),
Bytes::from_static(b"value 4@0x10"),
Bytes::from_static(b"value 5@0x10"),
Bytes::from_static(b"value 6@0x10"),
Bytes::from_static(b"value 7@0x10"),
Bytes::from_static(b"value 8@0x10"),
Bytes::from_static(b"value 9@0x10"),
];
let expected_result_at_lsn_20 = [
Bytes::from_static(b"value 0@0x10"),
Bytes::from_static(b"value 1@0x10@0x20"),
Bytes::from_static(b"value 2@0x10"),
Bytes::from_static(b"value 3@0x10"),
Bytes::from_static(b"value 4@0x10"),
Bytes::from_static(b"value 5@0x10"),
Bytes::from_static(b"value 6@0x10"),
Bytes::from_static(b"value 7@0x10"),
Bytes::from_static(b"value 8@0x10"),
Bytes::from_static(b"value 9@0x10"),
];
let expected_result_at_lsn_10 = [
Bytes::from_static(b"value 0@0x10"),
Bytes::from_static(b"value 1@0x10"),
Bytes::from_static(b"value 2@0x10"),
Bytes::from_static(b"value 3@0x10"),
Bytes::from_static(b"value 4@0x10"),
Bytes::from_static(b"value 5@0x10"),
Bytes::from_static(b"value 6@0x10"),
Bytes::from_static(b"value 7@0x10"),
Bytes::from_static(b"value 8@0x10"),
Bytes::from_static(b"value 9@0x10"),
];
let verify_result = || async {
let gc_horizon = {
let gc_info = tline.gc_info.read().unwrap();
gc_info.cutoffs.time
};
for idx in 0..10 {
assert_eq!(
tline
.get(get_key(idx as u32), Lsn(0x50), &ctx)
.await
.unwrap(),
&expected_result[idx]
);
assert_eq!(
tline
.get(get_key(idx as u32), gc_horizon, &ctx)
.await
.unwrap(),
&expected_result_at_gc_horizon[idx]
);
assert_eq!(
tline
.get(get_key(idx as u32), Lsn(0x20), &ctx)
.await
.unwrap(),
&expected_result_at_lsn_20[idx]
);
assert_eq!(
tline
.get(get_key(idx as u32), Lsn(0x10), &ctx)
.await
.unwrap(),
&expected_result_at_lsn_10[idx]
);
}
};
verify_result().await;
let cancel = CancellationToken::new();
let mut dryrun_flags = EnumSet::new();
dryrun_flags.insert(CompactFlags::DryRun);
tline
.compact_with_gc(&cancel, dryrun_flags, &ctx)
.await
.unwrap();
// We expect layer map to be the same b/c the dry run flag, but we don't know whether there will be other background jobs
// cleaning things up, and therefore, we don't do sanity checks on the layer map during unit tests.
verify_result().await;
tline
.compact_with_gc(&cancel, EnumSet::new(), &ctx)
.await
.unwrap();
verify_result().await;
// compact again
tline
.compact_with_gc(&cancel, EnumSet::new(), &ctx)
.await
.unwrap();
verify_result().await;
Ok(())
}
#[tokio::test]
async fn test_simple_bottom_most_compaction_on_branch() -> anyhow::Result<()> {
let harness = TenantHarness::create("test_simple_bottom_most_compaction_on_branch").await?;

View File

@@ -148,7 +148,7 @@ pub(super) const LEN_COMPRESSION_BIT_MASK: u8 = 0xf0;
/// The maximum size of blobs we support. The highest few bits
/// are reserved for compression and other further uses.
const MAX_SUPPORTED_LEN: usize = 0x0fff_ffff;
pub(crate) const MAX_SUPPORTED_BLOB_LEN: usize = 0x0fff_ffff;
pub(super) const BYTE_UNCOMPRESSED: u8 = 0x80;
pub(super) const BYTE_ZSTD: u8 = BYTE_UNCOMPRESSED | 0x10;
@@ -326,7 +326,7 @@ impl<const BUFFERED: bool> BlobWriter<BUFFERED> {
(self.write_all(io_buf.slice_len(), ctx).await, srcbuf)
} else {
// Write a 4-byte length header
if len > MAX_SUPPORTED_LEN {
if len > MAX_SUPPORTED_BLOB_LEN {
return (
(
io_buf.slice_len(),

View File

@@ -2,7 +2,6 @@
//! Low-level Block-oriented I/O functions
//!
use super::ephemeral_file::EphemeralFile;
use super::storage_layer::delta_layer::{Adapter, DeltaLayerInner};
use crate::context::RequestContext;
use crate::page_cache::{self, FileId, PageReadGuard, PageWriteGuard, ReadBufResult, PAGE_SZ};
@@ -81,9 +80,7 @@ impl<'a> Deref for BlockLease<'a> {
/// Unlike traits, we also support the read function to be async though.
pub(crate) enum BlockReaderRef<'a> {
FileBlockReader(&'a FileBlockReader<'a>),
EphemeralFile(&'a EphemeralFile),
Adapter(Adapter<&'a DeltaLayerInner>),
Slice(&'a [u8]),
#[cfg(test)]
TestDisk(&'a super::disk_btree::tests::TestDisk),
#[cfg(test)]
@@ -100,9 +97,7 @@ impl<'a> BlockReaderRef<'a> {
use BlockReaderRef::*;
match self {
FileBlockReader(r) => r.read_blk(blknum, ctx).await,
EphemeralFile(r) => r.read_blk(blknum, ctx).await,
Adapter(r) => r.read_blk(blknum, ctx).await,
Slice(s) => Self::read_blk_slice(s, blknum),
#[cfg(test)]
TestDisk(r) => r.read_blk(blknum),
#[cfg(test)]
@@ -111,24 +106,6 @@ impl<'a> BlockReaderRef<'a> {
}
}
impl<'a> BlockReaderRef<'a> {
fn read_blk_slice(slice: &[u8], blknum: u32) -> std::io::Result<BlockLease> {
let start = (blknum as usize).checked_mul(PAGE_SZ).unwrap();
let end = start.checked_add(PAGE_SZ).unwrap();
if end > slice.len() {
return Err(std::io::Error::new(
std::io::ErrorKind::UnexpectedEof,
format!("slice too short, len={} end={}", slice.len(), end),
));
}
let slice = &slice[start..end];
let page_sized: &[u8; PAGE_SZ] = slice
.try_into()
.expect("we add PAGE_SZ to start, so the slice must have PAGE_SZ");
Ok(BlockLease::Slice(page_sized))
}
}
///
/// A "cursor" for efficiently reading multiple pages from a BlockReader
///

View File

@@ -1,13 +1,21 @@
//! Implementation of append-only file data structure
//! used to keep in-memory layers spilled on disk.
use crate::assert_u64_eq_usize::{U64IsUsize, UsizeIsU64};
use crate::config::PageServerConf;
use crate::context::RequestContext;
use crate::page_cache;
use crate::tenant::block_io::{BlockCursor, BlockLease, BlockReader};
use crate::virtual_file::{self, VirtualFile};
use crate::tenant::storage_layer::inmemory_layer::vectored_dio_read::File;
use crate::virtual_file::owned_buffers_io::slice::SliceMutExt;
use crate::virtual_file::owned_buffers_io::util::size_tracking_writer;
use crate::virtual_file::owned_buffers_io::write::Buffer;
use crate::virtual_file::{self, owned_buffers_io, VirtualFile};
use bytes::BytesMut;
use camino::Utf8PathBuf;
use num_traits::Num;
use pageserver_api::shard::TenantShardId;
use tokio_epoll_uring::{BoundedBuf, Slice};
use tracing::error;
use std::io;
use std::sync::atomic::AtomicU64;
@@ -16,12 +24,17 @@ use utils::id::TimelineId;
pub struct EphemeralFile {
_tenant_shard_id: TenantShardId,
_timeline_id: TimelineId,
rw: page_caching::RW,
page_cache_file_id: page_cache::FileId,
bytes_written: u64,
buffered_writer: owned_buffers_io::write::BufferedWriter<
BytesMut,
size_tracking_writer::Writer<VirtualFile>,
>,
/// Gate guard is held on as long as we need to do operations in the path (delete on drop)
_gate_guard: utils::sync::gate::GateGuard,
}
mod page_caching;
mod zero_padded_read_write;
const TAIL_SZ: usize = 64 * 1024;
impl EphemeralFile {
pub async fn create(
@@ -51,60 +64,178 @@ impl EphemeralFile {
)
.await?;
let page_cache_file_id = page_cache::next_file_id(); // XXX get rid, we're not page-caching anymore
Ok(EphemeralFile {
_tenant_shard_id: tenant_shard_id,
_timeline_id: timeline_id,
rw: page_caching::RW::new(file, gate_guard),
page_cache_file_id,
bytes_written: 0,
buffered_writer: owned_buffers_io::write::BufferedWriter::new(
size_tracking_writer::Writer::new(file),
BytesMut::with_capacity(TAIL_SZ),
),
_gate_guard: gate_guard,
})
}
}
impl Drop for EphemeralFile {
fn drop(&mut self) {
// unlink the file
// we are clear to do this, because we have entered a gate
let path = &self.buffered_writer.as_inner().as_inner().path;
let res = std::fs::remove_file(path);
if let Err(e) = res {
if e.kind() != std::io::ErrorKind::NotFound {
// just never log the not found errors, we cannot do anything for them; on detach
// the tenant directory is already gone.
//
// not found files might also be related to https://github.com/neondatabase/neon/issues/2442
error!("could not remove ephemeral file '{path}': {e}");
}
}
}
}
impl EphemeralFile {
pub(crate) fn len(&self) -> u64 {
self.rw.bytes_written()
self.bytes_written
}
pub(crate) fn page_cache_file_id(&self) -> page_cache::FileId {
self.rw.page_cache_file_id()
self.page_cache_file_id
}
/// See [`self::page_caching::RW::load_to_vec`].
pub(crate) async fn load_to_vec(&self, ctx: &RequestContext) -> Result<Vec<u8>, io::Error> {
self.rw.load_to_vec(ctx).await
let size = self.len().into_usize();
let vec = Vec::with_capacity(size);
let (slice, nread) = self.read_exact_at_eof_ok(0, vec.slice_full(), ctx).await?;
assert_eq!(nread, size);
let vec = slice.into_inner();
assert_eq!(vec.len(), nread);
assert_eq!(vec.capacity(), size, "we shouldn't be reallocating");
Ok(vec)
}
pub(crate) async fn read_blk(
&self,
blknum: u32,
ctx: &RequestContext,
) -> Result<BlockLease, io::Error> {
self.rw.read_blk(blknum, ctx).await
}
pub(crate) async fn write_blob(
/// Returns the offset at which the first byte of the input was written, for use
/// in constructing indices over the written value.
///
/// Panics if the write is short because there's no way we can recover from that.
/// TODO: make upstack handle this as an error.
pub(crate) async fn write_raw(
&mut self,
srcbuf: &[u8],
ctx: &RequestContext,
) -> Result<u64, io::Error> {
let pos = self.rw.bytes_written();
) -> std::io::Result<u64> {
let pos = self.bytes_written;
// Write the length field
if srcbuf.len() < 0x80 {
// short one-byte length header
let len_buf = [srcbuf.len() as u8];
self.rw.write_all_borrowed(&len_buf, ctx).await?;
} else {
let mut len_buf = u32::to_be_bytes(srcbuf.len() as u32);
len_buf[0] |= 0x80;
self.rw.write_all_borrowed(&len_buf, ctx).await?;
}
let new_bytes_written = pos.checked_add(srcbuf.len().into_u64()).ok_or_else(|| {
std::io::Error::new(
std::io::ErrorKind::Other,
format!(
"write would grow EphemeralFile beyond u64::MAX: len={pos} writen={srcbuf_len}",
srcbuf_len = srcbuf.len(),
),
)
})?;
// Write the payload
self.rw.write_all_borrowed(srcbuf, ctx).await?;
let nwritten = self
.buffered_writer
.write_buffered_borrowed(srcbuf, ctx)
.await?;
assert_eq!(
nwritten,
srcbuf.len(),
"buffered writer has no short writes"
);
self.bytes_written = new_bytes_written;
Ok(pos)
}
}
impl super::storage_layer::inmemory_layer::vectored_dio_read::File for EphemeralFile {
async fn read_exact_at_eof_ok<'a, 'b, B: tokio_epoll_uring::IoBufMut + Send>(
&'b self,
start: u64,
dst: tokio_epoll_uring::Slice<B>,
ctx: &'a RequestContext,
) -> std::io::Result<(tokio_epoll_uring::Slice<B>, usize)> {
let file_size_tracking_writer = self.buffered_writer.as_inner();
let flushed_offset = file_size_tracking_writer.bytes_written();
let buffer = self.buffered_writer.inspect_buffer();
let buffered = &buffer[0..buffer.pending()];
let dst_cap = dst.bytes_total().into_u64();
let end = {
// saturating_add is correct here because the max file size is u64::MAX, so,
// if start + dst.len() > u64::MAX, then we know it will be a short read
let mut end: u64 = start.saturating_add(dst_cap);
if end > self.bytes_written {
end = self.bytes_written;
}
end
};
// inclusive, exclusive
#[derive(Debug)]
struct Range<N>(N, N);
impl<N: Num + Clone + Copy + PartialOrd + Ord> Range<N> {
fn len(&self) -> N {
if self.0 > self.1 {
N::zero()
} else {
self.1 - self.0
}
}
}
let written_range = Range(start, std::cmp::min(end, flushed_offset));
let buffered_range = Range(std::cmp::max(start, flushed_offset), end);
let dst = if written_range.len() > 0 {
let file: &VirtualFile = file_size_tracking_writer.as_inner();
let bounds = dst.bounds();
let slice = file
.read_exact_at(dst.slice(0..written_range.len().into_usize()), start, ctx)
.await?;
Slice::from_buf_bounds(Slice::into_inner(slice), bounds)
} else {
dst
};
let dst = if buffered_range.len() > 0 {
let offset_in_buffer = buffered_range
.0
.checked_sub(flushed_offset)
.unwrap()
.into_usize();
let to_copy =
&buffered[offset_in_buffer..(offset_in_buffer + buffered_range.len().into_usize())];
let bounds = dst.bounds();
let mut view = dst.slice({
let start = written_range.len().into_usize();
let end = start
.checked_add(buffered_range.len().into_usize())
.unwrap();
start..end
});
view.as_mut_rust_slice_full_zeroed()
.copy_from_slice(to_copy);
Slice::from_buf_bounds(Slice::into_inner(view), bounds)
} else {
dst
};
// TODO: in debug mode, randomize the remaining bytes in `dst` to catch bugs
Ok((dst, (end - start).into_usize()))
}
}
/// Does the given filename look like an ephemeral file?
pub fn is_ephemeral_file(filename: &str) -> bool {
if let Some(rest) = filename.strip_prefix("ephemeral-") {
@@ -114,19 +245,13 @@ pub fn is_ephemeral_file(filename: &str) -> bool {
}
}
impl BlockReader for EphemeralFile {
fn block_cursor(&self) -> super::block_io::BlockCursor<'_> {
BlockCursor::new(super::block_io::BlockReaderRef::EphemeralFile(self))
}
}
#[cfg(test)]
mod tests {
use rand::Rng;
use super::*;
use crate::context::DownloadBehavior;
use crate::task_mgr::TaskKind;
use crate::tenant::block_io::BlockReaderRef;
use rand::{thread_rng, RngCore};
use std::fs;
use std::str::FromStr;
@@ -157,69 +282,6 @@ mod tests {
Ok((conf, tenant_shard_id, timeline_id, ctx))
}
#[tokio::test]
async fn test_ephemeral_blobs() -> Result<(), io::Error> {
let (conf, tenant_id, timeline_id, ctx) = harness("ephemeral_blobs")?;
let gate = utils::sync::gate::Gate::default();
let entered = gate.enter().unwrap();
let mut file = EphemeralFile::create(conf, tenant_id, timeline_id, entered, &ctx).await?;
let pos_foo = file.write_blob(b"foo", &ctx).await?;
assert_eq!(
b"foo",
file.block_cursor()
.read_blob(pos_foo, &ctx)
.await?
.as_slice()
);
let pos_bar = file.write_blob(b"bar", &ctx).await?;
assert_eq!(
b"foo",
file.block_cursor()
.read_blob(pos_foo, &ctx)
.await?
.as_slice()
);
assert_eq!(
b"bar",
file.block_cursor()
.read_blob(pos_bar, &ctx)
.await?
.as_slice()
);
let mut blobs = Vec::new();
for i in 0..10000 {
let data = Vec::from(format!("blob{}", i).as_bytes());
let pos = file.write_blob(&data, &ctx).await?;
blobs.push((pos, data));
}
// also test with a large blobs
for i in 0..100 {
let data = format!("blob{}", i).as_bytes().repeat(100);
let pos = file.write_blob(&data, &ctx).await?;
blobs.push((pos, data));
}
let cursor = BlockCursor::new(BlockReaderRef::EphemeralFile(&file));
for (pos, expected) in blobs {
let actual = cursor.read_blob(pos, &ctx).await?;
assert_eq!(actual, expected);
}
// Test a large blob that spans multiple pages
let mut large_data = vec![0; 20000];
thread_rng().fill_bytes(&mut large_data);
let pos_large = file.write_blob(&large_data, &ctx).await?;
let result = file.block_cursor().read_blob(pos_large, &ctx).await?;
assert_eq!(result, large_data);
Ok(())
}
#[tokio::test]
async fn ephemeral_file_holds_gate_open() {
const FOREVER: std::time::Duration = std::time::Duration::from_secs(5);
@@ -253,4 +315,151 @@ mod tests {
.expect("closing completes right away")
.expect("closing does not panic");
}
#[tokio::test]
async fn test_ephemeral_file_basics() {
let (conf, tenant_id, timeline_id, ctx) = harness("test_ephemeral_file_basics").unwrap();
let gate = utils::sync::gate::Gate::default();
let mut file =
EphemeralFile::create(conf, tenant_id, timeline_id, gate.enter().unwrap(), &ctx)
.await
.unwrap();
let cap = file.buffered_writer.inspect_buffer().capacity();
let write_nbytes = cap + cap / 2;
let content: Vec<u8> = rand::thread_rng()
.sample_iter(rand::distributions::Standard)
.take(write_nbytes)
.collect();
let mut value_offsets = Vec::new();
for i in 0..write_nbytes {
let off = file.write_raw(&content[i..i + 1], &ctx).await.unwrap();
value_offsets.push(off);
}
assert!(file.len() as usize == write_nbytes);
for i in 0..write_nbytes {
assert_eq!(value_offsets[i], i.into_u64());
let buf = Vec::with_capacity(1);
let (buf_slice, nread) = file
.read_exact_at_eof_ok(i.into_u64(), buf.slice_full(), &ctx)
.await
.unwrap();
let buf = buf_slice.into_inner();
assert_eq!(nread, 1);
assert_eq!(&buf, &content[i..i + 1]);
}
let file_contents =
std::fs::read(&file.buffered_writer.as_inner().as_inner().path).unwrap();
assert_eq!(file_contents, &content[0..cap]);
let buffer_contents = file.buffered_writer.inspect_buffer();
assert_eq!(buffer_contents, &content[cap..write_nbytes]);
}
#[tokio::test]
async fn test_flushes_do_happen() {
let (conf, tenant_id, timeline_id, ctx) = harness("test_flushes_do_happen").unwrap();
let gate = utils::sync::gate::Gate::default();
let mut file =
EphemeralFile::create(conf, tenant_id, timeline_id, gate.enter().unwrap(), &ctx)
.await
.unwrap();
let cap = file.buffered_writer.inspect_buffer().capacity();
let content: Vec<u8> = rand::thread_rng()
.sample_iter(rand::distributions::Standard)
.take(cap + cap / 2)
.collect();
file.write_raw(&content, &ctx).await.unwrap();
// assert the state is as this test expects it to be
assert_eq!(
&file.load_to_vec(&ctx).await.unwrap(),
&content[0..cap + cap / 2]
);
let md = file
.buffered_writer
.as_inner()
.as_inner()
.path
.metadata()
.unwrap();
assert_eq!(
md.len(),
cap.into_u64(),
"buffered writer does one write if we write 1.5x buffer capacity"
);
assert_eq!(
&file.buffered_writer.inspect_buffer()[0..cap / 2],
&content[cap..cap + cap / 2]
);
}
#[tokio::test]
async fn test_read_split_across_file_and_buffer() {
// This test exercises the logic on the read path that splits the logical read
// into a read from the flushed part (= the file) and a copy from the buffered writer's buffer.
//
// This test build on the assertions in test_flushes_do_happen
let (conf, tenant_id, timeline_id, ctx) =
harness("test_read_split_across_file_and_buffer").unwrap();
let gate = utils::sync::gate::Gate::default();
let mut file =
EphemeralFile::create(conf, tenant_id, timeline_id, gate.enter().unwrap(), &ctx)
.await
.unwrap();
let cap = file.buffered_writer.inspect_buffer().capacity();
let content: Vec<u8> = rand::thread_rng()
.sample_iter(rand::distributions::Standard)
.take(cap + cap / 2)
.collect();
file.write_raw(&content, &ctx).await.unwrap();
let test_read = |start: usize, len: usize| {
let file = &file;
let ctx = &ctx;
let content = &content;
async move {
let (buf, nread) = file
.read_exact_at_eof_ok(
start.into_u64(),
Vec::with_capacity(len).slice_full(),
ctx,
)
.await
.unwrap();
assert_eq!(nread, len);
assert_eq!(&buf.into_inner(), &content[start..(start + len)]);
}
};
// completely within the file range
assert!(20 < cap, "test assumption");
test_read(10, 10).await;
// border onto edge of file
test_read(cap - 10, 10).await;
// read across file and buffer
test_read(cap - 10, 20).await;
// stay from start of buffer
test_read(cap, 10).await;
// completely within buffer
test_read(cap + 10, 10).await;
}
}

View File

@@ -1,153 +0,0 @@
//! Wrapper around [`super::zero_padded_read_write::RW`] that uses the
//! [`crate::page_cache`] to serve reads that need to go to the underlying [`VirtualFile`].
//!
//! Subject to removal in <https://github.com/neondatabase/neon/pull/8537>
use crate::context::RequestContext;
use crate::page_cache::{self, PAGE_SZ};
use crate::tenant::block_io::BlockLease;
use crate::virtual_file::owned_buffers_io::util::size_tracking_writer;
use crate::virtual_file::VirtualFile;
use std::io::{self};
use tokio_epoll_uring::BoundedBuf;
use tracing::*;
use super::zero_padded_read_write;
/// See module-level comment.
pub struct RW {
page_cache_file_id: page_cache::FileId,
rw: super::zero_padded_read_write::RW<size_tracking_writer::Writer<VirtualFile>>,
/// Gate guard is held on as long as we need to do operations in the path (delete on drop).
_gate_guard: utils::sync::gate::GateGuard,
}
impl RW {
pub fn new(file: VirtualFile, _gate_guard: utils::sync::gate::GateGuard) -> Self {
let page_cache_file_id = page_cache::next_file_id();
Self {
page_cache_file_id,
rw: super::zero_padded_read_write::RW::new(size_tracking_writer::Writer::new(file)),
_gate_guard,
}
}
pub fn page_cache_file_id(&self) -> page_cache::FileId {
self.page_cache_file_id
}
pub(crate) async fn write_all_borrowed(
&mut self,
srcbuf: &[u8],
ctx: &RequestContext,
) -> Result<usize, io::Error> {
// It doesn't make sense to proactively fill the page cache on the Pageserver write path
// because Compute is unlikely to access recently written data.
self.rw.write_all_borrowed(srcbuf, ctx).await
}
pub(crate) fn bytes_written(&self) -> u64 {
self.rw.bytes_written()
}
/// Load all blocks that can be read via [`Self::read_blk`] into a contiguous memory buffer.
///
/// This includes the blocks that aren't yet flushed to disk by the internal buffered writer.
/// The last block is zero-padded to [`PAGE_SZ`], so, the returned buffer is always a multiple of [`PAGE_SZ`].
pub(super) async fn load_to_vec(&self, ctx: &RequestContext) -> Result<Vec<u8>, io::Error> {
// round up to the next PAGE_SZ multiple, required by blob_io
let size = {
let s = usize::try_from(self.bytes_written()).unwrap();
if s % PAGE_SZ == 0 {
s
} else {
s.checked_add(PAGE_SZ - (s % PAGE_SZ)).unwrap()
}
};
let vec = Vec::with_capacity(size);
// read from disk what we've already flushed
let file_size_tracking_writer = self.rw.as_writer();
let flushed_range = 0..usize::try_from(file_size_tracking_writer.bytes_written()).unwrap();
let mut vec = file_size_tracking_writer
.as_inner()
.read_exact_at(
vec.slice(0..(flushed_range.end - flushed_range.start)),
u64::try_from(flushed_range.start).unwrap(),
ctx,
)
.await?
.into_inner();
// copy from in-memory buffer what we haven't flushed yet but would return when accessed via read_blk
let buffered = self.rw.get_tail_zero_padded();
vec.extend_from_slice(buffered);
assert_eq!(vec.len(), size);
assert_eq!(vec.len() % PAGE_SZ, 0);
Ok(vec)
}
pub(crate) async fn read_blk(
&self,
blknum: u32,
ctx: &RequestContext,
) -> Result<BlockLease, io::Error> {
match self.rw.read_blk(blknum).await? {
zero_padded_read_write::ReadResult::NeedsReadFromWriter { writer } => {
let cache = page_cache::get();
match cache
.read_immutable_buf(self.page_cache_file_id, blknum, ctx)
.await
.map_err(|e| {
std::io::Error::new(
std::io::ErrorKind::Other,
// order path before error because error is anyhow::Error => might have many contexts
format!(
"ephemeral file: read immutable page #{}: {}: {:#}",
blknum,
self.rw.as_writer().as_inner().path,
e,
),
)
})? {
page_cache::ReadBufResult::Found(guard) => {
return Ok(BlockLease::PageReadGuard(guard))
}
page_cache::ReadBufResult::NotFound(write_guard) => {
let write_guard = writer
.as_inner()
.read_exact_at_page(write_guard, blknum as u64 * PAGE_SZ as u64, ctx)
.await?;
let read_guard = write_guard.mark_valid();
return Ok(BlockLease::PageReadGuard(read_guard));
}
}
}
zero_padded_read_write::ReadResult::ServedFromZeroPaddedMutableTail { buffer } => {
Ok(BlockLease::EphemeralFileMutableTail(buffer))
}
}
}
}
impl Drop for RW {
fn drop(&mut self) {
// There might still be pages in the [`crate::page_cache`] for this file.
// We leave them there, [`crate::page_cache::PageCache::find_victim`] will evict them when needed.
// unlink the file
// we are clear to do this, because we have entered a gate
let path = &self.rw.as_writer().as_inner().path;
let res = std::fs::remove_file(path);
if let Err(e) = res {
if e.kind() != std::io::ErrorKind::NotFound {
// just never log the not found errors, we cannot do anything for them; on detach
// the tenant directory is already gone.
//
// not found files might also be related to https://github.com/neondatabase/neon/issues/2442
error!("could not remove ephemeral file '{path}': {e}");
}
}
}
}

View File

@@ -1,145 +0,0 @@
//! The heart of how [`super::EphemeralFile`] does its reads and writes.
//!
//! # Writes
//!
//! [`super::EphemeralFile`] writes small, borrowed buffers using [`RW::write_all_borrowed`].
//! The [`RW`] batches these into [`TAIL_SZ`] bigger writes, using [`owned_buffers_io::write::BufferedWriter`].
//!
//! # Reads
//!
//! [`super::EphemeralFile`] always reads full [`PAGE_SZ`]ed blocks using [`RW::read_blk`].
//!
//! The [`RW`] serves these reads either from the buffered writer's in-memory buffer
//! or redirects the caller to read from the underlying [`OwnedAsyncWriter`]
//! if the read is for the prefix that has already been flushed.
//!
//! # Current Usage
//!
//! The current user of this module is [`super::page_caching::RW`].
mod zero_padded;
use crate::{
context::RequestContext,
page_cache::PAGE_SZ,
virtual_file::owned_buffers_io::{
self,
write::{Buffer, OwnedAsyncWriter},
},
};
const TAIL_SZ: usize = 64 * 1024;
/// See module-level comment.
pub struct RW<W: OwnedAsyncWriter> {
buffered_writer: owned_buffers_io::write::BufferedWriter<
zero_padded::Buffer<TAIL_SZ>,
owned_buffers_io::util::size_tracking_writer::Writer<W>,
>,
}
pub enum ReadResult<'a, W> {
NeedsReadFromWriter { writer: &'a W },
ServedFromZeroPaddedMutableTail { buffer: &'a [u8; PAGE_SZ] },
}
impl<W> RW<W>
where
W: OwnedAsyncWriter,
{
pub fn new(writer: W) -> Self {
let bytes_flushed_tracker =
owned_buffers_io::util::size_tracking_writer::Writer::new(writer);
let buffered_writer = owned_buffers_io::write::BufferedWriter::new(
bytes_flushed_tracker,
zero_padded::Buffer::default(),
);
Self { buffered_writer }
}
pub(crate) fn as_writer(&self) -> &W {
self.buffered_writer.as_inner().as_inner()
}
pub async fn write_all_borrowed(
&mut self,
buf: &[u8],
ctx: &RequestContext,
) -> std::io::Result<usize> {
self.buffered_writer.write_buffered_borrowed(buf, ctx).await
}
pub fn bytes_written(&self) -> u64 {
let flushed_offset = self.buffered_writer.as_inner().bytes_written();
let buffer: &zero_padded::Buffer<TAIL_SZ> = self.buffered_writer.inspect_buffer();
flushed_offset + u64::try_from(buffer.pending()).unwrap()
}
/// Get a slice of all blocks that [`Self::read_blk`] would return as [`ReadResult::ServedFromZeroPaddedMutableTail`].
pub fn get_tail_zero_padded(&self) -> &[u8] {
let buffer: &zero_padded::Buffer<TAIL_SZ> = self.buffered_writer.inspect_buffer();
let buffer_written_up_to = buffer.pending();
// pad to next page boundary
let read_up_to = if buffer_written_up_to % PAGE_SZ == 0 {
buffer_written_up_to
} else {
buffer_written_up_to
.checked_add(PAGE_SZ - (buffer_written_up_to % PAGE_SZ))
.unwrap()
};
&buffer.as_zero_padded_slice()[0..read_up_to]
}
pub(crate) async fn read_blk(&self, blknum: u32) -> Result<ReadResult<'_, W>, std::io::Error> {
let flushed_offset = self.buffered_writer.as_inner().bytes_written();
let buffer: &zero_padded::Buffer<TAIL_SZ> = self.buffered_writer.inspect_buffer();
let buffered_offset = flushed_offset + u64::try_from(buffer.pending()).unwrap();
let read_offset = (blknum as u64) * (PAGE_SZ as u64);
// The trailing page ("block") might only be partially filled,
// yet the blob_io code relies on us to return a full PAGE_SZed slice anyway.
// Moreover, it has to be zero-padded, because when we still had
// a write-back page cache, it provided pre-zeroed pages, and blob_io came to rely on it.
// DeltaLayer probably has the same issue, not sure why it needs no special treatment.
// => check here that the read doesn't go beyond this potentially trailing
// => the zero-padding is done in the `else` branch below
let blocks_written = if buffered_offset % (PAGE_SZ as u64) == 0 {
buffered_offset / (PAGE_SZ as u64)
} else {
(buffered_offset / (PAGE_SZ as u64)) + 1
};
if (blknum as u64) >= blocks_written {
return Err(std::io::Error::new(std::io::ErrorKind::Other, anyhow::anyhow!("read past end of ephemeral_file: read=0x{read_offset:x} buffered=0x{buffered_offset:x} flushed=0x{flushed_offset}")));
}
// assertions for the `if-else` below
assert_eq!(
flushed_offset % (TAIL_SZ as u64), 0,
"we only use write_buffered_borrowed to write to the buffered writer, so it's guaranteed that flushes happen buffer.cap()-sized chunks"
);
assert_eq!(
flushed_offset % (PAGE_SZ as u64),
0,
"the logic below can't handle if the page is spread across the flushed part and the buffer"
);
if read_offset < flushed_offset {
assert!(read_offset + (PAGE_SZ as u64) <= flushed_offset);
Ok(ReadResult::NeedsReadFromWriter {
writer: self.as_writer(),
})
} else {
let read_offset_in_buffer = read_offset
.checked_sub(flushed_offset)
.expect("would have taken `if` branch instead of this one");
let read_offset_in_buffer = usize::try_from(read_offset_in_buffer).unwrap();
let zero_padded_slice = buffer.as_zero_padded_slice();
let page = &zero_padded_slice[read_offset_in_buffer..(read_offset_in_buffer + PAGE_SZ)];
Ok(ReadResult::ServedFromZeroPaddedMutableTail {
buffer: page
.try_into()
.expect("the slice above got it as page-size slice"),
})
}
}
}

View File

@@ -1,110 +0,0 @@
//! A [`crate::virtual_file::owned_buffers_io::write::Buffer`] whose
//! unwritten range is guaranteed to be zero-initialized.
//! This is used by [`crate::tenant::ephemeral_file::zero_padded_read_write::RW::read_blk`]
//! to serve page-sized reads of the trailing page when the trailing page has only been partially filled.
use std::mem::MaybeUninit;
use crate::virtual_file::owned_buffers_io::io_buf_ext::FullSlice;
/// See module-level comment.
pub struct Buffer<const N: usize> {
allocation: Box<[u8; N]>,
written: usize,
}
impl<const N: usize> Default for Buffer<N> {
fn default() -> Self {
Self {
allocation: Box::new(
// SAFETY: zeroed memory is a valid [u8; N]
unsafe { MaybeUninit::zeroed().assume_init() },
),
written: 0,
}
}
}
impl<const N: usize> Buffer<N> {
#[inline(always)]
fn invariants(&self) {
// don't check by default, unoptimized is too expensive even for debug mode
if false {
debug_assert!(self.written <= N, "{}", self.written);
debug_assert!(self.allocation[self.written..N].iter().all(|v| *v == 0));
}
}
pub fn as_zero_padded_slice(&self) -> &[u8; N] {
&self.allocation
}
}
impl<const N: usize> crate::virtual_file::owned_buffers_io::write::Buffer for Buffer<N> {
type IoBuf = Self;
fn cap(&self) -> usize {
self.allocation.len()
}
fn extend_from_slice(&mut self, other: &[u8]) {
self.invariants();
let remaining = self.allocation.len() - self.written;
if other.len() > remaining {
panic!("calling extend_from_slice() with insufficient remaining capacity");
}
self.allocation[self.written..(self.written + other.len())].copy_from_slice(other);
self.written += other.len();
self.invariants();
}
fn pending(&self) -> usize {
self.written
}
fn flush(self) -> FullSlice<Self> {
self.invariants();
let written = self.written;
FullSlice::must_new(tokio_epoll_uring::BoundedBuf::slice(self, 0..written))
}
fn reuse_after_flush(iobuf: Self::IoBuf) -> Self {
let Self {
mut allocation,
written,
} = iobuf;
allocation[0..written].fill(0);
let new = Self {
allocation,
written: 0,
};
new.invariants();
new
}
}
/// We have this trait impl so that the `flush` method in the `Buffer` impl above can produce a
/// [`tokio_epoll_uring::BoundedBuf::slice`] of the [`Self::written`] range of the data.
///
/// Remember that bytes_init is generally _not_ a tracker of the amount
/// of valid data in the io buffer; we use `Slice` for that.
/// The `IoBuf` is _only_ for keeping track of uninitialized memory, a bit like MaybeUninit.
///
/// SAFETY:
///
/// The [`Self::allocation`] is stable becauses boxes are stable.
/// The memory is zero-initialized, so, bytes_init is always N.
unsafe impl<const N: usize> tokio_epoll_uring::IoBuf for Buffer<N> {
fn stable_ptr(&self) -> *const u8 {
self.allocation.as_ptr()
}
fn bytes_init(&self) -> usize {
// Yes, N, not self.written; Read the full comment of this impl block!
N
}
fn bytes_total(&self) -> usize {
N
}
}

View File

@@ -464,7 +464,7 @@ impl LayerMap {
pub(self) fn insert_historic_noflush(&mut self, layer_desc: PersistentLayerDesc) {
// TODO: See #3869, resulting #4088, attempted fix and repro #4094
if Self::is_l0(&layer_desc.key_range) {
if Self::is_l0(&layer_desc.key_range, layer_desc.is_delta) {
self.l0_delta_layers.push(layer_desc.clone().into());
}
@@ -483,7 +483,7 @@ impl LayerMap {
self.historic
.remove(historic_layer_coverage::LayerKey::from(layer_desc));
let layer_key = layer_desc.key();
if Self::is_l0(&layer_desc.key_range) {
if Self::is_l0(&layer_desc.key_range, layer_desc.is_delta) {
let len_before = self.l0_delta_layers.len();
let mut l0_delta_layers = std::mem::take(&mut self.l0_delta_layers);
l0_delta_layers.retain(|other| other.key() != layer_key);
@@ -600,8 +600,8 @@ impl LayerMap {
}
/// Check if the key range resembles that of an L0 layer.
pub fn is_l0(key_range: &Range<Key>) -> bool {
key_range == &(Key::MIN..Key::MAX)
pub fn is_l0(key_range: &Range<Key>, is_delta_layer: bool) -> bool {
is_delta_layer && key_range == &(Key::MIN..Key::MAX)
}
/// This function determines which layers are counted in `count_deltas`:
@@ -628,7 +628,7 @@ impl LayerMap {
/// than just the current partition_range.
pub fn is_reimage_worthy(layer: &PersistentLayerDesc, partition_range: &Range<Key>) -> bool {
// Case 1
if !Self::is_l0(&layer.key_range) {
if !Self::is_l0(&layer.key_range, layer.is_delta) {
return true;
}

View File

@@ -2,13 +2,12 @@
pub mod delta_layer;
pub mod image_layer;
pub(crate) mod inmemory_layer;
pub mod inmemory_layer;
pub(crate) mod layer;
mod layer_desc;
mod layer_name;
pub mod merge_iterator;
#[cfg(test)]
pub mod split_writer;
use crate::context::{AccessStatsBehavior, RequestContext};

View File

@@ -36,10 +36,11 @@ use crate::tenant::block_io::{BlockBuf, BlockCursor, BlockLease, BlockReader, Fi
use crate::tenant::disk_btree::{
DiskBtreeBuilder, DiskBtreeIterator, DiskBtreeReader, VisitDirection,
};
use crate::tenant::storage_layer::layer::S3_UPLOAD_LIMIT;
use crate::tenant::timeline::GetVectoredError;
use crate::tenant::vectored_blob_io::{
BlobFlag, MaxVectoredReadBytes, StreamingVectoredReadPlanner, VectoredBlobReader, VectoredRead,
VectoredReadPlanner,
VectoredReadCoalesceMode, VectoredReadPlanner,
};
use crate::tenant::PageReconstructError;
use crate::virtual_file::owned_buffers_io::io_buf_ext::{FullSlice, IoBufExt};
@@ -64,7 +65,7 @@ use std::os::unix::fs::FileExt;
use std::str::FromStr;
use std::sync::Arc;
use tokio::sync::OnceCell;
use tokio_epoll_uring::IoBufMut;
use tokio_epoll_uring::IoBuf;
use tracing::*;
use utils::{
@@ -224,14 +225,24 @@ pub struct DeltaLayerInner {
file: VirtualFile,
file_id: FileId,
#[allow(dead_code)]
layer_key_range: Range<Key>,
#[allow(dead_code)]
layer_lsn_range: Range<Lsn>,
max_vectored_read_bytes: Option<MaxVectoredReadBytes>,
}
impl DeltaLayerInner {
pub(crate) fn layer_dbg_info(&self) -> String {
format!(
"delta {}..{} {}..{}",
self.key_range().start,
self.key_range().end,
self.lsn_range().start,
self.lsn_range().end
)
}
}
impl std::fmt::Debug for DeltaLayerInner {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("DeltaLayerInner")
@@ -458,7 +469,7 @@ impl DeltaLayerWriterInner {
ctx: &RequestContext,
) -> (FullSlice<Buf>, anyhow::Result<()>)
where
Buf: IoBufMut + Send,
Buf: IoBuf + Send,
{
assert!(
self.lsn_range.start <= lsn,
@@ -556,7 +567,6 @@ impl DeltaLayerWriterInner {
// 5GB limit for objects without multipart upload (which we don't want to use)
// Make it a little bit below to account for differing GB units
// https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html
const S3_UPLOAD_LIMIT: u64 = 4_500_000_000;
ensure!(
metadata.len() <= S3_UPLOAD_LIMIT,
"Created delta layer file at {} of size {} above limit {S3_UPLOAD_LIMIT}!",
@@ -666,7 +676,7 @@ impl DeltaLayerWriter {
ctx: &RequestContext,
) -> (FullSlice<Buf>, anyhow::Result<()>)
where
Buf: IoBufMut + Send,
Buf: IoBuf + Send,
{
self.inner
.as_mut()
@@ -690,12 +700,10 @@ impl DeltaLayerWriter {
self.inner.take().unwrap().finish(key_end, ctx).await
}
#[cfg(test)]
pub(crate) fn num_keys(&self) -> usize {
self.inner.as_ref().unwrap().num_keys
}
#[cfg(test)]
pub(crate) fn estimated_size(&self) -> u64 {
let inner = self.inner.as_ref().unwrap();
inner.blob_writer.size() + inner.tree.borrow_writer().size() + PAGE_SZ as u64
@@ -872,44 +880,6 @@ impl DeltaLayerInner {
Ok(())
}
/// Load all key-values in the delta layer, should be replaced by an iterator-based interface in the future.
pub(super) async fn load_key_values(
&self,
ctx: &RequestContext,
) -> anyhow::Result<Vec<(Key, Lsn, Value)>> {
let block_reader = FileBlockReader::new(&self.file, self.file_id);
let index_reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new(
self.index_start_blk,
self.index_root_blk,
block_reader,
);
let mut result = Vec::new();
let mut stream =
Box::pin(self.stream_index_forwards(index_reader, &[0; DELTA_KEY_SIZE], ctx));
let block_reader = FileBlockReader::new(&self.file, self.file_id);
let cursor = block_reader.block_cursor();
let mut buf = Vec::new();
while let Some(item) = stream.next().await {
let (key, lsn, pos) = item?;
// TODO: dedup code with get_reconstruct_value
// TODO: ctx handling and sharding
cursor
.read_blob_into_buf(pos.pos(), &mut buf, ctx)
.await
.with_context(|| {
format!("Failed to read blob from virtual file {}", self.file.path)
})?;
let val = Value::des(&buf).with_context(|| {
format!(
"Failed to deserialize file blob from virtual file {}",
self.file.path
)
})?;
result.push((key, lsn, val));
}
Ok(result)
}
async fn plan_reads<Reader>(
keyspace: &KeySpace,
lsn_range: Range<Lsn>,
@@ -1195,6 +1165,7 @@ impl DeltaLayerInner {
let mut prev: Option<(Key, Lsn, BlobRef)> = None;
let mut read_builder: Option<VectoredReadBuilder> = None;
let read_mode = VectoredReadCoalesceMode::get();
let max_read_size = self
.max_vectored_read_bytes
@@ -1243,6 +1214,7 @@ impl DeltaLayerInner {
offsets.end.pos(),
meta,
max_read_size,
read_mode,
))
}
} else {
@@ -1527,6 +1499,10 @@ pub struct DeltaLayerIterator<'a> {
}
impl<'a> DeltaLayerIterator<'a> {
pub(crate) fn layer_dbg_info(&self) -> String {
self.delta_layer.layer_dbg_info()
}
/// Retrieve a batch of key-value pairs into the iterator buffer.
async fn next_batch(&mut self) -> anyhow::Result<()> {
assert!(self.key_values_batch.is_empty());
@@ -2281,7 +2257,7 @@ pub(crate) mod test {
// every key should be a batch b/c the value is larger than max_read_size
assert_eq!(iter.key_values_batch.len(), 1);
} else {
assert_eq!(iter.key_values_batch.len(), batch_size);
assert!(iter.key_values_batch.len() <= batch_size);
}
if num_items >= N {
break;

View File

@@ -28,7 +28,7 @@ use crate::context::{PageContentKind, RequestContext, RequestContextBuilder};
use crate::page_cache::{self, FileId, PAGE_SZ};
use crate::repository::{Key, Value, KEY_SIZE};
use crate::tenant::blob_io::BlobWriter;
use crate::tenant::block_io::{BlockBuf, BlockReader, FileBlockReader};
use crate::tenant::block_io::{BlockBuf, FileBlockReader};
use crate::tenant::disk_btree::{
DiskBtreeBuilder, DiskBtreeIterator, DiskBtreeReader, VisitDirection,
};
@@ -167,6 +167,17 @@ pub struct ImageLayerInner {
max_vectored_read_bytes: Option<MaxVectoredReadBytes>,
}
impl ImageLayerInner {
pub(crate) fn layer_dbg_info(&self) -> String {
format!(
"image {}..{} {}",
self.key_range().start,
self.key_range().end,
self.lsn()
)
}
}
impl std::fmt::Debug for ImageLayerInner {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("ImageLayerInner")
@@ -442,33 +453,6 @@ impl ImageLayerInner {
Ok(())
}
/// Load all key-values in the delta layer, should be replaced by an iterator-based interface in the future.
pub(super) async fn load_key_values(
&self,
ctx: &RequestContext,
) -> anyhow::Result<Vec<(Key, Lsn, Value)>> {
let block_reader = FileBlockReader::new(&self.file, self.file_id);
let tree_reader =
DiskBtreeReader::new(self.index_start_blk, self.index_root_blk, &block_reader);
let mut result = Vec::new();
let mut stream = Box::pin(tree_reader.into_stream(&[0; KEY_SIZE], ctx));
let block_reader = FileBlockReader::new(&self.file, self.file_id);
let cursor = block_reader.block_cursor();
while let Some(item) = stream.next().await {
// TODO: dedup code with get_reconstruct_value
let (raw_key, offset) = item?;
let key = Key::from_slice(&raw_key[..KEY_SIZE]);
// TODO: ctx handling and sharding
let blob = cursor
.read_blob(offset, ctx)
.await
.with_context(|| format!("failed to read value from offset {}", offset))?;
let value = Bytes::from(blob);
result.push((key, self.lsn, Value::Image(value)));
}
Ok(result)
}
/// Traverse the layer's index to build read operations on the overlap of the input keyspace
/// and the keys in this layer.
///
@@ -700,15 +684,11 @@ struct ImageLayerWriterInner {
blob_writer: BlobWriter<false>,
tree: DiskBtreeBuilder<BlockBuf, KEY_SIZE>,
#[cfg_attr(not(feature = "testing"), allow(dead_code))]
#[cfg(feature = "testing")]
last_written_key: Key,
}
impl ImageLayerWriterInner {
fn size(&self) -> u64 {
self.tree.borrow_writer().size() + self.blob_writer.size()
}
///
/// Start building a new image layer.
///
@@ -763,6 +743,7 @@ impl ImageLayerWriterInner {
uncompressed_bytes_eligible: 0,
uncompressed_bytes_chosen: 0,
num_keys: 0,
#[cfg(feature = "testing")]
last_written_key: Key::MIN,
};
@@ -815,12 +796,11 @@ impl ImageLayerWriterInner {
///
/// Finish writing the image layer.
///
async fn finish(
async fn finish_layer(
self,
timeline: &Arc<Timeline>,
ctx: &RequestContext,
end_key: Option<Key>,
) -> anyhow::Result<ResidentLayer> {
) -> anyhow::Result<PersistentLayerDesc> {
let index_start_blk =
((self.blob_writer.size() + PAGE_SZ as u64 - 1) / PAGE_SZ as u64) as u32;
@@ -843,13 +823,19 @@ impl ImageLayerWriterInner {
res?;
}
let final_key_range = if let Some(end_key) = end_key {
self.key_range.start..end_key
} else {
self.key_range.clone()
};
// Fill in the summary on blk 0
let summary = Summary {
magic: IMAGE_FILE_MAGIC,
format_version: STORAGE_FORMAT_VERSION,
tenant_id: self.tenant_shard_id.tenant_id,
timeline_id: self.timeline_id,
key_range: self.key_range.clone(),
key_range: final_key_range.clone(),
lsn: self.lsn,
index_start_blk,
index_root_blk,
@@ -870,11 +856,7 @@ impl ImageLayerWriterInner {
let desc = PersistentLayerDesc::new_img(
self.tenant_shard_id,
self.timeline_id,
if let Some(end_key) = end_key {
self.key_range.start..end_key
} else {
self.key_range.clone()
},
final_key_range,
self.lsn,
metadata.len(),
);
@@ -894,8 +876,22 @@ impl ImageLayerWriterInner {
// fsync the file
file.sync_all().await?;
Ok(desc)
}
async fn finish(
self,
timeline: &Arc<Timeline>,
ctx: &RequestContext,
end_key: Option<Key>,
) -> anyhow::Result<ResidentLayer> {
let path = self.path.clone();
let conf = self.conf;
let desc = self.finish_layer(ctx, end_key).await?;
// FIXME: why not carry the virtualfile here, it supports renaming?
let layer = Layer::finish_creating(self.conf, timeline, desc, &self.path)?;
let layer = Layer::finish_creating(conf, timeline, desc, &path)?;
info!("created image layer {}", layer.local_path());
@@ -963,14 +959,12 @@ impl ImageLayerWriter {
self.inner.as_mut().unwrap().put_image(key, img, ctx).await
}
#[cfg(test)]
/// Estimated size of the image layer.
pub(crate) fn estimated_size(&self) -> u64 {
let inner = self.inner.as_ref().unwrap();
inner.blob_writer.size() + inner.tree.borrow_writer().size() + PAGE_SZ as u64
}
#[cfg(test)]
pub(crate) fn num_keys(&self) -> usize {
self.inner.as_ref().unwrap().num_keys
}
@@ -986,7 +980,32 @@ impl ImageLayerWriter {
self.inner.take().unwrap().finish(timeline, ctx, None).await
}
#[cfg(test)]
/// Like finish(), but doesn't create the ResidentLayer struct. This can be used
/// by utilities that don't have a full-blown Timeline.
pub(crate) async fn finish_raw(
mut self,
ctx: &RequestContext,
) -> anyhow::Result<super::PersistentLayerDesc> {
let inner = self.inner.take().unwrap();
let name = ImageLayerName {
key_range: inner.key_range.clone(),
lsn: inner.lsn,
};
let temp_path = inner.path.clone();
let final_path = inner.conf.timeline_path(&inner.tenant_shard_id, &inner.timeline_id)
.join(name.to_string());
let desc = inner.finish_layer(ctx, None).await?;
// Rename the file to final name like Layer::finish_creating() does
utils::fs_ext::rename_noreplace(temp_path.as_std_path(), final_path.as_std_path())
.with_context(|| format!("rename temporary file as {final_path}"))?;
Ok(desc)
}
/// Finish writing the image layer with an end key, used in [`super::split_writer::SplitImageLayerWriter`]. The end key determines the end of the image layer's covered range and is exclusive.
pub(super) async fn finish_with_end_key(
mut self,
@@ -1000,10 +1019,6 @@ impl ImageLayerWriter {
.finish(timeline, ctx, Some(end_key))
.await
}
pub(crate) fn size(&self) -> u64 {
self.inner.as_ref().unwrap().size()
}
}
impl Drop for ImageLayerWriter {
@@ -1024,6 +1039,10 @@ pub struct ImageLayerIterator<'a> {
}
impl<'a> ImageLayerIterator<'a> {
pub(crate) fn layer_dbg_info(&self) -> String {
self.image_layer.layer_dbg_info()
}
/// Retrieve a batch of key-value pairs into the iterator buffer.
async fn next_batch(&mut self) -> anyhow::Result<()> {
assert!(self.key_values_batch.is_empty());
@@ -1375,7 +1394,7 @@ mod test {
// every key should be a batch b/c the value is larger than max_read_size
assert_eq!(iter.key_values_batch.len(), 1);
} else {
assert_eq!(iter.key_values_batch.len(), batch_size);
assert!(iter.key_values_batch.len() <= batch_size);
}
if num_items >= N {
break;

View File

@@ -4,23 +4,23 @@
//! held in an ephemeral file, not in memory. The metadata for each page version, i.e.
//! its position in the file, is kept in memory, though.
//!
use crate::assert_u64_eq_usize::{u64_to_usize, U64IsUsize, UsizeIsU64};
use crate::config::PageServerConf;
use crate::context::{PageContentKind, RequestContext, RequestContextBuilder};
use crate::page_cache::PAGE_SZ;
use crate::repository::{Key, Value};
use crate::tenant::block_io::{BlockCursor, BlockReader, BlockReaderRef};
use crate::tenant::ephemeral_file::EphemeralFile;
use crate::tenant::timeline::GetVectoredError;
use crate::tenant::PageReconstructError;
use crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt;
use crate::{l0_flush, page_cache};
use anyhow::{anyhow, Result};
use anyhow::{anyhow, Context, Result};
use bytes::Bytes;
use camino::Utf8PathBuf;
use pageserver_api::key::CompactKey;
use pageserver_api::keyspace::KeySpace;
use pageserver_api::models::InMemoryLayerInfo;
use pageserver_api::shard::TenantShardId;
use std::collections::BTreeMap;
use std::collections::{BTreeMap, HashMap};
use std::sync::{Arc, OnceLock};
use std::time::Instant;
use tracing::*;
@@ -33,12 +33,14 @@ use std::fmt::Write;
use std::ops::Range;
use std::sync::atomic::Ordering as AtomicOrdering;
use std::sync::atomic::{AtomicU64, AtomicUsize};
use tokio::sync::{RwLock, RwLockWriteGuard};
use tokio::sync::RwLock;
use super::{
DeltaLayerWriter, PersistentLayerDesc, ValueReconstructSituation, ValuesReconstructState,
};
pub(crate) mod vectored_dio_read;
#[derive(Debug, PartialEq, Eq, Clone, Copy, Hash)]
pub(crate) struct InMemoryLayerFileId(page_cache::FileId);
@@ -78,9 +80,9 @@ impl std::fmt::Debug for InMemoryLayer {
pub struct InMemoryLayerInner {
/// All versions of all pages in the layer are kept here. Indexed
/// by block number and LSN. The value is an offset into the
/// by block number and LSN. The [`IndexEntry`] is an offset into the
/// ephemeral file where the page version is stored.
index: BTreeMap<CompactKey, VecMap<Lsn, u64>>,
index: BTreeMap<CompactKey, VecMap<Lsn, IndexEntry>>,
/// The values are stored in a serialized format in this file.
/// Each serialized Value is preceded by a 'u32' length field.
@@ -90,6 +92,154 @@ pub struct InMemoryLayerInner {
resource_units: GlobalResourceUnits,
}
/// Support the same max blob length as blob_io, because ultimately
/// all the InMemoryLayer contents end up being written into a delta layer,
/// using the [`crate::tenant::blob_io`].
const MAX_SUPPORTED_BLOB_LEN: usize = crate::tenant::blob_io::MAX_SUPPORTED_BLOB_LEN;
const MAX_SUPPORTED_BLOB_LEN_BITS: usize = {
let trailing_ones = MAX_SUPPORTED_BLOB_LEN.trailing_ones() as usize;
let leading_zeroes = MAX_SUPPORTED_BLOB_LEN.leading_zeros() as usize;
assert!(trailing_ones + leading_zeroes == std::mem::size_of::<usize>() * 8);
trailing_ones
};
/// See [`InMemoryLayerInner::index`].
///
/// For memory efficiency, the data is packed into a u64.
///
/// Layout:
/// - 1 bit: `will_init`
/// - [`MAX_SUPPORTED_BLOB_LEN_BITS`]: `len`
/// - [`MAX_SUPPORTED_POS_BITS`]: `pos`
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct IndexEntry(u64);
impl IndexEntry {
/// See [`Self::MAX_SUPPORTED_POS`].
const MAX_SUPPORTED_POS_BITS: usize = {
let remainder = 64 - 1 - MAX_SUPPORTED_BLOB_LEN_BITS;
if remainder < 32 {
panic!("pos can be u32 as per type system, support that");
}
remainder
};
/// The maximum supported blob offset that can be represented by [`Self`].
/// See also [`Self::validate_checkpoint_distance`].
const MAX_SUPPORTED_POS: usize = (1 << Self::MAX_SUPPORTED_POS_BITS) - 1;
// Layout
const WILL_INIT_RANGE: Range<usize> = 0..1;
const LEN_RANGE: Range<usize> =
Self::WILL_INIT_RANGE.end..Self::WILL_INIT_RANGE.end + MAX_SUPPORTED_BLOB_LEN_BITS;
const POS_RANGE: Range<usize> =
Self::LEN_RANGE.end..Self::LEN_RANGE.end + Self::MAX_SUPPORTED_POS_BITS;
const _ASSERT: () = {
if Self::POS_RANGE.end != 64 {
panic!("we don't want undefined bits for our own sanity")
}
};
/// Fails if and only if the offset or length encoded in `arg` is too large to be represented by [`Self`].
///
/// The only reason why that can happen in the system is if the [`InMemoryLayer`] grows too long.
/// The [`InMemoryLayer`] size is determined by the checkpoint distance, enforced by [`crate::tenant::Timeline::should_roll`].
///
/// Thus, to avoid failure of this function, whenever we start up and/or change checkpoint distance,
/// call [`Self::validate_checkpoint_distance`] with the new checkpoint distance value.
///
/// TODO: this check should happen ideally at config parsing time (and in the request handler when a change to checkpoint distance is requested)
/// When cleaning this up, also look into the s3 max file size check that is performed in delta layer writer.
#[inline(always)]
fn new(arg: IndexEntryNewArgs) -> anyhow::Result<Self> {
let IndexEntryNewArgs {
base_offset,
batch_offset,
len,
will_init,
} = arg;
let pos = base_offset
.checked_add(batch_offset)
.ok_or_else(|| anyhow::anyhow!("base_offset + batch_offset overflows u64: base_offset={base_offset} batch_offset={batch_offset}"))?;
if pos.into_usize() > Self::MAX_SUPPORTED_POS {
anyhow::bail!(
"base_offset+batch_offset exceeds the maximum supported value: base_offset={base_offset} batch_offset={batch_offset} (+)={pos} max={max}",
max = Self::MAX_SUPPORTED_POS
);
}
if len > MAX_SUPPORTED_BLOB_LEN {
anyhow::bail!(
"len exceeds the maximum supported length: len={len} max={MAX_SUPPORTED_BLOB_LEN}",
);
}
let mut data: u64 = 0;
use bit_field::BitField;
data.set_bits(Self::WILL_INIT_RANGE, if will_init { 1 } else { 0 });
data.set_bits(Self::LEN_RANGE, len.into_u64());
data.set_bits(Self::POS_RANGE, pos);
Ok(Self(data))
}
#[inline(always)]
fn unpack(&self) -> IndexEntryUnpacked {
use bit_field::BitField;
IndexEntryUnpacked {
will_init: self.0.get_bits(Self::WILL_INIT_RANGE) != 0,
len: self.0.get_bits(Self::LEN_RANGE),
pos: self.0.get_bits(Self::POS_RANGE),
}
}
/// See [`Self::new`].
pub(crate) const fn validate_checkpoint_distance(
checkpoint_distance: u64,
) -> Result<(), &'static str> {
if checkpoint_distance > Self::MAX_SUPPORTED_POS as u64 {
return Err("exceeds the maximum supported value");
}
let res = u64_to_usize(checkpoint_distance).checked_add(MAX_SUPPORTED_BLOB_LEN);
if res.is_none() {
return Err(
"checkpoint distance + max supported blob len overflows in-memory addition",
);
}
// NB: it is ok for the result of the addition to be larger than MAX_SUPPORTED_POS
Ok(())
}
const _ASSERT_DEFAULT_CHECKPOINT_DISTANCE_IS_VALID: () = {
let res = Self::validate_checkpoint_distance(
crate::tenant::config::defaults::DEFAULT_CHECKPOINT_DISTANCE,
);
if res.is_err() {
panic!("default checkpoint distance is valid")
}
};
}
/// Args to [`IndexEntry::new`].
#[derive(Clone, Copy)]
struct IndexEntryNewArgs {
base_offset: u64,
batch_offset: u64,
len: usize,
will_init: bool,
}
/// Unpacked representation of the bitfielded [`IndexEntry`].
#[derive(Clone, Copy, PartialEq, Eq, Debug)]
struct IndexEntryUnpacked {
will_init: bool,
len: u64,
pos: u64,
}
impl std::fmt::Debug for InMemoryLayerInner {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("InMemoryLayerInner").finish()
@@ -276,7 +426,12 @@ impl InMemoryLayer {
.build();
let inner = self.inner.read().await;
let reader = inner.file.block_cursor();
struct ValueRead {
entry_lsn: Lsn,
read: vectored_dio_read::LogicalRead<Vec<u8>>,
}
let mut reads: HashMap<Key, Vec<ValueRead>> = HashMap::new();
for range in keyspace.ranges.iter() {
for (key, vec_map) in inner
@@ -291,24 +446,62 @@ impl InMemoryLayer {
let slice = vec_map.slice_range(lsn_range);
for (entry_lsn, pos) in slice.iter().rev() {
// TODO: this uses the page cache => https://github.com/neondatabase/neon/issues/8183
let buf = reader.read_blob(*pos, &ctx).await;
if let Err(e) = buf {
reconstruct_state.on_key_error(key, PageReconstructError::from(anyhow!(e)));
for (entry_lsn, index_entry) in slice.iter().rev() {
let IndexEntryUnpacked {
pos,
len,
will_init,
} = index_entry.unpack();
reads.entry(key).or_default().push(ValueRead {
entry_lsn: *entry_lsn,
read: vectored_dio_read::LogicalRead::new(
pos,
Vec::with_capacity(len as usize),
),
});
if will_init {
break;
}
}
}
}
let value = Value::des(&buf.unwrap());
if let Err(e) = value {
// Execute the reads.
let f = vectored_dio_read::execute(
&inner.file,
reads
.iter()
.flat_map(|(_, value_reads)| value_reads.iter().map(|v| &v.read)),
&ctx,
);
send_future::SendFuture::send(f) // https://github.com/rust-lang/rust/issues/96865
.await;
// Process results into the reconstruct state
'next_key: for (key, value_reads) in reads {
for ValueRead { entry_lsn, read } in value_reads {
match read.into_result().expect("we run execute() above") {
Err(e) => {
reconstruct_state.on_key_error(key, PageReconstructError::from(anyhow!(e)));
break;
continue 'next_key;
}
Ok(value_buf) => {
let value = Value::des(&value_buf);
if let Err(e) = value {
reconstruct_state
.on_key_error(key, PageReconstructError::from(anyhow!(e)));
continue 'next_key;
}
let key_situation =
reconstruct_state.update_key(&key, *entry_lsn, value.unwrap());
if key_situation == ValueReconstructSituation::Complete {
break;
let key_situation =
reconstruct_state.update_key(&key, entry_lsn, value.unwrap());
if key_situation == ValueReconstructSituation::Complete {
// TODO: metric to see if we fetched more values than necessary
continue 'next_key;
}
// process the next value in the next iteration of the loop
}
}
}
@@ -320,6 +513,68 @@ impl InMemoryLayer {
}
}
/// Offset of a particular Value within a serialized batch.
struct SerializedBatchOffset {
key: CompactKey,
lsn: Lsn,
// TODO: separate type when we start serde-serializing this value, to avoid coupling
// in-memory representation to serialization format.
index_entry: IndexEntry,
}
pub struct SerializedBatch {
/// Blobs serialized in EphemeralFile's native format, ready for passing to [`EphemeralFile::write_raw`].
pub(crate) raw: Vec<u8>,
/// Index of values in [`Self::raw`], using offsets relative to the start of the buffer.
offsets: Vec<SerializedBatchOffset>,
/// The highest LSN of any value in the batch
pub(crate) max_lsn: Lsn,
}
impl SerializedBatch {
pub fn from_values(batch: Vec<(CompactKey, Lsn, usize, Value)>) -> anyhow::Result<Self> {
// Pre-allocate a big flat buffer to write into. This should be large but not huge: it is soft-limited in practice by
// [`crate::pgdatadir_mapping::DatadirModification::MAX_PENDING_BYTES`]
let buffer_size = batch.iter().map(|i| i.2).sum::<usize>();
let mut cursor = std::io::Cursor::new(Vec::<u8>::with_capacity(buffer_size));
let mut offsets: Vec<SerializedBatchOffset> = Vec::with_capacity(batch.len());
let mut max_lsn: Lsn = Lsn(0);
for (key, lsn, val_ser_size, val) in batch {
let relative_off = cursor.position();
val.ser_into(&mut cursor)
.expect("Writing into in-memory buffer is infallible");
offsets.push(SerializedBatchOffset {
key,
lsn,
index_entry: IndexEntry::new(IndexEntryNewArgs {
base_offset: 0,
batch_offset: relative_off,
len: val_ser_size,
will_init: val.will_init(),
})
.context("higher-level code ensures that values are within supported ranges")?,
});
max_lsn = std::cmp::max(max_lsn, lsn);
}
let buffer = cursor.into_inner();
// Assert that we didn't do any extra allocations while building buffer.
debug_assert!(buffer.len() <= buffer_size);
Ok(Self {
raw: buffer,
offsets,
max_lsn,
})
}
}
fn inmem_layer_display(mut f: impl Write, start_lsn: Lsn, end_lsn: Lsn) -> std::fmt::Result {
write!(f, "inmem-{:016X}-{:016X}", start_lsn.0, end_lsn.0)
}
@@ -380,53 +635,69 @@ impl InMemoryLayer {
})
}
// Write operations
/// Common subroutine of the public put_wal_record() and put_page_image() functions.
/// Adds the page version to the in-memory tree
pub async fn put_value(
/// Write path.
///
/// Errors are not retryable, the [`InMemoryLayer`] must be discarded, and not be read from.
/// The reason why it's not retryable is that the [`EphemeralFile`] writes are not retryable.
/// TODO: it can be made retryable if we aborted the process on EphemeralFile write errors.
pub async fn put_batch(
&self,
key: CompactKey,
lsn: Lsn,
buf: &[u8],
serialized_batch: SerializedBatch,
ctx: &RequestContext,
) -> Result<()> {
) -> anyhow::Result<()> {
let mut inner = self.inner.write().await;
self.assert_writable();
self.put_value_locked(&mut inner, key, lsn, buf, ctx).await
}
async fn put_value_locked(
&self,
locked_inner: &mut RwLockWriteGuard<'_, InMemoryLayerInner>,
key: CompactKey,
lsn: Lsn,
buf: &[u8],
ctx: &RequestContext,
) -> Result<()> {
trace!("put_value key {} at {}/{}", key, self.timeline_id, lsn);
let base_offset = inner.file.len();
let off = {
locked_inner
.file
.write_blob(
buf,
&RequestContextBuilder::extend(ctx)
.page_content_kind(PageContentKind::InMemoryLayer)
.build(),
)
.await?
};
let SerializedBatch {
raw,
mut offsets,
max_lsn: _,
} = serialized_batch;
let vec_map = locked_inner.index.entry(key).or_default();
let old = vec_map.append_or_update_last(lsn, off).unwrap().0;
if old.is_some() {
// We already had an entry for this LSN. That's odd..
warn!("Key {} at {} already exists", key, lsn);
// Add the base_offset to the batch's index entries which are relative to the batch start.
for offset in &mut offsets {
let IndexEntryUnpacked {
will_init,
len,
pos,
} = offset.index_entry.unpack();
offset.index_entry = IndexEntry::new(IndexEntryNewArgs {
base_offset,
batch_offset: pos,
len: len.into_usize(),
will_init,
})?;
}
let size = locked_inner.file.len();
locked_inner.resource_units.maybe_publish_size(size);
// Write the batch to the file
inner.file.write_raw(&raw, ctx).await?;
let new_size = inner.file.len();
let expected_new_len = base_offset
.checked_add(raw.len().into_u64())
// write_raw would error if we were to overflow u64.
// also IndexEntry and higher levels in
//the code don't allow the file to grow that large
.unwrap();
assert_eq!(new_size, expected_new_len);
// Update the index with the new entries
for SerializedBatchOffset {
key,
lsn,
index_entry,
} in offsets
{
let vec_map = inner.index.entry(key).or_default();
let old = vec_map.append_or_update_last(lsn, index_entry).unwrap().0;
if old.is_some() {
// We already had an entry for this LSN. That's odd..
warn!("Key {} at {} already exists", key, lsn);
}
}
inner.resource_units.maybe_publish_size(new_size);
Ok(())
}
@@ -470,7 +741,7 @@ impl InMemoryLayer {
{
let inner = self.inner.write().await;
for vec_map in inner.index.values() {
for (lsn, _pos) in vec_map.as_slice() {
for (lsn, _) in vec_map.as_slice() {
assert!(*lsn < end_lsn);
}
}
@@ -534,36 +805,23 @@ impl InMemoryLayer {
match l0_flush_global_state {
l0_flush::Inner::Direct { .. } => {
let file_contents: Vec<u8> = inner.file.load_to_vec(ctx).await?;
assert_eq!(
file_contents.len() % PAGE_SZ,
0,
"needed by BlockReaderRef::Slice"
);
assert_eq!(file_contents.len(), {
let written = usize::try_from(inner.file.len()).unwrap();
if written % PAGE_SZ == 0 {
written
} else {
written.checked_add(PAGE_SZ - (written % PAGE_SZ)).unwrap()
}
});
let cursor = BlockCursor::new(BlockReaderRef::Slice(&file_contents));
let mut buf = Vec::new();
let file_contents = Bytes::from(file_contents);
for (key, vec_map) in inner.index.iter() {
// Write all page versions
for (lsn, pos) in vec_map.as_slice() {
// TODO: once we have blob lengths in the in-memory index, we can
// 1. get rid of the blob_io / BlockReaderRef::Slice business and
// 2. load the file contents into a Bytes and
// 3. the use `Bytes::slice` to get the `buf` that is our blob
// 4. pass that `buf` into `put_value_bytes`
// => https://github.com/neondatabase/neon/issues/8183
cursor.read_blob_into_buf(*pos, &mut buf, ctx).await?;
let will_init = Value::des(&buf)?.will_init();
let (tmp, res) = delta_layer_writer
for (lsn, entry) in vec_map
.as_slice()
.iter()
.map(|(lsn, entry)| (lsn, entry.unpack()))
{
let IndexEntryUnpacked {
pos,
len,
will_init,
} = entry;
let buf = Bytes::slice(&file_contents, pos as usize..(pos + len) as usize);
let (_buf, res) = delta_layer_writer
.put_value_bytes(
Key::from_compact(*key),
*lsn,
@@ -573,7 +831,6 @@ impl InMemoryLayer {
)
.await;
res?;
buf = tmp.into_raw_slice().into_inner();
}
}
}
@@ -595,3 +852,134 @@ impl InMemoryLayer {
Ok(Some((desc, path)))
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_index_entry() {
const MAX_SUPPORTED_POS: usize = IndexEntry::MAX_SUPPORTED_POS;
use IndexEntryNewArgs as Args;
use IndexEntryUnpacked as Unpacked;
let roundtrip = |args, expect: Unpacked| {
let res = IndexEntry::new(args).expect("this tests expects no errors");
let IndexEntryUnpacked {
will_init,
len,
pos,
} = res.unpack();
assert_eq!(will_init, expect.will_init);
assert_eq!(len, expect.len);
assert_eq!(pos, expect.pos);
};
// basic roundtrip
for pos in [0, MAX_SUPPORTED_POS] {
for len in [0, MAX_SUPPORTED_BLOB_LEN] {
for will_init in [true, false] {
let expect = Unpacked {
will_init,
len: len.into_u64(),
pos: pos.into_u64(),
};
roundtrip(
Args {
will_init,
base_offset: pos.into_u64(),
batch_offset: 0,
len,
},
expect,
);
roundtrip(
Args {
will_init,
base_offset: 0,
batch_offset: pos.into_u64(),
len,
},
expect,
);
}
}
}
// too-large len
let too_large = Args {
will_init: false,
len: MAX_SUPPORTED_BLOB_LEN + 1,
base_offset: 0,
batch_offset: 0,
};
assert!(IndexEntry::new(too_large).is_err());
// too-large pos
{
let too_large = Args {
will_init: false,
len: 0,
base_offset: MAX_SUPPORTED_POS.into_u64() + 1,
batch_offset: 0,
};
assert!(IndexEntry::new(too_large).is_err());
let too_large = Args {
will_init: false,
len: 0,
base_offset: 0,
batch_offset: MAX_SUPPORTED_POS.into_u64() + 1,
};
assert!(IndexEntry::new(too_large).is_err());
}
// too large (base_offset + batch_offset)
{
let too_large = Args {
will_init: false,
len: 0,
base_offset: MAX_SUPPORTED_POS.into_u64(),
batch_offset: 1,
};
assert!(IndexEntry::new(too_large).is_err());
let too_large = Args {
will_init: false,
len: 0,
base_offset: MAX_SUPPORTED_POS.into_u64() - 1,
batch_offset: MAX_SUPPORTED_POS.into_u64() - 1,
};
assert!(IndexEntry::new(too_large).is_err());
}
// valid special cases
// - area past the max supported pos that is accessible by len
for len in [1, MAX_SUPPORTED_BLOB_LEN] {
roundtrip(
Args {
will_init: false,
len,
base_offset: MAX_SUPPORTED_POS.into_u64(),
batch_offset: 0,
},
Unpacked {
will_init: false,
len: len as u64,
pos: MAX_SUPPORTED_POS.into_u64(),
},
);
roundtrip(
Args {
will_init: false,
len,
base_offset: 0,
batch_offset: MAX_SUPPORTED_POS.into_u64(),
},
Unpacked {
will_init: false,
len: len as u64,
pos: MAX_SUPPORTED_POS.into_u64(),
},
);
}
}
}

View File

@@ -0,0 +1,937 @@
use std::{
collections::BTreeMap,
sync::{Arc, RwLock},
};
use itertools::Itertools;
use tokio_epoll_uring::{BoundedBuf, IoBufMut, Slice};
use crate::{
assert_u64_eq_usize::{U64IsUsize, UsizeIsU64},
context::RequestContext,
};
/// The file interface we require. At runtime, this is a [`crate::tenant::ephemeral_file::EphemeralFile`].
pub trait File: Send {
/// Attempt to read the bytes in `self` in range `[start,start+dst.bytes_total())`
/// and return the number of bytes read (let's call it `nread`).
/// The bytes read are placed in `dst`, i.e., `&dst[..nread]` will contain the read bytes.
///
/// The only reason why the read may be short (i.e., `nread != dst.bytes_total()`)
/// is if the file is shorter than `start+dst.len()`.
///
/// This is unlike [`std::os::unix::fs::FileExt::read_exact_at`] which returns an
/// [`std::io::ErrorKind::UnexpectedEof`] error if the file is shorter than `start+dst.len()`.
///
/// No guarantees are made about the remaining bytes in `dst` in case of a short read.
async fn read_exact_at_eof_ok<'a, 'b, B: IoBufMut + Send>(
&'b self,
start: u64,
dst: Slice<B>,
ctx: &'a RequestContext,
) -> std::io::Result<(Slice<B>, usize)>;
}
/// A logical read from [`File`]. See [`Self::new`].
pub struct LogicalRead<B: Buffer> {
pos: u64,
state: RwLockRefCell<LogicalReadState<B>>,
}
enum LogicalReadState<B: Buffer> {
NotStarted(B),
Ongoing(B),
Ok(B),
Error(Arc<std::io::Error>),
Undefined,
}
impl<B: Buffer> LogicalRead<B> {
/// Create a new [`LogicalRead`] from [`File`] of the data in the file in range `[ pos, pos + buf.cap() )`.
pub fn new(pos: u64, buf: B) -> Self {
Self {
pos,
state: RwLockRefCell::new(LogicalReadState::NotStarted(buf)),
}
}
pub fn into_result(self) -> Option<Result<B, Arc<std::io::Error>>> {
match self.state.into_inner() {
LogicalReadState::Ok(buf) => Some(Ok(buf)),
LogicalReadState::Error(e) => Some(Err(e)),
LogicalReadState::NotStarted(_) | LogicalReadState::Ongoing(_) => None,
LogicalReadState::Undefined => unreachable!(),
}
}
}
/// The buffer into which a [`LogicalRead`] result is placed.
pub trait Buffer: std::ops::Deref<Target = [u8]> {
/// Immutable.
fn cap(&self) -> usize;
/// Changes only through [`Self::extend_from_slice`].
fn len(&self) -> usize;
/// Panics if the total length would exceed the initialized capacity.
fn extend_from_slice(&mut self, src: &[u8]);
}
/// The minimum alignment and size requirement for disk offsets and memory buffer size for direct IO.
const DIO_CHUNK_SIZE: usize = 512;
/// If multiple chunks need to be read, merge adjacent chunk reads into batches of max size `MAX_CHUNK_BATCH_SIZE`.
/// (The unit is the number of chunks.)
const MAX_CHUNK_BATCH_SIZE: usize = {
let desired = 128 * 1024; // 128k
if desired % DIO_CHUNK_SIZE != 0 {
panic!("MAX_CHUNK_BATCH_SIZE must be a multiple of DIO_CHUNK_SIZE")
// compile-time error
}
desired / DIO_CHUNK_SIZE
};
/// Execute the given logical `reads` against `file`.
/// The results are placed in the buffers of the [`LogicalRead`]s.
/// Retrieve the results by calling [`LogicalRead::into_result`] on each [`LogicalRead`].
///
/// The [`LogicalRead`]s must be freshly created using [`LogicalRead::new`] when calling this function.
/// Otherwise, this function panics.
pub async fn execute<'a, I, F, B>(file: &F, reads: I, ctx: &RequestContext)
where
I: IntoIterator<Item = &'a LogicalRead<B>>,
F: File,
B: Buffer + IoBufMut + Send,
{
// Terminology:
// logical read = a request to read an arbitrary range of bytes from `file`; byte-level granularity
// chunk = we conceptually divide up the byte range of `file` into DIO_CHUNK_SIZEs ranges
// interest = a range within a chunk that a logical read is interested in; one logical read gets turned into many interests
// physical read = the read request we're going to issue to the OS; covers a range of chunks; chunk-level granularity
// Preserve a copy of the logical reads for debug assertions at the end
#[cfg(debug_assertions)]
let (reads, assert_logical_reads) = {
let (reads, assert) = reads.into_iter().tee();
(reads, Some(Vec::from_iter(assert)))
};
#[cfg(not(debug_assertions))]
let (reads, assert_logical_reads): (_, Option<Vec<&'a LogicalRead<B>>>) = (reads, None);
// Plan which parts of which chunks need to be appended to which buffer
let mut by_chunk: BTreeMap<u64, Vec<Interest<B>>> = BTreeMap::new();
struct Interest<'a, B: Buffer> {
logical_read: &'a LogicalRead<B>,
offset_in_chunk: u64,
len: u64,
}
for logical_read in reads {
let LogicalRead { pos, state } = logical_read;
let mut state = state.borrow_mut();
// transition from NotStarted to Ongoing
let cur = std::mem::replace(&mut *state, LogicalReadState::Undefined);
let req_len = match cur {
LogicalReadState::NotStarted(buf) => {
if buf.len() != 0 {
panic!("The `LogicalRead`s that are passed in must be freshly created using `LogicalRead::new`");
}
// buf.cap() == 0 is ok
// transition into Ongoing state
let req_len = buf.cap();
*state = LogicalReadState::Ongoing(buf);
req_len
}
x => panic!("must only call with fresh LogicalReads, got another state, leaving Undefined state behind state={x:?}"),
};
// plan which chunks we need to read from
let mut remaining = req_len;
let mut chunk_no = *pos / (DIO_CHUNK_SIZE.into_u64());
let mut offset_in_chunk = pos.into_usize() % DIO_CHUNK_SIZE;
while remaining > 0 {
let remaining_in_chunk = std::cmp::min(remaining, DIO_CHUNK_SIZE - offset_in_chunk);
by_chunk.entry(chunk_no).or_default().push(Interest {
logical_read,
offset_in_chunk: offset_in_chunk.into_u64(),
len: remaining_in_chunk.into_u64(),
});
offset_in_chunk = 0;
chunk_no += 1;
remaining -= remaining_in_chunk;
}
}
// At this point, we could iterate over by_chunk, in chunk order,
// read each chunk from disk, and fill the buffers.
// However, we can merge adjacent chunks into batches of MAX_CHUNK_BATCH_SIZE
// so we issue fewer IOs = fewer roundtrips = lower overall latency.
struct PhysicalRead<'a, B: Buffer> {
start_chunk_no: u64,
nchunks: usize,
dsts: Vec<PhysicalInterest<'a, B>>,
}
struct PhysicalInterest<'a, B: Buffer> {
logical_read: &'a LogicalRead<B>,
offset_in_physical_read: u64,
len: u64,
}
let mut physical_reads: Vec<PhysicalRead<B>> = Vec::new();
let mut by_chunk = by_chunk.into_iter().peekable();
loop {
let mut last_chunk_no = None;
let to_merge: Vec<(u64, Vec<Interest<B>>)> = by_chunk
.peeking_take_while(|(chunk_no, _)| {
if let Some(last_chunk_no) = last_chunk_no {
if *chunk_no != last_chunk_no + 1 {
return false;
}
}
last_chunk_no = Some(*chunk_no);
true
})
.take(MAX_CHUNK_BATCH_SIZE)
.collect(); // TODO: avoid this .collect()
let Some(start_chunk_no) = to_merge.first().map(|(chunk_no, _)| *chunk_no) else {
break;
};
let nchunks = to_merge.len();
let dsts = to_merge
.into_iter()
.enumerate()
.flat_map(|(i, (_, dsts))| {
dsts.into_iter().map(
move |Interest {
logical_read,
offset_in_chunk,
len,
}| {
PhysicalInterest {
logical_read,
offset_in_physical_read: i
.checked_mul(DIO_CHUNK_SIZE)
.unwrap()
.into_u64()
+ offset_in_chunk,
len,
}
},
)
})
.collect();
physical_reads.push(PhysicalRead {
start_chunk_no,
nchunks,
dsts,
});
}
drop(by_chunk);
// Execute physical reads and fill the logical read buffers
// TODO: pipelined reads; prefetch;
let get_io_buffer = |nchunks| Vec::with_capacity(nchunks * DIO_CHUNK_SIZE);
for PhysicalRead {
start_chunk_no,
nchunks,
dsts,
} in physical_reads
{
let all_done = dsts
.iter()
.all(|PhysicalInterest { logical_read, .. }| logical_read.state.borrow().is_terminal());
if all_done {
continue;
}
let read_offset = start_chunk_no
.checked_mul(DIO_CHUNK_SIZE.into_u64())
.expect("we produce chunk_nos by dividing by DIO_CHUNK_SIZE earlier");
let io_buf = get_io_buffer(nchunks).slice_full();
let req_len = io_buf.len();
let (io_buf_slice, nread) = match file.read_exact_at_eof_ok(read_offset, io_buf, ctx).await
{
Ok(t) => t,
Err(e) => {
let e = Arc::new(e);
for PhysicalInterest { logical_read, .. } in dsts {
*logical_read.state.borrow_mut() = LogicalReadState::Error(Arc::clone(&e));
// this will make later reads for the given LogicalRead short-circuit, see top of loop body
}
continue;
}
};
let io_buf = io_buf_slice.into_inner();
assert!(
nread <= io_buf.len(),
"the last chunk in the file can be a short read, so, no =="
);
let io_buf = &io_buf[..nread];
for PhysicalInterest {
logical_read,
offset_in_physical_read,
len,
} in dsts
{
let mut logical_read_state_borrow = logical_read.state.borrow_mut();
let logical_read_buf = match &mut *logical_read_state_borrow {
LogicalReadState::NotStarted(_) => {
unreachable!("we transition it into Ongoing at function entry")
}
LogicalReadState::Ongoing(buf) => buf,
LogicalReadState::Ok(_) | LogicalReadState::Error(_) => {
continue;
}
LogicalReadState::Undefined => unreachable!(),
};
let range_in_io_buf = std::ops::Range {
start: offset_in_physical_read as usize,
end: offset_in_physical_read as usize + len as usize,
};
assert!(range_in_io_buf.end >= range_in_io_buf.start);
if range_in_io_buf.end > nread {
let msg = format!(
"physical read returned EOF where this logical read expected more data in the file: offset=0x{read_offset:x} req_len=0x{req_len:x} nread=0x{nread:x} {:?}",
&*logical_read_state_borrow
);
logical_read_state_borrow.transition_to_terminal(Err(std::io::Error::new(
std::io::ErrorKind::UnexpectedEof,
msg,
)));
continue;
}
let data = &io_buf[range_in_io_buf];
// Copy data from io buffer into the logical read buffer.
// (And in debug mode, validate that the buffer impl adheres to the Buffer trait spec.)
let pre = if cfg!(debug_assertions) {
Some((logical_read_buf.len(), logical_read_buf.cap()))
} else {
None
};
logical_read_buf.extend_from_slice(data);
let post = if cfg!(debug_assertions) {
Some((logical_read_buf.len(), logical_read_buf.cap()))
} else {
None
};
match (pre, post) {
(None, None) => {}
(Some(_), None) | (None, Some(_)) => unreachable!(),
(Some((pre_len, pre_cap)), Some((post_len, post_cap))) => {
assert_eq!(pre_len + len as usize, post_len);
assert_eq!(pre_cap, post_cap);
}
}
if logical_read_buf.len() == logical_read_buf.cap() {
logical_read_state_borrow.transition_to_terminal(Ok(()));
}
}
}
if let Some(assert_logical_reads) = assert_logical_reads {
for logical_read in assert_logical_reads {
assert!(logical_read.state.borrow().is_terminal());
}
}
}
impl<B: Buffer> LogicalReadState<B> {
fn is_terminal(&self) -> bool {
match self {
LogicalReadState::NotStarted(_) | LogicalReadState::Ongoing(_) => false,
LogicalReadState::Ok(_) | LogicalReadState::Error(_) => true,
LogicalReadState::Undefined => unreachable!(),
}
}
fn transition_to_terminal(&mut self, err: std::io::Result<()>) {
let cur = std::mem::replace(self, LogicalReadState::Undefined);
let buf = match cur {
LogicalReadState::Ongoing(buf) => buf,
x => panic!("must only call in state Ongoing, got {x:?}"),
};
*self = match err {
Ok(()) => LogicalReadState::Ok(buf),
Err(e) => LogicalReadState::Error(Arc::new(e)),
};
}
}
impl<B: Buffer> std::fmt::Debug for LogicalReadState<B> {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
#[derive(Debug)]
#[allow(unused)]
struct BufferDebug {
len: usize,
cap: usize,
}
impl<'a> From<&'a dyn Buffer> for BufferDebug {
fn from(buf: &'a dyn Buffer) -> Self {
Self {
len: buf.len(),
cap: buf.cap(),
}
}
}
match self {
LogicalReadState::NotStarted(b) => {
write!(f, "NotStarted({:?})", BufferDebug::from(b as &dyn Buffer))
}
LogicalReadState::Ongoing(b) => {
write!(f, "Ongoing({:?})", BufferDebug::from(b as &dyn Buffer))
}
LogicalReadState::Ok(b) => write!(f, "Ok({:?})", BufferDebug::from(b as &dyn Buffer)),
LogicalReadState::Error(e) => write!(f, "Error({:?})", e),
LogicalReadState::Undefined => write!(f, "Undefined"),
}
}
}
#[derive(Debug)]
struct RwLockRefCell<T>(RwLock<T>);
impl<T> RwLockRefCell<T> {
fn new(value: T) -> Self {
Self(RwLock::new(value))
}
fn borrow(&self) -> impl std::ops::Deref<Target = T> + '_ {
self.0.try_read().unwrap()
}
fn borrow_mut(&self) -> impl std::ops::DerefMut<Target = T> + '_ {
self.0.try_write().unwrap()
}
fn into_inner(self) -> T {
self.0.into_inner().unwrap()
}
}
impl Buffer for Vec<u8> {
fn cap(&self) -> usize {
self.capacity()
}
fn len(&self) -> usize {
self.len()
}
fn extend_from_slice(&mut self, src: &[u8]) {
if self.len() + src.len() > self.cap() {
panic!("Buffer capacity exceeded");
}
Vec::extend_from_slice(self, src);
}
}
#[cfg(test)]
#[allow(clippy::assertions_on_constants)]
mod tests {
use rand::Rng;
use crate::{
context::DownloadBehavior, task_mgr::TaskKind,
virtual_file::owned_buffers_io::slice::SliceMutExt,
};
use super::*;
use std::{cell::RefCell, collections::VecDeque};
struct InMemoryFile {
content: Vec<u8>,
}
impl InMemoryFile {
fn new_random(len: usize) -> Self {
Self {
content: rand::thread_rng()
.sample_iter(rand::distributions::Standard)
.take(len)
.collect(),
}
}
fn test_logical_read(&self, pos: u64, len: usize) -> TestLogicalRead {
let expected_result = if pos as usize + len > self.content.len() {
Err("InMemoryFile short read".to_string())
} else {
Ok(self.content[pos as usize..pos as usize + len].to_vec())
};
TestLogicalRead::new(pos, len, expected_result)
}
}
#[test]
fn test_in_memory_file() {
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
let file = InMemoryFile::new_random(10);
let test_read = |pos, len| {
let buf = vec![0; len];
let fut = file.read_exact_at_eof_ok(pos, buf.slice_full(), &ctx);
use futures::FutureExt;
let (slice, nread) = fut
.now_or_never()
.expect("impl never awaits")
.expect("impl never errors");
let mut buf = slice.into_inner();
buf.truncate(nread);
buf
};
assert_eq!(test_read(0, 1), &file.content[0..1]);
assert_eq!(test_read(1, 2), &file.content[1..3]);
assert_eq!(test_read(9, 2), &file.content[9..]);
assert!(test_read(10, 2).is_empty());
assert!(test_read(11, 2).is_empty());
}
impl File for InMemoryFile {
async fn read_exact_at_eof_ok<'a, 'b, B: IoBufMut + Send>(
&'b self,
start: u64,
mut dst: Slice<B>,
_ctx: &'a RequestContext,
) -> std::io::Result<(Slice<B>, usize)> {
let dst_slice: &mut [u8] = dst.as_mut_rust_slice_full_zeroed();
let nread = {
let req_len = dst_slice.len();
let len = std::cmp::min(req_len, self.content.len().saturating_sub(start as usize));
if start as usize >= self.content.len() {
0
} else {
dst_slice[..len]
.copy_from_slice(&self.content[start as usize..start as usize + len]);
len
}
};
rand::Rng::fill(&mut rand::thread_rng(), &mut dst_slice[nread..]); // to discover bugs
Ok((dst, nread))
}
}
#[derive(Clone)]
struct TestLogicalRead {
pos: u64,
len: usize,
expected_result: Result<Vec<u8>, String>,
}
impl TestLogicalRead {
fn new(pos: u64, len: usize, expected_result: Result<Vec<u8>, String>) -> Self {
Self {
pos,
len,
expected_result,
}
}
fn make_logical_read(&self) -> LogicalRead<Vec<u8>> {
LogicalRead::new(self.pos, Vec::with_capacity(self.len))
}
}
async fn execute_and_validate_test_logical_reads<I, F>(
file: &F,
test_logical_reads: I,
ctx: &RequestContext,
) where
I: IntoIterator<Item = TestLogicalRead>,
F: File,
{
let (tmp, test_logical_reads) = test_logical_reads.into_iter().tee();
let logical_reads = tmp.map(|tr| tr.make_logical_read()).collect::<Vec<_>>();
execute(file, logical_reads.iter(), ctx).await;
for (logical_read, test_logical_read) in logical_reads.into_iter().zip(test_logical_reads) {
let actual = logical_read.into_result().expect("we call execute()");
match (actual, test_logical_read.expected_result) {
(Ok(actual), Ok(expected)) if actual == expected => {}
(Err(actual), Err(expected)) => {
assert_eq!(actual.to_string(), expected);
}
(actual, expected) => panic!("expected {expected:?}\nactual {actual:?}"),
}
}
}
#[tokio::test]
async fn test_blackbox() {
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
let cs = DIO_CHUNK_SIZE;
let cs_u64 = cs.into_u64();
let file = InMemoryFile::new_random(10 * cs);
let test_logical_reads = vec![
file.test_logical_read(0, 1),
// adjacent to logical_read0
file.test_logical_read(1, 2),
// gap
// spans adjacent chunks
file.test_logical_read(cs_u64 - 1, 2),
// gap
// tail of chunk 3, all of chunk 4, and 2 bytes of chunk 5
file.test_logical_read(3 * cs_u64 - 1, cs + 2),
// gap
file.test_logical_read(5 * cs_u64, 1),
];
let num_test_logical_reads = test_logical_reads.len();
let test_logical_reads_perms = test_logical_reads
.into_iter()
.permutations(num_test_logical_reads);
// test all orderings of LogicalReads, the order shouldn't matter for the results
for test_logical_reads in test_logical_reads_perms {
execute_and_validate_test_logical_reads(&file, test_logical_reads, &ctx).await;
}
}
#[tokio::test]
#[should_panic]
async fn test_reusing_logical_reads_panics() {
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
let file = InMemoryFile::new_random(DIO_CHUNK_SIZE);
let a = file.test_logical_read(23, 10);
let logical_reads = vec![a.make_logical_read()];
execute(&file, &logical_reads, &ctx).await;
// reuse pancis
execute(&file, &logical_reads, &ctx).await;
}
struct RecorderFile<'a> {
recorded: RefCell<Vec<RecordedRead>>,
file: &'a InMemoryFile,
}
struct RecordedRead {
pos: u64,
req_len: usize,
res: Vec<u8>,
}
impl<'a> RecorderFile<'a> {
fn new(file: &'a InMemoryFile) -> RecorderFile<'a> {
Self {
recorded: Default::default(),
file,
}
}
}
impl<'x> File for RecorderFile<'x> {
async fn read_exact_at_eof_ok<'a, 'b, B: IoBufMut + Send>(
&'b self,
start: u64,
dst: Slice<B>,
ctx: &'a RequestContext,
) -> std::io::Result<(Slice<B>, usize)> {
let (dst, nread) = self.file.read_exact_at_eof_ok(start, dst, ctx).await?;
self.recorded.borrow_mut().push(RecordedRead {
pos: start,
req_len: dst.bytes_total(),
res: Vec::from(&dst[..nread]),
});
Ok((dst, nread))
}
}
#[tokio::test]
async fn test_logical_reads_to_same_chunk_are_merged_into_one_chunk_read() {
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
let file = InMemoryFile::new_random(2 * DIO_CHUNK_SIZE);
let a = file.test_logical_read(DIO_CHUNK_SIZE.into_u64(), 10);
let b = file.test_logical_read(DIO_CHUNK_SIZE.into_u64() + 30, 20);
let recorder = RecorderFile::new(&file);
execute_and_validate_test_logical_reads(&recorder, vec![a, b], &ctx).await;
let recorded = recorder.recorded.borrow();
assert_eq!(recorded.len(), 1);
let RecordedRead { pos, req_len, .. } = &recorded[0];
assert_eq!(*pos, DIO_CHUNK_SIZE.into_u64());
assert_eq!(*req_len, DIO_CHUNK_SIZE);
}
#[tokio::test]
async fn test_max_chunk_batch_size_is_respected() {
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
let file = InMemoryFile::new_random(4 * MAX_CHUNK_BATCH_SIZE * DIO_CHUNK_SIZE);
// read the 10th byte of each chunk 3 .. 3+2*MAX_CHUNK_BATCH_SIZE
assert!(3 < MAX_CHUNK_BATCH_SIZE, "test assumption");
assert!(10 < DIO_CHUNK_SIZE, "test assumption");
let mut test_logical_reads = Vec::new();
for i in 3..3 + MAX_CHUNK_BATCH_SIZE + MAX_CHUNK_BATCH_SIZE / 2 {
test_logical_reads
.push(file.test_logical_read(i.into_u64() * DIO_CHUNK_SIZE.into_u64() + 10, 1));
}
let recorder = RecorderFile::new(&file);
execute_and_validate_test_logical_reads(&recorder, test_logical_reads, &ctx).await;
let recorded = recorder.recorded.borrow();
assert_eq!(recorded.len(), 2);
{
let RecordedRead { pos, req_len, .. } = &recorded[0];
assert_eq!(*pos as usize, 3 * DIO_CHUNK_SIZE);
assert_eq!(*req_len, MAX_CHUNK_BATCH_SIZE * DIO_CHUNK_SIZE);
}
{
let RecordedRead { pos, req_len, .. } = &recorded[1];
assert_eq!(*pos as usize, (3 + MAX_CHUNK_BATCH_SIZE) * DIO_CHUNK_SIZE);
assert_eq!(*req_len, MAX_CHUNK_BATCH_SIZE / 2 * DIO_CHUNK_SIZE);
}
}
#[tokio::test]
async fn test_batch_breaks_if_chunk_is_not_interesting() {
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
assert!(MAX_CHUNK_BATCH_SIZE > 10, "test assumption");
let file = InMemoryFile::new_random(3 * DIO_CHUNK_SIZE);
let a = file.test_logical_read(0, 1); // chunk 0
let b = file.test_logical_read(2 * DIO_CHUNK_SIZE.into_u64(), 1); // chunk 2
let recorder = RecorderFile::new(&file);
execute_and_validate_test_logical_reads(&recorder, vec![a, b], &ctx).await;
let recorded = recorder.recorded.borrow();
assert_eq!(recorded.len(), 2);
{
let RecordedRead { pos, req_len, .. } = &recorded[0];
assert_eq!(*pos, 0);
assert_eq!(*req_len, DIO_CHUNK_SIZE);
}
{
let RecordedRead { pos, req_len, .. } = &recorded[1];
assert_eq!(*pos, 2 * DIO_CHUNK_SIZE.into_u64());
assert_eq!(*req_len, DIO_CHUNK_SIZE);
}
}
struct ExpectedRead {
expect_pos: u64,
expect_len: usize,
respond: Result<Vec<u8>, String>,
}
struct MockFile {
expected: RefCell<VecDeque<ExpectedRead>>,
}
impl Drop for MockFile {
fn drop(&mut self) {
assert!(
self.expected.borrow().is_empty(),
"expected reads not satisfied"
);
}
}
macro_rules! mock_file {
($($pos:expr , $len:expr => $respond:expr),* $(,)?) => {{
MockFile {
expected: RefCell::new(VecDeque::from(vec![$(ExpectedRead {
expect_pos: $pos,
expect_len: $len,
respond: $respond,
}),*])),
}
}};
}
impl File for MockFile {
async fn read_exact_at_eof_ok<'a, 'b, B: IoBufMut + Send>(
&'b self,
start: u64,
mut dst: Slice<B>,
_ctx: &'a RequestContext,
) -> std::io::Result<(Slice<B>, usize)> {
let ExpectedRead {
expect_pos,
expect_len,
respond,
} = self
.expected
.borrow_mut()
.pop_front()
.expect("unexpected read");
assert_eq!(start, expect_pos);
assert_eq!(dst.bytes_total(), expect_len);
match respond {
Ok(mocked_bytes) => {
let len = std::cmp::min(dst.bytes_total(), mocked_bytes.len());
let dst_slice: &mut [u8] = dst.as_mut_rust_slice_full_zeroed();
dst_slice[..len].copy_from_slice(&mocked_bytes[..len]);
rand::Rng::fill(&mut rand::thread_rng(), &mut dst_slice[len..]); // to discover bugs
Ok((dst, len))
}
Err(e) => Err(std::io::Error::new(std::io::ErrorKind::Other, e)),
}
}
}
#[tokio::test]
async fn test_mock_file() {
// Self-test to ensure the relevant features of mock file work as expected.
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
let mock_file = mock_file! {
0 , 512 => Ok(vec![0; 512]),
512 , 512 => Ok(vec![1; 512]),
1024 , 512 => Ok(vec![2; 10]),
2048, 1024 => Err("foo".to_owned()),
};
let buf = Vec::with_capacity(512);
let (buf, nread) = mock_file
.read_exact_at_eof_ok(0, buf.slice_full(), &ctx)
.await
.unwrap();
assert_eq!(nread, 512);
assert_eq!(&buf.into_inner()[..nread], &[0; 512]);
let buf = Vec::with_capacity(512);
let (buf, nread) = mock_file
.read_exact_at_eof_ok(512, buf.slice_full(), &ctx)
.await
.unwrap();
assert_eq!(nread, 512);
assert_eq!(&buf.into_inner()[..nread], &[1; 512]);
let buf = Vec::with_capacity(512);
let (buf, nread) = mock_file
.read_exact_at_eof_ok(1024, buf.slice_full(), &ctx)
.await
.unwrap();
assert_eq!(nread, 10);
assert_eq!(&buf.into_inner()[..nread], &[2; 10]);
let buf = Vec::with_capacity(1024);
let err = mock_file
.read_exact_at_eof_ok(2048, buf.slice_full(), &ctx)
.await
.err()
.unwrap();
assert_eq!(err.to_string(), "foo");
}
#[tokio::test]
async fn test_error_on_one_chunk_read_fails_only_dependent_logical_reads() {
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
let test_logical_reads = vec![
// read spanning two batches
TestLogicalRead::new(
DIO_CHUNK_SIZE.into_u64() / 2,
MAX_CHUNK_BATCH_SIZE * DIO_CHUNK_SIZE,
Err("foo".to_owned()),
),
// second read in failing chunk
TestLogicalRead::new(
(MAX_CHUNK_BATCH_SIZE * DIO_CHUNK_SIZE).into_u64() + DIO_CHUNK_SIZE.into_u64() - 10,
5,
Err("foo".to_owned()),
),
// read unaffected
TestLogicalRead::new(
(MAX_CHUNK_BATCH_SIZE * DIO_CHUNK_SIZE).into_u64()
+ 2 * DIO_CHUNK_SIZE.into_u64()
+ 10,
5,
Ok(vec![1; 5]),
),
];
let (tmp, test_logical_reads) = test_logical_reads.into_iter().tee();
let test_logical_read_perms = tmp.permutations(test_logical_reads.len());
for test_logical_reads in test_logical_read_perms {
let file = mock_file!(
0, MAX_CHUNK_BATCH_SIZE*DIO_CHUNK_SIZE => Ok(vec![0; MAX_CHUNK_BATCH_SIZE*DIO_CHUNK_SIZE]),
(MAX_CHUNK_BATCH_SIZE*DIO_CHUNK_SIZE).into_u64(), DIO_CHUNK_SIZE => Err("foo".to_owned()),
(MAX_CHUNK_BATCH_SIZE*DIO_CHUNK_SIZE + 2*DIO_CHUNK_SIZE).into_u64(), DIO_CHUNK_SIZE => Ok(vec![1; DIO_CHUNK_SIZE]),
);
execute_and_validate_test_logical_reads(&file, test_logical_reads, &ctx).await;
}
}
struct TestShortReadsSetup {
ctx: RequestContext,
file: InMemoryFile,
written: u64,
}
fn setup_short_chunk_read_tests() -> TestShortReadsSetup {
let ctx = RequestContext::new(TaskKind::UnitTest, DownloadBehavior::Error);
assert!(DIO_CHUNK_SIZE > 20, "test assumption");
let written = (2 * DIO_CHUNK_SIZE - 10).into_u64();
let file = InMemoryFile::new_random(written as usize);
TestShortReadsSetup { ctx, file, written }
}
#[tokio::test]
async fn test_short_chunk_read_from_written_range() {
// Test what happens if there are logical reads
// that start within the last chunk, and
// the last chunk is not the full chunk length.
//
// The read should succeed despite the short chunk length.
let TestShortReadsSetup { ctx, file, written } = setup_short_chunk_read_tests();
let a = file.test_logical_read(written - 10, 5);
let recorder = RecorderFile::new(&file);
execute_and_validate_test_logical_reads(&recorder, vec![a], &ctx).await;
let recorded = recorder.recorded.borrow();
assert_eq!(recorded.len(), 1);
let RecordedRead { pos, req_len, res } = &recorded[0];
assert_eq!(*pos, DIO_CHUNK_SIZE.into_u64());
assert_eq!(*req_len, DIO_CHUNK_SIZE);
assert_eq!(res, &file.content[DIO_CHUNK_SIZE..(written as usize)]);
}
#[tokio::test]
async fn test_short_chunk_read_and_logical_read_from_unwritten_range() {
// Test what happens if there are logical reads
// that start within the last chunk, and
// the last chunk is not the full chunk length, and
// the logical reads end in the unwritten range.
//
// All should fail with UnexpectedEof and have the same IO pattern.
async fn the_impl(offset_delta: i64) {
let TestShortReadsSetup { ctx, file, written } = setup_short_chunk_read_tests();
let offset = u64::try_from(
i64::try_from(written)
.unwrap()
.checked_add(offset_delta)
.unwrap(),
)
.unwrap();
let a = file.test_logical_read(offset, 5);
let recorder = RecorderFile::new(&file);
let a_vr = a.make_logical_read();
execute(&recorder, vec![&a_vr], &ctx).await;
// validate the LogicalRead result
let a_res = a_vr.into_result().unwrap();
let a_err = a_res.unwrap_err();
assert_eq!(a_err.kind(), std::io::ErrorKind::UnexpectedEof);
// validate the IO pattern
let recorded = recorder.recorded.borrow();
assert_eq!(recorded.len(), 1);
let RecordedRead { pos, req_len, res } = &recorded[0];
assert_eq!(*pos, DIO_CHUNK_SIZE.into_u64());
assert_eq!(*req_len, DIO_CHUNK_SIZE);
assert_eq!(res, &file.content[DIO_CHUNK_SIZE..(written as usize)]);
}
the_impl(-1).await; // start == length - 1
the_impl(0).await; // start == length
the_impl(1).await; // start == length + 1
}
// TODO: mixed: some valid, some UnexpectedEof
// TODO: same tests but with merges
}

View File

@@ -13,8 +13,7 @@ use utils::lsn::Lsn;
use utils::sync::{gate, heavier_once_cell};
use crate::config::PageServerConf;
use crate::context::{DownloadBehavior, RequestContext};
use crate::repository::Key;
use crate::context::{DownloadBehavior, RequestContext, RequestContextBuilder};
use crate::span::debug_assert_current_span_has_tenant_and_timeline_id;
use crate::task_mgr::TaskKind;
use crate::tenant::timeline::{CompactionError, GetVectoredError};
@@ -35,6 +34,8 @@ mod tests;
#[cfg(test)]
mod failpoints;
pub const S3_UPLOAD_LIMIT: u64 = 4_500_000_000;
/// A Layer contains all data in a "rectangle" consisting of a range of keys and
/// range of LSNs.
///
@@ -332,23 +333,6 @@ impl Layer {
})
}
/// Get all key/values in the layer. Should be replaced with an iterator-based API in the future.
#[allow(dead_code)]
pub(crate) async fn load_key_values(
&self,
ctx: &RequestContext,
) -> anyhow::Result<Vec<(Key, Lsn, crate::repository::Value)>> {
let layer = self
.0
.get_or_maybe_download(true, Some(ctx))
.await
.map_err(|err| match err {
DownloadError::DownloadCancelled => GetVectoredError::Cancelled,
other => GetVectoredError::Other(anyhow::anyhow!(other)),
})?;
layer.load_key_values(&self.0, ctx).await
}
/// Download the layer if evicted.
///
/// Will not error when the layer is already downloaded.
@@ -1296,7 +1280,10 @@ impl LayerInner {
lsn_end: lsn_range.end,
remote: !resident,
access_stats,
l0: crate::tenant::layer_map::LayerMap::is_l0(&self.layer_desc().key_range),
l0: crate::tenant::layer_map::LayerMap::is_l0(
&self.layer_desc().key_range,
self.layer_desc().is_delta,
),
}
} else {
let lsn = self.desc.image_layer_lsn();
@@ -1489,8 +1476,9 @@ impl LayerInner {
let duration = SystemTime::now().duration_since(local_layer_mtime);
match duration {
Ok(elapsed) => {
let accessed = self.access_stats.accessed();
if accessed {
let accessed_and_visible = self.access_stats.accessed()
&& self.access_stats.visibility() == LayerVisibilityHint::Visible;
if accessed_and_visible {
// Only layers used for reads contribute to our "low residence" metric that is used
// to detect thrashing. Layers promoted for other reasons (e.g. compaction) are allowed
// to be rapidly evicted without contributing to this metric.
@@ -1504,7 +1492,7 @@ impl LayerInner {
tracing::info!(
residence_millis = elapsed.as_millis(),
accessed,
accessed_and_visible,
"evicted layer after known residence period"
);
}
@@ -1690,6 +1678,9 @@ impl DownloadedLayer {
);
let res = if owner.desc.is_delta {
let ctx = RequestContextBuilder::extend(ctx)
.page_content_kind(crate::context::PageContentKind::DeltaLayerSummary)
.build();
let summary = Some(delta_layer::Summary::expected(
owner.desc.tenant_shard_id.tenant_id,
owner.desc.timeline_id,
@@ -1700,11 +1691,14 @@ impl DownloadedLayer {
&owner.path,
summary,
Some(owner.conf.max_vectored_read_bytes),
ctx,
&ctx,
)
.await
.map(LayerKind::Delta)
} else {
let ctx = RequestContextBuilder::extend(ctx)
.page_content_kind(crate::context::PageContentKind::ImageLayerSummary)
.build();
let lsn = owner.desc.image_layer_lsn();
let summary = Some(image_layer::Summary::expected(
owner.desc.tenant_shard_id.tenant_id,
@@ -1717,7 +1711,7 @@ impl DownloadedLayer {
lsn,
summary,
Some(owner.conf.max_vectored_read_bytes),
ctx,
&ctx,
)
.await
.map(LayerKind::Image)
@@ -1771,19 +1765,6 @@ impl DownloadedLayer {
}
}
async fn load_key_values(
&self,
owner: &Arc<LayerInner>,
ctx: &RequestContext,
) -> anyhow::Result<Vec<(Key, Lsn, crate::repository::Value)>> {
use LayerKind::*;
match self.get(owner, ctx).await? {
Delta(d) => d.load_key_values(ctx).await,
Image(i) => i.load_key_values(ctx).await,
}
}
async fn dump(&self, owner: &Arc<LayerInner>, ctx: &RequestContext) -> anyhow::Result<()> {
use LayerKind::*;
match self.get(owner, ctx).await? {

View File

@@ -782,7 +782,7 @@ async fn eviction_cancellation_on_drop() {
let mut writer = timeline.writer().await;
writer
.put(
Key::from_i128(5),
crate::repository::Key::from_i128(5),
Lsn(0x20),
&Value::Image(Bytes::from_static(b"this does not matter either")),
&ctx,

View File

@@ -256,6 +256,10 @@ impl LayerName {
LayerName::Delta(layer) => &layer.key_range,
}
}
pub fn is_delta(&self) -> bool {
matches!(self, LayerName::Delta(_))
}
}
impl fmt::Display for LayerName {

View File

@@ -3,6 +3,7 @@ use std::{
collections::{binary_heap, BinaryHeap},
};
use anyhow::bail;
use pageserver_api::key::Key;
use utils::lsn::Lsn;
@@ -26,6 +27,13 @@ impl<'a> LayerRef<'a> {
Self::Delta(x) => LayerIterRef::Delta(x.iter(ctx)),
}
}
fn layer_dbg_info(&self) -> String {
match self {
Self::Image(x) => x.layer_dbg_info(),
Self::Delta(x) => x.layer_dbg_info(),
}
}
}
enum LayerIterRef<'a> {
@@ -40,6 +48,13 @@ impl LayerIterRef<'_> {
Self::Image(x) => x.next().await,
}
}
fn layer_dbg_info(&self) -> String {
match self {
Self::Image(x) => x.layer_dbg_info(),
Self::Delta(x) => x.layer_dbg_info(),
}
}
}
/// This type plays several roles at once
@@ -75,6 +90,11 @@ impl<'a> PeekableLayerIterRef<'a> {
async fn next(&mut self) -> anyhow::Result<Option<(Key, Lsn, Value)>> {
let result = self.peeked.take();
self.peeked = self.iter.next().await?;
if let (Some((k1, l1, _)), Some((k2, l2, _))) = (&self.peeked, &result) {
if (k1, l1) < (k2, l2) {
bail!("iterator is not ordered: {}", self.iter.layer_dbg_info());
}
}
Ok(result)
}
}
@@ -178,7 +198,12 @@ impl<'a> IteratorWrapper<'a> {
let iter = PeekableLayerIterRef::create(iter).await?;
if let Some((k1, l1, _)) = iter.peek() {
let (k2, l2) = first_key_lower_bound;
debug_assert!((k1, l1) >= (k2, l2));
if (k1, l1) < (k2, l2) {
bail!(
"layer key range did not include the first key in the layer: {}",
layer.layer_dbg_info()
);
}
}
*self = Self::Loaded { iter };
Ok(())

View File

@@ -1,4 +1,4 @@
use std::{ops::Range, sync::Arc};
use std::{future::Future, ops::Range, sync::Arc};
use bytes::Bytes;
use pageserver_api::key::{Key, KEY_SIZE};
@@ -7,7 +7,32 @@ use utils::{id::TimelineId, lsn::Lsn, shard::TenantShardId};
use crate::tenant::storage_layer::Layer;
use crate::{config::PageServerConf, context::RequestContext, repository::Value, tenant::Timeline};
use super::{DeltaLayerWriter, ImageLayerWriter, ResidentLayer};
use super::layer::S3_UPLOAD_LIMIT;
use super::{
DeltaLayerWriter, ImageLayerWriter, PersistentLayerDesc, PersistentLayerKey, ResidentLayer,
};
pub(crate) enum SplitWriterResult {
Produced(ResidentLayer),
Discarded(PersistentLayerKey),
}
#[cfg(test)]
impl SplitWriterResult {
fn into_resident_layer(self) -> ResidentLayer {
match self {
SplitWriterResult::Produced(layer) => layer,
SplitWriterResult::Discarded(_) => panic!("unexpected discarded layer"),
}
}
fn into_discarded_layer(self) -> PersistentLayerKey {
match self {
SplitWriterResult::Produced(_) => panic!("unexpected produced layer"),
SplitWriterResult::Discarded(layer) => layer,
}
}
}
/// An image writer that takes images and produces multiple image layers. The interface does not
/// guarantee atomicity (i.e., if the image layer generation fails, there might be leftover files
@@ -16,11 +41,12 @@ use super::{DeltaLayerWriter, ImageLayerWriter, ResidentLayer};
pub struct SplitImageLayerWriter {
inner: ImageLayerWriter,
target_layer_size: u64,
generated_layers: Vec<ResidentLayer>,
generated_layers: Vec<SplitWriterResult>,
conf: &'static PageServerConf,
timeline_id: TimelineId,
tenant_shard_id: TenantShardId,
lsn: Lsn,
start_key: Key,
}
impl SplitImageLayerWriter {
@@ -49,16 +75,22 @@ impl SplitImageLayerWriter {
timeline_id,
tenant_shard_id,
lsn,
start_key,
})
}
pub async fn put_image(
pub async fn put_image_with_discard_fn<D, F>(
&mut self,
key: Key,
img: Bytes,
tline: &Arc<Timeline>,
ctx: &RequestContext,
) -> anyhow::Result<()> {
discard: D,
) -> anyhow::Result<()>
where
D: FnOnce(&PersistentLayerKey) -> F,
F: Future<Output = bool>,
{
// The current estimation is an upper bound of the space that the key/image could take
// because we did not consider compression in this estimation. The resulting image layer
// could be smaller than the target size.
@@ -76,33 +108,87 @@ impl SplitImageLayerWriter {
)
.await?;
let prev_image_writer = std::mem::replace(&mut self.inner, next_image_writer);
self.generated_layers.push(
prev_image_writer
.finish_with_end_key(tline, key, ctx)
.await?,
);
let layer_key = PersistentLayerKey {
key_range: self.start_key..key,
lsn_range: PersistentLayerDesc::image_layer_lsn_range(self.lsn),
is_delta: false,
};
self.start_key = key;
if discard(&layer_key).await {
drop(prev_image_writer);
self.generated_layers
.push(SplitWriterResult::Discarded(layer_key));
} else {
self.generated_layers.push(SplitWriterResult::Produced(
prev_image_writer
.finish_with_end_key(tline, key, ctx)
.await?,
));
}
}
self.inner.put_image(key, img, ctx).await
}
pub(crate) async fn finish(
#[cfg(test)]
pub async fn put_image(
&mut self,
key: Key,
img: Bytes,
tline: &Arc<Timeline>,
ctx: &RequestContext,
) -> anyhow::Result<()> {
self.put_image_with_discard_fn(key, img, tline, ctx, |_| async { false })
.await
}
pub(crate) async fn finish_with_discard_fn<D, F>(
self,
tline: &Arc<Timeline>,
ctx: &RequestContext,
end_key: Key,
) -> anyhow::Result<Vec<ResidentLayer>> {
discard: D,
) -> anyhow::Result<Vec<SplitWriterResult>>
where
D: FnOnce(&PersistentLayerKey) -> F,
F: Future<Output = bool>,
{
let Self {
mut generated_layers,
inner,
..
} = self;
generated_layers.push(inner.finish_with_end_key(tline, end_key, ctx).await?);
if inner.num_keys() == 0 {
return Ok(generated_layers);
}
let layer_key = PersistentLayerKey {
key_range: self.start_key..end_key,
lsn_range: PersistentLayerDesc::image_layer_lsn_range(self.lsn),
is_delta: false,
};
if discard(&layer_key).await {
generated_layers.push(SplitWriterResult::Discarded(layer_key));
} else {
generated_layers.push(SplitWriterResult::Produced(
inner.finish_with_end_key(tline, end_key, ctx).await?,
));
}
Ok(generated_layers)
}
#[cfg(test)]
pub(crate) async fn finish(
self,
tline: &Arc<Timeline>,
ctx: &RequestContext,
end_key: Key,
) -> anyhow::Result<Vec<SplitWriterResult>> {
self.finish_with_discard_fn(tline, ctx, end_key, |_| async { false })
.await
}
/// When split writer fails, the caller should call this function and handle partially generated layers.
#[allow(dead_code)]
pub(crate) async fn take(self) -> anyhow::Result<(Vec<ResidentLayer>, ImageLayerWriter)> {
pub(crate) fn take(self) -> anyhow::Result<(Vec<SplitWriterResult>, ImageLayerWriter)> {
Ok((self.generated_layers, self.inner))
}
}
@@ -110,15 +196,21 @@ impl SplitImageLayerWriter {
/// A delta writer that takes key-lsn-values and produces multiple delta layers. The interface does not
/// guarantee atomicity (i.e., if the delta layer generation fails, there might be leftover files
/// to be cleaned up).
///
/// Note that if updates of a single key exceed the target size limit, all of the updates will be batched
/// into a single file. This behavior might change in the future. For reference, the legacy compaction algorithm
/// will split them into multiple files based on size.
#[must_use]
pub struct SplitDeltaLayerWriter {
inner: DeltaLayerWriter,
target_layer_size: u64,
generated_layers: Vec<ResidentLayer>,
generated_layers: Vec<SplitWriterResult>,
conf: &'static PageServerConf,
timeline_id: TimelineId,
tenant_shard_id: TenantShardId,
lsn_range: Range<Lsn>,
last_key_written: Key,
start_key: Key,
}
impl SplitDeltaLayerWriter {
@@ -147,9 +239,74 @@ impl SplitDeltaLayerWriter {
timeline_id,
tenant_shard_id,
lsn_range,
last_key_written: Key::MIN,
start_key,
})
}
/// Put value into the layer writer. In the case the writer decides to produce a layer, and the discard fn returns true, no layer will be written in the end.
pub async fn put_value_with_discard_fn<D, F>(
&mut self,
key: Key,
lsn: Lsn,
val: Value,
tline: &Arc<Timeline>,
ctx: &RequestContext,
discard: D,
) -> anyhow::Result<()>
where
D: FnOnce(&PersistentLayerKey) -> F,
F: Future<Output = bool>,
{
// The current estimation is key size plus LSN size plus value size estimation. This is not an accurate
// number, and therefore the final layer size could be a little bit larger or smaller than the target.
//
// Also, keep all updates of a single key in a single file. TODO: split them using the legacy compaction
// strategy. https://github.com/neondatabase/neon/issues/8837
let addition_size_estimation = KEY_SIZE as u64 + 8 /* LSN u64 size */ + 80 /* value size estimation */;
if self.inner.num_keys() >= 1
&& self.inner.estimated_size() + addition_size_estimation >= self.target_layer_size
{
if key != self.last_key_written {
let next_delta_writer = DeltaLayerWriter::new(
self.conf,
self.timeline_id,
self.tenant_shard_id,
key,
self.lsn_range.clone(),
ctx,
)
.await?;
let prev_delta_writer = std::mem::replace(&mut self.inner, next_delta_writer);
let layer_key = PersistentLayerKey {
key_range: self.start_key..key,
lsn_range: self.lsn_range.clone(),
is_delta: true,
};
self.start_key = key;
if discard(&layer_key).await {
drop(prev_delta_writer);
self.generated_layers
.push(SplitWriterResult::Discarded(layer_key));
} else {
let (desc, path) = prev_delta_writer.finish(key, ctx).await?;
let delta_layer = Layer::finish_creating(self.conf, tline, desc, &path)?;
self.generated_layers
.push(SplitWriterResult::Produced(delta_layer));
}
} else if self.inner.estimated_size() >= S3_UPLOAD_LIMIT {
// We have to produce a very large file b/c a key is updated too often.
anyhow::bail!(
"a single key is updated too often: key={}, estimated_size={}, and the layer file cannot be produced",
key,
self.inner.estimated_size()
);
}
}
self.last_key_written = key;
self.inner.put_value(key, lsn, val, ctx).await
}
pub async fn put_value(
&mut self,
key: Key,
@@ -158,56 +315,64 @@ impl SplitDeltaLayerWriter {
tline: &Arc<Timeline>,
ctx: &RequestContext,
) -> anyhow::Result<()> {
// The current estimation is key size plus LSN size plus value size estimation. This is not an accurate
// number, and therefore the final layer size could be a little bit larger or smaller than the target.
let addition_size_estimation = KEY_SIZE as u64 + 8 /* LSN u64 size */ + 80 /* value size estimation */;
if self.inner.num_keys() >= 1
&& self.inner.estimated_size() + addition_size_estimation >= self.target_layer_size
{
let next_delta_writer = DeltaLayerWriter::new(
self.conf,
self.timeline_id,
self.tenant_shard_id,
key,
self.lsn_range.clone(),
ctx,
)
.await?;
let prev_delta_writer = std::mem::replace(&mut self.inner, next_delta_writer);
let (desc, path) = prev_delta_writer.finish(key, ctx).await?;
let delta_layer = Layer::finish_creating(self.conf, tline, desc, &path)?;
self.generated_layers.push(delta_layer);
}
self.inner.put_value(key, lsn, val, ctx).await
self.put_value_with_discard_fn(key, lsn, val, tline, ctx, |_| async { false })
.await
}
pub(crate) async fn finish(
pub(crate) async fn finish_with_discard_fn<D, F>(
self,
tline: &Arc<Timeline>,
ctx: &RequestContext,
end_key: Key,
) -> anyhow::Result<Vec<ResidentLayer>> {
discard: D,
) -> anyhow::Result<Vec<SplitWriterResult>>
where
D: FnOnce(&PersistentLayerKey) -> F,
F: Future<Output = bool>,
{
let Self {
mut generated_layers,
inner,
..
} = self;
let (desc, path) = inner.finish(end_key, ctx).await?;
let delta_layer = Layer::finish_creating(self.conf, tline, desc, &path)?;
generated_layers.push(delta_layer);
if inner.num_keys() == 0 {
return Ok(generated_layers);
}
let layer_key = PersistentLayerKey {
key_range: self.start_key..end_key,
lsn_range: self.lsn_range.clone(),
is_delta: true,
};
if discard(&layer_key).await {
generated_layers.push(SplitWriterResult::Discarded(layer_key));
} else {
let (desc, path) = inner.finish(end_key, ctx).await?;
let delta_layer = Layer::finish_creating(self.conf, tline, desc, &path)?;
generated_layers.push(SplitWriterResult::Produced(delta_layer));
}
Ok(generated_layers)
}
#[cfg(test)]
pub(crate) async fn finish(
self,
tline: &Arc<Timeline>,
ctx: &RequestContext,
end_key: Key,
) -> anyhow::Result<Vec<SplitWriterResult>> {
self.finish_with_discard_fn(tline, ctx, end_key, |_| async { false })
.await
}
/// When split writer fails, the caller should call this function and handle partially generated layers.
#[allow(dead_code)]
pub(crate) async fn take(self) -> anyhow::Result<(Vec<ResidentLayer>, DeltaLayerWriter)> {
pub(crate) fn take(self) -> anyhow::Result<(Vec<SplitWriterResult>, DeltaLayerWriter)> {
Ok((self.generated_layers, self.inner))
}
}
#[cfg(test)]
mod tests {
use itertools::Itertools;
use rand::{RngCore, SeedableRng};
use crate::{
@@ -302,9 +467,16 @@ mod tests {
#[tokio::test]
async fn write_split() {
let harness = TenantHarness::create("split_writer_write_split")
.await
.unwrap();
write_split_helper("split_writer_write_split", false).await;
}
#[tokio::test]
async fn write_split_discard() {
write_split_helper("split_writer_write_split_discard", false).await;
}
async fn write_split_helper(harness_name: &'static str, discard: bool) {
let harness = TenantHarness::create(harness_name).await.unwrap();
let (tenant, ctx) = harness.load().await;
let tline = tenant
@@ -338,16 +510,19 @@ mod tests {
for i in 0..N {
let i = i as u32;
image_writer
.put_image(get_key(i), get_large_img(), &tline, &ctx)
.put_image_with_discard_fn(get_key(i), get_large_img(), &tline, &ctx, |_| async {
discard
})
.await
.unwrap();
delta_writer
.put_value(
.put_value_with_discard_fn(
get_key(i),
Lsn(0x20),
Value::Image(get_large_img()),
&tline,
&ctx,
|_| async { discard },
)
.await
.unwrap();
@@ -360,22 +535,39 @@ mod tests {
.finish(&tline, &ctx, get_key(N as u32))
.await
.unwrap();
assert_eq!(image_layers.len(), N / 512 + 1);
assert_eq!(delta_layers.len(), N / 512 + 1);
for idx in 0..image_layers.len() {
assert_ne!(image_layers[idx].layer_desc().key_range.start, Key::MIN);
assert_ne!(image_layers[idx].layer_desc().key_range.end, Key::MAX);
assert_ne!(delta_layers[idx].layer_desc().key_range.start, Key::MIN);
assert_ne!(delta_layers[idx].layer_desc().key_range.end, Key::MAX);
if idx > 0 {
assert_eq!(
image_layers[idx - 1].layer_desc().key_range.end,
image_layers[idx].layer_desc().key_range.start
);
assert_eq!(
delta_layers[idx - 1].layer_desc().key_range.end,
delta_layers[idx].layer_desc().key_range.start
);
if discard {
for layer in image_layers {
layer.into_discarded_layer();
}
for layer in delta_layers {
layer.into_discarded_layer();
}
} else {
let image_layers = image_layers
.into_iter()
.map(|x| x.into_resident_layer())
.collect_vec();
let delta_layers = delta_layers
.into_iter()
.map(|x| x.into_resident_layer())
.collect_vec();
assert_eq!(image_layers.len(), N / 512 + 1);
assert_eq!(delta_layers.len(), N / 512 + 1);
for idx in 0..image_layers.len() {
assert_ne!(image_layers[idx].layer_desc().key_range.start, Key::MIN);
assert_ne!(image_layers[idx].layer_desc().key_range.end, Key::MAX);
assert_ne!(delta_layers[idx].layer_desc().key_range.start, Key::MIN);
assert_ne!(delta_layers[idx].layer_desc().key_range.end, Key::MAX);
if idx > 0 {
assert_eq!(
image_layers[idx - 1].layer_desc().key_range.end,
image_layers[idx].layer_desc().key_range.start
);
assert_eq!(
delta_layers[idx - 1].layer_desc().key_range.end,
delta_layers[idx].layer_desc().key_range.start
);
}
}
}
}
@@ -456,4 +648,49 @@ mod tests {
.unwrap();
assert_eq!(layers.len(), 2);
}
#[tokio::test]
async fn write_split_single_key() {
let harness = TenantHarness::create("split_writer_write_split_single_key")
.await
.unwrap();
let (tenant, ctx) = harness.load().await;
let tline = tenant
.create_test_timeline(TIMELINE_ID, Lsn(0x10), DEFAULT_PG_VERSION, &ctx)
.await
.unwrap();
const N: usize = 2000;
let mut delta_writer = SplitDeltaLayerWriter::new(
tenant.conf,
tline.timeline_id,
tenant.tenant_shard_id,
get_key(0),
Lsn(0x10)..Lsn(N as u64 * 16 + 0x10),
4 * 1024 * 1024,
&ctx,
)
.await
.unwrap();
for i in 0..N {
let i = i as u32;
delta_writer
.put_value(
get_key(0),
Lsn(i as u64 * 16 + 0x10),
Value::Image(get_large_img()),
&tline,
&ctx,
)
.await
.unwrap();
}
let delta_layers = delta_writer
.finish(&tline, &ctx, get_key(N as u32))
.await
.unwrap();
assert_eq!(delta_layers.len(), 1);
}
}

View File

@@ -192,20 +192,28 @@ async fn compaction_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
}
}
let started_at = Instant::now();
let sleep_duration = if period == Duration::ZERO {
let sleep_duration;
if period == Duration::ZERO {
#[cfg(not(feature = "testing"))]
info!("automatic compaction is disabled");
// check again in 10 seconds, in case it's been enabled again.
Duration::from_secs(10)
sleep_duration = Duration::from_secs(10)
} else {
let iteration = Iteration {
started_at: Instant::now(),
period,
kind: BackgroundLoopKind::Compaction,
};
// Run compaction
match tenant.compaction_iteration(&cancel, &ctx).await {
let IterationResult { output, elapsed } = iteration.run(tenant.compaction_iteration(&cancel, &ctx)).await;
match output {
Ok(has_pending_task) => {
error_run_count = 0;
// schedule the next compaction immediately in case there is a pending compaction task
if has_pending_task { Duration::ZERO } else { period }
sleep_duration = if has_pending_task { Duration::ZERO } else { period };
}
Err(e) => {
let wait_duration = backoff::exponential_backoff_duration_seconds(
@@ -221,16 +229,14 @@ async fn compaction_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
&wait_duration,
cancel.is_cancelled(),
);
wait_duration
sleep_duration = wait_duration;
}
}
// the duration is recorded by performance tests by enabling debug in this function
tracing::debug!(elapsed_ms=elapsed.as_millis(), "compaction iteration complete");
};
let elapsed = started_at.elapsed();
warn_when_period_overrun(elapsed, period, BackgroundLoopKind::Compaction);
// the duration is recorded by performance tests by enabling debug in this function
tracing::debug!(elapsed_ms=elapsed.as_millis(), "compaction iteration complete");
// Perhaps we did no work and the walredo process has been idle for some time:
// give it a chance to shut down to avoid leaving walredo process running indefinitely.
@@ -368,23 +374,27 @@ async fn gc_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
}
}
let started_at = Instant::now();
let gc_horizon = tenant.get_gc_horizon();
let sleep_duration = if period == Duration::ZERO || gc_horizon == 0 {
let sleep_duration;
if period == Duration::ZERO || gc_horizon == 0 {
#[cfg(not(feature = "testing"))]
info!("automatic GC is disabled");
// check again in 10 seconds, in case it's been enabled again.
Duration::from_secs(10)
sleep_duration = Duration::from_secs(10);
} else {
let iteration = Iteration {
started_at: Instant::now(),
period,
kind: BackgroundLoopKind::Gc,
};
// Run gc
let res = tenant
.gc_iteration(None, gc_horizon, tenant.get_pitr_interval(), &cancel, &ctx)
let IterationResult { output, elapsed: _ } =
iteration.run(tenant.gc_iteration(None, gc_horizon, tenant.get_pitr_interval(), &cancel, &ctx))
.await;
match res {
match output {
Ok(_) => {
error_run_count = 0;
period
sleep_duration = period;
}
Err(crate::tenant::GcError::TenantCancelled) => {
return;
@@ -408,13 +418,11 @@ async fn gc_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
error!("Gc failed {error_run_count} times, retrying in {wait_duration:?}: {e:?}");
}
wait_duration
sleep_duration = wait_duration;
}
}
};
warn_when_period_overrun(started_at.elapsed(), period, BackgroundLoopKind::Gc);
if tokio::time::timeout(sleep_duration, cancel.cancelled())
.await
.is_ok()
@@ -468,14 +476,12 @@ async fn ingest_housekeeping_loop(tenant: Arc<Tenant>, cancel: CancellationToken
break;
}
let started_at = Instant::now();
tenant.ingest_housekeeping().await;
warn_when_period_overrun(
started_at.elapsed(),
let iteration = Iteration {
started_at: Instant::now(),
period,
BackgroundLoopKind::IngestHouseKeeping,
);
kind: BackgroundLoopKind::IngestHouseKeeping,
};
iteration.run(tenant.ingest_housekeeping()).await;
}
}
.await;
@@ -553,6 +559,54 @@ pub(crate) async fn delay_by_lease_length(
}
}
struct Iteration {
started_at: Instant,
period: Duration,
kind: BackgroundLoopKind,
}
struct IterationResult<O> {
output: O,
elapsed: Duration,
}
impl Iteration {
#[instrument(skip_all)]
pub(crate) async fn run<Fut, O>(self, fut: Fut) -> IterationResult<O>
where
Fut: std::future::Future<Output = O>,
{
let Self {
started_at,
period,
kind,
} = self;
let mut fut = std::pin::pin!(fut);
// Wrap `fut` into a future that logs a message every `period` so that we get a
// very obvious breadcrumb in the logs _while_ a slow iteration is happening.
let liveness_logger = async move {
loop {
match tokio::time::timeout(period, &mut fut).await {
Ok(x) => return x,
Err(_) => {
// info level as per the same rationale why warn_when_period_overrun is info
// => https://github.com/neondatabase/neon/pull/5724
info!("still running");
}
}
}
};
let output = liveness_logger.await;
let elapsed = started_at.elapsed();
warn_when_period_overrun(elapsed, period, kind);
IterationResult { output, elapsed }
}
}
/// Attention: the `task` and `period` beocme labels of a pageserver-wide prometheus metric.
pub(crate) fn warn_when_period_overrun(
elapsed: Duration,

View File

@@ -10,6 +10,7 @@ use std::{
use arc_swap::ArcSwap;
use enumset::EnumSet;
use tracing::{error, warn};
use utils::leaky_bucket::{LeakyBucketConfig, RateLimiter};
use crate::{context::RequestContext, task_mgr::TaskKind};
@@ -33,8 +34,7 @@ pub struct Throttle<M: Metric> {
pub struct Inner {
task_kinds: EnumSet<TaskKind>,
rate_limiter: Arc<leaky_bucket::RateLimiter>,
config: Config,
rate_limiter: Arc<RateLimiter>,
}
pub type Config = pageserver_api::models::ThrottleConfig;
@@ -77,8 +77,7 @@ where
refill_interval,
refill_amount,
max,
fair,
} = &config;
} = config;
let task_kinds: EnumSet<TaskKind> = task_kinds
.iter()
.filter_map(|s| match TaskKind::from_str(s) {
@@ -93,18 +92,21 @@ where
}
})
.collect();
// steady rate, we expect `refill_amount` requests per `refill_interval`.
// dividing gives us the rps.
let rps = f64::from(refill_amount.get()) / refill_interval.as_secs_f64();
let config = LeakyBucketConfig::new(rps, f64::from(max));
// initial tracks how many tokens are available to put in the bucket
// we want how many tokens are currently in the bucket
let initial_tokens = max - initial;
let rate_limiter = RateLimiter::with_initial_tokens(config, f64::from(initial_tokens));
Inner {
task_kinds,
rate_limiter: Arc::new(
leaky_bucket::RateLimiter::builder()
.initial(*initial)
.interval(*refill_interval)
.refill(refill_amount.get())
.max(*max)
.fair(*fair)
.build(),
),
config,
rate_limiter: Arc::new(rate_limiter),
}
}
pub fn reconfigure(&self, config: Config) {
@@ -127,7 +129,7 @@ where
/// See [`Config::steady_rps`].
pub fn steady_rps(&self) -> f64 {
self.inner.load().config.steady_rps()
self.inner.load().rate_limiter.steady_rps()
}
pub async fn throttle(&self, ctx: &RequestContext, key_count: usize) -> Option<Duration> {
@@ -136,18 +138,9 @@ where
return None;
};
let start = std::time::Instant::now();
let mut did_throttle = false;
let acquire = inner.rate_limiter.acquire(key_count);
// turn off runtime-induced preemption (aka coop) so our `did_throttle` is accurate
let acquire = tokio::task::unconstrained(acquire);
let mut acquire = std::pin::pin!(acquire);
std::future::poll_fn(|cx| {
use std::future::Future;
let poll = acquire.as_mut().poll(cx);
did_throttle = did_throttle || poll.is_pending();
poll
})
.await;
let did_throttle = inner.rate_limiter.acquire(key_count).await;
self.count_accounted.fetch_add(1, Ordering::Relaxed);
if did_throttle {
self.count_throttled.fetch_add(1, Ordering::Relaxed);

View File

@@ -22,8 +22,8 @@ use handle::ShardTimelineId;
use once_cell::sync::Lazy;
use pageserver_api::{
key::{
KEY_SIZE, METADATA_KEY_BEGIN_PREFIX, METADATA_KEY_END_PREFIX, NON_INHERITED_RANGE,
NON_INHERITED_SPARSE_RANGE,
CompactKey, KEY_SIZE, METADATA_KEY_BEGIN_PREFIX, METADATA_KEY_END_PREFIX,
NON_INHERITED_RANGE, NON_INHERITED_SPARSE_RANGE,
},
keyspace::{KeySpaceAccum, KeySpaceRandomAccum, SparseKeyPartitioning},
models::{
@@ -44,10 +44,8 @@ use tokio::{
use tokio_util::sync::CancellationToken;
use tracing::*;
use utils::{
bin_ser::BeSer,
fs_ext, pausable_failpoint,
sync::gate::{Gate, GateGuard},
vec_map::VecMap,
};
use std::pin::pin;
@@ -71,7 +69,7 @@ use crate::{
config::defaults::DEFAULT_PITR_INTERVAL,
layer_map::{LayerMap, SearchResult},
metadata::TimelineMetadata,
storage_layer::PersistentLayerDesc,
storage_layer::{inmemory_layer::IndexEntry, PersistentLayerDesc},
},
walredo,
};
@@ -137,7 +135,10 @@ use self::layer_manager::LayerManager;
use self::logical_size::LogicalSize;
use self::walreceiver::{WalReceiver, WalReceiverConf};
use super::{config::TenantConf, storage_layer::LayerVisibilityHint, upload_queue::NotInitialized};
use super::{
config::TenantConf, storage_layer::inmemory_layer, storage_layer::LayerVisibilityHint,
upload_queue::NotInitialized,
};
use super::{debug_assert_current_span_has_tenant_and_timeline_id, AttachedTenantConf};
use super::{remote_timeline_client::index::IndexPart, storage_layer::LayerFringe};
use super::{
@@ -217,7 +218,7 @@ pub(crate) struct RelSizeCache {
}
pub struct Timeline {
conf: &'static PageServerConf,
pub(crate) conf: &'static PageServerConf,
tenant_conf: Arc<ArcSwap<AttachedTenantConf>>,
myself: Weak<Self>,
@@ -866,6 +867,11 @@ impl Timeline {
.map(|ancestor| ancestor.timeline_id)
}
/// Get the ancestor timeline
pub(crate) fn ancestor_timeline(&self) -> Option<&Arc<Timeline>> {
self.ancestor_timeline.as_ref()
}
/// Get the bytes written since the PITR cutoff on this branch, and
/// whether this branch's ancestor_lsn is within its parent's PITR.
pub(crate) fn get_pitr_history_stats(&self) -> (u64, bool) {
@@ -1906,6 +1912,8 @@ impl Timeline {
true
} else if projected_layer_size >= checkpoint_distance {
// NB: this check is relied upon by:
let _ = IndexEntry::validate_checkpoint_distance;
info!(
"Will roll layer at {} with layer size {} due to layer size ({})",
projected_lsn, layer_size, projected_layer_size
@@ -2233,6 +2241,11 @@ impl Timeline {
handles: Default::default(),
};
if aux_file_policy == Some(AuxFilePolicy::V1) {
warn!("this timeline is using deprecated aux file policy V1");
}
result.repartition_threshold =
result.get_checkpoint_distance() / REPARTITION_FREQ_IN_CHECKPOINT_DISTANCE;
@@ -2996,7 +3009,10 @@ impl Timeline {
// - For L1 & image layers, download most recent LSNs first: the older the LSN, the sooner
// the layer is likely to be covered by an image layer during compaction.
layers.sort_by_key(|(desc, _meta, _atime)| {
std::cmp::Reverse((!LayerMap::is_l0(&desc.key_range), desc.lsn_range.end))
std::cmp::Reverse((
!LayerMap::is_l0(&desc.key_range, desc.is_delta),
desc.lsn_range.end,
))
});
let layers = layers
@@ -3589,34 +3605,6 @@ impl Timeline {
return Err(FlushLayerError::Cancelled);
}
// FIXME(auxfilesv2): support multiple metadata key partitions might need initdb support as well?
// This code path will not be hit during regression tests. After #7099 we have a single partition
// with two key ranges. If someone wants to fix initdb optimization in the future, this might need
// to be fixed.
// For metadata, always create delta layers.
let delta_layer = if !metadata_partition.parts.is_empty() {
assert_eq!(
metadata_partition.parts.len(),
1,
"currently sparse keyspace should only contain a single metadata keyspace"
);
let metadata_keyspace = &metadata_partition.parts[0];
self.create_delta_layer(
&frozen_layer,
Some(
metadata_keyspace.0.ranges.first().unwrap().start
..metadata_keyspace.0.ranges.last().unwrap().end,
),
ctx,
)
.await
.map_err(|e| FlushLayerError::from_anyhow(self, e))?
} else {
None
};
// For image layers, we add them immediately into the layer map.
let mut layers_to_upload = Vec::new();
layers_to_upload.extend(
self.create_image_layers(
@@ -3627,13 +3615,27 @@ impl Timeline {
)
.await?,
);
if let Some(delta_layer) = delta_layer {
layers_to_upload.push(delta_layer.clone());
(layers_to_upload, Some(delta_layer))
} else {
(layers_to_upload, None)
if !metadata_partition.parts.is_empty() {
assert_eq!(
metadata_partition.parts.len(),
1,
"currently sparse keyspace should only contain a single metadata keyspace"
);
layers_to_upload.extend(
self.create_image_layers(
// Safety: create_image_layers treat sparse keyspaces differently that it does not scan
// every single key within the keyspace, and therefore, it's safe to force converting it
// into a dense keyspace before calling this function.
&metadata_partition.into_dense(),
self.initdb_lsn,
ImageLayerCreationMode::Initial,
ctx,
)
.await?,
);
}
(layers_to_upload, None)
} else {
// Normal case, write out a L0 delta layer file.
// `create_delta_layer` will not modify the layer map.
@@ -4043,8 +4045,6 @@ impl Timeline {
mode: ImageLayerCreationMode,
start: Key,
) -> Result<ImageLayerCreationOutcome, CreateImageLayersError> {
assert!(!matches!(mode, ImageLayerCreationMode::Initial));
// Metadata keys image layer creation.
let mut reconstruct_state = ValuesReconstructState::default();
let data = self
@@ -4210,15 +4210,13 @@ impl Timeline {
"metadata keys must be partitioned separately"
);
}
if mode == ImageLayerCreationMode::Initial {
return Err(CreateImageLayersError::Other(anyhow::anyhow!("no image layer should be created for metadata keys when flushing frozen layers")));
}
if mode == ImageLayerCreationMode::Try && !check_for_image_layers {
// Skip compaction if there are not enough updates. Metadata compaction will do a scan and
// might mess up with evictions.
start = img_range.end;
continue;
}
// For initial and force modes, we always generate image layers for metadata keys.
} else if let ImageLayerCreationMode::Try = mode {
// check_for_image_layers = false -> skip
// check_for_image_layers = true -> check time_for_new_image_layer -> skip/generate
@@ -4226,7 +4224,8 @@ impl Timeline {
start = img_range.end;
continue;
}
} else if let ImageLayerCreationMode::Force = mode {
}
if let ImageLayerCreationMode::Force = mode {
// When forced to create image layers, we might try and create them where they already
// exist. This mode is only used in tests/debug.
let layers = self.layers.read().await;
@@ -4240,6 +4239,7 @@ impl Timeline {
img_range.start,
img_range.end
);
start = img_range.end;
continue;
}
}
@@ -4537,7 +4537,6 @@ pub struct DeltaLayerTestDesc {
#[cfg(test)]
impl DeltaLayerTestDesc {
#[allow(dead_code)]
pub fn new(lsn_range: Range<Lsn>, key_range: Range<Key>, data: Vec<(Key, Lsn, Value)>) -> Self {
Self {
lsn_range,
@@ -4595,7 +4594,7 @@ impl Timeline {
// for compact_level0_phase1 creating an L0, which does not happen in practice
// because we have not implemented L0 => L0 compaction.
duplicated_layers.insert(l.layer_desc().key());
} else if LayerMap::is_l0(&l.layer_desc().key_range) {
} else if LayerMap::is_l0(&l.layer_desc().key_range, l.layer_desc().is_delta) {
return Err(CompactionError::Other(anyhow::anyhow!("compaction generates a L0 layer file as output, which will cause infinite compaction.")));
} else {
insert_layers.push(l.clone());
@@ -5451,12 +5450,17 @@ impl Timeline {
!(a.end <= b.start || b.end <= a.start)
}
let guard = self.layers.read().await;
for layer in guard.layer_map()?.iter_historic_layers() {
if layer.is_delta()
&& overlaps_with(&layer.lsn_range, &deltas.lsn_range)
&& layer.lsn_range != deltas.lsn_range
{
if deltas.key_range.start.next() != deltas.key_range.end {
let guard = self.layers.read().await;
let mut invalid_layers =
guard.layer_map()?.iter_historic_layers().filter(|layer| {
layer.is_delta()
&& overlaps_with(&layer.lsn_range, &deltas.lsn_range)
&& layer.lsn_range != deltas.lsn_range
// skip single-key layer files
&& layer.key_range.start.next() != layer.key_range.end
});
if let Some(layer) = invalid_layers.next() {
// If a delta layer overlaps with another delta layer AND their LSN range is not the same, panic
panic!(
"inserted layer violates delta layer LSN invariant: current_lsn_range={}..{}, conflict_lsn_range={}..{}",
@@ -5590,44 +5594,6 @@ enum OpenLayerAction {
}
impl<'a> TimelineWriter<'a> {
/// Put a new page version that can be constructed from a WAL record
///
/// This will implicitly extend the relation, if the page is beyond the
/// current end-of-file.
pub(crate) async fn put(
&mut self,
key: Key,
lsn: Lsn,
value: &Value,
ctx: &RequestContext,
) -> anyhow::Result<()> {
// Avoid doing allocations for "small" values.
// In the regression test suite, the limit of 256 avoided allocations in 95% of cases:
// https://github.com/neondatabase/neon/pull/5056#discussion_r1301975061
let mut buf = smallvec::SmallVec::<[u8; 256]>::new();
value.ser_into(&mut buf)?;
let buf_size: u64 = buf.len().try_into().expect("oversized value buf");
let action = self.get_open_layer_action(lsn, buf_size);
let layer = self.handle_open_layer_action(lsn, action, ctx).await?;
let res = layer.put_value(key.to_compact(), lsn, &buf, ctx).await;
if res.is_ok() {
// Update the current size only when the entire write was ok.
// In case of failures, we may have had partial writes which
// render the size tracking out of sync. That's ok because
// the checkpoint distance should be significantly smaller
// than the S3 single shot upload limit of 5GiB.
let state = self.write_guard.as_mut().unwrap();
state.current_size += buf_size;
state.prev_lsn = Some(lsn);
state.max_lsn = std::cmp::max(state.max_lsn, Some(lsn));
}
res
}
async fn handle_open_layer_action(
&mut self,
at: Lsn,
@@ -5733,18 +5699,64 @@ impl<'a> TimelineWriter<'a> {
}
/// Put a batch of keys at the specified Lsns.
///
/// The batch is sorted by Lsn (enforced by usage of [`utils::vec_map::VecMap`].
pub(crate) async fn put_batch(
&mut self,
batch: VecMap<Lsn, (Key, Value)>,
batch: Vec<(CompactKey, Lsn, usize, Value)>,
ctx: &RequestContext,
) -> anyhow::Result<()> {
for (lsn, (key, val)) in batch {
self.put(key, lsn, &val, ctx).await?
if batch.is_empty() {
return Ok(());
}
Ok(())
let serialized_batch = inmemory_layer::SerializedBatch::from_values(batch)?;
let batch_max_lsn = serialized_batch.max_lsn;
let buf_size: u64 = serialized_batch.raw.len() as u64;
let action = self.get_open_layer_action(batch_max_lsn, buf_size);
let layer = self
.handle_open_layer_action(batch_max_lsn, action, ctx)
.await?;
let res = layer.put_batch(serialized_batch, ctx).await;
if res.is_ok() {
// Update the current size only when the entire write was ok.
// In case of failures, we may have had partial writes which
// render the size tracking out of sync. That's ok because
// the checkpoint distance should be significantly smaller
// than the S3 single shot upload limit of 5GiB.
let state = self.write_guard.as_mut().unwrap();
state.current_size += buf_size;
state.prev_lsn = Some(batch_max_lsn);
state.max_lsn = std::cmp::max(state.max_lsn, Some(batch_max_lsn));
}
res
}
#[cfg(test)]
/// Test helper, for tests that would like to poke individual values without composing a batch
pub(crate) async fn put(
&mut self,
key: Key,
lsn: Lsn,
value: &Value,
ctx: &RequestContext,
) -> anyhow::Result<()> {
use utils::bin_ser::BeSer;
if !key.is_valid_key_on_write_path() {
bail!(
"the request contains data not supported by pageserver at TimelineWriter::put: {}",
key
);
}
let val_ser_size = value.serialized_size().unwrap() as usize;
self.put_batch(
vec![(key.to_compact(), lsn, val_ser_size, value.clone())],
ctx,
)
.await
}
pub(crate) async fn delete_batch(
@@ -5885,7 +5897,7 @@ mod tests {
};
// Apart from L0s, newest Layers should come first
if !LayerMap::is_l0(layer.name.key_range()) {
if !LayerMap::is_l0(layer.name.key_range(), layer.name.is_delta()) {
assert!(layer_lsn <= last_lsn);
last_lsn = layer_lsn;
}

View File

@@ -14,7 +14,7 @@ use super::{
RecordedDuration, Timeline,
};
use anyhow::{anyhow, Context};
use anyhow::{anyhow, bail, Context};
use bytes::Bytes;
use enumset::EnumSet;
use fail::fail_point;
@@ -32,6 +32,9 @@ use crate::page_cache;
use crate::tenant::config::defaults::{DEFAULT_CHECKPOINT_DISTANCE, DEFAULT_COMPACTION_THRESHOLD};
use crate::tenant::remote_timeline_client::WaitCompletionError;
use crate::tenant::storage_layer::merge_iterator::MergeIterator;
use crate::tenant::storage_layer::split_writer::{
SplitDeltaLayerWriter, SplitImageLayerWriter, SplitWriterResult,
};
use crate::tenant::storage_layer::{
AsLayerDesc, PersistentLayerDesc, PersistentLayerKey, ValueReconstructState,
};
@@ -71,15 +74,60 @@ pub(crate) struct KeyHistoryRetention {
}
impl KeyHistoryRetention {
/// Hack: skip delta layer if we need to produce a layer of a same key-lsn.
///
/// This can happen if we have removed some deltas in "the middle" of some existing layer's key-lsn-range.
/// For example, consider the case where a single delta with range [0x10,0x50) exists.
/// And we have branches at LSN 0x10, 0x20, 0x30.
/// Then we delete branch @ 0x20.
/// Bottom-most compaction may now delete the delta [0x20,0x30).
/// And that wouldnt' change the shape of the layer.
///
/// Note that bottom-most-gc-compaction never _adds_ new data in that case, only removes.
///
/// `discard_key` will only be called when the writer reaches its target (instead of for every key), so it's fine to grab a lock inside.
async fn discard_key(key: &PersistentLayerKey, tline: &Arc<Timeline>, dry_run: bool) -> bool {
if dry_run {
return true;
}
let guard = tline.layers.read().await;
if !guard.contains_key(key) {
return false;
}
let layer_generation = guard.get_from_key(key).metadata().generation;
drop(guard);
if layer_generation == tline.generation {
info!(
key=%key,
?layer_generation,
"discard layer due to duplicated layer key in the same generation",
);
true
} else {
false
}
}
/// Pipe a history of a single key to the writers.
///
/// If `image_writer` is none, the images will be placed into the delta layers.
/// The delta writer will contain all images and deltas (below and above the horizon) except the bottom-most images.
#[allow(clippy::too_many_arguments)]
async fn pipe_to(
self,
key: Key,
delta_writer: &mut Vec<(Key, Lsn, Value)>,
mut image_writer: Option<&mut ImageLayerWriter>,
tline: &Arc<Timeline>,
delta_writer: &mut SplitDeltaLayerWriter,
mut image_writer: Option<&mut SplitImageLayerWriter>,
stat: &mut CompactionStatistics,
dry_run: bool,
ctx: &RequestContext,
) -> anyhow::Result<()> {
let mut first_batch = true;
let discard = |key: &PersistentLayerKey| {
let key = key.clone();
async move { Self::discard_key(&key, tline, dry_run).await }
};
for (cutoff_lsn, KeyLogAtLsn(logs)) in self.below_horizon {
if first_batch {
if logs.len() == 1 && logs[0].1.is_image() {
@@ -88,28 +136,45 @@ impl KeyHistoryRetention {
};
stat.produce_image_key(img);
if let Some(image_writer) = image_writer.as_mut() {
image_writer.put_image(key, img.clone(), ctx).await?;
image_writer
.put_image_with_discard_fn(key, img.clone(), tline, ctx, discard)
.await?;
} else {
delta_writer.push((key, cutoff_lsn, Value::Image(img.clone())));
delta_writer
.put_value_with_discard_fn(
key,
cutoff_lsn,
Value::Image(img.clone()),
tline,
ctx,
discard,
)
.await?;
}
} else {
for (lsn, val) in logs {
stat.produce_key(&val);
delta_writer.push((key, lsn, val));
delta_writer
.put_value_with_discard_fn(key, lsn, val, tline, ctx, discard)
.await?;
}
}
first_batch = false;
} else {
for (lsn, val) in logs {
stat.produce_key(&val);
delta_writer.push((key, lsn, val));
delta_writer
.put_value_with_discard_fn(key, lsn, val, tline, ctx, discard)
.await?;
}
}
}
let KeyLogAtLsn(above_horizon_logs) = self.above_horizon;
for (lsn, val) in above_horizon_logs {
stat.produce_key(&val);
delta_writer.push((key, lsn, val));
delta_writer
.put_value_with_discard_fn(key, lsn, val, tline, ctx, discard)
.await?;
}
Ok(())
}
@@ -1814,11 +1879,27 @@ impl Timeline {
}
let mut selected_layers = Vec::new();
drop(gc_info);
// Pick all the layers intersect or below the gc_cutoff, get the largest LSN in the selected layers.
let Some(max_layer_lsn) = layers
.iter_historic_layers()
.filter(|desc| desc.get_lsn_range().start <= gc_cutoff)
.map(|desc| desc.get_lsn_range().end)
.max()
else {
info!("no layers to compact with gc");
return Ok(());
};
// Then, pick all the layers that are below the max_layer_lsn. This is to ensure we can pick all single-key
// layers to compact.
for desc in layers.iter_historic_layers() {
if desc.get_lsn_range().start <= gc_cutoff {
if desc.get_lsn_range().end <= max_layer_lsn {
selected_layers.push(guard.get_from_desc(&desc));
}
}
if selected_layers.is_empty() {
info!("no layers to compact with gc");
return Ok(());
}
retain_lsns_below_horizon.sort();
(selected_layers, gc_cutoff, retain_lsns_below_horizon)
};
@@ -1848,27 +1929,53 @@ impl Timeline {
lowest_retain_lsn
);
// Step 1: (In the future) construct a k-merge iterator over all layers. For now, simply collect all keys + LSNs.
// Also, collect the layer information to decide when to split the new delta layers.
let mut downloaded_layers = Vec::new();
let mut delta_split_points = BTreeSet::new();
// Also, verify if the layer map can be split by drawing a horizontal line at every LSN start/end split point.
let mut lsn_split_point = BTreeSet::new(); // TODO: use a better data structure (range tree / range set?)
for layer in &layer_selection {
let resident_layer = layer.download_and_keep_resident().await?;
downloaded_layers.push(resident_layer);
let desc = layer.layer_desc();
if desc.is_delta() {
// TODO: is it correct to only record split points for deltas intersecting with the GC horizon? (exclude those below/above the horizon)
// so that we can avoid having too many small delta layers.
let key_range = desc.get_key_range();
delta_split_points.insert(key_range.start);
delta_split_points.insert(key_range.end);
// ignore single-key layer files
if desc.key_range.start.next() != desc.key_range.end {
let lsn_range = &desc.lsn_range;
lsn_split_point.insert(lsn_range.start);
lsn_split_point.insert(lsn_range.end);
}
stat.visit_delta_layer(desc.file_size());
} else {
stat.visit_image_layer(desc.file_size());
}
}
for layer in &layer_selection {
let desc = layer.layer_desc();
let key_range = &desc.key_range;
if desc.is_delta() && key_range.start.next() != key_range.end {
let lsn_range = desc.lsn_range.clone();
let intersects = lsn_split_point.range(lsn_range).collect_vec();
if intersects.len() > 1 {
bail!(
"cannot run gc-compaction because it violates the layer map LSN split assumption: layer {} intersects with LSN [{}]",
desc.key(),
intersects.into_iter().map(|lsn| lsn.to_string()).join(", ")
);
}
}
}
// The maximum LSN we are processing in this compaction loop
let end_lsn = layer_selection
.iter()
.map(|l| l.layer_desc().lsn_range.end)
.max()
.unwrap();
// We don't want any of the produced layers to cover the full key range (i.e., MIN..MAX) b/c it will then be recognized
// as an L0 layer.
let hack_end_key = Key::NON_L0_MAX;
let mut delta_layers = Vec::new();
let mut image_layers = Vec::new();
let mut downloaded_layers = Vec::new();
for layer in &layer_selection {
let resident_layer = layer.download_and_keep_resident().await?;
downloaded_layers.push(resident_layer);
}
for resident_layer in &downloaded_layers {
if resident_layer.layer_desc().is_delta() {
let layer = resident_layer.get_as_delta(ctx).await?;
@@ -1884,138 +1991,17 @@ impl Timeline {
let mut accumulated_values = Vec::new();
let mut last_key: Option<Key> = None;
enum FlushDeltaResult {
/// Create a new resident layer
CreateResidentLayer(ResidentLayer),
/// Keep an original delta layer
KeepLayer(PersistentLayerKey),
}
#[allow(clippy::too_many_arguments)]
async fn flush_deltas(
deltas: &mut Vec<(Key, Lsn, crate::repository::Value)>,
last_key: Key,
delta_split_points: &[Key],
current_delta_split_point: &mut usize,
tline: &Arc<Timeline>,
lowest_retain_lsn: Lsn,
ctx: &RequestContext,
stats: &mut CompactionStatistics,
dry_run: bool,
last_batch: bool,
) -> anyhow::Result<Option<FlushDeltaResult>> {
// Check if we need to split the delta layer. We split at the original delta layer boundary to avoid
// overlapping layers.
//
// If we have a structure like this:
//
// | Delta 1 | | Delta 4 |
// |---------| Delta 2 |---------|
// | Delta 3 | | Delta 5 |
//
// And we choose to compact delta 2+3+5. We will get an overlapping delta layer with delta 1+4.
// A simple solution here is to split the delta layers using the original boundary, while this
// might produce a lot of small layers. This should be improved and fixed in the future.
let mut need_split = false;
while *current_delta_split_point < delta_split_points.len()
&& last_key >= delta_split_points[*current_delta_split_point]
{
*current_delta_split_point += 1;
need_split = true;
}
if !need_split && !last_batch {
return Ok(None);
}
let deltas: Vec<(Key, Lsn, Value)> = std::mem::take(deltas);
if deltas.is_empty() {
return Ok(None);
}
let end_lsn = deltas.iter().map(|(_, lsn, _)| lsn).max().copied().unwrap() + 1;
let delta_key = PersistentLayerKey {
key_range: {
let key_start = deltas.first().unwrap().0;
let key_end = deltas.last().unwrap().0.next();
key_start..key_end
},
lsn_range: lowest_retain_lsn..end_lsn,
is_delta: true,
};
{
// Hack: skip delta layer if we need to produce a layer of a same key-lsn.
//
// This can happen if we have removed some deltas in "the middle" of some existing layer's key-lsn-range.
// For example, consider the case where a single delta with range [0x10,0x50) exists.
// And we have branches at LSN 0x10, 0x20, 0x30.
// Then we delete branch @ 0x20.
// Bottom-most compaction may now delete the delta [0x20,0x30).
// And that wouldnt' change the shape of the layer.
//
// Note that bottom-most-gc-compaction never _adds_ new data in that case, only removes.
// That's why it's safe to skip.
let guard = tline.layers.read().await;
if guard.contains_key(&delta_key) {
let layer_generation = guard.get_from_key(&delta_key).metadata().generation;
drop(guard);
if layer_generation == tline.generation {
stats.discard_delta_layer();
// TODO: depending on whether we design this compaction process to run along with
// other compactions, there could be layer map modifications after we drop the
// layer guard, and in case it creates duplicated layer key, we will still error
// in the end.
info!(
key=%delta_key,
?layer_generation,
"discard delta layer due to duplicated layer in the same generation"
);
return Ok(Some(FlushDeltaResult::KeepLayer(delta_key)));
}
}
}
let mut delta_layer_writer = DeltaLayerWriter::new(
tline.conf,
tline.timeline_id,
tline.tenant_shard_id,
delta_key.key_range.start,
lowest_retain_lsn..end_lsn,
ctx,
)
.await?;
for (key, lsn, val) in deltas {
delta_layer_writer.put_value(key, lsn, val, ctx).await?;
}
stats.produce_delta_layer(delta_layer_writer.size());
if dry_run {
return Ok(None);
}
let (desc, path) = delta_layer_writer
.finish(delta_key.key_range.end, ctx)
.await?;
let delta_layer = Layer::finish_creating(tline.conf, tline, desc, &path)?;
Ok(Some(FlushDeltaResult::CreateResidentLayer(delta_layer)))
}
// Hack the key range to be min..(max-1). Otherwise, the image layer will be
// interpreted as an L0 delta layer.
let hack_image_layer_range = {
let mut end_key = Key::MAX;
end_key.field6 -= 1;
Key::MIN..end_key
};
// Only create image layers when there is no ancestor branches. TODO: create covering image layer
// when some condition meet.
let mut image_layer_writer = if self.ancestor_timeline.is_none() {
Some(
ImageLayerWriter::new(
SplitImageLayerWriter::new(
self.conf,
self.timeline_id,
self.tenant_shard_id,
&hack_image_layer_range, // covers the full key range
Key::MIN,
lowest_retain_lsn,
self.get_compaction_target_size(),
ctx,
)
.await?,
@@ -2024,6 +2010,17 @@ impl Timeline {
None
};
let mut delta_layer_writer = SplitDeltaLayerWriter::new(
self.conf,
self.timeline_id,
self.tenant_shard_id,
Key::MIN,
lowest_retain_lsn..end_lsn,
self.get_compaction_target_size(),
ctx,
)
.await?;
/// Returns None if there is no ancestor branch. Throw an error when the key is not found.
///
/// Currently, we always get the ancestor image for each key in the child branch no matter whether the image
@@ -2044,47 +2041,11 @@ impl Timeline {
let img = tline.get(key, tline.ancestor_lsn, ctx).await?;
Ok(Some((key, tline.ancestor_lsn, img)))
}
let image_layer_key = PersistentLayerKey {
key_range: hack_image_layer_range,
lsn_range: PersistentLayerDesc::image_layer_lsn_range(lowest_retain_lsn),
is_delta: false,
};
// Like with delta layers, it can happen that we re-produce an already existing image layer.
// This could happen when a user triggers force compaction and image generation. In this case,
// it's always safe to rewrite the layer.
let discard_image_layer = {
let guard = self.layers.read().await;
if guard.contains_key(&image_layer_key) {
let layer_generation = guard.get_from_key(&image_layer_key).metadata().generation;
drop(guard);
if layer_generation == self.generation {
// TODO: depending on whether we design this compaction process to run along with
// other compactions, there could be layer map modifications after we drop the
// layer guard, and in case it creates duplicated layer key, we will still error
// in the end.
info!(
key=%image_layer_key,
?layer_generation,
"discard image layer due to duplicated layer key in the same generation",
);
true
} else {
false
}
} else {
false
}
};
// Actually, we can decide not to write to the image layer at all at this point because
// the key and LSN range are determined. However, to keep things simple here, we still
// create this writer, and discard the writer in the end.
let mut delta_values = Vec::new();
let delta_split_points = delta_split_points.into_iter().collect_vec();
let mut current_delta_split_point = 0;
let mut delta_layers = Vec::new();
while let Some((key, lsn, val)) = merge_iter.next().await? {
if cancel.is_cancelled() {
return Err(anyhow!("cancelled")); // TODO: refactor to CompactionError and pass cancel error
@@ -2115,27 +2076,14 @@ impl Timeline {
retention
.pipe_to(
*last_key,
&mut delta_values,
self,
&mut delta_layer_writer,
image_layer_writer.as_mut(),
&mut stat,
dry_run,
ctx,
)
.await?;
delta_layers.extend(
flush_deltas(
&mut delta_values,
*last_key,
&delta_split_points,
&mut current_delta_split_point,
self,
lowest_retain_lsn,
ctx,
&mut stat,
dry_run,
false,
)
.await?,
);
accumulated_values.clear();
*last_key = key;
accumulated_values.push((key, lsn, val));
@@ -2159,43 +2107,75 @@ impl Timeline {
retention
.pipe_to(
last_key,
&mut delta_values,
self,
&mut delta_layer_writer,
image_layer_writer.as_mut(),
&mut stat,
dry_run,
ctx,
)
.await?;
delta_layers.extend(
flush_deltas(
&mut delta_values,
last_key,
&delta_split_points,
&mut current_delta_split_point,
self,
lowest_retain_lsn,
ctx,
&mut stat,
dry_run,
true,
)
.await?,
);
assert!(delta_values.is_empty(), "unprocessed keys");
let image_layer = if discard_image_layer {
stat.discard_image_layer();
None
} else if let Some(writer) = image_layer_writer {
stat.produce_image_layer(writer.size());
let discard = |key: &PersistentLayerKey| {
let key = key.clone();
async move { KeyHistoryRetention::discard_key(&key, self, dry_run).await }
};
let produced_image_layers = if let Some(writer) = image_layer_writer {
if !dry_run {
Some(writer.finish(self, ctx).await?)
writer
.finish_with_discard_fn(self, ctx, hack_end_key, discard)
.await?
} else {
None
let (layers, _) = writer.take()?;
assert!(layers.is_empty(), "image layers produced in dry run mode?");
Vec::new()
}
} else {
None
Vec::new()
};
let produced_delta_layers = if !dry_run {
delta_layer_writer
.finish_with_discard_fn(self, ctx, hack_end_key, discard)
.await?
} else {
let (layers, _) = delta_layer_writer.take()?;
assert!(layers.is_empty(), "delta layers produced in dry run mode?");
Vec::new()
};
let mut compact_to = Vec::new();
let mut keep_layers = HashSet::new();
let produced_delta_layers_len = produced_delta_layers.len();
let produced_image_layers_len = produced_image_layers.len();
for action in produced_delta_layers {
match action {
SplitWriterResult::Produced(layer) => {
stat.produce_delta_layer(layer.layer_desc().file_size());
compact_to.push(layer);
}
SplitWriterResult::Discarded(l) => {
keep_layers.insert(l);
stat.discard_delta_layer();
}
}
}
for action in produced_image_layers {
match action {
SplitWriterResult::Produced(layer) => {
stat.produce_image_layer(layer.layer_desc().file_size());
compact_to.push(layer);
}
SplitWriterResult::Discarded(l) => {
keep_layers.insert(l);
stat.discard_image_layer();
}
}
}
let mut layer_selection = layer_selection;
layer_selection.retain(|x| !keep_layers.contains(&x.layer_desc().key()));
info!(
"gc-compaction statistics: {}",
serde_json::to_string(&stat)?
@@ -2206,28 +2186,11 @@ impl Timeline {
}
info!(
"produced {} delta layers and {} image layers",
delta_layers.len(),
if image_layer.is_some() { 1 } else { 0 }
"produced {} delta layers and {} image layers, {} layers are kept",
produced_delta_layers_len,
produced_image_layers_len,
layer_selection.len()
);
let mut compact_to = Vec::new();
let mut keep_layers = HashSet::new();
for action in delta_layers {
match action {
FlushDeltaResult::CreateResidentLayer(layer) => {
compact_to.push(layer);
}
FlushDeltaResult::KeepLayer(l) => {
keep_layers.insert(l);
}
}
}
if discard_image_layer {
keep_layers.insert(image_layer_key);
}
let mut layer_selection = layer_selection;
layer_selection.retain(|x| !keep_layers.contains(&x.layer_desc().key()));
compact_to.extend(image_layer);
// Step 3: Place back to the layer map.
{

View File

@@ -27,8 +27,8 @@ use super::TaskStateUpdate;
use crate::{
context::RequestContext,
metrics::{LIVE_CONNECTIONS, WALRECEIVER_STARTED_CONNECTIONS, WAL_INGEST},
task_mgr::TaskKind,
task_mgr::WALRECEIVER_RUNTIME,
pgdatadir_mapping::DatadirModification,
task_mgr::{TaskKind, WALRECEIVER_RUNTIME},
tenant::{debug_assert_current_span_has_tenant_and_timeline_id, Timeline, WalReceiverInfo},
walingest::WalIngest,
walrecord::DecodedWALRecord,
@@ -345,7 +345,10 @@ pub(super) async fn handle_walreceiver_connection(
// Commit every ingest_batch_size records. Even if we filtered out
// all records, we still need to call commit to advance the LSN.
uncommitted_records += 1;
if uncommitted_records >= ingest_batch_size {
if uncommitted_records >= ingest_batch_size
|| modification.approx_pending_bytes()
> DatadirModification::MAX_PENDING_BYTES
{
WAL_INGEST
.records_committed
.inc_by(uncommitted_records - filtered_records);

View File

@@ -27,7 +27,7 @@ use utils::vec_map::VecMap;
use crate::context::RequestContext;
use crate::tenant::blob_io::{BYTE_UNCOMPRESSED, BYTE_ZSTD, LEN_COMPRESSION_BIT_MASK};
use crate::virtual_file::VirtualFile;
use crate::virtual_file::{self, VirtualFile};
#[derive(Copy, Clone, Debug, PartialEq, Eq)]
pub struct MaxVectoredReadBytes(pub NonZeroUsize);
@@ -60,7 +60,7 @@ pub struct VectoredBlobsBuf {
pub struct VectoredRead {
pub start: u64,
pub end: u64,
/// Starting offsets and metadata for each blob in this read
/// Start offset and metadata for each blob in this read
pub blobs_at: VecMap<u64, BlobMeta>,
}
@@ -76,14 +76,109 @@ pub(crate) enum VectoredReadExtended {
No,
}
pub(crate) struct VectoredReadBuilder {
#[derive(Copy, Clone, Debug, PartialEq, Eq)]
pub enum VectoredReadCoalesceMode {
/// Only coalesce exactly adjacent reads.
AdjacentOnly,
/// In addition to adjacent reads, also consider reads whose corresponding
/// `end` and `start` offsets reside at the same chunk.
Chunked(usize),
}
impl VectoredReadCoalesceMode {
/// [`AdjacentVectoredReadBuilder`] is used if alignment requirement is 0,
/// whereas [`ChunkedVectoredReadBuilder`] is used for alignment requirement 1 and higher.
pub(crate) fn get() -> Self {
let align = virtual_file::get_io_buffer_alignment_raw();
if align == 0 {
VectoredReadCoalesceMode::AdjacentOnly
} else {
VectoredReadCoalesceMode::Chunked(align)
}
}
}
pub(crate) enum VectoredReadBuilder {
Adjacent(AdjacentVectoredReadBuilder),
Chunked(ChunkedVectoredReadBuilder),
}
impl VectoredReadBuilder {
fn new_impl(
start_offset: u64,
end_offset: u64,
meta: BlobMeta,
max_read_size: Option<usize>,
mode: VectoredReadCoalesceMode,
) -> Self {
match mode {
VectoredReadCoalesceMode::AdjacentOnly => Self::Adjacent(
AdjacentVectoredReadBuilder::new(start_offset, end_offset, meta, max_read_size),
),
VectoredReadCoalesceMode::Chunked(chunk_size) => {
Self::Chunked(ChunkedVectoredReadBuilder::new(
start_offset,
end_offset,
meta,
max_read_size,
chunk_size,
))
}
}
}
pub(crate) fn new(
start_offset: u64,
end_offset: u64,
meta: BlobMeta,
max_read_size: usize,
mode: VectoredReadCoalesceMode,
) -> Self {
Self::new_impl(start_offset, end_offset, meta, Some(max_read_size), mode)
}
pub(crate) fn new_streaming(
start_offset: u64,
end_offset: u64,
meta: BlobMeta,
mode: VectoredReadCoalesceMode,
) -> Self {
Self::new_impl(start_offset, end_offset, meta, None, mode)
}
pub(crate) fn extend(&mut self, start: u64, end: u64, meta: BlobMeta) -> VectoredReadExtended {
match self {
VectoredReadBuilder::Adjacent(builder) => builder.extend(start, end, meta),
VectoredReadBuilder::Chunked(builder) => builder.extend(start, end, meta),
}
}
pub(crate) fn build(self) -> VectoredRead {
match self {
VectoredReadBuilder::Adjacent(builder) => builder.build(),
VectoredReadBuilder::Chunked(builder) => builder.build(),
}
}
pub(crate) fn size(&self) -> usize {
match self {
VectoredReadBuilder::Adjacent(builder) => builder.size(),
VectoredReadBuilder::Chunked(builder) => builder.size(),
}
}
}
pub(crate) struct AdjacentVectoredReadBuilder {
/// Start offset of the read.
start: u64,
// End offset of the read.
end: u64,
/// Start offset and metadata for each blob in this read
blobs_at: VecMap<u64, BlobMeta>,
max_read_size: Option<usize>,
}
impl VectoredReadBuilder {
impl AdjacentVectoredReadBuilder {
/// Start building a new vectored read.
///
/// Note that by design, this does not check against reading more than `max_read_size` to
@@ -93,7 +188,7 @@ impl VectoredReadBuilder {
start_offset: u64,
end_offset: u64,
meta: BlobMeta,
max_read_size: usize,
max_read_size: Option<usize>,
) -> Self {
let mut blobs_at = VecMap::default();
blobs_at
@@ -104,7 +199,7 @@ impl VectoredReadBuilder {
start: start_offset,
end: end_offset,
blobs_at,
max_read_size: Some(max_read_size),
max_read_size,
}
}
/// Attempt to extend the current read with a new blob if the start
@@ -113,13 +208,15 @@ impl VectoredReadBuilder {
pub(crate) fn extend(&mut self, start: u64, end: u64, meta: BlobMeta) -> VectoredReadExtended {
tracing::trace!(start, end, "trying to extend");
let size = (end - start) as usize;
if self.end == start && {
let not_limited_by_max_read_size = {
if let Some(max_read_size) = self.max_read_size {
self.size() + size <= max_read_size
} else {
true
}
} {
};
if self.end == start && not_limited_by_max_read_size {
self.end = end;
self.blobs_at
.append(start, meta)
@@ -144,6 +241,107 @@ impl VectoredReadBuilder {
}
}
pub(crate) struct ChunkedVectoredReadBuilder {
/// Start block number
start_blk_no: usize,
/// End block number (exclusive).
end_blk_no: usize,
/// Start offset and metadata for each blob in this read
blobs_at: VecMap<u64, BlobMeta>,
max_read_size: Option<usize>,
/// Chunk size reads are coalesced into.
chunk_size: usize,
}
/// Computes x / d rounded up.
fn div_round_up(x: usize, d: usize) -> usize {
(x + (d - 1)) / d
}
impl ChunkedVectoredReadBuilder {
/// Start building a new vectored read.
///
/// Note that by design, this does not check against reading more than `max_read_size` to
/// support reading larger blobs than the configuration value. The builder will be single use
/// however after that.
pub(crate) fn new(
start_offset: u64,
end_offset: u64,
meta: BlobMeta,
max_read_size: Option<usize>,
chunk_size: usize,
) -> Self {
let mut blobs_at = VecMap::default();
blobs_at
.append(start_offset, meta)
.expect("First insertion always succeeds");
let start_blk_no = start_offset as usize / chunk_size;
let end_blk_no = div_round_up(end_offset as usize, chunk_size);
Self {
start_blk_no,
end_blk_no,
blobs_at,
max_read_size,
chunk_size,
}
}
/// Attempts to extend the current read with a new blob if the new blob resides in the same or the immediate next chunk.
///
/// The resulting size also must be below the max read size.
pub(crate) fn extend(&mut self, start: u64, end: u64, meta: BlobMeta) -> VectoredReadExtended {
tracing::trace!(start, end, "trying to extend");
let start_blk_no = start as usize / self.chunk_size;
let end_blk_no = div_round_up(end as usize, self.chunk_size);
let not_limited_by_max_read_size = {
if let Some(max_read_size) = self.max_read_size {
let coalesced_size = (end_blk_no - self.start_blk_no) * self.chunk_size;
coalesced_size <= max_read_size
} else {
true
}
};
// True if the second block starts in the same block or the immediate next block where the first block ended.
//
// Note: This automatically handles the case where two blocks are adjacent to each other,
// whether they starts on chunk size boundary or not.
let is_adjacent_chunk_read = {
// 1. first.end & second.start are in the same block
self.end_blk_no == start_blk_no + 1 ||
// 2. first.end ends one block before second.start
self.end_blk_no == start_blk_no
};
if is_adjacent_chunk_read && not_limited_by_max_read_size {
self.end_blk_no = end_blk_no;
self.blobs_at
.append(start, meta)
.expect("LSNs are ordered within vectored reads");
return VectoredReadExtended::Yes;
}
VectoredReadExtended::No
}
pub(crate) fn size(&self) -> usize {
(self.end_blk_no - self.start_blk_no) * self.chunk_size
}
pub(crate) fn build(self) -> VectoredRead {
let start = (self.start_blk_no * self.chunk_size) as u64;
let end = (self.end_blk_no * self.chunk_size) as u64;
VectoredRead {
start,
end,
blobs_at: self.blobs_at,
}
}
}
#[derive(Copy, Clone, Debug)]
pub enum BlobFlag {
None,
@@ -166,14 +364,18 @@ pub struct VectoredReadPlanner {
prev: Option<(Key, Lsn, u64, BlobFlag)>,
max_read_size: usize,
mode: VectoredReadCoalesceMode,
}
impl VectoredReadPlanner {
pub fn new(max_read_size: usize) -> Self {
let mode = VectoredReadCoalesceMode::get();
Self {
blobs: BTreeMap::new(),
prev: None,
max_read_size,
mode,
}
}
@@ -252,6 +454,7 @@ impl VectoredReadPlanner {
end_offset,
BlobMeta { key, lsn },
self.max_read_size,
self.mode,
);
let prev_read_builder = current_read_builder.replace(next_read_builder);
@@ -303,6 +506,18 @@ impl<'a> VectoredBlobReader<'a> {
read.size(),
buf.capacity()
);
if cfg!(debug_assertions) {
let align = virtual_file::get_io_buffer_alignment() as u64;
debug_assert_eq!(
read.start % align,
0,
"Read start at {} does not satisfy the required io buffer alignment ({} bytes)",
read.start,
align
);
}
let mut buf = self
.file
.read_exact_at(buf.slice(0..read.size()), read.start, ctx)
@@ -310,27 +525,20 @@ impl<'a> VectoredBlobReader<'a> {
.into_inner();
let blobs_at = read.blobs_at.as_slice();
let start_offset = blobs_at.first().expect("VectoredRead is never empty").0;
let start_offset = read.start;
let mut metas = Vec::with_capacity(blobs_at.len());
// Blobs in `read` only provide their starting offset. The end offset
// of a blob is implicit: the start of the next blob if one exists
// or the end of the read.
let pairs = blobs_at.iter().zip(
blobs_at
.iter()
.map(Some)
.skip(1)
.chain(std::iter::once(None)),
);
// Some scratch space, put here for reusing the allocation
let mut decompressed_vec = Vec::new();
for ((offset, meta), next) in pairs {
let offset_in_buf = offset - start_offset;
let first_len_byte = buf[offset_in_buf as usize];
for (blob_start, meta) in blobs_at {
let blob_start_in_buf = blob_start - start_offset;
let first_len_byte = buf[blob_start_in_buf as usize];
// Each blob is prefixed by a header containing its size and compression information.
// Extract the size and skip that header to find the start of the data.
@@ -340,7 +548,7 @@ impl<'a> VectoredBlobReader<'a> {
(1, first_len_byte as u64, BYTE_UNCOMPRESSED)
} else {
let mut blob_size_buf = [0u8; 4];
let offset_in_buf = offset_in_buf as usize;
let offset_in_buf = blob_start_in_buf as usize;
blob_size_buf.copy_from_slice(&buf[offset_in_buf..offset_in_buf + 4]);
blob_size_buf[0] &= !LEN_COMPRESSION_BIT_MASK;
@@ -353,12 +561,8 @@ impl<'a> VectoredBlobReader<'a> {
)
};
let start_raw = offset_in_buf + size_length;
let end_raw = match next {
Some((next_blob_start_offset, _)) => next_blob_start_offset - start_offset,
None => start_raw + blob_size,
};
assert_eq!(end_raw - start_raw, blob_size);
let start_raw = blob_start_in_buf + size_length;
let end_raw = start_raw + blob_size;
let (start, end);
if compression_bits == BYTE_UNCOMPRESSED {
start = start_raw as usize;
@@ -407,18 +611,22 @@ pub struct StreamingVectoredReadPlanner {
max_cnt: usize,
/// Size of the current batch
cnt: usize,
mode: VectoredReadCoalesceMode,
}
impl StreamingVectoredReadPlanner {
pub fn new(max_read_size: u64, max_cnt: usize) -> Self {
assert!(max_cnt > 0);
assert!(max_read_size > 0);
let mode = VectoredReadCoalesceMode::get();
Self {
read_builder: None,
prev: None,
max_cnt,
max_read_size,
cnt: 0,
mode,
}
}
@@ -467,17 +675,12 @@ impl StreamingVectoredReadPlanner {
}
None => {
self.read_builder = {
let mut blobs_at = VecMap::default();
blobs_at
.append(start_offset, BlobMeta { key, lsn })
.expect("First insertion always succeeds");
Some(VectoredReadBuilder {
start: start_offset,
end: end_offset,
blobs_at,
max_read_size: None,
})
Some(VectoredReadBuilder::new_streaming(
start_offset,
end_offset,
BlobMeta { key, lsn },
self.mode,
))
};
}
}
@@ -511,7 +714,9 @@ mod tests {
use super::*;
fn validate_read(read: &VectoredRead, offset_range: &[(Key, Lsn, u64, BlobFlag)]) {
assert_eq!(read.start, offset_range.first().unwrap().2);
let align = virtual_file::get_io_buffer_alignment() as u64;
assert_eq!(read.start % align, 0);
assert_eq!(read.start / align, offset_range.first().unwrap().2 / align);
let expected_offsets_in_read: Vec<_> = offset_range.iter().map(|o| o.2).collect();
@@ -525,6 +730,68 @@ mod tests {
assert_eq!(expected_offsets_in_read, offsets_in_read);
}
#[test]
fn planner_chunked_coalesce_all_test() {
use crate::virtual_file;
let chunk_size = virtual_file::get_io_buffer_alignment() as u64;
// The test explicitly does not check chunk size < 512
if chunk_size < 512 {
return;
}
let max_read_size = chunk_size as usize * 8;
let key = Key::MIN;
let lsn = Lsn(0);
let blob_descriptions = [
(key, lsn, chunk_size / 8, BlobFlag::None), // Read 1 BEGIN
(key, lsn, chunk_size / 4, BlobFlag::Ignore), // Gap
(key, lsn, chunk_size / 2, BlobFlag::None),
(key, lsn, chunk_size - 2, BlobFlag::Ignore), // Gap
(key, lsn, chunk_size, BlobFlag::None),
(key, lsn, chunk_size * 2 - 1, BlobFlag::None),
(key, lsn, chunk_size * 2 + 1, BlobFlag::Ignore), // Gap
(key, lsn, chunk_size * 3 + 1, BlobFlag::None),
(key, lsn, chunk_size * 5 + 1, BlobFlag::None),
(key, lsn, chunk_size * 6 + 1, BlobFlag::Ignore), // skipped chunk size, but not a chunk: should coalesce.
(key, lsn, chunk_size * 7 + 1, BlobFlag::None),
(key, lsn, chunk_size * 8, BlobFlag::None), // Read 2 BEGIN (b/c max_read_size)
(key, lsn, chunk_size * 9, BlobFlag::Ignore), // ==== skipped a chunk
(key, lsn, chunk_size * 10, BlobFlag::None), // Read 3 BEGIN (cannot coalesce)
];
let ranges = [
&[
blob_descriptions[0],
blob_descriptions[2],
blob_descriptions[4],
blob_descriptions[5],
blob_descriptions[7],
blob_descriptions[8],
blob_descriptions[10],
],
&blob_descriptions[11..12],
&blob_descriptions[13..],
];
let mut planner = VectoredReadPlanner::new(max_read_size);
for (key, lsn, offset, flag) in blob_descriptions {
planner.handle(key, lsn, offset, flag);
}
planner.handle_range_end(652 * 1024);
let reads = planner.finish();
assert_eq!(reads.len(), ranges.len());
for (idx, read) in reads.iter().enumerate() {
validate_read(read, ranges[idx]);
}
}
#[test]
fn planner_max_read_size_test() {
let max_read_size = 128 * 1024;
@@ -571,18 +838,19 @@ mod tests {
#[test]
fn planner_replacement_test() {
let max_read_size = 128 * 1024;
let chunk_size = virtual_file::get_io_buffer_alignment() as u64;
let max_read_size = 128 * chunk_size as usize;
let first_key = Key::MIN;
let second_key = first_key.next();
let lsn = Lsn(0);
let blob_descriptions = vec![
(first_key, lsn, 0, BlobFlag::None), // First in read 1
(first_key, lsn, 1024, BlobFlag::None), // Last in read 1
(second_key, lsn, 2 * 1024, BlobFlag::ReplaceAll),
(second_key, lsn, 3 * 1024, BlobFlag::None),
(second_key, lsn, 4 * 1024, BlobFlag::ReplaceAll), // First in read 2
(second_key, lsn, 5 * 1024, BlobFlag::None), // Last in read 2
(first_key, lsn, 0, BlobFlag::None), // First in read 1
(first_key, lsn, chunk_size, BlobFlag::None), // Last in read 1
(second_key, lsn, 2 * chunk_size, BlobFlag::ReplaceAll),
(second_key, lsn, 3 * chunk_size, BlobFlag::None),
(second_key, lsn, 4 * chunk_size, BlobFlag::ReplaceAll), // First in read 2
(second_key, lsn, 5 * chunk_size, BlobFlag::None), // Last in read 2
];
let ranges = [&blob_descriptions[0..2], &blob_descriptions[4..]];
@@ -592,7 +860,7 @@ mod tests {
planner.handle(key, lsn, offset, flag);
}
planner.handle_range_end(6 * 1024);
planner.handle_range_end(6 * chunk_size);
let reads = planner.finish();
assert_eq!(reads.len(), 2);
@@ -737,6 +1005,7 @@ mod tests {
let reserved_bytes = blobs.iter().map(|bl| bl.len()).max().unwrap() * 2 + 16;
let mut buf = BytesMut::with_capacity(reserved_bytes);
let mode = VectoredReadCoalesceMode::get();
let vectored_blob_reader = VectoredBlobReader::new(&file);
let meta = BlobMeta {
key: Key::MIN,
@@ -748,7 +1017,7 @@ mod tests {
if idx + 1 == offsets.len() {
continue;
}
let read_builder = VectoredReadBuilder::new(*offset, *end, meta, 16 * 4096);
let read_builder = VectoredReadBuilder::new(*offset, *end, meta, 16 * 4096, mode);
let read = read_builder.build();
let result = vectored_blob_reader.read_blobs(&read, buf, &ctx).await?;
assert_eq!(result.blobs.len(), 1);
@@ -784,4 +1053,12 @@ mod tests {
round_trip_test_compressed(&blobs, true).await?;
Ok(())
}
#[test]
fn test_div_round_up() {
const CHUNK_SIZE: usize = 512;
assert_eq!(1, div_round_up(200, CHUNK_SIZE));
assert_eq!(1, div_round_up(CHUNK_SIZE, CHUNK_SIZE));
assert_eq!(2, div_round_up(CHUNK_SIZE + 1, CHUNK_SIZE));
}
}

View File

@@ -9,7 +9,7 @@ use utils::serde_percent::Percent;
use pageserver_api::models::PageserverUtilization;
use crate::{config::PageServerConf, tenant::mgr::TenantManager};
use crate::{config::PageServerConf, metrics::NODE_UTILIZATION_SCORE, tenant::mgr::TenantManager};
pub(crate) fn regenerate(
conf: &PageServerConf,
@@ -58,13 +58,13 @@ pub(crate) fn regenerate(
disk_usable_pct,
shard_count,
max_shard_count: MAX_SHARDS,
utilization_score: 0,
utilization_score: None,
captured_at: utils::serde_system_time::SystemTime(captured_at),
};
doc.refresh_score();
// TODO: make utilization_score into a metric
// Initialize `PageserverUtilization::utilization_score`
let score = doc.cached_score();
NODE_UTILIZATION_SCORE.set(score);
Ok(doc)
}

View File

@@ -10,6 +10,7 @@
//! This is similar to PostgreSQL's virtual file descriptor facility in
//! src/backend/storage/file/fd.c
//!
use crate::config::defaults::DEFAULT_IO_BUFFER_ALIGNMENT;
use crate::context::RequestContext;
use crate::metrics::{StorageIoOperation, STORAGE_IO_SIZE, STORAGE_IO_TIME_METRIC};
@@ -1140,10 +1141,13 @@ impl OpenFiles {
/// server startup.
///
#[cfg(not(test))]
pub fn init(num_slots: usize, engine: IoEngineKind) {
pub fn init(num_slots: usize, engine: IoEngineKind, io_buffer_alignment: usize) {
if OPEN_FILES.set(OpenFiles::new(num_slots)).is_err() {
panic!("virtual_file::init called twice");
}
if set_io_buffer_alignment(io_buffer_alignment).is_err() {
panic!("IO buffer alignment ({io_buffer_alignment}) is not a power of two");
}
io_engine::init(engine);
crate::metrics::virtual_file_descriptor_cache::SIZE_MAX.set(num_slots as u64);
}
@@ -1167,6 +1171,53 @@ fn get_open_files() -> &'static OpenFiles {
}
}
static IO_BUFFER_ALIGNMENT: AtomicUsize = AtomicUsize::new(DEFAULT_IO_BUFFER_ALIGNMENT);
/// Returns true if `x` is zero or a power of two.
fn is_zero_or_power_of_two(x: usize) -> bool {
(x == 0) || ((x & (x - 1)) == 0)
}
#[allow(unused)]
pub(crate) fn set_io_buffer_alignment(align: usize) -> Result<(), usize> {
if is_zero_or_power_of_two(align) {
IO_BUFFER_ALIGNMENT.store(align, std::sync::atomic::Ordering::Relaxed);
Ok(())
} else {
Err(align)
}
}
/// Gets the io buffer alignment requirement. Returns 0 if there is no requirement specified.
///
/// This function should be used to check the raw config value.
pub(crate) fn get_io_buffer_alignment_raw() -> usize {
let align = IO_BUFFER_ALIGNMENT.load(std::sync::atomic::Ordering::Relaxed);
if cfg!(test) {
let env_var_name = "NEON_PAGESERVER_UNIT_TEST_IO_BUFFER_ALIGNMENT";
if let Some(test_align) = utils::env::var(env_var_name) {
if is_zero_or_power_of_two(test_align) {
test_align
} else {
panic!("IO buffer alignment ({test_align}) is not a power of two");
}
} else {
align
}
} else {
align
}
}
/// Gets the io buffer alignment requirement. Returns 1 if the alignment config is set to zero.
///
/// This function should be used for getting the actual alignment value to use.
pub(crate) fn get_io_buffer_alignment() -> usize {
let align = get_io_buffer_alignment_raw();
align.max(1)
}
#[cfg(test)]
mod tests {
use crate::context::DownloadBehavior;

View File

@@ -78,6 +78,7 @@ where
.expect("must not use after we returned an error")
}
/// Guarantees that if Ok() is returned, all bytes in `chunk` have been accepted.
#[cfg_attr(target_os = "macos", allow(dead_code))]
pub async fn write_buffered<S: IoBuf + Send>(
&mut self,

View File

@@ -21,19 +21,25 @@
//! redo Postgres process, but some records it can handle directly with
//! bespoken Rust code.
use std::time::Duration;
use std::time::SystemTime;
use pageserver_api::shard::ShardIdentity;
use postgres_ffi::v14::nonrelfile_utils::clogpage_precedes;
use postgres_ffi::v14::nonrelfile_utils::slru_may_delete_clogsegment;
use postgres_ffi::TimestampTz;
use postgres_ffi::{fsm_logical_to_physical, page_is_new, page_set_lsn};
use anyhow::{bail, Context, Result};
use bytes::{Buf, Bytes, BytesMut};
use tracing::*;
use utils::failpoint_support;
use utils::rate_limit::RateLimit;
use crate::context::RequestContext;
use crate::metrics::WAL_INGEST;
use crate::pgdatadir_mapping::{DatadirModification, Version};
use crate::span::debug_assert_current_span_has_tenant_and_timeline_id;
use crate::tenant::PageReconstructError;
use crate::tenant::Timeline;
use crate::walrecord::*;
@@ -53,6 +59,13 @@ pub struct WalIngest {
shard: ShardIdentity,
checkpoint: CheckPoint,
checkpoint_modified: bool,
warn_ingest_lag: WarnIngestLag,
}
struct WarnIngestLag {
lag_msg_ratelimit: RateLimit,
future_lsn_msg_ratelimit: RateLimit,
timestamp_invalid_msg_ratelimit: RateLimit,
}
impl WalIngest {
@@ -71,6 +84,11 @@ impl WalIngest {
shard: *timeline.get_shard_identity(),
checkpoint,
checkpoint_modified: false,
warn_ingest_lag: WarnIngestLag {
lag_msg_ratelimit: RateLimit::new(std::time::Duration::from_secs(10)),
future_lsn_msg_ratelimit: RateLimit::new(std::time::Duration::from_secs(10)),
timestamp_invalid_msg_ratelimit: RateLimit::new(std::time::Duration::from_secs(10)),
},
})
}
@@ -1212,6 +1230,48 @@ impl WalIngest {
Ok(())
}
fn warn_on_ingest_lag(
&mut self,
conf: &crate::config::PageServerConf,
wal_timestmap: TimestampTz,
) {
debug_assert_current_span_has_tenant_and_timeline_id();
let now = SystemTime::now();
let rate_limits = &mut self.warn_ingest_lag;
match try_from_pg_timestamp(wal_timestmap) {
Ok(ts) => {
match now.duration_since(ts) {
Ok(lag) => {
if lag > conf.wait_lsn_timeout {
rate_limits.lag_msg_ratelimit.call2(|rate_limit_stats| {
let lag = humantime::format_duration(lag);
warn!(%rate_limit_stats, %lag, "ingesting record with timestamp lagging more than wait_lsn_timeout");
})
}
},
Err(e) => {
let delta_t = e.duration();
// determined by prod victoriametrics query: 1000 * (timestamp(node_time_seconds{neon_service="pageserver"}) - node_time_seconds)
// => https://www.robustperception.io/time-metric-from-the-node-exporter/
const IGNORED_DRIFT: Duration = Duration::from_millis(100);
if delta_t > IGNORED_DRIFT {
let delta_t = humantime::format_duration(delta_t);
rate_limits.future_lsn_msg_ratelimit.call2(|rate_limit_stats| {
warn!(%rate_limit_stats, %delta_t, "ingesting record with timestamp from future");
})
}
}
};
}
Err(error) => {
rate_limits.timestamp_invalid_msg_ratelimit.call2(|rate_limit_stats| {
warn!(%rate_limit_stats, %error, "ingesting record with invalid timestamp, cannot calculate lag and will fail find-lsn-for-timestamp type queries");
})
}
}
}
/// Subroutine of ingest_record(), to handle an XLOG_XACT_* records.
///
async fn ingest_xact_record(
@@ -1228,6 +1288,8 @@ impl WalIngest {
let mut rpageno = pageno % pg_constants::SLRU_PAGES_PER_SEGMENT;
let mut page_xids: Vec<TransactionId> = vec![parsed.xid];
self.warn_on_ingest_lag(modification.tline.conf, parsed.xact_time);
for subxact in &parsed.subxacts {
let subxact_pageno = subxact / pg_constants::CLOG_XACTS_PER_PAGE;
if subxact_pageno != pageno {
@@ -2303,6 +2365,9 @@ mod tests {
let _endpoint = Lsn::from_hex("1FFFF98").unwrap();
let harness = TenantHarness::create("test_ingest_real_wal").await.unwrap();
let span = harness
.span()
.in_scope(|| info_span!("timeline_span", timeline_id=%TIMELINE_ID));
let (tenant, ctx) = harness.load().await;
let remote_initdb_path =
@@ -2354,6 +2419,7 @@ mod tests {
while let Some((lsn, recdata)) = decoder.poll_decode().unwrap() {
walingest
.ingest_record(recdata, lsn, &mut modification, &mut decoded, &ctx)
.instrument(span.clone())
.await
.unwrap();
}

View File

@@ -1,13 +1,7 @@
commit f7925d4d1406c0f0229e3c691c94b69e381899b1 (HEAD -> master)
Author: Alexey Masterov <alexeymasterov@neon.tech>
Date: Thu Jun 6 08:02:42 2024 +0000
Patch expected files to consider Neon's log messages
diff --git a/ext-src/pg_hint_plan-src/expected/ut-A.out b/ext-src/pg_hint_plan-src/expected/ut-A.out
index da723b8..f8d0102 100644
--- a/ext-src/pg_hint_plan-src/expected/ut-A.out
+++ b/ext-src/pg_hint_plan-src/expected/ut-A.out
diff --git a/expected/ut-A.out b/expected/ut-A.out
index da723b8..5328114 100644
--- a/expected/ut-A.out
+++ b/expected/ut-A.out
@@ -9,13 +9,16 @@ SET search_path TO public;
----
-- No.A-1-1-3
@@ -25,10 +19,18 @@ index da723b8..f8d0102 100644
DROP SCHEMA other_schema;
----
---- No. A-5-1 comment pattern
diff --git a/ext-src/pg_hint_plan-src/expected/ut-fdw.out b/ext-src/pg_hint_plan-src/expected/ut-fdw.out
@@ -3175,6 +3178,7 @@ SELECT s.query, s.calls
FROM public.pg_stat_statements s
JOIN pg_catalog.pg_database d
ON (s.dbid = d.oid)
+ WHERE s.query LIKE 'SELECT * FROM s1.t1%' OR s.query LIKE '%pg_stat_statements_reset%'
ORDER BY 1;
query | calls
--------------------------------------+-------
diff --git a/expected/ut-fdw.out b/expected/ut-fdw.out
index d372459..6282afe 100644
--- a/ext-src/pg_hint_plan-src/expected/ut-fdw.out
+++ b/ext-src/pg_hint_plan-src/expected/ut-fdw.out
--- a/expected/ut-fdw.out
+++ b/expected/ut-fdw.out
@@ -7,6 +7,7 @@ SET pg_hint_plan.debug_print TO on;
SET client_min_messages TO LOG;
SET pg_hint_plan.enable_hint TO on;
@@ -37,3 +39,15 @@ index d372459..6282afe 100644
CREATE SERVER file_server FOREIGN DATA WRAPPER file_fdw;
CREATE USER MAPPING FOR PUBLIC SERVER file_server;
CREATE FOREIGN TABLE ft1 (id int, val int) SERVER file_server OPTIONS (format 'csv', filename :'filename');
diff --git a/sql/ut-A.sql b/sql/ut-A.sql
index 7c7d58a..4fd1a07 100644
--- a/sql/ut-A.sql
+++ b/sql/ut-A.sql
@@ -963,6 +963,7 @@ SELECT s.query, s.calls
FROM public.pg_stat_statements s
JOIN pg_catalog.pg_database d
ON (s.dbid = d.oid)
+ WHERE s.query LIKE 'SELECT * FROM s1.t1%' OR s.query LIKE '%pg_stat_statements_reset%'
ORDER BY 1;
----

View File

@@ -550,9 +550,6 @@ pageserver_connect(shardno_t shard_no, int elevel)
case 2:
pagestream_query = psprintf("pagestream_v2 %s %s", neon_tenant, neon_timeline);
break;
case 1:
pagestream_query = psprintf("pagestream %s %s", neon_tenant, neon_timeline);
break;
default:
elog(ERROR, "unexpected neon_protocol_version %d", neon_protocol_version);
}
@@ -1063,7 +1060,7 @@ pg_init_libpagestore(void)
NULL,
&neon_protocol_version,
2, /* use protocol version 2 */
1, /* min */
2, /* min */
2, /* max */
PGC_SU_BACKEND,
0, /* no flags required */

View File

@@ -87,9 +87,8 @@ typedef enum {
* can skip traversing through recent layers which we know to not contain any
* versions for the requested page.
*
* These structs describe the V2 of these requests. The old V1 protocol contained
* just one LSN and a boolean 'latest' flag. If the neon_protocol_version GUC is
* set to 1, we will convert these to the V1 requests before sending.
* These structs describe the V2 of these requests. (The old now-defunct V1
* protocol contained just one LSN and a boolean 'latest' flag.)
*/
typedef struct
{

Some files were not shown because too many files have changed in this diff Show More