Commit Graph

1585 Commits

Author SHA1 Message Date
Arpad Müller
f842b22b90 Add endpoint for querying time info for lsn (#5497)
## Problem

See #5468.

## Summary of changes

Add a new `get_timestamp_of_lsn` endpoint, returning the timestamp
associated with the given lsn.

Fixes #5468.

---------

Co-authored-by: Shany Pozin <shany@neon.tech>
2023-10-19 04:50:49 +02:00
Konstantin Knizhnik
5c88213eaf Logical replication (#5271)
## Problem

See https://github.com/neondatabase/company_projects/issues/111

## Summary of changes

Save logical replication files in WAL at compute and include them in
basebackup at pate server.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2023-10-18 16:42:22 +03:00
John Spray
607d19f0e0 pageserver: clean up page service Result handling for shutdown/disconnect (#5504)
## Problem

- QueryError always logged at error severity, even though disconnections
are not true errors.
- QueryError type is not expressive enough to distinguish actual errors
from shutdowns.
- In some functions we're returning Ok(()) on shutdown, in others we're
returning an error

## Summary of changes

- Add QueryError::Shutdown and use it in places we check for
cancellation
- Adopt consistent Result behavior: disconnects and shutdowns are always
QueryError, not ok
- Transform shutdown+disconnect errors to Ok(()) at the very top of the
task that runs query handler
- Use the postgres protocol error code for "admin shutdown" in responses
to clients when we are shutting down.

Closes: #5517
2023-10-18 13:28:38 +01:00
Christian Schwarz
9da67c4f19 walredo: make request_redo() an async fn (#5559)
Stacked atop https://github.com/neondatabase/neon/pull/5557
Prep work for https://github.com/neondatabase/neon/pull/5560

These changes have a 2% impact on `bench_walredo`.
That's likely because of the `block_on() in the innermost piece of
benchmark-only code.
So, it doesn't affect production code.
The use of closures in the benchmarking code prevents a straightforward
conversion of the whole benchmarking code to async.

before:

```
    $ cargo bench --features testing --bench bench_walredo
       Compiling pageserver v0.1.0 (/home/cs/src/neon/pageserver)
        Finished bench [optimized + debuginfo] target(s) in 2m 11s
         Running benches/bench_walredo.rs (target/release/deps/bench_walredo-d99a324337dead70)
    Gnuplot not found, using plotters backend
    short/short/1           time:   [26.363 µs 27.451 µs 28.573 µs]
    Found 1 outliers among 100 measurements (1.00%)
      1 (1.00%) high mild
    short/short/2           time:   [64.340 µs 64.927 µs 65.485 µs]
    Found 2 outliers among 100 measurements (2.00%)
      2 (2.00%) low mild
    short/short/4           time:   [101.98 µs 104.06 µs 106.13 µs]
    short/short/8           time:   [151.42 µs 152.74 µs 154.03 µs]
    short/short/16          time:   [296.30 µs 297.53 µs 298.88 µs]
    Found 14 outliers among 100 measurements (14.00%)
      10 (10.00%) high mild
      4 (4.00%) high severe

    medium/medium/1         time:   [225.12 µs 225.90 µs 226.66 µs]
    Found 1 outliers among 100 measurements (1.00%)
      1 (1.00%) low mild
    medium/medium/2         time:   [490.80 µs 491.64 µs 492.49 µs]
    Found 1 outliers among 100 measurements (1.00%)
      1 (1.00%) low mild
    medium/medium/4         time:   [934.47 µs 936.49 µs 938.52 µs]
    Found 5 outliers among 100 measurements (5.00%)
      3 (3.00%) low mild
      1 (1.00%) high mild
      1 (1.00%) high severe
    medium/medium/8         time:   [1.8364 ms 1.8412 ms 1.8463 ms]
    Found 4 outliers among 100 measurements (4.00%)
      4 (4.00%) high mild
    medium/medium/16        time:   [3.6694 ms 3.6896 ms 3.7104 ms]

```

after:

```

    $ cargo bench --features testing --bench bench_walredo
       Compiling pageserver v0.1.0 (/home/cs/src/neon/pageserver)
        Finished bench [optimized + debuginfo] target(s) in 2m 11s
         Running benches/bench_walredo.rs (target/release/deps/bench_walredo-d99a324337dead70)
    Gnuplot not found, using plotters backend
    short/short/1           time:   [28.345 µs 28.529 µs 28.699 µs]
                            change: [-0.2201% +3.9276% +8.2451%] (p = 0.07 > 0.05)
                            No change in performance detected.
    Found 17 outliers among 100 measurements (17.00%)
      4 (4.00%) low severe
      5 (5.00%) high mild
      8 (8.00%) high severe
    short/short/2           time:   [66.145 µs 66.719 µs 67.274 µs]
                            change: [+1.5467% +2.7605% +3.9927%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 5 outliers among 100 measurements (5.00%)
      5 (5.00%) low mild
    short/short/4           time:   [105.51 µs 107.52 µs 109.49 µs]
                            change: [+0.5023% +3.3196% +6.1986%] (p = 0.02 < 0.05)
                            Change within noise threshold.
    short/short/8           time:   [151.90 µs 153.16 µs 154.41 µs]
                            change: [-1.0001% +0.2779% +1.4221%] (p = 0.65 > 0.05)
                            No change in performance detected.
    short/short/16          time:   [297.38 µs 298.26 µs 299.20 µs]
                            change: [-0.2953% +0.2462% +0.7763%] (p = 0.37 > 0.05)
                            No change in performance detected.
    Found 2 outliers among 100 measurements (2.00%)
      2 (2.00%) high mild

    medium/medium/1         time:   [229.76 µs 230.72 µs 231.69 µs]
                            change: [+1.5804% +2.1354% +2.6635%] (p = 0.00 < 0.05)
                            Performance has regressed.
    medium/medium/2         time:   [501.14 µs 502.31 µs 503.64 µs]
                            change: [+1.8730% +2.1709% +2.5199%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 7 outliers among 100 measurements (7.00%)
      1 (1.00%) low mild
      1 (1.00%) high mild
      5 (5.00%) high severe
    medium/medium/4         time:   [954.15 µs 956.74 µs 959.33 µs]
                            change: [+1.7962% +2.1627% +2.4905%] (p = 0.00 < 0.05)
                            Performance has regressed.
    medium/medium/8         time:   [1.8726 ms 1.8785 ms 1.8848 ms]
                            change: [+1.5858% +2.0240% +2.4626%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 6 outliers among 100 measurements (6.00%)
      1 (1.00%) low mild
      3 (3.00%) high mild
      2 (2.00%) high severe
    medium/medium/16        time:   [3.7565 ms 3.7746 ms 3.7934 ms]
                            change: [+1.5503% +2.3044% +3.0818%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 3 outliers among 100 measurements (3.00%)
      3 (3.00%) high mild
```
2023-10-18 11:23:06 +01:00
Christian Schwarz
3a8959a4c4 page_cache: remove dead code (#5493) 2023-10-17 15:56:16 +01:00
Arpad Müller
093f8c5f45 Update rust to 1.73.0 (#5574)
[Release notes](https://blog.rust-lang.org/2023/10/05/Rust-1.73.0.html)
2023-10-17 13:13:12 +01:00
Christian Schwarz
9256788273 limit imitate accesses concurrency, using same semaphore as compactions (#5578)
Before this PR, when we restarted pageserver, we'd see a rush of
`$number_of_tenants` concurrent eviction tasks starting to do imitate
accesses building up in the period of `[init_order allows activations,
$random_access_delay + EvictionPolicyLayerAccessThreshold::period]`.

We simply cannot handle that degree of concurrent IO.

We already solved the problem for compactions by adding a semaphore.
So, this PR shares that semaphore for use by evictions.

Part of https://github.com/neondatabase/neon/issues/5479

Which is again part of https://github.com/neondatabase/neon/issues/4743

Risks / Changes In System Behavior
==================================

* we don't do evictions as timely as we currently do
* we log a bunch of warnings about eviction taking too long
* imitate accesses and compactions compete for the same concurrency
limit, so, they'll slow each other down through this shares semaphore


Changes
=======

- Move the `CONCURRENT_COMPACTIONS` semaphore into `tasks.rs`
- Rename it to `CONCURRENT_BACKGROUND_TASKS`
- Use it also for the eviction imitate accesses:
    - Imitate acceses are both per-TIMELINE and per-TENANT
    - The per-TENANT is done through coalescing all the per-TIMELINE
      tasks via a tokio mutex `eviction_task_tenant_state`.
    - We acquire the CONCURRENT_BACKGROUND_TASKS permit early, at the
      beginning of the eviction iteration, much before the imitate
      acesses start (and they may not even start at all in the given
      iteration, as they happen only every $threshold).
    - Acquiring early is **sub-optimal** because when the per-timline
      tasks coalesce on the `eviction_task_tenant_state` mutex,
      they are already holding a CONCURRENT_BACKGROUND_TASKS permit.
    - It's also unfair because tenants with many timelines win
      the CONCURRENT_BACKGROUND_TASKS more often.
    - I don't think there's another way though, without refactoring
      more of the imitate accesses logic, e.g, making it all per-tenant.
- Add metrics for queue depth behind the semaphore.
I found these very useful to understand what work is queued in the
system.

    - The metrics are tagged by the new `BackgroundLoopKind`.
    - On a green slate, I would have used `TaskKind`, but we already had
      pre-existing labels whose names didn't map exactly to task kind.
      Also the task kind is kind of a lower-level detail, so, I think
it's fine to have a separate enum to identify background work kinds.


Future Work
===========

I guess I could move the eviction tasks from a ticker to "sleep for
$period".
The benefit would be that the semaphore automatically "smears" the
eviction task scheduling over time, so, we only have the rush on restart
but a smeared-out rush afterward.

The downside is that this perverts the meaning of "$period", as we'd
actually not run the eviction at a fixed period. It also means the the
"took to long" warning & metric becomes meaningless.

Then again, that is already the case for the compaction and gc tasks,
which do sleep for `$period` instead of using a ticker.
2023-10-17 11:29:48 +02:00
Joonas Koivunen
9e1449353d crash-consistent layer map through index_part.json (#5198)
Fixes #5172 as it:
- removes recoinciliation with remote index_part.json and accepts remote
index_part.json as the truth, deleting any local progress which is yet
to be reflected in remote
- moves to prefer remote metadata

Additionally:
- tests with single LOCAL_FS parametrization are cleaned up
- adds a test case for branched (non-bootstrap) local only timeline
availability after restart

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: John Spray <john@neon.tech>
2023-10-17 10:04:56 +01:00
John Spray
b06dffe3dc pageserver: fixes to /location_config API (#5548)
## Problem

I found some issues with the `/location_config` API when writing new
tests.

## Summary of changes

- Calling the API with the "Detached" state is now idempotent.
- `Tenant::spawn_attach` now takes a boolean to indicate whether to
expect a marker file. Marker files are used in the old attach path, but
not in the new location conf API. They aren't needed because in the New
World, the choice of whether to attach via remote state ("attach") or to
trust local state ("load") will be revised to cope with the transitions
between secondary & attached (see
https://github.com/neondatabase/neon/issues/5550). It is okay to merge
this change ahead of that ticket, because the API is not used in the
wild yet.
- Instead of using `schedule_local_tenant_processing`, the location conf
API handler does its own directory creation and calls `spawn_attach`
directly.
- A new `unsafe_create_dir_all` is added. This differs from
crashsafe::create_dir_all in two ways:
- It is intentionally not crashsafe, because in the location conf API we
are no longer using directory or config existence as the signal for any
important business logic.
   - It is async and uses `tokio::fs`.
2023-10-17 10:21:31 +02:00
Christian Schwarz
b08a0ee186 walredo: fix race condition where shutdown kills the wrong process (#5557)
Before this PR, the following race condition existed:

```
T1: does the apply_wal_records() call and gets back an error
T2: does the apply_wal_records() call and gets back an error
T2: does the kill_and_shutdown
T2: new loop iteration
T2: launches new walredo process
T1: does the kill_and_shutdown of the new process
```

That last step is wrong, T2 already did the kill_and_shutdown.

The symptom of this race condition was that T2 would observe an error
when it tried to do something with the process after T1 killed it.
For example, but not limited to:
`POLLHUP` /  `"WAL redo process closed its stderr unexpectedly"`.

The fix in this PR is the following:

* Use Arc to represent walredo processes.
  The Arc lives at least as long as the walredo process.
* Use Arc::ptr_eq to determine whether to kill the process or not.

The price is an additional RwLock to protect the new `redo_process`
field
that holds the Arc. I guess that could perhaps be an atomic pointer
swap some day. But, let's get one race fixed without risking introducing
a new one.

The use of Arc/drop is also not super great here because it now allows
for an unlimited number of to-be-killed processes to exist concurrently.
See the various `NB` comments above `drop(proc)` for why it's "ok" right
now due to the blocking `wait` inside `drop`.

Note: an earlier fix attempt was
https://github.com/neondatabase/neon/pull/5545
where we apply_batch_postgres would compare stdout_fd for equality.
That's incorrect because the kernel can reuse the file descriptor when
T2 launches the new process.
Details:
https://github.com/neondatabase/neon/pull/5545#pullrequestreview-1676589373
2023-10-17 09:55:39 +02:00
John Spray
ded7f48565 pageserver: measure startup duration spent fetching remote indices (#5564)
## Problem

Currently it's unclear how much of the `initial_tenant_load` period is
in S3 objects, and therefore how impactful it is to make changes to
remote operations during startup.

## Summary of changes

- `Tenant::load` is refactored to load remote indices in parallel and to
wait for all these remote downloads to finish before it proceeds to
construct any `Timeline` objects.
- `pageserver_startup_duration_seconds` gets a new `phase` value of
`initial_tenant_load_remote` which counts the time from startup to when
the last tenant finishes loading remote content.
- `test_pageserver_restart` is extended to validate this phase. The
previous version of the test was relying on order of dict entries, which
stopped working when adding a phase, so this is refactored a bit.
- `test_pageserver_restart` used to explicitly create a branch, now it
uses the default initial_timeline. This avoids startup getting held up
waiting for logical sizes, when one of the branches is not in use.
2023-10-16 18:21:37 +01:00
John Spray
44b1c4c456 pageserver: fix eviction across generations (#5538)
## Problem

Bug was introduced by me in 83ae2bd82c

When eviction constructs a RemoteLayer to replace the layer it just
evicted, it is building a LayerFileMetadata using its _current_
generation, rather than the generation of the layer.

## Summary of changes

- Retrieve Generation from RemoteTimelineClient when evicting. This will
no longer be necessary when #4938 lands.
- Add a test for the scenario in question (this fails without the fix).
2023-10-15 20:23:18 +01:00
Christian Schwarz
99c15907c1 walredo: trim public interfaces (#5556)
Stacked atop https://github.com/neondatabase/neon/pull/5554.
2023-10-13 19:35:53 +01:00
Christian Schwarz
c3626e3432 walredo: remove legacy wal-redo-datadir cleanup code (#5554)
It says it in the comment.
2023-10-13 19:16:15 +01:00
Christian Schwarz
dd6990567f walredo: apply_batch_postgres: get a backtrace whenever it encounters an error (#5541)
For 2 weeks we've seen rare, spurious, not-reproducible page
reconstruction
failures with PG16 in prod.

One of the commits we deployed this week was

Commit

    commit fc467941f9
    Author: Joonas Koivunen <joonas@neon.tech>
    Date:   Wed Oct 4 16:19:19 2023 +0300

        walredo: log retryed error (#546)

With the logs from that commit, we learned that some read() or write()
system call that walredo does fails with `EAGAIN`, aka
`Resource temporarily unavailable (os error 11)`.

But we have no idea where exactly in the code we get back that error.

So, use anyhow instead of fake std::io::Error's as an easy way to get
a backtrace when the error happens, and change the logging to print
that backtrace (i.e., use `{:?}` instead of
`utils::error::report_compact_sources(e)`).

The `WalRedoError` type had to go because we add additional `.context()`
further up the call chain before we `{:?}`-print it. That additional
`.context()` further up doesn't see that there's already an
anyhow::Error
inside the `WalRedoError::ApplyWalRecords` variant, and hence captures
another backtrace and prints that one on `{:?}`-print instead of the
original one inside `WalRedoError::ApplyWalRecords`.

If we ever switch back to `report_compact_sources`, we should make sure
we have some other way to uniquely identify the places where we return
an error in the error message.
2023-10-13 14:08:23 +00:00
Joonas Koivunen
ddceb9e6cd fix(branching): read last record lsn only after Tenant::gc_cs (#5535)
Fixes #5531, at least the latest error of not being able to create a
branch from the head under write and gc pressure.
2023-10-11 16:24:36 +01:00
John Spray
0fc3708de2 pageserver: use a backoff::retry in Deleter (#5534)
## Problem

The `Deleter` currently doesn't use a backoff::retry because it doesn't
need to: it is already inside a loop when doing the deletion, so can
just let the loop go around.

However, this is a problem for logging, because we log on errors, which
includes things like 503/429 cases that would usually be swallowed by a
backoff::retry in most places we use the RemoteStorage interface.

The underlying problem is that RemoteStorage doesn't have a proper error
type, and an anyhow::Error can't easily be interrogated for its original
S3 SdkError because downcast_ref requires a concrete type, but SdkError
is parametrized on response type.

## Summary of changes

Wrap remote deletions in Deleter in a backoff::retry to avoid logging
warnings on transient 429/503 conditions, and for symmetry with how
RemoteStorage is used in other places.
2023-10-11 15:25:08 +01:00
John Spray
39e144696f pageserver: clean up mgr.rs types that needn't be public (#5529)
## Problem

These types/functions are public and it prevents clippy from catching
unused things.

## Summary of changes

Move to `pub(crate)` and remove the error enum that becomes clearly
unused as a result.
2023-10-11 11:50:16 +00:00
Arseny Sher
685add2009 Enable /metrics without auth.
To enable auth faster.
2023-10-10 20:06:25 +03:00
John Spray
acefee9a32 pageserver: flush deletion queue on detach (#5452)
## Problem

If a caller detaches a tenant and then attaches it again, pending
deletions from the old attachment might not have happened yet. This is
not a correctness problem, but it causes:
- Risk of leaking some objects in S3
- Some warnings from the deletion queue when pending LSN updates and
pending deletions don't pass validation.

## Summary of changes

- Deletion queue now uses UnboundedChannel so that the push interfaces
don't have to be async.
- This was pulled out of https://github.com/neondatabase/neon/pull/5397,
where it is also useful to be able to drive the queue from non-async
contexts.
- Why is it okay for this to be unbounded? The only way the
unbounded-ness of the channel can become a problem is if writing out
deletion lists can't keep up, but if the system were that overloaded
then the code generating deletions (GC, compaction) would also be
impacted.
- DeletionQueueClient gets a new `flush_advisory` function, which is
like flush_execute, but doesn't wait for completion: this is appropriate
for use in contexts where we would like to encourage the deletion queue
to flush, but don't need to block on it.
- This function is also expected to be useful in next steps for seamless
migration, where the option to flush to S3 while transitioning into
AttachedStale will also include flushing deletion queue, but we wouldn't
want to block on that flush.
- The tenant_detach code in mgr.rs invokes flush_advisory after stopping
the `Tenant` object.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-10-10 10:46:24 +01:00
John Spray
b3195afd20 tests: fix a race in test_deletion_queue_recovery on loaded nodes (#5495)
## Problem

Seen in CI for https://github.com/neondatabase/neon/pull/5453 -- the
time gap between validation completing and the header getting written is
long enough to fail the test, where it was doing a cheeky 1 second
sleep.

## Summary of changes

- Replace 1 second sleep with a wait_until to see the header file get
written
- Use enums as test params to make the results more readable (instead of
True-False parameters)
- Fix the temp suffix used for deletion queue headers: this worked fine,
but resulted in `..tmp` extension.
2023-10-09 16:28:28 +01:00
John Spray
7eaa7a496b pageserver: cancellation handling in writes to postgres client socket (#5503)
## Problem

Writes to the postgres client socket from the page server were not
wrapped in cancellation handling, so a stuck client connection could
prevent tenant shutdowwn.

## Summary of changes

All the places we call flush() to write to the socket, we should be
respecting the cancellation token for the task.

In this PR, I explicitly pass around a CancellationToken rather than
doing inline `task_mgr::shutdown_token` calls, to avoid coupling it to
the global task_mgr state and make it easier to refactor later.

I have some follow-on commits that add a Shutdown variant to QueryError
and use it more extensively, but that's pure refactor so will keep
separate from this bug fix PR.

Closes: https://github.com/neondatabase/neon/issues/5341
2023-10-09 15:54:17 +01:00
Joonas Koivunen
4772cd6c93 fix: deny branching, starting compute from not yet uploaded timelines (#5484)
Part of #5172. First commits show that we used to allow starting up a
compute or creating a branch off a not yet uploaded timeline. This PR
moves activation of a timeline to happen **after** initial layer file(s)
(if any) and `index_part.json` have been uploaded. Simply moving
activation to be *after* downloads have finished works because we now
spawn a task per http request handler.

Current behaviour of uploading on the timelines on next startup is kept,
to be removed later as part of #5172.

Adds:
- `NeonCli.map_branch` and corresponding `neon_local` implementation:
allow creating computes for timelines managed via pageserver http
client/api
- possibly duplicate tests (I did not want to search for, will cleanup
in a follow-up if these duplicated)

Changes:
- make `wait_until_tenant_state` return immediatedly on `Broken` and not
wait more
2023-10-09 17:03:38 +03:00
John Spray
ea5a97e7b4 pageserver: implement emergency mode for operating without control plane (#5469)
## Problem

Pageservers with `control_plane_api` configured require a control plane
to start up: in an incident this might be a problem.

## Summary of changes

Note to reviewers: most of the code churn in mgr.rs is the refactor
commit that enables the later emergency mode commit: you may want to
review commits separately.

- Add `control_plane_emergency_mode` configuration property
- Refactor init_tenant_mgr to separate loading configurations from the
main loop where we construct Tenant, so that the generations fetch can
peek at the configs in emergency mode.
- During startup, in emergency mode, attach any tenants that were
attached on their last run, using the same generation number.

Closes: #5381 
Closes: https://github.com/neondatabase/neon/issues/5492
2023-10-06 17:25:21 +01:00
John Spray
547914fe19 pageserver: adjust timeline deletion for generations (#5453)
## Problem

Spun off from https://github.com/neondatabase/neon/pull/5449

Timeline deletion does the following:
1. Delete layers referenced in the index
2. Delete everything else in the timeline prefix, except the index
3. Delete the index.

When generations were added, the filter in step 2 got outdated, such
that the index objects were deleted along with everything else at step
2. That didn't really break anything, but it makes an automated test
unhappy and is a violation of the original intent of the code, which
presumably intends to upload an invariant that as long as any objects
for a timeline exist, the index exists.

(Eventually, this index-object-last complexity can go away: when we do
https://github.com/neondatabase/neon/issues/5080, there is no need to
keep the index_part around, as deletions can always be retried any time
any where.)

## Summary of changes

After object listing, split the listed objects into layers and index
objects. Delete the layers first, then the index objects.
2023-10-06 16:15:18 +00:00
Arpad Müller
607b185a49 Fix 1.73.0 clippy lints (#5494)
Doesn't do an upgrade of rustc to 1.73.0 as we want to wait for the
cargo response of the curl CVE before updating. In preparation for an
update, we address the clippy lints that are newly firing in 1.73.0.
2023-10-06 14:17:19 +01:00
Christian Schwarz
bfba5e3aca page_cache: ensure forward progress on miss (#5482)
Problem
=======

Prior to this PR, when we had a cache miss, we'd get back a write guard,
fill it, the drop it and retry the read from cache.

If there's severe contention for the cache, it could happen that the
just-filled data gets evicted before our retry, resulting in lost work
and no forward progress.

Solution
========

This PR leverages the now-available `tokio::sync::RwLockWriteGuard`'s
`downgrade()` functionality to turn the filled slot write guard into a
read guard.
We don't drop the guard at any point, so, forward progress is ensured.


Refs
====

Stacked atop https://github.com/neondatabase/neon/pull/5480 

part of https://github.com/neondatabase/neon/issues/4743
specifically part of https://github.com/neondatabase/neon/issues/5479
2023-10-06 13:41:13 +01:00
Christian Schwarz
ecc7a9567b page_cache: inline {,try_}lock_for_write into memorize_materialized_page (#5480)
Motivation
==========

It's the only user, and the name of `_for_write` is wrong as of

    commit 7a63685cde
    Author: Christian Schwarz <christian@neon.tech>
    Date:   Fri Aug 18 19:31:03 2023 +0200

        simplify page-caching of EphemeralFile (#4994)

Notes
=====

This also allows us to get rid of the WriteBufResult type.

Also rename `search_mapping_for_write` to `search_mapping_exact`. It
makes more sense that way because there is `_for_write`-locking anymore.

Refs
====

part of https://github.com/neondatabase/neon/issues/4743
specifically https://github.com/neondatabase/neon/issues/5479

this is prep work for https://github.com/neondatabase/neon/pull/5482
2023-10-06 13:38:02 +02:00
Joonas Koivunen
45f98dd018 debug_tool: get page at lsn and keyspace via http api (#5057)
If there are any layermap or layer file related problems, having a
reproducable `get_page@lsn` easily usable for fast debugging iteration
is helpful.

Split off from #4938.

Later evolved to add http apis for:
- `get_page@lsn` at
`/v1/tenant/:tenant_id/timeline/:timeline_id/get?key=<hex>&lsn=<lsn
string>`
- collecting the keyspace at
`/v1/tenant/:tenant_id/timeline/:timeline_id/keyspace?[at_lsn=<lsn
string>]`
    - defaults to `last_record_lsn`

collecting the keyspace seems to yield some ranges for which there is no
key.
2023-10-06 12:17:38 +01:00
John Spray
bdfe27f3ac swagger: add a 503 definition to each endpoint (#5476)
## Problem

The control plane doesn't have generic handling for this.

## Summary of changes

Add a 503 response to every endpoint.
2023-10-06 11:31:49 +01:00
Joonas Koivunen
a15f9b3baa pageserver: Tune 503 Resource unavailable (#5489)
503 Resource Unavailable appears as error in logs, but is not really an
error which should ever fail a test on, or even log an error in prod,
[evidence].

Changes:
- log 503 as `info!` level
- use `Cow<'static, str>` instead of `String`
- add an additional `wait_until_tenant_active` in
`test_actually_duplicate_l1`
 
We ought to have in tests "wait for tenants to complete loading" but
this is easier to implement for now.

[evidence]:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5485/6423110295/index.html#/testresult/182de66203864fc0
2023-10-06 09:59:14 +01:00
John Spray
7cbb39063a tests: stabilize + extend deletion queue recovery test (#5457)
## Problem

This test was unstable when run in parallel with lots of others: if the
pageserver stayed up long enough for some of the deletions to get
validated, they won't be discarded on restart the way the test expects
when keep_attachment=True. This was a test bug, not a pageserver bug.

## Summary of changes

- Add failpoints to control plane api client
- Use failpoint to pause validation in the test to cover the case where
it had been flaky
- Add a metric for the number of deleted keys validated
- Add a permutation to the test to additionally exercise the case where
we _do_ validate lists before restart: this is a coverage enhancement
that seemed sensible when realizing that the test was relying on nothing
being validated before restart.
- the test will now always enter the restart with nothing or everything
validated.
2023-10-05 11:22:05 +01:00
John Spray
baa5fa1e77 pageserver: location configuration API, attachment modes, secondary locations (#5299)
## Problem

These changes are part of building seamless tenant migration, as
described in the RFC:
- https://github.com/neondatabase/neon/pull/5029

## Summary of changes

- A new configuration type `LocationConf` supersedes `TenantConfOpt` for
storing a tenant's configuration in the pageserver repo dir. It contains
`TenantConfOpt`, as well as a new `mode` attribute that describes what
kind of location this is (secondary, attached, attachment mode etc). It
is written to a file called `config-v1` instead of `config` -- this
prepares us for neatly making any other profound changes to the format
of the file in future. Forward compat for existing pageserver code is
achieved by writing out both old and new style files. Backward compat is
achieved by checking for the old-style file if the new one isn't found.
- The `TenantMap` type changes, to hold `TenantSlot` instead of just
`Tenant`. The `Tenant` type continues to be used for attached tenants
only. Tenants in other states (such as secondaries) are represented by a
different variant of `TenantSlot`.
- Where `Tenant` & `Timeline` used to hold an Arc<Mutex<TenantConfOpt>>,
they now hold a reference to a AttachedTenantConf, which includes the
extra information from LocationConf. This enables them to know the
current attachment mode.
- The attachment mode is used as an advisory input to decide whether to
do compaction and GC (AttachedStale is meant to avoid doing uploads,
AttachedMulti is meant to avoid doing deletions).
- A new HTTP API is added at `PUT /tenants/<tenant_id>/location_config`
to drive new location configuration. This provides a superset of the
functionality of attach/detach/load/ignore:
  - Attaching a tenant is just configuring it in an attached state
  - Detaching a tenant is configuring it to a detached state
  - Loading a tenant is just the same as attaching it
- Ignoring a tenant is the same as configuring it into Secondary with
warm=false (i.e. retain the files on disk but do nothing else).

Caveats:
- AttachedMulti tenants don't do compaction in this PR, but they do in
the follow on #5397
- Concurrent updates to the `location_config` API are not handled
elegantly in this PR, a better mechanism is added in the follow on
https://github.com/neondatabase/neon/pull/5367
- Secondary mode is just a placeholder in this PR: the code to upload
heatmaps and do downloads on secondary locations will be added in a
later PR (but that shouldn't change any external interfaces)

Closes: https://github.com/neondatabase/neon/issues/5379

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-10-05 09:55:10 +01:00
John Spray
c5ea91f831 pageserver: fix loading control plane JWT token (#5470)
## Problem

In #5383 this configuration was added, but it missed the parts of the
Builder class that let it actually be used.

## Summary of changes

Add `control_plane_api_token` hooks to PageserverConfigBuilder
2023-10-05 01:31:17 +01:00
Joonas Koivunen
7dce62a9ee test: duplicate L1 layer (#5412)
We overwrite L1 layers if compaction gets interrupted. We did not have a
test showing that we do in fact do this.

The test might be a bit flaky due to timestamp usage, but separating for
smaller diff in as part of #5172.

Also removes an unrelated 200s pgbench from the test suite.
2023-10-04 16:52:32 +01:00
duguorong009
25a37215f3 fix: replace all std::PathBufs with camino::Utf8PathBuf (#5352)
Fixes #4689 by replacing all of `std::Path` , `std::PathBuf` with
`camino::Utf8Path`, `camino::Utf8PathBuf` in
- pageserver
- safekeeper
- control_plane
- libs/remote_storage

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-10-04 17:52:23 +03:00
Joonas Koivunen
fc467941f9 walredo: log retryed error (#5462)
We currently lose the actual reason the first walredo attempt failed.
Together with implicit retry making it difficult to eyeball what is
happening.

PR version keeps the logging the same error message twice, which is what
we've been doing all along. However correlating the retrying case and
the finally returned error is difficult, because the actual error
message was left out before this PR.

Lastly, log the final error we present to postgres *in the same span*,
not outside it. Additionally, suppress the stacktrace as the comment
suggested.
2023-10-04 14:19:19 +01:00
Christian Schwarz
25bf791568 metrics: distinguish page reconstruction success & failure (#5463)
Here's the existing dashboards that use the metric:


https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_getpage_reconstruct_seconds&type=code

Looks like only `_count` and `_sum` values are used currently.
We can fix them up easily post merge.

I think the histogram is worth keeping, though.

follow-up to
https://github.com/neondatabase/neon/pull/5459#pullrequestreview-1657072882

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-10-04 13:40:00 +01:00
Joonas Koivunen
dee2bcca44 fix: time the reconstruction, not future creation (#5459)
`pageserver_getpage_reconstruct_seconds` histogram had been only
recording the time it takes to create a future, not await on it. Since:
eb0a698adc.
2023-10-04 11:01:07 +01:00
Joonas Koivunen
db8ff9d64b testing: record walredo failures to test reports (#5451)
We have rare walredo failures with pg16.

Let's introduce recording of failing walredo input in `#[cfg(feature =
"testing")]`. There is additional logging (the value reconstruction path
logging usually shown with not found keys), keeping it for
`#[cfg(features = "testing")]`.

Cc: #5404.
2023-10-04 11:24:30 +03:00
Rahul Modpur
af6a20dfc2 Improve CrashsafeOverwriteError source printing (#5410)
## Problem

Duplication of error in log

Fixes #5366 

## Summary of changes

Removed `{0}` from error description above each enum due to presence of
`#[source]` to avoid duplication

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
2023-10-04 02:38:42 +02:00
John Spray
ace0c775fc pageserver: prefer 503 to 500 for transient unavailability (#5439)
## Problem

The 500 status code should only be used for bugs or unrecoverable
failures: situations we did not expect. Currently, the pageserver is
misusing this response code for some situations that are totally normal,
like requests targeting tenants that are in the process of activating.

The 503 response is a convenient catch-all for "I can't right now, but I
will be able to".

## Summary of changes

- Change some transient availability error conditions to return 503
instead of 500
- Update the HTTP client configuration in integration tests to retry on
503

After these changes, things like creating a tenant and then trying to
create a timeline within it will no longer require carefully checking
its status first, or retrying on 500s. Instead, a client which is
properly configured to retry on 503 can quietly handle such situations.
2023-10-03 17:00:55 +01:00
Christian Schwarz
c07eef8ea5 page_cache: find_victim: don't spin while there's no chance for a slot (#5319)
It is wasteful to cycle through the page cache slots trying to find a
victim slot if all the slots are currently un-evictable because a read /
write guard is alive.

We suspect this wasteful cycling to be the root cause for an
"indigestion" we observed in staging (#5291).
The hypothesis is that we `.await` after we get ahold of a read / write
guard, and that tokio actually deschedules us in favor of another
future.
If that other future then needs a page slot, it can't get ours because
we're holding the guard.
Repeat this, and eventually, the other future(s) will find themselves
doing `find_victim` until they hit `exceeded evict iter limit`.

The `find_victim` is wasteful and CPU-starves the futures that are
already holding the read/write guard. A `yield` inside `find_victim`
could mitigate the starvation, but wouldn't fix the wasting of CPU
cycles.

So instead, this PR queues waiters behind a tokio semaphore that counts
evictable slots.
The downside is that this stops the clock page replacement if we have 0
evictable slots.

Also, as explained by the big block comment in `find_victims`, the
semaphore doesn't fully prevent starvation because because we can't make
tokio prioritize those tasks executing `find_victim` that have been
trying the longest.

Implementation
===============
We need to acquire the semaphore permit before locking the slot.
Otherwise, we could deadlock / discover that all permits are gone and
would have to relinquish the slot, having moved forward the Clock LRU
without making progress.

The downside is that, we never get full throughput for read-heavy
workloads, because, until the reader coalesces onto an existing permit,
it'll hold its own permit.


Addendum To Root-Cause Analysis In #5291
========================================

Since merging that PR, @arpad-m pointed out that we couldn't have
reached the `slot.write().await` with his patches because the
VirtualFile slots can't have all been write-locked, because we only hold
them locked while the IO is ongoing, and the IO is still done with
synchronous system calls in that patch set, so, we can have had at most
$number_of_executor_threads locked at any given time.
I count 3 tokio runtimes that do `Timeline::get`, each with 8 executor
threads in our deployment => $number_of_executor_threads = 3*8 = 24 .
But the virtual file cache has 100 slots.

We both agree that nothing changed about the core hypothesis, i.e.,
additional await points inside VirtualFile caused higher concurrency
resulting in exhaustion of page cache slots.
But we'll need to reproduce the issue and investigate further to truly
understand the root cause, or find out that & why we were indeed using
100 VirtualFile slots.

TODO: could it be compaction that needs to hold guards of many
VirtualFile's in its iterators?
2023-09-29 20:03:56 +02:00
John Spray
ca3ca2bb9c pageserver: don't try and recover deletion queue if no remote storage (#5419)
## Problem

Because `neon_local` by default runs with no remote storage, it was not
running the deletion queue workers, and the attempt to call into
`recover()` was failing.

This is a bogus configuration that will go away when we make remote
storage mandatory.

## Summary of changes

Don't try and do deletion queue recovery when remote storage is
disabled.

The reason we don't just unset `control_plane_api` to avoid this is that
generations will soon become mandatory, irrespective of when we make
remote storage mandatory.
2023-09-28 17:20:34 +01:00
Joonas Koivunen
af28362a47 tests: Default to LOCAL_FS for pageserver remote storage (#5402)
Part of #5172. Builds upon #5243, #5298. Includes the test changes:
- no more RemoteStorageKind.NOOP
- no more testing of pageserver without remote storage
- benchmarks now use LOCAL_FS as well

Support for running without RemoteStorage is still kept but in practice,
there are no tests and should not be any tests.

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-09-28 12:25:20 +03:00
Christian Schwarz
090a644392 metrics for resident & remote physical size without tenant/timeline dimension (#5389)
So that we can compute worst-case /storage size dashboard panel more
cheaply.
2023-09-27 13:18:05 +01:00
John Spray
2cced770da pageserver: add control_plane_api_token config (#5383)
## Problem

Control plane API calls in prod will need authentication.

## Summary of changes

`control_plane_api_token` config is loaded and set as HTTP
`Authorization` header.

Closes: https://github.com/neondatabase/neon/issues/5139
2023-09-27 13:12:13 +01:00
John Spray
ba92668e37 pageserver: deletion queue & generation validation for deletions (#5207)
## Problem

Pageservers must not delete objects or advertise updates to
remote_consistent_lsn without checking that they hold the latest
generation for the tenant in question (see [the RFC](
https://github.com/neondatabase/neon/blob/main/docs/rfcs/025-generation-numbers.md))

In this PR:
- A new "deletion queue" subsystem is introduced, through which
deletions flow
- `RemoteTimelineClient` is modified to send deletions through the
deletion queue:
- For GC & compaction, deletions flow through the full generation
verifying process
- For timeline deletions, deletions take a fast path that bypasses
generation verification
- The `last_uploaded_consistent_lsn` value in `UploadQueue` is replaced
with a mechanism that maintains a "projected" lsn (equivalent to the
previous property), and a "visible" LSN (which is the one that we may
share with safekeepers).
- Until `control_plane_api` is set, all deletions skip generation
validation
- Tests are introduced for the new functionality in
`test_pageserver_generations.py`

Once this lands, if a pageserver is configured with the
`control_plane_api` configuration added in
https://github.com/neondatabase/neon/pull/5163, it becomes safe to
attach a tenant to multiple pageservers concurrently.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-09-26 16:11:55 +01:00
Christian Schwarz
3322b6c5b0 page cache: metrics: add page content kind dimension (#5373)
The TaskKind dimension added in #5339 is insufficient to understand what
kind of data causes the cache hits.

Regarding performance considerations: I'm not too worried because we're
moving from 3 to 4 one-byte sized fields; likely the space now used by
the new field was padding before. Didn't check this, though, and it
doesn't matter, we need the data.

What I don't like about this PR is that we have an `Unknown` content
type, and I also don't like that there's no compile-time way to assert
that it's set to something != `Unknown` when calling the page cache.
But, this is what I could come up with before tomorrow’s release, and I
think it covers the hot paths.
2023-09-26 10:01:09 +03:00
Christian Schwarz
1d98d3e4c1 VirtualFile::atomic_overwrite: add basic unit tests (#5191)
Should have added them in the initial PR #5186.

Would have been nice to test the failure cases as well, but, without
mocking the FS, that's too hard / platform-dependent.
2023-09-25 17:16:36 +00:00