Commit Graph

196 Commits

Author SHA1 Message Date
Christian Schwarz
0d7dfdbb3e WIP duplicate based on LayerMap 2023-11-08 18:37:21 +00:00
John Spray
85cd97af61 pageserver: add InProgress tenant map state, use a sync lock for the map (#5367)
## Problem

Follows on from #5299 
- We didn't have a generic way to protect a tenant undergoing changes:
`Tenant` had states, but for our arbitrary transitions between
secondary/attached, we need a general way to say "reserve this tenant
ID, and don't allow any other ops on it, but don't try and report it as
being in any particular state".
- The TenantsMap structure was behind an async RwLock, but it was never
correct to hold it across await points: that would block any other
changes for all tenants.


## Summary of changes

- Add the `TenantSlot::InProgress` value.  This means:
  - Incoming administrative operations on the tenant should retry later
- Anything trying to read the live state of the tenant (e.g. a page
service reader) should retry later or block.
- Store TenantsMap in `std::sync::RwLock`
- Provide an extended `get_active_tenant_with_timeout` for page_service
to use, which will wait on InProgress slots as well as non-active
tenants.

Closes: https://github.com/neondatabase/neon/issues/5378

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-11-06 14:03:22 +00:00
bojanserafimov
dc72567288 Layer flush minor speedup (#5765)
Convert keys to `i128` before sorting
2023-11-06 08:58:20 -05:00
John Spray
6defa2b5d5 pageserver: add Gate as a partner to CancellationToken for safe shutdown of Tenant & Timeline (#5711)
## Problem

When shutting down a Tenant, it isn't just important to cause any
background tasks to stop. It's also important to wait until they have
stopped before declaring shutdown complete, in cases where we may re-use
the tenant's local storage for something else, such as running in
secondary mode, or creating a new tenant with the same ID.

## Summary of changes

A `Gate` class is added, inspired by
[seastar::gate](https://docs.seastar.io/master/classseastar_1_1gate.html).
For types that have an important lifetime that corresponds to some
physical resource, use of a Gate as well as a CancellationToken provides
a robust pattern for async requests & shutdown:
- Requests must always acquire the gate as long as they are using the
object
- Shutdown must set the cancellation token, and then `close()` the gate
to wait for requests in progress before returning.

This is not for memory safety: it's for expressing the difference
between "Arc<Tenant> exists", and "This tenant's files on disk are
eligible to be read/written".

- Both Tenant and Timeline get a Gate & CancellationToken.
- The Timeline gate is held during eviction of layers, and during
page_service requests.
- Existing cancellation support in page_service is refined to use the
timeline-scope cancellation token instead of a process-scope
cancellation token. This replaces the use of `task_mgr::associate_with`:
tasks no longer change their tenant/timelineidentity after being
spawned.

The Tenant's Gate is not yet used, but will be important for
Tenant-scoped operations in secondary mode, where we must ensure that
our secondary-mode downloads for a tenant are gated wrt the activity of
an attached Tenant.

This is part of a broader move away from using the global-state driven
`task_mgr` shutdown tokens:
- less global state where we rely on implicit knowledge of what task a
given function is running in, and more explicit references to the
cancellation token that a particular function/type will respect, making
shutdown easier to reason about.
- eventually avoid the big global TASKS mutex.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-06 12:39:20 +00:00
John Spray
de90bf4663 pageserver: always load remote metadata (no more spawn_load) (#5580)
## Problem

The pageserver had two ways of loading a tenant:
- `spawn_load` would trust on-disk content to reflect all existing
timelines
- `spawn_attach` would list timelines in remote storage.

It was incorrect for `spawn_load` to trust local disk content, because
it doesn't know if the tenant might have been attached and written
somewhere else. To make this correct would requires some generation
number checks, but the payoff is to avoid one S3 op per tenant at
startup, so it's not worth the complexity -- it is much simpler to have
one way to load a tenant.

## Summary of changes

- `Tenant` objects are always created with `Tenant::spawn`: there is no
more distinction between "load" and "attach".
- The ability to run without remote storage (for `neon_local`) is
preserved by adding a branch inside `attach` that uses a fallback
`load_local` if no remote_storage is present.
- Fix attaching a tenant when it has a timeline with no IndexPart: this
can occur if a newly created timeline manages to upload a layer before
it has uploaded an index.
- The attach marker file that used to indicate whether a tenant should
be "loaded" or "attached" is no longer needed, and is removed.
- The GenericRemoteStorage interface gets a `list()` method that maps
more directly to what ListObjects does, returning both keys and common
prefixes. The existing `list_files` and `list_prefixes` methods are just
calls into `list()` now -- these can be removed later if we would like
to shrink the interface a bit.
- The remote deletion marker is moved into `timelines/` and detected as
part of listing timelines rather than as a separate GET request. If any
existing tenants have a marker in the old location (unlikely, only
happens if something crashes mid-delete), then they will rely on the
control plane retrying to complete their deletion.
- Revise S3 calls for timeline listing and tenant load to take a
cancellation token, and retry forever: it never makes sense to make a
Tenant broken because of a transient S3 issue.

## Breaking changes

- The remote deletion marker is moved from `deleted` to
`timelines/deleted` within the tenant prefix. Markers in the old
location will be ignored: it is the control plane's responsibility to
retry deletions until they succeed. Markers in the new location will be
tolerated by the previous release of pageserver via
https://github.com/neondatabase/neon/pull/5632
- The local `attaching` marker file is no longer written. Therefore, if
the pageserver is downgraded after running this code, the old pageserver
will not be able to distinguish between partially attached tenants and
fully attached tenants. This would only impact tenants that were partway
through attaching at the moment of downgrade. In the unlikely even t
that we do experience an incident that prompts us to roll back, then we
may check for attach operations in flight, and manually insert
`attaching` marker files as needed.

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-10-26 14:48:44 +01:00
John Spray
e0ebdfc7ce pageserver: suppress compaction/gc errors while stopping (#5670)
## Problem

Tenant deletions would sometimes be accompanied by compaction stack
traces, because `shutdown()` puts the tenant into stopping state before
it joins background tasks.

## Summary of changes

Treat GC+Compaction as no-ops on a Stopping tenant.
2023-10-26 10:59:24 +01:00
Joonas Koivunen
c508d3b5fa reimpl Layer, remove remote layer, trait Layer, trait PersistentLayer (#4938)
Implement a new `struct Layer` abstraction which manages downloadness
internally, requiring no LayerMap locking or rewriting to download or
evict providing a property "you have a layer, you can read it". The new
`struct Layer` provides ability to keep the file resident via a RAII
structure for new layers which still need to be uploaded. Previous
solution solved this `RemoteTimelineClient::wait_completion` which lead
to bugs like #5639. Evicting or the final local deletion after garbage
collection is done using Arc'd value `Drop`.

With a single `struct Layer` the closed open ended `trait Layer`, `trait
PersistentLayer` and `struct RemoteLayer` are removed following noting
that compaction could be simplified by simply not using any of the
traits in between: #4839.

The new `struct Layer` is a preliminary to remove
`Timeline::layer_removal_cs` documented in #4745.

Preliminaries: #4936, #4937, #5013, #5014, #5022, #5033, #5044, #5058,
#5059, #5061, #5074, #5103, epic #5172, #5645, #5649. Related split off:
#5057, #5134.
2023-10-26 12:36:38 +03:00
Joonas Koivunen
c5949e1fd6 misc smaller improvements (#5527)
- finally add an `#[instrument]` to Timeline::create_image_layers,
making it easier to see that something is happening because we create
image layers
- format some macro context code
- add a warning not to create new validation functions a la parse do not
validate

Split off from #5198.
2023-10-25 14:59:43 +03:00
John Spray
41ee75bc71 pageserver: do config writes in a spawn_blocking (#5603)
## Problem

We now persist tenant configuration every time we spawn a tenant. The
persist_tenant_config function is doing a series of non-async filesystem
I/O, because `crashsafe::` isn't async yet. This isn't a demonstrated
problem, but is a source of uncertainty when reasoning about what's
happening with our startup times.

## Summary of changes

- Wrap `crashsafe_overwrite` in `spawn_blocking`.
- Although I think this change makes sense, it does not have a
measurable impact on load time when testing with 10k tenants.
- This can be reverted when we have full async I/O
2023-10-23 09:19:01 +01:00
Christian Schwarz
11e523f503 walredo: fix EGAGAIN/"os error 11" false page reconstruction failures (#5560)
Stacked atop https://github.com/neondatabase/neon/pull/5559

Before this PR, there was the following race condition:

```
T1: polls for writeable stdin
T1: writes to stdin
T1: enters poll for stdout/stderr
T2: enters poll for stdin write
WALREDO: writes to stderr
KERNEL: wakes up T1 and T2
Tx: reads stderr and prints it
Ty: reads stderr and gets EAGAIN
(valid values for (x, y) are (1, 2) or (2, 1))
```

The concrete symptom that we observed repeatedly was with PG16,
which started logging `registered custom resource manager`
to stderr always, during startup, thereby giving us repeated
opportunity to hit above race condition. PG14 and PG15 didn't log
anything to stderr, hence we could have only hit this race condition
if there was an actual error happening.

This PR fixes the race by moving the reading of stderr into a tokio
task. It exits when the stderr is closed by the child process, which
in turn happens when the child exits, either by itself or because
we killed it.

The downside is that the async scheduling can reorder the log messages,
which can be seen in the new `test_stderr`, which runs in a
single-threaded runtime. I included the output below.

Overall I think we should move the entire walredo to async, as Joonas
proposed many months ago. This PR's asyncification is just the first
step to resolve these
false page reconstruction errors.

After this is fixed, we should stop printing that annoying stderr
message
on walredo startup; it causes noise in the pageserver logs.
That work is tracked in #5399 .

```
2023-10-13T19:05:21.878858Z ERROR apply_wal_records{tenant_id=d546fb76ba529195392fb4d19e243991 pid=753986}: failed to write out the walredo errored input: No such file or directory (os error 2) target=walredo-1697223921878-1132-0.walredo length=1132
2023-10-13T19:05:21.878932Z DEBUG postgres applied 2 WAL records (1062 bytes) in 114666 us to reconstruct page image at LSN 0/0
2023-10-13T19:05:21.878942Z ERROR error applying 2 WAL records 0/16A9388..0/16D4080 (1062 bytes) to base image with LSN 0/0 to reconstruct page image at LSN 0/0 n_attempts=0: apply_wal_records

Caused by:
    WAL redo process closed its stdout unexpectedly
2023-10-13T19:05:21.879027Z  INFO kill_and_wait_impl{pid=753986}: wait successful exit_status=signal: 11 (SIGSEGV) (core dumped)
2023-10-13T19:05:21.879079Z DEBUG wal-redo-postgres-stderr{pid=753986 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task started
2023-10-13T19:05:21.879104Z ERROR wal-redo-postgres-stderr{pid=753986 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: received output output="2023-10-13 19:05:21.769 GMT [753986] LOG:  registered custom resource manager \"neon\" with ID 134\n"
2023-10-13T19:05:21.879116Z DEBUG wal-redo-postgres-stderr{pid=753986 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task finished
2023-10-13T19:05:22.004439Z ERROR apply_wal_records{tenant_id=d546fb76ba529195392fb4d19e243991 pid=754000}: failed to write out the walredo errored input: No such file or directory (os error 2) target=walredo-1697223922004-1132-0.walredo length=1132
2023-10-13T19:05:22.004493Z DEBUG postgres applied 2 WAL records (1062 bytes) in 125344 us to reconstruct page image at LSN 0/0
2023-10-13T19:05:22.004501Z ERROR error applying 2 WAL records 0/16A9388..0/16D4080 (1062 bytes) to base image with LSN 0/0 to reconstruct page image at LSN 0/0 n_attempts=1: apply_wal_records

Caused by:
    WAL redo process closed its stdout unexpectedly
2023-10-13T19:05:22.004588Z  INFO kill_and_wait_impl{pid=754000}: wait successful exit_status=signal: 11 (SIGSEGV) (core dumped)
2023-10-13T19:05:22.004624Z DEBUG wal-redo-postgres-stderr{pid=754000 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task started
2023-10-13T19:05:22.004653Z ERROR wal-redo-postgres-stderr{pid=754000 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: received output output="2023-10-13 19:05:21.884 GMT [754000] LOG:  registered custom resource manager \"neon\" with ID 134\n"
2023-10-13T19:05:22.004666Z DEBUG wal-redo-postgres-stderr{pid=754000 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task finished
```
2023-10-23 09:00:13 +01:00
Christian Schwarz
9da67c4f19 walredo: make request_redo() an async fn (#5559)
Stacked atop https://github.com/neondatabase/neon/pull/5557
Prep work for https://github.com/neondatabase/neon/pull/5560

These changes have a 2% impact on `bench_walredo`.
That's likely because of the `block_on() in the innermost piece of
benchmark-only code.
So, it doesn't affect production code.
The use of closures in the benchmarking code prevents a straightforward
conversion of the whole benchmarking code to async.

before:

```
    $ cargo bench --features testing --bench bench_walredo
       Compiling pageserver v0.1.0 (/home/cs/src/neon/pageserver)
        Finished bench [optimized + debuginfo] target(s) in 2m 11s
         Running benches/bench_walredo.rs (target/release/deps/bench_walredo-d99a324337dead70)
    Gnuplot not found, using plotters backend
    short/short/1           time:   [26.363 µs 27.451 µs 28.573 µs]
    Found 1 outliers among 100 measurements (1.00%)
      1 (1.00%) high mild
    short/short/2           time:   [64.340 µs 64.927 µs 65.485 µs]
    Found 2 outliers among 100 measurements (2.00%)
      2 (2.00%) low mild
    short/short/4           time:   [101.98 µs 104.06 µs 106.13 µs]
    short/short/8           time:   [151.42 µs 152.74 µs 154.03 µs]
    short/short/16          time:   [296.30 µs 297.53 µs 298.88 µs]
    Found 14 outliers among 100 measurements (14.00%)
      10 (10.00%) high mild
      4 (4.00%) high severe

    medium/medium/1         time:   [225.12 µs 225.90 µs 226.66 µs]
    Found 1 outliers among 100 measurements (1.00%)
      1 (1.00%) low mild
    medium/medium/2         time:   [490.80 µs 491.64 µs 492.49 µs]
    Found 1 outliers among 100 measurements (1.00%)
      1 (1.00%) low mild
    medium/medium/4         time:   [934.47 µs 936.49 µs 938.52 µs]
    Found 5 outliers among 100 measurements (5.00%)
      3 (3.00%) low mild
      1 (1.00%) high mild
      1 (1.00%) high severe
    medium/medium/8         time:   [1.8364 ms 1.8412 ms 1.8463 ms]
    Found 4 outliers among 100 measurements (4.00%)
      4 (4.00%) high mild
    medium/medium/16        time:   [3.6694 ms 3.6896 ms 3.7104 ms]

```

after:

```

    $ cargo bench --features testing --bench bench_walredo
       Compiling pageserver v0.1.0 (/home/cs/src/neon/pageserver)
        Finished bench [optimized + debuginfo] target(s) in 2m 11s
         Running benches/bench_walredo.rs (target/release/deps/bench_walredo-d99a324337dead70)
    Gnuplot not found, using plotters backend
    short/short/1           time:   [28.345 µs 28.529 µs 28.699 µs]
                            change: [-0.2201% +3.9276% +8.2451%] (p = 0.07 > 0.05)
                            No change in performance detected.
    Found 17 outliers among 100 measurements (17.00%)
      4 (4.00%) low severe
      5 (5.00%) high mild
      8 (8.00%) high severe
    short/short/2           time:   [66.145 µs 66.719 µs 67.274 µs]
                            change: [+1.5467% +2.7605% +3.9927%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 5 outliers among 100 measurements (5.00%)
      5 (5.00%) low mild
    short/short/4           time:   [105.51 µs 107.52 µs 109.49 µs]
                            change: [+0.5023% +3.3196% +6.1986%] (p = 0.02 < 0.05)
                            Change within noise threshold.
    short/short/8           time:   [151.90 µs 153.16 µs 154.41 µs]
                            change: [-1.0001% +0.2779% +1.4221%] (p = 0.65 > 0.05)
                            No change in performance detected.
    short/short/16          time:   [297.38 µs 298.26 µs 299.20 µs]
                            change: [-0.2953% +0.2462% +0.7763%] (p = 0.37 > 0.05)
                            No change in performance detected.
    Found 2 outliers among 100 measurements (2.00%)
      2 (2.00%) high mild

    medium/medium/1         time:   [229.76 µs 230.72 µs 231.69 µs]
                            change: [+1.5804% +2.1354% +2.6635%] (p = 0.00 < 0.05)
                            Performance has regressed.
    medium/medium/2         time:   [501.14 µs 502.31 µs 503.64 µs]
                            change: [+1.8730% +2.1709% +2.5199%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 7 outliers among 100 measurements (7.00%)
      1 (1.00%) low mild
      1 (1.00%) high mild
      5 (5.00%) high severe
    medium/medium/4         time:   [954.15 µs 956.74 µs 959.33 µs]
                            change: [+1.7962% +2.1627% +2.4905%] (p = 0.00 < 0.05)
                            Performance has regressed.
    medium/medium/8         time:   [1.8726 ms 1.8785 ms 1.8848 ms]
                            change: [+1.5858% +2.0240% +2.4626%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 6 outliers among 100 measurements (6.00%)
      1 (1.00%) low mild
      3 (3.00%) high mild
      2 (2.00%) high severe
    medium/medium/16        time:   [3.7565 ms 3.7746 ms 3.7934 ms]
                            change: [+1.5503% +2.3044% +3.0818%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 3 outliers among 100 measurements (3.00%)
      3 (3.00%) high mild
```
2023-10-18 11:23:06 +01:00
Joonas Koivunen
9e1449353d crash-consistent layer map through index_part.json (#5198)
Fixes #5172 as it:
- removes recoinciliation with remote index_part.json and accepts remote
index_part.json as the truth, deleting any local progress which is yet
to be reflected in remote
- moves to prefer remote metadata

Additionally:
- tests with single LOCAL_FS parametrization are cleaned up
- adds a test case for branched (non-bootstrap) local only timeline
availability after restart

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: John Spray <john@neon.tech>
2023-10-17 10:04:56 +01:00
John Spray
b06dffe3dc pageserver: fixes to /location_config API (#5548)
## Problem

I found some issues with the `/location_config` API when writing new
tests.

## Summary of changes

- Calling the API with the "Detached" state is now idempotent.
- `Tenant::spawn_attach` now takes a boolean to indicate whether to
expect a marker file. Marker files are used in the old attach path, but
not in the new location conf API. They aren't needed because in the New
World, the choice of whether to attach via remote state ("attach") or to
trust local state ("load") will be revised to cope with the transitions
between secondary & attached (see
https://github.com/neondatabase/neon/issues/5550). It is okay to merge
this change ahead of that ticket, because the API is not used in the
wild yet.
- Instead of using `schedule_local_tenant_processing`, the location conf
API handler does its own directory creation and calls `spawn_attach`
directly.
- A new `unsafe_create_dir_all` is added. This differs from
crashsafe::create_dir_all in two ways:
- It is intentionally not crashsafe, because in the location conf API we
are no longer using directory or config existence as the signal for any
important business logic.
   - It is async and uses `tokio::fs`.
2023-10-17 10:21:31 +02:00
John Spray
ded7f48565 pageserver: measure startup duration spent fetching remote indices (#5564)
## Problem

Currently it's unclear how much of the `initial_tenant_load` period is
in S3 objects, and therefore how impactful it is to make changes to
remote operations during startup.

## Summary of changes

- `Tenant::load` is refactored to load remote indices in parallel and to
wait for all these remote downloads to finish before it proceeds to
construct any `Timeline` objects.
- `pageserver_startup_duration_seconds` gets a new `phase` value of
`initial_tenant_load_remote` which counts the time from startup to when
the last tenant finishes loading remote content.
- `test_pageserver_restart` is extended to validate this phase. The
previous version of the test was relying on order of dict entries, which
stopped working when adding a phase, so this is refactored a bit.
- `test_pageserver_restart` used to explicitly create a branch, now it
uses the default initial_timeline. This avoids startup getting held up
waiting for logical sizes, when one of the branches is not in use.
2023-10-16 18:21:37 +01:00
Christian Schwarz
dd6990567f walredo: apply_batch_postgres: get a backtrace whenever it encounters an error (#5541)
For 2 weeks we've seen rare, spurious, not-reproducible page
reconstruction
failures with PG16 in prod.

One of the commits we deployed this week was

Commit

    commit fc467941f9
    Author: Joonas Koivunen <joonas@neon.tech>
    Date:   Wed Oct 4 16:19:19 2023 +0300

        walredo: log retryed error (#546)

With the logs from that commit, we learned that some read() or write()
system call that walredo does fails with `EAGAIN`, aka
`Resource temporarily unavailable (os error 11)`.

But we have no idea where exactly in the code we get back that error.

So, use anyhow instead of fake std::io::Error's as an easy way to get
a backtrace when the error happens, and change the logging to print
that backtrace (i.e., use `{:?}` instead of
`utils::error::report_compact_sources(e)`).

The `WalRedoError` type had to go because we add additional `.context()`
further up the call chain before we `{:?}`-print it. That additional
`.context()` further up doesn't see that there's already an
anyhow::Error
inside the `WalRedoError::ApplyWalRecords` variant, and hence captures
another backtrace and prints that one on `{:?}`-print instead of the
original one inside `WalRedoError::ApplyWalRecords`.

If we ever switch back to `report_compact_sources`, we should make sure
we have some other way to uniquely identify the places where we return
an error in the error message.
2023-10-13 14:08:23 +00:00
Joonas Koivunen
ddceb9e6cd fix(branching): read last record lsn only after Tenant::gc_cs (#5535)
Fixes #5531, at least the latest error of not being able to create a
branch from the head under write and gc pressure.
2023-10-11 16:24:36 +01:00
John Spray
acefee9a32 pageserver: flush deletion queue on detach (#5452)
## Problem

If a caller detaches a tenant and then attaches it again, pending
deletions from the old attachment might not have happened yet. This is
not a correctness problem, but it causes:
- Risk of leaking some objects in S3
- Some warnings from the deletion queue when pending LSN updates and
pending deletions don't pass validation.

## Summary of changes

- Deletion queue now uses UnboundedChannel so that the push interfaces
don't have to be async.
- This was pulled out of https://github.com/neondatabase/neon/pull/5397,
where it is also useful to be able to drive the queue from non-async
contexts.
- Why is it okay for this to be unbounded? The only way the
unbounded-ness of the channel can become a problem is if writing out
deletion lists can't keep up, but if the system were that overloaded
then the code generating deletions (GC, compaction) would also be
impacted.
- DeletionQueueClient gets a new `flush_advisory` function, which is
like flush_execute, but doesn't wait for completion: this is appropriate
for use in contexts where we would like to encourage the deletion queue
to flush, but don't need to block on it.
- This function is also expected to be useful in next steps for seamless
migration, where the option to flush to S3 while transitioning into
AttachedStale will also include flushing deletion queue, but we wouldn't
want to block on that flush.
- The tenant_detach code in mgr.rs invokes flush_advisory after stopping
the `Tenant` object.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-10-10 10:46:24 +01:00
Joonas Koivunen
4772cd6c93 fix: deny branching, starting compute from not yet uploaded timelines (#5484)
Part of #5172. First commits show that we used to allow starting up a
compute or creating a branch off a not yet uploaded timeline. This PR
moves activation of a timeline to happen **after** initial layer file(s)
(if any) and `index_part.json` have been uploaded. Simply moving
activation to be *after* downloads have finished works because we now
spawn a task per http request handler.

Current behaviour of uploading on the timelines on next startup is kept,
to be removed later as part of #5172.

Adds:
- `NeonCli.map_branch` and corresponding `neon_local` implementation:
allow creating computes for timelines managed via pageserver http
client/api
- possibly duplicate tests (I did not want to search for, will cleanup
in a follow-up if these duplicated)

Changes:
- make `wait_until_tenant_state` return immediatedly on `Broken` and not
wait more
2023-10-09 17:03:38 +03:00
John Spray
baa5fa1e77 pageserver: location configuration API, attachment modes, secondary locations (#5299)
## Problem

These changes are part of building seamless tenant migration, as
described in the RFC:
- https://github.com/neondatabase/neon/pull/5029

## Summary of changes

- A new configuration type `LocationConf` supersedes `TenantConfOpt` for
storing a tenant's configuration in the pageserver repo dir. It contains
`TenantConfOpt`, as well as a new `mode` attribute that describes what
kind of location this is (secondary, attached, attachment mode etc). It
is written to a file called `config-v1` instead of `config` -- this
prepares us for neatly making any other profound changes to the format
of the file in future. Forward compat for existing pageserver code is
achieved by writing out both old and new style files. Backward compat is
achieved by checking for the old-style file if the new one isn't found.
- The `TenantMap` type changes, to hold `TenantSlot` instead of just
`Tenant`. The `Tenant` type continues to be used for attached tenants
only. Tenants in other states (such as secondaries) are represented by a
different variant of `TenantSlot`.
- Where `Tenant` & `Timeline` used to hold an Arc<Mutex<TenantConfOpt>>,
they now hold a reference to a AttachedTenantConf, which includes the
extra information from LocationConf. This enables them to know the
current attachment mode.
- The attachment mode is used as an advisory input to decide whether to
do compaction and GC (AttachedStale is meant to avoid doing uploads,
AttachedMulti is meant to avoid doing deletions).
- A new HTTP API is added at `PUT /tenants/<tenant_id>/location_config`
to drive new location configuration. This provides a superset of the
functionality of attach/detach/load/ignore:
  - Attaching a tenant is just configuring it in an attached state
  - Detaching a tenant is configuring it to a detached state
  - Loading a tenant is just the same as attaching it
- Ignoring a tenant is the same as configuring it into Secondary with
warm=false (i.e. retain the files on disk but do nothing else).

Caveats:
- AttachedMulti tenants don't do compaction in this PR, but they do in
the follow on #5397
- Concurrent updates to the `location_config` API are not handled
elegantly in this PR, a better mechanism is added in the follow on
https://github.com/neondatabase/neon/pull/5367
- Secondary mode is just a placeholder in this PR: the code to upload
heatmaps and do downloads on secondary locations will be added in a
later PR (but that shouldn't change any external interfaces)

Closes: https://github.com/neondatabase/neon/issues/5379

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-10-05 09:55:10 +01:00
duguorong009
25a37215f3 fix: replace all std::PathBufs with camino::Utf8PathBuf (#5352)
Fixes #4689 by replacing all of `std::Path` , `std::PathBuf` with
`camino::Utf8Path`, `camino::Utf8PathBuf` in
- pageserver
- safekeeper
- control_plane
- libs/remote_storage

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-10-04 17:52:23 +03:00
John Spray
ba92668e37 pageserver: deletion queue & generation validation for deletions (#5207)
## Problem

Pageservers must not delete objects or advertise updates to
remote_consistent_lsn without checking that they hold the latest
generation for the tenant in question (see [the RFC](
https://github.com/neondatabase/neon/blob/main/docs/rfcs/025-generation-numbers.md))

In this PR:
- A new "deletion queue" subsystem is introduced, through which
deletions flow
- `RemoteTimelineClient` is modified to send deletions through the
deletion queue:
- For GC & compaction, deletions flow through the full generation
verifying process
- For timeline deletions, deletions take a fast path that bypasses
generation verification
- The `last_uploaded_consistent_lsn` value in `UploadQueue` is replaced
with a mechanism that maintains a "projected" lsn (equivalent to the
previous property), and a "visible" LSN (which is the one that we may
share with safekeepers).
- Until `control_plane_api` is set, all deletions skip generation
validation
- Tests are introduced for the new functionality in
`test_pageserver_generations.py`

Once this lands, if a pageserver is configured with the
`control_plane_api` configuration added in
https://github.com/neondatabase/neon/pull/5163, it becomes safe to
attach a tenant to multiple pageservers concurrently.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-09-26 16:11:55 +01:00
Christian Schwarz
a0c82969a2 page cache: per-task-kind access stats (#5339)
This PR adds a `task_kind` label to page cache access metrics.

These are to validate our hypothesis that the high hit page cache rate
we observe in prod is due to internal tasks, not getpage requests from
compute.
We believe the latter should near-always be a pageserver-page-cache
_miss_ because compute has it's own page cache, and hence there is no
locality of reference for its accesses to pageserver page cache.

Before this PR, we didn't have `RequestContext` propagation to any code
below the on-demand downloader.
The vast majority of changes in this PR is concerned with adding that
propagation.
2023-09-25 18:30:10 +02:00
John Spray
61d661a6c3 pageserver: generation number fetch on startup and use in /attach (#5163)
## Problem

- #5050 

Closes: https://github.com/neondatabase/neon/issues/5136

## Summary of changes

- A new configuration property `control_plane_api` controls other
functionality in this PR: if it is unset (default) then everything still
works as it does today.
- If `control_plane_api` is set, then on startup we call out to control
plane `/re-attach` endpoint to discover our attachments and their
generations. If an attachment is missing from the response we implicitly
detach the tenant.
- Calls to pageserver `/attach` API may include a `generation`
parameter. If `control_plane_api` is set, then this parameter is
mandatory.
- RemoteTimelineClient's loading of index_part.json is generation-aware,
and will try to load the index_part with the most recent generation <=
its own generation.
- The `neon_local` testing environment now includes a new binary
`attachment_service` which implements the endpoints that the pageserver
requires to operate. This is on by default if running `cargo neon` by
hand. In `test_runner/` tests, it is off by default: existing tests
continue to run with in the legacy generation-less mode.

Caveats:
- The re-attachment during startup assumes that we are only re-attaching
tenants that have previously been attached, and not totally new tenants
-- this relies on the control plane's attachment logic to keep retrying
so that we should eventually see the attach API call. That's important
because the `/re-attach` API doesn't tell us which timelines we should
attach -- we still use local disk state for that. Ref:
https://github.com/neondatabase/neon/issues/5173
- Testing: generations are only enabled for one integration test right
now (test_pageserver_restart), as a smoke test that all the machinery
basically works. Writing fuller tests that stress tenant migration will
come later, and involve extending our test fixtures to deal with
multiple pageservers.
- I'm not in love with "attachment_service" as a name for the neon_local
component, but it's not very important because we can easily rename
these test bits whenever we want.
- Limited observability when in re-attach on startup: when I add
generation validation for deletions in a later PR, I want to wrap up the
control plane API calls in some small client class that will expose
metrics for things like errors calling the control plane API, which will
act as a strong red signal that something is not right.

Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-09-06 14:44:48 +01:00
John Spray
743933176e scrubber: add scan-metadata and hook into integration tests (#5176)
## Problem

- Scrubber's `tidy` command requires presence of a control plane
- Scrubber has no tests at all 

## Summary of changes

- Add re-usable async streams for reading metadata from a bucket
- Add a `scan-metadata` command that reads from those streams and calls
existing `checks.rs` code to validate metadata, then returns a summary
struct for the bucket. Command returns nonzero status if errors are
found.
- Add an `enable_scrub_on_exit()` function to NeonEnvBuilder so that
tests using remote storage can request to have the scrubber run after
they finish
- Enable remote storarge and scrub_on_exit in test_pageserver_restart
and test_pageserver_chaos

This is a "toe in the water" of the overall space of validating the
scrubber. Later, we should:
- Enable scrubbing at end of tests using remote storage by default
- Make the success condition stricter than "no errors": tests should
declare what tenants+timelines they expect to see in the bucket (or
sniff these from the functions tests use to create them) and we should
require that the scrubber reports on these particular tenants/timelines.

The `tidy` command is untouched in this PR, but it should be refactored
later to use similar async streaming interface instead of the current
batch-reading approach (the streams are faster with large buckets), and
to also be covered by some tests.


---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
2023-09-06 11:55:24 +01:00
Arpad Müller
6cd497bb44 Make VirtualFile::crashsafe_overwrite async fn (#5189)
## Problem

The `VirtualFile::crashsafe_overwrite` function was introduced by #5186
but it was not turned `async fn` yet. We want to make these functions
async fn as part of #4743.

## Summary of changes

Make `VirtualFile::crashsafe_overwrite` async fn, as well as all the
functions calling it. Don't make anything inside `crashsafe_overwrite`
use async functionalities, as per #4743 instructions.

Also, add rustdoc to `crashsafe_overwrite`.

Part of #4743.
2023-09-04 12:52:35 +02:00
Christian Schwarz
7e817789d5 VirtualFile: add crash-safe overwrite abstraction & use it (#5186)
(part of #4743)
(preliminary to #5180)
 
This PR adds a special-purpose API to `VirtualFile` for write-once
files.
It adopts it for `save_metadata` and `persist_tenant_conf`.

This is helpful for the asyncification efforts (#4743) and specifically
asyncification of `VirtualFile` because above two functions were the
only ones that needed the VirtualFile to be an `std::io::Write`.
(There was also `manifest.rs` that needed the `std::io::Write`, but, it
isn't used right now, and likely won't be used because we're taking a
different route for crash consistency, see #5172. So, let's remove it.
It'll be in Git history if we need to re-introduce it when picking up
the compaction work again; that's why it was introduced in the first
place).

We can't remove the `impl std::io::Write for VirtualFile` just yet
because of the `BufWriter` in

```rust
struct DeltaLayerWriterInner {
...
    blob_writer: WriteBlobWriter<BufWriter<VirtualFile>>,
}
```

But, @arpad-m and I have a plan to get rid of that by extracting the
append-only-ness-on-top-of-VirtualFile that #4994 added to
`EphemeralFile` into an abstraction that can be re-used in the
`DeltaLayerWriterInner` and `ImageLayerWriterInner`.
That'll be another PR.


### Performance Impact

This PR adds more fsyncs compared to before because we fsync the parent
directory every time.

1. For `save_metadata`, the additional fsyncs are unnecessary because we
know that `metadata` fits into a kernel page, and hence the write won't
be torn on the way into the kernel. However, the `metadata` file in
general is going to lose signficance very soon (=> see #5172), and the
NVMes in prod can definitely handle the additional fsync. So, let's not
worry about it.
2. For `persist_tenant_conf`, which we don't check to fit into a single
kernel page, this PR makes it actually crash-consistent. Before, we
could crash while writing out the tenant conf, leaving a prefix of the
tenant conf on disk.
2023-09-02 10:06:14 +02:00
Christian Schwarz
cfc0fb573d pageserver: run all Rust tests with remote storage enabled (#5164)
For
[#5086](https://github.com/neondatabase/neon/pull/5086#issuecomment-1701331777)
we will require remote storage to be configured in pageserver.

This PR enables `localfs`-based storage for all Rust unit tests.

Changes:

- In `TenantHarness`, set up localfs remote storage for the tenant.
- `create_test_timeline` should mimic what real timeline creation does,
and real timeline creation waits for the timeline to reach remote
storage. With this PR, `create_test_timeline` now does that as well.
- All the places that create the harness tenant twice need to shut down
the tenant before the re-create through a second call to `try_load` or
`load`.
- Without shutting down, upload tasks initiated by/through the first
incarnation of the harness tenant might still be ongoing when the second
incarnation of the harness tenant is `try_load`/`load`ed. That doesn't
make sense in the tests that do that, they generally try to set up a
scenario similar to pageserver stop & start.
- There was one test that recreates a timeline, not the tenant. For that
case, I needed to create a `Timeline::shutdown` method. It's a
refactoring of the existing `Tenant::shutdown` method.
- The remote_timeline_client tests previously set up their own
`GenericRemoteStorage` and `RemoteTimelineClient`. Now they re-use the
one that's pre-created by the TenantHarness. Some adjustments to the
assertions were needed because the assertions now need to account for
the initial image layer that's created by `create_test_timeline` to be
present.
2023-09-01 18:10:40 +02:00
John Spray
616e7046c7 s3_scrubber: import into the main neon repository (#5141)
## Problem

The S3 scrubber currently lives at
https://github.com/neondatabase/s3-scrubber

We don't have tests that use it, and it has copies of some data
structures that can get stale.

## Summary of changes

- Import the s3-scrubber as `s3_scrubber/
- Replace copied_definitions/ in the scrubber with direct access to the
`utils` and `pageserver` crates
- Modify visibility of a few definitions in `pageserver` to allow the
scrubber to use them
- Update scrubber code for recent changes to `IndexPart`
- Update `KNOWN_VERSIONS` for IndexPart and move the definition into
index.rs so that it is easier to keep up to date

As a future refinement, it would be good to pull the remote persistence
types (like IndexPart) out of `pageserver` into a separate library so
that the scrubber doesn't have to link against the whole pageserver, and
so that it's clearer which types need to be public.

Co-authored-by: Kirill Bulatov <kirill@neon.tech>
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-08-31 19:01:39 +01:00
John Spray
83ae2bd82c pageserver: generation number support in keys and indices (#5140)
## Problem

To implement split brain protection, we need tenants and timelines to be
aware of their current generation, and use it when composing S3 keys.


## Summary of changes

- A `Generation` type is introduced in the `utils` crate -- it is in
this broadly-visible location because it will later be used from
`control_plane/` as well as `pageserver/`. Generations can be a number,
None, or Broken, to support legacy content (None), and Tenants in the
broken state (Broken).
- Tenant, Timeline, and RemoteTimelineClient all get a generation
attribute
- IndexPart's IndexLayerMetadata has a new `generation` attribute.
Legacy layers' metadata will deserialize to Generation::none().
- Remote paths are composed with a trailing generation suffix. If a
generation is equal to Generation::none() (as it currently always is),
then this suffix is an empty string.
- Functions for composing remote storage paths added in
remote_timeline_client: these avoid the way that we currently always
compose a local path and then strip the prefix, and avoid requiring a
PageserverConf reference on functions that want to create remote paths
(the conf is only needed for local paths). These are less DRY than the
old functions, but remote storage paths are a very rarely changing
thing, so it's better to write out our paths clearly in the functions
than to compose timeline paths from tenant paths, etc.
- Code paths that construct a Tenant take a `generation` argument in
anticipation that we will soon load generations on startup before
constructing Tenant.

Until the whole feature is done, we don't want any generation-ful keys
though: so initially we will carry this everywhere with the special
Generation::none() value.

Closes: https://github.com/neondatabase/neon/issues/5135

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-08-31 09:19:34 +01:00
John Spray
b758bf47ca pageserver: refactor TimelineMetadata serialization in IndexPart (#5091)
## Problem

The `metadata_bytes` field of IndexPart required explicit
deserialization & error checking everywhere it was used -- there isn't
anything special about this structure that should prevent it from being
serialized & deserialized along with the rest of the structure.

## Summary of changes

- Implement Serialize and Deserialize for TimelineMetadata
- Replace IndexPart::metadata_bytes with a simpler `metadata`, that can
be used directly.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-08-25 16:16:20 +01:00
Joonas Koivunen
76aa01c90f refactor: single phase Timeline::load_layer_map (#5074)
Current implementation first calls `load_layer_map`, which loads all
local layers, cleans up files, leave cleaning up stuff to "second
function". Then the "second function" is finally called, it does not do
the cleanup and some of the first functions setup can torn down. "Second
function" is actually both `reconcile_with_remote` and
`create_remote_layers`.

This change makes it a bit more verbose but in one phase with the
following sub-steps:
1. scan the timeline directory
2. delete extra files
    - now including on-demand download files
    - fixes #3660
3. recoincile the two sources of layers (directory, index_part)
4. rename_to_backup future layers, short layers
5. create the remaining as layers

Needed by #4938.

It was also noticed that this is blocking code in an `async fn` so just
do it in a `spawn_blocking`, which should be healthy for our startup
times. Other effects includes hopefully halving of `stat` calls; extra
calls which were not done previously are now done for the future layers.

Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: John Spray <john@neon.tech>
2023-08-24 16:07:40 +03:00
Joonas Koivunen
533a92636c refactor: pre-cleanup Layer, PersistentLayer and impls (#5059)
Remove pub but dead code, move trait methods as inherent methods, remove
unnecessary. Split off from #4938.
2023-08-22 21:14:28 +03:00
John Spray
615a490239 pageserver: refactor Tenant/Timeline args into structs (#5053)
## Problem

There are some common types that we pass into tenants and timelines as
we construct them, such as remote storage and the broker client.
Currently the list is small, but this is likely to grow -- the deletion
queue PR (#4960) pushed some methods to the point of clippy complaining
they had too many args, because of the extra deletion queue client being
passed around.

There are some shared objects that currently aren't passed around
explicitly because they use a static `once_cell` (e.g.
CONCURRENT_COMPACTIONS), but as we add more resource management and
concurreny control over time, it will be more readable & testable to
pass a type around in the respective Resources object, rather than to
coordinate via static objects. The `Resources` structures in this PR
will make it easier to add references to central coordination functions,
without having to rely on statics.

## Summary of changes

- For `Tenant`, the `broker_client` and `remote_storage` are bundled
into `TenantSharedResources`
- For `Timeline`, the `remote_client` is wrapped into
`TimelineResources`.

Both of these structures will get an additional deletion queue member in
#4960.
2023-08-21 17:30:28 +01:00
Dmitry Rodionov
9140a950f4 Resume tenant deletion on attach (#5039)
I'm still a bit nervous about attach -> crash case. But it should work.
(unlike case with timeline). Ideally would be cool to cover this with
test.

This continues tradition of adding bool flags for Tenant::set_stopping.
Probably lifecycle project will help with fixing it.
2023-08-20 12:28:50 +03:00
Arpad Müller
a23b0773f1 Fix DeltaLayer dumping (#5045)
## Problem

Before, DeltaLayer dumping (via `cargo run --release -p pagectl --
print-layer-file` ) would crash as one can't call `Handle::block_on` in
an async executor thread.

## Summary of changes

Avoid the problem by using `DeltaLayerInner::load_keys` to load the keys
into RAM (which we already do during compaction), and then load the
values one by one during dumping.
2023-08-19 00:56:03 +02:00
Joonas Koivunen
368ee6c8ca refactor: failpoint support (#5033)
- move them to pageserver which is the only dependant on the crate fail
- "move" the exported macro to the new module
- support at init time the same failpoints as runtime

Found while debugging test failures and making tests more repeatable by
allowing "exit" from pageserver start via environment variables. Made
those changes to `test_gc_cutoff.py`.

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-08-19 01:01:44 +03:00
Dmitry Rodionov
f6c671c140 resume timeline deletions on attach (#5030)
closes [#5036](https://github.com/neondatabase/neon/issues/5036)
2023-08-18 20:48:33 +03:00
Christian Schwarz
7a63685cde simplify page-caching of EphemeralFile (#4994)
(This PR is the successor of https://github.com/neondatabase/neon/pull/4984 )

## Summary

The current way in which `EphemeralFile` uses `PageCache` complicates
the Pageserver code base to a degree that isn't worth it.
This PR refactors how we cache `EphemeralFile` contents, by exploiting
the append-only nature of `EphemeralFile`.

The result is that `PageCache` only holds `ImmutableFilePage` and
`MaterializedPage`.
These types of pages are read-only and evictable without write-back.
This allows us to remove the writeback code from `PageCache`, also
eliminating an entire failure mode.

Futher, many great open-source libraries exist to solve the problem of a
read-only cache,
much better than our `page_cache.rs` (e.g., better replacement policy,
less global locking).
With this PR, we can now explore using them.

## Problem & Analysis

Before this PR, `PageCache` had three types of pages:

* `ImmutableFilePage`: caches Delta / Image layer file contents
* `MaterializedPage`: caches results of Timeline::get (page
materialization)
* `EphemeralPage`: caches `EphemeralFile` contents

`EphemeralPage` is quite different from `ImmutableFilePage` and
`MaterializedPage`:

* Immutable and materialized pages are for the acceleration of (future)
reads of the same data using `PAGE_CACHE_SIZE * PAGE_SIZE` bytes of
DRAM.
* Ephemeral pages are a write-back cache of `EphemeralFile` contents,
i.e., if there is pressure in the page cache, we spill `EphemeralFile`
contents to disk.

`EphemeralFile` is only used by `InMemoryLayer`, for the following
purposes:
* **write**: when filling up the `InMemoryLayer`, via `impl BlobWriter
for EphemeralFile`
* **read**: when doing **page reconstruction** for a page@lsn that isn't
written to disk
* **read**: when writing L0 layer files, we re-read the `InMemoryLayer`
and put the contents into the L0 delta writer
(**`create_delta_layer`**). This happens every 10min or when
InMemoryLayer reaches 256MB in size.

The access patterns of the `InMemoryLayer` use case are as follows:

* **write**: via `BlobWriter`, strictly append-only
* **read for page reconstruction**: via `BlobReader`, random
* **read for `create_delta_layer`**: via `BlobReader`, dependent on
data, but generally random. Why?
* in classical LSM terms, this function is what writes the
memory-resident `C0` tree into the disk-resident `C1` tree
* in our system, though, the values of InMemoryLayer are stored in an
EphemeralFile, and hence they are not guaranteed to be memory-resident
* the function reads `Value`s in `Key, LSN` order, which is `!=` insert
order

What do these `EphemeralFile`-level access patterns mean for the page
cache?

* **write**:
* the common case is that `Value` is a WAL record, and if it isn't a
full-page-image WAL record, then it's smaller than `PAGE_SIZE`
* So, the `EphemeralPage` pages act as a buffer for these `< PAGE_CACHE`
sized writes.
* If there's no page cache eviction between subsequent
`InMemoryLayer::put_value` calls, the `EphemeralPage` is still resident,
so the page cache avoids doing a `write` system call.
* In practice, a busy page server will have page cache evictions because
we only configure 64MB of page cache size.
* **reads for page reconstruction**: read acceleration, just as for the
other page types.
* **reads for `create_delta_layer`**:
* The `Value` reads happen through a `BlockCursor`, which optimizes the
case of repeated reads from the same page.
* So, the best case is that subsequent values are located on the same
page; hence `BlockCursor`s buffer is maximally effective.
* The worst case is that each `Value` is on a different page; hence the
`BlockCursor`'s 1-page-sized buffer is ineffective.
* The best case translates into `256MB/PAGE_SIZE` page cache accesses,
one per page.
    * the worst case translates into `#Values` page cache accesses
* again, the page cache accesses must be assumed to be random because
the `Value`s aren't accessed in insertion order but `Key, LSN` order.

## Summary of changes

Preliminaries for this PR were:

- #5003
- #5004 
- #5005 
  - uncommitted microbenchmark in #5011 

Based on the observations outlined above, this PR makes the following
changes:

* Rip out `EphemeralPage` from `page_cache.rs`
* Move the `block_io::FileId` to `page_cache::FileId`
* Add a `PAGE_SIZE`d buffer to the `EphemeralPage` struct.
  It's called `mutable_tail`.
* Change `write_blob` to use `mutable_tail` for the write buffering
instead of a page cache page.
* if `mutable_tail` is full, it writes it out to disk, zeroes it out,
and re-uses it.
* There is explicitly no double-buffering, so that memory allocation per
`EphemeralFile` instance is fixed.
* Change `read_blob` to return different `BlockLease` variants depending
on `blknum`
* for the `blknum` that corresponds to the `mutable_tail`, return a ref
to it
* Rust borrowing rules prevent `write_blob` calls while refs are
outstanding.
  * for all non-tail blocks, return a page-cached `ImmutablePage`
* It is safe to page-cache these as ImmutablePage because EphemeralFile
is append-only.

## Performance 

How doe the changes above affect performance?
M claim is: not significantly.

* **write path**:
* before this PR, the `EphemeralFile::write_blob` didn't issue its own
`write` system calls.
* If there were enough free pages, it didn't issue *any* `write` system
calls.
* If it had to evict other `EphemeralPage`s to get pages a page for its
writes (`get_buf_for_write`), the page cache code would implicitly issue
the writeback of victim pages as needed.
* With this PR, `EphemeralFile::write_blob` *always* issues *all* of its
*own* `write` system calls.
* Also, the writes are explicit instead of implicit through page cache
write back, which will help #4743
* The perf impact of always doing the writes is the CPU overhead and
syscall latency.
* Before this PR, we might have never issued them if there were enough
free pages.
* We don't issue `fsync` and can expect the writes to only hit the
kernel page cache.
* There is also an advantage in issuing the writes directly: the perf
impact is paid by the tenant that caused the writes, instead of whatever
tenant evicts the `EphemeralPage`.
* **reads for page reconstruction**: no impact.
* The `write_blob` function pre-warms the page cache when it writes the
`mutable_tail` to disk.
* So, the behavior is the same as with the EphemeralPages before this
PR.
* **reads for `create_delta_layer`**: no impact.
  * Same argument as for page reconstruction.
  * Note for the future:
* going through the page cache likely causes read amplification here.
Why?
* Due to the `Key,Lsn`-ordered access pattern, we don't read all the
values in the page before moving to the next page. In the worst case, we
might read the same page multiple times to read different `Values` from
it.
    * So, it might be better to bypass the page cache here.
    * Idea drafts:
      * bypass PS page cache + prefetch pipeline + iovec-based IO
* bypass PS page cache + use `copy_file_range` to copy from ephemeral
file into the L0 delta file, without going through user space
2023-08-18 20:31:03 +03:00
Joonas Koivunen
6af5f9bfe0 fix: format context (#5022)
We return an error with unformatted `{timeline_id}`.
2023-08-17 14:30:25 +00:00
Dmitry Rodionov
c58b22bacb Delete tenant's data from s3 (#4855)
## Summary of changes

For context see
https://github.com/neondatabase/neon/blob/main/docs/rfcs/022-pageserver-delete-from-s3.md

Create Flow to delete tenant's data from pageserver. The approach
heavily mimics previously implemented timeline deletion implemented
mostly in https://github.com/neondatabase/neon/pull/4384 and followed up
in https://github.com/neondatabase/neon/pull/4552

For remaining deletion related issues consult with deletion project
here: https://github.com/orgs/neondatabase/projects/33

resolves #4250
resolves https://github.com/neondatabase/neon/issues/3889

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-08-10 18:53:16 +03:00
John Spray
4dc644612b pageserver: expose prometheus metrics for startup time (#4893)
## Problem

Currently to know how long pageserver startup took requires inspecting
logs.

## Summary of changes

`pageserver_startup_duration_ms` metric is added, with label `phase` for
different phases of startup.

These are broken down by phase, where the phases correspond to the
existing wait points in the code:
- Start of doing I/O
- When tenant load is done
- When initial size calculation is done
- When background jobs start
- Then "complete" when everything is done.

`pageserver_startup_is_loading` is a 0/1 gauge that indicates whether we are in the initial load of tenants.

`pageserver_tenant_activation_seconds` is a histogram of time in seconds taken to activate a tenant.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-08-08 12:41:37 +03:00
Joonas Koivunen
3a00a5deb2 refactor: tidy consumption metrics (#4860)
Tidying up I've been wanting to do for some time.

Follow-up to #4857.
2023-08-01 18:14:16 +03:00
Dmitry Rodionov
cafbe8237e Move tenant/delete.rs to tenant/timeline/delete.rs (#4825)
move tenant/delete.rs to tenant/timeline/delete.rs to prepare for
appearance of tenant deletion routines in tenant/delete.rs
2023-07-27 15:52:36 +03:00
Dmitry Rodionov
874c31976e dedup cleanup fs traces (#4778)
This is a follow up for discussion:
https://github.com/neondatabase/neon/pull/4552#discussion_r1253417777
see context there
2023-07-26 18:39:32 +03:00
Dmitry Rodionov
700d929529 Init Timeline in Stopping state in create_timeline_struct when Cause::Delete (#4780)
See
https://github.com/neondatabase/neon/pull/4552#discussion_r1258368127
for context.

TLDR: use CreateTimelineCause to infer desired state instead of using
.set_stopping after initialization
2023-07-26 14:05:18 +03:00
Dmitry Rodionov
6d023484ed Use mark file to allow for deletion operations to continue through restarts (#4552)
## Problem

Currently we delete local files first, so if pageserver restarts after
local files deletion then remote deletion is not continued. This can be
solved with inversion of these steps.

But even if these steps are inverted when index_part.json is deleted
there is no way to distinguish between "this timeline is good, we just
didnt upload it to remote" and "this timeline is deleted we should
continue with removal of local state". So to solve it we use another
mark file. After index part is deleted presence of this mark file
indentifies that it was a deletion intention.

Alternative approach that was discussed was to delete all except
metadata first, and then delete metadata and index part. In this case we
still do not support local only configs making them rather unsafe
(deletion in them is already unsafe, but this direction solidifies this
direction instead of fixing it). Another downside is that if we crash
after local metadata gets removed we may leave dangling index part on
the remote which in theory shouldnt be a big deal because the file is
small.

It is not a big change to choose another approach at this point.

## Summary of changes

Timeline deletion sequence:
1. Set deleted_at in remote index part.
2. Create local mark file.
3. Delete local files except metadata (it is simpler this way, to be
able to reuse timeline initialization code that expects metadata)
4. Delete remote layers
5. Delete index part
6. Delete meta, timeline directory.
7. Delete mark file.

This works for local only configuration without remote storage.
Sequence is resumable from any point.

resolves #4453
resolves https://github.com/neondatabase/neon/pull/4552 (the issue was
created with async cancellation in mind, but we can still have issues
with retries if metadata is deleted among the first by remove_dir_all
(which doesnt have any ordering guarantees))

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-07-25 16:25:27 +03:00
Joonas Koivunen
f9214771b4 fix: count broken tenant more correct (#4800)
count only once; on startup create the counter right away because we
will not observe any changes later.

small, probably never reachable from outside fix for #4796.
2023-07-25 12:31:24 +03:00
Joonas Koivunen
77a68326c5 Thin out TenantState metric, keep set of broken tenants (#4796)
We currently have a timeseries for each of the tenants in different
states. We only want this for Broken. Other states could be counters.

Fix this by making the `pageserver_tenant_states_count` a counter
without a `tenant_id` and
add a `pageserver_broken_tenants_count` which has a `tenant_id` label,
each broken tenant being 1.
2023-07-25 11:15:54 +03:00
Joonas Koivunen
a25504deae Limit concurrent compactions (#4777)
Compactions can create a lot of concurrent work right now with #4265.

Limit compactions to use at most 6/8 background runtime threads.
2023-07-25 10:19:04 +03:00
arpad-m
e5b7ddfeee Preparatory pageserver async conversions (#4773)
In #4743, we'd like to convert the read path to use `async` rust. In
preparation of that, this PR switches some functions that are calling
lower level functions like `BlockReader::read_blk`,
`BlockCursor::read_blob`, etc into `async`. The PR does not switch all
functions however, and only focuses on the ones which are easy to
switch.

This leaves around some async functions that are (currently)
unnecessarily `async`, but on the other hand it makes future changes
smaller in diff.

Part of #4743 (but does not completely address it).
2023-07-24 14:01:54 +02:00