Commit Graph

1522 Commits

Author SHA1 Message Date
Christian Schwarz
5be8d38a63 fix deadlock around TENANTS (#5285)
The sequence that can lead to a deadlock:

1. DELETE request gets all the way to `tenant.shutdown(progress,
false).await.is_err() ` , while holding TENANTS.read()
2. POST request for tenant creation comes in, calls `tenant_map_insert`,
it does `let mut guard = TENANTS.write().await;`
3. Something that `tenant.shutdown()` needs to wait for needs a
`TENANTS.read().await`.
The only case identified in exhaustive manual scanning of the code base
is this one:
Imitate size access does `get_tenant().await`, which does
`TENANTS.read().await` under the hood.

In the above case (1) waits for (3), (3)'s read-lock request is queued
behind (2)'s write-lock, and (2) waits for (1).
Deadlock.

I made a reproducer/proof-that-above-hypothesis-holds in
https://github.com/neondatabase/neon/pull/5281 , but, it's not ready for
merge yet and we want the fix _now_.

fixes https://github.com/neondatabase/neon/issues/5284
2023-09-12 11:23:46 +02:00
Arpad Müller
a18d6d9ae3 Make File opening in VirtualFile async-compatible (#5280)
## Problem

Previously, we were using `observe_closure_duration` in `VirtualFile`
file opening code, but this doesn't support async open operations, which
we want to use as part of #4743.

## Summary of changes

* Move the duration measurement from the `with_file` macro into a
`observe_duration` macro.
* Some smaller drive-by fixes to replace the old strings with the new
variant names introduced by #5273

Part of #4743, follow-up of #5247.
2023-09-11 18:41:08 +02:00
Arpad Müller
76cc87398c Use tokio locks in VirtualFile and turn with_file into macro (#5247)
## Problem

For #4743, we want to convert everything up to the actual I/O operations
of `VirtualFile` to `async fn`.

## Summary of changes

This PR is the last change in a series of changes to `VirtualFile`:
#5189, #5190, #5195, #5203, and #5224.

It does the last preparations before the I/O operations are actually
made async. We are doing the following things:

* First, we change the locks for the file descriptor cache to tokio's
locks that support Send. This is important when one wants to hold locks
across await points (which we want to do), otherwise the Future won't be
Send. Also, one shouldn't generally block in async code as executors
don't like that.
* Due to the lock change, we now take an approach for the `VirtualFile`
destructors similar to the one proposed by #5122 for the page cache, to
use `try_write`. Similarly to the situation in the linked PR, one can
make an argument that if we are in the destructor and the slot has not
been reused yet, we are the only user accessing the slot due to owning
the lock mutably. It is still possible that we are not obtaining the
lock, but the only cause for that is the clock algorithm touching the
slot, which should be quite an unlikely occurence. For the instance of
`try_write` failing, we spawn an async task to destroy the lock. As just
argued however, most of the time the code path where we spawn the task
should not be visited.
* Lastly, we split `with_file` into a macro part, and a function part
that contains most of the logic. The function part returns a lock
object, that the macro uses. The macro exists to perform the operation
in a more compact fashion, saving code from putting the lock into a
variable and then doing the operation while measuring the time to run
it. We take the locks approach because Rust has no support for async
closures. One can make normal closures return a future, but that
approach gets into lifetime issues the moment you want to pass data to
these closures via parameters that has a lifetime (captures work). For
details, see
[this](https://smallcultfollowing.com/babysteps/blog/2023/03/29/thoughts-on-async-closures/)
and
[this](https://users.rust-lang.org/t/function-that-takes-an-async-closure/61663)
link. In #5224, we ran into a similar problem with the `test_files`
function, and we ended up passing the path and the `OpenOptions`
by-value instead of by-ref, at the expense of a few extra copies. This
can be done as the data is cheaply copyable, and we are in test code.
But here, we are not, and while `File::try_clone` exists, it [issues
system calls
internally](1e746d7741/library/std/src/os/fd/owned.rs (L94-L111)).
Also, it would allocate an entirely new file descriptor, something that
the fd cache was built to prevent.
* We change the `STORAGE_IO_TIME` metrics to support async.

Part of #4743.
2023-09-11 17:35:05 +02:00
duguorong009
d7fa2dba2d fix(pageserver): update the STORAGE_IO_TIME metrics to avoid expensive operations (#5273)
Introduce the `StorageIoOperation` enum, `StorageIoTime` struct, and
`STORAGE_IO_TIME_METRIC` static which provides lockless access to
histograms consumed by `VirtualFile`.

Closes #5131

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-09-11 14:58:15 +03:00
Rahul Modpur
999fe668e7 Ack tenant detach before local files are deleted (#5211)
## Problem

Detaching a tenant can involve many thousands of local filesystem
metadata writes, but the control plane would benefit from us not
blocking detach/delete responses on these.

## Summary of changes

After rename of local tenant directory ack tenant detach and delete
tenant directory in background

#5183 

---------

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
2023-09-10 22:59:51 +03:00
Joonas Koivunen
720d59737a rust-1.72.0 changes (#5255)
Prepare to upgrade rust version to latest stable.

- `rustfmt` has learned to format `let irrefutable = $expr else { ...
};` blocks
- There's a new warning about virtual (workspace) crate resolver, picked
the latest resolver as I suspect everyone would expect it to be the
latest; should not matter anyways
- Some new clippies, which seem alright
2023-09-08 16:28:41 +03:00
Arpad Müller
d206655a63 Make VirtualFile::{open, open_with_options, create,sync_all,with_file} async fn (#5224)
## Problem

Once we use async file system APIs for `VirtualFile`, these functions
will also need to be async fn.

## Summary of changes

Makes the functions `open, open_with_options, create,sync_all,with_file`
of `VirtualFile` async fn, including all functions that call it. Like in
the prior PRs, the actual I/O operations are not using async APIs yet,
as per request in the #4743 epic.

We switch towards not using `VirtualFile` in the par_fsync module,
hopefully this is only temporary until we can actually do fully async
I/O in `VirtualFile`. This might cause us to exhaust fd limits in the
tests, but it should only be an issue for the local developer as we have
high ulimits in prod.

This PR is a follow-up of #5189, #5190, #5195, and #5203. Part of #4743.
2023-09-08 00:50:50 +02:00
duguorong009
706977fb77 fix(pageserver): add the walreceiver state to tenant timeline GET api endpoint (#5196)
Add a `walreceiver_state` field to `TimelineInfo` (response of `GET /v1/tenant/:tenant_id/timeline/:timeline_id`) and while doing that, refactor out a common `Timeline::walreceiver_state(..)`. No OpenAPI changes, because this is an internal debugging addition.

Fixes #3115.

Co-authored-by: Joonas Koivunen <joonas.koivunen@gmail.com>
2023-09-07 14:17:18 +03:00
Arpad Müller
7ba0f5c08d Improve comment in page cache (#5220)
It was easy to interpret comment in the page cache initialization code
to be about justifying why we leak here at all, not just why this
specific type of leaking is done (which the comment was actually meant
to describe).

See
https://github.com/neondatabase/neon/pull/5125#discussion_r1308445993

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-09-06 21:44:54 +02:00
Arpad Müller
6243b44dea Remove Virtual from FileBlockReaderVirtual variant name (#5225)
With #5181, the generics for `FileBlockReader` have been removed, so
having a `Virtual` postfix makes less sense now.
2023-09-06 20:54:57 +02:00
duguorong009
31e1568dee refactor(pageserver): refactor pageserver router state creation (#5165)
Fixes #3894 by:
- Refactor the pageserver router creation flow
- Create the router state in `pageserver/src/bin/pageserver.rs`
2023-09-06 21:31:49 +03:00
Chengpeng Yan
9a9187b81a Complete the missing metrics for files_created/bytes_written (#5120) 2023-09-06 14:00:15 -04:00
Arpad Müller
5e00c44169 Add WriteBlobWriter buffering and make VirtualFile::{write,write_all} async (#5203)
## Problem

We want to convert the `VirtualFile` APIs to async fn so that we can
adopt one of the async I/O solutions.

## Summary of changes

This PR is a follow-up of #5189, #5190, and #5195, and does the
following:

* Move the used `Write` trait functions of `VirtualFile` into inherent
functions
* Add optional buffering to `WriteBlobWriter`. The buffer is discarded
on drop, similarly to how tokio's `BufWriter` does it: drop is neither
async nor does it support errors.
* Remove the generics by `Write` impl of `WriteBlobWriter`, alwaays
using `VirtualFile`
* Rename `WriteBlobWriter` to `BlobWriter`
* Make various functions in the write path async, like
`VirtualFile::{write,write_all}`.

Part of #4743.
2023-09-06 18:17:12 +02:00
John Spray
61d661a6c3 pageserver: generation number fetch on startup and use in /attach (#5163)
## Problem

- #5050 

Closes: https://github.com/neondatabase/neon/issues/5136

## Summary of changes

- A new configuration property `control_plane_api` controls other
functionality in this PR: if it is unset (default) then everything still
works as it does today.
- If `control_plane_api` is set, then on startup we call out to control
plane `/re-attach` endpoint to discover our attachments and their
generations. If an attachment is missing from the response we implicitly
detach the tenant.
- Calls to pageserver `/attach` API may include a `generation`
parameter. If `control_plane_api` is set, then this parameter is
mandatory.
- RemoteTimelineClient's loading of index_part.json is generation-aware,
and will try to load the index_part with the most recent generation <=
its own generation.
- The `neon_local` testing environment now includes a new binary
`attachment_service` which implements the endpoints that the pageserver
requires to operate. This is on by default if running `cargo neon` by
hand. In `test_runner/` tests, it is off by default: existing tests
continue to run with in the legacy generation-less mode.

Caveats:
- The re-attachment during startup assumes that we are only re-attaching
tenants that have previously been attached, and not totally new tenants
-- this relies on the control plane's attachment logic to keep retrying
so that we should eventually see the attach API call. That's important
because the `/re-attach` API doesn't tell us which timelines we should
attach -- we still use local disk state for that. Ref:
https://github.com/neondatabase/neon/issues/5173
- Testing: generations are only enabled for one integration test right
now (test_pageserver_restart), as a smoke test that all the machinery
basically works. Writing fuller tests that stress tenant migration will
come later, and involve extending our test fixtures to deal with
multiple pageservers.
- I'm not in love with "attachment_service" as a name for the neon_local
component, but it's not very important because we can easily rename
these test bits whenever we want.
- Limited observability when in re-attach on startup: when I add
generation validation for deletions in a later PR, I want to wrap up the
control plane API calls in some small client class that will expose
metrics for things like errors calling the control plane API, which will
act as a strong red signal that something is not right.

Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-09-06 14:44:48 +01:00
John Spray
743933176e scrubber: add scan-metadata and hook into integration tests (#5176)
## Problem

- Scrubber's `tidy` command requires presence of a control plane
- Scrubber has no tests at all 

## Summary of changes

- Add re-usable async streams for reading metadata from a bucket
- Add a `scan-metadata` command that reads from those streams and calls
existing `checks.rs` code to validate metadata, then returns a summary
struct for the bucket. Command returns nonzero status if errors are
found.
- Add an `enable_scrub_on_exit()` function to NeonEnvBuilder so that
tests using remote storage can request to have the scrubber run after
they finish
- Enable remote storarge and scrub_on_exit in test_pageserver_restart
and test_pageserver_chaos

This is a "toe in the water" of the overall space of validating the
scrubber. Later, we should:
- Enable scrubbing at end of tests using remote storage by default
- Make the success condition stricter than "no errors": tests should
declare what tenants+timelines they expect to see in the bucket (or
sniff these from the functions tests use to create them) and we should
require that the scrubber reports on these particular tenants/timelines.

The `tidy` command is untouched in this PR, but it should be refactored
later to use similar async streaming interface instead of the current
batch-reading approach (the streams are faster with large buckets), and
to also be covered by some tests.


---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
2023-09-06 11:55:24 +01:00
duguorong009
4fec48f2b5 chore(pageserver): remove unnecessary logging in tenant task loops (#5188)
Fixes #3830 by adding the `#[cfg(not(feature = "testing"))]` attribute
to unnecessary loggings in `pageserver/src/tenant/tasks.rs`.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-09-06 13:19:19 +03:00
Arpad Müller
4904613aaa Convert VirtualFile::{seek,metadata} to async (#5195)
## Problem

We want to convert the `VirtualFile` APIs to async fn so that we can
adopt one of the async I/O solutions.

## Summary of changes

Convert the following APIs of `VirtualFile` to async fn (as well as all
of the APIs calling it):

* `VirtualFile::seek`
* `VirtualFile::metadata`
* Also, prepare for deletion of the write impl by writing the summary to
a buffer before writing it to disk, as suggested in
https://github.com/neondatabase/neon/issues/4743#issuecomment-1700663864
. This change adds an additional warning for the case when the summary
exceeds a block. Previously, we'd have silently corrupted data in this
(unlikely) case.
* `WriteBlobWriter::write_blob`, in preparation for making
`VirtualFile::write_all` async.
2023-09-05 12:55:45 +02:00
Arpad Müller
128a85ba5e Convert many VirtualFile APIs to async (#5190)
## Problem

`VirtualFile` does both reading and writing, and it would be nice if
both could be converted to async, so that it doesn't have to support an
async read path and a blocking write path (especially for the locks this
is annoying as none of the lock implementations in std, tokio or
parking_lot have support for both async and blocking access).

## Summary of changes

This PR is some initial work on making the `VirtualFile` APIs async. It
can be reviewed commit-by-commit.

* Introduce the `MaybeVirtualFile` enum to be generic in a test that
compares real files with virtual files.
* Make various APIs of `VirtualFile` async, including `write_all_at`,
`read_at`, `read_exact_at`.

Part of #4743 , successor of #5180.

Co-authored-by: Christian Schwarz <me@cschwarz.com>
2023-09-04 17:05:20 +02:00
Arpad Müller
6cd497bb44 Make VirtualFile::crashsafe_overwrite async fn (#5189)
## Problem

The `VirtualFile::crashsafe_overwrite` function was introduced by #5186
but it was not turned `async fn` yet. We want to make these functions
async fn as part of #4743.

## Summary of changes

Make `VirtualFile::crashsafe_overwrite` async fn, as well as all the
functions calling it. Don't make anything inside `crashsafe_overwrite`
use async functionalities, as per #4743 instructions.

Also, add rustdoc to `crashsafe_overwrite`.

Part of #4743.
2023-09-04 12:52:35 +02:00
John Spray
80f10d5ced pageserver: safe deletion for tenant directories (#5182)
## Problem

If a pageserver crashes partway through deleting a tenant's directory,
it might leave a partial state that confuses a subsequent
startup/attach.

## Summary of changes

Rename tenant directory to a temporary path before deleting.

Timeline deletions already have deletion markers to provide safety.

In future, it would be nice to exploit this to send responses to detach
requests earlier: https://github.com/neondatabase/neon/issues/5183
2023-09-04 08:31:55 +01:00
Christian Schwarz
7e817789d5 VirtualFile: add crash-safe overwrite abstraction & use it (#5186)
(part of #4743)
(preliminary to #5180)
 
This PR adds a special-purpose API to `VirtualFile` for write-once
files.
It adopts it for `save_metadata` and `persist_tenant_conf`.

This is helpful for the asyncification efforts (#4743) and specifically
asyncification of `VirtualFile` because above two functions were the
only ones that needed the VirtualFile to be an `std::io::Write`.
(There was also `manifest.rs` that needed the `std::io::Write`, but, it
isn't used right now, and likely won't be used because we're taking a
different route for crash consistency, see #5172. So, let's remove it.
It'll be in Git history if we need to re-introduce it when picking up
the compaction work again; that's why it was introduced in the first
place).

We can't remove the `impl std::io::Write for VirtualFile` just yet
because of the `BufWriter` in

```rust
struct DeltaLayerWriterInner {
...
    blob_writer: WriteBlobWriter<BufWriter<VirtualFile>>,
}
```

But, @arpad-m and I have a plan to get rid of that by extracting the
append-only-ness-on-top-of-VirtualFile that #4994 added to
`EphemeralFile` into an abstraction that can be re-used in the
`DeltaLayerWriterInner` and `ImageLayerWriterInner`.
That'll be another PR.


### Performance Impact

This PR adds more fsyncs compared to before because we fsync the parent
directory every time.

1. For `save_metadata`, the additional fsyncs are unnecessary because we
know that `metadata` fits into a kernel page, and hence the write won't
be torn on the way into the kernel. However, the `metadata` file in
general is going to lose signficance very soon (=> see #5172), and the
NVMes in prod can definitely handle the additional fsync. So, let's not
worry about it.
2. For `persist_tenant_conf`, which we don't check to fit into a single
kernel page, this PR makes it actually crash-consistent. Before, we
could crash while writing out the tenant conf, leaving a prefix of the
tenant conf on disk.
2023-09-02 10:06:14 +02:00
Christian Schwarz
cfc0fb573d pageserver: run all Rust tests with remote storage enabled (#5164)
For
[#5086](https://github.com/neondatabase/neon/pull/5086#issuecomment-1701331777)
we will require remote storage to be configured in pageserver.

This PR enables `localfs`-based storage for all Rust unit tests.

Changes:

- In `TenantHarness`, set up localfs remote storage for the tenant.
- `create_test_timeline` should mimic what real timeline creation does,
and real timeline creation waits for the timeline to reach remote
storage. With this PR, `create_test_timeline` now does that as well.
- All the places that create the harness tenant twice need to shut down
the tenant before the re-create through a second call to `try_load` or
`load`.
- Without shutting down, upload tasks initiated by/through the first
incarnation of the harness tenant might still be ongoing when the second
incarnation of the harness tenant is `try_load`/`load`ed. That doesn't
make sense in the tests that do that, they generally try to set up a
scenario similar to pageserver stop & start.
- There was one test that recreates a timeline, not the tenant. For that
case, I needed to create a `Timeline::shutdown` method. It's a
refactoring of the existing `Tenant::shutdown` method.
- The remote_timeline_client tests previously set up their own
`GenericRemoteStorage` and `RemoteTimelineClient`. Now they re-use the
one that's pre-created by the TenantHarness. Some adjustments to the
assertions were needed because the assertions now need to account for
the initial image layer that's created by `create_test_timeline` to be
present.
2023-09-01 18:10:40 +02:00
Christian Schwarz
aa22000e67 FileBlockReader<File> is never used (#5181)
part of #4743

preliminary to #5180
2023-09-01 17:30:22 +02:00
Christian Schwarz
40ce520c07 remote_timeline_client: tests: run upload ops on the tokio::test runtime (#5177)
The `remote_timeline_client` tests use `#[tokio::test]` and rely on the
fact that the test runtime that is set up by this macro is
single-threaded.

In PR https://github.com/neondatabase/neon/pull/5164, we observed
interesting flakiness with the `upload_scheduling` test case:
it would observe the upload of the third layer (`layer_file_name_3`)
before we did `wait_completion`.

Under the single-threaded-runtime assumption, that wouldn't be possible,
because the test code doesn't await inbetween scheduling the upload
and calling `wait_completion`.

However, RemoteTimelineClient was actually using `BACKGROUND_RUNTIME`.
That means there was parallelism where the tests didn't expect it,
leading to flakiness such as execution of an UploadOp task before
the test calls `wait_completion`.

The most confusing scenario is code like this:

```
schedule upload(A);
wait_completion.await; // B
schedule_upload(C);
wait_completion.await; // D
```

On a single-threaded executor, it is guaranteed that the upload up C
doesn't run before D, because we (the test) don't relinquish control
to the executor before D's `await` point.

However, RemoteTimelineClient actually scheduled onto the
BACKGROUND_RUNTIME, so, `A` could start running before `B` and
`C` could start running before `D`.

This would cause flaky tests when making assertions about the state
manipulated by the operations. The concrete issue that led to discover
of this bug was an assertion about `remote_fs_dir` state in #5164.
2023-09-01 16:24:04 +03:00
John Spray
616e7046c7 s3_scrubber: import into the main neon repository (#5141)
## Problem

The S3 scrubber currently lives at
https://github.com/neondatabase/s3-scrubber

We don't have tests that use it, and it has copies of some data
structures that can get stale.

## Summary of changes

- Import the s3-scrubber as `s3_scrubber/
- Replace copied_definitions/ in the scrubber with direct access to the
`utils` and `pageserver` crates
- Modify visibility of a few definitions in `pageserver` to allow the
scrubber to use them
- Update scrubber code for recent changes to `IndexPart`
- Update `KNOWN_VERSIONS` for IndexPart and move the definition into
index.rs so that it is easier to keep up to date

As a future refinement, it would be good to pull the remote persistence
types (like IndexPart) out of `pageserver` into a separate library so
that the scrubber doesn't have to link against the whole pageserver, and
so that it's clearer which types need to be public.

Co-authored-by: Kirill Bulatov <kirill@neon.tech>
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-08-31 19:01:39 +01:00
John Spray
300a5aa05e pageserver: fix test v4_indexpart_is_parsed (#5157)
## Problem

Two recent PRs raced:
- https://github.com/neondatabase/neon/pull/5153
- https://github.com/neondatabase/neon/pull/5140

## Summary of changes

Add missing `generation` argument to IndexLayerMetadata construction
2023-08-31 10:40:46 +01:00
John Spray
83ae2bd82c pageserver: generation number support in keys and indices (#5140)
## Problem

To implement split brain protection, we need tenants and timelines to be
aware of their current generation, and use it when composing S3 keys.


## Summary of changes

- A `Generation` type is introduced in the `utils` crate -- it is in
this broadly-visible location because it will later be used from
`control_plane/` as well as `pageserver/`. Generations can be a number,
None, or Broken, to support legacy content (None), and Tenants in the
broken state (Broken).
- Tenant, Timeline, and RemoteTimelineClient all get a generation
attribute
- IndexPart's IndexLayerMetadata has a new `generation` attribute.
Legacy layers' metadata will deserialize to Generation::none().
- Remote paths are composed with a trailing generation suffix. If a
generation is equal to Generation::none() (as it currently always is),
then this suffix is an empty string.
- Functions for composing remote storage paths added in
remote_timeline_client: these avoid the way that we currently always
compose a local path and then strip the prefix, and avoid requiring a
PageserverConf reference on functions that want to create remote paths
(the conf is only needed for local paths). These are less DRY than the
old functions, but remote storage paths are a very rarely changing
thing, so it's better to write out our paths clearly in the functions
than to compose timeline paths from tenant paths, etc.
- Code paths that construct a Tenant take a `generation` argument in
anticipation that we will soon load generations on startup before
constructing Tenant.

Until the whole feature is done, we don't want any generation-ful keys
though: so initially we will carry this everywhere with the special
Generation::none() value.

Closes: https://github.com/neondatabase/neon/issues/5135

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-08-31 09:19:34 +01:00
John Spray
a93274b389 pageserver: remove vestigial timeline_layers attribute (#5153)
## Problem

`timeline_layers` was write-only since
b95addddd5

We deployed the version that no longer requires it for deserializing, so
now we can stop including it when serializing.

## Summary of changes

Fully remove `timeline_layers`.
2023-08-30 16:14:04 +01:00
Joonas Koivunen
05773708d3 fix: add context for ancestor lsn wait (#5143)
In logs it is confusing to see seqwait timeouts which seemingly arise
from the branched lsn but actually are about the ancestor, leading to
questions like "has the last_record_lsn went back".

Noticed by @problame.
2023-08-30 12:21:41 +03:00
Arpad Müller
eb0a698adc Make page cache and read_blk async (#5023)
## Problem

`read_blk` does I/O and thus we would like to make it async. We can't
make the function async as long as the `PageReadGuard` returned by
`read_blk` isn't `Send`. The page cache is called by `read_blk`, and
thus it can't be async without `read_blk` being async. Thus, we have a
circular dependency.

## Summary of changes

Due to the circular dependency, we convert both the page cache and
`read_blk` to async at the same time:

We make the page cache use `tokio::sync` synchronization primitives as
those are `Send`. This makes all the places that acquire a lock require
async though, which we then also do. This includes also asyncification
of the `read_blk` function.

Builds upon #4994, #5015, #5056, and #5129.

Part of #4743.
2023-08-30 09:04:31 +02:00
Arseny Sher
bc49c73fee Move wal_stream_connection_config to utils.
It will be used by safekeeper as well.
2023-08-29 23:19:40 +03:00
Arseny Sher
e98580b092 Add term and http endpoint to broker messaged SkTimelineInfo.
We need them for safekeeper peer recovery
https://github.com/neondatabase/neon/pull/4875
2023-08-29 23:19:40 +03:00
Arpad Müller
805fee1483 page cache: small code cleanups (#5125)
## Problem

I saw these things while working on #5111.

## Summary of changes

* Add a comment explaining why we use `Vec::leak` instead of
`Vec::into_boxed_slice` plus `Box::leak`.
* Add another comment explaining what `valid` is doing, it wasn't very
clear before.
* Add a function `set_usage_count` to not set it directly.
2023-08-29 11:49:04 +03:00
Christian Schwarz
0fe3b3646a page cache: don't proactively evict EphemeralFile pages (#5129)
Before this patch, when dropping an EphemeralFile, we'd scan the entire
`slots` to proactively evict its pages (`drop_buffers_for_immutable`).

This was _necessary_ before #4994 because the page cache was a
write-back cache: we'd be deleting the EphemeralFile from disk after,
so, if we hadn't evicted its pages before that, write-back in
`find_victim` wouldhave failed.

But, since #4994, the page cache is a read-only cache, so, it's safe
to keep read-only data cached. It's never going to get accessed again
and eventually, `find_victim` will evict it.

The only remaining advantage of `drop_buffers_for_immutable` over
relying on `find_victim` is that `find_victim` has to do the clock
page replacement iterations until the count reaches 0,
whereas `drop_buffers_for_immutable` can kick the page out right away.

However, weigh that against the cost of `drop_buffers_for_immutable`,
which currently scans the entire `slots` array to find the
EphemeralFile's pages.

Alternatives have been proposed in #5122 and #5128, but, they come
with their own overheads & trade-offs.

Also, the real reason why we're looking into this piece of code is
that we want to make the slots rwlock async in #5023.
Since `drop_buffers_for_immutable` is called from drop, and there
is no async drop, it would be nice to not have to deal with this.

So, let's just stop doing `drop_buffers_for_immutable` and observe
the performance impact in benchmarks.
2023-08-28 20:42:18 +02:00
Joonas Koivunen
fbcd174489 load_layer_map: schedule deletions for any future layers (#5103)
Unrelated fixes noticed while integrating #4938.

- Stop leaking future layers in remote storage
- We schedule extra index_part uploads if layer name to be removed was
not actually present
2023-08-28 10:51:49 +03:00
John Spray
b758bf47ca pageserver: refactor TimelineMetadata serialization in IndexPart (#5091)
## Problem

The `metadata_bytes` field of IndexPart required explicit
deserialization & error checking everywhere it was used -- there isn't
anything special about this structure that should prevent it from being
serialized & deserialized along with the rest of the structure.

## Summary of changes

- Implement Serialize and Deserialize for TimelineMetadata
- Replace IndexPart::metadata_bytes with a simpler `metadata`, that can
be used directly.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-08-25 16:16:20 +01:00
Arpad Müller
8c13296add Remove BlockReader::read_blk in favour of BlockCursor (#5015)
## Problem

We want to make `read_blk` an async function, but outside of
`async_trait`, which allocates, and nightly features, we can't use async
fn's in traits.

## Summary of changes

* Remove all uses of `BlockReader::read_blk` in favour of using block
  cursors, at least where the type of the `BlockReader` is behind a
  generic
* Introduce a `BlockReaderRef` enum that lists all implementors of
  `BlockReader::read_blk`.
* Remove `BlockReader::read_blk` and move its implementations into
  inherent functions on the types instead.

We don't turn `read_blk` into an async fn yet, for that we also need to
modify the page cache. So this is a preparatory PR, albeit an important
one.

Part of #4743.
2023-08-25 12:28:01 +02:00
Arpad Müller
227c87e333 Make EphemeralFile::write_blob function async (#5056)
## Problem

The `EphemeralFile::write_blob` function accesses the page cache
internally. We want to require `async` for these accesses in #5023.

## Summary of changes

This removes the implementaiton of the `BlobWriter` trait for
`EphemeralFile` and turns the `write_blob` function into an inherent
function. We can then make it async as well as the `push_bytes`
function. We move the `SER_BUFFER` thread-local into the
`InMemoryLayerInner` so that the same buffer can be accessed by
different threads as the async is (potentially) moved between threads.

Part of #4743, preparation for #5023.
2023-08-24 19:18:30 +02:00
Chengpeng Yan
fa74d5649e rename EphmeralFile::size to EphemeralFile::len (#5076)
## Problem
close https://github.com/neondatabase/neon/issues/5034

## Summary of changes
Based on the
[comment](https://github.com/neondatabase/neon/pull/4994#discussion_r1297277922).
Just rename the `EphmeralFile::size` to `EphemeralFile::len`.
2023-08-24 16:41:57 +02:00
Joonas Koivunen
f70871dfd0 internal-devx: pageserver future layers (#5092)
I've personally forgotten why/how can we have future layers during
reconciliation. Adds `#[cfg(feature = "testing")]` logging when we
upload such index_part.json, with a cross reference to where the cleanup
happens.

Latest private slack thread:
https://neondb.slack.com/archives/C033RQ5SPDH/p1692879032573809?thread_ts=1692792276.173979&cid=C033RQ5SPDH

Builds upon #5074. Should had been considered on #4837.
2023-08-24 17:22:36 +03:00
Joonas Koivunen
76aa01c90f refactor: single phase Timeline::load_layer_map (#5074)
Current implementation first calls `load_layer_map`, which loads all
local layers, cleans up files, leave cleaning up stuff to "second
function". Then the "second function" is finally called, it does not do
the cleanup and some of the first functions setup can torn down. "Second
function" is actually both `reconcile_with_remote` and
`create_remote_layers`.

This change makes it a bit more verbose but in one phase with the
following sub-steps:
1. scan the timeline directory
2. delete extra files
    - now including on-demand download files
    - fixes #3660
3. recoincile the two sources of layers (directory, index_part)
4. rename_to_backup future layers, short layers
5. create the remaining as layers

Needed by #4938.

It was also noticed that this is blocking code in an `async fn` so just
do it in a `spawn_blocking`, which should be healthy for our startup
times. Other effects includes hopefully halving of `stat` calls; extra
calls which were not done previously are now done for the future layers.

Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: John Spray <john@neon.tech>
2023-08-24 16:07:40 +03:00
John Spray
3e2f0ffb11 libs: make backoff::retry() take a cancellation token (#5065)
## Problem

Currently, anything that uses backoff::retry will delay the join of its
task by however long its backoff sleep is, multiplied by its max
retries.

Whenever we call a function that sleeps, we should be passing in a
CancellationToken.

## Summary of changes

- Add a `Cancel` type to backoff::retry that wraps a CancellationToken
and an error `Fn` to generate an error if the cancellation token fires.
- In call sites that already run in a `task_mgr` task, use
`shutdown_token()` to provide the token. In other locations, use a dead
`CancellationToken` to satisfy the interface, and leave a TODO to fix it
up when we broaden the use of explicit cancellation tokens.
2023-08-24 14:54:46 +03:00
Joonas Koivunen
ad8d777c1c refactor: remove is_incremental=true for ImageLayers footgun (#5061)
Accidentially giving is_incremental=true for ImageLayers costs a lot of
debugging time. Removes all API which would allow to do that. They can
easily be restored later *when needed*.

Split off from #4938.
2023-08-22 22:12:05 +03:00
Joonas Koivunen
533a92636c refactor: pre-cleanup Layer, PersistentLayer and impls (#5059)
Remove pub but dead code, move trait methods as inherent methods, remove
unnecessary. Split off from #4938.
2023-08-22 21:14:28 +03:00
Christian Schwarz
8cd20485f8 metrics: smgr query time: add a pre-aggregated histogram (#5064)
When doing global queries in VictoriaMetrics, the per-timeline
histograms make us run into cardinality limits.

We don't want to give them up just yet because we don't
have an alternative for drilling down on timeline-specific
performance issues.

So, add a pre-aggregated histogram and add observations to it
whenever we add observations to the per-timeline histogram.

While we're at it, switch to using a strummed enum for the operation
type names.
2023-08-22 20:08:31 +03:00
Joonas Koivunen
933a869f00 refactor: compaction becomes async again (#5058)
#4938 will make on-demand download of layers in compaction possible, so
it's not suitable for our "policy" of no `spawn_blocking(|| ...
Handle::block_on(async { spawn_blocking(...).await })` because this
poses a clear deadlock risk. Nested spawn_blockings are because of the
download using `tokio::fs::File`.

- Remove `spawn_blocking` from caller of `compact_level0_phase1`
- Remove `Handle::block_on` from `compact_level0_phase1` (indentation
change)
- Revert to `AsLayerDesc::layer_desc` usage temporarily (until it
becomes field access in #4938)
2023-08-22 20:03:14 +03:00
John Spray
615a490239 pageserver: refactor Tenant/Timeline args into structs (#5053)
## Problem

There are some common types that we pass into tenants and timelines as
we construct them, such as remote storage and the broker client.
Currently the list is small, but this is likely to grow -- the deletion
queue PR (#4960) pushed some methods to the point of clippy complaining
they had too many args, because of the extra deletion queue client being
passed around.

There are some shared objects that currently aren't passed around
explicitly because they use a static `once_cell` (e.g.
CONCURRENT_COMPACTIONS), but as we add more resource management and
concurreny control over time, it will be more readable & testable to
pass a type around in the respective Resources object, rather than to
coordinate via static objects. The `Resources` structures in this PR
will make it easier to add references to central coordination functions,
without having to rely on statics.

## Summary of changes

- For `Tenant`, the `broker_client` and `remote_storage` are bundled
into `TenantSharedResources`
- For `Timeline`, the `remote_client` is wrapped into
`TimelineResources`.

Both of these structures will get an additional deletion queue member in
#4960.
2023-08-21 17:30:28 +01:00
John Spray
b95addddd5 pageserver: do not read redundant timeline_layers from IndexPart, so that we can remove it later (#4972)
## Problem

IndexPart contains two redundant lists of layer names: a set of the
names, and then a map of name to metadata.

We already required that all the layers in `timeline_layers` are also in
`layers_metadata`, in `initialize_with_current_remote_index_part`, so if
there were any index_part.json files in the field that relied on these
sets being different, they would already be broken.

## Summary of changes

`timeline_layers` is made private and no longer read at runtime. It is
still serialized, but not deserialized.

`disk_consistent_lsn` is also made private, as this field only exists
for convenience of humans reading the serialized JSON.

This prepares us to entirely remove `timeline_layers` in a future
release, once this change is fully deployed, and therefore no
pageservers are trying to read the field.
2023-08-21 14:29:36 +03:00
Dmitry Rodionov
9140a950f4 Resume tenant deletion on attach (#5039)
I'm still a bit nervous about attach -> crash case. But it should work.
(unlike case with timeline). Ideally would be cool to cover this with
test.

This continues tradition of adding bool flags for Tenant::set_stopping.
Probably lifecycle project will help with fixing it.
2023-08-20 12:28:50 +03:00
Arpad Müller
a23b0773f1 Fix DeltaLayer dumping (#5045)
## Problem

Before, DeltaLayer dumping (via `cargo run --release -p pagectl --
print-layer-file` ) would crash as one can't call `Handle::block_on` in
an async executor thread.

## Summary of changes

Avoid the problem by using `DeltaLayerInner::load_keys` to load the keys
into RAM (which we already do during compaction), and then load the
values one by one during dumping.
2023-08-19 00:56:03 +02:00