Commit Graph

1822 Commits

Author SHA1 Message Date
Vlad Lazar
221531c9db pageserver: lift ancestor timeline logic from read path (#6543)
When the read path needs to follow a key into the ancestor timeline, it
needs to wait for said ancestor to become active and aware of it's
branching lsn. The logic is lifted into a separate function with it's
own new error type.

This is done because the vectored read path needs the same logic. It's
also the reason for the newly introduced error type.

When we'll switch the read path to proxy into `get_vectored`, we can
remove the duplicated variants from `PageReconstructError`.
2024-02-01 10:35:18 +00:00
Christian Schwarz
4c173456dc pagebench: fix percentiles reporting (#6547)
Before this patch, pagebench was always showing the same value.

refs https://github.com/neondatabase/neon/issues/6509
2024-01-31 23:29:48 +00:00
Christian Schwarz
e82625b77d refactor(pageserver main): signal handling (#6554)
This refactoring makes it easier to experimentally replace
BACKGROUND_RUNTIME with a single-threaded runtime. Found this useful
[during benchmarking](https://github.com/neondatabase/neon/pull/6555).
2024-01-31 23:25:57 +00:00
Joonas Koivunen
3d5fab127a rewrite Gate impl for better observability (#6542)
changes:
- two messages instead of message every second when gate was closing
- replace the gate name string by using a pointer
- slow GateGuards are likely to log who they were (see example)

example found in regress tests: <https://github.com/neondatabase/neon/pull/6542#issuecomment-1919009256>
2024-01-31 22:15:58 +00:00
Joonas Koivunen
66719d7eaf logging: fix span usage (#6549)
Fixes some duplication due to extra or misconfigured `#[instrument]`,
while filling in the `timeline_id` to delete timeline flow calls.
2024-01-31 20:52:00 +00:00
Konstantin Knizhnik
9a9d9beaee Download SLRU segments on demand (#6151)
## Problem

See https://github.com/neondatabase/cloud/issues/8673

## Summary of changes


Download missed SLRU segments from page server

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-01-31 21:39:18 +02:00
Arpad Müller
47380be12d Remove version param from get_lsn_by_timestamp (#6551)
This removes the last remnants of the version param added by #5608 ,
concluding the transition plan laid out in
https://github.com/neondatabase/cloud/pull/7553#discussion_r1370473911 .
It follows PR https://github.com/neondatabase/cloud/pull/9202, which we
now assume has been deployed to all environments.

Full history:

* https://github.com/neondatabase/neon/pull/5608 
* https://github.com/neondatabase/cloud/pull/7553
* https://github.com/neondatabase/neon/pull/6178
* https://github.com/neondatabase/cloud/pull/9202
2024-01-31 15:30:19 +01:00
John Spray
4010adf653 control_plane/attachment_service: complete APIs (#6394)
Depends on: https://github.com/neondatabase/neon/pull/6468

## Problem

The sharding service will be used as a "virtual pageserver" by the
control plane -- so it needs the set of pageserver APIs that the control
plane uses, and to present them under identical URLs, including prefix
(/v1).

## Summary of changes

- Add missing APIs:
  - Tenant deletion
  - Timeline deletion
  - Node list (used in test now, later in tools)
- `/location_config` API (for migrating tenants into the sharding
service)
- Rework attachment service URLs:
  - `/v1` prefix is used for pageserver-compatible APIs
- `/upcall/v1` prefix is used for APIs that are called by the pageserver
(re-attach and validate)
  - `/debug/v1` prefix is used for endpoints that are for testing
- `/control/v1` prefix is used for new sharding service APIs that do not
mimic a pageserver API, such as registering and configuring nodes.
- Add test_sharding_service. The sharding service already had some
collateral coverage from its use in general tests, but this is the first
dedicated testing for it.
2024-01-31 12:23:06 +00:00
Christian Schwarz
79137a089f fix(#6366): pageserver: incorrect log level for Tenant not found during basebackup (#6400)
Before this patch, when requesting basebackup for a not-found tenant or
timeline, we'd emit an ERROR-level log entry with a huge stack trace.
See #6366 "Details" section for an example

With this patch, we log at INFO level and only a single line.
Example:

```
2024-01-19T14:16:11.479800Z  INFO page_service_conn_main{peer_addr=127.0.0.1:43448}: query handler for 'basebackup d69a536d529a68fcf85bc070030cdf4b 035484e9c28d8d0138a492caadd03ffd 0/2204340 --gzip' entity not found: Tenant d69a536d529a68fcf85bc070030cdf4b not found
2024-01-19T14:19:35.807819Z  INFO page_service_conn_main{peer_addr=127.0.0.1:48862}: query handler for 'basebackup d69a536d529a68fcf85bc070030cdf4a 035484e9c28d8d0138a492caadd03ffd 0/2204340 --gzip' entity not found: Timeline d69a536d529a68fcf85bc070030cdf4a/035484e9c28d8d0138a492caadd03ffd was not found
```

fixes https://github.com/neondatabase/neon/issues/6366

Changes
-------

- Change `handle_basebackup_request` to return a `QueryError`
- The new `impl From<WaitLsnError> for QueryError` is needed so the `?`
at `wait_lsn()` call in `handle_basebackup_request` works again. It's
duplicating `impl From<WaitLsnError> for PageStreamError`.
- Remove hard-to-spot conversion of `handle_basebackup_request` return
value to anyhow::Result (the place where I replaced `anyhow::Ok` with
`Result::<(), QueryError>::Ok(())`
- Add forgotten distinguished handling for "Tenant not found" case in
`impl From<GetActiveTenantError> for QueryError`

This was not at all pleasant, and I find it very hard to follow the
various error conversions.
It took me a while to spot the hard-to-spot `anyhow::Ok` thing above.
It would have been caught by the compiler if we weren't auto-converting
`anyhow::Error` into `QueryError::Other`.
We should move away from that, in my opinion, instead forcing each
`.context()` site to become `.context().map_err(QueryError::Other)`.
But that's for a future PR.
2024-01-30 13:10:48 +00:00
Joonas Koivunen
e3cb715e8a fix: capture initdb stderr, discard others (#6524)
When using spawn + wait_with_output instead of
std::process::Command::output or tokio::process::Command::output we must
configure the redirection.

Fixes: #6523 by discarding the stdout completely, we only care about
stderr if any.
2024-01-30 14:07:58 +01:00
Vlad Lazar
0c7b89235c pageserver: add range layer map search implementation (#6469)
## Problem
There's no efficient way of querying the layer map for a range.

## Summary of changes
Introduce a range query for the layer map (`LayerMap::range_search`).
There's two broad steps to it:
1. Find all coverage changes for layers that intersect the queried range
(see `LayerCoverage::range_overlaps`).
The slightly tricky part is dealing with the start of the range. We can
either be aligned with a layer or not and we need
to treat these cases differently.
2. Iterate over the coverage changes and collect the result. For this we
use a two pointer approach: the trailing pointer tracks the start of the
current range (current location in the key space) and the forward
pointer tracks the next coverage change.

Plugging the range search into the read path is deferred to a future PR.

## Performance
I adapted the layer map benchmarks on a local branch. Range searches are 
between 2x and 2.5x slower than point searches. That's in line with what I
expected since we query thelayer map twice.

Since `Timeline::get` will proxy to `Timeline::get_vectored` we can
special case the one element layer map range search
at that point.
2024-01-29 09:47:12 +00:00
Joonas Koivunen
1e9a50bca8 disk_usage_eviction_task: cleanup summaries (#6490)
This is the "partial revert" of #6384. The summaries turned out to be
expensive due to naive vec usage, but also inconclusive because of the
additional context required. In addition to removing summary traces,
small refactoring is done.
2024-01-29 10:38:40 +02:00
Konstantin Knizhnik
c1148dc9ac Fix calculation of maximal multixact in ingest_multixact_create_record (#6502)
## Problem

See https://neondb.slack.com/archives/C06F5UJH601/p1706373716661439

## Summary of changes

Use None instead of 0 as initial accumulator value for calculating
maximal multixact XID.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-01-29 07:39:16 +02:00
John Spray
58f6cb649e control_plane: database persistence for attachment_service (#6468)
## Problem

Spun off from https://github.com/neondatabase/neon/pull/6394 -- this PR
is just the persistence parts and the changes that enable it to work
nicely


## Summary of changes

- Revert #6444 and #6450
- In neon_local, start a vanilla postgres instance for the attachment
service to use.
- Adopt `diesel` crate for database access in attachment service. This
uses raw SQL migrations as the source of truth for the schema, so it's a
soft dependency: we can switch libraries pretty easily.
- Rewrite persistence.rs to use postgres (via diesel) instead of JSON.
- Preserve JSON read+write at startup and shutdown: this enables using
the JSON format in compatibility tests, so that we don't have to commit
to our DB schema yet.
- In neon_local, run database creation + migrations before starting
attachment service
- Run the initial reconciliation in Service::spawn in the background, so
that the pageserver + attachment service don't get stuck waiting for
each other to start, when restarting both together in a test.
2024-01-26 17:20:44 +00:00
John Spray
55b7cde665 tests: add basic coverage for sharding (#6380)
## Problem

The support for sharding in the pageserver was written before
https://github.com/neondatabase/neon/pull/6205 landed, so when it landed
we couldn't directly test sharding.

## Summary of changes

- Add `test_sharding_smoke` which tests the basics of creating a
sharding tenant, creating a timeline within it, checking that data
within it is distributed.
- Add modes to pg_regress tests for running with 4 shards as well as
with 1.
2024-01-26 14:40:47 +00:00
Vlad Lazar
5b34d5f561 pageserver: add vectored get latency histogram (#6461)
This patch introduces a new set of grafana metrics for a histogram:
pageserver_get_vectored_seconds_bucket{task_kind="Compaction|PageRequestHandler"}.

While it has a `task_kind` label, only compaction and SLRU fetches are
tracked. This reduces the increase in cardinality to 24.

The metric should allow us to isolate performance regressions while the
vectorized get is being implemented. Once the implementation is
complete, it'll also allow us to quantify the improvements.
2024-01-26 13:40:03 +00:00
Christian Schwarz
918b03b3b0 integrate tokio-epoll-uring as alternative VirtualFile IO engine (#5824) 2024-01-26 09:25:07 +01:00
Joonas Koivunen
8dee9908f8 fix(compaction_task): wrong log levels (#6442)
Filter what we log on compaction task. Per discussion in last triage
call, fixing these by introducing and inspecting the root cause within
anyhow::Error instead of rolling out proper conversions.

Fixes: #6365
Fixes: #6367
2024-01-25 18:45:17 +02:00
John Spray
c9b1657e4c pageserver: fixes for creation operations overlapping with shutdown/startup (#6436)
## Problem

For #6423, creating a reproducer turned out to be very easy, as an
extension to test_ondemand_activation.

However, before I had diagnosed the issue, I was starting with a more
brute force approach of running creation API calls in the background
while restarting a pageserver, and that shows up a bunch of other
interesting issues.

In this PR:
- Add the reproducer for #6423 by extending `test_ondemand_activation`
(confirmed that this test fails if I revert the fix from
https://github.com/neondatabase/neon/pull/6430)
- In timeline creation, return 503 responses when we get an error and
the tenant's cancellation token is set: this covers the cases where we
get an anyhow::Error from something during timeline creation as a result
of shutdown.
- While waiting for tenants to become active during creation, don't
.map_err() the result to a 500: instead let the `From` impl map the
result to something appropriate (this includes mapping shutdown to 503)
- During tenant creation, we were calling `Tenant::load_local` because
no Preload object is provided. This is usually harmless because the
tenant dir is empty, but if there are some half-created timelines in
there, bad things can happen. Propagate the SpawnMode into
Tenant::attach, so that it can properly skip _any_ attempt to load
timelines if creating.
- When we call upsert_location, there's a SpawnMode that tells us
whether to load from remote storage or not. But if the operation is a
retry and we already have the tenant, it is not correct to skip loading
from remote storage: there might be a timeline there. This isn't
strictly a correctness issue as long as the caller behaves correctly
(does not assume that any timelines are persistent until the creation is
acked), but it's a more defensive position.
- If we shut down while the task in Tenant::attach is running, it can
end up spawning rogue tasks. Fix this by holding a GateGuard through
here, and in upsert_location shutting down a tenant after calling
tenant_spawn if we can't insert it into tenants_map. This fixes the
expected behavior that after shutdown_all_tenants returns, no tenant
tasks are running.
- Add `test_create_churn_during_restart`, which runs tenant & timeline
creations across pageserver restarts.
- Update a couple of tests that covered cancellation, to reflect the
cleaner errors we now return.
2024-01-25 12:35:52 +00:00
Joonas Koivunen
a0a3ba85e7 fix(page_service): walredo logging problem (#6460)
Fixes: #6459 by formatting full causes of an error to log, while keeping
the top level string for end-user.

Changes user visible error detail from:

```
-DETAIL:  page server returned error: Read error: Failed to reconstruct a page image:
+DETAIL:  page server returned error: Read error
```

However on pageserver logs:

```
-ERROR page_service_conn_main{...}: error reading relation or page version: Read error: Failed to reconstruct a page image:
+ERROR page_service_conn_main{...}: error reading relation or page version: Read error: reconstruct a page image: launch walredo process: spawn process: Permission denied (os error 13)
```
2024-01-24 15:47:17 +00:00
Arpad Müller
d820aa1d08 Disable initdb cancellation (#6451)
## Problem

The initdb cancellation added in #5921 is not sufficient to reliably
abort the entire initdb process. Initdb also spawns children. The tests
added by #6310 (#6385) and #6436 now do initdb cancellations on a more
regular basis.

In #6385, I attempted to issue `killpg` (after giving it a new process
group ID) to kill not just the initdb but all its spawned subprocesses,
but this didn't work. Initdb doesn't take *that* long in the end either,
so we just wait until it concludes.

## Summary of changes

* revert initdb cancellation support added in #5921
* still return `Err(Cancelled)` upon cancellation, but this is just to
not have to remove the cancellation infrastructure
* fixes to the `test_tenant_delete_races_timeline_creation` test to make
it reliably pass

Fixes #6385
2024-01-24 13:06:05 +01:00
Christian Schwarz
996abc9563 pagebench-based GetPage@LSN performance test (#6214) 2024-01-24 12:51:53 +01:00
Arpad Müller
faf275d4a2 Remove initdb on timeline delete (#6387)
This PR:

* makes `initdb.tar.zst` be deleted by default on timeline deletion
(#6226), mirroring the safekeeper:
https://github.com/neondatabase/neon/pull/6381
* adds a new `preserve_initdb_archive` endpoint for a timeline, to be
used during the disaster recovery process, see reasoning
[here](https://github.com/neondatabase/neon/issues/6226#issuecomment-1894574778)
* makes the creation code look for `initdb-preserved.tar.zst` in
addition to `initdb.tar.zst`.
* makes the tests use the new endpoint

fixes #6226
2024-01-23 18:22:59 +00:00
Vlad Lazar
001f0d6db7 pageserver: fix import failure caused by merge race (#6448)
PR #6406 raced with #6372 and broke main.
2024-01-23 18:07:01 +01:00
Christian Schwarz
42c17a6fc6 attachment_service: use atomic overwrite to persist attachments.json (#6444)
The pagebench integration PR (#6214) is the first to SIGQUIT & then
restart attachment_service.

With many tenants (100), we have found frequent failures on restart in
the CI[^1].

[^1]:
[Allure](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6214/7615750160/index.html#suites/e26265675583c610f99af77084ae58f1/851ff709578c4452/)

```
2024-01-22T19:07:57.932021Z  INFO request{method=POST path=/attach-hook request_id=2697503c-7b3e-4529-b8c1-d12ef912d3eb}: Request handled, status: 200 OK
2024-01-22T19:07:58.898213Z  INFO Got SIGQUIT. Terminating
2024-01-22T19:08:02.176588Z  INFO version: git-env:d56f31639356ed8e8ce832097f132f27ee19ac8a, launch_timestamp: 2024-01-22 19:08:02.174634554 UTC, build_tag build_tag-env:7615750160, state at /tmp/test_output/test_pageserver_max_throughput_getpage_at_latest_lsn[10-13-30]/repo/attachments.json, listening on 127.0.0.1:15048
thread 'main' panicked at /__w/neon/neon/control_plane/attachment_service/src/persistence.rs:95:17:
Failed to load state from '/tmp/test_output/test_pageserver_max_throughput_getpage_at_latest_lsn[10-13-30]/repo/attachments.json': trailing characters at line 1 column 8957 (maybe your .neon/ dir was written by an older version?)
stack backtrace:
   0: rust_begin_unwind
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
   2: attachment_service::persistence::PersistentState::load_or_new::{{closure}}
             at ./control_plane/attachment_service/src/persistence.rs:95:17
   3: attachment_service::persistence::Persistence:🆕:{{closure}}
             at ./control_plane/attachment_service/src/persistence.rs:103:56
   4: attachment_service::main::{{closure}}
             at ./control_plane/attachment_service/src/main.rs:69:61
   5: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:63
   6: tokio::runtime::coop::with_budget
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:107:5
   7: tokio::runtime::coop::budget
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:73:5
   8: tokio::runtime::park::CachedParkThread::block_on
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:31
   9: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/blocking.rs:66:9
  10: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:87:13
  11: tokio::runtime::context::runtime::enter_runtime
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/runtime.rs:65:16
  12: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:86:9
  13: tokio::runtime::runtime::Runtime::block_on
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/runtime.rs:350:50
  14: attachment_service::main
             at ./control_plane/attachment_service/src/main.rs:99:5
  15: core::ops::function::FnOnce::call_once
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
```

The attachment_service handles SIGQUIT by just exiting the process.
In theory, the SIGQUIT could come in while we're writing out the
`attachments.json`.

Now, in above log output, there's a 1 second gap between the last
request completing
and the SIGQUIT coming in. So, there must be some other issue.

But, let's have this change anyways, maybe it helps uncover the real
cause for the test failure.
2024-01-23 17:21:06 +01:00
Vlad Lazar
37638fce79 pageserver: introduce vectored Timeline::get interface (#6372)
1. Introduce a naive  `Timeline::get_vectored` implementation

The return type is intended to be flexible enough for various types of
callers. We return the pages in a map keyed by `Key` such that the
caller doesn't have to map back to the key if it needs to know it. Some
callers can ignore errors
for specific pages, so we return a separate `Result<Bytes,
PageReconstructError>` for each page and an overarching
`GetVectoredError` for API misuse. The overhead of the mapping will be
small and bounded since we enforce a maximum key count for the
operation.

2. Use the `get_vectored` API for SLRU segment reconstruction and image
layer creation.
2024-01-23 14:23:53 +00:00
Christian Schwarz
50288c16b1 fix(pagebench): avoid CopyFail error in success case (#6443)
PR #6392 fixed CopyFail in the case where we get cancelled.
But, we also want to use `client.shutdown()` if we don't get cancelled.
2024-01-23 15:11:32 +01:00
Conrad Ludgate
72de1cb511 remove some duped deps (#6422)
## Problem

duplicated deps

## Summary of changes

little bit of fiddling with deps to reduce duplicates

needs consideration:
https://github.com/notify-rs/notify/blob/main/CHANGELOG.md#notify-600-2023-05-17
2024-01-23 11:17:15 +00:00
Vlad Lazar
f1901833a6 pageserver_api: migrate keyspace related functions from pgdatadir_mapping (#6406)
The idea is to achieve separation between keyspace layout definition
and operating on said keyspace. I've inlined all these function since
they're small and we don't use LTO in the storage release builds
at the moment.

Closes https://github.com/neondatabase/neon/issues/6347
2024-01-22 19:16:38 +00:00
John Spray
93572a3e99 pageserver: mark tenant broken when cancelling attach (#6430)
## Problem

When a tenant is in Attaching state, and waiting for the
`concurrent_tenant_warmup` semaphore, it also listens for the tenant
cancellation token. When that token fires, Tenant::attach drops out.
Meanwhile, Tenant::set_stopping waits forever for the tenant to exit
Attaching state.

Fixes: https://github.com/neondatabase/neon/issues/6423

## Summary of changes

- In the absence of a valid state for the tenant, it is set to Broken in
this path. A more elegant solution will require more refactoring, beyond
this minimal fix.
2024-01-22 15:50:32 +00:00
Christian Schwarz
15c0df4de7 fixup(#6037): actually fix the issue, #6388 failed to do so (#6429)
Before this patch, the select! still retured immediately if `futs` was
empty. Must have tested a stale build in my manual testing of #6388.
2024-01-22 14:27:29 +00:00
Heikki Linnakangas
e4898a6e60 Don't pass InvalidTransactionId to update_next_xid. (#6410)
update_next_xid() doesn't have any special treatment for the invalid or
other special XIDs, so it will treat InvalidTransactionId (0) as a
regular XID. If old nextXid is smaller than 2^31, 0 will look like a
very old XID, and nothing happens. But if nextXid is greater than 2^31 0
will look like a very new XID, and update_next_xid() will incorrectly
bump up nextXID.
2024-01-20 18:04:16 +02:00
Christian Schwarz
760a48207d fixup(#6037): page_service hangs up within 10ms if there's no message (#6388)
From #6037 on, until this patch, if the client opens the connection but
doesn't send a `PagestreamFeMessage` within the first 10ms, we'd close
the connection because `self.timeline_cancelled()` returns.
It returns because `self.shard_timelines` is still empty at that point:
it gets filled lazily within the handlers for the incoming messages.

Changes
-------

The question is: if we can't check for timeline cancellation, what else
do we need to be cancellable for? `tenant.cancel` is also a bad choice
because the `tenant` (shard) we pick at the top of handle_pagerequests
might indeed go away over the course of the connection lifetime, but
other shards may still be there.

The correct solution, I think, is to be responsive to task_mgr
cancellation, because the connection handler runs in a task_mgr task and
it is already the current canonical way how we shut down a tenant's /
timelin's page_service connections (see `Tenant::shutdown` /
`Timeline::shutdown`).

So, rename the function and make it sensitive to task_mgr cancellation.
2024-01-19 19:16:01 +00:00
Christian Schwarz
e8f773387d pagebench: avoid noise about CopyFail in PS logs (#6392)
Before this patch, pagebench get-page-latest-lsn would sometimes cause
noisy errors in pageserver log about `CopyFail` protocol message.

refs https://github.com/neondatabase/neon/issues/6390
2024-01-18 18:50:42 +00:00
Christian Schwarz
00936d19e1 pagebench: use tracing panic hook (#6393) 2024-01-18 18:39:38 +00:00
Joonas Koivunen
57155ada77 temp: human readable summaries for relative access time compared to absolute (#6384)
With testing the new eviction order there is a problem of all of the
(currently rare) disk usage based evictions being rare and unique; this
PR adds a human readable summary of what absolute order would had done
and what the relative order does. Assumption is that these loggings will
make the few evictions runs in staging more useful.

Cc: #5304 for allowing testing in the staging
2024-01-18 17:21:08 +02:00
John Spray
bd19290d9f pageserver: add shard_id to metric labels (#6308)
## Problem

tenant_id/timeline_id is no longer a full identifier for metrics from a
`Tenant` or `Timeline` object.

Closes: https://github.com/neondatabase/neon/issues/5953

## Summary of changes

Include `shard_id` label everywhere we have `tenant_id`/`timeline_id`
label.
2024-01-18 10:52:18 +00:00
John Spray
b6ec11ad78 control_plane: generalize attachment_service to handle sharding (#6251)
## Problem

To test sharding, we need something to control it. We could write python
code for doing this from the test runner, but this wouldn't be usable
with neon_local run directly, and when we want to write tests with large
number of shards/tenants, Rust is a better fit efficiently handling all
the required state.

This service enables automated tests to easily get a system with
sharding/HA without the test itself having to set this all up by hand:
existing tests can be run against sharded tenants just by setting a
shard count when creating the tenant.

## Summary of changes

Attachment service was previously a map of TenantId->TenantState, where
the principal state stored for each tenant was the generation and the
last attached pageserver. This enabled it to serve the re-attach and
validate requests that the pageserver requires.

In this PR, the scope of the service is extended substantially to do
overall management of tenants in the pageserver, including
tenant/timeline creation, live migration, evacuation of offline
pageservers etc. This is done using synchronous code to make declarative
changes to the tenant's intended state (`TenantState.policy` and
`TenantState.intent`), which are then translated into calls into the
pageserver by the `Reconciler`.

Top level summary of modules within
`control_plane/attachment_service/src`:
- `tenant_state`: structure that represents one tenant shard.
- `service`: implements the main high level such as tenant/timeline
creation, marking a node offline, etc.
- `scheduler`: for operations that need to pick a pageserver for a
tenant, construct a scheduler and call into it.
- `compute_hook`: receive notifications when a tenant shard is attached
somewhere new. Once we have locations for all the shards in a tenant,
emit an update to postgres configuration via the neon_local `LocalEnv`.
- `http`: HTTP stubs. These mostly map to methods on `Service`, but are
separated for readability and so that it'll be easier to adapt if/when
we switch to another RPC layer.
- `node`: structure that describes a pageserver node. The most important
attribute of a node is its availability: marking a node offline causes
tenant shards to reschedule away from it.

This PR is a precursor to implementing the full sharding service for
prod (#6342). What's the difference between this and a production-ready
controller for pageservers?
- JSON file persistence to be replaced with a database
- Limited observability.
- No concurrency limits. Marking a pageserver offline will try and
migrate every tenant to a new pageserver concurrently, even if there are
thousands.
- Very simple scheduler that only knows to pick the pageserver with
fewest tenants, and place secondary locations on a different pageserver
than attached locations: it does not try to place shards for the same
tenant on different pageservers. This matters little in tests, because
picking the least-used pageserver usually results in round-robin
placement.
- Scheduler state is rebuilt exhaustively for each operation that
requires a scheduler.
- Relies on neon_local mechanisms for updating postgres: in production
this would be something that flows through the real control plane.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-01-17 18:01:08 +00:00
John Spray
4cec95ba13 pageserver: add list API for LocationConf (#6329)
## Problem

The `/v1/tenant` listing API only applies to attached tenants.

For an external service to implement a global reconciliation of its list
of shards vs. what's on the pageserver, we need a full view of what's in
TenantManager, including secondary tenant locations, and InProgress
locations.

Dependency of https://github.com/neondatabase/neon/pull/6251

## Summary of changes

- Add methods to Tenant and SecondaryTenant to reconstruct the
LocationConf used to create them.
- Add `GET /v1/location_config` API
2024-01-17 13:34:51 +00:00
Arpad Müller
ab86060d97 Copy initdb if loading from different timeline ID (#6363)
Previously, if we:

1. created a new timeline B from a different timeline's A initdb
2. deleted timeline A

the initdb for timeline B would be gone, at least in a world where we
are deleting initdbs upon timeline deletion. This world is imminent
(#6226).

Therefore, if the pageserver is instructed to load the initdb from a
different timeline ID, copy it to the newly created timeline's directory
in S3. This ensures that we can disaster recover the new timeline as
well, regardless of whether the original timeline was deleted or not.

Part of https://github.com/neondatabase/neon/issues/5282.
2024-01-17 12:42:42 +01:00
John Spray
bf4e708646 pageserver: eviction for secondary mode tenants (#6225)
Follows #6123 

Closes: https://github.com/neondatabase/neon/issues/5342

The approach here is to avoid using `Layer` from secondary tenants, and
instead make the eviction types (e.g. `EvictionCandidate`) have a
variant that carries a Layer for attached tenants, and a different
variant for secondary tenants.

Other changes:
- EvictionCandidate no longer carries a `Timeline`: this was only used
for providing a witness reference to remote timeline client.
- The types for returning eviction candidates are all in
disk_usage_eviction_task.rs now, whereas some of them were in
timeline.rs before.
- The EvictionCandidate type replaces LocalLayerInfoForDiskUsageEviction
type, which was basically the same thing.
2024-01-16 10:29:26 +00:00
John Spray
887e94d7da page_service: more efficient page_service -> shard lookup (#6037)
## Problem

In #5980 the page service connection handler gets a simple piece of
logic for finding the right Timeline: at connection time, it picks an
arbitrary Timeline, and then when handling individual page requests it
checks if the original timeline is the correct shard, and if not looks
one up.

This is pretty slow in the case where we have to go look up the other
timeline, because we take the big tenants manager lock.

## Summary of changes

- Add a `shard_timelines` map of ShardIndex to Timeline on the page
service connection handler
- When looking up a Timeline for a particular ShardIndex, consult
`shard_timelines` to avoid hitting the TenantsManager unless we really
need to.
- Re-work the CancellationToken handling, because the handler now holds
gateguards on multiple timelines, and so must respect cancellation of
_any_ timeline it has in its cache, not just the timeline related to the
request it is currently servicing.

---------

Co-authored-by: Vlad Lazar <vlad@neon.tech>
2024-01-16 09:39:19 +00:00
John Spray
df9e9de541 pageserver: API updates for sharding (#6330)
The theme of the changes in this PR is that they're enablers for #6251
which are superficial struct/api changes.

This is a spinoff from #6251:
- Various APIs + clients thereof take TenantShardId rather than TenantId
- The creation API gets a ShardParameters member, which may be used to
configure shard count and stripe size. This enables the attachment
service to present a "virtual pageserver" creation endpoint that creates
multiple shards.
- The attachment service will use tenant size information to drive shard
splitting. Make a version of `TenantHistorySize` that is usable for
decoding these API responses.
- ComputeSpec includes a shard stripe size.
2024-01-16 09:21:00 +00:00
John Khvatov
2a3cfc9665 Remove PAGE_CACHE_ACQUIRE_PINNED_SLOT_TIME histogram. (#6356)
Fixes #6343.

## Problem

PAGE_CACHE_ACQUIRE_PINNED_SLOT_TIME is used on hot path and it adds
noticeable latency to GetPage@LSN.

## Refs

https://discordapp.com/channels/1176467419317940276/1195022264115151001/1196370689268125716
2024-01-15 17:19:19 +01:00
Christian Schwarz
0e1ef3713e fix(pagebench): #6325 broke running without --runtime (#6351)
After PR #6325, when running without --runtime, we wouldn't wait for
start_work_barrier, causing the benchmark to not start at all.
2024-01-15 08:54:19 +00:00
Arpad Müller
60ced06586 Fix timeline creation and tenant deletion race (#6310)
Fixes the race condition between timeline creation and tenant deletion
outlined in #6255.

Related: #5914, which is a similar race condition about the uninit
marker file.

Fixes #6255
2024-01-13 09:15:58 +01:00
Christian Schwarz
cd48ea784f TenantInfo: expose generation number (#6348)
Generally useful when debugging / troubleshooting.

I found this useful when manually duplicating a tenant from a script[^1]
where I can't use `neon_fixtures.Pageserver.tenant_attach`'s automatic
integration with the neon_local's attachment_service.

[^1]: https://github.com/neondatabase/neon/pull/6349
2024-01-12 18:27:11 +01:00
Vlad Lazar
02c6abadf0 pageserver: remove depenency of pagebench on pageserver (#6334)
To achieve this I had to lift the BlockNumber and key_to_rel_block
definitions to pageserver_api (similar to a change in #5980).

Closes #6299
2024-01-12 17:11:19 +00:00
John Spray
7af4c676c0 pageserver: only upload initdb from shard 0 (#6331)
## Problem

When creating a timeline on a sharded tenant, we call into each shard.
We don't need to upload the initdb from every shard: only do it on shard
zero.

## Summary of changes

- Move the initdb upload into a function, and only call it on shard
zero.
2024-01-12 15:32:27 +01:00
John Spray
aafe79873c page_service: handle GetActiveTenantError::Cancelled (#6344)
## Problem

Occasional test failures with QueryError::Other errors saying
"cancelled" that get logged at error severity.

## Summary of changes

Avoid casting GetActiveTenantError::Cancelled into QueryError::Other --
it should be QueryError::Shutdown, which is not logged as an error.
2024-01-12 12:43:14 +00:00