The pagebench integration PR (#6214) is the first to SIGQUIT & then
restart attachment_service.
With many tenants (100), we have found frequent failures on restart in
the CI[^1].
[^1]:
[Allure](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6214/7615750160/index.html#suites/e26265675583c610f99af77084ae58f1/851ff709578c4452/)
```
2024-01-22T19:07:57.932021Z INFO request{method=POST path=/attach-hook request_id=2697503c-7b3e-4529-b8c1-d12ef912d3eb}: Request handled, status: 200 OK
2024-01-22T19:07:58.898213Z INFO Got SIGQUIT. Terminating
2024-01-22T19:08:02.176588Z INFO version: git-env:d56f31639356ed8e8ce832097f132f27ee19ac8a, launch_timestamp: 2024-01-22 19:08:02.174634554 UTC, build_tag build_tag-env:7615750160, state at /tmp/test_output/test_pageserver_max_throughput_getpage_at_latest_lsn[10-13-30]/repo/attachments.json, listening on 127.0.0.1:15048
thread 'main' panicked at /__w/neon/neon/control_plane/attachment_service/src/persistence.rs:95:17:
Failed to load state from '/tmp/test_output/test_pageserver_max_throughput_getpage_at_latest_lsn[10-13-30]/repo/attachments.json': trailing characters at line 1 column 8957 (maybe your .neon/ dir was written by an older version?)
stack backtrace:
0: rust_begin_unwind
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
1: core::panicking::panic_fmt
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
2: attachment_service::persistence::PersistentState::load_or_new::{{closure}}
at ./control_plane/attachment_service/src/persistence.rs:95:17
3: attachment_service::persistence::Persistence:🆕:{{closure}}
at ./control_plane/attachment_service/src/persistence.rs:103:56
4: attachment_service::main::{{closure}}
at ./control_plane/attachment_service/src/main.rs:69:61
5: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:63
6: tokio::runtime::coop::with_budget
at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:107:5
7: tokio::runtime::coop::budget
at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:73:5
8: tokio::runtime::park::CachedParkThread::block_on
at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:31
9: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/blocking.rs:66:9
10: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:87:13
11: tokio::runtime::context::runtime::enter_runtime
at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/runtime.rs:65:16
12: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:86:9
13: tokio::runtime::runtime::Runtime::block_on
at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/runtime.rs:350:50
14: attachment_service::main
at ./control_plane/attachment_service/src/main.rs:99:5
15: core::ops::function::FnOnce::call_once
at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
```
The attachment_service handles SIGQUIT by just exiting the process.
In theory, the SIGQUIT could come in while we're writing out the
`attachments.json`.
Now, in above log output, there's a 1 second gap between the last
request completing
and the SIGQUIT coming in. So, there must be some other issue.
But, let's have this change anyways, maybe it helps uncover the real
cause for the test failure.
1. Introduce a naive `Timeline::get_vectored` implementation
The return type is intended to be flexible enough for various types of
callers. We return the pages in a map keyed by `Key` such that the
caller doesn't have to map back to the key if it needs to know it. Some
callers can ignore errors
for specific pages, so we return a separate `Result<Bytes,
PageReconstructError>` for each page and an overarching
`GetVectoredError` for API misuse. The overhead of the mapping will be
small and bounded since we enforce a maximum key count for the
operation.
2. Use the `get_vectored` API for SLRU segment reconstruction and image
layer creation.
## Problem
Parsing the IP address at check time is a little wasteful.
## Summary of changes
Parse the IP when we get it from cplane. Adding a `None` variant to
still allow malformed patterns
## Problem
There are a lot of responses with 404 role not found error, which are
not getting cached in proxy.
## Summary of changes
If there was returned an empty secret but with the project_id, store it
in cache.
## Problem
There is "neon.pageserver_connstring" GUC with PGC_SIGHUP option,
allowing to change it using
pg_reload_conf(). It is used by control plane to update pageserver
connection string if page server is crashed,
relocated or new shards are added.
It is copied to shared memory because config can not be loaded during
query execution and we need to
reestablish connection to page server.
## Summary of changes
Copying connection string to shared memory is done by postmaster. And
other backends
should check update counter to determine of connection URL is changed
and connection needs to be reestablished.
We can not use standard Postgres LW-locks, because postmaster has proc
entry and so can not wait
on this primitive. This is why lockless access algorithm is implemented
using two atomic counters to enforce
consistent reading of connection string value from shared memory.
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
The idea is to achieve separation between keyspace layout definition
and operating on said keyspace. I've inlined all these function since
they're small and we don't use LTO in the storage release builds
at the moment.
Closes https://github.com/neondatabase/neon/issues/6347
Before this patch, we would update the `tenant_state.intent` in memory
but not persist the detachment to disk.
I noticed this in https://github.com/neondatabase/neon/pull/6214 where
we stop, then restart, the attachment service.
## Problem
When a tenant is in Attaching state, and waiting for the
`concurrent_tenant_warmup` semaphore, it also listens for the tenant
cancellation token. When that token fires, Tenant::attach drops out.
Meanwhile, Tenant::set_stopping waits forever for the tenant to exit
Attaching state.
Fixes: https://github.com/neondatabase/neon/issues/6423
## Summary of changes
- In the absence of a valid state for the tenant, it is set to Broken in
this path. A more elegant solution will require more refactoring, beyond
this minimal fix.
## Problem
If you build the compute-node dockerfile with the PG_VERSION argument
passed in (e.g. `docker build -f Dockerfile.compute-node --build-arg
PG_VERSION=v15 .`, it fails, as some of stages doesn't have the
PG_VERSION arg defined.
## Summary of changes
Added the PG_VERSION arg to the plv8-build, neon-pg-ext-build, and
pg-embedding-pg-build stages of Dockerfile.compute-node
## Problem
See https://neondb.slack.com/archives/C06F5UJH601/p1705731304237889
Adding 1 to xid in `update_next_xid` can cause overflow in debug mode.
0xffffffff is valid transaction ID.
## Summary of changes
Use `wrapping_add`
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
## Problem
In https://github.com/neondatabase/neon/pull/6283 I did a couple changes
that weren't directly related to the goal of extracting the state
machine, so I'm putting them here
## Summary of changes
- move postgres vs console provider into another enum
- reduce error cases for link auth
- slightly refactor link flow
## Problem
Currently we store in cache even if the project is undefined. That makes
invalidation impossible.
## Summary of changes
Do not store if project id is empty.
update_next_xid() doesn't have any special treatment for the invalid or
other special XIDs, so it will treat InvalidTransactionId (0) as a
regular XID. If old nextXid is smaller than 2^31, 0 will look like a
very old XID, and nothing happens. But if nextXid is greater than 2^31 0
will look like a very new XID, and update_next_xid() will incorrectly
bump up nextXID.
configures nextest to kill tests after 1 minute. slow period is set to
20s which is how long our tests currently take in total, there will be 2
warnings and then the test will be killed and it's output logged.
Cc: #6361
Cc: #6368 -- likely this will be enough for longer time, but it will be
counter productive when we want to attach and debug; the added line
would have to be commented out.
## Problem
Some fields were missed in the initial spec.
## Summary of changes
Adds a success boolean (defaults to false unless specifically marked as
successful).
Adds a duration_us integer that tracks how many microseconds were taken
from session start through to request completion.
From #6037 on, until this patch, if the client opens the connection but
doesn't send a `PagestreamFeMessage` within the first 10ms, we'd close
the connection because `self.timeline_cancelled()` returns.
It returns because `self.shard_timelines` is still empty at that point:
it gets filled lazily within the handlers for the incoming messages.
Changes
-------
The question is: if we can't check for timeline cancellation, what else
do we need to be cancellable for? `tenant.cancel` is also a bad choice
because the `tenant` (shard) we pick at the top of handle_pagerequests
might indeed go away over the course of the connection lifetime, but
other shards may still be there.
The correct solution, I think, is to be responsive to task_mgr
cancellation, because the connection handler runs in a task_mgr task and
it is already the current canonical way how we shut down a tenant's /
timelin's page_service connections (see `Tenant::shutdown` /
`Timeline::shutdown`).
So, rename the function and make it sensitive to task_mgr cancellation.
In the most straightforward way; safekeeper performs it in DELETE endpoint
implementation, with no coordination between sks.
delete_force endpoint in the code is renamed to delete as there is only one way
to delete.
## Problem
For PRs with `run-benchmarks` label, we don't upload results to the db,
making it harder to debug such tests. The only way to see some
numbers is by examining GitHub Action output which is really
inconvenient.
This PR adds zenbenchmark metrics to Allure reports.
## Summary of changes
- Create a json file with zenbenchmark results and attach it to allure
report
In
7f828890cf
we changed the logic for persisting control_files. Previously it was
updated if `peer_horizon_lsn` jumped more than one segment, which made
`peer_horizon_lsn` initialized on disk as soon as safekeeper has
received a first `AppendRequest`.
This caused an issue with `truncateLsn`, which now can be zero
sometimes. This PR fixes it, and now `truncateLsn/peer_horizon_lsn` can
never be zero once we know `timeline_start_lsn`.
Closes https://github.com/neondatabase/neon/issues/6248