Stacks on:
- https://github.com/neondatabase/neon/pull/7165
Fixes while working on background optimization of scheduling after a
split:
- When a tenant has secondary locations, we weren't detaching the parent
shards' secondary locations when doing a split
- When a reconciler detaches a location, it was feeding back a
locationconf with `Detached` mode in its `observed` object, whereas it
should omit that location. This could cause the background reconcile
task to keep kicking off no-op reconcilers forever (harmless but
annoying).
- During shard split, we were scheduling secondary locations for the
child shards, but no reconcile was run for these until the next time the
background reconcile task ran. Creating these ASAP is useful, because
they'll be used shortly after a shard split as the destination locations
for migrating the new shards to different nodes.
## Problem
Storage controller had basically no metrics.
## Summary of changes
1. Migrate the existing metrics to use Conrad's
[`measured`](https://docs.rs/measured/0.0.14/measured/) crate.
2. Add metrics for incoming http requests
3. Add metrics for outgoing http requests to the pageserver
4. Add metrics for outgoing pass through requests to the pageserver
5. Add metrics for database queries
Note that the metrics response for the attachment service does not use
chunked encoding like the rest of the metrics endpoints. Conrad has
kindly extended the crate such that it can now be done. Let's leave it
for a follow-up since the payload shouldn't be that big at this point.
Fixes https://github.com/neondatabase/neon/issues/6875
## Problem
The current implementation of struct Layer supports canceled read
requests, but those will leave the internal state such that a following
`Layer::keep_resident` call will need to repair the state. In
pathological cases seen during generation numbers resetting in staging
or with too many in-progress on-demand downloads, this repair activity
will need to wait for the download to complete, which stalls disk
usage-based eviction. Similar stalls have been observed in staging near
disk-full situations, where downloads failed because the disk was full.
Fixes#6028 or the "layer is present on filesystem but not evictable"
problems by:
1. not canceling pending evictions by a canceled
`LayerInner::get_or_maybe_download`
2. completing post-download initialization of the `LayerInner::inner`
from the download task
Not canceling evictions above case (1) and always initializing (2) lead
to plain `LayerInner::inner` always having the up-to-date information,
which leads to the old `Layer::keep_resident` never having to wait for
downloads to complete. Finally, the `Layer::keep_resident` is replaced
with `Layer::is_likely_resident`. These fix#7145.
## Summary of changes
- add a new test showing that a canceled get_or_maybe_download should
not cancel the eviction
- switch to using a `watch` internally rather than a `broadcast` to
avoid hanging eviction while a download is ongoing
- doc changes for new semantics and cleanup
- fix `Layer::keep_resident` to use just `self.0.inner.get()` as truth
as `Layer::is_likely_resident`
- remove `LayerInner::wanted_evicted` boolean as no longer needed
Builds upon: #7185. Cc: #5331.
`pgxn/` also contains WAL proposer code, so modifications to this
directory should be able to be approved by the safekeeper team.
Signed-off-by: Alex Chi Z <chi@neon.tech>
Before this PR, cancellation for `LayerInner::get_or_maybe_download`
could occur so that we have downloaded the layer file in the filesystem,
but because of the cancellation chance, we have not set the internal
`LayerInner::inner` or initialized the state. With the detached init
support introduced in #7135 and in place in #7152, we can now initialize
the internal state after successfully downloading in the spawned task.
The next PR will fix the remaining problems that this PR leaves:
- `Layer::keep_resident` is still used because
- `Layer::get_or_maybe_download` always cancels an eviction, even when
canceled
Split off from #7030. Stacked on top of #7152. Cc: #5331.
- Enable debug logs for this test
- Add some debug logging detail in downloader.rs
- Add an info-level message in scheduler.rs that makes it obvious if a
command is waiting for an existing task rather than spawning a new one.
The second part of work towards fixing `Layer::keep_resident` so that it
does not need to repair the internal state. #7135 added a nicer API for
initialization. This PR uses it to remove a few indentation levels and
the loop construction. The next PR #7175 will use the refactorings done
in this PR, and always initialize the internal state after a download.
Cc: #5331
Since #6115 with more often used get_value_reconstruct_data and friends,
we should not have needless INFO level span creation near hot paths. In
our prod configuration, INFO spans are always created, but in practice,
very rarely anything at INFO level is logged underneath.
`ResidentLayer::load_keys` is only used during compaction so it is not
that hot, but this aligns the access paths and their span usage.
PR changes the span level to debug to align with others, and adds the
layer name to the error which was missing.
Split off from #7030.
## Problem
faster sha2 hashing.
## Summary of changes
enable asm feature for sha2. this feature will be default in sha2 0.11,
so we might as well lean into it now. It provides a noticeable speed
boost on macos aarch64. Haven't tested on x86 though
Warm-up (and the "tenant startup complete" metric update) happens in
a background tokio task. The tenant map is eagerly updated (can happen
before the task finishes).
The test assumed that if the tenant map was updated, then the metric
should reflect that. That's not the case, so we tweak the test to wait
for the metric.
Fixes https://github.com/neondatabase/neon/issues/7158
This is a mixed bag of changes split out for separate review while
working on other things, and batched together to reduce load on CI
runners. Each commits stands alone for review purposes:
- do_tenant_shard_split was a long function and had a synchronous
validation phase at the start that could readily be pulled out into a
separate function. This also avoids the special casing of
ApiError::BadRequest when deciding whether an abort is needed on errors
- Add a 'describe' API (GET on tenant ID) that will enable storcon-cli
to see what's going on with a tenant
- the 'locate' API wasn't really meant for use in the field. It's for
tests: demote it to the /debug/ prefix
- The `Single` placement policy was a redundant duplicate of Double(0),
and Double was a bad name. Rename it Attached.
(https://github.com/neondatabase/neon/issues/7107)
- Some neon_local commands were added for debug/demos, which are now
replaced by commands in storcon-cli (#7114 ). Even though that's not
merged yet, we don't need the neon_local ones any more.
Closes https://github.com/neondatabase/neon/issues/7107
## Backward compat of Single/Double -> `Attached(n)` change
A database migration is used to convert any existing values.
e2e tests cannot run on macOS unless the file engine env var is
supplied.
```
./scripts/pytest test_runner/regress/test_neon_superuser.py -s
```
will fail with tokio-epoll-uring not supported.
This is because we persist the file engine config by default. In this
pull request, we only persist when someone specifies it, so that it can
use the default platform-variant config in the page server.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
As with the pageserver, we should fail tests that emit unexpected log
errors/warnings.
## Summary of changes
- Refactor existing log checks to be reusable
- Run log checks for attachment_service
- Add allow lists as needed.
Add shard_number to PageserverFeedback and parse it on the compute side.
When compute receives a new ps_feedback, it calculates min LSNs among
feedbacks from all shards, and uses those LSNs for backpressure.
Add `test_sharding_backpressure` to verify that backpressure slows down
compute to wait for the slowest shard.
Manual testing of the changes in #7160 revealed that, if the
thread-local destructor ever runs (it apparently doesn't in our test
suite runs, otherwise #7160 would not have auto-merged), we can
encounter an `abort()` due to a double-panic in the tracing code.
This github comment here contains the stack trace:
https://github.com/neondatabase/neon/pull/7160#issuecomment-2003778176
This PR reverts #7160 and uses a atomic counter to identify the
thread-local in log messages, instead of the memory address of the
thread local, which may be re-used.
with `immediate_gc` the span only covered the `gc_iteration`, make it
cover the whole needless spawned task, which also does waiting for layer
drops and stray logging in tests.
also clarify some comments while we are here.
Fixes: #6910
The PR #7141 added log message
```
ThreadLocalState is being dropped and id might be re-used in the future
```
which was supposed to be emitted when the thread-local is destroyed.
Instead, it was emitted on _each_ call to `thread_local_system()`,
ie.., on each tokio-epoll-uring operation.
Testing
-------
Reproduced the issue locally and verified that this PR fixes the issue.
## Problem
Followup to https://github.com/neondatabase/neon/pull/6725
In that PR, code for purging local files from a tenant shard was
duplicated.
## Summary of changes
- Refactor detach code into TenantManager
- `spawn_background_purge` method can now be common between detach and
split operations
refs https://github.com/neondatabase/neon/issues/7136
Problem
-------
Before this PR, we were using
`tokio_epoll_uring::thread_local_system()`,
which panics on tokio_epoll_uring::System::launch() failure
As we've learned in [the
past](https://github.com/neondatabase/neon/issues/6373#issuecomment-1905814391),
some older Linux kernels account io_uring instances as locked memory.
And while we've raised the limit in prod considerably, we did hit it
once on 2024-03-11 16:30 UTC.
That was after we enabled tokio-epoll-uring fleet-wide, but before
we had shipped release-5090 (c6ed86d3d0)
which did away with the last mass-creation of tokio-epoll-uring
instances as per
commit 3da410c8fe
Author: Christian Schwarz <christian@neon.tech>
Date: Tue Mar 5 10:03:54 2024 +0100
tokio-epoll-uring: use it on the layer-creating code paths (#6378)
Nonetheless, it highlighted that panicking in this situation is probably
not ideal, as it can leave the pageserver process in a semi-broken
state.
Further, due to low sampling rate of Prometheus metrics, we don't know
much about the circumstances of this failure instance.
Solution
--------
This PR implements a custom thread_local_system() that is
pageserver-aware
and will do the following on failure:
- dump relevant stats to `tracing!`, hopefully they will be useful to
understand the circumstances better
- if it's the locked memory failure (or any other ENOMEM): abort() the
process
- if it's ENOMEM, retry with exponential back-off, capped at 3s.
- add metric counters so we can create an alert
This makes sense in the production environment where we know that
_usually_, there's ample locked memory allowance available, and we know
the failure rate is rare.
## Problem
The existing secondary download API relied on the caller to wait as long
as it took to complete -- for large shards that could be a long time, so
typical clients that might have a baked-in ~30s timeout would have a
problem.
## Summary of changes
- Take a `wait_ms` query parameter to instruct the pageserver how long
to wait: if the download isn't complete in this duration, then 201 is
returned instead of 200.
- For both 200 and 201 responses, include response body describing
download progress, in terms of layers and bytes. This is sufficient for
the caller to track how much data is being transferred and log/present
that status.
- In storage controller live migrations, use this API to apply a much
longer outer timeout, with smaller individual per-request timeouts, and
log the progress of the downloads.
- Add a test that injects layer download delays to exercise the new
behavior
# Problem
On-demand downloads are still using `tokio::fs`, which we know is
inefficient.
# Changes
- Add `pagebench ondemand-download-churn` to quantify on-demand download
throughput
- Requires dumping layer map, which required making `history_buffer`
impl `Deserialize`
- Implement an equivalent of `tokio::io::copy_buf` for owned buffers =>
`owned_buffers_io` module and children.
- Make layer file download sensitive to `io_engine::get()`, using
VirtualFile + above copy loop
- For this, I had to move some code into the `retry_download`, e.g.,
`sync_all()` call.
Drive-by:
- fix missing escaping in `scripts/ps_ec2_setup_instance_store`
- if we failed in retry_download to create a file, we'd try to remove
it, encounter `NotFound`, and `abort()` the process using
`on_fatal_io_error`. This PR adds treats `NotFound` as a success.
# Testing
Functional
- The copy loop is generic & unit tested.
Performance
- Used the `ondemand-download-churn` benchmark to manually test against
real S3.
- Results (public Notion page):
https://neondatabase.notion.site/Benchmarking-tokio-epoll-uring-on-demand-downloads-2024-04-15-newer-code-03c0fdc475c54492b44d9627b6e4e710?pvs=4
- Performance is equivalent at low concurrency. Jumpier situation at
high concurrency, but, still less CPU / throughput with
tokio-epoll-uring.
- It’s a win.
# Future Work
Turn the manual performance testing described in the above results
document into a performance regression test:
https://github.com/neondatabase/neon/issues/7146
## Problem
Tenant deletion had a couple of TODOs where we weren't using proper
cancellation tokens that would have aborted the deletions during process
shutdown.
## Summary of changes
- Refactor enough that deletion/shutdown code has access to the
TenantManager's cancellation toke
- Use that cancellation token in tenant deletion instead of dummy
tokens.
fixes https://github.com/neondatabase/neon/issues/7116
Changes:
- refactor PageServerConfigBuilder: support not-set values
- implement runtime feature test
- use runtime feature test to determine `virtual_file_io_engine` if not
explicitly configured in the config
- log the effective engine at startup
- drive-by: improve assertion messages in `test_pageserver_init_node_id`
This needed a tiny bit of tokio-epoll-uring work, hence bumping it.
Changelog:
```
git log --no-decorate --oneline --reverse 868d2c42b5d54ca82fead6e8f2f233b69a540d3e..342ddd197a060a8354e8f11f4d12994419fff939
c7a74c6 Bump mio from 0.8.8 to 0.8.11
4df3466 Bump mio from 0.8.8 to 0.8.11 (#47)
342ddd1 lifecycle: expose `LaunchResult` enum (#49)
```
## Problem
See:
- https://github.com/neondatabase/neon/issues/6374
## Summary of changes
Whereas previously we calculated synthetic size from the gc_horizon or
the pitr_interval (whichever is the lower LSN), now we ignore gc_horizon
and exclusively start from the `pitr_interval`. This is a more generous
calculation for billing, where we do not charge users for data retained
due to gc_horizon.
These test runs usually take 20-30 minutes. if something hangs, we see
actions proceeding for several hours: it's more convenient to have them
time out sooner so that we notice that something has hung faster.
Aiming for the design where `heavier_once_cell::OnceCell` is initialized
by a future factory lead to awkwardness with how
`LayerInner::get_or_maybe_download` looks right now with the `loop`. The
loop helps with two situations:
- an eviction has been scheduled but has not yet happened, and a read
access should cancel the eviction
- a previous `LayerInner::get_or_maybe_download` that canceled a pending
eviction was canceled leaving the `heavier_once_cell::OnceCell`
uninitialized but needing repair by the next
`LayerInner::get_or_maybe_download`
By instead supporting detached initialization in
`heavier_once_cell::OnceCell` via an `OnceCell::get_or_detached_init`,
we can fix what the monolithic #7030 does:
- spawned off download task initializes the
`heavier_once_cell::OnceCell` regardless of the download starter being
canceled
- a canceled `LayerInner::get_or_maybe_download` no longer stops
eviction but can win it if not canceled
Split off from #7030.
Cc: #5331
Split off from #7030:
- each early exit is counted as canceled init, even though it most
likely was just `LayerInner::keep_resident` doing the no-download repair
check
- `downloaded_after` could had been accounted for multiple times, and
also when repairing to match on-disk state
Cc: #5331
Switched the order; doing https://github.com/neondatabase/neon/pull/6139
first then can remove uninit marker after.
## Problem
Previously, existence of a timeline directory was treated as evidence of
the timeline's logical existence. That is no longer the case since we
treat remote storage as the source of truth on each startup: we can
therefore do without this mark file.
The mark file had also been used as a pseudo-lock to guard against
concurrent creations of the same TimelineId -- now that persistence is
no longer required, this is a bit unwieldy.
In #6139 the `Tenant::timelines_creating` was added to protect against
concurrent creations on the same TimelineId, making the uninit mark file
entirely redundant.
## Summary of changes
- Code that writes & reads mark file is removed
- Some nearby `pub` definitions are amended to `pub(crate)`
- `test_duplicate_creation` is added to demonstrate that mutual
exclusion of creations still works.
## Problem
These fields were only optional for the convenience of the `local_fs`
test helper -- real remote storage backends provide them. It complicated
any code that actually wanted to use them for anything.
## Summary of changes
- Make these fields non-optional
- For azure/S3 it is an error if the server doesn't provide them
- For local_fs, use random strings as etags and the file's mtime for
last_modified.
We need to shard our Tenants to support larger databases without those
large databases dominating our pageservers and/or requiring dedicated
pageservers.
This RFC aims to define an initial capability that will permit creating
large-capacity databases using a static configuration
defined at time of Tenant creation.
Online re-sharding is deferred as future work, as is offloading layers
for historical reads. However, both of these capabilities would be
implementable without further changes to the control plane or compute:
this RFC aims to define the cross-component work needed to bootstrap
sharding end-to-end.
## Problem
We have no regression tests for websocket flow
## Summary of changes
Add a hacky implementation of the postgres protocol over websockets just
to verify the protocol behaviour does not regress over time.
This pull request disables neon extension auto upgrade to help the next
compute image upgrade smooth.
## Summary of changes
We have two places to auto-upgrade neon extension: during compute spec
update, and when the compute node starts. The compute spec update logic
is always there, and the compute node start logic is added in
https://github.com/neondatabase/neon/pull/7029. In this pull request, we
disable both of them, so that we can still roll back to an older version
of compute before figuring out the best way of extension
upgrade-downgrade. https://github.com/neondatabase/neon/issues/6936
We will enable auto-upgrade in the next release following this release.
There are no other extension upgrades from release 4917 and therefore
after this pull request, it would be safe to revert to release 4917.
Impact:
* Project created after unpinning the compute image -> if we need to
roll back, **they will stuck**, because the default neon extension
version is 1.3. Need to manually pin the compute image version if such
things happen.
* Projects already stuck on staging due to not downgradeable -> I don't
know their current status, maybe they are already running the latest
compute image?
* Other projects -> can be rolled back to release 4917.
Signed-off-by: Alex Chi Z <chi@neon.tech>
We have been using #5681 for quite some time, and at least since #6931
the tests have assumed `cargo-nextest` to work around our use of global
statics. Unlike the `cargo test`, the `cargo nextest run` runs each test
as a separate process that can be timeouted.
Add a mention of using `cargo-nextest` in the top-level README.md.
Sub-crates can still declare they support `cargo test`, like
`compute_tools/README.md` does.
A node with a bad DNS configuration can register itself with the storage
controller, and the controller will try and schedule work onto the node,
but never succeed because it can't reach the node.
The DNS case is a special case of asymmetric network issues. The general
case isn't covered here -- but might make sense to tighten up after
#6844 merges -- then we can avoid assuming a node is immediately
available in re_attach.
## Problem
If a pageserver was offline when the storage controller started, there
was no mechanism to update the
storage controller state when the pageserver becomes active.
## Summary of changes
* Add a heartbeater module. The heartbeater must be driven by an
external loop.
* Integrate the heartbeater into the service.
- Extend the types used by the service and scheduler to keep track of a
nodes' utilisation score.
- Add a background loop to drive the heartbeater and update the state
based on the deltas it generated
- Do an initial round of heartbeats at start-up