Stacks on:
- https://github.com/neondatabase/neon/pull/7165
Fixes while working on background optimization of scheduling after a
split:
- When a tenant has secondary locations, we weren't detaching the parent
shards' secondary locations when doing a split
- When a reconciler detaches a location, it was feeding back a
locationconf with `Detached` mode in its `observed` object, whereas it
should omit that location. This could cause the background reconcile
task to keep kicking off no-op reconcilers forever (harmless but
annoying).
- During shard split, we were scheduling secondary locations for the
child shards, but no reconcile was run for these until the next time the
background reconcile task ran. Creating these ASAP is useful, because
they'll be used shortly after a shard split as the destination locations
for migrating the new shards to different nodes.
## Problem
Storage controller had basically no metrics.
## Summary of changes
1. Migrate the existing metrics to use Conrad's
[`measured`](https://docs.rs/measured/0.0.14/measured/) crate.
2. Add metrics for incoming http requests
3. Add metrics for outgoing http requests to the pageserver
4. Add metrics for outgoing pass through requests to the pageserver
5. Add metrics for database queries
Note that the metrics response for the attachment service does not use
chunked encoding like the rest of the metrics endpoints. Conrad has
kindly extended the crate such that it can now be done. Let's leave it
for a follow-up since the payload shouldn't be that big at this point.
Fixes https://github.com/neondatabase/neon/issues/6875
- Enable debug logs for this test
- Add some debug logging detail in downloader.rs
- Add an info-level message in scheduler.rs that makes it obvious if a
command is waiting for an existing task rather than spawning a new one.
Warm-up (and the "tenant startup complete" metric update) happens in
a background tokio task. The tenant map is eagerly updated (can happen
before the task finishes).
The test assumed that if the tenant map was updated, then the metric
should reflect that. That's not the case, so we tweak the test to wait
for the metric.
Fixes https://github.com/neondatabase/neon/issues/7158
This is a mixed bag of changes split out for separate review while
working on other things, and batched together to reduce load on CI
runners. Each commits stands alone for review purposes:
- do_tenant_shard_split was a long function and had a synchronous
validation phase at the start that could readily be pulled out into a
separate function. This also avoids the special casing of
ApiError::BadRequest when deciding whether an abort is needed on errors
- Add a 'describe' API (GET on tenant ID) that will enable storcon-cli
to see what's going on with a tenant
- the 'locate' API wasn't really meant for use in the field. It's for
tests: demote it to the /debug/ prefix
- The `Single` placement policy was a redundant duplicate of Double(0),
and Double was a bad name. Rename it Attached.
(https://github.com/neondatabase/neon/issues/7107)
- Some neon_local commands were added for debug/demos, which are now
replaced by commands in storcon-cli (#7114 ). Even though that's not
merged yet, we don't need the neon_local ones any more.
Closes https://github.com/neondatabase/neon/issues/7107
## Backward compat of Single/Double -> `Attached(n)` change
A database migration is used to convert any existing values.
## Problem
As with the pageserver, we should fail tests that emit unexpected log
errors/warnings.
## Summary of changes
- Refactor existing log checks to be reusable
- Run log checks for attachment_service
- Add allow lists as needed.
Add shard_number to PageserverFeedback and parse it on the compute side.
When compute receives a new ps_feedback, it calculates min LSNs among
feedbacks from all shards, and uses those LSNs for backpressure.
Add `test_sharding_backpressure` to verify that backpressure slows down
compute to wait for the slowest shard.
## Problem
The existing secondary download API relied on the caller to wait as long
as it took to complete -- for large shards that could be a long time, so
typical clients that might have a baked-in ~30s timeout would have a
problem.
## Summary of changes
- Take a `wait_ms` query parameter to instruct the pageserver how long
to wait: if the download isn't complete in this duration, then 201 is
returned instead of 200.
- For both 200 and 201 responses, include response body describing
download progress, in terms of layers and bytes. This is sufficient for
the caller to track how much data is being transferred and log/present
that status.
- In storage controller live migrations, use this API to apply a much
longer outer timeout, with smaller individual per-request timeouts, and
log the progress of the downloads.
- Add a test that injects layer download delays to exercise the new
behavior
## Problem
Tenant deletion had a couple of TODOs where we weren't using proper
cancellation tokens that would have aborted the deletions during process
shutdown.
## Summary of changes
- Refactor enough that deletion/shutdown code has access to the
TenantManager's cancellation toke
- Use that cancellation token in tenant deletion instead of dummy
tokens.
fixes https://github.com/neondatabase/neon/issues/7116
Changes:
- refactor PageServerConfigBuilder: support not-set values
- implement runtime feature test
- use runtime feature test to determine `virtual_file_io_engine` if not
explicitly configured in the config
- log the effective engine at startup
- drive-by: improve assertion messages in `test_pageserver_init_node_id`
This needed a tiny bit of tokio-epoll-uring work, hence bumping it.
Changelog:
```
git log --no-decorate --oneline --reverse 868d2c42b5d54ca82fead6e8f2f233b69a540d3e..342ddd197a060a8354e8f11f4d12994419fff939
c7a74c6 Bump mio from 0.8.8 to 0.8.11
4df3466 Bump mio from 0.8.8 to 0.8.11 (#47)
342ddd1 lifecycle: expose `LaunchResult` enum (#49)
```
## Problem
See:
- https://github.com/neondatabase/neon/issues/6374
## Summary of changes
Whereas previously we calculated synthetic size from the gc_horizon or
the pitr_interval (whichever is the lower LSN), now we ignore gc_horizon
and exclusively start from the `pitr_interval`. This is a more generous
calculation for billing, where we do not charge users for data retained
due to gc_horizon.
Switched the order; doing https://github.com/neondatabase/neon/pull/6139
first then can remove uninit marker after.
## Problem
Previously, existence of a timeline directory was treated as evidence of
the timeline's logical existence. That is no longer the case since we
treat remote storage as the source of truth on each startup: we can
therefore do without this mark file.
The mark file had also been used as a pseudo-lock to guard against
concurrent creations of the same TimelineId -- now that persistence is
no longer required, this is a bit unwieldy.
In #6139 the `Tenant::timelines_creating` was added to protect against
concurrent creations on the same TimelineId, making the uninit mark file
entirely redundant.
## Summary of changes
- Code that writes & reads mark file is removed
- Some nearby `pub` definitions are amended to `pub(crate)`
- `test_duplicate_creation` is added to demonstrate that mutual
exclusion of creations still works.
## Problem
We have no regression tests for websocket flow
## Summary of changes
Add a hacky implementation of the postgres protocol over websockets just
to verify the protocol behaviour does not regress over time.
## Problem
If a pageserver was offline when the storage controller started, there
was no mechanism to update the
storage controller state when the pageserver becomes active.
## Summary of changes
* Add a heartbeater module. The heartbeater must be driven by an
external loop.
* Integrate the heartbeater into the service.
- Extend the types used by the service and scheduler to keep track of a
nodes' utilisation score.
- Add a background loop to drive the heartbeater and update the state
based on the deltas it generated
- Do an initial round of heartbeats at start-up
## Problem
Shard splits worked, but weren't safe against failures (e.g. node crash
during split) yet.
Related: #6676
## Summary of changes
- Introduce async rwlocks at the scope of Tenant and Node:
- exclusive tenant lock is used to protect splits
- exclusive node lock is used to protect new reconciliation process that
happens when setting node active
- exclusive locks used in both cases when doing persistent updates (e.g.
node scheduling conf) where the update to DB & in-memory state needs to
be atomic.
- Add failpoints to shard splitting in control plane and pageserver
code.
- Implement error handling in control plane for shard splits: this
detaches child chards and ensures parent shards are re-attached.
- Crash-safety for storage controller restarts requires little effort:
we already reconcile with nodes over a storage controller restart, so as
long as we reset any incomplete splits in the DB on restart (added in
this PR), things are implicitly cleaned up.
- Implement reconciliation with offline nodes before they transition to
active:
- (in this context reconciliation means something like
startup_reconcile, not literally the Reconciler)
- This covers cases where split abort cannot reach a node to clean it
up: the cleanup will eventually happen when the node is marked active,
as part of reconciliation.
- This also covers the case where a node was unavailable when the
storage controller started, but becomes available later: previously this
allowed it to skip the startup reconcile.
- Storage controller now terminates on panics. We only use panics for
true "should never happen" assertions, and these cases can leave us in
an un-usable state if we keep running (e.g. panicking in a shard split).
In the unlikely event that we get into a crashloop as a result, we'll
rely on kubernetes to back us off.
- Add `test_sharding_split_failures` which exercises a variety of
failure cases during shard split.
## Problem
hyper auto-cancels the request futures on connection close.
`sql_over_http::handle` is not 'drop cancel safe', so we need to do some
other work to make sure connections are queries in the right way.
## Summary of changes
1. tokio::spawn the request handler to resolve the initial cancel-safety
issue
2. share a cancellation token, and cancel it when the request `Service`
is dropped.
3. Add a new log span to be able to track the HTTP connection lifecycle.
## Summary
- Currently we can set stripe size at tenant creation, but it doesn't
mean anything until we have multiple shards
- When onboarding an existing tenant, it will always get a default shard
stripe size, so we would like to be able to pick the actual stripe size
at the point we split.
## Why do this inline with a split?
The alternative to this change would be to have a separate endpoint on
the storage controller for setting the stripe size on a tenant, and only
permit writes to that endpoint when the tenant has only a single shard.
That would work, but be a little bit more work for a client, and not
appreciably simpler (instead of having a special argument to the split
functions, we'd have a special separate endpoint, and a requirement that
the controller must sync its config down to the pageserver before
calling the split API). Either approach would work, but this one feels a
bit more robust end-to-end: the split API is the _very last moment_ that
the stripe size is mutable, so if we aim to set it before splitting, it
makes sense to do it as part of the same operation.
## Problem
Currently we manually register nodes with the storage controller, and
use a script during deploy to register with the cloud control plane.
Rather than extend that script further, nodes should just register on
startup.
## Summary of changes
- Extend the re-attach request to include an optional
NodeRegisterRequest
- If the `register` field is set, handle it like a normal node
registration before executing the normal re-attach work.
- Update tests/neon_local that used to rely on doing an explicit
register step that could be enabled/disabled.
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
On HTTP query timeout, we should try and cancel the current in-flight
SQL query.
## Summary of changes
Trigger a cancellation command in postgres once the timeout is reach
Not a user-facing change, but can break any existing `.neon` directories
created by neon_local, as the name of the database used by the storage
controller changes.
This PR changes all the locations apart from the path of
`control_plane/attachment_service` (waiting for an opportune moment to
do that one, because it's the most conflict-ish wrt ongoing PRs like
#6676 )
The walproposer pretends to be a walsender in many ways. It has a
WalSnd slot, it claims to be a walsender by calling
MarkPostmasterChildWalSender() etc. But one different to real
walsenders was that the postmaster still treated it as a bgworker
rather than a walsender. The difference is that at shutdown,
walsenders are not killed until the very end, after the checkpointer
process has written the shutdown checkpoint and exited.
As a result, the walproposer always got killed before the shutdown
checkpoint was written, so the shutdown checkpoint never made it to
safekeepers. That's fine in principle, we don't require a clean
shutdown after all. But it also feels a bit silly not to stream the
shutdown checkpoint. It could be useful for initializing hot standby
mode in a read replica, for example.
Change postmaster to treat background workers that have called
MarkPostmasterChildWalSender() as walsenders. That unfortunately
requires another small change in postgres core.
After doing that, walproposers stay alive longer. However, it also
means that the checkpointer will wait for the walproposer to switch to
WALSNDSTATE_STOPPING state, when the checkpointer sends the
PROCSIG_WALSND_INIT_STOPPING signal. We don't have the machinery in
walproposer to receive and handle that signal reliably. Instead, we
mark walproposer as being in WALSNDSTATE_STOPPING always.
In commit 568f91420a, I assumed that shutdown will wait for all the
remaining WAL to be streamed to safekeepers, but before this commit
that was not true, and the test became flaky. This should make it
stable again.
Some tests wrongly assumed that no WAL could have been written between
pg_current_wal_flush_lsn and quick pg stop after it. Fix them by introducing
flush_ep_to_pageserver which first stops the endpoint and then waits till all
committed WAL reaches the pageserver.
In passing extract safekeeper http client to its own module.
## Problem
Before this PR, it was possible that on-demand downloads were started
after `Timeline::shutdown()`.
For example, we have observed a walreceiver-connection-handler-initiated
on-demand download that was started after `Timeline::shutdown()`s final
`task_mgr::shutdown_tasks()` call.
The underlying issue is that `task_mgr::shutdown_tasks()` isn't sticky,
i.e., new tasks can be spawned during or after
`task_mgr::shutdown_tasks()`.
Cc: https://github.com/neondatabase/neon/issues/4175 in lieu of a more
specific issue for task_mgr. We already decided we want to get rid of it
anyways.
Original investigation:
https://neondb.slack.com/archives/C033RQ5SPDH/p1709824952465949
## Changes
- enter gate while downloading
- use timeline cancellation token for cancelling download
thereby, fixes#7054
Entering the gate might also remove recent "kept the gate from closing"
in staging.
## Problem
When we start compute with newer version of extension (i.e. 1.2) and
then rollback the release, downgrading the compute version, next compute
start will try to update extension to the latest version available in
neon.control (i.e. 1.1).
Thus we need to provide downgrade scripts like neon--1.2--1.1.sql
These scripts must revert the changes made by the upgrade scripts in the
reverse order. This is necessary to ensure that the next upgrade will
work correctly.
In general, we need to write upgrade and downgrade scripts to be more
robust and add IF EXISTS / CREATE OR REPLACE clauses to all statements
(where applicable).
## Summary of changes
Adds downgrade scripts.
Adds test cases for extension downgrade/upgrade.
fixes#7066
This is a follow-up for
https://app.incident.io/neondb/incidents/167?tab=follow-ups
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Alex Chi Z <iskyzh@gmail.com>
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
## Problem
- The storage controller is the source of truth for a tenant's stripe
size, but doesn't currently have a way to propagate that to compute:
we're just using the default stripe size everywhere.
Closes: https://github.com/neondatabase/neon/issues/6903
## Summary of changes
- Include stripe size in `ComputeHookNotifyRequest`
- Include stripe size in `LocationConfigResponse`
The stripe size is optional: it will only be advertised for
multi-sharded tenants. This enables the controller to defer the choice
of stripe size until we split a tenant for the first time.
part of https://github.com/neondatabase/neon/issues/5899
Problem
-------
Before this PR, the time spent waiting on the throttle was charged
towards the higher-level page_service metrics, i.e.,
`pageserver_smgr_query_seconds`.
The metrics are the foundation of internal SLIs / SLOs.
A throttled tenant would cause the SLI to degrade / SLO alerts to fire.
Changes
-------
- don't charge time spent in throttle towards the page_service metrics
- record time spent in throttle in RequestContext and subtract it from
the elapsed time
- this works because the page_service path doesn't create child context,
so, all the throttle time is recorded in the parent
- it's quite brittle and will break if we ever decide to spawn child
tasks that need child RequestContexts, which would have separate
instances of the `micros_spent_throttled` counter.
- however, let's punt that to a more general refactoring of
RequestContext
- add a test case that ensures that
- throttling happens for getpage requests; this aspect of the test
passed before this PR
- throttling delays aren't charged towards the page_service metrics;
this aspect of the test only passes with this PR
- drive-by: make the throttle log message `info!`, it's an expected
condition
Performance
-----------
I took the same measurements as in #6706 , no meaningful change in CPU
overhead.
Future Work
-----------
This PR enables us to experiment with the throttle for select tenants
without affecting the SLI metrics / triggering SLO alerts.
Before declaring this feature done, we need more work to happen,
specifically:
- decide on whether we want to retain the flexibility of throttling any
`Timeline::get` call, filtered by TaskKind
- versus: separate throttles for each page_service endpoint, potentially
with separate config options
- the trouble here is that this decision implies changes to the
TenantConfig, so, if we start using the current config style now, then
decide to switch to a different config, it'll be a breaking change
Nice-to-haves but probably not worth the time right now:
- Equivalent tests to ensure the throttle applies to all other
page_service handlers.
## Problem
- #6966
- Existing logs aren't pointing to a cause: it looks like heatmap upload
and download are happening, but for some reason the evicted layer isn't
removed on the secondary location.
## Summary of changes
- Assert evicted layer is gone from heatmap before checking its gone
from local disk: this will give clarity on whether the issue is with the
uploads or downloads.
- On assertion failures, log the contents of heatmap.
During onboarding, the control plane may attempt ad-hoc creation of a
secondary location to facilitate live migration. This gives us two
problems to solve:
- Accept 'Secondary' mode in /location_config and use it to put the
tenant into secondary mode on some physical pageserver, then pass
through /tenant/xyz/secondary/download requests
- Create tenants with no generation initially, since the initial
`Secondary` mode call will not provide us a generation.
This PR also fixes modification of a tenant's TenantConf during
/location_conf, which was previously ignored, and refines the flow for
config modification:
- avoid bumping generations when the only reason we're reconciling an
attached location is a config change
- increment TenantState.sequence when spawning a reconciler: usually
schedule() does this, but when we do config changes that doesn't happen,
so without this change waiters would think reconciliation was done
immediately. `sequence` is a bit of a murky thing right now, as it's
dual-purposed for tracking waiters, and for checking if an existing
reconciliation is already making updates to our current sequence. I'll
follow up at some point to clarify it's purpose.
- test config modification at the end of onboarding test
## Problem
At high ingest rates, pageservers spuriously disconnect from safekeepers
because stats updates don't come in frequently enough to keep the
broker/safekeeper LSN delta under the wal lag limit.
## Summary of changes
- Increase DEFAULT_MAX_WALRECEIVER_LSN_WAL_LAG from 10MiB to 1GiB. This
should be enough for realistic per-timeline throughputs.
## Problem
https://github.com/neondatabase/neon/pull/6661 changed the layer
flushing logic and led to OOMs in staging.
The issue turned out to be holding on to in-memory layers for too long.
After OOMing we'd need to replay potentially
a lot of WAL.
## Summary of changes
Test that open layers get flushed after the `checkpoint_timeout` config
and do not require WAL reingest upon restart.
The workload creates a number of timelines and writes some data to each,
but not enough to trigger flushes via the `checkpoint_distance` config.
I ran this test against https://github.com/neondatabase/neon/pull/6661
and it was indeed failing.
## Problem
PR https://github.com/neondatabase/neon/pull/6851 implemented new output
in PostgreSQL explain.
this is a test case for the new function.
## Summary of changes
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [x] If it is a core feature, I have added thorough tests.
- [no ] Do we need to implement analytics? if so did you add the
relevant metrics to the dashboard?
- [no] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
Extracted from https://github.com/neondatabase/neon/pull/6953
Part of https://github.com/neondatabase/neon/issues/5899
Core Change
-----------
In #6953, we need the ability to scan the log _after_ a specific line
and ignore anything before that line.
This PR changes `log_contains` to returns a tuple of `(matching line,
cursor)`.
Hand that cursor to a subsequent `log_contains` call to search the log
for the next occurrence of the pattern.
Other Changes
-------------
- Inspect all the callsites of `log_contains` to handle the new tuple
return type.
- Above inspection unveiled many callers aren't using `assert
log_contains(...) is not None` but some weaker version of the code that
breaks if `log_contains` ever returns a not-None but falsy value. Fix
that.
- Above changes unveiled that `test_remote_storage_upload_queue_retries`
was using `wait_until` incorrectly; after fixing the usage, I had to
raise the `wait_until` timeout. So, maybe this will fix its flakiness.
## Summary of changes
Calculate number of unique page accesses at compute.
It can be used to estimate working set size and adjust cache size
(shared_buffers or local file cache).
Approximation is made using HyperLogLog algorithm.
It is performed by local file cache and so is available only when local
file cache is enabled.
This calculation doesn't take in account access to the pages present in
shared buffers, but includes pages available in local file cache.
This information can be retrieved using
approximate_working_set_size(reset bool) function from neon extension.
reset parameter can be used to reset statistic and so collect unique
accesses for the particular interval.
Below is an example of estimating working set size after pgbench -c 10
-S -T 100 -s 10:
```
postgres=# select approximate_working_set_size(false);
approximate_working_set_size
------------------------------
19052
(1 row)
postgres=# select pg_table_size('pgbench_accounts')/8192;
?column?
----------
16402
(1 row)
```
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Add off-by-default support for lazy queued tenant activation on attach.
This should be useful on bulk migrations as some tenants will be
activated faster due to operations or endpoint startup. Eventually all
tenants will get activated by reusing the same mechanism we have at
startup (`PageserverConf::concurrent_tenant_warmup`).
The difference to lazy attached tenants to startup ones is that we leave
their initial logical size calculation be triggered by WalReceiver or
consumption metrics.
Fixes: #6315
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
Starting up the pageserver before the storage controller is ready can
lead
to a round of reconciliation, which leads to the previous tenant being
shut down.
This disturbs some tests.
## Summary of changes
Wait for the storage controller to become ready on neon env start-up.
Closes https://github.com/neondatabase/neon/issues/6724
## Problem
After commit [840abe3954] (store AUX files
as deltas) we avoid quadratic growth of storage size when storing LR
snapshots but get quadratic slowdown of reconstruct time.
As a result storing 70k snapshots at my local Neon instance took more
than 3 hours and starting node (creation of basecbackup): ~10 minutes.
In prod 70k AUX files cause increase of startup time to 40 minutes:
https://neondb.slack.com/archives/C03F5SM1N02/p1708513010480179
## Summary of changes
Enforce storing full AUX directory (some analog of FPI) each 1024 files.
Time of creation 70k snapshots is reduced to 6 minutes and startup time
- to 1.5 minutes (100 seconds).
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Rebased version of #5234, part of #6768
This consists of three parts:
1. A refactoring and new contract for implementing and testing
compaction.
The logic is now in a separate crate, with no dependency on the
'pageserver' crate. It defines an interface that the real pageserver
must implement, in order to call the compaction algorithm. The interface
models things like delta and image layers, but just the parts that the
compaction algorithm needs to make decisions. That makes it easier unit
test the algorithm and experiment with different implementations.
I did not convert the current code to the new abstraction, however. When
compaction algorithm is set to "Legacy", we just use the old code. It
might be worthwhile to convert the old code to the new abstraction, so
that we can compare the behavior of the new algorithm against the old
one, using the same simulated cases. If we do that, have to be careful
that the converted code really is equivalent to the old.
This inclues only trivial changes to the main pageserver code. All the
new code is behind a tenant config option. So this should be pretty safe
to merge, even if the new implementation is buggy, as long as we don't
enable it.
2. A new compaction algorithm, implemented using the new abstraction.
The new algorithm is tiered compaction. It is inspired by the PoC at PR
#4539, although I did not use that code directly, as I needed the new
implementation to fit the new abstraction. The algorithm here is less
advanced, I did not implement partial image layers, for example. I
wanted to keep it simple on purpose, so that as we add bells and
whistles, we can see the effects using the included simulator.
One difference to #4539 and your typical LSM tree implementations is how
we keep track of the LSM tree levels. This PR doesn't have a permanent
concept of a level, tier or sorted run at all. There are just delta and
image layers. However, when compaction starts, we look at the layers
that exist, and arrange them into levels, depending on their shapes.
That is ephemeral: when the compaction finishes, we forget that
information. This allows the new algorithm to work without any extra
bookkeeping. That makes it easier to transition from the old algorithm
to new, and back again.
There is just a new tenant config option to choose the compaction
algorithm. The default is "Legacy", meaning the current algorithm in
'main'. If you set it to "Tiered", the new algorithm is used.
3. A simulator, which implements the new abstraction.
The simulator can be used to analyze write and storage amplification,
without running a test with the full pageserver. It can also draw an SVG
animation of the simulation, to visualize how layers are created and
deleted.
To run the simulator:
cargo run --bin compaction-simulator run-suite
---------
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Noticed that we are failing to handle `Result::Err` when entering a gate
for logical size calculation. Audited rest of the gate enters, which
seem fine, unified two instances.
Noticed that the gate guard allows to remove a failpoint, then noticed
that adjacent failpoint was blocking the executor thread instead of
using `pausable_failpoint!`, fix both.
eviction_task.rs now maintains a gate guard as well.
Cc: #4733
fixes https://github.com/neondatabase/neon/issues/6889
# Problem
The failure in the last 3 flaky runs on `main` is
```
test_runner/regress/test_remote_storage.py:460: in test_remote_timeline_client_calls_started_metric
churn("a", "b")
test_runner/regress/test_remote_storage.py:457: in churn
assert gc_result["layers_removed"] > 0
E assert 0 > 0
```
That's this code
cd449d66ea/test_runner/regress/test_remote_storage.py (L448-L460)
So, the test expects GC to remove some layers but the GC doesn't.
# Fix
My impression is that the VACUUM isn't re-using pages aggressively
enough, but I can't really prove that. Tried to analyze the layer map
dump but it's too complex.
So, this PR:
- Creates more churn by doing the overwrite twice.
- Forces image layer creation.
It also drive-by removes the redundant call to timeline_compact,
because, timeline_checkpoint already does that internally.
## Problem
Attachment service does not do auth based on JWT scopes.
## Summary of changes
Do JWT based permission checking for requests coming into the attachment
service.
Requests into the attachment service must use different tokens based on
the endpoint:
* `/control` and `/debug` require `admin` scope
* `/upcall` requires `generations_api` scope
* `/v1/...` requires `pageserverapi` scope
Requests into the pageserver from the attachment service must use
`pageserverapi` scope.
## Problem
Customers should be able to determine the size of their workload's
working set to right size their compute.
Since Neon uses Local file cache (LFC) instead of shared buffers on
bigger compute nodes to cache pages we need to externalize a means to
determine LFC hit ratio in addition to shared buffer hit ratio.
Currently the following end user documentation
fb7cd3af0e/content/docs/manage/endpoints.md (L137)
is wrong because it describes how to right size a compute node based on
shared buffer hit ratio.
Note that the existing functionality in extension "neon" is NOT
available to end users but only to superuser / cloud_admin.
## Summary of changes
- externalize functions and views in neon extension to end users
- introduce a new view `NEON_STAT_FILE_CACHE` with the following DDL
```sql
CREATE OR REPLACE VIEW NEON_STAT_FILE_CACHE AS
WITH lfc_stats AS (
SELECT
stat_name,
count
FROM neon_get_lfc_stats() AS t(stat_name text, count bigint)
),
lfc_values AS (
SELECT
MAX(CASE WHEN stat_name = 'file_cache_misses' THEN count ELSE NULL END) AS file_cache_misses,
MAX(CASE WHEN stat_name = 'file_cache_hits' THEN count ELSE NULL END) AS file_cache_hits,
MAX(CASE WHEN stat_name = 'file_cache_used' THEN count ELSE NULL END) AS file_cache_used,
MAX(CASE WHEN stat_name = 'file_cache_writes' THEN count ELSE NULL END) AS file_cache_writes,
-- Calculate the file_cache_hit_ratio within the same CTE for simplicity
CASE
WHEN MAX(CASE WHEN stat_name = 'file_cache_misses' THEN count ELSE 0 END) + MAX(CASE WHEN stat_name = 'file_cache_hits' THEN count ELSE 0 END) = 0 THEN NULL
ELSE ROUND((MAX(CASE WHEN stat_name = 'file_cache_hits' THEN count ELSE 0 END)::DECIMAL /
(MAX(CASE WHEN stat_name = 'file_cache_hits' THEN count ELSE 0 END) + MAX(CASE WHEN stat_name = 'file_cache_misses' THEN count ELSE 0 END))) * 100, 2)
END AS file_cache_hit_ratio
FROM lfc_stats
)
SELECT file_cache_misses, file_cache_hits, file_cache_used, file_cache_writes, file_cache_hit_ratio from lfc_values;
```
This view can be used by an end user as follows:
```sql
CREATE EXTENSION NEON;
SELECT * from neon. NEON_STAT_FILE_CACHE"
```
The output looks like the following:
```
select * from NEON_STAT_FILE_CACHE;
file_cache_misses | file_cache_hits | file_cache_used | file_cache_writes | file_cache_hit_ratio
-------------------+-----------------+-----------------+-------------------+----------------------
2133643 | 108999742 | 607 | 10767410 | 98.08
(1 row)
```
## Checklist before requesting a review
- [x ] I have performed a self-review of my code.
- [x ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [x ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
* decreases checkpointing and compaction targets for even more layer
files
* write 10 thousand rows 2 times instead of writing 20 thousand rows 1
time so that there is more to GC. Before it was noisily jumping between
1 and 0 layer files, now it's jumping between 19 and 20 layer files. The
0 caused an assertion error that gave the test most of its flakiness.
* larger timeout for the churn while failpoints are active thread: this
is mostly so that the test is more robust on systems with more load
Fixes#3051
## Problem
Previously we always wrote out both legacy and modern tenant config
files. The legacy write enabled rollbacks, but we are long past the
point where that is needed.
We still need the legacy format for situations where someone is running
tenants without generations (that will be yanked as well eventually),
but we can avoid writing it out at all if we do have a generation number
set. We implicitly also avoid writing the legacy config if our mode is
Secondary (secondary mode is newer than generations).
## Summary of changes
- Make writing legacy tenant config conditional on there being no
generation number set.
It's been dead-code-at-runtime for 9 months, let's remove it.
We can always re-introduce it at a later point.
Came across this while working on #6861, which will touch
`time_for_new_image_layer`. This is an opporunity to make that function
simpler.
We set it for neon replica, if primary is running.
Postgres uses this GUC at the start,
to determine if replica should wait for
RUNNING_XACTS from primary or not.
Corresponding cloud PR is
https://github.com/neondatabase/cloud/pull/10183
* Add test hot-standby replica startup.
* Extract oldest_running_xid from XlRunningXits WAL records.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>