Add `--walsenders-keep-horizon` argument to safekeeper cmdline. It will
prevent deleting WAL segments from disk if they are needed by the active
START_REPLICATION connection.
This is useful for sharding. Without this option, if one of the shard
falls behind, it starts to read WAL from S3, which is much slower than
disk. This can result in huge shard lagging.
## Problem
Accidentally merged #6852 without this test stability change. The test
as-written could sometimes fail on debug-pg14.
## Summary of changes
- Write more data so that the test can more reliably assert on the ratio
of total layers to small layers
- Skip the test in debug mode, since writing any more than a tiny bit of
data tends to result in a flaky test in the much slower debug
environment.
## Problem
PR #6834 introduced an assertion that the sets of metric labels on
finished operations should equal those on started operations, which is
not true if no operations have finished yet for a particular set of
labels.
## Summary of changes
- Instead of asserting out, wait and re-check in the case that finished
metrics don't match started
## Problem
- Current file has ambiguous ownership for some paths
- The /control_plane/attachment_service is storage specific & updates
there don't need to request reviews from other teams.
## Summary of changes
- Define a single owning team per path, so that we can make reviews by
that team mandatory in future.
- Remove the top-level /control_plane as no one specific team owns
neon_local, and we would rarely see a PR that exclusively touches that
path.
- Add an entry for /control_plane/attachment_service, which is newer
storage-specific code.
The sharding service didn't have support for S3 disaster recovery.
This PR adds a new endpoint to the attachment service, which is slightly
different from the endpoint on the pageserver, in that it takes the
shard count history of the tenant as json parameters: we need to do
time travel recovery for both the shard count at the target time and the
shard count at the current moment in time, as well as the past shard
counts that either still reference.
Fixes#6604, part of https://github.com/neondatabase/cloud/issues/8233
---------
Co-authored-by: John Spray <john@neon.tech>
As noticed in #6836 some occurances of error conversions were missed in
#6697:
- `std::io::Error` popped up by `tokio::io::copy_buf` containing
`DownloadError` was turned into `DownloadError::Other`
- similarly for secondary downloader errors
These changes come at the loss of pathname context.
Cc: #6096
## Problem
Where the stripe size is the same order of magnitude as the checkpoint
distance (such as with default settings), tenant shards can easily pass
through `checkpoint_distance` bytes of LSN without actually ingesting
anything. This results in emitting many tiny L0 delta layers.
## Summary of changes
- Multiply checkpoint distance by shard count before comparing with LSN
distance. This is a heuristic and does not guarantee that we won't emit
small layers, but it fixes the issue for typical cases where the writes
in a (checkpoint_distance * shard_count) range of LSN bytes are somewhat
distributed across shards.
- Add a test that checks the size of layers after ingesting to a sharded
tenant; this fails before the fix.
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Problem
## Summary of changes
1. Classify further cplane API errors
2. add 'serviceratelimit' and make a few of the timeout errors return
that.
3. a few additional minor changes
## Problem
During startup_reconcile we do a couple of potentially-slow things:
- Calling out to all nodes to read their locations
- Calling out to the cloud control plane to notify it of all tenants'
attached nodes
The read of node locations was not being done concurrently across nodes,
and neither operation was bounded by a well defined deadline.
## Summary of changes
- Refactor the async parts of startup_reconcile into separate functions
- Add concurrency and deadline to `scan_node_locations`
- Add deadline to `compute_notify_many`
- Run `cleanup_locations` in the background: there's no need for
startup_reconcile to wait for this to complete.
This PR introduces a new vectored implementation of the read path.
The search is basically a DFS if you squint at it long enough.
LayerFringe tracks the next layers to visit and acts as our stack.
Vertices are tuples of (layer, keyspace, lsn range). Continuously
pop the top of the stack (most recent layer) and do all the reads
for one layer at once.
The search maintains a fringe (`LayerFringe`) which tracks all the
layers that intersect the current keyspace being searched. Continuously
pop the top of the fringe (layer with highest LSN) and get all the data
required from the layer in one go.
Said search is done on one timeline at a time. If data is still required for
some keys, then search the ancestor timeline.
Apart from the high level layer traversal, vectored variants have been
introduced for grabbing data from each layer type. They still suffer from
read amplification issues and that will be addressed in a different PR.
You might notice that in some places we duplicate the code for the
existing read path. All of that code will be removed when we switch
the non-vectored read path to proxy into the vectored read path.
In the meantime, we'll have to contend with the extra cruft for the sake
of testing and gentle releasing.
This pull request adds two flags: `--update-catalog true` for `endpoint
create`, and `--create-test-user true` for `endpoint start`. The former
enables catalog updates for neon_superuser permission and many other
things, while the latter adds the user `test` and the database `neondb`
when setting up the database. A combination of these two flags will
create a Postgres similar to the production environment so that it would
be easier for us to test if extensions behave correctly when added to
Neon Postgres.
Example output:
```
❯ cargo neon endpoint start main --create-test-user true
Finished dev [unoptimized + debuginfo] target(s) in 0.22s
Running `target/debug/neon_local endpoint start main --create-test-user true`
Starting existing endpoint main...
Starting postgres node at 'postgresql://cloud_admin@127.0.0.1:55432/postgres'
Also at 'postgresql://user@127.0.0.1:55432/neondb'
```
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
A set of small changes that are too small to open a separate for each.
A notable change is adding `pytest-repeat` plugin, which can help to
ensure that a flaky test is fixed by running such a test several times.
## Summary of changes
- Update Allure from 2.24.0 to 2.27.0
- Update Ruff from 0.1.11 to 0.2.2 (update `[tool.ruff]` section of
`pyproject.toml` for it)
- Install pytest-repeat plugin
587cb705b8
changed the layer rolling logic to more closely obey the
`checkpoint_distance` config. Previously, this test was getting
layers significantly larger than the 8K it was asking for. Now the
payload in the layers is closer to 8K (which means more layers in
total).
Tweak the `checkpoint_distance` to get a number of layers more
reasonable for this test. Note that we still get more layers than
before (~8K vs ~5K).
this is to speed up suspends, see
https://github.com/neondatabase/cloud/issues/10284
## Problem
## Summary of changes
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
refs #6737
# Problem
Before this PR, on-demand downloads weren't measured per tenant_id.
This makes root-cause analysis of latency spikes harder, requiring us to
resort to log scraping for
```
{neon_service="pageserver"} |= `downloading on-demand` |= `$tenant_id`
```
which can be expensive when zooming out in Grafana.
Context: https://neondb.slack.com/archives/C033RQ5SPDH/p1707809037868189
# Solution / Changes
- Remove the calls_started histogram
- I did the dilegence, there are only 2 dashboards using this histogram,
and in fact only one uses it as a histogram, the other just as a
a counter.
- [Link
1](8115b54d9f/neonprod/dashboards/hkXNF7oVz/dashboard-Z31XmM24k.yaml (L1454)):
`Pageserver Thrashing` dashboard, linked from playbook, will fix.
- [Link
2](8115b54d9f/neonprod/dashboards/CEllzAO4z/dashboard-sJqfNFL4k.yaml (L599)):
one of my personal dashboards, unused for a long time, already broken in
other ways, no need to fix.
- replace `pageserver_remote_timeline_client_calls_unfinished` gauge
with a counter pair
- Required `Clone`-able `IntCounterPair`, made the necessary changes in
the `libs/metrics` crate
- fix tests to deal with the fallout
A subsequent PR will remove a timeline-scoped metric to compensate.
Note that we don't need additional global counters for the per-timeline
counters affected by this PR; we can use the `remote_storage` histogram
for those, which, conveniently, also include the secondary-mode
downloads, which aren't covered by the remote timeline client metrics
(should they?).
## Problem
`download_retry` correctly uses a fatal check to avoid retrying forever
on cancellations and NotFound cases. However, `download_layer_file` was
casting all download errors to "Other" in order to attach an
anyhow::Context.
Noticed this issue in the context of secondary downloads, where requests
to download layers that might not exist are issued intentionally, and
this resulted in lots of error spam from retries that shouldn't have
happened.
## Summary of changes
- Remove the `.context()` so that the original DownloadError is visible
to backoff::retry
refs https://github.com/neondatabase/neon/issues/4112
amends https://github.com/neondatabase/neon/pull/6687
Since my last PR #6687 regarding this test, the type of flakiness that
has been observed has shifted to the beginning of the test, where we
create the layers:
```
timed out while waiting for remote_consistent_lsn to reach 0/411A5D8, was 0/411A5A0
```
[Example Allure
Report](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6789/7932503173/index.html#/testresult/ddb877cfa4062f7d)
Analysis
--------
I suspect there was the following race condition:
- endpoints push out some tiny piece of WAL during their
endpoints.stop_all()
- that WAL reaches the SK (it's just one SK according to logs)
- the SKs send it into the walreceiver connection
- the SK gets shut down
- the checkpoint is taken, with last_record_lsn = 0/411A5A0
- the PS's walreceiver_connection_handler processes the WAL that was
sent into the connection by the SKs; this advances
last_record_lsn to 0/411A5D8
- we get current_lsn = 0/411A5D8
- nothing flushes a layer
Changes
-------
There's no testing / debug interface to shut down / server all
walreceiver connections.
So, this PR restarts pageserver to achieve it.
Also, it lifts the "wait for image layer uploads" further up, so that
after this first
restart, the pageserver really does _nothing_ by itself, and so, the
origianl physical size mismatch issue quoted in #6687 should be fixed.
(My initial suspicion hasn't changed that it was due to the tiny chunk
of endpoint.stop_all() WAL being ingested after the second PS restart.)
Before this PR, if remote storage is configured, `load_layer_map`'s call
to `RemoteTimelineClient::schedule_layer_file_deletion` would schedule
an empty UploadOp::Delete for each timeline.
It's jsut CPU overhead, no actual interaction with deletion queue
on-disk state or S3, as far as I can tell.
However, it shows up in the "RemoteTimelineClient calls started
metrics", which I'm refining in an orthogonal PR.
## Problem
Proxy already supported HTTP2, but I expect no one is using it because
we don't advertise it in the TLS handshake.
## Summary of changes
#6335 without the websocket changes.
## Problem
Timeline creation is meant to be very fast: it should only take
approximately on S3 PUT latency. When we have many shards in a tenant,
we should preserve that responsiveness.
## Summary of changes
- Issue create/delete pageserver API calls concurrently across all >0
shards
- During tenant deletion, delete shard zero last, separately, to avoid
confusing anything using GETs on the timeline.
- Return 201 instead of 200 on creations to make cloud control plane
happy
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
We use a bunch of deprecated actions.
See https://github.com/neondatabase/neon/actions/runs/7958569728
(Annotations section)
```
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3, actions/setup-java@v3, actions/cache@v3, actions/github-script@v6. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
```
## Summary of changes
- `actions/cache@v3` -> `actions/cache@v4`
- `actions/checkout@v3` -> `actions/checkout@v4`
- `actions/github-script@v6` -> `actions/github-script@v7`
- `actions/setup-java@v3` -> `actions/setup-java@v4`
- `actions/upload-artifact@v3` -> `actions/upload-artifact@v4`
This PR stacks on https://github.com/neondatabase/neon/pull/6814
Observability:
- Because we only persist a subset of our state, and our external API is
pretty high level, it can be hard to get at the detail of what's going
on internally (e.g. the IntentState of a shard).
- Add debug endpoints for getting a full dump of all TenantState and
SchedulerNode objects
- Enrich the /control/v1/node listing endpoint to include full in-memory
detail of `Node` rather than just the `NodePersistence` subset
Consistency checks:
- The storage controller maintains separate in-memory and on-disk
states, by design. To catch subtle bugs, it is useful to occasionally
cross-check these.
- The Scheduler maintains reference counts for shard->node
relationships, which could drift if there was a bug in IntentState:
exhausively cross check them in tests.
## Problem
When investigating test failures
(https://github.com/neondatabase/neon/issues/6813) I noticed we were
doing a bunch of Reconciler runs right after splitting a tenant.
It's because the splitting test does a pageserver restart, and there was
a bug in /re-attach handling, where we would update the generation
correctly in the database and intent state, but not observed state,
thereby triggering a reconciliation on the next call to maybe_reconcile.
This didn't break anything profound (underlying rules about generations
were respected), but caused the storage controller to do an un-needed
extra round of bumping the generation and reconciling.
## Summary of changes
- Start adding metrics to the storage controller
- Assert on the number of reconciles done in test_sharding_split_smoke
- Fix /re-attach to update `observed` such that we don't spuriously
re-reconcile tenants.
Often times the tenants we want to (WAL) DR are the ones which the
pageserver marks as broken. Therefore, we should allow initdb
preservation also for broken tenants.
Fixes#6781.
## Problem
Sharded tenants could panic during compaction when they try to generate
an L1 delta layer for a region that contains no keys on a particular
shard.
This is a variant of https://github.com/neondatabase/neon/issues/6755,
where we attempt to save a delta layer with no keys. It is harder to
reproduce than the case of image layers fixed in
https://github.com/neondatabase/neon/pull/6776.
It will become even less likely once
https://github.com/neondatabase/neon/pull/6778 tweaks keyspace
generation, but even then, we should not rely on keyspace partitioning
to guarantee at least one stored key in each partition.
## Summary of changes
- Move construction of `writer` in `compact_level0_phase1`, so that we
never leave a writer constructed but without any keys.
## Problem
One of the major shortcuts in the initial version of this code was to
construct a fresh `Scheduler` each time we need it, which is an O(N^2)
cost as the tenant count increases.
## Summary of changes
- Keep `Scheduler` alive through the lifetime of ServiceState
- Use `IntentState` as a reference tracking helper, updating Scheduler
refcounts as nodes are added/removed from the intent.
There is an automated test that checks things don't get pathologically
slow with thousands of shards, but it's not included in this PR because
tests that implicitly test the runner node performance take some thought
to stabilize/land in CI.
## Problem
Secondary mode locations keep a local copy of the heatmap, which needs
cleaning up during deletion.
Closes: https://github.com/neondatabase/neon/issues/6802
## Summary of changes
- Extend test_live_migration to reproduce the issue
- Remove heatmap-v1.json during tenant deletion
1) `scram::password` was used in tests only. can be replaced with
`postgres_protocol::password`.
2) `postgres_protocol::authentication::sasl` provides a client impl of
SASL which improves our ability to test
## Problem
One WAL record can actually produce an arbitrary amount of key value pairs.
This is problematic since it might cause our frozen layers to bloat past the
max allowed size of S3 single shot uploads.
[#6639](https://github.com/neondatabase/neon/pull/6639) introduced a "should roll"
check after every batch of `ingest_batch_size` (100 WAL records by default). This helps,
but the original problem still exists.
## Summary of changes
This patch moves the responsibility of rolling the currently open layer
to the `TimelineWriter`. Previously, this was done ad-hoc via calls
to `check_checkpoint_distance`. The advantages of this approach are:
* ability to split one batch over multiple open layers
* less layer map locking
* remove ad-hoc check_checkpoint_distance calls
More specifically, we track the current size of the open layer in the
writer. On each `put` check whether the current layer should be closed
and a new one opened. Keeping track of the currently open layer results
in less contention on the layer map lock. It only needs to be acquired
on the first write and on writes that require a roll afterwards.
Rolling the open layer can be triggered by:
1. The distance from the last LSN we rolled at. This bounds the amount
of WAL that the safekeepers need to store.
2. The size of the currently open layer.
3. The time since the last roll. It helps safekeepers to regard
pageserver as caught up and suspend activity.
Closes#6624
## Problem
test_sharding_split_unsharded was flaky with log errors from tenants not
being active. This was happening when the split function enters
wait_lsn() while the child shard might still be activating. It's flaky
rather than an outright failure because activation is usually very fast.
This is also a real bug fix, because in realistic scenarios we could
proceed to detach the parent shard before the children are ready,
leading to an availability gap for clients.
## Summary of changes
- Do a short wait_to_become_active on the child shards before proceeding
to wait for their LSNs to advance
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
`test_create_snapshot` is flaky[0] on CI and fails constantly on macOS,
but with a slightly different error:
```
shutil.Error: [('/Users/bayandin/work/neon/test_output/test_create_snapshot[release-pg15-1-100]/repo/endpoints/ep-1/pgdata/pg_dynshmem', '/Users/bayandin/work/neon/test_output/compatibility_snapshot_pgv15/repo/endpoints/ep-1/pgdata/pg_dynshmem', "[Errno 2] No such file or directory: '/Users/bayandin/work/neon/test_output/test_create_snapshot[release-pg15-1-100]/repo/endpoints/ep-1/pgdata/pg_dynshmem'")]
```
Also (on macOS) `repo/endpoints/ep-1/pgdata/pg_dynshmem` is a symlink
to `/dev/shm/`.
- [0] https://github.com/neondatabase/neon/issues/6784
## Summary of changes
Ignore `pg_dynshmem` directory while copying a snapshot
## Problem
Sharded tenants would sometimes try to write empty image layers during
compaction: this was more noticeable on larger databases.
- https://github.com/neondatabase/neon/issues/6755
**Note to reviewers: the last commit is a refactor that de-intents a
whole block, I recommend reviewing the earlier commits one by one to see
the real changes**
## Summary of changes
- Fix a case where when we drop a key during compaction, we might fail
to write out keys (this was broken when vectored get was added)
- If an image layer is empty, then do not try and write it out, but
leave `start` where it is so that if the subsequent key range meets
criteria for writing an image layer, we will extend its key range to
cover the empty area.
- Add a compaction test that configures small layers and compaction
thresholds, and asserts that we really successfully did image layer
generation. This fails before the fix.
## Problem
See https://github.com/neondatabase/cloud/issues/10268
## Summary of changes
Add pg_ivm extension
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
This reverts commit 9ad940086c.
This pull request reverts #6733 to avoid incompatibility with pgvector
and I will push further fixes later. Note that after reverting this pull
request, the postgres submodule will point to some detached branches.
## Problem
Even if you're not enforcing auth, the JwtAuth middleware barfs on
scopes it doesn't know about.
Add `generations_api` scope, which was invented in the cloud control
plane for the pageserver's /re-attach and /validate upcalls: this will
be enforced in storage controller's implementation of these in a later
PR.
Unfortunately the scope's naming doesn't match the other scope's naming
styles, so needs a manual serde decorator to give it an underscore.
## Summary of changes
- Add `Scope::GenerationsApi` variant
- Update pageserver + safekeeper auth code to print appropriate message
if they see it.
## Problem
`test_pageserver_max_throughput_getpage_at_latest_lsn` is flaky which
makes CI status red pretty frequently. `benchmarks` is not a blocking
job (doesn't block `deploy`), so having it red might hide failures in
other jobs
Ref: https://github.com/neondatabase/neon/issues/6724
## Summary of changes
- Disable `test_pageserver_max_throughput_getpage_at_latest_lsn` on CI
until it fixed
## Problem
Now that the storage controller is working end to end, we start burning
down the robustness aspects.
## Summary of changes
- Add a background task that periodically calls `reconcile_all`. This
ensures that if earlier operations couldn't succeed (e.g. because a node
was unavailable), we will eventually retry. This is a naive initial
implementation can start an unlimited number of reconcile tasks:
limiting reconcile concurrency is a later item in #6342
- Add a number of tracing spans in key locations: each background task,
each reconciler task.
- Add a top level CancellationToken and Gate, and use these to implement
a graceful shutdown that waits for tasks to shut down. This is not
bulletproof yet, because within these tasks we have remote HTTP calls
that aren't wrapped in cancellation/timeouts, but it creates the
structure, and if we don't shutdown promptly then k8s will kill us.
- To protect shard splits from background reconciliation, expose the `SplitState`
in memory and use it to guard any APIs that require an attached tenant.