## Problem
One WAL record can actually produce an arbitrary amount of key value pairs.
This is problematic since it might cause our frozen layers to bloat past the
max allowed size of S3 single shot uploads.
[#6639](https://github.com/neondatabase/neon/pull/6639) introduced a "should roll"
check after every batch of `ingest_batch_size` (100 WAL records by default). This helps,
but the original problem still exists.
## Summary of changes
This patch moves the responsibility of rolling the currently open layer
to the `TimelineWriter`. Previously, this was done ad-hoc via calls
to `check_checkpoint_distance`. The advantages of this approach are:
* ability to split one batch over multiple open layers
* less layer map locking
* remove ad-hoc check_checkpoint_distance calls
More specifically, we track the current size of the open layer in the
writer. On each `put` check whether the current layer should be closed
and a new one opened. Keeping track of the currently open layer results
in less contention on the layer map lock. It only needs to be acquired
on the first write and on writes that require a roll afterwards.
Rolling the open layer can be triggered by:
1. The distance from the last LSN we rolled at. This bounds the amount
of WAL that the safekeepers need to store.
2. The size of the currently open layer.
3. The time since the last roll. It helps safekeepers to regard
pageserver as caught up and suspend activity.
Closes#6624
## Problem
`test_create_snapshot` is flaky[0] on CI and fails constantly on macOS,
but with a slightly different error:
```
shutil.Error: [('/Users/bayandin/work/neon/test_output/test_create_snapshot[release-pg15-1-100]/repo/endpoints/ep-1/pgdata/pg_dynshmem', '/Users/bayandin/work/neon/test_output/compatibility_snapshot_pgv15/repo/endpoints/ep-1/pgdata/pg_dynshmem', "[Errno 2] No such file or directory: '/Users/bayandin/work/neon/test_output/test_create_snapshot[release-pg15-1-100]/repo/endpoints/ep-1/pgdata/pg_dynshmem'")]
```
Also (on macOS) `repo/endpoints/ep-1/pgdata/pg_dynshmem` is a symlink
to `/dev/shm/`.
- [0] https://github.com/neondatabase/neon/issues/6784
## Summary of changes
Ignore `pg_dynshmem` directory while copying a snapshot
## Problem
Even if you're not enforcing auth, the JwtAuth middleware barfs on
scopes it doesn't know about.
Add `generations_api` scope, which was invented in the cloud control
plane for the pageserver's /re-attach and /validate upcalls: this will
be enforced in storage controller's implementation of these in a later
PR.
Unfortunately the scope's naming doesn't match the other scope's naming
styles, so needs a manual serde decorator to give it an underscore.
## Summary of changes
- Add `Scope::GenerationsApi` variant
- Update pageserver + safekeeper auth code to print appropriate message
if they see it.
## Problem
`test_pageserver_max_throughput_getpage_at_latest_lsn` is flaky which
makes CI status red pretty frequently. `benchmarks` is not a blocking
job (doesn't block `deploy`), so having it red might hide failures in
other jobs
Ref: https://github.com/neondatabase/neon/issues/6724
## Summary of changes
- Disable `test_pageserver_max_throughput_getpage_at_latest_lsn` on CI
until it fixed
## Problem
Now that the storage controller is working end to end, we start burning
down the robustness aspects.
## Summary of changes
- Add a background task that periodically calls `reconcile_all`. This
ensures that if earlier operations couldn't succeed (e.g. because a node
was unavailable), we will eventually retry. This is a naive initial
implementation can start an unlimited number of reconcile tasks:
limiting reconcile concurrency is a later item in #6342
- Add a number of tracing spans in key locations: each background task,
each reconciler task.
- Add a top level CancellationToken and Gate, and use these to implement
a graceful shutdown that waits for tasks to shut down. This is not
bulletproof yet, because within these tasks we have remote HTTP calls
that aren't wrapped in cancellation/timeouts, but it creates the
structure, and if we don't shutdown promptly then k8s will kill us.
- To protect shard splits from background reconciliation, expose the `SplitState`
in memory and use it to guard any APIs that require an attached tenant.
## Problem
Currently, we don't store `PLATFORM` for Nightly Benchmarks. It
causes them to be merged as reruns in Allure report (because they have
the same test name).
## Summary of changes
- Parametrize benchmarks by
- Postgres Version (14/15/16)
- Build Type (debug/release/remote)
- PLATFORM (neon-staging/github-actions-selfhosted/...)
---------
Co-authored-by: Bodobolero <peterbendel@neon.tech>
## Problem
- We weren't deleting parent shard contents once the split was done
- Re-downloading layers into child shards is wasteful
## Summary of changes
- Hard-link layers into child chart local storage during split
- Delete parent shards content at the end
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Problem
Building on #5875 to add handy test functions for autoscaling.
Resolves#5609
## Summary of changes
This PR makes the following changes to #5875:
- Enable `neon_test_utils` extension in the compute node docker image,
so we could use it in the e2e tests (as discussed with @kelvich).
- Removed test functions related to disk as we don't use them for
autoscaling.
- Fix the warning with printf-ing unsigned long variables.
---------
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
There is O(n^2) issues due to how we store these directories (#6626), so
it's good to keep an eye on them and ensure the numbers stay low.
The new per-timeline metric `pageserver_directory_entries_count`
isn't perfect, namely we don't calculate it every time we attach
the timeline, but only if there is an actual change.
Also, it is a collective metric over multiple scalars. Lastly,
we only emit the metric if it is above a certain threshold.
However, the metric still give a feel for the general size of the timeline.
We care less for small values as the metric is mainly there to
detect and track tenants with large directory counts.
We also expose the directory counts in `TimelineInfo` so that one can
get the detailed size distribution directly via the pageserver's API.
Related: #6642 , https://github.com/neondatabase/cloud/issues/10273
## Problem
Test sometimes fails with `used_blocks > total_blocks`, because when
using mocked statvfs with the total blocks set to the size of data on
disk before starting, we are implicitly asserting that nothing at all
can be written to disk between startup and calling statvfs.
Related: https://github.com/neondatabase/neon/issues/6511
## Summary of changes
- Use HTTP API to invoke disk usage eviction instead of mocked statvfs
In #6079 it was found that there is no test that executes the scrubber.
We now add such a test, which does the following things:
* create a tenant, write some data
* run the scrubber
* remove the tenant
* run the scrubber again
Each time, the scrubber runs the scan-metadata command. Before #6079 we
would have errored, now we don't.
Fixes#6080
VanillaPostgres constructor prints the "port={port}" line to the
config file, no need to do it in the callers.
The TODO comment that it would be nice if VanillaPostgres could pick
the port by itself is still valid though.
Commit 9a6c0be823 removed the code that printed these warnings:
marking {} as locally complete, while it doesnt exist in remote index
No timelines to attach received
Remove those warnings from all the allowlists in tests.
## Problem
When debugging/supporting this service, we sometimes need it to just
forget about a tenant or node, e.g. because of an issue cleanly tearing
them down. For example, if I create a tenant with a PlacementPolicy that
can't be scheduled on the nodes we have, we would never be able to
schedule it for a DELETE to work.
## Summary of changes
- Add APIs for dropping nodes and tenants that do no teardown other than
removing the entity from the DB and removing any references to it.
Turn the warning into an error, if there is garbage after the end of
imported tar file. However, it's normal for 'tar' to append extra
empty blocks to the end, so tolerate those without warnings or errors.
## Problem
usernames and passwords can be URL 'percent' encoded in the connection
string URL provided by serverless driver.
## Summary of changes
Decode the parameters when getting conn info
The test was supposed to reproduce the bug fixed in commit 66fa176cc8,
i.e. that the clearing of the VM bit was not replayed in the
pageserver on HEAP_LOCK records. But it was broken in many ways and
failed to reproduce the original problem if you reverted the fix:
- The comparison of XIDs was broken. The test read the XID in to a
variable in python, but it was treated as a string rather than an
integer. As a result, e.g. "999" > "1000".
- The test accessed the locked tuple too early, in the loop. Accessing
it early, before the pg_xact page had been removed, set the hint bits.
That masked the problem on subsequent accesses.
- The on-demand SLRU download that was introduced in commit 9a9d9beaee
hid the issue. Even though an SLRU segment was removed by Postgres,
when it later tried to access it, it could still download it from
the pageserver. To ensure that doesn't happen, shorten the GC period
and compact and GC aggressively in the test.
I also added a more direct check that the VM page is updated, using
the get_page_at_lsn() debugging function. Right after locking the row,
we now fetch the VM page from pageserver and directly compare it with
the VM page in the page cache. They should match. That assertion is
more robust to things like on-demand SLRU download that could mask the
bug.
In neon_local, the default mode is now always 'fast', regardless of
'destroy'. You can override it with the "neon_local endpoint stop
--mode=immediate" flag.
In python tests, we still default to 'immediate' mode when using the
stop_and_destroy() function, and 'fast' with plain stop(). I kept that
to avoid changing behavior in existing tests. I don't think existing
tests depend on it, but I wasn't 100% certain.
## Problem
This test was a subset of the larger sharding test, and it missed the
validate() call on workload that was implicitly waiting for a tenant to
become active before trying to split it. It could therefore fail to
split due to tenant not yet being active.
## Summary of changes
- Insert .validate() call, and move the Workload setup to after the
check of shard ID (as the shard ID check should pass immediately)
## Problem
Drizzle needs to be able to configure the array_mode flag per query.
## Summary of changes
Adds an array_mode flag to the query data json that will otherwise
default to the header flag.
## Problem
Previous test started with a new-style TenantShardId with a non-zero
ShardCount. We also need to handle the case of a ShardCount() (aka
`unsharded`) parent shard.
**A followup PR will refactor ShardCount to make its inner value private
and thereby make this kind of mistake harder**
## Summary of changes
- Fix a place we were incorrectly treating a ShardCount as a number of
shards rather than as thing that can be zero or the number of shards.
- Add a test for this case.
This test occasionally fails with a difference in "pg_xact/0000" file
between the local and restored datadirs. My hypothesis is that something
changed in the database between the last explicit checkpoint and the
shutdown. I suspect autovacuum, it could certainly create transactions.
To fix, be more precise about the point in time that we compare. Shut
down the endpoint first, then read the last LSN (i.e. the shutdown
checkpoint's LSN), from the local disk with pg_controldata. And use
exactly that LSN in the basebackup.
Closes#559.
I'm proposing this as an alternative to
https://github.com/neondatabase/neon/pull/6662.
- Automatically set a node's availability to Active if it is responsive
in startup_reconcile
- Impose a 5s timeout of HTTP request to list location conf, so that an
unresponsive node can't hang it for minutes
- Do several retries if the request fails with a retryable error, to be
tolerant of concurrent pageserver & storage controller restarts
- Add a readiness hook for use with k8s so that we can tell when the
startup reconciliaton is done and the service is fully ready to do work.
- Add /metrics to the list of un-authenticated endpoints (this is
unrelated but we're touching the line in this PR already, and it fixes
auth error spam in deployed container.)
- A test for the above.
Closes: #6670
## Problem
One doesn't know at tenant creation time how large the tenant will grow.
We need to be able to dynamically adjust the shard count at runtime.
This is implemented as "splitting" of shards into smaller child shards,
which cover a subset of the keyspace that the parent covered.
Refer to RFC: https://github.com/neondatabase/neon/pull/6358
Part of epic: #6278
## Summary of changes
This PR implements the happy path (does not cleanly recover from a crash
mid-split, although won't lose any data), without any optimizations
(e.g. child shards re-download their own copies of layers that the
parent shard already had on local disk)
- Add `/v1/tenant/:tenant_shard_id/shard_split` API to pageserver: this
copies the shard's index to the child shards' paths, instantiates child
`Tenant` object, and tears down parent `Tenant` object.
- Add `splitting` column to `tenant_shards` table. This is written into
an existing migration because we haven't deployed yet, so don't need to
cleanly upgrade.
- Add `/control/v1/tenant/:tenant_id/shard_split` API to
attachment_service,
- Add `test_sharding_split_smoke` test. This covers the happy path:
future PRs will add tests that exercise failure cases.
## Problem
The password check logic for the sql-over-http is a bit non-intuitive.
## Summary of changes
1. Perform scram auth using the same logic as for websocket cleartext
password.
2. Split establish connection logic and connection pool.
3. Parallelize param parsing logic with authentication + wake compute.
4. Limit the total number of clients
## Problem
See
https://neondb.slack.com/archives/C04DGM6SMTM/p1707149618314539?thread_ts=1707081520.140049&cid=C04DGM6SMTM
## Summary of changes
Perform checkpoint check after processing `ingest_batch_size` (default
100) WAL records.
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Fix several test flakes:
- test_sharding_service_smoke had log failures on "Dropped LSN updates"
- test_emergency_mode had log failures on a deletion queue shutdown
check, where the check was incorrect because it was expecting channel
receiver to stay alive after cancellation token was fired.
- test_secondary_mode_eviction had racing heatmap uploads because the
test was using a live migration hook to set up locations, where that
migration was itself uploading heatmaps and generally making the
situation more complex than it needed to be.
These are the failure modes that I saw when spot checking the last few
failures of each test.
This will mostly/completely address #6511, but I'll leave that ticket
open for a couple days and then check if either of the tests named in
that ticket are flaky.
Related #6511
Cleanups from https://github.com/neondatabase/neon/pull/6394
- There was a rogue `*` breaking the `GET /tenant/:tenant_id`, which
passes through to shard zero
- There was a duplicate migrate endpoint
- There are un-prefixed API endpoints that were only needed for compat
tests and can now be removed.
Before tenant migration it made sense to leak broken tenants in the
metrics until restart. Nowdays it makes less sense because on
cancellations we set the tenant broken. The set metric still allows
filterable alerting.
Fixes: #6507
I was on-call this week, these would had made me understand more/faster
of the system:
- move stray attaching start logging inside the span it starts, add
generation
- log ancestor timeline_id or bootstrapping in the beginning of timeline
creation
## Problem
I didn't know about `wait_until` and was relying on `sleep` to wait for
stuff. This caused some tests to be flaky.
https://github.com/neondatabase/neon/issues/6561
## Summary of changes
Switch to `wait_until`, this should make it tests less flaky
At the end of ApplyRecord(), we called pfree on the decoded record, if
it was "oversized". However, we had alread linked it to the "decode
queue" list in XLogReaderState. If we later called XLogBeginRead(), it
called ResetDecoder and tried to free the same record again.
The conditions to hit this are:
- a large WAL record (larger than aboue 64 kB I think, per
DEFAULT_DECODE_BUFFER_SIZE), and
- another WAL record processed by the same WAL redo process after the
large one.
I think the reason we haven't seen this earlier is that you don't get
WAL records that large that are sent to the WAL redo process, except
when logical replication is enabled. Logical replication adds data to
the WAL records, making them larger.
To fix, allocate the buffer ourselves, and don't link it to the decode
queue. Alternatively, we could perhaps have just removed the pfree(),
but frankly I'm a bit scared about the whole queue thing.
## Problem
When we change which physical pageservers a tenant is attached to, we
must update the control plane so that it can update computes. This will
be done via an HTTP hook, as described in
https://www.notion.so/neondatabase/Sharding-Service-Control-Plane-interface-6de56dd310a043bfa5c2f5564fa98365#1fe185a35d6d41f0a54279ac1a41bc94
## Summary of changes
- Optional CLI args `--control-plane-jwt-token` and `-compute-hook-url`
are added. If these are set, then we will use this HTTP endpoint,
instead of trying to use neon_local LocalEnv to update compute
configuration.
- Implement an HTTP-driven version of ComputeHook that calls into the
configured URL
- Notify for all tenants on startup, to ensure that we don't miss
notifications if we crash partway through a change, and carry a
`pending_compute_notification` flag at runtime to allow notifications to
fail without risking never sending the update.
- Add a test for all this
One might wonder: why not do a "forever" retry for compute hook
notifications, rather than carrying a flag on the shard to call
reconcile() again later. The reason is that we will later limit
concurreny of reconciles, when dealing with larger numbers of shards,
and if reconcile is stuck waiting for the control plane to accept a
notification request, it could jam up the whole system and prevent us
making other changes. Anyway: from the perspective of the outside world,
we _do_ retry forever, but we don't retry forever within a given
Reconciler lifetime.
The `pending_compute_notification` logic is predicated on later adding a
background task that just calls `Service::reconcile_all` on a schedule
to make sure that anything+everything that can fail a
Reconciler::reconcile call will eventually be retried.
This test became flaky when postgres retry handling was fixed to use
backoff delays -- each iteration in this test's loop was taking much
longer because pgbench doesn't fail until postgres has given up on
retrying to the pageserver.
We are just removing it, because the condition it tests is no longer
risky: we reload all metadata from remote storage on restart, so
crashing directly between making local changes and doing remote uploads
isn't interesting any more.
Closes: https://github.com/neondatabase/neon/issues/2856
Closes: https://github.com/neondatabase/neon/issues/5329
in `test_statvfs_pressure_{usage,min_avail_bytes}` we now race against
initial logical size calculation on-demand downloading the layers. first
wait out the initial logical sizes, then change the final asserts to be
"eventual", which is not great but it is faster than failing and
retrying.
this issue seems to happen only in debug mode tests.
Fixes: #6510
Adds an endpoint to the pageserver to S3-recover an entire tenant to a
specific given timestamp.
Required input parameters:
* `travel_to`: the target timestamp to recover the S3 state to
* `done_if_after`: a timestamp that marks the beginning of the recovery
process. retries of the query should keep this value constant. it *must*
be after `travel_to`, and also after any changes we want to revert, and
must represent a point in time before the endpoint is being called, all
of these time points in terms of the time source used by S3. these
criteria need to hold even in the face of clock differences, so I
recommend waiting a specific amount of time, then taking
`done_if_after`, then waiting some amount of time again, and only then
issuing the request.
Also important to note: the timestamps in S3 work at second accuracy, so
one needs to add generous waits before and after for the process to work
smoothly (at least 2-3 seconds).
We ignore the added test for the mocked S3 for now due to a limitation
in moto: https://github.com/getmoto/moto/issues/7300 .
Part of https://github.com/neondatabase/cloud/issues/8233