## Problem
After https://github.com/neondatabase/neon/pull/7990 `regress_test` job
started to fail with an error:
```
...
File "/__w/neon/neon/test_runner/fixtures/benchmark_fixture.py", line 485, in pytest_terminal_summary
terminalreporter.write(f"{test_report.head_line}.{recorded_property['name']}: ")
TypeError: 'bool' object is not subscriptable
```
https://github.com/neondatabase/neon/actions/runs/10125750938/job/28002582582
It happens because the current implementation doesn't expect pytest's
`user_properties` can be used for anything else but benchmarks (and
https://github.com/neondatabase/neon/pull/7990 started to use it for
tracking `preserve_database_files` parameter)
## Summary of changes
- Make NeonBenchmarker use only records with`neon_benchmarker_` prefix
## Problem
There's a `NeonEnvBuilder#preserve_database_files` parameter that allows
you to keep database files for debugging purposes (by default, files get
cleaned up), but there's no way to get these files from a CI run.
This PR adds handling of `NeonEnvBuilder#preserve_database_files` and
adds the compressed test output directory to Allure reports (for tests
with this parameter enabled).
Ref https://github.com/neondatabase/neon/issues/6967
## Summary of changes
- Compress and add the whole test output directory to Allure reports
- Currently works only with `neon_env_builder` fixture
- Remove `preserve_database_files = True` from sharding tests as
unneeded
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
Persists whether a timeline is archived or not in `index_part.json`. We
only return success if the upload has actually worked successfully.
Also introduces a new `index_part.json` version number.
Fixes#8459
Part of #8088
close https://github.com/neondatabase/neon/issues/8435
## Summary of changes
If L0 compaction did not include all L0 layers, skip image generation.
There are multiple possible solutions to the original issue, i.e., an
alternative is to wrap the partial L0 compaction in a loop until it
compacts all L0 layers. However, considering that we should weight all
tenants equally, the current solution can ensure everyone gets a chance
to run compaction, and those who write too much won't get a chance to
create image layers. This creates a natural backpressure feedback that
they get a slower read due to no image layers are created, slowing down
their writes, and eventually compaction could keep up with their writes
+ generate image layers.
Consider deployment, we should add an alert on "skipping image layer
generation", so that we won't run into the case that image layers are
not generated => incidents again.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Problem
-------
wait_lsn timeouts result in a user-facing errors like
```
$ /tmp/neon/pg_install/v16/bin/pgbench -s3424 -i -I dtGvp user=neondb_owner dbname=neondb host=ep-tiny-wave-w23owa37.eastus2.azure.neon.build sslmode=require options='-cstatement_timeout=0 '
dropping old tables...
NOTICE: table "pgbench_accounts" does not exist, skipping
NOTICE: table "pgbench_branches" does not exist, skipping
NOTICE: table "pgbench_history" does not exist, skipping
NOTICE: table "pgbench_tellers" does not exist, skipping
creating tables...
generating data (server-side)...
vacuuming...
pgbench: error: query failed: ERROR: [NEON_SMGR] [shard 0] could not read block 214338 in rel 1663/16389/16839.0 from page server at lsn C/E1C12828
DETAIL: page server returned error: LSN timeout: Timed out while waiting for WAL record at LSN C/E1418528 to arrive, last_record_lsn 6/999D9CA8 disk consistent LSN=6/999D9CA8, WalReceiver status: (update 2024-07-25 08:30:07): connecting to node 25, safekeeper candidates (id|update_time|commit_lsn): [(21|08:30:16|C/E1C129E0), (23|08:30:16|C/E1C129E0), (25|08:30:17|C/E1C129E0)]
CONTEXT: while scanning block 214338 of relation "public.pgbench_accounts"
pgbench: detail: Query was: vacuum analyze pgbench_accounts
```
Solution
--------
Its better to be slow than to fail the queries.
If the app has a deadline, it can use `statement_timeout`.
In the long term, we want to eliminate wait_lsn timeout.
In the short term (this PR), we bump the wait_lsn timeout to
a larger value to reduce the frequency at which these wait_lsn timeouts
occur.
We will observe SLOs and specifically
`pageserver_wait_lsn_seconds_bucket`
before we eliminate the timeout completely.
## Problem
We are missing the step-down primitive required to implement rolling
restarts of the storage controller.
## Summary of changes
Add `/control/v1/step_down` endpoint which puts the storage controller
into a state where it rejects
all API requests apart from `/control/v1/step_down`, `/status` and
`/metrics`. When receiving the request,
storage controller cancels all pending reconciles and waits for them to
exit gracefully. The response contains
a snapshot of the in-memory observed state.
Related:
* https://github.com/neondatabase/cloud/issues/14701
* https://github.com/neondatabase/neon/issues/7797
* https://github.com/neondatabase/neon/pull/8310
## Problem
Vectored get is already enabled in all prod regions without validation.
The pageserver defaults
are out of sync however.
## Summary of changes
Update the pageserver defaults to match the prod config. Also means that
when running tests locally,
people don't have to use the env vars to get the prod config.
## Problem
This is an experiment to see if 16x concurrency is actually helping, or
if it's just giving us very noisy results. If the total runtime with a
lower concurrency is similar, then a lower concurrency is preferable to
reduce the impact of resource-hungry tests running concurrently.
## Problem
This test relies on writing image layers before the split. It can fail
to do so durably if the image layers are written ahead of the remote
consistent LSN, so we should have been doing a checkpoint rather than
just a compaction
## Problem
The scrubber would like to check the highest mtime in a tenant's objects
as a safety check during purges. It recently switched to use
GenericRemoteStorage, so we need to expose that in the listing methods.
## Summary of changes
- In Listing.keys, return a ListingObject{} including a last_modified
field, instead of a RemotePath
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
follow up for #8475
## Summary of changes
Using own private docker registry in `cache-from` and `cache-to`
settings in docker build-push actions
There is a race condition between timeline shutdown and the split task.
Timeline shutdown first shuts down the upload queue, and only then fires
the cancellation token. A parallel running timeline split operation
might thus encounter a cancelled upload queue before the cancellation
token is fired, and print a noisy error.
Fix this by mapping `anyhow::Error{ NotInitialized::ShuttingDown }) to
`FlushLayerError::Cancelled` instead of `FlushLayerError::Other(_)`.
Fixes#8496
update pg_jsonschema extension to v 0.3.1
update pg_graphql extension to v1.5.7
update pgx_ulid extension to v0.1.5
update pg_tiktoken extension, patch Cargo.toml to use new pgrx
update pg_jsonschema extension to v 0.3.1
update pg_graphql extension to v1.5.7
update pgx_ulid extension to v0.1.5
update pg_tiktoken extension, patch Cargo.toml to use new pgrx
This pull request (should) fix the failure of test_gc_feedback. See the
explanation in the newly-added test case.
Part of https://github.com/neondatabase/neon/issues/8002
Allow incomplete history for the compaction algorithm.
Signed-off-by: Alex Chi Z <chi@neon.tech>
In general, replace:
* 'lfc_approximate_working_set_size' with
* 'lfc_approximate_working_set_size_windows'
For the "main" metrics that are actually scraped and used internally,
the old one is just marked as deprecated.
For the "autoscaling" metrics, we're not currently using the old one, so
we can get away with just replacing it.
Also, for the user-visible metrics we'll only store & expose a few
different time windows, to avoid making the UI overly busy or bloating
our internal metrics storage.
But for the autoscaling-related scraper, we aren't storing the metrics,
and it's useful to be able to programmatically operate on the trendline
of how WSS increases (or doesn't!) with window size. So there, we can
just output datapoints for each minute.
Part of neondatabase/autoscaling#872
See also https://www.notion.so/neondatabase/cca38138fadd45eaa753d81b859490c6
## Problem
Storcon shutdown did not produce a clean observed state. This is not a
problem at the moment, but we will need to stop all reconciles with
clean observed state for rolling restarts.
I tried to test this by collecting the observed state during shutdown
and comparing it with the in-memory observed
state, but it doesn't work because a lot of tests use the cursed attach
hook to create tenants directly through the ps.
## Summary of Changes
Rework storcon shutdown as follows:
* Reconcilers get a separate cancellation token which is a child token
of the global `Service::cancel`.
* Reconcilers get a separate gate
* Add a mechanism to drain the reconciler result queue before
* Put all of this together into a clean shutdown sequence
Related https://github.com/neondatabase/cloud/issues/14701
## Problem
This test was destabilized by
https://github.com/neondatabase/neon/pull/8431. The threshold is
arbitrary & failures are still quite close to it. At a high level the
test is asserting "eviction was approximately fair to these tenants",
which appears to still be the case when the abs diff between ratios is
slightly higher at ~0.6-0.7.
## Summary of changes
- Change threshold from 0.06 to 0.065. Based on the last ~10 failures
that should be sufficient.
## Problem
Currently, tests may have a scrub during teardown if they ask for it,
but most tests don't request it. To detect "unknown unknowns", let's run
it at the end of every test where possible. This is similar to asserting
that there are no errors in the log at the end of tests.
## Summary of changes
- Remove explicit `enable_scrub_on_exit`
- Always scrub if remote storage is an S3Storage.
## Problem
Re-attach blocks the pageserver http server from starting up. Hence, it
can't reply to heartbeats
until that's done. This makes the storage controller mark the node
off-line (not good). We worked
around this by setting the interval after which nodes are marked offline
to 5 minutes. This isn't a
long term solution.
## Summary of changes
* Introduce a new `NodeAvailability` state: `WarmingUp`. This state
models the following time interval:
* From receiving the re-attach request until the pageserver replies to
the first heartbeat post re-attach
* The heartbeat delta generator becomes aware of this state and uses a
separate longer interval
* Flag `max-warming-up-interval` now models the longer timeout and
`max-offline-interval` the shorter one to
match the names of the states
Closes https://github.com/neondatabase/neon/issues/7552
## Problem
The rds-aurora endpoint connection cannot be reached from GitHub action
runners.
Temporarily remove this DBMS from the pgbench comparison runs.
## Summary of changes
On Saturday we normally run Neon in comparison with AWS RDS-Postgres and
AWS RDS-Aurora.
Remove Aurora until we have a working setup
Before this PR
1.The circuit breaker would trip on CompactionError::Shutdown. That's
wrong, we want to ignore those cases.
2. remote timeline client shutdown would not be mapped to
CompactionError::Shutdown in all circumstances.
We observed this in staging, see
https://neondb.slack.com/archives/C033RQ5SPDH/p1721829745384449
This PR fixes (1) with a simple `match` statement, and (2) by switching
a bunch of `anyhow` usage over to distinguished errors that ultimately
get mapped to `CompactionError::Shutdown`.
I removed the implicit `#[from]` conversion from `anyhow::Error` to
`CompactionError::Other` to discover all the places that were mapping
remote timeline client shutdown to `anyhow::Error`.
In my opinion `#[from]` is an antipattern and we should avoid it,
especially for `anyhow::Error`. If some callee is going to return
anyhow, the very least the caller should to is to acknowledge, through a
`map_err(MyError::Other)` that they're conflating different failure
reasons.
## Problem
Jobs `check-linux-arm-build` and `check-codestyle-rust-arm` (from
`.github/workflows/neon_extra_builds.yml`) duplicate `build-neon` and
`check-codestyle-rust` jobs in the main pipeline.
## Summary of changes
- Move `check-linux-arm-build` and `check-codestyle-rust-arm` from extra
builds to the main pipeline
By default git does not find a nice hunk header with rust. New(er)
versions ship with a handy xfuncname pattern, so lets enable that for
all developers.
Example of how this should help:
39046172ab
## Problem
PR that modified compaction raced with PR that modified the GcInfo
structure
## Summary of changes
Fix it
Co-authored-by: Vlad Lazar <vlalazar.vlad@gmail.com>
## Problem
The in-memory layer vectored read was very slow in some conditions
(walingest::test_large_rel) test. Upon profiling, I realised that 80% of
the time was spent building up the binary heap of reads. This stage
isn't actually needed.
## Summary of changes
Remove the planning stage as we never took advantage of it in order to
merge reads. There should be no functional change from this patch.
## Problem
- `build-and-test` workflow is pretty big
- jobs that depend on the matrix job don't start before all variations
are done. I.e. `regress-tests` depend on `build-neon`, but we can't
start `regress-tests` on the release configuration until `build-neon` is
done on release **and debug** configurations. This will be more visible
once we add ARM to the matrix.
## Summary of changes
- Move jobs related to building (`build-neon`) and testing
(`regress-tests`) to a separate job
## Problem
The current bucket based rate limiter is not very intuitive and has some
bad failure cases.
## Summary of changes
Switches from fixed interval buckets to leaky bucket impl. A single
bucket per endpoint,
drains over time. Drains by checking the time since the last check, and
draining tokens en-masse. Garbage collection works similar to before, it
drains a shard (1/64th of the set) every 2048 checks, and it only
removes buckets that are empty.
To be compatible with the existing config, I've faffed to make it take
the min and the max rps of each as the sustained rps and the max bucket
size which should be roughly equivalent.
## Problem
Previously, Timeline::gc_info was only updated in a batch operation at
the start of GC. That means that timelines didn't generally have
accurate information about who their children were before the first GC,
or between GC cycles.
Knowledge of child branches is important for calculating layer
visibility in #8398
## Summary of changes
- Split out part of refresh_gc_info into initialize_gc_info, which is
now called early in startup
- Include TimelineId in retain_lsns so that we can later add/remove the
LSNs for particular children
- When timelines are added/removed, update their parent's retain_lsns
## Problem
In `test_basebackup_with_high_slru_count`, the pageserver is sometimes
mysteriously hanging on startup, having been started+stopped earlier in
the test setup while populating template tenant data.
- #7586
We can't see why this is hanging in this particular test. The test does
some weird stuff though, like attaching a load of broken tenants and
then doing a SIGQUIT kill of a pageserver.
## Summary of changes
- Attach tenants normally instead of doing a failpoint dance to attach
them as broken
- Shut the pageserver down gracefully during init instead of using
immediate mode
- Remove the "sequential" variant of the unstable test, as this is going
away soon anyway
- Log before trying to acquire lock file, so that if it hangs we have a
clearer sense of if that's really where it's hanging. It seems like it
is, but that code does a non-blocking flock so it's surprising.
Implements the TODO from #8466 about retries: now the user of the stream
returned by `list_streaming` is able to obtain the next item in the
stream as often as they want, and retry it if it is an error.
Also adds extends the test for paginated listing to include a dedicated
test for `list_streaming`.
follow-up of #8466fixes#8457
part of #7547
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Problem
LayerAccessStats contains a lot of detail that we don't use: short
histories of most recent accesses, specifics on what kind of task
accessed a layer, etc. This is all stored inside a Mutex, which is
locked every time something accesses a layer.
## Summary of changes
- Store timestamps at a very low resolution (to the nearest second),
sufficient for use on the timescales of eviction.
- Pack access time and last residence change time into a single u64
- Use the high bits of the u64 for other flags, including the new layer
visibility concept.
- Simplify the external-facing model for access stats to just include
what we now track.
Note that the `HistoryBufferWithDropCounter` is removed here because it
is no longer used. I do not dislike this type, we just happen not to use
it for anything else at present.
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
While investigating problem with test_subscriber_restart flukyness, I
found out that this test is not passed at all for PG 14/15 at MacOS
(while working for PG16).
## Summary of changes
Rewrite async connect state machine exactly in the same way as in
Vanilla: call `WaitLatchOrSocket` with `WL_SOCKETR_WRTEABLE` before
calling `PQconnectPoll`.
Please notice that most likely it will not fix flukyness of
test_subscriber_restart.
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
This adds the ability to list many prefixes in a streaming fashion to
both the `RemoteStorage` trait as well as `GenericRemoteStorage`.
* The `list` function of the `RemoteStorage` trait is implemented by
default in terms of `list_streaming`.
* For the production users (S3, Azure), `list_streaming` is implemented
and the default `list` implementation is used.
* For `LocalFs`, we keep the `list` implementation and make
`list_streaming` call it.
The `list_streaming` function is implemented for both S3 and Azure.
A TODO for later is retries, which the scrubber currently has while the
`list_streaming` implementations lack them.
part of #8457 and #7547