this should allow test
test_delete_tenant_exercise_crash_safety_failpoints with
debug-pg16-Check.RETRY_WITH_RESTART-mock_s3-tenant-delete-before-remove-timelines-dir-True
to pass more reliably.
## Problem
In many places in test code, paths are built manually from what
NeonEnv.tenant_dir and NeonEnv.timeline_dir could do.
## Summary of changes
1. NeonEnv.tenant_dir and NeonEnv.timeline_dir moved under class
NeonPageserver as the path they use is per-pageserver instance.
2. Used these everywhere to replace manual path building
Closes#5258
---------
Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
Refactor tests to have less globals.
This will allow to hopefully write more complex tests for our new metric
collection requirements in #5297. Includes reverted work from #4761
related to test globals.
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: MMeent <matthias@neon.tech>
This adds PostgreSQL 16 as a vendored postgresql version, and adapts the
code to support this version.
The important changes to PostgreSQL 16 compared to the PostgreSQL 15
changeset include the addition of a neon_rmgr instead of altering Postgres's
original WAL format.
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
## Problem
Currently our testing environment only supports running a single
pageserver at a time. This is insufficient for testing failover and
migrations.
- Dependency of writing tests for #5207
## Summary of changes
- `neon_local` and `neon_fixture` now handle multiple pageservers
- This is a breaking change to the `.neon/config` format: any local
environments will need recreating
- Existing tests continue to work unchanged:
- The default number of pageservers is 1
- `NeonEnv.pageserver` is now a helper property that retrieves the first
pageserver if there is only one, else throws.
- Pageserver data directories are now at `.neon/pageserver_{n}` where n
is 1,2,3...
- Compatibility tests get some special casing to migrate neon_local
configs: these are not meant to be backward/forward compatible, but they
were treated that way by the test.
Remote storage cleanup split from #5198:
- pageserver, extensions, and safekeepers now have their separate remote
storage
- RemoteStorageKind has the configuration code
- S3Storage has the cleanup code
- with MOCK_S3, pageserver, extensions, safekeepers use different
buckets
- with LOCAL_FS, `repo_dir / "local_fs_remote_storage" / $user` is used
as path, where $user is `pageserver`, `safekeeper`
- no more `NeonEnvBuilder.enable_xxx_remote_storage` but one
`enable_{pageserver,extensions,safekeeper}_remote_storage`
Should not have any real changes. These will allow us to default to
`LOCAL_FS` for pageserver on the next PR, remove
`RemoteStorageKind.NOOP`, work towards #5172.
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
- #5050
Closes: https://github.com/neondatabase/neon/issues/5136
## Summary of changes
- A new configuration property `control_plane_api` controls other
functionality in this PR: if it is unset (default) then everything still
works as it does today.
- If `control_plane_api` is set, then on startup we call out to control
plane `/re-attach` endpoint to discover our attachments and their
generations. If an attachment is missing from the response we implicitly
detach the tenant.
- Calls to pageserver `/attach` API may include a `generation`
parameter. If `control_plane_api` is set, then this parameter is
mandatory.
- RemoteTimelineClient's loading of index_part.json is generation-aware,
and will try to load the index_part with the most recent generation <=
its own generation.
- The `neon_local` testing environment now includes a new binary
`attachment_service` which implements the endpoints that the pageserver
requires to operate. This is on by default if running `cargo neon` by
hand. In `test_runner/` tests, it is off by default: existing tests
continue to run with in the legacy generation-less mode.
Caveats:
- The re-attachment during startup assumes that we are only re-attaching
tenants that have previously been attached, and not totally new tenants
-- this relies on the control plane's attachment logic to keep retrying
so that we should eventually see the attach API call. That's important
because the `/re-attach` API doesn't tell us which timelines we should
attach -- we still use local disk state for that. Ref:
https://github.com/neondatabase/neon/issues/5173
- Testing: generations are only enabled for one integration test right
now (test_pageserver_restart), as a smoke test that all the machinery
basically works. Writing fuller tests that stress tenant migration will
come later, and involve extending our test fixtures to deal with
multiple pageservers.
- I'm not in love with "attachment_service" as a name for the neon_local
component, but it's not very important because we can easily rename
these test bits whenever we want.
- Limited observability when in re-attach on startup: when I add
generation validation for deletions in a later PR, I want to wrap up the
control plane API calls in some small client class that will expose
metrics for things like errors calling the control plane API, which will
act as a strong red signal that something is not right.
Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Problem
- Scrubber's `tidy` command requires presence of a control plane
- Scrubber has no tests at all
## Summary of changes
- Add re-usable async streams for reading metadata from a bucket
- Add a `scan-metadata` command that reads from those streams and calls
existing `checks.rs` code to validate metadata, then returns a summary
struct for the bucket. Command returns nonzero status if errors are
found.
- Add an `enable_scrub_on_exit()` function to NeonEnvBuilder so that
tests using remote storage can request to have the scrubber run after
they finish
- Enable remote storarge and scrub_on_exit in test_pageserver_restart
and test_pageserver_chaos
This is a "toe in the water" of the overall space of validating the
scrubber. Later, we should:
- Enable scrubbing at end of tests using remote storage by default
- Make the success condition stricter than "no errors": tests should
declare what tenants+timelines they expect to see in the bucket (or
sniff these from the functions tests use to create them) and we should
require that the scrubber reports on these particular tenants/timelines.
The `tidy` command is untouched in this PR, but it should be refactored
later to use similar async streaming interface instead of the current
batch-reading approach (the streams are faster with large buckets), and
to also be covered by some tests.
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
## Problem
Tests using remote storage have manually entered `test_name` parameters,
which:
- Are easy to accidentally duplicate when copying code to make a new
test
- Omit parameters, so don't actually create unique S3 buckets when
running many tests concurrently.
## Summary of changes
- Use the `request` fixture in neon_env_builder fixture to get the test
name, then munge that into an S3 compatible bucket name.
- Remove the explicit `test_name` parameters to enable_remote_storage
When doing global queries in VictoriaMetrics, the per-timeline
histograms make us run into cardinality limits.
We don't want to give them up just yet because we don't
have an alternative for drilling down on timeline-specific
performance issues.
So, add a pre-aggregated histogram and add observations to it
whenever we add observations to the per-timeline histogram.
While we're at it, switch to using a strummed enum for the operation
type names.
I made a mistake when I adding `env.initial_timeline:
Optional[TimelineId]` in the #3839, should had just generated it and
used it to create a specific timeline. This PR fixes those mistakes, and
some extra calling into psql which must be slower than python field
access.
I'm still a bit nervous about attach -> crash case. But it should work.
(unlike case with timeline). Ideally would be cool to cover this with
test.
This continues tradition of adding bool flags for Tenant::set_stopping.
Probably lifecycle project will help with fixing it.
Rather temporary solution before proper:
https://github.com/neondatabase/neon/issues/5006
It requires more plumbing so lets not attach deleted tenants first and
then implement resume.
Additionally fix `assert_prefix_empty`. It had a buggy prefix calculation,
and since we always asserted for absence of stuff it worked. Here I
started to assert for presence of stuff too and it failed. Added more
"presence" asserts to other places to be confident that it works.
Resolves [#5016](https://github.com/neondatabase/neon/issues/5016)
## Problem
As documented, the global connection pool will be high contention.
## Summary of changes
Use DashMap rather than Mutex<HashMap>.
Of note, DashMap currently uses a RwLock internally, but it's partially
sharded to reduce contention by a factor of N. We could potentially use
flurry which is a port of Java's concurrent hashmap, but I have no good
understanding of it's performance characteristics. Dashmap is at least
equivalent to hashmap but less contention.
See the read heavy benchmark to analyse our expected performance
<https://github.com/xacrimon/conc-map-bench#ready-heavy>
I also spoke with the developer of dashmap recently, and they are
working on porting the implementation to use concurrent HAMT FWIW
Previously list_prefixes was incorrectly used for that purpose. Change
to use list_files. Add a test.
Some drive by refactorings on python side to move helpers out of
specific test file to be widely accessible
resolves https://github.com/neondatabase/neon/issues/4499
## Problem
This was set to 5 seconds, which was very close to how long a compaction
took on my workstation, and when deletion is blocked on compaction the
test would fail.
We will fix this to make compactions drop out on deletion, but for the
moment let's stabilize the test.
## Summary of changes
Change timeout on timeline deletion in
`test_timeline_deletion_with_files_stuck_in_upload_queue` from 5 seconds
to 30 seconds.
To this end add
1) -e option to 'neon_local safekeeper start' command appending extra options
to safekeeper invocation;
2) Allow multiple occurrences of the same option in safekeepers, the last
value is taken.
3) Allow to specify empty string for *-auth-public-key-path opts, it
disables auth for the service.
## Problem
Deletions can be possibly reordered. Use fsync to avoid the case when
mark file doesnt exist but other tenant/timeline files do.
See added comments.
resolves#4987
Originated from test failure where we got SlowDown error from s3.
The patch generalizes `download_retry` to not be download specific.
Resulting `retry` function is moved to utils crate. `download_retries`
is now a thin wrapper around this `retry` function.
To ensure that all needed retries are in place test code now uses
`test_remote_failures=1` setting.
Ref https://neondb.slack.com/archives/C059ZC138NR/p1691743624353009
`pg_regress` is flaky: https://github.com/neondatabase/neon/issues/559
Consolidated `CHECKPOINT` to `check_restored_datadir_content`, add a
wait for `wait_for_last_flush_lsn`.
Some recently introduced flakyness was fixed with #4948.
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Problem
The pageserver<->safekeeper protocol uses error messages to indicate end
of stream. pageserver already logs these at INFO level, but the inner
error message includes the word "ERROR", which interferes with log
searching.
Example:
```
walreceiver connection handling ended: db error: ERROR: ending streaming to Some("pageserver") at 0/4031CA8
```
The inner DbError has a severity of ERROR so DbError's Display
implementation includes that ERROR, even though we are actually
logging the error at INFO level.
## Summary of changes
Introduce an explicit WalReceiverError type, and in its From<>
for postgres errors, apply the logic from ExpectedError, for
expected errors, and a new condition for successes.
The new output looks like:
```
walreceiver connection handling ended: Successful completion: ending streaming to Some("pageserver") at 0/154E9C0, receiver is caughtup and there is no computes
```
## Problem
I spent a few minutes seeing how fast I could get our regression test
suite to run on my workstation, for when I want to run a "did I break
anything?" smoke test before pushing to CI.
- Test runtime was dominated by a couple of tests that run for longer
than all the others take together
- Test concurrency was limited to <16 by the ports-per-worker setting
There's no "right answer" for how long a test should
be, but as a rule of thumb, no one test should run
for much longer than the time it takes to run all the
other tests together.
## Summary of changes
- Make the ports per worker setting dynamic depending on worker count
- Modify the longest running tests to run for a shorter time
(`test_duplicate_layers` which uses a pgbench runtime) or fewer
iterations (`test_restarts_frequent_checkpoints`).
We currently cannot drop tenant before removing it's directory, or use
Tenant::drop for this. This creates unnecessary or inactionable warnings
during detach at least. Silence the most typical, file not found. Log
remaining at `error!`.
Cc: #2442
We don't know how our s3 remote_storage is performing, or if it's
blocking the shutdown. Well, for sampling reasons, we will not really
know even after this PR.
Add metrics:
- align remote_storage metrics towards #4813 goals
- histogram
`remote_storage_s3_request_seconds{request_type=(get_object|put_object|delete_object|list_objects),
result=(ok|err|cancelled)}`
- histogram `remote_storage_s3_wait_seconds{request_type=(same kinds)}`
- counter `remote_storage_s3_cancelled_waits_total{request_type=(same
kinds)}`
Follow-up work:
- After release, remove the old metrics, migrate dashboards
Histogram buckets are rough guesses, need to be tuned. In pageserver we
have a download timeout of 120s, so I think the 100s bucket is quite
nice.
During deploys of 2023-08-03 we logged too much on shutdown. Fix the
logging by timing each top level shutdown step, and possibly warn on it
taking more than a rough threshold, based on how long I think it
possibly should be taking. Also remove all shutdown logging from
background tasks since there is already "shutdown is taking a long time"
logging.
Co-authored-by: John Spray <john@neon.tech>
## Problem
neon_fixtures.py has grown to unmanageable size. It attracts conflicts.
When adding specific utils under for example `fixtures/pageserver`
things sometimes need to import stuff from `neon_fixtures.py` which
creates circular import. This is usually only needed for type
annotations, so `typing.TYPE_CHECKING` flag can mask the issue.
Nevertheless I believe that splitting neon_fixtures.py into smaller
parts is a better approach.
Currently the PR contains small things, but I plan to continue and move
NeonEnv to its own `fixtures.env` module. To keep the diff small I think
this PR can already be merged to cause less conflicts.
UPD: it looks like currently its not really possible to fully avoid
usage of `typing.TYPE_CHECKING`, because some components directly depend
on each other. I e Env -> Cli -> Env cycle. But its still worth it to
avoid it in as many places as possible. And decreasing neon_fixture's
size still makes sense.
## Problem
If AWS credentials are not set locally (via
AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY env vars)
`test_remote_library[release-pg15-mock_s3]` test fails with the
following error:
```
ERROR could not start the compute node: Failed to download a remote file: Failed to download S3 object: failed to construct request
```
## Summary of changes
- set AWS credentials for endpoints programmatically
Add infrastructure to dynamically load postgres extensions and shared libraries from remote extension storage.
Before postgres start downloads list of available remote extensions and libraries, and also downloads 'shared_preload_libraries'. After postgres is running, 'compute_ctl' listens for HTTP requests to load files.
Postgres has new GUC 'extension_server_port' to specify port on which 'compute_ctl' listens for requests.
When PostgreSQL requests a file, 'compute_ctl' downloads it.
See more details about feature design and remote extension storage layout in docs/rfcs/024-extension-loading.md
---------
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Co-authored-by: Alek Westover <alek.westover@gmail.com>
The test was starting two endpoints on the same branch as discovered by
@petuhovskiy.
The fix is to allow passing branch-name from the python side over to
neon_local, which already accepted it.
Split from #4824, which will handle making this more misuse resistant.
## Problem
Currently we delete local files first, so if pageserver restarts after
local files deletion then remote deletion is not continued. This can be
solved with inversion of these steps.
But even if these steps are inverted when index_part.json is deleted
there is no way to distinguish between "this timeline is good, we just
didnt upload it to remote" and "this timeline is deleted we should
continue with removal of local state". So to solve it we use another
mark file. After index part is deleted presence of this mark file
indentifies that it was a deletion intention.
Alternative approach that was discussed was to delete all except
metadata first, and then delete metadata and index part. In this case we
still do not support local only configs making them rather unsafe
(deletion in them is already unsafe, but this direction solidifies this
direction instead of fixing it). Another downside is that if we crash
after local metadata gets removed we may leave dangling index part on
the remote which in theory shouldnt be a big deal because the file is
small.
It is not a big change to choose another approach at this point.
## Summary of changes
Timeline deletion sequence:
1. Set deleted_at in remote index part.
2. Create local mark file.
3. Delete local files except metadata (it is simpler this way, to be
able to reuse timeline initialization code that expects metadata)
4. Delete remote layers
5. Delete index part
6. Delete meta, timeline directory.
7. Delete mark file.
This works for local only configuration without remote storage.
Sequence is resumable from any point.
resolves#4453
resolves https://github.com/neondatabase/neon/pull/4552 (the issue was
created with async cancellation in mind, but we can still have issues
with retries if metadata is deleted among the first by remove_dir_all
(which doesnt have any ordering guarantees))
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
We currently have a timeseries for each of the tenants in different
states. We only want this for Broken. Other states could be counters.
Fix this by making the `pageserver_tenant_states_count` a counter
without a `tenant_id` and
add a `pageserver_broken_tenants_count` which has a `tenant_id` label,
each broken tenant being 1.
Compute now uses special safekeeper WAL service port allowing auth tokens with
only tenant scope. Adds understanding of this port to neon_local and fixtures,
as well as test of both ports behaviour with different tokens.
ref https://github.com/neondatabase/neon/issues/4730
## Problem
Binaries created from PRs (both in docker images and for tests) have
wrong git-env versions, they point to phantom merge commits.
## Summary of changes
- Prefer GIT_VERSION env variable even if git information was accessible
- Use `${{ github.event.pull_request.head.sha || github.sha }}` instead
of `${{ github.sha }}` for `GIT_VERSION` in workflows
So the builds will still happen from this phantom commit, but we will
report the PR commit.
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Problem
All tests have already been parametrised by Postgres version and build
type (to have them distinguishable in the Allure report), but despite
it, it's anyway required to have DEFAULT_PG_VERSION and BUILD_TYPE env
vars set to corresponding values, for example to
run`test_timeline_deletion_with_files_stuck_in_upload_queue[release-pg14-local_fs]`
test it's required to set `DEFAULT_PG_VERSION=14` and
`BUILD_TYPE=release`.
This PR makes the test framework pick up parameters from the test name
itself.
## Summary of changes
- Postgres version and build type related fixtures now are
function-scoped (instead of being sessions scoped before)
- Deprecate `--pg-version` argument in favour of DEFAULT_PG_VERSION env
variable (it's easier to parse)
- GitHub autocomment now includes only one command with all the failed
tests + runs them in parallel
The histogram distinguishes by ok/err.
I took the liberty to create a small abstraction for such use cases.
It helps keep the label values inside `metrics.rs`, right next
to the place where the metric and its labels are declared.