Rather temporary solution before proper:
https://github.com/neondatabase/neon/issues/5006
It requires more plumbing so lets not attach deleted tenants first and
then implement resume.
Additionally fix `assert_prefix_empty`. It had a buggy prefix calculation,
and since we always asserted for absence of stuff it worked. Here I
started to assert for presence of stuff too and it failed. Added more
"presence" asserts to other places to be confident that it works.
Resolves [#5016](https://github.com/neondatabase/neon/issues/5016)
## Problem
The previous version of neon (that we use in the forward compatibility test)
has installed `amcheck` extension now. We can run `pg_amcheck`
unconditionally.
## Summary of changes
- Run `pg_amcheck` in compatibility tests unconditionally
## Problem
As documented, the global connection pool will be high contention.
## Summary of changes
Use DashMap rather than Mutex<HashMap>.
Of note, DashMap currently uses a RwLock internally, but it's partially
sharded to reduce contention by a factor of N. We could potentially use
flurry which is a port of Java's concurrent hashmap, but I have no good
understanding of it's performance characteristics. Dashmap is at least
equivalent to hashmap but less contention.
See the read heavy benchmark to analyse our expected performance
<https://github.com/xacrimon/conc-map-bench#ready-heavy>
I also spoke with the developer of dashmap recently, and they are
working on porting the implementation to use concurrent HAMT FWIW
## Problem
PR #4839 has already reduced the number of b-tree traversals and vec
creations from 3 to 2, but as pointed out in
https://github.com/neondatabase/neon/pull/4839#discussion_r1279167815 ,
we would ideally just traverse the b-tree once during compaction.
Afer #4836, the two vecs created are one for the list of keys, lsns and
sizes, and one for the list of `(key, lsn, value reference)`. However,
they are not equal, as pointed out in
https://github.com/neondatabase/neon/pull/4839#issuecomment-1660418012
and the following comment: the key vec creation combines multiple
entries for which the lsn is changing but the key stays the same into
one, with the size being the sum of the sub-sizes. In SQL, this would
correspond to something like `SELECT key, lsn, SUM(size) FROM b_tree
GROUP BY key;` and `SELECT key, lsn, val_ref FROM b_tree;`. Therefore,
the join operation is non-trivial.
## Summary of changes
This PR merges the two lists of keys and value references into one. It's
not a trivial change and affects the size pattern of the resulting
files, which is why this is in a separate PR from #4839 .
The key vec is used in compaction for determining when to start a new
layer file. The loop uses various thresholds to come to this conclusion,
but the grouping via the key has led to the behaviour that regardless of
the threshold, it only starts a new file when either a new key is
encountered, or a new delta file.
The new code now does the combination after the merging and sorting of
the various keys from the delta files. This *mostly* does the same as
the old code, except for a detail: with the grouping done on a
per-delta-layer basis, the sorted and merged vec would still have
multiple entries for multiple delta files, but now, we don't have an
easy way to tell when a new input delta layer file is encountered, so we
cannot create multiple entries on that basis easily.
To prevent possibly infinite growth, our new grouping code compares the
combined size with the threshold, and if it is exceeded, it cuts a new
entry so that the downstream code can cut a new output file. Here, we
perform a tradeoff however, as if the threshold is too small, we risk
putting entries for the same key into multiple layer files, but if the
threshold is too big, we can in some instances exceed the target size.
Currently, we set the threshold to the target size, so in theory we
would stay below or roughly at double the `target_file_size`.
We also fix the way the size was calculated for the last key. The calculation
was wrong and accounted for the old layer's btree, even though we
already account for the overhead of the in-construction btree.
Builds on top of #4839 .
Previously list_prefixes was incorrectly used for that purpose. Change
to use list_files. Add a test.
Some drive by refactorings on python side to move helpers out of
specific test file to be widely accessible
resolves https://github.com/neondatabase/neon/issues/4499
## Problem
This was set to 5 seconds, which was very close to how long a compaction
took on my workstation, and when deletion is blocked on compaction the
test would fail.
We will fix this to make compactions drop out on deletion, but for the
moment let's stabilize the test.
## Summary of changes
Change timeout on timeline deletion in
`test_timeline_deletion_with_files_stuck_in_upload_queue` from 5 seconds
to 30 seconds.
To this end add
1) -e option to 'neon_local safekeeper start' command appending extra options
to safekeeper invocation;
2) Allow multiple occurrences of the same option in safekeepers, the last
value is taken.
3) Allow to specify empty string for *-auth-public-key-path opts, it
disables auth for the service.
## Problem
Deletions can be possibly reordered. Use fsync to avoid the case when
mark file doesnt exist but other tenant/timeline files do.
See added comments.
resolves#4987
## Problem
HTTP batch queries currently allow us to set the isolation level and
read only, but not deferrable.
## Summary of changes
Add support for deferrable.
Echo deferrable status in response headers only if true.
Likewise, now echo read-only status in response headers only if true.
## Problem
It's nice if `single query : single response :: batch query : batch
response`.
But at present, in the single case we send `{ query: '', params: [] }`
and get back a single `{ rows: [], ... }` object, while in the batch
case we send an array of `{ query: '', params: [] }` objects and get
back not an array of `{ rows: [], ... }` objects but a `{ results: [ {
rows: [] , ... }, { rows: [] , ... }, ... ] }` object instead.
## Summary of changes
With this change, the batch query body becomes `{ queries: [{ query: '',
params: [] }, ... ] }`, which restores a consistent relationship between
the request and response bodies.
Originated from test failure where we got SlowDown error from s3.
The patch generalizes `download_retry` to not be download specific.
Resulting `retry` function is moved to utils crate. `download_retries`
is now a thin wrapper around this `retry` function.
To ensure that all needed retries are in place test code now uses
`test_remote_failures=1` setting.
Ref https://neondb.slack.com/archives/C059ZC138NR/p1691743624353009
It allows term leader to ensure he pulls data from the correct term. Absense of
it wasn't very problematic due to CRC checks, but let's be strict.
walproposer still doesn't use it as we're going to remove recovery completely
from it.
`pg_regress` is flaky: https://github.com/neondatabase/neon/issues/559
Consolidated `CHECKPOINT` to `check_restored_datadir_content`, add a
wait for `wait_for_last_flush_lsn`.
Some recently introduced flakyness was fixed with #4948.
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
The test mutates a shared directory which does not work with multiple
concurrent tests. It is being fixed, so this should be a very temporary
band-aid.
Cc: #4949.
## Problem
For some tests, we override the default timeout (300s / 5m) with a larger
values like 600s / 10m or even 1800s / 30m, even if it's not required.
I've collected some statistics (for the last 60 days) for tests
duration:
| test | max (s) | p99 (s) | p50 (s) | count |
|-----------------------------------|---------|---------|---------|-------|
| test_hot_standby | 9 | 2 | 2 | 5319 |
| test_import_from_vanilla | 16 | 9 | 6 | 5692 |
| test_import_from_pageserver_small | 37 | 7 | 5 | 5719 |
| test_pg_regress | 101 | 73 | 44 | 5642 |
| test_isolation | 65 | 56 | 39 | 5692 |
A couple of tests that I left with custom 600s / 10m timeout.
| test | max (s) | p99 (s) | p50 (s) | count |
|-----------------------------------|---------|---------|---------|-------|
| test_gc_cutoff | 456 | 224 | 109 | 5694 |
| test_pageserver_chaos | 528 | 267 | 121 | 5712 |
## Summary of changes
- Remove `@pytest.mark.timeout` annotation from several tests
## Problem
Currently to know how long pageserver startup took requires inspecting
logs.
## Summary of changes
`pageserver_startup_duration_ms` metric is added, with label `phase` for
different phases of startup.
These are broken down by phase, where the phases correspond to the
existing wait points in the code:
- Start of doing I/O
- When tenant load is done
- When initial size calculation is done
- When background jobs start
- Then "complete" when everything is done.
`pageserver_startup_is_loading` is a 0/1 gauge that indicates whether we are in the initial load of tenants.
`pageserver_tenant_activation_seconds` is a histogram of time in seconds taken to activate a tenant.
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Problem
The pageserver<->safekeeper protocol uses error messages to indicate end
of stream. pageserver already logs these at INFO level, but the inner
error message includes the word "ERROR", which interferes with log
searching.
Example:
```
walreceiver connection handling ended: db error: ERROR: ending streaming to Some("pageserver") at 0/4031CA8
```
The inner DbError has a severity of ERROR so DbError's Display
implementation includes that ERROR, even though we are actually
logging the error at INFO level.
## Summary of changes
Introduce an explicit WalReceiverError type, and in its From<>
for postgres errors, apply the logic from ExpectedError, for
expected errors, and a new condition for successes.
The new output looks like:
```
walreceiver connection handling ended: Successful completion: ending streaming to Some("pageserver") at 0/154E9C0, receiver is caughtup and there is no computes
```
## Problem
I spent a few minutes seeing how fast I could get our regression test
suite to run on my workstation, for when I want to run a "did I break
anything?" smoke test before pushing to CI.
- Test runtime was dominated by a couple of tests that run for longer
than all the others take together
- Test concurrency was limited to <16 by the ports-per-worker setting
There's no "right answer" for how long a test should
be, but as a rule of thumb, no one test should run
for much longer than the time it takes to run all the
other tests together.
## Summary of changes
- Make the ports per worker setting dynamic depending on worker count
- Modify the longest running tests to run for a shorter time
(`test_duplicate_layers` which uses a pgbench runtime) or fewer
iterations (`test_restarts_frequent_checkpoints`).
We currently cannot drop tenant before removing it's directory, or use
Tenant::drop for this. This creates unnecessary or inactionable warnings
during detach at least. Silence the most typical, file not found. Log
remaining at `error!`.
Cc: #2442
We don't know how our s3 remote_storage is performing, or if it's
blocking the shutdown. Well, for sampling reasons, we will not really
know even after this PR.
Add metrics:
- align remote_storage metrics towards #4813 goals
- histogram
`remote_storage_s3_request_seconds{request_type=(get_object|put_object|delete_object|list_objects),
result=(ok|err|cancelled)}`
- histogram `remote_storage_s3_wait_seconds{request_type=(same kinds)}`
- counter `remote_storage_s3_cancelled_waits_total{request_type=(same
kinds)}`
Follow-up work:
- After release, remove the old metrics, migrate dashboards
Histogram buckets are rough guesses, need to be tuned. In pageserver we
have a download timeout of 120s, so I think the 100s bucket is quite
nice.
During deploys of 2023-08-03 we logged too much on shutdown. Fix the
logging by timing each top level shutdown step, and possibly warn on it
taking more than a rough threshold, based on how long I think it
possibly should be taking. Also remove all shutdown logging from
background tasks since there is already "shutdown is taking a long time"
logging.
Co-authored-by: John Spray <john@neon.tech>
## Problem
neon_fixtures.py has grown to unmanageable size. It attracts conflicts.
When adding specific utils under for example `fixtures/pageserver`
things sometimes need to import stuff from `neon_fixtures.py` which
creates circular import. This is usually only needed for type
annotations, so `typing.TYPE_CHECKING` flag can mask the issue.
Nevertheless I believe that splitting neon_fixtures.py into smaller
parts is a better approach.
Currently the PR contains small things, but I plan to continue and move
NeonEnv to its own `fixtures.env` module. To keep the diff small I think
this PR can already be merged to cause less conflicts.
UPD: it looks like currently its not really possible to fully avoid
usage of `typing.TYPE_CHECKING`, because some components directly depend
on each other. I e Env -> Cli -> Env cycle. But its still worth it to
avoid it in as many places as possible. And decreasing neon_fixture's
size still makes sense.
## Problem
If AWS credentials are not set locally (via
AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY env vars)
`test_remote_library[release-pg15-mock_s3]` test fails with the
following error:
```
ERROR could not start the compute node: Failed to download a remote file: Failed to download S3 object: failed to construct request
```
## Summary of changes
- set AWS credentials for endpoints programmatically
Add infrastructure to dynamically load postgres extensions and shared libraries from remote extension storage.
Before postgres start downloads list of available remote extensions and libraries, and also downloads 'shared_preload_libraries'. After postgres is running, 'compute_ctl' listens for HTTP requests to load files.
Postgres has new GUC 'extension_server_port' to specify port on which 'compute_ctl' listens for requests.
When PostgreSQL requests a file, 'compute_ctl' downloads it.
See more details about feature design and remote extension storage layout in docs/rfcs/024-extension-loading.md
---------
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Co-authored-by: Alek Westover <alek.westover@gmail.com>
## Problem
Compatibility tests fail from time to time due to `pg_tenant_only_port`
port collision (added in https://github.com/neondatabase/neon/pull/4731)
## Summary of changes
- replace `pg_tenant_only_port` value in config with new port
- remove old logic, than we don't need anymore
- unify config overrides
## Problem
We have some amount of outdated logic in test_compatibility, that we
don't need anymore.
## Summary of changes
- Remove `PR4425_ALLOWED_DIFF` and tune `dump_differs` method to accept
allowed diffs in the future (a cleanup after
https://github.com/neondatabase/neon/pull/4425)
- Remote etcd related code (a cleanup after
https://github.com/neondatabase/neon/pull/2733)
- Don't set `preserve_database_files`
Run `pg_amcheck` in forward and backward compatibility tests to catch
some data corruption.
## Summary of changes
- Add amcheck compiling to Makefile
- Add `pg_amcheck` to test_compatibility
The test was starting two endpoints on the same branch as discovered by
@petuhovskiy.
The fix is to allow passing branch-name from the python side over to
neon_local, which already accepted it.
Split from #4824, which will handle making this more misuse resistant.
## Problem
Currently we delete local files first, so if pageserver restarts after
local files deletion then remote deletion is not continued. This can be
solved with inversion of these steps.
But even if these steps are inverted when index_part.json is deleted
there is no way to distinguish between "this timeline is good, we just
didnt upload it to remote" and "this timeline is deleted we should
continue with removal of local state". So to solve it we use another
mark file. After index part is deleted presence of this mark file
indentifies that it was a deletion intention.
Alternative approach that was discussed was to delete all except
metadata first, and then delete metadata and index part. In this case we
still do not support local only configs making them rather unsafe
(deletion in them is already unsafe, but this direction solidifies this
direction instead of fixing it). Another downside is that if we crash
after local metadata gets removed we may leave dangling index part on
the remote which in theory shouldnt be a big deal because the file is
small.
It is not a big change to choose another approach at this point.
## Summary of changes
Timeline deletion sequence:
1. Set deleted_at in remote index part.
2. Create local mark file.
3. Delete local files except metadata (it is simpler this way, to be
able to reuse timeline initialization code that expects metadata)
4. Delete remote layers
5. Delete index part
6. Delete meta, timeline directory.
7. Delete mark file.
This works for local only configuration without remote storage.
Sequence is resumable from any point.
resolves#4453
resolves https://github.com/neondatabase/neon/pull/4552 (the issue was
created with async cancellation in mind, but we can still have issues
with retries if metadata is deleted among the first by remove_dir_all
(which doesnt have any ordering guarantees))
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
We currently have a timeseries for each of the tenants in different
states. We only want this for Broken. Other states could be counters.
Fix this by making the `pageserver_tenant_states_count` a counter
without a `tenant_id` and
add a `pageserver_broken_tenants_count` which has a `tenant_id` label,
each broken tenant being 1.
Fix mx_offset_to_flags_offset() function
Fixes issue #4774
Postgres `MXOffsetToFlagsOffset` was not correctly converted to Rust
because cast to u16 is done before division by modulo. It is possible
only if divider is power of two.
Add a small rust unit test to check that the function produces same results
as the PostgreSQL macro, and extend the existing python test to cover
this bug.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
## Problem
close https://github.com/neondatabase/neon/issues/4712
## Summary of changes
Previously, when flushing frozen layers, it was split into two
operations: add delta layer to disk + remove frozen layer from memory.
This would cause a short period of time where we will have the same data
both in frozen and delta layer. In this PR, we merge them into one
atomic operation in layer map manager, therefore simplifying the code.
Note that if we decide to create image layers for L0 flush, it will
still be split into two operations on layer map.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>