Commit Graph

6084 Commits

Author SHA1 Message Date
Alexey Masterov
3b6449fb7b Fix an error 2024-09-09 12:48:37 +02:00
Alexey Masterov
130066898e Fix an error 2024-09-09 12:09:55 +02:00
Alexey Masterov
6f2d7b4662 Try to avoid passwords in clean text 2024-09-09 11:58:18 +02:00
Alexey Masterov
8a00cc817c Revert "make the test fail for debug purposes"
This reverts commit 9d8ba21f65.
2024-09-06 15:54:31 +02:00
Alexey Masterov
7469656b72 Add regression.out to allure reports 2024-09-06 15:49:43 +02:00
Alexey Masterov
9d8ba21f65 make the test fail for debug purposes 2024-09-06 15:01:15 +02:00
Alexey Masterov
e54f8bc5ff Change the workdir to test_output_dir 2024-09-06 14:53:40 +02:00
Alexey Masterov
ac72832589 Change the runner 2024-09-06 14:33:35 +02:00
Alexey Masterov
2098184d67 Revert "Revert "Fix an error in the path""
This reverts commit c7f2a26cb9.
2024-09-06 13:56:20 +02:00
Alexey Masterov
c7f2a26cb9 Revert "Fix an error in the path"
This reverts commit ebdd187398.
2024-09-06 13:51:15 +02:00
Alexey Masterov
ebdd187398 Fix an error in the path 2024-09-06 13:36:49 +02:00
Alexey Masterov
6c679f722c Fix an error in the path 2024-09-06 13:27:05 +02:00
Alexey Masterov
d0cf670b76 Fix an error in the path 2024-09-06 13:19:06 +02:00
Alexey Masterov
6d66a2ebe7 Fix an error in the path 2024-09-06 13:01:43 +02:00
Alexey Masterov
a8d1cbe376 Change the directories calculation 2024-09-06 12:58:10 +02:00
Alexey Masterov
222f483ce8 Add a debug 2024-09-06 12:19:08 +02:00
Alexey Masterov
6f6d5f1ea3 Add AWS access keys 2024-09-06 12:03:54 +02:00
Alexey Masterov
7cd76ee351 add an allure report and slack posting 2024-09-06 11:52:04 +02:00
Alexey Masterov
0510676a3f Some refactoring 2024-09-06 11:30:21 +02:00
Alexey Masterov
c7d9eda56a Some refactoring 2024-09-06 11:25:59 +02:00
Alexey Masterov
6140e3b6b1 Some refactoring 2024-09-06 11:09:26 +02:00
Alexey Masterov
74eec88125 Some refactoring 2024-09-06 11:08:33 +02:00
Alexey Masterov
195c7a359d Some refactoring 2024-09-06 11:06:43 +02:00
Alexey Masterov
8bb0e97880 Some refactoring 2024-09-06 11:03:29 +02:00
Alexey Masterov
243db8ab4a Some refactoring 2024-09-05 17:06:56 +02:00
a-masterov
815d7d6ab1 Merge branch 'main' into amasterov/regress-arm 2024-09-05 15:30:05 +02:00
Joonas Koivunen
efe03d5a1c build: sync between benchies (#8919)
Sometimes, the benchmarks fail to start up pageserver in 10s without any
obvious reason. Benchmarks run sequentially on otherwise idle runners.
Try running `sync(2)` after each bench to force a cleaner slate.

Implement this via:
- SYNC_AFTER_EACH_TEST environment variable enabled autouse fixture
- autouse fixture seems to be outermost fixture, so it works as expected
- set SYNC_AFTER_EACH_TEST=true for benchmarks in build_and_test
workflow

Evidence:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/10678984691/index.html#suites/5008d72a1ba3c0d618a030a938fc035c/1210266507534c0f/

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-09-05 14:29:48 +01:00
Christian Schwarz
850421ec06 refactor(pageserver): rely on serde derive for toml deserialization (#7656)
This PR simplifies the pageserver configuration parsing as follows:

* introduce the `pageserver_api::config::ConfigToml` type
* implement `Default` for `ConfigToml`
* use serde derive to do the brain-dead leg-work of processing the toml
document
  * use `serde(default)` to fill in default values
* in `pageserver` crate:
* use `toml_edit` to deserialize the pageserver.toml string into a
`ConfigToml`
  * `PageServerConfig::parse_and_validate` then
    * consumes the `ConfigToml`
    * destructures it exhaustively into its constituent fields
    * constructs the `PageServerConfig`

The rules are:

* in `ConfigToml`, use `deny_unknown_fields` everywhere
* static default values go in `pageserver_api`
* if there cannot be a static default value (e.g. which default IO
engine to use, because it depends on the runtime), make the field in
`ConfigToml` an `Option`
* if runtime-augmentation of a value is needed, do that in
`parse_and_validate`
* a good example is `virtual_file_io_engine` or `l0_flush`, both of
which need to execute code to determine the effective value in
`PageServerConf`

The benefits:

* massive amount of brain-dead repetitive code can be deleted
* "unused variable" compile-time errors when removing a config value,
due to the exhaustive destructuring in `parse_and_validate`
* compile-time errors guide you when adding a new config field

Drawbacks:

* serde derive is sometimes a bit too magical
* `deny_unknown_fields` is easy to miss

Future Work / Benefits:
* make `neon_local` use `pageserver_api` to construct `ConfigToml` and
write it to `pageserver.toml`
* This provides more type safety / coompile-time errors than the current
approach.

### Refs

Fixes #3682 

### Future Work

* `remote_storage` deser doesn't reject unknown fields
https://github.com/neondatabase/neon/issues/8915
* clean up `libs/pageserver_api/src/config.rs` further
  * break up into multiple files, at least for tenant config
* move `models` as appropriate / refine distinction between config and
API models / be explicit about when it's the same
  * use `pub(crate)` visibility on `mod defaults` to detect stale values
2024-09-05 14:59:49 +02:00
Folke Behrens
6dfbf49128 proxy: don't let one timeout eat entire retry budget (#8924)
This reduces the per-request timeout to 10sec while keeping the total
retry duration at 1min.

Relates: neondatabase/cloud#15944
2024-09-05 13:34:27 +02:00
Alexey Masterov
226464e6b5 Fix format 2024-09-05 12:53:39 +02:00
Alexey Masterov
e4dc7fe4a5 Remove running the cloud test on a pull request 2024-09-05 12:28:07 +02:00
Alexey Masterov
7a324f84e4 Fix Line 2024-09-05 12:22:13 +02:00
Alexey Masterov
b54a919d51 Fix Line 2024-09-05 12:19:32 +02:00
Alexey Masterov
afd25c896c Get rid of redundant local variables 2024-09-05 12:14:54 +02:00
Alexey Masterov
99f9ab2c07 Fix regex 2024-09-05 12:04:16 +02:00
Alexey Masterov
e8676ffff7 Remove regress.so form image as we use the extension for this now 2024-09-05 11:58:13 +02:00
Alexey Masterov
9e61284d10 fix mypy warnings 2024-09-05 11:55:23 +02:00
Alexey Masterov
288388f14e remove the temp script 2024-09-05 11:11:26 +02:00
Alexey Masterov
bfb7bf92f2 fix linters' warnings 2024-09-05 11:07:51 +02:00
Vlad Lazar
708322ce3c storcon: handle fills including high tput tenants more gracefully (#8865)
## Problem
A tenant may ingest a lot of data between being drained for node restart
and being moved back
in the fill phase. This is expensive and causes the fill to stall. 

## Summary of changes
We make a tactical change to reduce secondary warm-up time for
migrations in fills.
2024-09-05 09:56:26 +01:00
Alexey Masterov
f8c9966aff modify the patch 2024-09-05 10:10:54 +02:00
Alexey Masterov
2e1725c570 modify the patch 2024-09-05 09:56:48 +02:00
Alex Chi Z.
99fa1c3600 fix(pageserver): more information on aux v1 warnings (#8906)
Part of https://github.com/neondatabase/neon/issues/8623

## Summary of changes

It seems that we have tenants with aux policy set to v1 but don't have
any aux files in the storage. It is still safe to force migrate them
without notifying the customers. This patch adds more details to the
warning to identify the cases where we have to reach out to the users
before retiring aux v1.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-09-04 21:45:04 +01:00
Alexey Masterov
9414976c4c uncomment the extension creation 2024-09-04 17:36:48 +02:00
Heikki Linnakangas
0205ce1849 Update submodule reference for vendor/postgres-v14 (#8913)
There was a confusion on the REL_14_STABLE_neon branch. PR
https://github.com/neondatabase/postgres/pull/471 was merged ot the
branch, but the corresponding PRs on the other REL_15_STABLE_neon and
REL_16_STABLE_neon branches were not merged. Also, the submodule
reference in the neon repository was never updated, so even though the
REL_14_STABLE_neon branch contained the commit, it was never used.

That PR https://github.com/neondatabase/postgres/pull/471 was a few
bricks shy of a load (no tests, some differences between the different
branches), so to get us to a good state, revert that change from the
REL_14_STABLE_neon branch. This PR in the neon repository updates the
submodule reference past two commites on the REL_14_STABLE_neon branch:
first the commit from PR
https://github.com/neondatabase/postgres/pull/471, and immediately after
that the revert of the same commit. This brings us back to square one,
but now the submodule reference matches the tip of the
REL_14_STABLE_neon branch again.
2024-09-04 15:41:51 +01:00
John Spray
1a9b54f1d9 storage controller: read from database in validate API (#8784)
## Problem

The initial implementation of the validate API treats the in-memory
generations as authoritative.
- This is true when only one storage controller is running, but if a
rogue controller was running that hadn't been shut down properly, and
some pageserver requests were routed to that bad controller, it could
incorrectly return valid=true for stale generations.
- The generation in the main in-memory map gets out of date while a live
migration is in flight, and if the origin location for the migration
tries to do some deletions even though it is in AttachedStale (for
example because it had already started compaction), these might be
wrongly validated + executed.

## Summary of changes

- Continue to do the in-memory check: if this returns valid=false it is
sufficient to reject requests.
- When valid=true, do an additional read from the database to confirm
the generation is fresh.
- Revise behavior for validation on missing shards: this used to always
return valid=true as a convenience for deletions and shard splits, so
that pageservers weren't prevented from completing any enqueued
deletions for these shards after they're gone. However, this becomes
unsafe when we consider split brain scenarios. We could reinstate this
in future if we wanted to store some tombstones for deleted shards.
- Update test_scrubber_physical_gc to cope with the behavioral change:
they must now explicitly flush the deletion queue before splits, to
avoid tripping up on deletions that are enqueued at the time of the
split (these tests assert "scrubber deletes nothing", which check fails
if the split leaves behind some remote objects that are legitimately
GC'able)
- Add `test_storage_controller_validate_during_migration`, which uses
failpoints to create a situation where incorrect generation validation
during a live migration could result in a corruption

The rate of validate calls for tenants is pretty low: it happens as a
consequence deletions from GC and compaction, which are both
concurrency-limited on the pageserver side.
2024-09-04 15:00:40 +01:00
Alexey Masterov
777c01938d fix 2024-09-04 15:42:19 +02:00
Alexey Masterov
302a2203a1 change path 2024-09-04 15:27:36 +02:00
Alexey Masterov
bc1697ab28 change path 2024-09-04 15:18:22 +02:00
Alexey Masterov
61f3ac3fbf change path 2024-09-04 14:58:41 +02:00