Commit Graph

1589 Commits

Author SHA1 Message Date
Alexey Masterov
b45560db75 Fix the error 2024-09-10 13:18:47 +02:00
Alexey Masterov
c4d98915ff Refactoring 2024-09-10 13:12:46 +02:00
Alexey Masterov
9ac06ea3d9 Debug 2024-09-10 13:04:04 +02:00
Alexey Masterov
841b39f7c5 Some refactoring 2024-09-10 12:52:46 +02:00
Alexey Masterov
fe8fee0b88 Add debug 2024-09-10 12:26:22 +02:00
Alexey Masterov
dbde226f38 Add debug 2024-09-10 12:21:09 +02:00
Alexey Masterov
01c37c6c6c Refactor, delete roles accidentally left into a project 2024-09-10 12:04:15 +02:00
Alexey Masterov
e989bf1887 remove unused import os 2024-09-10 11:17:55 +02:00
Alexey Masterov
287e05f49d Fix the error 2024-09-09 16:22:04 +02:00
Alexey Masterov
650fb7b2d7 Drop subscriptions if exist 2024-09-09 16:18:26 +02:00
Alexey Masterov
7469656b72 Add regression.out to allure reports 2024-09-06 15:49:43 +02:00
Alexey Masterov
e54f8bc5ff Change the workdir to test_output_dir 2024-09-06 14:53:40 +02:00
Alexey Masterov
2098184d67 Revert "Revert "Fix an error in the path""
This reverts commit c7f2a26cb9.
2024-09-06 13:56:20 +02:00
Alexey Masterov
c7f2a26cb9 Revert "Fix an error in the path"
This reverts commit ebdd187398.
2024-09-06 13:51:15 +02:00
Alexey Masterov
ebdd187398 Fix an error in the path 2024-09-06 13:36:49 +02:00
Alexey Masterov
6c679f722c Fix an error in the path 2024-09-06 13:27:05 +02:00
Alexey Masterov
d0cf670b76 Fix an error in the path 2024-09-06 13:19:06 +02:00
Alexey Masterov
6d66a2ebe7 Fix an error in the path 2024-09-06 13:01:43 +02:00
Alexey Masterov
a8d1cbe376 Change the directories calculation 2024-09-06 12:58:10 +02:00
Alexey Masterov
222f483ce8 Add a debug 2024-09-06 12:19:08 +02:00
Alexey Masterov
c7d9eda56a Some refactoring 2024-09-06 11:25:59 +02:00
Alexey Masterov
195c7a359d Some refactoring 2024-09-06 11:06:43 +02:00
Alexey Masterov
8bb0e97880 Some refactoring 2024-09-06 11:03:29 +02:00
a-masterov
815d7d6ab1 Merge branch 'main' into amasterov/regress-arm 2024-09-05 15:30:05 +02:00
Joonas Koivunen
efe03d5a1c build: sync between benchies (#8919)
Sometimes, the benchmarks fail to start up pageserver in 10s without any
obvious reason. Benchmarks run sequentially on otherwise idle runners.
Try running `sync(2)` after each bench to force a cleaner slate.

Implement this via:
- SYNC_AFTER_EACH_TEST environment variable enabled autouse fixture
- autouse fixture seems to be outermost fixture, so it works as expected
- set SYNC_AFTER_EACH_TEST=true for benchmarks in build_and_test
workflow

Evidence:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/10678984691/index.html#suites/5008d72a1ba3c0d618a030a938fc035c/1210266507534c0f/

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-09-05 14:29:48 +01:00
Christian Schwarz
850421ec06 refactor(pageserver): rely on serde derive for toml deserialization (#7656)
This PR simplifies the pageserver configuration parsing as follows:

* introduce the `pageserver_api::config::ConfigToml` type
* implement `Default` for `ConfigToml`
* use serde derive to do the brain-dead leg-work of processing the toml
document
  * use `serde(default)` to fill in default values
* in `pageserver` crate:
* use `toml_edit` to deserialize the pageserver.toml string into a
`ConfigToml`
  * `PageServerConfig::parse_and_validate` then
    * consumes the `ConfigToml`
    * destructures it exhaustively into its constituent fields
    * constructs the `PageServerConfig`

The rules are:

* in `ConfigToml`, use `deny_unknown_fields` everywhere
* static default values go in `pageserver_api`
* if there cannot be a static default value (e.g. which default IO
engine to use, because it depends on the runtime), make the field in
`ConfigToml` an `Option`
* if runtime-augmentation of a value is needed, do that in
`parse_and_validate`
* a good example is `virtual_file_io_engine` or `l0_flush`, both of
which need to execute code to determine the effective value in
`PageServerConf`

The benefits:

* massive amount of brain-dead repetitive code can be deleted
* "unused variable" compile-time errors when removing a config value,
due to the exhaustive destructuring in `parse_and_validate`
* compile-time errors guide you when adding a new config field

Drawbacks:

* serde derive is sometimes a bit too magical
* `deny_unknown_fields` is easy to miss

Future Work / Benefits:
* make `neon_local` use `pageserver_api` to construct `ConfigToml` and
write it to `pageserver.toml`
* This provides more type safety / coompile-time errors than the current
approach.

### Refs

Fixes #3682 

### Future Work

* `remote_storage` deser doesn't reject unknown fields
https://github.com/neondatabase/neon/issues/8915
* clean up `libs/pageserver_api/src/config.rs` further
  * break up into multiple files, at least for tenant config
* move `models` as appropriate / refine distinction between config and
API models / be explicit about when it's the same
  * use `pub(crate)` visibility on `mod defaults` to detect stale values
2024-09-05 14:59:49 +02:00
Alexey Masterov
226464e6b5 Fix format 2024-09-05 12:53:39 +02:00
Alexey Masterov
7a324f84e4 Fix Line 2024-09-05 12:22:13 +02:00
Alexey Masterov
b54a919d51 Fix Line 2024-09-05 12:19:32 +02:00
Alexey Masterov
afd25c896c Get rid of redundant local variables 2024-09-05 12:14:54 +02:00
Alexey Masterov
99f9ab2c07 Fix regex 2024-09-05 12:04:16 +02:00
Alexey Masterov
9e61284d10 fix mypy warnings 2024-09-05 11:55:23 +02:00
Alexey Masterov
bfb7bf92f2 fix linters' warnings 2024-09-05 11:07:51 +02:00
Alexey Masterov
9414976c4c uncomment the extension creation 2024-09-04 17:36:48 +02:00
John Spray
1a9b54f1d9 storage controller: read from database in validate API (#8784)
## Problem

The initial implementation of the validate API treats the in-memory
generations as authoritative.
- This is true when only one storage controller is running, but if a
rogue controller was running that hadn't been shut down properly, and
some pageserver requests were routed to that bad controller, it could
incorrectly return valid=true for stale generations.
- The generation in the main in-memory map gets out of date while a live
migration is in flight, and if the origin location for the migration
tries to do some deletions even though it is in AttachedStale (for
example because it had already started compaction), these might be
wrongly validated + executed.

## Summary of changes

- Continue to do the in-memory check: if this returns valid=false it is
sufficient to reject requests.
- When valid=true, do an additional read from the database to confirm
the generation is fresh.
- Revise behavior for validation on missing shards: this used to always
return valid=true as a convenience for deletions and shard splits, so
that pageservers weren't prevented from completing any enqueued
deletions for these shards after they're gone. However, this becomes
unsafe when we consider split brain scenarios. We could reinstate this
in future if we wanted to store some tombstones for deleted shards.
- Update test_scrubber_physical_gc to cope with the behavioral change:
they must now explicitly flush the deletion queue before splits, to
avoid tripping up on deletions that are enqueued at the time of the
split (these tests assert "scrubber deletes nothing", which check fails
if the split leaves behind some remote objects that are legitimately
GC'able)
- Add `test_storage_controller_validate_during_migration`, which uses
failpoints to create a situation where incorrect generation validation
during a live migration could result in a corruption

The rate of validate calls for tenants is pretty low: it happens as a
consequence deletions from GC and compaction, which are both
concurrency-limited on the pageserver side.
2024-09-04 15:00:40 +01:00
Alexey Masterov
777c01938d fix 2024-09-04 15:42:19 +02:00
Alexey Masterov
302a2203a1 change path 2024-09-04 15:27:36 +02:00
Alexey Masterov
bc1697ab28 change path 2024-09-04 15:18:22 +02:00
Alexey Masterov
61f3ac3fbf change path 2024-09-04 14:58:41 +02:00
Alexey Masterov
f7f0be8727 Temporary disable the extension. 2024-09-04 14:55:02 +02:00
Joonas Koivunen
7a1397cf37 storcon: boilerplate to upsert safekeeper records on deploy (#8879)
We currently do not record safekeepers in the storage controller
database. We want to migrate timelines across safekeepers eventually, so
start recording the safekeepers on deploy.

Cc: #8698
2024-09-04 10:10:05 +00:00
Vlad Lazar
75310fe441 storcon: make hb interval an argument and speed up tests (#8880)
## Problem
Each test might wait for up to 5s in order to HB the pageserver.

## Summary of changes
Make the heartbeat interval configurable and use a really tight one for
neon local => startup quicker
2024-09-04 10:09:41 +01:00
Arseny Sher
80512e2779 safekeeper: add endpoint resetting uploaded partial segment state.
Endpoint implementation sends msg to manager requesting to do the
reset. Manager stops current partial backup upload task if it exists and
performs the reset.

Also slightly tweak eviction condition: all full segments before
flush_lsn must be uploaded (and committed) and there must be only one
segment left on disk (partial). This allows to evict timelines which
started not on the first segment and didn't fill the whole
segment (previous condition wasn't good because last_removed_segno was
0).

ref https://github.com/neondatabase/neon/issues/8759
2024-09-03 17:21:36 +03:00
Vlad Lazar
c43e664ff5 storcon: provide an az id in metadata.json from neon local (#8897)
## Problem
Neon local set-up does not inject an az id in `metadata.json`. See real
change in https://github.com/neondatabase/neon/pull/8852.

## Summary of changes
We piggyback on the existing `availability_zone` pageserver
configuration in order to avoid making neon local even more complex.
2024-09-03 15:11:30 +01:00
Christian Schwarz
bf0531d107 fixup(#8839): test_forward_compatibility needs to allow lag warning as well (#8891)
Found in
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8885/10665614629/index.html#suites/0fbaeb107ef328d03993d44a1fb15690/ea10ba1c140fba1d
2024-09-02 15:10:10 +01:00
Arpad Müller
9746b6ea31 Implement archival_config timeline endpoint in the storage controller (#8680)
Implement the timeline specific `archival_config` endpoint also in the
storage controller.

It's mostly a copy-paste of the detach handler: the task is the same: do
the same operation on all shards.

Part of #8088.
2024-09-02 13:51:45 +02:00
Alexey Masterov
d4f656daa2 Change the python file 2024-09-02 09:07:11 +02:00
Arpad Müller
3ec785f30d Add safekeeper scrubber test (#8785)
The test is very rudimentary, it only checks that before and after
tenant deletion, we can run `scan_metadata` for the safekeeper node
kind. Also, we don't actually expect any uploaded data, for that we
don't have enough WAL (needs to create at least one S3-uploaded file,
the scrubber doesn't recognize partial files yet).

The `scan_metadata` scrubber subcommand is extended to support either
specifying a database connection string, which was previously the only
way, and required a database to be present, or specifying the timeline
information manually via json. This is ideal for testing scenarios
because in those, the number of timelines is usually limited,
but it is involved to spin up a database just to write the timeline
information.
2024-08-31 01:12:25 +02:00
Alexey Masterov
8fb8ec57ea Add python script, rename patch file 2024-08-30 16:39:07 +02:00
Arpad Müller
96b5c4d33d Don't unarchive a timeline if its ancestor is archived (#8853)
If a timeline unarchival request comes in, give an error if the parent
timeline is archived. This prevents us from the situation of having an
archived timeline with children that are not archived.

Follow up of #8824

Part of #8088

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2024-08-29 12:54:02 +00:00