Commit Graph

429 Commits

Author SHA1 Message Date
Arpad Müller
e310533ed3 Support JWT key reload in pageserver (#5594)
## Problem

For quickly rotating JWT secrets, we want to be able to reload the JWT
public key file in the pageserver, and also support multiple JWT keys.

See #4897.

## Summary of changes

* Allow directories for the `auth_validation_public_key_path` config
param instead of just files. for the safekeepers, all of their config options
also support multiple JWT keys.
* For the pageservers, make the JWT public keys easily globally swappable
by using the `arc-swap` crate.
* Add an endpoint to the pageserver, triggered by a POST to
`/v1/reload_auth_validation_keys`, that reloads the JWT public keys from
the pre-configured path (for security reasons, you cannot upload any
keys yourself).

Fixes #4897

---------

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-07 15:43:29 +01:00
John Spray
85cd97af61 pageserver: add InProgress tenant map state, use a sync lock for the map (#5367)
## Problem

Follows on from #5299 
- We didn't have a generic way to protect a tenant undergoing changes:
`Tenant` had states, but for our arbitrary transitions between
secondary/attached, we need a general way to say "reserve this tenant
ID, and don't allow any other ops on it, but don't try and report it as
being in any particular state".
- The TenantsMap structure was behind an async RwLock, but it was never
correct to hold it across await points: that would block any other
changes for all tenants.


## Summary of changes

- Add the `TenantSlot::InProgress` value.  This means:
  - Incoming administrative operations on the tenant should retry later
- Anything trying to read the live state of the tenant (e.g. a page
service reader) should retry later or block.
- Store TenantsMap in `std::sync::RwLock`
- Provide an extended `get_active_tenant_with_timeout` for page_service
to use, which will wait on InProgress slots as well as non-active
tenants.

Closes: https://github.com/neondatabase/neon/issues/5378

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-11-06 14:03:22 +00:00
duguorong009
09b5954526 refactor: use streaming in safekeeper /v1/debug_dump http response (#5731)
- Update the handler for `/v1/debug_dump` http response in safekeeper
- Update the `debug_dump::build()` to use the streaming in JSON build
process
2023-11-05 10:16:54 +00:00
John Spray
306c4f9967 s3_scrubber: prepare for scrubbing buckets with generation-aware content (#5700)
## Problem

The scrubber didn't know how to find the latest index_part when
generations were in use.

## Summary of changes

- Teach the scrubber to do the same dance that pageserver does when
finding the latest index_part.json
- Teach the scrubber how to understand layer files with generation
suffixes.
- General improvement to testability: scan_metadata has a machine
readable output that the testing `S3Scrubber` wrapper can read.
- Existing test coverage of scrubber was false-passing because it just
didn't see any data due to prefixing of data in the bucket. Fix that.

This is incremental improvement: the more confidence we can have in the
scrubber, the more we can use it in integration tests to validate the
state of remote storage.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-11-03 17:36:02 +00:00
John Spray
e5c81fef86 tests: minor improvements (#5674)
Minor changes from while I have been working on HA tests:
- Manual pytest executions came with some warnings from `log.warn()`
usage
- When something fails in a generations-enabled test, it it useful to
have a log from the attachment service of what attached when, and with
which generation.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-10-31 11:44:35 +00:00
Gleb Novikov
a5292f7e67 Some minor renames in attachment service API (#5687)
## Problem

## Summary of changes

## Checklist before requesting a review

- [x] I have performed a self-review of my code.
- [ ] ~~If it is a core feature, I have added thorough tests.~~
- [ ] ~~Do we need to implement analytics? if so did you add the
relevant metrics to the dashboard?~~
- [ ] ~~If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.~~

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2023-10-27 12:36:34 +01:00
Sasha Krassovsky
116c342cad Support changing pageserver dynamically (#5542)
## Problem
We currently require full restart of compute if we change the pageserver
url
## Summary of changes
Makes it so that we don't have to do a full restart, but can just send
SIGHUP
2023-10-26 10:56:07 -07:00
John Spray
de90bf4663 pageserver: always load remote metadata (no more spawn_load) (#5580)
## Problem

The pageserver had two ways of loading a tenant:
- `spawn_load` would trust on-disk content to reflect all existing
timelines
- `spawn_attach` would list timelines in remote storage.

It was incorrect for `spawn_load` to trust local disk content, because
it doesn't know if the tenant might have been attached and written
somewhere else. To make this correct would requires some generation
number checks, but the payoff is to avoid one S3 op per tenant at
startup, so it's not worth the complexity -- it is much simpler to have
one way to load a tenant.

## Summary of changes

- `Tenant` objects are always created with `Tenant::spawn`: there is no
more distinction between "load" and "attach".
- The ability to run without remote storage (for `neon_local`) is
preserved by adding a branch inside `attach` that uses a fallback
`load_local` if no remote_storage is present.
- Fix attaching a tenant when it has a timeline with no IndexPart: this
can occur if a newly created timeline manages to upload a layer before
it has uploaded an index.
- The attach marker file that used to indicate whether a tenant should
be "loaded" or "attached" is no longer needed, and is removed.
- The GenericRemoteStorage interface gets a `list()` method that maps
more directly to what ListObjects does, returning both keys and common
prefixes. The existing `list_files` and `list_prefixes` methods are just
calls into `list()` now -- these can be removed later if we would like
to shrink the interface a bit.
- The remote deletion marker is moved into `timelines/` and detected as
part of listing timelines rather than as a separate GET request. If any
existing tenants have a marker in the old location (unlikely, only
happens if something crashes mid-delete), then they will rely on the
control plane retrying to complete their deletion.
- Revise S3 calls for timeline listing and tenant load to take a
cancellation token, and retry forever: it never makes sense to make a
Tenant broken because of a transient S3 issue.

## Breaking changes

- The remote deletion marker is moved from `deleted` to
`timelines/deleted` within the tenant prefix. Markers in the old
location will be ignored: it is the control plane's responsibility to
retry deletions until they succeed. Markers in the new location will be
tolerated by the previous release of pageserver via
https://github.com/neondatabase/neon/pull/5632
- The local `attaching` marker file is no longer written. Therefore, if
the pageserver is downgraded after running this code, the old pageserver
will not be able to distinguish between partially attached tenants and
fully attached tenants. This would only impact tenants that were partway
through attaching at the moment of downgrade. In the unlikely even t
that we do experience an incident that prompts us to roll back, then we
may check for attach operations in flight, and manually insert
`attaching` marker files as needed.

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-10-26 14:48:44 +01:00
John Spray
8360307ea0 pageserver: exponential backoff on compaction/GC failures (#5672)
Previously, if walredo process crashed we would try to spawn a fresh one
every 2 seconds, which is expensive in itself, but also results in a
high I/O load from the part of the compaction prior to the failure,
which we re-run every 2 seconds.

Closes: https://github.com/neondatabase/neon/issues/5671
2023-10-26 14:00:26 +01:00
MMeent
6129077d31 WALRedo: Limit logging to log_level = ERROR and above (#5587)
This fixes issues in pageserver's walredo process where WALRedo
logs of loglevel=LOG are interpreted as errors.

## Problem

See #5560

## Summary of changes

Set the log level to something that doesn't include LOG.
2023-10-26 12:21:41 +01:00
Joonas Koivunen
c508d3b5fa reimpl Layer, remove remote layer, trait Layer, trait PersistentLayer (#4938)
Implement a new `struct Layer` abstraction which manages downloadness
internally, requiring no LayerMap locking or rewriting to download or
evict providing a property "you have a layer, you can read it". The new
`struct Layer` provides ability to keep the file resident via a RAII
structure for new layers which still need to be uploaded. Previous
solution solved this `RemoteTimelineClient::wait_completion` which lead
to bugs like #5639. Evicting or the final local deletion after garbage
collection is done using Arc'd value `Drop`.

With a single `struct Layer` the closed open ended `trait Layer`, `trait
PersistentLayer` and `struct RemoteLayer` are removed following noting
that compaction could be simplified by simply not using any of the
traits in between: #4839.

The new `struct Layer` is a preliminary to remove
`Timeline::layer_removal_cs` documented in #4745.

Preliminaries: #4936, #4937, #5013, #5014, #5022, #5033, #5044, #5058,
#5059, #5061, #5074, #5103, epic #5172, #5645, #5649. Related split off:
#5057, #5134.
2023-10-26 12:36:38 +03:00
Arpad Müller
a673e4e7a9 Optionally return json from get_lsn_by_timestamp (#5608)
This does two things: first a minor refactor to not use HTTP/1.x
style header names and also to not panic if some certain requests had no
"Accept" header. As a second thing, it addresses the third bullet point
from #3689:

> Change `get_lsn_by_timestamp` API method to return LSN even if we only
found commit before the specified timestamp.

This is done by adding a version parameter to the `get_lsn_by_timestamp`
API call and making its behaviour depend on the version number.

Part of #3414 (but doesn't address it in its entirety).

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-10-25 18:46:34 +02:00
John Spray
8b8be7bed4 tests: don't fail tests on torn log lines (#5655)
## Problem

Tests that force-kill and restart a service can generate torn log lines
that might match WARN|ERROR, but not match the allow expression that a
test has loaded, e.g.
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5651/6638398772/index.html#suites/7538959189f4501983ddd9e167836c8b/d272ba8a73e6945c

## Summary of changes

Ignore log lines which match a regex for torn log lines on restart: they
have two timestamps and the second line is an "INFO version"... message.
2023-10-25 13:29:30 +01:00
Christian Schwarz
11e523f503 walredo: fix EGAGAIN/"os error 11" false page reconstruction failures (#5560)
Stacked atop https://github.com/neondatabase/neon/pull/5559

Before this PR, there was the following race condition:

```
T1: polls for writeable stdin
T1: writes to stdin
T1: enters poll for stdout/stderr
T2: enters poll for stdin write
WALREDO: writes to stderr
KERNEL: wakes up T1 and T2
Tx: reads stderr and prints it
Ty: reads stderr and gets EAGAIN
(valid values for (x, y) are (1, 2) or (2, 1))
```

The concrete symptom that we observed repeatedly was with PG16,
which started logging `registered custom resource manager`
to stderr always, during startup, thereby giving us repeated
opportunity to hit above race condition. PG14 and PG15 didn't log
anything to stderr, hence we could have only hit this race condition
if there was an actual error happening.

This PR fixes the race by moving the reading of stderr into a tokio
task. It exits when the stderr is closed by the child process, which
in turn happens when the child exits, either by itself or because
we killed it.

The downside is that the async scheduling can reorder the log messages,
which can be seen in the new `test_stderr`, which runs in a
single-threaded runtime. I included the output below.

Overall I think we should move the entire walredo to async, as Joonas
proposed many months ago. This PR's asyncification is just the first
step to resolve these
false page reconstruction errors.

After this is fixed, we should stop printing that annoying stderr
message
on walredo startup; it causes noise in the pageserver logs.
That work is tracked in #5399 .

```
2023-10-13T19:05:21.878858Z ERROR apply_wal_records{tenant_id=d546fb76ba529195392fb4d19e243991 pid=753986}: failed to write out the walredo errored input: No such file or directory (os error 2) target=walredo-1697223921878-1132-0.walredo length=1132
2023-10-13T19:05:21.878932Z DEBUG postgres applied 2 WAL records (1062 bytes) in 114666 us to reconstruct page image at LSN 0/0
2023-10-13T19:05:21.878942Z ERROR error applying 2 WAL records 0/16A9388..0/16D4080 (1062 bytes) to base image with LSN 0/0 to reconstruct page image at LSN 0/0 n_attempts=0: apply_wal_records

Caused by:
    WAL redo process closed its stdout unexpectedly
2023-10-13T19:05:21.879027Z  INFO kill_and_wait_impl{pid=753986}: wait successful exit_status=signal: 11 (SIGSEGV) (core dumped)
2023-10-13T19:05:21.879079Z DEBUG wal-redo-postgres-stderr{pid=753986 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task started
2023-10-13T19:05:21.879104Z ERROR wal-redo-postgres-stderr{pid=753986 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: received output output="2023-10-13 19:05:21.769 GMT [753986] LOG:  registered custom resource manager \"neon\" with ID 134\n"
2023-10-13T19:05:21.879116Z DEBUG wal-redo-postgres-stderr{pid=753986 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task finished
2023-10-13T19:05:22.004439Z ERROR apply_wal_records{tenant_id=d546fb76ba529195392fb4d19e243991 pid=754000}: failed to write out the walredo errored input: No such file or directory (os error 2) target=walredo-1697223922004-1132-0.walredo length=1132
2023-10-13T19:05:22.004493Z DEBUG postgres applied 2 WAL records (1062 bytes) in 125344 us to reconstruct page image at LSN 0/0
2023-10-13T19:05:22.004501Z ERROR error applying 2 WAL records 0/16A9388..0/16D4080 (1062 bytes) to base image with LSN 0/0 to reconstruct page image at LSN 0/0 n_attempts=1: apply_wal_records

Caused by:
    WAL redo process closed its stdout unexpectedly
2023-10-13T19:05:22.004588Z  INFO kill_and_wait_impl{pid=754000}: wait successful exit_status=signal: 11 (SIGSEGV) (core dumped)
2023-10-13T19:05:22.004624Z DEBUG wal-redo-postgres-stderr{pid=754000 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task started
2023-10-13T19:05:22.004653Z ERROR wal-redo-postgres-stderr{pid=754000 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: received output output="2023-10-13 19:05:21.884 GMT [754000] LOG:  registered custom resource manager \"neon\" with ID 134\n"
2023-10-13T19:05:22.004666Z DEBUG wal-redo-postgres-stderr{pid=754000 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task finished
```
2023-10-23 09:00:13 +01:00
Arseny Sher
702382e99a Add check that WAL segments are identical after recovery. 2023-10-20 10:57:59 +03:00
Arpad Müller
f842b22b90 Add endpoint for querying time info for lsn (#5497)
## Problem

See #5468.

## Summary of changes

Add a new `get_timestamp_of_lsn` endpoint, returning the timestamp
associated with the given lsn.

Fixes #5468.

---------

Co-authored-by: Shany Pozin <shany@neon.tech>
2023-10-19 04:50:49 +02:00
John Spray
ecf759be6d tests: allow-list S3 500 on DeleteObjects key (#5586)
## Problem

S3 can give us a 500 whenever it likes: when this happens at request
level we eat it in `backoff::retry`, but when it happens for a key
inside a DeleteObjects request, we log it at warn level.

## Summary of changes

Allow-list this class of log message in all tests.
2023-10-18 15:16:58 +00:00
Konstantin Knizhnik
5c88213eaf Logical replication (#5271)
## Problem

See https://github.com/neondatabase/company_projects/issues/111

## Summary of changes

Save logical replication files in WAL at compute and include them in
basebackup at pate server.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2023-10-18 16:42:22 +03:00
Joonas Koivunen
4772cd6c93 fix: deny branching, starting compute from not yet uploaded timelines (#5484)
Part of #5172. First commits show that we used to allow starting up a
compute or creating a branch off a not yet uploaded timeline. This PR
moves activation of a timeline to happen **after** initial layer file(s)
(if any) and `index_part.json` have been uploaded. Simply moving
activation to be *after* downloads have finished works because we now
spawn a task per http request handler.

Current behaviour of uploading on the timelines on next startup is kept,
to be removed later as part of #5172.

Adds:
- `NeonCli.map_branch` and corresponding `neon_local` implementation:
allow creating computes for timelines managed via pageserver http
client/api
- possibly duplicate tests (I did not want to search for, will cleanup
in a follow-up if these duplicated)

Changes:
- make `wait_until_tenant_state` return immediatedly on `Broken` and not
wait more
2023-10-09 17:03:38 +03:00
Joonas Koivunen
a3c82f19b8 tests: prettier subprocess output in test log (#5485)
Clean subprocess output so that:
- one line of output is just one line without a linebreak
    - like shells handle `echo subshell says: $(echo foo)`
- multiple lines are indented like other pytest output
- error output is dedented and then indented to be like other pytest
output

Minor readability changes remove friction.
2023-10-05 20:15:55 +00:00
Alexander Bayandin
7a2cafb34d Use zstd to compress large allure artifacts (#5458)
## Problem

- Because we compress artifacts file by file, we don't need to put them
into `tar` containers (ie instead of `tar.gz` we can use just `gz`).
- Pythons gz single-threaded and pretty slow.

A benchmark has shown ~20 times speedup (19.876176291 vs
0.8748335830000009) on my laptop (for a pageserver.log size is 1.3M)

## Summary of changes
- Replace tarfile with zstandart
- Update allure to 2.24.0
2023-10-04 16:20:16 +01:00
Joonas Koivunen
db8ff9d64b testing: record walredo failures to test reports (#5451)
We have rare walredo failures with pg16.

Let's introduce recording of failing walredo input in `#[cfg(feature =
"testing")]`. There is additional logging (the value reconstruction path
logging usually shown with not found keys), keeping it for
`#[cfg(features = "testing")]`.

Cc: #5404.
2023-10-04 11:24:30 +03:00
John Spray
ace0c775fc pageserver: prefer 503 to 500 for transient unavailability (#5439)
## Problem

The 500 status code should only be used for bugs or unrecoverable
failures: situations we did not expect. Currently, the pageserver is
misusing this response code for some situations that are totally normal,
like requests targeting tenants that are in the process of activating.

The 503 response is a convenient catch-all for "I can't right now, but I
will be able to".

## Summary of changes

- Change some transient availability error conditions to return 503
instead of 500
- Update the HTTP client configuration in integration tests to retry on
503

After these changes, things like creating a tenant and then trying to
create a timeline within it will no longer require carefully checking
its status first, or retrying on 500s. Instead, a client which is
properly configured to retry on 503 can quietly handle such situations.
2023-10-03 17:00:55 +01:00
Joonas Koivunen
af28362a47 tests: Default to LOCAL_FS for pageserver remote storage (#5402)
Part of #5172. Builds upon #5243, #5298. Includes the test changes:
- no more RemoteStorageKind.NOOP
- no more testing of pageserver without remote storage
- benchmarks now use LOCAL_FS as well

Support for running without RemoteStorage is still kept but in practice,
there are no tests and should not be any tests.

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-09-28 12:25:20 +03:00
John Spray
ba92668e37 pageserver: deletion queue & generation validation for deletions (#5207)
## Problem

Pageservers must not delete objects or advertise updates to
remote_consistent_lsn without checking that they hold the latest
generation for the tenant in question (see [the RFC](
https://github.com/neondatabase/neon/blob/main/docs/rfcs/025-generation-numbers.md))

In this PR:
- A new "deletion queue" subsystem is introduced, through which
deletions flow
- `RemoteTimelineClient` is modified to send deletions through the
deletion queue:
- For GC & compaction, deletions flow through the full generation
verifying process
- For timeline deletions, deletions take a fast path that bypasses
generation verification
- The `last_uploaded_consistent_lsn` value in `UploadQueue` is replaced
with a mechanism that maintains a "projected" lsn (equivalent to the
previous property), and a "visible" LSN (which is the one that we may
share with safekeepers).
- Until `control_plane_api` is set, all deletions skip generation
validation
- Tests are introduced for the new functionality in
`test_pageserver_generations.py`

Once this lands, if a pageserver is configured with the
`control_plane_api` configuration added in
https://github.com/neondatabase/neon/pull/5163, it becomes safe to
attach a tenant to multiple pageservers concurrently.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-09-26 16:11:55 +01:00
Joonas Koivunen
55371af711 test: workaround known bad mock_s3 ListObjectsV2 response (#5330)
this should allow test
test_delete_tenant_exercise_crash_safety_failpoints with
debug-pg16-Check.RETRY_WITH_RESTART-mock_s3-tenant-delete-before-remove-timelines-dir-True
to pass more reliably.
2023-09-18 09:24:53 +02:00
Rahul Modpur
e6985bd098 Move tenant & timeline dir method to NeonPageserver and use them everywhere (#5262)
## Problem
In many places in test code, paths are built manually from what
NeonEnv.tenant_dir and NeonEnv.timeline_dir could do.

## Summary of changes
1. NeonEnv.tenant_dir and NeonEnv.timeline_dir moved under class
NeonPageserver as the path they use is per-pageserver instance.
2. Used these everywhere to replace manual path building

Closes #5258

---------

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
2023-09-15 11:17:18 +01:00
Joonas Koivunen
ffd146c3e5 refactor: globals in tests (#5298)
Refactor tests to have less globals.

This will allow to hopefully write more complex tests for our new metric
collection requirements in #5297. Includes reverted work from #4761
related to test globals.

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: MMeent <matthias@neon.tech>
2023-09-13 22:05:30 +03:00
bojanserafimov
8556d94740 proxy http: reproduce issue with transactions in pool (#5293)
xfail test reproducing issue https://github.com/neondatabase/neon/issues/4698
2023-09-12 17:13:25 -04:00
MMeent
83e7e5dbbd Feat/postgres 16 (#4761)
This adds PostgreSQL 16 as a vendored postgresql version, and adapts the
code to support this version. 
The important changes to PostgreSQL 16 compared to the PostgreSQL 15
changeset include the addition of a neon_rmgr instead of altering Postgres's
original WAL format.

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2023-09-12 15:11:32 +02:00
bojanserafimov
c0ed362790 Measure pageserver wal recovery time and fix flush() method (#5240) 2023-09-11 09:46:06 -04:00
John Spray
7b6337db58 tests: enable multiple pageservers in neon_local and neon_fixture (#5231)
## Problem

Currently our testing environment only supports running a single
pageserver at a time. This is insufficient for testing failover and
migrations.
- Dependency of writing tests for #5207 

## Summary of changes

- `neon_local` and `neon_fixture` now handle multiple pageservers
- This is a breaking change to the `.neon/config` format: any local
environments will need recreating
- Existing tests continue to work unchanged:
  - The default number of pageservers is 1
- `NeonEnv.pageserver` is now a helper property that retrieves the first
pageserver if there is only one, else throws.
- Pageserver data directories are now at `.neon/pageserver_{n}` where n
is 1,2,3...
- Compatibility tests get some special casing to migrate neon_local
configs: these are not meant to be backward/forward compatible, but they
were treated that way by the test.
2023-09-08 16:19:57 +01:00
Joonas Koivunen
ff87fc569d test: Remote storage refactorings (#5243)
Remote storage cleanup split from #5198:
- pageserver, extensions, and safekeepers now have their separate remote
storage
- RemoteStorageKind has the configuration code
- S3Storage has the cleanup code
- with MOCK_S3, pageserver, extensions, safekeepers use different
buckets
- with LOCAL_FS, `repo_dir / "local_fs_remote_storage" / $user` is used
as path, where $user is `pageserver`, `safekeeper`
- no more `NeonEnvBuilder.enable_xxx_remote_storage` but one
`enable_{pageserver,extensions,safekeeper}_remote_storage`

Should not have any real changes. These will allow us to default to
`LOCAL_FS` for pageserver on the next PR, remove
`RemoteStorageKind.NOOP`, work towards #5172.

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-09-08 13:54:23 +03:00
John Spray
61d661a6c3 pageserver: generation number fetch on startup and use in /attach (#5163)
## Problem

- #5050 

Closes: https://github.com/neondatabase/neon/issues/5136

## Summary of changes

- A new configuration property `control_plane_api` controls other
functionality in this PR: if it is unset (default) then everything still
works as it does today.
- If `control_plane_api` is set, then on startup we call out to control
plane `/re-attach` endpoint to discover our attachments and their
generations. If an attachment is missing from the response we implicitly
detach the tenant.
- Calls to pageserver `/attach` API may include a `generation`
parameter. If `control_plane_api` is set, then this parameter is
mandatory.
- RemoteTimelineClient's loading of index_part.json is generation-aware,
and will try to load the index_part with the most recent generation <=
its own generation.
- The `neon_local` testing environment now includes a new binary
`attachment_service` which implements the endpoints that the pageserver
requires to operate. This is on by default if running `cargo neon` by
hand. In `test_runner/` tests, it is off by default: existing tests
continue to run with in the legacy generation-less mode.

Caveats:
- The re-attachment during startup assumes that we are only re-attaching
tenants that have previously been attached, and not totally new tenants
-- this relies on the control plane's attachment logic to keep retrying
so that we should eventually see the attach API call. That's important
because the `/re-attach` API doesn't tell us which timelines we should
attach -- we still use local disk state for that. Ref:
https://github.com/neondatabase/neon/issues/5173
- Testing: generations are only enabled for one integration test right
now (test_pageserver_restart), as a smoke test that all the machinery
basically works. Writing fuller tests that stress tenant migration will
come later, and involve extending our test fixtures to deal with
multiple pageservers.
- I'm not in love with "attachment_service" as a name for the neon_local
component, but it's not very important because we can easily rename
these test bits whenever we want.
- Limited observability when in re-attach on startup: when I add
generation validation for deletions in a later PR, I want to wrap up the
control plane API calls in some small client class that will expose
metrics for things like errors calling the control plane API, which will
act as a strong red signal that something is not right.

Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-09-06 14:44:48 +01:00
John Spray
743933176e scrubber: add scan-metadata and hook into integration tests (#5176)
## Problem

- Scrubber's `tidy` command requires presence of a control plane
- Scrubber has no tests at all 

## Summary of changes

- Add re-usable async streams for reading metadata from a bucket
- Add a `scan-metadata` command that reads from those streams and calls
existing `checks.rs` code to validate metadata, then returns a summary
struct for the bucket. Command returns nonzero status if errors are
found.
- Add an `enable_scrub_on_exit()` function to NeonEnvBuilder so that
tests using remote storage can request to have the scrubber run after
they finish
- Enable remote storarge and scrub_on_exit in test_pageserver_restart
and test_pageserver_chaos

This is a "toe in the water" of the overall space of validating the
scrubber. Later, we should:
- Enable scrubbing at end of tests using remote storage by default
- Make the success condition stricter than "no errors": tests should
declare what tenants+timelines they expect to see in the bucket (or
sniff these from the functions tests use to create them) and we should
require that the scrubber reports on these particular tenants/timelines.

The `tidy` command is untouched in this PR, but it should be refactored
later to use similar async streaming interface instead of the current
batch-reading approach (the streams are faster with large buckets), and
to also be covered by some tests.


---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
2023-09-06 11:55:24 +01:00
John Spray
41aa627ec0 tests: get test name automatically for remote storage (#5184)
## Problem

Tests using remote storage have manually entered `test_name` parameters,
which:
- Are easy to accidentally duplicate when copying code to make a new
test
- Omit parameters, so don't actually create unique S3 buckets when
running many tests concurrently.

## Summary of changes

- Use the `request` fixture in neon_env_builder fixture to get the test
name, then munge that into an S3 compatible bucket name.
- Remove the explicit `test_name` parameters to enable_remote_storage
2023-09-01 17:29:38 +01:00
Joonas Koivunen
d1fcdf75b3 test: enhanced logging for curious mock_s3 (#5134)
Possible flakyness with mock_s3. Add logging in hopes this will happen
again.

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-08-29 14:48:50 +03:00
Christian Schwarz
8cd20485f8 metrics: smgr query time: add a pre-aggregated histogram (#5064)
When doing global queries in VictoriaMetrics, the per-timeline
histograms make us run into cardinality limits.

We don't want to give them up just yet because we don't
have an alternative for drilling down on timeline-specific
performance issues.

So, add a pre-aggregated histogram and add observations to it
whenever we add observations to the per-timeline histogram.

While we're at it, switch to using a strummed enum for the operation
type names.
2023-08-22 20:08:31 +03:00
Joonas Koivunen
130ccb4b67 Remove initial timeline id troubles (#5044)
I made a mistake when I adding `env.initial_timeline:
Optional[TimelineId]` in the #3839, should had just generated it and
used it to create a specific timeline. This PR fixes those mistakes, and
some extra calling into psql which must be slower than python field
access.
2023-08-20 12:33:19 +03:00
Dmitry Rodionov
9140a950f4 Resume tenant deletion on attach (#5039)
I'm still a bit nervous about attach -> crash case. But it should work.
(unlike case with timeline). Ideally would be cool to cover this with
test.

This continues tradition of adding bool flags for Tenant::set_stopping.
Probably lifecycle project will help with fixing it.
2023-08-20 12:28:50 +03:00
Dmitry Rodionov
64fc7eafcd Increase timeout once again. (#5021)
When failpoint is early in deletion process it takes longer to complete
after failpoint is removed.

Example was: https://neon-github-public-dev.s3.amazonaws.com/reports/main/5889544346/index.html#suites/3556ed71f2d69272a7014df6dcb02317/49826c68ce8492b1
2023-08-17 15:37:28 +03:00
Dmitry Rodionov
d8b0a298b7 Do not attach deleted tenants (#5008)
Rather temporary solution before proper:
https://github.com/neondatabase/neon/issues/5006

It requires more plumbing so lets not attach deleted tenants first and
then implement resume.

Additionally fix `assert_prefix_empty`. It had a buggy prefix calculation,
and since we always asserted for absence of stuff it worked. Here I
started to assert for presence of stuff too and it failed. Added more
"presence" asserts to other places to be confident that it works.

Resolves [#5016](https://github.com/neondatabase/neon/issues/5016)
2023-08-17 13:46:49 +03:00
Conrad Ludgate
25934ec1ba proxy: reduce global conn pool contention (#4747)
## Problem

As documented, the global connection pool will be high contention.

## Summary of changes

Use DashMap rather than Mutex<HashMap>.

Of note, DashMap currently uses a RwLock internally, but it's partially
sharded to reduce contention by a factor of N. We could potentially use
flurry which is a port of Java's concurrent hashmap, but I have no good
understanding of it's performance characteristics. Dashmap is at least
equivalent to hashmap but less contention.

See the read heavy benchmark to analyse our expected performance
<https://github.com/xacrimon/conc-map-bench#ready-heavy>

I also spoke with the developer of dashmap recently, and they are
working on porting the implementation to use concurrent HAMT FWIW
2023-08-16 17:20:28 +01:00
Dmitry Rodionov
96b84ace89 Correctly remove orphaned objects in RemoteTimelineClient::delete_all (#5000)
Previously list_prefixes was incorrectly used for that purpose. Change
to use list_files. Add a test.

Some drive by refactorings on python side to move helpers out of
specific test file to be widely accessible

resolves https://github.com/neondatabase/neon/issues/4499
2023-08-16 17:31:16 +03:00
John Spray
5c836ee5b4 tests: extend timeout in timeline deletion test (#4992)
## Problem

This was set to 5 seconds, which was very close to how long a compaction
took on my workstation, and when deletion is blocked on compaction the
test would fail.

We will fix this to make compactions drop out on deletion, but for the
moment let's stabilize the test.

## Summary of changes

Change timeout on timeline deletion in
`test_timeline_deletion_with_files_stuck_in_upload_queue` from 5 seconds
to 30 seconds.
2023-08-15 20:14:03 +03:00
Arseny Sher
4687b2e597 Test that auth on pg/http services can be enabled separately in sks.
To this end add
1) -e option to 'neon_local safekeeper start' command appending extra options
   to safekeeper invocation;
2) Allow multiple occurrences of the same option in safekeepers, the last
   value is taken.
3) Allow to specify empty string for *-auth-public-key-path opts, it
   disables auth for the service.
2023-08-15 19:31:20 +03:00
Dmitry Rodionov
52c2c69351 fsync directory before mark file removal (#4986)
## Problem

Deletions can be possibly reordered. Use fsync to avoid the case when
mark file doesnt exist but other tenant/timeline files do.

See added comments.

resolves #4987
2023-08-15 19:24:23 +03:00
Dmitry Rodionov
4626d89eda Harden retries on tenant/timeline deletion path. (#4973)
Originated from test failure where we got SlowDown error from s3.
The patch generalizes `download_retry` to not be download specific.
Resulting `retry` function is moved to utils crate. `download_retries`
is now a thin wrapper around this `retry` function.

To ensure that all needed retries are in place test code now uses
`test_remote_failures=1` setting.

Ref https://neondb.slack.com/archives/C059ZC138NR/p1691743624353009
2023-08-14 17:16:49 +03:00
Dmitry Rodionov
d39fd66773 tests: remove redundant wait_while (#4952)
Remove redundant `wait_while` in tests. It had only one usage. Use
`wait_tenant_status404`.

Related:
https://github.com/neondatabase/neon/pull/4855#discussion_r1289610641
2023-08-11 10:18:13 +03:00
Dmitry Rodionov
c58b22bacb Delete tenant's data from s3 (#4855)
## Summary of changes

For context see
https://github.com/neondatabase/neon/blob/main/docs/rfcs/022-pageserver-delete-from-s3.md

Create Flow to delete tenant's data from pageserver. The approach
heavily mimics previously implemented timeline deletion implemented
mostly in https://github.com/neondatabase/neon/pull/4384 and followed up
in https://github.com/neondatabase/neon/pull/4552

For remaining deletion related issues consult with deletion project
here: https://github.com/orgs/neondatabase/projects/33

resolves #4250
resolves https://github.com/neondatabase/neon/issues/3889

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-08-10 18:53:16 +03:00