Commit Graph

389 Commits

Author SHA1 Message Date
Arthur Petukhovsky
f3b5db1443 Add API for safekeeper timeline copy (#6091)
Implement API for cloning a single timeline inside a safekeeper. Also
add API for calculating a sha256 hash of WAL, which is used in tests.

`/copy` API works by copying objects inside S3 for all but the last
segments, and the last segments are copied on-disk. A special temporary
directory is created for a timeline, because copy can take a lot of
time, especially for large timelines. After all files segments have been
prepared, this directory is mounted to the main tree and timeline is
loaded to memory.

Some caveats:
- large timelines can take a lot of time to copy, because we need to
copy many S3 segments
- caller should wait for HTTP call to finish indefinetely and don't
close the HTTP connection, because it will stop the process, which is
not continued in the background
- `until_lsn` must be a valid LSN, otherwise bad things can happen
- API will return 200 if specified `timeline_id` already exists, even if
it's not a copy
- each safekeeper will try to copy S3 segments, so it's better to not
call this API in-parallel on different safekeepers
2024-01-04 17:40:38 +00:00
Arseny Sher
e92c9f42c0 Don't split WAL record across two XLogData's when sending from safekeepers.
As protocol demands. Not following this makes standby complain about corrupted
WAL in various ways.

https://neondb.slack.com/archives/C05L7D1JAUS/p1703774799114719
closes https://github.com/neondatabase/cloud/issues/9057
2024-01-02 10:50:20 +04:00
Arseny Sher
aaaa39d9f5 Add large insertion and slow WAL sending to test_hot_standby.
To exercise MAX_SEND_SIZE sending from safekeeper; we've had a bug with WAL
records torn across several XLogData messages. Add failpoint to safekeeper to
slow down sending. Also check for corrupted WAL complains in standby log.

Make the test a bit simpler in passing, e.g. we don't need explicit commits as
autocommit is enabled by default.

https://neondb.slack.com/archives/C05L7D1JAUS/p1703774799114719
https://github.com/neondatabase/cloud/issues/9057
2024-01-02 10:50:20 +04:00
Arseny Sher
e79a19339c Add failpoint support to safekeeper.
Just a copy paste from pageserver.
2024-01-02 10:50:20 +04:00
Arseny Sher
90ef48aab8 Fix safekeeper START_REPLICATION (term=n).
It was giving WAL only up to commit_lsn instead of flush_lsn, so recovery of
uncommitted WAL since cdb08f03 hanged. Add test for this.
2024-01-01 20:44:05 +04:00
Conrad Ludgate
cc633585dc gauge guards (#6138)
## Problem

The websockets gauge for active db connections seems to be growing more
than the gauge for client connections over websockets, which does not
make sense.

## Summary of changes

refactor how our counter-pair gauges are represented. not sure if this
will improve the problem, but it should be harder to mess-up the
counters. The API is much nicer though now and doesn't require
scopeguard::defer hacks
2023-12-14 17:21:39 +00:00
Joonas Koivunen
37fdbc3aaa fix: use larger buffers for remote storage (#6069)
Currently using 8kB buffers, raise that to 32kB to hopefully 1/4 of
`spawn_blocking` usage. Also a drive-by fixing of last `tokio::io::copy`
to `tokio::io::copy_buf`.
2023-12-07 19:36:44 +00:00
Joonas Koivunen
b492cedf51 fix(remote_storage): buffering, by using streams for upload and download (#5446)
There is double buffering in remote_storage and in pageserver for 8KiB
in using `tokio::io::copy` to read `BufReader<ReaderStream<_>>`.

Switches downloads and uploads to use `Stream<Item =
std::io::Result<Bytes>>`. Caller and only caller now handles setting up
buffering. For reading, `Stream<Item = ...>` is also a `AsyncBufRead`,
so when writing to a file, we now have `tokio::io::copy_buf` reading
full buffers and writing them to `tokio::io::BufWriter` which handles
the buffering before dispatching over to `tokio::fs::File`.

Additionally implements streaming uploads for azure. With azure
downloads are a bit nicer than before, but not much; instead of one huge
vec they just hold on to N allocations we got over the wire.

This PR will also make it trivial to switch reading and writing to
io-uring based methods.

Cc: #5563.
2023-12-07 15:52:22 +00:00
Arseny Sher
207c527270 Safekeepers: persist state before timeline deactivation.
Without it, sometimes on restart we lose latest remote_consistent_lsn which
leads to excessive ps -> sk reconnections.

https://github.com/neondatabase/neon/issues/5993
2023-12-04 18:22:36 +04:00
Arseny Sher
78e73b20e1 Notify safekeeper readiness with systemd.
To avoid downtime during deploy, as in busy regions initial load can currently
take ~30s.
2023-11-29 14:07:06 +04:00
Christian Schwarz
9e3c07611c logging: support output to stderr (#5896)
(part of the getpage benchmarking epic #5771)

The plan is to make the benchmarking tool log on stderr and emit results
as JSON on stdout. That way, the test suite can simply take captures
stdout and json.loads() it, while interactive users of the benchmarking
tool have a reasonable experience as well.

Existing logging users continue to print to stdout, so, this change
should be a no-op functionally and performance-wise.
2023-11-22 11:08:35 +00:00
Arpad Müller
ea118a238a JWT logging improvements (#5823)
* lower level on auth success from info to debug (fixes #5820)
* don't log stacktraces on auth errors (as requested on slack). we do this by introducing an `AuthError` type instead of using `anyhow` and `bail`.
* return errors that have been censored for improved security.
2023-11-08 16:56:53 +00:00
Arpad Müller
e310533ed3 Support JWT key reload in pageserver (#5594)
## Problem

For quickly rotating JWT secrets, we want to be able to reload the JWT
public key file in the pageserver, and also support multiple JWT keys.

See #4897.

## Summary of changes

* Allow directories for the `auth_validation_public_key_path` config
param instead of just files. for the safekeepers, all of their config options
also support multiple JWT keys.
* For the pageservers, make the JWT public keys easily globally swappable
by using the `arc-swap` crate.
* Add an endpoint to the pageserver, triggered by a POST to
`/v1/reload_auth_validation_keys`, that reloads the JWT public keys from
the pre-configured path (for security reasons, you cannot upload any
keys yourself).

Fixes #4897

---------

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-07 15:43:29 +01:00
Joonas Koivunen
4be6bc7251 refactor: remove unnecessary unsafe (#5802)
unsafe impls for `Send` and `Sync` should not be added by default. in
the case of `SlotGuard` removing them does not cause any issues, as the
compiler automatically derives those.

This PR adds requirement to document the unsafety (see
[clippy::undocumented_unsafe_blocks]) and opportunistically adds
`#![deny(unsafe_code)]` to most places where we don't have unsafe code
right now.

TRPL on Send and Sync:
https://doc.rust-lang.org/book/ch16-04-extensible-concurrency-sync-and-send.html

[clippy::undocumented_unsafe_blocks]:
https://rust-lang.github.io/rust-clippy/master/#/undocumented_unsafe_blocks
2023-11-07 10:26:25 +00:00
Richy Wang
bea8efac24 Fix comments in 'receive_wal.rs'. (#5807)
## Problem
Some comments in 'receive_wal.rs' is not suitable. It may copy from
'send_wal.rs' and leave it unchanged.
## Summary of changes
This commit fixes two comments in the code:
Changed "/// Unregister walsender." to "/// Unregister walreceiver."
Changed "///Scope guard to access slot in WalSenders registry" to
"///Scope guard to access slot in WalReceivers registry."
2023-11-07 09:13:01 +01:00
Arpad Müller
e6470ee92e Add API description for safekeeper copy endpoint (#5770)
Adds a yaml API description for a new endpoint that allows creation of a
new timeline as the copy of an existing one.
 
Part of #5282
2023-11-06 15:00:07 +01:00
duguorong009
b3d3a2587d feat: improve the serde impl for several types(Lsn, TenantId, TimelineId ...) (#5335)
Improve the serde impl for several types (`Lsn`, `TenantId`,
`TimelineId`) by making them sensitive to
`Serializer::is_human_readadable` (true for json, false for bincode).

Fixes #3511 by:
- Implement the custom serde for `Lsn`
- Implement the custom serde for `Id`
- Add the helper module `serde_as_u64` in `libs/utils/src/lsn.rs`
- Remove the unnecessary attr `#[serde_as(as = "DisplayFromStr")]` in
all possible structs

Additionally some safekeeper types gained serde tests.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-06 11:40:03 +02:00
duguorong009
09b5954526 refactor: use streaming in safekeeper /v1/debug_dump http response (#5731)
- Update the handler for `/v1/debug_dump` http response in safekeeper
- Update the `debug_dump::build()` to use the streaming in JSON build
process
2023-11-05 10:16:54 +00:00
Arpad Müller
bd59349af3 Fix Rust 1.74 warnings (#5702)
Fixes new warnings and clippy changes introduced by version 1.74 of the
rust compiler toolchain.
2023-10-28 03:47:26 +02:00
Arthur Petukhovsky
262348e41b Fix safekeeper log spans (#5643)
We were missing spans with ttid in "WAL backup" and several other
places, this commit should fix it.

Here are the examples of logs before and after:
https://gist.github.com/petuhovskiy/711a4a4e7ddde3cab3fa6419b2f70fb9
2023-10-27 12:09:02 +01:00
duguorong009
39f8fd6945 feat: add build_tag env support for set_build_info_metric (#5576)
- Add a new util `project_build_tag` macro, similar to
`project_git_version`
- Update the `set_build_info_metric` to accept and make use of
`build_tag` info
- Update all codes which use the `set_build_info_metric`
2023-10-27 10:47:11 +01:00
Arseny Sher
b332268cec Introduce safekeeper peer recovery.
Implements fetching of WAL by safekeeper from another safekeeper by imitating
behaviour of last elected leader. This allows to avoid WAL accumulation on
compute and facilitates faster compute startup as it doesn't need to download
any WAL. Actually removing WAL download in walproposer is a matter of another
patch though.

There is a per timeline task which always runs, checking regularly if it should
start recovery frome someone, meaning there is something to fetch and there is
no streaming compute. It then proceeds with fetching, finishing when there is
nothing more to receive.

Implements https://github.com/neondatabase/neon/pull/4875
2023-10-20 10:57:59 +03:00
Arseny Sher
76c702219c Don't use AppenRequestHeader.epoch_start_lsn.
It is simpler to get it once from ProposerEelected.
2023-10-20 10:57:59 +03:00
Arseny Sher
1f4805baf8 Remove remnants of num_computes field.
Fixes https://github.com/neondatabase/neon/issues/5581
2023-10-18 17:09:26 +03:00
Arseny Sher
685add2009 Enable /metrics without auth.
To enable auth faster.
2023-10-10 20:06:25 +03:00
Arpad Müller
607b185a49 Fix 1.73.0 clippy lints (#5494)
Doesn't do an upgrade of rustc to 1.73.0 as we want to wait for the
cargo response of the curl CVE before updating. In preparation for an
update, we address the clippy lints that are newly firing in 1.73.0.
2023-10-06 14:17:19 +01:00
duguorong009
25a37215f3 fix: replace all std::PathBufs with camino::Utf8PathBuf (#5352)
Fixes #4689 by replacing all of `std::Path` , `std::PathBuf` with
`camino::Utf8Path`, `camino::Utf8PathBuf` in
- pageserver
- safekeeper
- control_plane
- libs/remote_storage

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-10-04 17:52:23 +03:00
MMeent
83e7e5dbbd Feat/postgres 16 (#4761)
This adds PostgreSQL 16 as a vendored postgresql version, and adapts the
code to support this version. 
The important changes to PostgreSQL 16 compared to the PostgreSQL 15
changeset include the addition of a neon_rmgr instead of altering Postgres's
original WAL format.

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2023-09-12 15:11:32 +02:00
Arseny Sher
81b6578c44 Allow walsender in recovery mode give WAL till dynamic flush_lsn.
Instead of fixed during the start of replication. To this end, create
term_flush_lsn watch channel similar to commit_lsn one. This allows to continue
recovery streaming if new data appears.
2023-08-29 23:19:40 +03:00
Arseny Sher
e98580b092 Add term and http endpoint to broker messaged SkTimelineInfo.
We need them for safekeeper peer recovery
https://github.com/neondatabase/neon/pull/4875
2023-08-29 23:19:40 +03:00
Arseny Sher
804ef23043 Rename TermSwitchEntry to TermLsn.
Add derive Ord for easy comparison of <term, lsn> pairs.

part of https://github.com/neondatabase/neon/pull/4875
2023-08-29 23:19:40 +03:00
Arseny Sher
87f7d6bce3 Start and stop per timeline recovery task.
Slightly refactors init: now load_tenant_timelines is also async to properly
init the timeline, but to keep global map lock sync we just acquire it anew for
each timeline.

Recovery task itself is just a stub here.

part of
https://github.com/neondatabase/neon/pull/4875
2023-08-29 23:19:40 +03:00
Arseny Sher
39e3fbbeb0 Add safekeeper peers to TimelineInfo.
Now available under GET /tenant/xxx/timeline/yyy for inspection.
2023-08-29 23:19:40 +03:00
Arseny Sher
d597e6d42b Track list of walreceivers and their voting/streaming state in shmem.
Also add both walsenders and walreceivers to TimelineStatus (available under
v1/tenant/xxx/timeline/yyy).

Prepares for
https://github.com/neondatabase/neon/pull/4875
2023-08-23 16:04:08 +03:00
Arseny Sher
4687b2e597 Test that auth on pg/http services can be enabled separately in sks.
To this end add
1) -e option to 'neon_local safekeeper start' command appending extra options
   to safekeeper invocation;
2) Allow multiple occurrences of the same option in safekeepers, the last
   value is taken.
3) Allow to specify empty string for *-auth-public-key-path opts, it
   disables auth for the service.
2023-08-15 19:31:20 +03:00
Arseny Sher
13adc83fc3 Allow to enable http/pg/pg tenant only auth separately in safekeeper.
The same option enables auth and specifies public key, so this allows to use
different public keys as well. The motivation is to
1) Allow to e.g. change pageserver key/token without replacing all compute
  tokens.
2) Enable auth gradually.
2023-08-15 19:31:20 +03:00
Arseny Sher
8173813584 Add term=n option to safekeeper START_REPLICATION command.
It allows term leader to ensure he pulls data from the correct term. Absense of
it wasn't very problematic due to CRC checks, but let's be strict.

walproposer still doesn't use it as we're going to remove recovery completely
from it.
2023-08-12 12:20:13 +03:00
Cuong Nguyen
039017cb4b Add new flag for advertising pg address (#4898)
## Problem

The safekeeper advertises the same address specified in `--listen-pg`,
which is problematic when the listening address is different from the
address that the pageserver can use to connect to the safekeeper.

## Summary of changes

Add a new optional flag called `--advertise-pg` for the address to be
advertised. If this flag is not specified, the behavior is the same as
before.
2023-08-08 14:26:38 +03:00
John Spray
4892a5c5b7 pageserver: avoid logging the "ERROR" part of DbErrors that are successes (#4902)
## Problem

The pageserver<->safekeeper protocol uses error messages to indicate end
of stream. pageserver already logs these at INFO level, but the inner
error message includes the word "ERROR", which interferes with log
searching.
   
Example:
```
  walreceiver connection handling ended: db error: ERROR: ending streaming to Some("pageserver") at 0/4031CA8
```
    
The inner DbError has a severity of ERROR so DbError's Display
implementation includes that ERROR, even though we are actually
logging the error at INFO level.

## Summary of changes

Introduce an explicit WalReceiverError type, and in its From<>
for postgres errors, apply the logic from ExpectedError, for
expected errors, and a new condition for successes.
    
The new output looks like:
```
    walreceiver connection handling ended: Successful completion: ending streaming to Some("pageserver") at 0/154E9C0, receiver is caughtup and there is no computes
 ```
2023-08-08 12:35:24 +03:00
Yinnan Yao
705ae2dce9 Fix error message for listen_pg_addr_tenant_only binding (#4787)
## Problem

Wrong use of `conf.listen_pg_addr` in `error!()`.

## Summary of changes

Use `listen_pg_addr_tenant_only` instead of `conf.listen_pg_addr`.

Signed-off-by: yaoyinnan <35447132+yaoyinnan@users.noreply.github.com>
2023-07-31 14:40:52 +01:00
bojanserafimov
520046f5bd cold starts: Add sync-safekeepers fast path (#4804) 2023-07-25 19:44:18 -04:00
Arseny Sher
1e7db5458f Add one more WAL service port allowing only tenant scoped auth tokens.
It will make it easier to limit access at network level, with e.g. k8s network
policies.

ref https://github.com/neondatabase/neon/issues/4730
2023-07-19 06:03:51 +04:00
arpad-m
982fce1e72 Fix rustdoc warnings and test cargo doc in CI (#4711)
## Problem

`cargo +nightly doc` is giving a lot of warnings: broken links, naked
URLs, etc.

## Summary of changes

* update the `proc-macro2` dependency so that it can compile on latest
Rust nightly, see https://github.com/dtolnay/proc-macro2/pull/391 and
https://github.com/dtolnay/proc-macro2/issues/398
* allow the `private_intra_doc_links` lint, as linking to something
that's private is always more useful than just mentioning it without a
link: if the link breaks in the future, at least there is a warning due
to that. Also, one might enable
[`--document-private-items`](https://doc.rust-lang.org/cargo/commands/cargo-doc.html#documentation-options)
in the future and make these links work in general.
* fix all the remaining warnings given by `cargo +nightly doc`
* make it possible to run `cargo doc` on stable Rust by updating
`opentelemetry` and associated crates to version 0.19, pulling in a fix
that previously broke `cargo doc` on stable:
https://github.com/open-telemetry/opentelemetry-rust/pull/904
* Add `cargo doc` to CI to ensure that it won't get broken in the
future.

Fixes #2557

## Future work
* Potentially, it might make sense, for development purposes, to publish
the generated rustdocs somewhere, like for example [how the rust
compiler does
it](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_driver/index.html).
I will file an issue for discussion.
2023-07-15 05:11:25 +03:00
Arseny Sher
634be4f4e0 Fix async write in safekeepers.
General Rust Write trait semantics (as well as its async brother) is that write
definitely happens only after Write::flush(). This wasn't needed in sync where
rust write calls the syscall directly, but is required in async.

Also fix setting initial end_pos in walsender, sometimes it was from the future.

fixes https://github.com/neondatabase/neon/issues/4518
2023-07-06 19:56:28 +04:00
Dmitry Rodionov
75d583c04a Tenant::load: fix uninit timeline marker processing (#4458)
## Problem

During timeline creation we create special mark file which presense
indicates that initialization didnt complete successfully. In case of a
crash restart we can remove such half-initialized timeline and following
retry from control plane side should perform another attempt.

So in case of a possible crash restart during initial loading we have
following picture:

```
timelines
| - <timeline_id>___uninit
| - <timeline_id>
| - | <timeline files>
```

We call `std::fs::read_dir` to walk files in `timelines` directory one
by one. If we see uninit file
we proceed with deletion of both, timeline directory and uninit file. If
we see timeline we check if uninit file exists and do the same cleanup.
But in fact its possible to get both branches to be true at the same
time. Result of readdir doesnt reflect following directory state
modifications. So you can still get "valid" entry on the next iteration
of the loop despite the fact that it was deleted in one of the previous
iterations of the loop.

To see that you can apply the following patch (it disables uninit mark
cleanup on successful timeline creation):
```diff
diff --git a/pageserver/src/tenant.rs b/pageserver/src/tenant.rs
index 4beb2664..b3cdad8f 100644
--- a/pageserver/src/tenant.rs
+++ b/pageserver/src/tenant.rs
@@ -224,11 +224,6 @@ impl UninitializedTimeline<'_> {
                             )
                         })?;
                 }
-                uninit_mark.remove_uninit_mark().with_context(|| {
-                    format!(
-                        "Failed to remove uninit mark file for timeline {tenant_id}/{timeline_id}"
-                    )
-                })?;
                 v.insert(Arc::clone(&new_timeline));
 
                 new_timeline.maybe_spawn_flush_loop();
```
And perform the following steps:

```bash
neon_local init
neon_local start
neon_local tenant create
neon_local stop
neon_local start
```

The error is:
```log
INFO load{tenant_id=X}:blocking: Found an uninit mark file .neon/tenants/X/timelines/Y.___uninit, removing the timeline and its uninit mark
2023-06-09T18:43:41.664247Z ERROR load{tenant_id=X}: load failed, setting tenant state to Broken: failed to load metadata

Caused by:
    0: Failed to read metadata bytes from path .neon/tenants/X/timelines/Y/metadata
    1: No such file or directory (os error 2)
```

So uninit mark got deleted together with timeline directory but we still
got directory entry for it and tried to load it.

The bug prevented tenant from being successfully loaded.

## Summary of changes

Ideally I think we shouldnt place uninit marks in the same directory as timeline directories but move them to separate directory and
gather them as an input to actual listing, but that would be sort of an
on-disk format change, so just check whether entries are still valid
before operating on them.
2023-06-21 14:25:58 +03:00
Arthur Petukhovsky
b0286e3c46 Always truncate WAL after restart (#4464)
c058e1cec2 skipped `truncate_wal()` it if `write_lsn` is equal to
truncation position, but didn't took into account that `write_lsn` is
reset on restart.

Fixes regression looking like:

```
ERROR WAL acceptor{cid=22 ...}:panic{thread=WAL acceptor 19b6c1743666ec02991a7633c57178db/b07db8c88f4c76ea5ed0954c04cc1e74 location=safekeeper/src/wal_storage.rs:230:13}: unexpected write into non-partial segment file
```

This fix will prevent skipping WAL truncation when we are running for
the first time after restart.
2023-06-12 13:42:28 +00:00
Joonas Koivunen
7e17979d7a feat: http request logging on safekeepers.
With RequestSpan, successfull GETs are not logged, but all others, errors and
warns on cancellations are.
2023-06-11 22:53:08 +04:00
Arseny Sher
227271ccad Switch safekeepers to async.
This is a full switch, fs io operations are also tokio ones, working through
thread pool. Similar to pageserver, we have multiple runtimes for easier `top`
usage and isolation.

Notable points:
- Now that guts of safekeeper.rs are full of .await's, we need to be very
  careful not to drop task at random point, leaving timeline in unclear
  state. Currently the only writer is walreceiver and we don't have top
  level cancellation there, so we are good. But to be safe probably we should
  add a fuse panicking if task is being dropped while operation on a timeline
  is in progress.
- Timeline lock is Tokio one now, as we do disk IO under it.
- Collecting metrics got a crutch: since prometheus Collector is
  synchronous, it spawns a thread with current thread runtime collecting data.
- Anything involving closures becomes significantly more complicated, as
  async fns are already kinda closures + 'async closures are unstable'.
- Main thread now tracks other main tasks, which got much easier.
- The only sync place left is initial data loading, as otherwise clippy
  complains on timeline map lock being held across await points -- which is
  not bad here as it happens only in single threaded runtime of main thread.
  But having it sync doesn't hurt either.

I'm concerned about performance of thread pool io offloading, async traits and
many await points; but we can try and see how it goes.

fixes https://github.com/neondatabase/neon/issues/3036
fixes https://github.com/neondatabase/neon/issues/3966
2023-06-11 22:53:08 +04:00
Arthur Petukhovsky
a21b55fe0b Use connect_timeout for broker::connect (#4452)
Use `storage_broker::connect` everywhere. Add a default 5 seconds
timeout for opening new connection.
2023-06-09 17:38:53 +03:00
Arseny Sher
37bf2cac4f Persist safekeeper control file once in a while.
It should make remote_consistent_lsn commonly up-to-date on non actively writing
projects, which removes spike or pageserver -> safekeeper reconnections on
storage nodes restart.
2023-06-07 17:23:37 +04:00