The 1.88.0 stable release is near (this Thursday). We'd like to fix most
warnings beforehand so that the compiler upgrade doesn't require
approval from too many teams.
This is therefore a preparation PR (like similar PRs before it).
There is a lot of changes for this release, mostly because the
`uninlined_format_args` lint has been added to the `style` lint group.
One can read more about the lint
[here](https://rust-lang.github.io/rust-clippy/master/#/uninlined_format_args).
The PR is the result of `cargo +beta clippy --fix` and `cargo fmt`. One
remaining warning is left for the proxy team.
---------
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
## Problem
We're about to implement a gRPC interface for Pageserver. Let's upgrade
Tonic first, to avoid a more painful migration later. It's currently
only used by storage-broker.
Touches #11728.
## Summary of changes
Upgrade Tonic 0.12.3 → 0.13.1. Also opportunistically upgrade Prost
0.13.3 → 0.13.5. This transitively pulls in Indexmap 2.0.1 → 2.9.0, but
it doesn't appear to be used in any particularly critical code paths.
## Problem
Broker supports only HTTP, no HTTPS
- Closes: https://github.com/neondatabase/cloud/issues/27492
## Summary of changes
- Add `listen_https_addr`, `ssl_key_file`, `ssl_cert_file`,
`ssl_cert_reload_period` arguments to storage broker
- Make `listen_addr` argument optional
- Listen https in storage broker
- Support https for storage broker request in neon_local
- Add `use_https_storage_broker_api` option to NeonEnvBuilder
## Problem
Pageservers and safakeepers do not pass CA certificates to broker
client, so the client do not trust locally issued certificates.
- Part of https://github.com/neondatabase/cloud/issues/27492
## Summary of changes
- Change `ssl_ca_certs` type in PS/SK's config to `Pem` which may be
converted to both `reqwest` and `tonic` certificates.
- Pass CA certificates to storage broker client in PS and SK
I like to run nightly clippy every so often to make our future rust
upgrades easier. Some notable changes:
* Prefer `next_back()` over `last()`. Generic iterators will implement
`last()` to run forward through the iterator until the end.
* Prefer `io::Error::other()`.
* Use implicit returns
One case where I haven't dealt with the issues is the now
[more-sensitive "large enum variant"
lint](https://github.com/rust-lang/rust-clippy/pull/13833). I chose not
to take any decisions around it here, and simply marked them as allow
for now.
## Problem
There are some places in the code where we create `reqwest::Client`
without providing SSL CA certs from `ssl_ca_file`. These will break
after we enable TLS everywhere.
- Part of https://github.com/neondatabase/cloud/issues/22686
## Summary of changes
- Support `ssl_ca_file` in storage scrubber.
- Add `use_https_safekeeper_api` option to safekeeper to use https for
peer requests.
- Propagate SSL CA certs to storage_controller/client, storcon's
ComputeHook, PeerClient and maybe_forward.
Updates storage components to edition 2024. We like to stay on the
latest edition if possible. There is no functional changes, however some
code changes had to be done to accommodate the edition's breaking
changes.
The PR has two commits:
* the first commit updates storage crates to edition 2024 and appeases
`cargo clippy` by changing code. i have accidentially ran the formatter
on some files that had other edits.
* the second commit performs a `cargo fmt`
I would recommend a closer review of the first commit and a less close
review of the second one (as it just runs `cargo fmt`).
part of https://github.com/neondatabase/neon/issues/10918
It seems the ecosystem is not so keen on moving to aws-lc-rs as it's
build setup is more complicated than ring (requiring cmake).
Eventually I expect the ecosystem should pivot to
https://github.com/ctz/graviola/tree/main/rustls-graviola as it
stabilises (it has a very simply build step and license), but for now
let's try not have a headache of juggling two crypto libs.
I also noticed that tonic will just fail with tls without a default
provider, so I added some defensive code for that.
Update hyper and tonic again in the storage broker, this time with a fix
for the issue that made us revert the update last time.
The first commit is a revert of #9268, the second a fix for the issue.
fixes#9231.
Follow-up of #9234 to give hyper 1.0 the version-free name, and the
legacy version of hyper the one with the version number inside. As we
move away from hyper 0.14, we can remove the `hyper0` name piece by
piece.
Part of #9255
Fixes#9231 .
Upgrade hyper to 1.4.0 and use hyper 1.4 instead of 0.14 in the storage
broker, together with tonic 0.12. The two upgrades go hand in hand.
Thanks to the broker being independent from other components, we can
upgrade its hyper version without touching the other components, which
makes things easier.
To avoid pageserver gc'ing data needed by standby, propagate standby apply LSN
through standby -> safekeeper -> broker -> pageserver flow and hold off GC for
it. Iteration of GC resets the value to remove the horizon when standby goes
away -- pushes are assumed to happen at least once between gc iterations. As a
safety guard max allowed lag compared to normal GC horizon is hardcoded as 10GB.
Add test for the feature.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
We had an incident where pageserver requests timed out because
pageserver couldn't fetch WAL from safekeepers. This incident was caused
by a bug in safekeeper logic for timeline activation, which prevented
pageserver from finding safekeepers.
This bug was since fixed, but there is still a chance of a similar bug
in the future due to overall complexity.
We add a new broker message to "signal interest" for timeline. This
signal will be sent by pageservers `wait_lsn`, and safekeepers will
receive this signal to start broadcasting broker messages. Then every
broker subscriber will be able to find the safekeepers and connect to
them (to start fetching WAL).
This feature is not limited to pageservers and any service that wants to
download WAL from safekeepers will be able to use this discovery
request.
This commit changes pageserver's connection_manager (walreceiver) to
send a SafekeeperDiscoveryRequest when there is no information about
safekeepers present in memory. Current implementation will send these
requests only if there is an active wait_lsn() call and no more often
than once per 10 seconds.
Add `test_broker_discovery` to test this: safekeepers started with
`--disable-periodic-broker-push` will not push info to broker so that
pageserver must use a discovery to start fetching WAL.
Add task_stats in safekeepers broker module to log a warning if there is
no message received from the broker for the last 10 seconds.
Closes#5471
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
Old methods are unchanged for backwards compatibility. Added
`SafekeeperDiscoveryRequest` and `SafekeeperDiscoveryResponse` types to
serve as example, and also as a prerequisite for
https://github.com/neondatabase/neon/issues/5471
(part of the getpage benchmarking epic #5771)
The plan is to make the benchmarking tool log on stderr and emit results
as JSON on stdout. That way, the test suite can simply take captures
stdout and json.loads() it, while interactive users of the benchmarking
tool have a reasonable experience as well.
Existing logging users continue to print to stdout, so, this change
should be a no-op functionally and performance-wise.
- Add a new util `project_build_tag` macro, similar to
`project_git_version`
- Update the `set_build_info_metric` to accept and make use of
`build_tag` info
- Update all codes which use the `set_build_info_metric`
Two of the most common spurious log messages:
- broker connections terminate & we log at error severity. Unfortunately
tonic gives us an "Unknown" error so to suppress these we're doing
string matching. It's hacky but worthwhile for operations.
- the first iteration of tenant background tasks tends to over-run its
schedule and emit a warning. Ultimately we should fix these to run on
time, but for now we are not benefiting from polluting our logs with the
warnings.
(This is prep work to make `Timeline::activate` infallible.)
This patch removes the global storage_broker client instance from the
pageserver codebase.
Instead, pageserver startup instantiates it and passes it down to the
`Timeline::activate` function, which in turn passes it to the
WalReceiver, which is the entity that actually uses it.
Patch series:
- #4316
- #4317
- #4318
- #4319
- Remove repeated tenant & timeline from span
- Demote logging of the path to debug level
- Log completion at info level, in the same function where we log errors
- distinguish between layer file download success & on-demand download
succeeding as a whole in the log message wording
- Assert that the span contains a tenant id and a timeline id
fixes https://github.com/neondatabase/neon/issues/3945
Before:
```
INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{tenant_id=$TENANT_ID timeline_id=$TIMELINE_ID layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: download complete: /storage/pageserver/data/tenants/$TENANT_ID/timelines/$TIMELINE_ID/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91
INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{tenant_id=$TENANT_ID timeline_id=$TIMELINE_ID layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: Rebuilt layer map. Did 9 insertions to process a batch of 1 updates.
```
After:
```
INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: layer file download finished
INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: Rebuilt layer map. Did 9 insertions to process a batch of 1 updates.
INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: on-demand download successful
```
See https://github.com/neondatabase/neon/pull/3991
Brings the changes back with the right way to use new `toml_edit` to
deserialize values and a unit test for this.
All non-trivial updates extracted into separate commits, also `carho hakari` data and its manifest format were updated.
3 sets of crates remain unupdated:
* `base64` — touches proxy in a lot of places and changed its api (by 0.21 version) quite strongly since our version (0.13).
* `opentelemetry` and `opentelemetry-*` crates
```
error[E0308]: mismatched types
--> libs/tracing-utils/src/http.rs:65:21
|
65 | span.set_parent(parent_ctx);
| ---------- ^^^^^^^^^^ expected struct `opentelemetry_api::context::Context`, found struct `opentelemetry::Context`
| |
| arguments to this method are incorrect
|
= note: struct `opentelemetry::Context` and struct `opentelemetry_api::context::Context` have similar names, but are actually distinct types
note: struct `opentelemetry::Context` is defined in crate `opentelemetry_api`
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.19.0/src/context.rs:77:1
|
77 | pub struct Context {
| ^^^^^^^^^^^^^^^^^^
note: struct `opentelemetry_api::context::Context` is defined in crate `opentelemetry_api`
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.18.0/src/context.rs:77:1
|
77 | pub struct Context {
| ^^^^^^^^^^^^^^^^^^
= note: perhaps two different versions of crate `opentelemetry_api` are being used?
note: associated function defined here
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-opentelemetry-0.18.0/src/span_ext.rs:43:8
|
43 | fn set_parent(&self, cx: Context);
| ^^^^^^^^^^
For more information about this error, try `rustc --explain E0308`.
error: could not compile `tracing-utils` due to previous error
warning: build failed, waiting for other jobs to finish...
error: could not compile `tracing-utils` due to previous error
```
`tracing-opentelemetry` of version `0.19` is not yet released, that is supposed to have the update we need.
* similarly, `rustls`, `tokio-rustls`, `rustls-*` and `tls-listener` crates have similar issue:
```
error[E0308]: mismatched types
--> libs/postgres_backend/tests/simple_select.rs:112:78
|
112 | let mut make_tls_connect = tokio_postgres_rustls::MakeRustlsConnect::new(client_cfg);
| --------------------------------------------- ^^^^^^^^^^ expected struct `rustls::client::client_conn::ClientConfig`, found struct `ClientConfig`
| |
| arguments to this function are incorrect
|
= note: struct `ClientConfig` and struct `rustls::client::client_conn::ClientConfig` have similar names, but are actually distinct types
note: struct `ClientConfig` is defined in crate `rustls`
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.21.0/src/client/client_conn.rs:125:1
|
125 | pub struct ClientConfig {
| ^^^^^^^^^^^^^^^^^^^^^^^
note: struct `rustls::client::client_conn::ClientConfig` is defined in crate `rustls`
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.20.8/src/client/client_conn.rs:91:1
|
91 | pub struct ClientConfig {
| ^^^^^^^^^^^^^^^^^^^^^^^
= note: perhaps two different versions of crate `rustls` are being used?
note: associated function defined here
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-postgres-rustls-0.9.0/src/lib.rs:23:12
|
23 | pub fn new(config: ClientConfig) -> Self {
| ^^^
For more information about this error, try `rustc --explain E0308`.
error: could not compile `postgres_backend` due to previous error
warning: build failed, waiting for other jobs to finish...
```
* aws crates: I could not make new API to work with bucket endpoint overload, and console e2e tests failed.
Other our tests passed, further investigation is worth to be done in https://github.com/neondatabase/neon/issues/4008
All non-trivial updates extracted into separate commits, also `carho
hakari` data and its manifest format were updated.
3 sets of crates remain unupdated:
* `base64` — touches proxy in a lot of places and changed its api (by
0.21 version) quite strongly since our version (0.13).
* `opentelemetry` and `opentelemetry-*` crates
```
error[E0308]: mismatched types
--> libs/tracing-utils/src/http.rs:65:21
|
65 | span.set_parent(parent_ctx);
| ---------- ^^^^^^^^^^ expected struct `opentelemetry_api::context::Context`, found struct `opentelemetry::Context`
| |
| arguments to this method are incorrect
|
= note: struct `opentelemetry::Context` and struct `opentelemetry_api::context::Context` have similar names, but are actually distinct types
note: struct `opentelemetry::Context` is defined in crate `opentelemetry_api`
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.19.0/src/context.rs:77:1
|
77 | pub struct Context {
| ^^^^^^^^^^^^^^^^^^
note: struct `opentelemetry_api::context::Context` is defined in crate `opentelemetry_api`
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.18.0/src/context.rs:77:1
|
77 | pub struct Context {
| ^^^^^^^^^^^^^^^^^^
= note: perhaps two different versions of crate `opentelemetry_api` are being used?
note: associated function defined here
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-opentelemetry-0.18.0/src/span_ext.rs:43:8
|
43 | fn set_parent(&self, cx: Context);
| ^^^^^^^^^^
For more information about this error, try `rustc --explain E0308`.
error: could not compile `tracing-utils` due to previous error
warning: build failed, waiting for other jobs to finish...
error: could not compile `tracing-utils` due to previous error
```
`tracing-opentelemetry` of version `0.19` is not yet released, that is
supposed to have the update we need.
* similarly, `rustls`, `tokio-rustls`, `rustls-*` and `tls-listener`
crates have similar issue:
```
error[E0308]: mismatched types
--> libs/postgres_backend/tests/simple_select.rs:112:78
|
112 | let mut make_tls_connect = tokio_postgres_rustls::MakeRustlsConnect::new(client_cfg);
| --------------------------------------------- ^^^^^^^^^^ expected struct `rustls::client::client_conn::ClientConfig`, found struct `ClientConfig`
| |
| arguments to this function are incorrect
|
= note: struct `ClientConfig` and struct `rustls::client::client_conn::ClientConfig` have similar names, but are actually distinct types
note: struct `ClientConfig` is defined in crate `rustls`
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.21.0/src/client/client_conn.rs:125:1
|
125 | pub struct ClientConfig {
| ^^^^^^^^^^^^^^^^^^^^^^^
note: struct `rustls::client::client_conn::ClientConfig` is defined in crate `rustls`
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.20.8/src/client/client_conn.rs:91:1
|
91 | pub struct ClientConfig {
| ^^^^^^^^^^^^^^^^^^^^^^^
= note: perhaps two different versions of crate `rustls` are being used?
note: associated function defined here
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-postgres-rustls-0.9.0/src/lib.rs:23:12
|
23 | pub fn new(config: ClientConfig) -> Self {
| ^^^
For more information about this error, try `rustc --explain E0308`.
error: could not compile `postgres_backend` due to previous error
warning: build failed, waiting for other jobs to finish...
```
* aws crates: I could not make new API to work with bucket endpoint
overload, and console e2e tests failed.
Other our tests passed, further investigation is worth to be done in
https://github.com/neondatabase/neon/issues/4008
Add a condition to switch walreceiver connection to safekeeper that is
located in the same availability zone. Switch happens when commit_lsn of
a candidate is not less than commit_lsn from the active connection. This
condition is expected not to trigger instantly, because commit_lsn of a
current connection is usually greater than commit_lsn of updates from
the broker. That means that if WAL is written continuously, switch can
take a lot of time, but it should happen eventually.
Now protoc 3.15+ is required for building neon.
Fixes https://github.com/neondatabase/neon/issues/3200
neon_local sends SIGQUIT, which otherwise dumps core by default. Also, remove
obsolete install_shutdown_handlers; in all binaries it was overridden by
ShutdownSignals::handle later.
ref https://github.com/neondatabase/neon/issues/3847
Enables tracing panic hook in addition to pageserver introduced in
#3475:
- proxy
- safekeeper
- storage_broker
For proxy, a drop guard which resets the original std panic hook was
added on the first commit. Other binaries don't need it so they never
reset anything by `disarm`ing the drop guard.
The aim of the change is to make sure all panics a) have span
information b) are logged similar to other messages, not interleaved
with other messages as happens right now. Interleaving happens right now
because std prints panics to stderr, and other logging happens in
stdout. If this was handled gracefully by some utility, the log message
splitter would treat panics as belonging to the previous message because
it expects a message to start with a timestamp.
Cc: #3468
## Describe your changes
Add libmetrics_build_info metrics with commit sha to storage_broken
/metrics, to match behaviour of proxy, pageserver and safekeeper.
1.66 release speeds up compile times for over 10% according to tests.
Also its Clippy finds plenty of old nits in our code:
* useless conversion, `foo as u8` where `foo: u8` and similar, removed
`as u8` and similar
* useless references and dereferenced (that were automatically adjusted
by the compiler), removed various `&` and `*`
* bool -> u8 conversion via `if/else`, changed to `u8::from`
* Map `.iter()` calls where only values were used, changed to
`.values()` instead
Standing out lints:
* `Eq` is missing in our protoc generated structs. Silenced, does not
seem crucial for us.
* `fn default` looks like the one from `Default` trait, so I've
implemented that instead and replaced the `dummy_*` method in tests with
`::default()` invocation
* Clippy detected that
```
if retry_attempt < u32::MAX {
retry_attempt += 1;
}
```
is a saturating add and proposed to replace it.
Sometimes CI build fails with
error: couldn't read storage_broker/src/../proto/storage_broker.rs: No such file or directory (os error 2)
--> storage_broker/src/lib.rs:14:5
|
14 | include!("../proto/storage_broker.rs");
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The root cause is not clear, but it looks like interference with cachepot. Per
cargo docs, build scripts shouldn't output to anywhere but OUT_DIR; let's follow
this and see if it helps.
Which ought to replace etcd. This patch only adds the binary and adjusts
Dockerfile to include it; subsequent ones will add deploy of helm chart and the
actual replacement.
It is a simple and fast pub-sub message bus. In this patch only safekeeper
message is supported, but others can be easily added.
Compilation now requires protoc to be installed. Installing protobuf-compiler
package is fine for Debian/Ubuntu.
ref
https://github.com/neondatabase/neon/pull/2733https://github.com/neondatabase/neon/issues/2394