Before this PR, using the AWS SDK profile feature for running against
minio didn't work because
* our SDK versions were too old and didn't include
https://github.com/awslabs/aws-sdk-rust/issues/1060 and
* we didn't massage the s3 client config builder correctly.
This PR
* udpates all the AWS SDKs we use to, respectively, the latest version I
could find on crates.io (Is there a better process?)
* changes the way remote_storage constructs the S3 client, and
* documents how to run the test suite against real S3 & local minio.
Regarding the changes to `remote_storage`: if one reads the SDK docs, it
is clear that the recommended way is to use `aws_config::from_env`, then
customize.
What we were doing instead is to use the `aws_sdk_s3` builder directly.
To get the `local-minio` in the added docs working, I needed to update
both the SDKs and make the changes to the `remote_storage`. See the
commit history in this PR for details.
Refs:
* byproduct: https://github.com/smithy-lang/smithy-rs/pull/3633
* follow-up on deprecation:
https://github.com/neondatabase/neon/issues/7665
* follow-up for scrubber S3 setup:
https://github.com/neondatabase/neon/issues/7667
## Problem
Some HTTP client connections can stay open for quite a long time.
## Summary of changes
When there are too many HTTP client connections, pick a random
connection and gracefully cancel it.
Preceding PR https://github.com/neondatabase/neon/pull/7613 reduced the
usage of `--pageserver-config-override`.
This PR builds on top of that work and fully removes the `neon_local
--pageserver-config-override`.
Tests that need a non-default `pageserver.toml` control it using two
options:
1. Specify `NeonEnvBuilder.pageserver_config_override` before
`NeonEnvBuilder.init_start()`. This uses a new `neon_local init
--pageserver-config` flag.
2. After `init_start()`: `env.pageserver.stop()` +
`NeonPageserver.edit_config_toml()` + `env.pageserver.start()`
A few test cases were using
`env.pageserver.start(overrides=("--pageserver-config-override...",))`.
I changed them to use one of the options above.
Future Work
-----------
The `neon_local init --pageserver-config` flag still uses `pageserver
--config-override` under the hood. In the future, neon_local should just
write the `pageserver.toml` directly.
The `NeonEnvBuilder.pageserver_config_override` field should be renamed
to `pageserver_initial_config`. Let's save this churn for a separate
refactor commit.
## Problem
Issues around operation and tenant locks would have been hard to debug
since there was little observability around them.
## Summary of changes
- As suggested in the issue, a wrapper was added around
`OwnedRwLockWriteGuard` called `IdentifierLock` that removes the
operation currently holding the exclusive lock when it's dropped.
- The value in `IdLockMap` was extended to hold a pair of locks and
operations that can be accessed and locked independently.
- When requesting an exclusive lock besides returning the lock on that
resource, an operation is changed if the lock is acquired.
Closes https://github.com/neondatabase/neon/issues/7108
The main challenge was in the second commit, as `DownloadStream`
requires the inner to be Sync but the stream returned by the Azure SDK
wasn't Sync.
This left us with three options:
* Change the Azure SDK to return Sync streams. This was abandoned after
we realized that we couldn't just make `TokenCredential`'s returned
future Sync: it uses the `async_trait` macro and as the
`TokenCredential` trait is used in dyn form, one can't use Rust's new
"async fn in Trait" feature.
* Change `DownloadStream` to not require `Sync`. This was abandoned
after it turned into a safekeeper refactoring project.
* Put the stream into a `Mutex` and make it obtain a lock on every poll.
This adds some performance overhead but locks that actually don't do
anything should be comparatively cheap.
We went with the third option in the end as the change still represents
an improvement.
Follow up of #5446 , fixes#5563
## Problem
Storage controller was observed to have unexpectedly large memory
consumption when loaded with many thousands of shards.
This was recently fixed:
- https://github.com/neondatabase/neon/pull/7493
...but we need a general test that the controller is well behaved with
thousands of shards.
Closes: https://github.com/neondatabase/neon/issues/7460
Closes: https://github.com/neondatabase/neon/issues/7463
## Summary of changes
- Add test test_storage_controller_many_tenants to exercise the system's
behaviour with a more substantial workload. This test measures memory
consumption and reproduces #7460 before the other changes in this PR.
- Tweak reconcile_all's return value to make it nonzero if it spawns no
reconcilers, but _would_ have spawned some reconcilers if they weren't
blocked by the reconcile concurrency limit. This makes the test's
reconcile_until_idle behave as expected (i.e. not complete until the
system is nice and calm).
- Fix an issue where tenant migrations would leave a spurious secondary
location when migrated to some location that was not already their
secondary (this was an existing low-impact bug that tripped up the
test's consistency checks).
On the test with 8000 shards, the resident memory per shard is about
20KiB. This is not really per-shard memory: the primary source of memory
growth is the number of concurrent network/db clients we create.
With 8000 shards, the test takes 125s to run on my workstation.
It works by listing postgres table with memory dump of safekeepers state. s3
contents for each timeline are checked then against timeline_start_lsn and
backup_lsn. If inconsistency is found, before complaining timeline (branch) is
checked at control plane; it might have been deleted between the dump take and
s3 check.
## Problem
Downloading tenant data for analysis/debug with `aws s3 cp` works well
for small tenants, but for larger tenants it is unlikely that one ends
up with an index that matches layer files, due to the time taken to
download.
## Summary of changes
- Add a `tenant-snapshot` command to the scrubber, which reads timeline
indices and then downloads the layers referenced in the index, even if
they were deleted. The result is a snapshot of the tenant's remote
storage state that should be usable when imported (#7399 ).
Extracted from https://github.com/neondatabase/neon/pull/7375. We assume
everything >= 0x80 are metadata keys. AUX file keys are part of the
metadata keys, and we use `0x90` as the prefix for AUX file keys.
The AUX file encoding is described in the code comment. We use xxhash128
as the hash algorithm. It seems to be portable according to the
introduction,
> xxHash is an Extremely fast Hash algorithm, processing at RAM speed
limits. Code is highly portable, and produces hashes identical across
all platforms (little / big endian).
...though whether the Rust version follows the same convention is
unknown and might need manual review of the library. Anyways, we can
always change the hash algorithm before rolling it out in
staging/end-user, and I made a quick decision to use xxhash here because
it generates 128b hash + portable. We can save the discussion of which
hash algorithm to use later.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
`cargo deny check` is complaining about our rustls versions, causing
CI to fail:
```
error[vulnerability]: `rustls::ConnectionCommon::complete_io` could fall into an infinite loop based on network input
┌─ /__w/neon/neon/Cargo.lock:395:1
│
395 │ rustls 0.21.9 registry+https://github.com/rust-lang/crates.io-index
│ ------------------------------------------------------------------- security vulnerability detected
│
= ID: RUSTSEC-2024-0336
= Advisory: https://rustsec.org/advisories/RUSTSEC-2024-0336
= If a `close_notify` alert is received during a handshake, `complete_io`
does not terminate.
Callers which do not call `complete_io` are not affected.
`rustls-tokio` and `rustls-ffi` do not call `complete_io`
and are not affected.
`rustls::Stream` and `rustls::StreamOwned` types use
`complete_io` and are affected.
= Announcement: https://github.com/rustls/rustls/security/advisories/GHSA-6g7w-8wpp-frhj
= Solution: Upgrade to >=0.23.5 OR >=0.22.4, <0.23.0 OR >=0.21.11, <0.22.0 (try `cargo update -p rustls`)
error[vulnerability]: `rustls::ConnectionCommon::complete_io` could fall into an infinite loop based on network input
┌─ /__w/neon/neon/Cargo.lock:396:1
│
396 │ rustls 0.22.2 registry+https://github.com/rust-lang/crates.io-index
│ ------------------------------------------------------------------- security vulnerability detected
│
= ID: RUSTSEC-2024-0336
= Advisory: https://rustsec.org/advisories/RUSTSEC-2024-0336
= If a `close_notify` alert is received during a handshake, `complete_io`
does not terminate.
Callers which do not call `complete_io` are not affected.
`rustls-tokio` and `rustls-ffi` do not call `complete_io`
and are not affected.
`rustls::Stream` and `rustls::StreamOwned` types use
`complete_io` and are affected.
= Announcement: https://github.com/rustls/rustls/security/advisories/GHSA-6g7w-8wpp-frhj
= Solution: Upgrade to >=0.23.5 OR >=0.22.4, <0.23.0 OR >=0.21.11, <0.22.0 (try `cargo update -p rustls`)
```
## Summary of changes
`cargo update -p rustls@0.21.9 -p rustls@0.22.2`
## Problem
My benchmarks show that prometheus is not very good.
https://github.com/conradludgate/measured
We're already using it in storage_controller and it seems to be working
well.
## Summary of changes
Replace prometheus with my new measured crate in proxy only.
Apologies for the large diff. I tried to keep it as minimal as I could.
The label types add a bit of boiler plate (but reduce the chance we
mistype the labels), and some of our custom metrics like CounterPair and
HLL needed to be rewritten.
## Problem
hyper1 offers control over the HTTP connection that hyper0_14 does not.
We're blocked on switching all services to hyper1 because of how we use
tonic, but no reason we can't switch proxy over.
## Summary of changes
1. hyper0.14 -> hyper1
1. self managed server
2. Remove the `WithConnectionGuard` wrapper from `protocol2`
2. Remove TLS listener as it's no longer necessary
3. include first session ID in connection startup logs
Adds another tool to the DR toolbox: ability in pagectl to
recover arbitrary prefixes in remote storage. Requires remote storage config,
the prefix, and the travel-to timestamp parameter
to be specified as cli args.
The done-if-after parameter is also supported.
Example invocation (after `aws login --profile dev`):
```
RUST_LOG=remote_storage=debug AWS_PROFILE=dev cargo run -p pagectl time-travel-remote-prefix 'remote_storage = { bucket_name = "neon-test-bucket-name", bucket_region = "us-east-2" }' wal/3aa8fcc61f6d357410b7de754b1d9001/641e5342083b2235ee3deb8066819683/ 2024-04-05T17:00:00Z
```
This has been written to resolve a customer recovery case:
https://neondb.slack.com/archives/C033RQ5SPDH/p1712256888468009
There is validation of the prefix to prevent accidentially specifying
too generic prefixes, which can cause corruption and data
loss if used wrongly. Still, the validation is not perfect and it is
important that the command is used with caution.
If possible, `time_travel_remote_storage` should
be used instead which has additional checks in place.
## Problem
Some awkwardness in the measured API.
Missing process metrics.
## Summary of changes
Update measured to use the new convenience setup features.
Added measured-process lib.
Added measured support for libmetrics
## Problem
We have two places that use a helper (`ser_rfc3339_millis`) to get serde
to stringify SystemTimes into the desired format.
## Summary of changes
Created a new module `utils::serde_system_time` and inside it a wrapper
type `SystemTime` for `std::time::SystemTime` that
serializes/deserializes to the RFC3339 format.
This new type is then used in the two places that were previously using
the helper for serialization, thereby eliminating the need to decorate
structs.
Closes#7151.
The binary etc were renamed some time ago, but the path in the source
tree remained "attachment_service" to avoid disruption to ongoing PRs.
There aren't any big PRs out right now, so it's a good time to cut over.
- Rename `attachment_service` to `storage_controller`
- Move it to the top level for symmetry with `storage_broker` & to avoid
mixing the non-prod neon_local stuff (`control_plane/`) with the storage
controller which is a production component.
Updates the `test-context` dev-dependency of the `remote_storage` crate
to 0.3. This removes a lot of `async_trait` instances.
Related earlier work: #6305, #6464
## Problem
During incidents, we may need to quickly access the storage controller's
API without trying API client code or crafting `curl` CLIs on the fly. A
basic CLI client is needed for this.
## Summary of changes
- Update storage controller node listing API to only use public types in
controller_api.rs
- Add a storage controller API for listing tenants
- Add a basic test that the CLI can list and modify nodes and tenants.
## Problem
## Summary of changes
`cargo update -p tokio`.
The only risky change I could see is the `tokio::io::split` moving from
a spin-lock to a mutex but I think that's ok.
## Problem
Follows: https://github.com/neondatabase/neon/pull/7182
- Sufficient concurrent writes could OOM a pageserver from the size of
indices on all the InMemoryLayer instances.
- Enforcement of checkpoint_period only happened if there were some
writes.
Closes: https://github.com/neondatabase/neon/issues/6916
## Summary of changes
- Add `ephemeral_bytes_per_memory_kb` config property. This controls the
ratio of ephemeral layer capacity to memory capacity. The weird unit is
to enable making the ratio less than 1:1 (set this property to 1024 to
use 1MB of ephemeral layers for every 1MB of RAM, set it smaller to get
a fraction).
- Implement background layer rolling checks in
Timeline::compaction_iteration -- this ensures we apply layer rolling
policy in the absence of writes.
- During background checks, if the total ephemeral layer size has
exceeded the limit, then roll layers whose size is greater than the mean
size of all ephemeral layers.
- Remove the tick() path from walreceiver: it isn't needed any more now
that we do equivalent checks from compaction_iteration.
- Add tests for the above.
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
- Remove code for using AWS secrets manager, as we're deploying with
k8s->env vars instead
- Load each secret independently, so that one can mix CLI args with
environment variables, rather than requiring that all secrets are loaded
with the same mechanism.
- Add a 'strict mode', enabled by default, which will refuse to start if
secrets are not loaded. This avoids the risk of accidentially disabling
auth by omitting the public key, for example
## Problem
I noticed code coverage for auth_quirks was pretty bare
## Summary of changes
Adds 3 happy path unit tests for auth_quirks
* scram
* cleartext (websockets)
* cleartext (password hack)
## Problem
Support of IAM Roles for Service Accounts for authentication.
## Summary of changes
* Obtain aws 15m-long credentials
* Retrieve redis password from credentials
* Update every 1h to keep connection for more than 12h
* For now allow to have different endpoints for pubsub/stream redis.
TODOs:
* PubSub doesn't support credentials refresh, consider using stream
instead.
* We need an AWS role for proxy to be able to connect to both: S3 and
elasticache.
Credentials obtaining and connection refresh was tested on xenon
preview.
https://github.com/neondatabase/cloud/issues/10365
## Problem
for HTTP/WS/password hack flows we imitate SCRAM to validate passwords.
This code was unnecessarily complicated.
## Summary of changes
Copy in the `pbkdf2` and 'derive keys' steps from the
`postgres_protocol` crate in our `rust-postgres` fork. Derive the
`client_key`, `server_key` and `stored_key` from the password directly.
Use constant time equality to compare the `stored_key` and `server_key`
with the ones we are sent from cplane.
## Problem
Storage controller had basically no metrics.
## Summary of changes
1. Migrate the existing metrics to use Conrad's
[`measured`](https://docs.rs/measured/0.0.14/measured/) crate.
2. Add metrics for incoming http requests
3. Add metrics for outgoing http requests to the pageserver
4. Add metrics for outgoing pass through requests to the pageserver
5. Add metrics for database queries
Note that the metrics response for the attachment service does not use
chunked encoding like the rest of the metrics endpoints. Conrad has
kindly extended the crate such that it can now be done. Let's leave it
for a follow-up since the payload shouldn't be that big at this point.
Fixes https://github.com/neondatabase/neon/issues/6875
## Problem
faster sha2 hashing.
## Summary of changes
enable asm feature for sha2. this feature will be default in sha2 0.11,
so we might as well lean into it now. It provides a noticeable speed
boost on macos aarch64. Haven't tested on x86 though
refs https://github.com/neondatabase/neon/issues/7136
Problem
-------
Before this PR, we were using
`tokio_epoll_uring::thread_local_system()`,
which panics on tokio_epoll_uring::System::launch() failure
As we've learned in [the
past](https://github.com/neondatabase/neon/issues/6373#issuecomment-1905814391),
some older Linux kernels account io_uring instances as locked memory.
And while we've raised the limit in prod considerably, we did hit it
once on 2024-03-11 16:30 UTC.
That was after we enabled tokio-epoll-uring fleet-wide, but before
we had shipped release-5090 (c6ed86d3d0)
which did away with the last mass-creation of tokio-epoll-uring
instances as per
commit 3da410c8fe
Author: Christian Schwarz <christian@neon.tech>
Date: Tue Mar 5 10:03:54 2024 +0100
tokio-epoll-uring: use it on the layer-creating code paths (#6378)
Nonetheless, it highlighted that panicking in this situation is probably
not ideal, as it can leave the pageserver process in a semi-broken
state.
Further, due to low sampling rate of Prometheus metrics, we don't know
much about the circumstances of this failure instance.
Solution
--------
This PR implements a custom thread_local_system() that is
pageserver-aware
and will do the following on failure:
- dump relevant stats to `tracing!`, hopefully they will be useful to
understand the circumstances better
- if it's the locked memory failure (or any other ENOMEM): abort() the
process
- if it's ENOMEM, retry with exponential back-off, capped at 3s.
- add metric counters so we can create an alert
This makes sense in the production environment where we know that
_usually_, there's ample locked memory allowance available, and we know
the failure rate is rare.
fixes https://github.com/neondatabase/neon/issues/7116
Changes:
- refactor PageServerConfigBuilder: support not-set values
- implement runtime feature test
- use runtime feature test to determine `virtual_file_io_engine` if not
explicitly configured in the config
- log the effective engine at startup
- drive-by: improve assertion messages in `test_pageserver_init_node_id`
This needed a tiny bit of tokio-epoll-uring work, hence bumping it.
Changelog:
```
git log --no-decorate --oneline --reverse 868d2c42b5d54ca82fead6e8f2f233b69a540d3e..342ddd197a060a8354e8f11f4d12994419fff939
c7a74c6 Bump mio from 0.8.8 to 0.8.11
4df3466 Bump mio from 0.8.8 to 0.8.11 (#47)
342ddd1 lifecycle: expose `LaunchResult` enum (#49)
```
## Problem
If a pageserver was offline when the storage controller started, there
was no mechanism to update the
storage controller state when the pageserver becomes active.
## Summary of changes
* Add a heartbeater module. The heartbeater must be driven by an
external loop.
* Integrate the heartbeater into the service.
- Extend the types used by the service and scheduler to keep track of a
nodes' utilisation score.
- Add a background loop to drive the heartbeater and update the state
based on the deltas it generated
- Do an initial round of heartbeats at start-up
## Problem
Shard splits worked, but weren't safe against failures (e.g. node crash
during split) yet.
Related: #6676
## Summary of changes
- Introduce async rwlocks at the scope of Tenant and Node:
- exclusive tenant lock is used to protect splits
- exclusive node lock is used to protect new reconciliation process that
happens when setting node active
- exclusive locks used in both cases when doing persistent updates (e.g.
node scheduling conf) where the update to DB & in-memory state needs to
be atomic.
- Add failpoints to shard splitting in control plane and pageserver
code.
- Implement error handling in control plane for shard splits: this
detaches child chards and ensures parent shards are re-attached.
- Crash-safety for storage controller restarts requires little effort:
we already reconcile with nodes over a storage controller restart, so as
long as we reset any incomplete splits in the DB on restart (added in
this PR), things are implicitly cleaned up.
- Implement reconciliation with offline nodes before they transition to
active:
- (in this context reconciliation means something like
startup_reconcile, not literally the Reconciler)
- This covers cases where split abort cannot reach a node to clean it
up: the cleanup will eventually happen when the node is marked active,
as part of reconciliation.
- This also covers the case where a node was unavailable when the
storage controller started, but becomes available later: previously this
allowed it to skip the startup reconcile.
- Storage controller now terminates on panics. We only use panics for
true "should never happen" assertions, and these cases can leave us in
an un-usable state if we keep running (e.g. panicking in a shard split).
In the unlikely event that we get into a crashloop as a result, we'll
rely on kubernetes to back us off.
- Add `test_sharding_split_failures` which exercises a variety of
failure cases during shard split.
tokio 1.36 has been out for a month.
Release notes don't indicate major changes.
Skimming through their issue tracker, I can't find open `C-bug` issues
that would affect us.
(My personal motivation for this is `JoinSet::try_join_next`.)
Nightly has added a bunch of compiler and linter warnings. There is also
two dependencies that fail compilation on latest nightly due to using
the old `stdsimd` feature name. This PR fixes them.
Rebased version of #5234, part of #6768
This consists of three parts:
1. A refactoring and new contract for implementing and testing
compaction.
The logic is now in a separate crate, with no dependency on the
'pageserver' crate. It defines an interface that the real pageserver
must implement, in order to call the compaction algorithm. The interface
models things like delta and image layers, but just the parts that the
compaction algorithm needs to make decisions. That makes it easier unit
test the algorithm and experiment with different implementations.
I did not convert the current code to the new abstraction, however. When
compaction algorithm is set to "Legacy", we just use the old code. It
might be worthwhile to convert the old code to the new abstraction, so
that we can compare the behavior of the new algorithm against the old
one, using the same simulated cases. If we do that, have to be careful
that the converted code really is equivalent to the old.
This inclues only trivial changes to the main pageserver code. All the
new code is behind a tenant config option. So this should be pretty safe
to merge, even if the new implementation is buggy, as long as we don't
enable it.
2. A new compaction algorithm, implemented using the new abstraction.
The new algorithm is tiered compaction. It is inspired by the PoC at PR
#4539, although I did not use that code directly, as I needed the new
implementation to fit the new abstraction. The algorithm here is less
advanced, I did not implement partial image layers, for example. I
wanted to keep it simple on purpose, so that as we add bells and
whistles, we can see the effects using the included simulator.
One difference to #4539 and your typical LSM tree implementations is how
we keep track of the LSM tree levels. This PR doesn't have a permanent
concept of a level, tier or sorted run at all. There are just delta and
image layers. However, when compaction starts, we look at the layers
that exist, and arrange them into levels, depending on their shapes.
That is ephemeral: when the compaction finishes, we forget that
information. This allows the new algorithm to work without any extra
bookkeeping. That makes it easier to transition from the old algorithm
to new, and back again.
There is just a new tenant config option to choose the compaction
algorithm. The default is "Legacy", meaning the current algorithm in
'main'. If you set it to "Tiered", the new algorithm is used.
3. A simulator, which implements the new abstraction.
The simulator can be used to analyze write and storage amplification,
without running a test with the full pageserver. It can also draw an SVG
animation of the simulation, to visualize how layers are created and
deleted.
To run the simulator:
cargo run --bin compaction-simulator run-suite
---------
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
PR adds a simple at most 1Hz refreshed informational API for querying
pageserver utilization. In this first phase, no actual background
calculation is performed. Instead, the worst possible score is always
returned. The returned bytes information is however correct.
Cc: #6835
Cc: #5331