The 1.88.0 stable release is near (this Thursday). We'd like to fix most
warnings beforehand so that the compiler upgrade doesn't require
approval from too many teams.
This is therefore a preparation PR (like similar PRs before it).
There is a lot of changes for this release, mostly because the
`uninlined_format_args` lint has been added to the `style` lint group.
One can read more about the lint
[here](https://rust-lang.github.io/rust-clippy/master/#/uninlined_format_args).
The PR is the result of `cargo +beta clippy --fix` and `cargo fmt`. One
remaining warning is left for the proxy team.
---------
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
## Problem
`Attached(0)` tenant migrations can get stuck if the heatmap file has
not been uploaded.
## Summary of Changes
- Added a test to reproduce the issue.
- Introduced a `kick_secondary_downloads` config flag:
- Enabled in testing environments.
- Disabled in production (and in the new test).
- Updated `Attached(0)` locations to consider the number of secondaries
in their intent when deciding whether to download the heatmap.
## Problem
We need to specify the number of safekeepers for neon_local without
`testing` feature.
Also we need this option for testing different configurations of
safekeeper migration code.
We cannot set it in `neon_fixtures.py` and in the default config of
`neon_local` yet, because it will fail compatibility tests. I'll make a
separate PR with removing `cfg!("testing")` completely and specifying
this option in the config when this option reaches the release branch.
- Part of https://github.com/neondatabase/neon/issues/12298
## Summary of changes
- Add `timeline_safekeeper_count` config option to storcon and
neon_local
## Problem
`test_sharding_failures` is flaky due to interference from the
`background_reconcile` process.
The details are in the issue
https://github.com/neondatabase/neon/issues/12029.
## Summary of changes
- Use `reconcile_until_idle` to ensure a stable state before running
test assertions
- Added error tolerance in `reconcile_until_idle` test function (Failure
cases: 1, 3, 19, 20)
- Ignore the `Keeping extra secondaries` warning message since it i
retryable (Failure case: 2)
- Deduplicated code in `assert_rolled_back` and `assert_split_done`
- Added a log message before printing plenty of Node `X` seen on
pageserver `Y`
## Problem
Pageservers now expose a gRPC API on a separate address and port. This
must be registered with the storage controller such that it can be
plumbed through to the compute via cplane.
Touches #11926.
## Summary of changes
This patch registers the gRPC address and port with the storage
controller:
* Add gRPC address to `nodes` database table and `NodePersistence`, with
a Diesel migration.
* Add gRPC address in `NodeMetadata`, `NodeRegisterRequest`,
`NodeDescribeResponse`, and `TenantLocateResponseShard`.
* Add gRPC address flags to `storcon_cli node-register`.
These changes are backwards-compatible, since all structs will ignore
unknown fields during deserialization.
## Problem
`test_historic_storage_formats` uses `/tenant_import` to import historic
data. Tenant import does not create timelines onto safekeepers, because
they might already exist on some safekeeper set. If it does, then we may
end up with two different quorums accepting WAL for the same timeline.
If the tenant import is used in a real deployment, the administrator is
responsible for looking for the proper safekeeper set and migrate
timelines into storcon-managed timelines.
- Relates to https://github.com/neondatabase/neon/pull/11712
## Summary of changes
- Create timelines onto safekeepers manually after tenant import in
`test_historic_storage_formats`
- Add a note to tenant import that timelines will be not storcon-managed
after the import.
## Problem
The location config (which includes the stripe size) is stored on
pageserver disk.
For unsharded tenants we [do not include the shard identity in the
serialized
description](ad88ec9257/pageserver/src/tenant/config.rs (L64-L66)).
When the pageserver restarts, it reads that configuration and will use
the stripe size from there
and rely on storcon input from reattach for generation and mode.
The default deserialization is ShardIdentity::unsharded. This has the
new default stripe size of 2048.
Hence, for unsharded tenants we can be running with a stripe size
different from that the one in the
storcon observed state. This is not a problem until we shard split
without specifying a stripe size (i.e. manual splits via the UI or
storcon_cli). When that happens the new shards will use the 2048 stripe
size until storcon realises and switches them back. At that point it's
too late, since we've ingested data with the wrong stripe sizes.
## Summary of changes
Ideally, we would always have the full shard identity on disk. To
achieve this over two releases we do:
1. Always persist the shard identity in the location config on the PS.
2. Storage controller includes the stripe size to use in the re attach
response.
After the first release, we will start persisting correct stripe sizes
for any tenant shard that the storage controller
explicitly sends a location_conf. After the second release, the
re-attach change kicks in and we'll persist the
shard identity for all shards.
## Problem
If a shard split fails and must roll back, the tenant may hit a cold
start as the parent shard's files have already been removed from local
disk.
External contribution with minor adjustments, see
https://neondb.slack.com/archives/C08TE3203RQ/p1748246398269309.
## Summary of changes
Keep the parent shard's files on local disk until the split has been
committed, such that they are available if the spilt is rolled back. If
all else fails, the files will be removed on the next Pageserver
restart.
This should also be fine in a mixed version:
* New storcon, old Pageserver: the Pageserver will delete the files
during the split, storcon will log an error when the cleanup detach
fails.
* Old storcon, new Pageserver: the Pageserver will leave the parent's
files around until the next Pageserver restart.
The change looks good to me, but shard splits are delicate so I'd like
some extra eyes on this.
## Problem
Removed nodes can re-add themselves on restart if not properly
tombstoned. We need a mechanism (e.g. soft-delete flag) to prevent this,
especially in cases where the node is unreachable.
More details there: #12036
## Summary of changes
- Introduced `NodeLifecycle` enum to represent node lifecycle states.
- Added a string representation of `NodeLifecycle` to the `nodes` table.
- Implemented node removal using a tombstone mechanism.
- Introduced `/debug/v1/tombstone*` handlers to manage the tombstone
state.
## Problem
Importing timelines can't currently be deleted. This is problematic
because:
1. Cplane cannot delete failed imports and we leave the timeline behind.
2. The flow does not support user driven cancellation of the import
## Summary of changes
On the pageserver: I've taken the path of least resistance, extended
`TimelineOrOffloaded`
with a new variant and added handling in the right places. I'm open to
thoughts here,
but I think it turned out better than I was envisioning.
On the storage controller: Again, fairly simple business: when a DELETE
timeline request is
received, we remove the import from the DB and stop any finalization
tasks/futures. In order
to stop finalizations, we track them in-memory. For each finalizing
import, we associate a gate
and a cancellation token.
Note that we delete the entry from the database before cancelling any
finalizations. This is such
that a concurrent request can't progress the import into finalize state
and race with the deletion.
This concern about deleting an import with on-going finalization is
theoretical in the near future.
We are only going to delete importing timelines after the storage
controller reports the failure to
cplane. Alas, the design works for user driven cancellation too.
Closes https://github.com/neondatabase/neon/issues/11897
## Summary
The optimiser background loop could get delayed a lot by waiting for
timeouts trying to talk to offline nodes.
Fixes: #12056
## Solution
- Skip offline nodes in `get_top_tenant_shards`
Link to Devin run:
https://app.devin.ai/sessions/065afd6756734d33bbd4d012428c4b6e
Requested by: John Spray (john@neon.tech)
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: John Spray <john@neon.tech>
## Problem
Test coverage of timeline imports is lacking.
## Summary of changes
This PR adds a chaos import test. It runs an import while injecting
various chaos events
in the environment. All the commits that follow the test fix various
issues that were surfaced by it.
Closes https://github.com/neondatabase/neon/issues/10191
Support timeline creations on the storage controller to opt out from
their creation on the safekeepers, introducing the read-only timelines
concept. Read only timelines:
* will never receive WAL of their own, so it's fine to not create them
on the safekeepers
* the property is non-transitive. children of read-only timelines aren't
neccessarily read-only themselves.
This feature can be used for snapshots, to prevent the safekeepers from
being overloaded by empty timelines that won't ever get written to. In
the current world, this is not a problem, because timelines are created
implicitly by the compute connecting to a safekeeper that doesn't have
the timeline yet. In the future however, where the storage controller
creates timelines eagerly, we should watch out for that.
We represent read-only timelines in the storage controller database so
that we ensure that they never touch the safekeepers at all. Especially
we don't want them to cause a mess during the importing process of the
timelines from the cplane to the storcon database.
In a hypothetical future where we have a feature to detach timelines
from safekeepers, we'll either need to find a way to distinguish the
two, or if not, asking safekeepers to list the (empty) timeline prefix
and delete everything from it isn't a big issue either.
This patch will unconditionally hit the new safekeeper timeline creation
path for read-only timelines, without them needing the
`--timelines-onto-safekeepers` flag enabled. This is done because it's
lower risk (no safekeepers or computes involved at all) and gives us
some initial way to verify at least some parts of that code in prod.
https://github.com/neondatabase/cloud/issues/29435https://github.com/neondatabase/neon/issues/11670
## Problem
Timeline imports do not have progress checkpointing. Any time that the
tenant is shut-down, all progress is lost
and the import restarts from the beginning when the tenant is
re-attached.
## Summary of changes
This PR adds progress checkpointing.
### Preliminaries
The **unit of work** is a `ChunkProcessingJob`. Each
`ChunkProcessingJob` deals with the import for a set of key ranges. The
job split is done by using an estimation of how many pages each job will
produce.
The planning stage must be **pure**: given a fixed set of contents in
the import bucket, it will always yield the same plan. This property is
enforced by checking that the hash of the plan is identical when
resuming from a checkpoint.
The storage controller tracks the progress of each shard in the import
in the database in the form of the **latest
job** that has has completed.
### Flow
This is the high level flow for the happy path:
1. On the first run of the import task, the import task queries storcon
for the progress and sees that none is recorded.
2. Execute the preparatory stage of the import
3. Import jobs start running concurrently in a `FuturesOrdered`. Every
time the checkpointing threshold of jobs has been reached, notify the
storage controller.
4. Tenant is detached and re-attached
5. Import task starts up again and gets the latest progress checkpoint
from the storage controller in the form of a job index.
6. The plan is computed again and we check that the hash matches with
the original plan.
7. Jobs are spawned from where the previous import task left off. Note
that we will not report progress after the completion of each job, so
some jobs might run twice.
Closes https://github.com/neondatabase/neon/issues/11568
Closes https://github.com/neondatabase/neon/issues/11664
## Problem
Import up-calls did not enforce the usage of the latest generation. The
import might have finished in one previous generation, but not in the
latest one. Hence, the controller might try to activate a timeline
before it is ready. In theory, that would be fine, but it's tricky to
reason about.
## Summary of Changes
Pageserver provides the current generation in the upcall to the storage
controller and the later validates the generation. If the generation is
stale, we return an error which stops progress of the import job. Note
that the import job will retry the upcall until the stale location is
detached.
I'll add some proper tests for this as part of the [checkpointing
PR](https://github.com/neondatabase/neon/pull/11862).
Closes https://github.com/neondatabase/neon/issues/11884
## Problem
Lifetime of imported timelines (and implicitly the import background
task) has some shortcomings:
1. Timeline activation upon import completion is tricky. Previously, a
timeline that finished importing
after a tenant detach would not get activated and there's concerns about
the safety of activating
concurrently with shut-down.
2. Import jobs can prevent tenant shut down since they hold the tenant
gate
## Summary of Changes
Track the import tasks in memory and abort them explicitly on tenant
shutdown.
Integrate more closely with the storage controller:
1. When an import task has finished all of its jobs, it notifies the
storage controller, but **does not** mark the import as done in the
index_part. When all shards have finished importing, the storage
controller will call the `/activate_post_import` idempotent endpoint for
all of them. The handler, marks the import complete in index part,
resets the tenant if required and checks if the timeline is active yet.
2. Not directly related, but the import job now gets the starting state
from the storage controller instead of the import bucket. This paves the
way for progress checkpointing.
Related: https://github.com/neondatabase/neon/issues/11568
## Problem
SK timeline creations were skipped for imported timelines since we
didn't know the correct start LSN
of the timeline at that point.
## Summary of changes
Created imported timelines on the SK as part of the import finalize
step.
We use the last record LSN of shard 0 as the start LSN for the
safekeeper timeline.
Closes https://github.com/neondatabase/neon/issues/11569
## Problem
During deployment drains/fills, we often see the storage controller
giving up on warmups after 20 seconds, when the warmup is nearly
complete (~90%). This can cause latency spikes for migrated tenants if
they block on layer downloads.
Touches https://github.com/neondatabase/cloud/issues/26193.
## Summary of changes
Increase the drain and fill secondary warmup timeout from 20 to 30
seconds.
## Problem
When aborting a split, the code accidentally removes all other tenant
shards from the in-memory map that have the same shard count as the
aborted split, causing "tenant not found" errors. It will recover on a
storcon restart, when it loads the persisted state. This issue has been
present for at least a year.
Resolves https://github.com/neondatabase/cloud/issues/28589.
## Summary of changes
Only remove shards belonging to the relevant tenant when aborting a
split.
Also adds a regression test.
## Problem
In #11727 I overlooked the case of multiple attached locations for shard
0.
I misread the code and thought `create_one` acts on one location, but it
actually acts on one _shard_, which is potentially multiple locations.
This was not a regression, but it meant that the fix was incomplete.
## Summary of changes
- In `create_one`, when updating shard zero, have any "other" locations
use the initdb from shard 0
Right now we only support running one reconciliation per safekeeper.
This is of course usually way below of what a safekeeper can do.
Therefore, introduce a semaphore and spawn the tasks asynchronously as
they come in.
Part of #11670
## Problem
In princple, pageservers with different postgres binaries might generate
different initdbs, resulting in inconsistency between shards. To avoid
that, we should have shard 0 generate the initdb and other shards re-use
it.
Fixes: https://github.com/neondatabase/neon/issues/11340
## Summary of changes
- For shards with index greater than zero, set
`existing_initdb_timeline_id` in timeline creation to consume the
existing initdb rather than creating a new one
## Problem
Shard splits break timeline imports.
## Summary of Changes
Ensure mutual exclusion for imports and shard splits.
On the shard split code path:
1. Right before shard splitting, check the database to ensure that
no-import is on-going for the tenant. Exclusion is guaranteed because
this validation is done while holding the exclusive tenant lock.
Timeline creation (and import creation implicitly) requires a shared
tenant lock.
2. When selecting a shard to split, use the in-mem state to exclude
shards with an on-going import. This is opportunistic since an import
might start after the check, but allows shard splits to make progres
instead of continously retrying to split the same shard.
On the timeline creation code path:
1. Check the in-memory splitting flag on all shards of the tenant. If
any of them are splitting, error out asking the client to retry. On the
happy path this is not required, due to the tenant lock set-up described
above, but it covers the case where we restart with a pending
shard-split.
Closes https://github.com/neondatabase/neon/issues/11567
## Problem
In https://github.com/neondatabase/neon/pull/11345 coordination of
imports moved to the storage controller.
It involves notifying cplane when the import has been completed by
calling an idempotent endpoint.
If the storage controller shuts down in the middle of finalizing an
import, it would never be retried.
## Summary of changes
Reconcile imports at start-up by fetching the complete imports from the
database and spawning a background
task which notifies cplane.
Closes: https://github.com/neondatabase/neon/issues/11570
## Problem
We saw the following scenario in staging:
1. Pod A starts up. Becomes leader and steps down the previous pod
cleanly.
2. Pod B starts up (deployment).
3. Step down request from pod B to pod A times out. Pod A did not manage
to stop its reconciliations within 10 seconds and exited with return
code 1
([code](7ba8519b43/storage_controller/src/service.rs (L8686-L8702))).
4. Pod B marks itself as the leader and finishes start-up
5. k8s restarts pod A
6. k8s marks pod B as ready
7. pod A sends step down request to pod A - this succeeds => pod A is
now the leader
8. k8s kills pod A because it thinks pod B is healthy and pod A is part
of the old replica set
We end up in a situation where the only pod we have (B) is stepped down
and attempts to forward requests to a leader that doesn't exist. k8s
can't detect that pod B is in a bad state since the /status endpoint
simply returns 200 hundred if the pod is running.
## Summary of changes
This PR includes a number of robustness improvements to the leadership
protocol:
* use a single step down task per controller
* add a new endpoint to be used as k8s liveness probe and check
leadership status there
* handle restarts explicitly (i.e. don't step yourself down)
* increase the step down retry count
* don't kill the process on long step down since k8s will just restart
it
## Problem
Pageservers notify control plane directly when a shard import has
completed.
Control plane has to download the status of each shard from S3 and
figure out if everything is truly done,
before proceeding with branch activation.
Issues with this approach are:
* We can't control shard split behaviour on the storage controller side.
It's unsafe to split
during import.
* Control plane needs to know about shards and implement logic to check
all timelines are indeed ready.
## Summary of changes
In short, storage controller coordinates imports, and, only when
everything is done, notifies control plane.
Big rocks:
1. Store timeline imports in the storage controller database. Each
import stores the status of its shards in the database.
We hook into the timeline creation call as our entry point for this.
2. Pageservers get a new upcall endpoint to notify the storage
controller of shard import updates.
3. Storage controller handles these updates by updating persisted state.
If an update finalizes the import,
then poll pageservers until timeline activation, and, then, notify the
control plane that the import is complete.
Cplane side change with new endpoint is in
https://github.com/neondatabase/cloud/pull/26166
Closes https://github.com/neondatabase/neon/issues/11566
This delivers some additional fixes and improvements to storcon managed
safekeeper timelines:
* use `i32::MAX` for the generation number of timeline deletion
* start the generation for new timelines at 1 instead of 0: this ensures
that the other components actually are generation enabled
* fix database operations we use for metrics
* use join in list_pending_ops to prevent the classical ORM issue where
one does many db queries
* use enums in `test_storcon_create_delete_sk_down`. we are adding a
second parameter, and having two bool parameters is weird.
* extend `test_storcon_create_delete_sk_down` with a test of whole
tenant deletion. this hasn't been tested before.
* remove some redundant logging contexts
* Don't require mutable access to the service lock for scheduling
pending ops in memory. In order to pull this off, create reconcilers
eagerly. The advantage is that we don't need mutable access to the
service lock that way any more.
Part of #9011
---------
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
We have already migrated the storage controller to
`--control-plane-url`, added in #11173. The new param was added to
support also safekeeper specific endpoints. See the docs changes in
#11195 for further details.
Part of #11163
## Problem
The `pagebench` benchmarks set up an initial dataset by creating a
template tenant, copying the remote storage to a bunch of new tenants,
and attaching them to Pageservers.
In #11420, we found that
`test_pageserver_characterize_throughput_with_n_tenants` had degraded
performance because it set a custom tenant config in Pageservers that
was then replaced with the default tenant config by the storage
controller.
The initial fix was to register the tenants directly in the storage
controller, but this created the tenants with generation 1. This broke
`test_basebackup_with_high_slru_count`, where the template tenant was at
generation 2, leading to all layer files at generation 2 being ignored.
Resolves#11485.
Touches #11381.
## Summary of changes
This patch addresses both test issues by modifying `attach_hook` to also
take a custom tenant config. This allows attaching tenants to
Pageservers from pre-existing remote storage, specifying both the
generation and tenant config when registering them in the storage
controller.
## Problem
The graceful leadership transfer process involves calling step_down on
the old controller, but this was not waiting for shard splits to
complete, and the new controller could therefore end up trying to abort
a shard split while it was still going on.
We mitigated this already in #11256 by avoiding the case where shard
split completion would update the database incorrectly, but this was a
fragile fix because it assumes that is the only problematic part of the
split running concurrently.
Precursors:
- #11290
- #11256Closes: #11254
## Summary of changes
- Hold the reconciler gate from shard splits, so that step_down will
wait for them. Splits should always be fairly prompt, so it is okay to
wait here.
- Defense in depth: if step_down times out (hardcoded 10 second limit),
then fully terminate the controller process rather than letting it
continue running, potentially doing split-brainy things. This makes
sense because the new controller will always declare itself leader
unilaterally if step_down fails, so leaving an old controller running is
not beneficial.
- Tests: extend
`test_storage_controller_leadership_transfer_during_split` to separately
exercise the case of a split holding up step_down, and the case where
the overall timeout on step_down is hit and the controller terminates.
## Problem
The current stripe size of 256 MB is a bit large, and can cause load
imbalances across shards. A stripe size of 16 MB appears more reasonable
to avoid hotspots, although we don't see evidence of this in benchmarks.
Resolves https://github.com/neondatabase/cloud/issues/25634.
Touches https://github.com/neondatabase/cloud/issues/21870.
## Summary of changes
* Change the default stripe size to 16 MB.
* Remove `ShardParameters::DEFAULT_STRIPE_SIZE`, and only use
`pageserver_api::shard::DEFAULT_STRIPE_SIZE`.
* Update a bunch of tests that assumed a certain stripe size.
## Problem
Storcon will not start up if `use_https` is on and there are some
pageservers or safekeepers without https port in the database. Metrics
"how many nodes with https we have in DB" will help us to make sure that
`use_https` may be turned on safely.
- Part of https://github.com/neondatabase/cloud/issues/25526
## Summary of changes
- Add `storage_controller_https_pageserver_nodes`,
`storage_controller_safekeeper_nodes` and
`storage_controller_https_safekeeper_nodes` Prometheus metrics.
## Problem
`tenant_import`, used to import an existing tenant from remote storage
into a storage controller for support and debugging, assumed
`DEFAULT_STRIPE_SIZE` since this can't be recovered from remote storage.
In #11168, we are changing the stripe size, which will break
`tenant_import`.
Resolves#11175.
## Summary of changes
* Add `stripe_size` to the tenant manifest.
* Add `TenantScanRemoteStorageShard::stripe_size` and return from
`tenant_scan_remote` if present.
* Recover the stripe size during`tenant_import`, or fall back to 32768
(the original default stripe size).
* Add tenant manifest compatibility snapshot:
`2025-04-08-pgv17-tenant-manifest-v1.tar.zst`
There are no cross-version concerns here, since unknown fields are
ignored during deserialization where relevant.
## Problem
`test_scrubber_tenant_snapshot` is flaky with `request was dropped`
errors. More details are in the issue.
- Closes: https://github.com/neondatabase/neon/issues/11278
## Summary of changes
- Disable shard scheduling during pageservers restart
- Add `reconcile_until_idle` in the end of the test
## Problem
There are some places in the code where we create `reqwest::Client`
without providing SSL CA certs from `ssl_ca_file`. These will break
after we enable TLS everywhere.
- Part of https://github.com/neondatabase/cloud/issues/22686
## Summary of changes
- Support `ssl_ca_file` in storage scrubber.
- Add `use_https_safekeeper_api` option to safekeeper to use https for
peer requests.
- Propagate SSL CA certs to storage_controller/client, storcon's
ComputeHook, PeerClient and maybe_forward.
## Problem
While working on bulk import, I want to use the `control-plane-url` flag
for a different request.
Currently, the local compute hook is used whenever no control plane is
specified in the config.
My test requires local compute notifications and a configured
`control-plane-url` which isn't supported.
## Summary of changes
Add a `use-local-compute-notifications` flag. When this is set, we use
the local flow regardless of other config values.
It's enabled by default in neon_local and disabled by default in all
other envs. I had to turn the flag off in tests
that wish to bypass the local flow, but that's expected.
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
- We need to support multiple SSL CA certificates for graceful root CA
certificate rotation.
- Closes: https://github.com/neondatabase/cloud/issues/25971
## Summary of changes
- Parses `ssl_ca_file` as a pem bundle, which may contain multiple
certificates. Single pem cert is a valid pem bundle, so the change is
backward compatible.
## Problem
- Part of https://github.com/neondatabase/neon/issues/11113
- Building a new `reqwest::Client` for every request is expensive
because it parses CA certs under the hood. It's noticeable in storcon's
flamegraph.
## Summary of changes
- Reuse one `reqwest::Client` for all API calls to avoid parsing CA
certificates every time.
## Problem
Issue https://github.com/neondatabase/neon/issues/11254 describes a case
where restart during a shard split can result in a bad end state in the
database.
## Summary of changes
- Add a reproducer for the issue
- Tighten an existing safety check around updated row counts in
complete_shard_split
## Problem
Currently, we only split tenants into 8 shards once, at the 64 GB split
threshold. For very large tenants, we need to keep splitting to avoid
huge shards. And we also want to eagerly split at a lower threshold to
improve throughput during initial ingestion.
See
https://github.com/neondatabase/cloud/issues/22532#issuecomment-2706215907
for details.
Touches https://github.com/neondatabase/cloud/issues/22532.
Requires #11157.
## Summary of changes
This adds parameters and logic to enable repeated splits when a tenant's
largest timeline divided by shard count exceeds `split_threshold`, as
well as eager initial splits at a lower threshold to speed up initial
ingestion. The default parameters are all set such that they retain the
current behavior in production (only split into 8 shards once, at 64
GB).
* `split_threshold` now specifies a maximum shard size. When a shard
exceeds it, all tenant shards are split by powers of 2 such that all
tenant shards fall below `split_threshold`. Disabled by default, like
today.
* Add `max_split_shards` to specify a max shard count for autosplits.
Defaults to 8 to retain current behavior.
* Add `initial_split_threshold` and `initial_split_shards` to specify a
threshold and target count for eager splits of unsharded tenants.
Defaults to 64 GB and 8 shards to retain current production behavior.
Because this PR sets `initial_split_threshold` to 64 GB by default, it
has the effect of enabling autosplits by default. This was not the case
previously, since `split_threshold` defaults to None, but it is already
enabled across production and staging. This is temporary until we
complete the production rollout.
For more details, see code comments.
This must wait until #11157 has been deployed to Pageservers.
Once this has been deployed to production, we plan to change the
parameters to:
* `split-threshold`: 256 GB
* `initial-split-threshold`: 16 GB
* `initial-split-shards`: 4
* `max-split-shards`: 16
The final split points will thus be:
* Start: 1 shard
* 16 GB: 4 shards
* 1 TB: 8 shards
* 2 TB: 16 shards
We will then change the default settings to be disabled by default.
---------
Co-authored-by: John Spray <john@neon.tech>
If a tenant gets deleted, delete also all of its timelines. We assume
that by the time a tenant is being deleted, no new timelines are being
created, so we don't need to worry about races with creation in this
situation.
Unlike #11233, which was very simple because it listed the timelines and
invoked timeline deletion, this PR obtains a list of safekeepers to
invoke the tenant deletion on, and then invokes tenant deletion on each
safekeeper that has one or multiple timelines.
Alternative to #11233
Builds on #11288
Part of #9011
## Problem
Pageservers use unencrypted HTTP requests for storage controller API.
- Closes: https://github.com/neondatabase/cloud/issues/25524
## Summary of changes
- Replace hyper0::server::Server with http_utils::server::Server in
storage controller.
- Add HTTPS handler for storage controller API.
- Support `ssl_ca_file` in pageserver.
There is no functional change here. We move safekeeper related code from
`service.rs` to `service/safekeeper_service.rs`, so that safekeeper
related stuff is contained in a single file. This also helps with
preventing `service.rs` from growing even further.
Part of #9011.
PRs #10891 and #10902 have time-bounded the safekeeper heartbeating of
the storage controller. Those timeouts were not meant to be permanent,
but temporary until we figured out the reasons for the safekeeper
heartbeating causing problems.
Now they are better understood and resolved. A comment is
[here](https://github.com/neondatabase/cloud/issues/24396#issuecomment-2679342929),
but most importantly, we've had:
* #10954 to send heartbeats concurrently (before the issue was we sent
them sequentially, so the total time time was number of nodes times time
for timeout to be hit, now the total time is the maximum of all things
we are heartbeating)
* work to actually make heartbeats work and not error, i.e. JWT rollout
for storcon, not sending heartbeats to decomissioned safekeepers,
removal of decomissioned safekeepers from the databases
Part of https://github.com/neondatabase/cloud/issues/25473
Adds a basic test that makes the storcon issue explicit creation of a
timeline on safeekepers (main storcon PR in #11058). It was adapted from
`test_explicit_timeline_creation` from #11002.
Also, do a bunch of fixes needed to get the test work (the API
definitions weren't correct), and log more stuff when we can't create a
new timeline due to no safekeepers being active.
Part of #9011
---------
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
## Problem
There is a rare race between controller graceful deployment and shard
splitting where we may incorrectly both abort _and_ complete the split
(on different pods), and thereby leave no shards at all in the database.
Related: #11254
## Summary of changes
- In complete_shard_split, refuse to delete anything if child shards are
not found
# Refs
- fixes https://github.com/neondatabase/neon/issues/11228
# Problem High-Level
When storcon validates whether a `ScheduleOptimizationAction` should be
applied, it retrieves the `tenant_secondary_status` to determine whether
a secondary is ready for the optimization.
When collecting results, it associates secondary statuses with the wrong
optimization actions in the batch of optimizations that we're
validating.
The result is that we make the decision for shard/location X based on
the SecondaryStatus of a random secondary location Y in the current
batch of optimizations.
A possible symptom is an early cutover, as seen in this engineering
investigation here:
- https://github.com/neondatabase/cloud/issues/25734
# Problem Code-Level
This code here in `optimize_all_validate`
97e2e27f68/storage_controller/src/service.rs (L7012-L7029)
zips the `want_secondary_status` with the Vec returned from
`tenant_for_shards_api` .
However, the Vec returned from `want_secondary_status` is not ordered
(it uses FuturesUnordered internally).
# Solution
Sort the Vec in input order before returning it.
`optimize_all_validate` was the only caller affected by this problem
While at it, also future-proof similar-looking function
`tenant_for_shards`.
None of its callers care about the order, but this type of function
signature is easy to use incorrectly.
# Future Work
Avoid the additional iteration, map, and allocation.
Change API to leverage AsyncFn (async closure).
And/or invert `tenant_for_shards_api` into a Future ext trait / iterator
adaptor thing.
## Problem
Autosplits always request `DEFAULT_STRIPE_SIZE` for splits. However,
splits do not allow changing the stripe size of already-sharded tenants,
and will error out if it differs.
In #11168, we are changing the stripe size, which could hit this when
attempting to autosplit already sharded tenants.
Touches #11168.
## Summary of changes
Pass `new_stripe_size: None` when autosplitting already sharded tenants.
Otherwise, pass `DEFAULT_STRIPE_SIZE` instead of the shard identity's
stripe size, since we want to use the current default rather than an
old, persisted default.