Commit Graph

4852 Commits

Author SHA1 Message Date
John Spray
23416cc358 docs: sharding phase 1 RFC (#5432)
We need to shard our Tenants to support larger databases without those
large databases dominating our pageservers and/or requiring dedicated
pageservers.

This RFC aims to define an initial capability that will permit creating
large-capacity databases using a static configuration
defined at time of Tenant creation.

Online re-sharding is deferred as future work, as is offloading layers
for historical reads. However, both of these capabilities would be
implementable without further changes to the control plane or compute:
this RFC aims to define the cross-component work needed to bootstrap
sharding end-to-end.
2024-03-15 11:14:25 +00:00
Anna Khanova
46098ea0ea proxy: add more missing warm logging (#7133)
## Problem

There is one more missing thing about cached connections for
`cold_start_info`.

## Summary of changes

Fix and add comments.
2024-03-15 11:13:15 +00:00
Conrad Ludgate
49bc734e02 proxy: add websocket regression tests (#7121)
## Problem

We have no regression tests for websocket flow

## Summary of changes

Add a hacky implementation of the postgres protocol over websockets just
to verify the protocol behaviour does not regress over time.
2024-03-15 10:21:48 +01:00
Alex Chi Z
76c44dc140 spec: disable neon extension auto upgrade (#7128)
This pull request disables neon extension auto upgrade to help the next
compute image upgrade smooth.

## Summary of changes

We have two places to auto-upgrade neon extension: during compute spec
update, and when the compute node starts. The compute spec update logic
is always there, and the compute node start logic is added in
https://github.com/neondatabase/neon/pull/7029. In this pull request, we
disable both of them, so that we can still roll back to an older version
of compute before figuring out the best way of extension
upgrade-downgrade. https://github.com/neondatabase/neon/issues/6936

We will enable auto-upgrade in the next release following this release.

There are no other extension upgrades from release 4917 and therefore
after this pull request, it would be safe to revert to release 4917.

Impact:

* Project created after unpinning the compute image -> if we need to
roll back, **they will stuck**, because the default neon extension
version is 1.3. Need to manually pin the compute image version if such
things happen.
* Projects already stuck on staging due to not downgradeable -> I don't
know their current status, maybe they are already running the latest
compute image?
* Other projects -> can be rolled back to release 4917.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-03-14 19:45:38 +00:00
Joonas Koivunen
58ef78cf41 doc(README): note cargo-nextest usage (#7122)
We have been using #5681 for quite some time, and at least since #6931
the tests have assumed `cargo-nextest` to work around our use of global
statics. Unlike the `cargo test`, the `cargo nextest run` runs each test
as a separate process that can be timeouted.

Add a mention of using `cargo-nextest` in the top-level README.md.
Sub-crates can still declare they support `cargo test`, like
`compute_tools/README.md` does.
2024-03-14 18:49:42 +00:00
John Spray
678ed39de2 storage controller: validate DNS of registering nodes (#7101)
A node with a bad DNS configuration can register itself with the storage
controller, and the controller will try and schedule work onto the node,
but never succeed because it can't reach the node.

The DNS case is a special case of asymmetric network issues. The general
case isn't covered here -- but might make sense to tighten up after
#6844 merges -- then we can avoid assuming a node is immediately
available in re_attach.
2024-03-14 16:48:38 +00:00
Vlad Lazar
3d8830ac35 test_runner: re-enable large slru benchmark (#7125)
Previously disabled due to
https://github.com/neondatabase/neon/issues/7006.
2024-03-14 16:47:32 +00:00
Vlad Lazar
38767ace68 storage_controller: periodic pageserver heartbeats (#7092)
## Problem
If a pageserver was offline when the storage controller started, there
was no mechanism to update the
storage controller state when the pageserver becomes active.

## Summary of changes
* Add a heartbeater module. The heartbeater must be driven by an
external loop.
* Integrate the heartbeater into the service.
- Extend the types used by the service and scheduler to keep track of a
nodes' utilisation score.
- Add a background loop to drive the heartbeater and update the state
based on the deltas it generated
  - Do an initial round of heartbeats at start-up
2024-03-14 15:21:36 +00:00
Arseny Sher
9fe0193e51 Bump vendor/postgres v15 v14. 2024-03-14 18:06:53 +04:00
Christian Schwarz
8075f0965a fix(test suite) virtual_file_io_engine and get_vectored_impl patametrization doesn't work (#7113)
# Problem

While investigating #7124, I noticed that the benchmark was always using
the `DEFAULT_*` `virtual_file_io_engine` , i.e., `tokio-epoll-uring` as
of https://github.com/neondatabase/neon/pull/7077.

The fundamental problem is that the `control_plane` code has its own
view of `PageServerConfig`, which, I believe, will always be a subset of
the real pageserver's `pageserver/src/config.rs`.

For the `virtual_file_io_engine` and `get_vectored_impl` parametrization
of the test suite, we were constructing a dict on the Python side that
contained these parameters, then handed it to
`control_plane::PageServerConfig`'s derived `serde::Deserialize`.
The default in serde is to ignore unknown fields, so, the Deserialize
impl silently ignored the fields.
In consequence, the fields weren't propagated to the `pageserver --init`
call, and the tests ended up using the
`pageserver/src/config.rs::DEFAULT_` values for the respective options
all the time.

Tests that explicitly used overrides in `env.pageserver.start()` and
similar were not affected by this.

But, it means that all the test suite runs where with parametrization
didn't properly exercise the code path.

# Changes

- use `serde(deny_unknown_fields)` to expose the problem  
- With this change, the Python tests that override
`virtual_file_io_engine` and
`get_vectored_impl` fail on `pageserver --init`, exposing the problem.
- use destructuring to uncover the issue in the future
- fix the issue by adding the missing fields to the `control_plane`
crate's `PageServerConf`
- A better solution would be for control plane to re-use a struct
provided
    by the pageserver crate, so that everything is in one place in
    `pageserver/src/config.rs`, but, our config parsing code is (almost)
    beyond repair anyways.
- fix the `pageserver_virtual_file_io_engine` to be responsive to the
env var
  - => required to make parametrization work in benchmarks

# Testing

Before merging this PR, I re-ran the regression tests & CI with the full
matrix of `virtual_file_io_engine` and `tokio-epoll-uring`, see
9c7ea364e0
2024-03-14 11:18:55 +00:00
John Spray
44f42627dd pageserver/controller: error handling for shard splitting (#7074)
## Problem

Shard splits worked, but weren't safe against failures (e.g. node crash
during split) yet.

Related: #6676 

## Summary of changes

- Introduce async rwlocks at the scope of Tenant and Node:
  - exclusive tenant lock is used to protect splits
- exclusive node lock is used to protect new reconciliation process that
happens when setting node active
- exclusive locks used in both cases when doing persistent updates (e.g.
node scheduling conf) where the update to DB & in-memory state needs to
be atomic.
- Add failpoints to shard splitting in control plane and pageserver
code.
- Implement error handling in control plane for shard splits: this
detaches child chards and ensures parent shards are re-attached.
- Crash-safety for storage controller restarts requires little effort:
we already reconcile with nodes over a storage controller restart, so as
long as we reset any incomplete splits in the DB on restart (added in
this PR), things are implicitly cleaned up.
- Implement reconciliation with offline nodes before they transition to
active:
- (in this context reconciliation means something like
startup_reconcile, not literally the Reconciler)
- This covers cases where split abort cannot reach a node to clean it
up: the cleanup will eventually happen when the node is marked active,
as part of reconciliation.
- This also covers the case where a node was unavailable when the
storage controller started, but becomes available later: previously this
allowed it to skip the startup reconcile.
- Storage controller now terminates on panics. We only use panics for
true "should never happen" assertions, and these cases can leave us in
an un-usable state if we keep running (e.g. panicking in a shard split).
In the unlikely event that we get into a crashloop as a result, we'll
rely on kubernetes to back us off.
- Add `test_sharding_split_failures` which exercises a variety of
failure cases during shard split.
2024-03-14 09:11:57 +00:00
Conrad Ludgate
3bd6551b36 proxy http cancellation safety (#7117)
## Problem

hyper auto-cancels the request futures on connection close.
`sql_over_http::handle` is not 'drop cancel safe', so we need to do some
other work to make sure connections are queries in the right way.

## Summary of changes

1. tokio::spawn the request handler to resolve the initial cancel-safety
issue
2. share a cancellation token, and cancel it when the request `Service`
is dropped.
3. Add a new log span to be able to track the HTTP connection lifecycle.
2024-03-14 08:20:56 +00:00
Christian Schwarz
69338e53e3 throttling: fixup interactions with Timeline::get_vectored (#7089)
## Problem

Before this PR, `Timeline::get_vectored` would be throttled twice if the
sequential option was enabled or if validation was enabled.

Also, `pageserver_get_vectored_seconds` included the time spent in the
throttle, which turns out to be undesirable for what we use that metric
for.

## Summary of changes

Double-throttle:

* Add `Timeline::get0` method which is unthrottled.
* Use that method from within the `Timeline::get_vectored` code path.

Metric:

* return throttled time from `throttle()` method
* deduct the value from the observed time
* globally rate-limited logging of duration subtraction errors, like in
all other places that do the throttled-time deduction from observations
2024-03-13 17:49:17 +00:00
Arpad Müller
5309711691 Make tenant_id in TenantLocationConfigRequest optional (#7055)
The `tenant_id` in `TenantLocationConfigRequest` in the
`location_config` endpoint was only used in the storage
controller/attachment service, and there it was only used for assertions
and the creation part.
2024-03-13 17:30:29 +01:00
Joonas Koivunen
8a53d576e6 fix(metrics): time individual layer flush operations (#7109)
Currently, the flushing operation could flush multiple frozen layers to
the disk and store the aggregate time in the histogram. The result is a
bimodal distribution with short and over 1000-second flushes. Change it
so that we record how long one layer flush takes.
2024-03-13 15:10:20 +00:00
Anna Khanova
b0aff04157 proxy: add new dimension to exclude cplane latency (#7011)
## Problem

Currently cplane communication is a part of the latency monitoring. It
doesn't allow to setup the proper alerting based on proxy latency.

## Summary of changes

Added dimension to exclude cplane latency.
2024-03-13 13:50:05 +01:00
Anna Khanova
0554bee022 proxy: Report warm cold start if connection is from the local cache (#7104)
## Problem

* quotes in serialized string
* no status if connection is from local cache

## Summary of changes

* remove quotes
* report warm if connection if from local cache
2024-03-13 11:45:19 +00:00
Conrad Ludgate
83855a907c proxy http error classification (#7098)
## Problem

Missing error classification for SQL-over-HTTP queries.
Not respecting `UserFacingError` for SQL-over-HTTP queries.

## Summary of changes

Adds error classification.
Adds user facing errors.
2024-03-13 07:35:49 +01:00
John Spray
1b41db8bdd pageserver: enable setting stripe size inline with split request. (#7093)
## Summary

- Currently we can set stripe size at tenant creation, but it doesn't
mean anything until we have multiple shards
- When onboarding an existing tenant, it will always get a default shard
stripe size, so we would like to be able to pick the actual stripe size
at the point we split.

## Why do this inline with a split?

The alternative to this change would be to have a separate endpoint on
the storage controller for setting the stripe size on a tenant, and only
permit writes to that endpoint when the tenant has only a single shard.
That would work, but be a little bit more work for a client, and not
appreciably simpler (instead of having a special argument to the split
functions, we'd have a special separate endpoint, and a requirement that
the controller must sync its config down to the pageserver before
calling the split API). Either approach would work, but this one feels a
bit more robust end-to-end: the split API is the _very last moment_ that
the stripe size is mutable, so if we aim to set it before splitting, it
makes sense to do it as part of the same operation.
2024-03-12 20:41:08 +00:00
Jure Bajic
bac06ea1ac pageserver: fix read path max lsn bug (#7007)
## Summary of changes
The problem it fixes is when `request_lsn` is `u64::MAX-1` the
`cont_lsn` becomes `u64::MAX` which is the same as `prev_lsn` which
stops the loop.

Closes https://github.com/neondatabase/neon/issues/6812
2024-03-12 16:32:47 +00:00
John Spray
7ae8364b0b storage controller: register nodes in re-attach request (#7040)
## Problem

Currently we manually register nodes with the storage controller, and
use a script during deploy to register with the cloud control plane.
Rather than extend that script further, nodes should just register on
startup.

## Summary of changes

- Extend the re-attach request to include an optional
NodeRegisterRequest
- If the `register` field is set, handle it like a normal node
registration before executing the normal re-attach work.
- Update tests/neon_local that used to rely on doing an explicit
register step that could be enabled/disabled.

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-03-12 14:47:12 +00:00
Conrad Ludgate
1f7d54f987 proxy refactor tls listener (#7056)
## Problem

Now that we have tls-listener vendored, we can refactor and remove a lot
of bloated code and make the whole flow a bit simpler

## Summary of changes

1. Remove dead code
2. Move the error handling to inside the `TlsListener` accept() function
3. Extract the peer_addr from the PROXY protocol header and log it with
errors
2024-03-12 13:05:40 +00:00
Arthur Petukhovsky
580e136b2e Forward all backpressure feedback to compute (#7079)
Previously we aggregated ps_feedback on each safekeeper and sent it to
walproposer with every AppendResponse. This PR changes it to send
ps_feedback to walproposer right after receiving it from pageserver,
without aggregating it in memory. Also contains some preparations for
implementing backpressure support for sharding.
2024-03-12 12:14:02 +00:00
Conrad Ludgate
09699d4bd8 proxy: cancel http queries on timeout (#7031)
## Problem

On HTTP query timeout, we should try and cancel the current in-flight
SQL query.

## Summary of changes

Trigger a cancellation command in postgres once the timeout is reach
2024-03-12 11:52:00 +00:00
John Spray
89cf714890 tests/neon_local: rename "attachment service" -> "storage controller" (#7087)
Not a user-facing change, but can break any existing `.neon` directories
created by neon_local, as the name of the database used by the storage
controller changes.

This PR changes all the locations apart from the path of
`control_plane/attachment_service` (waiting for an opportune moment to
do that one, because it's the most conflict-ish wrt ongoing PRs like
#6676 )
2024-03-12 11:36:27 +00:00
Heikki Linnakangas
621ea2ec44 tests: try to make restored-datadir comparison tests not flaky v2
This test occasionally fails with a difference in "pg_xact/0000" file
between the local and restored datadirs. My hypothesis is that
something changed in the database between the last explicit checkpoint
and the shutdown. I suspect autovacuum, it could certainly create
transactions.

To fix, be more precise about the point in time that we compare. Shut
down the endpoint first, then read the last LSN (i.e. the shutdown
checkpoint's LSN), from the local disk with pg_controldata. And use
exactly that LSN in the basebackup.

Closes #559
2024-03-11 23:29:32 +04:00
Heikki Linnakangas
74d09b78c7 Keep walproposer alive until shutdown checkpoint is safe on safekepeers
The walproposer pretends to be a walsender in many ways. It has a
WalSnd slot, it claims to be a walsender by calling
MarkPostmasterChildWalSender() etc. But one different to real
walsenders was that the postmaster still treated it as a bgworker
rather than a walsender. The difference is that at shutdown,
walsenders are not killed until the very end, after the checkpointer
process has written the shutdown checkpoint and exited.

As a result, the walproposer always got killed before the shutdown
checkpoint was written, so the shutdown checkpoint never made it to
safekeepers. That's fine in principle, we don't require a clean
shutdown after all. But it also feels a bit silly not to stream the
shutdown checkpoint. It could be useful for initializing hot standby
mode in a read replica, for example.

Change postmaster to treat background workers that have called
MarkPostmasterChildWalSender() as walsenders. That unfortunately
requires another small change in postgres core.

After doing that, walproposers stay alive longer. However, it also
means that the checkpointer will wait for the walproposer to switch to
WALSNDSTATE_STOPPING state, when the checkpointer sends the
PROCSIG_WALSND_INIT_STOPPING signal. We don't have the machinery in
walproposer to receive and handle that signal reliably. Instead, we
mark walproposer as being in WALSNDSTATE_STOPPING always.

In commit 568f91420a, I assumed that shutdown will wait for all the
remaining WAL to be streamed to safekeepers, but before this commit
that was not true, and the test became flaky. This should make it
stable again.

Some tests wrongly assumed that no WAL could have been written between
pg_current_wal_flush_lsn and quick pg stop after it. Fix them by introducing
flush_ep_to_pageserver which first stops the endpoint and then waits till all
committed WAL reaches the pageserver.

In passing extract safekeeper http client to its own module.
2024-03-11 23:29:32 +04:00
Arseny Sher
0cf0731d8b SIGQUIT instead of SIGKILL prewarmed postgres.
To avoid orphaned processes using wiped datadir with confusing logging.
2024-03-11 22:36:52 +04:00
Sasha Krassovsky
98723844ee Don't return from inside PG_TRY (#7095)
## Problem
Returning from PG_TRY is a bug, and we currently do that

## Summary of changes
Make it break and then return false. This should also help stabilize
test_bad_connection.py
2024-03-11 18:36:39 +00:00
Alex Chi Z
73a8c97ac8 fix: warnings when compiling neon extensions (#7053)
proceeding https://github.com/neondatabase/neon/pull/7010, close
https://github.com/neondatabase/neon/issues/6188

## Summary of changes

This pull request (should) fix all warnings except
`-Wdeclaration-after-statement` in the neon extension compilation.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-03-11 17:49:58 +00:00
Christian Schwarz
17a3c9036e follow-up(#7077): adjust flaky-test-detection cutoff date for tokio-epoll-uring (#7090)
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-03-11 16:36:49 +00:00
Joonas Koivunen
8c5b310090 fix: Layer delete on drop and eviction can outlive timeline shutdown (#7082)
This is a follow-up to #7051 where `LayerInner::drop` and
`LayerInner::evict_blocking` were not noticed to require a gate before
the file deletion. The lack of entering a gate opens up a similar
possibility of deleting a layer file which a newer Timeline instance has
already checked out to be resident in a similar case as #7051.
2024-03-11 16:54:06 +01:00
Christian Schwarz
8224580f3e fix(tenant/timeline metrics): race condition during shutdown + recreation (#7064)
Tenant::shutdown or Timeline::shutdown completes and becomes externally
observable before the corresponding Tenant/Timeline object is dropped.

For example, after observing a Tenant::shutdown to complete, we could
attach the same tenant_id again. The shut down Tenant object might still
be around at the time of the attach.

The race is then the following:
- old object's metrics are still around
- new object uses with_label_values
- old object calls remove_label_values

The outcome is that the new object will have the metric objects (they're
an Arc internall) but the metrics won't be part of the internal registry
and hence they'll be missing in `/metrics`.

Later, when the new object gets shut down and tries to
remove_label_value, it will observe an error because
the metric was already removed by the old object.

Changes
-------

This PR moves metric removal to `shutdown()`.

An alternative design would be to multi-version the metrics using a
distinguishing label, or, to use a better metrics crate that allows
removing metrics from the registry through the locally held metric
handle instead of interacting with the (globally shared) registry.

refs https://github.com/neondatabase/neon/pull/7051
2024-03-11 15:41:41 +01:00
Christian Schwarz
2b0f3549f7 default to tokio-epoll-uring in CI tests & on Linux (#7077)
All of production is using it now as of
https://github.com/neondatabase/aws/pull/1121

The change in `flaky_tests.py` resets the flakiness detection logic.

The alternative would have been to repeat the choice of io engine in
each test name, which would junk up the various test reports too much.

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-03-11 14:35:59 +00:00
John Spray
b4972d07d4 storage controller: refactor non-mutable members up into Service (#7086)
result_tx and compute_hook were in ServiceState (i.e. behind a sync
mutex), but didn't need to be.

Moving them up into Service removes a bunch of boilerplate clones.

While we're here, create a helper `Service::maybe_reconcile_shard` which
avoids writing out all the `&self.` arguments to
`TenantState::maybe_reconcile` everywhere we call it.
2024-03-11 14:29:32 +00:00
Joonas Koivunen
26ae7b0b3e fix(metrics): reset TENANT_STATE metric on startup (#7084)
Otherwise, it might happen that we never get to witness the same state
on subsequent restarts, thus the time series will show the value from a
few restarts ago.

The actual case here was that "Activating" was showing `3` while I was
doing tenant migration testing on staging. The number 3 was however from
a startup that happened some time ago which had been interrupted by
another deployment.
2024-03-11 13:25:53 +00:00
John Spray
f8483cc4a3 pageserver: update swagger for HA APIs (#7070)
- The type of heatmap_period in tenant config was wrrong
- Secondary download and heatmap upload endpoints weren't in swagger.
2024-03-11 09:32:17 +00:00
Conrad Ludgate
cc5d6c66b3 proxy: categorise new cplane error message (#7057)
## Problem

`422 Unprocessable Entity: compute time quota of non-primary branches is
exceeded` being marked as a control plane error.

## Summary of changes

Add the manual checks to make this a user error that should not be
retried.
2024-03-11 09:20:09 +01:00
Roman Zaynetdinov
d894d2b450 Export db size, deadlocks and changed row metrics (#7050)
## Problem

We want to report metrics for the oldest user database.
2024-03-11 08:10:04 +00:00
Joonas Koivunen
b09d686335 fix: on-demand downloads can outlive timeline shutdown (#7051)
## Problem

Before this PR, it was possible that on-demand downloads were started
after `Timeline::shutdown()`.

For example, we have observed a walreceiver-connection-handler-initiated
on-demand download that was started after `Timeline::shutdown()`s final
`task_mgr::shutdown_tasks()` call.

The underlying issue is that `task_mgr::shutdown_tasks()` isn't sticky,
i.e., new tasks can be spawned during or after
`task_mgr::shutdown_tasks()`.

Cc: https://github.com/neondatabase/neon/issues/4175 in lieu of a more
specific issue for task_mgr. We already decided we want to get rid of it
anyways.

Original investigation:
https://neondb.slack.com/archives/C033RQ5SPDH/p1709824952465949

## Changes

- enter gate while downloading
- use timeline cancellation token for cancelling download

thereby, fixes #7054

Entering the gate might also remove recent "kept the gate from closing"
in staging.
2024-03-09 13:09:08 +00:00
Christian Schwarz
74d24582cf throttling: exclude throttled time from basebackup (fixup of #6953) (#7072)
PR #6953 only excluded throttled time from the handle_pagerequests
(aka smgr metrics).

This PR implements the deduction for `basebackup ` queries.

The other page_service methods either don't use Timeline::get
or they aren't used in production.

Found by manually inspecting in [staging
logs](https://neonprod.grafana.net/explore?schemaVersion=1&panes=%7B%22wx8%22:%7B%22datasource%22:%22xHHYY0dVz%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bhostname%3D%5C%22pageserver-0.eu-west-1.aws.neon.build%5C%22%7D%20%7C~%20%60git-env%7CERR%7CWARN%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22xHHYY0dVz%22%7D,%22editorMode%22:%22code%22%7D%5D,%22range%22:%7B%22to%22:%221709919114642%22,%22from%22:%221709904430898%22%7D%7D%7D).
2024-03-09 13:37:02 +01:00
Sasha Krassovsky
4834d22d2d Revoke REPLICATION (#7052)
## Problem
Currently users can cause problems with replication
## Summary of changes
Don't let them replicate
2024-03-08 22:24:30 +00:00
Anastasia Lubennikova
86e8c43ddf Add downgrade scripts for neon extension. (#7065)
## Problem

When we start compute with newer version of extension (i.e. 1.2) and
then rollback the release, downgrading the compute version, next compute
start will try to update extension to the latest version available in
neon.control (i.e. 1.1).

Thus we need to provide downgrade scripts like neon--1.2--1.1.sql

These scripts must revert the changes made by the upgrade scripts in the
reverse order. This is necessary to ensure that the next upgrade will
work correctly.

In general, we need to write upgrade and downgrade scripts to be more
robust and add IF EXISTS / CREATE OR REPLACE clauses to all statements
(where applicable).

## Summary of changes
Adds downgrade scripts.
Adds test cases for extension downgrade/upgrade. 

fixes #7066

This is a follow-up for
https://app.incident.io/neondb/incidents/167?tab=follow-ups

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Alex Chi Z <iskyzh@gmail.com>
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
2024-03-08 20:42:35 +00:00
John Spray
7329413705 storage controller: enable setting PlacementPolicy in tenant creation (#7037)
## Problem

Tenants created via the storage controller have a `PlacementPolicy` that
defines their HA/secondary/detach intent. For backward compat we can
just set it to Single, for onboarding tenants using /location_conf it is
automatically set to Double(1) if there are at least two pageservers,
but for freshly created tenants we didn't have a way to specify it.

This unblocks writing tests that create HA tenants on the storage
controller and do failure injection testing.

## Summary of changes

- Add optional fields to TenantCreateRequest for specifying
PlacementPolicy. This request structure is used both on pageserver API
and storage controller API, but this method is only meaningful for the
storage controller (same as existing `shard_parameters` attribute).
- Use the value from the creation request in tenant creation, if
provided.
2024-03-08 15:34:53 +00:00
Conrad Ludgate
2c132e45cb proxy: do not store ephemeral endpoints in http pool (#6819)
## Problem

For the ephemeral endpoint feature, it's not really too helpful to keep
them around in the connection pool. This isn't really pressing but I
think it's still a bit better this way.

## Summary of changes

Add `is_ephemeral` function to `NeonOptions`. Allow
`serverless::ConnInfo::endpoint_cache_key()` to return an `Option`.
Handle that option appropriately
2024-03-08 07:56:23 +00:00
Vlad Lazar
0f05ef67e2 pageserver: revert open layer rolling revert (#6962)
## Problem
We reverted https://github.com/neondatabase/neon/pull/6661 a few days
ago. The change led to OOMs in
benchmarks followed by large WAL reingests.

The issue was that we removed [this
code](d04af08567/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs (L409-L417)).
That call may trigger a roll of the open layer due to
the keepalive messages received from the safekeeper. Removing it meant
that enforcing
of checkpoint timeout became even more lax and led to using up large
amounts of memory
for the in memory layer indices.

## Summary of changes
Piggyback on keep alive messages to enforce checkpoint timeout. This is
a hack, but it's exactly what
the current code is doing.

## Alternatives
Christhian, Joonas and myself sketched out a timer based approach
[here](https://github.com/neondatabase/neon/pull/6940). While discussing
it further, it became obvious that's also a bit of a hack and not the
desired end state. I chose not
to take that further since it's not what we ultimately want and it'll be
harder to rip out.

Right now it's unclear what the ideal system behaviour is:
* early flushing on memory pressure, or ...
* detaching tenants on memory pressure
2024-03-07 19:53:10 +00:00
Conrad Ludgate
02358b21a4 update rustls (#7048)
## Summary of changes

Update rustls from 0.21 to 0.22.

reqwest/tonic/aws-smithy still use rustls 0.21. no upgrade route
available yet.
2024-03-07 18:23:19 +00:00
Sasha Krassovsky
2fc89428c3 Hopefully stabilize test_bad_connection.py (#6976)
## Problem
It seems that even though we have a retry on basebackup, it still
sometimes fails to fetch it with the failpoint enabled, resulting in a
test error.

## Summary of changes
If we fail to get the basebackup, disable the failpoint and try again.
2024-03-07 10:12:06 -08:00
Arpad Müller
ce7a82db05 Update svg_fmt (#7049)
Gets upstream PR https://github.com/nical/rust_debug/pull/3 , removes
trailing "s from output.
2024-03-07 17:32:09 +00:00
John Spray
d5a6a2a16d storage controller: robustness improvements (#7027)
## Problem


Closes: https://github.com/neondatabase/neon/issues/6847
Closes: https://github.com/neondatabase/neon/issues/7006

## Summary of changes

- Pageserver API calls are wrapped in timeout/retry logic: this prevents
a reconciler getting hung on a pageserver API hang, and prevents
reconcilers having to totally retry if one API call returns a retryable
error (e.g. 503).
- Add a cancellation token to `Node`, so that when we mark a node
offline we will cancel any API calls in progress to that node, and avoid
issuing any more API calls to that offline node.
- If the dirty locations of a shard are all on offline nodes, then don't
spawn a reconciler
- In re-attach, if we have no observed state object for a tenant then
construct one with conf: None (which means "unknown"). Then in
Reconciler, implement a TODO for scanning such locations before running,
so that we will avoid spuriously incrementing a generation in the case
of a node that was offline while we started (this is the case that
tripped up #7006)
- Refactoring: make Node contents private (and thereby guarantee that
updates to availability mode reliably update the cancellation token.)
- Refactoring: don't pass the whole map of nodes into Reconciler (and
thereby remove a bunch of .expect() calls)

Some of this was discovered/tested with a new failure injection test
that will come in a separate PR, once it is stable enough for CI.
2024-03-07 17:10:03 +00:00