Commit Graph

522 Commits

Author SHA1 Message Date
Conrad Ludgate
7e08fbd1b9 Revert "proxy: update tokio-postgres to allow arbitrary config params (#8076)" (#8654)
This reverts #8076 - which was already reverted from the release branch
since forever (it would have been a breaking change to release for all
users who currently set TimeZone options). It's causing conflicts now so
we should revert it here as well.
2024-08-09 09:09:29 +01:00
Conrad Ludgate
ad0988f278 proxy: random changes (#8602)
## Problem

1. Hard to correlate startup parameters with the endpoint that provided
them.
2. Some configurations are not needed in the `ProxyConfig` struct.

## Summary of changes

Because of some borrow checker fun, I needed to switch to an
interior-mutability implementation of our `RequestMonitoring` context
system. Using https://docs.rs/try-lock/latest/try_lock/ as a cheap lock
for such a use-case (needed to be thread safe).

Removed the lock of each startup message, instead just logging only the
startup params in a successful handshake.

Also removed from values from `ProxyConfig` and kept as arguments.
(needed for local-proxy config)
2024-08-07 14:37:03 +01:00
John Spray
2334fed762 storage_controller: start adding chaos hooks (#7946)
Chaos injection bridges the gap between automated testing (where we do
lots of different things with small, short-lived tenants), and staging
(where we do many fewer things, but with larger, long-lived tenants).

This PR adds a first type of chaos which isn't really very chaotic: it's
live migration of tenants between healthy pageservers. This nevertheless
provides continuous checks that things like clean, prompt shutdown of
tenants works for realistically deployed pageservers with realistically
large tenants.
2024-08-02 09:37:44 +01:00
Alex Chi Z.
970f2923b2 storage-scrubber: log version on start (#8571)
Helps us better identify which version of storage scrubber is running.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-08-01 13:52:34 +00:00
John Spray
1678dea20f pageserver: add layer visibility calculation (#8511)
## Problem

We recently added a "visibility" state to layers, but nothing
initializes it.

Part of:
- #8398 

## Summary of changes

- Add a dependency on `range-set-blaze`, which is used as a fast
incrementally updated alternative to KeySpace. We could also use this to
replace the internals of KeySpaceRandomAccum if we wanted to. Writing a
type that does this kind of "BtreeMap & merge overlapping entries" thing
isn't super complicated, but no reason to write this ourselves when
there's a third party impl available.
- Add a function to layermap to calculate visibilities for each layer
- Add a function to Timeline to call into layermap and then apply these
visibilities to the Layer objects.
- Invoke the calculation during startup, after image layer creations,
and when removing branches. Branch removal and image layer creation are
the two ways that a layer can go from Visible to Covered.
- Add unit test & benchmark for the visibility calculation
- Expose `pageserver_visible_physical_size` metric, which should always
be <= `pageserver_remote_physical_size`.
- This metric will feed into the /v1/utilization endpoint later: the
visible size indicates how much space we would like to use on this
pageserver for this tenant.
- When `pageserver_visible_physical_size` is greater than
`pageserver_resident_physical_size`, this is a sign that the tenant has
long-idle branches, which result in layers that are visible in
principle, but not used in practice.

This does not keep visibility hints up to date in all cases:
particularly, when creating a child timeline, any previously covered
layers will not get marked Visible until they are accessed.

Updates after image layer creation could be implemented as more of a
special case, but this would require more new code: the existing depth
calculation code doesn't maintain+yield the list of deltas that would be
covered by an image layer.

## Performance

This operation is done rarely (at startup and at timeline deletion), so
needs to be efficient but not ultra-fast.

There is a new `visibility` bench that measures runtime for a synthetic
100k layers case (`sequential`) and a real layer map (`real_map`) with
~26k layers.

The benchmark shows runtimes of single digit milliseconds (on a ryzen
7950). This confirms that the runtime shouldn't be a problem at startup
(as we already incur S3-level latencies there), but that it's slow
enough that we definitely shouldn't call it more often than necessary,
and it may be worthwhile to optimize further later (things like: when
removing a branch, only bother scanning layers below the branchpoint)

```
visibility/sequential   time:   [4.5087 ms 4.5894 ms 4.6775 ms]
                        change: [+2.0826% +3.9097% +5.8995%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 24 outliers among 100 measurements (24.00%)
  2 (2.00%) high mild
  22 (22.00%) high severe
min: 0/1696070, max: 93/1C0887F0
visibility/real_map     time:   [7.0796 ms 7.0832 ms 7.0871 ms]
                        change: [+0.3900% +0.4505% +0.5164%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
min: 0/1696070, max: 93/1C0887F0
visibility/real_map_many_branches
                        time:   [4.5285 ms 4.5355 ms 4.5434 ms]
                        change: [-1.0012% -0.8004% -0.5969%] (p = 0.00 < 0.05)
                        Change within noise threshold.
```
2024-08-01 09:25:35 +00:00
Arpad Müller
163f2eaf79 Reduce linux-raw-sys duplication (#8577)
Before, we had four versions of linux-raw-sys in our dependency graph:

```
  linux-raw-sys@0.1.4
  linux-raw-sys@0.3.8
  linux-raw-sys@0.4.13
  linux-raw-sys@0.6.4
```

now it's only two:

```
  linux-raw-sys@0.4.13
  linux-raw-sys@0.6.4
```

The changes in this PR are minimal. In order to get to its state one
only has to update procfs in Cargo.toml to 0.16 and do `cargo update -p
tempfile -p is-terminal -p prometheus`.
2024-08-01 08:22:21 +00:00
Yuchen Liang
e374d6778e feat(storcon): store scrubber metadata scan result (#8480)
Part of #8128, followed by #8502.

## Problem

Currently we lack mechanism to alert unhealthy `scan_metadata` status if
we start running this scrubber command as part of a cronjob. With the
storage controller client introduced to storage scrubber in #8196, it is
viable to set up alert by storing health status in the storage
controller database.

We intentionally do not store the full output to the database as the
json blobs potentially makes the table really huge. Instead, only a
health status and a timestamp recording the last time metadata health
status is posted on a tenant shard.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-07-30 14:32:00 +01:00
John Spray
f5db655447 pageserver: simplify LayerAccessStats (#8431)
## Problem

LayerAccessStats contains a lot of detail that we don't use: short
histories of most recent accesses, specifics on what kind of task
accessed a layer, etc. This is all stored inside a Mutex, which is
locked every time something accesses a layer.

## Summary of changes

- Store timestamps at a very low resolution (to the nearest second),
sufficient for use on the timescales of eviction.
- Pack access time and last residence change time into a single u64
- Use the high bits of the u64 for other flags, including the new layer
visibility concept.
- Simplify the external-facing model for access stats to just include
what we now track.

Note that the `HistoryBufferWithDropCounter` is removed here because it
is no longer used. I do not dislike this type, we just happen not to use
it for anything else at present.


Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-07-24 08:17:28 +01:00
Arpad Müller
2c0d311a54 remote_storage: add list_streaming API call (#8466)
This adds the ability to list many prefixes in a streaming fashion to
both the `RemoteStorage` trait as well as `GenericRemoteStorage`.

* The `list` function of the `RemoteStorage` trait is implemented by
default in terms of `list_streaming`.
* For the production users (S3, Azure), `list_streaming` is implemented
and the default `list` implementation is used.
* For `LocalFs`, we keep the `list` implementation and make
`list_streaming` call it.

The `list_streaming` function is implemented for both S3 and Azure.

A TODO for later is retries, which the scrubber currently has while the
`list_streaming` implementations lack them.

part of #8457 and #7547
2024-07-24 02:09:01 +02:00
Yuchen Liang
595c450036 fix(scrubber): more robust metadata consistency checks (#8344)
Part of #8128.

## Problem

Scrubber uses `scan_metadata` command to flag metadata inconsistencies.
To trust it at scale, we need to make sure the errors we emit is a
reflection of real scenario. One check performed in the scrubber is to
see whether layers listed in the latest `index_part.json` is present in
object listing. Currently, the scrubber does not robustly handle the
case where objects are uploaded/deleted during the scan.

## Summary of changes

**Condition for success:** An object in the index is (1) in the object
listing we acquire from S3 or (2) found in a HeadObject request (new
object).

- Add in the `HeadObject` requests for the layers missing from the
object listing.
- Keep the order of first getting the object listing and then
downloading the layers.
- Update check to only consider shards with highest shard count.
- Skip analyzing a timeline if `deleted_at` tombstone is marked in
`index_part.json`.
- Add new test to see if scrubber actually detect the metadata
inconsistency.

_Misc_

- A timeline with no ancestor should always have some layers.
- Removed experimental histograms

_Caveat_

- Ancestor layer is not cleaned until #8308 is implemented. If ancestor
layers reference non-existing layers in the index, the scrubber will
emit false positives.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-07-22 14:53:33 +01:00
John Spray
44781518d0 storage scrubber: GC ancestor shard layers (#8196)
## Problem

After a shard split, the pageserver leaves the ancestor shard's content
in place. It may be referenced by child shards, but eventually child
shards will de-reference most ancestor layers as they write their own
data and do GC. We would like to eventually clean up those ancestor
layers to reclaim space.

## Summary of changes

- Extend the physical GC command with `--mode=full`, which includes
cleaning up unreferenced ancestor shard layers
- Add test `test_scrubber_physical_gc_ancestors`
- Remove colored log output: in testing this is irritating ANSI code
spam in logs, and in interactive use doesn't add much.
- Refactor storage controller API client code out of storcon_client into
a `storage_controller/client` crate
- During physical GC of ancestors, call into the storage controller to
check that the latest shards seen in S3 reflect the latest state of the
tenant, and there is no shard split in progress.
2024-07-19 19:07:59 +03:00
Christian Schwarz
a2d170b6d0 NeonEnv.from_repo_dir: use storage_controller_db instead of attachments.json (#8382)
When `NeonEnv.from_repo_dir` was introduced, storage controller stored
its
state exclusively `attachments.json`.
Since then, it has moved to using Postgres, which stores its state in
`storage_controller_db`.

But `NeonEnv.from_repo_dir` wasn't adjusted to do this.
This PR rectifies the situation.

Context for this is failures in
`test_pageserver_characterize_throughput_with_n_tenants`
CF:
https://neondb.slack.com/archives/C033RQ5SPDH/p1721035799502239?thread_ts=1720901332.293769&cid=C033RQ5SPDH

Notably, `from_repo_dir` is also used by the backwards- and
forwards-compatibility.
Thus, the changes in this PR affect those tests as well.
However, it turns out that the compatibility snapshot already contains
the `storage_controller_db`.
Thus, it should just work and in fact we can remove hacks like
`fixup_storage_controller`.

Follow-ups created as part of this work:
* https://github.com/neondatabase/neon/issues/8399
* https://github.com/neondatabase/neon/issues/8400
2024-07-18 10:56:07 +02:00
Luca Bruno
8da3b547f8 proxy/http: switch to typed_json (#8377)
## Summary of changes

This switches JSON rendering logic to `typed_json` in order to
reduce the number of allocations in the HTTP responder path.

Followup from
https://github.com/neondatabase/neon/pull/8319#issuecomment-2216991760.

---------

Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>
2024-07-15 12:38:52 +01:00
Japin Li
86d6ef305a Remove fs2 dependency (#8350)
The fs2 dependency is not needed anymore after commit d42700280.
2024-07-12 12:56:06 +03:00
John Spray
0159ae9536 safekeeper: eviction metrics (#8348)
## Problem

Follow up to https://github.com/neondatabase/neon/pull/8335, to improve
observability of how many evict/restores we are doing.

## Summary of changes

- Add `safekeeper_eviction_events_started_total` and
`safekeeper_eviction_events_completed_total`, with a "kind" label of
evict or restore. This gives us rates, and also ability to calculate how
many are in progress.
- Generalize SafekeeperMetrics test type to use the same helpers as
pageserver, and enable querying any metric.
- Read the new metrics at the end of the eviction test.
2024-07-11 17:05:35 +01:00
Christian Schwarz
e26ef640c1 pageserver: remove trace_read_requests (#8338)
`trace_read_requests` is a per `Tenant`-object option.
But the `handle_pagerequests` loop doesn't know which
`Tenant` object (i.e., which shard) the request is for.

The remaining use of the `Tenant` object is to check `tenant.cancel`.
That check is incorrect [if the pageserver hosts multiple
shards](https://github.com/neondatabase/neon/issues/7427#issuecomment-2220577518).
I'll fix that in a future PR where I completely eliminate the holding
of `Tenant/Timeline` objects across requests.
See [my code RFC](https://github.com/neondatabase/neon/pull/8286) for
the
high level idea.

Note that we can always bring the tracing functionality if we need it.
But since it's actually about logging the `page_service` wire bytes,
it should be a `page_service`-level config option, not per-Tenant.
And for enabling tracing on a single connection, we can implement
a `set pageserver_trace_connection;` option.
2024-07-11 15:17:07 +02:00
Stas Kelvich
6bbd34a216 Enable core dumps for postgres (#8272)
Set core rmilit to ulimited in compute_ctl, so that all child processes
inherit it. We could also set rlimit in relevant startup script, but
that way we would depend on external setup and might inadvertently
disable it again (core dumping worked in pods, but not in VMs with
inittab-based startup).
2024-07-11 10:20:14 +03:00
Christian Schwarz
3f7aebb01c refactor: postgres_backend: replace abstract shutdown_watcher with CancellationToken (#8295)
Preliminary refactoring while working on
https://github.com/neondatabase/neon/issues/7427
and specifically https://github.com/neondatabase/neon/pull/8286
2024-07-09 21:11:11 +03:00
Conrad Ludgate
4a5b55c834 chore: fix nightly build (#8142)
## Problem

`cargo +nightly check` fails

## Summary of changes

Updates `measured`, `time`, and `crc32c`.

* `measured`: updated to fix
https://github.com/rust-lang/rust/issues/125763.
* `time`: updated to fix https://github.com/rust-lang/rust/issues/125319
* `crc32c`: updated to remove some nightly feature detection with a
removed nightly feature
2024-07-09 18:25:49 +01:00
John Spray
1121a1cbac pageserver: switch to jemalloc (#8307)
## Problem

- Resident memory on long running pageserver processes tends to climb:
memory fragmentation is suspected.
- Total resident memory may be a limiting factor for running on smaller
nodes.

## Summary of changes

- As a low-energy experiment, switch the pageserver to use jemalloc (not
a net-new dependency, proxy already use it)
- Decide at end of week whether to revert before next release.
2024-07-08 14:10:42 +01:00
Christian Schwarz
7dcdbaa25e remote_storage config: move handling of empty inline table {} to callers (#8193)
Before this PR, `RemoteStorageConfig::from_toml` would support
deserializing an
empty `{}` TOML inline table to a `None`, otherwise try `Some()`.

We can instead let
* in proxy: let clap derive handle the Option
* in PS & SK: assume that if the field is specified, it must be a valid
  RemtoeStorageConfig

(This PR started with a much simpler goal of factoring out the
`deserialize_item` function because I need that in another PR).
2024-07-02 12:53:08 +02:00
Vlad Lazar
7026dde9eb storcon: update db related dependencides (#8155)
## Problem
Storage controller runs into memory corruption issue on the drain/fill
code paths.

## Summary of changes
Update db related depdencies in the unlikely case that the issue was
fixed in diesel.
2024-06-25 15:06:18 +01:00
Conrad Ludgate
78d9059fc7 proxy: update tokio-postgres to allow arbitrary config params (#8076)
## Problem

Fixes https://github.com/neondatabase/neon/issues/1287

## Summary of changes

tokio-postgres now supports arbitrary server params through the
`param(key, value)` method. Some keys are special so we explicitly
filter them out.
2024-06-24 10:20:27 +00:00
Arpad Müller
75747cdbff Use serde for RemoteStorageConfig parsing (#8126)
Adds a `Deserialize` impl to `RemoteStorageConfig`. We thus achieve the
same as #7743 but with less repetitive code, by deriving `Deserialize`
impls on `S3Config`, `AzureConfig`, and `RemoteStorageConfig`. The
disadvantage is less useful error messages.

The git history of this PR contains a state where we go via an
intermediate representation, leveraging the `serde_json` crate,
without it ever being actual json though.

Also, the PR adds deserialization tests.

Alternative to #7743 .
2024-06-22 17:57:09 +00:00
Vlad Lazar
5778d714f0 storcon: add drain and fill background operations for graceful cluster restarts (#8014)
## Problem
Pageserver restarts cause read availablity downtime for tenants. See
`Motivation` section in the
[RFC](https://github.com/neondatabase/neon/pull/7704).

## Summary of changes
* Introduce a new `NodeSchedulingPolicy`: `PauseForRestart`
* Implement the first take of drain and fill algorithms
* Add a node status endpoint which can be polled to figure out when an
operation is done

The implementation follows the RFC, so it might be useful to peek at it
as you're reviewing.
Since the PR is rather chunky, I've made sure all commits build (with
warnings), so you can
review by commit if you prefer that.

RFC: https://github.com/neondatabase/neon/pull/7704
Related https://github.com/neondatabase/neon/issues/7387
2024-06-19 11:55:30 +01:00
Arseny Sher
d8b2a49c55 safekeeper: streaming pull_timeline
- Add /snapshot http endpoing streaming tar archive timeline contents up to
  flush_lsn.
- Add check that term doesn't change, corresponding test passes now.
- Also prepares infra to hold off WAL removal during the basebackup.
- Sprinkle fsyncs to persist the pull_timeline result.

ref https://github.com/neondatabase/neon/issues/6340
2024-06-18 15:45:39 +03:00
Arpad Müller
27518676d7 Rename S3 scrubber to storage scrubber (#8013)
The S3 scrubber contains "S3" in its name, but we want to make it
generic in terms of which storage is used (#7547). Therefore, rename it
to "storage scrubber", following the naming scheme of already existing
components "storage broker" and "storage controller".

Part of #7547
2024-06-11 22:45:22 +00:00
Vlad Lazar
7121db3669 storcon_cli: add 'drain' command (#8007)
## Problem
We need the ability to prepare a subset of storage controller managed
pageservers for decommisioning. The storage controller cannot currently
express this in terms of scheduling constraints (it's a pretty special
case, so I'm not sure it even should).

## Summary of Changes
A new `drain` command is added to `storcon_cli`. It takes a set of nodes
to drain and migrates primary attachments outside of said set. Simple
round robing assignment is used under the assumption that nodes outside
of the draining set are evenly balanced.

Note that secondary locations are not migrated. This is fine for
staging, but the migration API will have to be extended for prod in
order to allow migration of secondaries as well.

I've tested this out against a neon local cluster. The immediate use for
this command will be to migrate staging to ARM(Arch64) pageservers.

Related https://github.com/neondatabase/cloud/issues/14029
2024-06-11 16:39:38 +00:00
John Spray
69026a9a36 storcon_cli: add 'drop' and eviction interval utilities (#7938)
The storage controller has 'drop' APIs for tenants and nodes, for use in
situations where something weird has happened:
- node-drop is useful until we implement proper node decom, or if we
have a partially provisioned node that somehow gets registered with the
storage controller but is then dead.
- tenant-drop is useful if we accidentally add a tenant that shouldn't
be there at all, or if we want to make the controller forget about a
tenant without deleting its data. For example, if one uses the
tenant-warmup command with a bad tenant ID and needs to clean that up.

The drop commands require an `--unsafe` parameter, to reduce the chance
that someone incorrectly assumes these are the normal/clean ways to
delete things.

This PR also adds a convenience command for setting the time based
eviction parameters on a tenant. This is useful when onboarding an
existing tenant that has high resident size due to storage amplification
in compaction: setting a lower time based eviction threshold brings down
the resident size ahead of doing a shard split.
2024-06-03 18:13:01 +00:00
John Spray
69d18d6429 s3_scrubber: add pageserver-physical-gc (#7925)
## Problem

Currently, we leave `index_part.json` objects from old generations
behind each time a pageserver restarts or a tenant is migrated. This
doesn't break anything, but it's annoying when a tenant has been around
for a long time and starts to accumulate 10s-100s of these.

Partially implements: #7043 

## Summary of changes

- Add a new `pageserver-physical-gc` command to `s3_scrubber`

The name is a bit of a mouthful, but I think it makes sense:
- GC is the accurate term for what we are doing here: removing data that
takes up storage but can never be accessed.
- "physical" is a necessary distinction from the "normal" GC that we do
online in the pageserver, which operates at a higher level in terms of
LSNs+layers, whereas this type of GC is purely about S3 objects.
- "pageserver" makes clear that this command deals exclusively with
pageserver data, not safekeeper.
2024-06-03 17:16:23 +01:00
Joonas Koivunen
ef83f31e77 pagectl: key command for dumping what we know about the key (#7890)
What we know about the key via added `pagectl key $key` command:
- debug formatting
- shard placement when `--shard-count` is specified
- different boolean queries in `key.rs`
- aux files v2

Example:

```
$ cargo run -qp pagectl -- key 000000063F00004005000060270000100E2C
parsed from hex: 000000063F00004005000060270000100E2C:

Key { field1: 0, field2: 1599, field3: 16389, field4: 24615, field5: 0, field6: 1052204 }
rel_block:         true
rel_vm_block:      false
rel_fsm_block:     false
slru_block:        false
inherited:         true
rel_size:          false
slru_segment_size: false
recognized kind:   None
```
2024-05-31 18:19:41 +00:00
Arpad Müller
c18b1c0646 Update tokio-epoll-uring for linux-raw-sys (#7918)
Updates the `tokio-epoll-uring` dependency.

There is [only one change](342ddd197a...08ccfa94ff),
the adoption of linux-raw-sys for `statx` instead of using libc.

Part of #7889.
2024-05-30 17:45:48 +02:00
Conrad Ludgate
9a081c230f proxy: lazily parse startup pg params (#7905)
## Problem

proxy params being a `HashMap<String,String>` when it contains just
```
application_name: psql
database: neondb
user: neondb_owner
```
is quite wasteful allocation wise.

## Summary of changes

Keep the params in the wire protocol form, eg:
```
application_name\0psql\0database\0neondb\0user\0neondb_owner\0
```

Using a linear search for the map is fast enough at small sizes, which
is the normal case.
2024-05-30 11:02:38 +00:00
Conrad Ludgate
c8cebecabf proxy: reintroduce dynamic limiter for compute lock (#7737)
## Problem

Computes that are healthy can manage many connection attempts at a time.
Unhealthy computes cannot. We initially handled this with a fixed
concurrency limit, but it seems this inhibits pgbench.

## Summary of changes

Support AIMD for connect_to_compute lock to allow varying the
concurrency limit based on compute health
2024-05-29 11:17:05 +01:00
Arpad Müller
14df69d0e3 Drop postgres-native-tls in favour of tokio-postgres-rustls (#7883)
Get rid of postgres-native-tls and openssl in favour of rustls in our
dependency tree.

Do further steps to completely remove native-tls and openssl.

Among other advantages, this allows us to do static musl builds more
easily: #7889
2024-05-28 15:40:52 +00:00
Conrad Ludgate
9cfe08e3d9 proxy password threadpool (#7806)
## Problem

Despite making password hashing async, it can still take time away from
the network code.

## Summary of changes

Introduce a custom threadpool, inspired by rayon. Features:

### Fairness

Each task is tagged with it's endpoint ID. The more times we have seen
the endpoint, the more likely we are to skip the task if it comes up in
the queue. This is using a min-count-sketch estimator for the number of
times we have seen the endpoint, resetting it every 1000+ steps.

Since tasks are immediately rescheduled if they do not complete, the
worker could get stuck in a "always work available loop". To combat
this, we check the global queue every 61 steps to ensure all tasks
quickly get a worker assigned to them.

### Balanced

Using crossbeam_deque, like rayon does, we have workstealing out of the
box. I've tested it a fair amount and it seems to balance the workload
accordingly
2024-05-22 17:05:43 +00:00
Conrad Ludgate
a5ecca976e proxy: bump parquet (#7782)
## Summary of changes

Updates the parquet lib. one change left that we need is in an open PR
against upstream, hopefully we can remove the git dependency by 52.0.0
https://github.com/apache/arrow-rs/pull/5773

I'm not sure why the parquet files got a little bit bigger. I tested
them and they still open fine. 🤷

side effect of the update, chrono updated and added yet another
deprecation warning (hence why the safekeepers change)
2024-05-19 19:45:53 +00:00
Conrad Ludgate
790c05d675 proxy: swap tungstenite for a simpler impl (#7353)
## Problem

I wanted to do a deep dive of the tungstenite codebase.
tokio-tungstenite is incredibly convoluted... In my searching I found
[fastwebsockets by deno](https://github.com/denoland/fastwebsockets),
but it wasn't quite sufficient.

This also removes the default 16MB/64MB frame/message size limitation.
framed-websockets solves this by inserting continuation frames for
partially received messages, so the whole message does not need to be
entirely read into memory.

## Summary of changes

I took the fastwebsockets code as a starting off point and rewrote it to
be simpler, server-only, and be poll-based to support our Read/Write
wrappers.

I have replaced our tungstenite code with my framed-websockets fork.

<https://github.com/neondatabase/framed-websockets>
2024-05-16 13:05:50 +02:00
Andrew Rudenko
923cf91aa4 compute_ctl: catalog API endpoints (#7575)
## Problem

There are two cloud's features that require extra compute endpoints.

1. We are running pg_dump to get DB schemas. Currently, we are using a
special service for this. But it would be great to execute pg_dump in an
isolated environment. And we already have such an environment, it's our
compute! And likely enough pg_dump already exists there too! (see
https://github.com/neondatabase/cloud/issues/11644#issuecomment-2084617832)
2. We need to have a way to get databases and roles from compute after
time travel (see https://github.com/neondatabase/cloud/issues/12109)

## Summary of changes

It adds two API endpoints to compute_ctl HTTP API that target both of
the aforementioned cases.

---------

Co-authored-by: Tristan Partin <tristan@neon.tech>
2024-05-16 12:04:16 +02:00
Christian Schwarz
bc78b0e9cc chore(deps): use upstream svg_fmt after they merged our PR (#7764)
They have merged our PR https://github.com/nical/rust_debug/pull/4 but
they haven't released a new crate version yet.

refs https://github.com/neondatabase/neon/issues/7763
2024-05-15 14:18:02 +02:00
Christian Schwarz
ab10523cc1 remote_storage: AWS_PROFILE with endpoint overrides in ~/.aws/config (updates AWS SDKs) (#7664)
Before this PR, using the AWS SDK profile feature for running against
minio didn't work because
* our SDK versions were too old and didn't include
  https://github.com/awslabs/aws-sdk-rust/issues/1060 and 
* we didn't massage the s3 client config builder correctly.

This PR
* udpates all the AWS SDKs we use to, respectively, the latest version I
could find on crates.io (Is there a better process?)
* changes the way remote_storage constructs the S3 client, and
* documents how to run the test suite against real S3 & local minio.

Regarding the changes to `remote_storage`: if one reads the SDK docs, it
is clear that the recommended way is to use `aws_config::from_env`, then
customize.
What we were doing instead is to use the `aws_sdk_s3` builder directly.

To get the `local-minio` in the added docs working, I needed to update
both the SDKs and make the changes to the `remote_storage`. See the
commit history in this PR for details.

Refs:
* byproduct: https://github.com/smithy-lang/smithy-rs/pull/3633
* follow-up on deprecation:
https://github.com/neondatabase/neon/issues/7665
* follow-up for scrubber S3 setup:
https://github.com/neondatabase/neon/issues/7667
2024-05-09 10:58:38 +02:00
Conrad Ludgate
0c99e5ec6d proxy: cull http connections (#7632)
## Problem

Some HTTP client connections can stay open for quite a long time.

## Summary of changes

When there are too many HTTP client connections, pick a random
connection and gracefully cancel it.
2024-05-07 18:15:06 +01:00
Christian Schwarz
308227fa51 remove neon_local --pageserver-config-override (#7614)
Preceding PR https://github.com/neondatabase/neon/pull/7613 reduced the
usage of `--pageserver-config-override`.

This PR builds on top of that work and fully removes the `neon_local
--pageserver-config-override`.

Tests that need a non-default `pageserver.toml` control it using two
options:

1. Specify `NeonEnvBuilder.pageserver_config_override` before
`NeonEnvBuilder.init_start()`. This uses a new `neon_local init
--pageserver-config` flag.
2. After `init_start()`: `env.pageserver.stop()` +
`NeonPageserver.edit_config_toml()` + `env.pageserver.start()`

A few test cases were using
`env.pageserver.start(overrides=("--pageserver-config-override...",))`.
I changed them to use one of the options above. 

Future Work
-----------

The `neon_local init --pageserver-config` flag still uses `pageserver
--config-override` under the hood. In the future, neon_local should just
write the `pageserver.toml` directly.

The `NeonEnvBuilder.pageserver_config_override` field should be renamed
to `pageserver_initial_config`. Let's save this churn for a separate
refactor commit.
2024-05-07 16:29:59 +00:00
Jure Bajic
00423152c6 Store operation identifier in IdLockMap on exclusive lock (#7397)
## Problem

Issues around operation and tenant locks would have been hard to debug
since there was little observability around them.

## Summary of changes

- As suggested in the issue, a wrapper was added around
`OwnedRwLockWriteGuard` called `IdentifierLock` that removes the
operation currently holding the exclusive lock when it's dropped.
- The value in `IdLockMap` was extended to hold a pair of locks and
operations that can be accessed and locked independently.
- When requesting an exclusive lock besides returning the lock on that
resource, an operation is changed if the lock is acquired.


Closes https://github.com/neondatabase/neon/issues/7108
2024-05-03 09:38:19 +01:00
Arpad Müller
5f099dc760 Use streaming downloads for Azure as well (#7579)
The main challenge was in the second commit, as `DownloadStream`
requires the inner to be Sync but the stream returned by the Azure SDK
wasn't Sync.

This left us with three options:

* Change the Azure SDK to return Sync streams. This was abandoned after
we realized that we couldn't just make `TokenCredential`'s returned
future Sync: it uses the `async_trait` macro and as the
`TokenCredential` trait is used in dyn form, one can't use Rust's new
"async fn in Trait" feature.
* Change `DownloadStream` to not require `Sync`. This was abandoned
after it turned into a safekeeper refactoring project.
* Put the stream into a `Mutex` and make it obtain a lock on every poll.
This adds some performance overhead but locks that actually don't do
anything should be comparatively cheap.

We went with the third option in the end as the change still represents
an improvement.

Follow up of #5446 , fixes #5563
2024-05-02 20:19:00 +02:00
Arpad Müller
7a49e5d5c2 Remove tenant_id from TenantLocationConfigRequest (#7469)
Follow-up of #7055 and #7476 to remove `tenant_id` from
`TenantLocationConfigRequest` completely. All components of our system
should now not specify the `tenant_id`.

cc https://github.com/neondatabase/cloud/pull/11791
2024-05-02 20:18:13 +02:00
Conrad Ludgate
cb4b4750ba update to reqwest 0.12 (#7561)
## Problem

#7557

## Summary of changes
2024-05-02 11:16:04 +02:00
John Spray
a74b60066c storage controller: test for large shard counts (#7475)
## Problem

Storage controller was observed to have unexpectedly large memory
consumption when loaded with many thousands of shards.

This was recently fixed:
- https://github.com/neondatabase/neon/pull/7493

...but we need a general test that the controller is well behaved with
thousands of shards.

Closes: https://github.com/neondatabase/neon/issues/7460
Closes: https://github.com/neondatabase/neon/issues/7463

## Summary of changes

- Add test test_storage_controller_many_tenants to exercise the system's
behaviour with a more substantial workload. This test measures memory
consumption and reproduces #7460 before the other changes in this PR.
- Tweak reconcile_all's return value to make it nonzero if it spawns no
reconcilers, but _would_ have spawned some reconcilers if they weren't
blocked by the reconcile concurrency limit. This makes the test's
reconcile_until_idle behave as expected (i.e. not complete until the
system is nice and calm).
- Fix an issue where tenant migrations would leave a spurious secondary
location when migrated to some location that was not already their
secondary (this was an existing low-impact bug that tripped up the
test's consistency checks).

On the test with 8000 shards, the resident memory per shard is about
20KiB. This is not really per-shard memory: the primary source of memory
growth is the number of concurrent network/db clients we create.

With 8000 shards, the test takes 125s to run on my workstation.
2024-04-30 15:21:54 +00:00
Arseny Sher
4ac4b21598 Add retries to cloud_admin client. 2024-04-30 18:19:52 +03:00
Arseny Sher
3da54e6d90 s3_scrubber: implement scan-metadata for safekeepers.
It works by listing postgres table with memory dump of safekeepers state. s3
contents for each timeline are checked then against timeline_start_lsn and
backup_lsn. If inconsistency is found, before complaining timeline (branch) is
checked at control plane; it might have been deleted between the dump take and
s3 check.
2024-04-30 18:19:52 +03:00