Commit Graph

5172 Commits

Author SHA1 Message Date
Christian Schwarz
b1e21c7705 Add trailing dot 2024-05-05 17:17:42 +00:00
Christian Schwarz
004af53035 git diff reduction & polish 2024-05-05 17:15:09 +00:00
Christian Schwarz
d8702dd819 Merge branch 'problame/test-suite-narrow-pageserver-config-override' into problame/remove-pageserver-config-overrides 2024-05-05 16:57:51 +00:00
Christian Schwarz
5f04224817 remove NEON_PAGESERVER_OVERRIDES env var 2024-05-05 16:53:36 +00:00
Christian Schwarz
9c547da6a6 reduce scope of this PR & fix naming 2024-05-05 16:49:20 +00:00
Christian Schwarz
6d343feef0 whitespace diff reduction 2024-05-05 16:20:58 +00:00
Christian Schwarz
f73c8c6bd6 Merge branch 'problame/test-suite-narrow-pageserver-config-override' into problame/remove-pageserver-config-overrides
Conflicts:
	control_plane/src/pageserver.rs
    => pick ours
2024-05-05 16:19:11 +00:00
Christian Schwarz
51224c84c2 even more diff reduction 2024-05-05 16:16:59 +00:00
Christian Schwarz
6bccd64514 miimize diff by re-adding whitespace 2024-05-05 15:59:05 +00:00
Christian Schwarz
28c95e4207 Merge branch 'problame/test-suite-narrow-pageserver-config-override' into problame/remove-pageserver-config-overrides 2024-05-04 16:15:45 +00:00
Christian Schwarz
aacf8110a0 pretty up the inlined override code, long option --config-override 2024-05-04 16:12:50 +00:00
Christian Schwarz
70977afd07 ruff check & format 2024-05-04 15:14:34 +00:00
Christian Schwarz
7363b44b50 neon_local: remove --pageserver-config-overrides, neon_local init takes a toml tempfile 2024-05-04 15:13:28 +00:00
Christian Schwarz
cc64e1b17f Merge branch 'problame/test-suite-narrow-pageserver-config-override' into problame/remove-pageserver-config-overrides 2024-05-04 13:17:25 +00:00
Christian Schwarz
25dfafc2df undo the renaming, it's too much churn for review; will do in a separate PR 2024-05-04 13:14:41 +00:00
Christian Schwarz
d72fe6f5ee no neon_local_overrides during start(); inline it into PageServerNode::init 2024-05-04 13:10:45 +00:00
Christian Schwarz
0bca1a5de3 Revert "neon_local: only set --pageserver-config-override=remote_storage during init, not start"
This reverts commit 511f593360.
2024-05-04 12:33:02 +00:00
Christian Schwarz
511f593360 neon_local: only set --pageserver-config-override=remote_storage during init, not start 2024-05-04 12:30:06 +00:00
Christian Schwarz
b96e0b2458 rely on init to store remote storage config in pageserver.toml
This allows inlining append_pageserver_param_overrides into NeonCli.init()
2024-05-04 12:07:33 +00:00
Christian Schwarz
b4ed3b15b9 remove support for pageserver -c/--config-override and neon_local --pageserver-config-override 2024-05-04 11:50:59 +00:00
Christian Schwarz
ad185dd594 test_suite: remove usage of --pageserver-config-override
Rewrite the pageserver.toml instead.
2024-05-04 11:50:59 +00:00
Christian Schwarz
58055c7a96 remove NEON_PAGESERVER_OVERRIDES env var (no committed code uses it) 2024-05-04 11:50:59 +00:00
Christian Schwarz
ec04f0f4d4 Merge branch 'problame/remove-pageserver-update-config-flag' into problame/test-suite-narrow-pageserver-config-override 2024-05-04 11:48:12 +00:00
Christian Schwarz
a52b563b59 fixups 2024-05-04 11:47:38 +00:00
Christian Schwarz
89afba066c refactor(test_suite): rely less on --pageserver-config-override outside of neon_local init
The `NeonCli.init()` persists the non-default pageserver config values
for remote storage & `NeonEnvBuilder.pageserver_config_override` in
`pageserver.toml`.

We don't need to repeat them on each pageserver start after that.
2024-05-04 11:34:59 +00:00
Christian Schwarz
6bcb0959ad ruff format 2024-05-04 10:25:53 +00:00
Christian Schwarz
8f3051b416 Merge branch 'main' into problame/remove-pageserver-update-config-flag 2024-05-04 10:22:05 +00:00
Christian Schwarz
998dc6255e refactor(pageserver): remove --update-init flag 2024-05-04 10:20:41 +00:00
Heikki Linnakangas
4deb8dc52e compute_ctl: Be more precise in how startup time is calculated (#7601)
- On a non-pooled start, do not reset the 'start_time' after launching
the HTTP service. In a non-pooled start, it's fair to include that in
the total startup time.

- When setting wait_for_spec_ms and resetting start_time, call
Utc::now() only once. It's a waste of cycles to call it twice, but also,
it ensures the time between setting wait_for_spec_ms and resetting
start_time is included in one or the other time period.

These differences should be insignificant in practice, in the
microsecond range, but IMHO it seems more logical and readable this way
too. Also fix and clarify some of the surrounding comments.

(This caught my eye while reviewing PR #7577)
2024-05-04 08:44:18 +03:00
Em Sharnoff
64f0613edf compute_ctl: Add support for swap resizing (#7434)
Part of neondatabase/cloud#12047. Resolves #7239.

In short, this PR:

1. Adds `ComputeSpec.swap_size_bytes: Option<u64>`
2. Adds a flag to compute_ctl: `--resize-swap-on-bind`
3. Implements running `/neonvm/bin/resize-swap` with the value from the
   compute spec before starting postgres, if both the value in the spec
   *AND* the flag are specified.
4. Adds `sudo` to the final image
5. Adds a file in `/etc/sudoers.d` to allow `compute_ctl` to resize swap

Various bits of reasoning about design decisions in the added comments.
In short: We have both a compute spec field and a flag to make rollout
easier to implement. The flag will most likely be removed as part of
cleanups for neondatabase/cloud#12047.
2024-05-03 12:57:45 -07:00
Christian Schwarz
1e7cd6ac9f refactor: move NodeMetadata to pageserver_api; use it from neon_local (#7606)
This is the first step towards representing all of Pageserver
configuration as clean `serde::Serialize`able Rust structs in
`pageserver_api`.

The `neon_local` code will then use those structs instead of the crude
`toml_edit` / string concatenation that it does today.

refs https://github.com/neondatabase/neon/issues/7555

---------

Co-authored-by: Alex Chi Z <iskyzh@gmail.com>
2024-05-03 13:15:38 -04:00
Alex Chi Z
ef03b38e52 fix(pageserver): remove update_gc_info calls in tests (#7608)
introduced by https://github.com/neondatabase/neon/pull/7468 conflicting
with https://github.com/neondatabase/neon/pull/7584

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-05-03 16:01:33 +00:00
Conrad Ludgate
9b65946566 proxy: add connect compute concurrency lock (#7607)
## Problem

Too many connect_compute attempts can overwhelm postgres, getting the
connections stuck.

## Summary of changes

Limit number of connection attempts that can happen at a given time.
2024-05-03 15:45:24 +00:00
Christian Schwarz
700aa96770 Merge branch 'main' into problame/move-pageserver-config-into-api-crate 2024-05-03 15:20:24 +00:00
Christian Schwarz
4a72fe0908 add requested backward-compatibility test 2024-05-03 15:19:18 +00:00
Alex Chi Z
a3fe12b6d8 feat(pageserver): add scan interface (#7468)
This pull request adds the scan interface. Scan operates on a sparse
keyspace and retrieves all the key-value pairs from the keyspaces.

Currently, scan only supports the metadata keyspace, and by default do
not retrieve anything from the ancestor branch. This should be fixed in
the future if we need to have some keyspaces that inherits from the
parent.

The scan interface reuses the vectored get code path by disabling the
missing key errors.

This pull request also changes the behavior of vectored get on aux file
v1/v2 key/keyspace: if the key is not found, it is simply not included in the
result, instead of throwing a missing key error.

TODOs in future pull requests: limit memory consumption, ensure the
search stops when all keys are covered by the image layer, remove
`#[allow(dead_code)]` once the code path is used in basebackups / aux
files, remove unnecessary fine-grained keyspace tracking in vectored get
(or have another code path for scan) to improve performance.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-05-03 10:43:30 -04:00
John Spray
b5a6e68e68 storage controller: check warmth of secondary before doing proactive migration (#7583)
## Problem

The logic in Service::optimize_all would sometimes choose to migrate a
tenant to a secondary location that was only recently created, resulting
in Reconciler::live_migrate hitting its 5 minute timeout warming up the
location, and proceeding to attach a tenant to a location that doesn't
have a warm enough local set of layer files for good performance.

Closes: #7532 

## Summary of changes

- Add a pageserver API for checking download progress of a secondary
location
- During `optimize_all`, connect to pageservers of candidate
optimization secondary locations, and check they are warm.
- During shard split, do heatmap uploads and start secondary downloads,
so that the new shards' secondary locations start downloading ASAP,
rather than waiting minutes for background downloads to kick in.

I have intentionally not implemented this by continuously reading the
status of locations, to avoid dealing with the scale challenge of
efficiently polling & updating 10k-100k locations status. If we
implement that in the future, then this code can be simplified to act
based on latest state of a location rather than fetching it inline
during optimize_all.
2024-05-03 14:28:23 +00:00
Christian Schwarz
ce0ddd749c test_runner: remove unused NeonPageserver.config_override field (#7605)
refs https://github.com/neondatabase/neon/issues/7555
2024-05-03 16:05:00 +02:00
Arpad Müller
426598cf76 Update rust to 1.78.0 (#7598)
We keep the practice of keeping the compiler up to date, pointing to the
latest release. This is done by many other projects in the Rust
ecosystem as well.

Release notes: https://blog.rust-lang.org/2024/05/02/Rust-1.78.0.html

Prior update was in #7198
2024-05-03 15:59:28 +02:00
Christian Schwarz
923cdff13d Merge branch 'main' into problame/move-pageserver-config-into-api-crate 2024-05-03 12:36:18 +00:00
Christian Schwarz
498edfc0ff use NodeMetadata struct for writing metadata.json from neon_local 2024-05-03 12:35:41 +00:00
Christian Schwarz
d2e2a88737 move NodeMetadata type to pageserver_api::config 2024-05-03 12:35:41 +00:00
Christian Schwarz
6f720eb38f create config module inside pageserver_api crate 2024-05-03 12:35:41 +00:00
John Spray
8b4dd5dc27 pageserver: jitter secondary periods (#7544)
## Problem

After some time the load from heatmap uploads gets rather spiky. They're
unintentionally synchronising.

Chart (does this make a _boing_ sound in anyone else's head?):

![image](https://github.com/neondatabase/neon/assets/944640/18829fc8-c5b7-4739-9a9b-491b5d6fcade)


## Summary of changes

- Add a helper `period_jitter` and apply a 5% jitter from downloader and
heatmap_uploader when updating the next runtime at the end of an
interation.
- Refactor existing places that we pick a startup interval into
`period_warmup`, so that the intent is obvious.
2024-05-03 12:31:25 +00:00
Joonas Koivunen
ed9a114bde fix: find gc cutoff points without holding Tenant::gc_cs (#7585)
The current implementation of finding timeline gc cutoff Lsn(s) is done
while holding `Tenant::gc_cs`. In recent incidents long create branch
times were caused by holding the `Tenant::gc_cs` over extremely long
`Timeline::find_lsn_by_timestamp`. The fix is to find the GC cutoff
values before taking the `Tenant::gc_cs` lock. This change is safe to do
because the GC cutoff values and the branch points have no dependencies
on each other. In the case of `Timeline::find_gc_cutoff` taking a long
time with this change, we should no longer see `Tenant::gc_cs`
interfering with branch creation.

Additionally, the `Tenant::refresh_gc_info` is now tolerant of timeline
deletions (or any other failures to find the pitr_cutoff). This helps
with the synthetic size calculation being constantly completed instead
of having a break for a timely timeline deletion.

Fixes: #7560
Fixes: #7587
2024-05-03 14:57:26 +03:00
John Spray
b7385bb016 storage_controller: fix non-timeline passthrough GETs (#7602)
## Problem

We were matching on `/tenant/:tenant_id` and
`/tenant/:tenant_id/timeline*`, but not non-timeline tenant sub-paths.
There aren't many: this was only noticeable when using the
synthetic_size endpoint by hand.

## Summary of changes

- Change the wildcard from `/tenant/:tenant_id/timeline*` to
`/tenant/:tenant_id/*`
- Add test lines that exercise this
2024-05-03 12:52:43 +01:00
Vlad Lazar
37b1930b2f tests: relax test download remote layers api (#7604)
## Problem
This test triggers layer download failures on demand. It is possible to
modify the failpoint
during a `Timeline::get_vectored` right between the vectored read and
it's validation read.
This means that one of the reads can fail while the other one succeeds
and vice versa.

## Summary of changes
These errors are expected, so allow them to happen.
2024-05-03 12:40:09 +01:00
Arpad Müller
d76963691f Increase Azure parallelism limit to 100 (#7597)
After #5563 has been addressed we can now set the Azure strorage
parallelism limit to 100 like it is for S3.

Part of #5567
2024-05-03 13:23:11 +02:00
Joonas Koivunen
60f570c70d refactor(update_gc_info): split GcInfo to compose out of GcCutoffs (#7584)
Split `GcInfo` and replace `Timeline::update_gc_info` with a method that
simply finds gc cutoffs `Timeline::find_gc_cutoffs` to be combined as
`Timeline::gc_info` at the caller.

This change will be followed up with a change that finds the GC cutoff
values before taking the `Tenant::gc_cs` lock.

Cc: #7560
2024-05-03 13:11:51 +03:00
Alex Chi Z
3582a95c87 fix(pageserver): compile warning of download_object.ctx on macos (#7596)
fix macOS compile warning introduced in
45ec8688ea

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-05-03 10:55:48 +02:00