Commit Graph

6059 Commits

Author SHA1 Message Date
Heikki Linnakangas
d211f00f05 Remove unnecessary dependencies (#9000)
Found by "cargo machete"
2024-09-17 17:55:45 +03:00
Alexander Bayandin
cd4276fd65 CI: fix release pipeline (#9017)
## Problem

We've got 2 non-blocking failures on the release pipeline:
- `promote-compatibility-data` job got skipped _presumably_ because one
of the dependencies of `deploy` job (`push-to-acr-dev`) got skipped
(https://github.com/neondatabase/neon/pull/8940)
- `coverage-report` job fails because we don't build debug artifacts in
the release branch (https://github.com/neondatabase/neon/pull/8561)

## Summary of changes
- Always run `push-to-acr-dev` / `push-to-acr-prod` jobs, but add
`skip_if` parameter to the reusable workflow, which can skip the job
internally, without skipping externally
- Do not run `coverage-report` on release branches
2024-09-17 10:17:48 +01:00
Vlad Lazar
b719d58863 storcon: forward requests from stepped down instance to the current leader (#8954)
## Problem
It turns out that we can't rely on external orchestration to promptly
route trafic to the new leader. This is downtime inducing.
Forwarding provides a safe way out.

## Safety
We forward when:
1. Request is not one of ["/control/v1/step_down", "/status", "/ready",
"/metrics"]
2. Current instance is in [`LeadershipStatus::SteppedDown`] state
3. There is a leader in the database to forward to
4. Leader from step (3) is not the current instance

If a storcon instance is persisted in the database, then we know that it
is the current leader.
There's one exception: time between handling step-down request and the
new leader updating the
database.

Let's treat the happy case first. The stepped down node does not produce
any side effects,
since all request handling happens on the leader.

As for the edge case, we are guaranteed to always have a maximum of two
running instances.
Hence, if we are in the edge case scenario the leader persisted in the
database is the
stepped down instance that received the request. Condition (4) above
covers this scenario.

## Summary of changes
* Conversion utilities for reqwest <-> hyper. I'm not happy with these,
but I don't see a better way. Open to suggestions.
* Add request forwarding logic
* Update each request handler. Again, not happy with this. If anyone
knows a nice to wrap the handlers, lmk. Me and Joonas tried :/
* Update each handler to maybe forward
* Tweak tests to showcase new behaviour
2024-09-17 09:25:42 +01:00
Heikki Linnakangas
2db840d8b8 Move a few test functions related to auth tokens to separate file (#9018)
For readability. neon_fixtures.py is huge.
2024-09-17 06:53:18 +03:00
Heikki Linnakangas
4295ff0f07 Mark a couple of test fixtures as session-scoped (#9018)
pg_distrib_dir doesn't include the Postgres version and only depends
on env variables which cannot change during a test run, so it can be
marked as session-scoped. Similarly, the platform cannot change during
a test run.
2024-09-17 06:53:18 +03:00
Heikki Linnakangas
c6f56b8462 Remove redundant get_dir_size() function (#9018)
There was another copy of it in utils.py. The only difference is that
the version in utils.py tolerates files that are concurrently
removed. That seems fine for the few callers in neon_fixtures.py too.
2024-09-17 06:53:18 +03:00
Heikki Linnakangas
fec9321fc0 Use Path type in a few more places in neon_fixtures.py (#9018)
This is in preparation of replacing neon_fixtures.get_dir_size with
neon_fixtures.utils.get_dir_size() in next commit.
2024-09-17 06:53:18 +03:00
Heikki Linnakangas
3a52e356c1 Remove unused function (#9018) 2024-09-17 06:53:18 +03:00
Tristan Partin
5e16c7bb0b Generate pgbench data on the server for most tests
This should generally be faster when running tests, especially those
that run with higher scales.

Ignoring test_lfc_resize since it seems like we are hitting a query
timeout for some reason that I have yet to investigate. A little bit of
improvemnt is better than none.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-09-16 23:37:36 +01:00
Heikki Linnakangas
2bbb4d3e1c Remove misc unused code (#9014) 2024-09-16 18:45:19 +00:00
Matthias van de Meent
c8bedca582 Fix PG17's extension modifications (#9010)
This also reduces the GRANT statements to one per created _reset
function
2024-09-16 17:06:31 +01:00
Tristan Partin
5876c441ab Grant access to pg_show_replication_origin_status for neon_superuser
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-09-16 16:38:55 +01:00
Alexander Bayandin
b2c83db54d CI(gather-rust-build-stats): set PQ_LIB_DIR to Postgres 17 (#9001)
## Problem

`gather-rust-build-stats` extra CI job fails with 
```
"PQ_LIB_DIR" doesn't exist in the configured path: "/__w/neon/neon/pg_install/v16/lib"
```

## Summary of changes
- Use the path to Postgres 17 for the `gather-rust-build-stats` job. 
The job uses Postgres built by `make walproposer-lib`
2024-09-16 12:44:26 +01:00
Matthias van de Meent
0a8c5e1214 Fix broken image for PG17 (#8998)
Most extensions are not required to run Neon-based PostgreSQL, but the
Neon extension is _quite_ critical, so let's make sure we include it.

## Problem

Staging doesn't have working compute images for PG17

## Summary of changes

Disable some PG17 filters so that we get the critical components into the PG17 image
2024-09-13 15:10:52 +01:00
Matthias van de Meent
78938d1b59 [compute/postgres] feature: PostgreSQL 17 (#8573)
This adds preliminary PG17 support to Neon, based on RC1 / 2024-09-04
07b828e9d4

NOTICE: The data produced by the included version of the PostgreSQL fork
may not be compatible with the future full release of PostgreSQL 17 due to
expected or unexpected future changes in magic numbers and internals.
DO NOT EXPECT DATA IN V17-TENANTS TO BE COMPATIBLE WITH THE 17.0
RELEASE!

Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-09-12 23:18:41 +01:00
Stefan Radig
fcab61bdcd Prototype implementation for private access poc (#8976)
## Problem
For the Private Access POC we want users to be able to disable access
from the public proxy. To limit the number of changes this can be done
by configuring an IP allowlist [ "255.255.255.255" ]. For the Private
Access proxy a new commandline flag allows to disable IP allowlist
completely.

See
https://www.notion.so/neondatabase/Neon-Private-Access-POC-Proposal-8f707754e1ab4190ad5709da7832f020?d=887495c15e884aa4973f973a8a0a582a#7ac6ec249b524a74adbeddc4b84b8f5f
for details about the POC.,

## Summary of changes
- Adding the commandline flag is_private_access_proxy=true will disable
IP allowlist
2024-09-12 15:55:12 +01:00
Tristan Partin
9e3ead3689 Collect the last of on-demand WAL download in CreateReplicationSlot reverts
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-09-12 11:31:38 +01:00
Heikki Linnakangas
8dc069037b Remove NeonEnvBuilder.start() function
It feels wrong to me to start() from the builder object. Surely the
thing you start is the environment itself, not its configuration.
2024-09-12 01:28:56 +03:00
Heikki Linnakangas
0a363c3dce Add --timeline-id option to "neon_local timeline branch" command
Makes it consistent with the "timeline create" and "timeline import"
commands, which allowed you to pass the timeline id as argument. This
also makes it unnecessary to parse the timeline ID from the output in
the python function that calls it.
2024-09-12 01:28:56 +03:00
Heikki Linnakangas
aeca15008c Remove obsolete and misleading comment
The tenant ID was not actually generated here but in NeonEnvBuilder.
And the "neon_local init" command hasn't been able to generate the
initial tenant since 8712e1899e anyway.
2024-09-12 01:28:56 +03:00
Heikki Linnakangas
43846b72fa Remove unused "neon_local init --pg-version" arg
It has been unused since commit 8712e1899e, when it stopped creating
the initial timeline.
2024-09-12 01:28:56 +03:00
John Spray
cb060548fb libs: tweak PageserverUtilization::is_overloaded (#8946)
## Problem

Having run in production for a while, we see that nodes are generally
safely oversubscribed by about a factor of 2.

## Summary of changes

Tweak the is_overloaded method to check for utililzation over 200%
rather than over 100%
2024-09-11 18:45:34 +01:00
Folke Behrens
bae793ffcd proxy: Handle all let underscore instances (#8898)
* Most can be simply replaced
* One instance renamed to _rtchk (return-type check)
2024-09-10 15:36:08 +02:00
John Spray
26b5fcdc50 reinstate write-path key check (#8973)
## Problem

In https://github.com/neondatabase/neon/pull/8621, validation of keys
during ingest was removed because the places where we actually store
keys are now past the point where we have already converted them to
CompactKey (i128) representation.

## Summary of changes

Reinstate validation at an earlier stage in ingest. This doesn't cover
literally every place we write a key, but it covers most cases where
we're trusting postgres to give us a valid key (i.e. one that doesn't
try and use a custom spacenode).
2024-09-10 12:54:25 +01:00
Arpad Müller
97582178cb Remove async_trait from the Handler trait (#8958)
Newest attempt to remove `async_trait` from the Handler trait.

Earlier attempts were in #7301 and #8296 .
2024-09-10 02:40:00 +02:00
Matthias van de Meent
842be0ba74 Specialize WalIngest on PostgreSQL version (#8904)
The current code assumes that most of this functionality is
version-independent, which is only true up to v16 - PostgreSQL 17 has a
new field in CheckPoint that we need to keep track of.

This basically removes the file-level dependency on v14, and replaces it
with switches that load the correct version dependencies where required.
2024-09-09 23:01:52 +01:00
Heikki Linnakangas
982b376ea2 Update parquet crate to a released version (#8961)
PR #7782 set the dependency in Cargo.toml to 'master', and locked the
version to commit that contained a specific fix, because we needed the
fix before it was included in a versioned release. The fix was later
included in parquet crate version 52.0.0, so we can now switch back to
using a released version. The latest release is 53.0.0, switch straight
to that.

---------

Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>
2024-09-10 00:04:00 +03:00
Alex Chi Z.
e158df4e86 feat(pageserver): split delta writer automatically determines key range (#8850)
close https://github.com/neondatabase/neon/issues/8838

## Summary of changes

This patch modifies the split delta layer writer to avoid taking
start_key and end_key when creating/finishing the layer writer. The
start_key for the delta layers will be the first key provided to the
layer writer, and the end_key would be the `last_key.next()`. This
simplifies the delta layer writer API.

On that, the layer key hack is removed. Image layers now use the full
key range, and delta layers use the first/last key provided by the user.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-09-09 22:03:27 +01:00
Heikki Linnakangas
723c0971e8 Don't create 'empty' branch in neon_simple_env (#8965)
Now that we've given up hope on sharing the neon_simple_env between
tests, there's no reason to not use the 'main' branch directly.
2024-09-09 12:38:34 +03:00
Heikki Linnakangas
c8f67eed8f Remove TEST_SHARED_FIXTURES (#8965)
I wish it worked, but it's been broken for a long time, so let's admit
defeat and remove it.

The idea of sharing the same pageserver and safekeeper environment
between tests is still sound, and it could save a lot of time in our
CI. We should perhaps put some time into doing that, but we're better
off starting from scratch than trying to make TEST_SHARED_FIXTURES
work in its current form.
2024-09-09 12:38:34 +03:00
Heikki Linnakangas
2d885ac07a Update strum (#8962)
I wanted to use some features from the newer version. The PR that needed
the new version is not ready yet (and might never be), but seems nice to
stay up in any case.
2024-09-08 21:47:57 +03:00
Heikki Linnakangas
89c5e80b3f Update toml and toml_edit crates (#8963)
Eliminates a few duplicate versions from the dependency tree.
2024-09-08 21:47:23 +03:00
Heikki Linnakangas
93ec7503e0 Lock the correct revision of rust-postgres crates (#8960)
We modified the crate in an incompatible way and upgraded to the new
version in PR #8076. However, it was reverted in #8654. The revert
reverted the Cargo.lock reference to it, but since Cargo.toml still
points to the (tip of the) 'neon' branch, every time you make any other
unrelated changes to Cargo.toml, it also tries to update the
rust-postgres crates to the tip of the 'neon' branch again, which
doesn't work.

To fix, lock the crates to the exact commit SHA that works.
2024-09-07 14:11:36 +01:00
Alexander Bayandin
7d7d1f354b Fix rust warnings on macOS (#8955)
## Problem
```
error: unused import: `anyhow::Context`
 --> libs/utils/src/crashsafe.rs:8:5
  |
8 | use anyhow::Context;
  |     ^^^^^^^^^^^^^^^
  |
  = note: `-D unused-imports` implied by `-D warnings`
  = help: to override `-D warnings` add `#[allow(unused_imports)]`

error: unused variable: `fd`
   --> libs/utils/src/crashsafe.rs:209:15
    |
209 | pub fn syncfs(fd: impl AsRawFd) -> anyhow::Result<()> {
    |               ^^ help: if this is intentional, prefix it with an underscore: `_fd`
    |
    = note: `-D unused-variables` implied by `-D warnings`
    = help: to override `-D warnings` add `#[allow(unused_variables)]`
```

## Summary of changes
- Fix rust warnings on macOS
2024-09-07 08:17:25 +01:00
Cihan Demirci
16c200d6d9 push images to prod ACR (#8940)
Used `vars` for new storing non-sensitive information, changed dev
secrets to vars as well but
didn't cleanup any secrets.

https://github.com/neondatabase/cloud/issues/16925

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-09-07 00:20:36 +01:00
Joonas Koivunen
3dbd34aa78 feat(storcon): forward gc blocking and unblocking (#8956)
Currently using gc blocking and unblocking with storage controller
managed pageservers is painful. Implement the API on storage controller.

Fixes: #8893
2024-09-06 22:42:55 +01:00
Arpad Müller
fa3fc73c1b Address 1.82 clippy lints (#8944)
Addresses the clippy lints of the beta 1.82 toolchain.

The `too_long_first_doc_paragraph` lint complained a lot and was
addressed separately: #8941
2024-09-06 21:05:18 +02:00
Alex Chi Z.
ac5815b594 feat(storage-controller): add node shards api (#8896)
For control-plane managed tenants, we have the page in the admin console
that lists all tenants on a specific pageserver. But for
storage-controller managed ones, we don't have that functionality for
now.

## Summary of changes

Adds an API that lists all shards on a given node (intention + observed)

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-09-06 14:14:21 -04:00
Alexander Bayandin
30583cb626 CI(label-for-external-users): add retry logic for unexpected errors (#8938)
## Problem

One of the PRs opened by a `neondatabase` org member got labelled as
`external` because the `gh api` call failed in the wrong way:
```
Get "https://api.github.com/orgs/neondatabase/members/<username>": dial tcp 140.82.114.5:443: i/o timeout
is-member=false
```

## Summary of changes
- Check that the error message is expected before labelling PRs
- Retry `gh api` call for 10 times in case of unexpected error messages
- Add `workflow_dispatch` trigger
2024-09-06 17:42:35 +01:00
Arseny Sher
c1a51416db safekeeper: fsync filesystem on start.
We can't really rely on files contents after boot without fsync'ing
them.
2024-09-06 19:14:25 +03:00
Arseny Sher
8eab7009c1 safekeeper: do pid file lock before id init 2024-09-06 19:14:25 +03:00
Arseny Sher
11cf16e3f3 safekeeper: add term_bump endpoint.
When walproposer observes now higher term it restarts instead of
crashing whole compute with PANIC; this avoids compute crash after
term_bump call. After successfull election we're still checking
last_log_term of the highest given vote to ensure basebackup is good,
and PANIC otherwise.

It will be used for migration per
035-safekeeper-dynamic-membership-change.md
and
https://github.com/neondatabase/docs/pull/21

ref https://github.com/neondatabase/neon/issues/8700
2024-09-06 19:13:50 +03:00
Folke Behrens
af6f63617e proxy: clean up code and lints for 1.81 and 1.82 (#8945) 2024-09-06 17:13:30 +02:00
Arseny Sher
e287f36a05 safekeeper: fix endpoint restart immediately after xlog switch.
Check that truncation point is not from the future by comparing it with
write_record_lsn, not write_lsn, and explain that xlog switch changes
their normal order.

ref https://github.com/neondatabase/neon/issues/8911
2024-09-06 18:09:21 +03:00
Arpad Müller
cbcd4058ed Fix 1.82 clippy lint too_long_first_doc_paragraph (#8941)
Addresses the 1.82 beta clippy lint `too_long_first_doc_paragraph` by
adding newlines to the first sentence if it is short enough, and making
a short first sentence if there is the need.
2024-09-06 14:33:52 +02:00
Vlad Lazar
e86fef05dd storcon: track preferred AZ for each tenant shard (#8937)
## Problem
We want to do AZ aware scheduling, but don't have enough metadata.

## Summary of changes
Introduce a `preferred_az_id` concept for each managed tenant shard.

In a future PR, the scheduler will use this as a soft preference. 
The idea is to try and keep the shard attachments within the same AZ.
Under the assumption that the compute was placed in the correct AZ,
this reduces the chances of cross AZ trafic from between compute and PS.

In terms of code changes we:
1. Add a new nullable `preferred_az_id` column to the `tenant_shards`
table. Also include an in-memory counterpart.
2. Populate the preferred az on tenant creation and shard splits.
3. Add an endpoint which allows to bulk-set preferred AZs.

(3) gives us the migration path. I'll write a script which queries the
cplane db in the region and sets the preferred az of all shards with an 
active compute to the AZ of said compute. For shards without an active compute, 
I'll use the AZ of the currently attached pageserver
since this is what cplane uses now to schedule computes.
2024-09-06 13:11:17 +01:00
Arpad Müller
a1323231bc Update Rust to 1.81.0 (#8939)
We keep the practice of keeping the compiler up to date, pointing to the
latest release. This is done by many other projects in the Rust
ecosystem as well.

[Release notes](https://github.com/rust-lang/rust/blob/master/RELEASES.md#version-1810-2024-09-05).

Prior update was in #8667 and #8518
2024-09-06 12:40:19 +02:00
Christian Schwarz
06e840b884 compact_level0_phase1: ignore access mode config, always do streaming-kmerge without validation (#8934)
refs https://github.com/neondatabase/neon/issues/8184

PR https://github.com/neondatabase/infra/pull/1905 enabled
streaming-kmerge without validation everywhere.

It rolls into prod sooner or in the same release as this PR.
2024-09-06 10:58:48 +02:00
Christian Schwarz
cf11c8ab6a update svg_fmt to 0.4.3 (#8930)
Audited

```
diff -r -u ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/svg_fmt-0.4.{2,3}
```

fixes https://github.com/neondatabase/neon/issues/7763
2024-09-06 10:52:29 +02:00
Vlad Lazar
04f99a87bf storcon: make pageserver AZ id mandatory (#8856)
## Problem
https://github.com/neondatabase/neon/pull/8852 introduced a new nullable
column for the `nodes` table: `availability_zone_id`

## Summary of changes
* Make neon local and the test suite always provide an az id
* Make the az id field in the ps registration request mandatory
* Migrate the column to non-nullable and adjust in memory state
accordingly
* Remove the code that was used to populate the az id for pre-existing
nodes
2024-09-05 19:14:21 +01:00