## Problem
We want to have the data-api served by the proxy directly instead of
relying on a 3rd party to run a deployment for each project/endpoint.
## Summary of changes
With the changes below, the proxy (auth-broker) becomes also a
"rest-broker", that can be thought of as a "Multi-tenant" data-api which
provides an automated REST api for all the databases in the region.
The core of the implementation (that leverages the subzero library) is
in proxy/src/serverless/rest.rs and this is the only place that has "new
logic".
---------
Co-authored-by: Ruslan Talpa <ruslan.talpa@databricks.com>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
## Problem
Metrics are saved in https://github.com/neondatabase/neon/pull/11559,
but the file is not matched by the attachment regex.
## Summary of changes
Make attachment regex match the metrics file.
## Problem
In Neon DBaaS we adjust the shared_buffers to the size of the compute,
or better described we adjust the max number of connections to the
compute size and we adjust the shared_buffers size to the number of max
connections according to about the following sizes
`2 CU: 225mb; 4 CU: 450mb; 8 CU: 900mb`
[see](877e33b428/goapp/controlplane/internal/pkg/compute/computespec/pg_settings.go (L405))
## Summary of changes
We should run perf unit tests with settings that is realistic for a
paying customer and select 8 CU as the reference for those tests.
## Problem
`TYPE_CHECKING` is used inconsistently across Python tests.
## Summary of changes
- Update `ruff`: 0.7.0 -> 0.11.2
- Enable TC (flake8-type-checking):
https://docs.astral.sh/ruff/rules/#flake8-type-checking-tc
- (auto)fix all new issues
## Problem
Our benchmarking workflows contain links to grafana dashboards to
troubleshoot problems.
This works fine for non-pooled endpoints.
For pooled endpoints we need to remove the `-pooler` suffix from the
endpoint's hostname to get a valid endpoint ID.
Example link that doesn't work in this run
https://github.com/neondatabase/neon/actions/runs/13678933253/job/38246028316#step:8:311
## Summary of changes
Check if connection string is a -pooler connection string and if so
remove this suffix from the endpoint ID.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
Tests with mixed versions of binaries always pick up new versions if
services are started using `neon_local`.
## Summary of changes
- Set `neon_local_binpath` along with `neon_binpath` and
`pg_distrib_dir` for tests with mixed versions
## Problem
https://github.com/neondatabase/neon/issues/9965
## Summary of changes
Add safekeeper membership configuration struct itself and storing it in
the control file. In passing also add creation timestamp to the control
file (there were cases where I wanted it in the past).
Remove obsolete unused PersistedPeerInfo struct from control file (still
keep it control_file_upgrade.rs to have it in old upgrade code).
Remove the binary representation of cfile in the roundtrip test.
Updating it is annoying, and we still test the actual roundtrip.
Also add configuration to timeline creation http request, currently used
only in one python test. In passing, slightly change LSNs meaning in the
request: normally start_lsn is passed (the same as ancestor_start_lsn in
similar pageserver call), but we allow specifying higher commit_lsn for
manual intervention if needed. Also when given LSN initialize
term_history with it.
Improves `wait_until` by:
* Use `timeout` instead of `iterations`. This allows changing the
timeout/interval parameters independently.
* Make `timeout` and `interval` optional (default 20s and 0.5s). Most
callers don't care.
* Only output status every 1s by default, and add optional
`status_interval` parameter.
* Remove `show_intermediate_error`, this was always emitted anyway.
Most callers have been updated to use the defaults, except where they
had good reason otherwise.
## Problem
LFC is not enabled by default in tests, but it is enabled in production.
This increases the risk of errors in the production environment, which
were not found during the routine workflow.
However, enabling LFC for all the tests may overload the disk on our
servers and increase the number of failures.
So, we try enabling LFC in one case to evaluate the possible risk.
## Summary of changes
A new environment variable, USE_LFC is introduced. If it is set to true,
LFC is enabled by default in all the tests.
In our workflow, we enable LFC for PG17, release, x86-64, and disabled
for all other combinations.
---------
Co-authored-by: Alexey Masterov <alexeymasterov@neon.tech>
Co-authored-by: a-masterov <72613290+a-masterov@users.noreply.github.com>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Stas Kelvic <stas@neon.tech>
# Context
This PR contains PoC-level changes for a product feature that allows
onboarding large databases into Neon without going through the regular
data path.
# Changes
This internal RFC provides all the context
* https://github.com/neondatabase/cloud/pull/19799
In the language of the RFC, this PR covers
* the Importer code (`fast_import`)
* all the Pageserver changes (mgmt API changes, flow implementation,
etc)
* a basic test for the Pageserver changes
# Reviewing
As acknowledged in the RFC, the code added in this PR is not ready for
general availability.
Also, the **architecture is not to be discussed in this PR**, but in the
RFC and associated Slack channel instead.
Reviewers of this PR should take that into consideration.
The quality bar to apply during review depends on what area of the code
is being reviewed:
* Importer code (`fast_import`): practically anything goes
* Core flow (`flow.rs`):
* Malicious input data must be expected and the existing threat models
apply.
* The code must not be safe to execute on *dedicated* Pageserver
instances:
* This means in particular that tenants *on other* Pageserver instances
must not be affected negatively wrt data confidentiality, integrity or
availability.
* Other code: the usual quality bar
* Pay special attention to correct use of gate guards, timeline
cancellation in all places during shutdown & migration, etc.
* Consider the broader system impact; if you find potentially
problematic interactions with Storage features that were not covered in
the RFC, bring that up during the review.
I recommend submitting three separate reviews, for the three high-level
areas with different quality bars.
# References
(Internal-only)
* refs https://github.com/neondatabase/cloud/issues/17507
* refs https://github.com/neondatabase/company_projects/issues/293
* refs https://github.com/neondatabase/company_projects/issues/309
* refs https://github.com/neondatabase/cloud/issues/20646
---------
Co-authored-by: Stas Kelvich <stas.kelvich@gmail.com>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: John Spray <john@neon.tech>
## Problem
On Debian 12 (Bookworm), Python 3.11 is the latest available version.
## Summary of changes
- Update Python to 3.11 in build-tools
- Fix ruff check / format
- Fix mypy
- Use `StrEnum` instead of pair `str`, `Enum`
- Update docs
## Problem
I've noticed that we have 2 flaky tests which failed with error:
```
re.error: missing ), unterminated subpattern at position 21
```
- `test_timeline_archival_chaos` — has been already fixed
- `test_sharded_tad_interleaved_after_partial_success` — I didn't manage
to find the incorrect regex
[Internal link](https://neonprod.grafana.net/goto/yfmVHV7NR?orgId=1)
## Summary of changes
- Wrap `re.match` in `try..except` block and print incorrect regex
## Problem
Running `pytest.skip(...)` in a test body instead of marking the test
with `@pytest.mark.skipif(...)` makes all fixtures to be initialised,
which is not necessary if the test is going to be skipped anyway.
Also, some tests are unnecessarily skipped (e.g. `test_layer_bloating`
on Postgres 17, or `test_idle_reconnections` at all) or run (e.g.
`test_parse_project_git_version_output_positive` more than on once
configuration) according to comments.
## Summary of changes
- Move `skip_on_postgres` / `xfail_on_postgres` /
`run_only_on_default_postgres` decorators to `fixture.utils`
- Add new `skip_in_debug_build` and `skip_on_ci` decorators
- Replace `pytest.skip(...)` calls with decorators where possible
## Problem
Part of https://github.com/neondatabase/neon/issues/8623
## Summary of changes
Removed all aux-v1 config processing code. Note that we persisted it
into the index part file, so we cannot really remove the field from
index part. I also kept the config item within the tenant config, but we
will not read it any more.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
If the environment variables `COMPATIBILITY_NEON_BIN` or
`COMPATIBILITY_POSTGRES_DISTRIB_DIR` are not set (this is usual during a
local run), the tests with the versions mix cannot run.
## Summary of changes
If these variables are not set turn off the version mix.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
We faced the problem of incompatibility of the different components of
different versions.
This should be detected automatically to prevent production bugs.
## Summary of changes
The test for this situation was implemented
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
It didn't serve much value, and was only used twice.
Path(__file__).parent is a pretty easy invocation to use.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
We need to be able to run the regression tests against a cloud-based
Neon staging instance to prepare the migration to the arm architecture.
## Summary of changes
Some tests were modified to work on the cloud instance (i.e. added
passwords, server-side copy changed to client-side, etc)
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
https://github.com/neondatabase/neon/pull/8588 implemented the mechanism
for storage controller
leadership transfers. However, there's no tests that exercise the
behaviour.
## Summary of changes
1. Teach `neon_local` how to handle multiple storage controller
instances. Each storage controller
instance gets its own subdirectory (`storage_controller_1, ...`).
`storage_controller start|stop` subcommands
have also been extended to optionally accept an instance id.
2. Add a storage controller proxy test fixture. It's a basic HTTP server
that forwards requests from pageserver
and test env to the currently configured storage controller.
3. Add a test which exercises storage controller leadership transfer.
4. Finally fix a couple bugs that the test surfaced
## Problem
We need to test the logical replication with some external consumers.
## Summary of changes
A test of the logical replication with Debezium as a consumer was added.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
There's a `NeonEnvBuilder#preserve_database_files` parameter that allows
you to keep database files for debugging purposes (by default, files get
cleaned up), but there's no way to get these files from a CI run.
This PR adds handling of `NeonEnvBuilder#preserve_database_files` and
adds the compressed test output directory to Allure reports (for tests
with this parameter enabled).
Ref https://github.com/neondatabase/neon/issues/6967
## Summary of changes
- Compress and add the whole test output directory to Allure reports
- Currently works only with `neon_env_builder` fixture
- Remove `preserve_database_files = True` from sharding tests as
unneeded
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
As seen with the pgvector 0.7.0 index builds, we can receive large
batches of images, leading to very large L0 layers in the range of 1GB.
These large layers are produced because we are only able to roll the
layer after we have witnessed two different Lsns in a single
`DataDirModification::commit`. As the single Lsn batches of images can
span over multiple `DataDirModification` lifespans, we will rarely get
to write two different Lsns in a single `put_batch` currently.
The solution is to remember the TimelineWriterState instead of eagerly
forgetting it until we really open the next layer or someone else
flushes (while holding the write_guard).
Additional changes are test fixes to avoid "initdb image layer
optimization" or ignoring initdb layers for assertion.
Cc: #7197 because small `checkpoint_distance` will now trigger the
"initdb image layer optimization"
Do pull_timeline while WAL is being removed. To this end
- extract pausable_failpoint to utils, sprinkle pull_timeline with it
- add 'checkpoint' sk http endpoint to force WAL removal.
After fixing checking for pull file status code test fails so far which is
expected.
"taking a fullbackup" is an ugly multi-liner copypasted in multiple
places, most recently with timeline ancestor detach tests. move it under
`PgBin` which is not a great place, but better than yet another utility
function.
Additionally:
- cleanup `psql_env` repetition (PgBin already configures that)
- move the backup tar comparison as a yet another free utility function
- use backup tar comparison in `test_import.py` where a size check was
done previously
- cleanup extra timeline creation from test
Cc: #7715
This pull request adds the new basebackup read path + aux file write
path. In the regression test, all logical replication tests are run with
matrix aux_file_v2=false/true.
Also fixed the vectored get code path to correctly return missing key
error when being called from the unified sequential get code path.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
As with the pageserver, we should fail tests that emit unexpected log
errors/warnings.
## Summary of changes
- Refactor existing log checks to be reusable
- Run log checks for attachment_service
- Add allow lists as needed.
Extracted from https://github.com/neondatabase/neon/pull/6953
Part of https://github.com/neondatabase/neon/issues/5899
Core Change
-----------
In #6953, we need the ability to scan the log _after_ a specific line
and ignore anything before that line.
This PR changes `log_contains` to returns a tuple of `(matching line,
cursor)`.
Hand that cursor to a subsequent `log_contains` call to search the log
for the next occurrence of the pattern.
Other Changes
-------------
- Inspect all the callsites of `log_contains` to handle the new tuple
return type.
- Above inspection unveiled many callers aren't using `assert
log_contains(...) is not None` but some weaker version of the code that
breaks if `log_contains` ever returns a not-None but falsy value. Fix
that.
- Above changes unveiled that `test_remote_storage_upload_queue_retries`
was using `wait_until` incorrectly; after fixing the usage, I had to
raise the `wait_until` timeout. So, maybe this will fix its flakiness.
aiming for faster to understand a bunch of `.stdout` and `.stderr`
files, see example echo_1.stdout differences:
```
+# echo foobar abbacd
+
foobar abbacd
```
it can be disabled and is disabled in this PR for some tests; use
`pg_bin.run_capture(..., with_command_header=False)` for that.
as a bonus this cleans up the echoed newlines from s3_scrubber
output which are also saved to file but echoed to test log.
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
we have test cases which launch processes from threads, and they capture
output assuming this counter is thread-safe. at least according to my
understanding this operation in python requires a lock to be
thread-safe.
fixes https://github.com/neondatabase/neon/issues/5878
obsoletes https://github.com/neondatabase/neon/issues/5879
Before this PR, it could happen that `load_layer_map` schedules removal
of the future
image layer. Then a later compaction run could re-create the same image
layer, scheduling a PUT.
Due to lack of an upload queue barrier, the PUT and DELETE could be
re-ordered.
The result was IndexPart referencing a non-existent object.
## Summary of changes
* Add support to `pagectl` / Python tests to decode `IndexPart`
* Rust
* new `pagectl` Subcommand
* `IndexPart::{from,to}_s3_bytes()` methods to internalize knowledge
about encoding of `IndexPart`
* Python
* new `NeonCli` subclass
* Add regression test
* Rust
* Ability to force repartitioning; required to ensure image layer
creation at last_record_lsn
* Python
* The regression test.
* Fix the issue
* Insert an `UploadOp::Barrier` after scheduling the deletions.
## Problem
The scrubber didn't know how to find the latest index_part when
generations were in use.
## Summary of changes
- Teach the scrubber to do the same dance that pageserver does when
finding the latest index_part.json
- Teach the scrubber how to understand layer files with generation
suffixes.
- General improvement to testability: scan_metadata has a machine
readable output that the testing `S3Scrubber` wrapper can read.
- Existing test coverage of scrubber was false-passing because it just
didn't see any data due to prefixing of data in the bucket. Fix that.
This is incremental improvement: the more confidence we can have in the
scrubber, the more we can use it in integration tests to validate the
state of remote storage.
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
- Because we compress artifacts file by file, we don't need to put them
into `tar` containers (ie instead of `tar.gz` we can use just `gz`).
- Pythons gz single-threaded and pretty slow.
A benchmark has shown ~20 times speedup (19.876176291 vs
0.8748335830000009) on my laptop (for a pageserver.log size is 1.3M)
## Summary of changes
- Replace tarfile with zstandart
- Update allure to 2.24.0
We have rare walredo failures with pg16.
Let's introduce recording of failing walredo input in `#[cfg(feature =
"testing")]`. There is additional logging (the value reconstruction path
logging usually shown with not found keys), keeping it for
`#[cfg(features = "testing")]`.
Cc: #5404.
## Problem
- Scrubber's `tidy` command requires presence of a control plane
- Scrubber has no tests at all
## Summary of changes
- Add re-usable async streams for reading metadata from a bucket
- Add a `scan-metadata` command that reads from those streams and calls
existing `checks.rs` code to validate metadata, then returns a summary
struct for the bucket. Command returns nonzero status if errors are
found.
- Add an `enable_scrub_on_exit()` function to NeonEnvBuilder so that
tests using remote storage can request to have the scrubber run after
they finish
- Enable remote storarge and scrub_on_exit in test_pageserver_restart
and test_pageserver_chaos
This is a "toe in the water" of the overall space of validating the
scrubber. Later, we should:
- Enable scrubbing at end of tests using remote storage by default
- Make the success condition stricter than "no errors": tests should
declare what tenants+timelines they expect to see in the bucket (or
sniff these from the functions tests use to create them) and we should
require that the scrubber reports on these particular tenants/timelines.
The `tidy` command is untouched in this PR, but it should be refactored
later to use similar async streaming interface instead of the current
batch-reading approach (the streams are faster with large buckets), and
to also be covered by some tests.
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Conrad Ludgate <conrad@neon.tech>