## Problem
`test_runner` integration tests should support gRPC.
Touches #11926.
## Summary of changes
* Enable gRPC for Pageservers, with dynamic port allocations.
* Add a `grpc` parameter for endpoint creation, plumbed through to
`neon_local endpoint create`.
No tests actually use gRPC yet, since computes don't support it yet.
## Problem
`test_sharding_failures` is flaky due to interference from the
`background_reconcile` process.
The details are in the issue
https://github.com/neondatabase/neon/issues/12029.
## Summary of changes
- Use `reconcile_until_idle` to ensure a stable state before running
test assertions
- Added error tolerance in `reconcile_until_idle` test function (Failure
cases: 1, 3, 19, 20)
- Ignore the `Keeping extra secondaries` warning message since it i
retryable (Failure case: 2)
- Deduplicated code in `assert_rolled_back` and `assert_split_done`
- Added a log message before printing plenty of Node `X` seen on
pageserver `Y`
## Problem
- We run the large tenant oltp workload with a fixed size (larger than
existing customers' workloads).
Our customer's workloads are continuously growing and our testing should
stay ahead of the customers' production workloads.
- we want to touch all tables in the tenant's database (updates) so that
we simulate a continuous change in layer files like in a real production
workload
- our current oltp benchmark uses a mixture of read and write
transactions, however we also want a separate test run with read-only
transactions only
## Summary of changes
- modify the existing workload to have a separate run with pgbench
custom scripts that are read-only
- create a new workload that
- grows all large tables in each run (for the reuse branch in the large
oltp tenant's project)
- updates a percentage of rows in all large tables in each run (to
enforce table bloat and auto-vacuum runs and layer rebuild in
pageservers
Each run of the new workflow increases the logical database size about
16 GB.
We start with 6 runs per day which will give us about 96-100 GB growth
per day.
---------
Co-authored-by: Alexander Lakhin <alexander.lakhin@neon.tech>
- Add optional `?mode=fast|immediate` to `/terminate`, `fast` is
default. Immediate avoids waiting 30
seconds before returning from `terminate`.
- Add `TerminateMode` to `ComputeStatus::TerminationPending`
- Use `/terminate?mode=immediate` in `neon_local` instead of `pg_ctl
stop` for `test_replica_promotes`.
- Change `test_replica_promotes` to check returned LSN
- Annotate `finish_sync_safekeepers` as `noreturn`.
https://github.com/neondatabase/cloud/issues/29807
This check API only cheks the safekeeper_connstrings at the moment, and
the validation is limited to checking we have at least one entry in
there, and no duplicates.
## Problem
If the compute_ctl service is started with an empty list of safekeepers,
then hard-to-debug errors may happen at runtime, where it would be much
easier to catch them early.
## Summary of changes
Add an entry point in the compute_ctl API to validate the configuration
for safekeeper_connstrings.
---------
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Create a separate stage for downloading the Rust toolchain for pgrx, so
that it can be cached independently of the pg-build layer. Before this,
the 'pg-build-nonroot=with-cargo' layer was unnecessarily rebuilt every
time there was a change in PostgreSQL sources. Furthermore, this allows
using the same cached layer for building the compute images of all
Postgres versions.
## Problem
Full basebackups are used in tests, and may be useful for debugging as
well, so we should support them in the gRPC API.
Touches #11728.
## Summary of changes
Add `GetBaseBackupRequest::full` to generate full base backups.
The libpq implementation also allows specifying `prev_lsn` for full
backups, i.e. the end LSN of the previous WAL record. This is omitted in
the gRPC API, since it's not used by any tests, and presumably of
limited value since it's autodetected. We can add it later if we find
that we need it.
## Problem
The checksum for eigen (a dependency for onnxruntime) has changed which
breaks compute image build.
## Summary of changes
- Add a patch for onnxruntime which backports changes from
f57db79743
(we keep the current version)
Ref https://github.com/microsoft/onnxruntime/issues/24861
## Problem
https://github.com/neondatabase/neon/issues/7570
Even triggers are supported only for superusers.
## Summary of changes
Temporary switch to superuser when even trigger is created and disable
execution of user's even triggers under superuser.
---------
Co-authored-by: Dimitri Fontaine <dim@tapoueh.org>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
When converting `proto::GetPageRequest` into `page_api::GetPageRequest`
and validating the request, errors are returned as `tonic::Status`. This
will tear down the GetPage stream, which is disruptive and unnecessary.
## Summary of changes
Emit invalid request errors as `GetPageResponse` with an appropriate
`status_code` instead.
Also move the conversion from `tonic::Status` to `GetPageResponse` out
into the stream handler.
## Problem
Compatibility tests may be run against a compatibility snapshot
generated with --timelines-onto-safekeepers=false. We need to start the
compute without a generation (or with 0 generation) if the timeline is
not storcon-managed, otherwise the compute will hang.
- Follow up on https://github.com/neondatabase/neon/pull/12203
- Relates to https://github.com/neondatabase/neon/pull/11712
## Summary of changes
- Handle compatibility snapshot generated with no
`--timelines-onot-safekeepers` properly
## Problem
The cache was introduced as a hackathon project and the only supported
limit was the number of entries.
The basebackup entry size may vary. We need to have more control over
disk space usage to ship it to production.
- Part of https://github.com/neondatabase/cloud/issues/29353
## Summary of changes
- Store the size of entries in the cache and use it to limit
`max_total_size_bytes`
- Add the size of the cache in bytes to metrics.
## Problem
`neon_local` should support endpoints using gRPC, by providing `grpc://`
connstrings with the Pageservers' gRPC ports.
Requires #12268.
Touches #11926.
## Summary of changes
* Add `--grpc` switch for `neon_local endpoint create`.
* Generate `grpc://` connstrings for endpoints when enabled.
Computes don't actually support `grpc://` connstrings yet, but will
soon.
gRPC is configured when the endpoint is created, not when it's started,
such that it continues to use gRPC across restarts and reconfigurations.
In particular, this is necessary for the storage controller's local
notify hook, which can't easily plumb through gRPC configuration from
the start/reconfigure commands but has access to the endpoint's
configuration.
## Problem
The metrics `storage_controller_safekeeper_request_error` and
`storage_controller_safekeeper_request_latency` currently use
`pageserver_id` as a label.
This can be misleading, as the metrics are about safekeeper requests.
We want to replace this with a more accurate label — either
`safekeeper_id` or `node_id`.
## Summary of changes
- Introduced `SafekeeperRequestLabelGroup` with `safekeeper_id`.
- Updated the affected metrics to use the new label group.
- Fixed incorrect metric usage in safekeeper_client.rs
## Follow-up
- Review usage of these metrics in alerting rules and existing Grafana
dashboards to ensure this change does not break something.
## Problem
Pageservers now expose a gRPC API on a separate address and port. This
must be registered with the storage controller such that it can be
plumbed through to the compute via cplane.
Touches #11926.
## Summary of changes
This patch registers the gRPC address and port with the storage
controller:
* Add gRPC address to `nodes` database table and `NodePersistence`, with
a Diesel migration.
* Add gRPC address in `NodeMetadata`, `NodeRegisterRequest`,
`NodeDescribeResponse`, and `TenantLocateResponseShard`.
* Add gRPC address flags to `storcon_cli node-register`.
These changes are backwards-compatible, since all structs will ignore
unknown fields during deserialization.
## Problem
The gRPC base backup implementation has a few issues: chunks are not
properly bounded, and it's not possible to omit the LSN.
Touches #11728.
## Summary of changes
* Properly bound chunks by using a limited writer.
* Use an `Option<Lsn>` rather than a `ReadLsn` (the latter requires an
LSN).
## Problem
The `stably_attached` function is hard to read due to deeply nested
conditionals
## Summary of Changes
- Refactored `stably_attached` to use early returns and the `?` operator
for improved readability
## Problem
If the node intent includes more than one secondary, we can generate a
replace optimization using a candidate node that is already a secondary
location.
## Summary of changes
- Exclude all other secondary nodes from the scoring process to ensure
optimal candidate selection.
## Problem
We could easily miss a sanitizer-detected defect, if it occurred due to
some race condition, as we just rerun the test and if it succeeds, the
overall test run is considered successful. It was more reasonable
before, when we had much more unstable tests in main, but now we can
track all test failures.
## Summary of changes
Don't rerun failed tests.
## Problem
We don't test debug builds for v14..v16 in the regular "Build and Test"
runs to perform the testing faster, but it means we can't detect
assertion failures in those versions.
(See https://github.com/neondatabase/neon/issues/11891,
https://github.com/neondatabase/neon/issues/11997)
## Summary of changes
Add a new workflow to test all build types and all versions on all
architectures.
## Problem
In our environment, we don't always have access to the pagectl tool on
the pageserver. We have to download the page files to local env to
introspect them. Hence, it'll be useful to be able to parse the local
files using `pagectl`.
## Summary of changes
* Add `dump-layer-local` to `pagectl` that takes a local path as
argument and returns the layer content:
```
cargo run -p pagectl layer dump-layer-local ~/Desktop/000000067F000040490002800000FFFFFFFF-030000000000000000000000000000000002__00003E7A53EDE611-00003E7AF27BFD19-v1-00000001
```
* Bonus: Fix a bug in `pageserver/ctl/src/draw_timeline_dir.rs` in which
we don't filter out temporary files.
## Problem
The `page_api` domain types are missing a few derives.
## Summary of changes
Add `Clone`, `Copy`, and `Debug` derives for all types where
appropriate.
## Problem
Comment for `storage_controller_reconcile_long_running` metric was
copy-pasted and not updated in #9207
## Summary of changes
- Fixed comment
## Problem
Comment is in incorrect place: `/metrics` code is above its description
comment.
## Summary of changes
- `/metrics` code is now below the comment
## Problem
Need to fix naming `safkeeper` -> `safekeeper`
## Summary of changes
- `storage_controller_safkeeper_reconciles_*` renamed to
`storage_controller_safekeeper_reconciles_*`
## Problem
ChaosInjector is intended to skip non-active scheduling policies, but
the current logic skips active shards instead.
## Summary of changes
- Fixed shard eligibility condition to correctly allow chaos injection
for shards with an Active scheduling policy.
## Problem
We would easily hit this limit for a tenant running for enough long
time.
## Summary of changes
Remove the max key limit for time-travel recovery if the command is
running locally.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Part of #11813
PostHog has two endpoints to retrieve feature flags: the old project ID
one that uses personal API token, and the new one using a special
feature flag secure token that can only retrieve feature flag. The new
API I added in this patch is not documented in the PostHog API doc but
it's used in their Python SDK.
## Summary of changes
Add support for "feature flag secure token API". The API has no way of
providing a project ID so we verify if the retrieved spec is consistent
with the project ID specified by comparing the `team_id` field.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>