## Problem
Names are not consistent between safekeeper migration RFC and the actual
implementation.
It's not used anywhere in production yet, so it's safe to rename. We
don't need to worry about backward compatibility.
- Follow up on https://github.com/neondatabase/neon/pull/12432
## Summary of changes
- rename term -> last_log_term in TimelineMembershipSwitchResponse
- add missing fields to TimelineMembershipSwitchResponse in python
- Add ComputeSpec flag `offload_lfc_interval_seconds` controlling
whether LFC should be offloaded to endpoint storage. Default value
(None) means "don't offload".
- Add glue code around it for `neon_local` and integration tests.
- Add `autoprewarm` mode for `test_lfc_prewarm` testing
`offload_lfc_interval_seconds` and `autoprewarm` flags in conjunction.
- Rename `compute_ctl_lfc_prewarm_requests_total` and
`compute_ctl_lfc_offload_requests_total` to
`compute_ctl_lfc_prewarms_total`
and `compute_ctl_lfc_offloads_total` to reflect we count prewarms and
offloads, not `compute_ctl` requests of those.
Don't count request in metrics if there is a prewarm/offload already
ongoing.
https://github.com/neondatabase/cloud/issues/19011
Resolves: https://github.com/neondatabase/cloud/issues/30770
## Problem
The current deletion operation is synchronous and blocking, which is
unsuitable for potentially long-running tasks like. In such cases, the
standard HTTP request-response pattern is not a good fit.
## Summary of Changes
- Added new `storcon_cli` commands: `NodeStartDelete` and
`NodeCancelDelete` to initiate and cancel deletion asynchronously.
- Added corresponding `storcon` HTTP handlers to support the new
start/cancel deletion flow.
- Introduced a new type of background operation: `Delete`, to track and
manage the deletion process outside the request lifecycle.
---------
Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>
## Problem
I suspect that the pageservers get stuck on receiving broker updates.
## Summary of changes
This is a an opportunistic (staging only) patch that resets the
susbscription
stream if it's been idle for a while. This won't go to prod in this
form.
I'll revert or update it before Friday.
## Problem
Ancestor detach from a previously detached parent when there were no
writes panics since it tries to upload the tombstone layer twice.
## Summary of Changes
If we're gonna copy the tombstone from the ancestor, don't bother
creating it.
Fixes https://github.com/neondatabase/neon/issues/12458
When deploying new safekeepers, we don't immediately want to send
traffic to them. Maybe they are not ready yet by the time the deploy
script is registering them with the storage controller.
For pageservers, the storcon solves the problem by not scheduling stuff
to them unless there has been a positive heartbeat response. We can't do
the same for safekeepers though, otherwise a single down safekeeper
would mean we can't create new timelines in smaller regions where there
is only three safekeepers in total.
So far we have created safekeepers as `pause` but this adds a manual
step to safekeeper deployment which is prone to oversight. We want
things to be automatted. So we introduce a new state `activating` that
acts just like `pause`, except that we automatically transition the
policy to `active` once we get a positive heartbeat from the safekeeper.
For `pause`, we always keep the safekeeper paused.
## Problem
Currently, if `storcon` (storage controller) reconciliations repeatedly
fail, the system will indefinitely freeze optimizations. This can result
in optimization starvation for several days until the reconciliation
issues are manually resolved. To mitigate this, we should detect
persistently failing reconciliations and exclude them from influencing
the optimization decision.
## Summary of Changes
- A tenant shard reconciliation is now considered "keep-failing" if it
fails 5 consecutive times. These failures are excluded from the
optimization readiness check.
- Added a new metric: `storage_controller_keep_failing_reconciles` to
monitor such cases.
- Added a warning log message when a reconciliation is marked as
"keep-failing".
---------
Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>
## TLDR
This PR is a no-op. The changes are disabled by default.
## Problem
I. Currently we don't have a way to detect disk I/O failures from WAL
operations.
II.
We observe that the offloader fails to upload a segment due to race
conditions on XLOG SWITCH and PG start streaming WALs. wal_backup task
continously failing to upload a full segment while the segment remains
partial on the disk.
The consequence is that commit_lsn for all SKs move forward but
backup_lsn stays the same. Then, all SKs run out of disk space.
III.
We have discovered SK bugs where the WAL offload owner cannot keep up
with WAL backup/upload to S3, which results in an unbounded accumulation
of WAL segment files on the Safekeeper's disk until the disk becomes
full. This is a somewhat dangerous operation that is hard to recover
from because the Safekeeper cannot write its control files when it is
out of disk space. There are actually 2 problems here:
1. A single problematic timeline can take over the entire disk for the
SK
2. Once out of disk, it's difficult to recover SK
IV.
Neon reports certain storage errors as "critical" errors using a marco,
which will increment a counter/metric that can be used to raise alerts.
However, this metric isn't sliced by tenant and/or timeline today. We
need the tenant/timeline dimension to better respond to incidents and
for blast radius analysis.
## Summary of changes
I.
The PR adds a `safekeeper_wal_disk_io_errors ` which is incremented when
SK fails to create or flush WALs.
II.
To mitigate this issue, we will re-elect a new offloader if the current
offloader is lagging behind too much.
Each SK makes the decision locally but they are aware of each other's
commit and backup lsns.
The new algorithm is
- determine_offloader will pick a SK. say SK-1.
- Each SK checks
-- if commit_lsn - back_lsn > threshold,
-- -- remove SK-1 from the candidate and call determine_offloader again.
SK-1 will step down and all SKs will elect the same leader again.
After the backup is caught up, the leader will become SK-1 again.
This also helps when SK-1 is slow to backup.
I'll set the reelect backup lag to 4 GB later. Setting to 128 MB in dev
to trigger the code more frequently.
III.
This change addresses problem no. 1 by having the Safekeeper perform a
timeline disk utilization check check when processing WAL proposal
messages from Postgres/compute. The Safekeeper now rejects the WAL
proposal message, effectively stops writing more WAL for the timeline to
disk, if the existing WAL files for the timeline on the SK disk exceeds
a certain size (the default threshold is 100GB). The disk utilization is
calculated based on a `last_removed_segno` variable tracked by the
background task removing WAL files, which produces an accurate and
conservative estimate (>= than actual disk usage) of the actual disk
usage.
IV.
* Add a new metric `hadron_critical_storage_event_count` that has the
`tenant_shard_id` and `timeline_id` as dimensions.
* Modified the `crtitical!` marco to include tenant_id and timeline_id
as additional arguments and adapted existing call sites to populate the
tenant shard and timeline ID fields. The `critical!` marco invocation
now increments the `hadron_critical_storage_event_count` with the extra
dimensions. (In SK there isn't the notion of a tenant-shard, so just the
tenant ID is recorded in lieu of tenant shard ID.)
I considered adding a separate marco to avoid merge conflicts, but I
think in this case (detecting critical errors) conflicts are probably
more desirable so that we can be aware whenever Neon adds another
`critical!` invocation in their code.
---------
Co-authored-by: Chen Luo <chen.luo@databricks.com>
Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
Co-authored-by: William Huang <william.huang@databricks.com>
## Problem
The billing team wants to change the billing events pipeline and use a
common events format in S3 buckets across different event producers.
## Summary of changes
Change the events storage format for billing events from JSON to NDJSON.
Also partition files by hours, rather than days.
Resolves: https://github.com/neondatabase/cloud/issues/29995
## Problem
https://github.com/neondatabase/neon/pull/11712 changed how computes are
started in the test: the lsn is specified, making them read-only static
replicas. Lsn is `last_record_lsn` from pageserver. It works fine with
read-only branches (because their `last_record_lsn` is equal to
`start_lsn` and always valid). But with writable timelines, the
`last_record_lsn` on the pageserver might be stale.
Particularly in this test, after the `detach_branch` operation, the
tenant is reset on the pagesever. It leads to `last_record_lsn` going
back to `disk_consistent_lsn`, so basically rolling back some recent
writes.
If we start a primary compute, it will start at safekeepers' commit Lsn,
which is the correct one , and will wait till pageserver catches up with
this Lsn after reset.
- Closes: https://github.com/neondatabase/neon/issues/12365
## Summary of changes
- Start `primary` compute for writable timelines.
## Problem
We persist safekeeper host/port in the storcon DB after
https://github.com/neondatabase/neon/pull/11712, so the storcon fails to
ping safekeepers in the compatibility tests, where we start the cluster
from the snapshot.
PR also adds some small code improvements related to the test failure.
- Closes: https://github.com/neondatabase/neon/issues/12339
## Summary of changes
- Update safekeeper ports in the storcon DB when starting the neon from
the dir (snapshot)
- Fail the response on all not-success codes (e.g. 3xx). Should not
happen, but just to be more safe.
- Add `neon_previous/` to .gitignore to make it easier to run compat
tests.
- Add missing EXPORT to the instruction for running compat tests
## Problem
We don't notify cplane about safekeeper membership change yet. Without
the notification the compute needs to know all the safekeepers on the
cluster to be able to speak to them. Change notifications will allow to
avoid it.
- Closes: https://github.com/neondatabase/neon/issues/12188
## Summary of changes
- Implement `notify_safekeepers` method in `ComputeHook`
- Notify cplane about safekeepers in `safekeeper_migrate` handler.
- Update the test to make sure notifications work.
## Out of scope
- There is `cplane_notified_generation` field in `timelines` table in
strocon's database. It's not needed now, so it's not updated in the PR.
Probably we can remove it.
- e2e tests to make sure it works with a production cplane
## Problem
Some of the design decisions in PR #12256 were influenced by the
requirements of consistency tests. These decisions introduced
intermediate logic that is no longer needed and should be cleaned up.
## Summary of Changes
- Remove the `feature("testing")` flag related to
`kick_secondary_download`.
- Set the default value of `kick_secondary_download` back to false,
reflecting the intended production behavior.
Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>
We don't have cancellation support for timeline deletions. In other
words, timeline deletion might still go on in an older generation while
we are attaching it in a newer generation already, because the
cancellation simply hasn't reached the deletion code.
This has caused us to hit a situation with offloaded timelines in which
the timeline was in an unrecoverable state: always returning an accepted
response, but never a 404 like it should be.
The detailed description can be found in
[here](https://github.com/neondatabase/cloud/issues/30406#issuecomment-3008667859)
(private repo link).
TLDR:
1. we ask to delete timeline on old pageserver/generation, starts
process in background
2. the storcon migrates the tenant to a different pageserver.
- during attach, the pageserver still finds an index part, so it adds it
to `offloaded_timelines`
4. the timeline deletion finishes, removing the index part in S3
5. there is a retry of the timeline deletion endpoint, sent to the new
pageserver location. it is bound to fail however:
- as the index part is gone, we print `Timeline already deleted in
remote storage`.
- the problem is that we then return an accepted response code, and not
a 404.
- this confuses the code calling us. it thinks the timeline is not
deleted, so keeps retrying.
- this state never gets recovered from until a reset/detach, because of
the `offloaded_timelines` entry staying there.
This is where this PR fixes things: if no index part can be found, we
can safely assume that the timeline is gone in S3 (it's the last thing
to be deleted), so we can remove it from `offloaded_timelines` and
trigger a reupload of the manifest. Subsequent retries will pick that
up.
Why not improve the cancellation support? It is a more disruptive code
change, that might have its own risks. So we don't do it for now.
Fixes https://github.com/neondatabase/cloud/issues/30406
Make the safekeeper `pull_timeline` endpoint support timelines that
haven't had any writes yet. In the storcon managed sk timelines world,
if a safekeeper goes down temporarily, the storcon will schedule a
`pull_timeline` call. There is no guarantee however that by when the
safekeeper is online again, there have been writes to the timeline yet.
The `snapshot` endpoint gives an error if the timeline hasn't had
writes, so we avoid calling it if `timeline_start_lsn` indicates a
freshly created timeline.
Fixes#11422
Part of #11670
In `test_layer_download_cancelled_by_config_location`, we simulate hung
downloads via the `before-downloading-layer-stream-pausable` failpoint.
Then, we cancel a timeline via the `location_config` endpoint.
With the new default as of
https://github.com/neondatabase/neon/pull/11712, we would be creating
the timeline on safekeepers regardless if there have been writes or not,
and it turns out the test relied on the timeline not existing on
safekeepers, due to a cancellation bug:
* as established before, the test makes the read path hang
* the timeline cancellation function first cancels the walreceiver, and
only then cancels the timeline's token
* `WalIngest::new` is requesting a checkpoint, which hits the read path
* at cancellation time, we'd be hanging inside the read, not seeing the
cancellation of the walreceiver
* the test would time out due to the hang
This is probably also reproducible in the wild when there is S3
unavailabilies or bottlenecks. So we thought that it's worthwhile to fix
the hang issue. The approach chosen in the end involves the
`tokio::select` macro.
In PR 11712, we originally punted on the test due to the hang and opted
it out from the new default, but now we can use the new default.
Part of https://github.com/neondatabase/neon/issues/12299
pgaudit can spam logs due to all the monitoring that we do. Logs from
these connections are not necessary for HIPPA compliance, so we can stop
logging from those connections.
Part-of: https://github.com/neondatabase/cloud/issues/29574
Signed-off-by: Tristan Partin <tristan@neon.tech>
This makes it possible for the compiler to validate that a match block
matched all PostgreSQL versions we support.
## Problem
We did not have a complete picture about which places we had to test
against PG versions, and what format these versions were: The full PG
version ID format (Major/minor/bugfix `MMmmbb`) as transfered in
protocol messages, or only the Major release version (`MM`). This meant
type confusion was rampant.
With this change, it becomes easier to develop new version-dependent
features, by making type and niche confusion impossible.
## Summary of changes
Every use of `pg_version` is now typed as either `PgVersionId` (u32,
valued in decimal `MMmmbb`) or PgMajorVersion (an enum, with a value for
every major version we support, serialized and stored like a u32 with
the value of that major version)
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
`Attached(0)` tenant migrations can get stuck if the heatmap file has
not been uploaded.
## Summary of Changes
- Added a test to reproduce the issue.
- Introduced a `kick_secondary_downloads` config flag:
- Enabled in testing environments.
- Disabled in production (and in the new test).
- Updated `Attached(0)` locations to consider the number of secondaries
in their intent when deciding whether to download the heatmap.
## Problem
Part of #11813
## Summary of changes
Add a test API to make it easier to manipulate the feature flags within
tests.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
The `--gzip-probability` parameter was removed in #12250. However,
`test_basebackup_with_high_slru_count` still uses it, and keeps failing.
This patch removes the use of the parameter (gzip is enabled by
default).
Enable it across tests and set it as default. Marks the first milestone
of https://github.com/neondatabase/neon/issues/9114. We already enabled
it in all AWS regions and planning to enable it in all Azure regions
next week.
will merge after we roll out in all regions.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
`test_runner` integration tests should support gRPC.
Touches #11926.
## Summary of changes
* Enable gRPC for Pageservers, with dynamic port allocations.
* Add a `grpc` parameter for endpoint creation, plumbed through to
`neon_local endpoint create`.
No tests actually use gRPC yet, since computes don't support it yet.
## Problem
`test_sharding_failures` is flaky due to interference from the
`background_reconcile` process.
The details are in the issue
https://github.com/neondatabase/neon/issues/12029.
## Summary of changes
- Use `reconcile_until_idle` to ensure a stable state before running
test assertions
- Added error tolerance in `reconcile_until_idle` test function (Failure
cases: 1, 3, 19, 20)
- Ignore the `Keeping extra secondaries` warning message since it i
retryable (Failure case: 2)
- Deduplicated code in `assert_rolled_back` and `assert_split_done`
- Added a log message before printing plenty of Node `X` seen on
pageserver `Y`
## Problem
- We run the large tenant oltp workload with a fixed size (larger than
existing customers' workloads).
Our customer's workloads are continuously growing and our testing should
stay ahead of the customers' production workloads.
- we want to touch all tables in the tenant's database (updates) so that
we simulate a continuous change in layer files like in a real production
workload
- our current oltp benchmark uses a mixture of read and write
transactions, however we also want a separate test run with read-only
transactions only
## Summary of changes
- modify the existing workload to have a separate run with pgbench
custom scripts that are read-only
- create a new workload that
- grows all large tables in each run (for the reuse branch in the large
oltp tenant's project)
- updates a percentage of rows in all large tables in each run (to
enforce table bloat and auto-vacuum runs and layer rebuild in
pageservers
Each run of the new workflow increases the logical database size about
16 GB.
We start with 6 runs per day which will give us about 96-100 GB growth
per day.
---------
Co-authored-by: Alexander Lakhin <alexander.lakhin@neon.tech>
- Add optional `?mode=fast|immediate` to `/terminate`, `fast` is
default. Immediate avoids waiting 30
seconds before returning from `terminate`.
- Add `TerminateMode` to `ComputeStatus::TerminationPending`
- Use `/terminate?mode=immediate` in `neon_local` instead of `pg_ctl
stop` for `test_replica_promotes`.
- Change `test_replica_promotes` to check returned LSN
- Annotate `finish_sync_safekeepers` as `noreturn`.
https://github.com/neondatabase/cloud/issues/29807
This check API only cheks the safekeeper_connstrings at the moment, and
the validation is limited to checking we have at least one entry in
there, and no duplicates.
## Problem
If the compute_ctl service is started with an empty list of safekeepers,
then hard-to-debug errors may happen at runtime, where it would be much
easier to catch them early.
## Summary of changes
Add an entry point in the compute_ctl API to validate the configuration
for safekeeper_connstrings.
---------
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
## Problem
https://github.com/neondatabase/neon/issues/7570
Even triggers are supported only for superusers.
## Summary of changes
Temporary switch to superuser when even trigger is created and disable
execution of user's even triggers under superuser.
---------
Co-authored-by: Dimitri Fontaine <dim@tapoueh.org>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Compatibility tests may be run against a compatibility snapshot
generated with --timelines-onto-safekeepers=false. We need to start the
compute without a generation (or with 0 generation) if the timeline is
not storcon-managed, otherwise the compute will hang.
- Follow up on https://github.com/neondatabase/neon/pull/12203
- Relates to https://github.com/neondatabase/neon/pull/11712
## Summary of changes
- Handle compatibility snapshot generated with no
`--timelines-onot-safekeepers` properly
## Problem
The cache was introduced as a hackathon project and the only supported
limit was the number of entries.
The basebackup entry size may vary. We need to have more control over
disk space usage to ship it to production.
- Part of https://github.com/neondatabase/cloud/issues/29353
## Summary of changes
- Store the size of entries in the cache and use it to limit
`max_total_size_bytes`
- Add the size of the cache in bytes to metrics.
## Problem
We don't test debug builds for v14..v16 in the regular "Build and Test"
runs to perform the testing faster, but it means we can't detect
assertion failures in those versions.
(See https://github.com/neondatabase/neon/issues/11891,
https://github.com/neondatabase/neon/issues/11997)
## Summary of changes
Add a new workflow to test all build types and all versions on all
architectures.
## Problem
`test_historic_storage_formats` uses `/tenant_import` to import historic
data. Tenant import does not create timelines onto safekeepers, because
they might already exist on some safekeeper set. If it does, then we may
end up with two different quorums accepting WAL for the same timeline.
If the tenant import is used in a real deployment, the administrator is
responsible for looking for the proper safekeeper set and migrate
timelines into storcon-managed timelines.
- Relates to https://github.com/neondatabase/neon/pull/11712
## Summary of changes
- Create timelines onto safekeepers manually after tenant import in
`test_historic_storage_formats`
- Add a note to tenant import that timelines will be not storcon-managed
after the import.
## Problem
The test creates an endpoint and deletes its tenant. The compute cannot
stop gracefully because it tries to write a checkpoint shutdown record
into the WAL, but the timeline had been already deleted from
safekeepers.
- Relates to https://github.com/neondatabase/neon/pull/11712
## Summary of changes
Stop the compute before deleting a tenant
## Problem
Compatibility tests may be run against a compatibility snapshot
generated with `--timelines-onto-safekeepers=false`. We need to start
the compute without a generation (or with 0 generation) if the timeline
is not storcon-managed, otherwise the compute will hang.
This handler is needed to check if the timeline is storcon-managed.
It's also needed for better test coverage of safekeeper migration code.
- Relates to https://github.com/neondatabase/neon/pull/11712
## Summary of changes
- Implement `tenant_timeline_locate` handler in storcon to get
safekeeper info from storcon's DB
## Problem
The project limits were not respected, resulting in errors.
## Summary of changes
Now limits are checked before running an action, and if the action is
not possible to run, another random action will be run.
---------
Co-authored-by: Peter Bendel <peterbendel@neon.tech>
## Problem
Looks like our sql-over-http tests get to rely on "trust"
authentication, so the path that made sure the authkeys data was set was
never being hit.
## Summary of changes
Slight refactor to WakeComputeBackends, as well as making sure auth keys
are propagated. Fix tests to ensure passwords are tested.