N.B: No-op for the neon-env.
## Problem
We added a per-timeline disk utilization protection circuit breaker,
which will stop the safekeeper from accepting more WAL writes if the
disk utilization by the timeline has exceeded a configured limit. We
mainly designed the mechanism as a guard against WAL upload/backup bugs,
and we assumed that as long as WAL uploads are proceeding as normal we
will not run into disk pressure. This turned out to be not true. In one
of our load tests where we have 500 PGs ingesting data at the same time,
safekeeper disk utilization started to creep up even though WAL uploads
were completely normal (we likely just maxed out our S3 upload bandwidth
from the single SK). This means the per-timeline disk utilization
protection won't be enough if too many timelines are ingesting data at
the same time.
## Summary of changes
Added a global disk utilization protection circuit breaker which will
stop a safekeeper from accepting more WAL writes if the total disk usage
on the safekeeper (across all tenants) exceeds a limit. We implemented
this circuit breaker through two parts:
1. A "global disk usage watcher" background task that runs at a
configured interval (default every minute) to see how much disk space is
being used in the safekeeper's filesystem. This background task also
performs the check against the limit and publishes the result to a
global atomic boolean flag.
2. The `hadron_check_disk_usage()` routine (in `timeline.rs`) now also
checks this global boolean flag published in the step above, and fails
the `WalAcceptor` (triggers the circuit breaker) if the flag was raised.
The disk usage limit is disabled by default.
It can be tuned with the `--max-global-disk-usage-ratio` CLI arg.
## How is this tested?
Added integration test
`test_wal_acceptor.py::test_global_disk_usage_limit`.
Also noticed that I haven't been using the `wait_until(f)` test function
correctly (the `f` passed in is supposed to raise an exception if the
condition is not met, instead of returning `False`...). Fixed it in both
circuit breaker tests.
---------
Co-authored-by: William Huang <william.huang@databricks.com>
We don't want to depend on postgres_ffi in an API crate. If there is no
such dependency, we can compile stuff like `storcon_cli` without needing
a full working postgres build. Fixes regression of #12548 (before we
could compile it).
# TLDR
This PR is a no-op.
## Problem
When a SK loses a disk, it must recover all WALs from the very
beginning. This may take days/weeks to catch up to the latest WALs for
all timelines it owns.
## Summary of changes
When SK starts up,
if it finds that it has 0 timelines,
- it will ask SC for the timeline it owns.
- Then, pulls the timeline from its peer safekeepers to restore the WAL
redundancy right away.
After pulling timeline is complete, it will become active and accepts
new WALs.
The current impl is a prototype. We can optimize the impl further, e.g.,
parallel pull timelines.
---------
Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
# TLDR
All changes are no-op except some metrics.
## Summary of changes I
### Pageserver
Added a new global counter metric
`pageserver_pagestream_handler_results_total` that categorizes
pagestream request results according to their outcomes:
1. Success
2. Internal errors
3. Other errors
Internal errors include:
1. Page reconstruction error: This probably indicates a pageserver
bug/corruption
2. LSN timeout error: Could indicate overload or bugs with PS's ability
to reach other components
3. Misrouted request error: Indicates bugs in the Storage Controller/HCC
Other errors include transient errors that are expected during normal
operation or errors indicating bugs with other parts of the system
(e.g., malformed requests, errors due to cancelled operations during PS
shutdown, etc.)
## Summary of changes II
This PR adds a pageserver endpoint and its counterpart in storage
controller to list visible size of all tenant shards. This will be a
prerequisite of the tenant rebalance command.
## Problem III
We need a way to download WAL
segments/layerfiles from S3 and replay WAL records. We cannot access
production S3 from our laptops directly, and we also can't transfer any
user data out of production systems for GDPR compliance, so we need
solutions.
## Summary of changes III
This PR adds a couple of tools to support the debugging
workflow in production:
1. A new `pagectl download-remote-object` command that can be used to
download remote storage objects assuming the correct access is set up.
## Summary of changes IV
This PR adds a command to list all visible delta and image layers from
index_part. This is useful to debug compaction issues as index_part
often contain a lot of covered layers due to PITR.
---------
Co-authored-by: William Huang <william.huang@databricks.com>
Co-authored-by: Chen Luo <chen.luo@databricks.com>
Co-authored-by: Vlad Lazar <vlad@neon.tech>
## Problem
Safekeeper and pageserver metrics collection might time out. We've seen
this in both hadron and neon.
## Summary of changes
This PR moves metrics collection in PS/SK to the background so that we
will always get some metrics, despite there may be some delays. Will
leave it to the future work to reduce metrics collection time.
---------
Co-authored-by: Chen Luo <chen.luo@databricks.com>
## Problem
We only trim the senders if we tried to send a message to them and
discovered that the channel is closed. This is problematic if the
pageserver keeps connecting while there's nothing to send back for the
shard. In this scenario we never trim down the senders list and can
panic due to the u8 limit.
## Summary of Changes
Trim down the dead senders before adding a new one.
Closes LKB-178
## Problem
Names are not consistent between safekeeper migration RFC and the actual
implementation.
It's not used anywhere in production yet, so it's safe to rename. We
don't need to worry about backward compatibility.
- Follow up on https://github.com/neondatabase/neon/pull/12432
## Summary of changes
- rename term -> last_log_term in TimelineMembershipSwitchResponse
- add missing fields to TimelineMembershipSwitchResponse in python
## TLDR
This PR is a no-op. The changes are disabled by default.
## Problem
I. Currently we don't have a way to detect disk I/O failures from WAL
operations.
II.
We observe that the offloader fails to upload a segment due to race
conditions on XLOG SWITCH and PG start streaming WALs. wal_backup task
continously failing to upload a full segment while the segment remains
partial on the disk.
The consequence is that commit_lsn for all SKs move forward but
backup_lsn stays the same. Then, all SKs run out of disk space.
III.
We have discovered SK bugs where the WAL offload owner cannot keep up
with WAL backup/upload to S3, which results in an unbounded accumulation
of WAL segment files on the Safekeeper's disk until the disk becomes
full. This is a somewhat dangerous operation that is hard to recover
from because the Safekeeper cannot write its control files when it is
out of disk space. There are actually 2 problems here:
1. A single problematic timeline can take over the entire disk for the
SK
2. Once out of disk, it's difficult to recover SK
IV.
Neon reports certain storage errors as "critical" errors using a marco,
which will increment a counter/metric that can be used to raise alerts.
However, this metric isn't sliced by tenant and/or timeline today. We
need the tenant/timeline dimension to better respond to incidents and
for blast radius analysis.
## Summary of changes
I.
The PR adds a `safekeeper_wal_disk_io_errors ` which is incremented when
SK fails to create or flush WALs.
II.
To mitigate this issue, we will re-elect a new offloader if the current
offloader is lagging behind too much.
Each SK makes the decision locally but they are aware of each other's
commit and backup lsns.
The new algorithm is
- determine_offloader will pick a SK. say SK-1.
- Each SK checks
-- if commit_lsn - back_lsn > threshold,
-- -- remove SK-1 from the candidate and call determine_offloader again.
SK-1 will step down and all SKs will elect the same leader again.
After the backup is caught up, the leader will become SK-1 again.
This also helps when SK-1 is slow to backup.
I'll set the reelect backup lag to 4 GB later. Setting to 128 MB in dev
to trigger the code more frequently.
III.
This change addresses problem no. 1 by having the Safekeeper perform a
timeline disk utilization check check when processing WAL proposal
messages from Postgres/compute. The Safekeeper now rejects the WAL
proposal message, effectively stops writing more WAL for the timeline to
disk, if the existing WAL files for the timeline on the SK disk exceeds
a certain size (the default threshold is 100GB). The disk utilization is
calculated based on a `last_removed_segno` variable tracked by the
background task removing WAL files, which produces an accurate and
conservative estimate (>= than actual disk usage) of the actual disk
usage.
IV.
* Add a new metric `hadron_critical_storage_event_count` that has the
`tenant_shard_id` and `timeline_id` as dimensions.
* Modified the `crtitical!` marco to include tenant_id and timeline_id
as additional arguments and adapted existing call sites to populate the
tenant shard and timeline ID fields. The `critical!` marco invocation
now increments the `hadron_critical_storage_event_count` with the extra
dimensions. (In SK there isn't the notion of a tenant-shard, so just the
tenant ID is recorded in lieu of tenant shard ID.)
I considered adding a separate marco to avoid merge conflicts, but I
think in this case (detecting critical errors) conflicts are probably
more desirable so that we can be aware whenever Neon adds another
`critical!` invocation in their code.
---------
Co-authored-by: Chen Luo <chen.luo@databricks.com>
Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>
Co-authored-by: William Huang <william.huang@databricks.com>
## Problem
Similarly to #12217, the following endpoints may result in a stripe size
mismatch between the storage controller and Pageserver if an unsharded
tenant has a different stripe size set than the default. This can lead
to data corruption if the tenant is later manually split without
specifying an explicit stripe size, since the storage controller and
Pageserver will apply different defaults. This commonly happens with
tenants that were created before the default stripe size was changed
from 32k to 2k.
* `PUT /v1/tenant/config`
* `PATCH /v1/tenant/config`
These endpoints are no longer in regular production use (they were used
when cplane still managed Pageserver directly), but can still be called
manually or by tests.
## Summary of changes
Retain the current shard parameters when updating the location config in
`PUT | PATCH /v1/tenant/config`.
Also opportunistically derive `Copy` for `ShardParameters`.
Make the safekeeper `pull_timeline` endpoint support timelines that
haven't had any writes yet. In the storcon managed sk timelines world,
if a safekeeper goes down temporarily, the storcon will schedule a
`pull_timeline` call. There is no guarantee however that by when the
safekeeper is online again, there have been writes to the timeline yet.
The `snapshot` endpoint gives an error if the timeline hasn't had
writes, so we avoid calling it if `timeline_start_lsn` indicates a
freshly created timeline.
Fixes#11422
Part of #11670
This makes it possible for the compiler to validate that a match block
matched all PostgreSQL versions we support.
## Problem
We did not have a complete picture about which places we had to test
against PG versions, and what format these versions were: The full PG
version ID format (Major/minor/bugfix `MMmmbb`) as transfered in
protocol messages, or only the Major release version (`MM`). This meant
type confusion was rampant.
With this change, it becomes easier to develop new version-dependent
features, by making type and niche confusion impossible.
## Summary of changes
Every use of `pg_version` is now typed as either `PgVersionId` (u32,
valued in decimal `MMmmbb`) or PgMajorVersion (an enum, with a value for
every major version we support, serialized and stored like a u32 with
the value of that major version)
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
The 1.88.0 stable release is near (this Thursday). We'd like to fix most
warnings beforehand so that the compiler upgrade doesn't require
approval from too many teams.
This is therefore a preparation PR (like similar PRs before it).
There is a lot of changes for this release, mostly because the
`uninlined_format_args` lint has been added to the `style` lint group.
One can read more about the lint
[here](https://rust-lang.github.io/rust-clippy/master/#/uninlined_format_args).
The PR is the result of `cargo +beta clippy --fix` and `cargo fmt`. One
remaining warning is left for the proxy team.
---------
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
We might delete timelines on safekeepers before we are deleting them on
pageservers. This should be an exceptional situation, but can occur. As
the first step to improve behaviour here, emit a special error that is
less scary/obscure than "was not found in global map".
It is for example emitted when the pageserver tries to run
`IDENTIFY_SYSTEM` on a timeline that has been deleted on the safekeeper.
Found when analyzing the failure of
`test_scrubber_physical_gc_timeline_deletion` when enabling
`--timelines-onto-safekeepers` on the pytests.
Due to safekeeper restarts, there is no hard guarantee that we will keep
issuing this error, so we need to think of something better if we start
encountering this in staging/prod. But I would say that the introduction
of `--timelines-onto-safekeepers` in the pytests and into staging won't
change much about this: we are already deleting timelines from there. In
`test_scrubber_physical_gc_timeline_deletion`, we'd just be leaking the
timeline before on the safekeepers.
Part of #11712
The `test_storcon_create_delete_sk_down` test is still flaky. This test
addresses two possible causes for flakiness. both causes are related to
deletion racing with `pull_timeline` which hasn't finished yet.
* the first cause is timeline deletion racing with `pull_timeline`:
* the first deletion attempt doesn't contain the line because the
timeline doesn't exist yet
* the subsequent deletion attempts don't contain it either, only a note
that the timeline is already deleted.
* so this patch adds the note that the timeline is already deleted to
the regex
* the second cause is about tenant deletion racing with `pull_timeline`:
* there were no tenant specific tombstones so if a tenant was deleted,
we only added tombstones for the specific timelines being deleted, not
for the tenant itself.
* This patch changes this, so we now have tenant specific tombstones as
well as timeline specific ones, and creation of a timeline checks both.
* we also don't see any retries of the tenant deletion in the logs. once
it's done it's done. so extend the regex to contain the tenant deletion
message as well.
One could wonder why the regex and why not using the API to check
whether the timeline is just "gone". The issue with the API is that it
doesn't allow one to distinguish between "deleted" and "has never
existed", and latter case might race with `pull_timeline`. I.e. the
second case flakiness helped in the discovery of a real bug (no tenant
tombstones), so the more precise check was helpful.
Before, I could easily reproduce 2-9 occurences of flakiness when
running the test with an additional `range(128)` parameter (i.e. 218
times 4 times). With this patch, I ran it three times, not a single
failure.
Fixes#11838
Greetings! Please add `w=1` to github url when viewing diff
(sepcifically `wal_backup.rs`)
## Problem
This PR is aimed at addressing the remaining work of #8200. Namely,
removing static usage of remote storage in favour of arc. I did not opt
to pass `Arc<RemoteStorage>` directly since it is actually
`Optional<RemoteStorage>` as it is not necessarily always configured. I
wanted to avoid having to pass `Arc<Optional<RemoteStorage>>` everywhere
with individual consuming functions likely needing to handle unwrapping.
Instead I've added a `WalBackup` struct that holds
`Optional<RemoteStorage>` and handles initialization/unwrapping
RemoteStorage internally. wal_backup functions now take self and
`Arc<WalBackup>` is passed as a dependency through the various consumers
that need it.
## Summary of changes
- Add `WalBackup` that holds `Optional<RemoteStorage>` and handles
initialization and unwrapping
- Modify wal_backup functions to take `WalBackup` as self (Add `w=1` to
github url when viewing diff here)
- Initialize `WalBackup` in safekeeper root
- Store `Arc<WalBackup>` in `GlobalTimelineMap` and pass and store in
each Timeline as loaded
- use `WalBackup` through Timeline as needed
## Refs
- task to remove global variables
https://github.com/neondatabase/neon/issues/8200
- drive-by fixes https://github.com/neondatabase/neon/issues/11501
by turning the panic reported there into an error `remote storage not
configured`
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
1. Safekeeper selection on the pageserver side isn't very dynamic. Once
you connect to one safekeeper, you'll use that one for as long as the
safekeeper keeps the connection alive. In principle, we could be more
eager, since the wal receiver connection can be cancelled but we don't
do that. We wait until the "session" is done and then we pick a new SK.
2. Picking a new SK is quite conservative. We will switch if:
a. We haven't received anything from the SK within the last 10 seconds
(wal_connect_timeout) or
b. The candidate SK is 1GiB ahead or
c. The candidate SK is in the same AZ as the PS or d. There's a
candidate that is ahead and we've not had any WAL within the last 10
seconds (lagging_wal_timeout)
Hence, we can end up with pageservers that are requesting WAL which
their safekeeper hasn't seen yet.
## Summary of changes
Downgrade warning log to info.
Make `pull_timeline` check tombstones by default. Otherwise, we'd be
recreating timelines if the order between creation and deletion got
mixed up, as seen in #11838.
Fixes#11838.
Second PR with fixes extracted from #11712, relating to
`--timelines-onto-safekeepers`. Does the following:
* Moves safekeeper registration to `neon_local` instead of the test
fixtures
* Pass safekeeper JWT token if `--timelines-onto-safekeepers` is enabled
* Allow some warnings related to offline safekeepers (similarly to how
we allow them for offline pageservers)
* Enable generations on the compute's config if
`--timelines-onto-safekeepers` is enabled
* fix parallel `pull_timeline` race condition (the one that #11786 put
for later)
Fixes#11424
Part of #11670
This patch contains some fixes of issues I ran into for #11712:
* make `pull_timeline` return success for timeline that already exists.
This follows general API design of storage components: API endpoints are
retryable and converge to a status code, instead of starting to error.
We change the `pull_timeline`'s return type a little bit, because we
might not actually have a source sk to pull from. Note that the fix is
not enough, there is still a race when two `pull_timeline` instances
happen in parallel: we might try to enter both pulled timelines at the
same time. That can be fixed later.
* make `pull_timeline` support one safekeeper being down. In general, if
one safekeeper is down, that's not a problem. the added comment explains
a potential situation (found in the `test_lagging_sk` test for example)
* don't log very long errors when computes try to connect to safekeepers
that don't have the timeline yet, if `allow_timeline_creation` is false.
That flag is enabled when a sk connection string with generation numbers
is passed to the compute, so we'll hit this code path more often. E.g.
when a safekeeper missed a timeline creation, but the compute connects
to it first before the `pull_timeline` gets requested by the storcon
reconciler: this is a perfectly normal situation. So don't log the whole
error backtrace, and don't log it on the error log level, but only on
info.
part of #11670
# Remove SAFEKEEPER_AUTH_TOKEN env var parsing from safekeeper
This PR is a follow-up to #11443 that removes the parsing of the
`SAFEKEEPER_AUTH_TOKEN` environment variable from the safekeeper
codebase while keeping the `auth_token_path` CLI flag functionality.
## Changes:
- Removed code that checks for the `SAFEKEEPER_AUTH_TOKEN` environment
variable
- Updated comments to reflect that only the `auth_token_path` CLI flag
is now used
As mentioned in PR #11443, the environment variable approach was planned
to be deprecated and removed in favor of the file-based approach, which
is more secure since environment variables can be quite public in both
procfs and unit files.
Link to Devin run:
https://app.devin.ai/sessions/d6f56cf1b4164ea9880a9a06358a58ac
Requested by: arpad@neon.tech
---------
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: arpad@neon.tech <arpad@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
# Add --dev CLI flag to pageserver and safekeeper binaries
This PR adds the `--dev` CLI flag to both the pageserver and safekeeper
binaries without implementing any functionality yet. This is a precursor
to PR #11517, which will implement the full functionality to require
authentication by default unless the `--dev` flag is specified.
## Changes
- Add `dev_mode` config field to pageserver binary
- Add `--dev` CLI flag to safekeeper binary
This PR is needed for forward compatibility tests to work properly, when
we try to merge #11517
Link to Devin run:
https://app.devin.ai/sessions/ad8231b4e2be430398072b6fc4e85d46
Requested by: John Spray (john@neon.tech)
---------
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: John Spray <john@neon.tech>
## Problem
Safekeeper doesn't use TLS in wal service
- Closes: https://github.com/neondatabase/cloud/issues/27302
## Summary of changes
- Add `enable_tls_wal_service_api` option to safekeeper's cmd arguments
- Propagate `tls_server_config` to `wal_service` if the option is
enabled
- Create `BACKGROUND_RUNTIME` for small background tasks and offload SSL
certificate reloader to it.
No integration tests for now because support from compute side is
required: https://github.com/neondatabase/cloud/issues/25823
## Problem
Pageservers and safakeepers do not pass CA certificates to broker
client, so the client do not trust locally issued certificates.
- Part of https://github.com/neondatabase/cloud/issues/27492
## Summary of changes
- Change `ssl_ca_certs` type in PS/SK's config to `Pem` which may be
converted to both `reqwest` and `tonic` certificates.
- Pass CA certificates to storage broker client in PS and SK
## Problem
We need to export some metrics about certs/connections to configure
alerts and make sure that all HTTP requests are gone before turning
https-only mode on.
- Closes: https://github.com/neondatabase/cloud/issues/25526
## Summary of changes
- Add started connection and connection error metrics to http/https
Server.
- Add certificate expiration time and reload metrics to
ReloadingCertificateResolver.
I like to run nightly clippy every so often to make our future rust
upgrades easier. Some notable changes:
* Prefer `next_back()` over `last()`. Generic iterators will implement
`last()` to run forward through the iterator until the end.
* Prefer `io::Error::other()`.
* Use implicit returns
One case where I haven't dealt with the issues is the now
[more-sensitive "large enum variant"
lint](https://github.com/rust-lang/rust-clippy/pull/13833). I chose not
to take any decisions around it here, and simply marked them as allow
for now.
Before we specified the JWT via `SAFEKEEPER_AUTH_TOKEN`, but env vars
are quite public, both in procfs as well as the unit files. So add a way
to put the auth token into a file directly.
context: https://neondb.slack.com/archives/C033RQ5SPDH/p1743692566311099
## Problem
There are some places in the code where we create `reqwest::Client`
without providing SSL CA certs from `ssl_ca_file`. These will break
after we enable TLS everywhere.
- Part of https://github.com/neondatabase/cloud/issues/22686
## Summary of changes
- Support `ssl_ca_file` in storage scrubber.
- Add `use_https_safekeeper_api` option to safekeeper to use https for
peer requests.
- Propagate SSL CA certs to storage_controller/client, storcon's
ComputeHook, PeerClient and maybe_forward.
Adds a test `test_storcon_create_delete_sk_down` which tests the
reconciler and pending op persistence if faced with a temporary
safekeeper downtime during timeline creation or deletion. This is in
contrast to `test_explicit_timeline_creation_storcon`, which tests the
happy path.
We also do some fixes:
* timeline and tenant deletion http requests didn't expect a body, but
`()` sent one.
* we got the tenant deletion http request's return type wrong: it's
supposed to be a hash map
* we add some logging to improve observability
* We fix `list_pending_ops` which had broken code meant to make it
possible to restrict oneself to a single pageserver. But diesel doesn't
support that sadly, or at least I couldn't figure out a way to make it
work. We don't need that functionality, so remove it.
* We add an info span to the heartbeater futures with the node id, so
that there is no context-free msgs like "Backoff: waiting 1.1 seconds
before processing with the task" in the storcon logs. we could also add
the full base url of the node but don't do it as most other log lines
contain that information already, and if we do duplication it should at
least not be verbose. One can always find out the base url from the node
id.
Successor of #11261
Part of #9011
We keep the practice of keeping the compiler up to date, pointing to the
latest release. This is done by many other projects in the Rust
ecosystem as well.
[Announcement blog
post](https://blog.rust-lang.org/2025/04/03/Rust-1.86.0.html).
Prior update was in #10914.
## Problem
- We need to support multiple SSL CA certificates for graceful root CA
certificate rotation.
- Closes: https://github.com/neondatabase/cloud/issues/25971
## Summary of changes
- Parses `ssl_ca_file` as a pem bundle, which may contain multiple
certificates. Single pem cert is a valid pem bundle, so the change is
backward compatible.
## Problem
SSL certs are loaded only during start up. It doesn't allow the rotation
of short-lived certificates without server restart.
- Closes: https://github.com/neondatabase/cloud/issues/25525
## Summary of changes
- Implement `ReloadingCertificateResolver` which reloads certificates
from disk periodically.
Adds API definitions for the safekeeper API endpoints `exclude_timeline`
and `term_bump`. Also does a bugfix to return the correct type from
`delete_timeline`.
Part of #8614
## Problem
Partial reads are still problematic. They are stored in the buffer of
the wal decoder and result in gaps being reported too eagerly on the
pageserver side.
## Summary of changes
Previously, we always used the start LSN of the chunk of WAL that was
just read. This patch switches to using the end LSN of the last record
that was decoded in the previous iteration.
## Problem
Storage controller uses http for requests to safekeeper management API.
Closes: https://github.com/neondatabase/cloud/issues/24835
## Summary of changes
- Add `use_https_safekeeper_api` option to storcon to use https api
- Use https for requests to safekeeper management API if this option is
enabled
- Add `ssl_ca_file` option to storcon for ability to specify custom root
CA certificate
## Problem
Shard zero needs to track the start LSN of the latest record
in adition to the LSN of the next record to ingest. The former
is included in basebackup persisted by the compute in WAL.
Previously, empty records were skipped for all shards. This caused
the prev LSN tracking on the PS to fall behind and led to logical
replication
issues.
## Summary of changes
Shard zero now receives emtpy interpreted records for LSN tracking
purposes.
A test is included too.
The interpreted reader tracks a record aligned current position in the
WAL stream.
Partial reads move the stream internally, but not from the pov of the
interpreted WAL reader.
Hence, where new shards subscribe with a start position that matches the
reader's current position,
but we've also done some partial reads. This confuses the gap tracking.
To make it more robust,
update the current batch start to the min between the new start position
and its current value.
Since no record has been decoded yet (position matches), we can't have
lost it
## Problem
If a caller times out on safekeeper timeline deletion on a large
timeline, and waits a while before retrying, the deletion will not
progress while the retry is waiting. The net effect is very very slow
deletion as it only proceeds in 30 second bursts across 5 minute idle
periods.
Related: https://github.com/neondatabase/neon/issues/10265
## Summary of changes
- Run remote deletion in a background task
- Carry a watch::Receiver on the Timeline for other callers to join the
wait
- Restart deletion if the API is called again and the previous attempt
failed
## Problem
This `critical!` could fire on IO errors, which is just noisy.
Resolves#11027.
## Summary of changes
Downgrade to error, except for decode errors. These could be either data
corruption or a bug, but seem worth investigating either way.
## Problem
safekeepers must ignore walproposer messages with non matching
membership conf.
## Summary of changes
Make safekeepers reject vote request, proposer elected and append
request messages with non matching generation. Switch to the
configuration in the greeting message if it is higher.
In passing, fix one comment and WAL truncation.
Last part of https://github.com/neondatabase/neon/issues/9965
## Problem
https://github.com/neondatabase/neon/pull/10241 added configuration
switch endpoint, but it didn't delete timeline if node was excluded.
## Summary of changes
Add separate /exclude API endpoint which similarly accepts membership
configuration where sk is supposed by be excluded. Implementation
deletes the timeline locally.
Some more small related tweaks:
- make mconf switch API PUT instead of POST as it is idempotent;
- return 409 if switch was refused instead of 200 with requested &
current;
- remove unused was_active flag from delete response;
- remove meaningless _force suffix from delete functions names;
- reuse timeline.rs delete_dir function in timelines_global_map instead
of its own copy.
part of https://github.com/neondatabase/neon/issues/9965
## Problem
json_ctrl.rs is an obsolete attempt to have tests with fine control of
feeding messages into safekeeper superseded by desim framework.
## Summary of changes
Drop it.
Updates storage components to edition 2024. We like to stay on the
latest edition if possible. There is no functional changes, however some
code changes had to be done to accommodate the edition's breaking
changes.
The PR has two commits:
* the first commit updates storage crates to edition 2024 and appeases
`cargo clippy` by changing code. i have accidentially ran the formatter
on some files that had other edits.
* the second commit performs a `cargo fmt`
I would recommend a closer review of the first commit and a less close
review of the second one (as it just runs `cargo fmt`).
part of https://github.com/neondatabase/neon/issues/10918
## Problem
As part of https://github.com/neondatabase/neon/issues/8614 we need to
pass membership configurations between compute and safekeepers.
## Summary of changes
Add version 3 of the protocol carrying membership configurations.
Greeting message in both sides gets full conf, and other messages
generation number only. Use protocol bump to include other accumulated
changes:
- stop packing whole structs on the wire as is;
- make the tag u8 instead of u64;
- send all ints in network order;
- drop proposer_uuid, we can pass it in START_WAL_PUSH and it wasn't
much useful anyway.
Per message changes, apart from mconf:
- ProposerGreeting: tenant / timeline id is sent now as hex cstring.
Remove proto version, it is passed outside in START_WAL_PUSH. Remove
postgres timeline, it is unused. Reorder fields a bit.
- AcceptorGreeting: reorder fields
- VoteResponse: timeline_start_lsn is removed. It can be taken from
first member of term history, and later we won't need it at all when all
timelines will be explicitly created. Vote itself is u8 instead of u64.
- ProposerElected: timeline_start_lsn is removed for the same reasons.
- AppendRequest: epoch_start_lsn removed, it is known from term history
in ProposerElected.
Both compute and sk are able to talk v2 and v3 to make rollbacks (in
case we need them) easier; neon.safekeeper_proto_version GUC sets the
client version. v2 code can be dropped later.
So far empty conf is passed everywhere, future PRs will handle them.
To test, add param to some tests choosing proto version; we want to test
both 2 and 3 until we fully migrate.
ref https://github.com/neondatabase/neon/issues/10326
---------
Co-authored-by: Arthur Petukhovsky <petuhovskiy@yandex.ru>
## Problem
The interpreted WAL reader tracks the start of the current logical
batch.
This needs to be invalidated when the reader is reset.
This bug caused a couple of WAL gap alerts in staging.
## Summary of changes
* Refactor to make it possible to write a reproducer
* Add repro unit test
* Fix by resetting the start with the reader
Related https://github.com/neondatabase/cloud/issues/23935
The synchronous 'postgres' crate is just a wrapper around the async
'tokio_postgres' crate. Some places were unnecessarily using the
re-exported NoTls and Error from the synchronous 'postgres' crate, even
though they were otherwise using the 'tokio_postgres' crate. Tidy up by
using the tokio_postgres types directly.