After https://github.com/neondatabase/neon/pull/8022 was deployed to
staging, I noticed many cases of timeouts. After inspecting the logs, I
realized that some operations are taking ~20 seconds and they're doing
while holding shared state lock. Usually it happens right after
redeploy, because compute reconnections put high load on disks. This
commit tries to improve observability around slow operations.
Non-observability changes:
- `TimelineState::finish_change` now skips update if nothing has changed
- `wal_residence_guard()` timeout is set to 30s
While investigating Pageserver logs from the cases where systemd hangs
during shutdown (https://github.com/neondatabase/cloud/issues/11387), I
noticed that even if Pageserver shuts down cleanly[^1], there are
lingering walredo processes.
[^1]: Meaning, pageserver finishes its shutdown procedure and calls
`exit(0)` on its own terms, instead of hitting the systemd unit's
`TimeoutSec=` limit and getting SIGKILLed.
While systemd should never lock up like it does, maybe we can avoid
hitting that bug by cleaning up properly.
Changes
-------
This PR adds a shutdown method to `WalRedoManager` and hooks it up to
tenant shutdown.
We keep track of intent to shutdown through the new `enum
ProcessOnceCell` stored inside the pre-existing `redo_process` field.
A gate is added to keep track of running processes, using the new type
`struct Process`.
Future Work
-----------
Requests that don't need the redo process will not observe the shutdown
(see doc comment).
Doing so would be nice for completeness sake, but doesn't provide much
benefit because `Tenant` and `Timeline` already shut down all walredo
users.
Testing
-------
I did manual testing to confirm that the problem exists before this PR
and that it's gone after.
Setup:
* `neon_local` with a single tenant, create some data using `pgbench`
* ensure walredo process is running, not pid
* watch `strace -e kill,wait4 -f -p "$(pgrep pageserver)"`
* `neon_local pageserver stop`
With this PR, we always observe
```
$ strace -e kill,wait4 -f -p "$(pgrep pageserver)"
...
[pid 591120] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=591215, si_uid=1000} ---
[pid 591134] kill(591174, SIGKILL) = 0
[pid 591134] wait4(591174, <unfinished ...>
[pid 591142] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=591174, si_uid=1000, si_status=SIGKILL, si_utime=0, si_stime=0} ---
[pid 591134] <... wait4 resumed>[{WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL}], 0, NULL) = 591174
...
+++ exited with 0 +++
```
Before this PR, we'd usually observe just
```
...
[pid 596239] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=596455, si_uid=1000} ---
...
+++ exited with 0 +++
```
Refs
----
refs https://github.com/neondatabase/cloud/issues/11387
## Problem
Background node operations take a long time for loaded nodes.
## Summary of changes
Increase number of concurrent reconciles an operation is allowed to
spawn.
This should make drain and fill operations faster and the new value is
still well below the total limit of concurrent reconciles.
## Problem
The cache keys that we use on CI are the same for X64 and ARM64
(`runner.arch`)
## Summary of changes
- Include `runner.arch` along with `runner.os` into cache keys
- Add --safekeepers option to neon_local reconfigure
- Add it to python Endpoint reconfigure
- Implement config reload in walproposer by restarting the whole bgw when
safekeeper list changes.
ref https://github.com/neondatabase/neon/issues/6341
## Problem
The fill requirement was not taken into account when looking through the
shards of a given node to fill from.
## Summary of Changes
Ensure that we do not fill a node past the recommendation from
`Scheduler::compute_fill_requirement`.
Also, modify the "neon_local timeline import" command so that it
doesn't create the endpoint any more. I don't see any reason to bundle
that in the same command, the "timeline create" and "timeline branch"
commands don't do that either.
I plan to add more tests similar to 'test_import_at_2bil', this will
help to reduce the copy-pasting.
## Problem
In https://github.com/neondatabase/neon/pull/5299, the new config-v1
tenant config file was added to hold the LocationConf type. We left the
old config file in place for forward compat, and because running without
generations (therefore without LocationConf) as still useful before the
storage controller was ready for prime-time.
Closes: https://github.com/neondatabase/neon/issues/5388
## Summary of changes
- Remove code for reading and writing the legacy config file
- Remove Generation::Broken: it was unused.
- Treat missing config file on disk as an error loading a tenant, rather
than defaulting it. We can now remove LocationConf::default, and thereby
guarantee that we never construct a tenant with a None generation.
- Update some comments + add some assertions to clarify that
Generation::None is only used in layer metadata, not in the state of a
running tenant.
- Update docker compose test to create tenants with a generation
Fixes https://github.com/neondatabase/neon/issues/6337
Add safekeeper support to switch between `Present` and
`Offloaded(flush_lsn)` states. The offloading is disabled by default,
but can be controlled using new cmdline arguments:
```
--enable-offload
Enable automatic switching to offloaded state
--delete-offloaded-wal
Delete local WAL files after offloading. When disabled, they will be left on disk
--control-file-save-interval <CONTROL_FILE_SAVE_INTERVAL>
Pending updates to control file will be automatically saved after this interval [default: 300s]
```
Manager watches state updates and detects when there are no actvity on
the timeline and actual partial backup upload in remote storage. When
all conditions are met, the state can be switched to offloaded.
In `timeline.rs` there is `StateSK` enum to support switching between
states. When offloaded, code can access only control file structure and
cannot use `SafeKeeper` to accept new WAL.
`FullAccessTimeline` is now renamed to `WalResidentTimeline`. This
struct contains guard to notify manager about active tasks requiring
on-disk WAL access. All guards are issued by the manager, all requests
are sent via channel using `ManagerCtl`. When manager receives request
to issue a guard, it unevicts timeline if it's currently evicted.
Fixed a bug in partial WAL backup, it used `term` instead of
`last_log_term` previously.
After this commit is merged, next step is to roll this change out, as in
issue #6338.
## Problem
We have seen numerous segfault and memory corruption issue for clients
using libpq and openssl 3.2.2. I don't know if this is a bug in openssl
or libpq. Downgrading to 1.1.1w fixes the issues for the storage
controller and pgbench.
## Summary of Changes:
Use openssl 1.1.1w instead of 3.2.2
I was looking for metrics on how many computes are still using protocol
version 1 and 2. This provides counters for that as "pagestream" and
"pagestream_v2" commands, but also all the other commands. The new
metrics are global for the whole pageserver instance rather than
per-tenant, so the additional metrics bloat should be fairly small.
I saw this compiler warning on my laptop:
pgxn/neon_walredo/walredoproc.c:178:10: warning: using the result of an
assignment as a condition without parentheses [-Wparentheses]
if (err = close_range_syscall(3, ~0U, 0)) {
~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pgxn/neon_walredo/walredoproc.c:178:10: note: place parentheses around
the assignment to silence this warning
if (err = close_range_syscall(3, ~0U, 0)) {
^
( )
pgxn/neon_walredo/walredoproc.c:178:10: note: use '==' to turn this
assignment into an equality comparison
if (err = close_range_syscall(3, ~0U, 0)) {
^
==
1 warning generated.
I'm not sure what compiler version or options cause that, but it's a
good warning. Write the call a little differently, to avoid the warning
and to make it a little more clear anyway. (The 'err' variable wasn't
used for anything, so I'm surprised we were not seeing a compiler
warning on the unused value, too.)
## Problem
We install Postgres 14 in `build-tools` image, but we don't need
it. We use Postgres binaries, which we build ourselves.
## Summary of changes
- Remove Postgresql 14 installation from `build-tools` image
## Problem
Hard to debug the disconnection reason currently.
## Summary of changes
Keep track of error-direction, and therefore error source (client vs
compute) during passthrough.
A follow-up on https://github.com/neondatabase/neon/pull/8103/.
Previously, main branch fails with:
```
assertion `left == right` failed
left: b"value 3@0x10@0x30@0x28@0x40"
right: b"value 3@0x10@0x28@0x30@0x40"
```
This gets fixed after #8103 gets merged.
Signed-off-by: Alex Chi Z <chi@neon.tech>
This was a half-finished mechanism to allow a replica to enter hot
standby mode sooner, without waiting for a running-xacts record. It had
issues, and we are working on a better mechanism to replace it.
The control plane might still set the flag in the spec file, but
compute_ctl will simply ignore it.
In future we may want to run periodic tests on dedicated cloud instances
that are not GitHub action runners.
To allow these to download artifact binaries for a specific commit hash
we want to make the search by commit hash possible and prefix the S3
objects with
`artifacts/${GITHUB_SHA}/${GITHUB_RUN_ID}/${GITHUB_RUN_ATTEMPT}`
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Adds manual compaction trigger; add gc compaction to test_gc_feedback
Part of https://github.com/neondatabase/neon/issues/8002
```
test_gc_feedback[debug-pg15].logical_size: 50 Mb
test_gc_feedback[debug-pg15].physical_size: 2269 Mb
test_gc_feedback[debug-pg15].physical/logical ratio: 44.5302
test_gc_feedback[debug-pg15].max_total_num_of_deltas: 7
test_gc_feedback[debug-pg15].max_num_of_deltas_above_image: 2
test_gc_feedback[debug-pg15].logical_size_after_bottom_most_compaction: 50 Mb
test_gc_feedback[debug-pg15].physical_size_after_bottom_most_compaction: 287 Mb
test_gc_feedback[debug-pg15].physical/logical ratio after bottom_most_compaction: 5.6312
test_gc_feedback[debug-pg15].max_total_num_of_deltas_after_bottom_most_compaction: 4
test_gc_feedback[debug-pg15].max_num_of_deltas_above_image_after_bottom_most_compaction: 1
```
## Summary of changes
* Add the manual compaction trigger
* Use in test_gc_feedback
* Add a guard to avoid running it with retain_lsns
* Fix: Do `schedule_compaction_update` after compaction
* Fix: Supply deltas in the correct order to reconstruct value
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
part of https://github.com/neondatabase/neon/issues/8002
## Summary of changes
This pull request adds the image layer iterator. It buffers a fixed
amount of key-value pairs in memory, and give developer an iterator
abstraction over the image layer. Once the buffer is exhausted, it will
issue 1 I/O to fetch the next batch.
Due to the Rust lifetime mysteries, the `get_stream_from` API has been
refactored to `into_stream` and consumes `self`.
Delta layer iterator implementation will be similar, therefore I'll add
it after this pull request gets merged.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
1. Proxy is retrying errors from cplane that shouldn't be retried
2. ~~Proxy is not using the retry_after_ms value~~
## Summary of changes
1. Correct the could_retry impl for ConsoleError.
2. ~~Update could_retry interface to support returning a fixed wait
duration.~~
## Problem
These APIs have been deprecated for some time, but were still used from
test code.
Closes: https://github.com/neondatabase/neon/issues/4282
## Summary of changes
- It is still convenient to do a "tenant_attach" from a test without
having to write out a location_conf body, so those test methods have
been retained with implementations that call through to their
location_conf equivalent.
Part of #7497, closes#8120.
## Summary of changes
This PR adds a metric to track the number of valid leases after `GCInfo`
gets refreshed each time.
Besides this metric, we should also track disk space and synthetic size
(after #8071 is closed) to make sure leases are used properly.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
## Problem
Storage controller runs into memory corruption issue on the drain/fill
code paths.
## Summary of changes
Update db related depdencies in the unlikely case that the issue was
fixed in diesel.
We did not recover the subtransaction IDs of prepared transactions when
starting a hot standby from a shutdown checkpoint. As a result, such
subtransactions were considered as aborted, rather than in-progress.
That would lead to hint bits being set incorrectly, and the
subtransactions suddenly becoming visible to old snapshots when the
prepared transaction was committed.
To fix, update pg_subtrans with prepared transactions's subxids when
starting hot standby from a shutdown checkpoint. The snapshots taken
from that state need to be marked as "suboverflowed", so that we also
check the pg_subtrans.
Discussion:
https://www.postgresql.org/message-id/6b852e98-2d49-4ca1-9e95-db419a2696e0%40iki.fi
NEON: cherry-picked from the upstream thread ahead of time, to unblock
https://github.com/neondatabase/neon/pull/7288. I expect this to be
committed to upstream in the next few days, superseding this. NOTE: I
did not include the new regression test on v15 and v14 branches, because
the test would need some adapting, and we don't run the perl tests on
Neon anyway.
Part of #7497, closes#8072.
## Problem
Currently the `get_lsn_by_timestamp` and branch creation pageserver APIs do not provide a pleasant client experience where the looked-up LSN might be GC-ed between the two API calls.
This PR attempts to prevent common races between GC and branch creation by making use of LSN leases provided in #8084. A lease can be optionally granted to a looked-up LSN. With the lease, GC will not touch layers needed to reconstruct all pages at this LSN for the duration of the lease.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
## Problem
- `test_storage_controller_many_tenants` can fail with warnings in the
storage controller about tenant creation holding a lock for too long,
because this test stresses the machine running the test with many
concurrent timeline creations
- `test_tenant_delete_smoke` can fail when synthetic remote storage
errors show up
## Summary of changes
- tolerate warnings about slow timeline creation in
test_storage_controller_many_tenants
- tolerate both possible errors during error_tolerant_delete
## Problem
This command was used in the very early days of sharding, before the
storage controller had anti-affinity + scheduling optimization to spread
out shards.
## Summary of changes
- Remove `storcon_cli tenant-scatter`
Part of https://github.com/neondatabase/neon/issues/8002
This pull request adds tests for bottom-most gc-compaction with delta
records. Also fixed a bug in the compaction process that creates
overlapping delta layers by force splitting at the original delta layer
boundary.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
We now have a `vX` number in the file name, i.e.,
`000000067F0000000400000B150100000000-000000067F0000000400000D350100000000__00000000014B7AC8-v1-00000001`
The related pull request for new-style path was merged a month ago
https://github.com/neondatabase/neon/pull/7660
## Summary of changes
Fixed the draw timeline dir command to handle it.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Previously in safekeeper code, new segment file was opened without
truncate option. I don't think there is a reason to do it, this commit
replaces it with `File::create` to make it simpler and remove
`clippy::suspicious_open_options` linter warning.
## Problem
This test could occasionally trigger a "removing local file ... because
it has unexpected length log" when using the
`compact-shard-ancestors-persistent` failpoint is in use, which is
unexpected because that failpoint stops the process when the remote
metadata is in sync with local files.
It was because there are two shards on the same pageserver, and while
the one being compacted explicitly stops at the failpoint, another shard
was compacting in the background and failing at an unclean point. The
test intends to disable background compaction, but was mistakenly
revoking the value of `compaction_period` when it updated
`pitr_interval`.
Example failure:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8123/9602976462/index.html#/testresult/7dd6165da7daef40
## Summary of changes
- Update `TENANT_CONF` in the test to use properly typed values, so that
it is usable in pageserver APIs as well as via neon_local.
- When updating tenant config with `pitr_interval`, retain the overrides
from the start of the test, so that there won't be any background
compaction going on during the test.
## Problem
`test_sharding_autosplit` is occasionally failing on warnings about
shard splits taking longer than expected (`Exclusive lock by ShardSplit
was held for`...)
It's not obvious which part is taking the time (I suspect remote storage
uploads).
Example:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/9618788427/index.html#testresult/b395294d5bdeb783/
## Summary of changes
- Since shard splits are infrequent events, we can afford to be very
chatty: add a bunch of info-level logging throughout the process.
#8082 removed the legacy deletion path, but retained code for completing
deletions that were started before a pageserver restart. This PR cleans
up that remaining code, and removes all the pageserver code that dealt
with tenant deletion markers and resuming tenant deletions.
The release at https://github.com/neondatabase/neon/pull/8138 contains
https://github.com/neondatabase/neon/pull/8082, so we can now merge this
to `main`
Moves `RemoteStorageConfig` and related structs and functions into a
dedicated module. Also implements `Serialize` for the config structs
(requested in #8126).
Follow-up of #8126
## Problem
Fixes https://github.com/neondatabase/neon/issues/1287
## Summary of changes
tokio-postgres now supports arbitrary server params through the
`param(key, value)` method. Some keys are special so we explicitly
filter them out.
Adds a `Deserialize` impl to `RemoteStorageConfig`. We thus achieve the
same as #7743 but with less repetitive code, by deriving `Deserialize`
impls on `S3Config`, `AzureConfig`, and `RemoteStorageConfig`. The
disadvantage is less useful error messages.
The git history of this PR contains a state where we go via an
intermediate representation, leveraging the `serde_json` crate,
without it ever being actual json though.
Also, the PR adds deserialization tests.
Alternative to #7743 .
## Problem
While adapting the storage controller scale test to do graceful rolling
restarts via
drain and fill, I noticed that secondaries are also being rescheduled,
which, in turn,
caused the storage controller to optimise attachments.
## Summary of changes
* Introduce a transactional looking rescheduling primitive (i.e. "try to
schedule to this
secondary, but leave everything as is if you can't")
* Use it for the drain and fill stages to avoid calling into
`Scheduler::schedule` and having
secondaries move around.