neon/src at 980d506bdaba05955e3d9316d9d385228a16f39f - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-09 14:32:57 +00:00

Files

History

Christian Schwarz 980d506bda pageserver: shutdown all walredo managers 8s into shutdown (#8572 )

# Motivation

The working theory for hung systemd during PS deploy
(https://github.com/neondatabase/cloud/issues/11387) is that leftover
walredo processes trigger a race condition.

In https://github.com/neondatabase/neon/pull/8150 I arranged that a
clean Tenant shutdown does actually kill its walredo processes.

But many prod machines don't manage to shut down all their tenants until
the 10s systemd timeout hits and, presumably, triggers the race
condition in systemd / the Linux kernel that causes the frozen systemd

# Solution

This PR bolts on a rather ugly mechanism to shut down tenant managers
out of order 8s after we've received the SIGTERM from systemd.

# Changes

- add a global registry of `Weak<WalRedoManager>`
- add a special thread spawned during `shutdown_pageserver` that sleeps
for 8s, then shuts down all redo managers in the registry and prevents
new redo managers from being created
- propagate the new failure mode of tenant spawning throughout the code
base
- make sure shut down tenant manager results in
PageReconstructError::Cancelled so that if Timeline::get calls come in
after the shutdown, they do the right thing

2024-08-01 07:57:09 +02:00

..

refactor(page_service): Timeline gate guard holding + cancellation + shutdown (#8339 )

2024-07-31 17:05:45 +02:00

consumption_metrics

pageserver: improve readability of shard.rs (#7330 )

2024-04-15 11:50:26 +01:00

feat(per-tenant throttling): exclude throttled time from page_service metrics + regression test (#6953 )

2024-03-05 13:44:00 +00:00

pageserver: downgrade stale generation messages to INFO (#8256 )

2024-07-04 15:05:41 +01:00

refactor(page_service): Timeline gate guard holding + cancellation + shutdown (#8339 )

2024-07-31 17:05:45 +02:00

pageserver: shutdown all walredo managers 8s into shutdown (#8572 )

2024-08-01 07:57:09 +02:00

Fix nightly warnings 2024 june (#8151 )

2024-07-12 13:58:04 +01:00

test(pageserver): add test wal record for unit testing (#8015 )

2024-06-13 09:44:37 -04:00

auth.rs

storage scrubber: GC ancestor shard layers (#8196 )

2024-07-19 19:07:59 +03:00

aux_file.rs

feat(pageserver): compute aux file size on initial logical size calculation (#7958 )

2024-06-04 13:47:48 -04:00

basebackup.rs

fix(pageserver): include aux file in basebackup only once (#8207 )

2024-07-01 14:36:49 +00:00

config.rs

compaction_level0_phase1: bypass PS PageCache for data blocks (#8543 )

2024-07-31 14:17:59 +02:00

consumption_metrics.rs

refactor(pageserver) remove task_mgr for most global tasks (#8449 )

2024-07-22 17:25:06 +02:00

context.rs

Fix nightly warnings 2024 june (#8151 )

2024-07-12 13:58:04 +01:00

control_plane_client.rs

storcon: make heartbeats restart aware (#8222 )

2024-07-25 14:09:12 +01:00

deletion_queue.rs

Use DefaultCredentialsChain AWS authentication in remote_storage (#8440 )

2024-07-19 21:19:30 +02:00

disk_usage_eviction_task.rs

refactor(pageserver) remove task_mgr for most global tasks (#8449 )

2024-07-22 17:25:06 +02:00

import_datadir.rs

pageserver: apply shard filtering to blocks ingested during initdb (#7319 )

2024-04-05 18:07:35 +01:00

l0_flush.rs

l0_flush: use mode=direct by default => coverage in automated tests (#8534 )

2024-07-29 16:49:22 +02:00

lib.rs

pageserver: shutdown all walredo managers 8s into shutdown (#8572 )

2024-08-01 07:57:09 +02:00

metrics.rs

Add metrics for input data considered and taken for compression (#8522 )

2024-07-30 09:59:15 +02:00

page_cache.rs

remove materialized page cache (#8105 )

2024-06-20 11:56:14 +02:00

page_service.rs

refactor(page_service): Timeline gate guard holding + cancellation + shutdown (#8339 )

2024-07-31 17:05:45 +02:00

pgdatadir_mapping.rs

Use smgrexists() instead of access() to enforce uniqueness of generated relfilenumber (#7992 )

2024-07-23 18:41:55 +03:00

repository.rs

compaction_level0_phase1: bypass PS PageCache for data blocks (#8543 )

2024-07-31 14:17:59 +02:00

span.rs

debug_assert presence of shard_id tracing field (#6572 )

2024-02-06 14:43:33 +00:00

statvfs.rs

Also allow unnecessary_fallible_conversions lint (#6294 )

2024-01-09 04:22:36 +00:00

task_mgr.rs

refactor(pageserver) remove task_mgr for most global tasks (#8449 )

2024-07-22 17:25:06 +02:00

tenant.rs

pageserver: shutdown all walredo managers 8s into shutdown (#8572 )

2024-08-01 07:57:09 +02:00

utilization.rs

implement Serialize/Deserialize for SystemTime with RFC3339 format (#7203 )

2024-04-08 15:53:07 +01:00

virtual_file.rs

virtual_file: take a Slice in the read APIs, eliminate read_exact_at_n, fix UB for engine std-fs (#8186 )

2024-06-28 11:20:37 +02:00

walingest.rs

Update Rust to 1.80.0 (#8518 )

2024-07-26 11:17:33 +02:00

walrecord.rs

test(pageserver): add test wal record for unit testing (#8015 )

2024-06-13 09:44:37 -04:00

walredo.rs

pageserver: shutdown all walredo managers 8s into shutdown (#8572 )

2024-08-01 07:57:09 +02:00