From 50297bef9fb5163e067a76001062d68ab5498b84 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Thu, 20 Oct 2022 12:49:54 +0300 Subject: [PATCH] RFC about Tenant / Timeline guard objects (#2660) Co-authored-by: Heikki Linnakangas --- docs/SUMMARY.md | 2 + docs/rfcs/019-tenant-timeline-lifecycles.md | 91 +++++++++++++++++++++ 2 files changed, 93 insertions(+) create mode 100644 docs/rfcs/019-tenant-timeline-lifecycles.md diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index fb6467ffd5..faf2b2336f 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -80,4 +80,6 @@ - [015-storage-messaging](rfcs/015-storage-messaging.md) - [016-connection-routing](rfcs/016-connection-routing.md) - [017-timeline-data-management](rfcs/017-timeline-data-management.md) +- [018-storage-messaging-2](rfcs/018-storage-messaging-2.md) +- [019-tenant-timeline-lifecycles](rfcs/019-tenant-timeline-lifecycles.md) - [cluster-size-limits](rfcs/cluster-size-limits.md) diff --git a/docs/rfcs/019-tenant-timeline-lifecycles.md b/docs/rfcs/019-tenant-timeline-lifecycles.md new file mode 100644 index 0000000000..2734bf17b9 --- /dev/null +++ b/docs/rfcs/019-tenant-timeline-lifecycles.md @@ -0,0 +1,91 @@ +# Managing Tenant and Timeline lifecycles + +## Summary + +The pageserver has a Tenant object in memory for each tenant it manages, and a +Timeline for each timeline. There are a lot of tasks that operate on the tenants +and timelines with references to those objects. We have some mechanisms to track +which tasks are operating on each Tenant and Timeline, and to request them to +shutdown when a tenant or timeline is deleted, but it does not cover all uses, +and as a result we have many race conditions around tenant/timeline shutdown. + +## Motivation + +We have a bunch of race conditions that can produce weird errors and can be hard +to track down. + +## Non Goals + +This RFC only covers the problem of ensuring that a task/thread isn't operating +on a Tenant or Timeline. It does not cover what states, aside from Active and +non-Active, each Tenant and Timeline should have, or when exactly the transitions +should happen. + +## Impacted components (e.g. pageserver, safekeeper, console, etc) + +Pageserver. Although I wonder if the safekeeper should have a similar mechanism. + +## Current situation + +Most pageserver tasks of are managed by task_mgr.rs: + +- LibpqEndpointListener +- HttpEndPointListener +- WalReceiverManager and -Connection +- GarbageCollector and Compaction +- InitialLogicalSizeCalculation + +In addition to those tasks, the walreceiver performs some direct tokio::spawn +calls to spawn tasks that are not registered with 'task_mgr'. And all of these +tasks can spawn extra operations with tokio spawn_blocking. + +Whenever a tenant or timeline is removed from the system, by pageserver +shutdown, delete_timeline or tenant-detach operation, we rely on the task +registry in 'task_mgr.rs' to wait until there are no tasks operating on the +tenant or timeline, before its Tenant/Timeline object is removed. That relies on +each task to register itself with the tenant/timeline ID in +'task_mgr.rs'. However, there are many gaps in that. For example, +GarbageCollection and Compaction tasks are registered with the tenant, but when +they proceed to operate on a particular timeline of the tenant, they don't +register with timeline ID. Because of that, the timeline can be deleted while GC +or compaction is running on it, causing failures in the GC or compaction (see +https://github.com/neondatabase/neon/issues/2442). + +Another problem is that the task registry only works for tokio Tasks. There is +no way to register a piece of code that runs inside spawn_blocking(), for +example. + +## Proposed implementation + +This "voluntary" registration of tasks is fragile. Let's use Rust language features +to enforce that a tenant/timeline cannot be removed from the system when there is +still some code operating on it. + +Let's introduce new Guard objects for Tenant and Timeline, and do all actions through +the Guard object. Something like: + +TenantActiveGuard: Guard object over Arc. When you acquire the guard, +the code checks that the tenant is in Active state. If it's not, you get an +error. You can change the state of the tenant to Stopping while there are +ActiveTenantGuard objects still on it, to prevent new ActiveTenantGuards from +being acquired, but the Tenant cannot be removed until all the guards are gone. + +TenantMaintenanceGuard: Like ActiveTenantGuard, but can be held even when the +tenant is not in Active state. Used for operations like attach/detach. Perhaps +allow only one such guard on a Tenant at a time. + +Similarly for Timelines. We don't currentl have a "state" on Timeline, but I think +we need at least two states: Active and Stopping. The Stopping state is used at +deletion, to prevent new TimelineActiveGuards from appearing, while you wait for +existing TimelineActiveGuards to die out. + +The shutdown-signaling, using shutdown_watcher() and is_shutdown_requested(), +probably also needs changes to deal with the new Guards. The rule is that if you +have a TenantActiveGuard, and the tenant's state changes from Active to +Stopping, the is_shutdown_requested() function should return true, and +shutdown_watcher() future should return. + +This signaling doesn't neessarily need to cover all cases. For example, if you +have a block of code in spawn_blocking(), it might be acceptable if +is_shutdown_requested() doesn't return true even though the tenant is in +Stopping state, as long as the code finishes reasonably fast.