fix: on-demand downloads can outlive timeline shutdown (#7051)

## Problem

Before this PR, it was possible that on-demand downloads were started
after `Timeline::shutdown()`.

For example, we have observed a walreceiver-connection-handler-initiated
on-demand download that was started after `Timeline::shutdown()`s final
`task_mgr::shutdown_tasks()` call.

The underlying issue is that `task_mgr::shutdown_tasks()` isn't sticky,
i.e., new tasks can be spawned during or after
`task_mgr::shutdown_tasks()`.

Cc: https://github.com/neondatabase/neon/issues/4175 in lieu of a more
specific issue for task_mgr. We already decided we want to get rid of it
anyways.

Original investigation:
https://neondb.slack.com/archives/C033RQ5SPDH/p1709824952465949

## Changes

- enter gate while downloading
- use timeline cancellation token for cancelling download

thereby, fixes #7054

Entering the gate might also remove recent "kept the gate from closing"
in staging.
This commit is contained in:
Joonas Koivunen
2024-03-09 15:09:08 +02:00
committed by GitHub
parent 74d24582cf
commit b09d686335
5 changed files with 31 additions and 31 deletions

View File

@@ -190,6 +190,8 @@ def test_delete_tenant_exercise_crash_safety_failpoints(
# So by ignoring these instead of waiting for empty upload queue
# we execute more distinct code paths.
'.*stopping left-over name="remote upload".*',
# an on-demand is cancelled by shutdown
".*initial size calculation failed: downloading failed, possibly for shutdown",
]
)

View File

@@ -213,7 +213,9 @@ def test_delete_timeline_exercise_crash_safety_failpoints(
# This happens when timeline remains are cleaned up during loading
".*Timeline dir entry become invalid.*",
# In one of the branches we poll for tenant to become active. Polls can generate this log message:
f".*Tenant {env.initial_tenant} is not active*",
f".*Tenant {env.initial_tenant} is not active.*",
# an on-demand is cancelled by shutdown
".*initial size calculation failed: downloading failed, possibly for shutdown",
]
)