test_lazy_attach_activation: unblock failpoints before test exit

Before this patch, we would leave the
`timeline-calculate-logical-size-pause` failpoint in `pause` mode at the
end of the test.

With the switch to a single runtime, somehow we'd end up in a place where
the pageserver was half shut down while the failpoint spawn_blocking
thread was waiting for the `off` event that never arrived.

Failures were reproducible quite well in CI: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6555/8396322235/index.html#suites/c39429f093f87547b2a3b0943e2522d9/4dacb1efb232b98/

I couldn't repro it locally.

I managed to repro it once on an i3en.3xlarge , where I then attached
gdb to capture the backtrace.

For posterity: https://www.notion.so/neondatabase/debug-test_lazy_attach_activation-teardown-hang-as-part-of-PR-6555-421cb61dc45d4d4e90220c86567f50da?pvs=4
This commit is contained in:
Christian Schwarz
2024-03-23 13:33:54 +00:00
parent 53e9173757
commit 84b7588293

View File

@@ -1057,9 +1057,10 @@ def test_lazy_attach_activation(neon_env_builder: NeonEnvBuilder, activation_met
env.pageserver.stop()
# pause at logical size calculation, also pause before walreceiver can give feedback so it will give priority to logical size calculation
paused_failpoints = ["timeline-calculate-logical-size-pause", "walreceiver-after-ingest"]
env.pageserver.start(
extra_env_vars={
"FAILPOINTS": "timeline-calculate-logical-size-pause=pause;walreceiver-after-ingest=pause"
"FAILPOINTS": ";".join([f"{fp}=pause" for fp in paused_failpoints]),
}
)
@@ -1111,3 +1112,6 @@ def test_lazy_attach_activation(neon_env_builder: NeonEnvBuilder, activation_met
delete_lazy_activating(lazy_tenant, env.pageserver, expect_attaching=True)
else:
raise RuntimeError(activation_method)
ps_http = env.pageserver.http_client()
ps_http.configure_failpoints([(fp, "off") for fp in paused_failpoints])