mirror of
https://github.com/neondatabase/neon.git
synced 2026-05-16 12:40:36 +00:00
test_lazy_attach_activation: unblock failpoints before test exit
Before this patch, we would leave the `timeline-calculate-logical-size-pause` failpoint in `pause` mode at the end of the test. With the switch to a single runtime, somehow we'd end up in a place where the pageserver was half shut down while the failpoint spawn_blocking thread was waiting for the `off` event that never arrived. Failures were reproducible quite well in CI: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6555/8396322235/index.html#suites/c39429f093f87547b2a3b0943e2522d9/4dacb1efb232b98/ I couldn't repro it locally. I managed to repro it once on an i3en.3xlarge , where I then attached gdb to capture the backtrace. For posterity: https://www.notion.so/neondatabase/debug-test_lazy_attach_activation-teardown-hang-as-part-of-PR-6555-421cb61dc45d4d4e90220c86567f50da?pvs=4
This commit is contained in:
@@ -1057,9 +1057,10 @@ def test_lazy_attach_activation(neon_env_builder: NeonEnvBuilder, activation_met
|
||||
env.pageserver.stop()
|
||||
|
||||
# pause at logical size calculation, also pause before walreceiver can give feedback so it will give priority to logical size calculation
|
||||
paused_failpoints = ["timeline-calculate-logical-size-pause", "walreceiver-after-ingest"]
|
||||
env.pageserver.start(
|
||||
extra_env_vars={
|
||||
"FAILPOINTS": "timeline-calculate-logical-size-pause=pause;walreceiver-after-ingest=pause"
|
||||
"FAILPOINTS": ";".join([f"{fp}=pause" for fp in paused_failpoints]),
|
||||
}
|
||||
)
|
||||
|
||||
@@ -1111,3 +1112,6 @@ def test_lazy_attach_activation(neon_env_builder: NeonEnvBuilder, activation_met
|
||||
delete_lazy_activating(lazy_tenant, env.pageserver, expect_attaching=True)
|
||||
else:
|
||||
raise RuntimeError(activation_method)
|
||||
|
||||
ps_http = env.pageserver.http_client()
|
||||
ps_http.configure_failpoints([(fp, "off") for fp in paused_failpoints])
|
||||
|
||||
Reference in New Issue
Block a user