fix(test_ondemand_download_timetravel): occasionally fails with WAL timeout during layer creation (#6818)

refs https://github.com/neondatabase/neon/issues/4112
amends https://github.com/neondatabase/neon/pull/6687

Since my last PR #6687 regarding this test, the type of flakiness that
has been observed has shifted to the beginning of the test, where we
create the layers:

```
timed out while waiting for remote_consistent_lsn to reach 0/411A5D8, was 0/411A5A0
```

[Example Allure
Report](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6789/7932503173/index.html#/testresult/ddb877cfa4062f7d)

Analysis
--------

I suspect there was the following race condition:
- endpoints push out some tiny piece of WAL during their
  endpoints.stop_all()
- that WAL reaches the SK (it's just one SK according to logs)
- the SKs send it into the walreceiver connection
- the SK gets shut down
- the checkpoint is taken, with last_record_lsn = 0/411A5A0
- the PS's walreceiver_connection_handler processes the WAL that was
  sent into the connection by the SKs; this advances
  last_record_lsn to 0/411A5D8
- we get current_lsn = 0/411A5D8
- nothing flushes a layer

Changes
-------

There's no testing / debug interface to shut down / server all
walreceiver connections.
So, this PR restarts pageserver to achieve it.

Also, it lifts the "wait for image layer uploads" further up, so that
after this first
restart, the pageserver really does _nothing_ by itself, and so, the
origianl physical size mismatch issue quoted in #6687 should be fixed.
(My initial suspicion hasn't changed that it was due to the tiny chunk
of endpoint.stop_all() WAL being ingested after the second PS restart.)
This commit is contained in:
Christian Schwarz
2024-02-20 14:09:15 +01:00
committed by GitHub
parent a48b23d777
commit b467d8067b

View File

@@ -17,6 +17,7 @@ from fixtures.pageserver.utils import (
wait_for_last_record_lsn,
wait_for_upload,
wait_for_upload_queue_empty,
wait_until_tenant_active,
)
from fixtures.remote_storage import RemoteStorageKind
from fixtures.types import Lsn
@@ -165,6 +166,10 @@ def test_ondemand_download_timetravel(neon_env_builder: NeonEnvBuilder):
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
####
# Produce layers
####
lsns = []
table_len = 10000
@@ -194,19 +199,29 @@ def test_ondemand_download_timetravel(neon_env_builder: NeonEnvBuilder):
# run checkpoint manually to be sure that data landed in remote storage
client.timeline_checkpoint(tenant_id, timeline_id)
##### Stop the first pageserver instance, erase all its data
# prevent new WAL from being produced, wait for layers to reach remote storage
env.endpoints.stop_all()
# Stop safekeepers and take another checkpoint. The endpoints might
# have written a few more bytes during shutdown.
for sk in env.safekeepers:
sk.stop()
client.timeline_checkpoint(tenant_id, timeline_id)
current_lsn = Lsn(client.timeline_detail(tenant_id, timeline_id)["last_record_lsn"])
# wait until pageserver has successfully uploaded all the data to remote storage
# NB: the wait_for_upload returns as soon as remote_consistent_lsn == current_lsn.
# But the checkpoint also triggers a compaction
# => image layer generation =>
# => doesn't advance LSN
# => but we want the remote state to deterministic, so additionally, wait for upload queue to drain
wait_for_upload(client, tenant_id, timeline_id, current_lsn)
wait_for_upload_queue_empty(pageserver_http, env.initial_tenant, timeline_id)
client.deletion_queue_flush(execute=True)
del current_lsn
env.pageserver.stop()
env.pageserver.start()
# We've shut down the SKs, then restarted the PSes to sever all walreceiver connections;
# This means pageserver's remote_consistent_lsn is now frozen to whatever it was after the pageserver.stop() call.
wait_until_tenant_active(client, tenant_id)
###
# Produce layers complete;
# Start the actual testing.
###
def get_api_current_physical_size():
d = client.timeline_detail(tenant_id, timeline_id)
@@ -223,9 +238,7 @@ def test_ondemand_download_timetravel(neon_env_builder: NeonEnvBuilder):
log.info(filled_size)
assert filled_current_physical == filled_size, "we don't yet do layer eviction"
# Wait until generated image layers are uploaded to S3
wait_for_upload_queue_empty(pageserver_http, env.initial_tenant, timeline_id)
# Stop the first pageserver instance, erase all its data
env.pageserver.stop()
# remove all the layer files