refs https://github.com/neondatabase/neon/issues/4112
amends https://github.com/neondatabase/neon/pull/6687
Since my last PR #6687 regarding this test, the type of flakiness that
has been observed has shifted to the beginning of the test, where we
create the layers:
```
timed out while waiting for remote_consistent_lsn to reach 0/411A5D8, was 0/411A5A0
```
[Example Allure
Report](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6789/7932503173/index.html#/testresult/ddb877cfa4062f7d)
Analysis
--------
I suspect there was the following race condition:
- endpoints push out some tiny piece of WAL during their
endpoints.stop_all()
- that WAL reaches the SK (it's just one SK according to logs)
- the SKs send it into the walreceiver connection
- the SK gets shut down
- the checkpoint is taken, with last_record_lsn = 0/411A5A0
- the PS's walreceiver_connection_handler processes the WAL that was
sent into the connection by the SKs; this advances
last_record_lsn to 0/411A5D8
- we get current_lsn = 0/411A5D8
- nothing flushes a layer
Changes
-------
There's no testing / debug interface to shut down / server all
walreceiver connections.
So, this PR restarts pageserver to achieve it.
Also, it lifts the "wait for image layer uploads" further up, so that
after this first
restart, the pageserver really does _nothing_ by itself, and so, the
origianl physical size mismatch issue quoted in #6687 should be fixed.
(My initial suspicion hasn't changed that it was due to the tiny chunk
of endpoint.stop_all() WAL being ingested after the second PS restart.)