Saw a failure like this, from 'test_tenants_attached_after_download' and
'test_tenant_redownloads_truncated_file_on_startup':
> test_runner/fixtures/neon_fixtures.py:1064: in verbose_error
> res.raise_for_status()
> /github/home/.cache/pypoetry/virtualenvs/neon-_pxWMzVK-py3.9/lib/python3.9/site-packages/requests/models.py:1021: in raise_for_status
> raise HTTPError(http_error_msg, response=self)
> E requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://localhost:18150/v1/tenant/2334c9c113a82b5dd1651a0a23c53448/timeline
>
> The above exception was the direct cause of the following exception:
> test_runner/regress/test_tenants_with_remote_storage.py:185: in test_tenants_attached_after_download
> restored_timelines = client.timeline_list(tenant_id)
> test_runner/fixtures/neon_fixtures.py:1148: in timeline_list
> self.verbose_error(res)
> test_runner/fixtures/neon_fixtures.py:1070: in verbose_error
> raise PageserverApiException(msg) from e
> E fixtures.neon_fixtures.PageserverApiException: NotFound: Tenant 2334c9c113a82b5dd1651a0a23c53448 is not active. Current state: Loading
These tests starts the pageserver, wait until
assert_no_in_progress_downloads_for_tenant says that
has_downloads_in_progress is false, and then call timeline_list on the
tenant. But has_downloads_in_progress was only returned as true when
the tenant was being attached, not when it was being loaded at
pageserver startup. Change tenant_status API endpoint
(/v1/tenant/:tenant_id) so that it returns
has_downloads_in_progress=true also for tenants that are still in
Loading state.
If a connection from compute arrives while a tenant is still in
Loading state, wait for it to become Active instead of throwing an
error to the client. This should fix the errors from test_gc_cutoff
test that repeatedly restarts the pageserver and immediately tries to
connect to it.
- Pass through FAILPOINTS environment variable to the pageserver in
"neon_local pageserver start" command
- On startup, list any failpoints that were set with FAILPOINTS to the log
- Add optional "extra_env_vars" argument to the NeonPageserver.start()
function in the python fixture, so that you can pass FAILPOINTS
None of the tests use this functionality yet; that comes in a separate
commit.
closes https://github.com/neondatabase/neon/pull/2865
The test removes all the timelines from local disk, to test how
pageserver startup works when it's missing. But if the pageserver
hasn't finished loading the timeline yet, you can get an error:
> 2022-11-20T01:30:41.053207Z INFO load{tenant_id=0f6ba053925a997b99b5eb45f9c548ac}:load_local_timeline{timeline_id=308ada17f4c3d790b631805d2dd51807}: no index file was found on the remote
> 2022-11-20T01:30:41.054045Z ERROR load{tenant_id=0f6ba053925a997b99b5eb45f9c548ac}:load_local_timeline{timeline_id=308ada17f4c3d790b631805d2dd51807}: Failed to initialize timeline 0f6ba053925a997b99b5eb45f9c548ac/308ada17f4c3d790b631805d2dd51807: Failed to load layermap for timeline 0f6ba053925a997b99b5eb45f9c548ac/308ada17f4c3d790b631805d2dd51807
>
> Caused by:
> No such file or directory (os error 2)
I saw this in CI, here:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-2785/debug/3505805425/index.html#suites/ec4311502db344eee91f1354e9dc839b/725c7d0ecec1ec4d/
And was able to reproduce it with this:
> --- a/pageserver/src/tenant.rs
> +++ b/pageserver/src/tenant.rs
> @@ -946,6 +946,8 @@ impl Tenant {
> None => None,
> };
>
> + tokio::time::sleep(std::time::Duration::from_secs(2)).await;
> +
> self.setup_timeline(
> timeline_id,
> remote_client,
Even on 'main', it's pretty sketchy to remote the directory while the
pageserver is still running, but it didn't lead to an error because
the pagesever finished loading the local layer maps before starting
up. Now that that's spawned into background, the directory might get
removed before the loading finishes.
Increse the pgbench runtimes even further. The theory is that when
there are many other tests running at the same time, one pgbench run
could take a long time until it generates enough layers for GC to kick
in.
I saw these from the build of the compute docker image in the CI
(compute-node-image-v15):
pagestore_smgr.c: In function 'neon_prefetch':
pagestore_smgr.c:1654:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
1654 | BufferTag tag = (BufferTag) {
| ^~~~~~~~~
walproposer.c:197:1: warning: no previous prototype for 'WalProposerSync' [-Wmissing-prototypes]
197 | WalProposerSync(int argc, char *argv[])
| ^~~~~~~~~~~~~~~
libpagestore.c: In function 'pageserver_connect':
libpagestore.c💯9: warning: variable 'wc' set but not used [-Wunused-but-set-variable]
100 | int wc;
| ^~
libpagestore.c: In function 'call_PQgetCopyData':
libpagestore.c:144:9: warning: variable 'wc' set but not used [-Wunused-but-set-variable]
144 | int wc;
| ^~
Harmless warnings, but let's be tidy.
In the passing, I added some "extern" to a few function declarations
that were missing them, and marked WalProposerSync as "static". Those
changes are also purely cosmetic.
Commit d013a2b227 changed the test, so that it fails if pgbench runs
to completion without triggering the failpoint. That has now happened
several times in the CI. That's not expected, so this needs some
investigation, but as a quick fix just make the pgbench runs longer so
that we're closer to the situation before commit d013a2b227.
See https://github.com/neondatabase/neon/issues/2856
This allows us to error out in the case where we request flush but the
flush loop is not running.
Before, we would only track whether it was started, but not when it
exited.
Better to use an enum with 3 states than a 2-state bool because then
the error message can answer the question whether we ever started
the flush loop or not.
This allows us to error out in the case where we request flush but the
flush loop is not running.
Before, we would only track whether it was started, but not when it
exited.
Better to use an enum with 3 states than a 2-state bool because then
the error message can answer the question whether we ever started
the flush loop or not.
In a CI run, I got a test failure because of this error in the log,
from the test_get_tenant_size_with_multiple_branches test:
ERROR gc_loop{tenant_id=f1630516d4b526139836ced93be0c878}: Gc failed, retrying in 2s: No such file or directory (os error 2)
There are known race conditions between GC and timeline deletion,
which surely caused that error. But if we didn't know the cause, it
would be pretty hard to debug without a stack trace.
Previously in some cases local metadata was confused with remote one and
there was a check, that we write locally only if remote metadata has
greater disk_consistent_lsn. So because they were equal we didnt write
anything. For attach scenario this ended up in not writing metadata at
all.
Rearrange code so we decide on proper metadata value earlier on and
initialize timeline with correct one without need to update it late in
the initialization process in .reconsile_with_remote
As pointed out in b8488e70a9 (r1024319620)
the following is wrong for the case where the remote storage is empty:
metadata = whatever the local-ONLY metadata is
...
upload_queue.latest_metadata = Some(metadata.clone());
upload_queue.last_uploaded_consistent_lsn = Some(metadata.disk_consistent_lsn());
The reason why it's wrong is that we return last_uploaded_consistent_lsn
to safekeepers. So, we'd be returning an Lsn that is not yet uploaded to
S3.
Before this change, test_pageserver_with_empty_tenants was failing at:
assert loaded_tenant["state"] == {
"Active": {"background_jobs_running": False}
}, "Tenant {tenant_with_empty_timelines_dir} with empty timelines dir should be active and ready for timeline creation"
because background_jobs_running was True instead of False.
Personally I think we should simply always start the background loops
and not bother, but let's punt this until after we've merged this PR.
* Poll more frequently when waiting for process start/stop. This
speeds up startup and shutdown in tests. We did this already in
commit 52ce1c9d53, which reduced the interval to 100 ms, but it was
inadvertently increased back to 500 ms in commit d42700280f. Reduce
it to 100 ms again, for both start and stop operations.
* Harmonize the start and stop loops, printing the dots and notices
the same way in both. I considered extracting the logic to a
separate retry-function that takes a closure as argument that does
the polling, but as long as we only have two copies, the code
duplication isn't that bad.
* Remove newline after "Starting pageserver" and "Starting etcd"
messages, so that the progress-indicator dots that are printed once
a second are printed on the same line. Before:
Starting pageserver at '127.0.0.1:64000' in '.neon'
...
pageserver started, pid: 2538937
After:
Starting pageserver at '127.0.0.1:64000' in '.neon'...
pageserver started, pid: 2538937
The "Starting safekeeper" message already got this right.
* Update example output in README.md to match