Commit Graph

2395 Commits

Author SHA1 Message Date
Christian Schwarz
5a55dce282 ensure layers_removed > 0 2022-11-22 12:13:36 -05:00
Heikki Linnakangas
e6de1b0e8c Increase the timeout in test.
Downloading all files can take more than 5 seconds, if there are even
small network glitches or similar.
2022-11-22 17:22:33 +02:00
Heikki Linnakangas
58be279be1 Fix python flake8 errors 2022-11-22 16:40:51 +02:00
Heikki Linnakangas
d1b92e976a When a new layer file is created in compaction, also upload it.
I can't believe this was missing..
2022-11-22 16:35:31 +02:00
Heikki Linnakangas
7d46c7c118 Fix python formatting 2022-11-22 15:16:31 +02:00
Heikki Linnakangas
d6c1a9aa18 Fix formatting 2022-11-22 15:02:09 +02:00
Heikki Linnakangas
49f2eac934 Add missing import
Was causing the test to fail. It's annoying that you don't get an error
from that earlier..
2022-11-22 15:01:03 +02:00
Heikki Linnakangas
5c8387aff1 Fix setting failpoints in test_remote_storage_backup_and_restore
The checkpoints in the test were numbered 1 and 2, but the code only
tried to set the failpoints for checkpoint 0. So they were never set.
2022-11-22 14:52:11 +02:00
Heikki Linnakangas
3c4680b718 Turn upload_queue_items_metric into a regular function 2022-11-22 14:45:22 +02:00
Christian Schwarz
6b7cbec9b3 WIP: add test for storage_sync upload retries 2022-11-22 05:49:46 -05:00
Christian Schwarz
6de9a33c05 metric REMOTE_UPLOAD_QUEUE_UNFINISHED_TASKS 2022-11-22 05:49:46 -05:00
Christian Schwarz
77f883c95c prometheus metrics for storage_sync2
Extracted from https://github.com/neondatabase/neon/pull/2595

Use pub(crate) to highlight unused metrics.
2022-11-22 05:49:25 -05:00
Heikki Linnakangas
e6557f4f91 Silence test failures, where we operate on a tenant before it's loaded
Saw a failure like this, from 'test_tenants_attached_after_download' and
'test_tenant_redownloads_truncated_file_on_startup':

> test_runner/fixtures/neon_fixtures.py:1064: in verbose_error
>     res.raise_for_status()
> /github/home/.cache/pypoetry/virtualenvs/neon-_pxWMzVK-py3.9/lib/python3.9/site-packages/requests/models.py:1021: in raise_for_status
>     raise HTTPError(http_error_msg, response=self)
> E   requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://localhost:18150/v1/tenant/2334c9c113a82b5dd1651a0a23c53448/timeline
>
> The above exception was the direct cause of the following exception:
> test_runner/regress/test_tenants_with_remote_storage.py:185: in test_tenants_attached_after_download
>     restored_timelines = client.timeline_list(tenant_id)
> test_runner/fixtures/neon_fixtures.py:1148: in timeline_list
>     self.verbose_error(res)
> test_runner/fixtures/neon_fixtures.py:1070: in verbose_error
>     raise PageserverApiException(msg) from e
> E   fixtures.neon_fixtures.PageserverApiException: NotFound: Tenant 2334c9c113a82b5dd1651a0a23c53448 is not active. Current state: Loading

These tests starts the pageserver, wait until
assert_no_in_progress_downloads_for_tenant says that
has_downloads_in_progress is false, and then call timeline_list on the
tenant. But has_downloads_in_progress was only returned as true when
the tenant was being attached, not when it was being loaded at
pageserver startup. Change tenant_status API endpoint
(/v1/tenant/:tenant_id) so that it returns
has_downloads_in_progress=true also for tenants that are still in
Loading state.
2022-11-21 20:50:42 +01:00
Heikki Linnakangas
e8db20eb26 On connection from compute, wait for tenant to become Active.
If a connection from compute arrives while a tenant is still in
Loading state, wait for it to become Active instead of throwing an
error to the client. This should fix the errors from test_gc_cutoff
test that repeatedly restarts the pageserver and immediately tries to
connect to it.
2022-11-21 20:50:42 +01:00
Heikki Linnakangas
7552e2d25f Enable passing FAILPOINTS at startup.
- Pass through FAILPOINTS environment variable to the pageserver in
  "neon_local pageserver start" command

- On startup, list any failpoints that were set with FAILPOINTS to the log

- Add optional "extra_env_vars" argument to the NeonPageserver.start()
  function in the python fixture, so that you can pass FAILPOINTS

None of the tests use this functionality yet; that comes in a separate
commit.

closes https://github.com/neondatabase/neon/pull/2865
2022-11-21 20:50:42 +01:00
Heikki Linnakangas
dfb4160403 Don't remove timelines directory while pageserver is running.
The test removes all the timelines from local disk, to test how
pageserver startup works when it's missing. But if the pageserver
hasn't finished loading the timeline yet, you can get an error:

> 2022-11-20T01:30:41.053207Z  INFO load{tenant_id=0f6ba053925a997b99b5eb45f9c548ac}:load_local_timeline{timeline_id=308ada17f4c3d790b631805d2dd51807}: no index file was found on the remote
> 2022-11-20T01:30:41.054045Z ERROR load{tenant_id=0f6ba053925a997b99b5eb45f9c548ac}:load_local_timeline{timeline_id=308ada17f4c3d790b631805d2dd51807}: Failed to initialize timeline 0f6ba053925a997b99b5eb45f9c548ac/308ada17f4c3d790b631805d2dd51807: Failed to load layermap for timeline 0f6ba053925a997b99b5eb45f9c548ac/308ada17f4c3d790b631805d2dd51807
>
> Caused by:
>     No such file or directory (os error 2)

I saw this in CI, here:
  https://neon-github-public-dev.s3.amazonaws.com/reports/pr-2785/debug/3505805425/index.html#suites/ec4311502db344eee91f1354e9dc839b/725c7d0ecec1ec4d/

And was able to reproduce it with this:

> --- a/pageserver/src/tenant.rs
> +++ b/pageserver/src/tenant.rs
> @@ -946,6 +946,8 @@ impl Tenant {
>              None => None,
>          };
>
> +        tokio::time::sleep(std::time::Duration::from_secs(2)).await;
> +
>          self.setup_timeline(
>              timeline_id,
>              remote_client,

Even on 'main', it's pretty sketchy to remote the directory while the
pageserver is still running, but it didn't lead to an error because
the pagesever finished loading the local layer maps before starting
up. Now that that's spawned into background, the directory might get
removed before the loading finishes.
2022-11-20 21:52:22 +02:00
Heikki Linnakangas
f9bdc030e8 Fix python formatting failure 2022-11-20 02:37:15 +02:00
Heikki Linnakangas
e4c9b83a39 Merge remote-tracking branch 'origin/main' into HEAD 2022-11-20 02:31:21 +02:00
Alexander Bayandin
cb9b26776e Fix test_seqscans on remote cluster (#2869)
A remote project is reused between tests, so we need to ensure that we
don't have a table with the same name already created.
2022-11-19 23:39:42 +00:00
Heikki Linnakangas
a78c16328e Silence test_remote_storage failure, caused by error message change 2022-11-20 00:13:43 +02:00
Heikki Linnakangas
684329d4d2 Another attempt at silencing test_gc_cutoff failures.
Increse the pgbench runtimes even further. The theory is that when
there are many other tests running at the same time, one pgbench run
could take a long time until it generates enough layers for GC to kick
in.
2022-11-19 19:28:56 +02:00
Heikki Linnakangas
eed99b7251 Silence clippy warning 2022-11-19 18:26:18 +02:00
Heikki Linnakangas
ed40a045c0 Add more logging to track down test_gc_cutoff failure.
see https://github.com/neondatabase/neon/issues/2856
2022-11-19 14:12:21 +02:00
Heikki Linnakangas
3f39327622 Silence a few compiler warnings
I saw these from the build of the compute docker image in the CI
(compute-node-image-v15):

    pagestore_smgr.c: In function 'neon_prefetch':
    pagestore_smgr.c:1654:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
     1654 |  BufferTag tag = (BufferTag) {
          |  ^~~~~~~~~
    walproposer.c:197:1: warning: no previous prototype for 'WalProposerSync' [-Wmissing-prototypes]
      197 | WalProposerSync(int argc, char *argv[])
          | ^~~~~~~~~~~~~~~
    libpagestore.c: In function 'pageserver_connect':
    libpagestore.c💯9: warning: variable 'wc' set but not used [-Wunused-but-set-variable]
      100 |   int   wc;
          |         ^~
    libpagestore.c: In function 'call_PQgetCopyData':
    libpagestore.c:144:9: warning: variable 'wc' set but not used [-Wunused-but-set-variable]
      144 |   int   wc;
          |         ^~

Harmless warnings, but let's be tidy.

In the passing, I added some "extern" to a few function declarations
that were missing them, and marked WalProposerSync as "static". Those
changes are also purely cosmetic.
2022-11-19 14:11:04 +02:00
Heikki Linnakangas
a50a7e8ac0 Try to silence test_gc_cutoff flakiness.
Commit d013a2b227 changed the test, so that it fails if pgbench runs
to completion without triggering the failpoint. That has now happened
several times in the CI. That's not expected, so this needs some
investigation, but as a quick fix just make the pgbench runs longer so
that we're closer to the situation before commit d013a2b227.

See https://github.com/neondatabase/neon/issues/2856
2022-11-19 01:19:09 +02:00
Egor Suvorov
e28eda7939 sourcetree/docs: mention hakari generate (#2864) 2022-11-18 22:30:41 +00:00
Heikki Linnakangas
a2fbd93e91 Fix python isort codestyle check 2022-11-18 18:29:27 +02:00
Christian Schwarz
f564dff0e3 make test_tenant_detach_smoke fail reproducibly
Add failpoint that triggers the race condition.
Skip test until we'll land the fix from
https://github.com/neondatabase/neon/pull/2851
with
https://github.com/neondatabase/neon/pull/2785
2022-11-18 17:15:34 +01:00
Christian Schwarz
d783889a1f timeline: explicit tracking of flush loop state: NotStarted, Running, Exited
This allows us to error out in the case where we request flush but the
flush loop is not running.
Before, we would only track whether it was started, but not when it
exited.
Better to use an enum with 3 states than a 2-state bool because then
the error message can answer the question whether we ever started
the flush loop or not.
2022-11-18 17:15:34 +01:00
bojanserafimov
2655bdbb2e Add remote seqscans test (#2840) 2022-11-18 09:05:13 -05:00
Konstantin Knizhnik
b9152f1ef4 Correctly terminate prefetch in case of pageserver restart (#2850)
refer #2819

This patch requires deep knowledge of prefetch internals.
So @MMeent  please review it or suggest better solution.
2022-11-18 15:04:58 +02:00
Christian Schwarz
66f8f686a0 run manual gc in a task_mgr task to prevent race with detach
This fixes flaky test_tenant_detach_smoke.
2022-11-18 12:15:14 +02:00
Christian Schwarz
919f2b261a make test_tenant_detach_smoke fail reproducibly
Add failpoint that triggers the race condition.
2022-11-18 12:15:14 +02:00
Christian Schwarz
9d273c840a timeline: explicit tracking of flush loop state: NotStarted, Running, Exited
This allows us to error out in the case where we request flush but the
flush loop is not running.
Before, we would only track whether it was started, but not when it
exited.
Better to use an enum with 3 states than a 2-state bool because then
the error message can answer the question whether we ever started
the flush loop or not.
2022-11-18 12:15:14 +02:00
Heikki Linnakangas
328ec1ce24 Print a more full error message, with stack trace, on GC failure.
In a CI run, I got a test failure because of this error in the log,
from the test_get_tenant_size_with_multiple_branches test:

    ERROR gc_loop{tenant_id=f1630516d4b526139836ced93be0c878}: Gc failed, retrying in 2s: No such file or directory (os error 2)

There are known race conditions between GC and timeline deletion,
which surely caused that error. But if we didn't know the cause, it
would be pretty hard to debug without a stack trace.
2022-11-18 11:44:00 +02:00
Heikki Linnakangas
dcb79ef08f Silence yet another test failure from race condition between GC and delete.
Another similar case to commit 9ae4da4f31.
2022-11-18 10:18:15 +02:00
Konstantin Knizhnik
fd99e0fbc4 Build pg_prewrm extension (#2794) 2022-11-18 09:10:32 +02:00
Dmitry Rodionov
6600e1f896 initialize upload queue before starting download operations 2022-11-17 23:53:31 +02:00
Dmitry Rodionov
348369414b fix incorrect metadata update
Previously in some cases local metadata was confused with remote one and
there was a check, that we write locally only if remote metadata has
greater disk_consistent_lsn. So because they were equal we didnt write
anything. For attach scenario this ended up in not writing metadata at
all.

Rearrange code so we decide on proper metadata value earlier on and
initialize timeline with correct one without need to update it late in
the initialization process in .reconsile_with_remote
2022-11-17 23:53:31 +02:00
Christian Schwarz
3890acaf7f stop using Option for UploadQueueInitialized::{latest_metadata,last_uploaded_consistent_lsn} 2022-11-17 21:34:16 +02:00
Christian Schwarz
f537a7a873 add explainer comments regarding UploadQueueInitialized::{latest_files,latest_metadata,last_uploaded_consistent_lsn} 2022-11-17 21:34:16 +02:00
Christian Schwarz
71bc45a21b storage_sync: track upload queue initialization state using enum & fix last_uploaded_consistent_lsn initialization for empty remote storage
As pointed out in b8488e70a9 (r1024319620)
the following is wrong for the case where the remote storage is empty:

    metadata = whatever the local-ONLY metadata is
    ...
    upload_queue.latest_metadata = Some(metadata.clone());
    upload_queue.last_uploaded_consistent_lsn = Some(metadata.disk_consistent_lsn());

The reason why it's wrong is that we return last_uploaded_consistent_lsn
to safekeepers. So, we'd be returning an Lsn that is not yet uploaded to
S3.
2022-11-17 21:34:16 +02:00
Kirill Bulatov
60ac227196 Use modern flex and bison in macOS compilations (#2847) 2022-11-17 14:48:21 +00:00
MMeent
4a60051b0d Add codeowners section for /vendor/ (#2849)
After this, consent of @neondatabase/compute is required to update the
vendored PostgreSQL versions.
2022-11-17 14:31:34 +00:00
Heikki Linnakangas
24d3ed0952 Ignore another ERROR that's expected in test.
Got a test failure in CI because of this.
2022-11-17 12:42:56 +02:00
Christian Schwarz
decef74503 don't start background jobs if tenant has not timelines
Before this change, test_pageserver_with_empty_tenants was failing at:

  assert loaded_tenant["state"] == {
        "Active": {"background_jobs_running": False}
    }, "Tenant {tenant_with_empty_timelines_dir} with empty timelines dir should be active and ready for timeline creation"

because background_jobs_running was True instead of False.

Personally I think we should simply always start the background loops
and not bother, but let's punt this until after we've merged this PR.
2022-11-17 11:22:02 +02:00
Christian Schwarz
8aed805933 refactor: remove misleading log messages and redundant asserts from test_pageserver_with_empty_tenants
At least since commit 'Return broken tenants due to non existing timelines dir (#2552) (#2575)'
some of these messages are wrong.
2022-11-17 11:21:49 +02:00
Alexander Bayandin
0a87d71294 test_runner: make proxy mgmt port mandatory (#2839)
Make `mgmt` port mandatory argument for `NeonProxy` (and set it for
`static_proxy`) to avoid port collision when tests run in parallel.
2022-11-16 17:57:48 +00:00
Heikki Linnakangas
150bddb929 Clean up process start/stop handling
* Poll more frequently when waiting for process start/stop. This
  speeds up startup and shutdown in tests. We did this already in
  commit 52ce1c9d53, which reduced the interval to 100 ms, but it was
  inadvertently increased back to 500 ms in commit d42700280f. Reduce
  it to 100 ms again, for both start and stop operations.

* Harmonize the start and stop loops, printing the dots and notices
  the same way in both. I considered extracting the logic to a
  separate retry-function that takes a closure as argument that does
  the polling, but as long as we only have two copies, the code
  duplication isn't that bad.

* Remove newline after "Starting pageserver" and "Starting etcd"
  messages, so that the progress-indicator dots that are printed once
  a second are printed on the same line. Before:

    Starting pageserver at '127.0.0.1:64000' in '.neon'
    ...
    pageserver started, pid: 2538937

  After:

    Starting pageserver at '127.0.0.1:64000' in '.neon'...
    pageserver started, pid: 2538937

  The "Starting safekeeper" message already got this right.

* Update example output in README.md to match
2022-11-16 19:51:37 +02:00
Dmitry Rodionov
b8488e70a9 run clippy/fmt 2022-11-16 17:25:04 +02:00