rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-14 08:52:56 +00:00

Author	SHA1	Message	Date
Christian Schwarz	5a55dce282	ensure layers_removed > 0	2022-11-22 12:13:36 -05:00
Heikki Linnakangas	e6de1b0e8c	Increase the timeout in test. Downloading all files can take more than 5 seconds, if there are even small network glitches or similar.	2022-11-22 17:22:33 +02:00
Heikki Linnakangas	58be279be1	Fix python flake8 errors	2022-11-22 16:40:51 +02:00
Heikki Linnakangas	7d46c7c118	Fix python formatting	2022-11-22 15:16:31 +02:00
Heikki Linnakangas	49f2eac934	Add missing import Was causing the test to fail. It's annoying that you don't get an error from that earlier..	2022-11-22 15:01:03 +02:00
Heikki Linnakangas	5c8387aff1	Fix setting failpoints in test_remote_storage_backup_and_restore The checkpoints in the test were numbered 1 and 2, but the code only tried to set the failpoints for checkpoint 0. So they were never set.	2022-11-22 14:52:11 +02:00
Christian Schwarz	6b7cbec9b3	WIP: add test for storage_sync upload retries	2022-11-22 05:49:46 -05:00
Heikki Linnakangas	e6557f4f91	Silence test failures, where we operate on a tenant before it's loaded Saw a failure like this, from 'test_tenants_attached_after_download' and 'test_tenant_redownloads_truncated_file_on_startup': > test_runner/fixtures/neon_fixtures.py:1064: in verbose_error > res.raise_for_status() > /github/home/.cache/pypoetry/virtualenvs/neon-_pxWMzVK-py3.9/lib/python3.9/site-packages/requests/models.py:1021: in raise_for_status > raise HTTPError(http_error_msg, response=self) > E requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://localhost:18150/v1/tenant/2334c9c113a82b5dd1651a0a23c53448/timeline > > The above exception was the direct cause of the following exception: > test_runner/regress/test_tenants_with_remote_storage.py:185: in test_tenants_attached_after_download > restored_timelines = client.timeline_list(tenant_id) > test_runner/fixtures/neon_fixtures.py:1148: in timeline_list > self.verbose_error(res) > test_runner/fixtures/neon_fixtures.py:1070: in verbose_error > raise PageserverApiException(msg) from e > E fixtures.neon_fixtures.PageserverApiException: NotFound: Tenant 2334c9c113a82b5dd1651a0a23c53448 is not active. Current state: Loading These tests starts the pageserver, wait until assert_no_in_progress_downloads_for_tenant says that has_downloads_in_progress is false, and then call timeline_list on the tenant. But has_downloads_in_progress was only returned as true when the tenant was being attached, not when it was being loaded at pageserver startup. Change tenant_status API endpoint (/v1/tenant/:tenant_id) so that it returns has_downloads_in_progress=true also for tenants that are still in Loading state.	2022-11-21 20:50:42 +01:00
Heikki Linnakangas	e8db20eb26	On connection from compute, wait for tenant to become Active. If a connection from compute arrives while a tenant is still in Loading state, wait for it to become Active instead of throwing an error to the client. This should fix the errors from test_gc_cutoff test that repeatedly restarts the pageserver and immediately tries to connect to it.	2022-11-21 20:50:42 +01:00
Heikki Linnakangas	7552e2d25f	Enable passing FAILPOINTS at startup. - Pass through FAILPOINTS environment variable to the pageserver in "neon_local pageserver start" command - On startup, list any failpoints that were set with FAILPOINTS to the log - Add optional "extra_env_vars" argument to the NeonPageserver.start() function in the python fixture, so that you can pass FAILPOINTS None of the tests use this functionality yet; that comes in a separate commit. closes https://github.com/neondatabase/neon/pull/2865	2022-11-21 20:50:42 +01:00
Heikki Linnakangas	dfb4160403	Don't remove timelines directory while pageserver is running. The test removes all the timelines from local disk, to test how pageserver startup works when it's missing. But if the pageserver hasn't finished loading the timeline yet, you can get an error: > 2022-11-20T01:30:41.053207Z INFO load{tenant_id=0f6ba053925a997b99b5eb45f9c548ac}:load_local_timeline{timeline_id=308ada17f4c3d790b631805d2dd51807}: no index file was found on the remote > 2022-11-20T01:30:41.054045Z ERROR load{tenant_id=0f6ba053925a997b99b5eb45f9c548ac}:load_local_timeline{timeline_id=308ada17f4c3d790b631805d2dd51807}: Failed to initialize timeline 0f6ba053925a997b99b5eb45f9c548ac/308ada17f4c3d790b631805d2dd51807: Failed to load layermap for timeline 0f6ba053925a997b99b5eb45f9c548ac/308ada17f4c3d790b631805d2dd51807 > > Caused by: > No such file or directory (os error 2) I saw this in CI, here: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-2785/debug/3505805425/index.html#suites/ec4311502db344eee91f1354e9dc839b/725c7d0ecec1ec4d/ And was able to reproduce it with this: > --- a/pageserver/src/tenant.rs > +++ b/pageserver/src/tenant.rs > @@ -946,6 +946,8 @@ impl Tenant { > None => None, > }; > > + tokio::time::sleep(std::time::Duration::from_secs(2)).await; > + > self.setup_timeline( > timeline_id, > remote_client, Even on 'main', it's pretty sketchy to remote the directory while the pageserver is still running, but it didn't lead to an error because the pagesever finished loading the local layer maps before starting up. Now that that's spawned into background, the directory might get removed before the loading finishes.	2022-11-20 21:52:22 +02:00
Heikki Linnakangas	f9bdc030e8	Fix python formatting failure	2022-11-20 02:37:15 +02:00
Heikki Linnakangas	e4c9b83a39	Merge remote-tracking branch 'origin/main' into HEAD	2022-11-20 02:31:21 +02:00
Alexander Bayandin	cb9b26776e	Fix test_seqscans on remote cluster (#2869 ) A remote project is reused between tests, so we need to ensure that we don't have a table with the same name already created.	2022-11-19 23:39:42 +00:00
Heikki Linnakangas	a78c16328e	Silence test_remote_storage failure, caused by error message change	2022-11-20 00:13:43 +02:00
Heikki Linnakangas	684329d4d2	Another attempt at silencing test_gc_cutoff failures. Increse the pgbench runtimes even further. The theory is that when there are many other tests running at the same time, one pgbench run could take a long time until it generates enough layers for GC to kick in.	2022-11-19 19:28:56 +02:00
Heikki Linnakangas	ed40a045c0	Add more logging to track down test_gc_cutoff failure. see https://github.com/neondatabase/neon/issues/2856	2022-11-19 14:12:21 +02:00
Heikki Linnakangas	a50a7e8ac0	Try to silence test_gc_cutoff flakiness. Commit `d013a2b227` changed the test, so that it fails if pgbench runs to completion without triggering the failpoint. That has now happened several times in the CI. That's not expected, so this needs some investigation, but as a quick fix just make the pgbench runs longer so that we're closer to the situation before commit `d013a2b227`. See https://github.com/neondatabase/neon/issues/2856	2022-11-19 01:19:09 +02:00
Heikki Linnakangas	a2fbd93e91	Fix python isort codestyle check	2022-11-18 18:29:27 +02:00
Christian Schwarz	f564dff0e3	make test_tenant_detach_smoke fail reproducibly Add failpoint that triggers the race condition. Skip test until we'll land the fix from https://github.com/neondatabase/neon/pull/2851 with https://github.com/neondatabase/neon/pull/2785	2022-11-18 17:15:34 +01:00
bojanserafimov	2655bdbb2e	Add remote seqscans test (#2840 )	2022-11-18 09:05:13 -05:00
Christian Schwarz	66f8f686a0	run manual gc in a task_mgr task to prevent race with detach This fixes flaky test_tenant_detach_smoke.	2022-11-18 12:15:14 +02:00
Christian Schwarz	919f2b261a	make test_tenant_detach_smoke fail reproducibly Add failpoint that triggers the race condition.	2022-11-18 12:15:14 +02:00
Heikki Linnakangas	dcb79ef08f	Silence yet another test failure from race condition between GC and delete. Another similar case to commit `9ae4da4f31`.	2022-11-18 10:18:15 +02:00
Heikki Linnakangas	24d3ed0952	Ignore another ERROR that's expected in test. Got a test failure in CI because of this.	2022-11-17 12:42:56 +02:00
Christian Schwarz	decef74503	don't start background jobs if tenant has not timelines Before this change, test_pageserver_with_empty_tenants was failing at: assert loaded_tenant["state"] == { "Active": {"background_jobs_running": False} }, "Tenant {tenant_with_empty_timelines_dir} with empty timelines dir should be active and ready for timeline creation" because background_jobs_running was True instead of False. Personally I think we should simply always start the background loops and not bother, but let's punt this until after we've merged this PR.	2022-11-17 11:22:02 +02:00
Christian Schwarz	8aed805933	refactor: remove misleading log messages and redundant asserts from test_pageserver_with_empty_tenants At least since commit 'Return broken tenants due to non existing timelines dir (#2552) (#2575)' some of these messages are wrong.	2022-11-17 11:21:49 +02:00
Alexander Bayandin	0a87d71294	test_runner: make proxy mgmt port mandatory (#2839 ) Make `mgmt` port mandatory argument for `NeonProxy` (and set it for `static_proxy`) to avoid port collision when tests run in parallel.	2022-11-16 17:57:48 +00:00
Alexander Bayandin	2b728bc69e	test_forward_compatibility: fix path to pg_distrib_dir (#2826 ) Set correct `pg_distrib_dir` in `pageserver.toml` and in neon_local `config`. `test_forward_compatibility` shows flakiness during `neon_local pg start`, so hopefully, the patch will help. ``` 2022-11-15 16:07:34.091 GMT [13338] LOG: starting with zenith basebackup at LSN 0/A6A9310, prev 0/0 2022-11-15 16:07:34.091 GMT [13338] FATAL: cannot start in read-write mode from this base backup 2022-11-15 16:07:34.091 GMT [13337] LOG: startup process (PID 13338) exited with exit code 1 ```	2022-11-16 15:14:36 +00:00
Heikki Linnakangas	9ae4da4f31	Silence test failure caused by race condition between GC and detach. Thanks to the race condition, GC sometimes fails with "no such file or directory" error, if the tenant is detached concurrently. That's a known issue, but it didn't cause test failures until we started to check for unexpected ERRORs in the log in commit `46d30bf054`. We should fix the race condition, of course, but until we do, let's silence the failures.	2022-11-16 15:50:49 +02:00
Christian Schwarz	bb6dbd2f43	crash-safe and resumable tenant attach This change introduces a marker file $repo/tenants/$tenant_id/attaching that is present while a tenant is in Attaching state. When pageserver restarts, we use it to resume the tenant attach operation. Before this change, a crash during tenant attach would result in one of the following: 1. crash upon restart due to missing metadata file (IIRC) 2. "successful" loading of the tenant with a subset of timelines	2022-11-16 14:57:26 +02:00
Dmitry Rodionov	c4c4558736	adjust allowed errors for test_broken_timeline	2022-11-16 14:57:26 +02:00
Heikki Linnakangas	bfdc09cf4a	Improve comments and checks in test_broken_timeline.py	2022-11-16 14:57:21 +02:00
Dmitry Rodionov	795c3ca131	Port per-tenant upload queue and startup changes from #2595 This is a part of https://github.com/neondatabase/neon/pull/2595. It takes out switch to per tenant upload queue and changes to pageserver startup sequence because these two are highly interleaved with each other. I'm still not happy with the size of the diff, but splitting it even more will probably consume even more time. Ideally we should do it, but this patch isis already a step forward and should be easier to get this patch in yet still quite difficult. Mainly because of the size and fixes for existing concerns which will extend the diff even further Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2022-11-16 14:42:15 +02:00
Heikki Linnakangas	d013a2b227	Make test_gc_cutoff test more robust. Previously, if the failpoint was not reached for some reason, the test would only fail because it would reach the 5 minute timeout we have on all python tests. That's very subtle. Make it fail explicitly, if the failpoint is not hit on each iteration of the loop. Extracted from a larger PR, see https://github.com/neondatabase/neon/pull/2785/files#r1022765794	2022-11-16 13:24:02 +02:00
Heikki Linnakangas	3f93c6c6f0	Improve checks for broken tenants in test_broken_timeline.py - Refactor the code a little bit, removing the silly for-loop over a single element. - Make it more clear in log messages that the errors are expectd - Check for a more precise error message "Failed to load delta layer" instead of just "extracting base backup failed".	2022-11-16 13:16:00 +02:00
Heikki Linnakangas	46d30bf054	Check for errors in pageserver log after each test. If there are any unexpected ERRORs or WARNs in pageserver.log after test finishes, fail the test. This requires whitelisting the errors that are expected in each test, and there's also a few common errors that are printed by most tests, which are whitelisted in the fixture itself. With this, we don't need the special abort() call in testing mode, when compaction or GC fails. Those failures will print ERRORs to the logs, which will be picked up by this new mechanisms. A bunch of errors are currently whitelisted that we probably shouldn't be emitting in the first place, but fixing those is out of scope for this commit, so I just left FIXME comments on them.	2022-11-15 18:47:28 +02:00
Heikki Linnakangas	d0105cea1f	Avoid errors when removing a timeline that's still active	2022-11-15 18:47:28 +02:00
Heikki Linnakangas	223834a420	Fix confusion between Postgres and pageserver connection string in test. We passed the pageserver's libpq endpoint URL as the 'compute_ctl --connstr' argument, but that was bogus: the --connstr URL is supposed to be the URL to the Postgres instance that compute_ctl launches and monitors, not to the pageserver. compute_ctl does need the pageserver URL too, but it is read from the cluster spec JSON, not --connstr. That was pretty confusing, as you got a lot of "unknown command" errors in the pageserver log, when compute_tools tries to run regular SQL commands on the pageserver. The test still passed, however, as it doesn't require the SQL commands to succeed. But to make this less confusing, use an invalid hostname instead, so that the queries will fail to even connect.	2022-11-15 18:47:28 +02:00
andres	c11cbf0f5c	fix test_compare_child_and_root_pgbench_perf to do a fair comparison	2022-11-13 21:03:54 +02:00
bojanserafimov	7fd88fab59	Trace read requests (#2762 )	2022-11-10 16:43:04 -05:00
bojanserafimov	7edc098c40	Add perf test instructions (#2777 )	2022-11-10 16:05:57 -05:00
Alexander Bayandin	d5b7832c21	Fix test_wal_backpressure tests (#2792 ) Fix expected return type for `fetchone `: ``` AssertionError: assert False + where False = isinstance((Decimal('56048'), '55 kB', '0/1CF52D8', '0/1CE77E8'), list) ```	2022-11-10 16:15:04 +00:00
Christian Schwarz	8654e95fae	walredo: fix zombie processes ([postgres] <defunct>) This change wraps the std::process:Child that we spawn for WAL redo into a type that ensures that we try to SIGKILL + waitpid() on it. If there is no explicit call to kill_and_wait(), the Drop implementation will spawns a task that does it in the BACKGROUND_RUNTIME. That's an ugly hack but I think it's better than doing kill+wait synchronously from Drop, since I think the general assumption in the Rust ecosystem is that Drop doesn't block. Especially since the drop sites can be _any_ place that drops the last Arc<PostgresRedoManager>, e.g., compaction or GC. The benefit of having the new type over just adding a Drop impl to PostgresRedoProcess is that we can construct it earlier than the full PostgresRedoProcess in PostgresRedoProcess::launch(). That allows us to correctly kill+wait the child if there is an error in PostgresRedoProcess::launch() after spawning it. I also took a stab at a regression test. I manually verified that it fails before the fix to walredo.rs. fixes https://github.com/neondatabase/neon/issues/2761 closes https://github.com/neondatabase/neon/pull/2776	2022-11-10 12:50:50 +01:00
Vadim Kharitonov	f720dd735e	Stricter mypy linters for `test_runner/fixtures/*`	2022-11-10 12:47:27 +01:00
Alexander Bayandin	c4f9f1dc6d	Add data format forward compatibility tests (#2766 ) Add `test_forward_compatibility`, which checks if it's going to be possible to roll back a release to the previous version. The test uses artifacts (Neon & Postgres binaries) from the previous release to start Neon on the repo created by the current version. It performs exactly the same checks as `test_backward_compatibility` does. Single `ALLOW_BREAKING_CHANGES` env var got replaced by `ALLOW_BACKWARD_COMPATIBILITY_BREAKAGE` & `ALLOW_FORWARD_COMPATIBILITY_BREAKAGE` and can be set by `backward compatibility breakage` and `forward compatibility breakage` labels respectively.	2022-11-10 09:06:34 +00:00
Alexander Bayandin	c1a76eb0e5	test_runner: replace global variables with fixtures (#2754 ) This PR replaces the following global variables in the test framework with fixtures to make tests more configurable. I mainly need this for the forward compatibility tests (draft in https://github.com/neondatabase/neon/pull/2766). ``` base_dir neon_binpath pg_distrib_dir top_output_dir default_pg_version (this one got replaced with a fixture named pg_version) ``` Also, this PR adds more `Path` type where the code implies it.	2022-11-07 18:39:51 +00:00
Joonas Koivunen	548d472b12	fix: logical size query at before initdb_lsn (#2755 ) With more realistic selection of gc_horizon in tests there is an immediate failure with trying to query logical size with lsn < initdb_lsn. Fixes that, adds illustration gathered from clarity of explaining this tenant size calculation to more people. Cc: #2748, #2599.	2022-11-07 12:03:57 +02:00
Joonas Koivunen	cf68963b18	Add initial tenant sizing model and a http route to query it (#2714 ) Tenant size information is gathered by using existing parts of `Tenant::gc_iteration` which are now separated as `Tenant::refresh_gc_info`. `Tenant::refresh_gc_info` collects branch points, and invokes `Timeline::update_gc_info`; nothing was supposed to be changed there. The gathered branch points (through Timeline's `GcInfo::retain_lsns`), `GcInfo::horizon_cutoff`, and `GcInfo::pitr_cutoff` are used to build up a Vec of updates fed into the `libs/tenant_size_model` to calculate the history size. The gathered information is now exposed using `GET /v1/tenant/{tenant_id}/size`, which which will respond with the actual calculated size. Initially the idea was to have this delivered as tenant background task and exported via metric, but it might be too computationally expensive to run it periodically as we don't yet know if the returned values are any good. Adds one new metric: - pageserver_storage_operations_seconds with label `logical_size` - separating from original `init_logical_size` Adds a pageserver wide configuration variable: - `concurrent_tenant_size_logical_size_queries` with default 1 This leaves a lot of TODO's, tracked on issue #2748.	2022-11-03 12:39:19 +00:00
Joonas Koivunen	5112142997	fix: use different port for temporary postgres (#2743 ) `test_tenant_relocation` ends up starting a temporary postgres instance with a fixed port. the change makes the port configurable at scripts/export_import_between_pageservers.py and uses that in test_tenant_relocation.	2022-11-02 18:37:48 +00:00

1 2 3 4 5 ...

512 Commits