rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-07 06:00:38 +00:00

Author	SHA1	Message	Date
Christian Schwarz	dd2a77c2ef	Abort uploads if the tenant/timeline is requested to shut down - Introduce another UploadQueue::Stopped enum variant to indicate the state where the UploadQueue is shut down. - In perform_upload_task, wait concurrently for tenant/timeline shutdown. If we are requested to shut down, the first in-progress tasks that notices the shutdown request transitions the queue from UploadQueue::Initialized to UploadQueue::Stopped state. This involves dropping all the queued ops that are not yet in progress, which conveniently unblocks wait_completion() calls that are waiting for their barrier to be executed. They will receive an Err(), and do something sensible. Right now, wait_completion() is only used by tests, but I suspect that we should be using it in wherever we delete layer files, e.g., GC and compaction, as explained in the storage_sync.rs block comment section "Consistency". This change also fixes test_timeline_deletion_with_files_stuck_in_upload_queue which I added in the previous commit. Before, timeline delete would wait until all in-progress tasks and queued tasks were done. If, like in the test, a task was stuck due to upload error, timeline deletion would wait forever. Now it gets an error. Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2022-11-25 16:32:31 +02:00
Christian Schwarz	ef95637c65	add test case to cover race between timeline delete and stuck upload	2022-11-25 16:32:31 +02:00
Christian Schwarz	206b5d2ada	remove obsolete TODO (see github PR discussion https://github.com/neondatabase/neon/pull/2785#discussion_r1030547179 )	2022-11-24 13:20:38 -05:00
Christian Schwarz	77fea61fcc	address FIXME in tenant_attach regarding tenant config	2022-11-24 13:20:38 -05:00
Christian Schwarz	e3f4c0e4ac	remove FIXME addressed by 'On tenant load, start WAL receivers only after all timelines have been loaded.'	2022-11-24 13:20:38 -05:00
Christian Schwarz	2302ecda04	don't run background loops in unit tests	2022-11-24 13:20:38 -05:00
Heikki Linnakangas	5257bbe2b9	Remove information-free ".context" messages We capture stack traces of all errors, so these don't really add any value. As a thought experiment, if we had to add a line like this, with the function name in it, every time we use the ?-operator, we're doing something wrong. test_tenants.py::test_tenant_creation_fails creates a failpoint and checks that the error returned by the pageserver contains the failpoint name, and that was failing because it wasn't on the first line of the error. We should probably improve our error-scraping logic in the tests to not rely so heavily on string matching, but that's a different topic. FWIW, these are also pretty unlikely to fail in practice.	2022-11-24 19:20:17 +01:00
Heikki Linnakangas	6b61ed5fab	Fix clippy warning and some typos	2022-11-24 19:20:17 +01:00
Christian Schwarz	f28bf70596	tenant creation: re-use load_local_tenant()	2022-11-24 19:20:17 +01:00
Heikki Linnakangas	dc9c33139b	On tenant load, start WAL receivers only after all timelines have been loaded. And similarly on attach. This way, if the tenant load/attach fails halfway through, we don't have any leftover WAL receivers still running on the broken tenant.	2022-11-24 19:20:17 +01:00
Heikki Linnakangas	0260ee23b9	Start background loops on create_tenant correctly. This was caught by the test_gc_cutoff test.	2022-11-24 19:20:17 +01:00
Christian Schwarz	c2f5d011c7	fix python linter complaints	2022-11-24 19:20:17 +01:00
Christian Schwarz	1022b4b98f	use utils::failpoint_sleep_millis_async!("attach-before-activate") The detach_while_attaching test still passes, presumably because the two request execute on different OS threads.	2022-11-24 19:20:17 +01:00
Heikki Linnakangas	791eebefe2	Silence clippy warning	2022-11-24 19:20:17 +01:00
Heikki Linnakangas	50b686c3e4	rustfmt	2022-11-24 19:20:17 +01:00
Heikki Linnakangas	39b10696e9	Fix unit tests. `activate` is now more strict and errors out if the tenant is already Active.	2022-11-24 19:20:17 +01:00
Heikki Linnakangas	264b0ada9f	Handle concurrent detach and attach more gracefully. If tenant detach is requested while the tenant is still in Attaching state, we set the state to Paused, but when the attach completed, it changed it to Active again, and worse, it started the background jobs. To fix, rewrite the set_state() function so that when you activate a tenant that is already in Paused state, it stays in Paused state and we don't start the background loops.	2022-11-24 19:20:17 +01:00
Heikki Linnakangas	78338f7b94	Remove background_jobs_enabled, move code from tenant_mgr.rs to tenant.rs	2022-11-24 19:20:17 +01:00
Heikki Linnakangas	0d533ce840	Test detach while attach is still in progress	2022-11-24 19:20:17 +01:00
Christian Schwarz	978f1879b9	fix typo in storage_sync.rs module comment Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2022-11-24 19:17:41 +01:00
Christian Schwarz	c863d679f8	storage_sync: update module doc comment to reflect our changes	2022-11-23 11:46:03 -05:00
Heikki Linnakangas	28e9eb6539	Add one more compaction to test, to ensure GC finds old layers to remove	2022-11-22 20:00:41 +02:00
Christian Schwarz	d07a4d02bb	test_remote_storage_upload_queue_retries: actually generate Delete ops	2022-11-22 12:13:38 -05:00
Christian Schwarz	c61731a31f	metric pageserver_remote_upload_queue_unfinished_tasks: add labels for file & op kind, and check for them in the test Fails right now because turns out we don't actually generate layer removal tasks with the current test code. That will be the next commit.	2022-11-22 12:13:38 -05:00
Christian Schwarz	5a55dce282	ensure layers_removed > 0	2022-11-22 12:13:36 -05:00
Heikki Linnakangas	e6de1b0e8c	Increase the timeout in test. Downloading all files can take more than 5 seconds, if there are even small network glitches or similar.	2022-11-22 17:22:33 +02:00
Heikki Linnakangas	58be279be1	Fix python flake8 errors	2022-11-22 16:40:51 +02:00
Heikki Linnakangas	d1b92e976a	When a new layer file is created in compaction, also upload it. I can't believe this was missing..	2022-11-22 16:35:31 +02:00
Heikki Linnakangas	7d46c7c118	Fix python formatting	2022-11-22 15:16:31 +02:00
Heikki Linnakangas	d6c1a9aa18	Fix formatting	2022-11-22 15:02:09 +02:00
Heikki Linnakangas	49f2eac934	Add missing import Was causing the test to fail. It's annoying that you don't get an error from that earlier..	2022-11-22 15:01:03 +02:00
Heikki Linnakangas	5c8387aff1	Fix setting failpoints in test_remote_storage_backup_and_restore The checkpoints in the test were numbered 1 and 2, but the code only tried to set the failpoints for checkpoint 0. So they were never set.	2022-11-22 14:52:11 +02:00
Heikki Linnakangas	3c4680b718	Turn upload_queue_items_metric into a regular function	2022-11-22 14:45:22 +02:00
Christian Schwarz	6b7cbec9b3	WIP: add test for storage_sync upload retries	2022-11-22 05:49:46 -05:00
Christian Schwarz	6de9a33c05	metric REMOTE_UPLOAD_QUEUE_UNFINISHED_TASKS	2022-11-22 05:49:46 -05:00
Christian Schwarz	77f883c95c	prometheus metrics for storage_sync2 Extracted from https://github.com/neondatabase/neon/pull/2595 Use pub(crate) to highlight unused metrics.	2022-11-22 05:49:25 -05:00
Heikki Linnakangas	e6557f4f91	Silence test failures, where we operate on a tenant before it's loaded Saw a failure like this, from 'test_tenants_attached_after_download' and 'test_tenant_redownloads_truncated_file_on_startup': > test_runner/fixtures/neon_fixtures.py:1064: in verbose_error > res.raise_for_status() > /github/home/.cache/pypoetry/virtualenvs/neon-_pxWMzVK-py3.9/lib/python3.9/site-packages/requests/models.py:1021: in raise_for_status > raise HTTPError(http_error_msg, response=self) > E requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://localhost:18150/v1/tenant/2334c9c113a82b5dd1651a0a23c53448/timeline > > The above exception was the direct cause of the following exception: > test_runner/regress/test_tenants_with_remote_storage.py:185: in test_tenants_attached_after_download > restored_timelines = client.timeline_list(tenant_id) > test_runner/fixtures/neon_fixtures.py:1148: in timeline_list > self.verbose_error(res) > test_runner/fixtures/neon_fixtures.py:1070: in verbose_error > raise PageserverApiException(msg) from e > E fixtures.neon_fixtures.PageserverApiException: NotFound: Tenant 2334c9c113a82b5dd1651a0a23c53448 is not active. Current state: Loading These tests starts the pageserver, wait until assert_no_in_progress_downloads_for_tenant says that has_downloads_in_progress is false, and then call timeline_list on the tenant. But has_downloads_in_progress was only returned as true when the tenant was being attached, not when it was being loaded at pageserver startup. Change tenant_status API endpoint (/v1/tenant/:tenant_id) so that it returns has_downloads_in_progress=true also for tenants that are still in Loading state.	2022-11-21 20:50:42 +01:00
Heikki Linnakangas	e8db20eb26	On connection from compute, wait for tenant to become Active. If a connection from compute arrives while a tenant is still in Loading state, wait for it to become Active instead of throwing an error to the client. This should fix the errors from test_gc_cutoff test that repeatedly restarts the pageserver and immediately tries to connect to it.	2022-11-21 20:50:42 +01:00
Heikki Linnakangas	7552e2d25f	Enable passing FAILPOINTS at startup. - Pass through FAILPOINTS environment variable to the pageserver in "neon_local pageserver start" command - On startup, list any failpoints that were set with FAILPOINTS to the log - Add optional "extra_env_vars" argument to the NeonPageserver.start() function in the python fixture, so that you can pass FAILPOINTS None of the tests use this functionality yet; that comes in a separate commit. closes https://github.com/neondatabase/neon/pull/2865	2022-11-21 20:50:42 +01:00
Heikki Linnakangas	dfb4160403	Don't remove timelines directory while pageserver is running. The test removes all the timelines from local disk, to test how pageserver startup works when it's missing. But if the pageserver hasn't finished loading the timeline yet, you can get an error: > 2022-11-20T01:30:41.053207Z INFO load{tenant_id=0f6ba053925a997b99b5eb45f9c548ac}:load_local_timeline{timeline_id=308ada17f4c3d790b631805d2dd51807}: no index file was found on the remote > 2022-11-20T01:30:41.054045Z ERROR load{tenant_id=0f6ba053925a997b99b5eb45f9c548ac}:load_local_timeline{timeline_id=308ada17f4c3d790b631805d2dd51807}: Failed to initialize timeline 0f6ba053925a997b99b5eb45f9c548ac/308ada17f4c3d790b631805d2dd51807: Failed to load layermap for timeline 0f6ba053925a997b99b5eb45f9c548ac/308ada17f4c3d790b631805d2dd51807 > > Caused by: > No such file or directory (os error 2) I saw this in CI, here: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-2785/debug/3505805425/index.html#suites/ec4311502db344eee91f1354e9dc839b/725c7d0ecec1ec4d/ And was able to reproduce it with this: > --- a/pageserver/src/tenant.rs > +++ b/pageserver/src/tenant.rs > @@ -946,6 +946,8 @@ impl Tenant { > None => None, > }; > > + tokio::time::sleep(std::time::Duration::from_secs(2)).await; > + > self.setup_timeline( > timeline_id, > remote_client, Even on 'main', it's pretty sketchy to remote the directory while the pageserver is still running, but it didn't lead to an error because the pagesever finished loading the local layer maps before starting up. Now that that's spawned into background, the directory might get removed before the loading finishes.	2022-11-20 21:52:22 +02:00
Heikki Linnakangas	f9bdc030e8	Fix python formatting failure	2022-11-20 02:37:15 +02:00
Heikki Linnakangas	e4c9b83a39	Merge remote-tracking branch 'origin/main' into HEAD	2022-11-20 02:31:21 +02:00
Alexander Bayandin	cb9b26776e	Fix test_seqscans on remote cluster (#2869 ) A remote project is reused between tests, so we need to ensure that we don't have a table with the same name already created.	2022-11-19 23:39:42 +00:00
Heikki Linnakangas	a78c16328e	Silence test_remote_storage failure, caused by error message change	2022-11-20 00:13:43 +02:00
Heikki Linnakangas	684329d4d2	Another attempt at silencing test_gc_cutoff failures. Increse the pgbench runtimes even further. The theory is that when there are many other tests running at the same time, one pgbench run could take a long time until it generates enough layers for GC to kick in.	2022-11-19 19:28:56 +02:00
Heikki Linnakangas	eed99b7251	Silence clippy warning	2022-11-19 18:26:18 +02:00
Heikki Linnakangas	ed40a045c0	Add more logging to track down test_gc_cutoff failure. see https://github.com/neondatabase/neon/issues/2856	2022-11-19 14:12:21 +02:00
Heikki Linnakangas	3f39327622	Silence a few compiler warnings I saw these from the build of the compute docker image in the CI (compute-node-image-v15): pagestore_smgr.c: In function 'neon_prefetch': pagestore_smgr.c:1654:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] 1654 \| BufferTag tag = (BufferTag) { \| ^~~~~~~~~ walproposer.c:197:1: warning: no previous prototype for 'WalProposerSync' [-Wmissing-prototypes] 197 \| WalProposerSync(int argc, char *argv[]) \| ^~~~~~~~~~~~~~~ libpagestore.c: In function 'pageserver_connect': libpagestore.c💯9: warning: variable 'wc' set but not used [-Wunused-but-set-variable] 100 \| int wc; \| ^~ libpagestore.c: In function 'call_PQgetCopyData': libpagestore.c:144:9: warning: variable 'wc' set but not used [-Wunused-but-set-variable] 144 \| int wc; \| ^~ Harmless warnings, but let's be tidy. In the passing, I added some "extern" to a few function declarations that were missing them, and marked WalProposerSync as "static". Those changes are also purely cosmetic.	2022-11-19 14:11:04 +02:00
Heikki Linnakangas	a50a7e8ac0	Try to silence test_gc_cutoff flakiness. Commit `d013a2b227` changed the test, so that it fails if pgbench runs to completion without triggering the failpoint. That has now happened several times in the CI. That's not expected, so this needs some investigation, but as a quick fix just make the pgbench runs longer so that we're closer to the situation before commit `d013a2b227`. See https://github.com/neondatabase/neon/issues/2856	2022-11-19 01:19:09 +02:00
Egor Suvorov	e28eda7939	sourcetree/docs: mention hakari generate (#2864 )	2022-11-18 22:30:41 +00:00

1 2 3 4 5 ...

2419 Commits