rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-27 18:10:37 +00:00

Author	SHA1	Message	Date
Christian Schwarz	dd2a77c2ef	Abort uploads if the tenant/timeline is requested to shut down - Introduce another UploadQueue::Stopped enum variant to indicate the state where the UploadQueue is shut down. - In perform_upload_task, wait concurrently for tenant/timeline shutdown. If we are requested to shut down, the first in-progress tasks that notices the shutdown request transitions the queue from UploadQueue::Initialized to UploadQueue::Stopped state. This involves dropping all the queued ops that are not yet in progress, which conveniently unblocks wait_completion() calls that are waiting for their barrier to be executed. They will receive an Err(), and do something sensible. Right now, wait_completion() is only used by tests, but I suspect that we should be using it in wherever we delete layer files, e.g., GC and compaction, as explained in the storage_sync.rs block comment section "Consistency". This change also fixes test_timeline_deletion_with_files_stuck_in_upload_queue which I added in the previous commit. Before, timeline delete would wait until all in-progress tasks and queued tasks were done. If, like in the test, a task was stuck due to upload error, timeline deletion would wait forever. Now it gets an error. Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2022-11-25 16:32:31 +02:00
Christian Schwarz	206b5d2ada	remove obsolete TODO (see github PR discussion https://github.com/neondatabase/neon/pull/2785#discussion_r1030547179 )	2022-11-24 13:20:38 -05:00
Christian Schwarz	77fea61fcc	address FIXME in tenant_attach regarding tenant config	2022-11-24 13:20:38 -05:00
Christian Schwarz	e3f4c0e4ac	remove FIXME addressed by 'On tenant load, start WAL receivers only after all timelines have been loaded.'	2022-11-24 13:20:38 -05:00
Christian Schwarz	2302ecda04	don't run background loops in unit tests	2022-11-24 13:20:38 -05:00
Heikki Linnakangas	5257bbe2b9	Remove information-free ".context" messages We capture stack traces of all errors, so these don't really add any value. As a thought experiment, if we had to add a line like this, with the function name in it, every time we use the ?-operator, we're doing something wrong. test_tenants.py::test_tenant_creation_fails creates a failpoint and checks that the error returned by the pageserver contains the failpoint name, and that was failing because it wasn't on the first line of the error. We should probably improve our error-scraping logic in the tests to not rely so heavily on string matching, but that's a different topic. FWIW, these are also pretty unlikely to fail in practice.	2022-11-24 19:20:17 +01:00
Heikki Linnakangas	6b61ed5fab	Fix clippy warning and some typos	2022-11-24 19:20:17 +01:00
Christian Schwarz	f28bf70596	tenant creation: re-use load_local_tenant()	2022-11-24 19:20:17 +01:00
Heikki Linnakangas	dc9c33139b	On tenant load, start WAL receivers only after all timelines have been loaded. And similarly on attach. This way, if the tenant load/attach fails halfway through, we don't have any leftover WAL receivers still running on the broken tenant.	2022-11-24 19:20:17 +01:00
Heikki Linnakangas	0260ee23b9	Start background loops on create_tenant correctly. This was caught by the test_gc_cutoff test.	2022-11-24 19:20:17 +01:00
Christian Schwarz	1022b4b98f	use utils::failpoint_sleep_millis_async!("attach-before-activate") The detach_while_attaching test still passes, presumably because the two request execute on different OS threads.	2022-11-24 19:20:17 +01:00
Heikki Linnakangas	791eebefe2	Silence clippy warning	2022-11-24 19:20:17 +01:00
Heikki Linnakangas	50b686c3e4	rustfmt	2022-11-24 19:20:17 +01:00
Heikki Linnakangas	39b10696e9	Fix unit tests. `activate` is now more strict and errors out if the tenant is already Active.	2022-11-24 19:20:17 +01:00
Heikki Linnakangas	264b0ada9f	Handle concurrent detach and attach more gracefully. If tenant detach is requested while the tenant is still in Attaching state, we set the state to Paused, but when the attach completed, it changed it to Active again, and worse, it started the background jobs. To fix, rewrite the set_state() function so that when you activate a tenant that is already in Paused state, it stays in Paused state and we don't start the background loops.	2022-11-24 19:20:17 +01:00
Heikki Linnakangas	78338f7b94	Remove background_jobs_enabled, move code from tenant_mgr.rs to tenant.rs	2022-11-24 19:20:17 +01:00
Heikki Linnakangas	0d533ce840	Test detach while attach is still in progress	2022-11-24 19:20:17 +01:00
Christian Schwarz	978f1879b9	fix typo in storage_sync.rs module comment Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2022-11-24 19:17:41 +01:00
Christian Schwarz	c863d679f8	storage_sync: update module doc comment to reflect our changes	2022-11-23 11:46:03 -05:00
Christian Schwarz	c61731a31f	metric pageserver_remote_upload_queue_unfinished_tasks: add labels for file & op kind, and check for them in the test Fails right now because turns out we don't actually generate layer removal tasks with the current test code. That will be the next commit.	2022-11-22 12:13:38 -05:00
Heikki Linnakangas	d1b92e976a	When a new layer file is created in compaction, also upload it. I can't believe this was missing..	2022-11-22 16:35:31 +02:00
Heikki Linnakangas	d6c1a9aa18	Fix formatting	2022-11-22 15:02:09 +02:00
Heikki Linnakangas	3c4680b718	Turn upload_queue_items_metric into a regular function	2022-11-22 14:45:22 +02:00
Christian Schwarz	6b7cbec9b3	WIP: add test for storage_sync upload retries	2022-11-22 05:49:46 -05:00
Christian Schwarz	6de9a33c05	metric REMOTE_UPLOAD_QUEUE_UNFINISHED_TASKS	2022-11-22 05:49:46 -05:00
Christian Schwarz	77f883c95c	prometheus metrics for storage_sync2 Extracted from https://github.com/neondatabase/neon/pull/2595 Use pub(crate) to highlight unused metrics.	2022-11-22 05:49:25 -05:00
Heikki Linnakangas	e6557f4f91	Silence test failures, where we operate on a tenant before it's loaded Saw a failure like this, from 'test_tenants_attached_after_download' and 'test_tenant_redownloads_truncated_file_on_startup': > test_runner/fixtures/neon_fixtures.py:1064: in verbose_error > res.raise_for_status() > /github/home/.cache/pypoetry/virtualenvs/neon-_pxWMzVK-py3.9/lib/python3.9/site-packages/requests/models.py:1021: in raise_for_status > raise HTTPError(http_error_msg, response=self) > E requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://localhost:18150/v1/tenant/2334c9c113a82b5dd1651a0a23c53448/timeline > > The above exception was the direct cause of the following exception: > test_runner/regress/test_tenants_with_remote_storage.py:185: in test_tenants_attached_after_download > restored_timelines = client.timeline_list(tenant_id) > test_runner/fixtures/neon_fixtures.py:1148: in timeline_list > self.verbose_error(res) > test_runner/fixtures/neon_fixtures.py:1070: in verbose_error > raise PageserverApiException(msg) from e > E fixtures.neon_fixtures.PageserverApiException: NotFound: Tenant 2334c9c113a82b5dd1651a0a23c53448 is not active. Current state: Loading These tests starts the pageserver, wait until assert_no_in_progress_downloads_for_tenant says that has_downloads_in_progress is false, and then call timeline_list on the tenant. But has_downloads_in_progress was only returned as true when the tenant was being attached, not when it was being loaded at pageserver startup. Change tenant_status API endpoint (/v1/tenant/:tenant_id) so that it returns has_downloads_in_progress=true also for tenants that are still in Loading state.	2022-11-21 20:50:42 +01:00
Heikki Linnakangas	e8db20eb26	On connection from compute, wait for tenant to become Active. If a connection from compute arrives while a tenant is still in Loading state, wait for it to become Active instead of throwing an error to the client. This should fix the errors from test_gc_cutoff test that repeatedly restarts the pageserver and immediately tries to connect to it.	2022-11-21 20:50:42 +01:00
Heikki Linnakangas	7552e2d25f	Enable passing FAILPOINTS at startup. - Pass through FAILPOINTS environment variable to the pageserver in "neon_local pageserver start" command - On startup, list any failpoints that were set with FAILPOINTS to the log - Add optional "extra_env_vars" argument to the NeonPageserver.start() function in the python fixture, so that you can pass FAILPOINTS None of the tests use this functionality yet; that comes in a separate commit. closes https://github.com/neondatabase/neon/pull/2865	2022-11-21 20:50:42 +01:00
Heikki Linnakangas	e4c9b83a39	Merge remote-tracking branch 'origin/main' into HEAD	2022-11-20 02:31:21 +02:00
Heikki Linnakangas	eed99b7251	Silence clippy warning	2022-11-19 18:26:18 +02:00
Christian Schwarz	f564dff0e3	make test_tenant_detach_smoke fail reproducibly Add failpoint that triggers the race condition. Skip test until we'll land the fix from https://github.com/neondatabase/neon/pull/2851 with https://github.com/neondatabase/neon/pull/2785	2022-11-18 17:15:34 +01:00
Christian Schwarz	d783889a1f	timeline: explicit tracking of flush loop state: NotStarted, Running, Exited This allows us to error out in the case where we request flush but the flush loop is not running. Before, we would only track whether it was started, but not when it exited. Better to use an enum with 3 states than a 2-state bool because then the error message can answer the question whether we ever started the flush loop or not.	2022-11-18 17:15:34 +01:00
Christian Schwarz	66f8f686a0	run manual gc in a task_mgr task to prevent race with detach This fixes flaky test_tenant_detach_smoke.	2022-11-18 12:15:14 +02:00
Christian Schwarz	919f2b261a	make test_tenant_detach_smoke fail reproducibly Add failpoint that triggers the race condition.	2022-11-18 12:15:14 +02:00
Christian Schwarz	9d273c840a	timeline: explicit tracking of flush loop state: NotStarted, Running, Exited This allows us to error out in the case where we request flush but the flush loop is not running. Before, we would only track whether it was started, but not when it exited. Better to use an enum with 3 states than a 2-state bool because then the error message can answer the question whether we ever started the flush loop or not.	2022-11-18 12:15:14 +02:00
Heikki Linnakangas	328ec1ce24	Print a more full error message, with stack trace, on GC failure. In a CI run, I got a test failure because of this error in the log, from the test_get_tenant_size_with_multiple_branches test: ERROR gc_loop{tenant_id=f1630516d4b526139836ced93be0c878}: Gc failed, retrying in 2s: No such file or directory (os error 2) There are known race conditions between GC and timeline deletion, which surely caused that error. But if we didn't know the cause, it would be pretty hard to debug without a stack trace.	2022-11-18 11:44:00 +02:00
Dmitry Rodionov	6600e1f896	initialize upload queue before starting download operations	2022-11-17 23:53:31 +02:00
Dmitry Rodionov	348369414b	fix incorrect metadata update Previously in some cases local metadata was confused with remote one and there was a check, that we write locally only if remote metadata has greater disk_consistent_lsn. So because they were equal we didnt write anything. For attach scenario this ended up in not writing metadata at all. Rearrange code so we decide on proper metadata value earlier on and initialize timeline with correct one without need to update it late in the initialization process in .reconsile_with_remote	2022-11-17 23:53:31 +02:00
Christian Schwarz	3890acaf7f	stop using Option for UploadQueueInitialized::{latest_metadata,last_uploaded_consistent_lsn}	2022-11-17 21:34:16 +02:00
Christian Schwarz	f537a7a873	add explainer comments regarding UploadQueueInitialized::{latest_files,latest_metadata,last_uploaded_consistent_lsn}	2022-11-17 21:34:16 +02:00
Christian Schwarz	71bc45a21b	storage_sync: track upload queue initialization state using enum & fix last_uploaded_consistent_lsn initialization for empty remote storage As pointed out in `b8488e70a9 (r1024319620)` the following is wrong for the case where the remote storage is empty: metadata = whatever the local-ONLY metadata is ... upload_queue.latest_metadata = Some(metadata.clone()); upload_queue.last_uploaded_consistent_lsn = Some(metadata.disk_consistent_lsn()); The reason why it's wrong is that we return last_uploaded_consistent_lsn to safekeepers. So, we'd be returning an Lsn that is not yet uploaded to S3.	2022-11-17 21:34:16 +02:00
Christian Schwarz	decef74503	don't start background jobs if tenant has not timelines Before this change, test_pageserver_with_empty_tenants was failing at: assert loaded_tenant["state"] == { "Active": {"background_jobs_running": False} }, "Tenant {tenant_with_empty_timelines_dir} with empty timelines dir should be active and ready for timeline creation" because background_jobs_running was True instead of False. Personally I think we should simply always start the background loops and not bother, but let's punt this until after we've merged this PR.	2022-11-17 11:22:02 +02:00
Dmitry Rodionov	b8488e70a9	run clippy/fmt	2022-11-16 17:25:04 +02:00
Christian Schwarz	16fdd104ac	bring back HTTP API `has_in_progress_downloads` and `awaits_download` field, derived from TenantState	2022-11-16 14:57:26 +02:00
Christian Schwarz	bb6dbd2f43	crash-safe and resumable tenant attach This change introduces a marker file $repo/tenants/$tenant_id/attaching that is present while a tenant is in Attaching state. When pageserver restarts, we use it to resume the tenant attach operation. Before this change, a crash during tenant attach would result in one of the following: 1. crash upon restart due to missing metadata file (IIRC) 2. "successful" loading of the tenant with a subset of timelines	2022-11-16 14:57:26 +02:00
Dmitry Rodionov	1839ce0545	properly merge remote metadata with local one	2022-11-16 14:42:15 +02:00
Dmitry Rodionov	8e04f0455e	add a bunch of .context calls	2022-11-16 14:42:15 +02:00
Dmitry Rodionov	6839773538	handle temporary files during layer map loading	2022-11-16 14:42:15 +02:00
Dmitry Rodionov	2a96c4cfcd	start walreceiver after reconcile_with_remote	2022-11-16 14:42:15 +02:00

1 2 3 4 5 ...

1044 Commits