rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-14 08:52:56 +00:00

Author	SHA1	Message	Date
Christian Schwarz	5a55dce282	ensure layers_removed > 0	2022-11-22 12:13:36 -05:00
Heikki Linnakangas	e6de1b0e8c	Increase the timeout in test. Downloading all files can take more than 5 seconds, if there are even small network glitches or similar.	2022-11-22 17:22:33 +02:00
Heikki Linnakangas	58be279be1	Fix python flake8 errors	2022-11-22 16:40:51 +02:00
Heikki Linnakangas	d1b92e976a	When a new layer file is created in compaction, also upload it. I can't believe this was missing..	2022-11-22 16:35:31 +02:00
Heikki Linnakangas	7d46c7c118	Fix python formatting	2022-11-22 15:16:31 +02:00
Heikki Linnakangas	d6c1a9aa18	Fix formatting	2022-11-22 15:02:09 +02:00
Heikki Linnakangas	49f2eac934	Add missing import Was causing the test to fail. It's annoying that you don't get an error from that earlier..	2022-11-22 15:01:03 +02:00
Heikki Linnakangas	5c8387aff1	Fix setting failpoints in test_remote_storage_backup_and_restore The checkpoints in the test were numbered 1 and 2, but the code only tried to set the failpoints for checkpoint 0. So they were never set.	2022-11-22 14:52:11 +02:00
Heikki Linnakangas	3c4680b718	Turn upload_queue_items_metric into a regular function	2022-11-22 14:45:22 +02:00
Christian Schwarz	6b7cbec9b3	WIP: add test for storage_sync upload retries	2022-11-22 05:49:46 -05:00
Christian Schwarz	6de9a33c05	metric REMOTE_UPLOAD_QUEUE_UNFINISHED_TASKS	2022-11-22 05:49:46 -05:00
Christian Schwarz	77f883c95c	prometheus metrics for storage_sync2 Extracted from https://github.com/neondatabase/neon/pull/2595 Use pub(crate) to highlight unused metrics.	2022-11-22 05:49:25 -05:00
Heikki Linnakangas	e6557f4f91	Silence test failures, where we operate on a tenant before it's loaded Saw a failure like this, from 'test_tenants_attached_after_download' and 'test_tenant_redownloads_truncated_file_on_startup': > test_runner/fixtures/neon_fixtures.py:1064: in verbose_error > res.raise_for_status() > /github/home/.cache/pypoetry/virtualenvs/neon-_pxWMzVK-py3.9/lib/python3.9/site-packages/requests/models.py:1021: in raise_for_status > raise HTTPError(http_error_msg, response=self) > E requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://localhost:18150/v1/tenant/2334c9c113a82b5dd1651a0a23c53448/timeline > > The above exception was the direct cause of the following exception: > test_runner/regress/test_tenants_with_remote_storage.py:185: in test_tenants_attached_after_download > restored_timelines = client.timeline_list(tenant_id) > test_runner/fixtures/neon_fixtures.py:1148: in timeline_list > self.verbose_error(res) > test_runner/fixtures/neon_fixtures.py:1070: in verbose_error > raise PageserverApiException(msg) from e > E fixtures.neon_fixtures.PageserverApiException: NotFound: Tenant 2334c9c113a82b5dd1651a0a23c53448 is not active. Current state: Loading These tests starts the pageserver, wait until assert_no_in_progress_downloads_for_tenant says that has_downloads_in_progress is false, and then call timeline_list on the tenant. But has_downloads_in_progress was only returned as true when the tenant was being attached, not when it was being loaded at pageserver startup. Change tenant_status API endpoint (/v1/tenant/:tenant_id) so that it returns has_downloads_in_progress=true also for tenants that are still in Loading state.	2022-11-21 20:50:42 +01:00
Heikki Linnakangas	e8db20eb26	On connection from compute, wait for tenant to become Active. If a connection from compute arrives while a tenant is still in Loading state, wait for it to become Active instead of throwing an error to the client. This should fix the errors from test_gc_cutoff test that repeatedly restarts the pageserver and immediately tries to connect to it.	2022-11-21 20:50:42 +01:00
Heikki Linnakangas	7552e2d25f	Enable passing FAILPOINTS at startup. - Pass through FAILPOINTS environment variable to the pageserver in "neon_local pageserver start" command - On startup, list any failpoints that were set with FAILPOINTS to the log - Add optional "extra_env_vars" argument to the NeonPageserver.start() function in the python fixture, so that you can pass FAILPOINTS None of the tests use this functionality yet; that comes in a separate commit. closes https://github.com/neondatabase/neon/pull/2865	2022-11-21 20:50:42 +01:00
Heikki Linnakangas	dfb4160403	Don't remove timelines directory while pageserver is running. The test removes all the timelines from local disk, to test how pageserver startup works when it's missing. But if the pageserver hasn't finished loading the timeline yet, you can get an error: > 2022-11-20T01:30:41.053207Z INFO load{tenant_id=0f6ba053925a997b99b5eb45f9c548ac}:load_local_timeline{timeline_id=308ada17f4c3d790b631805d2dd51807}: no index file was found on the remote > 2022-11-20T01:30:41.054045Z ERROR load{tenant_id=0f6ba053925a997b99b5eb45f9c548ac}:load_local_timeline{timeline_id=308ada17f4c3d790b631805d2dd51807}: Failed to initialize timeline 0f6ba053925a997b99b5eb45f9c548ac/308ada17f4c3d790b631805d2dd51807: Failed to load layermap for timeline 0f6ba053925a997b99b5eb45f9c548ac/308ada17f4c3d790b631805d2dd51807 > > Caused by: > No such file or directory (os error 2) I saw this in CI, here: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-2785/debug/3505805425/index.html#suites/ec4311502db344eee91f1354e9dc839b/725c7d0ecec1ec4d/ And was able to reproduce it with this: > --- a/pageserver/src/tenant.rs > +++ b/pageserver/src/tenant.rs > @@ -946,6 +946,8 @@ impl Tenant { > None => None, > }; > > + tokio::time::sleep(std::time::Duration::from_secs(2)).await; > + > self.setup_timeline( > timeline_id, > remote_client, Even on 'main', it's pretty sketchy to remote the directory while the pageserver is still running, but it didn't lead to an error because the pagesever finished loading the local layer maps before starting up. Now that that's spawned into background, the directory might get removed before the loading finishes.	2022-11-20 21:52:22 +02:00
Heikki Linnakangas	f9bdc030e8	Fix python formatting failure	2022-11-20 02:37:15 +02:00
Heikki Linnakangas	e4c9b83a39	Merge remote-tracking branch 'origin/main' into HEAD	2022-11-20 02:31:21 +02:00
Alexander Bayandin	cb9b26776e	Fix test_seqscans on remote cluster (#2869 ) A remote project is reused between tests, so we need to ensure that we don't have a table with the same name already created.	2022-11-19 23:39:42 +00:00
Heikki Linnakangas	a78c16328e	Silence test_remote_storage failure, caused by error message change	2022-11-20 00:13:43 +02:00
Heikki Linnakangas	684329d4d2	Another attempt at silencing test_gc_cutoff failures. Increse the pgbench runtimes even further. The theory is that when there are many other tests running at the same time, one pgbench run could take a long time until it generates enough layers for GC to kick in.	2022-11-19 19:28:56 +02:00
Heikki Linnakangas	eed99b7251	Silence clippy warning	2022-11-19 18:26:18 +02:00
Heikki Linnakangas	ed40a045c0	Add more logging to track down test_gc_cutoff failure. see https://github.com/neondatabase/neon/issues/2856	2022-11-19 14:12:21 +02:00
Heikki Linnakangas	3f39327622	Silence a few compiler warnings I saw these from the build of the compute docker image in the CI (compute-node-image-v15): pagestore_smgr.c: In function 'neon_prefetch': pagestore_smgr.c:1654:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] 1654 \| BufferTag tag = (BufferTag) { \| ^~~~~~~~~ walproposer.c:197:1: warning: no previous prototype for 'WalProposerSync' [-Wmissing-prototypes] 197 \| WalProposerSync(int argc, char *argv[]) \| ^~~~~~~~~~~~~~~ libpagestore.c: In function 'pageserver_connect': libpagestore.c💯9: warning: variable 'wc' set but not used [-Wunused-but-set-variable] 100 \| int wc; \| ^~ libpagestore.c: In function 'call_PQgetCopyData': libpagestore.c:144:9: warning: variable 'wc' set but not used [-Wunused-but-set-variable] 144 \| int wc; \| ^~ Harmless warnings, but let's be tidy. In the passing, I added some "extern" to a few function declarations that were missing them, and marked WalProposerSync as "static". Those changes are also purely cosmetic.	2022-11-19 14:11:04 +02:00
Heikki Linnakangas	a50a7e8ac0	Try to silence test_gc_cutoff flakiness. Commit `d013a2b227` changed the test, so that it fails if pgbench runs to completion without triggering the failpoint. That has now happened several times in the CI. That's not expected, so this needs some investigation, but as a quick fix just make the pgbench runs longer so that we're closer to the situation before commit `d013a2b227`. See https://github.com/neondatabase/neon/issues/2856	2022-11-19 01:19:09 +02:00
Egor Suvorov	e28eda7939	sourcetree/docs: mention hakari generate (#2864 )	2022-11-18 22:30:41 +00:00
Heikki Linnakangas	a2fbd93e91	Fix python isort codestyle check	2022-11-18 18:29:27 +02:00
Christian Schwarz	f564dff0e3	make test_tenant_detach_smoke fail reproducibly Add failpoint that triggers the race condition. Skip test until we'll land the fix from https://github.com/neondatabase/neon/pull/2851 with https://github.com/neondatabase/neon/pull/2785	2022-11-18 17:15:34 +01:00
Christian Schwarz	d783889a1f	timeline: explicit tracking of flush loop state: NotStarted, Running, Exited This allows us to error out in the case where we request flush but the flush loop is not running. Before, we would only track whether it was started, but not when it exited. Better to use an enum with 3 states than a 2-state bool because then the error message can answer the question whether we ever started the flush loop or not.	2022-11-18 17:15:34 +01:00
bojanserafimov	2655bdbb2e	Add remote seqscans test (#2840 )	2022-11-18 09:05:13 -05:00
Konstantin Knizhnik	b9152f1ef4	Correctly terminate prefetch in case of pageserver restart (#2850 ) refer #2819 This patch requires deep knowledge of prefetch internals. So @MMeent please review it or suggest better solution.	2022-11-18 15:04:58 +02:00
Christian Schwarz	66f8f686a0	run manual gc in a task_mgr task to prevent race with detach This fixes flaky test_tenant_detach_smoke.	2022-11-18 12:15:14 +02:00
Christian Schwarz	919f2b261a	make test_tenant_detach_smoke fail reproducibly Add failpoint that triggers the race condition.	2022-11-18 12:15:14 +02:00
Christian Schwarz	9d273c840a	timeline: explicit tracking of flush loop state: NotStarted, Running, Exited This allows us to error out in the case where we request flush but the flush loop is not running. Before, we would only track whether it was started, but not when it exited. Better to use an enum with 3 states than a 2-state bool because then the error message can answer the question whether we ever started the flush loop or not.	2022-11-18 12:15:14 +02:00
Heikki Linnakangas	328ec1ce24	Print a more full error message, with stack trace, on GC failure. In a CI run, I got a test failure because of this error in the log, from the test_get_tenant_size_with_multiple_branches test: ERROR gc_loop{tenant_id=f1630516d4b526139836ced93be0c878}: Gc failed, retrying in 2s: No such file or directory (os error 2) There are known race conditions between GC and timeline deletion, which surely caused that error. But if we didn't know the cause, it would be pretty hard to debug without a stack trace.	2022-11-18 11:44:00 +02:00
Heikki Linnakangas	dcb79ef08f	Silence yet another test failure from race condition between GC and delete. Another similar case to commit `9ae4da4f31`.	2022-11-18 10:18:15 +02:00
Konstantin Knizhnik	fd99e0fbc4	Build pg_prewrm extension (#2794 )	2022-11-18 09:10:32 +02:00
Dmitry Rodionov	6600e1f896	initialize upload queue before starting download operations	2022-11-17 23:53:31 +02:00
Dmitry Rodionov	348369414b	fix incorrect metadata update Previously in some cases local metadata was confused with remote one and there was a check, that we write locally only if remote metadata has greater disk_consistent_lsn. So because they were equal we didnt write anything. For attach scenario this ended up in not writing metadata at all. Rearrange code so we decide on proper metadata value earlier on and initialize timeline with correct one without need to update it late in the initialization process in .reconsile_with_remote	2022-11-17 23:53:31 +02:00
Christian Schwarz	3890acaf7f	stop using Option for UploadQueueInitialized::{latest_metadata,last_uploaded_consistent_lsn}	2022-11-17 21:34:16 +02:00
Christian Schwarz	f537a7a873	add explainer comments regarding UploadQueueInitialized::{latest_files,latest_metadata,last_uploaded_consistent_lsn}	2022-11-17 21:34:16 +02:00
Christian Schwarz	71bc45a21b	storage_sync: track upload queue initialization state using enum & fix last_uploaded_consistent_lsn initialization for empty remote storage As pointed out in `b8488e70a9 (r1024319620)` the following is wrong for the case where the remote storage is empty: metadata = whatever the local-ONLY metadata is ... upload_queue.latest_metadata = Some(metadata.clone()); upload_queue.last_uploaded_consistent_lsn = Some(metadata.disk_consistent_lsn()); The reason why it's wrong is that we return last_uploaded_consistent_lsn to safekeepers. So, we'd be returning an Lsn that is not yet uploaded to S3.	2022-11-17 21:34:16 +02:00
Kirill Bulatov	60ac227196	Use modern flex and bison in macOS compilations (#2847 )	2022-11-17 14:48:21 +00:00
MMeent	4a60051b0d	Add codeowners section for /vendor/ (#2849 ) After this, consent of @neondatabase/compute is required to update the vendored PostgreSQL versions.	2022-11-17 14:31:34 +00:00
Heikki Linnakangas	24d3ed0952	Ignore another ERROR that's expected in test. Got a test failure in CI because of this.	2022-11-17 12:42:56 +02:00
Christian Schwarz	decef74503	don't start background jobs if tenant has not timelines Before this change, test_pageserver_with_empty_tenants was failing at: assert loaded_tenant["state"] == { "Active": {"background_jobs_running": False} }, "Tenant {tenant_with_empty_timelines_dir} with empty timelines dir should be active and ready for timeline creation" because background_jobs_running was True instead of False. Personally I think we should simply always start the background loops and not bother, but let's punt this until after we've merged this PR.	2022-11-17 11:22:02 +02:00
Christian Schwarz	8aed805933	refactor: remove misleading log messages and redundant asserts from test_pageserver_with_empty_tenants At least since commit 'Return broken tenants due to non existing timelines dir (#2552) (#2575)' some of these messages are wrong.	2022-11-17 11:21:49 +02:00
Alexander Bayandin	0a87d71294	test_runner: make proxy mgmt port mandatory (#2839 ) Make `mgmt` port mandatory argument for `NeonProxy` (and set it for `static_proxy`) to avoid port collision when tests run in parallel.	2022-11-16 17:57:48 +00:00
Heikki Linnakangas	150bddb929	Clean up process start/stop handling * Poll more frequently when waiting for process start/stop. This speeds up startup and shutdown in tests. We did this already in commit `52ce1c9d53`, which reduced the interval to 100 ms, but it was inadvertently increased back to 500 ms in commit `d42700280f`. Reduce it to 100 ms again, for both start and stop operations. * Harmonize the start and stop loops, printing the dots and notices the same way in both. I considered extracting the logic to a separate retry-function that takes a closure as argument that does the polling, but as long as we only have two copies, the code duplication isn't that bad. * Remove newline after "Starting pageserver" and "Starting etcd" messages, so that the progress-indicator dots that are printed once a second are printed on the same line. Before: Starting pageserver at '127.0.0.1:64000' in '.neon' ... pageserver started, pid: 2538937 After: Starting pageserver at '127.0.0.1:64000' in '.neon'... pageserver started, pid: 2538937 The "Starting safekeeper" message already got this right. * Update example output in README.md to match	2022-11-16 19:51:37 +02:00
Dmitry Rodionov	b8488e70a9	run clippy/fmt	2022-11-16 17:25:04 +02:00

1 2 3 4 5 ...

2395 Commits