rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-04 12:40:37 +00:00

Author	SHA1	Message	Date
Heikki Linnakangas	7552e2d25f	Enable passing FAILPOINTS at startup. - Pass through FAILPOINTS environment variable to the pageserver in "neon_local pageserver start" command - On startup, list any failpoints that were set with FAILPOINTS to the log - Add optional "extra_env_vars" argument to the NeonPageserver.start() function in the python fixture, so that you can pass FAILPOINTS None of the tests use this functionality yet; that comes in a separate commit. closes https://github.com/neondatabase/neon/pull/2865	2022-11-21 20:50:42 +01:00
Heikki Linnakangas	dfb4160403	Don't remove timelines directory while pageserver is running. The test removes all the timelines from local disk, to test how pageserver startup works when it's missing. But if the pageserver hasn't finished loading the timeline yet, you can get an error: > 2022-11-20T01:30:41.053207Z INFO load{tenant_id=0f6ba053925a997b99b5eb45f9c548ac}:load_local_timeline{timeline_id=308ada17f4c3d790b631805d2dd51807}: no index file was found on the remote > 2022-11-20T01:30:41.054045Z ERROR load{tenant_id=0f6ba053925a997b99b5eb45f9c548ac}:load_local_timeline{timeline_id=308ada17f4c3d790b631805d2dd51807}: Failed to initialize timeline 0f6ba053925a997b99b5eb45f9c548ac/308ada17f4c3d790b631805d2dd51807: Failed to load layermap for timeline 0f6ba053925a997b99b5eb45f9c548ac/308ada17f4c3d790b631805d2dd51807 > > Caused by: > No such file or directory (os error 2) I saw this in CI, here: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-2785/debug/3505805425/index.html#suites/ec4311502db344eee91f1354e9dc839b/725c7d0ecec1ec4d/ And was able to reproduce it with this: > --- a/pageserver/src/tenant.rs > +++ b/pageserver/src/tenant.rs > @@ -946,6 +946,8 @@ impl Tenant { > None => None, > }; > > + tokio::time::sleep(std::time::Duration::from_secs(2)).await; > + > self.setup_timeline( > timeline_id, > remote_client, Even on 'main', it's pretty sketchy to remote the directory while the pageserver is still running, but it didn't lead to an error because the pagesever finished loading the local layer maps before starting up. Now that that's spawned into background, the directory might get removed before the loading finishes.	2022-11-20 21:52:22 +02:00
Heikki Linnakangas	f9bdc030e8	Fix python formatting failure	2022-11-20 02:37:15 +02:00
Heikki Linnakangas	e4c9b83a39	Merge remote-tracking branch 'origin/main' into HEAD	2022-11-20 02:31:21 +02:00
Alexander Bayandin	cb9b26776e	Fix test_seqscans on remote cluster (#2869 ) A remote project is reused between tests, so we need to ensure that we don't have a table with the same name already created.	2022-11-19 23:39:42 +00:00
Heikki Linnakangas	a78c16328e	Silence test_remote_storage failure, caused by error message change	2022-11-20 00:13:43 +02:00
Heikki Linnakangas	684329d4d2	Another attempt at silencing test_gc_cutoff failures. Increse the pgbench runtimes even further. The theory is that when there are many other tests running at the same time, one pgbench run could take a long time until it generates enough layers for GC to kick in.	2022-11-19 19:28:56 +02:00
Heikki Linnakangas	ed40a045c0	Add more logging to track down test_gc_cutoff failure. see https://github.com/neondatabase/neon/issues/2856	2022-11-19 14:12:21 +02:00
Heikki Linnakangas	a50a7e8ac0	Try to silence test_gc_cutoff flakiness. Commit `d013a2b227` changed the test, so that it fails if pgbench runs to completion without triggering the failpoint. That has now happened several times in the CI. That's not expected, so this needs some investigation, but as a quick fix just make the pgbench runs longer so that we're closer to the situation before commit `d013a2b227`. See https://github.com/neondatabase/neon/issues/2856	2022-11-19 01:19:09 +02:00
Heikki Linnakangas	a2fbd93e91	Fix python isort codestyle check	2022-11-18 18:29:27 +02:00
Christian Schwarz	f564dff0e3	make test_tenant_detach_smoke fail reproducibly Add failpoint that triggers the race condition. Skip test until we'll land the fix from https://github.com/neondatabase/neon/pull/2851 with https://github.com/neondatabase/neon/pull/2785	2022-11-18 17:15:34 +01:00
bojanserafimov	2655bdbb2e	Add remote seqscans test (#2840 )	2022-11-18 09:05:13 -05:00
Christian Schwarz	66f8f686a0	run manual gc in a task_mgr task to prevent race with detach This fixes flaky test_tenant_detach_smoke.	2022-11-18 12:15:14 +02:00
Christian Schwarz	919f2b261a	make test_tenant_detach_smoke fail reproducibly Add failpoint that triggers the race condition.	2022-11-18 12:15:14 +02:00
Heikki Linnakangas	dcb79ef08f	Silence yet another test failure from race condition between GC and delete. Another similar case to commit `9ae4da4f31`.	2022-11-18 10:18:15 +02:00
Heikki Linnakangas	24d3ed0952	Ignore another ERROR that's expected in test. Got a test failure in CI because of this.	2022-11-17 12:42:56 +02:00
Christian Schwarz	decef74503	don't start background jobs if tenant has not timelines Before this change, test_pageserver_with_empty_tenants was failing at: assert loaded_tenant["state"] == { "Active": {"background_jobs_running": False} }, "Tenant {tenant_with_empty_timelines_dir} with empty timelines dir should be active and ready for timeline creation" because background_jobs_running was True instead of False. Personally I think we should simply always start the background loops and not bother, but let's punt this until after we've merged this PR.	2022-11-17 11:22:02 +02:00
Christian Schwarz	8aed805933	refactor: remove misleading log messages and redundant asserts from test_pageserver_with_empty_tenants At least since commit 'Return broken tenants due to non existing timelines dir (#2552) (#2575)' some of these messages are wrong.	2022-11-17 11:21:49 +02:00
Alexander Bayandin	0a87d71294	test_runner: make proxy mgmt port mandatory (#2839 ) Make `mgmt` port mandatory argument for `NeonProxy` (and set it for `static_proxy`) to avoid port collision when tests run in parallel.	2022-11-16 17:57:48 +00:00
Alexander Bayandin	2b728bc69e	test_forward_compatibility: fix path to pg_distrib_dir (#2826 ) Set correct `pg_distrib_dir` in `pageserver.toml` and in neon_local `config`. `test_forward_compatibility` shows flakiness during `neon_local pg start`, so hopefully, the patch will help. ``` 2022-11-15 16:07:34.091 GMT [13338] LOG: starting with zenith basebackup at LSN 0/A6A9310, prev 0/0 2022-11-15 16:07:34.091 GMT [13338] FATAL: cannot start in read-write mode from this base backup 2022-11-15 16:07:34.091 GMT [13337] LOG: startup process (PID 13338) exited with exit code 1 ```	2022-11-16 15:14:36 +00:00
Heikki Linnakangas	9ae4da4f31	Silence test failure caused by race condition between GC and detach. Thanks to the race condition, GC sometimes fails with "no such file or directory" error, if the tenant is detached concurrently. That's a known issue, but it didn't cause test failures until we started to check for unexpected ERRORs in the log in commit `46d30bf054`. We should fix the race condition, of course, but until we do, let's silence the failures.	2022-11-16 15:50:49 +02:00
Christian Schwarz	bb6dbd2f43	crash-safe and resumable tenant attach This change introduces a marker file $repo/tenants/$tenant_id/attaching that is present while a tenant is in Attaching state. When pageserver restarts, we use it to resume the tenant attach operation. Before this change, a crash during tenant attach would result in one of the following: 1. crash upon restart due to missing metadata file (IIRC) 2. "successful" loading of the tenant with a subset of timelines	2022-11-16 14:57:26 +02:00
Dmitry Rodionov	c4c4558736	adjust allowed errors for test_broken_timeline	2022-11-16 14:57:26 +02:00
Heikki Linnakangas	bfdc09cf4a	Improve comments and checks in test_broken_timeline.py	2022-11-16 14:57:21 +02:00
Dmitry Rodionov	795c3ca131	Port per-tenant upload queue and startup changes from #2595 This is a part of https://github.com/neondatabase/neon/pull/2595. It takes out switch to per tenant upload queue and changes to pageserver startup sequence because these two are highly interleaved with each other. I'm still not happy with the size of the diff, but splitting it even more will probably consume even more time. Ideally we should do it, but this patch isis already a step forward and should be easier to get this patch in yet still quite difficult. Mainly because of the size and fixes for existing concerns which will extend the diff even further Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2022-11-16 14:42:15 +02:00
Heikki Linnakangas	d013a2b227	Make test_gc_cutoff test more robust. Previously, if the failpoint was not reached for some reason, the test would only fail because it would reach the 5 minute timeout we have on all python tests. That's very subtle. Make it fail explicitly, if the failpoint is not hit on each iteration of the loop. Extracted from a larger PR, see https://github.com/neondatabase/neon/pull/2785/files#r1022765794	2022-11-16 13:24:02 +02:00
Heikki Linnakangas	3f93c6c6f0	Improve checks for broken tenants in test_broken_timeline.py - Refactor the code a little bit, removing the silly for-loop over a single element. - Make it more clear in log messages that the errors are expectd - Check for a more precise error message "Failed to load delta layer" instead of just "extracting base backup failed".	2022-11-16 13:16:00 +02:00
Heikki Linnakangas	46d30bf054	Check for errors in pageserver log after each test. If there are any unexpected ERRORs or WARNs in pageserver.log after test finishes, fail the test. This requires whitelisting the errors that are expected in each test, and there's also a few common errors that are printed by most tests, which are whitelisted in the fixture itself. With this, we don't need the special abort() call in testing mode, when compaction or GC fails. Those failures will print ERRORs to the logs, which will be picked up by this new mechanisms. A bunch of errors are currently whitelisted that we probably shouldn't be emitting in the first place, but fixing those is out of scope for this commit, so I just left FIXME comments on them.	2022-11-15 18:47:28 +02:00
Heikki Linnakangas	d0105cea1f	Avoid errors when removing a timeline that's still active	2022-11-15 18:47:28 +02:00
Heikki Linnakangas	223834a420	Fix confusion between Postgres and pageserver connection string in test. We passed the pageserver's libpq endpoint URL as the 'compute_ctl --connstr' argument, but that was bogus: the --connstr URL is supposed to be the URL to the Postgres instance that compute_ctl launches and monitors, not to the pageserver. compute_ctl does need the pageserver URL too, but it is read from the cluster spec JSON, not --connstr. That was pretty confusing, as you got a lot of "unknown command" errors in the pageserver log, when compute_tools tries to run regular SQL commands on the pageserver. The test still passed, however, as it doesn't require the SQL commands to succeed. But to make this less confusing, use an invalid hostname instead, so that the queries will fail to even connect.	2022-11-15 18:47:28 +02:00
andres	c11cbf0f5c	fix test_compare_child_and_root_pgbench_perf to do a fair comparison	2022-11-13 21:03:54 +02:00
bojanserafimov	7fd88fab59	Trace read requests (#2762 )	2022-11-10 16:43:04 -05:00
bojanserafimov	7edc098c40	Add perf test instructions (#2777 )	2022-11-10 16:05:57 -05:00
Alexander Bayandin	d5b7832c21	Fix test_wal_backpressure tests (#2792 ) Fix expected return type for `fetchone `: ``` AssertionError: assert False + where False = isinstance((Decimal('56048'), '55 kB', '0/1CF52D8', '0/1CE77E8'), list) ```	2022-11-10 16:15:04 +00:00
Christian Schwarz	8654e95fae	walredo: fix zombie processes ([postgres] <defunct>) This change wraps the std::process:Child that we spawn for WAL redo into a type that ensures that we try to SIGKILL + waitpid() on it. If there is no explicit call to kill_and_wait(), the Drop implementation will spawns a task that does it in the BACKGROUND_RUNTIME. That's an ugly hack but I think it's better than doing kill+wait synchronously from Drop, since I think the general assumption in the Rust ecosystem is that Drop doesn't block. Especially since the drop sites can be _any_ place that drops the last Arc<PostgresRedoManager>, e.g., compaction or GC. The benefit of having the new type over just adding a Drop impl to PostgresRedoProcess is that we can construct it earlier than the full PostgresRedoProcess in PostgresRedoProcess::launch(). That allows us to correctly kill+wait the child if there is an error in PostgresRedoProcess::launch() after spawning it. I also took a stab at a regression test. I manually verified that it fails before the fix to walredo.rs. fixes https://github.com/neondatabase/neon/issues/2761 closes https://github.com/neondatabase/neon/pull/2776	2022-11-10 12:50:50 +01:00
Vadim Kharitonov	f720dd735e	Stricter mypy linters for `test_runner/fixtures/*`	2022-11-10 12:47:27 +01:00
Alexander Bayandin	c4f9f1dc6d	Add data format forward compatibility tests (#2766 ) Add `test_forward_compatibility`, which checks if it's going to be possible to roll back a release to the previous version. The test uses artifacts (Neon & Postgres binaries) from the previous release to start Neon on the repo created by the current version. It performs exactly the same checks as `test_backward_compatibility` does. Single `ALLOW_BREAKING_CHANGES` env var got replaced by `ALLOW_BACKWARD_COMPATIBILITY_BREAKAGE` & `ALLOW_FORWARD_COMPATIBILITY_BREAKAGE` and can be set by `backward compatibility breakage` and `forward compatibility breakage` labels respectively.	2022-11-10 09:06:34 +00:00
Alexander Bayandin	c1a76eb0e5	test_runner: replace global variables with fixtures (#2754 ) This PR replaces the following global variables in the test framework with fixtures to make tests more configurable. I mainly need this for the forward compatibility tests (draft in https://github.com/neondatabase/neon/pull/2766). ``` base_dir neon_binpath pg_distrib_dir top_output_dir default_pg_version (this one got replaced with a fixture named pg_version) ``` Also, this PR adds more `Path` type where the code implies it.	2022-11-07 18:39:51 +00:00
Joonas Koivunen	548d472b12	fix: logical size query at before initdb_lsn (#2755 ) With more realistic selection of gc_horizon in tests there is an immediate failure with trying to query logical size with lsn < initdb_lsn. Fixes that, adds illustration gathered from clarity of explaining this tenant size calculation to more people. Cc: #2748, #2599.	2022-11-07 12:03:57 +02:00
Joonas Koivunen	cf68963b18	Add initial tenant sizing model and a http route to query it (#2714 ) Tenant size information is gathered by using existing parts of `Tenant::gc_iteration` which are now separated as `Tenant::refresh_gc_info`. `Tenant::refresh_gc_info` collects branch points, and invokes `Timeline::update_gc_info`; nothing was supposed to be changed there. The gathered branch points (through Timeline's `GcInfo::retain_lsns`), `GcInfo::horizon_cutoff`, and `GcInfo::pitr_cutoff` are used to build up a Vec of updates fed into the `libs/tenant_size_model` to calculate the history size. The gathered information is now exposed using `GET /v1/tenant/{tenant_id}/size`, which which will respond with the actual calculated size. Initially the idea was to have this delivered as tenant background task and exported via metric, but it might be too computationally expensive to run it periodically as we don't yet know if the returned values are any good. Adds one new metric: - pageserver_storage_operations_seconds with label `logical_size` - separating from original `init_logical_size` Adds a pageserver wide configuration variable: - `concurrent_tenant_size_logical_size_queries` with default 1 This leaves a lot of TODO's, tracked on issue #2748.	2022-11-03 12:39:19 +00:00
Joonas Koivunen	5112142997	fix: use different port for temporary postgres (#2743 ) `test_tenant_relocation` ends up starting a temporary postgres instance with a fixed port. the change makes the port configurable at scripts/export_import_between_pageservers.py and uses that in test_tenant_relocation.	2022-11-02 18:37:48 +00:00
Alexander Bayandin	0a0595b98d	test_backward_compatibility: assign random port to compute (#2738 )	2022-11-02 15:22:38 +00:00
Kirill Bulatov	d42700280f	Remove daemonize from storage components (#2677 ) Move daemonization logic into `control_plane`. Storage binaries now only crate a lockfile to avoid concurrent services running in the same directory.	2022-11-02 02:26:37 +02:00
Arseny Sher	596d622a82	Fix test_prepare_snapshot. It should checkpoint pageserver after waiting for all data arrival, not before.	2022-10-28 22:12:31 +04:00
Alexander Bayandin	0cbae6e8f3	test_backward_compatibility: friendlier error message (#2707 )	2022-10-27 15:54:49 +00:00
Alexander Stanovoy	78e412b84b	The fix of #2650 . (#2686 ) * Wrappers and drop implementations for image and delta layer writers. * Two regression tests for the image and delta layer files.	2022-10-27 14:02:55 +00:00
Alexander Bayandin	834ffe1bac	Add data format backward compatibility tests (#2626 )	2022-10-25 16:41:50 +02:00
Andrés	71ef7b6663	Remove cached_property package (#2673 ) Co-authored-by: andres <andres.rodriguez@outlook.es>	2022-10-21 20:02:31 +03:00
Kirill Bulatov	5928cb33c5	Introduce timeline state (#2651 ) Similar to https://github.com/neondatabase/neon/pull/2395, introduces a state field in Timeline, that's possible to subscribe to. Adjusts * walreceiver to not to have any connections if timeline is not Active * remote storage sync to not to schedule uploads if timeline is Broken * not to create timelines if a tenant/timeline is broken * automatically switches timelines' states based on tenant state Does not adjust timeline's gc, checkpointing and layer flush behaviour much, since it's not safe to cancel these processes abruptly and there's task_mgr::shutdown_tasks that does similar thing.	2022-10-21 15:51:48 +00:00
Heikki Linnakangas	fc4ea3553e	test_gc_cutoff.py fixes (#2655 ) * Fix bogus early exit from GC. Commit `91411c415a` added this failpoint, but the early exit was not intentional. * Cleanup test_gc_cutoff.py test. - Remove the 'scale' parameter, this isn't a benchmark - Tweak pgbench and pageserver options to create garbage faster that the the GC can collect away. The test used to take just under 5 minutes, which was uncomfortably close to the default 5 minute test timeout, and annoyingly even without the hard limit. These changes bring it down to about 1-2 minutes. - Improve comments, fix typos - Rename the failpoint. The old name, 'gc-before-save-metadata' implied that the failpoint was before the metadata update, but it was in fact much later in the function. - Move the call to persist the metadata outside the lock, to avoid holding it for too long. To verify that this test still covers the original bug, https://github.com/neondatabase/neon/issues/2539, I commenting out updating the metadata file like this: ``` diff --git a/pageserver/src/tenant/timeline.rs b/pageserver/src/tenant/timeline.rs index 1e857a9a..f8a9f34a 100644 --- a/pageserver/src/tenant/timeline.rs +++ b/pageserver/src/tenant/timeline.rs @@ -1962,7 +1962,7 @@ impl Timeline { } // Persist the new GC cutoff value in the metadata file, before // we actually remove anything. - self.update_metadata_file(self.disk_consistent_lsn.load(), HashMap::new())?; + //self.update_metadata_file(self.disk_consistent_lsn.load(), HashMap::new())?; info!("GC starting"); ``` It doesn't fail every time with that, but it did fail after about 5 runs.	2022-10-21 02:39:55 +03:00

1 2 3 4 5 ...

503 Commits