rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-07 06:00:38 +00:00

Author	SHA1	Message	Date
Heikki Linnakangas	eed99b7251	Silence clippy warning	2022-11-19 18:26:18 +02:00
Heikki Linnakangas	a2fbd93e91	Fix python isort codestyle check	2022-11-18 18:29:27 +02:00
Christian Schwarz	66f8f686a0	run manual gc in a task_mgr task to prevent race with detach This fixes flaky test_tenant_detach_smoke.	2022-11-18 12:15:14 +02:00
Christian Schwarz	919f2b261a	make test_tenant_detach_smoke fail reproducibly Add failpoint that triggers the race condition.	2022-11-18 12:15:14 +02:00
Christian Schwarz	9d273c840a	timeline: explicit tracking of flush loop state: NotStarted, Running, Exited This allows us to error out in the case where we request flush but the flush loop is not running. Before, we would only track whether it was started, but not when it exited. Better to use an enum with 3 states than a 2-state bool because then the error message can answer the question whether we ever started the flush loop or not.	2022-11-18 12:15:14 +02:00
Dmitry Rodionov	6600e1f896	initialize upload queue before starting download operations	2022-11-17 23:53:31 +02:00
Dmitry Rodionov	348369414b	fix incorrect metadata update Previously in some cases local metadata was confused with remote one and there was a check, that we write locally only if remote metadata has greater disk_consistent_lsn. So because they were equal we didnt write anything. For attach scenario this ended up in not writing metadata at all. Rearrange code so we decide on proper metadata value earlier on and initialize timeline with correct one without need to update it late in the initialization process in .reconsile_with_remote	2022-11-17 23:53:31 +02:00
Christian Schwarz	3890acaf7f	stop using Option for UploadQueueInitialized::{latest_metadata,last_uploaded_consistent_lsn}	2022-11-17 21:34:16 +02:00
Christian Schwarz	f537a7a873	add explainer comments regarding UploadQueueInitialized::{latest_files,latest_metadata,last_uploaded_consistent_lsn}	2022-11-17 21:34:16 +02:00
Christian Schwarz	71bc45a21b	storage_sync: track upload queue initialization state using enum & fix last_uploaded_consistent_lsn initialization for empty remote storage As pointed out in `b8488e70a9 (r1024319620)` the following is wrong for the case where the remote storage is empty: metadata = whatever the local-ONLY metadata is ... upload_queue.latest_metadata = Some(metadata.clone()); upload_queue.last_uploaded_consistent_lsn = Some(metadata.disk_consistent_lsn()); The reason why it's wrong is that we return last_uploaded_consistent_lsn to safekeepers. So, we'd be returning an Lsn that is not yet uploaded to S3.	2022-11-17 21:34:16 +02:00
Christian Schwarz	decef74503	don't start background jobs if tenant has not timelines Before this change, test_pageserver_with_empty_tenants was failing at: assert loaded_tenant["state"] == { "Active": {"background_jobs_running": False} }, "Tenant {tenant_with_empty_timelines_dir} with empty timelines dir should be active and ready for timeline creation" because background_jobs_running was True instead of False. Personally I think we should simply always start the background loops and not bother, but let's punt this until after we've merged this PR.	2022-11-17 11:22:02 +02:00
Christian Schwarz	8aed805933	refactor: remove misleading log messages and redundant asserts from test_pageserver_with_empty_tenants At least since commit 'Return broken tenants due to non existing timelines dir (#2552) (#2575)' some of these messages are wrong.	2022-11-17 11:21:49 +02:00
Dmitry Rodionov	b8488e70a9	run clippy/fmt	2022-11-16 17:25:04 +02:00
Christian Schwarz	16fdd104ac	bring back HTTP API `has_in_progress_downloads` and `awaits_download` field, derived from TenantState	2022-11-16 14:57:26 +02:00
Christian Schwarz	bb6dbd2f43	crash-safe and resumable tenant attach This change introduces a marker file $repo/tenants/$tenant_id/attaching that is present while a tenant is in Attaching state. When pageserver restarts, we use it to resume the tenant attach operation. Before this change, a crash during tenant attach would result in one of the following: 1. crash upon restart due to missing metadata file (IIRC) 2. "successful" loading of the tenant with a subset of timelines	2022-11-16 14:57:26 +02:00
Dmitry Rodionov	c4c4558736	adjust allowed errors for test_broken_timeline	2022-11-16 14:57:26 +02:00
Heikki Linnakangas	bfdc09cf4a	Improve comments and checks in test_broken_timeline.py	2022-11-16 14:57:21 +02:00
Dmitry Rodionov	1839ce0545	properly merge remote metadata with local one	2022-11-16 14:42:15 +02:00
Dmitry Rodionov	8e04f0455e	add a bunch of .context calls	2022-11-16 14:42:15 +02:00
Dmitry Rodionov	6839773538	handle temporary files during layer map loading	2022-11-16 14:42:15 +02:00
Dmitry Rodionov	2a96c4cfcd	start walreceiver after reconcile_with_remote	2022-11-16 14:42:15 +02:00
Christian Schwarz	027cf22663	fix layer download during reconcile_with_remote reconcile_with_remote, in this PR, is supposed to download all the layer files synchronously. I don't know why, but, download_missing was 1. not doing the download at all for DeltaLayer 2. not using the right RelativePath for image layer This patch fixes both.	2022-11-16 14:42:15 +02:00
Christian Schwarz	f4daa877b5	load_remote_timeline already has an #[instrument]	2022-11-16 14:42:15 +02:00
Christian Schwarz	ed28ced3bc	schedule_index_upload: remove unused Option() around metadata param	2022-11-16 14:42:15 +02:00
Christian Schwarz	d7c120574b	dedup download_missing() calls	2022-11-16 14:42:15 +02:00
Dmitry Rodionov	c9188ffa67	fix wrong path handling in reconcile_with_remote, refine spans	2022-11-16 14:42:15 +02:00
Dmitry Rodionov	c631fa1f50	fix test_gc_cutoff test Also improve it so it fails earlier if something is not working because otherwise it was failing because of the timeout. And if timeout was big enough test can even pass	2022-11-16 14:42:15 +02:00
Dmitry Rodionov	795c3ca131	Port per-tenant upload queue and startup changes from #2595 This is a part of https://github.com/neondatabase/neon/pull/2595. It takes out switch to per tenant upload queue and changes to pageserver startup sequence because these two are highly interleaved with each other. I'm still not happy with the size of the diff, but splitting it even more will probably consume even more time. Ideally we should do it, but this patch isis already a step forward and should be easier to get this patch in yet still quite difficult. Mainly because of the size and fixes for existing concerns which will extend the diff even further Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2022-11-16 14:42:15 +02:00
Heikki Linnakangas	d013a2b227	Make test_gc_cutoff test more robust. Previously, if the failpoint was not reached for some reason, the test would only fail because it would reach the 5 minute timeout we have on all python tests. That's very subtle. Make it fail explicitly, if the failpoint is not hit on each iteration of the loop. Extracted from a larger PR, see https://github.com/neondatabase/neon/pull/2785/files#r1022765794	2022-11-16 13:24:02 +02:00
Heikki Linnakangas	3f93c6c6f0	Improve checks for broken tenants in test_broken_timeline.py - Refactor the code a little bit, removing the silly for-loop over a single element. - Make it more clear in log messages that the errors are expectd - Check for a more precise error message "Failed to load delta layer" instead of just "extracting base backup failed".	2022-11-16 13:16:00 +02:00
Rory de Zoete	53267969d7	Preparation for ARM runners (#2751 ) Need to make the runner tag more specific else we inadvertently might run workloads on the wrong arch Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>	2022-11-16 11:28:57 +01:00
Andrés	c4b417ecdb	Introduce aws-sdk-rust as rusoto S3 replacement (#2802 ) - `aws-smithy-http`: Needed because of `SdkBody` see https://github.com/awslabs/smithy-rs/issues/1759 - `aws-types`: Needed because of `SharedCredentialsProvider`, the recommended way from aws is something like `aws_config::from_env().region("us-east-1").load().await` but that is problematic because of: - `sync -> async ` in the creation of S3Client and i don't want to change the signature of any method in this class. - We do not need the four default steps in https://github.com/awslabs/aws-sdk-rust/blob/main/sdk/aws-config/src/default_provider/credentials.rs#L235 - `Hyper`: Similar to what's currently doing Rusoto in https://github.com/rusoto/rusoto/blob/master/rusoto/signature/src/signature.rs#L59 to stream the body, see also https://github.com/awslabs/aws-sdk-rust/discussions/361 Co-authored-by: andres <andres.rodriguez@outlook.es>	2022-11-16 11:28:37 +02:00
Joonas Koivunen	1d105727cb	perf: simple walredo bench (#2816 ) adds a simple walredo bench to allow some comparison of the walredo throughput. Cc: #1339, #2778	2022-11-16 11:13:56 +02:00
Heikki Linnakangas	4787a744c2	Add documentation page about error handling and logging. (#2681 ) Add a page to the internal documentation, on how we do error handling and logging.	2022-11-16 10:38:03 +02:00
Sergey Melnikov	ac3ccac56c	Add zenith-1-ps-4 and zenith-1-ps-5 (#2815 )	2022-11-16 11:25:24 +04:00
Alexander Bayandin	638af96c51	postgres-v15: fix expected results for regress tests (#2822 ) Fix expected output for regress tests for Postgres 15. Required for https://github.com/neondatabase/neon/pull/2809	2022-11-15 22:32:12 +00:00
Kirill Bulatov	1e21ca1afe	Trim whitespaces off Lsn strings when parsing (#2827 )	2022-11-15 22:39:44 +02:00
Heikki Linnakangas	46d30bf054	Check for errors in pageserver log after each test. If there are any unexpected ERRORs or WARNs in pageserver.log after test finishes, fail the test. This requires whitelisting the errors that are expected in each test, and there's also a few common errors that are printed by most tests, which are whitelisted in the fixture itself. With this, we don't need the special abort() call in testing mode, when compaction or GC fails. Those failures will print ERRORs to the logs, which will be picked up by this new mechanisms. A bunch of errors are currently whitelisted that we probably shouldn't be emitting in the first place, but fixing those is out of scope for this commit, so I just left FIXME comments on them.	2022-11-15 18:47:28 +02:00
Heikki Linnakangas	d0105cea1f	Avoid errors when removing a timeline that's still active	2022-11-15 18:47:28 +02:00
Heikki Linnakangas	e44e4a699b	Downgrade log message, if client terminates COPY during basebackup import It's more or less expected from pageserver's point of view. Change the error kind to ConnectionReset, so that it gets logged at INFO level instead of ERROR.	2022-11-15 18:47:28 +02:00
Heikki Linnakangas	223834a420	Fix confusion between Postgres and pageserver connection string in test. We passed the pageserver's libpq endpoint URL as the 'compute_ctl --connstr' argument, but that was bogus: the --connstr URL is supposed to be the URL to the Postgres instance that compute_ctl launches and monitors, not to the pageserver. compute_ctl does need the pageserver URL too, but it is read from the cluster spec JSON, not --connstr. That was pretty confusing, as you got a lot of "unknown command" errors in the pageserver log, when compute_tools tries to run regular SQL commands on the pageserver. The test still passed, however, as it doesn't require the SQL commands to succeed. But to make this less confusing, use an invalid hostname instead, so that the queries will fail to even connect.	2022-11-15 18:47:28 +02:00
MMeent	01778e37cc	Address issues in the pagestore prefetch mechanism: (#2790 ) - Update vendored PostgreSQL to address prefetch issues - Make flushed state explicit in PrefetchState - Move flush logic into prefetch_wait_for, where possible - Clean up some prefetch state handling code in the various code elements handling state transitions. - Fix a race condition in neon_read_at_lsn where a hash entry pointer was used after the hash table was updated. This could result in incorrect state transitions and assertion failures after disconnects during prefetch_wait_for in that neon_read_at_lsn. Fixes #2780	2022-11-15 15:12:38 +01:00
Alexander Bayandin	03190a2161	GitHub Actions: Do not create Allure report for cancelled jobs (#2813 ) If a workflow is cancelled, do not delay its finishing by creating an allure report.	2022-11-15 10:27:59 +00:00
Kirill Bulatov	f87017c04d	Omit dependencies' debug info (#2803 ) Based on https://neondb.slack.com/archives/C0277TKAJCA/p1668079753506749 Co-authored-by: Arseny Sher <sher-ars@yandex.ru>	2022-11-14 12:44:41 +00:00
andres	c11cbf0f5c	fix test_compare_child_and_root_pgbench_perf to do a fair comparison	2022-11-13 21:03:54 +02:00
Heikki Linnakangas	f30ef00439	Stop building the legacy "compute-node" docker image. Before we had separate images for v14 and v15, the compute node image was called just "neondatabase/compute-node". It has been superseded by the "neondatabase/compute-node-v14" and "neondatabase/compute-node-v15" images. The old image is not used by the cloud console build or tests anymore.	2022-11-12 20:48:10 +02:00
Heikki Linnakangas	dbe5b52494	Avoid some vector-growing overhead. I saw this in 'perf' profile of a sequential scan: > - 31.93% 0.21% compute request pageserver [.] <pageserver::walredo::PostgresRedoManager as pageserver::walredo::WalRedoManager>::request_redo > - 31.72% <pageserver::walredo::PostgresRedoManager as pageserver::walredo::WalRedoManager>::request_redo > - 31.26% pageserver::walredo::PostgresRedoManager::apply_batch_postgres > + 7.64% <std::process::ChildStdin as std::io::Write>::write > + 6.17% nix::poll::poll > + 3.58% <std::process::ChildStderr as std::io::Read>::read > + 2.96% std::sync::condvar::Condvar::notify_one > + 2.48% std::sys::unix::locks::futex::Condvar::wait > + 2.19% alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle > + 1.14% std::sys::unix::locks::futex::Mutex::lock_contended > 0.67% __rust_alloc_zeroed > 0.62% __stpcpy_ssse3 > 0.56% std::sys::unix::locks::futex::Mutex::wake Note the 'do_reserve_handle' overhead. That's caused by having to grow the buffer used to construct the WAL redo request. This commit eliminates that overhead. It's only about 2% of the overall CPU usage, but every little helps. Also reuse the temp buffer when reading records from a DeltaLayer, and call Vec::reserve to avoid growing a buffer when reading a blob across pages. I saw a reduction from 2% to 1% of CPU spent in do_reserve_and_handle in that codepath, but that's such a small change that it could be just noise. Seems like it shouldn't hurt though.	2022-11-12 18:52:25 +02:00
Heikki Linnakangas	4131a6efae	Remove unused Dockerfile.compute-node.legacy. The cloud end-to-end tests use the docker images built by the neon PR now, and don't need this legacy Dockerfile anymore.	2022-11-12 18:51:51 +02:00
Kirill Bulatov	03695261fc	Test storage Docker images (#2767 ) Closes https://github.com/neondatabase/neon/issues/2697 Example: https://github.com/neondatabase/neon/actions/runs/3416774593/jobs/5688394855 Adds a set of tests on the storage Docker images before they are pushed to the public registries: * tests that pageserver binary has the correct version string (other binaries are built with the same library, so it should be enough to test one) * tests that the compose file set-up works and all components are able to start and perform a single SQL query (CREATE TABLE)	2022-11-11 19:42:26 +02:00
bojanserafimov	7fd88fab59	Trace read requests (#2762 )	2022-11-10 16:43:04 -05:00

1 2 3 4 5 ...

2354 Commits