rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-15 17:32:56 +00:00

Author	SHA1	Message	Date
Arpad Müller	5e32cce2d9	Add script for plots	2024-06-22 18:14:05 +02:00
Arpad Müller	6b67135bd3	Fix offset stream issue	2024-06-18 17:50:47 +02:00
Arpad Müller	66d3bef947	More printing and assertions	2024-06-18 17:50:47 +02:00
Arpad Müller	e749e73ad6	Add the compression algo name	2024-06-18 17:50:47 +02:00
Arpad Müller	c5034f0e45	Fix build	2024-06-18 17:50:47 +02:00
Arpad Müller	d2533c06a5	Always delete the file, even on error	2024-06-18 17:50:47 +02:00
Arpad Müller	0fdeca882c	Fix tests	2024-06-18 17:50:47 +02:00
Arpad Müller	f1bebda713	Fix failing test	2024-06-18 17:50:47 +02:00
Arpad Müller	983972f812	Add tests for compression	2024-06-18 17:50:47 +02:00
Arpad Müller	2ea8d1b151	Shutdown instead of flush	2024-06-18 17:50:47 +02:00
Arpad Müller	2f70221503	Don't forget the flush	2024-06-18 17:50:47 +02:00
Arpad Müller	07bd0ce69e	Also measure decompression time	2024-06-18 17:50:47 +02:00
Arpad Müller	40e79712eb	Add decompression	2024-06-18 17:50:47 +02:00
Arpad Müller	88b24e1593	Move constants out into file	2024-06-18 17:50:47 +02:00
Arpad Müller	8fcdc22283	Add stats info	2024-06-18 17:50:47 +02:00
Arpad Müller	2eb8b428cc	Also support the generation-less legacy naming scheme	2024-06-18 17:50:47 +02:00
Arpad Müller	e6a0e7ec61	Add zstd with low compression quality	2024-06-18 17:50:47 +02:00
Arpad Müller	843d996cb1	More precise printing	2024-06-18 17:50:47 +02:00
Arpad Müller	c824ffe1dc	Add ZstdHigh compression mode	2024-06-18 17:50:47 +02:00
Arpad Müller	dadbd87ac1	Add percent to output	2024-06-18 17:50:47 +02:00
Arpad Müller	0e667dcd93	more yielding	2024-06-18 17:50:47 +02:00
Arpad Müller	14447b98ce	Yield in between	2024-06-18 17:50:47 +02:00
Arpad Müller	8fcb236783	Increase listing limit	2024-06-18 17:50:47 +02:00
Arpad Müller	a9963db8c3	Create timeline dir in temp location if not existent	2024-06-18 17:50:47 +02:00
Arpad Müller	0c500450fe	Print error better	2024-06-18 17:50:47 +02:00
Arpad Müller	3182c3361a	Corrections	2024-06-18 17:50:47 +02:00
Arpad Müller	80803ff098	Printing tweaks	2024-06-18 17:50:47 +02:00
Arpad Müller	9b74d554b4	Remove generation suffix	2024-06-18 17:50:47 +02:00
Arpad Müller	fce252fb2c	Rename dest_path to tmp_dir	2024-06-18 17:50:47 +02:00
Arpad Müller	f132658bd9	Some prints	2024-06-18 17:50:47 +02:00
Arpad Müller	d030cbffec	Print number of keys	2024-06-18 17:50:47 +02:00
Arpad Müller	554a6bd4a6	Two separate commands More easy to have an overview	2024-06-18 17:50:47 +02:00
Arpad Müller	9850794250	Remote layer file after done	2024-06-18 17:50:47 +02:00
Arpad Müller	f5baac2579	clippy	2024-06-18 17:50:47 +02:00
Arpad Müller	2d37db234a	Add mode to compare multiple files from a tenant	2024-06-18 17:50:47 +02:00
Arpad Müller	8745c0d6f2	Add a pagectl tool to recompress image layers	2024-06-18 17:50:47 +02:00
Christian Schwarz	6c6a7f9ace	[v2] Include openssl and ICU statically linked (#8074 ) We had to revert the earlier static linking change due to libicu version incompatibilities: - original PR: https://github.com/neondatabase/neon/pull/7956 - revert PR: https://github.com/neondatabase/neon/pull/8003 Specifically, the problem manifests for existing projects as error ``` DETAIL: The collation in the database was created using version 153.120.42, but the operating system provides version 153.14.37. ``` So, this PR reintroduces the original change but with the exact same libicu version as in Debian `bullseye`, i.e., the libicu version that we're using today. This avoids the version incompatibility. Additional changes made by Christian ==================================== - `hashFiles` can take multiple arguments, use that feature - validation of the libicu tarball checksum - parallel build (`-j $(nproc)`) for openssl and libicu Follow-ups ========== Debian bullseye has a few patches on top of libicu: https://sources.debian.org/patches/icu/67.1-7/ We still decide whether we need to include these patches or not. => https://github.com/neondatabase/cloud/issues/14527 Eventually, we'll have to figure out an upgrade story for libicu. That work is tracked in epic https://github.com/neondatabase/cloud/issues/14525. The OpenSSL version in this PR is arbitrary. We should use `1.1.1w` + Debian patches if applicable. See https://github.com/neondatabase/cloud/issues/14526. Longer-term: * https://github.com/neondatabase/cloud/issues/14519 * https://github.com/neondatabase/cloud/issues/14525 Refs ==== Co-authored-by: Christian Schwarz <christian@neon.tech> refs https://github.com/neondatabase/cloud/issues/12648 --------- Co-authored-by: Rahul Patil <rahul@neon.tech>	2024-06-18 09:42:22 +02:00
MMeent	e729f28205	Fix log rates (#8035 ) ## Summary of changes - Stop logging HealthCheck message passing at INFO level (moved to DEBUG) - Stop logging /status accesses at INFO (moved to DEBUG) - Stop logging most occurances of `missing config file "compute_ctl_temp_override.conf"` - Log memory usage only when the data has changed significantly, or if we've not recently logged the data, rather than always every 2 seconds.	2024-06-17 18:57:49 +00:00
Alexander Bayandin	b6e1c09c73	CI(check-build-tools-image): change build-tools image persistent tag (#8059 ) ## Problem We don't rebuild `build-tools` image for changes in a workflow that builds this image itself (`.github/workflows/build-build-tools-image.yml`) or in a workflow that determines which tag to use (`.github/workflows/check-build-tools-image.yml`) ## Summary of changes - Use a hash of `Dockerfile.build-tools` and workflow files as a persistent tag instead of using a commit sha.	2024-06-17 12:47:20 +01:00
Vlad Lazar	16d80128ee	storcon: handle entire cluster going unavailable correctly (#8060 ) ## Problem A period of unavailability for all pageservers in a cluster produced the following fallout in staging: all tenants became detached and required manual operation to re-attach. Manually restarting the storage controller re-attached all tenants due to a consistency bug. Turns out there are two related bugs which caused the issue: 1. Pageserver re-attach can be processed before the first heartbeat. Hence, when handling the availability delta produced by the heartbeater, `Node::get_availability_transition` claims that there's no need to reconfigure the node. 2. We would still attempt to reschedule tenant shards when handling offline transitions even if the entire cluster is down. This puts tenant shards into a state where the reconciler believes they have to be detached (no pageserver shows up in their intent state). This is doubly wrong because we don't mark the tenant shards as detached in the database, thus causing memory vs database consistency issues. Luckily, this bug allowed all tenant shards to re-attach after restart. ## Summary of changes * For (1), abuse the fact that re-attach requests do not contain an utilisation score and use that to differentiate from a node that replied to heartbeats. * For (2), introduce a special case that skips any rescheduling if the entire cluster is unavailable. * Update the storage controller heartbeat test with an extra scenario where the entire cluster goes for lunch. Fixes https://github.com/neondatabase/neon/issues/8044	2024-06-17 11:40:35 +01:00
Arseny Sher	2ba414525e	Install rust binaries before running rust tests. cargo test (or nextest) might rebuild the binaries with different features/flags, so do install immediately after the build. Triggered by the particular case of nextest invocations missing $CARGO_FEATURES, which recompiled safekeeper without 'testing' feature which made python tests needing it (failpoints) not run in the CI. Also add CARGO_FEATURES to the nextest runs anyway because there doesn't seem to be an important reason not to.	2024-06-17 06:23:32 +03:00
Peter Bendel	46210035c5	add halfvec indexing and queries to periodic pgvector performance tests (#8057 ) ## Problem halfvec data type was introduced in pgvector 0.7.0 and is popular because it allows smaller vectors, smaller indexes and potentially better performance. So far we have not tested halfvec in our periodic performance tests. This PR adds halfvec indexing and halfvec queries to the test.	2024-06-14 18:36:50 +02:00
Alex Chi Z	81892199f6	chore(pageserver): vectored get target_keyspace directly accums (#8055 ) follow up on https://github.com/neondatabase/neon/pull/7904 avoid a layer of indirection introduced by `Vec<Range<Key>>` Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-06-14 11:57:58 -04:00
Alexander Bayandin	83eb02b07a	CI: downgrade docker/setup-buildx-action (#8062 ) ## Problem I've bumped `docker/setup-buildx-action` in #8042 because I wasn't able to reproduce the issue from #7445. But now the issue appears again in https://github.com/neondatabase/neon/actions/runs/9514373620/job/26226626923?pr=8059 The steps to reproduce aren't clear, it required `docker/setup-buildx-action@v3` and rebuilding the image without cache, probably ## Summary of changes - Downgrade `docker/setup-buildx-action@v3` to `docker/setup-buildx-action@v2`	2024-06-14 11:43:51 +00:00
Arseny Sher	a71f58e69c	Fix test_segment_init_failure. Graceful shutdown broke it.	2024-06-14 14:24:15 +03:00
Conrad Ludgate	e6eb0020a1	update rust to 1.79.0 (#8048 ) ## Problem rust 1.79 new enabled by default lints ## Summary of changes * update to rust 1.79 * `s/default_features/default-features/` * fix proxy dead code. * fix pageserver dead code.	2024-06-14 13:23:52 +02:00
John Spray	eb0ca9b648	pageserver: improved synthetic size & find_gc_cutoff error handling (#8051 ) ## Problem This PR refactors some error handling to avoid log spam on tenant/timeline shutdown. - "ignoring failure to find gc cutoffs: timeline shutting down." logs (https://github.com/neondatabase/neon/issues/8012) - "synthetic_size_worker: failed to calculate synthetic size for tenant ...: Failed to refresh gc_info before gathering inputs: tenant shutting down", for example here: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8049/9502988669/index.html#suites/3fc871d9ee8127d8501d607e03205abb/1a074a66548bbcea Closes: https://github.com/neondatabase/neon/issues/8012 ## Summary of changes - Refactor: Add a PageReconstructError variant to GcError: this is the only kind of error that find_gc_cutoffs can emit. - Functional change: only ignore shutdown PageReconstructError variant: for other variants, treat it as a real error - Refactor: add a structured CalculateSyntheticSizeError type and use it instead of anyhow::Error in synthetic size calculations - Functional change: while iterating through timelines gathering logical sizes, only drop out if the whole tenant is cancelled: individual timeline cancellations indicate deletion in progress and we can just ignore those.	2024-06-14 11:08:11 +01:00
John Spray	6843fd8f89	storage controller: always wait for tenant detach before delete (#8049 ) ## Problem This test could fail with a timeout waiting for tenant deletions. Tenant deletions could get tripped up on nodes transitioning from offline to online at the moment of the deletion. In a previous reconciliation, the reconciler would skip detaching a particular location because the node was offline, but then when we do the delete the node is marked as online and can be picked as the node to use for issuing a deletion request. This hits the "Unexpectedly still attached path", which would still work if the caller kept calling DELETE, but if a caller does a Delete,get,get,get poll, then it doesn't work because the GET calls fail after we've marked the tenant as detached. ## Summary of changes Fix the undesirable storage controller behavior highlighted by this test failure: - Change tenant deletion flow to _always_ wait for reconciliation to succeed: it was unsound to proceed and return 202 if something was still attached, because after the 202 callers can no longer GET the tenant. Stabilize the test: - Add a reconcile_until_idle to the test, so that it will not have reconciliations running in the background while we mark a node online. This test is not meant to be a chaos test: we should test that kind of complexity elsewhere. - This reconcile_until_idle also fixes another failure mode where the test might see a None for a tenant location because a reconcile was mutating it (https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7288/9500177581/index.html#suites/8fc5d1648d2225380766afde7c428d81/4acece42ae00c442/) It remains the case that a motivated tester could produce a situation where a DELETE gives a 500, when precisely the wrong node transitions from offline to available at the precise moment of a deletion (but the 500 is better than returning 202 and then failing all subsequent GETs). Note that nodes don't go through the offline state during normal restarts, so this is super rare. We should eventually fix this by making DELETE to the pageserver implicitly detach the tenant if it's attached, but that should wait until nobody is using the legacy-style deletes (the ones that use 202 + polling)	2024-06-14 10:37:30 +01:00
Alexander Bayandin	edc900028e	CI: Update outdated GitHub Actions (#8042 ) ## Problem We have some amount of outdated action in the CI pipeline, GitHub complains about some of them. ## Summary of changes - Update `actions/checkout@1` (a really old one) in `vm-compute-node-image` - Update `actions/checkout@3` in `build-build-tools-image` - Update `docker/setup-buildx-action` in all workflows / jobs, it was downgraded in https://github.com/neondatabase/neon/pull/7445, but it it seems it works fine now	2024-06-14 10:24:13 +01:00
Heikki Linnakangas	789196572e	Fix test_replica_query_race flakiness (#8038 ) This failed once with `relation "test" does not exist` when trying to run the query on the standby. It's possible that the standby is started before the CREATE TABLE is processed in the pageserver, and the standby opens up for queries before it has received the CREATE TABLE transaction from the primary. To fix, wait for the standby to catch up to the primary before starting to run the queries. https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8025/9483658488/index.html	2024-06-14 11:51:12 +03:00

1 2 3 4 5 ...

5466 Commits