rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-07 06:00:38 +00:00

Author	SHA1	Message	Date
Andrey Taranik	fe445b2945	more parallelism	2024-08-16 13:21:19 +03:00
Andrey Taranik	7f49f45a53	tune parallelism	2024-08-16 13:08:36 +03:00
Andrey Taranik	980b212789	try more parallelism	2024-08-16 11:56:11 +03:00
Andrey Taranik	2b17a03911	arm64 80 cores debian 12	2024-08-16 10:20:23 +03:00
Andrey Taranik	b50a9d84d4	Merge branch 'main' into cicd/debug-regress-tests-on-arm	2024-08-16 09:06:09 +03:00
Joonas Koivunen	4763a960d1	chore: log if we have an open layer or any frozen on shutdown (#8740 ) Some benchmarks are failing with a "long" flushing, which might be because there is a queue of in-memory layers, or something else. Add logging to narrow it down. Private slack DM ref: https://neondb.slack.com/archives/D049K7HJ9JM/p1723727305238099	2024-08-16 06:10:05 +01:00
Sasha Krassovsky	df086cd139	Add logical replication test to exercise snapfiles (#8364 )	2024-08-15 15:34:45 -07:00
Alexander Bayandin	69cb1ee479	CI(replication-tests): store test results & change notification channel (#8687 ) ## Problem We want to store Nightly Replication test results in the database and notify the relevant Slack channel about failures ## Summary of changes - Store test results in the database - Notify `on-call-compute-staging-stream` about failures	2024-08-15 22:41:58 +01:00
Alexander Bayandin	4e58fd9321	CI(label-for-external-users): use CI_ACCESS_TOKEN (#8738 ) ## Problem `secrets.GITHUB_TOKEN` (with any permissions) is not enough to get a user's membership info if they decide to hide it. ## Summary of changes - Use `secrets.CI_ACCESS_TOKEN` for `gh api` call - Use `pull_request_target` instead of `pull_request` event to get access to secrets	2024-08-15 18:37:15 +01:00
Andrey Taranik	6fd3c9daa5	trigger build	2024-08-15 20:27:24 +03:00
Andrey Taranik	409171ab08	try 16 cores	2024-08-15 16:46:57 +03:00
Andrey Taranik	f26240c9dc	Merge branch 'main' into cicd/debug-regress-tests-on-arm	2024-08-15 16:46:00 +03:00
Konstantin Knizhnik	f087423a01	Handle reload config file request in LR monitor (#8732 ) ## Problem Logical replication BGW checking replication lag is not reloading config ## Summary of changes Add handling of reload config request ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-08-15 16:28:25 +03:00
Joonas Koivunen	24d347f50b	storcon: use tracing for logging panics (#8734 ) this gives spans for panics, and does not globber loki output by writing to stderr while all of the other logging is to stdout. See: #3475	2024-08-15 16:27:07 +03:00
Andrey Taranik	1c771267ab	Merge branch 'main' into cicd/debug-regress-tests-on-arm	2024-08-15 16:15:17 +03:00
Joonas Koivunen	52641eb853	storcon: add spans to drain/fill ops (#8735 ) this way we do not need to repeat the %node_id everywhere, and we get no stray messages in logs from within the op.	2024-08-15 15:30:04 +03:00
Andrey Taranik	9f1b7b72ed	Merge branch 'main' into cicd/debug-regress-tests-on-arm	2024-08-15 14:56:37 +03:00
Andrey Taranik	2476f7ef74	try arm64 with 80 cores	2024-08-15 14:56:14 +03:00
Joonas Koivunen	d9a57aeed9	storcon: deny external node configuration if an operation is ongoing (#8727 ) Per #8674, disallow node configuration while drain/fill are ongoing. Implement it by adding a only-http wrapper `Service::external_node_configure` which checks for operation existing before configuring. Additionally: - allow cancelling drain/fill after a pageserver has restarted and transitioned to WarmingUp Fixes: #8674	2024-08-15 10:54:05 +01:00
Alexander Bayandin	a9c28be7d0	fix(pageserver): allow unused_imports in download.rs on macOS (#8733 ) ## Problem On macOS, clippy fails with the following error: ``` error: unused import: `crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt` --> pageserver/src/tenant/remote_timeline_client/download.rs:26:5 \| 26 \| use crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt; \| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ \| = note: `-D unused-imports` implied by `-D warnings` = help: to override `-D warnings` add `#[allow(unused_imports)]` ``` Introduced in https://github.com/neondatabase/neon/pull/8717 ## Summary of changes - allow `unused_imports` for `crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt` on macOS in download.rs	2024-08-15 10:06:28 +01:00
Vlad Lazar	fef77b0cc9	safekeeper: consider partial uploads when pulling timeline (#8628 ) ## Problem The control file contains the id of the safekeeper that uploaded it. Previously, when sending a snapshot of the control file to another sk, it would eventually be gc-ed by the receiving sk. This is incorrect because the original sk might still need it later. ## Summary of Changes When sending a snapshot and the control file contains an uploaded segment: * Create a copy of the segment in s3 with the destination sk in the object name * Tweak the streamed control file to point to the object create in the previous step Note that the snapshot endpoint now has to know the id of the requestor, so the api has been extended to include the node if of the destination sk. Closes https://github.com/neondatabase/neon/issues/8542	2024-08-15 09:02:33 +01:00
Andrey Taranik	f555cb3970	try cloud arm64 servers	2024-08-15 03:36:00 +03:00
Andrey Taranik	10a726503a	Merge branch 'main' into cicd/debug-regress-tests-on-arm	2024-08-15 03:34:07 +03:00
Christian Schwarz	168913bdf0	refactor(write path): newtype to enforce use of fully initialized slices (#8717 ) The `tokio_epoll_uring::Slice` / `tokio_uring::Slice` type is weird. The new `FullSlice` newtype is better. See the doc comment for details. The naming is not ideal, but we'll clean that up in a future refactoring where we move the `FullSlice` into `tokio_epoll_uring`. Then, we'll do the following: * tokio_epoll_uring::Slice is removed * `FullSlice` becomes `tokio_epoll_uring::IoBufView` * new type `tokio_epoll_uring::IoBufMutView` for the current `tokio_epoll_uring::Slice<IoBufMut>` Context ------- I did this work in preparation for https://github.com/neondatabase/neon/pull/8537. There, I'm changing the type that the `inmemory_layer.rs` passes to `DeltaLayerWriter::put_value_bytes` and thus it seemed like a good opportunity to make this cleanup first.	2024-08-14 21:57:17 +02:00
Alexander Bayandin	aa2e16f307	CI: misc cleanup & fixes (#8559 ) ## Problem A bunch of small fixes and improvements for CI, that are too small to have a separate PR for them ## Summary of changes - CI(build-and-test): fix parenthesis - CI(actionlint): fix path to workflow file - CI: remove default args from actions/checkout - CI: remove `gen3` label, using a couple `self-hosted` + `small{,-arm64}`/`large{,-arm64}` is enough - CI: prettify Slack messages, hide links behind text messages - C(build-and-test): add more dependencies to `conclusion` job	2024-08-14 17:56:59 +01:00
Alexander Bayandin	70b18ff481	CI(neon-image): add ARM-specific RUSTFLAGS (#8566 ) ## Problem It's recommended that a couple of additional RUSTFLAGS be set up to improve the performance of Rust applications on AWS Graviton. See `57dc813626/rust.md` Note: Apple Silicon is compatible with neoverse-n1: ``` $ clang --version Apple clang version 15.0.0 (clang-1500.3.9.4) Target: arm64-apple-darwin23.6.0 Thread model: posix InstalledDir: /Applications/Xcode_15.4.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin $ $ clang --print-supported-cpus 2>&1 \| grep neoverse- neoverse-512tvb neoverse-e1 neoverse-n1 neoverse-n2 neoverse-v1 neoverse-v2 ``` ## Summary of changes - Add `-Ctarget-feature=+lse -Ctarget-cpu=neoverse-n1` to RUSTFLAGS for ARM images	2024-08-14 17:03:21 +01:00
Andrey Taranik	7ba86e15fa	debug arm64 builds	2024-08-14 18:57:28 +03:00
Andrey Taranik	7ba42bfdb4	Merge branch 'main' into cicd/debug-regress-tests-on-arm	2024-08-14 18:33:19 +03:00
Andrey Taranik	4f8a39d6c6	try metal arm64 runners	2024-08-14 17:46:13 +03:00
Joonas Koivunen	60fc1e8cc8	chore: even more responsive compaction cancellation (#8725 ) Some benchmarks and tests might still fail because of #8655 (tracked in #8708) because we are not fast enough to shut down ([one evidence]). Partially this is explained by the current validation mode of streaming k-merge, but otherwise because that is where we use a lot of time in compaction. Outside of L0 => L1 compaction, the image layer generation is already guarded by vectored reads doing cancellation checks. 32768 is a wild guess based on looking how many keys we put in each layer in a bench (1-2 million), but I assume it will be good enough divisor. Doing checks more often will start showing up as contention which we cannot currently measure. Doing checks less often might be reasonable. [one evidence]: https://neon-github-public-dev.s3.amazonaws.com/reports/main/10384136483/index.html#suites/9681106e61a1222669b9d22ab136d07b/96e6d53af234924/ Earlier PR: #8706.	2024-08-14 14:48:15 +01:00
Alexander Bayandin	54c5da3981	CI(build-and-test): set RUSTFLAGS for ARM	2024-08-14 13:57:20 +01:00
Alexander Bayandin	c1378dc43b	CI: don't collect code coverage on arm64 runners	2024-08-14 13:53:16 +01:00
Alexander Bayandin	50b9fb430a	test_runner: add __arch parameter to Allure report	2024-08-14 13:53:16 +01:00
Alexander Bayandin	486eaba028	CI(build-and-test): run regression tests on arm	2024-08-14 13:53:16 +01:00
Alexander Bayandin	d4d70cc314	CI(build-and-test): make pg-versions configurable	2024-08-14 13:53:16 +01:00
Alexander Bayandin	176eefa47a	CI(regress-tests): run debug builds only with the latest Postgres version	2024-08-14 13:53:16 +01:00
Alexander Bayandin	36c1719a07	CI(build-neon): fix accidental neon rebuild on `cargo test` (#8721 ) ## Problem During `Run rust tests` step (for debug builds), we accidentally rebuild neon twice (by `cargo test --doc` and by `cargo nextest run`). It happens because we don't set `cov_prefix` for the `cargo test --doc` command, which triggers rebuilding with different build flags, and one more rebuild by `cargo nextest run`. ## Summary of changes - Set `cov_prefix` for `cargo test --doc` to prevent unneeded rebuilds	2024-08-14 13:38:25 +01:00
John Spray	abb53ba36d	storcon_cli: don't clobber heatmap interval when setting eviction (#8722 ) ## Problem This command is kind of a hack, used when we're migrating large tenants and want to get their resident size down. It sets the tenant config to a fixed value, which omitted heatmap_period, so caused secondaries to get out of date. ## Summary of changes - Set heatmap period to the same 300s default that we use elsewhere when updating eviction settings This is not as elegant as some general purpose partial modification of the config, but it practically makes the command safer to use.	2024-08-14 13:37:03 +01:00
Conrad Ludgate	a7028d92b7	proxy: start of jwk cache (#8690 ) basic JWT implementation that caches JWKs and verifies signatures. this code is currently not reachable from proxy, I just wanted to get something merged in.	2024-08-14 13:35:29 +01:00
Joonas Koivunen	6c9e3c9551	refactor: error/anyhow::Error wrapping (#8697 ) We can get CompactionError::Other(Cancelled) via the error handling with a few ways. [evidence](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8655/10301613380/index.html#suites/cae012a1e6acdd9fdd8b81541972b6ce/653a33de17802bb1/). Hopefully fix it by: 1. replace the `map_err` which hid the `GetReadyAncestorError::Cancelled` with `From<GetReadyAncestorError> for GetVectoredError` conversion 2. simplifying the code in pgdatadir_mapping to eliminate the token anyhow wrapping for deserialization errors 3. stop wrapping GetVectoredError as anyhow errors 4. stop wrapping PageReconstructError as anyhow errors Additionally, produce warnings if we treat any other error (as was legal before this PR) as missing key. Cc: #8708.	2024-08-14 12:45:56 +01:00
Alexander Bayandin	fc3d372f3a	CI(label-for-external-users): check membership using GitHub API (#8724 ) ## Problem `author_association` doesn't properly work if a GitHub user decides not to show affiliation with the org in their profile (i.e. if it's private) ## Summary of changes - Call `/orgs/ORG/members/USERNAME` API to check whether a PR/issue author is a member of the org	2024-08-14 12:27:52 +01:00
John Spray	19d69d515c	pageserver: evict covered layers earlier (#8679 ) ## Problem When pageservers do compaction, they frequently create image layers that make earlier layers un-needed for reads, but then keep those earlier layers around for 24 hours waiting for time-based eviction to expire them. Now that we track layer visibility, we can use it as an input to eviction, and avoid the 24 hour "disk bump" that happens around pageserver restarts. ## Summary of changes - During time-based eviction, if a layer is marked Covered, use the eviction period as the threshold: i.e. these layers get to remain resident for at least one iteration of the eviction loop, but then get evicted. With current settings this means they get evicted after 1h instead of 24h. - During disk usage eviction, prioritized evicting covered layers above all other layers. Caveats: - Using the period as the threshold for time based eviction in this case is a bit of a hack, but it avoids adding yet another configuration property, and in any case the value of a new property would be somewhat arbitrary: there's no "right" length of time to keep covered layers around just in case. - We had previously planned on removing time-based eviction: this change would motivate us to keep it around, but we can still simplify the code later to just do the eviction of covered layers, rather than applying a TTL policy to all layers.	2024-08-14 12:10:15 +01:00
Joonas Koivunen	485d76ac62	timeline_detach_ancestor: adjust error handling (#8528 ) With additional phases from #8430 the `detach_ancestor::Error` became untenable. Split it up into phases, and introduce laundering for remaining `anyhow::Error` to propagate them as most often `Error::ShuttingDown`. Additionally, complete FIXMEs. Cc: #6994	2024-08-14 10:16:18 +01:00
John Spray	4049d2b7e1	scrubber: fix spurious "Missed some shards" errors (#8661 ) ## Problem The storage scrubber was reporting warnings for lots of timelines like: ``` WARN Missed some shards at count ShardCount(0) tenant_id=25eb7a83d9a2f90ac0b765b6ca84cf4c ``` These were spurious: these tenants are fine. There was a bug in accumulating the ShardIndex for each tenant, whereby multiple timelines would lead us to add the same ShardIndex more than one. Closes: #8646 ## Summary of changes - Accumulate ShardIndex in a BTreeSet instead of a Vec - Extend the test to reproduce the issue	2024-08-14 09:29:06 +01:00
Konstantin Knizhnik	7a1736ddcf	Preserve HEAP_COMBOCID when restoring t_cid from WAL (#8503 ) ## Problem See https://github.com/neondatabase/neon/issues/8499 ## Summary of changes Save HEAP_COMBOCID flag in WAL and do not clear it in redo handlers. Related Postgres PRs: https://github.com/neondatabase/postgres/pull/457 https://github.com/neondatabase/postgres/pull/458 https://github.com/neondatabase/postgres/pull/459 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-08-14 08:13:20 +03:00
Tristan Partin	c624317b0e	Decode the database name in SQL/HTTP connections A url::Url does not hand you back a URL decoded value for path values, so we must decode them ourselves. Link: https://docs.rs/url/2.5.2/url/struct.Url.html#method.path Link: https://docs.rs/url/2.5.2/url/struct.Url.html#method.path_segments Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-08-13 16:32:58 -05:00
Tristan Partin	0f43b7c51b	Loosen type on PgProtocol::safe_psql(queries:) Using Iterable allows us to also use tuples, among other things. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-08-13 16:32:58 -05:00
Joonas Koivunen	6d6e2c6a39	feat(detach_ancestor): better retries with persistent gc blocking (#8430 ) With the persistent gc blocking, we can now retry reparenting timelines which had failed for whatever reason on the previous attempt(s). Restructure the detach_ancestor into three phases: - prepare (insert persistent gc blocking, copy lsn prefix, layers) - detach and reparent - reparenting can fail, so we might need to retry this portion - complete (remove persistent gc blocking) Cc: #6994	2024-08-13 18:51:51 +01:00
Joonas Koivunen	87a5d7db9e	test: do better job of shutting everything down (#8714 ) After #8655 we've had a few issues (mostly tracked on #8708) with the graceful shutdown. In order to shutdown more of the processes and catch more errors, for example, from all pageservers, do an immediate shutdown for those nodes which fail the initial (possibly graceful) shutdown. Cc: #6485	2024-08-13 18:49:50 +01:00
Peter Bendel	9d2276323d	Benchmarking tests: automatically restore Neon reuse databases, too and migrate to pg16 (#8707 ) ## Problem We use a set of Neon reuse databases in benchmarking.yml which are still using pg14. Because we want to compare apples to apples and have migrated the AWS reuse clusters to pg16 we should also use pg16 for Neon. ## Summary of changes - Automatically restore the test databases for Neon project	2024-08-13 19:36:39 +02:00

1 2 3 4 5 ...

5900 Commits