rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-06 13:40:37 +00:00

Author	SHA1	Message	Date
Alex Chi Z.	e2db76b9be	feat(pageserver): ondemand download reason observability (#11780 ) ## Problem Part of https://github.com/neondatabase/neon/issues/11615 ## Summary of changes We don't understand the root cause of why we get resident size surge every now and then. This patch adds observability for that, and in the next week, we might have a better understanding of what's going on. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-30 16:04:00 +00:00
Alex Chi Z.	6b4b8e0d8b	fix(pageserver): do not increase basebackup err counter when shutdown (#11778 ) ## Problem We occasionally see basebackup errors alerts but there were no errors logged. Looking at the code, the only codepath that will cause this is shutting down. ## Summary of changes Do not increase any counter (ok/err) when basebackup request gets cancelled due to shutdowns. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-30 15:50:12 +00:00
Konstantin Knizhnik	1d68577fbd	Check target slot state in prefetch_wait_for (#11779 ) ## Problem See https://neondb.slack.com/archives/C04DGM6SMTM/p1745599814030679 Assume the following scenario: prefetch_wait_for is doing `CHECK_FOR_INTERRUPTS` which tries to load prefetch responses. In case of error is calls pageserver_disconnect which aborts all in-flight requests. But such failure is not detected by `prefetch_wait_for` which returns true. As a result `communicator_read_at_lsnv` assumes that slot is received, but as far as asserts are disables at prod, it is not actually checked. Then it tries to interpret response and ... SIGSEGV ## Summary of changes Check target slot state in `prefetch_wait_for`. Resolves https://github.com/neondatabase/cloud/issues/28258 Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-30 12:44:59 +00:00
Arseny Sher	60f63c076f	Make safekeeper proto version 3 default (#11518 ) ## Problem We have been running compute <-> sk protocol version 3 for a while on staging with no issues observed, and want to fully migrate to it eventually. ## Summary of changes Let's make v3 the default. ref https://github.com/neondatabase/neon/issues/10326 --------- Co-authored-by: Arpad Müller <arpad@neon.tech>	2025-04-30 12:23:20 +00:00
Mikhail Kot	8da4ec9740	Postgres metrics for stuck getpage requests (#11710 ) https://github.com/neondatabase/neon/issues/10327 Resolves: #11720 New metrics: - `compute_getpage_stuck_requests_total` - `compute_getpage_max_inflight_stuck_time_ms`	2025-04-30 12:01:41 +00:00
Em Sharnoff	b48404952d	Bump vm-builder: v0.42.2 -> v0.46.0 (#11782 ) Bumped to pick up the changes from neondatabase/autoscaling#1366 — specifically including `uname` in the logs. Other changes included: * neondatabase/autoscaling#1301 * neondatabase/autoscaling#1296	2025-04-30 11:32:25 +00:00
devin-ai-integration[bot]	1d06172d59	pageserver: remove resident size from billing metrics (#11699 ) This is a rebase of PR #10739 by @henryliu2014 on the current main branch. ## Problem pageserver: remove resident size from billing metrics Fixes #10388 ## Summary of changes The following changes have been made to remove resident size from billing metrics: * removed the metric "resident_size" and related codes in consumption_metrics/metrics.rs * removed the item of the description of metric "resident_size" in consumption_metrics.md * refactored the metric "resident_size" related test case Requested by: John Spray (john@neon.tech) --------- Co-authored-by: liuheqing <hq.liu@qq.com> Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: John Spray <john@neon.tech>	2025-04-29 18:34:56 +00:00
Elizabeth Murray	a08c1a23eb	Upgrade the pgrag version in the compute Dockerfile. (#11687 ) Update the compute Dockerfile to use a new version of pgrag. The new version of pgrag uses the latest pgrx, and has a fix that terminates background workers on postmaster exit.	2025-04-29 16:50:18 +00:00
John Spray	a2adc7dbd3	storcon: avoid multiple initdbs when shard 0 has stale locations (#11760 ) ## Problem In #11727 I overlooked the case of multiple attached locations for shard 0. I misread the code and thought `create_one` acts on one location, but it actually acts on one _shard_, which is potentially multiple locations. This was not a regression, but it meant that the fix was incomplete. ## Summary of changes - In `create_one`, when updating shard zero, have any "other" locations use the initdb from shard 0	2025-04-29 15:31:52 +00:00
Vlad Lazar	768a580373	pageserver: add not modified since lsn to get page span (#11774 ) It's useful when debugging.	2025-04-29 14:07:23 +00:00
Folke Behrens	09247de8d5	proxy: Enable JSON logging by default (#11772 ) This does not affect local_proxy.	2025-04-29 13:11:24 +00:00
Arpad Müller	0b35929211	Make SafekeeperReconciler parallel via semaphore (#11757 ) Right now we only support running one reconciliation per safekeeper. This is of course usually way below of what a safekeeper can do. Therefore, introduce a semaphore and spawn the tasks asynchronously as they come in. Part of #11670	2025-04-29 12:46:15 +00:00
Ivan Efremov	b3db7f66ac	fix(compute): Change the local_proxy log level (#11770 ) Related to the INC-496	2025-04-29 11:49:16 +00:00
a-masterov	498d852bde	Fix the empty region if run on schedule (#11764 ) ## Problem When the workflow ran on a schedule, the `region_id` input was not set. As a result, an empty region value was used, which caused errors during execution. ## Summary of Changes - Added fallback logic to set a default region (`aws-us-east-2`) when `region_id` is not provided. - Ensures the workflow works correctly both when triggered manually (`workflow_dispatch`) and on schedule (`cron`).	2025-04-29 09:12:14 +00:00
Busra Kugler	7f8b1d79c0	Replace dorny/paths-filter with step-security maintained version (#11663 ) ## Problem Our CI/CD security tool StepSecurity maintains safer forks of popular GitHub Actions with low security scores. We're replacing dorny/paths-filter with the maintained step-security/paths-filter version to reduce risk of supply chain breaches and potential CVEs. ## Summary of changes replace ```uses: dorny/paths-filter@de90cc6fb3 ``` with ```uses: step-security/paths-filter@v3``` This PR will fix: neondatabase/cloud#26141	2025-04-29 09:02:01 +00:00
JC Grünhage	d15f2ff57a	fix(lint-release-pr): adjust lint and action to match (#11766 ) ## Problem The `lint-release-pr` workflow run for https://github.com/neondatabase/neon/pull/11763 failed, because the new action did not match the lint. ## Summary of changes Include time in expected merge message regex.	2025-04-29 08:56:44 +00:00
Konstantin Knizhnik	3593356c10	Prewarm sql api (#11742 ) ## Problem Continue work on prewarm, see https://github.com/neondatabase/neon/pull/11740 https://github.com/neondatabase/neon/pull/11741 ## Summary of changes Add SQL API to prewarm --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-29 06:44:28 +00:00
Tristan Partin	9e8ab2ab4f	Skip remote extensions WITH_LIB test when sanitizers are enabled (#11758 ) In order for the test to work when sanitizers are enabled, we would need to compile the dummy Postgres extension with the same sanitizer flags that we compile Postgres and the neon extension with. Doing this work would be a little more than trivial, so skipping is the best option, at least for now. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-28 19:13:35 +00:00
Alex Chi Z.	c1ff7db187	fix(pageserver): consider tombstones in replorigin (#11752 ) ## Problem We didn't consider tombstones in replorigin read path in the past. This was fine because tombstones are stored as LSN::Invalid before we universally define what the tombstone is for sparse keyspaces. Now we remove non-inherited keys during detach ancestor and write the universal tombstone "empty image". So we need to consider it across all the read paths. related: https://github.com/neondatabase/neon/pull/11299 ## Summary of changes Empty value gets ignored for replorigin scans. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-28 18:54:26 +00:00
Konstantin Knizhnik	6d6b83e737	Prewarm implementation (#11741 ) ## Problem Continue work on prewarm started in PR https://github.com/neondatabase/neon/pull/11740 ## Summary of changes Implement prewarm using prefetch --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-28 18:17:03 +00:00
John Spray	0482690534	pageserver: make control_plane_api & generations fully mandatory (#10715 ) ## Problem We had retained the ability to run in a generation-less mode to support test_generations_upgrade, which was replaced with a cleaner backward compat test in https://github.com/neondatabase/neon/pull/10701 ## Summary of changes - Remove all the special cases for "if no generation" or "if no control plane api" - Make control_plane_api config mandatory --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2025-04-28 17:24:55 +00:00
Tristan Partin	a750026c2e	Fix compiler warning in libpagestore.c when WITH_SANITIZERS=yes (#11755 ) Postgres has a nice self-documenting macro called pg_unreachable() when you want to assert that a location in code won't be hit. Warning in question: ``` /home/tristan957/Projects/work/neon//pgxn/neon/libpagestore.c: In function ‘pageserver_connect’: /home/tristan957/Projects/work/neon//pgxn/neon/libpagestore.c:739:1: warning: control reaches end of non-void function [-Wreturn-type] 739 \| } \| ^ ``` Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-28 17:09:48 +00:00
John Spray	998d2c2ce9	storcon: use shard 0's initdb for timeline creation (#11727 ) ## Problem In princple, pageservers with different postgres binaries might generate different initdbs, resulting in inconsistency between shards. To avoid that, we should have shard 0 generate the initdb and other shards re-use it. Fixes: https://github.com/neondatabase/neon/issues/11340 ## Summary of changes - For shards with index greater than zero, set `existing_initdb_timeline_id` in timeline creation to consume the existing initdb rather than creating a new one	2025-04-28 16:43:35 +00:00
JC Grünhage	b1fa68f659	impr(ci): switch release PR creation over to use python based action (#11679 ) ## Problem Our different repositories had both had code to achieve very similar results in terms of release PR creation, but they were structured differently and had different extensions. This was likely to cause maintainability problems in the long run. ## Summary of changes Switch to a python cli based composite action for creating the release PRs that will also be introduced in our other repos later. ## To Do - [ ] Adjust our docs to reflect the changes from this.	2025-04-28 16:37:36 +00:00
devin-ai-integration[bot]	84bc3380cc	Remove SAFEKEEPER_AUTH_TOKEN env var parsing from safekeeper (#11698 ) # Remove SAFEKEEPER_AUTH_TOKEN env var parsing from safekeeper This PR is a follow-up to #11443 that removes the parsing of the `SAFEKEEPER_AUTH_TOKEN` environment variable from the safekeeper codebase while keeping the `auth_token_path` CLI flag functionality. ## Changes: - Removed code that checks for the `SAFEKEEPER_AUTH_TOKEN` environment variable - Updated comments to reflect that only the `auth_token_path` CLI flag is now used As mentioned in PR #11443, the environment variable approach was planned to be deprecated and removed in favor of the file-based approach, which is more secure since environment variables can be quite public in both procfs and unit files. Link to Devin run: https://app.devin.ai/sessions/d6f56cf1b4164ea9880a9a06358a58ac Requested by: arpad@neon.tech --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: arpad@neon.tech <arpad@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2025-04-28 15:34:47 +00:00
Alex Chi Z.	11f6044338	fix(pageserver): report synthetic size = 1 if all tls offloaded (2) (#11731 ) ## Problem https://github.com/neondatabase/neon/pull/11648 did this for resident size instead of synthetic size. ## Summary of changes Report synthetic_size == 1 if all timelines are offloaded. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-28 13:45:45 +00:00
Konstantin Knizhnik	692c0f3fb8	Prepare to prewarm support (#11740 ) ## Problem See (original prewarm implementation) https://github.com/neondatabase/neon/pull/9197 (functions for storing/restoring LFC state) https://github.com/neondatabase/neon/pull/9587 (store prefetch results in LFC) https://github.com/neondatabase/neon/pull/10442 ## Summary of changes Preparation for prewarm implementation. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-28 13:24:18 +00:00
Alexander Bayandin	2b1d2a55d6	CI: fix typo oicd -> oidc (#11747 ) ## Problem It's OIDC (OpenID Connect), not OICD ## Summary of changes - Rename actions input `aws-oicd-role-arn` -> `aws-oidc-role-arn`	2025-04-28 12:44:28 +00:00
Konstantin Knizhnik	60b9fb1baf	Ignore unlogged LSNs in set last written LSN (#11743 ) ## Problem See https://github.com/neondatabase/neon/issues/11718 and https://neondb.slack.com/archives/C033RQ5SPDH/p1745122797538509 GIST other indexes performing "unlogged build" are using so called fake LSNs - not a real LSN, but something like 0/1. Been stored in lwlsn cache they cause incorrect lookup at PS. ## Summary of changes Do not store fake LSNs in LwLSN hash. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-28 12:16:29 +00:00
Erik Grinaker	606f14034e	pageserver: improve `pageserver_smgr_query_seconds` buckets (#11680 ) ## Problem The `pageserver_smgr_query_seconds` buckets are too coarse, using powers of 10: 1 µs, 10 µs, 100 µs, 1 ms, 10 ms, 100 ms, 1 s, 10 s, 100 s. This is one of our most crucial latency metrics, and needs better resolution. Touches #11594. ## Summary of changes This patch uses buckets with better resolution around 1 ms (the typical latency): * 0.6 ms * 1 ms * 3 ms * 6 ms * 10 ms * 30 ms * 100 ms * 1 s * 3 s These will be the same as the compute's `compute_getpage_wait_seconds`, to make them comparable across the compute and Pageserver: https://github.com/neondatabase/flux-fleet/pull/579. We sacrifice buckets above 3 s, since these can already be considered "too slow". This does not change the previously used `CRITICAL_OP_BUCKETS`, which is also used for other operations on different timescales (e.g. LSN waits). We should consider replacing this with more appropriate buckets for specific operations, since it covers a large span with low resolution.	2025-04-28 11:52:44 +00:00
Conrad Ludgate	32393b4393	pg-sni-router: support compute TLS on different port (#11732 ) ## Problem pg-sni-router isn't aware of compute TLS ## Summary of changes If connections come in on port 4433, we require TLS to compute from pg-sni-router	2025-04-28 11:29:44 +00:00
Alexander Bayandin	1a29f5672a	CI(check-macos-build): trigger workflow automatically for PRs (#11706 ) ## Problem - if-conditions for the `check-macos-build` workflow don't trigger it on PRs with relevant changes (in Rust code or Postgres submodules). - Jobs in the workflow depend on the presence of a cache, which is not guaranteed. ## Summary of changes - Fix if-conditions - Use artifacts on top of cache whenever the workflow depends on it — the cache might not be available	2025-04-28 09:03:10 +00:00
a-masterov	b8d47b5acf	Run the extensions' tests on staging (#11164 ) ## Problem We currently don't run end-to-end tests for PostgreSQL extensions on our cloud infrastructure, which means we might miss problems that only occur in a real cloud environment. ## Summary of changes - Added a workflow to run extension tests against a cloud staging instance - Set up proper project configuration for extension testing - Implemented test execution with appropriate environment settings - Added error handling and reporting for test failures --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2025-04-28 08:13:49 +00:00
Alexander Lakhin	97e01ae6fd	Add workflow to run particular test(s) N times (#11050 ) ## Problem Provide an easy way to run particular test(s) N times on CI. ## Summary of changes * Allow for passing the test selection and the number of test runs to the existing "Build and Test Locally" workflow * Allow for running multiple selected tests by the "Pytest regression tests" step * Introduce a new workflow to run specified test(s) several times * Store results in a separate database to distinguish between testing tests for stability and usual testing	2025-04-28 04:04:37 +00:00
Lokesh	459d51974c	doc: minor updates to consumption-metrics document (#7153 ) ## Problem Proposed minor changes to the `consumption_metrics` document. ## Summary of changes - Fixed minor typos in the document. - Minor formatting in the description of metrics `timeline_logical_size` and `synthetic_storage_size`. Makes this consistent as with description of other metrics in the document. ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Mikhail Kot <mikhail@neon.tech>	2025-04-25 19:46:40 +00:00
StepSecurity Bot	902d361107	CI/CD Hardening: Fixing StepSecurity Flagged Issues (#11724 ) ### Summary I'm fixing one or more of the following CI/CD misconfigurations to improve security. Please feel free to leave a comment if you think the current permissions for the GITHUB_TOKEN should not be restricted so I can take a note of it as accepted behaviour. - Restrict permissions for GITHUB_TOKEN - Add step-security/harden-runner - Pin Actions to a full length commit SHA ### Security Fixes will fix https://github.com/neondatabase/cloud/issues/26141	2025-04-25 14:36:45 +00:00
Dmitrii Kovalkov	ef53a76434	storage_broker: https handler (#11603 ) ## Problem Broker supports only HTTP, no HTTPS - Closes: https://github.com/neondatabase/cloud/issues/27492 ## Summary of changes - Add `listen_https_addr`, `ssl_key_file`, `ssl_cert_file`, `ssl_cert_reload_period` arguments to storage broker - Make `listen_addr` argument optional - Listen https in storage broker - Support https for storage broker request in neon_local - Add `use_https_storage_broker_api` option to NeonEnvBuilder	2025-04-25 14:28:56 +00:00
Vlad Lazar	6f0046b688	storage_controller: ensure mutual exclusion for imports and shard splits (#11632 ) ## Problem Shard splits break timeline imports. ## Summary of Changes Ensure mutual exclusion for imports and shard splits. On the shard split code path: 1. Right before shard splitting, check the database to ensure that no-import is on-going for the tenant. Exclusion is guaranteed because this validation is done while holding the exclusive tenant lock. Timeline creation (and import creation implicitly) requires a shared tenant lock. 2. When selecting a shard to split, use the in-mem state to exclude shards with an on-going import. This is opportunistic since an import might start after the check, but allows shard splits to make progres instead of continously retrying to split the same shard. On the timeline creation code path: 1. Check the in-memory splitting flag on all shards of the tenant. If any of them are splitting, error out asking the client to retry. On the happy path this is not required, due to the tenant lock set-up described above, but it covers the case where we restart with a pending shard-split. Closes https://github.com/neondatabase/neon/issues/11567	2025-04-25 11:46:15 +00:00
Em Sharnoff	2b0248cd76	fix(proxy): s/Console/Control plane/ in cplane error (#11716 ) I got bamboozled by the error message while debugging, seems no objections to updating it. ref https://neondb.slack.com/archives/C060N3SEF9D/p1745570961111509 ref https://neondb.slack.com/archives/C039YKBRZB4/p1745570811957019?thread_ts=1745393368.283599	2025-04-25 11:09:56 +00:00
Fedor Dikarev	7b03216dca	CI(check-macos-build): use gh native cache (#11707 ) ## Problem - using Hetzner buckets for cache requires secrets, we either need `secrets: inherit` to make it works - we don't have self-hosted MacOs runners, so actually GH native cache is more optimal solution there ## Summary of changes - switch to GH native cache for macos builds	2025-04-25 09:18:20 +00:00
a-masterov	992aa91075	Refresh the codestyle of docker compose test script (#11715 ) ## Problem The docker compose test script (`docker_compose_test.sh`) had inconsistent codestyle, mixing legacy syntax with modern approaches and not following best practices at all. This inconsistency could lead to potential issues with variable expansion, path handling, and maintainability. ## Summary of changes This PR modernizes the test script with several codestyle improvements: * Variable scoping and exports: * Added proper export declarations for environment variables * Added explicit COMPOSE_PROFILES export to avoid repetitive flags * Modern Bash syntax: * Replaced [ ] with [[ ]] for safer conditional testing * Used arithmetic operations (( cnt += 3 )) instead of expr * Added proper variable expansion with braces ${variable} * Added proper quoting around variables and paths with "${variable}" * Docker Compose commands: * Replaced hardcoded container names with service names * Used docker compose exec instead of docker exec $CONTAINER_NAME * Removed repetitive flags by using environment variables * Shell script best practices: * Added function keyword before function definition * Used safer path handling with "$(dirname "${0}")" These changes make the script more maintainable, less error-prone, and more consistent with modern shell scripting standards.	2025-04-25 09:13:35 +00:00
Conrad Ludgate	afe9b27983	fix(compute/tls): support for checking certificate chains (#11683 ) ## Problem It seems are production-ready cert-manager setup now includes a full certificate chain. This was not accounted for and the decoder would error. ## Summary of changes Change the way we decode certificates to support cert-chains, ignoring all but the first cert. This also changes a log line to not use multi-line errors. ~~I have tested this code manually against real certificates/keys, I didn't want to embed those in a test just yet, not until the cert expires in 24 hours.~~	2025-04-25 09:09:14 +00:00
Alex Chi Z.	5d91d4e843	fix(pageserver): reduce gc-compaction memory usage (#11709 ) ## Problem close https://github.com/neondatabase/neon/issues/11694 We had the delta layer iterator and image layer iterator set to buffer at most 8MB data. Note that 8MB is the compressed size, so it is possible for those iterators contain more than 8MB data in memory. For the recent OOM case, gc-compaction was running over 556 layers, which means that we will have 556 active iterators. So in theory, it could take up to 556*8=4448MB memory when the compaction is going on. If images get compressed and the compression ratio is high (for that tenant, we see 3x compression ratio across image layers), then that's 13344MB memory. Also we have layer rewrites, which explains the memory taken by gc-compaction itself (versus the iterators). We rewrite 424 out of 556 layers, and each of such rewrites need a pair of delta layer writer. So we are buffering a lot of deltas in the memory. The flamegraph shows that gc-compaction itself takes 6GB memory, delta iterator 7GB, and image iterator 2GB, which can be explained by the above theory. ## Summary of changes - Reduce the buffer sizes. - Estimate memory consumption and if it is too high. - Also give up if the number of layers-to-rewrite is too high. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-25 08:45:31 +00:00
Alexander Bayandin	2465e9141f	test_runner: bump `httpcore` to 1.0.9 and `h11` to 0.16.0 (#11711 ) ## Problem https://github.com/advisories/GHSA-vqfr-h8mv-ghfj ## Summary of changes - Bump `h11` to 0.16.0 (required to bump `httpcore` to 1.0.9)	2025-04-25 08:44:40 +00:00
Tristan Partin	2526f6aea1	Add remote extension test with library component (#11301 ) The current test was just SQL files only, but we also want to test a remote extension which includes a loadable library. With both extensions we should cover a larger portion of compute_ctl's remote extension code paths. Fixes: https://github.com/neondatabase/neon/issues/11146 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-24 22:33:46 +00:00
Vlad Lazar	5ba7315c84	storage_controller: reconcile completed imports at start-up (#11614 ) ## Problem In https://github.com/neondatabase/neon/pull/11345 coordination of imports moved to the storage controller. It involves notifying cplane when the import has been completed by calling an idempotent endpoint. If the storage controller shuts down in the middle of finalizing an import, it would never be retried. ## Summary of changes Reconcile imports at start-up by fetching the complete imports from the database and spawning a background task which notifies cplane. Closes: https://github.com/neondatabase/neon/issues/11570	2025-04-24 18:39:19 +00:00
Vlad Lazar	6f7e3c18e4	storage_controller: make leadership protocol more robust (#11703 ) ## Problem We saw the following scenario in staging: 1. Pod A starts up. Becomes leader and steps down the previous pod cleanly. 2. Pod B starts up (deployment). 3. Step down request from pod B to pod A times out. Pod A did not manage to stop its reconciliations within 10 seconds and exited with return code 1 ([code](`7ba8519b43/storage_controller/src/service.rs (L8686-L8702)`)). 4. Pod B marks itself as the leader and finishes start-up 5. k8s restarts pod A 6. k8s marks pod B as ready 7. pod A sends step down request to pod A - this succeeds => pod A is now the leader 8. k8s kills pod A because it thinks pod B is healthy and pod A is part of the old replica set We end up in a situation where the only pod we have (B) is stepped down and attempts to forward requests to a leader that doesn't exist. k8s can't detect that pod B is in a bad state since the /status endpoint simply returns 200 hundred if the pod is running. ## Summary of changes This PR includes a number of robustness improvements to the leadership protocol: * use a single step down task per controller * add a new endpoint to be used as k8s liveness probe and check leadership status there * handle restarts explicitly (i.e. don't step yourself down) * increase the step down retry count * don't kill the process on long step down since k8s will just restart it	2025-04-24 16:59:56 +00:00
Christian Schwarz	8afb783708	feat: Direct IO for the pageserver write path (#11558 ) # Problem The Pageserver read path exclusively uses direct IO if `virtual_file_io_mode=direct`. The write path is half-finished. Here is what the various writing components use: \|what\|buffering\|flags on <br/>`v_f_io_mode`<br/>=`buffered`\|flags on <br/>`virtual_file_io_mode`<br/>=`direct`\| \|-\|-\|-\|-\| \|`DeltaLayerWriter`\| BlobWriter<BUFFERED=true> \| () \| () \| \|`ImageLayerWriter`\| BlobWriter<BUFFERED=false> \| () \| () \| \|`download_layer_file`\|BufferedWriter\|()\|()\| \|`InMemoryLayer`\|BufferedWriter\|()\|O_DIRECT\| The vehicle towards direct IO support is `BufferedWriter` which - largely takes care of O_DIRECT alignment & size-multiple requirements - double-buffering to mask latency `DeltaLayerWriter`, `ImageLayerWriter` use `blob_io::BlobWriter` , which has neither of these. # Changes ## High-Level At a high-level this PR makes the following primary changes: - switch the two layer writer types to use `BufferedWriter` & make sensitive to `virtual_file_io_mode` (via open_with_options_v2) - make `download_layer_file` sensitive to `virtual_file_io_mode` (also via open_with_options_v2) - add `virtual_file_io_mode=direct-rw` as a feature gate - we're hackish-ly piggybacking on OpenOptions's ask for write access here - this means with just `=direct` InMemoryLayer reads and writes no longer uses O_DIRECT - this is transitory and we'll remove the `direct-rw` variant once the rollout is complete (The `_v2` APIs for opening / creating VirtualFile are those that are sensitive to `virtual_file_io_mode`) The result is: \|what\|uses <br/>`BufferedWriter`\|flags on <br/>`v_f_io_mode`<br/>=`buffered`\|flags on <br/>`v_f_io_mode`<br/>=`direct`\|flags on <br/>`v_f_io_mode`<br/>=`direct-rw`\| \|-\|-\|-\|-\|-\| \|`DeltaLayerWriter`\| ~~Blob~~BufferedWriter \| () \| () \| O_DIRECT \| \|`ImageLayerWriter`\| ~~Blob~~BufferedWriter \| () \| () \| O_DIRECT \| \|`download_layer_file`\|BufferedWriter\|()\|()\|O_DIRECT\| \|`InMemoryLayer`\|BufferedWriter\|()\|~~O_DIRECT~~()\|O_DIRECT\| ## Code-Level The main change is: - Switch `blob_io::BlobWriter` away from its own buffering method to use `BufferedWriter`. Additional prep for upholding `O_DIRECT` requirements: - Layer writer `finish()` methods switched to use IoBufferMut for guaranteed buffer address alignment. The size of the buffers is PAGE_SZ and thereby implicitly assumed to fulfill O_DIRECT requirements. For the hacky feature-gating via `=direct-rw`: - Track `OpenOptions::write(true\|false)` in a field; bunch of mechanical churn. - Consolidate the APIs in which we "open" or "create" VirtualFile for better overview over which parts of the code use the `_v2` APIs. Necessary refactorings & infra work: - Add doc comments explaining how BufferedWriter ensures that writes are compliant with O_DIRECT alignment & size constraints. This isn't new, but should be spelled out. - Add the concept of shutdown modes to `BufferedWriter::shutdown` to make writer shutdown adhere to these constraints. - The `PadThenTruncate` mode might not be necessary in practice because I believe all layer files ever written are sized in multiples `PAGE_SZ` and since `PAGE_SZ` is larger than the current alignment requirements (512/4k depending on platform), it won't be necesary to pad. - Some test (I believe `round_trip_test_compressed`?) required it though - [ ] TODO: decide if we want to accept that complexity; if we do then address TODO in the code to separate alignment requirement from buffer capacity - Add `set_len` (=`ftruncate`) VirtualFile operation to support the above. - Allow `BufferedWriter` to start at a non-zero offset (to make room for the summary block). Cleanups unlocked by this change: - Remove non-positional APIs from VirtualFile (e.g. seek, write_full, read_full) Drive-by fixes: - PR https://github.com/neondatabase/neon/pull/11585 aimed to run unit tests for all `virtual_file_io_mode` combinations but didn't because of a missing `_` in the env var. # Performance This section assesses this PR's impact on deployments with current production setting (`=direct`) and anticipated impact of switching to (`=direct-rw`). For `DeltaLayerWriter`, `=direct` should remain unchanged to slightly improved on throughput because the `BlobWriter`'s buffer had the same size as the `BufferedWriter`'s buffer, but it didn't have the double-buffering that `BufferedWriter` has. The `=direct-rw` enables direct IO; throughput should not be suffering because of double-buffering; benchmarks will show if this is true. The `ImageLayerWriter` was previously not doing any buffering (`BUFFERED=false`). It went straight to issuing the IO operation to the underlying VirtualFile and the buffering was done by the kernel. The switch to `BufferedWriter` under `=direct` adds an additional memcpy into the BufferedWriter's buffer. We will win back that memcpy when enabling direct IO via `=direct-rw`. A nice win from the switch to `BufferedWriter` is that ImageLayerWriter performs >=16x fewer write operations to VirtualFile (the BlobWriter performs one write per len field and one write per image value). This should save low tens of microseconds of CPU overhead from doing all these syscalls/io_uring operations, regardless of `=direct` or `=direct-rw`. Aside from problems with alignment, this write frequency without double-buffering is prohibitive if we actually have to wait for the disk, which is what will happen when we enable direct IO via (`=direct-rw`). Throughput should not be suffering because of BufferedWrite's double-buffering; benchmarks will show if this is true. `InMemoryLayer` at `=direct` will flip back to using buffered IO but remain on BufferedWriter. The buffered IO adds back one memcpy of CPU overhead. Throughput should not suffer and will might improve on not-memory-pressured Pageservers but let's remember that we're doing the whole direct IO thing to eliminate global memory pressure as a source of perf variability. ## bench_ingest I reran `bench_ingest` on `im4gn.2xlarge` and `Hetzner AX102`. Use `git diff` with `--word-diff` or similar to see the change. General guidance on interpretation: - immediate production impact of this PR without production config change can be gauged by comparing the same `io_mode=Direct` - end state of production switched over to `io_mode=DirectRw` can be gauged by comparing old results' `io_mode=Direct` to new results' `io_mode=DirectRw` Given above guidance, on `im4gn.2xlarge` - immediate impact is a significant improvement in all cases - end state after switching has same significant improvements in all cases - ... except `ingest/io_mode=DirectRw volume_mib=128 key_size_bytes=8192 key_layout=Sequential write_delta=Yes` which only achieves `238 MiB/s` instead of `253.43 MiB/s` - this is a 6% degradation - this workload is typical for image layer creation # Refs - epic https://github.com/neondatabase/neon/issues/9868 - stacked atop - preliminary refactor https://github.com/neondatabase/neon/pull/11549 - bench_ingest overhaul https://github.com/neondatabase/neon/pull/11667 - derived from https://github.com/neondatabase/neon/pull/10063 Co-authored-by: Yuchen Liang <yuchen@neon.tech>	2025-04-24 14:57:36 +00:00
Konstantin Knizhnik	1531712555	Undo commit `d1728a6bcd` because it causes problems with creating pg_search extension (#11700 ) ## Problem See https://neondb.slack.com/archives/C03H1K0PGKH/p1745489241982209 pg_search extension now can not be created. ## Summary of changes Undo `d1728a6bcd` Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-24 14:46:10 +00:00
Alexander Bayandin	5e989a3148	CI(build-tools): bump packages in build-tools image (#11697 ) ## Problem `cargo-deny` 0.16.2 spits a bunch of warnings like: ``` warning[index-failure]: unable to check for yanked crates ``` The issue is fixed for the latest version of `cargo-deny` (0.18.2). And while we're here, let's bump all the packages we have in `build-tools` image ## Summary of changes - bump cargo-hakari to 0.9.36 - bump cargo-deny to 0.18.2 - bump cargo-hack to 0.6.36 - bump cargo-nextest to 0.9.94 - bump diesel_cli to 2.2.9 - bump s5cmd to 2.3.0 - bump mold to 2.37.1 - bump python to 3.11.12	2025-04-24 14:13:04 +00:00

1 2 3 4 5 ...

7823 Commits