rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-07 14:10:43 +00:00

Author	SHA1	Message	Date
Vlad Lazar	d0aaec2abb	storage_controller: create imported timelines on safekeepers (#11801 ) ## Problem SK timeline creations were skipped for imported timelines since we didn't know the correct start LSN of the timeline at that point. ## Summary of changes Created imported timelines on the SK as part of the import finalize step. We use the last record LSN of shard 0 as the start LSN for the safekeeper timeline. Closes https://github.com/neondatabase/neon/issues/11569	2025-05-09 10:55:26 +00:00
Alex Chi Z.	d0dc65da12	fix(pageserver): give up gc-compaction if one key has too long history (#11869 ) ## Problem The limitation we imposed last week https://github.com/neondatabase/neon/pull/11709 is not enough to protect excessive memory usage. ## Summary of changes If a single key accumulated too much history, give up compaction. In the future, we can make the `generate_key_retention` function take a stream of keys instead of first accumulating them in memory, thus easily support such long key history cases. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-05-09 10:12:49 +00:00
Konstantin Knizhnik	03d635b916	Add more guards for prefetch_pump_state (#11859 ) ## Problem See https://neondb.slack.com/archives/C08PJ07BZ44/p1746566292750689 Looks like there are more cases when `prefetch_pump_state` can be called in unexpected place and cause core dump. ## Summary of changes Add more guards. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-05-09 09:07:08 +00:00
Conrad Ludgate	5cd7f936f9	fix(neon-rls): optimistically assume role grants are already assigned for replicas (#11811 ) ## Problem Read replicas cannot grant permissions for roles for Neon RLS. Usually the permission is already granted, so we can optimistically check. See INC-509 ## Summary of changes Perform a permission lookup prior to actually executing any grants.	2025-05-09 07:48:30 +00:00
Konstantin Knizhnik	101e115b38	Change prefetch logic in vacuum (#11650 ) ## Problem See https://neondb.slack.com/archives/C03QLRH7PPD/p1745003314183649 Vacuum doesn't use prefetch because this strange logic in `lazy_scan_heap`: ``` /* And only up to the next unskippable block */ if (next_prefetch_block + prefetch_budget > vacrel->next_unskippable_block) prefetch_budget = vacrel->next_unskippable_block - next_prefetch_block; ``` ## Summary of changes Disable prefetch only if vacuum jumps to next skippable block (there is SKIP_PAGES_THRESHOLD) which cancel seqscan and perform jump only if gap is large enough). Postgres PRs: https://github.com/neondatabase/postgres/pull/620 https://github.com/neondatabase/postgres/pull/621 https://github.com/neondatabase/postgres/pull/622 https://github.com/neondatabase/postgres/pull/623 --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-05-09 06:54:40 +00:00
Christian Schwarz	b37bb7d7ed	pageserver: timeline shutdown: fully quiesce ingest path before`freeze_and_flush` (#11851 ) # Problem Before this PR, timeline shutdown would - cancel the walreceiver cancellation token subtree (child token of Timeline::cancel) - call freeze_and_flush - Timeline::cancel.cancel() - ... bunch of waiting for things ... - Timeline::gate.close() As noted by the comment that is deleted by this PR, this left a window where, after freeze_and_flush, walreceiver could still be running and ingest data into a new InMemoryLayer. This presents a potential source of log noise during Timeline shutdown where the InMemoryLayer created after the freeze_and_flush observes that Timeline::cancel is cancelled, failing the ingest with some anyhow::Error wrapping (deeply) a `FlushTaskError::Cancelled` instance (`flush task cancelled` error message). # Solution It turns out that it is quite easy to shut down, not just cancel, walreceiver completely because the only subtask spawned by walreceiver connection manager is the `handle_walreceiver_connection` task, which is properly shut down and waited upon when the manager task observes cancellation and exits its retry loop. The alternative is to replace all the usage of `anyhow` on the ingest path with differentiated error types. A lot of busywork for little gain to fix a potential logging noise nuisance, so, not doing that for now. # Correctness / Risk We do not risk leaking walreceiver child tasks because existing discipline is to hold a gate guard. We will prolong `Timeline::shutdown` to the degree that we're no longer making progress with the rest of shutdown while the walreceiver task hasn't yet observed cancellation. In practice, this should be negligible. `Timeline::shutdown` could fail to complete if there is a hidden dependency of walreceiver shutdown on some subsystem. The code certainly suggests there isn't, and I'm not aware of any such dependency. Anyway, impact will be low because we only shut down Timeline instances that are obsolete, either because there is a newer attachment at a different location, or because the timeline got deleted by the user. We would learn about this through stuck cplane operations or stuck storcon reconciliations. We would be able to mitigate by cancelling such stuck operations/reconciliations and/or by rolling back pageserver. # Refs - identified this while investigating https://github.com/neondatabase/neon/issues/11762 - PR that _does_ fix a bunch _real_ `flush task cancelled` noise on the compaction path: https://github.com/neondatabase/neon/pull/11853	2025-05-08 18:48:24 +00:00
Conrad Ludgate	bef5954fd7	feat(proxy): track SNI usage by protocol, including for http (#11863 ) ## Problem We want to see how many users of the legacy serverless driver are still using the old URL for SQL-over-HTTP traffic. ## Summary of changes Adds a protocol field to the connections_by_sni metric. Ensures it's incremented for sql-over-http.	2025-05-08 16:46:57 +00:00
Christian Schwarz	8477d15f95	feat(direct IO): remove special case in test suite for compat tests (#11864 ) PR - https://github.com/neondatabase/neon/pull/11558 adds special treatment for compat snapshot binaries which don't understand the `direct-rw` mode. A new compat snapshot has been published since, so, we can remove the special case. refs: - fixes https://github.com/neondatabase/neon/issues/11598	2025-05-08 16:11:45 +00:00
Arpad Müller	622b3b2993	Fixes for enabling --timelines-onto-safekeepers in tests (#11854 ) Second PR with fixes extracted from #11712, relating to `--timelines-onto-safekeepers`. Does the following: * Moves safekeeper registration to `neon_local` instead of the test fixtures * Pass safekeeper JWT token if `--timelines-onto-safekeepers` is enabled * Allow some warnings related to offline safekeepers (similarly to how we allow them for offline pageservers) * Enable generations on the compute's config if `--timelines-onto-safekeepers` is enabled * fix parallel `pull_timeline` race condition (the one that #11786 put for later) Fixes #11424 Part of #11670	2025-05-08 15:13:11 +00:00
Santosh Pingale	659366060d	Reuse remote_client from the SnapshotDownloader instead of recreating in download function (#11812 ) ## Problem At the moment, remote_client and target are recreated in download function. We could reuse it from SnapshotDownloader instance. This isn't a problem per se, just a quality of life improvement but it caught my attention when we were trying out snapshot downloading in one of the older version and ran into a curious case of s3 clients behaving in two different manners. One client that used `force_path_style` and other one didn't. Logs from this run: ``` 2025-05-02T12:56:22.384626Z DEBUG /data/snappie/2739e7da34e625e3934ef0b76fa12483/timelines/d44b831adb0a6ba96792dc3a5cc30910/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000014E8F20-00000000014E8F99-00000001 requires download... 2025-05-02T12:56:22.384689Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:apply_configuration: timeout settings for this operation: TimeoutConfig { connect_timeout: Set(3.1s), read_timeout: Disabled, operation_timeout: Disabled, operation_attempt_timeout: Disabled } 2025-05-02T12:56:22.384730Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op: entering 'serialization' phase 2025-05-02T12:56:22.384784Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op: entering 'before transmit' phase 2025-05-02T12:56:22.384813Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op: retry strategy has OKed initial request 2025-05-02T12:56:22.384841Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op: beginning attempt #1 2025-05-02T12:56:22.384870Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op:try_attempt: resolving endpoint endpoint_params=EndpointResolverParams(TypeErasedBox[!Clone]:Params { bucket: Some("bucket"), region: Some("eu-north-1"), use_fips: false, use_dual_stack: false, endpoint: Some("https://s3.self-hosted.company.com"), force_path_style: false, accelerate: false, use_global_endpoint: false, use_object_lambda_endpoint: None, key: None, prefix: Some("/pageserver/tenants/2739e7da34e625e3934ef0b76fa12483/timelines/d44b831adb0a6ba96792dc3a5cc30910/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000014E8F20-00000000014E8F99-00000001"), copy_source: None, disable_access_points: None, disable_multi_region_access_points: false, use_arn_region: None, use_s3_express_control_endpoint: None, disable_s3_express_session_auth: None }) endpoint_prefix=None 2025-05-02T12:56:22.384979Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op:try_attempt: will use endpoint Endpoint { url: "https://neon.s3.self-hosted.company.com", headers: {}, properties: {"authSchemes": Array([Object({"signingRegion": String("eu-north-1"), "disableDoubleEncoding": Bool(true), "name": String("sigv4"), "signingName": String("s3")})])} } 2025-05-02T12:56:22.385042Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op:try_attempt:lazy_load_identity:provide_credentials{provider=default_chain}: loaded credentials provider=Environment 2025-05-02T12:56:22.385066Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op:try_attempt:lazy_load_identity: identity cache miss occurred; added new identity (took 35.958µs) new_expiration=2025-05-02T13:11:22.385028Z valid_for=899.999961437s partition=IdentityCachePartition(5) 2025-05-02T12:56:22.385090Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op:try_attempt: loaded identity 2025-05-02T12:56:22.385162Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op:try_attempt: entering 'transmit' phase 2025-05-02T12:56:22.385211Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op:try_attempt: new TCP connector created in 361ns 2025-05-02T12:56:22.385288Z DEBUG resolving host="neon.s3.self-hosted.company.com" 2025-05-02T12:56:22.390796Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op:try_attempt: encountered orchestrator error; halting ```	2025-05-08 14:09:15 +00:00
Christian Schwarz	42d93031a1	fixup(#11819 ): broken macOS build (#11861 ) refs - fixes https://github.com/neondatabase/neon/issues/11860	2025-05-08 11:48:29 +00:00
Mark Novikov	d22377c754	Skip event triggers in dump-restore (#11794 ) ## Problem Data import fails if the src db has any event triggers, because those can only be restored by a superuser. Specifically imports from Heroku and Supabase are guaranteed to fail. Closes https://github.com/neondatabase/cloud/issues/27353 ## Summary of changes Depends on `pg_dump` patches per each supported PostgreSQL version: - https://github.com/neondatabase/postgres/pull/630 - https://github.com/neondatabase/postgres/pull/629 - https://github.com/neondatabase/postgres/pull/627 - https://github.com/neondatabase/postgres/pull/628	2025-05-08 11:04:28 +00:00
Erik Grinaker	6c70789cfd	storcon: increase drain+fill secondary warmup timeout from 20 to 30 seconds (#11848 ) ## Problem During deployment drains/fills, we often see the storage controller giving up on warmups after 20 seconds, when the warmup is nearly complete (~90%). This can cause latency spikes for migrated tenants if they block on layer downloads. Touches https://github.com/neondatabase/cloud/issues/26193. ## Summary of changes Increase the drain and fill secondary warmup timeout from 20 to 30 seconds.	2025-05-08 10:14:41 +00:00
Dmitrii Kovalkov	7e55497e13	tests: flush wal before waiting for last record lsn (#11726 ) ## Problem Compute may flush WAL on page boundaries, leaving some records partially flushed for a long time. It leads to `wait_for_last_flush_lsn` stuck waiting for this partial LSN. - Closes: https://github.com/neondatabase/cloud/issues/27876 ## Summary of changes - Flush WAL via CHECKPOINT after requesting current_wal_lsn to make sure that the record we point to is flushed in full - Use proper endpoint in `test_timeline_detach_with_aux_files_with_detach_v1`	2025-05-08 10:00:45 +00:00
Vlad Lazar	40f32ea326	pageserver: refactor import flow and add job concurrency limiting (#11816 ) ## Problem Import code is one big block. Separating planning and execution will help with reporting progress of import to storcon (building block for resuming import). ## Summary of changes Split up the import into planning and execution. A concurrency limit driven by PS config is also added.	2025-05-08 09:19:14 +00:00
Christian Schwarz	1d1502bc16	fix(pageserver): `flush task cancelled` errors during timeline shutdown (#11853 ) # Refs - fixes https://github.com/neondatabase/neon/issues/11762 # Problem PR #10993 introduced internal retries for BufferedWriter flushes. PR #11052 added cancellation sensitivity to that retry loop. That cancellation sensitivity is an error path that didn't exist before. The result is that during timeline shutdown, after we `Timeline::cancel`, compaction can now fail with error `flush task cancelled`. The problem with that: 1. We mis-classify this as an `error!`-worthy event. 2. This causes tests to become flaky because the error is not in global `allowed_errors`. Technically we also trip the `compaction_circuit_breaker` because the resulting `CompactionError` is variant `::Other`. But since this is Timeline shutdown, is doesn't matter practically speaking. # Solution / Changes - Log the anyhow stack trace when classifying a compaction error as `error!`. This was helpful to identify sources of `flush task cancelled` errors. We only log at `error!` level in exceptional circumstances, so, it's ok to have bit verbose logs. - Introduce typed errors along the `BufferedWriter::write_`=> `BlobWriter::write_blob` => `{Delta,Image}LayerWriter::put_` => `Split{Delta,Image}LayerWriter::put_{value,image}` chain. - Proper mapping to `CompactionError`/`CreateImageLayersError` via new `From` impls. I am usually opposed to any magic `From` impls, but, it's how most of the compaction code works today. # Testing The symptoms are most prevalent in `test_runner/regress/test_branch_and_gc.py::test_branch_and_gc`. Before this PR, I was able to reproduce locally 1 or 2 times per 400 runs using `DEFAULT_PG_VERSION=15 BUILD_TYPE=release poetry run pytest --count 400 -n 8`. After this PR, it doesn't reproduce anymore after 2000 runs. # Future Work Technically the ingest path is also exposed to this new source of errors because `InMemoryLayer` is backed by `BufferedWriter`. But we haven't seen it occur in flaky tests yet. Details and a fix in - https://github.com/neondatabase/neon/pull/11851	2025-05-08 06:57:53 +00:00
Christian Schwarz	7eb85c56ac	tokio-epoll-uring: avoid warn! noise due to `ECANCELED` during shutdowns (#11819 ) # Problem Before this PR, `test_pageserver_catchup_while_compute_down` would occasionally fail due to scary-looking WARN log line ``` WARN ephemeral_file_buffered_writer{...}:flush_attempt{attempt=1}: \ error flushing buffered writer buffer to disk, retrying after backoff err=Operation canceled (os error 125) ``` After lengthy investigation, the conclusion is that this is likely due to a kernel bug related due to io_uring async workers (io-wq) and signals. The main indicator is that the error only ever happens in correlation with pageserver shtudown when SIGTERM is received. There is a fix that is merged in 6.14 kernels (`io-wq: backoff when retrying worker creation`). However, even when I revert that patch, the issue is not reproducible on 6.14, so, it remains a speculation. It was ruled out that the ECANCELED is due to the executor thread exiting before the async worker starts processing the operation. # Solution The workaround in this issue is to retry the operation on ECANCELED once. Retries are safe because the low-level io_engine operations are idempotent. (We don't use O_APPEND and I can't think of another flag that would make the APIs covered by this patch not idempotent.) # Testing With this PR, the warn! log no longer happens on [my reproducer setup](https://github.com/neondatabase/neon/issues/11446#issuecomment-2843015111). And the new rate-limited `info!`-level log line informing about the internal retry shows up instead, as expected. # Refs - fixes https://github.com/neondatabase/neon/issues/11446	2025-05-08 06:33:29 +00:00
Dmitrii Kovalkov	24d62c647f	storcon: add missing switch_timeline_membership method to sk client (#11850 ) ## Problem `switch_timeline_membership` is implemented on safekeeper's server side, but the is missing in the client. - Part of https://github.com/neondatabase/neon/issues/11823 ## Summary of changes - Add `switch_timeline_membership` method to `SafekeeperClient`	2025-05-07 17:00:41 +00:00
Shockingly Good	4d2e4b19c3	fix(compute) Correct the PGXN s3 gateway URL. (#11796 ) Corrects the postgres extension s3 gateway address to be not just a domain name but a full base URL. To make the code more readable, the option is renamed to "remote_ext_base_url", while keeping the old name also accessible by providing a clap argument alias. Also provides a very simple and, perhaps, even redundant unit test to confirm the logic behind parsing of the corresponding CLI argument. ## Problem As it is clearly stated in https://github.com/neondatabase/cloud/issues/26005, using of the short version of the domain name might work for now, but in the future, we should get rid of using the `default` namespace and this is where it will, most likely, break down. ## Summary of changes The changes adjust the domain name of the extension s3 gateway to use the proper base url format instead of the just domain name assuming the "default" namespace and add a new CLI argument name for to reflect the change and the expectance.	2025-05-07 16:34:08 +00:00
Alexey Kondratov	0691b73f53	fix(compute): Enforce cloud_admin role in compute_ctl connections (#11827 ) ## Problem Users can override some configuration parameters on the DB level with `ALTER DATABASE ... SET ...`. Some of these overrides, like `role` or `default_transaction_read_only`, affect `compute_ctl`'s ability to configure the DB schema properly. ## Summary of changes Enforce `role=cloud_admin`, `statement_timeout=0`, and move `default_transaction_read_only=off` override from control plane [1] to `compute_ctl`. Also, enforce `search_path=public` just in case, although we do not call any functions in user databases. [1]: `133dd8c4db/goapp/controlplane/internal/pkg/compute/provisioner/provisioner_common.go (L70)` Fixes https://github.com/neondatabase/cloud/issues/28532	2025-05-07 12:14:24 +00:00
Vlad Lazar	3cf5e1386c	pageserver: fix rough edges of pageserver tracing (#11842 ) ## Problem There's a few rough edges around PS tracing. ## Summary of changes * include compute request id in pageserver trace * use the get page specific context for GET_REL_SIZE and GET_BATCH * fix assertion in download layer trace ![image](https://github.com/user-attachments/assets/2ff6779c-7c2d-4102-8013-ada8203aa42f)	2025-05-07 10:13:26 +00:00
Alex Chi Z.	608afc3055	fix(scrubber): log download error (#11833 ) ## Problem We use `head_object` to determine whether an object exists or not. However, it does not always error due to a missing object. ## Summary of changes Log the error so that we can have a better idea what's going on with the scrubber errors in prod. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-05-07 09:21:17 +00:00
Tristan Partin	0ef6851219	Make the audience claim in compute JWTs a vector (#11845 ) According to RFC 7519, `aud` is generally an array of StringOrURI, but in special cases may be a single StringOrURI value. To accomodate future control plane work where a single token may work for multiple services, make the claim a vector. Link: https://www.rfc-editor.org/rfc/rfc7519#section-4.1.3 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-05-06 22:19:15 +00:00
Mikhail	5c356c63eb	endpoint_storage compute_ctl integration (#11550 ) Add `/lfc/(prewarm\|offload)` routes to `compute_ctl` which interact with endpoint storage. Add `prewarm_lfc_on_startup` spec option which, if enabled, downloads LFC prewarm data on compute startup. Resolves: https://github.com/neondatabase/cloud/issues/26343	2025-05-06 22:02:12 +00:00
Suhas Thalanki	384e3df2ad	fix: pinned anon extension to v2.1.0 (#11844 ) ## Problem Currently the setup for `anon` v2 in the compute image downloads the latest version of the extension. This can be problematic as on a compute start/restart it can download a version that is newer than what we have tested and potentially break things, hence not giving us the ability to control when the extension is updated. We were also using `v2.2.0`, which is not ready for production yet and has been clarified by the maintainer. Additional context: https://gitlab.com/dalibo/postgresql_anonymizer/-/issues/530 ## Summary of changes Changed the URL from which we download the `anon` extension to point to `v2.1.0` instead of `latest`.	2025-05-06 21:52:15 +00:00
Tristan Partin	f9b3a2e059	Add scoping to compute_ctl JWT claims (#11639 ) Currently we only have an admin scope which allows a user to bypass the compute_id check. When the admin scope is provided, validate the audience of the JWT to be "compute". Closes: https://github.com/neondatabase/cloud/issues/27614 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-05-06 19:51:10 +00:00
Jakub Kołodziejczak	79ee78ea32	feat(compute): enable audit logs for pg_session_jwt extension (#11829 ) related to https://github.com/neondatabase/cloud/issues/28480 related to https://github.com/neondatabase/pg_session_jwt/pull/36 cc @MihaiBojin @conradludgate @lneves12	2025-05-06 15:18:50 +00:00
Erik Grinaker	0e0ad073bf	storcon: fix split aborts removing other tenants (#11837 ) ## Problem When aborting a split, the code accidentally removes all other tenant shards from the in-memory map that have the same shard count as the aborted split, causing "tenant not found" errors. It will recover on a storcon restart, when it loads the persisted state. This issue has been present for at least a year. Resolves https://github.com/neondatabase/cloud/issues/28589. ## Summary of changes Only remove shards belonging to the relevant tenant when aborting a split. Also adds a regression test.	2025-05-06 13:57:34 +00:00
Alex Chi Z.	6827f2f58c	fix(pageserver): only keep `iter_with_options` API, improve docs in gc-compact (#11804 ) ## Problem Address comments in https://github.com/neondatabase/neon/pull/11709 ## Summary of changes - remove `iter` API, users always need to specify buffer size depending on the expected memory usage. - several doc improvements --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2025-05-06 12:27:16 +00:00
Peter Bendel	c82e363ed9	cleanup orphan projects created by python tests, too (#11836 ) ## Problem - some projects are created during GitHub workflows but not by action project_create but by python test scripts. If the python test fails the project is not deleted ## Summary of changes - make sure we cleanup those python created projects a few days after they are no longer used, too	2025-05-06 12:26:13 +00:00
Alexander Bayandin	50dc2fae77	compute-node.Dockerfile: remove layer with duplicated name (#11807 ) ## Problem Two `rust-extensions-build-pgrx14` layers were added independently in two different PRs, and the layers are exactly the same ## Summary of changes - Remove one of `rust-extensions-build-pgrx14` layers	2025-05-06 10:52:21 +00:00
Folke Behrens	62ac5b94b3	proxy: Include the exp/nbf timestamps in the errors (#11828 ) ## Problem It's difficult to tell when the JWT expired from current logs and error messages. ## Summary of changes Add exp/nbf timestamps to the respective error variants. Also use checked_add when deserializing a SystemTime from JWT. Related to INC-509	2025-05-06 09:28:25 +00:00
Konstantin Knizhnik	f0e7b3e0ef	Use unlogged build for gist_indexsortbuild_flush_ready_pages (#11753 ) ## Problem See https://github.com/neondatabase/neon/issues/11718 GIST index can be constructed in two ways: GIST_SORTED_BUILD and GIST_BUFFERING. We used unlogged build in the second case but not in the first. ## Summary of changes Use unlogged build in `gist_indexsortbuild_flush_ready_pages` Correspondent Postgres PRsL: https://github.com/neondatabase/postgres/pull/624 https://github.com/neondatabase/postgres/pull/625 https://github.com/neondatabase/postgres/pull/626 --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2025-05-06 07:24:27 +00:00
Dmitrii Kovalkov	c6ff18affc	cosmetics(pgxn/neon): WP code small clean up (#11824 ) ## Problem Some small cosmetic changes I made while reading the code. Should not affect anything. ## Summary of changes - Remove `n_votes` field because it's not used anymore - Explicitly initialize `safekeepers_generation` with `INVALID_GENERATION` if the generation is not present (the struct is zero-initialized anyway, but the explicit initialization is better IMHO) - Access SafekeeperId via pointer `sk_id` created above	2025-05-06 06:51:51 +00:00
Heikki Linnakangas	16ca74a3f4	Add SAFETY comment on libc::sysconf() call (#11581 ) I got an 'undocumented_unsafe_blocks' clippy warning about it. Not sure why I got the warning now and not before, but in any case a comment is a good idea.	2025-05-06 06:49:23 +00:00
Peter Bendel	cb67f9a651	delete orphan left over projects (#11826 ) ## Problem sometimes our benchmarking GitHub workflow is terminated by side-effects beyond our control (e.g. GitHub runner looses connection to server) and then we have left-over Neon projects created during the workflow [Example where GitHub runner lost connection and project was not deleted](https://github.com/neondatabase/neon/actions/runs/14017400543/job/39244816485) Fixes https://github.com/neondatabase/cloud/issues/28546 ## Summary of changes - Add a cleanup step that cleans up left-over projects - also give each project created during workflows a name that references the testcase and GitHub runid ## Example run (test of new job steps) https://github.com/neondatabase/neon/actions/runs/14837092399/job/41650741922#step:6:63 --------- Co-authored-by: a-masterov <72613290+a-masterov@users.noreply.github.com>	2025-05-05 14:30:13 +00:00
devin-ai-integration[bot]	baf425a2cd	[pageserver/virtual_file] impr: Improve OpenOptions API ergonomics (#11789 ) # Improve OpenOptions API ergonomics Closes #11787 This PR improves the OpenOptions API ergonomics by: 1. Making OpenOptions methods take and return owned Self instead of &mut self 2. Changing VirtualFile::open_with_options_v2 to take an owned OpenOptions 3. Removing unnecessary .clone() and .to_owned() calls These changes make the API more idiomatic Rust by leveraging the builder pattern with owned values, which is cleaner and more ergonomic than the previous approach. Link to Devin run: https://app.devin.ai/sessions/c2a4b24f7aca40a3b3777f4259bf8ee1 Requested by: christian@neon.tech --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: christian@neon.tech <christian@neon.tech>	2025-05-05 13:06:37 +00:00
Alex Chi Z.	0b243242df	fix(test): allow flush error in gc-compaction tests (#11822 ) ## Problem Part of https://github.com/neondatabase/neon/issues/11762 ## Summary of changes While #11762 needs some work to refactor the error propagating thing, we can do a hacky fix for the gc-compaction tests to allow flush error during shutdown. It does not affect correctness. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-05-05 12:15:22 +00:00
Conrad Ludgate	6131d86ec9	proxy: allow invalid SNI (#11792 ) ## Problem Some PrivateLink customers are unable to use Private DNS. As such they use an invalid domain name to address Neon. We currently are rejecting those connections because we cannot resolve the correct certificate. ## Summary of changes 1. Ensure a certificate is always returned. 2. If there is an SNI field, use endpoint fallback if it doesn't match. I suggest reviewing each commit separately.	2025-05-05 11:18:55 +00:00
Konstantin Knizhnik	4b9087651c	Checked that stored LwLSN >= FirstNormalUnloggedLSN (#11750 ) ## Problem Undo unintended change `60b9fb1baf` ## Summary of changes Add assert that we are not storing fake LSN in LwLSN. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-05-02 19:27:59 +00:00
Konstantin Knizhnik	79699aebc8	Reserve in file descriptor pool sockets used for connections to page servers (#11798 ) ## Problem See https://github.com/neondatabase/neon/issues/11790 The neon extension opens extensions to the pageservers, which consumes file descriptors. Postgres has a mechanism to count how many FDs are in use, but it doesn't know about those FDs. We should call ReserveExternalFD() or AcquireExternalFD() to account for them. ## Summary of changes Call `ReserveExternalFD()` for each shard --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Mikhail Kot <mikhail@neon.tech>	2025-05-02 14:36:10 +00:00
Alexander Bayandin	22290eb7ba	CI: notify relevant team about release deploy failures (#11797 ) ## Problem We notify only Storage team about failed deploys, but Compute and Proxy teams can also benefit from that ## Summary of changes - Adjust `notify-storage-release-deploy-failure` to notify the relevant team about failed deploy	2025-05-02 12:46:21 +00:00
Alex Chi Z.	bbc35e10b8	fix(test): increase timeouts for some tests (#11781 ) ## Problem Those tests are timing out more frequently after https://github.com/neondatabase/neon/pull/11585 ## Summary of changes Increase timeout for `test_pageserver_gc_compaction_smoke` Increase rollback wait timeout for `test_tx_abort_with_many_relations` Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-05-01 18:36:26 +00:00
Alex Chi Z.	ae2c3ac12f	test: revert relsizev2 config (#11759 ) ## Problem part of https://github.com/neondatabase/neon/issues/9516 One thing I realized in the past few months is that "no-way-back" things like this are scary to roll out without a fine-grained rollout infra. The plan was to flip the flag in the repo and roll it out soon, but I don't think rolling out would happen in the near future. So I'd rather revert the flag to avoid creating a discrepancy between staging and the regress tests. ## Summary of changes Not using rel_size_v2 by default in unit tests; we still have a few tests to explicitly test the new format so we still get some test coverages. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-05-01 17:51:10 +00:00
Vlad Lazar	16d594b7b3	pagectl: list layers for given key in decreasing LSN order (#11799 ) Adds an extra key CLI arg to `pagectl layer list-layer`. When provided, only layers with key ranges containing the key will be listed in decreasing LSN order (indices are preserved for `dump-layer`).	2025-05-01 15:56:43 +00:00
Suhas Thalanki	f999632327	Adding `anon` v2 support to the dockerfile (#11313 ) ## Problem Removed `anon` v1 support as described here: https://github.com/neondatabase/cloud/issues/22663 Adding `anon` v2 support to re-introduce the `pg_anon` extension. Related Issues: https://github.com/neondatabase/cloud/issues/20456 ## Summary of changes Adding `anon` v2 support by building it in the dockerfile	2025-05-01 15:22:01 +00:00
Shockingly Good	5bd850d15a	Fix the leaked tracing context for the "compute_monitor:run". (#11791 ) Removes the leaked tracing context for the "compute_monitor:run" log, which either inherited the "start_compute" span or also the HTTP request context. ## Problem The problem is that the context of the monitor's trace is unnecessarily populated with the span data inherited from previously within the same thread. ## Summary of changes The context is completely reset by moving the span from the thread spawning the monitor into the thread where the monitor will actually start working. Addresses https://github.com/neondatabase/cloud/issues/28145 ## Examples ### Before ``` 2025-04-30T16:39:05.840298Z INFO start_compute:compute_monitor:run: compute is not running, waiting before monitoring activity ``` ### After ``` 2025-04-30T16:39:05.840298Z INFO compute_monitor:run: compute is not running, waiting before monitoring activity ```	2025-05-01 09:09:10 +00:00
Dmitrii Kovalkov	1b789e8d7c	fix(pgxn/neon): Use proper member size in TermsCollectedMset and VotesCollectedMset (#11785 ) ## Problem `TermsCollectedMset` and `VotesCollectedMset` accept a MemberSet argument to find a quorum in. It may be either `wp->mconf.members` or `wp->mconf.new_members`. But the loops inside always use `wp->mconf.members.len`. If the sizes of member sets are different, it may lead to these functions not scanning all the safekeepers from `mset`. We are not planning to change the member set size dynamically now, but it's worth fixing anyway. - Part of https://github.com/neondatabase/neon/issues/11669 ## Summary of changes - Use proper size of member set in `TermsCollectedMset` and `VotesCollectedMset`	2025-04-30 16:50:21 +00:00
Arpad Müller	bec7427d9e	pull_timeline and sk logging fixes (#11786 ) This patch contains some fixes of issues I ran into for #11712: * make `pull_timeline` return success for timeline that already exists. This follows general API design of storage components: API endpoints are retryable and converge to a status code, instead of starting to error. We change the `pull_timeline`'s return type a little bit, because we might not actually have a source sk to pull from. Note that the fix is not enough, there is still a race when two `pull_timeline` instances happen in parallel: we might try to enter both pulled timelines at the same time. That can be fixed later. * make `pull_timeline` support one safekeeper being down. In general, if one safekeeper is down, that's not a problem. the added comment explains a potential situation (found in the `test_lagging_sk` test for example) * don't log very long errors when computes try to connect to safekeepers that don't have the timeline yet, if `allow_timeline_creation` is false. That flag is enabled when a sk connection string with generation numbers is passed to the compute, so we'll hit this code path more often. E.g. when a safekeeper missed a timeline creation, but the compute connects to it first before the `pull_timeline` gets requested by the storcon reconciler: this is a perfectly normal situation. So don't log the whole error backtrace, and don't log it on the error log level, but only on info. part of #11670	2025-04-30 16:24:01 +00:00
Alex Chi Z.	e2db76b9be	feat(pageserver): ondemand download reason observability (#11780 ) ## Problem Part of https://github.com/neondatabase/neon/issues/11615 ## Summary of changes We don't understand the root cause of why we get resident size surge every now and then. This patch adds observability for that, and in the next week, we might have a better understanding of what's going on. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-30 16:04:00 +00:00

1 2 3 4 5 ...

7872 Commits