rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-28 02:20:42 +00:00

Author	SHA1	Message	Date
Conrad Ludgate	a113c48c43	proxy: fix redis batching support (#11905 ) ## Problem For `StoreCancelKey`, we were inserting 2 commands, but we were not inserting two replies. This mismatch leads to errors when decoding the response. ## Summary of changes Abstract the command + reply pipeline so that commands and replies are registered at the same time.	2025-05-13 08:33:53 +00:00
Tristan Partin	9971fba584	Properly configure the dynamic loader to load our compiled libraries (#11858 ) The first line in /etc/ld.so.conf is: /etc/ld.so.conf.d/* We want to control library load order so that our compiled binaries are picked up before others from system packages. The previous solution allowed the system libraries to load before ours. Part-of: https://github.com/neondatabase/neon/issues/11857 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-05-12 17:36:07 +00:00
Conrad Ludgate	a77919f4b2	merge pg-sni-router into proxy (#11882 ) ## Problem We realised that pg-sni-router doesn't need to be separate from proxy. just a separate port. ## Summary of changes Add pg-sni-router config to proxy and expose the service.	2025-05-12 15:48:48 +00:00
Jakub Kołodziejczak	a618056770	chore(compute): skip audit logs for pg_session_jwt extension (#11883 ) references https://github.com/neondatabase/cloud/issues/28480#issuecomment-2866961124 related https://github.com/neondatabase/cloud/issues/28863 cc @MihaiBojin @conradludgate	2025-05-12 11:24:33 +00:00
Alex Chi Z.	307e1e64c8	fix(scrubber): more logs wrt relic timelines (#11895 ) ## Problem Further investigation on https://github.com/neondatabase/neon/issues/11159 reveals that the list_tenant function can find all the shards of the tenant, but then the shard gets missing during the gc timeline list blob. One reason could be that in some ways the timeline gets recognized as a relic timeline. ## Summary of changes Add logging to help identify the issue. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-05-12 09:17:35 +00:00
Arpad Müller	a537b2ffd0	pull_timeline: check tombstones by default (#11873 ) Make `pull_timeline` check tombstones by default. Otherwise, we'd be recreating timelines if the order between creation and deletion got mixed up, as seen in #11838. Fixes #11838.	2025-05-12 07:25:54 +00:00
Christian Schwarz	64353b48db	direct+concurrent IO: retroactive RFC (#11788 ) refs - direct IO epic: https://github.com/neondatabase/neon/issues/8130 - concurrent IO epic https://github.com/neondatabase/neon/issues/9378 - obsoletes direct IO proposal RFC: https://github.com/neondatabase/neon/pull/8240 - discussion in https://neondb.slack.com/archives/C07BZ38E6SD/p1746028030574349	2025-05-10 15:06:06 +00:00
Christian Schwarz	79ddc803af	feat(direct IO): runtime alignment validation; support config flag on macOS; default to `DirectRw` (#11868 ) This PR adds a runtime validation mode to check adherence to alignment and size-multiple requirements at the VirtualFile level. This can help prevent alignment bugs from slipping into production because test systems may have more lax requirements than production. (This is not the case today, but it could change in the future). It also allows catching O_DIRECT bugs on systems that don't have O_DIRECT (macOS). Consequently, we can now accept `virtual_file_io_mode={direct,direct-rw}` on macOS now. This has the side benefit of removing some annoying conditional compilation around `IoMode`. A third benefit is that it helped weed out size-multiple requirement violation bugs in how the VirtualFile unit tests exercise read and write APIs. I seized the opportunity to trim these tests down to what actually matters, i.e., exercising of the `OpenFiles` file descriptor cache. Lastly, this PR flips the binary-built-in default to `DirectRw` so that when running Python regress tests and benchmarks without specifying `PAGESERVER_VIRTUAL_FILE_IO_MODE`, one gets the production behavior. Refs - fixes https://github.com/neondatabase/neon/issues/11676	2025-05-10 14:19:52 +00:00
Christian Schwarz	f5070f6aa4	fixup(direct IO): PR #11864 broke test suite parametrization (#11887 ) PR - github.com/neondatabase/neon/pull/11864 committed yesterday rendered the `PAGESERVER_VIRTUAL_FILE_IO_MODE` env-var-based parametrization ineffective. As a consequence, the tests and benchmarks in `test_runner/` were using the binary built-in-default, i.e., `buffered`.	2025-05-09 18:13:35 +00:00
Matthias van de Meent	3b7cc4234c	Fix PS connect attempt timeouts when facing interrupts (#11880 ) With the 50ms timeouts of pumping state in connector.c, we need to correctly handle these timeouts that also wake up pg_usleep. This new approach makes the connection attempts re-start the wait whenever it gets woken up early; and CHECK_FOR_INTERRUPTS() is called to make sure we don't miss query cancellations. ## Problem https://neondb.slack.com/archives/C04DGM6SMTM/p1746794528680269 ## Summary of changes Make sure we start sleeping again if pg_usleep got woken up ahead of time.	2025-05-09 17:02:24 +00:00
Arpad Müller	33abfc2b74	storcon: remove finished safekeeper reconciliations from in-memory hashmap (#11876 ) ## Problem Currently there is a memory leak, in that finished safekeeper reconciliations leave a cancellation token behind which is never cleaned up. ## Summary of changes The change adds cleanup after finishing of a reconciliation. In order to ensure we remove the correct cancellation token, and we haven't raced with another reconciliation, we introduce a `TokenId` counter to tell tokens apart. Part of https://github.com/neondatabase/neon/issues/11670	2025-05-09 13:34:22 +00:00
Alex Chi Z.	93b964f829	fix(pageserver): do not do image compaction if it's below gc cutoff (#11872 ) ## Problem We observe image compaction errors after gc-compaction finishes compacting below the gc_cutoff. This is because `repartition` returns an LSN below the gc horizon as we (likely) determined that `distance <= self.repartition_threshold`. I think it's better to keep the current behavior of when to trigger compaction but we should skip image compaction if the returned LSN is below the gc horizon. ## Summary of changes If the repartition returns an invalid LSN, skip image compaction. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-05-09 12:07:52 +00:00
Vlad Lazar	d0aaec2abb	storage_controller: create imported timelines on safekeepers (#11801 ) ## Problem SK timeline creations were skipped for imported timelines since we didn't know the correct start LSN of the timeline at that point. ## Summary of changes Created imported timelines on the SK as part of the import finalize step. We use the last record LSN of shard 0 as the start LSN for the safekeeper timeline. Closes https://github.com/neondatabase/neon/issues/11569	2025-05-09 10:55:26 +00:00
Alex Chi Z.	d0dc65da12	fix(pageserver): give up gc-compaction if one key has too long history (#11869 ) ## Problem The limitation we imposed last week https://github.com/neondatabase/neon/pull/11709 is not enough to protect excessive memory usage. ## Summary of changes If a single key accumulated too much history, give up compaction. In the future, we can make the `generate_key_retention` function take a stream of keys instead of first accumulating them in memory, thus easily support such long key history cases. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-05-09 10:12:49 +00:00
Konstantin Knizhnik	03d635b916	Add more guards for prefetch_pump_state (#11859 ) ## Problem See https://neondb.slack.com/archives/C08PJ07BZ44/p1746566292750689 Looks like there are more cases when `prefetch_pump_state` can be called in unexpected place and cause core dump. ## Summary of changes Add more guards. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-05-09 09:07:08 +00:00
Conrad Ludgate	5cd7f936f9	fix(neon-rls): optimistically assume role grants are already assigned for replicas (#11811 ) ## Problem Read replicas cannot grant permissions for roles for Neon RLS. Usually the permission is already granted, so we can optimistically check. See INC-509 ## Summary of changes Perform a permission lookup prior to actually executing any grants.	2025-05-09 07:48:30 +00:00
Konstantin Knizhnik	101e115b38	Change prefetch logic in vacuum (#11650 ) ## Problem See https://neondb.slack.com/archives/C03QLRH7PPD/p1745003314183649 Vacuum doesn't use prefetch because this strange logic in `lazy_scan_heap`: ``` /* And only up to the next unskippable block */ if (next_prefetch_block + prefetch_budget > vacrel->next_unskippable_block) prefetch_budget = vacrel->next_unskippable_block - next_prefetch_block; ``` ## Summary of changes Disable prefetch only if vacuum jumps to next skippable block (there is SKIP_PAGES_THRESHOLD) which cancel seqscan and perform jump only if gap is large enough). Postgres PRs: https://github.com/neondatabase/postgres/pull/620 https://github.com/neondatabase/postgres/pull/621 https://github.com/neondatabase/postgres/pull/622 https://github.com/neondatabase/postgres/pull/623 --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-05-09 06:54:40 +00:00
Christian Schwarz	b37bb7d7ed	pageserver: timeline shutdown: fully quiesce ingest path before`freeze_and_flush` (#11851 ) # Problem Before this PR, timeline shutdown would - cancel the walreceiver cancellation token subtree (child token of Timeline::cancel) - call freeze_and_flush - Timeline::cancel.cancel() - ... bunch of waiting for things ... - Timeline::gate.close() As noted by the comment that is deleted by this PR, this left a window where, after freeze_and_flush, walreceiver could still be running and ingest data into a new InMemoryLayer. This presents a potential source of log noise during Timeline shutdown where the InMemoryLayer created after the freeze_and_flush observes that Timeline::cancel is cancelled, failing the ingest with some anyhow::Error wrapping (deeply) a `FlushTaskError::Cancelled` instance (`flush task cancelled` error message). # Solution It turns out that it is quite easy to shut down, not just cancel, walreceiver completely because the only subtask spawned by walreceiver connection manager is the `handle_walreceiver_connection` task, which is properly shut down and waited upon when the manager task observes cancellation and exits its retry loop. The alternative is to replace all the usage of `anyhow` on the ingest path with differentiated error types. A lot of busywork for little gain to fix a potential logging noise nuisance, so, not doing that for now. # Correctness / Risk We do not risk leaking walreceiver child tasks because existing discipline is to hold a gate guard. We will prolong `Timeline::shutdown` to the degree that we're no longer making progress with the rest of shutdown while the walreceiver task hasn't yet observed cancellation. In practice, this should be negligible. `Timeline::shutdown` could fail to complete if there is a hidden dependency of walreceiver shutdown on some subsystem. The code certainly suggests there isn't, and I'm not aware of any such dependency. Anyway, impact will be low because we only shut down Timeline instances that are obsolete, either because there is a newer attachment at a different location, or because the timeline got deleted by the user. We would learn about this through stuck cplane operations or stuck storcon reconciliations. We would be able to mitigate by cancelling such stuck operations/reconciliations and/or by rolling back pageserver. # Refs - identified this while investigating https://github.com/neondatabase/neon/issues/11762 - PR that _does_ fix a bunch _real_ `flush task cancelled` noise on the compaction path: https://github.com/neondatabase/neon/pull/11853	2025-05-08 18:48:24 +00:00
Conrad Ludgate	bef5954fd7	feat(proxy): track SNI usage by protocol, including for http (#11863 ) ## Problem We want to see how many users of the legacy serverless driver are still using the old URL for SQL-over-HTTP traffic. ## Summary of changes Adds a protocol field to the connections_by_sni metric. Ensures it's incremented for sql-over-http.	2025-05-08 16:46:57 +00:00
Christian Schwarz	8477d15f95	feat(direct IO): remove special case in test suite for compat tests (#11864 ) PR - https://github.com/neondatabase/neon/pull/11558 adds special treatment for compat snapshot binaries which don't understand the `direct-rw` mode. A new compat snapshot has been published since, so, we can remove the special case. refs: - fixes https://github.com/neondatabase/neon/issues/11598	2025-05-08 16:11:45 +00:00
Arpad Müller	622b3b2993	Fixes for enabling --timelines-onto-safekeepers in tests (#11854 ) Second PR with fixes extracted from #11712, relating to `--timelines-onto-safekeepers`. Does the following: * Moves safekeeper registration to `neon_local` instead of the test fixtures * Pass safekeeper JWT token if `--timelines-onto-safekeepers` is enabled * Allow some warnings related to offline safekeepers (similarly to how we allow them for offline pageservers) * Enable generations on the compute's config if `--timelines-onto-safekeepers` is enabled * fix parallel `pull_timeline` race condition (the one that #11786 put for later) Fixes #11424 Part of #11670	2025-05-08 15:13:11 +00:00
Santosh Pingale	659366060d	Reuse remote_client from the SnapshotDownloader instead of recreating in download function (#11812 ) ## Problem At the moment, remote_client and target are recreated in download function. We could reuse it from SnapshotDownloader instance. This isn't a problem per se, just a quality of life improvement but it caught my attention when we were trying out snapshot downloading in one of the older version and ran into a curious case of s3 clients behaving in two different manners. One client that used `force_path_style` and other one didn't. Logs from this run: ``` 2025-05-02T12:56:22.384626Z DEBUG /data/snappie/2739e7da34e625e3934ef0b76fa12483/timelines/d44b831adb0a6ba96792dc3a5cc30910/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000014E8F20-00000000014E8F99-00000001 requires download... 2025-05-02T12:56:22.384689Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:apply_configuration: timeout settings for this operation: TimeoutConfig { connect_timeout: Set(3.1s), read_timeout: Disabled, operation_timeout: Disabled, operation_attempt_timeout: Disabled } 2025-05-02T12:56:22.384730Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op: entering 'serialization' phase 2025-05-02T12:56:22.384784Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op: entering 'before transmit' phase 2025-05-02T12:56:22.384813Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op: retry strategy has OKed initial request 2025-05-02T12:56:22.384841Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op: beginning attempt #1 2025-05-02T12:56:22.384870Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op:try_attempt: resolving endpoint endpoint_params=EndpointResolverParams(TypeErasedBox[!Clone]:Params { bucket: Some("bucket"), region: Some("eu-north-1"), use_fips: false, use_dual_stack: false, endpoint: Some("https://s3.self-hosted.company.com"), force_path_style: false, accelerate: false, use_global_endpoint: false, use_object_lambda_endpoint: None, key: None, prefix: Some("/pageserver/tenants/2739e7da34e625e3934ef0b76fa12483/timelines/d44b831adb0a6ba96792dc3a5cc30910/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000014E8F20-00000000014E8F99-00000001"), copy_source: None, disable_access_points: None, disable_multi_region_access_points: false, use_arn_region: None, use_s3_express_control_endpoint: None, disable_s3_express_session_auth: None }) endpoint_prefix=None 2025-05-02T12:56:22.384979Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op:try_attempt: will use endpoint Endpoint { url: "https://neon.s3.self-hosted.company.com", headers: {}, properties: {"authSchemes": Array([Object({"signingRegion": String("eu-north-1"), "disableDoubleEncoding": Bool(true), "name": String("sigv4"), "signingName": String("s3")})])} } 2025-05-02T12:56:22.385042Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op:try_attempt:lazy_load_identity:provide_credentials{provider=default_chain}: loaded credentials provider=Environment 2025-05-02T12:56:22.385066Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op:try_attempt:lazy_load_identity: identity cache miss occurred; added new identity (took 35.958µs) new_expiration=2025-05-02T13:11:22.385028Z valid_for=899.999961437s partition=IdentityCachePartition(5) 2025-05-02T12:56:22.385090Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op:try_attempt: loaded identity 2025-05-02T12:56:22.385162Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op:try_attempt: entering 'transmit' phase 2025-05-02T12:56:22.385211Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op:try_attempt: new TCP connector created in 361ns 2025-05-02T12:56:22.385288Z DEBUG resolving host="neon.s3.self-hosted.company.com" 2025-05-02T12:56:22.390796Z DEBUG invoke{service=s3 operation=ListObjectVersions sdk_invocation_id=7315885}:try_op:try_attempt: encountered orchestrator error; halting ```	2025-05-08 14:09:15 +00:00
Christian Schwarz	42d93031a1	fixup(#11819 ): broken macOS build (#11861 ) refs - fixes https://github.com/neondatabase/neon/issues/11860	2025-05-08 11:48:29 +00:00
Mark Novikov	d22377c754	Skip event triggers in dump-restore (#11794 ) ## Problem Data import fails if the src db has any event triggers, because those can only be restored by a superuser. Specifically imports from Heroku and Supabase are guaranteed to fail. Closes https://github.com/neondatabase/cloud/issues/27353 ## Summary of changes Depends on `pg_dump` patches per each supported PostgreSQL version: - https://github.com/neondatabase/postgres/pull/630 - https://github.com/neondatabase/postgres/pull/629 - https://github.com/neondatabase/postgres/pull/627 - https://github.com/neondatabase/postgres/pull/628	2025-05-08 11:04:28 +00:00
Erik Grinaker	6c70789cfd	storcon: increase drain+fill secondary warmup timeout from 20 to 30 seconds (#11848 ) ## Problem During deployment drains/fills, we often see the storage controller giving up on warmups after 20 seconds, when the warmup is nearly complete (~90%). This can cause latency spikes for migrated tenants if they block on layer downloads. Touches https://github.com/neondatabase/cloud/issues/26193. ## Summary of changes Increase the drain and fill secondary warmup timeout from 20 to 30 seconds.	2025-05-08 10:14:41 +00:00
Dmitrii Kovalkov	7e55497e13	tests: flush wal before waiting for last record lsn (#11726 ) ## Problem Compute may flush WAL on page boundaries, leaving some records partially flushed for a long time. It leads to `wait_for_last_flush_lsn` stuck waiting for this partial LSN. - Closes: https://github.com/neondatabase/cloud/issues/27876 ## Summary of changes - Flush WAL via CHECKPOINT after requesting current_wal_lsn to make sure that the record we point to is flushed in full - Use proper endpoint in `test_timeline_detach_with_aux_files_with_detach_v1`	2025-05-08 10:00:45 +00:00
Vlad Lazar	40f32ea326	pageserver: refactor import flow and add job concurrency limiting (#11816 ) ## Problem Import code is one big block. Separating planning and execution will help with reporting progress of import to storcon (building block for resuming import). ## Summary of changes Split up the import into planning and execution. A concurrency limit driven by PS config is also added.	2025-05-08 09:19:14 +00:00
Christian Schwarz	1d1502bc16	fix(pageserver): `flush task cancelled` errors during timeline shutdown (#11853 ) # Refs - fixes https://github.com/neondatabase/neon/issues/11762 # Problem PR #10993 introduced internal retries for BufferedWriter flushes. PR #11052 added cancellation sensitivity to that retry loop. That cancellation sensitivity is an error path that didn't exist before. The result is that during timeline shutdown, after we `Timeline::cancel`, compaction can now fail with error `flush task cancelled`. The problem with that: 1. We mis-classify this as an `error!`-worthy event. 2. This causes tests to become flaky because the error is not in global `allowed_errors`. Technically we also trip the `compaction_circuit_breaker` because the resulting `CompactionError` is variant `::Other`. But since this is Timeline shutdown, is doesn't matter practically speaking. # Solution / Changes - Log the anyhow stack trace when classifying a compaction error as `error!`. This was helpful to identify sources of `flush task cancelled` errors. We only log at `error!` level in exceptional circumstances, so, it's ok to have bit verbose logs. - Introduce typed errors along the `BufferedWriter::write_`=> `BlobWriter::write_blob` => `{Delta,Image}LayerWriter::put_` => `Split{Delta,Image}LayerWriter::put_{value,image}` chain. - Proper mapping to `CompactionError`/`CreateImageLayersError` via new `From` impls. I am usually opposed to any magic `From` impls, but, it's how most of the compaction code works today. # Testing The symptoms are most prevalent in `test_runner/regress/test_branch_and_gc.py::test_branch_and_gc`. Before this PR, I was able to reproduce locally 1 or 2 times per 400 runs using `DEFAULT_PG_VERSION=15 BUILD_TYPE=release poetry run pytest --count 400 -n 8`. After this PR, it doesn't reproduce anymore after 2000 runs. # Future Work Technically the ingest path is also exposed to this new source of errors because `InMemoryLayer` is backed by `BufferedWriter`. But we haven't seen it occur in flaky tests yet. Details and a fix in - https://github.com/neondatabase/neon/pull/11851	2025-05-08 06:57:53 +00:00
Christian Schwarz	7eb85c56ac	tokio-epoll-uring: avoid warn! noise due to `ECANCELED` during shutdowns (#11819 ) # Problem Before this PR, `test_pageserver_catchup_while_compute_down` would occasionally fail due to scary-looking WARN log line ``` WARN ephemeral_file_buffered_writer{...}:flush_attempt{attempt=1}: \ error flushing buffered writer buffer to disk, retrying after backoff err=Operation canceled (os error 125) ``` After lengthy investigation, the conclusion is that this is likely due to a kernel bug related due to io_uring async workers (io-wq) and signals. The main indicator is that the error only ever happens in correlation with pageserver shtudown when SIGTERM is received. There is a fix that is merged in 6.14 kernels (`io-wq: backoff when retrying worker creation`). However, even when I revert that patch, the issue is not reproducible on 6.14, so, it remains a speculation. It was ruled out that the ECANCELED is due to the executor thread exiting before the async worker starts processing the operation. # Solution The workaround in this issue is to retry the operation on ECANCELED once. Retries are safe because the low-level io_engine operations are idempotent. (We don't use O_APPEND and I can't think of another flag that would make the APIs covered by this patch not idempotent.) # Testing With this PR, the warn! log no longer happens on [my reproducer setup](https://github.com/neondatabase/neon/issues/11446#issuecomment-2843015111). And the new rate-limited `info!`-level log line informing about the internal retry shows up instead, as expected. # Refs - fixes https://github.com/neondatabase/neon/issues/11446	2025-05-08 06:33:29 +00:00
Dmitrii Kovalkov	24d62c647f	storcon: add missing switch_timeline_membership method to sk client (#11850 ) ## Problem `switch_timeline_membership` is implemented on safekeeper's server side, but the is missing in the client. - Part of https://github.com/neondatabase/neon/issues/11823 ## Summary of changes - Add `switch_timeline_membership` method to `SafekeeperClient`	2025-05-07 17:00:41 +00:00
Shockingly Good	4d2e4b19c3	fix(compute) Correct the PGXN s3 gateway URL. (#11796 ) Corrects the postgres extension s3 gateway address to be not just a domain name but a full base URL. To make the code more readable, the option is renamed to "remote_ext_base_url", while keeping the old name also accessible by providing a clap argument alias. Also provides a very simple and, perhaps, even redundant unit test to confirm the logic behind parsing of the corresponding CLI argument. ## Problem As it is clearly stated in https://github.com/neondatabase/cloud/issues/26005, using of the short version of the domain name might work for now, but in the future, we should get rid of using the `default` namespace and this is where it will, most likely, break down. ## Summary of changes The changes adjust the domain name of the extension s3 gateway to use the proper base url format instead of the just domain name assuming the "default" namespace and add a new CLI argument name for to reflect the change and the expectance.	2025-05-07 16:34:08 +00:00
Alexey Kondratov	0691b73f53	fix(compute): Enforce cloud_admin role in compute_ctl connections (#11827 ) ## Problem Users can override some configuration parameters on the DB level with `ALTER DATABASE ... SET ...`. Some of these overrides, like `role` or `default_transaction_read_only`, affect `compute_ctl`'s ability to configure the DB schema properly. ## Summary of changes Enforce `role=cloud_admin`, `statement_timeout=0`, and move `default_transaction_read_only=off` override from control plane [1] to `compute_ctl`. Also, enforce `search_path=public` just in case, although we do not call any functions in user databases. [1]: `133dd8c4db/goapp/controlplane/internal/pkg/compute/provisioner/provisioner_common.go (L70)` Fixes https://github.com/neondatabase/cloud/issues/28532	2025-05-07 12:14:24 +00:00
Vlad Lazar	3cf5e1386c	pageserver: fix rough edges of pageserver tracing (#11842 ) ## Problem There's a few rough edges around PS tracing. ## Summary of changes * include compute request id in pageserver trace * use the get page specific context for GET_REL_SIZE and GET_BATCH * fix assertion in download layer trace ![image](https://github.com/user-attachments/assets/2ff6779c-7c2d-4102-8013-ada8203aa42f)	2025-05-07 10:13:26 +00:00
Alex Chi Z.	608afc3055	fix(scrubber): log download error (#11833 ) ## Problem We use `head_object` to determine whether an object exists or not. However, it does not always error due to a missing object. ## Summary of changes Log the error so that we can have a better idea what's going on with the scrubber errors in prod. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-05-07 09:21:17 +00:00
Tristan Partin	0ef6851219	Make the audience claim in compute JWTs a vector (#11845 ) According to RFC 7519, `aud` is generally an array of StringOrURI, but in special cases may be a single StringOrURI value. To accomodate future control plane work where a single token may work for multiple services, make the claim a vector. Link: https://www.rfc-editor.org/rfc/rfc7519#section-4.1.3 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-05-06 22:19:15 +00:00
Mikhail	5c356c63eb	endpoint_storage compute_ctl integration (#11550 ) Add `/lfc/(prewarm\|offload)` routes to `compute_ctl` which interact with endpoint storage. Add `prewarm_lfc_on_startup` spec option which, if enabled, downloads LFC prewarm data on compute startup. Resolves: https://github.com/neondatabase/cloud/issues/26343	2025-05-06 22:02:12 +00:00
Suhas Thalanki	384e3df2ad	fix: pinned anon extension to v2.1.0 (#11844 ) ## Problem Currently the setup for `anon` v2 in the compute image downloads the latest version of the extension. This can be problematic as on a compute start/restart it can download a version that is newer than what we have tested and potentially break things, hence not giving us the ability to control when the extension is updated. We were also using `v2.2.0`, which is not ready for production yet and has been clarified by the maintainer. Additional context: https://gitlab.com/dalibo/postgresql_anonymizer/-/issues/530 ## Summary of changes Changed the URL from which we download the `anon` extension to point to `v2.1.0` instead of `latest`.	2025-05-06 21:52:15 +00:00
Tristan Partin	f9b3a2e059	Add scoping to compute_ctl JWT claims (#11639 ) Currently we only have an admin scope which allows a user to bypass the compute_id check. When the admin scope is provided, validate the audience of the JWT to be "compute". Closes: https://github.com/neondatabase/cloud/issues/27614 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-05-06 19:51:10 +00:00
Jakub Kołodziejczak	79ee78ea32	feat(compute): enable audit logs for pg_session_jwt extension (#11829 ) related to https://github.com/neondatabase/cloud/issues/28480 related to https://github.com/neondatabase/pg_session_jwt/pull/36 cc @MihaiBojin @conradludgate @lneves12	2025-05-06 15:18:50 +00:00
Erik Grinaker	0e0ad073bf	storcon: fix split aborts removing other tenants (#11837 ) ## Problem When aborting a split, the code accidentally removes all other tenant shards from the in-memory map that have the same shard count as the aborted split, causing "tenant not found" errors. It will recover on a storcon restart, when it loads the persisted state. This issue has been present for at least a year. Resolves https://github.com/neondatabase/cloud/issues/28589. ## Summary of changes Only remove shards belonging to the relevant tenant when aborting a split. Also adds a regression test.	2025-05-06 13:57:34 +00:00
Alex Chi Z.	6827f2f58c	fix(pageserver): only keep `iter_with_options` API, improve docs in gc-compact (#11804 ) ## Problem Address comments in https://github.com/neondatabase/neon/pull/11709 ## Summary of changes - remove `iter` API, users always need to specify buffer size depending on the expected memory usage. - several doc improvements --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2025-05-06 12:27:16 +00:00
Peter Bendel	c82e363ed9	cleanup orphan projects created by python tests, too (#11836 ) ## Problem - some projects are created during GitHub workflows but not by action project_create but by python test scripts. If the python test fails the project is not deleted ## Summary of changes - make sure we cleanup those python created projects a few days after they are no longer used, too	2025-05-06 12:26:13 +00:00
Alexander Bayandin	50dc2fae77	compute-node.Dockerfile: remove layer with duplicated name (#11807 ) ## Problem Two `rust-extensions-build-pgrx14` layers were added independently in two different PRs, and the layers are exactly the same ## Summary of changes - Remove one of `rust-extensions-build-pgrx14` layers	2025-05-06 10:52:21 +00:00
Folke Behrens	62ac5b94b3	proxy: Include the exp/nbf timestamps in the errors (#11828 ) ## Problem It's difficult to tell when the JWT expired from current logs and error messages. ## Summary of changes Add exp/nbf timestamps to the respective error variants. Also use checked_add when deserializing a SystemTime from JWT. Related to INC-509	2025-05-06 09:28:25 +00:00
Konstantin Knizhnik	f0e7b3e0ef	Use unlogged build for gist_indexsortbuild_flush_ready_pages (#11753 ) ## Problem See https://github.com/neondatabase/neon/issues/11718 GIST index can be constructed in two ways: GIST_SORTED_BUILD and GIST_BUFFERING. We used unlogged build in the second case but not in the first. ## Summary of changes Use unlogged build in `gist_indexsortbuild_flush_ready_pages` Correspondent Postgres PRsL: https://github.com/neondatabase/postgres/pull/624 https://github.com/neondatabase/postgres/pull/625 https://github.com/neondatabase/postgres/pull/626 --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2025-05-06 07:24:27 +00:00
Dmitrii Kovalkov	c6ff18affc	cosmetics(pgxn/neon): WP code small clean up (#11824 ) ## Problem Some small cosmetic changes I made while reading the code. Should not affect anything. ## Summary of changes - Remove `n_votes` field because it's not used anymore - Explicitly initialize `safekeepers_generation` with `INVALID_GENERATION` if the generation is not present (the struct is zero-initialized anyway, but the explicit initialization is better IMHO) - Access SafekeeperId via pointer `sk_id` created above	2025-05-06 06:51:51 +00:00
Heikki Linnakangas	16ca74a3f4	Add SAFETY comment on libc::sysconf() call (#11581 ) I got an 'undocumented_unsafe_blocks' clippy warning about it. Not sure why I got the warning now and not before, but in any case a comment is a good idea.	2025-05-06 06:49:23 +00:00
Peter Bendel	cb67f9a651	delete orphan left over projects (#11826 ) ## Problem sometimes our benchmarking GitHub workflow is terminated by side-effects beyond our control (e.g. GitHub runner looses connection to server) and then we have left-over Neon projects created during the workflow [Example where GitHub runner lost connection and project was not deleted](https://github.com/neondatabase/neon/actions/runs/14017400543/job/39244816485) Fixes https://github.com/neondatabase/cloud/issues/28546 ## Summary of changes - Add a cleanup step that cleans up left-over projects - also give each project created during workflows a name that references the testcase and GitHub runid ## Example run (test of new job steps) https://github.com/neondatabase/neon/actions/runs/14837092399/job/41650741922#step:6:63 --------- Co-authored-by: a-masterov <72613290+a-masterov@users.noreply.github.com>	2025-05-05 14:30:13 +00:00
devin-ai-integration[bot]	baf425a2cd	[pageserver/virtual_file] impr: Improve OpenOptions API ergonomics (#11789 ) # Improve OpenOptions API ergonomics Closes #11787 This PR improves the OpenOptions API ergonomics by: 1. Making OpenOptions methods take and return owned Self instead of &mut self 2. Changing VirtualFile::open_with_options_v2 to take an owned OpenOptions 3. Removing unnecessary .clone() and .to_owned() calls These changes make the API more idiomatic Rust by leveraging the builder pattern with owned values, which is cleaner and more ergonomic than the previous approach. Link to Devin run: https://app.devin.ai/sessions/c2a4b24f7aca40a3b3777f4259bf8ee1 Requested by: christian@neon.tech --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: christian@neon.tech <christian@neon.tech>	2025-05-05 13:06:37 +00:00
Alex Chi Z.	0b243242df	fix(test): allow flush error in gc-compaction tests (#11822 ) ## Problem Part of https://github.com/neondatabase/neon/issues/11762 ## Summary of changes While #11762 needs some work to refactor the error propagating thing, we can do a hacky fix for the gc-compaction tests to allow flush error during shutdown. It does not affect correctness. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-05-05 12:15:22 +00:00

1 2 3 4 5 ...

7884 Commits