rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 21:20:37 +00:00

Author	SHA1	Message	Date
Arpad Müller	7d58647f5d	fix	2025-07-15 20:02:54 +02:00
Arpad Müller	783c66ea0b	Log all GET requests in safekeepers, pageservers, etc	2025-07-15 19:57:21 +02:00
Vlad Lazar	ffeede085e	libs: move metric collection for pageserver and safekeeper in a background task (#12525 ) ## Problem Safekeeper and pageserver metrics collection might time out. We've seen this in both hadron and neon. ## Summary of changes This PR moves metrics collection in PS/SK to the background so that we will always get some metrics, despite there may be some delays. Will leave it to the future work to reduce metrics collection time. --------- Co-authored-by: Chen Luo <chen.luo@databricks.com>	2025-07-10 11:58:22 +00:00
Mikhail	bdca5b500b	Fix test_lfc_prewarm: reduce number of prewarms, sleep before LFC offloading (#12515 ) Fixes: - Sleep before LFC offloading in `test_lfc_prewarm[autoprewarm]` to ensure offloaded LFC is the one exported after all writes finish - Reduce number of prewarms and increase timeout in `test_lfc_prewarm_under_workload` as debug builds were failing due to timeout. Additional changes: - Remove `check_pinned_entries`: https://github.com/neondatabase/neon/pull/12447#discussion_r2185946210 - Fix LFC error metrics description: https://github.com/neondatabase/neon/pull/12486#discussion_r2190763107	2025-07-10 11:11:53 +00:00
Erik Grinaker	f4b03ddd7b	pageserver/client_grpc: reap idle pool resources (#12476 ) ## Problem The gRPC client pools don't reap idle resources. Touches #11735. Requires #12475. ## Summary of changes Reap idle pool resources (channels/clients/streams) after 3 minutes of inactivity. Also restructure the `StreamPool` to use a mutex rather than atomics for synchronization, for simplicity. This will be optimized later.	2025-07-10 10:18:37 +00:00
Vlad Lazar	08b19f001c	pageserver: optionally force image layer creation on timeout (#12529 ) This PR introduces a `image_creation_timeout` to page servers so that we can force the image creation after a certain period. This is set to 1 day on dev/staging for now, and will rollout to production 1/2 weeks later. Majority of the PR are boilerplate code to add the new knob. Specific changes of the PR are: 1. During L0 compaction, check if we should force a compaction if min(LSN) of all delta layers < force_image_creation LSN. 2. During image creation, check if we should force a compaction if the image's LSN < force_image_creation LSN and there are newer deltas with overlapping key ranges. 3. Also tweaked the check image creation interval to make sure we honor image_creation_timeout. Vlad's note: This should be a no-op. I added an extra PS config for the large timeline threshold to enable this. --------- Co-authored-by: Chen Luo <chen.luo@databricks.com>	2025-07-10 10:07:21 +00:00
Dimitri Fontaine	1a45b2ec90	Review security model for executing Event Trigger code. (#12463 ) When a function is owned by a superuser (bootstrap user or otherwise), we consider it safe to run it. Only a superuser could have installed it, typically from CREATE EXTENSION script: we trust the code to execute. ## Problem This is intended to solve running pg_graphql Event Triggers graphql_watch_ddl and graphql_watch_drop which are executing the secdef function graphql.increment_schema_version(). ## Summary of changes Allow executing Event Trigger function owned by a superuser and with SECURITY DEFINER properties. The Event Trigger code runs with superuser privileges, and we consider that it's fine. --------- Co-authored-by: Tristan Partin <tristan.partin@databricks.com>	2025-07-10 08:06:33 +00:00
Tristan Partin	13e38a58a1	Grant pg_signal_backend to neon_superuser (#12533 ) Allow neon_superuser to cancel backends from non-neon_superusers, excluding Postgres superusers. Signed-off-by: Tristan Partin <tristan.partin@databricks.com> Co-authored-by: Vikas Jain <vikas.jain@databricks.com>	2025-07-09 21:35:39 +00:00
Christian Schwarz	2edd59aefb	impr(compaction): unify checking of `CompactionError` for cancellation reason (#12392 ) There are a couple of places that call `CompactionError::is_cancel` but don't check the `::Other` variant via downcasting for root cause being cancellation. The only place that does it is `log_compaction_error`. It's sad we have to do it, but, until we get around cleaning up all the culprits, a step forward is to unify the behavior so that all places that inspect a `CompactionError` for cancellation reason follow the same behavior. Thus, this PR ... - moves the downcasting checks against the `::Other` variant from `log_compaction_error` into `is_cancel()` and - enforces via type system that `.is_cancel()` is used to check whether a CompactionError is due to cancellation (matching on the `CompactionError::ShuttingDown` will cause a compile-time error). I don't think there's a _serious_ case right now where matching instead of using `is_cancel` causes problems. The worst case I could find is the circuit breaker and `compaction_failed`, which don't really matter if we're shutting down the timeline anyway. But it's unaesthetic and might cause log/alert noise down the line, so, this PR fixes that at least. Refs - https://databricks.atlassian.net/browse/LKB-182 - slack conversation about this PR: https://databricks.slack.com/archives/C09254R641L/p1751284317955159	2025-07-09 21:15:44 +00:00
Alex Chi Z.	0b639ba608	fix(storcon): correctly pass through lease error code (#12519 ) ## Problem close LKB-199 ## Summary of changes We always return the error as 500 to the cplane if a LSN lease request fails. This cause issues for the cplane as they don't retry on 500. This patch correctly passes through the error and assign the error code so that cplane can know if it is a retryable error. (TODO: look at the cplane code and learn the retry logic). Note that this patch does not resolve LKB-253 -- we need to handle not found error separately in the lsn lease path, like wait until the tenant gets attached, or return 503 so that cplane can retry. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-09 20:22:55 +00:00
Tristan Partin	28f604d628	Make pg_monitor neon_superuser test more robust (#12532 ) Make sure to check for NULL just in case. Signed-off-by: Tristan Partin <tristan.partin@databricks.com> Co-authored-by: Vikas Jain <vikas.jain@databricks.com>	2025-07-09 18:45:50 +00:00
Vlad Lazar	fe0ddb7169	libs: make remote storage failure injection probabilistic (#12526 ) Change the unreliable storage wrapper to fail by probability when there are more failure attempts left. Co-authored-by: Yecheng Yang <carlton.yang@databricks.com>	2025-07-09 17:41:34 +00:00
Dmitrii Kovalkov	4bbabc092a	tests: wait for flush lsn in test_branch_creation_before_gc (#12527 ) ## Problem Test `test_branch_creation_before_gc` is flaky in the internal repo. Pageserver sometimes lags behind write LSN. When we call GC it might not reach the LSN we try to create the branch at yet. ## Summary of changes - Wait till flush lsn on pageserver reaches the latest LSN before calling GC.	2025-07-09 17:16:06 +00:00
Tristan Partin	12c26243fc	Fix typo in migration testing related to pg_monitor (#12530 ) We should be joining on the neon_superuser roleid, not the pg_monitor roleid. Signed-off-by: Tristan Partin <tristan.partin@databricks.com>	2025-07-09 16:47:21 +00:00
Erik Grinaker	2f71eda00f	pageserver/client_grpc: add separate pools for bulk requests (#12475 ) ## Problem GetPage bulk requests such as prefetches and vacuum can head-of-line block foreground requests, causing increased latency. Touches #11735. Requires #12469. ## Summary of changes * Use dedicated channel/client/stream pools for bulk GetPage requests. * Use lower concurrency but higher queue depth for bulk pools. * Make pool limits configurable. * Require unbounded client pool for stream pool, to avoid accidental starvation.	2025-07-09 16:12:59 +00:00
Alex Chi Z.	5ec82105cc	fix(pageserver): ensure remote size gets computed (#12520 ) ## Problem Follow up of #12400 ## Summary of changes We didn't set remote_size_mb to Some when initialized so it never gets computed :( Also added a new API to force refresh the properties. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-09 15:35:19 +00:00
a-masterov	78a6daa874	Add retrying in Random ops test if parent branch is not found. (#12513 ) ## Problem Due to a lag in replication, we sometimes cannot get the parent branch definition just after completion of the Public API restore call. This leads to the test failures. https://databricks.atlassian.net/browse/LKB-279 ## Summary of changes The workaround is implemented. Now test retries up to 30 seconds, waiting for the branch definition to appear. --------- Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com>	2025-07-09 15:28:04 +00:00
Alexander Lakhin	5c0de4ee8c	Fix parameter name in workload for test_multiple_subscription_branching (#12522 ) ## Problem As discovered in https://github.com/neondatabase/neon/issues/12394, test_multiple_subscription_branching generates skewed data distribution, that leads to test failures when the unevenly filled last table receives even more data. for table t0: pub_res = (42001,), sub_res = (42001,) for table t1: pub_res = (29001,), sub_res = (29001,) for table t2: pub_res = (21001,), sub_res = (21001,) for table t3: pub_res = (21001,), sub_res = (21001,) for table t4: pub_res = (1711001,), sub_res = (1711001,) ## Summary of changes Fix the name of the workload parameter to generate data as expected.	2025-07-09 15:22:54 +00:00
Mikhail	bc6a756f1c	ci: lint openapi specs using redocly (#12510 ) We need to lint specs for pageserver, endpoint storage, and safekeeper #0000	2025-07-09 14:29:45 +00:00
Erik Grinaker	8f3351fa91	pageserver/client_grpc: split GetPage batches across shards (#12469 ) ## Problem The rich gRPC Pageserver client needs to split GetPage batches that straddle multiple shards. Touches #11735. Requires #12462. ## Summary of changes Adds a `GetPageSplitter` which splits `GetPageRequest` that span multiple shards, and then reassembles the responses. Dispatches per-shard requests in parallel.	2025-07-09 14:17:22 +00:00
Mikhail	e7d18bc188	Replica promotion in compute_ctl (#12183 ) Add `/promote` method for `compute_ctl` promoting secondary replica to primary, depends on secondary being prewarmed. Add `compute-ctl` mode to `test_replica_promotes`, testing happy path only (no corner cases yet) Add openapi spec for `/promote` and `/lfc` handlers https://github.com/neondatabase/cloud/issues/19011 Resolves: https://github.com/neondatabase/cloud/issues/29807	2025-07-09 12:55:10 +00:00
Konstantin Knizhnik	4ee0da0a20	Check prefetch response before assignment to slot (#12371 ) ## Problem See [Slack Channel](https://databricks.enterprise.slack.com/archives/C091LHU6NNB) Dropping connection without resetting prefetch state can cause request/response mismatch. And lack of check response correctness in communicator_prefetch_lookupv can cause data corruption. ## Summary of changes 1. Validate response before assignment to prefetch slot. 2. Consume prefetch requests before sending any other requests. --------- Co-authored-by: Kosntantin Knizhnik <konstantin.knizhnik@databricks.com> Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-07-09 12:49:21 +00:00
Arpad Müller	7049003cf7	storcon: print viability of --timelines-onto-safekeepers (#12485 ) The `--timelines-onto-safekeepers` flag is very consequential in the sense that it controls every single timeline creation. However, we don't have any automatic insight whether enabling the option will break things or not. The main way things can break is by misconfigured safekeepers, say they are marked as paused in the storcon db. The best input so far we can obtain via manually connecting via storcon_cli and listing safekeepers, but this is cumbersome and manual so prone to human error. So at storcon startup, do a simulated "test creation" in which we call `timelines_onto_safekeepers` with the configuration provided to us, and print whether it was successful or not. No actual timeline is created, and nothing is written into the storcon db. The heartbeat info will not have reached us at that point yet, but that's okay, because we still fall back to safekeepers that don't have any heartbeat. Also print some general scheduling policy stats on initial safekeeper load. Part of #11670.	2025-07-09 12:02:44 +00:00
Erik Grinaker	3915995530	pageserver/client_grpc: add rich Pageserver gRPC client (#12462 ) ## Problem For the communicator, we need a rich Pageserver gRPC client. Touches #11735. Requires #12434. ## Summary of changes This patch adds an initial rich Pageserver gRPC client. It supports: * Sharded tenants across multiple Pageservers. * Pooling of connections, clients, and streams for efficient resource use. * Concurrent use by many callers. * Internal handling of GetPage bidirectional streams, with pipelining and error handling. * Automatic retries. * Observability. The client is still under development. In particular, it needs GetPage batch splitting, shard map updates, and performance optimization. This will be addressed in follow-up PRs.	2025-07-09 11:42:46 +00:00
Folke Behrens	5ea0bb2d4f	proxy: Drop unused metrics (#12521 ) * proxy_control_plane_token_acquire_seconds * proxy_allowed_ips_cache_misses * proxy_vpc_endpoint_id_cache_stats * proxy_access_blocker_flags_cache_stats * proxy_requests_auth_rate_limits_total * proxy_endpoints_auth_rate_limits * proxy_invalid_endpoints_total	2025-07-09 09:58:46 +00:00
Christian Schwarz	aac1f8efb1	refactor(compaction): eliminate `CompactionError::CollectKeyspaceError` variant (#12517 ) The only differentiated handling of it is for `is_critical`, which in turn is a `matches!()` on several variants of the `enum CollectKeyspaceError` which is the value contained insided `CompactionError::CollectKeyspaceError`. This PR introduces a new error for `repartition()`, allowing its immediate callers to inspect it like `is_critical` did. A drive-by fix is more precise classification of WaitLsnError::BadState when mapping to `tonic::Status`. refs - https://databricks.atlassian.net/browse/LKB-182	2025-07-09 08:41:36 +00:00
Alex Chi Z.	43dbded8c8	fix(pageserver): disallow lease creation below the applied gc cutoff (#12489 ) ## Problem close LKB-209 ## Summary of changes - We should not allow lease creation below the applied gc cutoff. - Also removed the condition for `AttachedSingle`. We should always check the lease against the gc cutoff in all attach modes. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-08 22:32:51 +00:00
Vlad Lazar	c848b995b2	safekeeper: trim dead senders before adding more (#12490 ) ## Problem We only trim the senders if we tried to send a message to them and discovered that the channel is closed. This is problematic if the pageserver keeps connecting while there's nothing to send back for the shard. In this scenario we never trim down the senders list and can panic due to the u8 limit. ## Summary of Changes Trim down the dead senders before adding a new one. Closes LKB-178	2025-07-08 21:24:59 +00:00
Trung Dinh	4dee2bfd82	pageserver: Introduce config to enable/disable eviction task (#12496 ) ## Problem We lost capability to explicitly disable the global eviction task (for testing). ## Summary of changes Add an `enabled` flag to `DiskUsageEvictionTaskConfig` to indicate whether we should run the eviction job or not.	2025-07-08 21:14:04 +00:00
Suhas Thalanki	09ff22a4d4	fix(compute): removing `NEON_EXT_INT_UPD` log statement added for debugging verbosity (#12509 ) Removes the `NEON_EXT_INT_UPD` log statement that was added for debugging verbosity.	2025-07-08 21:12:26 +00:00
Erik Grinaker	8223c1ba9d	pageserver/client_grpc: add initial gRPC client pools (#12434 ) ## Problem The communicator will need gRPC channel/client/stream pools for efficient reuse across many backends. Touches #11735. Requires #12396. ## Summary of changes Adds three nested resource pools: * `ChannelPool` for gRPC channels (i.e. TCP connections). * `ClientPool` for gRPC clients (i.e. `page_api::Client`). Acquires channels from `ChannelPool`. * `StreamPool` for gRPC GetPage streams. Acquires clients from `ClientPool`. These are minimal functional implementations that will need further improvements and performance optimization. However, the overall structure is expected to be roughly final, so reviews should focus on that. The pools are not yet in use, but will form the foundation of a rich gRPC Pageserver client used by the communicator (see #12462). This PR also adds the initial crate scaffolding for that client. See doc comments for details.	2025-07-08 20:58:18 +00:00
HaoyuHuang	3dad4698ec	PS changes #1 (#12467 ) # TLDR All changes are no-op except 1. publishing additional metrics. 2. problem VI ## Problem I It has come to my attention that the Neon Storage Controller doesn't correctly update its "observed" state of tenants previously associated with PSs that has come back up after a local data loss. It would still think that the old tenants are still attached to page servers and won't ask more questions. The pageserver has enough information from the reattach request/response to tell that something is wrong, but it doesn't do anything about it either. We need to detect this situation in production while I work on a fix. (I think there is just some misunderstanding about how Neon manages their pageserver deployments which got me confused about all the invariants.) ## Summary of changes I Added a `pageserver_local_data_loss_suspected` gauge metric that will be set to 1 if we detect a problematic situation from the reattch response. The problematic situation is when the PS doesn't have any local tenants but received a reattach response containing tenants. We can set up an alert using this metric. The alert should be raised whenever this metric reports non-zero number. Also added a HTTP PUT `http://pageserver/hadron-internal/reset_alert_gauges` API on the pageserver that can be used to reset the gauge and the alert once we manually rectify the situation (by restarting the HCC). ## Problem II Azure upload is 3x slower than AWS. -> 3x slower ingestion. The reason for the slower upload is that Azure upload in page server is much slower => higher flush latency => higher disk consistent LSN => higher back pressure. ## Summary of changes II Use Azure put_block API to uploads a 1 GB layer file in 8 blocks in parallel. I set the put_block block size to be 128 MB by default in azure config. To minimize neon changes, upload function passes the layer file path to the azure upload code through the storage metadata. This allows the azure put block to use FileChunkStreamRead to stream read from one partition in the file instead of loading all file data in memory and split it into 8 128 MB chunks. ## How is this tested? II 1. rust test_real_azure tests the put_block change. 3. I deployed the change in azure dev and saw flush latency reduces from ~30 seconds to 10 seconds. 4. I also did a bunch of stress test using sqlsmith and 100 GB TPCDS runs. ## Problem III Currently Neon limits the compaction tasks as 3/4 * CPU cores. This limits the overall compaction throughput and it can easily cause head-of-the-line blocking problems when a few large tenants are compacting. ## Summary of changes III This PR increases the limit of compaction tasks as `BG_TASKS_PER_THREAD` (default 4) * CPU cores. Note that `CONCURRENT_BACKGROUND_TASKS` also limits some other tasks `logical_size_calculation` and `layer eviction` . But compaction should be the most frequent and time-consuming task. ## Summary of changes IV This PR adds the following PageServer metrics: 1. `pageserver_disk_usage_based_eviction_evicted_bytes_total`: captures the total amount of bytes evicted. It's more straightforward to see the bytes directly instead of layers. 2. `pageserver_active_storage_operations_count`: captures the active storage operation, e.g., flush, L0 compaction, image creation etc. It's useful to visualize these active operations to get a better idea of what PageServers are spending cycles on in the background. ## Summary of changes V When investigating data corruptions, it's useful to search the base image and all WAL records of a page up to an LSN, i.e., a breakdown of GetPage@LSN request. This PR implements this functionality with two tools: 1. Extended `pagectl` with a new command to search the layer files for a given key up to a given LSN from the `index_part.json` file. The output can be used to download the files from S3 and then search the file contents using the second tool. Example usage: ``` cargo run --bin pagectl index-part search --tenant-id 09b99ea3239bbb3b2d883a59f087659d --timeline-id 7bedf4a6995baff7c0421ff9aebbcdab --path ~/Downloads/corruption/index_part.json-0000000c-formatted --key 000000067F000080140000802100000D61BD --lsn 70C/BF3D61D8 ``` Example output: ``` tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F0000801400008028000002FEFF__000007089F0B5381-0000070C7679EEB9-0000000c tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000000000000000000000000000000000-000000067F0000801400008028000002F3F1__000006DD95B6F609-000006E2BA14C369-0000000c tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F000080140000802100001B0973__000006D33429F539-000006DD95B6F609-0000000c tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000164D81__000006C6343B2D31-000006D33429F539-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F0000801400008021000017687B__000006BA344FA7F1-000006C6343B2D31-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000165BAB__000006AD34613D19-000006BA344FA7F1-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000137A39__0000069F34773461-000006AD34613D19-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F000080140000802100000D4000-000000067F000080140000802100000F0000__0000069F34773460-0000000b ``` 2. Added a unit test to search the layer file contents. It's not implemented part of `pagectl` because it depends on some test harness code, which can only be used by unit tests. Example usage: ``` cargo test --package pageserver --lib -- tenant::debug::test_search_key --exact --nocapture -- --tenant-id 09b99ea3239bbb3b2d883a59f087659d --timeline-id 7bedf4a6995baff7c0421ff9aebbcdab --data-dir /Users/chen.luo/Downloads/corruption --key 000000067F000080140000802100000D61BD --lsn 70C/BF3D61D8 ``` Example output: ``` # omitted image for brievity delta: 69F/769D8180: will_init: false, "OgAAALGkuwXwYp12nwYAAECGAAASIqLHAAAAAH8GAAAUgAAAIYAAAL1hDQD/DLGkuwUDAAAAEAAWAA==" delta: 69F/769CB6D8: will_init: false, "PQAAALGkuwXotZx2nwYAABAJAAAFk7tpACAGAH8GAAAUgAAAIYAAAL1hDQD/CQUAEAASALExuwUBAAAAAA==" ``` ## Problem VI Currently when page service resolves shards from page numbers, it doesn't fully support the case that the shard could be split in the middle. This will lead to query failures during the tenant split for either commit or abort cases (it's mostly for abort). ## Summary of changes VI This PR adds retry logic in `Cache::get()` to deal with shard resolution errors more gracefully. Specifically, it'll clear the cache and retry, instead of failing the query immediately. It also reduces the internal timeout to make retries faster. The PR also fixes a very obvious bug in `TenantManager::resolve_attached_shard` where the code tries to cache the computed the shard number, but forgot to recompute when the shard count is different. --------- Co-authored-by: William Huang <william.huang@databricks.com> Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com> Co-authored-by: Chen Luo <chen.luo@databricks.com> Co-authored-by: Vlad Lazar <vlad.lazar@databricks.com> Co-authored-by: Vlad Lazar <vlad@neon.tech>	2025-07-08 19:43:01 +00:00
Erik Grinaker	81e7218c27	pageserver: tighten up gRPC `page_api::Client` (#12396 ) This patch tightens up `page_api::Client`. It's mostly superficial changes, but also adds a new constructor that takes an existing gRPC channel, for use with the communicator connection pool.	2025-07-08 18:15:13 +00:00
Alex Chi Z.	a06c560ad0	feat(pageserver): critical path feature flags (#12449 ) ## Problem Some feature flags are used heavily on the critical path and we want the "get feature flag" operation as cheap as possible. ## Summary of changes Add a `test_remote_size_flag` as an example of such flags. In the future, we can use macro to generate all those fields. The flag is updated in the housekeeping loop. The retrieval of the flag is simply reading an atomic flag. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-08 16:55:00 +00:00
Vlad Lazar	477ab12b69	pageserver: touch up broker subscription reset (#12503 ) ## Problem The goal of this code was to test out if resetting the broker subscription helps alleviate the issues we've been seeing in staging. Looks like it did the trick. However, the original version was too eager. ## Summary of Changes Only reset the stream when: * we are waiting for WAL * there's no connection candidates lined up * we're not already connected to a safekeeper	2025-07-08 16:46:55 +00:00
Christian Schwarz	f9b05a42d7	refactor(compaction): remove `CompactionError::AlreadyRunning` variant, use `::Other` instead (#12512 ) The only call stack that can emit the `::AlreadyRunning` variant is ``` -> iteration_inner -> iteration -> compaction_iteration -> compaction_loop -> start_background_loops ``` And on that call stack, the only differentiated handling of it is its invocations of `log_compaction_error -> CompactionError::is_cancel()`, which returns `true` for `::AlreadyRunning`. I think the condition of `AlreadyRunning` is severe; it really shouldn't happen. So, this PR starts treating it as something that is to be logged at `ERROR` / `WARN` level, depending on the `degrate_to_warning` argument to `log_compaction_error`. refs - https://databricks.atlassian.net/browse/LKB-182	2025-07-08 16:45:34 +00:00
Folke Behrens	29d73e1404	http-utils: Temporarily accept duplicate params (#12504 ) ## Problem Grafana Alloy in cluster mode seems to send duplicate "seconds" scrape URL parameters when one of its instances is disrupted. ## Summary of changes Temporarily accept duplicate parameters as long as their value is identical.	2025-07-08 15:49:42 +00:00
Christian Schwarz	8a042fb8ed	refactor(compaction): eliminate `CompactionError::Offload` variant, map to `::Other` (#12505 ) Looks can be deceiving: the match blocks in `maybe_trip_compaction_breaker` and at the end of `compact_with_options` seem like differentiated error handling, but in reality, these branches are unreachable at runtime because the only source of `CompactionError::Offload` within the compaction code is at the end of `Tenant::compaction_iteration`. We can simply map offload cancellation to CompactionError::Cancelled and all other offload errors to ::Other, since there's no differentiated handling for them in the compaction code. Also, the OffloadError::RemoteStorage variant has no differentiated handling, but was wrapping the remote storage anyhow::Error in a `anyhow(thiserror(anyhow))` sandwich. This PR removes that variant, mapping all RemoteStorage errors to `OffloadError::Other`. Thereby, the sandwich is gone and we will get a proper anyhow backtrace to the remote storage error location if when we debug-print the OffloadError (or the CompactionError if we map it to that). refs - https://databricks.atlassian.net/browse/LKB-182 - the observation that there's no need for differentiated handling of CompactionError::Offload was made in https://databricks.slack.com/archives/C09254R641L/p1751286453930269?thread_ts=1751284317.955159&cid=C09254R641L	2025-07-08 15:03:32 +00:00
Mikhail	f72115d0a9	Endpoint storage openapi spec (#12361 ) https://github.com/neondatabase/cloud/issues/19011	2025-07-08 14:37:24 +00:00
Christian Schwarz	7458d031b1	clippy: fix unfounded warning on macOS (#12501 ) Before this PR, macOS builds would get clippy warning ``` warning: `tokio_epoll_uring::thread_local_system` does not refer to an existing function ``` The reason is that the `thread_local_system` function is only defined on Linux. Add `allow-invalid = true` to make macOS clippy pass, and manually test that on Linux builds, clippy still fails when we use it. refs - https://databricks.slack.com/archives/C09254R641L/p1751917655527099 Co-authored-by: Christian Schwarz <Christian Schwarz>	2025-07-08 13:59:45 +00:00
Aleksandr Sarantsev	38384c37ac	Make node deletion context-aware (#12494 ) ## Problem Deletion process does not calculate preferred nodes correctly - it doesn't consider current tenant-shard layout among all pageservers. ## Summary of changes - Added a schedule context calculation for node deletion Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>	2025-07-08 13:15:14 +00:00
Christian Schwarz	2b2a547671	fix(tests): periodic and immediate gc is effectively a no-op in tests (#12431 ) The introduction of the default lease deadline feature 9 months ago made it so that after PS restart, `.timeline_gc()` calls in Python tests are no-ops for 10 minute after pageserver startup: the `gc_iteration()` bails with `Skipping GC because lsn lease deadline is not reached`. I did some impact analysis in the following PR. About 30 Python tests are affected: - https://github.com/neondatabase/neon/pull/12411 Rust tests that don't explicitly enable periodic GC or invoke GC manually are unaffected because we disable periodic GC by default in the `TenantHarness`'s tenant config. Two tests explicitly did `start_paused=true` + `tokio::time::advance()`, but it would add cognitive and code bloat to each existing and future test case that uses TenantHarness if we took that route. So, this PR sets the default lease deadline feature in both Python and Rust tests to zero by default. Tests that test the feature were thus identified by failing the test: - Python test `test_readonly_node_gc` + `test_lsn_lease_size` - Rust test `test_lsn_lease`. To accomplish the above, I changed the code that computes the initial lease deadline to respect the pageserver.toml's default tenant config, which it didn't before (and I would consider a bug). The Python test harness and the Rust TenantHarness test harness then simply set the default tenant config field to zero. Drive-by: - `test_lsn_lease_size` was writing a lot of data unnecessarily; reduce the amount and speed up the test refs - PR that introduced default lease deadline: https://github.com/neondatabase/neon/pull/9055/files - fixes https://databricks.atlassian.net/browse/LKB-92 --------- Co-authored-by: Christian Schwarz <Christian Schwarz>	2025-07-08 12:56:22 +00:00
a-masterov	59e393aef3	Enable parallel execution of extension tests (#12118 ) ## Problem Extension tests were previously run sequentially, resulting in unnecessary wait time and underutilization of available CPU cores. ## Summary of changes Tests are now executed in a customizable number of parallel threads using separate database branches. This reduces overall test time by approximately 50% (e.g., on my laptop, parallel test lasts 173s, while sequential one lasts 340s) and increases the load on the pageserver, providing better test coverage. --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com>	2025-07-08 11:28:39 +00:00
Peter Bendel	f51ed4a2c4	"disable" disk eviction in pagebench periodic benchmark (#12487 ) ## Problem https://github.com/neondatabase/neon/pull/12464 introduced new defaults for pageserver disk based eviction which activated disk based eviction for pagebench periodic pagebench. This caused the testcase to fail. ## Summary of changes Override the new defaults during testcase execution. ## Test run https://github.com/neondatabase/neon/actions/runs/16120217757/job/45483869734 Test run was successful, so merging this now	2025-07-08 09:38:06 +00:00
Mikhail	4f16ab3f56	add lfc offload and prewarm error metrics (#12486 ) Add `compute_ctl_lfc_prewarm_errors_total` and `compute_ctl_lfc_offload_errors_total` metrics. Add comments in `test_lfc_prewarm`. Correction PR for https://github.com/neondatabase/neon/pull/12447 https://github.com/neondatabase/cloud/issues/19011	2025-07-08 09:34:01 +00:00
Dmitrii Kovalkov	18796fd1dd	tests: more allowed errors for test_safekeeper_migration (#12495 ) ## Problem Pageserver now writes errors in the log during the safekeeper migration. Some errors are added to allowed errors, but "timeline not found in global map" is not. - Will be properly fixed in https://github.com/neondatabase/neon/issues/12191 ## Summary of changes Add "timeline not found in global map" error in a list of allowed errors in `test_safekeeper_migration_simple`	2025-07-08 09:15:29 +00:00
Aleksandr Sarantsev	2f3fc7cb57	Fix keep-failing reconciles test & add logs (#12497 ) ## Problem Test is flaky due to the following warning in the logs: ``` Keeping extra secondaries: can't determine which of [NodeId(1), NodeId(2)] to remove (some nodes offline?) ``` Some nodes being offline is expected behavior in this test. ## Summary of changes - Added `Keeping extra secondaries` to the list of allowed errors - Improved logging for better debugging experience Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>	2025-07-08 08:51:50 +00:00
Folke Behrens	e65d5f7369	proxy: Remove the endpoint filter cache (#12488 ) ## Problem The endpoint filter cache is still unused because it's not yet reliable enough to be used. It only consumes a lot of memory. ## Summary of changes Remove the code. Needs a new design. neondatabase/cloud#30634	2025-07-07 17:46:33 +00:00
Conrad Ludgate	55aef2993d	introduce a JSON serialization lib (#12417 ) See #11992 and #11961 for some examples of usecases. This introduces a JSON serialization lib, designed for more flexibility than serde_json offers. ## Dynamic construction Sometimes you have dynamic values you want to serialize, that are not already in a serde-aware model like a struct or a Vec etc. To achieve this with serde, you need to implement a lot of different traits on a lot of different new-types. Because of this, it's often easier to give-in and pull all the data into a serde-aware model (serde_json::Value or some intermediate struct), but that is often not very efficient. This crate allows full control over the JSON encoding without needing to implement any extra traits. Just call the relevant functions, and it will guarantee a correctly encoded JSON value. ## Async construction Similar to the above, sometimes the values arrive asynchronously. Often collecting those values in memory is more expensive than writing them as JSON, since the overheads of `Vec` and `String` is much higher, however there are exceptions. Serializing to JSON all in one go is also more CPU intensive and can cause lag spikes, whereas serializing values incrementally spreads out the CPU load and reduces lag.	2025-07-07 15:12:02 +00:00
Erik Grinaker	1eef961f09	pageserver: add gRPC error logging (#12445 ) ## Problem We don't log gRPC request errors on the server. Touches #11728. ## Summary of changes Automatically log non-OK gRPC response statuses in the observability middleware, and add corresponding logging for the `get_pages` stream. Also adds the peer address and gRPC method to the gRPC tracing span. Example output: ``` 2025-07-02T20:18:16.813718Z WARN grpc:pageservice{peer=127.0.0.1:56698 method=CheckRelExists tenant_id=c7b45faa1924b1958f05c5fdee8b0d04 timeline_id=4a36ee64fd2f97781b9dcc2c3cddd51b shard_id=0000}: request failed with NotFound: Tenant c7b45faa1924b1958f05c5fdee8b0d04 not found ```	2025-07-07 12:24:06 +00:00

1 2 3 4 5 ...

8259 Commits