rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-06 21:50:37 +00:00

Author	SHA1	Message	Date
Christian Schwarz	28a2cd05d5	Merge commit '5ec82105c' into problame/standby-horizon-leases	2025-08-06 17:46:37 +02:00
Christian Schwarz	1635390a96	fix all clippy complaints in this branch	2025-08-06 17:39:17 +02:00
Christian Schwarz	1877b70a35	Merge commit 'e7d18bc18' into problame/standby-horizon-leases	2025-08-06 17:19:37 +02:00
Christian Schwarz	fb7a027211	Merge commit '4ee0da0a2' into problame/standby-horizon-leases	2025-08-06 17:17:45 +02:00
Christian Schwarz	47146fe1d6	Merge commit '7049003cf' into problame/standby-horizon-leases	2025-08-06 17:17:11 +02:00
Christian Schwarz	577eee16f9	https://github.com/neondatabase/neon/pull/12676#discussion_r2220512343 ; concern about backward compat of TimelineInfo	2025-08-05 23:07:26 +02:00
Christian Schwarz	2ee0f4271c	fix(page_service): lsn lease API puts tenant_shard_id in tenant_id tracing field The LSN lease api actually accepts a tenant_shard_id, not a tenant_id. But we put the Display of the tenant_shard_id into the tenant_id field. This PR fixes it. Refs - fixes https://databricks.atlassian.net/browse/LKB-2930	2025-08-05 22:48:27 +02:00
Christian Schwarz	8a9f1dd5e7	use tokio::time::Instant internally, chrono::DateTime<Utc> externally; commuicate expiration through rfc3339 format; chrono::DateTime has good Debug fmt so this also serves observability; finish implementing release valve mechanism	2025-08-05 22:47:53 +02:00
Christian Schwarz	9f01840c18	use standby_horizon leases feature in the test, demonstrating that it passes now	2025-08-05 22:47:28 +02:00
Christian Schwarz	44466cebdb	WIP better observability for return values (SystemTime Debug is useless)	2025-08-05 22:46:54 +02:00
Christian Schwarz	b865e85de3	previous commit broke the tests because of the cfg business, see this commit's TODO	2025-08-05 22:46:24 +02:00
Christian Schwarz	73336962a8	finalize 3-stepped feature-gating (legacy,all,leases) + more tests + observability + fixes	2025-08-05 19:24:06 +02:00
Christian Schwarz	fc7267a760	feature-gate compute side code	2025-08-05 19:22:58 +02:00
Christian Schwarz	3365c8c648	enforce standby_horizon leases are always above applied_gc_cutoff (check against cutoff on upsert + block gc for lease length to allow renewals after attach)	2025-07-26 16:38:44 +02:00
Christian Schwarz	bc09df8823	add todo about init deadline	2025-07-26 16:23:59 +02:00
Christian Schwarz	e1eb98c0e9	add basic test & fix embarrasing bug in cull (needs comment out todo!())	2025-07-26 16:23:59 +02:00
Christian Schwarz	1e61ac6af2	cargo fmt (unrelated to prev commit)	2025-07-26 16:23:59 +02:00
Christian Schwarz	a948054db3	naming orhtodoxy: always refere to leases as LSN leases	2025-07-26 16:23:59 +02:00
Christian Schwarz	2ee24900ca	have claude generate plumbing for standby_horizon_lease_length	2025-07-25 13:16:20 +02:00
Christian Schwarz	23d1029afd	explain why there's no need to check standby_horizon lease deadline for getpage requests	2025-07-25 09:30:27 +00:00
Christian Schwarz	b47d3900b9	observability and debugging facilities	2025-07-20 23:13:02 +00:00
Christian Schwarz	f4b38d5975	expand comment on why we normalize_lsn	2025-07-20 22:38:07 +00:00
Christian Schwarz	2a89f72389	rudimentary leases impl, lacks initial lease deadline stuff	2025-07-20 22:17:03 +00:00
Christian Schwarz	2b5bb850f2	WIP	2025-07-20 20:46:20 +00:00
Christian Schwarz	4dc3acf3ed	WIP: standby_horizon leases	2025-07-20 19:21:47 +00:00
Alex Chi Z.	5ec82105cc	fix(pageserver): ensure remote size gets computed (#12520 ) ## Problem Follow up of #12400 ## Summary of changes We didn't set remote_size_mb to Some when initialized so it never gets computed :( Also added a new API to force refresh the properties. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-09 15:35:19 +00:00
a-masterov	78a6daa874	Add retrying in Random ops test if parent branch is not found. (#12513 ) ## Problem Due to a lag in replication, we sometimes cannot get the parent branch definition just after completion of the Public API restore call. This leads to the test failures. https://databricks.atlassian.net/browse/LKB-279 ## Summary of changes The workaround is implemented. Now test retries up to 30 seconds, waiting for the branch definition to appear. --------- Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com>	2025-07-09 15:28:04 +00:00
Alexander Lakhin	5c0de4ee8c	Fix parameter name in workload for test_multiple_subscription_branching (#12522 ) ## Problem As discovered in https://github.com/neondatabase/neon/issues/12394, test_multiple_subscription_branching generates skewed data distribution, that leads to test failures when the unevenly filled last table receives even more data. for table t0: pub_res = (42001,), sub_res = (42001,) for table t1: pub_res = (29001,), sub_res = (29001,) for table t2: pub_res = (21001,), sub_res = (21001,) for table t3: pub_res = (21001,), sub_res = (21001,) for table t4: pub_res = (1711001,), sub_res = (1711001,) ## Summary of changes Fix the name of the workload parameter to generate data as expected.	2025-07-09 15:22:54 +00:00
Mikhail	bc6a756f1c	ci: lint openapi specs using redocly (#12510 ) We need to lint specs for pageserver, endpoint storage, and safekeeper #0000	2025-07-09 14:29:45 +00:00
Erik Grinaker	8f3351fa91	pageserver/client_grpc: split GetPage batches across shards (#12469 ) ## Problem The rich gRPC Pageserver client needs to split GetPage batches that straddle multiple shards. Touches #11735. Requires #12462. ## Summary of changes Adds a `GetPageSplitter` which splits `GetPageRequest` that span multiple shards, and then reassembles the responses. Dispatches per-shard requests in parallel.	2025-07-09 14:17:22 +00:00
Mikhail	e7d18bc188	Replica promotion in compute_ctl (#12183 ) Add `/promote` method for `compute_ctl` promoting secondary replica to primary, depends on secondary being prewarmed. Add `compute-ctl` mode to `test_replica_promotes`, testing happy path only (no corner cases yet) Add openapi spec for `/promote` and `/lfc` handlers https://github.com/neondatabase/cloud/issues/19011 Resolves: https://github.com/neondatabase/cloud/issues/29807	2025-07-09 12:55:10 +00:00
Konstantin Knizhnik	4ee0da0a20	Check prefetch response before assignment to slot (#12371 ) ## Problem See [Slack Channel](https://databricks.enterprise.slack.com/archives/C091LHU6NNB) Dropping connection without resetting prefetch state can cause request/response mismatch. And lack of check response correctness in communicator_prefetch_lookupv can cause data corruption. ## Summary of changes 1. Validate response before assignment to prefetch slot. 2. Consume prefetch requests before sending any other requests. --------- Co-authored-by: Kosntantin Knizhnik <konstantin.knizhnik@databricks.com> Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-07-09 12:49:21 +00:00
Arpad Müller	7049003cf7	storcon: print viability of --timelines-onto-safekeepers (#12485 ) The `--timelines-onto-safekeepers` flag is very consequential in the sense that it controls every single timeline creation. However, we don't have any automatic insight whether enabling the option will break things or not. The main way things can break is by misconfigured safekeepers, say they are marked as paused in the storcon db. The best input so far we can obtain via manually connecting via storcon_cli and listing safekeepers, but this is cumbersome and manual so prone to human error. So at storcon startup, do a simulated "test creation" in which we call `timelines_onto_safekeepers` with the configuration provided to us, and print whether it was successful or not. No actual timeline is created, and nothing is written into the storcon db. The heartbeat info will not have reached us at that point yet, but that's okay, because we still fall back to safekeepers that don't have any heartbeat. Also print some general scheduling policy stats on initial safekeeper load. Part of #11670.	2025-07-09 12:02:44 +00:00
Erik Grinaker	3915995530	pageserver/client_grpc: add rich Pageserver gRPC client (#12462 ) ## Problem For the communicator, we need a rich Pageserver gRPC client. Touches #11735. Requires #12434. ## Summary of changes This patch adds an initial rich Pageserver gRPC client. It supports: * Sharded tenants across multiple Pageservers. * Pooling of connections, clients, and streams for efficient resource use. * Concurrent use by many callers. * Internal handling of GetPage bidirectional streams, with pipelining and error handling. * Automatic retries. * Observability. The client is still under development. In particular, it needs GetPage batch splitting, shard map updates, and performance optimization. This will be addressed in follow-up PRs.	2025-07-09 11:42:46 +00:00
Folke Behrens	5ea0bb2d4f	proxy: Drop unused metrics (#12521 ) * proxy_control_plane_token_acquire_seconds * proxy_allowed_ips_cache_misses * proxy_vpc_endpoint_id_cache_stats * proxy_access_blocker_flags_cache_stats * proxy_requests_auth_rate_limits_total * proxy_endpoints_auth_rate_limits * proxy_invalid_endpoints_total	2025-07-09 09:58:46 +00:00
Christian Schwarz	aac1f8efb1	refactor(compaction): eliminate `CompactionError::CollectKeyspaceError` variant (#12517 ) The only differentiated handling of it is for `is_critical`, which in turn is a `matches!()` on several variants of the `enum CollectKeyspaceError` which is the value contained insided `CompactionError::CollectKeyspaceError`. This PR introduces a new error for `repartition()`, allowing its immediate callers to inspect it like `is_critical` did. A drive-by fix is more precise classification of WaitLsnError::BadState when mapping to `tonic::Status`. refs - https://databricks.atlassian.net/browse/LKB-182	2025-07-09 08:41:36 +00:00
Alex Chi Z.	43dbded8c8	fix(pageserver): disallow lease creation below the applied gc cutoff (#12489 ) ## Problem close LKB-209 ## Summary of changes - We should not allow lease creation below the applied gc cutoff. - Also removed the condition for `AttachedSingle`. We should always check the lease against the gc cutoff in all attach modes. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-08 22:32:51 +00:00
Vlad Lazar	c848b995b2	safekeeper: trim dead senders before adding more (#12490 ) ## Problem We only trim the senders if we tried to send a message to them and discovered that the channel is closed. This is problematic if the pageserver keeps connecting while there's nothing to send back for the shard. In this scenario we never trim down the senders list and can panic due to the u8 limit. ## Summary of Changes Trim down the dead senders before adding a new one. Closes LKB-178	2025-07-08 21:24:59 +00:00
Trung Dinh	4dee2bfd82	pageserver: Introduce config to enable/disable eviction task (#12496 ) ## Problem We lost capability to explicitly disable the global eviction task (for testing). ## Summary of changes Add an `enabled` flag to `DiskUsageEvictionTaskConfig` to indicate whether we should run the eviction job or not.	2025-07-08 21:14:04 +00:00
Suhas Thalanki	09ff22a4d4	fix(compute): removing `NEON_EXT_INT_UPD` log statement added for debugging verbosity (#12509 ) Removes the `NEON_EXT_INT_UPD` log statement that was added for debugging verbosity.	2025-07-08 21:12:26 +00:00
Erik Grinaker	8223c1ba9d	pageserver/client_grpc: add initial gRPC client pools (#12434 ) ## Problem The communicator will need gRPC channel/client/stream pools for efficient reuse across many backends. Touches #11735. Requires #12396. ## Summary of changes Adds three nested resource pools: * `ChannelPool` for gRPC channels (i.e. TCP connections). * `ClientPool` for gRPC clients (i.e. `page_api::Client`). Acquires channels from `ChannelPool`. * `StreamPool` for gRPC GetPage streams. Acquires clients from `ClientPool`. These are minimal functional implementations that will need further improvements and performance optimization. However, the overall structure is expected to be roughly final, so reviews should focus on that. The pools are not yet in use, but will form the foundation of a rich gRPC Pageserver client used by the communicator (see #12462). This PR also adds the initial crate scaffolding for that client. See doc comments for details.	2025-07-08 20:58:18 +00:00
HaoyuHuang	3dad4698ec	PS changes #1 (#12467 ) # TLDR All changes are no-op except 1. publishing additional metrics. 2. problem VI ## Problem I It has come to my attention that the Neon Storage Controller doesn't correctly update its "observed" state of tenants previously associated with PSs that has come back up after a local data loss. It would still think that the old tenants are still attached to page servers and won't ask more questions. The pageserver has enough information from the reattach request/response to tell that something is wrong, but it doesn't do anything about it either. We need to detect this situation in production while I work on a fix. (I think there is just some misunderstanding about how Neon manages their pageserver deployments which got me confused about all the invariants.) ## Summary of changes I Added a `pageserver_local_data_loss_suspected` gauge metric that will be set to 1 if we detect a problematic situation from the reattch response. The problematic situation is when the PS doesn't have any local tenants but received a reattach response containing tenants. We can set up an alert using this metric. The alert should be raised whenever this metric reports non-zero number. Also added a HTTP PUT `http://pageserver/hadron-internal/reset_alert_gauges` API on the pageserver that can be used to reset the gauge and the alert once we manually rectify the situation (by restarting the HCC). ## Problem II Azure upload is 3x slower than AWS. -> 3x slower ingestion. The reason for the slower upload is that Azure upload in page server is much slower => higher flush latency => higher disk consistent LSN => higher back pressure. ## Summary of changes II Use Azure put_block API to uploads a 1 GB layer file in 8 blocks in parallel. I set the put_block block size to be 128 MB by default in azure config. To minimize neon changes, upload function passes the layer file path to the azure upload code through the storage metadata. This allows the azure put block to use FileChunkStreamRead to stream read from one partition in the file instead of loading all file data in memory and split it into 8 128 MB chunks. ## How is this tested? II 1. rust test_real_azure tests the put_block change. 3. I deployed the change in azure dev and saw flush latency reduces from ~30 seconds to 10 seconds. 4. I also did a bunch of stress test using sqlsmith and 100 GB TPCDS runs. ## Problem III Currently Neon limits the compaction tasks as 3/4 * CPU cores. This limits the overall compaction throughput and it can easily cause head-of-the-line blocking problems when a few large tenants are compacting. ## Summary of changes III This PR increases the limit of compaction tasks as `BG_TASKS_PER_THREAD` (default 4) * CPU cores. Note that `CONCURRENT_BACKGROUND_TASKS` also limits some other tasks `logical_size_calculation` and `layer eviction` . But compaction should be the most frequent and time-consuming task. ## Summary of changes IV This PR adds the following PageServer metrics: 1. `pageserver_disk_usage_based_eviction_evicted_bytes_total`: captures the total amount of bytes evicted. It's more straightforward to see the bytes directly instead of layers. 2. `pageserver_active_storage_operations_count`: captures the active storage operation, e.g., flush, L0 compaction, image creation etc. It's useful to visualize these active operations to get a better idea of what PageServers are spending cycles on in the background. ## Summary of changes V When investigating data corruptions, it's useful to search the base image and all WAL records of a page up to an LSN, i.e., a breakdown of GetPage@LSN request. This PR implements this functionality with two tools: 1. Extended `pagectl` with a new command to search the layer files for a given key up to a given LSN from the `index_part.json` file. The output can be used to download the files from S3 and then search the file contents using the second tool. Example usage: ``` cargo run --bin pagectl index-part search --tenant-id 09b99ea3239bbb3b2d883a59f087659d --timeline-id 7bedf4a6995baff7c0421ff9aebbcdab --path ~/Downloads/corruption/index_part.json-0000000c-formatted --key 000000067F000080140000802100000D61BD --lsn 70C/BF3D61D8 ``` Example output: ``` tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F0000801400008028000002FEFF__000007089F0B5381-0000070C7679EEB9-0000000c tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000000000000000000000000000000000-000000067F0000801400008028000002F3F1__000006DD95B6F609-000006E2BA14C369-0000000c tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F000080140000802100001B0973__000006D33429F539-000006DD95B6F609-0000000c tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000164D81__000006C6343B2D31-000006D33429F539-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F0000801400008021000017687B__000006BA344FA7F1-000006C6343B2D31-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000165BAB__000006AD34613D19-000006BA344FA7F1-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000137A39__0000069F34773461-000006AD34613D19-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F000080140000802100000D4000-000000067F000080140000802100000F0000__0000069F34773460-0000000b ``` 2. Added a unit test to search the layer file contents. It's not implemented part of `pagectl` because it depends on some test harness code, which can only be used by unit tests. Example usage: ``` cargo test --package pageserver --lib -- tenant::debug::test_search_key --exact --nocapture -- --tenant-id 09b99ea3239bbb3b2d883a59f087659d --timeline-id 7bedf4a6995baff7c0421ff9aebbcdab --data-dir /Users/chen.luo/Downloads/corruption --key 000000067F000080140000802100000D61BD --lsn 70C/BF3D61D8 ``` Example output: ``` # omitted image for brievity delta: 69F/769D8180: will_init: false, "OgAAALGkuwXwYp12nwYAAECGAAASIqLHAAAAAH8GAAAUgAAAIYAAAL1hDQD/DLGkuwUDAAAAEAAWAA==" delta: 69F/769CB6D8: will_init: false, "PQAAALGkuwXotZx2nwYAABAJAAAFk7tpACAGAH8GAAAUgAAAIYAAAL1hDQD/CQUAEAASALExuwUBAAAAAA==" ``` ## Problem VI Currently when page service resolves shards from page numbers, it doesn't fully support the case that the shard could be split in the middle. This will lead to query failures during the tenant split for either commit or abort cases (it's mostly for abort). ## Summary of changes VI This PR adds retry logic in `Cache::get()` to deal with shard resolution errors more gracefully. Specifically, it'll clear the cache and retry, instead of failing the query immediately. It also reduces the internal timeout to make retries faster. The PR also fixes a very obvious bug in `TenantManager::resolve_attached_shard` where the code tries to cache the computed the shard number, but forgot to recompute when the shard count is different. --------- Co-authored-by: William Huang <william.huang@databricks.com> Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com> Co-authored-by: Chen Luo <chen.luo@databricks.com> Co-authored-by: Vlad Lazar <vlad.lazar@databricks.com> Co-authored-by: Vlad Lazar <vlad@neon.tech>	2025-07-08 19:43:01 +00:00
Erik Grinaker	81e7218c27	pageserver: tighten up gRPC `page_api::Client` (#12396 ) This patch tightens up `page_api::Client`. It's mostly superficial changes, but also adds a new constructor that takes an existing gRPC channel, for use with the communicator connection pool.	2025-07-08 18:15:13 +00:00
Alex Chi Z.	a06c560ad0	feat(pageserver): critical path feature flags (#12449 ) ## Problem Some feature flags are used heavily on the critical path and we want the "get feature flag" operation as cheap as possible. ## Summary of changes Add a `test_remote_size_flag` as an example of such flags. In the future, we can use macro to generate all those fields. The flag is updated in the housekeeping loop. The retrieval of the flag is simply reading an atomic flag. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-08 16:55:00 +00:00
Vlad Lazar	477ab12b69	pageserver: touch up broker subscription reset (#12503 ) ## Problem The goal of this code was to test out if resetting the broker subscription helps alleviate the issues we've been seeing in staging. Looks like it did the trick. However, the original version was too eager. ## Summary of Changes Only reset the stream when: * we are waiting for WAL * there's no connection candidates lined up * we're not already connected to a safekeeper	2025-07-08 16:46:55 +00:00
Christian Schwarz	f9b05a42d7	refactor(compaction): remove `CompactionError::AlreadyRunning` variant, use `::Other` instead (#12512 ) The only call stack that can emit the `::AlreadyRunning` variant is ``` -> iteration_inner -> iteration -> compaction_iteration -> compaction_loop -> start_background_loops ``` And on that call stack, the only differentiated handling of it is its invocations of `log_compaction_error -> CompactionError::is_cancel()`, which returns `true` for `::AlreadyRunning`. I think the condition of `AlreadyRunning` is severe; it really shouldn't happen. So, this PR starts treating it as something that is to be logged at `ERROR` / `WARN` level, depending on the `degrate_to_warning` argument to `log_compaction_error`. refs - https://databricks.atlassian.net/browse/LKB-182	2025-07-08 16:45:34 +00:00
Folke Behrens	29d73e1404	http-utils: Temporarily accept duplicate params (#12504 ) ## Problem Grafana Alloy in cluster mode seems to send duplicate "seconds" scrape URL parameters when one of its instances is disrupted. ## Summary of changes Temporarily accept duplicate parameters as long as their value is identical.	2025-07-08 15:49:42 +00:00
Christian Schwarz	8a042fb8ed	refactor(compaction): eliminate `CompactionError::Offload` variant, map to `::Other` (#12505 ) Looks can be deceiving: the match blocks in `maybe_trip_compaction_breaker` and at the end of `compact_with_options` seem like differentiated error handling, but in reality, these branches are unreachable at runtime because the only source of `CompactionError::Offload` within the compaction code is at the end of `Tenant::compaction_iteration`. We can simply map offload cancellation to CompactionError::Cancelled and all other offload errors to ::Other, since there's no differentiated handling for them in the compaction code. Also, the OffloadError::RemoteStorage variant has no differentiated handling, but was wrapping the remote storage anyhow::Error in a `anyhow(thiserror(anyhow))` sandwich. This PR removes that variant, mapping all RemoteStorage errors to `OffloadError::Other`. Thereby, the sandwich is gone and we will get a proper anyhow backtrace to the remote storage error location if when we debug-print the OffloadError (or the CompactionError if we map it to that). refs - https://databricks.atlassian.net/browse/LKB-182 - the observation that there's no need for differentiated handling of CompactionError::Offload was made in https://databricks.slack.com/archives/C09254R641L/p1751286453930269?thread_ts=1751284317.955159&cid=C09254R641L	2025-07-08 15:03:32 +00:00
Mikhail	f72115d0a9	Endpoint storage openapi spec (#12361 ) https://github.com/neondatabase/cloud/issues/19011	2025-07-08 14:37:24 +00:00
Christian Schwarz	7458d031b1	clippy: fix unfounded warning on macOS (#12501 ) Before this PR, macOS builds would get clippy warning ``` warning: `tokio_epoll_uring::thread_local_system` does not refer to an existing function ``` The reason is that the `thread_local_system` function is only defined on Linux. Add `allow-invalid = true` to make macOS clippy pass, and manually test that on Linux builds, clippy still fails when we use it. refs - https://databricks.slack.com/archives/C09254R641L/p1751917655527099 Co-authored-by: Christian Schwarz <Christian Schwarz>	2025-07-08 13:59:45 +00:00

1 2 3 4 5 ...

8269 Commits