rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-08 06:30:37 +00:00

Author	SHA1	Message	Date
Christian Schwarz	fb7a027211	Merge commit '4ee0da0a2' into problame/standby-horizon-leases	2025-08-06 17:17:45 +02:00
Christian Schwarz	47146fe1d6	Merge commit '7049003cf' into problame/standby-horizon-leases	2025-08-06 17:17:11 +02:00
Christian Schwarz	577eee16f9	https://github.com/neondatabase/neon/pull/12676#discussion_r2220512343 ; concern about backward compat of TimelineInfo	2025-08-05 23:07:26 +02:00
Christian Schwarz	2ee0f4271c	fix(page_service): lsn lease API puts tenant_shard_id in tenant_id tracing field The LSN lease api actually accepts a tenant_shard_id, not a tenant_id. But we put the Display of the tenant_shard_id into the tenant_id field. This PR fixes it. Refs - fixes https://databricks.atlassian.net/browse/LKB-2930	2025-08-05 22:48:27 +02:00
Christian Schwarz	8a9f1dd5e7	use tokio::time::Instant internally, chrono::DateTime<Utc> externally; commuicate expiration through rfc3339 format; chrono::DateTime has good Debug fmt so this also serves observability; finish implementing release valve mechanism	2025-08-05 22:47:53 +02:00
Christian Schwarz	9f01840c18	use standby_horizon leases feature in the test, demonstrating that it passes now	2025-08-05 22:47:28 +02:00
Christian Schwarz	44466cebdb	WIP better observability for return values (SystemTime Debug is useless)	2025-08-05 22:46:54 +02:00
Christian Schwarz	b865e85de3	previous commit broke the tests because of the cfg business, see this commit's TODO	2025-08-05 22:46:24 +02:00
Christian Schwarz	73336962a8	finalize 3-stepped feature-gating (legacy,all,leases) + more tests + observability + fixes	2025-08-05 19:24:06 +02:00
Christian Schwarz	fc7267a760	feature-gate compute side code	2025-08-05 19:22:58 +02:00
Christian Schwarz	3365c8c648	enforce standby_horizon leases are always above applied_gc_cutoff (check against cutoff on upsert + block gc for lease length to allow renewals after attach)	2025-07-26 16:38:44 +02:00
Christian Schwarz	bc09df8823	add todo about init deadline	2025-07-26 16:23:59 +02:00
Christian Schwarz	e1eb98c0e9	add basic test & fix embarrasing bug in cull (needs comment out todo!())	2025-07-26 16:23:59 +02:00
Christian Schwarz	1e61ac6af2	cargo fmt (unrelated to prev commit)	2025-07-26 16:23:59 +02:00
Christian Schwarz	a948054db3	naming orhtodoxy: always refere to leases as LSN leases	2025-07-26 16:23:59 +02:00
Christian Schwarz	2ee24900ca	have claude generate plumbing for standby_horizon_lease_length	2025-07-25 13:16:20 +02:00
Christian Schwarz	23d1029afd	explain why there's no need to check standby_horizon lease deadline for getpage requests	2025-07-25 09:30:27 +00:00
Christian Schwarz	b47d3900b9	observability and debugging facilities	2025-07-20 23:13:02 +00:00
Christian Schwarz	f4b38d5975	expand comment on why we normalize_lsn	2025-07-20 22:38:07 +00:00
Christian Schwarz	2a89f72389	rudimentary leases impl, lacks initial lease deadline stuff	2025-07-20 22:17:03 +00:00
Christian Schwarz	2b5bb850f2	WIP	2025-07-20 20:46:20 +00:00
Christian Schwarz	4dc3acf3ed	WIP: standby_horizon leases	2025-07-20 19:21:47 +00:00
Konstantin Knizhnik	4ee0da0a20	Check prefetch response before assignment to slot (#12371 ) ## Problem See [Slack Channel](https://databricks.enterprise.slack.com/archives/C091LHU6NNB) Dropping connection without resetting prefetch state can cause request/response mismatch. And lack of check response correctness in communicator_prefetch_lookupv can cause data corruption. ## Summary of changes 1. Validate response before assignment to prefetch slot. 2. Consume prefetch requests before sending any other requests. --------- Co-authored-by: Kosntantin Knizhnik <konstantin.knizhnik@databricks.com> Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-07-09 12:49:21 +00:00
Arpad Müller	7049003cf7	storcon: print viability of --timelines-onto-safekeepers (#12485 ) The `--timelines-onto-safekeepers` flag is very consequential in the sense that it controls every single timeline creation. However, we don't have any automatic insight whether enabling the option will break things or not. The main way things can break is by misconfigured safekeepers, say they are marked as paused in the storcon db. The best input so far we can obtain via manually connecting via storcon_cli and listing safekeepers, but this is cumbersome and manual so prone to human error. So at storcon startup, do a simulated "test creation" in which we call `timelines_onto_safekeepers` with the configuration provided to us, and print whether it was successful or not. No actual timeline is created, and nothing is written into the storcon db. The heartbeat info will not have reached us at that point yet, but that's okay, because we still fall back to safekeepers that don't have any heartbeat. Also print some general scheduling policy stats on initial safekeeper load. Part of #11670.	2025-07-09 12:02:44 +00:00
Erik Grinaker	3915995530	pageserver/client_grpc: add rich Pageserver gRPC client (#12462 ) ## Problem For the communicator, we need a rich Pageserver gRPC client. Touches #11735. Requires #12434. ## Summary of changes This patch adds an initial rich Pageserver gRPC client. It supports: * Sharded tenants across multiple Pageservers. * Pooling of connections, clients, and streams for efficient resource use. * Concurrent use by many callers. * Internal handling of GetPage bidirectional streams, with pipelining and error handling. * Automatic retries. * Observability. The client is still under development. In particular, it needs GetPage batch splitting, shard map updates, and performance optimization. This will be addressed in follow-up PRs.	2025-07-09 11:42:46 +00:00
Folke Behrens	5ea0bb2d4f	proxy: Drop unused metrics (#12521 ) * proxy_control_plane_token_acquire_seconds * proxy_allowed_ips_cache_misses * proxy_vpc_endpoint_id_cache_stats * proxy_access_blocker_flags_cache_stats * proxy_requests_auth_rate_limits_total * proxy_endpoints_auth_rate_limits * proxy_invalid_endpoints_total	2025-07-09 09:58:46 +00:00
Christian Schwarz	aac1f8efb1	refactor(compaction): eliminate `CompactionError::CollectKeyspaceError` variant (#12517 ) The only differentiated handling of it is for `is_critical`, which in turn is a `matches!()` on several variants of the `enum CollectKeyspaceError` which is the value contained insided `CompactionError::CollectKeyspaceError`. This PR introduces a new error for `repartition()`, allowing its immediate callers to inspect it like `is_critical` did. A drive-by fix is more precise classification of WaitLsnError::BadState when mapping to `tonic::Status`. refs - https://databricks.atlassian.net/browse/LKB-182	2025-07-09 08:41:36 +00:00
Alex Chi Z.	43dbded8c8	fix(pageserver): disallow lease creation below the applied gc cutoff (#12489 ) ## Problem close LKB-209 ## Summary of changes - We should not allow lease creation below the applied gc cutoff. - Also removed the condition for `AttachedSingle`. We should always check the lease against the gc cutoff in all attach modes. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-08 22:32:51 +00:00
Vlad Lazar	c848b995b2	safekeeper: trim dead senders before adding more (#12490 ) ## Problem We only trim the senders if we tried to send a message to them and discovered that the channel is closed. This is problematic if the pageserver keeps connecting while there's nothing to send back for the shard. In this scenario we never trim down the senders list and can panic due to the u8 limit. ## Summary of Changes Trim down the dead senders before adding a new one. Closes LKB-178	2025-07-08 21:24:59 +00:00
Trung Dinh	4dee2bfd82	pageserver: Introduce config to enable/disable eviction task (#12496 ) ## Problem We lost capability to explicitly disable the global eviction task (for testing). ## Summary of changes Add an `enabled` flag to `DiskUsageEvictionTaskConfig` to indicate whether we should run the eviction job or not.	2025-07-08 21:14:04 +00:00
Suhas Thalanki	09ff22a4d4	fix(compute): removing `NEON_EXT_INT_UPD` log statement added for debugging verbosity (#12509 ) Removes the `NEON_EXT_INT_UPD` log statement that was added for debugging verbosity.	2025-07-08 21:12:26 +00:00
Erik Grinaker	8223c1ba9d	pageserver/client_grpc: add initial gRPC client pools (#12434 ) ## Problem The communicator will need gRPC channel/client/stream pools for efficient reuse across many backends. Touches #11735. Requires #12396. ## Summary of changes Adds three nested resource pools: * `ChannelPool` for gRPC channels (i.e. TCP connections). * `ClientPool` for gRPC clients (i.e. `page_api::Client`). Acquires channels from `ChannelPool`. * `StreamPool` for gRPC GetPage streams. Acquires clients from `ClientPool`. These are minimal functional implementations that will need further improvements and performance optimization. However, the overall structure is expected to be roughly final, so reviews should focus on that. The pools are not yet in use, but will form the foundation of a rich gRPC Pageserver client used by the communicator (see #12462). This PR also adds the initial crate scaffolding for that client. See doc comments for details.	2025-07-08 20:58:18 +00:00
HaoyuHuang	3dad4698ec	PS changes #1 (#12467 ) # TLDR All changes are no-op except 1. publishing additional metrics. 2. problem VI ## Problem I It has come to my attention that the Neon Storage Controller doesn't correctly update its "observed" state of tenants previously associated with PSs that has come back up after a local data loss. It would still think that the old tenants are still attached to page servers and won't ask more questions. The pageserver has enough information from the reattach request/response to tell that something is wrong, but it doesn't do anything about it either. We need to detect this situation in production while I work on a fix. (I think there is just some misunderstanding about how Neon manages their pageserver deployments which got me confused about all the invariants.) ## Summary of changes I Added a `pageserver_local_data_loss_suspected` gauge metric that will be set to 1 if we detect a problematic situation from the reattch response. The problematic situation is when the PS doesn't have any local tenants but received a reattach response containing tenants. We can set up an alert using this metric. The alert should be raised whenever this metric reports non-zero number. Also added a HTTP PUT `http://pageserver/hadron-internal/reset_alert_gauges` API on the pageserver that can be used to reset the gauge and the alert once we manually rectify the situation (by restarting the HCC). ## Problem II Azure upload is 3x slower than AWS. -> 3x slower ingestion. The reason for the slower upload is that Azure upload in page server is much slower => higher flush latency => higher disk consistent LSN => higher back pressure. ## Summary of changes II Use Azure put_block API to uploads a 1 GB layer file in 8 blocks in parallel. I set the put_block block size to be 128 MB by default in azure config. To minimize neon changes, upload function passes the layer file path to the azure upload code through the storage metadata. This allows the azure put block to use FileChunkStreamRead to stream read from one partition in the file instead of loading all file data in memory and split it into 8 128 MB chunks. ## How is this tested? II 1. rust test_real_azure tests the put_block change. 3. I deployed the change in azure dev and saw flush latency reduces from ~30 seconds to 10 seconds. 4. I also did a bunch of stress test using sqlsmith and 100 GB TPCDS runs. ## Problem III Currently Neon limits the compaction tasks as 3/4 * CPU cores. This limits the overall compaction throughput and it can easily cause head-of-the-line blocking problems when a few large tenants are compacting. ## Summary of changes III This PR increases the limit of compaction tasks as `BG_TASKS_PER_THREAD` (default 4) * CPU cores. Note that `CONCURRENT_BACKGROUND_TASKS` also limits some other tasks `logical_size_calculation` and `layer eviction` . But compaction should be the most frequent and time-consuming task. ## Summary of changes IV This PR adds the following PageServer metrics: 1. `pageserver_disk_usage_based_eviction_evicted_bytes_total`: captures the total amount of bytes evicted. It's more straightforward to see the bytes directly instead of layers. 2. `pageserver_active_storage_operations_count`: captures the active storage operation, e.g., flush, L0 compaction, image creation etc. It's useful to visualize these active operations to get a better idea of what PageServers are spending cycles on in the background. ## Summary of changes V When investigating data corruptions, it's useful to search the base image and all WAL records of a page up to an LSN, i.e., a breakdown of GetPage@LSN request. This PR implements this functionality with two tools: 1. Extended `pagectl` with a new command to search the layer files for a given key up to a given LSN from the `index_part.json` file. The output can be used to download the files from S3 and then search the file contents using the second tool. Example usage: ``` cargo run --bin pagectl index-part search --tenant-id 09b99ea3239bbb3b2d883a59f087659d --timeline-id 7bedf4a6995baff7c0421ff9aebbcdab --path ~/Downloads/corruption/index_part.json-0000000c-formatted --key 000000067F000080140000802100000D61BD --lsn 70C/BF3D61D8 ``` Example output: ``` tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F0000801400008028000002FEFF__000007089F0B5381-0000070C7679EEB9-0000000c tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000000000000000000000000000000000-000000067F0000801400008028000002F3F1__000006DD95B6F609-000006E2BA14C369-0000000c tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F000080140000802100001B0973__000006D33429F539-000006DD95B6F609-0000000c tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000164D81__000006C6343B2D31-000006D33429F539-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F0000801400008021000017687B__000006BA344FA7F1-000006C6343B2D31-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000165BAB__000006AD34613D19-000006BA344FA7F1-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000137A39__0000069F34773461-000006AD34613D19-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F000080140000802100000D4000-000000067F000080140000802100000F0000__0000069F34773460-0000000b ``` 2. Added a unit test to search the layer file contents. It's not implemented part of `pagectl` because it depends on some test harness code, which can only be used by unit tests. Example usage: ``` cargo test --package pageserver --lib -- tenant::debug::test_search_key --exact --nocapture -- --tenant-id 09b99ea3239bbb3b2d883a59f087659d --timeline-id 7bedf4a6995baff7c0421ff9aebbcdab --data-dir /Users/chen.luo/Downloads/corruption --key 000000067F000080140000802100000D61BD --lsn 70C/BF3D61D8 ``` Example output: ``` # omitted image for brievity delta: 69F/769D8180: will_init: false, "OgAAALGkuwXwYp12nwYAAECGAAASIqLHAAAAAH8GAAAUgAAAIYAAAL1hDQD/DLGkuwUDAAAAEAAWAA==" delta: 69F/769CB6D8: will_init: false, "PQAAALGkuwXotZx2nwYAABAJAAAFk7tpACAGAH8GAAAUgAAAIYAAAL1hDQD/CQUAEAASALExuwUBAAAAAA==" ``` ## Problem VI Currently when page service resolves shards from page numbers, it doesn't fully support the case that the shard could be split in the middle. This will lead to query failures during the tenant split for either commit or abort cases (it's mostly for abort). ## Summary of changes VI This PR adds retry logic in `Cache::get()` to deal with shard resolution errors more gracefully. Specifically, it'll clear the cache and retry, instead of failing the query immediately. It also reduces the internal timeout to make retries faster. The PR also fixes a very obvious bug in `TenantManager::resolve_attached_shard` where the code tries to cache the computed the shard number, but forgot to recompute when the shard count is different. --------- Co-authored-by: William Huang <william.huang@databricks.com> Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com> Co-authored-by: Chen Luo <chen.luo@databricks.com> Co-authored-by: Vlad Lazar <vlad.lazar@databricks.com> Co-authored-by: Vlad Lazar <vlad@neon.tech>	2025-07-08 19:43:01 +00:00
Erik Grinaker	81e7218c27	pageserver: tighten up gRPC `page_api::Client` (#12396 ) This patch tightens up `page_api::Client`. It's mostly superficial changes, but also adds a new constructor that takes an existing gRPC channel, for use with the communicator connection pool.	2025-07-08 18:15:13 +00:00
Alex Chi Z.	a06c560ad0	feat(pageserver): critical path feature flags (#12449 ) ## Problem Some feature flags are used heavily on the critical path and we want the "get feature flag" operation as cheap as possible. ## Summary of changes Add a `test_remote_size_flag` as an example of such flags. In the future, we can use macro to generate all those fields. The flag is updated in the housekeeping loop. The retrieval of the flag is simply reading an atomic flag. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-08 16:55:00 +00:00
Vlad Lazar	477ab12b69	pageserver: touch up broker subscription reset (#12503 ) ## Problem The goal of this code was to test out if resetting the broker subscription helps alleviate the issues we've been seeing in staging. Looks like it did the trick. However, the original version was too eager. ## Summary of Changes Only reset the stream when: * we are waiting for WAL * there's no connection candidates lined up * we're not already connected to a safekeeper	2025-07-08 16:46:55 +00:00
Christian Schwarz	f9b05a42d7	refactor(compaction): remove `CompactionError::AlreadyRunning` variant, use `::Other` instead (#12512 ) The only call stack that can emit the `::AlreadyRunning` variant is ``` -> iteration_inner -> iteration -> compaction_iteration -> compaction_loop -> start_background_loops ``` And on that call stack, the only differentiated handling of it is its invocations of `log_compaction_error -> CompactionError::is_cancel()`, which returns `true` for `::AlreadyRunning`. I think the condition of `AlreadyRunning` is severe; it really shouldn't happen. So, this PR starts treating it as something that is to be logged at `ERROR` / `WARN` level, depending on the `degrate_to_warning` argument to `log_compaction_error`. refs - https://databricks.atlassian.net/browse/LKB-182	2025-07-08 16:45:34 +00:00
Folke Behrens	29d73e1404	http-utils: Temporarily accept duplicate params (#12504 ) ## Problem Grafana Alloy in cluster mode seems to send duplicate "seconds" scrape URL parameters when one of its instances is disrupted. ## Summary of changes Temporarily accept duplicate parameters as long as their value is identical.	2025-07-08 15:49:42 +00:00
Christian Schwarz	8a042fb8ed	refactor(compaction): eliminate `CompactionError::Offload` variant, map to `::Other` (#12505 ) Looks can be deceiving: the match blocks in `maybe_trip_compaction_breaker` and at the end of `compact_with_options` seem like differentiated error handling, but in reality, these branches are unreachable at runtime because the only source of `CompactionError::Offload` within the compaction code is at the end of `Tenant::compaction_iteration`. We can simply map offload cancellation to CompactionError::Cancelled and all other offload errors to ::Other, since there's no differentiated handling for them in the compaction code. Also, the OffloadError::RemoteStorage variant has no differentiated handling, but was wrapping the remote storage anyhow::Error in a `anyhow(thiserror(anyhow))` sandwich. This PR removes that variant, mapping all RemoteStorage errors to `OffloadError::Other`. Thereby, the sandwich is gone and we will get a proper anyhow backtrace to the remote storage error location if when we debug-print the OffloadError (or the CompactionError if we map it to that). refs - https://databricks.atlassian.net/browse/LKB-182 - the observation that there's no need for differentiated handling of CompactionError::Offload was made in https://databricks.slack.com/archives/C09254R641L/p1751286453930269?thread_ts=1751284317.955159&cid=C09254R641L	2025-07-08 15:03:32 +00:00
Mikhail	f72115d0a9	Endpoint storage openapi spec (#12361 ) https://github.com/neondatabase/cloud/issues/19011	2025-07-08 14:37:24 +00:00
Christian Schwarz	7458d031b1	clippy: fix unfounded warning on macOS (#12501 ) Before this PR, macOS builds would get clippy warning ``` warning: `tokio_epoll_uring::thread_local_system` does not refer to an existing function ``` The reason is that the `thread_local_system` function is only defined on Linux. Add `allow-invalid = true` to make macOS clippy pass, and manually test that on Linux builds, clippy still fails when we use it. refs - https://databricks.slack.com/archives/C09254R641L/p1751917655527099 Co-authored-by: Christian Schwarz <Christian Schwarz>	2025-07-08 13:59:45 +00:00
Aleksandr Sarantsev	38384c37ac	Make node deletion context-aware (#12494 ) ## Problem Deletion process does not calculate preferred nodes correctly - it doesn't consider current tenant-shard layout among all pageservers. ## Summary of changes - Added a schedule context calculation for node deletion Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>	2025-07-08 13:15:14 +00:00
Christian Schwarz	2b2a547671	fix(tests): periodic and immediate gc is effectively a no-op in tests (#12431 ) The introduction of the default lease deadline feature 9 months ago made it so that after PS restart, `.timeline_gc()` calls in Python tests are no-ops for 10 minute after pageserver startup: the `gc_iteration()` bails with `Skipping GC because lsn lease deadline is not reached`. I did some impact analysis in the following PR. About 30 Python tests are affected: - https://github.com/neondatabase/neon/pull/12411 Rust tests that don't explicitly enable periodic GC or invoke GC manually are unaffected because we disable periodic GC by default in the `TenantHarness`'s tenant config. Two tests explicitly did `start_paused=true` + `tokio::time::advance()`, but it would add cognitive and code bloat to each existing and future test case that uses TenantHarness if we took that route. So, this PR sets the default lease deadline feature in both Python and Rust tests to zero by default. Tests that test the feature were thus identified by failing the test: - Python test `test_readonly_node_gc` + `test_lsn_lease_size` - Rust test `test_lsn_lease`. To accomplish the above, I changed the code that computes the initial lease deadline to respect the pageserver.toml's default tenant config, which it didn't before (and I would consider a bug). The Python test harness and the Rust TenantHarness test harness then simply set the default tenant config field to zero. Drive-by: - `test_lsn_lease_size` was writing a lot of data unnecessarily; reduce the amount and speed up the test refs - PR that introduced default lease deadline: https://github.com/neondatabase/neon/pull/9055/files - fixes https://databricks.atlassian.net/browse/LKB-92 --------- Co-authored-by: Christian Schwarz <Christian Schwarz>	2025-07-08 12:56:22 +00:00
a-masterov	59e393aef3	Enable parallel execution of extension tests (#12118 ) ## Problem Extension tests were previously run sequentially, resulting in unnecessary wait time and underutilization of available CPU cores. ## Summary of changes Tests are now executed in a customizable number of parallel threads using separate database branches. This reduces overall test time by approximately 50% (e.g., on my laptop, parallel test lasts 173s, while sequential one lasts 340s) and increases the load on the pageserver, providing better test coverage. --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com>	2025-07-08 11:28:39 +00:00
Peter Bendel	f51ed4a2c4	"disable" disk eviction in pagebench periodic benchmark (#12487 ) ## Problem https://github.com/neondatabase/neon/pull/12464 introduced new defaults for pageserver disk based eviction which activated disk based eviction for pagebench periodic pagebench. This caused the testcase to fail. ## Summary of changes Override the new defaults during testcase execution. ## Test run https://github.com/neondatabase/neon/actions/runs/16120217757/job/45483869734 Test run was successful, so merging this now	2025-07-08 09:38:06 +00:00
Mikhail	4f16ab3f56	add lfc offload and prewarm error metrics (#12486 ) Add `compute_ctl_lfc_prewarm_errors_total` and `compute_ctl_lfc_offload_errors_total` metrics. Add comments in `test_lfc_prewarm`. Correction PR for https://github.com/neondatabase/neon/pull/12447 https://github.com/neondatabase/cloud/issues/19011	2025-07-08 09:34:01 +00:00
Dmitrii Kovalkov	18796fd1dd	tests: more allowed errors for test_safekeeper_migration (#12495 ) ## Problem Pageserver now writes errors in the log during the safekeeper migration. Some errors are added to allowed errors, but "timeline not found in global map" is not. - Will be properly fixed in https://github.com/neondatabase/neon/issues/12191 ## Summary of changes Add "timeline not found in global map" error in a list of allowed errors in `test_safekeeper_migration_simple`	2025-07-08 09:15:29 +00:00
Aleksandr Sarantsev	2f3fc7cb57	Fix keep-failing reconciles test & add logs (#12497 ) ## Problem Test is flaky due to the following warning in the logs: ``` Keeping extra secondaries: can't determine which of [NodeId(1), NodeId(2)] to remove (some nodes offline?) ``` Some nodes being offline is expected behavior in this test. ## Summary of changes - Added `Keeping extra secondaries` to the list of allowed errors - Improved logging for better debugging experience Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>	2025-07-08 08:51:50 +00:00
Folke Behrens	e65d5f7369	proxy: Remove the endpoint filter cache (#12488 ) ## Problem The endpoint filter cache is still unused because it's not yet reliable enough to be used. It only consumes a lot of memory. ## Summary of changes Remove the code. Needs a new design. neondatabase/cloud#30634	2025-07-07 17:46:33 +00:00
Conrad Ludgate	55aef2993d	introduce a JSON serialization lib (#12417 ) See #11992 and #11961 for some examples of usecases. This introduces a JSON serialization lib, designed for more flexibility than serde_json offers. ## Dynamic construction Sometimes you have dynamic values you want to serialize, that are not already in a serde-aware model like a struct or a Vec etc. To achieve this with serde, you need to implement a lot of different traits on a lot of different new-types. Because of this, it's often easier to give-in and pull all the data into a serde-aware model (serde_json::Value or some intermediate struct), but that is often not very efficient. This crate allows full control over the JSON encoding without needing to implement any extra traits. Just call the relevant functions, and it will guarantee a correctly encoded JSON value. ## Async construction Similar to the above, sometimes the values arrive asynchronously. Often collecting those values in memory is more expensive than writing them as JSON, since the overheads of `Vec` and `String` is much higher, however there are exceptions. Serializing to JSON all in one go is also more CPU intensive and can cause lag spikes, whereas serializing values incrementally spreads out the CPU load and reduces lag.	2025-07-07 15:12:02 +00:00

1 2 3 4 5 ...

8260 Commits