rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2025-12-22 21:59:59 +00:00

Author	SHA1	Message	Date
github-actions[bot]	c1de242c1c	Storage release 2025-07-18 06:11 UTC release-9067	2025-07-18 06:11:53 +00:00
Alex Chi Z.	f3ef60d236	fix(storcon): use unified interface to handle 404 lsn lease (#12650 ) ## Problem Close LKB-270. This is part of our series of efforts to make sure lsn_lease API prompts clients to retry. Follow up of https://github.com/neondatabase/neon/pull/12631. Slack thread w/ Vlad: https://databricks.slack.com/archives/C09254R641L/p1752677940697529 ## Summary of changes - Use `tenant_remote_mutation` API for LSN leases. Makes it consistent with new APIs added to storcon. - For 404, we now always retry because we know the tenant is to-be-attached and will eventually reach a point that we can find that tenant on the intent pageserver. - Using the `tenant_remote_mutation` API also prevents us from the case where the intent pageserver changes within the lease request. The wrapper function will error with 503 if such things happen. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-18 04:40:35 +00:00
HaoyuHuang	8f627ea0ab	A few more SC changes (#12649 ) ## Problem ## Summary of changes	2025-07-17 23:17:01 +00:00
Arpad Müller	6a353c33e3	print more timestamps in find_lsn_for_timestamp (#12641 ) Observability of `find_lsn_for_timestamp` is lacking, as well as how and when we update gc space and time cutoffs. Log them.	2025-07-17 22:13:21 +00:00
Folke Behrens	64d0008389	proxy: Shorten the initial TTL of cancel keys (#12647 ) ## Problem A high rate of short-lived connections means that there a lot of cancel keys in Redis with TTL=10min that could be avoided by having a much shorter initial TTL. ## Summary of changes * Introduce an initial TTL of 1min used with the SET command. * Fix: don't delay repushing cancel data when expired. * Prepare for exponentially increasing TTLs. ## Alternatives A best-effort UNLINK command on connection termination would clean up cancel keys right away. This needs a bigger refactor due to how batching is handled.	2025-07-17 21:52:20 +00:00
Alexey Kondratov	53a05e8ccb	fix(compute_ctl): Only offload LFC state if no prewarming is in progress (#12645 ) ## Problem We currently offload LFC state unconditionally, which can cause problems. Imagine a situation: 1. Endpoint started with `autoprewarm: true`. 2. While prewarming is not completed, we upload the new incomplete state. 3. Compute gets interrupted and restarts. 4. We start again and try to prewarm with the state from 2. instead of the previous complete state. During the orchestrated prewarming, it's probably not a big issue, but it's still better to do not interfere with the prewarm process. ## Summary of changes Do not offload LFC state if we are currently prewarming or any issue occurred. While on it, also introduce `Skipped` LFC prewarm status, which is used when the corresponding LFC state is not present in the endpoint storage. It's primarily needed to distinguish the first compute start for particular endpoint, as it's completely valid to do not have LFC state yet.	2025-07-17 21:43:43 +00:00
Vlad Lazar	62c0152e6b	pageserver: shut down compute connections at libpq level (#12642 ) ## Problem Previously, if a get page failure was cause by timeline shutdown, the pageserver would attempt to tear down the connection gracefully: `shutdown(SHUT_WR)` followed by `close()`. This triggers a code path on the compute where it has to tell apart between an idle connection and a closed one. That code is bug prone, so we can just side-step the issue by shutting down the connection via a libpq error message. This surfaced as instability in test_shard_resolve_during_split_abort. It's a new test, but the issue existed for ages. ## Summary of Changes Send a libpq error message instead of doing graceful TCP connection shutdown. Closes LKB-648	2025-07-17 21:03:55 +00:00
Konstantin Knizhnik	7fef4435c1	Store stripe_size in shared memory (#12560 ) ## Problem See https://databricks.slack.com/archives/C09254R641L/p1752004515032899 stripe_size GUC update may be delayed at different backends and so cause inconsistency with connection strings (shard map). ## Summary of changes Postmaster should store stripe_size in shared memory as well as connection strings. It should be also enforced that stripe size is defined prior to connection strings in postgresql.conf --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Kosntantin Knizhnik <konstantin.knizhnik@databricks.com>	2025-07-17 20:32:34 +00:00
Konstantin Knizhnik	43fd5b218b	Refactor shmem initialization in Neon extension (#12630 ) ## Problem Initializing of shared memory in extension is complex and non-portable. In neon extension this boilerplate code is duplicated in several files. ## Summary of changes Perform all initialization in one place - neon.c All other module procvide ShmemRequest() and ShmemInit() fuinction which are called from neon.c --------- Co-authored-by: Kosntantin Knizhnik <konstantin.knizhnik@databricks.com> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2025-07-17 20:20:38 +00:00
Alex Chi Z.	29ee273d78	fix(storcon): correctly converts 404 for tenant passthrough requests (#12631 ) ## Problem Follow up of https://github.com/neondatabase/neon/pull/12620 Discussions: https://databricks.slack.com/archives/C09254R641L/p1752677940697529 The original code and after the patch above we converts 404s to 503s regardless of the type of 404. We should only do that for tenant not found errors. For other 404s like timeline not found, we should not prompt clients to retry. ## Summary of changes - Inspect the response body to figure out the type of 404. If it's a tenant not found error, return 503. - Otherwise, fallthrough and return 404 as-is. - Add `tenant_shard_remote_mutation` that manipulates a single shard. - Use `Service::tenant_shard_remote_mutation` for tenant shard passthrough requests. This prevents us from another race that the attach state changes within the request. (This patch mainly addresses the case that the tenant is "not yet attached"). - TODO: lease API is still using the old code path. We should refactor it to use `tenant_remote_mutation`. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-17 19:42:48 +00:00
Conrad Ludgate	8b0f2efa57	experiment with an InfoMetrics metric family (#12612 ) Putting this in the neon codebase for now, to experiment. Can be lifted into measured at a later date. This metric family is like a MetricVec, but it only supports 1 label being set at a time. It is useful for reporting info, rather than reporting metrics. https://www.robustperception.io/exposing-the-software-version-to-prometheus/	2025-07-17 17:58:47 +00:00
quantumish	b309cbc6e9	Add resizable hashmap and RwLock implementations to `neon-shmem` (#12596 ) Second PR for the hashmap behind the updated LFC implementation ([see first here](https://github.com/neondatabase/neon/pull/12595)). This only adds the raw code for the hashmap/lock implementations and doesn't plug it into the crate (that's dependent on the previous PR and should probably be done when the full integration into the new communicator is merged alongside `communicator-rewrite` changes?). Some high level details: the communicator codebase expects to be able to store references to entries within this hashmap for arbitrary periods of time and so the hashmap cannot be allowed to move them during a rehash. As a result, this implementation has a slightly unusual structure where key-value pairs (and hash chains) are allocated in a separate region with a freelist. The core hashmap structure is then an array of "dictionary entries" that are just indexes into this region of key-value pairs. Concurrency support is very naive at the moment with the entire map guarded by one big `RwLock` (which is implemented on top of a `pthread_rwlock_t` since Rust doesn't guarantee that a `std::sync::RwLock` is safe to use in shared memory). This (along with a lot of other things) is being changed on the `quantumish/lfc-resizable-map` branch.	2025-07-17 17:40:53 +00:00
Aleksandr Sarantsev	f0c0733a64	storcon: Ignore stuck reconciles when considering optimizations (#12589 ) ## Problem The `keep_failing_reconciles` counter was introduced in #12391, but there is a special case: > if a reconciliation loop claims to have succeeded, but maybe_reconcile still thinks the tenant is in need of reconciliation, then that's a probable bug and we should activate a similar backoff to prevent flapping. This PR redefines "flapping" to include not just repeated failures, but also consecutive reconciliations of any kind (success or failure). ## Summary of Changes - Replace `keep_failing_reconciles` with a new `stuck_reconciles` metric - Replace `MAX_CONSECUTIVE_RECONCILIATION_ERRORS` with `MAX_CONSECUTIVE_RECONCILES`, and increasing that from 5 to 10 - Increment the consecutive reconciles counter for all reconciles, not just failures - Reset the counter in `reconcile_all` when no reconcile is needed for a shard - Improve and fix the related test --------- Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>	2025-07-17 14:52:57 +00:00
Vlad Lazar	8862e7c4bf	tests: use new snapshot in test_forward_compat (#12637 ) ## Problem The forward compatibility test is erroneously using the downloaded (old) compatibility data. This test is meant to test that old binaries can work with new data. Using the old compatibility data renders this test useless. ## Summary of changes Use new snapshot in test_forward_compat Closes LKB-666 Co-authored-by: William Huang <william.huang@databricks.com>	2025-07-17 13:20:40 +00:00
HaoyuHuang	b7fc5a2fe0	A few SC changes (#12615 ) ## Summary of changes A bunch of no-op changes. --------- Co-authored-by: Vlad Lazar <vlad@neon.tech>	2025-07-17 13:14:36 +00:00
Aleksandr Sarantsev	4559ba79b6	Introduce force flag for new deletion API (#12588 ) ## Problem The force deletion API should behave like the graceful deletion API - it needs to support cancellation, persistence, and be non-blocking. ## Summary of Changes - Added a `force` flag to the `NodeStartDelete` command. - Passed the `force` flag through the `start_node_delete` handler in the storage controller. - Handled the `force` flag in the `delete_node` function. - Set the tombstone after removing the node from memory. - Minor cleanup, like adding a `get_error_on_cancel` closure. --------- Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>	2025-07-17 11:51:31 +00:00
Alexander Bayandin	5dd24c7ad8	test_total_size_limit: support hosts with up to 256 GB of RAM (#12617 ) ## Problem `test_total_size_limit` fails on runners with 256 GB of RAM ## Summary of changes - Generate more data in `test_total_size_limit`	2025-07-17 08:57:36 +00:00
Alex Chi Z.	f2828bbe19	fix(pageserver): skip gc-compaction for metadata key ranges (#12618 ) ## Problem part of https://github.com/neondatabase/neon/issues/11318 ; it is not entirely safe to run gc-compaction over the metadata key range due to tombstones and implications of image layers (missing key in image layer == key not exist). The auto gc-compaction trigger already skips metadata key ranges (see `schedule_auto_compaction` call in `trigger_auto_compaction`). In this patch we enforce it directly in gc_compact_inner so that compactions triggered via HTTP API will also be subject to this restriction. ## Summary of changes Ensure gc-compaction only runs on rel key ranges. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-16 21:52:18 +00:00
Alexander Bayandin	fb796229bf	Fix `make neon-pgindent` (#12535 ) ## Problem `make neon-pgindent` doesn't work: - there's no `$(BUILD_DIR)/neon-v17` dir - `make -C ...` along with relative `BUILD_DIR` resolves to a path that doesn't exist ## Summary of changes - Fix path for to neon extension for `make neon-pgindent` - Make `BUILD_DIR` absolute - Remove trailing slash from `POSTGRES_INSTALL_DIR` to avoid duplicated slashed in commands (doesn't break anything, it make it look nicer)	2025-07-16 21:20:44 +00:00
Dimitri Fontaine	267fb49908	Update Postgres branches. (#12628 ) ## Problem ## Summary of changes	2025-07-16 18:39:54 +00:00
Krzysztof Szafrański	e2982ed3ec	[proxy] Cache node info only for TTL, even if Redis is available (#12626 ) This PR simplifies our node info cache. Now we'll store entries for at most the TTL duration, even if Redis notifications are available. This will allow us to cache intermittent errors later (e.g. due to rate limits) with more predictable behavior. Related to https://github.com/neondatabase/cloud/issues/19353	2025-07-16 16:23:05 +00:00
Tristan Partin	9e154a8130	PG: smooth max wal rate (#12514 ) ## Problem We were only resetting the limit in the wal proposer. If backends are back pressured, it might take a while for the wal proposer to receive a new WAL to reset the limit. ## Summary of changes Backend also checks the time and resets the limit. ## How is this tested? pgbench has more smooth tps Signed-off-by: Tristan Partin <tristan.partin@databricks.com> Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>	2025-07-16 16:11:25 +00:00
JC Grünhage	79d72c94e8	reformat cargo install invocations in build-tools image (#12629 ) ## Problem Same change with different formatting happened in multiple branches. ## Summary of changes Realign formatting with the other branch.	2025-07-16 16:02:07 +00:00
Alex Chi Z.	80e5771c67	fix(storcon): passthrough 404 as 503 during migrations (#12620 ) ## Problem close LKB-270, close LKB-253 We periodically saw pageserver returns 404 -> storcon converts it to 500 to cplane, and causing branch operations fail. This is due to storcon is migrating tenants across pageservers and the request was forwarded from the storcon to pageservers while the tenant was not attached yet. Such operations should be retried from cplane and storcon should return 503 in such cases. ## Summary of changes - Refactor `tenant_timeline_lsn_lease` to have a single function process and passthrough such requests: `collect_tenant_shards` for collecting all shards and checking if they're consistent with the observed state, `process_result_and_passthrough_errors` to convert 404 into 503 if necessary. - `tenant_shard_node` also checks observed state now. Note that for passthrough shard0, we originally had a check to convert 404 to 503: ``` // Transform 404 into 503 if we raced with a migration if resp.status() == reqwest::StatusCode::NOT_FOUND { // Look up node again: if we migrated it will be different let new_node = service.tenant_shard_node(tenant_shard_id).await?; if new_node.get_id() != node.get_id() { // Rather than retry here, send the client a 503 to prompt a retry: this matches // the pageserver's use of 503, and all clients calling this API should retry on 503. return Err(ApiError::ResourceUnavailable( format!("Pageserver {node} returned 404, was migrated to {new_node}").into(), )); } } ``` However, this only checks the intent state. It is possible that the migration is in progress before/after the request is processed and intent state is always the same throughout the API call, therefore 404 not being processed by this branch. Also, not sure about if this new code is correct or not, need second eyes on that: ``` // As a reconciliation is in flight, we do not have the observed state yet, and therefore we assume it is always inconsistent. Ok((node.clone(), false)) ``` --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-16 15:51:20 +00:00
Aleksandr Sarantsev	1178f6fe7c	pageserver: Downgrade log level of 'No broker updates' (#12627 ) ## Problem The warning message was seen during deployment, but it's actually OK. ## Summary of changes - Treat `"No broker updates received for a while ..."` as an info message. Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>	2025-07-16 15:02:01 +00:00
Vlad Lazar	8b18d8b31b	safekeeper: add global disk usage utilization limit (#12605 ) N.B: No-op for the neon-env. ## Problem We added a per-timeline disk utilization protection circuit breaker, which will stop the safekeeper from accepting more WAL writes if the disk utilization by the timeline has exceeded a configured limit. We mainly designed the mechanism as a guard against WAL upload/backup bugs, and we assumed that as long as WAL uploads are proceeding as normal we will not run into disk pressure. This turned out to be not true. In one of our load tests where we have 500 PGs ingesting data at the same time, safekeeper disk utilization started to creep up even though WAL uploads were completely normal (we likely just maxed out our S3 upload bandwidth from the single SK). This means the per-timeline disk utilization protection won't be enough if too many timelines are ingesting data at the same time. ## Summary of changes Added a global disk utilization protection circuit breaker which will stop a safekeeper from accepting more WAL writes if the total disk usage on the safekeeper (across all tenants) exceeds a limit. We implemented this circuit breaker through two parts: 1. A "global disk usage watcher" background task that runs at a configured interval (default every minute) to see how much disk space is being used in the safekeeper's filesystem. This background task also performs the check against the limit and publishes the result to a global atomic boolean flag. 2. The `hadron_check_disk_usage()` routine (in `timeline.rs`) now also checks this global boolean flag published in the step above, and fails the `WalAcceptor` (triggers the circuit breaker) if the flag was raised. The disk usage limit is disabled by default. It can be tuned with the `--max-global-disk-usage-ratio` CLI arg. ## How is this tested? Added integration test `test_wal_acceptor.py::test_global_disk_usage_limit`. Also noticed that I haven't been using the `wait_until(f)` test function correctly (the `f` passed in is supposed to raise an exception if the condition is not met, instead of returning `False`...). Fixed it in both circuit breaker tests. --------- Co-authored-by: William Huang <william.huang@databricks.com>	2025-07-16 14:43:17 +00:00
Vlad Lazar	3e4cbaed67	storcon: validate intent state before applying optimization (#12593 ) ## Problem In the gap between picking an optimization and applying it, something might insert a change to the intent state that makes it incompatible. If the change is done via the `schedule()` method, we are covered by the increased sequence number, but otherwise we can panic if we violate the intent state invariants. ## Summary of Changes Validate the optimization right before applying it. Since we hold the service lock at that point, nothing else can sneak in. Closes LKB-65	2025-07-16 14:37:40 +00:00
Conrad Ludgate	c71aea0223	proxy: for json logging, only use callsite IDs if span name is duplicated (#12625 ) ## Problem We run multiple proxies, we get logs like ``` ... spans={"http_conn#22":{"conn_id": ... ... spans={"http_conn#24":{"conn_id": ... ``` these are the same span, and the difference is confusing. ## Summary of changes Introduce a counter per span name, rather than a global counter. If the counter is 0, no change to the span name is made. To follow up: see which span names are duplicated within the codebase in different callsites	2025-07-16 13:29:18 +00:00
Conrad Ludgate	87915df2fa	proxy: replace serde_json with our new json ser crate in the logging impl (#12602 ) This doesn't solve any particular problem, but it does simplify some of the code that was forced to round-trip through verbose Serialize impls.	2025-07-16 13:27:00 +00:00
Alexander Bayandin	caca08fe78	CI: rework and merge `lint-openapi-spec` and `validate-compute-manifest` jobs (#12575 ) ## Problem We have several linters that use Node.js, but they are currently set up differently, both locally and on CI. ## Summary of changes - Add Node.js to `build-tools` image - Move `compute/package.json` -> `build-tools/package.json` and add `redocly` to it `@redocly/cli` - Unify and merge into one job `lint-openapi-spec` and `validate-compute-manifest`	2025-07-16 11:08:27 +00:00
Alexander Bayandin	0c99f16c60	CI(run-python-test-set): don't collect code coverage for real (#12611 ) ## Problem neondatabase/neon#12601 did't compleatly disable writing `.profraw` files, but instead of `/tmp/coverage` it started to write into the current directory ## Summary of changes - Set `LLVM_PROFILE_FILE=/dev/null` to avoing writing `.profraw` at all	2025-07-16 08:26:52 +00:00
Alexey Kondratov	dd7fff655a	feat(compute): Introduce privileged_role_name parameter (#12539 ) ## Problem Currently `neon_superuser` is hardcoded in many places. It makes it harder to reuse the same code in different envs. ## Summary of changes Parametrize `neon_superuser` in `compute_ctl` via `--privileged-role-name` and in `neon` extensions via `neon.privileged_role_name`, so it's now possible to use different 'superuser' role names if needed. Everything still defaults to `neon_superuser`, so no control plane code changes are needed and I intentionally do not touch regression and migrations tests. Postgres PRs: - https://github.com/neondatabase/postgres/pull/674 - https://github.com/neondatabase/postgres/pull/675 - https://github.com/neondatabase/postgres/pull/676 - https://github.com/neondatabase/postgres/pull/677 Cloud PR: - https://github.com/neondatabase/cloud/pull/31138	2025-07-15 20:22:57 +00:00
quantumish	809633903d	Move `ShmemHandle` into separate module, tweak documentation (#12595 ) Initial PR for the hashmap behind the updated LFC implementation. This refactors `neon-shmem` so that the actual shared memory utilities are in a separate module within the crate. Beyond that, it slightly changes some of the docstrings so that they play nicer with `cargo doc`.	2025-07-15 17:40:40 +00:00
Arpad Müller	5c934efb29	Don't depend on the postgres_ffi just for one type (#12610 ) We don't want to depend on postgres_ffi in an API crate. If there is no such dependency, we can compile stuff like `storcon_cli` without needing a full working postgres build. Fixes regression of #12548 (before we could compile it).	2025-07-15 17:28:08 +00:00
Heikki Linnakangas	5c9c3b3317	Misc cosmetic cleanups (#12598 ) - Remove a few obsolete "allowed error messages" from tests. The pageserver doesn't emit those messages anymore. - Remove misplaced and outdated docstring comment from `test_tenants.py`. A docstring is supposed to be the first thing in a function, but we had added some code before it. And it was outdated, as we haven't supported running without safekeepers for a long time. - Fix misc typos in comments - Remove obsolete comment about backwards compatibility with safekeepers without `TIMELINE_STATUS` API. All safekeepers have it by now.	2025-07-15 14:36:28 +00:00
Alexander Bayandin	921a4f2009	CI(run-python-test-set): don't collect code coverage (#12601 ) ## Problem We don't use code coverage produced by `regress-tests` (neondatabase/neon#6798), so there's no need to collect it. Potentially, disabling it should reduce the load on disks and improve the stability of debug builds. ## Summary of changes - Disable code coverage collection for regression tests	2025-07-15 11:16:29 +00:00
dependabot[bot]	eb93c3e3c6	build(deps): bump aiohttp from 3.10.11 to 3.12.14 in the pip group across 1 directory (#12600 ) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-07-15 11:06:58 +00:00
Alexander Bayandin	7a7ab2a1d1	Move `build-tools.Dockerfile` -> `build-tools/Dockerfile` (#12590 ) ## Problem This is a prerequisite for neondatabase/neon#12575 to keep all things relevant to `build-tools` image in a single directory ## Summary of changes - Rename `build_tools/` to `build-tools/` - Move `build-tools.Dockerfile` to `build-tools/Dockerfile`	2025-07-15 10:45:49 +00:00
Krzysztof Szafrański	ff526a1051	[proxy] Recognize more cplane errors, use retry_delay_ms as TTL (#12543 ) ## Problem Not all cplane errors are properly recognized and cached/retried. ## Summary of changes Add more cplane error reasons. Also, use retry_delay_ms as cache TTL if present. Related to https://github.com/neondatabase/cloud/issues/19353	2025-07-15 07:42:48 +00:00
Heikki Linnakangas	9a2456bea5	Reduce noise from get_installed_extensions during e.g shut down (#12479 ) All Errors that can occur during get_installed_extensions() come from tokio-postgres functions, e.g. if the database is being shut down ("FATAL: terminating connection due to administrator command"). I'm seeing a lot of such errors in the logs with the regression tests, with very verbose stack traces. The compute_ctl stack trace is pretty useless for errors originating from the Postgres connection, the error message has all the information, so stop printing the stack trace. I changed the result type of the functions to return the originating tokio_postgres Error rather than anyhow::Error, so that if we introduce other error sources to the functions where the stack trace might be useful, we'll be forced to revisit this, probably by introducing a new Error type that separates postgres errors from other errors. But this will do for now.	2025-07-14 18:42:36 +00:00
Mikhail	a456e818af	LFC prewarm perftest: increase timeout for initialization job (#12594 ) Tests on https://github.com/neondatabase/neon/actions/runs/16268609007/job/45930162686 time out due to pgbench init job taking more than 30 minutes to run. Increase test timeout duration to 2 hours.	2025-07-14 17:37:47 +00:00
Matthias van de Meent	3e6fdb0aa6	Add and use [U]INT64_[HEX_]FORMAT for various [u]int64 needs (#12592 ) We didn't consistently apply these, and it wasn't consistently solved. With this patch we should have a more consistent approach to this, and have less issues porting changes to newer versions. This also removes some potentially buggy casts to `long` from `uint64` - they could've truncated the value in systems where `long` only has 32 bits.	2025-07-14 16:47:07 +00:00
Vlad Lazar	f8d3f86f58	pageserver: include records in get page debug handler (#12578 ) Include records and image in the debug get page handler. This endpoint does not update the metrics and does not support tracing. Note that this now returns individual bytes which need to be encoded properly for debugging. Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>	2025-07-14 16:37:28 +00:00
HaoyuHuang	f67a8a173e	A few SK changes (#12577 ) # TLDR This PR is a no-op. ## Problem When a SK loses a disk, it must recover all WALs from the very beginning. This may take days/weeks to catch up to the latest WALs for all timelines it owns. ## Summary of changes When SK starts up, if it finds that it has 0 timelines, - it will ask SC for the timeline it owns. - Then, pulls the timeline from its peer safekeepers to restore the WAL redundancy right away. After pulling timeline is complete, it will become active and accepts new WALs. The current impl is a prototype. We can optimize the impl further, e.g., parallel pull timelines. --------- Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com>	2025-07-14 16:37:04 +00:00
Mikhail	2288efae66	Performance test for LFC prewarm (#12524 ) https://github.com/neondatabase/cloud/issues/19011 Measure relative performance for prewarmed and non-prewarmed endpoints. Add test that runs on every commit, and one performance test with a remote cluster.	2025-07-14 13:41:31 +00:00
a-masterov	4fedcbc0ac	Leverage the existing mechanism to retry 404 errors instead of implementing new code. (#12567 ) ## Problem In https://github.com/neondatabase/neon/pull/12513, the new code was implemented to retry 404 errors caused by the replication lag. However, this implemented the new logic, making the script more complicated, while we have an existing one in `neon_api.py`. ## Summary of changes The existing mechanism is used to retry 404 errors. --------- Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com>	2025-07-14 13:25:25 +00:00
Erik Grinaker	eb830fa547	pageserver/client_grpc: use unbounded pools (#12585 ) ## Problem The communicator gRPC client currently uses bounded client/stream pools. This can artificially constrain clients, especially after we remove pipelining in #12584. [Benchmarks](https://github.com/neondatabase/neon/pull/12583) show that the cost of an idle server-side GetPage worker task is about 26 KB (2.5 GB for 100,000), so we can afford to scale out. In the worst case, we'll degenerate to the current libpq state with one stream per backend, but without the TCP connection overhead. In the common case we expect significantly lower stream counts due to stream sharing, driven e.g. by idle backends, LFC hits, read coalescing, sharding (backends typically only talk to one shard at a time), etc. Currently, Pageservers rarely serve more than 4000 backend connections, so we have at least 2 orders of magnitude of headroom. Touches #11735. Requires #12584. ## Summary of changes Remove the pool limits, and restructure the pools. We still keep a separate bulk pool for Getpage batches of >4 pages (>32 KB), with fewer streams per connection. This reduces TCP-level congestion and head-of-line blocking for non-bulk requests, and concentrates larger window sizes on a smaller set of streams/connections, presumably reducing memory usage. Apart from this, bulk requests don't have any latency penalty compared to other requests.	2025-07-14 13:22:38 +00:00
Erik Grinaker	a203f9829a	pageserver: add timeline_id span when freezing layers (#12572 ) ## Problem We don't log the timeline ID when rolling ephemeral layers during housekeeping. Resolves [LKB-179](https://databricks.atlassian.net/browse/LKB-179) ## Summary of changes Add a span with timeline ID when calling `maybe_freeze_ephemeral_layer` from the housekeeping loop. We don't instrument the function itself, since future callers may not have a span including the tenant_id already, but we don't want to duplicate the tenant_id for these spans.	2025-07-14 12:30:28 +00:00
Erik Grinaker	42ab34dc36	pageserver/client_grpc: don't pipeline GetPage requests (#12584 ) ## Problem The communicator gRPC client currently attempts to pipeline GetPage requests from multiple callers onto the same gRPC stream. This has a number of issues: * Head-of-line blocking: the request may block on e.g. layer download or LSN wait, delaying the next request. * Cancellation: we can't easily cancel in-progress requests (e.g. due to timeout or backend termination), so it may keep blocking the next request (even its own retry). * Complex stream scheduling: picking a stream becomes harder/slower, and additional Tokio tasks and synchronization is needed for stream management. Touches #11735. Requires #12579. ## Summary of changes This patch removes pipelining of gRPC stream requests, and instead prefers to scale out the number of streams to achieve the same throughput. Stream scheduling has been rewritten, and mostly follows the same pattern as the client pool with exclusive acquisition by a single caller. [Benchmarks](https://github.com/neondatabase/neon/pull/12583) show that the cost of an idle server-side GetPage worker task is about 26 KB (2.5 GB for 100,000), so we can afford to scale out. This has a number of advantages: * It (mostly) eliminates head-of-line blocking (except at the TCP level). * Cancellation becomes trivial, by closing the stream. * Stream scheduling becomes significantly simpler and cheaper. * Individual callers can still use client-side batching for pipelining.	2025-07-14 12:11:33 +00:00
Erik Grinaker	30b877074c	pagebench: add CPU profiling support (#12478 ) ## Problem The new communicator gRPC client has significantly worse Pagebench performance than a basic gRPC client. We need to find out why. ## Summary of changes Add a `pagebench --profile` flag which takes a client CPU profile of the benchmark and writes a flamegraph to `profile.svg`.	2025-07-14 11:44:53 +00:00

1 2 3 4 5 ...

9067 Commits