rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2025-12-22 21:59:59 +00:00

Author	SHA1	Message	Date
Alex Chi Z.	1bb434ab74	fix(test): test_readonly_node_gc compute needs time to acquire lease (#12747 ) ## Problem Part of LKB-2368. Compute fails to obtain LSN lease in this test case. There're many assumptions around how compute obtains the leases, and in this particular test case, as the LSN lease length is only 8s (which is shorter than the amount of time where pageserver can restart and compute can reconnect in terms of force stop), it sometimes cause issues. ## Summary of changes Add more sleeps around the test case to ensure it's stable at least. We need to find a more reliable way to test this in the future. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-29 14:23:42 +00:00
Alex Chi Z.	dbde37c53a	fix(safekeeper): retry if open segment fail (#12757 ) ## Problem Fix LKB-2632. The safekeeper wal read path does not seem to retry at all. This would cause client read errors on the customer side. ## Summary of changes - Retry on `safekeeper::wal_backup::read_object`. - Note that this only retries on S3 HTTP connection errors. Subsequent reads could fail, and that needs more refactors to make the retry mechanism work across the path. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-29 14:20:43 +00:00
Heikki Linnakangas	5e3cb2ab07	Refactor LFC stats functions (#12696 ) Split the functions into two parts: an internal function in file_cache.c which returns an array of structs representing the result set, and another function in neon.c with the glue code to expose it as a SQL function. This is in preparation for the new communicator, which needs to implement the same SQL functions, but getting the information from a different place. In the glue code, use the more modern Postgres way of building a result set using a tuplestore.	2025-07-29 13:12:44 +00:00
Erik Grinaker	61f267d8f9	pageserver: only retry `WaitForActiveTimeout` during shard resolution (#12772 ) ## Problem In https://github.com/neondatabase/neon/pull/12467, timeouts and retries were added to `Cache::get` tenant shard resolution to paper over an issue with read unavailability during shard splits. However, this retries _all_ errors, including irrecoverable errors like `NotFound`. This causes problems with gRPC child shard routing in #12702, which targets specific shards with `ShardSelector::Known` and relies on prompt `NotFound` errors to reroute requests to child shards. These retries introduce a 1s delay for all reads during child routing. The broader problem of read unavailability during shard splits is left as future work, see https://databricks.atlassian.net/browse/LKB-672. Touches #12702. Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191). ## Summary of changes * Change `TenantManager` to always return a concrete `GetActiveTimelineError`. * Only retry `WaitForActiveTimeout` errors. * Lots of code unindentation due to the simplified error handling. Out of caution, we do not gate the retries on `ShardSelector`, since this can trigger other races. Improvements here are left as future work.	2025-07-29 12:33:02 +00:00
JC Grünhage	e2411818ef	Add SBOMs and provenance attestations to container images (#12768 ) ## Problem Given a container image it is difficult to figure out dependencies and doesn't work automatically. ## Summary of changes - Build all rust binaries with `cargo auditable`, to allow sbom scanners to find it's dependencies. - Adjust `attests` for `docker/build-push-action`, so that buildkit creates sbom and provenance attestations. - Dropping `--locked` for `rustfilt`, because `rustfilt` can't build with locked dependencies[^5] ## Further details Building with `cargo auditable`[^1] embeds a dependency list into Linux, Windows, MacOS and WebAssembly artifacts. A bunch of tools support discovering dependencies from this, among them `syft`[^2], which is used by the BuildKit Syft scanner[^3] plugin. This BuildKit plugin is the default[^4] used in docker for generating sbom attestations, but we're making that default explicit by referencing the container image. [^1]: https://github.com/rust-secure-code/cargo-auditable [^2]: https://github.com/anchore/syft [^3]: https://github.com/docker/buildkit-syft-scanner [^4]: https://docs.docker.com/build/metadata/attestations/sbom/#sbom-generator [^5]: https://github.com/luser/rustfilt/issues/23	2025-07-29 12:12:14 +00:00
Dmitrii Kovalkov	58327cbba8	storcon: wait for the migration from the drained node in the draining loop (#12754 ) ## Problem We have seen some errors in staging when the shard migration was triggered by optimizations, and it was ongoing during draining the node it was migrating from. It happens because the node draining loop only waits for the migrations started by the drain loop itself. The ongoing migrations are ignored. Closes: https://databricks.atlassian.net/browse/LKB-1625 ## Summary of changes - Wait for the shard reconciliation during the drain if it is being migrated from the drained node.	2025-07-29 11:58:31 +00:00
Heikki Linnakangas	568927a8a0	Remove unnecessary dependency to 'log' crate (#12763 ) We use 'tracing' everywhere.	2025-07-29 11:08:22 +00:00
a-masterov	1ed7252950	Add a workaround for the clickhouse 24.9+ problem causing an error (#12767 ) ## Problem We used ClickHouse v. 24.8, which is outdated, for logical replication testing. We could miss some problems. ## Summary of changes The version was updated to 25.6, with a workaround using the environment variable `PGSSLCERT`. Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com>	2025-07-29 10:19:10 +00:00
Alexander Bayandin	30b57334ef	test_lsn_lease_storcon: ignore ShardSplit warning in debug builds (#12770 ) ## Problem `test_lsn_lease_storcon` might fail in debug builds due to slow ShardSplit ## Summary of changes - Make `test_lsn_lease_storcon ` test to ignore `.Exclusive lock by ShardSplit was held.` warning in debug builds Ref: https://databricks.slack.com/archives/C09254R641L/p1753777051481029	2025-07-29 09:47:39 +00:00
Heikki Linnakangas	d487ba2b9b	Replace 'memoffset' crate with core functionality (#12761 ) The `std::mem::offset_of` macro was introduced in Rust 1.77.0. In the passing, mark the function as `const`, as suggested in the comment. Not sure which compiler version that requires, but it works with what have currently.	2025-07-29 08:01:31 +00:00
Conrad Ludgate	e7a1d5de94	proxy: cache for password hashing (#12011 ) ## Problem Password hashing for sql-over-http takes up a lot of CPU. Perhaps we can get away with temporarily caching some steps so we only need fewer rounds, which will save some CPU time. ## Summary of changes The output of pbkdf2 is the XOR of the outputs of each iteration round, eg `U1 ^ U2 ^ ... U15 ^ U16 ^ U17 ^ ... ^ Un`. We cache the suffix of the expression `U16 ^ U17 ^ ... ^ Un`. To compute the result from the cached suffix, we only need to compute the prefix `U1 ^ U2 ^ ... U15`. The suffix by itself is useless, which prevent's its use in brute-force attacks should this cached memory leak. We are also caching the full 4096 round hash in memory, which can be used for brute-force attacks, where this suffix could be used to speed it up. My hope/expectation is that since these will be in different allocations, it makes any such memory exploitation much much harder. Since the full hash cache might be invalidated while the suffix is cached, I'm storing the timestamp of the computation as a way to identity the match. I also added `zeroize()` to clear the sensitive state from the stack/heap. For the most security conscious customers, we hope to roll out OIDC soon, so they can disable passwords entirely. --- The numbers for the threadpool were pretty random, but according to our busiest region for sql-over-http, we only see about 150 unique endpoints every minute. So storing ~100 of the most common endpoints for that minute should be the vast majority of requests. 1 minute was chosen so we don't keep data in memory for too long.	2025-07-29 06:48:14 +00:00
Ivan Efremov	6be572177c	chore: Fix nightly lints (#12746 ) - Remove some unused code - Use `is_multiple_of()` instead of '%' - Collapse consecuative "if let" statements - Elided lifetime fixes It is enough just to review the code of your team	2025-07-28 21:36:30 +00:00
Alex Chi Z.	fe7a4e1ab6	fix(test): wait compaction in timeline offload test (#12673 ) ## Problem close LKB-753. `test_pageserver_metrics_removed_after_offload` is unstable and it sometimes leave the metrics behind after tenant offloading. It turns out that we triggered an image compaction before the offload and the job was stopped after the offload request was completed. ## Summary of changes Wait all background tasks to finish before checking the metrics. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-28 16:27:55 +00:00
Heikki Linnakangas	40cae8cc36	Fix misc typos and some cosmetic code cleanup (#12695 )	2025-07-28 16:21:35 +00:00
Heikki Linnakangas	02fc8b7c70	Add compatibility macros for MyProcNumber and PGIOAlignedBlock (#12715 ) There were a few uses of these already, so collect them to the compatibility header to avoid the repetition and scattered #ifdefs. The definition of MyProcNumber is a little different from what was used before, but the end result is the same. (PGPROC->pgprocno values were just assigned sequentially to all PGPROC array members, see InitProcGlobal(). That's a bit silly, which is why it was removed in v17.)	2025-07-28 15:05:36 +00:00
John Spray	60feb168e2	pageserver: decrease MAX_SHARDS in utilization (#12668 ) ## Problem When tenants have a lot of timelines, the number of tenants that a pageserver can comfortably handle goes down. Branching is much more widely used in practice now than it was when this code was written, and we generally run pageservers with a few thousand tenants (where each tenant has many timelines), rather than the 10k-20k we might have done historically. This should really be something configurable, or a more direct proxy for resource utilization (such as non-archived timeline count), but this change should be a low effort improvement. ## Summary of changes * Change the target shard count (MAX_SHARDS) to 2500 from 5000 when calculating pageserver utilization (i.e. a 200% overcommit now corresponds to 5000 shards, not 10000 shards) Co-authored-by: John Spray <john.spray@databricks.com>	2025-07-28 13:50:18 +00:00
a-masterov	da596a5162	Update the versions for ClickHouse and Debezium (#12741 ) ## Problem The test for logical replication used the year-old versions of ClickHouse and Debezium so that we may miss problems related to up-to-date versions. ## Summary of changes The ClickHouse version has been updated to 24.8. The Debezium version has been updated to the latest stable one, 3.1.3Final. Some problems with locally running the Debezium test have been fixed. --------- Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com> Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2025-07-28 13:26:33 +00:00
Conrad Ludgate	effd6bf829	[proxy] add metrics for caches (#12752 ) Exposes metrics for caches. LKB-2594 This exposes a high level namespace, `cache`, that all cache metrics can be added to - this makes it easier to make library panels for the caches as I understand it. To calculate the current cache fill ratio, you could use the following query: ``` ( cache_inserted_total{cache="node_info"} - sum (cache_evicted_total{cache="node_info"}) without (cause) ) / cache_capacity{cache="node_info"} ``` To calculate the cache hit ratio, you could use the following query: ``` cache_request_total{cache="node_info", outcome="hit"} / sum (cache_request_total{cache="node_info"}) without (outcome) ```	2025-07-28 10:41:49 +00:00
Tristan Partin	a6e0baf31a	[BRC-1405] Mount databricks pg_hba and pg_ident from configmap (#12733 ) ## Problem For certificate auth, we need to configure pg_hba and pg_ident for it to work. HCC needs to mount this config map to all pg compute pod. ## Summary of changes Create `databricks_pg_hba` and `databricks_pg_ident` to configure where the files are located on the pod. These configs are pass down to `compute_ctl`. Compute_ctl uses these config to update `pg_hba.conf` and `pg_ident.conf` file. We append `include_if_exists {databricks_pg_hba}` to `pg_hba.conf` and similarly to `pg_ident.conf`. So that it will refer to databricks config file without much change to existing pg default config file. --------- Co-authored-by: Jarupat Jisarojito <jarupat.jisarojito@databricks.com> Co-authored-by: William Huang <william.huang@databricks.com> Co-authored-by: HaoyuHuang <haoyu.huang.68@gmail.com>	2025-07-25 20:50:03 +00:00
Christian Schwarz	19b74b8837	fix(page_service): getpage requests don't hold `applied_gc_cutoff_lsn` guard (#12743 ) Before this PR, getpage requests wouldn't hold the `applied_gc_cutoff_lsn` guard until they were done. Theoretical impact: if we’re not holding the `RcuReadGuard`, gc can theoretically concurrently delete reconstruct data that we need to reconstruct the page. I don't think this practically occurs in production because the odds of it happening are quite low, especially for primary read_write computes. But RO replicas / standby_horizon relies on correct `applied_gc_cutofff_lsn`, so, I'm fixing this as part of the work ok replacing standby_horizon propagation mechanism with leases (LKB-88). The change is feature-gated with a feature flag, and evaluated once when entering `handle_pagestream` to avoid performance impact. For observability, we add a field to the `handle_pagestream` span, and a slow-log to the place in `gc_loop` where it waits for the in-flight RcuReadGuard's to drain. refs - fixes https://databricks.atlassian.net/browse/LKB-2572 - standby_horizon leases epic: https://databricks.atlassian.net/browse/LKB-2572 --------- Co-authored-by: Christian Schwarz <Christian Schwarz>	2025-07-25 20:25:04 +00:00
Folke Behrens	25718e324a	proxy: Define service_info metric showing the run state (#12749 ) ## Problem Monitoring dashboards show aggregates of all proxy instances, including terminating ones. This can skew the results or make graphs less readable. Also, alerts must be tuned to ignore certain signals from terminating proxies. ## Summary of changes Add a `service_info` metric currently with one label, `state`, showing if an instance is in state `init`, `running`, or `terminating`. The metric can be joined with other metrics to filter the presented time series.	2025-07-25 18:27:21 +00:00
Dmitrii Kovalkov	ac8f44c70e	tests: stop ps immediately in test_ps_unavailable_after_delete (#12728 ) ## Problem test_ps_unavailable_after_delete is flaky. All test failures I've looked at are because of ERROR log messages in pageserver, which happen because storage controller tries runs a reconciliations during the graceful shutdown of the pageserver. I wasn't able to reproduce it locally, but I think stopping PS immediately instead of gracefully should help. If not, we might just silence those errors. - Closes: https://databricks.atlassian.net/browse/LKB-745	2025-07-25 18:09:34 +00:00
Conrad Ludgate	d09664f039	[proxy] replace TimedLru with moka (#12726 ) LKB-2536 TimedLru is hard to maintain. Let's use moka instead. Stacked on top of #12710.	2025-07-25 17:39:48 +00:00
Mikhail	6689d6fd89	LFC prewarm perftest fixes: use existing staging project (#12651 ) https://github.com/neondatabase/cloud/issues/19011 - Prewarm config changes are not publicly available. Correct the test by using a pre-filled 50 GB project on staging - Create extension neon with schema neon to fix read performance tests on staging, error example in https://neon-github-public-dev.s3.amazonaws.com/reports/main/16483462789/index.html#suites/3d632da6dda4a70f5b4bd24904ab444c/919841e331089fc4/ - Don't create extra endpoint in LFC prewarm performance tests	2025-07-25 16:56:41 +00:00
Tristan Partin	33b400beae	[BRC-1425] Plumb through and set the requisite GUCs when starting the compute instance (#12732 ) ## Problem We need the set the following Postgres GUCs to the correct value before starting Postgres in the compute instance: ``` databricks.workspace_url databricks.enable_databricks_identity_login databricks.enable_sql_restrictions ``` ## Summary of changes Plumbed through `workspace_url` and other GUC settings via `DatabricksSettings` in `ComputeSpec`. The spec is sent to the compute instance when it starts up and the GUCs are written to `postgresql.conf` before the postgres process is launched. --------- Co-authored-by: Jarupat Jisarojito <jarupat.jisarojito@databricks.com> Co-authored-by: William Huang <william.huang@databricks.com>	2025-07-25 15:20:05 +00:00
Tristan Partin	ca07f7dba5	Copy pg server cert and key to pgdata with correct permission (#12731 ) ## Problem Copy certificate and key from secret mount directory to `pgdata` directory where `postgres` is the owner and we can set the key permission to 0600. ## Summary of changes - Added new pgparam `pg_compute_tls_settings` to specify where k8s secret for certificate and key are mounted. - Added a new field to `ComputeSpec` called `databricks_settings`. This is a struct that will be used to store any other settings that needs to be propagate to Compute but should not be persisted to `ComputeSpec` in the database. - Then when the compute container start up, as part of `prepare_pgdata` function, it will copied `server.key` and `server.crt` from k8s mounted directory to `pgdata` directory. ## How is this tested? Add unit tests. Manual test via KIND Co-authored-by: Jarupat Jisarojito <jarupat.jisarojito@databricks.com>	2025-07-25 15:05:05 +00:00
Vlad Lazar	b0dfe0ffa6	storcon: attempt all non-essential location config calls during reconciliations (#12745 ) ## Problem We saw the following in the field: Context and observations: * The storage controller keeps track of the latest generations and the pageserver that issued the latest generation in the database * When the storage controller needs to proxy a request (e.g. timeline creation) to the pageservers, it will find use the pageserver that issued the latest generation from the db (generation_pageserver). * pageserver-2.cell-2 got into a bad state and wasn't able to apply location_config (e.g. detach a shard) What happened: 1. pageserver-2.cell-2 was a secondary for our shard since we were not able to detach it 2. control plane asked to detach a tenant (presumably because it was idle) a. In response storcon clears the generation_pageserver from the db and attempts to detach all locations b. it tries to detach pageserver-2.cell-2 first, but fails, which fails the entire reconciliation leaving the good attached location still there c. return success to cplane 3. control plane asks to re-attach the tenant a. In response storcon performs a reconciliation b. it finds that the observed state matches the intent (remember we did not detach the primary at step(2)) c. skips incrementing the genration and setting the generation_pageserver column Now any requests that need to be proxied to pageservers and rely on the generation_pageserver db column fail because that's not set ## Summary of changes 1. We do all non-essential location config calls (setting up secondaries, detaches) at the end of the reconciliation. Previously, we bailed out of the reconciliation on the first failure. With this patch we attempt all of the RPCs. This allows the observed state to update even if another RPC failed for unrelated reasons. 2. If the overall reconciliation failed, we don't want to remove nodes from the observed state as a safe-guard. With the previous patch, we'll get a deletion delta to process, which would be ignored. Ignoring it is not the right thing to do since it's out of sync with the db state. Hence, on reconciliation failures map deletion from the observed state to the uncertain state. Future reconciliation will query the node to refresh their observed state. Closes LKB-204	2025-07-25 14:03:17 +00:00
Erik Grinaker	185ead8395	pageserver: verify gRPC GetPages on correct shard (#12722 ) Verify that gRPC `GetPageRequest` has been sent to the shard that owns the pages. This avoid spurious `NotFound` errors if a compute misroutes a request, which can appear scarier (e.g. data loss). Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).	2025-07-25 13:43:04 +00:00
Erik Grinaker	37e322438b	pageserver: document gRPC compute accessibility (#12724 ) Document that the Pageserver gRPC port is accessible by computes, and should not provide internal services. Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).	2025-07-25 13:35:44 +00:00
Gustavo Bazan	fca2c32e59	[ci/docker] task: Apply some quick wins for tools dockerfile (#12740 ) ## Problem The Dockerfile for build tools has some small issues that are easy to fix to make it follow some of docker best practices ## Summary of changes Apply some small quick wins on the Dockerfile for build tools - Usage of apt-get over apt - usage of --no-cache-dir for pip install	2025-07-25 12:39:01 +00:00
Conrad Ludgate	d19aebcf12	[proxy] introduce moka for the project-info cache (#12710 ) ## Problem LKB-2502 The garbage collection of the project info cache is garbage. What we observed: If we get unlucky, we might throw away a very hot entry if the cache is full. The GC loop is dependent on getting a lucky shard of the projects2ep table that clears a lot of cold entries. The GC does not take into account active use, and the interval it runs at is too sparse to do any good. Can we switch to a proper cache implementation? Complications: 1. We need to invalidate by project/account. 2. We need to expire based on `retry_delay_ms`. ## Summary of changes 1. Replace `retry_delay_ms: Duration` with `retry_at: Instant` when deserializing. 2. Split the EndpointControls from the RoleControls into two different caches. 3. Introduce an expiry policy based on error retry info. 4. Introduce `moka` as a dependency, replacing our `TimedLru`. See the follow up PR for changing all TimedLru instances to use moka: #12726.	2025-07-25 11:40:47 +00:00
Conrad Ludgate	a70a5bccff	move subzero_core to proxy libs (#12742 ) We have a dedicated libs folder for proxy related libraries. Let's move the subzero_core stub there.	2025-07-25 10:44:28 +00:00
Conrad Ludgate	d9cedb4a95	[tokio-postgres] fix regression in buffer reuse (#12739 ) Follow up to #12701, which introduced a new regression. When profiling locally I noticed that writes have the tendency to always reallocate. On investigation I found that even if the `Connection`'s write buffer is empty, if it still shares the same data pointer as the `Client`'s write buffer then the client cannot reclaim it. The best way I found to fix this is to just drop the `Connection`'s write buffer each time we fully flush it. Additionally, I remembered that `BytesMut` has an `unsplit` method which is allows even better sharing over the previous optimisation I had when 'encoding'.	2025-07-25 09:03:21 +00:00
Tristan Partin	b623fbae0c	Cancel PG query if stuck at refreshing configuration (#12717 ) ## Problem While configuring or reconfiguring PG due to PageServer movements, it's possible PG may get stuck if PageServer is moved around after fetching the spec from StorageController. ## Summary of changes To fix this issue, this PR introduces two changes: 1. Fail the PG query directly if the query cannot request configuration for certain number of times. 2. Introduce a new state `RefreshConfiguration` in compute tools to differentiate it from `RefreshConfigurationPending`. If compute tool is already in `RefreshConfiguration` state, then it will not accept new request configuration requests. ## How is this tested? Chaos testing. Co-authored-by: Chen Luo <chen.luo@databricks.com>	2025-07-25 00:01:59 +00:00
Tristan Partin	512210bb5a	[BRC-2368] Add PS and compute_ctl metrics to report pagestream request errors (#12716 ) ## Problem In our experience running the system so far, almost all of the "hang compute" situations are due to the compute (postgres) pointing at the wrong pageservers. We currently mainly rely on the promethesus exporter (PGExporter) running on PG to detect and report any down time, but these can be unreliable because the read and write probes the PGExporter runs do not always generate pageserver requests due to caching, even though the real user might be experiencing down time when touching uncached pages. We are also about to start disk-wiping node pool rotation operations in prod clusters for our pageservers, and it is critical to have a convenient way to monitor the impact of these node pool rotations so that we can quickly respond to any issues. These metrics should provide very clear signals to address this operational need. ## Summary of changes Added a pair of metrics to detect issues between postgres' PageStream protocol (e.g. get_page_at_lsn, get_base_backup, etc.) communications with pageservers: * On the compute node (compute_ctl), exports a counter metric that is incremented every time postgres requests a configuration refresh. Postgres today only requests these configuration refreshes when it cannot connect to a pageserver or if the pageserver rejects its request by disconnecting. * On the pageserver, exports a counter metric that is incremented every time it receives a PageStream request that cannot be handled because the tenant is not known or if the request was routed to the wrong shard (e.g. secondary). ### How I plan to use metrics I plan to use the metrics added here to create alerts. The alerts can fire, for example, if these counters have been continuously increasing for over a certain period of time. During rollouts, misrouted requests may occasionally happen, but they should soon die down as reconfigurations make progress. We can start with something like raising the alert if the counters have been increasing continuously for over 5 minutes. ## How is this tested? New integration tests in `test_runner/regress/test_hadron_ps_connectivity_metrics.py` Co-authored-by: William Huang <william.huang@databricks.com>	2025-07-24 19:05:00 +00:00
HaoyuHuang	9eebd6fc79	A few more compute_ctl changes (#12713 ) ## Summary of changes A bunch of no-op changes. The only other thing is that the lock is released early in the terminate func.	2025-07-24 19:01:30 +00:00
Tristan Partin	11527b9df7	[BRC-2951] Enforce PG backpressure parameters at the shard level (#12694 ) ## Problem Currently PG backpressure parameters are enforced globally. With tenant splitting, this makes it hard to balance small tenants and large tenants. For large tenants with more shards, we need to increase the lagging because each shard receives total/shard_count amount of data, while doing so could be suboptimal to small tenants with fewer shards. ## Summary of changes This PR makes these parameters to be enforced at the shard level, i.e., PG will compute the actual lag limit by multiply the shard count. ## How is this tested? Added regression test. Co-authored-by: Chen Luo <chen.luo@databricks.com>	2025-07-24 18:41:29 +00:00
Tristan Partin	89554af1bd	[BRC-1778] Have PG signal compute_ctl to refresh configuration if it suspects that it is talking to the wrong PSs (#12712 ) ## Problem This is a follow-up to TODO, as part of the effort to rewire the compute reconfiguration/notification mechanism to make it more robust. Please refer to that commit or ticket BRC-1778 for full context of the problem. ## Summary of changes The previous change added mechanism in `compute_ctl` that makes it possible to refresh the configuration of PG on-demand by having `compute_ctl` go out to download a new config from the control plane/HCC. This change wired this mechanism up with PG so that PG will signal `compute_ctl` to refresh its configuration when it suspects that it could be talking to incorrect pageservers due to a stale configuration. PG will become suspicious that it is talking to the wrong pageservers in the following situations: 1. It cannot connect to a pageserver (e.g., getting a network-level connection refused error) 2. It can connect to a pageserver, but the pageserver does not return any data for the GetPage request 3. It can connect to a pageserver, but the pageserver returns a malformed response 4. It can connect to a pageserver, but there is an error receiving the GetPage request response for any other reason This change also includes a minor tweak to `compute_ctl`'s config refresh behavior. Upon receiving a request to refresh PG configuration, `compute_ctl` will reach out to download a config, but it will not attempt to apply the configuration if the config is the same as the old config is it replacing. This optimization is added because the act of reconfiguring itself requires working pageserver connections. In many failure situations it is likely that PG detects an issue with a pageserver before the control plane can detect the issue, migrate tenants, and update the compute config. In this case even the latest compute config won't point PG to working pageservers, causing the configuration attempt to hang and negatively impact PG's time-to-recovery. With this change, `compute_ctl` only attempts reconfiguration if the refreshed config points PG to different pageservers. ## How is this tested? The new code paths are exercised in all existing tests because this mechanism is on by default. Explicitly tested in `test_runner/regress/test_change_pageserver.py`. Co-authored-by: William Huang <william.huang@databricks.com>	2025-07-24 16:44:45 +00:00
Peter Bendel	f391186aa7	TPC-C like periodic benchmark using benchbase (#12665 ) ## Problem We don't have a well-documented, periodic benchmark for TPC-C like OLTP workload. ## Summary of changes # Benchbase TPC-C-like Performance Results Runs TPC-C-like benchmarks on Neon databases using [Benchbase](https://github.com/cmu-db/benchbase). Docker images are built [here](https://github.com/neondatabase-labs/benchbase-docker-images) We run the benchmarks at different scale factors aligned with different compute sizes we offer to customers. For each scale factor, we determine a max rate (see Throughput in warmup phase) and then run the benchmark at a target rate of approx. 70 % of the max rate. We use different warehouse sizes which determine the working set size - it is optimized for LFC size of the respected pricing tier. Usually we should get LFC hit rates above 70 % for this setup and quite good, consistent (non-flaky) latencies. ## Expected performance as of first testing this \| Tier \| CU \| Warehouses \| Terminals \| Max TPS \| LFC size \| Working set size \| LFC hit rate \| Median latency \| p95 latency \| \|------------\|------------\|---------------\|-----------\|---------\|----------\|------------------\|--------------\|----------------\|-------------\| \| free \| 0.25-2 \| 50 - 5 GB \| 150 \| 800 \| 5 GB \| 6.3 GB \| 95 % \| 170 ms \| 600 ms \| \| serverless \| 2-8 \| 500 - 50 GB \| 230 \| 2000 \| 26 GB \| ?? GB \| 91 % \| 50 ms \| 200 ms \| \| business \| 2-16 \| 1000 - 100 GB \| 330 \| 2900 \| 51 GB \| 50 GB \| 72 % \| 40 ms \| 180 ms \| Each run - first loads the database (not shown in the dashboard). - Then we run a warmup phase for 20 minutes to warm up the database and the LFC at unlimited target rate (max rate) (highest throughput but flaky latencies). The warmup phase can be used to determine the max rate and adjust it in the github workflow in case Neon is faster in the future. - Then we run the benchmark at a target rate of approx. 70 % of the max rate for 1 hour (expecting consistent latencies and throughput). ## Important notes on implementation: - we want to eventually publish the process how to reproduce these benchmarks - thus we want to reduce all dependencies necessary to run the benchmark, the only thing needed are - docker - the docker images referenced above for benchbase - python >= 3.9 to run some config generation steps and create diagrams - to reduce dependencies we deliberatly do NOT use some of our python fixture test infrastructure to make the dependency chain really small - so pls don't add a review comment "should reuse fixture xy" - we also upload all generator python scripts, generated bash shell scripts and configs as well as raw results to S3 bucket that we later want to publish once this benchmark is reviewed and approved.	2025-07-24 16:26:54 +00:00
Paul Banks	94b41b531b	storecon: Fix panic due to race with chaos migration on staging (#12727 ) ## Problem * Fixes LKB-743 We get regular assertion failures on staging caused by a race with chaos injector. If chaos injector decides to migrate a tenant shard between the background optimisation planning and applying optimisations then we attempt to migrate and already migrated shard and hit an assertion failure. ## Summary of changes @VladLazar fixed a variant of this issue by adding`validate_optimization` recently, however it didn't validate the specific property this other assertion requires. Fix is just to update it to cover all the expected properties.	2025-07-24 16:14:47 +00:00
Erik Grinaker	d793088225	pgxn: set `MACOSX_DEPLOYMENT_TARGET` (#12723 ) ## Problem Compiling `neon-pg-ext-v17` results in these linker warnings for `libcommunicator.a`: ``` $ make -j`nproc` -s neon-pg-ext-v17 Installing PostgreSQL v17 headers Compiling PostgreSQL v17 Compiling neon-specific Postgres extensions for v17 ld: warning: object file (/Users/erik.grinaker/Projects/neon/target/debug/libcommunicator.a[1159](25ac62e5b3c53843-curve25519.o)) was built for newer 'macOS' version (15.5) than being linked (15.0) ld: warning: object file (/Users/erik.grinaker/Projects/neon/target/debug/libcommunicator.a[1160](0bbbd18bda93c05b-aes_nohw.o)) was built for newer 'macOS' version (15.5) than being linked (15.0) ld: warning: object file (/Users/erik.grinaker/Projects/neon/target/debug/libcommunicator.a[1161](00c879ee3285a50d-montgomery.o)) was built for newer 'macOS' version (15.5) than being linked (15.0) [...] ``` ## Summary of changes Set `MACOSX_DEPLOYMENT_TARGET` to the current local SDK version (15.5 in this case), which links against object files for that version.	2025-07-24 14:48:35 +00:00
John Spray	67ad420e26	tests: turn down error rate in test_compute_pageserver_connection_stress (#12721 ) ## Problem Compute retries are finite (e.g. 5x in a basebackup) -- with a 50% failure rate we have pretty good chance of exceeding that and the test failing. Fixes: https://databricks.atlassian.net/browse/LKB-2278 ## Summary of changes - Turn connection error rate down to 20% Co-authored-by: John Spray <john.spray@databricks.com>	2025-07-24 14:42:39 +00:00
Tristan Partin	90cd5a5be8	[BRC-1778] Add mechanism to `compute_ctl` to pull a new config (#12711 ) ## Problem We have been dealing with a number of issues with the SC compute notification mechanism. Various race conditions exist in the PG/HCC/cplane/PS distributed system, and relying on the SC to send notifications to the compute node to notify it of PS changes is not robust. We decided to pursue a more robust option where the compute node itself discovers whether it may be pointing to the incorrect PSs and proactively reconfigure itself if issues are suspected. ## Summary of changes To support this self-healing reconfiguration mechanism several pieces are needed. This PR adds a mechanism to `compute_ctl` called "refresh configuration", where the compute node reaches out to the control plane to pull a new config and reconfigure PG using the new config, instead of listening for a notification message containing a config to arrive from the control plane. Main changes to compute_ctl: 1. The `compute_ctl` state machine now has a new State, `RefreshConfigurationPending`. The compute node may enter this state upon receiving a signal that it may be using the incorrect page servers. 2. Upon entering the `RefreshConfigurationPending` state, the background configurator thread in `compute_ctl` wakes up, pulls a new config from the control plane, and reconfigures PG (with `pg_ctl reload`) according to the new config. 3. The compute node may enter the new `RefreshConfigurationPending` state from `Running` or `Failed` states. If the configurator managed to configure the compute node successfully, it will enter the `Running` state, otherwise, it stays in `RefreshConfigurationPending` and the configurator thread will wait for the next notification if an incorrect config is still suspected. 4. Added various plumbing in `compute_ctl` data structures to allow the configurator thread to perform the config fetch. The "incorrect config suspected" notification is delivered using a HTTP endpoint, `/refresh_configuration`, on `compute_ctl`. This endpoint is currently not called by anyone other than the tests. In a follow up PR I will set up some code in the PG extension/libpagestore to call this HTTP endpoint whenever PG suspects that it is pointing to the wrong page servers. ## How is this tested? Modified `test_runner/regress/test_change_pageserver.py` to add a scenario where we use the new `/refresh_configuration` mechanism instead of the existing `/configure` mechanism (which requires us sending a full config to compute_ctl) to have the compute node reload and reconfigure its pageservers. I took one shortcut to reduce the scope of this change when it comes to testing: the compute node uses a local config file instead of pulling a config over the network from the HCC. This simplifies the test setup in the following ways: * The existing test framework is set up to use local config files for compute nodes only, so it's convenient if I just stick with it. * The HCC today generates a compute config with production settings (e.g., assuming 4 CPUs, 16GB RAM, with local file caches), which is probably not suitable in tests. We may need to add another test-only endpoint config to the control plane to make this work. The config-fetch part of the code is relatively straightforward (and well-covered in both production and the KIND test) so it is probably fine to replace it with loading from the local config file for these integration tests. In addition to making sure that the tests pass, I also manually inspected the logs to make sure that the compute node is indeed reloading the config using the new mechanism instead of going down the old `/configure` path (it turns out the test has bugs which causes compute `/configure` messages to be sent despite the test intending to disable/blackhole them). ```test 2024-09-24T18:53:29.573650Z INFO http request{otel.name=/refresh_configuration http.method=POST}: serving /refresh_configuration POST request 2024-09-24T18:53:29.573689Z INFO configurator_main_loop: compute node suspects its configuration is out of date, now refreshing configuration 2024-09-24T18:53:29.573706Z INFO configurator_main_loop: reloading config.json from path: /workspaces/hadron/test_output/test_change_pageserver_using_refresh[release-pg16]/repo/endpoints/ep-1/spec.json PG:2024-09-24 18:53:29.574 GMT [52799] LOG: received SIGHUP, reloading configuration files PG:2024-09-24 18:53:29.575 GMT [52799] LOG: parameter "neon.extension_server_port" cannot be changed without restarting the server PG:2024-09-24 18:53:29.575 GMT [52799] LOG: parameter "neon.pageserver_connstring" changed to "postgresql://no_user@localhost:15008" ... ``` Co-authored-by: William Huang <william.huang@databricks.com>	2025-07-24 14:26:21 +00:00
Christian Schwarz	643448b1a2	test_hot_standby_gc: work around standby_horizon-related flakiness/raciness uncovered by #12431 (#12704 ) PR #12431 set initial lease deadline = 0s for tests. This turned test_hot_standby_gc flaky because it now runs GC: it started failing with `tried to request a page version that was garbage collected` because the replica reads below applied gc cutoff. The leading theory is that, we run the timeline_gc() before the first standby_horizon push arrives at PS. That is definitively a thing that can happen with the current standby_horizon mechanism, and it's now tracked as such in https://databricks.atlassian.net/browse/LKB-2499. We don't have logs to confirm this theory though, but regardless, try the fix in this PR and see if it stabilizes things. Refs - flaky test issue: https://databricks.atlassian.net/browse/LKB-2465 ## Problem ## Summary of changes	2025-07-24 14:00:22 +00:00
Conrad Ludgate	8daebb6ed4	[proxy] remove TokioMechanism and HyperMechanism (#12672 ) Another go at #12341. LKB-2497 We now only need 1 connect mechanism (and 1 more for testing) which saves us some code and complexity. We should be able to remove the final connect mechanism when we create a separate worker task for pglb->compute connections - either via QUIC streams or via in-memory channels. This also now ensures that connect_once always returns a ConnectionError type - something simple enough we can probably define a serialisation for in pglb. * I've abstracted connect_to_compute to always use TcpMechanism and the ProxyConfig. * I've abstracted connect_to_compute_and_auth to perform authentication, managing any retries for stale computes * I had to introduce a separate `managed` function for taking ownership of the compute connection into the Client/Connection pair	2025-07-24 12:37:04 +00:00
Alexey Kondratov	ab14521ea5	fix(compute): Turn off database collector in postgres_exporter (#12684 ) ## Problem `postgres_exporter` has database collector enabled by default and it doesn't filter out invalid databases, see `06a553c816/collector/pg_database.go (L67)` so if it hits one, it starts spamming logs ``` ERROR: [NEON_SMGR] [reqid d9700000018] could not read db size of db 705302 from page server at lsn 5/A2457EB0 ``` ## Summary of changes We don't use `pg_database_size_bytes` metric anyway, see `5e19b3fd89/apps/base/compute-metrics/scrape-compute-pg-exporter-neon.yaml (L29)` so just turn it off by passing `--no-collector.database`.	2025-07-24 11:52:31 +00:00
dependabot[bot]	e82021d6fe	build(deps): bump the npm_and_yarn group across 1 directory with 2 updates (#12678 ) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-07-24 10:51:09 +00:00
Conrad Ludgate	9997661138	[proxy/tokio-postgres] garbage collection for codec buffers (#12701 ) ## Problem A large insert or a large row will cause the codec to allocate a large buffer. The codec never shrinks the buffer however. LKB-2496 ## Summary of changes 1. Introduce a naive GC system for codec buffers 2. Try and reduce copies as much as possible	2025-07-24 10:30:02 +00:00
Ivan Efremov	0e427fc117	Update proxy-bench workflow to use bare-metal script (#12703 ) Pass the params for run.sh in proxy-bench repo to use bare-metal config. Fix the paths and cleanup procedure.	2025-07-24 08:23:07 +00:00
Tristan Partin	9b2e6f862a	Set an upper limit on PG backpressure throttling (#12675 ) ## Problem Tenant split test revealed another bug with PG backpressure throttling that under some cases PS may never report its progress back to SK (e.g., observed when aborting tenant shard where the old shard needs to re-establish SK connection and re-ingest WALs from a much older LSN). In this case, PG may get stuck forever. ## Summary of changes As a general precaution that PS feedback mechanism may not always be reliable, this PR uses the previously introduced WAL write rate limit mechanism to slow down write rates instead of completely pausing it. The idea is to introduce a new `databricks_effective_max_wal_bytes_per_second`, which is set to `databricks_max_wal_mb_per_second` when no PS back pressure and is set to `10KB` when there is back pressure. This way, PG can still write to SK, though at a very low speed. The PR also fixes the problem that the current WAL rate limiting mechanism is too coarse grained and cannot enforce limits < 1MB. This is because it always resets the rate limiter after 1 second, even if PG could have written more data in the past second. The fix is to introduce a `batch_end_time_us` which records the expected end time of the current batch. For example, if PG writes 10MB of data in a single batch, and max WAL write rate is set as `1MB/s`, then `batch_end_time_us` will be set as 10 seconds later. ## How is this tested? Tweaked the existing test, and also did manual testing on dev. I set `max_replication_flush_lag` as 1GB, and loaded 500GB pgbench tables. It's expected to see PG gets throttled periodically because PS will accumulate 4GB of data before flushing. Results: when PG is throttled: ``` 9500000 of 3300000000 tuples (0%) done (elapsed 10.36 s, remaining 3587.62 s) 9600000 of 3300000000 tuples (0%) done (elapsed 124.07 s, remaining 42523.59 s) 9700000 of 3300000000 tuples (0%) done (elapsed 255.79 s, remaining 86763.97 s) 9800000 of 3300000000 tuples (0%) done (elapsed 315.89 s, remaining 106056.52 s) 9900000 of 3300000000 tuples (0%) done (elapsed 412.75 s, remaining 137170.58 s) ``` when PS just flushed: ``` 18100000 of 3300000000 tuples (0%) done (elapsed 433.80 s, remaining 78655.96 s) 18200000 of 3300000000 tuples (0%) done (elapsed 433.85 s, remaining 78231.71 s) 18300000 of 3300000000 tuples (0%) done (elapsed 433.90 s, remaining 77810.62 s) 18400000 of 3300000000 tuples (0%) done (elapsed 433.96 s, remaining 77395.86 s) 18500000 of 3300000000 tuples (0%) done (elapsed 434.03 s, remaining 76987.27 s) 18600000 of 3300000000 tuples (0%) done (elapsed 434.08 s, remaining 76579.59 s) 18700000 of 3300000000 tuples (0%) done (elapsed 434.13 s, remaining 76177.12 s) 18800000 of 3300000000 tuples (0%) done (elapsed 434.19 s, remaining 75779.45 s) 18900000 of 3300000000 tuples (0%) done (elapsed 434.84 s, remaining 75489.40 s) 19000000 of 3300000000 tuples (0%) done (elapsed 434.89 s, remaining 75097.90 s) 19100000 of 3300000000 tuples (0%) done (elapsed 434.94 s, remaining 74712.56 s) 19200000 of 3300000000 tuples (0%) done (elapsed 498.93 s, remaining 85254.20 s) 19300000 of 3300000000 tuples (0%) done (elapsed 498.97 s, remaining 84817.95 s) 19400000 of 3300000000 tuples (0%) done (elapsed 623.80 s, remaining 105486.76 s) 19500000 of 3300000000 tuples (0%) done (elapsed 745.86 s, remaining 125476.51 s) ``` Co-authored-by: Chen Luo <chen.luo@databricks.com>	2025-07-23 22:37:27 +00:00

1 2 3 4 5 ...

8433 Commits