rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-09 14:32:57 +00:00

Author	SHA1	Message	Date
Christian Schwarz	e0d2d293b1	Merge commit '7cd006621' into problame/standby-horizon-leases	2025-08-06 18:00:38 +02:00
Christian Schwarz	8799e87ae3	Merge commit '62d844e65' into problame/standby-horizon-leases	2025-08-06 18:00:09 +02:00
Christian Schwarz	737f5825bb	Merge commit 'b623fbae0' into problame/standby-horizon-leases	2025-08-06 17:59:58 +02:00
Christian Schwarz	b95106cd79	Merge commit '5c57e8a11' into problame/standby-horizon-leases	2025-08-06 17:59:21 +02:00
Christian Schwarz	be1c1df6aa	Merge commit '84a2556c9' into problame/standby-horizon-leases	2025-08-06 17:58:54 +02:00
Christian Schwarz	7d28fb118b	Merge commit 'f85935446' into problame/standby-horizon-leases	2025-08-06 17:58:36 +02:00
Christian Schwarz	daf2b5a806	Merge commit 'b00a0096b' into problame/standby-horizon-leases	2025-08-06 17:56:37 +02:00
Christian Schwarz	e52d0ef311	Merge commit '5b0972151' into problame/standby-horizon-leases	2025-08-06 17:56:07 +02:00
Christian Schwarz	d22e23f66d	Merge commit '108f7ec54' into problame/standby-horizon-leases	2025-08-06 17:55:56 +02:00
Christian Schwarz	54480167dc	Merge commit '9c0efba91' into problame/standby-horizon-leases	2025-08-06 17:55:48 +02:00
Christian Schwarz	30e7c4b75d	Merge commit '187170be4' into problame/standby-horizon-leases	2025-08-06 17:55:39 +02:00
Christian Schwarz	d380111428	Merge commit '87915df2f' into problame/standby-horizon-leases	2025-08-06 17:55:06 +02:00
Christian Schwarz	78a8ac7be9	ruff format	2025-08-06 17:54:36 +02:00
Christian Schwarz	279865c68a	Merge commit 'dd7fff655' into problame/standby-horizon-leases	2025-08-06 17:54:17 +02:00
Christian Schwarz	1ace4bcf23	Merge commit '809633903' into problame/standby-horizon-leases	2025-08-06 17:50:43 +02:00
Christian Schwarz	35c916c062	Merge commit '5c934efb2' into problame/standby-horizon-leases	2025-08-06 17:50:33 +02:00
Christian Schwarz	02e1aeef66	Merge commit 'a456e818a' into problame/standby-horizon-leases	2025-08-06 17:49:56 +02:00
Christian Schwarz	e2c88c1929	Merge commit '296c9190b' into problame/standby-horizon-leases	2025-08-06 17:49:50 +02:00
Christian Schwarz	553a120075	Merge commit '15f633922' into problame/standby-horizon-leases	2025-08-06 17:49:41 +02:00
Christian Schwarz	cfe345d3e6	Merge commit 'c34d36d8a' into problame/standby-horizon-leases	2025-08-06 17:47:29 +02:00
Christian Schwarz	e2facbde4e	Merge commit 'cec0543b5' into problame/standby-horizon-leases	2025-08-06 17:47:10 +02:00
Christian Schwarz	b8c8168378	Merge commit 'be5bbaeca' into problame/standby-horizon-leases	2025-08-06 17:46:44 +02:00
Christian Schwarz	28a2cd05d5	Merge commit '5ec82105c' into problame/standby-horizon-leases	2025-08-06 17:46:37 +02:00
Christian Schwarz	1635390a96	fix all clippy complaints in this branch	2025-08-06 17:39:17 +02:00
Christian Schwarz	1877b70a35	Merge commit 'e7d18bc18' into problame/standby-horizon-leases	2025-08-06 17:19:37 +02:00
Christian Schwarz	fb7a027211	Merge commit '4ee0da0a2' into problame/standby-horizon-leases	2025-08-06 17:17:45 +02:00
Christian Schwarz	47146fe1d6	Merge commit '7049003cf' into problame/standby-horizon-leases	2025-08-06 17:17:11 +02:00
Christian Schwarz	577eee16f9	https://github.com/neondatabase/neon/pull/12676#discussion_r2220512343 ; concern about backward compat of TimelineInfo	2025-08-05 23:07:26 +02:00
Christian Schwarz	2ee0f4271c	fix(page_service): lsn lease API puts tenant_shard_id in tenant_id tracing field The LSN lease api actually accepts a tenant_shard_id, not a tenant_id. But we put the Display of the tenant_shard_id into the tenant_id field. This PR fixes it. Refs - fixes https://databricks.atlassian.net/browse/LKB-2930	2025-08-05 22:48:27 +02:00
Christian Schwarz	8a9f1dd5e7	use tokio::time::Instant internally, chrono::DateTime<Utc> externally; commuicate expiration through rfc3339 format; chrono::DateTime has good Debug fmt so this also serves observability; finish implementing release valve mechanism	2025-08-05 22:47:53 +02:00
Christian Schwarz	9f01840c18	use standby_horizon leases feature in the test, demonstrating that it passes now	2025-08-05 22:47:28 +02:00
Christian Schwarz	44466cebdb	WIP better observability for return values (SystemTime Debug is useless)	2025-08-05 22:46:54 +02:00
Christian Schwarz	b865e85de3	previous commit broke the tests because of the cfg business, see this commit's TODO	2025-08-05 22:46:24 +02:00
Christian Schwarz	73336962a8	finalize 3-stepped feature-gating (legacy,all,leases) + more tests + observability + fixes	2025-08-05 19:24:06 +02:00
Christian Schwarz	fc7267a760	feature-gate compute side code	2025-08-05 19:22:58 +02:00
Erik Grinaker	7cd0066212	page_api: add `SplitError` for `GetPageSplitter` (#12709 ) Add a `SplitError` for `GetPageSplitter`, with an `Into<tonic::Status>` implementation. This avoids a bunch of boilerplate to convert `GetPageSplitter` errors into `tonic::Status`. Requires #12702. Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).	2025-07-29 18:26:20 +00:00
Suhas Thalanki	bf3a1529bf	Report metrics on data/index corruption (#12729 ) ## Problem We don't have visibility into data/index corruption. ## Summary of changes Add data/index corruptions metrics. PG calls elog ERROR errcode to emit these corruption errors. PG Changes: https://github.com/neondatabase/postgres/pull/698	2025-07-29 18:08:24 +00:00
Erik Grinaker	65d1be6e90	pageserver: route gRPC requests to child shards (#12702 ) ## Problem During shard splits, each parent shard is split and removed incrementally. Only when all parent shards have split is the split committed and the compute notified. This can take several minutes for large tenants. In the meanwhile, the compute will be sending requests to the (now-removed) parent shards. This was (mostly) not a problem for the libpq protocol, because it does shard routing on the server-side. The compute just sends requests to some Pageserver, and the server will figure out which local shard should serve it. It is a problem for the gRPC protocol, where the client explicitly says which shard it's talking to. Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191). Requires #12772. ## Summary of changes * Add server-side routing of gRPC requests to any local child shards if the parent does not exist. * Add server-side splitting of GetPage batch requests straddling multiple child shards. * Move the `GetPageSplitter` into `pageserver_page_api`. I really don't like this approach, but it avoids making changes to the split protocol. I could be convinced we should change the split protocol instead, e.g. to keep the parent shard alive until the split commits and the compute has been notified, but we can also do that as a later change without blocking the communicator on it.	2025-07-29 16:28:57 +00:00
Suhas Thalanki	16eb8dda3d	some compute ctl changes from hadron (#12760 ) Some compute ctl changes from hadron	2025-07-29 16:01:56 +00:00
Heikki Linnakangas	bb32f1b3d0	Move 'criterion' to a dev-dependency (#12762 ) It is only used in micro-benchmarks.	2025-07-29 15:35:00 +00:00
a-masterov	5585c32cee	Disable autovacuum while running pg_repack test (#12755 ) ## Problem Sometimes, the regression test of `pg_repack` fails due to an extra line in the output. The most probable cause of this is autovacuum. https://databricks.atlassian.net/browse/LKB-2637 ## Summary of changes Autovacuum is disabled during the test. Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com>	2025-07-29 15:34:02 +00:00
Krzysztof Szafrański	0ffdc98e20	[proxy] Classify "database not found" errors as user errors (#12603 ) ## Problem If a user provides a wrong database name in the connection string, it should be logged as a user error, not postgres error. I found 4 different places where we log such errors: 1. `proxy/src/stream.rs:193`, e.g.: ``` {"timestamp":"2025-07-15T11:33:35.660026Z","level":"INFO","message":"forwarding error to user","fields":{"kind":"postgres","msg":"database \"[redacted]\" does not exist"},"spans":{"connect_request#9":{"protocol":"tcp","session_id":"ce1f2c90-dfb5-44f7-b9e9-8b8535e8b9b8","conn_info":"[redacted]","ep":"[redacted]","role":"[redacted]"}},"thread_id":22,"task_id":"370407867","target":"proxy::stream","src":"proxy/src/stream.rs:193","extract":{"ep":"[redacted]","session_id":"ce1f2c90-dfb5-44f7-b9e9-8b8535e8b9b8"}} ``` 2. `proxy/src/pglb/mod.rs:137`, e.g.: ``` {"timestamp":"2025-07-15T11:37:44.340497Z","level":"WARN","message":"per-client task finished with an error: Couldn't connect to compute node: db error: FATAL: database \"[redacted]\" does not exist","spans":{"connect_request#8":{"protocol":"tcp","session_id":"763baaac-d039-4f4d-9446-c149e32660eb","conn_info":"[redacted]","ep":"[redacted]","role":"[redacted]"}},"thread_id":14,"task_id":"866658139","target":"proxy::pglb","src":"proxy/src/pglb/mod.rs:137","extract":{"ep":"[redacted]","session_id":"763baaac-d039-4f4d-9446-c149e32660eb"}} ``` 3. `proxy/src/serverless/mod.rs:451`, e.g. (note that the error is repeated 4 times — retries?): ``` {"timestamp":"2025-07-15T11:37:54.515891Z","level":"WARN","message":"error in websocket connection: Couldn't connect to compute node: db error: FATAL: database \"[redacted]\" does not exist: Couldn't connect to compute node: db error: FATAL: database \"[redacted]\" does not exist: db error: FATAL: database \"[redacted]\" does not exist: FATAL: database \"[redacted]\" does not exist","spans":{"http_conn#8":{"conn_id":"ec7780db-a145-4f0e-90df-0ba35f41b828"},"connect_request#9":{"protocol":"ws","session_id":"1eaaeeec-b671-4153-b1f4-247839e4b1c7","conn_info":"[redacted]","ep":"[redacted]","role":"[redacted]"}},"thread_id":10,"task_id":"366331699","target":"proxy::serverless","src":"proxy/src/serverless/mod.rs:451","extract":{"conn_id":"ec7780db-a145-4f0e-90df-0ba35f41b828","ep":"[redacted]","session_id":"1eaaeeec-b671-4153-b1f4-247839e4b1c7"}} ``` 4. `proxy/src/serverless/sql_over_http.rs:219`, e.g. ``` {"timestamp":"2025-07-15T10:32:34.866603Z","level":"INFO","message":"forwarding error to user","fields":{"kind":"postgres","error":"could not connect to postgres in compute","msg":"database \"[redacted]\" does not exist"},"spans":{"http_conn#19":{"conn_id":"7da08203-5dab-45e8-809f-503c9019ec6b"},"connect_request#5":{"protocol":"http","session_id":"68387f1c-cbc8-45b3-a7db-8bb1c55ca809","conn_info":"[redacted]","ep":"[redacted]","role":"[redacted]"}},"thread_id":17,"task_id":"16432250","target":"proxy::serverless::sql_over_http","src":"proxy/src/serverless/sql_over_http.rs:219","extract":{"conn_id":"7da08203-5dab-45e8-809f-503c9019ec6b","ep":"[redacted]","session_id":"68387f1c-cbc8-45b3-a7db-8bb1c55ca809"}} ``` This PR directly addresses 1 and 4. I _think_ it _should_ also help with 2 and 3, although in those places we don't seem to log `kind`, so I'm not quite sure. I'm also confused why in 3 the error is repeated multiple times. ## Summary of changes Resolves https://github.com/neondatabase/neon/issues/9440	2025-07-29 15:25:22 +00:00
HaoyuHuang	62d844e657	Add changes in spec apply (#12759 ) ## Problem All changes are no-op. ## Summary of changes	2025-07-29 15:22:04 +00:00
Alex Chi Z.	1bb434ab74	fix(test): test_readonly_node_gc compute needs time to acquire lease (#12747 ) ## Problem Part of LKB-2368. Compute fails to obtain LSN lease in this test case. There're many assumptions around how compute obtains the leases, and in this particular test case, as the LSN lease length is only 8s (which is shorter than the amount of time where pageserver can restart and compute can reconnect in terms of force stop), it sometimes cause issues. ## Summary of changes Add more sleeps around the test case to ensure it's stable at least. We need to find a more reliable way to test this in the future. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-29 14:23:42 +00:00
Alex Chi Z.	dbde37c53a	fix(safekeeper): retry if open segment fail (#12757 ) ## Problem Fix LKB-2632. The safekeeper wal read path does not seem to retry at all. This would cause client read errors on the customer side. ## Summary of changes - Retry on `safekeeper::wal_backup::read_object`. - Note that this only retries on S3 HTTP connection errors. Subsequent reads could fail, and that needs more refactors to make the retry mechanism work across the path. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-29 14:20:43 +00:00
Heikki Linnakangas	5e3cb2ab07	Refactor LFC stats functions (#12696 ) Split the functions into two parts: an internal function in file_cache.c which returns an array of structs representing the result set, and another function in neon.c with the glue code to expose it as a SQL function. This is in preparation for the new communicator, which needs to implement the same SQL functions, but getting the information from a different place. In the glue code, use the more modern Postgres way of building a result set using a tuplestore.	2025-07-29 13:12:44 +00:00
Erik Grinaker	61f267d8f9	pageserver: only retry `WaitForActiveTimeout` during shard resolution (#12772 ) ## Problem In https://github.com/neondatabase/neon/pull/12467, timeouts and retries were added to `Cache::get` tenant shard resolution to paper over an issue with read unavailability during shard splits. However, this retries _all_ errors, including irrecoverable errors like `NotFound`. This causes problems with gRPC child shard routing in #12702, which targets specific shards with `ShardSelector::Known` and relies on prompt `NotFound` errors to reroute requests to child shards. These retries introduce a 1s delay for all reads during child routing. The broader problem of read unavailability during shard splits is left as future work, see https://databricks.atlassian.net/browse/LKB-672. Touches #12702. Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191). ## Summary of changes * Change `TenantManager` to always return a concrete `GetActiveTimelineError`. * Only retry `WaitForActiveTimeout` errors. * Lots of code unindentation due to the simplified error handling. Out of caution, we do not gate the retries on `ShardSelector`, since this can trigger other races. Improvements here are left as future work.	2025-07-29 12:33:02 +00:00
JC Grünhage	e2411818ef	Add SBOMs and provenance attestations to container images (#12768 ) ## Problem Given a container image it is difficult to figure out dependencies and doesn't work automatically. ## Summary of changes - Build all rust binaries with `cargo auditable`, to allow sbom scanners to find it's dependencies. - Adjust `attests` for `docker/build-push-action`, so that buildkit creates sbom and provenance attestations. - Dropping `--locked` for `rustfilt`, because `rustfilt` can't build with locked dependencies[^5] ## Further details Building with `cargo auditable`[^1] embeds a dependency list into Linux, Windows, MacOS and WebAssembly artifacts. A bunch of tools support discovering dependencies from this, among them `syft`[^2], which is used by the BuildKit Syft scanner[^3] plugin. This BuildKit plugin is the default[^4] used in docker for generating sbom attestations, but we're making that default explicit by referencing the container image. [^1]: https://github.com/rust-secure-code/cargo-auditable [^2]: https://github.com/anchore/syft [^3]: https://github.com/docker/buildkit-syft-scanner [^4]: https://docs.docker.com/build/metadata/attestations/sbom/#sbom-generator [^5]: https://github.com/luser/rustfilt/issues/23	2025-07-29 12:12:14 +00:00
Dmitrii Kovalkov	58327cbba8	storcon: wait for the migration from the drained node in the draining loop (#12754 ) ## Problem We have seen some errors in staging when the shard migration was triggered by optimizations, and it was ongoing during draining the node it was migrating from. It happens because the node draining loop only waits for the migrations started by the drain loop itself. The ongoing migrations are ignored. Closes: https://databricks.atlassian.net/browse/LKB-1625 ## Summary of changes - Wait for the shard reconciliation during the drain if it is being migrated from the drained node.	2025-07-29 11:58:31 +00:00
Heikki Linnakangas	568927a8a0	Remove unnecessary dependency to 'log' crate (#12763 ) We use 'tracing' everywhere.	2025-07-29 11:08:22 +00:00

1 2 3 4 5 ...

8488 Commits