rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-08 14:02:55 +00:00

Author	SHA1	Message	Date
Christian Schwarz	d888835555	Merge commit 'ca8852165' into problame/standby-horizon-leases	2025-08-06 18:01:04 +02:00
Christian Schwarz	668e82d713	Merge commit '07c3cfd2a' into problame/standby-horizon-leases	2025-08-06 18:00:59 +02:00
Christian Schwarz	e0d2d293b1	Merge commit '7cd006621' into problame/standby-horizon-leases	2025-08-06 18:00:38 +02:00
Christian Schwarz	8799e87ae3	Merge commit '62d844e65' into problame/standby-horizon-leases	2025-08-06 18:00:09 +02:00
Christian Schwarz	737f5825bb	Merge commit 'b623fbae0' into problame/standby-horizon-leases	2025-08-06 17:59:58 +02:00
Christian Schwarz	b95106cd79	Merge commit '5c57e8a11' into problame/standby-horizon-leases	2025-08-06 17:59:21 +02:00
Christian Schwarz	be1c1df6aa	Merge commit '84a2556c9' into problame/standby-horizon-leases	2025-08-06 17:58:54 +02:00
Christian Schwarz	7d28fb118b	Merge commit 'f85935446' into problame/standby-horizon-leases	2025-08-06 17:58:36 +02:00
Christian Schwarz	daf2b5a806	Merge commit 'b00a0096b' into problame/standby-horizon-leases	2025-08-06 17:56:37 +02:00
Christian Schwarz	e52d0ef311	Merge commit '5b0972151' into problame/standby-horizon-leases	2025-08-06 17:56:07 +02:00
Christian Schwarz	d22e23f66d	Merge commit '108f7ec54' into problame/standby-horizon-leases	2025-08-06 17:55:56 +02:00
Christian Schwarz	54480167dc	Merge commit '9c0efba91' into problame/standby-horizon-leases	2025-08-06 17:55:48 +02:00
Christian Schwarz	30e7c4b75d	Merge commit '187170be4' into problame/standby-horizon-leases	2025-08-06 17:55:39 +02:00
Christian Schwarz	d380111428	Merge commit '87915df2f' into problame/standby-horizon-leases	2025-08-06 17:55:06 +02:00
Christian Schwarz	78a8ac7be9	ruff format	2025-08-06 17:54:36 +02:00
Christian Schwarz	279865c68a	Merge commit 'dd7fff655' into problame/standby-horizon-leases	2025-08-06 17:54:17 +02:00
Christian Schwarz	1ace4bcf23	Merge commit '809633903' into problame/standby-horizon-leases	2025-08-06 17:50:43 +02:00
Christian Schwarz	35c916c062	Merge commit '5c934efb2' into problame/standby-horizon-leases	2025-08-06 17:50:33 +02:00
Christian Schwarz	02e1aeef66	Merge commit 'a456e818a' into problame/standby-horizon-leases	2025-08-06 17:49:56 +02:00
Christian Schwarz	e2c88c1929	Merge commit '296c9190b' into problame/standby-horizon-leases	2025-08-06 17:49:50 +02:00
Christian Schwarz	553a120075	Merge commit '15f633922' into problame/standby-horizon-leases	2025-08-06 17:49:41 +02:00
Christian Schwarz	cfe345d3e6	Merge commit 'c34d36d8a' into problame/standby-horizon-leases	2025-08-06 17:47:29 +02:00
Christian Schwarz	e2facbde4e	Merge commit 'cec0543b5' into problame/standby-horizon-leases	2025-08-06 17:47:10 +02:00
Christian Schwarz	b8c8168378	Merge commit 'be5bbaeca' into problame/standby-horizon-leases	2025-08-06 17:46:44 +02:00
Christian Schwarz	28a2cd05d5	Merge commit '5ec82105c' into problame/standby-horizon-leases	2025-08-06 17:46:37 +02:00
Christian Schwarz	1635390a96	fix all clippy complaints in this branch	2025-08-06 17:39:17 +02:00
Christian Schwarz	1877b70a35	Merge commit 'e7d18bc18' into problame/standby-horizon-leases	2025-08-06 17:19:37 +02:00
Christian Schwarz	fb7a027211	Merge commit '4ee0da0a2' into problame/standby-horizon-leases	2025-08-06 17:17:45 +02:00
Christian Schwarz	47146fe1d6	Merge commit '7049003cf' into problame/standby-horizon-leases	2025-08-06 17:17:11 +02:00
Christian Schwarz	577eee16f9	https://github.com/neondatabase/neon/pull/12676#discussion_r2220512343 ; concern about backward compat of TimelineInfo	2025-08-05 23:07:26 +02:00
Christian Schwarz	2ee0f4271c	fix(page_service): lsn lease API puts tenant_shard_id in tenant_id tracing field The LSN lease api actually accepts a tenant_shard_id, not a tenant_id. But we put the Display of the tenant_shard_id into the tenant_id field. This PR fixes it. Refs - fixes https://databricks.atlassian.net/browse/LKB-2930	2025-08-05 22:48:27 +02:00
Christian Schwarz	8a9f1dd5e7	use tokio::time::Instant internally, chrono::DateTime<Utc> externally; commuicate expiration through rfc3339 format; chrono::DateTime has good Debug fmt so this also serves observability; finish implementing release valve mechanism	2025-08-05 22:47:53 +02:00
Christian Schwarz	9f01840c18	use standby_horizon leases feature in the test, demonstrating that it passes now	2025-08-05 22:47:28 +02:00
Christian Schwarz	44466cebdb	WIP better observability for return values (SystemTime Debug is useless)	2025-08-05 22:46:54 +02:00
Christian Schwarz	b865e85de3	previous commit broke the tests because of the cfg business, see this commit's TODO	2025-08-05 22:46:24 +02:00
Christian Schwarz	73336962a8	finalize 3-stepped feature-gating (legacy,all,leases) + more tests + observability + fixes	2025-08-05 19:24:06 +02:00
Christian Schwarz	fc7267a760	feature-gate compute side code	2025-08-05 19:22:58 +02:00
HaoyuHuang	ca88521653	Set neon_superuser privilege under lakebase mode (#12775 ) ## Problem ## Summary of changes	2025-07-29 21:30:34 +00:00
Suhas Thalanki	07c3cfd2a0	[BRC-2905] Feed back PS-detected data corruption signals to SK and PG… (#12748 ) … walproposer (#895) Data corruptions are typically detected on the pageserver side when it replays WAL records. However, since PS doesn't synchronously replay WAL records as they are being ingested through safekeepers, we need some extra plumbing to feed information about pageserver-detected corruptions during compaction (and/or WAL redo in general) back to SK and PG for proper action. We don't yet know what actions PG/SK should take upon receiving the signal, but we should have the detection and feedback in place. Add an extra `corruption_detected` field to the `PageserverFeedback` message that is sent from PS -> SK -> PG. It's a boolean value that is set to true when PS detects a "critical error" that signals data corruption, and it's sent in all `PageserverFeedback` messages. Upon receiving this signal, the safekeeper raises a `safekeeper_ps_corruption_detected` gauge metric (value set to 1). The safekeeper then forwards this signal to PG where a `ps_corruption_detected` gauge metric (value also set to 1) is raised in the `neon_perf_counters` view. Added an integration test in `test_compaction.py::test_ps_corruption_detection_feedback` that confirms that the safekeeper and PG can receive the data corruption signal in the `PageserverFeedback` message in a simulated data corruption. ## Problem ## Summary of changes --------- Co-authored-by: William Huang <william.huang@databricks.com>	2025-07-29 20:40:07 +00:00
Erik Grinaker	7cd0066212	page_api: add `SplitError` for `GetPageSplitter` (#12709 ) Add a `SplitError` for `GetPageSplitter`, with an `Into<tonic::Status>` implementation. This avoids a bunch of boilerplate to convert `GetPageSplitter` errors into `tonic::Status`. Requires #12702. Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).	2025-07-29 18:26:20 +00:00
Suhas Thalanki	bf3a1529bf	Report metrics on data/index corruption (#12729 ) ## Problem We don't have visibility into data/index corruption. ## Summary of changes Add data/index corruptions metrics. PG calls elog ERROR errcode to emit these corruption errors. PG Changes: https://github.com/neondatabase/postgres/pull/698	2025-07-29 18:08:24 +00:00
Erik Grinaker	65d1be6e90	pageserver: route gRPC requests to child shards (#12702 ) ## Problem During shard splits, each parent shard is split and removed incrementally. Only when all parent shards have split is the split committed and the compute notified. This can take several minutes for large tenants. In the meanwhile, the compute will be sending requests to the (now-removed) parent shards. This was (mostly) not a problem for the libpq protocol, because it does shard routing on the server-side. The compute just sends requests to some Pageserver, and the server will figure out which local shard should serve it. It is a problem for the gRPC protocol, where the client explicitly says which shard it's talking to. Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191). Requires #12772. ## Summary of changes * Add server-side routing of gRPC requests to any local child shards if the parent does not exist. * Add server-side splitting of GetPage batch requests straddling multiple child shards. * Move the `GetPageSplitter` into `pageserver_page_api`. I really don't like this approach, but it avoids making changes to the split protocol. I could be convinced we should change the split protocol instead, e.g. to keep the parent shard alive until the split commits and the compute has been notified, but we can also do that as a later change without blocking the communicator on it.	2025-07-29 16:28:57 +00:00
Suhas Thalanki	16eb8dda3d	some compute ctl changes from hadron (#12760 ) Some compute ctl changes from hadron	2025-07-29 16:01:56 +00:00
Heikki Linnakangas	bb32f1b3d0	Move 'criterion' to a dev-dependency (#12762 ) It is only used in micro-benchmarks.	2025-07-29 15:35:00 +00:00
a-masterov	5585c32cee	Disable autovacuum while running pg_repack test (#12755 ) ## Problem Sometimes, the regression test of `pg_repack` fails due to an extra line in the output. The most probable cause of this is autovacuum. https://databricks.atlassian.net/browse/LKB-2637 ## Summary of changes Autovacuum is disabled during the test. Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com>	2025-07-29 15:34:02 +00:00
Krzysztof Szafrański	0ffdc98e20	[proxy] Classify "database not found" errors as user errors (#12603 ) ## Problem If a user provides a wrong database name in the connection string, it should be logged as a user error, not postgres error. I found 4 different places where we log such errors: 1. `proxy/src/stream.rs:193`, e.g.: ``` {"timestamp":"2025-07-15T11:33:35.660026Z","level":"INFO","message":"forwarding error to user","fields":{"kind":"postgres","msg":"database \"[redacted]\" does not exist"},"spans":{"connect_request#9":{"protocol":"tcp","session_id":"ce1f2c90-dfb5-44f7-b9e9-8b8535e8b9b8","conn_info":"[redacted]","ep":"[redacted]","role":"[redacted]"}},"thread_id":22,"task_id":"370407867","target":"proxy::stream","src":"proxy/src/stream.rs:193","extract":{"ep":"[redacted]","session_id":"ce1f2c90-dfb5-44f7-b9e9-8b8535e8b9b8"}} ``` 2. `proxy/src/pglb/mod.rs:137`, e.g.: ``` {"timestamp":"2025-07-15T11:37:44.340497Z","level":"WARN","message":"per-client task finished with an error: Couldn't connect to compute node: db error: FATAL: database \"[redacted]\" does not exist","spans":{"connect_request#8":{"protocol":"tcp","session_id":"763baaac-d039-4f4d-9446-c149e32660eb","conn_info":"[redacted]","ep":"[redacted]","role":"[redacted]"}},"thread_id":14,"task_id":"866658139","target":"proxy::pglb","src":"proxy/src/pglb/mod.rs:137","extract":{"ep":"[redacted]","session_id":"763baaac-d039-4f4d-9446-c149e32660eb"}} ``` 3. `proxy/src/serverless/mod.rs:451`, e.g. (note that the error is repeated 4 times — retries?): ``` {"timestamp":"2025-07-15T11:37:54.515891Z","level":"WARN","message":"error in websocket connection: Couldn't connect to compute node: db error: FATAL: database \"[redacted]\" does not exist: Couldn't connect to compute node: db error: FATAL: database \"[redacted]\" does not exist: db error: FATAL: database \"[redacted]\" does not exist: FATAL: database \"[redacted]\" does not exist","spans":{"http_conn#8":{"conn_id":"ec7780db-a145-4f0e-90df-0ba35f41b828"},"connect_request#9":{"protocol":"ws","session_id":"1eaaeeec-b671-4153-b1f4-247839e4b1c7","conn_info":"[redacted]","ep":"[redacted]","role":"[redacted]"}},"thread_id":10,"task_id":"366331699","target":"proxy::serverless","src":"proxy/src/serverless/mod.rs:451","extract":{"conn_id":"ec7780db-a145-4f0e-90df-0ba35f41b828","ep":"[redacted]","session_id":"1eaaeeec-b671-4153-b1f4-247839e4b1c7"}} ``` 4. `proxy/src/serverless/sql_over_http.rs:219`, e.g. ``` {"timestamp":"2025-07-15T10:32:34.866603Z","level":"INFO","message":"forwarding error to user","fields":{"kind":"postgres","error":"could not connect to postgres in compute","msg":"database \"[redacted]\" does not exist"},"spans":{"http_conn#19":{"conn_id":"7da08203-5dab-45e8-809f-503c9019ec6b"},"connect_request#5":{"protocol":"http","session_id":"68387f1c-cbc8-45b3-a7db-8bb1c55ca809","conn_info":"[redacted]","ep":"[redacted]","role":"[redacted]"}},"thread_id":17,"task_id":"16432250","target":"proxy::serverless::sql_over_http","src":"proxy/src/serverless/sql_over_http.rs:219","extract":{"conn_id":"7da08203-5dab-45e8-809f-503c9019ec6b","ep":"[redacted]","session_id":"68387f1c-cbc8-45b3-a7db-8bb1c55ca809"}} ``` This PR directly addresses 1 and 4. I _think_ it _should_ also help with 2 and 3, although in those places we don't seem to log `kind`, so I'm not quite sure. I'm also confused why in 3 the error is repeated multiple times. ## Summary of changes Resolves https://github.com/neondatabase/neon/issues/9440	2025-07-29 15:25:22 +00:00
HaoyuHuang	62d844e657	Add changes in spec apply (#12759 ) ## Problem All changes are no-op. ## Summary of changes	2025-07-29 15:22:04 +00:00
Alex Chi Z.	1bb434ab74	fix(test): test_readonly_node_gc compute needs time to acquire lease (#12747 ) ## Problem Part of LKB-2368. Compute fails to obtain LSN lease in this test case. There're many assumptions around how compute obtains the leases, and in this particular test case, as the LSN lease length is only 8s (which is shorter than the amount of time where pageserver can restart and compute can reconnect in terms of force stop), it sometimes cause issues. ## Summary of changes Add more sleeps around the test case to ensure it's stable at least. We need to find a more reliable way to test this in the future. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-29 14:23:42 +00:00
Alex Chi Z.	dbde37c53a	fix(safekeeper): retry if open segment fail (#12757 ) ## Problem Fix LKB-2632. The safekeeper wal read path does not seem to retry at all. This would cause client read errors on the customer side. ## Summary of changes - Retry on `safekeeper::wal_backup::read_object`. - Note that this only retries on S3 HTTP connection errors. Subsequent reads could fail, and that needs more refactors to make the retry mechanism work across the path. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-29 14:20:43 +00:00
Heikki Linnakangas	5e3cb2ab07	Refactor LFC stats functions (#12696 ) Split the functions into two parts: an internal function in file_cache.c which returns an array of structs representing the result set, and another function in neon.c with the glue code to expose it as a SQL function. This is in preparation for the new communicator, which needs to implement the same SQL functions, but getting the information from a different place. In the glue code, use the more modern Postgres way of building a result set using a tuplestore.	2025-07-29 13:12:44 +00:00

1 2 3 4 5 ...

8492 Commits