rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 05:22:56 +00:00

Author	SHA1	Message	Date
Trung Dinh	225267b3ae	Make disk eviction run by default (#12464 ) ## Problem ## Summary of changes Provide a sane set of default values for disk_usage_based_eviction. Closes https://github.com/neondatabase/neon/issues/12301.	2025-07-04 12:06:10 +00:00
Vlad Lazar	d378726e38	pageserver: reset the broker subscription if it's been idle for a while (#12436 ) ## Problem I suspect that the pageservers get stuck on receiving broker updates. ## Summary of changes This is a an opportunistic (staging only) patch that resets the susbscription stream if it's been idle for a while. This won't go to prod in this form. I'll revert or update it before Friday.	2025-07-04 10:25:03 +00:00
Erik Grinaker	88d1127bf4	Tweak GetPageSplitter	2025-07-03 21:12:26 +02:00
Alex Chi Z.	305fe61ac1	fix(pageserver): also print open layer size in backpressure (#12440 ) ## Problem Better investigate memory usage during backpressure ## Summary of changes Print open layer size if backpressure is activated Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-03 16:37:11 +00:00
Vlad Lazar	f95fdf5b44	pageserver: fix duplicate tombstones in ancestor detach (#12460 ) ## Problem Ancestor detach from a previously detached parent when there were no writes panics since it tries to upload the tombstone layer twice. ## Summary of Changes If we're gonna copy the tombstone from the ancestor, don't bother creating it. Fixes https://github.com/neondatabase/neon/issues/12458	2025-07-03 16:35:46 +00:00
Erik Grinaker	42e4e5a418	Add GetPage request splitting	2025-07-03 18:31:12 +02:00
HaoyuHuang	4db934407a	SK changes #1 (#12448 ) ## TLDR This PR is a no-op. The changes are disabled by default. ## Problem I. Currently we don't have a way to detect disk I/O failures from WAL operations. II. We observe that the offloader fails to upload a segment due to race conditions on XLOG SWITCH and PG start streaming WALs. wal_backup task continously failing to upload a full segment while the segment remains partial on the disk. The consequence is that commit_lsn for all SKs move forward but backup_lsn stays the same. Then, all SKs run out of disk space. III. We have discovered SK bugs where the WAL offload owner cannot keep up with WAL backup/upload to S3, which results in an unbounded accumulation of WAL segment files on the Safekeeper's disk until the disk becomes full. This is a somewhat dangerous operation that is hard to recover from because the Safekeeper cannot write its control files when it is out of disk space. There are actually 2 problems here: 1. A single problematic timeline can take over the entire disk for the SK 2. Once out of disk, it's difficult to recover SK IV. Neon reports certain storage errors as "critical" errors using a marco, which will increment a counter/metric that can be used to raise alerts. However, this metric isn't sliced by tenant and/or timeline today. We need the tenant/timeline dimension to better respond to incidents and for blast radius analysis. ## Summary of changes I. The PR adds a `safekeeper_wal_disk_io_errors ` which is incremented when SK fails to create or flush WALs. II. To mitigate this issue, we will re-elect a new offloader if the current offloader is lagging behind too much. Each SK makes the decision locally but they are aware of each other's commit and backup lsns. The new algorithm is - determine_offloader will pick a SK. say SK-1. - Each SK checks -- if commit_lsn - back_lsn > threshold, -- -- remove SK-1 from the candidate and call determine_offloader again. SK-1 will step down and all SKs will elect the same leader again. After the backup is caught up, the leader will become SK-1 again. This also helps when SK-1 is slow to backup. I'll set the reelect backup lag to 4 GB later. Setting to 128 MB in dev to trigger the code more frequently. III. This change addresses problem no. 1 by having the Safekeeper perform a timeline disk utilization check check when processing WAL proposal messages from Postgres/compute. The Safekeeper now rejects the WAL proposal message, effectively stops writing more WAL for the timeline to disk, if the existing WAL files for the timeline on the SK disk exceeds a certain size (the default threshold is 100GB). The disk utilization is calculated based on a `last_removed_segno` variable tracked by the background task removing WAL files, which produces an accurate and conservative estimate (>= than actual disk usage) of the actual disk usage. IV. * Add a new metric `hadron_critical_storage_event_count` that has the `tenant_shard_id` and `timeline_id` as dimensions. * Modified the `crtitical!` marco to include tenant_id and timeline_id as additional arguments and adapted existing call sites to populate the tenant shard and timeline ID fields. The `critical!` marco invocation now increments the `hadron_critical_storage_event_count` with the extra dimensions. (In SK there isn't the notion of a tenant-shard, so just the tenant ID is recorded in lieu of tenant shard ID.) I considered adding a separate marco to avoid merge conflicts, but I think in this case (detecting critical errors) conflicts are probably more desirable so that we can be aware whenever Neon adds another `critical!` invocation in their code. --------- Co-authored-by: Chen Luo <chen.luo@databricks.com> Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com> Co-authored-by: William Huang <william.huang@databricks.com>	2025-07-03 14:32:53 +00:00
Erik Grinaker	6f8650782f	Client tweaks	2025-07-03 14:54:23 +02:00
Erik Grinaker	14214eb853	Add client shard routing	2025-07-03 14:42:35 +02:00
Erik Grinaker	d4b4724921	Sanity-check Pageserver URLs	2025-07-03 14:18:14 +02:00
Erik Grinaker	9aba9550dd	Instrument client methods	2025-07-03 14:11:53 +02:00
Erik Grinaker	375e8e5592	Improve retries and logging	2025-07-03 14:02:43 +02:00
Erik Grinaker	52c586f678	Restructure shard management	2025-07-03 11:51:19 +02:00
Heikki Linnakangas	7263d6e2e5	Clarify error message if not_modified_lsn > request_lsn I'm seeing this error from some python tests. Which means there's a bug in the compute side of course, but it took me a while to figure that out.	2025-07-02 23:21:42 +03:00
Dmitry Savelev	0429a0db16	Switch the billing metrics storage format to ndjson. (#12427 ) ## Problem The billing team wants to change the billing events pipeline and use a common events format in S3 buckets across different event producers. ## Summary of changes Change the events storage format for billing events from JSON to NDJSON. Also partition files by hours, rather than days. Resolves: https://github.com/neondatabase/cloud/issues/29995	2025-07-02 16:30:47 +00:00
Erik Grinaker	12dade35fa	Comment tweaks	2025-07-02 14:47:27 +02:00
Erik Grinaker	1ec63bd6bc	Misc pool improvements	2025-07-02 14:42:06 +02:00
Arpad Müller	efd7e52812	Don't error if timeline offload is already in progress (#12428 ) Don't print errors like: ``` Compaction failed 1 times, retrying in 2s: Failed to offload timeline: Unexpected offload error: Timeline deletion is already in progress ``` Print it at info log level instead. https://github.com/neondatabase/cloud/issues/30666	2025-07-02 12:06:55 +00:00
Erik Grinaker	bf01145ae4	Remove some old code	2025-07-02 11:46:54 +02:00
Erik Grinaker	6f0af96a54	Add new PageserverClient	2025-07-02 10:59:40 +02:00
Heikki Linnakangas	9913d2668a	print retried pageserver requests to log Not sure how verbose we want this to be in production, but for now, more is better. This shows that many tests are failing with errors like these: PG:2025-07-01 23:02:34.311 GMT [1456523] LOG: [COMMUNICATOR] send_process_get_rel_size_request: got error status: NotFound, message: "Read error", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 01 Jul 2025 23:02:34 GMT"} }, retrying I haven't debugged why that is yet. Did the compute make a bogus request?	2025-07-02 02:04:04 +03:00
Alex Chi Z.	5ec8881c0b	feat(pageserver): resolve feature flag based on remote size (#12400 ) ## Problem Part of #11813 ## Summary of changes * Compute tenant remote size in the housekeeping loop. * Add a new `TenantFeatureResolver` struct to cache the tenant-specific properties. * Evaluate feature flag based on the remote size. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-01 18:11:24 +00:00
Alex Chi Z.	b254dce8a1	feat(pageserver): report compaction progress (#12401 ) ## Problem close https://github.com/neondatabase/neon/issues/11528 ## Summary of changes Gives us better observability of compaction progress. - Image creation: num of partition processed / total partition - Gc-compaction: index of the in the queue / total items for a full compaction - Shard ancestor compaction: layers to rewrite / total layers Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-01 17:00:27 +00:00
Alex Chi Z.	3815e3b2b5	feat(pageserver): reduce lock contention in l0 compaction (#12360 ) ## Problem L0 compaction currently holds the read lock for a long region while it doesn't need to. ## Summary of changes This patch reduces the one long contention region into 2 short ones: gather the layers to compact at the beginning, and several short read locks when querying the image coverage. Co-Authored-By: Chen Luo --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-01 16:58:41 +00:00
Erik Grinaker	f6761760a2	Documentation and tweaks	2025-07-01 17:54:41 +02:00
Erik Grinaker	0bce818d5e	Add stream pool	2025-07-01 17:54:41 +02:00
Erik Grinaker	48be1da6ef	Add initial client pool	2025-07-01 17:54:41 +02:00
Erik Grinaker	d2efc80e40	Add initial ChannelPool	2025-07-01 17:54:41 +02:00
Erik Grinaker	958c2577f5	pageserver: tighten up `page_api::Client`	2025-07-01 17:54:41 +02:00
Trung Dinh	daa402f35a	pageserver: Make ImageLayerWriter sync, infallible and lazy (#12403 ) ## Problem ## Summary of changes Make ImageLayerWriter sync, infallible and lazy. Address https://github.com/neondatabase/neon/issues/12389. All unit tests passed.	2025-07-01 09:53:11 +00:00
Arpad Müller	2e681e0ef8	detach_ancestor: delete the right layer when hardlink fails (#12397 ) If a hardlink operation inside `detach_ancestor` fails due to the layer already existing, we delete the layer to make sure the source is one we know about, and then retry. But we deleted the wrong file, namely, the one we wanted to use as the source of the hardlink. As a result, the follow up hard link operation failed. Our PR corrects this mistake.	2025-06-30 21:36:15 +00:00
Erik Grinaker	c3cb1ab98d	Merge branch 'main' into communicator-rewrite	2025-06-30 21:07:01 +02:00
Erik Grinaker	81ac4ef43a	Add a generic pool prototype	2025-06-30 14:49:34 +02:00
Erik Grinaker	d0a4ae3e8f	pageserver: add gRPC LSN lease support (#12384 ) ## Problem The gRPC API does not provide LSN leases. ## Summary of changes * Add LSN lease support to the gRPC API. * Use gRPC LSN leases for static computes with `grpc://` connstrings. * Move `PageserverProtocol` into the `compute_api::spec` module and reuse it.	2025-06-30 12:44:17 +00:00
Erik Grinaker	a384d7d501	pageserver: assert no changes to shard identity (#12379 ) ## Problem Location config changes can currently result in changes to the shard identity. Such changes will cause data corruption, as seen with #12217. Resolves #12227. Requires #12377. ## Summary of changes Assert that the shard identity does not change on location config updates and on (re)attach. This is currently asserted with `critical!`, in case it misfires in production. Later, we should reject such requests with an error and turn this into a proper assertion.	2025-06-30 12:36:45 +00:00
Christian Schwarz	66f53d9d34	refactor(pageserver): force explicit mapping to `CreateImageLayersError::Other` (#12382 ) Implicit mapping to an `anyhow::Error` when we do `?` is discouraged because tooling to find those places isn't great. As a drive-by, also make SplitImageLayerWriter::new infallible and sync. I think we should also make ImageLayerWriter::new completely lazy, then `BatchLayerWriter:new` infallible and async.	2025-06-30 11:03:48 +00:00
Erik Grinaker	67b04f8ab3	Fix a bunch of linter warnings	2025-06-30 11:10:02 +02:00
Erik Grinaker	1d43f3bee8	pageserver: fix stripe size persistence in legacy HTTP handlers (#12377 ) ## Problem Similarly to #12217, the following endpoints may result in a stripe size mismatch between the storage controller and Pageserver if an unsharded tenant has a different stripe size set than the default. This can lead to data corruption if the tenant is later manually split without specifying an explicit stripe size, since the storage controller and Pageserver will apply different defaults. This commonly happens with tenants that were created before the default stripe size was changed from 32k to 2k. * `PUT /v1/tenant/config` * `PATCH /v1/tenant/config` These endpoints are no longer in regular production use (they were used when cplane still managed Pageserver directly), but can still be called manually or by tests. ## Summary of changes Retain the current shard parameters when updating the location config in `PUT \| PATCH /v1/tenant/config`. Also opportunistically derive `Copy` for `ShardParameters`.	2025-06-30 09:08:44 +00:00
Heikki Linnakangas	924c6a6fdf	Fix handling the case that server closes the stream - avoid panic by checking for Ok(None) response from tonic::Streaming::message() instead of just using unwrap() - There was a race condition, if the caller sent the message, but the receiver task concurrently received Ok(None) indicating the stream was closed. (I didn't see that in action, but I think it could happen by reading the code)	2025-06-29 22:53:39 +03:00
Heikki Linnakangas	7020476bf5	Run `cargo fmt`	2025-06-29 22:53:09 +03:00
Heikki Linnakangas	80e948db93	Remove ununused mock factory After reading the code a few times, I didn't quite understand what it was, to be honest, or how it was going to be used. Remove it now to reduce noise, but we can resurrect it from git history if we need it in the future.	2025-06-29 22:52:48 +03:00
Heikki Linnakangas	bfb30d434c	minor code tidy-up	2025-06-29 22:51:34 +03:00
Heikki Linnakangas	f3ba201800	Run `cargo fmt`	2025-06-29 21:21:07 +03:00
Heikki Linnakangas	8b7796cbfa	wip	2025-06-29 21:20:48 +03:00
Heikki Linnakangas	fdc7e9c2a4	Extract repeated code to look up RequestTracker into a helper function	2025-06-29 21:20:14 +03:00
Heikki Linnakangas	a352d290eb	Plumb through both libpq and grpc connection strings to the compute Add a new 'pageserver_connection_info' field in the compute spec. It replaces the old 'pageserver_connstring' field with a more complicated struct that includes both libpq and grpc URLs, for each shard (or only one of the the URLs, depending on the configuration). It also includes a flag suggesting which one to use; compute_ctl now uses it to decide which protocol to use for the basebackup. This is compatible with everything that's in production, because the control plane never used the 'pageserver_connstring' field. That was added a long time ago with the idea that it would replace the code that digs the 'neon.pageserver_connstring' GUC from the list of Postgres settings, but we never got around to do that in the control plane. Hence, it was only used with neon_local. But the plan now is to pass the 'pageserver_connection_info' from the control plane, and once that's fully deployed everywhere, the code to parse 'neon.pageserver_connstring' in compute_ctl can be removed. The 'grpc' flag on an endpoint in endpoint config is now more of a suggestion. Compute_ctl gets both URLs, so it can choose to use libpq or grpc as it wishes. It currently always obeys the 'prefer_grpc' flag that's part of the connection info though. Postgres however uses grpc iff the new rust-based communicator is enabled. TODO/plan for the control plane: - Start to pass `pageserver_connection_info` in the spec file. - Also keep the current `neon.pageserver_connstring` setting for now, for backwards compatibility with old computes After that, the `pageserver_connection_info.prefer_grpc` flag in the spec file can be used to control whether compute_ctl uses grpc or libpq. The actual compute's grpc usage will be controlled by the `neon.enable_new_communicator` GUC. It can be set separately from 'prefer_grpc'. Later: - Once all old computes are gone, remove the code to pass `neon.pageserver_connstring`	2025-06-29 18:16:49 +03:00
Erik Grinaker	e50b914a8e	compute_tools: support gRPC base backups in `compute_ctl` (#12244 ) ## Problem `compute_ctl` should support gRPC base backups. Requires #12111. Requires #12243. Touches #11926. ## Summary of changes Support `grpc://` connstrings for `compute_ctl` base backups.	2025-06-27 16:39:00 +00:00
Christian Schwarz	e33e109403	fix(pageserver): buffered writer cancellation error handling (#12376 ) ## Problem The problem has been well described in already-commited PR #11853. tl;dr: BufferedWriter is sensitive to cancellation, which the previous approach was not. The write path was most affected (ingest & compaction), which was mostly fixed in #11853: it introduced `PutError` and mapped instances of `PutError` that were due to cancellation of underlying buffered writer into `CreateImageLayersError::Cancelled`. However, there is a long tail of remaining errors that weren't caught by #11853 that result in `CompactionError::Other`s, which we log with great noise. ## Solution The stack trace logging for CompactionError::Other added in #11853 allows us to chop away at that long tail using the following pattern: - look at the stack trace - from leaf up, identify the place where we incorrectly map from the distinguished variant X indicating cancellation to an `anyhow::Error` - follow that anyhow further up, ensuring it stays the same anyhow all the way up in the `CompactionError::Other` - since it stayed one anyhow chain all the way up, root_cause() will yield us X - so, in `log_compaction_error`, add an additional `downcast_ref` check for X This PR specifically adds checks for - the flush task cancelling (FlushTaskError, BlobWriterError) - opening of the layer writer (GateError) That should cover all the reports in issues - https://github.com/neondatabase/cloud/issues/29434 - https://github.com/neondatabase/neon/issues/12162 ## Refs - follow-up to #11853 - fixup of / fixes https://github.com/neondatabase/neon/issues/11762 - fixes https://github.com/neondatabase/neon/issues/12162 - refs https://github.com/neondatabase/cloud/issues/29434	2025-06-27 15:26:00 +00:00
Arpad Müller	4c7956fa56	Fix hang deleting offloaded timelines (#12366 ) We don't have cancellation support for timeline deletions. In other words, timeline deletion might still go on in an older generation while we are attaching it in a newer generation already, because the cancellation simply hasn't reached the deletion code. This has caused us to hit a situation with offloaded timelines in which the timeline was in an unrecoverable state: always returning an accepted response, but never a 404 like it should be. The detailed description can be found in [here](https://github.com/neondatabase/cloud/issues/30406#issuecomment-3008667859) (private repo link). TLDR: 1. we ask to delete timeline on old pageserver/generation, starts process in background 2. the storcon migrates the tenant to a different pageserver. - during attach, the pageserver still finds an index part, so it adds it to `offloaded_timelines` 4. the timeline deletion finishes, removing the index part in S3 5. there is a retry of the timeline deletion endpoint, sent to the new pageserver location. it is bound to fail however: - as the index part is gone, we print `Timeline already deleted in remote storage`. - the problem is that we then return an accepted response code, and not a 404. - this confuses the code calling us. it thinks the timeline is not deleted, so keeps retrying. - this state never gets recovered from until a reset/detach, because of the `offloaded_timelines` entry staying there. This is where this PR fixes things: if no index part can be found, we can safely assume that the timeline is gone in S3 (it's the last thing to be deleted), so we can remove it from `offloaded_timelines` and trigger a reupload of the manifest. Subsequent retries will pick that up. Why not improve the cancellation support? It is a more disruptive code change, that might have its own risks. So we don't do it for now. Fixes https://github.com/neondatabase/cloud/issues/30406	2025-06-27 15:14:55 +00:00
Arpad Müller	37e181af8a	Update rust to 1.88.0 (#12364 ) We keep the practice of keeping the compiler up to date, pointing to the latest release. This is done by many other projects in the Rust ecosystem as well. [Announcement blog post](https://blog.rust-lang.org/2025/06/26/Rust-1.88.0/) Prior update was in https://github.com/neondatabase/neon/pull/11938	2025-06-27 13:51:59 +00:00

1 2 3 4 5 ...

3199 Commits