rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-21 15:10:44 +00:00

Author	SHA1	Message	Date
a-masterov	edc874e1b3	Use the same test image version as the computer one (#11448 ) ## Problem Changes in compute can cause errors in tests if another version of `neon-test-extensions` image is used. ## Summary of changes Use the same version of `neon-test-extensions` image as `compute` one for docker-compose based extension tests.	2025-04-04 10:13:00 +00:00
Dmitrii Kovalkov	181af302b5	storcon + safekeeper + scrubber: propagate root CA certs everywhere (#11418 ) ## Problem There are some places in the code where we create `reqwest::Client` without providing SSL CA certs from `ssl_ca_file`. These will break after we enable TLS everywhere. - Part of https://github.com/neondatabase/cloud/issues/22686 ## Summary of changes - Support `ssl_ca_file` in storage scrubber. - Add `use_https_safekeeper_api` option to safekeeper to use https for peer requests. - Propagate SSL CA certs to storage_controller/client, storcon's ComputeHook, PeerClient and maybe_forward.	2025-04-04 06:30:48 +00:00
Tristan Partin	497116b76d	Download extension if it does not exist on the filesystem (#11315 ) Previously we attempted to download all extensions in CREATE EXTENSION statements. Extensions like pg_stat_statements and neon are not remote extensions, but still we were requesting them when skip_pg_catalog_updates was set to false. Fixes: https://github.com/neondatabase/neon/issues/11127 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-04 01:06:22 +00:00
Arpad Müller	a917952b30	Add test_storcon_create_delete_sk_down and make it work (#11400 ) Adds a test `test_storcon_create_delete_sk_down` which tests the reconciler and pending op persistence if faced with a temporary safekeeper downtime during timeline creation or deletion. This is in contrast to `test_explicit_timeline_creation_storcon`, which tests the happy path. We also do some fixes: * timeline and tenant deletion http requests didn't expect a body, but `()` sent one. * we got the tenant deletion http request's return type wrong: it's supposed to be a hash map * we add some logging to improve observability * We fix `list_pending_ops` which had broken code meant to make it possible to restrict oneself to a single pageserver. But diesel doesn't support that sadly, or at least I couldn't figure out a way to make it work. We don't need that functionality, so remove it. * We add an info span to the heartbeater futures with the node id, so that there is no context-free msgs like "Backoff: waiting 1.1 seconds before processing with the task" in the storcon logs. we could also add the full base url of the node but don't do it as most other log lines contain that information already, and if we do duplication it should at least not be verbose. One can always find out the base url from the node id. Successor of #11261 Part of #9011	2025-04-04 00:17:40 +00:00
Tristan Partin	e581b670f4	Improve nightly physical replication benchmark (#11389 ) Log the created project and endpoint IDs and improve typing in the source code to improve readability. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-03 23:00:58 +00:00
Alexander Bayandin	8ed79ed773	build(deps): bump h2 to 4.2.0 (#11437 ) ## Problem We switched `h2` from 4.1.0 to a git commit to fix stubgen (in https://github.com/neondatabase/neon/pull/10491). `h2` 4.2.0 was released soon after that, so we can switch back to a pinned version. Expected no changes, as 4.2.0 is the right next commit after the commit we currently use: `dacd614fed`%5E ## Summary of changes - Bump `h2` to 4.2.0	2025-04-03 21:42:34 +00:00
Alex Chi Z.	381f42519e	fix(pageserver): skip gc-compaction over sparse keyspaces (#11404 ) ## Problem Part of https://github.com/neondatabase/neon/issues/11318 It's not 100% safe for now to run gc-compaction over the sparse keyspace. It might cause deleted file to re-appear if a specific sequence of operations are done as in the issue, which in reality doesn't happen due to how we split delta/image layers based on the key range. A long-term fix would be either having a separate gc-compaction code path for metadata keys (as how we have a different code path for metadata image layer generation), or let the compaction process aware of the information of "there's an image layer that doesn't contain a key" so that we can skip the keys. ## Summary of changes * gc-compaction auto trigger only triggers compaction over the normal data range. * do not hold gc_block_guard across the full compaction job, only hold it during each subcompaction. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-03 19:40:44 +00:00
Arpad Müller	375df517a0	storcon: return 503 instead of 500 if there is no new leader yet (#11417 ) The leadership transfer protocol between storage controller instances is as follows, listing the steps for the new pod: The new pod does these things: 1. new pod comes online. looks in database if there is a leader. if there is, it asks that leader to step down. 2. the new pod does some operations to come online. they should be fairly short timed, but it's not zero. 3. the new pod updates the leader entry in the database. The old pod, once it gets the step down request, changes its internal state to stepped down. It treats all incoming requests specially now: instead of processing, it wants to forward them to the new pod. The forwarding however only works if the new pod is online already, so before forwarding it reads from the db for a leader (also to get the address to forward to in the first place). If the new pod is not online yet, i.e. during step 2 above, the old pod might legitimately land in the branch which this patch is editing: the leader in the database is a stepped down instance. Before, we've returned a `ApiError::InternalServerError`, but that would print the full backtrace plus an error log. With this patch, we cut down on the noise, as it's an expected situation to have a short storcon downtime while we are cutting over to the new instance. A `ResourceUnavailable` error is not just more fitting, it also doesn't print a backtrace once encountered, and only prints on the INFO log level (see `api_error_handler` function). Fixes #11320 cc #8954	2025-04-03 18:43:16 +00:00
Vlad Lazar	9db63fea7a	pageserver: optionally export perf traces in OTEL format (#11140 ) Based on https://github.com/neondatabase/neon/pull/11139 ## Problem We want to export performance traces from the pageserver in OTEL format. End goal is to see them in Grafana. ## Summary of changes https://github.com/neondatabase/neon/pull/11139 introduces the infrastructure required to run the otel collector alongside the pageserver. ### Design Requirements: 1. We'd like to avoid implementing our own performance tracing stack if possible and use the `tracing` crate if possible. 2. Ideally, we'd like zero overhead of a sampling rate of zero and be a be able to change the tracing config for a tenant on the fly. 3. We should leave the current span hierarchy intact. This includes adding perf traces without modifying existing tracing. To satisfy (3) (and (2) in part) a separate span hierarchy is used. `RequestContext` gains an optional `perf_span` member that's only set when the request was chosen by sampling. All perf span related methods added to `RequestContext` are no-ops for requests that are not sampled. This on its own is not enough for (3), so performance spans use a separate tracing subscriber. The `tracing` crate doesn't have great support for this, so there's a fair amount of boilerplate to override the subscriber at all points of the perf span lifecycle. ### Perf Impact [Periodic pagebench](https://neonprod.grafana.net/d/ddqtbfykfqfi8d/e904990?orgId=1&from=2025-02-08T14:15:59.362Z&to=2025-03-10T14:15:59.362Z&timezone=utc) shows no statistically significant regression with a sample ratio of 0. There's an annotation on the dashboard on 2025-03-06. ### Overview of changes: 1. Clean up the `RequestContext` API a bit. Namely, get rid of the `RequestContext::extend` API and use the builder instead. 2. Add pageserver level configs for tracing: sampling ratio, otel endpoint, etc. 3. Introduce some perf span tracking utilities and expose them via `RequestContext`. We add a `tracing::Span` wrapper to be used for perf spans and a `tracing::Instrumented` equivalent for it. See doc comments for reason. 4. Set up OTEL tracing infra according to configuration. A separate runtime is used for the collector. 5. Add perf traces to the read path. ## Refs - epic https://github.com/neondatabase/neon/issues/9873 --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2025-04-03 17:56:51 +00:00
Alex Chi Z.	bfc767d60d	fix(test): wait for shard split complete for test_lsn_lease_storcon (#11436 ) ## Problem close https://github.com/neondatabase/neon/issues/11397 ref https://github.com/neondatabase/cloud/issues/23667 ## Summary of changes We need to wait until the shard split is complete, otherwise it will print warning like waiting for shard split exclusive lock for 30s. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-03 17:49:45 +00:00
Alex Chi Z.	109c54a300	fix(pageserver): avoid gc-compaction triggering circuit breaker (#11403 ) ## Problem There are some cases where traditional gc might collect some layer files causing gc-compaction cannot read the full history of the key. This needs to be resolved in the long-term by improving the compaction process. For now, let's simply avoid such errors triggering the circuit breaker. ## Summary of changes * Move the place where we trigger the circuit breaker. We only trigger it during compactions other than L0 compactions. We added the trigger a year ago due to file cleanup concerns in image layer compaction. * For gc-compaction, only return errors to the upper compaction_iteration if it's a shutdown error. Otherwise, just log it and skip the compaction for a key range. Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2025-04-03 17:18:37 +00:00
Vlad Lazar	74920d8cd8	storcon: notify compute if correct observed state was refreshed (#11342 ) ## Problem Previously, if the observed state was refreshed and matching the intent, we wouldn't send a compute notification. This is unsafe. There's no guarantee that the location landed on the pageserver _and_ a compute notification for it was delivered. See https://github.com/neondatabase/neon/issues/11291#issuecomment-2743205411 for one such example. ## Summary of changes Add a reproducer and notify the compute if the correct observed state required a refresh. Closes https://github.com/neondatabase/neon/issues/11291	2025-04-03 16:35:55 +00:00
Alex Chi Z.	131b32ef48	fix(pageserver): clean up aux files before detaching (#11299 ) ## Problem Related to https://github.com/neondatabase/cloud/issues/26091 and https://github.com/neondatabase/cloud/issues/25840 Close https://github.com/neondatabase/neon/issues/11297 Discussion on Slack: https://neondb.slack.com/archives/C033RQ5SPDH/p1742320666313969 ## Summary of changes * When detaching, scan all aux files within `sparse_non_inherited_keyspace` in the ancestor timeline and create an image layer exactly at the ancestor LSN. All scanned keys will map to an empty value, which is a delete tombstone. - Note that end_lsn for rewritten delta layers = ancestor_lsn + 1, so the image layer will have image_end_lsn=end_lsn. With the current `select_layer` logic, the read path will always first read the image layer. * Add a test case. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2025-04-03 15:55:22 +00:00
Suhas Thalanki	581bb5d7d5	removed pg_anon setup from compute dockerfile (#10960 ) ## Problem Removing the `anon` v1 extension in postgres as described in https://github.com/neondatabase/cloud/issues/22663. This extension is not built for postgres v17 and is out of date when compared to the upstream variant which is v2 (we have v1.4). ## Summary of changes Removed the `anon` v1 extension from the compute docker image Related to https://github.com/neondatabase/cloud/issues/22663	2025-04-03 15:26:35 +00:00
JC Grünhage	3c78133477	feat(ci): add 'released' tag to container images from release runs (#11425 ) ## Problem We had a problem with https://github.com/neondatabase/neon/pull/11413 having e2e tests failing, because an e2e test (`8d271bed47`) depended on an unreleased pageserver fix (`0ee5bfa2fc`). This came up because neon release CI runs against the most recent releases of the other components, but cloud e2e tests run against latest, which is tagged from main. ## Summary of changes Add an additional `released` tag for released versions. ## Alternative to consider We could (and maybe should) instead switch to `latest` being used for released versions and `main` being used where we use `latest` right now. That'd also mean we don't have to adjust the CI in the cloud repo.	2025-04-03 14:57:44 +00:00
Suhas Thalanki	46e046e779	Exporting `file_cache_used` to calculate LFC utilization (#11384 ) ## Problem Exporting `file_cache_used` which specifies the number of used chunks in the LFC. This helps calculate LFC utilization as: `file_cache_used_pages / (file_cache_used * file_cache_chunk_size_pages)` ## Summary of changes Exporting `file_cache_used`. Related Issue: https://github.com/neondatabase/cloud/issues/26688	2025-04-03 14:54:45 +00:00
Arpad Müller	d8cee52637	Update rust to 1.86.0 (#11431 ) We keep the practice of keeping the compiler up to date, pointing to the latest release. This is done by many other projects in the Rust ecosystem as well. [Announcement blog post](https://blog.rust-lang.org/2025/04/03/Rust-1.86.0.html). Prior update was in #10914.	2025-04-03 14:53:28 +00:00
Dmitrii Kovalkov	2e11d129d0	tests: suppress mgm api timeout error in sotrcon (#11428 ) ## Problem Since `0f367cb665` the timeout in `with_client_retries` is implemented via `tokio::timeout` instead of `reqwest::ClientBuilder::timeout` (because we reuse the client). It changed the error representation if the timeout is exceeded. Such errors were suppressed in `allowed_errors.py`, but old regexps do not match the new error. Discussion: https://neondb.slack.com/archives/C033RQ5SPDH/p1743533184736319 ## Summary of changes - Add new `Timeout` error to `allowed_errors.py`	2025-04-03 14:18:50 +00:00
Luís Tavares	43a7423f72	feat: bump pg_session_jwt extension to 0.3.0 (#11399 ) ## Problem Bumps https://github.com/neondatabase/pg_session_jwt to the latest release [v0.3.0](https://github.com/neondatabase/pg_session_jwt/releases/tag/v0.3.0) that introduces PostgREST fallback mechanisms. ## Summary of changes Updates the extension download tar and the extension version in the proxy constant. ## Subscribers @mrl5	2025-04-03 13:01:18 +00:00
Arpad Müller	374736a4de	Print remote_addr span for Failed to serve HTTP connection error (#11423 ) I've encountered this error in #11422. Ideally we'd have the URL as well to associate it with a tenant, but at this level we only have the remote addr I guess. Better than nothing.	2025-04-03 11:58:12 +00:00
Alexander Bayandin	5e507776bc	Compute: update plv8 patch (#11426 ) ## Problem https://github.com/neondatabase/cloud/issues/26866 ## Summary of changes - Update plv8 patch Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>	2025-04-03 11:29:53 +00:00
Alexander Lakhin	4e8e0951be	Increase timeout for test_pageserver_gc_compaction_smoke (#11410 ) ## Problem The test_pageserver_gc_compaction_smoke fails rather often due to a timeout on slow machines. See https://github.com/neondatabase/neon/issues/11355. ## Summary of changes Increase the timeout for the test.	2025-04-03 11:23:30 +00:00
JC Grünhage	64a8d0c2e6	impr(ci): retry container image pushing and send slack messages for failures (#11416 ) ## Problem We've seen quite a few CI failures related to pushes to docker hub failing with weird error messages that indicate maybe docker hub is just not reliable. ## Summary of changes Retry container image pushing up to 10 times, and send a slack message if we had to retry, regardless of the job succeeding or not.	2025-04-03 08:59:42 +00:00
Tristan Partin	7602e6ffc0	Skip compute_ctl authorization checks in testing builds (#11186 ) We will require authorization in production. We need to skip in testing builds for now because regression tests would fail. See https://github.com/neondatabase/neon/issues/11316 for more information. Signed-off-by: Tristan Partin <tristan@neon.tech> Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-04-03 00:00:28 +00:00
Erik Grinaker	17193d6a33	test_runner: fix pagebench tenant configs (#11420 ) ## Problem Pagebench creates a bunch of tenants by first creating a template tenant and copying its remote storage, then attaching the copies to the Pageserver. These tenants had custom configurations to disable GC and compaction. However, these configs were only picked up by the Pageserver on attach, and not registered with the storage controller. This caused the storage controller to replace the tenant configs with the default tenant config, re-enabling GC and compaction which interferes with benchmark performance. Resolves #11381. ## Summary of changes Register the copied tenants with the storage controller, instead of directly attaching them to the Pageserver.	2025-04-02 20:11:39 +00:00
Erik Grinaker	03ae57236f	docs: add compaction notes (#11415 ) Lifted from https://www.notion.so/neondatabase/Rough-Notes-on-Compaction-1baf189e004780859e65ef63b85cfa81?pvs=4.	2025-04-02 19:55:08 +00:00
Arpad Müller	e3d27b2f68	Start safekeeper node IDs with 0 and forbid 0 from registering (#11419 ) Right now we start safekeeper node ids at 0. However, other code treats 0 as invalid (see #11407). We decided on latter. Therefore, make the register python tests register safekeepers starting at node id 1 instead of 0, and forbid safekeepers with id 0 from registering. Context: https://github.com/neondatabase/neon/pull/11407#discussion_r2024852328	2025-04-02 18:36:50 +00:00
Alex Chi Z.	dd1299f337	feat(storcon): passthrough mark invisible and add tests (#11401 ) ## Problem close https://github.com/neondatabase/neon/issues/11279 ## Summary of changes * Allow passthrough of other methods in tenant timeline shard0 passthrough of storcon. * Passthrough mark invisible API in storcon. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-02 17:11:49 +00:00
John Spray	cb19e4e05d	pageserver: remove legacy TimelineInfo::latest_gc_cutoff field (2/2) (#11136 ) ## Problem This field was retained for backward compat only in #10707. Once https://github.com/neondatabase/cloud/pull/25233 is released, nothing will be reading this field. Related: https://github.com/neondatabase/cloud/issues/24250 ## Summary of changes - Remove TimelineInfo::latest_gc_cutoff_lsn	2025-04-02 15:21:58 +00:00
Erik Grinaker	9df230c837	storcon: improve autosplit defaults (#11332 ) ## Problem In #11122, we changed the autosplit behavior to allow repeated and initial splits. The defaults were set such that they retain the current production settings (8 shards at 64 GB). However, these defaults don't really make sense by themselves. Once we deploy new settings to production, we should change the defaults to something more reasonable. ## Summary of changes Changes the following default settings: * `max_split_shards`: 8 → 16 * `initial_split_threshold`: 64 GB → disabled * `initial_split_shards`: 8 → 2	2025-04-02 15:11:52 +00:00
JC Grünhage	3c2bc5baba	fix(ci): run checks on release PRs (#11375 ) ## Problem Hotfix releases mean that sometimes changes in release PRs haven't been tested and linted yet. Disabling tests and lints is therefore not necessarily safe. In the future we will check whether tests have run on the same git tree already to speed things up, but for now we need to turn tests back on fully. This partially reverts: https://github.com/neondatabase/neon/pull/11272 ## Summary of changes Run checks on `.*-rc-pr` runs.	2025-04-02 14:32:53 +00:00
Alexey Kondratov	6667810800	chore(compute_ctl): Minor code and comment fixes (#11411 ) ## Problem In #11376 I mistakenly reworded one comment and also forgot to commit one of the suggestions. ## Summary of changes Fix it here.	2025-04-02 14:20:52 +00:00
Erik Grinaker	47f5bcf2bc	pageserver: don't periodically flush layers for stale attachments (#11317 ) ## Problem Tenants in attachment state `Stale` can't upload layers, and don't run compaction, but still do periodic L0 layer flushes in the tenant housekeeping loop. If the tenant remains stuck in stale mode, this causes a large buildup of L0 layers, causing logging, metrics increases, and possibly alerts. Resolves #11245. ## Summary of changes Don't perform periodic layer flushes in stale attachment state.	2025-04-02 12:55:15 +00:00
a-masterov	c179d098ef	Increase the timeout for extensions upgrade tests (on schedule). (#11406 ) ## Problem Sometimes the forced extension upgrade test fails (on schedule) due to a timeout. ## Summary of changes The timeout is increased to 60 mins.	2025-04-02 11:18:37 +00:00
Peter Bendel	4bc6dbdd5f	use a prod-like shared_buffers size for some perf unit tests (#11373 ) ## Problem In Neon DBaaS we adjust the shared_buffers to the size of the compute, or better described we adjust the max number of connections to the compute size and we adjust the shared_buffers size to the number of max connections according to about the following sizes `2 CU: 225mb; 4 CU: 450mb; 8 CU: 900mb` [see](`877e33b428/goapp/controlplane/internal/pkg/compute/computespec/pg_settings.go (L405)`) ## Summary of changes We should run perf unit tests with settings that is realistic for a paying customer and select 8 CU as the reference for those tests.	2025-04-02 10:43:05 +00:00
John Spray	7dc8370848	storcon: do graceful migrations from chaos injector (#11028 ) ## Problem Followup to https://github.com/neondatabase/neon/pull/10913 Existing chaos injection just does simple cutovers to secondary locations. Let's also exercise code for doing graceful migrations. This should implicitly test how such migrations cope with overlapping with service restarts. ## Summary of changes	2025-04-02 10:08:05 +00:00
Alex Chi Z.	c4fc602115	feat(pageserver): support synthetic size calculation for invisible branches (#11335 ) ## Problem ref https://github.com/neondatabase/neon/issues/11279 Imagine we have a branch with 3 snapshots A, B, and C: ``` base---+---+---+---main \-A \-B \-C base=100G, base-A=1G, A-B=1G, B-C=1G, C-main=1G ``` at this point, the synthetic size should be 100+1+1+1+1=104G. after the deletion, the structure looks like: ``` base---+---+---+ \-A \-B \-C ``` If we simply assume main never exists, the size will be calculated as size(A) + size(B) + size(C)=300GB, which obviously is not what the user would expect. The correct way to do this is to assume part of main still exists, that is to say, set C-main=1G: ``` base---+---+---+main \-A \-B \-C ``` And we will get the correct synthetic size of 100G+1+1+1=103G. ## Summary of changes * Do not generate gc cutoff point for invisible branches. * Use the same LSN as the last branchpoint for branch end. * Remove test_api_handler for mark_invisible. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-01 18:50:58 +00:00
Konstantin Knizhnik	02936b82c5	Fix effective_lsn calculation for prefetch (#11219 ) ## Problem See https://neondb.slack.com/archives/C04DGM6SMTM/p1741594233757489 Consider the following scenario: 1. Backend A wants to prefetch some block B 2. Backend A checks that block B is not present in shared buffer 3. Backend A registers new prefetch request and calls prefetch_do_request 4. prefetch_do_request calls neon_get_request_lsns 5. neon_get_request_lsns obtains LwLSN for block B 6. Backend B downloads B, updates and wallogs it (let say to Lsn1) 7. Block B is once again thrown from shared buffers, its LwLSN is set to Lsn1 8. Backend A obtains current flush LSN, let's say that it is Lsn1 9. Backend A stores Lsn1 as effective_lsn in prefetch slot. 10. Backend A reads page B with LwLSN=Lsn1 11. Backend A finds in prefetch ring response for prefetch request for block B with effective_lsn=Lsn1, so that it satisfies neon_prefetch_response_usable condition 12. Backend A uses deteriorated version of the page! ## Summary of changes Use `not_modified_since` as `effective_lsn`. It should not cause some degrade of performance because we store LwLSN when it was not found in LwLSN hash, so if page is not changed till prefetch response is arrived, then LwLSN should not be changed. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-04-01 15:48:02 +00:00
Arpad Müller	1fad1abb24	storcon: timeline deletion improvements and fixes (#11334 ) This PR contains a bunch of smaller followups and fixes of the original PR #11058. most of these implement suggestions from Arseny: * remove `Queryable, Selectable` from `TimelinePersistence`: they are not needed. * no `Arc` around `CancellationToken`: it itself is an arc wrapper * only schedule deletes instead of scheduling excludes and deletes * persist and delete deletion ops * delete rows in timelines table upon tenant and timeline deletion * set `deleted_at` for timelines we are deleting before we start any reconciles: this flag will help us later to recognize half-executed deletions, or when we crashed before we could remove the timeline row but after we removed the last pending op (handling these situations are left for later). Part of #9011	2025-04-01 15:16:33 +00:00
Arpad Müller	016068b966	API spec: add safekeepers field returned by storcon (#11385 ) Add an optional `safekeepers` field to `TimelineInfo` which is returned by the storcon upon timeline creation if the `--timelines-onto-safekeepers` flag is enabled. It contains the list of safekeepers chosen. Other contexts where we return `TimelineInfo` do not contain the `safekeepers` field, sadly I couldn't make this more type safe like done in Rust via `TimelineCreateResponseStorcon`, as there is no way of flattening or inheritance (and I don't that duplicating the entire type for some minor type safety improvements is worth it). The storcon side has been done in #11058. Part of https://github.com/neondatabase/cloud/issues/16176 cc https://github.com/neondatabase/cloud/issues/16796	2025-04-01 12:39:10 +00:00
Erik Grinaker	80596feeaa	pageserver: invert `CompactFlags::NoYield` as `YieldForL0` (#11382 ) ## Problem `CompactFlags::NoYield` was a bit inconvenient, since every caller except for the background compaction loop should generally set it (e.g. HTTP API calls, tests, etc). It was also inconsistent with `CompactionOutcome::YieldForL0`. ## Summary of changes Invert `CompactFlags::NoYield` as `CompactFlags::YieldForL0`. There should be no behavioral changes.	2025-04-01 11:43:58 +00:00
Erik Grinaker	225cabd84d	pageserver: update upload queue TODOs (#11377 ) Update some upload queue TODOs, particularly to track https://github.com/neondatabase/neon/issues/10283, which I won't get around to.	2025-04-01 11:38:12 +00:00
Alexey Kondratov	557127550c	feat(compute): Add compute_ctl_up metric (#11376 ) ## Problem For computes running inside NeonVM, the actual compute image tag is buried inside the NeonVM spec, and we cannot get it as part of standard k8s container metrics (it's always an image and a tag of the NeonVM runner container). The workaround we currently use is to extract the running computes info from the control plane database with SQL. It has several drawbacks: i) it's complicated, separate DB per region; ii) it's slow; iii) it's still an indirect source of info, i.e. k8s state could be different from what the control plane expects. ## Summary of changes Add a new `compute_ctl_up` gauge metric with `build_tag` and `status` labels. It will help us to both overview what are the tags/versions of all running computes; and to break them down by current status (`empty`, `running`, `failed`, etc.) Later, we could introduce low cardinality (no endpoint or compute ids) streaming aggregates for such metrics, so they will be blazingly fast and usable for monitoring the fleet-wide state.	2025-04-01 08:51:17 +00:00
Konstantin Knizhnik	cfe3e6d4e1	Remove loop from pageserver_try_receive (#11387 ) ## Problem Commit `3da70abfa5` cause noticeable performance regression (40% in update-with-prefetch in test_bulk_update): https://neondb.slack.com/archives/C04BLQ4LW7K/p1742633167580879 ## Summary of changes Remove loop from pageserver_try_receive to make it fetch not more than one response. There is still loop in `pump_prefetch_state` which can fetch as many responses as available. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-31 19:49:32 +00:00
Alex Chi Z.	47d47000df	fix(pageserver): passthrough lsn lease in storcon API (#11386 ) ## Problem part of https://github.com/neondatabase/cloud/issues/23667 ## Summary of changes lsn_lease API can only be used on pageservers. This patch enables storcon passthrough. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-31 19:16:42 +00:00
Matthias van de Meent	e5b95bc9dc	Neon LFC/prefetch: Improve page read handling (#11380 ) Previously we had different meanings for the bitmask of vector IOps. That has now been unified to "bit set = final result, no more scribbling". Furthermore, the LFC read path scribbled on pages that were already read; that's probably not a good thing so that's been fixed too. In passing, the read path of LFC has been updated to read only the requested pages into the provided buffers, thus reducing the IO size of vectorized IOs. ## Problem ## Summary of changes	2025-03-31 17:04:00 +00:00
Alex Chi Z.	0ee5bfa2fc	fix(pageserver): allow sibling archived branch for detaching (#11383 ) ## Problem close https://github.com/neondatabase/neon/issues/11379 ## Summary of changes Remove checks around archived branches for detach v2. I also updated the comments `ancestor_retain_lsn`. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-31 16:32:55 +00:00
Fedor Dikarev	00bcafe82e	chore(ci): upgrade stats action with docker images from ghcr.io (#11378 ) ## Problem Current version of GitHub Workflow Stats action pull docker images from DockerHub, that could be an issue with the new pull limits on DockerHub side. ## Summary of changes Switch to version `v0.2.2`, with docker images hosted on `ghcr.io`	2025-03-31 14:21:07 +00:00
Conrad Ludgate	ed117af73e	chore(proxy/tokio-postgres): remove phf from sqlstate and switch to tracing (#11249 ) In sqlstate, we have a manual `phf` construction, which is not explicitly guaranteed to be stable - you're intended to use a build.rs or the macro to make sure it's constructed correctly each time. This was inherited from tokio-postgres upstream, which has the same issue (https://github.com/rust-phf/rust-phf/pull/321#issuecomment-2724521193). We don't need this encoding of sqlstate, so I've switched it to simply parse 5 bytes (https://www.postgresql.org/docs/current/errcodes-appendix.html). While here, I switched out log for tracing.	2025-03-31 12:35:51 +00:00
Konstantin Knizhnik	21a891a06d	Fix IS_LOCAL_REL macro (first class has oid=FirstNormalObjectId) (#11369 ) ## Problem Macro IS_LOCAL_REL used for DEBUG_COMPARE_LOCAL mode use greater-than rather than greater-or-equal comparison while first table really is assigned FirstNormalObjectId. ## Summary of changes Replace strict greater with greater-or-equal comparison. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-03-31 11:16:35 +00:00

1 2 3 4 5 ...

7612 Commits