rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-09 14:32:57 +00:00

Author	SHA1	Message	Date
Christian Schwarz	318700600d	refactor: lift inspection of CompactionError::Other(x) => x.root_cause() into CompactionError::is_cancel There are a couple of places that call CompactionError::is_cancel but don't check the Other variant for root cause. But they should, because some cancellations are observed by code that results in ::Other errors. I don't think there's a _serious_ case where this causes problems. The worst case one is the circuit breaker which we do currently trip on ::Other errors that are due to cancellation. Tripped circuit breaker on shutting down timelines doesn't really matter practically, but it's unaesthetic and might cause noise down the line, so, this PR fixes that at least. In any way, this PR forces future callers of is_cancel() to explicitly recognize the suboptimal state of affairs wrt error handling in compaction, thereby hopefully preventing errors of this kind from creeping in. (The _right_ solution for the compaction code probably is the approach I took in #11853: keep using anyhow but have a unified way / pattern of bubbling up cancellation, so that we don't need to perform the downcast trickery).	2025-06-27 14:10:24 +02:00
Christian Schwarz	75ef17e2f7	refactor: force explicit mapping to CreateImageLayersError::Other	2025-06-27 13:31:49 +02:00
Christian Schwarz	93026b7899	address most types of error that happen during creation of delta layers	2025-06-27 13:17:19 +02:00
Vlad Lazar	b69d103b90	pageserver: make import job max byte range size configurable (#12117 ) ## Problem We want to repro an OOM situation, but large partial reads are required. ## Summary of Changes Make the max partial read size configurable for import jobs.	2025-06-04 10:44:23 +00:00
a-masterov	208cbd52d4	Add postgis to the test image (#11672 ) ## Problem We don't currently run tests for PostGIS in our test environment. ## Summary of Changes - Added PostGIS test support for PostgreSQL v16 and v17 - Configured different PostGIS versions based on PostgreSQL version: - PostgreSQL v17: PostGIS 3.5.0 - PostgreSQL v14/v15/v16: PostGIS 3.3.3 - Added necessary test scripts and configurations This ensures our PostgreSQL implementation remains compatible with this widely-used extension. --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2025-06-04 09:57:31 +00:00
Alex Chi Z.	c567ed0de0	feat(pageserver): feature flag counter metrics (#12112 ) ## Problem Part of https://github.com/neondatabase/neon/issues/11813 ## Summary of changes Add a counter on the feature evaluation outcome and we will set up alerts for too many failed evaluations in the future. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-04 06:41:42 +00:00
Mikhail	c698cee19a	ComputeSpec: prewarm_lfc_on_startup -> autoprewarm (#12120 ) https://github.com/neondatabase/cloud/pull/29472 https://github.com/neondatabase/cloud/issues/26346	2025-06-04 05:38:03 +00:00
Tristan Partin	4a3f32bf4a	Clean up compute_tools::http::JsonResponse::invalid_status() (#12110 ) JsonResponse::error() properly logs an error message which can be read in the compute logs. invalid_status() was not going through that helper function, thus not logging anything. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-06-03 16:00:56 +00:00
Vlad Lazar	a963aab14b	pagserver: set default wal receiver proto to interpreted (#12100 ) ## Problem This is already the default in production and in our test suite. ## Summary of changes Set the default proto to interpreted to reduce friction when spinning up new regions or cells.	2025-06-03 14:57:36 +00:00
Erik Grinaker	5bdba70f7d	page_api: only validate Protobuf → domain type conversion (#12115 ) ## Problem Currently, `page_api` domain types validate message invariants both when converting Protobuf → domain and domain → Protobuf. This is annoying for clients, because they can't use stream combinators to convert streamed requests (needed for hot path performance), and also performs the validation twice in the common case. Blocks #12099. ## Summary of changes Only validate the Protobuf → domain type conversion, i.e. on the receiver side, and make domain → Protobuf infallible. This is where it matters -- the Protobuf types are less strict than the domain types, and receivers should expect all sorts of junk from senders (they're not required to validate anyway, and can just construct an invalid message manually). Also adds a missing `impl From<CheckRelExistsRequest> for proto::CheckRelExistsRequest`.	2025-06-03 13:50:41 +00:00
Trung Dinh	25fffd3a55	Validate max_batch_size against max_get_vectored_keys (#12052 ) ## Problem Setting `max_batch_size` to anything higher than `Timeline::MAX_GET_VECTORED_KEYS` will cause runtime error. We should rather fail fast at startup if this is the case. ## Summary of changes * Create `max_get_vectored_keys` as a new configuration (default to 32); * Validate `max_batch_size` against `max_get_vectored_keys` right at config parsing and validation. Closes https://github.com/neondatabase/neon/issues/11994	2025-06-03 13:37:11 +00:00
Erik Grinaker	e00fd45bba	page_api: remove smallvec (#12095 ) ## Problem The gRPC `page_api` domain types used smallvecs to avoid heap allocations in the common case where a single page is requested. However, this is pointless: the Protobuf types use a normal vec, and converting a smallvec into a vec always causes a heap allocation anyway. ## Summary of changes Use a normal `Vec` instead of a `SmallVec` in `page_api` domain types.	2025-06-03 12:20:34 +00:00
Vlad Lazar	3b8be98b67	pageserver: remove backtrace in info level log (#12108 ) ## Problem We print a backtrace in an info level log every 10 seconds while waiting for the import data to land in the bucket. ## Summary of changes The backtrace is not useful. Remove it.	2025-06-03 09:07:07 +00:00
a-masterov	3e72edede5	Use full hostname for ONNX URL (#12064 ) ## Problem We should use the full host name for computes, according to https://github.com/neondatabase/cloud/issues/26005 , but now a truncated host name is used. ## Summary of changes The URL for REMOTE_ONNX is rewritten using the FQDN.	2025-06-03 07:23:17 +00:00
Alex Chi Z.	a650f7f5af	fix(pageserver): only deserialize reldir key once during get_db_size (#12102 ) ## Problem fix https://github.com/neondatabase/neon/issues/12101; this is a quick hack and we need better API in the future. In `get_db_size`, we call `get_reldir_size` for every relation. However, we do the same deserializing the reldir directory thing for every relation. This creates huge CPU overhead. ## Summary of changes Get and deserialize the reldir v1 key once and use it across all get_rel_size requests. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-03 05:00:34 +00:00
Erik Grinaker	fc3994eb71	pageserver: initial gRPC page service implementation (#12094 ) ## Problem We should expose the page service over gRPC. Requires #12093. Touches #11728. ## Summary of changes This patch adds an initial page service implementation over gRPC. It ties in with the existing `PageServerHandler` request logic, to avoid the implementations drifting apart for the core read path. This is just a bare-bones functional implementation. Several important aspects have been omitted, and will be addressed in follow-up PRs: * Limited observability: minimal tracing, no logging, limited metrics and timing, etc. * Rate limiting will currently block. * No performance optimization. * No cancellation handling. * No tests. I've only done rudimentary testing of this, but Pagebench passes at least.	2025-06-02 17:15:18 +00:00
Conrad Ludgate	781bf4945d	proxy: optimise future layout allocations (#12104 ) A smaller version of #12066 that is somewhat easier to review. Now that I've been using https://crates.io/crates/top-type-sizes I've found a lot more of the low hanging fruit that can be tweaks to reduce the memory usage. Some context for the optimisations: Rust's stack allocation in futures is quite naive. Stack variables, even if moved, often still end up taking space in the future. Rearranging the order in which variables are defined, and properly scoping them can go a long way. `async fn` and `async move {}` have a consequence that they always duplicate the "upvars" (aka captures). All captures are permanently allocated in the future, even if moved. We can be mindful when writing futures to only capture as little as possible. TlsStream is massive. Needs boxing so it doesn't contribute to the above issue. ## Measurements from `top-type-sizes`: ### Before ``` 10328 {async block@proxy::proxy::task_main::{closure#0}::{closure#0}} align=8 6120 {async fn body of proxy::proxy::handle_client<proxy::protocol2::ChainRW<tokio::net::TcpStream>>()} align=8 ``` ### After ``` 4040 {async block@proxy::proxy::task_main::{closure#0}::{closure#0}} 4704 {async fn body of proxy::proxy::handle_client<proxy::protocol2::ChainRW<tokio::net::TcpStream>>()} align=8 ```	2025-06-02 16:13:30 +00:00
Erik Grinaker	a21c1174ed	pagebench: add gRPC support for `get-page-latest-lsn` (#12077 ) ## Problem We need gRPC support in Pagebench to benchmark the new gRPC Pageserver implementation. Touches #11728. ## Summary of changes Adds a `Client` trait to make the client transport swappable, and a gRPC client via a `--protocol grpc` parameter. This must also specify the connstring with the gRPC port: ``` pagebench get-page-latest-lsn --protocol grpc --page-service-connstring grpc://localhost:51051 ``` The client is implemented using the raw Tonic-generated gRPC client, to minimize client overhead.	2025-06-02 14:50:49 +00:00
Erik Grinaker	8d7ed2a4ee	pageserver: add gRPC observability middleware (#12093 ) ## Problem The page service logic asserts that a tracing span is present with tenant/timeline/shard IDs. An initial gRPC page service implementation thus requires a tracing span. Touches https://github.com/neondatabase/neon/issues/11728. ## Summary of changes Adds an `ObservabilityLayer` middleware that generates a tracing span and decorates it with IDs from the gRPC metadata. This is a minimal implementation to address the tracing span assertion. It will be extended with additional observability in later PRs.	2025-06-02 11:46:50 +00:00
Vlad Lazar	5b62749c42	pageserver: reduce import memory utilization (#12086 ) ## Problem Imports can end up allocating too much. ## Summary of Changes Nerf them a bunch and add some logs.	2025-06-02 10:29:15 +00:00
Vlad Lazar	af5bb67f08	pageserver: more reactive wal receiver cancellation (#12076 ) ## Problem If the wal receiver is cancelled, there's a 50% chance that it will ingest yet more WAL. ## Summary of Changes Always check cancellation first.	2025-06-02 08:59:21 +00:00
Conrad Ludgate	589bfdfd02	proxy: Changes to rate limits and GetEndpointAccessControl caches. (#12048 ) Precursor to https://github.com/neondatabase/cloud/issues/28333. We want per-endpoint configuration for rate limits, which will be distributed via the `GetEndpointAccessControl` API. This lays some of the ground work. 1. Allow the endpoint rate limiter to accept a custom leaky bucket config on check. 2. Remove the unused auth rate limiter, as I don't want to think about how it fits into this. 3. Refactor the caching of `GetEndpointAccessControl`, as it adds friction for adding new cached data to the API. That third one was rather large. I couldn't find any way to split it up. The core idea is that there's now only 2 cache APIs. `get_endpoint_access_controls` and `get_role_access_controls`. I'm pretty sure the behaviour is unchanged, except I did a drive by change to fix #8989 because it felt harmless. The change in question is that when a password validation fails, we eagerly expire the role cache if the role was cached for 5 minutes. This is to allow for edge cases where a user tries to connect with a reset password, but the cache never expires the entry due to some redis related quirk (lag, or misconfiguration, or cplane error)	2025-06-02 08:38:35 +00:00
Conrad Ludgate	87179e26b3	completely rewrite pq_proto (#12085 ) libs/pqproto is designed for safekeeper/pageserver with maximum throughput. proxy only needs it for handshakes/authentication where throughput is not a concern but memory efficiency is. For this reason, we switch to using read_exact and only allocating as much memory as we need to. All reads return a `&'a [u8]` instead of a `Bytes` because accidental sharing of bytes can cause fragmentation. Returning the reference enforces all callers only hold onto the bytes they absolutely need. For example, before this change, `pqproto` was allocating 8KiB for the initial read `BytesMut`, and proxy was holding the `Bytes` in the `StartupMessageParams` for the entire connection through to passthrough.	2025-06-01 18:41:45 +00:00
Shockingly Good	f05df409bd	impr(compute): Remove the deprecated CLI arg alias for remote-ext-config. (#12087 ) Also moves it from `String` to `Url`.	2025-05-30 17:45:24 +00:00
Alex Chi Z.	f6c0f6c4ec	fix(ci): install build tools with --locked (#12083 ) ## Problem Release pipeline failing due to some tools cannot be installed. ## Summary of changes Install with `--locked`. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-05-30 17:00:41 +00:00
Shockingly Good	62cd3b8d3d	fix(compute) Remove the hardcoded default value for PGXN HTTP URL. (#12030 ) Removes the hardcoded value for the Postgres Extensions HTTP gateway URL as it is always provided by the calling code.	2025-05-30 15:26:22 +00:00
Alexander Lakhin	8d26978ed9	Allow known pageserver errors in test_location_conf_churn (#12082 ) ## Problem While a pageserver in the unreadable state could not be accessed by postgres thanks to https://github.com/neondatabase/neon/pull/12059, it may still receive WAL records and bump into the "layer file download failed: No file found" error when trying to ingest them. Closes: https://github.com/neondatabase/neon/issues/11348 ## Summary of changes Allow errors from wal_connection_manager, which are considered expected. See https://github.com/neondatabase/neon/issues/11348.	2025-05-30 15:20:46 +00:00
Christian Schwarz	35372a8f12	adjust VirtualFile operation latency histogram buckets (#12075 ) The expected operating range for the production NVMe drives is in the range of 50 to 250us. The bucket boundaries before this PR were not well suited to reason about the utilization / queuing / latency variability of those devices. # Performance There was some concern about perf impact of having so many buckets, considering the impl does a linear search on each observe(). I added a benchmark and measured on relevant machines. In any way, the PR is 40 buckets, so, won't make a meaningful difference on production machines (im4gn.2xlarge), going from 30ns -> 35ns.	2025-05-30 13:22:53 +00:00
Vlad Lazar	6d95a3fe2d	pageserver: various import flow fixups (#12047 ) ## Problem There's a bunch of TODOs in the import code. ## Summary of changes 1. Bound max import byte range to 128MiB. This might still be too high, given the default job concurrency, but it needs to be balanced with going back and forth to S3. 2. Prevent unsigned overflow when determining key range splits for concurrent jobs 3. Use sharded ranges to estimate task size when splitting jobs 4. Bubble up errors that we might hit due to invalid data in the bucket back to the storage controller. 5. Tweak the import bucket S3 client configuration.	2025-05-30 12:30:11 +00:00
Vlad Lazar	99726495c7	test: allow list overly eager storcon finalization (#12055 ) ## Problem I noticed a small percentage of flakes on some import tests. They were all instances of the storage controller being too eager on the finalization. As a refresher: the pageserver notifies the storage controller that it's done from the import task and the storage controller has to call back into it in order to finalize the import. The pageserver checks that the import task is done before serving that request. Hence, we can get this race. In practice, this has no impact since the storage controller will simply retry. ## Summary of changes Allow list such cases	2025-05-30 12:14:36 +00:00
Christian Schwarz	4a4a457312	fix(pageserver): frozen->L0 flush failure causes data loss (#12043 ) This patch is a fixup for - https://github.com/neondatabase/neon/pull/6788 Background ---------- That PR 6788 added artificial advancement of `disk_consistent_lsn` and `remote_consistent_lsn` for shards that weren't written to while other shards _were_ written to. See the PR description for more context. At the time of that PR, Pageservers shards were doing WAL filtering. Nowadays, the WAL filtering happens in Safekeepers. Shards learn about the WAL gaps via `InterpretedWalRecords::next_record_lsn`. The Bug ------- That artificial advancement code also runs if the flush failed. So, we advance the disk_consistent_lsn / remote_consistent_lsn, without having the corresponding L0 to the `index_part.json`. The frozen layer remains in the layer map until detach, so we continue to serve data correctly. We're not advancing flush loop variable `flushed_to_lsn` either, so, subsequent flush requests will retry the flush and repair the situation if they succeed. But if there aren't any successful retries, eventually the tenant will be detached and when it is attached somewhere else, the `index_part.json` and therefore layer map... 1. ... does not contain the frozen layer that failed to flush and 2. ... won't re-ingest that WAL either because walreceiver starts up with the advanced disk_consistent_lsn/remote_consistent_lsn. The result is that the read path will have a gap in the reconstruct data for the keys whose modifications were lost, resulting in a) either walredo failure b) or an incorrect page@lsn image if walredo doesn't error. The Fix ------- The fix is to only do the artificial advancement if `result.is_ok()`. Misc ---- As an aside, I took some time to re-review the flush loop and its callers. I found one more bug related to error handling that I filed here: - https://github.com/neondatabase/neon/issues/12025 ## Problem ## Summary of changes	2025-05-30 11:22:37 +00:00
John Spray	e78d1e2ec6	tests: tighten readability rules in test_location_conf_churn (#12059 ) ## Problem Checking the most recent state of pageservers was insufficient to evaluate whether another pageserver may read in a particular generation, since the latest state might mask some earlier AttachedSingle state. Related: https://github.com/neondatabase/neon/issues/11348 ## Summary of changes - Maintain a history of all attachments - Write out explicit rules for when a pageserver may read	2025-05-30 11:18:01 +00:00
Alex Chi Z.	af429b4a62	feat(pageserver): observability for feature flags (#12034 ) ## Problem Part of #11813. This pull request adds misc observability improvements for the functionality. ## Summary of changes * Info span for the PostHog feature background loop. * New evaluate feature flag API. * Put the request error into the error message. * Log when feature flag gets updated. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-05-30 08:02:25 +00:00
Gleb Novikov	3b4d4eb535	fast_import.rs: log number of jobs for pg_dump/pg_restore (#12068 ) ## Problem I have a hypothesis that import might be using lower number of jobs than max for the VM, where the job is running. This change will help finding this out from logs ## Summary of changes Added logging of number of jobs, which is passed into both `pg_dump` and `pg_restore`	2025-05-29 18:25:42 +00:00
Arpad Müller	f060537a31	Add safekeeper reconciler metrics (#12062 ) Adds two metrics to the storcon that are related to the safekeeper reconciler: * `storage_controller_safkeeper_reconciles_queued` to indicate currrent queue depth * `storage_controller_safkeeper_reconciles_complete` to indicate the number of complete reconciles Both metrics operate on a per-safekeeper basis (as reconcilers run on a per-safekeeper basis too). These metrics mirror the `storage_controller_pending_reconciles` and `storage_controller_reconcile_complete` metrics, although those are not scoped on a per-pageserver basis but are global for the entire storage controller. Part of #11670	2025-05-29 14:07:33 +00:00
Vlad Lazar	8a6fc6fd8c	pageserver: hook importing timelines up into disk usage eviction (#12038 ) ## Problem Disk usage eviction isn't sensitive to layers of imported timelines. ## Summary of changes Hook importing timelines up into eviction and add a test for it. I don't think we need any special eviction logic for this. These layers will all be visible and their access time will be their creation time. Hence, we'll remove covered layers first and get to the imported layers if there's still disk pressure.	2025-05-29 13:01:10 +00:00
Vlad Lazar	51639cd6af	pageserver: allow for deletion of importing timelines (#12033 ) ## Problem Importing timelines can't currently be deleted. This is problematic because: 1. Cplane cannot delete failed imports and we leave the timeline behind. 2. The flow does not support user driven cancellation of the import ## Summary of changes On the pageserver: I've taken the path of least resistance, extended `TimelineOrOffloaded` with a new variant and added handling in the right places. I'm open to thoughts here, but I think it turned out better than I was envisioning. On the storage controller: Again, fairly simple business: when a DELETE timeline request is received, we remove the import from the DB and stop any finalization tasks/futures. In order to stop finalizations, we track them in-memory. For each finalizing import, we associate a gate and a cancellation token. Note that we delete the entry from the database before cancelling any finalizations. This is such that a concurrent request can't progress the import into finalize state and race with the deletion. This concern about deleting an import with on-going finalization is theoretical in the near future. We are only going to delete importing timelines after the storage controller reports the failure to cplane. Alas, the design works for user driven cancellation too. Closes https://github.com/neondatabase/neon/issues/11897	2025-05-29 11:13:52 +00:00
devin-ai-integration[bot]	529d661532	storcon: skip offline nodes in get_top_tenant_shards (#12057 ) ## Summary The optimiser background loop could get delayed a lot by waiting for timeouts trying to talk to offline nodes. Fixes: #12056 ## Solution - Skip offline nodes in `get_top_tenant_shards` Link to Devin run: https://app.devin.ai/sessions/065afd6756734d33bbd4d012428c4b6e Requested by: John Spray (john@neon.tech) Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: John Spray <john@neon.tech>	2025-05-29 11:07:09 +00:00
Alex Chi Z.	9e4cf52949	pageserver: reduce concurrency for gc-compaction (#12054 ) ## Problem Temporarily reduce the concurrency of gc-compaction to 1 job at a time. We are going to roll out in the largest AWS region next week. Having one job running at a time makes it easier to identify what tenant causes problem if it's not running well and pause gc-compaction for that specific tenant. (We can make this configurable via pageserver config in the future!) ## Summary of changes Reduce `CONCURRENT_GC_COMPACTION_TASKS` from 2 to 1. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-05-29 09:32:19 +00:00
Arpad Müller	831f2a4ba7	Fix flakiness of test_storcon_create_delete_sk_down (#12040 ) The `test_storcon_create_delete_sk_down` test is still flaky. This test addresses two possible causes for flakiness. both causes are related to deletion racing with `pull_timeline` which hasn't finished yet. * the first cause is timeline deletion racing with `pull_timeline`: * the first deletion attempt doesn't contain the line because the timeline doesn't exist yet * the subsequent deletion attempts don't contain it either, only a note that the timeline is already deleted. * so this patch adds the note that the timeline is already deleted to the regex * the second cause is about tenant deletion racing with `pull_timeline`: * there were no tenant specific tombstones so if a tenant was deleted, we only added tombstones for the specific timelines being deleted, not for the tenant itself. * This patch changes this, so we now have tenant specific tombstones as well as timeline specific ones, and creation of a timeline checks both. * we also don't see any retries of the tenant deletion in the logs. once it's done it's done. so extend the regex to contain the tenant deletion message as well. One could wonder why the regex and why not using the API to check whether the timeline is just "gone". The issue with the API is that it doesn't allow one to distinguish between "deleted" and "has never existed", and latter case might race with `pull_timeline`. I.e. the second case flakiness helped in the discovery of a real bug (no tenant tombstones), so the more precise check was helpful. Before, I could easily reproduce 2-9 occurences of flakiness when running the test with an additional `range(128)` parameter (i.e. 218 times 4 times). With this patch, I ran it three times, not a single failure. Fixes #11838	2025-05-28 18:20:38 +00:00
Vlad Lazar	eadabeddb8	pageserver: use the same job size throughout the import lifetime (#12026 ) ## Problem Import planning takes a job size limit as its input. Previously, the job size came from a pageserver config field. This field may change while imports are in progress. If this happens, plans will no longer be identical and the import would fail permanently. ## Summary of Changes Bake the job size into the import progress reported to the storage controller. For new imports, use the value from the pagesever config, and, for existing imports, use the value present in the shard progress. This value is identical for all shards, but we want it to be versioned since future versions of the planner might split the jobs up differently. Hence, it ends up in `ShardImportProgress`. Closes https://github.com/neondatabase/neon/issues/11983	2025-05-28 15:19:41 +00:00
Alex Chi Z.	67ddf1de28	feat(pageserver): create image layers at L0-L1 boundary (#12023 ) ## Problem Previous attempt https://github.com/neondatabase/neon/pull/10548 caused some issues in staging and we reverted it. This is a re-attempt to address https://github.com/neondatabase/neon/issues/11063. Currently we create image layers at latest record LSN. We would create "future image layers" (i.e., image layers with LSN larger than disk consistent LSN) that need special handling at startup. We also waste a lot of read operations to reconstruct from L0 layers while we could have compacted all of the L0 layers and operate on a flat level of historic layers. ## Summary of changes * Run repartition at L0-L1 boundary. * Roll out with feature flags. * Piggyback a change that downgrades "image layer creating below gc_cutoff" to debug level. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-05-28 07:00:52 +00:00
Nikita Kalyanov	541fcd8d2f	chore: expose new mark_invisible API in openAPI spec for use in cplane (#12032 ) ## Problem There is a new API that I plan to use. We generate client from the spec so it should be in the spec ## Summary of changes Document the existing API in openAPI format	2025-05-28 03:39:59 +00:00
Suhas Thalanki	e77961c1c6	background worker that collects installed extensions (#11939 ) ## Problem Currently, we collect metrics of what extensions are installed on computes at start up time. We do not have a mechanism that does this at runtime. ## Summary of changes Added a background thread that queries all DBs at regular intervals and collects a list of installed extensions.	2025-05-27 19:40:51 +00:00
Tristan Partin	cdfa06caad	Remove test-images compatibility hack for confirming library load paths (#11927 ) This hack was needed for compatiblity tests, but after the compute release is no longer needed. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-05-27 17:33:16 +00:00
Alex Chi Z.	f0bb93a9c9	feat(pageserver): support evaluate boolean flags (#12024 ) ## Problem Part of https://github.com/neondatabase/neon/issues/11813 ## Summary of changes * Support evaluate boolean flags. * Add docs on how to handle errors. * Add test cases based on real PostHog config. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-05-27 14:29:15 +00:00
Vlad Lazar	30adf8e2bd	pageserver: add tracing spans for time spent in batch and flushing (#12012 ) ## Problem We have some gaps in our traces. This indicates missing spans. ## Summary of changes This PR adds two new spans: * WAIT_EXECUTOR: time a batched request spends in the batch waiting to be picked up * FLUSH_RESPONSE: time a get page request spends flushing the response to the compute ![image](https://github.com/user-attachments/assets/41b3ddb8-438d-4375-9da3-da341fc0916a)	2025-05-27 13:57:53 +00:00
Erik Grinaker	5d538a9503	page_api: tweak errors (#12019 ) ## Problem The page API gRPC errors need a few tweaks to integrate better with the GetPage machinery. Touches https://github.com/neondatabase/neon/issues/11728. ## Summary of changes * Add `GetPageStatus::InternalError` for internal server errors. * Rename `GetPageStatus::Invalid` to `InvalidRequest` for clarity. * Rename `status` and `GetPageStatus` to `status_code` and `GetPageStatusCode`. * Add an `Into<tonic::Status>` implementation for `ProtocolError`.	2025-05-27 12:06:51 +00:00
Arpad Müller	f3976e5c60	remove safekeeper_proto_version = 3 from tests (#12020 ) Some tests still explicitly specify version 3 of the safekeeper walproposer protocol. Remove the explicit opt in from the tests as v3 is the default now since #11518. We don't touch the places where a test exercises both v2 and v3. Those we leave for #12021. Part of https://github.com/neondatabase/neon/issues/10326	2025-05-27 11:32:15 +00:00
Vlad Lazar	9657fbc194	pageserver: add and stabilize import chaos test (#11982 ) ## Problem Test coverage of timeline imports is lacking. ## Summary of changes This PR adds a chaos import test. It runs an import while injecting various chaos events in the environment. All the commits that follow the test fix various issues that were surfaced by it. Closes https://github.com/neondatabase/neon/issues/10191	2025-05-27 09:52:59 +00:00

1 2 3 4 5 ...

8000 Commits