rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-31 03:50:37 +00:00

Author	SHA1	Message	Date
Tristan Partin	8b4fbefc29	Patch pgaudit to disable logging in parallel workers (#12325 ) We want to turn logging in parallel workers off to reduce log amplification in queries which use parallel workers. Part-of: https://github.com/neondatabase/cloud/issues/28483 Signed-off-by: Tristan Partin <tristan.partin@databricks.com>	2025-07-02 19:54:47 +00:00
Alex Chi Z.	a9a51c038b	rfc: storage feature flags (#11805 ) ## Problem Part of https://github.com/neondatabase/neon/issues/11813 ## Summary of changes --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-02 17:41:36 +00:00
Alexey Kondratov	44121cc175	docs(compute): RFC for compute rolling restart with prewarm (#11294 ) ## Problem Neon currently implements several features that guarantee high uptime of compute nodes: 1. Storage high-availability (HA), i.e. each tenant shard has a secondary pageserver location, so we can quickly switch over compute to it in case of primary pageserver failure. 2. Fast compute provisioning, i.e. we have a fleet of pre-created empty computes, that are ready to serve workload, so restarting unresponsive compute is very fast. 3. Preemptive NeonVM compute provisioning in case of k8s node unavailability. This helps us to be well-within the uptime SLO of 99.95% most of the time. Problems begin when we go up to multi-TB workloads and 32-64 CU computes. During restart, compute looses all caches: LFC, shared buffers, file system cache. Depending on the workload, it can take a lot of time to warm up the caches, so that performance could be degraded and might be even unacceptable for certain workloads. The latter means that although current approach works well for small to medium workloads, we still have to do some additional work to avoid performance degradation after restart of large instances. [Rendered version](https://github.com/neondatabase/neon/blob/alexk/pg-prewarm-rfc/docs/rfcs/2025-03-17-compute-prewarm.md) Part of https://github.com/neondatabase/cloud/issues/19011	2025-07-02 17:16:00 +00:00
Dmitry Savelev	0429a0db16	Switch the billing metrics storage format to ndjson. (#12427 ) ## Problem The billing team wants to change the billing events pipeline and use a common events format in S3 buckets across different event producers. ## Summary of changes Change the events storage format for billing events from JSON to NDJSON. Also partition files by hours, rather than days. Resolves: https://github.com/neondatabase/cloud/issues/29995	2025-07-02 16:30:47 +00:00
Conrad Ludgate	d6beb3ffbb	[proxy] rewrite pg-text to json routines (#12413 ) We would like to move towards an arena system for JSON encoding the responses. This change pushes an "out" parameter into the pg-test to json routines to make swapping in an arena system easier in the future. (see #11992) This additionally removes the redundant `column: &[Type]` argument, as well as rewriting the pg_array parser. --- I rewrote the pg_array parser since while making these changes I found it hard to reason about. I went back to the specification and rewrote it from scratch. There's 4 separate routines: 1. pg_array_parse - checks for any prelude (multidimensional array ranges) 2. pg_array_parse_inner - only deals with the arrays themselves 3. pg_array_parse_item - parses a single item from the array, this might be quoted, unquoted, or another nested array. 4. pg_array_parse_quoted - parses a quoted string, following the relevant string escaping rules.	2025-07-02 12:46:11 +00:00
Arpad Müller	efd7e52812	Don't error if timeline offload is already in progress (#12428 ) Don't print errors like: ``` Compaction failed 1 times, retrying in 2s: Failed to offload timeline: Unexpected offload error: Timeline deletion is already in progress ``` Print it at info log level instead. https://github.com/neondatabase/cloud/issues/30666	2025-07-02 12:06:55 +00:00
Ivan Efremov	0f879a2e8f	[proxy]: Fix redis IRSA expiration failure errors (#12430 ) Relates to the [#30688](https://github.com/neondatabase/cloud/issues/30688)	2025-07-02 08:55:44 +00:00
Dmitrii Kovalkov	8e7ce42229	tests: start primary compute on not-readonly branches (#12408 ) ## Problem https://github.com/neondatabase/neon/pull/11712 changed how computes are started in the test: the lsn is specified, making them read-only static replicas. Lsn is `last_record_lsn` from pageserver. It works fine with read-only branches (because their `last_record_lsn` is equal to `start_lsn` and always valid). But with writable timelines, the `last_record_lsn` on the pageserver might be stale. Particularly in this test, after the `detach_branch` operation, the tenant is reset on the pagesever. It leads to `last_record_lsn` going back to `disk_consistent_lsn`, so basically rolling back some recent writes. If we start a primary compute, it will start at safekeepers' commit Lsn, which is the correct one , and will wait till pageserver catches up with this Lsn after reset. - Closes: https://github.com/neondatabase/neon/issues/12365 ## Summary of changes - Start `primary` compute for writable timelines.	2025-07-02 05:41:17 +00:00
Alex Chi Z.	5ec8881c0b	feat(pageserver): resolve feature flag based on remote size (#12400 ) ## Problem Part of #11813 ## Summary of changes * Compute tenant remote size in the housekeeping loop. * Add a new `TenantFeatureResolver` struct to cache the tenant-specific properties. * Evaluate feature flag based on the remote size. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-01 18:11:24 +00:00
Alex Chi Z.	b254dce8a1	feat(pageserver): report compaction progress (#12401 ) ## Problem close https://github.com/neondatabase/neon/issues/11528 ## Summary of changes Gives us better observability of compaction progress. - Image creation: num of partition processed / total partition - Gc-compaction: index of the in the queue / total items for a full compaction - Shard ancestor compaction: layers to rewrite / total layers Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-01 17:00:27 +00:00
Alex Chi Z.	3815e3b2b5	feat(pageserver): reduce lock contention in l0 compaction (#12360 ) ## Problem L0 compaction currently holds the read lock for a long region while it doesn't need to. ## Summary of changes This patch reduces the one long contention region into 2 short ones: gather the layers to compact at the beginning, and several short read locks when querying the image coverage. Co-Authored-By: Chen Luo --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-01 16:58:41 +00:00
Suhas Thalanki	bbcd70eab3	Dynamic Masking Support for `anon` v2 (#11733 ) ## Problem This PR works on adding dynamic masking support for `anon` v2. It currently only supports static masking. ## Summary of changes Added a security definer function that sets the dynamic masking guc to `true` with superuser permissions. Added a security definer function that adds `anon` to `session_preload_libraries` if it's not already present. Related to: https://github.com/neondatabase/cloud/issues/20456	2025-07-01 16:50:27 +00:00
Suhas Thalanki	0934ce9bce	compute: metrics for autovacuum (mxid, postgres) (#12294 ) ## Problem Currently we do not have metrics for autovacuum. ## Summary of changes Added a metric that extracts the top 5 DBs with oldest mxid and frozen xid. Tables that were vacuumed recently should have younger value (or younger age). Related Issue: https://github.com/neondatabase/cloud/issues/27296	2025-07-01 15:33:23 +00:00
Conrad Ludgate	4932963bac	[proxy]: dont log user errors from postgres (#12412 ) ## Problem #8843 User initiated sql queries are being classified as "postgres" errors, whereas they're really user errors. ## Summary of changes Classify user-initiated postgres errors as user errors if they are related to a sql query that we ran on their behalf. Do not log those errors.	2025-07-01 13:03:34 +00:00
Lassi Pölönen	6d73cfa608	Support audit syslog over TLS (#12124 ) Add support to transport syslogs over TLS. Since TLS params essentially require passing host and port separately, add a boolean flag to the configuration template and also use the same `action` format for plaintext logs. This allows seamless transition. The plaintext host:port is picked from `AUDIT_LOGGING_ENDPOINT` (as earlier) and from `AUDIT_LOGGING_TLS_ENDPOINT`. The TLS host:port is used when defined and non-empty. `remote_endpoint` is split separately to hostname and port as required by `omfwd` module. Also the address parsing and config content generation are split to more testable functions with basic tests added.	2025-07-01 12:53:46 +00:00
Dmitrii Kovalkov	d2d9946bab	tests: override safekeeper ports in storcon DB (#12410 ) ## Problem We persist safekeeper host/port in the storcon DB after https://github.com/neondatabase/neon/pull/11712, so the storcon fails to ping safekeepers in the compatibility tests, where we start the cluster from the snapshot. PR also adds some small code improvements related to the test failure. - Closes: https://github.com/neondatabase/neon/issues/12339 ## Summary of changes - Update safekeeper ports in the storcon DB when starting the neon from the dir (snapshot) - Fail the response on all not-success codes (e.g. 3xx). Should not happen, but just to be more safe. - Add `neon_previous/` to .gitignore to make it easier to run compat tests. - Add missing EXPORT to the instruction for running compat tests	2025-07-01 12:47:16 +00:00
Trung Dinh	daa402f35a	pageserver: Make ImageLayerWriter sync, infallible and lazy (#12403 ) ## Problem ## Summary of changes Make ImageLayerWriter sync, infallible and lazy. Address https://github.com/neondatabase/neon/issues/12389. All unit tests passed.	2025-07-01 09:53:11 +00:00
Suhas Thalanki	5f3532970e	[compute] fix: background worker that collects installed extension metrics now updates collection interval (#12277 ) ## Problem Previously, the background worker that collects the list of installed extensions across DBs had a timeout set to 1 hour. This cause a problem with computes that had a `suspend_timeout` > 1 hour as this collection was treated as activity, preventing compute shutdown. Issue: https://github.com/neondatabase/cloud/issues/30147 ## Summary of changes Passing the `suspend_timeout` as part of the `ComputeSpec` so that any updates to this are taken into account by the background worker and updates its collection interval.	2025-06-30 22:12:37 +00:00
Arpad Müller	2e681e0ef8	detach_ancestor: delete the right layer when hardlink fails (#12397 ) If a hardlink operation inside `detach_ancestor` fails due to the layer already existing, we delete the layer to make sure the source is one we know about, and then retry. But we deleted the wrong file, namely, the one we wanted to use as the source of the hardlink. As a result, the follow up hard link operation failed. Our PR corrects this mistake.	2025-06-30 21:36:15 +00:00
Dmitrii Kovalkov	8e216a3a59	storcon: notify cplane on safekeeper membership change (#12390 ) ## Problem We don't notify cplane about safekeeper membership change yet. Without the notification the compute needs to know all the safekeepers on the cluster to be able to speak to them. Change notifications will allow to avoid it. - Closes: https://github.com/neondatabase/neon/issues/12188 ## Summary of changes - Implement `notify_safekeepers` method in `ComputeHook` - Notify cplane about safekeepers in `safekeeper_migrate` handler. - Update the test to make sure notifications work. ## Out of scope - There is `cplane_notified_generation` field in `timelines` table in strocon's database. It's not needed now, so it's not updated in the PR. Probably we can remove it. - e2e tests to make sure it works with a production cplane	2025-06-30 14:09:50 +00:00
Erik Grinaker	d0a4ae3e8f	pageserver: add gRPC LSN lease support (#12384 ) ## Problem The gRPC API does not provide LSN leases. ## Summary of changes * Add LSN lease support to the gRPC API. * Use gRPC LSN leases for static computes with `grpc://` connstrings. * Move `PageserverProtocol` into the `compute_api::spec` module and reuse it.	2025-06-30 12:44:17 +00:00
Erik Grinaker	a384d7d501	pageserver: assert no changes to shard identity (#12379 ) ## Problem Location config changes can currently result in changes to the shard identity. Such changes will cause data corruption, as seen with #12217. Resolves #12227. Requires #12377. ## Summary of changes Assert that the shard identity does not change on location config updates and on (re)attach. This is currently asserted with `critical!`, in case it misfires in production. Later, we should reject such requests with an error and turn this into a proper assertion.	2025-06-30 12:36:45 +00:00
Christian Schwarz	66f53d9d34	refactor(pageserver): force explicit mapping to `CreateImageLayersError::Other` (#12382 ) Implicit mapping to an `anyhow::Error` when we do `?` is discouraged because tooling to find those places isn't great. As a drive-by, also make SplitImageLayerWriter::new infallible and sync. I think we should also make ImageLayerWriter::new completely lazy, then `BatchLayerWriter:new` infallible and async.	2025-06-30 11:03:48 +00:00
Busra Kugler	2af9380962	Revert "Replace step-security maintained actions" (#12386 ) Reverts neondatabase/neon#11663 and https://github.com/neondatabase/neon/pull/11265/ Step Security is not yet approved by Databricks team, in order to prevent issues during Github org migration, I'll revert this PR to use the previous action instead of Step Security maintained action.	2025-06-30 10:15:10 +00:00
Ivan Efremov	620d50432c	Fix path issue in the proxy-bernch CI workflow (#12388 )	2025-06-30 09:33:57 +00:00
Erik Grinaker	1d43f3bee8	pageserver: fix stripe size persistence in legacy HTTP handlers (#12377 ) ## Problem Similarly to #12217, the following endpoints may result in a stripe size mismatch between the storage controller and Pageserver if an unsharded tenant has a different stripe size set than the default. This can lead to data corruption if the tenant is later manually split without specifying an explicit stripe size, since the storage controller and Pageserver will apply different defaults. This commonly happens with tenants that were created before the default stripe size was changed from 32k to 2k. * `PUT /v1/tenant/config` * `PATCH /v1/tenant/config` These endpoints are no longer in regular production use (they were used when cplane still managed Pageserver directly), but can still be called manually or by tests. ## Summary of changes Retain the current shard parameters when updating the location config in `PUT \| PATCH /v1/tenant/config`. Also opportunistically derive `Copy` for `ShardParameters`.	2025-06-30 09:08:44 +00:00
Dmitrii Kovalkov	c746678bbc	storcon: implement safekeeper_migrate handler (#11849 ) This PR implements a safekeeper migration algorithm from RFC-035 https://github.com/neondatabase/neon/blob/main/docs/rfcs/035-safekeeper-dynamic-membership-change.md#change-algorithm - Closes: https://github.com/neondatabase/neon/issues/11823 It is not production-ready yet, but I think it's good enough to commit and start testing. There are some known issues which will be addressed in later PRs: - https://github.com/neondatabase/neon/issues/12186 - https://github.com/neondatabase/neon/issues/12187 - https://github.com/neondatabase/neon/issues/12188 - https://github.com/neondatabase/neon/issues/12189 - https://github.com/neondatabase/neon/issues/12190 - https://github.com/neondatabase/neon/issues/12191 - https://github.com/neondatabase/neon/issues/12192 ## Summary of changes - Implement `tenant_timeline_safekeeper_migrate` handler to drive the migration - Add possibility to specify number of safekeepers per timeline in tests (`timeline_safekeeper_count`) - Add `term` and `flush_lsn` to `TimelineMembershipSwitchResponse` - Implement compare-and-swap (CAS) operation over timeline in DB for updating membership configuration safely. - Write simple test to verify that migration code works	2025-06-30 08:30:05 +00:00
Aleksandr Sarantsev	9bb4688c54	storcon: Remove testing feature from kick_secondary_downloads (#12383 ) ## Problem Some of the design decisions in PR #12256 were influenced by the requirements of consistency tests. These decisions introduced intermediate logic that is no longer needed and should be cleaned up. ## Summary of Changes - Remove the `feature("testing")` flag related to `kick_secondary_download`. - Set the default value of `kick_secondary_download` back to false, reflecting the intended production behavior. Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>	2025-06-30 05:41:05 +00:00
Dmitrii Kovalkov	47553dbaf9	neon_local: set timeline_safekeeper_count if we have less than 3 safekeepers (#12378 ) ## Problem - Closes: https://github.com/neondatabase/neon/issues/12298 ## Summary of changes - Set `timeline_safekeeper_count` in `neon_local` if we have less than 3 safekeepers - Remove `cfg!(feature = "testing")` code from `safekeepers_for_new_timeline` - Change `timeline_safekeeper_count` type to `usize`	2025-06-28 12:59:29 +00:00
Erik Grinaker	e50b914a8e	compute_tools: support gRPC base backups in `compute_ctl` (#12244 ) ## Problem `compute_ctl` should support gRPC base backups. Requires #12111. Requires #12243. Touches #11926. ## Summary of changes Support `grpc://` connstrings for `compute_ctl` base backups.	2025-06-27 16:39:00 +00:00
Christian Schwarz	e33e109403	fix(pageserver): buffered writer cancellation error handling (#12376 ) ## Problem The problem has been well described in already-commited PR #11853. tl;dr: BufferedWriter is sensitive to cancellation, which the previous approach was not. The write path was most affected (ingest & compaction), which was mostly fixed in #11853: it introduced `PutError` and mapped instances of `PutError` that were due to cancellation of underlying buffered writer into `CreateImageLayersError::Cancelled`. However, there is a long tail of remaining errors that weren't caught by #11853 that result in `CompactionError::Other`s, which we log with great noise. ## Solution The stack trace logging for CompactionError::Other added in #11853 allows us to chop away at that long tail using the following pattern: - look at the stack trace - from leaf up, identify the place where we incorrectly map from the distinguished variant X indicating cancellation to an `anyhow::Error` - follow that anyhow further up, ensuring it stays the same anyhow all the way up in the `CompactionError::Other` - since it stayed one anyhow chain all the way up, root_cause() will yield us X - so, in `log_compaction_error`, add an additional `downcast_ref` check for X This PR specifically adds checks for - the flush task cancelling (FlushTaskError, BlobWriterError) - opening of the layer writer (GateError) That should cover all the reports in issues - https://github.com/neondatabase/cloud/issues/29434 - https://github.com/neondatabase/neon/issues/12162 ## Refs - follow-up to #11853 - fixup of / fixes https://github.com/neondatabase/neon/issues/11762 - fixes https://github.com/neondatabase/neon/issues/12162 - refs https://github.com/neondatabase/cloud/issues/29434	2025-06-27 15:26:00 +00:00
Folke Behrens	0ee15002fc	proxy: Move client connection accept and handshake to pglb (#12380 ) * This must be a no-op. * Move proxy::task_main to pglb::task_main. * Move client accept, TLS and handshake to pglb. * Keep auth and wake in proxy.	2025-06-27 15:20:23 +00:00
Arpad Müller	4c7956fa56	Fix hang deleting offloaded timelines (#12366 ) We don't have cancellation support for timeline deletions. In other words, timeline deletion might still go on in an older generation while we are attaching it in a newer generation already, because the cancellation simply hasn't reached the deletion code. This has caused us to hit a situation with offloaded timelines in which the timeline was in an unrecoverable state: always returning an accepted response, but never a 404 like it should be. The detailed description can be found in [here](https://github.com/neondatabase/cloud/issues/30406#issuecomment-3008667859) (private repo link). TLDR: 1. we ask to delete timeline on old pageserver/generation, starts process in background 2. the storcon migrates the tenant to a different pageserver. - during attach, the pageserver still finds an index part, so it adds it to `offloaded_timelines` 4. the timeline deletion finishes, removing the index part in S3 5. there is a retry of the timeline deletion endpoint, sent to the new pageserver location. it is bound to fail however: - as the index part is gone, we print `Timeline already deleted in remote storage`. - the problem is that we then return an accepted response code, and not a 404. - this confuses the code calling us. it thinks the timeline is not deleted, so keeps retrying. - this state never gets recovered from until a reset/detach, because of the `offloaded_timelines` entry staying there. This is where this PR fixes things: if no index part can be found, we can safely assume that the timeline is gone in S3 (it's the last thing to be deleted), so we can remove it from `offloaded_timelines` and trigger a reupload of the manifest. Subsequent retries will pick that up. Why not improve the cancellation support? It is a more disruptive code change, that might have its own risks. So we don't do it for now. Fixes https://github.com/neondatabase/cloud/issues/30406	2025-06-27 15:14:55 +00:00
Heikki Linnakangas	5a82182c48	impr(ci): Refactor postgres Makefile targets to a separate makefile (#12363 ) Mainly for general readability. Some notable changes: - Postgres can be built without the rest of the repository, and in particular without any of the Rust bits. Some CI scripts took advantage of that, so let's make that more explicit by separating those parts. Also add an explicit comment about that in the new postgres.mk file. - Add a new PG_INSTALL_CACHED variable. If it's set, `make all` and other top-Makefile targets skip checking if Postgres is up-to-date. This is also to be used in CI scripts that build and cache Postgres as separate steps. (It is currently only used in the macos walproposer-lib rule, but stay tuned for more.) - Introduce a POSTGRES_VERSIONS variable that lists all supported PostgreSQL versions. Refactor a few Makefile rules to use that.	2025-06-27 14:49:52 +00:00
Arpad Müller	37e181af8a	Update rust to 1.88.0 (#12364 ) We keep the practice of keeping the compiler up to date, pointing to the latest release. This is done by many other projects in the Rust ecosystem as well. [Announcement blog post](https://blog.rust-lang.org/2025/06/26/Rust-1.88.0/) Prior update was in https://github.com/neondatabase/neon/pull/11938	2025-06-27 13:51:59 +00:00
Peter Bendel	6f4198c78a	treat strategy flag test_maintenance as boolean data type (#12373 ) ## Problem In large oltp test run https://github.com/neondatabase/neon/actions/runs/15905488707/job/44859116742 we see that the `Benchmark database maintenance` step is skipped in all 3 strategy variants, however it should be executed in two. This is due to treating the `test_maintenance` boolean type in the strategy in the condition of the `Benchmark database maintenance` step ## Summary of changes Use a boolean condition instead of a string comparison ## Test run from this pull request branch https://github.com/neondatabase/neon/actions/runs/15923605412	2025-06-27 13:49:26 +00:00
Vlad Lazar	cc1664ef93	pageserver: allow flush task cancelled error in sharding autosplit test (#12374 ) ## Problem Test is failing due to compaction shutdown noise (see https://github.com/neondatabase/neon/issues/12162). ## Summary of changes Allow list the noise.	2025-06-27 13:13:11 +00:00
Vlad Lazar	ebb6e26a64	pageserver: handle multiple attached children in shard resolution (#12336 ) ## Problem When resolving a shard during a split we might have multiple attached shards with the old shard count (i.e. not all of them are marked in progress and ignored). Hence, we can compute the desired shard number based on the old shard count and misroute the request. ## Summary of Changes Recompute the desired shard every time the shard count changes during the iteration	2025-06-27 12:46:18 +00:00
Mikhail	ebc12a388c	fix: endpoint_storage_addr as String (#12359 ) It's not a SocketAddr as we use k8s DNS https://github.com/neondatabase/cloud/issues/19011	2025-06-27 11:06:27 +00:00
Conrad Ludgate	abc1efd5a6	[proxy] fix connect_to_compute retry handling (#12351 ) # Problem In #12335 I moved the `authenticate` method outside of the `connect_to_compute` loop. This triggered [e2e tests to become flaky](https://github.com/neondatabase/cloud/pull/30533). This highlighted an edge case we forgot to consider with that change. When we connect to compute, the compute IP might be cached. This cache hit might however be stale. Because we can't validate the IP is associated with a specific compute-id☨, we will succeed the connect_to_compute operation and fail when it comes to password authentication☨☨. Before the change, we were invalidating the cache and triggering wake_compute if the authentication failed. Additionally, I noticed some faulty logic I introduced 1 year ago https://github.com/neondatabase/neon/pull/8141/files#diff-5491e3afe62d8c5c77178149c665603b29d88d3ec2e47fc1b3bb119a0a970afaL145-R147 ☨ We can when we roll out TLS, as the certificate common name includes the compute-id. ☨☨ Technically password authentication could pass for the wrong compute, but I think this would only happen in the very very rare event that the IP got reused and the compute's endpoint happened to be a branch/replica. # Solution 1. Fix the broken logic 2. Simplify cache invalidation (I don't know why it was so convoluted) 3. Add a loop around connect_to_compute + authenticate to re-introduce the wake_compute invalidation we accidentally removed. I went with this approach to try and avoid interfering with https://github.com/neondatabase/neon/compare/main...cloneable/proxy-pglb-connect-compute-split. The changes made in commit 3 will move into `handle_client_request` I suspect,	2025-06-27 10:36:27 +00:00
Dmitrii Kovalkov	6fa1562b57	pageserver: increase default max_size_entries limit for basebackup cache (#12343 ) ## Problem Some pageservers hit `max_size_entries` limit in staging with only ~25 MiB storage used by basebackup cache. The limit is too strict. It should be safe to relax it. - Part of https://github.com/neondatabase/cloud/issues/29353 ## Summary of changes - Increase the default `max_size_entries` from 1000 to 10000	2025-06-27 09:18:18 +00:00
Heikki Linnakangas	10afac87e7	impr(ci): Remove unnecessary 'make postgres-headers' build step (#12354 ) The 'make postgres' step includes installation of the headers, no need to do that separately.	2025-06-26 16:45:34 +00:00
Vlad Lazar	72b3c9cd11	pageserver: fix wal receiver hang on remote client shutdown (#12348 ) ## Problem Druing shard splits we shut down the remote client early and allow the parent shard to keep ingesting data. While ingesting data, the wal receiver task may wait for the current flush to complete in order to apply backpressure. Notifications are delivered via `Timeline::layer_flush_done_tx`. When the remote client was being shut down the flush loop exited whithout delivering a notification. This left `Timeline::wait_flush_completion` hanging indefinitely which blocked the shutdown of the wal receiver task, and, hence, the shard split. ## Summary of Changes Deliver a final notification when the flush loop is shutting down without the timeline cancel cancellation token having fired. I tried writing a test for this, but got stuck in failpoint hell and decided it's not worth it. `test_sharding_autosplit`, which reproduces this reliably in CI, passed with the proposed fix in https://github.com/neondatabase/neon/pull/12304. Closes https://github.com/neondatabase/neon/issues/12060	2025-06-26 16:35:34 +00:00
Arpad Müller	232f2447d4	Support pull_timeline of timelines without writes (#12028 ) Make the safekeeper `pull_timeline` endpoint support timelines that haven't had any writes yet. In the storcon managed sk timelines world, if a safekeeper goes down temporarily, the storcon will schedule a `pull_timeline` call. There is no guarantee however that by when the safekeeper is online again, there have been writes to the timeline yet. The `snapshot` endpoint gives an error if the timeline hasn't had writes, so we avoid calling it if `timeline_start_lsn` indicates a freshly created timeline. Fixes #11422 Part of #11670	2025-06-26 16:29:03 +00:00
Erik Grinaker	a2d2108e6a	pageserver: use base backup cache with gRPC (#12352 ) ## Problem gRPC base backups do not use the base backup cache. Touches https://github.com/neondatabase/neon/issues/11728. ## Summary of changes Integrate gRPC base backups with the base backup cache. Also fixes a bug where the base backup cache did not differentiate between primary/replica base backups (at least I think that's a bug?).	2025-06-26 15:52:15 +00:00
Alex Chi Z.	33c0d5e2f4	fix(pageserver): make posthog config parsing more robust (#12356 ) ## Problem In our infra config, we have to split server_api_key and other fields in two files: the former one in the sops file, and the latter one in the normal config. It creates the situation that we might misconfigure some regions that it only has part of the fields available, causing storcon/pageserver refuse to start. ## Summary of changes Allow PostHog config to have part of the fields available. Parse it later. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-26 15:49:08 +00:00
Dmitrii Kovalkov	605fb04f89	pageserver: use bounded sender for basebackup cache (#12342 ) ## Problem Basebackup cache now uses unbounded channel for prepare requests. In theory it can grow large if the cache is hung and does not process the requests. - Part of https://github.com/neondatabase/cloud/issues/29353 ## Summary of changes - Replace an unbounded channel with a bounded one, the size is configurable. - Add `pageserver_basebackup_cache_prepare_queue_size` to observe the size of the queue. - Refactor a bit to move all metrics logic to `basebackup_cache.rs`	2025-06-26 13:26:24 +00:00
Conrad Ludgate	fd1e8ec257	[proxy] review and cleanup CLI args (#12167 ) I was looking at how we could expose our proxy config as toml again, and as I was writing out the schema format, I noticed some cruft in our CLI args that no longer seem to be in use. The redis change is the most complex, but I am pretty sure it's sound. Since https://github.com/neondatabase/cloud/pull/15613 cplane longer publishes to the global redis instance.	2025-06-26 11:25:41 +00:00
Konstantin Knizhnik	be23eae3b6	Mark pages as avaiable in LFC only after generation check (#12350 ) ## Problem If LFC generation is changed then `lfc_readv_select` will return -1 but pages are still marked as available in bitmap. ## Summary of changes Update bitmap after generation check. Co-authored-by: Kosntantin Knizhnik <konstantin.knizhnik@databricks.com>	2025-06-26 07:06:27 +00:00
Alex Chi Z.	6f70885e11	fix(pageserver): allow refresh_interval to be empty (#12349 ) ## Problem Fix for https://github.com/neondatabase/neon/pull/12324 ## Summary of changes Need `serde(default)` to allow this field not present in the config, otherwise there will be a config deserialization error. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-25 22:15:03 +00:00

1 2 3 4 5 ...

8282 Commits