rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2025-12-26 15:49:58 +00:00

Author	SHA1	Message	Date
Konstantin Knizhnik	fe6dcbdecf	Switch on DEBUG_COMPARE_LOCAL mode	2025-06-19 14:44:22 +03:00
Arpad Müller	ec1452a559	Switch on --timelines-onto-safekeepers in integration tests (#11712 ) Switch on the `--timelines-onto-safekeepers` param in integration tests. Some changes that were needed to enable this but which I put into other PRs to not clutter up this one: * #11786 * #11854 * #12129 * #12138 Further fixes that were needed for this: * https://github.com/neondatabase/neon/pull/11801 * https://github.com/neondatabase/neon/pull/12143 * https://github.com/neondatabase/neon/pull/12204 Not strictly needed, but helpful: * https://github.com/neondatabase/neon/pull/12155 Part of #11670 Closes #11424	2025-06-19 11:17:01 +00:00
Heikki Linnakangas	1950ccfe33	Eliminate dependency from pageserver_api to postgres_ffi (#12273 ) Introduce a separate `postgres_ffi_types` crate which contains a few types and functions that were used in the API. `postgres_ffi_types` is a much small crate than `postgres_ffi`, and it doesn't depend on bindgen or the Postgres C headers. Move NeonWalRecord and Value types to wal_decoder crate. They are only used in the pageserver-safekeeper "ingest" API. The rest of the ingest API types are defined in wal_decoder, so move these there as well.	2025-06-19 10:31:27 +00:00
Heikki Linnakangas	2ca6665f4a	Remove outdated 'clean' Makefile targest (#12288 ) We have been bad at keeping them up-to-date, several contrib modules and neon extensions were missing from the clean rules. Give up trying, and remove the targets altogether. In practice, it's straightforward to just do `rm -rf pg_install/build`, so the clean-targets are hardly worth the maintenance effort. I kept `make distclean` though. The rule for that is simple enough.	2025-06-19 10:24:09 +00:00
Heikki Linnakangas	fa954671b2	Remove unnecessary Postgres libs from the storage docker image (#12286 ) Since commit `87ad50c925`, storage_controller has used diesel_async, which in turn uses tokio-postgres as the Postgres client, which doesn't require libpq. Thus we no longer need libpq in the storage image.	2025-06-19 10:00:01 +00:00
Erik Grinaker	6f4ffdb48b	pageserver: add gRPC compression (#12280 ) ## Problem The gRPC page service should support compression. Requires #12111. Touches #11728. Touches https://github.com/neondatabase/cloud/issues/25679. ## Summary of changes Add support for gzip and zstd compression in the server, and a client parameter to enable compression. This will need further benchmarking under realistic network conditions.	2025-06-19 09:54:34 +00:00
Vlad Lazar	3f676df3d5	pageserver: fix initial layer visibility calculation (#12206 ) ## Problem GC info is an input to updating layer visibility. Currently, gc info is updated on timeline activation and visibility is computed on tenant attach, so we ignore branch points and compute visibility by taking all layers into account. Side note: gc info is also updated when timelines are created and dropped. That doesn't help because we create the timelines in topological order from the root. Hence the root timeline goes first, without context of where the branch points are. The impact of this in prod is that shards need to rehydrate layers after live migration since the non-visible ones were excluded from the heatmap. ## Summary of Changes Move the visibility calculation into tenant attachment instead of activation.	2025-06-19 09:53:18 +00:00
Suhas Thalanki	20f4febce1	fix: additional changes to terminate pgbouncer on compute suspend (#12153 ) (#12284 ) Addressed [retrospective comments ](https://github.com/neondatabase/neon/pull/12153#discussion_r2154197503) to https://github.com/neondatabase/neon/pull/12153	2025-06-18 19:31:22 +00:00
Mikhail	762905cf8d	endpoint storage: parse config with type:LocalFs\|AwsS3\|AzureContainer (#12282 ) https://github.com/neondatabase/cloud/issues/27195	2025-06-18 17:45:20 +00:00
Elizabeth Murray	830ef35ed3	Domain client for Pageserver GRPC. (#12111 ) Add domain client for new communicator GRPC types.	2025-06-18 15:51:49 +00:00
Erik Grinaker	d8d62fb7cb	test_runner: add gRPC support (#12279 ) ## Problem `test_runner` integration tests should support gRPC. Touches #11926. ## Summary of changes * Enable gRPC for Pageservers, with dynamic port allocations. * Add a `grpc` parameter for endpoint creation, plumbed through to `neon_local endpoint create`. No tests actually use gRPC yet, since computes don't support it yet.	2025-06-18 14:05:13 +00:00
Aleksandr Sarantsev	e6a404c66d	Fix flaky test_sharding_split_failures (#12199 ) ## Problem `test_sharding_failures` is flaky due to interference from the `background_reconcile` process. The details are in the issue https://github.com/neondatabase/neon/issues/12029. ## Summary of changes - Use `reconcile_until_idle` to ensure a stable state before running test assertions - Added error tolerance in `reconcile_until_idle` test function (Failure cases: 1, 3, 19, 20) - Ignore the `Keeping extra secondaries` warning message since it i retryable (Failure case: 2) - Deduplicated code in `assert_rolled_back` and `assert_split_done` - Added a log message before printing plenty of Node `X` seen on pageserver `Y`	2025-06-18 13:27:41 +00:00
Peter Bendel	7e711ede44	Increase tenant size for large tenant oltp workload (#12260 ) ## Problem - We run the large tenant oltp workload with a fixed size (larger than existing customers' workloads). Our customer's workloads are continuously growing and our testing should stay ahead of the customers' production workloads. - we want to touch all tables in the tenant's database (updates) so that we simulate a continuous change in layer files like in a real production workload - our current oltp benchmark uses a mixture of read and write transactions, however we also want a separate test run with read-only transactions only ## Summary of changes - modify the existing workload to have a separate run with pgbench custom scripts that are read-only - create a new workload that - grows all large tables in each run (for the reuse branch in the large oltp tenant's project) - updates a percentage of rows in all large tables in each run (to enforce table bloat and auto-vacuum runs and layer rebuild in pageservers Each run of the new workflow increases the logical database size about 16 GB. We start with 6 runs per day which will give us about 96-100 GB growth per day. --------- Co-authored-by: Alexander Lakhin <alexander.lakhin@neon.tech>	2025-06-18 12:40:25 +00:00
Mikhail	e95f2f9a67	compute_ctl: return LSN in /terminate (#12240 ) - Add optional `?mode=fast\|immediate` to `/terminate`, `fast` is default. Immediate avoids waiting 30 seconds before returning from `terminate`. - Add `TerminateMode` to `ComputeStatus::TerminationPending` - Use `/terminate?mode=immediate` in `neon_local` instead of `pg_ctl stop` for `test_replica_promotes`. - Change `test_replica_promotes` to check returned LSN - Annotate `finish_sync_safekeepers` as `noreturn`. https://github.com/neondatabase/cloud/issues/29807	2025-06-18 12:25:19 +00:00
Heikki Linnakangas	5a045e7d52	Move pagestream_api to separate module (#12272 ) For general readability.	2025-06-18 12:03:14 +00:00
Dimitri Fontaine	67fbc0582e	Validate safekeeper_connstrings when parsing compute specs. (#11906 ) This check API only cheks the safekeeper_connstrings at the moment, and the validation is limited to checking we have at least one entry in there, and no duplicates. ## Problem If the compute_ctl service is started with an empty list of safekeepers, then hard-to-debug errors may happen at runtime, where it would be much easier to catch them early. ## Summary of changes Add an entry point in the compute_ctl API to validate the configuration for safekeeper_connstrings. --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2025-06-18 10:01:05 +00:00
Heikki Linnakangas	3af6b3a2bf	Avoid redownloading rust toolchain on Postgres changes (#12265 ) Create a separate stage for downloading the Rust toolchain for pgrx, so that it can be cached independently of the pg-build layer. Before this, the 'pg-build-nonroot=with-cargo' layer was unnecessarily rebuilt every time there was a change in PostgreSQL sources. Furthermore, this allows using the same cached layer for building the compute images of all Postgres versions.	2025-06-18 09:49:42 +00:00
Erik Grinaker	04013929cb	pageserver: support full gRPC basebackups (#12269 ) ## Problem Full basebackups are used in tests, and may be useful for debugging as well, so we should support them in the gRPC API. Touches #11728. ## Summary of changes Add `GetBaseBackupRequest::full` to generate full base backups. The libpq implementation also allows specifying `prev_lsn` for full backups, i.e. the end LSN of the previous WAL record. This is omitted in the gRPC API, since it's not used by any tests, and presumably of limited value since it's autodetected. We can add it later if we find that we need it.	2025-06-18 06:48:39 +00:00
Suhas Thalanki	83069f6ca1	fix: terminate pgbouncer on compute suspend (#12153 ) ## Problem PgBouncer does not terminate connections on a suspend: https://github.com/neondatabase/cloud/issues/16282 ## Summary of changes 1. Adds a pid file to store the pid of PgBouncer 2. Terminates connections on a compute suspend --------- Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>	2025-06-17 22:56:05 +00:00
Mikhail	7d4f662fbf	upgrade default neon version to 1.6 (#12185 ) Changes for 1.6 were merged and deployed two months ago https://github.com/neondatabase/neon/blob/main/pgxn/neon/neon--1.6--1.5.sql. In order to deploy https://github.com/neondatabase/neon/pull/12183, we need 1.6 to be default, otherwise we can't use prewarm API on read-only replica (`ALTER EXTENSION` won't work) and we need it for promotion	2025-06-17 17:46:35 +00:00
Alexander Bayandin	a5cac52e26	compute-image: add a patch for onnxruntime (#12274 ) ## Problem The checksum for eigen (a dependency for onnxruntime) has changed which breaks compute image build. ## Summary of changes - Add a patch for onnxruntime which backports changes from `f57db79743` (we keep the current version) Ref https://github.com/microsoft/onnxruntime/issues/24861	2025-06-17 16:35:20 +00:00
Konstantin Knizhnik	dfa055f4be	Support event trigger for Neon users (#10624 ) ## Problem https://github.com/neondatabase/neon/issues/7570 Even triggers are supported only for superusers. ## Summary of changes Temporary switch to superuser when even trigger is created and disable execution of user's even triggers under superuser. --------- Co-authored-by: Dimitri Fontaine <dim@tapoueh.org> Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-06-17 15:44:50 +00:00
Erik Grinaker	a4c76740c0	pageserver: emit gRPC GetPage errors as responses (#12255 ) ## Problem When converting `proto::GetPageRequest` into `page_api::GetPageRequest` and validating the request, errors are returned as `tonic::Status`. This will tear down the GetPage stream, which is disruptive and unnecessary. ## Summary of changes Emit invalid request errors as `GetPageResponse` with an appropriate `status_code` instead. Also move the conversion from `tonic::Status` to `GetPageResponse` out into the stream handler.	2025-06-17 15:41:17 +00:00
Dmitrii Kovalkov	f2e96b2323	tests: prepare test_compatibility.py for --timelines-onto-safekeepers (#12204 ) ## Problem Compatibility tests may be run against a compatibility snapshot generated with --timelines-onto-safekeepers=false. We need to start the compute without a generation (or with 0 generation) if the timeline is not storcon-managed, otherwise the compute will hang. - Follow up on https://github.com/neondatabase/neon/pull/12203 - Relates to https://github.com/neondatabase/neon/pull/11712 ## Summary of changes - Handle compatibility snapshot generated with no `--timelines-onot-safekeepers` properly	2025-06-17 15:16:07 +00:00
Dmitrii Kovalkov	dee73f0cb4	pageserver: implement max_total_size_bytes limit for basebackup cache (#12230 ) ## Problem The cache was introduced as a hackathon project and the only supported limit was the number of entries. The basebackup entry size may vary. We need to have more control over disk space usage to ship it to production. - Part of https://github.com/neondatabase/cloud/issues/29353 ## Summary of changes - Store the size of entries in the cache and use it to limit `max_total_size_bytes` - Add the size of the cache in bytes to metrics.	2025-06-17 15:08:59 +00:00
Erik Grinaker	edf51688bc	neon_local: support gRPC connstrings for endpoints (#12271 ) ## Problem `neon_local` should support endpoints using gRPC, by providing `grpc://` connstrings with the Pageservers' gRPC ports. Requires #12268. Touches #11926. ## Summary of changes * Add `--grpc` switch for `neon_local endpoint create`. * Generate `grpc://` connstrings for endpoints when enabled. Computes don't actually support `grpc://` connstrings yet, but will soon. gRPC is configured when the endpoint is created, not when it's started, such that it continues to use gRPC across restarts and reconfigurations. In particular, this is necessary for the storage controller's local notify hook, which can't easily plumb through gRPC configuration from the start/reconfigure commands but has access to the endpoint's configuration.	2025-06-17 14:39:42 +00:00
Aleksandr Sarantsev	4a8f3508f9	storcon: Add safekeeper request label group (#12239 ) ## Problem The metrics `storage_controller_safekeeper_request_error` and `storage_controller_safekeeper_request_latency` currently use `pageserver_id` as a label. This can be misleading, as the metrics are about safekeeper requests. We want to replace this with a more accurate label — either `safekeeper_id` or `node_id`. ## Summary of changes - Introduced `SafekeeperRequestLabelGroup` with `safekeeper_id`. - Updated the affected metrics to use the new label group. - Fixed incorrect metric usage in safekeeper_client.rs ## Follow-up - Review usage of these metrics in alerting rules and existing Grafana dashboards to ensure this change does not break something.	2025-06-17 13:33:01 +00:00
Erik Grinaker	48052477b4	storcon: register Pageserver gRPC address (#12268 ) ## Problem Pageservers now expose a gRPC API on a separate address and port. This must be registered with the storage controller such that it can be plumbed through to the compute via cplane. Touches #11926. ## Summary of changes This patch registers the gRPC address and port with the storage controller: * Add gRPC address to `nodes` database table and `NodePersistence`, with a Diesel migration. * Add gRPC address in `NodeMetadata`, `NodeRegisterRequest`, `NodeDescribeResponse`, and `TenantLocateResponseShard`. * Add gRPC address flags to `storcon_cli node-register`. These changes are backwards-compatible, since all structs will ignore unknown fields during deserialization.	2025-06-17 13:27:10 +00:00
Erik Grinaker	d81353b2d1	pageserver: gRPC base backup fixes (#12243 ) ## Problem The gRPC base backup implementation has a few issues: chunks are not properly bounded, and it's not possible to omit the LSN. Touches #11728. ## Summary of changes * Properly bound chunks by using a limited writer. * Use an `Option<Lsn>` rather than a `ReadLsn` (the latter requires an LSN).	2025-06-17 12:37:43 +00:00
Aleksandr Sarantsev	143500dc4f	storcon: Improve stably_attached readability (#12249 ) ## Problem The `stably_attached` function is hard to read due to deeply nested conditionals ## Summary of Changes - Refactored `stably_attached` to use early returns and the `?` operator for improved readability	2025-06-17 10:10:10 +00:00
Aleksandr Sarantsev	1a5f7ce6ad	storcon: Exclude another secondaries while optimizing secondary (#12251 ) ## Problem If the node intent includes more than one secondary, we can generate a replace optimization using a candidate node that is already a secondary location. ## Summary of changes - Exclude all other secondary nodes from the scoring process to ensure optimal candidate selection.	2025-06-17 10:09:55 +00:00
Alexander Lakhin	01ccb34118	Don't rerun failed tests in 'Build and Test with Sanitizers' workflow (#12259 ) ## Problem We could easily miss a sanitizer-detected defect, if it occurred due to some race condition, as we just rerun the test and if it succeeds, the overall test run is considered successful. It was more reasonable before, when we had much more unstable tests in main, but now we can track all test failures. ## Summary of changes Don't rerun failed tests.	2025-06-17 08:08:43 +00:00
Tristan Partin	f669e18477	Remove TODO comment related to default_transaction_read_only (#12261 ) This code has been deployed for a while, so let's remove the TODO, and remove the option passed from the control plane. Link: https://github.com/neondatabase/cloud/pull/30274 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-06-16 19:38:26 +00:00
Suhas Thalanki	632cde7f13	schema and github workflow for validation of compute manifest (#12069 ) Adds a schema to validate the manifest.yaml described in [this RFC](https://github.com/neondatabase/neon/blob/main/docs/rfcs/038-independent-compute-release.md) and a github workflow to test this.	2025-06-16 19:30:41 +00:00
Alexander Lakhin	118e13438d	Add "Build and Test Fully" workflow (#11931 ) ## Problem We don't test debug builds for v14..v16 in the regular "Build and Test" runs to perform the testing faster, but it means we can't detect assertion failures in those versions. (See https://github.com/neondatabase/neon/issues/11891, https://github.com/neondatabase/neon/issues/11997) ## Summary of changes Add a new workflow to test all build types and all versions on all architectures.	2025-06-16 13:29:39 +00:00
Trung Dinh	fc136eec8f	pagectl: add dump layer local (#12245 ) ## Problem In our environment, we don't always have access to the pagectl tool on the pageserver. We have to download the page files to local env to introspect them. Hence, it'll be useful to be able to parse the local files using `pagectl`. ## Summary of changes * Add `dump-layer-local` to `pagectl` that takes a local path as argument and returns the layer content: ``` cargo run -p pagectl layer dump-layer-local ~/Desktop/000000067F000040490002800000FFFFFFFF-030000000000000000000000000000000002__00003E7A53EDE611-00003E7AF27BFD19-v1-00000001 ``` * Bonus: Fix a bug in `pageserver/ctl/src/draw_timeline_dir.rs` in which we don't filter out temporary files.	2025-06-16 10:29:42 +00:00
Erik Grinaker	818e5130f1	page_api: add a few derives (#12253 ) ## Problem The `page_api` domain types are missing a few derives. ## Summary of changes Add `Clone`, `Copy`, and `Debug` derives for all types where appropriate.	2025-06-16 09:45:50 +00:00
Alexander Sarantcev	c243521ae5	Fix reconcile_long_running metric comment (#12234 ) ## Problem Comment for `storage_controller_reconcile_long_running` metric was copy-pasted and not updated in #9207 ## Summary of changes - Fixed comment	2025-06-16 05:51:57 +00:00
Alexander Sarantcev	5303c71589	Move comment above metrics handler (#12236 ) ## Problem Comment is in incorrect place: `/metrics` code is above its description comment. ## Summary of changes - `/metrics` code is now below the comment	2025-06-13 18:18:51 +00:00
Alexander Sarantcev	d146897415	Fix reconciles metrics typo (#12235 ) ## Problem Need to fix naming `safkeeper` -> `safekeeper` ## Summary of changes - `storage_controller_safkeeper_reconciles_` renamed to `storage_controller_safekeeper_reconciles_`	2025-06-13 17:47:09 +00:00
Alexander Sarantcev	d63815fa40	Fix ChaosInjector shard eligibility bug (#12231 ) ## Problem ChaosInjector is intended to skip non-active scheduling policies, but the current logic skips active shards instead. ## Summary of changes - Fixed shard eligibility condition to correctly allow chaos injection for shards with an Active scheduling policy.	2025-06-13 13:34:29 +00:00
Dmitrii Kovalkov	385324ee8a	pageserver: fix post-merge PR comments on basebackup cache (#12216 ) ## Problem This PR addresses all but the direct IO post-merge comments on basebackup cache implementation. - Follow up on https://github.com/neondatabase/neon/pull/11989#pullrequestreview-2867966119 - Part of https://github.com/neondatabase/cloud/issues/29353 ## Summary of changes - Clean up the tmp directory by recreating it. - Recreate the tmp directory on startup. - Add comments why it's safe to not fsync the inode after renaming.	2025-06-13 08:49:31 +00:00
Alex Chi Z.	8a68d463f6	feat(pagectl): no max key limit if time travel recover locally (#12222 ) ## Problem We would easily hit this limit for a tenant running for enough long time. ## Summary of changes Remove the max key limit for time-travel recovery if the command is running locally. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-13 08:41:10 +00:00
Alex Chi Z.	3046c307da	feat(posthog_client): support feature flag secure API (#12201 ) ## Problem Part of #11813 PostHog has two endpoints to retrieve feature flags: the old project ID one that uses personal API token, and the new one using a special feature flag secure token that can only retrieve feature flag. The new API I added in this patch is not documented in the PostHog API doc but it's used in their Python SDK. ## Summary of changes Add support for "feature flag secure token API". The API has no way of providing a project ID so we verify if the retrieved spec is consistent with the project ID specified by comparing the `team_id` field. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-13 07:22:02 +00:00
Dmitrii Kovalkov	e83f1d8ba5	tests: prepare test_historic_storage_formats for --timelines-onto-safekeepers (#12214 ) ## Problem `test_historic_storage_formats` uses `/tenant_import` to import historic data. Tenant import does not create timelines onto safekeepers, because they might already exist on some safekeeper set. If it does, then we may end up with two different quorums accepting WAL for the same timeline. If the tenant import is used in a real deployment, the administrator is responsible for looking for the proper safekeeper set and migrate timelines into storcon-managed timelines. - Relates to https://github.com/neondatabase/neon/pull/11712 ## Summary of changes - Create timelines onto safekeepers manually after tenant import in `test_historic_storage_formats` - Add a note to tenant import that timelines will be not storcon-managed after the import.	2025-06-13 06:28:18 +00:00
Trung Dinh	8917676e86	Improve logging for gc-compaction (#12219 ) ## Problem * Inside `compact_with_gc_inner`, there is a similar log line: `db24ba95d1/pageserver/src/tenant/timeline/compaction.rs (L3181-L3187)` * Also, I think it would be useful when debugging to have the ability to select a particular sub-compaction job (e.g., `1/100`) to see all the logs for that job. ## Summary of changes * Attach a span to the `compact_with_gc_inner`. CC: @skyzh	2025-06-13 06:07:18 +00:00
Ivan Efremov	43acabd4c2	[proxy]: Improve backoff strategy for redis reconnection (#12218 ) Sometimes during a failed redis connection attempt at the init stage proxy pod can continuously restart. This, in turn, can aggravate the problem if redis is overloaded. Solves the #11114	2025-06-12 19:46:02 +00:00
Vlad Lazar	db24ba95d1	pagserver: always persist shard identity (#12217 ) ## Problem The location config (which includes the stripe size) is stored on pageserver disk. For unsharded tenants we [do not include the shard identity in the serialized description](`ad88ec9257/pageserver/src/tenant/config.rs (L64-L66)`). When the pageserver restarts, it reads that configuration and will use the stripe size from there and rely on storcon input from reattach for generation and mode. The default deserialization is ShardIdentity::unsharded. This has the new default stripe size of 2048. Hence, for unsharded tenants we can be running with a stripe size different from that the one in the storcon observed state. This is not a problem until we shard split without specifying a stripe size (i.e. manual splits via the UI or storcon_cli). When that happens the new shards will use the 2048 stripe size until storcon realises and switches them back. At that point it's too late, since we've ingested data with the wrong stripe sizes. ## Summary of changes Ideally, we would always have the full shard identity on disk. To achieve this over two releases we do: 1. Always persist the shard identity in the location config on the PS. 2. Storage controller includes the stripe size to use in the re attach response. After the first release, we will start persisting correct stripe sizes for any tenant shard that the storage controller explicitly sends a location_conf. After the second release, the re-attach change kicks in and we'll persist the shard identity for all shards.	2025-06-12 17:15:02 +00:00
Folke Behrens	1dce65308d	Update base64 to 0.22 (#12215 ) ## Problem Base64 0.13 is outdated. ## Summary of changes Update base64 to 0.22. Affects mostly proxy and proxy libs. Also upgrade serde_with to remove another dep on base64 0.13 from dep tree.	2025-06-12 16:12:47 +00:00
Alex Chi Z.	ad88ec9257	fix(pageserver): extend layer manager read guard threshold (#12211 ) ## Problem Follow up of https://github.com/neondatabase/neon/pull/12194 to make the benchmarks run without warnings. ## Summary of changes Extend read guard hold timeout to 30s. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-12 08:39:54 +00:00

1 2 3 4 5 ...

8094 Commits