rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-09 22:42:57 +00:00

Author	SHA1	Message	Date
Ruslan Talpa	424004ec95	apply cargo fmt	2025-06-30 12:32:47 +03:00
Ruslan Talpa	88d1a78260	cleanup the rest path code	2025-06-30 12:30:33 +03:00
Ruslan Talpa	8e544c7f99	import introspection queries instead of loading from files	2025-06-27 14:40:02 +03:00
Ruslan Talpa	4f49fc5b79	move common error types and http realted functions to error.rs and http_util.rs	2025-06-27 13:37:29 +03:00
Ruslan Talpa	5461039c3f	implement remote config fetch from the db and cache introspected schema	2025-06-27 10:20:26 +03:00
Ruslan Talpa	d6c36d103e	subzero integration WIP6 beginning work on introspection of config and schema shape from the database	2025-06-26 14:42:31 +03:00
Ruslan Talpa	fbb2416685	pg 14 vendor commit changed	2025-06-26 10:25:54 +03:00
Ruslan Talpa	8072fae2fe	Merge branch 'main' into ruslan/subzero-integration	2025-06-26 10:19:16 +03:00
Ruslan Talpa	3869d680f9	use a global parsed/cached schema	2025-06-26 10:14:28 +03:00
Konstantin Knizhnik	be23eae3b6	Mark pages as avaiable in LFC only after generation check (#12350 ) ## Problem If LFC generation is changed then `lfc_readv_select` will return -1 but pages are still marked as available in bitmap. ## Summary of changes Update bitmap after generation check. Co-authored-by: Kosntantin Knizhnik <konstantin.knizhnik@databricks.com>	2025-06-26 07:06:27 +00:00
Alex Chi Z.	6f70885e11	fix(pageserver): allow refresh_interval to be empty (#12349 ) ## Problem Fix for https://github.com/neondatabase/neon/pull/12324 ## Summary of changes Need `serde(default)` to allow this field not present in the config, otherwise there will be a config deserialization error. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-25 22:15:03 +00:00
Erik Grinaker	f755979102	pageserver: payload compression for gRPC base backups (#12346 ) ## Problem gRPC base backups use gRPC compression. However, this has two problems: * Base backup caching will cache compressed base backups (making gRPC compression pointless). * Tonic does not support varying the compression level, and zstd default level is 10% slower than gzip fastest level. Touches https://github.com/neondatabase/neon/issues/11728. Touches https://github.com/neondatabase/cloud/issues/29353. ## Summary of changes This patch adds a gRPC parameter `BaseBackupRequest::compression` specifying the compression algorithm. It also moves compression into `send_basebackup_tarball` to reduce code duplication. A follow-up PR will integrate the base backup cache with gRPC.	2025-06-25 18:16:23 +00:00
Matthias van de Meent	1d49eefbbb	RFC: Endpoint Persistent Unlogged Files Storage (#9661 ) ## Summary A design for a storage system that allows storage of files required to make Neon's Endpoints have a better experience at or after a reboot. ## Motivation Several systems inside PostgreSQL (and Neon) need some persistent storage for optimal workings across reboots and restarts, but still work without. Examples are the cumulative statistics file in `pg_stat/global.stat`, `pg_stat_statements`' `pg_stat/pg_stat_statements.stat`, and `pg_prewarm`'s `autoprewarm.blocks`. We need a storage system that can store and manage these files for each Endpoint. [GH rendered file](https://github.com/neondatabase/neon/blob/MMeent/rfc-unlogged-file/docs/rfcs/040-Endpoint-Persistent-Unlogged-Files-Storage.md) Part of https://github.com/neondatabase/cloud/issues/24225	2025-06-25 16:25:57 +00:00
Alex Chi Z.	6c77638ea1	feat(storcon): retrieve feature flag and pass to pageservers (#12324 ) ## Problem part of https://github.com/neondatabase/neon/issues/11813 ## Summary of changes It costs $$$ to directly retrieve the feature flags from the pageserver. Therefore, this patch adds new APIs to retrieve the spec from the storcon and updates it via pageserver. * Storcon retrieves the feature flag and send it to the pageservers. * If the feature flag gets updated outside of the normal refresh loop of the pageserver, pageserver won't fetch the flags on its own as long as the last updated time <= refresh_period. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-25 14:58:18 +00:00
Conrad Ludgate	517a3d0d86	[proxy]: BatchQueue::call is not cancel safe - make it directly cancellation aware (#12345 ) ## Problem https://github.com/neondatabase/cloud/issues/30539 If the current leader cancels the `call` function, then it has removed the jobs from the queue, but will never finish sending the responses. Because of this, it is not cancellation safe. ## Summary of changes Document these functions as not cancellation safe. Move cancellation of the queued jobs into the queue itself. ## Alternatives considered 1. We could spawn the task that runs the batch, since that won't get cancelled. * This requires `fn call(self: Arc<Self>)` or `fn call(&'static self)`. 2. We could add another scopeguard and return the requests back to the queue. * This requires that requests are always retry safe, and also requires requests to be `Clone`.	2025-06-25 14:19:20 +00:00
Ruslan Talpa	d3fa228d92	move subzero local test files to a "dot" folder	2025-06-25 16:43:50 +03:00
Conrad Ludgate	27ca1e21be	[console_redirect_proxy]: fix channel binding (#12238 ) ## Problem While working more on TLS to compute, I realised that Console Redirect -> pg-sni-router -> compute would break if channel binding was set to prefer. This is because the channel binding data would differ between Console Redirect -> pg-sni-router vs pg-sni-router -> compute. I also noticed that I actually disabled channel binding in #12145, since `connect_raw` would think that the connection didn't support TLS. ## Summary of changes Make sure we specify the channel binding. Make sure that `connect_raw` can see if we have TLS support.	2025-06-25 13:41:30 +00:00
Arpad Müller	1dc01c9bed	Support cancellations of timelines with hanging ondemand downloads (#12330 ) In `test_layer_download_cancelled_by_config_location`, we simulate hung downloads via the `before-downloading-layer-stream-pausable` failpoint. Then, we cancel a timeline via the `location_config` endpoint. With the new default as of https://github.com/neondatabase/neon/pull/11712, we would be creating the timeline on safekeepers regardless if there have been writes or not, and it turns out the test relied on the timeline not existing on safekeepers, due to a cancellation bug: * as established before, the test makes the read path hang * the timeline cancellation function first cancels the walreceiver, and only then cancels the timeline's token * `WalIngest::new` is requesting a checkpoint, which hits the read path * at cancellation time, we'd be hanging inside the read, not seeing the cancellation of the walreceiver * the test would time out due to the hang This is probably also reproducible in the wild when there is S3 unavailabilies or bottlenecks. So we thought that it's worthwhile to fix the hang issue. The approach chosen in the end involves the `tokio::select` macro. In PR 11712, we originally punted on the test due to the hang and opted it out from the new default, but now we can use the new default. Part of https://github.com/neondatabase/neon/issues/12299	2025-06-25 13:40:38 +00:00
Ruslan Talpa	be6a259b85	subzero integration WIP5 cleanup and postprocess the response and set the correct headers/status and handle errors	2025-06-25 14:33:45 +03:00
Ruslan Talpa	af3ca24a5e	remove unused enum values	2025-06-25 14:32:46 +03:00
Heikki Linnakangas	7c4c36f5ac	Remove unnecessary separate installation of libpq (#12287 ) `make install` compiles and installs libpq. Remove redundant separate step to compile and install it.	2025-06-25 10:47:56 +00:00
Tristan Partin	a2d623696c	Update pgaudit to latest versions (#12328 ) These updates contain some bug fixes and are completely backwards compatible with what we currently support in Neon. Link: https://github.com/pgaudit/pgaudit/compare/1.6.2...1.6.3 Link: https://github.com/pgaudit/pgaudit/compare/1.7.0...1.7.1 Link: https://github.com/pgaudit/pgaudit/compare/16.0...16.1 Link: https://github.com/pgaudit/pgaudit/compare/17.0...17.1 Signed-off-by: Tristan Partin <tristan.partin@databricks.com> Signed-off-by: Tristan Partin <tristan.partin@databricks.com>	2025-06-25 09:03:02 +00:00
Tristan Partin	aa75722010	Set pgaudit.log=none for monitoring connections (#12137 ) pgaudit can spam logs due to all the monitoring that we do. Logs from these connections are not necessary for HIPPA compliance, so we can stop logging from those connections. Part-of: https://github.com/neondatabase/cloud/issues/29574 Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-06-24 17:42:23 +00:00
Matthias van de Meent	6c6de6382a	Use enum-typed PG versions (#12317 ) This makes it possible for the compiler to validate that a match block matched all PostgreSQL versions we support. ## Problem We did not have a complete picture about which places we had to test against PG versions, and what format these versions were: The full PG version ID format (Major/minor/bugfix `MMmmbb`) as transfered in protocol messages, or only the Major release version (`MM`). This meant type confusion was rampant. With this change, it becomes easier to develop new version-dependent features, by making type and niche confusion impossible. ## Summary of changes Every use of `pg_version` is now typed as either `PgVersionId` (u32, valued in decimal `MMmmbb`) or PgMajorVersion (an enum, with a value for every major version we support, serialized and stored like a u32 with the value of that major version) --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2025-06-24 17:25:31 +00:00
Dmitry Savelev	158d84ea30	Switch the billing metrics storage format to ndjson. (#12338 ) ## Problem The billing team wants to change the billing events pipeline and use a common events format in S3 buckets across different event producers. ## Summary of changes Change the events storage format for billing events from JSON to NDJSON. Resolves: https://github.com/neondatabase/cloud/issues/29994	2025-06-24 15:36:36 +00:00
Conrad Ludgate	4dd9ca7b04	[proxy]: authenticate to compute after connect_to_compute (#12335 ) ## Problem PGLB will do the connect_to_compute logic, neonkeeper will do the session establishment logic. We should split it. ## Summary of changes Moves postgres authentication to compute to a separate routine that happens after connect_to_compute.	2025-06-24 14:15:36 +00:00
Ruslan Talpa	8b44f5b479	subzero integration WIP5 extract the response body from the local proxy response	2025-06-24 17:03:42 +03:00
Ruslan Talpa	d1445cf3eb	subzero integration WIP4 queries generated by subzero reach database and execute succesfully	2025-06-24 15:33:51 +03:00
Arpad Müller	552249607d	apply clippy fixes for 1.88.0 beta (#12331 ) The 1.88.0 stable release is near (this Thursday). We'd like to fix most warnings beforehand so that the compiler upgrade doesn't require approval from too many teams. This is therefore a preparation PR (like similar PRs before it). There is a lot of changes for this release, mostly because the `uninlined_format_args` lint has been added to the `style` lint group. One can read more about the lint [here](https://rust-lang.github.io/rust-clippy/master/#/uninlined_format_args). The PR is the result of `cargo +beta clippy --fix` and `cargo fmt`. One remaining warning is left for the proxy team. --------- Co-authored-by: Conrad Ludgate <conrad@neon.tech>	2025-06-24 10:12:42 +00:00
Ivan Efremov	a29772bf6e	Create proxy-bench periodic run in CI (#12242 ) Currently run for test only via pushing to the test-proxy-bench branch. Relates to the #22681	2025-06-24 09:54:43 +00:00
Arpad Müller	0efff1db26	Allow cancellation errors in tests that allow timeline deletion errors (#12315 ) After merging of PR https://github.com/neondatabase/neon/pull/11712 we saw some tests be flaky, with errors showing up about the timeline having been cancelled instead of having been deleted. This is an outcome that is inherently racy with the "has been deleted" error. In some instances, https://github.com/neondatabase/neon/pull/11712 has already added the error about the timeline having been cancelled. This PR adds them to the remaining instances of https://github.com/neondatabase/neon/pull/11712, fixing the flakiness.	2025-06-23 22:26:38 +00:00
Aleksandr Sarantsev	5eecde461d	storcon: Fix migration for Attached(0) tenants (#12256 ) ## Problem `Attached(0)` tenant migrations can get stuck if the heatmap file has not been uploaded. ## Summary of Changes - Added a test to reproduce the issue. - Introduced a `kick_secondary_downloads` config flag: - Enabled in testing environments. - Disabled in production (and in the new test). - Updated `Attached(0)` locations to consider the number of secondaries in their intent when deciding whether to download the heatmap.	2025-06-23 18:55:26 +00:00
Alex Chi Z.	85164422d0	feat(pageserver): support force overriding feature flags (#12233 ) ## Problem Part of #11813 ## Summary of changes Add a test API to make it easier to manipulate the feature flags within tests. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-23 17:31:53 +00:00
John Spray	6c3aba7c44	storcon: adjust AZ selection for heterogenous AZs (#12296 ) ## Problem The scheduler uses total shards per AZ to select the AZ for newly created or attached tenants. This makes bad decisions when we have different node counts per AZ -- we might have 2 very busy pageservers in one AZ, and 4 more lightly loaded pageservers in other AZs, and the scheduler picks the busy pageservers because the total shard count in their AZ is lower. ## Summary of changes - Divide the shard count by the number of nodes in the AZ when scoring in `get_az_for_new_tenant` --------- Co-authored-by: John Spray <john.spray@databricks.com>	2025-06-23 15:50:31 +00:00
Erik Grinaker	68a175d545	test_runner: fix `test_basebackup_with_high_slru_count` gzip param (#12319 ) The `--gzip-probability` parameter was removed in #12250. However, `test_basebackup_with_high_slru_count` still uses it, and keeps failing. This patch removes the use of the parameter (gzip is enabled by default).	2025-06-23 15:33:45 +00:00
Alex Chi Z.	5e2c444525	fix(pageserver): reduce default feature flag refresh interval (#12246 ) ## Problem Part of #11813 ## Summary of changes The current interval is 30s and it costs a lot of $$$. This patch reduced it to 600s refresh interval (which means that it takes 10min for feature flags to propagate from UI to the pageserver). In the future we can let storcon retrieve the feature flags and push it to pageservers. We can consider creating a new release or we can postpone this to the week after the next week. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-23 13:51:21 +00:00
Heikki Linnakangas	8d711229c1	ci: Fix bogus skipping of 'make all' step in CI (#12318 ) The 'make all' step must run always. PR #12311 accidentally left the condition in there to skip it if there were no changes in postgres v14 sources. That condition belonged to a whole different step that was removed altogether in PR#12311, and the condition should've been removed too. Per CI failure: https://github.com/neondatabase/neon/actions/runs/15820148967/job/44587394469	2025-06-23 13:23:33 +00:00
Vlad Lazar	0e490f3be7	pageserver: allow concurrent rw IO on in-mem layer (#12151 ) ## Problem Previously, we couldn't read from an in-memory layer while a batch was being written to it. Vice-versa, we couldn't write to it while there was an on-going read. ## Summary of Changes The goal of this change is to improve concurrency. Writes happened through a &mut self method so the enforcement was at the type system level. We attempt to improve by: 1. Adding interior mutability to EphemeralLayer. This involves wrapping the buffered writer in a read-write lock. 2. Minimise the time that the read lock is held for. Only hold the read lock while reading from the buffers (recently flushed or pending flush). If we need to read from the file, drop the lock and allow IO to be concurrent. The new benchmark variants with concurrent reads improve between 70 to 200 percent (against main). Benchmark results are in this [commit](`891f094ce6`). ## Future Changes We can push the interior mutability into the buffered writer. The mutable tail goes under a read lock, the flushed part goes into an ArcSwap and then we can read from anything that is flushed _without_ any locking.	2025-06-23 13:17:30 +00:00
Erik Grinaker	7e41ef1bec	pageserver: set gRPC basebackup chunk size to 256 KB (#12314 ) gRPC base backups send a stream of fixed-size 64KB chunks. pagebench basebackup with compression enabled shows this to reduce throughput: * 64 KB: 55 RPS * 128 KB: 69 RPS * 256 KB: 73 RPS * 1024 KB: 73 RPS This patch sets the base backup chunk size to 256 KB.	2025-06-23 12:41:11 +00:00
Heikki Linnakangas	7916aa26e0	Stop using build-tools image in compute image build (#12306 ) The build-tools image contains various build tools and dependencies, mostly Rust-related. The compute image build used it to build compute_ctl and a few other little rust binaries that are included in the compute image. However, for extensions built in Rust (pgrx), the build used a different layer which installed the rust toolchain using rustup. Switch to using the same rust toolchain for both pgrx-based extensions and compute_ctl et al. Since we don't need anything else from the build-tools image, I switched to using the toolchain installed with rustup, and eliminated the dependency to build-tools altogether. The compute image build no longer depends on build-tools. Note: We no longer use 'mold' for linking compute_ctl et al, since mold is not included in the build-deps-with-cargo layer. We could add it there, but it doesn't seem worth it. I proposed stopping using mold altogether in https://github.com/neondatabase/neon/pull/10735, but that was rejected because 'mold' is faster for incremental builds. That doesn't matter much for docker builds however, since they're not incremental, and the compute binaries are not as large as the storage server binaries anyway.	2025-06-23 09:11:05 +00:00
Heikki Linnakangas	52ab8f3e65	Use `make all` in the "Build and Test locally" CI workflow (#12311 ) To avoid duplicating the build logic. `make all` covers the separate `postgres-*` and `neon-pg-ext` steps, and also does `cargo build`. That's how you would typically do a full local build anyway.	2025-06-23 09:10:32 +00:00
Ruslan Talpa	67d3026fc4	subzero integration WIP3 * query makes it to the database	2025-06-23 11:48:55 +03:00
Ruslan Talpa	09e62e9b98	subzero integration WIP2	2025-06-23 10:11:06 +03:00
Heikki Linnakangas	3d822dbbde	Refactor Makefile rules for building the extensions under pgxn/ (#12305 )	2025-06-22 19:43:14 +00:00
Heikki Linnakangas	af46b5286f	Avoid recompiling `postgres_ffi` when there has been no changes (#12292 ) Every time you run `make`, it runs `make install` on all the PostgreSQL sources, which copies the header files. That in turn triggers a rebuild of the `postgres_ffi` crate, and everything that depends on it. We had worked around this earlier (see #2458), by passing a custom INSTALL script to the Postgres makefiles, which refrains from updating the modification timestamp on headers when they have not been changed, but the v14 makefile didn't obey INSTALL for the header files. Backporting c0a1d7621b to v14 fixes that. This backports upstream PostgreSQL commit c0a1d7621b to v14. Corresponding PR in the 'postgres' repo: https://github.com/neondatabase/postgres/pull/660	2025-06-21 21:07:38 +00:00
Erik Grinaker	47f7efee06	pageserver: require stripe size (#12257 ) ## Problem In #12217, we began passing the stripe size in reattach responses, and persisting it in the on-disk state. This is necessary to ensure the storage controller and Pageserver have a consistent view of the intended stripe size of unsharded tenants, which will be used for splits that do not specify a stripe size. However, for backwards compatibility, these stripe sizes were optional. ## Summary of changes Make the stripe sizes required for reattach responses and on-disk location configs. These will always be provided by the previous (current) release.	2025-06-21 15:01:29 +00:00
Tristan Partin	868c38f522	Rename the compute_ctl admin scope to compute_ctl:admin (#12263 ) Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-06-20 22:49:05 +00:00
Tristan Partin	c8b2ac93cf	Allow the control plane to override any Postgres connection options (#12262 ) The previous behavior was for the compute to override control plane options if there was a conflict. We want to change the behavior so that the control plane has the absolute power on what is right. In the event that we need a new option passed to the compute as soon as possible, we can initially roll it out in the control plane, and then migrate the option to EXTRA_OPTIONS within the compute later, for instance. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-06-20 18:46:30 +00:00
Dmitrii Kovalkov	b2954d16ff	storcon, neon_local: add timeline_safekeeper_count (#12303 ) ## Problem We need to specify the number of safekeepers for neon_local without `testing` feature. Also we need this option for testing different configurations of safekeeper migration code. We cannot set it in `neon_fixtures.py` and in the default config of `neon_local` yet, because it will fail compatibility tests. I'll make a separate PR with removing `cfg!("testing")` completely and specifying this option in the config when this option reaches the release branch. - Part of https://github.com/neondatabase/neon/issues/12298 ## Summary of changes - Add `timeline_safekeeper_count` config option to storcon and neon_local	2025-06-20 16:03:17 +00:00
Alex Chi Z.	79485e7c3a	feat(pageserver): enable gc-compaction by default everywhere (#12105 ) Enable it across tests and set it as default. Marks the first milestone of https://github.com/neondatabase/neon/issues/9114. We already enabled it in all AWS regions and planning to enable it in all Azure regions next week. will merge after we roll out in all regions. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-20 15:35:11 +00:00

1 2 3 4 5 ...

8158 Commits