rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-10 15:02:56 +00:00

Author	SHA1	Message	Date
a-masterov	e226d7a3d1	Fix docker compose with PG17 (#10165 ) ## Problem It's impossible to run docker compose with compute v17 due to `pg_anon` extension which is not supported under PG17. ## Summary of changes The auto-loading of `pg_anon` is disabled by default	2024-12-17 08:16:54 +00:00
Folke Behrens	aa7ab9b3ac	proxy: Allow dumping TLS session keys for debugging (#10163 ) ## Problem To debug issues with TLS connections there's no easy way to decrypt packets unless a client has special support for logging the keys. ## Summary of changes Add TLS session keys logging to proxy via `SSLKEYLOGFILE` env var gated by flag.	2024-12-16 18:56:24 +00:00
Erik Grinaker	28ccda0a63	test_runner: ignore error in `test_timeline_archival_chaos` (#10161 ) Resolves #10159.	2024-12-16 17:10:55 +00:00
Conrad Ludgate	59b7ff8988	chore(proxy): disallow unwrap and unimplemented (#10142 ) As the title says, I updated the lint rules to no longer allow unwrap or unimplemented. Three special cases: * Tests are allowed to use them * std::sync::Mutex lock().unwrap() is common because it's usually correct to continue panicking on poison * `tokio::spawn_blocking(...).await.unwrap()` is common because it will only error if the blocking fn panics, so continuing the panic is also correct I've introduced two extension traits to help with these last two, that are a bit more explicit so they don't need an expect message every time.	2024-12-16 16:37:15 +00:00
Conrad Ludgate	2e4c9c5704	chore(proxy): remove allow_self_signed from regular proxy (#10157 ) I noticed that the only place we use this flag is for testing console redirect proxy. Makes sense to me to make this assumption more explicit.	2024-12-16 16:11:39 +00:00
Erik Grinaker	3d30a7a934	pageserver: make `RemoteTimelineClient::schedule_index_upload` infallible (#10155 ) Remove an unnecessary `Result` and address a `FIXME`.	2024-12-16 15:54:47 +00:00
Conrad Ludgate	6565fd4056	chore: fix clippy lints 2024-12-06 (#10138 )	2024-12-16 15:33:21 +00:00
Arseny Sher	c5e3314c6e	Add test restarting compute at WAL page boundary (#10111 ) ## Problem We've had similar test in test_logical_replication, but then removed it because it wasn't needed to trigger LR related bug. Restarting at WAL page boundary is still a useful test, so add it separately back. ## Summary of changes Add the test.	2024-12-16 14:53:04 +00:00
Arseny Sher	1ed0e52bc8	Extract safekeeper http client to separate crate. (#10140 ) ## Problem We want to use safekeeper http client in storage controller and neon_local. ## Summary of changes Extract it to separate crate. No functional changes.	2024-12-16 12:07:24 +00:00
Conrad Ludgate	24d6587914	chore(proxy): refactor self-signed config (#10154 ) ## Problem While reviewing #10152 I found it tricky to actually determine whether the connection used `allow_self_signed_compute` or not. I've tried to remove this setting in the past: * https://github.com/neondatabase/neon/pull/7884 * https://github.com/neondatabase/neon/pull/7437 * https://github.com/neondatabase/cloud/pull/13702 But each time it seems it is used by e2e tests ## Summary of changes The `node_info.allow_self_signed_computes` is always initialised to false, and then sometimes inherits the proxy config value. There's no need this needs to be in the node_info, so removing it and propagating it via `TcpMechansim` is simpler.	2024-12-16 11:15:25 +00:00
John Spray	ebcbc1a482	pageserver: tighten up code around SLRU dir key handling (#10082 ) ## Problem Changes in #9786 were functionally complete but missed some edges that made testing less robust than it should have been: - `is_key_disposable` didn't consider SLRU dir keys disposable - Timeline `init_empty` was always creating SLRU dir keys on all shards The result was that when we had a bug (https://github.com/neondatabase/neon/pull/10080), it wasn't apparent in tests, because one would only encounter the issue if running on a long-lived timeline with enough compaction to drop the initially created empty SLRU dir keys, _and_ some CLog truncation going on. Closes: https://github.com/neondatabase/cloud/issues/21516 ## Summary of changes - Update is_key_global and init_empty to handle SLRU dir keys properly -- the only functional impact is that we avoid writing some spurious keys in shards >0, but this makes testing much more robust. - Make `test_clog_truncate` explicitly use a sharded tenant The net result is that if one reverts #10080, then tests fail (i.e. this PR is a reproducer for the issue)	2024-12-16 10:06:08 +00:00
Konstantin Knizhnik	117c1b5dde	Do not perform prefetch for temp relations (#10146 ) ## Problem See https://neondb.slack.com/archives/C04DGM6SMTM/p1734002916827019 With recent prefetch fixes for pg17 and `effective_io_concurrency=100` pg_regress test stats.sql is failed when set temp_buffers to 100. Stream API will try to lock all this 100 buffers for prefetch. ## Summary of changes Disable such behaviour for temp relations. Postgres PR: https://github.com/neondatabase/postgres/pull/548 Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-12-16 06:03:53 +00:00
Erik Grinaker	f3ecd5d76a	pageserver: revert flush backpressure (#8550 ) (#10135 ) ## Problem In #8550, we made the flush loop wait for uploads after every layer. This was to avoid unbounded buildup of uploads, and to reduce compaction debt. However, the approach has several problems: * It prevents upload parallelism. * It prevents flush and upload pipelining. * It slows down ingestion even when there is no need to backpressure. * It does not directly backpressure WAL ingestion (only via `disk_consistent_lsn`), and will build up in-memory layers. * It does not directly backpressure based on compaction debt and read amplification. An alternative solution to these problems is proposed in #8390. In the meanwhile, we revert the change to reduce the impact on ingest throughput. This does reintroduce some risk of unbounded upload/compaction buildup. Until https://github.com/neondatabase/neon/issues/8390, this can be addressed in other ways: * Use `max_replication_apply_lag` (aka `remote_consistent_lsn`), which will more directly limit upload debt. * Shard the tenant, which will spread the flush/upload work across more Pageservers and move the bottleneck to Safekeeper. Touches #10095. ## Summary of changes Remove waiting on the upload queue in the flush loop.	2024-12-15 09:45:12 +00:00
Mikhail Kot	cf161e1556	fix(adapter): password not set in role drop (#10130 ) ## Problem When entry was dropped and password wasn't set, new entry had uninitialized memory in controlplane adapter Resolves: https://github.com/neondatabase/cloud/issues/14914 ## Summary of changes Initialize password in all cases, add tests. Minor formatting for less indentation	2024-12-14 17:37:13 +00:00
Konstantin Knizhnik	2521eba674	Check for invalid down link while prefetching B-Tree leave pages for index-only scan (#9867 ) ## Problem See #9866 Index-only scan prefetch implementation doesn't take in account that down link may be invalid ## Summary of changes Check that downlink is valid block number Correspondent Postgres PRs: https://github.com/neondatabase/postgres/pull/534 https://github.com/neondatabase/postgres/pull/535 https://github.com/neondatabase/postgres/pull/536 https://github.com/neondatabase/postgres/pull/537 --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-12-13 20:46:41 +00:00
Alexander Bayandin	d56fea680e	CI: always require aws-oicd-role-arn input to be set (#10145 ) ## Problem `benchmarking` job fails because `aws-oicd-role-arn` input is not set ## Summary of changes: - Set `aws-oicd-role-arn` for `benchmarking job - Always require `aws-oicd-role-arn` to be set - Rename `aws_oicd_role_arn` to `aws-oicd-role-arn` for consistency	2024-12-13 19:56:32 +00:00
Alex Chi Z.	7ee5dca752	fix(pageserver): race between gc-compaction and repartition (#10127 ) ## Problem close https://github.com/neondatabase/neon/issues/10124 gc-compaction split_gc_jobs is holding the repartition lock for too long time. ## Summary of changes * Ensure split_gc_compaction_jobs drops the repartition lock once it finishes cloning the structures. * Update comments. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-12-13 18:22:25 +00:00
Tristan Partin	07d1db54b3	Improve comments and log messages in the logical replication monitor (#9974 ) Improved comments will help others when they read the code, and the log messages will help others understand why the logical replication monitor works the way it does. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-12-13 18:10:42 +00:00
Konstantin Knizhnik	eeabecd89f	Correctly update LFC used_pages in case of LFC resize (#10128 ) ## Problem LFC used_pages statistic is not updated in case of LFC resize (shrinking `neon.file_cache_size_limit`) ## Summary of changes Update `lfc_ctl->used_pages` in `lfc_change_limit_hook` Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-12-13 17:40:26 +00:00
Christian Schwarz	fcff752851	fix(test_timeline_archival_chaos): flakiness caused by orphan layers (#10083 ) The test was failing with the scary but generic message `Remote storage metadata corrupted`. The underlying scrubber error is `Orphan layer detected: ...`. The test kills pageserver at random points, hence it's expected that we leak layers if we're killed in the window after layer upload but before it's referenced from index part. Refer to generation numbers RFC for details. Refs: - fixes https://github.com/neondatabase/neon/issues/9988 - root-cause analysis https://github.com/neondatabase/neon/issues/9988#issuecomment-2520673167	2024-12-13 16:28:21 +00:00
Alexander Bayandin	2c91062828	test_prefetch: reduce timeout to default 5m from 10m (#10105 ) ## Problem `test_prefetch` is flaky (https://github.com/neondatabase/neon/issues/9961), but if it passes, the run time is less than 30 seconds — we don't need an extended timeout for it. ## Summary of changes - Remove extended test timeout for `test_prefetch`	2024-12-13 14:52:54 +00:00
Arseny Sher	ce8eb089f3	Extract public sk types to safekeeper_api (#10137 ) ## Problem We want to extract safekeeper http client to separate crate for use in storage controller and neon_local. However, many types used in the API are internal to safekeeper. ## Summary of changes Move them to safekeeper_api crate. No functional changes. ref https://github.com/neondatabase/neon/issues/9011	2024-12-13 14:06:27 +00:00
a-masterov	7dc382601c	Fix pg_regress tests on a cloud staging instance (#10134 ) ## Problem pg_regress tests start failing due to unique ids added to Neon error messages ## Summary of changes Patches updated	2024-12-13 13:59:04 +00:00
Rahul Patil	2451969d5c	fix(ci): Allow github-action-script to post reports (#10136 ) Allow github-action-script to post reports. Failed CI: https://github.com/neondatabase/neon/actions/runs/12304655364/job/34342554049#step:13:514	2024-12-13 12:22:15 +00:00
JC Grünhage	59ef701925	CI(deploy): fix git tag/release creation (#10119 ) ## Problem When moving the comment on proxy-releases from the yaml doc into a javascript code block, I missed converting the comment marker from `#` to `//`. ## Summary of changes Correctly convert comment marker.	2024-12-12 23:38:20 +00:00
Alexander Bayandin	ac04bad457	CI: don't run debug builds with LFC (#10123 ) ## Problem I've noticed that debug builds with LFC fail more frequently and for some reason ,their failure do block merging (but it should not) ## Summary of changes - Do not run Debug builds with LFC	2024-12-12 22:55:38 +00:00
Peter Bendel	2f3f98a319	use OIDC role instead of AWS access keys for managing test runner (#10117 ) in periodic pagebench workflow ## Problem for background see https://github.com/neondatabase/cloud/issues/21545 ## Summary of changes use OIDC role to manage runners instead of AWS access key which needs to be periodically rotated ## logs seems to work in https://github.com/neondatabase/neon/actions/runs/12298575888/job/34322306127#step:6:1	2024-12-12 20:25:39 +00:00
Alex Chi Z.	5ff4b991c7	feat(pageserver): gc-compaction split over LSN (#9900 ) ## Problem part of https://github.com/neondatabase/neon/issues/9114, stacked PR over https://github.com/neondatabase/neon/pull/9897, partially refactored to help with https://github.com/neondatabase/neon/issues/10031 ## Summary of changes * gc-compaction takes `above_lsn` parameter. We only compact the layers above this LSN, and all data below the LSN are treated as if they are on the ancestor branch. * refactored gc-compaction to take `GcCompactJob` that describes the rectangular range to be compacted. * Added unit test for this case. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-12-12 20:23:24 +00:00
John Spray	a93e3d31cc	storcon: refine logic for choosing AZ on tenant creation (#10054 ) ## Problem When we update our scheduler/optimization code to respect AZs properly (https://github.com/neondatabase/neon/pull/9916), the choice of AZ becomes a much higher-stakes decision. We will pretty much always run a tenant in its preferred AZ, and that AZ is fixed for the lifetime of the tenant (unless a human intervenes) Eventually, when we do auto-balancing based on utilization, I anticipate that part of that will be to automatically change the AZ of tenants if our original scheduling decisions have caused imbalance, but as an interim measure, we can at least avoid making this scheduling decision based purely on which AZ contains the emptiest node. This is a precursor to https://github.com/neondatabase/neon/pull/9947 ## Summary of changes - When creating a tenant, instead of scheduling a shard and then reading its preferred AZ back, make the AZ decision first. - Instead of choosing AZ based on which node is emptiest, use the median utilization of nodes in each AZ to pick the AZ to use. This avoids bad AZ decisions during periods when some node has very low utilization (such as after replacing a dead node) I considered also making the selection a weighted pseudo-random choice based on utilization, but wanted to avoid destabilising tests with that for now.	2024-12-12 19:35:38 +00:00
Rahul Patil	6d5687521b	fix(ci): Allow github-script to post test reports (#10120 ) Allow github-script to post test reports	2024-12-12 18:53:35 +00:00
Heikki Linnakangas	53721266f1	Disable connection logging in pgbouncer by default (#10118 ) It can produce a lot of logs, making pgbouncer itself consume all CPU in extreme cases. We saw that happen in stress testing.	2024-12-12 17:05:58 +00:00
a-masterov	2f3433876f	Change the channel for notification. (#10112 ) ## Problem Now notifications about failures in `pg_regress` tests run on the staging cloud instance, reach the channel `on-call-staging-stream`, while they should reach `on-call-qa-staging-stream` ## Summary of changes The channel changed.	2024-12-12 16:34:07 +00:00
Rahul Patil	58d45c6e86	ci(fix): Use OIDC auth to login on ECR (#10055 ) ## Problem CI currently uses static credentials in some places. These are less secure and hard to maintain, so we are going to deprecate them and use OIDC auth. ## Summary of changes - ci(fix): Use OIDC auth to upload artifact on s3 - ci(fix): Use OIDC auth to login on ECR	2024-12-12 15:13:08 +00:00
Conrad Ludgate	e502e880b5	chore(proxy): remove code for old API (#10109 ) ## Problem Now that https://github.com/neondatabase/cloud/issues/15245 is done, we can remove the old code. ## Summary of changes Removes support for the ManagementV2 API, in favour of the ProxyV1 API.	2024-12-12 13:42:50 +00:00
Arseny Sher	c9a773af37	Fix test_subscriber_synchronous_commit flakiness. (#10057 ) `6f7aeaa` configured LFC for USE_LFC case, but omitted setting shared_buffers for non USE_LFC, causing flakiness. ref https://github.com/neondatabase/neon/issues/9989	2024-12-12 11:57:00 +00:00
Vlad Lazar	ec0ce06c16	tests: default interpreted proto in tests (#10079 ) ## Problem We aren't using the sharded interpreted wal receiver protocol in all tests. ## Summary of changes Default to the interpreted protocol.	2024-12-12 10:53:10 +00:00
Alexander Bayandin	0bd8eca9ca	Storage: create release PRs On Fridays (#10017 ) ## Problem To give Storage more time on preprod — create a release branch on Friday ## Summary of changes - Automatically create Storage release PR on Friday instead of Monday	2024-12-12 09:18:50 +00:00
Misha Sakhnov	739f627b96	Bump vm-builder v0.35.0 -> v0.37.1 (#10015 ) Bump version to pick up changes introduced in the neonvm-daemon to support sys fs based CPU scaling (https://github.com/neondatabase/autoscaling/issues/1082). Previous update: https://github.com/neondatabase/neon/pull/9208	2024-12-12 08:45:52 +00:00
Arpad Müller	342cbea255	storcon: add safekeeper list API (#10089 ) This adds an API to the storage controller to list safekeepers registered to it. This PR does a `diesel print-schema > storage_controller/src/schema.rs` because of an inconsistency between up.sql and schema.rs, introduced by [this](`2c142f14f7`) commit, so there is some updates of `schema.rs` due to that. As a followup to this, we should maybe think about running `diesel print-schema` in CI. Part of #9981	2024-12-12 01:09:24 +00:00
Tristan Partin	b391b29bdc	Improve typing in test_runner/fixtures/httpserver.py (#10103 ) Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-12-11 22:21:42 +00:00
Erik Grinaker	5126ebbfed	test_runner: bump test_check_visibility_map timeout (#10091 ) ## Problem `test_check_visibility_map` has been seen to time out in debug tests. ## Summary of changes Bump the timeout to 10 minutes (test reports indicate 7 minutes is sufficient). We don't want to disable the test entirely in debug builds, to exercise this with debug assertions enabled. Resolves #10069.	2024-12-11 21:37:25 +00:00
Arpad Müller	7fa986bc92	Do tenant manifest validation with index-part (#10007 ) This adds some validation of invariants that we want to uphold wrt the tenant manifest and `index_part.json`: * the data the manifest has about a timeline must match with the data in `index_part.json`. It might actually change, e.g. when we do reparenting during detach ancestor, but that requires the timeline to be unoffloaded, i.e. removed from the manifest. * any timeline mentioned in index part, must, if present, be archived. If we unarchive, we first update the tenant manifest to unoffload, and only then update index part. And one needs to archive before offloading. * it is legal for timelines to be mentioned in the manifest but have no `index_part`: this is a temporary state visible during deletion of the timeline. if the pageserver crashed, an attach of the tenant will clean the state up. * it is also legal for offloaded timelines to have an `ancestor_retain_lsn` of None while having an `ancestor_timeline_id`. This is for the to-be-added flattening functionality: the plan is to set former to None if we have flattened a timeline. follow-up of #9942 part of #8088	2024-12-11 20:10:22 +00:00
Vlad Lazar	e8395807a5	storcon: allow for more concurrency in drain/fill operations (#10093 ) ## Problem We saw the drain/fill operations not drain fast enough in ap-southeast. ## Summary of changes These are some quick changes to speed it up: * double reconcile concurrency - this is now half of the available reconcile bandwidth * reduce the waiter polling timeout - this way we can spawn new reconciliations faster	2024-12-11 19:43:40 +00:00
Vlad Lazar	a3e80448e8	pageserver/storcon: add patch endpoints for tenant config metrics (#10020 ) ## Problem Cplane and storage controller tenant config changes are not additive. Any change overrides all existing tenant configs. This would be fine if both did client side patching, but that's not the case. Once this merges, we must update cplane to use the PATCH endpoint. ## Summary of changes ### High Level Allow for patching of tenant configuration with a `PATCH /v1/tenant/config` endpoint. It takes the same data as it's PUT counterpart. For example the payload below will update `gc_period` and unset `compaction_period`. All other fields are left in their original state. ``` { "tenant_id": "1234", "gc_period": "10s", "compaction_period": null } ``` ### Low Level * PS and storcon gain `PATCH /v1/tenant/config` endpoints. PS endpoint is only used for cplane managed instances. * `storcon_cli` is updated to have separate commands for `set-tenant-config` and `patch-tenant-config` Related https://github.com/neondatabase/cloud/issues/21043	2024-12-11 19:16:33 +00:00
Anastasia Lubennikova	ef233e91ef	Update compute_installed_extensions metric: (#9891 ) add owned_by_superuser field to filter out system extensions. While on it, also correct related code: - fix the metric setting: use set() instead of inc() in a loop. inc() is not idempotent and can lead to incorrect results if the function called multiple times. Currently it is only called at compute start, but this will change soon. - fix the return type of the installed_extensions endpoint to match the metric. Currently it is only used in the test.	2024-12-11 16:43:26 +00:00
Mikhail Kot	dee2041cd3	walproposer: fix link error on debian 12 / ubuntu 22 (#10090 ) ## Problem Linking walproposer library (e.g. `cargo t`) produces linker errors: /home/myrrc/neon/pgxn/neon/walproposer_compat.c:169: undefined reference to `pg_snprintf' The library with these symbols (libpgcommon.a) is present ## Summary of changes Changed order of libraries resolution for linker	2024-12-11 16:23:59 +00:00
Arseny Sher	e4bb1ca7d8	Increase neon_local http client to compute timeout in reconfigure. (#10088 ) Seems like 30s sometimes not enough when CI runners are overloaded, causing pull_timeline flakiness. ref https://github.com/neondatabase/neon/issues/9731#issuecomment-2535946443	2024-12-11 15:46:50 +00:00
a-masterov	b987648e71	Enable LFC for all the PG versions. (#10068 ) ## Problem We added support for LFC for tests but are still using it only for the PG17 release. ## Summary of changes LFC is enabled for all PG versions. Errors in tests with LFC enabled now block merging as usual. We keep tests with disabled LFC for PG17 release. Tests on debug builds with LFC enabled still don't affect permission to merge.	2024-12-11 15:28:10 +00:00
Mikhail Kot	c79c1dd8e9	compute_ctl: don't panic if control plane can't be reached (#10078 ) ## Problem If the control plane cannot be reached for some reason, compute_ctl panics ## Summary of changes panic is removed in favour of returning an error. Code is reformatted a bit for more flat control flow Resolves: #5391	2024-12-11 15:03:11 +00:00
Vlad Lazar	a53db73851	pageserver: don't drop multixact slrus on non zero shards (#10086 ) ## Problem We get slru truncation commands on non-zero shards. Compaction will drop the slru dir keys and ingest will fail when receiving such records. https://github.com/neondatabase/neon/pull/10080 fixed it for clog, but not for multixact. ## Summary of changes Only truncate multixact slrus on shard zero. I audited the rest of the ingest code and it looks fine from this pov.	2024-12-11 14:28:18 +00:00

1 2 3 4 5 ...

6792 Commits