rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-15 17:32:56 +00:00

Author	SHA1	Message	Date
Erik Grinaker	42e4e5a418	Add GetPage request splitting	2025-07-03 18:31:12 +02:00
Heikki Linnakangas	96a817fa2b	Fix the case that storage auth token is _not_ used I broke that in previous commit while fixing the case of using a token.	2025-07-03 18:39:06 +03:00
Heikki Linnakangas	e7b057f2e8	Fix passing storage JWT token to the communicator process Makes the 'test_compute_auth_to_pageserver' test pass	2025-07-03 18:14:22 +03:00
Heikki Linnakangas	956c2f4378	cargo fmt	2025-07-03 16:16:42 +03:00
Heikki Linnakangas	3293e4685e	Fix cases where pageserver gets stuck waiting for LSN The compute might make a request with an LSN that it hasn't even flushed yet.	2025-07-03 16:14:45 +03:00
Erik Grinaker	6f8650782f	Client tweaks	2025-07-03 14:54:23 +02:00
Erik Grinaker	14214eb853	Add client shard routing	2025-07-03 14:42:35 +02:00
Erik Grinaker	d4b4724921	Sanity-check Pageserver URLs	2025-07-03 14:18:14 +02:00
Erik Grinaker	9aba9550dd	Instrument client methods	2025-07-03 14:11:53 +02:00
Erik Grinaker	375e8e5592	Improve retries and logging	2025-07-03 14:02:43 +02:00
Erik Grinaker	52c586f678	Restructure shard management	2025-07-03 11:51:19 +02:00
Erik Grinaker	de97b73d6e	Lint fixes	2025-07-03 10:38:14 +02:00
Heikki Linnakangas	d8556616c9	Fix running Postgres in "vanilla mode", without neon storage Some tests do that	2025-07-03 00:32:40 +03:00
Heikki Linnakangas	d8296e60e6	Fix caching of newly extended pages This fixes read errors e.g. in test_compute_catalog.py test (and probably many others).	2025-07-02 23:21:42 +03:00
Heikki Linnakangas	7263d6e2e5	Clarify error message if not_modified_lsn > request_lsn I'm seeing this error from some python tests. Which means there's a bug in the compute side of course, but it took me a while to figure that out.	2025-07-02 23:21:42 +03:00
Erik Grinaker	12dade35fa	Comment tweaks	2025-07-02 14:47:27 +02:00
Erik Grinaker	1ec63bd6bc	Misc pool improvements	2025-07-02 14:42:06 +02:00
Heikki Linnakangas	7012b4aa90	Remove --grpc options from neon_local endpoint reconfigure and start calls They don't exist in neon_local anymore, and aren't actually used in tests either.	2025-07-02 15:10:18 +03:00
Heikki Linnakangas	2cc28c75be	Fix "ERROR: could not read size of rel ..." in many regression tests. We were incorrectly skipping the call to communicator_new_rel_create(), which resulted in an error during index build, when the btree build code tried to check the size of the newly-created relation.	2025-07-02 14:10:11 +03:00
Erik Grinaker	bf01145ae4	Remove some old code	2025-07-02 11:46:54 +02:00
Erik Grinaker	8ab8fc11a3	Use new `PageserverClient`	2025-07-02 11:27:56 +02:00
Erik Grinaker	6f0af96a54	Add new PageserverClient	2025-07-02 10:59:40 +02:00
Heikki Linnakangas	9913d2668a	print retried pageserver requests to log Not sure how verbose we want this to be in production, but for now, more is better. This shows that many tests are failing with errors like these: PG:2025-07-01 23:02:34.311 GMT [1456523] LOG: [COMMUNICATOR] send_process_get_rel_size_request: got error status: NotFound, message: "Read error", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 01 Jul 2025 23:02:34 GMT"} }, retrying I haven't debugged why that is yet. Did the compute make a bogus request?	2025-07-02 02:04:04 +03:00
Heikki Linnakangas	2fefece77d	temporary hack to make regression tests fail faster	2025-07-02 01:42:39 +03:00
Heikki Linnakangas	471191e64e	Fix updating relsize cache during WAL replay This makes some of the test_runner/regress/test_hot_standby.py tests pass, (Others are still failing..)	2025-07-01 21:22:04 +03:00
Erik Grinaker	f6761760a2	Documentation and tweaks	2025-07-01 17:54:41 +02:00
Erik Grinaker	0bce818d5e	Add stream pool	2025-07-01 17:54:41 +02:00
Erik Grinaker	48be1da6ef	Add initial client pool	2025-07-01 17:54:41 +02:00
Erik Grinaker	d2efc80e40	Add initial ChannelPool	2025-07-01 17:54:41 +02:00
Erik Grinaker	958c2577f5	pageserver: tighten up `page_api::Client`	2025-07-01 17:54:41 +02:00
Heikki Linnakangas	175c2e11e3	Add assertions that the legacy relsize cache is not used with new communicator And fix a few cases where it was being called	2025-07-01 16:44:25 +03:00
Heikki Linnakangas	efdb07e7b6	Implement function to check if page is in local cache This is needed for read replicas. There's one more TODO that needs to implemented before read replicas work though, in neon_extend_rel_size()	2025-07-01 16:22:51 +03:00
Heikki Linnakangas	b0970b415c	Don't call legacy lfc function when new communicator is used	2025-07-01 15:47:26 +03:00
Heikki Linnakangas	7429dd711c	fix the .metrics.socket filename in the ignore list	2025-06-30 23:41:09 +03:00
Heikki Linnakangas	88ac1e356b	Ignore the metrics unix domain socket in tests	2025-06-30 23:39:01 +03:00
Erik Grinaker	c3cb1ab98d	Merge branch 'main' into communicator-rewrite	2025-06-30 21:07:01 +02:00
Dmitrii Kovalkov	8e216a3a59	storcon: notify cplane on safekeeper membership change (#12390 ) ## Problem We don't notify cplane about safekeeper membership change yet. Without the notification the compute needs to know all the safekeepers on the cluster to be able to speak to them. Change notifications will allow to avoid it. - Closes: https://github.com/neondatabase/neon/issues/12188 ## Summary of changes - Implement `notify_safekeepers` method in `ComputeHook` - Notify cplane about safekeepers in `safekeeper_migrate` handler. - Update the test to make sure notifications work. ## Out of scope - There is `cplane_notified_generation` field in `timelines` table in strocon's database. It's not needed now, so it's not updated in the PR. Probably we can remove it. - e2e tests to make sure it works with a production cplane	2025-06-30 14:09:50 +00:00
Erik Grinaker	81ac4ef43a	Add a generic pool prototype	2025-06-30 14:49:34 +02:00
Erik Grinaker	d0a4ae3e8f	pageserver: add gRPC LSN lease support (#12384 ) ## Problem The gRPC API does not provide LSN leases. ## Summary of changes * Add LSN lease support to the gRPC API. * Use gRPC LSN leases for static computes with `grpc://` connstrings. * Move `PageserverProtocol` into the `compute_api::spec` module and reuse it.	2025-06-30 12:44:17 +00:00
Erik Grinaker	a384d7d501	pageserver: assert no changes to shard identity (#12379 ) ## Problem Location config changes can currently result in changes to the shard identity. Such changes will cause data corruption, as seen with #12217. Resolves #12227. Requires #12377. ## Summary of changes Assert that the shard identity does not change on location config updates and on (re)attach. This is currently asserted with `critical!`, in case it misfires in production. Later, we should reject such requests with an error and turn this into a proper assertion.	2025-06-30 12:36:45 +00:00
Christian Schwarz	66f53d9d34	refactor(pageserver): force explicit mapping to `CreateImageLayersError::Other` (#12382 ) Implicit mapping to an `anyhow::Error` when we do `?` is discouraged because tooling to find those places isn't great. As a drive-by, also make SplitImageLayerWriter::new infallible and sync. I think we should also make ImageLayerWriter::new completely lazy, then `BatchLayerWriter:new` infallible and async.	2025-06-30 11:03:48 +00:00
Erik Grinaker	a5b0fc560c	Fix/allow remaining clippy lints	2025-06-30 12:36:20 +02:00
Busra Kugler	2af9380962	Revert "Replace step-security maintained actions" (#12386 ) Reverts neondatabase/neon#11663 and https://github.com/neondatabase/neon/pull/11265/ Step Security is not yet approved by Databricks team, in order to prevent issues during Github org migration, I'll revert this PR to use the previous action instead of Step Security maintained action.	2025-06-30 10:15:10 +00:00
Ivan Efremov	620d50432c	Fix path issue in the proxy-bernch CI workflow (#12388 )	2025-06-30 09:33:57 +00:00
Erik Grinaker	67b04f8ab3	Fix a bunch of linter warnings	2025-06-30 11:10:02 +02:00
Erik Grinaker	1d43f3bee8	pageserver: fix stripe size persistence in legacy HTTP handlers (#12377 ) ## Problem Similarly to #12217, the following endpoints may result in a stripe size mismatch between the storage controller and Pageserver if an unsharded tenant has a different stripe size set than the default. This can lead to data corruption if the tenant is later manually split without specifying an explicit stripe size, since the storage controller and Pageserver will apply different defaults. This commonly happens with tenants that were created before the default stripe size was changed from 32k to 2k. * `PUT /v1/tenant/config` * `PATCH /v1/tenant/config` These endpoints are no longer in regular production use (they were used when cplane still managed Pageserver directly), but can still be called manually or by tests. ## Summary of changes Retain the current shard parameters when updating the location config in `PUT \| PATCH /v1/tenant/config`. Also opportunistically derive `Copy` for `ShardParameters`.	2025-06-30 09:08:44 +00:00
Dmitrii Kovalkov	c746678bbc	storcon: implement safekeeper_migrate handler (#11849 ) This PR implements a safekeeper migration algorithm from RFC-035 https://github.com/neondatabase/neon/blob/main/docs/rfcs/035-safekeeper-dynamic-membership-change.md#change-algorithm - Closes: https://github.com/neondatabase/neon/issues/11823 It is not production-ready yet, but I think it's good enough to commit and start testing. There are some known issues which will be addressed in later PRs: - https://github.com/neondatabase/neon/issues/12186 - https://github.com/neondatabase/neon/issues/12187 - https://github.com/neondatabase/neon/issues/12188 - https://github.com/neondatabase/neon/issues/12189 - https://github.com/neondatabase/neon/issues/12190 - https://github.com/neondatabase/neon/issues/12191 - https://github.com/neondatabase/neon/issues/12192 ## Summary of changes - Implement `tenant_timeline_safekeeper_migrate` handler to drive the migration - Add possibility to specify number of safekeepers per timeline in tests (`timeline_safekeeper_count`) - Add `term` and `flush_lsn` to `TimelineMembershipSwitchResponse` - Implement compare-and-swap (CAS) operation over timeline in DB for updating membership configuration safely. - Write simple test to verify that migration code works	2025-06-30 08:30:05 +00:00
Erik Grinaker	9d9e3cd08a	Fix `test_normal_work` grpc param	2025-06-30 10:13:46 +02:00
Aleksandr Sarantsev	9bb4688c54	storcon: Remove testing feature from kick_secondary_downloads (#12383 ) ## Problem Some of the design decisions in PR #12256 were influenced by the requirements of consistency tests. These decisions introduced intermediate logic that is no longer needed and should be cleaned up. ## Summary of Changes - Remove the `feature("testing")` flag related to `kick_secondary_download`. - Set the default value of `kick_secondary_download` back to false, reflecting the intended production behavior. Co-authored-by: Aleksandr Sarantsev <aleksandr.sarantsev@databricks.com>	2025-06-30 05:41:05 +00:00
Heikki Linnakangas	97a8f4ef85	Handle unexpected EOF while doing an LFC read more gracefully There's a bug somewhere because this happens in python regression tests. We need to hunt that down, but in any case, let's not get stuck in an infinite loop if it happens.	2025-06-30 00:59:53 +03:00

1 2 3 4 5 ...

8309 Commits