rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-04 12:02:55 +00:00

Author	SHA1	Message	Date
Tristan Partin	89554af1bd	[BRC-1778] Have PG signal compute_ctl to refresh configuration if it suspects that it is talking to the wrong PSs (#12712 ) ## Problem This is a follow-up to TODO, as part of the effort to rewire the compute reconfiguration/notification mechanism to make it more robust. Please refer to that commit or ticket BRC-1778 for full context of the problem. ## Summary of changes The previous change added mechanism in `compute_ctl` that makes it possible to refresh the configuration of PG on-demand by having `compute_ctl` go out to download a new config from the control plane/HCC. This change wired this mechanism up with PG so that PG will signal `compute_ctl` to refresh its configuration when it suspects that it could be talking to incorrect pageservers due to a stale configuration. PG will become suspicious that it is talking to the wrong pageservers in the following situations: 1. It cannot connect to a pageserver (e.g., getting a network-level connection refused error) 2. It can connect to a pageserver, but the pageserver does not return any data for the GetPage request 3. It can connect to a pageserver, but the pageserver returns a malformed response 4. It can connect to a pageserver, but there is an error receiving the GetPage request response for any other reason This change also includes a minor tweak to `compute_ctl`'s config refresh behavior. Upon receiving a request to refresh PG configuration, `compute_ctl` will reach out to download a config, but it will not attempt to apply the configuration if the config is the same as the old config is it replacing. This optimization is added because the act of reconfiguring itself requires working pageserver connections. In many failure situations it is likely that PG detects an issue with a pageserver before the control plane can detect the issue, migrate tenants, and update the compute config. In this case even the latest compute config won't point PG to working pageservers, causing the configuration attempt to hang and negatively impact PG's time-to-recovery. With this change, `compute_ctl` only attempts reconfiguration if the refreshed config points PG to different pageservers. ## How is this tested? The new code paths are exercised in all existing tests because this mechanism is on by default. Explicitly tested in `test_runner/regress/test_change_pageserver.py`. Co-authored-by: William Huang <william.huang@databricks.com>	2025-07-24 16:44:45 +00:00
Alexander Bayandin	30a7dd630c	ruff: enable TC — flake8-type-checking (#11368 ) ## Problem `TYPE_CHECKING` is used inconsistently across Python tests. ## Summary of changes - Update `ruff`: 0.7.0 -> 0.11.2 - Enable TC (flake8-type-checking): https://docs.astral.sh/ruff/rules/#flake8-type-checking-tc - (auto)fix all new issues	2025-03-30 18:58:33 +00:00
John Spray	9989d8bfae	tests: make Workload more determinstic (#10741 ) ## Problem Previously, Workload was reconfiguring the compute before each run of writes, which was meant to be a no-op when nothing changed, but was actually writing extra data due to an issue being fixed in https://github.com/neondatabase/neon/pull/10696. The row counts in tests were too low in some cases, these tests were only working because of those extra writes that shouldn't have been happening, and moreover were relying on checkpoints happening. ## Summary of changes - Only reconfigure compute if the attached pageserver actually changed. If pageserver is set to None, that means controller is managing everything, so never reconfigure compute. - Update tests that wrote too few rows. --------- Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>	2025-02-12 12:35:29 +00:00
Alex Chi Z.	a753349cb0	feat(pageserver): validate data integrity during gc-compaction (#10131 ) ## Problem part of https://github.com/neondatabase/neon/issues/9114 part of investigation of https://github.com/neondatabase/neon/issues/10049 ## Summary of changes * If `cfg!(test) or cfg!(feature = testing)`, then we will always try generating an image to ensure the history is replayable, but not put the image layer into the final layer results, therefore discovering wrong key history before we hit a read error. * I suspect it's easier to trigger some races if gc-compaction is continuously run on a timeline, so I increased the frequency to twice per 10 churns. * Also, create branches in gc-compaction smoke tests to get more test coverage. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Arpad Müller <arpad@neon.tech>	2025-01-15 22:04:06 +00:00
Alexander Bayandin	6f7aeaa1c5	test_runner: use LFC by default (#8613 ) ## Problem LFC is not enabled by default in tests, but it is enabled in production. This increases the risk of errors in the production environment, which were not found during the routine workflow. However, enabling LFC for all the tests may overload the disk on our servers and increase the number of failures. So, we try enabling LFC in one case to evaluate the possible risk. ## Summary of changes A new environment variable, USE_LFC is introduced. If it is set to true, LFC is enabled by default in all the tests. In our workflow, we enable LFC for PG17, release, x86-64, and disabled for all other combinations. --------- Co-authored-by: Alexey Masterov <alexeymasterov@neon.tech> Co-authored-by: a-masterov <72613290+a-masterov@users.noreply.github.com>	2024-11-25 09:01:05 +00:00
Alex Chi Z.	6f8b1eb5a6	test(pageserver): add detach ancestor smoke test (#9842 ) ## Problem Follow up to https://github.com/neondatabase/neon/pull/9682, hopefully we can detect some issues or assure ourselves that this is ready for production. ## Summary of changes * Add a compaction-detach-ancestor smoke test. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-11-22 18:21:51 +00:00
Alexander Bayandin	51d26a261b	build(deps): bump mypy from 1.3.0 to 1.13.0 (#9670 ) ## Problem We use a pretty old version of `mypy` 1.3 (released 1.5 years ago), it produces false positives for `typing.Self`. ## Summary of changes - Bump `mypy` from 1.3 to 1.13 - Fix new warnings and errors - Use `typing.Self` whenever we `return self`	2024-11-22 14:31:36 +00:00
Alexander Bayandin	8d1c44039e	Python 3.11 (#9515 ) ## Problem On Debian 12 (Bookworm), Python 3.11 is the latest available version. ## Summary of changes - Update Python to 3.11 in build-tools - Fix ruff check / format - Fix mypy - Use `StrEnum` instead of pair `str`, `Enum` - Update docs	2024-11-21 16:25:31 +00:00
Tristan Partin	d3464584a6	Improve some typing in test_runner Fixes some types, adds some types, and adds some override annotations. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-10-09 15:42:22 -05:00
Tristan Partin	5bd8e2363a	Enable all pyupgrade checks in ruff This will help to keep us from using deprecated Python features going forward. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-10-08 14:32:26 -05:00
Alex Chi Z.	700885471f	fix(test): only test num of L1 layers in compaction smoke test (#9186 ) close https://github.com/neondatabase/neon/issues/9160 For whatever reason, pg17's WAL pattern seems different from others, which triggers some flaky behavior within the compaction smoke test. ## Summary of changes * Run L0 compaction before proceeding with the read benchmark. * So that we can ensure the num of L0 layers is 0 and test the compaction behavior only with L1 layers. We have a threshold for triggering L0 compaction. In some cases, the test case did not produce enough L0 layers to do a L0 compaction, therefore leaving the layer map with 3+ L0 layers above the L1 layers. This increases the average read depth for the timeline. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-02 17:42:35 +01:00
Joonas Koivunen	b83d722369	test: fix more flaky due to graceful shutdown (#8787 ) Going through the list of recent flaky tests, trying to fix those related to graceful shutdown. - test_forward_compatibility: flush and wait for uploads to avoid graceful shutdown - test_layer_bloating: in the end the endpoint and vanilla are still up => immediate shutdown - test_lagging_sk: pageserver shutdown is not related to the test => immediate shutdown - test_lsn_lease_size: pageserver flushing is not needed => immediate shutdown Additionally: - remove `wait_for_upload` usage from workload fixture Cc: #8708 Fixes: #8710	2024-08-21 17:22:47 +01:00
Joonas Koivunen	ae6e27274c	refactor(test): unify how we clear shared buffers (#8634 ) so that we can easily plug in LFC clearing as well. Private discussion reference: <https://neondb.slack.com/archives/C033A2WE6BZ/p1722942856987979>	2024-08-13 20:14:42 +03:00
Joonas Koivunen	d9dcbffac3	python: allow using allowed_errors.py (#7719 ) See #7718. Fix it by renaming all `types.py` to `common_types.py`. Additionally, add an advert for using `allowed_errors.py` to test any added regex.	2024-05-13 15:16:23 +03:00
Christian Schwarz	edcaae6290	fixup: PR #7319 defined workload.py `def stop()` twice (#7333 ) Somehow it made it through CI.	2024-04-05 19:11:04 +00:00
John Spray	4fc95d2d71	pageserver: apply shard filtering to blocks ingested during initdb (#7319 ) ## Problem Ingest filtering wasn't being applied to timeline creations, so a timeline created on a sharded tenant would use 20MB+ on each shard (each shard got a full copy). This didn't break anything, but is inefficient and leaves the system in a harder-to-validate state where shards initially have some data that they will eventually drop during compaction. Closes: https://github.com/neondatabase/neon/issues/6649 ## Summary of changes - in `import_rel`, filter block-by-block with is_key_local - During test_sharding_smoke, check that per-shard physical sizes are as expected - Also extend the test to check deletion works as expected (this was an outstanding tech debt task)	2024-04-05 18:07:35 +01:00
John Spray	ac7fc6110b	pageserver: handle WAL gaps on sharded tenants (#6788 ) ## Problem In the test for https://github.com/neondatabase/neon/pull/6776, a test cases uses tiny layer sizes and tiny stripe sizes. This hits a scenario where a shard's checkpoint interval spans a region where none of the content in the WAL is ingested by this shard. Since there is no layer to flush, we do not advance disk_consistent_lsn, and this causes the test to fail while waiting for LSN to advance. ## Summary of changes - Pass an LSN through `layer_flush_start_tx`. This is the LSN to which we have frozen at the time we ask the flush to flush layers frozen up to this point. - In the layer flush task, if the layers we flush do not reach `frozen_to_lsn`, then advance disk_consistent_lsn up to this point. - In `maybe_freeze_ephemeral_layer`, handle the case where last_record_lsn has advanced without writing a layer file: this ensures that disk_consistent_lsn and remote_consistent_lsn advance anyway. The net effect is that the disk_consistent_lsn is allowed to advance past regions in the WAL where a shard ingests no data, and that we uphold our guarantee that remote_consistent_lsn always eventually reaches the tip of the WAL. The case of no layer at all is hard to test at present due to >0 shards being polluted with SLRU writes, but I have tested it locally with a branch that disables SLRU writes on shards >0. We can tighten up the testing on this in future as/when we refine shard filtering (currently shards >0 need the SLRU because they use it to figure out cutoff in GC using timestamp-to-lsn).	2024-04-04 16:54:38 +00:00
Arthur Petukhovsky	ad5efb49ee	Support backpressure for sharding (#7100 ) Add shard_number to PageserverFeedback and parse it on the compute side. When compute receives a new ps_feedback, it calculates min LSNs among feedbacks from all shards, and uses those LSNs for backpressure. Add `test_sharding_backpressure` to verify that backpressure slows down compute to wait for the slowest shard.	2024-03-18 21:54:44 +00:00
John Spray	44f42627dd	pageserver/controller: error handling for shard splitting (#7074 ) ## Problem Shard splits worked, but weren't safe against failures (e.g. node crash during split) yet. Related: #6676 ## Summary of changes - Introduce async rwlocks at the scope of Tenant and Node: - exclusive tenant lock is used to protect splits - exclusive node lock is used to protect new reconciliation process that happens when setting node active - exclusive locks used in both cases when doing persistent updates (e.g. node scheduling conf) where the update to DB & in-memory state needs to be atomic. - Add failpoints to shard splitting in control plane and pageserver code. - Implement error handling in control plane for shard splits: this detaches child chards and ensures parent shards are re-attached. - Crash-safety for storage controller restarts requires little effort: we already reconcile with nodes over a storage controller restart, so as long as we reset any incomplete splits in the DB on restart (added in this PR), things are implicitly cleaned up. - Implement reconciliation with offline nodes before they transition to active: - (in this context reconciliation means something like startup_reconcile, not literally the Reconciler) - This covers cases where split abort cannot reach a node to clean it up: the cleanup will eventually happen when the node is marked active, as part of reconciliation. - This also covers the case where a node was unavailable when the storage controller started, but becomes available later: previously this allowed it to skip the startup reconcile. - Storage controller now terminates on panics. We only use panics for true "should never happen" assertions, and these cases can leave us in an un-usable state if we keep running (e.g. panicking in a shard split). In the unlikely event that we get into a crashloop as a result, we'll rely on kubernetes to back us off. - Add `test_sharding_split_failures` which exercises a variety of failure cases during shard split.	2024-03-14 09:11:57 +00:00
John Spray	84f027357d	pageserver: adjust checkpoint distance for sharded tenants (#6852 ) ## Problem Where the stripe size is the same order of magnitude as the checkpoint distance (such as with default settings), tenant shards can easily pass through `checkpoint_distance` bytes of LSN without actually ingesting anything. This results in emitting many tiny L0 delta layers. ## Summary of changes - Multiply checkpoint distance by shard count before comparing with LSN distance. This is a heuristic and does not guarantee that we won't emit small layers, but it fixes the issue for typical cases where the writes in a (checkpoint_distance * shard_count) range of LSN bytes are somewhat distributed across shards. - Add a test that checks the size of layers after ingesting to a sharded tenant; this fails before the fix. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-02-21 14:12:35 +00:00
John Spray	55b7cde665	tests: add basic coverage for sharding (#6380 ) ## Problem The support for sharding in the pageserver was written before https://github.com/neondatabase/neon/pull/6205 landed, so when it landed we couldn't directly test sharding. ## Summary of changes - Add `test_sharding_smoke` which tests the basics of creating a sharding tenant, creating a timeline within it, checking that data within it is distributed. - Add modes to pg_regress tests for running with 4 shards as well as with 1.	2024-01-26 14:40:47 +00:00
John Spray	b6ec11ad78	control_plane: generalize attachment_service to handle sharding (#6251 ) ## Problem To test sharding, we need something to control it. We could write python code for doing this from the test runner, but this wouldn't be usable with neon_local run directly, and when we want to write tests with large number of shards/tenants, Rust is a better fit efficiently handling all the required state. This service enables automated tests to easily get a system with sharding/HA without the test itself having to set this all up by hand: existing tests can be run against sharded tenants just by setting a shard count when creating the tenant. ## Summary of changes Attachment service was previously a map of TenantId->TenantState, where the principal state stored for each tenant was the generation and the last attached pageserver. This enabled it to serve the re-attach and validate requests that the pageserver requires. In this PR, the scope of the service is extended substantially to do overall management of tenants in the pageserver, including tenant/timeline creation, live migration, evacuation of offline pageservers etc. This is done using synchronous code to make declarative changes to the tenant's intended state (`TenantState.policy` and `TenantState.intent`), which are then translated into calls into the pageserver by the `Reconciler`. Top level summary of modules within `control_plane/attachment_service/src`: - `tenant_state`: structure that represents one tenant shard. - `service`: implements the main high level such as tenant/timeline creation, marking a node offline, etc. - `scheduler`: for operations that need to pick a pageserver for a tenant, construct a scheduler and call into it. - `compute_hook`: receive notifications when a tenant shard is attached somewhere new. Once we have locations for all the shards in a tenant, emit an update to postgres configuration via the neon_local `LocalEnv`. - `http`: HTTP stubs. These mostly map to methods on `Service`, but are separated for readability and so that it'll be easier to adapt if/when we switch to another RPC layer. - `node`: structure that describes a pageserver node. The most important attribute of a node is its availability: marking a node offline causes tenant shards to reschedule away from it. This PR is a precursor to implementing the full sharding service for prod (#6342). What's the difference between this and a production-ready controller for pageservers? - JSON file persistence to be replaced with a database - Limited observability. - No concurrency limits. Marking a pageserver offline will try and migrate every tenant to a new pageserver concurrently, even if there are thousands. - Very simple scheduler that only knows to pick the pageserver with fewest tenants, and place secondary locations on a different pageserver than attached locations: it does not try to place shards for the same tenant on different pageservers. This matters little in tests, because picking the least-used pageserver usually results in round-robin placement. - Scheduler state is rebuilt exhaustively for each operation that requires a scheduler. - Relies on neon_local mechanisms for updating postgres: in production this would be something that flows through the real control plane. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-01-17 18:01:08 +00:00
John Spray	6a922b1a75	tests: start adding tests for secondary mode, live migration (#5842 ) These tests have been loitering on a branch of mine for a while: they already provide value even without all the secondary mode bits landed yet, and the Workload helper is handy for other tests too. - `Workload` is a re-usable test workload that replaces some of the arbitrary "write a few rows" SQL that I've found my self repeating, and adds a systematic way to append data and check that reads properly reflect the changes. This append+validate stuff is important when doing migrations, as we want to detect situations where we might be reading from a pageserver that has not properly seen latest changes. - test_multi_attach is a validation of how the pageserver handles attaching the same tenant to multiple pageservers, from a safety point of view. This is intentionally separate from the larger testing of migration, to provide an isolated environment for multi-attachment. - test_location_conf_churn is a pseudo-random walk through the various states that TenantSlot can be put into, with validation that attached tenants remain externally readable when they should, and as a side effect validating that the compute endpoint's online configuration changes work as expected. - test_live_migration is the reference implementation of how to drive a pair of pageservers through a zero-downtime migration of a tenant. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2023-12-11 16:55:43 +00:00

23 Commits