rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-29 19:10:38 +00:00

Author	SHA1	Message	Date
Alexander Bayandin	30a7dd630c	ruff: enable TC — flake8-type-checking (#11368 ) ## Problem `TYPE_CHECKING` is used inconsistently across Python tests. ## Summary of changes - Update `ruff`: 0.7.0 -> 0.11.2 - Enable TC (flake8-type-checking): https://docs.astral.sh/ruff/rules/#flake8-type-checking-tc - (auto)fix all new issues	2025-03-30 18:58:33 +00:00
Vlad Lazar	9fc7c22cc9	storcon: add use_local_compute_notifications flag (#11333 ) ## Problem While working on bulk import, I want to use the `control-plane-url` flag for a different request. Currently, the local compute hook is used whenever no control plane is specified in the config. My test requires local compute notifications and a configured `control-plane-url` which isn't supported. ## Summary of changes Add a `use-local-compute-notifications` flag. When this is set, we use the local flow regardless of other config values. It's enabled by default in neon_local and disabled by default in all other envs. I had to turn the flag off in tests that wish to bypass the local flow, but that's expected. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2025-03-21 15:31:06 +00:00
Arpad Müller	b1a1be6a4c	switch pytests and neon_local to control_plane_hooks_api (#11195 ) We want to switch away from and deprecate the `--compute-hook-url` param for the storcon in favour of `--control-plane-url` because it allows us to construct urls with `notify-safekeepers`. This PR switches the pytests and neon_local from a `control_plane_compute_hook_api` to a new param named `control_plane_hooks_api` which is supposed to point to the parent of the `notify-attach` URL. We still support reading the old url from disk to not be too disruptive with existing deployments, but we just ignore it. Also add docs for the `notify-safekeepers` upcall API. Follow-up of #11173 Part of https://github.com/neondatabase/neon/issues/11163	2025-03-13 19:50:52 +00:00
Vlad Lazar	435bf452e6	tests: remove obsolete err log whitelisting (#11069 ) The pageserver read path now supports overlapped in-memory and image layers via https://github.com/neondatabase/neon/pull/11000. These allowed errors are now obsolete.	2025-03-04 08:18:19 +00:00
John Spray	ae463f366b	tests: broaden allow-list for #10720 workaround (#10807 ) ## Problem In #10752 I used an overly-strict regex that only ignored error on a particular key. ## Summary of changes - Drop key from regex so it matches all such errors	2025-02-13 16:15:04 +00:00
John Spray	9989d8bfae	tests: make Workload more determinstic (#10741 ) ## Problem Previously, Workload was reconfiguring the compute before each run of writes, which was meant to be a no-op when nothing changed, but was actually writing extra data due to an issue being fixed in https://github.com/neondatabase/neon/pull/10696. The row counts in tests were too low in some cases, these tests were only working because of those extra writes that shouldn't have been happening, and moreover were relying on checkpoints happening. ## Summary of changes - Only reconfigure compute if the attached pageserver actually changed. If pageserver is set to None, that means controller is managing everything, so never reconfigure compute. - Update tests that wrote too few rows. --------- Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>	2025-02-12 12:35:29 +00:00
John Spray	fcedd10226	tests: temporarily permit a log error (#10752 ) ## Problem These tests can encounter a bug in the pageserver read path (#9185) which occurs under the very specific circumstances that the tests create, but is very unlikely to happen in the field. We will fix the bug, but in the meantime let's un-flake the tests. Related: https://github.com/neondatabase/neon/issues/10720 ## Summary of changes - Permit "could not find data for key" errors in tests affected by #9185	2025-02-11 12:37:09 +00:00
John Spray	fd1368d31e	storcon: rework scheduler optimisation, prioritize AZ (#9916 ) ## Problem We want to do a more robust job of scheduling tenants into their home AZ: https://github.com/neondatabase/neon/issues/8264. Closes: https://github.com/neondatabase/neon/issues/8969 ## Summary of changes ### Scope This PR combines prioritizing AZ with a larger rework of how we do optimisation. The rationale is that just bumping AZ in the order of Score attributes is a very tiny change: the interesting part is lining up all the optimisation logic to respect this properly, which means rewriting it to use the same scores as the scheduler, rather than the fragile hand-crafted logic that we had before. Separating these changes out is possible, but would involve doing two rounds of test updates instead of one. ### Scheduling optimisation `TenantShard`'s `optimize_attachment` and `optimize_secondary` methods now both use the scheduler to pick a new "favourite" location. Then there is some refined logic for whether + how to migrate to it: - To decide if a new location is sufficiently "better", we generate scores using some projected ScheduleContexts that exclude the shard under consideration, so that we avoid migrating from a node with AffinityScore(2) to a node with AffinityScore(1), only to migrate back later. - Score types get a `for_optimization` method so that when we compare scores, we will only do an optimisation if the scores differ by their highest-ranking attributes, not just because one pageserver is lower in utilization. Eventually we _will_ want a mode that does this, but doing it here would make scheduling logic unstable and harder to test, and to do this correctly one needs to know the size of the tenant that one is migrating. - When we find a new attached location that we would like to move to, we will create a new secondary location there, even if we already had one on some other node. This handles the case where we have a home AZ A, and want to migrate the attachment between pageservers in that AZ while retaining a secondary location in some other AZ as well. - A unit test is added for https://github.com/neondatabase/neon/issues/8969, which is implicitly fixed by reworking optimisation to use the same scheduling scores as scheduling.	2025-01-13 19:33:00 +00:00
Vlad Lazar	f4739d49e3	pageserver: tweak interpreted ingest record metrics (#10291 ) ## Problem The filtered record metric doesn't make sense for interpreted ingest. ## Summary of changes While of dubious utility in the first place, this patch replaces them with records received and records observed metrics for interpreted ingest: * received records cause the pageserver to do _something_: write a key, value pair to storage, update some metadata or flush pending modifications * observed records are a shard 0 concept and contain only key metadata used in tracking relation sizes (received records include observed records)	2025-01-09 12:31:02 +00:00
John Spray	fd230227f2	storcon: include preferred AZ in compute notifications (#9953 ) ## Problem It is unreliable for the control plane to infer the AZ for computes from where the tenant is currently attached, because if a tenant happens to be in a degraded state or a release is ongoing while a compute starts, then the tenant's attached AZ can be a different one to where it will run long-term, and the control plane doesn't check back later to restart the compute. This can land in parallel with https://github.com/neondatabase/neon/pull/9947 ## Summary of changes - Thread through the preferred AZ into the compute hook code via the reconciler - Include the preferred AZ in the body of compute hook notifications	2024-12-17 20:04:09 +00:00
Tristan Partin	b391b29bdc	Improve typing in test_runner/fixtures/httpserver.py (#10103 ) Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-12-11 22:21:42 +00:00
John Spray	aaee713e53	storcon: use proper schedule context during node delete (#9958 ) ## Problem I was touching `test_storage_controller_node_deletion` because for AZ scheduling work I was adding a change to the storage controller (kick secondaries during optimisation) that made a FIXME in this test defunct. While looking at it I also realized that we can easily fix the way node deletion currently doesn't use a proper ScheduleContext, using the iterator type recently added for that purpose. ## Summary of changes - A testing-only behavior in storage controller where if a secondary location isn't yet ready during optimisation, it will be actively polled. - Remove workaround in `test_storage_controller_node_deletion` that previously was needed because optimisation would get stuck on cold secondaries. - Update node deletion code to use a `TenantShardContextIterator` and thereby a proper ScheduleContext	2024-12-03 08:59:38 +00:00
Erik Grinaker	5330122049	test_runner: improve `wait_until` (#9936 ) Improves `wait_until` by: * Use `timeout` instead of `iterations`. This allows changing the timeout/interval parameters independently. * Make `timeout` and `interval` optional (default 20s and 0.5s). Most callers don't care. * Only output status every 1s by default, and add optional `status_interval` parameter. * Remove `show_intermediate_error`, this was always emitted anyway. Most callers have been updated to use the defaults, except where they had good reason otherwise.	2024-12-02 10:26:15 +00:00
Alexander Bayandin	51d26a261b	build(deps): bump mypy from 1.3.0 to 1.13.0 (#9670 ) ## Problem We use a pretty old version of `mypy` 1.3 (released 1.5 years ago), it produces false positives for `typing.Self`. ## Summary of changes - Bump `mypy` from 1.3 to 1.13 - Fix new warnings and errors - Use `typing.Self` whenever we `return self`	2024-11-22 14:31:36 +00:00
Alexander Bayandin	8d1c44039e	Python 3.11 (#9515 ) ## Problem On Debian 12 (Bookworm), Python 3.11 is the latest available version. ## Summary of changes - Update Python to 3.11 in build-tools - Fix ruff check / format - Fix mypy - Use `StrEnum` instead of pair `str`, `Enum` - Update docs	2024-11-21 16:25:31 +00:00
John Spray	67f5f83edc	pageserver: avoid reading SLRU blocks for GC on shards >0 (#9423 ) ## Problem SLRU blocks, which can add up to several gigabytes, are currently ingested by all shards, multiplying their capacity cost by the shard count and slowing down ingest. We do this because all shards need the SLRU pages to do timestamp->LSN lookup for GC. Related: https://github.com/neondatabase/neon/issues/7512 ## Summary of changes - On non-zero shards, learn the GC offset from shard 0's index instead of calculating it. - Add a test `test_sharding_gc` that exercises this - Do GC in test_pg_regress as a general smoke test that GC functions run (e.g. this would fail if we were using SLRUs we didn't have) In this PR we are still ingesting SLRUs everywhere, but not using them any more. Part 2 PR (https://github.com/neondatabase/neon/pull/9786) makes the change to not store them at all. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2024-11-20 15:56:14 +00:00
John Spray	593e35027a	tests: use fewer pageservers in test_sharding_split_smoke (#9804 ) ## Problem This test uses a gratuitous number of pageservers (16). This works fine when there are plenty of system resources, but causes issues on test runners that have limited resources and run many tests concurrently. Related: https://github.com/neondatabase/neon/issues/9802 ## Summary of changes - Split from 2 shards to 4, instead of 4 to 8 - Don't give every shard a separate pageserver, let two locations share each pageserver. Net result is 4 pageservers instead of 16	2024-11-20 14:57:59 +00:00
Alexander Bayandin	e9dcfa2eb2	test_runner: skip more tests using decorator instead of pytest.skip (#9704 ) ## Problem Running `pytest.skip(...)` in a test body instead of marking the test with `@pytest.mark.skipif(...)` makes all fixtures to be initialised, which is not necessary if the test is going to be skipped anyway. Also, some tests are unnecessarily skipped (e.g. `test_layer_bloating` on Postgres 17, or `test_idle_reconnections` at all) or run (e.g. `test_parse_project_git_version_output_positive` more than on once configuration) according to comments. ## Summary of changes - Move `skip_on_postgres` / `xfail_on_postgres` / `run_only_on_default_postgres` decorators to `fixture.utils` - Add new `skip_in_debug_build` and `skip_on_ci` decorators - Replace `pytest.skip(...)` calls with decorators where possible	2024-11-11 18:07:01 +00:00
Tristan Partin	ecde8d7632	Improve type safety according to pyright Pyright found many issues that mypy doesn't seem to want to catch or mypy isn't configured to catch. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-11-08 14:43:15 -06:00
Christian Schwarz	6f5c262684	pageserver: add testing API to scan layers for disposable keys (#9393 ) This PR adds a pageserver mgmt API to scan a layer file for disposable keys. It hooks it up to the sharding compaction test, demonstrating that we're not filtering out all disposable keys. This is extracted from PGDATA import (https://github.com/neondatabase/neon/pull/9218) where I do the filtering of layer files based on `is_key_disposable`.	2024-10-25 14:16:45 +02:00
Arpad Müller	4d9036bf1f	Support offloaded timelines during shard split (#9489 ) Before, we didn't copy over the `index-part.json` of offloaded timelines to the new shard's location, resulting in the new shard not knowing the timeline even exists. In #9444, we copy over the manifest, but we also need to do this for `index-part.json`. As the operations to do are mostly the same between offloaded and non-offloaded timelines, we can iterate over all of them in the same loop, after the introduction of a `TimelineOrOffloadedArcRef` type to generalize over the two cases. This is analogous to the deletion code added in #8907. The added test also ensures that the sharded archival config endpoint works, something that has not yet been ensured by tests. Part of #8088	2024-10-25 12:32:46 +02:00
Tristan Partin	d3464584a6	Improve some typing in test_runner Fixes some types, adds some types, and adds some override annotations. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-10-09 15:42:22 -05:00
Tristan Partin	5bd8e2363a	Enable all pyupgrade checks in ruff This will help to keep us from using deprecated Python features going forward. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-10-08 14:32:26 -05:00
Heikki Linnakangas	19db9e9aad	tests: Replace direct calls to neon_cli with wrappers in NeonEnv (#9195 ) Add wrappers for a few commands that didn't have them before. Move the logic to generate tenant and timeline IDs from NeonCli to the callers, so that NeonCli is more purely just a type-safe wrapper around 'neon_local'.	2024-10-03 22:03:22 +03:00
Yuchen Liang	1708743e78	pageserver: wait for lsn lease duration after transition into AttachedSingle (#9024 ) Part of #7497, closes https://github.com/neondatabase/neon/issues/8890. ## Problem Since leases are in-memory objects, we need to take special care of them after pageserver restarts and while doing a live migration. The approach we took for pageserver restart is to wait for at least lease duration before doing first GC. We want to do the same for live migration. Since we do not do any GC when a tenant is in `AttachedStale` or `AttachedMulti` mode, only the transition from `AttachedMulti` to `AttachedSingle` requires this treatment. ## Summary of changes - Added `lsn_lease_deadline` field in `GcBlock::reasons`: the tenant is temporarily blocked from GC until we reach the deadline. This information does not persist to S3. - In `GCBlock::start`, skip the GC iteration if we are blocked by the lsn lease deadline. - In `TenantManager::upsert_location`, set the lsn_lease_deadline to `Instant::now() + lsn_lease_length` so the granted leases have a chance to be renewed before we run GC for the first time after transitioned from AttachedMulti to AttachedSingle. Signed-off-by: Yuchen Liang <yuchen@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-09-19 17:27:10 +01:00
Heikki Linnakangas	8dc069037b	Remove NeonEnvBuilder.start() function It feels wrong to me to start() from the builder object. Surely the thing you start is the environment itself, not its configuration.	2024-09-12 01:28:56 +03:00
John Spray	b65a95f12e	controller: use PageserverUtilization for scheduling (#8711 ) ## Problem Previously, the controller only used the shard counts for scheduling. This works well when hosting only many-sharded tenants, but works much less well when hosting single-sharded tenants that have a greater deviation in size-per-shard. Closes: https://github.com/neondatabase/neon/issues/7798 ## Summary of changes - Instead of UtilizationScore, carry the full PageserverUtilization through into the Scheduler. - Use the PageserverUtilization::score() instead of shard count when ordering nodes in scheduling. Q: Why did test_sharding_split_smoke need updating in this PR? A: There's an interesting side effect during shard splits: because we do not decrement the shard count in the utilization when we de-schedule the shards from before the split, the controller will now prefer to pick _different_ nodes for shards compared with which ones held secondaries before the split. We could use our knowledge of splitting to fix up the utilizations more actively in this situation, but I'm leaning toward leaving the code simpler, as in practical systems the impact of one shard on the utilization of a node should be fairly low (single digit %).	2024-08-23 18:32:56 +01:00
Yuchen Liang	ed5724d79d	scrubber: clean up `scan_metadata` before prod (#8565 ) Part of #8128. ## Problem Currently, scrubber `scan_metadata` command will return with an error code if the metadata on remote storage is corrupted with fatal errors. To safely deploy this command in a cronjob, we want to differentiate between failures while running scrubber command and the erroneous metadata. At the same time, we also want our regression tests to catch corrupted metadata using the scrubber command. ## Summary of changes - Return with error code only when the scrubber command fails - Uses explicit checks on errors and warnings to determine metadata health in regression tests. Resolve conflict with `tenant-snapshot` command (after shard split): [`test_scrubber_tenant_snapshot`](https://github.com/neondatabase/neon/blob/yuchen/scrubber-scan-cleanup-before-prod/test_runner/regress/test_storage_scrubber.py#L23) failed before applying `422a8443dd` - When taking a snapshot, the old `index_part.json` in the unsharded tenant directory is not kept. - The current `list_timeline_blobs` implementation consider no `index_part.json` as a parse error. - During the scan, we are only analyzing shards with highest shard count, so we will not get a parse error. but we do need to add the layers to tenant object listing, otherwise we will get index is referencing a layer that is not in remote storage error. - Action: Add s3_layers from `list_timeline_blobs` regardless of parsing error Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-08-06 18:55:42 +01:00
Alexander Bayandin	6cad0455b0	CI(test_runner): Upload all test artifacts if preserve_database_files is enabled (#7990 ) ## Problem There's a `NeonEnvBuilder#preserve_database_files` parameter that allows you to keep database files for debugging purposes (by default, files get cleaned up), but there's no way to get these files from a CI run. This PR adds handling of `NeonEnvBuilder#preserve_database_files` and adds the compressed test output directory to Allure reports (for tests with this parameter enabled). Ref https://github.com/neondatabase/neon/issues/6967 ## Summary of changes - Compress and add the whole test output directory to Allure reports - Currently works only with `neon_env_builder` fixture - Remove `preserve_database_files = True` from sharding tests as unneeded --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-07-27 20:01:10 +01:00
John Spray	65868258d2	tests: checkpoint instead of compact in test_sharding_split_compaction (#8473 ) ## Problem This test relies on writing image layers before the split. It can fail to do so durably if the image layers are written ahead of the remote consistent LSN, so we should have been doing a checkpoint rather than just a compaction	2024-07-26 11:03:44 +01:00
John Spray	24ea9f9f60	tests: always scrub on test exit when using S3Storage (#8437 ) ## Problem Currently, tests may have a scrub during teardown if they ask for it, but most tests don't request it. To detect "unknown unknowns", let's run it at the end of every test where possible. This is similar to asserting that there are no errors in the log at the end of tests. ## Summary of changes - Remove explicit `enable_scrub_on_exit` - Always scrub if remote storage is an S3Storage.	2024-07-25 14:19:38 +01:00
Vlad Lazar	9c5ad21341	storcon: make heartbeats restart aware (#8222 ) ## Problem Re-attach blocks the pageserver http server from starting up. Hence, it can't reply to heartbeats until that's done. This makes the storage controller mark the node off-line (not good). We worked around this by setting the interval after which nodes are marked offline to 5 minutes. This isn't a long term solution. ## Summary of changes * Introduce a new `NodeAvailability` state: `WarmingUp`. This state models the following time interval: * From receiving the re-attach request until the pageserver replies to the first heartbeat post re-attach * The heartbeat delta generator becomes aware of this state and uses a separate longer interval * Flag `max-warming-up-interval` now models the longer timeout and `max-offline-interval` the shorter one to match the names of the states Closes https://github.com/neondatabase/neon/issues/7552	2024-07-25 14:09:12 +01:00
John Spray	44781518d0	storage scrubber: GC ancestor shard layers (#8196 ) ## Problem After a shard split, the pageserver leaves the ancestor shard's content in place. It may be referenced by child shards, but eventually child shards will de-reference most ancestor layers as they write their own data and do GC. We would like to eventually clean up those ancestor layers to reclaim space. ## Summary of changes - Extend the physical GC command with `--mode=full`, which includes cleaning up unreferenced ancestor shard layers - Add test `test_scrubber_physical_gc_ancestors` - Remove colored log output: in testing this is irritating ANSI code spam in logs, and in interactive use doesn't add much. - Refactor storage controller API client code out of storcon_client into a `storage_controller/client` crate - During physical GC of ancestors, call into the storage controller to check that the latest shards seen in S3 reflect the latest state of the tenant, and there is no shard split in progress.	2024-07-19 19:07:59 +03:00
John Spray	e89ec55ea5	tests: stabilize test_sharding_split_compaction (#8318 ) ## Problem This test incorrectly assumed that a post-split compaction would only drop content. This was easily destabilized by any changes to image generation rules. ## Summary of changes - Before split, do a full image layer generation pass, to guarantee that post-split compaction should only drop data, never create it. - Fix the force_image_layer_creation mode of compaction that we use from tests like this: previously it would try and generate image layers even if one already existed with the same layer key, which caused compaction to fail.	2024-07-10 14:14:10 +01:00
John Spray	c9e6dd45d3	pageserver: downgrade stale generation messages to INFO (#8256 ) ## Problem When generations were new, these messages were an important way of noticing if something unexpected was going on. We found some real issues when investigating tests that unexpectedly tripped them. At time has gone on, this code is now pretty battle-tested, and as we do more live migrations etc, it's fairly normal to see the occasional message from a node with a stale generation. At this point the cognitive load on developers to selectively allow-list these logs outweighs the benefit of having them at warn severity. Closes: https://github.com/neondatabase/neon/issues/8080 ## Summary of changes - Downgrade "Dropped remote consistent LSN updates" and "Dropping stale deletions" messages to INFO - Remove all the allow-list entries for these logs.	2024-07-04 15:05:41 +01:00
John Spray	b8bbaafc03	storage controller: fix heatmaps getting disabled during shard split (#8197 ) ## Problem At the start of do_tenant_shard_split, we drop any secondary location for the parent shards. The reconciler uses presence of secondary locations as a condition for enabling heatmaps. On the pageserver, child shards inherit their configuration from parents, but the storage controller assumes the child's ObservedState is the same as the parent's config from the prepare phase. The result is that some child shards end up with inaccurate ObservedState, and until something next migrates or restarts, those tenant shards aren't uploading heatmaps, so their secondary locations are downloading everything that was resident at the moment of the split (including ancestor layers which are often cleaned up shortly after the split). Closes: https://github.com/neondatabase/neon/issues/8189 ## Summary of changes - Use PlacementPolicy to control enablement of heatmap upload, rather than the literal presence of secondaries in IntentState: this way we avoid switching them off during shard split - test: during tenant split test, assert that the child shards have heatmap uploads enabled.	2024-06-28 18:27:13 +01:00
John Spray	47fdf93cf0	tests: fix a flake in test_sharding_split_compaction (#8136 ) ## Problem This test could occasionally trigger a "removing local file ... because it has unexpected length log" when using the `compact-shard-ancestors-persistent` failpoint is in use, which is unexpected because that failpoint stops the process when the remote metadata is in sync with local files. It was because there are two shards on the same pageserver, and while the one being compacted explicitly stops at the failpoint, another shard was compacting in the background and failing at an unclean point. The test intends to disable background compaction, but was mistakenly revoking the value of `compaction_period` when it updated `pitr_interval`. Example failure: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8123/9602976462/index.html#/testresult/7dd6165da7daef40 ## Summary of changes - Update `TENANT_CONF` in the test to use properly typed values, so that it is usable in pageserver APIs as well as via neon_local. - When updating tenant config with `pitr_interval`, retain the overrides from the start of the test, so that there won't be any background compaction going on during the test.	2024-06-24 14:54:54 +01:00
Arpad Müller	27518676d7	Rename S3 scrubber to storage scrubber (#8013 ) The S3 scrubber contains "S3" in its name, but we want to make it generic in terms of which storage is used (#7547). Therefore, rename it to "storage scrubber", following the naming scheme of already existing components "storage broker" and "storage controller". Part of #7547	2024-06-11 22:45:22 +00:00
Joonas Koivunen	b52e31c1a4	fix: allow layer flushes more often (#7927 ) As seen with the pgvector 0.7.0 index builds, we can receive large batches of images, leading to very large L0 layers in the range of 1GB. These large layers are produced because we are only able to roll the layer after we have witnessed two different Lsns in a single `DataDirModification::commit`. As the single Lsn batches of images can span over multiple `DataDirModification` lifespans, we will rarely get to write two different Lsns in a single `put_batch` currently. The solution is to remember the TimelineWriterState instead of eagerly forgetting it until we really open the next layer or someone else flushes (while holding the write_guard). Additional changes are test fixes to avoid "initdb image layer optimization" or ignoring initdb layers for assertion. Cc: #7197 because small `checkpoint_distance` will now trigger the "initdb image layer optimization"	2024-06-10 13:50:17 +00:00
John Spray	3860bc9c6c	pageserver: post-shard-split layer rewrites (2/2) (#7531 ) ## Problem - After a shard split of a large existing tenant, child tenants can end up with oversized historic layers indefinitely, if those layers are prevented from being GC'd by branchpoints. This PR follows https://github.com/neondatabase/neon/pull/7531, and adds rewriting of layers that contain a mixture of needed & un-needed contents, in addition to dropping un-needed layers. Closes: https://github.com/neondatabase/neon/issues/7504 ## Summary of changes - Add methods to ImageLayer for reading back existing layers - Extend `compact_shard_ancestors` to rewrite layer files that contain a mixture of keys that we want and keys we do not, if unwanted keys are the majority of those in the file. - Amend initialization code to handle multiple layers with the same LayerName properly - Get rid of of renaming bad layer files to `.old` since that's now expected on restarts during rewrites.	2024-05-24 08:33:19 +00:00
Joonas Koivunen	a8a88ba7bc	test(detach_ancestor): ensure L0 compaction in history is ok (#7813 ) detaching a timeline from its ancestor can leave the resulting timeline with more L0 layers than the compaction threshold. most of the time, the detached timeline has made progress, and next time the L0 -> L1 compaction happens near the original branch point and not near the last_record_lsn. add a test to ensure that inheriting the historical L0s does not change fullbackup. additionally: - add `wait_until_completed` to test-only timeline checkpoint and compact HTTP endpoints. with `?wait_until_completed=true` the endpoints will wait until the remote client has completed uploads. - for delta layers, describe L0-ness with the `/layer` endpoint Cc: #6994	2024-05-21 20:08:43 +03:00
John Spray	c84656a53e	pageserver: implement auto-splitting (#7681 ) ## Problem Currently tenants are only split into multiple shards if a human being calls the API to do it. Issue: #7388 ## Summary of changes - Add a pageserver API for returning the top tenants by size - Add a step to the controller's background loop where if there is no reconciliation or optimization to be done, it looks for things to split. - Add a test that runs pgbench on many tenants concurrently, and checks that splitting happens as expected as tenants grow, without interrupting the client I/O. This PR is quite basic: there is a tasklist in https://github.com/neondatabase/neon/issues/7388 for further work. This PR is meant to be safe (off by default), and sufficient to enable our staging environment to run lots of sharded tenants without a human having to set them up.	2024-05-17 16:01:24 +00:00
Joonas Koivunen	d9dcbffac3	python: allow using allowed_errors.py (#7719 ) See #7718. Fix it by renaming all `types.py` to `common_types.py`. Additionally, add an advert for using `allowed_errors.py` to test any added regex.	2024-05-13 15:16:23 +03:00
John Spray	af849a1f61	pageserver: post-shard-split layer trimming (1/2) (#7572 ) ## Problem After a shard split of a large existing tenant, child tenants can end up with oversized historic layers indefinitely, if those layers are prevented from being GC'd by branchpoints. This PR is followed by https://github.com/neondatabase/neon/pull/7531 Related issue: https://github.com/neondatabase/neon/issues/7504 ## Summary of changes - Add a new compaction phase `compact_shard_ancestors`, which identifies layers that are no longer needed after a shard split. - Add a Timeline->LayerMap code path called `rewrite_layers` , which is currently only used to drop layers, but will later be used to rewrite them as well in https://github.com/neondatabase/neon/pull/7531 - Add a new test that compacts after a split, and checks that something is deleted. Note that this doesn't have much impact on a tenant's resident size (since unused layers would end up evicted anyway), but it: - Makes index_part.json much smaller - Makes the system easier to reason about: avoid having tenants which are like "my physical size is 4TiB but don't worry I'll never actually download it", instead have tenants report the real physical size of what they might download. Why do we remove these layers in compaction rather than during the split? Because we have existing split tenants that need cleaning up. We can add it to the split operation in future as an optimization.	2024-05-07 11:15:58 +01:00
John Spray	b5a6e68e68	storage controller: check warmth of secondary before doing proactive migration (#7583 ) ## Problem The logic in Service::optimize_all would sometimes choose to migrate a tenant to a secondary location that was only recently created, resulting in Reconciler::live_migrate hitting its 5 minute timeout warming up the location, and proceeding to attach a tenant to a location that doesn't have a warm enough local set of layer files for good performance. Closes: #7532 ## Summary of changes - Add a pageserver API for checking download progress of a secondary location - During `optimize_all`, connect to pageservers of candidate optimization secondary locations, and check they are warm. - During shard split, do heatmap uploads and start secondary downloads, so that the new shards' secondary locations start downloading ASAP, rather than waiting minutes for background downloads to kick in. I have intentionally not implemented this by continuously reading the status of locations, to avoid dealing with the scale challenge of efficiently polling & updating 10k-100k locations status. If we implement that in the future, then this code can be simplified to act based on latest state of a location rather than fetching it inline during optimize_all.	2024-05-03 14:28:23 +00:00
John Spray	e018cac1f7	tests: tweak log allow list in test_sharding_split_failures (#7549 ) ## Problem This test became flaky recently with failures like: ``` AssertionError: Log errors on storage_controller: (129, '2024-04-29T16:41:03.591506Z ERROR request{method=PUT path=/control/v1/tenant/b38c0447fbdbcf4e1c023f00b0f7c221/shard_split request_id=34df4975-2ef3-4ed8-b167-2956650e365c}: Error processing HTTP request: InternalServerError(Reconcile error on shard b38c0447fbdbcf4e1c023f00b0f7c221-0002: Cancelled\n') ``` Likely due to #7508 changing how errors are reported from Reconcilers. ## Summary of changes - Tolerate `Reconcile error.*Cancelled` log errors	2024-04-30 18:00:24 +01:00
John Spray	0bd16182f7	pageserver: fix unlogged relations with sharding (#7454 ) ## Problem - #7451 INIT_FORKNUM blocks must be stored on shard 0 to enable including them in basebackup. This issue can be missed in simple tests because creating an unlogged table isn't sufficient -- to repro I had to create an _index_ on an unlogged table (then restart the endpoint). Closes: #7451 ## Summary of changes - Add a reproducer for the issue. - Tweak the condition for `key_is_shard0` to include anything that isn't a normal relation block _and_ any normal relation block whose forknum is INIT_FORKNUM. - To enable existing databases to recover from the issue, add a special case that omits relations if they were stored on the wrong INITFORK. This enables postgres to start and the user to drop the table and recreate it.	2024-04-22 11:47:24 +00:00
John Spray	4fc95d2d71	pageserver: apply shard filtering to blocks ingested during initdb (#7319 ) ## Problem Ingest filtering wasn't being applied to timeline creations, so a timeline created on a sharded tenant would use 20MB+ on each shard (each shard got a full copy). This didn't break anything, but is inefficient and leaves the system in a harder-to-validate state where shards initially have some data that they will eventually drop during compaction. Closes: https://github.com/neondatabase/neon/issues/6649 ## Summary of changes - in `import_rel`, filter block-by-block with is_key_local - During test_sharding_smoke, check that per-shard physical sizes are as expected - Also extend the test to check deletion works as expected (this was an outstanding tech debt task)	2024-04-05 18:07:35 +01:00
John Spray	ac7fc6110b	pageserver: handle WAL gaps on sharded tenants (#6788 ) ## Problem In the test for https://github.com/neondatabase/neon/pull/6776, a test cases uses tiny layer sizes and tiny stripe sizes. This hits a scenario where a shard's checkpoint interval spans a region where none of the content in the WAL is ingested by this shard. Since there is no layer to flush, we do not advance disk_consistent_lsn, and this causes the test to fail while waiting for LSN to advance. ## Summary of changes - Pass an LSN through `layer_flush_start_tx`. This is the LSN to which we have frozen at the time we ask the flush to flush layers frozen up to this point. - In the layer flush task, if the layers we flush do not reach `frozen_to_lsn`, then advance disk_consistent_lsn up to this point. - In `maybe_freeze_ephemeral_layer`, handle the case where last_record_lsn has advanced without writing a layer file: this ensures that disk_consistent_lsn and remote_consistent_lsn advance anyway. The net effect is that the disk_consistent_lsn is allowed to advance past regions in the WAL where a shard ingests no data, and that we uphold our guarantee that remote_consistent_lsn always eventually reaches the tip of the WAL. The case of no layer at all is hard to test at present due to >0 shards being polluted with SLRU writes, but I have tested it locally with a branch that disables SLRU writes on shards >0. We can tighten up the testing on this in future as/when we refine shard filtering (currently shards >0 need the SLRU because they use it to figure out cutoff in GC using timestamp-to-lsn).	2024-04-04 16:54:38 +00:00
John Spray	63213fc814	storage controller: scheduling optimization for sharded tenants (#7181 ) ## Problem - When we scheduled locations, we were doing it without any context about other shards in the same tenant - After a shard split, there wasn't an automatic mechanism to migrate the attachments away from the split location - After a shard split and the migration away from the split location, there wasn't an automatic mechanism to pick new secondary locations so that the end state has no concentration of locations on the nodes where the split happened. Partially completes: https://github.com/neondatabase/neon/issues/7139 ## Summary of changes - Scheduler now takes a `ScheduleContext` object that can be populated with information about other shards - During tenant creation and shard split, we incrementally build up the ScheduleContext, updating it for each shard as we proceed. - When scheduling new locations, the ScheduleContext is used to apply a soft anti-affinity to nodes where a tenant already has shards. - The background reconciler task now has an extra phase `optimize_all`, which runs only if the primary `reconcile_all` phase didn't generate any work. The separation is that `reconcile_all` is needed for availability, but optimize_all is purely "nice to have" work to balance work across the nodes better. - optimize_all calls into two new TenantState methods called optimize_attachment and optimize_secondary, which seek out opportunities to improve placment: - optimize_attachment: if the node where we're currently attached has an excess of attached shard locations for this tenant compared with the node where we have a secondary location, then cut over to the secondary location. - optimize_secondary: if the node holding our secondary location has an excessive number of locations for this tenant compared with some other node where we don't currently have a location, then create a new secondary location on that other node. - a new debug API endpoint is provided to run background tasks on-demand. This returns a number of reconciliations in progress, so callers can keep calling until they get a `0` to advance the system to its final state without waiting for many iterations of the background task. Optimization is run at an implicitly low priority by: - Omitting the phase entirely if reconcile_all has work to do - Skipping optimization of any tenant that has reconciles in flight - Limiting the total number of optimizations that will be run from one call to optimize_all to a constant (currently 2). The idea of that low priority execution is to minimize the operational risk that optimization work overloads any part of the system. It happens to also make the system easier to observe and debug, as we avoid running large numbers of concurrent changes. Eventually we may relax these limitations: there is no correctness problem with optimizing lots of tenants concurrently, and optimizing multiple shards in one tenant just requires housekeeping changes to update ShardContext with the result of one optimization before proceeding to the next shard.	2024-03-28 18:48:52 +00:00

1 2

70 Commits